From 1106ff93c0727ac2861fbc255e678115253088d4 Mon Sep 17 00:00:00 2001 From: tbartley94 <90423858+tbartley94@users.noreply.github.com> Date: Tue, 9 Nov 2021 15:18:19 -0500 Subject: [PATCH] WFST_tutorial for ITN development (#3128) * Pushing WFST_tutorial for open draft. (Still need to review collab code. Signed-off-by: tbartley94 * Checked tutorial code for WFST_Tutorial is properly functioning. Also included some formatting edits. Signed-off-by: tbartley94 * Responding to editorial comments for WFST_tutorial Signed-off-by: tbartley94 * Added images to folder and wrote README for tutorials Signed-off-by: tbartley94 * Few more editorial changes to explain permutations in classification. Signed-off-by: tbartley94 * Updated tutorials documentation page. Signed-off-by: tbartley94 * Forgot links for README Signed-off-by: tbartley94 * TOC links were dead Signed-off-by: tbartley94 * More dead links to fix. Signed-off-by: tbartley94 * removing collab install and appending a warning instead. Signed-off-by: tbartley94 * Update WFST_Tutorial.ipynb Signed-off-by: tbartley94 --- docs/source/starthere/tutorials.rst | 3 + tutorials/text_processing/README.md | 24 + tutorials/text_processing/WFST_Tutorial.ipynb | 7196 +++++++++++++++++ tutorials/text_processing/images/cent.PNG | Bin 0 -> 10352 bytes .../text_processing/images/cent_to_100.PNG | Bin 0 -> 11898 bytes .../text_processing/images/cent_vingt_bad.PNG | Bin 0 -> 13247 bytes .../images/cent_vingt_good.PNG | Bin 0 -> 31222 bytes .../images/cent_vingt_to_120.PNG | Bin 0 -> 10937 bytes .../text_processing/images/dix_to_digits.PNG | Bin 0 -> 17537 bytes .../images/dix_to_digits_with_insert.PNG | Bin 0 -> 27468 bytes .../text_processing/images/romanization.PNG | Bin 0 -> 36700 bytes 11 files changed, 7223 insertions(+) create mode 100644 tutorials/text_processing/README.md create mode 100644 tutorials/text_processing/WFST_Tutorial.ipynb create mode 100644 tutorials/text_processing/images/cent.PNG create mode 100644 tutorials/text_processing/images/cent_to_100.PNG create mode 100644 tutorials/text_processing/images/cent_vingt_bad.PNG create mode 100644 tutorials/text_processing/images/cent_vingt_good.PNG create mode 100644 tutorials/text_processing/images/cent_vingt_to_120.PNG create mode 100644 tutorials/text_processing/images/dix_to_digits.PNG create mode 100644 tutorials/text_processing/images/dix_to_digits_with_insert.PNG create mode 100644 tutorials/text_processing/images/romanization.PNG diff --git a/docs/source/starthere/tutorials.rst b/docs/source/starthere/tutorials.rst index 823db6a8b..ee7c33213 100644 --- a/docs/source/starthere/tutorials.rst +++ b/docs/source/starthere/tutorials.rst @@ -139,3 +139,6 @@ To run a tutorial: * - Text Processing - Inverse Text Normalization for ASR - `Inverse Text Normalization `_ + * - Text Processing + - Constructing Normalization Grammars with WFSTs + - `WFST Tutorial `_ diff --git a/tutorials/text_processing/README.md b/tutorials/text_processing/README.md new file mode 100644 index 000000000..07e4ac0ea --- /dev/null +++ b/tutorials/text_processing/README.md @@ -0,0 +1,24 @@ +# NeMo Text Processing Tutorials + +The NeMo Text Processing module provides support for both Text Normalization (TN) and +Inverse Text Normalization (ITN) in order to aid upstream and downstream text processing. +The included tutorials are intended to help you quickly become familiar with the interface +of the module, as well as guiding you in creating and deploying your own grammars for individual +text processing needs. + +If you wish to learn more about how to use NeMo's for Text Normalization tasks (e.g. conversion +of symbolic strings to verbal form - such as `15` -> "fifteen"), please see the [`Text Normalization`](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/text_processing/Text_Normalization.ipynb) +tutorial. + +If you wish to learn more about Inverse Text Normalization - the inverse task of converting +from verbalized strings to symbolic written form, as may be encountered in downstream ASR - +consult the [`Inverse Text Normalization`](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/text_processing/Inverse_Text_Normalization.ipynb) tutorial. + +For those curious about constructing grammars tailored to specific languages and use cases, +you may be interested in working through the [`WFST Tutorial`](https://github.com/NVIDIA/NeMo/blob/stable/tutorials/text_processing/WFST_Tutorial.ipynb), which goes through NeMo's Normalization +process in detail. + +As NeMo Text Processing utilizes Weighted Finite State Transducer (WFST) graphs to construct its +grammars, a working knowledge of [Finite State Automata](https://en.wikipedia.org/wiki/Finite-state_machine) (FSA) and/or regular languages is suggested. +Further, we recommend becoming functionally familiar with the [`pynini` library](https://www.openfst.org/twiki/bin/view/GRM/Pynini) - which functions +as the backend for graph construction - and [Sparrowhawk](https://github.com/google/sparrowhawk) - which NeMo utilizes for grammar deployment. \ No newline at end of file diff --git a/tutorials/text_processing/WFST_Tutorial.ipynb b/tutorials/text_processing/WFST_Tutorial.ipynb new file mode 100644 index 000000000..decd61057 --- /dev/null +++ b/tutorials/text_processing/WFST_Tutorial.ipynb @@ -0,0 +1,7196 @@ +{ + "cells": [ + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Qq1Hz6CKWdwl", + "outputId": "3d8f5bd6-f10e-431d-9039-eb88164fbb95" + }, + "outputs": [], + "source": [ + "### WARNING: This notebook will not work in a Colab environment. \n", + "\n", + "BRANCH= \"r1.5.0\"\n", + "\n", + "!git clone -b $BRANCH https://github.com/NVIDIA/NeMo\n", + "%cd NeMo\n", + "!./reinstall.sh" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import pynini\n", + "import nemo_text_processing\n", + "\n", + "from pynini.lib import pynutil" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from nemo_text_processing.text_normalization.en.graph_utils import GraphFst, NEMO_DIGIT, delete_space, NEMO_SIGMA, NEMO_NOT_QUOTE, delete_extra_space, NEMO_NON_BREAKING_SPACE\n", + "from nemo_text_processing.text_normalization.normalize import Normalizer\n", + "\n", + "from nemo_text_processing.inverse_text_normalization.fr.taggers.cardinal import CardinalFst\n", + "from nemo_text_processing.inverse_text_normalization.fr.taggers.decimal import DecimalFst\n", + "from nemo_text_processing.inverse_text_normalization.fr.taggers.money import MoneyFst\n", + "from nemo_text_processing.inverse_text_normalization.fr.taggers.ordinal import OrdinalFst\n", + "from nemo_text_processing.inverse_text_normalization.fr.taggers.punctuation import PunctuationFst\n", + "from nemo_text_processing.inverse_text_normalization.fr.taggers.time import TimeFst\n", + "from nemo_text_processing.inverse_text_normalization.fr.taggers.whitelist import WhiteListFst\n", + "from nemo_text_processing.inverse_text_normalization.fr.taggers.word import WordFst\n", + "from nemo_text_processing.inverse_text_normalization.fr.verbalizers.cardinal import CardinalFst\n", + "from nemo_text_processing.inverse_text_normalization.fr.verbalizers.decimal import DecimalFst\n", + "from nemo_text_processing.inverse_text_normalization.fr.verbalizers.money import MoneyFst\n", + "from nemo_text_processing.inverse_text_normalization.fr.verbalizers.ordinal import OrdinalFst\n", + "from nemo_text_processing.inverse_text_normalization.fr.verbalizers.time import TimeFst\n", + "from nemo_text_processing.inverse_text_normalization.fr.verbalizers.whitelist import WhiteListFst\n", + "from nemo_text_processing.inverse_text_normalization.fr.verbalizers.word import WordFst\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T0JxcvuPHvn9" + }, + "source": [ + "NeMo's Text Processing module uses Weighted Finite State Transducers (WFST) to deploy grammars for both efficient text normalization (TN) and inverse text normalization (ITN). In this tutorial, you will learn to build a normalization grammar from the ground up to use in your own text processing tasks." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Table of Contents\n", + "- [WFSTs](#wfsts)\n", + "- [NeMo Text Processing](#nemo-text-processing)\n", + "- [Getting Started](#getting-started)\n", + "- [Cardinal WFST](#cardinal-wfst)\n", + "- [Ordinal WFST](#ordinal-wfst)\n", + "- [Decimal WFST](#decimal-wfst)\n", + "- [Money WFST](#money-wfst)\n", + "- [Time WFST](#time-wfst)\n", + "- [WhiteList WFST](#whitelist-wfst)\n", + "- [Word and Punctuation WFST](#word-and-punctuation-wfst)\n", + "- [Other Classes](#other-classes)\n", + "- [Tokenize and Classify](#tokenize-and-classify)\n", + "- [Verbalize and Verbalize Final](#verbalize-and-verbalize-final)\n", + "- [Deployment](#deployment)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lMUovcMsfXyI" + }, + "source": [ + "# WFSTs " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Y1ejNMLbH1jM" + }, + "source": [ + "WFSTs are a form of [Finite State Machines](https://en.wikipedia.org/wiki/Finite-state_machine) used to graph relations between regular languages (or [regular expressions](https://en.wikipedia.org/wiki/Regular_expression)). For our purposes, they can be defined by two major properties:\n", + "\n", + "1. Mappings between accepted input and output expressions for text substitution\n", + "2. Path weighting to direct graph traversal" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nNg45ZuaP_A8" + }, + "source": [ + "For example, consider a simple normalization task of mapping the word \"cent\" (French for \"one hundred\") to the numerical representation `100`. We would begin with a Finite State representation of the regex `/cent/`:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uxo7gUkW_XKT" + }, + "source": [ + "![cent.png](images/cent.PNG)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fahsjMVFlbCa" + }, + "source": [ + "And then create a mapping to the text string `100`:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IMJ-fNSk_jXC" + }, + "source": [ + "![cent_to_100.png](images/cent_to_100.PNG)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bPKW0I4yAGUb" + }, + "source": [ + "*Note: Null characters are expressed as `ε` by convention*" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_0NK3aW5nG9C" + }, + "source": [ + "This would give us a WFST with universal path weights. (By default, `pynini` uses [tropical semirings](https://en.wikipedia.org/wiki/Tropical_semiring) for arcs, giving each arc a default weight of `0`.)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CzBc9D3qTGJ-" + }, + "source": [ + "Now, let us consider expanding our model. To indicate values between `100` and `200`, French uses the number scheme of `cent + digit`. For example, `120` would be pronounced as \"cent-vingt\". To create the appropriate output string, we would now want to map \"cent\" to `1` and the remaining aspect of our string to the appropriate digit representation." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GRrKNQRjFDoL" + }, + "source": [ + "![cent_vingt_to_120.png](images/cent_vingt_to_120.PNG)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jLpm4mufAfUz" + }, + "source": [ + "However this would make our graph [non-deterministic](https://en.wikipedia.org/wiki/Nondeterministic_algorithm) - it will have multiple possibilities for termination. Now an input of \"cent-vingt\" could have the outcome of `100` or `10020` when only one is correct. " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "![cent_vingt_bad.png](images/cent_vingt_bad.PNG)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "c-GJTpgIAf7S" + }, + "source": [ + "To correct this, we may add a new end state and a weight to the path that accepts the input without `s`:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6GJcsdttGg_S" + }, + "source": [ + "![cent_vingt_good.png](images/cent_vingt_good.PNG)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mHft1gzsAipc" + }, + "source": [ + "Now, we can guarantee an ideal mapping by relying on a shortest-path (smallest-weight) heuristic: traversal of the graph will prioritize longer inputs, only converting \"cent\" to `100` when a larger input isn't available. As such, we've now removed the undesired output `10020` while preserving our desired coverage in string mapping. \n", + "\n", + "This use of weights to ensure predictable behavior allows WFSTs to exploit the efficiency of standard graph traversal algorithms while also maintaining versatility. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8Ik4PBXafSSB" + }, + "source": [ + "# NeMo Text Processing " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "b2fcWKhqYVF5" + }, + "source": [ + "Following [Google's Kestrel](https://www.researchgate.net/publication/277932107_The_Kestrel_TTS_text_normalization_system) framework, NeMo deploys two composite WFSTs for text normalization. They are as follows:\n", + "1. A *classifier* (or tagger) to label potential tokens by 'semiotic class' (e.g. currency, ordinal number, street address)\n", + "2. A *verbalizer* to render a tagged token in conventional written form\n", + "\n", + "For example, consider the sentence: <>\n", + "\n", + "For an ITN task, a tokenizer would identify the following tokens:\n", + "\n", + "`[\"le\" ,\"premier\", \"juillet\", \"il\", \"a\", \"mangé\", \"trente-cinq\", \"pommes\"]`\n", + "\n", + "and provide each a class token: \n", + "\n", + "- `tokens { name: \"le\" }`\n", + "- `tokens { date { day: \"1\" month: \"juillet\" } } ` \n", + "- `tokens { name: \"il\" }` \n", + "- `tokens { name: \"a\" }` \n", + "- `tokens { name: \"mangé\" }`\n", + "- `tokens { cardinal { integer: \"35\" } }` \n", + "- `tokens { name: \"pommes\" }`\n", + "\n", + "These tokens are then passed to a 'verbalizer' WFST, which renders each token in a conventional written form:\n", + "\n", + "- `tokens { name: \"le\" }` -> `le` \n", + "- `tokens { date { day: \"1\" month: \"juillet\" } } ` -> `1ᵉʳ` \n", + "- `tokens { name: \"il\" }` -> `juillet`\n", + "- `tokens { name: \"il\" }` -> `il` \n", + "- `tokens { name: \"a\" }` -> `a`\n", + "- `tokens { name: \"mangé\" }` -> `mangé` \n", + "- `tokens { cardinal { integer: \"35\" } }` -> `35` \n", + "- `tokens { name: \"pommes\" }` -> `pommes`\n", + "\n", + "and merged into a normalized string:\n", + "\n", + "`le 1ᵉʳ juillet il a mangé 35 pommes`\n", + "\n", + "With the equivalent TN task being the reverse process. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_n-5JExAbvwr" + }, + "source": [ + "A few things to note: \n", + "- Each class token has a unique set of field names that must be parsed by the classifier. The default field names for NeMo are chosen to mirror the syntax in [Sparrowhawk](https://github.com/google/sparrowhawk) to enable deployment. If these fields are not exact, you will not be able to use Sparrowhawk.\n", + "- NeMo assumes no punctuation (unless explicitly provided in the grammar) and all lower casing to ease integration with upstream ASR.\n", + "- The `name` class token is default for any token that does not require processing. It will be left 'as is.'\n", + "- You may note how the tokenizer performed the conversion of `premier` to `1` while the verbalizer normalized `1` -> `1ᵉʳ`. Such decisions are implementation dependent and will vary depending on preference and language. (That is, normalization from `premier` -> `1ᵉʳ` could have been a tokenization step.)\n", + "- By default, NeMo will create several permutations of key values in a token to ease normalization. That is, given the token `tokens { date { day: \"1\" month: \"juillet\" } }`, it will also produce paths for `tokens { date { month: \"juillet\" day: \"1\" } }`. To prevent this and avoid ambiguity in verbalizer input, tokens can be assigned a `preserve_order` attribute to prevent permutation. (e.g. `tokens { date { day: \"1\" month: \"juillet\" preserve_order: true } }`) (We will discuss this [later in the tutorial](#verbalizer).)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## WFST Classes" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "NeMo Text Processing's base languages currently support only the following semiotic classes to permit integration with Sparrowhawk deployment. \n", + "\n", + "- CARDINAL\n", + "- ORDINAL\n", + "- DECIMAL\n", + "- FRACTION\n", + "- MEASURE\n", + "- MONEY\n", + "- TIME\n", + "- DATE\n", + "- ELECTRONIC\n", + "- TELEPHONE\n", + "- WHITELIST\n", + "- WORD\n", + "- PUNCTUATION\n", + "\n", + "For this tutorial, we will be focusing on the following classes:\n", + "- CARDINAL\n", + "- ORDINAL\n", + "- DECIMAL\n", + "- MONEY\n", + "- TIME\n", + "- WHITELIST\n", + "- WORD\n", + "- PUNCTUATION\n", + "\n", + "While not comprehensive, these classes will provide enough foundation and exposure to edge cases that you will feel comfortable constructing for other cases.\n", + "\n", + "**NOTE**: *If you intend to only develop for personal use with NeMo, you may rename these classes as desired. However, Sparrowhawk integration\n", + "REQUIRES use of only these tags and their assigned attributes. For list of Sparrowhawk tokens and attributes, [consult the Sparrowhawk repository](https://github.com/yzhang123/sparrowhawk/blob/test/src/proto/semiotic_classes.proto)*" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Futher Reading" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If you wish to learn more about NeMo Text Processing, you may wish to consult the following:\n", + "- [Y. Zhang, E. Bakhturina, K. Gorman, and B. Ginsburg, \"NeMo Inverse Text Normalization: From Development To Production\"](https://arxiv.org/pdf/2104.05055.pdf)\n", + "- [NeMo's Text Normalization Documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/nemo_text_processing/intro.html) \n", + "- [NeMo's Text Normalization Deployment Documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/tools/text_processing_deployment.html)\n", + "- NeMo's [Text Normalization Tutorial](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/text_processing/Text_Normalization.ipynb) or [Inverse Text Normalization](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/text_processing/Inverse_Text_Normalization.ipynb) tutorials\n", + "- [Sparrowhawk Documentation](https://github.com/google/sparrowhawk)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For further information regarding WFSTs, please see:\n", + "- [D. Jufasky and J. Martin, *Natural Language Processing*, Ch. 2](https://web.stanford.edu/~jurafsky/slp3/2.pdf)\n", + "- [K. Gorman and R. Sproat, *Finite-State Text Processing*](http://www.morganclaypoolpublishers.com/catalog_Orig/product_info.php?products_id=1636)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XFdXRcnUfI25" + }, + "source": [ + "# Getting Started \n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K3Zl3VwqdYqL" + }, + "source": [ + "To begin tokenizer development, make sure you have [installed NeMo from source](https://github.com/NVIDIA/NeMo)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rGg7Bf13FXgc" + }, + "source": [ + "For this tutorial, we will focus on developing an Inverse Text Normalization system, such as one you may encounter in downstream ASR processing. As such, we will navigate to\n", + "`nemo_text_processing/inverse_text_normalization` and create a directory for our target language (French) and subdirectories\n", + "for `taggers` and `verbalizers`. You may also wish to create a `data` subdirectory to ease navigation.\n", + "\n", + "(Note, for text normalization, the suggested directory structure would be the same within the `nemo_text_processing/text_normalization` folder. In fact, many of NeMo's grammars actively share between.)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "T58E4pU4FN3A" + }, + "outputs": [], + "source": [ + "LANGUAGE= \"MY_LANGUAGE\" # Change this to your desired language" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "_PAyEPhaFpCD", + "outputId": "23d034d2-8c93-4e8b-e3ce-5ba9e870f82d" + }, + "outputs": [], + "source": [ + "%cd nemo_text_processing/inverse_text_normalization/\n", + "!mkdir {LANGUAGE}\n", + "!mkdir \"{LANGUAGE}/taggers\"\n", + "!mkdir \"{LANGUAGE}/verbalizers\"\n", + "!mkdir \"{LANGUAGE}/data\"\n", + "%cd {LANGUAGE}\n", + "!pwd && ls" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Dependencies" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "O1vfz-bUFpwz" + }, + "source": [ + "All WFSTs deployed in NeMo inherit from the `GraphFst` class.\n", + "While in most cases you can simply import from a pre-existing `graph_utils.py`, you may occasionally find it helpful for deployment to keep a copy \n", + "in your working directory for language specific edits. (For our purposes, we will be utilizing `nemo_text_processing.text_normalization.en.graph_utils`, which serves as default for NeMo's grammars.)\n", + "\n", + "You may also wish to keep a copy of `utils.py` (found in each language system's directory)\n", + "in your working directory to assist with pathing. (Make sure to adjust the imports towards your language.)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "3OME84EmOQ4h", + "outputId": "6eea17f9-aae9-4176-ae35-3d1f0e94b4ea" + }, + "outputs": [], + "source": [ + "!cp ../../text_normalization/en/graph_utils.py .\n", + "!cp ../../text_normalization/en/utils.py .\n", + "! cd ../../.." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "For development, we utilize `nemo_text_processing` and `pynini` (a Python library for efficient WFST construction and traversal, installed with `NeMo-toolkit` by default). " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "While this tutorial will attempt to make use of `pynini` tools transparent, it does assume some familiarity with its syntax. For a more in-depth guide, the following will provide a function overview:\n", + "\n", + "- [K. Gorman, Pynini: A Python library for weighted finite-state grammar compilation](https://aclanthology.org/W16-2409.pdf)\n", + "- [K. Gorman, Pynini Tutorial](http://wellformedness.com/courses/pynini/)\n", + "- [Pynini Documentation](https://www.openfst.org/twiki/bin/view/GRM/PyniniDocs) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will also import the `pynutil` module for access to some extra functionality, along with writing a simple helper function for printing `pynini` graphs through the previously discussed 'shortest-path' heuristic." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "sz18Ui8-8Kf4" + }, + "outputs": [], + "source": [ + "from pynini.lib import pynutil\n", + "\n", + "def apply_fst(text, fst):\n", + " \"\"\" Given a string input, returns the output string\n", + " produced by traversing the path with lowest weight.\n", + " If no valid path accepts input string, returns an\n", + " error.\n", + " \"\"\"\n", + " try:\n", + " print(pynini.shortestpath(text @ fst).string())\n", + " except pynini.FstOpError:\n", + " print(f\"Error: No valid output with given input: '{text}'\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Cardinal WFST " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rOyLZb9DgLoh" + }, + "source": [ + "The vast majority of ITN tasks require the ability to recognize and denormalize numbers. As such, we will begin with developing a Classifier and Verbalizer for Cardinal (integer) numbers. (e.g. `-3,-2,-1,0,1,2,3,4,5....99,100,101...`)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9GZQkH1V89kh" + }, + "source": [ + "## Grammar" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We will begin by first constructing a Cardinal WFST, using French as an example language. While your target language will obviously differ greatly from our example, you will likely find some several similarities, such as:\n", + "- Use of a (semi) regular decimal (base-10) counting system. (A common - but not universal - feature of natural languages.)\n", + "- Incorporation of several irregularities requiring contingencies in our WFST construction. (e.g. a pseudo vigesimal (base-20) series.)\n", + "- Use of gender and number agreement in enumeration." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Digits" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NzJ2DIwc_TT3" + }, + "source": [ + "We shall begin with the first decimal place. As these numbers serve as the building blocks for the rest of our WFST, we shall begin by explicitly calling their WFST mappings with `pynini.string_map`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "u0H4qg4BjYfB" + }, + "outputs": [], + "source": [ + "zero = pynini.string_map([(\"zéro\",\"0\")]) # French only pronounces zeroes as stand alone\n", + "digits = pynini.string_map([ # pynini function that creates explicit input-output mappings for a WFST\n", + "\t\t\t\t(\"un\",\"1\"),\n", + "\t\t\t\t(\"une\",\"1\"),\n", + "\t\t\t\t(\"deux\",\"2\"),\n", + "\t\t\t\t(\"trois\",\"3\"),\n", + "\t\t\t\t(\"quatre\",\"4\"),\n", + "\t\t\t\t(\"cinq\",\"5\"),\n", + "\t\t\t\t(\"six\",\"6\"),\n", + "\t\t\t\t(\"sept\",\"7\"),\n", + "\t\t\t\t(\"huit\",\"8\"),\n", + "\t\t\t\t(\"neuf\",\"9\")\n", + "])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0nHjY-NNjdWQ" + }, + "source": [ + "We may also simply write a `tsv` file in a separate data folder " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "- zéro\t0\n", + "- un\t1\n", + "- une\t1\n", + "- deux\t2\n", + "- trois\t3\n", + "- quatre\t4\n", + "- cinq\t5\n", + "- six\t6\n", + "- sept\t7\n", + "- huit\t8\n", + "- neuf\t9" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xicKcZLEzQTg" + }, + "source": [ + "and import with `string_file`" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "`digits = pynini.string_file(\"data/digits.tsv\")`\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "If utils.py is in working directory you may also use `get_abs_path`, which will always call paths relative to your {LANGUAGE} directory:\n", + "\n", + "`from nemo_text_processing.inverse_normalization.{LANGUAGE}.utils import get_abs_path`\n", + "\n", + "`digits = pynini.string_file(get_abs_path(\"data/digits.tsv\"))`" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yPccmicQkYAB" + }, + "source": [ + "While we will use `string_map` throughout this tutorial, please note that NeMo employs the later option for maintainability and recommends its use instead." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Teens" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FQJiJcVMrNmC" + }, + "source": [ + "Let us consider our next set of numbers:\n", + "- 10 - dix\n", + "- 11 - onze\n", + "- 12 - douze\n", + "- 13 - treize\n", + "- 14 - quatorze\n", + "- 15 - quinze\n", + "- 16 - seize\n", + "- 17 - dix-sept\n", + "- 18 - dix-huit\n", + "- 19 - dix-neuf\n", + "\n", + "Like before, we can simply use `string_map` to compose a WFST for them. But note how there is some redundancy in the number set: `17`, `18`, and `19` are all of the form `dix + digit`. It would be more efficient simply to reuse our prior WFST in these cases than simply creating new arcs, states, and weights. \n", + "\n", + "We can achieve this using pynini's string concatenation function to extend the accepted input strings. First we will create an WFST for `11-16`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "orSgBwyXsfY5" + }, + "outputs": [], + "source": [ + "teens = pynini.string_map([\n", + "\t\t\t\t(\"onze\",\"11\"),\n", + "\t\t\t\t(\"douze\",\"12\"),\n", + "\t\t\t\t(\"treize\",\"13\"),\n", + "\t\t\t\t(\"quatorze\",\"14\"),\n", + "\t\t\t\t(\"quinze\",\"15\"),\n", + "\t\t\t\t(\"seize\",\"16\"),\n", + "])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "s1yIgigdtriQ" + }, + "source": [ + "Now, we will create a `tens` WFST that is responsible for mapping all instances of \"dix\" and concatenate (accomplished with the overloaded `+` operator) with the prior `digits` WFST. (Deleting any possible hyphens in-between with a build in `delete_hyphen`.)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "CzwZrFCkt87W" + }, + "outputs": [], + "source": [ + "tens = pynini.string_map([(\"dix\", \"1\")])\n", + "delete_hyphen = pynini.closure(pynutil.delete(\"-\"), 0, 1) # Applies a closure from 0-1 of operation. Equivalent to regex /?/\n", + "\n", + "graph_tens = tens + delete_hyphen + digits" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2knCwybmuTDn" + }, + "source": [ + "We now can combine the `teens` and `graph_tens` WFST together through the union operation (done with the overloaded `|` operator), allowing our choice of either graph." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "WIRJ4PE7uRrl" + }, + "outputs": [], + "source": [ + "graph_tens_and_teens = graph_tens | teens" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TGkzKoeuxbeA" + }, + "source": [ + "Let's see if it works through the string function:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "v2iD0_HnxdUV", + "outputId": "1d8f434f-ff8a-4c85-b8d0-1127e4587ddf" + }, + "outputs": [], + "source": [ + "apply_fst(\"dix-huit\", graph_tens_and_teens)\n", + "apply_fst(\"seize\", graph_tens_and_teens)\n", + "apply_fst(\"dix\", graph_tens_and_teens)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Yh2f-3rux8_2" + }, + "source": [ + "The first two worked, but why did we get an error with \"dix\"? If you look back, you'll notice that while our graph has a mapping from \"dix\" to `1` - the concatenation with `digits` makes the assumption that some input from those strings will follow. That is, we left no opportunity for an *omission* of `digits`.\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OM_eJYlV1UVp" + }, + "source": [ + "![dix_to_digits.png](images/dix_to_digits.PNG)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "M4xCMKRA1Wzw" + }, + "source": [ + "You may also note that this issue would hold also if we wanted to normalize only digits - our graph would error out since it's expecting a `tens` or input first. \n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XJHnlJCm1dPv" + }, + "source": [ + "We can fix both of these problems by allowing an option to simply insert a zero without any extra input. (Much like our \"cent\" example.)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9_vvJ9Bl1dYQ" + }, + "source": [ + "![dix_to_digits_with_insert.png](images/dix_to_digits_with_insert.PNG)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hJq3uoMN2OcC" + }, + "source": [ + "This may be accomplished through use of the `pynutil.insert` function:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "7h9xuNfA081P" + }, + "outputs": [], + "source": [ + "graph_digits = digits | pynutil.insert(\"0\") # inserts zero if no digit follows" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fA_L_6Ky2SHm" + }, + "source": [ + "And for `graph_tens`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "jelVA81o2RXu" + }, + "outputs": [], + "source": [ + "tens = tens | pynutil.insert(\"0\") | tens + delete_hyphen\n", + "graph_tens = tens + graph_digits" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Gb5uhpGr3I4X" + }, + "source": [ + "Bringing everything together:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "bLkDddkA3Stu" + }, + "outputs": [], + "source": [ + "graph_teens_and_tens = graph_tens | teens\n", + "graph_all = graph_teens_and_tens | zero " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DESDKScv3r3P" + }, + "source": [ + "Let us now check our tests:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "7wrDNXuD3oh9", + "outputId": "661d2526-5aa0-4640-9285-bca15cd56c75" + }, + "outputs": [], + "source": [ + "apply_fst(\"dix-huit\", graph_all) \n", + "apply_fst(\"seize\" , graph_all)\n", + "apply_fst(\"dix\" , graph_all) \n", + "apply_fst(\"une\" , graph_all) \n", + "apply_fst(\"trois\" , graph_all) \n", + "apply_fst(\"quatre\" , graph_all) \n", + "apply_fst(\"zéro\" , graph_all)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Tz_k3NoB66Bv" + }, + "source": [ + "Now we have no more error - albeit at the cost of leading zeroes. (We will take care of this later in the section.)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Tens" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2dJZAhE57an3" + }, + "source": [ + "Now that we've taken care of the teens, we can proceed with the rest of the tens. Like many languages, French employs a (fairly) regular schema of: `tens_digit + ones_digit` for 20-100. Indeed, we can summarize 20-69 in the following template:\n", + "\n", + "- 20 - vingt\n", + "- 21 - vingt-et-un\n", + "- 22 - vingt-deux\n", + "- 23 - vingt-trois\n", + "- 24 - vingt-quatre\n", + "- 25 - vingt-cinq\n", + "- 26 - vingt-six\n", + "- 27 - vingt-sept\n", + "- 28 - vingt-huit\n", + "- 29 - vingt-neuf\n", + "- 30 - trente\n", + "- 31 - trente-et-un\n", + "- 32 - trente-deux\n", + "- 33 - trente-trois\n", + "...\n", + "- 40 - quarante\n", + "...\n", + "- 50 - cinquante\n", + "...\n", + "- 60 - soixante\n", + "..." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "BuaxVG35UKcs" + }, + "source": [ + "Expanding `tens` is fairly easy to accommodate this template: we simply extend our earlier `string_map` for the new terms in the 'tens place.' From there, we once again concatenate the `digits` WFST (along with a simple WFST to delete the occurence of the \"-et-\" term that occasionally occurs.)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "qAnXlRkR32wt" + }, + "outputs": [], + "source": [ + "tens = pynini.string_map([\n", + "\t\t\t\t(\"dix\", \"1\"),\n", + "\t\t\t\t(\"vingt\",\"2\"),\n", + "\t\t\t\t(\"trente\",\"3\"),\n", + "\t\t\t\t(\"quarante\",\"4\"),\n", + "\t\t\t\t(\"cinquante\",\"5\"),\n", + "\t\t\t\t(\"soixante\",\"6\"),\n", + "\t\t])\n", + "\n", + "graph_et = pynutil.delete(\"-et-\")\n", + "\n", + "tens = tens | pynutil.insert(\"0\") | tens + pynutil.delete(\"-\") | tens + graph_et\n", + "\n", + "graph_tens = tens + graph_digits\n", + "graph_teens_and_tens = graph_tens | teens\n", + "graph_all = graph_teens_and_tens | zero " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-hJwqPDx8I2R" + }, + "source": [ + "#### Special Cases: 70-99" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zvBLvJdY9XPA" + }, + "source": [ + "However, things get tricky once we go beyond the 60s. Here, standard French possesses a notorious psuedo-vigecimal (base-20) system. For numbers 70-99:\n", + "\n", + "- 70 - soixante-dix <- Literally in English: \"sixty-ten\"\n", + "- 71 - soixante-et-onze <- Literally in English: \"sixty-and-eleven\"\n", + "- 72 - soixante-douze\n", + "- 73 - soixante-treize\n", + "- 74 - soixante-quatorze\n", + "- 75 - soixante-quinze\n", + "- 76 - soixante-seize\n", + "- 77 - soixante-dix-sept\n", + "- 78 - soixante-dix-huit\n", + "- 79 - soixante-dix-neuf\n", + "- 80 - quatre-vingts <- Literally in English: \"four-twenties\"\n", + "- 81 - quatre-vingt-un\n", + "- 82 - quatre-vingt-deux\n", + "- 83 - quatre-vingt-trois\n", + "- 84 - quatre-vingt-quatre\n", + "- 85 - quatre-vingt-cinq\n", + "- 86 - quatre-vingt-six\n", + "- 87 - quatre-vingt-sept\n", + "- 88 - quatre-vingt-huit\n", + "- 89 - quatre-vingt-nuef\n", + "- 90 - quatre-vingt-dix <- Literally in English: \"four-twenties-ten\"\n", + "- 91 - quatre-vingt-onze\n", + "- 92 - quatre-vingt-douze\n", + "- 93 - quatre-vingt-treize\n", + "- 94 - quatre-vingt-quatorze\n", + "- 95 - quatre-vingt-quinze\n", + "- 96 - quatre-vingt-seize\n", + "- 97 - quatre-vingt-dix-sept\n", + "- 98 - quatre-vingt-dix-huit\n", + "- 99 - quatre-vingt-dix-neuf" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HQNiwFDyVV_3" + }, + "source": [ + "As before, we want to take advantage of as much redundancy as we can without creating additional ambiguities that will impede graph traversal. \n", + "\n", + "We first note that - despite repeating prior words - \"quatre-vingt\" can be mapped to `8` without introducing ambiguity. This is because, despite \"quatre\" and \"vingt\" being present in our prior graphs, our WFST has no pathing for them in this exact order. As such, we can simply add it to `tens` and immediately improve our coverage for 81-89. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "AvJqaHhE9Wbd" + }, + "outputs": [], + "source": [ + "tens = pynini.string_map([\n", + "\t\t\t\t(\"dix\", \"1\"),\n", + "\t\t\t\t(\"vingt\",\"2\"),\n", + "\t\t\t\t(\"trente\",\"3\"),\n", + "\t\t\t\t(\"quarante\",\"4\"),\n", + "\t\t\t\t(\"cinquante\",\"5\"),\n", + "\t\t\t\t(\"soixante\",\"6\"),\n", + " (\"quatre-vingt\", \"8\")\n", + "\t\t])\n", + "tens = tens | pynutil.insert(\"0\") | tens + delete_hyphen | tens + graph_et\n", + "graph_tens = tens + graph_digits\n", + "graph_teens_and_tens = graph_tens | teens\n", + "graph_all = graph_teens_and_tens | zero " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0_DtcpZxZTzX" + }, + "source": [ + "Of course, now we permit the occurence of:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "V2leANlDhCvj", + "outputId": "db8d5d02-c848-4e50-df23-d8499538281c" + }, + "outputs": [], + "source": [ + "apply_fst(\"quatre-vingt\", graph_all)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_X_ef3sihCHH" + }, + "source": [ + "which is invalid (French uses the plural \"quatre-vingt**s**\" here.) " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vgKT903Y6rIQ" + }, + "source": [ + "Should we alter the grammar because of this? Such a decision will largely be dependent on your intended implementation and design aims. If you see the question of 'legal' tokens as a responsibility of your upstream model, then there is no need for any alteration: \"quatre-vingt\" as a standalone token will simply not occur, so there is no input to be concerned with.\n", + "\n", + "However, if your ITN grammars are developed for an environment with low-fidelity ASR and/or where mistaken transcriptions incur heavy loss (e.g. ASR for driving directions, telephone-numbers, banking) then you may wish to err on the side of caution." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Hf_FghLT7jdY" + }, + "source": [ + "If we wanted to go for the latter, we would want to mark that \"quatre-vingts\" maps **only** to `80`. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "JliFTF3mZSsJ" + }, + "outputs": [], + "source": [ + "quatre_vingt_plural = pynini.string_map([\n", + " (\"quatre-vingts\", \"80\")\n", + "\t\t])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "81_b3XPbicT1" + }, + "source": [ + "And that \"quatre vingt\" can only accompany non-zero digits:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "E4_dmg6uin2j" + }, + "outputs": [], + "source": [ + "quatre_vingt_singular = pynini.string_map([\n", + " (\"quatre-vingt-\", \"8\") # Note that the hyphen can be assumed now\n", + "\t\t])\n", + "graph_digits_without_zero = pynini.string_map([\n", + "\t\t\t\t(\"un\",\"1\"),\n", + "\t\t\t\t(\"une\",\"1\"),\n", + "\t\t\t\t(\"deux\",\"2\"),\n", + "\t\t\t\t(\"trois\",\"3\"),\n", + "\t\t\t\t(\"quatre\",\"4\"),\n", + "\t\t\t\t(\"cinq\",\"5\"),\n", + "\t\t\t\t(\"six\",\"6\"),\n", + "\t\t\t\t(\"sept\",\"7\"),\n", + "\t\t\t\t(\"huit\",\"8\"),\n", + "\t\t\t\t(\"neuf\",\"9\")\n", + "])\n", + "graph_eighties = (quatre_vingt_singular + graph_digits_without_zero) | quatre_vingt_plural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mL7jpekV8VgP" + }, + "source": [ + "For the `70`'s and `90`'s, we would likewise need to form exclusive configurations for their number series, rewriting digits to recognize \"onze\", \"douze\", \"treize\"... as `1,2,3....` (Note, we'll have to separate `71` and `91` to manage \"soixante-**et**-onze\" vs. \"quatre-vingt-onze\".)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "y3dYkwK29zCX" + }, + "outputs": [], + "source": [ + "seventy_and_ninety = pynini.string_map([\n", + " (\"soixante-dix\", \"70\"),\n", + " (\"quatre-vingt-dix\", \"90\"),\n", + "\t\t])\n", + "\n", + "seventy_and_ninety_tens = pynini.string_map([\n", + " (\"soixante-\", \"7\"),\n", + " (\"quatre-vingt-\", \"9\"),\n", + "\t\t])\n", + "\n", + "seventy_and_ninety_one = pynini.string_map([\n", + " (\"soixante-et-onze\", \"71\"),\n", + " (\"quatre-vingt-onze\", \"91\"),\n", + "\t\t])\n", + "\n", + "seventy_and_ninety_digits = digits = pynini.string_map([ \n", + "\t\t\t\t(\"douze\",\"2\"),\n", + "\t\t\t\t(\"treize\",\"3\"),\n", + "\t\t\t\t(\"quatorze\",\"4\"),\n", + "\t\t\t\t(\"quinze\",\"5\"),\n", + "\t\t\t\t(\"seize\",\"6\"),\n", + "\t\t\t\t(\"dix-sept\",\"7\"), # For 97-99, digits are used as normal.\n", + "\t\t\t\t(\"dix-huit\",\"8\"),\n", + "\t\t\t\t(\"dix-neuf\",\"9\")\n", + "])\n", + "\n", + "graph_seventies_and_nineties = (seventy_and_ninety_tens + seventy_and_ninety_digits) | seventy_and_ninety | seventy_and_ninety_one " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4NCrCwEH9HVg" + }, + "source": [ + "Now we union them with our original `tens` series:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "psGCgxaH-btn" + }, + "outputs": [], + "source": [ + "tens = pynini.string_map([\n", + "\t\t\t\t(\"dix\", \"1\"),\n", + "\t\t\t\t(\"vingt\",\"2\"),\n", + "\t\t\t\t(\"trente\",\"3\"),\n", + "\t\t\t\t(\"quarante\",\"4\"),\n", + "\t\t\t\t(\"cinquante\",\"5\"),\n", + "\t\t\t\t(\"soixante\",\"6\"),\n", + "\t\t])\n", + "tens = tens | pynutil.insert(\"0\") | tens + delete_hyphen | tens + graph_et\n", + "\n", + "graph_tens = tens + graph_digits\n", + "graph_tens_with_special_cases = graph_tens | graph_seventies_and_nineties | graph_eighties\n", + "graph_teens_and_tens = graph_tens_with_special_cases | teens\n", + "graph_all = graph_teens_and_tens | zero " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xWjSAGRX_s0H" + }, + "source": [ + "Making sure test cases work:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "kapWmgos-xcn", + "outputId": "5e9c6f5c-1450-495f-cadf-2945355b651c" + }, + "outputs": [], + "source": [ + "apply_fst(\"quatre-vingt-treize\" , graph_all)\n", + "apply_fst(\"quatre-vingts\", graph_all)\n", + "apply_fst(\"quatre-vingt-deux\", graph_all)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hNUepfKZ_vS_" + }, + "source": [ + "And the other cases fail as expected:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "wo2pCOXGAgYn", + "outputId": "0bbe2792-8bc9-40f7-dd28-4745bd1390e3" + }, + "outputs": [], + "source": [ + "apply_fst(\"quatre-vingt\", graph_all)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4VPuCTTtigh-" + }, + "source": [ + "Of course, there are other ways we could have reconfigured the grammar: we could simply make specific graphs for multiples of ten (`10,20,30..`) and all cases where \"-et-\" occurs (`21,31,41,51...91`). \n", + "\n", + "But this ignores a more important question: was any of this necessary in the first place? All these extra grammars did was simply expand coverage for thirty additional cardinals. And they still didn't exclude all faulty inputs! Note the following cases:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "KICvpeewCFyH", + "outputId": "174dd910-7329-4a5f-a5b0-5e796a174217" + }, + "outputs": [], + "source": [ + "apply_fst(\"dix-une\", graph_all) # supposed to be \"onze\"\n", + "apply_fst(\"dix-deux\", graph_all) # supposed to be \"douze\"\n", + "apply_fst(\"vingt-un\", graph_all) # supposed to be \"vingt-et-un\"\n", + "apply_fst(\"trente-un\", graph_all) # supposed to be \"trente-et-un\"" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0D130jIVCLp2" + }, + "source": [ + "We *still* need to address possible edge cases!\n", + "\n", + "All of this is to say that knowing your input domain before construction is imperative, as small decisions can easily determine your output range later down the line.\n", + "\n", + "Indeed, if you're particularly concerned with limiting input possibilities, it may be valid simply to write all unique options within a `string_map`. While a tad inelegant, it certainly assists in controlling your outputs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "RSp9w5ayA9ii" + }, + "outputs": [], + "source": [ + "graph_tens_special = pynini.string_map([\n", + "\t\t\t\t(\"soixante-dix\", \"70\"),\n", + "\t\t\t\t(\"soixante-et-onze\",\"71\"),\n", + " (\"soixante-douze\",\"72\"),\n", + "\t\t\t\t(\"soixante-treize\",\"73\"),\n", + "\t\t\t\t(\"soizante-quatorze\",\"74\"),\n", + "\t\t\t\t(\"soixante-quinze\",\"75\"),\n", + "\t\t\t\t(\"soixante-seize\",\"76\"),\n", + " (\"soixante-dix-sept\",\"77\"),\n", + " (\"soixante-dix-huit\",\"78\"),\n", + "\t\t\t\t(\"soixante-dix-neuf\",\"79\"),\n", + " (\"quatre-vingts\", \"80\"),\n", + " (\"quatre-vingt-un\", \"81\"),\n", + " (\"quatre-vingt-une\", \"81\"),\n", + "\t\t\t\t(\"quatre-vingt-deux\",\"82\"),\n", + " (\"quatre-vingt-trois\",\"83\"),\n", + " (\"quatre-vingt-quatre\",\"84\"),\n", + " (\"quatre-vingt-cinq\",\"85\"),\n", + " (\"quatre-vingt-six\",\"86\"),\n", + " (\"quatre-vingt-sept\",\"87\"),\n", + " (\"quatre-vingt-huit\",\"88\"),\n", + " (\"quatre-vingt-neuf\",\"89\"),\n", + " (\"quatre-vingt-dix\",\"90\"),\n", + " (\"quatre-vingt-onze\",\"91\"),\n", + " (\"quatre-vingt-douze\",\"92\"),\n", + " (\"quatre-vingt-treize\",\"93\"),\n", + " (\"quatre-vingt-quatorze\",\"94\"),\n", + " (\"quatre-vingt-quinze\",\"95\"),\n", + " (\"quatre-vingt-sieze\",\"96\"),\n", + " (\"quatre-vingt-dix-sept\",\"97\"),\n", + " (\"quatre-vingt-dix-huit\",\"98\"),\n", + " (\"quatre-vingt-dix-neuf\",\"99\"),])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NUPs1qOUg-hE" + }, + "source": [ + "Which is more efficient? Once again, it is dependent on your language and implementation. If we simply vizualize each graph and their number of states:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "sQ9GsIkNzxsU", + "outputId": "d70ca927-9c43-4f49-846c-c181e725e011" + }, + "outputs": [], + "source": [ + "constructed_version = (graph_seventies_and_nineties | graph_eighties)\n", + "constructed_version.num_states()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Xsgdu5TYx09_", + "outputId": "5812912f-883b-42e8-afbf-3ec4a0170345" + }, + "outputs": [], + "source": [ + "string_map_version = graph_tens_special\n", + "string_map_version.num_states()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9jzn_U7s0Sit" + }, + "source": [ + "We see that their number of states (graph vertexes) are almost equal. Yet, if we use `pynini.optimize` - a method that calls a suite of WFST minimization algorithms: " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "7YtqhOY90iF0", + "outputId": "26f0f51b-b00d-4f5a-9b2f-330c9812666a" + }, + "outputs": [], + "source": [ + "constructed_version.optimize()\n", + "constructed_version.num_states()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "y93SqnOf0qa8", + "outputId": "74efcbfa-a272-4fc6-e36e-f1e31c6df221" + }, + "outputs": [], + "source": [ + "string_map_version.optimize()\n", + "string_map_version.num_states()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2cTdQj9L0xhl" + }, + "source": [ + "We see the latter possessing a significantly larger amount of graph vertices. \n", + "\n", + "So the decision will be dependent on your ITN needs, language, concern with efficiency, and design philosophy. Further, even decisions of language dialect will have an influence. \n", + "(e.g. Belgian, Canadian, and Swiss dialects of French will dispense with elements of the vigecimal system for the decimal schema.)\n", + "\n", + "**N.B.** *For reference: while `nemo_text_processing` grammars aim to minimize invalid productions, they assume input tokens are valid strings for a target language. (e.g. The mapping of \"quatre-vingt\" to `80` is permitted since it is not likely to occur in a valid French string.)* " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V1djCnvY3CjW" + }, + "source": [ + "For more information on optimization algorithms for WFSTs, please see:\n", + "\n", + "- [M. Mohri,\"Generic epsilon-removal and input epsilon-normalization algorithms for weighted transducers\"](https://cs.nyu.edu/~mohri/pub/ijfcs.pdf)\n", + "- [M. Mohri, \"Weighted automata algorithms\"](https://cs.nyu.edu/~mohri/pub/hwa.pdf)\n", + "- [K. Thompson, \"Programming techniques: regular expression search algorithm\"](http://www.oilshell.org/archive/Thompson-1968.pdf)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Hundreds\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dqPUdVBbi6gU" + }, + "source": [ + "\n", + "Moving on to the case of three digit cardinals (\"hundreds\"), it is likely that your chosen language becomes more regular in its schema. For instance, practically all French numbers `>100` obey the following:\n", + "\n", + "- `digit_from_1_to_9 + word_for_hundred + digit_from_1_to_99`\n", + "\n", + "For example:\n", + "- `203` - \"deux-cent-trois\"\n", + "- `530` - \"cinq-cent-trente\"\n", + "- `880` - \"huit-cent-quatre-vingt\"\n", + "\n", + "As such, we can write a simple `hundreds` WFST as:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "lOt-gc-FiF-X" + }, + "outputs": [], + "source": [ + "hundreds = graph_digits + delete_hyphen + pynutil.delete(\"cent\") + delete_hyphen + graph_all" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "Fyn1uL_NoEiz", + "outputId": "d491680b-1b3e-4762-8470-497833b82b0e" + }, + "outputs": [], + "source": [ + "apply_fst(\"deux-cent-trois\", hundreds)\n", + "apply_fst(\"huit-cent-quatre-vingts\", hundreds)\n", + "apply_fst(\"cinq-cent-trente\" , hundreds) " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qDjq_KfnoD5C" + }, + "source": [ + "Indeed, the use of French only presents two complications:\n", + "- French uses *only* the word \"cent\" for `100`. (Instead of \"un cent\".)\n", + "- 'Pure' multiples of a hundred (`200,300,400....`) use the plural \"cents\".\n", + "\n", + "The second one is the easier of the two so let's start there. There are actually two options open to us. First, we could treat \"cents\" the same way as we did \"cent\" in the base case and simply delete it. From there, the lack of any following inputs will allow the WFST to insert the trailing zeroes as appropriate." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "m2F-sumbxqLE" + }, + "outputs": [], + "source": [ + "cents = pynini.accep(\"cent\") | pynini.accep(\"cents\") # Creates a Finite State (Accep)tor, mapping inputs back to themselves\n", + "hundreds = graph_digits + delete_hyphen + pynutil.delete(cents) + delete_hyphen + graph_all" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VisQu_Etx-QB" + }, + "source": [ + "Or we can use it as a cue to 'shortcut' the WFST to immediately insert zeroes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VspiTN5Vxxjl" + }, + "outputs": [], + "source": [ + "graph_cents = pynini.cross(\"cents\", \"00\") # Creates a single input-output mapping\n", + "hundreds = graph_digits + delete_hyphen + ((pynutil.delete(\"cent\") + delete_hyphen + graph_all) | graph_cents)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "meVn5BiyyX5v" + }, + "source": [ + "For the case of solitary \"cent\", we need to make sure our output is `1` only in the case that no digit preceeds the occurence. Here we need to be confident in the structure of our WFST and that any possible ambiguity has been dealt with by this point. (Something to keep in mind as we move to the thousands.)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "277Z-zLWyWAf" + }, + "outputs": [], + "source": [ + "graph_cent = pynini.cross(\"cent\", \"1\")\n", + "graph_hundreds_first_digit = (graph_digits + delete_hyphen + pynutil.delete(cents)) | graph_cent\n", + "graph_hundreds = graph_hundreds_first_digit + delete_hyphen + graph_all" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "FNZlJsvS_Yvt", + "outputId": "e85ae561-e7a1-4b6a-e394-f0194fdb89e7" + }, + "outputs": [], + "source": [ + "apply_fst(\"trois-cents\", graph_hundreds) \n", + "apply_fst(\"cent\", graph_hundreds)\n", + "apply_fst(\"cent-trois\", graph_hundreds) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Thousands" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e7Dy5slLzp-K" + }, + "source": [ + "For quite a few languages, managing the WFST for the thousands place is the last aspect to figure out, as the higher powers of ten reuse the same schema. (For those working with counting systems that reserve special terms for \"ten-thousand\" (e.g. Chinese derived counting systems), you may need to extend unique coverage to the next power of ten.)\n", + "\n", + "For French, the question of thousands is rather simple: `digits_from_1_to_999 + mille + digits_from_1_to_999`\n", + "\n", + "With only the exception that any expression of one thousand drops a leading digit. \n", + "- `1,000` -> \"mille\"\n", + "- `1,001` -> \"mille-un\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "AvsnAAiPzlu_" + }, + "outputs": [], + "source": [ + "graph_one_thousand = pynini.cross(\"mille\", \"1\")\n", + "graph_many_thousand = graph_hundreds + delete_hyphen + pynutil.delete(\"mille\")\n", + "\n", + "graph_thousands = (graph_one_thousand | graph_many_thousand) + delete_hyphen + graph_hundreds" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "i3m9TG7Y4tkl", + "outputId": "d3f1f81d-c463-4934-9df7-3b8f2b67798f" + }, + "outputs": [], + "source": [ + "apply_fst(\"cent-mille-deux-cents\", graph_thousands)\n", + "apply_fst(\"deux-cent-mille-deux-cents\", graph_thousands)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NoevSTZGGT17" + }, + "source": [ + "### Weighting" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "A2gcVIZM0-iv" + }, + "source": [ + "Question: will this cover all our grammar so far? (Hint: what assumptions were made about \"cent\"/\"cents\"?)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "cCFtPhr1BjAc", + "outputId": "048e0d93-a4a8-4f4e-d461-bfd70e911aff" + }, + "outputs": [], + "source": [ + "apply_fst(\"deux-mille-un\", graph_thousands)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Ne-7L9Cd4t-8" + }, + "source": [ + "Once again, we need to introduce the possibility of the prior power of ten not occuring in the string. There must be an option for simply inserting a string of `0` in place of the omitted \"cent\"." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iockqXdn-aG4" + }, + "source": [ + "Further, we want to be careful with how cavalier we have been with insertions. Consider the following:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "bxJlSnj2-Xw3", + "outputId": "6722e5ef-8a7f-43e1-84fe-b3f5f18307e1" + }, + "outputs": [], + "source": [ + "apply_fst(\"mille-cent-un\", graph_thousands) # Should be 1101\n", + "apply_fst(\"mille-cent\", graph_thousands) # 1100" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fq5zEayA-kOx" + }, + "source": [ + "It appears that our WFST has developed a tendency to simply 'ignore' some of these higher powers. Let us return to our code for `graph_hundreds` and `graph_thousands`. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "S2aV1KQ4-1iP" + }, + "outputs": [], + "source": [ + "graph_cents = pynini.cross(\"cents\", \"00\")\n", + "graph_cent = pynini.cross(\"cent\", \"1\")\n", + "graph_hundreds_first_digit = (graph_digits + delete_hyphen + pynutil.delete(cents)) | graph_cent\n", + "graph_hundreds = (graph_hundreds_first_digit + delete_hyphen | pynutil.insert(\"0\")) + graph_all \n", + "\n", + "graph_one_thousand = pynini.cross(\"mille\", \"1\")\n", + "graph_many_thousand = graph_hundreds + delete_hyphen + pynutil.delete(\"mille\")\n", + "graph_thousands = (graph_one_thousand | graph_many_thousand) + delete_hyphen + graph_hundreds" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9avwOIkk-9qt" + }, + "source": [ + "Recall that throughout we have provided options for simply inserting zeroes in the case of omitted numbers? That tendency has finally caught up with us. The use of our previous `graph_hundreds` in `graph_many_thousands` now allows our graph to insert a string of `0`'s without penalty. \n", + "\n", + "You may note that this is very similar to the \"cents\" example brought up at the beginning, presenting a similar solution. We can control this output by making it too costly to traverse unless absolutely necessary for the graph. This can be accomplished simply by appending a weight to the insertion for hundreds:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MQG3j0U8CUAQ" + }, + "outputs": [], + "source": [ + "graph_hundreds = (graph_hundreds_first_digit + delete_hyphen | pynutil.insert(\"0\", weight=.1)) + graph_all \n", + "\n", + "graph_one_thousand = pynini.cross(\"mille\", \"1\")\n", + "graph_many_thousand = graph_hundreds + delete_hyphen + pynutil.delete(\"mille\")\n", + "graph_thousands = (graph_one_thousand | graph_many_thousand) + delete_hyphen + graph_hundreds" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "KNHhrYZ7Ca58", + "outputId": "a7d07372-733d-4837-c1e9-1dc58ba2b87c" + }, + "outputs": [], + "source": [ + "apply_fst(\"mille-cent-un\", graph_thousands)\n", + "apply_fst(\"mille-cent\", graph_thousands)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "51yPEaf2EkbD" + }, + "source": [ + "Why choose a weight of `.1`? Quite simply: it's arbitrary. As mentioned earlier, the default graph in `pynini` is a tropical semiring, which uses the `min` function to select among two arcs for path traversal. Since all our paths so far are weight `0`, any positive value will ensure that it is a last option among path traversal. (Note, this conversely entails any negative weight path will be prioritized.)\n", + "\n", + "That we chose this number as a small value comes from a place of caution: the tropical semiring uses an additive function to calculate the total weight of an entire path to traverse a WFST. As our grammars can easily become massive, this means that small weights can have major impact down the line. Further, by constraining path weights to small values, we can have general certainty towards the maximum weight of any individual graph, allowing us to add constraints regarding maximum token length and token hierarchy. (As explained in [later sections](#classifyweights).) As such, when using weights in a localized setting, it is best to use small values to avoid unforseen escalation. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iScKgvRxGt-B" + }, + "source": [ + "### Higher Powers\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rtHEd6OE2WSg" + }, + "source": [ + "At this point, we can propose a general heuristic with escalating to higher powers of ten: they always need a way for their absence to be accomodated in the WFST. Further, they require some weighting to prevent this absence from developing into a string of omitted values. To aviod further bumps, we'll take care of this now with `graph_thousands`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "iZMN7wcE2lH5" + }, + "outputs": [], + "source": [ + "graph_one_thousand = pynini.cross(\"mille\", \"1\")\n", + "graph_many_thousand = graph_hundreds + delete_hyphen + pynutil.delete(\"mille\")\n", + "graph_thousands = (graph_one_thousand | graph_many_thousand | pynutil.insert(\"000\", weight=.001)) + delete_hyphen + graph_hundreds" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Fkc3LIH824P7" + }, + "source": [ + "\n", + "For the rest of French (and many other languages), the rest of the work is simply repeating the prior pattern for the thousands element: \n", + "`hundreds + word_for_higher_power + hundreds.....` Of course there will be some variation in this schema, but the recursion should be regular. (It is rather rare that languages appropriate unique forms for these higher counts.) " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qGnK4ARX4Nay" + }, + "source": [ + "To finish French, we can list off the following equivalent for higher powers of ten:\n", + "- `million` - \"million/millions\" \n", + "- `billion` - \"milliard/milliards\"\n", + "- `trillion` - \"billion/billions\"\n", + "\n", + "Like the \"cent/cents\" rule, these values alternate with a plural form in the case of multiples of the value. Writing them out:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "sBu7-dub4vxz" + }, + "outputs": [], + "source": [ + "millions = pynini.accep(\"million\") | pynini.accep(\"millions\")\n", + "graph_millions = ((graph_hundreds + delete_hyphen + pynutil.delete(millions) + delete_hyphen) | pynutil.insert(\"000\", weight=.1) # We need three zeroes now\n", + " ) + graph_thousands" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "LmMeCHXr5Bb5" + }, + "outputs": [], + "source": [ + "billions = pynini.accep(\"milliards\") | pynini.accep(\"milliard\")\n", + "graph_billions = ((graph_hundreds + delete_hyphen + pynutil.delete(billions) + delete_hyphen)| pynutil.insert(\"000\",weight=.1) # We need three zeroes now\n", + " ) + graph_millions" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "CIRIeQEg5B0J" + }, + "outputs": [], + "source": [ + "trillions = pynini.accep(\"billion\") | pynini.accep(\"billions\")\n", + "graph_trillions = ((graph_hundreds + delete_hyphen + pynutil.delete(trillions) + delete_hyphen) | pynutil.insert(\"000\",weight=.1) # We need three zeroes now\n", + " ) + graph_billions" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sRNUPx-15J1v" + }, + "source": [ + "Bringing all together:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "0dLOWm_B5SwQ" + }, + "outputs": [], + "source": [ + "graph = graph_trillions | zero" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nBFE3BrN6IPR" + }, + "source": [ + "Let's try it out:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "id": "6lWwtR1S6LI4", + "outputId": "3a6740ee-9e92-4500-c2c8-965131167e58" + }, + "outputs": [], + "source": [ + "example = \"deux-cent-milliard-quatre-million-deux-cent-quatre-vingt-onze\"\n", + "apply_fst(example, graph) " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Finishing Touches" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-w3KgX6C6mff" + }, + "source": [ + "Now that we have our cardinal in place, we can take care of that stylistic issue of the leading zeroes. For this, we want to develop a 'filter' that deletes all zeroes preceeding the first non-zero in the string, and leave the rest 'as is.'\n", + "\n", + "First let us create the filter by calling on `NEMO_DIGIT`- a `graph_util` WFST that only permits digits as input. With it, we'll create a WFST that will delete all leading zeroes in a sting. We then compose this (using `@`) onto our original graph, creating a new graph that accepts inputs from our original by outputs only outpus of `clean_cardinal`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 290 + }, + "id": "EA4VnRe6FO-2", + "outputId": "59e412b3-a445-4172-ee64-b0f80281a167" + }, + "outputs": [], + "source": [ + "delete_leading_zeroes = pynutil.delete(pynini.closure(\"0\")) # will delete all zeroes under closure. Equivalent to regex * operator\n", + "stop_at_non_zero = pynini.difference(NEMO_DIGIT, \"0\") # creates a graph that accepts all input-outputs from NEMO_DIGIT except 0\n", + "rest_of_cardinal = pynini.closure(NEMO_DIGIT) # accepts all digits that may follow\n", + "\n", + "clean_cardinal = delete_leading_zeroes + stop_at_non_zero + rest_of_cardinal\n", + "clean_cardinal = clean_cardinal | \"0\" # We don't want to ignore the occurence of zero\n", + "\n", + "graph = graph @ clean_cardinal " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "piP9nqQkHpo3" + }, + "source": [ + "Now our WFST will output our numbers as normal:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "dnQ9odSpIAB7" + }, + "outputs": [], + "source": [ + "apply_fst(example, graph)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Final Notes\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "p7zt8lVsK2rY" + }, + "source": [ + "We have finally formulated a grammar that will process French cardinals into numeric representation. Of course, not every grammar you write will be for French. But several of the principles we've worked through will be invaluable in your own development. Before moving on, here's a quick summary of (almost) universal points to take away for WFST construction.\n", + "- Decide at the beginning of construction the level of constraint you wish for your grammar. Is it necessary to have a specific domain or can you rely on upstream models to narrow your input possibilities for you? \n", + "- Work iteratively upwards from the smallest place value of your numeric system. This will assist you in forming building blocks for larger values. \n", + "- Always allow for the possibility of omission of previous place values. (Not every number in the thousands will contain mention of the hundreds place.)\n", + "- For each place value, consider how the sub-grammar will affect the preceeding and following place values. Are there exceptions that you've built into the grammar that may become problematic later on?\n", + "- Utilize weights for default insertions to limit path traversal to only final options. When doing so, use small values to avoid escalating problems in your larger grammar." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nvyHg1bQIIHD" + }, + "source": [ + "With that handled, we can move on to converting this grammar into a Classifier." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gJ1YJUvhIZwm" + }, + "source": [ + "## Classifier" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "q2L2x0crIeXQ" + }, + "source": [ + "Now that we have a grammar that will convert individual tokens into number strings, we now want to focus on building it into a classifier to properly tag candidate tokens. This requires a couple of properties:\n", + "- It recognizes any valid token and permits traversal through the WFST graph\n", + "- Conversely, it does not allow invalid tokens to traverse the WFST graph\n", + "- It properly disambiguates overlap among ambiguous cases\n", + "- It attributes the proper attributes to a classified token\n", + "\n", + "While this seems like a lot, in practice this just means that your grammar will need a few more tweaks to improve exclusivity." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ArEYn7RWKcYI" + }, + "source": [ + "NeMo ITN performs token classification through a series of `GraphFst` classes and assumes deployment of your grammars through an object that inherits from this class. As such, you will need to instantiate your grammar as a `CardinalFST` " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 368 + }, + "id": "GWgMSybqLqiS", + "outputId": "597c00ae-0f62-417f-888c-88c81c24a3fc" + }, + "outputs": [], + "source": [ + "class CardinalFst(GraphFst):\n", + " def __init__(self):\n", + " super().__init__(name=\"cardinal\", kind=\"classify\")\n", + " # Rest of the grammar here\n", + " # ....... \n", + " #........." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SIE8dNQlL52G" + }, + "source": [ + "While the naming convention may vary, the `name` and `kind` properties must be set accordingly to permit Sparrowhawk integration.\n", + "\n", + "Further, the reulting graph must produce the classified token within the following format:\n", + "`token { cardinal { integer: \"DIGIT_STRING\" } }`\n", + "\n", + "This is accomplished by a series of string insertions:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "aC_c64KSNTCg" + }, + "outputs": [], + "source": [ + "class CardinalFst(GraphFst):\n", + " def __init__(self):\n", + " super().__init__(name=\"cardinal\", kind=\"classify\")\n", + " # Rest of the grammar here\n", + " # ....... \n", + " #.........\n", + " self.fst = pynutil.insert(\"integer: \\\"\") + graph + pynutil.insert(\"\\\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AGLQxOSzOK1F" + }, + "source": [ + "Followed by a call of the parent `GraphFst.add_tokens()` method:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Jz-UXFipORps" + }, + "outputs": [], + "source": [ + "class CardinalFst(GraphFst):\n", + " def __init__(self):\n", + " super().__init__(name=\"cardinal\", kind=\"classify\")\n", + " # Rest of the grammar here\n", + " # ....... \n", + " #.........\n", + " self.fst = pynutil.insert(\"integer: \\\"\") + graph + pynutil.insert(\"\\\"\")\n", + " final_graph = self.add_tokens(graph)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gh23S7BHOY0r" + }, + "source": [ + "Which will insert the appropriate formatting. Note that this formatting must be exact: a single space must follow each field name and each value must be within escaped double quotes.\n", + "\n", + "In the event that you also wish for `CardinalFst` to indicate negative values, the optional `negative: ` property may be used.\n", + "\n", + "For instance, French indicates negative values by prefacing the quantity with \"moins.\" As such:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "3JbTn35cOx0k" + }, + "outputs": [], + "source": [ + "optional_minus_graph = pynini.closure(\n", + " pynutil.insert(\"negative: \") + pynini.cross(\"moins\", \"\\\"-\\\"\") + \" \", 0, 1 # Note the extra space to separate the value from the integer field\n", + ")\n", + "\n", + "final_graph = optional_minus_graph + pynutil.insert(\"integer: \\\"\") + graph + pynutil.insert(\"\\\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DCs1048v6N0K" + }, + "source": [ + "All together, your `CardinalFst` ultimately serves as a wrapper for your grammar, save with the addition of a few insertions to assist processing:\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "eo6uEz1s5TJY" + }, + "outputs": [], + "source": [ + "class CardinalFst(GraphFst):\n", + " def __init__(self):\n", + " super().__init__(name=\"cardinal\", kind=\"classify\")\n", + " \n", + " ### Cardinal Grammar....\n", + " ### .....\n", + " graph = graph_trillions | zero \n", + "\n", + " ### Formatting grammar....\n", + " ### .....\n", + " graph = graph @ clean_cardinal\n", + "\n", + " ### Token insertion\n", + " optional_minus_graph = pynini.closure(\n", + " pynutil.insert(\"negative: \") + pynini.cross(\"moins\", \"\\\"-\\\"\") + \" \", 0, 1\n", + " )\n", + "\n", + " final_graph = optional_minus_graph + pynutil.insert(\"integer: \\\"\") + graph + pynutil.insert(\"\\\"\")\n", + "\n", + " final_graph = self.add_tokens(final_graph) # inserts the cardinal tag\n", + "\n", + " self.fst = final_graph" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MFIMdLCoZzLK" + }, + "source": [ + "Let's see a demonstration. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "4CF6Iz9NZ7R_" + }, + "outputs": [], + "source": [ + "cardinal = CardinalFst().fst\n", + "\n", + "example = \"moins deux-cent-quatre\"\n", + "\n", + "apply_fst(example, cardinal)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Verbalizer" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uvUqpC_Q8FSt" + }, + "source": [ + "The verbalizer can be both the most crucial and simplest part of building each grammar. On one hand, it is the component that finalizes all of your previous work. If it is unable to properly normalize your text, everything has been for naught.\n", + "\n", + "On the other hand, your previous work has vastly limited the unpredictability of your input. Recall from our initial demonstration of the classifier-verbalizer system that and input like <> becomes:\n", + "\n", + "- `tokens { name: \"le\" }`\n", + "- `tokens { date { day: \"1\" month: \"juillet\" }` \n", + "- `tokens { name: \"il\" }` \n", + "- `tokens { name: \"a\" }` \n", + "- `tokens { name: \"mangé\" }`\n", + "- `tokens { cardinal { integer: \"35\" } }` \n", + "- `tokens { name: \"pommes\" }`\n", + "\n", + "Part of the purpose of the two stage set-up is that the input space for each verbalizer is obvious: it's simply the name of its semiotic class. As such, we only need to write our grammar to recognize its class, remove tokens accordingly, and then manage the attributes of each semiotic token." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "et1GgmBuAWzY" + }, + "source": [ + "We will begin as we did with our classifier and create a class to inherit from the `GraphFST` utility class:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "NNKpgWtkAgEW" + }, + "outputs": [], + "source": [ + "class CardinalFst(GraphFst):\n", + " def __init__(self):\n", + " super().__init__(name=\"cardinal\", kind=\"verbalize\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "OyAV39NsAqSN" + }, + "source": [ + "One of the useful aspects of the `GraphFst` utility is that it already posseses a built in graph that will recognize and remove semiotic tokens: `delete_tokens`. As such we need only concern ouselves with managing the properties of the Cardinal class:\n", + "- `integers`\n", + "- `negative`\n", + "\n", + "Here, the desired written format of your chosen language will dictate how you proceed. For French, we have the following rules for Cardinal numbers:\n", + "- A negative sign is written before the numeral.\n", + "- Cardinal numbers representing quantities (e.g. \"mille euros\"/ \"one thousand dollars\") are written with spaces in-between every three digits. (e.g. `1 000`)\n", + "- Cardinal numbers representing place in a sequence or addresses (\"page mille\"/\"page one thousand\") are written without spacing. (`1000`)\n", + "\n", + "The first property seems easy enough to handle: write a grammar that simply removes the `negative` formatting, leaving only `-`. (Recall that our Classifier only inserted the string if it was present.) \n", + "\n", + "For the final two, we may note that our intention to develop WFSTs for the Decimal, Measure, and Money classes already will cover most desired quantities. As such, we can leave the issue of spacing to those instances and let the Cardinal WFST default to the non-spacing case. (Note that this will be helpful with Time, Date, Telephone, Electronic, and Ordinal classes as they will not use the spacing format either. It is usually better to reserve specific formatting rules to other classes and let the Cardinal serve as a default.)\n", + "\n", + "As such, we just need our WFST to remove the `integer` property and `negative` property (if it occurs). These can be managed through the `pynutil.delete` function, as seen in the following:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 368 + }, + "id": "6MF2I6SLU7nf", + "outputId": "0437c4af-5c96-4122-8af0-ca37723c7228" + }, + "outputs": [], + "source": [ + "class CardinalFst(GraphFst):\n", + " def __init__(self):\n", + " super().__init__(name=\"cardinal\", kind=\"verbalize\")\n", + " \n", + " # Removes the negative attribute and leaves the sign if occurs\n", + " optional_sign = pynini.closure(\n", + " pynutil.delete(\"negative:\")\n", + " + delete_space\n", + " + pynutil.delete(\"\\\"\")\n", + " + pynini.accep(\"-\")\n", + " + pynutil.delete(\"\\\"\")\n", + " + delete_space,\n", + " 0,\n", + " 1,\n", + " )\n", + " \n", + " # removes integer aspect\n", + " graph = (\n", + " pynutil.delete(\"integer:\")\n", + " + delete_space\n", + " + pynutil.delete(\"\\\"\")\n", + " + pynini.closure(NEMO_DIGIT, 1) # Accepts at least one digit\n", + " + pynutil.delete(\"\\\"\")\n", + " )\n", + " \n", + " graph = optional_sign + graph # concatenates two properties\n", + "\n", + " delete_tokens = self.delete_tokens(graph) # removes semiotic class tag\n", + "\n", + " self.fst = delete_tokens.optimize()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QSX2KlZJbRAA" + }, + "source": [ + "Let's see if it will properly render a given token:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "JxaLm2k0bYIJ" + }, + "outputs": [], + "source": [ + "cardinal = CardinalFst().fst\n", + "example = 'cardinal { negative: \"-\" integer: \"204\" }'\n", + "\n", + "apply_fst(example, cardinal)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Bc0-QCBHWg-8" + }, + "source": [ + "That's it! We've now completed all aspects of our `CardinalFst` from grammar writing to Verbalization. While we still have quite a few semiotic classes left, you will find that they build off the `CardinalFst` quite easily, making progression much simpler and straightforward.\n", + "\n", + "Before proceeding, there are two things to note:\n", + "- `delete_tokens` is called on the completed graph, despite the token class occuring first in the tokenized string. This is because the function intersects with an initial WFST that deletes the tags. As such, the function must be passed a completed graph.\n", + "- In our initial example, all tokens were enclosed within a `token` category. Insertion and deletion of this category is managed by the main [Classifier](#tokenize-and-classify) and [Verbalizer](#verbalize-and-verbalize-final) respectively and is not a concern during individual class grammar development.\n", + "- Earlier in the tutorial we noted that NeMo ITN permutates all WFSTs unless the `preserve_order` tag is passed as part of the Classifier. This allows you to ignore possible variation in designing the verbalizer and focus on whatever form of processing is easiest for the grammar. That is, the decision to process the `negative` property before the `integer` property is not chosen because of a consequence of the French language but instead because it is easier to write out with `pynini`. \n", + "- Conversely, if your language is completely invariant in this regard, it may be more efficient to pass `preserve_order` through the Classifier and manage the property here in the Verbalizer. This allows NeMo ITN to avoid building states and arcs for each permutation, reducing graph size and compiling time." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aFUrbSdJ8Wk7" + }, + "source": [ + "# Ordinal WFST " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "w1b0Z7f5Z9Ar" + }, + "source": [ + "Ordinals is the class of numbers used for enumerating order or placement of entities in a series. In some languages, they are simply derivations of cardinal numbers. For instance, English enumerates order as `first, second, third, fourth, fifth....` After the third ordinal, they become a regular pattern of `cardinal + 'th'`.\n", + "\n", + "Meanwhile, other languages may reserve specific counting systems for ordinals. For example, while Korean uses a Chinese derived counting system for several Cardinal related tasks, it uses derivations from a native counting system for ordering:\n", + "\n", + "**Cardinal**/**Ordinal** = **English**\n", + "- il/cheot-jae = \"First\"\n", + "- i/dul-jae = \"Second\"\n", + "- sam/set-jae = \"Third\"\n", + "- sa/net-jae = \"Fourth\"\n", + "- o/daseot-jae = \"Fifth\"\n", + "\n", + "If your language is of the latter variety, you will likely need to begin development of Ordinal WFST by repeating Cardinal WFST development before proceeding. (Or make it part of your previous Cardinal WFST and combining with a `union` operation.) While you can extend coverage to the level of Cardinal WFST, you will find most Ordinals to be sufficiently covered by only enumerating to a few hundreds. (e.g. Is it common in your language to speak of the \"one millionth\" in an order and/or write out `1,000,000th`?)\n", + "\n", + "For this portion of the tutorial, we will focus on the first type of ordinals - those that primarily derived by altering Cardinals." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "oq_xA8NPiANw" + }, + "source": [ + "## Grammar" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lhjcQS6oiD_w" + }, + "source": [ + "Continuing with our example language, we first begin by laying out our expected inputs and pinpointing a regular pattern to guide our WFSTs. We note the following examples:\n", + "\n", + " **English = French**\n", + " - \"first\" = \"premier/première\"\n", + " - \"second\" = \"second/seconde/deuxième\"\n", + " - \"third\" = \"troisième\"\n", + " - \"fourth\" = \"quatrième\"\n", + " - \"fith\" = \"cinquième\"\n", + " - \"sixth\" = \"sixième\"\n", + " - \"seventh\" = \"septième\"\n", + "\n", + "From our examples inputs, it appears that spelling of French Ordinals follows a general format of: `cardinal + ième`. The only exceptions appear to be in the case of the first and second Ordinals - for which completely different roots appear - and the fourth and the fith Ordinals - where the former drops the \"e\" at the end of the root (`quatre -> quatr`) and the latter appends a \"u\" (`cinq -> cinqu`). \n", + "\n", + "For the expected outputs, we observe the following examples:\n", + " - \"premier/première\" -> `1ᵉʳ/1ʳᵉ`\n", + " - \"second/seconde\" -> `2ᵈ/2ᵈᵉ`\n", + " - \"deuxième\" -> `2ᵉ`\n", + " - \"troisième\" -> `3ᵉ`\n", + " - \"quatrième\" -> `4ᵉ`\n", + " - \"cinquième\" -> `5ᵉ`\n", + " - \"sixième\" -> `6ᵉ`\n", + " - \"septième\" -> `7ᵉ`\n", + "\n", + "It appears that the output is simply the cardinal number of the root with an associated superscript. Since we have already constructed the Cardinal WFST, this means that the job of constructing an Ordinal WFST is simply a case of recognizing the cardinal root for the input and then utilizing a preconstructed Cardinal grammar to render the proper form alongside an associated superscript. That is, our tasks are to:\n", + "- Identify the proper superscript for the ordinal\n", + "- Change the ordinal back into a cardinal\n", + "- Use the Cardinal WFST to transform the cardinal into normalized form\n", + "- Properly render the ordinal using the normalized cardinal and proper superscript\n", + "\n", + "As information regarding the superscript will need to be conveyed through development of the Classifier, we will begin with creating the grammar necessary for rendering the ordinal as its cardinal root. \n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AOUVZhiwT7hE" + }, + "source": [ + "### Stripping Suffixes" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5nw0_lOTsEik" + }, + "source": [ + "Since French forms Ordinals by appending a suffix to Cardinals, we should start by creating a WFST to remove the suffix. Assuming that our grammar processes one token at a time, this means that we just need an WFST that will accept all tokens that end with \"ième\" and then delete the suffix from that token:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Rk89LhsxsHTO" + }, + "outputs": [], + "source": [ + "strip_morpheme = pynutil.delete(\"ième\") # deletes suffix\n", + "graph_strip_morpheme = NEMO_SIGMA + strip_morpheme # accepts all strings until passed suffix, then deletes suffix" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pLg-PzdntV4N" + }, + "source": [ + "Now we can create a graph that permits all characters in a word token and deletes the ordinal suffix. (Note that this also means that the graph won't accept tokens without the suffix, helping us avoid false inputs.) \n", + "\n", + "We can now intersect this graph with our Cardinal WFST to now strip the suffixes from ordinals and treat them as cardinals. However, recall that our `CardinalFst` also inserted its own class tag. Obviously, we do not want to do this here as it will disrupt the formatting of the token. Instead, we should create a new subgraph *within* the `CardinalFst` class that will only produce the cardinals without tokens." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "class CardinalFst(GraphFst):\n", + " def __init__(self):\n", + " super().__init__(name=\"cardinal\", kind=\"classify\")\n", + " \n", + " ### Cardinal Grammar....\n", + " ### .....\n", + " graph = graph_trillions | zero \n", + "\n", + " ### Formatting grammar....\n", + " ### .....\n", + " graph = graph @ clean_cardinal\n", + " \n", + " ### NEW GRAPH\n", + " self.just_cardinals = graph # will produce cardinals without formatting\n", + "\n", + " ### Token insertion\n", + " optional_minus_graph = pynini.closure(\n", + " pynutil.insert(\"negative: \") + pynini.cross(\"moins\", \"\\\"-\\\"\") + \" \", 0, 1\n", + " )\n", + "\n", + " final_graph = optional_minus_graph + pynutil.insert(\"integer: \\\"\") + graph + pynutil.insert(\"\\\"\")\n", + "\n", + " final_graph = self.add_tokens(final_graph)\n", + "\n", + " self.fst = final_graph" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now we call it for our graph:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "vxDgBa4_t1nD" + }, + "outputs": [], + "source": [ + "graph_cardinal = CardinalFst().just_cardinals \n", + "graph_ordinal_regular_suffix = graph_strip_morpheme @ graph_cardinal" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hSpk5M7BuXRz" + }, + "source": [ + "Let's see if it works and gives us the desired cardinal:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "7cJ7fieouY2r" + }, + "outputs": [], + "source": [ + "example = \"sixième\" # dervied from six/6\n", + "apply_fst(example, graph_ordinal_regular_suffix)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GtEuV7sOuxek" + }, + "source": [ + "Now we can consider the edge cases. Beyond the first and second ordinals, French exhibits irregular behavior in the following cases:\n", + "- If the cardinal root ends with an \"e\", the \"e\" is dropped before adding the suffix (e.g. \"quatrième\"). \n", + "- Cardinals ending with \"cinq\", \"neuf\", and \"dix\" change their endings to \"cinqu\", \"neuv\" , and \"diz\" before appending the suffix, respectively. \n", + "\n", + "We could start by proposing a WFST that replaces the suffix \"ième\" with \"e\" and then compose this onto the Cardinal WFST. If it is a legitimate cardinal, then there will be a path through CardinalFST and the integer will be rendered as normal. \n", + "\n", + "Meanwhile, the case of \"dix\", \"cinq\", and \"neuf\" would each require a distinct WFST as they are each a consequence of different rules of orthography and phonology. Like the case with \"e\", we could change each back to its root and then see if the CardinalWFST will permit a path with the new input. \n", + "\n", + "It is at this point that we can do a cost-benefit analysis and realize that all these cases can be managed by an explicit `string_map/string_file`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "_9KTNQeIw4sq" + }, + "outputs": [], + "source": [ + "graph_root_change = pynini.string_map([(\"quatrième\", \"quatre\"),\n", + " (\"cinquième\",\t\"cinq\"),\n", + " (\"neuvième\",\t\"neuf\"),\n", + " (\"onzième\",\t\"onze\"),\n", + " (\"douzième\",\t\"douze\"),\n", + " (\"treizième\",\t\"treize\"),\n", + " (\"quatorzième\",\t\"quatorze\"),\n", + " (\"quinzième\",\t\"quinze\"),\n", + " (\"seizième\",\t\"seize\"),\n", + " (\"trentième\",\t\"trente\"),\n", + " (\"quarantième\",\t\"quarante\"),\n", + " (\"cinquantième\",\t\"cinquante\"),\n", + " (\"soixantième\",\t\"soixante\"),\n", + " (\"millième\",\t\"mille\"),\n", + "])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eo2_keFVqaY4" + }, + "source": [ + "We could then concatenate these with a WFST that accepts all tokens with these endings and then change the endings as desired. These will provide the cardinal roots just as effectively. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "O7I29ezmxylx" + }, + "source": [ + "The same can be said for \"premier/première\" and \"second/seconde\":" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "3JZoz51VyGS6" + }, + "outputs": [], + "source": [ + "graph_firsts = pynini.string_map([(\"premier\", \"un\"),(\"première\", \"un\")])\n", + "graph_seconds = pynini.string_map([(\"second\", \"deux\"),(\"seconde\", \"deux\")])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NJ9BGGAwyTQ5" + }, + "source": [ + "*Note: We graph separately to manage their different superscripts later on.*\n", + "\n", + "Depending on your language of focus, the choice of implicitly reversing the root token or explicitly mapping back to root will be the most efficient, but it is worth considering both options if only to check your understanding of the language." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8PgVwDRRq9gr" + }, + "source": [ + "Putting our grammar together, we have:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ko2kAeKwrRSH" + }, + "outputs": [], + "source": [ + "strip_morpheme = pynutil.delete(\"ième\") # deletes suffix\n", + "\n", + "graph_root_change = pynini.string_map([(\"quatrième\", \"quatre\"),\n", + " (\"cinquième\",\t\"cinq\"),\n", + " (\"neuvième\",\t\"neuf\"),\n", + " (\"onzième\",\t\"onze\"),\n", + " (\"douzième\",\t\"douze\"),\n", + " (\"treizième\",\t\"treize\"),\n", + " (\"quatorzième\",\t\"quatorze\"),\n", + " (\"quinzième\",\t\"quinze\"),\n", + " (\"seizième\",\t\"seize\"),\n", + " (\"trentième\",\t\"trente\"),\n", + " (\"quarantième\",\t\"quarante\"),\n", + " (\"cinquantième\",\t\"cinquante\"),\n", + " (\"soixantième\",\t\"soixante\"),\n", + " (\"millième\",\t\"mille\"),\n", + "])\n", + "\n", + "# Component will accept all tokens that end with desired strings\n", + "graph_get_cardinal = NEMO_SIGMA + (strip_morpheme | graph_root_change) \n", + "\n", + "graph_firsts = pynini.string_map([(\"premier\", \"un\"),(\"première\", \"un\")])\n", + "graph_seconds = pynini.string_map([(\"second\", \"deux\"),(\"seconde\", \"deux\")])\n", + "\n", + "graph_get_cardinal = pynini.union(graph_firsts, graph_seconds, graph_get_cardinal) \n", + "\n", + "graph_cardinal = CardinalFst().just_cardinals\n", + "\n", + "graph_ordinal = graph_get_cardinal @ graph_cardinal" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ESxY3LsCdE8q" + }, + "outputs": [], + "source": [ + "apply_fst(\"sixième\", graph_ordinal)\n", + "apply_fst(\"première\", graph_ordinal)\n", + "apply_fst(\"seconde\", graph_ordinal)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qo_g8UdoUFJB" + }, + "source": [ + "## Classifier" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kemhdKAjzEIa" + }, + "source": [ + "Now that we've found a way to pass the work of the Ordinal grammar back onto the Cardinal grammar, we can move onto the Classifier. Like before, we need to inherit from `GraphFst` to properly insert token formatting and required attributes. As well, we will again use the `integer` property to tag our digit string.\n", + "\n", + "Indeed, the only major difference between the Ordinal Classifier and the Cardinal Classifier is the replacement of optional `negative` attribute with the `morphosyntactic_feature` attribute to indicate the superscript function." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EHM4Y3TW2nXT" + }, + "source": [ + "Since we are relying on the `CardinalFst` class in our grammar, we want to consider how to instantiate an instance of it. Since our ultimate goal is to build a Classifier that unites all semiotic classes, it makes sense to simply use the `CardinalFst` that we will need to call for our ITN and pass it as an argument to our new class." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 273 + }, + "id": "KsmPhWSa3LF_", + "outputId": "9e881ca9-a926-4249-dda8-9c52175569b5" + }, + "outputs": [], + "source": [ + "def __init__(self, cardinal: GraphFst):\n", + " super().__init__(name=\"ordinal\", kind=\"classify\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CtBQ-udB3S5Q" + }, + "source": [ + "To clear up the namespace, we will now be importing from the NeMo implementation of `CardinalFst` for French." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "L-JAcidf4QQg" + }, + "outputs": [], + "source": [ + "from nemo_text_processing.inverse_text_normalization.fr.taggers.cardinal import CardinalFst\n", + "\n", + "class OrdinalFst(GraphFst):\n", + " def __init__(self, cardinal: GraphFst):\n", + " super().__init__(name=\"ordinal\", kind=\"classify\")\n", + " graph_cardinal = cardinal.graph_no_exception # NeMo equivalent to self.just_cardinals" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FQfkAqZavCAB" + }, + "source": [ + "We now add in our grammar:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "uUQ4BLuivGut" + }, + "outputs": [], + "source": [ + "class OrdinalFst(GraphFst):\n", + " def __init__(self, cardinal: GraphFst):\n", + " super().__init__(name=\"ordinal\", kind=\"classify\")\n", + " graph_cardinal = cardinal.graph_no_exception # may replace\n", + "\n", + " strip_morpheme = pynutil.delete(\"ième\") # deletes suffix\n", + "\n", + " graph_root_change = pynini.string_map([(\"quatrième\", \"quatre\"),\n", + " (\"cinquième\",\t\"cinq\"),\n", + " (\"neuvième\",\t\"neuf\"),\n", + " (\"onzième\",\t\"onze\"),\n", + " (\"douzième\",\t\"douze\"),\n", + " (\"treizième\",\t\"treize\"),\n", + " (\"quatorzième\",\t\"quatorze\"),\n", + " (\"quinzième\",\t\"quinze\"),\n", + " (\"seizième\",\t\"seize\"),\n", + " (\"trentième\",\t\"trente\"),\n", + " (\"quarantième\",\t\"quarante\"),\n", + " (\"cinquantième\",\t\"cinquante\"),\n", + " (\"soixantième\",\t\"soixante\"),\n", + " (\"millième\",\t\"mille\"),\n", + " ])\n", + " \n", + " # Component will accept all tokens that end with desired strings\n", + " graph_get_cardinal = NEMO_SIGMA + (strip_morpheme | graph_root_change) \n", + "\n", + " graph_firsts = pynini.string_map([(\"premier\", \"un\"),(\"première\", \"un\")])\n", + " graph_seconds = pynini.string_map([(\"second\", \"deux\"),(\"seconde\", \"deux\")])\n", + "\n", + " graph_get_cardinal = pynini.union(graph_firsts, graph_seconds, graph_get_cardinal) \n", + "\n", + " graph_ordinal = graph_get_cardinal @ graph_cardinal\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "F_6EXPRMvnp2" + }, + "source": [ + "Now we come to the `morphosyntactic_features` property - a linguistic term for aspects of a word related to grammar. If intending to deploy your WFST through Sparrowhawk, this is the only ordinal property that is permitted (outside of the universal properties like `preserve_order`) and thus must carry all information regarding how to properly normalize the ordinal. (If Sparrowhawk deployment is not necessary, you may add additional properties to the tag.)\n", + "\n", + "How should we convey this information? Since the Verbalizer will be the main interface for our tags, it really does not matter - so long as we can reliably process the features. For the purposes of French, we just need `morphosyntactic_features` to decide the following:\n", + "- Insert the specific superscripts for \"premier/première\" or \"second/seconde\"\n", + "- Insert \"ᵉ\" otherwise\n", + "\n", + "We will also introduce another aspect of French Ordinals: they can be either plural or singular, identified by the suffix \"s\" on input and superscript \"ˢ\" on output. As such, our `morphosyntactic_features` should also decide the additional property:\n", + "- Insert the plural superscript " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "atctz6p-2GtV" + }, + "source": [ + "Since the default superscript is near universal, we will just specify this in our WFST and focus on the second and first ordinals as specific cases. We will create a `graph_morpheme` component that inserts the default superscript - indicated with a standard \"e\" to avoid possible encoding issues. We will then append a WFST that will graph any possible plural marker - \"s\" - as part the `morphosyntactic_features`. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ui99osyP2UuQ" + }, + "outputs": [], + "source": [ + "graph_morpheme = pynutil.insert(\"e\") # Insert e superscript\n", + "graph_plural = pynini.closure(pynini.accep(\"s\"), 0, 1) # We create an acceptor since we must process the possible \"s\"\n", + "\n", + "graph_morpheme_component = graph_morpheme + graph_plural\n", + "\n", + "graph_morphosyntactic_features = (pynutil.insert(\" morphosyntactic_features: \\\"\") \n", + " + graph_morpheme_component\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QAlqubA25gq0" + }, + "source": [ + "Introducing the `integer` feature:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "rs2TyIBc5la6" + }, + "outputs": [], + "source": [ + "graph_reg_ordinals = graph_get_cardinal @ graph_cardinal # Rewriting ordinals to remove the first and second ordinal.\n", + "\n", + "graph_ordinal = pynutil.insert(\"integer: \\\"\") + graph_reg_ordinals + pynutil.insert(\"\\\"\")\n", + "graph_ordinal += graph_morphosyntactic_features" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xoqk20Pi2gT8" + }, + "source": [ + "For the first and second ordinals, we can explicitly state their mappings, as these occurences are invariable. (First and second ordinals do not need to accomodate being the endings of other terms.) As such, we can just have mappings from the token to the superscripts." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "54aqdH_P63Ea" + }, + "outputs": [], + "source": [ + "firsts = pynini.string_map([(\"premier\", \"er\"), (\"première\",\"re\")])\n", + "firsts += graph_plural # Still accepts plural marker in superscript\n", + "seconds = pynini.string_map([(\"second\", \"d\"),(\"seconde\", \"de\")])\n", + "seconds += graph_plural \n", + "\n", + "graph_firsts = pynutil.insert(\"integer: \\\"1\\\" morphosyntactic_features: \\\"\") + firsts\n", + "graph_seconds = pynutil.insert(\"integer: \\\"2\\\" morphosyntactic_features: \\\"\") + seconds" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D2vQ4m7o7p84" + }, + "source": [ + "Placing them in our class:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "w_JKT8JMf-Mz" + }, + "outputs": [], + "source": [ + "class OrdinalFst(GraphFst):\n", + " def __init__(self, cardinal: GraphFst):\n", + " super().__init__(name=\"ordinal\", kind=\"classify\")\n", + " graph_cardinal = cardinal.graph_no_exception # may replace\n", + "\n", + " strip_morpheme = pynutil.delete(\"ième\") # deletes suffix\n", + "\n", + " graph_root_change = pynini.string_map([(\"quatrième\", \"quatre\"),\n", + " (\"cinquième\",\t\"cinq\"),\n", + " (\"neuvième\",\t\"neuf\"),\n", + " (\"onzième\",\t\"onze\"),\n", + " (\"douzième\",\t\"douze\"),\n", + " (\"treizième\",\t\"treize\"),\n", + " (\"quatorzième\",\t\"quatorze\"),\n", + " (\"quinzième\",\t\"quinze\"),\n", + " (\"seizième\",\t\"seize\"),\n", + " (\"trentième\",\t\"trente\"),\n", + " (\"quarantième\",\t\"quarante\"),\n", + " (\"cinquantième\",\t\"cinquante\"),\n", + " (\"soixantième\",\t\"soixante\"),\n", + " (\"millième\",\t\"mille\"),\n", + " ])\n", + " \n", + " # Component will accept all tokens that end with desired strings\n", + " graph_get_cardinal = NEMO_SIGMA + (strip_morpheme | graph_root_change) \n", + "\n", + " # Graph will map ordinals beyond second ordinal to their cardinals\n", + " graph_reg_ordinals = graph_get_cardinal @ graph_cardinal\n", + "\n", + " # Graphing morphosyntactic_features\n", + " graph_morpheme = pynutil.insert(\"e\") # Insert e superscript\n", + " graph_plural = pynini.accep(\"s\").ques # ques is equivalent to pynini.closure(, 0, 1)\n", + "\n", + " graph_morpheme_component = graph_morpheme + graph_plural\n", + "\n", + " graph_morphosyntactic_features = (pynutil.insert(\" morphosyntactic_features: \\\"\") \n", + " + graph_morpheme_component\n", + " )\n", + "\n", + " # Adding in the `integer` property:\n", + " graph_ordinal = pynutil.insert(\"integer: \\\"\") + graph_reg_ordinals + pynutil.insert(\"\\\"\")\n", + " graph_ordinal += graph_morphosyntactic_features \n", + "\n", + " # Case of first and second ordinals\n", + " firsts = pynini.string_map([(\"premier\", \"er\"), (\"première\",\"re\")])\n", + " firsts += graph_plural # Still accepts plural marker in superscript\n", + " seconds = pynini.string_map([(\"second\", \"d\"),(\"seconde\", \"de\")])\n", + " seconds += graph_plural \n", + "\n", + " graph_firsts = pynutil.insert(\"integer: \\\"1\\\" morphosyntactic_features: \\\"\") + firsts\n", + " graph_seconds = pynutil.insert(\"integer: \\\"2\\\" morphosyntactic_features: \\\"\") + seconds\n", + "\n", + " # All together\n", + " graph_ordinal = pynini.union(graph_ordinal, graph_firsts, graph_seconds)\n", + " self.fst = graph_ordinal.optimize()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CpGHVg6chmA0" + }, + "source": [ + "Trying out on some examples:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "b5DL3PZRhpc8" + }, + "outputs": [], + "source": [ + "cardinal = CardinalFst()\n", + "ordinal = OrdinalFst(cardinal).fst\n", + "\n", + "apply_fst(\"premier\", ordinal)\n", + "apply_fst(\"premiers\", ordinal)\n", + "apply_fst(\"seconde\", ordinal)\n", + "apply_fst(\"douzièmes\", ordinal)\n", + "apply_fst(\"cent-cinquièmes\", ordinal)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MNQVgiv-UK29" + }, + "source": [ + "### Special Tokens" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UdiNAHGh71O9" + }, + "source": [ + "If you are particularly astute, you may have noticed that we have not closed the quotations around the `morphosyntactic_features` throughout, despite doing so for `integer`. This is not a typo, as there is one more aspect of the Classifier that must be addressed: special cases.\n", + "\n", + "For your language, you may notice that there are occasional exceptions to writing rules that are signaled by a specific vocabulary token in a string. As this must be communciated to our Verbalizer, it is important that we signal this vocabulary through our Classifier. \n", + "\n", + "For French, this can occur in the normalization of centuries. When using Ordinals to indicate centuries, French commonly writes with Roman numerals. For example:\n", + "- \"Fifth century\" -> \"cinquième siècle\" -> `Vᵉ siècle` \n", + "- \"Twentieth century\" -> \"vintième siècle\" -> `XXᵉ siècle` \n", + "\n", + "As such, we must allow our Classifier to pass on the information that \"siècle\" follows an ordinal to our Verbalizer, so it may normalize with Roman numerals. We accomplish this by appending a WFST that accepts special tokens that follow our Ordinals, adding them to our `morphosyntactic_features` attribute with a forward slach to delineate." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "MsWnT4BfQKcC" + }, + "outputs": [], + "source": [ + "special_tokens = pynini.accep(\"siècle\")\n", + "\n", + "graph_special_tokens = delete_space + pynutil.insert(\"/\") + special_tokens # We need to delete the space in between this token and the following one.\n", + "graph_special_tokens = pynini.closure(graph_special_tokens, 0, 1)\n", + "\n", + "graph_ordinal += graph_special_tokens + pynutil.insert(\"\\\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "698_n5SFQ_jP" + }, + "source": [ + "*Once again, it is advised to retain a tsv file in `data` to quickly append these key-words.*\n", + "\n", + "Having taken care of the special case, we may now call `add_tokens` and complete the graph (fully written out below)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "nZ1dkft0Riou" + }, + "outputs": [], + "source": [ + "class OrdinalFst(GraphFst):\n", + " def __init__(self, cardinal: GraphFst):\n", + " super().__init__(name=\"ordinal\", kind=\"classify\")\n", + " graph_cardinal = cardinal.graph_no_exception # may replace\n", + "\n", + " strip_morpheme = pynutil.delete(\"ième\") # deletes suffix\n", + "\n", + " graph_root_change = pynini.string_map([(\"quatrième\", \"quatre\"),\n", + " (\"cinquième\",\t\"cinq\"),\n", + " (\"neuvième\",\t\"neuf\"),\n", + " (\"onzième\",\t\"onze\"),\n", + " (\"douzième\",\t\"douze\"),\n", + " (\"treizième\",\t\"treize\"),\n", + " (\"quatorzième\",\t\"quatorze\"),\n", + " (\"quinzième\",\t\"quinze\"),\n", + " (\"seizième\",\t\"seize\"),\n", + " (\"trentième\",\t\"trente\"),\n", + " (\"quarantième\",\t\"quarante\"),\n", + " (\"cinquantième\",\t\"cinquante\"),\n", + " (\"soixantième\",\t\"soixante\"),\n", + " (\"millième\",\t\"mille\"),\n", + " ])\n", + " \n", + " # Component will accept all tokens that end with desired strings\n", + " graph_get_cardinal = NEMO_SIGMA + (strip_morpheme | graph_root_change) \n", + "\n", + " # Graph will map ordinals beyond second ordinal to their cardinals\n", + " graph_reg_ordinals = graph_get_cardinal @ graph_cardinal\n", + "\n", + " # Graphing morphosyntactic_features\n", + " graph_morpheme = pynutil.insert(\"e\") # Insert e superscript\n", + " graph_plural = pynini.accep(\"s\").ques # We create an acceptor since we must process the possible \"s\"\n", + "\n", + " graph_morpheme_component = graph_morpheme + graph_plural\n", + "\n", + " graph_morphosyntactic_features = (pynutil.insert(\" morphosyntactic_features: \\\"\") \n", + " + graph_morpheme_component\n", + " )\n", + "\n", + " # Adding in the `integer` property:\n", + " graph_ordinal = pynutil.insert(\"integer: \\\"\") + graph_reg_ordinals + pynutil.insert(\"\\\"\")\n", + " graph_ordinal += graph_morphosyntactic_features \n", + "\n", + " # Case of first and second ordinals\n", + " firsts = pynini.string_map([(\"premier\", \"er\"), (\"première\",\"re\")])\n", + " firsts += graph_plural # Still accepts plural marker in superscript\n", + " seconds = pynini.string_map([(\"second\", \"d\"),(\"seconde\", \"de\")])\n", + " seconds += graph_plural \n", + "\n", + " graph_firsts = pynutil.insert(\"integer: \\\"1\\\" morphosyntactic_features: \\\"\") + firsts\n", + " graph_seconds = pynutil.insert(\"integer: \\\"2\\\" morphosyntactic_features: \\\"\") + seconds\n", + "\n", + "\n", + " # Special tokens\n", + " special_tokens = pynini.accep(\"siècle\")\n", + "\n", + " graph_special_tokens = delete_space + pynutil.insert(\"/\") + special_tokens # We need to delete the space in between this token and the following one.\n", + " graph_special_tokens = pynini.closure(graph_special_tokens, 0, 1)\n", + "\n", + " graph_ordinal += graph_special_tokens + pynutil.insert(\"\\\"\")\n", + "\n", + " # Finishing\n", + " graph_ordinal = self.add_tokens(graph_ordinal)\n", + " self.fst = graph_ordinal.optimize()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7a4zBo-YS1QD" + }, + "source": [ + "## Verbalizer" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zYbrcGyGS2rW" + }, + "source": [ + "The initial part of the Ordinal Verbalizer is similar to the Cardinal WFST: we simply need to build a Verbalizer that inherits from `GraphFST` and removes the `integer` property tag. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "KUv99A_rYjb9" + }, + "outputs": [], + "source": [ + "class OrdinalFst(GraphFst):\n", + " def __init__(self):\n", + " super().__init__(name=\"ordinal\", kind=\"verbalize\")\n", + " graph_integer = (\n", + " pynutil.delete(\"integer:\")\n", + " + delete_space\n", + " + pynutil.delete(\"\\\"\")\n", + " + pynini.closure(NEMO_DIGIT, 1)\n", + " + pynutil.delete(\"\\\"\")\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zKCt_EapZXGW" + }, + "source": [ + "Now we need to manage the `morphosyntactic_features` component. The first steps seem simple enough: delete the property tag and replace the superscript indicators with the actual superscripts. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "yoa_mXMLabrU" + }, + "outputs": [], + "source": [ + " # Create mappings for all superscripts\n", + " superscript = pynini.union(\n", + " pynini.cross(\"e\", \"ᵉ\"), # only delete first quote since there may be more features\n", + " pynini.cross(\"d\", \"ᵈ\"),\n", + " pynini.cross(\"r\", \"ʳ\"),\n", + " pynini.cross(\"s\", \"ˢ\"),\n", + " )\n", + "\n", + " # Append to deletion of feature property. Note that we use plus closure for multiple superscripts.\n", + " graph_morphosyntactic_features = pynutil.delete(\" morphosyntactic_features: \\\"\") + superscript.plus" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xOA7_MsUrSJS" + }, + "source": [ + "### Romanization" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "K_SaG0DUa2t7" + }, + "source": [ + "Now we come to the possible Romanization component. Since we need to graph the superscript components as following the number, we want to design our graph so that `morphosyntactic_features` is the last component of the graph. However, we do not know that we need Romanization until we see the `morphosyntactic_features` component. As such, we need to design our graph such that two options are available initially for an input, but only one allows full traversal." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7dalc-tablG-" + }, + "source": [ + "![romanization.png](images/romanization.PNG)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "mPTNCddNcEEE" + }, + "source": [ + "In cases where your WFST decisions are dependent on latter parts of an input string, permitting the union of two separate paths when only one is valid usually assists, as a standard pathing heuristic will only choose the valid path. \n", + "\n", + "In the case of French, this would require us to separate our Verbalizer into two parts: one for Arabic numerals and one for Roman numerals. For the Arabic WFST, we simply conclude the graph. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "0YSy1PYOcuyD" + }, + "outputs": [], + "source": [ + "graph_integer = (\n", + " pynutil.delete(\"integer:\")\n", + " + delete_space\n", + " + pynutil.delete(\"\\\"\")\n", + " + pynini.closure(NEMO_DIGIT, 1)\n", + " + pynutil.delete(\"\\\"\")\n", + " )\n", + "graph_arabic = graph_integer + graph_morphosyntactic_features + pynutil.delete(\"\\\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nnXjUU5Pf7Sh" + }, + "source": [ + "For the Roman graph, things get a bit trickier. Ideally, we would want to build a WFST that maps each digit of `graph_arabic` to a Roman equivalent. However, consider the following examples:\n", + "- 1 -> I\n", + "- 10 -> X\n", + "- 11 -> XI\n", + "- 100 -> C\n", + "- 101 -> CI\n", + "- 110 -> CX\n", + "- 111 -> CXI\n", + "\n", + "Since Roman numerals do not preserve powers of ten through digit placement, we will need to design separate FSTs for each digit position and apply them accordingly. As this can quickly become intensive, we will only work to enumerate the Ordinals from 1 to 100. (Note: We are doing this to accomodate centuries; there is little likelihood that any century beyond the 99th will be used in regular strings.)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3-fQHMc2iQrz" + }, + "source": [ + "First we design our graphs for converting from Arabic to Roman numerals:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "d6PDySykiXTh" + }, + "outputs": [], + "source": [ + "digits = pynini.string_map([(\"1\", \"I\"),\n", + " (\"2\",\t\"II\"),\n", + " (\"3\",\t\"III\"),\n", + " (\"4\",\t\"IV\"),\n", + " (\"5\",\t\"V\"),\n", + " (\"6\",\t\"VI\"),\n", + " (\"7\",\t\"VII\"),\n", + " (\"8\",\t\"VIII\"),\n", + " (\"9\",\t\"IX\"),\n", + " ])\n", + "tens = pynini.string_map([(\"1\", \"X\"),\n", + " (\"2\",\t\"XX\"),\n", + " (\"3\",\t\"XXX\"),\n", + " (\"4\",\t\"XL\"),\n", + " (\"5\",\t\"L\"),\n", + " (\"6\",\t\"LX\"),\n", + " (\"7\",\t\"LXX\"),\n", + " (\"8\",\t\"LXXX\"),\n", + " (\"9\",\t\"XC\"),\n", + " ])\n", + "zero = pynutil.delete(\"0\") # No Roman representation for zero." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wb-LmwJdk59m" + }, + "source": [ + "Now we build two separate filters: one will accept only single digit arabic numerals and the other will accept two digit arabic numerals. For this we can use `NEMO_DIGIT`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "DW3oD7Hbli2X" + }, + "outputs": [], + "source": [ + "map_one_digit = NEMO_DIGIT\n", + "map_two_digits = NEMO_DIGIT ** 2 # pynini overloads the exponent function to allow self-concatenation." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xtYKLy9AmJZS" + }, + "source": [ + "We now build mappings between two digit Arabic numerals and Roman numerals, composing them onto the filters:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "dUy7uEUXmT_g" + }, + "outputs": [], + "source": [ + "graph_one_digit_romans = NEMO_DIGIT @ digits\n", + "\n", + "graph_two_digit_romans = tens + (digits | zero)\n", + "graph_two_digit_romans = map_two_digits @ graph_two_digit_romans\n", + "\n", + "graph_romans = graph_one_digit_romans | graph_two_digit_romans" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JEinyAMdm7RJ" + }, + "source": [ + "We now take care of the occurence of \"siècle\" before composing onto `graph_integer`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ERO19BbynPNX" + }, + "outputs": [], + "source": [ + "graph_romans = (graph_integer @ graph_romans) + graph_morphosyntactic_features\n", + "graph_romans += pynini.cross(\"/\", \" \") + \"siècle\" + pynutil.delete(\"\\\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zN-fwrCGoToQ" + }, + "source": [ + "We finalize with a union and calling `delete_tokens`, the complete Verbalizer now being::" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "kr2wcToAofWB" + }, + "outputs": [], + "source": [ + "class OrdinalFst(GraphFst):\n", + " def __init__(self):\n", + " super().__init__(name=\"ordinal\", kind=\"verbalize\")\n", + "\n", + " # Maps integer and removes attribute\n", + " graph_integer = (\n", + " pynutil.delete(\"integer:\")\n", + " + delete_space\n", + " + pynutil.delete(\"\\\"\")\n", + " + pynini.closure(NEMO_DIGIT, 1)\n", + " + pynutil.delete(\"\\\"\")\n", + " )\n", + "\n", + " # Create mappings for all superscripts\n", + " superscript = pynini.union(\n", + " pynini.cross(\"e\", \"ᵉ\"), # only delete first quote since there may be more features\n", + " pynini.cross(\"d\", \"ᵈ\"),\n", + " pynini.cross(\"r\", \"ʳ\"),\n", + " pynini.cross(\"s\", \"ˢ\"),\n", + " )\n", + "\n", + " # Append to deletion of feature property. Note that we use plus closure for multiple superscripts.\n", + " graph_morphosyntactic_features = pynutil.delete(\" morphosyntactic_features: \\\"\") + superscript.plus\n", + "\n", + " # Writing WFST for arabic\n", + " graph_arabic = graph_integer + graph_morphosyntactic_features + pynutil.delete(\"\\\"\")\n", + "\n", + " # Mapping Roman numerals\n", + " digits = pynini.string_map([(\"1\", \"I\"),\n", + " (\"2\",\t\"II\"),\n", + " (\"3\",\t\"III\"),\n", + " (\"4\",\t\"IV\"),\n", + " (\"5\",\t\"V\"),\n", + " (\"6\",\t\"VI\"),\n", + " (\"7\",\t\"VII\"),\n", + " (\"8\",\t\"VIII\"),\n", + " (\"9\",\t\"IX\"),\n", + " ])\n", + " tens = pynini.string_map([(\"1\", \"X\"),\n", + " (\"2\",\t\"XX\"),\n", + " (\"3\",\t\"XXX\"),\n", + " (\"4\",\t\"XL\"),\n", + " (\"5\",\t\"L\"),\n", + " (\"6\",\t\"LX\"),\n", + " (\"7\",\t\"LXX\"),\n", + " (\"8\",\t\"LXXX\"),\n", + " (\"9\",\t\"XC\"),\n", + " ])\n", + " zero = pynutil.delete(\"0\") # No Roman representation for zero.\n", + "\n", + " # filters for Roman digits\n", + " map_one_digit = NEMO_DIGIT\n", + " map_two_digits = NEMO_DIGIT ** 2 # pynini overloads the exponent function to allow self-concatenation.\n", + "\n", + " # Composing onto roman digits\n", + " graph_one_digit_romans = NEMO_DIGIT @ digits\n", + "\n", + " graph_two_digit_romans = tens + (digits | zero)\n", + " graph_two_digit_romans = map_two_digits @ graph_two_digit_romans\n", + "\n", + " graph_romans = graph_one_digit_romans | graph_two_digit_romans\n", + "\n", + " # Writing WFST for Roman\n", + " graph_romans = (graph_integer @ graph_romans) + graph_morphosyntactic_features\n", + " graph_romans += pynini.cross(\"/\", \" \") + \"siècle\" + pynutil.delete(\"\\\"\")\n", + "\n", + " # Final composition\n", + " graph = (graph_romans | graph_arabic)\n", + "\n", + " delete_tokens = self.delete_tokens(graph)\n", + " self.fst = delete_tokens.optimize()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Trying out our examples:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "example_regular = 'ordinal { integer: \"12\" morphosyntactic_features: \"es\" }'\n", + "example_roman = 'ordinal { integer: \"12\" morphosyntactic_features: \"es/siècle\" }'\n", + "\n", + "fst = OrdinalFst().fst\n", + "\n", + "apply_fst(example_regular, fst)\n", + "apply_fst(example_roman, fst)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yBgLhTq9pWZe" + }, + "source": [ + "We have now completed an Ordinal WFST from the ground up, allowing a separate numbering system for special cases." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-W1-BMVJUXXk" + }, + "source": [ + "## Final notes" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kR7E64P4pPU_" + }, + "source": [ + "Before moving on, there are some key takeaways that you may find useful for most (if not all) languages:\n", + "- Many ordinal systems rely on alteration of Cardinals. Even in the example of Korean, it is using a pre-existing counting system and adding a suffix to indicate ordering. As such, your Ordinal WFST will likely follow this tutorial's structure of changing the Ordinal to its original root and then relying on your Cardinal WFST for the majority of processing.\n", + "- The `morphosyntactic_features` property will carry the vast majority of information necessary for normalization through your Verbalizer.\n", + "- While not all writing systems have the same quirk as using Roman numerals in reference to centuries, you will likely find cases in your language when a specific token indicates unique rules for a semiotic class. Carrying this information to the Verbalizer is usually the simplest means of preserving the token while also facilitating normalization. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Rx8-LuJOUaa5" + }, + "source": [ + "# Decimal WFST " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "D2MRXYxz8TGA" + }, + "source": [ + "\n", + "If the Cardinal WFST is the most crucial element of a normalization grammar, the construction of the Decimal WFST is a close second. Much like in the case of constructing Ordinals from Cardinal grammars, many aspects of the Decimal WFST will be reused throughout your other semiotic classes.\n", + "\n", + "To get started, you should study the numerical conventions in your language. In particular, you should take note of the following:\n", + "- How is the decimal component of a number pronounced in your language of focus. (e.g. The English number `1.33` can be verbalized as \"one point three three\" or \"one and thirty three hundredths.\")\n", + "- What is the punctuation mark used for decimal demarcation? (In North America, several writing systems use `.` while European nations will use `,`.)\n", + "- Are there general rules regarding pronunciation/formatting of numbers past the decimal demarcation? (e.g. Does your language pronounce each digit or pronounce as a series of three digit numbers?)\n", + "\n", + "Such questions will likely require some deep familiarity with the language, and it may benefit to ask a native speaker for some input. Of course, the level of depth is dependent on your needs, but researching these questions will help your normalization system appear more organic." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UsK78ib4N-gb" + }, + "source": [ + "## Grammar" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "p4CLOOA9OAwZ" + }, + "source": [ + "In the case of French, we have the following guidelines:\n", + "- French uses the comma ( `,` ) for decimal delineation. It is articulated as \"virgule\".\n", + "- Decimals can be read as a series of digits or grouped as Cardinal numbers arbitrarily. (e.g. \"`.333` can be \"virgule trois trois trois\" or \"virgule trois-cent-trente-trois\".) \n", + "\n", + "As such, our grammar needs to accomodate the following pattern: \n", + "\n", + "`cardinal + \"virgule\" + string_of_cardinals`\n", + "\n", + "Given our experience with our previous WFSTs, this seems simple enough. We assume we have an instance of CardinalFST availble and create a subcomponent to map the integer portion of a decimal:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "XSp9FTzhf0XZ" + }, + "outputs": [], + "source": [ + "cardinal = CardinalFst().graph_no_exception # NeMo equivalent of just_cardinals\n", + "\n", + "# place cardinal under closure to permit values <=1\n", + "graph_integer = pynini.closure(cardinal, 0, 1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bk3_3iawgAZE" + }, + "source": [ + "Compose it on a subcomponent that detects the delineator \"virgule\":" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "UMzfAKkngH6z" + }, + "outputs": [], + "source": [ + "delete_virgule = pynutil.delete(\"virgule\")\n", + "graph_decimal = graph_integer + delete_space + delete_virgule" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GXjbtbLYgn17" + }, + "source": [ + "And permit the occurence of several strings of cardinals to follow:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "LMMNBJz8gtTA" + }, + "outputs": [], + "source": [ + "graph_string_of_cardinals = delete_space + graph_cardinal\n", + "graph_string_of_cardinals = pynini.closure(graph_string_of_cardinals, 1)\n", + "\n", + "graph_decimal += graph_string_of_cardinals" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jTgnRLddhGdE" + }, + "source": [ + "Let us try an example:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "D4rjDh0ShJAp" + }, + "outputs": [], + "source": [ + "example = \"trois virgule trois cinquante-cinq\" \n", + "apply_fst(example, graph_decimal) # Should output only the cardinals in the string" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RfD1d9JOioyl" + }, + "source": [ + "### Ambiguity?" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3IaI1mCIe_6i" + }, + "source": [ + "Note that our decision to include multiple strings of cardinals after the decimal marker has introduced some ambiguity into our WFST. Consider if a decimal number was followed by an integer series (e.g. `2.5, 5, 6`). Now what should be an application of one DecimalFST and two applications of a CardinalFST can be interpreted as a single DecimalFST application (e.g. `2.556`). What can be done?\n", + "\n", + "While we will address this in greater depth later (see [Tokenize and Classify](#tokenize-and-classify)), the short answer is that cases such as these must be calibrated according to use and linguistic intuition. As this is an inherent ambiguity in the language and its writing system, we can never truly remove this possibility without restricting our ability to model the language. However, we can rely on a few logical assumptions to guide our decision making:\n", + "- Unless the grammar is deployed in a restrictive setting (e.g. a Financial or environment where strings of numbers are often read in series) it's not likely for a valid string to exhibit this level of ambiguity. Speakers typically try to reduce possible ambiguity in their language production and would likely rephrase to avoid issues such as these. [See Grice's maxims](https://en.wikipedia.org/wiki/Cooperative_principle).\n", + "- While a language may allow a specific string by *rule*, speakers may typically avoid them *in practice* due to conventions or difficulty. In our case, while it may be possible to read `2,100 05` as \"deux virgule dix-mille-cinq\" (\"two point ten-thousand and five\"), it's dubious that a speaker would find such easier to read than \"deux virgule une zéro zéro zéro cinq\". (The place value of large strings tend to take longer to recognize.)\n", + "\n", + "While hardly satisfying, these two points will allow us to dismiss *some* worry. With the former observation being outside our grammar's ability to manage, we accomodate the latter point by using an alternate WFST from our CardinalFST: `numbers_up_to_million`. (To utilize in your own language, create a WFST in the Cardinal class right before building up to `graph_millions`. Again, calling `optimize` is advised.)\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "piNe1AWspa4J" + }, + "outputs": [], + "source": [ + "cardinal = CardinalFst().numbers_up_to_million\n", + "\n", + "# place cardinal under closure to permit values <=1\n", + "graph_integer = pynini.closure(cardinal, 0, 1)\n", + "\n", + "delete_virgule = pynutil.delete(\"virgule\")\n", + "graph_decimal = graph_integer + delete_space + delete_virgule\n", + "\n", + "graph_string_of_cardinals = delete_space + cardinal\n", + "graph_string_of_cardinals = pynini.closure(graph_string_of_cardinals, 1)\n", + "\n", + "graph_decimal += graph_string_of_cardinals" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "B1gglt0tfM5V" + }, + "source": [ + "## Classifier" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fVkOWkncgOZc" + }, + "source": [ + "Like with our previous WFSTs, the main duty for the classifier is inserting the necessary properties for the semiotic token. For the `decimal` tag, the following properties are used:\n", + "- `integer_part` - indicates value before decimal marker\n", + "- `fractional_part` - indicates values after the decimal marker\n", + "- `negative` - indicates if value is positive or negative (Optional)\n", + "- `quantity` - designates if decimal is in regards to a specific quantity. (See Quantities.)\n", + "\n", + "We can begin by inserting the `integer_part` around our `cardinal` subcomponent and the `fractional_part` around our `graph_string_of_cardinals`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "_zw_cDszh-fB" + }, + "outputs": [], + "source": [ + "graph_integer = pynutil.insert(\"integer_part: \\\"\") + cardinal + pynutil.insert(\"\\\" \")\n", + "graph_fractional = pynutil.insert(\"fractional_part: \\\"\") + graph_string_of_cardinals + pynutil.insert(\"\\\"\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bxlnn_7tiQMn" + }, + "source": [ + "We then concatenate them together with a component that recognizes and removes the decimal separator." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "BxNS9_AwiWHf" + }, + "outputs": [], + "source": [ + "graph_integer_or_none = graph_integer | pynutil.insert(\"integer_part: \\\"0\\\" \", weight=.1) # In cases we don't always have an integer preceeding\n", + "graph_decimal_no_sign = graph_integer_or_none + delete_space + pynutil.delete(\"virgule\") + graph_fractional" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "b7uGfsi4i5UI" + }, + "source": [ + "*Note that we allow insertion of 0 if there is no integer to accomodate reading of only decimal values*\n", + "\n", + "Now we allow the possibility of negative values. (Recall French uses \"moins\" to indicate the negative.)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VsP79naojQZR" + }, + "outputs": [], + "source": [ + "graph_negative = pynini.cross(\"moins\", \"negative: \\\"-\\\" \") + delete_space\n", + "graph_decimal = graph_negative + graph_decimal_no_sign" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "QTcvq5HqllqW" + }, + "outputs": [], + "source": [ + "example = \"moins deux virgule cent-quatre\"\n", + "apply_fst(example, graph_decimal)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FVKuGj_9mZ75" + }, + "source": [ + "Placing within a `DecimalFst` class, we have:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "tXwr32ermesp" + }, + "outputs": [], + "source": [ + "class DecimalFst(GraphFst):\n", + " def __init__(self, cardinal: GraphFst):\n", + " super().__init__(name=\"decimal\", kind=\"classify\")\n", + " cardinal = cardinal.numbers_up_to_million\n", + " delete_virgule = pynutil.delete(\"virgule\")\n", + "\n", + " graph_integer = pynutil.insert(\"integer_part: \\\"\") + cardinal + pynutil.insert(\"\\\" \") + delete_space\n", + " graph_integer_or_none = graph_integer | pynutil.insert(\"integer_part: \\\"0\\\" \", weight=.001) # In cases we don't always have an integer preceeding\n", + "\n", + " graph_string_of_cardinals = delete_space + cardinal\n", + " graph_string_of_cardinals = pynini.closure(graph_string_of_cardinals, 1)\n", + " graph_fractional = pynutil.insert(\"fractional_part: \\\"\") + graph_string_of_cardinals + pynutil.insert(\"\\\"\")\n", + "\n", + " graph_decimal_no_sign = graph_integer_or_none + pynutil.delete(\"virgule\") + graph_fractional \n", + "\n", + " graph_negative = pynini.cross(\"moins\", \"negative: \\\"-\\\" \") + delete_space\n", + " graph_negative = pynini.closure(graph_negative, 0, 1)\n", + "\n", + " graph_decimal = graph_negative + graph_decimal_no_sign\n", + "\n", + " graph = self.add_tokens(graph_decimal)\n", + " self.fst = graph.optimize()\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gjxI5mEKfHLo" + }, + "source": [ + "### Quantities" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3WuwWPf3py7G" + }, + "source": [ + "Recalling our earlier remarks regarding convention in language use, you may find a need to adjust the DecimalFst when processing specific values. For instance, consider the following equivalencies from English:\n", + "- `1,500,000` = \"one million five hundred thousand\" = \"one point five million\" = `1.5 million`\n", + "- `2,750,000` = \"two million seven hundred and fifty thousand\" = \"two point seven five million\" = `2.75 million`\n", + "\n", + "For large numbers, there is a tendency to use the decimal system as though one is describing a quantity. Notably, there is a minimum value for which this is comfortable. (A speaker of English may say \"three point five trillion\" but \"three point five hundred\" comes off as odd.)\n", + "\n", + "This behavior can occur in other languages. For example, the amount of `$1,500,000` may be read in French as \"une virgule cinq million de dollars\" (\"one point five million dollars\"). " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RgMBIKlYdsGz" + }, + "source": [ + "Our Classifier can be made to accomodate this behavior: we simply need to repeat what we did for `OrdinalFst` and set aside several key terms to trigger our model. For French, we will choose all terms added for values greater than a million. (Chosen empirically.)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "vEcsUXw5fUEe" + }, + "outputs": [], + "source": [ + "suffix = pynini.union(\n", + " \"million\",\n", + " \"millions\",\n", + " \"milliard\",\n", + " \"milliards\",\n", + " \"billion\",\n", + " \"billions\",\n", + " \"billiard\",\n", + " \"billiards\",\n", + " \"trillion\",\n", + " \"trillions\",\n", + " \"trilliard\",\n", + " \"trilliards\",\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wIIUAsR-fgQA" + }, + "source": [ + "We will then need to use a WFST to graph any numbers the preceed these amounts. Note, unlike for our `DecimalFst`, we need to permit cardinals as well as decimals. This is because we want to be able to normalize a phrase like \"three million\" to `3 million` as this will be less obtrusive than `3,000,000`.\n", + "\n", + "As such, we will call a `CardinalFst` and a `DecimalFst` in for `graph_quantities`. Since these are both utilized for our `DecimalFst`, it would be more efficient to just pass them along as function/class variables." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "yern-idtycWg" + }, + "outputs": [], + "source": [ + "def get_quantity(decimal, cardinal_up_to_thousand):\n", + " key_values = pynini.union(\n", + " \"million\",\n", + " \"millions\",\n", + " \"milliard\",\n", + " \"milliards\",\n", + " \"billion\",\n", + " \"billions\",\n", + " \"billiard\",\n", + " \"billiards\",\n", + " \"trillion\",\n", + " \"trillions\",\n", + " \"trilliard\",\n", + " \"trilliards\",\n", + " )\n", + " # The French WFST that this borrows from has not removed leading zeroes yet.\n", + " numbers = cardinal_up_to_thousand @ (\n", + " pynutil.delete(pynini.closure(\"0\")) + pynini.difference(NEMO_DIGIT, \"0\") + pynini.closure(NEMO_DIGIT)\n", + " )\n", + " res = (\n", + " pynutil.insert(\"integer_part: \\\"\")\n", + " + numbers\n", + " + pynutil.insert(\"\\\"\")\n", + " + (\n", + " pynini.union(delete_hyphen, delete_extra_space)\n", + " ) # Can be written either as 'deux-millions' or 'deux millions' depending on whether it registers as a noun or part of cardinal.\n", + " + pynutil.insert(\" quantity: \\\"\")\n", + " + suffix\n", + " + pynutil.insert(\"\\\"\")\n", + " )\n", + " # Union with decimal to permit either a cardinal or decimal representation.\n", + " res |= decimal + delete_extra_space + pynutil.insert(\" quantity: \\\"\") + suffix + pynutil.insert(\"\\\"\")\n", + " return res" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uT4LMo8ADBAq" + }, + "source": [ + "We can now insert this into our Classifier, producing the following:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "d2KrCuyGDLwh" + }, + "outputs": [], + "source": [ + "class DecimalFst(GraphFst):\n", + " def __init__(self, cardinal: GraphFst):\n", + " super().__init__(name=\"decimal\", kind=\"classify\")\n", + " quantities_cardinal = cardinal.graph_hundreds_component_at_least_one_none_zero_digit\n", + " cardinal = cardinal.graph_no_exception\n", + " delete_virgule = pynutil.delete(\"virgule\")\n", + "\n", + " graph_integer = pynutil.insert(\"integer_part: \\\"\") + cardinal + pynutil.insert(\"\\\" \") + delete_space\n", + " graph_integer_or_none = graph_integer | pynutil.insert(\"integer_part: \\\"0\\\" \", weight=.001) # In cases we don't always have an integer preceeding\n", + "\n", + " graph_string_of_cardinals = delete_space + cardinal\n", + " graph_string_of_cardinals = pynini.closure(graph_string_of_cardinals, 1)\n", + " graph_fractional = pynutil.insert(\"fractional_part: \\\"\") + graph_string_of_cardinals + pynutil.insert(\"\\\"\")\n", + "\n", + " graph_decimal_no_sign = graph_integer_or_none + delete_virgule + graph_fractional \n", + "\n", + " graph_negative = pynini.cross(\"moins\", \"negative: \\\"-\\\" \") + delete_space\n", + " graph_negative = pynini.closure(graph_negative, 0, 1)\n", + " graph_decimal = graph_negative + graph_decimal_no_sign\n", + "\n", + " # Union default decimal with version that accepts quantities\n", + " graph_decimal |= graph_negative + get_quantity(\n", + " graph_decimal_no_sign, quantities_cardinal\n", + " )\n", + " final_graph = self.add_tokens(graph_decimal)\n", + " self.fst = final_graph.optimize()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "cD-eKqO6qTyh" + }, + "outputs": [], + "source": [ + "cardinal = CardinalFst()\n", + "decimal = DecimalFst(cardinal).fst\n", + "example = \"trois virgule cent-quatre billion\"\n", + "apply_fst(example, decimal)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HiSLKF3RfRZA" + }, + "source": [ + "## Verbalizer" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QnkOV5FlteQA" + }, + "source": [ + "As before, the Verbalizer is responsible for removing the formatting and rendering a given token in conventional form. As the process remains similar to Ordinals and Cardinals (deleting strings in a regular matter) we will instead focus on a unique concern for `DecimalFst`: numeral spacing.\n", + "\n", + "For some writing systems, decimal numbers and other strings are typically not written as a single string, instead using punctuation to group numbers for clarity. For example, in the United States, integer digits greater than a thousand are separated by commas for every three digits:\n", + "- `12345.678` -> `12,345.678`\n", + "\n", + "A similar rule occurs in French, save it employs spaces on each side of the decimal marker:\n", + "- `12345,6789` -> `12 345,678 9`" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2h4WQZ1a4Cpc" + }, + "source": [ + "While simple enough, this rule poses a slight complication: it works from the left and right of the decimal separator, whereas WFSTs process linearly from the beginning (or end) of strings. As such we will need to break the formatting rule into two components: one for the integer component and one for the decimal component." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ViOFNdZw4-qu" + }, + "source": [ + "Starting with the integer component, we need our subcomponent to recognize every three digits and insert a space before. We can achieve this with some `graph_utils` helper objects - `NEMO_DIGIT` and `NEMO_NON_BREAKING_SPACE`, which accept all digits and non-breaking spaces, respectively. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Z36be2Vo5VbR" + }, + "outputs": [], + "source": [ + "every_three_digits = NEMO_DIGIT ** 3 # accepts a string of three digits\n", + "space_every_three_integer = pynini.closure(NEMO_NON_BREAKING_SPACE + every_three_digits) # inserts space before every three digits." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RSB2gGH-5vwi" + }, + "source": [ + "However, we cannot let the component insert spaces when there are *only* three digits (e.g. `100`.) As such, we need to make sure the insertion only begins starting from the beginning of a string (e.g. when there is a string between one and three digits.)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "wfWp3ghH6mDQ" + }, + "outputs": [], + "source": [ + "space_every_three_integer = pynini.closure(NEMO_DIGIT, 1, 3) + space_every_three_integer" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NJrQYSfA6vyu" + }, + "source": [ + "For the case of the decimal spacing, we simply reverse the logic:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "vBP6ncTp6yXX" + }, + "outputs": [], + "source": [ + "space_every_three_decimal = pynini.closure(NEMO_NON_BREAKING_SPACE + every_three_digits)\n", + "space_every_three_decimal = space_every_three_decimal + pynini.closure(NEMO_DIGIT, 1, 3)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WRXPN_gk69VV" + }, + "source": [ + "Placed into our Verbalizer, we would see the following:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "h49eztvs7BXH" + }, + "outputs": [], + "source": [ + "class DecimalFst(GraphFst):\n", + " \"\"\"\n", + " Finite state transducer for verbalizing decimal, e.g.\n", + " decimal { negative: \"true\" integer_part: \"12\" fractional_part: \"5006\" quantity: \"billion\" } -> -12.5006 billion\n", + " \"\"\"\n", + "\n", + " def __init__(self):\n", + " super().__init__(name=\"decimal\", kind=\"verbalize\")\n", + "\n", + " # Need parser to group digits by threes\n", + " exactly_three_digits = NEMO_DIGIT ** 3\n", + " at_most_three_digits = pynini.closure(NEMO_DIGIT, 1, 3)\n", + "\n", + " space_every_three_integer = (\n", + " at_most_three_digits + (pynutil.insert(NEMO_NON_BREAKING_SPACE) + exactly_three_digits).closure()\n", + " )\n", + " space_every_three_decimal = (\n", + " pynini.accep(\",\")\n", + " + (exactly_three_digits + pynutil.insert(NEMO_NON_BREAKING_SPACE)).closure()\n", + " + at_most_three_digits\n", + " )\n", + " group_by_threes = space_every_three_integer | space_every_three_decimal\n", + " self.group_by_threes = group_by_threes\n", + "\n", + " optional_sign = pynini.closure(pynini.cross(\"negative: \\\"true\\\"\", \"-\") + delete_space, 0, 1)\n", + " integer = (\n", + " pynutil.delete(\"integer_part:\")\n", + " + delete_space\n", + " + pynutil.delete(\"\\\"\")\n", + " + pynini.closure(NEMO_NOT_QUOTE, 1)\n", + " + pynutil.delete(\"\\\"\")\n", + " )\n", + " integer = integer @ group_by_threes\n", + " optional_integer = pynini.closure(integer + delete_space, 0, 1)\n", + " fractional = (\n", + " pynutil.insert(\",\")\n", + " + pynutil.delete(\"fractional_part:\")\n", + " + delete_space\n", + " + pynutil.delete(\"\\\"\")\n", + " + pynini.closure(NEMO_NOT_QUOTE, 1)\n", + " + pynutil.delete(\"\\\"\")\n", + " )\n", + " fractional = fractional @ group_by_threes\n", + " optional_fractional = pynini.closure(fractional + delete_space, 0, 1)\n", + " quantity = (\n", + " pynutil.delete(\"quantity:\")\n", + " + delete_space\n", + " + pynutil.delete(\"\\\"\")\n", + " + pynini.closure(NEMO_NOT_QUOTE, 1)\n", + " + pynutil.delete(\"\\\"\")\n", + " )\n", + " optional_quantity = pynini.closure(pynutil.insert(\" \") + quantity + delete_space, 0, 1)\n", + " graph = (optional_integer + optional_fractional + optional_quantity).optimize()\n", + " self.numbers = graph # Saving just the part of the graph used for numbers\n", + " graph = optional_sign + graph\n", + " delete_tokens = self.delete_tokens(graph)\n", + " self.fst = delete_tokens.optimize()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Trying out some examples:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "fst = DecimalFst().fst\n", + "\n", + "example1 = 'decimal { integer_part: \"3\" fractional_part: \"10453\" quantity: \"billion\" }'\n", + "example2 = 'decimal { integer_part: \"22323\" fractional_part: \"104553\" }'\n", + "\n", + "apply_fst(example1, fst)\n", + "apply_fst(example2, fst)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CZbshZCW8clI" + }, + "source": [ + "# Money WFST " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xuiv8HMz7yjm" + }, + "source": [ + "Now that we've handled some of the foundational classes, it's time to see how they build up to permit more concrete ones. Let's see how the previous WFSTs assist in building a WFST for normalizing currency: the `MoneyFst`. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wTU2c7MtUpqF" + }, + "source": [ + "## Grammar" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qqyRm8Ru8TDf" + }, + "source": [ + "While the exact phrasing will vary, a valid string for currency will possess the following qualities:\n", + "- A major and/or minor denomination of currency\n", + "- A numeric quantity of the denomination \n", + "\n", + "As our `CardinalFst` and `OrdinalFst` already allow us to normalize the quantity, the only issue for `MoneyFst` is to graph the amounts and build a vocabulary to recognize the denominations.\n", + "\n", + "For French, we will use the following examples to build upon:\n", + "- \"une euros\" -> `1 €`\n", + "- \"deux euros\" -> `2 €` \n", + "- \"deux euros cinq\" -> `2,5 €` \n", + "- \"cinq centimes\" -> `0,5 €`\n", + "- \"deux billions de euros\" -> `2 billions de euros`" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FMqUir9n9_cA" + }, + "source": [ + "These suggest the following requirements of our grammar:\n", + "- There must be a mapping between \"euro\" and \"centime\" and `€` in our vocabulary\n", + "- This mapping must allow both singular and plural forms\n", + "- The currency denomination is phrased between major and minor denominations (\"une euro cinq\" and not \"une cinq euro\")\n", + "- Large quantities of currency are left 'as is' instead of normalized\n", + "\n", + "We may deal with the vocabulary in the typical fashion:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "XN9nbNhB-vEV" + }, + "outputs": [], + "source": [ + "major_currency = pynini.string_map([(\"euro\", \"€\")])\n", + "minor_currency = pynini.string_map([(\"centime\", \"€\")])\n", + "\n", + "graph_plural = pynutil.delete(\"s\").ques\n", + "\n", + "major_currency += graph_plural\n", + "minor_currency += graph_plural" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3aHrm1qPAc-f" + }, + "source": [ + "Moving to the numbers, note that we need to append a leading zero to the value of fractional currency amounts (\"five cents\" -> `$0.05`). We bring back the subgraph from `CardinalFst` that maps tokens to numbers without tokenization to assist with this:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "jwi-yQW1AjvG" + }, + "outputs": [], + "source": [ + "from nemo_text_processing.inverse_text_normalization.fr.taggers import cardinal\n", + "\n", + "cardinal_graph = cardinal.CardinalFst()\n", + "graph_cardinal = cardinal_graph.graph_no_exception # graphs cardinals w/o tokenization\n", + "\n", + "add_leading_zero_to_double_digit = (NEMO_DIGIT + NEMO_DIGIT) | (pynutil.insert(\"0\") + NEMO_DIGIT)\n", + "graph_fractional_values = graph_cardinal @ add_leading_zero_to_double_digit" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, let us consider how to manage arge quantities of currency. In our example (\"deux billions de euros\" -> `2 billions de euros`) we see that its behavior mirrors that of our `get_quantity` portion of `DecimalFst`. As such, it would be useful if there was a subcomponent of that graph that we could use in here. Like in the case of `CardinalFst`, let us go back and create a subgraph for later use. Since all our quantities are positive, this would be best accomplished right before incorporating the `negative` property, creating a `self.final_graph_wo_negative`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "class DecimalFst(GraphFst):\n", + " def __init__(self, cardinal: GraphFst):\n", + " super().__init__(name=\"decimal\", kind=\"classify\")\n", + " quantities_cardinal = cardinal.graph_hundreds_component_at_least_one_none_zero_digit\n", + " cardinal = cardinal.graph_no_exception\n", + " delete_virgule = pynutil.delete(\"virgule\")\n", + "\n", + " graph_integer = pynutil.insert(\"integer_part: \\\"\") + cardinal + pynutil.insert(\"\\\" \") + delete_space\n", + " graph_integer_or_none = graph_integer | pynutil.insert(\"integer_part: \\\"0\\\" \", weight=.001) # In cases we don't always have an integer preceeding\n", + "\n", + " graph_string_of_cardinals = delete_space + cardinal\n", + " graph_string_of_cardinals = pynini.closure(graph_string_of_cardinals, 1)\n", + " graph_fractional = pynutil.insert(\"fractional_part: \\\"\") + graph_string_of_cardinals + pynutil.insert(\"\\\"\")\n", + "\n", + " graph_decimal_no_sign = graph_integer_or_none + delete_virgule + graph_fractional \n", + "\n", + " ### NEW GRAPH HERE\n", + " self.final_graph_wo_negative = graph_decimal_no_sign | get_quantity(\n", + " final_graph_wo_sign, cardinal.graph_hundreds_component_at_least_one_none_zero_digit\n", + " )\n", + " \n", + " graph_negative = pynini.cross(\"moins\", \"negative: \\\"-\\\" \") + delete_space\n", + " graph_negative = pynini.closure(graph_negative, 0, 1)\n", + " graph_decimal = graph_negative + graph_decimal_no_sign\n", + "\n", + " # Union default decimal with version that accepts quantities\n", + " graph_decimal |= graph_negative + get_quantity(\n", + " graph_decimal_no_sign, quantities_cardinal\n", + " )\n", + " final_graph = self.add_tokens(graph_decimal)\n", + " self.fst = final_graph.optimize()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Allowing us to change our grammar to:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from nemo_text_processing.inverse_text_normalization.fr.taggers import cardinal, decimal\n", + "\n", + "cardinal_graph = cardinal.CardinalFst()\n", + "decimal_graph = decimal.DecimalFst(cardinal_graph)\n", + "\n", + "graph_cardinal = cardinal_graph.graph_no_exception # graphs cardinals w/o tokenization\n", + "graph_decimal = decimal_graph.final_graph_wo_negative # graphs positive decimals w/o tokenization\n", + "\n", + "add_leading_zero_to_double_digit = (NEMO_DIGIT + NEMO_DIGIT) | (pynutil.insert(\"0\") + NEMO_DIGIT)\n", + "graph_fractional_values = graph_cardinal @ add_leading_zero_to_double_digit" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "L1RHoW-TLzIz" + }, + "source": [ + "Note that by doing this, we're also incorporating the formatting from the `decimal` class up to this point. Since these overlap with the `money` class (see next section), we have saved ourselves some work. \n", + "\n", + "Since we already made `graph_quantity` part of our `DecimalFst`, we can avoid dealing with large quantities now. However, this does mean we still need a way to leave currencies 'as is' without normalization. We can do this by using the `project` method, which will create a WFST that excepts either all valid inputs or all valid outputs of another WFST (depending on argument)." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "7l_TLtJkMluU" + }, + "outputs": [], + "source": [ + "major_currency_no_normalize = major_currency.project(\"input\")\n", + "apply_fst(\"euro\", major_currency_no_normalize)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "raBdHc_WXEpG" + }, + "source": [ + "We then append this WFST with a WFST that recognizes prepositions commonly used before large values of currency (\"d'\", \"des\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "CEuxiVgDXRBf" + }, + "outputs": [], + "source": [ + "graph_preposition = pynini.union(\"des \", \"d'\") # Used for large amounts (billions de euros)\n", + "major_currency_no_normalize = pynini.closure(graph_preposition, 0, 1) + major_currency.project(\"input\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FlXmf8Fq_Rm1" + }, + "source": [ + "## Classifier" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T5BBuQRzLuXS" + }, + "source": [ + "For the Money semiotic class, we have available the following properties for tokenization:\n", + "- `integer_part`\n", + "- `fractional_part` \n", + "- `currency`\n", + "\n", + "Laying the initial groundwork seems simple enough. We first instantiate our `MoneyFst` classifier with our initial grammars:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "EZaCeHcFWVP3" + }, + "outputs": [], + "source": [ + "class MoneyFst(GraphFst):\n", + " def __init__(self, cardinal: GraphFst, decimal: GraphFst):\n", + " super().__init__(name=\"money\", kind=\"classify\")\n", + " major_currency = pynini.string_map([(\"euro\", \"€\")])\n", + " minor_currency = pynini.string_map([(\"centime\", \"€\")])\n", + "\n", + " graph_plural = pynutil.delete(\"s\").ques\n", + "\n", + " major_currency += graph_plural\n", + " minor_currency += graph_plural\n", + "\n", + " major_currency_no_normalize = major_currency.project(\"input\")\n", + " graph_preposition = pynini.union(\"des \", \"d'\") # Used for large amounts (billions de euros)\n", + " major_currency_no_normalize = graph_preposition + major_currency.project(\"input\")\n", + "\n", + " graph_cardinal = cardinal.graph_no_exception\n", + " graph_decimal = decimal.final_graph_wo_negative\n", + "\n", + " add_leading_zero_to_double_digit = (NEMO_DIGIT + NEMO_DIGIT) | (pynutil.insert(\"0\") + NEMO_DIGIT)\n", + " graph_fractional_values = graph_cardinal @ add_leading_zero_to_double_digit" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_bpkXroLWaBo" + }, + "source": [ + "Let us now manage the `currency` property. We have the following scenarios to consider:\n", + "- Major denomination only\n", + "- Minor denomination only\n", + "- Major denomination and implicit minor denomination (\"cinq euro trois\")\n", + "- Major denomination and explicit minor denomination (\"cinq euros et trois centimes\")\n", + "- Large quantities of euros (\"cinq billion des euros\")\n", + "\n", + "Note how across cases the use of `graph_cardinal` and `graph_decimal` will be applied differently. Further, we may have varying orders in which tags are assigned proper values. For instance, if we have only minor denomination we would assign `fractional_part` before `currency`. Meanwhile, major denomination and implicit minor denomination would be the order of `integer_part`, `currency`, `fractional_part`. While we could try and figure out a way to preserve order, recall that the use of permutations in NeMo ITN makes that unnecessary: we can assume the desired order of tags reach our Verbalizer without make overt efforts in our Classifier! \n", + "\n", + "For example, let's say we need to process \"five dollars\" as `$5.00`. Processed linearly, we could get a token sequence along the lines of: `{ integer_part: \"5\" currency: \"$\" }`. If we passed this token array straight to a Verbalizer, we would need to configure a graph that effectively reverses the order so we could parse the `currency` field prior to the `integer_part` field, perhaps something along the lines of: \n", + "\n", + "`pynutil.insert(\"$\") + delete_space + pynutil.delete('integer_part: \\\"') +.... + pynutil.delete('currency: \"$\"')`\n", + "\n", + "But since NeMo creates permutations of our Classifier outputs, this is unnecessary. We can simply assume whatever would be the most convenient order for us (e.g. `{ currency: \"$\" integer_part: \"5\" }`) and build our Verbalizer around that:\n", + "\n", + "`pynutil.delete('currency: \\\"') + NEMO_SIGMA + pynutil.delete('\\\" integer_part: \\\"') + NEMO_DIGIT +...`\n", + "\n", + "Along with helping to keep our script simpler (we can focus simply on tokenization and not worry about what input order our Verbalizers will accept), this also allows us to overcome structural constraints of WFSTs, namely that they are [limited in reordering text strings](https://en.wikipedia.org/wiki/Pushdown_automaton)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fMZ13D2Dh9ZF" + }, + "source": [ + "Keeping this in mind, let's begin mapping the proper tags. Since they're relatively simple, we can start with only major and minor denominations:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "EtwWLp7VbbjM" + }, + "outputs": [], + "source": [ + "graph_integer_component = pynutil.insert(\"integer_part: \\\"\") + graph_cardinal + pynutil.insert(\"\\\"\")\n", + "graph_fractional_component = pynutil.insert(\"fractional_part: \\\"\") + graph_fractional_values + pynutil.insert(\"\\\"\")\n", + "\n", + "graph_major_currency = pynutil.insert(\" currency: \\\"\") + major_currency + pynutil.insert(\"\\\"\")\n", + "graph_minor_currency = pynutil.insert(\" currency: \\\"\") + minor_currency + pynutil.insert(\"\\\"\")\n", + "\n", + "graph_only_major_money = graph_integer_component + delete_space + graph_major_currency\n", + "graph_only_minor_money = graph_fractional_component + delete_space + graph_minor_currency " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XTmxrK4DmS39" + }, + "source": [ + "Now we may append the case of an implicit `fractional_part` to `graph_only_major_money`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Zvzn3pQinkT0" + }, + "outputs": [], + "source": [ + "implicit_fractional_part = delete_space + pynutil.insert(\"fractional_part: \\\"\") + graph_fractional_values + pynutil.insert(\"\\\"\") \n", + "implicit_fractional_part = pynini.closure(implicit_fractional_part, 0, 1) " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "tKFZkCVmn1OX" + }, + "source": [ + "And the explicit fractional portion:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "d_h0pTlMn3jz" + }, + "outputs": [], + "source": [ + "delete_et = pynutil.delete(\"et \") # Sometimes prefaces the minor currency\n", + "delete_et = pynini.closure(delete_et, 0 , 1)\n", + "\n", + "delete_minor = pynutil.delete(minor_currency.project(\"input\")) # to remove the minor currency\n", + "\n", + "explicit_fractional_part = pynutil.insert(\"fractional_part: \\\"\") + graph_fractional_values + pynutil.insert(\"\\\"\") \n", + "explicit_fractional_part = delete_space + delete_et + explicit_fractional_part + delete_space + delete_minor\n", + "explicit_fractional_part = pynini.closure(explicit_fractional_part, 0, 1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rvnpAudgo-o3" + }, + "source": [ + "We join them together:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "qYzlIRWTpD8e" + }, + "outputs": [], + "source": [ + "graph_major_money = graph_only_major_money + (implicit_fractional_part | explicit_fractional_part)\n", + "graph_standard_money = graph_major_money | graph_only_minor_money" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TzeaKXVzpYs8" + }, + "source": [ + "Finishing with the case the the large quantities of money, we need to use `graph_decimal` so we can exploit its ability to map quantities. Note that since we are using a pre-existing WFST, we can ignore inserting the tags ourselves, since this is already done by the Decimal WFST. As long as we remember to process this aspect with our Verbalizer, we can spare ourselves the extra step." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "LnqX9mGFpmJm" + }, + "outputs": [], + "source": [ + "graph_large_money = pynutil.insert(\" currency: \\\"\") + major_currency_no_normalize + pynutil.insert(\"\\\"\")\n", + "graph_large_money = graph_decimal + delete_space + graph_large_money" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "24TUZnJKqgPA" + }, + "source": [ + "Alltogether, this would give the following Classifier:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "B7-muCO2qizg" + }, + "outputs": [], + "source": [ + "class MoneyFst(GraphFst):\n", + " def __init__(self, cardinal: GraphFst, decimal: GraphFst):\n", + " super().__init__(name=\"money\", kind=\"classify\")\n", + " major_currency = pynini.string_map([(\"euro\", \"€\")])\n", + " minor_currency = pynini.string_map([(\"centime\", \"€\")])\n", + "\n", + " graph_plural = pynutil.delete(\"s\").ques\n", + "\n", + " major_currency += graph_plural\n", + " minor_currency += graph_plural\n", + "\n", + " major_currency_no_normalize = major_currency.project(\"input\")\n", + " graph_preposition = pynini.union(\"des \", \"d'\") # Used for large amounts (billions de euros)\n", + " major_currency_no_normalize = graph_preposition + major_currency.project(\"input\")\n", + "\n", + " graph_cardinal = cardinal.graph_no_exception\n", + " graph_decimal = decimal.final_graph_wo_negative\n", + "\n", + " add_leading_zero_to_double_digit = (NEMO_DIGIT + NEMO_DIGIT) | (pynutil.insert(\"0\") + NEMO_DIGIT)\n", + " graph_fractional_values = graph_cardinal @ add_leading_zero_to_double_digit\n", + "\n", + " graph_integer_component = pynutil.insert(\"integer_part: \\\"\") + graph_cardinal + pynutil.insert(\"\\\"\")\n", + " graph_fractional_component = pynutil.insert(\"fractional_part: \\\"\") + graph_fractional_values + pynutil.insert(\"\\\"\")\n", + "\n", + " graph_major_currency = pynutil.insert(\" currency: \\\"\") + major_currency + pynutil.insert(\"\\\"\")\n", + " graph_minor_currency = pynutil.insert(\" currency: \\\"\") + minor_currency + pynutil.insert(\"\\\"\")\n", + "\n", + " graph_only_major_money = graph_integer_component + delete_space + graph_major_currency\n", + " graph_only_minor_money = graph_fractional_component + delete_space + graph_minor_currency \n", + "\n", + " implicit_fractional_part = delete_space + pynutil.insert(\"fractional_part: \\\"\") + graph_fractional_values + pynutil.insert(\"\\\"\") \n", + " implicit_fractional_part = pynini.closure(implicit_fractional_part, 0, 1) \n", + "\n", + "\n", + " delete_et = pynutil.delete(\"et \") # Sometimes prefaces the minor currency\n", + " delete_et = pynini.closure(delete_et, 0 , 1)\n", + "\n", + " delete_minor = pynutil.delete(minor_currency.project(\"input\")) # to remove the minor currency\n", + "\n", + " explicit_fractional_part = pynutil.insert(\"fractional_part: \\\"\") + graph_fractional_values + pynutil.insert(\"\\\"\") \n", + " explicit_fractional_part = delete_space + delete_et + explicit_fractional_part + delete_space + delete_minor\n", + " explicit_fractional_part = pynini.closure(explicit_fractional_part, 0, 1)\n", + "\n", + " graph_major_money = graph_only_major_money + (implicit_fractional_part | explicit_fractional_part)\n", + "\n", + " graph_large_money = pynutil.insert(\" currency: \\\"\") + major_currency_no_normalize + pynutil.insert(\"\\\"\")\n", + " graph_large_money = graph_decimal + delete_space + graph_large_money\n", + "\n", + " final_graph = graph_large_money | graph_major_money | graph_only_minor_money\n", + "\n", + " final_graph = self.add_tokens(final_graph)\n", + " self.fst = final_graph.optimize()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's see the results:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from nemo_text_processing.inverse_text_normalization.fr.taggers import decimal, cardinal\n", + "\n", + "cardFst = cardinal.CardinalFst()\n", + "decFst = decimal.DecimalFst(cardFst)\n", + "\n", + "moneyFst = MoneyFst(cardFst, decFst).fst\n", + "\n", + "example = \"douze virgule cinq billions d'euros\"\n", + "\n", + "apply_fst(example, moneyFst)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "gxdcyuLmAZZa" + }, + "source": [ + "## Verbalizer" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZZFDWNwY6sOG" + }, + "source": [ + "By this point, the creation of the Verbalizer should be rather straight-forward - delete the expected tokens and perform any specific formatting that was not caught by the Classifier. \n", + "\n", + "In fact, it is so straight-forward that much of the work does not even need to be explicitly managed by the Verbalizer. As mentioned previously, two of the properties we inserted in our Classifier where already referenced in our `DecimalFst` - `integer_part` and `fractional_part`. We even went so far to directly call a component of `DecimalFst` in our Classifier. As such, outside of the `currency` property - there is little in our Money token that is different from a standard Decimal token. Indeed, even the normalized forms are similar (`200,5` vs. `200,5 €`.) " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T7sgH0t79tmU" + }, + "source": [ + "Given these similarities, it seems that we can save ourselves some work and simply use the Decimal Verbalizer to manage much of the normalization. Let's look at the basic format of our `MoneyFst` verbalizer, writing it so it accepts a `DecimalFst` as input:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "BEu8nITP9mSG" + }, + "outputs": [], + "source": [ + "class MoneyFst(GraphFst):\n", + " def __init__(self, decimal: GraphFst):\n", + " super().__init__(name=\"money\", kind=\"verbalize\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JYVLou5N-Dk8" + }, + "source": [ + "We manage the issue of deleting the `currency` property:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "LO35tJ7G-H6N" + }, + "outputs": [], + "source": [ + "class MoneyFst(GraphFst):\n", + " def __init__(self, decimal: GraphFst):\n", + " super().__init__(name=\"money\", kind=\"verbalize\")\n", + " unit = (\n", + " pynutil.delete(\"currency:\")\n", + " + delete_extra_space\n", + " + pynutil.delete(\"\\\"\")\n", + " + pynini.closure(NEMO_NOT_QUOTE, 1)\n", + " + pynutil.delete(\"\\\"\")\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "bDS8XSII-Dpd" + }, + "source": [ + "Now consider, we need to normalize an integer component, a fractional component, and a decimal to separate them. Since NeMo will automatically permutate all tags, we can assume whatever order we want. As such, we can assume we get the exact order that is accepted by our `DecimalFst`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "VtGfpjVA-r3u" + }, + "outputs": [], + "source": [ + " def __init__(self, decimal: GraphFst):\n", + " super().__init__(name=\"money\", kind=\"verbalize\")\n", + " unit = (\n", + " pynutil.delete(\"currency:\")\n", + " + delete_extra_space\n", + " + pynutil.delete(\"\\\"\")\n", + " + pynini.closure(NEMO_NOT_QUOTE, 1)\n", + " + pynutil.delete(\"\\\"\")\n", + " )\n", + " graph = decimal.numbers + delete_space + unit\n", + " delete_tokens = self.delete_tokens(graph)\n", + " self.fst = delete_tokens.optimize()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ZefxZLIU-uRU" + }, + "source": [ + "It is as simple and compact as appending the `unit` component to the prexisting `decimal.numbers`. \n", + "\n", + "This feature is worth keeping in mind as you build up to more concrete classes: the combination of guaranteed tag permutations and prebuilt Verbalizers make the addition of semiotic classes progressively simpler despite the building complexity of your entire grammar." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WydC7Cn28l5Y" + }, + "source": [ + "# Time WFST " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VelunbumCJJe" + }, + "source": [ + "Our next composite graph will be for the Time WFST. Here, you may see more variation between your language and our example than with our previous classes. This is for a number of reasons, among them being that while there may be some standard cross linguistic patterns regarding time (e.g. `quantity_of_hours + quantity_of_minutes`), the use of various equivalent phrases can make an exhaustive grammar incredibly specific (e.g. consider managing \"twelve fifteen\", \"twelve and a quarter\", \"quarter past twelve\", \"quarter after twelve\", and \"forty five until one\" all together). You may find yourself drawing upon WFSTs that accomodate Cardinals, Fractions, and some basic subtraction.\n", + "\n", + "As such, we are going to focus on those aspects of the Time WFST that are necessary for a functional normalization of time related phrases, saving a more exhaustive grammar for your own specific languages and use cases." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8wqb28wzATOR" + }, + "source": [ + "## Grammar" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AVntDM3AEz0v" + }, + "source": [ + "For our Time WFST, we will focus on the following aspects:\n", + "- Use of 24 or 12 hour base\n", + "- Use of fraction terminology (e.g. \"quarter\" = `15`)\n", + "- Accomodation of key-words (\"noon\", \"midnight\")\n", + "- Counting backwards from the hour (\"ten to five\", \"five to three\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "seU9hTbgFgu7" + }, + "source": [ + "We'll start with the basic system.\n", + "\n", + "For French, time operates on a twenty-four hour system, with the zeroth hour being midnight. Time is given in the following format:\n", + "\n", + "`cardinal + heure(s) + (cardinal)` \n", + "\n", + "This is normalized as:\n", + "\n", + "`cardinal h (cardinal)`\n", + "\n", + "For instance, for `3:03`, we would have:\n", + "- input: \"trois heures trois\"\n", + "- output: `3 h 03`\n", + "\n", + "As such, our grammar needs to utilize a Cardinal WFST and have a means to accept \"heures\" from the input. Taking care of the latter case is simple enough:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "HTSVxf4fI_ND" + }, + "outputs": [], + "source": [ + "graph_heures = pynini.accep(\"heure\") + pynini.accep(\"s\").ques\n", + "graph_heures = pynutil.delete(graph_heures)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6LW7pXaXJSZa" + }, + "source": [ + "For the cardinals, we could pass an instance of `CardinalFST` to our graph. But do we really need that level of coverage? We only really need to cover the numbers 0 - 60, which we could simply write a new WFST for. Further, it may be beneficial to allow our graph to separate possible ambiguity. While we will not cover it in our tutorial, you may in the future find it necessary to build a WFST for Measurements, of which quantities of time may play a part. Would it not be helpful for you WFST to know that \"thirty hours\" could only ever be a measurement instead of a possible time of day?\n", + "\n", + "Given the little amount of effort necessary and the quick benefit, we choose to make our hours and minutes explicit in the Time WFST." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "R4aa06ZPLKIR" + }, + "outputs": [], + "source": [ + "hours = pynini.string_map([\n", + " (\"zéro\",\"0\"),\n", + " (\"une\",\"1\"),\n", + " (\"deux\",\"2\"),\n", + " (\"trois\",\"3\"),\n", + " (\"quatre\",\"4\"),\n", + " (\"cinq\",\"5\"),\n", + " (\"six\",\"6\"),\n", + " (\"sept\",\"7\"),\n", + " (\"huit\",\"8\"),\n", + " (\"neuf\",\"9\"),\n", + " (\"dix\",\"10\"),\n", + " (\"onze\",\"11\"),\n", + " (\"douze\",\"12\"),\n", + " (\"treize\",\"13\"),\n", + " (\"quatorze\",\"14\"),\n", + " (\"quinze\",\"15\"),\n", + " (\"seize\",\"16\"),\n", + " (\"dix-sept\",\"17\"),\n", + " (\"dix-huit\",\"18\"),\n", + " (\"dix-neuf\",\"19\"),\n", + " (\"vingt\",\"20\"),\n", + " (\"vingt-et-une\",\"21\"),\n", + " (\"vingt et une\",\"21\"),\n", + " (\"vingt-deux\",\"22\"),\n", + " (\"vingt-trois\",\"23\"),\n", + " (\"vingt-quatre\",\"24\"),\n", + "])\n", + "minutes = pynini.string_map([\n", + " (\"une\", \"01\"),\n", + " (\"deux\", \"02\"),\n", + " (\"trois\", \"03\"),\n", + " (\"quatre\", \"04\"),\n", + " (\"cinq\", \"05\"),\n", + " (\"six\", \"06\"),\n", + " (\"sept\", \"07\"),\n", + " (\"huit\", \"08\"),\n", + " (\"neuf\", \"09\"),\n", + " (\"dix\", \"10\"),\n", + " (\"onze\", \"11\"),\n", + " (\"douze\", \"12\"),\n", + " (\"treize\", \"13\"),\n", + " (\"quatorze\", \"14\"),\n", + " (\"quinze\", \"15\"),\n", + " (\"seize\", \"16\"),\n", + " (\"dix-sept\", \"17\"),\n", + " (\"dix-huit\", \"18\"),\n", + " (\"dix-neuf\", \"19\"),\n", + " (\"vingt\", \"20\"),\n", + " (\"vingt-et-une\", \"21\"),\n", + " (\"vingt et une\", \"21\"),\n", + " (\"vingt-deux\", \"22\"),\n", + " (\"vingt-trois\", \"23\"),\n", + " (\"vingt-quatre\", \"27\"),\n", + " (\"vingt-cinq\", \"25\"),\n", + " (\"vingt-six\", \"26\"),\n", + " (\"vingt-sept\", \"27\"),\n", + " (\"vingt-huit\", \"28\"),\n", + " (\"vingt-neuf\", \"29\"),\n", + " (\"trente\", \"30\"),\n", + " (\"trente-et-une\", \"31\"),\n", + " (\"trente et une\", \"31\"),\n", + " (\"trente-deux\", \"32\"),\n", + " (\"trente-trois\", \"33\"),\n", + " (\"trente-quatre\", \"34\"),\n", + " (\"trente-cinq\", \"35\"),\n", + " (\"trente-six\", \"36\"),\n", + " (\"trente-sept\", \"37\"),\n", + " (\"trente-huit\", \"38\"),\n", + " (\"trente-neuf\", \"39\"),\n", + " (\"quarante\", \"40\"),\n", + " (\"quarante-et-une\", \"41\"),\n", + " (\"quarante et une\", \"41\"),\n", + " (\"quarante-deux\", \"42\"),\n", + " (\"quarante-trois\", \"43\"),\n", + " (\"quarante-quatre\", \"44\"),\n", + " (\"quarante-cinq\", \"45\"),\n", + " (\"quarante-six\", \"46\"),\n", + " (\"quarante-sept\", \"47\"),\n", + " (\"quarante-huit\", \"48\"),\n", + " (\"quarante-neuf\", \"49\"),\n", + " (\"cinquante\", \"50\"),\n", + " (\"cinquante-et-une\", \"51\"),\n", + " (\"cinquante et une\", \"51\"),\n", + " (\"cinquante-deux\", \"52\"),\n", + " (\"cinquante-trois\", \"53\"),\n", + " (\"cinquante-quatre\", \"54\"),\n", + " (\"cinquante-cinq\", \"55\"),\n", + " (\"cinquante-six\", \"56\"),\n", + " (\"cinquante-sept\", \"57\"),\n", + " (\"cinquante-huit\", \"58\"),\n", + " (\"cinquante-neuf\", \"59\"),\n", + "])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4SmNsNKLM9cC" + }, + "source": [ + "Now that we've managed the basic graph, we can address some of the more niche rules of French timekeeping.\n", + "\n", + "To start, French employs some colliquialisms that will be familiar to English speakers: minutes that are multiples of fifteen are referred to as fractions of a clock. In particular:\n", + "- `5 h 15` -> \"cinq heures **et quart**\"\n", + "- `5 h 30` -> \"cinq heures **et demie**\"\n", + "- `5 h 45` -> \"cinq eures **et trois quarts**\"\n", + "\n", + "We thus need a means of rendering these as their numerical equivalents:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "xHe3nfrpSlrE" + }, + "outputs": [], + "source": [ + "# Mapping 'et demi' and 'et qart'\n", + "graph_et = pynutil.delete(\"et\") + delete_space\n", + "\n", + "graph_demi = pynini.accep(\"demi\")\n", + "graph_demi += pynini.accep(\"e\").ques # people vary on feminine or masculine form\n", + "graph_demi = pynini.cross(graph_demi, \"30\")\n", + "\n", + "graph_quart = pynini.accep('quart')\n", + "graph_quart = pynini.cross(graph_quart, '15')\n", + "graph_trois_quart = pynini.cross(\"trois quarts\", \"45\")\n", + "\n", + "graph_fractions = graph_demi | graph_quart | graph_trois_quart\n", + "graph_fractions = graph_et + graph_fractions" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "HD2wobIQS3fX" + }, + "source": [ + "Also like English, French will use key words to designate a specific timeslot. Noon and midnight are \"midi\" and \"minuit\" respectively." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ahbkiZFuTN2t" + }, + "outputs": [], + "source": [ + "# Midi and minuit\n", + "graph_midi = pynini.cross(\"midi\", \"12\")\n", + "graph_minuit = pynini.cross(\"minuit\", \"0\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6OyMoqfZTX1U" + }, + "source": [ + "Now it's time to throw a wrench into things: counting backwards from the hour. How are we to get what is essentially a graph to do the subtraction necessarily for \"ten to twelve\" to become `11:50`?\n", + "\n", + "Easy: we build the subtraction into the graph itself. That is, we map the hours and minutes produced by our graph onto another graph that produces their amount shifted back a value.\n", + "\n", + "Let's take our \"ten to twelve\" example. Normally \"ten\" would map to `10` and \"twelve\" to `12`. But with these new graphs, the detection of the pattern `minute + to + hour` would signal that `10` should now become `50` and `12` become `11`." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uMWifbm1VQjP" + }, + "source": [ + "Let us do this for our French example. Luckily enough, the indication that a French string is regular: counting backwards from the hour is by use of the pattern `cardinal + heures + moins + minutes`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "c4bV3T1pViCH" + }, + "outputs": [], + "source": [ + "hours_to = pynini.string_map([\n", + " (\"1\",\"0\"),\n", + " (\"2\",\"1\"),\n", + " (\"3\",\"2\"),\n", + " (\"4\",\"3\"),\n", + " (\"5\",\"4\"),\n", + " (\"6\",\"5\"),\n", + " (\"7\",\"6\"),\n", + " (\"8\",\"7\"),\n", + " (\"9\",\"8\"),\n", + " (\"10\",\"9\"),\n", + " (\"11\",\"10\"),\n", + " (\"12\",\"11\"),\n", + " (\"13\",\"12\"),\n", + " (\"14\",\"13\"),\n", + " (\"15\",\"14\"),\n", + " (\"16\",\"15\"),\n", + " (\"17\",\"16\"),\n", + " (\"18\",\"17\"),\n", + " (\"19\",\"18\"),\n", + " (\"20\",\"19\"),\n", + " (\"21\",\"20\"),\n", + " (\"22\",\"21\"),\n", + " (\"23\",\"22\"),\n", + " (\"24\",\"23\"),\n", + " (\"0\",\"23\"),\n", + "])\n", + "minutes_to = pynini.string_map([\n", + " (\"59\", \"01\"),\n", + " (\"58\", \"02\"),\n", + " (\"57\", \"03\"),\n", + " (\"56\", \"04\"),\n", + " (\"55\", \"05\"),\n", + " (\"54\", \"06\"),\n", + " (\"53\", \"07\"),\n", + " (\"52\", \"08\"),\n", + " (\"51\", \"09\"),\n", + " (\"50\", \"10\"),\n", + " (\"49\", \"11\"),\n", + " (\"48\", \"12\"),\n", + " (\"47\", \"13\"),\n", + " (\"46\", \"14\"),\n", + " (\"45\", \"15\"),\n", + " (\"44\", \"16\"),\n", + " (\"43\", \"17\"),\n", + " (\"42\", \"18\"),\n", + " (\"41\", \"19\"),\n", + " (\"40\", \"20\"),\n", + " (\"39\", \"21\"),\n", + " (\"38\", \"22\"),\n", + " (\"37\", \"23\"),\n", + " (\"36\", \"24\"),\n", + " (\"35\", \"25\"),\n", + " (\"34\", \"26\"),\n", + " (\"33\", \"27\"),\n", + " (\"32\", \"28\"),\n", + " (\"31\", \"29\"),\n", + " (\"30\", \"30\"),\n", + " (\"29\", \"31\"),\n", + " (\"28\", \"32\"),\n", + " (\"27\", \"33\"),\n", + " (\"26\", \"34\"),\n", + " (\"25\", \"35\"),\n", + " (\"24\", \"36\"),\n", + " (\"23\", \"37\"),\n", + " (\"22\", \"38\"),\n", + " (\"21\", \"39\"),\n", + " (\"20\", \"40\"),\n", + " (\"19\", \"41\"),\n", + " (\"18\", \"42\"),\n", + " (\"17\", \"43\"),\n", + " (\"16\", \"44\"),\n", + " (\"15\", \"45\"),\n", + " (\"14\", \"46\"),\n", + " (\"13\", \"47\"),\n", + " (\"12\", \"48\"),\n", + " (\"11\", \"49\"),\n", + " (\"10\", \"50\"),\n", + " (\"09\", \"51\"),\n", + " (\"08\", \"52\"),\n", + " (\"07\", \"53\"),\n", + " (\"06\", \"54\"),\n", + " (\"05\", \"55\"),\n", + " (\"04\", \"56\"),\n", + " (\"03\", \"57\"),\n", + " (\"02\", \"58\"),\n", + " (\"01\", \"59\"),\n", + "])\n", + "graph_moins = pynutil.delete(\"moins\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XOKETkIYZy5M" + }, + "source": [ + "Why graph the digits instead of the tokens themselves? Along with avoiding some minor repetition and making editing more apparent, it allows this subgraph to be ported to other languages - if so desired.\n", + "\n", + "Further, it helps us illustrate a helpful idea within this tutorial: as long as a pattern is regular and/or finite, it is no major issue to accomodate it in our graph, regardless of mathematic or logic system it employs." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DJbFiD2fAUc5" + }, + "source": [ + "## Classifier" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cK0SGXntaDkI" + }, + "source": [ + "Once again we place the grammar within the proper child class of `GraphFst`. We also insert the proper tags for the `Time` class, which are:\n", + "- `hours`\n", + "- `minutes`\n", + "- `suffix` (explained within this section)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "9Eq5r-_VbBIg" + }, + "outputs": [], + "source": [ + "graph_hours_component = pynini.union(hours, graph_midi, graph_minuit)\n", + "graph_hours_component = pynutil.insert(\"hours: \\\"\") + graph_hours_component + pynutil.insert(\"\\\"\")\n", + "\n", + "graph_minutes_component = (\n", + " pynutil.insert(\" minutes: \\\"\") + pynini.union(minutes, graph_fractions) + pynutil.insert(\"\\\"\")\n", + ") \n", + "graph_minutes_component = delete_space + graph_minutes_component\n", + "\n", + "graph_time_standard = (graph_hours_component + delete_space + graph_heures \n", + " + pynini.closure(graph_minutes_component, 0, 1))" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2avfS3IacSiC" + }, + "source": [ + "We now setup the alternate graph that allows backwards counting. Note, this is triggered by the occurence of \"moins\" between the hour and minute component." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "TmpwisOVcn0T" + }, + "outputs": [], + "source": [ + "graph_hours_to_component = hours | graph_midi | graph_minuit\n", + "graph_hours_to_component @= hours_to\n", + "graph_hours_to_component = pynutil.insert(\"hours: \\\"\") + graph_hours_to_component + pynutil.insert(\"\\\"\")\n", + "graph_hours_to_component = graph_hours_to_component + delete_space + graph_heures\n", + "\n", + "graph_minutes_to_component = (minutes | graph_demi | # No 'et' in fractions\n", + " (pynutil.delete(\"le \") + graph_quart) | graph_trois_quart)\n", + "graph_minutes_to_component @= minutes_to\n", + "graph_minutes_to_component = pynutil.insert(\" minutes: \\\"\") + graph_minutes_to_component + pynutil.insert(\"\\\"\")\n", + "\n", + "graph_time_to = graph_hours_to_component + delete_space + graph_moins + delete_space + graph_minutes_to_component" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "FkO4tRRfdQT4" + }, + "source": [ + "We now join it with our main component, allowing us to graph all times:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "0O0vUVizdU8c" + }, + "outputs": [], + "source": [ + "graph_time = graph_time_standard | graph_time_to" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "jbX4JV-LdY3Y" + }, + "source": [ + "Once again we throw a wrench into things with the `suffix` feature. As in the case of Ordinals and Decimals, key-words can play into our Time WFST. For French, this occurs with the words \"du matin\", \"de l'après-midi\", and \"du soir\". (Respectively: \"in the morning\", \"in the afternoon\", and \"in the evening\".) Much like in English, these phrases alter how we write down the time. But instead of indicating `a.m.` or `p.m.`, these indicate *what hour system is used*. For example:\n", + "- \"deux heures du matin\" -> `2 h` = `2:00 a.m.`\n", + "- \"deux heures de l'après-midi\" -> `14 h` = `2:00 p.m.`\n", + "\n", + "Only a twelve hour system is used when these suffixes accompany the time. As such, our Classifier will need to either adjust the times like in the case of counting backwards or must pass the information to the Verbalizer so it can adjust. \n", + "\n", + "Since our Classifier is long enough as is, we will simply store this information in the `suffix` property and allow the Verbalizer to manage." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "OqVa78zRgJw9" + }, + "outputs": [], + "source": [ + "graph_suffix_am = pynini.cross(\"du matin\", \"am\")\n", + "graph_suffix_pm = pynini.string_map([(\"de l'après-midi\", \"pm\"),(\"du soir\", \"pm\")])\n", + "\n", + "graph_suffix = pynini.cross(graph_suffix_am, \"am\") | pynini.cross(graph_suffix_pm, \"pm\")\n", + "\n", + "graph_suffix_component = pynutil.insert(\" suffix: \\\"\") + graph_suffix + pynutil.insert(\"\\\"\")\n", + "graph_suffix_component = delete_space + graph_suffix_component\n", + "graph_suffix_component = pynini.closure(graph_suffix_component, 0, 1)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-LaJMIjUf1XR" + }, + "source": [ + "And we append to our graph:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "76myCFiggX3E" + }, + "outputs": [], + "source": [ + "class TimeFst(GraphFst):\n", + " def __init__(self):\n", + " super().__init__(name=\"time\", kind=\"classify\")\n", + " \"\"\"grammar omitted for length\n", + " ....\n", + " ....\n", + " ....\n", + " \"\"\"\n", + " graph_hours_component = pynini.union(hours, graph_midi, graph_minuit)\n", + " graph_hours_component = pynutil.insert(\"hours: \\\"\") + graph_hours_component + pynutil.insert(\"\\\"\")\n", + "\n", + " graph_minutes_component = (\n", + " pynutil.insert(\" minutes: \\\"\") + pynini.union(minutes, graph_fractions) + pynutil.insert(\"\\\"\")\n", + " ) \n", + " graph_minutes_component = delete_space + graph_minutes_component\n", + "\n", + " graph_time_standard = (graph_hours_component + delete_space + graph_heures \n", + " + pynini.closure(graph_minutes_component, 0, 1))\n", + "\n", + " graph_hours_to_component = hours | graph_midi | graph_minuit\n", + " graph_hours_to_component @= hours_to\n", + " graph_hours_to_component = pynutil.insert(\"hours: \\\"\") + graph_hours_to_component + pynutil.insert(\"\\\"\")\n", + " graph_hours_to_component = graph_hours_to_component + delete_space + graph_heures\n", + "\n", + " graph_minutes_to_component = (minutes | graph_demi | # No 'et' in fractions\n", + " (pynutil.delete(\"le \") + graph_quart) | graph_trois_quart)\n", + " graph_minutes_to_component @= minutes_to\n", + " graph_minutes_to_component = pynutil.insert(\" minutes: \\\"\") + graph_minutes_to_component + pynutil.insert(\"\\\"\")\n", + "\n", + " graph_time_to = graph_hours_to_component + delete_space + graph_moins + delete_space + graph_minutes_to_component\n", + "\n", + " graph_time_no_suffix = graph_time_standard | graph_time_to\n", + "\n", + " graph_suffix_am = pynini.cross(\"du matin\", \"am\")\n", + " graph_suffix_pm = pynini.string_map([(\"de l'après-midi\", \"pm\"),(\"du soir\", \"pm\")])\n", + "\n", + " graph_suffix = pynini.cross(graph_suffix_am, \"am\") | pynini.cross(graph_suffix_pm, \"pm\")\n", + "\n", + " graph_suffix_component = pynutil.insert(\" suffix: \\\"\") + graph_suffix + pynutil.insert(\"\\\"\")\n", + " graph_suffix_component = delete_space + graph_suffix_component\n", + " graph_suffix_component = pynini.closure(graph_suffix_component, 0, 1)\n", + " \n", + " final_graph = graph_time_no_suffix + graph_suffix_component\n", + "\n", + " final_graph = self.add_tokens(final_graph)\n", + "\n", + " self.fst = final_graph.optimize()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's see how we did:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "time = TimeFst().fst\n", + "example = \"quatre heures moins cinq\"\n", + "apply_fst(example, time)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lPlJ1qyeAWOL" + }, + "source": [ + "## Verbalizer" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CrO-xtJ87PEl" + }, + "source": [ + "The initial part of the Verbalizer should appear familiar. We delete the property tags `hours` and `minutes`, making sure they preserve the actual values for formatting." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "fCzZKR7ek0Mz" + }, + "outputs": [], + "source": [ + "hour = (\n", + " pynutil.delete(\"hours:\")\n", + " + delete_space\n", + " + pynutil.delete(\"\\\"\")\n", + " + pynini.closure(NEMO_DIGIT, 1, 2)\n", + " + pynutil.delete(\"\\\"\")\n", + ")\n", + "minute = (\n", + " pynutil.delete(\"minutes:\")\n", + " + delete_extra_space\n", + " + pynutil.delete(\"\\\"\")\n", + " + pynini.closure(NEMO_DIGIT, 1, 2)\n", + " + pynutil.delete(\"\\\"\")\n", + ")\n", + "graph = hour + delete_extra_space + pynutil.insert(\"h\") + minute.ques" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WnVV9GUKk-b7" + }, + "source": [ + "We then deal with the case of `suffix`. We first note that if the suffix is for a morning time (before noon), then there is no further conversion that is needed. We may simply delete the property and its value." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "haOEiSbglc6s" + }, + "outputs": [], + "source": [ + "day_suffixes = pynutil.delete(\"suffix: \\\"am\\\"\")\n", + "\n", + "graph = hours + delete_extra_space + pynutil.insert(\"h\") + minute.ques + delete_space + day_suffixes.ques" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wL0FNg6Xlhb-" + }, + "source": [ + "Meanwhile, the post-noon suffixes would require us shifting the hours value by twelve. Much like in the case of counting backwards from the hour, we can simply create a WFST to do this addition work for us." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "YLrabUNplwG7" + }, + "outputs": [], + "source": [ + "hour_to_night = pynini.string_map([\n", + " (\"1\", \"13\"),\n", + " (\"2\", \"14\"),\n", + " (\"3\", \"15\"),\n", + " (\"4\", \"16\"),\n", + " (\"5\", \"17\"),\n", + " (\"6\", \"18\"),\n", + " (\"7\", \"19\"),\n", + " (\"8\", \"20\"),\n", + " (\"9\", \"21\"),\n", + " (\"10\", \"22\"),\n", + " (\"11\", \"23\"), # Note that 12 and 24 would be phrased \"midi\" and \"minuit\" respectively\n", + "])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "X0-z-qJAmIiI" + }, + "source": [ + "We then create an alternate graph where this conversion is mapped onto the hours function - given a post-noon suffix - and create a union with our earlier graph:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "8CdEmo9NmN7u" + }, + "outputs": [], + "source": [ + "night_suffixes = pynutil.delete(\"suffix: \\\"pm\\\"\")\n", + "graph |= (\n", + " hour @ hour_to_night\n", + " + delete_extra_space\n", + " + pynutil.insert(\"h\")\n", + " + minute.ques\n", + " + delete_space\n", + " + night_suffixes\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YnoIkZBqmaTo" + }, + "source": [ + "Giving us a final Verbalizer of:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ZfXimvFBmdDD" + }, + "outputs": [], + "source": [ + "class TimeFst(GraphFst):\n", + " def __init__(self):\n", + " super().__init__(name=\"time\", kind=\"verbalize\")\n", + "\n", + " hour_to_night = pynini.string_map([\n", + " (\"1\", \"13\"),\n", + " (\"2\", \"14\"),\n", + " (\"3\", \"15\"),\n", + " (\"4\", \"16\"),\n", + " (\"5\", \"17\"),\n", + " (\"6\", \"18\"),\n", + " (\"7\", \"19\"),\n", + " (\"8\", \"20\"),\n", + " (\"9\", \"21\"),\n", + " (\"10\", \"22\"),\n", + " (\"11\", \"23\"),\n", + "])\n", + "\n", + " day_suffixes = pynutil.delete(\"suffix: \\\"am\\\"\")\n", + " night_suffixes = pynutil.delete(\"suffix: \\\"pm\\\"\")\n", + "\n", + " hour = (\n", + " pynutil.delete(\"hours:\")\n", + " + delete_space\n", + " + pynutil.delete(\"\\\"\")\n", + " + pynini.closure(NEMO_DIGIT, 1, 2)\n", + " + pynutil.delete(\"\\\"\")\n", + " )\n", + " minute = (\n", + " pynutil.delete(\"minutes:\")\n", + " + delete_extra_space\n", + " + pynutil.delete(\"\\\"\")\n", + " + pynini.closure(NEMO_DIGIT, 1, 2)\n", + " + pynutil.delete(\"\\\"\")\n", + " )\n", + "\n", + " graph = hour + delete_extra_space + pynutil.insert(\"h\") + minute.ques + delete_space + day_suffixes.ques\n", + "\n", + " graph |= (\n", + " hour @ hour_to_night\n", + " + delete_extra_space\n", + " + pynutil.insert(\"h\")\n", + " + minute.ques\n", + " + delete_space\n", + " + night_suffixes\n", + " )\n", + " delete_tokens = self.delete_tokens(graph)\n", + " self.fst = delete_tokens.optimize()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "e5tPcCaSYuhY" + }, + "source": [ + "If you've noticed, the Verbalizer process has become simpler as we've progressed through our WFSTs. Commonly, you will seldom need to even provide the amount of overhead we've seen in `TimeFst`, `MoneyFst`, and `OrdinalFst`, and the majority of this component is simply removing tokens as an intermediary step, as we'll see for our Name class." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iHmRe3UIhyIH" + }, + "source": [ + "# WhiteList WFST " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8kMn2qB9bVFy" + }, + "source": [ + "\n", + "While developing your grammars, you may encounter tokens that refuse standard categorization and yet still require normalization. For example, you may need to render \"Mister Brown\" as `Mr. Brown` or \"H M S Nelson\" as `H.M.S. Nelson`. As these cases are rather specific, they lack a regular pattern for a specific classifier. (What about \"mister\" as a token requires tokenization as opposed to \"Brown\".) Instead, we need to explicitly list their input-output mappings (i.e. a whitelist).\n", + "\n", + "For NeMo, this is performed through the `WhiteListFst`:\n", + "\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6B4oPXYcccWs" + }, + "source": [ + "## Grammar" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RThTLUCRceOO" + }, + "source": [ + "`WhitelistFst` is essentially just a wrapper for a `string_map` or `string_file` mapping with the appropriate formatting for deployment. Per our example, we can make a graph with the following:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "eIOOb_wJdMMx" + }, + "outputs": [], + "source": [ + "graph = pynini.string_map([\n", + " (\"mister\", \"mr.\"),\n", + " (\"h m s\", \"h.m.s\"),\n", + " (\"doctor\", \"dr.\")\n", + "])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "O5kTXwmPZ9Tt" + }, + "source": [ + "As previously mentioned, here is where the use of `string_file` will make maintnance much easier. Discovering whitelist mappings is an iterative process and you will more than likely need to return to your list throughout development. For instance, it may be obvious that tokens such as \"madame\", \"miss\", \"esquire\", but would you think of providing abreviations for \"the right honorable\" or \"tennessee valley authority\"? Keeping a tsv file available for quick insertions greatly assists here." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "RC5Cf-Z8dYVk" + }, + "source": [ + "## Classifier" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "144nvAHEdfBJ" + }, + "source": [ + "Unlike for our other WFSTs, There is no specific semiotic class for `WhiteListFst`. It instead falls under the default Name class to designate there is no need for further processing beyond obligatory tokenization. Indeed, we can simply insert the token ourselves instead of calling `add_tokens`." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "oPkrmg2gdznd" + }, + "outputs": [], + "source": [ + "class WhiteListFst(GraphFst):\n", + " def __init__(self):\n", + " super().__init__(name=\"whitelist\", kind=\"classify\")\n", + "\n", + " whitelist = pynini.string_map([\n", + " (\"mister\", \"mr.\"),\n", + " (\"h m s\", \"h.m.s\"),\n", + " (\"doctor\", \"dr.\")])\n", + " graph = pynutil.insert(\"name: \\\"\") + convert_space(whitelist) + pynutil.insert(\"\\\"\")\n", + " self.fst = graph.optimize()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "B05kdSIdd2dv" + }, + "source": [ + "## Verbalizer" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Since the whitelisted token has already been rendered in the desired normalized form, all that is necessary is to strip the `name` token and render the string 'as is'. This can be done by through the following:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "gaq3voIYiUCA" + }, + "outputs": [], + "source": [ + "class WhiteListFst(GraphFst):\n", + " def __init__(self):\n", + " super().__init__(name=\"whitelist\", kind=\"verbalize\")\n", + " graph = (\n", + " pynutil.delete(\"name:\")\n", + " + delete_space\n", + " + pynutil.delete(\"\\\"\")\n", + " + pynini.closure(NEMO_CHAR - \" \", 1)\n", + " + pynutil.delete(\"\\\"\")\n", + " )\n", + " graph = graph @ pynini.cdrewrite(pynini.cross(u\"\\u00A0\", \" \"), \"\", \"\", NEMO_SIGMA) # Removes possible null token\n", + " self.fst = graph.optimize()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "cUE7Gg35bWKb" + }, + "source": [ + "While the graph is largely self-explanatory, take note that the default implementation assumes a character string without spacing. If you intend to include additional formatting in your normalization (e.g. `H. M. S.` instead of `H.M.S.`), you may need to adjust the graph to expand coverage." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_o_a15Fg7niv" + }, + "source": [ + "# Word and Punctuation WFST " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Zi6lP7mTmnUV" + }, + "source": [ + "Continuing with the Name class, we will conclude with the Word and Punctuation WFSTs. These are among the simplest and most crucial classes of the entire ITN system, as they classify all tokens that are not caught by other semiotic classes. Since these other tokens make up the majority of all strings your normalization system will encounter, they are essential for general functionality.\n", + "\n", + "However, they escape discussion as their function is self-evident: since they function as default classes, tokens only reach Word WFST and Punctuation WFST if they have not been accepted by the other WFSTs. As such, we can simply accept the tokens as they are, providing them a `name` tag." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9zCqczLqp5NW" + }, + "source": [ + "## Classifier" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "eUWum5U0p99c" + }, + "source": [ + "For instance, consider the entire `WordFst` Classifier in its entirety:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "CCZSTeDHofDl" + }, + "outputs": [], + "source": [ + "class WordFst(GraphFst):\n", + " def __init__(self):\n", + " super().__init__(name=\"word\", kind=\"classify\")\n", + " word = pynutil.insert(\"name: \\\"\") + pynini.closure(NEMO_NOT_SPACE, 1) + pynutil.insert(\"\\\"\")\n", + " self.fst = word.optimize()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9ys2VpjjoiEC" + }, + "source": [ + "It just processes the entire token string with the `NEMO_NOT_SPACE` utility WFST (which accepts any string that is not a space). For your language, you may simply use one of the prexisting `WordFst`.\n", + "\n", + "Depending on language, the `PunctuationFst` may require some (minimal) adjustment. Note the following:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Mnnd3PVMpF4t" + }, + "outputs": [], + "source": [ + "class PunctuationFst(GraphFst):\n", + " def __init__(self):\n", + " super().__init__(name=\"punctuation\", kind=\"classify\")\n", + "\n", + " s = \"!#$%&\\'()*+,-./:;<=>?@^_`{|}~\"\n", + " punct = pynini.union(*s)\n", + "\n", + " graph = pynutil.insert(\"name: \\\"\") + punct + pynutil.insert(\"\\\"\")\n", + "\n", + " self.fst = graph.optimize()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_afW02LXpLtz" + }, + "source": [ + "If your language uses other punctuation than that in the `s` string (or reserves some of the punctuation as characters), you may simply edit `s` to accomodate. \n", + "\n", + "For instance, French has a unique quotation style that utilizes guillemets \"« »\". We may add their Unicode codepoints (to avoid encoding issues) to `s`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "mgfZIKzVplVm" + }, + "outputs": [], + "source": [ + "class PunctuationFst(GraphFst):\n", + " def __init__(self):\n", + " super().__init__(name=\"punctuation\", kind=\"classify\")\n", + "\n", + " s = \"!#$%&\\'()*+,-./:;<=>?@^_`{|}~\"\n", + " guillemets = \"\\u00AB\" + \"\\u00BB\" # quotation marks in French.\n", + " s += guillemets\n", + " punct = pynini.union(*s)\n", + "\n", + " graph = pynutil.insert(\"name: \\\"\") + punct + pynutil.insert(\"\\\"\")\n", + "\n", + " self.fst = graph.optimize()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "6Upb5-wcp_7H" + }, + "source": [ + "## Verbalizer" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ufWT1T6GqCCT" + }, + "source": [ + "Note that both `PunctuationFst` and `WordFst` both encode with the `name` property. This leaves no differentiation between the two for a Verbalizer. This makes sense as there are no particular formatting rules for them, they simply need a placeholder tag to avoid alteration between the Classifier and Verbalizer step. Once passed to the verbalizer, they are rendered as normal by simply removing the tag (this is practically identical to the WhiteListFST):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "LqyhqQKZqcph" + }, + "outputs": [], + "source": [ + "class WordFst(GraphFst):\n", + " def __init__(self):\n", + " super().__init__(name=\"word\", kind=\"verbalize\")\n", + " chars = pynini.closure(NEMO_CHAR - \" \", 1)\n", + " char = pynutil.delete(\"name:\") + delete_space + pynutil.delete(\"\\\"\") + chars + pynutil.delete(\"\\\"\")\n", + " graph = char @ pynini.cdrewrite(pynini.cross(u\"\\u00A0\", \" \"), \"\", \"\", NEMO_SIGMA) # Cleans up possible null character\n", + "\n", + " self.fst = graph.optimize()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lGbrUkcpapyi" + }, + "source": [ + "For many languages, the writing of your `WordFst` and `PunctuationFst` (both Classifiers and Verbalizers) will require no more than duplicating the prexisting grammars found in NeMo Text Processing." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "5y9jhkhQ7p4W" + }, + "source": [ + "# Other Classes " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "j1mgnISmiu-g" + }, + "source": [ + "While the preceeding discussion should be suitable for development of the remaining classes, some helpful notes may be of use before continuing:\n", + "- Fraction WFST: This is the last of the 'fundamental' classes and should take priority after completion of the Decimal WFST. It operates very similarly to the Ordinal WFST in that you wish to recover the Cardinal roots for the numerator and denominator prior to tagging. Its properties are: `negative`, `integer_part`, `numerator`, and `denominator`.\n", + "- Measure WFST: Like the Money WFST, this will require management of several 'parent' WFSTS (Fraction, Cardinal, Decimal) to be suitably comprehensive. As well, you may find it more productive to find ways to compose new measurement units instead of simply listing all (e.g. micrometers, petameters, miles per hour, feet per second). Its properties are: `negative`, `units` and it allows subgraphs of the `cardinal`, `decimal`, and `fraction` classes. (This is, it allows tokenization within the tokenization.)\n", + "- Date WFST: Depending on writing conventions, this may vary in complexity. For instance, English speakers may write dates as `01/01/2021/` or `Jan. 1 2021`. Are there specific use cases where one is preferred or should you simply decide on a format? Further, you may wish to take advantage of the `preserve order` property to avoid possible unwanted verbalizations (some implementations will permit both `Jan. 1` and `1 Jan.` if not careful.) Its properties are: `month`, `day`, and `year`. \n", + "- Telephone WFST: These will be heavily dependent not only on writing conventions but even regional preference. For instance, the U.S. commonly uses a ten digit system broken into the following sequence: `###-###-####`. Meanwhile, mainland France breaks a ten digit sequence into groups of two: `##-##-##-##-##`. Take careful note of how your language's target region verbalizes these figures and leave room for some variation in development. The `telephone` class has only one property: `number_part`. \n", + "- Electronic WFST: For normalizing email addresses or urls, you will need to develop for the `electronic` class. The main concerns will be managing alphanumeric strings and parsing the reserved symbols used for protocols and domains. (How does your target language pronounce \"https://\"? www? '.' or '@'?\") Depending on whether you are normalizing a url or email, the following properties will be needed:\n", + " - email: `username`, `domain`\n", + " - url: `protocol` (Sparrowhawk allows further detail here but NeMo passes the entire url through the `protocol` property)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-i25X8mK90n3" + }, + "source": [ + "# Tokenize and Classify " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "v4bcigU6b9ss" + }, + "source": [ + "We are now ready to build a general Classifier for our entire language. Upon completion of your grammars, the next step is to unite them together in a general Classifier WFST - located within a `tokenize_and_classify.py` file, preferably. This WFST will be responsible for determining the appropriate semiotic class for each token in your string and processing the necessary properties for normalization.\n", + "\n", + "For this section, we will focus on the following: grammar composition, assignment of weights, and importing/exporting as a FAR file. Since we will need to work with some instantiated graphs, let's preload them before proceeding. (Note the compilement time.)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from nemo_text_processing.inverse_text_normalization.fr.taggers.cardinal import CardinalFst\n", + "from nemo_text_processing.inverse_text_normalization.fr.taggers.decimal import DecimalFst\n", + "from nemo_text_processing.inverse_text_normalization.fr.taggers.money import MoneyFst\n", + "from nemo_text_processing.inverse_text_normalization.fr.taggers.ordinal import OrdinalFst\n", + "from nemo_text_processing.inverse_text_normalization.fr.taggers.punctuation import PunctuationFst\n", + "from nemo_text_processing.inverse_text_normalization.fr.taggers.time import TimeFst\n", + "from nemo_text_processing.inverse_text_normalization.fr.taggers.whitelist import WhiteListFst\n", + "from nemo_text_processing.inverse_text_normalization.fr.taggers.word import WordFst\n", + "\n", + "cardinal = CardinalFst()\n", + "cardinal_graph = cardinal.fst\n", + "\n", + "ordinal = OrdinalFst(cardinal)\n", + "ordinal_graph = ordinal.fst\n", + "\n", + "decimal = DecimalFst(cardinal)\n", + "decimal_graph = decimal.fst\n", + "\n", + "whitelist_graph = WhiteListFst().fst\n", + "word_graph = WordFst().fst\n", + "time_graph = TimeFst().fst\n", + "money_graph = MoneyFst(cardinal, decimal).fst\n", + "punct_graph = PunctuationFst().fst" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MIv58eSocOV1" + }, + "source": [ + "## Grammar" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "k_RPlnfVdG5E" + }, + "source": [ + "As for all previous grammars, the `tokenize_and_classify` grammar inherits from a `GraphFst` as an individual class: `ClassifyFst`. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "WHKG4c2WdW0G" + }, + "outputs": [], + "source": [ + "class ClassifyFst(GraphFst):\n", + " def __init__(self):\n", + " super().__init__(name=\"tokenize_and_classify\", kind=\"classify\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "j9_I6DJmdcOG" + }, + "source": [ + "This class is responsible for instantiating all subgraphs and passing necessary dependencies:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "4YtmcxLOdlas" + }, + "outputs": [], + "source": [ + "class ClassifyFst(GraphFst):\n", + " def __init__(self):\n", + " super().__init__(name=\"tokenize_and_classify\", kind=\"classify\")\n", + "\n", + " cardinal = CardinalFst()\n", + " cardinal_graph = cardinal.fst\n", + "\n", + " ordinal = OrdinalFst(cardinal)\n", + " ordinal_graph = ordinal.fst\n", + "\n", + " decimal = DecimalFst(cardinal)\n", + " decimal_graph = decimal.fst\n", + "\n", + " whitelist_graph = WhiteList().fst\n", + " word_graph = WordFst().fst\n", + " time_graph = TimeFst().fst\n", + " money_graph = MoneyFst(cardinal, decimal).fst\n", + " punct_graph = PunctuationFst().fst" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "y5vGvv3HeAY9" + }, + "source": [ + "We then join all the grammars together so `ClassifyFst` can apply them. Rather uncermoniously, this is accomplished by performing a union across all grammars (excluding `PunctuationFst`, to assist tokenization). We then follow this union by inserting the `tokens` class around the resulting formatting (required for processing):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "oocgPQ5geZJO" + }, + "outputs": [], + "source": [ + "class ClassifyFst(GraphFst):\n", + " def __init__(self):\n", + " super().__init__(name=\"tokenize_and_classify\", kind=\"classify\")\n", + "\n", + " cardinal = CardinalFst()\n", + " cardinal_graph = cardinal.fst\n", + "\n", + " ordinal = OrdinalFst(cardinal)\n", + " ordinal_graph = ordinal.fst\n", + "\n", + " decimal = DecimalFst(cardinal)\n", + " decimal_graph = decimal.fst\n", + "\n", + " whitelist_graph = WhiteListFst().fst\n", + " word_graph = WordFst().fst\n", + " time_graph = TimeFst().fst\n", + " money_graph = MoneyFst(cardinal, decimal).fst\n", + " punct_graph = PunctuationFst().fst\n", + "\n", + " classify = (\n", + " time_graph\n", + " | whitelist_graph\n", + " | decimal_graph\n", + " | cardinal_graph\n", + " | ordinal_graph\n", + " | money_graph\n", + " | word_graph\n", + " )\n", + " token = pynutil.insert(\"tokens { \") + classify + pynutil.insert(\" }\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ASWDXWQjfLEU" + }, + "source": [ + "Our graph is now able to process an individual token. But what about a string? Here you will need to be mindful of the tokenization behavior for your language and decide on your desired treatment of punctuation (hence exclusion from the main graph). \n", + "\n", + "For our purposes, we will assume the convention of space and punctuation serving as token seperators. We graph punctuation as individual tokens" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "r6WztK2jwhFt" + }, + "outputs": [], + "source": [ + "punct_graph = PunctuationFst().fst\n", + "punct = pynutil.insert(\"tokens { \") + punct_graph + pynutil.insert(\" }\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9T2rT89jw3T1" + }, + "source": [ + "and join the `punct` graph with our `tokens` graph (inserting spaces between tokens for formatting)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "rGtVOK-txKOP" + }, + "outputs": [], + "source": [ + "token = \"PLACEHOLDER\"\n", + "token_plus_punct = (\n", + " pynini.closure(punct + pynutil.insert(\" \")) + token + pynini.closure(pynutil.insert(\" \") + punct)\n", + " ) # Note the use of closure incase there are multiple punctuations\n", + "graph = token_plus_punct + pynini.closure(delete_extra_space + token_plus_punct)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "_gixfQ69xWPe" + }, + "source": [ + "then address space between tokens: \n", + "\n", + "`graph = delete_space + graph + delete_space`" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DWnmazWecyUG" + }, + "source": [ + "## Weighting " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "egHbwIbMx-hT" + }, + "source": [ + "Were we to leave our `ClassifyFst` like this, we would undoubtedly encounter a mountain of erros. What will stop our graph from treating punctuation that is part of a previous grammar as a token separator (e.g. \"vingt-et-un\")? How do we ensure that a currency string isn't treated as solely a decimal string with a `name` token following?\n", + "\n", + "As in previous cases, the solution lies in our choice of weights for the grammar." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "y3U7_M8CyxZ1" + }, + "source": [ + "Let us return to the main graph:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "9VXe1dfsy3Be" + }, + "outputs": [], + "source": [ + "classify = (\n", + " time_graph\n", + " | whitelist_graph\n", + " | decimal_graph\n", + " | cardinal_graph\n", + " | ordinal_graph\n", + " | money_graph\n", + " | word_graph\n", + " )\n", + "punct = pynutil.insert(\"tokens { \") + punct_graph + pynutil.insert(\" }\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aY4vOFqxy5ua" + }, + "source": [ + "Beyond the path weights that we explicitly added, these graphs are currently weightless. Since we want the graphs themselves to be the general determiners of a path, let us use some default weights an order of magnitude beyond our path weights (we use `pynutil.add_weight`):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "bthyt_Le2rsA" + }, + "outputs": [], + "source": [ + "classify = (\n", + " pynutil.add_weight(time_graph, 1)\n", + " | pynutil.add_weight(whitelist_graph, 1)\n", + " | pynutil.add_weight(decimal_graph, 1)\n", + " | pynutil.add_weight(cardinal_graph, 1)\n", + " | pynutil.add_weight(ordinal_graph, 1)\n", + " | pynutil.add_weight(money_graph, 1)\n", + " | pynutil.add_weight(word_graph, 1)\n", + " )\n", + "punct = pynutil.insert(\"tokens { \") + pynutil.add_weight(punct_graph, 1) + pynutil.insert(\" }\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xMNIJbzj3MMP" + }, + "source": [ + "Let's see what logical adjustements should be made. First off, we know that we want each class token to span the largest string possible. (e.g. We don't want \"quatre-vingt\" to be rendered as two `cardinal` classes with a hyphen in between.) As such, we want to penalize our graph for using more than one tokens. We can do so by establishing the following constraint: the sum of two or more tokens cannot be less than the weight of a single token. Or, for any pair of tokens `w_1` and `w_2`, their sum must always be greater than any other individual token (including themselves):\n", + "\n", + "`w_1 + w_2 > k >= w`\n", + "\n", + "To keep things simple, let us make the upper limit `2`. This means we should increase all the weights to keep our constraint:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "classify = (\n", + " pynutil.add_weight(time_graph, 1.1)\n", + " | pynutil.add_weight(whitelist_graph, 1.1)\n", + " | pynutil.add_weight(decimal_graph, 1.1)\n", + " | pynutil.add_weight(cardinal_graph, 1.1)\n", + " | pynutil.add_weight(ordinal_graph, 1.1)\n", + " | pynutil.add_weight(money_graph, 1.1)\n", + " | pynutil.add_weight(word_graph, 1.1)\n", + " )\n", + "punct = pynutil.insert(\"tokens { \") + pynutil.add_weight(punct_graph, 1.1) + pynutil.insert(\" }\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Do we want this constraint to include all tokens? Imagine if we had a string of multiple semiotic tokens in a row. Since this string's combined weight would be larger than any single class token, a grammar that served as a universal acceptor (i.e. `word_graph`) would be preferred over these individual classes. This would be obviously incorrect. As such, we want to make sure that `word_graph` would only be traversed when there is truly no other option:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "qc_CU2ro63eg" + }, + "outputs": [], + "source": [ + "classify = (\n", + " pynutil.add_weight(time_graph, 1.1)\n", + " | pynutil.add_weight(whitelist_graph, 1.1)\n", + " | pynutil.add_weight(decimal_graph, 1.1)\n", + " | pynutil.add_weight(cardinal_graph, 1.1)\n", + " | pynutil.add_weight(ordinal_graph, 1.1)\n", + " | pynutil.add_weight(money_graph, 1.1)\n", + " | pynutil.add_weight(word_graph, 100)\n", + " )\n", + "punct = pynutil.insert(\"tokens { \") + pynutil.add_weight(punct_graph, 1.1) + pynutil.insert(\" }\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Now, even with a string of fifty different class tokens, `word_graph` would still not be considered as a path to traverse." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fW8C3vD-7Dbl" + }, + "source": [ + "Next, let us consider our foundational graph: `cardinal_graph`. As Cardinals occur in practically all our WFSTs, it's possible for `cardinal_graph` to apply in almost all cases. Yet, we've specifically invoked `CardinalFST` when it was required in any of the other classes, so it will never be needed in any of those cases. This means that we want all those graphs to have *priority* over `cardinal_graph`. As such, we will increase its weight so it takes second lowest precedence (while stillpaying attention to the combined weight constraint). " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "97UwGaEn8pj7" + }, + "outputs": [], + "source": [ + "classify = (\n", + " pynutil.add_weight(time_graph, 1.1)\n", + " | pynutil.add_weight(whitelist_graph, 1.1)\n", + " | pynutil.add_weight(decimal_graph, 1.1)\n", + " | pynutil.add_weight(cardinal_graph, 1.2)\n", + " | pynutil.add_weight(ordinal_graph, 1.1)\n", + " | pynutil.add_weight(money_graph, 1.1)\n", + " | pynutil.add_weight(word_graph, 100)\n", + " )\n", + "punct = pynutil.insert(\"tokens { \") + pynutil.add_weight(punct_graph, 1.1) + pynutil.insert(\" }\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0d9Lw4Ot88_B" + }, + "source": [ + "This form of thinking can be applied to all the 'foundational' graphs you may develop: the dependent graphs should take higher precedence than the graphs they borrow from. For instance, since `money_graph` utilizes `decimal_graph`, we know it should take precedence. However, since `decimal_graph` borrows from `cardinal_graph`, its weight must still be less than `1.2`. As such: " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "-wF8cgLK9tpU" + }, + "outputs": [], + "source": [ + "classify = (\n", + " pynutil.add_weight(time_graph, 1)\n", + " | pynutil.add_weight(whitelist_graph, 1)\n", + " | pynutil.add_weight(decimal_graph, 1.1)\n", + " | pynutil.add_weight(cardinal_graph, 1.2)\n", + " | pynutil.add_weight(ordinal_graph, 1)\n", + " | pynutil.add_weight(money_graph, 1.09)\n", + " | pynutil.add_weight(word_graph, 100)\n", + " )\n", + "punct = pynutil.insert(\"tokens { \") + pynutil.add_weight(punct_graph, 1) + pynutil.insert(\" }\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "huMzDoZ2-FD2" + }, + "source": [ + "For those classes that don't seem affected, we can set their weights as the same as those below their 'foundation' graphs, simply to prevent prioritization when not required\n", + "\n", + "Meanwhile, `whitelist_graph` should take precedence over all others, as it may contain unique normalizations that may get accidentally caught by the other graphs." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "gWG6ttyd-bbD" + }, + "outputs": [], + "source": [ + "classify = (\n", + " pynutil.add_weight(time_graph, 1.1)\n", + " | pynutil.add_weight(whitelist_graph, 1.07)\n", + " | pynutil.add_weight(decimal_graph, 1.1)\n", + " | pynutil.add_weight(cardinal_graph, 1.2)\n", + " | pynutil.add_weight(ordinal_graph, 1.1)\n", + " | pynutil.add_weight(money_graph, 1.08)\n", + " | pynutil.add_weight(word_graph, 100)\n", + " )\n", + "punct = pynutil.insert(\"tokens { \") + pynutil.add_weight(punct_graph, 1.1) + pynutil.insert(\" }\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "1TH08f8O-fWx" + }, + "source": [ + "Keep in mind that building weights in this manner is hardly a rule for grammar development and is instead intended as a means to initialize weights for empirical development. You will find that actual strings will cause unexpected behavior that require fine tuning. \n", + "\n", + "For instance, the Classifier for French in NeMo ITN benefits from having varying precedence for some weights, as seen in the following excerpt:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "gKdkyDK3_r46" + }, + "outputs": [], + "source": [ + "class ClassifyFst(GraphFst):\n", + " \"\"\"\n", + " Final class that composes all other classification grammars. This class can process an entire sentence, that is lower cased.\n", + " For deployment, this grammar will be compiled and exported to OpenFst Finate State Archiv (FAR) File. \n", + " More details to deployment at NeMo/tools/text_processing_deployment.\n", + "\n", + " Args:\n", + " cache_dir: path to a dir with .far grammar file. Set to None to avoid using cache.\n", + " overwrite_cache: set to True to overwrite .far files\n", + " \"\"\"\n", + "\n", + " def __init__(self, cache_dir: str = None, overwrite_cache: bool = False):\n", + " super().__init__(name=\"tokenize_and_classify\", kind=\"classify\")\n", + "\n", + " far_file = None\n", + " if cache_dir is not None and cache_dir != \"None\":\n", + " os.makedirs(cache_dir, exist_ok=True)\n", + " far_file = os.path.join(cache_dir, \"_fr_itn.far\")\n", + " if not overwrite_cache and far_file and os.path.exists(far_file):\n", + " self.fst = pynini.Far(far_file, mode=\"r\")[\"tokenize_and_classify\"]\n", + " logging.info(f\"ClassifyFst.fst was restored from {far_file}.\")\n", + " else:\n", + " logging.info(f\"Creating ClassifyFst grammars.\")\n", + "\n", + " cardinal = CardinalFst()\n", + " cardinal_graph = cardinal.fst\n", + "\n", + " fraction = FractionFst(cardinal)\n", + " fraction_graph = fraction.fst\n", + "\n", + " ordinal = OrdinalFst(cardinal)\n", + " ordinal_graph = ordinal.fst\n", + "\n", + " decimal = DecimalFst(cardinal)\n", + " decimal_graph = decimal.fst\n", + "\n", + " measure_graph = MeasureFst(cardinal=cardinal, decimal=decimal, fraction=fraction).fst\n", + " date_graph = DateFst(cardinal).fst\n", + " word_graph = WordFst().fst\n", + " time_graph = TimeFst().fst\n", + " money_graph = MoneyFst(cardinal, decimal).fst\n", + " whitelist_graph = WhiteListFst().fst\n", + " punct_graph = PunctuationFst().fst\n", + " electronic_graph = ElectronicFst().fst\n", + " telephone_graph = TelephoneFst().fst\n", + "\n", + " classify = (\n", + " pynutil.add_weight(whitelist_graph, 1.01)\n", + " | pynutil.add_weight(time_graph, 1.05)\n", + " | pynutil.add_weight(date_graph, 1.09)\n", + " | pynutil.add_weight(decimal_graph, 1.08)\n", + " | pynutil.add_weight(measure_graph, 1.1)\n", + " | pynutil.add_weight(cardinal_graph, 1.1)\n", + " | pynutil.add_weight(ordinal_graph, 1.1)\n", + " | pynutil.add_weight(fraction_graph, 1.09)\n", + " | pynutil.add_weight(money_graph, 1.07)\n", + " | pynutil.add_weight(telephone_graph, 1.1)\n", + " | pynutil.add_weight(electronic_graph, 1.1)\n", + " | pynutil.add_weight(word_graph, 100)\n", + " )\n", + "\n", + " punct = pynutil.insert(\"tokens { \") + pynutil.add_weight(punct_graph, weight=1.1) + pynutil.insert(\" }\")\n", + " token = pynutil.insert(\"tokens { \") + classify + pynutil.insert(\" }\")\n", + " token_plus_punct = (\n", + " pynini.closure(punct + pynutil.insert(\" \")) + token + pynini.closure(pynutil.insert(\" \") + punct)\n", + " )\n", + "\n", + " graph = token_plus_punct + pynini.closure(delete_extra_space + token_plus_punct)\n", + " graph = delete_space + graph + delete_space\n", + "\n", + " self.fst = graph.optimize()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qc4B_0rNcQZu" + }, + "source": [ + "## FAR import/export" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0nRRPvy-AYsA" + }, + "source": [ + "While working through these code excerpts, you may have noticed some latency with each instantiation of our WFSTs (notably whereever `CardinalFst` was involved). This is because the `pynini.optimize` that we call with each graph's instantiation is computationally expensive. For our ultimate purpose of deployment, it seems a waste of resources to recreate stable graphs for each use.\n", + "\n", + "To address this, NeMo ITN supports WFST caching through use of `pynini.Far`, storing and recovering Classify grammars as FAR (Fst ARchives).\n", + "\n", + "Let us update our `ClassifyFst` to permit passing a cache and allowing overwriting (for development):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "5XgWevUzD1AE" + }, + "outputs": [], + "source": [ + "class ClassifyFst(GraphFst):\n", + " def __init__(self, cache_dir: str = None, overwrite_cache: bool = False):\n", + " super().__init__(name=\"tokenize_and_classify\", kind=\"classify\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "l28GMR70ESz0" + }, + "source": [ + "For storing our graphs as FARs, we can use `graph_utils.generator_main`, which saves our WFSTs by type for easier management. For arguments it takes a string name and a dict mapping of WFST type to graph:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "AzTkcmAWFLYm" + }, + "outputs": [], + "source": [ + "import os\n", + "\n", + "class ClassifyFst(GraphFst):\n", + " def __init__(self, cache_dir: str = None, overwrite_cache: bool = False):\n", + " super().__init__(name=\"tokenize_and_classify\", kind=\"classify\")\n", + " # Grammar here\n", + " # ....\n", + " if cache_dir is not None and cache_dir != \"None\":\n", + " os.makedirs(cache_dir, exist_ok=True)\n", + " far_file = os.path.join(cache_dir, \"_fr_itn.far\")\n", + " generator_main(far_file, {\"tokenize_and_classify\": self.fst})" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Wz8wjCQSD6eJ" + }, + "source": [ + "We pair this with the ability to load from cache (note the `\"tokenize_and_classify\"` key being passed):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "FRFYgMmuD_53" + }, + "outputs": [], + "source": [ + "import os\n", + "\n", + "class ClassifyFst(GraphFst):\n", + " def __init__(self, cache_dir: str = None, overwrite_cache: bool = False):\n", + " super().__init__(name=\"tokenize_and_classify\", kind=\"classify\")\n", + " if not overwrite_cache and far_file and os.path.exists(far_file):\n", + " self.fst = pynini.Far(far_file, mode=\"r\")[\"tokenize_and_classify\"]\n", + " else:\n", + " # Grammar here\n", + " # ....\n", + " if cache_dir is not None and cache_dir != \"None\":\n", + " os.makedirs(cache_dir, exist_ok=True)\n", + " far_file = os.path.join(cache_dir, \"_fr_itn.far\")\n", + " generator_main(far_file, {\"tokenize_and_classify\": self.fst})\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ib9nggZxF38s" + }, + "source": [ + "Producing our `ClassifyFst` as:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "d2BZyx6sGGg2" + }, + "outputs": [], + "source": [ + "class ClassifyFst(GraphFst):\n", + " def __init__(self, cache_dir: str = None, overwrite_cache: bool = False):\n", + " super().__init__(name=\"tokenize_and_classify\", kind=\"classify\")\n", + "\n", + " far_file = None\n", + " if cache_dir is not None and cache_dir != \"None\":\n", + " os.makedirs(cache_dir, exist_ok=True)\n", + " far_file = os.path.join(cache_dir, \"_fr_itn.far\")\n", + " if not overwrite_cache and far_file and os.path.exists(far_file):\n", + " self.fst = pynini.Far(far_file, mode=\"r\")[\"tokenize_and_classify\"]\n", + " else:\n", + " cardinal = CardinalFst()\n", + " cardinal_graph = cardinal.fst\n", + "\n", + " ordinal = OrdinalFst(cardinal)\n", + " ordinal_graph = ordinal.fst\n", + "\n", + " decimal = DecimalFst(cardinal)\n", + " decimal_graph = decimal.fst\n", + "\n", + " whitelist_graph = WhiteList().fst\n", + " word_graph = WordFst().fst\n", + " time_graph = TimeFst().fst\n", + " money_graph = MoneyFst(cardinal, decimal).fst\n", + " whitelist_graph = WhiteListFst().fst\n", + " punct_graph = PunctuationFst().fst\n", + "\n", + " classify = (\n", + " pynutil.add_weight(time_graph, 1.1)\n", + " | pynutil.add_weight(whitelist_graph, 1.01)\n", + " | pynutil.add_weight(decimal_graph, 1.09)\n", + " | pynutil.add_weight(cardinal_graph, 1.1)\n", + " | pynutil.add_weight(ordinal_graph, 1.09)\n", + " | pynutil.add_weight(money_graph, 1.08)\n", + " | pynutil.add_weight(word_graph, 100)\n", + " )\n", + "\n", + " punct = pynutil.insert(\"tokens { \") + pynutil.add_weight(punct_graph, weight=1.1) + pynutil.insert(\" }\")\n", + " token = pynutil.insert(\"tokens { \") + classify + pynutil.insert(\" }\")\n", + " token_plus_punct = (\n", + " pynini.closure(punct + pynutil.insert(\" \")) + token + pynini.closure(pynutil.insert(\" \") + punct)\n", + " )\n", + "\n", + " graph = token_plus_punct + pynini.closure(delete_extra_space + token_plus_punct)\n", + " graph = delete_space + graph + delete_space\n", + "\n", + " self.fst = graph.optimize()\n", + "\n", + " if far_file:\n", + " generator_main(far_file, {\"tokenize_and_classify\": self.fst})" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nEhY6wKKtfhn" + }, + "source": [ + "You should find the caching to vastly speed up compilement time." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rTtCnC5w95CI" + }, + "source": [ + "# Verbalize and Verbalize Final " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "H9y5yuk1HaGj" + }, + "source": [ + "Our last step is to create a universal Verbalizer for all classes. This is very similar to development of `ClassifierFst`, except that the Verbalizer breaks its normalization task into two components:\n", + "- `VerbalizeFst`, which removes formatting for each token\n", + "- `VerbalizeFinalFst`, which extends `VerbalizeFst` across all tokens in a string\n", + "Why two componenets when `tokenize_and_classify` was one? Because Sparrowhawk performs all the functionality of `VerbalizeFinalFst`, so its inclusion would break deployment. However, without it, your NeMo grammar would be unable to function at base. So we separate the two to allow the best of both world." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vUawTJVuH8iR" + }, + "source": [ + "## VerbalizeFst" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xghiBV06IIWU" + }, + "source": [ + "Much like `ClassifyFst`, `VerbalizeFst` instanatiates all its subgraphs and then joins them together under a union operation. However, it does not need to employ weighting. Why? Because `ClassifyFst` has assigned each token a specific class. As each class is unique, there is no possibility that a subgraph will be employed for the wrong token.\n", + "\n", + "As such, our `VerbalizeFst` is formed by a simple union operation across all previous Verbalizer graphs:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "uMVCqCvsIt2v" + }, + "outputs": [], + "source": [ + "from nemo_text_processing.inverse_text_normalization.fr.verbalizers.cardinal import CardinalFst\n", + "from nemo_text_processing.inverse_text_normalization.fr.verbalizers.decimal import DecimalFst\n", + "from nemo_text_processing.inverse_text_normalization.fr.verbalizers.money import MoneyFst\n", + "from nemo_text_processing.inverse_text_normalization.fr.verbalizers.ordinal import OrdinalFst\n", + "from nemo_text_processing.inverse_text_normalization.fr.verbalizers.time import TimeFst\n", + "from nemo_text_processing.inverse_text_normalization.fr.verbalizers.whitelist import WhiteListFst\n", + "from nemo_text_processing.inverse_text_normalization.fr.verbalizers.word import WordFst\n", + "\n", + "class VerbalizeFst(GraphFst):\n", + " def __init__(self):\n", + " super().__init__(name=\"verbalize\", kind=\"verbalize\")\n", + " cardinal = CardinalFst()\n", + " cardinal_graph = cardinal.fst\n", + " ordinal_graph = OrdinalFst().fst\n", + " decimal = DecimalFst()\n", + " decimal_graph = decimal.fst\n", + " whitelist_graph = WhiteListFst().fst\n", + " money_graph = MoneyFst(decimal=decimal).fst\n", + " time_graph = TimeFst().fst\n", + " graph = (\n", + " time_graph\n", + " | whitelist_graph\n", + " | money_graph\n", + " | ordinal_graph\n", + " | decimal_graph\n", + " | cardinal_graph\n", + " )\n", + " self.fst = graph" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Wap-LU6EI2Iu" + }, + "source": [ + "## Verbalize Final" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TYaEt_0tI47t" + }, + "source": [ + "With `VerbalizeFst` complete, we now extend our graph to cover any series of tokens. All this requires is deletion of the `tokens` formatting (note the absence of such in our previous graph) and use of closure for any series of one or more tokens.\n", + "\n", + "This provides the following graph:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "L-9lJNE6JPCW" + }, + "outputs": [], + "source": [ + "\n", + "class VerbalizeFinalFst(GraphFst):\n", + " def __init__(self):\n", + " super().__init__(name=\"verbalize_final\", kind=\"verbalize\")\n", + " verbalize = VerbalizeFst().fst\n", + " word = WordFst().fst\n", + " types = verbalize | word\n", + " graph = (\n", + " pynutil.delete(\"tokens\")\n", + " + delete_space\n", + " + pynutil.delete(\"{\")\n", + " + delete_space\n", + " + types\n", + " + delete_space\n", + " + pynutil.delete(\"}\")\n", + " )\n", + " graph = delete_space + pynini.closure(graph + delete_extra_space) + graph + delete_space\n", + " self.fst = graph" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "WwMKFw-QJVgm" + }, + "source": [ + "Unlike `ClassifyFst`, NeMo ITN does not cache `VerbalizeFst` or `VerbalizeFinalFst`. (While you are welcome to provide such functionality in your own development, keep in mind that the limited complexity of our Verbalizers makes compilement times less significant.)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7U21AZearZMK" + }, + "source": [ + "# Deployment " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "VrSccoh9K6JK" + }, + "source": [ + "Now that we have done all the groundwork, we can finally move to deployment. This final section will just cover the minor code alterations required to call your language through NeMo ITN and deploy through Sparrowhawk. For further information on using NeMo ITN, please see [this tutorial](https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/text_processing/Inverse_Text_Normalization.ipynb). " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0Le2aJvFIAKd" + }, + "source": [ + "## InverseNormalize" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "r2R3TUCDLi5-" + }, + "source": [ + "NeMo calls upon the `InverseNormalizer` class for all ITN tasks. Given a string and language, it will instantiate both the `ClassifierFst` and `VerbalizeFst` respective for the given language. (Note: we do not use `VerbalizeFinal` as its functions are managed by Sparrowhawk.) To make your language deployable in the general NeMo ITN system, you must designate the availbility of these classes for instantiation. (For more information, see the [source code](https://github.com/NVIDIA/NeMo/blob/main/nemo_text_processing/inverse_text_normalization/inverse_normalize.py).)\n", + "\n", + "To do so requires only two changes. The first is providing a string to identify your language as an option for `parse_args` ([ISO codes are advised](https://en.wikipedia.org/wiki/List_of_ISO_639-1_codes)):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "tfv4Ee3ML-Fg" + }, + "outputs": [], + "source": [ + "def parse_args():\n", + " parser = ArgumentParser()\n", + " parser.add_argument(\"input_string\", help=\"input string\", type=str)\n", + " parser.add_argument(\"--language\", help=\"language\", choices=['en', 'de', 'es', 'ru', 'fr', 'MY_LANGUAGE'], default=\"en\", type=str)\n", + " parser.add_argument(\"--verbose\", help=\"print info for debugging\", action='store_true')\n", + " parser.add_argument(\"--overwrite_cache\", help=\"set to True to re-create .far grammar files\", action=\"store_true\")\n", + " parser.add_argument(\n", + " \"--cache_dir\",\n", + " help=\"path to a dir with .far grammar file. Set to None to avoid using cache\",\n", + " default=None,\n", + " type=str,\n", + " )\n", + " return parser.parse_args()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "awVl5nAsMUTl" + }, + "source": [ + "The next is to call your `ClassifyFst` and `VerbalizeFst` from `__init__`:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "class InverseNormalizer(Normalizer):\n", + " def __init__(self, lang: str = 'en', cache_dir: str = None, overwrite_cache: bool = False):\n", + "\n", + " if lang == 'en':\n", + " from nemo_text_processing.inverse_text_normalization.en.taggers.tokenize_and_classify import ClassifyFst\n", + " from nemo_text_processing.inverse_text_normalization.en.verbalizers.verbalize_final import (\n", + " VerbalizeFinalFst,\n", + " )\n", + " # Other languages\n", + " # ....\n", + " elif lang == 'MY_LANGUAGE':\n", + "\n", + " from nemo_text_processing.inverse_text_normalization.MY_LANGUAGE.taggers.tokenize_and_classify import ClassifyFst\n", + "\n", + " from nemo_text_processing.inverse_text_normalization.MY_LANGUAGE.verbalizers.verbalize_final import (\n", + "\n", + " VerbalizeFst,\n", + "\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TI1PuejLMxdI" + }, + "source": [ + "And you're done! NeMo will handle the rest. " + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "xrksINQoICfj" + }, + "source": [ + "## Sparrowhawk" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "rP9-dmMJSg3h" + }, + "source": [ + "Sparrowhawk is an open-source implementation of Google's Kestrel Text Normalization system. Functionally it operates similar to NeMo ITN (the two-step Classify and Verbalize functions stem from [intentional NeMo integration](https://https://arxiv.org/pdf/2104.05055.pdf) but is better optimized for backend deployment. \n", + "\n", + "Like the preceeding section, this portion of the tutorial will highlight a few necessary edits so you may deploy your normalization system." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "u1eGMGxkVZmM" + }, + "source": [ + "### Grammar Export" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "v9dr0E-uVgoT" + }, + "source": [ + "The first step in deploying your grammar is by exporting both `ClassifyFst` and `VerbalizeFst` WFST as FAR files. This is done through `pynini_export.py`, found in `NeMo/tools/text_processing_deployment`. To allow export of your grammar, we must make the similar edits as wed did for `inverse_normalize.py`" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qtek2bMMWbMj" + }, + "source": [ + "First append your language to the list of accepted strings in `parse_args`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "5pTGX9YAWiTZ" + }, + "outputs": [], + "source": [ + "\n", + "def parse_args():\n", + " parser = ArgumentParser()\n", + " parser.add_argument(\"--output_dir\", help=\"output directory for grammars\", required=True, type=str)\n", + " parser.add_argument(\"--language\", help=\"language\", choices=[\"en\", \"de\", \"es\", \"ru\", 'fr', 'MY_LANGUAGE'], type=str, default='en')\n", + " parser.add_argument(\n", + " \"--grammars\", help=\"grammars to be exported\", choices=[\"tn_grammars\", \"itn_grammars\"], type=str, required=True\n", + " )\n", + " parser.add_argument(\n", + " \"--input_case\", help=\"input capitalization\", choices=[\"lower_cased\", \"cased\"], default=\"cased\", type=str\n", + " )\n", + " parser.add_argument(\"--overwrite_cache\", help=\"set to True to re-create .far grammar files\", action=\"store_true\")\n", + " parser.add_argument(\n", + " \"--cache_dir\",\n", + " help=\"path to a dir with .far grammar file. Set to None to avoid using cache\",\n", + " default=None,\n", + " type=str,\n", + " )\n", + " return parser.parse_args()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Fm3CTmdLWlUt" + }, + "source": [ + "And then call `ClassifyFst` and `VerbalizeFinalFst` in `main`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "LANG=\"FOO\"\n", + "\n", + "if LANG == 'en':\n", + " from nemo_text_processing.inverse_text_normalization.en.taggers.tokenize_and_classify import (\n", + " ClassifyFst as ITNClassifyFst,\n", + " )\n", + " from nemo_text_processing.inverse_text_normalization.en.verbalizers.verbalize import (\n", + " VerbalizeFst as ITNVerbalizeFst,\n", + " )\n", + "# Other languages\n", + "# ...\n", + "elif LANG == 'MY_LANGUAGE':\n", + " from nemo_text_processing.inverse_text_normalization.MY_LANGUAGE.taggers.tokenize_and_classify import (\n", + " ClassifyFst as ITNClassifyFst,\n", + " )\n", + " from nemo_text_processing.inverse_text_normalization.MY_LANGUAGE.verbalizers.verbalize import (\n", + " VerbalizeFst as ITNVerbalizeFst,\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JFgGhCMMW3UQ" + }, + "source": [ + "### Deployment" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "V8RH0aGbW41U" + }, + "source": [ + "By default, NeMo ITN is structured to allow deployment through a Docker based backend. This involved building a container from file, exporting your grammars to the container and then deploying Sparrowhawk for processing.\n", + "\n", + "NeMo automates this entire process through `export_grammars.sh`, which will automatically compile your grammars for deployment (assuming you edited `pynini_export` appropriately) and mount them in a container for you. For our purposes, `export_grammar` only requires the following arguments:\n", + "- `LANGUAGE` - the string you have used throughout to indicate your language\n", + "- `GRAMMARS` - only accepts `itn_grammars`(Inverse Text Normalization) or `tn_grammars` (Text Normalization)\n", + "\n", + "For instance, we would call our French ITN with:" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "KYdbawAfZIco" + }, + "source": [ + "`bash export_grammar.sh --GRAMMARS=itn_grammars --LANGUAGE={LANGUAGE}`" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UXVr2twdZMO2" + }, + "source": [ + "Which will return the Docker prompt for further normalization." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TDoVUxCE-Dax" + }, + "source": [ + "# Final Notes" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Fw-9mU7ql8iY" + }, + "source": [ + "Congratulations, you have now constructed an entire ITN system from the ground up! While your experience will vary with each language, you will find several commonalities that will assist you in further development. \n", + "\n", + "If you are interested in working further with your language WFSTs, you may wish to construct a TN system. Broadly, this is accomplished by inverting your previous graphs (`pynini.invert` may assist here) and changing your outputs to avoid indeterminancy (i.e. decide on one canonical output for your grammar for each class). But outside of such grammar specific edits, you repeat many of the steps exhibited here, such as:\n", + "- Use of a two step classifier-verbalizer system\n", + "- Same semiotic classes for tagging\n", + "- Inheritance of `GraphFst`\n", + "- Minor import edits to `pynini_export` and `export_grammar`" + ] + } + ], + "metadata": { + "colab": { + "collapsed_sections": [], + "name": "WFST Tutorial.ipynb", + "provenance": [], + "toc_visible": true + }, + "interpreter": { + "hash": "fbc643a332f9d7801191710b24a8a955d342df4f32791f7fb65121dc4784751f" + }, + "kernelspec": { + "display_name": "Python 3.9.7 64-bit ('wfst_tutorial': conda)", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.9.7" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/tutorials/text_processing/images/cent.PNG b/tutorials/text_processing/images/cent.PNG new file mode 100644 index 0000000000000000000000000000000000000000..457006f9bde2dd71d2654359aa107e90a322004c GIT binary patch literal 10352 zcmds-=UY=v)b0@ll_Dx2(ve;Sq<2v1UAj_5=@3vt5kgTB5Tr>blprOL&;&wnDhLuf zBm@XWr9?_VT0-Y+pZCK#|H6Adq+Xe8uk4vwbFcfi;vSjk($lcfP*70N>+5NmQ&3Q* z0`Cj1Py^3zuxqlw%LTBx?n8>|VUAVc!zK3z#t$ecYLjVCoT-4%S6}Eo0aH+1>;Ct+ z(C1g~LO~%+rLXnC@|7cTjyBt2;;?V~ph>HO1sULBpH_|$zma8Rmu2L^FML&`2r%}>wPZt}&yH{{bZF7gx-X4Do63^C$rDy({x_3Or^uPN`%Fh?VPL$)$PLhgbx zkm1h!-BZ<(2p5u)+wk;pUdKYiWZOJ;0Ema;!gW1L3JRXbz)DK<2f4^SN zRU=Jx({ZUzZJ;Jh+_D^kq%>o%$vy8)kSc>kwC2czpie$OeRKG?n_4;Ms~$(-0HIlTxv7a$uxIk>Dk?)(DVwkFER*Q z?YqM(gEAwiHv*|q|8C$CmM>ZLS(3`8V*PyR`ZUjr;zW7rSk9pAmZP-_V@=JdC!Jv} z5I4enIN4#B+;u)@@82)rw)Ep8Ja}uKLKUs@VKcX&UWo6|y7l>=9SsXW%3{2*sxs>Yt*`@y`nAm`WMqN|>;q&<` zHA!lTmv_sIv%j12Xvs2Tk%UbM}q?t(faexy* zbfVdo^O=#=zV=>vv%relY~nTfz#=K1CS=-pp-4m5yQ(3Ro+t8X0tN+E@p#^=-*Kco zuJFCFM(4tu_Jxxb1$;r*a;Du31wSk#G!q|zAvs$OB=r`+g`+Xqk5H23*|u1xTzy^Q-X2YbmM-tyeZ>bA{R zG>ahE4nk}oo}Yic%-!4-e@|_9P)5G+0BzIxEV*lV>xp~5>&gghVYx73b-Z*1*qEPJ zX*;Q^Kct5?IYu11lJ9E$@{2sqrKdS1d)9Q2q`Qviwl;>0IyfD+fn__*#8g@uMs8bZ z=+afQ2l7bHlNSF{$gLuOLVoZJPdChBU(h&5`g;;(K*LwkH(JRlXDd0Ly*bR*v~L@6 z(8|Wa^$ghME$(-?fvI$wUaWX}HXzv*0I?zWFq%#sf6}O-IT}$SaDW%W$vJSiAh~iF zv00)Vgi(CtR~j)A+_}pMVF*9n$IaDSm&L%V=|j&lY#difymXgRlVS=^jp{dy6~rA1 zm5)bPc7WZ1BL(F7qea{%j1?S+WTg8N)VuoSh)(C~!{3t6oQ`L3KG(}>?bKCI{&qJm zZ;X^ikn3CrW{a;xr5<7@$``8~h3KDmk!HL%#|q49zQ4K5QsVV?t=jOprz2Fy%N zdD!%hEsS4()avy*%i*URj?+T)hGo*1EdHb~Jp_M)OdX zokOjdOH@M({=qRA@-b?>hd#32oXgZ*+$A7ClskR5d{O!3Z>=VUFlh6lg?-UU+xMTF zZCi~k2<67sNgeI(Vgskq1uM}@t$mk4oKznT#Q$;%uvM9P76_~EV=O>wBWzwcy{8fp zikZSk>msT(7>6_$ve|Pnr~kxeW>1 zu^K0zI=`}zzqeA<`$kz4Rq_jVwkOo2V2k#f^tR-J$4r8^yNPRq56+lec@XB5 zov#rKmMqfOki9kU{tn`bC>ZC8KV&84OYj0Yx@&I!m4~pgQW%j?b5e->01a!vbO1e`sCXo zNOenp9pormGrT0JqM=y&r(7$U&8%fMYtS0vA%uJai9a^RWq&5!CjYjf*U;2UxwtF*SX#tl zS_MzDf&8jeX*ho(+YYp+lnThK$p=au62hW=np2&t3VXZK)_1C2xgmSIMfuzL_(qTbPsAU`1+!l@zCzj z=GTLAc%eC?vO%Q@&A!X3z1OV}-gaAeWRi%S-KvbW^p^zH`rf-qAugsYU^yBLJ}W@I z%LJFMC@1N8PcE$wdZCewoW5J&;Hk5wVr%XudOq`~fSH^A4pORe$4$R%dQrK|){keBpTinvlR}{KxlcjcW zGTJxvy!180AEOW~V)Q7QsK}w2-!08@xkoaFM*TVaRS!z)Uxns~h=|BcJUiMHA!zsU z_iGlm2s0RV_-&Tm)H<8JO%c6FeKGfo%-lX_0Nm?;%4OxoeSf#s}I(Pj&O5ktI?0-usQLWU7Q4Z z5f=Iz>u+I=)Ah*QGB83~xb~H2j0hN3W{>|fy*=BS&Q$ZM4t3bgefBijE-dZ0@)hMw zO!k;mfhpDt!+%+vlp>GCVo7;xoi~yXF7_Yp67bIpPbPhebrS1BVm<9%JP?@PD~LG8 zFSM!=QDMbg0|e2a?{FY0r)@_(p!+sS@TmBE$GB=^CEbsqtuu69hixzSEtiWL>H_bM zw_>md+l$0?+<5#)yK>df1{U0rq2B*Va4v?q_Hn%qdFTdnvI}*pI+5RIHAD8`D?`N7 z7<%@VhAiXB?biD#gkBwIibrNk&dP<|5 zf{8Zbz79GkN1kV)Z^gQ09+Izz=rbv3$%k5>#2pt*60E{MXNcMS8~?ULIqRRs#xvew z*US7~ppvEEf8y%ux>>fDBjcW=2HRfxARv>1*@-m}G7wv6Hv7rc@#>;}sn4GdSU73g z*HmThgZx~81=-@rLr$fm3D_}HM9;EoLRD4C|ApdnH~a;vRJ^QvS0UB8Y6&cw!9J-gvqIaavhTMX>0UMjE^3Qj6y(7v+Q}x3V{DeX2BuWD}`T|z9RiPIqDF0g&;2oO`Z2z9|0*1mMhK*YVeb=LfH(R{A^ zm&njVg3M7s^b%F-WsVoG9jy3?>C~m7lw-JvKth0}suh!^+5`GxA9aU2!~vQkY(gEo zGFufUq6(L|DK>R~T!_w_UDhK7pKB5!6$<)@;|*)*&5W(x=Swb^#aakL<8RpIjDKLM zl1@{hfreqm%pE+7e|0Ik5N|%!Amqy$h5F9d38)6D_+c&mW+%$QqOIfI4|vm{S{jz28$o&dPAc>Zb{|sI8WM7u|DAZ@n&Vg&tMn z1i7g$xUdm$*`YQ_F&d~|T19n{|Cuyl6q5 z)$s4|mBEjVpTYekKI7$$@1N|OPZJI7?rIi^J`L}ofPiUs!-I6Zf32OuBF_$eO%*Md zLUaa{#2H!?%G%p!Gn3;qcX#FBJ6~RnludHu4(Ys8jdyn&K(Oh?qV+pTv);?R*gFXc z>w$YXD$v>+AMxe%T5UzGBZ9{H^x4p8X1>GZQ>4uB>QsLqT&gES6gy>M613!4CAW;v z%62@}$@MhVB@bDOYv*@u%(QnlNz_rENDkfOv!h)7tfMEJQ|{Ciu~`% zvQR4>Prw?tEc@-VnY*$*wEe29#v$67AEc&KR#p&-oPHWbKWiG&SR&BOX66eYs!-V9 z^)|w0S8XVXE4gFYnhs0sSL`q7rtUAXsd!J`g&5W2;3WEYdgI+tmg!?;s7p5u|EUXd z!#T9e(rw(vz_92o9e+bLe8+C@(`e>-u9f$vO~;EB#w)1k6)HwM+ksNJ@;md4Snh~d z_~U-71u^=Qb(^6toAaMht;uH#z*g%aUX810xE;i^nnh^cpR}7R2pAhe!%XuBHVaIb z$3@5r=TIZ!K`q=Z1EQ7YnLRsm|8_dK$ZTX)b&c%4@NS|PHUT?X5}H#lvz<4uciOF4 z4ppg_7jyqyE0=YzN8pkbk5k{nw3G2~Fb&Wgy=Vja1A5VAA_7aKcLEQP15m6g)s@Tp zu|O&)T5D)%@HG`%p7-wB5gt>cV%0;=7oAH-rydYvG7Kb!TSeq>xdudrPZ^b|RA*|& zd5U91W0vzHnWhfXOogFB<-f9aGMLuk9V2r(Uv|Zy4jYh$a}pb?sWWU{fBz{?4$@dQ z8wmdF=9nb4;o1PVuX$+?64vZETL}OPJ|K@SjXYnO#BP=Qj}%p6y<@jyjbxnnxexa~ z#BzoHAnJE+fC)|2c^by3Y3EqZSs*7(DCOxaAk(`XBAyp}=2d0<`9{mHaee}Z7}aGN zSPMmyOo8%|_`rtQg=aPPK|{1zX7OqbGbuPZKR4Xa%Tpj3I9mh`{RF%l8zW6=kmEnP z&IXP5)lR!qXdhzmQh?h)ws_A=#~(O0`CTPZ`eJ`lW72S$!Q<6!j}K^tY-DEF6KSQq zvx?<`EIc`2Ipg19kREb&lG(BusY}@FAkWt)o94<75wGUtdsuoQZ^dM39?W>xOq0{$ zW)~w1lxIk@;1OS^=X@NK>Jb|go@Mzf5fA?-{2WyNi*Hwb z{m$;THp>yb_L-gRS7WYsUdpMM^&t7P9pV-;#E)Z2ImYyEF7N@%a6!WeLP4Q~!`Nep(m)d&2dvrr01 zL1eRR23s1ct*MnFVnugPmS&W_o4tE<1zPGvn$6F=UqOxLM3zCT!e*CJOG``L&QHm~ z8Vz{BEh(-^FsGoAPO+?*ihT0zF>Y!kAwfE2Pi76}33^7~R07(q>1xZB2tv=rxwyiv zTU+iFFXVT#$=O-N@#Ky>Vwu+dP%ojtGZNXcQ z43*`EB@~ufiy=VB-LVV`kd4|{AOXkx<*Vv^!~;Z(`I~zF_FXNL1IL>ZCkE71(}$R17_?2Y(%uDk;2zG^jle>sGz~RF8hX>+heY1WX3`b3CwoM zC86slUBUAq?#Bx+ApADilOI}bO9ZbpuHEv-q5Oe~R#H125Po*!A(4u#r6OJF@E;%P zp03W;l7emKVc}p~7i7b@mOfV0_07(w+y4CZl0G@&&;9w+v z-H8b!>>!7sMzzPi1~MVCs^?)1-*g6;Es#^+P8!bNerNO{f)7E(6L2;eaBZ=gZ#7yC zWz5-sBKG6H^hpND_HKm?vw(oxM&bwJ&A1`MwZWe2S$cO-8K72w=K&@xTHct}ejb|> zSY_xGQ<7rcxa##cTcjNr2;m)|PLj{CKpF?G$RoZbY}ojEWwi*cd^9Z+DFKEbb6O=^ z3ya}pu{r1n0k=` zGX};(YpcHzk!OX4UqZoM>)+97%i7uaVhaA0r{_9(ct<#N(yC-^r2LkVE==|}Z zrWUoTlci?B)te>iH+5fAx$A!MfhQ~UUhF@$XRaU|e$YVACi{bETs9!lR!Qt&+D(ui zvR(+pIn0@^!gDcZx}3`qq9*7YRxfu@yYwBa6J>2`-w_BzKUeI%4>r2XW4hc8Tdu=- zO1`EqA6+d>Gg0n8#hCDg0Z1cIqcqn)MDA^hF*y_P`A1IQZ3+Vy1vtTJBgI{$@8Lb+ ztidjZp?L{woIN;EjT-H$FQZ|k;Ww|^RHlf)ypLBa%XfpVhjniEviq8Dn{Rq6GIA7` z&WkKhN1kl=0aqM-mwR@4+y=8V0{DA=sFS~6Sl%-HOQrHzgIi2Dq?ZtHd( zn~oU$hqXMnh4Kj<*Qb{6+o%5XTu#&um(qMqD|&l%+MHU3JmIW=!jDOT8Vzkgq{L$g zY1(~`D?s$bth`%HVUD0>PFsUQ5jcsviz5SZu?9^$() zqPm+9?k8HBWsIc`n(-7OGqzR~Ueq7Eti{a*`Z%dL5H*PPK_S%LYAxd5CTUN7uk0=o z__24yN)1}}oAJyKD4wnT8g;o1#uIVdlKxQEyko|>JEjE|pDb9sw>r@WOy`@1;dKnc z0#H|G4E1Pskt~G?TFLgO3%JrO$2C%68;yEOJK(YtV|)BeH8>wO{7v!jkBgk^;|$Kj z(;6$9JJdJLhYk?+V*we7E*Q8s8EGn0#aS~`AIUK?su>!#czET7GWY}Atu;&hnQ7HUXo>Upx-hL^2V*mg92V(pZ?paAfy zYNfM-#W}7^W=wBNXkM*%Z#8X{eoWQrfak2^y{uR&(F`zGoT?#Gth)JI=R8ouwyOOo znJ$6~WJtY0@jFqofUBe8j`M>6AmxW+qLc5S@hEzexO9$}9mUauAj(14Bci zOS64?d=Z)vKn+lQno*~8Pn=M}9};SJ0JDP3joTD~-Z#GnK$GLIWAHn4%=yM_8A5++ z%vhXl4LB|IX{flN8!4DjKQ;+GU`!IxLf@^^{&A3XiQ^VifAqJx;02IVv4NuV(204W zzJ%&lRxe-SIB_@N`vtF_ zB6q-01n>X@k>0`S8VJdpS+d+N+fH!Gl}|mu8SSqHRLZJoc(R~?BfXC#QE}!D6%6^l ziWaLu5oF>@t1Y4^2$o;n_z3N&VxRu31Q&s|vT0HK6IVpTuPrVT*uSn>VyszAo&_gEWVr^i zBozP*7m4>ChqH^e_L@#Ux`&W!NxRP?y`NqU==>lD&wx297W=0o1d+15-=vc(M| zgVrusKEGokDp&iDl+m(2aJkpXVkqOGak^*Gq`=ileyA`ZV3^t5@kb(~WBlbmN;p?= zaG+{a#hve9bG>3GB!)AiuniczvABl7y`_CQekB7b^5<;cYz@kGjor7|c`XZKlv+f` zuWhy2H0{QIO!YU{U%2?>#bHigPOij@D=PrE`OI>oK^_1#)_{Z0COQ^k`QdRO-;G%y z+YoDKf!f69<(Xn0kjqGN9tTN|6$QI#T(NYlHwVa?auQZ++BFYCs#|~qRhfBxQfh+k zk9|7t=d+`AJcW3*fN6tCqhql$qcZe$$kIpM8Gkb}}V9(wE zM(6mIG1P`CNS{}CzS~_H0*~B(#Aa^YnU;ib)7P~M)k-Tu&DWi6{)*)`srsRcsHySu z{`+6eMh^>hmQCQ?Ul6CJQAQ;cC4tgQr)1OTYMeBVgGCUN;Yd!zbDWubto8e8fMXKD z=z)PD04Wl=u4b|EM^LpDC#%oM-|`=OS2OIW@m6f@{`jualbqbpM$8e~y#cbkb0}%M z0^hm)lHn>7RKx%mF+hH4AM+xOwaXvT{&vs=3mt0D?7_+rs`76%*>?Q z73H7f9b%Wa`?&pb_uvh|8SHT(%;Ra<_yEgQNoH$lmx)9!kQlcA2bS(C=fl=`}dOf7F!L2Q$)fFfv}Yb(wC zm^m{Z`}o70`?(_(m+e>TUyq!f0mzgSYhkK}t;Xgvv??L6#Qz82xfxdkl_xd%Kf>l_ z)F=l5l``mSFEzW*Q+D1fFoVJJcL_gcvR!zbPyF>BiMluBz-3h=+T=%9`G2!-J7B z8iVF5{3?YEzCrp01-@+G%*Op~qF9!)Y^!KU42nM~+BjeulDVM99ES&bnowwtBG8k8 z$G`gXDRzc60;#9Itl_E}?D-6~nfbJ4#fU7lDyHIc&3E&o6JhjaVLuMhT7}YE!1a7L7k-yJ{Io9nNstV8iGnN2^+I30vD1J#QD= z&#-cFlv(4y6!2@}J7@;w87JzPY=w9UzbehP$)(_%2fAJU3vLyCw72?P z_a>8;=-)?@bp}VDy)5fIrAq__z4XizDe2#c6(Jl2`}pZaGpfYy82WnJXctEabDr1cYqm=?_rc6sBic&q5yEyIGcEBS z`OdHKx9PKp4pm#v=`XtpUm5$G?N#hKWzqkgP1<7ghIOhG1%>|T?Ch*O>p$zRah1yK z80f6=;WHOAT@Caop&a|K{YkB;{{>)!%coc{ECQ08<;Y7xRH}r_yX(t)LCTJU-yb&3 zw4(p(?1|Rd2%}h8-de497XBm#xMnB1ys{+gs^qr0>$+^lTrhxDNhla2JT}^EFR?aWf3~^AO Uy$Sl)Uq_*@ZK75E@afzC15*x0^Z)<= literal 0 HcmV?d00001 diff --git a/tutorials/text_processing/images/cent_to_100.PNG b/tutorials/text_processing/images/cent_to_100.PNG new file mode 100644 index 0000000000000000000000000000000000000000..ee595ef7d91fb5299e61b627614cd60a0d77bb63 GIT binary patch literal 11898 zcmeIY_dlE8A3okeX%$sPOO4igq4uUIwTV4LY?_!gV#I1GYE^61ti6dnVivXc-lIi` zy@?o~+xzohe7~O`l9M}+V<>&~5fN5JP% z!Uw>6m#w-3Fx++3QjocW9DIoZF7Q4`gQV}=DUTw$G{*<7AAVFcbiH$jr2Y1Fx63Kt z>dqZ5QYBew9Z!>;1p-eU-HR?~`cRKIk->U5)thf==xCk*-ETE3;kxIo z?xoIpAGrT-6};OCf6607-vvf>vi{p|f4xp10EV}6Jfy(j8D>TY4ExVo-vYxjTYwiZ z{MP<|fBgT#;;3Jp&N;T^+|X5CjF>ZaTx?XT?kYH&^J(qak5m z+^Jx)eMP7t0+R#EQ%gn9QX1LwRt}17B|tnDCQ^J3d(iPlj=X|`g1d>sBTsHaiD}r# zsmYqFe&coJgq~5({gWUnlWDiT#f9U&rKJ?V3k>?Ok6U=k)W7M4plvMbpp#}mS07?M zC~;=&I&B`3xKq2d_@;K=1=9j?9S|IF?xOub=XZiq{Fpaj2Q;jCju5=5uv2wYxZv9rC}nG&&6o4Skrt)7syMZq^_q^+-G5!bQ8{#ebQE+$0mMQYkrqiMgEV= z?<%|Kyc0me3*+C6j9@u&{p>}>7LyPU$kAEL&B9B!VQF+55w`$2iz-i=|1}C{=jLV$ zPZWa}dhIox_2c$hoFsAE{6|}*E%n}-isQbLmYt7}S_$Zp@L48z z%jGCZ4B_UNLDH6+>xKp@$z(}iM17F)%3PY~zlR>`eQW<-9%Ay+lCq*CXPcuO7wQug z=)C?c+^2uiukeo51Y%9^x%@fTb$t-Su{i0N>`;@~r0jPzAiTie?!n!-ACOvHb#skN zKAAFIQnPwAcQ&hMb=L;;J))`VQyHXDl{(Cm#)p z52-P8hrI3 z6|NUP^opESO(5B+8Fl3i$_%0>M7wSP56l<%fY?_5&eD9%$~W<(-d^Q1<^3MlA&C!N zjFc1C1M<;?JL@?N=j(fc=go&OCP7?Qf#rW@hgl$^rA0+Kvc`Qzw61GVNl}>gxoj zJYape1okKI)@Nq6)lAmLbru(_$(3-<43!qGVIc#IQf^wrB)>P2Ix2QtUA64rwTYU3 znM%KsHm+FkSn{o5mB*<_*i#4mR6+Q`bGESXas^V8Ls)kp!m?^ez;qLsmh0JSFc;y) zF~%~@w6!UKk0}t{32TAotGWt|%dM=Rh!EH64ykg5XkcEBmYPc8E;*2OnGR6g7}3$u;qFYv6oOb1cr1@oU_(FqEvLk~LF+!<+WZS` z^<7-d%k7Q+{m9cF!6O?oZYu4hp_kZosTVA1`z{OBU`Y#As`(8?!pS|>t|2zxno9y#9}q@jvUfCCKyQV%V6{WfQ{Ux zBO`{F{)~$g^ny4%V)$X&Ymkjp+#%h`5?SzCPsJRZQ}=$p)Gke`FRxZdHXbVLQg?!*yCh&A9QXH%o(v$JvG`c z5&&xFQZ)TZUr637l0gnfdRv2m^P;(eJXwcIpoR{c#bvic0MgS9*OP=2nhYz`pLO7& zbCq^xsp^;Yqrj3->2xy@z4`TGnLJg&i$y_Ju87W7yQh`l=DFx3KC1DU=Z~92kNWu` zFm16GjGqKaj||V!qk4p{9ElKIyra78eL}i8*K}-qvnBrViA`sC)pD}5JJq5$ZVo5Jz*j0NvH$QJ!Am}vBS%%y^n7)FIPp&xnWS?p&Ja{km zZ{2P33<8-q6_aNCuqxXJGgJiV&fIFA`?s$^BhrSa$4 zXIV=1<^it1_?%ve5~7V1S0K8d13~1*d7xzU?f^ZgV!+2#%-jv{WKSn?q`eOWBlSi4 zg+4~dU8^ji4zE~3(fS*{wBFn0$UpwomMW^6nxO?%QYeJBxA&&|lor3F^vUb<#kd?6 zrDs(bUwBmg2*@lgc-;78^jEs2cY{*O0jsR+whQps%@wV8-2`sm3~ z3AYUmTyrdhE#jMVz3Ka|?~n10IQIc;Biv&!R{3E_C?_~9tRfL_4>Wf<3c+T(>o{-n zGjk{CD@*%MQO`!pDjOA?O}zqJ<$V+K)k@28ld(arfcfh^Shd}J;*?|_teQQS)AE)L zSes3zDczza-TdO>wI!~PJZBoMvu|qgvexYh$E<*QKJ}82^%IoLo4@~1)%|mK3~!uC zt7#ZVMAK4Ac1Jm+C_+{7K>(I=8nIR4kTY5HH<*ScljBWtXUBs*O(WMB10R0N;?8|$ zj)a@@m8czA1`tB+FnTjD&9AKJzhyJy zMWt?0VVtqkuM4h$Gi+uko$iZ$D>Ud2bm07f`WP4QIf#2r z3Kz7wT-F+^ELzn5fvGuT&=zHO{TznSmZ%cRD)MnudW;6DphxnHe^+Nq9?U&C#$(_s zZZaajaoyyRt~1_+Fdh1~PLYs_5L@jLiIelp!||Jnw^KYi=}Xg>si%TmCZelx@@g=G z7}WypxKIlq2WyBX*FaZ~`MvB>Y%?O&Y1%DY7i@7h8hc!)&ozC(aRqQvyJgeVCy)(0 zwTXxqJ)}z2WvLbBu&2Hzpi;g-_enW$yJ)dFhj7eJ>Od;TCTQS`i*~)Mu=eV8u1b`a zwqyniT-}dNqL|vZ?Hf^R+jjuOBAHO9^RDVNxG$-D0>X;nhQ0s-LgHa%c%(>`r4zl( z4$S`&bH01pxYStlZ19(TulKRQeb#ZImpWT^{a+Ug&f>(L3x9)^@awW-mLXQXEJ5^$ z7|MPE*Jqi7H_04neisFpc~s>9Z>s)SSMunhb*?R<-4CgrTGuB)pQ*5El{)A`sYP!S zA=LeM}eA=7(8qW~$*(X)kykzp&Bs z&x}tF@JwVW+rOLqplTzsXPP=ZkJl7JK)}j*gyWyN|_>J4el!c&)`9dr!IpsI}hvpN_%k@tvv|+DFk;C*%VR`tz3@#QlFSP@gIKpOE^?{ZBn>p~`|JNe?9W14syqhOk-4Y&tkx1smWZvs~pBSb? z`|_=483*jyIMS<%HOR&;<&~6qO7EB~<*?vk6;-abu9iGkB9Tn;Lg(y`bJ>5sJaAN( zx~lTsu|uLdLDKvi4`KkJ3%1s*rJ z82!w^rcSvpdaci#W$2x&f~!>4sFT(iK`B!>dQq2b4p{a}Y*JL<<*NizvSOE9&`M-X z*|35g=F3x`2Uo)uC$=bB-0a@zsfQm6Gq9W|xbgQfAWn{6ub1LfKE#MjPR?n^jSghi z`lfQJ_s8@FR^}BjVsxxqLB(&B%p}7{YhipL?72c9=;1_02wl?k%LA?^;}4zm z8B?5m%$CUycB^ZDCieD|mk^Mfv8pDRwX}cxo@QkD@ZXSCW~eQ3;(y8&2)j)d-GneI z{!@PgUqT2&erLqApcD}vei1_Lw(~ewYEr#;`P0}Xlp|EVD8P$!Q??A6u*VrAt&!`c zkNvy(ZaRB%8^JR`T-%GVzbabG&EuOFNHnI6N>}g!cn4(mXFi$&Vi(vL? zvU+2;o4Duqg6CkH2dBFh2$F)q^WwjlPU?xo(DkC6lJ3U{rrj}DM$Qhp_(X<0o5_U2 z4CdUq@6fJ`&B|44PsMskH$Di0RKWT@hw98oN3Q4oD&<7Trnv+tkIp(*S#xm@P7Hf| zRORMhYSNU6lNgwseO!Qea_L?r)Gj_U;Y-{aeuHQ1#4@)8pJc+;v-2yaHF(c7j>U;7 z6_QiyW#HpyX*$_$CR^tgT{IiZ>2Mw{Dv(~#Jihq@C=em79Z*0`7TC$A$ z4T-eDVCoCQ0Nt*bDlxxTuM&nb1zZci&vMl*-_uWeFLZ-o9pk6?Gu?jBA6hUcm+J|& z`JPO(X+3N~+Pa=7TG{hRVkBA5x|h_^pr7$FGspSWca+-iR@oXBqq0wmje4fOCxju> z{X8*U5kib=TVrJ#N`J`DcAIgII0!li-x+PC?JTSAO*U6xr>-qFe)V7!tx2zSCip-A zqMF9f4g$Fys8$O2%P*y|pw>)E3SMf&Hc!7;^%ED0a<|!&BO#_08~6KlPWjk{v-H#z z1>!t#y;=;^!>vv|9BC8H`-LcMh*K#TlFE+42-l73fRdn{q}+lXr!w628L~;s`M&9K z$RmqcPwjeV_K$mo<@=GJy*(u~ew+R%hw4y`Rr*&^{XWVJN^fJSLjp_$FL8OId8Zxz z!ng;SE{~Msya?_R;7%$yU{6*j^(*lIM43sdpfgnf8o@0GHTvDZN&75}g@bBRon3|Y zYbH!dgTrFqYkW)Nt3ZYds=Auczz&ouHsR-yv`k*W;A{D(^^(fe$6RV?|8F})>F>z8 zzAu!Bmi+ik^D(M0ndc+Y90nhPb-yb#v7+U7VeXDgOIW)+ViJ+B$}z!P$rgOcy`K9` zIIEItbu*2la;7a4ye4EoxB;|%Zp=Z9`S%?P2&mnuPMh!z?W-0Q#%;UelSc1Sw^2fP zZq|O9&aMv)uATtB)vU`vhx8>Oy7EoUv_*}Kt@c_!+s}zCsNy7S2H!<)T>xUmM#bkk zXbSLp$f`2?+68ya-a8{ZlJTv++}&zSXjNATM&PfjaaBT@TaE?m3hY)H6&j>F5hfwA@mUHE$1y*R z1r~%;eH6Mp+;*i+?og-d-ne&o2aE|rJPxx}$5EcC@JZ(!)R@m~6FVG4~XsuCC7Vk5W6R`p$LYF?15~JO@uSJ8hQQ=8oO~qN^X~T(bY(Hj!6Nqyb`!jV>j$;mlp)gJ9)dqG zBDqn&1NR#4OE6!WFAHPR=a2M_K7Fv z0H(fWmaA`~cM}8aE_TMf}i#admNO;D_Zl0>P+JO%}nhZSS4KdBzY;q zBVe$#QKin`vW|6_J(HHuXdg zdLRGqiYL`Qy;O{nynyN_iDL=Tfj#WA;}pejUIx`z+Z@Aoa{ua?6Ku?fb0?7UB|s(R zrq3ouYPWsolR%>JEB58Zv=dH9rE*m6Z-*J{pTt+lN7iG;1z7x}L?d)Z3j!rHZFc|6 zLB3#d98|zZan&$A#YxR02}geo1}C1Re(}|(^%}eUVb)S()#uarjR?SIjkpIiFat}g zCa>Oa9 z@)2Oj5(iVoL;NOB?agt#KaHmCYT#o>)hr1^i{sQUR8q@q7qG9Bltwbsi<`OzwmH3T z)?A3Zn`-P!h|!h8_!ToJWCmNYFemTaN%@Q_&5?Y%q&^qA+ef94UV*R-fX}+0; z2?N$$MBeI&M$e>uc?yKZdB5;6bcnuq83}|ov^;Q`Af4W95XIarOcPI0F?Eae5C3C9 zvE^9oC;l_V@V)?3_Y<(L_s3Ik9+Rn4A$(eW=4XVXS?rfW99r=qHoh8(;Mvy{mJHkk zFF(F}OD;lQoYUIK=ZQz<{<%?=p{AttK1f$~1wD~KAtzs(Z3p5>gOI98VpoUbGzD3$ z7vNLyT97a@hA zj;N{rOwFkOtgQvY}YV+&YL?igU=Kp+ud%Tl3RIJw_Bh+H%X9AZ|IqBjr z&DGq=m$a!Q3^9YrlyVspfed(BfI^>pBUOsfFfnmtRl32hry;YYw^zC?pWA3_X+53o z&Uw!G>kC=8JWtwRfG1YGfwpU@uKnp}=P%3ej*TcRXGLFNbaRUpez4Ic@zO|y1jsp7 z1#$922-^a%UbM0Nun9?tYoIuKFq0(;)o1iooe^r2J1cwB+B0Ij4rSdT2}}5p+=>c5 zZ4lu$iL%QNWB|dchFjZnaNBu*of2-SZbVSiGGYyQAp9- z8o3S&LUiq{ZP75CaGVgF9~q1V9#9Sv-Wp}l$FE><3yVGqYsVrYGb2@0JF@d#o(in` z5=#yN(v&gG{9yn#o-UTZX}^u=)TyYj&{zqHACs2ssL46OW&2r21DqeQSI2qs7um2| z;g7o|eWPC%lnSj$ppgWTMgl8YfMTbOyL5gdm}Vu-vlK5zcI>D2D+Om#eIYEL5FJS~ z$%+{-2*;r83fR(yho3x)Gpw|knbLl*kpbNLG=)RJSa|GKHS44I*_YOTTK+u1OAwmh z^(vWKS`d%mHh(i;5njB4Hehm-jSE`%xYYzA@}E?nrl(*4>a-e?N6UOptR@0RG-_1m zN#bn@xy3FE*NOC)1T+s<<*DIe{Kb{`X^d0ep3S==@i-3vj(lbhRN&8lb2 zK>V8{Li#3NIaef7*i~}8tHp^(1rYCD$PDkbl&v+lz({fLyJ6S~bB@ z+Gx?DvJ_&95BJ`lE(cbCtXMbeZ$zT4ot*>r`>XfHp=kjT{_P>89?X%kBT5A+Kj`s} zUg>D+npb0!er*A1sOJA5c)Dtd&U@Ru9=?Tk#434x7rah;CA56@;$>Dfa73-mb5>1m&s6S+>a#^w!pzHW#lVHbm|!!^ku_=g$mOFrI}(z^ zK;RkcmHA!!Bri6ycIo_~=E=)fR4(XfW7J+Vu2Iqevs_f=4Zxuk4ry;=K;x~EJQq{q zF;G=|fAF0~`cd1qJxenof}y=4HhoP%R;fUuasfbvr?``91@*{xVc`O?|ILN6{(|n| z0L*Lr@E<_eJmyp>rvFs~C>w4349jTwO~FS zsLuwVoNOsyx^%w^nGUkw+{{#UG4K>vY}79`*^M`SmR+#Ubtot(CPECGC>MHTnScQF zW_x~*KJMPmtf3QtVQz?ikfcA2a{T~wUz^I`N%4+uHD~!4in|Jt)N8v+lC*CE&Pu|c zu6EuEi7nbooheh}&&ebymSVDy)9*GKzfItHg9{~`%2Y%`PLY~@HNr4M!>LQ z)LKF-kj>&b1YnG}oKB*-@dnj0s%dnsk0i-v0MT#2^yWf8nUn^V6ZfRL7@H#MYOHiv z;L&r2rB}*TrB4b^KNxEUM0nb5Jo#Xy)lNWQa~NtJdV3TBkT@vz-ryYkIiHN9UyDaT zm-APd8q;PCv<|Md-?;%YFk>w=a9PA2`;X}%C62J>TiJ&0Xnv3k(Wm2`xv4m&IY4r^ zhwK-8=wSuhD=zIlWKCICjs_L+Z(eO0e1DzpEA>aEZZ#LD`yant#PlO&?AC) zFSnCQRUQ5Gd67wV{BH4o7wA_l@pV zX1%eEkyVz;+dBuUfID-~F`#_ToGI0pK_PAm$0^X)eCj-Ju+se^$E31Oi+DEK@*VqsG zVfp_B25kmvwXys-46$>uoE9;WT><#EYPgK6L-a<@Pm>BF#*6snR#s)*!lMJHmMgR} zHJfhPrh=CD@Wi`(xiz5$H|?V_bt%-Sd~ZbVS?P(&cYiyo_u8r{YKFDrZPf;Z74~w zFYYx-!*)fuggldjfi@goO*}qnYV@s11j6Anms|0-yCOB92 z7%E+8Lp&nG_oTm8=H~j!0iUEsI`9F?`)~!IRRy++E7fAop2d?8jSOciaI)m=NwAf0 zl(t;gmyq{yH55B|z4sY^pG&kQM`_5!`@)gV|I%3;YshL-A`|Kf8;EwOU6`a}jINLm z3jW5fXk=GX-46uxx-waunOYmrm+mSjjn*L+(qhdKwlICC0O~t!)DPkaWr;K|>eFo$ zG0)e)YsOzq)%7l67%jb3VaH1m;cFkJyMU%l2J4B)w_ zNtD*1Ka?;L6OsRKWf1Ut{8J{YrT063)%11A`X>afe6#|%U&{C1xAO@m(jN@(3qe-x zT-*{Eklh}(8XZqmIICj^?2iG3zHT&ke6DKRBD~y2sSQ{lX&${d)3G&t$RgWnB+SJQ zy=8}WsQ&%RKfoF}(VJ<`#N)6JnMsMDFLjAH|8k+3Q&pA`B~p8 zN=wTx%~v2~Er7*AtV#W!NrDm)4jNr#Vrn?^+>wI&WVWD7$H=v79f*>JUHX$#zx`GM zq<|z~m5BHC9ms(0{xf}%bo`u;o*A&M&{(U7P2^Id3JclKzRq9lHtgB=?FA5VO|J@0 zOvn+b@XZ7IRgJZAc;qc-6DP0IUKr3b^yW$Y8)>ixG3@J>#SI{G^8*@by)UFB=Cc** z`eJK`*Vs-MTAnHb^^6Jz<*PACcq=R9#yh6c3q0Ece(I6~!kiVEJ15~NryAduJTJux z6w=qHL)E??OLqw}0n-wPy7U32BXbTkV}cuG1r5OqD`LAF z@UZakK(vGJY_Z#^AY)1PO;(0sYz7 z;2k>P(w5$i<=l^P#_C5)qJZpnJ-(I1a&RQ4b1Qbw8;D^u*b9T~d&1%=Y)UNlXiX z^m*)ry|r)cH6DFysC9V8z~1FMJX#n!qy$Pc%8zY;lmwBxR{5R)~FI3Pq1 z04xi7*MX*FSpCBlHyI=6rYN4U?Q;s3e%N^cIZ5A&Z)(LorA+cJ@p%7 zt1C3ZVY*FVKX8&|_~}jS_4Z^*vctoPGKrIZ+j|brbKXxd{b|^1o+F7O(E+75~zv)7jGiH5fDh zo_=W*7%C7Pt>Z%5ek*OA!%BYTKl_-7GivdQ^q#Up2~HZctCbBknwB7JpJ;G{Tp=g> zQPf|?!;5~KaM;=GxlTN!Wz=D<*ovSIbJNT_=*K2}ikeIob1ba**_;zTM0xw7R zDN=D15k|x#6e+UoU?8;47vs@ns14h(GApsj^K9_3ZJI@UD8qxnz*nAh$ExF#j`^ofwCcr~g z&ZFLd+;s5DY*w$^GtUc5-k6=mf7XRhoyxd^(*Cbe@Lz7SV+DqT!piQlA$Sm~7+KL* z^^;}51mMmvCrZbcU@yHy$7SySnftzIve1CO5wpCt<7FdmA9+iili`cW`Dy>;8n7$X zwLr5UR`!0%c?z5!=Xg5UhUycwP$F@Fb#cBjGmGKYY`g!pf)?g1vzFYlUD2m%XqY(gC8uBL zR#t&%XuPP19gw$snr_MX|9rXsKl2{{jKKPbQPdjj8n=A~9{rAzoSH0B#w_6f0JXa& AY5)KL literal 0 HcmV?d00001 diff --git a/tutorials/text_processing/images/cent_vingt_bad.PNG b/tutorials/text_processing/images/cent_vingt_bad.PNG new file mode 100644 index 0000000000000000000000000000000000000000..f8c56586641647385ee9e5687322720fa2703d2e GIT binary patch literal 13247 zcmeIZ=Q~{A`~EEvBodKmiSC^!(IZ-P5~7Y?V$=|WVTj%d5(%Ot`si(xQ6d<9M6~E- z)L{^v(aY%kw$Ic1Ke&(Y17$m~*IsK~>$yBZG|Vm#jF4OlmDBs|Idc`NpXtFynJ6@U*Ee~GwX!K zVtZo(51yJhp-&)~p`mwy7kG=*g3LhdW)$q<^**BA@@hUyx3{+!D#-9HK3P?kgQBLK zq=X^8$^RuVdrq|YZ&Z_?O`y{pyO87X=KjuLtt-e2OE}xZ6E^~u$d$DQ9en4 z-)41^1YyFuFU2;)9({(}hU3edXG-2y{6VqI_^hNOnwoHY#=qK3Jtu7tX!Hz|$A}Ij zMuwo`7ZpI5>`Zg(msm!hZMkiSsVM@1Apdz4j~Cuw8K?=)qFg0RIm_)wIs3PI@;2}M zx$89WPj7sGZ>a}1NAG`eaWp&YzwI|G9CURyi_0jaxek~3ItoUyWa0joPw8u$t}eF4 zXHUm!dP6B?R|$V*qHLw{C6&Db4I5Uo`!Hv}RC`B95kd5E&J^)dG!SwhUmUI@%i;S2 za&q)cEG}&impkPyE2h9N$4%X$Ql7ti_2O;g`Cbo>E?zjK?L~#pK4d?2_Pk@-wTFMX zo3~u#k(v8ZM91-We77^zI$EOR^60L`afrm(MrB`)Qe-c)*YrER zAflyp)~n+$DWVRE7iB@?4Kp@n=u>Eojh%?+&Wzu7wQ$1dUu`2JTTSU*MZYGyS3L)7 zn7*}wOeX>Ku@&P%QSkiXxP@1>N@+*AgrBnn@as#{TGvTvhyA)3&vGj+(8Kc~aLb8&I8l@L6Pm~30< zjHy}KQRPY(S9Vp7=b*Ex@c^R2THKaSi1B!fm-%C`RtG*3T=;zmQg&f3d~w^fxhOmAzzH69HTa>0{U* z(bJlA;89sXU7oMVrCMC98+mTTLu@tEUJ7k`$X=Zcq$cnh)z1>C#jpSiSns`SWt5uE zS!wG1@AEP->#_XakyS-9O!U0p3vF#HC3m`-3&9{jTjIKxgqr3_RYA2cxpYd^&(RCa z+bZa2SAGH0+QF?F(f#GV%I4fV0`782k3!m>oLwF@I}3t?{@Kvw3k-g%W^`^Gl0F!G zO@N3eb7lGupATG}$E3eQ6rc;o50tnc=AfBA>gEcRTWTAbm^hR*Y)S5KR1Vb3yjcc< zv_c_w@At|0SZI=ACXreft&!=w0j=tf<>pP;(KmW?1p@DbGJEB&& z%LK%D@|54=2!8=xA{-9?dxVA6;2dk1wIfE>v$Clqjl8h-TyDS-k;0YX{Rh^^U;1Qd zaL{sQFR^CgNNVF}>juTcXDD^Srjd*|GNuo|vsRir7@!-aSApZL16qN5)HR|S|>L?)TKhEGv}xZOIUnH`SCU$dyODp)V~bd z&mPEb8HBf!(!_O8oCcr3+eR^%1r$wUhczf&ay0OMRCE2+-h*4v9Gl-K$hxW zNJhU|AQXCaF?+QMy{omrbSR-;wXQBkP7@%1R+)LxJDNiS%=gz0a_GmV=8yhEALp!w zYLUIWu?)gL2F5fjpR0BK8)!$C)pw85CAr3)4&-f#W<3zu(q1t&FfbU42f-&xSe|Wq zO}lhXs+zQq;c)32A!8eO$I=}J)}mfwsqMzflZ)-3ZKw|PHq~8uEl`VxfB|8~&w0Q3 z>a5u*W`5{*(9-w2fnu~vcDKm8YF@1(3dh=Xb1K5oORuSK^Ys0$hv-oUS(X@2z(`G* zBUVmUGHZI)*ZlRmQ=iqBALotEo-d1=!PbuVlA0amrnqEx!%J|B>xePgFD$-=NQI*R zRC~>)+l_$!9)X75QNyrbf_~(TPU#sWzCnqMke(K z^7#YB@s5**FbUwa*_35wW-cSm4`CCEdBIiGn=I4{<2GyAHR4g*=>rc+{H()>w86vZ z*?pT3o3}3f^vz=nX~WzziawW_8WImbh@u3NKf&#h>mOqdhPR`ZR=aafh4qid;HNfg zrC|ZQjG*cZbVN$W=vuJF)#f3m7{uz0{Dd>sw*4scK{JawpXNwieis*%V??p@nOa|- zTX{0IeppWp9nZS}ntVS@gh2P*Z^N%*leuKe69m_-1RjBVgpVghTWNBC)^%3(>qmB~ zgti4@giNmJ5@^|rdo(2Oc;P&ZQgeT%J*&8wn!WVg2hP^G(a1M;a}|cdugyAM-=}H- z^~^gRU9HtW;-o6oGNqZGfx)@jEac|@X|xm(%FTxHP_;_F7+eL`1`;j56WAoZ=eeuvPL@CoT9)op`F^7p@(On@p)fy0q#JJg<=ry zk0$2P(Am({hgADy^YqpEs3`GjwG>!!j8XJfX*(F3u#i5HfqW2h8FT&Hc-dj7iY#Q3T>PJMnnwtwZ=L#HN> z%k_NzxAi8%KCvP2Nd)+?zMSy-SqR66z)s@7_6t^s<&gd=!`6~%mvK|-+oxmVso}o3 zl+CS|#9~z`_SnR8Mw;1r5B!FZooMzmS?1^t-bCJ6a^G(%Nwl-leU?QLUt`jzs%Bb)r$regYQl^%wWF{1XQQ_tShC4 zonAHd;Cqi3;GiUmiFt9=@dVn`Nyx^gt#y*U#SnCT^i_5$7VfDhhwQ2<-4J2kWh~qz zz(*X1r=(6brrbAKhki_qJ3%IzU3x`I-8>%h@T(l$SyFc~T>fD9nU48On0<(?7m+u?bzWRprPS%fal=8#B#p;fWr}{RzAK z77I5!^;4G*oxE(>SV{~!2bSp7{its&?x%+fecBx_G*A1!?V5XciEwTR9uLFjQ8?Pqq-7<_6$`pjfLALM{ZhX6{n3ll_}MW>EJTaRnME|o3*ju+ zLxV~%28;>tg83dfmD{H?1V!DUr&L8NZQnY|TB4w0=keY&rl(X9zvpR?cVMnZ4W-pE z6{atZR5e4WZks(BkEeS%Xgm6n@GLN&FX|t^SK##7=SSwN&|&*c_a&{3ZB|{z7kl5> zguaNl;bpdcVw#BbtE6@6?g~U9bUjORE_N-SU)JS1)2wYoiSEClOLF{vyGjyr7kiEd z+2dC!S=jF@agc_N6Ohh9Pa1 zN!7Hpb0>zoy#mwkk!Bqya(1yV?-9dyrCY0P@~U?gtX3;c@@d%vHytc0GY^xR&-0P0 zI$?MGm@tSs_+Ej$c4yVQ*DoIpQHW1l*4Tf05)lK8{e>-QO-)S$|B|}(YtM;yUlxTD zaO{bW;&l&Z$zQ76nLfWwrTnHxc75AzKAfu_-uaL*9%;vqOq)!xr=;9mm>VJ z)bv*PRxsar*3!*i-1Ma<66913k4BT}d;U-cFHx~W?P5Cq*~6BffmM|{vlBFUJ{ukm z3#Ges(3f%ICMv99IUi_%a_9cZy%Qc$QPDEgd6CI))}(G?O6fUNeDHr{;R3OY^YWe{ zZQmshNAxyl3y==(!3ssY>V@{Cwq4$hqaz)1nB@08?`rHb;1cFzT}rW!gC3=VAsXrS zl$w8BTNc_r({HDRvguf}F($q_qB?dELb1(%z3@x=o!rKuo;WdZ63Tl1do<5(O%jIj z(6g_iij3X_icJ?$?v-?}3>*Sa-mVmz{4~(#KMNl2AUFi#T!ZuMvPPDddzo~9Y(DAR zWODD#bowCH*ZaVYo{IhcEJjNXb|t2kW)dDPy{F>mKWUiwOq#W76f5+V7|GaI<;ZQ} zO2ys*vlI7Nn7{FF!xMwNyLA|dv5zzE9JsIyBcg`^Pf?C&4=s66S=QBdaH=yNDE+Z) z2t}2Q+c3$_<;*+I-`r}x3N+G_aH?CP^K-neB7sFm7NKk$>=pk#U|;@gE4xqcr;F6L z60RGV&Ci+sVAJXqY;)%)=@^R3$m(b4Y-gLs#ikB1f=pvYw}U;phha6PK?rI-h54>R&f`rF;()74xMF667JHZzTQ`5>@+Ez9}CAp>vIK)hHi!L5>>o= zH$u*0~IT_QeHaq9Lb6gfyh{J4rYN0)rbp6-SsiU;Wd>)X6SzrXgPZuy^=*+Aw?JbfmX5 z)K(32lND`F>$~z@`B~v+&s7|JqeB$)%ZmCp4EsBp-~7ripFQ9h*bc;0)??h2&FC1h zzevAc;Nc*3%iet17(-Ll0kJvoTQ1R*t&*VFKiV;ImuMYz=^5{ZC{Gja@;+&>nn$j+ zX(c2IsX7Gi$0AZ@p8@q2(~f%N$uK7FJ7vkMjp}jpE^GZBtI3Ipe|z;s`YF)T6Un0P zZ{eS96$t_uQ2VA-8Mchlq*Px*Rx77*#u^;I{uHg{i?o(bPEEb7;!v(BM^?X#l!ZVa zRi#FAP$29R8l*V(*&#E3Z=N=2?8mpsMj<1o_Jnk`WKZ%g8lEL6bLhdJF&FG%6Cy5Y zJwh`rTx6YBlh;ZS;<1+Km32j)bRoCl{cu9Lqlzh)o{5H*>y=`UnWJmEy`gNocWe@$ zgP(>4#H(`c1qfALF(y*F`n=RuJVmoc>xGup8{WC!rw~0xxA6~n_yh8E3Pto`jC%dJic`10zNtTj)Ak2xE&}x-_rG#PQbwP zhZS89tx@w-{D~!%bX+HfTb4SJhad2VU}A)W0%-j|UVSSBM8Ws2<@P28f-?}*!6}}M`mLq}2~cm8 z4Y&&SP#a-BtS_Q!y6@h+rgnyQ6iVly@-i?oO2v6j^J&$yL*s1ALu5eblF*;IOO@T+Y#=f!XHH2Z?4dcC#3Z%9VT(Obyy zv-FOn5I%GlrPTv%#*faTUoO`4M*&tUW91gk{gQaTv)Q3zpwZUy>z4-LN*QO&e9OD3 zc}~6ZEYlU}jZnp<0E}PYA=1J_bSkmUae)PM^w!8%eLlhvBV8`QayDkYpJeVT$_+kr zm<>4ckjs7yKYrEH2dW}0YIdSGO~QZJxxAzEshPM?QE>*#7m zZ8z_u5YLURu~h}=bgj!)!A)aK^6>`AJt`&j0{inNv6WpzVVI|4&K3~}TJ(uc$lI)O zJ7Ak}C^SA%VstwbU~^@h?FH>!US+tg&((-PV_#&pMTCfOCj>1bEc?>4|o!yASWz7`^bYY(RX;n(Lf(H*Qbs5I=jkz{w}0$ zubcM^&2y*8U+tIucev8#pyX|WWJYsrI{dGl1Mns|tts*vc8iXQc99D+#M%a-QT8vL z+D`rL#DCH%CpKEwP_iCQLlTW8950-FE4;4xS5@DHD{iFp2rPG+dEsnk@y(a`W>Zka zKvH48^#4W}KS?6>JhqvwuK@N{<31h?%IREO?0@%#IZ=8HR;>;*`LPK}rhg9QGe=b$ zdgNjP-B$WNL`h!2X?Sz9rky0ss~erAPC=b%Z7*jg0);hN=fyDJH=MSf!|viin_Ybf zDXVFiM7O6|<8G9*#^u!*OCH$7WHV5wM{hlN9pvmAcUE+Jx`dF;XDbQ5CU{K@3=F0q zHFfn!+}Pyg8dTfvdLBneB*P6nfj-A5l{Jv+_2T+}vJ#GM*w#AtVO1`Un4{x`7_aI= z{Yj^=7OF-e9t%6j?^O)fT9~o65;;}#e){&rf?17II&#wUpwvl}q7wQej&0*@jWGqn z!)7}RH@F0OHr@`5jgIfFWz7YEqC7;o$yoh=ZgvPTFtXp5eJX2o3E)U#ccEIDRPkwMsk15NE%g0gSOaU3O8E@kea}sH ziO()1OGz?!P4d`w;SOaLok%AUeZ-LIqe}%f-0OT5Wh-rBH>)4lSftID+0BBr=?q`dHw9V}%>JIH0B3rL-#BKS)_RThcvzG^7Q+mkJ8Xn~^ z2wNLpc0M8HHp=N}upL(8J7qNU7&Ry({rm>F(<5{5cY9Vw@EHF+Ybsm5nAKZ5JJe1j=P2;>O{?nZj1s9pH)_lY+c@y9+P36~8}9r<)iKMa^tXTPbb z4Ud20XEDAXii?XQG@hqBhf61Yrg&fj)L&wm=Zjo8EpGVFoaQ0Qz#U?v&{t1YnDTkc zYM=P*A|u%{&vgxSg!@SiMJHe<894v-q5{qsul13lGP(q4t@!I8;t_b&3_zn{HMIT| zSm555Qva^0y=bAx(>qk7r5CxZ7k`^xL8Z@LQ_3ttbvZTD-X1sRFk(TN`K=K4!T!81 zX8`l%oN?#dEu?%`1RlL*SJ_9)cy6D)pPZ{?(>$-y@FX)z%%(ejaTjWJlV?iXIqZ`3 zaRTWIKzLDSy{X~@qcketLliR?1fZ{O^0ZoxD}6v+S&o0+y9v!KJP`0Dnsw=9#%Yl} zpnSkcM+fOPKEa~ z3q1jRtj2a(DK%k>z3*dvt`J9~H@Ch+dp z-$pb{23=kB0cuE7X>_xJLj%F0-n|H^nj+(e!V!9f2WsKc*ug|m`3;M4NAO;wd?EEZ?> z$)Nt}v%g$rZH978nF>)$@DXNOr+ALZ@Y_mRnY?-z(6dgJ>F0}FOwgJlR4tr>rLqN1Y8#A3vbl2LWl;h(B2b|D?-o)}?o zYfa5#g*LP_fSsJ~@+3Hg)!$fuBL+|D3XHvUC~sPZ?md&h$TcZ)W)BGAdJoKX!h>}r z&ONfj{aei{C;gTZj5g|WCi_yh>Q-19K;e`7HkM{_Cb<%RT}O|-BrXmxn|?At`%=cK z6zAn0q_hi^=f44&BJR=XLuICaP~MoEBtdE0fkzEy{69T_6|0kR775S)fdh60;ic+i zaQ1}PYyfP2L!%shS}YS2FIGGp03;DH*5lt~(^}kDr{m4No>Q`F~?xh>>&Bf>?+=MGiaO?fiN^46cGo6upN& zoCHdGoYwETDr^K=E7$d_7?5INyoMzoL=$X>a+C;3sg#9~SkN)RXi!X~8534*`Z zvP&q;%UbH^QY0u*dw%@4<1h*+I!$GWU&sEjk4!%VBX5uvxS%C%bB4oZP8Q=lrvR~K z5y}(eTi^*0`2C=(qo6X*rN2*OL;&g@dUmHBz;REF%zg5kvI0&vorCZwebQ2*xtPsQ z)S4fve-EArtY9>y`bPB=Q-p1?3%N-8*6c3&hK*m5XWY0UWmecgQApk&?w@640)CDb zC*4Ls297qG5}9HBUkiAxw}B9))Fw7#CB{zbC;L6$PJSGT$Q8bYG7n1V+lx2N%$WO` z))x`Cx~%0VVck{=*`k;bLVx9VpM{+ON2REx&8+WP5>hje1t~%c>l%KF-LLrjEo2Z_kurB-jx$M( z)=$Yr#@cC1s9(F4iylq-A=l~`%fMn+oEgP)=}u;^G?_0WPtw;;@#QsB{ns=}?=oIl zzq88s5NZ@RPo01qB57BSO}4hp3_!`W1cbw@4FxU_3gjk(TQA>Ut7A1J6!rK6x=xwQ645ek%RhP(k9Wdn6xTk2l%nT%CNrJCM{fS$GFq8X)`W9>r18h)#n!wy)@LDYA$s=6FYZDXSNt=CAt)yFeC zVdBZ64r8@4-96huSALa1?GLZ8?!<;W{IzTjc!%A0rZGn-{^%my0&@%g=QNSk2Y~`65XuC7inEx3@4#QiVywVih>Msi=z9U zm*UN|kNpemtAcf_?eFvD7rXe1#^FI{3*pbpg94B#*_JN4ackHTxGbhR2UpdvS+xN& z!P;PL$5RkC%m_CdWKkwfAZQIh&IVDoJ8z!9i#pG6FDM@%ZpvKJXER-TR(Xie- zu<)^#MR5DB^)IPg6v_Q{;p0#~OFp`Lx4ATuEobFSxW^lFzA$?xU!3<6l*?8D8RHO; zZENA~vaH+J!l03!?0NV5$&Ph=a+M5^EqerTRBDM(!!iR16@>C zjovOVR!6@HgCN|+QJcH+f)ED7{Fq9-xv904b+JaP+AO?~4z=Umcuh&)athq58g36& z9X05Kipq$2Ydn6N;O(1P`>aVx&@cA!RXRmF!rCT={yo(mKpbT;Vlulm1JZAJg`c@+ z2K2_S3M?rQ_cKj^(o9(YGDA|_Uypk&4KXz3uU6(AEJuK1!2s@lT zf7_BH%lx{K=MPZZ&RrOm5i&tG)Mr;4x)q5ivIPU8&d7U~-t@vBr72A`&*VoQY47X@ z0&WN|lc>6XVT_LRMx(5NCac7loBViS+*0!Cd_h+5Xw+_2h0t{Ua%l0LzTM};E*{zP z-DK~(%5a}rLfWz)Lt`J;$(Y1G77EmHWoGB8Wpt8HHtv_+*EoW9q`7u;qcr2t-7Vv2 zg6Nq~5C2akpPe|?^4fWFqqu%8*P$0lF|wzHNPaW;U^P%fO#K!!xw|$NO%D{-$+qA) zU9O;?rvAI_P7Z6eK%4!F|0`TSR+ZWfeOT%|*=UAcIlyQMe&C^Il~?j8MW}1+s4#2Q z-ZBqmbNPFal{?M|1;&B~ikEsLwG-e3gVoqD#u)INCl?O{5-Q6`|CaKbpx8S%AdIO6 zWS+2DzroB>Vt7sC5v z2;{NBm9q|U^gq-a3}4kiz?PfmY`WJ>-AA*v*OXM$Ekk2Rm1;aH8YG_I7Z?JxlO;HX z#^Ry$AT^S-I;OZJK zOM;tq2u19*cf|}QnJKr+yzQ;5%IR`M{gRgba>G{6t$OWp=ZoCL*%t<89u8+{{(mZb zv4DxAlMr%;F5rf{r)Rwkx${CtR2QSgPrI}UZ5#m6x}20L7LANl0#C9EI;xzOq&SB7 zW!mJ++CtjUOW{IUqT`>G=}%pLv`qGecfHUt+W`c~?p0X_n#>)~LZwy#_osj9*nHJE zaxX+nvh~3#Bk!qm!K2PuT5aDoV{hJmeyZSZmjkxM`(DAV-AooPOeWbUSyAVHk^x zDnkM^d~nNaqoNZ#qS2o7aj?8FZ)@S&6a~ApkL+D#c)bRANPYr-sBR5Fnm1=^$q@)e zAUGL6EXCEMbL+{G#l^kF2FPjir@4ulahj6cCvbAQN*Sf-ZUElFW}p7NmEJgGE4d7c zBr_LqXId4BLVf0kKuqIgr)q=DDvL~*ZnG%Uup_|2WMTvkiy+I_No%H4Qg+&?(f8Qh zof^Mjv?E@_53Jd#sM$8rOE}p5;@RQV@olrVErNFxVUII}oP7Jp;uylach4}aAuX4* z;mpMt40|%=jqo4J<11xF9vP{h$Q(^Cpol&%HS5)gnH%*?oZvwVGYPfz+UDN>NZiJZ zZS$M1EYG#`T~kU>{D5oWN1@9{p8B=*H680qSEU|2|J$Fx{dYoHzxivCxTT80T>WA) z63z4;g$D&hHJ^``2QHzOUrB!in>tu+9RiYOGMqd7{p2Vh{DdEV!&P*2OkF?;ktBQU zoCl`qDv`nvO=aKbb-(`GmA&SombZ(X@04UFU3gw zs-1j&W#6@ zsz3buQ)12F_V3yfL(P<+Z@yh59TUp&XY8%D9KoBgF4~MX+pK0jU zFqxpe&36WXChv<{Nes)wzbA7q@OTJz+8Vn&#%UNC9z>wup8^fAf#NS5ZU)5_9U^Qz z-t??K!?|NX_6qUn9WS7j(Jtn)$br0|G}6-20*|9QczJf1N_ZV!Ub~s5oM1t221Q7u zHJglD1gFN@^oY#sg%eWD@UYd))#cIDg(yU)a?kAqhxF6k`A?@w@6_q>H{YTAa_1i! zf{JW~RWrS{=CxKlDYOF{%>rqKtoHC?w{h0zr5WODlkQeil3 zXvvw|$2ydIY-a>9MAC1Bjn$`p-jZ$mdjT{VmDUCU(HgW%ss}x5HmPknjLHIQ*=i@W z0jC_aOL%?tFmU;GCsiv$R&d9}_vJ8;O5j{$cf-t~fS;Lt2Zwfqf3&8|;5{2FBL@cu z*Kb)7;76mCx!XZeHcVZBvz;a-~u*YL43$m2QLl=dIOSnPy_6uo?-`=UK@PMjfUVnu=Tc1^LNdXhF z`0psxD0lC6vSyxp84MAF(45w59i9NBnM$-L@oYI7rO*)me)v3JK(hKhzH)SABss=^ z@u}yA^M@-l&&l?(j2H2Y%-ZB5>!of+)PF{-Ueb_1JA6*-$bY1?Fm(7c{|#xiu4bxO zAA!#9&TpUte_GW{7~RRw){Gr0Ry>vA-!49nekOP;oI?fI4&E>;71epPrX2|p+LldHx-x$+Fl6IYjq7239<*g@i@yqJ{T#l&hdlR2n~r8uO=nGS z*m`-nC*Fw{1FV!gZPZ#5bidUF>e@(+*5PCA$i#Ug^V8;S#?AEU2k@|4%b zn@^%(5{I})YW2l}{C%I8`yXq3%czeY04z(3->hiUO`z3}*Q5RcpyO9_!RON(i{o;a zyOd|oSIUh7j#i|kI<)OxD(|={l}>-&Q;7YV;|JIjHXqZ=Jj8PwTJQmlUzvS%;d;-5 zx&ki0>IcXAf;{m2!W(;PKV*N<2H$=V{Ac*ssd{AXbdu(pvIS_Gaf zc^dMQ7L>7qy0wR~b9~CQBoX9~Q~K`e-_OsvgOJ5M#)XBEN=5GiXHmhF!=icMiA@7j zRtFLOY@#`a6TKK@Y)HeP@T5Crc=cIs(1+f%i2D3dpdh9T>Co3MY)-H!=6y@^tT-}7 zVNu@3Z!Vb3(YLCJfaUeMKz`P#jKkyco9`7o3>=1gEC6UyH62MzOzim>Af7Vsxk~Jq z?W0%+&NYW>y#tQu+_>cz9|5=Oe;OlgwI9U&&jm7ANbkE|JYyL0rWBF88w#EhN_)I7QwC+{gt;P(iaXYVnfavG?KV;9@u>W1P=d45$ z#Y4vI9vk2~RwIDA;+TnZb$Dh1F`sZO*1jNp!&^GvUk944 z+c!{{Hwr&2y|JqHAl-+l+lJ0j6o=I{BO-(*71&YLLlQF7C8J|A`u*zj%*^%R#UIY}Qi7;8U*iXn6_iyQyG`9`i2q?wKz4si% yM#wp!4qq{+CHgq5^nWiq{Qvk}k5Hj2N^}6`12TCPc+rN4NL5KwvFw#)$o~UwDZU{9 literal 0 HcmV?d00001 diff --git a/tutorials/text_processing/images/cent_vingt_good.PNG b/tutorials/text_processing/images/cent_vingt_good.PNG new file mode 100644 index 0000000000000000000000000000000000000000..88c0323dbcbefcf7ff5afb297254aee0b97230d9 GIT binary patch literal 31222 zcmeFZ`9G9j{02M-Me-?28Clw_k#%IRgzU>$Vys~pd)dhnQQ5P_*h0p>?^{TSjCJg5 zWHl_w8ly;gaQ3m8TF$Nd)DoDJl4T#qoun69mHaf$)nc zVt&g40`Uk@c>Yw&-3UK<}S^H~=YDd^@f`Nc9*8aHHWBl`E5DETnn?a#{r8$MGs4&1a0vCk zizh#?&k+5)`uLbw;@<`2y1XCxzbn?+|GU}$tr-OJe_I0r`M=}ze^G-F6#oTBKZzl} zs{T-^uU|M*SLw~IQ2{y0`u0Hw8y2-LtH|0tURZ#q)YbqgpWwE@uipj*l{($E;Mi)*nB# zhl9sb6N34+YBVZdxbG^9I_}03D`q^!=EHn}jAQP2jaPDNg~EGbVVFNyYq->l2!ySi zq}GOxa%LaS5N=I!uOUb)eno^)+03jVCt0GCR^$;kRLgWFu3Eb+AEgnN@HqY31CETo z6o{Hxa!&C^%@+oV?_&?Gtc1cSZB*uSTCu-_m6wHGum}kiy>Sc}PYEG2M2f^%XqtAC z(bhdo82bF$&+-o~9LbVVee`|nPly;ija9bF2zj`xMf8acF0S(~(jH^AL`&1?IwzB1 zyP0YAq$UUZDpG#*8J+x>_}KzPfPt$ybVg6prW5bQBVYj^ulIyYYDO5FE4B?{u5Fv! zB@#m*<+30-`B`9&zWMy6jc!N~4a1mzro8>IKyZG%=BwP!FOH@G+*@Nt+a=_2I2E;l z5EeRD`id-(lXCLMe^dj%^@kCE>D#hJV#%u)YdOML70^&d=xKW{cc}89 z`b}17)Xc`NeI{o`#N=>jLiFE3%h!abzrO=gX_G+zV>9>Jrl<6{aJwrUR^~Tma#FQc z)^%TTO}kKzXA9C9K{)%oUTT`jc_loz9KL^bY6NeRTIBEosYKt!;}$V3A#iz9NRodn zbxip9)BB<+Eh?vo2E&>hm16y&N=^vm2{9p2zod`Dq4~uUdMY4l@GyosQj5wwJ{SAf z@na=ZyIJ2cuTl`V)TKu$t(^rkm=!fc-TP<-M+%1RUdDEK_RYJJy6X#u)VeTQRbT=$L6T%z(wxa zllss`mN_=ti3vBZMt=L{hstr8D;USdsPt^pqvZXv zd^8nSz2$V+d$_E#1uUj&Cp(q3yMFh~!T`j2;Sb;nt3Gs9W$$kv^3YG}{F7tf&;#V* zXe_3Un=dIwOhp^r*P^H|uJZ5f>BQ#_N2y)2ep@!Cd|S`j(}i24z|>!^%tWq+VXEL27gcy(%xB7F*D0RxtEQ_P;U#o@=W*>ee-?_aana+ zD1*9rFj*Kse4%op(Z}9G3DP%i1%HtwouZS5l^ zX*p*TTfn~cK^sjv^(HL7Nj5r1RQm96YxGWs#qAjsnAz!~;+&LbD`or zB~CUoHe2sTg!XR^=$EJVN^L|bR|5q22l{4*{&(Sn2n&B>C9AT0{2FCae5*FH+E>$n z)~MGFoOK->mX<2HzcQsWYJtYH@vFTETr9^WET~2sa4Nd)-ey`H3vNM8a5cvw~NRy@}?b z)@}6GKqgU&z98GCK@N!o6|ysOXwNL=UdT8t0J{tTvic3A{jR?Oceehk~+l9je9Wa0h%swh{#EIr5Iq2xWg0EB}1 z?U^fN&-&5KF&ww9-`u>M{DzL(!Y}J6xk|NFUF1?b!t@cE#(JFi5DBwUy&m6~>x5U^W@fZ%&;^)#~LFR-Ox+ z__ObRN_i9wQ6qPL+!S4!sEsB{5%T8((_`vwhN!LBeR=AaHTiFXqJ#-fELaZFCT!oF zEDbGpIkoiKzX zpubeu0ZJ4}U=dGLZKyQM1)FFd>utP*Jtw;4{%YlF=Ihk^3A`?I#I0!q?h(zy=smkA zlpi~9BKO`Ro5GL^eH9^){oDAW1$#dmli}sW>|FsH$_6d$6~vb}f0Dv9k%d3avwpfB zTE`nLNUK)0JQMrU`Vo8Bku9Sm29A?Zd~RAn5&Qfi8`~WFxf8nK--xdM8xbWL=UD7J zny#-e^vy+MpMm?4>1*0=n9SVw(brb~dRJQ|c?1g6OO1SvyYK$VSZggY4}>!dWnrC)U26`ps3|DAc5!=GVB{4@;X`OWQOA$nnfW=%^^LPaeU*4q56M#uD$yU?Un zs7R)jNvWScpJ3*QR+m!<$qM_I;~webPxld$k$5-`0YRWicIP#64g?$4`e?96*^cr27z z-Mp4=kH000J(&Y1$JZ2HJLuLFwQ#Ufshf^^*7?TziqHL2A8qN1u7WRCE7~>~RNE(GZ`i-RdW*1T zmoFz-k;{SPf-yJQS(^UM!>uY2Xaj;Beb*&Lp&J<g&U1qaTCV#GB+OrfP!;5(w^O z8%zjl^Ms3g#4tz{^8WYtW3o{kN;lKQk9~l6P{rxDO36AxYX~+nL(c7)dEMFiY@9B_ zjqk%3V>NQIF?DrTU1zLp=A~7^> zC@~k_fa1NAk&V-;kAJJFG>{mJbl-?P@yxj3fHTF3mX!D5CU14!wZ76Rc0bhA?Y)*;lL7fNNd$1-p_r5VWjO1PU-rvV9VwS8K!im7p$2GA$3EC9Dl#L<< zs^>cXP4S})KoI|NQaNSCZ<{{AL{J*-u1JrBqSYtg_TzP4So|-_re2!Ev@G!lC2r?b zVD9|=)>t8>>~CsQ`?K=s8l#=fj1YQk%YtVkBRhhVar~6|hyOJO)jkR&{yvn!Vn>LV zPuuK6;;WmtxW{UR%J59Rd^YI}!oSAX8rLhTbV~;2Tp5p68Jh} z?D77x4I$Xg61cJ*@{Aw&n@&BWmVz^ef@m53QY{F5d?EH8&F+Kfm+5VVKV#SdD@BETBC~|L5S47flUnnoXH27%#kuS<; zE5taSPS>FCVdLqxtJP?c&mMPKx7ccY7;Nm9ZRIEuxmr*K8+T}I7`6Cf*yTq=vSuUx zH&UjH<$RM|(0NIHCrkRQBNQs!ndGsMuyDH5Bl5+_GpA-Yn7d1O*uZSs%)%nGX0@=c zYj=478$c~(#q71lzq31_miWckXS9B&TO5;?)_i)>&TNYH*C!#rR!&qzqx!fx$#W$K*%i;f(8*m?{w_m&qkceQ zvxx+Q!L-$H{-9hasG1UB@jWd$T1a%5Fs1>RrI9cq+9VdA!)9bqJy=$l6B0nag*b$L zNy3B9I0l$)yy)RffSsQl17G72*1?uc>E(>6Uf6sbqMJ;z)@{RHM<4ZIAVZ$IY$Z1j zJ*b1go*oZ3!fXm_Fy51$l^Twy04i9qo=Bo?>8B&SgRf11D#ErTaMU`lz?Qxn%+4l2 zqqA)kY!Ym@_=at8NC?)x4v*X)F{4i8|Mrk+60_&bGHI;$)N{gR*l-w0%=t0%DyzlC zl5kMNzJ=s&ZFzD`>$k?D(p1`qT-@A(S*bp?oR1!fTpYGbD{E`VCP;c!VoQ%_Lmo|J z0U?o9gdj0KpSn1kzDO80^1@1N{iH#Tm7|ceKR+#${^k|TN_ItyF2tLymER#{^W0mp zP+MzJJr+0PfMhJfH)Y`8gj^J#-XNv9Cb_GAG8VI!vBqLS#Ch`fc(q6*nq3Wcak?mN z7Z-`)<|LU96Pe;;dAs?JN=QgpxW+BzB#zh6+PFx-GKOex)G{?8LhAU}LxbvGqzuo? z;->oTPHtE9*iMhM!$MbrS<2gO1*d86gVAM#uCaH;B$f|aoLIbLS3q61q`jD-MHx-|>X2UA`> zDZA57J(8|sxTU@fY@a-fU6~@SkcMG=arBlgmBo+jps-RsGAgKf7Rj=3D5FR>(u3o_ z>wB!qQiouc_)1@J)a-_^E!5JsHYXe=n9$o6$go@dKdv$e{8UYa4}}Xv(iZD^%*V=H z?2q0$KIKq)u@^-~&1|bGvGrU$c5Te2NEnV$E_te85#p#{F>C~`^FgEu!i#v?YW^Jh9Zn8In?u4;5x z6M?Xq@;R}?{H2X|XgnLL2k~a1^`>!oYoWHWLTU(@@$l(Zgl~DXHNW7R!uE*(CJHDJng*3gGw-kS}(O3n?E?^Ta z%N%3EGV*U*EOIzz+cH_;IlR~tn-vH3`pwqMdZb{9YdIEGA0(O1)En#1RjT ziJ&gNo<;e$kuy0*7Vu;({{TdkN_fy54$JUSs_4-?6E(E$lf5~J8E`mCmkpW7ApYe- z$OH@;RCFrxn&J)?k2%~kNS-nA<^I+fQ9xm2WIbIrbCekeb3{+p%J3a!`lyRlCg7uV=6Gm61f8zo|3ua8DGEoN0T$Ve_bB0DoPfh-%tK!o;_3fUHZX;2bK1H zq&E#r?CZDWc6w!~6KZS4c8(TP>vTz3S-@FAby+684odz$!JrpJp=FbrK5eF=GIW0R z$vQ!`_?Sl0x|rhCjl7@T9U2U=8H*_;6Vu<2YjqsND(tD@6b(qN_ik5B`_%E|=Rsk( zyHC?jw%S>+A6R^>LR9lw5{aLp72Ux)j}uQ?(-h72n@*QFe^|K&II@4XHQsm1V{vq? z3M~D!Djfbr6jHNV(#|aCd-U7bw(p>L!+k`(U6$@l7xEIGHYi+T+~~7Ey^@~|DH9V6 zniR3+;3)8m@E?M_Apu*GgRgc4^YE5Rs^-2Q#v}V)q-KSF0KBu{R3y z-+?u1O5yVX16rqp{0h1_+wfhGzBr2D;bJm=Z;;&P`~zcKUA$o$$4W5BM7ql-qxw*?KT#Y>D>D-FCGZrp) z?u*ZUyw?S_$-jP#b}o@0t+~wzF^m9E@p*651EsZhNf-wMmqG3Q+yntjt%v+;Cfvzs zq9+u@IVs_)=Y`*ot;Ktc0)uZpGFnSvcp{~V@Flf~x%0KD7-P$6la)clY5~b%E#KG3 z9nB1tvY>f;FzL3QK=fWiUS1{j>WNX#c#5?RC!?F5*YR=|KJY%NXN%El>iI4^;)L?a zFo6t+DzY1_^a|RI35myAFZHIuWNRB_UxHliA#P#Tc(yNiPd>by9YOKmvZz(#Rn`P^ zqObew(FDXHE~zDpiPiq*SGq4xVL}qPl30d5VFQH^^Tdv{H z^2{hkA;^*eIEE$+aDF%w?eaGTQX1jGr@9Su)ZzUME3Y!sKgk4LEnzo_W>YC05Wu-m zK*T~;h)~e@74Heh7OKk4T#VydamzJiLB$Zor11kr3TRGpS*>3fllTz5_pr12SkpaY zY^wD1xof-cP1lLn^5caD$%pB)0L@ekR0C|22-}cvot^erWU6BGvn3SW44>;Z4yG*x za~Bo|hPlkwN!qoFOu3tR4I)rI=()n#T&`F7=^r7HA>WS{7#potLAtdZ7tMo>xQ^R> z+450OxJq%GGvGz*ghL%3hOj!0R+D<&L-zcG{~gN|4`gD`(UjMQUE?DS5$|K?MDN=d z=LAqwMdoLFw<$n1E6>x~JKw*EYdzcgv=*9G(*llvAyz+#_Aoc#Q9X(ZQ7TL)MA z3qvlPp&a||V_O!_rL@Wslhb!9xoNM0b8|A%GsPItqD~l(6MPk?J_YNeWf5Zru})mg zTIuh#7~JgatDVH@Q#NOo1gqVpzLudIj1um}qKipRw)7db$Juy^NSpQVC-!B{WQ;%G z2F2Fw129`By#0Y#(z=8BZ4m*MG;ETxdR^XR&2#p{K*G|uLD*IQ5qzJH| zu$_q8q{|ol)k}Z<37J$?Rvj@5PI}M26B>0#lL^O3Bfm39H}iJ0nH*0`?*AClt7zDy zwPrXPJU>4VBhf4v2grA_*V`t&q+gMHRMqdp?KdCo*o#mlYEnn{;V=8NX5pWX z-sZ9m{_TjKgeJ@P6wIM4aLfL=LFWZ?=)!nn(bu!j%NEc!ZS`(U@;Ti%{lf1x1zhu5 zuUWK(g)dwtcN{gPRt@IH|QxJwV4uWLr`GDIo00T8J@Ng+fGNPJ_meLfY>Z5{56*ivaP>r!^e zGkg4O5MXsO9etdTbUVMalMB|A)02*&S56ii9Qvu@cNtP_o4p>f--_OM*=oHx&MW18 zvJZ~IkNx>iL8`%n?~2RIamaHkv6?ZsjUsR4C(1|ld<$inJsIeS&M{J}XyQz>cq#AW zpAz!04;X3dm1=*j(O=blTXo+N&5P>l6i-gB&peg0Z#;8l?4w@)=w!iBCZ98&>U&<} zRq-Ux)_g9%a?HAtZ=2=fP=5TCfo}Vh?}hilRfu2q%W|vFcs}>JFp)9T4Eb$GKLU}F zIXLSG;w6iTiHX<9kQ>cUMWdm?k{Mw|jpt)Se<^)jV>pMdcU~Hb;k8I6Wf8H>Z`dD0 z0$f@eh^4!vNu@m4qBTHHWtgOib;*QS%Zt6`RR7jhH)EDZ zHryk9u{H#Wyh1GTLP2~aujsoN_o$fSiB|HBOw(MapfSE%B?z}Ny{f!R{-3GSPdSc75s+;f~uMI$2gqjwd>o8#_ z2&2p7z1vrr(;?U+F_2a(hF;11nu$R%z5aRqiVf8YaeR2q%F4=wq+pi+Qggdj$uhOn zVbdf9tKlX0NZj7A@ia2UQA~y$wgUNy6OmyYYg~%Lu?1GOkA%imjfg_W~ zLItb&ozV>h4k&1OxP`O&(!#y zXp#L1CpE_R8YO?bF=mejX|=x86%VI}-gW7=qfUObbhr56rS(ucDh|>Qj16hHD@qKo zW+vC%5^(eqx8~Z%Rc|Ig)%|YYu&3jM(~wU{k@O1s9#ov>68rA`Qh#&NLR?;u|F!G@ z>GS2QE@lJGJ%C}`^=gtEER)?Pg1nGz=jhzWwGyigLmxYS9kt2`d*&N`gX*l33M%w# z=;)|j`rOo}%(3rIEwlT~yZH1rG;}KES!d_@7&y$Yq`09gm$M$0S&Wjd6SyLQ?E>=c zmN*QM*hvDQ6*a;GS(Ih9FV-ZxSMrLc3fqY&=!lqqs&g#>bjDH^pbB&D2kQRs^mS|~ zgP?6O`bsP3w6$VZRp%5yPon=r(SfjwqNws&&!N&f8Xa;3gn9= z&d5olz8Pa3lb+`f3dW#?zvgXd^rP^7*+UR-;LLeGx)7`p89<39A2*J$c@srx$ zM56Vjp{DTf+x&ULjCO5h^PkvLJ{~cedunJ9B{8{8@GtPk@DznRXk?g}oq93wnLKH8 z@yE?~*EI%<=H27+`+0ejw%Tpldw2~Hd4BQf;-nMpUYK!FMP!r6LLiTU3oVbdSX<_Vi5;dqw+Rx zeoC~^mHj)xdR#f_ai42y|8B0fuA#8=IN1lEUf+w8>G5L{F}^_}Yp(@!5{TbRuD1MT z?qj>5j$(mWVQ=oWgU5Xzt!FHOa8*`x0jQ86l8P2*s-o9ERf*fOtOcw<-plsyq^^t} z;z-K`p|=Crf=QQGS2=GVkUbWsb>Cq#mA=#>AeXMsaH@mlbn8-u`B)X^*HX=sWd6<# zLyes51}%Kg;HlWg;W)4?!_-c-cE7aOUhWk2NFEq?UF=t00;(+jU{Ke1bZ7vm5cY1# zsRq|hIc(St^#R^6OjMZ|^cS|IO;Y&6>&*G``dey>{&{W-ebVVXsG0e;-k9&j+1Ux# z4&xKVC3lL*e#rpw;lnhYP>_*o5#rLP0QW3>i&mxZZk$qwDW)ASWkhskL`Yem?^iXB z7LuY@S#0qMR@uA*AbXgf?`BDl`ge#eoq&8YnL!NkGx)YfKKM+-t>`yzU6<=pzccEr z@Wa{EhcJW7r$Are-L0??lgzybw7G{D9j$YNBV?D?Pe{|rcO4z z@(L<}*1r{0CtP!J(s)s}>UE%PDMl~=(B3+Oc!{?be^yXL(W1H?Mo#wIHvhCtqK|C) z(q${Vu>c5~QFjf!Gz;ssyDp8ZQC#jJF#ZNCmvpX!fYXAqc%Wf2uXY~4wqeW6 zCb4vGm=ll)$8ye-g_GIzm=sjtUf}ZtI$4Ivs$p?pORJVjLQ6>Wal@G($+hCf&$?CfG%AZrgQSoC(j5H4XPkVHRC6St;0dtB)|p2$S!Ii}OUE*@ zWQ6$!x5w+IudpZ0*mFMYbk@w$auDbE7;hgb55LxMvbQq$duZ}fuRlcXU2jBq)xE2X z1XPG9NWbL1SH5t!V%N`<rCoiu|_y9+L`te@)i@InG+B^GN2u5`IrEqk3L2PVnS_A`3fM+T3n&IOe<_mvH zUPq@Ag-CC*Al55|`XP;4|&gTSf9S~}KaXg&1(ZX0Okp8R#ZBxW;iR%RA zq2zfWYL~Tt1@JV4%KV=uT`{ObDJgg5nsA#sdogMNlR!#IUoCHAvpLh8cAzFy36fyB z;B(C@LuFb+PGUla%0yQ;e2$Fh@6a@}M)(J$`W)xfJ>a*1kS1mmg6ul|*W&9l9ZuG$ ztljaT!pmXBo+fiP(f_cnxG3bb7!irLO!|v1=#9lc3Xs) z;j(|Uzfj{D-a#h9qQ0b+9x;NvR=bJee1#6r7@AF>eSa-HJnp_a_rA&36(DsvSKY@v z%^{h{{V+;9R6s4nwk72tWtFMzHC(PF}Kk zxw)N23}9FNBJb?ixMMhYrR!UOE?b7%##H6ZYc^bVOWg%*+;%RG&FyIZ%gMnJ&|e%F zO;{)7S-i~!U)SS>0bg=cdIesfo$^i$Aq!3t)3SjJu_l$!KeIf8f5=yJ*S1-sq=X1iY70#2hFux%S``4IAsnBg9rXs)R~JofZyP9jgBX;dz$~ei2AkStnQC;W&S?+2400 zC&w0n?x41`uT|H*(ReXcV9^DXER=oCJbf96*dGdu z@}$7oEIR^qt4T@lxX7IT{{CLGg6`YR&pNCXN?BXRPZIG|b6+ zVClX#5K>j8=+sJQNe5eEwQf{A@1e&kP)B@SmHITALq?X%vviwn3+g~NuY2%6kI-6e z9p55tm!X#YJM_dWn`4SVERc0Uq%P$t7!#F57!O+uVYQe9y^nWc?M8-@gt%U(OJfgI z|Gi-gPxbvvSsgjwHMlI3UE&(g&km+H;)ny^LDZPlC2u;ggTYsYh1c!&E6G$9iD-S? zAQZ=Vdcklby+1)eUHX82X~lIW$7pAsrDF#~!2QQ6lZx8kbQ6Zzlbxd&Bt6S(meNDk zw#EVBboj^10OlH@%n%k9u8iCUj2a5RkWIFZ)aj}L z&JYhbx0BMlA5Z+)sZaNceD~a5HNW0yIH(owy>9$=OV(=DqoM&P!~-Vj%u8ybMPfOW zuLsMR1Ae=+ zfvKUxl)H(*g-k+o^w#K!_4rWz_b=QxfQtM^sRbn#SG6v6vNjiYcl>xCzv{5q)+Sda zG%^3hc)ci5b&{FY7#Mo03CB~sXr>wzNGG7b$mo==jd!EJ2R<9NTCvoLFD>}p6(eh} zVW~G}Vx1M2_~*>4;w_oi&BW$=>}WKZn>d6vMhsqJxa#nx@39!Y2XsM|f4)UgIrVOp z_dgG6hRS?C*}&hSj^}fAgUH!mng{N(m>$1Bn73{I3&u^SE}&|9lv*Gg5_Qo;9&p7v zI=uTeR(SYuTvWRkc#S;HYwyX0Ix4+8mG@(x?UvY4_FbgA4N9MB2$IoKfLtp^yBmRW zKUu)>Scy?(GW3mLb=85#t%(|rO>STlS;5Jog*%;SMdD=k z>}4Swv?W9---rL3;a6J4#+a_(ed4bnl>Gyj$&`RPFQ~)dXoaVXh9fE9XUG7QZN5lF zw0y#=jS}E`-Ra8BE3~fq5`LlCxAk=h-v_aclx8s`xmz%4JOAn9IPjAK70@ zVu|ZjpnQ8GF6OWXNcp;s`2gT5gGhu8pGsa6M+kD;3NH3FVwrA7Z|Ntb=LaXI%Sf09 zHt#f7)BsDJ;n*A~_`)ES%u4O|qhCxmZvwY~`g+Hk3S*t^f6mjWB#ZbGAN3AvwmM{S zzLeU%Z!J(^7J^{50Z!Qq%7MdP4#vZdb~~1?Na>5yoymSj0zYV62ep;exb(7Ylgg?F zZC_3Xi*@(aH}&Dqp8B!>&@$*AN0GLJO6bziFOBQT$Zb&OEgv0%=IA3|F(m4m%Bf&y zKjh`~3Awv4Lvy=>{VPw}w+dwkOaikt2{H;@cQnRE`V| zq?vkeIC2U+uvF%EdA;`$O+Q#2s^ljlcPkI+HX~8WsrnWxUuyT9w7k6=E<7Vd7j9JvCFzfQt@hV42LeR@o*=EAtNvybHKbvq#v~*G zxqyEhVFqfjZdG@Og|G}}$FPgAyNolaTz<29k~KURy79rM!SKCnW5?)V5Bob1)~vy4 z(UJ8!@!K;VRsAtnQQ~2_VGDbTcn8_vx6Yost7Cjh-Cn09GVG~Gb%o2qyG%B=g0$yV z`sN9O-FJ4w5hYo>>WX#n3jW0X{rUt}ELU{SGjI@mCgxvo&sNN+a7*m6Lvb^iy}~Pf z`BhC3ntucL5YL}^j4vOUC=g{+!z4(YZnp4`k2CHp_PE)wS{@R$E|IyTQ*PmX+oBAg z(q%4qt zeuWyQSy7;5pAF~I?{#ntOd-Ayk8oVNcf z8$wJ%f03#<=o~jQg;{}dD`Q6sHttxs#ozpG=vVGR8_SN}CxSdggsFvY1&BjOvzwy} z$4Ux%7}3ocwyYNwVFH5+18=lOYWB7|F^jW(2xN-q3g^XHHjkS?^P4NwlDm-zw3#i; z(OA%0F6wI-*ir!!*!Jr+Z*#1+|8|_)sJN^vT+MY6i`=-p`CQAzrJ}Kn;u;-s zx2OjUdq3Z4bU7<^-5yIrRe|h0)Ib*NUas@_=YWD54`Lk6I1n%D?qWWs)_Po(t`m^u z$FmLaxdn4d*G6zskR72}ijcq@ydw4v{FjNCo~xqjs9cdjGxBzqDr@ms>Tih)Dz+%M zsXCmNxt7e=Goln+4?{f{S}u|+o-MUS&k;hQVLS9yU)YSSnE>mkYp_Nj(!iNQY zkk#RTKg>ZV@zXt}L|+#Fh(0U&E-SlZ^>UXS2H_fBJW6i=!20E<aUUTQGcFqa0gPVaw?>HNFm%xD*6*y|8Y5 zd-x2XvSpBQLm+C;wY9Vk#kn1J*G2~>(THt3w8Be#2Eo=d33UluXD$5|lOo-Us~1O>_?JPNv@wT=?|!Lt|$O!_?cah7z{vIh6+ve87@eeo1g`-x^_=gVHES5O8&0 zGjO~)v&GzFydH5E4ilF8HDGRB)aI{Hz)l{(&a?T&MyQqcKU+NUD|xxmM|i5uZ(B)W zgUA>+ubmejE>$^gq21=*kR?T>MCG&mH*n~$LYz<8_T)EtQp`g7ckHV0#s>Rff}HO( zQ}wM#+r~vw-&bSr>-<`0p@M8rn@49a!8^Q_=;J$0DJX<5bi!?tw{O4(GhGPQ;6z>P zscTaX8hb|tRS^lJ$BY)fS^CVSvIdh$>93hO`fPosKChQt%f4Drq3IS@S%Y8?3tc%@ zc)(DQx=KjP7iZtl=Wv6GemNgdbaPdDqc$mQfg-c9R`HlQN%-A<9YCUJJpgpqR+R_W z%n1z63o2rtgrNBQ2;@hXeg*a;usqH~z? zNMehd6v{m4DotMh_Ts&UE#6+a@}Bwjqe6jynwl=X5b>C|%wFo6`K9C*=IVTFHAho-w;w!zEa(ykL9yalCEQJHXf&bRYmSxPF8 zhw8y#4f{MMZWKnnb(UV+ZuA|IO_|LYfwrqUBv^T)?OioGy?TTJV3s3 z;*HXY{@%1X6bha2T0!=t`n?6$2mEQyRSQw4iO3sC2diOpIxm|ip@w_# z-}={TXd&VYNgDddgj^aGFKM*F-V5W@J))l>c>LD9^g#7?GRhIuD=RL(M~JUKt`h1b_B~0T zxDaP{Y$DpP$PLls0ws^CRX`M*?XN^Og6J4C2J@8Wno_6SNQ(Y2piD4kUF?y5DE#mF zP}Bl`z0~H0VP6?}yhvOH4!R^cqV;{5Uff7&Fv>jn=Bq;6pxfo-ByUOB(WLgT>t~wH z(-(i=ZyNByQ*VBwWtmYp8X-(whHu2!*=IIq92Or{>T0~|^Jt4-#_{4Y*jl-!G!q*^ zs+?*LsVsuR-d5T>4L>qjW;*CMh!sk&W>!Pgpab(eJWT_2ZVo3#niVg6LB;Oupy1i} zzF|)d*yw?PZYVHkO`nI{WRLLlwA(m2*w~1J{EJW?;<+=`z_V4|0gnCe{J71*? z2Y$1`g6on6Jq* z6Iy{Oi52F@F{-L>`o8Kd5T)EF`WVyrw7H!j`tbFuaPJ$^#=7M0=!MV{ctLbk6I)YR z8`NgxKvPq5HtxL_kWkJJyupF5&Ah(gd#>c<;~95JiS6Ob$fYQg5WA|$p~5UV&L`jR zHPL7gES`<<9XZv?mF^Z4+e%dLaf*I>lBm-JRD&iz8rMxyR;^r6E`R#&9#M)8zn%5K z;rh=rsv9bXBz^|}C0(O~Iu^T3-yrqX=c$9JMS1&dzJBwdsqEWP#lZ$LI7n52Pxv0I84TRcoKa(^naU}|t=oiCe?)qig(L>aVJ5~9s#1e&{ zo?R>)q{xNr)xIvFqSid~C!3ZRTU`{O)83!@4NQIbxS&L!S2WrK7bF6zZSScQwLn+1 z05oMF+IO^aQpv^&4mV48UdEoPNpq>%A@Z%NXV9=qd`>GWreuXy>q3(yFPlo=?atTUU5G~Lf z0wjrj)4L=t^|Y`viT_-Te}h|9m(<;sCh+`EU}lSu@}nJN-KrLW--pXDKNML21Mo@e zg^?zm>i4t#@iP~|&P=O>&?_4_e)w#+bJZpUUw&%FqV**Ljgz_{|W6FKdrQi4_8*~=h3z#C>YyxN!X0YUsP zJ%Lh+^pSY^Of?W`CZYe1-9x?7pT6GB{Gb3)C5QWI=&h9J^lZnwO~sb#?xhR+WwM)S zYh_W{xM)psh6`^@c0li&y#kBL_qwq}tXc0XlpPYdq8{!%$o@?tQA+68n(=>3(iXoW zyRKW7lXg^9f-~CWfZ|RO@vp?REWrt!$``U|5WKqnvS2bs(>N(l*lfbCeW-coC!}-;;l0JE|JIo}R+0G3&gV2T#uLf-}io zP#y)&WE&_O{;|hCI1<7B1vA~7R&$g%+a0hETe^Zt%A(6w$G?xi`?fit*&ex>oI_2s z!X8`M0BYgvI>iHBP3F2~0Z(hq@eV|+RhJMLH%1+c70WPwSv%|P`<2SX5?Hj+Ly-q( z9}+cyFu>vCWHjp+NeQa zdD;E%=};8saL8H+19tE3Nwm_o1G)p@&@gxmCCMn{{h>!gF>+~2sh?XECfV>SG=a{m z>E_zUbk>$^28AEj9aMz^S~SP6Y?f<)jux-esQY$Gi5G0^m+l!~D<-1-7$3LXimCP$ zf!|J@O!!#}+A+KP%Cq_~wE^Pes+5}}oFc8uq;6>6f%YVg!F@Ng*A)!&XNFPE+SX?U z*whfp&h1{WF@h!(J8B5nq z0cB_9f3v{e@<97^j)xhrj;{|*L^%v7BYRRNk9~;efT`Wl=D9|6P6NlSuggeJsvJJy z*YZ_iMJoImv2o0A4ShGjK9$;4kD_im3AZpX`(Se{< z@Yp{-U=ZU$gPobeuI}>+(=iz0Ln$}SBgs823=^)@jLzZZSUELG=diUr`&CQ8~Xt~{-w6+*( zF2%=8-DHrpRi#}KRPw9gs-r{RXyYD(%B?oW@q_ZYaPis)e-?UDH!jY28+J9TgrLyW zTV}y@6U!!%Z}h}~owTu*{M;Hj4aTg~@W;GhNR5o0=ty%=B zsh#49{gN!|kQHmx19>EipRhN06H{hNpD0T<@qZ$bXhzMZwR2;87G2J{xP;9xI*Dk# z3glCn43upsBdY$>OyEcRv-bFv!2fN^Hm}YBbcZZNLnPvPa;`0wiOFG)*ZycyRUp{q zU9;3w_WgtxJbQZOMlXqs9znE(zV4JhSy^4h0(}z;Ix0*-c#ulWNmJELMtxv%E#A<3 zmtqd|LBT4jx&#_Ei9Q~z_kMi3+m7YK*_-dLiLde1;lCj4MR$Om2COVjQfYO)@#6gd zg;)NfQxU=cj5BUF>m6ArHs7;V<1@SUFoNTM68bvK1@kIW+?|T?Og>YBwO9L&#nSt! zMvwasB}Q>3xjo^@3&KdG%s9zes|)fKYj*#Q`YT7}LW6;q@`IH-st*Hj)h0@{aFQ>u zgyRIWSH$}0cB87t$?)i|R5+Y?%@q$ITbF?e6w27;pFqc>_eAyqvWw>W_p9Vzn>^k{ z@9JlGL5&kf(62N48td)2H#5U9>upb1GmqT9+G-}F3|8z9zU?v6Uf%pB?3pn8uT}oi z$)c<_4^cw(N%@tbc2f6OWts>f;UYNtZ?!Mm)hb&9$3bV z%%~QbCL1@-i!>goBn~(9$uTB0988TTIrl3p09hOc^ro+RK(p1W6*{L*ta%zhHoCYp z)li3siof~zg>W*rt~pNI?hmaX*M~&dE`EI5TZQf;ylL=c;wmGo3y4=YiBdj7#a;6o z%eLh-G(Oh}lh5}F(NthVr}u$?l~^atB7{CppuuL}X@xNvkh^%3T@{&!+o;IYwyT8g z5juVat9_Le^Swi#W_10$6v!TcS~1PY3{vH5m9l-0Ytl2Y;R;_LcoX<~RRBJ9dTJ_= zm`;}n`j@_SE}(|Xi`z&s)@zOzoi=@dNa5aF!_{obpWrw@en#4W8kd%7i%L;&N8}Yo zSG4wuw`bS++fTd#?u~ZK;Xn5r(NjsU>ArJbYVzujk>+*r1M~i*_BJZb^ylaqK#1BXI=+vMk_8 z7`4>(jeLKg$lLhXj^8uJBv_r>pSn()N2}O=Uj2ulLB(+UGlnfTw1A(Aa#b0d4;9a4 z(*?|}nSbmmwQ68zU)yyJ$8OQZa?y3`o|ms%>FlU}gOEG&uipp+XaRyNI}sYJYv>Z| zn+gn4Lp)S&e?Q9~ulk(r4cnvEON8Cf#%6s(f8hK6EgKYHT2e;|{U_(Hx;!c71Y4d9 z!K_mRvH4$c~pE#Fq8^J8;IbwUeLT-37E&}+;8-g91 zQ71hTJ3uc(tgt|1b-aHbUWd`~>1H!WABeH5p#R1NQJ#mLg91oR=0C__YQ)$+Qw@qM zfPFmtF`x{ZUIvhfHu<1OEA4lHOaa~{Q(iX|XUzo0BCo{uMuUuWFF6pc8TYMPFsh#) zKJr>yHw?61?0~mYt4l+$`0orW(OuPN^T%=vKx-r=x#I`-`js952{&Bo*vfa3;28mx z7{dGeL3Zu-(nfK831}s$LG&^Fpm=Bl%APloKu*Avmi8yTZewu$e(1ScaI@ClG5fKT zo?b%3ab(A-BluATfpNI_KGzk+QsnhN0KlNX-!aUN{9YV_HsmKz>H~@>nIRT7tJ1 z5!_t~Y%e2rg<=vpR{AMJlfVHAQc5F8Da8ao%!S64@CQm_HOWtj;CHGH@yXdurG+;* z<`xO=kHB->n30~u)1#TW5Zm>rc$KhF}4Vv$NSD7**E55+lQehc0_h%&1eosAvQA zvT-|8QNohY>n3XO+x_xN=>JTI?CgTVvxfd)+b>U9{Xgw}_dnI||Np@Yb*#!LN<+hn zBvi@@*?SdfOW`1U6QL9;l98Qp?0q=sq-8|*JXYz&u|oF#Jg(FG^SyokgYPfz*N?s2 z&hvU+&+ED#<3298;x0p-T(Q&+-fR$s#(#=eNd0-rcU9vi*>$jmz6{df=C`hc?psBG zbW@j+v$d5~zrItH@78?c)>0ne@fx3G5X4LrN?_rjg{o>V*MV~Y&kp3)ouD`QNhl2} za+3ZD!;}OL^}7lvVw;}qK5EnVKpJ1CF3BtFRgP48J7CUQdbzq|LrTovEsLjF>YQe( z>#q+11>VuS`z`GjW`>5k+UHPfc?*ZPn#?ROBIKoyr#cGm7{ko~#d!cWA26)fCcC=y z>6sA1_?h0-VNh3k9EhbeUGinjJQA5rWybOG-)A9(h(pc8>oNO4LZ`KD6db%V7JeuW zR~Cc%#Z4V}1N04Bb^>YUgEDu7OD`^W^#XAV0KY2YGbFGx^Zz~z(|}l`eYo5_OFZga&?ev5hFJ30l;Qs>7dDIu>|T{wCsyYmxQznt>fy+-s` zq^FhS)fCKCzX0}Qb(~qefbeOe&T$WZ61d>EDRYfj;hbH>s_GFaYX}Atd*&v#klHdG zFX;Ys%0T=47vFyV1hP>$246XH)%bFO|+Ds4D3vK^fIUuZHeUGnA!jH__b4@5yO z4wW?2vAk+pZl7EMA}%d=_X->xKAT&5*G|ve8Y1Pb1mk|*wU!V|%`;nxRwH6hdHUcinQsnr%Op zb=)fIw;AxH4*}*k9tt|>`I)V>L>AUk6Aa*fx@OF6Bmn~?|H8Uk zLo92h4pFhaXEA!*0}hJK^>l+F{g}?dH5L{YTfrjbMVG4S3s9aDBu;DP&UbAiU_$@! zHV}8{PQ71AA<;3hhKN`O_q`6r4Iy@rZD{{8z-rOJS?Z-r|HOKs(FQ~u$uojJkOXw2 zX!4IY_#a$CNPQEK1aaY~yPyTuawlY1x2#ayEz&#Z z?e2T#ik=%^cejvhiF_jc&ZF5C1VH6!by36gINtUXshX$l2E3PgYend>bTkPUlU2(I zN1=TLWk&8J;H<~t`fN;^=(Z6Lkf-&Xu=T`;w50^0!?_%km1S-zm!4aUj?niZ2b|X0 z4KQ?|f{DM(ce2~)9DM?ix8TBD1dI99 zlxq7^5`>hkJNP&&ak-7z0FsPhRZI|nZz}U!q*1jc4_13trai^0JzaKfHxx;NjZn{y zsz2AeH40ohwa192PC$PkO8$A)cEcIt*mc!1hK}{z2b+@C%gHlxK9mnykNEs$0d@=q zE`9&O8$uDTZ_Hf^%}qar(fBL!hPeP=a~_)hz1M(WifGV;FSf-Bype7AGnvUFdz_C> z3ut*RA2guJH57fna@%+F;mFsDG>QQ48xQg#aWcBrzD@QLD7+ z=?(u85E(tRKFkhSYcbB1yG#qZ9OeY$Ed#C|(1kPsb!BgyV3h4kMPBBtqe0Tu)v#B7 zw_<)tJD02u2e3^!Jq1mkXXiC%u|bbXk+`JXHyhSb+HcK57ieC_{RGUxk zEW|;Nk9EL!PSIo$j^qjj^Sd2amA-TuJ$HB}QZ}ojsm~IxvO7T*yxRbfNw)UNT`VSd8Ar$-($zI@8&0Bc4OrOt7St0$8Tb|QX~?6Jiqas z;ny(U$$a0h+?M5O_wz@>rtk-rW6BH~}3izS4NQ@$Bo2YTPfzMSFs zu!9_?z6g7lZNPMgexY5HqwM&*2%E$F@$RcUYmH&tXTBbk7qrI6kG1TK(Y~mzu5P3C zykO>zO0AvG{Qmt56+mxk0wvaT0yl8K1+}DWjCi@dPqAAZuVu4PJK#`vWQni}B~)3h zl9ra%Z@o9U`)$jUn8Yu3m|i*rY`y0E$2n4ap6YDQ?a?}e;66``8%mMBdJ zmK7JwGtzYr719T@j&`c;9BRk00=#Hn01xl;@es3z4-2p2N;@wRtnD-X zN=GZ$7igqyftlj6_8G+b`_{bYgM5yHctyAvxN*JNX~c;f(>+;79{&5&@=G+m$gTGBZ2qbF0T>K!(-d_E}<}$?QhV_xp$Pqr$#Irb~%BA?3AV@9S*= zl$|GV_^L;~&pPDwF5M;m7TFGHQ%=@N)g5}57%dwF+N&tgTWSST>rp39#@bUpb4@|;3C~HK+E@+cvF&t`L62BqCTYesc%31$qf5#ZYG+o z>pDBDAHPzmGVtjIx8%%h*$|>`N1cF$lfOC%Np=|!wA_H8mHg$|UaBX-P(3{(w?I9; zfx<{IEmNH=Z%$Lpnc59tCKAktoW9(ki7y~2*k{HC`$zrb?Z>saxLCH&qh~=_t3r3F zsWIjZPB;Aw9IcivrorOgA8E_0Jr;{Vx&=gWPmHP(w`A69NrzvmWNxAOAr+SR;fYHj zK4#7~5jv!o--N`Rsk|4HMIU`Vb8-5Bs&xwO-Zles^J%-Y@$FuqkNh3=*@=04fZs^N z^Kc@>6$2!nAw@hpq+X+JW;HTy9{=PUo~JoE>6f~?3z^M# z7_^U_J(qhfzOflpD65mg7Y+4v$oj}QO<)_lvK4mJH<4wn^Q&De+(n|96rOMGInVReQ6Auv0C6Gp%z?GssUtv@*15OXc<|1v?{#jDR z*IQJ0Rh%wiPYO#b=-!lF;WpBk&-4{_E-HbPsJ1G9J=#46%XpCj3p z%bW3>cSd%?pUDrSQ`7PkXJ5;GV8V0(@|_Avrr?n9ARJq00IPIeZlG6M#BF*UM^&qn z2(MGHH`mYLuJ*B!R~n*6C;_6WL^i~m-9CTKcXM$*@8I&KVZ;69M5Rhn?2(I z@&=+A_KkEkJ)s_IlmU@?L8gYva4Q3v```oS_t^RB+*hz+xh6KfzcX2Fe)v;aK+x!h zcyqNFJ~YQP&)R1LNWFF$xIpjAKUKV1#*j!je23Fp+3=cX@{1#ie_T9reS-Z<135a> zGrrQlV2QO#6*X))u}|9RzUDxrp<*dZCq_On0&Geint3v=YDr?MN!uvpKsk)7hQq|c zeH&An(PA|N`fgj`H1gHv86sv4yn5@uU9(;dHYdNiz%v>9yzf4P>yw&f_iEqqTGc!MQ$!su;c0ZjcDCsRAF_E>ep8@3F4lvUxod6w_`pWqn^1A{<)h$Z-$b-nMW@kwE z^Sr#eqlSAPWW<;M3@W*=o{GJd0AV2g-Mi07WSX3aUrF|FmVT+LaLw5=wOZuWdC=T% zG>0vjw`b_M5;%fZ@L$$^lVl>}@jq7^iVPX@u_Fv}W}r94vOo{eWkPQR8-I;fp{8sw z_Uju_&PX<&E{>ip$Xf?;S2b#7V~P8`W+ph^i1MnqFB?=#QMEOhHy@T4BtK9WqQtO8 zoxIvMD&)4&q9)BxrtIKs=%|`&O;sUxm5p=t_j}DZP1A(r$u}j)#CS#0JbLC?UKCw@=6slcas#m+&e*mDtL(#Bq6-wh9_0Cv;l=T>UP@YfFQd`K zhxL^ew(6K8^A5D3{E^Zod9m2K0=(PCv4Qa!IVkzb$u(2>QYbhPw#T-1q&uJG2-HG{P_k-nIjh5Z~GvI3uI0~L9nEmY=3jKa?A zg_efg^vQa#N^<-_!`L(A>71|*RQQpDoOPn~OaX~9TK{Ul!Fpw@y3TG{EhV{Z=wasfOkl^h-&mVv@M1Ux(^+g zaFMs<`W1ruo?TCPAoR&R+?j68&m$B9^|rQ4$EC`=yzgFsEdSoUTe&7XF4p)J8{<+h zhwqT|-fG5UwCnvT~u zdlq_r5){&e%TAdqZeB+$EW|U+G<#D=Qityt8XJR<4mqRVD3Y3}LESwIC!-W$%FS75 zg)n(n&)tC&SrXP=Stb9AI1V>ZY??J85$$0W$jF)&xCXGMJw^Wow zDvp{4^LJ_kMT5!%0ho?+Iw+d zN2_9 zmACrGZI`0?36uy#$3I9(7Pb&_!{@|UD3~xrrsuQw6?E^F`Wk*{<~NH{c;Cik!;u;c!(9rcCnN@FDiqrb{M)r6@&2C;?Y@wWvr_%baR1d&t3J}a`5ufZX7{1aoHCl&y!!V zkmZu6DuAJWVN`O`D_MqQJ=d!_A&)&QNqh0WqT_Ag$oEIbTV*9bI6lFX)OR@k0Z2m1 z7@qdM4XO@P8iCU4#=S=ZNoNEpsR1A4Hpg{bgcN(mJniHGs^>G!2yA z;YECB=W}zSH1YKxAsE_`_Se+58S-mQ8h>!SYVhcRb9Sh2sEo&g*qz!iT(|P^>apQF zf(C5z9$*1+#C)91#Zk?bVihp@;RbN&QEh*gnuvXF6c{iy$angpJwJP^h-VGg;j)Km6 z3AIBv@5W=bH*d(6lW#@{dlE0D+f!JPe$U+PN8l)&CQjBFGqs0&OWGv^!ZVmEse*Gs zIG4!>z*yt-1+r#v`qTTi)vTi>BgVI`Z0e5t=^ChL1K>dBGBsJ_h@T1k^Qw;-Y!yl_(T@x5|uvETfL@fWOl}lUz`F z8rup=M~H4v2u9BDJKS!;cAMX3s1NdgC>_?)xdDDYu+OlQ}Va+;Q091YxQ zEohGmWh7y;!!nGXTwys-q#Sy{mNKHa3A}xf*He&7DFbBrE!b_PCfNt zWM*Donc(cVkBDe**R5F@k`$PJ!h@`967iBZHY-=Kd+l0QOJ%I^=k88b3B#lBa&j8) ziC$dnFqkHWu>Fby-v;C+0({aNn-eC;M(lUjlHO7l@jvH?E0Mbnv+v#2R{3yborDzkvM^q(+X8wr&yJlHh7Z%KtOXP95yHR@r77St!s@@45M=oU5o5LZY@XDw>x&~Q?+gt#lb!w7DiHpcVVR8>|HoN zk(t^i>t>j>1f`p%vU1?`+tgUW3RQ!as_tw0`nRDjIPa=4hV2qOa&F+Xkoa$$hWF){ z(N^Sf^~(2eV!swbgQr)pV{l1B%IWxxc09jMg3aW+qkyN5$9uWQemZ_QYHq&E&r2kTSXz6fq%PkLT&8DsNZ3Ft-U9VEh>V(x%s8rZMojqB1Z*Dy~J>Pg{K;KDemM>u9YfUsL@j9;G>fRY4K7~{{sF&({;rM_6P8U43 zF-qH;Vbi_x+X1VK&=lLysW!vA&W#-^ic?m9_!FRLzf z0-=ba*V${FAz<*9F~Eb722X0~*FKqsx(}giw{%nQL3C_bl2j(qA(LH}(`XRrb5orD zpxEv!Q{k2KsoH$0r$-1&&!eUAz^j2cR}A0-wZ-k^12qpHJA{E%=Y~i?xdz2K_ zf&}H1f`a!hHncBTkFP4r%h@hq8YA0WuvR<2Og$3Vk! zA9C@l^Psc^S+HEG4Xfw-f@Hh}aji%1I!h|>V;`|ko%Dpjv_X24zA#GHSz!>5jX&@} zU2$fxdUHA2Ceyn@+@0f?kvSf97i9SSuBEY>to>#neIrtmFZ0b*&4-x@OP6f7%`7>5 z6WKeTAR&nuyN=d0byCO>)c)g3&bx^BVG{ z!Lf3U1hXFOJgpQs*tv8KKZK~Yo+V{%G2&d*20xkmq(K(Th)mGLvm@(~g+Fn3oI@Iv z8`AZRLun(j&c9K6v%jrvd{6~fG~P1Zu5imWX_#adw%Kj_kz?i zZafR0G?(uRY5ZBp>vGN5r1)syEpTT}oeWqdVv2h1kLfS8v!ZJajs1ObYOTjvFF zD|ymVJcyOo3W-LS8(y_9h|c(cUs8ryh>d}~9jJy0tdW$V7 z*6nurJn0yUDe~ca$;G^uxo*_+=$?J1B(cc;qwP1R%=^E9L-XUh+U=UgJ-i&L^O3J# zuaehZn_PdWaUl`=G;5P2`)qb;yJ6qn~zNaY0u5 zf3(FY>`~$g(}l3O`uIps@AHHaMEPgqg|AFfU>&$>6l-uxPJDFfG&1R3I}7-WG8Dn&tO_PC{|L>gWo7_2``R)-J2CP<#Axs z^H<$Vkco#Ct_xVmb8&J8${KI9(6c))ejEE8C?84nAANkYCBt+1cQRuZFYm;)cT!yn zR*sVcdGb2%B)Z}z<}A{iTLjZioh^ohtcYq}b6^YS#%9< zRU}B2iF#Id^7`0?l5#FJU|!&`=d5|&h%>gctT^4he~8ja+wNqEsR%@yd^l4|r{|G% z8MRzY5R^(iX2P^xRtR3fp}aO8?lTWmooFwnMEx9;YJQjYKK)ETINbsyucyLmp%W#+ z((IX|q}aaYWaQQMPHpKl=H^wsBgzAhhj(xhivf!xE63}N84l(>ASM)_tI&D}q@HISODNX+~rEgH= zd@VHi3B6mS5~8?O z*G~&Sa%Ya-rbz%%aj);RRopg|oiWH#7J>C;=9>WSb*&A)dJb*_z3alN3!oPws(?_p zPV4UvL!)Y^Ow7zwE6~#8&hO!VmL6R*I2nG$PsRAE1?PGnfQ0`>lN3{u^Gsl$zITTr z9arg$0naRC5Gf!Kz`%l)shJ5$UMfS5Yo5o?i~giZfbzY%X{lTHgC)FWDQ{VV`zUv@ z6JFyg2-m4M1Nr%tMH8 zZU&;;KkUdGwMO@d+RQsrs$zTu06L98AZUag`JnPtXJ=<`8RZ~7SJmPe)bCN3l|BRE z?$S5l=0utyS9*{LfHlBZXnhVRE@bc^^+0@qx`sv*J|n@3jsYaN1wh(T=*ztw4&J`o zbI4zhAi>9XaM&Hth7DZ4OAU--72SfBbTjjKnQL2qE7(64!?_((6Tlb;=AdNMSQ&*z zW~1r4wZ#g>-3Joy%HzBD%wfWd@KvPwp29l=a9NliP+MR4AJvh#4>dqfBjPM8-qqm zn)fEv9RNPk-`Q$W^H`wVqGLw92D3d!+<~32u!I)RuD?S;F_(@R+OPS4|F2%QXZ04f z#SXe{L}o&dfiIiKB8IGoY}FnsUHmW)iAhP-(&$6f?@2>biXu5C{w{P%OK2EMHNxM* z*O%SiaFqMt2qK}1|MNFWF&D66hMeR)2Z!F z_?A8>Zp!ob!K%U~{UDGwbIo3Tx?SY5C$ztL6y*9_Bc#FIM=$>!awv5cDbPQe4`5;1 zE9#`t8uPn+gw;202k8hkxo+tRJ$zMRW%Sr5= z-QDXoL8)QV4I;{xreJD4wKRD9nMgnNEznJGe%f6GyBquMA(CbQn?>An$wn^81(Mvq z9~7C1V&bQuf?5h-1r*<2qm>&R97rqVHP~zp8n8P4Ag5!DbnnLM{ASW8aN``0N;%`! zUM$sfla2r;ItXEhp)gH>Kex||F=!V!YrNC~FVRF&XfqfP5#TP5|NR@}2gN3f9FRqn zrCvuxt&y)>MHe6iK1Sm(9|nw95}oZe!H^sF0Li5aYzBPn(b?pT#)4%;ViN73@p(-QLeb*F0&j4%pDXT6Srq4gw0I7 zQL&C@g7#l}QQ(=_5jYZ7=f~nuLN|1IA_@M&&lnvrbu=i-_5ufa6Y~}PUPf(e;Cthw z2Am9h$!{GR1OaHkU;y)lL+vO8@T1UxKP30!AxveZ8MYgaTMOh~;~RItxkHcox8Uyu z{_nv8Nv+u?r{t0zLZJ|wo_@mB`j$I5=QP5ZGiApleGeAS-}{RW%)>uL4}r;+&rS55 zP*5K*-FMLa*Y0)*5Wf3SCx`mW)T^N6Sv|m2q7nQ>#u(3m!>0^cIcpbMF1|xNpk~OS zzWiSgdKF12saWXS$erlX)6)aJh8)sVa-Y7f)VMB#3fNqDHC1%q4(7ZqWN@by*f?)o z2QSM?NjbD2?g7je>Ymu}3&x7NWf;sZ8syNXIg7O9RP*)z+}1@%SCN?w|7bOSOIJ%P z6nRZx#*r1Tyb^-+(#y-wo!BQ1Os9E-5lE)|3#_zLF zZxoV>r#nOL9{``mQPjS${#AxpcF^uTJi54-lardDFZiTl9~zO#DQ_|SB7w8YGQExy6t5E3Ym^)pL6IL@kj-{iJEgj zM0%Ld3?q$ygjR#Q?bgQXz5eot_wuWkuq+?%I&MsukiRpMYGMuVKv(b#qN8#nW+ueXXdl>g1q4(6=G?=S>4rrJraceLP)neaA0FsHgBAz;_utD!VK;uJP9x05e=o}X v??sttSN!*S-Tz#ai6-O!?=8abxaYSM$7S*_oZufuqft}VR?53{)Bk?}#4XJ` literal 0 HcmV?d00001 diff --git a/tutorials/text_processing/images/cent_vingt_to_120.PNG b/tutorials/text_processing/images/cent_vingt_to_120.PNG new file mode 100644 index 0000000000000000000000000000000000000000..017c98261949e56473b0aac471c7b05687e9ef33 GIT binary patch literal 10937 zcmeHt=UWr&7cI!a0*VI_Q0h^SA|-UC89)&WMM^?%0umrJDWMk?=}n~v0RahwUK1cd zq!%dyp@Tts3rI;IK*Al*{U7fA^7}B8OrFW}zO!eSwb$BFx-XtyV&P$7U|_iP{MloD z28J^b1_s6(=g$Jan5o^y10Rf@`cEG*lnwB$0T+MSKh%E6z)+dMdScBCTwi$o%*2y{ z;fmLvgYhR!(3gQh3W`({%q$k=+kYN8TUci)EjJA5oU_vZb_ z(AY~KPenSzP0ZN?+IgU|b**PDW-AQM)hlh!adKJ^j16tu_OcK96cW=HUp(_b9n&5R z`pF)=_*DOEpf6xsVa2E68n6W~9&2F#|1kXT{67o)f3d)uGrwnji0R{2A+smP#MAo- z1Y(&o?IyR-l^}-mDwTm$?!Z@iN+tB zSYj~B4C{*0Ocf`&I+fL8$8CQG*S1niuQ|&za?~SLaoKgp2lXj`vCnvN6BJqT_ z9paPl2*dv8LEl9P_;+vCj$1~(sRsU^#_L+w5=fb}-(pYC*F{wtaB^}+5z$JPJf2~j zH~&21!$FnPMpEzI+b2HxSwR#Nteo6(U%Ksp!U|vYjK}xMl?obVdV1^8t8>;xs%SS0 z31xEPDakFQnzqQ};5%kwpT0M!LNs^p;Pggy*(QKt%l#QE&8!R`ba?g_)d=6-GWGMi zHU6S3cBNVcI}RvqH@187NWNsqzHXwLCh|8jDu*$mOE9Ey@>Qbt6`|mOX~m5aTQAmQ^7_9%PgFcnJ_Yuf*2bpPr;Il+vKS?djWvL~|7wM1%>rvu!aniE zN)P#y*X^v5v{Kc}9RAtNPql=5PrJxf6ZAFJvr@l{RmbxGTRf?$cTBeS>b!10A-tek zS+`zPd}XELY%(?iuCYe5@+oid+0NQk{H-lRCun2$hc%yWSp|_tN$NG?&y?whwg-vR zsK0Bc;Kh^Mg7=-%rlWP=qaPgpiV*BuO0{wwvkIXZO2ikU{BJgx)sisNNZ=T7Zl*57 zW)b7s>DND%t*JS!{>WJ-VTs;&{_f9ic-mMjztUJ%D*ZFWc=(GM;%RY)sZZhZV76N@ zb&31H!W~#YnRI4|;D#{UC^A{8JymcjZ^DqK4B6Gb=w7-m-JVsR;o5-Cti?)>ptNP4 z@*|Q=Y@%*oSvi7bxWEE#mo-sL%_<-ZnO8BNXom|4l&xggDY~(BtsvV~0!bUx9mymH zY&VSuJ@r3QpdFE`iFv2`f64M$i*!L%zLu1TEir$X+0Rv6N`mz9`^;FE@8eJD0SB2w zKZ6b|lxT?dNOsstrceAydh=RzcG=Ze?|ud%2B?a1qNO#T6n?$u=nK89yecr{$702E zD{W@B^hikt*7rJrfhDw;ef0#2C7;p{RIvAu4a=X1c?KLBZPGsjw!sIH1A5z`$`Wi3 zet&BpqyrJ|Vq1H8U7?r`s*okK@9J8(B@ znsqT^;Gj9^c*;}X2Q{^ueR?94|KOcK`EBc}UwH}2Ropk%=_A`yK)P{`bc(SY^xq8U zbxk{H+3kpxOlD%ypob;R$YTD@P#?_-*ix7l6eQyk9_&6-vi;jZBIxcR>$|x7Q{6NE z8>Ox~lr@{jOzvcyk)@xV=U%@Y4y?L`R?=R{g&ARQ`*F?o2xU7fAI>phg>bS-k2zEV zlM$C-ZpD~n@yPwUIJuu`>uZ^h(|T{tQby-gPi=O!jrL46g^er&cRG@8x;B$nhigawhYuxQ^_sgr)rRKI&^JwO;7DM~pLVv@8CdjgVdk!Tx|pRL zX&OI1oXo1p_65Yln(wK;Kvh(nRpOO>@_3}3{8{!ltzVAl74T+rIPCp^QfXXj{7)Sq z_4pmjBkDQN%=VCFoDtSZ@vwz;sfG)0;%{i#wE!*-3qHURCliO|5HSuZPnsPYpL^I@ z5}^w$&rp+Uru~*HT{#f~rmu~Pk?+b@#%i6$8!f2zgAX8b6}YvkmVY2}<1UXZ(Xnp_ zu*yM4b)Ztm#v*|{sjx8bk3|B}AneSrGTw6#JnrJ+@_Vj@A{BP!uNY=)kv6ZVMRp&C zRXczrJOj>^EG$={Dii@FF343vy11t=prI2=A-NW1d1tVEZWyFY*j^5>yy^ks;bb25 zBt)ck8YIWhvq1Jx1{n{V%uByk3808`?x>v7o6{=&GMDnsrT^qzZ7K-Z+_?vXF%AwG z`Amuwm)OKu6gfY{w-!% z3VY$huy1}Z^!d``zYEbV1TE)Nk6YD3`*L=b_>cq3B~)O-PgKC>De3sRNeF~DT?v*k z#b}3eR*g%yV}jBIlQaLte}HFP9aJk>?Y3zza;Ihw{4zN6dX<&Ix~;*Nvf2ndopg2mD!IZcRr<&L0Lq3H534hnhZAJ05(n zr-fy90ofdh-)$WDn`{=c zaV)~Qo;-pA*8W}eaMC5n-kmrZ^4=_WkIfAy-OUxRHR*POkPnecV$X4;#1+LE!JcS| zey;Ru5X~z(F%7;AcN9GP25 zUvB^9)k;Ni)cojt28zw*U&>&q=R~5$H?&4oiE+9x|4k!DxK-3O2dVNF=w&fEu=BX7 z&q%<_+Vrg5@Ny}wTGaY)0TiD0!$DT2>#`J0HaM>&;<{|G;U@dQj6Y;jDA$H|S7pxR zVew3oi>@_%s<2+s?NMXKmk(`8y{}dk;3lytnkH9SL4_r@KlsNV{W>Kjn(g{Y20@3F zLoEBV?bf^2KRdI}aZDG-1}%Q}4-lAk#WWkBZiq~@4=kG`79qQ2YvD&Gds?)L2|vDi zohhFzxeR;w<*@@JWcXKSg1kcWzQ*KjcEJIi*vM_oM$GPZE2~7!;Pd3(JO`U_iqo8* zlz!y!-uM+eP-?&xkZKvt{EHY+iI@Vx>_ispG0UfGZ4lZ5{Ti3atB-8u8d=(u==u^o z+OH+7fO-k`zJa-6|NiK6TX05Jne`C00f_Typ7Lrt;Bho~x$2 z{-`~FL}$>|=uSriO1w_0_!7+g@wN1O7%ckQwqmW}M8+VJByo-T5d6eL)bsD~qz)%EHb`M(1Ibpvy?V0|4P zz_)hxtJ6$5MJ5a5@*b_)U^5i8Y$l~6G&;=R)0J!E#%%3sA@+B2dT#{8?R1c@SafX* z@@Ss{HopsBUs$fz_YFm+5tHQ(3m*a*0&+CzcjDFrQU%(MlRYbYD z(V`JdK+5g<%u)#tw?I4N?~!X)vjb8M z%k7+8OVN=er)J(1D{c#~k3nU6$#gvZ$O$K#+UwyEZeBSJ;6GBOGR;DU-IvsHtYfYguH&Tg8N
56PC!@D%{3?%VIYN8`2{zANi+hPynEx!`0~ZrL=GGCU`tN>Cg6 zv|Xzk`B=6U%AgXXR@&G&;{pmkJqS5Xl^4GUO8cxrx(wO5oao)m&VdpGMHYpH8t8u^ z6iyVuO7)8ANxW}&^s2{2Hb}KLsba!YbAdszxE6d>?bvxx5-}HwYYBO1ZzXb)B*p0T)^RsVB7azPA-k5SKuN~Ap1>M_j+=~a_ z2W3O$&B6Iq8Y92sI`<5|3NwGmk>r#KnEz)4%=I~3>ofnhcLwogkv}48H9Y153q%`s zG$8^ZKKZ3hBu%Wx8^><@_Z>JZH=VBeISA+96-oRu>6YJp`F_Y+e(FpLjDGXj>hvge z`rWJd=ogA}Zm}`ikuLCX!?cIT9h!HOT*d}3 zURROv|HrSO4J!65uFTw*c~91-_g}hV0?(IhGoXjY(Rd%nMGV{QOc_Py+e+u;O!?FG_*?&`EgZ@h)VmN%6m2T3H(`!XfyspX?K^^@5|oCa0WocxZd2_C3p+atRr~-B;NB{gacuL0Gz7 zig2A&+|KY{ASU%sPpqDn+QTor@AQW_4q|r+JbEhq(hEoM026)mhqQ^XXM+=%-GyRp z65;HKWNE#2NKlp|ynYjjPg}j$PS7fqF=#vSkX#$sn?waWY_~gA*^M~W7`!}zHK5n? zYN_yk=O$l--dE1EHu&mN@m6yawMML)cmkE`K{ z@aI&~<3&_4)oXed8zeIlt7MGF;n3H>VBt6SI+e9-rbonv*VT-tvNN6n5sc;MO|1^o z3W3X3qYN|X8RWsWo#U2v)5yk`jTW{d;c}0&@boQRjgidZcvo(2b5qhmX-Ok>pe&y? zl?B(UdJvju#1^SIh!`)9%XTqCdeSmXP?VD39Rn7d@*#TO`b=&Yl??QUlPjfHq(I=q zxlr$H@kdnyW%C~zr`rCOaW!?_3a6GeYl_#myEvjWwx{GAhyTSg3s@I%$nFW#pA2<; z4&3c=b;2yVg^`c9a1!pH;hrm8s%PFPkZMMpKlUqlw!U}+bo0tjeVlDpHrsh1U}^e| z%nix4?Qh`S?pvseRDb{qYv5`MM0O4rC%-aK9PecD9K$+jOX2W17CEa^v_^pOs6jg4;x| z`Y4c5hX(o(EjjwIp&>4NVh7+bu=U`{kAs|J2vIqzae-F)JmsHkZ=gi938DtzQ= zYuhEh0nh{n$F%xVUJjh3yzZb9*1ES`#hbd6iUZ*VkRbV2`erMg&JdY8+18};R1{{^ z&2J17JS>hgIr@vwc*Jwlujm@I+g#v9f4No zwvgR9<*%QU->O^KJa`O(mRCZjovY)HMKqU3)7?LwS~kx)O`m3^&$Idl@Ae3}YRbgr z8yKsLv_j9-ZXUtmy_RXh?>Wb4^H{j)M?q?;xE2*NA1#WsjaNYRUr#IAEa`f3NxC2P z;|fP+e#{xrGkS5ZNhl_!uy0?!aX5Cw=4W$;8@v=PFf+I>C?n%PW5}+-23ebG6_=+> zIZ0HvvWV7|BOQO&H=q7cyNodN7)-#&nvYAid2t)zMd78*7Cy71B@OFETcW=2(QFMz z>%|-reWONJvo8Jdhv#&VNC^c~Yg$Dcn?7}i23z-Hy}0?z+?o)J zJjv_ggo8}?&g-t4TODmsUKv%H?(vam)0?CZYHXFqvJkHR^ALWoUt!!+9M3_x4#f=n z)Fsa_^5XR%B~ne|badeQGhiHP)G1#4BqMBc4d7JI5p7Gy3VU# zBj|$j+(u${_)~3~VCchszzgZX_#e28(tNH&5`BLfHA`tUInXCFfEs z=)UC*>mt~EljiXAR}}##q;?35Ry7Sb3(6z|p=&VGEm%J0u0}`HWJpj$A0=m!ALgi= zlM!WI;6+g_GHH4|_*t$O0+XSmEkoY7#EzLdrT-xhY3;B<<>QhdbLC#AzARnGqO5Mx z;e8OSWTA!~r_EuJ2{x)R40mLI$-M+*(fNY)yyzG&&wZ%veZD*b70(E>y zGQU#)=*zR_;el~p8^ER3mtP%|qfgcy(uV)>Y7+m1+rJNR>)n+VZp~B_E$dfX1P)X` zw7)M8xuDV6rk@1%L6@CKR}mnH;X2tXyuqG z_*OA2*!|eN`P_Z5g^P&La>3`KPK`C+0_}`t0K!numu9_z0wTM&fTXh z8$B`nPUy3XRt=K16_=x5@rdScI~H>pEH4!}6f{U9N}w*{E-qohm*8%L8?8hHbqlu* zK)0gO8Xza(%BzZZqV^xcDeWSV)FqKsB)J&hwmDgv9U6{xZrthoztMZNKcWM%{AzM4 z*}K590hmd5^^_G**Y0F@pXXH=<1z@E-v8M>^K?GVKB>b<&n! z%I5v_JAHz*mars<$RCEE>ukZfe>ZiYm^amP5Z_HOp5X>C^tcw00Zx>QS#y>3_d{7*53}m75dT_T;7fZSl!&12Y954m0s|{%mPMs?>1xAByWy; zRdXG@fy<;_p^Y$lz1HSUyv5){pRzwC=)yU{z}NHb1EVg)K)`Wqsc(j>0qxHWp8=x#SZDM`B@Ub*C#XRWrn1_UCXD-tpc5kkU1cg zP~Pv70DZtSKqCgGO0|8}_>?#8-l|QCJ_QDR zi$bI{?hSaFX(UJ}UqciyMh=%-sVG6rVtBD%%OsMackR|)hUyBLS-I*!{g_#AGn|9` zpPHr*v_w}P+5+m8r;Y5s273=EA8#&TlGsW` zQMNe1V`jO&1l4@)11QJ&W{rF$8O3r~U)nEB(piKPeOV4)X)u?Ul$tt?c!;M_%&^Qo z=OK)N-0$3w7k~tx)Gx@v5{P@~5!q53!hvF0tCewuA0Yb zEQ>nQ#9Xd$F|O^VgRIPsUkSJ2WUnOD4YlxzkiBVFT)&I3rC{Lix8Q_4t`g>{4WJHS z%cBaFFf#T7HOEj@?A1SLkmBo#r+EVwX89ss-6tI=AoiEF$cGoQ1~H>ig%}_ePZlF> zzP??i%M)_fW539v4Uq%3+5uTADt0%zb3IpcBBu6IqgT%V+Ma;dzP)Ro9fX;++VoMm zH(}oO(RBu3M{kkn`gK2g@3cbC9xjOligr%AHRETnh3R5m7jAS5s$~5J2yPEEed%s#ML zi>S$7(~}S(Mx&p7vAX$9uBoo3TPir{YA%9p(Tj2M;*R<=xhaQR16HrzOC$oQL#Ymg zK2a#pC6mXkaHSn;3xVX5ytA3#F9>~pd7$%Ie#qsaB0x1$*Ne+H=fIi1&75r{qT>i^ za;>aqFymo3-V#Y&W*vSKmwgwCyjV4$xG}L(TC}2yatLF{G(!I3?ns@R0BG~sl=;_x z~VAr69|hKl;Yag#7*1b**G;PKmqwNTKS@V z5%o4!g|?p<@Hp540OO@u*<~efUX2v!$T^i%hJ#gV0`+tB;<6Fr7URk)@2*z$M|=Q6 zW7Ng5KNT_oz$aD@!-FWHEQjpub^CnY1q2`vPIf_|&^nOA4?WsgZC)Op znOh!iTQl`hj1fKBwrAelZRjXpy6u&<8ds-#h-f9GRmGVP9p1?Fbg?i$B$J?T@w%Pa zfaaJ18@KQQ*pPfRE-~Xd%Gco%b?}oIKz66DxQb4NN+zqbyeLx2Pn4-&>wQ8-dptMV z0(Ct_#{qIA_)H|KhqHOU`rRndHEvQCvpS#hdrw$X>Q^tF?sN(I0AkL_F$P~3X$tSe zt)VimY*$hRlyj7{o3DAA4K4wsLoI8lL}mLGb+h#w+2qj1!jEDMrQg+{u`@TmVcYcsV>z02J^R zZ=cl{ z+gI5*bsesTVywj{F(m_^6+CBdO;Z51bE0Kl z1ebT^9Npe=ja&Pw9`{1)-M+N|ua-})^?Wsg)*=bAWOE7C(r%K~#dK$p?oONXZswr# zg?HkzrIKdGECXErK$B_K^V--k3$e_MhUlL!LobFf z@tac%Z_f!-sEL0HO`h8QF5zay6m+y-86)Pj8zHXiO$_qEJ^5KwTsF17%M-dN$lim?dr+ zqV9k=w=Q1%bHDaW(ZDD$k}wrN@*^Vj=fu5C^K!mN+N@x2Z{(kSC=8GF}002}Es%hI+6F^aYxBkqfSQq^h z3=Ec?`qo&JA1sk{NL=1UU~*VFy_ZbjMfQp z<^%!#-)vrM;n?&gxtW*eN{)%YoUUET^$3+@W>94-RGtPVC|reJ9{$MVfs1(NeY(J7 zo1dG(^0c9CQS+Sx2%LOf&yktackIDr$q`ShO^ zm6yu_j29}Cw2nKIhcx7goZ6b^GIO^6Z$JgehL!T}0{;!=$bd$_wr27)mMJ=kye!5U zHynReG(=8#zw`er@c-NbnXtZ7CU2^qF$I2x S0(c?AbM+UG%O1Ua_kRF$9h)uy literal 0 HcmV?d00001 diff --git a/tutorials/text_processing/images/dix_to_digits.PNG b/tutorials/text_processing/images/dix_to_digits.PNG new file mode 100644 index 0000000000000000000000000000000000000000..0e35d665c6b5ac404283e4d8248e58939f459755 GIT binary patch literal 17537 zcmeIa7d4Csq5>lD1tmpNT0ulwqy~`gMrnqUZUm$i>F&;QjWeD$6-p9hiB9w(ls$yZ?7Q({1g>&}~ zc!wyB%@;gwIjPFL#exk}Z-N&%7UBxxSXfox@Gsxvg4g#PAUaN1SmaEYf45{+8UJEo zNs`M-imSUDZZF~|X{gd6 zb~nemTMZ~yFbk}|bOAJ&XNF|uZSeRWd)F27(w(H^0p_VhKXVH_e#zp1Z-<2?9A=CM z9-ml<(U_Ni8UJtp&rzMP>=ud)eU3CVl$Eo(zfk1fc|ct>EP7!xRjkG`Rc$8kGAfQ& ze04mp#Ah?3%x66bp>hA^Um_JjUSiPfS|XpwRbr29L(M@eG}H>-sf>u6r0Ld}fAzJX zJNmcAd7(q8{G}F|KAK-?MRZ-Z=cev|KuwiCI#F%+^_gn(@p73pODs<2wJ7-}dJgf! zrmK61*_y_r!?`b%d~z=6nHQGEBrtUSyq)G7##W)B6h1Y)C*GBUzdqfbH1yieuSfb` zWL4bL$dlQ~iQrNNqZFNAU+f+g-d^~M+3fFTM;e@8zCMo$I!4o%`O0qfrT>Vk-LWF35FCBKO;i*fe*enu)5ZU0{6qp5Bv_4`B;P ze+>Dns;^(hte9rWS2$fCo)u*X0mX;>5bxD^t;qs~ z)Wc>F5-L2vXH8Z0&~CXYjTWge@VTJW7dz+k3fB)Oluj}ja zCfMrx%+X2ap3_|l`R=@#pQx+k)^?6!ze3z<>~KT+SSWP*psRW&Z5tNN&7Q(fe@B!T z##&@}6!7gBUO(MkYHdc%H&h%wH$l%-O-IE&GV&gytjDGkk)_=H?spJENLAo_eP*#p z-?rj;w3ceG*=(2&&YgILmX`jSozG6yx9`fDA! zc0v)=_$RUkJA1aqqHtx!1ZynTN%;A|++NQU`=k9I5&9J?{^2<6)psII#dvQF6)dfpR182x4r}eCLJ7k}cdT7}CKmB0GX~>R zdCbC(^7v-X%pd6p{hm})(g?&Os>i19V>|>=s2CGQe884lAFK_bz}nT4MB6Mhj!~YU zPwUwiuV;7-dqQ3~u0ULNn$|ThWJD?<)|wu3O&@;Vzq)rTFGpynv1@lBA&CDGh#C#k zUwm3olf|(}FFf)lLs^u?h3C-^QYCE!I|UO7+LK9Q8tfvtEEfh94}{Q_S-bX+E=r~DM$t%osuk~?{^iucb?@_YHrrpt8BHLz5F#%mdFx^WHo zNJ~itEl~npTCoN^p(=azzL5s4f|`DzS7I++hdn52%p zc3TS@u zadRVf@6-K9f#6Dh7?>_YBjl?7r{|m26YszNcInh4m%R#dMR?pi(Qe~;vZzeI>#+)~ zUz+=)6T*IG??+Q&w&DjVULr@hXat=rRkho1t}oy&j=e{6F|;_oO3l>HqvF$l;oxn( zcKTQ27iPoJaRl|%f7u=qS&^c9-E|Ird?`qC8WzwDCeh(%8M{7dztLHo^e3fcAfs4f z=9k1OiS7H|VMlp92jzx(pMq#wsyEH)5iaO+erUaEmH*OiVP@{mcOPF6+pMkPEr!#I z8@i=cUk*IyI%v%U^QbQAgY(%L8&#Or%Kfs4C+40gP?+E}VQAf4<8!#!ZfFRexmCA% zX1pHs$gmPFITZUsKY9=5n0>xkn0Y9~h~!6lo8XG>{}7n5Akbut_ILKUJ>&WSH}x9} z0{cwHb8EUBo`2DMuE&Gs`&-jsk2K`eu6)ygt&Cv&7?nI@+C$@3ySDXHZf#^`W#u$| zS5>#(#tQdAwPlkYr(uis(Z*QroP|9CX7Cl8EV+mD@n4XImF#h?yC*s#h3~ykN^+3u z-fUhi4%?j4G9rXye=MRu`25M)b+`4TWYG<|xZ4u-zNn5yE5nB81e&vfllCvItoi~HL}Q<~038n$|#`4F*I2OBGpD-n5f z(;jBM5~e`MB_!OTc|cO8@3mcLeRY0x9>2rXwGbe3c2CoE=0QX({bHG3!!V_%bq2m8 zX*8-!IKPnPdA7f)M9iQA7Alyd-zxK4jJe?SpsA7XS$_Q4pm4?N)31YGvS#BGq)DT1 zNO)z<8@2m#3Xi=eD~#HuI~eJNAEA431O-?JNjbHnzdxmL`uN-S2sNc*784uh%hN{; z!P_S)7S`b#xwUy~PZgVp!uQdP3DcG#7KCfFymF*t(RE&d6Qoc3u;GO!r}BHJYfT)x ztX;D$Z#`#jdzCR7xRecEq)UIL)FJZq*Y;qWXz0T*e*_RyFQ*Q(YM%^JGbn zKz7OgZ^)B`_qJ6WwQTot=jv^@TCPDNp3wWW;Z8sEqQ~V9dkoo+0rpt=?j{KI z==nl^Fx$Y?zQ9?DOoV4iPug>doN2OB;&fj1)-N1Z9a;&8) znr;j(hdkM7AfolV{Mbsyr_$ux#72#f;HyALkOe3A>oj@7T@;ECIUy-+a8(5JEkv>) z!##;Xr|n3gol!mdSacLihI(yhf8CEr1gVJNS)@vy? z0p700;YqU1_~>P|<2F#+4ICx(VoHc9xC?Z>uKJZByg)S}NVeH!YYmOhT18Nos6M?0 z&GbaEw&LF9&JRo%^*KMKIrV2+%s32(bj@unhn4MbW5@HCLdX<`RFb>f@Tt&WV7h0? zsZl?4KuK-A)PgEFjFNcp(X9W}6klJ>5@aS%{LWJK{F;6Bjuv`8pEm_nL<~4W78=`I zd3S{LEl{6v_>X;eK>~w)wb`DlLxF16#gH8V(D;u~?+PvUAp#W#b8-ZNF(- zBFV_>T{$;s(V@v^uf=h4yV?&vZy9^r!S{rJ$-?{;i( zYRsacJ~*Iqs}Ln+$66S*ybIUWjSbyB8hDu?BbVGce42#M60(xTe9x z-u#zR4m|2pOIoUiepfE7xXT`#GhOJ#fUXNmfbVKpPB>9v;Y%SjcnH+SC3CL7C4Pg+hCe04I$BM*jtCU0a8(J`&! z`Yc_k)&NTAHMbTnkWcKz8Fh4YhJA(peBpKGe0!i+sn+D`Id#bpkWOiZY-@G8mQ4LC z5gamK5#0%i-M}D`Vc)H}x>6WnMN5l5pXKOxkHrD*_Ob@7YgOhMikFsAPH*rMns?yx z4R4<@$V3vrck>n;2RPwm*kG$@bf^Y|75^n&I!c~0iHw-J3nfiEnb%8qUD*)iMd&Xj z2f@ZHt|SXH7fRDTVbZ?@g$EhJhPuB}T}|V+EC*8^Cj1@5mCwB@kMdQ;#4qzZ>QzmSF_1l$iFh|<((A18yICSsaOMR9^|(P_%i32b zBOu-zX`4__FV=f}NKiX?uT3K4!BkynknrZqZ@IOZg$*wA^c?~Qyy<+kDQ>82$HIi7 z{qh$$LY`I2=vNdf8L4C??_fl&)#~gl53OK4&1*%(HQ`e!LvS6&Wf*T;dzE(%N2qnX zUnmL3X|Z;0c=u`%2lAw9&a_rl=eBfFfCp2FvrAR~mZ^TOu%dKyQYYvqs{FfY@@kDi##)}%^+3W-m}alTVJ@b*bP*30sm`eEL^jj7N7%@ev)vNH{rM`z%DonP?XHvA9jbcKy6!E5MX+vYa}H<<6`NwSF46 zg>1cif6f7oHkaJ!?CK|k)rLQ(c{FC)QJWcY*8F^#WW|>+Vkh&$7%l^ovN|FQ>^aiv zZ-=K*QTABimZaM2N#DGUk1XwlaOc+ec}oR{WVsIR{jF}3f*0#Yw{u4I^=bxnI9B6i zB#H;5_vcScg@&~q_`Iu)AiQ|IJI^rVV(wJc&v$J%y?WNR6{fG|y?VwsKR3A|Xmb28 zgsNLp2fi&hW2O2RS1x@dleRcv3t;|Qp}6F=Gad2^HS5#OD_s60_> zOidZ@(|~^e+P1=lfxYn1XK z<^#3M3zZf{#1U^#ViJbRvkiV`1Eu9;#_ zC8My1=u;%lnOyL$H00YXx7Xx}Xm*BZAwXyD-)Ffm1Idj3bboa^wmsLE`0CUM6+g{T zZEtc}UPnvvC60J&kH3-}3Hw_&P(432pedEj%{yjfkwf6xZ;txnV-YmVpBkaAn1Um; z_I%fE$wLMaeEchrNO0S1S^38|fiz`L+no1zG5|nE)P0HwOvlA1e~o8r>~XT_=E6QX z=`kIpK^m|9bH{NyDQ~k5;PV2){bxTZLh`G((c{|4+H6kb+Tor>y32OqjokUCr2KZ># zwwNd8|EVto(s(p}s2fn8Q@4zzpK z)=^e4RcF0|cOZ>LXj0sl%jZOzgYBsc#xms@ruu=X;{{38o7$ynS$ z>V4*9MPffC;l0l_-{$L}h0rbl&AZYNm09dIFalRY?>(t?W4;IuTBhU=51;t|IUV{v zrGb)XdEC{Lb61Xp0^K__v;Q$!qXM57hTLM5&<1Ec&8NzGimerQ&wMy6rf_yiqh$Z< z(u9Nfk5?aeX7}(JJJaXKZWbHygvOz~>O88bdiYB*=w~)<;&Tl{lVM`3MR*;IP{A|% z1nK9u&c~=Cn-;4iIy*b^P2@NUMrhml`JSRzawfrDtvf=C&2DZFG;0<-x60wTzophT zaOM$Oh93EL%S_i!4k*zXaQQlI**E8Ec+L9%vv_j2g-}A!6Pt20w^j(bpz@$1&y0Oa z^YDa9-jStgOmSHlhFbQd)sQ|F7rzXoQOTAoyknHflIs+i(l&XaWNg}Da3O?rowtR* zdg=6Zpv~t3rOOx~up*km!~+WWU$u#5@>t7oV-|J$t_-i8V&98x{>A%y`+z^fTSOjh zA+*G<4rA?B$+!qyOY&Q63-=QWj2fMO5X-#2_%vA_#==i{d~H~384Wv15+rhsEAiGV zY)K%Y@A|N5y4Xx5^&#d5j*zZLScfJNvly=>BA?HP_?-;zrG~DOY;zY z$Q|xZ8Z~MxbAmFBWLIG#{jzLB!JsZnnVIn&FYX+utfXZ$phBASGD?8 zQSsYN*3ptgt#UB@_DD`LmWdOI&z*A_QE9@I8O<#Gx$*s|_)_K}9aw(@Dwvp&VWvvM zTRY3qIjX+g-AQqokbmKqcMqVwmB1=?;68f2o!j>?`1Cd4)YGoGB&Ir}Ro;L*6F8-5 zh}XzA;{oWU^uSq;G0FWRnwr!ZjqUGc*}$JtfbTF|zvzFc$@r~I%Y65ORPnVq!c+cd3-!8?+&!`X1_)d` zxnEP;R;ulJ!}zE<451c3#b!3M=M)HT-jq5!3SAj4;kyFzc^zG7$=ohK@@>%A8c#(E(b&WTe^AdTDU@hN^5X(n3D?DrP~{1(KO#%DjAJ0nM`}~hdnu0 z{w)Co`+*;iF!SM15K;lsc;uDlA(&kaJG&YBavk(Ouu7f9WNCX>?nA%|biTg4=BEAh zH@?Bv)A)Sf*#}!w1L__TZ*gd+B|=N;J>*?{9*&9Nuxi8}Vs}$q;j3GD1eJsebF!2b zRq&cul2*Dq{_OPDsC1q8)7yQtoE?6E#r$DJ8}(uxCdJFGz4qVSw(Pqo@9p^px8oa& z%%H6p^=a4U^1KRZPbY;+HC5H$)Jf*ilf95g-VtHm4g0@m>zP~RhRgWR&Z)MxPs86) za~bMo+tadX{6xLw%W+ERCumgZPuD05H(xEz`Bc~6Z*HzpCnTo+>BG&Uz6P}&ME?6* zKSKo>PN?}Tat|O?aE%J8OnL;{I}SG|T*`aY2i%jg;R??*p|7@T->YKV&6sqbF0Df2~LAy>srZt)Mm*k z?+r^={8q%&-76e7y*_mh{DG_6ap8O=4n>Q(amD+IpDCwRi~BR@ULEG6u2$B{B;!3e zB<3fS#J0VAFt8xivtR4KB4P`^Z`L{Rd5C?=6;yBzeWCal~2iklaWaZVK<;)OV3c4)2JxoGgkyIy;#$ zAu{-*qW6yjYeNnymM-X<9rtTYJWAMnU*?!$UZi=oudy6XC;nOBC5Pp`jL9tL>Rn{; z4#VVKq)Rudu@e|?ZKVYBlv+qO3a^Sy8DnCY0=Gh@r)gqeBFU;8vf zR%+^AVCRu1m-Iw!Y7|$CtN|Xxp$+;nS*8MxVk>OL5#7X1U0>|Vq;hsYxuWXqnnE^LF(vvCJv(vspj5Mb zye<@P>->oVbG@9$8|GHlpCxa^UdvDYDP~6frZjz!-u^qB-PzW1du}$25~brG)4Z8a zc(`6p$(Bs<@>DiIdpsr`alK&QdXxYo97f-R|Lw$g;T(@<51?^U;jU>EBTf1!ey{0a z-k05Je(Cf}mTv3xX;yNCUQNF7Hnv2JATS`1I@oLNZD}ILZiW;*4^1|&g znJ48?1Dhkpt*515*jNduKQw8q6|J6+D3vA>H+=CAdj*QFbtjpW1_p*ebUo`{^V9eG z#ONf0ujV?H#{PE8s^+^IF*nsWb=tmGU1yR@vREcwzr?Fs3&(63|VH zXF>u30`ON;aiuxe8W-#u_KK~0dX0k#t87YZ2prS55s^4T(Fm-A7!8KzFej&a+pSbt>bAGP;^%i6!$qXclG?!?oP$!Mcm$8Lpb|CfW>;HAH{*N|8Q5J{ zSj&EY3F+`Nr1!yByJhKlu4nV2(7Zx70A;UmW>zT(q5}Hd_sJNk3E-fu@@Z7 z5=ZWMNoyvqom`g?Zfij_-O20tl!04FmaxDOTP9x3$LjU z{LvyO3%O8(k{s>T`k@GGV!id(b#kXc&WoLBvn@bazzB!*^(SLDkBG^%03?#?;Q~C^ zq@F>)ecc5y@wIsD;A;&er$vTWbHotUlbXLj5Y0JFU`E1rg24}HP%W8X(v>s}@4Eiz zze@1}w_Z6JKi|>5$ z=H>-UC`GdxxQ5RXEV|jRR5SZH{^?KHu&?7Ue>_beO#?x8O_Uv-PVNA0ev~lwc@k zS#I5xqJX+hUhg5zd1zFNs3uRDtRaTdL8Xi`4pm*9v|aAY8*Jg47mTJ1a+gxAeVF`L z3U(wrk$|>hIW*c197WLmKngtG{=MVeb1lMmbgsoYP9Bc*ymPego%r}faQQRoml1e5 zB;l32nyuZ-L6QLlnbcRdWr*IU%ziLq$VmB>NN+FslX$4v``;*?i-1Iec)k`->kVh( z5vcOZSN2Qj7HRA&h^b?9?IJ(}E?>2hIuv1SiqU^|_XoJ$eA^)uvl;Hc?-g!ilm=i@ zDB+peTB}$mx|CJ;7HooZp{UmSL1diuRz6Nur+#jBNNEUbQZdaoiGre6UV;0fWfIf4 zLQmVp4zvzBS6d8|RWQ8g%6O1}C8BEebX{G!s$P;zG0w&2a^l*&w(lQJw&^4{2{^Tn zDd8u4&r~)0_@L9TZ7+_t3c+6SKDUmO}b znj;mF23N_I-c%x9P!hXZsf`ceH&lJa3!Qh9uR{BQK2hUAYHwPagXjgCx6n3I)W?}~ zTp=A#D4WYMC@2UXgio=x1}T=U{If=tyZ5K(!bh`t0f;hCpp#SJUe+si+~K=x8n?69 zyp_xK@AKD^Ez)+!0C%0$b1x{SNHr0WrDxY9m5JNYhvVM74c5CH+M1=V5L1_hs#Eij zS7T@GZt*xsL#}}CJlKkIqxK~$cB`W33~*>(WL74vf_~{Vr~byacl>*A%?ey0lF6jZ z__wmARhjh#^c6TlJW4&+v@yPO*Y@|BXyh89qgqBjtF=I2aNul=Qv1D?U@7YywRo^%TMTy~kZbKoz zYwvabu$NzKyMhT7^@6MNxe?6I-2Zmqx#08jI87kbba)v!St%Qb6^S$&4ecGWTUvzw z)M&=9Kn?uTh41sSmCHKi9BslDXURCE+R&ot~3iC$lBT2Od&@H%+AqW#*j-VHsYf+%1HTd?oEcukAZ48U+>7{(BE9)1Uu zrUe-Ze!HCj6!@jp0}z3zD=v1FYU_Wu_eN#`prB!UlFg1&pae^=$5h#M5Plb0v^QYB zhW#xQwkp~U8yr|z%yQ$Nk^kYb4@Z6F_;~dBjmP=2x2A#XqKj55nIY*e6=73TrTz1_ z;8gjP)nim){|gnS3(Z#|fU6KNz-N@9hl%?W>AYK$m`DWkEe)j$is9DxSf8#puOYBZs% z2hw({99QKqvl51hj|rPrB@=;gvt0s}fPxmxRK4%jIh^^Yw!b0(C$Z=Lk&i*21I@=!HiD}bhHxu!kjUGJS3OgblN z?A>CD3X`mMpY>C!uE~#ckzy+h;vP+IVqR5!7vk7WVez^>gpuy9%o^RCt(gK#9t+6j z)!}fT=W}N5SH}~2i}Ky=a}g223aY=L78L`3j;V404miDf;7ak__!B6C!j7bK3-})~ zI~i)*B1^fz(pW9EG#$~nB7NKpZZj$w=m(-h%Di@(3SswbH%7CEkqPfrSWOJ7`nTh< ztC=O65{BV#<9WLmyU;&@)Xw^fe2k#tiKutu!F(%dc)`gxyU99+$+YyBslS+i^gFHq z+zOA${JO0WtK|5t^fi^aG>xoL(wBr*w7~Lqr%*EdV&0ULP~az_dY|Zh1RBd@4xOcxBD(#b-vj zYQG3e66rKBdquCl!QjJ{aO0jq9Mgm#{6{DNTg;mJcWbu~T$}B`!pf>ShjQ(`BB(Xu zum!d)29s2_-J+_EJFpkA4M|bbU2RRF>x98W`bP5IMup5CQ7U!qIkMur()Mj<1qU=- zt%s`MASvZRmFbfXL=(6O+xt%dVp^+>KeL5d42?yP%(?$fFj}XN-EklezeYo=0)N&h!H2UY#yFQpoM>|=MD5#AP080j6qC4nUoY2Ib55|6*w*G7Wb2{6~DWpDI0&K zMFxXGnm+CU$y~z$OOu! zarXU$yy?*dhRD}BbGuRFo3DtY7t|T|?^!={4Q61)h%aF-LN~ZU5(&?ymmKkVr@Zq9 zrQgucbSUsti1YY$iSAUi?8_NY%>e|kU7HvrZBX0-G20q>>=hPe*~fH#yuIbUW;T$p zIrnSHW1XhZw%zv%Tty5}83R(;^{m>6y!?3FyV2A}fMBjezLCSEQ+zsKAU#D>n-Ij7 z3E@#Yd^)14OduccTN_GE-x>jEwgM_a1#t01MqLp~5wsf0k;d!kZqt0(B7eW%9=3M{ zLZ6NQkqdx@c6=aLllP;A=G$F9qjgXpO>llX_4pmel4b$~qVECebO@+)q4th{Y1aSH z0Wh0-J=cKSEyUYAr#OFkzC>-?{Q4i$hi5P_ZN2831exB-a~LC%T! z97wn>V~-yIpFv( zhVxc=gJ8S=U{m-H>V%t%-FCo!Vfc{`fbMp9@U9t>$V9S>8MJ;kpW0(38jkzW&-`3# z)9j6-Y>AcEjo*GB3n-w7aB*>m_rzS$+8}NN%$j!0t_1_BBH>S@(D3XH6afRdJU%{N zz;>}6B0>VFKt)&AdMAgJ{KiJ@hBrB%DLY?z9$GG9hl=r%CJ~rP?R$fqI^33+dY5j| z`}%BsSiqt49t$8G2HzQjCNvm27{~{EXHUk;@LoBsy(_#@;9tyxM#J>U(83^S4tH3Q z-wTM+W=VHe9rbAk9RVfvxx9;XHF7@0C}%=zq8du z_^a2lub$3&=i7oq0}g5ORzP}bMcbS}){rBHizE-_=QRn_VOB`V1Ip3^AT<)|Nn`AL z^IHU*lVQ$c@v^6EHViC^d_K; z;pB9ApjrcQ<|s(Sy2B&7Z90*)_yHv!xnp+X;N>0p=?+{vu+00=mR@Je+kU)dD>ue(vlRBRr&ZNWO;b%!w`4FMEu ziB6Z8mar%OZ%%ghf!PsI;ZE1RIoAcfIVVMGr@)DLcRal@mgsyYqJZ%^Y5t4kU zJLG*0k#inEwH&ac$C{TaZ4gzM5AI*P6ScHcm;X$!8r17X^mxVuPX6i8ac(_$@GMl4 z$G7h}>jfXr!u>s_u4(kuMK8lQ%$@iJ@%)UH3Js)r?`o7Hr8<@4I@e(T#2IPmxtR~g zjq(PHN2kc2f4+$sBIs#NhjUx)S6N?S`F*A0cO8xfA>IfP{L1Itx$p{zuS=oCL~Fu| zI(W}1*F^rPDT(AmZzbXhnalNlcl;+G51j(K<$xVE*6N&>%890Yei!av%=cbPG9MfO zg};|YJJ4X4dg5cvpYHuhcalsn4V5lbOix3AqFWr1^9mzK*8q+Ji*_!ZYDxQhwtK&S zDqdKIv^@1`6r3Rj3Srj=#x-alX+wbtPnSRBY_i0e&r~%X#!*HL`qkz|m1P~9UeSB` zt*x!0bg>G6$WUq?`hMT5|dZXa?8A zZes<2uI2W@DzGn%*DV_WiT^Ovl#qRL!N6^C6+IC6K(P1z8;0mQXEcHu$ll@PMZeFD zWqN2yao15X)+GNVU4}G{WNFBgLicaLzD`=pLsTmQSpLH`ey8Qv1IjZ%`FxLm66pXUb8TTc7 z(zG6L`-I^@)?S8y{`m&yr}nseu9A&?M?lG)i>c$COWUWTCZpqUA@{D<~ zC~6KkkCHafK550xCZK%TJ8J<%yH9k;{U0Yo>w7Ef0qW7Tzo%5)I9m}bz>?+_og7w# z2@n{h+{)Afv`h;g2LDO`Wdk`}q0T|*6K#x`IOCFXFv6vhWZPXk3zm*)jDKXuAF^F) z(XkiL0#N)ceVt?Y{^TH#1#hIuj4`u6j1lhhO=rBpFeFO>Z+N~e0S%dH$ARTwrWn%W zt1z02ogK(9QL0=s#5`vs!UBWUK*U}h{8_1ymWGu__%4KOb#>J@8OQ8x3BmhM?cbX` zfZVHc37bxn>Xl994lDY=3h)u0^6umKlNfEHOEvko!R$aW-U9=Ui<81k-)bKDxr$~m z=pi{>V;q6&XJNA!N+dPCAJjL~Tp-<#@aDN)pQ;h4j@?|Jh_#aquikkZKxe}XGHd&R zcYhqKC;q1t&yDPmG$(6>f!kNljgu5UXxV$4W2jlx3Wx@%^4@^~r{k}2m({v7hGfXqZ?EJBitIep-`$rUi?&p<&x>3u_)bw>vO-m)H|1Nyyn#E9=NcWh^rl1)gmUD=JAK`;ume>f z`*xeEu~2G1(C2x3EA)wumkfDn?MRjO4*QbVuEYz4_3niznc?3Q~*Ag z!*NiYcJL&(Zf);w9j39rg#9^jyd!dZ!D-}3=oWTFnI(oi+L~TsaEjBdvnD%io)4m> zaqk0rV6AY;_iYd8`G71_JcAN%ClFYoP~}*^SsiH7wb#NR(ywI8R3{K#{=#|p7x2ak z4#~#@oxb0~dtG6+`20sFX-DMUhK-^)jCN_!=LE3~Z4CtL?x2N!sEHLuho6*+NY&;^xdws37upBGV6~F6vPsUC(yXlS zDe|^{yFeB$Qcy>{n!@GuavPE9v6k$K|Bz+`N9Z?*a&C&j%K4xmvz{u^I43KXhdT+m zq8}d$r395h|H9vQ1)tFzRNp@VS-7ada{vU&iu04*-4mZxK~RxB*&xRv=^MPm&i{A( z4`~Dx*3`Xw8Nd7y{}I=uUaJ((sDWc2lP|Kt=i>Oqx7ru$76V7=>Pw;Vg#k3gtfa{X zFpwMKgx{N1m4Ag6~l86w?U02rSH zcCd0iEJ!erJP5A;0!2Y7rm$~i5{&tvGadY5#7vflzE#Zdh1ao7yD5*Vqod=bk>6Fm zo@CEp3-bwIOx~T)`1JsqJ>cTe)kztQ<H z;EV^Fnr)5eM-`cwl|haBYY+ zYC=(HXtyOHgn~z%E52O2Do6Ks5s;-1dD4-w%a)k7DZ!at-)wvEfY{Kj#=M%!B3D!m z{8M_di#bD?neRXpbDcy=gqKMZBkU^eoE9-0<44YIx(`AC6^w??pg%>4H_4YD~^ns`+CO4Gl~ zCDZrZsuedh-6D`?9)JrC5TCCUQyiqoi&A(k6bB8;_YOg*ECSq&GA0;`i_mE#DYfl8cmMJvOIRqT*^g$4f4!zjSl(P4 zeccRv7r_G`8osW*QSP(K~=0Rp)Cfa z+>6zAwg7?N?*tsb{&vky9m4PhCmfH?!}M_r0S*7iYzpj#K1>4}DIcz1VF%x2MgZ^B z`@g`;mQ!6_UE(th8X{_7w`}2NHb6?FynQin8=6%=PCrxC$N2Uh^B^hf9DSFu>6B4Z zs1Lwu(*Xs1M<7>f$D$c&{$z;`GmsX_ zWB-hZ-KYZrS;Mf2MoE zRpRH0GTqNRC@##X))Yv5anP|gK3UuA!KtKR7PqQe!78YhCo#RISy-Bp`Q>jJ|vWMUTdXS8*FHB{1Z&5Xf56wX%(Ka>NGR zRvLDXzy09#8v}<9-09aoo2qwy(E_DKQ#_L_`9(9Z!jx-#2j|-x`XFOr5Q3T=9DD~o z4a$QhQsFe6MIG_3NK2AaVHz+;&9@*dJemA|k(Xn+GAw2M{F6s=GcS|4gsCwl`qMhkI^&D9;P}ceUb3Fqw zPzRhMX0HKQ3FSSdfn7C9n8X`rm?&j&CXBXvBbkU=(uiuPL;(@pt5W1K-Jy5 zydy|Ar<8_YKnl)c8A{8L)AyUtOFukT3!fh0-~h3&y$#EPWMo zfrKuuYOwhQY=-T4vz&ti>h>XQLl?TXvLkSBio9%~FtCx!;{2^<3EtNtISfaja+ zk7u6(;n0LLdaUw&S5~HMf75yf)GQfyA^3kT?X{6idrl2vtB-#GH&zO)dp){kYU4WG zzlDV9;=zc|^ZPOzX8o<_<03S>Vkc>^Ii1AF8fqOOGl44weuaVcv zp(+Thz)&kfId@;LF}2eD_=Bf!L+PRE3dP>4~(b?@H#g>_5e!9DOV zbfU2*;E!AOsxsnOCH>?Z;4gPg#T3M_u*$-5FF#^~zvFy*ZVwB~ zd=V-srtYG@GkbqxMExASFA~Jq=1oLIXtqNTo z4bM4_>#e1@ECtS`y6tXQ)o#!91frhH{VXxha_Oy<6)~DLmmNBF(UC z#U^pFX*!N`bryC1-zi}WU!NMH%Kxw)@=Tf0uJw7@m#t?6BXlgw{G8X30SFR3he&5s z6B=RI6|ygUxhK(rWva?)(z_SUVAvgY?53b=+ezd}5JcfL(W+xzw~}0ntWz#u9#K;| zo^$z!cc}Gc=zn|0J-a})G)-GK^i7)=7ANzbb&5lMJWeLN*+S018?NaIq4K&u%jNaF zbV21LSsy&3->h0h6nB2=;~qa^ty_NT#q%uv`Tu1bWz%P`r|hO3k5EAQ_mWXGS)oE|F}7bIO3oo$u2G+yjB3@Cl*^w^ z2h*DFp8apD1P@H!kbDLNu$C^r<6c>0RIe&UKP*wl8|otdnoA`;xGjOMjnlB~E_9+D;1<$keK zg$Knn+!gsP2 zhB5W6!rMRE8^cAeCs;yjudgp`T2iI*;U6_{?@me&W!%Et{A+b6CK$(SYz#`>(7SU{ za;zIwR?}l=Qen@Eb!twCwT+(GwBzv}oyV5!4g>d%oRZp2S zm#XC|_OqEo{!!hW2Gs0#kzFtZ-KTONyg;|ySm{$>{gg7aoA)c>l{A`XUzQYR8s}&lHda zcV+YEa~gK*|G0q*nCO`#e08Mel{i%CvXR~7uZ$BB|=k%Ii zuafv$Lf{WH#D(`eo?Sg7#afO`Ey*9BP8ajJ3|bQ~Wvia}R9o z-=?CE=P->%=w!n1T8a}Zz4_>;^`!B4$JAWsIagDg&5?YQU+2oDIHT?p9|I_4$42rr z9JVL^QWbb&E`1ZPdPt@I$Q?=#76MIDbD@RpKdDKmpoRClQR z7HrpT+`y|zyP(?P=JTkA))~%`)6NvXfs$OeG?~>hnou3Jj~HCO!(8;k`>-kxi;0v9 z8cdwd-Yd9Un|T-sp$X13yi)a6M`b!Sdp};Pwp^UJYuX@I@8p#GZEHWUWT{3(< z-e$drWqY;XP31{4l;!BlB`Jv=6)tr0zU6sR{d&O%?QC^M1rlelg$C!}uXA05+)j=x zjBE8J%HDe%k2cA%*7?3JzLhS{6jhX@KVEkN5)!YRMw)e706xVg0j|T<&L`x@Y1yp2 zwL?7KnFS&AjQt!#n)|0}nDA9<+ zH)(#n@il+z*3ULvHhD<%J&;N8M`A7iN|8WL57Y%mJhUg3i9V}+XD9lt(zV}18-B!A zKqN!q7?UQYu}^(`qKqUJN@3PKt4%~|zjKJX0D;k$dh1JE=EG$HkuISd$DOHB5wAJd zwUoI4Do?X&RgpBW>pBsi=>C-gSrj6)WPLa{+<+GcD*kFp+DQmhtp0?s{?bYuUUJZ z$fJvC9OSFvQb@Z?E#eSa2`VuEfYM?-JSXAuFui|;%Ht?fTTtj|L~d;h+_ZK!-&F*` z;^1&SOwZ_!=vZD}mc@Ht%F|IA{x>GQ&%NIJU!ir&W}T*8$93&P4d!XAny*fDMCwXk z{8&x2U|riH2te@TQwh%$>FI%j-*RYEy}Zo|fJ`u{a$gYl*!%V~`qvCGr-SjFj!!HC z^ZmFSqgHJ-Una$B!hW5R{=uP_9I?W~+7z1=cSNoUQ29v+jdIy|WcKwV^`Far;No2= zK>UM~YG8dydfHZUn2TW2ng*5iap72sx%|35CZ}^*@Z6gvdHt1rtEfU?UhDCuSl&gb zxmpU-_`Dj*h6_K6SR;#uA5?WaPxdJP^4i1;kx=_bg*uhx4F!29#12&Ag=RhLoxE7y zIhV}IRH8pD6(I(0xupL*brLu4>w(Da!+-hr`3}HnQ$^TVk4Mw(Q#cDfqAyOSZFr|E z#3j^51YM^p;U>jh$Dq!g3z2Bd3?J+J3fRBe`dj-Kuk};%)tI%(njLZxd zuZoYlv(>L>s5?opu7Uz}=>O=&(((G=_OsL|U;hRy(W*Sp{>1>Q-M901kw_=)UB^98 zM$UUuu#0`}o4AA5huf^*MAb zby3Y1B_gw_1;h<=7Rk!&NOM!ITg%~5`2{!s1O^zLCXGqFs>2%xpEWgmTnM109-IvsE~pY|M$NmDHa|JvL2s&C#y6 z`_~m}uNap8a_=2)6fsflQV{b){PS8Q0v})>PZQIjLA>^Y!jesQl&$zL0stM9jAilk z)S80c637f=XPDE~orH=DT}y{kTE*S_;yze8USgkNj;-P2AMw3Fr|fu4XYG8Z@JfxT zti&zI9M4qUL$oj1>;_NaN$?R3T;*OUNtn*;IDV+Y@cVy_%H!3JW7^$yY>C04JX1X_ zKkt3vi+Pvox|)!(_H5MI1u%l%D3`O9?^hYtB_7#>k5sWD?$>Kp4SX$zRUMcTO$wL zCM!P^(%v>p>f`iOq7sk1K^C^yp$OG+?hcU3jN?RM5tZMd!{n3q9qci=<;+ju$??i} z1H^su&tL4*VuN74p?K-G@M`+;I&o~iLTx?>+ffH#9$F5&YF+2Id!nekF8}Jlvtu9j zJ2J&+zdqirs40vGDfZp%?_yQLL8HG4rEm3(>$?>%d@b<5zfyOtK^j^4?c>cAI-I~aF)Ub=oFO^I}`hD zz#V#TUyv~QU(AEj80(Ck9VUnCEN3a>Acc?3D{(sK?UKw${iG8o5(9Tz)Zu^# zuUvX|WF_OG$jzD3uTO7El#LfQhI5l3>HM2Gq=T=~_1(k9pfqu{Lafg8exJAVJXdw^~!k6I$tb@BD3|aPnSEu04AT*^m6aoLL{sl~}DX$8C~KGPCg_Z;Rrp zIE=&qwr559`@N+}94Kh>EE=f*?iPMX8u?`ov<4OdTdT^y(%APc$|jtqt(pKI?G<1A#x5|1J)pHqrfs=|`q; zP#>%6M(4{-TAQ^`p1cbg@Z|PxiLJ5XCfk2q31iE1?uQC%z0V4YxeU7=RS=p(Ik~1T z3{-w?Q*A~XqdC-v)Cjcx+P#Cl2Vg(AM-0YPpi)L`?Q;*$Zr5Im1psnFUfI_gi<({M zfE_t{)~v(7u7*$uj!}g@zP=;^uE|F!0&3v;7&d4A7@4Bjc{ESRcUB(3NVR=re>RfvU7(O)ku@=e1C-%<^q7}sE zE1)Pk@49K3$(Q7miX-6C0hB@rN#UAWlHUk(r{hsAP9A>KkyvM-O@xqh21P3h{9^fv z7)B3Xza>Njg?B}K|8EI-t1721T4r%sb{l~W7PU+4-QO- zIox%n&F84tf1Wt6Bu zAmm&%_B@P2P(Xx0a6Rt3HK`K$bz`lve{k1r=dCe_T@j zx{r^f`|vk&>1l;{M5UQ-3F7U55IPyI(68}E-;pQ+zhv?whQzn5necm_Z{&LGa|_&4 z_Nr5Y%>a%Ud8bU>toAmn3t*%L$MwP1v$_c+9pO+Qu5m}%VZWeE*fYP?QJH#-jDwvv zOlncI2owz)fCq@ymR8}6f&LkAC!8qaP$=hxU$}keMsAulEP5I+dfe*V1eXZbl8+Rd z6o*&n`6T+NzmK1}xd90xA1>9jEC6Q*X{yxY;spDC1~b#^F`v!T@s{hAD)?~M+1bSY zpaeleRm7_T>}S`>;0P29z1zFLEn)PG;&&KaoV##11!|x?^H7kz?I9PZuRjrH`+&k* zFY}FaWvvLDEl1N%s{#}xoI?cv$FvUU<8;#4vP0M-8#xJ@TqUql6@ofN*aSnxCw7;; z4;A-dQ|$yI+m|8RgO+!a<%haiOMd%twCPIVGu8Eoder8`y15f;BZwtrBi8bK* zUtIWSNf{1Rxt%`OT3^)S*R6UQ52d-%B@hb(L}xfLq#p;D<%g(|4qfpka}5;NaqS_M zOXm_uY1|gnkddv$S|6x_?z#22d2JiFRmkyOOy z>){~SALAi$64ev2wIE@S^LXwEo)XqyC4 zBAu~X`fyk@|M_|*ZnacsLl-Ad21^~c#v%u7fo#HE{HxY;?e|XIS0lUcl@GcY_$_{SAC{Jkv2*A;;rqJg%q|?`c(YtuvTscdDEOa|Z z+vtcwAHez)lE|BDbPI0zLbf>3(H#T@La1cLdvHs}vCiZ(lXLHD*c{N%^!8i}s7W_z zA>e|nbLZ{GgprIEYa2yMCF%!zRvg0sme{%iYJo}UL5FMcIT?nK)qt8KeFC*d&XZoP z)|wfFk@F9s5-A&buS(|%=PL9A^rtJ~!(pnUbZehWCu`{Cw!P>n zr$!?v&^bg^5ncAS>;m7-rKy{UrBHKCO_f%DmCldei08|8VYInvd7 zpGtRcGREG=+u`rTKC526r0{qj&vb8P#?+7}E4~^kuxea?+Ms6QOygJYGeth+f=!+H zxMh87jCqqL ztc~x*WMQfPV+x1%|*8&BhDx3DFKrc;tssKHC8I<(qqxi5Xafe~J zkO}M`nGnLE)#=EUhK$m6InsLoJtW7*UAzxBo3O0tBc2TNPfvHUi>w?Z&^}FPA~sSX z`_4MEMEKrg;p_cgQpWGB9f|@QSvejBxNLe~Vl!~n#^0`Wvc|Mmmasl=+oG@Q!8IWF zmyi%1xx>ujj+~k46k#$P&zuC;OaZt26DTCX?p*~WE1Qbp2V6sS6N=*N7uTTlC2I2a z4gUCf8<(p%6G=(BTKJJpEN|Rfj=ogmkJ)6_R_Jr~3^>K?DG$S`plwa~(D}*Ebma{Z zX!E>*{96Cz<|6oZus7B7>Rd|rAJ7V2(>zWR;GqMa)Fo45c`3Sd5T|($bdT6dkeZUv zC`+jHO8w)C%o%U_(=;G9|HxnS&fyCYgk-ZSJ0eN1-!ADg%AI~vjy$44fwE>vYDn&j zP|IVs6(y-d9OUu{)miQ?QLT`@b@{wpp&Y^0Pl=~8Sed&xaJJV@z+0TBPMZ=8ME;F; z;k@+kxrjc#CVlk@ozV!&%%!~}`AYK`bwY%ZP*cvZl1LMr;%aBrEG3w&nQtHU;^Brr z=LA~p?a`n_4tMhcDK)ul*h`%l$to&`P4f7d&BKW*h=abgm60%fadXwUJBYc&EKWLe z8mM&1X|{YvnZ5_^K%t#`m1H?o$)*9FrmWBj>7tmn&+Tqf*0g)ssLxO0x`=y8eYC73 zqW|;}t_s@txL@2V&*a-`FM$_6sYr;WDtpJCY4JBV`_LuQ4c+Z?H`~G!1G``Jk;m0D zc5+^u=X}4fW(PT7JiicA&Qk%c0=DSfW?ZKJgd_AL71Lx}>EULrJO%^RO;PW5G${;IU?Lo(YMsh?8wNY)LI8eUFEbZqbUnN&kRI zB0uXTVrxFlfJ>={Ilgy%JUVA+oapJXZ92;(V9?#uxFGt#5P z`Ud*ib`B{K(Bp)W%ZG!C5S#hAG--9>=rCx0t(yzkrndj)ABUkf;cvRFM~KNx2Fk-e z-n7F*d~qd)BDToNf}2HU|QD6WdqL$~;E;=H#`xW4AGx{*pK2>HAJNUJ>jneGl& zk-_ zCqL}2H0h1U2mba-lZI>Otn<88oG8Q{J?pFwxCHHmuAyus0h0;I^)^uftX*+C1&TOal%~MI8;f^2KsEjdf3zBd8`uQ zQskisR;STtL-P9B)>EVQV(stRw$>fHN1sB2zdT(iFnw};;gq-53A6;12Z$aM2wl~0 zrb5XRIKPk?w=Q2}XfSJi5BSiL3L5uEjDWif4U>lVKRMxHT!vv5{D_?I$HLDt!=;Wu z6Sz?yedM` zA3%i?+q&C8CNqIOmZ41x-Z`|kAC@arm?nrD-zFEixvYR&>C3%>NJ@yN0vz@3uez4m zDigh9M2Pv2ANpTWML((A)cj9CphzbDgKW)f9~|TYDJIfz5rae^`;!%sz=pt`gFPDY zE2|TnIT^p%_5^r;y-_^^4GnS3QBH+@wBvr|;v|01te5a7Z}*3JQa#V%H{R@xJFBc& zk;xwH>2Re;B`2YK$3Ml~L)}?5Ib2Tn79%_S{5R?lf$Q^a5~oLBKzJ|5DZ=i?G%8=k zss0MSL-6*EiugUhPEN2KE7(#+Kp<#j{>{O5&8#5R0 zD||7rxeFfxLf4jKoc(cFtIz$UY?TmJ9zC3NjQGj&L`jYT;-He3>$pfsRH9D;O-hMinc1N_Iec40$&3l{^{{1bW!BZ~c{tkIVl^m@>CaxCPj%xHp^#p_KJnPDG7rFn_ER{8A$0Ju90kz)MIS|(-xoCZZEu?@` zA6f8u3Lq1z$!76C<`nAE!vlV3i-A-j1>6JKS?l(0-=OrD&C((y%QgSJIxaL6?jSJY|d>I z*vPfy^cn99F>#SEgQEQA=GkPMC=21qLxyQv&<6v0aELXNuAyCY zNM{vWOl)VEeTC|VUwIUvtR^EIX;rsA->!5)cwUyj3^kaWBm#uZ&3f3o5lCD*%M*h! z@!xV)20F#sGW2m45e$%{W&L5(t{r5q$K%vGnq5<@foV`{l1`FnHZ0W(*q`dR=ruTb zmdbqqbTE{hED_F%qhL&~-(tVkpMxFyTYX4eWMc77C;n_Db=o$3c>BY*cW(jfO*bHD zoAV;o|2sp_%(`6Ce7@O(uuWsputOFFy9kXABeMP7%@9puKKG=q!^G9rIOMb5lKMZJ z`wpZGaGZj(Gk)Qr;6jd7Qin+kk}#i<$=`yNQeuzF!Wb(dNl}Z7qz`0ac^0)ff}s)@ zq8Bu$WSg$+K(ZSSQdQrgyFBY?qI{6|+`+}eBuZid)aoE|fOY{w+=tXIct=|O$Dpud z!5Oy+zuRUsy*4rONzrhY8e;w6N~nnCD?u^wyPZs;PCyDGJTe+o%h~_eos=EK zj9yYBStC`-8z;#XEVqdNreEP7M_p$er~Rs&O(u*S4vTO;iFc5i(Z6lhXkTVx{xFJ# z^f!mY`r!L^6(*Hk3lE@(;|9H*5@M{p;m(pqF+U!8r+o0T-))$?yNk?b=7^WE<3+fK zP;P1^v=|6skybxYPd*dvDJ2@s0_0A!rMA~1z!6>P0*Djs{cp_8H`i_z4#Mld{H-UB z?lSkt7XFOo=tT%b)Z})b;Kfh?KSYK0(`@(4{ch}GUuvjxIDi}eShEhpGW zf!DG$Gni+TU)h$Q$k@KGLKtzh?tQGQ)NCn6fvZ>4Rs8U}PCtNSZ_u7eF5ii*64ReAHhTm-rR5AeluarAJ*DtaiE)lsZJ0f^LkQlq%hx zupbcWi!xJGzM8F~(nA{VmvvKGBsR6CVLZiXsx?G|K;zTo}*t`BU5`mrPEEj*R6ZmI=R3R`oT==}F zGK2pdf`eLnDd8U<=dwNN%;lz_zkV$$(cXr>-6{TO-A#FmQU#usQNQRB$;HlNyVTKh zu2=8C{(py%SitE1Wdcc6`gSmM61X|Wj+)WaHX=80c~L{Xc9A$zuLK8{51HoH#F2i) zupX8`K45$!Lje_nZ9Vow0PBqRr1tibjrs8K&rc#j=V;Jn;Pd)0>LyMye-h1mRdcPk z(^`%Qt$%N|E3TzP%-XoumVwH5M$KIh|E5#AIFM~pYrA0G@H5t2k5|(=EzxU`J_(>f z0f1SCk0Q8YeOXtT8r*&)#0Q)_xML9BkzwR}RE?}bVV>Zr@Wr;;k7n_41*U3K=CLHn z?y9J{P3{b1@2py3HU%6CUS!w}4A^F9FSMG@qkcqFLUAEHOeqYBg;#42g^qp=-%p$+G4puj_ z1@aDjN~EMm?shKhNK#g58DsId5of72(ljL|0Rzw0t@1)EHUKyOHQDKadw*mMZ~vU;>Mf zoa*mnMf@(LB7h!wLYad?v;n{aSwiDmT)mZ-$@k4dB;G@52Rh(g3<`=v;^L;grwLbMm zZv_Fxbc6*bOYQXe7`6#18^;cf`bhA8Be2+IFW^u?aR&s~lI>|OB&z}7KZuTaFcJZ@5BqdG2P z8dGCTM=BxqUd$;yET&3@Fl=obLg0Pbz=t4M?nxO|1^Xj8lu^eHBh5@T20pi9k!DVp z7Fo%Tja?Jx>~X0tLDI$vUFG*s<3zLdj+}D0H`H17b8zLWvKMJE-J;0xE(Zg@VKSz7 zFhXH|jlX6_a?sOcDtV@|=Z}IP2Amk^BXlOXzpG0M;_S5Vo!9e97;{8ce)cG9i>#Knt%4`k?p-Kw zsCj$twQreR`5pl!R%Wi00?WgZ@R}r|cy>WZv1kjUe+Rdj+!8!CTy8m|S7SC;L zYxjizc5;_X;N0so$r;dCqo+p|ysULn8NwY^VS+#L6>g7{|4;yx5hc(x?sv#^HV$8**Wo0iLr1eE(ze*0=4_#?eR|0;Cq*Z1~AJF#D#$ z#rR2u;m8n!(=R^^fnSPS&Lld;L7C3=Af4vDlkIOhXt;D%{%jj+9G|G~&JYtEtnRq> zr#gG--h8jllm1VDC6J)L+`<304NMI;RY3WaJi%RCupA80%S}a?53dz0Rg(6@UOP67 zJyN`LAOGm)UI8uUQ(A50AYKL|7Sxx#+!7j3H&>qNix5ch$%IqxU@Y9mUP>jP``RO{ zMws*!60H7={c;X0Sbz9w_|^0YCSf+$9Wp`kfDhZSSR)DqZn3A%QH&t?xfGkH^QIn&vRPS^tTVgh{y!}CI7&lYHR zQhrjl$FoX6ZcGv1e1$5$QUG&VEePE-b9zXdN?&kqVQI5K`3_5(Pc*O(C42PdRX-(7 zvp6lt0M*Qs?Jr$L?8Zf=wglHszQ3P3#~qxRf>^LfZZ92m!*6KZ_eB6&w&m5od>x^K+CN{HItA zu1klKj&cd*Q^<;v+x5KkL@K7(tEY*Nvjt&nzra$mc3BOSQEn>>*8hF!-awXC;dDv=;>7{urB4$Q#Lq1#yR`V$KnYLMrf~y<2G!nZa$nY{FO}wjSyFt74L?WX$M+64C+~t>_Luym)iVKj`TJUE zEWoj`d67!bK04kU{i9==;z%=(gLmg|h)LqUC~qvH=gYZSewNhZbF0_K7oXBZH2%>A zf)p)Mehmz`3b?x4^F?-)K_*{4!@WJy=kw{fmLc!8A}e$6Wq?;(*7lHD690i<`P=&y zPYF*hAPn%2;o53OSIb5Og=IeQkqAu)s3ZYfx9x#_Ih;kp@8$*=yQYAmLNv)6+xzAg zbsf+=d}&XuaBiBnn=f`FjYp>Wa4tL-f$qr#`K+=N4`jgpd98Yre!JoZRbt)Gtv$Af zKr%>(Tg?|%a?ztVR>9m|hm9e6aCb4ZuEOX9RJdMnohbo^}q<=6&F4KG_9P)QpPplqw2Z!B&;YrxL9+jjT_r+ESfm+Wryd$ zT`_eL*-(}KXW*(A1J_}shs5K%DohTX3ClH|%?2Ny|42e;>jHP+weXO_i4TFW8SLO! zNkg>5?-gJm&xY%b4*z#!|6*k0R>CmQElJ8Q&xb*A_4uZze<6l4bZV@zQ3+t}QghKH zbP!pK5;U3EW682Xl$`zc1-411?XbL%*i*~^$-vH!h%Z8mJ~``(8DLmJVXI$Yt7z3g zB@FYnwkO!_-uPt#y#SoqLnSFf$tu9RW>QvIxP%}NGK7)A(9y*nLCiidq!zP}zq;!+ z^wu0R%?5+=Jl}>0)9zEcq_=oE&DhGIKnCW3oEf7lF#{R^QXN(1rGm$p_0R0J65<=z zly8?b>@v9w07(ESj)RpZ(L&OFL;(~|&&9Hsbi3h+hB*l)9P@l2ZmfWVFnpU#PrRY_2~WFBeHoqK<>%4eLo17_0_L$M?#9%8WsaiiQQdkEaIU94m5v zUxBW#0`3=ta=71~^BV=^Jh9tY4GURh8BPM~CswBmGZul}|rYT~EauP#r&GC;aOus7I5z#>spHZTd zj*%&%q(Jvq*{oT?fEN})%ZWDtTIeNF!^Jv_&XB{IQpkoQ(~_O;bW>G zwUb1DOtP+r8#RGc5+CPI3o3d&JOKR%%t3sBVU5Vifi6F>g)kV0pxMG$k3m@x8o@2S z>E#@VkGoV|?32Lf??8R@=%#Rv`0>Nc#o{#nZ)F`6BO*l}bKAfCJ-!DG25Vpd*(&n* z%ui!noFQr9YV#A@ulHQ^eA^rD*0qcGh7%PwJ)U{CIGOd}MQ9$&!#WhyeUEo%140ZJg~<7FIb3DA5lGa(sf;lHTh*Zr2JSxj?i-e%Id4pK$Cnl z_&EvVO92>B&9Bk31 z5n4&~hjI=xpj%;(F>az;$U-hO}=GDM5>x z3#f&S|efGHV6l2Q}Et3*8MMA_eJ3vTr;(8@x#Np={pjx`Fir04^Ke#lU|(v{Bw zQ_UnmU%-qYH6{Z3r{EqiirE6c#l?l_ja*v>A6^1j8D=;KtPI3kSW)Ty9Y%w81_(xs z?*UK@$j|`OX6G+fgUg+RE+UHnCoxhK!=PL z;GFs=19wJu9}pZq2j&#`>snH+}cRwEqTEWAu6$?rZ)htmav{NX+OTypKnLpRmzJ>kS_9P*HN0R59DDb_Z+t zXo1G?d@7ipVB9;t+3i+$2s>Hvl*0(5C{{ zCOyWs*46Z6gHlY0`6xB|70ZaA1kS_bq6g;d@CgY@6H;v;f6#RQ0euq_*piCM#@;z* zbPAMwz;4%p&u}^Y2koEXRDW!H$d*vyxK#c|uzHB9J%`04@U;)1%A)S~DL5A-6~8^= z5&IXERGEkFG@8Ij_^uF))f;lQA`7D{_Dqukm`=witr9~_vGQ_jp~2>3qO#^-T=ZLG z9L8uG=Q0NHQqQ?-e3TB(L$?R|^hoW8hjs?sU=BiZ&TVLsYiFve(9A;;NHjo#bdv<9 z8P&fs)59gC^oFD=>M1yUrRCST^-gf}OzMxw1P;tVC9DDk}*tPO5gkZ*3Un9`HH5iq`yCASo9BLY^B*RNvU{`rnJhMsayBSA8U(j zao-zxwgQ}U!HsLOhQM1TJJX*1Ep&v|G-{ZxA-v5qDC|LI4zfVoQHKN38dG^MIA z5(dL~ymO$VGOzwG1GKoVp9v|Tv)ZK2Yk+t7Ps4JQoZT=t>t`@LBprq)7$5ckZpr`| zvSWXdV=s5gz3R&I4Koo5l={jV(DRWw2^3#5ODT-C>~DV~{auc&iK|=**gY&j&W>jO zFBiGNfPs%ca{Rjkk3Yt1Rqh#wB5+{)3@j?rb2i-KoW!fGzyw6*Ixjm$hdd`*JWlxC zV3+U?w(kAL{li4rh^kxy{h>s=IOwcUc;$p(rV-`|>P6=OtG$A!jcD+Pb0vj%xcf?| zt>1>{1ov>gVrtU|Zk!%8Q@Bw4(cL$V1WUdrYWn`0*dYDZ^a%I-!Df%(5gVeh3JiKf zltPfk7B%+(*im`PuYtYB3Ra$*JfFTQ-h_RqYX1%g)XEkxfdSx+leYo^qjG{g9fWK7 zrTd3k@mU9~<2SYxRAO`#R5US-W#0s^ax)RBw)mgX>AAZ?XrbyXC%NbgHebYc0 z8?=2(^ffpqz2!z*V17&lBKDLC6FHtQJkRCy`6Ouhf}inSrSUgG+kM>AYdz1Ol}Tmb+6lW?Z#fN;D{fZa@Bw|@4eu4E;P46WNLM8_3Zur@0K^$Y zP6g&zfsKmXeXkW%MtpKJm~x;zJ!G3{*=ZYyeJ0T5J}@$x+%5SSx@K^1@U5HTC0I{C z_zY`+D0<4uTJMp?6Js$d`lXrzYeCN$Exb*wiL-5s0Ap07MybEybhj)j>mtpeZw3XC zr#uwXIH^p3njj0CL2MOA`gDY0j8B+JZ*5Z7axgL`5FSV!AM_wk*e&${ILd(~0L7*8 z@pNL4&mQZWA-J>Q$s!I{X&t~4MWpS8^dsf~Jvg_qa3JVSS9SohvfOi?SbT3ZGC!*m zWbw!R%DT3HAT{yXkt7)B<>C7W-Qac_#obd@IN@q{Xzv95<){3w2%!!0w|h!)S!=27 zt{HQ`p+J>z# zK9XefWKDqB`#qS?=#9cE=%bvjySctdB3_R5Xh&|?-lAr0h|h`o2c-Zmwn(ctZco%B zMBL>aw#ND@J513ujo<*}ahYQ4vO0kc_!<9uU1T)H!N;zU7y4iXGPzsqapFTFQL7M2 zA%O!A3ow?82HmAZ&E>zC#LoW;%!qx0VNp}!@jgnENp)Gm0i2$>wDGC7!9FmmCTK24 zZfg042Vna0aW1+-*W6dH3ux`3xPOnNPQ;#t{Uz2U9;gq|RhppjE;I#%u zHaI|ER)u7HuWYc|N(Ti3mAe@0@xT)q@+bV?G5QU7!5|?ql!t;qoUf#|0k-$ap0dFL z4Q3QDFM}A%yDR2GCeW-b1Z)@N0J}u`bcF%2dlG~G`-XZrE&@|20N;$&ekxkb2z_!r z$F8H~dAvQzX~z;{pLq-GWF$B7lVComhaif8TeA&31o)|l_=$l(T?DkJ@ddXZPj-Qh zuqjL#4?Lt1)IsnNB2e)YwMn%d!5tP5x%h0x_vFje(B}wn1w&E#{geL&iNKf{XE9bp zFHJDZQ~orwt^zt>mn5(W<6r$|NxLCWXXXy(-K>|ND>q{FVQxx1NO=u%Cj-HeP#FP? zyd?_@I5NQH+k`0y7(b5NPE|^p$77?BujgeA2LMm|1J?wSrfAVn6#LH?*wFJsF!I|z z4U%w@2?q?=5Sjoq>vRJPY;3u?o@NP@cIajAmh68Gu3!T^5CtZ%zF8WAYyv_ZHx};! z)twc>fa+iuOlJ=mLx7^$30n3~`CW8vf#DxBv~CM{Hpu)YaI7km0lkuU=jx9_Erzn7 z*l8kak1{kdc3=_hvR;lqr`=N`*ZTxEK6pHPrV+RzDdT%|L2Nv%<1r_R)f^b-1c{|j z>X);9FakvEe%&4boPztsph8Ak<>gQoFas}A8SXZ~$G+-y6|N z6ZW8)mq6$&2Xx&J5OGTtM{07BV_fbQ0%H!_9-zmYGkSP^d8l;Lsu?l(9^jkxuGyd4 z`c+^&stg`j3g(R?tp=xQL}F?md7UJR(d7fx>t#6}5CT)p^mC$#+x_vF`Mzzh&$qOI zM`Px{_(ZloD-YkzYXK$)LU$5W{T3^*o{K!RukLvmcAcODZVC>ed~Msw zkJZeiH~!-o*5K3s=4sI7R@2p<>|90z;J2$hJ}UxE5G~9pbcJtCJPhNcr-_L$3p3ZD za2OO**n-!73lhvIYXmi1A?@R1ww6m^g;OG^YMVA2Er>+}7N4YP^b7np6Uwt9?kyu# zFyHfi9Qee0oc>FBDpqyxV2l@(!s)umwvgZD8Mc$LZhE%BbC8sMt1<=nN*ZQ$LB`7$%u6%xZbS2@yiWP1L0Y-=PPU!#yx56!y#MI=!ho=MS2?J4(!$+# zIU+;~0aSDS|4pt%5Mf*@%bc2dC#tHT5db$L1By!oTnw^I1AqDdwRhhCRQKT@KZ!C@ zSqT{#*;G~(+0G$*@1kUsl`YjRBc#mAcFbdwJx=bpDcKw|aT^)O%9f1p^>KgyiSOg_ z{WWv~QiQ7{>71Bkldkv z`YAWrJ=}pTUrmAln-hUXJU9jW1RmkluN}Uz4c;Y9bfI2-b<~*eT*>Tpm zBpHc_S6rb>l{3XoKES1naSJ0%n@-TRqF=T%bQFB|LDu}ahYnw0b9^v}iaG`1?#DAX zM{A$V{qtmrE5(5H+Wv?La2;~Ua3VJs(tTizsp+%K*3nf7%4C8s8J4B{!`1VjZuK49ekFZA^R5E>G!KISx2D4mG#TDB$(OGyl8gZhIg5DUaNEa%0`xYCRM@c$y z6fFsKwCFjfzeP^!Oa@e!0ExJO)sW7za0fcSTJ2uZcnNZA@rFZA@;a+FQvY zd+y?d8SyQ5=_av0QtWbOnZ4KLdgE3{`dtWX46yx}**&nq8 z{!Zv$gpi2qGdnMP0Ws7B^$i%wt` zAGgM{8U$%sW-VWvrWC#nESGWyseyN;4_}zTHi&PFFLUJNE3P+uyZcKSK9d}7 zy*|kHvI^DFFMkEnv}pim=((QEwPIkta^HB+m_!jT67CBv9%uV$Y>?tLkV5!Otj>X_ zasW)l7c<9eu4pBj{PIl&zDSNlLMWRb?yk9dPsX)N&-d5J(yVLGWm^Ux!MoRY zKThR|Gc$Nr-PY&n+HAyx`qWU~*j^m!uHY;xt|R6Lc55m<`&QY??4(?k{m(5TFND#K z*Ytk{FTm2g1@tTF@(X6)fRM1{&Ey@R>V@bfg=Fc`IK}#d0K1iqO$ZK~e4fjux>^6; zJnx?t5V1OLti>S}8yDHxcnh6-^?GqVIyPMUrUhQg*M6QkryI!G(2$Xw6$l_O)@VdEMIMr2*wJHJvQCixJj?^-e!mX&uJD*F-D)T1pZy%7fG$TO6e548ENtW{WH>+A%iH32L@ik2EMO26hz!qARXeK_u$y2}rrB_<+LEX@;xh*U#BC zOtVZ2W^EU#FPz{yDL>tR@rl3Lwcysa%m)kQZUb!SL<52N1`fAqk!?bTo4*;s3< z$-4j93vFhgChVud9Vz*ERQb>!uBB9oh%^rZDBJ$JO)vGsBi58eSfyD4hKH$v-8nH? zeR;X5W~F!pgjA7{iZ;8LXr_I;Y5T7QN=vj;A-&@wENvSXEIoAR9^MVDi}?haw4El( z^xs4#`Ok$~!gKKJ2w2PbSAxA!n0uVM1s~cf>t$W*Ic9O(!jmxl{tPx4UAr=Day!si z|L4t)xuZmsR3>&dPQ(7Wx_yZzkrHK4&)| z^@>)x5GIs)IzFCYR|_o;=XWYa3Q_dD`alRaOr66);`QN#BJE8$#no7Pfv2h)2RSqb{+Y|D%hl&U2zC zR%9nd#?$?ossc`M3w)V&gZXmCCpz8z&@~|me9YE{;GQDYtcQPU zXSmdmJq3eTNhdCml)DzwSrNg`W(Jo|%O4)|j6l`4a<%HA&-0rKCiXnD0^Snk_6C$X zW;;RG+9Eu3L=Aof0lWa7=WTx(!gc=MOVk>@)GE6^!!V?8fXh56)O!o&I(e%S`jlZk zeNBG^g=uyyN})1jC^UCSk&#^@2zhOL#dq{vr8Ni*Unzm+j_i*q-V#yR$_;$Q#opm9 z19B!OUF@Y{F17}_TwmxTo_$uZ5FpsfZ;)n+GsyiwqFjGUO>k-vIpQQy)_~tAtL{~0 zk?y7;4u*Z3pZZbhA|Z8vtmp6YMSr}G6WGu7B3crhrtUxFzh`q>T07Cz zl}GM_be3-?58WawCgweYAxrZpuf|R##`ox~fp@6zZhbiD9^*p6?s-))P68e&Nsz7Z zg%3;MI@TMP1!xrC7Ho1j(7dG7UWWNX^hs&0WY;DN;XP8$PDITpfgP~xz(@!X8my}@ ziN0!%itcy}y{*xIh1s?5J zOLNCyE=0nh71qGZ#wk-+B{{}K0!5)hQk&u|k4)3*Y6y$-)gnHghHkU#-+qrfnYe?L z!}%TLGSmG$F~RIf-O1%s*i`kzF7`H@2@!!(j^(P;20&$L*S3yB7Zg0a?D@DaI7@{D z)b&T7PyDbDs-1~JP*?Y=P@QY+D^|=>P)&7;E1@WH7qb?VvAFwehYht0IY|vj2D+BL ze>W&2?3#L8+~TdnxA6X2(}4}kKU$dA>R-r* znA;*o_3vr=H(p6FGQgZ*SV;zve7{>s?5#mkz~h{xgqr@M(cm3%bFID+wRNF}RU&KM zNFf2dBG#opR2VG@B~HfttbdAmcaN!3{xGZRh$%$~bmK75TN?8wUe+(NI!{!wcR4x^ zJfgv-@PN-KT}TTT^0MywY_OILIQl(TB4!o!^R6k8zqqkkF#}O1uT%|sX7%dh8b zd7h*g8~m-os00+DSW}ra@hsKyqMGeq;3BLb17o|kR@%AQuo)G1uTJt7gi4B-ne4)* zfX~Bbs;!&Toc(uee3{;y7cGMp~4N^K)a@Ov5=QN00REjK2d#&nDgP<-$GXmI|y zi~AV}#%G@;_lA_EX$F;|Wl%kgUMgRHyxPgt7C}5l)hH|a9OVr^cp}y|jK3Jj2WgUE ztL0Jpd{J!vq1(LwMC0?L9M7%9t?6p-Y4rw-5B6_fNYGq>n!Oa_PyP&lBVs@PLfNQuogO+qh6qK+pAdq3JPR%FY!;iZAzgu-VhJ za*#a?e1#5LG(bHK)AOrf7S6v*!c8IjLk2-6B5KT|N;jN(b$Hx`_BXB6DObfmF{APo zv|rVGuL8}B4=SKJc(>gWMCx)Qtbcr1X!>4xj$W{+LhnlGRCEIlX-2+ia|G?AguA`V z*T$OCHuYx-NRHg6TZOLrnufA0fH0yTyhLC3I)E(DFR$Tyr=QDi)#+T`v?;F%&UG{V z%a!_}%I~H$)1r)b6cI0o*x~2ZP*xtUnF=S?H4;vK0v64{O(HJShIffMnvEIvP`@lS zIvnbe@p=$HI+){=Bp;i7_Pv80iJHF`|BPEWA8pxa1k8{?QKh~fMiXrc5DXm3e{S~D zU0^8aYbjoo7a|>18_io0nX3>o-CSl3jv*?_ACCk7UpCLlwYayz-G23&GowP0=fh^Eue9}e6g2%`JDa5>kuS4U zBImzI-KzR%hUK3pL}KFWROyt`1dxt7bNbw9M6$tjnc!;_I6Eyx(22&92E5F<7l152> zk7^Ec>{yg+!OCO1_jwFRA1T%!990o$WdteDNrPQ=#+>|*{?n0Cd$ux$9QxP5=d%iZ z+jeoy6y;IatJV$&KlVs5ONi0o6YLV(Y7ZQPTNCAAux+=Bl5g+*turS#RQv5LkITS; z=^VjP^QG}w&F}d^m3AfJu|B%m+FPiBP{Qg7m`irQ|g^`-bz0^+wI% z;#qBmLVXM=H@&w@rb7Yr$9rHF?x#O5Hw{LGEhpU?s{&T=s%`npJ0|u!>i2y#!7K2f zdwkGlgfK1cJqcU%z6&A7YwL#IT_vuLPhR*$TGNr%L&Ur%m}pl=t!7cxVpV=-pMV5ZvM`dh9g z@|~6m>86ZD%dBC#{@jrn)9q&s-v7|3?QVrGHChT0GM&Noww>#07-)`9ERd$xX+!a> zNVNQdzdwDB8sWpHlw_MA?#gk5cUWZ-KxC~Qbu5rRl9bqF24AAgAA~p}`r9T#cy|B# z1Ac?IO0v>s*Egt8n0ZXJNRv84sSrb5>GEJvQrdkT7jjR{#$QcT2dQU~9@LuN0L$h= zq%6d;`Vlj;!u-B1ELg_O1hEqLig}%V*-7-y#xVYU#I0!*RQpbY*#TohDt^fky!!5+ zj5=@VIU%-_BUC9^1iOZMEO2TeA(&;zKOOdv^NdM&WDs0pyS%}v_u=S!1vKG%J0d^j zp8kL)et&t;PsVtQ+xs8@|2$?l5ljS@GS8-l+3dc3G}f>;`WqCKg9*hUr(PWv>LO)ZE{AB6 zL}QEKNssc?y^j)#I5WKC{fZeA7Zi@cv)@oxax2}{ldjkpcQ>K#kofcuqDvWvn!WO$ zR#_*ByhB04+Wd1npRcv@Su*+wR4QyIi-(GDz=P4Lw}qW#&%c+;vK`E5HT*k~NJF$N zN`1!Mb;EDdeuGF%yZD%BJJEFyQ9=`Yop_^4a2y` zEr6PdtfM?#E|j;pBv4RDn=o6}*F$j0=PY5OCqx?C#)7nKxiO@xU5AZ`{hR#VN#?&| zPP+LUa$}50r=C7svd>!W9xk(u(3;Gf`chdqq7$C;i4BJ0X}9Ko)K^=72mk2Vt|3z` zd!KIz)0c5@=uU`$eG%rgSzt(-Q}~0S#bnq-55{A@0{oMI2b!#7nszk+qx@mQoVF2R z`nIQ~rp}oEWuf$_lkNf|ia9039%tP8HJS?#T=N#~CC5<$G77&5`LBb(4$97TkEOCN zdyEUrHw!0O&&h5^!Gqc6l%*XN=X#(z+KZO5{`%I8SKv|c@`VW1gRc2oD}J{rsV93% zgT;5De>mU|V)tLVc2te&k0w?xKuIRL_Z>tG#^3mho8Pk*4ExuuluLe8b~6xZ5^{#Z zi5w2ScM54O18=XJT?n!lxMCEw`X_e7zhxP28! zP73p*4G3bUH0X6%hxVF!y8fHvFKy@7?T%u#)T2FpyfoS8$V!!LS1Wgo5{#M;c zxW5?gL4l_jd0byi7-iDpYA}}uMs~~06|WYYLx%Y2J7ds9-~ntsyy~`W9vtG}IDF+} z1agvuJefx`<4)d3=kw(rnIRs0Qo6bY_f^sfJDd>{FA=4XU2EB|F&t8Pd;Q^@sR+>N zJFFhbfAq>AVWD5o$$LzhH^7X|1S_`3{4HM_b47q&;bTUs#Y9*k`%b=o1z_zjG^}zM zbm-wTCUdaAvh56$5e6=H{VPNj01R(G^LYs zL!yNJX}+#Xw{V`DpIKP<9@U$p25zAy@h72<-}1a)97qS-zPI#R%N6WQ0uiO{CcuXp z5R&wc#(7AngzkV0`=K%l&|L5#iCJT$s%$RX>*%mk0fRre`TO4C$fqxE?=KV@Ua}So zuaRr!xByXsRj8WRa+x~wK@;>r$nn+0`~sOBg&T0nfGEOu+0`v^3DFYM<`FJM$7rNR`6U6}GEg3+C!B5o<8~S$ zR3#xiFcx1mgsGn%Os1V-$@nWfkXzL@j~jy}kD-4$KH>_-8ehb4xlvxFv?VI8JO$rs z6P{wqb z59XY_T7=3wrk0@_8;6?8p^B9L_{@?hy*kqYO+hLAg7Lz`?4CMGxsrW9(Esp2YZzJ_ zF3a4w$bU=&u7;Js?^yhk?{Ukr0`yeJRr8V#aF$F(GKsa;!;km*)dn_{Jo)2%T=A)T z$G~J*q#m!`r(!;NeBDoBlcpTGuhH|RIiF%si1}PuGOb)HhoP)N_yJap&kpvj7>cKP zF&$tGc@}wQAZ{4^GkAcbc76w^PXpjR_ns?Ztw1-}Tmv;=nU_Iw%vI_56bM-2S1X}J zsQpm`oWv(S$N>hS{~-L5Syqy@@}af(9Gz9FfyEosuf?wGqUnYIdtFsYDPE~N7r$g#yftoNET(|%kqmrH zG9Z4hiPYt5)a7TZuiy}h_9SqKqu<`s$n2Sp z-#~XE^bLcJ9JyEj&dTkwD3D71!MN-;FY*U2lReKJT)n)oNQN?ab9_8b;&2;kDDtTh z7g~oDLj#2)7jkUF3%dvW2{hcif#Y=Cd?*h5@n=qdG|4xR=0wRuK(K=wm=j{MN7F@@owP%0NpUpW&i*H literal 0 HcmV?d00001 diff --git a/tutorials/text_processing/images/romanization.PNG b/tutorials/text_processing/images/romanization.PNG new file mode 100644 index 0000000000000000000000000000000000000000..f7f0a890596c9ac07d42b8474344ef1076116724 GIT binary patch literal 36700 zcmeFZ_g9l!)GZ8xSiypdfD{2iKv24bF1;%VNLOhIp{PJ;QUn1J=}k&#Qlue?Idu=S&5nvHYE4r`>9Lm;Ve(wM(e&PzwSS)d2a>B?>aogYkrS!hR~pZ&UgG1n zefuFMR$Hy}bhxrw8RPQ@DfdMhf2fGu%vaVq9enR}@V|Z1Mjr}O(&9kC{mOt5A-zUW{7i(z#`*3pW?h>f2cO-SKB(Y_@1!sl+HYnQq8(3HNeoeTGxrxP28`Po}Gt zyefFGw-xW!FK6;TBEqiy19O#a*Ja;mkm^$+k86X$YF@HcTyc4;nGb1Ih}Eg$jr+pD zO*@;7UMZ=s&8)7?kLG&Q9FDlCs1)sw3rApr*Zx|^Y=WU)EWOmkfYjI=8@#-w&V6ZE zb9>x@V{xQ7xvN=bsqoWGCwFv$GJ~{AlP6Ia=Uk60Y290DttUm5*5l3kA$C$ub(OWJ?w(i7IB}w;|yOq=hdrMx37V{SqwSeo6C|OOOMB0+YVA` zJ0YH{cTB9j^P0)4C`3=4k*O&^x7vpop}Mi0E*qJ@@3?iSE>Tf&B9CjYqQ%{|{VG>` zq~zOy5A~eh3)x{azk@HhtUn=>hM=bJvhgXi!P@CGW4^=G8fHL?2G4z`@4wZ zr>PR~q)h^@zRPCxgCZ!|S=`4Z+Zs!cNOP_`bIuE9Ke=4q=2EY5LCm2~y!T#nzcLEl zy7M;zaSnX>@Ba>px8`fGRA=o*pG=w$VsS;`XgAuaW>D=}|1uU{UJC~sAN1dQT8xa0 z#37Mv*T#nEE(9Fz7EE%r_-&1tyjm(M2$gG_j*`&>ktBlxTdR2Q_yq$+Y}(I{zxYl+ z`7}u{lHM#n{4qa2f2;)N$#=gp)Jm21Z}=uPRp9o2&!1UE5fL;`AT@}|QR@KbQTu$) zDL<2KoT~y!y1Zo@%gmZf3pPEBvgwL4QZ5q>6@ivYN^dOVcMtcL@1oC)lJ_}$7H*=z z**u*+ez}1tOV3G7M?OHfk7?zHLxB?L~UuZ742R~x@_`sbn!z%|>PRDQqntXzas)qg; z#u0vmFj+0oWk*4lZlkUY?yvXUtpb6EBA2uvFyI&jPyP3tr`Q^p%Rg9;P<;Z3xS=wo zJLPSMO{gjN<}Hl*@oO$~Q9>u?PYpYcL1yJQT}H<1bxjmqcmw!?Nk7ldn_**lU}2>Z z$D1FD`WL=hEY?gB)+2ZH2twUV{>9XvhF{rT68e$7mz@)mX(d1V7mR<^P_S zo#C^@FV*Qj6RVERgN6%RHZ_npk)HTUZ4jOY(f=(9vXB)#L4_79Z+@PfbBo>UH??PG z1_BE^e$dh-3fk>d1jqD&)dI5igOOm6`wg5HVHHQ33YZe6Bm|-3U1eS<{1o+yM!URq z@3ZCm$6o)wo|^RY1qV4!*@7Zf3h?Fp4lR}7D|Fxa6Nbyv$m>JOq#d9zX7H}{*&YJu z69R`R@uwJ5S{T2S!g#AqpKtMA{*bT^chBd%*AH4Z2@^{1=dgBAkCP`)x(v814HqH! z4HiRLU0#sZ*gPkmQ4Tqm0i50!Ekl|5$M3^twe;%mkY?%)qi=Mv8Q;Ez{+!#+TmE=z zy#C4h2xY|g`P~YzC`#xri+Mk8XCMmUdT6IkKD+Z+dwk(Cf12c@bsw*7zKDXOT zrO#~NP7*+gf4-F`+DW(i_tfj9TqliK+lj>=6I@caW1{%96Kz&(^cN&l1e|I|jO_!6 z%^Q7)GiE*DZg3TsHGLkRI>qK5W0_hekk{#Noc4h|n}~Vq>pZLl3&qGEwI-N+ zQ4DHw>!+`zOPYk_YI$j!_+2!pmVqD?vk1c`8gJzO^`|_xVQM#)lC!RQfa{>&xpO;d zmF+b4B)t-lu&^>)eRUre>{)*sb=GhZ?HsmEZZJ~mgI`D7=On^-P3E-sB2aum$7wYBbAv`-tl&)pk@aBbbqCy$*wdv;~$d?hqz#>+a5!BiBg6}O9U zJKIZmKZgsb4?i!ITv7A*6>>C3NP!P_C__uN_~Cft?+Blfq5j2gbO}=XUcaDok8<|U z6BsMdN&~`Ba3O>WGA1*8S7FJtCI?k!Mb(i|$AUSD)_(LIXaNMV0%a$EJ;mXaYlp!rLg{@Y!Kp2VTN?J4J*hCR1HR0kp`6)1~-L_?Fq_qVIMaT`|=v^_32rl%{#n_~a+2GEsw)nXr?>n>vGu%*~_M{&&IoTmOKRog3 z2aQ;OvVO>JgN)uU+;wr=ql;NqqY62yt}YtMkClZcf1SMy60#zS(%BE>i=V!VSa(q4 z>tZvxSu45Bd$VUbw@gpH@k$fP@b;{IiWnD{k#;vvzx!8on~Cs9ZML&xao+P+{?j%~ zv3F-<`I7Zc$?@H`3zfTRVSF>>QBYX;aKPs@qp^qoTHMO6=oTfg_)TaFx^n z@q9p34NuJHot!VYGX5ofvQNG|wa7c;xAphy;u7muD)j`^=07Uh6dbA=KYnuLW7*y2+K{{?!P+RZZsd| z&gMN&v0S+19k=3JH5@c=#OPyn)xRJ9GJIxR&~ZrOUO*Em5hg3WojGU0orl*8%T^jF z^U(cx@9X4fwS8}uHd-b16k)Ytn!$5QfxJ?UfYH4OM(n*G*8bV~uiyOvII=s?QHJjS z7m-i%28xm!d1Xo#JC`Ii-aGL5K;oVZ=q>-1CQZFK!IE94LnlqDcBmAm4+!z9UyKow zra1_9Hkxii1r(FLrhG4(8VCf7e=k$^dC#hZz4;N#v-3Zs-7JiScX@x0 zDedV+?Jg+yqZ^N3;7cI&YU|;2w8F}ectd$ zRG)s15^8qDKvbn*#4~O}B+md=l?>1G$>-xtPUJp`e%FSn&7)oO=RIYzw zVB>=ahncYA-7(v)EKMVq_^T#vJ=eW{9;6RzmnQLDU!cXas5KMu2?;QO^UeZ95|~)# z4-2`H#sZDxV~*rDUq8?rvIsv(u!Vg;V`y9+bem7Li?06pLFs)y;|T5Ca!U>g#{Nrp zzh?_&CL4)0mQU9CP6)?|=Rjrj#NO&kO%4lda>Lkb$cJR&BxyIrAp@3dPt}%XsS%IO ztpq1?xdwe+p7GoQlfq0!RajUxqce5`sdDAbpxa%!7J+sh)Ah zZI25M&&QRSnhtq9jZNqij@@q{#lO$8EKQ(Cta86mG>%h38T(lYl$2TTALxPwgZ4J2 zw5ea7Ja!AlLlcCcG|JxlVI2MqmT}b2X$d+Ny@X-xz1OUUKh9yjMWd%Je-x(7*zSgg zy29Q(K1F!Yrh#6cXk!*Ur$^&v`sjp95x>pFxF7R8hn}uW|8K4Lgo7+@6oN8)i^t2F#hqwVettyl`-> zktudP_m{)jqWD;*+I9!6?(1)JMWM;vqV!aO#>Z6y@qq6}aC^J7^glBz{t#cu46mKc zTps^8W3fs5e;$4DajLPqm-)}T{FJTy9LJ@9?b&AhGr>3Ud6K?Xj{dO@S{CULBg#Vc zjE$B5O=kr&RA~=s(9sfxe2&x^rJzhhxu2X=Y~FuVk|C0w?Gxie`3oZh1RV@)WA20# zt9ZO*f=JHmQp(q^Xp;)5Tdv0mfZh9+Wtnz-m#$Ne88^rng|965qvGh~bCJbVRo*cA zvW#YNT8eAaZ*Xz(wae5qAr6}n??>fJ|>+r3* z*YIne=T>aIxIRK{IP2NjVDp!HB`gKOA!;Q#(*ko{2CNEqxob-2XC}%Sb4+|0u!#u? zo$1nUw&PW{XxG#NxA{K$a!u0e|E!ly1|c5Zrk^Q~(4?y<*9MnJ1FW}RL{%#o@n?$P z`C3i&8|iuCobHBgzK~#iGc5$hDn!{%5RBWNu7ZgmJLTRu-qA|F8E>7 ztA1PGz5OPhjTo*+6@=KTJ|}0G{oJFSKHQ^f7}MW#TEUwTBcY*{JZ2=1bsNxQy?npl zFG8s9UTq(R>}i}$GM^cD$V|MY{d}dte&8wcuoQ{Fc7c?oNy{Cy(64ahZoN!RLpPSk z{H0YmoEAi_x+(f2*PkrQ=Wp&3J++MEU!CiEoqVfH=S?k%0GoJP^>teO)kC4X|99<6 zJ(NpNX^jUe#vl+DSngGPep67bT7wV7JsI_r! z<wcl1EIhBD`@|SQg)wYMGIJ7|N;9pc#HgWGOR$4-9pse-5 z*$HPxa@0S%PiO5V#CAsR;EB$BU&dkxHwRPN^AuBiy zlV&~$$EQN)A)unaPLSJ;xHSA;cDIw=h`p)btSG~jr%^eb)5L3SodM~V=Y@zIP9gAW zS(WQ$cNpCE?%NwUWBRAWtSH6wcB4%(OZ5qOk8KP}*M_R(jBD5kl=(%hMAid7x7~Qb zy!JExLglgVUpD5?#d@ZFnMH!}#K ziN38pF`Je@<776dH%7vRP6q}n6k<=BdkS&a` z^Kg@rOz-(S?wKZawxrX3`q9yrqhc#9Y*|^B-N#ky>hWbegM4wG1t!M7w47_N6q2AT zoUnf=9Y>(I^5{!Szj@L?6fF}XmgG~EVepj351C7>xLmZ}_-yi<0sq2i?H`p}#c(Jp z++cC>(!;i=IWtRs?yG{ki~^HwmNGwCOE#G2cz9)}`(1Z1?&gmKuQIORs#oVjA?ZX| zPq>9<+)3#$6drlB$6SNm(BkpR2^k5Sm2CCT9u?N7aoY~{`8!sm->Q>tIw}4oLYBv%|*-|@K^DG{m3-J&q#_-uaYE8DplMrD^ z07Gd$8xXd$V*%{i`(q+Ya2#-fEcytX*9UevGmsiCkTp*9v3|IZk3W;j+joEZyl2}e z#{Qv(jb665-sH4hyz1cnNO?Uv%drIPZBuJO$J}9BQq?EAWUjjzJIUJ6n`oiBKZn(7 zMk*KXH5nMSbN;}c=blS2CW`;PUVFmOAfK9QAeB;YsA?{|A{*;&sIz$__|d|+?yweJ z&6lVXb>{s$c?32h+2Y05llhhJB#SiEx3qim2F~Q#(Z|J|^f*IX$R!=g|5w+S-V}6u zE$P1>g4A+5_PuH?{nyg*TLxLnkooQ(yYjGM_V5I4D04wD;`yYDQAE%6+oc;U#y&T- z>ExbFlT(L<(ie7RL)4EH2pw+Ib|VAMP5+_7;Q|mjCF!RrAiic`MvFv(Fkp^Fpuka2Y@8 zJ39oioE8NCZ9wxt*^Y^0v##oXc~<|SiL=Ty?${z@P^e(5>-q`@!*|Y=cu~`d1!Z~^ zDBYi5b9}0|e8zEZ5-2az-t4*^N#-v6G<4@<&M@+RT&)JAV(jtm=f)shTIMr+dG6lT zKU|9r*r9VW5HYs)^JoMb4#C>_9&CumvxTu!d3gx?rcXXo&bQ8Vd!t{WXMcC27ye@0 zF1bsj;|4d7JgQNaHI+CQRG0L8I`;*I{fgJ?U3_z`RGinGeK!AuadcB8jtlZTSyz~t zw4Xi8P6tGksQbdhT~gbTb314W_~((H8M}H-YwfJM1@M-N+&0s;{q=|zDJrG^PJ`JW z@HcMX+(wK{y1)a4fzHtC%9$?_n(>HkZk!6lV&FkD@J|1qbM6od%?fK4IO79MuS0z?2(2 ze#}#l{_4Fu5pM}7T_KYEL1=3kkcLVo+|j!0Kwj#Z3OI5m(ktc<-ht2$S$h2m5zJ%t zp6^5IVs^NrWx2rrZt;6eg}n^Y9u<|gzLxkW%>vo&)k$;7i5JRKjLdD>65ZmyODom= zAQT#ZPV{$h1l(b=bdM!&cYOeog%P{o<|FcL8}yws^9Y}+FpL{et{MR&s!~4;D2WZn>6XMb@+Iy{1dkHMh#vrJ z{_z|gllAM$uD^aD9(qHNM+d|JUX)3-n@3YCLodUq3nn$TbV^Lj{#Oi&*ow1V&u|~o ztO6m~eRME&w0p?run_X;Xn$UTclv@N?wFg$=7DvLrdL2qGi|HpjV(Ql#(};v`-4$m z+#{GB$T2BWt{M9~)6v@*-zNi(4q8d^odb8|2#RUv<(S=E0`?XG`3R*y#qO1bDmP!| z*-mq9jWw;dPoc`DdkaGJ={VT*$u?H!6>zZ)9zJ}k_jhZCO|hb7Z>jhd#?(Qzw-2+p zxrY+--}kB}0A0h;HO0)x$Y`$qQGfzs;-&gjN6x2!qeD@@J$GM0s?uYvsS`ToM?xkf zC-+=6v3Sr&Ol+seo3&n+mZ?X$IHmr4WOPRCRf7Zp{K(yz-=ucL@w%I8AiU-G5X zVUHnlq=HQUD(zS;5X*IBGq!;JH3SSdkh7sLyY}`2vzQu1r?Da)>s+$_HbBxy1Yu!N zhFXQ7?ab@yKWRhcRqVxdJVT3#$@f@4K?c8$(VnAyU>+0$fVUu-j*)oootEcG5{iVtU=sme>!~vfN*aXh2;>!JvQ zudwQ8=si!Q2S$W!5^STC%hv}fH5Rk2!ByzZj__XVBXeN~O5}Ri(?#vxzup{F-^Nqr zYH4bQ7b!FEj-y!+{c@!ANr9cs2XE}N%j>X?&-#^Rir{K@1BZ`8^Mv!)pp*Ikwm_Az zKXJCWq)Kb-%!86_YC*Y6L+zI-X|XHAy!CZs%Y5A@q|yOWPH!vn*c*7|W3aShh|EC3Hh+*o2}lCrYr4L^ zJ`+g85EChh?eyy#4D}b5Kp8GCEvm{;(d+VZq2w|_L0bm|7EVYs+8*x>_lIyJtK!~p z0@W?+j9g0VpHSABcid4g;M+CrFCgQvukJ$Z7dSUwA8d`TZ=&_nRvJG=L9~DYf^~{W z(^mPFjIv=5#Otrf6(oapb3&G{*UJ}_p_L)bd~Y2(>#W&a#NF1I zT8vrGihe2Xv5ASpsI6pg>S%u@U3aZ%grv9T%wHtkzHsBrWpp8Tyy*KtFF+kKQEDsO zw)a4PqD-fj>JmWm zqDsnCi|S&~T2?%zq61;Lyrgk1wKdbDEt_pqoU9|$52JqFQVvw8alD)TK+|i%;A>|+ zx2x^Lvv)UJ2rJsi^@$7^y>KsA)4y-$X8(P^Acce?{TQ901HTtOMN6tKfq&}Vq{GDx zqtc0nwGjQ}T6IGJKDNr&Lps{GL6e$DuDQ<_rZMMty5jlLSUwiLAQS3)Sl}gp4(XBd z#ir|2mq8wZb{y_+=(~B8szgP}?N1H57^|3+OB6d@?!c=(Q=0=;4Fo$E?4U4x z0OT*C>@oqBC({pPjp;#ud%@2=pVE_WFxhfRD|prYI@(-ik>?U zZB0#O@dhbhx8AI`dE95YbkdL28Oo+Kx&2%g(}o{Q4aoX6FW`ElRg; z2zlji<+}n3$(Yc-x7NX-O(_-Di=Nqf%ifNq-E;O`@W8>egVK%*H^it0fsTMZgDkr| z-Xee>;CMv84qGEV#vpeAnwyAOYMG6ci-LQT^KhIX5*b+J`I4 z@L?QG{1bB{R_8(W`^}A_lv?NAmO#v(()!%-1neZ!AWS)7#Kd}X(!qIrw_Xkn+pKoz zmqf0(xw)lX*@~8|!DRV6EUKPq$)c*^McTdk{pM5?QQ;ucxsR-ub_I43OSZzO=`mTn zc#~3l`3g%qx@a|ayt25oF12e&KdJB68m19V9wAy#{Q+f?c?FnRE|4EZN>U&*di?-^ z+Uj}>@6F#T1;2PwcRv~0Q;Xa7FS=hVrO8urKCmGB+3&YgtPW?5;M>La{VfOEINMU& zd-RU?%UkqngB0Ylhl2=xtwYaOo_GTf)H$&b>xj{#W$U9jq56p-Kj)lCBV*H_v__i^ z58800i#6%IDzyB>Su;n<3s~@l;z@iQRLPTpqwf?%En(o zwj@kbMV2SR{L9WX&>X*k28}hICLDTfIs-*bI_Bh4WGZ{C3Hw3&&9EC>z)iKCq-c+3 zH0j-DmO+#hiKOTtHCk9&0VA05cg&1Yy?Q=Xtn}n!KbbWfDLO5qV#;n$p@jKYN6BBdunBbB20H@I40HY~1uCdk3K6|zQ8zt>bUM7U+s!0-Vt z5=v<8;6KM4(k})e^U~Q({sq5up{Knw*rBy3#o1PwvHwVP$&n51o_h zBijKP5J8_HsrPDI7lEiNk_!244nS&`)lkG`no|rq4LpJN=jmC*HCzBvH1ttk_M#qK zNN}ESsXjU+r{>n3H!*qfkLU@8@YA7oJxM9QzZ#iX-tybq z6rnFoACG>NdiHowSYx$>E zL1KS3_AIPxF3ok7EWgd#REk$~Ehc4?uKrjgXmCK}2FUE_Z}Wtd`ntFzhtmd^ymI^0 zIs9xzW3~>;QWEgLJ!%FkZPdkX8Niox(>8nA)lUKTLVz(% z<^TXl-KteR8G<-UQeG~9O>#hHxukl9CF=l?oVHZ1sjX9;HSzc4I6M3MNq%L0S%(*m z>wr6|+rk4A{JJ*?UBvEB9DynaqfdnQN9tDUJAL_Go|4zWdY-4SVxUl^qMvv7MVrkGU=b8NKPkGgkPrmUYJ}9d zd>_=;l_bSwAHZek%;CT{aLRY%fgEUsVT#bfrlsZuL>T>*7`d>zd}{k%o>%zfy$qc? z#^+6`3IYC4V$@_fgE8E_c%%}5s2&QKis#i8ZKKx+fSkmT_m<&l zxBZCs>&-!LFbm?U8ZJ6D(dhOa@@uIi^OknfSZ!r*h8(U1$w2>}h5)!w&PeZNc@>H` zvDVV|Ofw)^M9%3_;5kA&Yj(eCzJ|Ma%D3MY`c$!6af|0+Io9Ft@b%ns?~S>xD4ErT zR%;2|o$&{M7u9HH@%WJL@ifq?Q=ocy^FDG zep3&)qAB4>U55}Tz3Myy=Tse8#Er!-+<8NDL3UvBf@PDt2pY{O0=hpE!e{@-N!=O~ zT0)@stWW}*^x;-{K%QN7_}u5x(!}5k@?)OX3j>%9#VW4q`Kqd_*FYb)0eq-LZM3+r zUctMC8o?%gE;cAq*pKKl)&EI7x(+TDmaz{zFeGja_l3CAt5%5%ZU8e*Y_@V3uLYeF z&0m(&L#sQ$TuWH`C^Y$jLjmtMDPEXa=zp-+SdR`wA03huQ27UgeOq{{{G=m|8s1#? z?=to1Pr6~WeR~prnx4KT~2sN|L#YsKuO3aY6bssH~gy0TBu&g5GSN|K1MV2 z0tfrHJaN3bpc(LL2xlK(z(J(hLebtgigKvm10>zG&kL<a zL)cq5N-VGvV`LVgClz1H!h3z6Svs`xKyHbn+59#KSN#_UKZWfB@Pp8GH{%w=_K@qzh1U1QrU!%%^Fs4xlI~ zC|*fd#)}7iRvWIaE&d6xo!hd!rGfi}u1e@6|IIb3AyPX5uo z4t$qgpr8vnAU%?HBiCD{KGC;$<-UnpE<^RvrZFO)zng)zXgjpQ$5T|Gq-XPz6u)}c zN0o-ArltxK`Q0V%mDDjHUQFUFd=~Q%`~&hJs}8-*a94phR1jOqum1kI=1s6ec6sOMa{X-{&dLY4mQq{jsJYoi8?n>{3`Cet$GqY4& zo&;R$_f}pm6P5F&#uWDkBcdpns&UNyZ#WGkgy6vlj2 z+!j4kJ&)k=8D8#FzShYUU!BCe$Cj2tplPo}m5Y0^MI+p{5&!bOOJNx)Au&KBIv>_8 zBRTcjZ5Omx1mhRbf4A0EdYOHUSQq)gvVjz%2YnwaMuDQz8^6TtzP&w@j~@VzRaYXj zBG+b62OzL+0@I@7n1i-{YfV~y#=DbWDW>mK9Tj^FEckkF#>t@&$wDvCLj4N4)w{hk zR%tWli*=>g?DNeF^y~say3k>(b9b(&s)}+gZ(0)Y3|R92px(y1l~duCrcy9e=fE=T zzMb6az$9z|9aSlyw@$}x5&`LOQj`NEGx~(2q$UM}*kGqO3!93gu9=Gz>p#KxG?(l4 zM65bHzsyQSFmZW7szwhfY1_EoEjSy_m%&*`pZig-3*FaAA!g$HGcn6efQQAt#^OO0 zeR}qY2)J2YcC;F+Gta1n8$I)4=jpw)e9JD0dy_4`BwnIt8-V5JA5+_~FJ`;=G83ZZ z_d4L^}_E8h|rbpVW$tg5Q&sy})vN(E}jGcO4s zY@PO-Uupvkr|%2xu$4Y(Jsi6!NgOdC4!|tijt)_iX9cxwx6|#vt6po3mZ%{EXY%)7 zHJk*el(LiSZ`^Y;tq-Cm9g9m}D`bw3kLM?)V?uO1zN`Se9?vXh|MANqyYI>$G706r z_*`^(;aj*)!Fb-!pFc&PexVxoClmY>F2sI4Vx>VF=Nre=53=tUBP9P%x~`pbyDVl;k%U@6kO&qDTXu@No9|eCXP5DU%i3f#V zs?E3LLX6unk=7PNXnEAd)RDi`*dne8Rl#%6%-8$2ZE@LhfW@Qzh(NybPiR(NJ~Ozd z@Y1XVFPb8L2@~<5kAbOomgqpu_w=a5Ctt=9Z6+g*+;4@S^LzF@>fB@qi0ssTNGKN z*=-=W`MEi4p5h#UPqZHoFG?$D;bS$Le(g1u z=NtC@v$L})558c-2VDcy7<$i-5Nc^~06X3VYs#h};&6@nlUQg+=}n5A6OuUxtE zadjcpGW$&c5>ierPv5=h8Hvan(fa^Olf?M^a4D@>G-j`~u<)MYAOLSK#-9e$zYZ8} zQ4W}v2+T+BN>CNP*#qc%+}BB7DB=%^1YVLMpX(_CB*i9D5&Hc!WyA^9BJW!B!4p!$ zF4q33?PlLLn763_HyR*<)3@bdFa19>PXnk~XU=Jg6~VY>^2WWN%D6x7MxkvMzkUAs z`JhQ2JKg{94NtRQU)`q2&SJPVtQU2CP+`0_u;Sr#az8#PC3bM4$f%$wF2SeU-3&&Q=uU ze5&3wss4k(bq1m4_ep>%Uz@C_s0foe2QRAT#UoV9uP;9i2J=GSJdacugr1ZdGP|2O z!C?Hg)YMc^K5=MXt)RZ#*5mP)9c+Hlj9eXnw|B~Kw(_1$dr_V2RYUaX9D!+}wjZ^Bz7 zaIKnfs{prv$%i~3(5xOtdg4Fn?xs6?gkFv*qD0~+59$4UF2~U82Fh?dl~rqk*~c@M zr;sbs^c66^YO5BXE+C(c&+P6+Qg{hl+~U`Dkin83;{3n_hK7%9@hwZ?)}7~QRXnkQ z`;}9}pHTbMB~a%HDEXA^NY~Bhc3e#$FT2EdiuZs~rG@~Tzdypka#Z155V8qt?hiLG z_|kxb_de!i1a#2}wTN+uf;>K3=}K-W`;T41bP`4x1_u%hn%dnb5JSB?uW-ZY*Tw>Uq%A+MRF$j5ei>yftB@8|>wC)gVm^_!xR`+ylcfv~MskG1w;T zj=*5qW-qge#mkU`h(Bd5+h(3=UcGn#&80nYfF$h)GR>l@yMqGUruvDaqa#?JqtJif zd>xSH4GOaOm%FpcIH4E;h~;-zYj{UhqGmar!rsECS71CW4f4v2Xa_`|HNTQrN)pSz zy-5)Zz^6pusr`u;1oGCu#RS7PoAk#NPfwm^FQfool5~2P;=nZsz^9@q)C(BJA%lxN z;nZ_1ctpBTfG2Kg)A7Hv$!%yDle_*e>Iv)s%I+E_Kpb$YtIHa91N{W=7X&D*J>URV z>&l{~Y7R5(Pe!pwXo11L)*+fm%23Y>pd{|HlXG7|HDiB9F!g{jo=H2>U8$QzyLVk-P&M2UkF{DcWopC3@W+QBG?;016V~ z$PWy{VAX%?->re*td1eBSr5RNE2~6`UM##-fDA)AyEgoxLtW_vQ*42O1DMUe3sIg> z*1fD^a|>a{u&u;0)4h+?x$tsX{#0Wj4}{;tUESCbmKOmBTgj7L(y4N!A?>>K`lQq_GU`^0J&w`-g@QPv8g<4M(hzWYioZF5OkuhSSL4mMX0VM*1>t z?;g0VRC+_*`;sP}G3Mo9TrR%Ap23VytqB({yPG-Sz(vFWfCy}MoSYO@tV!WMehqMX zIH90Iz0d8W#FByOmEpE0-Ap4|mX?u^_4<-m2C7Q@e;U+IxUeWzxCgt zzLjJ@ee_6oMcr|>D7jmNUTD0UE?H})KVWk6HRG?IB_p0hYWj)NCzTI?rM8xrY)SJn z3XHC(QEobzMDW$I$5Uka{1P>MyDNpKv+u4(+!4HcHyJ3@SW3uKHB*Lfd}T+l%F*?; zzNLL}4M2)r`2@?2Y)N;HrcCsSzjMrWf05+gxg-#*mZUMha<=HBCGRb6x4iiIUB};L zUT<1717QUFc@cPU?UQER2hUGhzj3D&zSDB2U-&zvXH#cpeMUk$X=fqsonVLs@$4fD z|04RIY*Tbuf)7d@nqKw{KUium#Vl(3?gntPRskwP%9JQ|oE6PLI%P8VA3{OD1vC5x<&6JFDyac87}GEk6S zR0Ro5Jk|OkQqtUwod|SfaHCnAJnH;b8|POHIK@EYcH%j3&%O)*>)jGN0rL*Ia$jMe zTQG-iW+`9Vq@_uFaN=nwVe|86&&&>1h)|9v5&vp7(IoX-J;i6y!Uq;8K1!Kd^j*0M zFR5&2$zSQbq<-V3L#AJy1&H=_Fk#$DX^9-^)u)l~i{CUqp-_&VIPe7+Vfv4ITKr2) zz6kAuk`umPE!CmsDA~fA*6Ql1t#4j;KuFXd$6Qi}_vPJ)gPS0z{Z8KjB!LQA7MbsphhGTAS{l;~- zT-!>^m13>;B=@>V$wmS>8Y7HbC)^^XQvNX$YKw3t;U*oe+ zQ6|khWg(M8(sRWT@i}!AcNc*&CywIL9Rb61-hHM9_hkx4^1pkt>fAA}`z5?JuTzc( zj-Hjv|Ao+RKs_J=V$v%CxrnR%nq&ElmBV3wc&}R|V_pLCF?`2ot87hL;zKwiAGVhA z-FWn98Yo?XaJXKa5So|^8TAKioPsN z{ok>Iq1ij6Q(eq0K3i4r;JjBg^DlWXzAZ+^Eqhyiz2oBxAb=OU&G)Q#Ak-A?hwN>9 zn(IN^Fn;UsnvmHHJFC^ppI5=0VNr41Uw_k~KVQZs?SdCTi{MFWOJaO)H;|>YT1V-1 zEzjmEP#|jht8V_)=nrDY)R3fxV_r=g_>T7`CD?)xfH%pV7m@tnRXY17!Q6`fryoQ! ztmFoNI+9^Fn=zrN{~T3J9mS*N*348cxCpFilQ!QerUhZ$3tP1#2J$!cUscI=G$6@Z zUu!nIYJEW-nCS2agwj~vVY8+G^awxg&>2cF>&dREkks3mfm5ym9?Zi9mEr}IFAJyh z>kstGG(|9&%cJ=D%Wzwdp4fDJW1K>hO82$leBH=+ohy=Cvc(#oSJvH=kJr;$H}Oh) z1LY1BDeSF7E59uxlGF;ofwkJfclUQ>UfF8)jcpdfNq-~mg5f%=mc{-3_9h&pl)3W? z<)k}%o%g;&Ui?->Not6SL6hal3TG$|T>+VWZ;!sjpqGq!?)D*%e=$VDZW-LGDCXB% zaf%-)R@7p;)MU>5Ey&#Wngg&D2WK!*ar)WW*`-Y71v5Z8h1W>HJkrqijFDHTLL05_ z2dbcM+HLO^mJHYF&bT1cuL^iy0IViEZnz%7VXO`~ESA%FkS-O>A6}$7Ad@@;Y}2!Q zf(onz)O^?-PM~ejf8HK|ZWs3Y10vmLsR-c=DNcAm+<1S@*3y7?S56jZH>)vdal+HN zPb^aW({@0<{9P(V>=45zpI@ImB5k}0wWAZ$pNIS??Ix$eAo%hN?W0Zr^A=ft9NR zDLXQHFbWLGqsw$1Bn*6-=e>V+f3PqRXv~m-g?d4dx68_{8&NqHus(S0H##@Nbgb@d2X3vhXdz%nx742)Rn zs4J8o6cI@E&x-k#TC@8>VjjZL#vz}fqXW*$k&%)tA^DMUy5@8xIeq22$n&?BX?<}i zeK&kc)k5@)%{?j5P8LRlw*lCQ>QJTlFf3F1So~3}8HS$xR!^Bu4FiS$@9NQiZI9=WCSM~C zz|qYAyLUdDP|Ds}q7b!k8hbDx*bpsd71{JJ>(Rqt#l8_^;%xzeDmdEomBi^~hpTT1 z!77f|at=OwS7cw~0dAV2i%BUQHa`yf!qG8ByHloH29$wuF2}F!O>M*Cf7i$;Fbt~2 z`f#Cj7>lTUeRM!tg@0tOzW1PQD*n~boS!?eWT42_1>NnhfJ^ct0o)6@|H*L!E*gCo z;>0A!p34-~ewSDstrQyskxKo(~BNJ&%j?d_M!OJDnFRK3)g2?tZ_DvvmP z%I|x;1fijk=+pa@6(@D)w!IpLGvz+8_FoplRWz#!@cAx#+G+N^Gbl2{1T|gI7R+@w z3wf-=f4=x!(u;wz6c4{;d;NsJ{B{PAZj?0`-W>)lufz5%;M(kyRpxK-Z4+>#!kVs; zx_{5+a{bPGs20g$#{oZ8R904}eE>q)${qzKDZc(s5~chI0gyyxz?sRQR4ay$EBHrb zfvl;w>G$qi@4OM98_6SEG*1KNu+}`cI>`=u(gPx9T-Ib7o>_51aeXQ8m5vH&!`@QG z@y=i@Ta(|?b4CvX`I5EDpZg3tFm85Y$ipCnkO$E58p{8eSo_%}@8JN-??p^Dc~ol; zrLoqq8eHqNaVs~}lm6+4OuB3B0=@#g&@Tkw4iRT!m?@X(ALtMo6g2Q6o-T_0p~;Eb z{z-o601vUBma^*VVQ}jGK#?dWr|@fSM+f~bv3+sputt#^wzTe@ao@%@)1Q_UCU*1x zjGIz0he(qKrOjW3UDaNkp#6MfbeRxg`>_5_7xM3TYnd4s_tr$rhmu!@xxtjNR7=}l z-c`|;PplUmr;?F-Gd(Jqgw%V{D2;U=Uv|$hy|_sk{j!=~6?4$HL$b|SvyG6Y+-)QxWZ#MG#yXa?+1If%sAL~x8%y+_*YrHk_xry8!FwFf zPq*W~-Q${TIj{5loS!B1jK}EAR5zE(5_GX&bQ5`Na&&nPwPIs-F^fG!oo9YvYHqY? zx`g6h7OUM)a(YsupGl35cKXiZ=3#NL4d0g?>Gn_DEt{agDEz4dWb$64s%Xyx+h*Y}omy zE}GSA7+#j&;&pqt%$ic^k~cScM6<&KJ9Ty&%Lb-(O@1ycEGV}z9_?#l$Jne!q+oCB zrfv;5wIp8fk@YKENNe!9$z`P|73~hDue{0A4FjqabZT3tGPi}njN z!m)D{WZ92%sMNTEDNAc%&eh?w&Vb7k7miP;vH_?;*WwW?_!AtTRGT{Y@>@M zavgKJd@H}VhxtSq>sT3wCF-)cBXos^rP1ozV>9Wt56v29;`o*IE^2n<;&h(8dAUg2 zTC|ityr7g{qaHXnx+0xy*K(Ax`k9KdWw>DOo1$nr_uvW4m7Mo(YzCLC7_PUUTPe4j z@Y-XZ+}^a_P+*szuRlpG^BlJ>+n9emVq6n5M}ZF@ zH(XLxhlFZU9$5&Pp)N5DG(m=g=}s<*zF?q`>~Gu&#;9Y3X09$Gf{7)Pc)jj zcfFne{F4RlXtVB9p+myPn}=4n*aQE-IW2nPhYGhFzZ+{^qT4EnoJ?T)x|n?mM@cmT zNC_{)iXl^=+2gPU4`#Kb6`SRcz2uMl?o1cfmzrF9ukOu^{)Q%=-brj6h}b|pN;f2(1IJI7Dt=$TQ zmW~Qno8oK8Dl1B<44~T>2LP4j3Fk30Xww+PvoU#4T&Jt)eBJ)%I+`RPb1 zY;qpS${5fUxz@U8(~4dF2QBPA~$3Z4CKT_gFO-aqHTkbuKGo2Z`mN!V5?u8k!0 z3Ue8)HvXuOqwmMZhK060-D2Iv#P{UaqO&<-hX>QwtKX>J?ZPzf(0APUD4b*An}HTn zy!=(!KaKU5lIptjm4O>F=QjJna7Ro|%;7V$m&xtss-K7V4W>U9yL8{Q9(biAFno%n zb{xqM;?uP#BjJR%j0qpV7Q=cm+M#5aDHMym#EUv_uvYA)ZR?0!!LR{FawB}Z8rGHl zCD9z5n3K4VOwOFg($%hu7)gKm-9>r*fQvd=02G;+mjz1{DAZG)**Wo7rUXU05_`f zbuu&SS>q{jzWh5=(sKO9Mea@c6)@Xn-A6QQ`Ry&y!em`PP{E`U$apLfJa;0I{&i~$ z4BHAdwgu3Y-6Y)NSRFK}lE~Nju0_UC*>!!KJ+Q4_=Gofn*3&Mr6IRGcLN|ozo*tJ> zR?TS6r~Jd6?EkDRFLt!?DX)JKEIfjF4U9fisH(5hwpI9i2*3r$$0>5zg}&kGSNi-e z9~u>9p?AiSmhP;UEM9L8mQjwgI2^%iYCr z#eRHC`pU))nc$OVMV1|DGa2npp6Np}S>{Z5jVHjKYb9mEhQCN-d-~Ero1v;I8N#vL zpYSBRQz$@O!9;C(gOAj(!=L5EKmGp(aZeb!`?^*A06(n)V~*QR&CNcP)i->4b0?xX zP#uEjFO4xv8LHwr{JI{USZKT%`WRONt9ZDf{rs1>F;#`rr2_;cs{3S$jR&Z{$7|$IBhWslCv}fjRlw z^GNk8^grA`b(8dRrQE`*8nuL_j(VbABoweXC){Zo)qvMVOL30oQe3+o#_jB+S#F;B zMTa;3#=A!uU-*`B|7LsJKxuh=Z%FBa>5qd$NBZnFIs|Vc5CcUNv^@EM3nDu?0vR@I z*eK4PDm@EW>0<~Yl}HzFKa{1%3u6g5+xYt<%<18?7i|za`>|@RWMqNXDA#)=|EJ%k z8azQoIVMZ8dLb*8Jf>QB0AqrOY>t1KU(c%I)u`5p8L6H8oBL$;%RkK%54pgGaxi@-J3D*W0XSJogkxDa*n|+aBuBdY zfyGrZq`UV_G^jn9u5?PpjJPxl;27|4%H9@uSdQE#(t?n5)%AcZ(J^JOn~6Em?I{1j z1dFnTd5_Tq=Gb2FR4F?C`?1`Of&iS<&unKS*C{#X9E-l2vuaM^T--tAkNX_h{`~|+ z55L4eeJw7)`O&IYBj%E}yb#=KO`yv^t~-_!(n&4_6NuhLpih+NzqTfQAr{+cdw5q1 z=(ab;4Z%8g9NFi03#Id(;M)wxWBEoER@UZRgCQ`VZEci#?(!iS#?= zE5nt-n*4)cIP()2PcCMe4_1lrx?%Si0wb?s*jdQ{XJLM)8l(%rl>m0SZ7B(?2+2n| zmJ!Suvoh5=e>l=ZGeKs+@zDhAAqfvDxgMB~72tPA1O^`Pe(!dhE4|)3!MN&!lBGwb z#4344k(V{t@NhEo^LInN8Wplf0we40Tzmgvn@}^L_eyDxKw;@Jz5Bz=`d;u7dD!mC z?*?78hV4h&yb>Rfl*f4OE;Mwu`*2&W^Dexw|Jdo272H~qU3m0S688n1zJC2ax3GN3 z&bKf}SKUS>4#%@trMqE@a4_>W2cDR>h7YscFP%W+jXcba`k$+$U&KsqlesWuOAn+Y zz@9E1hjlZyFMKkfFIeMYjM*`pV$!9bRMaQcFMPm8d<;{J3FE4YSe^RdnhRpTAoyJy1LLY703Uj*e)$EPe0oR z^`^=+{=AJ_a^{q$2VdN6eoF`}@@J@mKFdV_?MU$ud`TP1DQ2&Ua=e=WsktNM7h56sJ>Cg}F;} zp9lF8`rfM~zqyPp4@uea)IH(M8Qrj~o8@)+iB;a2vp$4^`kryw-5hVPkB*K6X(%?e zJPC?v97LBobuZkxbAaX7yIXW*+ss3@CTF?c6Oe~Q*VuTr<`6<`L9>v6Lo-Z<{i6}# zn1HCmpom8;KzO!@Tf}DHT!pTbp z9EcLro32M1&(#CIls|!_Co)8D>XZ8>WC`Wb^IBkoc%7Gze@;y2AG)V4Q}CWqPj^v^ zZ+DnY_i{;am|bl522t@bfTKCLXN)qNR9YwfSOtzi_Z_ z8UHU!Rf(emr~pIrIyH}>Gkk9|%o7&<<6bq$>yRG5NO&oWHUs;BIrTf$-Kfj^@=r=+ zo7i$CnxacIS1jwIryGkcWGC`;uT<;LxLw-{9<p%WiQqr|5;7T@qk6@ zvs9v4ZU8p3e~8BP7N70sRNdov2II3-n-$SjAes3HT;#WSrVTn#UPD;sw3cp;B0X51 zf(AUmpL`f6+JaZ>j&1;&Z51rqdu3?mm?1|vvaFwh$<)QKyvhRPIN!#SPUD0BfQ)34 zpv`61zYWYv7eK(T#DuHZhiIX@`sJ&&PGxx z;=IUe$a!2N_2(7O*{>TRPs5XMIh@>ybH9jDsLf+EG&Dq~FBE`5!T!n~yPGKOMJYtWniyZl~nAnlfGC&zo0Qa1`JBAD~`-+JCb+FH!8$D71Q;q*xiZ~J6AolG z_ZybNVumKj;$`Dd6KF;U+()8MK^e{)5IfOY;96Qcl@c_kD-ueCio1teBQ^w*a8d1-uW66kaeP&<4wCe z9sVr**$J_vx0Hd-!QTfURvm4(1kZ`ngyY|T^)@UXY8~(3JM^N1@0rmMD!*Zq)c0sF z`vtgK;Sa%56x@Y7b`Vn-*u!MLMfv)UMvN2!S2?z84?-kZndxb^tUlx3t>XfXY!f0T zArU;pdB41ftD4=PX#G_NerxSDN9w>==C;{U50~zXQTh+ooHU@Ub<>eQclX)b=XEMb zG4d_vf2pWa4OK6lIvZm`9bcZrAc!o5N;gl0Z(uBfSBuvC=ZsOEFB>g6Z~1@7OTV zgvncFZuUUlbpic<^mKW{tySHt>a6Nhb_rmjk@3qrJ^P6ERKxYAXQGZLdljUw-wS?# zz_bqp26B9rb!*6uA)IN`$2V*WJy%FKs~^e!XK>3f`^yg_`qGk+VZb^;^@niJGuGvM{J$SRRTKT572Z@RX@1(noa=r5vCHoc{YQ2 z>UZ?FzAAC;=4(X;^2fvz(v%?^DP5nNw~dYMGTLXl1#4^jfICnDw!>QAEWu9O>s|?H zfJ8SEhzIWrZ07M6@&}-@g7gDcK#6;0my+et_@!);1^F6tXV?vy{bJ_=;{u1=l=4;X zloe_(B~o(>T%&S{~ANq$8HoK#$ z{0iWmK}Wi0y|gi%`Si#oRfvo93~$wCJW)FII71 z>jQ?PWX|cR^pRRWKdL>BKF#B~v~ZN|(rMJsaxl~;n;n?Uy&V-BYX?00A<44N`Tplx z(lt_B8kqzhAp!l%zS3nc)&_S)KZj$v@e1Y|N_~?wUbNc!M{*t5s{9k^f)C-Ebi^KYyCZoHmxz-K3i)JOZ}o; zO!5Vw>`ZO65x=dNS@$)i^?sbgS+I~Gh)A^_?2fT-b8tksLiDfk_{YVGZd15#3B}oTN z893f^UHw&hcJ5P83`9CM{YMyPz0SlvH+HJ-scr=#nIjw+LTzG1O4oeIo$(HP0j+aR zPHutv_)AJ`h_wny=z)lZRmqw7P-+vbCWu&VsIoN+H61!8CiZ0q z<*E;4eeo1kourt!*#nj}@y{}pI_*uGV>I_VS#(()%agZ!QuQnXdM!C`tE;Pj%)pYt zj`2}`=U&Sqgv!+Ycne66)4qc4Q_i`R(MFhplzgk)SH+<5*Z`$HIi|l#i`nhG?#at~ zSP~gb{Xt1Oc=x={2*Y45N%hdai|~SE2@ce1ejVQ0q>|!p+J5Sj9hnmyz+c z<@}=WA+Bfi-P@78LU|PvEL%}APU8))=eOCw69$KU2&*(~zeNSy;Bzr+KS0{`8X^)Z zx9)-kC{F9)$Wcbb_XT|be+}j)k<3+&xU(X)$-#$(dSKp~+=`2TjV0|8_14`}f*6=A zEcYazerA+&=dgdtjydO(@Llr4535JHLCc5Jnjp532LH#SUOu}TbdD3!^+U;`E_1N0 zvS~v(0d8aNn5}ikJH-v-kQx;Fo^~O!Yl}XpwhsQJ z>Gffw0PF7Q#thTxZcXXh?Vm4^&gPKR6o;^tBBj{+*RF<#pcg!_C3sgI69810tM;{_ zKX;wKF-t_MM~0z(4#V?5>-E<(HzL9KX26(-XXKZrqhlVTW5_jD8CPDiOb9?Ryqar~ zweP=T2#zI?ynqH7y?MI>bfKb;hLn6jaZ?00ab;Qi_|a?fLh$IHz9F3FCY3``Av&D_ zUWAaxyovYwF)w(qT19r(PL8nw`>X&vOydU~Xl5WLS31}8*xh7LdFlQUAEx2w%<=TI*LYs|4RACy!oSM-e=V5bo62BMwn zYBBRY9bljyn#_54UrI1DN&kVYKEC2Co*{9{TV&-iQPnMU&q&WIX5!J~u zr+9?`1`gM$?n2ms^>04}Xu1{0ScUJeDh?4L#xBSved|w7OH%@N$UU5U&yPT$py;tb zT&#NoB_ua@b=#Q#hx8oM;r^+?sc5Yk7}vdGvtomqfLRmPt25uow)1&KIV)F7!#`LWNcR`ZOoxiG>akWRd}%AuRtIOAp*FE(2?8lOzO1;Fh^s)h{4k0s!%4tnoqvK%?o-RP!Z zA0lG>dEkzvZ<6d1{WNaKL7cM8B#DUOg@vydf5Yb)+Wn$`(Q9DA=!vMUVoE0+UKIk4 z%Il+1ZNL=UU~MM3U0VJnV2GKA$XeMq~Jlq{haprA8H(6-^uZ58|uCJ zz;5tuutiBtpVvE_o0r!%qA|Mx(hp5G?u8dRoQj8C)^(hGyzXW&@*lRDL!@iI}Xdkq*LF3PkaFC!v&DBu2>cLO%p5VG|>V8K3 zZNvc8E4c=0`Yz-C%?YQQMS>V#H zl6T&pd>kW8HY{cq2fZA{7iLY1-O)53Vb~ws-};Xf@XbH@S~8zTF@}Nit{J*G^M%|E zxj(;M!&}{9d_a<7eEwMHB`2t{-!?{E}>8fdmb&|w`~Qi7jH8uk<^jKk#p;PwL1~|?s9_+vJbkkSD>|c=BfQ4XkR^x z)ba0d%SJl?_!=G1UD_~1J1Cr6U;62gba|qJK%)O=5&$LhHPYjuim=7p#)P7Z%T~lC z^y#bV0}*ua4%$5eE0QG4#Y6&?Bva`hx9llsC$0NLy+ec&KiWN(x-vjOd z?FD_%lIy%RRqZ&<n=1lY&i-2mT;^j?D zWco;6h1~Xv9`oYz$O14pZTfaUhl!)3Z8{_$XV#OYJ94%&LK3kWa=m*KjdnQ8-4!o( zx-+lXYcb2+_?}kXuU7!W_G-T9%`1s%Y;=?3@C0du`!zkU?>)nFA!{gb4dWdjJO|Dl zyA_h%eMOhiv5MHfUNBl}-X7j!BtOX3yDItvZP%g8DMTh8`3#YP z{u%+CUP)0A5y#h3NhcIDF>#=svpzsSxVJS8sO;h;W}BzhdKkJN)ziqeK$gezshx(SX2vdaR!f8xm94bi4ZHCn-4=8@~az4n= zshi}EkyVesm!)S?a^rqViG{l~c_Qy*nNjN6>F-3Z+QAQ>wkoH7Y1xT{ufsz&4op9B zE#;Ma?@ardXPd%qUGZm2t?WiJ^}WHJJ0+F={^~?*fm$ z>T?u>7pRTXz(3*)?Zo)FYF{4O3J^TuJCjw{8l>E!^N8JE)s^qcIyvrwLW0kQ;kun5DajY}B)i$?PRI^Cya;f85- zm4)&6=1QD#pTEYqNHa#!ZxYXwgL3c5dP((tRLQXiZKkbmy;k$FVdxBSK`p7m9$qU` zN7mv>1QR3Gc7+1cMWcJ~OW<)HrWD0LXE9oXUr@q_R|1Pp-0a=+9dzeU*f_qj4z6F% z{Amip&y3zcYfgA`on5<3Gp(&17e^r^ba1&Nj8Biz8fc9HWRl43%+`uRJaWRf@sQh9 zzK~_0Xmbt2Q;AyyjI;6~G&5`LM=X(bMa5gPkAS;Ww&|OeV}l>g3(03G+Wp#vkwa|g zUna+QiL8r3>npsfxaT;)NwIvN&v-JOcEFSw$D>ksmAtN0ga0z|(P33_DYhn0v$C{+ zb1`1j}GE3k9W)6u7B^pd0 ze&b7yVX@|GCxN!%xJpk%cDDazoNJq;b#{0H9^3H`vuDmFDTiTCKe&cHg31?>E!wNfxsLJxO2=V8hk@uZTUVIcZ*5e>5T(!_e8ZKTjiP>q0!d$NG;{L zSZ|C7Hf$Py_Sj&_IsjndK*QUTvu>U5Wr(5CrKU^C{hpbSj23M?$VkHuW6VT$Wv_cD z)*V1x7Wd~J*@ysQ%LxP{g&Lu@B_$>ElU1PN!SM;TLg!(Zk%6eCGO}L#X2kW$jw~{? z(edpIxZ82C6=sCfFPN-!Qzk;a{<8;qH#fQmc)`yNxGz=1A9YbV`t~> z$VXqStbozTTmKj_hH}8K$H6BJWoye{Cz%##+D{xS7~RsW2hdp%3BfUeY_cV&Wle}u z$!Wj^nu=V{g_*)&F_jlRV;aR8?oU5LJK*HRm{bBqjfO)RzmmKy=Do`vNd|HX?!q8l`LPzivQ@FM`pd%7#Up@eA) zzZ7nhP=t{v>7PSywx9ND&vc~+H7*ge_@q{GOXH189~myc)s!b2Jde=MymR=}6?sH` zNJv}7UUl^enfsg16x4gp7-rzso!)jJ~Pqmt_sKGl&E3O*G>n(3VZ+uy(6(^K-AAFkkw*A19*tl6-ptafB|NRd^A zeT|x{JeP-SGxI$`948Qwa!#^Kw7JHqxKh0>bT?l#nt%EzuJ`+Fo19Iy!O%Kf^A@^5 zAk*s!RxN(C?m5#VpFnT4n)1w=4SOTZnhE<=gvoMnT%9#}YUNhz)HP_tV&XgV6*Grz zPsFo(W4K-1e= zKvByn1c!oHY=BXsM(s8C$|Tflht?001w}aqg&%40(7$PUB31{K*EN%{%<~mqk+@#G zExK&%=ZC!1g&An%3!tl~;Rwe1Nawmz07q*6{DN6)uT-v%D*%Q~(Cpes5jNsmc%)X+ zJP;1nlS0vsv{ifj%~I4!A{H}f1$Jd8LPvewpIWOEHf zJ?ULgl88>`ZIu#L$^346e*vm}8M3b8*`t=fG%HdH_Qj>m(_%5Fr_0(o>rWWln+vfa zj9TyJ3FnysGRO*XYKz@k?NK?048w#|Ze>RfJkj6oIsoc5_NrzU!z1fiMJVAttws** zA~!Qm%fz&%>eZ>dJY^pE15WZz5g-~*tMeKU3gGigs#d%;_I!i7eP!S}3k@v|b!AGE z6i>`JK3H`@F$$Ow*ZF9x}@Vj7+Zf6jda0c@Y!^rI4C|U0Qhdjqgo}1 z%)jUuPBb9?&Vfyt1eo0GJg$9aVR?&b>(*y%lcFTMS3b_Vp}y%BTJs}2W$$;lnH*l` ze8p!bd}!M~x;NIEGZtt7z2?GNmH>c_^)Iq+z*5w5d#2nHx*fBDRg=1ZM7+kjF&LoC z&$Qdn+y6`n5)@L9Y-R7AkgaPyMF*({$jPU1#yV7k2bi&S%UkonS^%|w@?3EPWU3ix z|8^PQdGzcqb|y#r10i2{HAh4~$xv+2=MH3(MMcGCy)gPT3H7!tM8p8UlQg=lZi(@@ z%V<4wY)T39fmo4y;=n3|Q-%^{a-F!!Tp|3XLokBb9ezDd!3PYcFq6d-}ky($% z|4nDksn$Uv#|MPJi)t$gt-_TP4+@b3-y2FIi_wKfAC0Hx(?0)ozLDN5%3L5uDq_v>YKhRpxdFd?UCB*W=1(v|k zS%@|^FIqplRRnK5PS0vT#(F`urz6V~fxptg{xgH41~t(0QqJcR%Rb`WL|k`IeO9YD z9>~nfN0Xz~ef^3nejvuM&{@=b%oOwX*s_@HfR?Mp ztrMp2@Z*_}uKR@8MkZP(+v;apr_XE+cwQ1w(3+rWX+@@6R{8#@t;F37OPgbF|i=l?O6qRQqT=qu0Y{ytNjRq(69SEbk9BU}~l3 z6rRCXOPX;)YRfe9ReiDjfKWJD5{6&qf?i57YO6OZb+AEsRGTFU)*)yl#n*j7`nKYp zZ$|mLnIC3u8MMA4+TUWNpRwpl%16Z`v7aXq-|X@QY4E;kd^!YJnH2;eE;HVz^}*nf8Di+5LV(P4KC0>2z+`{sd0)n=2u(XVsr+a3 zAKi?~t?S~!(}lW|&JaWv(a|;Wfh^9QPsKiV$S2Yl?qy`LeC$6cay?e13YLOJ4}RSv z-a;G?_96`dO{DdT4^;hm5hcd8X}5Z031bPfP#0JPVQqWJcp|C@Y#d>5x4cP#_PFXoKe`zb{-)q9AuN~rX3a8GY^27II+5<@j0rnXW~Ni zm6Fr&75^r->LDBpfKk$C}d4Z5(*AUl}W(X)wDp3d71zFolSK zl01|L@>>i$_3Y%z&rU1TcN-X0(6 z)Og0OtXnZGsLq8X2W?T4gZ6{kv>EB2XfSk#!DiH8Odz@tiDU+h2jSaBCY-xrN@jZqm=aDVuH(?K82U#_Im-1MUM+r1H@jn>j$8 zU0}_DtS|YHIhZ6Q3M#XoJMp(@cMW|8+H<$0O5jBA5Z&}N6nYoR_IP7BHC{Cm)Z!2h zd;sPkdS|!6nR4eJ<49huoskNtyjdl>ICRIb9|ZZO3;jt;>U?K(c-+?#6tLx!F19p{`SK3}k7Da%L ztv?XVlYA6zj7Vziap*b3p@~^$5XspHyPIWp^OBF~U7LWQ{UPQ*AH}o0kN1tQ_k_Sw zYBH_>SkJ<(sTc<_7>+d4(0Z3>>`8&x!GZe7xrRjcVG#n9V~X>u)52eoxLu4S^~OJa z4|cxr`+!9hXmrNh2w^ABFAo&k3?XT||NhKNrd9on9Px;#mCr~I>_i7fp?;L2;i`wS z{Qah?WsW3GCM=ZNQjHEHxve>2aaFHj=lZ{?y?-Mj0!-iwIdV+Hp07}&lLcJi#5~MK z!R_XMGHYoXIFK=rRaI(xg-C`hTB1k;3?*KAC%iYMU_!FIap@m5n4 zAQ86gkh``=#8kaNPpaSWP`frCemRd-taia-$*2O5sX8GtvTc_q5d+s+`%(}Q_A#^ zQh3MpO5&rSavHSMrpQ*t!we05^53j*5EH_a*S0b@KxDkZcv|SQ;FCVYb!I-5Dknnj zdFj6g84c{Y0r=k`1spLmWO)aXX60KvK-59QvPzluHhi38~I zzPv8GE?_T2qt`>z%n!bLhZy>?lLqi@BYbazOLfYHuFV8P?25PV~Jps66}r zzPX>P70<)cnoy8*CleKq_((<|FVWn`RHwz&Zo<#2 za62~q+rZI)$sTG~B(lo}T!S3R-vpf!-%ZCIdngR1xY!#>d?>{&;FTM5Bfb1mp0O zgnfeS2JTT?yPv&9DODh3)v;Tl)Guc~^4Lu`@OHRTCsDiAm#lXW+Q;)qJUim(;=e+TS*PJk9`HcW-7~H&SBroFxkcZb z5)7Ji4e)DXGlU1vr%#17*1qTkP((Ut*YJ9m;0$F@pF@BBUvT*9;7POpy8}SJH40^x zDo94pS2VD&u~LZh3F0yawn17Lfl4s)ZUVuNC7?-=GQ)vHSpW;6Q!`(7=3T0$zjlC& zr>AEIKN;b(!ippUc943AfO`SDAdY3~HG%U&vl~`F$UkXln66MCO8Cj&pIYLOe@RR2 zFp!@(?o$6Y^5d?**93mrcZzx!!cX!asErl!pa1)j|2>!gy@`+~;lEGM|9@rU5$DpM YEBwa}7k=G6i(ErBWt39h^}CP%56c{m;{X5v literal 0 HcmV?d00001