Update text norm docs (#2137)

* move do_training flag to config

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* finished cardinal tagger

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* cardinal working

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* fix cardinal negative

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* decimal

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* ordinals

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* added measure

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* added money

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* date

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* added time

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* finished draft

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* fixed tests

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* started fixing time, measure

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* fixed suppletive for measure

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* fixed ambiguity between .2 and regular punctuation

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* support for optional comma separating every three digits for cardinal, all other classes automatically uses this too

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* added quantity

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* adding whitelist

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* fixed some date formats

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* fixed some date formats

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* added more date formats (#2090)

* added more date formats

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* update

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* configure input capitalization (#2087)

* adding configuration to distinguish between lower_cased and cased input, affects whitelist, word classes

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* fix style

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* adding option to set input_case to args

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* fix jenkins

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* generalized export to both itn and tn (#2093)

* generalized export to both itn and tn

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* fix jenkins

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* adding to jenkins

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* fix date

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* keep line break

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* added test for word

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* added telephone (#2098)

* added telephone

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* adding commas as separator between phone number

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* fixing punctuation and simplifying SH export

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* fix lgtm

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* updated docstring

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* updated docstring

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* refactored itn to have class for denormalizer

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* Itn punctuation (#2106)

* fixing punctuation and simplifying SH export

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* fix lgtm

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* changed variable names

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* updated docs

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* save tutorial

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* deleted sentence boundary for ITN

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* style

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* fix citation

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* sign off

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* cleaned first 6200 lines of whitelist

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* WFST Normalization support for emails/electronic (#2092)

* wip

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* added email tagging and verbalization

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* test added

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* update test

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* update electronic to handle digits and common domains

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* docstring fix

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* fix lgtm

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* adding doc string, adding more info to docs

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* deleted token parser from ITN

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* reuse token parser from tn

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* deleting whitelist items

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* fix for wfst normalization (#2134)

Signed-off-by: ekmb <ebakhturina@nvidia.com>

* add whitelist

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* update_text_processing_docs

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* changing tn to cased by default

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* change to cased

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

* fix jenkins

Signed-off-by: Yang Zhang <yangzhang@nvidia.com>

Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com>
This commit is contained in:
Yang Zhang 2021-04-29 14:59:25 -07:00 committed by GitHub
parent d017cfcd16
commit c5426c871d
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
11 changed files with 15 additions and 878 deletions

2
Jenkinsfile vendored
View file

@ -153,7 +153,7 @@ pipeline {
stage('L2: TN') {
steps {
sh 'cd tools/text_processing_deployment && python pynini_export.py --output=/home/TestData/nlp/text_norm/output/ --grammars=tn_grammars && ls -R /home/TestData/nlp/text_norm/output/ && echo ".far files created "|| exit 1'
sh 'cd nemo_text_processing/text_normalization/ && python run_predict.py --input=/home/TestData/nlp/text_norm/ci/test.txt --output=/home/TestData/nlp/text_norm/output/test.pynini.txt --verbose'
sh 'cd nemo_text_processing/text_normalization/ && python run_predict.py --input=/home/TestData/nlp/text_norm/ci/test.txt --input_case="lower_cased" --output=/home/TestData/nlp/text_norm/output/test.pynini.txt --verbose'
sh 'cmp --silent /home/TestData/nlp/text_norm/output/test.pynini.txt /home/TestData/nlp/text_norm/ci/test_goal_py.txt || exit 1'
sh 'rm -rf /home/TestData/nlp/text_norm/output/*'
}

View file

@ -56,6 +56,8 @@ Example prediction run:
The input is expected to be lower-cased. Punctuation are outputted with separating spaces after semiotic tokens, e.g. `"i see, it is ten o'clock..."` -> `"I see, it is 10:00 . . ."`.
Inner-sentence white-space characters in the input are not maintained.
Data Cleaning for Evaluation

View file

@ -56,7 +56,8 @@ Example prediction run:
python run_prediction.py <--input INPUT_TEXT_FILE> <--output OUTPUT_PATH> [--input_case INPUT_CASE]
``INPUT_CASE`` specifies whether the input is lower-cased or cased. Punctuation are outputted with separating spaces after semiotic tokens, e.g. `"I see, it is 10:00..."` -> `"I see, it is ten o'clock . . ."`.
``INPUT_CASE`` specifies whether to treat the input as lower-cased or case sensitive. By default treat the input as cased since this is more informative, especially for abbreviations. Punctuation are outputted with separating spaces after semiotic tokens, e.g. `"I see, it is 10:00..."` -> `"I see, it is ten o'clock . . ."`.
Inner-sentence white-space characters in the input are not maintained.
Evaluation

View file

@ -5,4 +5,4 @@ Introduction
------------
NeMo's `nemo_text_processing` is a Python package that is installed with the `nemo_toolkit`.
See `documentation <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/>`_ for details.
See [documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/) for details.

File diff suppressed because it is too large Load diff

View file

@ -212,7 +212,7 @@ def parse_args():
parser = ArgumentParser()
parser.add_argument("input_string", help="input string", type=str)
parser.add_argument(
"--input_case", help="input capitalization", choices=["lower_cased", "cased"], default="lower_cased", type=str
"--input_case", help="input capitalization", choices=["lower_cased", "cased"], default="cased", type=str
)
parser.add_argument("--verbose", help="print info for debugging", action='store_true')
return parser.parse_args()

View file

@ -35,7 +35,7 @@ def parse_args():
parser = ArgumentParser()
parser.add_argument("--input", help="input file path", type=str)
parser.add_argument(
"--input_case", help="input capitalization", choices=["lower_cased", "cased"], default="lower_cased", type=str
"--input_case", help="input capitalization", choices=["lower_cased", "cased"], default="cased", type=str
)
parser.add_argument(
"--cat",

View file

@ -58,7 +58,7 @@ def parse_args():
parser.add_argument("--input", help="input file path", required=True, type=str)
parser.add_argument("--output", help="output file path", required=True, type=str)
parser.add_argument(
"--input_case", help="input capitalization", choices=["lower_cased", "cased"], default="lower_cased", type=str
"--input_case", help="input capitalization", choices=["lower_cased", "cased"], default="cased", type=str
)
parser.add_argument("--verbose", help="print meta info for debugging", action='store_true')
return parser.parse_args()

View file

@ -21,7 +21,7 @@
# echo "two dollars fifty" | ../../src/bin/normalizer_main --config=sparrowhawk_configuration.ascii_proto
GRAMMARS=${1:-"itn_grammars"} # tn_grammars
INPUT_CASE=${2:-"lower_cased"}
INPUT_CASE=${2:-"cased"}
python pynini_export.py --output_dir=. --grammars=${GRAMMARS} --input_case=${INPUT_CASE}
find . -name "Makefile" -type f -delete
bash docker/build.sh

View file

@ -85,7 +85,7 @@ def parse_args():
"--grammars", help="grammars to be exported", choices=["tn_grammars", "itn_grammars"], type=str, required=True
)
parser.add_argument(
"--input_case", help="input capitalization", choices=["lower_cased", "cased"], default="lower_cased", type=str
"--input_case", help="input capitalization", choices=["lower_cased", "cased"], default="cased", type=str
)
return parser.parse_args()

View file

@ -158,8 +158,8 @@
"source": [
"from nemo_text_processing.text_normalization.normalize import Normalizer\n",
"# creates normalizer object that works on lower cased input\n",
"normalizer = Normalizer(input_case='lower_cased')\n",
"raw_text = \"we paid $123 for this desk.\"\n",
"normalizer = Normalizer(input_case='cased')\n",
"raw_text = \"We paid $123 for this desk.\"\n",
"normalizer.normalize(raw_text, verbose=False)"
],
"execution_count": null,