Update text norm docs (#2137)
* move do_training flag to config Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * finished cardinal tagger Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * cardinal working Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * fix cardinal negative Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * decimal Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * ordinals Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * added measure Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * added money Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * date Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * added time Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * finished draft Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * fixed tests Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * started fixing time, measure Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * fixed suppletive for measure Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * fixed ambiguity between .2 and regular punctuation Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * support for optional comma separating every three digits for cardinal, all other classes automatically uses this too Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * added quantity Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * adding whitelist Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * fixed some date formats Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * fixed some date formats Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * added more date formats (#2090) * added more date formats Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * update Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * configure input capitalization (#2087) * adding configuration to distinguish between lower_cased and cased input, affects whitelist, word classes Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * fix style Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * adding option to set input_case to args Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * fix jenkins Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * generalized export to both itn and tn (#2093) * generalized export to both itn and tn Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * fix jenkins Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * adding to jenkins Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * fix date Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * keep line break Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * added test for word Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * added telephone (#2098) * added telephone Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * adding commas as separator between phone number Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * fixing punctuation and simplifying SH export Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * fix lgtm Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * updated docstring Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * updated docstring Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * refactored itn to have class for denormalizer Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * Itn punctuation (#2106) * fixing punctuation and simplifying SH export Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * fix lgtm Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * changed variable names Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * updated docs Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * save tutorial Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * deleted sentence boundary for ITN Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * style Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * fix citation Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * sign off Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * cleaned first 6200 lines of whitelist Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * WFST Normalization support for emails/electronic (#2092) * wip Signed-off-by: ekmb <ebakhturina@nvidia.com> * added email tagging and verbalization Signed-off-by: ekmb <ebakhturina@nvidia.com> * test added Signed-off-by: ekmb <ebakhturina@nvidia.com> * update test Signed-off-by: ekmb <ebakhturina@nvidia.com> * update electronic to handle digits and common domains Signed-off-by: ekmb <ebakhturina@nvidia.com> * docstring fix Signed-off-by: ekmb <ebakhturina@nvidia.com> * fix lgtm Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * adding doc string, adding more info to docs Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * deleted token parser from ITN Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * reuse token parser from tn Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * deleting whitelist items Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * fix for wfst normalization (#2134) Signed-off-by: ekmb <ebakhturina@nvidia.com> * add whitelist Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * update_text_processing_docs Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * changing tn to cased by default Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * change to cased Signed-off-by: Yang Zhang <yangzhang@nvidia.com> * fix jenkins Signed-off-by: Yang Zhang <yangzhang@nvidia.com> Co-authored-by: Evelina <10428420+ekmb@users.noreply.github.com>
This commit is contained in:
parent
d017cfcd16
commit
c5426c871d
2
Jenkinsfile
vendored
2
Jenkinsfile
vendored
|
@ -153,7 +153,7 @@ pipeline {
|
|||
stage('L2: TN') {
|
||||
steps {
|
||||
sh 'cd tools/text_processing_deployment && python pynini_export.py --output=/home/TestData/nlp/text_norm/output/ --grammars=tn_grammars && ls -R /home/TestData/nlp/text_norm/output/ && echo ".far files created "|| exit 1'
|
||||
sh 'cd nemo_text_processing/text_normalization/ && python run_predict.py --input=/home/TestData/nlp/text_norm/ci/test.txt --output=/home/TestData/nlp/text_norm/output/test.pynini.txt --verbose'
|
||||
sh 'cd nemo_text_processing/text_normalization/ && python run_predict.py --input=/home/TestData/nlp/text_norm/ci/test.txt --input_case="lower_cased" --output=/home/TestData/nlp/text_norm/output/test.pynini.txt --verbose'
|
||||
sh 'cmp --silent /home/TestData/nlp/text_norm/output/test.pynini.txt /home/TestData/nlp/text_norm/ci/test_goal_py.txt || exit 1'
|
||||
sh 'rm -rf /home/TestData/nlp/text_norm/output/*'
|
||||
}
|
||||
|
|
|
@ -56,6 +56,8 @@ Example prediction run:
|
|||
|
||||
|
||||
The input is expected to be lower-cased. Punctuation are outputted with separating spaces after semiotic tokens, e.g. `"i see, it is ten o'clock..."` -> `"I see, it is 10:00 . . ."`.
|
||||
Inner-sentence white-space characters in the input are not maintained.
|
||||
|
||||
|
||||
|
||||
Data Cleaning for Evaluation
|
||||
|
|
|
@ -56,7 +56,8 @@ Example prediction run:
|
|||
|
||||
python run_prediction.py <--input INPUT_TEXT_FILE> <--output OUTPUT_PATH> [--input_case INPUT_CASE]
|
||||
|
||||
``INPUT_CASE`` specifies whether the input is lower-cased or cased. Punctuation are outputted with separating spaces after semiotic tokens, e.g. `"I see, it is 10:00..."` -> `"I see, it is ten o'clock . . ."`.
|
||||
``INPUT_CASE`` specifies whether to treat the input as lower-cased or case sensitive. By default treat the input as cased since this is more informative, especially for abbreviations. Punctuation are outputted with separating spaces after semiotic tokens, e.g. `"I see, it is 10:00..."` -> `"I see, it is ten o'clock . . ."`.
|
||||
Inner-sentence white-space characters in the input are not maintained.
|
||||
|
||||
|
||||
Evaluation
|
||||
|
|
|
@ -5,4 +5,4 @@ Introduction
|
|||
------------
|
||||
|
||||
NeMo's `nemo_text_processing` is a Python package that is installed with the `nemo_toolkit`.
|
||||
See `documentation <https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/>`_ for details.
|
||||
See [documentation](https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/) for details.
|
File diff suppressed because it is too large
Load diff
|
@ -212,7 +212,7 @@ def parse_args():
|
|||
parser = ArgumentParser()
|
||||
parser.add_argument("input_string", help="input string", type=str)
|
||||
parser.add_argument(
|
||||
"--input_case", help="input capitalization", choices=["lower_cased", "cased"], default="lower_cased", type=str
|
||||
"--input_case", help="input capitalization", choices=["lower_cased", "cased"], default="cased", type=str
|
||||
)
|
||||
parser.add_argument("--verbose", help="print info for debugging", action='store_true')
|
||||
return parser.parse_args()
|
||||
|
|
|
@ -35,7 +35,7 @@ def parse_args():
|
|||
parser = ArgumentParser()
|
||||
parser.add_argument("--input", help="input file path", type=str)
|
||||
parser.add_argument(
|
||||
"--input_case", help="input capitalization", choices=["lower_cased", "cased"], default="lower_cased", type=str
|
||||
"--input_case", help="input capitalization", choices=["lower_cased", "cased"], default="cased", type=str
|
||||
)
|
||||
parser.add_argument(
|
||||
"--cat",
|
||||
|
|
|
@ -58,7 +58,7 @@ def parse_args():
|
|||
parser.add_argument("--input", help="input file path", required=True, type=str)
|
||||
parser.add_argument("--output", help="output file path", required=True, type=str)
|
||||
parser.add_argument(
|
||||
"--input_case", help="input capitalization", choices=["lower_cased", "cased"], default="lower_cased", type=str
|
||||
"--input_case", help="input capitalization", choices=["lower_cased", "cased"], default="cased", type=str
|
||||
)
|
||||
parser.add_argument("--verbose", help="print meta info for debugging", action='store_true')
|
||||
return parser.parse_args()
|
||||
|
|
|
@ -21,7 +21,7 @@
|
|||
# echo "two dollars fifty" | ../../src/bin/normalizer_main --config=sparrowhawk_configuration.ascii_proto
|
||||
|
||||
GRAMMARS=${1:-"itn_grammars"} # tn_grammars
|
||||
INPUT_CASE=${2:-"lower_cased"}
|
||||
INPUT_CASE=${2:-"cased"}
|
||||
python pynini_export.py --output_dir=. --grammars=${GRAMMARS} --input_case=${INPUT_CASE}
|
||||
find . -name "Makefile" -type f -delete
|
||||
bash docker/build.sh
|
||||
|
|
|
@ -85,7 +85,7 @@ def parse_args():
|
|||
"--grammars", help="grammars to be exported", choices=["tn_grammars", "itn_grammars"], type=str, required=True
|
||||
)
|
||||
parser.add_argument(
|
||||
"--input_case", help="input capitalization", choices=["lower_cased", "cased"], default="lower_cased", type=str
|
||||
"--input_case", help="input capitalization", choices=["lower_cased", "cased"], default="cased", type=str
|
||||
)
|
||||
return parser.parse_args()
|
||||
|
||||
|
|
|
@ -158,8 +158,8 @@
|
|||
"source": [
|
||||
"from nemo_text_processing.text_normalization.normalize import Normalizer\n",
|
||||
"# creates normalizer object that works on lower cased input\n",
|
||||
"normalizer = Normalizer(input_case='lower_cased')\n",
|
||||
"raw_text = \"we paid $123 for this desk.\"\n",
|
||||
"normalizer = Normalizer(input_case='cased')\n",
|
||||
"raw_text = \"We paid $123 for this desk.\"\n",
|
||||
"normalizer.normalize(raw_text, verbose=False)"
|
||||
],
|
||||
"execution_count": null,
|
||||
|
|
Loading…
Reference in a new issue