[BERT/PyT][BERT/TF] Switch back to the original server for data download

* update - wiki download
2021-02-25 14:13:53 -08:00 · 2021-02-25 14:13:53 -08:00 · 04988752a8
parent 2b0daf392a
commit 04988752a8
4 changed files with 6 additions and 6 deletions
--- a/PyTorch/LanguageModeling/BERT/README.md
+++ b/PyTorch/LanguageModeling/BERT/README.md
@ -280,7 +280,7 @@ The pretraining dataset is 170GB+ and takes 15+ hours to download. The BookCorpu
 Users are welcome to download BookCorpus from other sources to match our accuracy, or repeatedly try our script until the required number of files are downloaded by running the following:
 `/workspace/bert/data/create_datasets_from_start.sh wiki_books`

-Note: Not using BookCorpus can potentially change final accuracy on a few downstream tasks.
+Note: Ensure a complete Wikipedia download. If in any case, the download breaks, remove the output file `wikicorpus_en.xml.bz2` and start again. If a partially downloaded file exists, the script assumes successful download which causes the extraction to fail. Not using BookCorpus can potentially change final accuracy on a few downstream tasks.

 6. Start pretraining.
 
--- a/PyTorch/LanguageModeling/BERT/data/WikiDownloader.py
+++ b/PyTorch/LanguageModeling/BERT/data/WikiDownloader.py
@ -28,8 +28,8 @@ class WikiDownloader:
        self.language = language
        # Use a mirror from https://dumps.wikimedia.org/mirrors.html if the below links do not work
        self.download_urls = {
-            'en' : 'https://dumps.wikimedia.your.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2',
-            'zh' : 'https://dumps.wikimedia.your.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2'
+            'en' : 'https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2',
+            'zh' : 'https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2'
        }

        self.output_files = {
--- a/TensorFlow/LanguageModeling/BERT/README.md
+++ b/TensorFlow/LanguageModeling/BERT/README.md
@ -269,7 +269,7 @@ The pretraining dataset is 170GB+ and takes 15+ hours to download. The BookCorpu
 Users are welcome to download BookCorpus from other sources to match our accuracy, or repeatedly try our script until the required number of files are downloaded by running the following:
 `bash scripts/data_download.sh wiki_books`

-Note: Not using BookCorpus can potentially change final accuracy on a few downstream tasks.
+Note: Ensure a complete Wikipedia download. If in any case, the download breaks, remove the output file `wikicorpus_en.xml.bz2`  and start again. If a partially downloaded file exists, the script assumes successful download which causes the extraction to fail. Not using BookCorpus can potentially change final accuracy on a few downstream tasks.

 4. Download the pretrained models from NGC.

--- a/TensorFlow/LanguageModeling/BERT/data/WikiDownloader.py
+++ b/TensorFlow/LanguageModeling/BERT/data/WikiDownloader.py
@ -26,8 +26,8 @@ class WikiDownloader:

        self.language = language
        self.download_urls = {
-            'en' : 'https://dumps.wikimedia.your.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2',
-            'zh' : 'https://dumps.wikimedia.your.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2'
+            'en' : 'https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2',
+            'zh' : 'https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2'
        }

        self.output_files = {