[BERT/PyT][BERT/TF] Switch back to the original server for data download

* update - wiki download
This commit is contained in:
Sharath TS 2021-02-25 14:13:53 -08:00 committed by GitHub
parent 2b0daf392a
commit 04988752a8
No known key found for this signature in database
GPG key ID: 4AEE18F83AFDEB23
4 changed files with 6 additions and 6 deletions

View file

@ -280,7 +280,7 @@ The pretraining dataset is 170GB+ and takes 15+ hours to download. The BookCorpu
Users are welcome to download BookCorpus from other sources to match our accuracy, or repeatedly try our script until the required number of files are downloaded by running the following:
`/workspace/bert/data/create_datasets_from_start.sh wiki_books`
Note: Not using BookCorpus can potentially change final accuracy on a few downstream tasks.
Note: Ensure a complete Wikipedia download. If in any case, the download breaks, remove the output file `wikicorpus_en.xml.bz2` and start again. If a partially downloaded file exists, the script assumes successful download which causes the extraction to fail. Not using BookCorpus can potentially change final accuracy on a few downstream tasks.
6. Start pretraining.

View file

@ -28,8 +28,8 @@ class WikiDownloader:
self.language = language
# Use a mirror from https://dumps.wikimedia.org/mirrors.html if the below links do not work
self.download_urls = {
'en' : 'https://dumps.wikimedia.your.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2',
'zh' : 'https://dumps.wikimedia.your.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2'
'en' : 'https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2',
'zh' : 'https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2'
}
self.output_files = {

View file

@ -269,7 +269,7 @@ The pretraining dataset is 170GB+ and takes 15+ hours to download. The BookCorpu
Users are welcome to download BookCorpus from other sources to match our accuracy, or repeatedly try our script until the required number of files are downloaded by running the following:
`bash scripts/data_download.sh wiki_books`
Note: Not using BookCorpus can potentially change final accuracy on a few downstream tasks.
Note: Ensure a complete Wikipedia download. If in any case, the download breaks, remove the output file `wikicorpus_en.xml.bz2` and start again. If a partially downloaded file exists, the script assumes successful download which causes the extraction to fail. Not using BookCorpus can potentially change final accuracy on a few downstream tasks.
4. Download the pretrained models from NGC.

View file

@ -26,8 +26,8 @@ class WikiDownloader:
self.language = language
self.download_urls = {
'en' : 'https://dumps.wikimedia.your.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2',
'zh' : 'https://dumps.wikimedia.your.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2'
'en' : 'https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2',
'zh' : 'https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2'
}
self.output_files = {