[BERT/PyT][BERT/TF] Switch back to the original server for data download
* update - wiki download
This commit is contained in:
parent
2b0daf392a
commit
04988752a8
|
@ -280,7 +280,7 @@ The pretraining dataset is 170GB+ and takes 15+ hours to download. The BookCorpu
|
|||
Users are welcome to download BookCorpus from other sources to match our accuracy, or repeatedly try our script until the required number of files are downloaded by running the following:
|
||||
`/workspace/bert/data/create_datasets_from_start.sh wiki_books`
|
||||
|
||||
Note: Not using BookCorpus can potentially change final accuracy on a few downstream tasks.
|
||||
Note: Ensure a complete Wikipedia download. If in any case, the download breaks, remove the output file `wikicorpus_en.xml.bz2` and start again. If a partially downloaded file exists, the script assumes successful download which causes the extraction to fail. Not using BookCorpus can potentially change final accuracy on a few downstream tasks.
|
||||
|
||||
6. Start pretraining.
|
||||
|
||||
|
|
|
@ -28,8 +28,8 @@ class WikiDownloader:
|
|||
self.language = language
|
||||
# Use a mirror from https://dumps.wikimedia.org/mirrors.html if the below links do not work
|
||||
self.download_urls = {
|
||||
'en' : 'https://dumps.wikimedia.your.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2',
|
||||
'zh' : 'https://dumps.wikimedia.your.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2'
|
||||
'en' : 'https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2',
|
||||
'zh' : 'https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2'
|
||||
}
|
||||
|
||||
self.output_files = {
|
||||
|
|
|
@ -269,7 +269,7 @@ The pretraining dataset is 170GB+ and takes 15+ hours to download. The BookCorpu
|
|||
Users are welcome to download BookCorpus from other sources to match our accuracy, or repeatedly try our script until the required number of files are downloaded by running the following:
|
||||
`bash scripts/data_download.sh wiki_books`
|
||||
|
||||
Note: Not using BookCorpus can potentially change final accuracy on a few downstream tasks.
|
||||
Note: Ensure a complete Wikipedia download. If in any case, the download breaks, remove the output file `wikicorpus_en.xml.bz2` and start again. If a partially downloaded file exists, the script assumes successful download which causes the extraction to fail. Not using BookCorpus can potentially change final accuracy on a few downstream tasks.
|
||||
|
||||
4. Download the pretrained models from NGC.
|
||||
|
||||
|
|
|
@ -26,8 +26,8 @@ class WikiDownloader:
|
|||
|
||||
self.language = language
|
||||
self.download_urls = {
|
||||
'en' : 'https://dumps.wikimedia.your.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2',
|
||||
'zh' : 'https://dumps.wikimedia.your.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2'
|
||||
'en' : 'https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2',
|
||||
'zh' : 'https://dumps.wikimedia.org/zhwiki/latest/zhwiki-latest-pages-articles.xml.bz2'
|
||||
}
|
||||
|
||||
self.output_files = {
|
||||
|
|
Loading…
Reference in a new issue