diff --git a/official/nlp/mass/README.md b/official/nlp/mass/README.md index 0df1841d93ad018c89d56925dcf8c825749ec34f..eb3826d455c44d9c652ca9e5028a85921e0cf0fd 100644 --- a/official/nlp/mass/README.md +++ b/official/nlp/mass/README.md @@ -69,11 +69,9 @@ Note that you can run the scripts based on the dataset mentioned in original pap Dataset used: -- monolingual English data from News Crawl dataset(WMT 2019) for pre-training. -- Gigaword Corpus(Graff et al., 2003) for Text Summarization. -- Cornell movie dialog corpus(DanescuNiculescu-Mizil & Lee, 2011). - -Details about those dataset could be found in [MASS: Masked Sequence to Sequence Pre-training for Language Generation](https://www.microsoft.com/en-us/research/uploads/prod/2019/06/MASS-paper-updated-002.pdf). +- [monolingual English data from News Crawl dataset](https://www.statmt.org/wmt16/translation-task.html)(WMT 2019) for pre-training. +- [Gigaword Corpus](https://github.com/harvardnlp/sent-summary)(Graff et al., 2003) for Text Summarization. +- [Cornell movie dialog corpus](https://github.com/suriyadeepan/datasets/tree/master/seq2seq/)(DanescuNiculescu-Mizil & Lee, 2011). # Features diff --git a/official/nlp/mass/README_CN.md b/official/nlp/mass/README_CN.md index fd0503339e70026473c893930d97b27b8e84f536..65b47f13c3c7495b9f9b8d475d6d59e93d15b5dc 100644 --- a/official/nlp/mass/README_CN.md +++ b/official/nlp/mass/README_CN.md @@ -67,11 +67,9 @@ MASS网络由Transformer实现,Transformer包括多个编码器层和多个解 本文运用数据集包括: -- News Crawl数据集(WMT,2019年)的英语单语数据,用于预训练 -- Gigaword语料库(Graff等人,2003年),用于文本摘要 -- Cornell电影对白语料库(DanescuNiculescu-Mizil & Lee,2011年) - -数据集相关信息,参见[MASS:语言生成的隐式序列到序列预训练](https://www.microsoft.com/en-us/research/uploads/prod/2019/06/MASS-paper-updated-002.pdf)。 +- [News Crawl数据集](https://www.statmt.org/wmt16/translation-task.html)(WMT,2019年)的英语单语数据,用于预训练 +- [Gigaword语料库](https://github.com/harvardnlp/sent-summary)(Graff等人,2003年),用于文本摘要 +- [Cornell电影对白语料库](https://github.com/suriyadeepan/datasets/tree/master/seq2seq/)(DanescuNiculescu-Mizil & Lee,2011年) ## 特性