Skip to content
Snippets Groups Projects
Unverified Commit e310aabf authored by i-robot's avatar i-robot Committed by Gitee
Browse files

!2455 fix mass readme

Merge pull request !2455 from qujianwei/fix_mass_readme
parents cf998c98 33115787
No related branches found
No related tags found
No related merge requests found
......@@ -15,7 +15,6 @@
- [Generate Dataset](#generate-dataset)
- [News Crawl Corpus](#news-crawl-corpus)
- [Gigaword Corpus](#gigaword-corpus)
- [Cornell Movie Dialog Corpus](#cornell-movie-dialog-corpus)
- [Configuration](#configuration)
- [Training & Evaluation process](#training--evaluation-process)
- [Weights average](#weights-average)
......@@ -34,7 +33,6 @@
- [Performance](#performance)
- [Results](#results)
- [Fine-Tuning on Text Summarization](#fine-tuning-on-text-summarization)
- [Fine-Tuning on Conversational ResponseGeneration](#fine-tuning-on-conversational-responsegeneration)
- [Training Performance](#training-performance)
- [Inference Performance](#inference-performance)
- [Description of random situation](#description-of-random-situation)
......@@ -71,7 +69,6 @@ Dataset used:
- [monolingual English data from News Crawl dataset](https://www.statmt.org/wmt16/translation-task.html)(WMT 2019) for pre-training.
- [Gigaword Corpus](https://github.com/harvardnlp/sent-summary)(Graff et al., 2003) for Text Summarization.
- [Cornell movie dialog corpus](https://github.com/suriyadeepan/datasets/tree/master/seq2seq/)(DanescuNiculescu-Mizil & Lee, 2011).
# Features
......@@ -174,7 +171,6 @@ MASS script and code structure are as follow:
├── weights_average.py // Average multi model checkpoints to NPZ format.
├── news_crawl.py // Create News Crawl dataset for pre-training.
├── gigaword.py // Create Gigaword Corpus.
├── cornell_dialog.py // Create Cornell Movie Dialog dataset for conversation response.
```
......@@ -276,7 +272,7 @@ For more detail, please refer to the source file.
### Generate Dataset
As mentioned above, three corpus are used in MASS mode, dataset generation scripts for them are provided.
As mentioned above, two corpus are used in MASS mode, dataset generation scripts for them are provided.
#### News Crawl Corpus
......@@ -339,34 +335,6 @@ python gigaword.py --train_src /{path}/gigaword/train_src.txt \
--max_len 64
```
#### Cornell Movie Dialog Corpus
Script can be found in `cornell_dialog.py`.
Major parameters in `cornell_dialog.py`:
```bash
--src_folder: Corpus folders.
--existed_vocab: Persisted vocabulary file.
--train_prefix: Train source and target file prefix. Default: train.
--test_prefix: Test source and target file prefix. Default: test.
--output_folder: Output dataset files folder path.
--max_len: Maximum sentence length. If a sentence longer than `max_len`, then drop it.
--valid_prefix: Optional, Valid source and target file prefix. Default: valid.
```
Sample code:
```bash
python cornell_dialog.py --src_folder /{path}/cornell_dialog \
--existed_vocab /{path}/mass/vocab/all_en.dict.bin \
--train_prefix train \
--test_prefix test \
--noise_prob 0.1 \
--output_folder /{path}/cornell_dialog_dataset \
--max_len 64
```
## Configuration
Almost all of the options and arguments needed could be assigned conveniently, including the training platform, configurations of dataset and model, arguments of optimizer etc. Optional features such as loss scale and checkpoint are also available by setting the options correspondingly.
......@@ -679,15 +647,7 @@ with 3.8M training data are as follows:
| Method | RG-1(F) | RG-2(F) | RG-L(F) |
|:---------------|:--------------|:-------------|:-------------|
| MASS | Ongoing | Ongoing | Ongoing |
### Fine-Tuning on Conversational ResponseGeneration
The comparisons between MASS and other baseline methods in terms of PPL on Cornell Movie Dialog corpus are as follows:
| Method | Data = 10K | Data = 110K |
|--------------------|------------------|-----------------|
| MASS | Ongoing | Ongoing |
| MASS | 38.73 | 19.71 | 35.96 |
### Training Performance
......@@ -697,13 +657,13 @@ The comparisons between MASS and other baseline methods in terms of PPL on Corne
| Resource | Ascend 910; cpu 2.60GHz, 192cores; memory 755G; OS Euler2.8 |
| uploaded Date | 06/21/2021 |
| MindSpore Version | 1.2.1 |
| Dataset | News Crawl 2007-2017 English monolingual corpus, Gigaword corpus, Cornell Movie Dialog corpus |
| Dataset | News Crawl 2007-2017 English monolingual corpus, Gigaword corpus |
| Training Parameters | Epoch=50, steps=XXX, batch_size=192, lr=1e-4 |
| Optimizer | Adam |
| Loss Function | Label smoothed cross-entropy criterion |
| outputs | Sentence and probability |
| Loss | Lower than 2 |
| Accuracy | For conversation response, ppl=23.52, for text summarization, RG-1=29.79. |
| Accuracy | For text summarization, RG-1=45.98. |
| Speed | 611.45 sentences/s |
| Total time | --/-- |
| Params (M) | 44.6M |
......@@ -716,10 +676,10 @@ The comparisons between MASS and other baseline methods in terms of PPL on Corne
| Resource | Ascend 910; OS Euler2.8 |
| uploaded Date | 06/21/2021 |
| MindSpore Version | 1.2.1 |
| Dataset | Gigaword corpus, Cornell Movie Dialog corpus |
| Dataset | Gigaword corpus |
| batch_size | --- |
| outputs | Sentence and probability |
| Accuracy | ppl=23.52 for conversation response, RG-1=29.79 for text summarization. |
| Accuracy | RG-1=45.98 for text summarization. |
| Speed | ---- sentences/s |
| Total time | --/-- |
......
......@@ -15,7 +15,6 @@
- [生成数据集](#生成数据集)
- [News Crawl语料库](#news-crawl语料库)
- [Gigaword语料库](#gigaword语料库)
- [Cornell电影对白语料库](#cornell电影对白语料库)
- [配置](#配置)
- [训练&评估过程](#训练评估过程)
- [权重平均值](#权重平均值)
......@@ -34,7 +33,6 @@
- [性能](#性能)
- [结果](#结果-1)
- [文本摘要微调](#文本摘要微调)
- [会话应答微调](#会话应答微调)
- [训练性能](#训练性能)
- [推理性能](#推理性能)
- [随机情况说明](#随机情况说明)
......@@ -69,7 +67,6 @@ MASS网络由Transformer实现,Transformer包括多个编码器层和多个解
- [News Crawl数据集](https://www.statmt.org/wmt16/translation-task.html)(WMT,2019年)的英语单语数据,用于预训练
- [Gigaword语料库](https://github.com/harvardnlp/sent-summary)(Graff等人,2003年),用于文本摘要
- [Cornell电影对白语料库](https://github.com/suriyadeepan/datasets/tree/master/seq2seq/)(DanescuNiculescu-Mizil & Lee,2011年)
## 特性
......@@ -176,7 +173,6 @@ MASS脚本及代码结构如下:
├── weights_average.py // 将各模型检查点平均转换到NPZ格式
├── news_crawl.py // 创建预训练所用的News Crawl数据集
├── gigaword.py // 创建Gigaword语料库
├── cornell_dialog.py // 创建Cornell电影对白数据集,用于对话应答
```
......@@ -278,7 +274,7 @@ print([vocabulary.index[s] for s in sentence])
### 生成数据集
如前所述,MASS模式下使用了个语料数据集,相关数据集生成脚本已提供。
如前所述,MASS模式下使用了个语料数据集,相关数据集生成脚本已提供。
#### News Crawl语料库
......@@ -341,34 +337,6 @@ python gigaword.py --train_src /{path}/gigaword/train_src.txt \
--max_len 64
```
#### Cornell电影对白语料库
数据集生成脚本为`cornell_dialog.py`
`cornell_dialog.py`主要参数如下:
```bash
--src_folder: Corpus folders.
--existed_vocab: Persisted vocabulary file.
--train_prefix: Train source and target file prefix.Default: train.
--test_prefix: Test source and target file prefix.Default: test.
--output_folder: Output dataset files folder path.
--max_len: Maximum sentence length.If a sentence longer than `max_len`, then drop it.
--valid_prefix: Optional, Valid source and target file prefix.Default: valid.
```
示例代码如下:
```bash
python cornell_dialog.py --src_folder /{path}/cornell_dialog \
--existed_vocab /{path}/mass/vocab/all_en.dict.bin \
--train_prefix train \
--test_prefix test \
--noise_prob 0.1 \
--output_folder /{path}/cornell_dialog_dataset \
--max_len 64
```
## 配置
`config/`目录下的JSON文件为模板配置文件,
......@@ -680,15 +648,7 @@ bash run_infer_310.sh [MINDIR_PATH] [CONFIG] [VOCAB] [OUTPUT] [NEED_PREPROCESS]
| 方法| RG-1(F) | RG-2(F) | RG-L(F) |
|:---------------|:--------------|:-------------|:-------------|
| MASS | 进行中 | 进行中 | 进行中 |
### 会话应答微调
下表展示了,相较于其他两种基线方法,MASS在Cornell电影对白语料库中困惑度(PPL)的得分情况。
| 方法 | 数据 = 10K | 数据 = 110K |
|--------------------|------------------|-----------------|
| MASS | 进行中 | 进行中 |
| MASS | 38.73 | 19.71 | 35.96 |
### 训练性能
......@@ -698,13 +658,13 @@ bash run_infer_310.sh [MINDIR_PATH] [CONFIG] [VOCAB] [OUTPUT] [NEED_PREPROCESS]
| 资源 | Ascend 910;CPU 2.60GHz,192核;内存 755GB;系统 Euler2.8 |
| 上传日期 | 2021-06-21 |
| MindSpore版本 | 1.2.1 |
| 数据集 | News Crawl 2007-2017英语单语语料库、Gigaword语料库、Cornell电影对白语料库 |
| 数据集 | News Crawl 2007-2017英语单语语料库、Gigaword语料库 |
| 训练参数 | Epoch=50, steps=XXX, batch_size=192, lr=1e-4 |
| 优化器 | Adam |
| 损失函数 | 标签平滑交叉熵准则 |
| 输出 | 句子及概率 |
| 损失 | 小于2 |
| 准确性 | 会话应答PPL=23.52,文本摘要RG-1=29.79|
| 准确性 | 文本摘要RG-1=45.98 |
| 速度 | 611.45句子/秒 |
| 总时长 | |
| 参数(M) | 44.6M |
......@@ -717,10 +677,10 @@ bash run_infer_310.sh [MINDIR_PATH] [CONFIG] [VOCAB] [OUTPUT] [NEED_PREPROCESS]
| 资源 | Ascend 910;系统 Euler2.8 |
| 上传日期 | 2020-06-21 |
| MindSpore版本 | 1.2.1 |
| 数据集 | Gigaword语料库、Cornell电影对白语料库 |
| 数据集 | Gigaword语料库 |
| batch_size | --- |
| 输出 | 句子及概率 |
| 准确度 | 会话应答PPL=23.52,文本摘要RG-1=29.79|
| 准确度 | 文本摘要RG-1=45.98 |
| 速度 | ----句子/秒 |
| 总时长 | --/-- |
......
# Copyright 2020 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Generate Cornell Movie Dialog dataset."""
import os
import argparse
from src.dataset import BiLingualDataLoader
from src.language_model import NoiseChannelLanguageModel
from src.utils import Dictionary
parser = argparse.ArgumentParser(description='Generate Cornell Movie Dialog dataset file.')
parser.add_argument("--src_folder", type=str, default="", required=True,
help="Raw corpus folder.")
parser.add_argument("--existed_vocab", type=str, default="", required=True,
help="Existed vocabulary.")
parser.add_argument("--train_prefix", type=str, default="train", required=False,
help="Prefix of train file.")
parser.add_argument("--test_prefix", type=str, default="test", required=False,
help="Prefix of test file.")
parser.add_argument("--valid_prefix", type=str, default=None, required=False,
help="Prefix of valid file.")
parser.add_argument("--noise_prob", type=float, default=0., required=False,
help="Add noise prob.")
parser.add_argument("--max_len", type=int, default=32, required=False,
help="Max length of sentence.")
parser.add_argument("--output_folder", type=str, default="", required=True,
help="Dataset output path.")
if __name__ == '__main__':
args, _ = parser.parse_known_args()
dicts = []
train_src_file = ""
train_tgt_file = ""
test_src_file = ""
test_tgt_file = ""
valid_src_file = ""
valid_tgt_file = ""
for file in os.listdir(args.src_folder):
if file.startswith(args.train_prefix) and "src" in file and file.endswith(".txt"):
train_src_file = os.path.join(args.src_folder, file)
elif file.startswith(args.train_prefix) and "tgt" in file and file.endswith(".txt"):
train_tgt_file = os.path.join(args.src_folder, file)
elif file.startswith(args.test_prefix) and "src" in file and file.endswith(".txt"):
test_src_file = os.path.join(args.src_folder, file)
elif file.startswith(args.test_prefix) and "tgt" in file and file.endswith(".txt"):
test_tgt_file = os.path.join(args.src_folder, file)
elif args.valid_prefix and file.startswith(args.valid_prefix) and "src" in file and file.endswith(".txt"):
valid_src_file = os.path.join(args.src_folder, file)
elif args.valid_prefix and file.startswith(args.valid_prefix) and "tgt" in file and file.endswith(".txt"):
valid_tgt_file = os.path.join(args.src_folder, file)
else:
continue
vocab = Dictionary.load_from_persisted_dict(args.existed_vocab)
if train_src_file and train_tgt_file:
BiLingualDataLoader(
src_filepath=train_src_file,
tgt_filepath=train_tgt_file,
src_dict=vocab, tgt_dict=vocab,
src_lang="en", tgt_lang="en",
language_model=NoiseChannelLanguageModel(add_noise_prob=args.noise_prob),
max_sen_len=args.max_len
).write_to_tfrecord(
path=os.path.join(
args.output_folder, "train_cornell_dialog.tfrecord"
)
)
if test_src_file and test_tgt_file:
BiLingualDataLoader(
src_filepath=test_src_file,
tgt_filepath=test_tgt_file,
src_dict=vocab, tgt_dict=vocab,
src_lang="en", tgt_lang="en",
language_model=NoiseChannelLanguageModel(add_noise_prob=0.),
max_sen_len=args.max_len
).write_to_tfrecord(
path=os.path.join(
args.output_folder, "test_cornell_dialog.tfrecord"
)
)
if args.valid_prefix:
BiLingualDataLoader(
src_filepath=os.path.join(args.src_folder, valid_src_file),
tgt_filepath=os.path.join(args.src_folder, valid_tgt_file),
src_dict=vocab, tgt_dict=vocab,
src_lang="en", tgt_lang="en",
language_model=NoiseChannelLanguageModel(add_noise_prob=0.),
max_sen_len=args.max_len
).write_to_tfrecord(
path=os.path.join(
args.output_folder, "valid_cornell_dialog.tfrecord"
)
)
print(f" | Vocabulary size: {vocab.size}.")
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment