diff --git a/official/nlp/transformer/README.md b/official/nlp/transformer/README.md index 22c6cdef300321cf57dfda00d894b76cab271e86..aba3b341c7066ce0854857f499625349c87dd622 100644 --- a/official/nlp/transformer/README.md +++ b/official/nlp/transformer/README.md @@ -59,15 +59,36 @@ Note that you can run the scripts based on the dataset mentioned in original pap After dataset preparation, you can start training and evaluation as follows: +In Ascend environment + +```bash +# run training example +bash scripts/run_standalone_train.sh Ascend [DEVICE_ID] [EPOCH_SIZE] [GRADIENT_ACCUMULATE_STEP] [DATA_PATH] +# EPOCH_SIZE recommend 52, GRADIENT_ACCUMULATE_STEP recommend 8 or 1 + +# run distributed training example +bash scripts/run_distribute_train_ascend.sh [DEVICE_NUM] [EPOCH_SIZE] [DATA_PATH] [RANK_TABLE_FILE] [CONFIG_PATH] +# EPOCH_SIZE recommend 52 + +# run evaluation example +bash scripts/run_eval.sh Ascend [DEVICE_ID] [MINDRECORD_DATA] [CKPT_PATH] [CONFIG_PATH] +# CONFIG_PATH should be consistent with training +``` + +In GPU environment + ```bash # run training example -bash scripts/run_standalone_train_ascend.sh Ascend 0 52 /path/ende-l128-mindrecord +bash scripts/run_standalone_train.sh GPU [DEVICE_ID] [EPOCH_SIZE] [GRADIENT_ACCUMULATE_STEP] [DATA_PATH] +# EPOCH_SIZE recommend 52, GRADIENT_ACCUMULATE_STEP recommend 8 or 1 # run distributed training example -bash scripts/run_distribute_train_ascend.sh 8 52 /path/ende-l128-mindrecord rank_table.json ./default_config.yaml +bash scripts/run_distribute_train_gpu.sh [DEVICE_NUM] [EPOCH_SIZE] [DATA_PATH] [CONFIG_PATH] +# EPOCH_SIZE recommend 52 # run evaluation example -python eval.py > eval.log 2>&1 & +bash scripts/run_eval.sh GPU [DEVICE_ID] [MINDRECORD_DATA] [CKPT_PATH] [CONFIG_PATH] +# CONFIG_PATH should be consistent with training ``` - Running on [ModelArts](https://support.huaweicloud.com/modelarts/) @@ -322,25 +343,28 @@ Parameters for learning rate: - Run `run_standalone_train.sh` for non-distributed training of Transformer model. ``` bash - bash scripts/run_standalone_train.sh DEVICE_TARGET DEVICE_ID EPOCH_SIZE GRADIENT_ACCUMULATE_STEP DATA_PATH + bash scripts/run_standalone_train.sh [DEVICE_TARGET] [DEVICE_ID] [EPOCH_SIZE] [GRADIENT_ACCUMULATE_STEP] [DATA_PATH] ``` - Run `run_distribute_train_ascend.sh` for distributed training of Transformer model. ``` bash - bash scripts/run_distribute_train_ascend.sh DEVICE_NUM EPOCH_SIZE DATA_PATH RANK_TABLE_FILE CONFIG_PATH + # Ascend environment + bash scripts/run_distribute_train_ascend.sh [DEVICE_NUM] [EPOCH_SIZE] [DATA_PATH] [RANK_TABLE_FILE] [CONFIG_PATH] + # GPU environment + bash scripts/run_distribute_train_gpu.sh [DEVICE_NUM] [EPOCH_SIZE] [DATA_PATH] [CONFIG_PATH] ``` **Attention**: data sink mode can not be used in transformer since the input data have different sequence lengths. ## [Evaluation Process](#contents) -- Set options in `default_config.yaml`. Make sure the 'data_file', 'model_file' and 'output_file' are set to your own path. +- Set options in [CONFIG_PATH], that should be consistent with training. Make sure the 'device_target', 'data_file', 'model_file' and 'output_file' are set to your own path. - Run `eval.py` for evaluation of Transformer model. ```bash - python eval.py + python eval.py --config_path=[CONFIG_PATH] ``` - Run `process_output.sh` to process the output token ids to get the real translation results. @@ -390,33 +414,33 @@ Inference result is saved in current path, 'output_file' will generate in path s #### Training Performance -| Parameters | Ascend | -| -------------------------- | -------------------------------------------------------------- | -| Resource | Ascend 910; OS Euler2.8 | -| uploaded Date | 07/05/2021 (month/day/year) | -| MindSpore Version | 1.3.0 | -| Dataset | WMT Englis-German | -| Training Parameters | epoch=52, batch_size=96 | -| Optimizer | Adam | -| Loss Function | Softmax Cross Entropy | -| BLEU Score | 28.7 | -| Speed | 400ms/step (8pcs) | -| Loss | 2.8 | -| Params (M) | 213.7 | -| Checkpoint for inference | 2.4G (.ckpt file) | +| Parameters | Ascend | GPU | +| -------------------------- | -------------------------------------------| --------------------------------| +| Resource | Ascend 910; OS Euler2.8 | GPU(Tesla V100 SXM2) | +| uploaded Date | 07/05/2021 (month/day/year) | 12/21/2021 (month/day/year) | +| MindSpore Version | 1.3.0 | 1.5.0 | +| Dataset | WMT Englis-German | WMT Englis-German | +| Training Parameters | epoch=52, batch_size=96 | epoch=52, batch_size=96 | +| Optimizer | Adam | Adam | +| Loss Function | Softmax Cross Entropy | Softmax Cross Entropy | +| BLEU Score | 28.7 | 29.1 | +| Speed | 400ms/step (8pcs) | 337 ms/step (8pcs) | +| Loss | 2.8 | 2.9 | +| Params (M) | 213.7 | 213.7 | +| Checkpoint for inference | 2.4G (.ckpt file) | 2.4G (.ckpt file) | | Scripts | [Transformer scripts](https://gitee.com/mindspore/models/tree/master/official/nlp/transformer) | #### Evaluation Performance -| Parameters | Ascend | -| ------------------- | --------------------------- | -| Resource | Ascend 910; OS Euler2.8 | -| Uploaded Date | 07/05/2021 (month/day/year) | -| MindSpore Version | 1.3.0 | -| Dataset | WMT newstest2014 | -| batch_size | 1 | -| outputs | BLEU score | -| Accuracy | BLEU=28.7 | +| Parameters | Ascend | GPU | +| ------------------- | --------------------------- | ----------------------------| +| Resource | Ascend 910; OS Euler2.8 | GPU(Tesla V100 SXM2) | +| Uploaded Date | 07/05/2021 (month/day/year) | 12/21/2021 (month/day/year) | +| MindSpore Version | 1.3.0 | 1.5.0 | +| Dataset | WMT newstest2014 | WMT newstest2014 | +| batch_size | 1 | 1 | +| outputs | BLEU score | BLEU score | +| Accuracy | BLEU=28.7 | BLEU=29.1 | ## [Description of Random Situation](#contents) diff --git a/official/nlp/transformer/README_CN.md b/official/nlp/transformer/README_CN.md index 24b75eae4a414392cde75cf8f7767ad088246d0a..e49963b45c6b709374d33d13dea0587f8fdb4300 100644 --- a/official/nlp/transformer/README_CN.md +++ b/official/nlp/transformer/README_CN.md @@ -49,8 +49,8 @@ Transformer具体包括六个编码模块和六个解码模块。每个编码模 ## 环境要求 -- 硬件(Ascend处理器) - - 使用Ascend处理器准备硬件环境。 +- 硬件(Ascend处理器/CPU处理器) + - 使用Ascend/GPu处理器准备硬件环境。 - 框架 - [MindSpore](https://gitee.com/mindspore/mindspore) - 如需查看详情,请参见如下资源: @@ -61,15 +61,36 @@ Transformer具体包括六个编码模块和六个解码模块。每个编码模 数据集准备完成后,请按照如下步骤开始训练和评估: +Ascend环境下: + +```bash +# 运行训练示例 +bash scripts/run_standalone_train.sh Ascend [DEVICE_ID] [EPOCH_SIZE] [GRADIENT_ACCUMULATE_STEP] [DATA_PATH] +# EPOCH_SIZE 推荐52, GRADIENT_ACCUMULATE_STEP 推荐8或者1 + +# 运行分布式训练示例 +bash scripts/run_distribute_train_ascend.sh [DEVICE_NUM] [EPOCH_SIZE] [DATA_PATH] [RANK_TABLE_FILE] [CONFIG_PATH] +# EPOCH_SIZE 推荐52 + +# 运行评估示例 +bash scripts/run_eval.sh Ascend [DEVICE_ID] [MINDRECORD_DATA] [CKPT_PATH] [CONFIG_PATH] +# CONFIG_PATH要和训练时保持一致 +``` + +GPU环境下: + ```bash # 运行训练示例 -bash scripts/run_standalone_train_ascend.sh Ascend 0 52 /path/ende-l128-mindrecord +bash scripts/run_standalone_train.sh GPU [DEVICE_ID] [EPOCH_SIZE] [GRADIENT_ACCUMULATE_STEP] [DATA_PATH] +# EPOCH_SIZE 推荐52, GRADIENT_ACCUMULATE_STEP 推荐8或者1 # 运行分布式训练示例 -bash scripts/run_distribute_train_ascend.sh 8 52 /path/ende-l128-mindrecord rank_table.json ./default_config.yaml +bash scripts/run_distribute_train_gpu.sh [DEVICE_NUM] [EPOCH_SIZE] [DATA_PATH] [CONFIG_PATH] +# EPOCH_SIZE 推荐52 # 运行评估示例 -python eval.py > eval.log 2>&1 & +bash scripts/run_eval.sh GPU [DEVICE_ID] [MINDRECORD_DATA] [CKPT_PATH] [CONFIG_PATH] +# CONFIG_PATH要和训练时保持一致 ``` - 在 ModelArts 进行训练 (如果你想在modelarts上运行,可以参考以下文档 [modelarts](https://support.huaweicloud.com/modelarts/)) @@ -325,25 +346,28 @@ Parameters for learning rate: - 运行`run_standalone_train.sh`,进行Transformer模型的非分布式训练。 ``` bash - bash scripts/run_standalone_train.sh DEVICE_TARGET DEVICE_ID EPOCH_SIZE GRADIENT_ACCUMULATE_STEP DATA_PATH + bash scripts/run_standalone_train.sh [DEVICE_TARGET] [DEVICE_ID] [EPOCH_SIZE] [GRADIENT_ACCUMULATE_STEP] [DATA_PATH] ``` - 运行`run_distribute_train_ascend.sh`,进行Transformer模型的非分布式训练。 ``` bash - bash scripts/run_distribute_train_ascend.sh DEVICE_NUM EPOCH_SIZE DATA_PATH RANK_TABLE_FILE CONFIG_PATH + # Ascend environment + bash scripts/run_distribute_train_ascend.sh [DEVICE_NUM] [EPOCH_SIZE] [DATA_PATH] [RANK_TABLE_FILE] [CONFIG_PATH] + # GPU environment + bash scripts/run_distribute_train_gpu.sh [DEVICE_NUM] [EPOCH_SIZE] [DATA_PATH] [CONFIG_PATH] ``` **注意**:由于网络输入中有不同句长的数据,所以数据下沉模式不可使用。 ### 评估过程 -- 在`default_config.yaml`中设置选项。确保已设置了‘data_file'、'model_file’和'output_file'文件路径。 +- 在[CONFIG_PATH]中设置选项,此时的[CONFIG_PATH]要和训练时保持一致。确保已设置了'device_target', 'data_file'、'model_file'和'output_file'文件路径。 - 运行`eval.py`,评估Transformer模型。 ```bash - python eval.py + python eval.py --config_path=[CONFIG_PATH] ``` - 运行`process_output.sh`,处理输出标记ids,获得真实翻译结果。 @@ -393,33 +417,33 @@ bash run_infer_310.sh [MINDIR_PATH] [NEED_PREPROCESS] [DEVICE_ID] #### 训练性能 -| 参数 | Ascend | -| -------------------------- | -------------------------------------------------------------- | -| 资源 | Ascend 910;系统 Euler2.8 | -| 上传日期 | 2021-07-05 | -| MindSpore版本 | 1.3.0 | -| 数据集 | WMT英-德翻译数据集 | -| 训练参数 | epoch=52, batch_size=96 | -| 优化器 | Adam | -| 损失函数 | Softmax Cross Entropy | -| BLEU分数 | 28.7 | -| 速度 | 400毫秒/步(8卡) | -| 损失 | 2.8 | -| 参数 (M) | 213.7 | -| 推理检查点 | 2.4G (.ckpt文件) | -| 脚本 | <https://gitee.com/mindspore/models/tree/master/official/nlp/transformer> | +| 参数 | Ascend | GPU | +| -------------------------- | -------------------------------- | --------------------------------| +| 资源 | Ascend 910;系统 Euler2.8 | GPU(Tesla V100 SXM2) | +| 上传日期 | 2021-07-05 | 2021-12-21 | +| MindSpore版本 | 1.3.0 | 1.5.0 | +| 数据集 | WMT英-德翻译数据集 | WMT英-德翻译数据集 | +| 训练参数 | epoch=52, batch_size=96 | epoch=52, batch_size=96 | +| 优化器 | Adam | Adam | +| 损失函数 | Softmax Cross Entropy | Softmax Cross Entropy | +| BLEU分数 | 28.7 | 29.1 | +| 速度 | 400毫秒/步(8卡) | 337 ms/step(8卡) | +| 损失 | 2.8 | 2.9 | +| 参数 (M) | 213.7 | 213.7 | +| 推理检查点 | 2.4G (.ckpt文件) | 2.4G | +| 脚本 | <https://gitee.com/mindspore/models/tree/master/official/nlp/transformer> | #### 评估性能 -| 参数 | Ascend | -| ------------------- | --------------------------- | -|资源| Ascend 910;系统 Euler2.8 | -| 上传日期 | 2021-07-05 | -| MindSpore版本 | 1.3.0 | -| 数据集 | WMT newstest2014 | -| batch_size | 1 | -| 输出 | BLEU score | -| 准确率 | BLEU=28.7 | +| 参数 | Ascend | GPU | +| ------------------- | --------------------------- | ----------------------------| +|资源| Ascend 910;系统 Euler2.8 | GPU(Tesla V100 SXM2) | +| 上传日期 | 2021-07-05 | 2021-12-21 | +| MindSpore版本 | 1.3.0 | 1.5.0 | +| 数据集 | WMT newstest2014 | WMT newstest2014 | +| batch_size | 1 | 1 | +| 输出 | BLEU score | BLEU score | +| 准确率 | BLEU=28.7 | BLEU=29.1 | ## 随机情况说明 diff --git a/official/nlp/transformer/scripts/run_distribute_train_ascend.sh b/official/nlp/transformer/scripts/run_distribute_train_ascend.sh index 41be7cdef414660863d77b4eb8f34aefee89bf53..3c6f5b0135e6f62091a69b9bcd5b69ded21c3c35 100644 --- a/official/nlp/transformer/scripts/run_distribute_train_ascend.sh +++ b/official/nlp/transformer/scripts/run_distribute_train_ascend.sh @@ -16,7 +16,7 @@ if [ $# != 5 ] ; then echo "==============================================================================================================" echo "Please run the script as: " -echo "sh run_distribute_train_ascend.sh DEVICE_NUM EPOCH_SIZE DATA_PATH RANK_TABLE_FILE CONFIG_PATH" +echo "bash scripts/run_distribute_train_ascend.sh DEVICE_NUM EPOCH_SIZE DATA_PATH RANK_TABLE_FILE CONFIG_PATH" echo "for example: sh run_distribute_train_ascend.sh 8 52 /path/ende-l128-mindrecord00 /path/hccl.json ./default_config_large.yaml" echo "It is better to use absolute path." echo "==============================================================================================================" diff --git a/official/nlp/transformer/scripts/run_distribute_train_ascend_multi_machines.sh b/official/nlp/transformer/scripts/run_distribute_train_ascend_multi_machines.sh index 14fd9d12d6bea02a4e0ee5eef445577ea3cc2bbe..6577f0d9894fb2b22171cb68614fcfa6aee2e546 100644 --- a/official/nlp/transformer/scripts/run_distribute_train_ascend_multi_machines.sh +++ b/official/nlp/transformer/scripts/run_distribute_train_ascend_multi_machines.sh @@ -16,7 +16,7 @@ if [ $# != 6 ] ; then echo "==============================================================================================================" echo "Please run the script as: " -echo "sh run_distribute_train_ascend_multi_machines.sh DEVICE_NUM SERVER_ID EPOCH_SIZE DATA_PATH RANK_TABLE_FILE CONFIG_PATH" +echo "bash scripts/run_distribute_train_ascend_multi_machines.sh DEVICE_NUM SERVER_ID EPOCH_SIZE DATA_PATH RANK_TABLE_FILE CONFIG_PATH" echo "for example: sh run_distribute_train_ascend_multi_machines.sh 32 0 52 /path/ende-l128-mindrecord00 /path/hccl.json ./default_config_large.yaml" echo "It is better to use absolute path." echo "==============================================================================================================" diff --git a/official/nlp/transformer/scripts/run_distribute_train_gpu.sh b/official/nlp/transformer/scripts/run_distribute_train_gpu.sh index a7a884a78d14cf2e5babd215338f2ede6d4eb2e7..e878616ae72595efabaca5db6fd9eb545cd1b306 100644 --- a/official/nlp/transformer/scripts/run_distribute_train_gpu.sh +++ b/official/nlp/transformer/scripts/run_distribute_train_gpu.sh @@ -16,7 +16,7 @@ if [ $# != 4 ] ; then echo "==============================================================================================================" echo "Please run the script as: " -echo "sh run_distribute_train_gpu.sh DEVICE_NUM EPOCH_SIZE DATA_PATH CONFIG_PATH" +echo "bash scripts/run_distribute_train_gpu.sh DEVICE_NUM EPOCH_SIZE DATA_PATH CONFIG_PATH" echo "for example: sh run_distribute_train_gpu.sh 8 55 /path/ende-l128-mindrecord00 ./default_config_large_gpu.yaml" echo "It is better to use absolute path." echo "==============================================================================================================" diff --git a/official/nlp/transformer/scripts/run_eval.sh b/official/nlp/transformer/scripts/run_eval.sh index 8a2d2a1fe2a823bfd38927bab20966af380396c5..9628e782092804f5cb6d1be03dfb0c1c234a1533 100644 --- a/official/nlp/transformer/scripts/run_eval.sh +++ b/official/nlp/transformer/scripts/run_eval.sh @@ -16,7 +16,7 @@ if [ $# != 5 ] ; then echo "==============================================================================================================" echo "Please run the script as: " -echo "sh run_eval.sh DEVICE_TARGET DEVICE_ID MINDRECORD_DATA CKPT_PATH CONFIG_PATH" +echo "sh scripts/run_eval.sh DEVICE_TARGET DEVICE_ID MINDRECORD_DATA CKPT_PATH CONFIG_PATH" echo "for example: sh run_eval.sh Ascend 0 /your/path/evaluation.mindrecord /your/path/checkpoint_file ./default_config_large_gpu.yaml" echo "Note: set the checkpoint and dataset path in default_config.yaml" echo "==============================================================================================================" diff --git a/official/nlp/transformer/scripts/run_standalone_train.sh b/official/nlp/transformer/scripts/run_standalone_train.sh index 50a7779ee8fb6487ecca0a0035120d634c1ece7a..3250dff510b0bcd2afb7b6e8a6a79b4515bcae9c 100644 --- a/official/nlp/transformer/scripts/run_standalone_train.sh +++ b/official/nlp/transformer/scripts/run_standalone_train.sh @@ -16,7 +16,7 @@ if [ $# != 5 ] ; then echo "==============================================================================================================" echo "Please run the script as: " -echo "sh run_standalone_train.sh DEVICE_TARGET DEVICE_ID EPOCH_SIZE GRADIENT_ACCUMULATE_STEP DATA_PATH" +echo "sh scripts/run_standalone_train.sh DEVICE_TARGET DEVICE_ID EPOCH_SIZE GRADIENT_ACCUMULATE_STEP DATA_PATH" echo "for example: sh run_standalone_train.sh Ascend 0 52 8 /path/ende-l128-mindrecord00" echo "It is better to use absolute path." echo "=============================================================================================================="