diff --git a/official/nlp/bert/README.md b/official/nlp/bert/README.md index 6299de59fd4aa31ddcb78954ab8971965543f3f4..a95c91c62a708fa6a3009f8ac2a9e6100ac218ed 100644 --- a/official/nlp/bert/README.md +++ b/official/nlp/bert/README.md @@ -61,7 +61,7 @@ The backbone structure of BERT is transformer. For BERT_base, the transformer co - Extract and refine texts in the dataset with [WikiExtractor](https://github.com/attardi/wikiextractor). The commands are as follows: - pip install wikiextractor - python -m wikiextractor.WikiExtractor -o <output file path> -b <output file size> <Wikipedia dump file> - - Convert the dataset to TFRecord format. Please refer to create_pretraining_data.py file in [BERT](https://github.com/google-research/bert) repository and download vocab.txt here, if AttributeError: module 'tokenization' has no attribute 'FullTokenizer' occur, please install bert-tensorflow. + - Extracted text data from `WikiExtractor` cannot be trained directly, you have to preprocess the data and convert the dataset to TFRecord format. Please refer to create_pretraining_data.py file in [BERT](https://github.com/google-research/bert) repository and download vocab.txt here, if AttributeError: module 'tokenization' has no attribute 'FullTokenizer' occur, please install bert-tensorflow. - Create fine-tune dataset - Download dataset for fine-tuning and evaluation such as [CLUENER](https://github.com/CLUEbenchmark/CLUENER2020), [TNEWS](https://github.com/CLUEbenchmark/CLUE), [SQuAD v1.1 train dataset](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json), [SQuAD v1.1 eval dataset](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json), etc. - Convert dataset files from JSON format to TFRECORD format, please refer to run_classifier.py or run_squad.py file in [BERT](https://github.com/google-research/bert) repository. @@ -91,7 +91,7 @@ bash scripts/run_distributed_pretrain_ascend.sh /path/cn-wiki-128 /path/hccl.jso # run fine-tuning and evaluation example - If you are going to run a fine-tuning task, please prepare a checkpoint generated from pre-training. -- Set bert network config and optimizer hyperparameters in `finetune_eval_config.py`. +- Set bert network config and optimizer hyperparameters in `task_[DOWNSTREAM_TASK]_config.yaml`. - Classification task: Set task related hyperparameters in scripts/run_classifier.sh. - Run `bash scripts/run_classifier.sh` for fine-tuning of BERT-base and BERT-NEZHA model. @@ -120,7 +120,7 @@ bash scripts/run_distributed_pretrain_for_gpu.sh 8 40 /path/cn-wiki-128 # run fine-tuning and evaluation example - If you are going to run a fine-tuning task, please prepare a checkpoint generated from pre-training. -- Set bert network config and optimizer hyperparameters in `finetune_eval_config.py`. +- Set bert network config and optimizer hyperparameters in `task_[DOWNSTREAM_TASK]_config.yaml`. - Classification task: Set task related hyperparameters in scripts/run_classifier.sh. - Run `bash scripts/run_classifier.sh` for fine-tuning of BERT-base and BERT-NEZHA model. @@ -179,7 +179,7 @@ If you want to run in modelarts, please check the official documentation of [mod # 1. Add ”enable_modelarts=True“ # 2. Set other parameters, other parameter configuration can refer to `run_ner.sh`(or run_squad.sh or run_classifier.sh) under the folder '{path}/bert/scripts/'. # Note that vocab_file_path, label_file_path, train_data_file_path, eval_data_file_path, schema_file_path fill in the relative path relative to the path selected in step 7. - # Finally, "config_path=../../*.yaml" must be added on the web page (select the *.yaml configuration file according to the downstream task) + # Finally, "config_path=/path/*.yaml" must be added on the web page (select the *.yaml configuration file according to the downstream task) # (6) Upload the dataset to S3 bucket. # (7) Check the "data storage location" on the website UI interface and set the "Dataset path" path (there is only data or zip package under this path). # (8) Set the "Output file path" and "Job log path" to your path on the website UI interface. @@ -190,13 +190,11 @@ If you want to run in modelarts, please check the official documentation of [mod For distributed training on Ascend, an hccl configuration file with JSON format needs to be created in advance. -For distributed training on single machine, [here](https://gitee.com/mindspore/mindspore/tree/master/config/hccl_single_machine_multi_rank.json) is an example hccl.json. - -For distributed training among multiple machines, training command should be executed on each machine in a small time interval. Thus, an hccl.json is needed on each machine. [here](https://gitee.com/mindspore/mindspore/tree/master/config/hccl_multi_machine_multi_rank.json) is an example of hccl.json for multi-machine case. - Please follow the instructions in the link below to create an hccl.json file in need: [https://gitee.com/mindspore/models/tree/master/utils/hccl_tools](https://gitee.com/mindspore/models/tree/master/utils/hccl_tools). +For distributed training among multiple machines, training command should be executed on each machine in a small time interval. Thus, an hccl.json is needed on each machine. [merge_hccl](https://gitee.com/mindspore/models/tree/master/utils/hccl_tools#merge_hccl) is a tool to create hccl.json for multi-machine case. + For dataset, if you want to set the format and parameters, a schema configuration file with JSON format needs to be created, please refer to [tfrecord](https://www.mindspore.cn/docs/programming_guide/en/master/dataset_loading.html#tfrecord) format. ```text @@ -437,14 +435,14 @@ options: ## Options and Parameters -Parameters for training and evaluation can be set in file `config.py` and `finetune_eval_config.py` respectively. +Parameters for training and downstream task can be set in yaml config file respectively. ### Options ```text config for lossscale and etc. bert_network version of BERT model: base | nezha, default is base - batch_size batch size of input dataset: N, default is 16 + batch_size batch size of input dataset: N, default is 32 loss_scale_value initial value of loss scale: N, default is 2^32 scale_factor factor used to update loss scale: N, default is 2 scale_window steps for once updatation of loss scale: N, default is 1000 @@ -690,7 +688,7 @@ The result will be as follows: We only support export with fine-tuned downstream task model and yaml config file, because the pretrained model is useless in inferences task. ```shell -python export.py --config_path [../../*.yaml] --export_ckpt_file [CKPT_PATH] --export_file_name [FILE_NAME] --file_format [FILE_FORMAT] +python export.py --config_path [/path/*.yaml] --export_ckpt_file [CKPT_PATH] --export_file_name [FILE_NAME] --file_format [FILE_FORMAT] ``` - Export on ModelArts (If you want to run in modelarts, please check the official documentation of [modelarts](https://support.huaweicloud.com/modelarts/), and you can start as follows) @@ -713,7 +711,7 @@ python export.py --config_path [../../*.yaml] --export_ckpt_file [CKPT_PATH] --e # 3. Add ”export_file_name=bert_ner“ # 4. Add ”file_format=MINDIR“ # 5. Add ”label_file_path:{path}/*.txt“('label_file_path' refers to the relative path relative to the folder selected in step 7.) -# Finally, "config_path=../../*.yaml" must be added on the web page (select the *.yaml configuration file according to the downstream task) +# Finally, "config_path=/path/*.yaml" must be added on the web page (select the *.yaml configuration file according to the downstream task) # (7) Check the "data storage location" on the website UI interface and set the "Dataset path" path. # (8) Set the "Output file path" and "Job log path" to your path on the website UI interface. # (9) Under the item "resource pool selection", select the specification of a single card. @@ -753,7 +751,7 @@ Currently, the ONNX model of Bert classification task can be exported, and third - export ONNX ```shell -python export.py --config_path [../../task_classifier_config.yaml] --file_format ["ONNX"] --export_ckpt_file [CKPT_PATH] --num_class [NUM_CLASS] --export_file_name [EXPORT_FILE_NAME] +python export.py --config_path [/path/*.yaml] --file_format ["ONNX"] --export_ckpt_file [CKPT_PATH] --num_class [NUM_CLASS] --export_file_name [EXPORT_FILE_NAME] ``` 'CKPT_PATH' is mandatory, it is the path of the CKPT file that has been trained for a certain classification task model. @@ -765,7 +763,7 @@ After running, the ONNX model of Bert will be saved in the current file director - Load ONNX and inference ```shell -python run_eval_onnx.py --config_path [../../task_classifier_config.yaml] --eval_data_file_path [EVAL_DATA_FILE_PATH] -export_file_name [EXPORT_FILE_NAME] +python run_eval_onnx.py --config_path [/path/*.yaml] --eval_data_file_path [EVAL_DATA_FILE_PATH] -export_file_name [EXPORT_FILE_NAME] ``` 'EVAL_DATA_FILE_PATH' is mandatory, it is the eval data of the dataset used by the classification task. diff --git a/official/nlp/bert/README_CN.md b/official/nlp/bert/README_CN.md index 3329b19b076029690270d924d67c24ff5ba65335..e54793df3ac6edfa2c2dfbf394f12e09b5c9ed68 100644 --- a/official/nlp/bert/README_CN.md +++ b/official/nlp/bert/README_CN.md @@ -63,7 +63,7 @@ BERT的主干结构为Transformer。对于BERT_base,Transformer包含12个编 - 使用[WikiExtractor](https://github.com/attardi/wikiextractor)提取和整理数据集中的文本,使用步骤如下: - pip install wikiextractor - python -m wikiextractor.WikiExtractor -o <output file path> -b <output file size> <Wikipedia dump file> - - 将数据集转换为TFRecord格式。详见[BERT](https://github.com/google-research/bert)代码仓中的create_pretraining_data.py文件,同时下载对应的vocab.txt文件, 如果出现AttributeError: module 'tokenization' has no attribute 'FullTokenizer’,请安装bert-tensorflow。 + - `WikiExtarctor`提取出来的原始文本并不能直接使用,还需要将数据集预处理并转换为TFRecord格式。详见[BERT](https://github.com/google-research/bert)代码仓中的create_pretraining_data.py文件,同时下载对应的vocab.txt文件, 如果出现AttributeError: module 'tokenization' has no attribute 'FullTokenizer’,请安装bert-tensorflow。 - 生成下游任务数据集 - 下载数据集进行微调和评估,如[CLUENER](https://github.com/CLUEbenchmark/CLUENER2020)、[TNEWS](https://github.com/CLUEbenchmark/CLUE)、[SQuAD v1.1训练集](https://rajpurkar.github.io/SQuAD-explorer/dataset/train-v1.1.json)、[SQuAD v1.1验证集](https://rajpurkar.github.io/SQuAD-explorer/dataset/dev-v1.1.json)等。 - 将数据集文件从JSON格式转换为TFRecord格式。详见[BERT](https://github.com/google-research/bert)代码仓中的run_classifier.py或run_squad.py文件。 @@ -97,7 +97,7 @@ bash scripts/run_distributed_pretrain_ascend.sh /path/cn-wiki-128 /path/hccl.jso # 运行微调和评估示例 - 如需运行微调任务,请先准备预训练生成的权重文件(ckpt)。 -- 在`finetune_eval_config.py`中设置BERT网络配置和优化器超参。 +- 在`task_[DOWNSTREAM_TASK]_config.yaml`中设置BERT网络配置和优化器超参。 - 分类任务:在scripts/run_classifier.sh中设置任务相关的超参。 - 运行`bash scripts/run_classifier.sh`,对BERT-base和BERT-NEZHA模型进行微调。 @@ -130,7 +130,7 @@ bash scripts/run_distributed_pretrain_for_gpu.sh 8 40 /path/cn-wiki-128 # 运行微调和评估示例 - 如需运行微调任务,请先准备预训练生成的权重文件(ckpt)。 -- 在`finetune_eval_config.py`中设置BERT网络配置和优化器超参。 +- 在`task_[DOWNSTREAM_TASK]_config.yaml`中设置BERT网络配置和优化器超参。 - 分类任务:在scripts/run_classifier.sh中设置任务相关的超参。 - 运行`bash scripts/run_classifier.sh`,对BERT-base和BERT-NEZHA模型进行微调。 @@ -187,7 +187,7 @@ bash scripts/run_distributed_pretrain_for_gpu.sh 8 40 /path/cn-wiki-128 # 1. 添加 ”enable_modelarts=True“ # 2. 添加其它参数,其它参数配置可以参考 './scripts/'下的 `run_ner.sh`或`run_squad.sh`或`run_classifier.sh` # 注意vocab_file_path,label_file_path,train_data_file_path,eval_data_file_path,schema_file_path填写相对于第7步所选路径的相对路径。 - # 最后必须在网页上添加 “config_path=../../*.yaml”(根据下游任务选择 *.yaml 配置文件) + # 最后必须在网页上添加 “config_path=/path/*.yaml”(根据下游任务选择 *.yaml 配置文件) # (6) 上传你的 数据 到 s3 桶上 # (7) 在网页上勾选数据存储位置,设置“训练数据集”路径(该路径下仅有 数据/数据zip压缩包) # (8) 在网页上设置“训练输出文件路径”、“作业日志路径” @@ -198,11 +198,11 @@ bash scripts/run_distributed_pretrain_for_gpu.sh 8 40 /path/cn-wiki-128 在Ascend设备上做分布式训练时,请提前创建JSON格式的HCCL配置文件。 -在Ascend设备上做单机分布式训练时,请参考[here](https://gitee.com/mindspore/mindspore/tree/master/config/hccl_single_machine_multi_rank.json)创建HCCL配置文件。 +在Ascend设备上做单机分布式训练时,请参考[hccl_tools](https://gitee.com/mindspore/models/tree/master/utils/hccl_tools)创建HCCL配置文件。 -在Ascend设备上做多机分布式训练时,训练命令需要在很短的时间间隔内在各台设备上执行。因此,每台设备上都需要准备HCCL配置文件。请参考[here](https://gitee.com/mindspore/mindspore/tree/master/config/hccl_multi_machine_multi_rank.json)创建多机的HCCL配置文件。 +在Ascend设备上做多机分布式训练时,训练命令需要在很短的时间间隔内在各台设备上执行。因此,每台设备上都需要准备HCCL配置文件。请参考[merge_hccl](https://gitee.com/mindspore/models/tree/master/utils/hccl_tools#merge_hccl)创建多机的HCCL配置文件。 -如需设置数据集格式和参数,请创建JSON格式的模式配置文件,详见[TFRecord](https://www.mindspore.cn/docs/programming_guide/zh-CN/master/dataset_loading.html#tfrecord)格式。 +如需设置数据集格式和参数,请创建JSON格式的schema配置文件,详见[TFRecord](https://www.mindspore.cn/docs/programming_guide/zh-CN/master/dataset_loading.html#tfrecord)格式。 ```text For pretraining, schema file contains ["input_ids", "input_mask", "segment_ids", "next_sentence_labels", "masked_lm_positions", "masked_lm_ids", "masked_lm_weights"]. @@ -440,14 +440,14 @@ options: ## 选项及参数 -可以在`config.py`和`finetune_eval_config.py`文件中分别配置训练和评估参数。 +可以在yaml配置文件中分别配置预训练和下游任务的参数。 ### 选项 ```text config for lossscale and etc. bert_network BERT模型版本,可选项为base或nezha,默认为base - batch_size 输入数据集的批次大小,默认为16 + batch_size 输入数据集的批次大小,默认为32 loss_scale_value 损失放大初始值,默认为2^32 scale_factor 损失放大的更新因子,默认为2 scale_window 损失放大的一次更新步数,默认为1000 @@ -662,7 +662,7 @@ bash scripts/squad.sh - 在本地导出 ```shell -python export.py --config_path [../../*.yaml] --export_ckpt_file [CKPT_PATH] --export_file_name [FILE_NAME] --file_format [FILE_FORMAT] +python export.py --config_path [/path/*.yaml] --export_ckpt_file [CKPT_PATH] --export_file_name [FILE_NAME] --file_format [FILE_FORMAT] ``` - 在ModelArts上导出 @@ -686,7 +686,7 @@ python export.py --config_path [../../*.yaml] --export_ckpt_file [CKPT_PATH] --e # 3. 添加 ”export_file_name=bert_ner“ # 4. 添加 ”file_format=MINDIR“ # 5. 添加 ”label_file_path:{path}/*.txt“('label_file_path'指相对于第7步所选文件夹的相对路径) -# 最后必须在网页上添加 “config_path=../../*.yaml”(根据下游任务选择 *.yaml 配置文件) +# 最后必须在网页上添加 “config_path=/path/*.yaml”(根据下游任务选择 *.yaml 配置文件) # (7) 在网页上勾选数据存储位置,设置“训练数据集”路径 # (8) 在网页上设置“训练输出文件路径”、“作业日志路径” # (9) 在网页上的’资源池选择‘项目下, 选择单卡规格的资源 @@ -729,7 +729,7 @@ F1 0.931243 - 导出ONNX ```shell -python export.py --config_path [../../task_classifier_config.yaml] --file_format ["ONNX"] --export_ckpt_file [CKPT_PATH] --num_class [NUM_CLASS] --export_file_name [EXPORT_FILE_NAME] +python export.py --config_path [/path/*.yaml] --file_format ["ONNX"] --export_ckpt_file [CKPT_PATH] --num_class [NUM_CLASS] --export_file_name [EXPORT_FILE_NAME] ``` `CKPT_PATH`为必选项, 是某个分类任务模型训练完毕的ckpt文件路径。 @@ -741,7 +741,7 @@ python export.py --config_path [../../task_classifier_config.yaml] --file_format - 加载ONNX并推理 ```shell -python run_eval_onnx.py --config_path [../../task_classifier_config.yaml] --eval_data_file_path [EVAL_DATA_FILE_PATH] --export_file_name [EXPORT_FILE_NAME] +python run_eval_onnx.py --config_path [/path/*.yaml] --eval_data_file_path [EVAL_DATA_FILE_PATH] --export_file_name [EXPORT_FILE_NAME] ``` `EVAL_DATA_FILE_PATH`为必选项, 是该分类任务所用数据集的eval数据。 diff --git a/official/nlp/bert/scripts/run_distributed_pretrain_ascend.sh b/official/nlp/bert/scripts/run_distributed_pretrain_ascend.sh index 606e0922ce457663254cc6c0a11d6c85150cf05c..28700f8901f72d1c5cd6b2fbbd83adbb8a759d14 100644 --- a/official/nlp/bert/scripts/run_distributed_pretrain_ascend.sh +++ b/official/nlp/bert/scripts/run_distributed_pretrain_ascend.sh @@ -16,8 +16,8 @@ echo "==============================================================================================================" echo "Please run the script as: " -echo "bash run_distributed_pretrain_ascend.sh DATA_DIR RANK_TABLE_FILE" -echo "for example: bash run_distributed_pretrain_ascend.sh /path/dataset /path/hccl.json" +echo "bash scripts/run_distributed_pretrain_ascend.sh DATA_DIR RANK_TABLE_FILE" +echo "for example: bash scripts/run_distributed_pretrain_ascend.sh /path/dataset /path/hccl.json" echo "It is better to use absolute path." echo "For hyper parameter, please note that you should customize the scripts: '{CUR_DIR}/scripts/ascend_distributed_launcher/hyper_parameter_config.ini' " diff --git a/official/nlp/bert/scripts/run_distributed_pretrain_for_gpu.sh b/official/nlp/bert/scripts/run_distributed_pretrain_for_gpu.sh index 8d0fccd26c5e449c8a208d6780783b5e9794cc65..770dab31195c38fd8a1e77bc93c5cc643399b20e 100644 --- a/official/nlp/bert/scripts/run_distributed_pretrain_for_gpu.sh +++ b/official/nlp/bert/scripts/run_distributed_pretrain_for_gpu.sh @@ -16,8 +16,8 @@ echo "==============================================================================================================" echo "Please run the script as: " -echo "bash run_distributed_pretrain.sh DEVICE_NUM EPOCH_SIZE DATA_DIR SCHEMA_DIR" -echo "for example: bash run_distributed_pretrain.sh 8 40 /path/zh-wiki/ /path/Schema.json" +echo "bash scripts/run_distributed_pretrain.sh DEVICE_NUM EPOCH_SIZE DATA_DIR SCHEMA_DIR" +echo "for example: bash scripts/run_distributed_pretrain.sh 8 40 /path/zh-wiki/ [/path/Schema.json](optional)" echo "It is better to use absolute path." echo "==============================================================================================================" diff --git a/official/nlp/bert/scripts/run_standalone_pretrain_ascend.sh b/official/nlp/bert/scripts/run_standalone_pretrain_ascend.sh index 329958a08bb52a94b44af2eb4be0cffb8db08b77..f81ed7bab7671f0769c3a2847dd4fd7a66eae118 100644 --- a/official/nlp/bert/scripts/run_standalone_pretrain_ascend.sh +++ b/official/nlp/bert/scripts/run_standalone_pretrain_ascend.sh @@ -16,8 +16,8 @@ echo "==============================================================================================================" echo "Please run the script as: " -echo "bash run_standalone_pretrain_ascend.sh DEVICE_ID EPOCH_SIZE DATA_DIR SCHEMA_DIR" -echo "for example: bash run_standalone_pretrain_ascend.sh 0 40 /path/zh-wiki/ /path/Schema.json" +echo "bash scripts/run_standalone_pretrain_ascend.sh DEVICE_ID EPOCH_SIZE DATA_DIR SCHEMA_DIR" +echo "for example: bash scripts/run_standalone_pretrain_ascend.sh 0 40 /path/zh-wiki/ [/path/Schema.json](optional)" echo "==============================================================================================================" DEVICE_ID=$1 diff --git a/official/nlp/bert/scripts/run_standalone_pretrain_for_gpu.sh b/official/nlp/bert/scripts/run_standalone_pretrain_for_gpu.sh index 74f8e7846285dc70dca72def60a7d40827ef6998..eb7cc6a589d99fb14d983edcee618013236717f1 100644 --- a/official/nlp/bert/scripts/run_standalone_pretrain_for_gpu.sh +++ b/official/nlp/bert/scripts/run_standalone_pretrain_for_gpu.sh @@ -16,8 +16,8 @@ echo "==============================================================================================================" echo "Please run the script as: " -echo "bash run_standalone_pretrain.sh DEVICE_ID EPOCH_SIZE DATA_DIR SCHEMA_DIR" -echo "for example: bash run_standalone_pretrain.sh 0 40 /path/zh-wiki/ /path/Schema.json" +echo "bash scripts/run_standalone_pretrain.sh DEVICE_ID EPOCH_SIZE DATA_DIR SCHEMA_DIR" +echo "for example: bash scripts/run_standalone_pretrain.sh 0 40 /path/zh-wiki/ [/path/Schema.json](optional)" echo "==============================================================================================================" DEVICE_ID=$1