Skip to content
Snippets Groups Projects
Commit 8c209046 authored by anzhengqi's avatar anzhengqi
Browse files

modify readme of network tinybert and HRNet

parent 2af5d3d0
No related branches found
No related tags found
No related merge requests found
......@@ -50,14 +50,63 @@ The backbone structure of TinyBERT is transformer, the transformer contains four
# [Dataset](#contents)
- Create dataset for general distill phase
- Download the [zhwiki](https://dumps.wikimedia.org/zhwiki/) or [enwiki](https://dumps.wikimedia.org/enwiki/) dataset for pre-training.
- Extract and refine texts in the dataset with [WikiExtractor](https://github.com/attardi/wikiextractor). The commands are as follows:
- Download the [enwiki](https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2) dataset for pre-training.
- Extract and refine texts in the dataset with [WikiExtractor](https://github.com/attardi/wikiextractor). The commands are as follows:
- pip install wikiextractor
- python -m wikiextractor.WikiExtractor -o <output file path> -b <output file size> <Wikipedia dump file>
- Convert the dataset to TFRecord format. Please refer to create_pretraining_data.py file in [BERT](https://github.com/google-research/bert) repository and download vocab.txt here, if AttributeError: module 'tokenization' has no attribute 'FullTokenizer' occur, please install bert-tensorflow.
- python -m wikiextractor.WikiExtractor [Wikipedia dump file] -o [output file path] -b 2G
- Download [BERT](https://github.com/google-research/bert), and download [BERT-Base, Uncased](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip), it contains `vocab.txt`, `bert_config.json`, and pretrain model.
- Use `create_pretraining_data.py`, transform data to tfrecord, please refer to readme file, if AttributeError: module 'tokenization' has no attribute 'FullTokenizer' occur, please install bert-tensorflow.
- Transform tensorflow model to mindspore model
```bash
cd scripts/ms2tf
python ms_and_tf_checkpoint_transfer_tools.py --tf_ckpt_path=PATH/model.ckpt \
--new_ckpt_path=PATH/ms_model_ckpt.ckpt \
--tarnsfer_option=tf2ms
# Attention,tensorflow model include 3 parts,data, index and meta,the input of tf_ckpt_path is *.ckpt
```
- Create dataset for task distill phase
- Download [GLUE](https://github.com/nyu-mll/GLUE-baselines) dataset for task distill phase
- Convert dataset files from JSON format to TFRECORD format, please refer to run_classifier.py file in [BERT](https://github.com/google-research/bert) repository.
- Download [GLUE](https://github.com/nyu-mll/GLUE-baselines) dataset for task distill phase, use `download_glue_data.py` to download sst2, mnli, qnli dataset.
- Transform dataset to TFRecord format. Use `run_classifier.py` in [BERT](https://github.com/google-research/bert), referring to readme. Besides, transforming sst2 dataset, you should add code in [PR:327](https://github.com/google-research/bert/pull/327); transforming qnli dataset, you should refer sst2 add add follow code. Parts of code, such as training, evaling, predicting, are useless, you can comment them and only left necessary code. task_name should be SST2, ber_config_files should be `bert_config.json`, max_seq_length should be 64.
```python
...
class QnliProcessor(DataProcessor):
"""Processor for the QNLI data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev.tsv")),
"dev_matched")
def get_labels(self):
"""See base class."""
return ["entailment", "not_entailment"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type, line[0])
text_a = line[1]
text_b = line[2]
label = line[-1]
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
...
"qnli": QnliProcessor,
...
```
# [Environment Requirements](#contents)
......@@ -587,6 +636,7 @@ Inference result is saved in current path, you can find result like this in acc.
#### Inference Performance
> SST2 dataset
| Parameters | Ascend | GPU |
| -------------------------- | ----------------------------- | ------------------------- |
| Model Version | | |
......@@ -597,7 +647,9 @@ Inference result is saved in current path, you can find result like this in acc.
| batch_size | 32 | 32 |
| Accuracy | 0.902777 | 0.9086 |
| Model for inference | 74M(.ckpt file) | 74M(.ckpt file) |
> QNLI dataset
| Parameters | Ascend | GPU |
| -------------- | ----------------------------- | ------------------------- |
| Model Version | | |
......@@ -608,7 +660,9 @@ Inference result is saved in current path, you can find result like this in acc.
| batch_size | 32 | 32 |
| Accuracy | 0.8860 | 0.8755 |
| Model for inference | 74M(.ckpt file) | 74M(.ckpt file) |
> MNLI dataset
| Parameters | Ascend | GPU |
| -------------- | ----------------------------- | ------------------------- |
| Model Version | | |
......
......@@ -55,14 +55,63 @@ TinyBERT模型的主干结构是转换器,转换器包含四个编码器模块
# 数据集
- 生成通用蒸馏阶段数据集
- 下载[zhwiki](https://dumps.wikimedia.org/zhwiki/)[enwiki](https://dumps.wikimedia.org/enwiki/)数据集进行预训练,
- 下载[enwiki](https://dumps.wikimedia.org/enwiki/latest/enwiki-latest-pages-articles.xml.bz2)数据集进行预训练,
- 使用[WikiExtractor](https://github.com/attardi/wikiextractor)提取和整理数据集中的文本,使用步骤如下:
- pip install wikiextractor
- python -m wikiextractor.WikiExtractor -o <output file path> -b <output file size> <Wikipedia dump file>
- 将数据集转换为TFRecord格式。详见[BERT](https://github.com/google-research/bert)代码仓中的create_pretraining_data.py文件,同时下载对应的vocab.txt文件, 如果出现AttributeError: module 'tokenization' has no attribute 'FullTokenizer’,请安装bert-tensorflow。
- python -m wikiextractor.WikiExtractor [Wikipedia dump file] -o [output file path] -b 2G
- 下载[BERT](https://github.com/google-research/bert)代码仓,并下载[BERT-Base, Uncased](https://storage.googleapis.com/bert_models/2018_10_18/uncased_L-12_H-768_A-12.zip),其中包含了转化需要使用的`vocab.txt`, `bert_config.json`和预训练模型
- 使用`create_pretraining_data.py`文件,将下载得到的文件转化成tfrecord数据集,详细用法请参考readme文件,其中input_file第2步会生成多个文本文件,请转化为`bert0.tfrecord-bertx.tfrecord`,如果出现AttributeError: module 'tokenization' has no attribute 'FullTokenizer’,请安装bert-tensorflow
- 将下载得到的tensorflow模型转化为mindspore模型
```bash
cd scripts/ms2tf
python ms_and_tf_checkpoint_transfer_tools.py --tf_ckpt_path=PATH/model.ckpt \
--new_ckpt_path=PATH/ms_model_ckpt.ckpt \
--tarnsfer_option=tf2ms
# 注意,tensorflow的模型包括三部分,data, index和meta,这里传入的tf_ckpt_path传入的文件名到.ckpt截至
```
- 生成下游任务蒸馏阶段数据集
- 下载数据集进行微调和评估,如[GLUE](https://github.com/nyu-mll/GLUE-baselines)
- 将数据集文件从JSON格式转换为TFRecord格式。详见[BERT](https://github.com/google-research/bert)代码仓中的run_classifier.py文件。
- 下载数据集进行微调和评估,如[GLUE](https://github.com/nyu-mll/GLUE-baselines),使用`download_glue_data.py`脚本下载SST2, MNLI, QNLI数据集
- 将数据集文件从JSON格式转换为TFRecord格式。使用通用蒸馏阶段的第三步BERT代码,参考readme使用代码仓中的`run_classifier.py`文件。转化SST2数据集需要[PR:327](https://github.com/google-research/bert/pull/327),该PR是未合入状态,需要手动添加到代码中;转化QNLI数据集需要参考SST2在`run_classifier.py`合适的位置插入以下代码;另外,`run_classifier.py`代码中包含了训练,推理和预测的代码,对于转化tfrecord数据集来说,这部分代码是多余的,可以将这部分代码注释掉,只保留转化数据集的代码.其中task_name指定为SST2,bert_config_file指定为通用蒸馏阶段下载得到的bert_config.json文件,max_seq_length为64
```python
...
class QnliProcessor(DataProcessor):
"""Processor for the QNLI data set (GLUE version)."""
def get_train_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
def get_dev_examples(self, data_dir):
"""See base class."""
return self._create_examples(
self._read_tsv(os.path.join(data_dir, "dev.tsv")),
"dev_matched")
def get_labels(self):
"""See base class."""
return ["entailment", "not_entailment"]
def _create_examples(self, lines, set_type):
"""Creates examples for the training and dev sets."""
examples = []
for (i, line) in enumerate(lines):
if i == 0:
continue
guid = "%s-%s" % (set_type, line[0])
text_a = line[1]
text_b = line[2]
label = line[-1]
examples.append(
InputExample(guid=guid, text_a=text_a, text_b=text_b, label=label))
return examples
...
"qnli": QnliProcessor,
...
```
# 环境要求
......@@ -583,6 +632,7 @@ bash run_infer_310.sh [MINDIR_PATH] [DATASET_PATH] [SCHEMA_DIR] [DATASET_TYPE] [
#### 推理性能
> SST2数据集
| 参数 | Ascend | GPU |
| -------------------------- | ----------------------------- | ------------------------- |
| 模型版本 | | |
......@@ -593,7 +643,9 @@ bash run_infer_310.sh [MINDIR_PATH] [DATASET_PATH] [SCHEMA_DIR] [DATASET_TYPE] [
| batch_size | 32 | 32 |
| 准确率 | 0.902777 | 0.9086 |
| 推理模型 | 74M(.ckpt 文件) | 74M(.ckpt 文件) |
> QNLI数据集
| 参数 | Ascend | GPU |
| -------------- | ----------------------------- | ------------------------- |
| 模型版本 | | |
......@@ -604,7 +656,9 @@ bash run_infer_310.sh [MINDIR_PATH] [DATASET_PATH] [SCHEMA_DIR] [DATASET_TYPE] [
| batch_size | 32 | 32 |
| 准确率 | 0.8860 | 0.8755 |
| 推理模型 | 74M(.ckpt 文件) | 74M(.ckpt 文件) |
> MNLI数据集
| 参数 | Ascend | GPU |
| -------------- | ----------------------------- | ------------------------- |
| 模型版本 | | |
......
This diff is collapsed.
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""
mindspore and tensorflow checkpoint transfer tools
You only need a tf checkpoint to create a mindspore one while using 'tf2ms'.
But you need both two: an existed tf checkpoint and a mindspore one, while using 'ms2tf'.
example:
python ms_and_tf_checkpoint_transfer_tools.py \
--tf_ckpt_path=./model.ckpt-28252 \
--new_ckpt_path=./new_ckpt.ckpt \
--tarnsfer_option=tf2ms
"""
import argparse
import tensorflow as tf
from mindspore.common.tensor import Tensor
from mindspore.train.serialization import load_checkpoint, save_checkpoint
from ms2tf_config import param_name_dict as ms2tf_param_dict, transpose_list
def convert_ms_2_tf(tf_ckpt_path, ms_ckpt_path, new_ckpt_path):
"""
convert ms checkpoint to tf checkpoint
"""
# load MS checkpoint
ms_param_dict = load_checkpoint(ms_ckpt_path)
for name in ms_param_dict.keys():
if isinstance(ms_param_dict[name].data, Tensor):
ms_param_dict[name] = ms_param_dict[name].data.asnumpy()
convert_count = 0
with tf.Session() as sess:
# convert ms shape to tf
print("start convert parameter ...")
new_var_list = []
for var_name, shape in tf.contrib.framework.list_variables(tf_ckpt_path):
if var_name in ms2tf_param_dict:
ms_name = ms2tf_param_dict[var_name]
new_tensor = tf.convert_to_tensor(ms_param_dict[ms_name])
if ms_name in transpose_list:
new_tensor = tf.transpose(new_tensor, (1, 0))
if new_tensor.shape != tuple(shape):
raise ValueError("shape is not matched after transpose!! {}, {}"
.format(str(new_tensor.shape), str(tuple(shape))))
if new_tensor.shape != tuple(shape):
raise ValueError("shape is not matched after transpose!! {}, {}"
.format(str(new_tensor.shape), str(tuple(shape))))
var = tf.Variable(new_tensor, name=var_name)
convert_count = convert_count + 1
else:
var = tf.Variable(tf.contrib.framework.load_variable(tf_ckpt_path, var_name), name=var_name)
new_var_list.append(var)
print('convert value num: ', convert_count, " of ", len(ms2tf_param_dict))
# saving tf checkpoint
print("start saving ...")
saver = tf.train.Saver(var_list=new_var_list)
sess.run(tf.global_variables_initializer())
saver.save(sess, new_ckpt_path)
print("tf checkpoint was save in :", new_ckpt_path)
return True
def convert_tf_2_ms(tf_ckpt_path, ms_ckpt_path, new_ckpt_path):
"""
convert tf checkpoint to ms checkpoint
"""
tf2ms_param_dict = dict(zip(ms2tf_param_dict.values(), ms2tf_param_dict.keys()))
new_params_list = []
flag_tf1 = tf.__version__[0] == '1'
session = tf.compat.v1.Session()
count = 0
for ms_name in tf2ms_param_dict.keys():
count += 1
param_dict = {}
tf_name = tf2ms_param_dict[ms_name]
data = tf.train.load_variable(tf_ckpt_path, tf_name)
if ms_name in transpose_list:
data = tf.transpose(data, (1, 0))
data = data.eval(session=session) if flag_tf1 else data.numpy()
param_dict['name'] = ms_name
param_dict['data'] = Tensor(data)
new_params_list.append(param_dict)
print("start saving checkpoint ...")
save_checkpoint(new_params_list, new_ckpt_path)
print("ms checkpoint was save in :", new_ckpt_path)
return True
def main():
"""
tf checkpoint transfer to ms or ms checkpoint transfer to tf
"""
parser = argparse.ArgumentParser(description='checkpoint transfer.')
parser.add_argument("--tf_ckpt_path", type=str, default='./tf-bert/bs64k_32k_ckpt_model.ckpt-28252',
help="TensorFlow checkpoint dir, default is: './tf-bert/bs64k_32k_ckpt_model.ckpt-28252'.")
parser.add_argument("--ms_ckpt_path", type=str, default='./ms-bert/large_en.ckpt',
help="MindSpore checkpoint dir, default is: './ms-bert/large_en.ckpt'.")
parser.add_argument("--new_ckpt_path", type=str, default='./new_ckpt/new_bert_large_en.ckpt',
help="New checkpoint dir, default is: './new_ckpt/new_bert_large_en.ckpt'.")
parser.add_argument("--transfer_option", type=str, default='ms2tf', choices=['ms2tf', 'tf2ms'],
help="option of transfer ms2tf or tf2ms, default is ms2tf.")
args_opt = parser.parse_args()
if args_opt.transfer_option == 'ms2tf':
print("start ms2tf option ...")
tf_ckpt_path = args_opt.tf_ckpt_path
ms_ckpt_path = args_opt.ms_ckpt_path
new_ckpt_path = args_opt.new_ckpt_path
convert_ms_2_tf(tf_ckpt_path, ms_ckpt_path, new_ckpt_path)
elif args_opt.transfer_option == 'tf2ms':
print("start tf2ms option ...")
tf_ckpt_path = args_opt.tf_ckpt_path
ms_ckpt_path = args_opt.ms_ckpt_path
new_ckpt_path = args_opt.new_ckpt_path
convert_tf_2_ms(tf_ckpt_path, ms_ckpt_path, new_ckpt_path)
if __name__ == "__main__":
main()
......@@ -308,10 +308,6 @@ bash scripts/run_eval.sh [DEVICE_ID] [DATASET_PATH] [CHECKPOINT_PATH]
python torch2mindspore.py --pth_path [TORCH_MODEL_PATH] --ckpt_path [OUTPUT_MODEL_PATH]
```
2. 自行训练
HRNet分类任务已经实现了MindSpore版本,并提供可达到目标精度的超参,可自行训练获取预训练模型。详情请参考:[HRNet图像分类-MindSpore实现](https://git.openi.org.cn/OpenModelZoo/HRNet-cls)
### 用法
#### Ascend处理器环境运行
......
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment