diff --git a/official/cv/mobilenetv2_quant/Dockerfile b/official/cv/mobilenetv2_quant/Dockerfile
deleted file mode 100644
index 053bf80cd2309a41b6033b3e8d5ab4f87d41bd5e..0000000000000000000000000000000000000000
--- a/official/cv/mobilenetv2_quant/Dockerfile
+++ /dev/null
@@ -1,5 +0,0 @@
-ARG FROM_IMAGE_NAME
-FROM ${FROM_IMAGE_NAME}
-
-COPY requirements.txt .
-RUN pip3.7 install -r requirements.txt
\ No newline at end of file
diff --git a/official/cv/mobilenetv2_quant/scripts/docker_start.sh b/official/cv/mobilenetv2_quant/scripts/docker_start.sh
deleted file mode 100644
index 2e237d159869ed871ae32b4e8af8e7f1499c7182..0000000000000000000000000000000000000000
--- a/official/cv/mobilenetv2_quant/scripts/docker_start.sh
+++ /dev/null
@@ -1,35 +0,0 @@
-#!/bin/bash
-# Copyright 2022 Huawei Technologies Co., Ltd
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.mitations under the License.
-
-docker_image=$1
-data_dir=$2
-model_dir=$3
-
-docker run -it --ipc=host \
-              --device=/dev/davinci0 \
-              --device=/dev/davinci1 \
-              --device=/dev/davinci2 \
-              --device=/dev/davinci3 \
-              --device=/dev/davinci4 \
-              --device=/dev/davinci5 \
-              --device=/dev/davinci6 \
-              --device=/dev/davinci7 \
-              --device=/dev/davinci_manager \
-              --device=/dev/devmm_svm --device=/dev/hisi_hdc \
-              -v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-              -v /usr/local/Ascend/add-ons/:/usr/local/Ascend/add-ons/ \
-              -v ${model_dir}:${model_dir} \
-              -v ${data_dir}:${data_dir}  \
-              -v /root/ascend/log:/root/ascend/log ${docker_image} /bin/bash
\ No newline at end of file
diff --git a/research/nlp/ternarybert/README.md b/research/nlp/ternarybert/README.md
deleted file mode 100644
index 4f9cb8c16b99070b131acf32ad027af0af1343df..0000000000000000000000000000000000000000
--- a/research/nlp/ternarybert/README.md
+++ /dev/null
@@ -1,312 +0,0 @@
-
-# Contents
-
-- [Contents](#contents)
-- [TernaryBERT Description](#ternarybert-description)
-- [Model Architecture](#model-architecture)
-- [Dataset](#dataset)
-- [Environment Requirements](#environment-requirements)
-- [Quick Start](#quick-start)
-- [Script Description](#script-description)
-    - [Script and Sample Code](#script-and-sample-code)
-    - [Script Parameters](#script-parameters)
-        - [Train](#train)
-        - [Eval](#eval)
-    - [Options and Parameters](#options-and-parameters)
-        - [Parameters](#parameters)
-    - [Training Process](#training-process)
-        - [Training](#training)
-    - [Evaluation Process](#evaluation-process)
-        - [Evaluation](#evaluation)
-            - [evaluation on STS-B dataset](#evaluation-on-STS-B-dataset)
-    - [Model Description](#model-description)
-    - [Performance](#performance)
-        - [training Performance](#training-performance)
-        - [Inference Performance](#inference-performance)
-- [Description of Random Situation](#description-of-random-situation)
-- [ModelZoo Homepage](#modelzoo-homepage)
-
-# [TernaryBERT Description](#contents)
-
-[TernaryBERT](https://arxiv.org/abs/2009.12812) ternarizes the weights in a fine-tuned [BERT](https://arxiv.org/abs/1810.04805) or [TinyBERT](https://arxiv.org/abs/1909.10351) model and achieves competitive performances in natural language processing tasks. TernaryBERT outperforms the other BERT quantization methods, and even achieves comparable performance as the full-precision model while being 14.9x smaller
-
-[Paper](https://arxiv.org/abs/2009.12812): Wei Zhang, Lu Hou, Yichun Yin, Lifeng Shang, Xiao Chen, Xin Jiang and Qun Liu. [TernaryBERT: Distillation-aware Ultra-low Bit BERT](https://arxiv.org/abs/2009.12812). arXiv preprint arXiv:2009.12812.
-
-# [Model Architecture](#contents)
-
-The backbone structure of TernaryBERT is transformer, the transformer contains six encoder modules, one encoder contains one self-attention module and one self-attention module contains one attention module. The pretrained teacher model and student model are provided [here](https://download.mindspore.cn/model_zoo/research/nlp/ternarybert/).
-
-# [Dataset](#contents)
-
-- Download glue dataset for task distillation. Convert dataset files from json format to tfrecord format, please refer to run_classifier.py which in [BERT](https://github.com/google-research/bert) repository.
-
-# [Environment Requirements](#contents)
-
-- Hardware（GPU）
-    - Prepare hardware environment with GPU processor.
-- Framework
-    - [MindSpore](https://gitee.com/mindspore/mindspore)
-- For more information, please check the resources below：
-    - [MindSpore Tutorials](https://www.mindspore.cn/tutorials/en/master/index.html)
-    - [MindSpore Python API](https://www.mindspore.cn/docs/api/en/master/index.html)
-- Software：
-    - sklearn
-
-# [Quick Start](#contents)
-
-After installing MindSpore via the official website, you can start training and evaluation as follows:
-
-```bash
-
-# run training example
-
-bash scripts/run_train.sh [TASK_NAME] [DEVICE_TARGET] [TEACHER_MODEL_DIR] [STUDENT_MODEL_DIR] [DATA_DIR] [DEVICE_ID](optional)
-
-Before running the shell script, please set the `task_name`, `device_target`, `teacher_model_dir`, `student_model_dir`, `data_dir` and `device_id`(optional) in the run_train.sh file first.
-
-# run evaluation example
-
-bash scripts/run_eval.sh [TASK_NAME] [DEVICE_TARGET] [MODEL_DIR] [DATA_DIR]
-
-Before running the shell script, please set the `task_name`, `device_target`, `model_dir` and `data_dir` in the run_eval.sh file first.
-```
-
-# [Script Description](#contents)
-
-## [Script and Sample Code](#contents)
-
-```text
-
-.
-└─ternarybert
-  ├─README.md
-  ├─scripts
-    ├─run_train.sh                  # shell script for training phase
-    ├─run_eval.sh                   # shell script for evaluation phase
-  ├─src
-    ├─__init__.py
-    ├─assessment_method.py          # assessment method for evaluation
-    ├─cell_wrapper.py               # cell for training
-    ├─config.py                     # parameter configuration for training and evaluation phase
-    ├─dataset.py                    # data processing
-    ├─quant.py                      # function for quantization
-    ├─tinybert_model.py             # backbone code of network
-    ├─utils.py                      # util function
-  ├─__init__.py
-  ├─train.py                        # train net for task distillation
-  ├─eval.py                         # evaluate net after task distillation
-
-```
-
-## [Script Parameters](#contents)
-
-### Train
-
-```text
-
-usage: train.py    [--h] [--device_target GPU] [--do_eval {true,false}] [--epoch_size EPOCH_SIZE]
-                   [--device_id DEVICE_ID] [--do_shuffle {true,false}] [--enable_data_sink {true,false}] [--save_ckpt_step SAVE_CKPT_STEP]
-                   [--eval_ckpt_step EVAL_CKPT_STEP] [--max_ckpt_num MAX_CKPT_NUM] [--data_sink_steps DATA_SINK_STEPS]
-                   [--teacher_model_dir TEACHER_MODEL_DIR] [--student_model_dir STUDENT_MODEL_DIR] [--data_dir DATA_DIR]
-                   [--output_dir OUTPUT_DIR] [--task_name {sts-b,qnli,mnli}] [--dataset_type DATASET_TYPE] [--seed SEED]
-                   [--train_batch_size TRAIN_BATCH_SIZE] [--eval_batch_size EVAL_BATCH_SIZE]
-
-options:
-    --device_target                 Device where the code will be implemented: "GPU"
-    --do_eval                       Do eval task during training or not: "true" | "false", default is "true"
-    --epoch_size                    Epoch size for train phase: N, default is 3
-    --device_id                     Device id: N, default is 0
-    --do_shuffle                    Enable shuffle for train dataset: "true" | "false", default is "true"
-    --enable_data_sink              Enable data sink: "true" | "false", default is "true"
-    --save_ckpt_step                If do_eval is false, the checkpoint will be saved every save_ckpt_step: N, default is 50
-    --eval_ckpt_step                If do_eval is true, the evaluation will be ran every eval_ckpt_step: N, default is 50
-    --max_ckpt_num                  The number of checkpoints will not be larger than max_ckpt_num: N, default is 50
-    --data_sink_steps               Sink steps for each epoch: N, default is 1
-    --teacher_model_dir             The checkpoint directory of teacher model: PATH, default is ""
-    --student_model_dir             The checkpoint directory of student model: PATH, default is ""
-    --data_dir                      Data directory: PATH, default is ""
-    --output_dir                    The output checkpoint directory: PATH, default is "./"
-    --task_name                     The name of the task to train: "sts-b" | "qnli" | "mnli", default is "sts-b"
-    --dataset_type                  The name of the task to train: "tfrecord" | "mindrecord", default is "tfrecord"
-    --seed                          The random seed: N, default is 1
-    --train_batch_size              Batch size for training: N, default is 16
-    --eval_batch_size               Eval Batch size in callback: N, default is 32
-
-```
-
-### Eval
-
-```text
-
-usage: eval.py    [--h] [--device_target GPU] [--device_id DEVICE_ID] [--model_dir MODEL_DIR] [--data_dir DATA_DIR]
-                  [--task_name {sts-b,qnli,mnli}] [--dataset_type DATASET_TYPE] [--batch_size BATCH_SIZE]
-
-options:
-    --device_target                 Device where the code will be implemented: "GPU"
-    --device_id                     Device id: N, default is 0
-    --model_dir                     The checkpoint directory of model: PATH, default is ""
-    --data_dir                      Data directory: PATH, default is ""
-    --task_name                     The name of the task to train: "sts-b" | "qnli" | "mnli", default is "sts-b"
-    --dataset_type                  The name of the task to train: "tfrecord" | "mindrecord", default is "tfrecord"
-    --batch_size                    Batch size for evaluating: N, default is 32
-
-```
-
-## Parameters
-
-`config.py`contains parameters of glue tasks, train, optimizer, eval, teacher BERT model and student BERT model.
-
-```text
-
-Parameters for glue task:
-    num_labels                      the numbers of labels: N.
-    seq_length                      length of input sequence: N
-    task_type                       the type of task: "classification" | "regression"
-    metrics                         the eval metric for task: Accuracy | F1 | Pearsonr | Matthews
-
-Parameters for train:
-    batch_size                      batch size of input dataset: N, default is 16
-    loss_scale_value                initial value of loss scale: N, default is 2^16
-    scale_factor                    factor used to update loss scale: N, default is 2
-    scale_window                    steps for once updatation of loss scale: N, default is 50
-
-Parameters for optimizer:
-    learning_rate                   value of learning rate: Q, default is 5e-5
-    end_learning_rate               value of end learning rate: Q, must be positive, default is 1e-14
-    power                           power: Q, default is 1.0
-    weight_decay                    weight decay: Q, default is 1e-4
-    eps                             term added to the denominator to improve numerical stability: Q, default is 1e-6
-    warmup_ratio                    the ratio of warmup steps to total steps: Q, default is 0.1
-
-Parameters for eval:
-    batch_size                      batch size of input dataset: N, default is 32
-
-Parameters for teacher bert network:
-    seq_length                      length of input sequence: N, default is 128
-    vocab_size                      size of each embedding vector: N, must be consistent with the dataset you use. Default is 30522
-    hidden_size                     size of bert encoder layers: N
-    num_hidden_layers               number of hidden layers: N
-    num_attention_heads             number of attention heads: N, default is 12
-    intermediate_size               size of intermediate layer: N
-    hidden_act                      activation function used: ACTIVATION, default is "gelu"
-    hidden_dropout_prob             dropout probability for BertOutput: Q
-    attention_probs_dropout_prob    dropout probability for BertAttention: Q
-    max_position_embeddings         maximum length of sequences: N, default is 512
-    save_ckpt_step                  number for saving checkponit: N, default is 100
-    max_ckpt_num                    maximum number for saving checkpoint: N, default is 1
-    type_vocab_size                 size of token type vocab: N, default is 2
-    initializer_range               initialization value of Normal: Q, default is 0.02
-    use_relative_positions          use relative positions or not: True | False, default is False
-    dtype                           data type of input: mstype.float16 | mstype.float32, default is mstype.float32
-    compute_type                    compute type in BertTransformer: mstype.float16 | mstype.float32, default is mstype.float32
-
-Parameters for student bert network:
-    seq_length                      length of input sequence: N, default is 128
-    vocab_size                      size of each embedding vector: N, must be consistent with the dataset you use. Default is 30522
-    hidden_size                     size of bert encoder layers: N
-    num_hidden_layers               number of hidden layers: N
-    num_attention_heads             number of attention heads: N, default is 12
-    intermediate_size               size of intermediate layer: N
-    hidden_act                      activation function used: ACTIVATION, default is "gelu"
-    hidden_dropout_prob             dropout probability for BertOutput: Q
-    attention_probs_dropout_prob    dropout probability for BertAttention: Q
-    max_position_embeddings         maximum length of sequences: N, default is 512
-    save_ckpt_step                  number for saving checkponit: N, default is 100
-    max_ckpt_num                    maximum number for saving checkpoint: N, default is 1
-    type_vocab_size                 size of token type vocab: N, default is 2
-    initializer_range               initialization value of Normal: Q, default is 0.02
-    use_relative_positions          use relative positions or not: True | False, default is False
-    dtype                           data type of input: mstype.float16 | mstype.float32, default is mstype.float32
-    compute_type                    compute type in BertTransformer: mstype.float16 | mstype.float32, default is mstype.float32
-    do_quant                        do activation quantilization or not: True | False, default is True
-    embedding_bits                  the quant bits of embedding: N, default is 2
-    weight_bits                     the quant bits of weight: N, default is 2
-    cls_dropout_prob                dropout probability for BertModelCLS: Q
-    activation_init                 initialization value of activation quantilization: Q, default is 2.5
-    is_lgt_fit                      use label ground truth loss or not: True | False, default is False
-
-```
-
-## [Training Process](#contents)
-
-### Training
-
-Before running the command below, please check `teacher_model_dir`, `student_model_dir` and `data_dir` has been set. Please set the path to be the absolute full path, e.g:"/home/xxx/model_dir/".
-
-```text
-
-python
-    python train.py --task_name='sts-b' --device_target="GPU" --teacher_model_dir='/home/xxx/model_dir/' --student_model_dir='/home/xxx/model_dir/' --data_dir='/home/xxx/data_dir/'
-shell
-    bash scripts/run_train.sh [TASK_NAME] [DEVICE_TARGET] [TEACHER_MODEL_DIR] [STUDENT_MODEL_DIR] [DATA_DIR] [DEVICE_ID](optional)
-
-```
-
-The shell command above will run in the background, you can view the results the file log.txt. The python command will run in the console, you can view the results on the interface. After training, you will get some checkpoint files under the script folder by default. The eval metric value will be achieved as follows:
-
-```text
-
-step: 50, Pearsonr 72.50008506516072, best_Pearsonr 72.50008506516072
-step 100, Pearsonr 81.3580301181608, best_Pearsonr 81.3580301181608
-step 150, Pearsonr 83.60461724688754, best_Pearsonr 83.60461724688754
-step 200, Pearsonr 82.23210161651377, best_Pearsonr 83.60461724688754
-...
-step 1050, Pearsonr 87.5606067964618332, best_Pearsonr 87.58388835685436
-
-```
-
-## [Evaluation Process](#contents)
-
-### Evaluation
-
-If you want to after running and continue to eval.
-
-#### evaluation on STS-B dataset
-
-```text
-
-python
-    python eval.py --task_name='sts-b' --device_target="GPU" --model_dir='/home/xxx/model_dir/' --data_dir='/home/xxx/data_dir/'
-shell
-    bash scripts/run_eval.sh [TASK_NAME] [DEVICE_TARGET] [MODEL_DIR] [DATA_DIR]
-
-
-```
-
-The shell command above will run in the background, you can view the results the file log.txt. The python command will run in the console, you can view the results on the interface. The metric value of the test dataset will be as follows:
-
-```text
-
-eval step: 0, Pearsonr: 96.91109003302263
-eval step: 1, Pearsonr: 95.6800637493701
-eval step: 2, Pearsonr: 94.23823082886167
-...
-The best Pearsonr: 87.58388835685437
-
-```
-
-## [Model Description](#contents)
-
-## [Performance](#contents)
-
-### training Performance
-
-| Parameters                                                                    | GPU                       |
-| -------------------------- | ------------------------- |
-| Model Version              | TernaryBERT                           |
-| Resource                   | NV GeForce GTX1080ti, cpu 2.00GHz, cores 56, mem 251G, os Ubuntu 16.04         |
-| Date              | 2021-6-10      |
-| MindSpore Version          | 1.2.0                     |
-| Dataset                    | STS-B              |
-| batch_size                    | 16              |
-| Metric value                 | 87.5                       |
-
-# [Description of Random Situation](#contents)
-
-In train.py, we set do_shuffle to shuffle the dataset.
-
-In config.py, we set the hidden_dropout_prob, attention_pros_dropout_prob and cls_dropout_prob to dropout some network node.
-
-# [ModelZoo Homepage](#contents)
-
-Please check the official [homepage](https://gitee.com/mindspore/models).
diff --git a/research/nlp/ternarybert/__init__.py b/research/nlp/ternarybert/__init__.py
deleted file mode 100644
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000
diff --git a/research/nlp/ternarybert/eval.py b/research/nlp/ternarybert/eval.py
deleted file mode 100644
index ba75607d6d81b2d87ec38c17d8eae94da0812cbf..0000000000000000000000000000000000000000
--- a/research/nlp/ternarybert/eval.py
+++ /dev/null
@@ -1,108 +0,0 @@
-# Copyright 2021 Huawei Technologies Co., Ltd
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ============================================================================
-
-"""eval standalone script"""
-
-import os
-import re
-import argparse
-
-from mindspore import context
-from mindspore.train.serialization import load_checkpoint, load_param_into_net
-from src.dataset import create_dataset
-from src.config import eval_cfg, student_net_cfg, task_cfg
-from src.tinybert_model import BertModelCLS
-
-
-DATA_NAME = 'eval.tf_record'
-
-
-def parse_args():
-    """
-    parse args
-    """
-    parser = argparse.ArgumentParser(description='ternarybert evaluation')
-    parser.add_argument('--device_target', type=str, default='GPU', choices=['Ascend', 'GPU'],
-                        help='Device where the code will be implemented. (Default: GPU)')
-    parser.add_argument('--device_id', type=int, default=0, help='Device id. (Default: 0)')
-    parser.add_argument('--model_dir', type=str, default='', help='The checkpoint directory of model.')
-    parser.add_argument('--data_dir', type=str, default='', help='Data directory.')
-    parser.add_argument('--task_name', type=str, default='sts-b', choices=['sts-b', 'qnli', 'mnli'],
-                        help='The name of the task to train. (Default: sts-b)')
-    parser.add_argument('--dataset_type', type=str, default='tfrecord', choices=['tfrecord', 'mindrecord'],
-                        help='The name of the task to train. (Default: tfrecord)')
-    parser.add_argument('--batch_size', type=int, default=32, help='Batch size for evaluating')
-    return parser.parse_args()
-
-
-def get_ckpt(ckpt_file):
-    lists = os.listdir(ckpt_file)
-    lists.sort(key=lambda fn: os.path.getmtime(ckpt_file + '/' + fn))
-    return os.path.join(ckpt_file, lists[-1])
-
-
-def do_eval_standalone(args_opt):
-    """
-    do eval standalone
-    """
-    ckpt_file = os.path.join(args_opt.model_dir, args_opt.task_name)
-    ckpt_file = get_ckpt(ckpt_file)
-    print('ckpt file:', ckpt_file)
-    task = task_cfg[args_opt.task_name]
-    student_net_cfg.seq_length = task.seq_length
-    eval_cfg.batch_size = args_opt.batch_size
-    eval_data_dir = os.path.join(args_opt.data_dir, args_opt.task_name, DATA_NAME)
-
-    context.set_context(mode=context.GRAPH_MODE, device_target=args_opt.device_target, device_id=args.device_id)
-
-    eval_dataset = create_dataset(batch_size=eval_cfg.batch_size,
-                                  device_num=1,
-                                  rank=0,
-                                  do_shuffle=False,
-                                  data_dir=eval_data_dir,
-                                  data_type=args_opt.dataset_type,
-                                  seq_length=task.seq_length,
-                                  task_type=task.task_type,
-                                  drop_remainder=False)
-    print('eval dataset size:', eval_dataset.get_dataset_size())
-    print('eval dataset batch size:', eval_dataset.get_batch_size())
-
-    eval_model = BertModelCLS(student_net_cfg, False, task.num_labels, 0.0, phase_type='student')
-    param_dict = load_checkpoint(ckpt_file)
-    new_param_dict = {}
-    for key, value in param_dict.items():
-        new_key = re.sub('tinybert_', 'bert_', key)
-        new_key = re.sub('^bert.', '', new_key)
-        new_param_dict[new_key] = value
-    load_param_into_net(eval_model, new_param_dict)
-    eval_model.set_train(False)
-
-    columns_list = ["input_ids", "input_mask", "segment_ids", "label_ids"]
-    callback = task.metrics()
-    for step, data in enumerate(eval_dataset.create_dict_iterator()):
-        input_data = []
-        for i in columns_list:
-            input_data.append(data[i])
-        input_ids, input_mask, token_type_id, label_ids = input_data
-        _, _, logits, _ = eval_model(input_ids, token_type_id, input_mask)
-        callback.update(logits, label_ids)
-        print('eval step: {}, {}: {}'.format(step, callback.name, callback.get_metrics()))
-    metrics = callback.get_metrics()
-    print('The best {}: {}'.format(callback.name, metrics))
-
-
-if __name__ == '__main__':
-    args = parse_args()
-    do_eval_standalone(args)
diff --git a/research/nlp/ternarybert/mindspore_hub_conf.py b/research/nlp/ternarybert/mindspore_hub_conf.py
deleted file mode 100644
index 5ff07d3b4acf997e229cff7785d13fabcdb7010b..0000000000000000000000000000000000000000
--- a/research/nlp/ternarybert/mindspore_hub_conf.py
+++ /dev/null
@@ -1,57 +0,0 @@
-# Copyright 2021 Huawei Technologies Co., Ltd
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ============================================================================
-
-"""Bert hub interface for bert base"""
-
-from src.tinybert_model import BertModel
-from src.tinybert_model import BertConfig
-import mindspore.common.dtype as mstype
-
-tinybert_student_net_cfg = BertConfig(
-    seq_length=128,
-    vocab_size=30522,
-    hidden_size=768,
-    num_hidden_layers=6,
-    num_attention_heads=12,
-    intermediate_size=3072,
-    hidden_act="gelu",
-    hidden_dropout_prob=0.1,
-    attention_probs_dropout_prob=0.1,
-    max_position_embeddings=512,
-    type_vocab_size=2,
-    initializer_range=0.02,
-    use_relative_positions=False,
-    dtype=mstype.float32,
-    compute_type=mstype.float32,
-    do_quant=True,
-    embedding_bits=2,
-    weight_bits=2,
-    weight_clip_value=3.0,
-    cls_dropout_prob=0.1,
-    activation_init=2.5,
-    is_lgt_fit=False
-)
-
-
-def create_network(name, *args, **kwargs):
-    """
-    Create tinybert network.
-    """
-    if name == "ternarybert":
-        if "seq_length" in kwargs:
-            tinybert_student_net_cfg.seq_length = kwargs["seq_length"]
-        is_training = kwargs.get("is_training", False)
-        return BertModel(tinybert_student_net_cfg, is_training, *args)
-    raise NotImplementedError(f"{name} is not implemented in the repo")
diff --git a/research/nlp/ternarybert/requirements.txt b/research/nlp/ternarybert/requirements.txt
deleted file mode 100644
index 5a619e015a5ba164371d32865c44aa47a3aa416c..0000000000000000000000000000000000000000
--- a/research/nlp/ternarybert/requirements.txt
+++ /dev/null
@@ -1,2 +0,0 @@
-numpy
-easydict
diff --git a/research/nlp/ternarybert/scripts/run_eval.sh b/research/nlp/ternarybert/scripts/run_eval.sh
deleted file mode 100644
index 09e6782e45776401163ca3681b57d53278e6ad75..0000000000000000000000000000000000000000
--- a/research/nlp/ternarybert/scripts/run_eval.sh
+++ /dev/null
@@ -1,43 +0,0 @@
-#!/bin/bash
-# Copyright 2021 Huawei Technologies Co., Ltd
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ============================================================================
-
-if [ $# != 4 ]
-then
-    echo "============================================================================================================"
-    echo "Please run the script as: "
-    echo "sh scripts/run_eval.sh [TASK_NAME] [DEVICE_TARGET] [MODEL_DIR] [DATA_DIR]"
-    echo "============================================================================================================"
-exit 1
-fi
-
-echo "===============================================start evaling================================================"
-
-task_name=$1
-device_target=$2
-model_dir=$3
-data_dir=$4
-
-mkdir -p ms_log
-PROJECT_DIR=$(cd "$(dirname "$0")" || exit; pwd)
-CUR_DIR=`pwd`
-export GLOG_log_dir=${CUR_DIR}/ms_log
-export GLOG_logtostderr=0
-python ${PROJECT_DIR}/../eval.py \
-    --task_name=$task_name \
-    --device_target=$device_target \
-    --device_id=0 \
-    --model_dir=$model_dir \
-    --data_dir=$data_dir > eval_log.txt 2>&1 &
diff --git a/research/nlp/ternarybert/scripts/run_train.sh b/research/nlp/ternarybert/scripts/run_train.sh
deleted file mode 100644
index 1d71710ae8d4f51a4d30ce2bc0771dc1682b1dab..0000000000000000000000000000000000000000
--- a/research/nlp/ternarybert/scripts/run_train.sh
+++ /dev/null
@@ -1,58 +0,0 @@
-#!/bin/bash
-# Copyright 2021 Huawei Technologies Co., Ltd
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ============================================================================
-
-if [[ $# != 5 && $# != 6 ]]
-then
-    echo "============================================================================================================"
-    echo "Please run the script as: "
-    echo "sh scripts/run_train.sh [TASK_NAME] [DEVICE_TARGET] [TEACHER_MODEL_DIR] [STUDENT_MODEL_DIR] [DATA_DIR] [DEVICE_ID]"
-    echo "DEVICE_ID is optional, it can be set by command line argument DEVICE_ID , otherwise the value is zero"
-    echo "============================================================================================================"
-exit 1
-fi
-
-echo "===============================================start training==============================================="
-
-task_name=$1
-device_target=$2
-teacher_model_dir=$3
-student_model_dir=$4
-data_dir=$5
-
-if [ $# == 6 ]
-then
-device_id=$6
-else
-device_id=0
-fi
-
-if [ $device_target ==  "GPU" ]
-then
-    export CUDA_VISIBLE_DEVICES=$device_id
-fi
-
-mkdir -p ms_log
-PROJECT_DIR=$(cd "$(dirname "$0")" || exit; pwd)
-CUR_DIR=`pwd`
-export GLOG_log_dir=${CUR_DIR}/ms_log
-export GLOG_logtostderr=0
-python ${PROJECT_DIR}/../train.py \
-    --task_name=$task_name \
-    --device_target=$device_target \
-    --device_id=$device_id \
-    --teacher_model_dir=$teacher_model_dir \
-    --student_model_dir=$student_model_dir \
-    --data_dir=$data_dir> train_log.txt 2>&1 &
diff --git a/research/nlp/ternarybert/src/__init__.py b/research/nlp/ternarybert/src/__init__.py
deleted file mode 100644
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..0000000000000000000000000000000000000000
diff --git a/research/nlp/ternarybert/src/assessment_method.py b/research/nlp/ternarybert/src/assessment_method.py
deleted file mode 100644
index 12208c5cb3879cba0f827b671e94fca741c68d98..0000000000000000000000000000000000000000
--- a/research/nlp/ternarybert/src/assessment_method.py
+++ /dev/null
@@ -1,115 +0,0 @@
-# Copyright 2021 Huawei Technologies Co., Ltd
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ============================================================================
-
-"""assessment methods"""
-
-import numpy as np
-
-
-class Accuracy:
-    """Accuracy"""
-    def __init__(self):
-        self.acc_num = 0
-        self.total_num = 0
-        self.name = 'Accuracy'
-
-    def update(self, logits, labels):
-        labels = labels.asnumpy()
-        labels = np.reshape(labels, -1)
-        logits = logits.asnumpy()
-        logit_id = np.argmax(logits, axis=-1)
-        self.acc_num += np.sum(labels == logit_id)
-        self.total_num += len(labels)
-
-    def get_metrics(self):
-        return self.acc_num / self.total_num * 100.0
-
-
-class F1:
-    """F1"""
-    def __init__(self):
-        self.logits_array = np.array([])
-        self.labels_array = np.array([])
-        self.name = 'F1'
-
-    def update(self, logits, labels):
-        labels = labels.asnumpy()
-        labels = np.reshape(labels, -1)
-        logits = logits.asnumpy()
-        logits = np.argmax(logits, axis=1)
-        self.labels_array = np.concatenate([self.labels_array, labels]).astype(np.bool)
-        self.logits_array = np.concatenate([self.logits_array, logits]).astype(np.bool)
-
-    def get_metrics(self):
-        if len(self.labels_array) < 2:
-            return 0.0
-        tp = np.sum(self.labels_array & self.logits_array)
-        fp = np.sum(self.labels_array & (~self.logits_array))
-        fn = np.sum((~self.labels_array) & self.logits_array)
-        p = tp / (tp + fp)
-        r = tp / (tp + fn)
-        return 2.0 * p * r / (p + r) * 100.0
-
-
-class Pearsonr:
-    """Pearsonr"""
-    def __init__(self):
-        self.logits_array = np.array([])
-        self.labels_array = np.array([])
-        self.name = 'Pearsonr'
-
-    def update(self, logits, labels):
-        labels = labels.asnumpy()
-        labels = np.reshape(labels, -1)
-        logits = logits.asnumpy()
-        logits = np.reshape(logits, -1)
-        self.labels_array = np.concatenate([self.labels_array, labels])
-        self.logits_array = np.concatenate([self.logits_array, logits])
-
-    def get_metrics(self):
-        if len(self.labels_array) < 2:
-            return 0.0
-        x_mean = self.logits_array.mean()
-        y_mean = self.labels_array.mean()
-        xm = self.logits_array - x_mean
-        ym = self.labels_array - y_mean
-        norm_xm = np.linalg.norm(xm)
-        norm_ym = np.linalg.norm(ym)
-        return np.dot(xm / norm_xm, ym / norm_ym) * 100.0
-
-
-class Matthews:
-    """Matthews"""
-    def __init__(self):
-        self.logits_array = np.array([])
-        self.labels_array = np.array([])
-        self.name = 'Matthews'
-
-    def update(self, logits, labels):
-        labels = labels.asnumpy()
-        labels = np.reshape(labels, -1)
-        logits = logits.asnumpy()
-        logits = np.argmax(logits, axis=1)
-        self.labels_array = np.concatenate([self.labels_array, labels]).astype(np.bool)
-        self.logits_array = np.concatenate([self.logits_array, logits]).astype(np.bool)
-
-    def get_metrics(self):
-        if len(self.labels_array) < 2:
-            return 0.0
-        tp = np.sum(self.labels_array & self.logits_array)
-        fp = np.sum(self.labels_array & (~self.logits_array))
-        fn = np.sum((~self.labels_array) & self.logits_array)
-        tn = np.sum((~self.labels_array) & (~self.logits_array))
-        return (tp * tn - fp * fn) / np.sqrt((tp + fp) * (tp + fn) * (tn + fp) * (tn + fn)) * 100.0
diff --git a/research/nlp/ternarybert/src/cell_wrapper.py b/research/nlp/ternarybert/src/cell_wrapper.py
deleted file mode 100644
index 04906f434c1782b2f97639a7ec7aacf13217f9fa..0000000000000000000000000000000000000000
--- a/research/nlp/ternarybert/src/cell_wrapper.py
+++ /dev/null
@@ -1,513 +0,0 @@
-# Copyright 2021 Huawei Technologies Co., Ltd
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ============================================================================
-
-"""Train Cell."""
-
-import mindspore.nn as nn
-from mindspore import context
-from mindspore.ops import operations as P
-from mindspore.ops import functional as F
-from mindspore.ops import composite as C
-from mindspore.common.tensor import Tensor
-from mindspore.common import dtype as mstype
-from mindspore.common.parameter import Parameter
-from mindspore.communication.management import get_group_size
-from mindspore.nn.wrap.grad_reducer import DistributedGradReducer
-from mindspore.context import ParallelMode
-from mindspore.train.serialization import load_checkpoint, load_param_into_net
-from .tinybert_model import BertModelCLS
-from .quant import QuantizeWeightCell
-from .config import gradient_cfg
-
-
-class ClipByNorm(nn.Cell):
-    r"""
-        Clips tensor values to a maximum :math:`L_2`-norm.
-
-        The output of this layer remains the same if the :math:`L_2`-norm of the input tensor
-        is not greater than the argument clip_norm. Otherwise the tensor will be normalized as:
-
-        .. math::
-            \text{output}(X) = \frac{\text{clip_norm} * X}{L_2(X)},
-
-        where :math:`L_2(X)` is the :math:`L_2`-norm of :math:`X`.
-
-        Args:
-            axis (Union[None, int, tuple(int)]): Compute the L2-norm along the Specific dimension.
-                                                Default: None, all dimensions to calculate.
-
-        Inputs:
-            - **input** (Tensor) - Tensor of shape N-D. The type must be float32 or float16.
-            - **clip_norm** (Tensor) - A scalar Tensor of shape :math:`()` or :math:`(1)`.
-              Or a tensor shape can be broadcast to input shape.
-
-        Outputs:
-            Tensor, clipped tensor with the same shape as the input, whose type is float32.
-
-        Supported Platforms:
-            ``Ascend`` ``GPU``
-
-        Examples:
-            >>> net = nn.ClipByNorm()
-            >>> input = Tensor(np.random.randint(0, 10, [4, 16]), mindspore.float32)
-            >>> clip_norm = Tensor(np.array([100]).astype(np.float32))
-            >>> output = net(input, clip_norm)
-            >>> print(output.shape)
-            (4, 16)
-
-        """
-
-    def __init__(self):
-        super(ClipByNorm, self).__init__()
-        self.reduce_sum = P.ReduceSum(keep_dims=True)
-        self.select_ = P.Select()
-        self.greater_ = P.Greater()
-        self.cast = P.Cast()
-        self.sqrt = P.Sqrt()
-        self.max_op = P.Maximum()
-        self.shape = P.Shape()
-        self.reshape = P.Reshape()
-        self.fill = P.Fill()
-        self.expand_dims = P.ExpandDims()
-        self.dtype = P.DType()
-
-    def construct(self, x, clip_norm):
-        """add ms_function decorator for pynative mode"""
-        mul_x = F.square(x)
-        if mul_x.shape == (1,):
-            l2sum = self.cast(mul_x, mstype.float32)
-        else:
-            l2sum = self.cast(self.reduce_sum(mul_x), mstype.float32)
-        cond = self.greater_(l2sum, 0)
-        ones_ = self.fill(self.dtype(cond), self.shape(cond), 1.0)
-        l2sum_safe = self.select_(cond, l2sum, self.cast(ones_, self.dtype(l2sum)))
-        l2norm = self.select_(cond, self.sqrt(l2sum_safe), l2sum)
-
-        intermediate = x * clip_norm
-
-        max_norm = self.max_op(l2norm, clip_norm)
-        values_clip = self.cast(intermediate, mstype.float32) / self.expand_dims(max_norm, -1)
-        values_clip = self.reshape(values_clip, self.shape(x))
-        values_clip = F.identity(values_clip)
-        return values_clip
-
-
-clip_grad = C.MultitypeFuncGraph("clip_grad")
-# pylint: disable=consider-using-in
-
-
-@clip_grad.register("Number", "Number", "Tensor")
-def _clip_grad(clip_type, clip_value, grad):
-    """
-    Clip gradients.
-
-    Inputs:
-        clip_type (int): The way to clip, 0 for 'value', 1 for 'norm'.
-        clip_value (float): Specifies how much to clip.
-        grad (tuple[Tensor]): Gradients.
-
-    Outputs:
-        tuple[Tensor], clipped gradients.
-    """
-    if clip_type != 0 and clip_type != 1:
-        return grad
-    dt = F.dtype(grad)
-    if clip_type == 0:
-        new_grad = C.clip_by_value(grad, F.cast(F.tuple_to_array((-clip_value,)), dt),
-                                   F.cast(F.tuple_to_array((clip_value,)), dt))
-    else:
-        new_grad = ClipByNorm()(grad, F.cast(F.tuple_to_array((clip_value,)), dt))
-    return new_grad
-
-
-grad_scale = C.MultitypeFuncGraph("grad_scale")
-reciprocal = P.Reciprocal()
-
-
-@grad_scale.register("Tensor", "Tensor")
-def tensor_grad_scale(scale, grad):
-    return grad * reciprocal(scale)
-
-
-class ClipGradients(nn.Cell):
-    """
-    Clip gradients.
-
-    Inputs:
-        grads (list): List of gradient tuples.
-        clip_type (Tensor): The way to clip, 'value' or 'norm'.
-        clip_value (Tensor): Specifies how much to clip.
-
-    Returns:
-        List, a list of clipped_grad tuples.
-    """
-    def __init__(self):
-        super(ClipGradients, self).__init__()
-        self.clip_by_norm = nn.ClipByNorm()
-        self.cast = P.Cast()
-        self.dtype = P.DType()
-
-    def construct(self,
-                  grads,
-                  clip_type,
-                  clip_value):
-        """clip gradients"""
-        if clip_type != 0 and clip_type != 1:
-            return grads
-        new_grads = ()
-        for grad in grads:
-            dt = self.dtype(grad)
-            if clip_type == 0:
-                t = C.clip_by_value(grad, self.cast(F.tuple_to_array((-clip_value,)), dt),
-                                    self.cast(F.tuple_to_array((clip_value,)), dt))
-            else:
-                t = self.clip_by_norm(grad, self.cast(F.tuple_to_array((clip_value,)), dt))
-            new_grads = new_grads + (t,)
-        return new_grads
-
-
-class SoftmaxCrossEntropy(nn.Cell):
-    """SoftmaxCrossEntropy loss"""
-    def __init__(self):
-        super(SoftmaxCrossEntropy, self).__init__()
-        self.log_softmax = P.LogSoftmax(axis=-1)
-        self.softmax = P.Softmax(axis=-1)
-        self.reduce_mean = P.ReduceMean()
-        self.cast = P.Cast()
-
-    def construct(self, predicts, targets):
-        likelihood = self.log_softmax(predicts)
-        target_prob = self.softmax(targets)
-        loss = self.reduce_mean(-target_prob * likelihood)
-
-        return self.cast(loss, mstype.float32)
-
-
-class BertNetworkWithLoss(nn.Cell):
-    """
-    Provide bert pre-training loss through network.
-    Args:
-        teacher_config (BertConfig): The config of BertModel.
-        is_training (bool): Specifies whether to use the training mode.
-        use_one_hot_embeddings (bool): Specifies whether to use one-hot for embeddings. Default: False.
-    Returns:
-        Tensor, the loss of the network.
-    """
-    def __init__(self, teacher_config, teacher_ckpt, student_config, student_ckpt,
-                 is_training, task_type, num_labels, use_one_hot_embeddings=False,
-                 temperature=1.0, dropout_prob=0.1):
-        super(BertNetworkWithLoss, self).__init__()
-        # load teacher model
-        self.teacher = BertModelCLS(teacher_config, False, num_labels, dropout_prob,
-                                    use_one_hot_embeddings, "teacher")
-        param_dict = load_checkpoint(teacher_ckpt)
-        new_param_dict = {}
-        for key, value in param_dict.items():
-            new_key = 'teacher.' + key
-            new_param_dict[new_key] = value
-        load_param_into_net(self.teacher, new_param_dict)
-
-        # no_grad
-        self.teacher.set_train(False)
-        params = self.teacher.trainable_params()
-        for param in params:
-            param.requires_grad = False
-        # load student model
-        self.bert = BertModelCLS(student_config, is_training, num_labels, dropout_prob,
-                                 use_one_hot_embeddings, "student")
-        param_dict = load_checkpoint(student_ckpt)
-        new_param_dict = {}
-        for key, value in param_dict.items():
-            new_key = 'bert.' + key
-            new_param_dict[new_key] = value
-        load_param_into_net(self.bert, new_param_dict)
-        self.cast = P.Cast()
-        self.teacher_layers_num = teacher_config.num_hidden_layers
-        self.student_layers_num = student_config.num_hidden_layers
-        self.layers_per_block = int(self.teacher_layers_num / self.student_layers_num)
-        self.is_att_fit = student_config.is_att_fit
-        self.is_rep_fit = student_config.is_rep_fit
-        self.is_lgt_fit = student_config.is_lgt_fit
-        self.task_type = task_type
-        self.temperature = temperature
-        self.loss_mse = nn.MSELoss()
-        self.lgt_fct = nn.SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean')
-        self.select = P.Select()
-        self.zeroslike = P.ZerosLike()
-        self.dtype = student_config.dtype
-        self.num_labels = num_labels
-        self.soft_cross_entropy = SoftmaxCrossEntropy()
-        self.compute_type = student_config.compute_type
-        self.embedding_bits = student_config.embedding_bits
-        self.weight_bits = student_config.weight_bits
-        self.weight_clip_value = student_config.weight_clip_value
-        self.reshape = P.Reshape()
-
-    def construct(self,
-                  input_ids,
-                  input_mask,
-                  token_type_id,
-                  label_ids):
-        """task distill network with loss"""
-        # teacher model
-        teacher_seq_output, teacher_att_output, teacher_logits, _ = self.teacher(input_ids, token_type_id, input_mask)
-        # student model
-        student_seq_output, student_att_output, student_logits, _ = self.bert(input_ids, token_type_id, input_mask)
-        total_loss = 0
-        if self.is_att_fit:
-            selected_teacher_att_output = ()
-            selected_student_att_output = ()
-            for i in range(self.student_layers_num):
-                selected_teacher_att_output += (teacher_att_output[(i + 1) * self.layers_per_block - 1],)
-                selected_student_att_output += (student_att_output[i],)
-            att_loss = 0
-            for i in range(self.student_layers_num):
-                student_att = selected_student_att_output[i]
-                teacher_att = selected_teacher_att_output[i]
-                student_att = self.select(student_att <= self.cast(-100.0, mstype.float32),
-                                          self.zeroslike(student_att),
-                                          student_att)
-                teacher_att = self.select(teacher_att <= self.cast(-100.0, mstype.float32),
-                                          self.zeroslike(teacher_att),
-                                          teacher_att)
-                att_loss += self.loss_mse(student_att, teacher_att)
-            total_loss += att_loss
-        if self.is_rep_fit:
-            selected_teacher_seq_output = ()
-            selected_student_seq_output = ()
-            for i in range(self.student_layers_num + 1):
-                selected_teacher_seq_output += (teacher_seq_output[i * self.layers_per_block],)
-                selected_student_seq_output += (student_seq_output[i],)
-            rep_loss = 0
-            for i in range(self.student_layers_num + 1):
-                student_rep = selected_student_seq_output[i]
-                teacher_rep = selected_teacher_seq_output[i]
-                rep_loss += self.loss_mse(student_rep, teacher_rep)
-            total_loss += rep_loss
-        if self.task_type == 'classification':
-            cls_loss = self.soft_cross_entropy(student_logits / self.temperature, teacher_logits / self.temperature)
-            if self.is_lgt_fit:
-                student_logits = self.cast(student_logits, mstype.float32)
-                label_ids_reshape = self.reshape(self.cast(label_ids, mstype.int32), (-1,))
-                lgt_loss = self.lgt_fct(student_logits, label_ids_reshape)
-                total_loss += lgt_loss
-        else:
-            student_logits = self.reshape(student_logits, (-1,))
-            label_ids = self.reshape(label_ids, (-1,))
-            cls_loss = self.loss_mse(student_logits, label_ids)
-        total_loss += cls_loss
-        return self.cast(total_loss, mstype.float32)
-
-
-class BertTrainWithLossScaleCell(nn.Cell):
-    """
-    Specifically defined for finetuning where only four inputs tensor are needed.
-    """
-    def __init__(self, network, optimizer, scale_update_cell=None):
-        super(BertTrainWithLossScaleCell, self).__init__(auto_prefix=False)
-        self.network = network
-        self.network.set_grad()
-        self.weights = optimizer.parameters
-        self.optimizer = optimizer
-        self.grad = C.GradOperation(get_by_list=True,
-                                    sens_param=True)
-        self.reducer_flag = False
-        self.allreduce = P.AllReduce()
-        self.parallel_mode = context.get_auto_parallel_context("parallel_mode")
-        if self.parallel_mode in [ParallelMode.DATA_PARALLEL, ParallelMode.HYBRID_PARALLEL]:
-            self.reducer_flag = True
-        self.grad_reducer = F.identity
-        self.degree = 1
-        if self.reducer_flag:
-            self.degree = get_group_size()
-            self.grad_reducer = DistributedGradReducer(optimizer.parameters, False, self.degree)
-        self.clip_type = gradient_cfg.clip_type
-        self.clip_value = gradient_cfg.clip_value
-        self.is_distributed = (self.parallel_mode != ParallelMode.STAND_ALONE)
-        self.cast = P.Cast()
-        self.alloc_status = P.NPUAllocFloatStatus()
-        self.get_status = P.NPUGetFloatStatus()
-        self.clear_before_grad = P.NPUClearFloatStatus()
-        self.reduce_sum = P.ReduceSum(keep_dims=False)
-        self.base = Tensor(1, mstype.float32)
-        self.less_equal = P.LessEqual()
-        self.hyper_map = C.HyperMap()
-        self.loss_scale = None
-        self.loss_scaling_manager = scale_update_cell
-        if scale_update_cell:
-            self.loss_scale = Parameter(Tensor(scale_update_cell.get_loss_scale(), dtype=mstype.float32))
-
-        self.saved_params = self.weights.clone(prefix='saved')
-        self.length = len(self.weights)
-        self.quant_embedding_list = []
-        self.quant_weight_list = []
-        for i, key in enumerate(self.saved_params):
-            if 'embedding_lookup' in key.name:
-                self.quant_embedding_list.append(i)
-            elif 'weight' in key.name and 'dense_1' not in key.name:
-                self.quant_weight_list.append(i)
-        self.quant_embedding_list_length = len(self.quant_embedding_list)
-        self.quant_weight_list_length = len(self.quant_weight_list)
-
-        self.quantize_embedding = QuantizeWeightCell(num_bits=network.embedding_bits,
-                                                     compute_type=network.compute_type,
-                                                     clip_value=network.weight_clip_value)
-        self.quantize_weight = QuantizeWeightCell(num_bits=network.weight_bits,
-                                                  compute_type=network.compute_type,
-                                                  clip_value=network.weight_clip_value)
-
-    @C.add_flags(has_effect=True)
-    def construct(self,
-                  input_ids,
-                  input_mask,
-                  token_type_id,
-                  label_ids,
-                  sens=None):
-        """Defines the computation performed."""
-        weights = self.weights
-        for i in range(self.length):
-            F.assign(self.saved_params[i], weights[i])
-
-        for i in range(self.quant_embedding_list_length):
-            quant_embedding = self.quantize_embedding(weights[self.quant_embedding_list[i]])
-            F.assign(weights[self.quant_embedding_list[i]], quant_embedding)
-
-        for i in range(self.quant_weight_list_length):
-            quant_weight = self.quantize_weight(weights[self.quant_weight_list[i]])
-            F.assign(weights[self.quant_weight_list[i]], quant_weight)
-
-        if sens is None:
-            scaling_sens = self.loss_scale
-        else:
-            scaling_sens = sens
-
-        # alloc status and clear should be right before grad operation
-        init = self.alloc_status()
-        self.clear_before_grad(init)
-        grads = self.grad(self.network, weights)(input_ids,
-                                                 input_mask,
-                                                 token_type_id,
-                                                 label_ids,
-                                                 self.cast(scaling_sens,
-                                                           mstype.float32))
-        # apply grad reducer on grads
-        grads = self.grad_reducer(grads)
-        grads = self.hyper_map(F.partial(grad_scale, scaling_sens * self.degree), grads)
-        grads = self.hyper_map(F.partial(clip_grad, self.clip_type, self.clip_value), grads)
-
-        for i in range(self.length):
-            param = F.depend(self.saved_params[i], grads)
-            F.assign(weights[i], param)
-
-        self.get_status(init)
-        flag_sum = self.reduce_sum(init, (0,))
-        if self.is_distributed:
-            # sum overflow flag over devices
-            flag_reduce = self.allreduce(flag_sum)
-            cond = self.less_equal(self.base, flag_reduce)
-        else:
-            cond = self.less_equal(self.base, flag_sum)
-        overflow = cond
-        if sens is None:
-            overflow = self.loss_scaling_manager(self.loss_scale, cond)
-        if overflow:
-            succ = False
-        else:
-            succ = self.optimizer(grads)
-        return succ
-
-
-class BertTrainCell(nn.Cell):
-    """
-    Specifically defined for finetuning where only four inputs tensor are needed.
-    """
-    def __init__(self, network, optimizer, sens=1.0):
-        super(BertTrainCell, self).__init__(auto_prefix=False)
-        self.network = network
-        self.network.set_grad()
-        self.weights = optimizer.parameters
-        self.optimizer = optimizer
-        self.sens = sens
-        self.grad = C.GradOperation(get_by_list=True,
-                                    sens_param=True)
-        self.clip_type = gradient_cfg.clip_type
-        self.clip_value = gradient_cfg.clip_value
-        self.reducer_flag = False
-        self.parallel_mode = context.get_auto_parallel_context("parallel_mode")
-        if self.parallel_mode in [ParallelMode.DATA_PARALLEL, ParallelMode.HYBRID_PARALLEL]:
-            self.reducer_flag = True
-        self.grad_reducer = F.identity
-        self.degree = 1
-        if self.reducer_flag:
-            mean = context.get_auto_parallel_context("gradients_mean")
-            self.degree = get_group_size()
-            self.grad_reducer = DistributedGradReducer(optimizer.parameters, mean, self.degree)
-        self.is_distributed = (self.parallel_mode != ParallelMode.STAND_ALONE)
-        self.cast = P.Cast()
-        self.hyper_map = C.HyperMap()
-
-        self.saved_params = self.weights.clone(prefix='saved')
-        self.length = len(self.weights)
-        self.quant_embedding_list = []
-        self.quant_weight_list = []
-        for i, key in enumerate(self.saved_params):
-            if 'embedding_lookup' in key.name and 'min' not in key.name and 'max' not in key.name:
-                self.quant_embedding_list.append(i)
-            elif 'weight' in key.name and 'dense_1' not in key.name:
-                self.quant_weight_list.append(i)
-        self.quant_embedding_list_length = len(self.quant_embedding_list)
-        self.quant_weight_list_length = len(self.quant_weight_list)
-
-        self.quantize_embedding = QuantizeWeightCell(num_bits=network.embedding_bits,
-                                                     compute_type=network.compute_type,
-                                                     clip_value=network.weight_clip_value)
-        self.quantize_weight = QuantizeWeightCell(num_bits=network.weight_bits,
-                                                  compute_type=network.compute_type,
-                                                  clip_value=network.weight_clip_value)
-
-    def construct(self,
-                  input_ids,
-                  input_mask,
-                  token_type_id,
-                  label_ids):
-        """Defines the computation performed."""
-        weights = self.weights
-        for i in range(self.length):
-            F.assign(self.saved_params[i], weights[i])
-
-        for i in range(self.quant_embedding_list_length):
-            quant_embedding = self.quantize_embedding(weights[self.quant_embedding_list[i]])
-            F.assign(weights[self.quant_embedding_list[i]], quant_embedding)
-
-        for i in range(self.quant_weight_list_length):
-            quant_weight = self.quantize_weight(weights[self.quant_weight_list[i]])
-            F.assign(weights[self.quant_weight_list[i]], quant_weight)
-
-        grads = self.grad(self.network, weights)(input_ids,
-                                                 input_mask,
-                                                 token_type_id,
-                                                 label_ids,
-                                                 self.cast(F.tuple_to_array((self.sens,)),
-                                                           mstype.float32))
-        # apply grad reducer on grads
-        grads = self.grad_reducer(grads)
-        grads = self.hyper_map(F.partial(clip_grad, self.clip_type, self.clip_value), grads)
-
-        for i in range(self.length):
-            param = F.depend(self.saved_params[i], grads)
-            F.assign(weights[i], param)
-
-        succ = self.optimizer(grads)
-        return succ
diff --git a/research/nlp/ternarybert/src/config.py b/research/nlp/ternarybert/src/config.py
deleted file mode 100644
index 4b664abe924191f9012ffb2489b9efefa76bed75..0000000000000000000000000000000000000000
--- a/research/nlp/ternarybert/src/config.py
+++ /dev/null
@@ -1,104 +0,0 @@
-# Copyright 2021 Huawei Technologies Co., Ltd
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ============================================================================
-
-"""config script"""
-
-from easydict import EasyDict as edict
-import mindspore.common.dtype as mstype
-from .tinybert_model import BertConfig
-from .assessment_method import Accuracy, F1, Pearsonr, Matthews
-
-
-gradient_cfg = edict({
-    'clip_type': 1,
-    'clip_value': 1.0
-})
-
-task_cfg = edict({
-    "sst-2": edict({"num_labels": 2, "seq_length": 64, "task_type": "classification", "metrics": Accuracy}),
-    "qnli": edict({"num_labels": 2, "seq_length": 128, "task_type": "classification", "metrics": Accuracy}),
-    "mnli": edict({"num_labels": 3, "seq_length": 128, "task_type": "classification", "metrics": Accuracy}),
-    "cola": edict({"num_labels": 2, "seq_length": 64, "task_type": "classification", "metrics": Matthews}),
-    "mrpc": edict({"num_labels": 2, "seq_length": 128, "task_type": "classification", "metrics": F1}),
-    "sts-b": edict({"num_labels": 1, "seq_length": 128, "task_type": "regression", "metrics": Pearsonr}),
-    "qqp": edict({"num_labels": 2, "seq_length": 128, "task_type": "classification", "metrics": F1}),
-    "rte": edict({"num_labels": 2, "seq_length": 128, "task_type": "classification", "metrics": Accuracy})
-})
-
-train_cfg = edict({
-    'batch_size': 16,
-    'loss_scale_value': 2 ** 16,
-    'scale_factor': 2,
-    'scale_window': 50,
-    'optimizer_cfg': edict({
-        'AdamWeightDecay': edict({
-            'learning_rate': 5e-5,
-            'end_learning_rate': 1e-14,
-            'power': 1.0,
-            'weight_decay': 1e-4,
-            'eps': 1e-6,
-            'decay_filter': lambda x: 'layernorm' not in x.name.lower() and 'bias' not in x.name.lower(),
-            'warmup_ratio': 0.1
-        }),
-    }),
-})
-
-eval_cfg = edict({
-    'batch_size': 32,
-})
-
-teacher_net_cfg = BertConfig(
-    seq_length=128,
-    vocab_size=30522,
-    hidden_size=768,
-    num_hidden_layers=6,
-    num_attention_heads=12,
-    intermediate_size=3072,
-    hidden_act="gelu",
-    hidden_dropout_prob=0.0,
-    attention_probs_dropout_prob=0.0,
-    cls_dropout_prob=0.0,
-    max_position_embeddings=512,
-    type_vocab_size=2,
-    initializer_range=0.02,
-    use_relative_positions=False,
-    dtype=mstype.float32,
-    compute_type=mstype.float32,
-    do_quant=False
-)
-student_net_cfg = BertConfig(
-    seq_length=128,
-    vocab_size=30522,
-    hidden_size=768,
-    num_hidden_layers=6,
-    num_attention_heads=12,
-    intermediate_size=3072,
-    hidden_act="gelu",
-    hidden_dropout_prob=0.0,
-    attention_probs_dropout_prob=0.0,
-    max_position_embeddings=512,
-    type_vocab_size=2,
-    initializer_range=0.02,
-    use_relative_positions=False,
-    dtype=mstype.float32,
-    compute_type=mstype.float32,
-    do_quant=True,
-    embedding_bits=2,
-    weight_bits=2,
-    weight_clip_value=3.0,
-    cls_dropout_prob=0.0,
-    activation_init=2.5,
-    is_lgt_fit=False
-)
diff --git a/research/nlp/ternarybert/src/dataset.py b/research/nlp/ternarybert/src/dataset.py
deleted file mode 100644
index fe867e90c537e088daf3b893e75b5ee521e00498..0000000000000000000000000000000000000000
--- a/research/nlp/ternarybert/src/dataset.py
+++ /dev/null
@@ -1,60 +0,0 @@
-# Copyright 2021 Huawei Technologies Co., Ltd
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ============================================================================
-
-"""create tinybert dataset"""
-
-from enum import Enum
-import mindspore.common.dtype as mstype
-import mindspore.dataset.engine as de
-import mindspore.dataset.transforms.c_transforms as C
-
-
-class DataType(Enum):
-    """Enumerate supported dataset format"""
-    TFRECORD = 1
-    MINDRECORD = 2
-
-
-def create_dataset(batch_size=32, device_num=1, rank=0, do_shuffle=True, data_dir=None,
-                   data_type='tfrecord', seq_length=128, task_type=mstype.int32, drop_remainder=True):
-    """create tinybert dataset"""
-    if isinstance(data_dir, list):
-        data_files = data_dir
-    else:
-        data_files = [data_dir]
-
-    columns_list = ["input_ids", "input_mask", "segment_ids", "label_ids"]
-
-    if data_type == 'mindrecord':
-        ds = de.MindDataset(data_files, columns_list=columns_list, shuffle=do_shuffle, num_shards=device_num,
-                            shard_id=rank)
-    else:
-        ds = de.TFRecordDataset(data_files, columns_list=columns_list, shuffle=do_shuffle, num_shards=device_num,
-                                shard_id=rank, shard_equal_rows=(device_num != 1))
-
-    if device_num == 1 and do_shuffle is True:
-        ds = ds.shuffle(10000)
-
-    type_cast_op = C.TypeCast(mstype.int32)
-    slice_op = C.Slice(slice(0, seq_length, 1))
-    label_type = mstype.int32 if task_type == 'classification' else mstype.float32
-    ds = ds.map(operations=[type_cast_op, slice_op], input_columns=["segment_ids"])
-    ds = ds.map(operations=[type_cast_op, slice_op], input_columns=["input_mask"])
-    ds = ds.map(operations=[type_cast_op, slice_op], input_columns=["input_ids"])
-    ds = ds.map(operations=[C.TypeCast(label_type), slice_op], input_columns=["label_ids"])
-    # apply batch operations
-    ds = ds.batch(batch_size, drop_remainder=drop_remainder)
-
-    return ds
diff --git a/research/nlp/ternarybert/src/quant.py b/research/nlp/ternarybert/src/quant.py
deleted file mode 100644
index d88a97d645ae67ae516c56b5a8d4f4cea9695283..0000000000000000000000000000000000000000
--- a/research/nlp/ternarybert/src/quant.py
+++ /dev/null
@@ -1,171 +0,0 @@
-# Copyright 2021 Huawei Technologies Co., Ltd
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ============================================================================
-
-"""Quantization function."""
-
-from mindspore.common import dtype as mstype
-from mindspore.common.parameter import Parameter
-from mindspore.ops import operations as P
-from mindspore.ops import functional as F
-from mindspore.ops import composite as C
-from mindspore import nn
-
-
-class QuantizeWeightCell(nn.Cell):
-    """
-    The ternary fake quant op for weight.
-
-    Args:
-        num_bits (int): The bit number of quantization, supporting 2 to 8 bits. Default: 2.
-        compute_type (:class:`mindspore.dtype`): Compute type in QuantizeWeightCell. Default: mstype.float32.
-        clip_value (float): Clips weight to be in [-clip_value, clip_value].
-        per_channel (bool): Quantization granularity based on layer or on channel. Default: False.
-
-    Inputs:
-        - **weight** (Parameter) - Parameter of shape :math:`(N, C_{in}, H_{in}, W_{in})`.
-
-    Outputs:
-        Parameter of shape :math:`(N, C_{out}, H_{out}, W_{out})`.
-    """
-
-    def __init__(self, num_bits=8, compute_type=mstype.float32, clip_value=1.0, per_channel=False):
-        super(QuantizeWeightCell, self).__init__()
-        self.num_bits = num_bits
-        self.compute_type = compute_type
-        self.clip_value = clip_value
-        self.per_channel = per_channel
-
-        self.clamp = C.clip_by_value
-        self.abs = P.Abs()
-        self.sum = P.ReduceSum()
-        self.nelement = F.size
-        self.div = P.Div()
-        self.cast = P.Cast()
-        self.max = P.ReduceMax()
-        self.min = P.ReduceMin()
-        self.round = P.Round()
-
-    def construct(self, weight):
-        """quantize weight cell"""
-        tensor = self.clamp(weight, -self.clip_value, self.clip_value)
-        if self.num_bits == 2:
-            if self.per_channel:
-                n = self.nelement(tensor[0])
-                m = self.div(self.sum(self.abs(tensor), 1), n)
-                thres = 0.7 * m
-                pos = self.cast(tensor[:] > thres[0], self.compute_type)
-                neg = self.cast(tensor[:] < -thres[0], self.compute_type)
-                mask = self.cast(self.abs(tensor)[:] > thres[0], self.compute_type)
-                alpha = self.reshape(self.sum(self.abs(mask * tensor), 1) / self.sum(mask, 1), (-1, 1))
-                output = alpha * pos - alpha * neg
-            else:
-                n = self.nelement(tensor)
-                m = self.div(self.sum(self.abs(tensor)), n)
-                thres = 0.7 * m
-                pos = self.cast(tensor > thres, self.compute_type)
-                neg = self.cast(tensor < -thres, self.compute_type)
-                mask = self.cast(self.abs(tensor) > thres, self.compute_type)
-                alpha = self.sum(self.abs(mask * self.cast(tensor, self.compute_type))) / self.sum(mask)
-                output = alpha * pos - alpha * neg
-        else:
-            tensor_max = self.cast(self.max(tensor), self.compute_type)
-            tensor_min = self.cast(self.min(tensor), self.compute_type)
-            s = (tensor_max - tensor_min) / (2 ** self.cast(self.num_bits, self.compute_type) - 1)
-            output = self.round(self.div(tensor - tensor_min, s)) * s + tensor_min
-        return output
-
-
-class QuantizeWeight:
-    """
-    Quantize weight into specified bit.
-
-    Args:
-        num_bits (int): The bit number of quantization, supporting 2 to 8 bits. Default: 2.
-        compute_type (:class:`mindspore.dtype`): Compute type in QuantizeWeightCell. Default: mstype.float32.
-        clip_value (float): Clips weight to be in [-clip_value, clip_value].
-        per_channel (bool): Quantization granularity based on layer or on channel. Default: False.
-
-    Inputs:
-        - **weight** (Parameter) - Parameter of shape :math:`(N, C_{in}, H_{in}, W_{in})`.
-
-    Outputs:
-        Parameter of shape :math:`(N, C_{out}, H_{out}, W_{out})`.
-    """
-
-    def __init__(self, num_bits=2, compute_type=mstype.float32, clip_value=1.0, per_channel=False):
-        self.num_bits = num_bits
-        self.compute_type = compute_type
-        self.clip_value = clip_value
-        self.per_channel = per_channel
-
-        self.clamp = C.clip_by_value
-        self.abs = P.Abs()
-        self.sum = P.ReduceSum()
-        self.nelement = F.size
-        self.div = P.Div()
-        self.cast = P.Cast()
-        self.max = P.ReduceMax()
-        self.min = P.ReduceMin()
-        self.floor = P.Floor()
-
-    def construct(self, weight):
-        """quantize weight"""
-        tensor = self.clamp(weight, -self.clip_value, self.clip_value)
-        if self.num_bits == 2:
-            if self.per_channel:
-                n = self.nelement(tensor[0])
-                m = self.div(self.sum(self.abs(tensor), 1), n)
-                thres = 0.7 * m
-                pos = self.cast(tensor[:] > thres[0], self.compute_type)
-                neg = self.cast(tensor[:] < -thres[0], self.compute_type)
-                mask = self.cast(self.abs(tensor)[:] > thres[0], self.compute_type)
-                alpha = self.reshape(self.sum(self.abs(mask * tensor), 1) / self.sum(mask, 1), (-1, 1))
-                output = alpha * pos - alpha * neg
-            else:
-                n = self.nelement(tensor)
-                m = self.div(self.sum(self.abs(tensor)), n)
-                thres = 0.7 * m
-                pos = self.cast(tensor > thres, self.compute_type)
-                neg = self.cast(tensor < -thres, self.compute_type)
-                mask = self.cast(self.abs(tensor) > thres, self.compute_type)
-                alpha = self.sum(self.abs(mask * tensor)) / self.sum(mask)
-                output = alpha * pos - alpha * neg
-        else:
-            tensor_max = self.max(tensor)
-            tensor_min = self.min(tensor)
-            s = (tensor_max - tensor_min) / (2 ** self.num_bits - 1)
-            output = self.floor(self.div((tensor - tensor_min), s) + 0.5) * s + tensor_min
-        return output
-
-
-def convert_network(network, embedding_bits=2, weight_bits=2, clip_value=1.0):
-    quantize_embedding = QuantizeWeight(num_bits=embedding_bits, clip_value=clip_value)
-    quantize_weight = QuantizeWeight(num_bits=weight_bits, clip_value=clip_value)
-    for name, param in network.parameters_and_names():
-        if 'bert_embedding_lookup' in name and 'min' not in name and 'max' not in name:
-            quantized_param = quantize_embedding.construct(param)
-            param.set_data(quantized_param)
-        elif 'weight' in name and 'dense_1' not in name:
-            quantized_param = quantize_weight.construct(param)
-            param.set_data(quantized_param)
-
-
-def save_params(network):
-    return {name: Parameter(param, 'saved_params') for name, param in network.parameters_and_names()}
-
-
-def restore_params(network, params_dict):
-    for name, param in network.parameters_and_names():
-        param.set_data(params_dict[name])
diff --git a/research/nlp/ternarybert/src/tinybert_model.py b/research/nlp/ternarybert/src/tinybert_model.py
deleted file mode 100644
index ccec6182d5150db795c73541269a60847e4f696f..0000000000000000000000000000000000000000
--- a/research/nlp/ternarybert/src/tinybert_model.py
+++ /dev/null
@@ -1,1191 +0,0 @@
-# Copyright 2021 Huawei Technologies Co., Ltd
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ============================================================================
-
-"""Bert model."""
-
-import math
-import copy
-import numpy as np
-import mindspore.common.dtype as mstype
-import mindspore.nn as nn
-import mindspore.ops.functional as F
-from mindspore.common.initializer import Normal, initializer
-from mindspore.ops import operations as P
-from mindspore.ops import composite as C
-from mindspore.common.tensor import Tensor
-from mindspore.common.parameter import Parameter
-from mindspore import context
-from mindspore.nn.layer.quant import FakeQuantWithMinMaxObserver as FakeQuantWithMinMax
-from mindspore.nn.layer import get_activation
-
-
-class DenseGeLU(nn.Cell):
-    r"""
-    The dense connected layer with GeLU activation function.
-
-    Args:
-        in_channels (int): The number of channels in the input space.
-        out_channels (int): The number of channels in the output space.
-        weight_init (Union[Tensor, str, Initializer, numbers.Number]): The trainable weight_init parameter. The dtype
-            is same as input x. The values of str refer to the function `initializer`. Default: 'normal'.
-        bias_init (Union[Tensor, str, Initializer, numbers.Number]): The trainable bias_init parameter. The dtype is
-            same as input x. The values of str refer to the function `initializer`. Default: 'zeros'.
-        has_bias (bool): Specifies whether the layer uses a bias vector. Default: True.
-
-    Inputs:
-        - **input** (Tensor) - Tensor of shape :math:`(*, in\_channels)`.
-
-    Outputs:
-        Tensor of shape :math:`(*, out\_channels)`.
-    """
-
-    def __init__(self,
-                 in_channels,
-                 out_channels,
-                 weight_init='normal',
-                 bias_init='zeros',
-                 has_bias=True):
-        super(DenseGeLU, self).__init__()
-        self.in_channels = in_channels
-        self.out_channels = out_channels
-        self.has_bias = has_bias
-
-        self.weight = Parameter(initializer(weight_init, [out_channels, in_channels]), name="weight")
-
-        if self.has_bias:
-            self.bias = Parameter(initializer(bias_init, [out_channels]), name="bias")
-
-        self.matmul = P.MatMul(transpose_b=True)
-        self.bias_add = P.BiasAdd()
-
-        self.tanh = P.Tanh()
-
-    def construct(self, x):
-        output = self.matmul(x, self.weight)
-        if self.has_bias:
-            output = self.bias_add(output, self.bias)
-        return 0.5 * output * (1.0 + self.tanh(0.7978845608028654 * (output + 0.044715 * output * output * output)))
-
-
-class GatherV2Quant(nn.Cell):
-    """
-    The fake quant gather.
-
-    Args:
-        activation_init (float): init activate quant value. Default: 6.
-    """
-
-    def __init__(self, activation_init=6):
-        super(GatherV2Quant, self).__init__()
-        self.gather = P.Gather()
-
-        self.fake_quant_input = FakeQuantWithMinMax(min_init=-activation_init, max_init=activation_init, ema=True,
-                                                    symmetric=False)
-        self.fake_quant_output = FakeQuantWithMinMax(min_init=-activation_init, max_init=activation_init, ema=True,
-                                                     symmetric=False)
-
-    def construct(self, x, y, z):
-        x = self.fake_quant_input(x)
-        output = self.gather(x, y, z)
-        output = self.fake_quant_output(output)
-        return output
-
-
-class DenseQuant(nn.Cell):
-    """
-    The fake quant fully connected layer.
-
-    Args:
-        in_channels (int): The number of channels in the input space.
-        out_channels (int): The number of channels in the output space.
-        weight_init (Union[Tensor, str, Initializer, numbers.Number]): The trainable weight_init parameter. The dtype
-            is same as input x. The values of str refer to the function `initializer`. Default: 'normal'.
-        bias_init (Union[Tensor, str, Initializer, numbers.Number]): The trainable bias_init parameter. The dtype is
-            same as input x. The values of str refer to the function `initializer`. Default: 'zeros'.
-        has_bias (bool): Specifies whether the layer uses a bias vector. Default: True.
-        activation (Function): activate function applied to the output of the fully connected layer, e.g. 'ReLU'.
-            Default: None.
-        activation_init (float): init activate quant value. Default: 6.
-    """
-
-    def __init__(self,
-                 in_channels,
-                 out_channels,
-                 weight_init='normal',
-                 bias_init='zeros',
-                 has_bias=True,
-                 activation=None,
-                 activation_init=6):
-        super(DenseQuant, self).__init__()
-        self.in_channels = in_channels
-        self.out_channels = out_channels
-        self.has_bias = has_bias
-        self.weight = Parameter(initializer(weight_init, [out_channels, in_channels]), name="weight")
-        if self.has_bias:
-            self.bias = Parameter(initializer(bias_init, [out_channels]), name="bias")
-        self.matmul = P.MatMul(transpose_b=True)
-        self.bias_add = P.BiasAdd()
-        self.activation = get_activation(activation)
-        self.activation_flag = self.activation is not None
-
-        self.fake_quant_input = FakeQuantWithMinMax(min_init=-activation_init, max_init=activation_init, ema=True,
-                                                    symmetric=False)
-        self.fake_quant_output = FakeQuantWithMinMax(min_init=-activation_init, max_init=activation_init, ema=True,
-                                                     symmetric=False)
-
-    def construct(self, x):
-        x = self.fake_quant_input(x)
-        output = self.matmul(x, self.weight)
-        output = self.fake_quant_output(output)
-        if self.has_bias:
-            output = self.bias_add(output, self.bias)
-        if self.activation_flag:
-            return self.activation(output)
-        return output
-
-
-class BatchMatMulQuant(nn.Cell):
-    """
-    The fake quant batch matmul.
-
-    Args:
-        transpose_a (bool): If True, `a` is transposed before multiplication. Default: False
-        transpose_b (bool): If True, `b` is transposed before multiplication. Default: False
-        activation_init (float): init activate quant value. Default: 6.
-    """
-
-    def __init__(self, transpose_a=False, transpose_b=False, activation_init=6):
-        super(BatchMatMulQuant, self).__init__()
-        self.batch_matmul = P.BatchMatMul(transpose_a=transpose_a, transpose_b=transpose_b)
-
-        self.fake_quant_x = FakeQuantWithMinMax(min_init=-activation_init, max_init=activation_init, ema=True,
-                                                symmetric=False, narrow_range=False)
-        self.fake_quant_y = FakeQuantWithMinMax(min_init=-activation_init, max_init=activation_init, ema=True,
-                                                symmetric=False, narrow_range=True)
-        self.fake_quant_output = FakeQuantWithMinMax(min_init=-activation_init, max_init=activation_init, ema=True,
-                                                     symmetric=False, narrow_range=False)
-
-    def construct(self, x, y):
-        x = self.fake_quant_x(x)
-        y = self.fake_quant_y(y)
-        output = self.batch_matmul(x, y)
-        output = self.fake_quant_output(output)
-        return output
-
-
-class BertConfig:
-    """
-    Configuration for `BertModel`.
-
-    Args:
-        seq_length (int): Length of input sequence. Default: 128.
-        vocab_size (int): The shape of each embedding vector. Default: 32000.
-        hidden_size (int): Size of the bert encoder layers. Default: 768.
-        num_hidden_layers (int): Number of hidden layers in the BertTransformer encoder
-                           cell. Default: 12.
-        num_attention_heads (int): Number of attention heads in the BertTransformer
-                             encoder cell. Default: 12.
-        intermediate_size (int): Size of intermediate layer in the BertTransformer
-                           encoder cell. Default: 3072.
-        hidden_act (str): Activation function used in the BertTransformer encoder
-                    cell. Default: "gelu".
-        hidden_dropout_prob (float): The dropout probability for BertOutput. Default: 0.1.
-        attention_probs_dropout_prob (float): The dropout probability for
-                                      BertAttention. Default: 0.1.
-        max_position_embeddings (int): Maximum length of sequences used in this
-                                 model. Default: 512.
-        type_vocab_size (int): Size of token type vocab. Default: 16.
-        initializer_range (float): Initialization value of Normal. Default: 0.02.
-        use_relative_positions (bool): Specifies whether to use relative positions. Default: False.
-        dtype (:class:`mindspore.dtype`): Data type of the input. Default: mstype.float32.
-        compute_type (:class:`mindspore.dtype`): Compute type in BertTransformer. Default: mstype.float32.
-        do_quant (bool): Do quantization or not. Default: False.
-        embedding_bits (int): Quant bits for embedding. Default: 2.
-        weight_bits (int): Quant bits for weight of dense. Default: 2.
-        weight_clip_value (float): Clip value for weight. Default: 1.0.
-        cls_dropout_prob (float): The dropout probability for Classifier. Default: 1.0.
-        activation_init (float): Init clip value for quantization. Default: 2.5.
-        is_att_fit (bool): If do attention based distillation or not. Default: True.
-        is_rep_fit (bool): If do hidden states based distillation or not. Default: True.
-        is_lgt_fit (bool): If use label loss or not. Default: True.
-        export (bool): If True, replace some operators for export. Default: False.
-    """
-
-    def __init__(self,
-                 seq_length=128,
-                 vocab_size=32000,
-                 hidden_size=768,
-                 num_hidden_layers=12,
-                 num_attention_heads=12,
-                 intermediate_size=3072,
-                 hidden_act="gelu",
-                 hidden_dropout_prob=0.1,
-                 attention_probs_dropout_prob=0.1,
-                 max_position_embeddings=512,
-                 type_vocab_size=16,
-                 initializer_range=0.02,
-                 use_relative_positions=False,
-                 dtype=mstype.float32,
-                 compute_type=mstype.float32,
-                 do_quant=False,
-                 embedding_bits=2,
-                 weight_bits=2,
-                 weight_clip_value=1.0,
-                 cls_dropout_prob=1.0,
-                 activation_init=2.5,
-                 is_att_fit=True,
-                 is_rep_fit=True,
-                 is_lgt_fit=True,
-                 export=False):
-        self.seq_length = seq_length
-        self.vocab_size = vocab_size
-        self.hidden_size = hidden_size
-        self.num_hidden_layers = num_hidden_layers
-        self.num_attention_heads = num_attention_heads
-        self.hidden_act = hidden_act
-        self.intermediate_size = intermediate_size
-        self.hidden_dropout_prob = hidden_dropout_prob
-        self.attention_probs_dropout_prob = attention_probs_dropout_prob
-        self.max_position_embeddings = max_position_embeddings
-        self.type_vocab_size = type_vocab_size
-        self.initializer_range = initializer_range
-        self.use_relative_positions = use_relative_positions
-        self.dtype = dtype
-        self.compute_type = compute_type
-        self.do_quant = do_quant
-        self.embedding_bits = embedding_bits
-        self.weight_bits = weight_bits
-        self.weight_clip_value = weight_clip_value
-        self.cls_dropout_prob = cls_dropout_prob
-        self.activation_init = activation_init
-        self.is_att_fit = is_att_fit
-        self.is_rep_fit = is_rep_fit
-        self.is_lgt_fit = is_lgt_fit
-        self.export = export
-
-
-class EmbeddingLookup(nn.Cell):
-    """
-    A embeddings lookup table with a fixed dictionary and size.
-
-    Args:
-        vocab_size (int): Size of the dictionary of embeddings.
-        embedding_size (int): The size of each embedding vector.
-        embedding_shape (list): [batch_size, seq_length, embedding_size], the shape of
-                         each embedding vector.
-        use_one_hot_embeddings (bool): Specifies whether to use one hot encoding form. Default: False.
-        initializer_range (float): Initialization value of Normal. Default: 0.02.
-    """
-
-    def __init__(self,
-                 vocab_size,
-                 embedding_size,
-                 embedding_shape,
-                 use_one_hot_embeddings=False,
-                 initializer_range=0.02,
-                 do_quant=False,
-                 activation_init=2.5):
-        super(EmbeddingLookup, self).__init__()
-        self.vocab_size = vocab_size
-        self.use_one_hot_embeddings = use_one_hot_embeddings
-        self.embedding_table = Parameter(initializer
-                                         (Normal(initializer_range),
-                                          [vocab_size, embedding_size]),
-                                         name='embedding_table')
-        self.expand = P.ExpandDims()
-        self.shape_flat = (-1,)
-        if do_quant:
-            self.gather = GatherV2Quant(activation_init=activation_init)
-        else:
-            self.gather = P.Gather()
-        self.one_hot = P.OneHot()
-        self.on_value = Tensor(1.0, mstype.float32)
-        self.off_value = Tensor(0.0, mstype.float32)
-        self.array_mul = P.MatMul()
-        self.reshape = P.Reshape()
-        self.shape = tuple(embedding_shape)
-
-    def construct(self, input_ids):
-        """embedding lookup"""
-        flat_ids = self.reshape(input_ids, self.shape_flat)
-        if self.use_one_hot_embeddings:
-            one_hot_ids = self.one_hot(flat_ids, self.vocab_size, self.on_value, self.off_value)
-            output_for_reshape = self.array_mul(
-                one_hot_ids, self.embedding_table)
-        else:
-            output_for_reshape = self.gather(self.embedding_table, flat_ids, 0)
-        output = self.reshape(output_for_reshape, self.shape)
-        return output, self.embedding_table
-
-
-class EmbeddingPostprocessor(nn.Cell):
-    """
-    Postprocessors apply positional and token type embeddings to word embeddings.
-
-    Args:
-        embedding_size (int): The size of each embedding vector.
-        embedding_shape (list): [batch_size, seq_length, embedding_size], the shape of
-                         each embedding vector.
-        use_token_type (bool): Specifies whether to use token type embeddings. Default: False.
-        token_type_vocab_size (int): Size of token type vocab. Default: 16.
-       use_one_hot_embeddings (bool): Specifies whether to use one hot encoding form. Default: False.
-        initializer_range (float): Initialization value of Normal. Default: 0.02.
-        max_position_embeddings (int): Maximum length of sequences used in this
-                                 model. Default: 512.
-        dropout_prob (float): The dropout probability. Default: 0.1.
-    """
-
-    def __init__(self,
-                 use_relative_positions,
-                 embedding_size,
-                 embedding_shape,
-                 use_token_type=False,
-                 token_type_vocab_size=16,
-                 use_one_hot_embeddings=False,
-                 initializer_range=0.02,
-                 max_position_embeddings=512,
-                 dropout_prob=0.1):
-        super(EmbeddingPostprocessor, self).__init__()
-        self.use_token_type = use_token_type
-        self.token_type_vocab_size = token_type_vocab_size
-        self.use_one_hot_embeddings = use_one_hot_embeddings
-        self.max_position_embeddings = max_position_embeddings
-        self.embedding_table = Parameter(initializer
-                                         (Normal(initializer_range),
-                                          [token_type_vocab_size,
-                                           embedding_size]),
-                                         name='embedding_table')
-        self.shape_flat = (-1,)
-        self.one_hot = P.OneHot()
-        self.on_value = Tensor(1.0, mstype.float32)
-        self.off_value = Tensor(0.1, mstype.float32)
-        self.array_mul = P.MatMul()
-        self.reshape = P.Reshape()
-        self.shape = tuple(embedding_shape)
-        self.layernorm = nn.LayerNorm((embedding_size,))
-        self.dropout = nn.Dropout(1 - dropout_prob)
-        self.gather = P.Gather()
-        self.use_relative_positions = use_relative_positions
-        self.slice = P.StridedSlice()
-        self.full_position_embeddings = Parameter(initializer
-                                                  (Normal(initializer_range),
-                                                   [max_position_embeddings,
-                                                    embedding_size]),
-                                                  name='full_position_embeddings')
-
-    def construct(self, token_type_ids, word_embeddings):
-        """embedding postprocessor"""
-        output = word_embeddings
-        if self.use_token_type:
-            flat_ids = self.reshape(token_type_ids, self.shape_flat)
-            if self.use_one_hot_embeddings:
-                one_hot_ids = self.one_hot(flat_ids,
-                                           self.token_type_vocab_size, self.on_value, self.off_value)
-                token_type_embeddings = self.array_mul(one_hot_ids,
-                                                       self.embedding_table)
-            else:
-                token_type_embeddings = self.gather(self.embedding_table, flat_ids, 0)
-            token_type_embeddings = self.reshape(token_type_embeddings, self.shape)
-            output += token_type_embeddings
-        if not self.use_relative_positions:
-            _, seq, width = self.shape
-            position_embeddings = self.slice(self.full_position_embeddings, (0, 0), (seq, width), (1, 1))
-            position_embeddings = self.reshape(position_embeddings, (1, seq, width))
-            output += position_embeddings
-        output = self.layernorm(output)
-        output = self.dropout(output)
-        return output
-
-
-class BertOutput(nn.Cell):
-    """
-    Apply a linear computation to hidden status and a residual computation to input.
-
-    Args:
-        in_channels (int): Input channels.
-        out_channels (int): Output channels.
-        initializer_range (float): Initialization value of Normal. Default: 0.02.
-        dropout_prob (float): The dropout probability. Default: 0.1.
-        compute_type (:class:`mindspore.dtype`): Compute type in BertTransformer. Default: mstype.float32.
-    """
-
-    def __init__(self,
-                 in_channels,
-                 out_channels,
-                 initializer_range=0.02,
-                 dropout_prob=0.1,
-                 compute_type=mstype.float32,
-                 do_quant=False,
-                 activation_init=2.5
-                 ):
-        super(BertOutput, self).__init__()
-        if do_quant:
-            self.dense = DenseQuant(in_channels, out_channels,
-                                    weight_init=Normal(initializer_range),
-                                    activation_init=activation_init).to_float(compute_type)
-        else:
-            self.dense = nn.Dense(in_channels, out_channels,
-                                  weight_init=Normal(initializer_range)).to_float(compute_type)
-        self.dropout = nn.Dropout(1 - dropout_prob)
-        self.add = P.Add()
-        self.is_gpu = context.get_context('device_target') == "GPU"
-        if self.is_gpu:
-            self.layernorm = nn.LayerNorm((out_channels,)).to_float(mstype.float32)
-            self.compute_type = compute_type
-        else:
-            self.layernorm = nn.LayerNorm((out_channels,)).to_float(compute_type)
-
-        self.cast = P.Cast()
-
-    def construct(self, hidden_status, input_tensor):
-        """bert output"""
-        output = self.dense(hidden_status)
-        output = self.dropout(output)
-        output = self.add(input_tensor, output)
-        output = self.layernorm(output)
-        if self.is_gpu:
-            output = self.cast(output, self.compute_type)
-        return output
-
-
-class RelaPosMatrixGenerator(nn.Cell):
-    """
-    Generates matrix of relative positions between inputs.
-
-    Args:
-        length (int): Length of one dim for the matrix to be generated.
-        max_relative_position (int): Max value of relative position.
-    """
-
-    def __init__(self, length, max_relative_position):
-        super(RelaPosMatrixGenerator, self).__init__()
-        self._length = length
-        self._max_relative_position = Tensor(max_relative_position, dtype=mstype.int32)
-        self._min_relative_position = Tensor(-max_relative_position, dtype=mstype.int32)
-        self.range_length = -length + 1
-        self.tile = P.Tile()
-        self.range_mat = P.Reshape()
-        self.sub = P.Sub()
-        self.expanddims = P.ExpandDims()
-        self.cast = P.Cast()
-
-    def construct(self):
-        """position matrix generator"""
-        range_vec_row_out = self.cast(F.tuple_to_array(F.make_range(self._length)), mstype.int32)
-        range_vec_col_out = self.range_mat(range_vec_row_out, (self._length, -1))
-        tile_row_out = self.tile(range_vec_row_out, (self._length,))
-        tile_col_out = self.tile(range_vec_col_out, (1, self._length))
-        range_mat_out = self.range_mat(tile_row_out, (self._length, self._length))
-        transpose_out = self.range_mat(tile_col_out, (self._length, self._length))
-        distance_mat = self.sub(range_mat_out, transpose_out)
-        distance_mat_clipped = C.clip_by_value(distance_mat,
-                                               self._min_relative_position,
-                                               self._max_relative_position)
-        # Shift values to be >=0. Each integer still uniquely identifies a
-        # relative position difference.
-        final_mat = distance_mat_clipped + self._max_relative_position
-        return final_mat
-
-
-class RelaPosEmbeddingsGenerator(nn.Cell):
-    """
-    Generates tensor of size [length, length, depth].
-
-    Args:
-        length (int): Length of one dim for the matrix to be generated.
-        depth (int): Size of each attention head.
-        max_relative_position (int): Maxmum value of relative position.
-        initializer_range (float): Initialization value of Normal.
-        use_one_hot_embeddings (bool): Specifies whether to use one hot encoding form. Default: False.
-    """
-
-    def __init__(self,
-                 length,
-                 depth,
-                 max_relative_position,
-                 initializer_range,
-                 use_one_hot_embeddings=False):
-        super(RelaPosEmbeddingsGenerator, self).__init__()
-        self.depth = depth
-        self.vocab_size = max_relative_position * 2 + 1
-        self.use_one_hot_embeddings = use_one_hot_embeddings
-        self.embeddings_table = Parameter(
-            initializer(Normal(initializer_range),
-                        [self.vocab_size, self.depth]),
-            name='embeddings_for_position')
-        self.relative_positions_matrix = RelaPosMatrixGenerator(length=length,
-                                                                max_relative_position=max_relative_position)
-        self.reshape = P.Reshape()
-        self.one_hot = P.OneHot()
-        self.on_value = Tensor(1.0, mstype.float32)
-        self.off_value = Tensor(0.0, mstype.float32)
-        self.shape = P.Shape()
-        self.gather = P.Gather()  # index_select
-        self.matmul = P.BatchMatMul()
-
-    def construct(self):
-        """position embedding generation"""
-        relative_positions_matrix_out = self.relative_positions_matrix()
-        # Generate embedding for each relative position of dimension depth.
-        if self.use_one_hot_embeddings:
-            flat_relative_positions_matrix = self.reshape(relative_positions_matrix_out, (-1,))
-            one_hot_relative_positions_matrix = self.one_hot(
-                flat_relative_positions_matrix, self.vocab_size, self.on_value, self.off_value)
-            embeddings = self.matmul(one_hot_relative_positions_matrix, self.embeddings_table)
-            my_shape = self.shape(relative_positions_matrix_out) + (self.depth,)
-            embeddings = self.reshape(embeddings, my_shape)
-        else:
-            embeddings = self.gather(self.embeddings_table,
-                                     relative_positions_matrix_out, 0)
-        return embeddings
-
-
-class SaturateCast(nn.Cell):
-    """
-    Performs a safe saturating cast. This operation applies proper clamping before casting to prevent
-    the danger that the value will overflow or underflow.
-
-    Args:
-        src_type (:class:`mindspore.dtype`): The type of the elements of the input tensor. Default: mstype.float32.
-        dst_type (:class:`mindspore.dtype`): The type of the elements of the output tensor. Default: mstype.float32.
-    """
-
-    def __init__(self, src_type=mstype.float32, dst_type=mstype.float32):
-        super(SaturateCast, self).__init__()
-        np_type = mstype.dtype_to_nptype(dst_type)
-        min_type = np.finfo(np_type).min
-        max_type = np.finfo(np_type).max
-        self.tensor_min_type = Tensor([min_type], dtype=src_type)
-        self.tensor_max_type = Tensor([max_type], dtype=src_type)
-        self.min_op = P.Minimum()
-        self.max_op = P.Maximum()
-        self.cast = P.Cast()
-        self.dst_type = dst_type
-
-    def construct(self, x):
-        """saturate cast"""
-        out = self.max_op(x, self.tensor_min_type)
-        out = self.min_op(out, self.tensor_max_type)
-        return self.cast(out, self.dst_type)
-
-
-class BertAttention(nn.Cell):
-    """
-    Apply multi-headed attention from "from_tensor" to "to_tensor".
-
-    Args:
-        from_tensor_width (int): Size of last dim of from_tensor.
-        to_tensor_width (int): Size of last dim of to_tensor.
-        from_seq_length (int): Length of from_tensor sequence.
-        to_seq_length (int): Length of to_tensor sequence.
-        num_attention_heads (int): Number of attention heads. Default: 1.
-        size_per_head (int): Size of each attention head. Default: 512.
-        query_act (str): Activation function for the query transform. Default: None.
-        key_act (str): Activation function for the key transform. Default: None.
-        value_act (str): Activation function for the value transform. Default: None.
-        has_attention_mask (bool): Specifies whether to use attention mask. Default: False.
-        attention_probs_dropout_prob (float): The dropout probability for
-                                      BertAttention. Default: 0.0.
-        use_one_hot_embeddings (bool): Specifies whether to use one hot encoding form. Default: False.
-        initializer_range (float): Initialization value of Normal. Default: 0.02.
-        do_return_2d_tensor (bool): True for return 2d tensor. False for return 3d
-                             tensor. Default: False.
-        use_relative_positions (bool): Specifies whether to use relative positions. Default: False.
-        compute_type (:class:`mindspore.dtype`): Compute type in BertAttention. Default: mstype.float32.
-    """
-
-    def __init__(self,
-                 from_tensor_width,
-                 to_tensor_width,
-                 from_seq_length,
-                 to_seq_length,
-                 num_attention_heads=1,
-                 size_per_head=512,
-                 query_act=None,
-                 key_act=None,
-                 value_act=None,
-                 has_attention_mask=False,
-                 attention_probs_dropout_prob=0.0,
-                 use_one_hot_embeddings=False,
-                 initializer_range=0.02,
-                 do_return_2d_tensor=False,
-                 use_relative_positions=False,
-                 compute_type=mstype.float32,
-                 do_quant=False,
-                 activation_init=2.5
-                 ):
-        super(BertAttention, self).__init__()
-        self.from_seq_length = from_seq_length
-        self.to_seq_length = to_seq_length
-        self.num_attention_heads = num_attention_heads
-        self.size_per_head = size_per_head
-        self.has_attention_mask = has_attention_mask
-        self.use_relative_positions = use_relative_positions
-        self.scores_mul = Tensor([1.0 / math.sqrt(float(self.size_per_head))], dtype=compute_type)
-        self.reshape = P.Reshape()
-        self.shape_from_2d = (-1, from_tensor_width)
-        self.shape_to_2d = (-1, to_tensor_width)
-        weight = Normal(initializer_range)
-        units = num_attention_heads * size_per_head
-
-        if do_quant:
-            self.query_layer = DenseQuant(from_tensor_width,
-                                          units,
-                                          activation=query_act,
-                                          weight_init=weight,
-                                          activation_init=activation_init).to_float(compute_type)
-            self.key_layer = DenseQuant(to_tensor_width,
-                                        units,
-                                        activation=key_act,
-                                        weight_init=weight,
-                                        activation_init=activation_init).to_float(compute_type)
-            self.value_layer = DenseQuant(to_tensor_width,
-                                          units,
-                                          activation=value_act,
-                                          weight_init=weight,
-                                          activation_init=activation_init).to_float(compute_type)
-            self.matmul_trans_b = BatchMatMulQuant(transpose_b=True, activation_init=activation_init)
-            self.matmul = BatchMatMulQuant(activation_init=activation_init)
-        else:
-            self.query_layer = nn.Dense(from_tensor_width,
-                                        units,
-                                        activation=query_act,
-                                        weight_init=weight).to_float(compute_type)
-            self.key_layer = nn.Dense(to_tensor_width,
-                                      units,
-                                      activation=key_act,
-                                      weight_init=weight).to_float(compute_type)
-            self.value_layer = nn.Dense(to_tensor_width,
-                                        units,
-                                        activation=value_act,
-                                        weight_init=weight).to_float(compute_type)
-            self.matmul_trans_b = P.BatchMatMul(transpose_b=True)
-            self.matmul = P.BatchMatMul()
-        self.shape_from = (-1, from_seq_length, num_attention_heads, size_per_head)
-        self.shape_to = (-1, to_seq_length, num_attention_heads, size_per_head)
-        self.multiply = P.Mul()
-        self.transpose = P.Transpose()
-        self.trans_shape = (0, 2, 1, 3)
-        self.trans_shape_relative = (2, 0, 1, 3)
-        self.trans_shape_position = (1, 2, 0, 3)
-        self.multiply_data = Tensor([-10000.0,], dtype=compute_type)
-        self.softmax = nn.Softmax()
-        self.dropout = nn.Dropout(1 - attention_probs_dropout_prob)
-        if self.has_attention_mask:
-            self.expand_dims = P.ExpandDims()
-            self.sub = P.Sub()
-            self.add = P.Add()
-            self.cast = P.Cast()
-            self.get_dtype = P.DType()
-        if do_return_2d_tensor:
-            self.shape_return = (-1, num_attention_heads * size_per_head)
-        else:
-            self.shape_return = (-1, from_seq_length, num_attention_heads * size_per_head)
-        self.cast_compute_type = SaturateCast(dst_type=compute_type)
-        if self.use_relative_positions:
-            self._generate_relative_positions_embeddings = \
-                RelaPosEmbeddingsGenerator(length=to_seq_length,
-                                           depth=size_per_head,
-                                           max_relative_position=16,
-                                           initializer_range=initializer_range,
-                                           use_one_hot_embeddings=use_one_hot_embeddings)
-
-    def construct(self, from_tensor, to_tensor, attention_mask):
-        """bert attention"""
-        # reshape 2d/3d input tensors to 2d
-        from_tensor_2d = self.reshape(from_tensor, self.shape_from_2d)
-        to_tensor_2d = self.reshape(to_tensor, self.shape_to_2d)
-        query_out = self.query_layer(from_tensor_2d)
-        key_out = self.key_layer(to_tensor_2d)
-        value_out = self.value_layer(to_tensor_2d)
-        query_layer = self.reshape(query_out, self.shape_from)
-        query_layer = self.transpose(query_layer, self.trans_shape)
-        key_layer = self.reshape(key_out, self.shape_to)
-        key_layer = self.transpose(key_layer, self.trans_shape)
-        attention_scores = self.matmul_trans_b(query_layer, key_layer)
-        # use_relative_position, supplementary logic
-        if self.use_relative_positions:
-            # relations_keys is [F|T, F|T, H]
-            relations_keys = self._generate_relative_positions_embeddings()
-            relations_keys = self.cast_compute_type(relations_keys)
-            # query_layer_t is [F, B, N, H]
-            query_layer_t = self.transpose(query_layer, self.trans_shape_relative)
-            # query_layer_r is [F, B * N, H]
-            query_layer_r = self.reshape(query_layer_t,
-                                         (self.from_seq_length,
-                                          -1,
-                                          self.size_per_head))
-            # key_position_scores is [F, B * N, F|T]
-            key_position_scores = self.matmul_trans_b(query_layer_r,
-                                                      relations_keys)
-            # key_position_scores_r is [F, B, N, F|T]
-            key_position_scores_r = self.reshape(key_position_scores,
-                                                 (self.from_seq_length,
-                                                  -1,
-                                                  self.num_attention_heads,
-                                                  self.from_seq_length))
-            # key_position_scores_r_t is [B, N, F, F|T]
-            key_position_scores_r_t = self.transpose(key_position_scores_r,
-                                                     self.trans_shape_position)
-            attention_scores = attention_scores + key_position_scores_r_t
-        attention_scores = self.multiply(self.scores_mul, attention_scores)
-        if self.has_attention_mask:
-            attention_mask = self.expand_dims(attention_mask, 1)
-            multiply_out = self.sub(self.cast(F.tuple_to_array((1.0,)), self.get_dtype(attention_scores)),
-                                    self.cast(attention_mask, self.get_dtype(attention_scores)))
-            adder = self.multiply(multiply_out, self.multiply_data)
-            attention_scores = self.add(adder, attention_scores)
-        attention_probs = self.softmax(attention_scores)
-        attention_probs = self.dropout(attention_probs)
-        value_layer = self.reshape(value_out, self.shape_to)
-        value_layer = self.transpose(value_layer, self.trans_shape)
-        context_layer = self.matmul(attention_probs, value_layer)
-        # use_relative_position, supplementary logic
-        if self.use_relative_positions:
-            # relations_values is [F|T, F|T, H]
-            relations_values = self._generate_relative_positions_embeddings()
-            relations_values = self.cast_compute_type(relations_values)
-            # attention_probs_t is [F, B, N, T]
-            attention_probs_t = self.transpose(attention_probs, self.trans_shape_relative)
-            # attention_probs_r is [F, B * N, T]
-            attention_probs_r = self.reshape(
-                attention_probs_t,
-                (self.from_seq_length,
-                 -1,
-                 self.to_seq_length))
-            # value_position_scores is [F, B * N, H]
-            value_position_scores = self.matmul(attention_probs_r,
-                                                relations_values)
-            # value_position_scores_r is [F, B, N, H]
-            value_position_scores_r = self.reshape(value_position_scores,
-                                                   (self.from_seq_length,
-                                                    -1,
-                                                    self.num_attention_heads,
-                                                    self.size_per_head))
-            # value_position_scores_r_t is [B, N, F, H]
-            value_position_scores_r_t = self.transpose(value_position_scores_r,
-                                                       self.trans_shape_position)
-            context_layer = context_layer + value_position_scores_r_t
-        context_layer = self.transpose(context_layer, self.trans_shape)
-        context_layer = self.reshape(context_layer, self.shape_return)
-        return context_layer, attention_scores
-
-
-class BertSelfAttention(nn.Cell):
-    """
-    Apply self-attention.
-
-    Args:
-        seq_length (int): Length of input sequence.
-        hidden_size (int): Size of the bert encoder layers.
-        num_attention_heads (int): Number of attention heads. Default: 12.
-        attention_probs_dropout_prob (float): The dropout probability for
-                                      BertAttention. Default: 0.1.
-        use_one_hot_embeddings (bool): Specifies whether to use one_hot encoding form. Default: False.
-        initializer_range (float): Initialization value of Normal. Default: 0.02.
-        hidden_dropout_prob (float): The dropout probability for BertOutput. Default: 0.1.
-        use_relative_positions (bool): Specifies whether to use relative positions. Default: False.
-        compute_type (:class:`mindspore.dtype`): Compute type in BertSelfAttention. Default: mstype.float32.
-    """
-
-    def __init__(self,
-                 seq_length,
-                 hidden_size,
-                 num_attention_heads=12,
-                 attention_probs_dropout_prob=0.1,
-                 use_one_hot_embeddings=False,
-                 initializer_range=0.02,
-                 hidden_dropout_prob=0.1,
-                 use_relative_positions=False,
-                 compute_type=mstype.float32,
-                 do_quant=False,
-                 activation_init=2.5
-                 ):
-        super(BertSelfAttention, self).__init__()
-        if hidden_size % num_attention_heads != 0:
-            raise ValueError("The hidden size (%d) is not a multiple of the number "
-                             "of attention heads (%d)" % (hidden_size, num_attention_heads))
-        self.size_per_head = int(hidden_size / num_attention_heads)
-        self.attention = BertAttention(
-            from_tensor_width=hidden_size,
-            to_tensor_width=hidden_size,
-            from_seq_length=seq_length,
-            to_seq_length=seq_length,
-            num_attention_heads=num_attention_heads,
-            size_per_head=self.size_per_head,
-            attention_probs_dropout_prob=attention_probs_dropout_prob,
-            use_one_hot_embeddings=use_one_hot_embeddings,
-            initializer_range=initializer_range,
-            use_relative_positions=use_relative_positions,
-            has_attention_mask=True,
-            do_return_2d_tensor=True,
-            compute_type=compute_type,
-            do_quant=do_quant,
-            activation_init=activation_init
-        )
-        self.output = BertOutput(in_channels=hidden_size,
-                                 out_channels=hidden_size,
-                                 initializer_range=initializer_range,
-                                 dropout_prob=hidden_dropout_prob,
-                                 compute_type=compute_type,
-                                 do_quant=do_quant,
-                                 activation_init=activation_init
-                                 )
-        self.reshape = P.Reshape()
-        self.shape = (-1, hidden_size)
-
-    def construct(self, input_tensor, attention_mask):
-        """bert self attention"""
-        input_tensor = self.reshape(input_tensor, self.shape)
-        attention_output, attention_scores = self.attention(input_tensor, input_tensor, attention_mask)
-        output = self.output(attention_output, input_tensor)
-        return output, attention_scores
-
-
-class BertEncoderCell(nn.Cell):
-    """
-    Encoder cells used in BertTransformer.
-
-    Args:
-        hidden_size (int): Size of the bert encoder layers. Default: 768.
-        seq_length (int): Length of input sequence. Default: 512.
-        num_attention_heads (int): Number of attention heads. Default: 12.
-        intermediate_size (int): Size of intermediate layer. Default: 3072.
-        attention_probs_dropout_prob (float): The dropout probability for
-                                      BertAttention. Default: 0.02.
-        use_one_hot_embeddings (bool): Specifies whether to use one hot encoding form. Default: False.
-        initializer_range (float): Initialization value of Normal. Default: 0.02.
-        hidden_dropout_prob (float): The dropout probability for BertOutput. Default: 0.1.
-        use_relative_positions (bool): Specifies whether to use relative positions. Default: False.
-        hidden_act (str): Activation function. Default: "gelu".
-        compute_type (:class:`mindspore.dtype`): Compute type in attention. Default: mstype.float32.
-    """
-
-    def __init__(self,
-                 hidden_size=768,
-                 seq_length=512,
-                 num_attention_heads=12,
-                 intermediate_size=3072,
-                 attention_probs_dropout_prob=0.02,
-                 use_one_hot_embeddings=False,
-                 initializer_range=0.02,
-                 hidden_dropout_prob=0.1,
-                 use_relative_positions=False,
-                 hidden_act="gelu",
-                 compute_type=mstype.float32,
-                 do_quant=False,
-                 activation_init=2.5,
-                 export=False
-                 ):
-        super(BertEncoderCell, self).__init__()
-        self.attention = BertSelfAttention(
-            hidden_size=hidden_size,
-            seq_length=seq_length,
-            num_attention_heads=num_attention_heads,
-            attention_probs_dropout_prob=attention_probs_dropout_prob,
-            use_one_hot_embeddings=use_one_hot_embeddings,
-            initializer_range=initializer_range,
-            hidden_dropout_prob=hidden_dropout_prob,
-            use_relative_positions=use_relative_positions,
-            compute_type=compute_type,
-            do_quant=do_quant,
-            activation_init=activation_init
-        )
-        if do_quant:
-            self.intermediate = DenseQuant(in_channels=hidden_size,
-                                           out_channels=intermediate_size,
-                                           activation=hidden_act,
-                                           weight_init=Normal(initializer_range),
-                                           activation_init=activation_init).to_float(compute_type)
-        else:
-            if export and hidden_act == "gelu":
-                self.intermediate = DenseGeLU(in_channels=hidden_size,
-                                              out_channels=intermediate_size,
-                                              weight_init=Normal(initializer_range)).to_float(compute_type)
-            else:
-                self.intermediate = nn.Dense(in_channels=hidden_size,
-                                             out_channels=intermediate_size,
-                                             activation=hidden_act,
-                                             weight_init=Normal(initializer_range)).to_float(compute_type)
-        self.output = BertOutput(in_channels=intermediate_size,
-                                 out_channels=hidden_size,
-                                 initializer_range=initializer_range,
-                                 dropout_prob=hidden_dropout_prob,
-                                 compute_type=compute_type,
-                                 do_quant=do_quant,
-                                 activation_init=activation_init
-                                 )
-
-    def construct(self, hidden_states, attention_mask):
-        """bert encoder cell"""
-        # self-attention
-        attention_output, attention_scores = self.attention(hidden_states, attention_mask)
-        # feed construct
-        intermediate_output = self.intermediate(attention_output)
-        # add and normalize
-        output = self.output(intermediate_output, attention_output)
-        return output, attention_scores
-
-
-class BertTransformer(nn.Cell):
-    """
-    Multi-layer bert transformer.
-
-    Args:
-        hidden_size (int): Size of the encoder layers.
-        seq_length (int): Length of input sequence.
-        num_hidden_layers (int): Number of hidden layers in encoder cells.
-        num_attention_heads (int): Number of attention heads in encoder cells. Default: 12.
-        intermediate_size (int): Size of intermediate layer in encoder cells. Default: 3072.
-        attention_probs_dropout_prob (float): The dropout probability for
-                                      BertAttention. Default: 0.1.
-        use_one_hot_embeddings (bool): Specifies whether to use one hot encoding form. Default: False.
-        initializer_range (float): Initialization value of Normal. Default: 0.02.
-        hidden_dropout_prob (float): The dropout probability for BertOutput. Default: 0.1.
-        use_relative_positions (bool): Specifies whether to use relative positions. Default: False.
-        hidden_act (str): Activation function used in the encoder cells. Default: "gelu".
-        compute_type (:class:`mindspore.dtype`): Compute type in BertTransformer. Default: mstype.float32.
-        return_all_encoders (bool): Specifies whether to return all encoders. Default: False.
-    """
-
-    def __init__(self,
-                 hidden_size,
-                 seq_length,
-                 num_hidden_layers,
-                 num_attention_heads=12,
-                 intermediate_size=3072,
-                 attention_probs_dropout_prob=0.1,
-                 use_one_hot_embeddings=False,
-                 initializer_range=0.02,
-                 hidden_dropout_prob=0.1,
-                 use_relative_positions=False,
-                 hidden_act="gelu",
-                 compute_type=mstype.float32,
-                 return_all_encoders=False,
-                 do_quant=False,
-                 activation_init=2.5,
-                 export=False
-                 ):
-        super(BertTransformer, self).__init__()
-        self.return_all_encoders = return_all_encoders
-        layers = []
-        for _ in range(num_hidden_layers):
-            layer = BertEncoderCell(hidden_size=hidden_size,
-                                    seq_length=seq_length,
-                                    num_attention_heads=num_attention_heads,
-                                    intermediate_size=intermediate_size,
-                                    attention_probs_dropout_prob=attention_probs_dropout_prob,
-                                    use_one_hot_embeddings=use_one_hot_embeddings,
-                                    initializer_range=initializer_range,
-                                    hidden_dropout_prob=hidden_dropout_prob,
-                                    use_relative_positions=use_relative_positions,
-                                    hidden_act=hidden_act,
-                                    compute_type=compute_type,
-                                    do_quant=do_quant,
-                                    activation_init=activation_init,
-                                    export=export
-                                    )
-            layers.append(layer)
-        self.layers = nn.CellList(layers)
-        self.reshape = P.Reshape()
-        self.shape = (-1, hidden_size)
-        self.out_shape = (-1, seq_length, hidden_size)
-
-    def construct(self, input_tensor, attention_mask):
-        """bert transformer"""
-        prev_output = self.reshape(input_tensor, self.shape)
-        all_encoder_layers = ()
-        all_encoder_atts = ()
-        all_encoder_outputs = ()
-        all_encoder_outputs += (prev_output,)
-        for layer_module in self.layers:
-            layer_output, encoder_att = layer_module(prev_output, attention_mask)
-            prev_output = layer_output
-            if self.return_all_encoders:
-                all_encoder_outputs += (layer_output,)
-                layer_output = self.reshape(layer_output, self.out_shape)
-                all_encoder_layers += (layer_output,)
-                all_encoder_atts += (encoder_att,)
-        if not self.return_all_encoders:
-            prev_output = self.reshape(prev_output, self.out_shape)
-            all_encoder_layers += (prev_output,)
-        return all_encoder_layers, all_encoder_outputs, all_encoder_atts
-
-
-class CreateAttentionMaskFromInputMask(nn.Cell):
-    """
-    Create attention mask according to input mask.
-
-    Args:
-        config (Class): Configuration for BertModel.
-    """
-
-    def __init__(self, config):
-        super(CreateAttentionMaskFromInputMask, self).__init__()
-        self.input_mask = None
-        self.cast = P.Cast()
-        self.reshape = P.Reshape()
-        self.shape = (-1, 1, config.seq_length)
-
-    def construct(self, input_mask):
-        attention_mask = self.cast(self.reshape(input_mask, self.shape), mstype.float32)
-        return attention_mask
-
-
-class BertModel(nn.Cell):
-    """
-    Bidirectional Encoder Representations from Transformers.
-
-    Args:
-        config (Class): Configuration for BertModel.
-        is_training (bool): True for training mode. False for eval mode.
-        use_one_hot_embeddings (bool): Specifies whether to use one hot encoding form. Default: False.
-    """
-
-    def __init__(self,
-                 config,
-                 is_training,
-                 use_one_hot_embeddings=False):
-        super(BertModel, self).__init__()
-        config = copy.deepcopy(config)
-        if not is_training:
-            config.hidden_dropout_prob = 0.0
-            config.attention_probs_dropout_prob = 0.0
-        self.seq_length = config.seq_length
-        self.hidden_size = config.hidden_size
-        self.num_hidden_layers = config.num_hidden_layers
-        self.embedding_size = config.hidden_size
-        self.token_type_ids = None
-        self.last_idx = self.num_hidden_layers - 1
-        output_embedding_shape = [-1, self.seq_length, self.embedding_size]
-        self.bert_embedding_lookup = EmbeddingLookup(
-            vocab_size=config.vocab_size,
-            embedding_size=self.embedding_size,
-            embedding_shape=output_embedding_shape,
-            use_one_hot_embeddings=use_one_hot_embeddings,
-            initializer_range=config.initializer_range,
-            do_quant=config.do_quant,
-            activation_init=config.activation_init)
-        self.bert_embedding_postprocessor = EmbeddingPostprocessor(
-            use_relative_positions=config.use_relative_positions,
-            embedding_size=self.embedding_size,
-            embedding_shape=output_embedding_shape,
-            use_token_type=True,
-            token_type_vocab_size=config.type_vocab_size,
-            use_one_hot_embeddings=use_one_hot_embeddings,
-            initializer_range=0.02,
-            max_position_embeddings=config.max_position_embeddings,
-            dropout_prob=config.hidden_dropout_prob)
-        self.bert_encoder = BertTransformer(
-            hidden_size=self.hidden_size,
-            seq_length=self.seq_length,
-            num_attention_heads=config.num_attention_heads,
-            num_hidden_layers=self.num_hidden_layers,
-            intermediate_size=config.intermediate_size,
-            attention_probs_dropout_prob=config.attention_probs_dropout_prob,
-            use_one_hot_embeddings=use_one_hot_embeddings,
-            initializer_range=config.initializer_range,
-            hidden_dropout_prob=config.hidden_dropout_prob,
-            use_relative_positions=config.use_relative_positions,
-            hidden_act=config.hidden_act,
-            compute_type=config.compute_type,
-            return_all_encoders=True,
-            do_quant=config.do_quant,
-            activation_init=config.activation_init,
-            export=config.export
-        )
-        self.cast = P.Cast()
-        self.dtype = config.dtype
-        self.cast_compute_type = SaturateCast(dst_type=config.compute_type)
-        self.slice = P.StridedSlice()
-        self.squeeze_1 = P.Squeeze(axis=1)
-        if config.do_quant:
-            self.dense = DenseQuant(self.hidden_size, self.hidden_size,
-                                    activation="tanh",
-                                    weight_init=Normal(config.initializer_range),
-                                    activation_init=config.activation_init).to_float(config.compute_type)
-        else:
-            self.dense = nn.Dense(self.hidden_size, self.hidden_size,
-                                  activation="tanh",
-                                  weight_init=Normal(config.initializer_range)).to_float(config.compute_type)
-        self._create_attention_mask_from_input_mask = CreateAttentionMaskFromInputMask(config)
-
-    def construct(self, input_ids, token_type_ids, input_mask):
-        """bert model"""
-        # embedding
-        word_embeddings, embedding_tables = self.bert_embedding_lookup(input_ids)
-        embedding_output = self.bert_embedding_postprocessor(token_type_ids, word_embeddings)
-        # attention mask [batch_size, seq_length, seq_length]
-        attention_mask = self._create_attention_mask_from_input_mask(input_mask)
-        # bert encoder
-        encoder_output, encoder_layers, layer_atts = self.bert_encoder(self.cast_compute_type(embedding_output),
-                                                                       attention_mask)
-        sequence_output = self.cast(encoder_output[self.last_idx], self.dtype)
-        # pooler
-        batch_size = P.Shape()(input_ids)[0]
-        sequence_slice = self.slice(sequence_output,
-                                    (0, 0, 0),
-                                    (batch_size, 1, self.hidden_size),
-                                    (1, 1, 1))
-        first_token = self.squeeze_1(sequence_slice)
-        pooled_output = self.dense(first_token)
-        pooled_output = self.cast(pooled_output, self.dtype)
-        encoder_outputs = ()
-        for output in encoder_layers:
-            encoder_outputs += (self.cast(output, self.dtype),)
-        attention_outputs = ()
-        for output in layer_atts:
-            attention_outputs += (self.cast(output, self.dtype),)
-        return sequence_output, pooled_output, embedding_tables, encoder_outputs, attention_outputs
-
-
-class BertModelCLS(nn.Cell):
-    """
-    This class is responsible for classification task evaluation,
-    i.e. mnli(num_labels=3), qnli(num_labels=2), qqp(num_labels=2).
-    The returned output represents the final logits as the results of log_softmax is proportional to that of softmax.
-    """
-
-    def __init__(self, config, is_training, num_labels=2, dropout_prob=0.0,
-                 use_one_hot_embeddings=False, phase_type="student"):
-        super(BertModelCLS, self).__init__()
-        self.bert = BertModel(config, is_training, use_one_hot_embeddings)
-        self.cast = P.Cast()
-        self.weight_init = Normal(config.initializer_range)
-        self.log_softmax = P.LogSoftmax(axis=-1)
-        self.dtype = config.dtype
-        self.num_labels = num_labels
-        self.phase_type = phase_type
-        self.dense_1 = nn.Dense(config.hidden_size, self.num_labels, weight_init=self.weight_init,
-                                has_bias=True).to_float(config.compute_type)
-        self.relu = nn.ReLU()
-        self.is_training = is_training
-        if is_training:
-            self.dropout = nn.Dropout(1 - dropout_prob)
-        self.export = config.export
-
-    def construct(self, input_ids, token_type_id, input_mask):
-        """classification bert model"""
-        _, pooled_output, _, seq_output, att_output = self.bert(input_ids, token_type_id, input_mask)
-        cls = self.cast(pooled_output, self.dtype)
-        cls = self.relu(cls)
-        if self.is_training:
-            cls = self.dropout(cls)
-        logits = self.dense_1(cls)
-        if self.export:
-            return logits
-        logits = self.cast(logits, self.dtype)
-        log_probs = self.log_softmax(logits)
-        return seq_output, att_output, logits, log_probs
diff --git a/research/nlp/ternarybert/src/utils.py b/research/nlp/ternarybert/src/utils.py
deleted file mode 100644
index 7256c4c33e4de13ff6214d1d3954c22730a25cd8..0000000000000000000000000000000000000000
--- a/research/nlp/ternarybert/src/utils.py
+++ /dev/null
@@ -1,182 +0,0 @@
-# Copyright 2021 Huawei Technologies Co., Ltd
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ============================================================================
-
-"""ternarybert utils"""
-
-import os
-import time
-import numpy as np
-from mindspore import Tensor
-from mindspore.common import dtype as mstype
-from mindspore.train.callback import Callback
-from mindspore.train.serialization import save_checkpoint
-from mindspore.ops import operations as P
-from mindspore.nn.learning_rate_schedule import LearningRateSchedule, PolynomialDecayLR, WarmUpLR
-from .quant import convert_network, save_params, restore_params
-
-
-class ModelSaveCkpt(Callback):
-    """
-    Saves checkpoint.
-    If the loss in NAN or INF terminating training.
-    Args:
-        network (Network): The train network for training.
-        save_ckpt_step (int): The step to save checkpoint.
-        max_ckpt_num (int): The max checkpoint number.
-    """
-    def __init__(self, network, save_ckpt_step, max_ckpt_num, output_dir, embedding_bits=2, weight_bits=2,
-                 clip_value=1.0):
-        super(ModelSaveCkpt, self).__init__()
-        self.count = 0
-        self.network = network
-        self.save_ckpt_step = save_ckpt_step
-        self.max_ckpt_num = max_ckpt_num
-        self.output_dir = output_dir
-        if not os.path.exists(output_dir):
-            os.makedirs(output_dir)
-        self.embedding_bits = embedding_bits
-        self.weight_bits = weight_bits
-        self.clip_value = clip_value
-
-    def step_end(self, run_context):
-        """step end and save ckpt"""
-        cb_params = run_context.original_args()
-        if cb_params.cur_step_num % self.save_ckpt_step == 0:
-            saved_ckpt_num = cb_params.cur_step_num / self.save_ckpt_step
-            if saved_ckpt_num > self.max_ckpt_num:
-                oldest_ckpt_index = saved_ckpt_num - self.max_ckpt_num
-                path = os.path.join(self.output_dir, "ternary_bert_{}_{}.ckpt".format(int(oldest_ckpt_index),
-                                                                                      self.save_ckpt_step))
-                if os.path.exists(path):
-                    os.remove(path)
-            params_dict = save_params(self.network)
-            convert_network(self.network, self.embedding_bits, self.weight_bits, self.clip_value)
-            save_checkpoint(self.network, os.path.join(self.output_dir,
-                                                       "ternary_bert_{}_{}.ckpt".format(int(saved_ckpt_num),
-                                                                                        self.save_ckpt_step)))
-            restore_params(self.network, params_dict)
-
-
-class LossCallBack(Callback):
-    """
-    Monitor the loss in training.
-    """
-    def __init__(self, per_print_times=1):
-        super(LossCallBack, self).__init__()
-        if not isinstance(per_print_times, int) or per_print_times < 0:
-            raise ValueError("print_step must be int and >= 0")
-        self._per_print_times = per_print_times
-
-    def step_end(self, run_context):
-        """step end and print loss"""
-        cb_params = run_context.original_args()
-        print("epoch: {}, step: {}, outputs are {}".format(cb_params.cur_epoch_num,
-                                                           cb_params.cur_step_num,
-                                                           str(cb_params.net_outputs)))
-
-
-class StepCallBack(Callback):
-    """
-    Monitor the time in training.
-    """
-    def __init__(self):
-        super(StepCallBack, self).__init__()
-        self.start_time = 0.0
-
-    def step_begin(self, run_context):
-        self.start_time = time.time()
-
-    def step_end(self, run_context):
-        time_cost = time.time() - self.start_time
-        cb_params = run_context.original_args()
-        print("step: {}, second_per_step: {}".format(cb_params.cur_step_num, time_cost))
-
-
-class EvalCallBack(Callback):
-    """Evaluation callback"""
-    def __init__(self, network, dataset, eval_ckpt_step, save_ckpt_dir, embedding_bits=2, weight_bits=2,
-                 clip_value=1.0, metrics=None):
-        super(EvalCallBack, self).__init__()
-        self.network = network
-        self.global_metrics = 0.0
-        self.dataset = dataset
-        self.eval_ckpt_step = eval_ckpt_step
-        self.save_ckpt_dir = save_ckpt_dir
-        self.embedding_bits = embedding_bits
-        self.weight_bits = weight_bits
-        self.clip_value = clip_value
-        self.metrics = metrics
-        if not os.path.exists(save_ckpt_dir):
-            os.makedirs(save_ckpt_dir)
-
-    def step_end(self, run_context):
-        """step end and do evaluation"""
-        cb_params = run_context.original_args()
-        if cb_params.cur_step_num % self.eval_ckpt_step == 0:
-            params_dict = save_params(self.network)
-            convert_network(self.network, self.embedding_bits, self.weight_bits, self.clip_value)
-            self.network.set_train(False)
-            callback = self.metrics()
-            columns_list = ["input_ids", "input_mask", "segment_ids", "label_ids"]
-            for data in self.dataset:
-                input_data = []
-                for i in columns_list:
-                    input_data.append(data[i])
-                input_ids, input_mask, token_type_id, label_ids = input_data
-                _, _, logits, _ = self.network(input_ids, token_type_id, input_mask)
-                callback.update(logits, label_ids)
-            metrics = callback.get_metrics()
-
-            if metrics > self.global_metrics:
-                self.global_metrics = metrics
-                eval_model_ckpt_file = os.path.join(self.save_ckpt_dir, 'eval_model.ckpt')
-                if os.path.exists(eval_model_ckpt_file):
-                    os.remove(eval_model_ckpt_file)
-                save_checkpoint(self.network, eval_model_ckpt_file)
-            print('step {}, {} {}, best_{} {}'.format(cb_params.cur_step_num,
-                                                      callback.name,
-                                                      metrics,
-                                                      callback.name,
-                                                      self.global_metrics))
-            restore_params(self.network, params_dict)
-            self.network.set_train(True)
-
-
-class BertLearningRate(LearningRateSchedule):
-    """
-    Warmup-decay learning rate for Bert network.
-    """
-    def __init__(self, learning_rate, end_learning_rate, warmup_steps, decay_steps, power):
-        super(BertLearningRate, self).__init__()
-        self.warmup_flag = False
-        if warmup_steps > 0:
-            self.warmup_flag = True
-            self.warmup_lr = WarmUpLR(learning_rate, warmup_steps)
-        self.decay_lr = PolynomialDecayLR(learning_rate, end_learning_rate, decay_steps, power)
-        self.warmup_steps = Tensor(np.array([warmup_steps]).astype(np.float32))
-
-        self.greater = P.Greater()
-        self.one = Tensor(np.array([1.0]).astype(np.float32))
-        self.cast = P.Cast()
-
-    def construct(self, global_step):
-        decay_lr = self.decay_lr(global_step)
-        if self.warmup_flag:
-            is_warmup = self.cast(self.greater(self.warmup_steps, global_step), mstype.float32)
-            warmup_lr = self.warmup_lr(global_step)
-            lr = (self.one - is_warmup) * decay_lr + is_warmup * warmup_lr
-        else:
-            lr = decay_lr
-        return lr
diff --git a/research/nlp/ternarybert/train.py b/research/nlp/ternarybert/train.py
deleted file mode 100644
index 5410b90c53dee56abbeb3aaeaf15c03d20efe058..0000000000000000000000000000000000000000
--- a/research/nlp/ternarybert/train.py
+++ /dev/null
@@ -1,177 +0,0 @@
-# Copyright 2021 Huawei Technologies Co., Ltd
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-# http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-# ============================================================================
-
-"""task distill script"""
-
-import os
-import argparse
-import ast
-
-from mindspore import context
-from mindspore.train.model import Model
-from mindspore.nn.optim import AdamWeightDecay
-from mindspore import set_seed
-from mindspore.train.callback import TimeMonitor
-
-from src.dataset import create_dataset
-from src.utils import StepCallBack, ModelSaveCkpt, EvalCallBack, BertLearningRate
-from src.config import train_cfg, eval_cfg, teacher_net_cfg, student_net_cfg, task_cfg
-from src.cell_wrapper import BertNetworkWithLoss, BertTrainCell
-
-WEIGHTS_NAME = 'eval_model.ckpt'
-EVAL_DATA_NAME = 'eval.tf_record'
-TRAIN_DATA_NAME = 'train.tf_record'
-
-
-def parse_args():
-    """
-    parse args
-    """
-    parser = argparse.ArgumentParser(description='ternarybert task distill')
-    parser.add_argument('--device_target', type=str, default='GPU', choices=['Ascend', 'GPU'],
-                        help='Device where the code will be implemented. (Default: GPU)')
-    parser.add_argument('--do_eval', type=ast.literal_eval, default=True,
-                        help='Do eval task during training or not. (Default: True)')
-    parser.add_argument('--epoch_size', type=int, default=3, help='Epoch size for train phase. (Default: 3)')
-    parser.add_argument('--device_id', type=int, default=0, help='Device id. (Default: 0)')
-    parser.add_argument('--do_shuffle', type=ast.literal_eval, default=True,
-                        help='Enable shuffle for train dataset. (Default: True)')
-    parser.add_argument('--enable_data_sink', type=ast.literal_eval, default=True,
-                        help='Enable data sink. (Default: True)')
-    parser.add_argument('--save_ckpt_step', type=int, default=50,
-                        help='If do_eval is False, the checkpoint will be saved every save_ckpt_step. (Default: 50)')
-    parser.add_argument('--eval_ckpt_step', type=int, default=50,
-                        help='If do_eval is True, the evaluation will be ran every eval_ckpt_step. (Default: 50)')
-    parser.add_argument('--max_ckpt_num', type=int, default=10,
-                        help='The number of checkpoints will not be larger than max_ckpt_num. (Default: 10)')
-    parser.add_argument('--data_sink_steps', type=int, default=100, help='Sink steps for each epoch. (Default: 1)')
-    parser.add_argument('--teacher_model_dir', type=str, default='', help='The checkpoint directory of teacher model.')
-    parser.add_argument('--student_model_dir', type=str, default='', help='The checkpoint directory of student model.')
-    parser.add_argument('--data_dir', type=str, default='', help='Data directory.')
-    parser.add_argument('--output_dir', type=str, default='./', help='The output checkpoint directory.')
-    parser.add_argument('--task_name', type=str, default='sts-b', choices=['sts-b', 'qnli', 'mnli'],
-                        help='The name of the task to train. (Default: sts-b)')
-    parser.add_argument('--dataset_type', type=str, default='tfrecord', choices=['tfrecord', 'mindrecord'],
-                        help='The name of the task to train. (Default: tfrecord)')
-    parser.add_argument('--seed', type=int, default=1, help='The random seed')
-    parser.add_argument('--train_batch_size', type=int, default=16, help='Batch size for training')
-    parser.add_argument('--eval_batch_size', type=int, default=32, help='Eval Batch size in callback')
-    return parser.parse_args()
-
-
-def run_task_distill(args_opt):
-    """
-    run task distill
-    """
-    task = task_cfg[args_opt.task_name]
-    teacher_net_cfg.seq_length = task.seq_length
-    student_net_cfg.seq_length = task.seq_length
-    train_cfg.batch_size = args_opt.train_batch_size
-    eval_cfg.batch_size = args_opt.eval_batch_size
-    teacher_ckpt = os.path.join(args_opt.teacher_model_dir, args_opt.task_name, WEIGHTS_NAME)
-    student_ckpt = os.path.join(args_opt.student_model_dir, args_opt.task_name, WEIGHTS_NAME)
-    train_data_dir = os.path.join(args_opt.data_dir, args_opt.task_name, TRAIN_DATA_NAME)
-    eval_data_dir = os.path.join(args_opt.data_dir, args_opt.task_name, EVAL_DATA_NAME)
-    save_ckpt_dir = os.path.join(args_opt.output_dir, args_opt.task_name)
-
-    if args_opt.device_target == "Ascend":
-        context.set_context(mode=context.GRAPH_MODE, device_target=args_opt.device_target, device_id=args_opt.device_id)
-    elif args_opt.device_target == "GPU":
-        context.set_context(mode=context.GRAPH_MODE, device_target=args_opt.device_target)
-    else:
-        raise Exception("Target error, GPU or Ascend is supported.")
-
-    rank = 0
-    device_num = 1
-    train_dataset = create_dataset(batch_size=train_cfg.batch_size,
-                                   device_num=device_num,
-                                   rank=rank,
-                                   do_shuffle=args_opt.do_shuffle,
-                                   data_dir=train_data_dir,
-                                   data_type=args_opt.dataset_type,
-                                   seq_length=task.seq_length,
-                                   task_type=task.task_type,
-                                   drop_remainder=True)
-    dataset_size = train_dataset.get_dataset_size()
-    print('train dataset size:', dataset_size)
-    eval_dataset = create_dataset(batch_size=eval_cfg.batch_size,
-                                  device_num=device_num,
-                                  rank=rank,
-                                  do_shuffle=args_opt.do_shuffle,
-                                  data_dir=eval_data_dir,
-                                  data_type=args_opt.dataset_type,
-                                  seq_length=task.seq_length,
-                                  task_type=task.task_type,
-                                  drop_remainder=False)
-    print('eval dataset size:', eval_dataset.get_dataset_size())
-
-    if args_opt.enable_data_sink:
-        repeat_count = args_opt.epoch_size * dataset_size // args_opt.data_sink_steps
-        time_monitor_steps = args_opt.data_sink_steps
-    else:
-        repeat_count = args_opt.epoch_size
-        time_monitor_steps = dataset_size
-
-    netwithloss = BertNetworkWithLoss(teacher_config=teacher_net_cfg, teacher_ckpt=teacher_ckpt,
-                                      student_config=student_net_cfg, student_ckpt=student_ckpt,
-                                      is_training=True, task_type=task.task_type, num_labels=task.num_labels)
-    params = netwithloss.trainable_params()
-    optimizer_cfg = train_cfg.optimizer_cfg
-    lr_schedule = BertLearningRate(learning_rate=optimizer_cfg.AdamWeightDecay.learning_rate,
-                                   end_learning_rate=optimizer_cfg.AdamWeightDecay.end_learning_rate,
-                                   warmup_steps=int(dataset_size * args_opt.epoch_size *
-                                                    optimizer_cfg.AdamWeightDecay.warmup_ratio),
-                                   decay_steps=int(dataset_size * args_opt.epoch_size),
-                                   power=optimizer_cfg.AdamWeightDecay.power)
-    decay_params = list(filter(optimizer_cfg.AdamWeightDecay.decay_filter, params))
-    other_params = list(filter(lambda x: not optimizer_cfg.AdamWeightDecay.decay_filter(x), params))
-    group_params = [{'params': decay_params, 'weight_decay': optimizer_cfg.AdamWeightDecay.weight_decay},
-                    {'params': other_params, 'weight_decay': 0.0},
-                    {'order_params': params}]
-
-    optimizer = AdamWeightDecay(group_params, learning_rate=lr_schedule, eps=optimizer_cfg.AdamWeightDecay.eps)
-
-    netwithgrads = BertTrainCell(netwithloss, optimizer=optimizer)
-
-    if args_opt.do_eval:
-        eval_dataset = list(eval_dataset.create_dict_iterator())
-        callback = [TimeMonitor(time_monitor_steps),
-                    EvalCallBack(network=netwithloss.bert,
-                                 dataset=eval_dataset,
-                                 eval_ckpt_step=args_opt.eval_ckpt_step,
-                                 save_ckpt_dir=save_ckpt_dir,
-                                 embedding_bits=student_net_cfg.embedding_bits,
-                                 weight_bits=student_net_cfg.weight_bits,
-                                 clip_value=student_net_cfg.weight_clip_value,
-                                 metrics=task.metrics)]
-    else:
-        callback = [TimeMonitor(time_monitor_steps), StepCallBack(),
-                    ModelSaveCkpt(network=netwithloss.bert,
-                                  save_ckpt_step=args_opt.save_ckpt_step,
-                                  max_ckpt_num=args_opt.max_ckpt_num,
-                                  output_dir=save_ckpt_dir,
-                                  embedding_bits=student_net_cfg.embedding_bits,
-                                  weight_bits=student_net_cfg.weight_bits,
-                                  clip_value=student_net_cfg.weight_clip_value)]
-    model = Model(netwithgrads)
-    model.train(repeat_count, train_dataset, callbacks=callback,
-                dataset_sink_mode=args_opt.enable_data_sink,
-                sink_size=args_opt.data_sink_steps)
-
-
-if __name__ == '__main__':
-    args = parse_args()
-    set_seed(args.seed)
-    run_task_distill(args)