diff --git a/research/nlp/transformer_xl/README.md b/research/nlp/transformer_xl/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..0e60df4873c4c01e67d3b2a26d3ad83d8d31ce46
--- /dev/null
+++ b/research/nlp/transformer_xl/README.md
@@ -0,0 +1,298 @@
+# Contents
+
+- [Contents](#Contents)
+    - [Transformer_XL Description](#transformer-xl-description)
+    - [Model Architecture](#model-architecture)
+    - [Dataset](#dataset)
+    - [Environment Requirements](#environment-requirements)
+    - [Quick Start](#quick-start)
+    - [Script Description](#script-description)
+        - [Script and Sample Code](#script-and-sample-code)
+        - [Script Parameters](#script-parameters)
+            - [Training Script Parameters](#training-script-parameters)
+            - [Running Options](#running-options)
+            - [Network Parameters](#network-parameters)
+    - [Dataset Preparation](#dataset-preparation)
+    - [Training Process](#training-process)
+    - [Evaluation Process](#evaluation-process)
+    - [Model Description](#model-description)
+        - [Performance](#performance)
+            - [Training Performance](#training-performance)
+            - [Evaluation Performance](#evaluation-performance)
+    - [Description of Random Situation](#description-of-random-situation)
+    - [ModelZoo Homepage](#modelzoo-homepage)
+
+## [Transformer_XL Description](#contents)
+
+Transformer-XL is an improvement to Transformer, mainly to solve the problem of long sequences. At the same time, it
+combines the advantages of RNN sequence modeling and Transformer's self-attention mechanism, introduces a recurrent
+mechanism and relative position encoding, uses Transformer's attention module on each segment of the input data, and
+uses a recurrent mechanism to learn the relationship between consecutive segments. dependencies. And successfully
+achieved SoTA effect on language modeling datasets such as enwik8 and text8.
+
+[Paper](https://arxiv.org/abs/1901.02860):  Dai Z, Yang Z, Yang Y, et al. Transformer-xl: Attentive language models
+beyond a fixed-length context[J]. arXiv preprint arXiv:1901.02860, 2019.
+
+## [Model Architecture](#contents)
+
+The backbone structure of Transformer-XL is Transformer, which adds Recurrence Mechanism and Relative Positional
+Encoding on the original basis.
+
+## [Dataset](#contents)
+
+The following two datasets contain the training dataset and the evaluation dataset. Recommended for dataset `bash getdata.sh` is automatically downloaded and preprocessed.
+
+[enwik8](http://mattmahoney.net/dc/enwik8.zip)
+
+Enwik8 data set is based on Wikipedia and is usually used to measure the ability of the model to compress data. Contains 100MB of unprocessed Wikipedia text.
+
+If you download the enwik8 dataset directly through the link, download and execute [prep_enwik8.py](https://raw.githubusercontent.com/salesforce/awd-lstm-lm/master/data/enwik8/prep_enwik8.py) Preprocess the downloaded data set.
+
+Dataset size:
+
+- Training set: 88,982,818 characters in total
+- Validation set: 4,945,742 characters in total
+- Test set: 36,191 characters in total
+
+Dataset format: TXT text
+
+Dataset directory structure:
+
+```text
+└─data
+  ├─enwik8
+    ├─train.txt       # Training set
+    ├─train.txt.raw   # Training set(unprocessed)
+    ├─valid.txt       # Validation set
+    ├─valid.txt.raw   # Validation set(unprocessed)
+    ├─test.txt        # Test set
+    └─test.txt.raw    # Test set(unprocessed)
+```
+
+- [text8](http://mattmahoney.net/dc/text8.zip)
+
+Text8 also contains 100MB of Wikipedia text. The difference is to move other characters except 26 letters and spaces based on the enwik8 dataset.
+
+If you download the text8 dataset directly through the link, execute prep_ text8.py to preprocess the downloaded data set.
+
+Dataset size:
+
+- Training set: 89,999,999 characters in total
+- Validation set: 4,999,999 characters in total
+- Test set: 5,000,000 characters in total
+
+Dataset format: TXT text
+
+Dataset directory structure:
+
+```text
+└─data
+  ├─text8
+    ├─train.txt       # Training set
+    ├─train.txt.raw   # Training set(unprocessed)
+    ├─valid.txt       # Validation set
+    ├─valid.txt.raw   # Validation set(unprocessed)
+    ├─test.txt        # Test set
+    └─test.txt.raw    # Test set(unprocessed)
+```
+
+## [Environment Requirements](#contents)
+
+- Hardware(Ascend/GPU)
+    - Prepare hardware environment with Ascend or GPU processor.
+- Framework
+    - [MindSpore](https://gitee.com/mindspore/mindspore)
+- For more information, please check the resources below:
+    - [MindSpore Tutorials](https://www.mindspore.cn/tutorials/en/master/index.html)
+    - [MindSpore Python API](https://www.mindspore.cn/docs/api/en/master/index.html)
+
+## [Quick Start](#contents)
+
+- Running on GPU
+
+After dataset preparation, you can start training and evaluation as follows:
+
+```bash
+# Fine-tuning of parameters: hyperparameters in enwik8_base.yaml
+# Where [DATA_NAME] belongs to the default parameter [enwik8, text8]
+# The [TRAIN_URL] parameter can be set to a character name like "experiments", which will automatically create the corresponding model training file under "/script/train/experiments-enwik8" according to this name, or it can be set to a path, such as "/home/mindspore/transformer-xl/enwik8_8p". In this way, the training model will be saved separately in this directory.
+
+# run training example
+bash run_standalone_train_gpu.sh [DEVICE_ID] [DATA_DIR] [DATA_NAME] [TRAIN_URL] [CONFIG_PATH]
+# for example: bash run_standalone_train_gpu.sh 0 /home/mindspore/transformer-xl/data/enwik8/ enwik8 experiments /home/mindspore/transformer-xl/yaml/enwik8_base.yaml
+
+# run distributed training example
+bash run_distribute_train_gpu.sh [DEVICE_NUM] [VISIABLE_DEVICES(0,1,2,3,4,5,6,7)] [DATA_DIR] [DATA_NAME] [TRAIN_URL] [CONFIG_PATH]
+# for example: bash run_distribute_train_gpu.sh 4 0,1,2,3 /home/mindspore/transformer-xl/data/enwik8/ enwik8 experiments /home/mindspore/transformer-xl/yaml/enwik8_base.yaml
+
+# run evaluation example
+bash run_eval_gpu.sh [DATA_URL] [DATA_NAME] [CKPT_PATH] [CONFIG_PATH] [DEVICE_ID(optional)]
+# for example: bash run_eval_gpu.sh  /home/mindspore/transformer-xl/data/enwik8/ enwik8 /home/mindspore/transformer-xl/script/experiments-enwik8/20220416-140816/model7.ckpt /home/mindspore/transformer-xl/yaml/enwik8_base.yaml 0
+```
+
+## [Script Description](#contents)
+
+### [Script and Sample Code](#contents)
+
+```text
+.
+└─Transformer-XL
+  ├─README.md             // descriptions about Transformer-XL
+  ├─README_CN.md          // descriptions about Transformer-XL
+  ├─scripts
+    ├─run_distribute_train_gpu.sh   // shell script for distributed training on GPU
+    ├─run_standalone_train_gpu.sh   // shell script for training on GPU
+    └─run_eval_gpu.sh               // shell script for testing on GPU
+  ├─src
+    ├─callback
+      ├─eval.py           // callback function(eval)
+      ├─flag.py           // callback function(flag)
+      └─log.py            // callback function(log)
+    ├─loss_fn
+      └─ProjectedAdaptiveLogSoftmaxLoss.py    // loss
+    ├─metric
+      └─calc.py               // get bpc and ppl
+    ├─model
+      ├─attn.py               // Attention code
+      ├─dataset.py            // get dataset
+      ├─embedding.py          // PositionalEmbedding and AdaptiveEmbedding
+      ├─layer.py              // layer code
+      ├─mem_transformer.py    // Transformer-XL model
+      ├─positionwiseFF.py     // positionwiseFF
+      └─vocabulary.py         // construct vocabulary
+    ├─model_utils
+      ├─config.py             // parameter configuration
+      ├─device_adapter.py     // device adapter
+      ├─local_adapter.py      // local adapter
+      └─moxing_adapter.py     // moxing adapter
+    ├─utils
+      ├─additional_algorithms.py  // General method
+      ├─dataset_util.py           // Interface to get dataset
+      └─nnUtils.py                // Basic method
+  ├─yaml
+      ├─enwik8_base.yaml          // parameter configuration on gpu
+      ├─enwik8_large.yaml         // parameter configuration on gpu
+      └─text8_large.yaml          // parameter configuration on gpu
+  ├─getdata.sh                    // shell script for preprocessing dataset
+  ├─eval.py                       // evaluation script
+  └─train.py                      // training script
+```
+
+### [Script Parameters](#contents)
+
+#### Training Script Parameters
+
+```text
+usage:
+train.py
+If you need to set the parameters, you can modify the . /enwik8_base.yaml file to implement the parameters.
+If you need to change the parameter configuration file, you can change the --config_path parameter of line130 in /src/model_utils/config.py.
+
+```
+
+#### Network Parameters
+
+```text
+Parameters for dataset and network (Training/Evaluation):
+    n_layer       number of total layers: N, default is 12
+    d_model       dimension of model, default is 512
+    n_head        number of heads, default is 8
+    d_head        head dimension, default is 64
+    d_inner       inner dimension in FF, default is 2048
+    dropout       global dropout rate: Q, default is 0.1
+    dropatt       attention probability dropout rate: Q, default is 0.0
+    max_step      maximum of step: N, default is 400000
+    tgt_len       number of tokens to predict, default is 512
+    mem_len       length of the retained previous heads, default is 512
+    eval_tgt_len  number of tokens to predict for evaluation, default is 128
+    batch_size    batch size of input dataset: N, default is 22
+
+Parameters for learning rate:
+    lr            value of learning rate: Q, default is 0.00025
+    warmup_step   steps of the learning rate warm up: N, default is 0
+```
+
+### [Dataset Preparation](#contents)
+
+- Download the dataset and configure DATA_PATH
+
+### [Training Process](#contents)
+
+- Set options in `enwik8_base.yaml`, including loss_scale, learning rate and network hyperparameters.
+
+- Run `run_standalone_train_gpu.sh` for training of Transformer-XL model.
+
+    ```
+    # run training example
+    bash run_standalone_train_gpu.sh [DEVICE_ID] [DATA_DIR] [DATA_NAME] [TRAIN_URL] [CONFIG_PATH]
+    # for example: bash run_standalone_train_gpu.sh 0 /home/mindspore/transformer-xl/data/enwik8/ enwik8 experiments /home/mindspore/transformer-xl/yaml/enwik8_base.yaml
+    ```
+
+- Run `run_distribute_train_gpu.sh` for distributed training of Transformer-XL model.
+
+    ```
+    # run distributed training example
+    bash run_distribute_train_gpu.sh [DEVICE_NUM] [VISIABLE_DEVICES(0,1,2,3,4,5,6,7)] [DATA_DIR] [DATA_NAME] [TRAIN_URL] [CONFIG_PATH]
+    # for example: bash run_distribute_train_gpu.sh 4 0,1,2,3 /home/mindspore/transformer-xl/data/enwik8/ enwik8 experiments /home/mindspore/transformer-xl/yaml/enwik8_base.yaml
+    ```
+
+### [Evaluation Process](#contents)
+
+- Set options in `enwik8_base.yaml`. Make sure the 'datadir' are set to your own path.
+
+- Run `run_eval_gpu.sh` for evaluation of Transformer model.
+
+    ```
+    # run evaluation example
+    bash run_eval_gpu.sh [DATA_URL] [DATA_NAME] [CKPT_PATH] [CONFIG_PATH] [DEVICE_ID(optional)]
+    # for example: bash run_eval_gpu.sh  /home/mindspore/transformer-xl/data/enwik8/ enwik8 /home/mindspore/transformer-xl/script/experiments-enwik8/20220416-140816/model7.ckpt /home/mindspore/transformer-xl/yaml/enwik8_base.yaml 0
+    ```
+
+## [Model Description](#contents)
+
+### [Performance](#contents)
+
+#### Training Performance
+
+| Parameters                 | GPU                                    |
+| -------------------------- | -------------------------------------- |
+| Resource                   | MindSpore                              |
+| uploaded Date              | 22/04/2022 (month/day/year)            |
+| MindSpore Version          | 1.6.1                                  |
+| Dataset                    | enwik8                                 |
+| Training Parameters        | batch_size=22                          |
+| Optimizer                  | Adam                                   |
+| Loss Function              | Softmax Cross Entropy                  |
+| BPC Score                  | 1.07906                                |
+| Speed                      | 421.24ms/step(1p,bsz=8)                      |
+| Loss                       | 0.75                                   |
+| Checkpoint for inference   | 1.45G(.ckpt文件)                        |
+| Scripts                    | Transformer scripts                    |
+
+#### Evaluation Performance
+
+| Parameters          | GPU                         |
+| ------------------- | --------------------------- |
+| Resource            | MindSpore                   |
+| Uploaded Date       | 22/04/2022 (month/day/year) |
+| MindSpore Version   | 1.6.1                       |
+| Dataset             | enwik8                      |
+| batch_size          | 22                          |
+| outputs             | loss,bpc                    |
+| Loss                | 0.75                        |
+| BPC Score           | 1.07906                     |
+
+## [Description of Random Situation](#contents)
+
+There are three random situations:
+
+- Shuffle of the dataset.
+- Initialization of some model weights.
+- Dropout operations.
+
+Some seeds have already been set in train.py to avoid the randomness of dataset shuffle and weight initialization. If
+you want to disable dropout, please set the corresponding dropout_prob parameter to 0 in default_config.yaml.
+
+## [ModelZoo Homepage](#contents)
+
+Please check the official [homepage](https://gitee.com/mindspore/models).
diff --git a/research/nlp/transformer_xl/README_CN.md b/research/nlp/transformer_xl/README_CN.md
new file mode 100644
index 0000000000000000000000000000000000000000..6a4f3869df42c111c32b33d8fffe079866d4cc1b
--- /dev/null
+++ b/research/nlp/transformer_xl/README_CN.md
@@ -0,0 +1,296 @@
+# 目录
+
+- [目录](#目录)
+    - [Transformer-XL 概述](#transformer-xl-概述)
+    - [模型架构](#模型架构)
+    - [数据集](#数据集)
+    - [环境要求](#环境要求)
+    - [快速入门](#快速入门)
+    - [脚本说明](#脚本说明)
+        - [脚本和样例代码](#脚本和样例代码)
+        - [脚本参数](#脚本参数)
+            - [训练脚本参数](#训练脚本参数)
+            - [运行选项](#运行选项)
+            - [网络参数](#网络参数)
+        - [准备数据集](#准备数据集)
+        - [训练过程](#训练过程)
+        - [评估过程](#评估过程)
+    - [模型描述](#模型描述)
+        - [性能](#性能)
+            - [训练性能](#训练性能)
+            - [评估性能](#评估性能)
+    - [随机情况说明](#随机情况说明)
+    - [ModelZoo主页](#modelzoo主页)
+
+## Transformer-XL 概述
+
+Transformer-XL是对Transformer的改进,主要是解决长序列的问题。同时结合了RNN序列建模和Transformer自注意力机制的优点,引入循环机制(Recurrence
+Mechanism)和相对位置编码(Relative Positional
+Encoding),在输入数据的每个段上使用Transformer的注意力模块,并使用循环机制来学习连续段之间的依赖关系。并成功在enwik8、text8等语言建模数据集上取得SoTA效果。
+
+[论文](https://arxiv.org/abs/1901.02860):  Dai Z, Yang Z, Yang Y, et al. Transformer-xl: Attentive language models beyond
+a fixed-length context[J]. arXiv preprint arXiv:1901.02860, 2019.
+
+## 模型架构
+
+Transformer-XL主干结构为Transformer,在原有基础上加入了循环机制(Recurrence Mechanism)和相对位置编码(Relative Positional Encoding)
+
+## 数据集
+
+以下数据集包含训练数据集和评估数据集,数据集推荐使用 `bash getdata.sh` 的方式自动下载并预处理。
+
+[enwik8](http://mattmahoney.net/dc/enwik8.zip)
+
+enwik8数据集基于维基百科,通常用于衡量模型压缩数据的能力。包含了100MB未处理的Wikipedia的文本。
+
+如果直接通过链接下载enwik8数据集,请通过下载并执行 [prep_enwik8.py](https://raw.githubusercontent.com/salesforce/awd-lstm-lm/master/data/enwik8/prep_enwik8.py) 的方式对下载的数据集进行预处理。
+
+数据集大小
+
+- 训练集:共计88,982,818个字符
+- 验证集:共计4,945,742个字符
+- 测试集:共计36,191个字符
+
+数据集格式:txt文本
+
+数据集目录结构:
+
+```text
+└─data
+  ├─enwik8
+    ├─train.txt       # 训练集
+    ├─train.txt.raw   # 训练集(未处理)
+    ├─valid.txt       # 验证集
+    ├─valid.txt.raw   # 验证集(未处理)
+    ├─test.txt        # 测试集
+    └─test.txt.raw    # 测试集(未处理)
+```
+
+- [text8](http://mattmahoney.net/dc/text8.zip)
+
+text8同样包含了100MB的Wikipedia文本,区别在于在enwik8数据集的基础上移除了26个字母和空格以外的其他字符。
+
+如果直接通过链接下载text8数据集,请通过执行 prep_text8.py 的方式对下载的数据集进行预处理。
+
+数据集大小:
+
+- 训练集:共计89,999,999个字符
+- 验证集:共计4,999,999个字符
+- 测试集:共计5,000,000个字符
+
+数据集格式:txt文本
+
+数据集目录结构:
+
+```text
+└─data
+  ├─text8
+    ├─train.txt       # 训练集
+    ├─train.txt.raw   # 训练集(未处理)
+    ├─valid.txt       # 验证集
+    ├─valid.txt.raw   # 验证集(未处理)
+    ├─test.txt        # 测试集
+    └─test.txt.raw    # 测试集(未处理)
+```
+
+## 环境要求
+
+- 硬件(Ascend处理器)
+    - 使用Ascend处理器准备硬件环境。
+- 框架
+    - [MindSpore](https://gitee.com/mindspore/mindspore)
+- 如需查看详情,请参见如下资源:
+    - [MindSpore教程](https://www.mindspore.cn/tutorials/zh-CN/master/index.html)
+    - [MindSpore Python API](https://www.mindspore.cn/docs/api/zh-CN/master/index.html)
+
+## 快速入门
+
+- 在GPU上运行
+
+数据集准备完成后,请按照如下步骤开始训练和评估:
+
+```bash
+# 对参数进行微调: enwik8_base.yaml中对超参数进行调整
+# 其中[DATA_NAME]属于缺省参数[enwik8,text8]
+# 其中[TRAIN_URL]参数可以设置为一个字符名称,这样会自动按照这个名称在/script/train/下面创建对应的模型训练文件,也可以设置为一个路径,例如 `"/home/mindspore/transformer-xl/enwik8_8p"`  这种方式会将训练的模型单独保存在这个目录下。
+
+# 运行非分布式训练示例
+bash run_standalone_train_gpu.sh [DEVICE_ID] [DATA_DIR] [DATA_NAME] [TRAIN_URL] [CONFIG_PATH]
+# for example: bash run_standalone_train_gpu.sh 0 /home/mindspore/transformer-xl/data/enwik8/ enwik8 experiments /home/mindspore/transformer-xl/yaml/enwik8_base.yaml
+
+# 运行分布式训练示例
+bash run_distribute_train_gpu.sh [DEVICE_NUM] [VISIABLE_DEVICES(0,1,2,3,4,5,6,7)] [DATA_DIR] [DATA_NAME] [TRAIN_URL] [CONFIG_PATH]
+# for example: bash run_distribute_train_gpu.sh 4 0,1,2,3 /home/mindspore/transformer-xl/data/enwik8/ enwik8 experiments /home/mindspore/transformer-xl/yaml/enwik8_base.yaml
+
+# 运行评估示例
+bash run_eval_gpu.sh [DATA_URL] [DATA_NAME] [CKPT_PATH] [CONFIG_PATH] [DEVICE_ID(optional)]
+# for example: bash run_eval_gpu.sh  /home/mindspore/transformer-xl/data/enwik8/ enwik8 /home/mindspore/transformer-xl/script/experiments-enwik8/20220416-140816/model7.ckpt /home/mindspore/transformer-xl/yaml/enwik8_base.yaml 0
+```
+
+## 脚本说明
+
+### 脚本和样例代码
+
+```text
+.
+└─Transformer-XL
+  ├─README.md             // descriptions about Transformer-XL
+  ├─README_CN.md          // descriptions about Transformer-XL
+  ├─scripts
+    ├─run_distribute_train_gpu.sh   // shell script for distributed training on GPU
+    ├─run_standalone_train_gpu.sh   // shell script for training on GPU
+    └─run_eval_gpu.sh               // shell script for testing on GPU
+  ├─src
+    ├─callback
+      ├─eval.py           // callback function(eval)
+      ├─flag.py           // callback function(flag)
+      └─log.py            // callback function(log)
+    ├─loss_fn
+      └─ProjectedAdaptiveLogSoftmaxLoss.py    // loss
+    ├─metric
+      └─calc.py               // get bpc and ppl
+    ├─model
+      ├─attn.py               // Attention code
+      ├─dataset.py            // get dataset
+      ├─embedding.py          // PositionalEmbedding and AdaptiveEmbedding
+      ├─layer.py              // layer code
+      ├─mem_transformer.py    // Transformer-XL model
+      ├─positionwiseFF.py     // positionwiseFF
+      └─vocabulary.py         // construct vocabulary
+    ├─model_utils
+      ├─config.py             // parameter configuration
+      ├─device_adapter.py     // device adapter
+      ├─local_adapter.py      // local adapter
+      └─moxing_adapter.py     // moxing adapter
+    ├─utils
+      ├─additional_algorithms.py  // General method
+      ├─dataset_util.py           // Interface to get dataset
+      └─nnUtils.py                // Basic method
+  ├─yaml
+      ├─enwik8_base.yaml          // parameter configuration on gpu
+      ├─enwik8_large.yaml         // parameter configuration on gpu
+      └─text8_large.yaml          // parameter configuration on gpu
+  ├─getdata.sh                    // shell script for preprocessing dataset
+  ├─eval.py                       // evaluation script
+  └─train.py                      // training script
+```
+
+### 脚本参数
+
+#### 训练脚本参数
+
+```text
+用法:
+train.py
+如果需要对参数进行设置,可以修改./enwik8_base.yaml文件中的参数实现。
+如果需要更改参数配置文件,可以更改/src/model_utils/config.py中line130的--config_path参数。
+```
+
+#### 网络参数
+
+```text
+数据集和网络参数(训练/微调/评估):
+    n_layer       网络层数: N, 默认值为 12
+    d_model       模型维度, 默认值为 512
+    n_head        总的注意力头数, 默认值为 8
+    d_head        注意力头的维度, 默认值为 64
+    d_inner       前馈网络的维度, 默认值为 2048
+    dropout       输出层的随机失活概率: Q, 默认值是 0.1
+    dropatt       注意力层的随机失活概率: Q, default is 0.0
+    max_step      迭代次数: N, 默认值为 400000
+    tgt_len       标签特征维度大小, 默认值为 512
+    mem_len       记忆特征维度大小, 默认值为 512
+    eval_tgt_len  迭代任务中标签特征维度大小, 默认值为 128
+    batch_size    输入数据集的批次大小: N, 默认值是 22
+
+学习率参数:
+    lr            学习率: Q, 默认值为 0.00025
+    warmup_step   热身学习率步数: N, 默认值为 0
+```
+
+### 准备数据集
+
+- 运行 `bash getdata.sh` , 脚本会创建 `./data` 目录并将数据集自动下载到该目录下
+
+- 下载数据集并配置好DATA_PATH
+
+### 训练过程
+
+- 通过直接用sh输入参数的方式输入路径,或在`enwik8_base.yaml`中设置选项,确保 'datadir' 路径为数据集路径。设置其他参数包括loss_scale、学习率和网络超参数。
+
+- 运行`run_standalone_train_gpu.sh`,进行Transformer-XL模型的非分布式训练。
+
+    ```
+    # 运行非分布式训练示例
+    bash run_standalone_train_gpu.sh [DEVICE_ID] [DATA_DIR] [DATA_NAME] [TRAIN_URL] [CONFIG_PATH]
+    # for example: bash run_standalone_train_gpu.sh 0 /home/mindspore/transformer-xl/data/enwik8/ enwik8 experiments /home/mindspore/transformer-xl/yaml/enwik8_base.yaml
+    ```
+
+- 运行`run_distribute_train_gpu.sh`,进行Transformer-XL模型的分布式训练。
+
+    ```
+    # 运行分布式训练示例
+    bash run_distribute_train_gpu.sh [DEVICE_NUM] [VISIABLE_DEVICES(0,1,2,3,4,5,6,7)] [DATA_DIR] [DATA_NAME] [TRAIN_URL] [CONFIG_PATH]
+    # for example: bash run_distribute_train_gpu.sh 4 0,1,2,3 /home/mindspore/transformer-xl/data/enwik8/ enwik8 experiments /home/mindspore/transformer-xl/yaml/enwik8_base.yaml
+    ```
+
+### 评估过程
+
+- 通过直接用sh输入参数的方式输入路径,或在`enwik8_base.yaml`中设置选项,设置 'load_path' 文件路径。
+
+- 运行`run_eval_gpu.sh`,评估Transformer-XL模型。
+
+    ```
+    # 运行评估示例
+    bash run_eval_gpu.sh [DATA_URL] [DATA_NAME] [CKPT_PATH] [CONFIG_PATH] [DEVICE_ID(optional)]
+    # for example: bash run_eval_gpu.sh  /home/mindspore/transformer-xl/data/enwik8/ enwik8 /home/mindspore/transformer-xl/script/experiments-enwik8/20220416-140816/model7.ckpt /home/mindspore/transformer-xl/yaml/enwik8_base.yaml 0
+    ```
+
+## 模型描述
+
+### 性能
+
+#### 训练性能
+
+| 参数           | GPU                            |
+| ------------- | ------------------------------ |
+| 资源           | MindSpore                      |
+| 上传日期        | 2022-04-22                     |
+| MindSpore版本  | 1.6.1                           |
+| 数据集         | enwik8                          |
+| 训练参数       | max_step=400000, batch_size=22  |
+| 优化器         | Adam                            |
+| 损失函数       | Softmax Cross Entropy           |
+| BPC分数       | 1.07906                         |
+| 速度          | 421.24ms/step(1p,bsz=8)  |
+| 损失          | 0.75                            |
+| 推理检查点     | 1.45G(.ckpt文件)                |
+| 脚本          | Transformer-XL script           |
+
+#### 评估性能
+
+| 参数           | GPU                   |
+| ------------- | --------------------------- |
+|资源            | MindSpore               |
+| 上传日期        | 2022-04-22                |
+| MindSpore版本  | 1.6.1                      |
+| 数据集         | enwik8                     |
+| batch_size    | 22                        |
+| 输出           | 损失loss,BPC分数                   |
+| 损失loss       | 0.75                      |
+| BPC分数       | 1.07906                      |
+
+## 随机情况说明
+
+以下三种随机情况:
+
+- 轮换数据集
+- 初始化部分模型权重
+- 随机失活运行
+
+train.py已经设置了一些种子,避免数据集轮换和权重初始化的随机性。若需关闭随机失活,将default_config.yaml中相应的dropout_prob参数设置为0。
+
+## ModelZoo主页
+
+请浏览官网[主页](https://gitee.com/mindspore/models)。
+
diff --git a/research/nlp/transformer_xl/eval.py b/research/nlp/transformer_xl/eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..b8e1bc7adaa3983841122e84b622c68ed95ed492
--- /dev/null
+++ b/research/nlp/transformer_xl/eval.py
@@ -0,0 +1,77 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+import argparse
+from mindspore import load_checkpoint, context
+from mindspore.dataset import GeneratorDataset
+from src.callback.eval import doEval
+from src.metric.calc import bpc
+from src.model.mem_transformer import MemTransformerLM
+from src.model_utils.config import config
+from src.utils.dataset_util import get_dataset
+
+if __name__ == '__main__':
+    parser = argparse.ArgumentParser(description='Transformer-XL evaluation running')
+    parser.add_argument('--datadir', default='./data/enwik8',
+                        help='Directory contains enwik8 dataset.')
+    parser.add_argument('--dataset', default='enwik8',
+                        help='Dataset Name.', choices=["enwik8", "text8"])
+    parser.add_argument('--ckpt_path', default="./model0.ckpt", help='Directory of model.')
+    parser.add_argument("--device", type=str, default="GPU", help="Device Target, default GPU",
+                        choices=["Ascend", "GPU"])
+    parser.add_argument("--device_id", type=int, default=0, help="Device id, default is 0.")
+
+    args = parser.parse_args()
+    datadir = args.datadir
+    dataset = args.dataset
+    device_id = args.device_id
+
+    dataset = get_dataset(datadir, dataset)
+    ntokens = len(dataset.vocab)
+
+    context.set_context(device_id=device_id)
+    context.set_context(mode=context.GRAPH_MODE, device_target="GPU", max_device_memory="39.0GB",
+                        enable_graph_kernel=True)
+
+    # Due to the mems mechanism, it is not possible to perform multi-card segmentation on the valid and test datasets
+    valid_dataset = GeneratorDataset(source=dataset.get_valid_generator(), column_names=['data', 'target'],
+                                     shuffle=False)
+    test_dataset = GeneratorDataset(source=dataset.get_test_generator(), column_names=['data', 'target'],
+                                    shuffle=False)
+
+    # adaptive softmax / embedding
+    cutoffs = []
+    net = MemTransformerLM(ntokens, config.n_layer, config.n_head, config.d_model,
+                           config.d_head, config.d_inner, config.dropout, config.dropatt, batch_size=config.batch_size,
+                           d_embed=config.d_embed, div_val=config.div_val,
+                           pre_lnorm=config.pre_lnorm, tgt_len=config.tgt_len,
+                           ext_len=config.ext_len, mem_len=config.mem_len, eval_tgt_len=config.eval_tgt_len,
+                           cutoffs=cutoffs, same_length=config.same_length, clamp_len=config.clamp_len)
+
+    # model_filename = os.path.join(config.load_path, args.ckpt_filename + '.ckpt')
+    model_filename = args.ckpt_path
+    print(model_filename)
+    load_checkpoint(net=net, ckpt_file_name=model_filename)
+
+    valid_loss = doEval(net, valid_dataset, config.tgt_len, config.ext_len, config.mem_len, config.eval_tgt_len)
+    test_loss = doEval(net, test_dataset, config.tgt_len, config.ext_len, config.mem_len, config.eval_tgt_len)
+
+    print('=' * 100)
+    if config.dataset in ['enwik8', 'text8']:
+        print('| End of valid | valid loss {:5.2f} | valid bpc {:9.5f}'.format(
+            valid_loss, bpc(valid_loss)))
+        print('| End of test | test loss {:5.2f} | test bpc {:9.5f}'.format(
+            test_loss, bpc(test_loss)))
+    print('=' * 100)
diff --git a/research/nlp/transformer_xl/getdata.sh b/research/nlp/transformer_xl/getdata.sh
new file mode 100644
index 0000000000000000000000000000000000000000..3aa90345c108048a22966b5a2782a6dda971aa38
--- /dev/null
+++ b/research/nlp/transformer_xl/getdata.sh
@@ -0,0 +1,43 @@
+#!/bin/bash
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+echo "=== Acquiring datasets ==="
+echo "---"
+
+mkdir -p data
+cd data
+
+echo "- Downloading enwik8 (Character)"
+if [[ ! -d 'enwik8' ]]; then
+    mkdir -p enwik8
+    cd enwik8
+    wget --continue http://mattmahoney.net/dc/enwik8.zip --no-check-certificate
+    wget https://raw.githubusercontent.com/salesforce/awd-lstm-lm/master/data/enwik8/prep_enwik8.py --no-check-certificate
+    python3 prep_enwik8.py
+    cd ..
+fi
+
+echo "- Downloading text8 (Character)"
+if [[ ! -d 'text8' ]]; then
+    mkdir -p text8
+    cd text8
+    wget --continue http://mattmahoney.net/dc/text8.zip --no-check-certificate
+    python ../../prep_text8.py
+    cd ..
+fi
+
+echo "---"
+echo "Happy language modeling :)"
diff --git a/research/nlp/transformer_xl/prep_text8.py b/research/nlp/transformer_xl/prep_text8.py
new file mode 100644
index 0000000000000000000000000000000000000000..fa586ba0545d08188e4c22427100a06d3a39aed4
--- /dev/null
+++ b/research/nlp/transformer_xl/prep_text8.py
@@ -0,0 +1,46 @@
+#!/usr/bin/env python
+# coding=utf-8
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+import os
+import sys
+import zipfile
+
+from io import open
+
+if os.path.exists('train.txt'):
+    print('Tokenized text8 already exists - skipping processing')
+    sys.exit()
+
+zipfile.ZipFile('text8.zip').extractall()
+data = open('text8', 'r', encoding='utf-8').read()
+
+print('Length of text8: {}'.format(len(data)))
+
+# Segment the text8 dataset according to the specification
+num_test_chars = 5000000
+
+train_data = data[: -2 * num_test_chars]
+valid_data = data[-2 * num_test_chars: -num_test_chars]
+test_data = data[-num_test_chars:]
+
+for fn, part in [('train.txt', train_data), ('valid.txt', valid_data), ('test.txt', test_data)]:
+    print('{} will have {} bytes'.format(fn, len(part)))
+    print('- Tokenizing...')
+    # Change space ' ' to underscore '_'
+    part_str = ' '.join(['_' if c == ' ' else c for c in part.strip()])
+    print('- Writing...')
+    f = open(fn, 'w').write(part_str)
+    f = open(fn + '.raw', 'w', encoding='utf-8').write(part)
diff --git a/research/nlp/transformer_xl/requirements.txt b/research/nlp/transformer_xl/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..ae46c79cafd7b9d321a9dc2c8eaf53608e53b0cd
--- /dev/null
+++ b/research/nlp/transformer_xl/requirements.txt
@@ -0,0 +1,3 @@
+numpy
+easydict
+pyyaml
diff --git a/research/nlp/transformer_xl/script/run_distribute_train_gpu.sh b/research/nlp/transformer_xl/script/run_distribute_train_gpu.sh
new file mode 100644
index 0000000000000000000000000000000000000000..83de2471622cc73c8c28001bdbc88e33f95fd396
--- /dev/null
+++ b/research/nlp/transformer_xl/script/run_distribute_train_gpu.sh
@@ -0,0 +1,63 @@
+#!/bin/bash
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+if [ $# != 6 ]; then
+  echo "Usage: bash run_distributed_train_gpu.sh [DEVICE_NUM] [VISIABLE_DEVICES(0,1,2,3,4,5,6,7)]
+   [DATA_DIR] [DATA_NAME] [TRAIN_URL] [CONFIG_PATH]"
+exit 1
+fi
+
+if [ $1 -lt 1 ] || [ $1 -gt 8 ]; then
+  echo "error: DEVICE_NUM=$1 is not in (1-8)"
+  exit 1
+fi
+
+DATA_DIR=$3
+DATA_NAME=$4
+TRAIN_URL=$5
+CONFIG_PATH=$6
+
+echo "DATA_DIR="$DATA_DIR
+echo "DATA_NAME="$DATA_NAME
+echo "TRAIN_URL="$TRAIN_URL
+echo "CONFIG_PATH="$CONFIG_PATH
+
+export CONFIG_PATH=${CONFIG_PATH}
+export DEVICE_NUM=$1
+export RANK_SIZE=$1
+
+BASEPATH=$(
+  cd "$(dirname $0)" || exit
+  pwd
+)
+
+export PYTHONPATH=${BASEPATH}:$PYTHONPATH
+if [ -d "./train" ]; then
+  rm -rf ./train
+fi
+mkdir ./train
+cd ./train || exit
+
+export CUDA_VISIBLE_DEVICES="$2"
+
+echo "Start Training :)"
+
+if [ $1 -gt 1 ]; then
+  mpirun -np $1 --allow-run-as-root --output-filename log_output --merge-stderr-to-stdout \
+  python ${BASEPATH}/../train.py --device="GPU" --datadir=$DATA_DIR --dataset=$DATA_NAME --train_url=$TRAIN_URL >train_gpu.log 2>&1 &
+else
+  python ${BASEPATH}/../train.py --device="GPU" --datadir=$DATA_DIR --dataset=$DATA_NAME --train_url=$TRAIN_URL >train_gpu.log 2>&1 &
+fi
diff --git a/research/nlp/transformer_xl/script/run_eval_gpu.sh b/research/nlp/transformer_xl/script/run_eval_gpu.sh
new file mode 100644
index 0000000000000000000000000000000000000000..7d25944f446dea3aa2b9b845d3010cc410d4c58f
--- /dev/null
+++ b/research/nlp/transformer_xl/script/run_eval_gpu.sh
@@ -0,0 +1,69 @@
+#!/bin/bash
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+if [ $# -lt 4 ] ||  [ $# -gt 5 ]
+then
+    echo "Usage: bash run_eval_gpu.sh [DATA_DIR] [DATA_NAME] [CKPT_PATH] [CONFIG_PATH] [DEVICE_ID(optional)]"
+exit 1
+fi
+
+export DEVICE_ID=0
+
+if [ $# = 5 ] ; then
+  export DEVICE_ID=$5
+fi;
+
+
+get_real_path(){
+  if [ "${1:0:1}" == "/" ]; then
+    echo "$1"
+  else
+    echo "$(realpath -m $PWD/$1)"
+  fi
+}
+
+DATA_DIR=$(get_real_path $1)
+
+if [ ! -d $DATA_DIR ]
+then
+    echo "error: DATA_DIR=$DATA_DIR is not a directory"
+exit 1
+fi
+
+DATA_NAME=$2
+CKPT_PATH=$3
+CONFIG_PATH=$4
+
+echo "DATA_DIR="$DATA_DIR
+echo "DATA_NAME="$DATA_NAME
+echo "CKPT_PATH="$CKPT_PATH
+echo "CONFIG_PATH="$CONFIG_PATH
+
+export CONFIG_PATH=${CONFIG_PATH}
+export DEVICE_NUM=1
+export RANK_SIZE=$DEVICE_NUM
+export RANK_ID=0
+if [ -d "eval" ];
+then
+    rm -rf ./eval
+fi
+mkdir ./eval
+
+env > env.log
+
+echo "Start evaluation for device $DEVICE_ID :)"
+
+python ../eval.py --device_id=$DEVICE_ID --datadir=$DATA_DIR --dataset=$DATA_NAME --ckpt_path=$CKPT_PATH --device="GPU" &> eval.log &
diff --git a/research/nlp/transformer_xl/script/run_standalone_train_gpu.sh b/research/nlp/transformer_xl/script/run_standalone_train_gpu.sh
new file mode 100644
index 0000000000000000000000000000000000000000..c0021e54bea1fcf8e55c5833d81bea9087d65e2b
--- /dev/null
+++ b/research/nlp/transformer_xl/script/run_standalone_train_gpu.sh
@@ -0,0 +1,44 @@
+#!/bin/bash
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+if [ $# -lt 4 ]; then
+    echo "Usage: bash run_standalone_train_gpu.sh [DEVICE_ID] [DATA_DIR] [DATA_NAME]
+     [TRAIN_URL] [CONFIG_PATH]"
+exit 1
+fi
+
+DEVICE_ID=$1
+DATA_DIR=$2
+DATA_NAME=$3
+TRAIN_URL=$4
+CONFIG_PATH=$5
+
+echo "DATA_DIR="$DATA_DIR
+echo "DATA_NAME="$DATA_NAME
+echo "TRAIN_URL="$TRAIN_URL
+echo "CONFIG_PATH="$CONFIG_PATH
+
+export CONFIG_PATH=${CONFIG_PATH}
+
+if [ -d "./train_stand" ]; then
+  rm -rf ./train_stand
+fi
+mkdir ./train_stand
+cd ./train_stand || exit
+
+echo "Start training for device $DEVICE_ID :)"
+
+CUDA_VISIBLE_DEVICES=$DEVICE_ID python ../../train.py --device="GPU" --datadir=$DATA_DIR --dataset=$DATA_NAME --train_url=$TRAIN_URL > train_stand_gpu.log 2>&1 &
diff --git a/research/nlp/transformer_xl/src/__init__.py b/research/nlp/transformer_xl/src/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..602527cd720c8d268599dbaef190ba1cf1eb6f2b
--- /dev/null
+++ b/research/nlp/transformer_xl/src/__init__.py
@@ -0,0 +1,14 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
diff --git a/research/nlp/transformer_xl/src/callback/eval.py b/research/nlp/transformer_xl/src/callback/eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..f4408ce34b689901abfb382284bbc3bb1ac36f6c
--- /dev/null
+++ b/research/nlp/transformer_xl/src/callback/eval.py
@@ -0,0 +1,86 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+import time
+import os
+import numpy as np
+from mindspore.train.callback import Callback
+from mindspore import save_checkpoint
+from src.model_utils.device_adapter import get_device_id
+from src.model_utils.config import config
+from src.metric.calc import bpc, ppl
+
+
+def doEval(net, dataset, tgt_len, ext_len, mem_len, eval_tgt_len):
+    """Separate eval for valid and test"""
+    net.set_train(tgt_len, ext_len, mem_len, eval_tgt_len, False)
+    total_len, total_loss = 0, 0.
+    idx = 0
+    for data, target in dataset.create_tuple_iterator():
+        loss = net(data, target, idx)
+        idx = 1
+        seq_len = target.shape[0]
+        total_loss += seq_len * loss
+        total_len += seq_len
+        if net.is_first_iteration:
+            net.add_flags_recursive(is_first_iteration=False)
+
+    test_loss = total_loss / total_len
+    test_loss = np.mean(test_loss.asnumpy())
+    net.set_train(tgt_len, ext_len, mem_len, eval_tgt_len, True)
+    return test_loss
+
+
+class EvalDuringTrain(Callback):
+    def __init__(self, dataset, per_print_times, tgt_len, ext_len, mem_len,
+                 eval_tgt_len):
+        super(EvalDuringTrain, self).__init__()
+        self.dataset = dataset
+        self._per_print_times = per_print_times
+        self.best_val_loss = None
+        self.tgt_len = tgt_len
+        self.ext_len = ext_len
+        self.mem_len = mem_len
+        self.eval_tgt_len = eval_tgt_len
+
+    def step_end(self, run_context):
+        """Called after each step finished."""
+        device_id = get_device_id()
+        cb_params = run_context.original_args()
+        train_step = cb_params.cur_epoch_num
+        if self._per_print_times != 0 and train_step % self._per_print_times == 0:
+            eval_start_time = time.time()
+            net = cb_params.network
+
+            valid_loss = doEval(net, self.dataset, tgt_len=self.tgt_len, ext_len=self.ext_len, mem_len=self.mem_len,
+                                eval_tgt_len=self.eval_tgt_len)
+
+            print('-' * 100)
+            log_str = '| Eval {:3d} at step {:>8d} | time: {:5.2f}s ' \
+                      '| valid loss {:5.2f}'.format(train_step // self._per_print_times, train_step,
+                                                    (time.time() - eval_start_time), valid_loss)
+            if config.dataset in ['enwik8', 'text8']:
+                log_str += ' | valid bpc {:9.5f}'.format(bpc(valid_loss))
+            else:
+                log_str += ' | valid ppl {:9.3f}'.format(ppl(valid_loss))
+            print(log_str)
+            print('-' * 100)
+
+            if not self.best_val_loss or valid_loss < self.best_val_loss:
+                model_filename = os.path.join(config.train_url, 'model' + str(device_id) + '.ckpt')
+                optimizer_filename = os.path.join(config.train_url, 'optimizer' + str(device_id) + '.ckpt')
+                save_checkpoint(net, model_filename)
+                save_checkpoint(cb_params.optimizer, optimizer_filename)
+                self.best_val_loss = valid_loss
diff --git a/research/nlp/transformer_xl/src/callback/flag.py b/research/nlp/transformer_xl/src/callback/flag.py
new file mode 100644
index 0000000000000000000000000000000000000000..cffae73a469bb8c2a65db59da6116b72d4fd4371
--- /dev/null
+++ b/research/nlp/transformer_xl/src/callback/flag.py
@@ -0,0 +1,26 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+from mindspore.train.callback import Callback
+
+
+class FlagModifiedCallback(Callback):
+
+    def step_end(self, run_context):
+        """Called after each step finished."""
+        cb_params = run_context.original_args()
+        net = cb_params.network
+        if net.is_first_iteration:
+            net.add_flags_recursive(is_first_iteration=False)
diff --git a/research/nlp/transformer_xl/src/callback/log.py b/research/nlp/transformer_xl/src/callback/log.py
new file mode 100644
index 0000000000000000000000000000000000000000..eff31b2e560ddd9fc208aaa9733089e7ebe219b7
--- /dev/null
+++ b/research/nlp/transformer_xl/src/callback/log.py
@@ -0,0 +1,68 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+import math
+import time
+import numpy as np
+
+import mindspore as ms
+from mindspore import Tensor
+from mindspore.train.callback import LossMonitor
+
+from src.metric.calc import bpc, ppl
+from src.model_utils.config import config
+
+
+class TrainLogger(LossMonitor):
+    def __init__(self, per_print_times, n_batch):
+        super(TrainLogger, self).__init__(per_print_times)
+        self.log_start_time = 0
+        self.n_batch = n_batch
+        self.train_loss = 0.0
+        self.log_start_time = time.time()
+
+    def step_end(self, run_context):
+        """Called after each step finished."""
+        cb_params = run_context.original_args()
+        train_step = cb_params.cur_epoch_num
+
+        loss = cb_params.net_outputs
+
+        if isinstance(loss, (tuple, list)):
+            if isinstance(loss[0], Tensor) and isinstance(loss[0].asnumpy(), np.ndarray):
+                loss = loss[0]
+
+        if isinstance(loss, Tensor) and isinstance(loss.asnumpy(), np.ndarray):
+            loss = np.mean(loss.asnumpy())
+
+        self.train_loss += loss
+        if self._per_print_times != 0 and train_step % self._per_print_times == 0:
+            epoch = math.ceil(train_step / self.n_batch)
+            cur_loss = self.train_loss / self._per_print_times
+            elapsed = time.time() - self.log_start_time
+            batch = train_step % (self.n_batch + 1) + (0 if epoch == 1 else 1)
+            optimizer = cb_params.optimizer
+            train_step_t = Tensor(train_step, ms.int32)
+            lr = optimizer.learning_rate(train_step_t).asnumpy()
+            log_str = '| epoch {:3d} step {:>8d} | {:>6d} batches | lr {:.3g} ' \
+                      '| ms/step {:5.2f} | loss {:5.2f}'.format(epoch, train_step, batch, lr,
+                                                                elapsed * 1000 / self._per_print_times, cur_loss)
+            if config.dataset in ['enwik8', 'text8']:
+                log_str += ' | bpc {:9.5f}'.format(bpc(cur_loss))
+            else:
+                log_str += ' | ppl {:9.3f}'.format(ppl(cur_loss))
+            print(log_str)
+            self.train_loss = 0.0
+            self.log_start_time = time.time()
diff --git a/research/nlp/transformer_xl/src/loss_fn/ProjectedAdaptiveLogSoftmaxLoss.py b/research/nlp/transformer_xl/src/loss_fn/ProjectedAdaptiveLogSoftmaxLoss.py
new file mode 100644
index 0000000000000000000000000000000000000000..7c37aeac10770e57f123eccb4f17e2c51987e2b1
--- /dev/null
+++ b/research/nlp/transformer_xl/src/loss_fn/ProjectedAdaptiveLogSoftmaxLoss.py
@@ -0,0 +1,95 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+import mindspore as ms
+import mindspore.nn as nn
+import mindspore.ops as P
+from mindspore.nn import LossBase
+from mindspore.ops import Zeros
+from mindspore.ops import ExpandDims, Concat, Squeeze
+from src.utils.additional_algorithms import linear
+
+
+class ProjectedAdaptiveLogSoftmaxLoss(LossBase):
+    def __init__(self, n_token, d_embed, d_proj, cutoffs, div_val=1, tie_projs=None,
+                 keep_order=False):
+        super(ProjectedAdaptiveLogSoftmaxLoss, self).__init__()
+        self.squeeze_1 = Squeeze(1)
+        self.gather = P.GatherD()
+        self.zeros = Zeros()
+        self.expandDims = ExpandDims()
+        self.concat_0 = Concat(0)
+        self.log_softmax_n_1 = nn.LogSoftmax()
+        self.log_softmax_1 = nn.LogSoftmax(1)
+        if tie_projs is None:
+            tie_projs = [False]
+        self.n_token = n_token
+        self.d_embed = d_embed
+        self.d_proj = d_proj
+
+        self.cutoffs = cutoffs + [n_token]
+        self.cutoff_ends = [0] + self.cutoffs
+        self.div_val = div_val
+
+        self.shortlist_size = self.cutoffs[0]
+        self.n_clusters = len(self.cutoffs) - 1
+        self.head_size = self.shortlist_size + self.n_clusters
+
+        if self.n_clusters > 0:
+            self.cluster_weight = ms.Parameter(self.zeros((self.n_clusters, self.d_embed), ms.float32))
+            self.cluster_bias = ms.Parameter(self.zeros(self.n_clusters, ms.float32))
+
+        self.out_layers = nn.CellList()
+        parameters = []
+
+        if div_val == 1:
+            for i in range(len(self.cutoffs)):
+                if d_proj != d_embed:
+                    parameters.append(
+                        ms.Parameter(self.zeros((d_proj, d_embed), ms.float32))
+                    )
+
+            self.out_layers.append(nn.Dense(d_embed, n_token))
+        else:
+            for i in range(len(self.cutoffs)):
+                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]
+                d_emb_i = d_embed // (div_val ** i)
+
+                parameters.append(
+                    ms.Parameter(self.zeros((d_proj, d_emb_i), ms.float32))
+                )
+
+                self.out_layers.append(nn.Dense(d_emb_i, r_idx - l_idx))
+
+        self.out_projs = ms.ParameterTuple(parameters)
+        self.keep_order = keep_order
+
+    def _compute_logit(self, hidden, weight, bias, proj=None):
+        if proj is None:
+            logit = linear(hidden, weight, bias)
+        else:
+            proj_hid = linear(hidden, proj.T)
+            logit = linear(proj_hid, weight, bias)
+        return logit
+
+    def construct(self, hidden, target):
+        """
+            hidden :: [len*bsz x d_proj]
+            target :: [len*bsz]
+        """
+
+        logit = self.out_layers[0](hidden)
+        nll = self.squeeze_1(self.gather(-self.log_softmax_n_1(logit), 1, self.expandDims(target, 1)))
+        return self.get_loss(nll)
diff --git a/research/nlp/transformer_xl/src/metric/calc.py b/research/nlp/transformer_xl/src/metric/calc.py
new file mode 100644
index 0000000000000000000000000000000000000000..3d5b8bd2d5f20bf6b1c69a511f58ce6e764921fc
--- /dev/null
+++ b/research/nlp/transformer_xl/src/metric/calc.py
@@ -0,0 +1,59 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+import math
+from mindspore.nn import Metric
+
+
+def bpc(loss):
+    return loss / math.log(2)
+
+
+def ppl(loss):
+    return math.exp(loss)
+
+
+class BPC(Metric):
+    def __init__(self):
+        super(BPC, self).__init__()
+
+        self.loss = 0.0
+        self.log_2 = math.log(2)
+
+    def clear(self):
+        """Clears the internal evaluation result."""
+        self.loss = 0.0
+
+    def update(self, loss):
+        self.loss = loss
+
+    def eval(self):
+        return self.loss / self.log_2
+
+
+class PPL(Metric):
+    def __init__(self):
+        super(PPL, self).__init__()
+        self.loss = 0.0
+
+    def clear(self):
+        """Clears the internal evaluation result."""
+        self.loss = 0.0
+
+    def update(self, loss):
+        self.loss = loss
+
+    def eval(self):
+        return math.exp(self.loss)
diff --git a/research/nlp/transformer_xl/src/model/attn.py b/research/nlp/transformer_xl/src/model/attn.py
new file mode 100644
index 0000000000000000000000000000000000000000..58ecc7f145c5a77488b1fb142d2bee65a7c706eb
--- /dev/null
+++ b/research/nlp/transformer_xl/src/model/attn.py
@@ -0,0 +1,165 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+# MindSpore r1.7
+# import os
+# os.environ["PATH"] = os.environ["PATH"] + ":/usr/local/cuda/bin/"
+# from mindspore.ops import Einsum
+import mindspore as ms
+import mindspore.nn as nn
+from mindspore.nn import Tril, Triu
+from mindspore.ops import Zeros, Ones
+from mindspore.ops import ExpandDims, Concat, Split
+from mindspore.ops import Transpose, BatchMatMul, Tile
+from mindspore.ops import Softmax, Mul
+from src.utils.additional_algorithms import MaskerFill
+
+class RelMultiHeadAttn(nn.Cell):
+    def __init__(self, n_head, d_model, d_head, dropout, dropatt=0.0, pre_lnorm=False):
+        super(RelMultiHeadAttn, self).__init__()
+
+        self.zeros, self.ones = Zeros(), Ones()
+        self.expandDims, self.concat_0, self.concat_1 = ExpandDims(), Concat(0), Concat(1)
+        self.split_n_1_2, self.split_n_1_3 = Split(-1, 2), Split(-1, 3)
+        self.tril, self.triu = Tril(), Triu()
+        self.transpose = Transpose()
+        self.batchMatMul = BatchMatMul()
+        self.tile = Tile()
+        self.maskerFill = MaskerFill()
+        self.softmax_1 = Softmax(1)
+        self.mul = Mul()
+        self.n_head = n_head
+        self.d_model = d_model
+        self.d_head = d_head
+        self.dropout = dropout
+
+        self.qkv_net = nn.Dense(d_model, 3 * n_head * d_head, has_bias=False)
+        self.drop = nn.Dropout(1 - dropout, dtype=ms.float32)
+        self.dropatt = nn.Dropout(1 - dropout, dtype=ms.float32)
+        self.o_net = nn.Dense(n_head * d_head, d_model, has_bias=False)
+        self.layer_norm = nn.LayerNorm([d_model])
+        self.scale = 1 / (d_head ** 0.5)
+        self.pre_lnorm = pre_lnorm
+        self.negative_inf = -1e9
+
+    def _rel_shift(self, x, zero_triu=False):
+        zero_pad = self.zeros((x.shape[0], 1, x.shape[2], x.shape[3]), x.dtype)
+        x_padded = self.concat_1((zero_pad, x))
+        x_padded = x_padded.view(x.shape[1] + 1, x.shape[0], x.shape[2], x.shape[3])
+
+        x = x_padded[1:].reshape(x.shape)
+
+        if zero_triu:
+            _ones = self.ones((x.shape[0], x.shape[1]))
+            x = x * self.tril(_ones, x.shape[1] - x.shape[0])[:, :, None, None]
+
+        return x
+
+    def construct(self, w, r, r_w_bias, r_r_bias, mems=None, attn_mask=None):
+        raise NotImplementedError
+
+
+class RelPartialLearnableMultiHeadAttn(RelMultiHeadAttn):
+    def __init__(self, *args, **kwargs):
+        super(RelPartialLearnableMultiHeadAttn, self).__init__(*args, **kwargs)
+
+        self.r_net = nn.Dense(self.d_model, self.n_head * self.d_head, has_bias=False)
+
+    def construct(self, w, r, r_w_bias, r_r_bias, mems=None, attn_mask=None):
+        qlen, rlen, bsz = w.shape[0], r.shape[0], w.shape[1]
+
+        if not self.is_first_iteration:
+            cat = self.concat_0([mems, w])
+            if self.pre_lnorm:
+                w_heads = self.qkv_net(self.layer_norm(cat))
+            else:
+                w_heads = self.qkv_net(cat)
+            r_head_k = self.r_net(r)
+
+            w_head_q, w_head_k, w_head_v = self.split_n_1_3(w_heads)
+            w_head_q = w_head_q[-qlen:]
+        else:
+            if self.pre_lnorm:
+                w_heads = self.qkv_net(self.layer_norm(w))
+            else:
+                w_heads = self.qkv_net(w)
+            r_head_k = self.r_net(r)
+            w_head_q, w_head_k, w_head_v = self.split_n_1_3(w_heads)
+
+        klen = w_head_k.shape[0]
+
+        w_head_q = w_head_q.view(qlen, bsz, self.n_head, self.d_head)  # qlen x bsz x n_head x d_head
+        w_head_k = w_head_k.view(klen, bsz, self.n_head, self.d_head)  # qlen x bsz x n_head x d_head
+        w_head_v = w_head_v.view(klen, bsz, self.n_head, self.d_head)  # qlen x bsz x n_head x d_head
+
+        r_head_k = r_head_k.view(rlen, self.n_head, self.d_head)  # qlen x n_head x d_head
+
+        # compute attention score
+        rw_head_q = w_head_q + r_w_bias  # qlen x bsz x n_head x d_head
+        rr_head_q = w_head_q + r_r_bias
+
+        # qlen x klen x bsz x n_head
+
+        AC = self.transpose(
+            self.batchMatMul(self.transpose(rw_head_q, (1, 2, 0, 3)), self.transpose(w_head_k, (1, 2, 3, 0))),
+            (2, 3, 0, 1))
+
+        rr_head_q_t = self.transpose(rr_head_q, (1, 2, 0, 3))
+        r_head_k_t = self.transpose(r_head_k, (1, 2, 0))
+        BD = self.transpose(
+            self.batchMatMul(rr_head_q_t, self.tile(self.expandDims(r_head_k_t, 0), (rr_head_q_t.shape[0], 1, 1, 1))),
+            (2, 3, 0, 1))
+
+        BD = self._rel_shift(BD)
+
+        # [qlen x klen x bsz x n_head]
+        attn_score = AC + BD
+
+        attn_score *= self.scale
+
+        # compute attention probability
+        if attn_mask is not None:
+            if attn_mask.ndim == 2:
+                attn_mask_ = self.tile(self.expandDims(self.expandDims(attn_mask, -1), 0),
+                                       (1, 1, attn_score.shape[2], attn_score.shape[3]))
+                attn_score = self.maskerFill(attn_score, attn_mask_, self.negative_inf)
+            elif attn_mask.ndim == 3:
+                attn_mask_ = self.tile(self.expandDims(attn_mask, -1), (1, 1, attn_score.shape[2], attn_score.shape[3]))
+                attn_score = self.maskerFill(attn_score, attn_mask_, self.negative_inf)
+
+        # [qlen x klen x bsz x n_head]
+        attn_prob = self.softmax_1(attn_score)
+        attn_prob = self.dropatt(attn_prob)
+        # compute attention vector
+        attn_vec = self.transpose(
+            self.batchMatMul(self.transpose(attn_prob, (2, 3, 0, 1)), self.transpose(w_head_v, (1, 2, 0, 3))),
+            (2, 0, 1, 3))
+
+        # [qlen x bsz x n_head x d_head]
+        attn_vec = attn_vec.reshape(
+            attn_vec.shape[0], attn_vec.shape[1], self.n_head * self.d_head)
+
+        # linear projection
+        attn_out = self.o_net(attn_vec)
+        attn_out = self.drop(attn_out)
+
+        if self.pre_lnorm:
+            # residual connection
+            output = w + attn_out
+        else:
+            # residual connection + layer normalization
+            output = self.layer_norm(w + attn_out)
+
+        return output
diff --git a/research/nlp/transformer_xl/src/model/dataset.py b/research/nlp/transformer_xl/src/model/dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..2e314f6339c9a2dcbe68c3d1214dd2c86c646c5e
--- /dev/null
+++ b/research/nlp/transformer_xl/src/model/dataset.py
@@ -0,0 +1,157 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+import os
+import numpy as np
+from src.model.vocabulary import Vocab
+
+
+class Generator:
+    # LM1B dataset
+    def __init__(self, _data, batch_size, tgt_len, ext_len=None):
+        super(Generator, self).__init__()
+        self.bsz = batch_size
+        self.bptt = tgt_len
+        self.ext_len = ext_len if ext_len is not None else 0
+
+        # Work out how cleanly we can divide the dataset into bsz parts.
+        self.n_step = _data.size // self.bsz
+
+        # Trim off any extra elements that wouldn't cleanly fit (remainders).
+        _data = _data[:self.n_step * self.bsz]
+
+        # Evenly divide the data across the bsz batches.
+        self._data = _data.reshape(self.bsz, -1).T
+        self._data = self._data.astype(np.int32)
+
+        # Number of mini-batches
+        self.n_batch = self.n_step // self.bptt
+
+    def __getitem__(self, item):
+        item *= self.bptt
+        _seq_len = min(self.bptt, self._data.size - 1 - item)
+
+        end_idx = item + _seq_len
+        beg_idx = max(0, item - self.ext_len)
+
+        _data = self._data[beg_idx: end_idx]
+        _target = self._data[item + 1:item + 1 + _seq_len]
+        return _data, _target
+
+    def __len__(self):
+        return self.n_batch
+
+
+class VariableGenerator(Generator):
+    def __init__(self, _data, batch_size, tgt_len, ext_len=None, start=0, std=5, min_len=5, max_deviation=3):
+        super(VariableGenerator, self).__init__(_data, batch_size, tgt_len, ext_len)
+        self.start = start
+        self.std = std
+        self.min_len = min_len
+        self.max_deviation = max_deviation
+        self.max_len = self.bptt + max_deviation * std
+
+        self.bptt_arr = []
+        j = start
+        while j < self._data.size - 2:
+            bptt = self.bptt if np.random.random() < 0.95 else self.bptt / 2.
+            bptt = min(self.max_len, max(self.min_len, int(np.random.normal(bptt, self.std))))
+            self.bptt_arr.append(bptt)
+            _seq_len = min(bptt, self._data.size - 1 - j)
+            j += _seq_len
+        self.len = len(self.bptt_arr)
+        self.index = 0
+
+    def __getitem__(self, item):
+        bptt = self.bptt_arr[self.index]
+        self.index += 1
+        _seq_len = min(bptt, len(self._data) - 1 - item)
+
+        end_idx = item + _seq_len
+        beg_idx = max(0, item - self.ext_len)
+
+        _data = self._data[beg_idx:end_idx]
+        _target = self._data[item + 1:item + 1 + _seq_len]
+        return _data, _target
+
+    def __len__(self):
+        return self.len
+
+
+class AbstractDataset:
+    def __init__(self, path, _dataset, *_args, **kwargs):
+        super(AbstractDataset, self).__init__()
+        self.path = path
+        self.dataset = _dataset
+        self.args = _args
+        self.kwargs = kwargs
+
+    def write(self):
+        pass
+
+    def get_train_generator(self):
+        return self.train_generator
+
+    def get_valid_generator(self):
+        return self.valid_generator
+
+    def get_test_generator(self):
+        return self.test_generator
+
+
+class Enwik8_Dataset(AbstractDataset):
+    def __init__(self, path, _dataset, batch_size, tgt_len, *_args, ext_len=None, eval_tgt_len=None, varlen=False,
+                 **kwargs):
+        super(Enwik8_Dataset, self).__init__(path, _dataset, *_args, **kwargs)
+        self.vocab = Vocab()
+        self.vocab.count_file(os.path.join(path, 'train.txt'))
+        self.vocab.count_file(os.path.join(path, 'valid.txt'))
+        self.vocab.count_file(os.path.join(path, 'test.txt'))
+        self.vocab.build_vocab()
+        self.train = self.vocab.encode_file(
+            os.path.join(path, 'train.txt'), ordered=True, add_eos=False)
+        self.valid = self.vocab.encode_file(
+            os.path.join(path, 'valid.txt'), ordered=True, add_eos=False)
+        self.test = self.vocab.encode_file(
+            os.path.join(path, 'test.txt'), ordered=True, add_eos=False)
+        self.train_generator = getGenerator(self.train, batch_size, tgt_len, ext_len, varlen)
+        self.valid_generator = getGenerator(self.valid, batch_size, eval_tgt_len, ext_len, varlen)
+        self.test_generator = getGenerator(self.test, batch_size, eval_tgt_len, ext_len, varlen)
+
+
+class Text8_Dataset(AbstractDataset):
+    def __init__(self, path, _dataset, batch_size, tgt_len, *_args, ext_len=None, eval_tgt_len=None, varlen=False,
+                 **kwargs):
+        super(Text8_Dataset, self).__init__(path, _dataset, *_args, **kwargs)
+        self.vocab = Vocab()
+        self.vocab.count_file(os.path.join(path, 'train.txt'))
+        self.vocab.count_file(os.path.join(path, 'valid.txt'))
+        self.vocab.count_file(os.path.join(path, 'test.txt'))
+        self.vocab.build_vocab()
+        self.train = self.vocab.encode_file(
+            os.path.join(path, 'train.txt'), ordered=True, add_eos=False)
+        self.valid = self.vocab.encode_file(
+            os.path.join(path, 'valid.txt'), ordered=True, add_eos=False)
+        self.test = self.vocab.encode_file(
+            os.path.join(path, 'test.txt'), ordered=True, add_eos=False)
+        self.train_generator = getGenerator(self.train, batch_size, tgt_len, ext_len, varlen)
+        self.valid_generator = getGenerator(self.valid, batch_size, eval_tgt_len, ext_len, varlen)
+        self.test_generator = getGenerator(self.test, batch_size, eval_tgt_len, ext_len, varlen)
+
+
+def getGenerator(_data, batch_size, tgt_len, ext_len=None, varlen=False, start=0, std=5, min_len=5, max_deviation=3):
+    if varlen:
+        return VariableGenerator(_data, batch_size, tgt_len, ext_len, start, std, min_len, max_deviation)
+    return Generator(_data, batch_size, tgt_len, ext_len)
diff --git a/research/nlp/transformer_xl/src/model/embedding.py b/research/nlp/transformer_xl/src/model/embedding.py
new file mode 100644
index 0000000000000000000000000000000000000000..f6e3869f47f6ebae38d75e3048e2988b8f79c3d0
--- /dev/null
+++ b/research/nlp/transformer_xl/src/model/embedding.py
@@ -0,0 +1,84 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+import numpy as np
+import mindspore as ms
+import mindspore.nn as nn
+from mindspore.ops import Zeros, Concat, BroadcastTo
+from mindspore.ops import Sin, Cos
+from mindspore.numpy import outer
+from src.utils.additional_algorithms import linear
+
+
+class PositionalEmbedding(nn.Cell):
+    def __init__(self, demb):
+        super(PositionalEmbedding, self).__init__()
+        self.concat_n_1 = Concat(-1)
+        self.sin = Sin()
+        self.cos = Cos()
+        self.demb = demb
+        self.inv_freq = ms.Tensor(1 / (10000 ** (np.arange(0.0, demb, 2.0) / demb)), ms.float32)
+
+    def construct(self, pos_seq, bsz=None):
+        sinusoid_inp = outer(pos_seq, self.inv_freq)
+        pos_emb = self.concat_n_1([self.sin(sinusoid_inp), self.cos(sinusoid_inp)])
+
+        if bsz is not None:
+            return BroadcastTo(-1, bsz, -1)(pos_emb[:, None, :])
+        return pos_emb[:, None, :]
+
+
+class AdaptiveEmbedding(nn.Cell):
+    def __init__(self, n_token, d_embed, d_proj, cutoffs, div_val=1):
+        super(AdaptiveEmbedding, self).__init__()
+        self.zeros = Zeros()
+        self.n_token = n_token
+        self.d_embed = d_embed
+
+        self.cutoffs = cutoffs + [n_token]
+        self.div_val = div_val
+        self.d_proj = d_proj
+
+        self.emb_scale = d_proj ** 0.5
+
+        self.cutoff_ends = [0] + self.cutoffs
+
+        self.emb_layers = nn.CellList()
+        parameters = []
+        if div_val == 1:
+            self.emb_layers.append(
+                nn.Embedding(n_token, d_embed)
+            )
+            if d_proj != d_embed:
+                parameters.append(ms.Parameter(self.zeros((d_proj, d_embed), ms.float32)))
+        else:
+            for i in range(len(self.cutoffs)):
+                l_idx, r_idx = self.cutoff_ends[i], self.cutoff_ends[i + 1]
+                d_emb_i = d_embed // (div_val ** i)
+                self.emb_layers.append(nn.Embedding(r_idx - l_idx, d_emb_i))
+                parameters.append(ms.Parameter(self.zeros((d_proj, d_emb_i), ms.float32)))
+        self.emb_projs = ms.ParameterTuple(parameters)
+
+    def construct(self, inp):
+        if self.div_val == 1:
+            embed = self.emb_layers[0](inp)
+            if self.d_proj != self.d_embed:
+                embed = linear(embed, self.emb_projs[0])
+        else:
+            embed = self.emb_layers[0](inp)
+
+        embed *= self.emb_scale
+
+        return embed
diff --git a/research/nlp/transformer_xl/src/model/layer.py b/research/nlp/transformer_xl/src/model/layer.py
new file mode 100644
index 0000000000000000000000000000000000000000..d37898bfabed0327f39ddfcc2be280f23a619dbd
--- /dev/null
+++ b/research/nlp/transformer_xl/src/model/layer.py
@@ -0,0 +1,32 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+from mindspore.nn import Cell
+from src.model.attn import RelPartialLearnableMultiHeadAttn
+from src.model.positionwiseFF import PositionwiseFF
+
+
+class RelPartialLearnableDecoderLayer(Cell):
+    def __init__(self, n_head, d_model, d_head, d_inner, dropout,
+                 **kwargs):
+        super(RelPartialLearnableDecoderLayer, self).__init__()
+
+        self.attn = RelPartialLearnableMultiHeadAttn(n_head, d_model, d_head, dropout, **kwargs)
+        self.pos_ff = PositionwiseFF(d_model, d_inner, dropout, pre_lnorm=kwargs.get('pre_lnorm'))
+
+    def construct(self, dec_inp, r, r_w_bias, r_r_bias, mems=None, attn_mask=None):
+        output = self.attn(dec_inp, r, r_w_bias, r_r_bias, attn_mask=attn_mask, mems=mems)
+        output = self.pos_ff(output)
+        return output
diff --git a/research/nlp/transformer_xl/src/model/mem_transformer.py b/research/nlp/transformer_xl/src/model/mem_transformer.py
new file mode 100644
index 0000000000000000000000000000000000000000..555d925017de91e598f5083a88c5339ee443ffaf
--- /dev/null
+++ b/research/nlp/transformer_xl/src/model/mem_transformer.py
@@ -0,0 +1,233 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+import mindspore.numpy as np
+import mindspore as ms
+import mindspore.nn as nn
+import mindspore.ops as P
+from mindspore import Tensor
+from mindspore import Parameter
+from mindspore.ops import Zeros, Ones
+from mindspore.ops import ExpandDims, Concat
+from mindspore.ops import clip_by_value
+from mindspore.nn import Tril, Triu
+from src.loss_fn.ProjectedAdaptiveLogSoftmaxLoss import ProjectedAdaptiveLogSoftmaxLoss
+from src.model.embedding import AdaptiveEmbedding, PositionalEmbedding
+from src.model.layer import RelPartialLearnableDecoderLayer
+
+
+class MemTransformerLM(nn.Cell):
+    def __init__(self, n_token, n_layer, n_head, d_model, d_head, d_inner,
+                 dropout, dropatt, batch_size, d_embed=None,
+                 div_val=1, pre_lnorm=False,
+                 tgt_len=None, ext_len=None, mem_len=None, eval_tgt_len=None,
+                 cutoffs=None, sample_softmax=-1, tie_weight=True, tie_projs=None,
+                 same_length=False, clamp_len=-1):
+        super(MemTransformerLM, self).__init__()
+
+        if tie_projs is None:
+            tie_projs = [False]
+        if cutoffs is None:
+            cutoffs = []
+        self.assign = P.Assign()
+        self.zeros, self.ones = Zeros(), Ones()
+        self.expandDims, self.concat_0, self.concat_1 = ExpandDims(), Concat(0), Concat(1)
+        self.tril, self.triu = Tril(), Triu()
+
+        self.n_token = n_token
+
+        d_embed = d_model if d_embed is None else d_embed
+        self.d_embed = d_embed
+        self.d_model = d_model
+        self.n_head = n_head
+        self.d_head = d_head
+        self.batch_size = batch_size
+
+        self.word_emb = AdaptiveEmbedding(n_token, d_embed, d_model, cutoffs,
+                                          div_val=div_val)
+        self.drop = nn.Dropout(1 - dropout, dtype=ms.float32)
+        self.n_layer = n_layer
+        self.tgt_len = tgt_len
+        self.mem_len = mem_len
+        self.ext_len = ext_len
+        self.eval_tgt_len = eval_tgt_len
+        self.max_klen = tgt_len + ext_len + mem_len
+        self.layers = nn.CellList()
+
+        for i in range(n_layer):
+            self.layers.append(
+                RelPartialLearnableDecoderLayer(
+                    n_head, d_model, d_head, d_inner, dropout, dropatt=dropatt, pre_lnorm=pre_lnorm)
+            )
+
+        self.sample_softmax = sample_softmax
+        # use sampled softmax
+        if self.sample_softmax > 0:
+            self.out_layer = nn.Dense(d_model, n_token)
+            if tie_weight:
+                self.out_layer.weight = self.word_emb.emb_projs[0].embedding_table
+            self.tie_weight = tie_weight
+
+        # use adaptive softmax (including standard softmax)
+        else:
+            self.crit = ProjectedAdaptiveLogSoftmaxLoss(n_token, d_embed, d_model,
+                                                        cutoffs, div_val=div_val)
+
+            if tie_weight:
+                for i in range(len(self.crit.out_layers)):
+                    self.crit.out_layers[i].weight = self.word_emb.emb_layers[i].embedding_table
+
+            if tie_projs:
+                for i, tie_proj in enumerate(tie_projs):
+                    if tie_proj and div_val == 1 and d_model != d_embed:
+                        self.crit.out_projs[i] = self.word_emb.emb_projs[0]
+                    elif tie_proj and div_val != 1:
+                        self.crit.out_projs[i] = self.word_emb.emb_projs[i]
+
+        self.same_length = same_length
+        self.clamp_len = Tensor(clamp_len, ms.float32)
+        self.min_clamp_len = Tensor(0, ms.float32)
+
+        self._create_params()
+
+        self.add_flags_recursive(is_first_iteration=True)
+
+    def backward_compatible(self):
+        self.sample_softmax = -1
+
+    def _create_params(self):
+        self.pos_emb = PositionalEmbedding(self.d_model)
+        self.r_w_bias = Parameter(self.zeros((self.n_head, self.d_head), ms.float32))
+        self.r_r_bias = Parameter(self.zeros((self.n_head, self.d_head), ms.float32))
+        self.mems = Parameter(
+            self.zeros((self.n_layer, self.mem_len, self.batch_size, self.d_model), ms.float32),
+            requires_grad=False)
+        self.valid_mems = Parameter(
+            self.zeros((self.n_layer, self.mem_len + self.tgt_len - self.eval_tgt_len, self.batch_size, self.d_model),
+                       ms.float32), requires_grad=False)
+        self.empty_valid_mems = Parameter(
+            self.zeros((self.n_layer, self.mem_len + self.tgt_len - self.eval_tgt_len, self.batch_size, self.d_model),
+                       ms.float32), requires_grad=False)
+
+    def reset_length(self, tgt_len, ext_len, mem_len):
+        self.tgt_len = tgt_len
+        self.mem_len = mem_len
+        self.ext_len = ext_len
+        return True
+
+    def _update_mems(self, hids, qlen, mlen):
+        if self.training:  # update mems #
+            if self.mem_len > 0:
+                # There are `mlen + qlen` steps that can be cached into mems
+                # For the next step, the last `ext_len` of the `qlen` tokens
+                # will be used as the extended context. Hence, we only cache
+                # the tokens from `mlen + qlen - self.ext_len - self.mem_len`
+                # to `mlen + qlen - self.ext_len`.
+                for i, h in enumerate(hids):
+                    hids[i] = self.expandDims(h, 0)
+
+                # graph mode not support function max()
+                end_idx = mlen if qlen - self.ext_len < 0 else qlen - self.ext_len + mlen
+                beg_idx = 0 if end_idx - self.mem_len < 0 else end_idx - self.mem_len
+                cat = self.concat_0(hids)
+                cat = self.concat_1((self.mems, cat))
+                cat = cat[:, beg_idx:end_idx]
+                self.assign(self.mems, cat)
+        else:  # update mems #
+            if self.mem_len > 0:
+                for i, h in enumerate(hids):
+                    hids[i] = self.expandDims(h, 0)
+
+                if self.is_first_iteration:
+                    cat = self.concat_0(hids)
+                    cat = self.sameShape(cat, self.valid_mems)
+                    self.assign(self.valid_mems, cat)
+                else:
+                    end_idx = mlen if qlen - self.ext_len < 0 else qlen - self.ext_len + mlen
+                    beg_idx = 0 if end_idx - self.mem_len < 0 else end_idx - self.mem_len
+                    cat = self.concat_0(hids)
+                    cat = self.concat_1((self.valid_mems, cat))
+                    cat = cat[:, beg_idx:end_idx]
+                    self.assign(self.valid_mems, cat)
+        return True
+
+    def sameShape(self, a, b):
+        c = self.zeros((a.shape[0], b.shape[1] - a.shape[1], a.shape[2], a.shape[3]), ms.float32)
+        a = self.concat_1((c, a))
+        return a
+
+    def set_train(self, tgt_len, ext_len, mem_len, eval_tgt_len, mode=True):
+        super(MemTransformerLM, self).set_train(mode=mode)
+        if mode:
+            # Switch back to the training mode
+            self.reset_length(tgt_len, ext_len, mem_len)
+        else:
+            # If the model does not use memory at all, make the ext_len longer.
+            # Otherwise, make the mem_len longer and keep the ext_len the same.
+            self.assign(self.valid_mems, self.empty_valid_mems)
+            self.add_flags_recursive(is_first_iteration=True)
+            if mem_len == 0:
+                self.reset_length(eval_tgt_len,
+                                  ext_len + tgt_len - eval_tgt_len, mem_len)
+            else:
+                self.reset_length(eval_tgt_len,
+                                  ext_len, mem_len + tgt_len - eval_tgt_len)
+        return True
+
+    def construct(self, data, target, idx=None):
+
+        tgt_len = target.size
+        qlen, _ = data.shape
+        word_emb = self.word_emb(data)
+
+        mems = self.mems if self.training else self.valid_mems
+        mlen = 0 if self.is_first_iteration \
+            else (self.mem_len if self.training else self.mem_len + self.tgt_len - self.eval_tgt_len)
+
+        klen = qlen + mlen
+        all_ones = np.ones((qlen, klen), ms.int16)
+
+        if self.same_length:
+            mask_len = klen - self.mem_len
+            if mask_len > 0:
+                mask_shift_len = qlen - mask_len
+            else:
+                mask_shift_len = qlen
+            dec_attn_mask = np.expand_dims((np.triu((all_ones, 1 + mlen), ms.int16)
+                                            + np.tril((all_ones, -mask_shift_len), ms.int16)), -1)  # -1
+        else:
+            dec_attn_mask = np.expand_dims(np.triu(all_ones, 1 + mlen), -1)
+
+        hids = []
+
+        pos_seq = np.arange(klen - 1, -1, -1, dtype=word_emb.dtype)
+        if self.clamp_len > 0:
+            pos_seq = clip_by_value(pos_seq, clip_value_min=self.min_clamp_len, clip_value_max=self.clamp_len)
+        pos_emb = self.pos_emb(pos_seq)
+
+        core_out = self.drop(word_emb)
+        pos_emb = self.drop(pos_emb)
+
+        for i, layer in enumerate(self.layers):
+            hids.append(core_out)
+            core_out = layer(core_out, pos_emb, self.r_w_bias, self.r_r_bias, attn_mask=dec_attn_mask, mems=mems[i])
+
+        hidden = self.drop(core_out)
+
+        self._update_mems(hids, qlen, mlen)
+
+        pred_hid = hidden[-tgt_len:]
+        loss = self.crit(pred_hid.reshape(-1, pred_hid.shape[-1]), target.reshape(-1))
+        return loss
diff --git a/research/nlp/transformer_xl/src/model/positionwiseFF.py b/research/nlp/transformer_xl/src/model/positionwiseFF.py
new file mode 100644
index 0000000000000000000000000000000000000000..3e430947da64c028d6f3abd72589dfc096e13b46
--- /dev/null
+++ b/research/nlp/transformer_xl/src/model/positionwiseFF.py
@@ -0,0 +1,55 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+import mindspore as ms
+from mindspore.nn import Cell
+import mindspore.nn as nn
+
+
+class PositionwiseFF(Cell):
+    def __init__(self, d_model, d_inner, dropout, pre_lnorm=False):
+        super(PositionwiseFF, self).__init__()
+
+        self.d_model = d_model
+        self.d_inner = d_inner
+        self.dropout = dropout
+
+        if dropout == 0.0:
+            self.CoreNet = nn.SequentialCell(
+                nn.Dense(d_model, d_inner), nn.ReLU(),
+                nn.Dense(d_inner, d_model),
+            )
+        else:
+            self.CoreNet = nn.SequentialCell(
+                nn.Dense(d_model, d_inner), nn.ReLU(),
+                nn.Dropout(1 - dropout, dtype=ms.float32),
+                nn.Dense(d_inner, d_model),
+                nn.Dropout(1 - dropout, dtype=ms.float32),
+            )
+        self.layer_norm = nn.LayerNorm([d_model])
+        self.pre_lnorm = pre_lnorm
+
+    def construct(self, inp):
+        if self.pre_lnorm:
+            # layer normalization + positionwise feed-forward
+            core_out = self.CoreNet(self.layer_norm(inp))
+            # residual connection
+            output = core_out + inp
+        else:
+            # positionwise feed-forward
+            core_out = self.CoreNet(inp)
+            # residual connection + layer normalization
+            output = self.layer_norm(inp + core_out)
+        return output
diff --git a/research/nlp/transformer_xl/src/model/vocabulary.py b/research/nlp/transformer_xl/src/model/vocabulary.py
new file mode 100644
index 0000000000000000000000000000000000000000..2ef036690415785e0b20898b4bf73c3cd9bda0d0
--- /dev/null
+++ b/research/nlp/transformer_xl/src/model/vocabulary.py
@@ -0,0 +1,186 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+from collections import Counter, OrderedDict
+import numpy as np
+import mindspore as ms
+from mindspore.ops import Concat
+
+class Vocab:
+    def __init__(self, special=None, min_freq=0, max_size=None, lower_case=True,
+                 delimiter=None, vocab_file=None):
+        self.concat_0 = Concat(0)
+        if special is None:
+            special = []
+        self.counter = Counter()
+        self.special = special
+        self.min_freq = min_freq
+        self.max_size = max_size
+        self.lower_case = lower_case
+        self.delimiter = delimiter
+        self.vocab_file = vocab_file
+
+        self.sym2idx = OrderedDict()
+        self.unk_idx = None
+        self.idx2sym = []
+
+    def tokenize(self, line, add_eos=False, add_double_eos=False):
+        line = line.strip()
+        # convert to lower case
+        if self.lower_case:
+            line = line.lower()
+
+        # empty delimiter '' will evaluate False
+        if self.delimiter == '':
+            symbols = line
+        else:
+            symbols = line.split(self.delimiter)
+
+        if add_double_eos:  # lm1b
+            symbols = ['<S>'] + symbols + ['<S>']
+        elif add_eos:
+            symbols = symbols + ['<eos>']
+        return symbols
+
+    def count_file(self, path, verbose=False, add_eos=False):
+        if verbose: print('counting file {} ...'.format(path))
+
+        sents = []
+        with open(path, 'r', encoding='utf-8') as f:
+            for idx, line in enumerate(f):
+                if verbose and idx > 0 and idx % 500000 == 0:
+                    print('    line {}'.format(idx))
+                symbols = self.tokenize(line, add_eos=add_eos)
+                self.counter.update(symbols)
+                sents.append(symbols)
+
+        return sents
+
+    def count_sents(self, sents, verbose=False):
+        """
+            sents : a list of sentences, each a list of tokenized symbols
+        """
+        if verbose: print('counting {} sents ...'.format(len(sents)))
+        for idx, symbols in enumerate(sents):
+            if verbose and idx > 0 and idx % 500000 == 0:
+                print('    line {}'.format(idx))
+            self.counter.update(symbols)
+
+    def _build_from_file(self, vocab_file):
+        self.idx2sym = []
+        self.sym2idx = OrderedDict()
+
+        with open(vocab_file, 'r', encoding='utf-8') as f:
+            for line in f:
+                symb = line.strip().split()[0]
+                self.add_symbol(symb)
+        self.unk_idx = self.sym2idx['<UNK>']
+
+    def build_vocab(self):
+        if self.vocab_file:
+            print('building vocab from {}'.format(self.vocab_file))
+            self._build_from_file(self.vocab_file)
+            print('final vocab size {}'.format(len(self)))
+        else:
+            print('building vocab with min_freq={}, max_size={}'.format(
+                self.min_freq, self.max_size))
+            self.idx2sym = []
+            self.sym2idx = OrderedDict()
+
+            for sym in self.special:
+                self.add_special(sym)
+
+            for sym, cnt in self.counter.most_common(self.max_size):
+                if cnt < self.min_freq: break
+                self.add_symbol(sym)
+
+            print('final vocab size {} from {} unique tokens'.format(
+                len(self), len(self.counter)))
+
+    def encode_file(self, path, ordered=False, verbose=False, add_eos=True,
+                    add_double_eos=False):
+        if verbose: print('encoding file {} ...'.format(path))
+
+        encoded = []
+        if not ordered:
+            with open(path, 'r', encoding='utf-8') as f:
+                for idx, line in enumerate(f):
+                    if verbose and idx > 0 and idx % 500000 == 0:
+                        print('    line {}'.format(idx))
+                    symbols = self.tokenize(line, add_eos=add_eos,
+                                            add_double_eos=add_double_eos)
+                    encoded.append(self.convert_to_tensor(symbols))
+
+        else:
+            with open(path, 'r', encoding='utf-8') as f:
+                for idx, line in enumerate(f):
+                    if verbose and idx > 0 and idx % 500000 == 0:
+                        print('    line {}'.format(idx))
+                    symbols = self.tokenize(line, add_eos=add_eos,
+                                            add_double_eos=add_double_eos)
+                    symbols_indices = self.get_indices(symbols)
+                    encoded.append(symbols_indices)
+        encoded = np.concatenate(encoded, axis=0).astype(np.int64)
+
+        return encoded
+
+    def encode_sents(self, sents, ordered=False, verbose=False):
+        if verbose: print('encoding {} sents ...'.format(len(sents)))
+        encoded = []
+        for idx, symbols in enumerate(sents):
+            if verbose and idx > 0 and idx % 500000 == 0:
+                print('    line {}'.format(idx))
+            encoded.append(self.convert_to_tensor(symbols))
+
+        if ordered:
+            encoded = self.concat_0(encoded)
+
+        return encoded
+
+    def add_special(self, sym):
+        if sym not in self.sym2idx:
+            self.idx2sym.append(sym)
+            self.sym2idx[sym] = len(self.idx2sym) - 1
+            setattr(self, '{}_idx'.format(sym.strip('<>')), self.sym2idx[sym])
+
+    def add_symbol(self, sym):
+        if sym not in self.sym2idx:
+            self.idx2sym.append(sym)
+            self.sym2idx[sym] = len(self.idx2sym) - 1
+
+    def get_sym(self, idx):
+        return self.idx2sym[idx]
+
+    def get_idx(self, sym):
+        if sym in self.sym2idx:
+            return self.sym2idx[sym]
+        return self.sym2idx.get(sym, self.unk_idx)
+
+    def get_symbols(self, indices):
+        return [self.get_sym(idx) for idx in indices]
+
+    def get_indices(self, symbols):
+        return [self.get_idx(sym) for sym in symbols]
+
+    def convert_to_tensor(self, symbols):
+        return ms.Tensor(self.get_indices(symbols), dtype=ms.int64)
+
+    def convert_to_sent(self, indices, exclude=None):
+        if exclude is None:
+            return ' '.join([self.get_sym(idx) for idx in indices])
+        return ' '.join([self.get_sym(idx) for idx in indices if idx not in exclude])
+
+    def __len__(self):
+        return len(self.idx2sym)
diff --git a/research/nlp/transformer_xl/src/model_utils/__init__.py b/research/nlp/transformer_xl/src/model_utils/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..602527cd720c8d268599dbaef190ba1cf1eb6f2b
--- /dev/null
+++ b/research/nlp/transformer_xl/src/model_utils/__init__.py
@@ -0,0 +1,14 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
diff --git a/research/nlp/transformer_xl/src/model_utils/config.py b/research/nlp/transformer_xl/src/model_utils/config.py
new file mode 100644
index 0000000000000000000000000000000000000000..dd048184190a3221428bb62f7920ea34cd4337dd
--- /dev/null
+++ b/research/nlp/transformer_xl/src/model_utils/config.py
@@ -0,0 +1,139 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+"""Parse arguments"""
+
+import os
+import ast
+import argparse
+import time
+from pprint import pformat
+import yaml
+
+
+class Config:
+    """
+    Configuration namespace. Convert dictionary to members.
+    """
+
+    def __init__(self, cfg_dict):
+        for k, v in cfg_dict.items():
+            if isinstance(v, (list, tuple)):
+                setattr(self, k, [Config(x) if isinstance(x, dict) else x for x in v])
+            else:
+                setattr(self, k, Config(v) if isinstance(v, dict) else v)
+
+    def __str__(self):
+        return pformat(self.__dict__)
+
+    def __repr__(self):
+        return self.__str__()
+
+
+def parse_cli_to_yaml(parser, cfg, helper=None, choices=None, cfg_path="default_config.yaml"):
+    """
+    Parse command line arguments to the configuration according to the default yaml.
+
+    Args:
+        parser: Parent parser.
+        cfg: Base configuration.
+        helper: Helper description.
+        cfg_path: Path to the default yaml config.
+    """
+    parser = argparse.ArgumentParser(description="[REPLACE THIS at config.py]",
+                                     parents=[parser])
+    helper = {} if helper is None else helper
+    choices = {} if choices is None else choices
+    for item in cfg:
+        if not isinstance(cfg[item], list) and not isinstance(cfg[item], dict):
+            help_description = helper[item] if item in helper else "Please reference to {}".format(cfg_path)
+            choice = choices[item] if item in choices else None
+            if isinstance(cfg[item], bool):
+                parser.add_argument("--" + item, type=ast.literal_eval, default=cfg[item], choices=choice,
+                                    help=help_description)
+            else:
+                parser.add_argument("--" + item, type=type(cfg[item]), default=cfg[item], choices=choice,
+                                    help=help_description)
+    args = parser.parse_args()
+    return args
+
+
+def parse_yaml(yaml_path):
+    """
+    Parse the yaml config file.
+
+    Args:
+        yaml_path: Path to the yaml config.
+    """
+    with open(yaml_path, 'r') as fin:
+        try:
+            cfgs = yaml.load_all(fin.read(), Loader=yaml.FullLoader)
+            cfgs = [x for x in cfgs]
+            if len(cfgs) == 1:
+                cfg_helper = {}
+                cfg = cfgs[0]
+                cfg_choices = {}
+            elif len(cfgs) == 2:
+                cfg, cfg_helper = cfgs
+                cfg_choices = {}
+            elif len(cfgs) == 3:
+                cfg, cfg_helper, cfg_choices = cfgs
+            else:
+                raise ValueError("At most 3 docs (config, description for help, choices) are supported in config yaml")
+            print(cfg_helper)
+        except:
+            raise ValueError("Failed to parse yaml")
+    return cfg, cfg_helper, cfg_choices
+
+
+def merge(args, cfg):
+    """
+    Merge the base config from yaml file and command line arguments.
+
+    Args:
+        args: Command line arguments.
+        cfg: Base configuration.
+    """
+    args_var = vars(args)
+    for item in args_var:
+        cfg[item] = args_var[item]
+    return cfg
+
+
+def reset_config(args):
+    if args.d_embed < 0:
+        args.d_embed = args.d_model
+    args.train_url = '{}-{}'.format(args.train_url, args.dataset)
+    args.train_url = os.path.join(args.train_url, time.strftime('%Y%m%d-%H%M%S'))
+    if not os.path.exists(args.train_url):
+        os.makedirs(args.train_url, exist_ok=True)
+
+
+def get_config():
+    """
+    Get Config according to the yaml file and cli arguments.
+    """
+    parser = argparse.ArgumentParser(description="Mindspore Transformer Language Model", add_help=False)
+
+    config_path = os.environ['CONFIG_PATH']
+    default, helper, choices = parse_yaml(config_path)
+    args = parse_cli_to_yaml(parser=parser, cfg=default, helper=helper, choices=choices, cfg_path=config_path)
+
+    reset_config(args)
+    final_config = merge(args, default)
+    return Config(final_config)
+
+
+config = get_config()
diff --git a/research/nlp/transformer_xl/src/model_utils/device_adapter.py b/research/nlp/transformer_xl/src/model_utils/device_adapter.py
new file mode 100644
index 0000000000000000000000000000000000000000..98b8c7ac4807141e7a736893e743c341976e891f
--- /dev/null
+++ b/research/nlp/transformer_xl/src/model_utils/device_adapter.py
@@ -0,0 +1,26 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+"""Device adapter for ModelArts"""
+from .config import config
+
+if config.enable_modelarts:
+    from .moxing_adapter import get_device_id, get_device_num, get_rank_id, get_job_id
+else:
+    from .local_adapter import get_device_id, get_device_num, get_rank_id, get_job_id
+
+__all__ = [
+    "get_device_id", "get_device_num", "get_rank_id", "get_job_id"
+]
diff --git a/research/nlp/transformer_xl/src/model_utils/local_adapter.py b/research/nlp/transformer_xl/src/model_utils/local_adapter.py
new file mode 100644
index 0000000000000000000000000000000000000000..33edc2d0df14791f377be28ed4fe48fe9a681325
--- /dev/null
+++ b/research/nlp/transformer_xl/src/model_utils/local_adapter.py
@@ -0,0 +1,63 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+"""Local adapter"""
+
+import os
+from .config import config
+
+
+def get_device_id():
+    if config.device == "Ascend":
+        device_id = os.getenv('DEVICE_ID', '0')
+    elif config.device == "GPU":
+        device_id = os.getenv('OMPI_COMM_WORLD_LOCAL_RANK', '0')
+    else:
+        device_id = 0
+    return int(device_id)
+
+
+def get_device_num():
+    if config.device == "Ascend":
+        local_device_num = os.getenv('RANK_SIZE', '1')
+    elif config.device == "GPU":
+        local_device_num = os.getenv('OMPI_COMM_WORLD_SIZE', '1')
+    else:
+        local_device_num = 1
+    return int(local_device_num)
+
+
+def get_local_device_num():
+    if config.device == "Ascend":
+        local_device_num = min(get_device_num, 8)
+    elif config.device == "GPU":
+        local_device_num = os.getenv('OMPI_COMM_WORLD_LOCAL_SIZE', '1')
+    else:
+        local_device_num = 1
+    return int(local_device_num)
+
+
+def get_rank_id():
+    if config.device == "Ascend":
+        global_rank_id = os.getenv('RANK_ID', '0')
+    elif config.device == "GPU":
+        global_rank_id = os.getenv('OMPI_COMM_WORLD_RANK', '0')
+    else:
+        global_rank_id = 0
+    return int(global_rank_id)
+
+
+def get_job_id():
+    return "Local Job"
diff --git a/research/nlp/transformer_xl/src/model_utils/moxing_adapter.py b/research/nlp/transformer_xl/src/model_utils/moxing_adapter.py
new file mode 100644
index 0000000000000000000000000000000000000000..8edf03c97fe62fa63228e1f012a88d7a80a7be3a
--- /dev/null
+++ b/research/nlp/transformer_xl/src/model_utils/moxing_adapter.py
@@ -0,0 +1,146 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+"""Moxing adapter for ModelArts"""
+
+import os
+import functools
+from mindspore import context
+from .config import config
+
+_global_sync_count = 0
+
+
+def get_device_id():
+    if config.device == "ascend":
+        device_id = os.getenv('DEVICE_ID', '0')
+    elif config.device == "gpu":
+        device_id = os.getenv('OMPI_COMM_WORLD_LOCAL_RANK', '0')
+    else:
+        device_id = 0
+    return int(device_id)
+
+
+def get_device_num():
+    if config.device == "ascend":
+        local_device_num = os.getenv('RANK_SIZE', '1')
+    elif config.device == "gpu":
+        local_device_num = os.getenv('OMPI_COMM_WORLD_SIZE', '1')
+    else:
+        local_device_num = 1
+    return int(local_device_num)
+
+
+def get_local_device_num():
+    if config.device == "ascend":
+        local_device_num = min(get_device_num, 8)
+    elif config.device == "gpu":
+        local_device_num = os.getenv('OMPI_COMM_WORLD_LOCAL_SIZE', '1')
+    else:
+        local_device_num = 1
+    return int(local_device_num)
+
+
+def get_rank_id():
+    if config.device == "ascend":
+        global_rank_id = os.getenv('RANK_ID', '0')
+    elif config.device == "gpu":
+        global_rank_id = os.getenv('OMPI_COMM_WORLD_RANK', '0')
+    else:
+        global_rank_id = 0
+    return int(global_rank_id)
+
+
+def get_job_id():
+    job_id = os.getenv('JOB_ID')
+    job_id = job_id if job_id != "" else "default"
+    return job_id
+
+
+def sync_data(from_path, to_path):
+    """
+    Download data from remote obs to local directory if the first url is remote url and the second one is local path
+    Upload data from local directory to remote obs in contrast.
+    """
+    import moxing as mox
+    import time
+    global _global_sync_count
+    sync_lock = "/tmp/copy_sync.lock" + str(_global_sync_count)
+    _global_sync_count += 1
+
+    # Each server contains 8 devices as most.
+    if get_device_id() % min(get_device_num(), 8) == 0 and not os.path.exists(sync_lock):
+        print("from path: ", from_path)
+        print("to path: ", to_path)
+        mox.file.copy_parallel(from_path, to_path)
+        print("===finish data synchronization===")
+        try:
+            os.mknod(sync_lock)
+        except IOError:
+            pass
+        print("===save flag===")
+
+    while True:
+        if os.path.exists(sync_lock):
+            break
+        time.sleep(1)
+
+    print("Finish sync data from {} to {}.".format(from_path, to_path))
+
+
+def moxing_wrapper(pre_process=None, post_process=None):
+    """
+    Moxing wrapper to download dataset and upload outputs.
+    """
+
+    def wrapper(run_func):
+        @functools.wraps(run_func)
+        def wrapped_func(*args, **kwargs):
+            # Download data from data_url
+            if config.enable_modelarts:
+                if config.data_url:
+                    sync_data(config.data_url, config.data_path)
+                    print("Dataset downloaded: ", os.listdir(config.data_path))
+                if config.checkpoint_url:
+                    sync_data(config.checkpoint_url, config.load_path)
+                    print("Preload downloaded: ", os.listdir(config.load_path))
+                if config.train_url:
+                    sync_data(config.train_url, config.output_path)
+                    print("Workspace downloaded: ", os.listdir(config.output_path))
+
+                context.set_context(save_graphs_path=os.path.join(config.output_path, str(get_rank_id())))
+                config.device_num = get_device_num()
+                config.device_id = get_device_id()
+                if not os.path.exists(config.output_path):
+                    os.makedirs(config.output_path)
+
+                if pre_process:
+                    pre_process()
+
+            # Run the main function
+            run_func(*args, **kwargs)
+
+            # Upload data to train_url
+            if config.enable_modelarts:
+                if post_process:
+                    post_process()
+
+                if config.train_url:
+                    print("Start to copy output directory")
+                    sync_data(config.output_path, config.train_url)
+
+        return wrapped_func
+
+    return wrapper
diff --git a/research/nlp/transformer_xl/src/utils/additional_algorithms.py b/research/nlp/transformer_xl/src/utils/additional_algorithms.py
new file mode 100644
index 0000000000000000000000000000000000000000..cb9dc40008e6d2b0734b37e11d93d772646ff093
--- /dev/null
+++ b/research/nlp/transformer_xl/src/utils/additional_algorithms.py
@@ -0,0 +1,82 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+import math
+import mindspore as ms
+import mindspore.ops as P
+import mindspore.common.dtype as mstype
+from mindspore import Tensor, Parameter
+from mindspore.nn import Cell, MatMul
+from mindspore.nn.learning_rate_schedule import LearningRateSchedule
+
+matMul_tb = MatMul(transpose_x2=True)
+
+
+def linear(_input, weight, bias=None):
+    r"""
+    Applies a linear transformation to the incoming data: :math:`y = xA^T + b`.
+
+    Shape:
+
+        - Input: :math:`(N, *, in\_features)` where `*` means any number of
+          additional dimensions
+        - Weight: :math:`(out\_features, in\_features)`
+        - Bias: :math:`(out\_features)`
+        - Output: :math:`(N, *, out\_features)`
+    """
+    output = matMul_tb(_input, weight)
+    if bias is not None:
+        output += bias
+    return output
+
+
+class MaskerFill(Cell):
+    def __init__(self):
+        super(MaskerFill, self).__init__()
+        self.select = P.Select()
+        self.fill = P.Fill()
+        self.cast = P.Cast()
+
+    def construct(self, inputs, mask, value):
+        mask = self.cast(mask, mstype.bool_)
+        masked_value = self.fill(ms.float32, inputs.shape, value)
+        output = self.select(mask, masked_value, inputs)
+        return output
+
+
+class CosineAnnealingLR(LearningRateSchedule):
+    def __init__(self, total_step, lr, min_lr=0):
+        super(CosineAnnealingLR, self).__init__()
+
+        self.min_lr = Parameter(Tensor(min_lr, ms.float32))
+        self.lr = Parameter(Tensor(lr, ms.float32))
+        self.max_lr = Parameter(Tensor(lr, ms.float32))
+        self.T_max = Parameter(Tensor(total_step, ms.float32))
+
+        self.cos = P.Cos()
+        self.pi = Parameter(Tensor(math.pi, ms.float32))
+        self.cast = P.Cast()
+
+    def construct(self, global_step):
+        global_step = self.cast(global_step, ms.float32)
+        if global_step <= 0:
+            self.lr = self.max_lr
+        elif (global_step - 1 - self.T_max) % (2 * self.T_max) == 0:
+            self.lr += (self.max_lr - self.min_lr) * (1 - self.cos(self.pi / self.T_max)) / 2
+        else:
+            self.lr = (1 + self.cos(self.pi * global_step / self.T_max)) /\
+                      (1 + self.cos(self.pi * (global_step - 1) / self.T_max)) * (self.lr - self.min_lr) + self.min_lr
+
+        return self.lr
diff --git a/research/nlp/transformer_xl/src/utils/dataset_util.py b/research/nlp/transformer_xl/src/utils/dataset_util.py
new file mode 100644
index 0000000000000000000000000000000000000000..69285269cad2ce355c507d018a800d94cb560cb5
--- /dev/null
+++ b/research/nlp/transformer_xl/src/utils/dataset_util.py
@@ -0,0 +1,29 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+from src.model_utils.config import config
+import src.model.dataset as ds
+
+
+def get_dataset(datadir, dataset):
+    if dataset == 'enwik8':
+        dataset = ds.Enwik8_Dataset(path=datadir, _dataset=dataset, batch_size=config.batch_size,
+                                    tgt_len=config.tgt_len, ext_len=config.ext_len, eval_tgt_len=config.eval_tgt_len,
+                                    varlen=config.varlen)
+    elif dataset == 'text8':
+        dataset = ds.Text8_Dataset(path=datadir, _dataset=dataset, batch_size=config.batch_size, tgt_len=config.tgt_len,
+                                   ext_len=config.ext_len, eval_tgt_len=config.eval_tgt_len, varlen=config.varlen)
+
+    return dataset
diff --git a/research/nlp/transformer_xl/src/utils/nnUtils.py b/research/nlp/transformer_xl/src/utils/nnUtils.py
new file mode 100644
index 0000000000000000000000000000000000000000..a5dce844859627d4a2567791d4f6bc1adc900635
--- /dev/null
+++ b/research/nlp/transformer_xl/src/utils/nnUtils.py
@@ -0,0 +1,67 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+import mindspore
+from mindspore import Tensor
+from mindspore.common import initializer as init
+
+
+def uniform_(tensor, a=0., b=1.):
+    r"""Fills the input Tensor with values drawn from the uniform
+        distribution :math:`\mathcal{U}(a, b)`.
+
+        Args:
+            tensor: an n-dimensional `torch.Tensor`
+            a: the lower bound of the uniform distribution
+            b: the upper bound of the uniform distribution
+
+        Examples:
+            #>>> w = torch.empty(3, 5)
+            #>>> nn.init.uniform_(w)
+        """
+    tensor += Tensor(dtype=mindspore.float32, init=init.Zero(), shape=tensor.shape).fill((b - a) / 2)
+    init.Uniform((b - a) / 2)(tensor.asnumpy())
+
+
+def normal_(tensor, mean=0., std=1.):
+    r"""Fills the input Tensor with values drawn from the normal
+        distribution :math:`\mathcal{N}(\text{mean}, \text{std}^2)`.
+
+        Args:
+            tensor: an n-dimensional `torch.Tensor`
+            mean: the mean of the normal distribution
+            std: the standard deviation of the normal distribution
+
+        Examples:
+            #>>> w = torch.empty(3, 5)
+            #>>> nn.init.normal_(w)
+    """
+
+    init.Normal(mean=mean, sigma=std)(tensor.asnumpy())
+
+
+def constant_(tensor, val):
+    r"""Fills the input Tensor with the value :math:`\text{val}`.
+
+        Args:
+            tensor: an n-dimensional `torch.Tensor`
+            val: the value to fill the tensor with
+
+        Examples:
+            #>>> w = torch.empty(3, 5)
+            #>>> nn.init.constant_(w, 0.3)
+    """
+    constant_init = init.Constant(value=val)
+    constant_init(tensor.asnumpy())
diff --git a/research/nlp/transformer_xl/train.py b/research/nlp/transformer_xl/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..2b6406a7683c545ec72dc7d7357462635d245107
--- /dev/null
+++ b/research/nlp/transformer_xl/train.py
@@ -0,0 +1,334 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""Transformer training script."""
+
+import math
+import argparse
+import numpy as np
+import mindspore as ms
+from mindspore.communication import init
+import mindspore.nn.optim as optim
+import mindspore.context as context
+from mindspore.dataset import GeneratorDataset
+from mindspore.train.model import Model
+from src.callback.eval import EvalDuringTrain, doEval
+from src.callback.log import TrainLogger
+from src.callback.flag import FlagModifiedCallback
+from src.model.mem_transformer import MemTransformerLM
+from src.model_utils.device_adapter import get_device_id, get_device_num, get_rank_id
+from src.model_utils.config import config
+from src.utils.dataset_util import get_dataset
+from src.utils.nnUtils import uniform_, normal_, constant_
+from src.metric.calc import bpc
+
+
+def init_weight(weight, _config):
+    if _config.init == 'uniform':
+        uniform_(weight, -_config.init_range, _config.init_range)
+    elif _config.init == 'normal':
+        normal_(weight, 0.0, _config.init_std)
+
+
+def init_bias(bias):
+    constant_(bias, 0.0)
+
+
+def weights_init_Dense(m, config1):
+    if hasattr(m, 'weight') and m.weight is not None:
+        init_weight(m.weight, config1)
+    if hasattr(m, 'bias') and m.bias is not None:
+        init_bias(m.bias)
+
+
+def weights_init_AdaptiveEmbedding(m, config1):
+    if hasattr(m, 'emb_projs'):
+        for i in range(len(m.emb_projs)):
+            if m.emb_projs[i] is not None:
+                normal_(m.emb_projs[i], 0.0, config1.proj_init_std)
+
+
+def weights_init_ProjectedAdaptiveLogSoftmax(m, config1):
+    if hasattr(m, 'cluster_weight') and m.cluster_weight is not None:
+        init_weight(m.cluster_weight, config1)
+    if hasattr(m, 'cluster_bias') and m.cluster_bias is not None:
+        init_bias(m.cluster_bias)
+    if hasattr(m, 'out_projs'):
+        for i in range(len(m.out_projs)):
+            if m.out_projs[i] is not None:
+                normal_(m.out_projs[i], 0.0, config1.proj_init_std)
+
+
+def weights_init_LayerNorm(m, config1):
+    if hasattr(m, 'weight'):
+        normal_(m.weight, 1.0, config1.init_std)
+    if hasattr(m, 'bias') and m.bias is not None:
+        init_bias(m.bias)
+
+
+def weights_init_TransformerLM(m, config1):
+    if hasattr(m, 'r_emb'):
+        init_weight(m.r_emb, config1)
+    if hasattr(m, 'r_w_bias'):
+        init_weight(m.r_w_bias, config1)
+    if hasattr(m, 'r_r_bias'):
+        init_weight(m.r_r_bias, config1)
+    if hasattr(m, 'r_bias'):
+        init_bias(m.r_bias)
+
+
+def weights_init(m, config1):
+    classname = m.__class__.__name__
+    if classname.find('Dense') != -1:
+        weights_init_Dense(m, config1)
+    elif classname.find('AdaptiveEmbedding') != -1:
+        weights_init_AdaptiveEmbedding(m, config1)
+    elif classname.find('Embedding') != -1:
+        if hasattr(m, 'weight'):
+            init_weight(m.weight, config1)
+    elif classname.find('ProjectedAdaptiveLogSoftmax') != -1:
+        weights_init_ProjectedAdaptiveLogSoftmax(m, config1)
+    elif classname.find('LayerNorm') != -1:
+        weights_init_LayerNorm(m, config1)
+    elif classname.find('TransformerLM') != -1:
+        weights_init_TransformerLM(m, config1)
+
+
+def get_optimizer(_config, net, scheduler):
+    """
+    get optimizer: adam,sgd
+    Args:
+        _config:
+        net:
+        scheduler:
+
+    Returns:
+        optimizer:
+        optimizer_sparse: default is None
+    """
+    optimizer = optimizer_sparse = None
+    lr = dynamic_lr()
+    if _config.optim.lower() == 'sgd':
+        if _config.sample_softmax > 0:
+            dense_params, sparse_params = [], []
+            for param in net.trainable_params():
+                if len(param) == len(net.word_emb.embedding_table):
+                    sparse_params.append(param)
+                else:
+                    dense_params.append(param)
+            optimizer_sparse = optim.SGD(sparse_params, learning_rate=_config.lr * 2)
+            optimizer = optim.SGD(dense_params, learning_rate=_config.lr, momentum=_config.mom)
+        else:
+            optimizer = optim.SGD(net.trainable_params(), learning_rate=_config.lr,
+                                  momentum=_config.mom)
+    elif _config.optim.lower() == 'adam':
+        if _config.sample_softmax > 0:
+            dense_params, sparse_params = [], []
+            for param in net.trainable_params():
+                if len(param) == len(net.word_emb.embedding_table):
+                    sparse_params.append(param)
+                else:
+                    dense_params.append(param)
+            optimizer_sparse = optim.SparseAdam(sparse_params, lr=lr)
+            optimizer = optim.Adam(dense_params, learning_rate=lr)
+        else:
+            optimizer = optim.Adam(net.trainable_params(), learning_rate=lr)
+    elif _config.optim.lower() == 'adagrad':
+        optimizer = optim.Adagrad(net.trainable_params(), learning_rate=lr)
+    return optimizer, optimizer_sparse
+
+
+def rsqrt_decay(warmup_steps, current_step):
+    return float(max([current_step, warmup_steps])) ** -0.5
+
+
+def linear_warmup_learning_rate(current_step, warmup_steps, base_lr, init_lr):
+    lr_inc = (float(base_lr) - float(init_lr)) / float(warmup_steps)
+    learning_rate = float(init_lr) + lr_inc * current_step
+    return learning_rate
+
+
+def a_cosine_learning_rate(current_step, base_lr, warmup_steps, total_steps):
+    decay_steps = total_steps - warmup_steps
+    linear_decay = (total_steps - current_step) / decay_steps
+    cosine_decay = 0.5 * (1 + math.cos(math.pi * 2 * 0.47 * current_step / decay_steps))
+    decayed = linear_decay * cosine_decay + 0.00001
+    learning_rate = decayed * base_lr
+    return learning_rate
+
+
+def dynamic_lr():
+    """dynamic learning rate generator"""
+    base_lr = config.lr
+    total_steps = int(config.max_step)
+    warmup_steps = int(config.warmup_step)
+    lr = []
+    for i in range(total_steps):
+        if i < warmup_steps:
+            lr.append(linear_warmup_learning_rate(i, warmup_steps, base_lr, base_lr * config.warmup_ratio))
+        else:
+            lr.append(a_cosine_learning_rate(i, base_lr, warmup_steps, total_steps))
+    return lr
+
+
+def get_scheduler(_config):
+    scheduler = scheduler_sparse = None
+    if _config.scheduler == 'cosine':
+        # here we do not set eta_min to lr_min to be backward compatible
+        # because in previous versions eta_min is default to 0
+        # rather than the default value of lr_min 1e-6
+        from src.utils.additional_algorithms import CosineAnnealingLR
+
+        scheduler = CosineAnnealingLR(total_step=_config.max_step, lr=_config.lr, min_lr=_config.eta_min)
+
+    elif _config.scheduler == 'inv_sqrt':
+        pass
+
+    elif _config.scheduler == 'dev_perf':
+        pass
+    elif _config.scheduler == 'constant':
+        pass
+    return scheduler, scheduler_sparse
+
+
+def set_seed():
+    np.random.seed(config.seed)
+    ms.set_seed(config.seed)
+
+
+def main():
+    # Set the random seed manually for reproducibility.
+    set_seed()
+
+    parser = argparse.ArgumentParser(description='Transformer-XL train running')
+    parser.add_argument('--datadir', default='./data/enwik8',
+                        help='Directory contains enwik8 dataset.')
+    parser.add_argument('--dataset', default='enwik8',
+                        help='Dataset Name.', choices=["enwik8", "text8"])
+    parser.add_argument('--train_url', default="./", help='Directory of training output.')
+    parser.add_argument("--device", type=str, default="GPU", help="Device Target, default GPU",
+                        choices=["Ascend", "GPU"])
+
+    args = parser.parse_args()
+    datadir = args.datadir
+    dataset = args.dataset
+
+    device_id = get_device_id()
+    device_num = get_device_num()
+
+    if config.device == 'Ascend':
+        context.set_context(mode=context.GRAPH_MODE, device_target="Ascend", device_id=device_id)
+        if device_num > 1:
+            context.reset_auto_parallel_context()
+            context.set_auto_parallel_context(device_num=device_num, parallel_mode=context.ParallelMode.DATA_PARALLEL,
+                                              gradients_mean=True)
+            init()
+
+    elif config.device == 'GPU':
+        context.set_context(mode=context.GRAPH_MODE, device_target="GPU", max_device_memory="39.0GB",
+                            enable_graph_kernel=True)
+        if device_num > 1:
+            init()
+            context.reset_auto_parallel_context()
+            context.set_auto_parallel_context(device_num=device_num, parallel_mode=context.ParallelMode.DATA_PARALLEL,
+                                              gradients_mean=True)
+    else:
+        context.set_context(mode=context.PYNATIVE_MODE, device_target="CPU")
+
+    ###############################################################################
+    # Load data
+    ###############################################################################
+
+    dataset = get_dataset(datadir, dataset)
+    ntokens = len(dataset.vocab)
+    config.n_token = ntokens
+
+    # adaptive softmax / embedding
+    cutoffs = []
+
+    ###############################################################################
+    # Build the model
+    ###############################################################################
+
+    net = MemTransformerLM(ntokens, config.n_layer, config.n_head, config.d_model,
+                           config.d_head, config.d_inner, config.dropout, config.dropatt,
+                           batch_size=config.batch_size,
+                           d_embed=config.d_embed, div_val=config.div_val,
+                           pre_lnorm=config.pre_lnorm, tgt_len=config.tgt_len,
+                           ext_len=config.ext_len, mem_len=config.mem_len, eval_tgt_len=config.eval_tgt_len,
+                           cutoffs=cutoffs, same_length=config.same_length, clamp_len=config.clamp_len)
+
+    # ensure embedding init is not overridden by out_layer in case of weight sharing
+    weights_init(net, config)
+    weights_init(net.word_emb, config)
+
+    config.n_all_param = sum([p.size for p in net.trainable_params()])
+    config.n_nonemb_param = sum([p.size for p in net.layers.trainable_params()])
+
+    # scheduler
+    scheduler, _ = get_scheduler(config)
+    # optimizer
+    optimizer, _ = get_optimizer(config, net, scheduler)
+
+    if device_id == 0:
+        print('=' * 100)
+        for k, v in config.__dict__.items():
+            print('    - {} : {}'.format(k, v))
+        print('=' * 100)
+        print('#params = {}'.format(config.n_all_param))
+        print('#non emb params = {}'.format(config.n_nonemb_param))
+
+    ###############################################################################
+    # Training code
+    ###############################################################################
+
+    config.n_batch = dataset.get_train_generator().n_batch
+    config.max_epoch = math.ceil(config.max_step / config.n_batch)
+
+    rank_size, rank_id = get_device_num(), get_rank_id()
+
+    train_dataset = GeneratorDataset(source=dataset.get_train_generator(), column_names=['data', 'target'],
+                                     num_shards=rank_size, shard_id=rank_id, shuffle=False)
+    # Due to the mems mechanism, it is not possible to perform multi-card segmentation on the valid and test datasets
+    valid_dataset = GeneratorDataset(source=dataset.get_valid_generator(), column_names=['data', 'target'],
+                                     shuffle=False)
+    test_dataset = GeneratorDataset(source=dataset.get_test_generator(), column_names=['data', 'target'],
+                                    shuffle=False)
+
+    # Train #
+
+    flagModifiedCallback = FlagModifiedCallback()
+    train_log = TrainLogger(per_print_times=config.log_interval, n_batch=config.n_batch)
+    evalDuringTrain = EvalDuringTrain(dataset=valid_dataset, per_print_times=config.eval_interval,
+                                      tgt_len=config.tgt_len, ext_len=config.ext_len, mem_len=config.mem_len,
+                                      eval_tgt_len=config.eval_tgt_len)
+
+    model = Model(network=net, loss_fn=None, optimizer=optimizer, metrics=None)
+    model.train(config.max_step, train_dataset, sink_size=1,
+                callbacks=[flagModifiedCallback, train_log, evalDuringTrain])
+
+    # Test #
+
+    if device_id == 0:
+        test_loss = doEval(net=net, dataset=test_dataset, tgt_len=config.tgt_len, ext_len=config.ext_len,
+                           mem_len=config.mem_len, eval_tgt_len=config.eval_tgt_len)
+        print('=' * 100)
+        if config.dataset in ['enwik8', 'text8']:
+            print('| End of training | test loss {:5.2f} | test bpc {:9.5f}'.format(
+                test_loss, bpc(test_loss)))
+        print('=' * 100)
+
+
+if __name__ == '__main__':
+    main()
diff --git a/research/nlp/transformer_xl/tran_model/Transformer-XL model transform.md b/research/nlp/transformer_xl/tran_model/Transformer-XL model transform.md
new file mode 100644
index 0000000000000000000000000000000000000000..b438acf601fa85458d4be533f5d1287ba48aacd2
--- /dev/null
+++ b/research/nlp/transformer_xl/tran_model/Transformer-XL model transform.md	
@@ -0,0 +1,58 @@
+## Transformer-XL model transform
+
+作者在GitHub提供的源代码一共提供了PyTorch和TensorFlow两种版本的代码,这里提供了将PyTorch和TensorFlow训练好的pt模型转为MindSpore的ckpt模型的方案和具体操作
+
+由于Transformer-XL源代码所需要的环境版本较低,并且高版本的环境会出现代码无法正常运行等问题,因此强烈建议先配置好Transformer-XL作者提供的源代码所需要的环境。模型转化的思路是,先在作者源代码所需的环境下,通过训练/下载对应模型的方式,将模型转化为numpy格式下的.pkl参数文件,再切换到MindSpore环境下将.pkl的参数文件传入MindSpore模型并保存为.ckpt文件。为了保证模型的正常运行,在保存模型后,加入了对test数据集的推理。
+
+论文官方源代码(包含PyTorch版本与TensorFlow版本):[点此](https://github.com/kimiyoung/transformer-xl)
+
+作者提供的enwik8_large与text8_large模型链接:[enwik8_large](http://curtis.ml.cmu.edu/datasets/pretrained_xl/tf_enwiki8/) ; [text8_large](http://curtis.ml.cmu.edu/datasets/pretrained_xl/tf_text8/)
+
+### PyTorch2MindSpore
+
+所需环境:
+
+- Python:3.7.5
+
+- PyTorch:0.4.0
+
+```shell
+# Step1:将/tran_model/torch_get_param.py和/tran_model/torch_get_param.sh拷贝到源代码的/pytorch/目录下
+cp "/home/transformer-xl/tran_model/torch_get_param.py" "/home/txl_author/pytorch/torch_get_param.py" 
+cp "/home/transformer-xl/tran_model/torch_get_param.sh" "/home/txl_author/pytorch/torch_get_param.sh" 
+# Step2:在PyTorch0.4.0环境下运行torch_get_param.sh,将模型参数取出转为numpy格式,并存为.pkl文件,其中[DATA_SET]为数据集名称,例如enwik8/text8,[WORK_DIR]为模型所在路径,因为PyTorch训练得到的模型默认名称为model.pt
+cd /home/txl_author/pytorch/
+bash torch_get_param.sh [DATA_SET] [WORK_DIR]
+# bash torch_get_param.sh "enwik8" "/home/ganweikang/project/txl_torch/pytorch/LM-TFM-enwik8/20220322-202922/"
+# Step3:切换到高版本PyTorch下,将model.state_dict中的参数转为numpy,[WORK_DIR]为Step2中保存的enwik8_base.pkl所在的路径
+cd /home/transformer-xl/tran_model/torch2msp
+bash torch2numpy.sh [DATA_SET] [WORK_DIR]
+# bash torch2numpy.sh "enwik8" "/home/ganweikang/project/txl_torch/pytorch/"
+# Step4:切换到MindSpore环境下,执行torch2msp.sh,将numpy格式的.pkl文件传入MindSpore模型并保存为.ckpt文件并执行一次test数据集的推理
+cd /home/transformer-xl/tran_model/
+bash torch2msp.sh [DATA_DIR] [DATA_NAME] [TORCH_PT_PATH]
+     [CONFIG_PATH] [DEVICE_ID(optional)]
+# bash torch2msp.sh "/home/ganweikang/project/transformer-xl/data/enwik8/" "enwik8" "/home/ganweikang/project/txl_0512/tran_model/torch2msp/enwik8_base.pkl" "/home/ganweikang/project/txl_0512/yaml/enwik8_base_eval.yaml"
+
+```
+
+
+
+### TensorFlow2MindSpore
+
+所需环境:
+
+- Python:2.7
+
+- TensorFlow:1.12.0
+
+```shell
+# Step1:将/tran_model/tf_get_param.py和/tran_model/tf_get_param.sh拷贝到源代码的/tf/目录下
+cp "/home/transformer-xl/tran_model/tf_get_param.py" "/home/txl_author/tf/torch_get_param.py" 
+cp "/home/transformer-xl/tran_model/tf_get_param.sh" "/home/txl_author/tf/torch_get_param.sh" 
+# Step2:在TensorFlow环境下运行tf_get_param.sh,将模型参数取出转为numpy格式,并存为.pkl文件,其中[DATA_SET]为数据集名称,例如enwik8/text8。
+cd /home/txl_author/tf/
+bash tf_get_param.sh [DATA_SET]
+# Step3:切换到MindSpore环境下,执行tf2msp.sh,将.pkl文件传入MindSpore模型并保存为.ckpt文件并执行一次test数据集的推理
+bash tf2msp.sh "/home/transformer-xl/data/text8" "text8" "/home/txl_author/tf/text8_large.pkl" "/home/transformer-xl/yaml/text8_large_eval.yaml"
+```
diff --git a/research/nlp/transformer_xl/tran_model/key_mapping.py b/research/nlp/transformer_xl/tran_model/key_mapping.py
new file mode 100644
index 0000000000000000000000000000000000000000..44d736b5f46dca2119770baa2d7eda65b3015511
--- /dev/null
+++ b/research/nlp/transformer_xl/tran_model/key_mapping.py
@@ -0,0 +1,76 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+"""
+Build mapping file for model parameter transformation
+"""
+
+
+# mindspore -> pytroch
+def msp2torch():
+    param_dict = {}
+    param_dict["r_w_bias"] = 'r_w_bias'
+    param_dict["r_r_bias"] = 'r_r_bias'
+    param_dict['word_emb.emb_layers.0.embedding_table'] = 'word_emb.emb_layers.0.weight'
+    for i in range(0, 12):
+        param_dict[str(i) + '.attn.qkv_net.weight'] = 'layers.' + str(i) + '.dec_attn.qkv_net.weight'
+        param_dict[str(i) + '.attn.o_net.weight'] = 'layers.' + str(i) + '.dec_attn.o_net.weight'
+        param_dict[str(i) + '.attn.layer_norm.gamma'] = 'layers.' + str(i) + '.dec_attn.layer_norm.weight'
+        param_dict[str(i) + '.attn.layer_norm.beta'] = 'layers.' + str(i) + '.dec_attn.layer_norm.bias'
+        param_dict[str(i) + '.attn.r_net.weight'] = 'layers.' + str(i) + '.dec_attn.r_net.weight'
+        param_dict[str(i) + '.pos_ff.CoreNet.0.weight'] = 'layers.' + str(i) + '.pos_ff.CoreNet.0.weight'
+        param_dict[str(i) + '.pos_ff.CoreNet.0.bias'] = 'layers.' + str(i) + '.pos_ff.CoreNet.0.bias'
+        param_dict[str(i) + '.pos_ff.CoreNet.3.weight'] = 'layers.' + str(i) + '.pos_ff.CoreNet.3.weight'
+        param_dict[str(i) + '.pos_ff.CoreNet.3.bias'] = 'layers.' + str(i) + '.pos_ff.CoreNet.3.bias'
+        param_dict[str(i) + '.pos_ff.layer_norm.gamma'] = 'layers.' + str(i) + '.pos_ff.layer_norm.weight'
+        param_dict[str(i) + '.pos_ff.layer_norm.beta'] = 'layers.' + str(i) + '.pos_ff.layer_norm.bias'
+    param_dict['crit.out_layers.0.weight'] = 'crit.out_layers.0.weight'
+    param_dict['crit.out_layers.0.bias'] = 'crit.out_layers.0.bias'
+    param_dict['pos_emb.inv_freq'] = 'pos_emb.inv_freq'
+    with open('msp2torch_base.txt', 'w') as f:
+        for key, value in param_dict.items():
+            line = '%s:%s\n' % (key, value)
+            f.write(line)
+    return param_dict
+
+
+# tf -> msp
+def tf2msp():
+    param_dict = {}
+    param_dict["transformer/r_w_bias"] = 'r_w_bias'
+    param_dict["transformer/r_r_bias"] = 'r_r_bias'
+    param_dict['transformer/adaptive_embed/lookup_table'] = 'word_emb.emb_layers.0.embedding_table'
+    for i in range(0, 24):
+        param_dict['transformer/layer_' + str(i) + '/rel_attn/qkv/kernel'] = str(i) + '.attn.qkv_net.weight'
+        param_dict['transformer/layer_' + str(i) + '/rel_attn/o/kernel'] = str(i) + '.attn.o_net.weight'
+        param_dict['transformer/layer_' + str(i) + '/rel_attn/r/kernel'] = str(i) + '.attn.r_net.weight'
+        param_dict['transformer/layer_' + str(i) + '/rel_attn/LayerNorm/gamma'] = str(i) + '.attn.layer_norm.gamma'
+        param_dict['transformer/layer_' + str(i) + '/rel_attn/LayerNorm/beta'] = str(i) + '.attn.layer_norm.beta'
+        param_dict['transformer/layer_' + str(i) + '/ff/layer_1/kernel'] = str(i) + '.pos_ff.CoreNet.0.weight'
+        param_dict['transformer/layer_' + str(i) + '/ff/layer_1/bias'] = str(i) + '.pos_ff.CoreNet.0.bias'
+        param_dict['transformer/layer_' + str(i) + '/ff/layer_2/kernel'] = str(i) + '.pos_ff.CoreNet.3.weight'
+        param_dict['transformer/layer_' + str(i) + '/ff/layer_2/bias'] = str(i) + '.pos_ff.CoreNet.3.bias'
+        param_dict['transformer/layer_' + str(i) + '/ff/LayerNorm/gamma'] = str(i) + '.pos_ff.layer_norm.gamma'
+        param_dict['transformer/layer_' + str(i) + '/ff/LayerNorm/beta'] = str(i) + '.pos_ff.layer_norm.beta'
+    with open('tf2msp_large.txt', 'w') as f:
+        for key, value in param_dict.items():
+            line = '%s:%s\n' % (key, value)
+            f.write(line)
+    return param_dict
+
+
+if __name__ == '__main__':
+    msp2torch()
+    tf2msp()
diff --git a/research/nlp/transformer_xl/tran_model/tf2msp.sh b/research/nlp/transformer_xl/tran_model/tf2msp.sh
new file mode 100644
index 0000000000000000000000000000000000000000..696b3a63b20155fbe80f46ffb329487c4ceaba87
--- /dev/null
+++ b/research/nlp/transformer_xl/tran_model/tf2msp.sh
@@ -0,0 +1,77 @@
+#!/bin/bash
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+
+echo 'Preprocess key mapping...'
+python key_mapping.py
+
+echo 'Trans tensorflow model to mindspore model.'
+if [ $# -lt 4 ] ||  [ $# -gt 5 ]
+then
+    echo "Usage: bash tf2msp.sh [DATA_DIR] [DATA_NAME] [TF_PT_PATH]
+     [CONFIG_PATH] [DEVICE_ID(optional)]"
+exit 1
+fi
+
+export DEVICE_ID=0
+if [ $# = 5 ] ; then
+  export DEVICE_ID=$5
+fi;
+
+get_real_path(){
+  if [ "${1:0:1}" == "/" ]; then
+    echo "$1"
+  else
+    echo "$(realpath -m $PWD/$1)"
+  fi
+}
+
+DATA_DIR=$(get_real_path $1)
+if [ ! -d $DATA_DIR ]
+then
+    echo "error: DATA_DIR=$DATA_DIR is not a directory"
+exit 1
+fi
+
+DATA_NAME=$2
+PT_PATH=$3
+CONFIG_PATH=$4
+
+echo "DATA_DIR="$DATA_DIR
+echo "DATA_NAME="$DATA_NAME
+echo "PT_PATH="$PT_PATH
+echo "CONFIG_PATH="$CONFIG_PATH
+
+export CONFIG_PATH=${CONFIG_PATH}
+export DEVICE_NUM=1
+export RANK_SIZE=$DEVICE_NUM
+export RANK_ID=0
+
+if [ ! -d "tf2msp_model" ];
+then
+    mkdir ./tf2msp_model
+fi
+
+echo "Start evaluation for device $DEVICE_ID :)"
+
+python ./tf2msp/tf2msp.py \
+  --device_id=$DEVICE_ID \
+  --datadir=$DATA_DIR \
+  --dataset=$DATA_NAME \
+  --pt_path=$PT_PATH \
+  --device="GPU" &> tf2msp_$DATA_NAME.log &
+
+
diff --git a/research/nlp/transformer_xl/tran_model/tf2msp/tf2msp.py b/research/nlp/transformer_xl/tran_model/tf2msp/tf2msp.py
new file mode 100644
index 0000000000000000000000000000000000000000..1fad7f9c2159b10b12fc0b50c0d59568aab2f960
--- /dev/null
+++ b/research/nlp/transformer_xl/tran_model/tf2msp/tf2msp.py
@@ -0,0 +1,102 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+import sys
+import argparse
+import pickle
+import mindspore
+import mindspore.ops as ops
+from mindspore import context
+from mindspore.dataset import GeneratorDataset
+from mindspore import save_checkpoint
+from src.metric.calc import bpc
+from src.model.mem_transformer import MemTransformerLM
+from src.model_utils.config import config
+from src.utils.dataset_util import get_dataset
+from src.callback.eval import doEval
+
+sys.path.insert(0, '../')
+
+parser = argparse.ArgumentParser(description='PyTorch Model Trans MindSpore Model.')
+parser.add_argument('--datadir', default='./data/enwik8',
+                    help='Directory contains enwik8 dataset.')
+parser.add_argument('--dataset', default='enwik8',
+                    help='Dataset Name.', choices=["enwik8", "text8"])
+parser.add_argument('--pt_path', default="./model.pt", help='Directory of model param.')
+parser.add_argument("--device", type=str, default="GPU", help="Device Target, default GPU",
+                    choices=["Ascend", "GPU"])
+parser.add_argument("--device_id", type=int, default=0, help="Device id, default is 0.")
+args = parser.parse_args()
+datadir = args.datadir
+dataset = args.dataset
+pt_path = args.pt_path
+device_id = args.device_id
+
+numpy_param_path = pt_path
+with open(numpy_param_path, 'rb') as f:
+    tf_dict = pickle.load(f, encoding='bytes')
+
+dataset = get_dataset(datadir, dataset)
+ntokens = len(dataset.vocab)
+
+context.set_context(device_id=device_id)
+context.set_context(mode=context.GRAPH_MODE, device_target="GPU", max_device_memory="39.0GB",
+                    enable_graph_kernel=True)
+
+test_dataset = GeneratorDataset(source=dataset.get_test_generator(), column_names=['data', 'target'],
+                                shuffle=False)
+
+cutoffs = []
+net = MemTransformerLM(ntokens, config.n_layer, config.n_head, config.d_model,
+                       config.d_head, config.d_inner, config.dropout, config.dropatt, batch_size=config.batch_size,
+                       d_embed=config.d_embed, div_val=config.div_val,
+                       pre_lnorm=config.pre_lnorm, tgt_len=config.tgt_len,
+                       ext_len=config.ext_len, mem_len=config.mem_len, eval_tgt_len=config.eval_tgt_len,
+                       cutoffs=cutoffs, same_length=config.same_length, clamp_len=config.clamp_len)
+
+net_dict = {}
+with open('./tf2msp_large.txt', 'r') as f:
+    for line in f.readlines():
+        tf_name, msp_name = line.strip().split(":")
+        net_dict[msp_name] = tf_dict[tf_name]
+
+transpose = ops.Transpose()
+
+for k in net.parameters_dict():
+    if k in ('mems', 'valid_mems', 'empty_valid_mems'):
+        continue
+    if k in ('pos_emb.inv_freq', 'crit.out_layers.0.weight', 'crit.out_layers.0.bias'):
+        continue
+    if 'attn.qkv_net.weight' in k or 'attn.r_net.weight' in k or \
+            'attn.o_net.weight' in k or 'pos_ff.CoreNet.0.weight' in k or \
+            'pos_ff.CoreNet.3.weight' in k:
+        a = mindspore.Tensor(net_dict[k].transpose((1, 0)))
+        net.parameters_dict()[k].set_data(a)
+    else:
+        net.parameters_dict()[k].set_data(mindspore.Tensor(net_dict[k]))
+
+print('load net param')
+
+save_path = './tf2msp_model/' + str(args.dataset) + '_model.ckpt'
+save_checkpoint(net, save_path)
+
+test_loss = doEval(net, test_dataset, config.tgt_len, config.ext_len, config.mem_len, config.eval_tgt_len)
+
+print('=' * 100)
+if config.dataset in ['enwik8', 'text8']:
+    print('| End of test | test loss {:5.2f} | test bpc {:9.5f}'.format(
+        test_loss, bpc(test_loss)))
+
+print('=' * 100)
diff --git a/research/nlp/transformer_xl/tran_model/tf2msp/tf_get_param.py b/research/nlp/transformer_xl/tran_model/tf2msp/tf_get_param.py
new file mode 100644
index 0000000000000000000000000000000000000000..fd639512028c50127ce8170cb7142de17f7d078e
--- /dev/null
+++ b/research/nlp/transformer_xl/tran_model/tf2msp/tf_get_param.py
@@ -0,0 +1,371 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+from __future__ import absolute_import
+from __future__ import division
+from __future__ import print_function
+
+import tensorflow as tf
+import model
+import data_utils
+import numpy as np
+from absl import flags
+from gpu_utils import assign_to_gpu
+
+# GPU config
+flags.DEFINE_integer("num_hosts", default=1,
+                     help="Number of TPU hosts")
+flags.DEFINE_integer("num_core_per_host", default=8,
+                     help="Number of cores per host")
+
+# Experiment (data/checkpoint/directory) config
+flags.DEFINE_string("data_dir", default="",
+                    help="Path to tf-records directory.")
+flags.DEFINE_string("record_info_dir", default="",
+                    help="Path to local directory containing filenames.txt.")
+flags.DEFINE_string("corpus_info_path", default="",
+                    help="Path to corpus-info.json file.")
+flags.DEFINE_string("model_dir", default=None,
+                    help="Estimator model_dir.")
+flags.DEFINE_bool("do_train", default=True,
+                  help="Whether to run training.")
+flags.DEFINE_bool("do_eval", default=False,
+                  help="Whether to run eval on the dev set.")
+flags.DEFINE_string("eval_ckpt_path", None,
+                    help="Checkpoint path for do_test evaluation."
+                         "If set, model_dir will be ignored."
+                         "If unset, will use the latest ckpt in model_dir.")
+flags.DEFINE_string("warm_start_path", None,
+                    help="Checkpoint path for warm start."
+                         "If set, will clear Adam states."
+                         "Note that the new model_dir should be different"
+                         " from warm_start_path.")
+
+# Optimization config
+flags.DEFINE_float("learning_rate", default=2.5e-4,
+                   help="Maximum learning rate.")
+flags.DEFINE_float("clip", default=0.25,
+                   help="Gradient clipping value.")
+# for cosine decay
+flags.DEFINE_float("min_lr_ratio", default=0.004,
+                   help="Minimum ratio learning rate.")
+flags.DEFINE_integer("warmup_steps", default=0,
+                     help="Number of steps for linear lr warmup.")
+
+# Training config
+flags.DEFINE_integer("train_batch_size", default=60,
+                     help="Size of train batch.")
+flags.DEFINE_integer("eval_batch_size", default=60,
+                     help="Size of valid batch.")
+flags.DEFINE_integer("train_steps", default=100000,
+                     help="Total number of training steps.")
+flags.DEFINE_integer("iterations", default=500,
+                     help="Number of iterations per repeat loop.")
+flags.DEFINE_integer("save_steps", default=10000,
+                     help="number of steps for model checkpointing.")
+
+# Evaluation config
+flags.DEFINE_bool("do_test", default=False,
+                  help="Run on the test set.")
+flags.DEFINE_integer("max_eval_batch", default=-1,
+                     help="Set -1 to turn off. Only used in test mode.")
+flags.DEFINE_bool("do_eval_only", default=False,
+                  help="Run evaluation only.")
+flags.DEFINE_integer("start_eval_steps", default=10000,
+                     help="Which checkpoint to start with in `do_eval_only` mode.")
+flags.DEFINE_string("eval_split", "valid",
+                    help="Which data split to evaluate.")
+
+# Model config
+flags.DEFINE_integer("tgt_len", default=70,
+                     help="Number of steps to predict")
+flags.DEFINE_integer("mem_len", default=70,
+                     help="Number of steps to cache")
+flags.DEFINE_bool("same_length", default=False,
+                  help="Same length attention")
+flags.DEFINE_integer("clamp_len", default=-1,
+                     help="Clamp length")
+
+flags.DEFINE_integer("n_layer", default=6,
+                     help="Number of layers.")
+flags.DEFINE_integer("d_model", default=500,
+                     help="Dimension of the model.")
+flags.DEFINE_integer("d_embed", default=500,
+                     help="Dimension of the embeddings.")
+flags.DEFINE_integer("n_head", default=10,
+                     help="Number of attention heads.")
+flags.DEFINE_integer("d_head", default=50,
+                     help="Dimension of each attention head.")
+flags.DEFINE_integer("d_inner", default=1000,
+                     help="Dimension of inner hidden size in positionwise feed-forward.")
+flags.DEFINE_float("dropout", default=0.1,
+                   help="Dropout rate.")
+flags.DEFINE_float("dropatt", default=0.1,
+                   help="Attention dropout rate.")
+flags.DEFINE_bool("untie_r", default=False,
+                  help="untie r_w_bias and r_r_bias")
+
+# Adaptive Softmax / Embedding
+flags.DEFINE_bool("tie_weight", default=True,
+                  help="Tie embedding and softmax weight.")
+flags.DEFINE_integer("div_val", default=1,
+                     help="Divide the embedding size by this val for each bin")
+flags.DEFINE_bool("proj_share_all_but_first", default=False,
+                  help="True to share all but first projs, False not to share.")
+flags.DEFINE_bool("proj_same_dim", default=True,
+                  help="Project the bin with the same dimension.")
+
+# Parameter initialization
+flags.DEFINE_enum("init", default="normal",
+                  enum_values=["normal", "uniform"],
+                  help="Initialization method.")
+flags.DEFINE_float("init_std", default=0.02,
+                   help="Initialization std when init is normal.")
+flags.DEFINE_float("proj_init_std", default=0.01,
+                   help="Initialization std for embedding projection.")
+flags.DEFINE_float("init_range", default=0.1,
+                   help="Initialization std when init is uniform.")
+
+FLAGS = flags.FLAGS
+
+
+def get_model_fn(n_token, cutoffs):
+    def model_fn(inp, tgt, mems, is_training):
+        inp = tf.transpose(inp, [1, 0])
+        tgt = tf.transpose(tgt, [1, 0])
+
+        if FLAGS.init == "uniform":
+            initializer = tf.initializers.random_uniform(
+                minval=-FLAGS.init_range,
+                maxval=FLAGS.init_range,
+                seed=None)
+        elif FLAGS.init == "normal":
+            initializer = tf.initializers.random_normal(
+                stddev=FLAGS.init_std,
+                seed=None)
+            proj_initializer = tf.initializers.random_normal(
+                stddev=FLAGS.proj_init_std,
+                seed=None)
+
+        tie_projs = [False for _ in range(len(cutoffs) + 1)]
+        if FLAGS.proj_share_all_but_first:
+            for i in range(1, len(tie_projs)):
+                tie_projs[i] = True
+
+        loss, new_mems = model.transformer(
+            dec_inp=inp,
+            target=tgt,
+            mems=mems,
+            n_token=n_token,
+            n_layer=FLAGS.n_layer,
+            d_model=FLAGS.d_model,
+            d_embed=FLAGS.d_embed,
+            n_head=FLAGS.n_head,
+            d_head=FLAGS.d_head,
+            d_inner=FLAGS.d_inner,
+            dropout=FLAGS.dropout,
+            dropatt=FLAGS.dropatt,
+            initializer=initializer,
+            proj_initializer=proj_initializer,
+            is_training=is_training,
+            mem_len=FLAGS.mem_len,
+            cutoffs=cutoffs,
+            div_val=FLAGS.div_val,
+            tie_projs=tie_projs,
+            input_perms=None,
+            target_perms=None,
+            head_target=None,
+            same_length=FLAGS.same_length,
+            clamp_len=FLAGS.clamp_len,
+            use_tpu=False,
+            untie_r=FLAGS.untie_r,
+            proj_same_dim=FLAGS.proj_same_dim)
+
+        # number of parameters
+        num_params = sum([np.prod(v.shape) for v in tf.trainable_variables()])
+        tf.logging.info('#params: {}'.format(num_params))
+
+        # format_str = '{{:<{0}s}}\t{{}}'.format(
+        #     max([len(v.name) for v in tf.trainable_variables()]))
+        # for v in tf.trainable_variables():
+        #   tf.logging.info(format_str.format(v.name, v.get_shape()))
+
+        if is_training:
+            all_vars = tf.trainable_variables()
+            grads = tf.gradients(loss, all_vars)
+            grads_and_vars = list(zip(grads, all_vars))
+            return loss, new_mems, grads_and_vars
+        return loss, new_mems
+
+    return model_fn
+
+
+def single_core_graph(n_token, cutoffs, is_training, inp, tgt, mems):
+    model_fn = get_model_fn(
+        n_token=n_token,
+        cutoffs=cutoffs)
+
+    model_ret = model_fn(
+        inp=inp,
+        tgt=tgt,
+        mems=mems,
+        is_training=is_training)
+
+    return model_ret
+
+
+def evaluate(n_token, cutoffs, ps_device):
+    ##### Get input function and model function
+    eval_input_fn, eval_record_info = data_utils.get_input_fn(
+        record_info_dir=FLAGS.record_info_dir,
+        split=FLAGS.eval_split,
+        per_host_bsz=FLAGS.eval_batch_size,
+        tgt_len=FLAGS.tgt_len,
+        num_core_per_host=FLAGS.num_core_per_host,
+        num_hosts=1,
+        use_tpu=False)
+
+    num_batch = eval_record_info["num_batch"]
+    if FLAGS.max_eval_batch > 0:
+        num_batch = FLAGS.max_eval_batch
+    tf.logging.info("num of batches {}".format(num_batch))
+
+    ##### Create computational graph
+    eval_set = eval_input_fn({
+        "batch_size": FLAGS.eval_batch_size,
+        "data_dir": FLAGS.data_dir})
+
+    input_feed, label_feed = eval_set.make_one_shot_iterator().get_next()
+
+    inputs = tf.split(input_feed, FLAGS.num_core_per_host, 0)
+    labels = tf.split(label_feed, FLAGS.num_core_per_host, 0)
+
+    per_core_bsz = FLAGS.eval_batch_size // FLAGS.num_core_per_host
+    tower_mems, tower_losses, tower_new_mems = [], [], []
+
+    for i in range(FLAGS.num_core_per_host):
+        with tf.device(assign_to_gpu(i, ps_device)), \
+                tf.variable_scope(tf.get_variable_scope(), reuse=tf.AUTO_REUSE):
+            mems_i = [tf.placeholder(tf.float32,
+                                     [FLAGS.mem_len, per_core_bsz, FLAGS.d_model])
+                      for _ in range(FLAGS.n_layer)]
+
+            loss_i, new_mems_i = single_core_graph(
+                n_token=n_token,
+                cutoffs=cutoffs,
+                is_training=False,
+                inp=inputs[i],
+                tgt=labels[i],
+                mems=mems_i)
+
+            tower_mems.append(mems_i)
+            tower_losses.append(loss_i)
+            tower_new_mems.append(new_mems_i)
+
+    saver = tf.train.Saver()
+
+    with tf.Session(config=tf.ConfigProto(allow_soft_placement=True)) as sess:
+        sess.run(tf.global_variables_initializer())
+
+        if FLAGS.eval_ckpt_path is None:
+            eval_ckpt_path = tf.train.latest_checkpoint(FLAGS.model_dir)
+        else:
+            eval_ckpt_path = FLAGS.eval_ckpt_path
+        tf.logging.info("Evaluate {}".format(eval_ckpt_path))
+        saver.restore(sess, eval_ckpt_path)
+
+        print("=" * 100)
+        graph = sess.graph
+        # print([node.name for node in graph.as_graph_def().node])
+
+        # r_w_bias(8,128) --> transformer/r_w_bias(8,128)
+        # r_r_bias(8.128) --> transformer/r_r_bias(8,128)
+
+        # 0.attn.qkv_net.weight(3072, 1024) --> transformer/layer_0/rel_attn/qkv/kernel(1024, 3072)
+        # 0.attn.o_net.weight(1024,1024) --> transformer/layer_0/rel_attn/o/kernel(1024, 1024)
+        # 0.attn.r_net.weight(1024,1024) --> transformer/layer_0/rel_attn/r/kernel(1024, 1024)
+        # 0.attn.layer_norm.gamma(1024,1) --> transformer/layer_0/rel_attn/LayerNorm/gamma(1024,1)
+        # 0.attn.layer_norm.beta(1024,1) --> transformer/layer_0/rel_attn/LayerNorm/beta(1024,1)
+        # 0.pos_ff.CoreNet.0.weight(3072, 1024) --> transformer/layer_0/ff/layer_1/kernel(3072, 1024)
+        # 0.pos_ff.CoreNet.0.bias(3072,1) --> transformer/layer_0/ff/layer_1/bias(1024,1)
+        # 0.pos_ff.CoreNet.3.weight(1024, 3072) --> transformer/layer_0/ff/layer_2/kernel(3072, 1024)
+        # 0.pos_ff.CoreNet.3.bias(1024,1) --> transformer/layer_0/ff/layer_2/bias(1024,1)
+        # 0.pos_ff.layer_norm.gamma(1024,1) --> transformer/layer_0/ff/LayerNorm/gamma(1024,1)
+        # 0.pos_ff.layer_norm.beta(1024,1) --> transformer/layer_0/ff/LayerNorm/beta(1024,1)
+
+        # word_emb.emb_layers.0.embedding_table(204,1024) --> transformer/adaptive_embed/lookup_table(204, 1024)
+        # crit.out_layers.0.bias(204,) -->
+
+        print("*" * 100)
+        param_dict = {}
+        param_dict["transformer/r_w_bias"] = 'r_w_bias'
+        param_dict["transformer/r_r_bias"] = 'r_r_bias'
+        param_dict['transformer/adaptive_embed/lookup_table'] = 'word_emb.emb_layers.0.embedding_table'
+        for i in range(0, 24):
+            param_dict['transformer/layer_' + str(i) + '/rel_attn/qkv/kernel'] = str(i) + '.attn.qkv_net.weight'
+            param_dict['transformer/layer_' + str(i) + '/rel_attn/o/kernel'] = str(i) + '.attn.o_net.weight'
+            param_dict['transformer/layer_' + str(i) + '/rel_attn/r/kernel'] = str(i) + '.attn.r_net.weight'
+            param_dict['transformer/layer_' + str(i) + '/rel_attn/LayerNorm/gamma'] = str(i) + '.attn.layer_norm.gamma'
+            param_dict['transformer/layer_' + str(i) + '/rel_attn/LayerNorm/beta'] = str(i) + '.attn.layer_norm.beta'
+            param_dict['transformer/layer_' + str(i) + '/ff/layer_1/kernel'] = str(i) + '.pos_ff.CoreNet.0.weight'
+            param_dict['transformer/layer_' + str(i) + '/ff/layer_1/bias'] = str(i) + '.pos_ff.CoreNet.0.bias'
+            ###############
+            param_dict['transformer/layer_' + str(i) + '/ff/layer_2/kernel'] = str(i) + '.pos_ff.CoreNet.3.weight'
+            param_dict['transformer/layer_' + str(i) + '/ff/layer_2/bias'] = str(i) + '.pos_ff.CoreNet.3.bias'
+            ###############
+            param_dict['transformer/layer_' + str(i) + '/ff/LayerNorm/gamma'] = str(i) + '.pos_ff.layer_norm.gamma'
+            param_dict['transformer/layer_' + str(i) + '/ff/LayerNorm/beta'] = str(i) + '.pos_ff.layer_norm.beta'
+
+        tf_dict = {}
+        for node in graph.as_graph_def().node:
+            if node.name in param_dict.keys():
+                print(node.name)
+                node_data = graph.get_operation_by_name(node.name).outputs[0]
+                data_np = sess.run(node_data)
+                print(type(data_np))
+                print(data_np.shape)
+                print(data_np)
+
+                tf_dict[node.name] = data_np
+                print("*" * 100)
+
+        import pickle
+
+        if 'enwik8' in FLAGS.model_dir:
+            with open('./enwik8_large.pkl', 'wb') as f:
+                pickle.dump(tf_dict, f)
+        if 'text8' in FLAGS.model_dir:
+            with open('./text8_large.pkl', 'wb') as f:
+                pickle.dump(tf_dict, f)
+
+        print("=" * 100)
+        print("finish!")
+
+
+def main(unused_argv):
+    del unused_argv  # Unused
+
+    tf.logging.set_verbosity(tf.logging.INFO)
+
+    # Get corpus info
+    corpus_info = data_utils.get_corpus_info(FLAGS.corpus_info_path)
+    n_token = corpus_info["vocab_size"]
+    cutoffs = corpus_info["cutoffs"][1:-1]
+    tf.logging.info("n_token {}".format(n_token))
+
+    evaluate(n_token, cutoffs, "/gpu:0")
+
+
+if __name__ == "__main__":
+    tf.app.run()
diff --git a/research/nlp/transformer_xl/tran_model/tf2msp/tf_get_param.sh b/research/nlp/transformer_xl/tran_model/tf2msp/tf_get_param.sh
new file mode 100644
index 0000000000000000000000000000000000000000..323c60aa88f5fa2688597fa9e7f899ca1df0804f
--- /dev/null
+++ b/research/nlp/transformer_xl/tran_model/tf2msp/tf_get_param.sh
@@ -0,0 +1,140 @@
+#!/bin/bash
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+export CUDA_VISIBLE_DEVICES="0,1,2,3"
+
+echo 'Trans TensorFlow model to Numpy.'
+if [ $# -lt 1 ] ; then
+    echo "Usage: bash torch_get_param.sh [DATA_SET]"
+exit 1
+fi
+
+if [ "$1" == "enwik8" ]; then
+  # Data
+  DATA_ROOT=./
+  DATA_DIR=${DATA_ROOT}/pretrained_xl/tf_enwik8/data
+  MODEL_DIR=${DATA_ROOT}/pretrained_xl/tf_enwik8/model
+
+  # Model
+  N_LAYER=24
+  D_MODEL=1024
+  D_EMBED=1024
+  N_HEAD=8
+  D_HEAD=128
+  D_INNER=3072
+
+  # Testing
+  TEST_TGT_LEN=128
+  TEST_MEM_LEN=3800
+  TEST_CLAMP_LEN=1000
+
+  TEST_CKPT_PATH=${MODEL_DIR}/model.ckpt-0
+  TEST_BSZ=2
+  TEST_NUM_CORE=2
+
+  echo 'Preprocess test set...'
+  python data_utils.py \
+    --data_dir=${DATA_DIR}/ \
+    --dataset=enwik8 \
+    --tgt_len=${TEST_TGT_LEN} \
+    --per_host_test_bsz=${TEST_BSZ} \
+    --num_passes=1 \
+    --use_tpu=False
+
+  echo 'Run evaluation on test set...'
+  python tf_get_param.py \
+      --data_dir=${DATA_DIR}/tfrecords \
+      --record_info_dir=${DATA_DIR}/tfrecords/ \
+      --corpus_info_path=${DATA_DIR}/corpus-info.json \
+      --eval_ckpt_path=${TEST_CKPT_PATH} \
+      --model_dir=EXP-enwik8 \
+      --n_layer=${N_LAYER} \
+      --d_model=${D_MODEL} \
+      --d_embed=${D_EMBED} \
+      --n_head=${N_HEAD} \
+      --d_head=${D_HEAD} \
+      --d_inner=${D_INNER} \
+      --dropout=0.0 \
+      --dropatt=0.0 \
+      --tgt_len=${TEST_TGT_LEN} \
+      --mem_len=${TEST_MEM_LEN} \
+      --clamp_len=${TEST_CLAMP_LEN} \
+      --same_length=True \
+      --eval_batch_size=${TEST_BSZ} \
+      --num_core_per_host=${TEST_NUM_CORE} \
+      --do_train=False \
+      --do_eval=True \
+      --eval_split=test
+
+fi
+
+if [ "$1" == "text8" ]; then
+  # Data
+  DATA_ROOT=./
+  DATA_DIR=${DATA_ROOT}/pretrained_xl/tf_text8/data
+  MODEL_DIR=${DATA_ROOT}/pretrained_xl/tf_text8/model
+
+  # Model
+  N_LAYER=24
+  D_MODEL=1024
+  D_EMBED=1024
+  N_HEAD=8
+  D_HEAD=128
+  D_INNER=3072
+
+  # Testing
+  TEST_TGT_LEN=128
+  TEST_MEM_LEN=3800
+  TEST_CLAMP_LEN=1000
+
+  TEST_CKPT_PATH=${MODEL_DIR}/model.ckpt-0
+  TEST_BSZ=2
+  TEST_NUM_CORE=2
+
+  echo 'Preprocess test set...'
+  python data_utils.py \
+    --data_dir=${DATA_DIR}/ \
+    --dataset=text8 \
+    --tgt_len=${TEST_TGT_LEN} \
+    --per_host_test_bsz=${TEST_BSZ} \
+    --num_passes=1 \
+    --use_tpu=False
+
+  echo 'Run evaluation on test set...'
+  python tf_get_param.py \
+      --data_dir=${DATA_DIR}/tfrecords \
+      --record_info_dir=${DATA_DIR}/tfrecords/ \
+      --corpus_info_path=${DATA_DIR}/corpus-info.json \
+      --eval_ckpt_path=${TEST_CKPT_PATH} \
+      --model_dir=EXP-text8 \
+      --n_layer=${N_LAYER} \
+      --d_model=${D_MODEL} \
+      --d_embed=${D_EMBED} \
+      --n_head=${N_HEAD} \
+      --d_head=${D_HEAD} \
+      --d_inner=${D_INNER} \
+      --dropout=0.0 \
+      --dropatt=0.0 \
+      --tgt_len=${TEST_TGT_LEN} \
+      --mem_len=${TEST_MEM_LEN} \
+      --clamp_len=${TEST_CLAMP_LEN} \
+      --same_length=True \
+      --eval_batch_size=${TEST_BSZ} \
+      --num_core_per_host=${TEST_NUM_CORE} \
+      --do_train=False \
+      --do_eval=True \
+      --eval_split=test
+fi
\ No newline at end of file
diff --git a/research/nlp/transformer_xl/tran_model/torch2msp.sh b/research/nlp/transformer_xl/tran_model/torch2msp.sh
new file mode 100644
index 0000000000000000000000000000000000000000..24d998976ad43c95e4604e6b1fe95de023a07ea0
--- /dev/null
+++ b/research/nlp/transformer_xl/tran_model/torch2msp.sh
@@ -0,0 +1,77 @@
+#!/bin/bash
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+
+echo 'Preprocess key mapping...'
+python key_mapping.py
+
+echo 'Trans pytorch model to mindspore model.'
+if [ $# -lt 4 ] ||  [ $# -gt 5 ]
+then
+    echo "Usage: bash torch2msp.sh [DATA_DIR] [DATA_NAME] [TORCH_PT_PATH]
+     [CONFIG_PATH] [DEVICE_ID(optional)]"
+exit 1
+fi
+
+export DEVICE_ID=0
+if [ $# = 5 ] ; then
+  export DEVICE_ID=$5
+fi;
+
+get_real_path(){
+  if [ "${1:0:1}" == "/" ]; then
+    echo "$1"
+  else
+    echo "$(realpath -m $PWD/$1)"
+  fi
+}
+
+DATA_DIR=$(get_real_path $1)
+if [ ! -d $DATA_DIR ]
+then
+    echo "error: DATA_DIR=$DATA_DIR is not a directory"
+exit 1
+fi
+
+DATA_NAME=$2
+PT_PATH=$3
+CONFIG_PATH=$4
+
+echo "DATA_DIR="$DATA_DIR
+echo "DATA_NAME="$DATA_NAME
+echo "PT_PATH="$PT_PATH
+echo "CONFIG_PATH="$CONFIG_PATH
+
+export CONFIG_PATH=${CONFIG_PATH}
+export DEVICE_NUM=1
+export RANK_SIZE=$DEVICE_NUM
+export RANK_ID=0
+
+if [ ! -d "torch2msp_model" ];
+then
+    mkdir ./torch2msp_model
+fi
+
+echo "Start evaluation for device $DEVICE_ID :)"
+
+python ./torch2msp/torch2msp.py \
+  --device_id=$DEVICE_ID \
+  --datadir=$DATA_DIR \
+  --dataset=$DATA_NAME \
+  --pt_path=$PT_PATH \
+  --device="GPU" &> torch2msp_$DATA_NAME.log &
+
+
diff --git a/research/nlp/transformer_xl/tran_model/torch2msp/torch2msp.py b/research/nlp/transformer_xl/tran_model/torch2msp/torch2msp.py
new file mode 100644
index 0000000000000000000000000000000000000000..8903f0854049d2d1259d16f9ebfd595d2a3a081e
--- /dev/null
+++ b/research/nlp/transformer_xl/tran_model/torch2msp/torch2msp.py
@@ -0,0 +1,91 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+import sys
+import argparse
+import pickle
+import mindspore
+from mindspore import context
+from mindspore import save_checkpoint
+from mindspore.dataset import GeneratorDataset
+from src.utils.dataset_util import get_dataset
+from src.metric.calc import bpc
+from src.model.mem_transformer import MemTransformerLM
+from src.model_utils.config import config
+from src.callback.eval import doEval
+
+sys.path.insert(0, '../')
+
+parser = argparse.ArgumentParser(description='PyTorch Model Trans MindSpore Model.')
+parser.add_argument('--datadir', default='./data/enwik8',
+                    help='Directory contains enwik8 dataset.')
+parser.add_argument('--dataset', default='enwik8',
+                    help='Dataset Name.', choices=["enwik8", "text8"])
+parser.add_argument('--pt_path', default="./model.pt", help='Directory of model param.')
+parser.add_argument("--device", type=str, default="GPU", help="Device Target, default GPU",
+                    choices=["Ascend", "GPU"])
+parser.add_argument("--device_id", type=int, default=0, help="Device id, default is 0.")
+args = parser.parse_args()
+datadir = args.datadir
+dataset = args.dataset
+pt_path = args.pt_path
+device_id = args.device_id
+
+numpy_param_path = pt_path
+with open(numpy_param_path, 'rb') as f:
+    torch_dict = pickle.load(f)
+
+dataset = get_dataset(datadir, dataset)
+ntokens = len(dataset.vocab)
+
+context.set_context(device_id=device_id)
+context.set_context(mode=context.GRAPH_MODE, device_target="GPU", max_device_memory="39.0GB",
+                    enable_graph_kernel=True)
+
+test_dataset = GeneratorDataset(source=dataset.get_test_generator(), column_names=['data', 'target'],
+                                shuffle=False)
+
+cutoffs = []
+net = MemTransformerLM(ntokens, config.n_layer, config.n_head, config.d_model,
+                       config.d_head, config.d_inner, config.dropout, config.dropatt, batch_size=config.batch_size,
+                       d_embed=config.d_embed, div_val=config.div_val,
+                       pre_lnorm=config.pre_lnorm, tgt_len=config.tgt_len,
+                       ext_len=config.ext_len, mem_len=config.mem_len, eval_tgt_len=config.eval_tgt_len,
+                       cutoffs=cutoffs, same_length=config.same_length, clamp_len=config.clamp_len)
+
+net_dict = {}
+with open('./msp2torch_base.txt', 'r') as f:
+    for line in f.readlines():
+        msp_name, pytorch_name = line.strip().split(":")
+        net_dict[msp_name] = torch_dict[pytorch_name]
+
+for k in net.parameters_dict():
+    if k in ('mems', 'valid_mems', 'empty_valid_mems'):
+        continue
+    net.parameters_dict()[k].set_data(mindspore.Tensor(net_dict[k]))
+
+print('load net param')
+
+save_path = './torch2msp_model/' + str(args.dataset) + '_model.ckpt'
+save_checkpoint(net, save_path)
+
+test_loss = doEval(net, test_dataset, config.tgt_len, config.ext_len, config.mem_len, config.eval_tgt_len)
+
+print('=' * 100)
+if config.dataset in ['enwik8', 'text8']:
+    print('| End of test | test loss {:5.2f} | test bpc {:9.5f}'.format(
+        test_loss, bpc(test_loss)))
+
+print('=' * 100)
diff --git a/research/nlp/transformer_xl/tran_model/torch2msp/torch2numpy.py b/research/nlp/transformer_xl/tran_model/torch2msp/torch2numpy.py
new file mode 100644
index 0000000000000000000000000000000000000000..4fbfc06bdd90eef5a674b47d2c8b1737c7d239b7
--- /dev/null
+++ b/research/nlp/transformer_xl/tran_model/torch2msp/torch2numpy.py
@@ -0,0 +1,48 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+import argparse
+import os
+import pickle
+import torch
+
+# 需要1.x版本的PyTorch实现numpy的转化
+parser = argparse.ArgumentParser(description='PyTorch Model Trans numpy.')
+parser.add_argument('--dataset', default='enwik8',
+                    help='Dataset Name.', choices=["enwik8", "text8"])
+parser.add_argument('--work_dir', default="./enwik8_base.pkl", help='Directory of model param.')
+args = parser.parse_args()
+
+torch_model_path = args.work_dir
+torch_param = None
+if args.dataset == 'enwik8':
+    torch_param = torch.load(os.path.join(torch_model_path, 'enwik8_base.pkl'))
+if args.dataset == 'text8':
+    torch_param = torch.load(os.path.join(torch_model_path, 'text8_base.pkl'))
+
+if not torch_param:
+    print('no torch param model.')
+    exit()
+torch_dict = {}
+for key in torch_param.keys():
+    torch_dict[key] = torch_param[key].cpu().numpy()
+
+if args.dataset == 'enwik8':
+    with open('./enwik8_base.pkl', 'wb') as f:
+        pickle.dump(torch_dict, f)
+if args.dataset == 'text8':
+    with open('./text8_base.pkl', 'wb') as f:
+        pickle.dump(torch_dict, f)
+print('finish!')
diff --git a/research/nlp/transformer_xl/tran_model/torch2msp/torch2numpy.sh b/research/nlp/transformer_xl/tran_model/torch2msp/torch2numpy.sh
new file mode 100644
index 0000000000000000000000000000000000000000..eaa7b55d5907cf855a0c9e5edaa0e318e160addf
--- /dev/null
+++ b/research/nlp/transformer_xl/tran_model/torch2msp/torch2numpy.sh
@@ -0,0 +1,25 @@
+#!/bin/bash
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+echo 'Trans pytorch dict model to Numpy.'
+if [ $# -lt 2 ] ; then
+    echo "Usage: bash torch2numpy.sh [DATA_SET] [WORK_DIR]"
+exit 1
+fi
+
+python torch2numpy.py \
+    --dataset $1 \
+    --work_dir $2
\ No newline at end of file
diff --git a/research/nlp/transformer_xl/tran_model/torch2msp/torch_get_param.py b/research/nlp/transformer_xl/tran_model/torch2msp/torch_get_param.py
new file mode 100644
index 0000000000000000000000000000000000000000..ce527889edd8a2cb68f2e45630c549028dddebde
--- /dev/null
+++ b/research/nlp/transformer_xl/tran_model/torch2msp/torch_get_param.py
@@ -0,0 +1,44 @@
+# coding: utf-8
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+import argparse
+import os
+import torch
+
+parser = argparse.ArgumentParser(description='PyTorch Transformer Language Model')
+parser.add_argument('--dataset', type=str, default='wt103',
+                    choices=['wt103', 'lm1b', 'enwik8', 'text8'],
+                    help='dataset name')
+parser.add_argument('--cuda', action='store_true',
+                    help='use CUDA')
+parser.add_argument('--work_dir', type=str, required=True,
+                    help='path to the work_dir')
+
+args = parser.parse_args()
+
+device = torch.device("cuda" if args.cuda else "cpu")
+
+# Load the best saved model.
+with open(os.path.join(args.work_dir, 'model.pt'), 'rb') as f:
+    model = torch.load(f)
+model.backward_compatible()
+model = model.to(device)
+
+# 保存模型参数
+if 'enwik8' in args.dataset:
+    torch.save(model.state_dict(), 'enwik8_base.pkl')
+if 'text8' in args.dataset:
+    torch.save(model.state_dict(), 'text8_base.pkl')
+print('finish')
diff --git a/research/nlp/transformer_xl/tran_model/torch2msp/torch_get_param.sh b/research/nlp/transformer_xl/tran_model/torch2msp/torch_get_param.sh
new file mode 100644
index 0000000000000000000000000000000000000000..9ef29b4b6be7211782f4604ae76b05e0510d39c1
--- /dev/null
+++ b/research/nlp/transformer_xl/tran_model/torch2msp/torch_get_param.sh
@@ -0,0 +1,26 @@
+#!/bin/bash
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+echo 'Trans pytorch model to dict_model.'
+if [ $# -lt 2 ] ; then
+    echo "Usage: bash torch_get_param.sh [DATA_SET] [WORK_DIR]"
+exit 1
+fi
+
+python torch_get_param.py \
+    --cuda \
+    --dataset $1 \
+    --work_dir $2
\ No newline at end of file
diff --git a/research/nlp/transformer_xl/yaml/enwik8_base.yaml b/research/nlp/transformer_xl/yaml/enwik8_base.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..0473d99ad5a37f7c587df81b8353549f8e24e8dd
--- /dev/null
+++ b/research/nlp/transformer_xl/yaml/enwik8_base.yaml
@@ -0,0 +1,94 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+# Builtin Configurations(DO NOT CHANGE THESE CONFIGURATIONS unless you know exactly what you are doing)
+enable_modelarts: False
+# Url for modelarts
+data_url: ""
+train_url: "experiments"
+checkpoint_url: "experiments"
+
+# Path for local
+datadir: "/home/mindspore/msp_txl/official/nlp/transformer_xl/data/enwik8"
+dataset: "enwik8"
+ckpt_path: "/home/mindspore/msp_txl/official/nlp/transformer_xl/script/experiments-enwik8/20220416-140816/model0.ckpt"
+device: "GPU"
+device_id: 0
+
+# ==============================================================================
+# Training options
+
+n_layer: 12
+n_head: 8
+d_head: 64
+d_embed: 512
+d_model: 512
+d_inner: 2048
+dropout: 0.1
+dropatt: 0.0
+optim: "adam"
+scheduler: "cosine"
+lr: 0.00025
+lr_min: 0.0
+warmup_step: 0
+max_step: 400000
+log-interval: 200
+eval-interval: 4000
+batch_size: 22
+tgt_len: 512
+eval_tgt_len: 128
+mem_len: 512
+clamp_len: -1
+init: "normal"
+emb_init: "normal"
+init_range: 0.1
+emb_init_range: 0.01
+init_std: 0.02
+proj_init_std: 0.01
+mom: 0.0
+decay_rate: 0.5
+clip: 0.25
+batch_chunk: 1
+seed: 1111
+div_val: 1
+attn_type: 0
+ext_len: 0
+eta_min: 0.0
+max_eval_steps: -1
+sample_softmax: -1
+patience: 0
+adaptive: False
+varlen: False
+pre_lnorm: False
+same_length: False
+
+# Model Description
+
+
+
+---
+# Config description for each option
+enable_modelarts: 'Whether training on modelarts, default: False'
+data_url: 'Dataset url for obs'
+train_url: 'Training output url for obs'
+data_path: 'Dataset path for local'
+output_path: 'Training output path for local'
+
+device_target: 'Target device type'
+enable_profiling: 'Whether enable profiling while training, default: False'
+
+
+---
+device_target: [ 'Ascend', 'GPU', 'CPU' ]
diff --git a/research/nlp/transformer_xl/yaml/enwik8_large.yaml b/research/nlp/transformer_xl/yaml/enwik8_large.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..b8e18c2eba8c032a306465915cbc65b9c6d22513
--- /dev/null
+++ b/research/nlp/transformer_xl/yaml/enwik8_large.yaml
@@ -0,0 +1,94 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+# Builtin Configurations(DO NOT CHANGE THESE CONFIGURATIONS unless you know exactly what you are doing)
+enable_modelarts: False
+# Url for modelarts
+data_url: ""
+train_url: "experiments"
+checkpoint_url: "experiments"
+
+# Path for local
+datadir: "/home/mindspore/msp_txl/official/nlp/transformer_xl/data/enwik8"
+dataset: "enwik8"
+ckpt_path: "/home/mindspore/msp_txl/official/nlp/transformer_xl/script/experiments-enwik8/20220416-140816/model0.ckpt"
+device: "GPU"
+device_id: 0
+
+# ==============================================================================
+# Training options
+
+n_layer: 24
+n_head: 8
+d_head: 128
+d_embed: -1
+d_model: 1024
+d_inner: 3072
+dropout: 0.15
+dropatt: 0.15
+optim: "adam"
+scheduler: "cosine"
+lr: 0.00025
+lr_min: 0.0
+warmup_step: 0
+max_step: 400000
+log-interval: 200
+eval-interval: 4000
+batch_size: 64
+tgt_len: 786
+eval_tgt_len: 128
+ext_len: 0
+mem_len: 786
+clamp_len: -1
+init: "normal"
+emb_init: "normal"
+init_range: 0.1
+emb_init_range: 0.01
+init_std: 0.02
+proj_init_std: 0.01
+mom: 0.0
+decay_rate: 0.5
+clip: 0.25
+batch_chunk: 1
+seed: 1111
+div_val: 1
+attn_type: 0
+eta_min: 0.0
+max_eval_steps: -1
+sample_softmax: -1
+patience: 0
+adaptive: False
+varlen: False
+pre_lnorm: False
+same_length: False
+
+# Model Description
+
+
+
+---
+# Config description for each option
+enable_modelarts: 'Whether training on modelarts, default: False'
+data_url: 'Dataset url for obs'
+train_url: 'Training output url for obs'
+data_path: 'Dataset path for local'
+output_path: 'Training output path for local'
+
+device_target: 'Target device type'
+enable_profiling: 'Whether enable profiling while training, default: False'
+
+
+---
+device_target: [ 'Ascend', 'GPU', 'CPU' ]
diff --git a/research/nlp/transformer_xl/yaml/text8_large.yaml b/research/nlp/transformer_xl/yaml/text8_large.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..b8e18c2eba8c032a306465915cbc65b9c6d22513
--- /dev/null
+++ b/research/nlp/transformer_xl/yaml/text8_large.yaml
@@ -0,0 +1,94 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+
+# Builtin Configurations(DO NOT CHANGE THESE CONFIGURATIONS unless you know exactly what you are doing)
+enable_modelarts: False
+# Url for modelarts
+data_url: ""
+train_url: "experiments"
+checkpoint_url: "experiments"
+
+# Path for local
+datadir: "/home/mindspore/msp_txl/official/nlp/transformer_xl/data/enwik8"
+dataset: "enwik8"
+ckpt_path: "/home/mindspore/msp_txl/official/nlp/transformer_xl/script/experiments-enwik8/20220416-140816/model0.ckpt"
+device: "GPU"
+device_id: 0
+
+# ==============================================================================
+# Training options
+
+n_layer: 24
+n_head: 8
+d_head: 128
+d_embed: -1
+d_model: 1024
+d_inner: 3072
+dropout: 0.15
+dropatt: 0.15
+optim: "adam"
+scheduler: "cosine"
+lr: 0.00025
+lr_min: 0.0
+warmup_step: 0
+max_step: 400000
+log-interval: 200
+eval-interval: 4000
+batch_size: 64
+tgt_len: 786
+eval_tgt_len: 128
+ext_len: 0
+mem_len: 786
+clamp_len: -1
+init: "normal"
+emb_init: "normal"
+init_range: 0.1
+emb_init_range: 0.01
+init_std: 0.02
+proj_init_std: 0.01
+mom: 0.0
+decay_rate: 0.5
+clip: 0.25
+batch_chunk: 1
+seed: 1111
+div_val: 1
+attn_type: 0
+eta_min: 0.0
+max_eval_steps: -1
+sample_softmax: -1
+patience: 0
+adaptive: False
+varlen: False
+pre_lnorm: False
+same_length: False
+
+# Model Description
+
+
+
+---
+# Config description for each option
+enable_modelarts: 'Whether training on modelarts, default: False'
+data_url: 'Dataset url for obs'
+train_url: 'Training output url for obs'
+data_path: 'Dataset path for local'
+output_path: 'Training output path for local'
+
+device_target: 'Target device type'
+enable_profiling: 'Whether enable profiling while training, default: False'
+
+
+---
+device_target: [ 'Ascend', 'GPU', 'CPU' ]