diff --git a/research/audio/FastSpeech/README.md b/research/audio/FastSpeech/README.md
new file mode 100644
index 0000000000000000000000000000000000000000..b9e41e589ff152be77005cdb5d19a1944284bcc5
--- /dev/null
+++ b/research/audio/FastSpeech/README.md
@@ -0,0 +1,350 @@
+# Contents
+
+- [Contents](#contents)
+    - [FastSpeech Description](#fastspeech-description)
+    - [Model Architecture](#model-architecture)
+    - [Dataset](#dataset)
+    - [Environment Requirements](#environment-requirements)
+    - [Quick Start](#quick-start)
+    - [Script Description](#script-description)
+        - [Script and Sample Code](#script-and-sample-code)
+        - [Script Parameters](#script-parameters)
+        - [Training Process](#training-process)
+            - [Standalone Training](#standalone-training)
+            - [Distribute Training](#distribute-training)
+        - [Evaluation Process](#evaluation-process)
+            - [Checkpoints preparation](#checkpoints-preparation)
+            - [Evaluation](#evaluation)
+        - [Model Export](#model-export)
+    - [Model Description](#model-description)
+        - [Performance](#performance)
+            - [Training Performance](#training-performance)
+            - [Evaluation Performance](#evaluation-performance)
+    - [ModelZoo Homepage](#modelzoo-homepage)
+
+## [FastSpeech Description](#contents)
+
+Neural network based end-to-end text to speech (TTS) has significantly improved
+the quality of synthesized speech. TTS methods usually first generate mel-spectrogram from text,
+and then synthesize speech from the mel-spectrogram using vocoder such as WaveNet (WaveGlow in that work).
+Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is
+usually not robust (i.e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control).
+In this work, we use feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS. Specifically, we use previously extracted attention alignments from an encoder-decoder
+based teacher model for phoneme duration prediction, which is used by a length regulator to expand the source phoneme sequence to match the length of the target
+mel-spectrogram sequence for parallel mel-spectrogram generation. Experiments on the LJSpeech dataset show that parallel model matches autoregressive models in terms of speech quality, nearly eliminates the problem of word skipping and
+repeating in particularly hard cases, and can adjust voice speed smoothly.
+
+[Paper](https://arxiv.org/pdf/1905.09263v5.pdf): FastSpeech: Fast, Robust and Controllable Text to Speech.
+
+## [Model Architecture](#contents)
+
+The architecture for FastSpeech is a feed-forward structure based on self-attention in Transformer
+and 1D convolution. This structure is called Feed-Forward Transformer (FFT). Feed-Forward Transformer stacks multiple FFT blocks for phoneme to mel-spectrogram
+transformation, with N blocks on the phoneme side, and N blocks on the mel-spectrogram side, with
+a length regulator in between to bridge the length gap between the phoneme and mel-spectrogram sequence.
+Each FFT block consists of a self-attention and 1D convolutional network.
+The self-attention network consists of a multi-head attention to extract the cross-position information.
+Different from the 2-layer dense network in Transformer, FastSpeech uses a 2-layer 1D convolutional network with ReLU activation.
+The motivation is that the adjacent hidden states are more closely related in the character/phoneme and mel-spectrogram sequence in speech tasks.
+
+## [Dataset](#contents)
+
+We use LJSpeech-1.1 dataset and previously extracted alignments by teacher model.
+
+Dataset description: 3.8 Gb of the .wav files with the annotated text (contains English speech only).
+
+- [Download](https://keithito.com/LJ-Speech-Dataset/) LJSpeech and extract it into your `datasets` folder.
+- [Download](https://github.com/xcmyz/FastSpeech/blob/master/alignments.zip) alignments and unzip into extracted LJSpeech dataset folder.
+
+> Original LJSpeech-1.1 dataset not split into train/test parts.
+> We manually split it into 13000/100 (train/test) by select 100 test indices stored into preprocess.py.
+> We fixed indices, so you can reproduce our results.
+> Also, you can select indices independently if you want and put it into _INDICES_FOR_TEST into preprocess.py.
+
+The original dataset structure is as follows:
+
+```text
+.
+└── LJSpeech-1.1
+    ├─ alignments/
+    ├─ wavs/
+    └─ metadata.csv
+```
+
+Note: Before pre-processing the dataset you need to prepare the environment and install the requirements.
+Preprocess script uses ~3.5 Gb video memory, thus you can specify the visible GPU devices if necessary.
+
+Run (from the project folder) `preprocess.py` script located into `data` folder with following command:
+
+```bash
+python -m data.preprocess --dataset_path [PATH_TO_DATASET_FOLDER]
+```
+
+- PATH_TO_DATASET_FOLDER - path to the dataset root.
+
+Processed data will be also saved into the PATH_TO_DATASET_FOLDER folder.
+
+After pre-precessing the data, the dataset structure should be as follows:
+
+```text
+.
+└── LJSpeech-1.1
+    ├─ alignments/
+    ├─ mels/
+    ├─ metadata.csv
+    ├─ metadata.txt
+    ├─ train_indices.txt
+    ├─ validation.txt
+    └─ wavs/
+```
+
+## [Environment Requirements](#contents)
+
+- Hardware (GPU).
+- Prepare hardware environment with GPU processor.
+- Framework
+    - [MindSpore](https://www.mindspore.cn/install/en)
+- For more information, please check the resources below:
+    - [MindSpore Tutorials](https://www.mindspore.cn/tutorials/en/master/index.html)
+    - [MindSpore Python API](https://www.mindspore.cn/docs/api/en/master/index.html)
+
+Note: We use MindSpore 1.6.0 GPU, thus make sure that you install > 1.6.0 version.
+
+## [Quick Start](#contents)
+
+After installing MindSpore through the official website, you can follow the steps below for training and evaluation,
+in particular, before training, you need to install `requirements.txt` by following
+command `pip install -r requirements.txt`.
+
+Then run training script as shown below.
+
+```example
+# Run standalone training example
+bash scripts/run_standalone_train_gpu.sh [DEVICE_ID] [LOGS_CKPT_DIR] [DATASET_ROOT]
+
+# Run distribute training example
+bash scripts/run_distribute_train_gpu.sh [DEVICE_NUM] [LOGS_CKPT_DIR] [DATASET_ROOT]
+```
+
+## [Script Description](#contents)
+
+### [Script and Sample Code](#contents)
+
+```contents
+.
+└─FastSpeech
+  ├─README.md
+  ├─requirements.txt
+  ├─data
+  │ └─preprocess.py                    # data preprocessing script
+  ├─scripts
+  │ ├─run_distribute_train_gpu.sh      # launch distribute train on GPU
+  │ ├─run_eval_gpu.sh                  # launch evaluation on GPU
+  │ └─run_standalone_train_gpu.sh      # launch standalone train on GPU
+  ├─src
+  │ ├─audio
+  │ │ ├─__init__.py
+  │ │ ├─stft.py                        # audio processing scripts
+  │ │ └─tools.py                       # audio processing tools
+  │ ├─cfg
+  │ │ ├─__init__.py
+  │ │ └─config.py                      # config parser
+  │ ├─deepspeech2
+  │ │ ├─__init__.py
+  │ │ ├─dataset.py                     # audio parser script for DeepSpeech2
+  │ │ └─model.py                       # model scripts
+  │ ├─import_ckpt
+  │ │ ├─__init__.py
+  │ │ ├─import_deepspeech2.py          # importer for DeepSpeech2 from < 1.5 MS versions
+  │ │ └─import_waveglow.py             # importer for WaveGlow from .pickle
+  │ ├─text
+  │ │ ├─__init__.py
+  │ │ ├─cleaners.py                    # text cleaners scripts
+  │ │ ├─numbers.py                     # numbers to text preprocessing scripts
+  │ │ └─symbols.py                     # symbols dictionary
+  │ ├─transformer
+  │ │ ├─__init__.py
+  │ │ ├─constants.py                   # constants for transformer
+  │ │ ├─layers.py                      # layers initialization
+  │ │ ├─models.py                      # model blocks
+  │ │ ├─modules.py                     # model modules
+  │ │ └─sublayers.py                   # model sublayers
+  │ ├─waveglow
+  │ │ ├─__init__.py
+  │ │ ├─layers.py                      # model layers
+  │ │ ├─model.py                       # model scripts
+  │ │ └─utils.py                       # utils tools
+  │ ├─__init__.py
+  │ ├─dataset.py                       # create dataset
+  │ ├─metrics.py                       # metrics scripts
+  │ ├─model.py                         # model scripts
+  │ ├─modules.py                       # model modules
+  │ └─utils.py                         # utilities used in other scripts
+  ├─default_config.yaml                # default configs
+  ├─eval.py                            # evaluation script
+  ├─export.py                          # export to MINDIR script
+  └─train.py                           # training script
+```
+
+### [Script Parameters](#contents)
+
+```parameters
+all parameters and descriptions, except --config_path, stored into default_config.yaml
+
+usage: train.py [--config_path CONFIG_PATH]
+                [--device_target DEVICE_TARGET]
+                [--device_id DEVICE_ID]
+                [--logs_dir LOGS_DIR]
+                [--dataset_path DATASET_PATH]
+                [--epochs EPOCHS]
+                [--lr_scale LR_SCALE]
+```
+
+### [Training Process](#contents)
+
+#### Standalone Training
+
+```bash
+bash scripts/run_standalone_train_gpu.sh [DEVICE_ID] [LOGS_CKPT_DIR] [DATASET_PATH]
+```
+
+The above command will run in the background, you can view the result through the generated standalone_train.log file.
+After training, you can get the training loss and time logs in chosen logs dir:
+
+```log
+epoch: 200 step: 406, loss is 0.8701540231704712
+epoch time: 168215.485 ms, per step time: 413.072 ms
+```
+
+The model checkpoints will be saved in logs outputs directory.
+
+#### Distribute Training
+
+```bash
+bash scripts/run_distribute_train_gpu.sh [DEVICE_NUM] [LOGS_CKPT_DIR] [DATASET_PATH]
+```
+
+The above shell script will run distributed training in the background.
+After training, you can get the training results:
+
+```log
+epoch: 200 step: 50, loss is 0.9151536226272583
+epoch: 200 step: 50, loss is 0.9770485162734985
+epoch: 200 step: 50, loss is 0.9304656982421875
+epoch: 200 step: 50, loss is 0.8000383377075195
+epoch: 200 step: 50, loss is 0.8380972146987915
+epoch: 200 step: 50, loss is 0.854132890701294
+epoch: 200 step: 50, loss is 0.8262668251991272
+epoch: 200 step: 50, loss is 0.8031083345413208
+epoch time: 25208.625 ms, per step time: 504.173 ms
+epoch time: 25207.587 ms, per step time: 504.152 ms
+epoch time: 25206.404 ms, per step time: 504.128 ms
+epoch time: 25210.164 ms, per step time: 504.203 ms
+epoch time: 25210.281 ms, per step time: 504.206 ms
+epoch time: 25210.364 ms, per step time: 504.207 ms
+epoch time: 25210.161 ms, per step time: 504.203 ms
+epoch time: 25059.312 ms, per step time: 501.186 ms
+```
+
+Note: It was just examples of logs, values may vary.
+
+### [Evaluation Process](#contents)
+
+#### Checkpoints preparation
+
+Before starting evaluation process, you need to import WaveGlow vocoder (generate audio from FastSpeech output mel-spectrograms) and DeepSpeech2 (to evaluate metrics) checkpoints.
+
+- [Download](https://download.mindspore.cn/model_zoo/r1.3/deepspeech2_gpu_v130_librispeech_research_audio_bs20_avgwer11.34_avgcer3.79/) DeepSpeech2 checkpoint (version < 1.5, not directly load to new MindSpore versions).
+
+  To import checkpoints follow steps below:
+- Run `import_deepspeech2.py`. Converted checkpoint will be saved at the same directory as original and named `DeepSpeech2.ckpt`.
+
+```bash
+# from project root folder
+python -m src.import_ckpt.import_deepspeech2 --ds_ckpt_url [CKPT_URL] # weights of .ckpt format
+```
+
+- To get WaveGlow take the following steps. We convert checkpoint from original [checkpoint](https://drive.google.com/file/d/1rpK8CzAAirq9sWZhe9nlfvxMF1dRgFbF/view) to the .pickle format with numpy weights, by using code below (previously we download `glow.py` model from original WaveGlow [implementation](https://github.com/NVIDIA/waveglow)).
+
+```python
+# script in same dir with glow.py
+import torch
+
+waveglow = torch.load(checkpoint_url)['model']  # ckpt_url for original .pt object
+waveglow = waveglow.remove_weightnorm(waveglow)
+numpy_weights = {key: value.detach().numpy() for key, value in waveglow.named_parameters()}
+# save numpy_weights as .pickle format
+```
+
+Note: The original checkpoint is stored in the PyTorch format (.pth). You need to install PyTorch first, before running the code above.
+
+- To import .pickle WaveGlow checkpoint run `import_waveglow.py`. Converted checkpoint will be saved at the same directory as original and named `WaveGlow.ckpt`.
+
+```bash
+# from project root folder
+python -m src.import_ckpt.import_waveglow --wg_ckpt_url [CKPT_URL] # weights of .pickle format
+```
+
+#### Evaluation
+
+Before evaluation make sure that you have trained FastSpeech.ckpt, converted WaveGlow.ckpt, and converted DeepSpeech2.ckpt.
+To start evaluation run the command below.
+
+```bash
+bash scripts/run_eval_gpu.sh [DEVICE_ID] [DATASET_PATH] [FS_CKPT_URL] [WG_CKPT_URL] [DS_CKPT_URL]
+```
+
+The above python command will run in the background. You can view the results through the file "eval.log".
+
+```text
+==========Evaluation results==========
+Mean Frechet distance 201.42256
+Mean Kernel distance 0.02357
+Generated audios stored into results
+```
+
+### [Model Export](#contents)
+
+You can export the model to mindir format by running the following python script:
+
+```bash
+python export.py --fs_ckpt_url [FS_CKPT_URL]
+```
+
+## [Model Description](#contents)
+
+### [Performance](#contents)
+
+#### Training Performance
+
+| Parameters                 | GPU (1p)                                                   | GPU (8p)                                                              |
+| -------------------------- |----------------------------------------------------------- |---------------------------------------------------------------------- |
+| Model                      | FastSpeech                                                 | FastSpeech                                                            |
+| Hardware                   | 1 Nvidia Tesla V100-PCIE, CPU @ 3.40GHz                    | 8 Nvidia RTX 3090, Intel Xeon Gold 6226R CPU @ 2.90GHz                |
+| Upload Date                | 14/03/2022 (day/month/year)                                | 14/03/2022 (day/month/year)                                           |
+| MindSpore Version          | 1.6.0                                                      | 1.6.0                                                                 |
+| Dataset                    | LJSpeech-1.1                                               | LJSpeech-1.1                                                          |
+| Training Parameters        | epochs=200, batch_size=32, warmup_steps=5000, lr_scale=1   | epochs=300, batch_size=32 (per device), warmup_steps=5000, lr_scale=2 |
+| Optimizer                  | Adam (beta1=0.9, beta2=0.98, eps=1e-9)                     | Adam (beta1=0.9, beta2=0.98, eps=1e-9)                                |
+| Loss Function              | MSE, L1                                                    | MSE, L1                                                               |
+| Speed                      | ~412 ms/step                                               | ~504 ms/step                                                          |
+| Total time                 | ~9.3 hours                                                 | ~2.1 hours                                                            |
+
+Note: lr scheduler was taken from [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf) paper.
+
+#### Evaluation Performance
+
+| Parameters          | Trained on GPU (1p)                                        | Trained on GPU (8p)                                        |
+| ------------------- |--------------------------------------------------------    |----------------------------------------------------------- |
+| Model               | FastSpeech                                                 | FastSpeech                                                 |
+| Resource            | 1 Nvidia Tesla V100-PCIE, CPU @ 3.40GHz                    | 1 Nvidia Tesla V100-PCIE, CPU @ 3.40GHz                    |
+| Upload Date         | 14/03/2022 (day/month/year)                                | 14/03/2022 (day/month/year)                                |
+| MindSpore Version   | 1.6.0                                                      | 1.6.0                                                      |
+| Dataset             | LJSpeech-1.1                                               | LJSpeech-1.1                                               |
+| Batch_size          | 1                                                          | 1                                                          |
+| Outputs             | Mel-spectrogram, mel duration                              | Mel-spectrogram, mel duration                              |
+| Metric              | (classifier distances) Frechet 201.42256, Kernel 0.02357   | (classifier distances) Frechet 203.89236, Kernel 0.02386   |
+
+## [ModelZoo Homepage](#contents)
+
+ Please check the official [homepage](https://gitee.com/mindspore/models).
diff --git a/research/audio/FastSpeech/data/preprocess.py b/research/audio/FastSpeech/data/preprocess.py
new file mode 100644
index 0000000000000000000000000000000000000000..95c3d758c56f6c1b3d45b6d24871ca8a67e43012
--- /dev/null
+++ b/research/audio/FastSpeech/data/preprocess.py
@@ -0,0 +1,128 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""Dataset preprocess script."""
+import os
+from pathlib import Path
+
+import numpy as np
+
+from src.audio.tools import get_mel
+from src.cfg.config import config as hp
+
+# Original dataset contains 13100 samples and not splited into parts.
+# We manually selected 100 test indices and fixed it to be able to reproduce results.
+_INDICES_FOR_TEST = (
+    3667, 8887, 10353, 7657, 1498, 2758, 4913, 1697, 5653, 1911,
+    12032, 8925, 11517, 5881, 6575, 120, 6232, 11680, 8433, 1728,
+    12771, 11738, 6574, 12918, 9836, 7556, 2231, 7916, 5985, 3148,
+    2596, 1709, 5841, 5383, 6248, 9831, 7667, 10944, 2833, 614,
+    11990, 6894, 12645, 5422, 12015, 447, 7108, 2973, 9937, 11938,
+    3626, 11406, 2853, 6379, 1621, 3981, 5486, 3902, 10925, 4249,
+    6518, 3376, 1998, 10250, 10145, 7325, 2665, 61, 2709, 11683,
+    8776, 10979, 8834, 4805, 4565, 2577, 9369, 4422, 8212, 5871,
+    10721, 6046, 5129, 9610, 821, 4378, 693, 10500, 5027, 1663,
+    6946, 2460, 6068, 4329, 11001, 10122, 9154, 6990, 8908, 2530,
+)
+
+
+def preprocess_ljspeech(root_dir):
+    """Preprocess LJSpeech dataset."""
+    in_dir = root_dir
+    out_dir = os.path.join(in_dir, 'mels')
+
+    if not os.path.exists(out_dir):
+        os.makedirs(out_dir, exist_ok=True)
+
+    metadata = build_from_path(in_dir, out_dir)
+    write_metadata(metadata, in_dir)
+    train_test_split(in_dir)
+
+
+def write_metadata(metadata, out_dir):
+    """Write clear metadata."""
+    with Path(out_dir, 'metadata.txt').open('w', encoding='utf-8') as file:
+        for m in metadata:
+            file.write(m + '\n')
+
+
+def build_from_path(in_dir, out_dir):
+    """Get text and preprocess .wavs to mels."""
+    index = 1
+    texts = []
+
+    with Path(in_dir, 'metadata.csv').open('r', encoding='utf-8') as file:
+        for line in file.readlines():
+            if index % 100 == 0:
+                print("{:d} Done".format(index))
+
+            parts = line.strip().split('|')
+            wav_path = os.path.join(in_dir, 'wavs', '%s.wav' % parts[0])
+            text = parts[2]
+            texts.append(_process_utterance(out_dir, index, wav_path, text))
+
+            index = index + 1
+
+    return texts
+
+
+def _process_utterance(out_dir, index, wav_path, text):
+    """Preprocess .wav to mel and save."""
+    # Compute a mel-scale spectrogram from the wav:
+    mel_spectrogram = get_mel(wav_path)
+
+    # Write the spectrograms to disk:
+    mel_filename = 'ljspeech-mel-%05d.npy' % index
+    np.save(
+        os.path.join(out_dir, mel_filename),
+        mel_spectrogram.T,
+        allow_pickle=False
+    )
+
+    return text
+
+
+def train_test_split(folder_path):
+    """Prepare data for training and validation format."""
+    test_indices = np.array(_INDICES_FOR_TEST)
+
+    with Path(folder_path, 'metadata.csv').open('r') as file:
+        metadata = file.readlines()
+        dataset_size = len(metadata)
+
+    test_metadata = []
+    all_indices = np.arange(dataset_size)
+    train_indices = np.delete(all_indices, test_indices)
+
+    with Path(folder_path, 'train_indices.txt').open('w') as file:
+        for i in train_indices:
+            file.write(f'{i}\n')
+
+    for i, line in enumerate(metadata):
+        if i in test_indices:
+            wav_name, _, text = line.strip().split('|')
+            test_data = f'{wav_name}|{text}\n'
+            test_metadata.append(test_data)
+
+    with Path(folder_path, 'validation.txt').open('w') as file:
+        for line in test_metadata:
+            file.write(line)
+
+
+def main():
+    preprocess_ljspeech(hp.dataset_path)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/research/audio/FastSpeech/default_config.yaml b/research/audio/FastSpeech/default_config.yaml
new file mode 100644
index 0000000000000000000000000000000000000000..c1ade25217a7e1d706556b08c25f3e188c30e91e
--- /dev/null
+++ b/research/audio/FastSpeech/default_config.yaml
@@ -0,0 +1,144 @@
+# Builtin Configurations(DO NOT CHANGE THESE CONFIGURATIONS unless you know exactly what you are doing)
+
+# Mel
+num_mels: 80
+text_cleaners: ['english_cleaners']
+
+# FastSpeech
+vocab_size: 300
+max_seq_len: 3000
+encoder_dim: 256
+encoder_n_layer: 4
+encoder_head: 2
+encoder_conv1d_filter_size: 1024
+decoder_dim: 256
+decoder_n_layer: 4
+decoder_head: 2
+decoder_conv1d_filter_size: 1024
+fft_conv1d_kernel: [9, 1]
+fft_conv1d_padding: [4, 0]
+duration_predictor_filter_size: 256
+duration_predictor_kernel_size: 3
+dropout: 0.1
+
+# Train
+batch_size: 32  # per one device
+epochs: 200
+n_warm_up_step: 5000
+lr_scale: 1
+mel_max_length: 900
+character_max_length: 200
+keep_checkpoint_max: 10
+
+# Eval
+mel_val_len: 3500
+
+# Other
+alpha: 1
+device_target: 'GPU'
+device_id: 0
+device_start: 0
+logs_dir: 'logs'
+output_dir: 'results'
+dataset_path: '/path/to/LJSpeech-1.1'
+fs_ckpt_url: '/path/to/fastspeech/ckpt'
+wg_ckpt_url: '/path/to/waveglow/ckpt'
+ds_ckpt_url: '/path/to/deepspeech/ckpt'
+
+# WaveGlow
+wg_n_mel_channels: 80
+wg_n_flows: 12
+wg_n_group: 8
+wg_n_early_every: 4
+wg_n_early_size: 2
+wg_n_layers: 8
+wg_n_channels: 256
+wg_kernel_size: 3
+wg_wav_value: 32768
+wg_sampling_rate: 22050
+
+# DeepSpeech2
+ds_sampling_rate: 16000
+ds_window_size: 0.02
+ds_window_stride: 0.01
+ds_window: 'hanning'
+ds_rnn_type: 'LSTM'
+ds_hidden_size: 1024
+ds_hidden_layers: 5
+ds_lookahead_context: 20
+labels: "'ABCDEFGHIJKLMNOPQRSTUVWXYZ _"
+
+# Audio
+au_max_wav_value: 32768
+au_sampling_rate: 22050
+au_filter_length: 1024
+au_hop_length: 256
+au_win_length: 1024
+au_n_mel_channels: 80
+au_mel_fmin: 0
+au_mel_fmax: 8000
+
+---
+# Config description for each option
+num_mels: "Number of channels of mel-spectrogram."
+text_cleaners: "Chosen language pipeline."
+vocab_size: "Vocabulary size"
+max_seq_len: "Max sequence length."
+encoder_dim: "Encoder dimension."
+encoder_n_layer: "Number of encoder layers."
+encoder_head: "Number of encoders head."
+encoder_conv1d_filter_size: "Conv out filters of encoder."
+decoder_dim: "Decoder dimension."
+decoder_n_layer: "Number of decoder layers."
+decoder_head: "Number of decoder head."
+decoder_conv1d_filter_size: "Conv out filters of decoder."
+fft_conv1d_kernel: "Conv kernel size of FFT block."
+fft_conv1d_padding: "Conv padding of FFT block."
+duration_predictor_filter_size: "Conv out filters of duration predictor."
+duration_predictor_kernel_size: "Conv kernel size of duration predictor."
+dropout: "Dropout ratio."
+batch_size: "Batch size for training."
+epochs: "Num of training epochs."
+n_warm_up_step: "Num of warmup steps."
+lr_scale: "Learning rate multiplier."
+mel_max_length: "Pad all samples of mels to max len during training."
+character_max_length: "Pad all samples of character sequences to max len during training."
+keep_checkpoint_max: "Save last N checkpoints during train."
+mel_val_len: "Max mel length at validation."
+alpha: "Speech speed regulator."
+device_target: "Target device platform."
+device_id: "Device id of the target platform."
+device_start: "Main device for distribute training."
+logs_dir: "Output logs dir."
+output_dir: "Output dir for synthesized audio."
+dataset_path: "Path to dataset folder."
+fs_ckpt_url: "Path to FastSpeech checkpoint."
+wg_ckpt_url: "Path to WaveGlow checkpoint."
+ds_ckpt_url: "Path to DeepSpeech2 checkpoint."
+wg_n_mel_channels: "WaveGlow num of mel-spectrogram channels."
+wg_n_flows: "WaveGlow num cells."
+wg_n_group: "WaveGlow num layers in cell."
+wg_n_early_every: "WaveGlow add noise every."
+wg_n_early_size: "WaveGlow param."
+wg_n_layers: "WaveGlow num layers."
+wg_n_channels: "WaveGlow num channels."
+wg_kernel_size: "WaveGlow kernel size."
+wg_wav_value: "WaveGlow audio wav value."
+wg_sampling_rate: "WaveGlow audio sampling rate."
+ds_sampling_rate: "DeepSpeech2 audio param."
+ds_window_size: "DeepSpeech2 window size."
+ds_window_stride: "DeepSpeech2 window stride."
+ds_window: "DeepSpeech2 window type."
+ds_rnn_type: "DeepSpeech2 rnn type."
+ds_hidden_size: "DeepSpeech2 size of hidden layer."
+ds_hidden_layers: "DeepSpeech2 num hidden layers."
+ds_lookahead_context: "DeepSpeech2 param."
+labels: "Symbols for the DeepSpeech2 model."
+au_max_wav_value: "DeepSpeech2 audio max wav value."
+au_sampling_rate: "DeepSpeech2 audio sampling rate."
+au_filter_length: "DeepSpeech2 audio filter length."
+au_hop_length: "DeepSpeech2 audio hop length."
+au_win_length: "DeepSpeech2 audio window length."
+au_n_mel_channels: "DeepSpeech2 audio num mel channels."
+au_mel_fmin: "DeepSpeech2 audio mel fmin."
+au_mel_fmax: "DeepSpeech2 audio mel fmax."
\ No newline at end of file
diff --git a/research/audio/FastSpeech/eval.py b/research/audio/FastSpeech/eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..1613f08363104b381a125d86216ac2dedae52cba
--- /dev/null
+++ b/research/audio/FastSpeech/eval.py
@@ -0,0 +1,206 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""Evaluation script."""
+import os
+from pathlib import Path
+
+import numpy as np
+from mindspore import Tensor
+from mindspore import context
+from mindspore import dtype as mstype
+from mindspore import load_checkpoint
+from mindspore.common import set_seed
+from scipy.io.wavfile import write
+
+from src.cfg.config import config as hp
+from src.dataset import get_val_data
+from src.deepspeech2.dataset import LoadAudioAndTranscript
+from src.deepspeech2.model import DeepSpeechModel
+from src.metrics import frechet_classifier_distance_from_activations
+from src.metrics import kernel_classifier_distance_and_std_from_activations
+from src.model import FastSpeech
+from src.model import FastSpeechEval
+from src.waveglow.model import WaveGlow
+
+set_seed(1)
+
+
+def save_audio(audio, audio_length, save_root_dir, name, audio_cfg):
+    """Process raw audio and save as .wav audio file."""
+    audio_length = int(audio_length.asnumpy())
+    audio = audio[:, :audio_length] * audio_cfg['wav_value']
+    audio = (audio.asnumpy().squeeze()).astype('int16')
+
+    audio_path = os.path.join(save_root_dir, name + '_synth.wav')
+    write(audio_path, audio_cfg['sampling_rate'], audio)
+
+    return audio_path
+
+
+def get_waveglow(ckpt_url):
+    """
+    Init WaveGlow vocoder model with weights.
+    Used to generate realistic audio from mel-spectrogram.
+    """
+    wn_config = {
+        'n_layers': hp.wg_n_layers,
+        'n_channels': hp.wg_n_channels,
+        'kernel_size': hp.wg_kernel_size
+    }
+
+    audio_config = {
+        'wav_value': hp.wg_wav_value,
+        'sampling_rate': hp.wg_sampling_rate
+    }
+
+    model = WaveGlow(
+        n_mel_channels=hp.wg_n_mel_channels,
+        n_flows=hp.wg_n_flows,
+        n_group=hp.wg_n_group,
+        n_early_every=hp.wg_n_early_every,
+        n_early_size=hp.wg_n_early_size,
+        wn_config=wn_config
+    )
+
+    load_checkpoint(ckpt_url, model)
+    model.set_train(False)
+
+    return model, audio_config
+
+
+def get_deepspeech(ckpt_url):
+    """
+    Init DeepSpeech2 model with weights.
+    Used to get activations from lstm layers to compute metrics.
+    """
+    spect_config = {
+        'sampling_rate': hp.ds_sampling_rate,
+        'window_size': hp.ds_window_size,
+        'window_stride': hp.ds_window_stride,
+        'window': hp.ds_window
+    }
+
+    model = DeepSpeechModel(
+        batch_size=1,
+        rnn_hidden_size=hp.ds_hidden_size,
+        nb_layers=hp.ds_hidden_layers,
+        labels=hp.labels,
+        rnn_type=hp.ds_rnn_type,
+        audio_conf=spect_config,
+        bidirectional=True
+    )
+
+    load_checkpoint(ckpt_url, model)
+    model.set_train(False)
+
+    return model, spect_config
+
+
+def get_fastspeech(ckpt_url):
+    """
+    Init FastSpeech model with weights.
+    Used to generate mel-spectrogram from sequence (text).
+    """
+    model = FastSpeech()
+
+    load_checkpoint(ckpt_url, model)
+    model.set_train(False)
+
+    return model
+
+
+def activation_from_audio(loader, model, path):
+    """
+    Compute activations of audio to get metric.
+
+    Args:
+        loader (class): Audio loader.
+        model (nn.Cell): DeepSpeech2 model.
+        path (str): Path to the audio.
+
+    Returns:
+         activation (np.array): Activations from last lstm layer.
+    """
+    metric_mel = loader.parse_audio(audio_path=path)
+    metric_mel_len = Tensor([metric_mel.shape[1]], mstype.float32)
+    metric_mel_padded = np.pad(metric_mel, (0, hp.mel_val_len - metric_mel.shape[1]))[:metric_mel.shape[0], :]
+    metric_mel_padded = Tensor(np.expand_dims(np.expand_dims(metric_mel_padded, 0), 0), mstype.float32)
+
+    _, output_length, activation = model(metric_mel_padded, metric_mel_len)
+    output_length = int(output_length.asnumpy())
+
+    activation = activation.asnumpy().transpose((1, 0, 2)).squeeze()
+    clear_activation = activation[:output_length, :]
+
+    return clear_activation
+
+
+def main(args):
+    fastspeech = get_fastspeech(args.fs_ckpt_url)
+    waveglow, audio_config = get_waveglow(args.wg_ckpt_url)
+    deepspeech, spect_config = get_deepspeech(args.ds_ckpt_url)
+
+    audio_loader = LoadAudioAndTranscript(spect_config)
+
+    model = FastSpeechEval(
+        mel_generator=fastspeech,
+        vocoder=waveglow,
+        config=args
+    )
+
+    data_list = get_val_data(hp.dataset_path)
+
+    if not os.path.exists(hp.output_dir):
+        os.makedirs(hp.output_dir, exist_ok=True)
+
+    frechet, kernel = [], []
+
+    for sequence, src_pos, target_audio_path in data_list:
+        raw_audio, audio_len = model.get_audio(sequence, src_pos)
+
+        audio_path = save_audio(
+            audio=raw_audio,
+            audio_length=audio_len,
+            save_root_dir=args.output_dir,
+            audio_cfg=audio_config,
+            name=Path(target_audio_path).stem
+        )
+
+        activation = activation_from_audio(audio_loader, deepspeech, audio_path)
+        activation_target = activation_from_audio(audio_loader, deepspeech, target_audio_path)
+
+        frechet_distance = frechet_classifier_distance_from_activations(
+            activations1=activation,
+            activations2=activation_target,
+        )
+
+        kernel_distance, _ = kernel_classifier_distance_and_std_from_activations(
+            activations1=activation,
+            activations2=activation_target,
+        )
+
+        frechet.append(frechet_distance)
+        kernel.append(kernel_distance)
+
+    print('=' * 10 + 'Evaluation results' + '=' * 10)
+    print(f'Mean Frechet distance {round(float(np.mean(np.array(frechet))), 5)}')
+    print(f'Mean Kernel distance {round(float(np.mean(np.array(kernel))), 5)}')
+    print(f'Generated audios stored into {args.output_dir}')
+
+
+if __name__ == "__main__":
+    context.set_context(mode=context.GRAPH_MODE, device_target=hp.device_target)
+    context.set_context(device_id=hp.device_id)
+    main(hp)
diff --git a/research/audio/FastSpeech/export.py b/research/audio/FastSpeech/export.py
new file mode 100644
index 0000000000000000000000000000000000000000..f3fb0b5dc5851471feee0120d82f7ba838a9a052
--- /dev/null
+++ b/research/audio/FastSpeech/export.py
@@ -0,0 +1,56 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""Run export"""
+from pathlib import Path
+
+import numpy as np
+from mindspore import Tensor
+from mindspore import context
+from mindspore import dtype as mstype
+from mindspore import load_checkpoint
+from mindspore.train.serialization import export
+
+from src.cfg.config import config as default_config
+from src.model import FastSpeech
+
+
+def run_export(config):
+    """
+    Export model to MINDIR.
+    """
+    model = FastSpeech()
+
+    load_checkpoint(config.fs_ckpt_url, model)
+    model.set_train(False)
+
+    input_1 = Tensor(np.ones([1, config.character_max_length]), dtype=mstype.float32)
+    input_2 = Tensor(np.ones([1, config.character_max_length]), dtype=mstype.float32)
+    name = Path(config.fs_ckpt_url).stem
+    path = Path(config.fs_ckpt_url).resolve().parent
+    save_path = str(Path(path, name))
+
+    export(model, input_1, input_2, file_name=save_path, file_format='MINDIR')
+    print('Model exported successfully!')
+    print(f'Path to exported model {save_path}.mindir')
+
+
+if __name__ == "__main__":
+    context.set_context(
+        mode=context.GRAPH_MODE,
+        device_target=default_config.device_target,
+        device_id=default_config.device_id,
+    )
+
+    run_export(default_config)
diff --git a/research/audio/FastSpeech/requirements.txt b/research/audio/FastSpeech/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..abb72cbec91fd6c166d067a844aa1cbc77f9984e
--- /dev/null
+++ b/research/audio/FastSpeech/requirements.txt
@@ -0,0 +1,6 @@
+PyYAML
+scipy>=1.5.3
+inflect>=5.4.0
+Unidecode>=1.3.3
+librosa>=0.9.1
+SoundFile>=0.10.3.post1
\ No newline at end of file
diff --git a/research/audio/FastSpeech/scripts/run_distribute_train_gpu.sh b/research/audio/FastSpeech/scripts/run_distribute_train_gpu.sh
new file mode 100644
index 0000000000000000000000000000000000000000..53a8804c2c83a45e741117314b05e928523a6a7f
--- /dev/null
+++ b/research/audio/FastSpeech/scripts/run_distribute_train_gpu.sh
@@ -0,0 +1,53 @@
+#!/bin/bash
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+if [[ $# -ne 3 ]]; then
+    echo "Usage: bash ./scripts/run_distribute_train_gpu.sh [DEVICE_NUM] [LOGS_CKPT_DIR] [DATASET_ROOT]"
+exit 1;
+fi
+
+export RANK_SIZE=$1
+
+get_real_path(){
+    if [ "${1:0:1}" == "/" ]; then
+        echo "$1"
+    else
+        realpath -m "$PWD/$1"
+    fi
+}
+
+CONFIG_FILE_BASE="./default_config.yaml"
+CONFIG_FILE=$(get_real_path "$CONFIG_FILE_BASE")
+DATASET_ROOT=$(get_real_path "$3")
+LOGS_ROOT=$(get_real_path "$2")
+
+if [ !  -d "$LOGS_ROOT" ]; then
+  mkdir "$LOGS_ROOT"
+  mkdir "$LOGS_ROOT/training_configs"
+fi
+
+cp ./*.py "$LOGS_ROOT"/training_configs
+cp ./*.yaml "$LOGS_ROOT"/training_configs
+cp -r ./src "$LOGS_ROOT"/training_configs
+
+mpirun -n $1 --allow-run-as-root \
+    python train.py  \
+    --device_target="GPU" \
+    --logs_dir="$LOGS_ROOT" \
+    --dataset_path="$DATASET_ROOT" \
+    --config_path="$CONFIG_FILE" \
+    --epochs=300 \
+    --lr_scale=2 \
+    > "$LOGS_ROOT"/distribute_train.log 2>&1 &
diff --git a/research/audio/FastSpeech/scripts/run_eval_gpu.sh b/research/audio/FastSpeech/scripts/run_eval_gpu.sh
new file mode 100644
index 0000000000000000000000000000000000000000..01ceb3872234765780a580ef532e2e5bc7c41619
--- /dev/null
+++ b/research/audio/FastSpeech/scripts/run_eval_gpu.sh
@@ -0,0 +1,49 @@
+#!/bin/bash
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+if [[ $# -ne 5 ]]; then
+    echo "Usage: bash ./scripts/run_eval_gpu.sh [DEVICE_ID] [DATASET_PATH] [FS_CKPT_URL] [WG_CKPT_URL] [DS_CKPT_URL]"
+exit 1;
+fi
+
+export CUDA_VISIBLE_DEVICES=$1
+
+get_real_path(){
+    if [ "${1:0:1}" == "/" ]; then
+        echo "$1"
+    else
+        realpath -m "$PWD/$1"
+    fi
+}
+
+CONFIG_FILE_BASE="./default_config.yaml"
+OUTPUT_DIR_BASE="./results"
+OUTPUT_ROOT=$(get_real_path "$OUTPUT_DIR_BASE")
+CONFIG_FILE=$(get_real_path "$CONFIG_FILE_BASE")
+DATASET_ROOT=$(get_real_path "$2")
+FS_CKPT=$(get_real_path "$3")
+WG_CKPT=$(get_real_path "$4")
+DS_CKPT=$(get_real_path "$5")
+
+python eval.py \
+    --device_target="GPU" \
+    --device_id=0 \
+    --output_dir="$OUTPUT_ROOT" \
+    --dataset_path="$DATASET_ROOT" \
+    --config_path="$CONFIG_FILE" \
+    --fs_ckpt_url="$FS_CKPT" \
+    --wg_ckpt_url="$WG_CKPT" \
+    --ds_ckpt_url="$DS_CKPT" \
+    > eval.log 2>&1 &
diff --git a/research/audio/FastSpeech/scripts/run_standalone_train_gpu.sh b/research/audio/FastSpeech/scripts/run_standalone_train_gpu.sh
new file mode 100644
index 0000000000000000000000000000000000000000..fd1a5bc41c8d97c4cd992a557d15a76cfecf9299
--- /dev/null
+++ b/research/audio/FastSpeech/scripts/run_standalone_train_gpu.sh
@@ -0,0 +1,51 @@
+#!/bin/bash
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+if [[ $# -ne 3 ]]; then
+    echo "Usage: bash ./scripts/run_standalone_train_gpu.sh [DEVICE_ID] [LOGS_CKPT_DIR] [DATASET_ROOT]"
+exit 1
+fi
+
+export CUDA_VISIBLE_DEVICES=$1
+
+get_real_path(){
+    if [ "${1:0:1}" == "/" ]; then
+        echo "$1"
+    else
+        realpath -m "$PWD/$1"
+    fi
+}
+
+CONFIG_FILE_BASE="./default_config.yaml"
+CONFIG_FILE=$(get_real_path "$CONFIG_FILE_BASE")
+DATASET_ROOT=$(get_real_path "$3")
+LOGS_ROOT=$(get_real_path "$2")
+
+if [ !  -d "$LOGS_ROOT" ]; then
+  mkdir "$LOGS_ROOT"
+  mkdir "$LOGS_ROOT/training_configs"
+fi
+
+cp ./*.py "$LOGS_ROOT"/training_configs
+cp ./*.yaml "$LOGS_ROOT"/training_configs
+cp -r ./src "$LOGS_ROOT"/training_configs
+
+python train.py \
+    --device_target="GPU" \
+    --device_id=0 \
+    --logs_dir="$LOGS_ROOT" \
+    --dataset_path="$DATASET_ROOT" \
+    --config_path="$CONFIG_FILE" \
+    > "$LOGS_ROOT"/standalone_train.log 2>&1 &
diff --git a/research/audio/FastSpeech/src/__init__.py b/research/audio/FastSpeech/src/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/research/audio/FastSpeech/src/audio/__init__.py b/research/audio/FastSpeech/src/audio/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/research/audio/FastSpeech/src/audio/stft.py b/research/audio/FastSpeech/src/audio/stft.py
new file mode 100644
index 0000000000000000000000000000000000000000..a4242e4457b4f907b27ddb6f2c05efc0fff3dca4
--- /dev/null
+++ b/research/audio/FastSpeech/src/audio/stft.py
@@ -0,0 +1,148 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""Tacotron module."""
+import numpy as np
+from librosa.filters import mel as librosa_mel_fn
+from librosa.util import pad_center
+from mindspore import Tensor
+from mindspore import dtype as mstype
+from mindspore.ops import Conv2D
+from scipy.signal import get_window
+
+
+class STFT:
+    """Mel-spectrogram transformer."""
+    def __init__(
+            self,
+            filter_length=800,
+            hop_length=200,
+            win_length=800,
+            window='hann'
+    ):
+        super().__init__()
+        self.filter_length = filter_length
+        self.hop_length = hop_length
+        self.win_length = win_length
+        self.window = window
+        self.forward_transform = None
+
+        scale = self.filter_length / self.hop_length
+        fourier_basis = np.fft.fft(np.eye(self.filter_length))
+
+        cutoff = int((self.filter_length / 2 + 1))
+        fourier_basis = np.vstack(
+            [
+                np.real(fourier_basis[:cutoff, :]),
+                np.imag(fourier_basis[:cutoff, :])
+            ]
+        )
+
+        forward_basis = fourier_basis[:, None, :].astype(np.float32)
+        inverse_basis = np.linalg.pinv(scale * fourier_basis).T[:, None, :].astype(np.float32)
+
+        if window is not None:
+            assert filter_length >= win_length
+            # get window and zero center pad it to filter_length
+            fft_window = get_window(window, win_length, fftbins=True)
+            fft_window = pad_center(fft_window, size=filter_length)
+            fft_window = np.array(fft_window, np.float32)
+
+            # window the bases
+            forward_basis *= fft_window
+            inverse_basis *= fft_window
+
+        self.forward_basis = forward_basis.astype(np.float32)
+        self.inverse_basis = inverse_basis.astype(np.float32)
+
+        self.conv = Conv2D(
+            out_channel=self.forward_basis.shape[0],
+            kernel_size=self.forward_basis.shape[1:],
+            stride=self.hop_length,
+            pad_mode='pad',
+            pad=0
+        )
+
+    def transform(self, input_data):
+        """Transforms input wav to raw mel-spect data."""
+        num_batches = input_data.shape[0]
+        num_samples = input_data.shape[1]
+
+        input_data = input_data.reshape(num_batches, 1, num_samples)
+        input_data = np.pad(np.squeeze(input_data), int(self.filter_length / 2), mode='reflect')
+
+        input_data = np.expand_dims(np.expand_dims(np.expand_dims(input_data, 0), 0), 0)
+
+        forward_transform = self.conv(
+            Tensor(input_data, mstype.float32),
+            Tensor(np.expand_dims(self.forward_basis, 1), mstype.float32),
+        )
+
+        forward_transform = forward_transform.asnumpy().squeeze(2)
+
+        cutoff = int((self.filter_length / 2) + 1)
+        real_part = forward_transform[:, :cutoff, :]
+        imag_part = forward_transform[:, cutoff:, :]
+
+        magnitude = np.sqrt(real_part ** 2 + imag_part ** 2)
+        phase = np.arctan2(imag_part, real_part)
+
+        return magnitude, phase
+
+
+class TacotronSTFT:
+    """Tacotron."""
+    def __init__(
+            self,
+            filter_length=1024,
+            hop_length=256,
+            win_length=1024,
+            n_mel_channels=80,
+            sampling_rate=22050,
+            mel_fmin=0.0,
+            mel_fmax=8000.0
+    ):
+        super().__init__()
+        self.n_mel_channels = n_mel_channels
+        self.sampling_rate = sampling_rate
+        self.stft_fn = STFT(filter_length, hop_length, win_length)
+
+        self.mel_basis = librosa_mel_fn(
+            sr=sampling_rate,
+            n_fft=filter_length,
+            n_mels=n_mel_channels,
+            fmin=mel_fmin,
+            fmax=mel_fmax
+        )
+
+    def spectral_normalize(self, x):
+        """Normalize magnitudes."""
+        output = np.log(np.clip(x, a_min=1e-5, a_max=np.max(x)))
+        return output
+
+    def mel_spectrogram(self, y):
+        """
+        Computes mel-spectrogram from wav.
+
+        Args:
+            y (np.array): Raw mel-spectrogram with shape (B, T) in range [-1, 1].
+
+        Returns:
+            mel_output (np.array): Mel-spectrogram  with shape (B, n_mel_channels, T).
+        """
+        magnitudes, _ = self.stft_fn.transform(y)
+        mel_output = np.matmul(self.mel_basis, magnitudes)
+        mel_output = self.spectral_normalize(mel_output)
+
+        return mel_output
diff --git a/research/audio/FastSpeech/src/audio/tools.py b/research/audio/FastSpeech/src/audio/tools.py
new file mode 100644
index 0000000000000000000000000000000000000000..59cdc06cfc37cf38149eaa8ef0a783847f2899d2
--- /dev/null
+++ b/research/audio/FastSpeech/src/audio/tools.py
@@ -0,0 +1,47 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""Preprocessing tools."""
+import numpy as np
+from scipy.io.wavfile import read
+
+from src.audio import stft
+from src.cfg.config import config
+
+_stft = stft.TacotronSTFT(
+    config.au_filter_length,
+    config.au_hop_length,
+    config.au_win_length,
+    config.au_n_mel_channels,
+    config.au_sampling_rate,
+    config.au_mel_fmin,
+    config.au_mel_fmax,
+)
+
+
+def load_wav_to_array(full_path):
+    """Load wav file as numpy array."""
+    sampling_rate, data = read(full_path)
+    return data.astype(np.float32), sampling_rate
+
+
+def get_mel(filename):
+    """Process loaded audio to mel-spectrogram."""
+    audio, _ = load_wav_to_array(filename)
+    audio_norm = audio / config.au_max_wav_value
+    audio_norm = np.expand_dims(audio_norm, 0)
+    melspec = _stft.mel_spectrogram(audio_norm)
+    melspec = np.squeeze(melspec, 0)
+
+    return melspec
diff --git a/research/audio/FastSpeech/src/cfg/__init__.py b/research/audio/FastSpeech/src/cfg/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/research/audio/FastSpeech/src/cfg/config.py b/research/audio/FastSpeech/src/cfg/config.py
new file mode 100644
index 0000000000000000000000000000000000000000..e2cb9ec894cdbb2b9febc07b472954ba2096b817
--- /dev/null
+++ b/research/audio/FastSpeech/src/cfg/config.py
@@ -0,0 +1,129 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""Parse arguments"""
+import argparse
+import ast
+from pprint import pformat
+
+import yaml
+
+
+class Config:
+    """
+    Configuration namespace, convert dictionary to members.
+    """
+    def __init__(self, cfg_dict):
+        for k, v in cfg_dict.items():
+            if isinstance(v, (list, tuple)):
+                setattr(self, k, [Config(x) if isinstance(x, dict) else x for x in v])
+            else:
+                setattr(self, k, Config(v) if isinstance(v, dict) else v)
+
+    def __str__(self):
+        return pformat(self.__dict__)
+
+    def __repr__(self):
+        return self.__str__()
+
+
+def parse_cli_to_yaml(parser, cfg, helper=None, choices=None, cfg_path="default_config.yaml"):
+    """
+    Parse command line arguments to the configuration according to the default yaml.
+
+    Args:
+        parser (argparse.ArgumentParser): Parent parser.
+        cfg (dict): Base configuration.
+        helper (dict): Helper description.
+        choices (dict): Choices.
+        cfg_path (str): Path to the default yaml config.
+    """
+    helper = {} if helper is None else helper
+    choices = {} if choices is None else choices
+    for item in cfg:
+        if not isinstance(cfg[item], list) and not isinstance(cfg[item], dict):
+            help_description = helper[item] if item in helper else f"Please reference to {cfg_path}"
+            choice = choices[item] if item in choices else None
+            if isinstance(cfg[item], bool):
+                parser.add_argument("--" + item, type=ast.literal_eval, default=cfg[item], choices=choice,
+                                    help=help_description)
+            else:
+                parser.add_argument("--" + item, type=type(cfg[item]), default=cfg[item], choices=choice,
+                                    help=help_description)
+    args = parser.parse_args()
+    return args
+
+
+def parse_yaml(yaml_path):
+    """
+    Parse the yaml config file.
+
+    Args:
+        yaml_path (str): Path to the yaml config.
+    """
+    with open(yaml_path, 'r') as fin:
+        try:
+            cfgs_raw = yaml.load_all(fin.read(), Loader=yaml.FullLoader)
+            cfgs = []
+            for cf in cfgs_raw:
+                cfgs.append(cf)
+
+            if len(cfgs) == 1:
+                cfg_helper = {}
+                cfg = cfgs[0]
+                cfg_choices = {}
+            elif len(cfgs) == 2:
+                cfg, cfg_helper = cfgs
+                cfg_choices = {}
+            elif len(cfgs) == 3:
+                cfg, cfg_helper, cfg_choices = cfgs
+            else:
+                raise ValueError("At most 3 docs (config, description for help, choices) are supported in config yaml")
+        except ValueError("Failed to parse yaml") as err:
+            raise err
+
+    return cfg, cfg_helper, cfg_choices
+
+
+def merge(args, cfg):
+    """
+    Merge the base config from yaml file and command line arguments.
+
+    Args:
+        args (argparse.Namespace): Command line arguments.
+        cfg (dict): Base configuration.
+    """
+    args_var = vars(args)
+    for item in args_var:
+        cfg[item] = args_var[item]
+
+    return cfg
+
+
+def get_config():
+    """
+    Get Config according to the yaml file and cli arguments.
+    """
+    parser = argparse.ArgumentParser(description="default name", add_help=False)
+    parser.add_argument("--config_path", type=str, default="default_config.yaml", help="Config file path.")
+
+    path_args, _ = parser.parse_known_args()
+    default, helper, choices = parse_yaml(path_args.config_path)
+    args = parse_cli_to_yaml(parser=parser, cfg=default, helper=helper, choices=choices, cfg_path=path_args.config_path)
+    final_config = merge(args, default)
+
+    return Config(final_config)
+
+
+config = get_config()
diff --git a/research/audio/FastSpeech/src/dataset.py b/research/audio/FastSpeech/src/dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..e72e66213e867feab300003b59edf5352fb51bbb
--- /dev/null
+++ b/research/audio/FastSpeech/src/dataset.py
@@ -0,0 +1,165 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""Data preprocessing."""
+import os
+from pathlib import Path
+
+import numpy as np
+from mindspore import Tensor
+from mindspore import dtype as mstype
+
+from src.cfg.config import config as hp
+from src.text import text_to_sequence
+from src.utils import pad_1d_tensor
+from src.utils import pad_2d_tensor
+from src.utils import process_text
+
+
+def get_data_to_buffer():
+    """
+    Put data to memory, for faster training.
+    """
+    with Path(hp.dataset_path, 'train_indices.txt').open('r') as file:
+        train_part = np.array([i[:-1] for i in file.readlines()], np.int32)
+        train_part.sort()
+
+    buffer = list()
+    raw_text = process_text(os.path.join(hp.dataset_path, "metadata.txt"))
+
+    for i in train_part:
+        mel_gt_name = os.path.join(hp.dataset_path, 'mels', "ljspeech-mel-%05d.npy" % (i+1))
+        mel_gt_target = np.load(mel_gt_name)
+
+        duration = np.load(os.path.join(hp.dataset_path, 'alignments', str(i)+".npy"))
+
+        character = raw_text[i][: len(raw_text[i])-1]
+        character = np.array(text_to_sequence(character, hp.text_cleaners))
+
+        buffer.append(
+            {
+                "text": character,
+                "duration": duration,
+                "mel_target": mel_gt_target
+            }
+        )
+
+    return buffer
+
+
+def reprocess_tensor(data_dict):
+    """
+    Prepare data for training.
+    Apply padding for all samples, in reason of static graph.
+
+    Args:
+        data_dict (dict): Dictionary of np.array type data.
+
+    Returns:
+        out (dict): Dictionary with prepared data for training, np.array type.
+    """
+    text = data_dict["text"]
+    mel_target = data_dict["mel_target"]
+    duration = data_dict["duration"]
+
+    max_len = hp.character_max_length
+    length_text = text.shape[0]
+    src_pos = np.pad([i+1 for i in range(int(length_text))], (0, max_len-int(length_text)), 'constant')
+
+    max_mel_len = hp.mel_max_length
+    length_mel = mel_target.shape[0]
+    mel_pos = np.pad([i+1 for i in range(int(length_mel))], (0, max_mel_len-int(length_mel)), 'constant')
+
+    text = pad_1d_tensor(text)
+    duration = pad_1d_tensor(duration)
+    mel_target = pad_2d_tensor(mel_target)
+
+    out = {
+        "text": text,  # shape (hp.character_max_length)
+        "src_pos": src_pos,  # shape (hp.character_max_length)
+        "mel_pos": mel_pos,  # shape (hp.mel_max_length)
+        "duration": duration,  # shape (hp.character_max_length)
+        "mel_target": mel_target,  # shape (hp.mel_max_length, hp.num_mels)
+        "mel_max_len": max_mel_len,
+    }
+
+    return out
+
+
+def preprocess_data(buffer):
+    """
+    Prepare data for training.
+
+    Args:
+        buffer (list): Raw data inputs.
+
+    Returns:
+        preprocessed_data (list): Padded and converted data, ready for training.
+    """
+    preprocessed_data = []
+    for squeeze_data in buffer:
+        db = reprocess_tensor(squeeze_data)
+
+        preprocessed_data.append(
+            (
+                db["text"].astype(np.float32),
+                db["src_pos"].astype(np.float32),
+                db["mel_pos"].astype(np.float32),
+                db["duration"].astype(np.int32),
+                db["mel_target"].astype(np.float32),
+                db["mel_max_len"],
+            )
+        )
+
+    return preprocessed_data
+
+
+class BufferDataset:
+    """
+    Dataloader.
+    """
+    def __init__(self, buffer):
+        self.length_dataset = len(buffer)
+        self.preprocessed_data = preprocess_data(buffer)
+
+    def __len__(self):
+        return self.length_dataset
+
+    def __getitem__(self, idx):
+        return self.preprocessed_data[idx]
+
+
+def get_val_data(data_url):
+    """Get validation data."""
+    data_list = list()
+    with Path(data_url, 'validation.txt').open('r') as file:
+        data_paths = file.readlines()
+
+    root_wav_path = os.path.join(data_url, 'wavs')
+    wav_paths = [root_wav_path + '/' + raw_path.split('|')[0] + '.wav' for raw_path in data_paths]
+    val_txts = [raw_path.split('|')[1][:-1] for raw_path in data_paths]
+
+    for orig_text, wav_path in zip(val_txts, wav_paths):
+        sequence = text_to_sequence(orig_text, hp.text_cleaners)
+        sequence = np.expand_dims(sequence, 0)
+
+        src_pos = np.array([i + 1 for i in range(sequence.shape[1])])
+        src_pos = np.expand_dims(src_pos, 0)
+
+        sequence = Tensor([np.pad(sequence[0], (0, hp.character_max_length - sequence.shape[1]))], mstype.float32)
+        src_pos = Tensor([np.pad(src_pos[0], (0, hp.character_max_length - src_pos.shape[1]))], mstype.float32)
+
+        data_list.append([sequence, src_pos, wav_path])
+
+    return data_list
diff --git a/research/audio/FastSpeech/src/deepspeech2/__init__.py b/research/audio/FastSpeech/src/deepspeech2/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/research/audio/FastSpeech/src/deepspeech2/dataset.py b/research/audio/FastSpeech/src/deepspeech2/dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..04fd56169c82951b3714ea465ebf19fcca6f5893
--- /dev/null
+++ b/research/audio/FastSpeech/src/deepspeech2/dataset.py
@@ -0,0 +1,69 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""Audio parser script."""
+import librosa
+import numpy as np
+import soundfile as sf
+
+
+class LoadAudioAndTranscript:
+    """
+    Parse audio and transcript.
+    """
+    def __init__(
+            self,
+            audio_conf=None,
+            normalize=False,
+            labels=None
+    ):
+        super().__init__()
+        self.window_stride = audio_conf['window_stride']
+        self.window_size = audio_conf['window_size']
+        self.sample_rate = audio_conf['sampling_rate']
+        self.window = audio_conf['window']
+        self.is_normalization = normalize
+        self.labels = labels
+
+    def load_audio(self, path):
+        """
+        Load audio.
+        """
+        sound, _ = sf.read(path, dtype='int16')
+        sound = sound.astype('float32') / 32767
+        if len(sound.shape) > 1:
+            if sound.shape[1] == 1:
+                sound = sound.squeeze()
+            else:
+                sound = sound.mean(axis=1)
+
+        return sound
+
+    def parse_audio(self, audio_path):
+        """
+        Parse audio.
+        """
+        audio = self.load_audio(audio_path)
+        n_fft = int(self.sample_rate * self.window_size)
+        win_length = n_fft
+        hop_length = int(self.sample_rate * self.window_stride)
+        d = librosa.stft(y=audio, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=self.window)
+        mag, _ = librosa.magphase(d)
+        mag = np.log1p(mag)
+        if self.is_normalization:
+            mean = mag.mean()
+            std = mag.std()
+            mag = (mag - mean) / std
+
+        return mag
diff --git a/research/audio/FastSpeech/src/deepspeech2/model.py b/research/audio/FastSpeech/src/deepspeech2/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..73e87f34a689be343d2eb4f205dcd092c605da87
--- /dev/null
+++ b/research/audio/FastSpeech/src/deepspeech2/model.py
@@ -0,0 +1,315 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""DeepSpeech2 model."""
+import math
+
+import numpy as np
+from mindspore import Tensor
+from mindspore import nn
+from mindspore.ops import operations as P
+
+
+class SequenceWise(nn.Cell):
+    """SequenceWise FC Layers."""
+    def __init__(self, module):
+        super().__init__()
+        self.module = module
+        self.reshape_op = P.Reshape()
+        self.shape_op = P.Shape()
+        self._initialize_weights()
+
+    def construct(self, x):
+        sizes = self.shape_op(x)
+        t, n = sizes[0], sizes[1]
+        x = self.reshape_op(x, (t * n, -1))
+        x = self.module(x)
+        x = self.reshape_op(x, (t, n, -1))
+
+        return x
+
+    def _initialize_weights(self):
+        """Init weights."""
+        self.init_parameters_data()
+        for _, m in self.cells_and_names():
+            if isinstance(m, nn.Dense):
+                m.weight.set_data(
+                    Tensor(
+                        np.random.uniform(
+                            -1. / m.in_channels,
+                            1. / m.in_channels,
+                            m.weight.data.shape
+                        ).astype("float32")
+                    )
+                )
+
+                if m.bias is not None:
+                    m.bias.set_data(
+                        Tensor(
+                            np.random.uniform(
+                                -1. / m.in_channels,
+                                1. / m.in_channels,
+                                m.bias.data.shape).astype("float32")
+                        )
+                    )
+
+
+class MaskConv(nn.Cell):
+    """
+    MaskConv architecture.
+    MaskConv is actually not implemented in this part
+    because some operation in MindSpore is not supported.
+    Lengths is kept for future use.
+    """
+
+    def __init__(self):
+        super().__init__()
+        self.zeros = P.ZerosLike()
+        self.conv1 = nn.Conv2d(
+            in_channels=1,
+            out_channels=32,
+            kernel_size=(41, 11),
+            stride=(2, 2),
+            pad_mode='pad',
+            padding=(20, 20, 5, 5)
+        )
+
+        self.bn1 = nn.BatchNorm2d(num_features=32)
+        self.conv2 = nn.Conv2d(
+            in_channels=32,
+            out_channels=32,
+            kernel_size=(21, 11),
+            stride=(2, 1),
+            pad_mode='pad',
+            padding=(10, 10, 5, 5)
+        )
+
+        self.bn2 = nn.BatchNorm2d(num_features=32)
+        self.tanh = nn.Tanh()
+        self._initialize_weights()
+        self.module_list = nn.CellList(
+            [
+                self.conv1,
+                self.bn1,
+                self.tanh,
+                self.conv2,
+                self.bn2,
+                self.tanh
+            ]
+        )
+
+    def construct(self, x):
+        for module in self.module_list:
+            x = module(x)
+
+        return x
+
+    def _initialize_weights(self):
+        """
+        parameter initialization
+        """
+        self.init_parameters_data()
+        for _, m in self.cells_and_names():
+            if isinstance(m, nn.Conv2d):
+                n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels
+                m.weight.set_data(Tensor(np.random.normal(0, np.sqrt(2. / n), m.weight.data.shape).astype("float32")))
+                if m.bias is not None:
+                    m.bias.set_data(
+                        Tensor(np.zeros(m.bias.data.shape, dtype="float32")))
+            elif isinstance(m, nn.BatchNorm2d):
+                m.gamma.set_data(
+                    Tensor(np.ones(m.gamma.data.shape, dtype="float32")))
+                m.beta.set_data(
+                    Tensor(np.zeros(m.beta.data.shape, dtype="float32")))
+
+
+class BatchRNN(nn.Cell):
+    """
+    BatchRNN architecture.
+
+    Args:
+        batch_size (int):  Sample_number of per step in training.
+        input_size  (int):  Dimension of input tensor.
+        hidden_size (int):  Rnn hidden size.
+        num_layers (int):  Number of rnn layers.
+        bidirectional (bool): Use bidirectional rnn. Currently, only bidirectional rnn is implemented.
+        batch_norm(bool): Whether to use BN in RNN.
+        rnn_type (str):  Rnn type to use. Currently, only LSTM is supported.
+    """
+
+    def __init__(
+            self,
+            batch_size,
+            input_size,
+            hidden_size,
+            num_layers,
+            bidirectional=False,
+            batch_norm=False,
+            rnn_type='LSTM',
+    ):
+        super().__init__()
+        self.batch_size = batch_size
+        self.input_size = input_size
+        self.hidden_size = hidden_size
+        self.num_layers = num_layers
+        self.rnn_type = rnn_type
+        self.bidirectional = bidirectional
+        self.has_bias = True
+        self.is_batch_norm = batch_norm
+        self.num_directions = 2 if bidirectional else 1
+        self.reshape_op = P.Reshape()
+        self.shape_op = P.Shape()
+        self.sum_op = P.ReduceSum()
+
+        input_size_list = [input_size]
+        for i in range(num_layers - 1):
+            input_size_list.append(hidden_size)
+        layers = []
+
+        for i in range(num_layers):
+            layers.append(
+                nn.LSTM(
+                    input_size=input_size_list[i],
+                    hidden_size=hidden_size,
+                    bidirectional=bidirectional,
+                    has_bias=self.has_bias
+                )
+            )
+
+        self.lstms = nn.CellList(layers)
+
+        if batch_norm:
+            batch_norm_layer = []
+            for i in range(num_layers - 1):
+                batch_norm_layer.append(nn.BatchNorm1d(hidden_size))
+            self.batch_norm_list = batch_norm_layer
+
+    def construct(self, x):
+        for i in range(self.num_layers):
+            if self.is_batch_norm and i > 0:
+                x = self.batch_norm_list[i - 1](x)
+            x, _ = self.lstms[i](x)
+            if self.bidirectional:
+                size = self.shape_op(x)
+                x = self.reshape_op(x, (size[0], size[1], 2, -1))
+                x = self.sum_op(x, 2)
+        return x
+
+
+class DeepSpeechModel(nn.Cell):
+    """
+    ResNet architecture.
+
+    Args:
+        batch_size (int):  Sample_number of per step in training.
+        rnn_type (str):  Rnn type to use.
+        labels (str):  Str containing all the possible symbols to map to.
+        rnn_hidden_size (int):  Rnn hidden size.
+        nb_layers (int):  Number of rnn layers.
+        audio_conf: Config containing the sample rate, window and the window length/stride in seconds.
+        bidirectional (bool): Use bidirectional rnn.
+    """
+
+    def __init__(
+            self,
+            batch_size,
+            labels,
+            rnn_hidden_size,
+            nb_layers,
+            audio_conf,
+            rnn_type='LSTM',
+            bidirectional=True,
+    ):
+        super().__init__()
+        self.batch_size = batch_size
+        self.hidden_size = rnn_hidden_size
+        self.hidden_layers = nb_layers
+        self.rnn_type = rnn_type
+        self.audio_conf = audio_conf
+        self.labels = list(labels)
+        self.bidirectional = bidirectional
+        self.reshape_op = P.Reshape()
+        self.shape_op = P.Shape()
+        self.transpose_op = P.Transpose()
+        self.add = P.Add()
+        self.div = P.Div()
+
+        sample_rate = self.audio_conf['sampling_rate']
+        window_size = self.audio_conf['window_size']
+        num_classes = len(self.labels)
+
+        self.conv = MaskConv()
+        # This is to calculate
+        self.pre, self.stride = self.get_conv_num()
+        self.num_iters = list(range(len(self.stride)))
+
+        # Based on above convolutions and spectrogram size using conv formula (W - F + 2P)/ S+1
+        rnn_input_size = int(math.floor((sample_rate * window_size) / 2) + 1)
+        rnn_input_size = int(math.floor(rnn_input_size + 2 * 20 - 41) / 2 + 1)
+        rnn_input_size = int(math.floor(rnn_input_size + 2 * 10 - 21) / 2 + 1)
+        rnn_input_size *= 32
+
+        self.rnn = BatchRNN(
+            batch_size=self.batch_size,
+            input_size=rnn_input_size,
+            num_layers=nb_layers,
+            hidden_size=rnn_hidden_size,
+            bidirectional=bidirectional,
+            batch_norm=False,
+            rnn_type=self.rnn_type,
+        )
+
+        fully_connected = nn.Dense(rnn_hidden_size, num_classes, has_bias=False)
+        self.fc = SequenceWise(fully_connected)
+
+    def construct(self, x, lengths):
+        """
+        Forward.
+        """
+        output_lengths = self.get_seq_lens(lengths)
+        x = self.conv(x)
+        sizes = self.shape_op(x)
+        x = self.reshape_op(x, (sizes[0], sizes[1] * sizes[2], sizes[3]))
+        x = self.transpose_op(x, (2, 0, 1))
+        x = self.rnn(x)
+
+        activations = x.copy()
+
+        x = self.fc(x)
+
+        return x, output_lengths, activations
+
+    def get_seq_lens(self, seq_len):
+        """
+        Given a 1D Tensor or Variable containing integer sequence lengths,
+        return a 1D tensor or variable containing the size sequences
+        that will be output by the network.
+        """
+        for i in self.num_iters:
+            seq_len = self.add(self.div(self.add(seq_len, self.pre[i]), self.stride[i]), 1)
+
+        return seq_len
+
+    def get_conv_num(self):
+        """Get number of convs."""
+        p, s = [], []
+        for _, cell in self.conv.cells_and_names():
+            if isinstance(cell, nn.Conv2d):
+                kernel_size = cell.kernel_size
+                padding_1 = int((kernel_size[1] - 1) / 2)
+                temp = 2 * padding_1 - cell.dilation[1] * (cell.kernel_size[1] - 1) - 1
+                p.append(temp)
+                s.append(cell.stride[1])
+
+        return p, s
diff --git a/research/audio/FastSpeech/src/import_ckpt/__init__.py b/research/audio/FastSpeech/src/import_ckpt/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/research/audio/FastSpeech/src/import_ckpt/import_deepspeech2.py b/research/audio/FastSpeech/src/import_ckpt/import_deepspeech2.py
new file mode 100644
index 0000000000000000000000000000000000000000..0402cbba3f7283d79bb1d1ca5586798ea33e3f21
--- /dev/null
+++ b/research/audio/FastSpeech/src/import_ckpt/import_deepspeech2.py
@@ -0,0 +1,84 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""DeepSpeech2 checkpoint converter."""
+from pathlib import Path
+
+import numpy as np
+from mindspore import Parameter
+from mindspore import Tensor
+from mindspore import dtype as mstype
+from mindspore import load_checkpoint
+from mindspore import save_checkpoint
+
+from src.cfg.config import config
+from src.deepspeech2.model import DeepSpeechModel
+
+
+def main(ckpt_url):
+    spect_config = {
+        'sampling_rate': config.ds_sampling_rate,
+        'window_size': config.ds_window_size,
+        'window_stride': config.ds_window_stride,
+        'window': config.ds_window
+    }
+    # Initialize model to get new lstm params names
+    model = DeepSpeechModel(
+        batch_size=1,
+        rnn_hidden_size=config.ds_hidden_size,
+        nb_layers=config.ds_hidden_layers,
+        labels=config.labels,
+        rnn_type=config.ds_rnn_type,
+        audio_conf=spect_config,
+        bidirectional=True
+    )
+
+    filter_prefix = ['moment1', 'moment2', 'step', 'learning_rate', 'beta1_power', 'beta2_power']
+    lstm_old_names = ['RNN.weight0', 'RNN.weight1', 'RNN.weight2', 'RNN.weight3', 'RNN.weight4']
+    new_params = model.trainable_params()
+    old_params = load_checkpoint(ckpt_url, filter_prefix=filter_prefix)
+    names_and_shapes = {param.name: param.shape for param in new_params}
+
+    lstm_weights = {}
+    # Reprocess flatten weights of LSTM from < 1.5 mindspore versions to new.
+    for layer, old_layer in zip(range(0, 5), lstm_old_names):
+        previous = 0
+        for i in np.array(list(names_and_shapes.keys())[layer * 8 + 6: layer * 8 + 14])[[0, 2, 1, 3, 4, 6, 5, 7]]:
+            weights = old_params[old_layer][int(previous): int(previous + np.prod(names_and_shapes[i]))].asnumpy()
+            weights_shaped = weights.reshape(names_and_shapes[i])
+            lstm_weights[i] = weights_shaped
+
+            previous += np.prod(names_and_shapes[i])
+
+        # Remove lstm layers to the load remaining layers
+        old_params.pop(old_layer)
+
+    # Put remaining weights into dictionary
+    for remaining_key, remaining_param in old_params.items():
+        lstm_weights[remaining_key] = remaining_param.asnumpy()
+
+    # Process to checkpoint save format
+    save_params = []
+    for key, value in lstm_weights.items():
+        save_params.append({'name': key, 'data': Parameter(Tensor(value, mstype.float32), name=key)})
+
+    save_name = Path(Path(ckpt_url).parent, 'DeepSpeech2.ckpt')
+    save_checkpoint(save_params, str(save_name))
+
+    print('Successfully converted checkpoint')
+    print(f'New checkpoint path {save_name}')
+
+
+if __name__ == "__main__":
+    main(config.ds_ckpt_url)
diff --git a/research/audio/FastSpeech/src/import_ckpt/import_waveglow.py b/research/audio/FastSpeech/src/import_ckpt/import_waveglow.py
new file mode 100644
index 0000000000000000000000000000000000000000..15093c243961562919ddf02966c6e3d2d63c182a
--- /dev/null
+++ b/research/audio/FastSpeech/src/import_ckpt/import_waveglow.py
@@ -0,0 +1,87 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""WaveGlow checkpoint converter."""
+import pickle
+from pathlib import Path
+
+import numpy as np
+from mindspore import Parameter
+from mindspore import Tensor
+from mindspore import dtype as mstype
+from mindspore import save_checkpoint
+
+from src.cfg.config import config
+from src.waveglow.model import WaveGlow
+
+
+def main(ckpt_url):
+    with Path(ckpt_url).open('rb') as file:
+        waveglow_np_params = pickle.load(file)
+
+    wn_config = {
+        'n_layers': config.wg_n_layers,
+        'n_channels': config.wg_n_channels,
+        'kernel_size': config.wg_kernel_size
+    }
+
+    # Initialize model to get true names
+    model = WaveGlow(
+        n_mel_channels=config.wg_n_mel_channels,
+        n_flows=config.wg_n_flows,
+        n_group=config.wg_n_group,
+        n_early_every=config.wg_n_early_every,
+        n_early_size=config.wg_n_early_size,
+        wn_config=wn_config
+    )
+    names_and_shapes = {key: param.shape for key, param in model.parameters_and_names()}
+
+    # Put similar names into blocks
+    wn_names = list(waveglow_np_params.keys())[2: 2 + 38 * 12]
+    convinv_names = list(waveglow_np_params.keys())[-12:]
+    ordered_names = list(waveglow_np_params.keys())[:2]
+
+    # Mindspore order of weights into same block
+    indexes_weighs = np.concatenate((np.arange(1, 34, 2), np.array([34, 37])))
+    indexes_biases = np.concatenate((np.arange(0, 34, 2), np.array([35, 36])))
+
+    for block_num in reversed(range(12)):
+        block_layers = wn_names[block_num * 38: 38 * (block_num + 1)]
+        for layer_index_weight, layer_index_bias in zip(indexes_weighs, indexes_biases):
+            ordered_names.append(block_layers[layer_index_weight])
+            ordered_names.append(block_layers[layer_index_bias])
+        ordered_names.append(convinv_names[block_num])
+
+    # Reshape weights and process inverted convolutions
+    processed_weights = {}
+    for torch_name, mindspore_name in zip(ordered_names, list(names_and_shapes.keys())):
+        weights = waveglow_np_params[torch_name]
+        if torch_name.startswith('convinv'):
+            weights = np.linalg.inv((np.squeeze(weights)))
+            weights = np.expand_dims(weights, -1)
+        processed_weights[mindspore_name] = weights.reshape(names_and_shapes[mindspore_name])
+
+    save_params = []
+    for key, value in processed_weights.items():
+        save_params.append({'name': key, 'data': Parameter(Tensor(value, mstype.float32), name=key)})
+
+    save_name = Path(Path(ckpt_url).parent, 'WaveGlow.ckpt')
+    save_checkpoint(save_params, str(save_name))
+
+    print('Successfully converted checkpoint')
+    print(f'New checkpoint path {save_name}')
+
+
+if __name__ == "__main__":
+    main(config.wg_ckpt_url)
diff --git a/research/audio/FastSpeech/src/metrics.py b/research/audio/FastSpeech/src/metrics.py
new file mode 100644
index 0000000000000000000000000000000000000000..981ae91d14919e5460cafa399eb28c9410eb913b
--- /dev/null
+++ b/research/audio/FastSpeech/src/metrics.py
@@ -0,0 +1,155 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""Metrics scripts."""
+import numpy as np
+
+
+def kernel_classifier_distance_and_std_from_activations(
+        activations1,
+        activations2,
+        max_block_size=1024,
+        dtype=np.float32,
+):
+    """Compute kernel distance between two activations."""
+    n_r = activations1.shape[0]
+    n_g = activations2.shape[0]
+
+    n_bigger = np.maximum(n_r, n_g)
+    n_blocks = np.ceil(n_bigger / max_block_size).astype(np.int32)
+
+    v_r = n_r // n_blocks
+    v_g = n_g // n_blocks
+
+    n_plusone_r = n_r - v_r * n_blocks
+    n_plusone_g = n_g - v_g * n_blocks
+
+    sizes_r = np.concatenate([np.full([n_blocks - n_plusone_r], v_r), np.full([n_plusone_r], v_r + 1)], axis=0)
+
+    sizes_g = np.concatenate([
+        np.full([n_blocks - n_plusone_g], v_g),
+        np.full([n_plusone_g], v_g + 1)], axis=0)
+
+    zero = np.zeros([1], dtype=np.int32)
+    inds_r = np.concatenate([zero, np.cumsum(sizes_r)], axis=0)
+    inds_g = np.concatenate([zero, np.cumsum(sizes_g)], axis=0)
+
+    dim = activations1.shape[1]
+
+    def compute_kid_block(i):
+        """Computes the ith block of the KID estimate."""
+        r_s = inds_r[i]
+        r_e = inds_r[i + 1]
+        r = activations1[r_s:r_e]
+        m = (r_e - r_s).astype(dtype)
+
+        g_s = inds_g[i]
+        g_e = inds_g[i + 1]
+        g = activations2[g_s:g_e]
+        n = (g_e - g_s).astype(dtype)
+
+        k_rr = (np.matmul(r, r.T) / dim + 1) ** 3
+        k_rg = (np.matmul(r, g.T) / dim + 1) ** 3
+        k_gg = (np.matmul(g, g.T) / dim + 1) ** 3
+
+        out = (-2 * np.mean(k_rg) + (np.sum(k_rr) - np.trace(k_rr)) /
+               (m * (m - 1)) + (np.sum(k_gg) - np.trace(k_gg)) / (n * (n - 1)))
+
+        return out.astype(dtype)
+
+    ests = np.array([compute_kid_block(i) for i in range(n_blocks)])
+
+    mn = np.mean(ests)
+
+    n_blocks_ = n_blocks.astype(dtype)
+
+    if np.less_equal(n_blocks, 1):
+        var = np.array(float('nan'), dtype=dtype)
+    else:
+        var = np.sum(np.square(ests - mn)) / (n_blocks_ - 1)
+
+    return mn, np.sqrt(var / n_blocks_)
+
+
+def frechet_classifier_distance_from_activations(
+        activations1,
+        activations2,
+):
+    """Compute frechet distance between two activations."""
+    activations1 = activations1.astype(np.float64)
+    activations2 = activations2.astype(np.float64)
+
+    m = np.mean(activations1, axis=0)
+    m_w = np.mean(activations2, axis=0)
+
+    # Calculate the unbiased covariance matrix of first activations.
+    num_examples_real = activations1.shape[0]
+    sigma = num_examples_real / (num_examples_real - 1) * np.cov(activations1.T)
+    # Calculate the unbiased covariance matrix of second activations.
+    num_examples_generated = activations2.shape[0]
+    sigma_w = num_examples_generated / (num_examples_generated - 1) * np.cov(activations2.T)
+
+    def _calculate_fid(m, m_w, sigma, sigma_w):
+        """Returns the Frechet distance given the sample mean and covariance."""
+        # Find the Tr(sqrt(sigma sigma_w)) component of FID
+        sqrt_trace_component = trace_sqrt_product(sigma, sigma_w)
+
+        # Compute the two components of FID.
+
+        # First the covariance component.
+        # Here, note that trace(A + B) = trace(A) + trace(B)
+        trace = np.trace(sigma + sigma_w) - 2.0 * sqrt_trace_component
+
+        # Next the distance between means.
+        mean = np.sum(squared_difference(m, m_w))
+
+        # Equivalent to L2 but more stable.
+        fid = trace + mean
+
+        return fid.astype(np.float64)
+
+    result = tuple(
+        _calculate_fid(m_val, m_w_val, sigma_val, sigma_w_val) for
+        m_val, m_w_val, sigma_val, sigma_w_val in
+        zip([m], [m_w], [sigma], [sigma_w])
+    )
+
+    return result[0]
+
+
+def squared_difference(m, w):
+    arr = []
+    for i, j in zip(m, w):
+        arr.append((i - j) ** 2)
+    arr = np.array(arr)
+
+    return arr
+
+
+def trace_sqrt_product(sigma, sigma_v):
+    # Note sqrt_sigma is called "A" in the proof above
+    sqrt_sigma = _symmetric_matrix_square_root(sigma)
+
+    # This is sqrt(A sigma_v A) above
+    sqrt_a_sigmav_a = np.matmul(sqrt_sigma, np.matmul(sigma_v, sqrt_sigma))
+
+    return np.trace(_symmetric_matrix_square_root(sqrt_a_sigmav_a))
+
+
+def _symmetric_matrix_square_root(mat, eps=1e-10):
+    u, s, v = np.linalg.svd(mat)
+    # sqrt is unstable around 0, just use 0 in such case
+    si = np.where(np.less(s, eps), s, np.sqrt(s))
+
+    return np.matmul(np.matmul(u, np.diag(si)), v)
diff --git a/research/audio/FastSpeech/src/model.py b/research/audio/FastSpeech/src/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..6d2f92fcc998122041a7ec2cd580c9a1a576c665
--- /dev/null
+++ b/research/audio/FastSpeech/src/model.py
@@ -0,0 +1,256 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""FastSpeech model."""
+import mindspore.numpy as msnp
+from mindspore import dtype as mstype
+from mindspore import nn
+from mindspore import ops
+from mindspore.common.initializer import XavierUniform
+from mindspore.common.initializer import initializer
+
+from src.cfg.config import config as hp
+from src.modules import CBHG
+from src.modules import LengthRegulator
+from src.transformer.models import Decoder
+from src.transformer.models import Encoder
+
+
+class FastSpeech(nn.Cell):
+    """FastSpeech model."""
+    def __init__(self):
+        super().__init__()
+        self.encoder = Encoder(
+            n_src_vocab=hp.vocab_size,
+            len_max_seq=hp.vocab_size,
+            d_word_vec=hp.encoder_dim,
+            n_layers=hp.encoder_n_layer,
+            n_head=hp.encoder_head,
+            d_k=hp.encoder_dim // hp.encoder_head,
+            d_v=hp.encoder_dim // hp.encoder_head,
+            d_model=hp.encoder_dim,
+            d_inner=hp.encoder_conv1d_filter_size,
+            dropout=hp.dropout,
+        )
+
+        self.length_regulator = LengthRegulator()
+
+        self.decoder = Decoder(
+            len_max_seq=hp.max_seq_len,
+            n_layers=hp.decoder_n_layer,
+            n_head=hp.decoder_head,
+            d_k=hp.decoder_dim // hp.decoder_head,
+            d_v=hp.decoder_dim // hp.decoder_head,
+            d_model=hp.decoder_dim,
+            d_inner=hp.decoder_conv1d_filter_size,
+            dropout=hp.dropout
+        )
+
+        num_mels = hp.num_mels
+        decoder_dim = hp.decoder_dim
+
+        self.mel_linear = nn.Dense(
+            decoder_dim,
+            num_mels,
+            weight_init=initializer(
+                XavierUniform(),
+                [num_mels, decoder_dim],
+                mstype.float32
+            )
+        )
+
+        self.last_linear = nn.Dense(
+            num_mels * 2,
+            num_mels,
+            weight_init=initializer(
+                XavierUniform(),
+                [num_mels, num_mels * 2],
+                mstype.float32
+            )
+        )
+
+        self.postnet = CBHG(
+            in_dim=num_mels,
+            num_banks=8,
+            projections=[256, hp.num_mels],
+        )
+
+        self.expand_dims = ops.ExpandDims()
+        self.argmax = ops.ArgMaxWithValue(axis=-1)
+        self.broadcast = ops.BroadcastTo((-1, -1, num_mels))
+
+        self.ids_linspace = msnp.arange(hp.mel_max_length)
+        self.zeros_mask = msnp.zeros((hp.batch_size, hp.mel_max_length, hp.num_mels))
+
+    def mask_tensor(self, mel_output, position):
+        """
+        Make mask for tensor, to ignore padded cells.
+        """
+        lengths = self.argmax(position)[1]
+
+        ids = self.ids_linspace
+
+        mask = (ids < self.expand_dims(lengths, 1)).astype(mstype.float32)
+        mask_bool = self.broadcast(self.expand_dims(mask, -1)).astype(mstype.bool_)
+
+        mel_output = msnp.where(mask_bool, mel_output, self.zeros_mask)
+
+        return mel_output
+
+    def construct(
+            self,
+            src_seq,
+            src_pos,
+            mel_pos=None,
+            mel_max_length=None,
+            length_target=None,
+            alpha=1.0
+    ):
+        """
+        Predict mel-spectrogram from sequence.
+
+        Args:
+            src_seq (Tensor): Tokenized text sequence. Shape (hp.batch_size, hp.character_max_length)
+            src_pos (Tensor): Positions of the sequences. Shape (hp.batch_size, hp.character_max_length)
+            mel_pos (Tensor): Positions of the mels. Shape (hp.batch_size, hp.mel_max_length)
+            mel_max_length (int): Max mel length.
+            length_target (Tensor): Duration of the each phonema. Shape (hp.batch_size, hp.character_max_length)
+            alpha (int): Regulator of the speech speed.
+        """
+        encoder_output = self.encoder(src_seq, src_pos)
+
+        if self.training:
+            length_regulator_output, duration_predictor_output = self.length_regulator(
+                encoder_output,
+                target=length_target,
+                alpha=alpha,
+                mel_max_length=mel_max_length,
+            )
+
+            decoder_output = self.decoder(length_regulator_output, mel_pos)
+
+            mel_output = self.mel_linear(decoder_output)
+            mel_output = self.mask_tensor(mel_output, mel_pos)
+
+            residual = self.postnet(mel_output)
+            residual = self.last_linear(residual)
+
+            mel_postnet_output = mel_output + residual
+            mel_postnet_output = self.mask_tensor(mel_postnet_output, mel_pos)
+
+            return mel_output, mel_postnet_output, duration_predictor_output
+
+        length_regulator_output, decoder_pos, mel_len = self.length_regulator(encoder_output, alpha=alpha)
+
+        decoder_output = self.decoder(length_regulator_output, decoder_pos)
+
+        mel_output = self.mel_linear(decoder_output)
+
+        residual = self.postnet(mel_output)
+        residual = self.last_linear(residual)
+
+        mel_postnet_output = mel_output + residual
+
+        return mel_output, mel_postnet_output, mel_len
+
+
+class LossWrapper(nn.Cell):
+    """
+    Training wrapper for model.
+    """
+    def __init__(self, model):
+        super().__init__()
+        self.model = model
+
+        self.mse_loss = nn.MSELoss()
+        self.l1_loss = nn.L1Loss()
+
+    def construct(
+            self,
+            character,
+            src_pos,
+            mel_pos,
+            duration,
+            mel_target,
+            max_mel_len,
+    ):
+        """
+        FastSpeech with loss.
+
+        Args:
+            character (Tensor): Tokenized text sequence. Shape (hp.batch_size, hp.character_max_length)
+            src_pos (Tensor): Positions of the sequences. Shape (hp.batch_size, hp.character_max_length)
+            mel_pos (Tensor): Positions of the mels. Shape (hp.batch_size, hp.mel_max_length)
+            duration (Tensor): Target duration. Shape (hp.batch_size, hp.character_max_length)
+            mel_target (Tensor): Target mel-spectrogram. Shape (hp.batch_size, hp.mel_max_length, hp.num_mels)
+            max_mel_len (list): Max mel length.
+
+        Returns:
+            total_loss (Tensor): Sum of 3 losses.
+        """
+        max_mel_len = max_mel_len[0]
+        mel_output, mel_postnet_output, duration_predictor_output = self.model(
+            character,
+            src_pos,
+            mel_pos=mel_pos,
+            mel_max_length=max_mel_len,
+            length_target=duration,
+        )
+
+        mel_loss = self.mse_loss(mel_output, mel_target)
+        mel_postnet_loss = self.mse_loss(mel_postnet_output, mel_target)
+        duration_predictor_loss = self.l1_loss(duration_predictor_output, duration)
+
+        total_loss = mel_loss + mel_postnet_loss + duration_predictor_loss
+
+        return total_loss
+
+
+class FastSpeechEval:
+    """FastSpeech with vocoder for evaluation."""
+    def __init__(
+            self,
+            mel_generator,
+            vocoder,
+            config,
+    ):
+        super().__init__()
+        self.mel_generator = mel_generator
+        self.vocoder = vocoder
+
+        self.alpha = config.alpha
+        self.vocoder_stride = vocoder.upsample.stride[1]
+        self.zeros_mask = msnp.zeros((1, config.num_mels, config.mel_max_length))
+
+        x_grid = msnp.arange(0, config.mel_max_length)
+        y_grid = msnp.arange(0, config.num_mels)
+
+        self.transpose = ops.Transpose()
+        self.grid = ops.ExpandDims()(msnp.meshgrid(x_grid, y_grid)[0], 0)
+
+    def get_audio(self, src_seq, src_pos):
+        """
+        Generate mel-spectrogram from sequence,
+        generate raw audio from mel-spectrogram by vocoder.
+        """
+        _, mel, mel_len = self.mel_generator(src_seq, src_pos, alpha=self.alpha)
+
+        mel_mask = (self.grid < mel_len).astype(mstype.float32)
+        clear_mel = self.transpose(mel, (0, 2, 1)) * mel_mask
+
+        audio = self.vocoder.construct(clear_mel)
+
+        audio_len = mel_len * self.vocoder_stride
+
+        return audio, audio_len
diff --git a/research/audio/FastSpeech/src/modules.py b/research/audio/FastSpeech/src/modules.py
new file mode 100644
index 0000000000000000000000000000000000000000..3c1871553bf51391abf824d414a568ae14e945a5
--- /dev/null
+++ b/research/audio/FastSpeech/src/modules.py
@@ -0,0 +1,364 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""Model modules."""
+from collections import OrderedDict
+
+import numpy as np
+from mindspore import Tensor
+from mindspore import dtype as mstype
+from mindspore import nn
+from mindspore import numpy as msnp
+from mindspore import ops
+from mindspore.common.initializer import XavierUniform
+from mindspore.common.initializer import initializer
+
+from src.cfg.config import config as hp
+
+
+class LengthRegulator(nn.Cell):
+    """
+    Length Regulator.
+
+    Predicts duration of the each phonem,
+    and let change speed of speech with alpha.
+    """
+    def __init__(self):
+        super().__init__()
+        self.duration_predictor = DurationPredictor()
+
+        self.tile = ops.Tile()
+        self.round = ops.Round()
+        self.stack = ops.Stack()
+        self.zeros = ops.Zeros()
+        self.concat = ops.Concat()
+        self.matmul = ops.MatMul()
+        self.sum = ops.ReduceSum()
+        self.bmm = ops.BatchMatMul()
+        self.unsqueeze = ops.ExpandDims()
+        self.max = ops.ArgMaxWithValue(axis=-1)
+        self.mesh = ops.Meshgrid(indexing='xy')
+
+        self.alignment_zeros = self.zeros(
+            (hp.batch_size, hp.mel_max_length, hp.character_max_length),
+            mstype.float32,
+        )
+
+        # For alignment
+        self.h = hp.mel_max_length
+        self.w = hp.character_max_length
+        self.base_mat_ones = msnp.ones((self.h, self.w))
+        self.meshgrid = self.mesh((msnp.arange(self.w), msnp.arange(self.h)))[1]
+        self.zero_tensor = Tensor([0.])
+        self.mel_pos_linspace = self.unsqueeze(msnp.arange(hp.mel_max_length) + 1, 0)
+
+    def LR(self, enc_out, duration_predictor_output, mel_max_length=None):
+        """Length regulator module."""
+        expand_max_len = self.sum(duration_predictor_output.astype(mstype.float32))
+
+        # None during eval
+        if mel_max_length is not None:
+            alignment = self.alignment_zeros
+        else:
+            alignment = self.unsqueeze(self.alignment_zeros[0], 0)
+
+        for i in range(duration_predictor_output.shape[0]):
+            thresh_2 = duration_predictor_output[i].cumsum().astype(mstype.float32)
+            thresh_1 = self.concat(
+                (
+                    self.zero_tensor.astype(mstype.float64),
+                    thresh_2[:-1].astype(mstype.float64)
+                )
+            )
+            thresh_1 = self.tile(thresh_1, (self.h, 1))
+            thresh_2 = self.tile(thresh_2, (self.h, 1))
+
+            low_thresh = (self.meshgrid < thresh_2).astype(mstype.float32)
+            up_thresh = (self.meshgrid >= thresh_1).astype(mstype.float32)
+            intersection = low_thresh * up_thresh
+            res = intersection.astype(mstype.bool_)
+            alignment[i] = msnp.where(res, self.base_mat_ones, alignment[i])
+
+        output = self.bmm(alignment, enc_out)
+
+        return output, expand_max_len
+
+    def construct(self, encoder_output, alpha=1.0, target=None, mel_max_length=None):
+        """
+        Predict duration of each phonema.
+        """
+        duration_predictor_output = self.duration_predictor(encoder_output)
+
+        # Not none during training
+        if target is not None:
+            output, _ = self.LR(encoder_output, target, mel_max_length=mel_max_length)
+
+            return output, duration_predictor_output
+
+        duration_predictor_output = (duration_predictor_output + 0.5) * alpha
+        duration_predictor_output = self.round(duration_predictor_output.copy())
+
+        output, mel_len = self.LR(encoder_output, duration_predictor_output)
+
+        mel_pos_mask = (self.mel_pos_linspace <= mel_len).astype(mstype.float32)
+        mel_pos = self.mel_pos_linspace * mel_pos_mask
+
+        return output, mel_pos, mel_len
+
+
+class DurationPredictor(nn.Cell):
+    """
+    Duration Predictor.
+
+    Predicts duration of the each phonem.
+    """
+    def __init__(self):
+        super().__init__()
+
+        self.input_size = hp.encoder_dim
+        self.filter_size = hp.duration_predictor_filter_size
+        self.kernel = hp.duration_predictor_kernel_size
+        self.conv_output_size = hp.duration_predictor_filter_size
+        self.dropout = 1 - hp.dropout
+
+        self.conv_layer = nn.SequentialCell(OrderedDict([
+            ("conv1d_1", Conv(
+                self.input_size,
+                self.filter_size,
+                kernel_size=self.kernel,
+                padding=1)),
+            ("layer_norm_1", nn.LayerNorm([self.filter_size])),
+            ("relu_1", nn.ReLU()),
+            ("dropout_1", nn.Dropout(keep_prob=self.dropout)),
+            ("conv1d_2", Conv(
+                self.filter_size,
+                self.filter_size,
+                kernel_size=self.kernel,
+                padding=1)),
+            ("layer_norm_2", nn.LayerNorm([self.filter_size])),
+            ("relu_2", nn.ReLU()),
+            ("dropout_2", nn.Dropout(keep_prob=self.dropout))
+        ]))
+
+        self.linear_layer = nn.Dense(
+            in_channels=self.conv_output_size,
+            out_channels=1,
+            weight_init=initializer(
+                XavierUniform(),
+                [1, self.conv_output_size],
+                mstype.float32
+            )
+        )
+
+        self.relu = nn.ReLU()
+        self.expand_dims = ops.ExpandDims()
+        self.squeeze = ops.Squeeze()
+
+    def construct(self, encoder_output):
+        out = self.conv_layer(encoder_output)
+        out = self.linear_layer(out)
+        out = self.relu(out)
+        out = self.squeeze(out)
+
+        if not self.training:
+            out = self.expand_dims(out, 0)
+
+        return out
+
+
+class BatchNormConv1d(nn.Cell):
+    """
+    Custom BN, Conv1d layer with weight init.
+    """
+    def __init__(
+            self,
+            in_dim,
+            out_dim,
+            kernel_size,
+            stride,
+            padding,
+            activation=None,
+    ):
+        super().__init__()
+
+        self.conv1d = nn.Conv1d(
+            in_dim,
+            out_dim,
+            kernel_size=kernel_size,
+            stride=stride,
+            pad_mode='pad',
+            padding=padding,
+            has_bias=False,
+            weight_init=initializer(
+                XavierUniform(),
+                [out_dim, in_dim, kernel_size],
+                mstype.float32,
+            )
+        )
+
+        self.bn = nn.BatchNorm2d(out_dim, use_batch_statistics=True)
+
+        self.activation = activation
+        self.expand_dims = ops.ExpandDims()
+
+    def construct(self, input_tensor):
+        out = self.conv1d(input_tensor)
+
+        if self.activation is not None:
+            out = self.activation(out)
+
+        out = self.bn(self.expand_dims(out, -1))
+        out = out.squeeze(-1)
+
+        return out
+
+
+class Conv(nn.Cell):
+    """
+    Conv1d with weight init.
+    """
+    def __init__(
+            self,
+            in_channels,
+            out_channels,
+            kernel_size=1,
+            stride=1,
+            padding=0,
+            dilation=1,
+            bias=True,
+    ):
+        super().__init__()
+
+        self.conv = nn.Conv1d(
+            in_channels,
+            out_channels,
+            kernel_size=kernel_size,
+            stride=stride,
+            pad_mode='pad',
+            padding=padding,
+            dilation=dilation,
+            has_bias=bias,
+            weight_init=initializer(
+                XavierUniform(),
+                [in_channels, out_channels, kernel_size],
+                mstype.float32,
+            )
+        )
+
+        self.transpose = ops.Transpose()
+
+    def construct(self, x):
+        x = self.transpose(x, (0, 2, 1))
+        x = self.conv(x)
+        x = self.transpose(x, (0, 2, 1))
+
+        return x
+
+
+class Highway(nn.Cell):
+    """Highway network."""
+    def __init__(self, in_size, out_size):
+        super().__init__()
+        self.h = nn.Dense(in_size, out_size, bias_init='zeros')
+        self.t = nn.Dense(in_size, out_size, bias_init=Tensor(np.full(in_size, -1.), mstype.float32))
+        self.relu = nn.ReLU()
+        self.sigmoid = nn.Sigmoid()
+
+    def construct(self, inputs):
+        out_1 = self.relu(self.h(inputs))
+        out_2 = self.sigmoid(self.t(inputs))
+        output = out_1 * out_2 + inputs * (1.0 - out_2)
+
+        return output
+
+
+class CBHG(nn.Cell):
+    """
+    CBHG a recurrent neural network composed of:
+        - 1-d convolution banks
+        - Highway networks + residual connections
+        - Bidirectional gated recurrent units
+    """
+    def __init__(self, in_dim, num_banks, projections):
+        super().__init__()
+        self.in_dim = in_dim
+
+        self.relu = nn.ReLU()
+        self.conv1d_banks = nn.CellList(
+            [
+                BatchNormConv1d(
+                    in_dim,
+                    in_dim,
+                    kernel_size=k,
+                    stride=1,
+                    padding=k // 2,
+                    activation=self.relu,
+                )
+                for k in range(1, num_banks + 1)
+            ]
+        )
+
+        self.max_pool1d = nn.MaxPool1d(kernel_size=2, stride=1, pad_mode='same')
+
+        in_sizes = [num_banks * in_dim] + projections[:-1]
+        activations = [self.relu] * (len(projections) - 1) + [None]
+
+        self.conv1d_projections = nn.CellList(
+            [
+                BatchNormConv1d(
+                    in_size,
+                    out_size,
+                    kernel_size=3,
+                    stride=1,
+                    padding=1,
+                    activation=activation,
+                )
+                for (in_size, out_size, activation) in zip(in_sizes, projections, activations)
+            ]
+        )
+
+        self.highways = nn.CellList([Highway(in_dim, in_dim) for _ in range(4)])
+
+        self.gru = nn.GRU(in_dim, in_dim, 1, batch_first=True, bidirectional=True)
+
+        self.transpose = ops.Transpose()
+        self.concat = ops.Concat(axis=1)
+
+    def construct(self, inputs):
+        """
+        Forward mels to recurrent network.
+        """
+        out = self.transpose(inputs, (0, 2, 1))
+
+        last_dim = out.shape[-1]
+
+        output_list = []
+        for conv in self.conv1d_banks:
+            output_list.append(conv(out)[:, :, :last_dim])
+
+        output = self.concat(output_list)
+        output = self.max_pool1d(output)[:, :, :last_dim]
+
+        for conv1d in self.conv1d_projections:
+            output = conv1d(output)
+
+        output = self.transpose(output, (0, 2, 1))
+        output += inputs
+
+        for highway in self.highways:
+            output = highway(output)
+
+        outputs, _ = self.gru(output)
+
+        return outputs
diff --git a/research/audio/FastSpeech/src/text/__init__.py b/research/audio/FastSpeech/src/text/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..6e8d5010ee2d59bd8b74a7a1b25f594cafc0fac2
--- /dev/null
+++ b/research/audio/FastSpeech/src/text/__init__.py
@@ -0,0 +1,77 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the License);
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# httpwww.apache.orglicensesLICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an AS IS BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""Adapted from https://github.com/keithito/tacotron"""
+import re
+from src.text import cleaners
+from src.text.symbols import all_symbols
+from src.text.cleaners import english_cleaners
+
+
+# Mappings from symbol to numeric ID and vice versa
+_symbol_to_id = {s: i for i, s in enumerate(all_symbols)}
+
+# Regular expression matching text enclosed in curly braces
+_curly_re = re.compile(r'(.*?)\{(.+?)\}(.*)')
+
+
+def text_to_sequence(text, cleaner_names):
+    """
+    Converts a string of text to a sequence of IDs corresponding to the symbols in the text.
+
+    The text can optionally have ARPAbet sequences enclosed in curly braces embedded
+    in it. For example, "Turn left on {HH AW1 S S T AH0 N} Street."
+
+    Args:
+        text (str): String to convert to a sequence.
+        cleaner_names: names of the cleaner functions to run the text through.
+
+    Returns:
+        List of integers corresponding to the symbols in the text.
+    """
+    sequence = []
+
+    # Check for curly braces and treat their contents as ARPAbet
+    while text:
+        m = _curly_re.match(text)
+        if not m:
+            sequence += _symbols_to_sequence(_clean_text(text, cleaner_names))
+            break
+        sequence += _symbols_to_sequence(
+            _clean_text(m.group(1), cleaner_names))
+        sequence += _arpabet_to_sequence(m.group(2))
+        text = m.group(3)
+
+    return sequence
+
+
+def _clean_text(text, cleaner_names):
+    for name in cleaner_names:
+        cleaner = getattr(cleaners, name)
+        if not cleaner:
+            raise Exception('Unknown cleaner: %s' % name)
+        text = cleaner(text)
+    return text
+
+
+def _symbols_to_sequence(symbols):
+    return [_symbol_to_id[s] for s in symbols if _should_keep_symbol(s)]
+
+
+def _arpabet_to_sequence(text):
+    return _symbols_to_sequence(['@' + s for s in text.split()])
+
+
+def _should_keep_symbol(s):
+    return s in _symbol_to_id and s != '_' and s != '~'
diff --git a/research/audio/FastSpeech/src/text/cleaners.py b/research/audio/FastSpeech/src/text/cleaners.py
new file mode 100644
index 0000000000000000000000000000000000000000..9594191e954a6e0be72ac41ceed9bd1660c94657
--- /dev/null
+++ b/research/audio/FastSpeech/src/text/cleaners.py
@@ -0,0 +1,92 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""from https://github.com/keithito/tacotron """
+import re
+
+from unidecode import unidecode
+
+from src.text.numbers import normalize_numbers
+
+_whitespace_re = re.compile(r'\s+')
+
+# List of (regular expression, replacement) pairs for abbreviations:
+_abbreviations = [(re.compile('\\b%s\\.' % x[0], re.IGNORECASE), x[1]) for x in [
+    ('mrs', 'misess'),
+    ('mr', 'mister'),
+    ('dr', 'doctor'),
+    ('st', 'saint'),
+    ('co', 'company'),
+    ('jr', 'junior'),
+    ('maj', 'major'),
+    ('gen', 'general'),
+    ('drs', 'doctors'),
+    ('rev', 'reverend'),
+    ('lt', 'lieutenant'),
+    ('hon', 'honorable'),
+    ('sgt', 'sergeant'),
+    ('capt', 'captain'),
+    ('esq', 'esquire'),
+    ('ltd', 'limited'),
+    ('col', 'colonel'),
+    ('ft', 'fort'),
+]]
+
+
+def expand_abbreviations(text):
+    for regex, replacement in _abbreviations:
+        text = re.sub(regex, replacement, text)
+    return text
+
+
+def expand_numbers(text):
+    return normalize_numbers(text)
+
+
+def lowercase(text):
+    return text.lower()
+
+
+def collapse_whitespace(text):
+    return re.sub(_whitespace_re, ' ', text)
+
+
+def convert_to_ascii(text):
+    """Convert to ascii."""
+    return unidecode(text)
+
+
+def basic_cleaners(text):
+    """Basic pipeline that lowercases and collapses whitespace without transliteration."""
+    text = lowercase(text)
+    text = collapse_whitespace(text)
+    return text
+
+
+def transliteration_cleaners(text):
+    """Pipeline for non-English text that transliterates to ASCII."""
+    text = convert_to_ascii(text)
+    text = lowercase(text)
+    text = collapse_whitespace(text)
+    return text
+
+
+def english_cleaners(text):
+    """Pipeline for English text, including number and abbreviation expansion."""
+    text = convert_to_ascii(text)
+    text = lowercase(text)
+    text = expand_numbers(text)
+    text = expand_abbreviations(text)
+    text = collapse_whitespace(text)
+    return text
diff --git a/research/audio/FastSpeech/src/text/numbers.py b/research/audio/FastSpeech/src/text/numbers.py
new file mode 100644
index 0000000000000000000000000000000000000000..1792fd38aac4159916981062a9f21d6c7ecdbf77
--- /dev/null
+++ b/research/audio/FastSpeech/src/text/numbers.py
@@ -0,0 +1,86 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""from https://github.com/keithito/tacotron"""
+import re
+
+import inflect
+
+_inflect = inflect.engine()
+_comma_number_re = re.compile(r'([0-9][0-9\,]+[0-9])')
+_decimal_number_re = re.compile(r'([0-9]+\.[0-9]+)')
+_pounds_re = re.compile(r'£([0-9\,]*[0-9]+)')
+_dollars_re = re.compile(r'\$([0-9\.\,]*[0-9]+)')
+_ordinal_re = re.compile(r'[0-9]+(st|nd|rd|th)')
+_number_re = re.compile(r'[0-9]+')
+
+
+def _remove_commas(m):
+    return m.group(1).replace(',', '')
+
+
+def _expand_decimal_point(m):
+    return m.group(1).replace('.', ' point ')
+
+
+def _expand_dollars(m):
+    """Expand english money names values."""
+    match = m.group(1)
+    parts = match.split('.')
+    if len(parts) > 2:
+        return match + ' dollars'  # Unexpected format
+    dollars = int(parts[0]) if parts[0] else 0
+    cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0
+    if dollars and cents:
+        dollar_unit = 'dollar' if dollars == 1 else 'dollars'
+        cent_unit = 'cent' if cents == 1 else 'cents'
+        return '%s %s, %s %s' % (dollars, dollar_unit, cents, cent_unit)
+    if dollars:
+        dollar_unit = 'dollar' if dollars == 1 else 'dollars'
+        return '%s %s' % (dollars, dollar_unit)
+    if cents:
+        cent_unit = 'cent' if cents == 1 else 'cents'
+        return '%s %s' % (cents, cent_unit)
+
+    return 'zero dollars'
+
+
+def _expand_ordinal(m):
+    return _inflect.number_to_words(m.group(0))
+
+
+def _expand_number(m):
+    """Expand numbers into text."""
+    num = int(m.group(0))
+    if 1000 < num < 3000:
+        if num == 2000:
+            return 'two thousand'
+        if 2000 < num < 2010:
+            return 'two thousand ' + _inflect.number_to_words(num % 100)
+        if num % 100 == 0:
+            return _inflect.number_to_words(num // 100) + ' hundred'
+
+        return _inflect.number_to_words(num, andword='', zero='oh', group=2).replace(', ', ' ')
+
+    return _inflect.number_to_words(num, andword='')
+
+
+def normalize_numbers(text):
+    text = re.sub(_comma_number_re, _remove_commas, text)
+    text = re.sub(_pounds_re, r'\1 pounds', text)
+    text = re.sub(_dollars_re, _expand_dollars, text)
+    text = re.sub(_decimal_number_re, _expand_decimal_point, text)
+    text = re.sub(_ordinal_re, _expand_ordinal, text)
+    text = re.sub(_number_re, _expand_number, text)
+    return text
diff --git a/research/audio/FastSpeech/src/text/symbols.py b/research/audio/FastSpeech/src/text/symbols.py
new file mode 100644
index 0000000000000000000000000000000000000000..57d0cd841b2a3c8d8d9078a6f43c3af80542d0b6
--- /dev/null
+++ b/research/audio/FastSpeech/src/text/symbols.py
@@ -0,0 +1,36 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""Symbols preprocessing."""
+
+valid_symbols = [
+    'AA', 'AA0', 'AA1', 'AA2', 'AE', 'AE0', 'AE1', 'AE2', 'AH', 'AH0', 'AH1', 'AH2',
+    'AO', 'AO0', 'AO1', 'AO2', 'AW', 'AW0', 'AW1', 'AW2', 'AY', 'AY0', 'AY1', 'AY2',
+    'B', 'CH', 'D', 'DH', 'EH', 'EH0', 'EH1', 'EH2', 'ER', 'ER0', 'ER1', 'ER2', 'EY',
+    'EY0', 'EY1', 'EY2', 'F', 'G', 'HH', 'IH', 'IH0', 'IH1', 'IH2', 'IY', 'IY0', 'IY1',
+    'IY2', 'JH', 'K', 'L', 'M', 'N', 'NG', 'OW', 'OW0', 'OW1', 'OW2', 'OY', 'OY0',
+    'OY1', 'OY2', 'P', 'R', 'S', 'SH', 'T', 'TH', 'UH', 'UH0', 'UH1', 'UH2', 'UW',
+    'UW0', 'UW1', 'UW2', 'V', 'W', 'Y', 'Z', 'ZH'
+]
+
+_pad = '_'
+_punctuation = '!\'(),.:;? '
+_special = '-'
+_letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'
+
+# Prepend "@" to ARPAbet symbols to ensure uniqueness (some are the same as uppercase letters):
+_arpabet = ['@' + s for s in valid_symbols]
+
+# Export all symbols:
+all_symbols = [_pad] + list(_special) + list(_punctuation) + list(_letters) + _arpabet
diff --git a/research/audio/FastSpeech/src/transformer/__init__.py b/research/audio/FastSpeech/src/transformer/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/research/audio/FastSpeech/src/transformer/constants.py b/research/audio/FastSpeech/src/transformer/constants.py
new file mode 100644
index 0000000000000000000000000000000000000000..7d7eb3a6b69aa270e8b6d40880adfe530b68fac9
--- /dev/null
+++ b/research/audio/FastSpeech/src/transformer/constants.py
@@ -0,0 +1,24 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""Constants and tokens."""
+PAD = 0
+UNK = 1
+BOS = 2
+EOS = 3
+
+PAD_WORD = '<blank>'
+UNK_WORD = '<unk>'
+BOS_WORD = '<s>'
+EOS_WORD = '</s>'
diff --git a/research/audio/FastSpeech/src/transformer/layers.py b/research/audio/FastSpeech/src/transformer/layers.py
new file mode 100644
index 0000000000000000000000000000000000000000..7ce9629fc086c0d910d1ddc9f8fef7b49090269b
--- /dev/null
+++ b/research/audio/FastSpeech/src/transformer/layers.py
@@ -0,0 +1,133 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""Custom layers."""
+from mindspore import dtype as mstype
+from mindspore import nn
+from mindspore.common.initializer import Normal
+from mindspore.common.initializer import XavierUniform
+from mindspore.common.initializer import initializer
+
+from src.transformer.sublayers import MultiHeadAttention
+from src.transformer.sublayers import PositionwiseFeedForward
+
+
+class Linear(nn.Cell):
+    """
+    Create linear layer and init weights.
+    """
+    def __init__(
+            self,
+            in_dim,
+            out_dim,
+            bias=True,
+            w_init='linear'
+    ):
+        super().__init__()
+
+        if w_init == 'xavier':
+            linear_weights = initializer(XavierUniform(), [in_dim, out_dim], mstype.float32)
+        else:
+            linear_weights = initializer(Normal(), [in_dim, out_dim], mstype.float32)
+
+        self.linear_layer = nn.Dense(
+            in_dim,
+            out_dim,
+            bias=bias,
+            weight_init=linear_weights,
+        )
+
+    def construct(self, x):
+        """Forward."""
+        out = self.linear_layer(x)
+
+        return out
+
+
+class FFTBlock(nn.Cell):
+    """
+    Feed-forward transformer (FFT) block.
+    Similar for 'encoder' and 'decoder' at this model.
+    """
+    def __init__(
+            self,
+            d_model,
+            d_inner,
+            n_head,
+            d_k,
+            d_v,
+            dropout=0.1,
+    ):
+        super().__init__()
+
+        self.slf_attn = MultiHeadAttention(n_head, d_model, d_k, d_v, dropout=dropout)
+        self.pos_ffn = PositionwiseFeedForward(d_model, d_inner, dropout=dropout)
+
+    def construct(self, enc_input, non_pad_mask=None, slf_attn_mask=None):
+        """Forward"""
+        enc_output = self.slf_attn(enc_input, enc_input, enc_input, mask=slf_attn_mask)
+        enc_output *= non_pad_mask
+
+        enc_output = self.pos_ffn(enc_output)
+        enc_output *= non_pad_mask
+
+        return enc_output
+
+
+class ConvNorm(nn.Cell):
+    """
+    Create convolution layer and init weights.
+    """
+    def __init__(
+            self,
+            in_channels,
+            out_channels,
+            kernel_size=1,
+            stride=1,
+            padding=None,
+            dilation=1,
+            bias=True,
+            w_init_gain='linear',
+    ):
+        super().__init__()
+
+        if padding is None:
+            assert kernel_size % 2 == 1
+            padding = int(dilation * (kernel_size - 1) / 2)
+
+        if w_init_gain == 'tanh':
+            gain = 5.0 / 3
+        else:
+            gain = 1
+
+        self.conv = nn.Conv1d(
+            in_channels,
+            out_channels,
+            kernel_size=kernel_size,
+            stride=stride,
+            padding=padding,
+            dilation=dilation,
+            bias=bias,
+            weight_init=initializer(
+                XavierUniform(gain=gain),
+                [in_channels, out_channels],
+                mstype.float32
+            )
+        )
+
+    def construct(self, x):
+        """Forward."""
+        output = self.conv(x)
+
+        return output
diff --git a/research/audio/FastSpeech/src/transformer/models.py b/research/audio/FastSpeech/src/transformer/models.py
new file mode 100644
index 0000000000000000000000000000000000000000..1e4d634447cd8c4bbd389209d9e2598f1ea24885
--- /dev/null
+++ b/research/audio/FastSpeech/src/transformer/models.py
@@ -0,0 +1,187 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""Model script."""
+import numpy as np
+from mindspore import Tensor
+from mindspore import dtype as mstype
+from mindspore import nn
+from mindspore import ops
+
+from src.cfg.config import config as hp
+from src.transformer import constants
+from src.transformer.layers import FFTBlock
+
+
+def get_sinusoid_encoding_table(n_position, d_hid, padding_idx=None):
+    """
+    Sinusoid position encoding table.
+    """
+    def cal_angle(position, hid_idx):
+        return position / np.power(10000, 2 * (hid_idx // 2) / d_hid)
+
+    def get_posi_angle_vec(position):
+        return [cal_angle(position, hid_j) for hid_j in range(d_hid)]
+
+    sinusoid_table = np.array([get_posi_angle_vec(pos_i)
+                               for pos_i in range(n_position)])
+
+    sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2])
+    sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2])
+
+    if padding_idx is not None:
+        # zero vector for padding dimension
+        sinusoid_table[padding_idx] = 0.
+
+    return Tensor(sinusoid_table, dtype=mstype.float32)
+
+
+class Encoder(nn.Cell):
+    """Encoder."""
+    def __init__(
+            self,
+            n_src_vocab,
+            len_max_seq,
+            d_word_vec,
+            n_layers,
+            n_head,
+            d_k,
+            d_v,
+            d_model,
+            d_inner,
+            dropout,
+    ):
+        super().__init__()
+
+        n_position = len_max_seq + 1
+        pretrained_embs = get_sinusoid_encoding_table(n_position, d_word_vec, padding_idx=0)
+
+        self.src_word_emb = nn.Embedding(
+            n_src_vocab,
+            d_word_vec,
+            padding_idx=constants.PAD,
+        )
+
+        self.position_enc = nn.Embedding(
+            n_position,
+            d_word_vec,
+            embedding_table=pretrained_embs,
+            padding_idx=0,
+        )
+
+        self.layer_stack = nn.CellList(
+            [
+                FFTBlock(d_model, d_inner, n_head, d_k, d_v, dropout=dropout) for _ in range(n_layers)
+            ]
+        )
+
+        self.equal = ops.Equal()
+        self.not_equal = ops.NotEqual()
+        self.expand_dims = ops.ExpandDims()
+        self.pad = constants.PAD
+        self.broadcast = ops.BroadcastTo((-1, hp.character_max_length, -1))
+
+    def construct(self, src_seq, src_pos):
+        """
+        Create mask and forward to FFT blocks.
+
+        Args:
+            src_seq (Tensor): Tokenized text sequence. Shape (hp.batch_size, hp.character_max_length).
+            src_pos (Tensor): Positions of the sequences. Shape (hp.batch_size, hp.character_max_length).
+
+        Returns:
+            enc_output (Tensor): Encoder output.
+        """
+        # Prepare masks
+        padding_mask = self.equal(src_seq, self.pad)
+        slf_attn_mask = self.broadcast(self.expand_dims(padding_mask.astype(mstype.float32), 1))
+        slf_attn_mask_bool = slf_attn_mask.astype(mstype.bool_)
+
+        non_pad_mask_bool = self.expand_dims(self.not_equal(src_seq, self.pad), -1)
+        non_pad_mask = non_pad_mask_bool.astype(mstype.float32)
+
+        # Forward
+        enc_output = self.src_word_emb(src_seq.astype('int32')) + self.position_enc(src_pos.astype('int32'))
+
+        for enc_layer in self.layer_stack:
+            enc_output = enc_layer(
+                enc_output,
+                non_pad_mask=non_pad_mask,
+                slf_attn_mask=slf_attn_mask_bool,
+            )
+
+        return enc_output
+
+
+class Decoder(nn.Cell):
+    """Decoder."""
+    def __init__(
+            self,
+            len_max_seq,
+            n_layers,
+            n_head,
+            d_k,
+            d_v,
+            d_model,
+            d_inner,
+            dropout
+    ):
+
+        super().__init__()
+
+        n_position = len_max_seq + 1
+
+        pretrained_embs = get_sinusoid_encoding_table(n_position, d_model, padding_idx=0)
+
+        self.position_enc = nn.Embedding(
+            n_position,
+            d_model,
+            embedding_table=pretrained_embs,
+            padding_idx=0,
+        )
+
+        self.layer_stack = nn.CellList(
+            [
+                FFTBlock(d_model, d_inner, n_head, d_k, d_v, dropout=dropout) for _ in range(n_layers)
+            ]
+        )
+
+        self.pad = constants.PAD
+        self.equal = ops.Equal()
+        self.not_equal = ops.NotEqual()
+        self.expand_dims = ops.ExpandDims()
+        self.broadcast = ops.BroadcastTo((-1, hp.mel_max_length, -1))
+
+    def construct(self, enc_seq, enc_pos):
+        """
+        Create mask and forward to FFT blocks.
+        """
+        # Prepare masks
+        padding_mask = self.equal(enc_pos, self.pad)
+        slf_attn_mask = self.broadcast(self.expand_dims(padding_mask.astype(mstype.float32), 1))
+        slf_attn_mask_bool = slf_attn_mask.astype(mstype.bool_)
+
+        non_pad_mask_bool = self.expand_dims(self.not_equal(enc_pos, self.pad), -1)
+        non_pad_mask = non_pad_mask_bool.astype(mstype.float32)
+
+        # Forward
+        dec_output = enc_seq + self.position_enc(enc_pos.astype(mstype.int32))
+
+        for dec_layer in self.layer_stack:
+            dec_output = dec_layer(
+                dec_output,
+                non_pad_mask=non_pad_mask,
+                slf_attn_mask=slf_attn_mask_bool)
+
+        return dec_output
diff --git a/research/audio/FastSpeech/src/transformer/modules.py b/research/audio/FastSpeech/src/transformer/modules.py
new file mode 100644
index 0000000000000000000000000000000000000000..7e72e56c0d9851ca3ebd53394622f3ab92b4c31e
--- /dev/null
+++ b/research/audio/FastSpeech/src/transformer/modules.py
@@ -0,0 +1,59 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""Model modules."""
+import mindspore.numpy as msnp
+from mindspore import dtype as mstype
+from mindspore import nn
+from mindspore import ops
+from mindspore.ops import constexpr
+
+
+class ScaledDotProductAttention(nn.Cell):
+    """
+    Scaled Dot-Product Attention.
+    """
+    def __init__(self, temperature, attn_dropout=0.1):
+        super().__init__()
+        self.temperature = temperature
+
+        self.softmax = nn.Softmax(axis=2)
+        self.dropout = nn.Dropout(keep_prob=1-attn_dropout)
+
+        self.bmm = ops.BatchMatMul()
+        self.transpose = ops.Transpose()
+
+    def construct(self, q, k, v, mask=None):
+        """Forward."""
+        attn = self.bmm(q, self.transpose(k, (0, 2, 1)))
+        attn = attn / self.temperature
+
+        inf_mask = infinity_mask(attn.shape, -msnp.inf)
+
+        if mask is not None:
+            attn = msnp.where(mask, inf_mask, attn)
+
+        attn = self.softmax(attn)
+        attn = self.dropout(attn)
+
+        output = self.bmm(attn, v)
+
+        return output
+
+
+@constexpr
+def infinity_mask(mask_shape, inf):
+    """Make infinity mask."""
+    inf_mask = ops.Fill()(mstype.float32, mask_shape, inf)
+    return inf_mask
diff --git a/research/audio/FastSpeech/src/transformer/sublayers.py b/research/audio/FastSpeech/src/transformer/sublayers.py
new file mode 100644
index 0000000000000000000000000000000000000000..a4bb7b1aaac472fdc0c62d7c0322fe963c824187
--- /dev/null
+++ b/research/audio/FastSpeech/src/transformer/sublayers.py
@@ -0,0 +1,154 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""Model sublayers."""
+import numpy as np
+from mindspore import dtype as mstype
+from mindspore import nn
+from mindspore import ops
+from mindspore.common.initializer import Normal
+from mindspore.common.initializer import initializer
+
+from src.cfg.config import config as hp
+from src.transformer.modules import ScaledDotProductAttention
+
+
+class MultiHeadAttention(nn.Cell):
+    """
+    Multi-Head Attention module.
+    """
+    def __init__(self, n_head, d_model, d_k, d_v, dropout=0.1):
+        super().__init__()
+
+        self.n_head = n_head
+        self.d_k = d_k
+        self.d_v = d_v
+
+        self.w_qs = nn.Dense(
+            d_model,
+            n_head * d_k,
+            weight_init=initializer(
+                Normal(sigma=np.sqrt(2.0 / (d_model + d_k)), mean=0),
+                [d_model, n_head * d_k],
+                mstype.float32,
+            )
+        )
+
+        self.w_ks = nn.Dense(
+            d_model,
+            n_head * d_k,
+            weight_init=initializer(
+                Normal(sigma=np.sqrt(2.0 / (d_model + d_k)), mean=0),
+                [d_model, n_head * d_k],
+                mstype.float32,
+            )
+        )
+
+        self.w_vs = nn.Dense(
+            d_model,
+            n_head * d_v,
+            weight_init=initializer(
+                Normal(sigma=np.sqrt(2.0 / (d_model + d_v)), mean=0),
+                [d_model, n_head * d_v],
+                mstype.float32,
+            )
+        )
+
+        self.fc = nn.Dense(
+            n_head * d_v,
+            d_model,
+            weight_init=initializer(Normal(), [n_head * d_v, d_model], mstype.float32)
+        )
+
+        self.attention = ScaledDotProductAttention(temperature=np.power(d_k, 0.5))
+        self.layer_norm = nn.LayerNorm([d_model])
+        self.dropout = nn.Dropout(keep_prob=1-dropout)
+
+        self.transpose = ops.Transpose()
+        self.reshape = ops.Reshape()
+        self.tile = ops.Tile()
+
+    def construct(self, q, k, v, mask=None):
+        """Forward."""
+        d_k, d_v, n_head = self.d_k, self.d_v, self.n_head
+
+        sz_b, len_q, _ = q.shape
+        sz_b, len_k, _ = k.shape
+        sz_b, len_v, _ = v.shape
+
+        residual = q
+
+        q = self.reshape(self.w_qs(q), (sz_b, len_q, n_head, d_k))
+        k = self.reshape(self.w_ks(k), (sz_b, len_k, n_head, d_k))
+        v = self.reshape(self.w_vs(v), (sz_b, len_v, n_head, d_v))
+
+        q = self.reshape(self.transpose(q, (2, 0, 1, 3)), (-1, len_q, d_k))  # (n*b) x lq x dk
+        k = self.reshape(self.transpose(k, (2, 0, 1, 3)), (-1, len_q, d_k))  # (n*b) x lk x dk
+        v = self.reshape(self.transpose(v, (2, 0, 1, 3)), (-1, len_v, d_v))  # (n*b) x lv x dv
+
+        mask = self.tile(mask.astype(mstype.float32), (n_head, 1, 1))
+        output = self.attention(q, k, v, mask=mask.astype(mstype.bool_))
+
+        output = self.reshape(output, (n_head, sz_b, len_q, d_v))
+        output = self.reshape(self.transpose(output, (1, 2, 0, 3)), (sz_b, len_q, -1))  # b x lq x (n*dv)
+
+        output = self.dropout(self.fc(output))
+        output = self.layer_norm(output + residual)
+
+        return output
+
+
+class PositionwiseFeedForward(nn.Cell):
+    """A two-feed-forward-layer module."""
+    def __init__(self, d_in, d_hid, dropout=0.1):
+        super().__init__()
+
+        self.w_1 = nn.Conv1d(
+            d_in,
+            d_hid,
+            kernel_size=hp.fft_conv1d_kernel[0],
+            pad_mode='pad',
+            padding=hp.fft_conv1d_padding[0],
+            has_bias=True,
+        )
+
+        self.w_2 = nn.Conv1d(
+            d_hid,
+            d_in,
+            kernel_size=hp.fft_conv1d_kernel[1],
+            pad_mode='pad',
+            padding=hp.fft_conv1d_padding[1],
+            has_bias=True,
+        )
+
+        self.dropout = nn.Dropout(keep_prob=1-dropout)
+        self.layer_norm = nn.LayerNorm([d_in])
+        self.relu = nn.ReLU()
+
+        self.transpose = ops.Transpose()
+
+    def construct(self, x):
+        """Forward."""
+        residual = x
+
+        output = self.transpose(x, (0, 2, 1))
+        output = self.w_1(output)
+        output = self.relu(output)
+        output = self.w_2(output)
+        output = self.transpose(output, (0, 2, 1))
+        output = self.dropout(output)
+
+        output = self.layer_norm(output + residual)
+
+        return output
diff --git a/research/audio/FastSpeech/src/utils.py b/research/audio/FastSpeech/src/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..3a7542184efd138d33ca4729cbe2737910b516de
--- /dev/null
+++ b/research/audio/FastSpeech/src/utils.py
@@ -0,0 +1,54 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""Utilities."""
+from pathlib import Path
+
+import numpy as np
+
+from src.cfg.config import config as hp
+
+
+def process_text(train_text_path):
+    """
+    Read .txt data.
+    """
+    metadata_path = Path(train_text_path)
+    with metadata_path.open("r", encoding="utf-8") as file:
+        txt = []
+        for line in file.readlines():
+            txt.append(line)
+
+        return txt
+
+
+def pad_1d_tensor(inputs):
+    """
+    Pad 1d tensor to fixed size.
+    """
+    max_len = hp.character_max_length
+    padded = np.pad(inputs, (0, max_len - inputs.shape[0]))
+
+    return padded
+
+
+def pad_2d_tensor(inputs):
+    """
+    Pad 2d tensor to fixed size.
+    """
+    max_len = hp.mel_max_length
+    s = inputs.shape[1]
+    padded = np.pad(inputs, (0, max_len - inputs.shape[0]))[:, :s]
+
+    return padded
diff --git a/research/audio/FastSpeech/src/waveglow/__init__.py b/research/audio/FastSpeech/src/waveglow/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/research/audio/FastSpeech/src/waveglow/layers.py b/research/audio/FastSpeech/src/waveglow/layers.py
new file mode 100644
index 0000000000000000000000000000000000000000..d33704b54c307165e8ba156ebec82ba4e9927fa4
--- /dev/null
+++ b/research/audio/FastSpeech/src/waveglow/layers.py
@@ -0,0 +1,38 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""Model layers."""
+from mindspore import nn
+
+
+class Invertible1x1Conv(nn.Cell):
+    """
+    The layer outputs both the convolution,
+    and the log determinant of its weight matrix.
+    """
+    def __init__(self, c):
+        super().__init__()
+        self.conv = nn.Conv1d(
+            in_channels=c,
+            out_channels=c,
+            kernel_size=1,
+            stride=1,
+            padding=0,
+            has_bias=False,
+        )
+
+    def construct(self, z):
+        z = self.conv(z)
+
+        return z
diff --git a/research/audio/FastSpeech/src/waveglow/model.py b/research/audio/FastSpeech/src/waveglow/model.py
new file mode 100644
index 0000000000000000000000000000000000000000..10a4eb4ae9dff5c74a2b6ca3c39a81c6928311bc
--- /dev/null
+++ b/research/audio/FastSpeech/src/waveglow/model.py
@@ -0,0 +1,270 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""Model script."""
+import numpy as np
+from mindspore import Tensor
+from mindspore import dtype as mstype
+from mindspore import nn
+from mindspore import ops
+
+from src.waveglow.layers import Invertible1x1Conv
+from src.waveglow.utils import fused_add_tanh_sigmoid_multiply
+
+
+class WN(nn.Cell):
+    """
+    This is the WaveNet like layer for the affine coupling.
+    The primary difference from WaveNet is the convolutions need not be causal.
+    There is also no dilation size reset. The dilation only doubles on each layer.
+    """
+    def __init__(
+            self,
+            n_in_channels,
+            n_mel_channels,
+            n_layers,
+            n_channels,
+            kernel_size,
+    ):
+        super().__init__()
+
+        self.n_layers = n_layers
+        self.n_channels = n_channels
+        self.in_layers = nn.CellList()
+        self.res_skip_layers = nn.CellList()
+
+        self.start = nn.Conv1d(
+            in_channels=n_in_channels,
+            out_channels=n_channels,
+            kernel_size=1,
+            has_bias=True
+        )
+
+        self.end = nn.Conv1d(
+            in_channels=n_channels,
+            out_channels=2 * n_in_channels,
+            kernel_size=1,
+            has_bias=True
+        )
+
+        self.cond_layer = nn.Conv1d(
+            in_channels=n_mel_channels,
+            out_channels=2 * n_channels * n_layers,
+            kernel_size=1,
+            has_bias=True
+        )
+
+        for i in range(n_layers):
+            dilation = 2 ** i
+            padding = int((kernel_size * dilation - dilation) / 2)
+
+            if i < n_layers - 1:
+                res_skip_channels = 2 * n_channels
+            else:
+                res_skip_channels = n_channels
+
+            in_layer = nn.Conv1d(
+                in_channels=n_channels,
+                out_channels=2 * n_channels,
+                kernel_size=kernel_size,
+                dilation=dilation,
+                pad_mode='pad',
+                padding=padding,
+                has_bias=True
+            )
+
+            res_skip_layer = nn.Conv1d(
+                in_channels=n_channels,
+                out_channels=res_skip_channels,
+                kernel_size=1,
+                has_bias=True
+            )
+
+            self.in_layers.append(in_layer)
+            self.res_skip_layers.append(res_skip_layer)
+
+        self.audio_zeros = Tensor(np.zeros((1, self.n_channels, 28800)), mstype.float32)
+
+    def construct(self, audio, spect):
+        """
+        Forward.
+        """
+        audio = self.start(audio)
+        output = self.audio_zeros
+
+        spect = self.cond_layer(spect)
+
+        for i in range(self.n_layers):
+            spect_offset = i * 2 * self.n_channels
+
+            acts = fused_add_tanh_sigmoid_multiply(
+                self.in_layers[i](audio),
+                spect[:, spect_offset: spect_offset + 2 * self.n_channels, :],
+                self.n_channels
+            )
+
+            res_skip_acts = self.res_skip_layers[i](acts)
+            if i < self.n_layers - 1:
+                audio = audio + res_skip_acts[:, :self.n_channels, :]
+                output = output + res_skip_acts[:, self.n_channels:, :]
+            else:
+                output = output + res_skip_acts
+
+        output = self.end(output)
+
+        return output
+
+
+class WaveGlow(nn.Cell):
+    """WaveGlow vocoder inference model."""
+    def __init__(
+            self,
+            n_mel_channels,
+            n_flows,
+            n_group,
+            n_early_every,
+            n_early_size,
+            wn_config,
+            sigma=1.0
+    ):
+        super().__init__()
+
+        self.upsample = nn.Conv1dTranspose(
+            in_channels=n_mel_channels,
+            out_channels=n_mel_channels,
+            pad_mode='valid',
+            kernel_size=1024,
+            stride=256,
+            has_bias=True,
+        )
+
+        self.n_flows = n_flows
+        self.n_group = n_group
+        self.n_early_every = n_early_every
+        self.n_early_size = n_early_size
+        self.wavenet = nn.CellList()
+        self.convinv = nn.CellList()
+
+        n_half = int(n_group / 2)
+        n_remaining_channels = n_group
+        audio_cells_list = []
+
+        for k in range(n_flows):
+            use_data_append = False
+            if k % self.n_early_every == 0 and k > 0:
+                n_half = n_half - int(self.n_early_size / 2)
+                n_remaining_channels = n_remaining_channels - self.n_early_size
+                use_data_append = True
+
+            audio_cells_list.insert(
+                0,
+                AudioCell(
+                    n_half=n_half,
+                    n_mel_channels=n_mel_channels * n_group,
+                    wn_config=wn_config,
+                    use_data_append=use_data_append,
+                    n_early_size=self.n_early_size,
+                    sigma=sigma,
+                    n_remaining_channels=n_remaining_channels,
+                )
+            )
+
+        self.wavenet_blocks = nn.CellList(audio_cells_list)
+
+        self.n_remaining_channels = n_remaining_channels
+
+        self.concat = ops.Concat(axis=1)
+        self.transpose = ops.Transpose()
+        self.reshape = ops.Reshape()
+
+        self.noise_shape = (1, self.n_remaining_channels, 28800)
+        self.audio = Tensor(np.random.standard_normal(self.noise_shape), mstype.float32)
+
+        self.time_cutoff = self.upsample.kernel_size[1] - self.upsample.stride[1]
+        self.sigma = Tensor(sigma, mstype.float32)
+
+    def construct(self, spect):
+        """
+        Forward to mel-spectrogram.
+
+        Args:
+            spect (Tensor): Mel-spectrogram. Shape (1, n_mel_channels, max_mel_len)
+
+        Returns:
+            audio (Tensor): Raw audio.
+        """
+        spect = self.upsample(spect)
+        spect = spect[:, :, : - self.time_cutoff]
+        bs, mel_size, channels = spect.shape
+
+        spect = self.reshape(spect, (bs, mel_size, channels // self.n_group, self.n_group))
+        spect = self.transpose(spect, (0, 2, 1, 3))
+        spect = self.transpose(spect.view(spect.shape[0], spect.shape[1], -1), (0, 2, 1))
+
+        audio = self.sigma * self.audio
+
+        for audio_cell in self.wavenet_blocks:
+            audio = audio_cell(audio, spect)
+
+        audio = self.transpose(audio, (0, 2, 1)).view(audio.shape[0], -1)
+
+        return audio
+
+
+class AudioCell(nn.Cell):
+    """Audio generator cell."""
+    def __init__(
+            self,
+            n_half,
+            n_mel_channels,
+            wn_config,
+            use_data_append,
+            n_early_size,
+            sigma,
+            n_remaining_channels,
+    ):
+        super().__init__()
+        self.n_half = n_half
+
+        self.wn_cell = WN(n_half, n_mel_channels, **wn_config)
+        self.convinv = Invertible1x1Conv(n_remaining_channels)
+        self.sigma = Tensor(sigma, mstype.float32)
+
+        self.use_data_append = bool(use_data_append)
+        self.noise_shape = (1, n_early_size, 28800)
+
+        self.z = Tensor(np.random.standard_normal(self.noise_shape), mstype.float32)
+        self.concat = ops.Concat(axis=1)
+        self.exp = ops.Exp()
+
+    def construct(self, audio, spect):
+        """Iterationaly restore audio from spectrogram."""
+        audio_0 = audio[:, :self.n_half, :]
+        audio_1 = audio[:, self.n_half:, :]
+
+        output = self.wn_cell(audio_0, spect)
+
+        s = output[:, self.n_half:, :]
+        b = output[:, :self.n_half, :]
+
+        audio_1 = (audio_1 - b) / self.exp(s)
+
+        audio = self.concat((audio_0, audio_1))
+        audio = self.convinv(audio)
+
+        if self.use_data_append:
+            z = self.z
+            audio = self.concat((self.sigma * z, audio))
+
+        return audio
diff --git a/research/audio/FastSpeech/src/waveglow/utils.py b/research/audio/FastSpeech/src/waveglow/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..ba23c56ad9f48e95fc52d48c7b572dc807c30165
--- /dev/null
+++ b/research/audio/FastSpeech/src/waveglow/utils.py
@@ -0,0 +1,43 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""Utils scripts."""
+from mindspore import ops
+
+
+def fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels):
+    """
+    Fusion method.
+    """
+    n_channels_int = n_channels
+    in_act = input_a + input_b
+
+    t_act = ops.Tanh()(in_act[:, :n_channels_int, :])
+    s_act = ops.Sigmoid()(in_act[:, n_channels_int:, :])
+
+    acts = t_act * s_act
+
+    return acts
+
+
+def files_to_list(filename):
+    """
+    Takes a text file of filenames and makes a list of filenames.
+    """
+    with open(filename, encoding='utf-8') as f:
+        files = f.readlines()
+
+    files = [f.rstrip() for f in files]
+
+    return files
diff --git a/research/audio/FastSpeech/train.py b/research/audio/FastSpeech/train.py
new file mode 100644
index 0000000000000000000000000000000000000000..80dda8331aaf6fb8a102f64a41d7b6eb8d020dd0
--- /dev/null
+++ b/research/audio/FastSpeech/train.py
@@ -0,0 +1,169 @@
+# Copyright 2022 Huawei Technologies Co., Ltd
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ============================================================================
+"""Training script."""
+import os
+
+import numpy as np
+from mindspore import Model
+from mindspore import context
+from mindspore.common import set_seed
+from mindspore.communication.management import get_group_size
+from mindspore.communication.management import get_rank
+from mindspore.communication.management import init
+from mindspore.context import ParallelMode
+from mindspore.dataset import GeneratorDataset
+from mindspore.nn import Adam
+from mindspore.train.callback import CheckpointConfig
+from mindspore.train.callback import LossMonitor
+from mindspore.train.callback import ModelCheckpoint
+from mindspore.train.callback import TimeMonitor
+
+from src.cfg.config import config as default_config
+from src.dataset import BufferDataset
+from src.dataset import get_data_to_buffer
+from src.model import FastSpeech
+from src.model import LossWrapper
+
+set_seed(1)
+
+
+def _get_rank_info(target):
+    """
+    Get rank size and rank id.
+    """
+    if target == 'GPU':
+        num_devices = get_group_size()
+        device = get_rank()
+    else:
+        raise ValueError("Unsupported platform.")
+
+    return num_devices, device
+
+
+def lr_scheduler(cfg, steps_per_epoch, p_num):
+    """
+    Init lr steps.
+    """
+    d_model = cfg.decoder_dim
+    lr_init = np.power(d_model, -0.5) * cfg.lr_scale
+    warmup_steps = cfg.n_warm_up_step
+    total_steps = cfg.epochs * steps_per_epoch
+
+    learning_rate = []
+    for step in range(1, total_steps + 1):
+        lr_at_step = np.min([
+            np.power(step * p_num, -0.5),
+            np.power(warmup_steps, -1.5) * step
+        ])
+        learning_rate.append(lr_at_step * lr_init)
+
+    return learning_rate
+
+
+def set_trainable_params(params):
+    """
+    Freeze positional encoding layers
+    and exclude it from trainable params for optimizer.
+    """
+    trainable_params = []
+    for param in params:
+        if param.name.endswith('position_enc.embedding_table'):
+            param.requires_grad = False
+        else:
+            trainable_params.append(param)
+
+    return trainable_params
+
+
+def main():
+    """Trainloop."""
+    config = default_config
+    device_target = config.device_target
+
+    context.set_context(mode=context.GRAPH_MODE, device_target=device_target)
+    device_num = int(os.getenv('RANK_SIZE', '1'))
+
+    if device_target == 'GPU':
+        if device_num > 1:
+            init(backend_name='nccl')
+            device_num = get_group_size()
+            device_id = get_rank()
+            context.reset_auto_parallel_context()
+            context.set_auto_parallel_context(
+                device_num=device_num,
+                parallel_mode=ParallelMode.DATA_PARALLEL,
+                gradients_mean=True,
+            )
+        else:
+            device_num = 1
+            device_id = config.device_id
+            context.set_context(device_id=device_id)
+    else:
+        raise ValueError("Unsupported platform.")
+
+    if device_num > 1:
+        rank_size, rank_id = _get_rank_info(target=device_target)
+    else:
+        rank_size, rank_id = None, None
+
+    net = FastSpeech()
+    network = LossWrapper(net)
+    network.set_train(True)
+
+    buffer = get_data_to_buffer()
+    data = BufferDataset(buffer)
+
+    dataloader = GeneratorDataset(
+        data,
+        column_names=['text', 'mel_pos', 'src_pos', 'mel_max_len', 'duration', 'mel_target'],
+        shuffle=True,
+        num_shards=rank_size,
+        shard_id=rank_id,
+        num_parallel_workers=1,
+        python_multiprocessing=False,
+    )
+
+    dataloader = dataloader.batch(config.batch_size, True)
+    batch_num = dataloader.get_dataset_size()
+
+    lr = lr_scheduler(config, batch_num, device_num)
+
+    trainable_params = set_trainable_params(network.trainable_params())
+    opt = Adam(trainable_params, beta1=0.9, beta2=0.98, eps=1e-9, learning_rate=lr)
+
+    model = Model(network, optimizer=opt)
+
+    config_ck = CheckpointConfig(
+        save_checkpoint_steps=batch_num,
+        keep_checkpoint_max=config.keep_checkpoint_max,
+    )
+
+    loss_cb = LossMonitor(per_print_times=10)
+    time_cb = TimeMonitor(data_size=batch_num)
+    ckpt_cb = ModelCheckpoint(
+        prefix="FastSpeech",
+        directory=config.logs_dir,
+        config=config_ck,
+    )
+
+    cbs = [loss_cb, time_cb, ckpt_cb]
+    if device_num > 1 and device_id != config.device_start:
+        cbs = [loss_cb, time_cb]
+
+    model.train(epoch=config.epochs, train_dataset=dataloader, callbacks=cbs, dataset_sink_mode=False)
+
+
+if __name__ == "__main__":
+    main()