diff --git a/research/audio/FastSpeech/README.md b/research/audio/FastSpeech/README.md new file mode 100644 index 0000000000000000000000000000000000000000..b9e41e589ff152be77005cdb5d19a1944284bcc5 --- /dev/null +++ b/research/audio/FastSpeech/README.md @@ -0,0 +1,350 @@ +# Contents + +- [Contents](#contents) + - [FastSpeech Description](#fastspeech-description) + - [Model Architecture](#model-architecture) + - [Dataset](#dataset) + - [Environment Requirements](#environment-requirements) + - [Quick Start](#quick-start) + - [Script Description](#script-description) + - [Script and Sample Code](#script-and-sample-code) + - [Script Parameters](#script-parameters) + - [Training Process](#training-process) + - [Standalone Training](#standalone-training) + - [Distribute Training](#distribute-training) + - [Evaluation Process](#evaluation-process) + - [Checkpoints preparation](#checkpoints-preparation) + - [Evaluation](#evaluation) + - [Model Export](#model-export) + - [Model Description](#model-description) + - [Performance](#performance) + - [Training Performance](#training-performance) + - [Evaluation Performance](#evaluation-performance) + - [ModelZoo Homepage](#modelzoo-homepage) + +## [FastSpeech Description](#contents) + +Neural network based end-to-end text to speech (TTS) has significantly improved +the quality of synthesized speech. TTS methods usually first generate mel-spectrogram from text, +and then synthesize speech from the mel-spectrogram using vocoder such as WaveNet (WaveGlow in that work). +Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is +usually not robust (i.e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control). +In this work, we use feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS. Specifically, we use previously extracted attention alignments from an encoder-decoder +based teacher model for phoneme duration prediction, which is used by a length regulator to expand the source phoneme sequence to match the length of the target +mel-spectrogram sequence for parallel mel-spectrogram generation. Experiments on the LJSpeech dataset show that parallel model matches autoregressive models in terms of speech quality, nearly eliminates the problem of word skipping and +repeating in particularly hard cases, and can adjust voice speed smoothly. + +[Paper](https://arxiv.org/pdf/1905.09263v5.pdf): FastSpeech: Fast, Robust and Controllable Text to Speech. + +## [Model Architecture](#contents) + +The architecture for FastSpeech is a feed-forward structure based on self-attention in Transformer +and 1D convolution. This structure is called Feed-Forward Transformer (FFT). Feed-Forward Transformer stacks multiple FFT blocks for phoneme to mel-spectrogram +transformation, with N blocks on the phoneme side, and N blocks on the mel-spectrogram side, with +a length regulator in between to bridge the length gap between the phoneme and mel-spectrogram sequence. +Each FFT block consists of a self-attention and 1D convolutional network. +The self-attention network consists of a multi-head attention to extract the cross-position information. +Different from the 2-layer dense network in Transformer, FastSpeech uses a 2-layer 1D convolutional network with ReLU activation. +The motivation is that the adjacent hidden states are more closely related in the character/phoneme and mel-spectrogram sequence in speech tasks. + +## [Dataset](#contents) + +We use LJSpeech-1.1 dataset and previously extracted alignments by teacher model. + +Dataset description: 3.8 Gb of the .wav files with the annotated text (contains English speech only). + +- [Download](https://keithito.com/LJ-Speech-Dataset/) LJSpeech and extract it into your `datasets` folder. +- [Download](https://github.com/xcmyz/FastSpeech/blob/master/alignments.zip) alignments and unzip into extracted LJSpeech dataset folder. + +> Original LJSpeech-1.1 dataset not split into train/test parts. +> We manually split it into 13000/100 (train/test) by select 100 test indices stored into preprocess.py. +> We fixed indices, so you can reproduce our results. +> Also, you can select indices independently if you want and put it into _INDICES_FOR_TEST into preprocess.py. + +The original dataset structure is as follows: + +```text +. +└── LJSpeech-1.1 + ├─ alignments/ + ├─ wavs/ + └─ metadata.csv +``` + +Note: Before pre-processing the dataset you need to prepare the environment and install the requirements. +Preprocess script uses ~3.5 Gb video memory, thus you can specify the visible GPU devices if necessary. + +Run (from the project folder) `preprocess.py` script located into `data` folder with following command: + +```bash +python -m data.preprocess --dataset_path [PATH_TO_DATASET_FOLDER] +``` + +- PATH_TO_DATASET_FOLDER - path to the dataset root. + +Processed data will be also saved into the PATH_TO_DATASET_FOLDER folder. + +After pre-precessing the data, the dataset structure should be as follows: + +```text +. +└── LJSpeech-1.1 + ├─ alignments/ + ├─ mels/ + ├─ metadata.csv + ├─ metadata.txt + ├─ train_indices.txt + ├─ validation.txt + └─ wavs/ +``` + +## [Environment Requirements](#contents) + +- Hardware (GPU). +- Prepare hardware environment with GPU processor. +- Framework + - [MindSpore](https://www.mindspore.cn/install/en) +- For more information, please check the resources below: + - [MindSpore Tutorials](https://www.mindspore.cn/tutorials/en/master/index.html) + - [MindSpore Python API](https://www.mindspore.cn/docs/api/en/master/index.html) + +Note: We use MindSpore 1.6.0 GPU, thus make sure that you install > 1.6.0 version. + +## [Quick Start](#contents) + +After installing MindSpore through the official website, you can follow the steps below for training and evaluation, +in particular, before training, you need to install `requirements.txt` by following +command `pip install -r requirements.txt`. + +Then run training script as shown below. + +```example +# Run standalone training example +bash scripts/run_standalone_train_gpu.sh [DEVICE_ID] [LOGS_CKPT_DIR] [DATASET_ROOT] + +# Run distribute training example +bash scripts/run_distribute_train_gpu.sh [DEVICE_NUM] [LOGS_CKPT_DIR] [DATASET_ROOT] +``` + +## [Script Description](#contents) + +### [Script and Sample Code](#contents) + +```contents +. +└─FastSpeech + ├─README.md + ├─requirements.txt + ├─data + │ └─preprocess.py # data preprocessing script + ├─scripts + │ ├─run_distribute_train_gpu.sh # launch distribute train on GPU + │ ├─run_eval_gpu.sh # launch evaluation on GPU + │ └─run_standalone_train_gpu.sh # launch standalone train on GPU + ├─src + │ ├─audio + │ │ ├─__init__.py + │ │ ├─stft.py # audio processing scripts + │ │ └─tools.py # audio processing tools + │ ├─cfg + │ │ ├─__init__.py + │ │ └─config.py # config parser + │ ├─deepspeech2 + │ │ ├─__init__.py + │ │ ├─dataset.py # audio parser script for DeepSpeech2 + │ │ └─model.py # model scripts + │ ├─import_ckpt + │ │ ├─__init__.py + │ │ ├─import_deepspeech2.py # importer for DeepSpeech2 from < 1.5 MS versions + │ │ └─import_waveglow.py # importer for WaveGlow from .pickle + │ ├─text + │ │ ├─__init__.py + │ │ ├─cleaners.py # text cleaners scripts + │ │ ├─numbers.py # numbers to text preprocessing scripts + │ │ └─symbols.py # symbols dictionary + │ ├─transformer + │ │ ├─__init__.py + │ │ ├─constants.py # constants for transformer + │ │ ├─layers.py # layers initialization + │ │ ├─models.py # model blocks + │ │ ├─modules.py # model modules + │ │ └─sublayers.py # model sublayers + │ ├─waveglow + │ │ ├─__init__.py + │ │ ├─layers.py # model layers + │ │ ├─model.py # model scripts + │ │ └─utils.py # utils tools + │ ├─__init__.py + │ ├─dataset.py # create dataset + │ ├─metrics.py # metrics scripts + │ ├─model.py # model scripts + │ ├─modules.py # model modules + │ └─utils.py # utilities used in other scripts + ├─default_config.yaml # default configs + ├─eval.py # evaluation script + ├─export.py # export to MINDIR script + └─train.py # training script +``` + +### [Script Parameters](#contents) + +```parameters +all parameters and descriptions, except --config_path, stored into default_config.yaml + +usage: train.py [--config_path CONFIG_PATH] + [--device_target DEVICE_TARGET] + [--device_id DEVICE_ID] + [--logs_dir LOGS_DIR] + [--dataset_path DATASET_PATH] + [--epochs EPOCHS] + [--lr_scale LR_SCALE] +``` + +### [Training Process](#contents) + +#### Standalone Training + +```bash +bash scripts/run_standalone_train_gpu.sh [DEVICE_ID] [LOGS_CKPT_DIR] [DATASET_PATH] +``` + +The above command will run in the background, you can view the result through the generated standalone_train.log file. +After training, you can get the training loss and time logs in chosen logs dir: + +```log +epoch: 200 step: 406, loss is 0.8701540231704712 +epoch time: 168215.485 ms, per step time: 413.072 ms +``` + +The model checkpoints will be saved in logs outputs directory. + +#### Distribute Training + +```bash +bash scripts/run_distribute_train_gpu.sh [DEVICE_NUM] [LOGS_CKPT_DIR] [DATASET_PATH] +``` + +The above shell script will run distributed training in the background. +After training, you can get the training results: + +```log +epoch: 200 step: 50, loss is 0.9151536226272583 +epoch: 200 step: 50, loss is 0.9770485162734985 +epoch: 200 step: 50, loss is 0.9304656982421875 +epoch: 200 step: 50, loss is 0.8000383377075195 +epoch: 200 step: 50, loss is 0.8380972146987915 +epoch: 200 step: 50, loss is 0.854132890701294 +epoch: 200 step: 50, loss is 0.8262668251991272 +epoch: 200 step: 50, loss is 0.8031083345413208 +epoch time: 25208.625 ms, per step time: 504.173 ms +epoch time: 25207.587 ms, per step time: 504.152 ms +epoch time: 25206.404 ms, per step time: 504.128 ms +epoch time: 25210.164 ms, per step time: 504.203 ms +epoch time: 25210.281 ms, per step time: 504.206 ms +epoch time: 25210.364 ms, per step time: 504.207 ms +epoch time: 25210.161 ms, per step time: 504.203 ms +epoch time: 25059.312 ms, per step time: 501.186 ms +``` + +Note: It was just examples of logs, values may vary. + +### [Evaluation Process](#contents) + +#### Checkpoints preparation + +Before starting evaluation process, you need to import WaveGlow vocoder (generate audio from FastSpeech output mel-spectrograms) and DeepSpeech2 (to evaluate metrics) checkpoints. + +- [Download](https://download.mindspore.cn/model_zoo/r1.3/deepspeech2_gpu_v130_librispeech_research_audio_bs20_avgwer11.34_avgcer3.79/) DeepSpeech2 checkpoint (version < 1.5, not directly load to new MindSpore versions). + + To import checkpoints follow steps below: +- Run `import_deepspeech2.py`. Converted checkpoint will be saved at the same directory as original and named `DeepSpeech2.ckpt`. + +```bash +# from project root folder +python -m src.import_ckpt.import_deepspeech2 --ds_ckpt_url [CKPT_URL] # weights of .ckpt format +``` + +- To get WaveGlow take the following steps. We convert checkpoint from original [checkpoint](https://drive.google.com/file/d/1rpK8CzAAirq9sWZhe9nlfvxMF1dRgFbF/view) to the .pickle format with numpy weights, by using code below (previously we download `glow.py` model from original WaveGlow [implementation](https://github.com/NVIDIA/waveglow)). + +```python +# script in same dir with glow.py +import torch + +waveglow = torch.load(checkpoint_url)['model'] # ckpt_url for original .pt object +waveglow = waveglow.remove_weightnorm(waveglow) +numpy_weights = {key: value.detach().numpy() for key, value in waveglow.named_parameters()} +# save numpy_weights as .pickle format +``` + +Note: The original checkpoint is stored in the PyTorch format (.pth). You need to install PyTorch first, before running the code above. + +- To import .pickle WaveGlow checkpoint run `import_waveglow.py`. Converted checkpoint will be saved at the same directory as original and named `WaveGlow.ckpt`. + +```bash +# from project root folder +python -m src.import_ckpt.import_waveglow --wg_ckpt_url [CKPT_URL] # weights of .pickle format +``` + +#### Evaluation + +Before evaluation make sure that you have trained FastSpeech.ckpt, converted WaveGlow.ckpt, and converted DeepSpeech2.ckpt. +To start evaluation run the command below. + +```bash +bash scripts/run_eval_gpu.sh [DEVICE_ID] [DATASET_PATH] [FS_CKPT_URL] [WG_CKPT_URL] [DS_CKPT_URL] +``` + +The above python command will run in the background. You can view the results through the file "eval.log". + +```text +==========Evaluation results========== +Mean Frechet distance 201.42256 +Mean Kernel distance 0.02357 +Generated audios stored into results +``` + +### [Model Export](#contents) + +You can export the model to mindir format by running the following python script: + +```bash +python export.py --fs_ckpt_url [FS_CKPT_URL] +``` + +## [Model Description](#contents) + +### [Performance](#contents) + +#### Training Performance + +| Parameters | GPU (1p) | GPU (8p) | +| -------------------------- |----------------------------------------------------------- |---------------------------------------------------------------------- | +| Model | FastSpeech | FastSpeech | +| Hardware | 1 Nvidia Tesla V100-PCIE, CPU @ 3.40GHz | 8 Nvidia RTX 3090, Intel Xeon Gold 6226R CPU @ 2.90GHz | +| Upload Date | 14/03/2022 (day/month/year) | 14/03/2022 (day/month/year) | +| MindSpore Version | 1.6.0 | 1.6.0 | +| Dataset | LJSpeech-1.1 | LJSpeech-1.1 | +| Training Parameters | epochs=200, batch_size=32, warmup_steps=5000, lr_scale=1 | epochs=300, batch_size=32 (per device), warmup_steps=5000, lr_scale=2 | +| Optimizer | Adam (beta1=0.9, beta2=0.98, eps=1e-9) | Adam (beta1=0.9, beta2=0.98, eps=1e-9) | +| Loss Function | MSE, L1 | MSE, L1 | +| Speed | ~412 ms/step | ~504 ms/step | +| Total time | ~9.3 hours | ~2.1 hours | + +Note: lr scheduler was taken from [Attention Is All You Need](https://arxiv.org/pdf/1706.03762.pdf) paper. + +#### Evaluation Performance + +| Parameters | Trained on GPU (1p) | Trained on GPU (8p) | +| ------------------- |-------------------------------------------------------- |----------------------------------------------------------- | +| Model | FastSpeech | FastSpeech | +| Resource | 1 Nvidia Tesla V100-PCIE, CPU @ 3.40GHz | 1 Nvidia Tesla V100-PCIE, CPU @ 3.40GHz | +| Upload Date | 14/03/2022 (day/month/year) | 14/03/2022 (day/month/year) | +| MindSpore Version | 1.6.0 | 1.6.0 | +| Dataset | LJSpeech-1.1 | LJSpeech-1.1 | +| Batch_size | 1 | 1 | +| Outputs | Mel-spectrogram, mel duration | Mel-spectrogram, mel duration | +| Metric | (classifier distances) Frechet 201.42256, Kernel 0.02357 | (classifier distances) Frechet 203.89236, Kernel 0.02386 | + +## [ModelZoo Homepage](#contents) + + Please check the official [homepage](https://gitee.com/mindspore/models). diff --git a/research/audio/FastSpeech/data/preprocess.py b/research/audio/FastSpeech/data/preprocess.py new file mode 100644 index 0000000000000000000000000000000000000000..95c3d758c56f6c1b3d45b6d24871ca8a67e43012 --- /dev/null +++ b/research/audio/FastSpeech/data/preprocess.py @@ -0,0 +1,128 @@ +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +"""Dataset preprocess script.""" +import os +from pathlib import Path + +import numpy as np + +from src.audio.tools import get_mel +from src.cfg.config import config as hp + +# Original dataset contains 13100 samples and not splited into parts. +# We manually selected 100 test indices and fixed it to be able to reproduce results. +_INDICES_FOR_TEST = ( + 3667, 8887, 10353, 7657, 1498, 2758, 4913, 1697, 5653, 1911, + 12032, 8925, 11517, 5881, 6575, 120, 6232, 11680, 8433, 1728, + 12771, 11738, 6574, 12918, 9836, 7556, 2231, 7916, 5985, 3148, + 2596, 1709, 5841, 5383, 6248, 9831, 7667, 10944, 2833, 614, + 11990, 6894, 12645, 5422, 12015, 447, 7108, 2973, 9937, 11938, + 3626, 11406, 2853, 6379, 1621, 3981, 5486, 3902, 10925, 4249, + 6518, 3376, 1998, 10250, 10145, 7325, 2665, 61, 2709, 11683, + 8776, 10979, 8834, 4805, 4565, 2577, 9369, 4422, 8212, 5871, + 10721, 6046, 5129, 9610, 821, 4378, 693, 10500, 5027, 1663, + 6946, 2460, 6068, 4329, 11001, 10122, 9154, 6990, 8908, 2530, +) + + +def preprocess_ljspeech(root_dir): + """Preprocess LJSpeech dataset.""" + in_dir = root_dir + out_dir = os.path.join(in_dir, 'mels') + + if not os.path.exists(out_dir): + os.makedirs(out_dir, exist_ok=True) + + metadata = build_from_path(in_dir, out_dir) + write_metadata(metadata, in_dir) + train_test_split(in_dir) + + +def write_metadata(metadata, out_dir): + """Write clear metadata.""" + with Path(out_dir, 'metadata.txt').open('w', encoding='utf-8') as file: + for m in metadata: + file.write(m + '\n') + + +def build_from_path(in_dir, out_dir): + """Get text and preprocess .wavs to mels.""" + index = 1 + texts = [] + + with Path(in_dir, 'metadata.csv').open('r', encoding='utf-8') as file: + for line in file.readlines(): + if index % 100 == 0: + print("{:d} Done".format(index)) + + parts = line.strip().split('|') + wav_path = os.path.join(in_dir, 'wavs', '%s.wav' % parts[0]) + text = parts[2] + texts.append(_process_utterance(out_dir, index, wav_path, text)) + + index = index + 1 + + return texts + + +def _process_utterance(out_dir, index, wav_path, text): + """Preprocess .wav to mel and save.""" + # Compute a mel-scale spectrogram from the wav: + mel_spectrogram = get_mel(wav_path) + + # Write the spectrograms to disk: + mel_filename = 'ljspeech-mel-%05d.npy' % index + np.save( + os.path.join(out_dir, mel_filename), + mel_spectrogram.T, + allow_pickle=False + ) + + return text + + +def train_test_split(folder_path): + """Prepare data for training and validation format.""" + test_indices = np.array(_INDICES_FOR_TEST) + + with Path(folder_path, 'metadata.csv').open('r') as file: + metadata = file.readlines() + dataset_size = len(metadata) + + test_metadata = [] + all_indices = np.arange(dataset_size) + train_indices = np.delete(all_indices, test_indices) + + with Path(folder_path, 'train_indices.txt').open('w') as file: + for i in train_indices: + file.write(f'{i}\n') + + for i, line in enumerate(metadata): + if i in test_indices: + wav_name, _, text = line.strip().split('|') + test_data = f'{wav_name}|{text}\n' + test_metadata.append(test_data) + + with Path(folder_path, 'validation.txt').open('w') as file: + for line in test_metadata: + file.write(line) + + +def main(): + preprocess_ljspeech(hp.dataset_path) + + +if __name__ == "__main__": + main() diff --git a/research/audio/FastSpeech/default_config.yaml b/research/audio/FastSpeech/default_config.yaml new file mode 100644 index 0000000000000000000000000000000000000000..c1ade25217a7e1d706556b08c25f3e188c30e91e --- /dev/null +++ b/research/audio/FastSpeech/default_config.yaml @@ -0,0 +1,144 @@ +# Builtin Configurations(DO NOT CHANGE THESE CONFIGURATIONS unless you know exactly what you are doing) + +# Mel +num_mels: 80 +text_cleaners: ['english_cleaners'] + +# FastSpeech +vocab_size: 300 +max_seq_len: 3000 +encoder_dim: 256 +encoder_n_layer: 4 +encoder_head: 2 +encoder_conv1d_filter_size: 1024 +decoder_dim: 256 +decoder_n_layer: 4 +decoder_head: 2 +decoder_conv1d_filter_size: 1024 +fft_conv1d_kernel: [9, 1] +fft_conv1d_padding: [4, 0] +duration_predictor_filter_size: 256 +duration_predictor_kernel_size: 3 +dropout: 0.1 + +# Train +batch_size: 32 # per one device +epochs: 200 +n_warm_up_step: 5000 +lr_scale: 1 +mel_max_length: 900 +character_max_length: 200 +keep_checkpoint_max: 10 + +# Eval +mel_val_len: 3500 + +# Other +alpha: 1 +device_target: 'GPU' +device_id: 0 +device_start: 0 +logs_dir: 'logs' +output_dir: 'results' +dataset_path: '/path/to/LJSpeech-1.1' +fs_ckpt_url: '/path/to/fastspeech/ckpt' +wg_ckpt_url: '/path/to/waveglow/ckpt' +ds_ckpt_url: '/path/to/deepspeech/ckpt' + +# WaveGlow +wg_n_mel_channels: 80 +wg_n_flows: 12 +wg_n_group: 8 +wg_n_early_every: 4 +wg_n_early_size: 2 +wg_n_layers: 8 +wg_n_channels: 256 +wg_kernel_size: 3 +wg_wav_value: 32768 +wg_sampling_rate: 22050 + +# DeepSpeech2 +ds_sampling_rate: 16000 +ds_window_size: 0.02 +ds_window_stride: 0.01 +ds_window: 'hanning' +ds_rnn_type: 'LSTM' +ds_hidden_size: 1024 +ds_hidden_layers: 5 +ds_lookahead_context: 20 +labels: "'ABCDEFGHIJKLMNOPQRSTUVWXYZ _" + +# Audio +au_max_wav_value: 32768 +au_sampling_rate: 22050 +au_filter_length: 1024 +au_hop_length: 256 +au_win_length: 1024 +au_n_mel_channels: 80 +au_mel_fmin: 0 +au_mel_fmax: 8000 + +--- +# Config description for each option +num_mels: "Number of channels of mel-spectrogram." +text_cleaners: "Chosen language pipeline." +vocab_size: "Vocabulary size" +max_seq_len: "Max sequence length." +encoder_dim: "Encoder dimension." +encoder_n_layer: "Number of encoder layers." +encoder_head: "Number of encoders head." +encoder_conv1d_filter_size: "Conv out filters of encoder." +decoder_dim: "Decoder dimension." +decoder_n_layer: "Number of decoder layers." +decoder_head: "Number of decoder head." +decoder_conv1d_filter_size: "Conv out filters of decoder." +fft_conv1d_kernel: "Conv kernel size of FFT block." +fft_conv1d_padding: "Conv padding of FFT block." +duration_predictor_filter_size: "Conv out filters of duration predictor." +duration_predictor_kernel_size: "Conv kernel size of duration predictor." +dropout: "Dropout ratio." +batch_size: "Batch size for training." +epochs: "Num of training epochs." +n_warm_up_step: "Num of warmup steps." +lr_scale: "Learning rate multiplier." +mel_max_length: "Pad all samples of mels to max len during training." +character_max_length: "Pad all samples of character sequences to max len during training." +keep_checkpoint_max: "Save last N checkpoints during train." +mel_val_len: "Max mel length at validation." +alpha: "Speech speed regulator." +device_target: "Target device platform." +device_id: "Device id of the target platform." +device_start: "Main device for distribute training." +logs_dir: "Output logs dir." +output_dir: "Output dir for synthesized audio." +dataset_path: "Path to dataset folder." +fs_ckpt_url: "Path to FastSpeech checkpoint." +wg_ckpt_url: "Path to WaveGlow checkpoint." +ds_ckpt_url: "Path to DeepSpeech2 checkpoint." +wg_n_mel_channels: "WaveGlow num of mel-spectrogram channels." +wg_n_flows: "WaveGlow num cells." +wg_n_group: "WaveGlow num layers in cell." +wg_n_early_every: "WaveGlow add noise every." +wg_n_early_size: "WaveGlow param." +wg_n_layers: "WaveGlow num layers." +wg_n_channels: "WaveGlow num channels." +wg_kernel_size: "WaveGlow kernel size." +wg_wav_value: "WaveGlow audio wav value." +wg_sampling_rate: "WaveGlow audio sampling rate." +ds_sampling_rate: "DeepSpeech2 audio param." +ds_window_size: "DeepSpeech2 window size." +ds_window_stride: "DeepSpeech2 window stride." +ds_window: "DeepSpeech2 window type." +ds_rnn_type: "DeepSpeech2 rnn type." +ds_hidden_size: "DeepSpeech2 size of hidden layer." +ds_hidden_layers: "DeepSpeech2 num hidden layers." +ds_lookahead_context: "DeepSpeech2 param." +labels: "Symbols for the DeepSpeech2 model." +au_max_wav_value: "DeepSpeech2 audio max wav value." +au_sampling_rate: "DeepSpeech2 audio sampling rate." +au_filter_length: "DeepSpeech2 audio filter length." +au_hop_length: "DeepSpeech2 audio hop length." +au_win_length: "DeepSpeech2 audio window length." +au_n_mel_channels: "DeepSpeech2 audio num mel channels." +au_mel_fmin: "DeepSpeech2 audio mel fmin." +au_mel_fmax: "DeepSpeech2 audio mel fmax." \ No newline at end of file diff --git a/research/audio/FastSpeech/eval.py b/research/audio/FastSpeech/eval.py new file mode 100644 index 0000000000000000000000000000000000000000..1613f08363104b381a125d86216ac2dedae52cba --- /dev/null +++ b/research/audio/FastSpeech/eval.py @@ -0,0 +1,206 @@ +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +"""Evaluation script.""" +import os +from pathlib import Path + +import numpy as np +from mindspore import Tensor +from mindspore import context +from mindspore import dtype as mstype +from mindspore import load_checkpoint +from mindspore.common import set_seed +from scipy.io.wavfile import write + +from src.cfg.config import config as hp +from src.dataset import get_val_data +from src.deepspeech2.dataset import LoadAudioAndTranscript +from src.deepspeech2.model import DeepSpeechModel +from src.metrics import frechet_classifier_distance_from_activations +from src.metrics import kernel_classifier_distance_and_std_from_activations +from src.model import FastSpeech +from src.model import FastSpeechEval +from src.waveglow.model import WaveGlow + +set_seed(1) + + +def save_audio(audio, audio_length, save_root_dir, name, audio_cfg): + """Process raw audio and save as .wav audio file.""" + audio_length = int(audio_length.asnumpy()) + audio = audio[:, :audio_length] * audio_cfg['wav_value'] + audio = (audio.asnumpy().squeeze()).astype('int16') + + audio_path = os.path.join(save_root_dir, name + '_synth.wav') + write(audio_path, audio_cfg['sampling_rate'], audio) + + return audio_path + + +def get_waveglow(ckpt_url): + """ + Init WaveGlow vocoder model with weights. + Used to generate realistic audio from mel-spectrogram. + """ + wn_config = { + 'n_layers': hp.wg_n_layers, + 'n_channels': hp.wg_n_channels, + 'kernel_size': hp.wg_kernel_size + } + + audio_config = { + 'wav_value': hp.wg_wav_value, + 'sampling_rate': hp.wg_sampling_rate + } + + model = WaveGlow( + n_mel_channels=hp.wg_n_mel_channels, + n_flows=hp.wg_n_flows, + n_group=hp.wg_n_group, + n_early_every=hp.wg_n_early_every, + n_early_size=hp.wg_n_early_size, + wn_config=wn_config + ) + + load_checkpoint(ckpt_url, model) + model.set_train(False) + + return model, audio_config + + +def get_deepspeech(ckpt_url): + """ + Init DeepSpeech2 model with weights. + Used to get activations from lstm layers to compute metrics. + """ + spect_config = { + 'sampling_rate': hp.ds_sampling_rate, + 'window_size': hp.ds_window_size, + 'window_stride': hp.ds_window_stride, + 'window': hp.ds_window + } + + model = DeepSpeechModel( + batch_size=1, + rnn_hidden_size=hp.ds_hidden_size, + nb_layers=hp.ds_hidden_layers, + labels=hp.labels, + rnn_type=hp.ds_rnn_type, + audio_conf=spect_config, + bidirectional=True + ) + + load_checkpoint(ckpt_url, model) + model.set_train(False) + + return model, spect_config + + +def get_fastspeech(ckpt_url): + """ + Init FastSpeech model with weights. + Used to generate mel-spectrogram from sequence (text). + """ + model = FastSpeech() + + load_checkpoint(ckpt_url, model) + model.set_train(False) + + return model + + +def activation_from_audio(loader, model, path): + """ + Compute activations of audio to get metric. + + Args: + loader (class): Audio loader. + model (nn.Cell): DeepSpeech2 model. + path (str): Path to the audio. + + Returns: + activation (np.array): Activations from last lstm layer. + """ + metric_mel = loader.parse_audio(audio_path=path) + metric_mel_len = Tensor([metric_mel.shape[1]], mstype.float32) + metric_mel_padded = np.pad(metric_mel, (0, hp.mel_val_len - metric_mel.shape[1]))[:metric_mel.shape[0], :] + metric_mel_padded = Tensor(np.expand_dims(np.expand_dims(metric_mel_padded, 0), 0), mstype.float32) + + _, output_length, activation = model(metric_mel_padded, metric_mel_len) + output_length = int(output_length.asnumpy()) + + activation = activation.asnumpy().transpose((1, 0, 2)).squeeze() + clear_activation = activation[:output_length, :] + + return clear_activation + + +def main(args): + fastspeech = get_fastspeech(args.fs_ckpt_url) + waveglow, audio_config = get_waveglow(args.wg_ckpt_url) + deepspeech, spect_config = get_deepspeech(args.ds_ckpt_url) + + audio_loader = LoadAudioAndTranscript(spect_config) + + model = FastSpeechEval( + mel_generator=fastspeech, + vocoder=waveglow, + config=args + ) + + data_list = get_val_data(hp.dataset_path) + + if not os.path.exists(hp.output_dir): + os.makedirs(hp.output_dir, exist_ok=True) + + frechet, kernel = [], [] + + for sequence, src_pos, target_audio_path in data_list: + raw_audio, audio_len = model.get_audio(sequence, src_pos) + + audio_path = save_audio( + audio=raw_audio, + audio_length=audio_len, + save_root_dir=args.output_dir, + audio_cfg=audio_config, + name=Path(target_audio_path).stem + ) + + activation = activation_from_audio(audio_loader, deepspeech, audio_path) + activation_target = activation_from_audio(audio_loader, deepspeech, target_audio_path) + + frechet_distance = frechet_classifier_distance_from_activations( + activations1=activation, + activations2=activation_target, + ) + + kernel_distance, _ = kernel_classifier_distance_and_std_from_activations( + activations1=activation, + activations2=activation_target, + ) + + frechet.append(frechet_distance) + kernel.append(kernel_distance) + + print('=' * 10 + 'Evaluation results' + '=' * 10) + print(f'Mean Frechet distance {round(float(np.mean(np.array(frechet))), 5)}') + print(f'Mean Kernel distance {round(float(np.mean(np.array(kernel))), 5)}') + print(f'Generated audios stored into {args.output_dir}') + + +if __name__ == "__main__": + context.set_context(mode=context.GRAPH_MODE, device_target=hp.device_target) + context.set_context(device_id=hp.device_id) + main(hp) diff --git a/research/audio/FastSpeech/export.py b/research/audio/FastSpeech/export.py new file mode 100644 index 0000000000000000000000000000000000000000..f3fb0b5dc5851471feee0120d82f7ba838a9a052 --- /dev/null +++ b/research/audio/FastSpeech/export.py @@ -0,0 +1,56 @@ +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +"""Run export""" +from pathlib import Path + +import numpy as np +from mindspore import Tensor +from mindspore import context +from mindspore import dtype as mstype +from mindspore import load_checkpoint +from mindspore.train.serialization import export + +from src.cfg.config import config as default_config +from src.model import FastSpeech + + +def run_export(config): + """ + Export model to MINDIR. + """ + model = FastSpeech() + + load_checkpoint(config.fs_ckpt_url, model) + model.set_train(False) + + input_1 = Tensor(np.ones([1, config.character_max_length]), dtype=mstype.float32) + input_2 = Tensor(np.ones([1, config.character_max_length]), dtype=mstype.float32) + name = Path(config.fs_ckpt_url).stem + path = Path(config.fs_ckpt_url).resolve().parent + save_path = str(Path(path, name)) + + export(model, input_1, input_2, file_name=save_path, file_format='MINDIR') + print('Model exported successfully!') + print(f'Path to exported model {save_path}.mindir') + + +if __name__ == "__main__": + context.set_context( + mode=context.GRAPH_MODE, + device_target=default_config.device_target, + device_id=default_config.device_id, + ) + + run_export(default_config) diff --git a/research/audio/FastSpeech/requirements.txt b/research/audio/FastSpeech/requirements.txt new file mode 100644 index 0000000000000000000000000000000000000000..abb72cbec91fd6c166d067a844aa1cbc77f9984e --- /dev/null +++ b/research/audio/FastSpeech/requirements.txt @@ -0,0 +1,6 @@ +PyYAML +scipy>=1.5.3 +inflect>=5.4.0 +Unidecode>=1.3.3 +librosa>=0.9.1 +SoundFile>=0.10.3.post1 \ No newline at end of file diff --git a/research/audio/FastSpeech/scripts/run_distribute_train_gpu.sh b/research/audio/FastSpeech/scripts/run_distribute_train_gpu.sh new file mode 100644 index 0000000000000000000000000000000000000000..53a8804c2c83a45e741117314b05e928523a6a7f --- /dev/null +++ b/research/audio/FastSpeech/scripts/run_distribute_train_gpu.sh @@ -0,0 +1,53 @@ +#!/bin/bash +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +if [[ $# -ne 3 ]]; then + echo "Usage: bash ./scripts/run_distribute_train_gpu.sh [DEVICE_NUM] [LOGS_CKPT_DIR] [DATASET_ROOT]" +exit 1; +fi + +export RANK_SIZE=$1 + +get_real_path(){ + if [ "${1:0:1}" == "/" ]; then + echo "$1" + else + realpath -m "$PWD/$1" + fi +} + +CONFIG_FILE_BASE="./default_config.yaml" +CONFIG_FILE=$(get_real_path "$CONFIG_FILE_BASE") +DATASET_ROOT=$(get_real_path "$3") +LOGS_ROOT=$(get_real_path "$2") + +if [ ! -d "$LOGS_ROOT" ]; then + mkdir "$LOGS_ROOT" + mkdir "$LOGS_ROOT/training_configs" +fi + +cp ./*.py "$LOGS_ROOT"/training_configs +cp ./*.yaml "$LOGS_ROOT"/training_configs +cp -r ./src "$LOGS_ROOT"/training_configs + +mpirun -n $1 --allow-run-as-root \ + python train.py \ + --device_target="GPU" \ + --logs_dir="$LOGS_ROOT" \ + --dataset_path="$DATASET_ROOT" \ + --config_path="$CONFIG_FILE" \ + --epochs=300 \ + --lr_scale=2 \ + > "$LOGS_ROOT"/distribute_train.log 2>&1 & diff --git a/research/audio/FastSpeech/scripts/run_eval_gpu.sh b/research/audio/FastSpeech/scripts/run_eval_gpu.sh new file mode 100644 index 0000000000000000000000000000000000000000..01ceb3872234765780a580ef532e2e5bc7c41619 --- /dev/null +++ b/research/audio/FastSpeech/scripts/run_eval_gpu.sh @@ -0,0 +1,49 @@ +#!/bin/bash +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +if [[ $# -ne 5 ]]; then + echo "Usage: bash ./scripts/run_eval_gpu.sh [DEVICE_ID] [DATASET_PATH] [FS_CKPT_URL] [WG_CKPT_URL] [DS_CKPT_URL]" +exit 1; +fi + +export CUDA_VISIBLE_DEVICES=$1 + +get_real_path(){ + if [ "${1:0:1}" == "/" ]; then + echo "$1" + else + realpath -m "$PWD/$1" + fi +} + +CONFIG_FILE_BASE="./default_config.yaml" +OUTPUT_DIR_BASE="./results" +OUTPUT_ROOT=$(get_real_path "$OUTPUT_DIR_BASE") +CONFIG_FILE=$(get_real_path "$CONFIG_FILE_BASE") +DATASET_ROOT=$(get_real_path "$2") +FS_CKPT=$(get_real_path "$3") +WG_CKPT=$(get_real_path "$4") +DS_CKPT=$(get_real_path "$5") + +python eval.py \ + --device_target="GPU" \ + --device_id=0 \ + --output_dir="$OUTPUT_ROOT" \ + --dataset_path="$DATASET_ROOT" \ + --config_path="$CONFIG_FILE" \ + --fs_ckpt_url="$FS_CKPT" \ + --wg_ckpt_url="$WG_CKPT" \ + --ds_ckpt_url="$DS_CKPT" \ + > eval.log 2>&1 & diff --git a/research/audio/FastSpeech/scripts/run_standalone_train_gpu.sh b/research/audio/FastSpeech/scripts/run_standalone_train_gpu.sh new file mode 100644 index 0000000000000000000000000000000000000000..fd1a5bc41c8d97c4cd992a557d15a76cfecf9299 --- /dev/null +++ b/research/audio/FastSpeech/scripts/run_standalone_train_gpu.sh @@ -0,0 +1,51 @@ +#!/bin/bash +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +if [[ $# -ne 3 ]]; then + echo "Usage: bash ./scripts/run_standalone_train_gpu.sh [DEVICE_ID] [LOGS_CKPT_DIR] [DATASET_ROOT]" +exit 1 +fi + +export CUDA_VISIBLE_DEVICES=$1 + +get_real_path(){ + if [ "${1:0:1}" == "/" ]; then + echo "$1" + else + realpath -m "$PWD/$1" + fi +} + +CONFIG_FILE_BASE="./default_config.yaml" +CONFIG_FILE=$(get_real_path "$CONFIG_FILE_BASE") +DATASET_ROOT=$(get_real_path "$3") +LOGS_ROOT=$(get_real_path "$2") + +if [ ! -d "$LOGS_ROOT" ]; then + mkdir "$LOGS_ROOT" + mkdir "$LOGS_ROOT/training_configs" +fi + +cp ./*.py "$LOGS_ROOT"/training_configs +cp ./*.yaml "$LOGS_ROOT"/training_configs +cp -r ./src "$LOGS_ROOT"/training_configs + +python train.py \ + --device_target="GPU" \ + --device_id=0 \ + --logs_dir="$LOGS_ROOT" \ + --dataset_path="$DATASET_ROOT" \ + --config_path="$CONFIG_FILE" \ + > "$LOGS_ROOT"/standalone_train.log 2>&1 & diff --git a/research/audio/FastSpeech/src/__init__.py b/research/audio/FastSpeech/src/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/research/audio/FastSpeech/src/audio/__init__.py b/research/audio/FastSpeech/src/audio/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/research/audio/FastSpeech/src/audio/stft.py b/research/audio/FastSpeech/src/audio/stft.py new file mode 100644 index 0000000000000000000000000000000000000000..a4242e4457b4f907b27ddb6f2c05efc0fff3dca4 --- /dev/null +++ b/research/audio/FastSpeech/src/audio/stft.py @@ -0,0 +1,148 @@ +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +"""Tacotron module.""" +import numpy as np +from librosa.filters import mel as librosa_mel_fn +from librosa.util import pad_center +from mindspore import Tensor +from mindspore import dtype as mstype +from mindspore.ops import Conv2D +from scipy.signal import get_window + + +class STFT: + """Mel-spectrogram transformer.""" + def __init__( + self, + filter_length=800, + hop_length=200, + win_length=800, + window='hann' + ): + super().__init__() + self.filter_length = filter_length + self.hop_length = hop_length + self.win_length = win_length + self.window = window + self.forward_transform = None + + scale = self.filter_length / self.hop_length + fourier_basis = np.fft.fft(np.eye(self.filter_length)) + + cutoff = int((self.filter_length / 2 + 1)) + fourier_basis = np.vstack( + [ + np.real(fourier_basis[:cutoff, :]), + np.imag(fourier_basis[:cutoff, :]) + ] + ) + + forward_basis = fourier_basis[:, None, :].astype(np.float32) + inverse_basis = np.linalg.pinv(scale * fourier_basis).T[:, None, :].astype(np.float32) + + if window is not None: + assert filter_length >= win_length + # get window and zero center pad it to filter_length + fft_window = get_window(window, win_length, fftbins=True) + fft_window = pad_center(fft_window, size=filter_length) + fft_window = np.array(fft_window, np.float32) + + # window the bases + forward_basis *= fft_window + inverse_basis *= fft_window + + self.forward_basis = forward_basis.astype(np.float32) + self.inverse_basis = inverse_basis.astype(np.float32) + + self.conv = Conv2D( + out_channel=self.forward_basis.shape[0], + kernel_size=self.forward_basis.shape[1:], + stride=self.hop_length, + pad_mode='pad', + pad=0 + ) + + def transform(self, input_data): + """Transforms input wav to raw mel-spect data.""" + num_batches = input_data.shape[0] + num_samples = input_data.shape[1] + + input_data = input_data.reshape(num_batches, 1, num_samples) + input_data = np.pad(np.squeeze(input_data), int(self.filter_length / 2), mode='reflect') + + input_data = np.expand_dims(np.expand_dims(np.expand_dims(input_data, 0), 0), 0) + + forward_transform = self.conv( + Tensor(input_data, mstype.float32), + Tensor(np.expand_dims(self.forward_basis, 1), mstype.float32), + ) + + forward_transform = forward_transform.asnumpy().squeeze(2) + + cutoff = int((self.filter_length / 2) + 1) + real_part = forward_transform[:, :cutoff, :] + imag_part = forward_transform[:, cutoff:, :] + + magnitude = np.sqrt(real_part ** 2 + imag_part ** 2) + phase = np.arctan2(imag_part, real_part) + + return magnitude, phase + + +class TacotronSTFT: + """Tacotron.""" + def __init__( + self, + filter_length=1024, + hop_length=256, + win_length=1024, + n_mel_channels=80, + sampling_rate=22050, + mel_fmin=0.0, + mel_fmax=8000.0 + ): + super().__init__() + self.n_mel_channels = n_mel_channels + self.sampling_rate = sampling_rate + self.stft_fn = STFT(filter_length, hop_length, win_length) + + self.mel_basis = librosa_mel_fn( + sr=sampling_rate, + n_fft=filter_length, + n_mels=n_mel_channels, + fmin=mel_fmin, + fmax=mel_fmax + ) + + def spectral_normalize(self, x): + """Normalize magnitudes.""" + output = np.log(np.clip(x, a_min=1e-5, a_max=np.max(x))) + return output + + def mel_spectrogram(self, y): + """ + Computes mel-spectrogram from wav. + + Args: + y (np.array): Raw mel-spectrogram with shape (B, T) in range [-1, 1]. + + Returns: + mel_output (np.array): Mel-spectrogram with shape (B, n_mel_channels, T). + """ + magnitudes, _ = self.stft_fn.transform(y) + mel_output = np.matmul(self.mel_basis, magnitudes) + mel_output = self.spectral_normalize(mel_output) + + return mel_output diff --git a/research/audio/FastSpeech/src/audio/tools.py b/research/audio/FastSpeech/src/audio/tools.py new file mode 100644 index 0000000000000000000000000000000000000000..59cdc06cfc37cf38149eaa8ef0a783847f2899d2 --- /dev/null +++ b/research/audio/FastSpeech/src/audio/tools.py @@ -0,0 +1,47 @@ +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +"""Preprocessing tools.""" +import numpy as np +from scipy.io.wavfile import read + +from src.audio import stft +from src.cfg.config import config + +_stft = stft.TacotronSTFT( + config.au_filter_length, + config.au_hop_length, + config.au_win_length, + config.au_n_mel_channels, + config.au_sampling_rate, + config.au_mel_fmin, + config.au_mel_fmax, +) + + +def load_wav_to_array(full_path): + """Load wav file as numpy array.""" + sampling_rate, data = read(full_path) + return data.astype(np.float32), sampling_rate + + +def get_mel(filename): + """Process loaded audio to mel-spectrogram.""" + audio, _ = load_wav_to_array(filename) + audio_norm = audio / config.au_max_wav_value + audio_norm = np.expand_dims(audio_norm, 0) + melspec = _stft.mel_spectrogram(audio_norm) + melspec = np.squeeze(melspec, 0) + + return melspec diff --git a/research/audio/FastSpeech/src/cfg/__init__.py b/research/audio/FastSpeech/src/cfg/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/research/audio/FastSpeech/src/cfg/config.py b/research/audio/FastSpeech/src/cfg/config.py new file mode 100644 index 0000000000000000000000000000000000000000..e2cb9ec894cdbb2b9febc07b472954ba2096b817 --- /dev/null +++ b/research/audio/FastSpeech/src/cfg/config.py @@ -0,0 +1,129 @@ +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +"""Parse arguments""" +import argparse +import ast +from pprint import pformat + +import yaml + + +class Config: + """ + Configuration namespace, convert dictionary to members. + """ + def __init__(self, cfg_dict): + for k, v in cfg_dict.items(): + if isinstance(v, (list, tuple)): + setattr(self, k, [Config(x) if isinstance(x, dict) else x for x in v]) + else: + setattr(self, k, Config(v) if isinstance(v, dict) else v) + + def __str__(self): + return pformat(self.__dict__) + + def __repr__(self): + return self.__str__() + + +def parse_cli_to_yaml(parser, cfg, helper=None, choices=None, cfg_path="default_config.yaml"): + """ + Parse command line arguments to the configuration according to the default yaml. + + Args: + parser (argparse.ArgumentParser): Parent parser. + cfg (dict): Base configuration. + helper (dict): Helper description. + choices (dict): Choices. + cfg_path (str): Path to the default yaml config. + """ + helper = {} if helper is None else helper + choices = {} if choices is None else choices + for item in cfg: + if not isinstance(cfg[item], list) and not isinstance(cfg[item], dict): + help_description = helper[item] if item in helper else f"Please reference to {cfg_path}" + choice = choices[item] if item in choices else None + if isinstance(cfg[item], bool): + parser.add_argument("--" + item, type=ast.literal_eval, default=cfg[item], choices=choice, + help=help_description) + else: + parser.add_argument("--" + item, type=type(cfg[item]), default=cfg[item], choices=choice, + help=help_description) + args = parser.parse_args() + return args + + +def parse_yaml(yaml_path): + """ + Parse the yaml config file. + + Args: + yaml_path (str): Path to the yaml config. + """ + with open(yaml_path, 'r') as fin: + try: + cfgs_raw = yaml.load_all(fin.read(), Loader=yaml.FullLoader) + cfgs = [] + for cf in cfgs_raw: + cfgs.append(cf) + + if len(cfgs) == 1: + cfg_helper = {} + cfg = cfgs[0] + cfg_choices = {} + elif len(cfgs) == 2: + cfg, cfg_helper = cfgs + cfg_choices = {} + elif len(cfgs) == 3: + cfg, cfg_helper, cfg_choices = cfgs + else: + raise ValueError("At most 3 docs (config, description for help, choices) are supported in config yaml") + except ValueError("Failed to parse yaml") as err: + raise err + + return cfg, cfg_helper, cfg_choices + + +def merge(args, cfg): + """ + Merge the base config from yaml file and command line arguments. + + Args: + args (argparse.Namespace): Command line arguments. + cfg (dict): Base configuration. + """ + args_var = vars(args) + for item in args_var: + cfg[item] = args_var[item] + + return cfg + + +def get_config(): + """ + Get Config according to the yaml file and cli arguments. + """ + parser = argparse.ArgumentParser(description="default name", add_help=False) + parser.add_argument("--config_path", type=str, default="default_config.yaml", help="Config file path.") + + path_args, _ = parser.parse_known_args() + default, helper, choices = parse_yaml(path_args.config_path) + args = parse_cli_to_yaml(parser=parser, cfg=default, helper=helper, choices=choices, cfg_path=path_args.config_path) + final_config = merge(args, default) + + return Config(final_config) + + +config = get_config() diff --git a/research/audio/FastSpeech/src/dataset.py b/research/audio/FastSpeech/src/dataset.py new file mode 100644 index 0000000000000000000000000000000000000000..e72e66213e867feab300003b59edf5352fb51bbb --- /dev/null +++ b/research/audio/FastSpeech/src/dataset.py @@ -0,0 +1,165 @@ +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +"""Data preprocessing.""" +import os +from pathlib import Path + +import numpy as np +from mindspore import Tensor +from mindspore import dtype as mstype + +from src.cfg.config import config as hp +from src.text import text_to_sequence +from src.utils import pad_1d_tensor +from src.utils import pad_2d_tensor +from src.utils import process_text + + +def get_data_to_buffer(): + """ + Put data to memory, for faster training. + """ + with Path(hp.dataset_path, 'train_indices.txt').open('r') as file: + train_part = np.array([i[:-1] for i in file.readlines()], np.int32) + train_part.sort() + + buffer = list() + raw_text = process_text(os.path.join(hp.dataset_path, "metadata.txt")) + + for i in train_part: + mel_gt_name = os.path.join(hp.dataset_path, 'mels', "ljspeech-mel-%05d.npy" % (i+1)) + mel_gt_target = np.load(mel_gt_name) + + duration = np.load(os.path.join(hp.dataset_path, 'alignments', str(i)+".npy")) + + character = raw_text[i][: len(raw_text[i])-1] + character = np.array(text_to_sequence(character, hp.text_cleaners)) + + buffer.append( + { + "text": character, + "duration": duration, + "mel_target": mel_gt_target + } + ) + + return buffer + + +def reprocess_tensor(data_dict): + """ + Prepare data for training. + Apply padding for all samples, in reason of static graph. + + Args: + data_dict (dict): Dictionary of np.array type data. + + Returns: + out (dict): Dictionary with prepared data for training, np.array type. + """ + text = data_dict["text"] + mel_target = data_dict["mel_target"] + duration = data_dict["duration"] + + max_len = hp.character_max_length + length_text = text.shape[0] + src_pos = np.pad([i+1 for i in range(int(length_text))], (0, max_len-int(length_text)), 'constant') + + max_mel_len = hp.mel_max_length + length_mel = mel_target.shape[0] + mel_pos = np.pad([i+1 for i in range(int(length_mel))], (0, max_mel_len-int(length_mel)), 'constant') + + text = pad_1d_tensor(text) + duration = pad_1d_tensor(duration) + mel_target = pad_2d_tensor(mel_target) + + out = { + "text": text, # shape (hp.character_max_length) + "src_pos": src_pos, # shape (hp.character_max_length) + "mel_pos": mel_pos, # shape (hp.mel_max_length) + "duration": duration, # shape (hp.character_max_length) + "mel_target": mel_target, # shape (hp.mel_max_length, hp.num_mels) + "mel_max_len": max_mel_len, + } + + return out + + +def preprocess_data(buffer): + """ + Prepare data for training. + + Args: + buffer (list): Raw data inputs. + + Returns: + preprocessed_data (list): Padded and converted data, ready for training. + """ + preprocessed_data = [] + for squeeze_data in buffer: + db = reprocess_tensor(squeeze_data) + + preprocessed_data.append( + ( + db["text"].astype(np.float32), + db["src_pos"].astype(np.float32), + db["mel_pos"].astype(np.float32), + db["duration"].astype(np.int32), + db["mel_target"].astype(np.float32), + db["mel_max_len"], + ) + ) + + return preprocessed_data + + +class BufferDataset: + """ + Dataloader. + """ + def __init__(self, buffer): + self.length_dataset = len(buffer) + self.preprocessed_data = preprocess_data(buffer) + + def __len__(self): + return self.length_dataset + + def __getitem__(self, idx): + return self.preprocessed_data[idx] + + +def get_val_data(data_url): + """Get validation data.""" + data_list = list() + with Path(data_url, 'validation.txt').open('r') as file: + data_paths = file.readlines() + + root_wav_path = os.path.join(data_url, 'wavs') + wav_paths = [root_wav_path + '/' + raw_path.split('|')[0] + '.wav' for raw_path in data_paths] + val_txts = [raw_path.split('|')[1][:-1] for raw_path in data_paths] + + for orig_text, wav_path in zip(val_txts, wav_paths): + sequence = text_to_sequence(orig_text, hp.text_cleaners) + sequence = np.expand_dims(sequence, 0) + + src_pos = np.array([i + 1 for i in range(sequence.shape[1])]) + src_pos = np.expand_dims(src_pos, 0) + + sequence = Tensor([np.pad(sequence[0], (0, hp.character_max_length - sequence.shape[1]))], mstype.float32) + src_pos = Tensor([np.pad(src_pos[0], (0, hp.character_max_length - src_pos.shape[1]))], mstype.float32) + + data_list.append([sequence, src_pos, wav_path]) + + return data_list diff --git a/research/audio/FastSpeech/src/deepspeech2/__init__.py b/research/audio/FastSpeech/src/deepspeech2/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/research/audio/FastSpeech/src/deepspeech2/dataset.py b/research/audio/FastSpeech/src/deepspeech2/dataset.py new file mode 100644 index 0000000000000000000000000000000000000000..04fd56169c82951b3714ea465ebf19fcca6f5893 --- /dev/null +++ b/research/audio/FastSpeech/src/deepspeech2/dataset.py @@ -0,0 +1,69 @@ +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +"""Audio parser script.""" +import librosa +import numpy as np +import soundfile as sf + + +class LoadAudioAndTranscript: + """ + Parse audio and transcript. + """ + def __init__( + self, + audio_conf=None, + normalize=False, + labels=None + ): + super().__init__() + self.window_stride = audio_conf['window_stride'] + self.window_size = audio_conf['window_size'] + self.sample_rate = audio_conf['sampling_rate'] + self.window = audio_conf['window'] + self.is_normalization = normalize + self.labels = labels + + def load_audio(self, path): + """ + Load audio. + """ + sound, _ = sf.read(path, dtype='int16') + sound = sound.astype('float32') / 32767 + if len(sound.shape) > 1: + if sound.shape[1] == 1: + sound = sound.squeeze() + else: + sound = sound.mean(axis=1) + + return sound + + def parse_audio(self, audio_path): + """ + Parse audio. + """ + audio = self.load_audio(audio_path) + n_fft = int(self.sample_rate * self.window_size) + win_length = n_fft + hop_length = int(self.sample_rate * self.window_stride) + d = librosa.stft(y=audio, n_fft=n_fft, hop_length=hop_length, win_length=win_length, window=self.window) + mag, _ = librosa.magphase(d) + mag = np.log1p(mag) + if self.is_normalization: + mean = mag.mean() + std = mag.std() + mag = (mag - mean) / std + + return mag diff --git a/research/audio/FastSpeech/src/deepspeech2/model.py b/research/audio/FastSpeech/src/deepspeech2/model.py new file mode 100644 index 0000000000000000000000000000000000000000..73e87f34a689be343d2eb4f205dcd092c605da87 --- /dev/null +++ b/research/audio/FastSpeech/src/deepspeech2/model.py @@ -0,0 +1,315 @@ +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +"""DeepSpeech2 model.""" +import math + +import numpy as np +from mindspore import Tensor +from mindspore import nn +from mindspore.ops import operations as P + + +class SequenceWise(nn.Cell): + """SequenceWise FC Layers.""" + def __init__(self, module): + super().__init__() + self.module = module + self.reshape_op = P.Reshape() + self.shape_op = P.Shape() + self._initialize_weights() + + def construct(self, x): + sizes = self.shape_op(x) + t, n = sizes[0], sizes[1] + x = self.reshape_op(x, (t * n, -1)) + x = self.module(x) + x = self.reshape_op(x, (t, n, -1)) + + return x + + def _initialize_weights(self): + """Init weights.""" + self.init_parameters_data() + for _, m in self.cells_and_names(): + if isinstance(m, nn.Dense): + m.weight.set_data( + Tensor( + np.random.uniform( + -1. / m.in_channels, + 1. / m.in_channels, + m.weight.data.shape + ).astype("float32") + ) + ) + + if m.bias is not None: + m.bias.set_data( + Tensor( + np.random.uniform( + -1. / m.in_channels, + 1. / m.in_channels, + m.bias.data.shape).astype("float32") + ) + ) + + +class MaskConv(nn.Cell): + """ + MaskConv architecture. + MaskConv is actually not implemented in this part + because some operation in MindSpore is not supported. + Lengths is kept for future use. + """ + + def __init__(self): + super().__init__() + self.zeros = P.ZerosLike() + self.conv1 = nn.Conv2d( + in_channels=1, + out_channels=32, + kernel_size=(41, 11), + stride=(2, 2), + pad_mode='pad', + padding=(20, 20, 5, 5) + ) + + self.bn1 = nn.BatchNorm2d(num_features=32) + self.conv2 = nn.Conv2d( + in_channels=32, + out_channels=32, + kernel_size=(21, 11), + stride=(2, 1), + pad_mode='pad', + padding=(10, 10, 5, 5) + ) + + self.bn2 = nn.BatchNorm2d(num_features=32) + self.tanh = nn.Tanh() + self._initialize_weights() + self.module_list = nn.CellList( + [ + self.conv1, + self.bn1, + self.tanh, + self.conv2, + self.bn2, + self.tanh + ] + ) + + def construct(self, x): + for module in self.module_list: + x = module(x) + + return x + + def _initialize_weights(self): + """ + parameter initialization + """ + self.init_parameters_data() + for _, m in self.cells_and_names(): + if isinstance(m, nn.Conv2d): + n = m.kernel_size[0] * m.kernel_size[1] * m.out_channels + m.weight.set_data(Tensor(np.random.normal(0, np.sqrt(2. / n), m.weight.data.shape).astype("float32"))) + if m.bias is not None: + m.bias.set_data( + Tensor(np.zeros(m.bias.data.shape, dtype="float32"))) + elif isinstance(m, nn.BatchNorm2d): + m.gamma.set_data( + Tensor(np.ones(m.gamma.data.shape, dtype="float32"))) + m.beta.set_data( + Tensor(np.zeros(m.beta.data.shape, dtype="float32"))) + + +class BatchRNN(nn.Cell): + """ + BatchRNN architecture. + + Args: + batch_size (int): Sample_number of per step in training. + input_size (int): Dimension of input tensor. + hidden_size (int): Rnn hidden size. + num_layers (int): Number of rnn layers. + bidirectional (bool): Use bidirectional rnn. Currently, only bidirectional rnn is implemented. + batch_norm(bool): Whether to use BN in RNN. + rnn_type (str): Rnn type to use. Currently, only LSTM is supported. + """ + + def __init__( + self, + batch_size, + input_size, + hidden_size, + num_layers, + bidirectional=False, + batch_norm=False, + rnn_type='LSTM', + ): + super().__init__() + self.batch_size = batch_size + self.input_size = input_size + self.hidden_size = hidden_size + self.num_layers = num_layers + self.rnn_type = rnn_type + self.bidirectional = bidirectional + self.has_bias = True + self.is_batch_norm = batch_norm + self.num_directions = 2 if bidirectional else 1 + self.reshape_op = P.Reshape() + self.shape_op = P.Shape() + self.sum_op = P.ReduceSum() + + input_size_list = [input_size] + for i in range(num_layers - 1): + input_size_list.append(hidden_size) + layers = [] + + for i in range(num_layers): + layers.append( + nn.LSTM( + input_size=input_size_list[i], + hidden_size=hidden_size, + bidirectional=bidirectional, + has_bias=self.has_bias + ) + ) + + self.lstms = nn.CellList(layers) + + if batch_norm: + batch_norm_layer = [] + for i in range(num_layers - 1): + batch_norm_layer.append(nn.BatchNorm1d(hidden_size)) + self.batch_norm_list = batch_norm_layer + + def construct(self, x): + for i in range(self.num_layers): + if self.is_batch_norm and i > 0: + x = self.batch_norm_list[i - 1](x) + x, _ = self.lstms[i](x) + if self.bidirectional: + size = self.shape_op(x) + x = self.reshape_op(x, (size[0], size[1], 2, -1)) + x = self.sum_op(x, 2) + return x + + +class DeepSpeechModel(nn.Cell): + """ + ResNet architecture. + + Args: + batch_size (int): Sample_number of per step in training. + rnn_type (str): Rnn type to use. + labels (str): Str containing all the possible symbols to map to. + rnn_hidden_size (int): Rnn hidden size. + nb_layers (int): Number of rnn layers. + audio_conf: Config containing the sample rate, window and the window length/stride in seconds. + bidirectional (bool): Use bidirectional rnn. + """ + + def __init__( + self, + batch_size, + labels, + rnn_hidden_size, + nb_layers, + audio_conf, + rnn_type='LSTM', + bidirectional=True, + ): + super().__init__() + self.batch_size = batch_size + self.hidden_size = rnn_hidden_size + self.hidden_layers = nb_layers + self.rnn_type = rnn_type + self.audio_conf = audio_conf + self.labels = list(labels) + self.bidirectional = bidirectional + self.reshape_op = P.Reshape() + self.shape_op = P.Shape() + self.transpose_op = P.Transpose() + self.add = P.Add() + self.div = P.Div() + + sample_rate = self.audio_conf['sampling_rate'] + window_size = self.audio_conf['window_size'] + num_classes = len(self.labels) + + self.conv = MaskConv() + # This is to calculate + self.pre, self.stride = self.get_conv_num() + self.num_iters = list(range(len(self.stride))) + + # Based on above convolutions and spectrogram size using conv formula (W - F + 2P)/ S+1 + rnn_input_size = int(math.floor((sample_rate * window_size) / 2) + 1) + rnn_input_size = int(math.floor(rnn_input_size + 2 * 20 - 41) / 2 + 1) + rnn_input_size = int(math.floor(rnn_input_size + 2 * 10 - 21) / 2 + 1) + rnn_input_size *= 32 + + self.rnn = BatchRNN( + batch_size=self.batch_size, + input_size=rnn_input_size, + num_layers=nb_layers, + hidden_size=rnn_hidden_size, + bidirectional=bidirectional, + batch_norm=False, + rnn_type=self.rnn_type, + ) + + fully_connected = nn.Dense(rnn_hidden_size, num_classes, has_bias=False) + self.fc = SequenceWise(fully_connected) + + def construct(self, x, lengths): + """ + Forward. + """ + output_lengths = self.get_seq_lens(lengths) + x = self.conv(x) + sizes = self.shape_op(x) + x = self.reshape_op(x, (sizes[0], sizes[1] * sizes[2], sizes[3])) + x = self.transpose_op(x, (2, 0, 1)) + x = self.rnn(x) + + activations = x.copy() + + x = self.fc(x) + + return x, output_lengths, activations + + def get_seq_lens(self, seq_len): + """ + Given a 1D Tensor or Variable containing integer sequence lengths, + return a 1D tensor or variable containing the size sequences + that will be output by the network. + """ + for i in self.num_iters: + seq_len = self.add(self.div(self.add(seq_len, self.pre[i]), self.stride[i]), 1) + + return seq_len + + def get_conv_num(self): + """Get number of convs.""" + p, s = [], [] + for _, cell in self.conv.cells_and_names(): + if isinstance(cell, nn.Conv2d): + kernel_size = cell.kernel_size + padding_1 = int((kernel_size[1] - 1) / 2) + temp = 2 * padding_1 - cell.dilation[1] * (cell.kernel_size[1] - 1) - 1 + p.append(temp) + s.append(cell.stride[1]) + + return p, s diff --git a/research/audio/FastSpeech/src/import_ckpt/__init__.py b/research/audio/FastSpeech/src/import_ckpt/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/research/audio/FastSpeech/src/import_ckpt/import_deepspeech2.py b/research/audio/FastSpeech/src/import_ckpt/import_deepspeech2.py new file mode 100644 index 0000000000000000000000000000000000000000..0402cbba3f7283d79bb1d1ca5586798ea33e3f21 --- /dev/null +++ b/research/audio/FastSpeech/src/import_ckpt/import_deepspeech2.py @@ -0,0 +1,84 @@ +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +"""DeepSpeech2 checkpoint converter.""" +from pathlib import Path + +import numpy as np +from mindspore import Parameter +from mindspore import Tensor +from mindspore import dtype as mstype +from mindspore import load_checkpoint +from mindspore import save_checkpoint + +from src.cfg.config import config +from src.deepspeech2.model import DeepSpeechModel + + +def main(ckpt_url): + spect_config = { + 'sampling_rate': config.ds_sampling_rate, + 'window_size': config.ds_window_size, + 'window_stride': config.ds_window_stride, + 'window': config.ds_window + } + # Initialize model to get new lstm params names + model = DeepSpeechModel( + batch_size=1, + rnn_hidden_size=config.ds_hidden_size, + nb_layers=config.ds_hidden_layers, + labels=config.labels, + rnn_type=config.ds_rnn_type, + audio_conf=spect_config, + bidirectional=True + ) + + filter_prefix = ['moment1', 'moment2', 'step', 'learning_rate', 'beta1_power', 'beta2_power'] + lstm_old_names = ['RNN.weight0', 'RNN.weight1', 'RNN.weight2', 'RNN.weight3', 'RNN.weight4'] + new_params = model.trainable_params() + old_params = load_checkpoint(ckpt_url, filter_prefix=filter_prefix) + names_and_shapes = {param.name: param.shape for param in new_params} + + lstm_weights = {} + # Reprocess flatten weights of LSTM from < 1.5 mindspore versions to new. + for layer, old_layer in zip(range(0, 5), lstm_old_names): + previous = 0 + for i in np.array(list(names_and_shapes.keys())[layer * 8 + 6: layer * 8 + 14])[[0, 2, 1, 3, 4, 6, 5, 7]]: + weights = old_params[old_layer][int(previous): int(previous + np.prod(names_and_shapes[i]))].asnumpy() + weights_shaped = weights.reshape(names_and_shapes[i]) + lstm_weights[i] = weights_shaped + + previous += np.prod(names_and_shapes[i]) + + # Remove lstm layers to the load remaining layers + old_params.pop(old_layer) + + # Put remaining weights into dictionary + for remaining_key, remaining_param in old_params.items(): + lstm_weights[remaining_key] = remaining_param.asnumpy() + + # Process to checkpoint save format + save_params = [] + for key, value in lstm_weights.items(): + save_params.append({'name': key, 'data': Parameter(Tensor(value, mstype.float32), name=key)}) + + save_name = Path(Path(ckpt_url).parent, 'DeepSpeech2.ckpt') + save_checkpoint(save_params, str(save_name)) + + print('Successfully converted checkpoint') + print(f'New checkpoint path {save_name}') + + +if __name__ == "__main__": + main(config.ds_ckpt_url) diff --git a/research/audio/FastSpeech/src/import_ckpt/import_waveglow.py b/research/audio/FastSpeech/src/import_ckpt/import_waveglow.py new file mode 100644 index 0000000000000000000000000000000000000000..15093c243961562919ddf02966c6e3d2d63c182a --- /dev/null +++ b/research/audio/FastSpeech/src/import_ckpt/import_waveglow.py @@ -0,0 +1,87 @@ +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +"""WaveGlow checkpoint converter.""" +import pickle +from pathlib import Path + +import numpy as np +from mindspore import Parameter +from mindspore import Tensor +from mindspore import dtype as mstype +from mindspore import save_checkpoint + +from src.cfg.config import config +from src.waveglow.model import WaveGlow + + +def main(ckpt_url): + with Path(ckpt_url).open('rb') as file: + waveglow_np_params = pickle.load(file) + + wn_config = { + 'n_layers': config.wg_n_layers, + 'n_channels': config.wg_n_channels, + 'kernel_size': config.wg_kernel_size + } + + # Initialize model to get true names + model = WaveGlow( + n_mel_channels=config.wg_n_mel_channels, + n_flows=config.wg_n_flows, + n_group=config.wg_n_group, + n_early_every=config.wg_n_early_every, + n_early_size=config.wg_n_early_size, + wn_config=wn_config + ) + names_and_shapes = {key: param.shape for key, param in model.parameters_and_names()} + + # Put similar names into blocks + wn_names = list(waveglow_np_params.keys())[2: 2 + 38 * 12] + convinv_names = list(waveglow_np_params.keys())[-12:] + ordered_names = list(waveglow_np_params.keys())[:2] + + # Mindspore order of weights into same block + indexes_weighs = np.concatenate((np.arange(1, 34, 2), np.array([34, 37]))) + indexes_biases = np.concatenate((np.arange(0, 34, 2), np.array([35, 36]))) + + for block_num in reversed(range(12)): + block_layers = wn_names[block_num * 38: 38 * (block_num + 1)] + for layer_index_weight, layer_index_bias in zip(indexes_weighs, indexes_biases): + ordered_names.append(block_layers[layer_index_weight]) + ordered_names.append(block_layers[layer_index_bias]) + ordered_names.append(convinv_names[block_num]) + + # Reshape weights and process inverted convolutions + processed_weights = {} + for torch_name, mindspore_name in zip(ordered_names, list(names_and_shapes.keys())): + weights = waveglow_np_params[torch_name] + if torch_name.startswith('convinv'): + weights = np.linalg.inv((np.squeeze(weights))) + weights = np.expand_dims(weights, -1) + processed_weights[mindspore_name] = weights.reshape(names_and_shapes[mindspore_name]) + + save_params = [] + for key, value in processed_weights.items(): + save_params.append({'name': key, 'data': Parameter(Tensor(value, mstype.float32), name=key)}) + + save_name = Path(Path(ckpt_url).parent, 'WaveGlow.ckpt') + save_checkpoint(save_params, str(save_name)) + + print('Successfully converted checkpoint') + print(f'New checkpoint path {save_name}') + + +if __name__ == "__main__": + main(config.wg_ckpt_url) diff --git a/research/audio/FastSpeech/src/metrics.py b/research/audio/FastSpeech/src/metrics.py new file mode 100644 index 0000000000000000000000000000000000000000..981ae91d14919e5460cafa399eb28c9410eb913b --- /dev/null +++ b/research/audio/FastSpeech/src/metrics.py @@ -0,0 +1,155 @@ +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +"""Metrics scripts.""" +import numpy as np + + +def kernel_classifier_distance_and_std_from_activations( + activations1, + activations2, + max_block_size=1024, + dtype=np.float32, +): + """Compute kernel distance between two activations.""" + n_r = activations1.shape[0] + n_g = activations2.shape[0] + + n_bigger = np.maximum(n_r, n_g) + n_blocks = np.ceil(n_bigger / max_block_size).astype(np.int32) + + v_r = n_r // n_blocks + v_g = n_g // n_blocks + + n_plusone_r = n_r - v_r * n_blocks + n_plusone_g = n_g - v_g * n_blocks + + sizes_r = np.concatenate([np.full([n_blocks - n_plusone_r], v_r), np.full([n_plusone_r], v_r + 1)], axis=0) + + sizes_g = np.concatenate([ + np.full([n_blocks - n_plusone_g], v_g), + np.full([n_plusone_g], v_g + 1)], axis=0) + + zero = np.zeros([1], dtype=np.int32) + inds_r = np.concatenate([zero, np.cumsum(sizes_r)], axis=0) + inds_g = np.concatenate([zero, np.cumsum(sizes_g)], axis=0) + + dim = activations1.shape[1] + + def compute_kid_block(i): + """Computes the ith block of the KID estimate.""" + r_s = inds_r[i] + r_e = inds_r[i + 1] + r = activations1[r_s:r_e] + m = (r_e - r_s).astype(dtype) + + g_s = inds_g[i] + g_e = inds_g[i + 1] + g = activations2[g_s:g_e] + n = (g_e - g_s).astype(dtype) + + k_rr = (np.matmul(r, r.T) / dim + 1) ** 3 + k_rg = (np.matmul(r, g.T) / dim + 1) ** 3 + k_gg = (np.matmul(g, g.T) / dim + 1) ** 3 + + out = (-2 * np.mean(k_rg) + (np.sum(k_rr) - np.trace(k_rr)) / + (m * (m - 1)) + (np.sum(k_gg) - np.trace(k_gg)) / (n * (n - 1))) + + return out.astype(dtype) + + ests = np.array([compute_kid_block(i) for i in range(n_blocks)]) + + mn = np.mean(ests) + + n_blocks_ = n_blocks.astype(dtype) + + if np.less_equal(n_blocks, 1): + var = np.array(float('nan'), dtype=dtype) + else: + var = np.sum(np.square(ests - mn)) / (n_blocks_ - 1) + + return mn, np.sqrt(var / n_blocks_) + + +def frechet_classifier_distance_from_activations( + activations1, + activations2, +): + """Compute frechet distance between two activations.""" + activations1 = activations1.astype(np.float64) + activations2 = activations2.astype(np.float64) + + m = np.mean(activations1, axis=0) + m_w = np.mean(activations2, axis=0) + + # Calculate the unbiased covariance matrix of first activations. + num_examples_real = activations1.shape[0] + sigma = num_examples_real / (num_examples_real - 1) * np.cov(activations1.T) + # Calculate the unbiased covariance matrix of second activations. + num_examples_generated = activations2.shape[0] + sigma_w = num_examples_generated / (num_examples_generated - 1) * np.cov(activations2.T) + + def _calculate_fid(m, m_w, sigma, sigma_w): + """Returns the Frechet distance given the sample mean and covariance.""" + # Find the Tr(sqrt(sigma sigma_w)) component of FID + sqrt_trace_component = trace_sqrt_product(sigma, sigma_w) + + # Compute the two components of FID. + + # First the covariance component. + # Here, note that trace(A + B) = trace(A) + trace(B) + trace = np.trace(sigma + sigma_w) - 2.0 * sqrt_trace_component + + # Next the distance between means. + mean = np.sum(squared_difference(m, m_w)) + + # Equivalent to L2 but more stable. + fid = trace + mean + + return fid.astype(np.float64) + + result = tuple( + _calculate_fid(m_val, m_w_val, sigma_val, sigma_w_val) for + m_val, m_w_val, sigma_val, sigma_w_val in + zip([m], [m_w], [sigma], [sigma_w]) + ) + + return result[0] + + +def squared_difference(m, w): + arr = [] + for i, j in zip(m, w): + arr.append((i - j) ** 2) + arr = np.array(arr) + + return arr + + +def trace_sqrt_product(sigma, sigma_v): + # Note sqrt_sigma is called "A" in the proof above + sqrt_sigma = _symmetric_matrix_square_root(sigma) + + # This is sqrt(A sigma_v A) above + sqrt_a_sigmav_a = np.matmul(sqrt_sigma, np.matmul(sigma_v, sqrt_sigma)) + + return np.trace(_symmetric_matrix_square_root(sqrt_a_sigmav_a)) + + +def _symmetric_matrix_square_root(mat, eps=1e-10): + u, s, v = np.linalg.svd(mat) + # sqrt is unstable around 0, just use 0 in such case + si = np.where(np.less(s, eps), s, np.sqrt(s)) + + return np.matmul(np.matmul(u, np.diag(si)), v) diff --git a/research/audio/FastSpeech/src/model.py b/research/audio/FastSpeech/src/model.py new file mode 100644 index 0000000000000000000000000000000000000000..6d2f92fcc998122041a7ec2cd580c9a1a576c665 --- /dev/null +++ b/research/audio/FastSpeech/src/model.py @@ -0,0 +1,256 @@ +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +"""FastSpeech model.""" +import mindspore.numpy as msnp +from mindspore import dtype as mstype +from mindspore import nn +from mindspore import ops +from mindspore.common.initializer import XavierUniform +from mindspore.common.initializer import initializer + +from src.cfg.config import config as hp +from src.modules import CBHG +from src.modules import LengthRegulator +from src.transformer.models import Decoder +from src.transformer.models import Encoder + + +class FastSpeech(nn.Cell): + """FastSpeech model.""" + def __init__(self): + super().__init__() + self.encoder = Encoder( + n_src_vocab=hp.vocab_size, + len_max_seq=hp.vocab_size, + d_word_vec=hp.encoder_dim, + n_layers=hp.encoder_n_layer, + n_head=hp.encoder_head, + d_k=hp.encoder_dim // hp.encoder_head, + d_v=hp.encoder_dim // hp.encoder_head, + d_model=hp.encoder_dim, + d_inner=hp.encoder_conv1d_filter_size, + dropout=hp.dropout, + ) + + self.length_regulator = LengthRegulator() + + self.decoder = Decoder( + len_max_seq=hp.max_seq_len, + n_layers=hp.decoder_n_layer, + n_head=hp.decoder_head, + d_k=hp.decoder_dim // hp.decoder_head, + d_v=hp.decoder_dim // hp.decoder_head, + d_model=hp.decoder_dim, + d_inner=hp.decoder_conv1d_filter_size, + dropout=hp.dropout + ) + + num_mels = hp.num_mels + decoder_dim = hp.decoder_dim + + self.mel_linear = nn.Dense( + decoder_dim, + num_mels, + weight_init=initializer( + XavierUniform(), + [num_mels, decoder_dim], + mstype.float32 + ) + ) + + self.last_linear = nn.Dense( + num_mels * 2, + num_mels, + weight_init=initializer( + XavierUniform(), + [num_mels, num_mels * 2], + mstype.float32 + ) + ) + + self.postnet = CBHG( + in_dim=num_mels, + num_banks=8, + projections=[256, hp.num_mels], + ) + + self.expand_dims = ops.ExpandDims() + self.argmax = ops.ArgMaxWithValue(axis=-1) + self.broadcast = ops.BroadcastTo((-1, -1, num_mels)) + + self.ids_linspace = msnp.arange(hp.mel_max_length) + self.zeros_mask = msnp.zeros((hp.batch_size, hp.mel_max_length, hp.num_mels)) + + def mask_tensor(self, mel_output, position): + """ + Make mask for tensor, to ignore padded cells. + """ + lengths = self.argmax(position)[1] + + ids = self.ids_linspace + + mask = (ids < self.expand_dims(lengths, 1)).astype(mstype.float32) + mask_bool = self.broadcast(self.expand_dims(mask, -1)).astype(mstype.bool_) + + mel_output = msnp.where(mask_bool, mel_output, self.zeros_mask) + + return mel_output + + def construct( + self, + src_seq, + src_pos, + mel_pos=None, + mel_max_length=None, + length_target=None, + alpha=1.0 + ): + """ + Predict mel-spectrogram from sequence. + + Args: + src_seq (Tensor): Tokenized text sequence. Shape (hp.batch_size, hp.character_max_length) + src_pos (Tensor): Positions of the sequences. Shape (hp.batch_size, hp.character_max_length) + mel_pos (Tensor): Positions of the mels. Shape (hp.batch_size, hp.mel_max_length) + mel_max_length (int): Max mel length. + length_target (Tensor): Duration of the each phonema. Shape (hp.batch_size, hp.character_max_length) + alpha (int): Regulator of the speech speed. + """ + encoder_output = self.encoder(src_seq, src_pos) + + if self.training: + length_regulator_output, duration_predictor_output = self.length_regulator( + encoder_output, + target=length_target, + alpha=alpha, + mel_max_length=mel_max_length, + ) + + decoder_output = self.decoder(length_regulator_output, mel_pos) + + mel_output = self.mel_linear(decoder_output) + mel_output = self.mask_tensor(mel_output, mel_pos) + + residual = self.postnet(mel_output) + residual = self.last_linear(residual) + + mel_postnet_output = mel_output + residual + mel_postnet_output = self.mask_tensor(mel_postnet_output, mel_pos) + + return mel_output, mel_postnet_output, duration_predictor_output + + length_regulator_output, decoder_pos, mel_len = self.length_regulator(encoder_output, alpha=alpha) + + decoder_output = self.decoder(length_regulator_output, decoder_pos) + + mel_output = self.mel_linear(decoder_output) + + residual = self.postnet(mel_output) + residual = self.last_linear(residual) + + mel_postnet_output = mel_output + residual + + return mel_output, mel_postnet_output, mel_len + + +class LossWrapper(nn.Cell): + """ + Training wrapper for model. + """ + def __init__(self, model): + super().__init__() + self.model = model + + self.mse_loss = nn.MSELoss() + self.l1_loss = nn.L1Loss() + + def construct( + self, + character, + src_pos, + mel_pos, + duration, + mel_target, + max_mel_len, + ): + """ + FastSpeech with loss. + + Args: + character (Tensor): Tokenized text sequence. Shape (hp.batch_size, hp.character_max_length) + src_pos (Tensor): Positions of the sequences. Shape (hp.batch_size, hp.character_max_length) + mel_pos (Tensor): Positions of the mels. Shape (hp.batch_size, hp.mel_max_length) + duration (Tensor): Target duration. Shape (hp.batch_size, hp.character_max_length) + mel_target (Tensor): Target mel-spectrogram. Shape (hp.batch_size, hp.mel_max_length, hp.num_mels) + max_mel_len (list): Max mel length. + + Returns: + total_loss (Tensor): Sum of 3 losses. + """ + max_mel_len = max_mel_len[0] + mel_output, mel_postnet_output, duration_predictor_output = self.model( + character, + src_pos, + mel_pos=mel_pos, + mel_max_length=max_mel_len, + length_target=duration, + ) + + mel_loss = self.mse_loss(mel_output, mel_target) + mel_postnet_loss = self.mse_loss(mel_postnet_output, mel_target) + duration_predictor_loss = self.l1_loss(duration_predictor_output, duration) + + total_loss = mel_loss + mel_postnet_loss + duration_predictor_loss + + return total_loss + + +class FastSpeechEval: + """FastSpeech with vocoder for evaluation.""" + def __init__( + self, + mel_generator, + vocoder, + config, + ): + super().__init__() + self.mel_generator = mel_generator + self.vocoder = vocoder + + self.alpha = config.alpha + self.vocoder_stride = vocoder.upsample.stride[1] + self.zeros_mask = msnp.zeros((1, config.num_mels, config.mel_max_length)) + + x_grid = msnp.arange(0, config.mel_max_length) + y_grid = msnp.arange(0, config.num_mels) + + self.transpose = ops.Transpose() + self.grid = ops.ExpandDims()(msnp.meshgrid(x_grid, y_grid)[0], 0) + + def get_audio(self, src_seq, src_pos): + """ + Generate mel-spectrogram from sequence, + generate raw audio from mel-spectrogram by vocoder. + """ + _, mel, mel_len = self.mel_generator(src_seq, src_pos, alpha=self.alpha) + + mel_mask = (self.grid < mel_len).astype(mstype.float32) + clear_mel = self.transpose(mel, (0, 2, 1)) * mel_mask + + audio = self.vocoder.construct(clear_mel) + + audio_len = mel_len * self.vocoder_stride + + return audio, audio_len diff --git a/research/audio/FastSpeech/src/modules.py b/research/audio/FastSpeech/src/modules.py new file mode 100644 index 0000000000000000000000000000000000000000..3c1871553bf51391abf824d414a568ae14e945a5 --- /dev/null +++ b/research/audio/FastSpeech/src/modules.py @@ -0,0 +1,364 @@ +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +"""Model modules.""" +from collections import OrderedDict + +import numpy as np +from mindspore import Tensor +from mindspore import dtype as mstype +from mindspore import nn +from mindspore import numpy as msnp +from mindspore import ops +from mindspore.common.initializer import XavierUniform +from mindspore.common.initializer import initializer + +from src.cfg.config import config as hp + + +class LengthRegulator(nn.Cell): + """ + Length Regulator. + + Predicts duration of the each phonem, + and let change speed of speech with alpha. + """ + def __init__(self): + super().__init__() + self.duration_predictor = DurationPredictor() + + self.tile = ops.Tile() + self.round = ops.Round() + self.stack = ops.Stack() + self.zeros = ops.Zeros() + self.concat = ops.Concat() + self.matmul = ops.MatMul() + self.sum = ops.ReduceSum() + self.bmm = ops.BatchMatMul() + self.unsqueeze = ops.ExpandDims() + self.max = ops.ArgMaxWithValue(axis=-1) + self.mesh = ops.Meshgrid(indexing='xy') + + self.alignment_zeros = self.zeros( + (hp.batch_size, hp.mel_max_length, hp.character_max_length), + mstype.float32, + ) + + # For alignment + self.h = hp.mel_max_length + self.w = hp.character_max_length + self.base_mat_ones = msnp.ones((self.h, self.w)) + self.meshgrid = self.mesh((msnp.arange(self.w), msnp.arange(self.h)))[1] + self.zero_tensor = Tensor([0.]) + self.mel_pos_linspace = self.unsqueeze(msnp.arange(hp.mel_max_length) + 1, 0) + + def LR(self, enc_out, duration_predictor_output, mel_max_length=None): + """Length regulator module.""" + expand_max_len = self.sum(duration_predictor_output.astype(mstype.float32)) + + # None during eval + if mel_max_length is not None: + alignment = self.alignment_zeros + else: + alignment = self.unsqueeze(self.alignment_zeros[0], 0) + + for i in range(duration_predictor_output.shape[0]): + thresh_2 = duration_predictor_output[i].cumsum().astype(mstype.float32) + thresh_1 = self.concat( + ( + self.zero_tensor.astype(mstype.float64), + thresh_2[:-1].astype(mstype.float64) + ) + ) + thresh_1 = self.tile(thresh_1, (self.h, 1)) + thresh_2 = self.tile(thresh_2, (self.h, 1)) + + low_thresh = (self.meshgrid < thresh_2).astype(mstype.float32) + up_thresh = (self.meshgrid >= thresh_1).astype(mstype.float32) + intersection = low_thresh * up_thresh + res = intersection.astype(mstype.bool_) + alignment[i] = msnp.where(res, self.base_mat_ones, alignment[i]) + + output = self.bmm(alignment, enc_out) + + return output, expand_max_len + + def construct(self, encoder_output, alpha=1.0, target=None, mel_max_length=None): + """ + Predict duration of each phonema. + """ + duration_predictor_output = self.duration_predictor(encoder_output) + + # Not none during training + if target is not None: + output, _ = self.LR(encoder_output, target, mel_max_length=mel_max_length) + + return output, duration_predictor_output + + duration_predictor_output = (duration_predictor_output + 0.5) * alpha + duration_predictor_output = self.round(duration_predictor_output.copy()) + + output, mel_len = self.LR(encoder_output, duration_predictor_output) + + mel_pos_mask = (self.mel_pos_linspace <= mel_len).astype(mstype.float32) + mel_pos = self.mel_pos_linspace * mel_pos_mask + + return output, mel_pos, mel_len + + +class DurationPredictor(nn.Cell): + """ + Duration Predictor. + + Predicts duration of the each phonem. + """ + def __init__(self): + super().__init__() + + self.input_size = hp.encoder_dim + self.filter_size = hp.duration_predictor_filter_size + self.kernel = hp.duration_predictor_kernel_size + self.conv_output_size = hp.duration_predictor_filter_size + self.dropout = 1 - hp.dropout + + self.conv_layer = nn.SequentialCell(OrderedDict([ + ("conv1d_1", Conv( + self.input_size, + self.filter_size, + kernel_size=self.kernel, + padding=1)), + ("layer_norm_1", nn.LayerNorm([self.filter_size])), + ("relu_1", nn.ReLU()), + ("dropout_1", nn.Dropout(keep_prob=self.dropout)), + ("conv1d_2", Conv( + self.filter_size, + self.filter_size, + kernel_size=self.kernel, + padding=1)), + ("layer_norm_2", nn.LayerNorm([self.filter_size])), + ("relu_2", nn.ReLU()), + ("dropout_2", nn.Dropout(keep_prob=self.dropout)) + ])) + + self.linear_layer = nn.Dense( + in_channels=self.conv_output_size, + out_channels=1, + weight_init=initializer( + XavierUniform(), + [1, self.conv_output_size], + mstype.float32 + ) + ) + + self.relu = nn.ReLU() + self.expand_dims = ops.ExpandDims() + self.squeeze = ops.Squeeze() + + def construct(self, encoder_output): + out = self.conv_layer(encoder_output) + out = self.linear_layer(out) + out = self.relu(out) + out = self.squeeze(out) + + if not self.training: + out = self.expand_dims(out, 0) + + return out + + +class BatchNormConv1d(nn.Cell): + """ + Custom BN, Conv1d layer with weight init. + """ + def __init__( + self, + in_dim, + out_dim, + kernel_size, + stride, + padding, + activation=None, + ): + super().__init__() + + self.conv1d = nn.Conv1d( + in_dim, + out_dim, + kernel_size=kernel_size, + stride=stride, + pad_mode='pad', + padding=padding, + has_bias=False, + weight_init=initializer( + XavierUniform(), + [out_dim, in_dim, kernel_size], + mstype.float32, + ) + ) + + self.bn = nn.BatchNorm2d(out_dim, use_batch_statistics=True) + + self.activation = activation + self.expand_dims = ops.ExpandDims() + + def construct(self, input_tensor): + out = self.conv1d(input_tensor) + + if self.activation is not None: + out = self.activation(out) + + out = self.bn(self.expand_dims(out, -1)) + out = out.squeeze(-1) + + return out + + +class Conv(nn.Cell): + """ + Conv1d with weight init. + """ + def __init__( + self, + in_channels, + out_channels, + kernel_size=1, + stride=1, + padding=0, + dilation=1, + bias=True, + ): + super().__init__() + + self.conv = nn.Conv1d( + in_channels, + out_channels, + kernel_size=kernel_size, + stride=stride, + pad_mode='pad', + padding=padding, + dilation=dilation, + has_bias=bias, + weight_init=initializer( + XavierUniform(), + [in_channels, out_channels, kernel_size], + mstype.float32, + ) + ) + + self.transpose = ops.Transpose() + + def construct(self, x): + x = self.transpose(x, (0, 2, 1)) + x = self.conv(x) + x = self.transpose(x, (0, 2, 1)) + + return x + + +class Highway(nn.Cell): + """Highway network.""" + def __init__(self, in_size, out_size): + super().__init__() + self.h = nn.Dense(in_size, out_size, bias_init='zeros') + self.t = nn.Dense(in_size, out_size, bias_init=Tensor(np.full(in_size, -1.), mstype.float32)) + self.relu = nn.ReLU() + self.sigmoid = nn.Sigmoid() + + def construct(self, inputs): + out_1 = self.relu(self.h(inputs)) + out_2 = self.sigmoid(self.t(inputs)) + output = out_1 * out_2 + inputs * (1.0 - out_2) + + return output + + +class CBHG(nn.Cell): + """ + CBHG a recurrent neural network composed of: + - 1-d convolution banks + - Highway networks + residual connections + - Bidirectional gated recurrent units + """ + def __init__(self, in_dim, num_banks, projections): + super().__init__() + self.in_dim = in_dim + + self.relu = nn.ReLU() + self.conv1d_banks = nn.CellList( + [ + BatchNormConv1d( + in_dim, + in_dim, + kernel_size=k, + stride=1, + padding=k // 2, + activation=self.relu, + ) + for k in range(1, num_banks + 1) + ] + ) + + self.max_pool1d = nn.MaxPool1d(kernel_size=2, stride=1, pad_mode='same') + + in_sizes = [num_banks * in_dim] + projections[:-1] + activations = [self.relu] * (len(projections) - 1) + [None] + + self.conv1d_projections = nn.CellList( + [ + BatchNormConv1d( + in_size, + out_size, + kernel_size=3, + stride=1, + padding=1, + activation=activation, + ) + for (in_size, out_size, activation) in zip(in_sizes, projections, activations) + ] + ) + + self.highways = nn.CellList([Highway(in_dim, in_dim) for _ in range(4)]) + + self.gru = nn.GRU(in_dim, in_dim, 1, batch_first=True, bidirectional=True) + + self.transpose = ops.Transpose() + self.concat = ops.Concat(axis=1) + + def construct(self, inputs): + """ + Forward mels to recurrent network. + """ + out = self.transpose(inputs, (0, 2, 1)) + + last_dim = out.shape[-1] + + output_list = [] + for conv in self.conv1d_banks: + output_list.append(conv(out)[:, :, :last_dim]) + + output = self.concat(output_list) + output = self.max_pool1d(output)[:, :, :last_dim] + + for conv1d in self.conv1d_projections: + output = conv1d(output) + + output = self.transpose(output, (0, 2, 1)) + output += inputs + + for highway in self.highways: + output = highway(output) + + outputs, _ = self.gru(output) + + return outputs diff --git a/research/audio/FastSpeech/src/text/__init__.py b/research/audio/FastSpeech/src/text/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..6e8d5010ee2d59bd8b74a7a1b25f594cafc0fac2 --- /dev/null +++ b/research/audio/FastSpeech/src/text/__init__.py @@ -0,0 +1,77 @@ +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the License); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# httpwww.apache.orglicensesLICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an AS IS BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +"""Adapted from https://github.com/keithito/tacotron""" +import re +from src.text import cleaners +from src.text.symbols import all_symbols +from src.text.cleaners import english_cleaners + + +# Mappings from symbol to numeric ID and vice versa +_symbol_to_id = {s: i for i, s in enumerate(all_symbols)} + +# Regular expression matching text enclosed in curly braces +_curly_re = re.compile(r'(.*?)\{(.+?)\}(.*)') + + +def text_to_sequence(text, cleaner_names): + """ + Converts a string of text to a sequence of IDs corresponding to the symbols in the text. + + The text can optionally have ARPAbet sequences enclosed in curly braces embedded + in it. For example, "Turn left on {HH AW1 S S T AH0 N} Street." + + Args: + text (str): String to convert to a sequence. + cleaner_names: names of the cleaner functions to run the text through. + + Returns: + List of integers corresponding to the symbols in the text. + """ + sequence = [] + + # Check for curly braces and treat their contents as ARPAbet + while text: + m = _curly_re.match(text) + if not m: + sequence += _symbols_to_sequence(_clean_text(text, cleaner_names)) + break + sequence += _symbols_to_sequence( + _clean_text(m.group(1), cleaner_names)) + sequence += _arpabet_to_sequence(m.group(2)) + text = m.group(3) + + return sequence + + +def _clean_text(text, cleaner_names): + for name in cleaner_names: + cleaner = getattr(cleaners, name) + if not cleaner: + raise Exception('Unknown cleaner: %s' % name) + text = cleaner(text) + return text + + +def _symbols_to_sequence(symbols): + return [_symbol_to_id[s] for s in symbols if _should_keep_symbol(s)] + + +def _arpabet_to_sequence(text): + return _symbols_to_sequence(['@' + s for s in text.split()]) + + +def _should_keep_symbol(s): + return s in _symbol_to_id and s != '_' and s != '~' diff --git a/research/audio/FastSpeech/src/text/cleaners.py b/research/audio/FastSpeech/src/text/cleaners.py new file mode 100644 index 0000000000000000000000000000000000000000..9594191e954a6e0be72ac41ceed9bd1660c94657 --- /dev/null +++ b/research/audio/FastSpeech/src/text/cleaners.py @@ -0,0 +1,92 @@ +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +"""from https://github.com/keithito/tacotron """ +import re + +from unidecode import unidecode + +from src.text.numbers import normalize_numbers + +_whitespace_re = re.compile(r'\s+') + +# List of (regular expression, replacement) pairs for abbreviations: +_abbreviations = [(re.compile('\\b%s\\.' % x[0], re.IGNORECASE), x[1]) for x in [ + ('mrs', 'misess'), + ('mr', 'mister'), + ('dr', 'doctor'), + ('st', 'saint'), + ('co', 'company'), + ('jr', 'junior'), + ('maj', 'major'), + ('gen', 'general'), + ('drs', 'doctors'), + ('rev', 'reverend'), + ('lt', 'lieutenant'), + ('hon', 'honorable'), + ('sgt', 'sergeant'), + ('capt', 'captain'), + ('esq', 'esquire'), + ('ltd', 'limited'), + ('col', 'colonel'), + ('ft', 'fort'), +]] + + +def expand_abbreviations(text): + for regex, replacement in _abbreviations: + text = re.sub(regex, replacement, text) + return text + + +def expand_numbers(text): + return normalize_numbers(text) + + +def lowercase(text): + return text.lower() + + +def collapse_whitespace(text): + return re.sub(_whitespace_re, ' ', text) + + +def convert_to_ascii(text): + """Convert to ascii.""" + return unidecode(text) + + +def basic_cleaners(text): + """Basic pipeline that lowercases and collapses whitespace without transliteration.""" + text = lowercase(text) + text = collapse_whitespace(text) + return text + + +def transliteration_cleaners(text): + """Pipeline for non-English text that transliterates to ASCII.""" + text = convert_to_ascii(text) + text = lowercase(text) + text = collapse_whitespace(text) + return text + + +def english_cleaners(text): + """Pipeline for English text, including number and abbreviation expansion.""" + text = convert_to_ascii(text) + text = lowercase(text) + text = expand_numbers(text) + text = expand_abbreviations(text) + text = collapse_whitespace(text) + return text diff --git a/research/audio/FastSpeech/src/text/numbers.py b/research/audio/FastSpeech/src/text/numbers.py new file mode 100644 index 0000000000000000000000000000000000000000..1792fd38aac4159916981062a9f21d6c7ecdbf77 --- /dev/null +++ b/research/audio/FastSpeech/src/text/numbers.py @@ -0,0 +1,86 @@ +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +"""from https://github.com/keithito/tacotron""" +import re + +import inflect + +_inflect = inflect.engine() +_comma_number_re = re.compile(r'([0-9][0-9\,]+[0-9])') +_decimal_number_re = re.compile(r'([0-9]+\.[0-9]+)') +_pounds_re = re.compile(r'£([0-9\,]*[0-9]+)') +_dollars_re = re.compile(r'\$([0-9\.\,]*[0-9]+)') +_ordinal_re = re.compile(r'[0-9]+(st|nd|rd|th)') +_number_re = re.compile(r'[0-9]+') + + +def _remove_commas(m): + return m.group(1).replace(',', '') + + +def _expand_decimal_point(m): + return m.group(1).replace('.', ' point ') + + +def _expand_dollars(m): + """Expand english money names values.""" + match = m.group(1) + parts = match.split('.') + if len(parts) > 2: + return match + ' dollars' # Unexpected format + dollars = int(parts[0]) if parts[0] else 0 + cents = int(parts[1]) if len(parts) > 1 and parts[1] else 0 + if dollars and cents: + dollar_unit = 'dollar' if dollars == 1 else 'dollars' + cent_unit = 'cent' if cents == 1 else 'cents' + return '%s %s, %s %s' % (dollars, dollar_unit, cents, cent_unit) + if dollars: + dollar_unit = 'dollar' if dollars == 1 else 'dollars' + return '%s %s' % (dollars, dollar_unit) + if cents: + cent_unit = 'cent' if cents == 1 else 'cents' + return '%s %s' % (cents, cent_unit) + + return 'zero dollars' + + +def _expand_ordinal(m): + return _inflect.number_to_words(m.group(0)) + + +def _expand_number(m): + """Expand numbers into text.""" + num = int(m.group(0)) + if 1000 < num < 3000: + if num == 2000: + return 'two thousand' + if 2000 < num < 2010: + return 'two thousand ' + _inflect.number_to_words(num % 100) + if num % 100 == 0: + return _inflect.number_to_words(num // 100) + ' hundred' + + return _inflect.number_to_words(num, andword='', zero='oh', group=2).replace(', ', ' ') + + return _inflect.number_to_words(num, andword='') + + +def normalize_numbers(text): + text = re.sub(_comma_number_re, _remove_commas, text) + text = re.sub(_pounds_re, r'\1 pounds', text) + text = re.sub(_dollars_re, _expand_dollars, text) + text = re.sub(_decimal_number_re, _expand_decimal_point, text) + text = re.sub(_ordinal_re, _expand_ordinal, text) + text = re.sub(_number_re, _expand_number, text) + return text diff --git a/research/audio/FastSpeech/src/text/symbols.py b/research/audio/FastSpeech/src/text/symbols.py new file mode 100644 index 0000000000000000000000000000000000000000..57d0cd841b2a3c8d8d9078a6f43c3af80542d0b6 --- /dev/null +++ b/research/audio/FastSpeech/src/text/symbols.py @@ -0,0 +1,36 @@ +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +"""Symbols preprocessing.""" + +valid_symbols = [ + 'AA', 'AA0', 'AA1', 'AA2', 'AE', 'AE0', 'AE1', 'AE2', 'AH', 'AH0', 'AH1', 'AH2', + 'AO', 'AO0', 'AO1', 'AO2', 'AW', 'AW0', 'AW1', 'AW2', 'AY', 'AY0', 'AY1', 'AY2', + 'B', 'CH', 'D', 'DH', 'EH', 'EH0', 'EH1', 'EH2', 'ER', 'ER0', 'ER1', 'ER2', 'EY', + 'EY0', 'EY1', 'EY2', 'F', 'G', 'HH', 'IH', 'IH0', 'IH1', 'IH2', 'IY', 'IY0', 'IY1', + 'IY2', 'JH', 'K', 'L', 'M', 'N', 'NG', 'OW', 'OW0', 'OW1', 'OW2', 'OY', 'OY0', + 'OY1', 'OY2', 'P', 'R', 'S', 'SH', 'T', 'TH', 'UH', 'UH0', 'UH1', 'UH2', 'UW', + 'UW0', 'UW1', 'UW2', 'V', 'W', 'Y', 'Z', 'ZH' +] + +_pad = '_' +_punctuation = '!\'(),.:;? ' +_special = '-' +_letters = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz' + +# Prepend "@" to ARPAbet symbols to ensure uniqueness (some are the same as uppercase letters): +_arpabet = ['@' + s for s in valid_symbols] + +# Export all symbols: +all_symbols = [_pad] + list(_special) + list(_punctuation) + list(_letters) + _arpabet diff --git a/research/audio/FastSpeech/src/transformer/__init__.py b/research/audio/FastSpeech/src/transformer/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/research/audio/FastSpeech/src/transformer/constants.py b/research/audio/FastSpeech/src/transformer/constants.py new file mode 100644 index 0000000000000000000000000000000000000000..7d7eb3a6b69aa270e8b6d40880adfe530b68fac9 --- /dev/null +++ b/research/audio/FastSpeech/src/transformer/constants.py @@ -0,0 +1,24 @@ +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +"""Constants and tokens.""" +PAD = 0 +UNK = 1 +BOS = 2 +EOS = 3 + +PAD_WORD = '<blank>' +UNK_WORD = '<unk>' +BOS_WORD = '<s>' +EOS_WORD = '</s>' diff --git a/research/audio/FastSpeech/src/transformer/layers.py b/research/audio/FastSpeech/src/transformer/layers.py new file mode 100644 index 0000000000000000000000000000000000000000..7ce9629fc086c0d910d1ddc9f8fef7b49090269b --- /dev/null +++ b/research/audio/FastSpeech/src/transformer/layers.py @@ -0,0 +1,133 @@ +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +"""Custom layers.""" +from mindspore import dtype as mstype +from mindspore import nn +from mindspore.common.initializer import Normal +from mindspore.common.initializer import XavierUniform +from mindspore.common.initializer import initializer + +from src.transformer.sublayers import MultiHeadAttention +from src.transformer.sublayers import PositionwiseFeedForward + + +class Linear(nn.Cell): + """ + Create linear layer and init weights. + """ + def __init__( + self, + in_dim, + out_dim, + bias=True, + w_init='linear' + ): + super().__init__() + + if w_init == 'xavier': + linear_weights = initializer(XavierUniform(), [in_dim, out_dim], mstype.float32) + else: + linear_weights = initializer(Normal(), [in_dim, out_dim], mstype.float32) + + self.linear_layer = nn.Dense( + in_dim, + out_dim, + bias=bias, + weight_init=linear_weights, + ) + + def construct(self, x): + """Forward.""" + out = self.linear_layer(x) + + return out + + +class FFTBlock(nn.Cell): + """ + Feed-forward transformer (FFT) block. + Similar for 'encoder' and 'decoder' at this model. + """ + def __init__( + self, + d_model, + d_inner, + n_head, + d_k, + d_v, + dropout=0.1, + ): + super().__init__() + + self.slf_attn = MultiHeadAttention(n_head, d_model, d_k, d_v, dropout=dropout) + self.pos_ffn = PositionwiseFeedForward(d_model, d_inner, dropout=dropout) + + def construct(self, enc_input, non_pad_mask=None, slf_attn_mask=None): + """Forward""" + enc_output = self.slf_attn(enc_input, enc_input, enc_input, mask=slf_attn_mask) + enc_output *= non_pad_mask + + enc_output = self.pos_ffn(enc_output) + enc_output *= non_pad_mask + + return enc_output + + +class ConvNorm(nn.Cell): + """ + Create convolution layer and init weights. + """ + def __init__( + self, + in_channels, + out_channels, + kernel_size=1, + stride=1, + padding=None, + dilation=1, + bias=True, + w_init_gain='linear', + ): + super().__init__() + + if padding is None: + assert kernel_size % 2 == 1 + padding = int(dilation * (kernel_size - 1) / 2) + + if w_init_gain == 'tanh': + gain = 5.0 / 3 + else: + gain = 1 + + self.conv = nn.Conv1d( + in_channels, + out_channels, + kernel_size=kernel_size, + stride=stride, + padding=padding, + dilation=dilation, + bias=bias, + weight_init=initializer( + XavierUniform(gain=gain), + [in_channels, out_channels], + mstype.float32 + ) + ) + + def construct(self, x): + """Forward.""" + output = self.conv(x) + + return output diff --git a/research/audio/FastSpeech/src/transformer/models.py b/research/audio/FastSpeech/src/transformer/models.py new file mode 100644 index 0000000000000000000000000000000000000000..1e4d634447cd8c4bbd389209d9e2598f1ea24885 --- /dev/null +++ b/research/audio/FastSpeech/src/transformer/models.py @@ -0,0 +1,187 @@ +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +"""Model script.""" +import numpy as np +from mindspore import Tensor +from mindspore import dtype as mstype +from mindspore import nn +from mindspore import ops + +from src.cfg.config import config as hp +from src.transformer import constants +from src.transformer.layers import FFTBlock + + +def get_sinusoid_encoding_table(n_position, d_hid, padding_idx=None): + """ + Sinusoid position encoding table. + """ + def cal_angle(position, hid_idx): + return position / np.power(10000, 2 * (hid_idx // 2) / d_hid) + + def get_posi_angle_vec(position): + return [cal_angle(position, hid_j) for hid_j in range(d_hid)] + + sinusoid_table = np.array([get_posi_angle_vec(pos_i) + for pos_i in range(n_position)]) + + sinusoid_table[:, 0::2] = np.sin(sinusoid_table[:, 0::2]) + sinusoid_table[:, 1::2] = np.cos(sinusoid_table[:, 1::2]) + + if padding_idx is not None: + # zero vector for padding dimension + sinusoid_table[padding_idx] = 0. + + return Tensor(sinusoid_table, dtype=mstype.float32) + + +class Encoder(nn.Cell): + """Encoder.""" + def __init__( + self, + n_src_vocab, + len_max_seq, + d_word_vec, + n_layers, + n_head, + d_k, + d_v, + d_model, + d_inner, + dropout, + ): + super().__init__() + + n_position = len_max_seq + 1 + pretrained_embs = get_sinusoid_encoding_table(n_position, d_word_vec, padding_idx=0) + + self.src_word_emb = nn.Embedding( + n_src_vocab, + d_word_vec, + padding_idx=constants.PAD, + ) + + self.position_enc = nn.Embedding( + n_position, + d_word_vec, + embedding_table=pretrained_embs, + padding_idx=0, + ) + + self.layer_stack = nn.CellList( + [ + FFTBlock(d_model, d_inner, n_head, d_k, d_v, dropout=dropout) for _ in range(n_layers) + ] + ) + + self.equal = ops.Equal() + self.not_equal = ops.NotEqual() + self.expand_dims = ops.ExpandDims() + self.pad = constants.PAD + self.broadcast = ops.BroadcastTo((-1, hp.character_max_length, -1)) + + def construct(self, src_seq, src_pos): + """ + Create mask and forward to FFT blocks. + + Args: + src_seq (Tensor): Tokenized text sequence. Shape (hp.batch_size, hp.character_max_length). + src_pos (Tensor): Positions of the sequences. Shape (hp.batch_size, hp.character_max_length). + + Returns: + enc_output (Tensor): Encoder output. + """ + # Prepare masks + padding_mask = self.equal(src_seq, self.pad) + slf_attn_mask = self.broadcast(self.expand_dims(padding_mask.astype(mstype.float32), 1)) + slf_attn_mask_bool = slf_attn_mask.astype(mstype.bool_) + + non_pad_mask_bool = self.expand_dims(self.not_equal(src_seq, self.pad), -1) + non_pad_mask = non_pad_mask_bool.astype(mstype.float32) + + # Forward + enc_output = self.src_word_emb(src_seq.astype('int32')) + self.position_enc(src_pos.astype('int32')) + + for enc_layer in self.layer_stack: + enc_output = enc_layer( + enc_output, + non_pad_mask=non_pad_mask, + slf_attn_mask=slf_attn_mask_bool, + ) + + return enc_output + + +class Decoder(nn.Cell): + """Decoder.""" + def __init__( + self, + len_max_seq, + n_layers, + n_head, + d_k, + d_v, + d_model, + d_inner, + dropout + ): + + super().__init__() + + n_position = len_max_seq + 1 + + pretrained_embs = get_sinusoid_encoding_table(n_position, d_model, padding_idx=0) + + self.position_enc = nn.Embedding( + n_position, + d_model, + embedding_table=pretrained_embs, + padding_idx=0, + ) + + self.layer_stack = nn.CellList( + [ + FFTBlock(d_model, d_inner, n_head, d_k, d_v, dropout=dropout) for _ in range(n_layers) + ] + ) + + self.pad = constants.PAD + self.equal = ops.Equal() + self.not_equal = ops.NotEqual() + self.expand_dims = ops.ExpandDims() + self.broadcast = ops.BroadcastTo((-1, hp.mel_max_length, -1)) + + def construct(self, enc_seq, enc_pos): + """ + Create mask and forward to FFT blocks. + """ + # Prepare masks + padding_mask = self.equal(enc_pos, self.pad) + slf_attn_mask = self.broadcast(self.expand_dims(padding_mask.astype(mstype.float32), 1)) + slf_attn_mask_bool = slf_attn_mask.astype(mstype.bool_) + + non_pad_mask_bool = self.expand_dims(self.not_equal(enc_pos, self.pad), -1) + non_pad_mask = non_pad_mask_bool.astype(mstype.float32) + + # Forward + dec_output = enc_seq + self.position_enc(enc_pos.astype(mstype.int32)) + + for dec_layer in self.layer_stack: + dec_output = dec_layer( + dec_output, + non_pad_mask=non_pad_mask, + slf_attn_mask=slf_attn_mask_bool) + + return dec_output diff --git a/research/audio/FastSpeech/src/transformer/modules.py b/research/audio/FastSpeech/src/transformer/modules.py new file mode 100644 index 0000000000000000000000000000000000000000..7e72e56c0d9851ca3ebd53394622f3ab92b4c31e --- /dev/null +++ b/research/audio/FastSpeech/src/transformer/modules.py @@ -0,0 +1,59 @@ +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +"""Model modules.""" +import mindspore.numpy as msnp +from mindspore import dtype as mstype +from mindspore import nn +from mindspore import ops +from mindspore.ops import constexpr + + +class ScaledDotProductAttention(nn.Cell): + """ + Scaled Dot-Product Attention. + """ + def __init__(self, temperature, attn_dropout=0.1): + super().__init__() + self.temperature = temperature + + self.softmax = nn.Softmax(axis=2) + self.dropout = nn.Dropout(keep_prob=1-attn_dropout) + + self.bmm = ops.BatchMatMul() + self.transpose = ops.Transpose() + + def construct(self, q, k, v, mask=None): + """Forward.""" + attn = self.bmm(q, self.transpose(k, (0, 2, 1))) + attn = attn / self.temperature + + inf_mask = infinity_mask(attn.shape, -msnp.inf) + + if mask is not None: + attn = msnp.where(mask, inf_mask, attn) + + attn = self.softmax(attn) + attn = self.dropout(attn) + + output = self.bmm(attn, v) + + return output + + +@constexpr +def infinity_mask(mask_shape, inf): + """Make infinity mask.""" + inf_mask = ops.Fill()(mstype.float32, mask_shape, inf) + return inf_mask diff --git a/research/audio/FastSpeech/src/transformer/sublayers.py b/research/audio/FastSpeech/src/transformer/sublayers.py new file mode 100644 index 0000000000000000000000000000000000000000..a4bb7b1aaac472fdc0c62d7c0322fe963c824187 --- /dev/null +++ b/research/audio/FastSpeech/src/transformer/sublayers.py @@ -0,0 +1,154 @@ +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +"""Model sublayers.""" +import numpy as np +from mindspore import dtype as mstype +from mindspore import nn +from mindspore import ops +from mindspore.common.initializer import Normal +from mindspore.common.initializer import initializer + +from src.cfg.config import config as hp +from src.transformer.modules import ScaledDotProductAttention + + +class MultiHeadAttention(nn.Cell): + """ + Multi-Head Attention module. + """ + def __init__(self, n_head, d_model, d_k, d_v, dropout=0.1): + super().__init__() + + self.n_head = n_head + self.d_k = d_k + self.d_v = d_v + + self.w_qs = nn.Dense( + d_model, + n_head * d_k, + weight_init=initializer( + Normal(sigma=np.sqrt(2.0 / (d_model + d_k)), mean=0), + [d_model, n_head * d_k], + mstype.float32, + ) + ) + + self.w_ks = nn.Dense( + d_model, + n_head * d_k, + weight_init=initializer( + Normal(sigma=np.sqrt(2.0 / (d_model + d_k)), mean=0), + [d_model, n_head * d_k], + mstype.float32, + ) + ) + + self.w_vs = nn.Dense( + d_model, + n_head * d_v, + weight_init=initializer( + Normal(sigma=np.sqrt(2.0 / (d_model + d_v)), mean=0), + [d_model, n_head * d_v], + mstype.float32, + ) + ) + + self.fc = nn.Dense( + n_head * d_v, + d_model, + weight_init=initializer(Normal(), [n_head * d_v, d_model], mstype.float32) + ) + + self.attention = ScaledDotProductAttention(temperature=np.power(d_k, 0.5)) + self.layer_norm = nn.LayerNorm([d_model]) + self.dropout = nn.Dropout(keep_prob=1-dropout) + + self.transpose = ops.Transpose() + self.reshape = ops.Reshape() + self.tile = ops.Tile() + + def construct(self, q, k, v, mask=None): + """Forward.""" + d_k, d_v, n_head = self.d_k, self.d_v, self.n_head + + sz_b, len_q, _ = q.shape + sz_b, len_k, _ = k.shape + sz_b, len_v, _ = v.shape + + residual = q + + q = self.reshape(self.w_qs(q), (sz_b, len_q, n_head, d_k)) + k = self.reshape(self.w_ks(k), (sz_b, len_k, n_head, d_k)) + v = self.reshape(self.w_vs(v), (sz_b, len_v, n_head, d_v)) + + q = self.reshape(self.transpose(q, (2, 0, 1, 3)), (-1, len_q, d_k)) # (n*b) x lq x dk + k = self.reshape(self.transpose(k, (2, 0, 1, 3)), (-1, len_q, d_k)) # (n*b) x lk x dk + v = self.reshape(self.transpose(v, (2, 0, 1, 3)), (-1, len_v, d_v)) # (n*b) x lv x dv + + mask = self.tile(mask.astype(mstype.float32), (n_head, 1, 1)) + output = self.attention(q, k, v, mask=mask.astype(mstype.bool_)) + + output = self.reshape(output, (n_head, sz_b, len_q, d_v)) + output = self.reshape(self.transpose(output, (1, 2, 0, 3)), (sz_b, len_q, -1)) # b x lq x (n*dv) + + output = self.dropout(self.fc(output)) + output = self.layer_norm(output + residual) + + return output + + +class PositionwiseFeedForward(nn.Cell): + """A two-feed-forward-layer module.""" + def __init__(self, d_in, d_hid, dropout=0.1): + super().__init__() + + self.w_1 = nn.Conv1d( + d_in, + d_hid, + kernel_size=hp.fft_conv1d_kernel[0], + pad_mode='pad', + padding=hp.fft_conv1d_padding[0], + has_bias=True, + ) + + self.w_2 = nn.Conv1d( + d_hid, + d_in, + kernel_size=hp.fft_conv1d_kernel[1], + pad_mode='pad', + padding=hp.fft_conv1d_padding[1], + has_bias=True, + ) + + self.dropout = nn.Dropout(keep_prob=1-dropout) + self.layer_norm = nn.LayerNorm([d_in]) + self.relu = nn.ReLU() + + self.transpose = ops.Transpose() + + def construct(self, x): + """Forward.""" + residual = x + + output = self.transpose(x, (0, 2, 1)) + output = self.w_1(output) + output = self.relu(output) + output = self.w_2(output) + output = self.transpose(output, (0, 2, 1)) + output = self.dropout(output) + + output = self.layer_norm(output + residual) + + return output diff --git a/research/audio/FastSpeech/src/utils.py b/research/audio/FastSpeech/src/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..3a7542184efd138d33ca4729cbe2737910b516de --- /dev/null +++ b/research/audio/FastSpeech/src/utils.py @@ -0,0 +1,54 @@ +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +"""Utilities.""" +from pathlib import Path + +import numpy as np + +from src.cfg.config import config as hp + + +def process_text(train_text_path): + """ + Read .txt data. + """ + metadata_path = Path(train_text_path) + with metadata_path.open("r", encoding="utf-8") as file: + txt = [] + for line in file.readlines(): + txt.append(line) + + return txt + + +def pad_1d_tensor(inputs): + """ + Pad 1d tensor to fixed size. + """ + max_len = hp.character_max_length + padded = np.pad(inputs, (0, max_len - inputs.shape[0])) + + return padded + + +def pad_2d_tensor(inputs): + """ + Pad 2d tensor to fixed size. + """ + max_len = hp.mel_max_length + s = inputs.shape[1] + padded = np.pad(inputs, (0, max_len - inputs.shape[0]))[:, :s] + + return padded diff --git a/research/audio/FastSpeech/src/waveglow/__init__.py b/research/audio/FastSpeech/src/waveglow/__init__.py new file mode 100644 index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391 diff --git a/research/audio/FastSpeech/src/waveglow/layers.py b/research/audio/FastSpeech/src/waveglow/layers.py new file mode 100644 index 0000000000000000000000000000000000000000..d33704b54c307165e8ba156ebec82ba4e9927fa4 --- /dev/null +++ b/research/audio/FastSpeech/src/waveglow/layers.py @@ -0,0 +1,38 @@ +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +"""Model layers.""" +from mindspore import nn + + +class Invertible1x1Conv(nn.Cell): + """ + The layer outputs both the convolution, + and the log determinant of its weight matrix. + """ + def __init__(self, c): + super().__init__() + self.conv = nn.Conv1d( + in_channels=c, + out_channels=c, + kernel_size=1, + stride=1, + padding=0, + has_bias=False, + ) + + def construct(self, z): + z = self.conv(z) + + return z diff --git a/research/audio/FastSpeech/src/waveglow/model.py b/research/audio/FastSpeech/src/waveglow/model.py new file mode 100644 index 0000000000000000000000000000000000000000..10a4eb4ae9dff5c74a2b6ca3c39a81c6928311bc --- /dev/null +++ b/research/audio/FastSpeech/src/waveglow/model.py @@ -0,0 +1,270 @@ +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +"""Model script.""" +import numpy as np +from mindspore import Tensor +from mindspore import dtype as mstype +from mindspore import nn +from mindspore import ops + +from src.waveglow.layers import Invertible1x1Conv +from src.waveglow.utils import fused_add_tanh_sigmoid_multiply + + +class WN(nn.Cell): + """ + This is the WaveNet like layer for the affine coupling. + The primary difference from WaveNet is the convolutions need not be causal. + There is also no dilation size reset. The dilation only doubles on each layer. + """ + def __init__( + self, + n_in_channels, + n_mel_channels, + n_layers, + n_channels, + kernel_size, + ): + super().__init__() + + self.n_layers = n_layers + self.n_channels = n_channels + self.in_layers = nn.CellList() + self.res_skip_layers = nn.CellList() + + self.start = nn.Conv1d( + in_channels=n_in_channels, + out_channels=n_channels, + kernel_size=1, + has_bias=True + ) + + self.end = nn.Conv1d( + in_channels=n_channels, + out_channels=2 * n_in_channels, + kernel_size=1, + has_bias=True + ) + + self.cond_layer = nn.Conv1d( + in_channels=n_mel_channels, + out_channels=2 * n_channels * n_layers, + kernel_size=1, + has_bias=True + ) + + for i in range(n_layers): + dilation = 2 ** i + padding = int((kernel_size * dilation - dilation) / 2) + + if i < n_layers - 1: + res_skip_channels = 2 * n_channels + else: + res_skip_channels = n_channels + + in_layer = nn.Conv1d( + in_channels=n_channels, + out_channels=2 * n_channels, + kernel_size=kernel_size, + dilation=dilation, + pad_mode='pad', + padding=padding, + has_bias=True + ) + + res_skip_layer = nn.Conv1d( + in_channels=n_channels, + out_channels=res_skip_channels, + kernel_size=1, + has_bias=True + ) + + self.in_layers.append(in_layer) + self.res_skip_layers.append(res_skip_layer) + + self.audio_zeros = Tensor(np.zeros((1, self.n_channels, 28800)), mstype.float32) + + def construct(self, audio, spect): + """ + Forward. + """ + audio = self.start(audio) + output = self.audio_zeros + + spect = self.cond_layer(spect) + + for i in range(self.n_layers): + spect_offset = i * 2 * self.n_channels + + acts = fused_add_tanh_sigmoid_multiply( + self.in_layers[i](audio), + spect[:, spect_offset: spect_offset + 2 * self.n_channels, :], + self.n_channels + ) + + res_skip_acts = self.res_skip_layers[i](acts) + if i < self.n_layers - 1: + audio = audio + res_skip_acts[:, :self.n_channels, :] + output = output + res_skip_acts[:, self.n_channels:, :] + else: + output = output + res_skip_acts + + output = self.end(output) + + return output + + +class WaveGlow(nn.Cell): + """WaveGlow vocoder inference model.""" + def __init__( + self, + n_mel_channels, + n_flows, + n_group, + n_early_every, + n_early_size, + wn_config, + sigma=1.0 + ): + super().__init__() + + self.upsample = nn.Conv1dTranspose( + in_channels=n_mel_channels, + out_channels=n_mel_channels, + pad_mode='valid', + kernel_size=1024, + stride=256, + has_bias=True, + ) + + self.n_flows = n_flows + self.n_group = n_group + self.n_early_every = n_early_every + self.n_early_size = n_early_size + self.wavenet = nn.CellList() + self.convinv = nn.CellList() + + n_half = int(n_group / 2) + n_remaining_channels = n_group + audio_cells_list = [] + + for k in range(n_flows): + use_data_append = False + if k % self.n_early_every == 0 and k > 0: + n_half = n_half - int(self.n_early_size / 2) + n_remaining_channels = n_remaining_channels - self.n_early_size + use_data_append = True + + audio_cells_list.insert( + 0, + AudioCell( + n_half=n_half, + n_mel_channels=n_mel_channels * n_group, + wn_config=wn_config, + use_data_append=use_data_append, + n_early_size=self.n_early_size, + sigma=sigma, + n_remaining_channels=n_remaining_channels, + ) + ) + + self.wavenet_blocks = nn.CellList(audio_cells_list) + + self.n_remaining_channels = n_remaining_channels + + self.concat = ops.Concat(axis=1) + self.transpose = ops.Transpose() + self.reshape = ops.Reshape() + + self.noise_shape = (1, self.n_remaining_channels, 28800) + self.audio = Tensor(np.random.standard_normal(self.noise_shape), mstype.float32) + + self.time_cutoff = self.upsample.kernel_size[1] - self.upsample.stride[1] + self.sigma = Tensor(sigma, mstype.float32) + + def construct(self, spect): + """ + Forward to mel-spectrogram. + + Args: + spect (Tensor): Mel-spectrogram. Shape (1, n_mel_channels, max_mel_len) + + Returns: + audio (Tensor): Raw audio. + """ + spect = self.upsample(spect) + spect = spect[:, :, : - self.time_cutoff] + bs, mel_size, channels = spect.shape + + spect = self.reshape(spect, (bs, mel_size, channels // self.n_group, self.n_group)) + spect = self.transpose(spect, (0, 2, 1, 3)) + spect = self.transpose(spect.view(spect.shape[0], spect.shape[1], -1), (0, 2, 1)) + + audio = self.sigma * self.audio + + for audio_cell in self.wavenet_blocks: + audio = audio_cell(audio, spect) + + audio = self.transpose(audio, (0, 2, 1)).view(audio.shape[0], -1) + + return audio + + +class AudioCell(nn.Cell): + """Audio generator cell.""" + def __init__( + self, + n_half, + n_mel_channels, + wn_config, + use_data_append, + n_early_size, + sigma, + n_remaining_channels, + ): + super().__init__() + self.n_half = n_half + + self.wn_cell = WN(n_half, n_mel_channels, **wn_config) + self.convinv = Invertible1x1Conv(n_remaining_channels) + self.sigma = Tensor(sigma, mstype.float32) + + self.use_data_append = bool(use_data_append) + self.noise_shape = (1, n_early_size, 28800) + + self.z = Tensor(np.random.standard_normal(self.noise_shape), mstype.float32) + self.concat = ops.Concat(axis=1) + self.exp = ops.Exp() + + def construct(self, audio, spect): + """Iterationaly restore audio from spectrogram.""" + audio_0 = audio[:, :self.n_half, :] + audio_1 = audio[:, self.n_half:, :] + + output = self.wn_cell(audio_0, spect) + + s = output[:, self.n_half:, :] + b = output[:, :self.n_half, :] + + audio_1 = (audio_1 - b) / self.exp(s) + + audio = self.concat((audio_0, audio_1)) + audio = self.convinv(audio) + + if self.use_data_append: + z = self.z + audio = self.concat((self.sigma * z, audio)) + + return audio diff --git a/research/audio/FastSpeech/src/waveglow/utils.py b/research/audio/FastSpeech/src/waveglow/utils.py new file mode 100644 index 0000000000000000000000000000000000000000..ba23c56ad9f48e95fc52d48c7b572dc807c30165 --- /dev/null +++ b/research/audio/FastSpeech/src/waveglow/utils.py @@ -0,0 +1,43 @@ +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +"""Utils scripts.""" +from mindspore import ops + + +def fused_add_tanh_sigmoid_multiply(input_a, input_b, n_channels): + """ + Fusion method. + """ + n_channels_int = n_channels + in_act = input_a + input_b + + t_act = ops.Tanh()(in_act[:, :n_channels_int, :]) + s_act = ops.Sigmoid()(in_act[:, n_channels_int:, :]) + + acts = t_act * s_act + + return acts + + +def files_to_list(filename): + """ + Takes a text file of filenames and makes a list of filenames. + """ + with open(filename, encoding='utf-8') as f: + files = f.readlines() + + files = [f.rstrip() for f in files] + + return files diff --git a/research/audio/FastSpeech/train.py b/research/audio/FastSpeech/train.py new file mode 100644 index 0000000000000000000000000000000000000000..80dda8331aaf6fb8a102f64a41d7b6eb8d020dd0 --- /dev/null +++ b/research/audio/FastSpeech/train.py @@ -0,0 +1,169 @@ +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ +"""Training script.""" +import os + +import numpy as np +from mindspore import Model +from mindspore import context +from mindspore.common import set_seed +from mindspore.communication.management import get_group_size +from mindspore.communication.management import get_rank +from mindspore.communication.management import init +from mindspore.context import ParallelMode +from mindspore.dataset import GeneratorDataset +from mindspore.nn import Adam +from mindspore.train.callback import CheckpointConfig +from mindspore.train.callback import LossMonitor +from mindspore.train.callback import ModelCheckpoint +from mindspore.train.callback import TimeMonitor + +from src.cfg.config import config as default_config +from src.dataset import BufferDataset +from src.dataset import get_data_to_buffer +from src.model import FastSpeech +from src.model import LossWrapper + +set_seed(1) + + +def _get_rank_info(target): + """ + Get rank size and rank id. + """ + if target == 'GPU': + num_devices = get_group_size() + device = get_rank() + else: + raise ValueError("Unsupported platform.") + + return num_devices, device + + +def lr_scheduler(cfg, steps_per_epoch, p_num): + """ + Init lr steps. + """ + d_model = cfg.decoder_dim + lr_init = np.power(d_model, -0.5) * cfg.lr_scale + warmup_steps = cfg.n_warm_up_step + total_steps = cfg.epochs * steps_per_epoch + + learning_rate = [] + for step in range(1, total_steps + 1): + lr_at_step = np.min([ + np.power(step * p_num, -0.5), + np.power(warmup_steps, -1.5) * step + ]) + learning_rate.append(lr_at_step * lr_init) + + return learning_rate + + +def set_trainable_params(params): + """ + Freeze positional encoding layers + and exclude it from trainable params for optimizer. + """ + trainable_params = [] + for param in params: + if param.name.endswith('position_enc.embedding_table'): + param.requires_grad = False + else: + trainable_params.append(param) + + return trainable_params + + +def main(): + """Trainloop.""" + config = default_config + device_target = config.device_target + + context.set_context(mode=context.GRAPH_MODE, device_target=device_target) + device_num = int(os.getenv('RANK_SIZE', '1')) + + if device_target == 'GPU': + if device_num > 1: + init(backend_name='nccl') + device_num = get_group_size() + device_id = get_rank() + context.reset_auto_parallel_context() + context.set_auto_parallel_context( + device_num=device_num, + parallel_mode=ParallelMode.DATA_PARALLEL, + gradients_mean=True, + ) + else: + device_num = 1 + device_id = config.device_id + context.set_context(device_id=device_id) + else: + raise ValueError("Unsupported platform.") + + if device_num > 1: + rank_size, rank_id = _get_rank_info(target=device_target) + else: + rank_size, rank_id = None, None + + net = FastSpeech() + network = LossWrapper(net) + network.set_train(True) + + buffer = get_data_to_buffer() + data = BufferDataset(buffer) + + dataloader = GeneratorDataset( + data, + column_names=['text', 'mel_pos', 'src_pos', 'mel_max_len', 'duration', 'mel_target'], + shuffle=True, + num_shards=rank_size, + shard_id=rank_id, + num_parallel_workers=1, + python_multiprocessing=False, + ) + + dataloader = dataloader.batch(config.batch_size, True) + batch_num = dataloader.get_dataset_size() + + lr = lr_scheduler(config, batch_num, device_num) + + trainable_params = set_trainable_params(network.trainable_params()) + opt = Adam(trainable_params, beta1=0.9, beta2=0.98, eps=1e-9, learning_rate=lr) + + model = Model(network, optimizer=opt) + + config_ck = CheckpointConfig( + save_checkpoint_steps=batch_num, + keep_checkpoint_max=config.keep_checkpoint_max, + ) + + loss_cb = LossMonitor(per_print_times=10) + time_cb = TimeMonitor(data_size=batch_num) + ckpt_cb = ModelCheckpoint( + prefix="FastSpeech", + directory=config.logs_dir, + config=config_ck, + ) + + cbs = [loss_cb, time_cb, ckpt_cb] + if device_num > 1 and device_id != config.device_start: + cbs = [loss_cb, time_cb] + + model.train(epoch=config.epochs, train_dataset=dataloader, callbacks=cbs, dataset_sink_mode=False) + + +if __name__ == "__main__": + main()