Skip to content
Snippets Groups Projects
Commit b5842d64 authored by latrawy's avatar latrawy
Browse files

update STGAN GPU code

parent 9c321da6
No related branches found
No related tags found
No related merge requests found
......@@ -53,8 +53,8 @@ Dataset used: [CelebA](http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html)
## [Environment Requirements](#contents)
- Hardware(Ascend)
- Prepare hardware environment with Ascend processor.
- Hardware(Ascend/GPU
- Prepare hardware environment with Ascend processor.It also supports the use of GPU processor to prepare the hardware environment.
- Framework
- [MindSpore](https://www.mindspore.cn/install/en)
- For more information, please check the resources below:
......@@ -65,14 +65,27 @@ Dataset used: [CelebA](http://mmlab.ie.cuhk.edu.hk/projects/CelebA.html)
After installing MindSpore via the official website, you can start training and evaluation as follows:
```python
# enter script dir, train STGAN
sh scripts/run_standalone_train.sh [DATA_PATH] [EXPERIMENT_NAME] [DEVICE_ID]
# distributed training
sh scripts/run_distribute_train.sh [RANK_TABLE_FILE] [EXPERIMENT_NAME] [DATA_PATH]
# enter script dir, evaluate STGAN
sh scripts/run_eval.sh [DATA_PATH] [EXPERIMENT_NAME] [DEVICE_ID] [CHECKPOINT_PATH]
```
- running on Ascend
```python
# train STGAN
sh scripts/run_standalone_train.sh [DATA_PATH] [EXPERIMENT_NAME] [DEVICE_ID]
# distributed training
sh scripts/run_distribute_train.sh [RANK_TABLE_FILE] [EXPERIMENT_NAME] [DATA_PATH]
# evaluate STGAN
sh scripts/run_eval.sh [DATA_PATH] [EXPERIMENT_NAME] [DEVICE_ID] [CHECKPOINT_PATH]
```
- running on GPU
```python
# train STGAN
sh scripts/run_standalone_train_gpu.sh [DATA_PATH] [EXPERIMENT_NAME] [DEVICE_ID]
# distributed training
sh scripts/run_distribute_train_gpu.sh [EXPERIMENT_NAME] [DATA_PATH]
# evaluate STGAN, if you want to evaluate distributed training result, you should enter ./train_parallel
sh scripts/run_eval_gpu.sh [DATA_PATH] [EXPERIMENT_NAME] [DEVICE_ID] [CHECKPOINT_PATH]
```
## [Script Description](#contents)
......@@ -84,9 +97,13 @@ sh scripts/run_eval.sh [DATA_PATH] [EXPERIMENT_NAME] [DEVICE_ID] [CHECKPOINT_PAT
├── README.md // descriptions about STGAN
├── requirements.txt // package needed
├── scripts
│ ├──run_standalone_train.sh // train in ascend
│ ├──run_eval.sh // evaluate in ascend
│ ├──run_distribute_train.sh // distributed train in ascend
│ ├──docker_start.sh // start docker container
│ ├──run_standalone_train.sh // train in ascend
│ ├──run_eval.sh // evaluate in ascend
│ ├──run_distribute_train.sh // distributed train in ascend
│ ├──run_standalone_train_gpu.sh // train in GPU
│ ├──run_eval_gpu.sh // evaluate in GPU
│ ├──run_distribute_train_gpu.sh // distributed train in GPU
├── src
├── dataset
├── datasets.py // creating dataset
......@@ -114,7 +131,7 @@ Major parameters in train.py and utils/args.py as follows:
--n_epochs: Total training epochs.
--batch_size: Training batch size.
--image_size: Image size used as input to the model.
--device_target: Device where the code will be implemented. Optional value is "Ascend".
--device_target: Device where the code will be implemented. Optional value is "Ascend" or "GPU".
```
### [Training Process](#contents)
......@@ -125,12 +142,22 @@ Major parameters in train.py and utils/args.py as follows:
```bash
python train.py --dataroot ./dataset --experiment_name 128 > log 2>&1 &
# or enter script dir, and run the script
# or run the script
sh scripts/run_standalone_train.sh ./dataset 128 0
# distributed training
sh scripts/run_distribute_train.sh ./config/rank_table_8pcs.json 128 /data/dataset
```
- running on GPU
```bash
python train.py --dataroot ./dataset --experiment_name 128 --platform="GPU" > log 2>&1 &
# or run the script
sh scripts/run_standalone_train_gpu.sh ./dataset 128 0
# distributed training
sh scripts/run_distribute_train_gpu.sh 128 /data/dataset
```
After training, the loss value will be achieved as follows:
```bash
......@@ -155,10 +182,18 @@ Before running the command below, please check the checkpoint path used for eval
```bash
python eval.py --dataroot ./dataset --experiment_name 128 > eval_log.txt 2>&1 &
# or enter script dir, and run the script
# or run the script
sh scripts/run_eval.sh ./dataset 128 0 ./ckpt/generator.ckpt
```
- running on GPU
```bash
python eval.py --dataroot ./dataset --experiment_name 128 --platform="GPU" > eval_log.txt 2>&1 &
# or run the script (if you want to evaluate distributed training result, you should enter ./train_parallel, then run the script)
sh scripts/run_eval_gpu.sh ./dataset 128 0 ./ckpt/generator.ckpt
```
You can view the results in the output directory, which contains a batch of result sample images.
### Model Export
......@@ -175,22 +210,22 @@ python export.py --ckpt_path [CHECKPOINT_PATH] --platform [PLATFORM] --file_form
#### Evaluation Performance
| Parameters | Ascend |
| -------------------------- | ----------------------------------------------------------- |
| Model Version | V1 |
| Resource | Ascend 910; CPU 2.60GHz, 192cores; Memory, 755G |
| uploaded Date | 05/07/2021 (month/day/year) |
| MindSpore Version | 1.2.0 |
| Dataset | CelebA |
| Training Parameters | epoch=100, batch_size = 128 |
| Optimizer | Adam |
| Loss Function | Loss |
| Output | predict class |
| Loss | 6.5523 |
| Speed | 1pc: 400 ms/step; 8pcs: 143 ms/step |
| Total time | 1pc: 41:36:07 |
| Checkpoint for Fine tuning | 170.55M(.ckpt file) |
| Scripts | [STGAN script](https://gitee.com/mindspore/models/tree/master/research/cv/STGAN) |
| Parameters | Ascend | GPU |
| -------------------------- | ----------------------------------------------------------- | --- |
| Model Version | V1 | V1 |
| Resource | Ascend 910; CPU 2.60GHz, 192cores; Memory, 755G | RTX-3090 |
| uploaded Date | 05/07/2021 (month/day/year) | 11/23/2021 (month/day/year) |
| MindSpore Version | 1.2.0 | 1.5.0rc1 |
| Dataset | CelebA | CelebA |
| Training Parameters | epoch=100, batch_size = 128 | epoch=100, batch_size=64 |
| Optimizer | Adam | Adam |
| Loss Function | Loss | Loss |
| Output | predict class | image |
| Loss | 6.5523 | 31.23 |
| Speed | 1pc: 400 ms/step; 8pcs: 143 ms/step | 1pc: 369 ms/step; 8pcs: 68 ms/step |
| Total time | 1pc: 41:36:07 | 1pc: 29:15:09 |
| Checkpoint for Fine tuning | 170.55M(.ckpt file) | 283.76M(.ckpt file) |
| Scripts | [STGAN script](https://gitee.com/mindspore/models/tree/master/research/cv/STGAN) | [STGAN script](https://gitee.com/mindspore/models/tree/master/research/cv/STGAN) |
## [Model Description](#contents)
......
#!/bin/bash
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
if [ $# != 2 ]
then
echo "Usage: sh run_distribute_train_gpu.sh [EXPERIMENT_NAME] [DATA_PATH]"
exit 1
fi
export DEVICE_NUM=8
export RANK_SIZE=8
rm -rf ./train_parallel
mkdir ./train_parallel
cp ./*.py ./train_parallel
cp -r ./src ./train_parallel
cp -r ./scripts ./train_parallel
cd ./train_parallel || exit
export EXPERIMENT_NAME=$1
export DATA_PATH=$2
mpirun --allow-run-as-root -n $RANK_SIZE --output-filename log_output --merge-stderr-to-stdout \
nohup python train.py \
--dataroot=$DATA_PATH \
--experiment_name=$EXPERIMENT_NAME \
--device_num ${DEVICE_NUM} \
--platform="GPU" > log 2>&1 &
cd ..
#!/bin/bash
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
if [ $# != 4 ]
then
echo "Usage: sh run_eval_gpu.sh [DATA_PATH] [EXPERIMENT_NAME] [DEVICE_ID] [CHECKPOINT_PATH]"
exit 1
fi
export DATA_PATH=$1
export EXPERIMENT_NAME=$2
export DEVICE_ID=$3
export CHECKPOINT_PATH=$4
python eval.py --dataroot=$DATA_PATH --experiment_name=$EXPERIMENT_NAME \
--device_id=$DEVICE_ID --ckpt_path=$CHECKPOINT_PATH \
--platform="GPU" > eval_log 2>&1 &
#!/bin/bash
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
if [ $# != 3 ]
then
echo "Usage: sh run_standalone_train_gpu.sh [DATA_PATH] [EXPERIMENT_NAME] [DEVICE_ID]"
exit 1
fi
export DATA_PATH=$1
export EXPERIMENT_NAME=$2
export DEVICE_ID=$3
python train.py --dataroot=$DATA_PATH --experiment_name=$EXPERIMENT_NAME \
--device_id=$DEVICE_ID --platform="GPU" > log 2>&1 &
......@@ -48,14 +48,15 @@ class BaseModel(ABC):
), 'Checkpoint path not found at %s' % self.save_dir
self.current_iteration = self.args.continue_iter
else:
if not os.path.exists(self.save_dir):
if not os.path.exists(self.save_dir) and self.args.rank == 0:
mkdirs(self.save_dir)
# save config
self.config_save_path = os.path.join(self.save_dir, 'config')
if not os.path.exists(self.config_save_path):
if not os.path.exists(self.config_save_path) and self.args.rank == 0:
mkdirs(self.config_save_path)
if self.isTrain:
if self.isTrain and self.args.rank == 0:
with open(os.path.join(self.config_save_path, 'train.conf'),
'w') as f:
f.write(json.dumps(vars(self.args)))
......@@ -67,7 +68,7 @@ class BaseModel(ABC):
# sample save path
if self.isTrain:
self.sample_save_path = os.path.join(self.save_dir, 'sample')
if not os.path.exists(self.sample_save_path):
if not os.path.exists(self.sample_save_path) and self.args.rank == 0:
mkdirs(self.sample_save_path)
# test result save path
......@@ -79,7 +80,7 @@ class BaseModel(ABC):
# train log save path
if self.isTrain:
self.train_log_path = os.path.join(self.save_dir, 'logs')
if not os.path.exists(self.train_log_path):
if not os.path.exists(self.train_log_path) and self.args.rank == 0:
mkdirs(self.train_log_path)
@abstractmethod
......@@ -109,7 +110,7 @@ class BaseModel(ABC):
def save_networks(self):
""" saving networks """
for name in self.model_names:
if isinstance(name, str):
if isinstance(name, str) and self.args.rank == 0:
save_filename = '%s_%s.ckpt' % (self.current_iteration, name)
save_filename_latest = 'latest_%s.ckpt' % name
save_path = os.path.join(self.save_dir, 'ckpt')
......
......@@ -19,7 +19,7 @@ import ast
import datetime
from mindspore.context import ParallelMode
from mindspore import context
from mindspore.communication.management import init
from mindspore.communication.management import init, get_rank
def add_basic_parameters(parser):
""" add basic parameters """
......@@ -266,7 +266,7 @@ def get_args(phase):
assert args.experiment_name != default_experiment_name, "--experiment_name should be assigned in test mode"
if args.continue_train:
assert args.experiment_name != default_experiment_name, "--experiment_name should be assigned in continue"
if args.device_num > 1 and args.platform != "CPU":
if args.device_num > 1 and args.platform == "Ascend":
context.set_context(mode=context.GRAPH_MODE,
device_target=args.platform,
save_graphs=args.save_graphs,
......@@ -278,6 +278,12 @@ def get_args(phase):
device_num=args.device_num)
init()
args.rank = int(os.environ["DEVICE_ID"])
elif args.device_num > 1 and args.platform == "GPU":
init()
context.reset_auto_parallel_context()
args.rank = get_rank()
context.set_auto_parallel_context(device_num=args.device_num, parallel_mode=ParallelMode.DATA_PARALLEL,
gradients_mean=True)
else:
context.set_context(mode=context.GRAPH_MODE,
device_target=args.platform,
......
......@@ -13,6 +13,7 @@
# limitations under the License.
# ============================================================================
""" STGAN TRAIN"""
import time
import tqdm
from mindspore.common import set_seed
......@@ -44,8 +45,9 @@ def train():
model = STGANModel(args)
it_count = 0
for _ in tqdm.trange(args.n_epochs, desc='Epoch Loop'):
for _ in tqdm.trange(iter_per_epoch, desc='Inner Epoch Loop'):
for _ in tqdm.trange(args.n_epochs, desc='Epoch Loop', unit='epoch'):
start_epoch_time = time.time()
for _ in tqdm.trange(iter_per_epoch, desc='Step Loop', unit='step'):
if model.current_iteration > it_count:
it_count += 1
continue
......@@ -56,11 +58,11 @@ def train():
model.optimize_parameters()
# saving model
if (it_count + 1) % args.save_freq == 0:
if (it_count + 1) % args.save_freq == 0 and args.rank == 0:
model.save_networks()
# sampling
if (it_count + 1) % args.sample_freq == 0:
if (it_count + 1) % args.sample_freq == 0 and args.rank == 0:
model.eval(data_loader)
except KeyboardInterrupt:
......@@ -69,7 +71,9 @@ def train():
it_count += 1
model.current_iteration = it_count
if args.rank == 0:
with open('performance.log', "a") as f:
f.write('average speed: {}ms/step\n'.format((time.time() - start_epoch_time)*1000/iter_per_epoch))
model.save_networks()
print('\n\n=============== finish training ===============\n\n')
......
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment