Skip to content
Snippets Groups Projects
Unverified Commit 99450158 authored by i-robot's avatar i-robot Committed by Gitee
Browse files

!2869 Implemented training and evaluation of simple baselines on a GPU

Merge pull request !2869 from Marina Molchanova/simple_baseline_gpu
parents 248d58d7 2d984191
No related branches found
No related tags found
No related merge requests found
Showing
with 701 additions and 233 deletions
# Contents
<!-- TOC -->
[查看中文](./README_CN.md)
- [Simple Baselines Description](#simple_baselines-description)
- [Model Architecture](#model-architecture)
- [Dataset](#dataset)
- [Features](#features)
- [Mixed Precision](#mixed-precision)
- [Environment Requirements](#environment-requirements)
- [Quick Start](#quick-start)
- [Script Description](#script-description)
- [Script and Sample Code](#script-and-sample-code)
- [Script Parameters](#script-parameters)
- [Training Process](#training-process)
- [Usage](#usage1)
- [Result](#result1)
- [Evaluation Process](#evaluation-process)
- [Usage](#usage2)
- [Result](#result2)
- [Inference Process](#inference-process)
- [Model Export](#model-export)
- [Infer on Ascend310](#infer-ascend310)
- [Result](#result)
- [Model Description](#model-description)
- [Performance](#performance)
- [Description of Random State](#description-of-random-state)
- [ModelZoo Homepage](#ModelZoo-homepage)
<!-- /TOC -->
# Simple Baselines Description
## Overview
Simple Baselines proposed by Bin Xiao, Haiping Wu, and Yichen Wei from Microsoft Research Asia. The authors believe that
the current popular human pose estimation and tracking methods are too complicated. The existing human pose estimation and
pose tracking models seem to be quite different in structure, but It's really close in terms of performance. The author proposes
a simple and effective baseline method by adding a deconvolution layer on the backbone network ResNet, which is precisely the
simplest method to estimate the heatmap from the high and low resolution feature maps, thereby helping to stimulate and evaluate
new ideas in the field.
For more details refer to [paper](https://arxiv.org/pdf/1804.06208.pdf).
Mindspore implementation is based on [original pytorch version](https://github.com/microsoft/human-pose-estimation.pytorch) released by Microsoft Asia Research Institute.
## Paper
[Paper](https://arxiv.org/pdf/1804.06208.pdf): Bin Xiao, Haiping Wu, Yichen Wei "Simple baselines for human pose estimation and tracking"
# Model Architecture
The overall network architecture of simple baselines is [here](https://arxiv.org/pdf/1804.06208.pdf).
# Dataset
Dataset used: [COCO2017](https://gitee.com/link?target=https%3A%2F%2Fcocodataset.org%2F%23download)
- Dataset size:
- Train:19.56GB, 57k images, 149813 person instances
- Test:825MB, 5k images, 6352 person instances
- Data Format:JPG
- Note: Data is processed in src/dataset.py
# Features
## Mixed Precision
The [mixed precision](https://www.mindspore.cn/tutorials/experts/en/master/others/mixed_precision.html) training
method accelerates the deep learning neural network training process by using both the single-precision and half-precision
data types, and maintains the network precision achieved by the single-precision training at the same time. Mixed precision
training can accelerate the computation process, reduce memory usage, and enable a larger model or batch size to be trained
on specific hardware. For FP16 operators, if the input data type is FP32, the backend of MindSpore will automatically handle
it with reduced precision. Users could check the reduced-precision operators by enabling INFO log and then searching ‘reduce precision’.
# Environment Requirements
- Hardware(Ascend/GPU)
- Prepare hardware environment with Ascend or GPU.
- Framework
- [MindSpore](https://www.mindspore.cn/install/en)
- For more information about MindSpore, please check the resources below:
- [MindSpore Tutorials](https://www.mindspore.cn/tutorials/zh-CN/master/index.html)
- [MindSpore Python API](https://www.mindspore.cn/docs/api/zh-CN/master/index.html)
# Quick Start
After installing MindSpore through the official website, you can follow the steps below for training and evaluation.
- Dataset preparation
The simple baselines uses the COCO2017 dataset for training and evaluation. Download dataset from [official website](https://cocodataset.org/).
- Running on Ascend
```shell
# Distributed training
bash scripts/run_distribute_train.sh RANK_TABLE
# Standalone training
bash scripts/run_standalone_train.sh DEVICE_ID
# Evaluation
bash scripts/run_eval.sh
```
- Running on GPU
```shell
# Distributed training
bash scripts/run_distribute_train_gpu.sh DEVICE_NUM
# Standalone training
bash scripts/run_standalone_train_gpu.sh DEVICE_ID
# Evaluation
bash scripts/run_eval_gpu.sh DEVICE_ID
```
# Script Description
## Script and Sample Code
```text
.
└──simple_baselines
├── README.md
├── scripts
├── run_distribute_train.sh # train on Ascend
├── run_distribute_train_gpu.sh # train on GPU
├── run_eval.sh # eval on Ascend
├── run_eval_gpu.sh # eval on GPU
├── run_standalone_train.sh # train on Ascend
├── run_standalone_train_gpu.sh # train on GPU
└── run_infer_310.sh
├── src
├── utils
├── coco.py # COCO dataset evaluation results
├── nms.py
└── transforms.py # Image processing conversion
├── config.py
├── dataset.py # Data preprocessing
├── network_with_loss.py # Loss function
├── pose_resnet.py # Backbone network
└── predict.py # Heatmap key point prediction
├── export.py
├── postprocess.py
├── preprocess.py
├── eval.py
└── train.py
```
## Script Parameters
Before training configure parameters and paths in src/config.py.
- Model parameters:
```text
config.MODEL.INIT_WEIGHTS = True # Initialize model weights
config.MODEL.PRETRAINED = 'resnet50.ckpt' # Pre-trained model
config.MODEL.NUM_JOINTS = 17 # Number of key points
config.MODEL.IMAGE_SIZE = [192, 256] # Image size
```
- Network parameters:
```text
config.NETWORK.NUM_LAYERS = 50 # Resnet backbone layers
config.NETWORK.DECONV_WITH_BIAS = False # Network deconvolution bias
config.NETWORK.NUM_DECONV_LAYERS = 3 # Number of network deconvolution layers
config.NETWORK.NUM_DECONV_FILTERS = [256, 256, 256] # Deconvolution layer filter size
config.NETWORK.NUM_DECONV_KERNELS = [4, 4, 4] # Deconvolution layer kernel size
config.NETWORK.FINAL_CONV_KERNEL = 1 # Final convolutional layer kernel size
config.NETWORK.HEATMAP_SIZE = [48, 64]
```
- Training parameters:
```text
config.TRAIN.SHUFFLE = True
config.TRAIN.BATCH_SIZE = 64
config.TRAIN.BEGIN_EPOCH = 0
config.TRAIN.END_EPOCH = 140
config.TRAIN.LR = 0.001
config.TRAIN.LR_FACTOR = 0.1 # learning rate reduction factor
config.TRAIN.LR_STEP = [90, 120]
config.TRAIN.NUM_PARALLEL_WORKERS = 8
config.TRAIN.SAVE_CKPT = True
config.TRAIN.CKPT_PATH = "./model" # directory of pretrained resnet50 and to save ckpt
config.TRAIN.SAVE_CKPT_EPOCH = 3
config.TRAIN.KEEP_CKPT_MAX = 10
```
- Evaluation parameters:
```text
config.TEST.BATCH_SIZE = 32
config.TEST.FLIP_TEST = True
config.TEST.USE_GT_BBOX = False
```
- nms parameters:
```text
config.TEST.OKS_THRE = 0.9 # OKS threshold
config.TEST.IN_VIS_THRE = 0.2 # Visualization threshold
config.TEST.BBOX_THRE = 1.0 # Candidate box threshold
config.TEST.IMAGE_THRE = 0.0 # Image threshold
config.TEST.NMS_THRE = 1.0 # nms threshold
```
## Training Process
### Usage
- Ascend
```shell
# Distributed training 8p
bash scripts/run_distribute_train.sh RANK_TABLE
# Standalone training
bash scripts/run_standalone_train.sh DEVICE_ID
# Evaluation
bash scripts/run_eval.sh
```
- GPU
```shell
# Distributed training
bash scripts/run_distribute_train_gpu.sh DEVICE_NUM
# Standalone training
bash scripts/run_standalone_train_gpu.sh DEVICE_ID
# Evaluation
bash scripts/run_eval_gpu.sh DEVICE_ID
```
### Result
- Use COCO2017 dataset to train simple_baselines
```text
# Standalone training results (1P)
epoch:1 step:2340, loss is 0.0008106
epoch:2 step:2340, loss is 0.0006160
epoch:3 step:2340, loss is 0.0006480
epoch:4 step:2340, loss is 0.0005620
epoch:5 step:2340, loss is 0.0005207
...
epoch:138 step:2340, loss is 0.0003183
epoch:139 step:2340, loss is 0.0002866
epoch:140 step:2340, loss is 0.0003393
```
## Evaluation Process
### Usage
The corresponding model inference can be performed by changing the "config.TEST.MODEL_FILE" file in the config.py file.
Use val2017 in the COCO2017 dataset folder to evaluate simple_baselines.
- Ascend
```shell
# Evaluation
bash scripts/run_eval.sh
```
- GPU
```shell
# Evaluation
bash scripts/run_eval_gpu.sh DEVICE_ID
```
### Result
results will be saved in keypoints_results.pkl
```text
AP: 0.704
```
## Inference Process
### Model Export
- Export in local
```shell
python export.py
```
- Export in ModelArts (If you want to run in modelarts, please check [modelarts official document](https://support.huaweicloud.com/modelarts/).
```text
# (1) Upload the code folder to S3 bucket.
# (2) Click to "create training task" on the website UI interface.
# (3) Set the code directory to "/{path}/simple_pose" on the website UI interface.
# (4) Set the startup file to /{path}/simple_pose/export.py" on the website UI interface.
# (5) Perform a .
# a. setting parameters in /{path}/simple_pose/default_config.yaml.
# 1. Set ”enable_modelarts: True“
# 2. Set “TEST.MODEL_FILE: ./{path}/*.ckpt”('TEST.MODEL_FILE' indicates the path of the weight file to be exported relative to the file `export.py`, and the weight file must be included in the code directory.)
# 3. Set ”EXPORT.FILE_NAME: simple_pose“
# 4. Set ”EXPORT.FILE_FORMAT:MINDIR“
# (7) Check the "data storage location" on the website UI interface and set the "Dataset path" path (This step is useless, but necessary.).
# (8) Set the "Output file path" and "Job log path" to your path on the website UI interface.
# (9) Under the item "resource pool selection", select the specification of a single card.
# (10) Create your job.
# You will see simple_pose.mindir under {Output file path}.
```
`FILE_FORMAT` should be in ["AIR", "MINDIR"]
### 310 inference
Before performing inference, the mindir file must bu exported by export.py script. We only provide an example of inference using MINDIR model.
When the network is processing the dataset, if the last batch is not enough, it will not be automatically supplemented. Better set batch_size to 1.
```shell
# Ascend310 inference
bash run_infer_310.sh [MINDIR_PATH] [NEED_PREPROCESS] [DEVICE_ID]
```
- `NEED_PREPROCESS` indicates that dataset is processed in binary format, value are "y" or "n".
- `DEVICE_ID` optional, default value is 0.
### Result
The inference results are saved in the current path in `acc.log` file.
```text
AP: 0.7139169694686592
```
# Model Description
## Performance
| Parameters | Ascend 910 | GPU 1p | GPU 8p |
| ------------------- | --------------------------- | ------------ | ------------ |
| Model | simple_baselines | simple_baselines | simple_baselines |
| Environment | Ascend 910;CPU 2.60GHz,192cores;RAM:755G | Ubuntu 18.04.6, 1p RTX3090, CPU 2.90GHz, 64cores, RAM 252GB; Mindspore 1.5.0 | Ubuntu 18.04.6, 8pcs RTX3090, CPU 2.90GHz, 64cores, RAM 252GB; Mindspore 1.5.0 |
| Upload date (Y-M-D) | 2021-03-29 | 2021-12-29 | 2021-12-29 |
| MindSpore Version | 1.1.0 | 1.5.0 | 1.5.0 |
| Dataset | COCO2017 | COCO2017 | COCO2017 |
| Training params | epoch=140, batch_size=64 | epoch=140, batch_size=64 | epoch=140, batch_size=64 |
| Optimizer | Adam | Adam | Adam |
| Loss function | Mean Squared Error | Mean Squared Error | Mean Squared Error |
| Output | heatmap | heatmap | heatmap |
| Final Loss | | 0.27 | 0.27 |
| Training speed | 1pc: 251.4 ms/step | 184 ms/step | 285 ms/step |
| Total training time | | 17h | 3.5h |
| Accuracy | AP: 0.704 | AP: 0.7143 | AP: 0.7143 |
# Description of Random State
Random seed is set inside "create_dataset" function in dataset.py.
Initial network weights are used in model.py.
# ModelZoo Homepage
Please check the official [homepage](https://gitee.com/mindspore/models).
......@@ -2,6 +2,8 @@
<!-- TOC -->
[View English](./README.md)
- [simple_baselines描述](#simple_baselines描述)
- [模型架构](#模型架构)
- [数据集](#数据集)
......@@ -17,7 +19,6 @@
- [onnx推理](#onnx推理)
- [模型描述](#模型描述)
- [性能](#性能)
- [评估性能](#评估性能)
- [随机情况说明](#随机情况说明)
- [ModelZoo主页](#ModelZoo主页)
......@@ -59,7 +60,7 @@ simple_baselines的总体网络架构如下:
# 环境要求
- 硬件(Ascend)
- 硬件Ascend/GPU)
- 准备Ascend处理器搭建硬件环境。
- 框架
- [MindSpore](https://www.mindspore.cn/install/en)
......@@ -81,7 +82,7 @@ simple_baselines的总体网络架构如下:
- Ascend处理器环境运行
```text
```shell
# 分布式训练
用法:bash run_distribute_train.sh RANK_TABLE
......@@ -92,12 +93,25 @@ simple_baselines的总体网络架构如下:
用法:bash run_eval.sh
```
- GPU处理器环境运行
```shell
# 分布式训练
用法:bash scripts/run_distribute_train_gpu.sh DEVICE_NUM
# 单机训练
用法:bash scripts/run_standalone_train_gpu.sh DEVICE_ID
# 运行评估示例
用法:bash scripts/run_eval_gpu.sh DEVICE_ID
```
# 脚本说明
## 脚本及样例代码
```shell
```text
.
└──simple_baselines
├── README.md
├── scripts
......@@ -108,13 +122,13 @@ simple_baselines的总体网络架构如下:
├── src
├── utils
├── coco.py # COCO数据集评估结果
├── inference.py # 热图关键点预测
├── nms.py # nms
├── transforms.py # 图像处理转换
├── config.py # 参数配置
├── dataset.py # 数据预处理
├── network_with_loss.py # 损失函数定义
└── pose_resnet.py # 主干网络定义
├── pose_resnet.py # 主干网络定义
└── predict.py # 热图关键点预测
├── eval.py # 评估网络
├── eval_onnx.py # onnx推理
└── train.py # 训练网络
......@@ -126,7 +140,7 @@ simple_baselines的总体网络架构如下:
- 配置模型相关参数:
```python
```text
config.MODEL.INIT_WEIGHTS = True # 初始化模型权重
config.MODEL.PRETRAINED = 'resnet50.ckpt' # 预训练模型
config.MODEL.NUM_JOINTS = 17 # 关键点数量
......@@ -135,7 +149,7 @@ config.MODEL.IMAGE_SIZE = [192, 256] # 图像大小
- 配置网络相关参数:
```python
```text
config.NETWORK.NUM_LAYERS = 50 # resnet主干网络层数
config.NETWORK.DECONV_WITH_BIAS = False # 网络反卷积偏差
config.NETWORK.NUM_DECONV_LAYERS = 3 # 网络反卷积层数
......@@ -147,7 +161,7 @@ config.NETWORK.HEATMAP_SIZE = [48, 64] # 热图尺寸
- 配置训练相关参数:
```python
```text
config.TRAIN.SHUFFLE = True # 训练数据随机排序
config.TRAIN.BATCH_SIZE = 64 # 训练批次大小
config.TRAIN.BEGIN_EPOCH = 0 # 测试数据集文件名
......@@ -162,7 +176,7 @@ config.TRAIN.LR_FACTOR = 0.1 # 学习率降
- 配置验证相关参数:
```python
```text
config.TEST.BATCH_SIZE = 32 # 验证批次大小
config.TEST.FLIP_TEST = True # 翻转验证
config.TEST.USE_GT_BBOX = False # 使用标注框
......@@ -170,7 +184,7 @@ config.TEST.USE_GT_BBOX = False # 使用标注
- 配置nms相关参数:
```python
```text
config.TEST.OKS_THRE = 0.9 # OKS阈值
config.TEST.IN_VIS_THRE = 0.2 # 可视化阈值
config.TEST.BBOX_THRE = 1.0 # 候选框阈值
......@@ -182,9 +196,9 @@ config.TEST.NMS_THRE = 1.0 # nms阈值
### 用法
#### Ascend处理器环境运行
- Ascend处理器环境运行
```text
```shell
# 分布式训练
用法:bash run_distribute_train.sh RANK_TABLE
......@@ -195,6 +209,19 @@ config.TEST.NMS_THRE = 1.0 # nms阈值
用法:bash run_eval.sh
```
- GPU处理器环境运行
```shell
# 分布式训练
bash scripts/run_distribute_train_gpu.sh DEVICE_NUM
# 单机训练
bash scripts/run_standalone_train_gpu.sh DEVICE_ID
# 运行评估示例
bash scripts/run_eval_gpu.sh DEVICE_ID
```
### 结果
- 使用COCO2017数据集训练simple_baselines
......@@ -216,21 +243,27 @@ epoch:140 step:2340, loss is 0.0003393
### 用法
#### Ascend处理器环境运行
可通过改变config.py文件中的"config.TEST.MODEL_FILE"文件进行相应模型推理。
```bash
- Ascend处理器环境运行
```shell
# 评估
bash eval.sh
```
- GPU处理器环境运行
```shell
# Evaluation
bash scripts/run_eval_gpu.sh DEVICE_ID
```
### 结果
使用COCO2017数据集文件夹中val2017进行评估simple_baselines,如下所示:
```text
coco eval results saved to /cache/train_output/multi_train_poseresnet_v5_2-140_2340/keypoints_results.pkl
AP: 0.704
```
......@@ -270,7 +303,7 @@ AP:0.72296
## 推理过程
### [导出mindir]
### 导出mindir
- 本地导出
......@@ -280,7 +313,7 @@ python export.py
- 在ModelArts上导出(如果想在modelarts中运行,请查看【modelarts】官方文档(https://support huaweicloud.com/modelarts/),如下启动即可)
```python
```text
# (1) Upload the code folder to S3 bucket.
# (2) Click to "create training task" on the website UI interface.
# (3) Set the code directory to "/{path}/simple_pose" on the website UI interface.
......@@ -317,7 +350,7 @@ bash run_infer_310.sh [MINDIR_PATH] [NEED_PREPROCESS] [DEVICE_ID]
推理结果保存在当前路径中,您可以在 acc.log 文件中找到这样的结果。
```bash
```text
AP: 0.7139169694686592
```
......@@ -325,24 +358,21 @@ AP: 0.7139169694686592
## 性能
### 评估性能
#### COCO2017上性能参数
| Parameters | Ascend 910 |
| ------------------- | --------------------------- |
| 模型版本 | simple_baselines |
| 资源 | Ascend 910;CPU:2.60GHz,192核;内存:755G |
| 上传日期 | 2021-03-29 |
| MindSpore版本 | 1.1.0 |
| 数据集 | COCO2017 |
| 训练参数 | epoch=140, batch_size=64 |
| 优化器 | Adam |
| 损失函数 | Mean Squared Error |
| 输出 | heatmap |
| 输出 | heatmap |
| 速度 | 1pc: 251.4 ms/step |
| 训练性能 | AP: 0.704 |
| Parameters | Ascend 910 | GPU 1p | GPU 8p |
| -------------- | --------------------------- | ---------------- | ------------ |
| 模型版本 | simple_baselines | simple_baselines | simple_baselines |
| 资源 | Ascend 910;CPU:2.60GHz,192核;内存:755G | Ubuntu 18.04.6, 1p RTX3090, CPU 2.90GHz, 64cores, RAM 252GB; Mindspore 1.5.0 | Ubuntu 18.04.6, 8pcs RTX3090, CPU 2.90GHz, 64cores, RAM 252GB; Mindspore 1.5.0 |
| 上传日期 | 2021-03-29 | 2021-12-29 | 2021-12-29 |
| MindSpore版本 | 1.1.0 | 1.5.0 | 1.5.0 |
| 数据集 | COCO2017 | COCO2017 | COCO2017 |
| 训练参数 | epoch=140, batch_size=64 | epoch=140, batch_size=64 | epoch=140, batch_size=64 |
| 优化器 | Adam | Adam | Adam |
| 损失函数 | Mean Squared Error | Mean Squared Error | Mean Squared Error |
| 输出 | heatmap | heatmap | heatmap |
| 最终损失 | | 0.27 | 0.27 |
| 速度 | 1pc: 251.4 ms/step | 184 ms/step | 285 ms/step |
| 训练总时间 | | 17h | 3.5h |
| 精确度 | AP: 0.704 | AP: 0.7143 | AP: 0.7143 |
# 随机情况说明
......@@ -350,4 +380,4 @@ dataset.py中设置了“create_dataset”函数内的种子,同时在model.py
# ModelZoo主页
请浏览官网[主页](https://gitee.com/mindspore/models)
请浏览官网[主页](https://gitee.com/mindspore/models)
......@@ -12,9 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
'''
This file evaluates the model used.
'''
""" This file evaluates the model """
from __future__ import division
import argparse
......@@ -32,25 +30,28 @@ from src.dataset import flip_pairs
from src.dataset import keypoint_dataset
from src.utils.coco import evaluate
from src.utils.transforms import flip_back
from src.utils.inference import get_final_preds
from src.predict import get_final_preds
if config.MODELARTS.IS_MODEL_ARTS:
import moxing as mox
set_seed(config.GENERAL.EVAL_SEED)
device_id = int(os.getenv('DEVICE_ID'))
def parse_args():
"""command line arguments parsing"""
parser = argparse.ArgumentParser(description='Evaluate')
parser.add_argument('--data_url', required=False, default=None, help='Location of data.')
parser.add_argument('--train_url', required=False, default=None, help='Location of evaluate outputs.')
parser.add_argument("--device_target", type=str, choices=["Ascend", "GPU", "CPU"], default="Ascend",
help="device target")
parser.add_argument("--device_id", type=int, default=0, help="Device id, default is 0.")
args = parser.parse_args()
return args
def validate(cfg, val_dataset, model, output_dir, ann_path):
'''
validate
'''
"""evaluate model"""
model.set_train(False)
num_samples = val_dataset.get_dataset_size() * cfg.TEST.BATCH_SIZE
all_preds = np.zeros((num_samples, cfg.MODEL.NUM_JOINTS, 3),
......@@ -98,22 +99,21 @@ def validate(cfg, val_dataset, model, output_dir, ann_path):
def main():
context.set_context(mode=context.GRAPH_MODE,
device_target="Ascend", save_graphs=False, device_id=device_id)
"""main"""
args = parse_args()
context.set_context(mode=context.GRAPH_MODE,
device_target=args.device_target, save_graphs=False, device_id=args.device_id)
if config.MODELARTS.IS_MODEL_ARTS:
mox.file.copy_parallel(src_url=args.data_url, dst_url=config.MODELARTS.CACHE_INPUT)
model = GetPoseResNet(config)
ckpt_name = ''
if config.MODELARTS.IS_MODEL_ARTS:
ckpt_name = config.MODELARTS.CACHE_INPUT
ckpt_name = os.path.join(ckpt_name, config.TEST.MODEL_FILE)
else:
ckpt_name = config.DATASET.ROOT
ckpt_name = ckpt_name + config.TEST.MODEL_FILE
ckpt_name = config.TEST.MODEL_FILE
print('loading model ckpt from {}'.format(ckpt_name))
load_param_into_net(model, load_checkpoint(ckpt_name))
......@@ -126,20 +126,19 @@ def main():
ckpt_name = ckpt_name.split('/')
ckpt_name = ckpt_name[len(ckpt_name) - 1]
ckpt_name = ckpt_name.split('.')[0]
output_dir = ''
ann_path = ''
if config.MODELARTS.IS_MODEL_ARTS:
output_dir = config.MODELARTS.CACHE_OUTPUT
ann_path = config.MODELARTS.CACHE_INPUT
else:
output_dir = config.TEST.OUTPUT_DIR
ann_path = config.DATASET.ROOT
output_dir = output_dir + ckpt_name
ann_path = ann_path + config.DATASET.TEST_JSON
output_dir = os.path.join(output_dir, ckpt_name)
ann_path = os.path.join(ann_path, config.DATASET.TEST_JSON)
validate(config, valid_dataset, model, output_dir, ann_path)
if config.MODELARTS.IS_MODEL_ARTS:
mox.file.copy_parallel(src_url=config.MODELARTS.CACHE_OUTPUT, dst_url=args.train_url)
if __name__ == '__main__':
main()
......@@ -12,9 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""
export simple_baseline to mindir or air
"""
""" Export simple_baseline to mindir or air """
import argparse
import numpy as np
from mindspore import context, Tensor, export
......
......@@ -12,9 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""
postprocess.
"""
""" postprocess script """
import os
import numpy as np
......@@ -24,8 +22,9 @@ from src.predict import get_final_preds
from src.dataset import flip_pairs
from src.config import config
def get_acc():
'''calculate accuracy'''
""" calculate accuracy """
ckpt_file = config.TEST.MODEL_FILE
output_dir = ckpt_file.split('.')[0]
if config.enable_modelarts:
......@@ -86,5 +85,6 @@ def get_acc():
cfg, all_preds[:idx], output_dir, all_boxes[:idx], image_id, None)
print("AP:", perf_indicator)
if __name__ == '__main__':
get_acc()
......@@ -12,17 +12,16 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""
preprocess.
"""
""" preprocess script """
import os
import numpy as np
from src.dataset import keypoint_dataset
from src.config import config
def get_bin():
''' get bin files'''
""" get bin files"""
valid_dataset, _ = keypoint_dataset(
config,
bbox_file=config.TEST.COCO_BBOX_FILE,
......@@ -62,5 +61,6 @@ def get_bin():
np.save(os.path.join(id_path, file_name), item['id'])
print("=" * 20, "export bin files finished", "=" * 20)
if __name__ == '__main__':
get_bin()
#!/bin/bash
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
if [ $# -ne 1 ]; then
echo "Please run the script as: "
echo "bash scripts/run_distribute_train_gpu.sh [RANK_SIZE]"
echo "For example: bash scripts/run_distribute_train_gpu.sh 8"
echo "It is better to use the absolute path."
echo "========================================================================"
exit 1
fi
export RANK_SIZE=$1
rm -rf ./train_parallel
mkdir ./train_parallel
cp ./*.py ./train_parallel
cp -r ./src ./train_parallel
cd ./train_parallel
echo "start training on GPU $RANK_SIZE devices"
env > env.log
mpirun -n $RANK_SIZE --output-filename log_output --merge-stderr-to-stdout \
python train.py \
--device_target="GPU" \
--is_model_arts=False \
--run_distribute=True > train.log 2>&1 &
cd ..
#!/bin/bash
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 1.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-1.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
if [ $# -ne 1 ]; then
echo "Please run the script as: "
echo "bash scripts/run_eval_gpu.sh [DEVICE_ID]"
echo "For example: bash scripts/run_eval_gpu.sh 0"
echo "It is better to use the absolute path."
echo "========================================================================"
exit 1
fi
export DEVICE_NUM=1
export DEVICE_ID=$1
rm -rf ./eval
mkdir ./eval
cp ./*.py ./eval
cp -r ./src ./eval
cd ./eval || exit
echo "start evaluation on GPU device $DEVICE_ID"
env > env.log
python eval.py --device_target="GPU" --device_id=$DEVICE_ID > eval.log 2>&1 &
#!/bin/bash
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
if [ $# -ne 1 ]; then
echo "Please run the script as: "
echo "bash scripts/run_standalone_train_gpu.sh [DEVICE_ID]"
echo "For example: bash scripts/run_standalone_train_gpu.sh 0"
echo "It is better to use the absolute path."
echo "========================================================================"
exit 1
fi
export RANK_SIZE=1
export DEVICE_ID=$1
rm -rf train$1
mkdir ./train$1
cp ./*.py ./train$1
cp -r ./src ./train$1
cd ./train$1 || exit
echo "start training on GPU device $DEVICE_ID"
env > env.log
python train.py \
--device_target="GPU" \
--device_id=$DEVICE_ID \
--is_model_arts=False \
--run_distribute=False > train.log 2>&1 &
cd ..
......@@ -12,14 +12,12 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
'''
config
'''
"""Config parameters for simple baselines."""
from easydict import EasyDict as edict
config = edict()
#general
# general
config.GENERAL = edict()
config.GENERAL.VERSION = 'commit'
config.GENERAL.TRAIN_SEED = 1
......@@ -27,7 +25,7 @@ config.GENERAL.EVAL_SEED = 1
config.GENERAL.DATASET_SEED = 1
config.GENERAL.RUN_DISTRIBUTE = True
#model arts
# model arts
config.MODELARTS = edict()
config.MODELARTS.IS_MODEL_ARTS = False
config.MODELARTS.CACHE_INPUT = '/cache/data_tzh/'
......@@ -50,7 +48,6 @@ config.NETWORK.NUM_DECONV_FILTERS = [256, 256, 256]
config.NETWORK.NUM_DECONV_KERNELS = [4, 4, 4]
config.NETWORK.FINAL_CONV_KERNEL = 1
config.NETWORK.REVERSE = True
config.NETWORK.TARGET_TYPE = 'gaussian'
config.NETWORK.HEATMAP_SIZE = [48, 64]
config.NETWORK.SIGMA = 2
......@@ -66,7 +63,7 @@ config.DATASET.ROOT = '/home/dataset/coco/'
config.DATASET.TRAIN_SET = 'train2017'
config.DATASET.TRAIN_JSON = 'annotations/person_keypoints_train2017.json'
config.DATASET.TEST_SET = 'val2017'
config.DATASET.TEST_JSON = 'annotations/COCO_val2017_detections_AP_H_56_person.json'
config.DATASET.TEST_JSON = 'annotations/person_keypoints_val2017.json'
# training data augmentation
config.DATASET.FLIP = True
......@@ -76,7 +73,7 @@ config.DATASET.ROT_FACTOR = 40
# train
config.TRAIN = edict()
config.TRAIN.SHUFFLE = True
config.TRAIN.BATCH_SIZE = 64
config.TRAIN.BATCH_SIZE = 64 # 32 in paper
config.TRAIN.BEGIN_EPOCH = 0
config.TRAIN.END_EPOCH = 140
config.TRAIN.LR = 0.001
......@@ -84,7 +81,9 @@ config.TRAIN.LR_FACTOR = 0.1
config.TRAIN.LR_STEP = [90, 120]
config.TRAIN.NUM_PARALLEL_WORKERS = 8
config.TRAIN.SAVE_CKPT = True
config.TRAIN.CKPT_PATH = "/home/dataset/coco/"
config.TRAIN.CKPT_PATH = "/home/model/"
config.TRAIN.SAVE_CKPT_EPOCH = 3
config.TRAIN.KEEP_CKPT_MAX = 10
# valid
config.TEST = edict()
......@@ -93,10 +92,10 @@ config.TEST.FLIP_TEST = True
config.TEST.POST_PROCESS = True
config.TEST.SHIFT_HEATMAP = True
config.TEST.USE_GT_BBOX = False
config.TEST.NUM_PARALLEL_WORKERS = 2
config.TEST.MODEL_FILE = '/home/dataset/coco/multi_train_poseresnet_commit_0-140_292.ckpt'
config.TEST.NUM_PARALLEL_WORKERS = 8
config.TEST.MODEL_FILE = '/home/model/multi_train_poseresnet_commit_5-140_292.ckpt'
config.TEST.COCO_BBOX_FILE = '/home/dataset/coco/annotations/COCO_val2017_detections_AP_H_56_person.json'
config.TEST.OUTPUT_DIR = 'results/'
config.TEST.OUTPUT_DIR = '/home/results/'
# nms
config.TEST.OKS_THRE = 0.9
......@@ -105,7 +104,7 @@ config.TEST.BBOX_THRE = 1.0
config.TEST.IMAGE_THRE = 0.0
config.TEST.NMS_THRE = 1.0
#310 infer-related
# 310 infer-related
config.INFER = edict()
config.INFER.PRE_RESULT_PATH = './preprocess_Result'
config.INFER.POST_RESULT_PATH = './result_Files'
......
......@@ -12,16 +12,15 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
'''
dataset processing
'''
""" dataset processing """
from __future__ import division
import json
import os
from copy import deepcopy
import random
import multiprocessing as mp
import numpy as np
import cv2
......@@ -29,14 +28,15 @@ import mindspore.dataset as ds
import mindspore.dataset.vision as C
from src.utils.transforms import fliplr_joints, get_affine_transform, affine_transform
ds.config.set_seed(1) # Set Random Seed
ds.config.set_seed(1) # Set Random Seed
flip_pairs = [[1, 2], [3, 4], [5, 6], [7, 8],
[9, 10], [11, 12], [13, 14], [15, 16]]
class KeypointDatasetGenerator:
'''
About the specific operations of coco2017 data set processing
'''
"""
About the specific operations of coco2017 dataset processing
"""
def __init__(self, cfg, is_train=False):
self.image_thre = cfg.TEST.IMAGE_THRE
self.image_size = np.array(cfg.MODEL.IMAGE_SIZE, dtype=np.int32)
......@@ -56,9 +56,7 @@ class KeypointDatasetGenerator:
self.num_joints = 17
def load_gt_dataset(self, image_path, ann_file):
'''
load_gt_dataset
'''
""" load_gt_dataset """
self.db = []
with open(ann_file, "rb") as f:
......@@ -134,11 +132,8 @@ class KeypointDatasetGenerator:
})
def load_detect_dataset(self, image_path, ann_file, bbox_file):
'''
load_detect_dataset
'''
""" load detection dataset """
self.db = []
all_boxes = None
with open(bbox_file, 'r') as f:
all_boxes = json.load(f)
......@@ -245,9 +240,7 @@ class KeypointDatasetGenerator:
return image, target, target_weight, s, c, score, db_rec['id']
def generate_heatmap(self, joints, joints_vis):
'''
generate_heatmap
'''
""" generate heatmap"""
target_weight = np.ones((self.num_joints, 1), dtype=np.float32)
target_weight[:, 0] = joints_vis[:, 0]
......@@ -300,6 +293,7 @@ class KeypointDatasetGenerator:
def __len__(self):
return len(self.db)
def keypoint_dataset(config,
ann_file=None,
image_path=None,
......@@ -307,7 +301,7 @@ def keypoint_dataset(config,
rank=0,
group_size=1,
train_mode=True,
num_parallel_workers=8,
num_parallel_workers=mp.cpu_count(),
transform=None,
shuffle=None):
"""
......@@ -315,9 +309,8 @@ def keypoint_dataset(config,
Args:
rank (int): The shard ID within num_shards (default=None).
group_size (int): Number of shards that the dataset should be divided
into (default=None).
mode (str): "train" or others. Default: " train".
group_size (int): Number of shards that the dataset should be divided into (default=None).
mode (str): "train" or others. Default: " train".
num_parallel_workers (int): Number of workers to read the data. Default: None.
"""
# config
......
......@@ -12,9 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
'''
network_with_loss
'''
""" network_with_loss """
from __future__ import division
import mindspore.nn as nn
......@@ -23,10 +21,9 @@ from mindspore.ops import functional as F
from mindspore.nn.loss.loss import LossBase
from mindspore.common import dtype as mstype
class JointsMSELoss(LossBase):
'''
JointsMSELoss
'''
"""JointsMSELoss"""
def __init__(self, use_target_weight):
super(JointsMSELoss, self).__init__()
self.criterion = nn.MSELoss(reduction='mean')
......@@ -37,9 +34,7 @@ class JointsMSELoss(LossBase):
self.mul = P.Mul()
def construct(self, output, target, target_weight):
'''
construct
'''
""" construct """
total_shape = self.shape(output)
batch_size = total_shape[0]
num_joints = total_shape[1]
......@@ -64,6 +59,7 @@ class JointsMSELoss(LossBase):
return loss / num_joints
class PoseResNetWithLoss(nn.Cell):
"""
Pack the model network and loss function together to calculate the loss value.
......
......@@ -12,9 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
'''
simple_baselines network
'''
""" simple_baselines network """
from __future__ import division
import os
import mindspore.nn as nn
......@@ -24,10 +22,9 @@ from mindspore.train.serialization import load_checkpoint, load_param_into_net
BN_MOMENTUM = 0.1
class MPReverse(nn.Cell):
'''
MPReverse
'''
"""MPReverse"""
def __init__(self, kernel_size=1, stride=1, pad_mode="valid"):
super(MPReverse, self).__init__()
self.maxpool = nn.MaxPool2d(kernel_size=kernel_size, stride=stride, pad_mode=pad_mode)
......@@ -39,10 +36,9 @@ class MPReverse(nn.Cell):
x = self.reverse(x)
return x
class Bottleneck(nn.Cell):
'''
model part of network
'''
"""model part of network"""
expansion = 4
def __init__(self, inplanes, planes, stride=1, downsample=None):
......@@ -59,9 +55,7 @@ class Bottleneck(nn.Cell):
self.stride = stride
def construct(self, x):
'''
construct
'''
"""construct"""
residual = x
out = self.conv1(x)
......@@ -85,10 +79,7 @@ class Bottleneck(nn.Cell):
class PoseResNet(nn.Cell):
'''
PoseResNet
'''
"""pose-resnet"""
def __init__(self, block, layers, cfg):
self.inplanes = 64
self.deconv_with_bias = cfg.NETWORK.DECONV_WITH_BIAS
......@@ -122,9 +113,7 @@ class PoseResNet(nn.Cell):
)
def _make_layer(self, block, planes, blocks, stride=1):
'''
_make_layer
'''
"""make layer"""
downsample = None
if stride != 1 or self.inplanes != planes * block.expansion:
downsample = nn.SequentialCell([nn.Conv2d(self.inplanes, planes * block.expansion,
......@@ -134,16 +123,13 @@ class PoseResNet(nn.Cell):
layers = []
layers.append(block(self.inplanes, planes, stride, downsample))
self.inplanes = planes * block.expansion
for i in range(1, blocks):
for _ in range(1, blocks):
layers.append(block(self.inplanes, planes))
print(i)
return nn.SequentialCell(layers)
def _make_deconv_layer(self, num_layers, num_filters, num_kernels):
'''
_make_deconv_layer
'''
"""make deconvolutional layer"""
assert num_layers == len(num_filters), \
'ERROR: num_deconv_layers is different len(num_deconv_filters)'
assert num_layers == len(num_kernels), \
......@@ -171,9 +157,6 @@ class PoseResNet(nn.Cell):
return nn.SequentialCell(layers)
def construct(self, x):
'''
construct
'''
x = self.conv1(x)
x = self.bn1(x)
x = self.relu(x)
......@@ -186,6 +169,7 @@ class PoseResNet(nn.Cell):
x = self.deconv_layers(x)
x = self.final_layer(x)
return x
def init_weights(self, pretrained=''):
......@@ -205,18 +189,16 @@ resnet_spec = {50: (Bottleneck, [3, 4, 6, 3]),
def GetPoseResNet(cfg):
'''
GetPoseResNet
'''
"""get pose-resnet"""
num_layers = cfg.NETWORK.NUM_LAYERS
block_class, layers = resnet_spec[num_layers]
network = PoseResNet(block_class, layers, cfg)
if cfg.MODEL.IS_TRAINED and cfg.MODEL.INIT_WEIGHTS:
pretrained = ''
if cfg.MODELARTS.IS_MODEL_ARTS:
pretrained = cfg.MODELARTS.CACHE_INPUT + cfg.MODEL.PRETRAINED
pretrained = os.path.join(cfg.MODELARTS.CACHE_INPUT, cfg.MODEL.PRETRAINED)
else:
pretrained = cfg.TRAIN.CKPT_PATH + cfg.MODEL.PRETRAINED
pretrained = os.path.join(cfg.TRAIN.CKPT_PATH, cfg.MODEL.PRETRAINED)
network.init_weights(pretrained)
return network
......@@ -12,21 +12,19 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
'''
prediction picture
'''
""" prediction picture """
import math
import numpy as np
from src.utils.transforms import transform_preds
def get_max_preds(batch_heatmaps):
'''
"""
get predictions from score maps
heatmaps: numpy.ndarray([batch_size, num_joints, height, width])
'''
assert isinstance(batch_heatmaps, np.ndarray), \
'batch_heatmaps should be numpy.ndarray'
"""
assert isinstance(batch_heatmaps, np.ndarray), 'batch_heatmaps should be numpy.ndarray'
assert batch_heatmaps.ndim == 4, 'batch_images should be 4-ndim'
batch_size = batch_heatmaps.shape[0]
......@@ -50,10 +48,11 @@ def get_max_preds(batch_heatmaps):
preds *= pred_mask
return preds, maxvals
def get_final_preds(config, batch_heatmaps, center, scale):
'''
"""
get final predictions from score maps
'''
"""
coords, maxvals = get_max_preds(batch_heatmaps)
heatmap_height = batch_heatmaps.shape[2]
heatmap_width = batch_heatmaps.shape[3]
......
......@@ -12,9 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
'''
coco
'''
"""coco"""
from __future__ import division
import json
......@@ -33,10 +31,8 @@ except ImportError:
from src.utils.nms import oks_nms
def _write_coco_keypoint_results(img_kpts, num_joints, res_file):
'''
_write_coco_keypoint_results
'''
results = []
for img, items in img_kpts.items():
......@@ -62,9 +58,6 @@ def _write_coco_keypoint_results(img_kpts, num_joints, res_file):
def _do_python_keypoint_eval(res_file, res_folder, ann_path):
'''
_do_python_keypoint_eval
'''
coco = COCO(ann_path)
coco_dt = coco.loadRes(res_file)
coco_eval = COCOeval(coco, coco_dt, 'keypoints')
......@@ -87,10 +80,8 @@ def _do_python_keypoint_eval(res_file, res_folder, ann_path):
return info_str
def evaluate(cfg, preds, output_dir, all_boxes, img_id, ann_path):
'''
evaluate
'''
if not os.path.exists(output_dir):
os.makedirs(output_dir)
res_file = os.path.join(output_dir, 'keypoints_results.json')
......
......@@ -13,16 +13,12 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
'''
nms operation
'''
""" nms operation """
from __future__ import division
import numpy as np
def oks_iou(g, d, a_g, a_d, sigmas=None, in_vis_thre=None):
'''
oks_iou
'''
if not isinstance(sigmas, np.ndarray):
sigmas = np.array([.26, .25, .25, .35, .35, .79, .79, .72, .72,
.62, .62, 1.07, 1.07, .87, .87, .89, .89]) / 10.0
......@@ -44,6 +40,7 @@ def oks_iou(g, d, a_g, a_d, sigmas=None, in_vis_thre=None):
ious[n_d] = np.sum(np.exp(-e)) / e.shape[0] if e.shape[0] != 0 else 0.0
return ious
def oks_nms(kpts_db, thresh, sigmas=None, in_vis_thre=None):
"""
greedily select boxes with high confidence and overlap with current maximum <= thresh
......
......@@ -13,13 +13,12 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
'''
transforms
'''
""" transforms """
from __future__ import division
import numpy as np
import cv2
def flip_back(output_flipped, matched_parts):
'''
ouput_flipped: numpy.ndarray(batch_size, num_joints, height, width)
......@@ -55,9 +54,7 @@ def fliplr_joints(joints, joints_vis, width, matched_parts):
def transform_preds(coords, center, scale, output_size):
'''
transform_preds
'''
"""transform_preds"""
target_coords = np.zeros(coords.shape)
trans = get_affine_transform(center, scale, 0, output_size, inv=1)
for p in range(coords.shape[0]):
......@@ -65,15 +62,8 @@ def transform_preds(coords, center, scale, output_size):
return target_coords
def get_affine_transform(center,
scale,
rot,
output_size,
shift=np.array([0, 0], dtype=np.float32),
inv=0):
'''
get_affine_transform
'''
def get_affine_transform(center, scale, rot, output_size, shift=np.array([0, 0], dtype=np.float32), inv=0):
"""get_affine_transform"""
if not isinstance(scale, np.ndarray) and not isinstance(scale, list):
print(scale)
scale = np.array([scale, scale])
......
# Copyright 2021 Huawei Technologies Co., Ltd
# Copyright 2021-2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -12,18 +12,17 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
'''
train
'''
"""Train simple baselines."""
from __future__ import division
import os
import ast
import argparse
import numpy as np
from mindspore import context, Tensor
from mindspore.context import ParallelMode
from mindspore.communication.management import init
from mindspore.communication.management import init, get_rank, get_group_size
from mindspore.train import Model
from mindspore.train.callback import TimeMonitor, LossMonitor, ModelCheckpoint, CheckpointConfig
from mindspore.nn.optim import Adam
......@@ -38,16 +37,10 @@ if config.MODELARTS.IS_MODEL_ARTS:
import moxing as mox
set_seed(config.GENERAL.TRAIN_SEED)
def get_lr(begin_epoch,
total_epochs,
steps_per_epoch,
lr_init=0.1,
factor=0.1,
epoch_number_to_drop=(90, 120)
):
'''
get_lr
'''
def get_lr(begin_epoch, total_epochs, steps_per_epoch, lr_init=0.1, factor=0.1,
epoch_number_to_drop=(90, 120)):
lr_each_step = []
total_steps = steps_per_epoch * total_epochs
step_number_to_drop = [steps_per_epoch * x for x in epoch_number_to_drop]
......@@ -60,11 +53,10 @@ def get_lr(begin_epoch,
learning_rate = lr_each_step[current_step:]
return learning_rate
def parse_args():
'''
args
'''
parser = argparse.ArgumentParser(description="Simplebaseline training")
"""command line arguments parsing"""
parser = argparse.ArgumentParser(description="Simple Baselines training")
parser.add_argument('--data_url', required=False, default=None, help='Location of data.')
parser.add_argument('--train_url', required=False, default=None, help='Location of training outputs.')
parser.add_argument('--device_id', required=False, default=None, type=int, help='Location of training outputs.')
......@@ -75,30 +67,38 @@ def parse_args():
args = parser.parse_args()
return args
def main():
print("loading parse...")
args = parse_args()
device_id = args.device_id
device_target = args.device_target
config.GENERAL.RUN_DISTRIBUTE = args.run_distribute
config.MODELARTS.IS_MODEL_ARTS = args.is_model_arts
if config.GENERAL.RUN_DISTRIBUTE or config.MODELARTS.IS_MODEL_ARTS:
device_id = int(os.getenv('DEVICE_ID'))
context.set_context(mode=context.GRAPH_MODE,
device_target=device_target,
save_graphs=False,
device_id=device_id)
if config.GENERAL.RUN_DISTRIBUTE:
init()
rank = int(os.getenv('DEVICE_ID'))
device_num = int(os.getenv('RANK_SIZE'))
context.set_auto_parallel_context(device_num=device_num,
parallel_mode=ParallelMode.DATA_PARALLEL,
gradients_mean=True)
else:
context.set_context(mode=context.GRAPH_MODE, device_target=device_target, save_graphs=False)
if device_target == "Ascend":
rank = 0
device_num = 1
context.set_context(device_id=rank)
if config.GENERAL.RUN_DISTRIBUTE:
init()
rank = int(os.getenv('DEVICE_ID'))
device_num = int(os.getenv('RANK_SIZE'))
context.set_auto_parallel_context(device_num=device_num,
parallel_mode=ParallelMode.DATA_PARALLEL,
gradients_mean=True)
elif device_target == "GPU":
rank = int(os.getenv('DEVICE_ID', "0"))
device_num = int(os.getenv('RANK_SIZE', "0"))
if device_num > 1:
init()
rank = get_rank()
device_num = get_group_size()
context.set_auto_parallel_context(device_num=device_num,
parallel_mode=ParallelMode.DATA_PARALLEL,
gradients_mean=True)
else:
raise ValueError("Unsupported device, only GPU or Ascend is supported.")
if config.MODELARTS.IS_MODEL_ARTS:
mox.file.copy_parallel(src_url=args.data_url, dst_url=config.MODELARTS.CACHE_INPUT)
......@@ -106,9 +106,7 @@ def main():
dataset, _ = keypoint_dataset(config,
rank=rank,
group_size=device_num,
train_mode=True,
num_parallel_workers=config.TRAIN.NUM_PARALLEL_WORKERS,
)
train_mode=True)
net = GetPoseResNet(config)
loss = JointsMSELoss(config.LOSS.USE_TARGET_WEIGHT)
net_with_loss = PoseResNetWithLoss(net, loss)
......@@ -123,24 +121,25 @@ def main():
time_cb = TimeMonitor(data_size=dataset_size)
loss_cb = LossMonitor()
cb = [time_cb, loss_cb]
if config.TRAIN.SAVE_CKPT:
config_ck = CheckpointConfig(save_checkpoint_steps=dataset_size, keep_checkpoint_max=20)
prefix = ''
config_ck = CheckpointConfig(save_checkpoint_steps=dataset_size*config.TRAIN.SAVE_CKPT_EPOCH,
keep_checkpoint_max=config.TRAIN.KEEP_CKPT_MAX)
if config.GENERAL.RUN_DISTRIBUTE:
prefix = 'multi_' + 'train_poseresnet_' + config.GENERAL.VERSION + '_' + os.getenv('DEVICE_ID')
prefix = 'multi_' + 'train_poseresnet_' + config.GENERAL.VERSION + '_' + str(rank)
else:
prefix = 'single_' + 'train_poseresnet_' + config.GENERAL.VERSION
directory = ''
if config.MODELARTS.IS_MODEL_ARTS:
directory = config.MODELARTS.CACHE_OUTPUT + 'device_'+ os.getenv('DEVICE_ID')
directory = os.path.join(config.MODELARTS.CACHE_OUTPUT, 'device_' + str(rank))
elif config.GENERAL.RUN_DISTRIBUTE:
directory = config.TRAIN.CKPT_PATH + 'device_'+ os.getenv('DEVICE_ID')
directory = os.path.join(config.TRAIN.CKPT_PATH, 'device_' + str(rank))
else:
directory = config.TRAIN.CKPT_PATH + 'device'
directory = os.path.join(config.TRAIN.CKPT_PATH, 'device')
ckpoint_cb = ModelCheckpoint(prefix=prefix, directory=directory, config=config_ck)
cb.append(ckpoint_cb)
model = Model(net_with_loss, loss_fn=None, optimizer=opt, amp_level="O2")
epoch_size = config.TRAIN.END_EPOCH - config.TRAIN.BEGIN_EPOCH
print("************ Start training now ************")
......@@ -150,5 +149,6 @@ def main():
if config.MODELARTS.IS_MODEL_ARTS:
mox.file.copy_parallel(src_url=config.MODELARTS.CACHE_OUTPUT, dst_url=args.train_url)
if __name__ == '__main__':
main()
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment