Skip to content
Snippets Groups Projects
Commit 34e2e420 authored by gaozeyang's avatar gaozeyang
Browse files

add resnet101 as backbone & improve ease of use

parent 1d8fe960
No related branches found
No related tags found
No related merge requests found
Showing
with 118 additions and 104 deletions
......@@ -30,9 +30,11 @@
- [性能](#性能)
- [训练性能](#训练性能)
- [ImageNet2012上的Glore_resnet50](#imagenet2012上的glore_resnet50)
- [ImageNet2012上的Glore_resnet101](#imagenet2012上的glore_resnet101)
- [ImageNet2012上的Glore_resnet200](#imagenet2012上的glore_resnet200)
- [推理性能](#推理性能)
- [ImageNet2012上的Glore_resnet50](#imagenet2012上的glore_resnet50)
- [ImageNet2012上的Glore_resnet101](#imagenet2012上的glore_resnet101)
- [ImageNet2012上的Glore_resnet200](#imagenet2012上的glore_resnet200)
- [随机情况说明](#随机情况说明)
- [ModelZoo主页](#modelzoo主页)
......@@ -100,26 +102,26 @@ glore_res200网络模型的backbone是ResNet200, 在Stage2, Stage3中分别均
```bash
# 分布式训练
用法:bash run_distribute_train.sh [DATASET_PATH] [RANK_TABLE] [CONFIG_PATH]
用法:bash run_distribute_train.sh [TRAIN_DATA_PATH] [RANK_TABLE] [CONFIG_PATH] [EVAL_DATA_PATH]
# 单机训练
用法:bash run_standalone_train.sh [DATASET_PATH] [DEVICE_ID] [CONFIG_PATH]
用法:bash run_standalone_train.sh [TRAIN_DATA_PATH] [DEVICE_ID] [CONFIG_PATH] [EVAL_DATA_PATH]
# 运行评估示例
用法:bash run_eval.sh [DATASET_PATH] [DEVICE_ID] [CHECKPOINT_PATH] [CONFIG_PATH]
用法:bash run_eval.sh [EVAL_DATA_PATH] [DEVICE_ID] [CHECKPOINT_PATH] [CONFIG_PATH]
```
- GPU处理器环境运行
```bash
# 分布式训练
用法:bash run_distribute_train_gpu.sh [DATASET_PATH] [RANK_SIZE] [CONFIG_PATH]
用法:bash run_distribute_train_gpu.sh [TRAIN_DATA_PATH] [EVAL_DATA_PATH] [RANK_SIZE] [CONFIG_PATH]
# 单机训练
用法:bash run_standalone_train_gpu.sh [DATASET_PATH] [CONFIG_PATH]
用法:bash run_standalone_train.sh [TRAIN_DATA_PATH] [DEVICE_ID] [CONFIG_PATH] [EVAL_DATA_PATH]
# 运行评估示例
用法:bash run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH] [CONFIG_PATH]
用法:bash run_eval.sh [EVAL_DATA_PATH] [DEVICE_ID] [CHECKPOINT_PATH] [CONFIG_PATH]
```
对于分布式训练,需要提前创建JSON格式的hccl配置文件。
......@@ -136,6 +138,12 @@ glore_res200网络模型的backbone是ResNet200, 在Stage2, Stage3中分别均
.
└──Glore_resnet
├── README.md
├── config
├── config_resnet50_ascend.yaml # Ascend glore_resnet50配置
├── config_resnet50_gpu.yaml # GPU glore_resnet50配置
├── config_resnet101_gpu.yaml # GPU glore_resnet101配置
├── config_resnet200_ascend.yaml # Ascend glore_resnet200配置
└── config_resnet200_gpu.yaml # GPU glore_resnet200配置
├── script
├── run_distribute_train.sh # 启动Ascend分布式训练(8卡)
├── run_distribute_train_gpu.sh # 启动GPU分布式训练(8卡)
......@@ -212,6 +220,27 @@ glore_res200网络模型的backbone是ResNet200, 在Stage2, Stage3中分别均
"lr_end":0.0, # 最小学习率
```
- 配置Glore_resnet101在ImageNet2012数据集参数(GPU)。
```text
"class_num":1000, # 数据集类数
"batch_size":64, # 输入张量的批次大小
"loss_scale":1024, # 损失等级
"momentum":0.08, # 动量优化器
"weight_decay":0.0002, # 权重衰减
"epoch_size":150, # 此值仅适用于训练;应用于推理时固定为1
"pretrain_epoch_size":0, # 加载预训练检查点之前已经训练好的模型的周期大小;实际训练周期大小等于epoch_size减去pretrain_epoch_size
"save_checkpoint":True, # 是否保存检查点
"save_checkpoint_epochs":5, # 两个检查点之间的周期间隔;默认情况下,最后一个检查点将在最后一个周期完成后保存
"keep_checkpoint_max":10, # 只保存最后一个keep_checkpoint_max检查点
"save_checkpoint_path":"./", # 检查点相对于执行路径的保存路径
"warmup_epochs":0, # 热身周期数
"lr_decay_mode":"poly", # 用于生成学习率的衰减模式
"lr_init":0.1, # 初始学习率
"lr_max":0.4, # 最大学习率
"lr_end":0.0, # 最小学习率
```
- 配置Glore_resnet200在ImageNet2012数据集参数(Ascend)。
```text
......@@ -264,13 +293,13 @@ glore_res200网络模型的backbone是ResNet200, 在Stage2, Stage3中分别均
```text
# 分布式训练
用法:bash run_distribute_train.sh [DATASET_PATH] [RANK_TABLE] [CONFIG_PATH]
用法:bash run_distribute_train.sh [TRAIN_DATA_PATH] [RANK_TABLE] [CONFIG_PATH] [EVAL_DATA_PATH]
# 单机训练
用法:bash run_standalone_train.sh [DATASET_PATH] [DEVICE_ID] [CONFIG_PATH]
用法:bash run_standalone_train.sh [TRAIN_DATA_PATH] [RANK_TABLE] [CONFIG_PATH] [EVAL_DATA_PATH]
# 运行推理示例
用法:bash run_eval.sh [DATASET_PATH] [DEVICE_ID] [CHECKPOINT_PATH] [CONFIG_PATH]
用法:bash run_eval.sh [EVAL_DATA_PATH] [DEVICE_ID] [CHECKPOINT_PATH] [CONFIG_PATH]
```
分布式训练需要提前创建JSON格式的HCCL配置文件。
......@@ -283,13 +312,14 @@ glore_res200网络模型的backbone是ResNet200, 在Stage2, Stage3中分别均
```text
# 分布式训练
用法:bash run_distribute_train_gpu.sh [DATASET_PATH] [RANK_SIZE] [CONFIG_PATH]
用法:bash run_distribute_train_gpu.sh [TRAIN_DATA_PATH] [EVAL_DATA_PATH] [RANK_SIZE] [CONFIG_PATH]
示例:bash run_distribute_train_gpu.sh ~/Imagenet_Original/train/ ~/Imagenet_Original/val/ 8 ../config/config_resnet50_gpu.yaml
# 单机训练
用法:bash run_standalone_train_gpu.sh [DATASET_PATH] [CONFIG_PATH]
用法:bash run_standalone_train.sh [TRAIN_DATA_PATH] [CONFIG_PATH] [EVAL_DATA_PATH]
# 运行推理示例
用法:bash run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH] [CONFIG_PATH]
用法:bash run_eval.sh [EVAL_DATA_PATH] [DEVICE_ID] [CHECKPOINT_PATH] [CONFIG_PATH]
```
## 训练结果
......@@ -306,6 +336,18 @@ epoch:5 step:1251, loss is 3.3024906
...
```
- 使用ImageNet2012数据集训练Glore_resnet101(8 pcs)
```text
# 分布式训练结果(8P)
epoch:1 step:5004, loss is 4.7398486
epoch:2 step:5004, loss is 4.129058
epoch:3 step:5004, loss is 3.5034246
epoch:4 step:5004, loss is 3.4452052
epoch:5 step:5004, loss is 3.148675
...
```
- 使用ImageNet2012数据集训练Glore_resnet200(8 pcs)
```text
......@@ -326,24 +368,24 @@ epoch:5 step:1251, loss is 4.080069
```bash
# 推理
Usage: bash run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH] [CONFIG_PATH]
Usage: bash run_eval.sh [EVAL_DATA_PATH] [DEVICE_ID] [CHECKPOINT_PATH] [CONFIG_PATH]
```
```bash
# 推理示例
bash run_eval.sh ~/Imagenet_Original/ 0 ~/glore_resnet200-150_1251.ckpt ../config/config_resnet50_gpu.yaml
bash run_eval.sh ~/Imagenet_Original/val/ 0 ~/glore_resnet200-150_1251.ckpt ../config/config_resnet50_gpu.yaml
```
#### GPU处理器环境运行
```bash
# 推理
Usage: bash run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH] [CONFIG_PATH]
Usage: bash run_eval_gpu.sh [EVAL_DATA_PATH] [DEVICE_ID] [CHECKPOINT_PATH] [CONFIG_PATH]
```
```bash
# 推理示例
bash run_eval.sh ~/Imagenet ~/glore_resnet200-150_2502.ckpt ../config/config_resnet50_gpu.yaml
bash run_eval.sh ~/Imagenet/val/ ~/glore_resnet200-150_2502.ckpt ../config/config_resnet50_gpu.yaml
```
## 推理结果
......@@ -378,6 +420,26 @@ result:{'top_1 acc':0.802303685897436}
| 微调检查点| 233.46M(.ckpt文件) |233.46M(.ckpt文件) |
| 脚本 | [链接](https://gitee.com/mindspore/models/tree/master/research/cv/glore_res) |
#### ImageNet2012上的Glore_resnet101
| 参数 | GPU |
| --------------------------|------------------------------------|
| 模型版本 |Glore_resnet101 |
| 资源 |GPU-V100 PCIE 32G |
| 上传日期 |2021-10-22 |
| MindSpore版本 | r1.5 |1.5.0 |
| 数据集 | ImageNet2012 |
| 训练参数 |epoch=150, steps per epoch=5004, batch_size = 32 |
| 优化器 | NAG |
| 损失函数 |SoftmaxCrossEntropyExpand |
| 输出 |概率 |
| 损失 |1.7463021 |
| 速度 |33 毫秒/步(8卡) |
| 总时长 |30 小时 |
| 参数(M) |57 |
| 微调检查点|579.06M(.ckpt文件) |
| 脚本 | [链接](https://gitee.com/mindspore/models/tree/master/research/cv/glore_res) |
#### ImageNet2012上的Glore_resnet200
| 参数 | Ascend 910 | GPU |
......@@ -413,6 +475,19 @@ result:{'top_1 acc':0.802303685897436}
| 输出 | 概率 |概率 |
| 准确性 | 8卡: 78.44% |8卡:78.50% |
#### ImageNet2012上的Glore_resnet101
| 参数 | GPU |
| ------------------- | ----------------------|
| 模型版本 | Glore_resnet101 |
| 资源 | GPU-V100(SXM2) |
| 上传日期 | 2021-10-22 |
| MindSpore版本 | 1.5.0 |
| 数据集 | ImageNet2012测试集(6.4GB) |
| batch_size | 32 |
| 输出 | 概率 |
| 准确性 | 8卡: 79.663% |
#### ImageNet2012上的Glore_resnet200
| 参数 | Ascend | GPU |
......@@ -432,4 +507,4 @@ transform_utils.py中使用数据增强时采用了随机选择策略,train.py
# ModelZoo主页
请浏览官网[主页](https://gitee.com/mindspore/models)
请浏览官网[主页](https://gitee.com/mindspore/models/)
# Builtin Configurations(DO NOT CHANGE THESE CONFIGURATIONS unless you know exactly what you are doing)
isModelArts: false
# Url for modelarts
data_url: ""
train_url: ""
checkpoint_url: ""
# Path for local
run_distribute: true
enable_profiling: False
data_path: "/cache/data"
output_path: "/cache/train"
load_path: "/cache/checkpoint_path/"
device_target: "Ascend"
checkpoint_path: "./checkpoint/"
# ==============================================================================
# Training options
batch_size: 80
class_num: 1000
epoch_size: 150
keep_checkpoint_max: 10
loss_scale: 1024
lr_decay_mode: poly
lr_end: 0
lr_init: 0.1
lr_max: 0.4
momentum: 0.08
pretrain_epoch_size: 0
use_glore: true
save_checkpoint: true
save_checkpoint_epochs: 5
save_checkpoint_path: ./
use_glore: true
use_label_smooth: false
warmup_epochs: 0
weight_decay: 0.0002
net: "resnet101"
cast_fp16: true
device_target: "Ascend"
device_id: 0
device_num: 8
data_url: ""
pretrained_ckpt: ""
parameter_server: ""
# Export options
device_id: 0
file_name: "resnet200"
file_format: "MINDIR"
ckpt_url: ""
# Image options
image_size: 224
---
# Help description for each configuration
enable_modelarts: "Whether training on modelarts, default: False"
data_url: "Dataset url for obs"
checkpoint_url: "The location of checkpoint for obs"
data_path: "Dataset path for local"
output_path: "Training output path for local"
load_path: "The location of checkpoint for obs"
device_target: "Target device type, available: [Ascend, GPU, CPU]"
enable_profiling: "Whether enable profiling while training, default: False"
num_classes: "Class for dataset"
batch_size: "Batch size for training and evaluation"
epoch_size: "Total training epochs."
checkpoint_path: "The location of the checkpoint file."
\ No newline at end of file
......@@ -40,6 +40,7 @@ device_target: "GPU"
device_id: 0
device_num: 8
data_url: ""
eval_data_url: ""
pretrained_ckpt: ""
parameter_server: ""
......
......@@ -41,6 +41,7 @@ device_target: "Ascend"
device_id: 0
device_num: 8
data_url: ""
eval_data_url: ""
pretrained_ckpt: ""
parameter_server: ""
......
......@@ -40,6 +40,7 @@ device_target: "GPU"
device_id: 0
device_num: 8
data_url: ""
eval_data_url: ""
pretrained_ckpt: ""
parameter_server: ""
......
......@@ -44,6 +44,7 @@ device_target: "Ascend"
device_id: 0
device_num: 8
data_url: ""
eval_data_url: ""
pretrained_ckpt: ""
parameter_server: ""
......
......@@ -44,6 +44,7 @@ device_target: "GPU"
device_id: 0
device_num: 8
data_url: ""
eval_data_url: ""
pretrained_ckpt: ""
parameter_server: ""
......
......@@ -25,7 +25,7 @@ from mindspore import context
from mindspore import dataset as de
from mindspore.train.model import Model
from mindspore.train.serialization import load_checkpoint, load_param_into_net
from src.glore_resnet import glore_resnet200, glore_resnet50
from src.glore_resnet import glore_resnet200, glore_resnet50, glore_resnet101
from src.dataset import create_eval_dataset
from src.dataset import create_dataset_ImageNet as ImageNet
from src.loss import CrossEntropySmooth, SoftmaxCrossEntropyExpand
......@@ -50,13 +50,13 @@ if __name__ == '__main__':
device_id=device_id)
# dataset
eval_dataset_path = os.path.join(config.data_url, 'val')
eval_dataset_path = os.path.abspath(config.eval_data_url)
if config.isModelArts:
mox.file.copy_parallel(src_url=config.data_url, dst_url='/cache/dataset')
mox.file.copy_parallel(src_url=config.eval_data_url, dst_url='/cache/dataset')
eval_dataset_path = '/cache/dataset/'
if config.net == 'resnet50':
predict_data = create_eval_dataset(dataset_path=eval_dataset_path, repeat_num=1, batch_size=config.batch_size)
elif config.net == 'resnet200':
else:
predict_data = ImageNet(dataset_path=eval_dataset_path,
do_train=False,
repeat_num=1,
......@@ -71,6 +71,8 @@ if __name__ == '__main__':
net = glore_resnet50(class_num=config.class_num, use_glore=config.use_glore)
elif config.net == 'resnet200':
net = glore_resnet200(class_num=config.class_num, use_glore=config.use_glore)
elif config.net == 'resnet101':
net = glore_resnet101(class_num=config.class_num, use_glore=config.use_glore)
# load checkpoint
param_dict = load_checkpoint(config.ckpt_url)
......
......@@ -21,9 +21,9 @@ echo "For example: bash run_distribute_train.sh /path/dataset /path/rank_table .
echo "It is better to use the absolute path."
echo "=============================================================================================================="
set -e
if [ $# != 3 ]
if [ $# != 4 ]
then
echo "Usage: bash run_distribute_train.sh [DATASET_PATH] [RANK_TABLE] [CONFIG_PATH]"
echo "Usage: bash run_distribute_train.sh [TRAIN_DATA_PATH] [RANK_TABLE] [CONFIG_PATH] [EVAL_DATA_PATH]"
exit 1
fi
get_real_path(){
......@@ -37,6 +37,7 @@ DATA_PATH=$(get_real_path $1)
export DATA_PATH=${DATA_PATH}
RANK_TABLE=$(get_real_path $2)
CONFIG_PATH=$(get_real_path $3)
EVAL_DATA_PATH=$(get_real_path $4)
export RANK_TABLE_FILE=${RANK_TABLE}
export RANK_SIZE=8
......@@ -58,7 +59,7 @@ do
export RANK_ID=$i
echo "start training for device $i"
env > env$i.log
python3 train.py --data_url $1 --isModelArts False --run_distribute True --config_path=$CONFIG_PATH > train$i.log 2>&1 &
python3 train.py --data_url $DATA_PATH --isModelArts False --run_distribute True --config_path=$CONFIG_PATH --eval_data_url $EVAL_DATA_PATH > train$i.log 2>&1 &
if [ $? -eq 0 ];then
echo "start training for device$i"
else
......
......@@ -20,9 +20,9 @@ echo "bash run_distribute_train_gpu.sh DATA_PATH RANK_SIZE CONFIG_PATH"
echo "For example: bash run_distribute_train.sh /path/dataset 8 ../config/config_resnet50_gpu.yaml"
echo "It is better to use the absolute path."
echo "=============================================================================================================="
if [ $# != 3 ]
if [ $# != 4 ]
then
echo "Usage: bash run_distribute_train_gpu.sh [DATASET_PATH] [RANK_SIZE] [CONFIG_PATH]"
echo "Usage: bash run_distribute_train_gpu.sh [TRAIN_DATA_PATH] [EVAL_DATA_PATH] [RANK_SIZE] [CONFIG_PATH]"
exit 1
fi
......@@ -35,12 +35,13 @@ get_real_path(){
}
set -e
DEVICE_NUM=$2
DEVICE_NUM=$3
DATA_PATH=$(get_real_path $1)
CONFIG_PATH=$(get_real_path $3)
EVAL_DATA_PATH=$(get_real_path $2)
CONFIG_PATH=$(get_real_path $4)
export DATA_PATH=${DATA_PATH}
export DEVICE_NUM=$2
export RANK_SIZE=$2
export DEVICE_NUM=$3
export RANK_SIZE=$3
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
cd ../
......@@ -56,5 +57,5 @@ env > env.log
echo "start training"
mpirun -n $2 --allow-run-as-root \
python3 train.py --data_url=$DATA_PATH --isModelArts=False --run_distribute=True \
--device_target="GPU" --config_path=$CONFIG_PATH --device_num $2 > train.log 2>&1 &
--device_target="GPU" --config_path=$CONFIG_PATH --eval_data_url=$EVAL_DATA_PATH --device_num $2 > train.log 2>&1 &
......@@ -51,7 +51,7 @@ cd ../
export DEVICE_ID=$2
export RANK_ID=0
env > env0.log
python3 eval.py --data_url $1 --isModelArts False --device_id $2 --ckpt_url $CKPT_PATH --config_path=$CONFIG_PATH > eval.log 2>&1
python3 eval.py --eval_data_url $1 --isModelArts False --device_id $2 --ckpt_url $CKPT_PATH --config_path=$CONFIG_PATH > eval.log 2>&1
if [ $? -eq 0 ];then
echo "testing success"
......
......@@ -86,8 +86,8 @@ if __name__ == '__main__':
# get device_num, device_id after device init
device_num, device_id = _get_rank_info()
#create dataset
train_dataset_path = os.path.join(config.data_url, 'train')
eval_dataset_path = os.path.join(config.data_url, 'val')
train_dataset_path = os.path.abspath(config.data_url)
eval_dataset_path = os.path.abspath(config.eval_data_url)
# download dataset from obs to cache if train on ModelArts
if config.net == 'resnet50':
......
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment