Skip to content
Snippets Groups Projects
Unverified Commit 0923e976 authored by i-robot's avatar i-robot Committed by Gitee
Browse files

!2520 RefineNet gpu pr

Merge pull request !2520 from FuYanfeng/master
parents ff641e06 d4c423d9
No related branches found
No related tags found
No related merge requests found
Showing
with 509 additions and 111 deletions
......@@ -3,7 +3,7 @@
<!-- TOC -->
- [目录](#目录)
- [RefineNet描述](#RefineNet描述)
- [RefineNet描述](#refinenet描述)
- [描述](#描述)
- [模型架构](#模型架构)
- [数据集](#数据集)
......@@ -17,15 +17,17 @@
- [训练过程](#训练过程)
- [用法](#用法)
- [Ascend处理器环境运行](#ascend处理器环境运行)
- [GPU处理器环境运行](#gpu处理器环境运行)
- [结果](#结果)
- [评估过程](#评估过程)
- [用法](#用法-1)
- [Ascend处理器环境运行](#ascend处理器环境运行-1)
- [GPU处理器环境运行](#gpu处理器环境运行-1)
- [结果](#结果-1)
- [训练准确率](#训练准确率)
- [Mindir推理](#Mindir推理)
- [导出模型](#导出模型)
- [在Ascend310执行推理](#在Ascend310执行推理)
- [在Ascend310执行推理](#在ascend310执行推理)
- [结果](#结果)
- [模型描述](#模型描述)
- [性能](#性能)
......@@ -37,7 +39,7 @@
# RefineNet描述
##
##
RefineNet是一种通用的多径优化网络,它显式地利用下采样过程中的所有可用信息,利用长程残差连接实现高分辨率预测。通过这种方式,捕获高级语义特征的深层可以使用来自浅层卷积的细粒度特征直接细化。RefineNet的各个组件按照认证映射思想使用残差连接,这允许进行有效的端到端训练。
......@@ -56,37 +58,66 @@ RefineNet是一种通用的多径优化网络,它显式地利用下采样过
Pascal VOC数据集和语义边界数据集(Semantic Boundaries Dataset,SBD)
- 下载分段数据集。
- 下载分段数据集。
- 准备Backbone模型。
Pascal VOC数据集官网:[链接](https://host.robots.ox.ac.uk/pascal/VOC/)
SBD数据集下载地址:[链接](https://www2.eecs.berkeley.edu/Research/Projects/CS/vision/grouping/semantic_contours/benchmark.tgz)
ResNet101预训练模型下载地址:[链接](https://download.mindspore.cn/model_zoo/r1.2/resnet101_ascend_v120_imagenet2012_official_cv_bs32_acc78/resnet101_ascend_v120_imagenet2012_official_cv_bs32_acc78.ckpt)
下载数据集后解压分别得到如下所示目录:
```text
~/data/ 数据集存放根目录
~/data/VOCdevkit/ Pascal VOC数据集目录
~/data/benchmark_RELEASE/ SBD边界数据集目录
```
- 准备训练数据清单文件。清单文件用于保存图片和标注对的相对路径。如下:
```text
VOCdevkit/VOC2012/JPEGImages/2007_000032.jpg VOCdevkit/VOC2012/SegmentationClassGray/2007_000032.png
VOCdevkit/VOC2012/JPEGImages/2007_000039.jpg VOCdevkit/VOC2012/SegmentationClassGray/2007_000039.png
VOCdevkit/VOC2012/JPEGImages/2007_000063.jpg VOCdevkit/VOC2012/SegmentationClassGray/2007_000063.png
VOCdevkit/VOC2012/JPEGImages/2007_000068.jpg VOCdevkit/VOC2012/SegmentationClassGray/2007_000068.png
......
```
```text
VOCdevkit/VOC2012/JPEGImages/2007_000032.jpg VOCdevkit/VOC2012/SegmentationClassGray/2007_000032.png
VOCdevkit/VOC2012/JPEGImages/2007_000039.jpg VOCdevkit/VOC2012/SegmentationClassGray/2007_000039.png
VOCdevkit/VOC2012/JPEGImages/2007_000063.jpg VOCdevkit/VOC2012/SegmentationClassGray/2007_000063.png
VOCdevkit/VOC2012/JPEGImages/2007_000068.jpg VOCdevkit/VOC2012/SegmentationClassGray/2007_000068.png
......
```
你也可以通过运行脚本:`python get_dataset_lst.py --data_root=/PATH/TO/DATA` 来自动生成数据清单文件
通过运行脚本:`python ~/src/tool/get_dataset_lst.py --data_dir=~/data/` 来自动生成数据清单文件
- 配置并运行get_dataset_MRcd.sh,将数据集转换为MindRecords。scripts/get_dataset_MRcd.sh中的参数:
```text
voc_train_lst.txtfinetune数据集目录
voc_val_lst.txt验证数据集目录
sbd_train_lst.txt预训练数据集目录
```
```
--data_root 训练数据的根路径
--data_lst 训练数据列表(如上准备)
--dst_path MindRecord所在路径
--num_shards MindRecord的分片数
--shuffle 是否混洗
```
- 配置并运行build _MRcd.py,将数据集转换为MindRecords:
运行配置以下参数命令,使用生成的'sbd_train_lst.txt',生成预训练数据集:
```bash
# build_MRcd.py
Usage: python ~src/tool/build_MRcd.py --data_root=~/data/ --data_lst=~/sbd_train_lst.txt --dst_path=~/data/sbdonly
```
运行配置以下参数命令,使用生成的'voc_train_lst.txt',生成finetune数据集:
```bash
# build_MRcd.py
Usage: python ~src/tool/build_MRcd.py --data_root=~/data/ --data_lst=~/voc_train_lst.txt --dst_path=~/data/voconly
```
```text
--data_root 训练数据的根路径(~/data)
--data_lst 训练数据列表(~/data/sbd_train_lst.txt)
--dst_path MindRecord存放路径与MindRecord文件名(eg:~/data/为文件路径,sbdonly为文件名)
--num_shards MindRecord的分片数(默认为8)
```
# 特性
## 混合精度
采用[混合精度](https://www.mindspore.cn/tutorials/experts/zh-CN/master/others/mixed_precision.html)
的训练方法使用支持单精度和半精度数据来提高深度学习神经网络的训练速度,同时保持单精度训练所能达到的网络精度。混合精度训练提高计算速度、减少内存使用的同时,支持在特定硬件上训练更大的模型或实现更大批次的训练。
以FP16算子为例,如果输入数据类型为FP32,MindSpore后台会自动降低精度来处理数据。用户可打开INFO日志,搜索“reduce precision”查看精度降低的算子。
采用混合精度的训练方法使用支持单精度和半精度数据来提高深度学习神经网络的训练速度,同时保持单精度训练所能达到的网络精度。混合精度训练提高计算速度、减少内存使用的同时,支持在特定硬件上训练更大的模型或实现更大批次的训练。
# 环境要求
......@@ -131,7 +162,7 @@ run_distribute_train_ascend_r2.sh
1.使用voc val数据集评估。评估脚本如下:
```bash
run_eval_ascend.sh
run_eval.sh
```
# 脚本说明
......@@ -142,12 +173,15 @@ run_eval_ascend.sh
.
└──refinenet
├── script
├── get_dataset_mindrecord.sh # 将原始数据转换为MindRecord数据集
├── run_standalone_train_r1.sh # 启动Ascend单机预训练(单卡)
├── run_standalone_train_r2.sh # 启动Ascend单机finetune(单卡)
├── run_standalone_train_ascend_r1.sh # 启动Ascend单机预训练(单卡)
├── run_standalone_train_ascend_r2.sh # 启动Ascend单机finetune(单卡)
├── run_distribute_train_ascend_r1.sh # 启动Ascend分布式预训练(八卡)
├── run_distribute_train_ascend_r2.sh # 启动Ascend分布式finetune(八卡)
├── run_eval_ascend.sh # 启动Ascend评估
├── run_eval.sh # 启动评估
├── run_standalone_train_gpu_r1.sh # 启动GPU单机预训练(单卡)
├── run_standalone_train_gpu_r2.sh # 启动GPU单机finetune(单卡)
├── run_distribute_train_gpu_r1.sh # 启动GPU分布式预训练(八卡)
├── run_distribute_train_gpu_r2.sh # 启动GPU分布式finetune(八卡)
├── src
├── tools
├── get_dataset_lst.py # 获取数据清单文件
......@@ -164,15 +198,36 @@ run_eval_ascend.sh
## 脚本参数
默认配置
Ascend处理器环境默认配置
```bash
"data_file":"/PATH/TO/MINDRECORD_NAME" # 数据集路径
"data_file":"~/data/" # 数据集路径
"device_target":Ascend # 训练后端类型
"train_epochs":200 # 总轮次数
"batch_size":32 # 输入张量的批次大小
"crop_size":513 # 裁剪大小
"base_lr":0.0015 # 初始学习率
"base_lr":0.0015 # 基础学习率
"lr_type":cos # 用于生成学习率的衰减模式
"min_scale":0.5 # 数据增强的最小尺度
"max_scale":2.0 # 数据增强的最大尺度
"ignore_label":255 # 忽略标签
"num_classes":21 # 类别数
"ckpt_pre_trained":"/PATH/TO/PRETRAIN_MODEL" # 加载预训练检查点的路径
"is_distributed": # 分布式训练,设置该参数为True
"save_epochs":5 # 用于保存的迭代间隙
"freeze_bn": # 设置该参数freeze_bn为True
"keep_checkpoint_max":200 # 用于保存的最大检查点
```
GPU处理器环境默认配置
```bash
"data_file":"~/data/" # 数据集路径
"device_target":GPU # 训练后端类型
"train_epochs":200 # 总轮次数
"batch_size":16 # 输入张量的批次大小
"crop_size":513 # 裁剪大小
"base_lr":0.001 # 基础学习率
"lr_type":cos # 用于生成学习率的衰减模式
"min_scale":0.5 # 数据增强的最小尺度
"max_scale":2.0 # 数据增强的最大尺度
......@@ -180,7 +235,7 @@ run_eval_ascend.sh
"num_classes":21 # 类别数
"ckpt_pre_trained":"/PATH/TO/PRETRAIN_MODEL" # 加载预训练检查点的路径
"is_distributed": # 分布式训练,设置该参数为True
"save_epochs":5 # 用于保存的迭代间隙
"save_epochs":5 # 用于保存的迭代间隙
"freeze_bn": # 设置该参数freeze_bn为True
"keep_checkpoint_max":200 # 用于保存的最大检查点
```
......@@ -191,39 +246,79 @@ run_eval_ascend.sh
#### Ascend处理器环境运行
在RefineNet原始论文的基础上,我们先对COCO+SBD混合数据集进行训练,再采用Pascal Voc中的voc_train数据集进行finetune。最后对voc_val数据集进行了评估。
首先准备ResNet_101预训练模型:resnet-101.ckpt,在RefineNet原始论文的基础上,我们先对SBD混合数据集进行训练,再采用Pascal Voc中的voc_train数据集进行finetune。最后对voc_val数据集进行了评估。
运行以下训练脚本配置单卡训练参数:
运行以下训练脚本配置单卡训练参数,微调ResNet_101模型
```bash
# run_standalone_train.sh
Usage: sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [DATASET_PATH] [PRETRAINED_PATH]
Usage: bash scripts/run_standalone_train_ascend_r1.sh [DATASET_PATH] [PRETRAINED_PATH] [DEVICE_ID]
# example: bash scripts/run_standalone_train_ascend_r1.sh ~/data/sbdonly0 /disk3/fyf/resnet-101.ckpt 0
```
运行以下训练脚本配置单卡训练参数,微调上一步模型:
```bash
# run_distribute_train.sh
Usage: sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [DATASET_PATH] [PRETRAINED_PATH]
# run_standalone_train.sh
Usage: bash scripts/run_standalone_train_ascend_r2.sh [DATASET_PATH] [PRETRAINED_PATH] [DEVICE_ID]
# example: bash scripts/run_standalone_train_ascend_r2.sh ~/data/voconly0 /disk3/fyf/RefineNet/scripts/refinenet-115_284.ckpt 4
```
运行以下训练脚本配置八卡训练参数,微调ResNet_101模型:
```bash
# run_distribute_train.sh
Usage: sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [DATASET_PATH] [PRETRAINED_PATH]
Usage: bash scripts/run_distribute_train_ascend_r1.sh [RANK_TABLE_FILE] [DATASET_PATH] [PRETRAINED_PATH]
# example: bash scripts/run_distribute_train_ascend_r1.sh hccl_8p_01234567_127.0.0.1.json ~/data/sbdonly0 /disk3/fyf/resnet-101.ckpt
```
运行以下训练脚本配置八卡训练参数,微调上一步模型:
```bash
# run_distribute_train.sh
Usage: sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [DATASET_PATH] [PRETRAINED_PATH]
Usage: bash scripts/run_distribute_train_ascend_r2.sh [RANK_TABLE_FILE] [DATASET_PATH] [PRETRAINED_PATH]
# example: bash scripts/run_distribute_train_ascend_r2.sh hccl_8p_01234567_127.0.0.1.json ~/data/voconly0 /disk3/fyf/RefineNet/scripts/refinenet-115_284.ckpt
```
#### GPU处理器环境运行
参考Ascend处理器环境运行方式,使用GPU脚本。同样,首先准备ResNet_101预训练模型:resnet-101.ckpt。
运行以下训练脚本配置单卡训练参数,微调ResNet_101模型:
```bash
# run_standalone_train.sh
Usage: bash scripts/run_standalone_train_gpu_r1.sh [DATASET_PATH] [PRETRAINED_PATH] [DEVICE_ID]
#example: bash scripts/run_standalone_train_gpu_r1.sh ~/data/sbdonly0 /data1/fyf/resnet-101.ckpt 0
```
运行以下训练脚本配置单卡训练参数,微调上一步模型:
```bash
# run_distribute_train.sh
Usage: bash scripts/run_standalone_train_gpu_r2.sh [DATASET_PATH] [PRETRAINED_PATH] [DEVICE_ID]
#example: bash scripts/run_standalone_train_gpu_r2.sh ~/data/voconly0 /data1/fyf/RefineNet/scripts/train2/ckpt_0/refinenet-130_569.ckpt 0
```
运行以下训练脚本配置八卡训练参数,微调ResNet_101模型(注意gpu多卡训练不需要配置rank_table_ip):
```bash
# run_distribute_train.sh
Usage: bash scripts/run_distribute_train_gpu_r1.sh [DATASET_PATH] [PRETRAINED_PATH] [VISIABLE_DEVICES(0,1,2,3,4,5,6,7)]
#example: bash scripts/run_distribute_train_gpu_r1.sh ~/data/sbdonly0 /data1/fyf/resnet-101.ckpt 0,1,2,3,4,5,6,7
```
运行以下训练脚本配置八卡训练参数,微调上一步模型:
```bash
# run_distribute_train.sh
Usage: bash scripts/run_distribute_train_gpu_r2.sh [DATASET_PATH] [PRETRAINED_PATH] [VISIABLE_DEVICES(0,1,2,3,4,5,6,7)]
#example: bash scripts/run_distribute_train_gpu_r2.sh ~/data/voconly0 /data1/fyf/refinenet-115_1140.ckpt 0,1,2,3,4,5,6,7
```
### 结果
#### Ascend处理器环境运行
Ascend处理器环境结果
- 在去除VOC2012重复部分的SBD数据集上训练,微调ResNet-101模型:
......@@ -297,6 +392,25 @@ epoch time: 12969.236 ms, per step time: 589.511 ms
...
```
GPU处理器环境结果
```bash
# 分布式训练结果(1P)
epoch: 195 step: 569, loss is 0.05817811
epoch time: 410643.678 ms, per step time: 721.694 ms
epoch: 196 step: 569, loss is 0.07650596
epoch time: 409365.036 ms, per step time: 719.446 ms
epoch: 197 step: 569, loss is 0.07034514
epoch time: 409448.961 ms, per step time: 719.594 ms
epoch: 198 step: 569, loss is 0.07419827
epoch time: 409355.774 ms, per step time: 719.430 ms
epoch: 199 step: 569, loss is 0.07571901
epoch time: 409360.690 ms, per step time: 719.439 ms
epoch: 200 step: 569, loss is 0.08345377
epoch time: 410627.769 ms, per step time: 721.666 ms
...
```
## 评估过程
### 用法
......@@ -306,7 +420,9 @@ epoch time: 12969.236 ms, per step time: 589.511 ms
使用--ckpt_path配置检查点,运行脚本,在eval_path/log中打印mIOU。
```bash
./run_eval_ascend.sh # 测试训练结果
# run_eval.sh # 测试训练结果
Usage: bash scripts/run_eval.sh [DATA_LST] [PRETRAINED_PATH] [DEVICE_TARGET] [DEVICE_ID]
#example: bash scripts/run_eval.sh ~/data/voc_val_lst.txt /data1/fyf/refinenet-115_1140.ckpt Ascend 0
per-class IoU [0.92730402 0.89903323 0.42117934 0.82678775 0.69056955 0.72132475
0.8930829 0.81315161 0.80125108 0.32330532 0.74447242 0.58100735
......@@ -316,29 +432,21 @@ mean IoU 0.8038030230633278
```
测试脚本示例如下:
#### GPU处理器环境运行
使用--ckpt_path配置检查点,运行脚本,在eval_path/log中打印mIOU。
```bash
if [ $# -ne 3 ]
then
echo "Usage: sh run_eval_ascend.sh [DATASET_PATH] [PRETRAINED_PATH] [DEVICE_ID]"
exit 1
ulimit -u unlimited
export DEVICE_NUM=1
export DEVICE_ID=$3
export RANK_ID=0
export RANK_SIZE=1
LOCAL_DIR=eval$DEVICE_ID
rm -rf $LOCAL_DIR
mkdir $LOCAL_DIR
cp ../*.py $LOCAL_DIR
cp *.sh $LOCAL_DIR
cp -r ../src $LOCAL_DIR
cd $LOCAL_DIR || exit
echo "start training for device $DEVICE_ID"
env > env.log
python eval_utils.py --data_lst=$DATASET_PATH --ckpt_path=$PRETRAINED_PATH --device_id=$DEVICE_ID --flip &> log &
cd ..
# run_eval.sh # 测试训练结果
Usage: bash scripts/run_eval.sh [DATA_LST] [PRETRAINED_PATH] [DEVICE_TARGET] [DEVICE_TARGET] [DEVICE_ID]
#example: bash scripts/run_eval.sh ~/data/voc_val_lst.txt /data1/fyf/refinenet-115_1140.ckpt GPU 0
per-class IoU [0.95088336 0.90526754 0.62389328 0.90752526 0.77911041 0.79076594
0.94210807 0.88425516 0.93747317 0.41626388 0.84932021 0.63371361
0.89109052 0.85608585 0.8491058 0.86728246 0.6983279 0.88386951
0.47583356 0.8800718 0.78794471]
mean IoU 0.8004853336726656
```
### 结果
......@@ -369,7 +477,7 @@ python export.py --checkpoint [CKPT_PATH] --file_name [FILE_NAME] --file_format
```shell
# Ascend310 inference
bash run_infer_310.sh [MINDIR_PATH] [DATA_ROOT] [DATA_LIST] [DEVICE_ID]
bash scripts/run_infer_310.sh [MINDIR_PATH] [DATA_ROOT] [DATA_LIST] [DEVICE_ID]
```
- `DATA_ROOT` 表示进入模型推理数据集的根目录。
......@@ -386,21 +494,21 @@ bash run_infer_310.sh [MINDIR_PATH] [DATA_ROOT] [DATA_LIST] [DEVICE_ID]
### 评估性能
| 参数 | Ascend 910|
| -------------------------- | -------------------------------------- |
| 模型版本 | RefineNet |
| 资源 | Ascend 910 |
| 上传日期 | 2021-09-17 |
| MindSpore版本 | 1.2 |
| 数据集 | PASCAL VOC2012 + SBD |
| 训练参数 | epoch = 200, batch_size = 32 |
| 优化器 | Momentum |
| 损失函数 | Softmax交叉熵 |
| 输出 | 概率 |
| 损失 | 0.027490407 |
| 性能 | 54294.528ms(八卡) 298406.836ms(单卡)|
| 微调检查点 | 901M(.ckpt文件) |
| 脚本 | [链接](https://gitee.com/mindspore/models/tree/master/research/cv/RefineNet) |
| 参数 | Ascend 910| GPU |
| -------------------------- | -------------------------------------- | -----------|
| 模型版本 | RefineNet | RefineNet |
| 资源 | Ascend 910 | GForce RTX 3090 |
| 上传日期 | 2021-09-17 | 2022-02-16 |
| MindSpore版本 | 1.2 | 1.2 |
| 数据集 | PASCAL VOC2012 + SBD | PASCAL VOC2012 + SBD |
| 训练参数 | epoch = 200, batch_size = 32 | epoch=200,batch_size=16 |
| 优化器 | Momentum | Momentum |
| 损失函数 | Softmax交叉熵 | Softmax交叉熵 |
| 输出 | 概率 | 概率 |
| 损失 | 0.027490407 | 0.08345377 |
| 性能 | 54294.528ms(Ascend八卡) 298406.836ms(Ascend单卡)| 723.160 ms(GPU单卡)|
| 微调检查点 | 901M(.ckpt文件) | 900M(.ckpt文件)|
| 脚本 | [链接](https://gitee.com/mindspore/models/tree/master/research/cv/RefineNet) | [链接](https://gitee.com/mindspore/models/tree/master/research/cv/RefineNet)
# 随机情况说明
......
# Copyright 2021 Huawei Technologies Co., Ltd
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -41,6 +41,8 @@ def parse_args():
parser.add_argument('--device_id', type=str, default='0', choices=['0', '1', '2', '3', '4', '5', '6', '7'],
help='which device will be implemented')
parser.add_argument('--ckpt_path', type=str, default='', help='model to evaluate')
parser.add_argument('--device_target', type=str, default='Ascend', choices=['Ascend', 'GPU'],
help='device where the code will be implemented. (Default: Ascend)')
args, _ = parser.parse_known_args()
return args
......@@ -140,7 +142,7 @@ def net_eval():
with open(args.data_lst) as f:
img_lst = f.readlines()
context.set_context(mode=context.GRAPH_MODE, device_target="Ascend", save_graphs=False,
context.set_context(mode=context.GRAPH_MODE, device_target=args.device_target, save_graphs=False,
device_id=int(args.device_id))
network = RefineNet(Bottleneck, [3, 4, 23, 3], args.num_classes)
......
#! /bin/bash
# Copyright 2021 Huawei Technologies Co., Ltd
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -15,7 +15,7 @@
# ============================================================================
if [ $# -ne 3 ]
then
echo "Usage: sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [DATASET_PATH] [PRETRAINED_PATH]"
echo "Usage: bash scripts/run_distribute_train_ascend.sh [RANK_TABLE_FILE] [DATASET_PATH] [PRETRAINED_PATH]"
exit 1
fi
......
#! /bin/bash
# Copyright 2021 Huawei Technologies Co., Ltd
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -15,7 +15,7 @@
# ============================================================================
if [ $# -ne 3 ]
then
echo "Usage: sh run_distribute_train_ascend.sh [RANK_TABLE_FILE] [DATASET_PATH] [PRETRAINED_PATH]"
echo "Usage: bash scripts/run_distribute_train_ascend.sh [RANK_TABLE_FILE] [DATASET_PATH] [PRETRAINED_PATH]"
exit 1
fi
......
#! /bin/bash
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
if [ $# -ne 3 ]
then
echo "Usage: bash scripts/run_distribute_train_gpu_r1.sh [DATASET_PATH] [PRETRAINED_PATH] [VISIABLE_DEVICES(0,1,2,3,4,5,6,7)]"
exit 1
fi
get_real_path(){
if [ "${1:0:1}" == "/" ]; then
echo "$1"
else
echo "$(realpath -m $PWD/$1)"
fi
}
DATASET_PATH=$1
if [ ! -f $DATASET_PATH ]
then
echo "error: DATASET_PATH=$DATASET_PATH is not a file"
exit 1
fi
PRETRAINED_PATH=$(get_real_path $2)
echo $PRETRAINED_PATH
if [ ! -f $PRETRAINED_PATH ]
then
echo "error: PRETRAINED_PATH=$PRETRAINED_PATH is not a file"
exit 1
fi
ulimit -u unlimited
export DEVICE_NUM=8
export RANK_SIZE=8
export CUDA_VISIBLE_DEVICES="$3"
rm -rf ./train_parallel_r1
mkdir ./train_parallel_r1
cp ../*.py ./train_parallel_r1
cp *.sh ./train_parallel_r1
cp -r ../src ./train_parallel_r1
cd ./train_parallel_r1 || exit
echo "start training "
env > env.log
if [ $# == 3 ]
then
mpirun --allow-run-as-root -n 8 \
python train.py \
--is_distribute \
--data_file=$DATASET_PATH \
--ckpt_pre_trained=$PRETRAINED_PATH \
--base_lr=0.004 \
--batch_size=8 \
--device_target='GPU' &> log &
fi
#! /bin/bash
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
if [ $# -ne 3 ]
then
echo "Usage: bash scripts/run_distribute_train_gpu_r1.sh [DATASET_PATH] [PRETRAINED_PATH] [VISIABLE_DEVICES(0,1,2,3,4,5,6,7)]"
exit 1
fi
get_real_path(){
if [ "${1:0:1}" == "/" ]; then
echo "$1"
else
echo "$(realpath -m $PWD/$1)"
fi
}
DATASET_PATH=$1
if [ ! -f $DATASET_PATH ]
then
echo "error: DATASET_PATH=$DATASET_PATH is not a file"
exit 1
fi
PRETRAINED_PATH=$(get_real_path $2)
echo $PRETRAINED_PATH
if [ ! -f $PRETRAINED_PATH ]
then
echo "error: PRETRAINED_PATH=$PRETRAINED_PATH is not a file"
exit 1
fi
ulimit -u unlimited
export DEVICE_NUM=8
export RANK_SIZE=8
export CUDA_VISIBLE_DEVICES="$3"
rm -rf ./train_parallel_r2
mkdir ./train_parallel_r2
cp ../*.py ./train_parallel_r2
cp *.sh ./train_parallel_r2
cp -r ../src ./train_parallel_r2
cd ./train_parallel_r2 || exit
echo "start training "
env > env.log
if [ $# == 3 ]
then
mpirun --allow-run-as-root -n 8 \
python train.py \
--is_distribute \
--data_file=$DATASET_PATH \
--ckpt_pre_trained=$PRETRAINED_PATH \
--base_lr=0.00004 \
--batch_size=8 \
--device_target='GPU' &> log &
fi
#! /bin/bash
# Copyright 2021 Huawei Technologies Co., Ltd
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -15,7 +15,7 @@
# ============================================================================
if [ $# -ne 3 ]
then
echo "Usage: sh run_eval_ascend.sh [DATASET_PATH] [PRETRAINED_PATH] [DEVICE_ID]"
echo "Usage: bash scripts/run_eval_ascend.sh [DATA_LST] [PRETRAINED_PATH] [DEVICE_TARGET] [DEVICE_ID]"
exit 1
fi
......@@ -46,7 +46,8 @@ fi
ulimit -u unlimited
export DEVICE_NUM=1
export DEVICE_ID=$3
export DEVICE_TARGET=$3
export DEVICE_ID=$4
export RANK_ID=0
export RANK_SIZE=1
LOCAL_DIR=eval$DEVICE_ID
......@@ -58,6 +59,6 @@ cp -r ../src $LOCAL_DIR
cd $LOCAL_DIR || exit
echo "start training for device $DEVICE_ID"
env > env.log
python eval.py --data_lst=$DATASET_PATH --ckpt_path=$PRETRAINED_PATH --device_id=$DEVICE_ID --flip &> log &
python eval.py --data_lst=$DATASET_PATH --ckpt_path=$PRETRAINED_PATH --device_target=DEVICE_TARGET --device_id=$DEVICE_ID --flip &> log &
cd ..
......@@ -15,7 +15,7 @@
# ============================================================================
if [[ $# -lt 3 || $# -gt 4 ]]; then
echo "Usage: bash run_infer_310.sh [MINDIR_PATH] [DATA_ROOT] [DATA_LIST] [DEVICE_ID]
echo "Usage: bash scripts/run_infer_310.sh [MINDIR_PATH] [DATA_ROOT] [DATA_LIST] [DEVICE_ID]
DEVICE_ID is optional, it can be set by environment variable device_id, otherwise the value is zero"
exit 1
fi
......
#! /bin/bash
# Copyright 2021 Huawei Technologies Co., Ltd
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -15,7 +15,7 @@
# ============================================================================
if [ $# -ne 3 ]
then
echo "Usage: sh run_standalone_train_ascend.sh [DATASET_PATH] [PRETRAINED_PATH] [DEVICE_ID]"
echo "Usage: bash scripts/run_standalone_train_ascend_r1.sh [DATASET_PATH] [PRETRAINED_PATH] [DEVICE_ID]"
exit 1
fi
......
#! /bin/bash
# Copyright 2021 Huawei Technologies Co., Ltd
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -15,7 +15,7 @@
# ============================================================================
if [ $# -ne 3 ]
then
echo "Usage: sh run_standalone_train_ascend.sh [DATASET_PATH] [PRETRAINED_PATH] [DEVICE_ID]"
echo "Usage: bash scripts/run_standalone_train_ascend_r2.sh [DATASET_PATH] [PRETRAINED_PATH] [DEVICE_ID]"
exit 1
fi
......
#! /bin/bash
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
if [ $# -ne 3 ]
then
echo "Usage: bash scripts/run_standalone_train_gpu_r1.sh [DATASET_PATH] [PRETRAINED_PATH] [DEVICE_ID]"
exit 1
fi
get_real_path(){
if [ "${1:0:1}" == "/" ]; then
echo "$1"
else
echo "$(realpath -m $PWD/$1)"
fi
}
DATASET_PATH=$(get_real_path $1)
PRETRAINED_PATH=$(get_real_path $2)
echo $DATASET_PATH
echo $PRETRAINED_PATH
if [ ! -f $DATASET_PATH ]
then
echo "error: DATASET_PATH=$DATASET_PATH is not a file"
exit 1
fi
if [ ! -f $PRETRAINED_PATH ]
then
echo "error: PRETRAINED_PATH=$PRETRAINED_PATH is not a file"
exit 1
fi
ulimit -u unlimited
export DEVICE_NUM=1
export DEVICE_ID=$3
export RANK_ID=0
export RANK_SIZE=1
LOCAL_DIR=train$DEVICE_ID
rm -rf $LOCAL_DIR
mkdir $LOCAL_DIR
cp ../*.py $LOCAL_DIR
cp *.sh $LOCAL_DIR
cp -r ../src $LOCAL_DIR
cd $LOCAL_DIR || exit
echo "start training for device $DEVICE_ID"
env > env.log
python train.py --data_file=$DATASET_PATH --ckpt_pre_trained=$PRETRAINED_PATH --device_id=$DEVICE_ID --base_lr=0.001 --batch_size=16 --device_target='GPU' &> log &
cd ..
#! /bin/bash
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
if [ $# -ne 3 ]
then
echo "Usage: bash scripts/run_standalone_train_gpu_r2.sh [DATASET_PATH] [PRETRAINED_PATH] [DEVICE_ID]"
exit 1
fi
get_real_path(){
if [ "${1:0:1}" == "/" ]; then
echo "$1"
else
echo "$(realpath -m $PWD/$1)"
fi
}
DATASET_PATH=$(get_real_path $1)
PRETRAINED_PATH=$(get_real_path $2)
echo $DATASET_PATH
echo $PRETRAINED_PATH
if [ ! -f $DATASET_PATH ]
then
echo "error: DATASET_PATH=$DATASET_PATH is not a file"
exit 1
fi
if [ ! -f $PRETRAINED_PATH ]
then
echo "error: PRETRAINED_PATH=$PRETRAINED_PATH is not a file"
exit 1
fi
ulimit -u unlimited
export DEVICE_NUM=1
export DEVICE_ID=$3
export RANK_ID=0
export RANK_SIZE=1
LOCAL_DIR=train$DEVICE_ID
rm -rf $LOCAL_DIR
mkdir $LOCAL_DIR
cp ../*.py $LOCAL_DIR
cp *.sh $LOCAL_DIR
cp -r ../src $LOCAL_DIR
cd $LOCAL_DIR || exit
echo "start training for device $DEVICE_ID"
env > env.log
python train.py --data_file=$DATASET_PATH --ckpt_pre_trained=$PRETRAINED_PATH --device_id=$DEVICE_ID --base_lr=0.0001 --batch_size=16 --device_target='GPU' &> log &
cd ..
# Copyright 2021 Huawei Technologies Co., Ltd
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -117,7 +117,7 @@ class SegDataset:
def get_dataset1(self):
"""get dataset"""
ds.config.set_seed(1000)
data_set = ds.MindDataset(dataset_file=self.data_file, columns_list=["data", "label"],
data_set = ds.MindDataset(self.data_file, columns_list=["data", "label"],
shuffle=True, num_parallel_workers=self.num_readers,
num_shards=self.shard_num, shard_id=self.shard_id)
decode_op = C.Decode()
......
# Copyright 2021 Huawei Technologies Co., Ltd
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -26,7 +26,7 @@ def parse_args():
parser = argparse.ArgumentParser('mindrecord')
parser.add_argument('--data_root', type=str, default='', help='root path of data')
parser.add_argument('--data_lst', type=str, default='', help='list of data')
parser.add_argument('--dst_path', type=str, default='', help='save path of mindrecords')
parser.add_argument('--dst_path', type=str, default='', help='save path and the file name')
parser.add_argument('--num_shards', type=int, default=8, help='number of shards')
parser_args, _ = parser.parse_known_args()
return parser_args
......
# Copyright 2021 Huawei Technologies Co., Ltd
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -20,7 +20,7 @@ import scipy.io
from PIL import Image
parser = argparse.ArgumentParser('dataset list generator')
parser.add_argument("--data_dir", type=str, default='D:/datasets/', help='where dataset stored.')
parser.add_argument("--data_dir", type=str, default='~/data/', help='where dataset stored.')
args, _ = parser.parse_known_args()
......
# Copyright 2021 Huawei Technologies Co., Ltd
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
......@@ -15,6 +15,7 @@
""" train Refinenet """
import argparse
import math
import os
from mindspore import Parameter, context
from mindspore.train.model import Model
import mindspore.nn as nn
......@@ -63,7 +64,7 @@ def parse_args():
parser.add_argument('--ckpt_pre_trained', type=str, default='', help='PreTrained model')
# train
parser.add_argument('--device_target', type=str, default='Ascend', choices=['Ascend', 'CPU'],
parser.add_argument('--device_target', type=str, default='Ascend', choices=['Ascend', 'GPU'],
help='device where the code will be implemented. (Default: Ascend)')
parser.add_argument('--device_id', type=str, default='0', choices=['0', '1', '2', '3', '4', '5', '6', '7'],
help='which device will be implemented')
......@@ -83,19 +84,37 @@ def weights_init(net):
cell.weight = Parameter(initializer(HeUniform(negative_slope=math.sqrt(5)), cell.weight.shape,
cell.weight.dtype), name=cell.weight.name)
def get_device_id():
device_id = os.getenv('DEVICE_ID', '0')
return int(device_id)
def train():
"""train"""
args = parse_args()
context.set_context(mode=context.GRAPH_MODE, enable_auto_mixed_precision=True, save_graphs=False,
device_target="Ascend", device_id=int(args.device_id))
if args.is_distributed:
init()
args.rank = get_rank()
args.group_size = get_group_size()
parallel_mode = ParallelMode.DATA_PARALLEL
context.set_auto_parallel_context(parallel_mode=parallel_mode, gradients_mean=True, device_num=args.group_size)
if args.device_target == 'Ascend':
context.set_context(mode=context.GRAPH_MODE, enable_auto_mixed_precision=True, save_graphs=False,
device_target=args.device_target, device_id=int(args.device_id))
if args.is_distributed:
init()
args.rank = get_rank()
args.group_size = get_group_size()
parallel_mode = ParallelMode.DATA_PARALLEL
context.set_auto_parallel_context(parallel_mode=parallel_mode, gradients_mean=True,
device_num=args.group_size)
elif args.device_target == 'GPU':
if args.is_distributed:
context.set_context(mode=context.GRAPH_MODE, enable_auto_mixed_precision=True, save_graphs=False,
device_target=args.device_target, device_id=get_device_id())
init()
args.rank = get_rank()
args.group_size = get_group_size()
parallel_mode = ParallelMode.DATA_PARALLEL
context.set_auto_parallel_context(parallel_mode=parallel_mode, gradients_mean=True,
device_num=args.group_size)
else:
context.set_context(mode=context.GRAPH_MODE, enable_auto_mixed_precision=True, save_graphs=False,
device_target=args.device_target, device_id=int(args.device_id))
# dataset
dataset = data_generator.SegDataset(image_mean=args.image_mean,
image_std=args.image_std,
......@@ -141,7 +160,11 @@ def train():
# loss scale
manager_loss_scale = FixedLossScaleManager(args.loss_scale, drop_overflow_update=False)
amp_level = "O0" if args.device_target == "CPU" else "O3"
if args.device_target == "GPU":
amp_level = "O2"
elif args.device_target == "Ascend":
amp_level = "O3"
model = Model(network, loss_, optimizer=opt, amp_level=amp_level, loss_scale_manager=manager_loss_scale)
# callback for saving ckpts
......
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment