Skip to content
Snippets Groups Projects
Commit 12ecbcc7 authored by gaozeyang's avatar gaozeyang
Browse files

add gpu scripts to glore_res50 and meger res50 with res200

parent 03ac2c1b
Branches
Tags
No related merge requests found
Showing
with 2647 additions and 0 deletions
# 目录
<!-- TOC -->
- [目录](#目录)
- [Glore_resnet描述](#glore_resnet描述)
- [概述](#概述)
- [论文](#论文)
- [模型架构](#模型架构)
- [数据集](#数据集)
- [特性](#特性)
- [混合精度](#混合精度)
- [环境要求](#环境要求)
- [快速入门](#快速入门)
- [脚本说明](#脚本说明)
- [脚本及样例代码](#脚本及样例代码)
- [脚本参数](#脚本参数)
- [训练过程](#训练过程)
- [用法](#用法)
- [Ascend处理器环境运行](#ascend处理器环境运行)
- [GPU处理器环境运行](#gpu处理器环境运行)
- [训练结果](#训练结果)
- [推理过程](#推理过程)
- [用法](#用法-1)
- [Ascend处理器环境运行](#ascend处理器环境运行-1)
- [GPU处理器环境运行](#gpu处理器环境运行-1)
- [推理结果](#推理结果)
- [模型描述](#模型描述)
- [性能](#性能)
- [训练性能](#训练性能)
- [ImageNet2012上的Glore_resnet50](#imagenet2012上的glore_resnet50)
- [ImageNet2012上的Glore_resnet200](#imagenet2012上的glore_resnet200)
- [推理性能](#推理性能)
- [ImageNet2012上的Glore_resnet50](#imagenet2012上的glore_resnet50)
- [ImageNet2012上的Glore_resnet200](#imagenet2012上的glore_resnet200)
- [随机情况说明](#随机情况说明)
- [ModelZoo主页](#modelzoo主页)
<!-- /TOC -->
# Glore_resnet描述
## 概述
卷积神经网络擅长提取局部关系,但是在处理全局上的区域间关系时显得低效,且需要堆叠很多层才可能完成,而在区域之间进行全局建模和推理对很多计算机视觉任务有益。为了进行全局推理,facebook research、新加坡国立大学和360 AI研究所提出了基于图的全局推理模块-Global Reasoning Unit,可以被插入到很多任务的网络模型中。glore_res200是在ResNet200的Stage2, Stage3中分别均匀地插入了2和3个全局推理模块的用于图像分类任务的网络模型。
如下为MindSpore使用ImageNet2012数据集对glore_res50进行训练的示例。glore_res50可参考[论文1](https://arxiv.org/pdf/1811.12814v1.pdf)
## 论文
1.[论文](https://arxiv.org/abs/1811.12814):Yunpeng Chenyz, Marcus Rohrbachy, Zhicheng Yany, Shuicheng Yanz, Jiashi Fengz, Yannis Kalantidisy
# 模型架构
glore_res的总体网络架构如下:
[链接](https://arxiv.org/pdf/1811.12814v1.pdf)
glore_res200网络模型的backbone是ResNet200, 在Stage2, Stage3中分别均匀地插入了了2个和3个全局推理模块。全局推理模块在Stage2和Stage 3中插入方式相同.
# 数据集
使用的数据集:[ImageNet2012](http://www.image-net.org/)
- 数据集大小:共1000个类、224*224彩色图像
- 训练集:共1,281,167张图像
- 测试集:共50,000张图像
- 数据格式:JPEG
- 注:数据在dataset.py中处理。
- 下载数据集,目录结构如下:
```text
└─dataset
├─train # 训练数据集
└─val # 评估数据集
```
# 特性
## 混合精度
采用[混合精度](https://www.mindspore.cn/docs/programming_guide/zh-CN/master/enable_mixed_precision.html)的训练方法使用支持单精度和半精度数据来提高深度学习神经网络的训练速度,同时保持单精度训练所能达到的网络精度。混合精度训练提高计算速度、减少内存使用的同时,支持在特定硬件上训练更大的模型或实现更大批次的训练。
以FP16算子为例,如果输入数据类型为FP32,MindSpore后台会自动降低精度来处理数据。用户可打开INFO日志,搜索“reduce precision”查看精度降低的算子。
# 环境要求
- 硬件(Ascend/GPU)
- 准备Ascend或GPU处理器搭建硬件环境。
- 框架
- [MindSpore](https://www.mindspore.cn/install)
- 如需查看详情,请参见如下资源:
- [MindSpore教程](https://www.mindspore.cn/tutorials/zh-CN/master/index.html)
- [MindSpore Python API](https://www.mindspore.cn/docs/api/zh-CN/master/index.html)
# 快速入门
通过官方网站安装MindSpore后,您可以按照如下步骤进行训练和评估:
- Ascend处理器环境运行
```bash
# 分布式训练
用法:bash run_distribute_train.sh [DATASET_PATH] [RANK_TABLE] [CONFIG_PATH]
# 单机训练
用法:bash run_standalone_train.sh [DATASET_PATH] [DEVICE_ID] [CONFIG_PATH]
# 运行评估示例
用法:bash run_eval.sh [DATASET_PATH] [DEVICE_ID] [CHECKPOINT_PATH] [CONFIG_PATH]
```
- GPU处理器环境运行
```bash
# 分布式训练
用法:bash run_distribute_train_gpu.sh [DATASET_PATH] [RANK_SIZE] [CONFIG_PATH]
# 单机训练
用法:bash run_standalone_train_gpu.sh [DATASET_PATH] [CONFIG_PATH]
# 运行评估示例
用法:bash run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH] [CONFIG_PATH]
```
对于分布式训练,需要提前创建JSON格式的hccl配置文件。
请遵循以下链接中的说明:
<https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools.>
# 脚本说明
## 脚本及样例代码
```shell
.
└──Glore_resnet
├── README.md
├── script
├── run_distribute_train.sh # 启动Ascend分布式训练(8卡)
├── run_distribute_train_gpu.sh # 启动GPU分布式训练(8卡)
├── run_eval.sh # 启动Ascend、GPU推理(单卡)
└── run_standalone_train_gpu.sh # 启动Ascend、GPU单机训练(单卡)
├── src
├── _init_.py
├── config.py #参数配置
├── dataset.py # 加载数据集
├── autoaugment.py # AutoAugment组件与类
├── lr_generator.py # 学习率策略
├── loss.py # ImageNet2012数据集的损失定义
├── save_callback.py # 训练时推理并保存最优精度下的参数
├── glore_resnet200.py # glore_resnet200网络
├── glore_resnet50.py # glore_resnet50网络
├── transform.py # 数据增强
└── transform_utils.py # 数据增强
├── eval.py # 推理脚本
├── export.py # 将checkpoint导出
└── train.py # 训练脚本
```
## 脚本参数
- 配置Glore_resnet50在ImageNet2012数据集参数(Ascend)。
```text
"class_num":1000, # 数据集类数
"batch_size":128, # 输入张量的批次大小
"loss_scale":1024, # 损失等级
"momentum":0.9, # 动量优化器
"weight_decay":1e-4, # 权重衰减
"epoch_size":120, # 此值仅适用于训练;应用于推理时固定为1
"pretrained": False, # 加载预训练权重
"pretrain_epoch_size": 0, # 加载预训练检查点之前已经训练好的模型的周期大小;实际训练周期大小等于epoch_size减去pretrain_epoch_size
"save_checkpoint":True, # 是否保存检查点
"save_checkpoint_epochs":5, # 两个检查点之间的周期间隔;默认情况下,最后一个检查点将在最后一个周期完成后保存
"keep_checkpoint_max":10, # 只保存最后一个keep_checkpoint_max检查点
"save_checkpoint_path":"./", # 检查点相对于执行路径的保存路径
"warmup_epochs":0, # 热身周期数
"lr_decay_mode":"Linear", # 用于生成学习率的衰减模式
"use_label_smooth":True, # 标签平滑
"label_smooth_factor":0.05, # 标签平滑因子
"weight_init": "xavier_uniform", # 权重初始化方式,可选"he_normal", "he_uniform", "xavier_uniform"
"use_autoaugment": True, # 是否应用AutoAugment方法
"lr_init":0, # 初始学习率
"lr_max":0.8, # 最大学习率
"lr_end":0.0, # 最小学习率
```
- 配置Glore_resnet50在ImageNet2012数据集参数(GPU)。
```text
"class_num":1000, # 数据集类数
"batch_size":128, # 输入张量的批次大小
"loss_scale":1024, # 损失等级
"momentum":0.9, # 动量优化器
"weight_decay":1e-4, # 权重衰减
"epoch_size":130, # 此值仅适用于训练;应用于推理时固定为1
"pretrained": False, # 加载预训练权重
"pretrain_epoch_size": 0, # 加载预训练检查点之前已经训练好的模型的周期大小;实际训练周期大小等于epoch_size减去pretrain_epoch_size
"save_checkpoint":True, # 是否保存检查点
"save_checkpoint_epochs":5, # 两个检查点之间的周期间隔;默认情况下,最后一个检查点将在最后一个周期完成后保存
"keep_checkpoint_max":10, # 只保存最后一个keep_checkpoint_max检查点
"save_checkpoint_path":"./", # 检查点相对于执行路径的保存路径
"warmup_epochs":0, # 热身周期数
"lr_decay_mode":"Linear", # 用于生成学习率的衰减模式
"use_label_smooth":True, # 标签平滑
"label_smooth_factor":0.05, # 标签平滑因子
"weight_init": "xavier_uniform", # 权重初始化方式,可选"he_normal", "he_uniform", "xavier_uniform"
"use_autoaugment": True, # 是否应用AutoAugment方法
"lr_init":0, # 初始学习率
"lr_max":0.8, # 最大学习率
"lr_end":0.0, # 最小学习率
```
- 配置Glore_resnet200在ImageNet2012数据集参数(Ascend)。
```text
"class_num":1000, # 数据集类数
"batch_size":80, # 输入张量的批次大小
"loss_scale":1024, # 损失等级
"momentum":0.08, # 动量优化器
"weight_decay":0.0002, # 权重衰减
"epoch_size":150, # 此值仅适用于训练;应用于推理时固定为1
"pretrain_epoch_size":0, # 加载预训练检查点之前已经训练好的模型的周期大小;实际训练周期大小等于epoch_size减去pretrain_epoch_size
"save_checkpoint":True, # 是否保存检查点
"save_checkpoint_epochs":5, # 两个检查点之间的周期间隔;默认情况下,最后一个检查点将在最后一个周期完成后保存
"keep_checkpoint_max":10, # 只保存最后一个keep_checkpoint_max检查点
"save_checkpoint_path":"./", # 检查点相对于执行路径的保存路径
"warmup_epochs":0, # 热身周期数
"lr_decay_mode":"poly", # 用于生成学习率的衰减模式
"lr_init":0.1, # 初始学习率
"lr_max":0.4, # 最大学习率
"lr_end":0.0, # 最小学习率
```
- 配置Glore_resnet200在ImageNet2012数据集参数(GPU)。
```text
"class_num":1000, # 数据集类数
"batch_size":64, # 输入张量的批次大小
"loss_scale":1024, # 损失等级
"momentum":0.08, # 动量优化器
"weight_decay":0.0002, # 权重衰减
"epoch_size":150, # 此值仅适用于训练;应用于推理时固定为1
"pretrain_epoch_size":0, # 加载预训练检查点之前已经训练好的模型的周期大小;实际训练周期大小等于epoch_size减去pretrain_epoch_size
"save_checkpoint":True, # 是否保存检查点
"save_checkpoint_epochs":5, # 两个检查点之间的周期间隔;默认情况下,最后一个检查点将在最后一个周期完成后保存
"keep_checkpoint_max":10, # 只保存最后一个keep_checkpoint_max检查点
"save_checkpoint_path":"./", # 检查点相对于执行路径的保存路径
"warmup_epochs":0, # 热身周期数
"lr_decay_mode":"poly", # 用于生成学习率的衰减模式
"lr_init":0.1, # 初始学习率
"lr_max":0.4, # 最大学习率
"lr_end":0.0, # 最小学习率
```
更多配置细节请参考脚本`config.py`
## 训练过程
### 用法
#### Ascend处理器环境运行
```text
# 分布式训练
用法:bash run_distribute_train.sh [DATASET_PATH] [RANK_TABLE] [CONFIG_PATH]
# 单机训练
用法:bash run_standalone_train.sh [DATASET_PATH] [DEVICE_ID] [CONFIG_PATH]
# 运行推理示例
用法:bash run_eval.sh [DATASET_PATH] [DEVICE_ID] [CHECKPOINT_PATH] [CONFIG_PATH]
```
分布式训练需要提前创建JSON格式的HCCL配置文件。
具体操作,参见[hccn_tools](https://gitee.com/mindspore/mindspore/tree/master/model_zoo/utils/hccl_tools)中的说明。
训练结果保存在示例路径中,文件夹名称以“train”或“train_parallel”开头。您可在此路径下的日志中找到检查点文件以及结果,如下所示。
#### GPU处理器环境运行
```text
# 分布式训练
用法:bash run_distribute_train_gpu.sh [DATASET_PATH] [RANK_SIZE] [CONFIG_PATH]
# 单机训练
用法:bash run_standalone_train_gpu.sh [DATASET_PATH] [CONFIG_PATH]
# 运行推理示例
用法:bash run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH] [CONFIG_PATH]
```
## 训练结果
- 使用ImageNet2012数据集训练Glore_resnet50(8 pcs)
```text
# 分布式训练结果(8P)
epoch:1 step:1251, loss is 5.074506
epoch:2 step:1251, loss is 4.339285
epoch:3 step:1251, loss is 3.9819345
epoch:4 step:1251, loss is 3.5608528
epoch:5 step:1251, loss is 3.3024906
...
```
- 使用ImageNet2012数据集训练Glore_resnet200(8 pcs)
```text
# 分布式训练结果(8P)
epoch:1 step:1251, loss is 6.0563216
epoch:2 step:1251, loss is 5.3812423
epoch:3 step:1251, loss is 4.782114
epoch:4 step:1251, loss is 4.4079633
epoch:5 step:1251, loss is 4.080069
...
```
## 推理过程
### 用法
#### Ascend处理器环境运行
```bash
# 推理
Usage: bash run_eval.sh [DATASET_PATH] [CHECKPOINT_PATH] [CONFIG_PATH]
```
```bash
# 推理示例
bash run_eval.sh ~/Imagenet_Original/ 0 ~/glore_resnet200-150_1251.ckpt ../config/config_resnet50_gpu.yaml
```
#### GPU处理器环境运行
```bash
# 推理
Usage: bash run_eval_gpu.sh [DATASET_PATH] [CHECKPOINT_PATH] [CONFIG_PATH]
```
```bash
# 推理示例
bash run_eval.sh ~/Imagenet ~/glore_resnet200-150_2502.ckpt ../config/config_resnet50_gpu.yaml
```
## 推理结果
```text
result:{'top_1 acc':0.802303685897436}
```
# 模型描述
## 性能
### 训练性能
#### ImageNet2012上的Glore_resnet50
| 参数 | Ascend 910 | GPU |
| -------------------------- | -------------------------------------- |------------------------------------|
| 模型版本 | Glore_resnet50 |Glore_resnet50 |
| 资源 | Ascend 910;CPU:2.60GHz,192核;内存:2048G |GPU-V100 PCIE 32G |
| 上传日期 | 2021-03-21 |2021-09-22 |
| MindSpore版本 | r1.1 |1.3.0 |
| 数据集 | ImageNet2012 | ImageNet2012 |
| 训练参数 | epoch=120, steps per epoch=1251, batch_size = 128 |epoch=130, steps per epoch=1251, batch_size = 128 |
| 优化器 | Momentum | Momentum |
| 损失函数 | SoftmaxCrossEntropyExpand |SoftmaxCrossEntropyExpand |
| 输出 | 概率 |概率 |
| 损失 |1.8464266 |1.7463021 |
| 速度 | 263.483毫秒/步(8卡) |655 毫秒/步(8卡) |
| 总时长 | 10.98小时 |58.5 小时 |
| 参数(M) | 30.5 |30.5 |
| 微调检查点| 233.46M(.ckpt文件) |233.46M(.ckpt文件) |
| 脚本 | [链接](https://gitee.com/mindspore/models/tree/master/research/cv/glore_res) |
#### ImageNet2012上的Glore_resnet200
| 参数 | Ascend 910 | GPU |
| -------------------------- | -------------------------------------- |------------------------------------|
| 模型版本 | Glore_resnet200 |Glore_resnet200 |
| 资源 | Ascend 910;CPU:2.60GHz,192核;内存:2048G |GPU-V100(SXM2) |
| 上传日期 | 2021-03-34 |2021-05-25 |
| MindSpore版本 | 1.3.0 |1.2.0 |
| 数据集 | ImageNet2012 | ImageNet2012 |
| 训练参数 | epoch=150, steps per epoch=2001, batch_size = 80 |epoch=150, steps per epoch=2502, batch_size = 64 |
| 优化器 | NAG | NAG |
| 损失函数 | SoftmaxCrossEntropyExpand |SoftmaxCrossEntropyExpand |
| 输出 | 概率 |概率 |
| 损失 |0.8068262 |0.55614954 |
| 速度 | 400.343毫秒/步(8卡) |912.211 毫秒/步(8卡) |
| 总时长 | 33时35分钟 |94时08分 |
| 参数(M) | 70.6 |70.6 |
| 微调检查点| 807.57M(.ckpt文件) |808.28(.ckpt) |
| 脚本 | [链接](https://gitee.com/mindspore/models/tree/master/research/cv/glore_res) |
### 推理性能
#### ImageNet2012上的Glore_resnet50
| 参数 | Ascend | GPU |
| ------------------- | ----------------------|------------------------------|
| 模型版本 | Glore_resnet50 | Glore_resnet50 |
| 资源 | Ascend 910 | GPU-V100 PCIE 32G |
| 上传日期 | 2021-03-21 |2021-09-22 |
| MindSpore版本 | r1.1 |1.3.0 |
| 数据集 | ImageNet2012测试集(6.4GB) | ImageNet2012测试集(6.4GB) |
| batch_size | 128 |128 |
| 输出 | 概率 |概率 |
| 准确性 | 8卡: 78.44% |8卡:78.50% |
#### ImageNet2012上的Glore_resnet200
| 参数 | Ascend | GPU |
| ------------------- | ----------------------|------------------------------|
| 模型版本 | Glore_resnet200 | Glore_resnet200 |
| 资源 | Ascend 910 | GPU-V100(SXM2) |
| 上传日期 | 2021-3-24 |2021-05-25 |
| MindSpore版本 | 1.3.0 |1.2.0 |
| 数据集 | ImageNet2012测试集(6.4GB) | ImageNet2012测试集(6.4GB) |
| batch_size | 80 |64 |
| 输出 | 概率 |概率 |
| 准确性 | 8卡: 80.23% |8卡:80.603% |
# 随机情况说明
transform_utils.py中使用数据增强时采用了随机选择策略,train.py中使用了随机种子。
# ModelZoo主页
请浏览官网[主页](https://gitee.com/mindspore/mindspore/tree/master/model_zoo)
# Builtin Configurations(DO NOT CHANGE THESE CONFIGURATIONS unless you know exactly what you are doing)
isModelArts: false
# Url for modelarts
data_url: ""
train_url: ""
checkpoint_url: ""
# Path for local
run_distribute: true
enable_profiling: False
data_path: "/cache/data"
output_path: "/cache/train"
load_path: "/cache/checkpoint_path/"
device_target: "Ascend"
checkpoint_path: "./checkpoint/"
# ==============================================================================
# Training options
batch_size: 80
class_num: 1000
epoch_size: 150
keep_checkpoint_max: 10
loss_scale: 1024
lr_decay_mode: poly
lr_end: 0
lr_init: 0.1
lr_max: 0.4
momentum: 0.08
pretrain_epoch_size: 0
use_glore: true
save_checkpoint: true
save_checkpoint_epochs: 5
save_checkpoint_path: ./
use_glore: true
use_label_smooth: false
warmup_epochs: 0
weight_decay: 0.0002
net: "resnet101"
cast_fp16: true
device_target: "Ascend"
device_id: 0
device_num: 8
data_url: ""
pretrained_ckpt: ""
parameter_server: ""
# Export options
device_id: 0
file_name: "resnet200"
file_format: "MINDIR"
ckpt_url: ""
# Image options
image_size: 224
---
# Help description for each configuration
enable_modelarts: "Whether training on modelarts, default: False"
data_url: "Dataset url for obs"
checkpoint_url: "The location of checkpoint for obs"
data_path: "Dataset path for local"
output_path: "Training output path for local"
load_path: "The location of checkpoint for obs"
device_target: "Target device type, available: [Ascend, GPU, CPU]"
enable_profiling: "Whether enable profiling while training, default: False"
num_classes: "Class for dataset"
batch_size: "Batch size for training and evaluation"
epoch_size: "Total training epochs."
checkpoint_path: "The location of the checkpoint file."
\ No newline at end of file
# Builtin Configurations(DO NOT CHANGE THESE CONFIGURATIONS unless you know exactly what you are doing)
isModelArts: false
# Url for modelarts
data_url: ""
train_url: ""
checkpoint_url: ""
# Path for local
run_distribute: true
enable_profiling: False
data_path: "/cache/data"
output_path: "/cache/train"
load_path: "/cache/checkpoint_path/"
device_target: "GPU"
checkpoint_path: "./checkpoint/"
# ==============================================================================
# Training options
batch_size: 32
class_num: 1000
epoch_size: 150
keep_checkpoint_max: 10
loss_scale: 1024
lr_decay_mode: poly
lr_end: 0
lr_init: 0.1
lr_max: 0.4
momentum: 0.08
pretrain_epoch_size: 0
use_glore: true
use_label_smooth: false
save_checkpoint: true
save_checkpoint_epochs: 5
save_checkpoint_path: ./
warmup_epochs: 0
weight_decay: 0.0002
net: "resnet101"
cast_fp16: true
device_target: "GPU"
device_id: 0
device_num: 8
data_url: ""
pretrained_ckpt: ""
parameter_server: ""
# Export options
device_id: 0
file_name: "resnet200"
file_format: "MINDIR"
ckpt_url: ""
# Image options
image_size: 224
---
# Help description for each configuration
isModelArts: "Whether training on modelarts, default: False"
data_url: "Dataset url for obs"
checkpoint_url: "The location of checkpoint for obs"
data_path: "Dataset path for local"
output_path: "Training output path for local"
load_path: "The location of checkpoint for obs"
device_target: "Target device type, available: [Ascend, GPU, CPU]"
enable_profiling: "Whether enable profiling while training, default: False"
num_classes: "Class for dataset"
batch_size: "Batch size for training and evaluation"
epoch_size: "Total training epochs."
checkpoint_path: "The location of the checkpoint file."
\ No newline at end of file
# Builtin Configurations(DO NOT CHANGE THESE CONFIGURATIONS unless you know exactly what you are doing)
isModelArts: false
# Url for modelarts
data_url: ""
train_url: ""
checkpoint_url: ""
# Path for local
run_distribute: true
enable_profiling: False
data_path: "/cache/data"
output_path: "/cache/train"
load_path: "/cache/checkpoint_path/"
device_target: "Ascend"
checkpoint_path: "./checkpoint/"
# ==============================================================================
# Training options
batch_size: 80
class_num: 1000
epoch_size: 150
keep_checkpoint_max: 10
loss_scale: 1024
lr_decay_mode: poly
lr_end: 0
lr_init: 0.1
lr_max: 0.4
momentum: 0.08
pretrain_epoch_size: 0
use_glore: true
save_checkpoint: true
save_checkpoint_epochs: 5
save_checkpoint_path: ./
use_glore: true
use_label_smooth: false
warmup_epochs: 0
weight_decay: 0.0002
net: "resnet200"
cast_fp16: true
device_target: "Ascend"
device_id: 0
device_num: 8
data_url: ""
pretrained_ckpt: ""
parameter_server: ""
# Export options
device_id: 0
file_name: "resnet200"
file_format: "MINDIR"
ckpt_url: ""
# Image options
image_size: 224
---
# Help description for each configuration
enable_modelarts: "Whether training on modelarts, default: False"
data_url: "Dataset url for obs"
checkpoint_url: "The location of checkpoint for obs"
data_path: "Dataset path for local"
output_path: "Training output path for local"
load_path: "The location of checkpoint for obs"
device_target: "Target device type, available: [Ascend, GPU, CPU]"
enable_profiling: "Whether enable profiling while training, default: False"
num_classes: "Class for dataset"
batch_size: "Batch size for training and evaluation"
epoch_size: "Total training epochs."
checkpoint_path: "The location of the checkpoint file."
\ No newline at end of file
# Builtin Configurations(DO NOT CHANGE THESE CONFIGURATIONS unless you know exactly what you are doing)
isModelArts: false
# Url for modelarts
data_url: ""
train_url: ""
checkpoint_url: ""
# Path for local
run_distribute: true
enable_profiling: False
data_path: "/cache/data"
output_path: "/cache/train"
load_path: "/cache/checkpoint_path/"
device_target: "GPU"
checkpoint_path: "./checkpoint/"
# ==============================================================================
# Training options
batch_size: 32
class_num: 1000
epoch_size: 150
keep_checkpoint_max: 10
loss_scale: 1024
lr_decay_mode: poly
lr_end: 0
lr_init: 0.1
lr_max: 0.4
momentum: 0.08
pretrain_epoch_size: 0
use_glore: true
use_label_smooth: false
save_checkpoint: true
save_checkpoint_epochs: 5
save_checkpoint_path: ./
warmup_epochs: 0
weight_decay: 0.0002
net: "resnet200"
cast_fp16: true
device_target: "GPU"
device_id: 0
device_num: 8
data_url: ""
pretrained_ckpt: ""
parameter_server: ""
# Export options
device_id: 0
file_name: "resnet200"
file_format: "MINDIR"
ckpt_url: ""
# Image options
image_size: 224
---
# Help description for each configuration
isModelArts: "Whether training on modelarts, default: False"
data_url: "Dataset url for obs"
checkpoint_url: "The location of checkpoint for obs"
data_path: "Dataset path for local"
output_path: "Training output path for local"
load_path: "The location of checkpoint for obs"
device_target: "Target device type, available: [Ascend, GPU, CPU]"
enable_profiling: "Whether enable profiling while training, default: False"
num_classes: "Class for dataset"
batch_size: "Batch size for training and evaluation"
epoch_size: "Total training epochs."
checkpoint_path: "The location of the checkpoint file."
\ No newline at end of file
# Builtin Configurations(DO NOT CHANGE THESE CONFIGURATIONS unless you know exactly what you are doing)
isModelArts: false
# Url for modelarts
data_url: ""
train_url: ""
checkpoint_url: ""
# Path for local
run_distribute: true
enable_profiling: False
data_path: "/cache/data"
output_path: "/cache/train"
load_path: "/cache/checkpoint_path/"
device_target: "Ascend"
checkpoint_path: "./checkpoint/"
# ==============================================================================
# Training options
batch_size: 128
class_num: 1000
epoch_size: 120
keep_checkpoint_max: 5
label_smooth_factor: 0.1
loss_scale: 1024
lr_decay_mode: poly
lr_end: 0.0
lr_init: 0
lr_max: 0.6
momentum: 0.9
pretrain_epoch_size: 0
pretrained: false
save_checkpoint: true
save_checkpoint_epochs: 5
save_checkpoint_path: ./
use_glore: true
use_autoaugment: true
use_label_smooth: true
warmup_epochs: 5
weight_decay: 0.0001
weight_init: xavier_uniform
net: "resnet50"
cast_fp16: false
device_target: "Ascend"
device_id: 0
device_num: 8
data_url: ""
pretrained_ckpt: ""
parameter_server: ""
# Export options
device_id: 0
file_name: "resnet50"
file_format: "MINDIR"
ckpt_url: ""
# Image options
image_size: 224
---
# Help description for each configuration
enable_modelarts: "Whether training on modelarts, default: False"
data_url: "Dataset url for obs"
checkpoint_url: "The location of checkpoint for obs"
data_path: "Dataset path for local"
output_path: "Training output path for local"
load_path: "The location of checkpoint for obs"
device_target: "Target device type, available: [Ascend, GPU, CPU]"
enable_profiling: "Whether enable profiling while training, default: False"
num_classes: "Class for dataset"
batch_size: "Batch size for training and evaluation"
epoch_size: "Total training epochs."
checkpoint_path: "The location of the checkpoint file."
\ No newline at end of file
# Builtin Configurations(DO NOT CHANGE THESE CONFIGURATIONS unless you know exactly what you are doing)
isModelArts: false
# Url for modelarts
data_url: ""
train_url: ""
checkpoint_url: ""
# Path for local
run_distribute: true
enable_profiling: False
data_path: "/cache/data"
output_path: "/cache/train"
load_path: "/cache/checkpoint_path/"
device_target: "GPU"
checkpoint_path: "./checkpoint/"
# ==============================================================================
# Training options
batch_size: 128
class_num: 1000
epoch_size: 130
keep_checkpoint_max: 5
label_smooth_factor: 0.1
loss_scale: 1024
lr_decay_mode: poly
lr_end: 0.0
lr_init: 0
lr_max: 0.6
momentum: 0.9
pretrain_epoch_size: 0
pretrained: false
save_checkpoint: true
save_checkpoint_epochs: 5
save_checkpoint_path: ./
use_glore: true
use_autoaugment: true
use_label_smooth: true
warmup_epochs: 5
weight_decay: 0.0001
weight_init: xavier_uniform
net: "resnet50"
cast_fp16: false
device_target: "GPU"
device_id: 0
device_num: 8
data_url: ""
pretrained_ckpt: ""
parameter_server: ""
# Export options
device_id: 0
file_name: "resnet50"
file_format: "MINDIR"
ckpt_url: ""
# Image options
image_size: 224
---
# Help description for each configuration
enable_modelarts: "Whether training on modelarts, default: False"
data_url: "Dataset url for obs"
checkpoint_url: "The location of checkpoint for obs"
data_path: "Dataset path for local"
output_path: "Training output path for local"
load_path: "The location of checkpoint for obs"
device_target: "Target device type, available: [Ascend, GPU, CPU]"
enable_profiling: "Whether enable profiling while training, default: False"
num_classes: "Class for dataset"
batch_size: "Batch size for training and evaluation"
epoch_size: "Total training epochs."
checkpoint_path: "The location of the checkpoint file."
\ No newline at end of file
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""
################################eval glore_resnet series################################
python eval.py
"""
import os
import random
import numpy as np
from mindspore import context
from mindspore import dataset as de
from mindspore.train.model import Model
from mindspore.train.serialization import load_checkpoint, load_param_into_net
from src.glore_resnet import glore_resnet200, glore_resnet50
from src.dataset import create_eval_dataset
from src.dataset import create_dataset_ImageNet as ImageNet
from src.loss import CrossEntropySmooth, SoftmaxCrossEntropyExpand
from src.config import config
if config.isModelArts:
import moxing as mox
if config.net == 'resnet200':
if config.device_target == "GPU":
config.cast_fp16 = False
random.seed(1)
np.random.seed(1)
de.config.set_seed(1)
if __name__ == '__main__':
target = config.device_target
# init context
device_id = config.device_id
context.set_context(mode=context.GRAPH_MODE, device_target=target, save_graphs=False,
device_id=device_id)
# dataset
eval_dataset_path = os.path.join(config.data_url, 'val')
if config.isModelArts:
mox.file.copy_parallel(src_url=config.data_url, dst_url='/cache/dataset')
eval_dataset_path = '/cache/dataset/'
if config.net == 'resnet50':
predict_data = create_eval_dataset(dataset_path=eval_dataset_path, repeat_num=1, batch_size=config.batch_size)
elif config.net == 'resnet200':
predict_data = ImageNet(dataset_path=eval_dataset_path,
do_train=False,
repeat_num=1,
batch_size=config.batch_size,
target=target)
step_size = predict_data.get_dataset_size()
if step_size == 0:
raise ValueError("Please check dataset size > 0 and batch_size <= dataset size")
# define net
if config.net == 'resnet50':
net = glore_resnet50(class_num=config.class_num, use_glore=config.use_glore)
elif config.net == 'resnet200':
net = glore_resnet200(class_num=config.class_num, use_glore=config.use_glore)
# load checkpoint
param_dict = load_checkpoint(config.ckpt_url)
load_param_into_net(net, param_dict)
# define loss, model
if config.net == 'resnet50':
if config.use_label_smooth:
loss = CrossEntropySmooth(sparse=True, reduction="mean", smooth_factor=config.label_smooth_factor,
num_classes=config.class_num)
else:
loss = SoftmaxCrossEntropyExpand(sparse=True)
model = Model(net, loss_fn=loss, metrics={'top_1_accuracy', 'top_5_accuracy'})
print("============== Starting Testing ==============")
print("ckpt path : {}".format(config.ckpt_url))
print("data path : {}".format(eval_dataset_path))
acc = model.eval(predict_data)
print("==============Acc: {} ==============".format(acc))
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""
##############export checkpoint file into air, onnx, mindir models#################
python export.py
"""
import numpy as np
import mindspore.common.dtype as mstype
from mindspore import Tensor, load_checkpoint, load_param_into_net, export, context
from src.config import config
context.set_context(mode=context.GRAPH_MODE, device_target=config.device_target)
if config.device_target == "Ascend":
context.set_context(device_id=config.device_id)
if __name__ == '__main__':
if config.net == 'resnet50':
from src.glore_resnet import glore_resnet50
net = glore_resnet50(class_num=config.class_num)
elif config.net == 'resnet200':
from src.glore_resnet import glore_resnet200
net = glore_resnet200(class_num=config.class_num)
assert config.ckpt_url is not None, "config.ckpt_url is None."
param_dict = load_checkpoint(config.ckpt_url)
load_param_into_net(net, param_dict)
input_arr = Tensor(np.ones([config.batch_size, 3, 224, 224]), mstype.float32)
export(net, input_arr, file_name=config.file_name, file_format=config.file_format)
#!/bin/bash
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
echo "=============================================================================================================="
echo "Please run the script as: "
echo "bash run_distribute_train.sh DATA_PATH RANK_TABLE CONFIG_PATH"
echo "For example: bash run_distribute_train.sh /path/dataset /path/rank_table ../config/config_resnet50_gpu.yaml"
echo "It is better to use the absolute path."
echo "=============================================================================================================="
set -e
if [ $# != 3 ]
then
echo "Usage: bash run_distribute_train.sh [DATASET_PATH] [RANK_TABLE] [CONFIG_PATH]"
exit 1
fi
get_real_path(){
if [ "${1:0:1}" == "/" ]; then
echo "$1"
else
echo "$(realpath -m $PWD/$1)"
fi
}
DATA_PATH=$(get_real_path $1)
export DATA_PATH=${DATA_PATH}
RANK_TABLE=$(get_real_path $2)
CONFIG_PATH=$(get_real_path $3)
export RANK_TABLE_FILE=${RANK_TABLE}
export RANK_SIZE=8
echo "$EXEC_PATH"
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
for((i=0;i<8;i++))
do
rm -rf device$i
mkdir device$i
cd ./device$i
mkdir src
cd ../
cp ../*.py ./device$i
cp ../src/*.py ./device$i/src
cd ./device$i
export DEVICE_ID=$i
export RANK_ID=$i
echo "start training for device $i"
env > env$i.log
python3 train.py --data_url $1 --isModelArts False --run_distribute True --config_path=$CONFIG_PATH > train$i.log 2>&1 &
if [ $? -eq 0 ];then
echo "start training for device$i"
else
echo "training device$i failed"
exit 2
fi
echo "$i finish"
cd ../
done
#!/bin/bash
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
echo "=============================================================================================================="
echo "Please run the script as: "
echo "bash run_distribute_train_gpu.sh DATA_PATH RANK_SIZE CONFIG_PATH"
echo "For example: bash run_distribute_train.sh /path/dataset 8 ../config/config_resnet50_gpu.yaml"
echo "It is better to use the absolute path."
echo "=============================================================================================================="
if [ $# != 3 ]
then
echo "Usage: bash run_distribute_train_gpu.sh [DATASET_PATH] [RANK_SIZE] [CONFIG_PATH]"
exit 1
fi
get_real_path(){
if [ "${1:0:1}" == "/" ]; then
echo "$1"
else
echo "$(realpath -m $PWD/$1)"
fi
}
set -e
DEVICE_NUM=$2
DATA_PATH=$(get_real_path $1)
CONFIG_PATH=$(get_real_path $3)
export DATA_PATH=${DATA_PATH}
export DEVICE_NUM=$2
export RANK_SIZE=$2
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
cd ../
rm -rf ./train_parallel
mkdir ./train_parallel
cd ./train_parallel
mkdir src
cd ../
cp *.py ./train_parallel
cp src/*.py ./train_parallel/src
cd ./train_parallel
env > env.log
echo "start training"
mpirun -n $2 --allow-run-as-root \
python3 train.py --data_url=$DATA_PATH --isModelArts=False --run_distribute=True \
--device_target="GPU" --config_path=$CONFIG_PATH --device_num $2 > train.log 2>&1 &
#!/bin/bash
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
echo "=============================================================================================================="
echo "Please run the script as: "
echo "bash run_eval.sh DATA_PATH DEVICE_ID CKPT_PATH CONFIG_PATH"
echo "For example: bash run_eval.sh /path/dataset 0 /path/ckpt ../config/config_resnet50_ascend.yaml"
echo "It is better to use the absolute path."
echo "=============================================================================================================="
if [ $# != 4 ]
then
echo "Usage: bash run_eval.sh [DATASET_PATH] [DEVICE_ID] [CKPT_PATH] [CONFIG_PATH]"
exit 1
fi
get_real_path(){
if [ "${1:0:1}" == "/" ]; then
echo "$1"
else
echo "$(realpath -m $PWD/$1)"
fi
}
set -e
DATA_PATH=$(get_real_path $1)
DEVICE_ID=$2
export DATA_PATH=${DATA_PATH}
CKPT_PATH=$(get_real_path $3)
CONFIG_PATH=$(get_real_path $4)
EXEC_PATH=$(pwd)
echo "$EXEC_PATH"
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
cd ../
export DEVICE_ID=$2
export RANK_ID=0
env > env0.log
python3 eval.py --data_url $1 --isModelArts False --device_id $2 --ckpt_url $CKPT_PATH --config_path=$CONFIG_PATH > eval.log 2>&1
if [ $? -eq 0 ];then
echo "testing success"
else
echo "testing failed"
exit 2
fi
echo "finish"
cd ../
#!/bin/bash
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
echo "=============================================================================================================="
echo "Please run the script as: "
echo "bash run_standalone_train.sh DATA_PATH DEVICE_ID CONFIG_PATH"
echo "For example: bash run_standalone_train.sh /path/dataset 0 ../config/config_resnet50_ascend.yaml"
echo "It is better to use the absolute path."
echo "=============================================================================================================="
if [ $# != 3 ]
then
echo "Usage: bash run_standalone_train.sh [DATASET_PATH] [DEVICE_ID] [CONFIG_PATH]"
exit 1
fi
get_real_path(){
if [ "${1:0:1}" == "/" ]; then
echo "$1"
else
echo "$(realpath -m $PWD/$1)"
fi
}
set -e
DATA_PATH=$(get_real_path $1)
DEVICE_ID=$2
export DATA_PATH=${DATA_PATH}
CONFIG_PATH=$(get_real_path $3)
EXEC_PATH=$(pwd)
echo "$EXEC_PATH"
export PROTOCOL_BUFFERS_PYTHON_IMPLEMENTATION=python
cd ../
export DEVICE_ID=$2
export RANK_ID=$2
env > env0.log
python3 train.py --data_url $1 --isModelArts False --run_distribute False --device_id $2 --config_path $CONFIG_PATH> train.log 2>&1
if [ $? -eq 0 ];then
echo "training success"
else
echo "training failed"
exit 2
fi
echo "finish"
cd ../
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""define autoaugment"""
import os
import mindspore.dataset.engine as de
import mindspore.dataset.transforms.c_transforms as c_transforms
import mindspore.dataset.vision.c_transforms as c_vision
from mindspore import dtype as mstype
from mindspore.communication.management import init, get_rank, get_group_size
# define Auto Augmentation operators
PARAMETER_MAX = 10
def float_parameter(level, maxval):
return float(level) * maxval / PARAMETER_MAX
def int_parameter(level, maxval):
return int(level * maxval / PARAMETER_MAX)
def shear_x(level):
v = float_parameter(level, 0.3)
return c_transforms.RandomChoice(
[c_vision.RandomAffine(degrees=0, shear=(-v, -v)), c_vision.RandomAffine(degrees=0, shear=(v, v))])
def shear_y(level):
v = float_parameter(level, 0.3)
return c_transforms.RandomChoice(
[c_vision.RandomAffine(degrees=0, shear=(0, 0, -v, -v)), c_vision.RandomAffine(degrees=0, shear=(0, 0, v, v))])
def translate_x(level):
v = float_parameter(level, 150 / 331)
return c_transforms.RandomChoice(
[c_vision.RandomAffine(degrees=0, translate=(-v, -v)), c_vision.RandomAffine(degrees=0, translate=(v, v))])
def translate_y(level):
v = float_parameter(level, 150 / 331)
return c_transforms.RandomChoice([c_vision.RandomAffine(degrees=0, translate=(0, 0, -v, -v)),
c_vision.RandomAffine(degrees=0, translate=(0, 0, v, v))])
def color_impl(level):
v = float_parameter(level, 1.8) + 0.1
return c_vision.RandomColor(degrees=(v, v))
def rotate_impl(level):
v = int_parameter(level, 30)
return c_transforms.RandomChoice(
[c_vision.RandomRotation(degrees=(-v, -v)), c_vision.RandomRotation(degrees=(v, v))])
def solarize_impl(level):
level = int_parameter(level, 256)
v = 256 - level
return c_vision.RandomSolarize(threshold=(0, v))
def posterize_impl(level):
level = int_parameter(level, 4)
v = 4 - level
return c_vision.RandomPosterize(bits=(v, v))
def contrast_impl(level):
v = float_parameter(level, 1.8) + 0.1
return c_vision.RandomColorAdjust(contrast=(v, v))
def autocontrast_impl(level):
return c_vision.AutoContrast()
def sharpness_impl(level):
v = float_parameter(level, 1.8) + 0.1
return c_vision.RandomSharpness(degrees=(v, v))
def brightness_impl(level):
v = float_parameter(level, 1.8) + 0.1
return c_vision.RandomColorAdjust(brightness=(v, v))
# define the Auto Augmentation policy
imagenet_policy = [
[(posterize_impl(8), 0.4), (rotate_impl(9), 0.6)],
[(solarize_impl(5), 0.6), (autocontrast_impl(5), 0.6)],
[(c_vision.Equalize(), 0.8), (c_vision.Equalize(), 0.6)],
[(posterize_impl(7), 0.6), (posterize_impl(6), 0.6)],
[(c_vision.Equalize(), 0.4), (solarize_impl(4), 0.2)],
[(c_vision.Equalize(), 0.4), (rotate_impl(8), 0.8)],
[(solarize_impl(3), 0.6), (c_vision.Equalize(), 0.6)],
[(posterize_impl(5), 0.8), (c_vision.Equalize(), 1.0)],
[(rotate_impl(3), 0.2), (solarize_impl(8), 0.6)],
[(c_vision.Equalize(), 0.6), (posterize_impl(6), 0.4)],
[(rotate_impl(8), 0.8), (color_impl(0), 0.4)],
[(rotate_impl(9), 0.4), (c_vision.Equalize(), 0.6)],
[(c_vision.Equalize(), 0.0), (c_vision.Equalize(), 0.8)],
[(c_vision.Invert(), 0.6), (c_vision.Equalize(), 1.0)],
[(color_impl(4), 0.6), (contrast_impl(8), 1.0)],
[(rotate_impl(8), 0.8), (color_impl(2), 1.0)],
[(color_impl(8), 0.8), (solarize_impl(7), 0.8)],
[(sharpness_impl(7), 0.4), (c_vision.Invert(), 0.6)],
[(shear_x(5), 0.6), (c_vision.Equalize(), 1.0)],
[(color_impl(0), 0.4), (c_vision.Equalize(), 0.6)],
[(c_vision.Equalize(), 0.4), (solarize_impl(4), 0.2)],
[(solarize_impl(5), 0.6), (autocontrast_impl(5), 0.6)],
[(c_vision.Invert(), 0.6), (c_vision.Equalize(), 1.0)],
[(color_impl(4), 0.6), (contrast_impl(8), 1.0)],
[(c_vision.Equalize(), 0.8), (c_vision.Equalize(), 0.6)],
]
def autoaugment(dataset_path, repeat_num=1, batch_size=32, target="Ascend"):
"""
define dataset with autoaugment
"""
if target == "Ascend":
device_num, rank_id = _get_rank_info()
else:
init("nccl")
rank_id = get_rank()
device_num = get_group_size()
if device_num == 1:
ds = de.ImageFolderDataset(dataset_path, num_parallel_workers=8, shuffle=True)
else:
ds = de.ImageFolderDataset(dataset_path, num_parallel_workers=8, shuffle=True,
num_shards=device_num, shard_id=rank_id)
image_size = 224
mean = [0.485 * 255, 0.456 * 255, 0.406 * 255]
std = [0.229 * 255, 0.224 * 255, 0.225 * 255]
trans = [
c_vision.RandomCropDecodeResize(image_size, scale=(0.08, 1.0), ratio=(0.75, 1.333)),
]
post_trans = [
c_vision.RandomHorizontalFlip(prob=0.5),
c_vision.Normalize(mean=mean, std=std),
c_vision.HWC2CHW()
]
dataset = ds.map(operations=trans, input_columns="image")
dataset = dataset.map(operations=c_vision.RandomSelectSubpolicy(imagenet_policy), input_columns=["image"])
dataset = dataset.map(operations=post_trans, input_columns="image")
type_cast_op = c_transforms.TypeCast(mstype.int32)
dataset = dataset.map(operations=type_cast_op, input_columns="label")
# apply the batch operation
dataset = dataset.batch(batch_size, drop_remainder=True)
# apply the repeat operation
dataset = dataset.repeat(repeat_num)
return dataset
def _get_rank_info():
"""
get rank size and rank id
"""
rank_size = int(os.environ.get("RANK_SIZE", "1"))
rank_id = int(os.environ.get("RANK_ID", "0"))
return rank_size, rank_id
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Parse arguments"""
import os
import ast
import argparse
from pprint import pprint, pformat
import yaml
_config_path = "../config/config_resnet50_gpu.yaml"
class Config:
"""
Configuration namespace. Convert dictionary to members.
"""
def __init__(self, cfg_dict):
for k, v in cfg_dict.items():
if isinstance(v, (list, tuple)):
setattr(self, k, [Config(x) if isinstance(x, dict) else x for x in v])
else:
setattr(self, k, Config(v) if isinstance(v, dict) else v)
def __str__(self):
return pformat(self.__dict__)
def __repr__(self):
return self.__str__()
def parse_cli_to_yaml(parser, cfg, helper=None, choices=None, cfg_path="resnet50_cifar10_config.yaml"):
"""
Parse command line arguments to the configuration according to the default yaml.
Args:
parser: Parent parser.
cfg: Base configuration.
helper: Helper description.
cfg_path: Path to the default yaml config.
"""
parser = argparse.ArgumentParser(description="[REPLACE THIS at config.py]",
parents=[parser])
helper = {} if helper is None else helper
choices = {} if choices is None else choices
for item in cfg:
if not isinstance(cfg[item], list) and not isinstance(cfg[item], dict):
help_description = helper[item] if item in helper else "Please reference to {}".format(cfg_path)
choice = choices[item] if item in choices else None
if isinstance(cfg[item], bool):
parser.add_argument("--" + item, type=ast.literal_eval, default=cfg[item], choices=choice,
help=help_description)
else:
parser.add_argument("--" + item, type=type(cfg[item]), default=cfg[item], choices=choice,
help=help_description)
args = parser.parse_args()
return args
def parse_yaml(yaml_path):
"""
Parse the yaml config file.
Args:
yaml_path: Path to the yaml config.
"""
with open(yaml_path, 'r') as fin:
try:
cfgs = yaml.load_all(fin.read(), Loader=yaml.FullLoader)
cfgs = [x for x in cfgs]
if len(cfgs) == 1:
cfg_helper = {}
cfg = cfgs[0]
cfg_choices = {}
elif len(cfgs) == 2:
cfg, cfg_helper = cfgs
cfg_choices = {}
elif len(cfgs) == 3:
cfg, cfg_helper, cfg_choices = cfgs
else:
raise ValueError("At most 3 docs (config description for help, choices) are supported in config yaml")
print(cfg_helper)
except:
raise ValueError("Failed to parse yaml")
return cfg, cfg_helper, cfg_choices
def merge(args, cfg):
"""
Merge the base config from yaml file and command line arguments.
Args:
args: Command line arguments.
cfg: Base configuration.
"""
args_var = vars(args)
for item in args_var:
cfg[item] = args_var[item]
return cfg
def get_config():
"""
Get Config according to the yaml file and cli arguments.
"""
parser = argparse.ArgumentParser(description="default name", add_help=False)
current_dir = os.path.dirname(os.path.abspath(__file__))
parser.add_argument("--config_path", type=str, default=os.path.join(current_dir, \
"../config/config_resnet50_gpu.yaml"), help="Config file path")
path_args, _ = parser.parse_known_args()
default, helper, choices = parse_yaml(path_args.config_path)
pprint(default)
args = parse_cli_to_yaml(parser=parser, cfg=default, helper=helper, choices=choices, cfg_path=path_args.config_path)
final_config = merge(args, default)
return Config(final_config)
config = get_config()
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""
create train or eval dataset.
"""
import os
import mindspore.common.dtype as mstype
import mindspore.dataset as ds
import mindspore.dataset.vision.c_transforms as C
from mindspore.dataset.vision import Inter
import mindspore.dataset.transforms.c_transforms as C2
from mindspore.communication.management import init, get_rank, get_group_size
from src.transform import RandAugment
from src.config import config
def cifar10(dataset_path, do_train, repeat_num=1, batch_size=32, target="Ascend", distribute=False):
"""
create a train or evaluate cifar10 dataset for resnet50
Args:
dataset_path(string): the path of dataset.
do_train(bool): whether dataset is used for train or eval.
repeat_num(int): the repeat times of dataset. Default: 1
batch_size(int): the batch size of dataset. Default: 32
target(str): the device target. Default: Ascend
distribute(bool): data for distribute or not. Default: False
Returns:
dataset
"""
if target == "Ascend":
device_num, rank_id = _get_rank_info()
else:
if distribute:
init()
rank_id = get_rank()
device_num = get_group_size()
else:
device_num = 1
if device_num == 1:
data_set = ds.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True)
else:
data_set = ds.Cifar10Dataset(dataset_path, num_parallel_workers=8, shuffle=True,
num_shards=device_num, shard_id=rank_id)
# define map operations
trans = []
if do_train:
trans += [
C.RandomCrop((32, 32), (4, 4, 4, 4)),
C.RandomHorizontalFlip(prob=0.5)
]
trans += [
C.Resize((config.image_size, config.image_size)),
C.Rescale(1.0 / 255.0, 0.0),
C.Normalize([0.4914, 0.4822, 0.4465], [0.2023, 0.1994, 0.2010]),
C.HWC2CHW()
]
type_cast_op = C2.TypeCast(mstype.int32)
data_set = data_set.map(operations=type_cast_op, input_columns="label", num_parallel_workers=8)
data_set = data_set.map(operations=trans, input_columns="image", num_parallel_workers=8)
# apply batch operations
data_set = data_set.batch(batch_size, drop_remainder=True)
# apply dataset repeat operation
data_set = data_set.repeat(repeat_num)
return data_set
def create_train_dataset(dataset_path, repeat_num=1, batch_size=32, target="Ascend"):
"""
create a train or eval imagenet2012 dataset for resnet50
Args:
dataset_path(string): the path of dataset.
repeat_num(int): the repeat times of dataset. Default: 1
batch_size(int): the batch size of dataset. Default: 32
target(str): the device target. Default: Ascend
distribute(bool): data for distribute or not. Default: False
Returns:
dataset
"""
if target == "Ascend":
device_num, rank_id = _get_rank_info()
if device_num == 1:
data_set = ds.ImageFolderDataset(dataset_path, num_parallel_workers=8, shuffle=True)
else:
data_set = ds.ImageFolderDataset(dataset_path, num_parallel_workers=8, shuffle=True,
num_shards=device_num, shard_id=rank_id)
image_size = config.image_size
mean = [0.485 * 255, 0.456 * 255, 0.406 * 255]
std = [0.229 * 255, 0.224 * 255, 0.225 * 255]
# define map operations
trans = [
C.RandomCropDecodeResize(image_size, scale=(0.08, 1.0), ratio=(0.75, 1.333)),
C.RandomHorizontalFlip(prob=0.5),
C.Normalize(mean=mean, std=std),
C.HWC2CHW()
]
type_cast_op = C2.TypeCast(mstype.int32)
data_set = data_set.map(operations=trans, input_columns="image", num_parallel_workers=8)
data_set = data_set.map(operations=type_cast_op, input_columns="label", num_parallel_workers=8)
# apply batch operations
data_set = data_set.batch(batch_size, drop_remainder=True)
# apply dataset repeat operation
data_set = data_set.repeat(repeat_num)
return data_set
def create_eval_dataset(dataset_path, repeat_num=1, batch_size=32, target="Ascend"):
"""
create a train or eval imagenet2012 dataset for resnet50
Args:
dataset_path(string): the path of dataset.
repeat_num(int): the repeat times of dataset. Default: 1
batch_size(int): the batch size of dataset. Default: 32
target(str): the device target. Default: Ascend
Returns:
dataset
"""
data_set = ds.ImageFolderDataset(dataset_path, num_parallel_workers=8, shuffle=True)
image_size = config.image_size
mean = [0.485 * 255, 0.456 * 255, 0.406 * 255]
std = [0.229 * 255, 0.224 * 255, 0.225 * 255]
# define map operations
trans = [
C.Decode(),
C.Resize(256),
C.CenterCrop(image_size),
C.Normalize(mean=mean, std=std),
C.HWC2CHW()
]
type_cast_op = C2.TypeCast(mstype.int32)
data_set = data_set.map(operations=trans, input_columns="image", num_parallel_workers=8)
data_set = data_set.map(operations=type_cast_op, input_columns="label", num_parallel_workers=8)
# apply batch operations
data_set = data_set.batch(batch_size, drop_remainder=True)
# apply dataset repeat operation
data_set = data_set.repeat(repeat_num)
return data_set
def create_dataset_ImageNet(dataset_path, do_train, use_randaugment=False, repeat_num=1, batch_size=32,
target="Ascend"):
"""
create a train or eval imagenet2012 dataset for resnet50
Args:
dataset_path(string): the path of dataset.
do_train(bool): whether dataset is used for train or eval.
use_randaugment(bool): enable randAugment.
repeat_num(int): the repeat times of dataset. Default: 1
batch_size(int): the batch size of dataset. Default: 32
target(str): the device target. Default: Ascend
Returns:
dataset
"""
if target == "Ascend":
device_num, rank_id = _get_rank_info()
elif target == "GPU":
init("nccl")
rank_id = get_rank()
device_num = get_group_size()
if device_num == 1:
da = ds.ImageFolderDataset(dataset_path, num_parallel_workers=8, shuffle=True)
else:
da = ds.ImageFolderDataset(dataset_path, num_parallel_workers=8, shuffle=True,
num_shards=device_num, shard_id=rank_id)
image_size = 224
mean = [0.485 * 255, 0.456 * 255, 0.406 * 255]
std = [0.229 * 255, 0.224 * 255, 0.225 * 255]
# define map operations
if do_train:
if use_randaugment:
trans = [
C.Decode(),
C.RandomResizedCrop(size=(image_size, image_size),
scale=(0.08, 1.0),
ratio=(3. / 4., 4. / 3.),
interpolation=Inter.BICUBIC),
C.RandomHorizontalFlip(prob=0.5),
]
else:
trans = [
C.RandomCropDecodeResize(image_size, scale=(0.08, 1.0), ratio=(0.75, 1.333)),
C.RandomHorizontalFlip(prob=0.5),
C.Normalize(mean=mean, std=std),
C.HWC2CHW()
]
else:
use_randaugment = False
trans = [
C.Decode(),
C.Resize(256),
C.CenterCrop(image_size),
C.Normalize(mean=mean, std=std),
C.HWC2CHW()
]
type_cast_op = C2.TypeCast(mstype.int32)
da = da.map(input_columns="image", num_parallel_workers=8, operations=trans)
da = da.map(input_columns="label", num_parallel_workers=8, operations=type_cast_op)
# apply batch operations
if use_randaugment:
efficient_rand_augment = RandAugment()
da = da.batch(batch_size,
per_batch_map=efficient_rand_augment,
input_columns=['image', 'label'],
num_parallel_workers=2,
drop_remainder=True)
else:
da = da.batch(batch_size, drop_remainder=True)
# apply dataset repeat operation
da = da.repeat(repeat_num)
return da
def _get_rank_info():
"""
get rank size and rank id
"""
rank_size = int(os.environ.get("RANK_SIZE", 1))
if rank_size > 1:
rank_size = get_group_size()
rank_id = get_rank()
else:
rank_size = 1
rank_id = 0
return rank_size, rank_id
This diff is collapsed.
# Copyright 2021 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""define loss for glore_resnet50"""
import mindspore.nn as nn
from mindspore import Tensor
from mindspore.common import dtype as mstype
from mindspore.nn.loss.loss import LossBase
from mindspore.ops import functional as F
from mindspore.ops import operations as P
import mindspore.ops as ops
class SoftmaxCrossEntropyExpand(nn.Cell):
'''SoftmaxCrossEntropy'''
def __init__(self, sparse=False):
super(SoftmaxCrossEntropyExpand, self).__init__()
self.exp = ops.Exp()
self.sum = ops.ReduceSum(keep_dims=True)
self.onehot = ops.OneHot()
self.on_value = Tensor(1.0, mstype.float32)
self.off_value = Tensor(0.0, mstype.float32)
self.div = ops.RealDiv()
self.log = ops.Log()
self.sum_cross_entropy = ops.ReduceSum(keep_dims=False)
self.mul = ops.Mul()
self.mul2 = ops.Mul()
self.mean = ops.ReduceMean(keep_dims=False)
self.sparse = sparse
self.max = ops.ReduceMax(keep_dims=True)
self.sub = ops.Sub()
self.eps = Tensor(1e-24, mstype.float32)
def construct(self, logit, label):
'''construct SoftmaxCrossEntropy'''
logit_max = self.max(logit, -1)
exp = self.exp(self.sub(logit, logit_max))
exp_sum = self.sum(exp, -1)
softmax_result = self.div(exp, exp_sum)
if self.sparse:
label = self.onehot(label, ops.shape(logit)[1], self.on_value, self.off_value)
softmax_result_log = self.log(softmax_result + self.eps)
loss = self.sum_cross_entropy((self.mul(softmax_result_log, label)), -1)
loss = self.mul2(ops.scalar_to_array(-1.0), loss)
loss = self.mean(loss, -1)
return loss
class CrossEntropySmooth(LossBase):
"""CrossEntropy"""
def __init__(self, sparse=True, reduction='mean', smooth_factor=0., num_classes=1000):
super(CrossEntropySmooth, self).__init__()
self.onehot = P.OneHot()
self.sparse = sparse
self.on_value = Tensor(1.0 - smooth_factor, mstype.float32)
self.off_value = Tensor(1.0 * smooth_factor / (num_classes - 1), mstype.float32)
self.ce = nn.SoftmaxCrossEntropyWithLogits(reduction=reduction)
def construct(self, logit, label):
if self.sparse:
label = self.onehot(label, F.shape(logit)[1], self.on_value, self.off_value)
loss = self.ce(logit, label)
return loss
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment