Skip to content
Snippets Groups Projects
Unverified Commit 5061b090 authored by i-robot's avatar i-robot Committed by Gitee
Browse files

!2634 S-Ghostnet of Noah's Ark Lab, Huawei

Merge pull request !2634 from liu09114/master
parents 75b0eeb6 6aa4a2a1
No related branches found
No related tags found
No related merge requests found
Showing
with 2899 additions and 0 deletions
# 目录
<!-- TOC -->
- [目录](#目录)
- [概述](#概述)
- [论文](#论文)
- [模型架构](#模型架构)
- [数据集](#数据集)
- [环境要求](#环境要求)
- [脚本说明](#脚本说明)
- [脚本结构与说明](#脚本结构与说明)
- [训练过程](#训练过程)
- [用法](#用法)
- [Ascend处理器环境运行](#ascend处理器环境运行)
- [评估过程](#评估过程)
- [用法](#用法-1)
- [Ascend处理器环境运行](#ascend处理器环境运行-1)
- [结果](#结果-1)
- [模型描述](#模型描述)
- [性能](#性能)
- [评估性能](#评估性能)
- [随机情况说明](#随机情况说明)
- [ModelZoo主页](#modelzoo主页)
<!-- /TOC -->
# GhostNet描述
## 概述
S-GhostNet由华为诺亚方舟实验室在2021年提出,此网络在GhostNet的基础上,通过网络结构搜索方法探索了大模型的构建方法。旨在更低计算代价下,提供更优的性能。该架构可以在同样计算量下,精度优于SOTA算法。
如下为MindSpore使用ImageNet2012数据集对GhostNet进行训练的示例。
## 论文
1. [论文](https://arxiv.org/pdf/2108.00177.pdf): Chuanjian Liu, Kai Han, An Xiao, Yiping Deng, Wei Zhang, Chunjing Xu, Yunhe Wang."Greedy Network Enlarging"
# 模型架构
GhostNet的总体网络架构如下:[链接](https://arxiv.org/pdf/1911.11907.pdf)
# 数据集
使用的数据集:[ImageNet2012](http://www.image-net.org/)
- 数据集大小:共1000个类、224*224彩色图像
- 训练集:共1,281,167张图像
- 测试集:共50,000张图像
- 数据格式:JPEG
- 注:数据在dataset.py中处理。
- 下载数据集,目录结构如下:
```text
└─dataset
├─imagenet
├─train # 训练数据集
└─val # 评估数据集
```
# 环境要求
- 硬件
- 准备Ascend处理器搭建硬件环境。
- 框架
- [MindSpore](https://www.mindspore.cn/install/en)
- 如需查看详情,请参见如下资源:
- [MindSpore教程](https://www.mindspore.cn/tutorials/zh-CN/master/index.html)
- [MindSpore Python API](https://www.mindspore.cn/docs/api/zh-CN/master/index.html)
# 脚本说明
## 脚本结构与说明
```text
└──S-GhostNet
├── README.md
├── script
├── ma-pre-start.sh # Modelarts训练日志保存
├── train_distributed_ascend.sh # 单机分布式训练脚本
├── src
├── autoaug.py # 数据自动增强
├── dataset.py # 数据预处理
├── bignet.py # S-GhostNet网络定义
├── callback.py # 模型参数滑动平均
├── eval_callback.py # 训练过程中对模型测试
├── loss.py # 模型训练损失定义
├── utils.py
└── ghostnet.py # ghostnet网络
├── eval_b1.py # 评估网络S-GhostNet-b1
├── eval_b4.py # 评估网络S-GhostNet-b4
└── train.py # 训练网络
└── compute_acc.py # 统计准确率
```
# 训练过程
## 用法
### ascend处理器环境运行
```Shell
# 分布式训练
用法:bash train_distributed_ascend.sh [bignet] [RANK_TABLE_FILE] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)
# 单机训练
用法:python train.py --model big_net --data_path path-to-imagent --drop 0.2 --drop-path 0.2 --large --layers 2,4,5,12,6,14 --channels 36,68,108,164,216,336 --batch-size 4
Modelarts训练
python train.py --model big_net --amp_level=O0 --autoaugment --batch-size=32 --channels=36,60,108,168,232,336 --ckpt_save_epoch=20 --cloud= --data_url=obs://path-to-imagenet --decay-epochs=2.4 --decay-rate=0.97 --device_num=8 --distributed --drop=0.3 --drop-path=0.1 --ema-decay=0.97 --epochs=450 --input_size=384 --large --layers=3,5,5,12,6,14 --loss_scale=1024 --lr=0.0001 --lr_decay_style=cosine --lr_end=1e-6 --lr_max=0.6 --model=big_net --opt=momentum --opt-eps=0.0001 --sync_bn --warmup-epochs=20 --warmup-lr=1e-6 --weight-decay=2e-5 --workers=8
```
分布式训练需要提前创建JSON格式的HCCL配置文件。
具体操作,参见[hccl_tools](https://gitee.com/mindspore/models/tree/master/utils/hccl_tools)中的说明。
# 评估过程
## 用法
### ascend处理器环境运行
```Shell
# Modelarts评估
Usage: python eval_b4.py --data_url=obs://path-to-imagenet --large= --model=big_net --test_mode=ema_best --train_url=obs://path-to-pretrained-model --trained_model_dir=s3://path-to-output
```
训练过程中可以生成检查点。
## 结果
评估结果保存在日志文件中。您可通过compute_acc.py统计分布式训练下不同checkpoint的结果:
# 模型描述
S-GhostNet_b1 --channels=28,44,72,140,196,280 --layers=1,2,3,4,3,6 --input_size=240 --large
Top-1=79.844
S-GhostNet_b4 --channels=36,60,108,168,232,336 --layers=3,5,5,12,6,14 --input_size=384 --large
Top-1=83.024
## 性能
### 评估性能
| 参数 | Ascend 910 |
|---|---|
| 模型版本 | S-GhostNet |
| 资源 | Ascend 910;CPU:2.60GHz,192核;内存:755G |
| 上传日期 |2022-04-29 ; |
| MindSpore版本 | 1.5.1 |
| 数据集 | ImageNet2012 |
| 训练参数 | epoch=450 |
| 优化器 | Momentum |
| 损失函数 |Softmax交叉熵 |
|总时长 | S-GhostNet_b4 32卡 112小时 |
# 随机情况说明
dataset.py中设置了“create_dataset”函数内的种子,同时还使用了train.py中的随机种子。
# ModelZoo主页
请浏览官网[主页](https://gitee.com/mindspore/mindspore/tree/r1.3/model_zoo)
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
accuracy = {}
acc_nums = {}
with open('modelarts-job-c0a748c1-47a8-45d0-9603-2097883482c0-worker-0.log') as f:
lines = f.readlines()
for line in lines:
if 'Validation-Loss' in line:
contents = line.strip().split(' ')
ckpt_index = int(contents[1].strip(','))
if str(ckpt_index) not in accuracy.keys():
acc_nums[str(ckpt_index)] = 1
accuracy[str(ckpt_index)] = float(contents[8].strip(','))
else:
acc_nums[str(ckpt_index)] += 1
accuracy[str(ckpt_index)] += float(contents[8].strip(','))
print(accuracy)
print(acc_nums)
mean_acc = []
acc_go = acc_nums.keys()
acc_lo = accuracy.keys()
for key in acc_lo:
if key not in acc_go:
print('Wrong key!!!!!!!')
else:
mean_acc.append(accuracy[key]/acc_nums[key])
print(mean_acc)
print(max(mean_acc))
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Inference Interface"""
import sys
import os
import argparse
import zipfile
import time
import moxing as mox
from mindspore.communication.management import init, get_rank, get_group_size
from mindspore.train.model import Model
from mindspore.train.serialization import load_checkpoint, load_param_into_net
from mindspore.nn import Loss, Top1CategoricalAccuracy, Top5CategoricalAccuracy
from mindspore import context
from src.dataset import create_dataset_val
from src.utils import count_params
from src.loss import LabelSmoothingCrossEntropy
from src.tinynet import tinynet
from src.ghostnet import ghostnet_1x
from src.big_net import GhostNet
os.environ["GLOG_v"] = '3'
os.environ["ASCEND_SLOG_PRINT_TO_STDOUT"] = '0'
os.environ["ASCEND_GLOBAL_LOG_LEVEL"] = '2'
os.environ["ASCEND_GLOBAL_EVENT_ENABLE"] = '0'
parser = argparse.ArgumentParser(description='Evaluation')
parser.add_argument('--data_path', type=str, default='/autotest/liuchuanjian/data/imagenet/',
metavar='DIR', help='path to dataset')
parser.add_argument('--model', default='tinynet_c', type=str, metavar='MODEL',
help='Name of model to train (default: "tinynet_c") ghostnet, big_net')
parser.add_argument('--num-classes', type=int, default=1000, metavar='N',
help='number of label classes (default: 1000)')
parser.add_argument('--channels', type=str, default='24,32,64,112,160,280',
help='channel config of model architecure')
parser.add_argument('--layers', type=str, default='2,2,5,10,2,10',
help='layer config of model architecure')
parser.add_argument('--large', action='store_true', default=False,
help='ghostnet1x or ghostnet larger')
parser.add_argument('--input_size', type=int, default=248,
help='input size of model.')
parser.add_argument('--smoothing', type=float, default=0.1,
help='label smoothing (default: 0.1)')
parser.add_argument('-b', '--batch-size', type=int, default=125, metavar='N',
help='input batch size for training (default: 32)')
parser.add_argument('-j', '--workers', type=int, default=8, metavar='N',
help='how many training processes to use (default: 1)')
parser.add_argument('--GPU', action='store_true', default=False,
help='Use GPU for training (default: False)')
parser.add_argument('--dataset_sink', action='store_false', default=True)
parser.add_argument('--drop', type=float, default=0.2, metavar='DROP',
help='Dropout rate (default: 0.) for big_net, use "1-drop", for others, use "drop"')
parser.add_argument('--drop-path', type=float, default=0.0, metavar='DROP',
help='Drop connect rate (default: 0.)')
parser.add_argument('--sync_bn', action='store_true', default=False,
help='Use sync bn in distributed mode. (default: False)')
parser.add_argument('--test_mode', default=None,
help='Use ema saved model to test, "ema_best", "ema_last", ')
# eval on cloud
parser.add_argument('--cloud', action='store_true', default=False, help='Whether train on cloud.')
parser.add_argument('--data_url', type=str, default="/home/ma-user/work/data/imagenet", help='path to dataset.')
parser.add_argument('--zip_url', type=str, default="s3://bucket-800/liuchuanjian/data/imagenet_zip/imagenet.zip")
parser.add_argument('--train_url', type=str, default=" ", help='train_dir.')
parser.add_argument('--tmp_data_dir', default='/cache/data/', help='temp data dir')
parser.add_argument('--trained_model_dir', default='s3://bucket-800/liuchuanjian/results/bignet/1291/',
help='temp save dir')
parser.add_argument('--tmp_save_dir', default='/cache/liuchuanjian/', help='temp save dir')
_global_sync_count = 0
def get_device_id():
device_id = os.getenv('DEVICE_ID', '0')
return int(device_id)
def get_device_num():
device_num = os.getenv('RANK_SIZE', '1')
return int(device_num)
def unzip_file(zip_src, dst_dir):
r = zipfile.is_zipfile(zip_src)
if r:
fz = zipfile.ZipFile(zip_src, 'r')
for file_item in fz.namelist():
fz.extract(file_item, dst_dir)
else:
raise Exception('This is not zip')
def sync_data(opts):
"""
Download data from remote obs to local directory if the first url is remote url and the second one is local path
Upload data from local directory to remote obs in contrast.
"""
global _global_sync_count
sync_lock = "/tmp/copy_sync.lock" + str(_global_sync_count)
_global_sync_count += 1
if not mox.file.exists(opts.tmp_data_dir):
mox.file.make_dirs(opts.tmp_data_dir)
target_file = os.path.join(opts.tmp_data_dir, 'imagenet.zip')
# Each server contains 8 devices as most.
if get_device_id() % min(get_device_num(), 8) == 0 and not os.path.exists(sync_lock):
print("from path: ", opts.zip_url)
print("to path: ", target_file)
mox.file.copy_parallel(opts.zip_url, target_file)
print('Zip file copy success.')
print('Starting unzip file.')
unzip_file(target_file, opts.tmp_data_dir)
print('Unzip file success.')
print("===finish data synchronization===")
## ckpt copy
print('Moving ckpt file')
if not mox.file.exists(opts.tmp_save_dir):
mox.file.make_dirs(opts.tmp_save_dir)
for i in range(8):
print('copying ckpt_ ', str(i))
if opts.test_mode == 'ema_best':
source_ckpt = os.path.join(opts.trained_model_dir, 'ckpt_'+str(i), 'ema_best.ckpt')
target_ckpt = os.path.join(opts.tmp_save_dir, 'ckpt_'+str(i), 'ema_best.ckpt')
elif opts.test_mode == 'ema_last':
source_ckpt = os.path.join(opts.trained_model_dir, 'ckpt_'+str(i), 'ema_last.ckpt')
target_ckpt = os.path.join(opts.tmp_save_dir, 'ckpt_'+str(i), 'ema_last.ckpt')
else:
source_ckpt = os.path.join(opts.trained_model_dir, 'ckpt_'+str(i), 'big_net-500_1251.ckpt')
target_ckpt = os.path.join(opts.tmp_save_dir, 'ckpt_'+str(i), 'big_net-500_1251.ckpt')
if mox.file.exists(source_ckpt):
mox.file.copy(source_ckpt, target_ckpt)
else:
print(source_ckpt, 'does not exist.')
try:
os.mknod(sync_lock)
except IOError:
pass
print("===save flag===")
while True:
if os.path.exists(sync_lock):
break
time.sleep(1)
opts.data_url = os.path.join(opts.tmp_data_dir, 'imagenet')
opts.data_path = opts.data_url
print("Finish sync data from {} to {}.".format(opts.zip_url, target_file))
def main(opts):
"""Main entrance for training"""
print(sys.argv)
if opts.channels:
channel_config = []
for item in opts.channels.split(','):
channel_config.append(int(item.strip()))
if opts.layers:
layer_config = []
for item in opts.layers.split(','):
layer_config.append(int(item.strip()))
print(opts)
context.set_context(mode=context.GRAPH_MODE, device_target='Ascend')
devid = int(os.getenv('DEVICE_ID'))
context.set_context(device_id=devid,
reserve_class_name_in_scope=True)
context.set_auto_parallel_context(device_num=8)
init()
opts.rank = get_rank()
opts.group_size = get_group_size()
print('Rank {}, group_size {}'.format(opts.rank, opts.group_size))
val_data_url = os.path.join(opts.data_path, 'val')
val_dataset = create_dataset_val(opts.batch_size,
val_data_url,
workers=opts.workers,
target='Ascend',
distributed=False,
input_size=opts.input_size)
# parse model argument
if opts.model == 'tinynet_c':
_, sub_name = opts.model.split("_")
net = tinynet(sub_model=sub_name,
num_classes=opts.num_classes,
drop_rate=0.0,
drop_connect_rate=0.0,
global_pool="avg",
bn_tf=False,
bn_momentum=None,
bn_eps=None)
elif opts.model == 'ghostnet':
net = ghostnet_1x(num_classes=opts.num_classes)
else:
net = GhostNet(layers=layer_config,
channels=channel_config,
num_classes=opts.num_classes,
final_drop=opts.drop,
drop_path_rate=opts.drop_path,
large=opts.large,
zero_init_residual=False,
sync_bn=opts.sync_bn)
print("Total number of parameters:", count_params(net))
if opts.model == 'tinynet_c':
opts.input_size = net.default_cfg['input_size'][1]
loss = LabelSmoothingCrossEntropy(smooth_factor=opts.smoothing,
num_classes=opts.num_classes)
loss.add_flags_recursive(fp32=True, fp16=False)
eval_metrics = {'Validation-Loss': Loss(),
'Top1-Acc': Top1CategoricalAccuracy(),
'Top5-Acc': Top5CategoricalAccuracy()}
for i in range(8):
if opts.test_mode == 'ema_best':
target_ckpt = os.path.join(opts.tmp_save_dir, 'ckpt_'+str(i), 'ema_best.ckpt')
elif opts.test_mode == 'ema_last':
target_ckpt = os.path.join(opts.tmp_save_dir, 'ckpt_'+str(i), 'ema_last.ckpt')
else:
target_ckpt = os.path.join(opts.tmp_save_dir, 'ckpt_'+str(i), 'big_net-500_1251.ckpt')
if mox.file.exists(target_ckpt):
print('Loading checkpoint: ', target_ckpt)
ckpt = load_checkpoint(target_ckpt)
load_param_into_net(net, ckpt)
net.set_train(False)
model = Model(net, loss, metrics=eval_metrics)
metrics = model.eval(val_dataset, dataset_sink_mode=False)
print('ckpt {}, Rank {}, Accuracy {}'.format(i, opts.rank, metrics))
if __name__ == '__main__':
args, unparsed = parser.parse_known_opts()
# copy data
sync_data(args)
main(args)
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Inference Interface"""
import sys
import os
import argparse
import zipfile
import time
import moxing as mox
from mindspore.communication.management import init, get_rank, get_group_size
from mindspore.train.model import Model
from mindspore.train.serialization import load_checkpoint, load_param_into_net
from mindspore.nn import Loss, Top1CategoricalAccuracy, Top5CategoricalAccuracy
from mindspore import context
from src.dataset import create_dataset_val
from src.utils import count_params
from src.loss import LabelSmoothingCrossEntropy
from src.tinynet import tinynet
from src.ghostnet import ghostnet_1x
from src.big_net import GhostNet
os.environ["GLOG_v"] = '3'
os.environ["ASCEND_SLOG_PRINT_TO_STDOUT"] = '0'
os.environ["ASCEND_GLOBAL_LOG_LEVEL"] = '2'
os.environ["ASCEND_GLOBAL_EVENT_ENABLE"] = '0'
parser = argparse.ArgumentParser(description='Evaluation')
parser.add_argument('--data_path', type=str, default='/autotest/liuchuanjian/data/imagenet/',
metavar='DIR', help='path to dataset')
parser.add_argument('--model', default='tinynet_c', type=str, metavar='MODEL',
help='Name of model to train (default: "tinynet_c") ghostnet, big_net')
parser.add_argument('--num-classes', type=int, default=1000, metavar='N',
help='number of label classes (default: 1000)')
parser.add_argument('--channels', type=str, default='36,60,108,168,232,336',
help='channel config of model architecure')
parser.add_argument('--layers', type=str, default='3,5,5,12,6,14',
help='layer config of model architecure')
parser.add_argument('--large', action='store_true', default=False,
help='ghostnet1x or ghostnet larger')
parser.add_argument('--input_size', type=int, default=384,
help='input size of model.')
parser.add_argument('--smoothing', type=float, default=0.1,
help='label smoothing (default: 0.1)')
parser.add_argument('-b', '--batch-size', type=int, default=25, metavar='N',
help='input batch size for training (default: 32)')
parser.add_argument('-j', '--workers', type=int, default=4, metavar='N',
help='how many training processes to use (default: 1)')
parser.add_argument('--GPU', action='store_true', default=False,
help='Use GPU for training (default: False)')
parser.add_argument('--dataset_sink', action='store_false', default=True)
parser.add_argument('--drop', type=float, default=0.2, metavar='DROP',
help='Dropout rate (default: 0.) for big_net, use "1-drop", for others, use "drop"')
parser.add_argument('--drop-path', type=float, default=0.0, metavar='DROP',
help='Drop connect rate (default: 0.)')
parser.add_argument('--sync_bn', action='store_true', default=False,
help='Use sync bn in distributed mode. (default: False)')
parser.add_argument('--test_mode', default=None,
help='Use ema saved model to test, "ema_best", "ema_last", ')
# eval on cloud
parser.add_argument('--cloud', action='store_true', default=False, help='Whether train on cloud.')
parser.add_argument('--data_url', type=str, default="/home/ma-user/work/data/imagenet", help='path to dataset.')
parser.add_argument('--zip_url', type=str, default="s3://bucket-800/liuchuanjian/data/imagenet_zip/imagenet.zip")
parser.add_argument('--train_url', type=str, default=" ", help='train_dir.')
parser.add_argument('--tmp_data_dir', default='/cache/data/', help='temp data dir')
parser.add_argument('--trained_model_dir', default='s3://bucket-800/liuchuanjian/results/bignet/1311/',
help='temp save dir')
parser.add_argument('--tmp_save_dir', default='/cache/liuchuanjian/', help='temp save dir')
_global_sync_count = 0
def get_device_id():
device_id = os.getenv('DEVICE_ID', '0')
return int(device_id)
def get_device_num():
device_num = os.getenv('RANK_SIZE', '1')
return int(device_num)
def unzip_file(zip_src, dst_dir):
r = zipfile.is_zipfile(zip_src)
if r:
fz = zipfile.ZipFile(zip_src, 'r')
for file_item in fz.namelist():
fz.extract(file_item, dst_dir)
else:
raise Exception('This is not zip')
def sync_data(opts):
"""
Download data from remote obs to local directory if the first url is remote url and the second one is local path
Upload data from local directory to remote obs in contrast.
"""
global _global_sync_count
sync_lock = "/tmp/copy_sync.lock" + str(_global_sync_count)
_global_sync_count += 1
if not mox.file.exists(opts.tmp_data_dir):
mox.file.make_dirs(opts.tmp_data_dir)
target_file = os.path.join(opts.tmp_data_dir, 'imagenet.zip')
# Each server contains 8 devices as most.
if get_device_id() % min(get_device_num(), 8) == 0 and not os.path.exists(sync_lock):
print("from path: ", opts.zip_url)
print("to path: ", target_file)
mox.file.copy_parallel(opts.zip_url, target_file)
print('Zip file copy success.')
print('Starting unzip file.')
unzip_file(target_file, opts.tmp_data_dir)
print('Unzip file success.')
print("===finish data synchronization===")
## ckpt copy
print('Moving ckpt file')
if not mox.file.exists(opts.tmp_save_dir):
mox.file.make_dirs(opts.tmp_save_dir)
for i in range(32):
print('copying ckpt_ ', str(i))
if opts.test_mode == 'ema_best':
source_ckpt = os.path.join(opts.trained_model_dir, 'ckpt_'+str(i), 'ema_best.ckpt')
target_ckpt = os.path.join(opts.tmp_save_dir, 'ckpt_'+str(i), 'ema_best.ckpt')
elif opts.test_mode == 'ema_last':
source_ckpt = os.path.join(opts.trained_model_dir, 'ckpt_'+str(i), 'ema_last.ckpt')
target_ckpt = os.path.join(opts.tmp_save_dir, 'ckpt_'+str(i), 'ema_last.ckpt')
else:
source_ckpt = os.path.join(opts.trained_model_dir, 'ckpt_'+str(i), 'big_net-450_1251.ckpt')
target_ckpt = os.path.join(opts.tmp_save_dir, 'ckpt_'+str(i), 'big_net-450_1251.ckpt')
if mox.file.exists(source_ckpt):
mox.file.copy(source_ckpt, target_ckpt)
else:
print(source_ckpt, 'does not exist.')
try:
os.mknod(sync_lock)
except IOError:
pass
print("===save flag===")
while True:
if os.path.exists(sync_lock):
break
time.sleep(1)
opts.data_url = os.path.join(opts.tmp_data_dir, 'imagenet')
opts.data_path = opts.data_url
print("Finish sync data from {} to {}.".format(opts.zip_url, target_file))
def main(opts):
"""Main entrance for training"""
print(sys.argv)
if opts.channels:
channel_config = []
for item in opts.channels.split(','):
channel_config.append(int(item.strip()))
if opts.layers:
layer_config = []
for item in opts.layers.split(','):
layer_config.append(int(item.strip()))
print(opts)
context.set_context(mode=context.GRAPH_MODE, device_target='Ascend')
devid = int(os.getenv('DEVICE_ID'))
context.set_context(device_id=devid,
reserve_class_name_in_scope=True)
context.set_auto_parallel_context(device_num=8)
init()
opts.rank = get_rank()
opts.group_size = get_group_size()
print('Rank {}, group_size {}'.format(opts.rank, opts.group_size))
val_data_url = os.path.join(opts.data_path, 'val')
val_dataset = create_dataset_val(opts.batch_size,
val_data_url,
workers=opts.workers,
target='Ascend',
distributed=False,
input_size=opts.input_size)
# parse model argument
if opts.model == 'tinynet_c':
_, sub_name = opts.model.split("_")
net = tinynet(sub_model=sub_name,
num_classes=opts.num_classes,
drop_rate=0.0,
drop_connect_rate=0.0,
global_pool="avg",
bn_tf=False,
bn_momentum=None,
bn_eps=None)
elif opts.model == 'ghostnet':
net = ghostnet_1x(num_classes=opts.num_classes)
else:
net = GhostNet(layers=layer_config,
channels=channel_config,
num_classes=opts.num_classes,
final_drop=opts.drop,
drop_path_rate=opts.drop_path,
large=opts.large,
zero_init_residual=False,
sync_bn=opts.sync_bn)
print("Total number of parameters:", count_params(net))
if opts.model == 'tinynet_c':
opts.input_size = net.default_cfg['input_size'][1]
loss = LabelSmoothingCrossEntropy(smooth_factor=opts.smoothing,
num_classes=opts.num_classes)
loss.add_flags_recursive(fp32=True, fp16=False)
eval_metrics = {'Validation-Loss': Loss(),
'Top1-Acc': Top1CategoricalAccuracy(),
'Top5-Acc': Top5CategoricalAccuracy()}
for i in range(32):
if opts.test_mode == 'ema_best':
target_ckpt = os.path.join(opts.tmp_save_dir, 'ckpt_'+str(i), 'ema_best.ckpt')
elif opts.test_mode == 'ema_last':
target_ckpt = os.path.join(opts.tmp_save_dir, 'ckpt_'+str(i), 'ema_last.ckpt')
else:
target_ckpt = os.path.join(opts.tmp_save_dir, 'ckpt_'+str(i), 'big_net-450_1251.ckpt')
if mox.file.exists(target_ckpt):
print('Loading checkpoint: ', target_ckpt)
ckpt = load_checkpoint(target_ckpt)
load_param_into_net(net, ckpt)
net.set_train(False)
model = Model(net, loss, metrics=eval_metrics)
metrics = model.eval(val_dataset, dataset_sink_mode=False)
print('ckpt {}, Rank {}, Accuracy {}'.format(i, opts.rank, metrics))
if __name__ == '__main__':
args, unparsed = parser.parse_known_opts()
# copy data
sync_data(args)
main(args)
#!/bin/bash
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
if [ $# != 3 ] && [ $# != 4 ]
then
echo "Usage: bash train_distribute_ascend.sh [bignet] [RANK_TABLE_FILE] [DATASET_PATH] [PRETRAINED_CKPT_PATH](optional)"
exit 1
fi
get_real_path(){
if [ "${1:0:1}" == "/" ]; then
echo "$1"
else
echo "$(realpath -m $PWD/$1)"
fi
}
PATH1=$(get_real_path $2)
PATH2=$(get_real_path $3)
if [ $# == 4 ]
then
PATH3=$(get_real_path $4)
fi
if [ ! -f $PATH1 ]
then
echo "error: RANK_TABLE_FILE=$PATH1 is not a file"
exit 1
fi
if [ ! -d $PATH2 ]
then
echo "error: DATASET_PATH=$PATH2 is not a directory"
exit 1
fi
if [ $# == 4 ] && [ ! -f $PATH3 ]
then
echo "error: PRETRAINED_CKPT_PATH=$PATH3 is not a file"
exit 1
fi
ulimit -u unlimited
export DEVICE_NUM=8
export RANK_SIZE=8
export RANK_TABLE_FILE=$PATH1
export SERVER_ID=0
rank_start=$((DEVICE_NUM * SERVER_ID))
for((i=0; i<${DEVICE_NUM}; i++))
do
export DEVICE_ID=${i}
export RANK_ID=$((rank_start + i))
rm -rf ./train_parallel$i
mkdir ./train_parallel$i
cp ../*.py ./train_parallel$i
cp *.sh ./train_parallel$i
cp -r ../src ./train_parallel$i
cd ./train_parallel$i || exit
echo "start training for rank $RANK_ID, device $DEVICE_ID, device_num $DEVICE_NUM"
env > env.log
if [ $# == 3 ]
then
python train.py \
--distributed \
--device_num=$DEVICE_NUM \
--model $1 \
--data_path $PATH2 \
--num-classes 1000 \
--channels 16,24,40,80,112,160 \
--layers 1,2,2,4,2,5 \
--batch-size 256 \
--drop 0.2 \
--drop-path 0 \
--opt rmsprop \
--opt-eps 0.001 \
--lr 0.048 \
--decay-epochs 2.4 \
--warmup-lr 1e-6 \
--warmup-epochs 3 \
--decay-rate 0.97 \
--ema-decay 0.9999 \
--weight-decay 1e-5 \
--per_print_times 100 \
--epochs 300 \
--ckpt_save_epoch 5 \
--workers 8 \
--amp_level O2 > $1.log 2>&1 &
fi
if [ $# == 4 ]
then
python train.py --model=$1 --distributed --device_num=$DEVICE_NUM --data_path=$PATH2 --pre_trained=$PATH3 &> log &
fi
cd ..
done
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
import mindspore.dataset as ds
import mindspore.dataset.transforms.c_transforms as c_transforms
import mindspore.dataset.vision.c_transforms as c_vision
from mindspore import dtype as mstype
# define Auto Augmentation operators
PARAMETER_MAX = 10
def float_parameter(level, maxval):
return float(level) * maxval / PARAMETER_MAX
def int_parameter(level, maxval):
return int(level * maxval / PARAMETER_MAX)
def shear_x(level):
v = float_parameter(level, 0.3)
return c_transforms.RandomChoice([c_vision.RandomAffine(degrees=0, shear=(-v, -v)),
c_vision.RandomAffine(degrees=0, shear=(v, v))])
def shear_y(level):
v = float_parameter(level, 0.3)
return c_transforms.RandomChoice([c_vision.RandomAffine(degrees=0, shear=(0, 0, -v, -v)),
c_vision.RandomAffine(degrees=0, shear=(0, 0, v, v))])
def translate_x(level):
v = float_parameter(level, 150 / 331)
return c_transforms.RandomChoice([c_vision.RandomAffine(degrees=0, translate=(-v, -v)),
c_vision.RandomAffine(degrees=0, translate=(v, v))])
def translate_y(level):
v = float_parameter(level, 150 / 331)
return c_transforms.RandomChoice([c_vision.RandomAffine(degrees=0, translate=(0, 0, -v, -v)),
c_vision.RandomAffine(degrees=0, translate=(0, 0, v, v))])
def color_impl(level):
v = float_parameter(level, 1.8) + 0.1
return c_vision.RandomColor(degrees=(v, v))
def rotate_impl(level):
v = int_parameter(level, 30)
return c_transforms.RandomChoice([c_vision.RandomRotation(degrees=(-v, -v)),
c_vision.RandomRotation(degrees=(v, v))])
def solarize_impl(level):
level = int_parameter(level, 256)
v = 256 - level
return c_vision.RandomSolarize(threshold=(0, v))
def posterize_impl(level):
level = int_parameter(level, 4)
v = 4 - level
return c_vision.RandomPosterize(bits=(v, v))
def contrast_impl(level):
v = float_parameter(level, 1.8) + 0.1
return c_vision.RandomColorAdjust(contrast=(v, v))
def autocontrast_impl(level):
v = level
assert v == level
return c_vision.AutoContrast()
def sharpness_impl(level):
v = float_parameter(level, 1.8) + 0.1
return c_vision.RandomSharpness(degrees=(v, v))
def brightness_impl(level):
v = float_parameter(level, 1.8) + 0.1
return c_vision.RandomColorAdjust(brightness=(v, v))
# define the Auto Augmentation policy
imagenet_policy = [
[(posterize_impl(8), 0.4), (rotate_impl(9), 0.6)],
[(solarize_impl(5), 0.6), (autocontrast_impl(5), 0.6)],
[(c_vision.Equalize(), 0.8), (c_vision.Equalize(), 0.6)],
[(posterize_impl(7), 0.6), (posterize_impl(6), 0.6)],
[(c_vision.Equalize(), 0.4), (solarize_impl(4), 0.2)],
[(c_vision.Equalize(), 0.4), (rotate_impl(8), 0.8)],
[(solarize_impl(3), 0.6), (c_vision.Equalize(), 0.6)],
[(posterize_impl(5), 0.8), (c_vision.Equalize(), 1.0)],
[(rotate_impl(3), 0.2), (solarize_impl(8), 0.6)],
[(c_vision.Equalize(), 0.6), (posterize_impl(6), 0.4)],
[(rotate_impl(8), 0.8), (color_impl(0), 0.4)],
[(rotate_impl(9), 0.4), (c_vision.Equalize(), 0.6)],
[(c_vision.Equalize(), 0.0), (c_vision.Equalize(), 0.8)],
[(c_vision.Invert(), 0.6), (c_vision.Equalize(), 1.0)],
[(color_impl(4), 0.6), (contrast_impl(8), 1.0)],
[(rotate_impl(8), 0.8), (color_impl(2), 1.0)],
[(color_impl(8), 0.8), (solarize_impl(7), 0.8)],
[(sharpness_impl(7), 0.4), (c_vision.Invert(), 0.6)],
[(shear_x(5), 0.6), (c_vision.Equalize(), 1.0)],
[(color_impl(0), 0.4), (c_vision.Equalize(), 0.6)],
[(c_vision.Equalize(), 0.4), (solarize_impl(4), 0.2)],
[(solarize_impl(5), 0.6), (autocontrast_impl(5), 0.6)],
[(c_vision.Invert(), 0.6), (c_vision.Equalize(), 1.0)],
[(color_impl(4), 0.6), (contrast_impl(8), 1.0)],
[(c_vision.Equalize(), 0.8), (c_vision.Equalize(), 0.6)],
]
def create_dataset(dataset_path, do_train, repeat_num=1, batch_size=32, shuffle=True, num_samples=5):
# create a train or eval imagenet2012 dataset for ResNet-50
dataset = ds.ImageFolderDataset(dataset_path, num_parallel_workers=8,
shuffle=shuffle, num_samples=num_samples)
image_size = 224
mean = [0.485 * 255, 0.456 * 255, 0.406 * 255]
std = [0.229 * 255, 0.224 * 255, 0.225 * 255]
# define map operations
if do_train:
trans = [
c_vision.RandomCropDecodeResize(image_size, scale=(0.08, 1.0), ratio=(0.75, 1.333)),
]
post_trans = [
c_vision.RandomHorizontalFlip(prob=0.5),
]
else:
trans = [
c_vision.Decode(),
c_vision.Resize(256),
c_vision.CenterCrop(image_size),
c_vision.Normalize(mean=mean, std=std),
c_vision.HWC2CHW()
]
dataset = dataset.map(operations=trans, input_columns="image")
if do_train:
dataset = dataset.map(operations=c_vision.RandomSelectSubpolicy(imagenet_policy), input_columns=["image"])
dataset = dataset.map(operations=post_trans, input_columns="image")
type_cast_op = c_transforms.TypeCast(mstype.int32)
dataset = dataset.map(operations=type_cast_op, input_columns="label")
# apply the batch operation
dataset = dataset.batch(batch_size, drop_remainder=True)
# apply the repeat operation
dataset = dataset.repeat(repeat_num)
return dataset
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
import math
import numpy as np
import mindspore.nn as nn
from mindspore.ops import operations as P
from mindspore import Tensor
import mindspore.common.initializer as weight_init
groups_npu_1 = [[0, 1, 2, 3, 4, 5, 6, 7]]
groups_npu_4 = [[0, 1, 2, 3, 4, 5, 6, 7],
[8, 9, 10, 11, 12, 13, 14, 15],
[16, 17, 18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29, 30, 31]]
groups_npu_16 = [[0, 1, 2, 3, 4, 5, 6, 7],
[8, 9, 10, 11, 12, 13, 14, 15],
[16, 17, 18, 19, 20, 21, 22, 23],
[24, 25, 26, 27, 28, 29, 30, 31],
[32, 33, 34, 35, 36, 37, 38, 39],
[40, 41, 42, 43, 44, 45, 46, 47],
[48, 49, 50, 51, 52, 53, 54, 55],
[56, 57, 58, 59, 60, 61, 62, 63],
[64, 65, 66, 67, 68, 69, 70, 71],
[72, 73, 74, 75, 76, 77, 78, 79],
[80, 81, 82, 83, 84, 85, 86, 87],
[88, 89, 90, 91, 92, 93, 94, 95],
[96, 97, 98, 99, 100, 101, 102, 103],
[104, 105, 106, 107, 108, 109, 110, 111],
[112, 113, 114, 115, 116, 117, 118, 119],
[120, 121, 122, 123, 124, 125, 126, 127]]
def _make_divisible(x, divisor=4, min_value=None):
if min_value is None:
min_value = divisor
new_v = max(min_value, int(x + divisor / 2) // divisor * divisor)
# Make sure that round down does not go down by more than 10%.
if new_v < 0.9 * x:
new_v += divisor
return new_v
class HardSwish(nn.Cell):
def __init__(self):
super(HardSwish, self).__init__()
self.relu6 = nn.ReLU6()
self.mul = P.Mul()
def construct(self, x):
return self.mul(x, self.relu6(x + 3.)/6)
class MyHSigmoid(nn.Cell):
def __init__(self):
super(MyHSigmoid, self).__init__()
self.relu6 = nn.ReLU6()
def construct(self, x):
return self.relu6(x + 3.) * 0.16666667
class Activation(nn.Cell):
def __init__(self, act_func):
super(Activation, self).__init__()
if act_func == 'relu':
self.act = nn.ReLU()
elif act_func == 'relu6':
self.act = nn.ReLU6()
elif act_func in ('hsigmoid', 'hard_sigmoid'):
self.act = MyHSigmoid()
elif act_func in ('hswish', 'hard_swish'):
self.act = nn.HSwish()
else:
raise NotImplementedError
def construct(self, x):
return self.act(x)
class DropConnect(nn.Cell):
def __init__(self, drop_connect_rate=0.):
super(DropConnect, self).__init__()
self.shape = P.Shape()
self.dtype = P.DType()
self.keep_prob = 1 - drop_connect_rate
self.dropout = P.Dropout(keep_prob=self.keep_prob)
def construct(self, x):
shape = self.shape(x)
dtype = self.dtype(x)
ones_tensor = P.Fill()(dtype, (shape[0], 1, 1, 1), 1)
mask, _ = self.dropout(ones_tensor)
x = x * mask
return x
def drop_connect(inputs, training=False, drop_connect_rate=0.):
if not training or drop_connect_rate == 0:
return inputs
return DropConnect(drop_connect_rate)(inputs)
class GlobalAvgPooling(nn.Cell):
def __init__(self, keep_dims=False):
super(GlobalAvgPooling, self).__init__()
self.mean = P.ReduceMean(keep_dims=keep_dims)
def construct(self, x):
x = self.mean(x, (2, 3))
return x
class SeModule(nn.Cell):
def __init__(self, num_out, se_ratio=0.25, divisor=8):
super(SeModule, self).__init__()
num_mid = _make_divisible(num_out*se_ratio, divisor)
self.pool = GlobalAvgPooling(keep_dims=True)
self.conv_reduce = nn.Conv2d(in_channels=num_out, out_channels=num_mid,
kernel_size=1, has_bias=True, pad_mode='pad')
self.act1 = Activation('relu')
self.conv_expand = nn.Conv2d(in_channels=num_mid, out_channels=num_out,
kernel_size=1, has_bias=True, pad_mode='pad')
self.act2 = Activation('hsigmoid')
self.mul = P.Mul()
def construct(self, x):
out = self.pool(x)
out = self.conv_reduce(out)
out = self.act1(out)
out = self.conv_expand(out)
out = self.act2(out)
out = self.mul(x, out)
return out
class ConvBnAct(nn.Cell):
def __init__(self, num_in, num_out, kernel_size, stride=1, padding=0, num_groups=1,
use_act=True, act_type='relu', sync_bn=False):
super(ConvBnAct, self).__init__()
self.conv = nn.Conv2d(in_channels=num_in,
out_channels=num_out,
kernel_size=kernel_size,
stride=stride,
padding=padding,
group=num_groups,
has_bias=False,
pad_mode='pad')
if sync_bn:
self.bn = nn.SyncBatchNorm(num_out)
else:
self.bn = nn.BatchNorm2d(num_out)
self.use_act = use_act
self.act = Activation(act_type) if use_act else None
def construct(self, x):
out = self.conv(x)
out = self.bn(out)
if self.use_act:
out = self.act(out)
return out
class GhostModule(nn.Cell):
def __init__(self, num_in, num_out, kernel_size=1, stride=1, padding=0, ratio=2, dw_size=3,
use_act=True, act_type='relu', sync_bn=False):
super(GhostModule, self).__init__()
init_channels = math.ceil(num_out / ratio)
new_channels = init_channels * (ratio - 1)
self.primary_conv = ConvBnAct(num_in, init_channels, kernel_size=kernel_size, stride=stride, padding=padding,
num_groups=1, use_act=use_act, act_type=act_type, sync_bn=sync_bn)
self.cheap_operation = ConvBnAct(init_channels, new_channels, kernel_size=dw_size, stride=1, padding=dw_size//2,
num_groups=init_channels, use_act=use_act, act_type=act_type, sync_bn=sync_bn)
self.concat = P.Concat(axis=1)
def construct(self, x):
x1 = self.primary_conv(x)
x2 = self.cheap_operation(x1)
return self.concat((x1, x2))
class GhostBottleneck(nn.Cell):
def __init__(self, num_in, num_mid, num_out, dw_kernel_size, stride=1,
act_type='relu', se_ratio=0., divisor=8, drop_path_rate=0., sync_bn=False):
super(GhostBottleneck, self).__init__()
use_se = se_ratio is not None and se_ratio > 0.
self.drop_path_rate = drop_path_rate
self.ghost1 = GhostModule(num_in, num_mid, kernel_size=1,
stride=1, padding=0, act_type=act_type, sync_bn=sync_bn)
self.use_dw = stride > 1
if self.use_dw:
self.dw = ConvBnAct(num_mid, num_mid, kernel_size=dw_kernel_size, stride=stride,
padding=(dw_kernel_size-1)//2, act_type=act_type,
num_groups=num_mid, use_act=False, sync_bn=sync_bn)
self.use_se = use_se
if use_se:
self.se = SeModule(num_mid, se_ratio=se_ratio, divisor=divisor)
self.ghost2 = GhostModule(num_mid, num_out, kernel_size=1, stride=1,
padding=0, act_type=act_type, use_act=False, sync_bn=sync_bn)
self.down_sample = False
if num_in != num_out or stride != 1:
self.down_sample = True
if self.down_sample:
self.shortcut = nn.SequentialCell([
ConvBnAct(num_in, num_in, kernel_size=dw_kernel_size, stride=stride,
padding=(dw_kernel_size-1)//2, num_groups=num_in, use_act=False, sync_bn=sync_bn),
ConvBnAct(num_in, num_out, kernel_size=1, stride=1,
padding=0, num_groups=1, use_act=False, sync_bn=sync_bn),
])
self.add = P.Add()
def construct(self, x):
shortcut = x
out = self.ghost1(x)
if self.use_dw:
out = self.dw(out)
if self.use_se:
out = self.se(out)
out = self.ghost2(out)
if self.down_sample:
shortcut = self.shortcut(shortcut)
## drop path
if self.drop_path_rate > 0.:
out = drop_connect(out, self.training, self.drop_path_rate)
out = self.add(shortcut, out)
return out
def gen_cfgs_1x(layers, channels):
# generate configs of ghostnet
cfgs = []
for i in range(len(layers)):
cfgs.append([])
for j in range(layers[i]):
if i == 0:
cfgs[i].append([3, channels[i], channels[i], 0, 1])
elif i == 1:
if j == 0:
cfgs[i].append([3, channels[i-1]*3, channels[i], 0, 2])
else:
cfgs[i].append([3, channels[i]*3, channels[i], 0, 1])
elif i == 2:
if j == 0:
cfgs[i].append([5, channels[i-1]*3, channels[i], 0.25, 2])
else:
cfgs[i].append([5, channels[i]*3, channels[i], 0.25, 1])
elif i == 3:
if j == 0:
cfgs[i].append([3, channels[i-1]*6, channels[i], 0, 2])
elif j == 1:
cfgs[i].append([3, 200, channels[i], 0, 1])
else:
cfgs[i].append([3, 184, channels[i], 0, 1])
elif i == 4:
if j == 0:
cfgs[i].append([3, channels[i-1]*6, channels[i], 0.25, 1])
else:
cfgs[i].append([3, channels[i]*6, channels[i], 0.25, 1])
elif i == 5:
if j == 0:
cfgs[i].append([5, channels[i-1]*6, channels[i], 0.25, 2])
elif j%2 == 0:
cfgs[i].append([5, channels[i]*6, channels[i], 0.25, 1])
else:
cfgs[i].append([5, channels[i]*6, channels[i], 0, 1])
return cfgs
def gen_cfgs_large(layers, channels):
# generate configs of ghostnet
cfgs = []
for i in range(len(layers)):
cfgs.append([])
if i == 0:
for j in range(layers[i]):
cfgs[i].append([3, channels[i], channels[i], 0.1, 1])
elif i == 1:
for j in range(layers[i]):
if j == 0:
cfgs[i].append([3, channels[i-1]*3, channels[i], 0.1, 2])
else:
cfgs[i].append([3, channels[i]*3, channels[i], 0.1, 1])
elif i == 2:
for j in range(layers[i]):
if j == 0:
cfgs[i].append([5, channels[i-1]*3, channels[i], 0.1, 2])
else:
cfgs[i].append([5, channels[i]*3, channels[i], 0.1, 1])
elif i == 3:
for j in range(layers[i]):
if j == 0:
cfgs[i].append([3, channels[i-1]*6, channels[i], 0.1, 2])
else:
cfgs[i].append([3, channels[i]*2.5, channels[i], 0.1, 1])
elif i == 4:
for j in range(layers[i]):
if j == 0:
cfgs[i].append([3, channels[i-1]*6, channels[i], 0.1, 1])
else:
cfgs[i].append([3, channels[i]*6, channels[i], 0.1, 1])
elif i == 5:
for j in range(layers[i]):
if j == 0:
cfgs[i].append([5, channels[i-1]*6, channels[i], 0.1, 2])
else:
cfgs[i].append([5, channels[i]*6, channels[i], 0.1, 1])
return cfgs
class GhostNet(nn.Cell):
def __init__(self, layers, channels, num_classes=1000, multiplier=1.,
final_drop=0., drop_path_rate=0., large=False, zero_init_residual=False, sync_bn=False):
super(GhostNet, self).__init__()
if layers is None:
layers = [1, 2, 2, 4, 2, 5]
if channels is None:
channels = [16, 24, 40, 80, 112, 160]
self.large = large
if self.large:
self.cfgs = gen_cfgs_large(layers, channels)
else:
self.cfgs = gen_cfgs_1x(layers, channels)
self.drop_path_rate = drop_path_rate
self.inplanes = 16
first_conv_in_channel = 3
first_conv_out_channel = _make_divisible(multiplier * channels[0], 4)
self.conv_stem = nn.Conv2d(in_channels=first_conv_in_channel,
out_channels=first_conv_out_channel,
kernel_size=3, padding=1, stride=2,
has_bias=False, pad_mode='pad')
if sync_bn:
self.bn1 = nn.SyncBatchNorm(first_conv_out_channel)
else:
self.bn1 = nn.BatchNorm2d(first_conv_out_channel)
if self.large:
self.act1 = HardSwish()
else:
self.act1 = Activation('relu')
input_channel = first_conv_out_channel
stages = []
block = GhostBottleneck
block_idx = 0
block_count = sum(layers)
for cfg in self.cfgs:
layers = []
for k, exp_size, c, se_ratio, s in cfg:
output_channel = _make_divisible(c * multiplier, 4)
hidden_channel = _make_divisible(exp_size * multiplier, 4)
drop_path_rate = self.drop_path_rate * block_idx / block_count
if self.large:
layers.append(block(input_channel, hidden_channel, output_channel, k, s, act_type='relu',
se_ratio=se_ratio, divisor=8, drop_path_rate=drop_path_rate, sync_bn=sync_bn))
else:
layers.append(block(input_channel, hidden_channel, output_channel, k, s, act_type='relu',
se_ratio=se_ratio, divisor=4, drop_path_rate=drop_path_rate, sync_bn=sync_bn))
input_channel = output_channel
block_idx += 1
output_channel = _make_divisible(multiplier * exp_size, 4)
stages.append(nn.SequentialCell(layers))
if self.large:
stages.append(ConvBnAct(input_channel, output_channel, 1, act_type='relu', sync_bn=sync_bn)) ###HardSuishMe
else:
stages.append(ConvBnAct(input_channel, output_channel, 1, act_type='relu', sync_bn=sync_bn))
head_output_channel = max(1280, int(input_channel*5))
input_channel = output_channel
self.blocks = nn.SequentialCell(stages)
self.global_pool = GlobalAvgPooling(keep_dims=True)
self.conv_head = nn.Conv2d(input_channel,
head_output_channel,
kernel_size=1, padding=0, stride=1,
has_bias=True, pad_mode='pad')
if self.large:
self.act2 = HardSwish()
else:
self.act2 = Activation('relu')
self.squeeze = P.Flatten()
self.final_drop = 1-final_drop
if self.final_drop > 0:
self.dropout = nn.Dropout(self.final_drop)
self.classifier = nn.Dense(head_output_channel, num_classes, has_bias=True)
self._initialize_weights()
if zero_init_residual:
for _, m in self.cellsand_names():
if isinstance(m, GhostBottleneck):
tmp_x = Tensor(np.zeros(m.ghost2.primary_conv[1].weight.data.shape, dtype="float32"))
m.ghost2.primary_conv[1].weight.set_data(tmp_x)
tmp_y = Tensor(np.zeros(m.ghost2.cheap_operation[1].weight.data.shape, dtype="float32"))
m.ghost2.cheap_operation[1].weight.set_data(tmp_y)
def construct(self, x):
r"""construct of GhostNet"""
x = self.conv_stem(x)
x = self.bn1(x)
x = self.act1(x)
x = self.blocks(x)
x = self.global_pool(x)
x = self.conv_head(x)
x = self.act2(x)
x = self.squeeze(x)
if self.final_drop > 0:
x = self.dropout(x)
x = self.classifier(x)
return x
def _initialize_weights(self):
self.init_parameters_data()
for _, m in self.cells_and_names():
if isinstance(m, (nn.Conv2d)):
m.weight.set_data(weight_init.initializer(weight_init.HeUniform(),
m.weight.shape,
m.weight.dtype))
if m.bias is not None:
m.bias.set_data(
Tensor(np.zeros(m.bias.data.shape, dtype="float32")))
elif isinstance(m, (nn.BatchNorm2d, nn.SyncBatchNorm)):
m.gamma.set_data(
Tensor(np.ones(m.gamma.data.shape, dtype="float32")))
m.beta.set_data(
Tensor(np.zeros(m.beta.data.shape, dtype="float32")))
elif isinstance(m, nn.Dense):
m.weight.set_data(weight_init.initializer(weight_init.HeNormal(),
m.weight.shape,
m.weight.dtype))
if m.bias is not None:
m.bias.set_data(Tensor(np.zeros(m.bias.data.shape, dtype="float32")))
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""custom callbacks for ema and loss"""
import numpy as np
from mindspore.train.callback import Callback
from mindspore import Tensor
class LossMonitor(Callback):
"""
Monitor the loss in training.
If the loss is NAN or INF, it will terminate training.
Note:
If per_print_times is 0, do not print loss.
Args:
lr_array (numpy.array): scheduled learning rate.
total_epochs (int): Total number of epochs for training.
per_print_times (int): Print the loss every time. Default: 1.
start_epoch (int): which epoch to start, used when resume from a
certain epoch.
Raises:
ValueError: If print_step is not an integer or less than zero.
"""
def __init__(self, lr_array, total_epochs, per_print_times=1, start_epoch=0):
super(LossMonitor, self).__init__()
if not isinstance(per_print_times, int) or per_print_times < 0:
raise ValueError("print_step must be int and >= 0.")
self._per_print_times = per_print_times
self._lr_array = lr_array
self._total_epochs = total_epochs
self._start_epoch = start_epoch
def step_end(self, run_context):
"""log epoch, step, loss and learning rate"""
cb_params = run_context.original_args()
loss = cb_params.net_outputs
cur_epoch_num = cb_params.cur_epoch_num + self._start_epoch - 1
if isinstance(loss, (tuple, list)):
if isinstance(loss[0], Tensor) and isinstance(loss[0].asnumpy(), np.ndarray):
loss = loss[0]
if isinstance(loss, Tensor) and isinstance(loss.asnumpy(), np.ndarray):
loss = np.mean(loss.asnumpy())
global_step = cb_params.cur_step_num - 1
cur_step_in_epoch = global_step % cb_params.batch_num + 1
if isinstance(loss, float) and (np.isnan(loss) or np.isinf(loss)):
raise ValueError("epoch: {} step: {}. Invalid loss, terminating training.".format(
cur_epoch_num, cur_step_in_epoch))
if self._per_print_times != 0 and cur_step_in_epoch % self._per_print_times == 0:
print("epoch: {}/{}, step: {}/{}, loss is {}, learning rate: {}".format(cur_epoch_num,
self._total_epochs,
cur_step_in_epoch,
cb_params.batch_num,
loss,
self._lr_array[global_step]))
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Data operations, will be used in train.py and eval.py"""
import math
import os
import numpy as np
import mindspore.dataset.vision.c_transforms as c_vision
import mindspore.dataset.transforms.c_transforms as c_transforms
import mindspore.common.dtype as mstype
import mindspore.dataset as ds
from mindspore.communication.management import init, get_rank, get_group_size
from mindspore.dataset.vision import Inter
# values that should remain constant
DEFAULT_CROP_PCT = 0.875
IMAGENET_DEFAULT_MEAN = (0.485*255, 0.456*255, 0.406*255)
IMAGENET_DEFAULT_STD = (0.229*255, 0.224*255, 0.225*255)
# data preprocess configs
SCALE = (0.08, 1.0)
RATIO = (3./4., 4./3.)
ds.config.set_seed(1)
# define Auto Augmentation operators
PARAMETER_MAX = 10
def float_parameter(level, maxval):
return float(level) * maxval / PARAMETER_MAX
def int_parameter(level, maxval):
return int(level * maxval / PARAMETER_MAX)
def shear_x(level):
v = float_parameter(level, 0.3)
return c_transforms.RandomChoice([c_vision.RandomAffine(degrees=0, shear=(-v, -v)),
c_vision.RandomAffine(degrees=0, shear=(v, v))])
def shear_y(level):
v = float_parameter(level, 0.3)
return c_transforms.RandomChoice([c_vision.RandomAffine(degrees=0, shear=(0, 0, -v, -v)),
c_vision.RandomAffine(degrees=0, shear=(0, 0, v, v))])
def translate_x(level):
v = float_parameter(level, 150 / 331)
return c_transforms.RandomChoice([c_vision.RandomAffine(degrees=0, translate=(-v, -v)),
c_vision.RandomAffine(degrees=0, translate=(v, v))])
def translate_y(level):
v = float_parameter(level, 150 / 331)
return c_transforms.RandomChoice([c_vision.RandomAffine(degrees=0, translate=(0, 0, -v, -v)),
c_vision.RandomAffine(degrees=0, translate=(0, 0, v, v))])
def color_impl(level):
v = float_parameter(level, 1.8) + 0.1
return c_vision.RandomColor(degrees=(v, v))
def rotate_impl(level):
v = int_parameter(level, 30)
return c_transforms.RandomChoice([c_vision.RandomRotation(degrees=(-v, -v)),
c_vision.RandomRotation(degrees=(v, v))])
def solarize_impl(level):
level = int_parameter(level, 256)
v = 256 - level
return c_vision.RandomSolarize(threshold=(0, v))
def posterize_impl(level):
level = int_parameter(level, 4)
v = 4 - level
return c_vision.RandomPosterize(bits=(v, v))
def contrast_impl(level):
v = float_parameter(level, 1.8) + 0.1
return c_vision.RandomColorAdjust(contrast=(v, v))
def autocontrast_impl(level):
v = level
assert v == level
return c_vision.AutoContrast()
def sharpness_impl(level):
v = float_parameter(level, 1.8) + 0.1
return c_vision.RandomSharpness(degrees=(v, v))
def brightness_impl(level):
v = float_parameter(level, 1.8) + 0.1
return c_vision.RandomColorAdjust(brightness=(v, v))
# define the Auto Augmentation policy
imagenet_policy = [
[(posterize_impl(8), 0.4), (rotate_impl(9), 0.6)],
[(solarize_impl(5), 0.6), (autocontrast_impl(5), 0.6)],
[(c_vision.Equalize(), 0.8), (c_vision.Equalize(), 0.6)],
[(posterize_impl(7), 0.6), (posterize_impl(6), 0.6)],
[(c_vision.Equalize(), 0.4), (solarize_impl(4), 0.2)],
[(c_vision.Equalize(), 0.4), (rotate_impl(8), 0.8)],
[(solarize_impl(3), 0.6), (c_vision.Equalize(), 0.6)],
[(posterize_impl(5), 0.8), (c_vision.Equalize(), 1.0)],
[(rotate_impl(3), 0.2), (solarize_impl(8), 0.6)],
[(c_vision.Equalize(), 0.6), (posterize_impl(6), 0.4)],
[(rotate_impl(8), 0.8), (color_impl(0), 0.4)],
[(rotate_impl(9), 0.4), (c_vision.Equalize(), 0.6)],
[(c_vision.Equalize(), 0.0), (c_vision.Equalize(), 0.8)],
[(c_vision.Invert(), 0.6), (c_vision.Equalize(), 1.0)],
[(color_impl(4), 0.6), (contrast_impl(8), 1.0)],
[(rotate_impl(8), 0.8), (color_impl(2), 1.0)],
[(color_impl(8), 0.8), (solarize_impl(7), 0.8)],
[(sharpness_impl(7), 0.4), (c_vision.Invert(), 0.6)],
[(shear_x(5), 0.6), (c_vision.Equalize(), 1.0)],
[(color_impl(0), 0.4), (c_vision.Equalize(), 0.6)],
[(c_vision.Equalize(), 0.4), (solarize_impl(4), 0.2)],
[(solarize_impl(5), 0.6), (autocontrast_impl(5), 0.6)],
[(c_vision.Invert(), 0.6), (c_vision.Equalize(), 1.0)],
[(color_impl(4), 0.6), (contrast_impl(8), 1.0)],
[(c_vision.Equalize(), 0.8), (c_vision.Equalize(), 0.6)],
]
def split_imgs_and_labels(imgs, labels):
"""split data into labels and images"""
ret_imgs = []
ret_labels = []
for i, image in enumerate(imgs):
ret_imgs.append(image)
ret_labels.append(labels[i])
return np.array(ret_imgs), np.array(ret_labels)
def create_dataset(batch_size, train_data_url='', workers=8, target='Ascend', distributed=False,
input_size=224, color_jitter=0.5, autoaugment=False):
"""Create ImageNet training dataset"""
if not os.path.exists(train_data_url):
raise ValueError('Path not exists')
if target == "Ascend":
device_num, rank_id = _get_rank_info()
else:
if distributed:
init()
rank_id = get_rank()
device_num = get_group_size()
else:
device_num = 1
rank_id = 1
if device_num == 1:
dataset_train = ds.ImageFolderDataset(train_data_url, num_parallel_workers=workers, shuffle=True)
else:
dataset_train = ds.ImageFolderDataset(train_data_url, num_parallel_workers=workers, shuffle=True,
num_shards=device_num, shard_id=rank_id)
type_cast_op = c_transforms.TypeCast(mstype.int32)
random_resize_crop_bicubic = c_vision.RandomCropDecodeResize(input_size, scale=SCALE,
ratio=RATIO, interpolation=Inter.BICUBIC)
random_horizontal_flip_op = c_vision.RandomHorizontalFlip(prob=0.5)
adjust_range = (max(0, 1 - color_jitter), 1 + color_jitter)
random_color_jitter_op = c_vision.RandomColorAdjust(brightness=adjust_range,
contrast=adjust_range,
saturation=adjust_range)
normalize_op = c_vision.Normalize(
IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD)
channel_op = c_vision.HWC2CHW()
# assemble all the transforms
if autoaugment:
dataset_train = dataset_train.map(input_columns=["image"],
operations=[random_resize_crop_bicubic],
num_parallel_workers=workers)
dataset_train = dataset_train.map(input_columns=["image"],
operations=c_vision.RandomSelectSubpolicy(imagenet_policy),
num_parallel_workers=workers)
dataset_train = dataset_train.map(input_columns=["image"],
operations=[random_horizontal_flip_op, normalize_op, channel_op],
num_parallel_workers=workers)
else:
image_ops = [random_resize_crop_bicubic, random_horizontal_flip_op,
random_color_jitter_op, normalize_op, channel_op]
dataset_train = dataset_train.map(input_columns=["image"],
operations=image_ops,
num_parallel_workers=workers)
dataset_train = dataset_train.map(input_columns=["label"],
operations=type_cast_op,
num_parallel_workers=workers)
# batch dealing
ds_train = dataset_train.batch(batch_size, drop_remainder=True)
ds_train = ds_train.repeat(1)
return ds_train
def create_dataset_val(batch_size=128, val_data_url='', workers=8, target='Ascend', distributed=False,
input_size=224):
"""Create ImageNet validation dataset"""
if not os.path.exists(val_data_url):
raise ValueError('Path not exists')
if target == "Ascend":
device_num, rank_id = _get_rank_info()
else:
if distributed:
init()
rank_id = get_rank()
device_num = get_group_size()
else:
device_num = 1
if device_num == 1:
dataset = ds.ImageFolderDataset(val_data_url, num_parallel_workers=workers)
else:
dataset = ds.ImageFolderDataset(val_data_url, num_parallel_workers=workers,
num_shards=device_num, shard_id=rank_id)
scale_size = None
if isinstance(input_size, tuple):
assert len(input_size) == 2
if input_size[-1] == input_size[-2]:
scale_size = int(math.floor(input_size[0] / DEFAULT_CROP_PCT))
else:
scale_size = tuple([int(x / DEFAULT_CROP_PCT) for x in input_size])
else:
scale_size = int(math.floor(input_size / DEFAULT_CROP_PCT))
type_cast_op = c_transforms.TypeCast(mstype.int32)
decode_op = c_vision.Decode()
resize_op = c_vision.Resize(size=scale_size, interpolation=Inter.BICUBIC)
center_crop = c_vision.CenterCrop(size=input_size)
normalize_op = c_vision.Normalize(
IMAGENET_DEFAULT_MEAN, IMAGENET_DEFAULT_STD)
channel_op = c_vision.HWC2CHW()
image_ops = [decode_op, resize_op, center_crop, normalize_op, channel_op]
dataset = dataset.map(input_columns=["label"], operations=type_cast_op,
num_parallel_workers=workers)
dataset = dataset.map(input_columns=["image"], operations=image_ops,
num_parallel_workers=workers)
dataset = dataset.batch(batch_size, drop_remainder=True)
dataset = dataset.repeat(1)
return dataset
def _get_rank_info():
"""
get rank size and rank id
"""
rank_size = int(os.environ.get("RANK_SIZE", 1))
if rank_size > 1:
rank_size = get_group_size()
rank_id = get_rank()
else:
rank_size = 1
rank_id = 0
return rank_size, rank_id
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
import os
import stat
import time
from mindspore import save_checkpoint
from mindspore import log as logger
from mindspore.train.callback import Callback
class EvalCallBack(Callback):
def __init__(self, eval_function, eval_param_dict, interval=1, eval_start_epoch=1, save_best_ckpt=True,
ckpt_directory="./", besk_ckpt_name="best.ckpt", metrics_name="acc"):
super(EvalCallBack, self).__init__()
self.eval_param_dict = eval_param_dict
self.eval_function = eval_function
self.eval_start_epoch = eval_start_epoch
if interval < 1:
raise ValueError("interval should >= 1.")
self.interval = interval
self.save_best_ckpt = save_best_ckpt
self.best_res = 0
self.best_epoch = 0
if not os.path.isdir(ckpt_directory):
os.makedirs(ckpt_directory)
self.bast_ckpt_path = os.path.join(ckpt_directory, besk_ckpt_name)
self.metrics_name = metrics_name
def remove_ckpoint_file(self, file_name):
try:
os.chmod(file_name, stat.S_IWRITE)
os.remove(file_name)
except OSError:
logger.warning("OSError, failed to remove the older ckpt file %s.", file_name)
except ValueError:
logger.warning("ValueError, failed to remove the older ckpt file %s.", file_name)
def epoch_end(self, run_context):
cb_params = run_context.original_args()
cur_epoch = cb_params.cur_epoch_num
if cur_epoch >= self.eval_start_epoch and (cur_epoch - self.eval_start_epoch) % self.interval == 0:
eval_start = time.time()
res = self.eval_function(self.eval_param_dict)
eval_cost = time.time() - eval_start
print("epoch: {}, {}: {}, eval_cost:{:.2f}".format(cur_epoch, self.metrics_name, res, eval_cost))
if res >= self.best_res:
self.best_res = res
self.best_epoch = cur_epoch
print("update best result: {}".format(res))
if self.save_best_ckpt:
if os.path.exists(self.bast_ckpt_path):
self.remove_ckpoint_file(self.bast_ckpt_path)
save_checkpoint(cb_params.train_network, self.bast_ckpt_path)
print("update best checkpoint at: {}".format(self.bast_ckpt_path))
def end(self, run_context):
v = run_context
assert v == run_context
print("End training, the best {0} is: {1}, the best {0} epoch is {2}".format(self.metrics_name,
self.best_res,
self.best_epoch))
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""GhostNet model define"""
from functools import partial
import math
import numpy as np
import mindspore.nn as nn
from mindspore.ops import operations as P
from mindspore import Tensor
import mindspore.common.initializer as weight_init
__all__ = ['ghostnet']
def _make_divisible(x, divisor=4):
return int(np.ceil(x * 1. / divisor) * divisor)
class MyHSigmoid(nn.Cell):
"""
Hard Sigmoid definition.
Args:
Returns:
Tensor, output tensor.
Examples:
>>> MyHSigmoid()
"""
def __init__(self):
super(MyHSigmoid, self).__init__()
self.relu6 = nn.ReLU6()
def construct(self, x):
return self.relu6(x + 3.) * 0.16666667
class Activation(nn.Cell):
"""
Activation definition.
Args:
act_func(string): activation name.
Returns:
Tensor, output tensor.
"""
def __init__(self, act_func):
super(Activation, self).__init__()
if act_func == 'relu':
self.act = nn.ReLU()
elif act_func == 'relu6':
self.act = nn.ReLU6()
elif act_func in ('hsigmoid', 'hard_sigmoid'):
self.act = MyHSigmoid()
elif act_func in ('hswish', 'hard_swish'):
self.act = nn.HSwish()
else:
raise NotImplementedError
def construct(self, x):
return self.act(x)
class GlobalAvgPooling(nn.Cell):
"""
Global avg pooling definition.
Args:
Returns:
Tensor, output tensor.
Examples:
>>> GlobalAvgPooling()
"""
def __init__(self, keep_dims=False):
super(GlobalAvgPooling, self).__init__()
self.mean = P.ReduceMean(keep_dims=keep_dims)
def construct(self, x):
x = self.mean(x, (2, 3))
return x
class SE(nn.Cell):
"""
SE warpper definition.
Args:
num_out (int): Output channel.
ratio (int): middle output ratio.
Returns:
Tensor, output tensor.
Examples:
>>> SE(4)
"""
def __init__(self, num_out, ratio=4):
super(SE, self).__init__()
num_mid = _make_divisible(num_out // ratio)
self.pool = GlobalAvgPooling(keep_dims=True)
self.conv_reduce = nn.Conv2d(in_channels=num_out, out_channels=num_mid,
kernel_size=1, has_bias=True, pad_mode='pad')
self.act1 = Activation('relu')
self.conv_expand = nn.Conv2d(in_channels=num_mid, out_channels=num_out,
kernel_size=1, has_bias=True, pad_mode='pad')
self.act2 = Activation('hsigmoid')
self.mul = P.Mul()
def construct(self, x):
out = self.pool(x)
out = self.conv_reduce(out)
out = self.act1(out)
out = self.conv_expand(out)
out = self.act2(out)
out = self.mul(x, out)
return out
class ConvUnit(nn.Cell):
"""
ConvUnit warpper definition.
Args:
num_in (int): Input channel.
num_out (int): Output channel.
kernel_size (int): Input kernel size.
stride (int): Stride size.
padding (int): Padding number.
num_groups (int): Output num group.
use_act (bool): Used activation or not.
act_type (string): Activation type.
Returns:
Tensor, output tensor.
Examples:
>>> ConvUnit(3, 3)
"""
def __init__(self, num_in, num_out, kernel_size=1, stride=1, padding=0, num_groups=1,
use_act=True, act_type='relu'):
super(ConvUnit, self).__init__()
self.conv = nn.Conv2d(in_channels=num_in,
out_channels=num_out,
kernel_size=kernel_size,
stride=stride,
padding=padding,
group=num_groups,
has_bias=False,
pad_mode='pad')
self.bn = nn.BatchNorm2d(num_out)
self.use_act = use_act
self.act = Activation(act_type) if use_act else None
def construct(self, x):
out = self.conv(x)
out = self.bn(out)
if self.use_act:
out = self.act(out)
return out
class GhostModule(nn.Cell):
"""
GhostModule warpper definition.
Args:
num_in (int): Input channel.
num_out (int): Output channel.
kernel_size (int): Input kernel size.
stride (int): Stride size.
padding (int): Padding number.
ratio (int): Reduction ratio.
dw_size (int): kernel size of cheap operation.
use_act (bool): Used activation or not.
act_type (string): Activation type.
Returns:
Tensor, output tensor.
Examples:
>>> GhostModule(3, 3)
"""
def __init__(self, num_in, num_out, kernel_size=1, stride=1, padding=0, ratio=2, dw_size=3,
use_act=True, act_type='relu'):
super(GhostModule, self).__init__()
init_channels = math.ceil(num_out / ratio)
new_channels = init_channels * (ratio - 1)
self.primary_conv = ConvUnit(num_in, init_channels, kernel_size=kernel_size, stride=stride, padding=padding,
num_groups=1, use_act=use_act, act_type=act_type)
self.cheap_operation = ConvUnit(init_channels, new_channels, kernel_size=dw_size, stride=1, padding=dw_size//2,
num_groups=init_channels, use_act=use_act, act_type=act_type)
self.concat = P.Concat(axis=1)
def construct(self, x):
x1 = self.primary_conv(x)
x2 = self.cheap_operation(x1)
return self.concat((x1, x2))
class GhostBottleneck(nn.Cell):
"""
GhostBottleneck warpper definition.
Args:
num_in (int): Input channel.
num_mid (int): Middle channel.
num_out (int): Output channel.
kernel_size (int): Input kernel size.
stride (int): Stride size.
act_type (str): Activation type.
use_se (bool): Use SE warpper or not.
Returns:
Tensor, output tensor.
Examples:
>>> GhostBottleneck(16, 3, 1, 1)
"""
def __init__(self, num_in, num_mid, num_out, kernel_size, stride=1, act_type='relu', use_se=False):
super(GhostBottleneck, self).__init__()
self.ghost1 = GhostModule(num_in, num_mid, kernel_size=1,
stride=1, padding=0, act_type=act_type)
self.use_dw = stride > 1
self.dw = None
if self.use_dw:
self.dw = ConvUnit(num_mid, num_mid, kernel_size=kernel_size, stride=stride,
padding=self._get_pad(kernel_size), act_type=act_type, num_groups=num_mid, use_act=False)
self.use_se = use_se
if use_se:
self.se = SE(num_mid)
self.ghost2 = GhostModule(num_mid, num_out, kernel_size=1, stride=1,
padding=0, act_type=act_type, use_act=False)
self.down_sample = False
if num_in != num_out or stride != 1:
self.down_sample = True
self.shortcut = None
if self.down_sample:
self.shortcut = nn.SequentialCell([
ConvUnit(num_in, num_in, kernel_size=kernel_size, stride=stride,
padding=self._get_pad(kernel_size), num_groups=num_in, use_act=False),
ConvUnit(num_in, num_out, kernel_size=1, stride=1,
padding=0, num_groups=1, use_act=False),
])
self.add = P.Add()
def construct(self, x):
r"""construct of ghostnet"""
shortcut = x
out = self.ghost1(x)
if self.use_dw:
out = self.dw(out)
if self.use_se:
out = self.se(out)
out = self.ghost2(out)
if self.down_sample:
shortcut = self.shortcut(shortcut)
out = self.add(shortcut, out)
return out
def _get_pad(self, kernel_size):
"""set the padding number"""
pad = 0
if kernel_size == 1:
pad = 0
elif kernel_size == 3:
pad = 1
elif kernel_size == 5:
pad = 2
elif kernel_size == 7:
pad = 3
else:
raise NotImplementedError
return pad
class GhostNet(nn.Cell):
"""
GhostNet architecture.
Args:
model_cfgs (Cell): number of classes.
num_classes (int): Output number classes.
multiplier (int): Channels multiplier for round to 8/16 and others. Default is 1.
final_drop (float): Dropout number.
round_nearest (list): Channel round to . Default is 8.
Returns:
Tensor, output tensor.
Examples:
>>> GhostNet(num_classes=1000)
"""
def __init__(self, model_cfgs, num_classes=1000, multiplier=1., final_drop=0.):
super(GhostNet, self).__init__()
self.cfgs = model_cfgs['cfg']
self.inplanes = 16
first_conv_in_channel = 3
first_conv_out_channel = _make_divisible(multiplier * self.inplanes)
self.conv_stem = nn.Conv2d(in_channels=first_conv_in_channel,
out_channels=first_conv_out_channel,
kernel_size=3, padding=1, stride=2,
has_bias=False, pad_mode='pad')
self.bn1 = nn.BatchNorm2d(first_conv_out_channel)
self.act1 = Activation('relu')
self.blocks = []
for layer_cfg in self.cfgs:
self.blocks.append(self._make_layer(kernel_size=layer_cfg[0],
exp_ch=_make_divisible(
multiplier * layer_cfg[1]),
out_channel=_make_divisible(
multiplier * layer_cfg[2]),
use_se=layer_cfg[3],
act_func=layer_cfg[4],
stride=layer_cfg[5]))
output_channel = _make_divisible(
multiplier * model_cfgs["cls_ch_squeeze"])
self.blocks.append(ConvUnit(_make_divisible(multiplier * self.cfgs[-1][2]), output_channel,
kernel_size=1, stride=1, padding=0, num_groups=1, use_act=True))
self.blocks = nn.SequentialCell(self.blocks)
self.global_pool = GlobalAvgPooling(keep_dims=True)
self.conv_head = nn.Conv2d(in_channels=output_channel,
out_channels=model_cfgs['cls_ch_expand'],
kernel_size=1, padding=0, stride=1,
has_bias=True, pad_mode='pad')
self.act2 = Activation('relu')
self.squeeze = P.Flatten()
self.final_drop = final_drop
if self.final_drop > 0:
self.dropout = nn.Dropout(self.final_drop)
self.classifier = nn.Dense(
model_cfgs['cls_ch_expand'], num_classes, has_bias=True)
self._initialize_weights()
def construct(self, x):
r"""construct of GhostNet"""
x = self.conv_stem(x)
x = self.bn1(x)
x = self.act1(x)
x = self.blocks(x)
x = self.global_pool(x)
x = self.conv_head(x)
x = self.act2(x)
x = self.squeeze(x)
if self.final_drop > 0:
x = self.dropout(x)
x = self.classifier(x)
return x
def _make_layer(self, kernel_size, exp_ch, out_channel, use_se, act_func, stride=1):
mid_planes = exp_ch
out_planes = out_channel
layer = GhostBottleneck(self.inplanes, mid_planes, out_planes,
kernel_size, stride=stride, act_type=act_func, use_se=use_se)
self.inplanes = out_planes
return layer
def _initialize_weights(self):
"""
Initialize weights.
Args:
Returns:
None.
Examples:
>>> _initialize_weights()
"""
self.init_parameters_data()
for _, m in self.cells_and_names():
if isinstance(m, (nn.Conv2d)):
m.weight.set_data(weight_init.initializer(weight_init.HeUniform(),
m.weight.shape,
m.weight.dtype))
if m.bias is not None:
m.bias.set_data(
Tensor(np.zeros(m.bias.data.shape, dtype="float32")))
elif isinstance(m, nn.BatchNorm2d):
m.gamma.set_data(
Tensor(np.ones(m.gamma.data.shape, dtype="float32")))
m.beta.set_data(
Tensor(np.zeros(m.beta.data.shape, dtype="float32")))
elif isinstance(m, nn.Dense):
m.weight.set_data(weight_init.initializer(weight_init.HeNormal(),
m.weight.shape,
m.weight.dtype))
if m.bias is not None:
m.bias.set_data(
Tensor(np.zeros(m.bias.data.shape, dtype="float32")))
def ghostnet(model_name, **kwargs):
"""
Constructs a GhostNet model
"""
model_cfgs = {
"1x": {
"cfg": [
# k, exp, c, se, nl, s,
# stage1
[3, 16, 16, False, 'relu', 1],
# stage2
[3, 48, 24, False, 'relu', 2],
[3, 72, 24, False, 'relu', 1],
# stage3
[5, 72, 40, True, 'relu', 2],
[5, 120, 40, True, 'relu', 1],
# stage4
[3, 240, 80, False, 'relu', 2],
[3, 200, 80, False, 'relu', 1],
[3, 184, 80, False, 'relu', 1],
[3, 184, 80, False, 'relu', 1],
[3, 480, 112, True, 'relu', 1],
[3, 672, 112, True, 'relu', 1],
# stage5
[5, 672, 160, True, 'relu', 2],
[5, 960, 160, False, 'relu', 1],
[5, 960, 160, True, 'relu', 1],
[5, 960, 160, False, 'relu', 1],
[5, 960, 160, True, 'relu', 1]],
"cls_ch_squeeze": 960,
"cls_ch_expand": 1280,
},
"nose_1x": {
"cfg": [
# k, exp, c, se, nl, s,
# stage1
[3, 16, 16, False, 'relu', 1],
# stage2
[3, 48, 24, False, 'relu', 2],
[3, 72, 24, False, 'relu', 1],
# stage3
[5, 72, 40, False, 'relu', 2],
[5, 120, 40, False, 'relu', 1],
# stage4
[3, 240, 80, False, 'relu', 2],
[3, 200, 80, False, 'relu', 1],
[3, 184, 80, False, 'relu', 1],
[3, 184, 80, False, 'relu', 1],
[3, 480, 112, False, 'relu', 1],
[3, 672, 112, False, 'relu', 1],
# stage5
[5, 672, 160, False, 'relu', 2],
[5, 960, 160, False, 'relu', 1],
[5, 960, 160, False, 'relu', 1],
[5, 960, 160, False, 'relu', 1],
[5, 960, 160, False, 'relu', 1]],
"cls_ch_squeeze": 960,
"cls_ch_expand": 1280,
}
}
return GhostNet(model_cfgs[model_name], **kwargs)
ghostnet_1x = partial(ghostnet, model_name="1x", final_drop=0.8)
ghostnet_nose_1x = partial(ghostnet, model_name="nose_1x", final_drop=0.8)
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""define loss function for network."""
from mindspore.nn.loss.loss import _Loss
from mindspore.ops import operations as P
from mindspore.ops import functional as F
from mindspore import Tensor
from mindspore.common import dtype as mstype
import mindspore.nn as nn
class LabelSmoothingCrossEntropy(_Loss):
"""cross-entropy with label smoothing"""
def __init__(self, smooth_factor=0.1, num_classes=1000):
super(LabelSmoothingCrossEntropy, self).__init__()
self.onehot = P.OneHot()
self.on_value = Tensor(1.0 - smooth_factor, mstype.float32)
self.off_value = Tensor(1.0 * smooth_factor /
(num_classes - 1), mstype.float32)
self.ce = nn.SoftmaxCrossEntropyWithLogits(reduction='mean')
self.cast = P.Cast()
def construct(self, logits, label):
label = self.cast(label, mstype.int32)
one_hot_label = self.onehot(label, F.shape(
logits)[1], self.on_value, self.off_value)
loss_logit = self.ce(logits, one_hot_label)
return loss_logit
\ No newline at end of file
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""model utils"""
import math
import argparse
import numpy as np
def str2bool(value):
"""Convert string arguments to bool type"""
if value.lower() in ('yes', 'true', 't', 'y', '1'):
return True
if value.lower() in ('no', 'false', 'f', 'n', '0'):
return False
raise argparse.ArgumentTypeError('Boolean value expected.')
def get_lr_tinynet_c(base_lr, total_epochs, steps_per_epoch, decay_epochs=1, decay_rate=0.9,
warmup_epochs=0., warmup_lr_init=0., global_epoch=0):
"""Get scheduled learning rate"""
lr_each_step = []
total_steps = steps_per_epoch * total_epochs
global_steps = steps_per_epoch * global_epoch
self_warmup_delta = ((base_lr - warmup_lr_init) / \
warmup_epochs) if warmup_epochs > 0 else 0
self_decay_rate = decay_rate if decay_rate < 1 else 1/decay_rate
for i in range(total_steps):
epochs = math.floor(i/steps_per_epoch)
cond = 1 if (epochs < warmup_epochs) else 0
warmup_lr = warmup_lr_init + epochs * self_warmup_delta
decay_nums = math.floor(epochs / decay_epochs)
decay_rate = math.pow(self_decay_rate, decay_nums)
decay_lr = base_lr * decay_rate
lr = cond * warmup_lr + (1 - cond) * decay_lr
lr_each_step.append(lr)
lr_each_step = lr_each_step[global_steps:]
lr_each_step = np.array(lr_each_step).astype(np.float32)
return lr_each_step
def get_lr(lr_init, lr_end, lr_max, warmup_epochs, total_epochs, steps_per_epoch):
"""
generate learning rate array
Args:
global_step(int): total steps of the training
lr_init(float): init learning rate
lr_end(float): end learning rate
lr_max(float): max learning rate
warmup_epochs(int): number of warmup epochs
total_epochs(int): total epoch of training
steps_per_epoch(int): steps of one epoch
Returns:
np.array, learning rate array
"""
lr_each_step = []
total_steps = steps_per_epoch * total_epochs
warmup_steps = steps_per_epoch * warmup_epochs
for i in range(total_steps):
if i < warmup_steps:
lr = lr_init + (lr_max - lr_init) * i / warmup_steps
else:
lr = lr_end + \
(lr_max - lr_end) * \
(1. + math.cos(math.pi * (i - warmup_steps) /
(total_steps - warmup_steps))) / 2.
if lr < 0.0:
lr = 0.0
lr_each_step.append(lr)
lr_each_step = np.array(lr_each_step).astype(np.float32)
return lr_each_step
def add_weight_decay(net, weight_decay=1e-5, skip_list=None):
"""Apply weight decay to only conv and dense layers (len(shape) > =2)
Args:
net (mindspore.nn.Cell): Mindspore network instance
weight_decay (float): weight decay tobe used.
skip_list (tuple): list of parameter names without weight decay
Returns:
A list of group of parameters, separated by different weight decay.
"""
decay = []
no_decay = []
if not skip_list:
skip_list = ()
for param in net.trainable_params():
if len(param.shape) == 1 or \
param.name.endswith(".bias") or \
param.name in skip_list:
no_decay.append(param)
else:
decay.append(param)
return [
{'params': no_decay, 'weight_decay': 0.},
{'params': decay, 'weight_decay': weight_decay}]
def count_params(net):
"""Count number of parameters in the network
Args:
net (mindspore.nn.Cell): Mindspore network instance
Returns:
total_params (int): Total number of trainable params
"""
total_params = 0
for param in net.trainable_params():
total_params += np.prod(param.shape)
return total_params
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
"""Training Interface"""
import sys
import os
import zipfile
import argparse
import hashlib
from pathlib import Path
from mindspore.communication.management import init, get_rank, get_group_size
from mindspore import Model
from mindspore.context import ParallelMode
from mindspore.train.callback import ModelCheckpoint, CheckpointConfig, TimeMonitor
from mindspore.train.serialization import load_checkpoint, load_param_into_net
from mindspore.train.loss_scale_manager import FixedLossScaleManager
from mindspore.nn.loss import SoftmaxCrossEntropyWithLogits
from mindspore.nn import SGD, RMSProp, Momentum, Loss, Top1CategoricalAccuracy, \
Top5CategoricalAccuracy
from mindspore import context, Tensor
from mindspore.common import set_seed
from src.dataset import create_dataset, create_dataset_val
from src.utils import add_weight_decay, str2bool, get_lr_tinynet_c, get_lr
from src.callback import LossMonitor
from src.loss import LabelSmoothingCrossEntropy
from src.eval_callback import EvalCallBack
from src.big_net import GhostNet
local_plog_path = os.path.join(Path.home(), 'ascend/log/')
os.environ["GLOG_v"] = '3'
os.environ["ASCEND_SLOG_PRINT_TO_STDOUT"] = '0'
os.environ["ASCEND_GLOBAL_LOG_LEVEL"] = '2'
os.environ["ASCEND_GLOBAL_EVENT_ENABLE"] = '0'
parser = argparse.ArgumentParser(description='Training')
# model architecture parameters
parser.add_argument('--data_path', type=str, default="", metavar="DIR",
help='path to dataset')
parser.add_argument('--model', default='tinynet_c', type=str, metavar='MODEL',
help='Name of model to train (default: "tinynet_c") ghostnet, big_net')
parser.add_argument('--num-classes', type=int, default=1000, metavar='N',
help='number of label classes (default: 1000)')
parser.add_argument('--channels', type=str, default='16,24,40,80,112,160',
help='channel config of model architecure')
parser.add_argument('--layers', type=str, default='1,2,2,4,2,5',
help='layer config of model architecure')
parser.add_argument('--large', action='store_true', default=False,
help='ghostnet1x or ghostnet larger')
parser.add_argument('--input_size', type=int, default=224,
help='input size of model.')
# preprocess parameters
parser.add_argument('--autoaugment', action='store_true', default=False,
help='whether use autoaugment for training images')
# training parameters
parser.add_argument('--epochs', type=int, default=200, metavar='N',
help='number of epochs to train (default: 2)')
parser.add_argument('-b', '--batch-size', type=int, default=256, metavar='N',
help='input batch size for training (default: 32)')
parser.add_argument('--drop', type=float, default=0.0, metavar='DROP',
help='Dropout rate (default: 0.) for big_net, use "1-drop", for others, use "drop"')
parser.add_argument('--drop-path', type=float, default=0.0, metavar='DROP',
help='Drop connect rate (default: 0.)')
parser.add_argument('--opt', default='sgd', type=str, metavar='OPTIMIZER',
help='Optimizer (default: "sgd"')
parser.add_argument('--opt-eps', default=1e-8, type=float, metavar='EPSILON',
help='Optimizer Epsilon (default: 1e-8)')
parser.add_argument('--momentum', type=float, default=0.9, metavar='M',
help='SGD momentum (default: 0.9)')
parser.add_argument('--weight-decay', type=float, default=0.0001,
help='weight decay (default: 0.0001)')
parser.add_argument('--lr_decay_style', type=str, default='cosine',
help='learning rate decay method(default: cosine), cosine_step')
parser.add_argument('--lr', type=float, default=0.01, metavar='LR',
help='learning rate (default: 0.01)')
parser.add_argument('--lr_end', type=float, default=1e-6,
help='The end of training learning rate (default: 1e-6)')
parser.add_argument('--lr_max', type=float, default=0.4,
help='the max of training learning rate (default: 0.4)')
parser.add_argument('--warmup-lr', type=float, default=0.0001, metavar='LR',
help='warmup learning rate (default: 0.0001)')
parser.add_argument('--decay-epochs', type=float, default=30, metavar='N',
help='epoch interval to decay LR')
parser.add_argument('--warmup-epochs', type=int, default=3, metavar='N',
help='epochs to warmup LR, if scheduler supports')
parser.add_argument('--decay-rate', '--dr', type=float, default=0.1, metavar='RATE',
help='LR decay rate (default: 0.1)')
parser.add_argument('--smoothing', type=float, default=0.1,
help='label smoothing (default: 0.1)')
parser.add_argument('--ema-decay', type=float, default=0.9999,
help='decay factor for model weights moving average \
(default: 0.999)')
# training information parameters
parser.add_argument('--amp_level', type=str, default='O0')
parser.add_argument('--per_print_times', type=int, default=1)
# batch norm parameters
parser.add_argument('--sync_bn', action='store_true', default=False,
help='Use sync bn in distributed mode. (default: False)')
parser.add_argument('--bn-tf', action='store_true', default=False,
help='Use Tensorflow BatchNorm defaults for models that \
support it (default: False)')
parser.add_argument('--bn-momentum', type=float, default=None,
help='BatchNorm momentum override (if not None)')
parser.add_argument('--bn-eps', type=float, default=None,
help='BatchNorm epsilon override (if not None)')
# parallel parameters
parser.add_argument('-j', '--workers', type=int, default=4, metavar='N',
help='how many training processes to use (default: 1)')
parser.add_argument('--distributed', action='store_true', default=False)
parser.add_argument('--dataset_sink', action='store_false', default=True)
parser.add_argument('--device_num', type=int, default=8, help='Device num.')
# checkpoint config
parser.add_argument('--save_checkpoint', action='store_false', default=True)
parser.add_argument('--ckpt', type=str, default=None)
parser.add_argument('--ckpt_save_path', type=str, default='./')
parser.add_argument('--ckpt_save_epoch', type=int, default=5)
parser.add_argument('--loss_scale', type=int,
default=128, help='static loss scale')
parser.add_argument('--train', type=str2bool, default=1, help='train or eval')
parser.add_argument('--device_target', type=str, default='Ascend', choices=("Ascend", "GPU", "CPU"),
help="Device target, support Ascend, GPU and CPU.")
# train on cloud
parser.add_argument('--cloud', action='store_true', default=False, help='Whether train on cloud.')
parser.add_argument('--data_url', type=str, default="/home/ma-user/work/data/imagenet", help='path to dataset.')
parser.add_argument('--zip_url', type=str, default="s3://bucket-800/liuchuanjian/data/imagenet_zip/imagenet.zip")
parser.add_argument('--train_url', type=str, default=" ", help='train_dir.')
parser.add_argument('--tmp_data_dir', default='/cache/data/', help='temp data dir')
parser.add_argument('--tmp_save_dir', default='/cache/liuchuanjian/', help='temp save dir')
parser.add_argument('--save_dir', default='/autotest/liuchuanjian/result/big_model', help='temp save dir')
_global_sync_count = 0
def get_file_md5(file_name):
m = hashlib.md5()
with open(file_name, 'rb') as fobj:
while True:
data = fobj.read(4096)
if not data:
break
m.update(data)
return m.hexdigest()
def get_device_id():
device_id = os.getenv('DEVICE_ID', '0')
return int(device_id)
def get_device_num():
device_num = os.getenv('RANK_SIZE', '1')
return int(device_num)
def unzip_file(zip_src, dst_dir):
r = zipfile.is_zipfile(zip_src)
if r:
fz = zipfile.ZipFile(zip_src, 'r')
for file_item in fz.namelist():
fz.extract(file_item, dst_dir)
else:
raise Exception('This is not zip')
def sync_data(args):
"""
Download data from remote obs to local directory if the first url is remote url and the second one is local path
Upload data from local directory to remote obs in contrast.
"""
import time
global _global_sync_count
sync_lock = "/tmp/copy_sync.lock" + str(_global_sync_count)
_global_sync_count += 1
if not mox.file.exists(args.tmp_data_dir):
mox.file.make_dirs(args.tmp_data_dir)
target_file = os.path.join(args.tmp_data_dir, 'imagenet.zip')
# Each server contains 8 devices as most.
if get_device_id() % min(get_device_num(), 8) == 0 and not os.path.exists(sync_lock):
print("from path: ", args.zip_url)
print("to path: ", target_file)
mox.file.copy_parallel(args.zip_url, target_file)
print('Zip file copy success.')
os.system('ls /cache/data/')
print('Computing MD5 of copied file.')
file_md5 = get_file_md5(target_file)
print('MD5 is: ', file_md5)
re_upload_num = 20
while file_md5 != '674b2e3a185c2c82c8d211a0465c386e' and re_upload_num >= 0:
mox.file.copy_parallel(args.zip_url, target_file)
print('Zip file copy success.')
print('Computing MD5 of copied file.')
file_md5 = get_file_md5(target_file)
print('MD5 is: ', file_md5)
re_upload_num -= 1
print('reupload num is: ', re_upload_num)
print('Starting unzip file.')
unzip_file(target_file, args.tmp_data_dir)
print('Unzip file success.')
print("===finish data synchronization===")
try:
os.mknod(sync_lock)
except IOError:
pass
print("===save flag===")
while True:
if os.path.exists(sync_lock):
break
time.sleep(1)
args.data_url = os.path.join(args.tmp_data_dir, 'imagenet')
args.data_path = args.data_url
print("Finish sync data from {} to {}.".format(args.zip_url, target_file))
def apply_eval(eval_param):
eval_model = eval_param["model"]
eval_ds = eval_param["dataset"]
metrics_name = eval_param["metrics_name"]
res = eval_model.eval(eval_ds)
return res[metrics_name]
def main(args):
"""Main entrance for training"""
if args.channels:
channel_config = []
for item in args.channels.split(','):
channel_config.append(int(item.strip()))
if args.layers:
layer_config = []
for item in args.layers.split(','):
layer_config.append(int(item.strip()))
print(sys.argv)
set_seed(1)
target = args.device_target
ckpt_save_dir = args.save_dir
context.set_context(mode=context.GRAPH_MODE, device_target=target, save_graphs=False)
device_num = get_device_num()
if args.distributed:
devid = int(os.getenv('DEVICE_ID'))
context.set_context(device_id=devid,
reserve_class_name_in_scope=True)
init()
context.set_auto_parallel_context(device_num=device_num, parallel_mode=ParallelMode.DATA_PARALLEL,
gradients_mean=True, parameter_broadcast=True)
print('device_num: ', context.get_auto_parallel_context("device_num"))
args.rank = get_rank()
args.group_size = get_group_size()
print('Rank {}, group_size {}'.format(args.rank, args.group_size))
# create save dir
if args.cloud:
if not mox.file.exists(os.path.join(args.tmp_save_dir, str(args.rank))):
mox.file.make_dirs(os.path.join(args.tmp_save_dir, str(args.rank)))
args.save_dir = os.path.join(args.tmp_save_dir, str(args.rank))
else:
# Check the save_dir exists or not
if not os.path.exists(args.save_dir):
os.makedirs(args.save_dir)
print('Imagenet dir: ', args.data_path)
ckpt_save_dir = os.path.join(args.save_dir, "ckpt_" + str(get_rank()))
net = GhostNet(layers=layer_config, channels=channel_config, num_classes=args.num_classes,
final_drop=args.drop, drop_path_rate=args.drop_path,
large=args.large, zero_init_residual=False, sync_bn=args.sync_bn)
print(net)
time_cb = TimeMonitor(data_size=batches_per_epoch)
if args.lr_decay_style == 'cosine_step':
# original tinynet_c lr_decay method
lr_array = get_lr_tinynet_c(base_lr=args.lr, total_epochs=args.epochs,
steps_per_epoch=batches_per_epoch, decay_epochs=args.decay_epochs,
decay_rate=args.decay_rate, warmup_epochs=args.warmup_epochs,
warmup_lr_init=args.warmup_lr, global_epoch=0)
elif args.lr_decay_style == 'cosine':
# standard cosine lr_decay method, used in official mindspore ghostnet
lr_array = get_lr(lr_init=args.lr, lr_end=args.lr_end,
lr_max=args.lr_max, warmup_epochs=args.warmup_epochs,
total_epochs=args.epochs, steps_per_epoch=batches_per_epoch)
else:
raise Exception('Unknown lr decay method!!!!!')
lr = Tensor(lr_array)
loss_cb = LossMonitor(lr_array, args.epochs, per_print_times=args.per_print_times,
start_epoch=0)
if args.opt == 'sgd':
param_group = add_weight_decay(net, weight_decay=args.weight_decay)
optimizer = SGD(param_group, learning_rate=lr,
momentum=args.momentum, weight_decay=args.weight_decay,
loss_scale=args.loss_scale)
elif args.opt == 'rmsprop':
param_group = add_weight_decay(net, weight_decay=args.weight_decay)
optimizer = RMSProp(param_group, learning_rate=lr,
decay=0.9, weight_decay=args.weight_decay,
momentum=args.momentum, epsilon=args.opt_eps,
loss_scale=args.loss_scale)
elif args.opt == 'momentum':
optimizer = Momentum(net.trainable_params(), learning_rate=lr,
momentum=args.momentum, loss_scale=args.loss_scale,
weight_decay=args.weight_decay)
if args.smoothing == 0.0:
loss = SoftmaxCrossEntropyWithLogits(sparse=True, reduction='mean')
else:
loss = LabelSmoothingCrossEntropy(smooth_factor=args.smoothing, num_classes=args.num_classes)
loss_scale_manager = FixedLossScaleManager(args.loss_scale, drop_overflow_update=False)
loss.add_flags_recursive(fp32=True, fp16=False)
eval_metrics = {'Validation-Loss': Loss(),
'Top1-Acc': Top1CategoricalAccuracy(),
'Top5-Acc': Top5CategoricalAccuracy()}
if args.ckpt:
ckpt = load_checkpoint(args.ckpt)
load_param_into_net(net, ckpt)
net.set_train(False)
model = Model(net, loss, optimizer, metrics=eval_metrics,
loss_scale_manager=loss_scale_manager,
amp_level=args.amp_level)
eval_param_dict = {"model": model, "dataset": val_dataset, "metrics_name": "Top1-Acc"}
eval_cb = EvalCallBack(apply_eval, eval_param_dict, interval=1,
eval_start_epoch=0, save_best_ckpt=False,
ckpt_directory=ckpt_save_dir, besk_ckpt_name="best_acc.ckpt",
metrics_name="Top1-Acc")
callbacks = [loss_cb, eval_cb, time_cb]
if args.save_checkpoint:
config_ck = CheckpointConfig(save_checkpoint_steps=args.ckpt_save_epoch* batches_per_epoch,
keep_checkpoint_max=5)
ckpt_cb = ModelCheckpoint(prefix=args.model, directory=ckpt_save_dir, config=config_ck)
callbacks += [ckpt_cb]
print('dataset_sink_mode: ', args.dataset_sink)
model.train(args.epochs, train_dataset, callbacks=callbacks,
dataset_sink_mode=args.dataset_sink)
if __name__ == '__main__':
opts, unparsed = parser.parse_known_args()
# copy data
if opts.cloud:
import moxing as mox
sync_data(opts)
# input image size of the network
train_dataset = val_dataset = None
train_data_url = os.path.join(opts.data_path, 'train')
val_data_url = os.path.join(opts.data_path, 'val')
print('train data path: ', train_data_url)
print('val data path: ', val_data_url)
val_dataset = create_dataset_val(opts.batch_size,
val_data_url,
workers=opts.workers,
target=opts.device_target,
distributed=False,
input_size=opts.input_size)
if opts.train:
train_dataset = create_dataset(opts.batch_size,
train_data_url,
workers=opts.workers,
target=opts.device_target,
distributed=opts.distributed,
input_size=opts.input_size,
autoaugment=opts.autoaugment)
batches_per_epoch = train_dataset.get_dataset_size()
print('Number of batches:', batches_per_epoch)
main(opts)
if opts.cloud:
mox.file.copy_parallel(opts.save_dir, opts.train_url)
if os.path.exists(local_plog_path):
mox.file.copy_parallel(local_plog_path, opts.train_url)
else:
print('{} not exist....'.format(local_plog_path))
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment