Skip to content
Snippets Groups Projects
Unverified Commit cf6c5563 authored by zhaoting's avatar zhaoting Committed by Gitee
Browse files

!3065 [西安交通大学][高校贡献][Mindspore][LEO]

Merge pull request !3065 from jialing/master
parents ce32d64b 748eba33
No related branches found
No related tags found
No related merge requests found
Showing
with 475 additions and 87 deletions
......@@ -105,8 +105,9 @@ LEO由以下几个模块组成,分类器,编码器,关系网络和编码
# 环境要求
- 硬件(GPU)
- 硬件(GPU or Ascend
- 使用GPU处理器来搭建硬件环境。
- 使用Ascend处理器来搭建硬件环境。
- 框架
- [MindSpore](https://www.mindspore.cn/install/en)
- 如需查看详情,请参见如下资源:
......@@ -130,6 +131,24 @@ LEO由以下几个模块组成,分类器,编码器,关系网络和编码
bash scripts/run_eval_gpu.sh [DATA_PATH] [DATA_NAME] [NUM_TR_EXAMPLES_PER_CLASS] [CKPT_FILE]
# 运行评估示例
bash scripts/run_eval_gpu.sh /home/mindspore/dataset/embeddings/ miniImageNet 1 ./ckpt/1P_mini_1/xxx.ckpt
```
- Ascend处理器环境运行
```bash
# 运行训练示例
bash scripts/run_train_gpu.sh [DEVICE_ID] [DEVICE_TARGET] [DATA_PATH] [DATA_NAME] [NUM_TR_EXAMPLES_PER_CLASS] [SAVE_PATH]
# 例如:
bash scripts/run_train_ascend.sh 6 Ascend /home/mindspore/dataset/embeddings/ miniImageNet 5 ./ckpts/1P_mini_5
# 运行分布式训练示例
bash scripts/run_distribution_ascend.sh [RANK_TABLE_FILE] [DEVICE_TARGET] [DATA_PATH] [DATA_NAME] [NUM_TR_EXAMPLES_PER_CLASS] [SAVE_PATH]
# 例如:
bash scripts/run_distribution_ascend.sh ./hccl_8p_01234567_127.0.0.1.json Ascend /home/mindspore/dataset/embeddings/ miniImageNet 5 ./ckpts/8P_mini_5
# 运行评估示例
bash scripts/run_eval_gpu.sh [DEVICE_ID] [DATA_PATH] [CKPT_FILE]
# 例如
bash scripts/run_eval_ascend.sh 4 Ascend /home/mindspore/dataset/embeddings/ miniImageNet 5 ./ckpt/1P_mini_5/xxx.ckpt
```
以上为第一个实验示例,其余三个实验请参考训练部分。
......@@ -144,8 +163,11 @@ LEO由以下几个模块组成,分类器,编码器,关系网络和编码
├─ train.py # 训练脚本
├─ eval.py # 评估脚本
├─ scripts
│ ├─ run_eval_gpu.sh # 启动评估
│ └─ run_train_gpu.sh # 启动训练
│ ├─ run_distribution_ascend.sh # 启动8卡Ascend训练
│ ├─ run_eval_ascend.sh # ascend启动评估
│ ├─ run_eval_gpu.sh # gpu启动评估
│ ├─ run_train_ascend.sh # ascend启动训练
│ └─ run_train_gpu.sh # gpu启动训练
├─ src
│ ├─ data.py # 数据处理
│ ├─ model.py # LEO模型
......@@ -211,7 +233,7 @@ LEO由以下几个模块组成,分类器,编码器,关系网络和编码
outer_lr: 0.004 #超参
gradient_threshold: 0.1
gradient_norm_threshold: 0.1
total_steps: 100000
total_steps: 200000
```
更多配置细节请参考config文件夹,**启动训练之前请根据不同的实验设置上述超参数。**
......@@ -221,13 +243,13 @@ LEO由以下几个模块组成,分类器,编码器,关系网络和编码
- 四个实验设置不同的超参
| 超参 | miniImageNet 1-shot | miniImageNet 5-shot | tieredImageNet 1-shot | tieredImageNet 5-shot |
| ------------------------------ | ------------------- | ------------------- | --------------------- | --------------------- |
| ------------------------------ |---------------------|---------------------|-----------------------| --------------------- |
| `dropout` | 0.3 | 0.3 | 0.2 | 0.3 |
| `kl_weight` | 0.001 | 0.001 | 0 | 0.001 |
| `encoder_penalty_weight` | 1E-9 | 2.66E-7 | 5.7E-1 | 5.7E-6 |
| `l2_penalty_weight` | 0.0001 | 8.5E-6 | 5.10E-6 | 3.6E-10 |
| `orthogonality_penalty_weight` | 303.0 | 0.00152 | 4.88E-1 | 0.188 |
| `outer_lr` | 0.004 | 0.004 | 0.004 | 0.0025 |
| `orthogonality_penalty_weight` | 303.0 | 0.00152 | 4.88E-1 | 0.188 |
| `outer_lr` | 0.005 | 0.005 | 0.005 | 0.0025 |
### 训练
......@@ -240,6 +262,15 @@ LEO由以下几个模块组成,分类器,编码器,关系网络和编码
bash scripts/run_train_gpu.sh 1 /home/mindspore/dataset/embeddings/ tieredImageNet 5 ./ckpt/1P_tiered_5
```
- 配置好上述参数后,AScend环境运行
```bash
bash scripts/run_train_ascend.sh 6 Ascend /home/mindspore/dataset/embeddings/ miniImageNet 1 ./ckpts/1P_mini_1
bash scripts/run_train_ascend.sh 6 Ascend /home/mindspore/dataset/embeddings/ miniImageNet 5 ./ckpts/1P_mini_5
bash scripts/run_train_ascend.sh 6 Ascend /home/mindspore/dataset/embeddings/ tieredImageNet 1 ./ckpt/1P_tiered_1
bash scripts/run_train_ascend.sh 6 Ascend /home/mindspore/dataset/embeddings/ tieredImageNet 5 ./ckpt/1P_tiered_5
```
训练将在后台运行,您可以通过`1P_miniImageNet_1_train.log`等日志文件查看训练过程。
训练结束后,您可在 ` ./ckpt/1P_mini_1` 等checkpoint文件夹下找到检查点文件。
......@@ -256,6 +287,15 @@ LEO由以下几个模块组成,分类器,编码器,关系网络和编码
bash scripts/run_train_gpu.sh 8 /home/mindspore/dataset/embeddings/ tieredImageNet 5 ./ckpt/8P_tiered_5
```
- 配置好上述参数后,Ascend环境运行
```bash
bash scripts/run_distribution_ascend.sh ./hccl_8p_01234567_127.0.0.1.json Ascend /home/mindspore/dataset/embeddings/ miniImageNet 1 ./ckpts/8P_mini_1
bash scripts/run_distribution_ascend.sh ./hccl_8p_01234567_127.0.0.1.json Ascend /home/mindspore/dataset/embeddings/ miniImageNet 5 ./ckpts/8P_mini_5
bash scripts/run_distribution_ascend.sh ./hccl_8p_01234567_127.0.0.1.json Ascend /home/mindspore/dataset/embeddings/ tieredImageNet 1 ./ckpts/8P_tired_1
bash scripts/run_distribution_ascend.sh ./hccl_8p_01234567_127.0.0.1.json Ascend /home/mindspore/dataset/embeddings/ tieredImageNet 5 ./ckpts/8P_tired_5
```
与单卡训练一样,可以在`8P_miniImageNet_1_train.log`文件查看训练过程,并在默认`./ckpt/8P_mini_1`等checkpoint文件夹下找到检查点文件。
## 评估过程
......@@ -273,6 +313,15 @@ LEO由以下几个模块组成,分类器,编码器,关系网络和编码
bash scripts/run_eval_gpu.sh /home/mindspore/dataset/embeddings/ tieredImageNet 5 ./ckpt/1P_tiered_5/xxx.ckpt
```
- Ascend环境运行
```bash
bash scripts/run_eval_ascend.sh 0 Ascend /home/mindspore/dataset/embeddings/ miniImageNet 1 ./ckpt/1P_mini_1/xxx.ckpt
bash scripts/run_eval_ascend.sh 0 Ascend /home/mindspore/dataset/embeddings/ miniImageNet 5 ./ckpt/1P_mini_5/xxx.ckpt
bash scripts/run_eval_ascend.sh 0 Ascend /home/mindspore/dataset/embeddings/ tieredImageNet 1 ./ckpt/1P_tiered_1/xxx.ckpt
bash scripts/run_eval_ascend.sh 0 Ascend /home/mindspore/dataset/embeddings/ tieredImageNet 5 ./ckpt/1P_tiered_5/xxx.ckpt
```
评估将在后台运行,您可以通过`1P_miniImageNet_1_eval.log`等日志文件查看评估过程。
# 模型描述
......@@ -283,19 +332,19 @@ LEO由以下几个模块组成,分类器,编码器,关系网络和编码
- 训练参数
| 参数 | LEO |
| -------------| ----------------------------------------------------------- |
| 资源 | NVIDIA GeForce RTX 3090;CUDA核心 10496个;显存 24GB |
| 上传日期 | 2022-03-27 |
| MindSpore版本 | 1.7.0 |
| 数据集 | miniImageNet |
| 优化器 | Adam |
| 损失函数 | Cross Entropy Loss |
| 输出 | 准确率 |
| 损失 | GANLoss,L1Loss,localLoss,DTLoss |
| 微调检查点 | 672KB (.ckpt文件) |
| 参数 | LEO | Ascend |
| -------------| ----------------------------------------------------------- |-----------------------------------------------|
| 资源 | NVIDIA GeForce RTX 3090;CUDA核心 10496个;显存 24GB | Ascend 910; CPU 24cores; 显存 256G; OS Euler2.8 |
| 上传日期 | 2022-03-27 | 2022-06-12 |
| MindSpore版本 | 1.7.0 | 1.5.0 |
| 数据集 | miniImageNet | miniImageNet |
| 优化器 | Adam | Adam |
| 损失函数 | Cross Entropy Loss | Cross Entropy Loss |
| 输出 | 准确率 | 准确率 |
| 损失 | GANLoss,L1Loss,localLoss,DTLoss | GANLoss,L1Loss,localLoss,DTLoss |
| 微调检查点 | 672KB (.ckpt文件) | 672KB (.ckpt文件) |
- 评估性能
- GPU评估性能
| 实验 | miniImageNet 1-shot | miniImageNet 5-shot | tieredImageNet 1-shot | tieredImageNet 5-shot |
| ----- | ------------------- | ------------------- | --------------------- | --------------------- |
......@@ -306,13 +355,13 @@ LEO由以下几个模块组成,分类器,编码器,关系网络和编码
- 评估参数
| 参数 | LEO |
| ------------ | ----------------------------------------------------------- |
| 资源 | NVIDIA GeForce RTX 3090;CUDA核心 10496个;显存 24GB |
| 上传日期 | 2022-03-27 |
| MindSpore版本 | 1.7.0 |
| 数据集 | miniImageNet |
| 输出 | 准确率 |
| 参数 | LEO | Ascend |
| ------------ | ----------------------------------------------------------- |-----------------------------------------------|
| 资源 | NVIDIA GeForce RTX 3090;CUDA核心 10496个;显存 24GB | Ascend 910; CPU 24cores; 显存 256G; OS Euler2.8 |
| 上传日期 | 2022-03-27 | 2022-06-12 |
| MindSpore版本 | 1.7.0 |1.5.0 |
| 数据集 | miniImageNet | miniImageNet |
| 输出 | 准确率 | 准确率 |
- 评估精度
......
......@@ -2,6 +2,8 @@
enable_modelarts: False
data_url: ""
train_url: ""
ckpt_url: 'ckpt files'
result_url: 'infer result files'
checkpoint_url: ""
device_target: "GPU"
device_num: 1
......@@ -36,10 +38,10 @@ metatrain_batch_size: 12
metavalid_batch_size: 200
metatest_batch_size: 200
num_steps_limit: int(1e5)
outer_lr: 0.004 # parameters
outer_lr: 0.005 # parameters
gradient_threshold: 0.1
gradient_norm_threshold: 0.1
total_steps: 100000
total_steps: 200000
# Model Description
model_name: LEO
......
......@@ -2,6 +2,8 @@
enable_modelarts: False
data_url: ""
train_url: ""
ckpt_url: 'ckpt files'
result_url: 'infer result files'
checkpoint_url: ""
device_target: "GPU"
device_num: 1
......@@ -36,10 +38,10 @@ metatrain_batch_size: 12
metavalid_batch_size: 200
metatest_batch_size: 200
num_steps_limit: int(1e5)
outer_lr: 0.004 # parameters
outer_lr: 0.005 # parameters
gradient_threshold: 0.1
gradient_norm_threshold: 0.1
total_steps: 100000
total_steps: 200000
# Model Description
model_name: LEO
......
......@@ -2,6 +2,8 @@
enable_modelarts: False
data_url: ""
train_url: ""
ckpt_url: 'ckpt files'
result_url: 'infer result files'
checkpoint_url: ""
device_target: "GPU"
device_num: 1
......@@ -20,10 +22,11 @@ inner_unroll_length: 5
finetuning_unroll_length: 5
num_latents: 64
inner_lr_init: 1.0
finetuning_lr_init: 0.001
finetuning_lr_init: 0.0005
dropout_rate: 0.3 # parameters
kl_weight: 0.001 # parameters
encoder_penalty_weight: 2.66E-7 # parameters
encoder_penalty_weight: 2.66E-7 # parameters origin
l2_penalty_weight: 8.5E-6 # parameters
orthogonality_penalty_weight: 0.00152 # parameters
# ==============================================================================
......@@ -36,15 +39,15 @@ metatrain_batch_size: 12
metavalid_batch_size: 200
metatest_batch_size: 200
num_steps_limit: int(1e5)
outer_lr: 0.004 # parameters
outer_lr: 0.005 # parameters origin
gradient_threshold: 0.1
gradient_norm_threshold: 0.1
total_steps: 100000
total_steps: 200000
# Model Description
model_name: LEO
file_name: 'leo'
file_format: 'MINDIR' # ['AIR', 'MINDIR']
file_format: 'AIR'
---
......@@ -54,6 +57,8 @@ data_url: 'Dataset url for obs'
train_url: 'Training output url for obs'
data_path: 'Dataset path for local'
output_path: 'Training output path for local'
ckpt_url: 'ckpt files'
result_url: 'infer result files'
device_target: 'Target device type'
enable_profiling: 'Whether enable profiling while training, default: False'
......@@ -2,6 +2,8 @@
enable_modelarts: False
data_url: ""
train_url: ""
ckpt_url: 'ckpt files'
result_url: 'infer result files'
checkpoint_url: ""
device_target: "GPU"
device_num: 1
......@@ -39,7 +41,7 @@ num_steps_limit: int(1e5)
outer_lr: 0.0025 # parameters
gradient_threshold: 0.1
gradient_norm_threshold: 0.1
total_steps: 100000
total_steps: 200000
# Model Description
model_name: LEO
......
......@@ -35,7 +35,7 @@ def eval_leo(init_config, inner_model_config, outer_model_config):
total_test_steps = 100
data_utils = data.Data_Utils(
train=False, seed=100, way=outer_model_config['num_classes'],
train=False, seed=1, way=outer_model_config['num_classes'],
shot=outer_model_config['num_tr_examples_per_class'],
data_path=init_config['data_path'], dataset_name=init_config['dataset_name'],
embedding_crop=init_config['embedding_crop'],
......@@ -75,6 +75,7 @@ if __name__ == '__main__':
initConfig = config.get_init_config()
inner_model_Config = config.get_inner_model_config()
outer_model_Config = config.get_outer_model_config()
args = config.get_config(get_args=True)
print("===============inner_model_config=================")
for key, value in inner_model_Config.items():
......@@ -84,5 +85,16 @@ if __name__ == '__main__':
print(key+": "+str(value))
context.set_context(mode=context.GRAPH_MODE, device_target=initConfig['device_target'])
if args.enable_modelarts:
import moxing as mox
mox.file.copy_parallel(
src_url=args.data_url, dst_url='/cache/dataset/device_' + os.getenv('DEVICE_ID'))
train_dataset_path = os.path.join('/cache/dataset/device_' + os.getenv('DEVICE_ID'), "embeddings")
ckpt_path = '/home/work/user-job-dir/checkpoint.ckpt'
mox.file.copy(args.ckpt_url, ckpt_path)
initConfig['data_path'] = train_dataset_path
initConfig['ckpt_file'] = ckpt_path
eval_leo(initConfig, inner_model_Config, outer_model_Config)
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
import os
import src.data as data
import src.outerloop as outerloop
import model_utils.config as config
import mindspore as ms
from mindspore import context, Tensor, export
from mindspore import load_checkpoint, load_param_into_net
import numpy as np
os.environ['GLOG_v'] = "3"
os.environ['GLOG_log_dir'] = '/var/log'
def export_leo(init_config, inner_model_config, outer_model_config):
inner_lr_init = inner_model_config['inner_lr_init']
finetuning_lr_init = inner_model_config['finetuning_lr_init']
data_utils = data.Data_Utils(
train=False, seed=100, way=outer_model_config['num_classes'],
shot=outer_model_config['num_tr_examples_per_class'],
data_path=init_config['data_path'], dataset_name=init_config['dataset_name'],
embedding_crop=init_config['embedding_crop'],
batchsize=outer_model_config['metatrain_batch_size'],
val_batch_size=outer_model_config['metavalid_batch_size'],
test_batch_size=outer_model_config['metatest_batch_size'],
meta_val_steps=outer_model_config['num_val_examples_per_class'], embedding_size=640, verbose=True)
test_outer_loop = outerloop.OuterLoop(
batchsize=outer_model_config['metavalid_batch_size'], input_size=640,
latent_size=inner_model_config['num_latents'],
way=outer_model_config['num_classes'], shot=outer_model_config['num_tr_examples_per_class'],
dropout=inner_model_config['dropout_rate'], kl_weight=inner_model_config['kl_weight'],
encoder_penalty_weight=inner_model_config['encoder_penalty_weight'],
orthogonality_weight=inner_model_config['orthogonality_penalty_weight'],
inner_lr_init=inner_lr_init, finetuning_lr_init=finetuning_lr_init,
inner_step=inner_model_config['inner_unroll_length'],
finetune_inner_step=inner_model_config['finetuning_unroll_length'], is_meta_training=False)
parm_dict = load_checkpoint(init_config['ckpt_file'])
load_param_into_net(test_outer_loop, parm_dict)
batch = data_utils.get_batch('test')
print(batch['train']['input'].shape) # [200,5,5,640]
print(batch['train']['input'].dtype) # Float32
print(batch['train']['target'].shape) # [200,5,5,1]
print(batch['train']['target'].dtype) # Int64
print(batch['val']['input'].shape) # [200,5,15,640]
print(batch['val']['input'].dtype) # Float32
print(batch['val']['target'].shape) # [200,5,15,1]
print(batch['val']['target'].dtype) # Int64
train_input = Tensor(np.zeros(batch['train']['input'].shape), ms.float32)
train_target = Tensor(np.zeros(batch['train']['target'].shape), ms.int64)
val_input = Tensor(np.zeros(batch['val']['input'].shape), ms.float32)
val_target = Tensor(np.zeros(batch['val']['target'].shape), ms.int64)
result_name = "LEO-" + init_config['dataset_name'] + str(outer_model_config['num_classes']) +\
"N" + str(outer_model_config['num_tr_examples_per_class']) + "K"
export(test_outer_loop, train_input, train_target, val_input, val_target,
file_name=result_name, file_format="MINDIR")
if __name__ == '__main__':
initConfig = config.get_init_config()
inner_model_Config = config.get_inner_model_config()
outer_model_Config = config.get_outer_model_config()
print("===============inner_model_config=================")
for key, value in inner_model_Config.items():
print(key + ": " + str(value))
print("===============outer_model_config=================")
for key, value in outer_model_Config.items():
print(key + ": " + str(value))
context.set_context(mode=context.GRAPH_MODE, device_target=initConfig['device_target'])
export_leo(initConfig, inner_model_Config, outer_model_Config)
print("successfully export LEO model!")
......@@ -56,6 +56,8 @@ def parse_cli_to_yaml(parser, cfg, helper=None, choices=None, cfg_path="LEO-N5-K
helper = {} if helper is None else helper
choices = {} if choices is None else choices
for item in cfg:
if item in ("dataset_name", "num_tr_examples_per_class"):
continue
if not isinstance(cfg[item], list) and not isinstance(cfg[item], dict):
help_description = helper[item] if item in helper else "Please reference to {}".format(cfg_path)
choice = choices[item] if item in choices else None
......@@ -110,20 +112,28 @@ def merge(args, cfg):
return cfg
def get_config():
def get_config(get_args=False):
"""
Get Config according to the yaml file and cli arguments.
"""
parser = argparse.ArgumentParser(description="default name", add_help=False)
config_dir = os.path.join(os.path.abspath(os.getcwd()), "config")
config_name = "LEO-N5-K" + str(os.getenv("NUM_TR_EXAMPLES_PER_CLASS")) \
+ "_" + os.getenv("DATA_NAME") + "_config.yaml"
parser.add_argument("--config_path", type=str,
default=os.path.join(config_dir, config_name),
help="Config file path")
parser.add_argument("--num_tr_examples_per_class", type=int,
default=5,
help="num_tr_examples_per_class")
parser.add_argument("--dataset_name", type=str,
default="miniImageNet",
help="dataset_name")
path_args, _ = parser.parse_known_args()
default, helper, choices = parse_yaml(path_args.config_path)
args = parse_cli_to_yaml(parser=parser, cfg=default, helper=helper, choices=choices, cfg_path=path_args.config_path)
config_name = "LEO-N5-K" + str(path_args.num_tr_examples_per_class) \
+ "_" + path_args.dataset_name + "_config.yaml"
config_path = os.path.join(os.path.abspath(os.path.join(__file__, "../..")), "config", config_name)
default, helper, choices = parse_yaml(config_path)
args = parse_cli_to_yaml(parser=parser, cfg=default, helper=helper, choices=choices, cfg_path=config_path)
if get_args:
return args
final_config = merge(args, default)
return Config(final_config)
......
#!/bin/bash
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
if [[ $# -ne 6 ]]; then
echo "=============================================================================================================="
echo "Please run the script as: "
echo "bash scripts/run_eval_gpu.sh [DEVICE_NUM] [DEVICE_TARGET] [DATA_PATH] [DATA_NAME] [NUM_TR_EXAMPLES_PER_CLASS] [CKPT_FILE] "
echo "For example: bash scripts/export.sh 1 Ascend ../leo/leo-mindspore/embeddings miniImageNet 5 ./ckpts/xxx.ckpt "
echo "=============================================================================================================="
exit 1;
fi
export GLOG_v=3
export DEVICE_ID=$1
export DEVICE_TARGET=$2
export DATA_PATH=$3
export DATA_NAME=$4
export NUM_TR_EXAMPLES_PER_CLASS=$5
export CKPT_FILE=$6
nohup python export.py --device_target $DEVICE_TARGET \
--data_path $DATA_PATH \
--dataset_name $DATA_NAME \
--num_tr_examples_per_class $NUM_TR_EXAMPLES_PER_CLASS \
--ckpt_file $CKPT_FILE > export.log 2>&1 &
#!/bin/bash
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
# an simple tutorial as follows, more parameters can be setting
if [ $# != 6 ]
then
echo "Usage: bash scripts/run_distribution_ascend.sh [RANK_TABLE_FILE] [DEVICE_TARGET] [DATA_PATH] [DATA_NAME] [NUM_TR_EXAMPLES_PER_CLASS] [SAVE_PATH]"
echo "For example: bash scripts/run_distribution_ascend.sh ./hccl_8p_01234567_127.0.0.1.json Ascend /home/jialing/leo/leo-mindspore/embeddings miniImageNet 5 ./ckpts/8P_mini_5
"
exit 1
fi
if [ ! -f $1 ]
then
echo "error: RANK_TABLE_FILE=$1 is not a file"
exit 1
fi
get_real_path(){
if [ "${1:0:1}" == "/" ]; then
echo "$1"
else
echo "$(realpath -m $PWD/$1)"
fi
}
ulimit -u unlimited
export DEVICE_NUM=8
export RANK_SIZE=8
RANK_TABLE_FILE=$(realpath $1)
export RANK_TABLE_FILE
echo "RANK_TABLE_FILE=${RANK_TABLE_FILE}"
export DEVICE_TARGET=$2
export DATA_PATH=$3
export DATA_NAME=$4
export NUM_TR_EXAMPLES_PER_CLASS=$5
export SAVE_PATH=$6
export SERVER_ID=0
rank_start=$((DEVICE_NUM * SERVER_ID))
for((i=0; i<${DEVICE_NUM}; i++))
do
export DEVICE_ID=$i
export RANK_ID=$((rank_start + i))
rm -rf ./train_parallel$i
mkdir ./train_parallel$i
cp -r ./src ./train_parallel$i
cp -r ./config ./train_parallel$i
cp -r ./model_utils ./train_parallel$i
cp ./train.py ./train_parallel$i
echo "start training for rank $RANK_ID, device $DEVICE_ID"
cd ./train_parallel$i ||exit
env > env.log
nohup python -u train.py --device_target $DEVICE_TARGET --data_path $DATA_PATH --dataset_name $DATA_NAME --num_tr_examples_per_class $NUM_TR_EXAMPLES_PER_CLASS --save_path $SAVE_PATH >log_distribution_ascend 2>&1 &
cd ..
done
#!/bin/bash
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
if [[ $# -ne 6 ]]; then
echo "=============================================================================================================="
echo "Please run the script as: "
echo "bash scripts/run_eval_gpu.sh [DEVICE_ID] [DEVICE_TARGET] [DATA_PATH] [DATA_NAME] [NUM_TR_EXAMPLES_PER_CLASS] [CKPT_FILE] "
echo "For example: bash scripts/run_eval_ascend.sh 4 Ascend ../leo/leo-mindspore/embeddings miniImageNet 5 ./ckpt/1P_mini_5/xxx.ckpt "
echo "=============================================================================================================="
exit 1;
fi
export GLOG_v=3
export DEVICE_ID=$1
export DEVICE_TARGET=$2
export DATA_PATH=$3
export DATA_NAME=$4
export NUM_TR_EXAMPLES_PER_CLASS=$5
export CKPT_FILE=$6
nohup python eval.py --device_target $DEVICE_TARGET \
--data_path $DATA_PATH \
--dataset_name $DATA_NAME \
--num_tr_examples_per_class $NUM_TR_EXAMPLES_PER_CLASS \
--ckpt_file $CKPT_FILE > ${DATA_NAME}_${NUM_TR_EXAMPLES_PER_CLASS}_eval.log 2>&1 &
......@@ -12,14 +12,18 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
echo "============================================================================================================================"
echo "Please run the script as: "
echo "bash scripts/run_eval_gpu.sh [DATA_PATH] [DATA_NAME] [NUM_TR_EXAMPLES_PER_CLASS] [CKPT_FILE] "
echo "For example: bash scripts/run_eval_gpu.sh /home/mindspore/dataset/embeddings/ miniImageNet 1 ./ckpt/1P_mini_1/xxx.ckpt "
echo "============ bash scripts/run_eval_gpu.sh /home/mindspore/dataset/embeddings/ miniImageNet 5 ./ckpt/1P_mini_5/xxx.ckpt "
echo "============ bash scripts/run_eval_gpu.sh /home/mindspore/dataset/embeddings/ tieredImageNet 1 ./ckpt/1P_tiered_1/xxx.ckpt "
echo "============ bash scripts/run_eval_gpu.sh /home/mindspore/dataset/embeddings/ tieredImageNet 5 ./ckpt/1P_tiered_5/xxx.ckpt "
echo "============================================================================================================================"
if [[ $# -ne 6 ]]; then
echo "============================================================================================================================"
echo "Please run the script as: "
echo "bash scripts/run_eval_gpu.sh [DATA_PATH] [DATA_NAME] [NUM_TR_EXAMPLES_PER_CLASS] [CKPT_FILE] "
echo "For example: bash scripts/run_eval_gpu.sh /home/mindspore/dataset/embeddings/ miniImageNet 1 ./ckpt/1P_mini_1/xxx.ckpt "
echo "============ bash scripts/run_eval_gpu.sh /home/mindspore/dataset/embeddings/ miniImageNet 5 ./ckpt/1P_mini_5/xxx.ckpt "
echo "============ bash scripts/run_eval_gpu.sh /home/mindspore/dataset/embeddings/ tieredImageNet 1 ./ckpt/1P_tiered_1/xxx.ckpt "
echo "============ bash scripts/run_eval_gpu.sh /home/mindspore/dataset/embeddings/ tieredImageNet 5 ./ckpt/1P_tiered_5/xxx.ckpt "
echo "============================================================================================================================"
exit 1;
fi
export GLOG_v=3
export DEVICE_TARGET=GPU
export DATA_PATH=$1
......
#!/bin/bash
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
if [[ $# -ne 6 ]]; then
echo "=============================================================================================================="
echo "Please run the script as: "
echo "bash scripts/run_train_gpu.sh [DEVICE_ID] [DEVICE_TARGET] [DATA_PATH] [DATA_NAME] [NUM_TR_EXAMPLES_PER_CLASS] [SAVE_PATH] "
echo "For example: bash scripts/run_train_ascend.sh 6 Ascend /home/mindspore/dataset/embeddings/ miniImageNet 5 ./ckpts/1P_mini_5"
echo "=============================================================================================================="
exit 1;
fi
export DEVICE_ID=$1
export DEVICE_TARGET=$2
export DATA_PATH=$3
export DATA_NAME=$4
export NUM_TR_EXAMPLES_PER_CLASS=$5
export SAVE_PATH=$6
export GLOG_v=3
export DEVICE_ID=$DEVICE_ID
nohup python -u train.py --device_target $DEVICE_TARGET --data_path $DATA_PATH --dataset_name $DATA_NAME --num_tr_examples_per_class $NUM_TR_EXAMPLES_PER_CLASS --save_path $SAVE_PATH > ${DEVICE_NUM}P_${DATA_NAME}_${NUM_TR_EXAMPLES_PER_CLASS}_train.log 2>&1 &
......@@ -12,20 +12,24 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
echo "===================================================================================================================="
echo "Please run the script as: "
echo "bash scripts/run_train_gpu.sh [DEVICE_NUM] [DATA_PATH] [DATA_NAME] [NUM_TR_EXAMPLES_PER_CLASS] [SAVE_PATH] "
echo "For example: bash scripts/run_train_gpu.sh 1 /home/mindspore/dataset/embeddings/ miniImageNet 1 ./ckpt/1P_mini_1"
echo " ============bash scripts/run_train_gpu.sh 1 /home/mindspore/dataset/embeddings/ miniImageNet 5 ./ckpt/1P_mini_5"
echo " ============bash scripts/run_train_gpu.sh 1 /home/mindspore/dataset/embeddings/ tieredImageNet 1 ./ckpt/1P_tiered_1"
echo " ============bash scripts/run_train_gpu.sh 1 /home/mindspore/dataset/embeddings/ tieredImageNet 5 ./ckpt/1P_tiered_5"
echo "===================================================================================================================="
echo "Please run distributed training script as: "
echo "For example: bash scripts/run_train_gpu.sh 8 /home/mindspore/dataset/embeddings/ miniImageNet 1 ./ckpt/8P_mini_1 "
echo " ============bash scripts/run_train_gpu.sh 8 /home/mindspore/dataset/embeddings/ miniImageNet 5 ./ckpt/8P_mini_5"
echo " ============bash scripts/run_train_gpu.sh 8 /home/mindspore/dataset/embeddings/ tieredImageNet 1 ./ckpt/8P_tiered_1"
echo " ============bash scripts/run_train_gpu.sh 8 /home/mindspore/dataset/embeddings/ tieredImageNet 5 ./ckpt/8P_tiered_5"
echo "===================================================================================================================="
if [[ $# -ne 6 ]]; then
echo "===================================================================================================================="
echo "Please run the script as: "
echo "bash scripts/run_train_gpu.sh [DEVICE_NUM] [DATA_PATH] [DATA_NAME] [NUM_TR_EXAMPLES_PER_CLASS] [SAVE_PATH] "
echo "For example: bash scripts/run_train_gpu.sh 1 /home/mindspore/dataset/embeddings/ miniImageNet 1 ./ckpt/1P_mini_1"
echo " ============bash scripts/run_train_gpu.sh 1 /home/mindspore/dataset/embeddings/ miniImageNet 5 ./ckpt/1P_mini_5"
echo " ============bash scripts/run_train_gpu.sh 1 /home/mindspore/dataset/embeddings/ tieredImageNet 1 ./ckpt/1P_tiered_1"
echo " ============bash scripts/run_train_gpu.sh 1 /home/mindspore/dataset/embeddings/ tieredImageNet 5 ./ckpt/1P_tiered_5"
echo "===================================================================================================================="
echo "Please run distributed training script as: "
echo "For example: bash scripts/run_train_gpu.sh 8 /home/mindspore/dataset/embeddings/ miniImageNet 1 ./ckpt/8P_mini_1 "
echo " ============bash scripts/run_train_gpu.sh 8 /home/mindspore/dataset/embeddings/ miniImageNet 5 ./ckpt/8P_mini_5"
echo " ============bash scripts/run_train_gpu.sh 8 /home/mindspore/dataset/embeddings/ tieredImageNet 1 ./ckpt/8P_tiered_1"
echo " ============bash scripts/run_train_gpu.sh 8 /home/mindspore/dataset/embeddings/ tieredImageNet 5 ./ckpt/8P_tiered_5"
echo "===================================================================================================================="
exit 1;
fi
export DEVICE_NUM=$1
export DEVICE_TARGET=GPU
export DATA_PATH=$2
......
......@@ -19,8 +19,10 @@ import model_utils.config as config
import src.data as data
import src.outerloop as outerloop
from src.trainonestepcell import TrainOneStepCell
import mindspore
import mindspore.nn as nn
from mindspore import context
from mindspore import save_checkpoint, load_param_into_net, load_checkpoint
from mindspore import save_checkpoint, load_param_into_net
from mindspore.communication.management import init
from mindspore.context import ParallelMode
......@@ -29,12 +31,35 @@ os.environ['GLOG_v'] = "3"
os.environ['GLOG_log_dir'] = '/var/log'
def save_checkpoint_to_file(if_save_checkpoint, val_accs, best_acc, step, val_losses, init_config, train_outer_loop):
if if_save_checkpoint:
if not sum(val_accs) / len(val_accs) < best_acc:
best_acc = sum(val_accs) / len(val_accs)
model_name = '%dk_%4.4f_%4.4f_model.ckpt' % (
(step // 1000 + 1),
sum(val_losses) / len(val_losses),
sum(val_accs) / len(val_accs))
check_dir(init_config['save_path'])
if args.enable_modelarts:
save_checkpoint_path = '/cache/train_output/device_' + \
os.getenv('DEVICE_ID') + '/'
save_checkpoint_path = '/cache/train_output/'
if not os.path.exists(save_checkpoint_path):
os.makedirs(save_checkpoint_path)
save_checkpoint(train_outer_loop, os.path.join(save_checkpoint_path, model_name))
else:
save_checkpoint(train_outer_loop, os.path.join(init_config['save_path'], model_name))
print('Saved checkpoint %s...' % model_name)
def train_leo(init_config, inner_model_config, outer_model_config):
inner_lr_init = inner_model_config['inner_lr_init']
finetuning_lr_init = inner_model_config['finetuning_lr_init']
total_train_steps = outer_model_config['total_steps']
val_every_step = 5000
val_every_step = 3000
total_val_steps = 100
if_save_checkpoint = True
best_acc = 0
......@@ -72,8 +97,11 @@ def train_leo(init_config, inner_model_config, outer_model_config):
inner_step=inner_model_config['inner_unroll_length'],
finetune_inner_step=inner_model_config['finetuning_unroll_length'], is_meta_training=True)
parm_dict = load_checkpoint('./resource/leo_ms_init.ckpt')
load_param_into_net(train_outer_loop, parm_dict)
if context.get_context("device_target") == "Ascend":
train_outer_loop.to_float(mindspore.float32)
for _, cell in train_outer_loop.cells_and_names():
if isinstance(cell, nn.Dense):
cell.to_float(mindspore.float16)
train_net = TrainOneStepCell(train_outer_loop,
outer_model_config['outer_lr'],
......@@ -105,8 +133,8 @@ def train_leo(init_config, inner_model_config, outer_model_config):
train_net.group_params[0]['params'][1].T.asnumpy(),
val_acc.asnumpy(), now_t-old_t))
if step % val_every_step == 4999:
print('5000 step average time: %4.4f second...'%(sum_steptime/5000))
if step % val_every_step == 2999:
print('3000 step average time: %4.4f second...'%(sum_steptime/3000))
sum_steptime = 0
val_losses = []
......@@ -128,18 +156,8 @@ def train_leo(init_config, inner_model_config, outer_model_config):
(sum(val_losses)/len(val_losses), sum(val_accs)/len(val_accs)))
print('=' * 50)
if if_save_checkpoint:
if not sum(val_accs)/len(val_accs) < best_acc:
best_acc = sum(val_accs)/len(val_accs)
model_name = '%dk_%4.4f_%4.4f_model.ckpt' % (
(step//1000+1),
sum(val_losses)/len(val_losses),
sum(val_accs)/len(val_accs))
check_dir(init_config['save_path'])
save_checkpoint(train_outer_loop, os.path.join(init_config['save_path'], model_name))
print('Saved checkpoint %s...'%model_name)
save_checkpoint_to_file(if_save_checkpoint, val_accs, best_acc, step, val_losses,
init_config, train_outer_loop)
if step == (total_train_steps-1):
train_end = time.time()
......@@ -166,6 +184,8 @@ if __name__ == '__main__':
initConfig = config.get_init_config()
inner_model_Config = config.get_inner_model_config()
outer_model_Config = config.get_outer_model_config()
args = config.get_config(get_args=True)
print("===============inner_model_config=================")
for key, value in inner_model_Config.items():
......@@ -175,10 +195,21 @@ if __name__ == '__main__':
print(key+": "+str(value))
context.set_context(mode=context.GRAPH_MODE, device_target=initConfig['device_target'])
if initConfig['device_num'] > 1:
if args.enable_modelarts:
import moxing as mox
mox.file.copy_parallel(
src_url=args.data_url, dst_url='/cache/dataset/device_' + os.getenv('DEVICE_ID'))
train_dataset_path = os.path.join('/cache/dataset/device_' + os.getenv('DEVICE_ID'), "embeddings")
initConfig['data_path'] = train_dataset_path
elif initConfig['device_num'] > 1:
init('nccl')
context.set_auto_parallel_context(device_num=initConfig['device_num'],
parallel_mode=ParallelMode.DATA_PARALLEL,
gradients_mean=True)
train_leo(initConfig, inner_model_Config, outer_model_Config)
if args.enable_modelarts:
mox.file.copy_parallel(
src_url='/cache/train_output', dst_url=args.train_url)
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment