Skip to content
Snippets Groups Projects
Commit 748eba33 authored by jialingqu's avatar jialingqu Committed by jialing
Browse files

update LEO

parent a13e11ac
No related branches found
No related tags found
No related merge requests found
Showing
with 475 additions and 87 deletions
......@@ -105,8 +105,9 @@ LEO由以下几个模块组成,分类器,编码器,关系网络和编码
# 环境要求
- 硬件(GPU)
- 硬件(GPU or Ascend
- 使用GPU处理器来搭建硬件环境。
- 使用Ascend处理器来搭建硬件环境。
- 框架
- [MindSpore](https://www.mindspore.cn/install/en)
- 如需查看详情,请参见如下资源:
......@@ -130,6 +131,24 @@ LEO由以下几个模块组成,分类器,编码器,关系网络和编码
bash scripts/run_eval_gpu.sh [DATA_PATH] [DATA_NAME] [NUM_TR_EXAMPLES_PER_CLASS] [CKPT_FILE]
# 运行评估示例
bash scripts/run_eval_gpu.sh /home/mindspore/dataset/embeddings/ miniImageNet 1 ./ckpt/1P_mini_1/xxx.ckpt
```
- Ascend处理器环境运行
```bash
# 运行训练示例
bash scripts/run_train_gpu.sh [DEVICE_ID] [DEVICE_TARGET] [DATA_PATH] [DATA_NAME] [NUM_TR_EXAMPLES_PER_CLASS] [SAVE_PATH]
# 例如:
bash scripts/run_train_ascend.sh 6 Ascend /home/mindspore/dataset/embeddings/ miniImageNet 5 ./ckpts/1P_mini_5
# 运行分布式训练示例
bash scripts/run_distribution_ascend.sh [RANK_TABLE_FILE] [DEVICE_TARGET] [DATA_PATH] [DATA_NAME] [NUM_TR_EXAMPLES_PER_CLASS] [SAVE_PATH]
# 例如:
bash scripts/run_distribution_ascend.sh ./hccl_8p_01234567_127.0.0.1.json Ascend /home/mindspore/dataset/embeddings/ miniImageNet 5 ./ckpts/8P_mini_5
# 运行评估示例
bash scripts/run_eval_gpu.sh [DEVICE_ID] [DATA_PATH] [CKPT_FILE]
# 例如
bash scripts/run_eval_ascend.sh 4 Ascend /home/mindspore/dataset/embeddings/ miniImageNet 5 ./ckpt/1P_mini_5/xxx.ckpt
```
以上为第一个实验示例,其余三个实验请参考训练部分。
......@@ -144,8 +163,11 @@ LEO由以下几个模块组成,分类器,编码器,关系网络和编码
├─ train.py # 训练脚本
├─ eval.py # 评估脚本
├─ scripts
│ ├─ run_eval_gpu.sh # 启动评估
│ └─ run_train_gpu.sh # 启动训练
│ ├─ run_distribution_ascend.sh # 启动8卡Ascend训练
│ ├─ run_eval_ascend.sh # ascend启动评估
│ ├─ run_eval_gpu.sh # gpu启动评估
│ ├─ run_train_ascend.sh # ascend启动训练
│ └─ run_train_gpu.sh # gpu启动训练
├─ src
│ ├─ data.py # 数据处理
│ ├─ model.py # LEO模型
......@@ -211,7 +233,7 @@ LEO由以下几个模块组成,分类器,编码器,关系网络和编码
outer_lr: 0.004 #超参
gradient_threshold: 0.1
gradient_norm_threshold: 0.1
total_steps: 100000
total_steps: 200000
```
更多配置细节请参考config文件夹,**启动训练之前请根据不同的实验设置上述超参数。**
......@@ -221,13 +243,13 @@ LEO由以下几个模块组成,分类器,编码器,关系网络和编码
- 四个实验设置不同的超参
| 超参 | miniImageNet 1-shot | miniImageNet 5-shot | tieredImageNet 1-shot | tieredImageNet 5-shot |
| ------------------------------ | ------------------- | ------------------- | --------------------- | --------------------- |
| ------------------------------ |---------------------|---------------------|-----------------------| --------------------- |
| `dropout` | 0.3 | 0.3 | 0.2 | 0.3 |
| `kl_weight` | 0.001 | 0.001 | 0 | 0.001 |
| `encoder_penalty_weight` | 1E-9 | 2.66E-7 | 5.7E-1 | 5.7E-6 |
| `l2_penalty_weight` | 0.0001 | 8.5E-6 | 5.10E-6 | 3.6E-10 |
| `orthogonality_penalty_weight` | 303.0 | 0.00152 | 4.88E-1 | 0.188 |
| `outer_lr` | 0.004 | 0.004 | 0.004 | 0.0025 |
| `orthogonality_penalty_weight` | 303.0 | 0.00152 | 4.88E-1 | 0.188 |
| `outer_lr` | 0.005 | 0.005 | 0.005 | 0.0025 |
### 训练
......@@ -240,6 +262,15 @@ LEO由以下几个模块组成,分类器,编码器,关系网络和编码
bash scripts/run_train_gpu.sh 1 /home/mindspore/dataset/embeddings/ tieredImageNet 5 ./ckpt/1P_tiered_5
```
- 配置好上述参数后,AScend环境运行
```bash
bash scripts/run_train_ascend.sh 6 Ascend /home/mindspore/dataset/embeddings/ miniImageNet 1 ./ckpts/1P_mini_1
bash scripts/run_train_ascend.sh 6 Ascend /home/mindspore/dataset/embeddings/ miniImageNet 5 ./ckpts/1P_mini_5
bash scripts/run_train_ascend.sh 6 Ascend /home/mindspore/dataset/embeddings/ tieredImageNet 1 ./ckpt/1P_tiered_1
bash scripts/run_train_ascend.sh 6 Ascend /home/mindspore/dataset/embeddings/ tieredImageNet 5 ./ckpt/1P_tiered_5
```
训练将在后台运行,您可以通过`1P_miniImageNet_1_train.log`等日志文件查看训练过程。
训练结束后,您可在 ` ./ckpt/1P_mini_1` 等checkpoint文件夹下找到检查点文件。
......@@ -256,6 +287,15 @@ LEO由以下几个模块组成,分类器,编码器,关系网络和编码
bash scripts/run_train_gpu.sh 8 /home/mindspore/dataset/embeddings/ tieredImageNet 5 ./ckpt/8P_tiered_5
```
- 配置好上述参数后,Ascend环境运行
```bash
bash scripts/run_distribution_ascend.sh ./hccl_8p_01234567_127.0.0.1.json Ascend /home/mindspore/dataset/embeddings/ miniImageNet 1 ./ckpts/8P_mini_1
bash scripts/run_distribution_ascend.sh ./hccl_8p_01234567_127.0.0.1.json Ascend /home/mindspore/dataset/embeddings/ miniImageNet 5 ./ckpts/8P_mini_5
bash scripts/run_distribution_ascend.sh ./hccl_8p_01234567_127.0.0.1.json Ascend /home/mindspore/dataset/embeddings/ tieredImageNet 1 ./ckpts/8P_tired_1
bash scripts/run_distribution_ascend.sh ./hccl_8p_01234567_127.0.0.1.json Ascend /home/mindspore/dataset/embeddings/ tieredImageNet 5 ./ckpts/8P_tired_5
```
与单卡训练一样,可以在`8P_miniImageNet_1_train.log`文件查看训练过程,并在默认`./ckpt/8P_mini_1`等checkpoint文件夹下找到检查点文件。
## 评估过程
......@@ -273,6 +313,15 @@ LEO由以下几个模块组成,分类器,编码器,关系网络和编码
bash scripts/run_eval_gpu.sh /home/mindspore/dataset/embeddings/ tieredImageNet 5 ./ckpt/1P_tiered_5/xxx.ckpt
```
- Ascend环境运行
```bash
bash scripts/run_eval_ascend.sh 0 Ascend /home/mindspore/dataset/embeddings/ miniImageNet 1 ./ckpt/1P_mini_1/xxx.ckpt
bash scripts/run_eval_ascend.sh 0 Ascend /home/mindspore/dataset/embeddings/ miniImageNet 5 ./ckpt/1P_mini_5/xxx.ckpt
bash scripts/run_eval_ascend.sh 0 Ascend /home/mindspore/dataset/embeddings/ tieredImageNet 1 ./ckpt/1P_tiered_1/xxx.ckpt
bash scripts/run_eval_ascend.sh 0 Ascend /home/mindspore/dataset/embeddings/ tieredImageNet 5 ./ckpt/1P_tiered_5/xxx.ckpt
```
评估将在后台运行,您可以通过`1P_miniImageNet_1_eval.log`等日志文件查看评估过程。
# 模型描述
......@@ -283,19 +332,19 @@ LEO由以下几个模块组成,分类器,编码器,关系网络和编码
- 训练参数
| 参数 | LEO |
| -------------| ----------------------------------------------------------- |
| 资源 | NVIDIA GeForce RTX 3090;CUDA核心 10496个;显存 24GB |
| 上传日期 | 2022-03-27 |
| MindSpore版本 | 1.7.0 |
| 数据集 | miniImageNet |
| 优化器 | Adam |
| 损失函数 | Cross Entropy Loss |
| 输出 | 准确率 |
| 损失 | GANLoss,L1Loss,localLoss,DTLoss |
| 微调检查点 | 672KB (.ckpt文件) |
| 参数 | LEO | Ascend |
| -------------| ----------------------------------------------------------- |-----------------------------------------------|
| 资源 | NVIDIA GeForce RTX 3090;CUDA核心 10496个;显存 24GB | Ascend 910; CPU 24cores; 显存 256G; OS Euler2.8 |
| 上传日期 | 2022-03-27 | 2022-06-12 |
| MindSpore版本 | 1.7.0 | 1.5.0 |
| 数据集 | miniImageNet | miniImageNet |
| 优化器 | Adam | Adam |
| 损失函数 | Cross Entropy Loss | Cross Entropy Loss |
| 输出 | 准确率 | 准确率 |
| 损失 | GANLoss,L1Loss,localLoss,DTLoss | GANLoss,L1Loss,localLoss,DTLoss |
| 微调检查点 | 672KB (.ckpt文件) | 672KB (.ckpt文件) |
- 评估性能
- GPU评估性能
| 实验 | miniImageNet 1-shot | miniImageNet 5-shot | tieredImageNet 1-shot | tieredImageNet 5-shot |
| ----- | ------------------- | ------------------- | --------------------- | --------------------- |
......@@ -306,13 +355,13 @@ LEO由以下几个模块组成,分类器,编码器,关系网络和编码
- 评估参数
| 参数 | LEO |
| ------------ | ----------------------------------------------------------- |
| 资源 | NVIDIA GeForce RTX 3090;CUDA核心 10496个;显存 24GB |
| 上传日期 | 2022-03-27 |
| MindSpore版本 | 1.7.0 |
| 数据集 | miniImageNet |
| 输出 | 准确率 |
| 参数 | LEO | Ascend |
| ------------ | ----------------------------------------------------------- |-----------------------------------------------|
| 资源 | NVIDIA GeForce RTX 3090;CUDA核心 10496个;显存 24GB | Ascend 910; CPU 24cores; 显存 256G; OS Euler2.8 |
| 上传日期 | 2022-03-27 | 2022-06-12 |
| MindSpore版本 | 1.7.0 |1.5.0 |
| 数据集 | miniImageNet | miniImageNet |
| 输出 | 准确率 | 准确率 |
- 评估精度
......
......@@ -2,6 +2,8 @@
enable_modelarts: False
data_url: ""
train_url: ""
ckpt_url: 'ckpt files'
result_url: 'infer result files'
checkpoint_url: ""
device_target: "GPU"
device_num: 1
......@@ -36,10 +38,10 @@ metatrain_batch_size: 12
metavalid_batch_size: 200
metatest_batch_size: 200
num_steps_limit: int(1e5)
outer_lr: 0.004 # parameters
outer_lr: 0.005 # parameters
gradient_threshold: 0.1
gradient_norm_threshold: 0.1
total_steps: 100000
total_steps: 200000
# Model Description
model_name: LEO
......
......@@ -2,6 +2,8 @@
enable_modelarts: False
data_url: ""
train_url: ""
ckpt_url: 'ckpt files'
result_url: 'infer result files'
checkpoint_url: ""
device_target: "GPU"
device_num: 1
......@@ -36,10 +38,10 @@ metatrain_batch_size: 12
metavalid_batch_size: 200
metatest_batch_size: 200
num_steps_limit: int(1e5)
outer_lr: 0.004 # parameters
outer_lr: 0.005 # parameters
gradient_threshold: 0.1
gradient_norm_threshold: 0.1
total_steps: 100000
total_steps: 200000
# Model Description
model_name: LEO
......
......@@ -2,6 +2,8 @@
enable_modelarts: False
data_url: ""
train_url: ""
ckpt_url: 'ckpt files'
result_url: 'infer result files'
checkpoint_url: ""
device_target: "GPU"
device_num: 1
......@@ -20,10 +22,11 @@ inner_unroll_length: 5
finetuning_unroll_length: 5
num_latents: 64
inner_lr_init: 1.0
finetuning_lr_init: 0.001
finetuning_lr_init: 0.0005
dropout_rate: 0.3 # parameters
kl_weight: 0.001 # parameters
encoder_penalty_weight: 2.66E-7 # parameters
encoder_penalty_weight: 2.66E-7 # parameters origin
l2_penalty_weight: 8.5E-6 # parameters
orthogonality_penalty_weight: 0.00152 # parameters
# ==============================================================================
......@@ -36,15 +39,15 @@ metatrain_batch_size: 12
metavalid_batch_size: 200
metatest_batch_size: 200
num_steps_limit: int(1e5)
outer_lr: 0.004 # parameters
outer_lr: 0.005 # parameters origin
gradient_threshold: 0.1
gradient_norm_threshold: 0.1
total_steps: 100000
total_steps: 200000
# Model Description
model_name: LEO
file_name: 'leo'
file_format: 'MINDIR' # ['AIR', 'MINDIR']
file_format: 'AIR'
---
......@@ -54,6 +57,8 @@ data_url: 'Dataset url for obs'
train_url: 'Training output url for obs'
data_path: 'Dataset path for local'
output_path: 'Training output path for local'
ckpt_url: 'ckpt files'
result_url: 'infer result files'
device_target: 'Target device type'
enable_profiling: 'Whether enable profiling while training, default: False'
......@@ -2,6 +2,8 @@
enable_modelarts: False
data_url: ""
train_url: ""
ckpt_url: 'ckpt files'
result_url: 'infer result files'
checkpoint_url: ""
device_target: "GPU"
device_num: 1
......@@ -39,7 +41,7 @@ num_steps_limit: int(1e5)
outer_lr: 0.0025 # parameters
gradient_threshold: 0.1
gradient_norm_threshold: 0.1
total_steps: 100000
total_steps: 200000
# Model Description
model_name: LEO
......
......@@ -35,7 +35,7 @@ def eval_leo(init_config, inner_model_config, outer_model_config):
total_test_steps = 100
data_utils = data.Data_Utils(
train=False, seed=100, way=outer_model_config['num_classes'],
train=False, seed=1, way=outer_model_config['num_classes'],
shot=outer_model_config['num_tr_examples_per_class'],
data_path=init_config['data_path'], dataset_name=init_config['dataset_name'],
embedding_crop=init_config['embedding_crop'],
......@@ -75,6 +75,7 @@ if __name__ == '__main__':
initConfig = config.get_init_config()
inner_model_Config = config.get_inner_model_config()
outer_model_Config = config.get_outer_model_config()
args = config.get_config(get_args=True)
print("===============inner_model_config=================")
for key, value in inner_model_Config.items():
......@@ -84,5 +85,16 @@ if __name__ == '__main__':
print(key+": "+str(value))
context.set_context(mode=context.GRAPH_MODE, device_target=initConfig['device_target'])
if args.enable_modelarts:
import moxing as mox
mox.file.copy_parallel(
src_url=args.data_url, dst_url='/cache/dataset/device_' + os.getenv('DEVICE_ID'))
train_dataset_path = os.path.join('/cache/dataset/device_' + os.getenv('DEVICE_ID'), "embeddings")
ckpt_path = '/home/work/user-job-dir/checkpoint.ckpt'
mox.file.copy(args.ckpt_url, ckpt_path)
initConfig['data_path'] = train_dataset_path
initConfig['ckpt_file'] = ckpt_path
eval_leo(initConfig, inner_model_Config, outer_model_Config)
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
import os
import src.data as data
import src.outerloop as outerloop
import model_utils.config as config
import mindspore as ms
from mindspore import context, Tensor, export
from mindspore import load_checkpoint, load_param_into_net
import numpy as np
os.environ['GLOG_v'] = "3"
os.environ['GLOG_log_dir'] = '/var/log'
def export_leo(init_config, inner_model_config, outer_model_config):
inner_lr_init = inner_model_config['inner_lr_init']
finetuning_lr_init = inner_model_config['finetuning_lr_init']
data_utils = data.Data_Utils(
train=False, seed=100, way=outer_model_config['num_classes'],
shot=outer_model_config['num_tr_examples_per_class'],
data_path=init_config['data_path'], dataset_name=init_config['dataset_name'],
embedding_crop=init_config['embedding_crop'],
batchsize=outer_model_config['metatrain_batch_size'],
val_batch_size=outer_model_config['metavalid_batch_size'],
test_batch_size=outer_model_config['metatest_batch_size'],
meta_val_steps=outer_model_config['num_val_examples_per_class'], embedding_size=640, verbose=True)
test_outer_loop = outerloop.OuterLoop(
batchsize=outer_model_config['metavalid_batch_size'], input_size=640,
latent_size=inner_model_config['num_latents'],
way=outer_model_config['num_classes'], shot=outer_model_config['num_tr_examples_per_class'],
dropout=inner_model_config['dropout_rate'], kl_weight=inner_model_config['kl_weight'],
encoder_penalty_weight=inner_model_config['encoder_penalty_weight'],
orthogonality_weight=inner_model_config['orthogonality_penalty_weight'],
inner_lr_init=inner_lr_init, finetuning_lr_init=finetuning_lr_init,
inner_step=inner_model_config['inner_unroll_length'],
finetune_inner_step=inner_model_config['finetuning_unroll_length'], is_meta_training=False)
parm_dict = load_checkpoint(init_config['ckpt_file'])
load_param_into_net(test_outer_loop, parm_dict)
batch = data_utils.get_batch('test')
print(batch['train']['input'].shape) # [200,5,5,640]
print(batch['train']['input'].dtype) # Float32
print(batch['train']['target'].shape) # [200,5,5,1]
print(batch['train']['target'].dtype) # Int64
print(batch['val']['input'].shape) # [200,5,15,640]
print(batch['val']['input'].dtype) # Float32
print(batch['val']['target'].shape) # [200,5,15,1]
print(batch['val']['target'].dtype) # Int64
train_input = Tensor(np.zeros(batch['train']['input'].shape), ms.float32)
train_target = Tensor(np.zeros(batch['train']['target'].shape), ms.int64)
val_input = Tensor(np.zeros(batch['val']['input'].shape), ms.float32)
val_target = Tensor(np.zeros(batch['val']['target'].shape), ms.int64)
result_name = "LEO-" + init_config['dataset_name'] + str(outer_model_config['num_classes']) +\
"N" + str(outer_model_config['num_tr_examples_per_class']) + "K"
export(test_outer_loop, train_input, train_target, val_input, val_target,
file_name=result_name, file_format="MINDIR")
if __name__ == '__main__':
initConfig = config.get_init_config()
inner_model_Config = config.get_inner_model_config()
outer_model_Config = config.get_outer_model_config()
print("===============inner_model_config=================")
for key, value in inner_model_Config.items():
print(key + ": " + str(value))
print("===============outer_model_config=================")
for key, value in outer_model_Config.items():
print(key + ": " + str(value))
context.set_context(mode=context.GRAPH_MODE, device_target=initConfig['device_target'])
export_leo(initConfig, inner_model_Config, outer_model_Config)
print("successfully export LEO model!")
......@@ -56,6 +56,8 @@ def parse_cli_to_yaml(parser, cfg, helper=None, choices=None, cfg_path="LEO-N5-K
helper = {} if helper is None else helper
choices = {} if choices is None else choices
for item in cfg:
if item in ("dataset_name", "num_tr_examples_per_class"):
continue
if not isinstance(cfg[item], list) and not isinstance(cfg[item], dict):
help_description = helper[item] if item in helper else "Please reference to {}".format(cfg_path)
choice = choices[item] if item in choices else None
......@@ -110,20 +112,28 @@ def merge(args, cfg):
return cfg
def get_config():
def get_config(get_args=False):
"""
Get Config according to the yaml file and cli arguments.
"""
parser = argparse.ArgumentParser(description="default name", add_help=False)
config_dir = os.path.join(os.path.abspath(os.getcwd()), "config")
config_name = "LEO-N5-K" + str(os.getenv("NUM_TR_EXAMPLES_PER_CLASS")) \
+ "_" + os.getenv("DATA_NAME") + "_config.yaml"
parser.add_argument("--config_path", type=str,
default=os.path.join(config_dir, config_name),
help="Config file path")
parser.add_argument("--num_tr_examples_per_class", type=int,
default=5,
help="num_tr_examples_per_class")
parser.add_argument("--dataset_name", type=str,
default="miniImageNet",
help="dataset_name")
path_args, _ = parser.parse_known_args()
default, helper, choices = parse_yaml(path_args.config_path)
args = parse_cli_to_yaml(parser=parser, cfg=default, helper=helper, choices=choices, cfg_path=path_args.config_path)
config_name = "LEO-N5-K" + str(path_args.num_tr_examples_per_class) \
+ "_" + path_args.dataset_name + "_config.yaml"
config_path = os.path.join(os.path.abspath(os.path.join(__file__, "../..")), "config", config_name)
default, helper, choices = parse_yaml(config_path)
args = parse_cli_to_yaml(parser=parser, cfg=default, helper=helper, choices=choices, cfg_path=config_path)
if get_args:
return args
final_config = merge(args, default)
return Config(final_config)
......
#!/bin/bash
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
if [[ $# -ne 6 ]]; then
echo "=============================================================================================================="
echo "Please run the script as: "
echo "bash scripts/run_eval_gpu.sh [DEVICE_NUM] [DEVICE_TARGET] [DATA_PATH] [DATA_NAME] [NUM_TR_EXAMPLES_PER_CLASS] [CKPT_FILE] "
echo "For example: bash scripts/export.sh 1 Ascend ../leo/leo-mindspore/embeddings miniImageNet 5 ./ckpts/xxx.ckpt "
echo "=============================================================================================================="
exit 1;
fi
export GLOG_v=3
export DEVICE_ID=$1
export DEVICE_TARGET=$2
export DATA_PATH=$3
export DATA_NAME=$4
export NUM_TR_EXAMPLES_PER_CLASS=$5
export CKPT_FILE=$6
nohup python export.py --device_target $DEVICE_TARGET \
--data_path $DATA_PATH \
--dataset_name $DATA_NAME \
--num_tr_examples_per_class $NUM_TR_EXAMPLES_PER_CLASS \
--ckpt_file $CKPT_FILE > export.log 2>&1 &
#!/bin/bash
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# ============================================================================
# an simple tutorial as follows, more parameters can be setting
if [ $# != 6 ]
then
echo "Usage: bash scripts/run_distribution_ascend.sh [RANK_TABLE_FILE] [DEVICE_TARGET] [DATA_PATH] [DATA_NAME] [NUM_TR_EXAMPLES_PER_CLASS] [SAVE_PATH]"
echo "For example: bash scripts/run_distribution_ascend.sh ./hccl_8p_01234567_127.0.0.1.json Ascend /home/jialing/leo/leo-mindspore/embeddings miniImageNet 5 ./ckpts/8P_mini_5
"
exit 1
fi
if [ ! -f $1 ]
then
echo "error: RANK_TABLE_FILE=$1 is not a file"
exit 1
fi
get_real_path(){
if [ "${1:0:1}" == "/" ]; then
echo "$1"
else
echo "$(realpath -m $PWD/$1)"
fi
}
ulimit -u unlimited
export DEVICE_NUM=8
export RANK_SIZE=8
RANK_TABLE_FILE=$(realpath $1)
export RANK_TABLE_FILE
echo "RANK_TABLE_FILE=${RANK_TABLE_FILE}"
export DEVICE_TARGET=$2
export DATA_PATH=$3
export DATA_NAME=$4
export NUM_TR_EXAMPLES_PER_CLASS=$5
export SAVE_PATH=$6
export SERVER_ID=0
rank_start=$((DEVICE_NUM * SERVER_ID))
for((i=0; i<${DEVICE_NUM}; i++))
do
export DEVICE_ID=$i
export RANK_ID=$((rank_start + i))
rm -rf ./train_parallel$i
mkdir ./train_parallel$i
cp -r ./src ./train_parallel$i
cp -r ./config ./train_parallel$i
cp -r ./model_utils ./train_parallel$i
cp ./train.py ./train_parallel$i
echo "start training for rank $RANK_ID, device $DEVICE_ID"
cd ./train_parallel$i ||exit
env > env.log
nohup python -u train.py --device_target $DEVICE_TARGET --data_path $DATA_PATH --dataset_name $DATA_NAME --num_tr_examples_per_class $NUM_TR_EXAMPLES_PER_CLASS --save_path $SAVE_PATH >log_distribution_ascend 2>&1 &
cd ..
done
#!/bin/bash
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
if [[ $# -ne 6 ]]; then
echo "=============================================================================================================="
echo "Please run the script as: "
echo "bash scripts/run_eval_gpu.sh [DEVICE_ID] [DEVICE_TARGET] [DATA_PATH] [DATA_NAME] [NUM_TR_EXAMPLES_PER_CLASS] [CKPT_FILE] "
echo "For example: bash scripts/run_eval_ascend.sh 4 Ascend ../leo/leo-mindspore/embeddings miniImageNet 5 ./ckpt/1P_mini_5/xxx.ckpt "
echo "=============================================================================================================="
exit 1;
fi
export GLOG_v=3
export DEVICE_ID=$1
export DEVICE_TARGET=$2
export DATA_PATH=$3
export DATA_NAME=$4
export NUM_TR_EXAMPLES_PER_CLASS=$5
export CKPT_FILE=$6
nohup python eval.py --device_target $DEVICE_TARGET \
--data_path $DATA_PATH \
--dataset_name $DATA_NAME \
--num_tr_examples_per_class $NUM_TR_EXAMPLES_PER_CLASS \
--ckpt_file $CKPT_FILE > ${DATA_NAME}_${NUM_TR_EXAMPLES_PER_CLASS}_eval.log 2>&1 &
......@@ -12,14 +12,18 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
echo "============================================================================================================================"
echo "Please run the script as: "
echo "bash scripts/run_eval_gpu.sh [DATA_PATH] [DATA_NAME] [NUM_TR_EXAMPLES_PER_CLASS] [CKPT_FILE] "
echo "For example: bash scripts/run_eval_gpu.sh /home/mindspore/dataset/embeddings/ miniImageNet 1 ./ckpt/1P_mini_1/xxx.ckpt "
echo "============ bash scripts/run_eval_gpu.sh /home/mindspore/dataset/embeddings/ miniImageNet 5 ./ckpt/1P_mini_5/xxx.ckpt "
echo "============ bash scripts/run_eval_gpu.sh /home/mindspore/dataset/embeddings/ tieredImageNet 1 ./ckpt/1P_tiered_1/xxx.ckpt "
echo "============ bash scripts/run_eval_gpu.sh /home/mindspore/dataset/embeddings/ tieredImageNet 5 ./ckpt/1P_tiered_5/xxx.ckpt "
echo "============================================================================================================================"
if [[ $# -ne 6 ]]; then
echo "============================================================================================================================"
echo "Please run the script as: "
echo "bash scripts/run_eval_gpu.sh [DATA_PATH] [DATA_NAME] [NUM_TR_EXAMPLES_PER_CLASS] [CKPT_FILE] "
echo "For example: bash scripts/run_eval_gpu.sh /home/mindspore/dataset/embeddings/ miniImageNet 1 ./ckpt/1P_mini_1/xxx.ckpt "
echo "============ bash scripts/run_eval_gpu.sh /home/mindspore/dataset/embeddings/ miniImageNet 5 ./ckpt/1P_mini_5/xxx.ckpt "
echo "============ bash scripts/run_eval_gpu.sh /home/mindspore/dataset/embeddings/ tieredImageNet 1 ./ckpt/1P_tiered_1/xxx.ckpt "
echo "============ bash scripts/run_eval_gpu.sh /home/mindspore/dataset/embeddings/ tieredImageNet 5 ./ckpt/1P_tiered_5/xxx.ckpt "
echo "============================================================================================================================"
exit 1;
fi
export GLOG_v=3
export DEVICE_TARGET=GPU
export DATA_PATH=$1
......
#!/bin/bash
# Copyright 2022 Huawei Technologies Co., Ltd
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
if [[ $# -ne 6 ]]; then
echo "=============================================================================================================="
echo "Please run the script as: "
echo "bash scripts/run_train_gpu.sh [DEVICE_ID] [DEVICE_TARGET] [DATA_PATH] [DATA_NAME] [NUM_TR_EXAMPLES_PER_CLASS] [SAVE_PATH] "
echo "For example: bash scripts/run_train_ascend.sh 6 Ascend /home/mindspore/dataset/embeddings/ miniImageNet 5 ./ckpts/1P_mini_5"
echo "=============================================================================================================="
exit 1;
fi
export DEVICE_ID=$1
export DEVICE_TARGET=$2
export DATA_PATH=$3
export DATA_NAME=$4
export NUM_TR_EXAMPLES_PER_CLASS=$5
export SAVE_PATH=$6
export GLOG_v=3
export DEVICE_ID=$DEVICE_ID
nohup python -u train.py --device_target $DEVICE_TARGET --data_path $DATA_PATH --dataset_name $DATA_NAME --num_tr_examples_per_class $NUM_TR_EXAMPLES_PER_CLASS --save_path $SAVE_PATH > ${DEVICE_NUM}P_${DATA_NAME}_${NUM_TR_EXAMPLES_PER_CLASS}_train.log 2>&1 &
......@@ -12,20 +12,24 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
echo "===================================================================================================================="
echo "Please run the script as: "
echo "bash scripts/run_train_gpu.sh [DEVICE_NUM] [DATA_PATH] [DATA_NAME] [NUM_TR_EXAMPLES_PER_CLASS] [SAVE_PATH] "
echo "For example: bash scripts/run_train_gpu.sh 1 /home/mindspore/dataset/embeddings/ miniImageNet 1 ./ckpt/1P_mini_1"
echo " ============bash scripts/run_train_gpu.sh 1 /home/mindspore/dataset/embeddings/ miniImageNet 5 ./ckpt/1P_mini_5"
echo " ============bash scripts/run_train_gpu.sh 1 /home/mindspore/dataset/embeddings/ tieredImageNet 1 ./ckpt/1P_tiered_1"
echo " ============bash scripts/run_train_gpu.sh 1 /home/mindspore/dataset/embeddings/ tieredImageNet 5 ./ckpt/1P_tiered_5"
echo "===================================================================================================================="
echo "Please run distributed training script as: "
echo "For example: bash scripts/run_train_gpu.sh 8 /home/mindspore/dataset/embeddings/ miniImageNet 1 ./ckpt/8P_mini_1 "
echo " ============bash scripts/run_train_gpu.sh 8 /home/mindspore/dataset/embeddings/ miniImageNet 5 ./ckpt/8P_mini_5"
echo " ============bash scripts/run_train_gpu.sh 8 /home/mindspore/dataset/embeddings/ tieredImageNet 1 ./ckpt/8P_tiered_1"
echo " ============bash scripts/run_train_gpu.sh 8 /home/mindspore/dataset/embeddings/ tieredImageNet 5 ./ckpt/8P_tiered_5"
echo "===================================================================================================================="
if [[ $# -ne 6 ]]; then
echo "===================================================================================================================="
echo "Please run the script as: "
echo "bash scripts/run_train_gpu.sh [DEVICE_NUM] [DATA_PATH] [DATA_NAME] [NUM_TR_EXAMPLES_PER_CLASS] [SAVE_PATH] "
echo "For example: bash scripts/run_train_gpu.sh 1 /home/mindspore/dataset/embeddings/ miniImageNet 1 ./ckpt/1P_mini_1"
echo " ============bash scripts/run_train_gpu.sh 1 /home/mindspore/dataset/embeddings/ miniImageNet 5 ./ckpt/1P_mini_5"
echo " ============bash scripts/run_train_gpu.sh 1 /home/mindspore/dataset/embeddings/ tieredImageNet 1 ./ckpt/1P_tiered_1"
echo " ============bash scripts/run_train_gpu.sh 1 /home/mindspore/dataset/embeddings/ tieredImageNet 5 ./ckpt/1P_tiered_5"
echo "===================================================================================================================="
echo "Please run distributed training script as: "
echo "For example: bash scripts/run_train_gpu.sh 8 /home/mindspore/dataset/embeddings/ miniImageNet 1 ./ckpt/8P_mini_1 "
echo " ============bash scripts/run_train_gpu.sh 8 /home/mindspore/dataset/embeddings/ miniImageNet 5 ./ckpt/8P_mini_5"
echo " ============bash scripts/run_train_gpu.sh 8 /home/mindspore/dataset/embeddings/ tieredImageNet 1 ./ckpt/8P_tiered_1"
echo " ============bash scripts/run_train_gpu.sh 8 /home/mindspore/dataset/embeddings/ tieredImageNet 5 ./ckpt/8P_tiered_5"
echo "===================================================================================================================="
exit 1;
fi
export DEVICE_NUM=$1
export DEVICE_TARGET=GPU
export DATA_PATH=$2
......
......@@ -19,8 +19,10 @@ import model_utils.config as config
import src.data as data
import src.outerloop as outerloop
from src.trainonestepcell import TrainOneStepCell
import mindspore
import mindspore.nn as nn
from mindspore import context
from mindspore import save_checkpoint, load_param_into_net, load_checkpoint
from mindspore import save_checkpoint, load_param_into_net
from mindspore.communication.management import init
from mindspore.context import ParallelMode
......@@ -29,12 +31,35 @@ os.environ['GLOG_v'] = "3"
os.environ['GLOG_log_dir'] = '/var/log'
def save_checkpoint_to_file(if_save_checkpoint, val_accs, best_acc, step, val_losses, init_config, train_outer_loop):
if if_save_checkpoint:
if not sum(val_accs) / len(val_accs) < best_acc:
best_acc = sum(val_accs) / len(val_accs)
model_name = '%dk_%4.4f_%4.4f_model.ckpt' % (
(step // 1000 + 1),
sum(val_losses) / len(val_losses),
sum(val_accs) / len(val_accs))
check_dir(init_config['save_path'])
if args.enable_modelarts:
save_checkpoint_path = '/cache/train_output/device_' + \
os.getenv('DEVICE_ID') + '/'
save_checkpoint_path = '/cache/train_output/'
if not os.path.exists(save_checkpoint_path):
os.makedirs(save_checkpoint_path)
save_checkpoint(train_outer_loop, os.path.join(save_checkpoint_path, model_name))
else:
save_checkpoint(train_outer_loop, os.path.join(init_config['save_path'], model_name))
print('Saved checkpoint %s...' % model_name)
def train_leo(init_config, inner_model_config, outer_model_config):
inner_lr_init = inner_model_config['inner_lr_init']
finetuning_lr_init = inner_model_config['finetuning_lr_init']
total_train_steps = outer_model_config['total_steps']
val_every_step = 5000
val_every_step = 3000
total_val_steps = 100
if_save_checkpoint = True
best_acc = 0
......@@ -72,8 +97,11 @@ def train_leo(init_config, inner_model_config, outer_model_config):
inner_step=inner_model_config['inner_unroll_length'],
finetune_inner_step=inner_model_config['finetuning_unroll_length'], is_meta_training=True)
parm_dict = load_checkpoint('./resource/leo_ms_init.ckpt')
load_param_into_net(train_outer_loop, parm_dict)
if context.get_context("device_target") == "Ascend":
train_outer_loop.to_float(mindspore.float32)
for _, cell in train_outer_loop.cells_and_names():
if isinstance(cell, nn.Dense):
cell.to_float(mindspore.float16)
train_net = TrainOneStepCell(train_outer_loop,
outer_model_config['outer_lr'],
......@@ -105,8 +133,8 @@ def train_leo(init_config, inner_model_config, outer_model_config):
train_net.group_params[0]['params'][1].T.asnumpy(),
val_acc.asnumpy(), now_t-old_t))
if step % val_every_step == 4999:
print('5000 step average time: %4.4f second...'%(sum_steptime/5000))
if step % val_every_step == 2999:
print('3000 step average time: %4.4f second...'%(sum_steptime/3000))
sum_steptime = 0
val_losses = []
......@@ -128,18 +156,8 @@ def train_leo(init_config, inner_model_config, outer_model_config):
(sum(val_losses)/len(val_losses), sum(val_accs)/len(val_accs)))
print('=' * 50)
if if_save_checkpoint:
if not sum(val_accs)/len(val_accs) < best_acc:
best_acc = sum(val_accs)/len(val_accs)
model_name = '%dk_%4.4f_%4.4f_model.ckpt' % (
(step//1000+1),
sum(val_losses)/len(val_losses),
sum(val_accs)/len(val_accs))
check_dir(init_config['save_path'])
save_checkpoint(train_outer_loop, os.path.join(init_config['save_path'], model_name))
print('Saved checkpoint %s...'%model_name)
save_checkpoint_to_file(if_save_checkpoint, val_accs, best_acc, step, val_losses,
init_config, train_outer_loop)
if step == (total_train_steps-1):
train_end = time.time()
......@@ -166,6 +184,8 @@ if __name__ == '__main__':
initConfig = config.get_init_config()
inner_model_Config = config.get_inner_model_config()
outer_model_Config = config.get_outer_model_config()
args = config.get_config(get_args=True)
print("===============inner_model_config=================")
for key, value in inner_model_Config.items():
......@@ -175,10 +195,21 @@ if __name__ == '__main__':
print(key+": "+str(value))
context.set_context(mode=context.GRAPH_MODE, device_target=initConfig['device_target'])
if initConfig['device_num'] > 1:
if args.enable_modelarts:
import moxing as mox
mox.file.copy_parallel(
src_url=args.data_url, dst_url='/cache/dataset/device_' + os.getenv('DEVICE_ID'))
train_dataset_path = os.path.join('/cache/dataset/device_' + os.getenv('DEVICE_ID'), "embeddings")
initConfig['data_path'] = train_dataset_path
elif initConfig['device_num'] > 1:
init('nccl')
context.set_auto_parallel_context(device_num=initConfig['device_num'],
parallel_mode=ParallelMode.DATA_PARALLEL,
gradients_mean=True)
train_leo(initConfig, inner_model_Config, outer_model_Config)
if args.enable_modelarts:
mox.file.copy_parallel(
src_url='/cache/train_output', dst_url=args.train_url)
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment