diff --git a/research/cv/single_path_nas/README.md b/research/cv/single_path_nas/README.md new file mode 100644 index 0000000000000000000000000000000000000000..f4899b7ae70006667cc6c6ccbc8d89eb3e8b0e0f --- /dev/null +++ b/research/cv/single_path_nas/README.md @@ -0,0 +1,338 @@ +# Contents + +<!-- TOC --> + +- [Contents](#contents) +- [Single-path-nas description](#single-path-nas-description) +- [Dataset](#dataset) +- [Features](#features) + - [Mixed Precision](#mixed-precision) +- [Environment Requirements](#environment-requirements) +- [Quick Start](#quick-start) +- [Scripts Description](#scripts-description) + - [Scripts and sample code](#scripts-and-sample-code) + - [Script parameters](#script-parameters) + - [Training process](#training-process) + - [Standalone training](#standalone-training) + - [Distributed training](#distributed-training) + - [Evaluation process](#evaluation-process) + - [Evaluate](#evaluate) + - [Export process](#export-process) + - [Export](#export) + - [Inference process](#inference-process) + - [Inference](#inference) +- [Model description](#model-description) + - [Performance](#performance) + - [Training performance](#training-performance) + - [Single-Path-NAS on ImageNet-1k](#single-path-nas-on-imagenet-1k) + - [Inference performance](#inference-performance) + - [Single-Path-NAS on ImageNet-1k](#single-path-nas-on-imagenet-1k-1) +- [ModelZoo Homepage](#modelzoo-homepage) + +<!-- /TOC --> + +# Single-path-nas description + +The author of single-path-nas used a large 7x7 convolution to represent the three convolutions of 3x3, 5x5, and 7x7. +The weights of the smaller convolution layers are shared with the larger ones. The largest kernel becomes a "superkernel". +This way when training the model we do not need to choose between different paths, instead we pass the data through a +layer with shared weights among different sub-kernels. The search space is a block-based straight structure. +Like in the ProxylessNAS and the FBNet, the Inverted Bottleneck block is used as the cell, +and the number of layers is 22 as in the MobileNetV2. Each layer has only two searchable hyper-parameters: +expansion rate and kernel size. The others hyper-parameters are fixed. For example, the filter number of each layer in +the 22nd layer is fixed. Like FBNet, it is slightly changed from MobileNetV2. The used kernel sizes in the paper are +only 3x3 and 5x5 like in the FBNet and ProxylessNAS, and 7x7 kernels are not used. The expansion ratio in the paper has +only two choices of 3 and 6. Both the kernel size and expansion ratio have only 2 choices. +The Single-Path-NAS paper uses the techniques described in Lightnn's paper. +In particular, it describes using a continuous smooth function to represent the discrete choice, +and the threshold is a group Lasso term. This paper uses the same technique as ProxylessNAS to express skip connection, +which is represented by a zero layer. +Paper: https://zhuanlan.zhihu.com/p/63605721 + +# Dataset + +Dataset used锛歔ImageNet2012](http://www.image-net.org/) + +- Dataset size锛歛 total of 1000 categories, 224\*224 color images + - Training set: 1,281,167 images in total + - Test set: 50,000 images in total +- Data format锛欽PEG + - Note: The data is processed in dataset.py. +- Download the dataset and prepare the directories structure as follows锛� + +```text +鈹斺攢dataset + 鈹溾攢train # Training dataset + 鈹斺攢val # Evaluation dataset +``` + +# Features + +## Mixed Precision + +The [mixed-precision](https://www.mindspore.cn/docs/programming_guide/zh-CN/master/enable_mixed_precision.html) +training method uses single-precision and half-precision data to improve the training speed of +deep learning neural networks, while maintaining the network accuracy that can be achieved by single-precision training. +Mixed-precision training increases computing speed and reduces memory usage, while supporting training larger models or +allowing larger batches for training on specific hardware. + +# Environment Requirements + +- Hardware锛圓scend, GPU锛� + - Prepare hardware environment with Ascend processor or CUDA based GPU. +- Framework + - [MindSpore](https://www.mindspore.cn/install/en) +- For more information, please check the links below: + - [MindSpore tutorials](https://www.mindspore.cn/tutorials/zh-CN/r1.3/index.html) + - [MindSpore Python API](https://www.mindspore.cn/docs/api/zh-CN/r1.3/index.html) + +# Quick Start + +After installing MindSpore through the official website, you can follow the steps below for training and evaluation: + +- For the Ascend hardware + + ```bash + # Run the training example + python train.py --device_id=0 > train.log 2>&1 & + + # Run a distributed training example + bash ./scripts/run_train.sh [RANK_TABLE_FILE] imagenet + + # Run evaluation example + python eval.py --checkpoint_path ./ckpt_0 > ./eval.log 2>&1 & + + # Run the inference example + bash run_infer_310.sh [MINDIR_PATH] [DATA_PATH] [DEVICE_ID] + ``` + + For distributed training, you need to create an **hccl** configuration file in JSON format in advance. + + Please follow the instructions in the link below: + + <https://gitee.com/mindspore/models/tree/master/utils/hccl_tools.> + +- For the GPU hardware + + ```bash + # Run the training example + python train.py --device_target="GPU" --data_path="/path/to/imagenet/train/" --lr_init=0.26 > train.log 2>&1 & + + # Run a distributed training example + bash ./scripts/run_distribute_train_gpu.sh "/path/to/imagenet/train/" + + # Run evaluation example + python eval.py --device_target="GPU" --val_data_path="/path/to/imagenet/val/" --checkpoint_path ./ckpt_0 > ./eval.log 2>&1 & + ``` + +# Scripts Description + +## Scripts and sample code + +```bash +鈹溾攢鈹€ model_zoo + 鈹溾攢鈹€ scripts + 鈹� 鈹溾攢鈹€run_distribute_train.sh // Shell script for running the Ascend distributed training + 鈹� 鈹溾攢鈹€run_distribute_train_gou.sh // Shell script for running the GPU distributed training + 鈹� 鈹溾攢鈹€run_standalone_train.sh // Shell script for running the Ascend standalone training + 鈹� 鈹溾攢鈹€run_standalone_train_gpu.sh // Shell script for running the GPU standalone training + 鈹� 鈹溾攢鈹€run_eval.sh // Shell script for running the Ascend evaluation + 鈹� 鈹溾攢鈹€run_eval_gpu.sh // Shell script for running the GPU evaluation + 鈹� 鈹溾攢鈹€run_infer_310.sh // Shell script for running the Ascend 310 inference + 鈹溾攢鈹€ src + 鈹� 鈹溾攢鈹€lr_scheduler + 鈹� 鈹� 鈹溾攢鈹€__init__.py + 鈹� 鈹� 鈹溾攢鈹€linear_warmup.py // Definitions for the warm-up functionality + 鈹� 鈹� 鈹溾攢鈹€warmup_cosine_annealing_lr.py // Definitions for the cosine annealing learning rate schedule + 鈹� 鈹� 鈹溾攢鈹€warmup_step_lr.py // Definitions for the exponential learning rate schedule + 鈹� 鈹溾攢鈹€__init__.py + 鈹� 鈹溾攢鈹€config.py // Parameters configuration + 鈹� 鈹溾攢鈹€CrossEntropySmooth.py // Definitions for the cross entropy loss function + 鈹� 鈹溾攢鈹€dataset.py // Functions for creating a dataset + 鈹� 鈹溾攢鈹€spnasnet.py // Single-Path-NAS architecture. + 鈹� 鈹溾攢鈹€utils.py // Auxiliary functions + 鈹溾攢鈹€ create_imagenet2012_label.py // Creating ImageNet labels + 鈹溾攢鈹€ eval.py // Evaluate the trained model + 鈹溾攢鈹€ export.py // Export model to other formats + 鈹溾攢鈹€ postprocess.py // Postprocess for the Ascend 310 inference. + 鈹溾攢鈹€ README.md // Single-Path-NAS related instruction in English + 鈹溾攢鈹€ README_CN.md // Single-Path-NAS related instruction in Chinese + 鈹溾攢鈹€ train.py // Train the model. +``` + +## Script parameters + +Training parameters and evaluation parameters can be configured in a `config.py` file. + +- Parameters of a Single-Path-NAS model for the ImageNet-1k dataset. + + ```python + 'name':'imagenet' # dataset + 'pre_trained':'False' # Whether to start using a pre-trained model + 'num_classes':1000 # Number of classes in a dataset + 'lr_init':0.26 # Initial learning rate, set to 0.26 for single-card training, and 1.5 for eight-card parallel training. + 'batch_size':128 # training batch size + 'epoch_size':180 # Number of epochs + 'momentum':0.9 # Momentum + 'weight_decay':1e-5 # Weight decay value + 'image_height':224 # Height of the model input image + 'image_width':224 # Width of the model input image + 'data_path':'/data/ILSVRC2012_train/' # The absolute path to the training dataset + 'val_data_path':'/data/ILSVRC2012_val/' # The absolute path to the validation dataset + 'device_target':'Ascend' # Device + 'device_id':0 # ID of the device used for training/evaluation. + 'keep_checkpoint_max':40 # Number of checkpoints to keep + 'checkpoint_path':None # The absolute path to the checkpoint file or a directory, where the checkpoints are saved + + 'lr_scheduler': 'cosine_annealing' # Learning rate scheduler ['cosine_annealing', 'exponential'] + 'lr_epochs': [30, 60, 90] # Key points for the exponential schedular + 'lr_gamma': 0.3 # Learning rate decay for the exponential scheduler + 'eta_min': 0.0 # Minimal learning rate + 'T_max': 180 # Number of epochs for the cosine + 'warmup_epochs': 0 # Number of warm-up epochs + 'is_dynamic_loss_scale': 1 # Use dynamic loss scale manager (scale manager is not used for GPU) + 'loss_scale': 1024 # Loss scale value + 'label_smooth_factor': 0.1 # Factor for labels smoothing + 'use_label_smooth': True # Use label smoothing + ``` + +For more configuration details, please refer to the script `config.py`. + +## Training process + +### Standalone training + +- Using an Ascend processor environment + + ```bash + python train.py --device_id=0 > train.log 2>&1 & + ``` + + The above python command will run in the background, and the result can be viewed through the generated train.log file. + +- Using an GPU environment + + ```bash + python train.py --device_target='GPU' --data_path="/path/to/imagenet/train/" --lr_init=0.26 > train.log 2>&1 & + ``` + + The above python command will run in the background, and the result can be viewed through the generated train.log file. + +### Distributed training + +- Using an Ascend processor environment + + ```bash + bash ./scripts/run_distribute_train.sh [RANK_TABLE_FILE] + ``` + + The above shell script will run distributed training in the background. + +- Using a GPU environment + + ```bash + bash ./scripts/run_distribute_train_gpu.sh [TRAIN_PATH](optional) + ``` + +> TRAIN_PATH - Path to the directory with the training subset of the dataset. + +The above shell scripts will run the distributed training in the background. +Also `train_parallel` folder will be created where the copy of the code, +the training log files and the checkpoints will be stored. + +## Evaluation process + +### Evaluate + +- Evaluate the model on the ImageNet-1k dataset using the Ascend environment + + 鈥�./ckpt_0鈥� is a directory, where the trained model is saved in the .ckpt format. + + ```bash + python eval.py --checkpoint_path ./ckpt_0 > ./eval.log 2>&1 & + OR + bash ./scripts/run_eval.sh + ``` + +- Evaluate the model on the ImageNet-1k dataset using the GPU environment + + 鈥�./ckpt_0鈥� is a directory, where the trained model is saved in the .ckpt format. + + ```bash + python eval.py --device_target="GPU" --checkpoint_path ./ckpt_0 > ./eval.log 2>&1 & + OR + bash ./scripts/run_eval_gpu.sh [CKPT_FILE_OR_DIR] [VALIDATION_DATASET](optional) + ``` + +> CKPT_FILE_OR_DIR - Path to the trained model checkpoint or to the directory, containing checkpoints. +> +> VALIDATION_DATASET - (optional) Path to the validation subset of the dataset. + +## Export process + +### Export + + ```shell + python export.py --ckpt_file [CKPT_FILE] --device_target [DEVICE_TARGET] + ``` + +> DEVICE_TARGET: Ascend or GPU + +## Inference process + +### Inference + +Before inference, we need to export the model first. +MINDIR can be exported in any environment, and the AIR model can only be exported in the Ascend 910 environment. +The following shows an example of using the MINDIR model to run the inference. + +- Use ImageNet-1k dataset for inference on the Ascend 310 + + The results of the inference are stored in the scripts directory, + and results similar to the following can be found in the acc.log log file. + + ```shell + # Ascend310 inference + bash run_infer_310.sh [MINDIR_PATH] [DATA_PATH] [DEVICE_ID] + Total data: 50000, top1 accuracy: 0.74214, top5 accuracy: 0.91652. + ``` + +# Model description + +## Performance + +### Training performance + +#### Single-Path-NAS on ImageNet-1k + +| Parameter | Ascend | GPU | +| -------------------------- | --------------------------------------------------------------------------------------- | --------------------------------------------------------------------------------------- | +| Model | single-path-nas | single-path-nas | +| Resource | Ascend 910 | V100 GPU, Intel Xeon Gold 6226R CPU @ 2.90GHz | +| Upload date | 2021-06-27 | - | +| MindSpore version | 1.2.0 | 1.5.0 | +| Dataset | ImageNet-1k Train, 1,281,167 images in total | ImageNet-1k Train, 1,281,167 images in total | +| Training parameters | epoch=180, batch_size=128, lr_init=0.26 (0.26 for a single card, 1.5 for eight cards) | epoch=180, batch_size=128, lr_init=0.26 (0.26 for a single card, 1.5 for eight cards) | +| Optimizer | Momentum | Momentum | +| Loss function | Softmax cross entropy | Softmax cross entropy | +| Output | Probability | Probability | +| Classification accuracy | Eight cards: top1:74.21%, top5:91.712% | Single card: top1=73.9%, top5=91.62% ; Eight cards: top1=74.01%, top5=91.66% | +| Speed | Single card: milliseconds/step; eight cards: 87.173 milliseconds/step | Single card: 221 ms/step; Eight cards: 263 ms/step | + +### Inference performance + +#### Single-Path-NAS on ImageNet-1k + +| Parameter | Ascend | GPU (8 card) | GPU (1 card) | +| -------------------------- | --------------------------------------------- | ------------------------------------------- | ------------------------------------------ | +| Model | single-path-nas | single-path-nas | single-path-nas | +| Resource | Ascend 310 | V100 GPU | V100 GPU | +| Upload date | 2021-06-27 | - | - | +| MindSpore version | 1.2.0 | 1.5.0 | 1.5.0 | +| Dataset | ImageNet-1k Val, a total of 50,000 images | ImageNet-1k Val, a total of 50,000 images | ImageNet-1k Val, a total of 50,000 images | +| Classification accuracy | top1: 74.214%, top5: 91.652% | top1: 74.01%, top5: 91.66% | top1: 73.9%, top5: 91.62% | +| Speed | Average time 7.67324 ms of infer_count 50000 | 1285 images/second | 1285 images/second | + +# ModelZoo homepage + +Please visit the official website [homepage](https://gitee.com/mindspore/models) diff --git a/research/cv/single_path_nas/README_CN.md b/research/cv/single_path_nas/README_CN.md index 53920d81c1901ed262a27caffdeda9432aef4f1f..3c71cfe5396fdacd71cbfa3d2e9e0f09b7b94ebb 100644 --- a/research/cv/single_path_nas/README_CN.md +++ b/research/cv/single_path_nas/README_CN.md @@ -101,19 +101,33 @@ single-path-nas鐨勪綔鑰呯敤涓€涓�7x7鐨勫ぇ鍗风Н锛屾潵浠h〃3x3銆�5x5鍜�7x7鐨� ```bash 鈹溾攢鈹€ model_zoo - 鈹溾攢鈹€ README_CN.md // Single-Path-NAS鐩稿叧璇存槑 鈹溾攢鈹€ scripts - 鈹� 鈹溾攢鈹€run_train.sh // 鍒嗗竷寮忓埌Ascend鐨剆hell鑴氭湰 - 鈹� 鈹溾攢鈹€run_eval.sh // 娴嬭瘯鑴氭湰 - 鈹� 鈹溾攢鈹€run_infer_310.sh // 310鎺ㄧ悊鑴氭湰 + 鈹� 鈹溾攢鈹€run_distribute_train.sh // 鍒嗗竷寮忓埌Ascend鐨剆hell鑴氭湰 + 鈹� 鈹溾攢鈹€run_distribute_train_gou.sh // Shell script for running the GPU distributed training + 鈹� 鈹溾攢鈹€run_standalone_train.sh // Shell script for running the Ascend standalone training + 鈹� 鈹溾攢鈹€run_standalone_train_gpu.sh // Shell script for running the GPU standalone training + 鈹� 鈹溾攢鈹€run_eval.sh // 娴嬭瘯鑴氭湰 + 鈹� 鈹溾攢鈹€run_eval_gpu.sh // Shell script for running the GPU evaluation + 鈹� 鈹溾攢鈹€run_infer_310.sh // 310鎺ㄧ悊鑴氭湰 鈹溾攢鈹€ src - 鈹� 鈹溾攢鈹€lr_scheduler // 瀛︿範鐜囩浉鍏虫枃浠跺す锛屽寘鍚涔犵巼鍙樺寲绛栫暐鐨刾y鏂囦欢 - 鈹� 鈹溾攢鈹€dataset.py // 鍒涘缓鏁版嵁闆� - 鈹� 鈹溾攢鈹€CrossEntropySmooth.py // 鎹熷け鍑芥暟鐩稿叧 - 鈹� 鈹溾攢鈹€spnasnet.py // Single-Path-NAS缃戠粶鏋舵瀯 - 鈹� 鈹溾攢鈹€config.py // 鍙傛暟閰嶇疆 - 鈹� 鈹溾攢鈹€utils.py // spnasnet.py鐨勮嚜瀹氫箟缃戠粶妯″潡 - 鈹溾攢鈹€ train.py // 璁粌鍜屾祴璇曟枃浠� + 鈹� 鈹溾攢鈹€lr_scheduler // 瀛︿範鐜囩浉鍏虫枃浠跺す锛屽寘鍚涔犵巼鍙樺寲绛栫暐鐨刾y鏂囦欢 + 鈹� 鈹� 鈹溾攢鈹€__init__.py + 鈹� 鈹� 鈹溾攢鈹€linear_warmup.py // Definitions for the warm-up functionality + 鈹� 鈹� 鈹溾攢鈹€warmup_cosine_annealing_lr.py // Definitions for the cosine annealing learning rate schedule + 鈹� 鈹� 鈹溾攢鈹€warmup_step_lr.py // Definitions for the exponential learning rate schedule + 鈹� 鈹溾攢鈹€__init__.py + 鈹� 鈹溾攢鈹€dataset.py // 鍒涘缓鏁版嵁闆� + 鈹� 鈹溾攢鈹€CrossEntropySmooth.py // 鎹熷け鍑芥暟鐩稿叧 + 鈹� 鈹溾攢鈹€spnasnet.py // Single-Path-NAS缃戠粶鏋舵瀯 + 鈹� 鈹溾攢鈹€config.py // 鍙傛暟閰嶇疆 + 鈹� 鈹溾攢鈹€utils.py // spnasnet.py鐨勮嚜瀹氫箟缃戠粶妯″潡 + 鈹溾攢鈹€ create_imagenet2012_label.py // Creating ImageNet labels + 鈹溾攢鈹€ eval.py // Evaluate the trained model + 鈹溾攢鈹€ export.py // Export model to other formats + 鈹溾攢鈹€ postprocess.py // Postprocess for the Ascend 310 inference. + 鈹溾攢鈹€ README.md // Single-Path-NAS related instruction in English + 鈹溾攢鈹€ README_CN.md // Single-Path-NAS鐩稿叧璇存槑 + 鈹溾攢鈹€ train.py // 璁粌鍜屾祴璇曟枃浠� ``` ## 鑴氭湰鍙傛暟 diff --git a/research/cv/single_path_nas/eval.py b/research/cv/single_path_nas/eval.py index a671c7a34afb9c87f8aa31e89b713ddfc71eb440..cbba5a025c30569a7e5a4635550b8378be370a24 100644 --- a/research/cv/single_path_nas/eval.py +++ b/research/cv/single_path_nas/eval.py @@ -27,7 +27,8 @@ from mindspore.nn.loss.loss import _Loss from mindspore.ops import functional as F from mindspore.ops import operations as P from mindspore.train.model import Model -from mindspore.train.serialization import load_checkpoint, load_param_into_net +from mindspore.train.serialization import load_checkpoint +from mindspore.train.serialization import load_param_into_net import src.spnasnet as spnasnet from src.config import imagenet_cfg @@ -38,6 +39,10 @@ set_seed(1) parser = argparse.ArgumentParser(description='single-path-nas') parser.add_argument('--dataset_name', type=str, default='imagenet', choices=['imagenet',], help='dataset name.') +parser.add_argument('--val_data_path', type=str, default=None, + help='Path to the validation dataset (e.g. "/datasets/imagenet/val/")') +parser.add_argument('--device_target', type=str, choices=['Ascend', 'GPU', 'CPU'], + default=None, help='Target device: Ascend, GPU or CPU') parser.add_argument('--checkpoint_path', type=str, default='./ckpt_0', help='Checkpoint file path or dir path') parser.add_argument('--device_id', type=int, default=None, help='device id of Ascend. (Default: None)') args_opt = parser.parse_args() @@ -65,18 +70,27 @@ if __name__ == '__main__': if args_opt.dataset_name == "imagenet": cfg = imagenet_cfg - dataset = create_dataset_imagenet(cfg.val_data_path, 1, False) - if not cfg.use_label_smooth: - cfg.label_smooth_factor = 0.0 - loss = CrossEntropySmooth(sparse=True, reduction="mean", - smooth_factor=cfg.label_smooth_factor, num_classes=cfg.num_classes) - net = spnasnet.spnasnet(num_classes=cfg.num_classes) - model = Model(net, loss_fn=loss, metrics={'top_1_accuracy', 'top_5_accuracy'}) + if args_opt.val_data_path is not None: + cfg.val_data_path = args_opt.val_data_path + + if args_opt.device_target is not None: + cfg.device_target = args_opt.device_target + + device_target = cfg.device_target + dataset_drop_reminder = (device_target == 'GPU') + + dataset = create_dataset_imagenet(cfg.val_data_path, 1, False, drop_reminder=dataset_drop_reminder) else: raise ValueError("dataset is not support.") - device_target = cfg.device_target + if not cfg.use_label_smooth: + cfg.label_smooth_factor = 0.0 + loss = CrossEntropySmooth(sparse=True, reduction="mean", + smooth_factor=cfg.label_smooth_factor, num_classes=cfg.num_classes) + net = spnasnet.spnasnet(num_classes=cfg.num_classes) + model = Model(net, loss_fn=loss, metrics={'top_1_accuracy', 'top_5_accuracy'}) + context.set_context(mode=context.GRAPH_MODE, device_target=cfg.device_target) if device_target == "Ascend": if args_opt.device_id is not None: @@ -84,14 +98,16 @@ if __name__ == '__main__': else: context.set_context(device_id=cfg.device_id) + print(f'Checkpoint path: {args_opt.checkpoint_path}') + if os.path.isfile(args_opt.checkpoint_path) and args_opt.checkpoint_path.endswith('.ckpt'): param_dict = load_checkpoint(args_opt.checkpoint_path) load_param_into_net(net, param_dict) net.set_train(False) acc = model.eval(dataset) - print(f"model {args_opt.checkpoint_path}'s accuracy is {acc}") + print(f"model {args_opt.checkpoint_path}'s accuracy is {acc}", flush=True) elif os.path.isdir(args_opt.checkpoint_path): - file_list = os.listdir(args_opt.checkpoint_path) + file_list = sorted(os.listdir(args_opt.checkpoint_path)) for filename in file_list: de_path = os.path.join(args_opt.checkpoint_path, filename) if de_path.endswith('.ckpt'): @@ -100,6 +116,6 @@ if __name__ == '__main__': net.set_train(False) acc = model.eval(dataset) - print(f"model {de_path}'s accuracy is {acc}") + print(f"model {de_path}'s accuracy is {acc}", flush=True) else: raise ValueError("args_opt.checkpoint_path must be a checkpoint file or dir contains checkpoint(s)") diff --git a/research/cv/single_path_nas/export.py b/research/cv/single_path_nas/export.py index 1c2b73de6e210c7cc184c6ee35f19a2fc5bc0c04..e4caf65fab02dafea8ea388178c6fc11886edd24 100644 --- a/research/cv/single_path_nas/export.py +++ b/research/cv/single_path_nas/export.py @@ -33,11 +33,11 @@ parser.add_argument('--width', type=int, default=224, help='input width') parser.add_argument('--height', type=int, default=224, help='input height') parser.add_argument("--file_format", type=str, choices=["AIR", "ONNX", "MINDIR"], default="MINDIR", help="file format") parser.add_argument("--device_target", type=str, default="Ascend", - choices=["Ascend",], help="device target(default: Ascend)") + choices=["Ascend", "GPU"], help="device target(default: Ascend)") args = parser.parse_args() context.set_context(mode=context.GRAPH_MODE, device_target=args.device_target) -if args.device_target == "Ascend": +if args.device_target in ["Ascend", "GPU"]: context.set_context(device_id=args.device_id) else: raise ValueError("Unsupported platform.") diff --git a/research/cv/single_path_nas/scripts/run_distribute_train.sh b/research/cv/single_path_nas/scripts/run_distribute_train.sh index f4b030f79165e42af0db99491f2dab8978a351fb..fab9920c8223c639bf50adbc5e9284cdb9cf1165 100644 --- a/research/cv/single_path_nas/scripts/run_distribute_train.sh +++ b/research/cv/single_path_nas/scripts/run_distribute_train.sh @@ -16,7 +16,7 @@ if [ $# != 1 ] then - echo "Usage: sh run_train.sh [RANK_TABLE_FILE]" + echo "Usage: bash run_train.sh [RANK_TABLE_FILE]" exit 1 fi diff --git a/research/cv/single_path_nas/scripts/run_distribute_train_gpu.sh b/research/cv/single_path_nas/scripts/run_distribute_train_gpu.sh new file mode 100644 index 0000000000000000000000000000000000000000..0a64c51536dbe8a692542178305f803053b40f87 --- /dev/null +++ b/research/cv/single_path_nas/scripts/run_distribute_train_gpu.sh @@ -0,0 +1,62 @@ +#!/bin/bash +# Copyright 2021 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ + + +if [ $# != 0 ] && [ $# != 1 ] +then + echo "Usage: bash run_distribute_train_gpu.sh [TRAIN_DATASET](optional)" + exit 1 +fi + +if [ $# == 1 ] && [ ! -d $1 ] +then + echo "error: TRAIN_DATASET=$1 is not a directory" + exit 1 +fi + +ulimit -u unlimited + +rm -rf ./train_parallel +mkdir ./train_parallel +cp ./*.py ./train_parallel +cp -r ./src ./train_parallel +cd ./train_parallel || exit +env > env.log + +if [ $# == 0 ] +then + mpirun -n 8 \ + --allow-run-as-root \ + --output-filename 'log_output' \ + --merge-stderr-to-stdout \ + python ./train.py \ + --use_gpu_distributed=1 \ + --device_target='GPU' \ + --lr_init=1.5 > log.txt 2>&1 & +fi + +if [ $# == 1 ] +then + mpirun -n 8 \ + --allow-run-as-root \ + --output-filename 'log_output' \ + --merge-stderr-to-stdout \ + python ./train.py \ + --use_gpu_distributed=1 \ + --device_target='GPU' \ + --data_path="$1" \ + --lr_init=1.5 > log.txt 2>&1 & +fi diff --git a/research/cv/single_path_nas/scripts/run_eval.sh b/research/cv/single_path_nas/scripts/run_eval.sh index 5d30b6166f349f903da8609d096fcc4c1d428436..6cef937954c9c3281d56cf2fca4a56ee75fe9b54 100644 --- a/research/cv/single_path_nas/scripts/run_eval.sh +++ b/research/cv/single_path_nas/scripts/run_eval.sh @@ -16,7 +16,7 @@ if [ $# != 1 ] then - echo "Usage: sh run_eval.sh checkpoint_path_dir/checkpoint_path_file" + echo "Usage: bash run_eval.sh checkpoint_path_dir/checkpoint_path_file" exit 1 fi diff --git a/research/cv/single_path_nas/scripts/run_eval_gpu.sh b/research/cv/single_path_nas/scripts/run_eval_gpu.sh new file mode 100644 index 0000000000000000000000000000000000000000..f30886a8269eb6728a7528e6a9191dd2d0cb2f20 --- /dev/null +++ b/research/cv/single_path_nas/scripts/run_eval_gpu.sh @@ -0,0 +1,51 @@ +#!/bin/bash +# Copyright 2021 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ + +if [ $# != 1 ] && [ $# != 2 ] +then + echo "Usage: bash run_eval_gpu.sh [CKPT_FILE_OR_DIR] [VALIDATION_DATASET](optional)" + exit 1 +fi + + +if [ ! -d $1 ] && [ ! -f $1 ] +then + echo "error: CKPT_FILE_OR_DIR=$1 is neither a directory nor a file" + exit 1 +fi + +if [ $# == 2 ] && [ ! -d $2 ] +then + echo "error: VALIDATION_DATASET=$2 is not a directory" + exit 1 +fi + +ulimit -u unlimited + +if [ $# == 1 ] +then + GLOG_v=3 python eval.py \ + --checkpoint_path="$1" \ + --device_target="GPU" > "./eval.log" 2>&1 & +fi + +if [ $# == 2 ] +then + GLOG_v=3 python eval.py \ + --checkpoint_path="$1" \ + --val_data_path="$2" \ + --device_target="GPU" > "./eval.log" 2>&1 & +fi diff --git a/research/cv/single_path_nas/scripts/run_infer_310.sh b/research/cv/single_path_nas/scripts/run_infer_310.sh index ab9d81199e91a9de1db5077add1c5d4d8e17abae..7e7cbfee7dd7f1fea612843c0ec1671e202186d9 100644 --- a/research/cv/single_path_nas/scripts/run_infer_310.sh +++ b/research/cv/single_path_nas/scripts/run_infer_310.sh @@ -15,7 +15,7 @@ # ============================================================================ if [[ $# -lt 2 || $# -gt 3 ]]; then - echo "Usage: sh run_infer_310.sh [MINDIR_PATH] [DATA_PATH] [DEVICE_ID] + echo "Usage: bash run_infer_310.sh [MINDIR_PATH] [DATA_PATH] [DEVICE_ID] DEVICE_ID is optional, it can be set by environment variable device_id, otherwise the value is zero" exit 1 fi @@ -59,7 +59,7 @@ function compile_app() if [ -f "Makefile" ]; then make clean fi - sh build.sh &> build.log + bash build.sh &> build.log } function infer() diff --git a/research/cv/single_path_nas/scripts/run_standalone_train.sh b/research/cv/single_path_nas/scripts/run_standalone_train.sh index af523897117b1bbd63cd9d7e5b25cff5e1a93f8d..884c99f8b3fa8d0f5d664725c78f9e7466e9d069 100644 --- a/research/cv/single_path_nas/scripts/run_standalone_train.sh +++ b/research/cv/single_path_nas/scripts/run_standalone_train.sh @@ -16,7 +16,7 @@ if [ $# != 0 ] then - echo "Usage: sh run_train.sh" + echo "Usage: bash run_train.sh" exit 1 fi diff --git a/research/cv/single_path_nas/scripts/run_standalone_train_gpu.sh b/research/cv/single_path_nas/scripts/run_standalone_train_gpu.sh new file mode 100644 index 0000000000000000000000000000000000000000..16b3e7fbd4c0d5cf3e7e42a649c9c0b0d9ac22d4 --- /dev/null +++ b/research/cv/single_path_nas/scripts/run_standalone_train_gpu.sh @@ -0,0 +1,46 @@ +#!/bin/bash +# Copyright 2021 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ============================================================================ + +if [ $# != 0 ] && [ $# != 1 ] +then + echo "Usage: bash run_standalone_train_gpu.sh [TRAIN_DATASET](optional)" + exit 1 +fi + +if [ $# == 1 ] && [ ! -d $1 ] +then + echo "error: TRAIN_DATASET=$1 is not a directory" + exit 1 +fi + +ulimit -u unlimited + +rm -rf ./train_standalone +mkdir ./train_standalone +cp ./*.py ./train_standalone +cp -r ./src ./train_standalone +cd ./train_standalone || exit +env > env.log + +if [ $# == 0 ] +then + python train.py --device_target='GPU' --lr_init=0.26 > log.txt 2>&1 & +fi + +if [ $# == 1 ] +then + python train.py --device_target='GPU' --data_path="$1" --lr_init=0.26 > log.txt 2>&1 & +fi diff --git a/research/cv/single_path_nas/src/config.py b/research/cv/single_path_nas/src/config.py index fba997b6f099de1d098f958e1bacf175f3413a8d..54e9594cfae81bcbae84b0519dc1fd0739ab2fe9 100644 --- a/research/cv/single_path_nas/src/config.py +++ b/research/cv/single_path_nas/src/config.py @@ -42,7 +42,7 @@ imagenet_cfg = edict({ 'lr_epochs': [30, 60, 90], 'lr_gamma': 0.3, 'eta_min': 0.0, - 'T_max': 150, + 'T_max': 180, 'warmup_epochs': 0, # loss related diff --git a/research/cv/single_path_nas/src/dataset.py b/research/cv/single_path_nas/src/dataset.py index 97b64529478e7e81af657edf1a6abcb64e370dd0..ac51ad69e82410668aef968c4396d816be056806 100644 --- a/research/cv/single_path_nas/src/dataset.py +++ b/research/cv/single_path_nas/src/dataset.py @@ -15,7 +15,6 @@ """ Data operations, will be used in train.py and eval.py """ -import os import mindspore.common.dtype as mstype import mindspore.dataset as ds @@ -26,28 +25,27 @@ from src.config import imagenet_cfg def create_dataset_imagenet(dataset_path, repeat_num=1, training=True, - num_parallel_workers=None, shuffle=True): + num_parallel_workers=None, shuffle=True, + device_num=None, rank_id=None, drop_reminder=False): """ create a train or eval imagenet2012 dataset for resnet50 Args: dataset_path(string): the path of dataset. - do_train(bool): whether dataset is used for train or eval. repeat_num(int): the repeat times of dataset. Default: 1 - batch_size(int): the batch size of dataset. Default: 32 - target(str): the device target. Default: Ascend - + training(bool): whether dataset is used for train or eval. Default: True. + num_parallel_workers(int): Number of parallel workers. Default: None. + shuffle(bool): whether dataset is used for train or eval. Default: True. + device_num(int): Number of devices for the distributed training. Default: None + rank_id(int): Rank of the process for the distributed training. Default: None + drop_reminder (bool): Drop reminder of the dataset, + if its size is less than the specified batch size. Default: False Returns: dataset """ - device_num, rank_id = _get_rank_info() - - if device_num == 1: - data_set = ds.ImageFolderDataset(dataset_path, num_parallel_workers=num_parallel_workers, shuffle=shuffle) - else: - data_set = ds.ImageFolderDataset(dataset_path, num_parallel_workers=num_parallel_workers, shuffle=shuffle, - num_shards=device_num, shard_id=rank_id) + data_set = ds.ImageFolderDataset(dataset_path, num_parallel_workers=num_parallel_workers, + shuffle=shuffle, num_shards=device_num, shard_id=rank_id) assert imagenet_cfg.image_height == imagenet_cfg.image_width, "image_height not equal image_width" image_size = imagenet_cfg.image_height @@ -73,32 +71,14 @@ def create_dataset_imagenet(dataset_path, repeat_num=1, training=True, ] transform_label = [C.TypeCast(mstype.int32)] - if training: - data_set = data_set.map(input_columns="image", num_parallel_workers=16, operations=transform_img) - data_set = data_set.map(input_columns="label", num_parallel_workers=4, operations=transform_label) - else: - data_set = data_set.map(input_columns="image", num_parallel_workers=16, operations=transform_img) - data_set = data_set.map(input_columns="label", num_parallel_workers=4, operations=transform_label) + data_set = data_set.map(input_columns="image", num_parallel_workers=16, + operations=transform_img, python_multiprocessing=True) + data_set = data_set.map(input_columns="label", num_parallel_workers=4, + operations=transform_label) # apply batch operations - data_set = data_set.batch(imagenet_cfg.batch_size, drop_remainder=False) + data_set = data_set.batch(imagenet_cfg.batch_size, drop_remainder=drop_reminder) # apply dataset repeat operation data_set = data_set.repeat(repeat_num) return data_set - - -def _get_rank_info(): - """ - get rank size and rank id - """ - rank_size = int(os.environ.get("RANK_SIZE", 1)) - - if rank_size > 1: - from mindspore.communication.management import get_rank, get_group_size - rank_size = get_group_size() - rank_id = get_rank() - else: - rank_size = rank_id = None - - return rank_size, rank_id diff --git a/research/cv/single_path_nas/train.py b/research/cv/single_path_nas/train.py index 1bc22b118e8b1f71ad1b604943a53c49d27eb789..3a54e28f3a794b31935b0e345e90aaff1207955f 100644 --- a/research/cv/single_path_nas/train.py +++ b/research/cv/single_path_nas/train.py @@ -22,11 +22,17 @@ import os from mindspore import Tensor from mindspore import context from mindspore.common import set_seed -from mindspore.communication.management import init, get_rank +from mindspore.communication.management import get_group_size +from mindspore.communication.management import get_rank +from mindspore.communication.management import init from mindspore.context import ParallelMode from mindspore.nn.optim.momentum import Momentum -from mindspore.train.callback import ModelCheckpoint, CheckpointConfig, LossMonitor, TimeMonitor -from mindspore.train.loss_scale_manager import DynamicLossScaleManager, FixedLossScaleManager +from mindspore.train.callback import CheckpointConfig +from mindspore.train.callback import LossMonitor +from mindspore.train.callback import ModelCheckpoint +from mindspore.train.callback import TimeMonitor +from mindspore.train.loss_scale_manager import DynamicLossScaleManager +from mindspore.train.loss_scale_manager import FixedLossScaleManager from mindspore.train.model import Model from src import spnasnet @@ -64,10 +70,21 @@ def lr_steps_imagenet(_cfg, steps_per_epoch): if __name__ == '__main__': parser = argparse.ArgumentParser(description='Single-Path-NAS Training') - parser.add_argument('--dataset_name', type=str, default='imagenet', choices=['imagenet',], + parser.add_argument('--dataset_name', type=str, default='imagenet', choices=['imagenet'], help='dataset name.') - parser.add_argument('--filter_prefix', type=str, default='huawei', help='filter_prefix name.') - parser.add_argument('--device_id', type=int, default=None, help='device id of Ascend. (Default: None)') + parser.add_argument('--filter_prefix', type=str, default='huawei', + help='filter_prefix name.') + parser.add_argument('--lr_init', type=float, default=None, + help='Override the learning rate value in the configuration file') + parser.add_argument('--device_id', type=int, default=None, + help='device id of Ascend. (Default: None)') + parser.add_argument('--device_target', type=str, choices=['Ascend', 'GPU'], + default=None, help='Target device: Ascend or GPU') + parser.add_argument('--use_gpu_distributed', type=int, default=0, + help='Enable distributed GPU training.') + parser.add_argument('--data_path', type=str, default=None, + help='Path to the training dataset (e.g. "/datasets/imagenet/train/")') + args_opt = parser.parse_args() if args_opt.dataset_name == "imagenet": @@ -75,9 +92,23 @@ if __name__ == '__main__': else: raise ValueError("Unsupported dataset.") + if args_opt.data_path is not None: + cfg.data_path = args_opt.data_path + # set context + if args_opt.device_target is not None: + cfg.device_target = args_opt.device_target + + if args_opt.lr_init is not None: + cfg.lr_init = args_opt.lr_init + device_target = cfg.device_target - context.set_context(mode=context.GRAPH_MODE, device_target=cfg.device_target, enable_graph_kernel=True) + + # We enabling the graph kernel only for the Ascend device. + enable_graph_kernel = (device_target == 'Ascend') + + context.set_context(mode=context.GRAPH_MODE, device_target=cfg.device_target, + enable_graph_kernel=enable_graph_kernel) device_num = int(os.environ.get("DEVICE_NUM", 1)) @@ -90,15 +121,37 @@ if __name__ == '__main__': if device_num > 1: context.reset_auto_parallel_context() - context.set_auto_parallel_context(device_num=device_num, parallel_mode=ParallelMode.DATA_PARALLEL, + context.set_auto_parallel_context(device_num=device_num, + parallel_mode=ParallelMode.DATA_PARALLEL, gradients_mean=True) init() rank = get_rank() + elif device_target == "GPU": + # Using the rank and devices number determined by the communication module. + if args_opt.use_gpu_distributed == 1: + init('nccl') + device_num = get_group_size() + rank = get_rank() + context.reset_auto_parallel_context() + context.set_auto_parallel_context(device_num=device_num, + parallel_mode=ParallelMode.DATA_PARALLEL, + gradients_mean=True) + else: + device_num = 1 + else: raise ValueError("Unsupported platform.") + dataset_drop_reminder = (device_target == 'GPU') + if args_opt.dataset_name == "imagenet": - dataset = create_dataset_imagenet(cfg.data_path, 1) + if device_num > 1: + dataset = create_dataset_imagenet(cfg.data_path, 1, num_parallel_workers=8, + device_num=device_num, rank_id=rank, + drop_reminder=dataset_drop_reminder) + else: + dataset = create_dataset_imagenet(cfg.data_path, 1, num_parallel_workers=8, + drop_reminder=dataset_drop_reminder) else: raise ValueError("Unsupported dataset.") @@ -111,7 +164,6 @@ if __name__ == '__main__': if args_opt.dataset_name == 'imagenet': lr = lr_steps_imagenet(cfg, batch_num) - def get_param_groups(network): """ get param groups """ decay_params = [] @@ -152,6 +204,9 @@ if __name__ == '__main__': else: loss_scale_manager = FixedLossScaleManager(cfg.loss_scale, drop_overflow_update=False) + else: + raise ValueError("Unsupported dataset.") + model = Model(net, loss_fn=loss, optimizer=opt, metrics={'top_1_accuracy', 'top_5_accuracy', 'loss'}, amp_level="O3", keep_batchnorm_fp32=True, loss_scale_manager=loss_scale_manager)