diff --git a/research/cv/osnet/README.md b/research/cv/osnet/README.md index f449b8b59d3e2272bacffa36a5fdf093a9742ba1..4cafb342617c77b3aee80705f88497c513a1de50 100644 --- a/research/cv/osnet/README.md +++ b/research/cv/osnet/README.md @@ -80,8 +80,8 @@ In this project, the file organization is recommended as below: 鈹溾攢鈹€cuhk03 鈹溾攢鈹€cuhk03_release 鈹溾攢鈹€cuhk-03.mat - 鈹溾攢鈹€cuhk03_new_protocol_config_labeled.mat - 鈹溾攢鈹€cuhk03_new_protocol_config_detected.mat + 鈹溾攢鈹€cuhk03_new_protocol_config_labeled.mat + 鈹溾攢鈹€cuhk03_new_protocol_config_detected.mat 鈹溾攢鈹€msmt17 鈹溾攢鈹€MSMT17_V1 鈹溾攢鈹€train @@ -92,11 +92,13 @@ In this project, the file organization is recommended as below: 鈹溾攢鈹€list_query.txt ``` +While first using the cuhk03 dataset, Please delete the splits file generated by the default match. + # [Features](#contents) # [Environment Requirements](#contents) -- Hardware锛圓scend锛� +- Hardware锛圓scend/GPU锛� - Prepare hardware environment with Ascend processor. - Framework - [MindSpore](https://www.mindspore.cn/install/en) @@ -115,7 +117,10 @@ In this project, the file organization is recommended as below: 鈹溾攢scripts 鈹溾攢run_train_standalone_ascend.sh # launch standalone training with ascend platform(1p) 鈹溾攢run_train_distribute_ascend.sh # launch distributed training with ascend platform(8p) - 鈹斺攢run_eval_ascend.sh # launch evaluating with ascend platform + 鈹溾攢run_eval_ascend.sh # launch evaluating with ascend platform + 鈹溾攢run_train_standalone_gpu.sh # launch standalone training with gpu platform(1p) + 鈹溾攢run_train_distribute_gpu.sh # launch distributed training with gpu platform(8p) + 鈹斺攢run_eval_gpu.sh # launch evaluating with gpu platform 鈹溾攢src 鈹溾攢cross_entropy_loss.py # cross entropy loss 鈹溾攢dataset.py # data preprocessing @@ -143,7 +148,7 @@ data_path:/home/osnet/datasets ```bash # distribute training example(8p) bash run_train_distribute_ascend.sh [RANK_TABLE_FILE] [DATASET] [PRETRAINED_CKPT_PATH](optional) -# example: bash run_distribute_train_ascend.sh ./hccl_8p.json market1501 /home/osnet/checkpoint/market1501/osnet-240_101.ckpt +# example: bash run_train_distribute_ascend.sh ./hccl_8p.json market1501 /home/osnet/checkpoint/market1501/osnet-240_101.ckpt # standalone training bash run_train_standalone_ascend.sh [DATASET] [DEVICE_ID] [PRETRAINED_CKPT_PATH](optional) @@ -162,7 +167,7 @@ bash run_eval_ascend.sh [DATASET] [CHECKPOINT_PATH] [DEVICE_ID] > The `PRETRAINED_CKPT_PATH` should be a checkpoint saved in the training process on ascend, it will resume from the checkpoint and continue to train. ### Launch -- Training needs to load the parameters pre-trained on ImageNet. You can download the checkpoint file on [link](https://pan.baidu.com/s/1BLf2BwFYRXwgD44zkzExRQ), the extraction code is `1961`. After downloading, put it in the `./model_utils` folder. You can also download the `.pth` file pre-trained under Pytorch here, and convert it to `.ckpt` file through `./model_utils/pth_to_ckpt.py`. +- Training needs to load the parameters pre-trained on ImageNet. You can download the checkpoint file on [init_osnet.ckpt](https://pan.baidu.com/s/1BLf2BwFYRXwgD44zkzExRQ), the extraction code is `1961`. After downloading, put it in the `./model_utils` folder. You can also download the [osnet_x1_0_imagenet.pth](https://drive.google.com/file/d/1LaG1EJpHrxdAxKnSCJ_i0u-nbxSAeiFY/view) file pre-trained under ImageNet via [osnet (pytorch)](https://github.com/KaiyangZhou/deep-person-reid) here, and convert it to `.ckpt` file through `./model_utils/pth_to_ckpt.py`. - Running on local server @@ -174,13 +179,26 @@ bash run_eval_ascend.sh [DATASET] [CHECKPOINT_PATH] [DEVICE_ID] Ascend: # distribute training example(8p) bash run_train_distribute_ascend.sh [RANK_TABLE_FILE] [DATASET] [PRETRAINED_CKPT_PATH](optional) - # example: bash run_distribute_train_ascend.sh ./hccl_8p.json market1501 /home/osnet/checkpoint/market1501/osnet-240_101.ckpt + # example: bash run_train_distribute_ascend.sh ./hccl_8p.json market1501 /home/osnet/checkpoint/market1501/osnet-240_101.ckpt # standalone training bash run_train_standalone_ascend.sh [DATASET] [DEVICE_ID] [PRETRAINED_CKPT_PATH](optional) # example: bash run_train_standalone_ascend.sh market1501 0 /home/osnet/checkpoint/market1501/osnet-240_101.ckpt ``` + ```bash + # training example + shell: + GPU: + # distribute training example(8p) + bash run_train_distribute_gpu.sh [DEVICE_NUM] [DATASET] [PRETRAINED_CKPT_PATH](optional) + # example: bash run_train_distribute_gpu.sh 8 market1501 /home/osnet/checkpoint/market1501/osnet-240_101.ckpt + + # standalone training + bash run_train_standalone_gpu.sh [DATASET] [DEVICE_ID] [PRETRAINED_CKPT_PATH](optional) + # example: bash run_train_standalone_gpu.sh market1501 0 /home/osnet/checkpoint/market1501/osnet-240_101.ckpt + ``` + - Running on ModelArts - ModelArts (If you want to run in modelarts, please check the official documentation of [modelarts](https://support.huaweicloud.com/modelarts/), and you can start training as follows) @@ -295,6 +313,14 @@ You can start evaluating using python or shell scripts. The usage of shell scrip # example: bash run_eval_ascend.sh market1501 /home/osnet/scripts/output/checkpoint/market1501/osnet-240_101.ckpt 0 ``` + ```bash + # eval example + shell: + GPU: + bash run_eval_gpu.sh [DATASET] [CHECKPOINT_PATH] [DEVICE_ID] + # example: bash run_eval_gpu.sh market1501 /home/osnet/scripts/output/checkpoint/market1501/osnet-240_101.ckpt 0 + ``` + - Running on ModelArts. ```bash @@ -381,89 +407,117 @@ Rank-20 : 97.4% #### OSNet train on Market1501 -| Parameters | OSNet | -| -------------------------- | ----------------------------------------------------------- | -| Resource | Ascend 910; CPU 2.60GHz, 192cores; Memory 755G; OS Euler2.8 | -| uploaded Date | 18/12/2021 (month/day/year) | -| MindSpore Version | 1.5.0 | -| Dataset | Market1501 | -| Training Parameters | epoch=250, batch_size = 128, lr=0.001 | -| Optimizer | Adam | -| Loss Function | Label Smoothing Cross Entropy Loss | -| outputs | probability | -| Speed | 1pc: 175.741 ms/step; 8pcs: 181.027 ms/step | | | -| Checkpoint for Fine tuning | 29.4M (.ckpt file) | +| Parameters | Ascend | GPU | +| -------------------------- | ----------------------------------------------------------- | ----------------------------------------------------------- | +| Resource | Ascend 910; CPU 2.60GHz, 192cores; Memory 755G; OS Euler2.8 | RTX 3090; CPU 2.90GHz, 16cores; Memory 24G | +| uploaded Date | 18/12/2021 (month/day/year) | 27/1/2022 (month/day/year) | +| MindSpore Version | 1.5.0 | 1.5.0 | +| Dataset | Market1501 | Market1501 | +| Training Parameters | epoch=250, batch_size = 128, lr=0.001 | epoch=250, batch_size = 128, lr=0.001 | +| Optimizer | Adam | Adam | +| Loss Function | Label Smoothing Cross Entropy Loss | Label Smoothing Cross Entropy Loss | +| outputs | probability | probability | +| Speed | 1pc: 175.741 ms/step; 8pcs: 181.027 ms/step | 1pc: 198.320 ms/step; 8pcs: 185.140 ms/step | +| Checkpoint for Fine tuning | 29.4M (.ckpt file) | 30.91M (.ckpt file) | #### OSNet train on DukeMTMC-reID -| Parameters | OSNet | -| -------------------------- | ----------------------------------------------------------- | -| Resource | Ascend 910; CPU 2.60GHz, 192cores; Memory 755G; OS Euler2.8 | -| uploaded Date | 18/12/2021 (month/day/year) | -| MindSpore Version | 1.5.0 | -| Dataset | DukeMTMC-reID | -| Training Parameters | epoch=250, batch_size = 128, lr=0.001 | -| Optimizer | Adam | -| Loss Function | Label Smoothing Cross Entropy Loss | -| outputs | probability | -| Speed | 1pc: 175.904 ms/step; 8pcs: 180.340 ms/step | | | -| Checkpoint for Fine tuning | 29.11M (.ckpt file) | +| Parameters | Ascend | GPU | +| -------------------------- | ----------------------------------------------------------- | ----------------------------------------------------------- | +| Resource | Ascend 910; CPU 2.60GHz, 192cores; Memory 755G; OS Euler2.8 | RTX 3090; CPU 2.90GHz, 16cores; Memory 24G | +| uploaded Date | 18/12/2021 (month/day/year) | 27/1/2022 (month/day/year) | +| MindSpore Version | 1.5.0 | 1.5.0 | +| Dataset | DukeMTMC-reID | DukeMTMC-reID | +| Training Parameters | epoch=250, batch_size = 128, lr=0.001 | epoch=250, batch_size = 128, lr=0.001 | +| Optimizer | Adam | Adam | +| Loss Function | Label Smoothing Cross Entropy Loss | Label Smoothing Cross Entropy Loss +| outputs | probability | probability | +| Speed | 1pc: 175.904 ms/step; 8pcs: 180.340 ms/step | 1pc: 228.166 ms/step; 8pcs: 178.790 ms/step | +| Checkpoint for Fine tuning | 29.11M (.ckpt file) | 30.63M (.ckpt file) | #### OSNet train on MSMT17 -| Parameters | OSNet | +| Parameters | Ascend | GPU | +| -------------------------- | ----------------------------------------------------------- | ----------------------------------------------------------- | +| Resource | Ascend 910; CPU 2.60GHz, 192cores; Memory 755G; OS Euler2.8 | RTX 3090; CPU 2.90GHz, 16cores; Memory 24G | +| uploaded Date | 18/12/2021 (month/day/year) | 27/1/2022 (month/day/year) | +| MindSpore Version | 1.5.0 | 1.5.0 | +| Dataset | MSMT17 | MSMT17 | +| Training Parameters | epoch=250, batch_size = 128, lr=0.001 | epoch=250, batch_size = 128, lr=0.001 | +| Optimizer | Adam | Adam | +| Loss Function | Label Smoothing Cross Entropy Loss | Label Smoothing Cross Entropy Loss | +| outputs | probability | probability | +| Speed | 1pc: 183.783 ms/step; 8pcs: 180.458 ms/step | 1pc: 309.691 ms/step; 8pcs: 218.579 ms/step| +| Checkpoint for Fine tuning | 31.12M (.ckpt file) | 32.82M (.ckpt file) | + +#### OSNet train on CUHK03 + +| Parameters | GPU | | -------------------------- | ----------------------------------------------------------- | -| Resource | Ascend 910; CPU 2.60GHz, 192cores; Memory 755G; OS Euler2.8 | -| uploaded Date | 18/12/2021 (month/day/year) | +| Resource | RTX 3090; CPU 2.90GHz, 16cores; Memory 24G | +| uploaded Date | 27/1/2022 (month/day/year) | | MindSpore Version | 1.5.0 | -| Dataset | MSMT17 | +| Dataset | CUHK03 | | Training Parameters | epoch=250, batch_size = 128, lr=0.001 | | Optimizer | Adam | | Loss Function | Label Smoothing Cross Entropy Loss | | outputs | probability | -| Speed | 1pc: 183.783 ms/step; 8pcs: 180.458 ms/step | | | -| Checkpoint for Fine tuning | 31.12M (.ckpt file) | +| Speed | 1pc: 144.631 ms/step; 8pcs: 237.241 ms/step | +| Checkpoint for Fine tuning | 30.96M (.ckpt file) ### Inference Performance #### OSNet on Market1501 -| Parameters | Ascend | -| ------------------- | --------------------------- | -| Resource | Ascend 910; OS Euler2.8 | -| Uploaded Date | 18/12/2021 (month/day/year) | -| MindSpore Version | 1.5.0 | -| Dataset | Market1501 | -| batch_size | 300 | -| outputs | probability | -| mAP | 1pc: 82.4%; 8pcs:83.7% | -| Rank-1 | 1pc: 93.3%; 8pcs:93.9% | +| Parameters | Ascend |GPU | +| ------------------- | --------------------------- |--------------------------- | +| Resource | Ascend 910; OS Euler2.8 | RTX 3090 | +| Uploaded Date | 18/12/2021 (month/day/year) | 27/1/2022 (month/day/year) | +| MindSpore Version | 1.5.0 | 1.5.0 | +| Dataset | Market1501 | Market1501 | +| batch_size | 300 | 300 | +| outputs | probability | probability | +| mAP | 1pc: 82.4%; 8pcs:83.7% | 1pc: 81.9%; 8pcs:83.4% | +| Rank-1 | 1pc: 93.3%; 8pcs:93.9% | 1pc: 93.3%; 8pcs:93.8% | #### OSNet on DukeMTMC-reID -| Parameters | Ascend | -| ------------------- | --------------------------- | -| Resource | Ascend 910; OS Euler2.8 | -| Uploaded Date | 18/12/2021 (month/day/year) | -| MindSpore Version | 1.5.0 | -| Dataset | DukeMTMC-reID | -| batch_size | 300 | -| outputs | probability | -| mAP | 1pc: 69.8%; 8pcs:74.6% | -| Rank-1 | 1pc: 86.2%; 8pcs:89.2% | +| Parameters | Ascend | GPU | +| ------------------- | --------------------------- |--------------------------- | +| Resource | Ascend 910; OS Euler2.8 | RTX 3090 | +| Uploaded Date | 18/12/2021 (month/day/year) | 27/1/2022 (month/day/year) | +| MindSpore Version | 1.5.0 | 1.5.0 | +| Dataset | DukeMTMC-reID | DukeMTMC-reID | +| batch_size | 300 | 300 | +| outputs | probability | probability | +| mAP | 1pc: 69.8%; 8pcs:74.6% | 1pc: 69.8%; 8pcs:74.5% | +| Rank-1 | 1pc: 86.2%; 8pcs:89.2% | 1pc: 86.9%; 8pcs:88.5% | #### OSNet on MSMT17 -| Parameters | Ascend | +| Parameters | Ascend | GPU | +| ------------------- | --------------------------- | --------------------------- | +| Resource | Ascend 910; OS Euler2.8 | RTX 3090 | +| Uploaded Date | 18/12/2021 (month/day/year) | 27/1/2022 (month/day/year) | +| MindSpore Version | 1.5.0 | 1.5.0 | +| Dataset | MSMT17 | MSMT17 | +| batch_size | 300 | 300 | +| outputs | probability | probability | +| mAP | 1pc: 43.1%; 8pcs:50.0% | 1pc: 46.5%; 8pcs:53.5% | +| Rank-1 | 1pc: 71.5%; 8pcs:77.7% | 1pc: 71.7%; 8pcs:77.5% | + +#### OSNet on CUHK03 + +| Parameters | GPU | | ------------------- | --------------------------- | -| Resource | Ascend 910; OS Euler2.8 | -| Uploaded Date | 18/12/2021 (month/day/year) | +| Resource | RTX 3090 | +| Uploaded Date | 27/1/2022 (month/day/year) | | MindSpore Version | 1.5.0 | -| Dataset | MSMT17 | +| Dataset | CUHK03 | | batch_size | 300 | | outputs | probability | -| mAP | 1pc: 43.1%; 8pcs:50.0% | -| Rank-1 | 1pc: 71.5%; 8pcs:77.7% | +| mAP | 1pc: 60.8%; 8pcs:54.9% | +| Rank-1 | 1pc: 65.4%; 8pcs:60.1% | # [Description of Random Situation](#contents) diff --git a/research/cv/osnet/eval.py b/research/cv/osnet/eval.py index 7336971b797327278f53105c2e66ef60bf255b9c..69c0bd9e9ef3e3c2f842cf72f2be7bb4c4737ace 100644 --- a/research/cv/osnet/eval.py +++ b/research/cv/osnet/eval.py @@ -179,4 +179,9 @@ def eval_net(net=None): i += 1 if __name__ == '__main__': + if config.target == 'msmt17' or config.target == 'cuhk03 ': + config.dist_metric = 'cosine' + else: + config.dist_metric = 'euclidean' + print(config.dist_metric) eval_net() diff --git a/research/cv/osnet/export.py b/research/cv/osnet/export.py index a69caf0ec71ab9a4e64eafda4daaa04187d0140c..49b46560c5f1a34add718e90b6d3ef547261de10 100644 --- a/research/cv/osnet/export.py +++ b/research/cv/osnet/export.py @@ -24,7 +24,7 @@ from mindspore import Tensor, load_checkpoint, load_param_into_net, export, cont from src.osnet import create_osnet -from src.datasets_define import (Market1501, DukeMTMCreID, MSMT17, CUHK03) +from src.datasets_define import Market1501, DukeMTMCreID, MSMT17, CUHK03 from model_utils.config import config from model_utils.moxing_adapter import moxing_wrapper from model_utils.device_adapter import get_device_id diff --git a/research/cv/osnet/model_utils/transforms.py b/research/cv/osnet/model_utils/transforms.py index 2f6a849bec59a5f612060c936028f6c5c275eaaf..779d5e89f21ff81f237d3a2a9f9c6d3c872a5271 100644 --- a/research/cv/osnet/model_utils/transforms.py +++ b/research/cv/osnet/model_utils/transforms.py @@ -17,7 +17,7 @@ import math import random -from mindspore.dataset.vision.c_transforms import(Resize, Rescale, Normalize, HWC2CHW, RandomHorizontalFlip) +from mindspore.dataset.vision.c_transforms import Resize, Rescale, Normalize, HWC2CHW, RandomHorizontalFlip from mindspore.dataset.transforms.c_transforms import Compose diff --git a/research/cv/osnet/osnet_config.yaml b/research/cv/osnet/osnet_config.yaml index 5fca76f9d0d45c8e484f1cc450e683191d352564..5479085e88a42750b5692d01bffad687eecd695e 100644 --- a/research/cv/osnet/osnet_config.yaml +++ b/research/cv/osnet/osnet_config.yaml @@ -35,6 +35,7 @@ save_checkpoint: True save_checkpoint_epochs: 10 pretrained_dir: 'model_utils' run_distribute: False +LogSummary: False # optimizer adam_beta1: 0.9 diff --git a/research/cv/osnet/scripts/run_eval_gpu.sh b/research/cv/osnet/scripts/run_eval_gpu.sh new file mode 100644 index 0000000000000000000000000000000000000000..03fd17828001f1575de72dfe0e8cd9701609afb1 --- /dev/null +++ b/research/cv/osnet/scripts/run_eval_gpu.sh @@ -0,0 +1,58 @@ +#!/bin/bash +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License Version 2.0(the "License"); +# you may not use this file except in compliance with the License. +# you may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0# +# +# Unless required by applicable law or agreed to in writing software +# distributed under the License is distributed on an "AS IS" BASIS +# WITHOUT WARRANT IES OR CONITTONS OF ANY KIND锛� either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ==================================================================================== + +if [ $# != 3 ] +then + echo "Usage: bash run_eval_ascend.sh [market1501|dukemtmcreid|cuhk03|msmt17] [CHECKPOINT_PATH] [DEVICE_ID]" +exit 1 +fi + +if [ $1 != "market1501" ] && [ $1 != "dukemtmcreid" ] && [ $1 != "cuhk03" ] && [ $1 != "msmt17" ] +then + echo "error: the selected dataset is not market1501, dukemtmcreid, cuhk03 or msmt17" +exit 1 +fi +dataset_name=$1 + + +if [ ! -f $2 ] +then + echo "error: CHECKPOINT_PATH=$2 is not a file" +exit 1 +fi +PATH1=$(realpath $2) + +BASEPATH=$(cd "`dirname $0`" || exit; pwd) +export PYTHONPATH=${BASEPATH}:$PYTHONPATH + +ulimit -u unlimited +export DEVICE_NUM=1 +export DEVICE_ID=$3 + +export RANK_SIZE=1 +export RANK_ID=$3 + +config_path="${BASEPATH}/../osnet_config.yaml" +echo "config path is : ${config_path}" + +if [ -d "./eval" ]; +then + rm -rf ./eval +fi +mkdir ./eval +cd ./eval || exit +echo "start evaluating for device $DEVICE_ID" +python ${BASEPATH}/../eval.py --config_path=$config_path --checkpoint_file_path=$PATH1 --device_target="GPU" --target=$dataset_name> ./eval.log 2>&1 & diff --git a/research/cv/osnet/scripts/run_train_distribute_gpu.sh b/research/cv/osnet/scripts/run_train_distribute_gpu.sh new file mode 100644 index 0000000000000000000000000000000000000000..7bc0d56fa92438515897b57831ae6b341feeec36 --- /dev/null +++ b/research/cv/osnet/scripts/run_train_distribute_gpu.sh @@ -0,0 +1,74 @@ +#!/bin/bash +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License Version 2.0(the "License"); +# you may not use this file except in compliance with the License. +# you may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0# +# +# Unless required by applicable law or agreed to in writing software +# distributed under the License is distributed on an "AS IS" BASIS +# WITHOUT WARRANT IES OR CONITTONS OF ANY KIND锛� either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ==================================================================================== + +if [ $# != 2 ] && [ $# != 3 ] +then + echo "Usage: sh run_train_distribute_ascend.sh [DEVICE_NUM] [market1501|dukemtmcreid|cuhk03|msmt17] [PRETRAINED_CKPT_PATH](optional)" +exit 1 +fi + +if [ $2 != "market1501" ] && [ $2 != "dukemtmcreid" ] && [ $2 != "cuhk03" ] && [ $2 != "msmt17" ] +then + echo "error: the selected dataset is not market1501, dukemtmcreid, cuhk03 or msmt17" +exit 1 +fi +dataset_name=$2 + +if [ $# == 3 ] +then + if [ ! -f $3 ] + then + echo "error: PRETRAINED_CKPT_PATH=$3 is not a file" + exit 1 + fi + PATH1=$(realpath $3) +fi + +ulimit -u unlimited +export RANK_SIZE=$1 + + +BASEPATH=$(cd "`dirname $0`" || exit; pwd) +config_path="${BASEPATH}/../osnet_config.yaml" +echo "config path is : ${config_path}" +if [ -d "./train" ]; +then + rm -rf ./train +fi +mkdir ./train +cd ./train || exit +echo "start training on $RANK_SIZE device" + +if [ $# == 2 ]; +then + mpirun -n $RANK_SIZE --output-filename log_output --merge-stderr-to-stdout --allow-run-as-root\ + python ${BASEPATH}/../train.py \ + --config_path=$config_path \ + --device_target="GPU" \ + --run_distribute=True \ + --source=$dataset_name \ + --output_path='./osnet_ckpt_output' > train.log 2>&1 & +else + mpirun -n $RANK_SIZE --output-filename log_output --merge-stderr-to-stdout --allow-run-as-root\ + python ${BASEPATH}/../train.py \ + --config_path=$config_path \ + --device_target="GPU" \ + --run_distribute=True \ + --source=$dataset_name \ + --output_path='./osnet_ckpt_output' \ + --checkpoint_file_path=$PATH1> train.log 2>&1 & +fi + diff --git a/research/cv/osnet/scripts/run_train_standalone_gpu.sh b/research/cv/osnet/scripts/run_train_standalone_gpu.sh new file mode 100644 index 0000000000000000000000000000000000000000..5838576f58003b6dfc281c07469fe9ba69f88a76 --- /dev/null +++ b/research/cv/osnet/scripts/run_train_standalone_gpu.sh @@ -0,0 +1,65 @@ +#!/bin/bash +# Copyright 2022 Huawei Technologies Co., Ltd +# +# Licensed under the Apache License Version 2.0(the "License"); +# you may not use this file except in compliance with the License. +# you may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0# +# +# Unless required by applicable law or agreed to in writing software +# distributed under the License is distributed on an "AS IS" BASIS +# WITHOUT WARRANT IES OR CONITTONS OF ANY KIND锛� either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# ==================================================================================== + +if [ $# != 2 ] && [ $# != 3 ] +then + echo "Usage: sh run_train_standalone_ascend.sh [market1501|dukemtmcreid|cuhk03|msmt17] [DEVICE_ID] [PRETRAINED_CKPT_PATH](optional)" +exit 1 +fi + +if [ $1 != "market1501" ] && [ $1 != "dukemtmcreid" ] && [ $1 != "cuhk03" ] && [ $1 != "msmt17" ] +then + echo "error: the selected dataset is not market1501, dukemtmcreid, cuhk03 or msmt17" +exit 1 +fi +dataset_name=$1 + +if [ $# == 3 ] +then + if [ ! -f $3 ] + then + echo "error: PRETRAINED_CKPT_PATH=$3 is not a file" + exit 1 + fi + PATH1=$(get_real_path $3) +fi + +ulimit -u unlimited +export DEVICE_NUM=1 +export DEVICE_ID=$2 + +export RANK_SIZE=1 +export RANK_ID=$2 + +BASEPATH=$(cd "`dirname $0`" || exit; pwd) +config_path="${BASEPATH}/../osnet_config.yaml" +echo "config path is : ${config_path}" + +if [ -d "./train" ]; +then + rm -rf ./train +fi +mkdir ./train +cd ./train || exit +echo "start training for device $DEVICE_ID" +if [ $# == 2 ]; +then + python ${BASEPATH}/../train.py --config_path=$config_path --source=$dataset_name \ + --device_target="GPU" --output_path='./output'> train.log 2>&1 & +else + python ${BASEPATH}/../train.py --config_path=$config_path --source=$dataset_name \ + --device_target="GPU" --output_path='./output' --checkpoint_file_path=$PATH1> train.log 2>&1 & +fi diff --git a/research/cv/osnet/src/dataset.py b/research/cv/osnet/src/dataset.py index 3dabf1eaf5bb5a64f20ada7cc878a8062830341a..4b121487421b6efbcd743aff4d6de5ab9bb71297 100644 --- a/research/cv/osnet/src/dataset.py +++ b/research/cv/osnet/src/dataset.py @@ -18,7 +18,7 @@ import os import mindspore.dataset as ds from model_utils.transforms import build_train_transforms, build_test_transforms -from .datasets_define import (Market1501, DukeMTMCreID, MSMT17, CUHK03) +from .datasets_define import Market1501, DukeMTMCreID, MSMT17, CUHK03 def init_dataset(name, **kwargs): diff --git a/research/cv/osnet/train.py b/research/cv/osnet/train.py index 3312aed65e2af36a420be73cb7d8fdcd0bb959f6..8980226358f6a2b67b8c4704babdfb9932d8c4d0 100644 --- a/research/cv/osnet/train.py +++ b/research/cv/osnet/train.py @@ -1,4 +1,4 @@ -# Copyright 2021 Huawei Technologies Co., Ltd +# Copyright 2022 Huawei Technologies Co., Ltd # # Licensed under the Apache License Version 2.0(the "License"); # you may not use this file except in compliance with the License. @@ -27,8 +27,8 @@ import mindspore.ops as ops from mindspore.common import set_seed from mindspore import Tensor, Model, context from mindspore import load_checkpoint, load_param_into_net -from mindspore.train.callback import (ModelCheckpoint, CheckpointConfig, LossMonitor, - TimeMonitor, SummaryCollector) +from mindspore.train.callback import ModelCheckpoint, CheckpointConfig, LossMonitor,\ + TimeMonitor, SummaryCollector from mindspore.train.loss_scale_manager import FixedLossScaleManager from mindspore.communication.management import init, get_rank @@ -39,7 +39,7 @@ from src.lr_generator import step_lr, multi_step_lr from src.cross_entropy_loss import CrossEntropyLoss from model_utils.config import config from model_utils.moxing_adapter import moxing_wrapper -from model_utils.device_adapter import get_device_id, get_device_num, get_rank_id +from model_utils.device_adapter import get_device_id, get_device_num set_seed(1) @@ -53,6 +53,10 @@ class LossCallBack(LossMonitor): super(LossCallBack, self).__init__() self.has_trained_epoch = has_trained_epoch + def begin(self, run_context): + cb_params = run_context.original_args() + cb_params.init_time = time.time() + def step_end(self, run_context): '''check loss at the end of each step.''' cb_params = run_context.original_args() @@ -74,6 +78,11 @@ class LossCallBack(LossMonitor): print("epoch: %s step: %s, loss is %s" % (cb_params.cur_epoch_num + int(self.has_trained_epoch), cur_step_in_epoch, loss), flush=True) + def end(self, run_context): + cb_params = run_context.original_args() + end_time = time.time() + print("total_time:", (end_time-cb_params.init_time)*1000, "ms") + def init_lr(num_batches): '''initialize learning rate.''' @@ -96,7 +105,7 @@ def check_isfile(fpath): def load_from_checkpoint(net): - '''load parameters when resuming from a checkpoint for training.''' + '''load parameters when resuming from a checkpoints for training.''' param_dict = load_checkpoint(config.checkpoint_file_path) if param_dict: if param_dict.get("epoch_num") and param_dict.get("step_num"): @@ -114,7 +123,7 @@ def set_save_ckpt_dir(): """set save ckpt dir""" ckpt_save_dir = os.path.join(config.output_path, config.checkpoint_path, config.source) if config.enable_modelarts and config.run_distribute: - ckpt_save_dir = ckpt_save_dir + "ckpt_" + str(get_rank_id()) + "/" + ckpt_save_dir = ckpt_save_dir + "ckpt_" + str(get_rank()) + "/" else: if config.run_distribute: ckpt_save_dir = ckpt_save_dir + "ckpt_" + str(get_rank()) + "/" @@ -155,7 +164,7 @@ def modelarts_pre_process(): sync_lock = "/tmp/unzip_sync.lock" # Each server contains 8 devices as most. - if get_device_id() % min(get_device_num(), 8) == 0 and not os.path.exists(sync_lock): + if get_rank() % min(get_device_num(), 8) == 0 and not os.path.exists(sync_lock): print("Zip file path: ", zip_file_1) print("Unzip file save dir: ", save_dir_1) unzip(zip_file_1, save_dir_1) @@ -170,7 +179,7 @@ def modelarts_pre_process(): break time.sleep(1) - print("Device: {}, Finish sync unzip data from {} to {}.".format(get_device_id(), zip_file_1, save_dir_1)) + print("Device: {}, Finish sync unzip data from {} to {}.".format(get_rank(), zip_file_1, save_dir_1)) @moxing_wrapper(pre_process=modelarts_pre_process) @@ -178,6 +187,16 @@ def train_net(): """train net""" context.set_context(mode=context.GRAPH_MODE, device_target=config.device_target) device_num = get_device_num() + if config.device_target == "GPU": + context.set_context(enable_graph_kernel=True) + if device_num > 1: + context.reset_auto_parallel_context() + context.set_auto_parallel_context(device_num=device_num, parallel_mode='data_parallel', + gradients_mean=True) + init() + rank_id = get_rank() + else: + rank_id = 0 if config.device_target == "Ascend": device_id = get_device_id() context.set_context(device_id=device_id) @@ -186,6 +205,9 @@ def train_net(): context.set_auto_parallel_context(device_num=device_num, parallel_mode='data_parallel', gradients_mean=True) init() + rank_id = get_rank() + else: + rank_id = 0 num_classes, dataset1 = dataset_creator(root=config.data_path, height=config.height, width=config.width, @@ -202,6 +224,7 @@ def train_net(): cuhk03_classic_split=config.cuhk03_classic_split, mode='train') num_batches = dataset1.get_dataset_size() + if config.checkpoint_file_path and check_isfile(config.checkpoint_file_path): fpath = osp.abspath(osp.expanduser(config.checkpoint_file_path)) if not osp.exists(fpath): @@ -216,7 +239,8 @@ def train_net(): crit = CrossEntropyLoss(num_classes=num_classes, label_smooth=config.label_smooth) lr = init_lr(num_batches=num_batches) loss_scale = FixedLossScaleManager(config.loss_scale, drop_overflow_update=False) - summary_collector = SummaryCollector(summary_dir='./summary_dir', collect_freq=1) + if config.LogSummary: + summary_collector = SummaryCollector(summary_dir='./summary_dir', collect_freq=1) time_cb = TimeMonitor(data_size=num_batches) @@ -228,7 +252,10 @@ def train_net(): model1 = Model(network=net, optimizer=opt1, loss_fn=crit, amp_level="O3", loss_scale_manager=loss_scale) loss_cb1 = LossCallBack(has_trained_epoch=0) - cb1 = [time_cb, loss_cb1, summary_collector] + if config.LogSummary: + cb1 = [time_cb, loss_cb1, summary_collector] + else: + cb1 = [time_cb, loss_cb1] model1.train(config.fixbase_epoch, dataset1, cb1, dataset_sink_mode=True) loss_cb2 = LossCallBack(config.start_epoch) @@ -238,9 +265,13 @@ def train_net(): weight_decay=config.weight_decay, loss_scale=config.loss_scale) model2 = Model(network=net, optimizer=opt2, loss_fn=crit, amp_level="O3", loss_scale_manager=loss_scale) - cb2 = [time_cb, loss_cb2, summary_collector] + if config.LogSummary: + cb2 = [time_cb, loss_cb2, summary_collector] + else: + cb2 = [time_cb, loss_cb2] + if config.save_checkpoint: - if not config.run_distribute or (config.run_distribute and get_device_id() % 8 == 0): + if not config.run_distribute or (config.run_distribute and rank_id % 8 == 0): ckpt_append_info = [{"epoch_num": config.start_epoch, "step_num": config.start_epoch}] config_ck = CheckpointConfig(save_checkpoint_steps=config.save_checkpoint_epochs * num_batches, keep_checkpoint_max=10, append_info=ckpt_append_info)