Skip to content
Snippets Groups Projects
Unverified Commit 0eeb2473 authored by i-robot's avatar i-robot Committed by Gitee
Browse files

!2587 Add faq about multi-servers

Merge pull request !2587 from chenhaozhe/add-multi-server-faq
parents 9175b761 92270d10
No related branches found
No related tags found
No related merge requests found
......@@ -383,3 +383,7 @@ For more information about `MindSpore` framework, please refer to [FAQ](https://
- **Q: How to solve the error when loading dataset in mindrecord format on Mac system, such as *Invalid file, failed to open files for reading mindrecord files.*?**
**A**: Please check the system limit with *ulimit -a*, if the number of *file descriptors* is 256 (default), you need to use *ulimit -n 1024* to set it to 1024 (or larger). Then check whether the file is damaged or modified.
- **Q: What should I do if I can't reach the accuracy while training with several servers instead of a single server?**
**A**: Most of the models has only been trained on single server with at most 8 pcs. Because the `batch_size` used in MindSpore only represent the batch size of single GPU/NPU, the `global_batch_size` will increase while training with multi-server. Different `gloabl_batch_size` requires different hyper parameter including learning_rate, etc. So you have to optimize these hyperparameters will training with multi-servers.
......@@ -385,3 +385,7 @@ MindSpore已获得Apache 2.0许可,请参见LICENSE文件。
- **Q: 在Mac系统上加载mindrecord格式的数据集出错,例如*Invalid file, failed to open files for reading mindrecord files.*,该怎么处理?**
**A**: 优先使用*ulimit -a*检查系统限制,如果*file descriptors*数量为256(默认值),需要使用*ulimit -n 1024*将其设置为1024(或者更大的值)。之后再检查文件是否损坏或者被修改。
- **Q: 我在多台服务器构成的大集群上进行训练,但是得到的精度比预期要低,该怎么办?**
**A**: 当前模型库中的大部分模型只在单机内进行过验证,最大使用8卡进行训练。由于MindSpore训练时指定的`batch_size`是单卡的,所以当单机8卡升级到多机时,会导致全局的`global_batch_size`变大,这就导致需要针对当前多机场景的`global_batch_size`进行重新调参优化。
......@@ -17,7 +17,6 @@
- [Dataset Preparation](#dataset-preparation)
- [Training Process](#training-process)
- [Evaluation Process](#evaluation-process)
- [Evaluation](#evaluation)
- [ONNX Evaluation](#onnx-evaluation)
- [Inference Process](#inference-process)
- [Export MindIR](#export-mindir)
......@@ -29,6 +28,7 @@
- [Evaluation Performance](#evaluation-performance)
- [Description of Random Situation](#description-of-random-situation)
- [ModelZoo Homepage](#modelzoo-homepage)
- [FAQ](#faq)
## [Transformer Description](#contents)
......@@ -487,3 +487,12 @@ Some seeds have already been set in train.py to avoid the randomness of dataset
## [ModelZoo Homepage](#contents)
Please check the official [homepage](https://gitee.com/mindspore/models).
## FAQ
Refer to the [ModelZoo FAQ](https://gitee.com/mindspore/models#FAQ) for some common question.
- **Q: Why the last checkpoint I got can't reach the accuracy?**
**A**: At the end stage of training, the model accuracy usually drifts irregularly. Because we have to use a third-party perl scripts for evaluation, we can't find the best checkpoint as soon as the training process finished.
You can try to evaluate the last several checkpoints to find the best one.
......@@ -29,6 +29,7 @@
- [评估性能](#评估性能)
- [随机情况说明](#随机情况说明)
- [ModelZoo主页](#modelzoo主页)
- [FAQ](#faq)
<!-- /TOC -->
......@@ -458,3 +459,11 @@ train.py已经设置了一些种子,避免数据集轮换和权重初始化的
## ModelZoo主页
请浏览官网[主页](https://gitee.com/mindspore/models)
## FAQ
优先参考[ModelZoo FAQ](https://gitee.com/mindspore/models#FAQ)来查找一些常见的公共问题。
- **Q: 为什么我最后的checkpoint的精度不好?**
**A**: 因为我们需要使用一个第三方的perl脚本来进行验证,所以我们没办法在训练的时候就获取最优checkpoint。你可以尝试对最后的多个checkpoint进行验证,从中获取最好的一个。
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment