!2587 Add faq about multi-servers

Merge pull request !2587 from chenhaozhe/add-multi-server-faq

!2587 Add faq about multi-servers
Merge pull request !2587 from chenhaozhe/add-multi-server-faq
0eeb2473 · i-robot · Gitee · 9175b761 · 92270d10 · 0eeb2473
Unverified Commit 0eeb2473 authored 2 years ago by i-robot Committed by Gitee 2 years ago
--- a/README.md
+++ b/README.md
@@ -383,3 +383,7 @@ For more information about `MindSpore` framework, please refer to [FAQ](https://
 - **Q: How to solve the error when loading dataset in mindrecord format on Mac system, such as *Invalid file, failed to open files for reading mindrecord files.*?**

  **A**: Please check the system limit with *ulimit -a*, if the number of *file descriptors* is 256 (default), you need to use *ulimit -n 1024* to set it to 1024 (or larger). Then check whether the file is damaged or modified.
+
+- **Q: What should I do if I can't reach the accuracy while training with several servers instead of a single server?**
+
+  **A**: Most of the models has only been trained on single server with at most 8 pcs. Because the `batch_size` used in MindSpore only represent the batch size of single GPU/NPU, the `global_batch_size` will increase while training with multi-server. Different `gloabl_batch_size` requires different hyper parameter including learning_rate, etc. So you have to optimize these hyperparameters will training with multi-servers.
--- a/README_CN.md
+++ b/README_CN.md
@@ -385,3 +385,7 @@ MindSpore已获得Apache 2.0许可，请参见LICENSE文件。
 - **Q: 在Mac系统上加载mindrecord格式的数据集出错,例如*Invalid file, failed to open files for reading mindrecord files.*，该怎么处理?**

  **A**: 优先使用*ulimit -a*检查系统限制，如果*file descriptors*数量为256（默认值），需要使用*ulimit -n 1024*将其设置为1024（或者更大的值）。之后再检查文件是否损坏或者被修改。
+
+- **Q: 我在多台服务器构成的大集群上进行训练，但是得到的精度比预期要低，该怎么办？**
+
+  **A**: 当前模型库中的大部分模型只在单机内进行过验证，最大使用8卡进行训练。由于MindSpore训练时指定的`batch_size`是单卡的，所以当单机8卡升级到多机时，会导致全局的`global_batch_size`变大，这就导致需要针对当前多机场景的`global_batch_size`进行重新调参优化。
--- a/official/nlp/transformer/README.md
+++ b/official/nlp/transformer/README.md
@@ -17,7 +17,6 @@
    - [Dataset Preparation](#dataset-preparation)
    - [Training Process](#training-process)
    - [Evaluation Process](#evaluation-process)
-        - [Evaluation](#evaluation)
        - [ONNX Evaluation](#onnx-evaluation)
    - [Inference Process](#inference-process)
        - [Export MindIR](#export-mindir)
@@ -29,6 +28,7 @@
            - [Evaluation Performance](#evaluation-performance)
    - [Description of Random Situation](#description-of-random-situation)
    - [ModelZoo Homepage](#modelzoo-homepage)
+    - [FAQ](#faq)

 ## [Transformer Description](#contents)

@@ -487,3 +487,12 @@ Some seeds have already been set in train.py to avoid the randomness of dataset
 ## [ModelZoo Homepage](#contents)

 Please check the official [homepage](https://gitee.com/mindspore/models).
+
+## FAQ
+
+Refer to the [ModelZoo FAQ](https://gitee.com/mindspore/models#FAQ) for some common question.
+
+- **Q: Why the last checkpoint I got can't reach the accuracy?**
+
+  **A**: At the end stage of training, the model accuracy usually drifts irregularly. Because we have to use a third-party perl scripts for evaluation, we can't find the best checkpoint as soon as the training process finished.
+  You can try to evaluate the last several checkpoints to find the best one.
--- a/official/nlp/transformer/README_CN.md
+++ b/official/nlp/transformer/README_CN.md
@@ -29,6 +29,7 @@
            - [评估性能](#评估性能)
    - [随机情况说明](#随机情况说明)
    - [ModelZoo主页](#modelzoo主页)
+    - [FAQ](#faq)

 <!-- /TOC -->

@@ -458,3 +459,11 @@ train.py已经设置了一些种子，避免数据集轮换和权重初始化的
 ## ModelZoo主页

 请浏览官网[主页](https://gitee.com/mindspore/models)。
+
+## FAQ
+
+优先参考[ModelZoo FAQ](https://gitee.com/mindspore/models#FAQ)来查找一些常见的公共问题。
+
+- **Q: 为什么我最后的checkpoint的精度不好？**
+
+  **A**： 因为我们需要使用一个第三方的perl脚本来进行验证，所以我们没办法在训练的时候就获取最优checkpoint。你可以尝试对最后的多个checkpoint进行验证，从中获取最好的一个。