Commits · 139c22413aa96b46fbd790a3f8120c7ec9b31687 · Summer2021 / 210130133

Oct 01, 2018

fix: add AsyncSednRegstMsgToConsumer() for send single produced regst, e.g.... · 139c2241

Niu Chong authored 6 years ago

fix: add AsyncSednRegstMsgToConsumer() for send single produced regst, e.g. forward_model_regst (#1274)

* fix(normal_model_update_compute_actor): fix send forward_model_regst_ to consumer

* fix: add AsyncSednRegstMsgToConsumer() for send single produced regst, e.g. forward_model_regst

139c2241

refine cudnn_limit_buf (#1271) · 7390c2f7

Shiyuan Shang-Guan authored 6 years ago

* refine cudnn_limit_buf

* rename default_cudnn_buf_limit_mbyte -> cudnn_buf_limit_mbyte

7390c2f7

fix(normal_forward_compute_actor): fix SendMsgToForwardModelSaveActor() (#1270) · d746016e
Niu Chong authored 6 years ago
```
* fix(normal_forward_compute_actor): fix SendMsgToForwardModelSaveActor()

* refine(normal_forward_compute_actor)
```
d746016e
enlarge the cudnn buf to 4GB (#1269) · 28f981eb
Jinhui Yuan authored 6 years ago

28f981eb

Sep 30, 2018

Refactor Actor (#1259) · e042befc

Niu Chong authored 6 years ago

* feat(register_slot): add the RegstSlot

* feat(register_slot): update RegstSlot if

* feat(actor): update member of Actor to use RegstSlot

* fix(register_slot): fix the available_regst_desc_cnt init val

* refine(register_slot): rename PushBack/PopFront, FindTheRegstDescId to TryPushBack/TryPopFront, HasRegstDescId

* feat(regst_slot): rename ForEachCurRegstDeq/ForEachCurFrontRegst to ForEachRegstDeq/ForEachFrontRegst

* feat(regst_slot): add ForChosenRegstDeq/ForChosenFrontRegst, add CHECK empty in ForEachFrontRegst

* fix(register_slot): fix the CHECK empty

* feat: remove actual_writeable_regst_desc_id_ from Actor, add Naive/CustomizedProducedRegst

* fix(normal_model_update_actor): bug: not send customized regst to consumer when SendIntialModel

* fix(normal_forward_compute_actor): bug: not add kLoss/kAccuracy produced regst to NaiveProducedRegst

* fix(actor): UNIMPLEMENTED() for AsyncSendCustomizedProducedRegstMsgToConsumer

* fix(normal_forward_compute_actor): set const_buf_regst to nullptr when recv from consumers

* fix(actor): total_reading_data_regst_cnt, not total_reading_ctrl_regst_cnt

* refactor: update GetNaiveConsumedRegstDescName to GetNaiveOrCustomizedConsumedRegstDescName(same for Produced)

* feat: combine data_regst and ctrl_regst in Actor

* fix: fix bugs

* fix: fix bugs

* fix: remove .swp files and unused LOG

* feat: split Act and SendMsg (#1255)

* feat: split Act and SendMsg

* refine: rename HandleProduced/ConsumedDataRegst.. to HandleProduced/ConsumedNaiveDatRegst..

* fix(input_wise_comp_actor): bug: not set piece id

* fix(actor): potential bug: produced msg with no allowed actor still pop from queue

* refactor: mv some protected member function to private

* fix(actor): fix the condition about sending EORD msg

* refactor(input_wise_actor): use RegstSlot in InputWiseActor

* fix(copy_comm_net_actor): rename piece_id2regst_ctx to piece_id2regst_ctx_

* refactor: rename Name2RegstDescId to Name2RegstDescIds

* refactor(naive_actor): "override final" instead of only "final"

* refine(actor): little refine

* feat: update the return type of GetNaiveOrCustomizedNamesRegstDescName to enum class RegstNameType

e042befc

Sep 26, 2018

add impl of lars (#1163) · 9518970b

Shiyuan Shang-Guan authored 6 years ago

* add lars set

* add lars

* override ibn&obn to lbi

* make model update consistent

* check cuda stream sync

* add LARSUpdateModelGpu

* checkout naive & momentum model update

* use cublas::dot compute SumOfSquare

* update lars for master

* refine lars for master

9518970b

Hinge loss test (#1263) · 7faf75a6

binbinHan authored 6 years ago

* hinge_loss_kernel_test

* fix opkernel_test

* fix test file

* optimize test file

* opyimize opkernel test

* complete opkernel test interface

7faf75a6

Sep 25, 2018
- cmake for nccl (#1262) · 57a82568
  Juncheng authored 6 years ago
  
  57a82568
- remove useless Copy in device_context (#1261) · 4b2c4ef0
  Jinhui Yuan authored 6 years ago
```
* remove useless Copy in device_context

* fix cyclic and copy_to_local bug in binary_in_stream_with_local_copy
```
  4b2c4ef0
Sep 24, 2018

Dev use nccl (#1198) · 55496813

Jinhui Yuan authored 6 years ago

* add nccl dependency

* add nccl comm handle

* nccl allreduce works

* NcclAllreduce -> NcclAllReduce

* fix header guard

* add NcclReduceScatter, NcclAllGather

* complete ReduceScatter and AllGather, (with cuda error)

* change variable name

* reduce-scatter, all-gather works

* add NcclScatter and NcclGather work type

* Dev use nccl add nccl comm manager (#1206)

* add parallel_set_id

* add nccl_comm_manager

* log nccl comm create

* use NcclCommMgr

* bugfix

* OF_DISALLOW_COPY_AND_MOVE

* remove nccl_scatter_handle and nccl_gather_handle from DeviceCtx

* remove nccl handles from cuda_stream_handle

* nccl_util and GetNcclDataType

* fix rank_num

* fix rank_id


fix rank_id

* CudaCheck->NcclCheck

* only GPU

* PoorCompTaskNode

SoleIn, SoleOut, SoleOp, SoleIbn, SoleObn

* PoorCompTaskNode

* reformat

* format change

* Dev use nccl merge reduce share mem (#1216)

* add parallel_set_id

* add nccl_comm_manager

* log nccl comm create

* use NcclCommMgr

* bugfix

* OF_DISALLOW_COPY_AND_MOVE

* remove nccl_scatter_handle and nccl_gather_handle from DeviceCtx

* remove nccl handles from cuda_stream_handle

* nccl_util and GetNcclDataType

* fix rank_num

* fix rank_id


fix rank_id

* CudaCheck->NcclCheck

* only GPU

* PoorCompTaskNode

SoleIn, SoleOut, SoleOp, SoleIbn, SoleObn

* PoorCompTaskNode

* reformat

* ReduceGather

* GlobalAdd

* ReduceScatter

* EnableIfNeed

* ConcatSplit

* EnableMemSharing for pred if need


EnableMemSharing for pred if need

* CtrlEdge for Gather

* CtrlEdge for GlobalAdd

* LocalAdd CtrlEdge

* CollectReduceTaskNode

* reverse nodes

* local_add_mem_sharing


local add mem sharing

* global add mem sharing

* reduce_mem_sharing

* bugfix

* refine

* format change (remove empty lines)

* format change

* fix local_add and gather issues

* Dev refactor reduce add (#1218)

* change ReduceGlobalAdd to ReduceAdd

* rm ReduceLocalAdd

* no mem sharing case works

* let ReduceAddCompActor decide whether it is local or global

* multi machine multi gpus Nccl and Oneflow allreduce works

* refine

* extract SortEdges

* make EdgeInfo protected

* Dev use nccl refine (#1220)

* const qualifier

* PoorCompTaskNode=>PipeCompTaskNode

* int=>int32_t

* refine ReduceMemSharingCtx

* NcclDeviceCtx and NcclActor


NcclDeviceCtx and NcclActor

* empty line

* CudaDeviceCtx<-NcclDeviceCtx

* fix wrong rank_id in reduce_add_actor (#1229)

* fix wrong rank_id in reduce_add_actor

* rm device_num_of_each_machine from parallel_ctx

* fix reduce gather control edge (#1235)

* fix reduce gather control edge

* extract FindNearestReduceAddCompTaskNode

* extract method ReduceCompTaskNodeIf::FindPredRduceTaskNodeIf

* CHECK nearest_add_copy_d2h

* Dev use nccl cross machine nccl all reduce (#1246)

* support ncclAllReduce cross machine

* fix rank_id and rank_num for mix

* reformat

* reformat

* simplify nccl_kernel (#1256)

* simplify REGISTER_BLD_SUB_TSK_GPH_MTHD (#1260)

* simplify REGISTER_BLD_SUB_TSK_GPH_MTHD

* note

* Dev use nccl reduce ranking ctx (#1252)

* reformat

* compute rank_id and rank_num with FixCompTaskNode

* reformat

* fix rank_id for reduceadd

* ReduceRankingCtx

* New Ranking and MemSharing for Reduce

* DECLARE_REDUCE_LOGICAL_NODE

* Ranking4NcclAllReduce

* fix ranking

* remove AsTaskNode

* reformat

* runtime rank ctx

* rank_set

* bugfix

* bugfix

* unittest

* change use_nccl_all_reduce_cross_machine to use_nccl_inter_node_communication

* refine


refine

* move BuildCtrlRegstBetweenReduceCopyNodes to ReduceAddCompTaskNode

* CHECK mem_size_

55496813

Sep 23, 2018
- chore(of_submit): update of_submit to support recently proto updates (#1258) · 9fdcf61f
  Niu Chong authored 6 years ago
  
  9fdcf61f
Sep 19, 2018
- no out_diff then no backward node (#1250) · cf84a6e8
  Li Xinqi authored 6 years ago
  
  cf84a6e8
- make save_download_file_conf false (#1254) · 31693ec1
  Shiyuan Shang-Guan authored 6 years ago
  
  31693ec1
Sep 18, 2018

Dev define test blob (#1247) · 0476d2c2

Li Xinqi authored 6 years ago

* define_test_blob

* decode random compute task node

* rename define_test_blob_conf.name => define_test_blob_conf.out

* decode random task node color

0476d2c2

Sep 17, 2018

moving model (#1234) · baa146bd

Li Xinqi authored 6 years ago

* moving model

* moving_model => forward_model

* add todo commit

* two model save node

* let md_updt actor handle forward_model

* remove useless code

* rename local variable

baa146bd

refine model update conf (#1240) · 5ccd29d7
Shiyuan Shang-Guan authored 6 years ago
```
* refine model update conf

* make todo

* add primary_lr and secondary_lr
```
5ccd29d7
fix loss/accuracy print op placement during logical graph construction (#1175) · 4168d55e
scxfjiang authored 6 years ago

4168d55e

Dev refactor channel (#1181) · fda25987

Juncheng authored 6 years ago

* add enum ChannelStatus

* merge CloseSendEnd and CloseReceiveEnd

* update channel_test

fda25987

Refine runtime (#1108) · d76513b3

Jinhui Yuan authored 6 years ago

* only master machine saves plan and has event logger

* separate Data, Persistence, Cache, Log FileSystem config

* refine

* only specify data and snapshot path conf

* forbit multiple machines use localfs as snapshot fs

* networkfs as localfs

* refine

* Store log to snapshot (#1109)

* use machine id, drop machine name

* ensure setting machine id

* allow save snapshot to localfs for distributed training (#1113)

* Snapshot to master (#1116)

* allow save snapshot to localfs for distributed training

* fix mdSave to master for model parallel

* fix review comment issues

* add sanity check for machine id

* rm useless comments

* update example

* Dev refine runtime add log stream mgr (#1142)

* add LogStreamMgr

* refine and refactor OutStream=>LogStream

* bugfix

* use LogStreamMgr to write graph, dot, plan, profile and proto

* refine

* simplify, remove LogStreamMgr (#1243)

* simplify, remove LogStreamMgr

* TeePersistentLogStream add static factory (#1244)

d76513b3

fix bug of forward model -> copyD2H conflict with out regst (#1242) · 0da0646c
cheng cheng authored 6 years ago
```
* fix bug of forward model -> copyD2H conflict with out regst

* use 1 line
```
0da0646c

Sep 16, 2018
- loss print has no in_diff (#1239) · f4e8f0fc
  Li Xinqi authored 6 years ago
  
  f4e8f0fc
- Dev pb data type encode (#1241) · 517e1533
  Li Xinqi authored 6 years ago
```
* patch protobuf encode/decode files

* patch EncodeConf

* patch common/preprocessor.h
```
  517e1533
Sep 15, 2018

pb list data type (#1237) · 58f43ff5
Li Xinqi authored 6 years ago

58f43ff5

separate model for update (#1232) · 11408363

Shiyuan Shang-Guan authored 6 years ago

* make each blob of the packed blob be updated separately in the ModelUpdate

* make blob descs in regst be consistent in bw->md_diff_acc->shared_md_diff_add->md_update->fw

* copy lbi2blob_descs from model

* add shared_model_diff_add kernel

* refine model_update actor and kernel

* rm useless TODO

* add shared_model_diff_add kernel

* refine code

11408363

Sep 14, 2018
- refine op_type order (#1233) · c63bd8f3
  Shiyuan Shang-Guan authored 6 years ago
  
  c63bd8f3
- blob slice dptr (#1225) · 642f1ba8
  Li Xinqi authored 6 years ago
```
* enable dptr<T>(...) if T is not void

* simplify dptr(...) by parameter packing
```
  642f1ba8
Sep 13, 2018
- mdupdt delayed topo (#1227) · 317267a0
  Li Xinqi authored 6 years ago
  
  317267a0
Sep 10, 2018
- fix(actor): bug: do not send consumed ctrl regst msg to producer rightly (#1222) · 592227a1
  Niu Chong authored 6 years ago
  
  592227a1
- sketch: let input-wise actor and send input-wise ctrl msg (#1219) · 9ddf4c53
  Jinhui Yuan authored 6 years ago
  
  9ddf4c53
Sep 09, 2018
- fix: can not access the regst_desc_id of network regst (#1217) · 85496886
  Jinhui Yuan authored 6 years ago
  
  85496886
Sep 07, 2018

feat: update the data members to use RegstSlot in Actor (#1208) · 38a50de4

Niu Chong authored 6 years ago

* feat(register_slot): add the RegstSlot

* feat(register_slot): update RegstSlot if

* feat(actor): update member of Actor to use RegstSlot

* fix(register_slot): fix the available_regst_desc_cnt init val

* refine(register_slot): rename PushBack/PopFront, FindTheRegstDescId to TryPushBack/TryPopFront, HasRegstDescId

* feat(regst_slot): rename ForEachCurRegstDeq/ForEachCurFrontRegst to ForEachRegstDeq/ForEachFrontRegst

* feat(regst_slot): add ForChosenRegstDeq/ForChosenFrontRegst, add CHECK empty in ForEachFrontRegst

* fix(register_slot): fix the CHECK empty

38a50de4

Dev allreduce2 (#1211) · 5909cc43

Jinhui Yuan authored 6 years ago

* add ReduceScatter2, ReduceAdd2, ReduceGather2 op and kernel

* add ReduceScatter2, ReduceAdd2, ReduceGather2 task node and actor

* complete Reduce2 op

* TODO: complete ReduceAdd2 kernel

* add ReduceScatter2 task to accept model_diff

* sketch of connecting ReduceScatter2/Add2/Gather2

* build allreduce2 logical graph

* connect allreduce2 task graph

* ReduceScatter2 task node

* complete ReduceAdd2, ReduceGather2 task node

* simplify ReduceAdd2 actor

* refactor ReduceAdd2 task node

* let global add -> gather share path

* separate ReduceLocalAdd2 and ReduceGlobalAdd2

* connect AllReduce2 task graph

* complete ReduceGlobalAdd2 op

* refine ReduceLocalAdd2 task node

* complete ReduceGlobalAdd2 task node

* global AllReduce2 works

* add device_num_of_each_machine to parallel_context

* simplify ReduceGlobalAdd2 runtime

* multi machine multi gpus AllReduce2 works

* add mem sharing and ctrl edge for AllReduce2

* single machine multiple gpu mem sharing works

* refine

* remove the previous allreduce

* change AllReduce2 to AllReduce variable convention

* change filename

* complete transfer to allreduce2

* remove unnecessary format change

* remove unnecessary format change

* simplify

* simplify mem sharing rule for reduce add and gather

* check for local add

* fix reduce_global_add actor bug

* refine reduce task node

* refine variable name

* refine

* refine

5909cc43

fix bug in add kernel of allreduce (#1214) · 34ce4862
Jinhui Yuan authored 6 years ago

34ce4862

Sep 06, 2018
- fix Div function (#1212) · 91432cb5
  guo ran authored 6 years ago
  
  91432cb5
Sep 04, 2018

Dev hinge loss (#1207) · 87db37ed

binbinHan authored 6 years ago

* add hinge loss

* add hinge loss test

* hack hinge loss

* optimize hinge loss

* optimize hinge loss

* optimize hinge loss

* optimize hinge loss

87db37ed

Dev matmul dot multiply (#1189) · 6ab4006f

binbinHan authored 6 years ago

* add matmul & dot & multiply

* optimize dot kernel

* fix multiply kernel code style

* optimize matmul kernel

6ab4006f

call cudnnBatchNormalizationForwardInference if trainable == flase (#1197) · a21dea46
Li Xinqi authored 6 years ago

a21dea46
Dev embedding hb (#1188) · 6c92495a
binbinHan authored 6 years ago
```
* add embedding look up infer blob desc

* optimize inifer blob desc
```
6c92495a

Dev hinge loss (#1190) · e2da4ecf

binbinHan authored 6 years ago

* add hinge loss

* add hinge loss test

* hack hinge loss

* optimize hinge loss

* optimize hinge loss

* optimize hinge loss

* optimize hinge loss

e2da4ecf

Sep 03, 2018
- split sources when infer shape (#1202) · 34fb73fe
  Li Xinqi authored 6 years ago
  
  34fb73fe