Skip to content
Snippets Groups Projects
  1. Aug 06, 2021
  2. Aug 05, 2021
  3. Jul 30, 2021
  4. Feb 20, 2021
  5. Dec 28, 2020
  6. Nov 03, 2020
  7. Oct 30, 2020
  8. Oct 10, 2020
  9. Jul 23, 2020
    • Shenghang Tsai's avatar
      Dev apache2 license (#3266) · d0bdbd5d
      Shenghang Tsai authored
      
      * add license at root dir
      
      * check in empty files
      
      * rm space
      
      * check in script
      
      * update script
      
      * fix bug
      
      * add print
      
      * fix
      
      * add exit
      
      * add to of_format
      
      * add CI task
      
      * fix license
      
      * Revert "fix license"
      
      This reverts commit 818b6d7691d3a8b4a25dd41a47ff2c5922b8ec57.
      
      * only add once
      
      * quick fix
      
      * fix script
      
      * dont fmt empty file
      
      * fix
      
      * quick fix
      
      * fix py
      
      * add license
      
      * fix exit
      
      * add license for hpp
      
      * add license
      
      * license new vm files
      
      Co-authored-by: default avatartsai <caishenghang@oneflow.org>
      d0bdbd5d
  10. Dec 09, 2019
  11. Dec 06, 2019
  12. Nov 29, 2019
  13. Sep 24, 2019
    • Niu Chong's avatar
      merge with dev_python (#2249) · 3960d2cb
      Niu Chong authored
      * Dev actor msg queue (#2225)
      
      * async msg queue
      
      * EnqueueAsyncMsg
      
      * Merge wnd python (#2226)
      
      * not ready yet
      
      * segment fix
      
      * fix segment_sum bugs
      
      * 1st wide_n_deep push
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * run sucessfully on single GPU
      
      * fix 121 for tick (#2069)
      
      * delete unncessary multiply_grad class
      
      * speed up generate time for dot2svg (#2083)
      
      * Add axis conf to bias_add for any axis channel (#2087)
      
      * bias_add completion
      
      * follow comment
      
      * make conf axis required
      
      * Revert "Add axis conf to bias_add for any axis channel (#2087)" (#2091)
      
      This reverts commit 8679ce980ce8570bf927baeab8616ee7b93fac47.
      
      * updated
      
      * fix segment_sum_grad
      
      * fix sbp
      
      * fix segment_sum impl for data parallel
      
      * fix
      
      * remove useless code in segment_kernel_util.h
      
      * add python interface
      
      * fix sigmoid conf
      
      * fix naming error
      
      * fix typo
      
      * temp mod loss sbp
      
      * add LazyAdam
      
      * Merge branch 'dev_python' of https://github.com/Oneflow-Inc/oneflow into dev_python_widedeep
      
      * rm useless code
      
      * unsorted_segment_sum
      
      * refactor sigmoid_cross_entropy_loss_kernel to high performance
      
      * Improve sigmoid cross entropy loss grad (#2207)
      
      * remove for loop called cuda kernel
      
      * minor fix
      
      * ../oneflow/python/ops/data_ops.py (#2209)
      
      * fix lazy_adam
      
      * Merge wnd and python (#2214)
      
      * rm ActivationType from op/kernel (#2205)
      
      * refactor sigmoid_cross_entropy_loss
      
      * fix SigmoidGrad::InferBatchAxis
      
      * support part_name_prefix and part_name_suffix_length (#2208)
      
      * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus
      
      * oneflow.watch for debug
      
      * Dev decode batch size (#2206)
      
      * rm batch_size and piece_size
      
      * merge dev_python
      
      * Update reshape_like_op.cpp (#2213)
      
      * oneflow.parallel (#2211)
      
      * oneflow.parallel
      
      * refactor split_axis => parallel
      
      * rename parallel => distribute
      
      * fix typo: *Parallel => *Distribute
      
      * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()
      
      * merge dev_python
      
      * fix boxing: P->S(0)
      
      * check in docker build scripts (#2216)
      
      * Dev python widedeep docker (#2218)
      
      * check in docker build scripts
      
      * check in .dockerignore
      
      * rm oneflow.segment_sum
      
      * remove segment_sum
      
      * rm unused file
      
      * rm debug code
      
      * rm debug code
      
      * rm double empty lines
      
      * remove useless comments
      
      * fix send msg (#2227)
      
      * fix reduction_coefficient (#2228)
      
      * refactor ndarray for eq/ne/...
      
      * Dev kernel launch synchronized (#2230)
      
      * IsKernelLaunchSynchronized
      
      * virtual
      
      * refine
      
      * refine
      
      * seperate LOGICAL_BINARY_FUNC from ARITHMETIC_BINARY_FUNC
      
      * more static_assert
      
      * remove unused task related dot function (#2236)
      
      * remove unused task related dot function
      
      * do not output dot rank info
      
      * Dev non distributed optimizer js (#2234)
      
      * op&kernel&actor
      
      * job
      
      * job_completer
      
      * graph
      
      * format
      
      * fix pd
      
      * fix
      
      * ignore DelPlacementByOpName
      
      * fix auto tick
      
      * JobBuilder
      
      * fix
      
      * config util
      
      * fix
      
      * fix opgrade
      
      * broadcast tick
      
      * fix allreduce
      
      * balance by model size
      
      * GetSoleOutBlobSize
      
      * async_actor_msg_deque
      
      * group
      
      * AddOrMutOpsOnlyOnce
      
      * fix NcclTupleBroadcastGrad
      
      * order
      
      * set nccl order hint
      
      * op_conf
      
      * grad hint
      
      * NcclTupleBroadcastReduceSequencePass
      
      * add missed mutops
      
      * order fix
      
      * try kMdUpdtArea
      
      * fix nccl_order_hint
      
      * fix
      
      * add ti
      
      * tuple_identity_op
      
      * remove useless
      
      * group
      
      * fix dead lock
      
      * force ctrl in
      
      * sc broadcast
      
      * sort obn
      
      * group nccl
      
      * config group_size_mbyte
      
      * non_distributed_optimizer_group_size_mbyte
      
      * format
      
      * stop check
      
      * rm message sending optimization
      
      * refine lazy adam (#2244)
      
      * refine lazy adam
      
      * update
      
      * memory version 2 step 1: replace original concept about mem sharing (#2242)
      
      * mem_shared_id -> mem_block_id;  mem_shared_off_set -> mem_block_offset; enable_mem_sharing->enable_reuse_mem
      
      * memory version 2 step 1: replace original concept about mem sharing
      
      * record reader multi thread (#2246)
      
      * multi thread
      
      * ComputeThreadPoolSize
      
      * python api
      3960d2cb
  14. Sep 19, 2019
  15. Sep 04, 2019
    • lixinqi's avatar
      rm job_conf.num_of_batches_in_snapshot · 9bcdf707
      lixinqi authored
      9bcdf707
    • Juncheng's avatar
      Dev model init op (#2117) · 26717534
      Juncheng authored
      * assign op
      
      
      AddGlobalStepOpConf
      
      
      fix
      
      
      ARITHMETIC_DATA_TYPE_SEQ
      
      
      identity_op_conf
      
      
      add ops
      
      
      GenNewSnapshotName
      
      
      SnapshotOp
      
      
      cleanup
      
      
      blob name
      
      
      LearningRateScheduleOp
      
      
      LearningRateScheduleKernel
      
      
      LearningRateScheduleKernel
      
      
      AddLearningRateScheduleOpConf
      
      
      learning rate
      
      
      cleanup
      
      
      fix
      
      
      fix
      
      * remove total_mbn_num
      
      * date time format
      
      * save
      
      * refine
      
      * refine
      
      * revert
      
      * refine snapshot
      
      * fix
      
      * refine
      
      * AutoGlobalStep
      
      * refine
      
      * GenLogicalBlobName
      
      * AutoLearningRate
      
      * remove JobDesc lr
      
      * fix snapshot path
      
      * Maybe<void>
      
      * learning_rate blob
      
      * remove next_model_vid
      
      
      fix
      
      
      fix 
      
      
      fix
      
      
      learning_rate
      
      * train_conf
      
      * fix for global step on multi nodes
      
      * SnapshotReader
      
      
      snapshot writer
      
      
      model init op
      
      
      fix
      
      
      refine
      
      
      init
      
      
      InitializeFromSnapshotConf
      
      
      model io job
      
      
      ModelLoadOp
      
      
      ModelLoadKernel
      
      
      MakeModelLoadJob
      
      
      ModelSaveOp
      
      
      fix
      
      
      InterUserJobInfo
      
      
      _MakeModelLoadJobFunc
      
      
      MutModelLoadOpConTickInputHelper
      
      
      fix
      
      
      refine
      
      
      init/load/save
      
      
      set_default_variable
      
      * remove SnapshotMgr
      
      * snapshot.h
      
      * delete model_init_job.cpp
      
      
      foreign_input_op_conf
      
      
      fix
      
      
      snapshot path
      
      
      set path
      
      
      op_conf
      
      
      fix
      
      
      fix CopyFromNdarray
      
      
      to bytes c
      
      
      use uint8
      
      
      char2uint8
      
      * model init
      
      * model io
      
      * fix
      
      * ModelSaveKernel
      
      * mutable_batch_axis()->Clear()
      
      * InferBatchAxis
      
      * fix
      
      * refine
      
      * job set
      
      * MakeModelIoJobs
      
      * fix
      
      * jobs
      
      * fix
      
      * model io job
      
      * GenOutputOpConf
      
      * refine snapshot
      
      * refine
      
      * fix
      
      * refine CheckPoint
      
      * remove session
      
      * refine
      
      * refine
      
      * refine
      
      * remove keyword.h/cpp
      
      * refine
      
      * global_step=>train_step
      
      * GetSbpSignatures
      
      * ModelInitOp
      26717534
  16. Aug 11, 2019
    • Li Xinqi's avatar
      Bugfix actor case (#1995) · 9b479e89
      Li Xinqi authored
      * unlock critical sections as more as possible
      
      * consumed and produced regst of actor 'case' are customized
      
      * refine code
      9b479e89
  17. Jun 20, 2019
  18. Jun 19, 2019
  19. Jun 05, 2019
  20. May 31, 2019
  21. May 26, 2019
    • Li Xinqi's avatar
      Dev inplace (#1879) · 36749a85
      Li Xinqi authored
      * OpGraph::MakePredicatorIsAllLbiConsumersReachableToOpName
      
      * refactor TaskGraph::EnableInplaceMemSharing
      
      * TaskGraph::GetSafeInplaceOpBlobArgList
      
      * InplaceLbiGraph::ForEachSafeInplaceEdgesInSourceOpSubTree
      
      * fix a typo
      
      * TaskGraph::SetTaskRegstInplaceInfo
      
      * InplaceRegstGraph
      
      * refine
      
      * refine IsLbiOnTaskEdge
      
      * fix a bug in TaskGraph::ForEachDeviceNodes
      
      * ForEachGpuDeviceNodes
      
      * remove wrong CHECK
      
      * fix wrong use of std::unordered_set::erase
      
      * fix a bug in TaskGraph::GetInplaceOpBlobArgList
      
      * fix inplace bugs
      
      * fix error CHECK between inplace in dptr and inplace out dptr
      36749a85
  22. May 16, 2019
    • Li Xinqi's avatar
      Dev inplace obn graph (#1868) · 3563ba46
      Li Xinqi authored
      * InplaceObnGraph
      
      * more checks in InplaceObnGraph::InitNodes
      
      * framework of InplaceObnGraph::ComputeSafeInplaceObns
      
      * refine InplaceObnGraph::ComputeSafeInplaceObns
      
      * replace InplaceObnGraph with InplaceLbiGraph
      
      * fix three types of mut_ref conflicts
      
      * InplaceLbiGraph::FindFirstConstRefConflictMutRefEdge
      
      * fix bugs in InplaceLbiGraph::ComputeSafeInplaceObns
      
      * InplaceLbiGraph::DisconnectUnReachabeAndDataMutableEdge
      
      * InplaceLbiGraph::FixMutRefConflictsFromSourceOpNode
      
      * InplaceLbiGraph::FixMutRefConflictsFromSourceOpNode
      
      * Graph::FindFirstBackEdgeDstNode
      
      * more CHECK_ISNULL
      
      * fix a bug in Graph::FindFirstBackEdgeDstNode()
      
      a
      
      * fix bugs in Graph<NodeType, EdgeType>::ForEachConnectedComponent
      
      * rename GetIsMutableIbnConsumer => FindSoleMutableIbnConsumer
      
      * refine InplaceLbiGraph::IsConstRefConflictMutRefNode
      
      * there could be no mut_ref node found in InplaceLbiGraph::FindFirstInterOpRefConflictMutRefEdge
      
      * refine InplaceLbiGraph::FindFirstInterOpRefConflictMutRefEdge
      
      * remove unnecessary CHECK in InplaceLbiGraph::GetSafeInplaceObnEdges
      
      * fix a line of comment in InplaceLbiGraph::GetSafeInplaceObnEdges
      
      * shouldn't delete the edge to updt_node
      
      * refine InplaceLbiGraph::FixMutRefConflictsFromSourceOpNode
      
      * refine FindFirstIntraOpRefConflictMutRefEdge
      
      * fix a bug in InplaceLbiGraph::FindFirstIntraOpRefConflictMutRefEdge
      
      * CheckSubGraph
      
      * change some lambdas to functions
      3563ba46
  23. May 06, 2019
    • Niu Chong's avatar
      feat: support inplace in Actor (#1781) · 606d8446
      Niu Chong authored
      * style(actor.h): move customized virtual function declaration together
      
      * feat: update Actor to support inplace regst
      
      * feat: check inplace in/out regst dptr equal
      606d8446
  24. Feb 21, 2019
    • Shiyuan Shang-Guan's avatar
      Merge master to develop (part 3) (#1657) · 07ab09a5
      Shiyuan Shang-Guan authored
      * gpu (#1310)
      
      * Fix snapshot (#1320)
      
      * fix bug of snapshot
      
      * refine distribute.sh
      
      * use more accurate function calls
      
      * rename function
      
      * update for model parallel
      
      * refine code
      
      * feat: enhance cmake download & options (#1281)
      
      * feat: enhance cmake download & options
      
      * feat(tools/): add share libs build scripts
      
      * fix: add cmake options
      
      * feat: add 3rd party download
      
      * chore: updat README
      
      * fix: fix protobuf & cmake repo
      
      * fix: fix options name
      
      * chore: merge 3rd_party.cmake & third_party.cmake
      
      * chore: revert pre cmake URL fix
      
      * chore: update ExternalProject check
      
      * fix: fix typo & missing download
      
      * fix: fix download url
      
      * chore: update readme
      
      * chore: fix typo
      
      * fix: fix bugs
      
      * fix: fix bugs
      
      * fix: fix pre
      
      * print all third party libs
      
      * refine readme
      
      * DOWNLOAD_THIRD_PARTY -> PRECOMPILED_THIRD_PARTY
      
      * refine readme
      
      * minor typo fix
      
      * Fix bug in model parallel (#1345)
      
      * fix conv in model parallel
      
      * add TODO
      
      * Fix bug in gcc54 (#1352)
      
      * fix bug in gcc 5.4
      
      * update
      
      * refine ibverbs lib (#1391)
      
      * refine link ibverbs lib
      
      * modify minor
      
      * fix a little bug in accuracy print (#1403)
      
      * batch num for prediction (#1405)
      
      * batch num for prediction
      
      * !Train() => Predict()
      
      * fix normlization epsilon check (#1433)
      
      * Fix normlization epsilon check (#1441)
      
      * fix normlization epsilon check
      
      * remove check, fix eplison value in op_conf
      
      * align with tensorflow (#1461)
      
      * Dev crop with random size (#1468)
      
      * random size crop proto
      
      * ImagePreprocessImpl::<kCropWithRandomSize>
      
      * clang format
      
      * MaxVal
      
      * Dev jinyi offline build (#1476)
      
      * chore: remove pre compiler funcs
      
      * chore: add submoudles
      
      * fix: fix project build URL from git_url -> submodule_dir_url
      
      * fix: fix submodule commit id
      
      * fix: fix .gitmodules
      
      * chore: mv third_party dir
      
      * chore: remove test-driver(glog#188) link in glog submodule
      
      * fix: update glog from: da816ea70645e463aa04f9564544939fa327d5a7 ==> to: 4f3e18bf26cdb794fc66cec348f57b5838a0c929
      
      * chore: update README.md
      
      * Dev prelu (#1474)
      
      * GPU impl of prelu
      
      * better naming
      
      * address review
      
      * renaming and use ? :
      
      * address reviews on op
      
      * change op conf
      
      * rename weight
      
      * allow 2d+
      
      * not elementwise op
      
      * naive impl
      
      * minor fix
      
      * rename
      
      * remove dupl
      
      * refacor to remove duplicate
      
      * remove dup
      
      * remove dup
      
      * reverse condition
      
      * move code
      
      * remove useless arg
      
      * refactoring
      
      * add empty line
      
      * fix jpeg encoder quality (#1450)
      
      * fix(actor.cpp): never access regst after sending it to producer (#1531)
      
      * fix(boxing_actor): not handle ctrl regst in NormalProcessNaiveReadableRegstMsg() (#1520)
      
      * Dev center crop (#1542)
      
      * center crop
      
      * update
      
      * add scalar_mul (#1553)
      
      *  refactor(actor/*): update the {RegstNameType, {}} to std::make_pair (#1605)
      
      * fix(boxing_actor): not handle ctrl regst in NormalProcessNaiveReadableRegstMsg()
      
      * refactor(actor/*): update the {RegstNameType, {}} to std::make_pair
      
      * fix record_num in blob (#1619)
      
      * fix record_num in blob
      
      * add comment
      07ab09a5
  25. Feb 19, 2019
    • Li Xinqi's avatar
      Dev bert merge develop (#1650) · 59eb55c1
      Li Xinqi authored
      * Implement gelu op (#1478)
      
      * gelu op
      
      * call different funcs for float and double
      
      * Dev bert gather op (#1483)
      
      * embedding_dense_op
      
      * refine
      
      * gather op
      
      * revert
      
      * Fix gelu bug (#1484)
      
      * fix inherit bug
      
      * fix backward formula
      
      * fix bug
      
      * Dev variable op (#1485)
      
      * DefineTestBlobConf => DefineTestBlobOpConf (#1480)
      
      * variable op
      
      * Dev variable op disable memsharing (#1487)
      
      * disable mem sharing for VariableOp
      
      * variable disable tick diff
      
      * fix
      
      * refine
      
      * options transpose_a and transpose_b for Matmul
      
      * matmul operator conf
      
      * Dev bert const scalar op (#1488)
      
      * const scalar  op
      
      * refine
      
      * fix
      
      * data parallel only
      
      * const range op (#1489)
      
      * square and sqrt
      
      * broadcast_binary_op
      
      * feat: add mean op (#1490)
      
      * feat: add mean op
      
      * feat: add mean_kernel
      
      * feat: add implementation
      
      * feat: fix mean kernel
      
      * Dev bert slice op (#1491)
      
      * add op_conf
      
      * add slice op impl
      
      * add space kernel impl
      
      * fix
      
      * same semantic as python
      
      * optional start and end
      
      * fix
      
      * add has_dim0_in_shape in reshape op (#1486)
      
      * refine CHECK in broadcast_binary_op
      
      * feat: add kernel implement for broadcast_mul/div
      
      * Impl square && sqrt (#1495)
      
      * impl square && sqrt
      
      * fix typo
      
      * Dev bert slice op (#1496)
      
      * add op_conf
      
      * add slice op impl
      
      * add space kernel impl
      
      * fix
      
      * same semantic as python
      
      * optional start and end
      
      * fix
      
      * slice kernel cpu impl
      
      * modify coding style
      
      * BiasAddOpConf
      
      * refactor(broadcast_div_kernel): update kernel util api
      
      * Dev bert cosnt range use device piece size (#1498)
      
      * use device_piece_size
      
      * cosnt size => size
      
      * fix
      
      * no check in BroadcastBinaryOp::InitFromProto
      
      * override GetCustomizedConfs for broadcast_binary_op
      
      * fix: fix bugs in broadcast_div/mul kernel (#1502)
      
      * fix: fix bugs in broadcast_div/mul kernel
      
      * fix
      
      * fix: fix the infer bw_buf blobdesc bug in broadcast_binary op
      
      * Bias Add Op && Kernel (#1503)
      
      * pass compile
      
      * fix typo
      
      * Matmul kernel implementation (#1494)
      
      * pass compile
      
      * add comment
      
      * fix bug
      
      * Dev bert const scalar kernel (#1492)
      
      * const scalar kernel
      
      * fix
      
      * fix
      
      * init
      
      * empty const range kernel
      
      * sketch of gather kernel
      
      * gather kernel
      
      * refine
      
      * refine
      
      * const range kernel
      
      * refine
      
      * backward
      
      * const range size
      
      * gather kernel
      
      * assert index
      
      * add truncated_normal initializer (#1499)
      
      * add truncated_normal initializer
      
      * rename RngTruncatedNormal
      
      * fix: add const override for InferBwBufBlobDescs in BroadcastBinaryOp
      
      * fix: udpate the supported data type from floating to arithmetic
      
      * enforce 2d on bias add
      
      * Dev bert slice op (#1500)
      
      * add op_conf
      
      * add slice op impl
      
      * add space kernel impl
      
      * fix
      
      * same semantic as python
      
      * optional start and end
      
      * fix
      
      * slice kernel cpu impl
      
      * modify coding style
      
      * slice gpu impl const buf infer
      
      * add slice gpu impl
      
      * simplify slice cpu impl
      
      * fix gpu impl bug
      
      * fix typo
      
      * add forward function from broadcast_add,broadcast_sub
      
      * feat: add gpu impl of cast kernel (#1504)
      
      * Dev nc cast (#1507)
      
      * feat: add gpu impl of cast kernel
      
      * register gpu cast op
      
      * Fix broadcast binary all dim size 1 (#1505)
      
      * remove check NumAxes
      
      * check scalar
      
      * IsScalarBlob
      
      * b_diff=>b (#1509)
      
      * feat: add LayerNormOp/Kernel without kernel implement (#1510)
      
      * fix: fix missing registing layer_normalization kernel
      
      * fix: fix missing registing layer_normalization op
      
      * fix: temply remove activation from layer_norm_kernel
      
      * ExecShapeUtil
      
      * broadcast_binary_xpu_util.h
      
      * add bw kernel of broadcast_add
      
      * Dev constant (#1513)
      
      * constant_op
      
      * init_op_conf
      
      * sequence=>range
      
      * Dev broadcast add (#1514)
      
      * ExecShapeUtil
      
      * broadcast_binary_xpu_util.h
      
      * add bw kernel of broadcast_add
      
      * WITH_CUDA_PARAM
      
      * left extended shape
      
      * xpu_ndarray_builder
      
      * add bw kernel of broadcast_sub
      
      * updt to 1d (#1512)
      
      * fix small in xpu_reduce_ndarray
      
      * fix(broadcast_binary_op): fix the wrong data_type of bw_buf regst (#1515)
      
      * feat(mean): update mean_op/kernel for calc only last dim of blob (#1516)
      
      * fix(mean_kernel): fix typo
      
      * ndarray reduce
      
      * new reduce
      
      * fix shape of tmp_storage
      
      * reduce
      
      * more check for NdArrayReduce
      
      * ImplaceApplyUnary<UnaryFuncMinus>
      
      * ndarray_apply_broadcast_binary
      
      * delte useless files
      
      * complete backward kernel of broadcast_mul
      
      * add backward kernel of broadcast_div
      
      * broadcast binary op check data type equal (#1508)
      
      * fix bug in broadcast_binary
      
      * debug op
      
      * EncodeBlob
      
      * const_out_blob_feature_load_file
      
      * DefineTestBlobOpConf.has_diff
      
      * indices has_diff = false (#1519)
      
      * adam model update (#1518)
      
      * adam model update
      
      * add comment
      
      * update
      
      * add correct_deviation flag
      
      * rename
      
      * remove GetCustomizedConf
      
      * fix bug in mean_op fw kernel
      
      * add sigmoid loss op
      
      * ndarray_apply_broadcast_unary
      
      * reomve multiplier of mean kernel
      
      * fix(boxing_actor): not handle ctrl regst in NormalProcessNaiveReadableRegstMsg()
      
      * fix raw (#1522)
      
      * rsqrt
      
      * XpuReducedNdarray supports expression template
      
      * faster_reduce
      
      * inlined cuda device function
      
      * profiling reduce_sum
      
      * refactor(kernel_util.cu): calc x_strides on cpu instead of on TransposeGpu() (#1525)
      
      * BroadcastBinaryOp
      
      * ExecShape => XpuShape
      
      * fix shape bug in mean bw kernel
      
      * refine XpuNdarrayAssign
      
      * use ndarray broadcast mul (#1529)
      
      * Dev softmax reduce ndarray (#1527)
      
      * softmax use ndarray reduce
      
      * fix shape
      
      * refine reduce
      
      * fix
      
      * remove xpu_ndarray_builder
      
      * fix(actor.cpp): never access regst after sending it to producer
      
      * ndarray_util.h => xpu_util.h
      
      * xpu_ndarray_util.h => ndarray_util.h
      
      * XpuNdArrayUtil => NdarrayUtil
      
      * SwitchReduce(SwitchCase(num_axes), ...) => Reduce(...)
      
      * refactor: rename NormalProcessNaiveReadableRegstMsg() to NormalProcessNaiveReadableDataRegstMsg() (#1532)
      
      * SwitchBroadcastApply(SwitchCase(num_axes), ...) => BroadcastApply(...)
      
      * softmax kernel use ndarray reduce  (#1530)
      
      * softmax use ndarray reduce
      
      * fix shape
      
      * refine reduce
      
      * fix
      
      * RowMax=>NdarrayReduce
      
      * SwitchReduce=>Reduce
      
      * move template parameter NDIMS from class NdarrayReduce to methods of class NdarrayReduce
      
      * rename file: ndarray/xpu_ndarray_reduce_test.cpp -> ndarray/ndarray_reduce_test.cpp
      
      * move NdarrayUtil::SwitchReduce(...) to NdarrayReduce::SwitchReduce(...)
      
      * Dev one hot encoder (#1533)
      
      * one_hot op
      
      * ohe
      
      * one hot kernel
      
      * refine
      
      * refine
      
      * remove old
      
      * refine
      
      * refine
      
      * refine
      
      * format
      
      * save m and v in adam_model_update (#1534)
      
      * Dev profile reduce (#1535)
      
      * ndarray_reduce_impl
      
      * NdarrayMatrixRowReduce
      
      * 1) MatrixColReduce; 2) WITH_CUDA_PARAM => RUN_CUDA_KERNEL
      
      * NdarrayScalarReduce
      
      * NdarrayDefaultReduce
      
      * refactor NdarrayReduce<DeviceType device_type, typename T> to NdarrayReduce<DeviceType device_type, typename T, const T(*binary_func)(const T, const T)>
      
      * 1) MaxVal<T>() => GetMaxVal<T>(); MaxValue<T>::value => MaxVal<T>::value
      
      * replace KernelUtil::RowMax with NdarrayUtil::ReduceMax
      
      * NdarrayNoReduce
      
      * eliminate redundant code by macros
      
      * Fix matmul gpu bugs (#1528)
      
      * call different api for batchedgemm
      
      * updt api
      
      * use naive loop
      
      * save work
      
      * save work
      
      * updt impl
      
      * remove useless code
      
      * replace naive loop with cublasgemmbatched
      
      * feat: add ScalarAddOp and ScalarMulOp (#1541)
      
      * Dev nc scalar (#1543)
      
      * feat: add ScalarAddOp and ScalarMulOp
      
      * feat: add ScalarAddKernel and ScalarMulKernel
      
      * fix: ScalarAddOp/ScalarMulOp not inheri from CWiseOp
      
      * fix: fix code style
      
      * fix: fix typo of include file in scalar_add_op/scalar_mul_op
      
      * fix(scalar_mul_kernel): register ScalarMulKerenl
      
      * fix: add MulbyScalarPara(), replace cublas_scal by this on ScalarMulKernel
      
      * fix(scalar_mul_kernel): fix typo
      
      * Dev nc testtrans (#1540)
      
      * feat: update trans kernel
      
      * InitGlobalCudaDeviceProp
      
      * in_blob and out_blob is unnecesary for bw kernel of variable_op and constant_op
      
      * Transpose: the shape elem_cnt of x must not exceed 2^32
      
      * remove LabelType (#1545)
      
      * rm ndarray_reduce_core.*
      
      * Dev identity loss (#1547)
      
      * identity_loss
      
      * loss op
      
      * CalcLossInstanceNum
      
      * mem shared for mdupdt first in regst and md diff add regst (#1546)
      
      * remove useless code (#1548)
      
      * Dev sparse cross entropy (#1550)
      
      * op for sparse cross _entropy
      
      * modify op_conf for sparse cross entropy
      
      * saprse cross entropy kernel
      
      * op
      
      * SparseCrossEntropyKernelUtil
      
      * refine
      
      * refine shape check (#1552)
      
      * refactoring reduce sum (#1554)
      
      * refactoring reduce sum
      
      * also use shape and dptr when bw
      
      * add resize when keepdims
      
      * address reviews
      
      * move functions to Anonymous namespace
      
      * address reviews
      
      * remove auto
      
      * replace find
      
      * rename keepdims
      
      * only enable nccl on gpu
      
      * fix diff add regst size in MdUpdt task node as same as in regst (#1556)
      
      * mem shared for mdupdt first in regst and md diff add regst
      
      * fix diff add regst size in MdUpdt task node as same as in regst
      
      * minor fix
      
      * special occasion when it is a loss op
      
      * Dev loss instance num (#1544)
      
      * loss instance number
      
      * set_has_loss_instance_num_field
      
      * loss
      
      * in_diff
      
      * LossOpFixInDiffHasLossInstanceNum
      
      * remove need_do_loss_instance_num
      
      * move to FixInDiffBlobDescs
      
      * remove
      
      * loss_instance_num use float
      
      * refine
      
      * Boxing ForwardLossInstance
      
      * fix
      
      * fix loss
      
      * fix
      
      * refine
      
      * fix
      
      * refine
      
      * refine
      
      * impl reduce mean
      
      * Dev all reduce ctrl edge (#1558)
      
      * mem shared for mdupdt first in regst and md diff add regst
      
      * feat: add ReduceInplaceIdentity LogicalNode/TaskNode/Op/Kernel
      
      * nccl reduce ctrl edge
      
      * MayConsumeModelDiff
      
      * fix diff add regst size in MdUpdt task node as same as in regst
      
      * eager_reduce_ratio
      
      * mem sharing for ReduceIdentity
      
      * ReduceInplaceIdentity => ReduceIdentity
      
      * reduce ctrl edge supports for arbitrary placement
      
      * refine ChainLogicalGraph::IsLogicalNodeMergeable
      
      * model name (#1561)
      
      * Dev gather refine (#1517)
      
      * gather op index support all int type and axis
      
      * out=in
      
      * reformat
      
      * negative axis
      
      * LookupKernel=>GatherKernel
      
      * reformat
      
      * refine
      
      * axis
      
      * refine & bugfix
      
      * remove ConstScalar and ConstRange (#1526)
      
      * Refine range initializer (#1523)
      
      * support axis
      
      * refine naming
      
      * fix before_dim_size
      
      * doc
      
      * refine
      
      * refine naming
      
      * refine naming
      
      * VariableLogicalNode
      
      * identity (#1563)
      
      * total_instance_num use naive mdupdt (#1564)
      
      * patch by hand from faster_rcnn
      
      * revert LogicalVariableOp
      
      * Dev clone boxing (#1566)
      
      * identity
      
      * reduce clone boxing
      
      * Dev clone boxing (#1568)
      
      * identity
      
      * reduce clone boxing
      
      * tuple identity
      
      * Dev tick (#1571)
      
      * feat: add Tick LogicalNode/TaskNode/Op/Kernel
      
      * feat: remove Tick LogicalNode/TaskNode
      
      * feat: add BldSubTskGphByTickToSource for TickOp
      
      * refine: refine due to comment
      
      * feat: add BldSubTskGphByRecordLoadToTick
      
      * pr tick op/kernel alone
      
      * feat: add TickOp and BldSubTskGphByTickToSource  (#1565)
      
      * feat: add Tick LogicalNode/TaskNode/Op/Kernel
      
      * feat: remove Tick LogicalNode/TaskNode
      
      * feat: add BldSubTskGphByTickToSource for TickOp
      
      * refine: refine due to comment
      
      * feat: add BldSubTskGphByRecordLoadToTick
      
      * refine: refine due to comment
      
      * refine: due to comment
      
      * refine: remove BldSubTskGphByRecordLoadToTick
      
      * fix tick op in dlnet (#1572)
      
      * Dev clip by global norm (#1521)
      
      * clip_by_global_norm
      
      * update
      
      * refine model_update op
      
      * remove useless code
      
      * fix name
      
      * rename clip_norm
      
      * remove useless code
      
      * force init memory and add CHECK()
      
      * remove useless code and add comment
      
      * fixbug
      
      * refine code
      
      * Dev bert profile (#1573)
      
      * 1) refactor reduce_group; 2) add new stream kReduceCtrl
      
      * 1) allreduce and model_update overlapping; 2) allreduce and fw overlapping
      
      * add mdupdt ctrl edges within reduce group (#1575)
      
      * Dev group all reduce by model bytes (#1577)
      
      * group all reduce by model byte size
      
      * mv OpGraph into a seperate file op_graph.h
      
      * gelu (#1578)
      
      * Dev bert layer norm (#1574)
      
      * layer norm
      
      * layer_norm
      
      * fix trainable
      
      * fix
      
      * fix trainable
      
      * refine
      
      * Dev bert cuda event sync (#1581)
      
      * cudaSetDevice in actor poller threads
      
      * ReduceConcatCompActor ; NaiveActor
      
      * set dev id (#1583)
      
      * Dev bert profiling (#1586)
      
      * profiling
      
      * all_reduce_* option for performance optimization
      
      * fix a mem sharing bug (#1590)
      
      * Fix mem sharing bug (#1593)
      
      * fix a mem sharing bug
      
      * refine by review
      
      * remove previous if condition
      
      * refine
      
      * Dev profiling adam (#1592)
      
      * profiling
      
      * all_reduce_* option for performance optimization
      
      * faster adam kernel
      
      * Dev refine transpose (#1594)
      
      * profiling
      
      * all_reduce_* option for performance optimization
      
      * faster adam kernel
      
      * refine dropout and transpose
      
      * loss print duration (#1598)
      
      * pseudo chains of OpGraph
      
      * ConvertPseudoChainToChain
      
      * refine pseudo_chain
      
      * refine register coloring algorithm
      
      * rename op_graph log file name
      
      * remove unused code
      
      * Dev bigger chain (#1601)
      
      * pseudo chains of OpGraph
      
      * ConvertPseudoChainToChain
      
      * refine pseudo_chain
      
      * refine register coloring algorithm
      
      * rename op_graph log file name
      
      * remove unused code
      
      * chore: add -gencode in CMakeLists.txt (#1603)
      
      * EnableMemSharingInVariableOp
      
      * no mem_sharing for out_diff & model_diff in variable_op
      
      * Dev mem sharing for variable op (#1604)
      
      * pseudo chains of OpGraph
      
      * ConvertPseudoChainToChain
      
      * refine pseudo_chain
      
      * refine register coloring algorithm
      
      * rename op_graph log file name
      
      * remove unused code
      
      * EnableMemSharingInVariableOp
      
      * no mem_sharing for out_diff & model_diff in variable_op
      
      * refine code
      
      * Fix jxf reduce concat bug (#1606)
      
      * refine logic to infer reduce_concat_op's elem_cnt of out blob, still have bugs...
      
      * add RoundUp in reduce_concat
      
      * CHECK_LE -> CHECK_EQ
      
      * add CHECK
      
      * Dev random shuffle (#1607)
      
      * random shuffle
      
      * fix
      
      * refine
      
      * refine
      
      * single thread
      
      * refine
      
      * cmake add half (#1609)
      
      * Bugfix no tick diff (#1614)
      
      * group by has_diff
      
      * rm unnecessary identity
      
      * share model_diff and out_diff in variable op (#1616)
      
      * share model_diff and out_diff in variable op
      
      * bugfix: model_diff is a produced register
      
      * register_num of model_diff is 1
      
      * add VariableKernelConf
      
      * no mutable
      
      * bugfix
      
      * bugfix: set ctrl_regst's return_regst_num (#1617)
      
      * 带策略的寄存器着色 (#1613)
      
      * mem_shared_hint_id
      
      * sharable memory block
      
      * rm useless code
      
      * remove useless code
      
      * bugfix: no redundant edges
      
      * rename: MemBlockGroup => MemBlock
      
      * put constrcutor of SharableMemBlockNode into header file
      
      * bugfix
      
      * rename field: MemBlock.block_id => MemBlock.mem_block_id
      
      * refine CHECK in AllReduce (#1618)
      
      * refine CHECK in AllReduce
      
      * move ReduceConcatOpCtx definition to .cpp file
      
      * fix fw_consumer nullptr (#1622)
      
      * faster improver (#1628)
      
      * multithreads register coloring (#1630)
      
      * multithreads register coloring
      
      * refine code
      
      * Dev bert accuracy with weight (#1632)
      
      * accuracy
      
      * accuracy_task_node add fw_buf
      
      * fw_buf=>data_tmp
      
      * Dev logical blob dim0 (#1625)
      
      * mem_shared_hint_id
      
      * sharable memory block
      
      * rm useless code
      
      * remove useless code
      
      * bugfix: no redundant edges
      
      * rename: MemBlockGroup => MemBlock
      
      * put constrcutor of SharableMemBlockNode into header file
      
      * bugfix
      
      * rename field: MemBlock.block_id => MemBlock.mem_block_id
      
      * replace piece_size with logical_blob_dim0
      
      * BlobParallelConf
      
      * BlobParallelDesc
      
      * infer out blob model_split_axis
      
      * int64_t => int32_t
      
      * InferOutBlobParallelDesc
      
      * gather out blob model split (#1624)
      
      * InferBlobParallelDesc
      
      * let variable op support kModelParallel
      
      * rename lbi2blob_desc_ => lbi2no_parallel_blob_desc_
      
      * Global<OpGraph>
      
      * SplitLogicalInputBlobDesc
      
      * ConcatOutputBlobDescs
      
      * rename: BlobDataParallel => DataBlobParallel; BlobModelParallel => ModelBlobParallel; BlobGridParallel => GridBlobParallel
      
      * OpGraph::CheckBlobDescs(...)
      
      * exact division is unnecessary
      
      * fix bugs
      
      * rename InferOutBlob* => InferOutputBlob
      
      * exact division in variable_op is unnecessary
      
      * bug fix
      
      * fix bugs
      
      * fix bugs
      
      * IsInputBlobAllowedModelSplit
      
      * use Global<OpGraph> to InferModelSize
      
      * add OpGraph::GetDataBalancedSplitter and OpGraph::GetModelBalancedSplitter
      
      * fix IdentityOp::IsInputBlobAllowedModelSplit
      
      * no implementation for pure virtual function Operator::IsInputBlobAllowedModelSplit
      
      * refine BlobParallelDesc: replace CopyParallelConf with operator=
      
      * refine ParallelDesc: remove unused functions
      
      * more checks on ParallelDesc
      
      * Dev logical blob dim0 (#1635)
      
      * mem_shared_hint_id
      
      * sharable memory block
      
      * rm useless code
      
      * remove useless code
      
      * bugfix: no redundant edges
      
      * rename: MemBlockGroup => MemBlock
      
      * put constrcutor of SharableMemBlockNode into header file
      
      * bugfix
      
      * rename field: MemBlock.block_id => MemBlock.mem_block_id
      
      * replace piece_size with logical_blob_dim0
      
      * BlobParallelConf
      
      * BlobParallelDesc
      
      * infer out blob model_split_axis
      
      * int64_t => int32_t
      
      * InferOutBlobParallelDesc
      
      * gather out blob model split (#1624)
      
      * InferBlobParallelDesc
      
      * let variable op support kModelParallel
      
      * rename lbi2blob_desc_ => lbi2no_parallel_blob_desc_
      
      * Global<OpGraph>
      
      * SplitLogicalInputBlobDesc
      
      * ConcatOutputBlobDescs
      
      * rename: BlobDataParallel => DataBlobParallel; BlobModelParallel => ModelBlobParallel; BlobGridParallel => GridBlobParallel
      
      * OpGraph::CheckBlobDescs(...)
      
      * exact division is unnecessary
      
      * fix bugs
      
      * rename InferOutBlob* => InferOutputBlob
      
      * exact division in variable_op is unnecessary
      
      * bug fix
      
      * fix bugs
      
      * fix bugs
      
      * IsInputBlobAllowedModelSplit
      
      * use Global<OpGraph> to InferModelSize
      
      * add OpGraph::GetDataBalancedSplitter and OpGraph::GetModelBalancedSplitter
      
      * fix IdentityOp::IsInputBlobAllowedModelSplit
      
      * no implementation for pure virtual function Operator::IsInputBlobAllowedModelSplit
      
      * refine BlobParallelDesc: replace CopyParallelConf with operator=
      
      * refine ParallelDesc: remove unused functions
      
      * more checks on ParallelDesc
      
      * remove unused function Operator::MaxModelSplitNum
      
      * bugfix: SoleOp() => op_vec().at(0)
      
      * Dev global op graph (#1636)
      
      * Global<OpGraph> is only available duraing compilation
      
      * small record_piece_size for InferNoParallelBlobDesc
      
      * Dev op graph piece size (#1637)
      
      * fix a bug in OpGraph::InferNoParallelBlobDesc
      
      * fix a bug in OpGraph::InferNoParallelBlobDesc
      
      * DfsTopoForEachNodeSortByDistanceToSink (#1638)
      
      * Dev jxf bert top k (#1633)
      
      * top_k
      
      * dev top_k op
      
      * refine
      
      * fix bug
      
      * refactor top_k op, cooperate with gather op to get values now
      
      * customized TOPK_KERNEL_ENTRY in auto factory
      
      * batch gather op
      
      * refine
      
      * Backup: batch_gather op, pass compile
      
      * fix bugs, pass the test
      
      * fix no new line at the end of file
      
      * const
      
      * refine by review
      
      * fix bugs
      
      * rename: instance_dim -> instance_size
      
      * remove a blank line
      
      * refine coding style by Juncheng's suggestions, Bravo
      
      * refine top_k
      
      * more refine
      
      * compatible with new model parallel
      
      * refine
      
      * rename
      
      * cpu only in top_k
      
      * Dev model boxing (#1639)
      
      * mem_shared_hint_id
      
      * sharable memory block
      
      * rm useless code
      
      * remove useless code
      
      * bugfix: no redundant edges
      
      * rename: MemBlockGroup => MemBlock
      
      * put constrcutor of SharableMemBlockNode into header file
      
      * bugfix
      
      * rename field: MemBlock.block_id => MemBlock.mem_block_id
      
      * replace piece_size with logical_blob_dim0
      
      * BlobParallelConf
      
      * BlobParallelDesc
      
      * infer out blob model_split_axis
      
      * int64_t => int32_t
      
      * InferOutBlobParallelDesc
      
      * gather out blob model split (#1624)
      
      * InferBlobParallelDesc
      
      * let variable op support kModelParallel
      
      * rename lbi2blob_desc_ => lbi2no_parallel_blob_desc_
      
      * Global<OpGraph>
      
      * SplitLogicalInputBlobDesc
      
      * ConcatOutputBlobDescs
      
      * rename: BlobDataParallel => DataBlobParallel; BlobModelParallel => ModelBlobParallel; BlobGridParallel => GridBlobParallel
      
      * OpGraph::CheckBlobDescs(...)
      
      * exact division is unnecessary
      
      * fix bugs
      
      * rename InferOutBlob* => InferOutputBlob
      
      * exact division in variable_op is unnecessary
      
      * bug fix
      
      * fix bugs
      
      * fix bugs
      
      * IsInputBlobAllowedModelSplit
      
      * use Global<OpGraph> to InferModelSize
      
      * add OpGraph::GetDataBalancedSplitter and OpGraph::GetModelBalancedSplitter
      
      * fix IdentityOp::IsInputBlobAllowedModelSplit
      
      * no implementation for pure virtual function Operator::IsInputBlobAllowedModelSplit
      
      * refine BlobParallelDesc: replace CopyParallelConf with operator=
      
      * refine ParallelDesc: remove unused functions
      
      * more checks on ParallelDesc
      
      * remove unused function Operator::MaxModelSplitNum
      
      * BlobParallelDesc::EquivalentTo
      
      * LogicalNOde::main_model_parallel_ is out of date
      
      * refine Operator: replace IsElemWiseOp with IsSoleInputBlobAllowedModelSplit
      
      * refine transpose conf
      
      * fix a bug in Operator::FixParallelDesc
      
      * InferInputBlobModelSplitAxis
      
      * BlobParallelType
      
      * more default behaviors for Operator::InferInputOutputBlobParallelType
      
      * op_parallel_signature
      
      * rename: BlobParallelType => LogicalBlobParallelDesc
      
      * OpGraph::InferLogicalBlobParallelDesc
      
      * refactor SplitLogicalInputBlobDesc by LogicalBlobParallelDesc
      
      * refine OpNode::ConcatBlobDesc By LogicalBlobParallelDesc
      
      * OpNode::lbi2model_split_axis_
      
      * OpGraph::GetBalancedSplitter
      
      * replace OpGraph::GetBlobParallelDesc4Lbi with OpGraph::GetLbpd4Lbi
      
      * rm BlobParallelDesc in OpGraph
      
      * VariableOp::InitOpParallelSignatures
      
      * rm BlobParallelDesc
      
      * rename Make*ParalelSignature functions
      
      * MakeOpParallelSignature_DS_MC_2_DS
      
      * MakeOpParallelSignature_DC_MS_2_MS
      
      * BiasAddOp::InitOpParallelSignatures
      
      * refine MakeOpParallelSignature_DC_MS_2_MS
      
      * MatmulOp::InitOpParallelSignatures
      
      * GatherOp::InitOpParallelSignatures
      
      * bugfix: model_split_axis cannot equals -1 when parallel_policy is kModelParallel
      
      * refactor: bn2parallel_id2blob_desc => lbi2parallel_id2blob_desc
      
      * refine OpNode
      
      * LogicalBlobParallelConf
      
      * LogicalBlobParallelDesc::DualLbpd
      
      * 1) merge dev_bert;
      2) placement.proto not used in logical_blob_parallel_conf.proto
      
      * bugfix: 1) remove CHECK(has_model) in Operator::NaiveInitOpParallelSignatures; 2) lbpd->set_parallel_num(val)
      
      * fix bugs in GatherOp::InitOpParallelSignatures and BroadcastBinaryOp::InitOpParallelSignatures
      
      * refactor: InitOpParallelSignatures => GetOpParallelSignatures
      
      * refactor: const OpParallelSignature => std::unique_ptr<const OpParallelSignature>
      
      * rm LogicalBlobParallelConf
      
      * refactor: ModelSplitAxis4BnInOp => LbpdHint4BnInOp
      
      * fix bugs about LbpdHint
      
      * simplify the interface of InferInputOutputBlobLogicalBlobParallelDescIf
      
      * rename Class CloneParallel => BroadcastParallel
      
      * rename field: clone_parallel => broadcast_parallel
      
      * refactor LbpdHint by SbpParallel
      
      * InferIsModelBlob4OutputBlobsIf
      
      * remove field LogicalBlobParallelDesc::parallel_num
      
      * rename: LogicalBlobParallelDesc => SbpParallel
      
      * rename: LbpdHint =>SbpInferHint
      
      * simplify interface Operator::InferOutputBlobSbpInferHint
      
      * rename api: Operator::InferBlobSbpInferHintIf => Operator::InferOuputBlobsSbpInferHintIf
      
      * OpGraph::InferIsModelBlob
      
      * rename file: logical_blob_parallel_desc.* => sbp_parallel.*
      
      * rename filename: lbpd_hint* => sbp_infer_hint*
      
      * rename field: SbpInferHint::has_data_split => SbpInferHint::is_data_split
      
      * rename fields: SbpInferHint::is_data_split, is_model_split, is_data_partial_sum, is_model_broadcast
      
      * refactor SbpInferHint::split_axis
      
      * LambdaOpParallelSignature
      
      * replace function MakeVariableOpDataSplitOpParallelSignature with class VariableOpDataSplitOpParallelSignature
      
      * replace function MakeVariableOpModelSplitOpParallelSignature with class VariableOpModelSplitOpParallelSignature
      
      * BroadcastBinaryOpParallelSignature
      
      * Matmul_DMS_MS_2_P_OpParallelSignature
      
      * Gather_DC_MS_2_P_OpParallelSignature
      
      * class DataSplitOpParallelSignature
      
      * class ModelBroadcastOpParallelSignature
      
      * class DS_MC_2_DS_OpParallelSignature
      
      * add field OpParallelSignature::op_
      
      * refactor: ModelSplitAxis => OutputBlobModelSplitAxis
      
      * remove Operator::InferOuputBlobsSbpInferHintIf
      
      * implement MatmulOp::OutputBlobModelSplitAxis
      
      * implement GatherOp::OutputBlobModelSplitAxis
      
      * implement TransposeOp::OutputBlobModelSplitAxis and BiasAddOp::OutputBlobModelSplitAxis
      
      * add method OpGraph::IsDataBlob
      
      * refactor OpGraph::InferSbpParallel
      
      * refactor class SbpInferHint
      
      * rename local variable: SbpInferHint4BnInOp => SbpInferHint4Ibn
      
      * refactor MakeModelSplitOpParallelSignature
      
      * refactor Make_DC_MS_2_MS_OpParallelSignature
      
      * remove unused class LambdaOpParallelSignature; refactor class name '*Clone*' => '*Broadcast*'
      
      * bugfix: Operator::OutputBlobModelSplitAxis for sole-ibn op
      
      * fix bugs in SbpInferHint::has_split_axis(), SbpInferHint::split_axis and OpNode::IsModelBlob4Lbi
      
      * refactor class SbpInferHint: replace split_axis_ with sbp_parallel_
      
      * refactor by SbpInferHint::sbp_parallel
      
      * 1) rename OpNode data member; 2) rm unused proto
      
      * fix clone (#1641)
      
      * OpGraph::GetBlobDataType (#1643)
      
      * OpGraph::GetBlobDataType
      
      * refine OpGraph::GetBlobDataType
      
      * IdentityOp => TupleIdentityOp (#1644)
      
      * Dev sbp parallel cast (#1646)
      
      * add SbpParallelCastOp
      
      * only SplitParallel and BroadcastParallel can be user customized
      
      * rename: SbpParallelCastOp => ParallelCastOp
      
      * build boxing_conf by sbp_parallel
      
      * fix a bug in BroadcastBinaryOpParallelSignature
      
      * support broadcast_parallel for sole-ibn op
      
      * 1) build boxing_op_conf by sbp_parallel for tuple_identity_op;
      2) no op parallel desc fix for kModelParallel;
      3) fix a in TaskGraph::EnableMemSharingInVariableOp
      4) add TupleIdentityOpParallelSignature
      
      * fix bug in IsModelParallel121 (#1648)
      
      * merge develop
      
      * merge develop (#1649)
      59eb55c1
  26. Oct 01, 2018
  27. Sep 30, 2018
    • Niu Chong's avatar
      Refactor Actor (#1259) · e042befc
      Niu Chong authored
      * feat(register_slot): add the RegstSlot
      
      * feat(register_slot): update RegstSlot if
      
      * feat(actor): update member of Actor to use RegstSlot
      
      * fix(register_slot): fix the available_regst_desc_cnt init val
      
      * refine(register_slot): rename PushBack/PopFront, FindTheRegstDescId to TryPushBack/TryPopFront, HasRegstDescId
      
      * feat(regst_slot): rename ForEachCurRegstDeq/ForEachCurFrontRegst to ForEachRegstDeq/ForEachFrontRegst
      
      * feat(regst_slot): add ForChosenRegstDeq/ForChosenFrontRegst, add CHECK empty in ForEachFrontRegst
      
      * fix(register_slot): fix the CHECK empty
      
      * feat: remove actual_writeable_regst_desc_id_ from Actor, add Naive/CustomizedProducedRegst
      
      * fix(normal_model_update_actor): bug: not send customized regst to consumer when SendIntialModel
      
      * fix(normal_forward_compute_actor): bug: not add kLoss/kAccuracy produced regst to NaiveProducedRegst
      
      * fix(actor): UNIMPLEMENTED() for AsyncSendCustomizedProducedRegstMsgToConsumer
      
      * fix(normal_forward_compute_actor): set const_buf_regst to nullptr when recv from consumers
      
      * fix(actor): total_reading_data_regst_cnt, not total_reading_ctrl_regst_cnt
      
      * refactor: update GetNaiveConsumedRegstDescName to GetNaiveOrCustomizedConsumedRegstDescName(same for Produced)
      
      * feat: combine data_regst and ctrl_regst in Actor
      
      * fix: fix bugs
      
      * fix: fix bugs
      
      * fix: remove .swp files and unused LOG
      
      * feat: split Act and SendMsg (#1255)
      
      * feat: split Act and SendMsg
      
      * refine: rename HandleProduced/ConsumedDataRegst.. to HandleProduced/ConsumedNaiveDatRegst..
      
      * fix(input_wise_comp_actor): bug: not set piece id
      
      * fix(actor): potential bug: produced msg with no allowed actor still pop from queue
      
      * refactor: mv some protected member function to private
      
      * fix(actor): fix the condition about sending EORD msg
      
      * refactor(input_wise_actor): use RegstSlot in InputWiseActor
      
      * fix(copy_comm_net_actor): rename piece_id2regst_ctx to piece_id2regst_ctx_
      
      * refactor: rename Name2RegstDescId to Name2RegstDescIds
      
      * refactor(naive_actor): "override final" instead of only "final"
      
      * refine(actor): little refine
      
      * feat: update the return type of GetNaiveOrCustomizedNamesRegstDescName to enum class RegstNameType
      e042befc
  28. Sep 10, 2018
  29. Sep 07, 2018
    • Niu Chong's avatar
      feat: update the data members to use RegstSlot in Actor (#1208) · 38a50de4
      Niu Chong authored
      * feat(register_slot): add the RegstSlot
      
      * feat(register_slot): update RegstSlot if
      
      * feat(actor): update member of Actor to use RegstSlot
      
      * fix(register_slot): fix the available_regst_desc_cnt init val
      
      * refine(register_slot): rename PushBack/PopFront, FindTheRegstDescId to TryPushBack/TryPopFront, HasRegstDescId
      
      * feat(regst_slot): rename ForEachCurRegstDeq/ForEachCurFrontRegst to ForEachRegstDeq/ForEachFrontRegst
      
      * feat(regst_slot): add ForChosenRegstDeq/ForChosenFrontRegst, add CHECK empty in ForEachFrontRegst
      
      * fix(register_slot): fix the CHECK empty
      38a50de4
  30. Aug 19, 2018
  31. Aug 04, 2018
  32. Aug 01, 2018
  33. Jul 16, 2018
    • Niu Chong's avatar
      feat: Add InputWiseActor for ReduceGlobalAdd and ReduceGather (#1012) · 0ffc781c
      Niu Chong authored
      * feat: avoid net contention by adding ctrl edge in ReduceStruct
      
      * refine(task_graph.h/cpp): refine AddCtrlEdgeInReduceStruct()
      
      * fix(graph/task_graph.cpp): fix the bug of machine order
      
      * fix(graph/task_graph.cpp): do not add ctrl edge with reduce scatter
      
      * feat: add ReduceGlobalAddCompActor
      
      * fix: fix the bug of reduce_global_actor/kernel
      
      * chore: remove used vim .swp file
      
      * fix(graph/task_graph.cpp): fix the bug of sorting copycomment when build reduce ctrl edge
      
      * fix(graph/task_graph.h/cpp): add CtrlEdge for ReduceGather
      
      * feat: revert add ctrl edge in reduce struct from this PR
      
      * refactor: rename ReduceGlobalAddCompActor to InputWiseCompActor for scalability
      
      * fix(kernel/reduce_global_add_kernel.cpp): use Memcpy other than Memset for first blob to be added
      
      * refactor(actor/input_wise_compute_actor.*): use HashMap and counter instead of HashSet for processed regst_desc
      
      * refactor: let ReduceGlobalAddCompActor inherit InputWiseCompActor
      
      * feature: add ReduceGatherCompActor that inherits InputWiseCompActor
      
      * fix(reduce_gather_kernel.cpp): add missing break
      
      * refactor: replace regst_desc_id2bn_in_op_ with regst_desc_id2in_bn_id_ in InputWiseCompActor
      
      * fix(reduce_global_add_kernel): remove useless class member parallel_id_
      
      * refactor: make ReduceLocalAdd kernel support inputwise, rename ReduceGlobalAddActor to ReduceAddActor for scalibility
      0ffc781c
  34. Jul 03, 2018
    • Jinhui Yuan's avatar
      Refine loader (#982) · 9674be86
      Jinhui Yuan authored
      * add parallel record decoder
      
      * null data loader
      
      * use real loader
      
      * use libjpeg-turbo
      
      * add streams support, TODO:1,disable internal buffer_;2,use template
      
      * make persistent_in_stream support multiple files
      
      * make compiler support new loader
      
      * minor refine
      
      * make new loader work on mnist
      
      * update proto in benchmark and example
      
      * refactor stream buffer filler
      
      * refine persistent_in_stream
      
      * workable
      
      * add record_loader_op
      
      * finish record loader op
      
      * AddRecordLoaderOps
      
      * make compiler work
      
      * infer shape works
      
      * add record_load_kernel
      
      * let decode actor pass in_regst in normal way
      
      * add kOFRecordPtr type
      
      * remove record regst type
      
      * change ALL_DATA_TYPE_SEQ to ALL_POD_DATA_TYPE_SEQ
      
      * support OFRecordPtr blob
      
      * complete decode_ofrecord_kernel
      
      * allocate OFRecord in Blob<OFRecordPtr>
      
      * fix of record ptr blob
      
      * let actor manage the OFRecord blob
      
      * let regst mgr own ofrecord memory
      
      * workable
      
      * remove useless code
      
      * refine
      
      * NormalRegst -> DataRegst
      
      * OFRecord data type (#984)
      
      * OFRecord data type
      
      * placement new (#987)
      
      * placement new
      
      * fix
      
      * remove useless code
      
      * placement new OFRecord
      
      * remove useless code
      
      * Refactor stream (#985)
      
      * refactor stream scanner
      
      * let persistence_in_stream create the binary stream
      
      * refine persistence_in_stream
      
      * refine
      
      * POD_DATA_TYPE_SEQ
      
      * update placement proto in benchmark and example
      9674be86