Dev bert merge develop (#1650)
* Implement gelu op (#1478) * gelu op * call different funcs for float and double * Dev bert gather op (#1483) * embedding_dense_op * refine * gather op * revert * Fix gelu bug (#1484) * fix inherit bug * fix backward formula * fix bug * Dev variable op (#1485) * DefineTestBlobConf => DefineTestBlobOpConf (#1480) * variable op * Dev variable op disable memsharing (#1487) * disable mem sharing for VariableOp * variable disable tick diff * fix * refine * options transpose_a and transpose_b for Matmul * matmul operator conf * Dev bert const scalar op (#1488) * const scalar op * refine * fix * data parallel only * const range op (#1489) * square and sqrt * broadcast_binary_op * feat: add mean op (#1490) * feat: add mean op * feat: add mean_kernel * feat: add implementation * feat: fix mean kernel * Dev bert slice op (#1491) * add op_conf * add slice op impl * add space kernel impl * fix * same semantic as python * optional start and end * fix * add has_dim0_in_shape in reshape op (#1486) * refine CHECK in broadcast_binary_op * feat: add kernel implement for broadcast_mul/div * Impl square && sqrt (#1495) * impl square && sqrt * fix typo * Dev bert slice op (#1496) * add op_conf * add slice op impl * add space kernel impl * fix * same semantic as python * optional start and end * fix * slice kernel cpu impl * modify coding style * BiasAddOpConf * refactor(broadcast_div_kernel): update kernel util api * Dev bert cosnt range use device piece size (#1498) * use device_piece_size * cosnt size => size * fix * no check in BroadcastBinaryOp::InitFromProto * override GetCustomizedConfs for broadcast_binary_op * fix: fix bugs in broadcast_div/mul kernel (#1502) * fix: fix bugs in broadcast_div/mul kernel * fix * fix: fix the infer bw_buf blobdesc bug in broadcast_binary op * Bias Add Op && Kernel (#1503) * pass compile * fix typo * Matmul kernel implementation (#1494) * pass compile * add comment * fix bug * Dev bert const scalar kernel (#1492) * const scalar kernel * fix * fix * init * empty const range kernel * sketch of gather kernel * gather kernel * refine * refine * const range kernel * refine * backward * const range size * gather kernel * assert index * add truncated_normal initializer (#1499) * add truncated_normal initializer * rename RngTruncatedNormal * fix: add const override for InferBwBufBlobDescs in BroadcastBinaryOp * fix: udpate the supported data type from floating to arithmetic * enforce 2d on bias add * Dev bert slice op (#1500) * add op_conf * add slice op impl * add space kernel impl * fix * same semantic as python * optional start and end * fix * slice kernel cpu impl * modify coding style * slice gpu impl const buf infer * add slice gpu impl * simplify slice cpu impl * fix gpu impl bug * fix typo * add forward function from broadcast_add,broadcast_sub * feat: add gpu impl of cast kernel (#1504) * Dev nc cast (#1507) * feat: add gpu impl of cast kernel * register gpu cast op * Fix broadcast binary all dim size 1 (#1505) * remove check NumAxes * check scalar * IsScalarBlob * b_diff=>b (#1509) * feat: add LayerNormOp/Kernel without kernel implement (#1510) * fix: fix missing registing layer_normalization kernel * fix: fix missing registing layer_normalization op * fix: temply remove activation from layer_norm_kernel * ExecShapeUtil * broadcast_binary_xpu_util.h * add bw kernel of broadcast_add * Dev constant (#1513) * constant_op * init_op_conf * sequence=>range * Dev broadcast add (#1514) * ExecShapeUtil * broadcast_binary_xpu_util.h * add bw kernel of broadcast_add * WITH_CUDA_PARAM * left extended shape * xpu_ndarray_builder * add bw kernel of broadcast_sub * updt to 1d (#1512) * fix small in xpu_reduce_ndarray * fix(broadcast_binary_op): fix the wrong data_type of bw_buf regst (#1515) * feat(mean): update mean_op/kernel for calc only last dim of blob (#1516) * fix(mean_kernel): fix typo * ndarray reduce * new reduce * fix shape of tmp_storage * reduce * more check for NdArrayReduce * ImplaceApplyUnary<UnaryFuncMinus> * ndarray_apply_broadcast_binary * delte useless files * complete backward kernel of broadcast_mul * add backward kernel of broadcast_div * broadcast binary op check data type equal (#1508) * fix bug in broadcast_binary * debug op * EncodeBlob * const_out_blob_feature_load_file * DefineTestBlobOpConf.has_diff * indices has_diff = false (#1519) * adam model update (#1518) * adam model update * add comment * update * add correct_deviation flag * rename * remove GetCustomizedConf * fix bug in mean_op fw kernel * add sigmoid loss op * ndarray_apply_broadcast_unary * reomve multiplier of mean kernel * fix(boxing_actor): not handle ctrl regst in NormalProcessNaiveReadableRegstMsg() * fix raw (#1522) * rsqrt * XpuReducedNdarray supports expression template * faster_reduce * inlined cuda device function * profiling reduce_sum * refactor(kernel_util.cu): calc x_strides on cpu instead of on TransposeGpu() (#1525) * BroadcastBinaryOp * ExecShape => XpuShape * fix shape bug in mean bw kernel * refine XpuNdarrayAssign * use ndarray broadcast mul (#1529) * Dev softmax reduce ndarray (#1527) * softmax use ndarray reduce * fix shape * refine reduce * fix * remove xpu_ndarray_builder * fix(actor.cpp): never access regst after sending it to producer * ndarray_util.h => xpu_util.h * xpu_ndarray_util.h => ndarray_util.h * XpuNdArrayUtil => NdarrayUtil * SwitchReduce(SwitchCase(num_axes), ...) => Reduce(...) * refactor: rename NormalProcessNaiveReadableRegstMsg() to NormalProcessNaiveReadableDataRegstMsg() (#1532) * SwitchBroadcastApply(SwitchCase(num_axes), ...) => BroadcastApply(...) * softmax kernel use ndarray reduce (#1530) * softmax use ndarray reduce * fix shape * refine reduce * fix * RowMax=>NdarrayReduce * SwitchReduce=>Reduce * move template parameter NDIMS from class NdarrayReduce to methods of class NdarrayReduce * rename file: ndarray/xpu_ndarray_reduce_test.cpp -> ndarray/ndarray_reduce_test.cpp * move NdarrayUtil::SwitchReduce(...) to NdarrayReduce::SwitchReduce(...) * Dev one hot encoder (#1533) * one_hot op * ohe * one hot kernel * refine * refine * remove old * refine * refine * refine * format * save m and v in adam_model_update (#1534) * Dev profile reduce (#1535) * ndarray_reduce_impl * NdarrayMatrixRowReduce * 1) MatrixColReduce; 2) WITH_CUDA_PARAM => RUN_CUDA_KERNEL * NdarrayScalarReduce * NdarrayDefaultReduce * refactor NdarrayReduce<DeviceType device_type, typename T> to NdarrayReduce<DeviceType device_type, typename T, const T(*binary_func)(const T, const T)> * 1) MaxVal<T>() => GetMaxVal<T>(); MaxValue<T>::value => MaxVal<T>::value * replace KernelUtil::RowMax with NdarrayUtil::ReduceMax * NdarrayNoReduce * eliminate redundant code by macros * Fix matmul gpu bugs (#1528) * call different api for batchedgemm * updt api * use naive loop * save work * save work * updt impl * remove useless code * replace naive loop with cublasgemmbatched * feat: add ScalarAddOp and ScalarMulOp (#1541) * Dev nc scalar (#1543) * feat: add ScalarAddOp and ScalarMulOp * feat: add ScalarAddKernel and ScalarMulKernel * fix: ScalarAddOp/ScalarMulOp not inheri from CWiseOp * fix: fix code style * fix: fix typo of include file in scalar_add_op/scalar_mul_op * fix(scalar_mul_kernel): register ScalarMulKerenl * fix: add MulbyScalarPara(), replace cublas_scal by this on ScalarMulKernel * fix(scalar_mul_kernel): fix typo * Dev nc testtrans (#1540) * feat: update trans kernel * InitGlobalCudaDeviceProp * in_blob and out_blob is unnecesary for bw kernel of variable_op and constant_op * Transpose: the shape elem_cnt of x must not exceed 2^32 * remove LabelType (#1545) * rm ndarray_reduce_core.* * Dev identity loss (#1547) * identity_loss * loss op * CalcLossInstanceNum * mem shared for mdupdt first in regst and md diff add regst (#1546) * remove useless code (#1548) * Dev sparse cross entropy (#1550) * op for sparse cross _entropy * modify op_conf for sparse cross entropy * saprse cross entropy kernel * op * SparseCrossEntropyKernelUtil * refine * refine shape check (#1552) * refactoring reduce sum (#1554) * refactoring reduce sum * also use shape and dptr when bw * add resize when keepdims * address reviews * move functions to Anonymous namespace * address reviews * remove auto * replace find * rename keepdims * only enable nccl on gpu * fix diff add regst size in MdUpdt task node as same as in regst (#1556) * mem shared for mdupdt first in regst and md diff add regst * fix diff add regst size in MdUpdt task node as same as in regst * minor fix * special occasion when it is a loss op * Dev loss instance num (#1544) * loss instance number * set_has_loss_instance_num_field * loss * in_diff * LossOpFixInDiffHasLossInstanceNum * remove need_do_loss_instance_num * move to FixInDiffBlobDescs * remove * loss_instance_num use float * refine * Boxing ForwardLossInstance * fix * fix loss * fix * refine * fix * refine * refine * impl reduce mean * Dev all reduce ctrl edge (#1558) * mem shared for mdupdt first in regst and md diff add regst * feat: add ReduceInplaceIdentity LogicalNode/TaskNode/Op/Kernel * nccl reduce ctrl edge * MayConsumeModelDiff * fix diff add regst size in MdUpdt task node as same as in regst * eager_reduce_ratio * mem sharing for ReduceIdentity * ReduceInplaceIdentity => ReduceIdentity * reduce ctrl edge supports for arbitrary placement * refine ChainLogicalGraph::IsLogicalNodeMergeable * model name (#1561) * Dev gather refine (#1517) * gather op index support all int type and axis * out=in * reformat * negative axis * LookupKernel=>GatherKernel * reformat * refine * axis * refine & bugfix * remove ConstScalar and ConstRange (#1526) * Refine range initializer (#1523) * support axis * refine naming * fix before_dim_size * doc * refine * refine naming * refine naming * VariableLogicalNode * identity (#1563) * total_instance_num use naive mdupdt (#1564) * patch by hand from faster_rcnn * revert LogicalVariableOp * Dev clone boxing (#1566) * identity * reduce clone boxing * Dev clone boxing (#1568) * identity * reduce clone boxing * tuple identity * Dev tick (#1571) * feat: add Tick LogicalNode/TaskNode/Op/Kernel * feat: remove Tick LogicalNode/TaskNode * feat: add BldSubTskGphByTickToSource for TickOp * refine: refine due to comment * feat: add BldSubTskGphByRecordLoadToTick * pr tick op/kernel alone * feat: add TickOp and BldSubTskGphByTickToSource (#1565) * feat: add Tick LogicalNode/TaskNode/Op/Kernel * feat: remove Tick LogicalNode/TaskNode * feat: add BldSubTskGphByTickToSource for TickOp * refine: refine due to comment * feat: add BldSubTskGphByRecordLoadToTick * refine: refine due to comment * refine: due to comment * refine: remove BldSubTskGphByRecordLoadToTick * fix tick op in dlnet (#1572) * Dev clip by global norm (#1521) * clip_by_global_norm * update * refine model_update op * remove useless code * fix name * rename clip_norm * remove useless code * force init memory and add CHECK() * remove useless code and add comment * fixbug * refine code * Dev bert profile (#1573) * 1) refactor reduce_group; 2) add new stream kReduceCtrl * 1) allreduce and model_update overlapping; 2) allreduce and fw overlapping * add mdupdt ctrl edges within reduce group (#1575) * Dev group all reduce by model bytes (#1577) * group all reduce by model byte size * mv OpGraph into a seperate file op_graph.h * gelu (#1578) * Dev bert layer norm (#1574) * layer norm * layer_norm * fix trainable * fix * fix trainable * refine * Dev bert cuda event sync (#1581) * cudaSetDevice in actor poller threads * ReduceConcatCompActor ; NaiveActor * set dev id (#1583) * Dev bert profiling (#1586) * profiling * all_reduce_* option for performance optimization * fix a mem sharing bug (#1590) * Fix mem sharing bug (#1593) * fix a mem sharing bug * refine by review * remove previous if condition * refine * Dev profiling adam (#1592) * profiling * all_reduce_* option for performance optimization * faster adam kernel * Dev refine transpose (#1594) * profiling * all_reduce_* option for performance optimization * faster adam kernel * refine dropout and transpose * loss print duration (#1598) * pseudo chains of OpGraph * ConvertPseudoChainToChain * refine pseudo_chain * refine register coloring algorithm * rename op_graph log file name * remove unused code * Dev bigger chain (#1601) * pseudo chains of OpGraph * ConvertPseudoChainToChain * refine pseudo_chain * refine register coloring algorithm * rename op_graph log file name * remove unused code * chore: add -gencode in CMakeLists.txt (#1603) * EnableMemSharingInVariableOp * no mem_sharing for out_diff & model_diff in variable_op * Dev mem sharing for variable op (#1604) * pseudo chains of OpGraph * ConvertPseudoChainToChain * refine pseudo_chain * refine register coloring algorithm * rename op_graph log file name * remove unused code * EnableMemSharingInVariableOp * no mem_sharing for out_diff & model_diff in variable_op * refine code * Fix jxf reduce concat bug (#1606) * refine logic to infer reduce_concat_op's elem_cnt of out blob, still have bugs... * add RoundUp in reduce_concat * CHECK_LE -> CHECK_EQ * add CHECK * Dev random shuffle (#1607) * random shuffle * fix * refine * refine * single thread * refine * cmake add half (#1609) * Bugfix no tick diff (#1614) * group by has_diff * rm unnecessary identity * share model_diff and out_diff in variable op (#1616) * share model_diff and out_diff in variable op * bugfix: model_diff is a produced register * register_num of model_diff is 1 * add VariableKernelConf * no mutable * bugfix * bugfix: set ctrl_regst's return_regst_num (#1617) * 带策略的寄存器着色 (#1613) * mem_shared_hint_id * sharable memory block * rm useless code * remove useless code * bugfix: no redundant edges * rename: MemBlockGroup => MemBlock * put constrcutor of SharableMemBlockNode into header file * bugfix * rename field: MemBlock.block_id => MemBlock.mem_block_id * refine CHECK in AllReduce (#1618) * refine CHECK in AllReduce * move ReduceConcatOpCtx definition to .cpp file * fix fw_consumer nullptr (#1622) * faster improver (#1628) * multithreads register coloring (#1630) * multithreads register coloring * refine code * Dev bert accuracy with weight (#1632) * accuracy * accuracy_task_node add fw_buf * fw_buf=>data_tmp * Dev logical blob dim0 (#1625) * mem_shared_hint_id * sharable memory block * rm useless code * remove useless code * bugfix: no redundant edges * rename: MemBlockGroup => MemBlock * put constrcutor of SharableMemBlockNode into header file * bugfix * rename field: MemBlock.block_id => MemBlock.mem_block_id * replace piece_size with logical_blob_dim0 * BlobParallelConf * BlobParallelDesc * infer out blob model_split_axis * int64_t => int32_t * InferOutBlobParallelDesc * gather out blob model split (#1624) * InferBlobParallelDesc * let variable op support kModelParallel * rename lbi2blob_desc_ => lbi2no_parallel_blob_desc_ * Global<OpGraph> * SplitLogicalInputBlobDesc * ConcatOutputBlobDescs * rename: BlobDataParallel => DataBlobParallel; BlobModelParallel => ModelBlobParallel; BlobGridParallel => GridBlobParallel * OpGraph::CheckBlobDescs(...) * exact division is unnecessary * fix bugs * rename InferOutBlob* => InferOutputBlob * exact division in variable_op is unnecessary * bug fix * fix bugs * fix bugs * IsInputBlobAllowedModelSplit * use Global<OpGraph> to InferModelSize * add OpGraph::GetDataBalancedSplitter and OpGraph::GetModelBalancedSplitter * fix IdentityOp::IsInputBlobAllowedModelSplit * no implementation for pure virtual function Operator::IsInputBlobAllowedModelSplit * refine BlobParallelDesc: replace CopyParallelConf with operator= * refine ParallelDesc: remove unused functions * more checks on ParallelDesc * Dev logical blob dim0 (#1635) * mem_shared_hint_id * sharable memory block * rm useless code * remove useless code * bugfix: no redundant edges * rename: MemBlockGroup => MemBlock * put constrcutor of SharableMemBlockNode into header file * bugfix * rename field: MemBlock.block_id => MemBlock.mem_block_id * replace piece_size with logical_blob_dim0 * BlobParallelConf * BlobParallelDesc * infer out blob model_split_axis * int64_t => int32_t * InferOutBlobParallelDesc * gather out blob model split (#1624) * InferBlobParallelDesc * let variable op support kModelParallel * rename lbi2blob_desc_ => lbi2no_parallel_blob_desc_ * Global<OpGraph> * SplitLogicalInputBlobDesc * ConcatOutputBlobDescs * rename: BlobDataParallel => DataBlobParallel; BlobModelParallel => ModelBlobParallel; BlobGridParallel => GridBlobParallel * OpGraph::CheckBlobDescs(...) * exact division is unnecessary * fix bugs * rename InferOutBlob* => InferOutputBlob * exact division in variable_op is unnecessary * bug fix * fix bugs * fix bugs * IsInputBlobAllowedModelSplit * use Global<OpGraph> to InferModelSize * add OpGraph::GetDataBalancedSplitter and OpGraph::GetModelBalancedSplitter * fix IdentityOp::IsInputBlobAllowedModelSplit * no implementation for pure virtual function Operator::IsInputBlobAllowedModelSplit * refine BlobParallelDesc: replace CopyParallelConf with operator= * refine ParallelDesc: remove unused functions * more checks on ParallelDesc * remove unused function Operator::MaxModelSplitNum * bugfix: SoleOp() => op_vec().at(0) * Dev global op graph (#1636) * Global<OpGraph> is only available duraing compilation * small record_piece_size for InferNoParallelBlobDesc * Dev op graph piece size (#1637) * fix a bug in OpGraph::InferNoParallelBlobDesc * fix a bug in OpGraph::InferNoParallelBlobDesc * DfsTopoForEachNodeSortByDistanceToSink (#1638) * Dev jxf bert top k (#1633) * top_k * dev top_k op * refine * fix bug * refactor top_k op, cooperate with gather op to get values now * customized TOPK_KERNEL_ENTRY in auto factory * batch gather op * refine * Backup: batch_gather op, pass compile * fix bugs, pass the test * fix no new line at the end of file * const * refine by review * fix bugs * rename: instance_dim -> instance_size * remove a blank line * refine coding style by Juncheng's suggestions, Bravo * refine top_k * more refine * compatible with new model parallel * refine * rename * cpu only in top_k * Dev model boxing (#1639) * mem_shared_hint_id * sharable memory block * rm useless code * remove useless code * bugfix: no redundant edges * rename: MemBlockGroup => MemBlock * put constrcutor of SharableMemBlockNode into header file * bugfix * rename field: MemBlock.block_id => MemBlock.mem_block_id * replace piece_size with logical_blob_dim0 * BlobParallelConf * BlobParallelDesc * infer out blob model_split_axis * int64_t => int32_t * InferOutBlobParallelDesc * gather out blob model split (#1624) * InferBlobParallelDesc * let variable op support kModelParallel * rename lbi2blob_desc_ => lbi2no_parallel_blob_desc_ * Global<OpGraph> * SplitLogicalInputBlobDesc * ConcatOutputBlobDescs * rename: BlobDataParallel => DataBlobParallel; BlobModelParallel => ModelBlobParallel; BlobGridParallel => GridBlobParallel * OpGraph::CheckBlobDescs(...) * exact division is unnecessary * fix bugs * rename InferOutBlob* => InferOutputBlob * exact division in variable_op is unnecessary * bug fix * fix bugs * fix bugs * IsInputBlobAllowedModelSplit * use Global<OpGraph> to InferModelSize * add OpGraph::GetDataBalancedSplitter and OpGraph::GetModelBalancedSplitter * fix IdentityOp::IsInputBlobAllowedModelSplit * no implementation for pure virtual function Operator::IsInputBlobAllowedModelSplit * refine BlobParallelDesc: replace CopyParallelConf with operator= * refine ParallelDesc: remove unused functions * more checks on ParallelDesc * remove unused function Operator::MaxModelSplitNum * BlobParallelDesc::EquivalentTo * LogicalNOde::main_model_parallel_ is out of date * refine Operator: replace IsElemWiseOp with IsSoleInputBlobAllowedModelSplit * refine transpose conf * fix a bug in Operator::FixParallelDesc * InferInputBlobModelSplitAxis * BlobParallelType * more default behaviors for Operator::InferInputOutputBlobParallelType * op_parallel_signature * rename: BlobParallelType => LogicalBlobParallelDesc * OpGraph::InferLogicalBlobParallelDesc * refactor SplitLogicalInputBlobDesc by LogicalBlobParallelDesc * refine OpNode::ConcatBlobDesc By LogicalBlobParallelDesc * OpNode::lbi2model_split_axis_ * OpGraph::GetBalancedSplitter * replace OpGraph::GetBlobParallelDesc4Lbi with OpGraph::GetLbpd4Lbi * rm BlobParallelDesc in OpGraph * VariableOp::InitOpParallelSignatures * rm BlobParallelDesc * rename Make*ParalelSignature functions * MakeOpParallelSignature_DS_MC_2_DS * MakeOpParallelSignature_DC_MS_2_MS * BiasAddOp::InitOpParallelSignatures * refine MakeOpParallelSignature_DC_MS_2_MS * MatmulOp::InitOpParallelSignatures * GatherOp::InitOpParallelSignatures * bugfix: model_split_axis cannot equals -1 when parallel_policy is kModelParallel * refactor: bn2parallel_id2blob_desc => lbi2parallel_id2blob_desc * refine OpNode * LogicalBlobParallelConf * LogicalBlobParallelDesc::DualLbpd * 1) merge dev_bert; 2) placement.proto not used in logical_blob_parallel_conf.proto * bugfix: 1) remove CHECK(has_model) in Operator::NaiveInitOpParallelSignatures; 2) lbpd->set_parallel_num(val) * fix bugs in GatherOp::InitOpParallelSignatures and BroadcastBinaryOp::InitOpParallelSignatures * refactor: InitOpParallelSignatures => GetOpParallelSignatures * refactor: const OpParallelSignature => std::unique_ptr<const OpParallelSignature> * rm LogicalBlobParallelConf * refactor: ModelSplitAxis4BnInOp => LbpdHint4BnInOp * fix bugs about LbpdHint * simplify the interface of InferInputOutputBlobLogicalBlobParallelDescIf * rename Class CloneParallel => BroadcastParallel * rename field: clone_parallel => broadcast_parallel * refactor LbpdHint by SbpParallel * InferIsModelBlob4OutputBlobsIf * remove field LogicalBlobParallelDesc::parallel_num * rename: LogicalBlobParallelDesc => SbpParallel * rename: LbpdHint =>SbpInferHint * simplify interface Operator::InferOutputBlobSbpInferHint * rename api: Operator::InferBlobSbpInferHintIf => Operator::InferOuputBlobsSbpInferHintIf * OpGraph::InferIsModelBlob * rename file: logical_blob_parallel_desc.* => sbp_parallel.* * rename filename: lbpd_hint* => sbp_infer_hint* * rename field: SbpInferHint::has_data_split => SbpInferHint::is_data_split * rename fields: SbpInferHint::is_data_split, is_model_split, is_data_partial_sum, is_model_broadcast * refactor SbpInferHint::split_axis * LambdaOpParallelSignature * replace function MakeVariableOpDataSplitOpParallelSignature with class VariableOpDataSplitOpParallelSignature * replace function MakeVariableOpModelSplitOpParallelSignature with class VariableOpModelSplitOpParallelSignature * BroadcastBinaryOpParallelSignature * Matmul_DMS_MS_2_P_OpParallelSignature * Gather_DC_MS_2_P_OpParallelSignature * class DataSplitOpParallelSignature * class ModelBroadcastOpParallelSignature * class DS_MC_2_DS_OpParallelSignature * add field OpParallelSignature::op_ * refactor: ModelSplitAxis => OutputBlobModelSplitAxis * remove Operator::InferOuputBlobsSbpInferHintIf * implement MatmulOp::OutputBlobModelSplitAxis * implement GatherOp::OutputBlobModelSplitAxis * implement TransposeOp::OutputBlobModelSplitAxis and BiasAddOp::OutputBlobModelSplitAxis * add method OpGraph::IsDataBlob * refactor OpGraph::InferSbpParallel * refactor class SbpInferHint * rename local variable: SbpInferHint4BnInOp => SbpInferHint4Ibn * refactor MakeModelSplitOpParallelSignature * refactor Make_DC_MS_2_MS_OpParallelSignature * remove unused class LambdaOpParallelSignature; refactor class name '*Clone*' => '*Broadcast*' * bugfix: Operator::OutputBlobModelSplitAxis for sole-ibn op * fix bugs in SbpInferHint::has_split_axis(), SbpInferHint::split_axis and OpNode::IsModelBlob4Lbi * refactor class SbpInferHint: replace split_axis_ with sbp_parallel_ * refactor by SbpInferHint::sbp_parallel * 1) rename OpNode data member; 2) rm unused proto * fix clone (#1641) * OpGraph::GetBlobDataType (#1643) * OpGraph::GetBlobDataType * refine OpGraph::GetBlobDataType * IdentityOp => TupleIdentityOp (#1644) * Dev sbp parallel cast (#1646) * add SbpParallelCastOp * only SplitParallel and BroadcastParallel can be user customized * rename: SbpParallelCastOp => ParallelCastOp * build boxing_conf by sbp_parallel * fix a bug in BroadcastBinaryOpParallelSignature * support broadcast_parallel for sole-ibn op * 1) build boxing_op_conf by sbp_parallel for tuple_identity_op; 2) no op parallel desc fix for kModelParallel; 3) fix a in TaskGraph::EnableMemSharingInVariableOp 4) add TupleIdentityOpParallelSignature * fix bug in IsModelParallel121 (#1648) * merge develop * merge develop (#1649)
Showing
- CMakeLists.txt 5 additions, 0 deletionsCMakeLists.txt
- cmake/oneflow.cmake 1 addition, 0 deletionscmake/oneflow.cmake
- cmake/third_party.cmake 5 additions, 0 deletionscmake/third_party.cmake
- cmake/third_party/half.cmake 35 additions, 0 deletionscmake/third_party/half.cmake
- oneflow/core/actor/actor.cpp 10 additions, 4 deletionsoneflow/core/actor/actor.cpp
- oneflow/core/actor/actor.h 1 addition, 1 deletiononeflow/core/actor/actor.h
- oneflow/core/actor/boxing_actor.cpp 1 addition, 1 deletiononeflow/core/actor/boxing_actor.cpp
- oneflow/core/actor/boxing_actor.h 1 addition, 1 deletiononeflow/core/actor/boxing_actor.h
- oneflow/core/actor/loss_print_compute_actor.h 5 additions, 1 deletiononeflow/core/actor/loss_print_compute_actor.h
- oneflow/core/actor/naive_actor.cpp 2 additions, 0 deletionsoneflow/core/actor/naive_actor.cpp
- oneflow/core/actor/normal_backward_compute_actor.cpp 1 addition, 1 deletiononeflow/core/actor/normal_backward_compute_actor.cpp
- oneflow/core/actor/normal_backward_compute_actor.h 1 addition, 1 deletiononeflow/core/actor/normal_backward_compute_actor.h
- oneflow/core/actor/reduce_concat_compute_actor.cpp 0 additions, 10 deletionsoneflow/core/actor/reduce_concat_compute_actor.cpp
- oneflow/core/actor/reduce_concat_compute_actor.h 2 additions, 8 deletionsoneflow/core/actor/reduce_concat_compute_actor.h
- oneflow/core/actor/reduce_split_compute_actor.cpp 0 additions, 9 deletionsoneflow/core/actor/reduce_split_compute_actor.cpp
- oneflow/core/actor/reduce_split_compute_actor.h 2 additions, 8 deletionsoneflow/core/actor/reduce_split_compute_actor.h
- oneflow/core/comm_network/epoll/epoll_comm_network.cpp 2 additions, 2 deletionsoneflow/core/comm_network/epoll/epoll_comm_network.cpp
- oneflow/core/comm_network/ibverbs/ibverbs_qp.cpp 1 addition, 1 deletiononeflow/core/comm_network/ibverbs/ibverbs_qp.cpp
- oneflow/core/common/blas.h 10 additions, 8 deletionsoneflow/core/common/blas.h
- oneflow/core/common/data_type.h 31 additions, 0 deletionsoneflow/core/common/data_type.h
cmake/third_party/half.cmake
0 → 100644
Please register or sign in to comment