Dev bert merge develop (#1650)

* Implement gelu op (#1478) * gelu op * call different funcs for float and double * Dev bert gather op (#1483) * embedding_dense_op * refine * gather op * revert * Fix gelu bug (#1484) * fix inherit bug * fix backward formula * fix bug * Dev variable op (#1485) * DefineTestBlobConf => DefineTestBlobOpConf (#1480) * variable op * Dev variable op disable memsharing (#1487) * disable mem sharing for VariableOp * variable disable tick diff * fix * refine * options transpose_a and transpose_b for Matmul * matmul operator conf * Dev bert const scalar op (#1488) * const scalar op * refine * fix * data parallel only * const range op (#1489) * square and sqrt * broadcast_binary_op * feat: add mean op (#1490) * feat: add mean op * feat: add mean_kernel * feat: add implementation * feat: fix mean kernel * Dev bert slice op (#1491) * add op_conf * add slice op impl * add space kernel impl * fix * same semantic as python * optional start and end * fix * add has_dim0_in_shape in reshape op (#1486) * refine CHECK in broadcast_binary_op * feat: add kernel implement for broadcast_mul/div * Impl square && sqrt (#1495) * impl square && sqrt * fix typo * Dev bert slice op (#1496) * add op_conf * add slice op impl * add space kernel impl * fix * same semantic as python * optional start and end * fix * slice kernel cpu impl * modify coding style * BiasAddOpConf * refactor(broadcast_div_kernel): update kernel util api * Dev bert cosnt range use device piece size (#1498) * use device_piece_size * cosnt size => size * fix * no check in BroadcastBinaryOp::InitFromProto * override GetCustomizedConfs for broadcast_binary_op * fix: fix bugs in broadcast_div/mul kernel (#1502) * fix: fix bugs in broadcast_div/mul kernel * fix * fix: fix the infer bw_buf blobdesc bug in broadcast_binary op * Bias Add Op && Kernel (#1503) * pass compile * fix typo * Matmul kernel implementation (#1494) * pass compile * add comment * fix bug * Dev bert const scalar kernel (#1492) * const scalar kernel * fix * fix * init * empty const range kernel * sketch of gather kernel * gather kernel * refine * refine * const range kernel * refine * backward * const range size * gather kernel * assert index * add truncated_normal initializer (#1499) * add truncated_normal initializer * rename RngTruncatedNormal * fix: add const override for InferBwBufBlobDescs in BroadcastBinaryOp * fix: udpate the supported data type from floating to arithmetic * enforce 2d on bias add * Dev bert slice op (#1500) * add op_conf * add slice op impl * add space kernel impl * fix * same semantic as python * optional start and end * fix * slice kernel cpu impl * modify coding style * slice gpu impl const buf infer * add slice gpu impl * simplify slice cpu impl * fix gpu impl bug * fix typo * add forward function from broadcast_add,broadcast_sub * feat: add gpu impl of cast kernel (#1504) * Dev nc cast (#1507) * feat: add gpu impl of cast kernel * register gpu cast op * Fix broadcast binary all dim size 1 (#1505) * remove check NumAxes * check scalar * IsScalarBlob * b_diff=>b (#1509) * feat: add LayerNormOp/Kernel without kernel implement (#1510) * fix: fix missing registing layer_normalization kernel * fix: fix missing registing layer_normalization op * fix: temply remove activation from layer_norm_kernel * ExecShapeUtil * broadcast_binary_xpu_util.h * add bw kernel of broadcast_add * Dev constant (#1513) * constant_op * init_op_conf * sequence=>range * Dev broadcast add (#1514) * ExecShapeUtil * broadcast_binary_xpu_util.h * add bw kernel of broadcast_add * WITH_CUDA_PARAM * left extended shape * xpu_ndarray_builder * add bw kernel of broadcast_sub * updt to 1d (#1512) * fix small in xpu_reduce_ndarray * fix(broadcast_binary_op): fix the wrong data_type of bw_buf regst (#1515) * feat(mean): update mean_op/kernel for calc only last dim of blob (#1516) * fix(mean_kernel): fix typo * ndarray reduce * new reduce * fix shape of tmp_storage * reduce * more check for NdArrayReduce * ImplaceApplyUnary<UnaryFuncMinus> * ndarray_apply_broadcast_binary * delte useless files * complete backward kernel of broadcast_mul * add backward kernel of broadcast_div * broadcast binary op check data type equal (#1508) * fix bug in broadcast_binary * debug op * EncodeBlob * const_out_blob_feature_load_file * DefineTestBlobOpConf.has_diff * indices has_diff = false (#1519) * adam model update (#1518) * adam model update * add comment * update * add correct_deviation flag * rename * remove GetCustomizedConf * fix bug in mean_op fw kernel * add sigmoid loss op * ndarray_apply_broadcast_unary * reomve multiplier of mean kernel * fix(boxing_actor): not handle ctrl regst in NormalProcessNaiveReadableRegstMsg() * fix raw (#1522) * rsqrt * XpuReducedNdarray supports expression template * faster_reduce * inlined cuda device function * profiling reduce_sum * refactor(kernel_util.cu): calc x_strides on cpu instead of on TransposeGpu() (#1525) * BroadcastBinaryOp * ExecShape => XpuShape * fix shape bug in mean bw kernel * refine XpuNdarrayAssign * use ndarray broadcast mul (#1529) * Dev softmax reduce ndarray (#1527) * softmax use ndarray reduce * fix shape * refine reduce * fix * remove xpu_ndarray_builder * fix(actor.cpp): never access regst after sending it to producer * ndarray_util.h => xpu_util.h * xpu_ndarray_util.h => ndarray_util.h * XpuNdArrayUtil => NdarrayUtil * SwitchReduce(SwitchCase(num_axes), ...) => Reduce(...) * refactor: rename NormalProcessNaiveReadableRegstMsg() to NormalProcessNaiveReadableDataRegstMsg() (#1532) * SwitchBroadcastApply(SwitchCase(num_axes), ...) => BroadcastApply(...) * softmax kernel use ndarray reduce (#1530) * softmax use ndarray reduce * fix shape * refine reduce * fix * RowMax=>NdarrayReduce * SwitchReduce=>Reduce * move template parameter NDIMS from class NdarrayReduce to methods of class NdarrayReduce * rename file: ndarray/xpu_ndarray_reduce_test.cpp -> ndarray/ndarray_reduce_test.cpp * move NdarrayUtil::SwitchReduce(...) to NdarrayReduce::SwitchReduce(...) * Dev one hot encoder (#1533) * one_hot op * ohe * one hot kernel * refine * refine * remove old * refine * refine * refine * format * save m and v in adam_model_update (#1534) * Dev profile reduce (#1535) * ndarray_reduce_impl * NdarrayMatrixRowReduce * 1) MatrixColReduce; 2) WITH_CUDA_PARAM => RUN_CUDA_KERNEL * NdarrayScalarReduce * NdarrayDefaultReduce * refactor NdarrayReduce<DeviceType device_type, typename T> to NdarrayReduce<DeviceType device_type, typename T, const T(*binary_func)(const T, const T)> * 1) MaxVal<T>() => GetMaxVal<T>(); MaxValue<T>::value => MaxVal<T>::value * replace KernelUtil::RowMax with NdarrayUtil::ReduceMax * NdarrayNoReduce * eliminate redundant code by macros * Fix matmul gpu bugs (#1528) * call different api for batchedgemm * updt api * use naive loop * save work * save work * updt impl * remove useless code * replace naive loop with cublasgemmbatched * feat: add ScalarAddOp and ScalarMulOp (#1541) * Dev nc scalar (#1543) * feat: add ScalarAddOp and ScalarMulOp * feat: add ScalarAddKernel and ScalarMulKernel * fix: ScalarAddOp/ScalarMulOp not inheri from CWiseOp * fix: fix code style * fix: fix typo of include file in scalar_add_op/scalar_mul_op * fix(scalar_mul_kernel): register ScalarMulKerenl * fix: add MulbyScalarPara(), replace cublas_scal by this on ScalarMulKernel * fix(scalar_mul_kernel): fix typo * Dev nc testtrans (#1540) * feat: update trans kernel * InitGlobalCudaDeviceProp * in_blob and out_blob is unnecesary for bw kernel of variable_op and constant_op * Transpose: the shape elem_cnt of x must not exceed 2^32 * remove LabelType (#1545) * rm ndarray_reduce_core.* * Dev identity loss (#1547) * identity_loss * loss op * CalcLossInstanceNum * mem shared for mdupdt first in regst and md diff add regst (#1546) * remove useless code (#1548) * Dev sparse cross entropy (#1550) * op for sparse cross _entropy * modify op_conf for sparse cross entropy * saprse cross entropy kernel * op * SparseCrossEntropyKernelUtil * refine * refine shape check (#1552) * refactoring reduce sum (#1554) * refactoring reduce sum * also use shape and dptr when bw * add resize when keepdims * address reviews * move functions to Anonymous namespace * address reviews * remove auto * replace find * rename keepdims * only enable nccl on gpu * fix diff add regst size in MdUpdt task node as same as in regst (#1556) * mem shared for mdupdt first in regst and md diff add regst * fix diff add regst size in MdUpdt task node as same as in regst * minor fix * special occasion when it is a loss op * Dev loss instance num (#1544) * loss instance number * set_has_loss_instance_num_field * loss * in_diff * LossOpFixInDiffHasLossInstanceNum * remove need_do_loss_instance_num * move to FixInDiffBlobDescs * remove * loss_instance_num use float * refine * Boxing ForwardLossInstance * fix * fix loss * fix * refine * fix * refine * refine * impl reduce mean * Dev all reduce ctrl edge (#1558) * mem shared for mdupdt first in regst and md diff add regst * feat: add ReduceInplaceIdentity LogicalNode/TaskNode/Op/Kernel * nccl reduce ctrl edge * MayConsumeModelDiff * fix diff add regst size in MdUpdt task node as same as in regst * eager_reduce_ratio * mem sharing for ReduceIdentity * ReduceInplaceIdentity => ReduceIdentity * reduce ctrl edge supports for arbitrary placement * refine ChainLogicalGraph::IsLogicalNodeMergeable * model name (#1561) * Dev gather refine (#1517) * gather op index support all int type and axis * out=in * reformat * negative axis * LookupKernel=>GatherKernel * reformat * refine * axis * refine & bugfix * remove ConstScalar and ConstRange (#1526) * Refine range initializer (#1523) * support axis * refine naming * fix before_dim_size * doc * refine * refine naming * refine naming * VariableLogicalNode * identity (#1563) * total_instance_num use naive mdupdt (#1564) * patch by hand from faster_rcnn * revert LogicalVariableOp * Dev clone boxing (#1566) * identity * reduce clone boxing * Dev clone boxing (#1568) * identity * reduce clone boxing * tuple identity * Dev tick (#1571) * feat: add Tick LogicalNode/TaskNode/Op/Kernel * feat: remove Tick LogicalNode/TaskNode * feat: add BldSubTskGphByTickToSource for TickOp * refine: refine due to comment * feat: add BldSubTskGphByRecordLoadToTick * pr tick op/kernel alone * feat: add TickOp and BldSubTskGphByTickToSource (#1565) * feat: add Tick LogicalNode/TaskNode/Op/Kernel * feat: remove Tick LogicalNode/TaskNode * feat: add BldSubTskGphByTickToSource for TickOp * refine: refine due to comment * feat: add BldSubTskGphByRecordLoadToTick * refine: refine due to comment * refine: due to comment * refine: remove BldSubTskGphByRecordLoadToTick * fix tick op in dlnet (#1572) * Dev clip by global norm (#1521) * clip_by_global_norm * update * refine model_update op * remove useless code * fix name * rename clip_norm * remove useless code * force init memory and add CHECK() * remove useless code and add comment * fixbug * refine code * Dev bert profile (#1573) * 1) refactor reduce_group; 2) add new stream kReduceCtrl * 1) allreduce and model_update overlapping; 2) allreduce and fw overlapping * add mdupdt ctrl edges within reduce group (#1575) * Dev group all reduce by model bytes (#1577) * group all reduce by model byte size * mv OpGraph into a seperate file op_graph.h * gelu (#1578) * Dev bert layer norm (#1574) * layer norm * layer_norm * fix trainable * fix * fix trainable * refine * Dev bert cuda event sync (#1581) * cudaSetDevice in actor poller threads * ReduceConcatCompActor ; NaiveActor * set dev id (#1583) * Dev bert profiling (#1586) * profiling * all_reduce_* option for performance optimization * fix a mem sharing bug (#1590) * Fix mem sharing bug (#1593) * fix a mem sharing bug * refine by review * remove previous if condition * refine * Dev profiling adam (#1592) * profiling * all_reduce_* option for performance optimization * faster adam kernel * Dev refine transpose (#1594) * profiling * all_reduce_* option for performance optimization * faster adam kernel * refine dropout and transpose * loss print duration (#1598) * pseudo chains of OpGraph * ConvertPseudoChainToChain * refine pseudo_chain * refine register coloring algorithm * rename op_graph log file name * remove unused code * Dev bigger chain (#1601) * pseudo chains of OpGraph * ConvertPseudoChainToChain * refine pseudo_chain * refine register coloring algorithm * rename op_graph log file name * remove unused code * chore: add -gencode in CMakeLists.txt (#1603) * EnableMemSharingInVariableOp * no mem_sharing for out_diff & model_diff in variable_op * Dev mem sharing for variable op (#1604) * pseudo chains of OpGraph * ConvertPseudoChainToChain * refine pseudo_chain * refine register coloring algorithm * rename op_graph log file name * remove unused code * EnableMemSharingInVariableOp * no mem_sharing for out_diff & model_diff in variable_op * refine code * Fix jxf reduce concat bug (#1606) * refine logic to infer reduce_concat_op's elem_cnt of out blob, still have bugs... * add RoundUp in reduce_concat * CHECK_LE -> CHECK_EQ * add CHECK * Dev random shuffle (#1607) * random shuffle * fix * refine * refine * single thread * refine * cmake add half (#1609) * Bugfix no tick diff (#1614) * group by has_diff * rm unnecessary identity * share model_diff and out_diff in variable op (#1616) * share model_diff and out_diff in variable op * bugfix: model_diff is a produced register * register_num of model_diff is 1 * add VariableKernelConf * no mutable * bugfix * bugfix: set ctrl_regst's return_regst_num (#1617) * 带策略的寄存器着色 (#1613) * mem_shared_hint_id * sharable memory block * rm useless code * remove useless code * bugfix: no redundant edges * rename: MemBlockGroup => MemBlock * put constrcutor of SharableMemBlockNode into header file * bugfix * rename field: MemBlock.block_id => MemBlock.mem_block_id * refine CHECK in AllReduce (#1618) * refine CHECK in AllReduce * move ReduceConcatOpCtx definition to .cpp file * fix fw_consumer nullptr (#1622) * faster improver (#1628) * multithreads register coloring (#1630) * multithreads register coloring * refine code * Dev bert accuracy with weight (#1632) * accuracy * accuracy_task_node add fw_buf * fw_buf=>data_tmp * Dev logical blob dim0 (#1625) * mem_shared_hint_id * sharable memory block * rm useless code * remove useless code * bugfix: no redundant edges * rename: MemBlockGroup => MemBlock * put constrcutor of SharableMemBlockNode into header file * bugfix * rename field: MemBlock.block_id => MemBlock.mem_block_id * replace piece_size with logical_blob_dim0 * BlobParallelConf * BlobParallelDesc * infer out blob model_split_axis * int64_t => int32_t * InferOutBlobParallelDesc * gather out blob model split (#1624) * InferBlobParallelDesc * let variable op support kModelParallel * rename lbi2blob_desc_ => lbi2no_parallel_blob_desc_ * Global<OpGraph> * SplitLogicalInputBlobDesc * ConcatOutputBlobDescs * rename: BlobDataParallel => DataBlobParallel; BlobModelParallel => ModelBlobParallel; BlobGridParallel => GridBlobParallel * OpGraph::CheckBlobDescs(...) * exact division is unnecessary * fix bugs * rename InferOutBlob* => InferOutputBlob * exact division in variable_op is unnecessary * bug fix * fix bugs * fix bugs * IsInputBlobAllowedModelSplit * use Global<OpGraph> to InferModelSize * add OpGraph::GetDataBalancedSplitter and OpGraph::GetModelBalancedSplitter * fix IdentityOp::IsInputBlobAllowedModelSplit * no implementation for pure virtual function Operator::IsInputBlobAllowedModelSplit * refine BlobParallelDesc: replace CopyParallelConf with operator= * refine ParallelDesc: remove unused functions * more checks on ParallelDesc * Dev logical blob dim0 (#1635) * mem_shared_hint_id * sharable memory block * rm useless code * remove useless code * bugfix: no redundant edges * rename: MemBlockGroup => MemBlock * put constrcutor of SharableMemBlockNode into header file * bugfix * rename field: MemBlock.block_id => MemBlock.mem_block_id * replace piece_size with logical_blob_dim0 * BlobParallelConf * BlobParallelDesc * infer out blob model_split_axis * int64_t => int32_t * InferOutBlobParallelDesc * gather out blob model split (#1624) * InferBlobParallelDesc * let variable op support kModelParallel * rename lbi2blob_desc_ => lbi2no_parallel_blob_desc_ * Global<OpGraph> * SplitLogicalInputBlobDesc * ConcatOutputBlobDescs * rename: BlobDataParallel => DataBlobParallel; BlobModelParallel => ModelBlobParallel; BlobGridParallel => GridBlobParallel * OpGraph::CheckBlobDescs(...) * exact division is unnecessary * fix bugs * rename InferOutBlob* => InferOutputBlob * exact division in variable_op is unnecessary * bug fix * fix bugs * fix bugs * IsInputBlobAllowedModelSplit * use Global<OpGraph> to InferModelSize * add OpGraph::GetDataBalancedSplitter and OpGraph::GetModelBalancedSplitter * fix IdentityOp::IsInputBlobAllowedModelSplit * no implementation for pure virtual function Operator::IsInputBlobAllowedModelSplit * refine BlobParallelDesc: replace CopyParallelConf with operator= * refine ParallelDesc: remove unused functions * more checks on ParallelDesc * remove unused function Operator::MaxModelSplitNum * bugfix: SoleOp() => op_vec().at(0) * Dev global op graph (#1636) * Global<OpGraph> is only available duraing compilation * small record_piece_size for InferNoParallelBlobDesc * Dev op graph piece size (#1637) * fix a bug in OpGraph::InferNoParallelBlobDesc * fix a bug in OpGraph::InferNoParallelBlobDesc * DfsTopoForEachNodeSortByDistanceToSink (#1638) * Dev jxf bert top k (#1633) * top_k * dev top_k op * refine * fix bug * refactor top_k op, cooperate with gather op to get values now * customized TOPK_KERNEL_ENTRY in auto factory * batch gather op * refine * Backup: batch_gather op, pass compile * fix bugs, pass the test * fix no new line at the end of file * const * refine by review * fix bugs * rename: instance_dim -> instance_size * remove a blank line * refine coding style by Juncheng's suggestions, Bravo * refine top_k * more refine * compatible with new model parallel * refine * rename * cpu only in top_k * Dev model boxing (#1639) * mem_shared_hint_id * sharable memory block * rm useless code * remove useless code * bugfix: no redundant edges * rename: MemBlockGroup => MemBlock * put constrcutor of SharableMemBlockNode into header file * bugfix * rename field: MemBlock.block_id => MemBlock.mem_block_id * replace piece_size with logical_blob_dim0 * BlobParallelConf * BlobParallelDesc * infer out blob model_split_axis * int64_t => int32_t * InferOutBlobParallelDesc * gather out blob model split (#1624) * InferBlobParallelDesc * let variable op support kModelParallel * rename lbi2blob_desc_ => lbi2no_parallel_blob_desc_ * Global<OpGraph> * SplitLogicalInputBlobDesc * ConcatOutputBlobDescs * rename: BlobDataParallel => DataBlobParallel; BlobModelParallel => ModelBlobParallel; BlobGridParallel => GridBlobParallel * OpGraph::CheckBlobDescs(...) * exact division is unnecessary * fix bugs * rename InferOutBlob* => InferOutputBlob * exact division in variable_op is unnecessary * bug fix * fix bugs * fix bugs * IsInputBlobAllowedModelSplit * use Global<OpGraph> to InferModelSize * add OpGraph::GetDataBalancedSplitter and OpGraph::GetModelBalancedSplitter * fix IdentityOp::IsInputBlobAllowedModelSplit * no implementation for pure virtual function Operator::IsInputBlobAllowedModelSplit * refine BlobParallelDesc: replace CopyParallelConf with operator= * refine ParallelDesc: remove unused functions * more checks on ParallelDesc * remove unused function Operator::MaxModelSplitNum * BlobParallelDesc::EquivalentTo * LogicalNOde::main_model_parallel_ is out of date * refine Operator: replace IsElemWiseOp with IsSoleInputBlobAllowedModelSplit * refine transpose conf * fix a bug in Operator::FixParallelDesc * InferInputBlobModelSplitAxis * BlobParallelType * more default behaviors for Operator::InferInputOutputBlobParallelType * op_parallel_signature * rename: BlobParallelType => LogicalBlobParallelDesc * OpGraph::InferLogicalBlobParallelDesc * refactor SplitLogicalInputBlobDesc by LogicalBlobParallelDesc * refine OpNode::ConcatBlobDesc By LogicalBlobParallelDesc * OpNode::lbi2model_split_axis_ * OpGraph::GetBalancedSplitter * replace OpGraph::GetBlobParallelDesc4Lbi with OpGraph::GetLbpd4Lbi * rm BlobParallelDesc in OpGraph * VariableOp::InitOpParallelSignatures * rm BlobParallelDesc * rename Make*ParalelSignature functions * MakeOpParallelSignature_DS_MC_2_DS * MakeOpParallelSignature_DC_MS_2_MS * BiasAddOp::InitOpParallelSignatures * refine MakeOpParallelSignature_DC_MS_2_MS * MatmulOp::InitOpParallelSignatures * GatherOp::InitOpParallelSignatures * bugfix: model_split_axis cannot equals -1 when parallel_policy is kModelParallel * refactor: bn2parallel_id2blob_desc => lbi2parallel_id2blob_desc * refine OpNode * LogicalBlobParallelConf * LogicalBlobParallelDesc::DualLbpd * 1) merge dev_bert; 2) placement.proto not used in logical_blob_parallel_conf.proto * bugfix: 1) remove CHECK(has_model) in Operator::NaiveInitOpParallelSignatures; 2) lbpd->set_parallel_num(val) * fix bugs in GatherOp::InitOpParallelSignatures and BroadcastBinaryOp::InitOpParallelSignatures * refactor: InitOpParallelSignatures => GetOpParallelSignatures * refactor: const OpParallelSignature => std::unique_ptr<const OpParallelSignature> * rm LogicalBlobParallelConf * refactor: ModelSplitAxis4BnInOp => LbpdHint4BnInOp * fix bugs about LbpdHint * simplify the interface of InferInputOutputBlobLogicalBlobParallelDescIf * rename Class CloneParallel => BroadcastParallel * rename field: clone_parallel => broadcast_parallel * refactor LbpdHint by SbpParallel * InferIsModelBlob4OutputBlobsIf * remove field LogicalBlobParallelDesc::parallel_num * rename: LogicalBlobParallelDesc => SbpParallel * rename: LbpdHint =>SbpInferHint * simplify interface Operator::InferOutputBlobSbpInferHint * rename api: Operator::InferBlobSbpInferHintIf => Operator::InferOuputBlobsSbpInferHintIf * OpGraph::InferIsModelBlob * rename file: logical_blob_parallel_desc.* => sbp_parallel.* * rename filename: lbpd_hint* => sbp_infer_hint* * rename field: SbpInferHint::has_data_split => SbpInferHint::is_data_split * rename fields: SbpInferHint::is_data_split, is_model_split, is_data_partial_sum, is_model_broadcast * refactor SbpInferHint::split_axis * LambdaOpParallelSignature * replace function MakeVariableOpDataSplitOpParallelSignature with class VariableOpDataSplitOpParallelSignature * replace function MakeVariableOpModelSplitOpParallelSignature with class VariableOpModelSplitOpParallelSignature * BroadcastBinaryOpParallelSignature * Matmul_DMS_MS_2_P_OpParallelSignature * Gather_DC_MS_2_P_OpParallelSignature * class DataSplitOpParallelSignature * class ModelBroadcastOpParallelSignature * class DS_MC_2_DS_OpParallelSignature * add field OpParallelSignature::op_ * refactor: ModelSplitAxis => OutputBlobModelSplitAxis * remove Operator::InferOuputBlobsSbpInferHintIf * implement MatmulOp::OutputBlobModelSplitAxis * implement GatherOp::OutputBlobModelSplitAxis * implement TransposeOp::OutputBlobModelSplitAxis and BiasAddOp::OutputBlobModelSplitAxis * add method OpGraph::IsDataBlob * refactor OpGraph::InferSbpParallel * refactor class SbpInferHint * rename local variable: SbpInferHint4BnInOp => SbpInferHint4Ibn * refactor MakeModelSplitOpParallelSignature * refactor Make_DC_MS_2_MS_OpParallelSignature * remove unused class LambdaOpParallelSignature; refactor class name '*Clone*' => '*Broadcast*' * bugfix: Operator::OutputBlobModelSplitAxis for sole-ibn op * fix bugs in SbpInferHint::has_split_axis(), SbpInferHint::split_axis and OpNode::IsModelBlob4Lbi * refactor class SbpInferHint: replace split_axis_ with sbp_parallel_ * refactor by SbpInferHint::sbp_parallel * 1) rename OpNode data member; 2) rm unused proto * fix clone (#1641) * OpGraph::GetBlobDataType (#1643) * OpGraph::GetBlobDataType * refine OpGraph::GetBlobDataType * IdentityOp => TupleIdentityOp (#1644) * Dev sbp parallel cast (#1646) * add SbpParallelCastOp * only SplitParallel and BroadcastParallel can be user customized * rename: SbpParallelCastOp => ParallelCastOp * build boxing_conf by sbp_parallel * fix a bug in BroadcastBinaryOpParallelSignature * support broadcast_parallel for sole-ibn op * 1) build boxing_op_conf by sbp_parallel for tuple_identity_op; 2) no op parallel desc fix for kModelParallel; 3) fix a in TaskGraph::EnableMemSharingInVariableOp 4) add TupleIdentityOpParallelSignature * fix bug in IsModelParallel121 (#1648) * merge develop * merge develop (#1649)

Dev bert merge develop (#1650)
* Implement gelu op (#1478) * gelu op * call different funcs for float and double * Dev bert gather op (#1483) * embedding_dense_op * refine * gather op * revert * Fix gelu bug (#1484) * fix inherit bug * fix backward formula * fix bug * Dev variable op (#1485) * DefineTestBlobConf => DefineTestBlobOpConf (#1480) * variable op * Dev variable op disable memsharing (#1487) * disable mem sharing for VariableOp * variable disable tick diff * fix * refine * options transpose_a and transpose_b for Matmul * matmul operator conf * Dev bert const scalar op (#1488) * const scalar op * refine * fix * data parallel only * const range op (#1489) * square and sqrt * broadcast_binary_op * feat: add mean op (#1490) * feat: add mean op * feat: add mean_kernel * feat: add implementation * feat: fix mean kernel * Dev bert slice op (#1491) * add op_conf * add slice op impl * add space kernel impl * fix * same semantic as python * optional start and end * fix * add has_dim0_in_shape in reshape op (#1486) * refine CHECK in broadcast_binary_op * feat: add kernel implement for broadcast_mul/div * Impl square && sqrt (#1495) * impl square && sqrt * fix typo * Dev bert slice op (#1496) * add op_conf * add slice op impl * add space kernel impl * fix * same semantic as python * optional start and end * fix * slice kernel cpu impl * modify coding style * BiasAddOpConf * refactor(broadcast_div_kernel): update kernel util api * Dev bert cosnt range use device piece size (#1498) * use device_piece_size * cosnt size => size * fix * no check in BroadcastBinaryOp::InitFromProto * override GetCustomizedConfs for broadcast_binary_op * fix: fix bugs in broadcast_div/mul kernel (#1502) * fix: fix bugs in broadcast_div/mul kernel * fix * fix: fix the infer bw_buf blobdesc bug in broadcast_binary op * Bias Add Op && Kernel (#1503) * pass compile * fix typo * Matmul kernel implementation (#1494) * pass compile * add comment * fix bug * Dev bert const scalar kernel (#1492) * const scalar kernel * fix * fix * init * empty const range kernel * sketch of gather kernel * gather kernel * refine * refine * const range kernel * refine * backward * const range size * gather kernel * assert index * add truncated_normal initializer (#1499) * add truncated_normal initializer * rename RngTruncatedNormal * fix: add const override for InferBwBufBlobDescs in BroadcastBinaryOp * fix: udpate the supported data type from floating to arithmetic * enforce 2d on bias add * Dev bert slice op (#1500) * add op_conf * add slice op impl * add space kernel impl * fix * same semantic as python * optional start and end * fix * slice kernel cpu impl * modify coding style * slice gpu impl const buf infer * add slice gpu impl * simplify slice cpu impl * fix gpu impl bug * fix typo * add forward function from broadcast_add,broadcast_sub * feat: add gpu impl of cast kernel (#1504) * Dev nc cast (#1507) * feat: add gpu impl of cast kernel * register gpu cast op * Fix broadcast binary all dim size 1 (#1505) * remove check NumAxes * check scalar * IsScalarBlob * b_diff=>b (#1509) * feat: add LayerNormOp/Kernel without kernel implement (#1510) * fix: fix missing registing layer_normalization kernel * fix: fix missing registing layer_normalization op * fix: temply remove activation from layer_norm_kernel * ExecShapeUtil * broadcast_binary_xpu_util.h * add bw kernel of broadcast_add * Dev constant (#1513) * constant_op * init_op_conf * sequence=>range * Dev broadcast add (#1514) * ExecShapeUtil * broadcast_binary_xpu_util.h * add bw kernel of broadcast_add * WITH_CUDA_PARAM * left extended shape * xpu_ndarray_builder * add bw kernel of broadcast_sub * updt to 1d (#1512) * fix small in xpu_reduce_ndarray * fix(broadcast_binary_op): fix the wrong data_type of bw_buf regst (#1515) * feat(mean): update mean_op/kernel for calc only last dim of blob (#1516) * fix(mean_kernel): fix typo * ndarray reduce * new reduce * fix shape of tmp_storage * reduce * more check for NdArrayReduce * ImplaceApplyUnary<UnaryFuncMinus> * ndarray_apply_broadcast_binary * delte useless files * complete backward kernel of broadcast_mul * add backward kernel of broadcast_div * broadcast binary op check data type equal (#1508) * fix bug in broadcast_binary * debug op * EncodeBlob * const_out_blob_feature_load_file * DefineTestBlobOpConf.has_diff * indices has_diff = false (#1519) * adam model update (#1518) * adam model update * add comment * update * add correct_deviation flag * rename * remove GetCustomizedConf * fix bug in mean_op fw kernel * add sigmoid loss op * ndarray_apply_broadcast_unary * reomve multiplier of mean kernel * fix(boxing_actor): not handle ctrl regst in NormalProcessNaiveReadableRegstMsg() * fix raw (#1522) * rsqrt * XpuReducedNdarray supports expression template * faster_reduce * inlined cuda device function * profiling reduce_sum * refactor(kernel_util.cu): calc x_strides on cpu instead of on TransposeGpu() (#1525) * BroadcastBinaryOp * ExecShape => XpuShape * fix shape bug in mean bw kernel * refine XpuNdarrayAssign * use ndarray broadcast mul (#1529) * Dev softmax reduce ndarray (#1527) * softmax use ndarray reduce * fix shape * refine reduce * fix * remove xpu_ndarray_builder * fix(actor.cpp): never access regst after sending it to producer * ndarray_util.h => xpu_util.h * xpu_ndarray_util.h => ndarray_util.h * XpuNdArrayUtil => NdarrayUtil * SwitchReduce(SwitchCase(num_axes), ...) => Reduce(...) * refactor: rename NormalProcessNaiveReadableRegstMsg() to NormalProcessNaiveReadableDataRegstMsg() (#1532) * SwitchBroadcastApply(SwitchCase(num_axes), ...) => BroadcastApply(...) * softmax kernel use ndarray reduce (#1530) * softmax use ndarray reduce * fix shape * refine reduce * fix * RowMax=>NdarrayReduce * SwitchReduce=>Reduce * move template parameter NDIMS from class NdarrayReduce to methods of class NdarrayReduce * rename file: ndarray/xpu_ndarray_reduce_test.cpp -> ndarray/ndarray_reduce_test.cpp * move NdarrayUtil::SwitchReduce(...) to NdarrayReduce::SwitchReduce(...) * Dev one hot encoder (#1533) * one_hot op * ohe * one hot kernel * refine * refine * remove old * refine * refine * refine * format * save m and v in adam_model_update (#1534) * Dev profile reduce (#1535) * ndarray_reduce_impl * NdarrayMatrixRowReduce * 1) MatrixColReduce; 2) WITH_CUDA_PARAM => RUN_CUDA_KERNEL * NdarrayScalarReduce * NdarrayDefaultReduce * refactor NdarrayReduce<DeviceType device_type, typename T> to NdarrayReduce<DeviceType device_type, typename T, const T(*binary_func)(const T, const T)> * 1) MaxVal<T>() => GetMaxVal<T>(); MaxValue<T>::value => MaxVal<T>::value * replace KernelUtil::RowMax with NdarrayUtil::ReduceMax * NdarrayNoReduce * eliminate redundant code by macros * Fix matmul gpu bugs (#1528) * call different api for batchedgemm * updt api * use naive loop * save work * save work * updt impl * remove useless code * replace naive loop with cublasgemmbatched * feat: add ScalarAddOp and ScalarMulOp (#1541) * Dev nc scalar (#1543) * feat: add ScalarAddOp and ScalarMulOp * feat: add ScalarAddKernel and ScalarMulKernel * fix: ScalarAddOp/ScalarMulOp not inheri from CWiseOp * fix: fix code style * fix: fix typo of include file in scalar_add_op/scalar_mul_op * fix(scalar_mul_kernel): register ScalarMulKerenl * fix: add MulbyScalarPara(), replace cublas_scal by this on ScalarMulKernel * fix(scalar_mul_kernel): fix typo * Dev nc testtrans (#1540) * feat: update trans kernel * InitGlobalCudaDeviceProp * in_blob and out_blob is unnecesary for bw kernel of variable_op and constant_op * Transpose: the shape elem_cnt of x must not exceed 2^32 * remove LabelType (#1545) * rm ndarray_reduce_core.* * Dev identity loss (#1547) * identity_loss * loss op * CalcLossInstanceNum * mem shared for mdupdt first in regst and md diff add regst (#1546) * remove useless code (#1548) * Dev sparse cross entropy (#1550) * op for sparse cross _entropy * modify op_conf for sparse cross entropy * saprse cross entropy kernel * op * SparseCrossEntropyKernelUtil * refine * refine shape check (#1552) * refactoring reduce sum (#1554) * refactoring reduce sum * also use shape and dptr when bw * add resize when keepdims * address reviews * move functions to Anonymous namespace * address reviews * remove auto * replace find * rename keepdims * only enable nccl on gpu * fix diff add regst size in MdUpdt task node as same as in regst (#1556) * mem shared for mdupdt first in regst and md diff add regst * fix diff add regst size in MdUpdt task node as same as in regst * minor fix * special occasion when it is a loss op * Dev loss instance num (#1544) * loss instance number * set_has_loss_instance_num_field * loss * in_diff * LossOpFixInDiffHasLossInstanceNum * remove need_do_loss_instance_num * move to FixInDiffBlobDescs * remove * loss_instance_num use float * refine * Boxing ForwardLossInstance * fix * fix loss * fix * refine * fix * refine * refine * impl reduce mean * Dev all reduce ctrl edge (#1558) * mem shared for mdupdt first in regst and md diff add regst * feat: add ReduceInplaceIdentity LogicalNode/TaskNode/Op/Kernel * nccl reduce ctrl edge * MayConsumeModelDiff * fix diff add regst size in MdUpdt task node as same as in regst * eager_reduce_ratio * mem sharing for ReduceIdentity * ReduceInplaceIdentity => ReduceIdentity * reduce ctrl edge supports for arbitrary placement * refine ChainLogicalGraph::IsLogicalNodeMergeable * model name (#1561) * Dev gather refine (#1517) * gather op index support all int type and axis * out=in * reformat * negative axis * LookupKernel=>GatherKernel * reformat * refine * axis * refine & bugfix * remove ConstScalar and ConstRange (#1526) * Refine range initializer (#1523) * support axis * refine naming * fix before_dim_size * doc * refine * refine naming * refine naming * VariableLogicalNode * identity (#1563) * total_instance_num use naive mdupdt (#1564) * patch by hand from faster_rcnn * revert LogicalVariableOp * Dev clone boxing (#1566) * identity * reduce clone boxing * Dev clone boxing (#1568) * identity * reduce clone boxing * tuple identity * Dev tick (#1571) * feat: add Tick LogicalNode/TaskNode/Op/Kernel * feat: remove Tick LogicalNode/TaskNode * feat: add BldSubTskGphByTickToSource for TickOp * refine: refine due to comment * feat: add BldSubTskGphByRecordLoadToTick * pr tick op/kernel alone * feat: add TickOp and BldSubTskGphByTickToSource (#1565) * feat: add Tick LogicalNode/TaskNode/Op/Kernel * feat: remove Tick LogicalNode/TaskNode * feat: add BldSubTskGphByTickToSource for TickOp * refine: refine due to comment * feat: add BldSubTskGphByRecordLoadToTick * refine: refine due to comment * refine: due to comment * refine: remove BldSubTskGphByRecordLoadToTick * fix tick op in dlnet (#1572) * Dev clip by global norm (#1521) * clip_by_global_norm * update * refine model_update op * remove useless code * fix name * rename clip_norm * remove useless code * force init memory and add CHECK() * remove useless code and add comment * fixbug * refine code * Dev bert profile (#1573) * 1) refactor reduce_group; 2) add new stream kReduceCtrl * 1) allreduce and model_update overlapping; 2) allreduce and fw overlapping * add mdupdt ctrl edges within reduce group (#1575) * Dev group all reduce by model bytes (#1577) * group all reduce by model byte size * mv OpGraph into a seperate file op_graph.h * gelu (#1578) * Dev bert layer norm (#1574) * layer norm * layer_norm * fix trainable * fix * fix trainable * refine * Dev bert cuda event sync (#1581) * cudaSetDevice in actor poller threads * ReduceConcatCompActor ; NaiveActor * set dev id (#1583) * Dev bert profiling (#1586) * profiling * all_reduce_* option for performance optimization * fix a mem sharing bug (#1590) * Fix mem sharing bug (#1593) * fix a mem sharing bug * refine by review * remove previous if condition * refine * Dev profiling adam (#1592) * profiling * all_reduce_* option for performance optimization * faster adam kernel * Dev refine transpose (#1594) * profiling * all_reduce_* option for performance optimization * faster adam kernel * refine dropout and transpose * loss print duration (#1598) * pseudo chains of OpGraph * ConvertPseudoChainToChain * refine pseudo_chain * refine register coloring algorithm * rename op_graph log file name * remove unused code * Dev bigger chain (#1601) * pseudo chains of OpGraph * ConvertPseudoChainToChain * refine pseudo_chain * refine register coloring algorithm * rename op_graph log file name * remove unused code * chore: add -gencode in CMakeLists.txt (#1603) * EnableMemSharingInVariableOp * no mem_sharing for out_diff & model_diff in variable_op * Dev mem sharing for variable op (#1604) * pseudo chains of OpGraph * ConvertPseudoChainToChain * refine pseudo_chain * refine register coloring algorithm * rename op_graph log file name * remove unused code * EnableMemSharingInVariableOp * no mem_sharing for out_diff & model_diff in variable_op * refine code * Fix jxf reduce concat bug (#1606) * refine logic to infer reduce_concat_op's elem_cnt of out blob, still have bugs... * add RoundUp in reduce_concat * CHECK_LE -> CHECK_EQ * add CHECK * Dev random shuffle (#1607) * random shuffle * fix * refine * refine * single thread * refine * cmake add half (#1609) * Bugfix no tick diff (#1614) * group by has_diff * rm unnecessary identity * share model_diff and out_diff in variable op (#1616) * share model_diff and out_diff in variable op * bugfix: model_diff is a produced register * register_num of model_diff is 1 * add VariableKernelConf * no mutable * bugfix * bugfix: set ctrl_regst's return_regst_num (#1617) * 带策略的寄存器着色 (#1613) * mem_shared_hint_id * sharable memory block * rm useless code * remove useless code * bugfix: no redundant edges * rename: MemBlockGroup => MemBlock * put constrcutor of SharableMemBlockNode into header file * bugfix * rename field: MemBlock.block_id => MemBlock.mem_block_id * refine CHECK in AllReduce (#1618) * refine CHECK in AllReduce * move ReduceConcatOpCtx definition to .cpp file * fix fw_consumer nullptr (#1622) * faster improver (#1628) * multithreads register coloring (#1630) * multithreads register coloring * refine code * Dev bert accuracy with weight (#1632) * accuracy * accuracy_task_node add fw_buf * fw_buf=>data_tmp * Dev logical blob dim0 (#1625) * mem_shared_hint_id * sharable memory block * rm useless code * remove useless code * bugfix: no redundant edges * rename: MemBlockGroup => MemBlock * put constrcutor of SharableMemBlockNode into header file * bugfix * rename field: MemBlock.block_id => MemBlock.mem_block_id * replace piece_size with logical_blob_dim0 * BlobParallelConf * BlobParallelDesc * infer out blob model_split_axis * int64_t => int32_t * InferOutBlobParallelDesc * gather out blob model split (#1624) * InferBlobParallelDesc * let variable op support kModelParallel * rename lbi2blob_desc_ => lbi2no_parallel_blob_desc_ * Global<OpGraph> * SplitLogicalInputBlobDesc * ConcatOutputBlobDescs * rename: BlobDataParallel => DataBlobParallel; BlobModelParallel => ModelBlobParallel; BlobGridParallel => GridBlobParallel * OpGraph::CheckBlobDescs(...) * exact division is unnecessary * fix bugs * rename InferOutBlob* => InferOutputBlob * exact division in variable_op is unnecessary * bug fix * fix bugs * fix bugs * IsInputBlobAllowedModelSplit * use Global<OpGraph> to InferModelSize * add OpGraph::GetDataBalancedSplitter and OpGraph::GetModelBalancedSplitter * fix IdentityOp::IsInputBlobAllowedModelSplit * no implementation for pure virtual function Operator::IsInputBlobAllowedModelSplit * refine BlobParallelDesc: replace CopyParallelConf with operator= * refine ParallelDesc: remove unused functions * more checks on ParallelDesc * Dev logical blob dim0 (#1635) * mem_shared_hint_id * sharable memory block * rm useless code * remove useless code * bugfix: no redundant edges * rename: MemBlockGroup => MemBlock * put constrcutor of SharableMemBlockNode into header file * bugfix * rename field: MemBlock.block_id => MemBlock.mem_block_id * replace piece_size with logical_blob_dim0 * BlobParallelConf * BlobParallelDesc * infer out blob model_split_axis * int64_t => int32_t * InferOutBlobParallelDesc * gather out blob model split (#1624) * InferBlobParallelDesc * let variable op support kModelParallel * rename lbi2blob_desc_ => lbi2no_parallel_blob_desc_ * Global<OpGraph> * SplitLogicalInputBlobDesc * ConcatOutputBlobDescs * rename: BlobDataParallel => DataBlobParallel; BlobModelParallel => ModelBlobParallel; BlobGridParallel => GridBlobParallel * OpGraph::CheckBlobDescs(...) * exact division is unnecessary * fix bugs * rename InferOutBlob* => InferOutputBlob * exact division in variable_op is unnecessary * bug fix * fix bugs * fix bugs * IsInputBlobAllowedModelSplit * use Global<OpGraph> to InferModelSize * add OpGraph::GetDataBalancedSplitter and OpGraph::GetModelBalancedSplitter * fix IdentityOp::IsInputBlobAllowedModelSplit * no implementation for pure virtual function Operator::IsInputBlobAllowedModelSplit * refine BlobParallelDesc: replace CopyParallelConf with operator= * refine ParallelDesc: remove unused functions * more checks on ParallelDesc * remove unused function Operator::MaxModelSplitNum * bugfix: SoleOp() => op_vec().at(0) * Dev global op graph (#1636) * Global<OpGraph> is only available duraing compilation * small record_piece_size for InferNoParallelBlobDesc * Dev op graph piece size (#1637) * fix a bug in OpGraph::InferNoParallelBlobDesc * fix a bug in OpGraph::InferNoParallelBlobDesc * DfsTopoForEachNodeSortByDistanceToSink (#1638) * Dev jxf bert top k (#1633) * top_k * dev top_k op * refine * fix bug * refactor top_k op, cooperate with gather op to get values now * customized TOPK_KERNEL_ENTRY in auto factory * batch gather op * refine * Backup: batch_gather op, pass compile * fix bugs, pass the test * fix no new line at the end of file * const * refine by review * fix bugs * rename: instance_dim -> instance_size * remove a blank line * refine coding style by Juncheng's suggestions, Bravo * refine top_k * more refine * compatible with new model parallel * refine * rename * cpu only in top_k * Dev model boxing (#1639) * mem_shared_hint_id * sharable memory block * rm useless code * remove useless code * bugfix: no redundant edges * rename: MemBlockGroup => MemBlock * put constrcutor of SharableMemBlockNode into header file * bugfix * rename field: MemBlock.block_id => MemBlock.mem_block_id * replace piece_size with logical_blob_dim0 * BlobParallelConf * BlobParallelDesc * infer out blob model_split_axis * int64_t => int32_t * InferOutBlobParallelDesc * gather out blob model split (#1624) * InferBlobParallelDesc * let variable op support kModelParallel * rename lbi2blob_desc_ => lbi2no_parallel_blob_desc_ * Global<OpGraph> * SplitLogicalInputBlobDesc * ConcatOutputBlobDescs * rename: BlobDataParallel => DataBlobParallel; BlobModelParallel => ModelBlobParallel; BlobGridParallel => GridBlobParallel * OpGraph::CheckBlobDescs(...) * exact division is unnecessary * fix bugs * rename InferOutBlob* => InferOutputBlob * exact division in variable_op is unnecessary * bug fix * fix bugs * fix bugs * IsInputBlobAllowedModelSplit * use Global<OpGraph> to InferModelSize * add OpGraph::GetDataBalancedSplitter and OpGraph::GetModelBalancedSplitter * fix IdentityOp::IsInputBlobAllowedModelSplit * no implementation for pure virtual function Operator::IsInputBlobAllowedModelSplit * refine BlobParallelDesc: replace CopyParallelConf with operator= * refine ParallelDesc: remove unused functions * more checks on ParallelDesc * remove unused function Operator::MaxModelSplitNum * BlobParallelDesc::EquivalentTo * LogicalNOde::main_model_parallel_ is out of date * refine Operator: replace IsElemWiseOp with IsSoleInputBlobAllowedModelSplit * refine transpose conf * fix a bug in Operator::FixParallelDesc * InferInputBlobModelSplitAxis * BlobParallelType * more default behaviors for Operator::InferInputOutputBlobParallelType * op_parallel_signature * rename: BlobParallelType => LogicalBlobParallelDesc * OpGraph::InferLogicalBlobParallelDesc * refactor SplitLogicalInputBlobDesc by LogicalBlobParallelDesc * refine OpNode::ConcatBlobDesc By LogicalBlobParallelDesc * OpNode::lbi2model_split_axis_ * OpGraph::GetBalancedSplitter * replace OpGraph::GetBlobParallelDesc4Lbi with OpGraph::GetLbpd4Lbi * rm BlobParallelDesc in OpGraph * VariableOp::InitOpParallelSignatures * rm BlobParallelDesc * rename Make*ParalelSignature functions * MakeOpParallelSignature_DS_MC_2_DS * MakeOpParallelSignature_DC_MS_2_MS * BiasAddOp::InitOpParallelSignatures * refine MakeOpParallelSignature_DC_MS_2_MS * MatmulOp::InitOpParallelSignatures * GatherOp::InitOpParallelSignatures * bugfix: model_split_axis cannot equals -1 when parallel_policy is kModelParallel * refactor: bn2parallel_id2blob_desc => lbi2parallel_id2blob_desc * refine OpNode * LogicalBlobParallelConf * LogicalBlobParallelDesc::DualLbpd * 1) merge dev_bert; 2) placement.proto not used in logical_blob_parallel_conf.proto * bugfix: 1) remove CHECK(has_model) in Operator::NaiveInitOpParallelSignatures; 2) lbpd->set_parallel_num(val) * fix bugs in GatherOp::InitOpParallelSignatures and BroadcastBinaryOp::InitOpParallelSignatures * refactor: InitOpParallelSignatures => GetOpParallelSignatures * refactor: const OpParallelSignature => std::unique_ptr<const OpParallelSignature> * rm LogicalBlobParallelConf * refactor: ModelSplitAxis4BnInOp => LbpdHint4BnInOp * fix bugs about LbpdHint * simplify the interface of InferInputOutputBlobLogicalBlobParallelDescIf * rename Class CloneParallel => BroadcastParallel * rename field: clone_parallel => broadcast_parallel * refactor LbpdHint by SbpParallel * InferIsModelBlob4OutputBlobsIf * remove field LogicalBlobParallelDesc::parallel_num * rename: LogicalBlobParallelDesc => SbpParallel * rename: LbpdHint =>SbpInferHint * simplify interface Operator::InferOutputBlobSbpInferHint * rename api: Operator::InferBlobSbpInferHintIf => Operator::InferOuputBlobsSbpInferHintIf * OpGraph::InferIsModelBlob * rename file: logical_blob_parallel_desc.* => sbp_parallel.* * rename filename: lbpd_hint* => sbp_infer_hint* * rename field: SbpInferHint::has_data_split => SbpInferHint::is_data_split * rename fields: SbpInferHint::is_data_split, is_model_split, is_data_partial_sum, is_model_broadcast * refactor SbpInferHint::split_axis * LambdaOpParallelSignature * replace function MakeVariableOpDataSplitOpParallelSignature with class VariableOpDataSplitOpParallelSignature * replace function MakeVariableOpModelSplitOpParallelSignature with class VariableOpModelSplitOpParallelSignature * BroadcastBinaryOpParallelSignature * Matmul_DMS_MS_2_P_OpParallelSignature * Gather_DC_MS_2_P_OpParallelSignature * class DataSplitOpParallelSignature * class ModelBroadcastOpParallelSignature * class DS_MC_2_DS_OpParallelSignature * add field OpParallelSignature::op_ * refactor: ModelSplitAxis => OutputBlobModelSplitAxis * remove Operator::InferOuputBlobsSbpInferHintIf * implement MatmulOp::OutputBlobModelSplitAxis * implement GatherOp::OutputBlobModelSplitAxis * implement TransposeOp::OutputBlobModelSplitAxis and BiasAddOp::OutputBlobModelSplitAxis * add method OpGraph::IsDataBlob * refactor OpGraph::InferSbpParallel * refactor class SbpInferHint * rename local variable: SbpInferHint4BnInOp => SbpInferHint4Ibn * refactor MakeModelSplitOpParallelSignature * refactor Make_DC_MS_2_MS_OpParallelSignature * remove unused class LambdaOpParallelSignature; refactor class name '*Clone*' => '*Broadcast*' * bugfix: Operator::OutputBlobModelSplitAxis for sole-ibn op * fix bugs in SbpInferHint::has_split_axis(), SbpInferHint::split_axis and OpNode::IsModelBlob4Lbi * refactor class SbpInferHint: replace split_axis_ with sbp_parallel_ * refactor by SbpInferHint::sbp_parallel * 1) rename OpNode data member; 2) rm unused proto * fix clone (#1641) * OpGraph::GetBlobDataType (#1643) * OpGraph::GetBlobDataType * refine OpGraph::GetBlobDataType * IdentityOp => TupleIdentityOp (#1644) * Dev sbp parallel cast (#1646) * add SbpParallelCastOp * only SplitParallel and BroadcastParallel can be user customized * rename: SbpParallelCastOp => ParallelCastOp * build boxing_conf by sbp_parallel * fix a bug in BroadcastBinaryOpParallelSignature * support broadcast_parallel for sole-ibn op * 1) build boxing_op_conf by sbp_parallel for tuple_identity_op; 2) no op parallel desc fix for kModelParallel; 3) fix a in TaskGraph::EnableMemSharingInVariableOp 4) add TupleIdentityOpParallelSignature * fix bug in IsModelParallel121 (#1648) * merge develop * merge develop (#1649)
59eb55c1 · Li Xinqi · GitHub · 8009a404 · 59eb55c1 · 59eb55c1
Unverified Commit 59eb55c1 authored 6 years ago by Li Xinqi Committed by GitHub 6 years ago
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -40,6 +40,11 @@ if (WIN32)
  set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} /D_ITERATOR_DEBUG_LEVEL=0")
 else()
  list(APPEND CUDA_NVCC_FLAGS -std=c++11 -w -Wno-deprecated-gpu-targets)
+  list(APPEND CUDA_NVCC_FLAGS -gencode arch=compute_30,code=\"sm_30,compute_30\")
+  list(APPEND CUDA_NVCC_FLAGS -gencode arch=compute_52,code=\"sm_52,compute_52\")
+  list(APPEND CUDA_NVCC_FLAGS -gencode arch=compute_60,code=\"sm_60,compute_60\")
+  list(APPEND CUDA_NVCC_FLAGS -gencode arch=compute_61,code=\"sm_61,compute_61\")
+  list(APPEND CUDA_NVCC_FLAGS -gencode arch=compute_70,code=\"sm_70,compute_70\")
  set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -g -Wall -Wno-sign-compare -Wno-unused-function")
  if (RELEASE_VERSION)
    list(APPEND CUDA_NVCC_FLAGS -O3)

--- a/cmake/oneflow.cmake
+++ b/cmake/oneflow.cmake
 # main cpp
 list(APPEND of_main_cc ${PROJECT_SOURCE_DIR}/oneflow/core/job/oneflow.cpp)
+list(APPEND of_main_cc ${PROJECT_SOURCE_DIR}/oneflow/core/ndarray/ndarray_reduce_test.cpp)
 list(APPEND of_main_cc ${PROJECT_SOURCE_DIR}/tools/gen_resnet.cpp)
 list(APPEND of_main_cc ${PROJECT_SOURCE_DIR}/tools/gen_alexnet.cpp)
 list(APPEND of_main_cc ${PROJECT_SOURCE_DIR}/tools/gen_googlenet.cpp)

--- a/cmake/third_party.cmake
+++ b/cmake/third_party.cmake
@@ -12,6 +12,7 @@ include(libjpeg-turbo)
 include(opencv)
 include(eigen)
 include(cocoapi)
+include(half)

 if (BUILD_CUDA)
  set(CUDA_SEPARABLE_COMPILATION ON)
@@ -90,6 +91,7 @@ set(oneflow_third_party_dependencies
  eigen
  cocoapi_copy_headers_to_destination
  cocoapi_copy_libs_to_destination
+  half_copy_headers_to_destination
 )

 include_directories(
@@ -104,6 +106,7 @@ include_directories(
    ${OPENCV_INCLUDE_DIR}
    ${EIGEN_INCLUDE_DIR}
    ${COCOAPI_INCLUDE_DIR}
+    ${HALF_INCLUDE_DIR}
 )

 if (BUILD_CUDA)
@@ -124,3 +127,5 @@ if (BUILD_CUDA)
    ${NCCL_INCLUDE_DIR}
 )
 endif()
+
+add_definitions(-DHALF_ENABLE_CPP11_USER_LITERALS=0)
--- a/cmake/third_party/half.cmake
+++ b/cmake/third_party/half.cmake
+include (ExternalProject)
+
+set(HALF_INCLUDE_DIR ${THIRD_PARTY_DIR}/half/include)
+
+set(HALF_URL https://cfhcable.dl.sourceforge.net/project/half/half/1.12.0/half-1.12.0.zip)
+set(HALF_BASE_DIR ${CMAKE_CURRENT_BINARY_DIR}/half/src/half)
+
+set(HALF_HEADERS
+    "${HALF_BASE_DIR}/include/half.hpp"
+)
+
+if(BUILD_THIRD_PARTY)
+
+ExternalProject_Add(half
+    PREFIX half
+    URL ${HALF_URL}
+    UPDATE_COMMAND ""
+    CONFIGURE_COMMAND ""
+    BUILD_COMMAND ""
+    BUILD_IN_SOURCE 1
+    INSTALL_COMMAND ""
+)
+
+add_custom_target(half_create_header_dir
+    COMMAND ${CMAKE_COMMAND} -E make_directory ${HALF_INCLUDE_DIR}
+    DEPENDS half)
+
+add_custom_target(half_copy_headers_to_destination
+    DEPENDS half_create_header_dir)
+
+foreach(header_file ${HALF_HEADERS})
+    add_custom_command(TARGET half_copy_headers_to_destination PRE_BUILD
+    COMMAND ${CMAKE_COMMAND} -E copy_if_different ${header_file} ${HALF_INCLUDE_DIR})
+endforeach()
+endif(BUILD_THIRD_PARTY)
--- a/oneflow/core/actor/actor.cpp
+++ b/oneflow/core/actor/actor.cpp
@@ -193,8 +193,11 @@ int Actor::HandlerNormal(const ActorMsg& msg) {
      Regst* regst = msg.regst();
      if (naive_consumed_rs_.HasRegstDescId(regst->regst_desc_id())) {
        CHECK_EQ(0, naive_consumed_rs_.TryPushBackRegst(regst));
-        NormalProcessNaiveReadableRegstMsg(
-            naive_consumed_rs_.RegstDeq4RegstDescId(regst->regst_desc_id()));
+        const auto& rdeq = naive_consumed_rs_.RegstDeq4RegstDescId(regst->regst_desc_id());
+        CHECK(rdeq.empty() == false);
+        if (rdeq.front()->regst_desc()->regst_desc_type().has_data_regst_desc()) {
+          NormalProcessNaiveReadableDataRegstMsg(rdeq);
+        }
      } else if (TryUpdtStateAsProducedRegst(regst) == 0) {
        // do nothing
      } else {
@@ -325,8 +328,9 @@ void Actor::AsyncSendConsumedCtrlRegstMsgToProducer() {
    CHECK_GE(reg_deq.size(), returned_regst_num);
    for (size_t i = 0; i < returned_regst_num; ++i) {
      Regst* regst = reg_deq.at(i);
-      AsyncSendMsg(ActorMsg::BuildRegstMsgToProducer(actor_id_, regst->producer_actor_id(), regst));
+      // must access regst before sending it to producer
      regst_desc_ids.push_back(regst->regst_desc_id());
+      AsyncSendMsg(ActorMsg::BuildRegstMsgToProducer(actor_id_, regst->producer_actor_id(), regst));
    }
  });
  naive_consumed_rs_.PopFrontRegsts(regst_desc_ids);
@@ -452,8 +456,10 @@ void Actor::AsyncSendRegstMsgToProducer(Regst* regst) {
 }

 void Actor::AsyncSendRegstMsgToProducer(Regst* regst, int64_t producer) {
+  // must access regst before sending it to producer
+  int64_t regst_desc_id = regst->regst_desc_id();
  AsyncSendMsg(ActorMsg::BuildRegstMsgToProducer(actor_id_, producer, regst));
-  naive_consumed_rs_.TryPopFrontRegst(regst->regst_desc_id());
+  naive_consumed_rs_.TryPopFrontRegst(regst_desc_id);
 }

 Regst* Actor::GetSoleProducedRegst4RegstDescId(int64_t regst_desc_id) {

--- a/oneflow/core/actor/actor.h
+++ b/oneflow/core/actor/actor.h
@@ -128,7 +128,7 @@ class Actor {

  // Process Msg
  virtual void NormalProcessCustomizedEordMsg(const ActorMsg&) {}
-  virtual void NormalProcessNaiveReadableRegstMsg(const std::deque<Regst*>&) {}
+  virtual void NormalProcessNaiveReadableDataRegstMsg(const std::deque<Regst*>&) {}
  virtual void NormalProcessCustomizedReadableRegstMsg(const ActorMsg&) { UNIMPLEMENTED(); }
  virtual bool NormalTryProcessReadableMsgFromOtherMachine(const ActorMsg&) { return false; }
  int TryUpdtStateAsProducedRegst(Regst* regst);

--- a/oneflow/core/actor/boxing_actor.cpp
+++ b/oneflow/core/actor/boxing_actor.cpp
@@ -9,7 +9,7 @@ void BoxingActor::VirtualActorInit(const TaskProto& task_proto) {
  OF_SET_MSG_HANDLER(&BoxingActor::HandlerNormal);
 }

-void BoxingActor::NormalProcessNaiveReadableRegstMsg(const std::deque<Regst*>& rq) {
+void BoxingActor::NormalProcessNaiveReadableDataRegstMsg(const std::deque<Regst*>& rq) {
  if (rq.back()->packed_blob()->max_col_num() > 1 && col_id_order_ == ColIdOrder::kUnCertain) {
    TrySetColIdOrder(rq.back());
  }

--- a/oneflow/core/actor/boxing_actor.h
+++ b/oneflow/core/actor/boxing_actor.h
@@ -14,7 +14,7 @@ class BoxingActor final : public Actor {
  void VirtualActorInit(const TaskProto&) override;

 private:
-  void NormalProcessNaiveReadableRegstMsg(const std::deque<Regst*>&) override;
+  void NormalProcessNaiveReadableDataRegstMsg(const std::deque<Regst*>&) override;
  void Act() override;
  void VirtualAsyncSendNaiveProducedRegstMsgToConsumer();
  void VirtualAsyncSendNaiveConsumedRegstMsgToProducer();

--- a/oneflow/core/actor/loss_print_compute_actor.h
+++ b/oneflow/core/actor/loss_print_compute_actor.h
@@ -9,9 +9,13 @@ class LossPrintCompActor final : public SinkCompActor {
 public:
  OF_DISALLOW_COPY_AND_MOVE(LossPrintCompActor);
  LossPrintCompActor() = default;
-  ~LossPrintCompActor() = default;
+  ~LossPrintCompActor() override = default;

 private:
+  void VirtualSinkCompActorInit(const TaskProto&) override { timestamp_ = 0; }
+  void* NewOther() override { return &timestamp_; }
+
+  double timestamp_ = 0;
 };

 }  // namespace oneflow

--- a/oneflow/core/actor/naive_actor.cpp
+++ b/oneflow/core/actor/naive_actor.cpp
@@ -12,4 +12,6 @@ void NaiveActor::VirtualAsyncSendNaiveProducedRegstMsgToConsumer() {
  });
 }

+REGISTER_ACTOR(TaskType::kReduceIdentity, NaiveActor);
+
 }  // namespace oneflow
--- a/oneflow/core/actor/normal_backward_compute_actor.cpp
+++ b/oneflow/core/actor/normal_backward_compute_actor.cpp
@@ -33,7 +33,7 @@ void NormalBackwardCompActor::ForEachCurCustomizedReadableRegst(
  }
 }

-void NormalBackwardCompActor::NormalProcessNaiveReadableRegstMsg(const std::deque<Regst*>& rq) {
+void NormalBackwardCompActor::NormalProcessNaiveReadableDataRegstMsg(const std::deque<Regst*>& rq) {
  if (rq.size() == 1 && rq.front()->regst_desc_id() == any_out_diff_regst_desc_id_) {
    AsyncReturnModelRegstUntilModelVersionIdEqual(
        GetModelVersionIdFromPieceId(rq.front()->piece_id(), actual_num_of_piece_in_batch_));

--- a/oneflow/core/actor/normal_backward_compute_actor.h
+++ b/oneflow/core/actor/normal_backward_compute_actor.h
@@ -15,7 +15,7 @@ class NormalBackwardCompActor final : public CompActor {

 private:
  void ForEachCurCustomizedReadableRegst(std::function<void(const Regst*)>) const override;
-  void NormalProcessNaiveReadableRegstMsg(const std::deque<Regst*>&) override;
+  void NormalProcessNaiveReadableDataRegstMsg(const std::deque<Regst*>&) override;
  void NormalProcessCustomizedReadableRegstMsg(const ActorMsg&) override;
  void Act() override;
  bool IsCustomizedReadReady() override;

--- a/oneflow/core/actor/reduce_concat_compute_actor.cpp
+++ b/oneflow/core/actor/reduce_concat_compute_actor.cpp
@@ -2,16 +2,6 @@

 namespace oneflow {

-void ReduceConcatCompActor::VirtualCompActorInit(const TaskProto& proto) {
-  InputWiseCompActor::Init(proto);
-}
-
-void ReduceConcatCompActor::SetKernelCtxOther(void** other) {
-  int64_t in_bn_id = InBnId4RegstDescId(cur_processed_regst_desc_id());
-  other_val_ = std::make_pair(in_bn_id, EnableInplace());
-  *other = static_cast<void*>(&other_val_);
-}
-
 REGISTER_ACTOR(TaskType::kReduceConcat, ReduceConcatCompActor);

 }  // namespace oneflow
--- a/oneflow/core/actor/reduce_concat_compute_actor.h
+++ b/oneflow/core/actor/reduce_concat_compute_actor.h
 #ifndef ONEFLOW_CORE_ACTOR_REDUCE_CONCAT_COMPUTE_ACTOR_H_
 #define ONEFLOW_CORE_ACTOR_REDUCE_CONCAT_COMPUTE_ACTOR_H_

-#include "oneflow/core/actor/input_wise_compute_actor.h"
+#include "oneflow/core/actor/naive_actor.h"

 namespace oneflow {

-class ReduceConcatCompActor final : public InputWiseCompActor {
+class ReduceConcatCompActor final : public NaiveActor {
 public:
  OF_DISALLOW_COPY_AND_MOVE(ReduceConcatCompActor);
  ReduceConcatCompActor() = default;
  ~ReduceConcatCompActor() = default;
-
- private:
-  void VirtualCompActorInit(const TaskProto& proto) override;
-  void SetKernelCtxOther(void** other) override;
-
-  std::pair<int64_t, bool> other_val_;
 };

 }  // namespace oneflow

--- a/oneflow/core/actor/reduce_split_compute_actor.cpp
+++ b/oneflow/core/actor/reduce_split_compute_actor.cpp
@@ -2,15 +2,6 @@

 namespace oneflow {

-void ReduceSplitCompActor::VirtualCompActorInit(const TaskProto& proto) {
-  InputWiseCompActor::Init(proto);
-}
-
-void ReduceSplitCompActor::SetKernelCtxOther(void** other) {
-  other_val_ = EnableInplace();
-  *other = static_cast<void*>(&other_val_);
-}
-
 REGISTER_ACTOR(TaskType::kReduceSplit, ReduceSplitCompActor);

 }  // namespace oneflow
--- a/oneflow/core/actor/reduce_split_compute_actor.h
+++ b/oneflow/core/actor/reduce_split_compute_actor.h
 #ifndef ONEFLOW_CORE_ACTOR_REDUCE_SPLIT_COMPUTE_ACTOR_H_
 #define ONEFLOW_CORE_ACTOR_REDUCE_SPLIT_COMPUTE_ACTOR_H_

-#include "oneflow/core/actor/input_wise_compute_actor.h"
+#include "oneflow/core/actor/naive_actor.h"

 namespace oneflow {

-class ReduceSplitCompActor final : public InputWiseCompActor {
+class ReduceSplitCompActor final : public NaiveActor {
 public:
  OF_DISALLOW_COPY_AND_MOVE(ReduceSplitCompActor);
  ReduceSplitCompActor() = default;
  ~ReduceSplitCompActor() = default;
-
- private:
-  void VirtualCompActorInit(const TaskProto& proto) override;
-  void SetKernelCtxOther(void** other) override;
-
-  bool other_val_;
 };

 }  // namespace oneflow

--- a/oneflow/core/comm_network/epoll/epoll_comm_network.cpp
+++ b/oneflow/core/comm_network/epoll/epoll_comm_network.cpp
@@ -114,13 +114,13 @@ void EpollCommNet::InitSockets() {
             ((this_machine.data_port_agent() != -1) ? (this_machine.data_port_agent())
                                                     : (this_listen_port)));
  } else {
-    for (this_listen_port = 1024; this_listen_port < MaxVal<uint16_t>(); ++this_listen_port) {
+    for (this_listen_port = 1024; this_listen_port < GetMaxVal<uint16_t>(); ++this_listen_port) {
      if (SockListen(listen_sockfd, this_listen_port, total_machine_num) == 0) {
        PushPort(this_machine_id, this_listen_port);
        break;
      }
    }
-    CHECK_LT(this_listen_port, MaxVal<uint16_t>());
+    CHECK_LT(this_listen_port, GetMaxVal<uint16_t>());
  }
  int32_t src_machine_count = 0;


--- a/oneflow/core/comm_network/ibverbs/ibverbs_qp.cpp
+++ b/oneflow/core/comm_network/ibverbs/ibverbs_qp.cpp
@@ -65,7 +65,7 @@ void IBVerbsQP::Connect(const IBVerbsConnectionInfo& peer_info) {
  qp_attr.ah_attr.grh.dgid.global.interface_id = peer_info.interface_id();
  qp_attr.ah_attr.grh.flow_label = 0;
  qp_attr.ah_attr.grh.sgid_index = 0;
-  qp_attr.ah_attr.grh.hop_limit = MaxVal<uint8_t>();
+  qp_attr.ah_attr.grh.hop_limit = GetMaxVal<uint8_t>();
  qp_attr.ah_attr.dlid = peer_info.lid();
  qp_attr.ah_attr.sl = 0;
  qp_attr.ah_attr.src_path_bits = 0;

--- a/oneflow/core/common/blas.h
+++ b/oneflow/core/common/blas.h
@@ -8,14 +8,16 @@

 namespace oneflow {

-#define BLAS_NAME_SEQ        \
-  OF_PP_MAKE_TUPLE_SEQ(dot)  \
-  OF_PP_MAKE_TUPLE_SEQ(swap) \
-  OF_PP_MAKE_TUPLE_SEQ(copy) \
-  OF_PP_MAKE_TUPLE_SEQ(axpy) \
-  OF_PP_MAKE_TUPLE_SEQ(scal) \
-  OF_PP_MAKE_TUPLE_SEQ(gemv) \
-  OF_PP_MAKE_TUPLE_SEQ(gemm)
+#define BLAS_NAME_SEQ               \
+  OF_PP_MAKE_TUPLE_SEQ(dot)         \
+  OF_PP_MAKE_TUPLE_SEQ(swap)        \
+  OF_PP_MAKE_TUPLE_SEQ(copy)        \
+  OF_PP_MAKE_TUPLE_SEQ(axpy)        \
+  OF_PP_MAKE_TUPLE_SEQ(scal)        \
+  OF_PP_MAKE_TUPLE_SEQ(gemv)        \
+  OF_PP_MAKE_TUPLE_SEQ(gemm)        \
+  OF_PP_MAKE_TUPLE_SEQ(gemmBatched) \
+  OF_PP_MAKE_TUPLE_SEQ(gemmStridedBatched)

 #define CBLAS_TEMPLATE(name)                                                                    \
  template<typename T, typename... Args>                                                        \

--- a/oneflow/core/common/data_type.h
+++ b/oneflow/core/common/data_type.h
@@ -95,6 +95,37 @@ TRAIT_CONST_VAR(One, 1);

 #undef TRAIT_CONST_VAR

+template<typename T>
+struct MaxVal;
+template<typename T>
+struct MinVal;
+
+#define TRAIT_LIMIT_VAL(max_or_min, T, limit_value)                                               \
+  template<>                                                                                      \
+  struct max_or_min##Val<T> final {                                                               \
+    static_assert(alignof(int) == alignof(int32_t), "int32_t should be exactly int");             \
+    static_assert(alignof(long long) == alignof(int64_t), "int64_t should be exactly long long"); \
+    constexpr static T value = limit_value;                                                       \
+  }
+
+TRAIT_LIMIT_VAL(Max, int8_t, CHAR_MAX);
+TRAIT_LIMIT_VAL(Max, int32_t, INT_MAX);
+TRAIT_LIMIT_VAL(Max, uint32_t, UINT_MAX);
+TRAIT_LIMIT_VAL(Max, int64_t, LLONG_MAX);
+TRAIT_LIMIT_VAL(Max, uint64_t, ULLONG_MAX);
+TRAIT_LIMIT_VAL(Max, float, FLT_MAX);
+TRAIT_LIMIT_VAL(Max, double, DBL_MAX);
+
+TRAIT_LIMIT_VAL(Min, int8_t, CHAR_MIN);
+TRAIT_LIMIT_VAL(Min, int32_t, INT_MIN);
+TRAIT_LIMIT_VAL(Min, uint32_t, 0);
+TRAIT_LIMIT_VAL(Min, int64_t, LLONG_MIN);
+TRAIT_LIMIT_VAL(Min, uint64_t, 0);
+TRAIT_LIMIT_VAL(Min, float, -FLT_MAX);
+TRAIT_LIMIT_VAL(Min, double, -DBL_MAX);
+
+#undef TRAIT_LIMIT_VAL
+
 // Func

 bool IsIntegralDataType(DataType data_type);