Skip to content
Snippets Groups Projects
  • Li Xinqi's avatar
    Dev bert merge develop (#1650) · 59eb55c1
    Li Xinqi authored
    * Implement gelu op (#1478)
    
    * gelu op
    
    * call different funcs for float and double
    
    * Dev bert gather op (#1483)
    
    * embedding_dense_op
    
    * refine
    
    * gather op
    
    * revert
    
    * Fix gelu bug (#1484)
    
    * fix inherit bug
    
    * fix backward formula
    
    * fix bug
    
    * Dev variable op (#1485)
    
    * DefineTestBlobConf => DefineTestBlobOpConf (#1480)
    
    * variable op
    
    * Dev variable op disable memsharing (#1487)
    
    * disable mem sharing for VariableOp
    
    * variable disable tick diff
    
    * fix
    
    * refine
    
    * options transpose_a and transpose_b for Matmul
    
    * matmul operator conf
    
    * Dev bert const scalar op (#1488)
    
    * const scalar  op
    
    * refine
    
    * fix
    
    * data parallel only
    
    * const range op (#1489)
    
    * square and sqrt
    
    * broadcast_binary_op
    
    * feat: add mean op (#1490)
    
    * feat: add mean op
    
    * feat: add mean_kernel
    
    * feat: add implementation
    
    * feat: fix mean kernel
    
    * Dev bert slice op (#1491)
    
    * add op_conf
    
    * add slice op impl
    
    * add space kernel impl
    
    * fix
    
    * same semantic as python
    
    * optional start and end
    
    * fix
    
    * add has_dim0_in_shape in reshape op (#1486)
    
    * refine CHECK in broadcast_binary_op
    
    * feat: add kernel implement for broadcast_mul/div
    
    * Impl square && sqrt (#1495)
    
    * impl square && sqrt
    
    * fix typo
    
    * Dev bert slice op (#1496)
    
    * add op_conf
    
    * add slice op impl
    
    * add space kernel impl
    
    * fix
    
    * same semantic as python
    
    * optional start and end
    
    * fix
    
    * slice kernel cpu impl
    
    * modify coding style
    
    * BiasAddOpConf
    
    * refactor(broadcast_div_kernel): update kernel util api
    
    * Dev bert cosnt range use device piece size (#1498)
    
    * use device_piece_size
    
    * cosnt size => size
    
    * fix
    
    * no check in BroadcastBinaryOp::InitFromProto
    
    * override GetCustomizedConfs for broadcast_binary_op
    
    * fix: fix bugs in broadcast_div/mul kernel (#1502)
    
    * fix: fix bugs in broadcast_div/mul kernel
    
    * fix
    
    * fix: fix the infer bw_buf blobdesc bug in broadcast_binary op
    
    * Bias Add Op && Kernel (#1503)
    
    * pass compile
    
    * fix typo
    
    * Matmul kernel implementation (#1494)
    
    * pass compile
    
    * add comment
    
    * fix bug
    
    * Dev bert const scalar kernel (#1492)
    
    * const scalar kernel
    
    * fix
    
    * fix
    
    * init
    
    * empty const range kernel
    
    * sketch of gather kernel
    
    * gather kernel
    
    * refine
    
    * refine
    
    * const range kernel
    
    * refine
    
    * backward
    
    * const range size
    
    * gather kernel
    
    * assert index
    
    * add truncated_normal initializer (#1499)
    
    * add truncated_normal initializer
    
    * rename RngTruncatedNormal
    
    * fix: add const override for InferBwBufBlobDescs in BroadcastBinaryOp
    
    * fix: udpate the supported data type from floating to arithmetic
    
    * enforce 2d on bias add
    
    * Dev bert slice op (#1500)
    
    * add op_conf
    
    * add slice op impl
    
    * add space kernel impl
    
    * fix
    
    * same semantic as python
    
    * optional start and end
    
    * fix
    
    * slice kernel cpu impl
    
    * modify coding style
    
    * slice gpu impl const buf infer
    
    * add slice gpu impl
    
    * simplify slice cpu impl
    
    * fix gpu impl bug
    
    * fix typo
    
    * add forward function from broadcast_add,broadcast_sub
    
    * feat: add gpu impl of cast kernel (#1504)
    
    * Dev nc cast (#1507)
    
    * feat: add gpu impl of cast kernel
    
    * register gpu cast op
    
    * Fix broadcast binary all dim size 1 (#1505)
    
    * remove check NumAxes
    
    * check scalar
    
    * IsScalarBlob
    
    * b_diff=>b (#1509)
    
    * feat: add LayerNormOp/Kernel without kernel implement (#1510)
    
    * fix: fix missing registing layer_normalization kernel
    
    * fix: fix missing registing layer_normalization op
    
    * fix: temply remove activation from layer_norm_kernel
    
    * ExecShapeUtil
    
    * broadcast_binary_xpu_util.h
    
    * add bw kernel of broadcast_add
    
    * Dev constant (#1513)
    
    * constant_op
    
    * init_op_conf
    
    * sequence=>range
    
    * Dev broadcast add (#1514)
    
    * ExecShapeUtil
    
    * broadcast_binary_xpu_util.h
    
    * add bw kernel of broadcast_add
    
    * WITH_CUDA_PARAM
    
    * left extended shape
    
    * xpu_ndarray_builder
    
    * add bw kernel of broadcast_sub
    
    * updt to 1d (#1512)
    
    * fix small in xpu_reduce_ndarray
    
    * fix(broadcast_binary_op): fix the wrong data_type of bw_buf regst (#1515)
    
    * feat(mean): update mean_op/kernel for calc only last dim of blob (#1516)
    
    * fix(mean_kernel): fix typo
    
    * ndarray reduce
    
    * new reduce
    
    * fix shape of tmp_storage
    
    * reduce
    
    * more check for NdArrayReduce
    
    * ImplaceApplyUnary<UnaryFuncMinus>
    
    * ndarray_apply_broadcast_binary
    
    * delte useless files
    
    * complete backward kernel of broadcast_mul
    
    * add backward kernel of broadcast_div
    
    * broadcast binary op check data type equal (#1508)
    
    * fix bug in broadcast_binary
    
    * debug op
    
    * EncodeBlob
    
    * const_out_blob_feature_load_file
    
    * DefineTestBlobOpConf.has_diff
    
    * indices has_diff = false (#1519)
    
    * adam model update (#1518)
    
    * adam model update
    
    * add comment
    
    * update
    
    * add correct_deviation flag
    
    * rename
    
    * remove GetCustomizedConf
    
    * fix bug in mean_op fw kernel
    
    * add sigmoid loss op
    
    * ndarray_apply_broadcast_unary
    
    * reomve multiplier of mean kernel
    
    * fix(boxing_actor): not handle ctrl regst in NormalProcessNaiveReadableRegstMsg()
    
    * fix raw (#1522)
    
    * rsqrt
    
    * XpuReducedNdarray supports expression template
    
    * faster_reduce
    
    * inlined cuda device function
    
    * profiling reduce_sum
    
    * refactor(kernel_util.cu): calc x_strides on cpu instead of on TransposeGpu() (#1525)
    
    * BroadcastBinaryOp
    
    * ExecShape => XpuShape
    
    * fix shape bug in mean bw kernel
    
    * refine XpuNdarrayAssign
    
    * use ndarray broadcast mul (#1529)
    
    * Dev softmax reduce ndarray (#1527)
    
    * softmax use ndarray reduce
    
    * fix shape
    
    * refine reduce
    
    * fix
    
    * remove xpu_ndarray_builder
    
    * fix(actor.cpp): never access regst after sending it to producer
    
    * ndarray_util.h => xpu_util.h
    
    * xpu_ndarray_util.h => ndarray_util.h
    
    * XpuNdArrayUtil => NdarrayUtil
    
    * SwitchReduce(SwitchCase(num_axes), ...) => Reduce(...)
    
    * refactor: rename NormalProcessNaiveReadableRegstMsg() to NormalProcessNaiveReadableDataRegstMsg() (#1532)
    
    * SwitchBroadcastApply(SwitchCase(num_axes), ...) => BroadcastApply(...)
    
    * softmax kernel use ndarray reduce  (#1530)
    
    * softmax use ndarray reduce
    
    * fix shape
    
    * refine reduce
    
    * fix
    
    * RowMax=>NdarrayReduce
    
    * SwitchReduce=>Reduce
    
    * move template parameter NDIMS from class NdarrayReduce to methods of class NdarrayReduce
    
    * rename file: ndarray/xpu_ndarray_reduce_test.cpp -> ndarray/ndarray_reduce_test.cpp
    
    * move NdarrayUtil::SwitchReduce(...) to NdarrayReduce::SwitchReduce(...)
    
    * Dev one hot encoder (#1533)
    
    * one_hot op
    
    * ohe
    
    * one hot kernel
    
    * refine
    
    * refine
    
    * remove old
    
    * refine
    
    * refine
    
    * refine
    
    * format
    
    * save m and v in adam_model_update (#1534)
    
    * Dev profile reduce (#1535)
    
    * ndarray_reduce_impl
    
    * NdarrayMatrixRowReduce
    
    * 1) MatrixColReduce; 2) WITH_CUDA_PARAM => RUN_CUDA_KERNEL
    
    * NdarrayScalarReduce
    
    * NdarrayDefaultReduce
    
    * refactor NdarrayReduce<DeviceType device_type, typename T> to NdarrayReduce<DeviceType device_type, typename T, const T(*binary_func)(const T, const T)>
    
    * 1) MaxVal<T>() => GetMaxVal<T>(); MaxValue<T>::value => MaxVal<T>::value
    
    * replace KernelUtil::RowMax with NdarrayUtil::ReduceMax
    
    * NdarrayNoReduce
    
    * eliminate redundant code by macros
    
    * Fix matmul gpu bugs (#1528)
    
    * call different api for batchedgemm
    
    * updt api
    
    * use naive loop
    
    * save work
    
    * save work
    
    * updt impl
    
    * remove useless code
    
    * replace naive loop with cublasgemmbatched
    
    * feat: add ScalarAddOp and ScalarMulOp (#1541)
    
    * Dev nc scalar (#1543)
    
    * feat: add ScalarAddOp and ScalarMulOp
    
    * feat: add ScalarAddKernel and ScalarMulKernel
    
    * fix: ScalarAddOp/ScalarMulOp not inheri from CWiseOp
    
    * fix: fix code style
    
    * fix: fix typo of include file in scalar_add_op/scalar_mul_op
    
    * fix(scalar_mul_kernel): register ScalarMulKerenl
    
    * fix: add MulbyScalarPara(), replace cublas_scal by this on ScalarMulKernel
    
    * fix(scalar_mul_kernel): fix typo
    
    * Dev nc testtrans (#1540)
    
    * feat: update trans kernel
    
    * InitGlobalCudaDeviceProp
    
    * in_blob and out_blob is unnecesary for bw kernel of variable_op and constant_op
    
    * Transpose: the shape elem_cnt of x must not exceed 2^32
    
    * remove LabelType (#1545)
    
    * rm ndarray_reduce_core.*
    
    * Dev identity loss (#1547)
    
    * identity_loss
    
    * loss op
    
    * CalcLossInstanceNum
    
    * mem shared for mdupdt first in regst and md diff add regst (#1546)
    
    * remove useless code (#1548)
    
    * Dev sparse cross entropy (#1550)
    
    * op for sparse cross _entropy
    
    * modify op_conf for sparse cross entropy
    
    * saprse cross entropy kernel
    
    * op
    
    * SparseCrossEntropyKernelUtil
    
    * refine
    
    * refine shape check (#1552)
    
    * refactoring reduce sum (#1554)
    
    * refactoring reduce sum
    
    * also use shape and dptr when bw
    
    * add resize when keepdims
    
    * address reviews
    
    * move functions to Anonymous namespace
    
    * address reviews
    
    * remove auto
    
    * replace find
    
    * rename keepdims
    
    * only enable nccl on gpu
    
    * fix diff add regst size in MdUpdt task node as same as in regst (#1556)
    
    * mem shared for mdupdt first in regst and md diff add regst
    
    * fix diff add regst size in MdUpdt task node as same as in regst
    
    * minor fix
    
    * special occasion when it is a loss op
    
    * Dev loss instance num (#1544)
    
    * loss instance number
    
    * set_has_loss_instance_num_field
    
    * loss
    
    * in_diff
    
    * LossOpFixInDiffHasLossInstanceNum
    
    * remove need_do_loss_instance_num
    
    * move to FixInDiffBlobDescs
    
    * remove
    
    * loss_instance_num use float
    
    * refine
    
    * Boxing ForwardLossInstance
    
    * fix
    
    * fix loss
    
    * fix
    
    * refine
    
    * fix
    
    * refine
    
    * refine
    
    * impl reduce mean
    
    * Dev all reduce ctrl edge (#1558)
    
    * mem shared for mdupdt first in regst and md diff add regst
    
    * feat: add ReduceInplaceIdentity LogicalNode/TaskNode/Op/Kernel
    
    * nccl reduce ctrl edge
    
    * MayConsumeModelDiff
    
    * fix diff add regst size in MdUpdt task node as same as in regst
    
    * eager_reduce_ratio
    
    * mem sharing for ReduceIdentity
    
    * ReduceInplaceIdentity => ReduceIdentity
    
    * reduce ctrl edge supports for arbitrary placement
    
    * refine ChainLogicalGraph::IsLogicalNodeMergeable
    
    * model name (#1561)
    
    * Dev gather refine (#1517)
    
    * gather op index support all int type and axis
    
    * out=in
    
    * reformat
    
    * negative axis
    
    * LookupKernel=>GatherKernel
    
    * reformat
    
    * refine
    
    * axis
    
    * refine & bugfix
    
    * remove ConstScalar and ConstRange (#1526)
    
    * Refine range initializer (#1523)
    
    * support axis
    
    * refine naming
    
    * fix before_dim_size
    
    * doc
    
    * refine
    
    * refine naming
    
    * refine naming
    
    * VariableLogicalNode
    
    * identity (#1563)
    
    * total_instance_num use naive mdupdt (#1564)
    
    * patch by hand from faster_rcnn
    
    * revert LogicalVariableOp
    
    * Dev clone boxing (#1566)
    
    * identity
    
    * reduce clone boxing
    
    * Dev clone boxing (#1568)
    
    * identity
    
    * reduce clone boxing
    
    * tuple identity
    
    * Dev tick (#1571)
    
    * feat: add Tick LogicalNode/TaskNode/Op/Kernel
    
    * feat: remove Tick LogicalNode/TaskNode
    
    * feat: add BldSubTskGphByTickToSource for TickOp
    
    * refine: refine due to comment
    
    * feat: add BldSubTskGphByRecordLoadToTick
    
    * pr tick op/kernel alone
    
    * feat: add TickOp and BldSubTskGphByTickToSource  (#1565)
    
    * feat: add Tick LogicalNode/TaskNode/Op/Kernel
    
    * feat: remove Tick LogicalNode/TaskNode
    
    * feat: add BldSubTskGphByTickToSource for TickOp
    
    * refine: refine due to comment
    
    * feat: add BldSubTskGphByRecordLoadToTick
    
    * refine: refine due to comment
    
    * refine: due to comment
    
    * refine: remove BldSubTskGphByRecordLoadToTick
    
    * fix tick op in dlnet (#1572)
    
    * Dev clip by global norm (#1521)
    
    * clip_by_global_norm
    
    * update
    
    * refine model_update op
    
    * remove useless code
    
    * fix name
    
    * rename clip_norm
    
    * remove useless code
    
    * force init memory and add CHECK()
    
    * remove useless code and add comment
    
    * fixbug
    
    * refine code
    
    * Dev bert profile (#1573)
    
    * 1) refactor reduce_group; 2) add new stream kReduceCtrl
    
    * 1) allreduce and model_update overlapping; 2) allreduce and fw overlapping
    
    * add mdupdt ctrl edges within reduce group (#1575)
    
    * Dev group all reduce by model bytes (#1577)
    
    * group all reduce by model byte size
    
    * mv OpGraph into a seperate file op_graph.h
    
    * gelu (#1578)
    
    * Dev bert layer norm (#1574)
    
    * layer norm
    
    * layer_norm
    
    * fix trainable
    
    * fix
    
    * fix trainable
    
    * refine
    
    * Dev bert cuda event sync (#1581)
    
    * cudaSetDevice in actor poller threads
    
    * ReduceConcatCompActor ; NaiveActor
    
    * set dev id (#1583)
    
    * Dev bert profiling (#1586)
    
    * profiling
    
    * all_reduce_* option for performance optimization
    
    * fix a mem sharing bug (#1590)
    
    * Fix mem sharing bug (#1593)
    
    * fix a mem sharing bug
    
    * refine by review
    
    * remove previous if condition
    
    * refine
    
    * Dev profiling adam (#1592)
    
    * profiling
    
    * all_reduce_* option for performance optimization
    
    * faster adam kernel
    
    * Dev refine transpose (#1594)
    
    * profiling
    
    * all_reduce_* option for performance optimization
    
    * faster adam kernel
    
    * refine dropout and transpose
    
    * loss print duration (#1598)
    
    * pseudo chains of OpGraph
    
    * ConvertPseudoChainToChain
    
    * refine pseudo_chain
    
    * refine register coloring algorithm
    
    * rename op_graph log file name
    
    * remove unused code
    
    * Dev bigger chain (#1601)
    
    * pseudo chains of OpGraph
    
    * ConvertPseudoChainToChain
    
    * refine pseudo_chain
    
    * refine register coloring algorithm
    
    * rename op_graph log file name
    
    * remove unused code
    
    * chore: add -gencode in CMakeLists.txt (#1603)
    
    * EnableMemSharingInVariableOp
    
    * no mem_sharing for out_diff & model_diff in variable_op
    
    * Dev mem sharing for variable op (#1604)
    
    * pseudo chains of OpGraph
    
    * ConvertPseudoChainToChain
    
    * refine pseudo_chain
    
    * refine register coloring algorithm
    
    * rename op_graph log file name
    
    * remove unused code
    
    * EnableMemSharingInVariableOp
    
    * no mem_sharing for out_diff & model_diff in variable_op
    
    * refine code
    
    * Fix jxf reduce concat bug (#1606)
    
    * refine logic to infer reduce_concat_op's elem_cnt of out blob, still have bugs...
    
    * add RoundUp in reduce_concat
    
    * CHECK_LE -> CHECK_EQ
    
    * add CHECK
    
    * Dev random shuffle (#1607)
    
    * random shuffle
    
    * fix
    
    * refine
    
    * refine
    
    * single thread
    
    * refine
    
    * cmake add half (#1609)
    
    * Bugfix no tick diff (#1614)
    
    * group by has_diff
    
    * rm unnecessary identity
    
    * share model_diff and out_diff in variable op (#1616)
    
    * share model_diff and out_diff in variable op
    
    * bugfix: model_diff is a produced register
    
    * register_num of model_diff is 1
    
    * add VariableKernelConf
    
    * no mutable
    
    * bugfix
    
    * bugfix: set ctrl_regst's return_regst_num (#1617)
    
    * 带策略的寄存器着色 (#1613)
    
    * mem_shared_hint_id
    
    * sharable memory block
    
    * rm useless code
    
    * remove useless code
    
    * bugfix: no redundant edges
    
    * rename: MemBlockGroup => MemBlock
    
    * put constrcutor of SharableMemBlockNode into header file
    
    * bugfix
    
    * rename field: MemBlock.block_id => MemBlock.mem_block_id
    
    * refine CHECK in AllReduce (#1618)
    
    * refine CHECK in AllReduce
    
    * move ReduceConcatOpCtx definition to .cpp file
    
    * fix fw_consumer nullptr (#1622)
    
    * faster improver (#1628)
    
    * multithreads register coloring (#1630)
    
    * multithreads register coloring
    
    * refine code
    
    * Dev bert accuracy with weight (#1632)
    
    * accuracy
    
    * accuracy_task_node add fw_buf
    
    * fw_buf=>data_tmp
    
    * Dev logical blob dim0 (#1625)
    
    * mem_shared_hint_id
    
    * sharable memory block
    
    * rm useless code
    
    * remove useless code
    
    * bugfix: no redundant edges
    
    * rename: MemBlockGroup => MemBlock
    
    * put constrcutor of SharableMemBlockNode into header file
    
    * bugfix
    
    * rename field: MemBlock.block_id => MemBlock.mem_block_id
    
    * replace piece_size with logical_blob_dim0
    
    * BlobParallelConf
    
    * BlobParallelDesc
    
    * infer out blob model_split_axis
    
    * int64_t => int32_t
    
    * InferOutBlobParallelDesc
    
    * gather out blob model split (#1624)
    
    * InferBlobParallelDesc
    
    * let variable op support kModelParallel
    
    * rename lbi2blob_desc_ => lbi2no_parallel_blob_desc_
    
    * Global<OpGraph>
    
    * SplitLogicalInputBlobDesc
    
    * ConcatOutputBlobDescs
    
    * rename: BlobDataParallel => DataBlobParallel; BlobModelParallel => ModelBlobParallel; BlobGridParallel => GridBlobParallel
    
    * OpGraph::CheckBlobDescs(...)
    
    * exact division is unnecessary
    
    * fix bugs
    
    * rename InferOutBlob* => InferOutputBlob
    
    * exact division in variable_op is unnecessary
    
    * bug fix
    
    * fix bugs
    
    * fix bugs
    
    * IsInputBlobAllowedModelSplit
    
    * use Global<OpGraph> to InferModelSize
    
    * add OpGraph::GetDataBalancedSplitter and OpGraph::GetModelBalancedSplitter
    
    * fix IdentityOp::IsInputBlobAllowedModelSplit
    
    * no implementation for pure virtual function Operator::IsInputBlobAllowedModelSplit
    
    * refine BlobParallelDesc: replace CopyParallelConf with operator=
    
    * refine ParallelDesc: remove unused functions
    
    * more checks on ParallelDesc
    
    * Dev logical blob dim0 (#1635)
    
    * mem_shared_hint_id
    
    * sharable memory block
    
    * rm useless code
    
    * remove useless code
    
    * bugfix: no redundant edges
    
    * rename: MemBlockGroup => MemBlock
    
    * put constrcutor of SharableMemBlockNode into header file
    
    * bugfix
    
    * rename field: MemBlock.block_id => MemBlock.mem_block_id
    
    * replace piece_size with logical_blob_dim0
    
    * BlobParallelConf
    
    * BlobParallelDesc
    
    * infer out blob model_split_axis
    
    * int64_t => int32_t
    
    * InferOutBlobParallelDesc
    
    * gather out blob model split (#1624)
    
    * InferBlobParallelDesc
    
    * let variable op support kModelParallel
    
    * rename lbi2blob_desc_ => lbi2no_parallel_blob_desc_
    
    * Global<OpGraph>
    
    * SplitLogicalInputBlobDesc
    
    * ConcatOutputBlobDescs
    
    * rename: BlobDataParallel => DataBlobParallel; BlobModelParallel => ModelBlobParallel; BlobGridParallel => GridBlobParallel
    
    * OpGraph::CheckBlobDescs(...)
    
    * exact division is unnecessary
    
    * fix bugs
    
    * rename InferOutBlob* => InferOutputBlob
    
    * exact division in variable_op is unnecessary
    
    * bug fix
    
    * fix bugs
    
    * fix bugs
    
    * IsInputBlobAllowedModelSplit
    
    * use Global<OpGraph> to InferModelSize
    
    * add OpGraph::GetDataBalancedSplitter and OpGraph::GetModelBalancedSplitter
    
    * fix IdentityOp::IsInputBlobAllowedModelSplit
    
    * no implementation for pure virtual function Operator::IsInputBlobAllowedModelSplit
    
    * refine BlobParallelDesc: replace CopyParallelConf with operator=
    
    * refine ParallelDesc: remove unused functions
    
    * more checks on ParallelDesc
    
    * remove unused function Operator::MaxModelSplitNum
    
    * bugfix: SoleOp() => op_vec().at(0)
    
    * Dev global op graph (#1636)
    
    * Global<OpGraph> is only available duraing compilation
    
    * small record_piece_size for InferNoParallelBlobDesc
    
    * Dev op graph piece size (#1637)
    
    * fix a bug in OpGraph::InferNoParallelBlobDesc
    
    * fix a bug in OpGraph::InferNoParallelBlobDesc
    
    * DfsTopoForEachNodeSortByDistanceToSink (#1638)
    
    * Dev jxf bert top k (#1633)
    
    * top_k
    
    * dev top_k op
    
    * refine
    
    * fix bug
    
    * refactor top_k op, cooperate with gather op to get values now
    
    * customized TOPK_KERNEL_ENTRY in auto factory
    
    * batch gather op
    
    * refine
    
    * Backup: batch_gather op, pass compile
    
    * fix bugs, pass the test
    
    * fix no new line at the end of file
    
    * const
    
    * refine by review
    
    * fix bugs
    
    * rename: instance_dim -> instance_size
    
    * remove a blank line
    
    * refine coding style by Juncheng's suggestions, Bravo
    
    * refine top_k
    
    * more refine
    
    * compatible with new model parallel
    
    * refine
    
    * rename
    
    * cpu only in top_k
    
    * Dev model boxing (#1639)
    
    * mem_shared_hint_id
    
    * sharable memory block
    
    * rm useless code
    
    * remove useless code
    
    * bugfix: no redundant edges
    
    * rename: MemBlockGroup => MemBlock
    
    * put constrcutor of SharableMemBlockNode into header file
    
    * bugfix
    
    * rename field: MemBlock.block_id => MemBlock.mem_block_id
    
    * replace piece_size with logical_blob_dim0
    
    * BlobParallelConf
    
    * BlobParallelDesc
    
    * infer out blob model_split_axis
    
    * int64_t => int32_t
    
    * InferOutBlobParallelDesc
    
    * gather out blob model split (#1624)
    
    * InferBlobParallelDesc
    
    * let variable op support kModelParallel
    
    * rename lbi2blob_desc_ => lbi2no_parallel_blob_desc_
    
    * Global<OpGraph>
    
    * SplitLogicalInputBlobDesc
    
    * ConcatOutputBlobDescs
    
    * rename: BlobDataParallel => DataBlobParallel; BlobModelParallel => ModelBlobParallel; BlobGridParallel => GridBlobParallel
    
    * OpGraph::CheckBlobDescs(...)
    
    * exact division is unnecessary
    
    * fix bugs
    
    * rename InferOutBlob* => InferOutputBlob
    
    * exact division in variable_op is unnecessary
    
    * bug fix
    
    * fix bugs
    
    * fix bugs
    
    * IsInputBlobAllowedModelSplit
    
    * use Global<OpGraph> to InferModelSize
    
    * add OpGraph::GetDataBalancedSplitter and OpGraph::GetModelBalancedSplitter
    
    * fix IdentityOp::IsInputBlobAllowedModelSplit
    
    * no implementation for pure virtual function Operator::IsInputBlobAllowedModelSplit
    
    * refine BlobParallelDesc: replace CopyParallelConf with operator=
    
    * refine ParallelDesc: remove unused functions
    
    * more checks on ParallelDesc
    
    * remove unused function Operator::MaxModelSplitNum
    
    * BlobParallelDesc::EquivalentTo
    
    * LogicalNOde::main_model_parallel_ is out of date
    
    * refine Operator: replace IsElemWiseOp with IsSoleInputBlobAllowedModelSplit
    
    * refine transpose conf
    
    * fix a bug in Operator::FixParallelDesc
    
    * InferInputBlobModelSplitAxis
    
    * BlobParallelType
    
    * more default behaviors for Operator::InferInputOutputBlobParallelType
    
    * op_parallel_signature
    
    * rename: BlobParallelType => LogicalBlobParallelDesc
    
    * OpGraph::InferLogicalBlobParallelDesc
    
    * refactor SplitLogicalInputBlobDesc by LogicalBlobParallelDesc
    
    * refine OpNode::ConcatBlobDesc By LogicalBlobParallelDesc
    
    * OpNode::lbi2model_split_axis_
    
    * OpGraph::GetBalancedSplitter
    
    * replace OpGraph::GetBlobParallelDesc4Lbi with OpGraph::GetLbpd4Lbi
    
    * rm BlobParallelDesc in OpGraph
    
    * VariableOp::InitOpParallelSignatures
    
    * rm BlobParallelDesc
    
    * rename Make*ParalelSignature functions
    
    * MakeOpParallelSignature_DS_MC_2_DS
    
    * MakeOpParallelSignature_DC_MS_2_MS
    
    * BiasAddOp::InitOpParallelSignatures
    
    * refine MakeOpParallelSignature_DC_MS_2_MS
    
    * MatmulOp::InitOpParallelSignatures
    
    * GatherOp::InitOpParallelSignatures
    
    * bugfix: model_split_axis cannot equals -1 when parallel_policy is kModelParallel
    
    * refactor: bn2parallel_id2blob_desc => lbi2parallel_id2blob_desc
    
    * refine OpNode
    
    * LogicalBlobParallelConf
    
    * LogicalBlobParallelDesc::DualLbpd
    
    * 1) merge dev_bert;
    2) placement.proto not used in logical_blob_parallel_conf.proto
    
    * bugfix: 1) remove CHECK(has_model) in Operator::NaiveInitOpParallelSignatures; 2) lbpd->set_parallel_num(val)
    
    * fix bugs in GatherOp::InitOpParallelSignatures and BroadcastBinaryOp::InitOpParallelSignatures
    
    * refactor: InitOpParallelSignatures => GetOpParallelSignatures
    
    * refactor: const OpParallelSignature => std::unique_ptr<const OpParallelSignature>
    
    * rm LogicalBlobParallelConf
    
    * refactor: ModelSplitAxis4BnInOp => LbpdHint4BnInOp
    
    * fix bugs about LbpdHint
    
    * simplify the interface of InferInputOutputBlobLogicalBlobParallelDescIf
    
    * rename Class CloneParallel => BroadcastParallel
    
    * rename field: clone_parallel => broadcast_parallel
    
    * refactor LbpdHint by SbpParallel
    
    * InferIsModelBlob4OutputBlobsIf
    
    * remove field LogicalBlobParallelDesc::parallel_num
    
    * rename: LogicalBlobParallelDesc => SbpParallel
    
    * rename: LbpdHint =>SbpInferHint
    
    * simplify interface Operator::InferOutputBlobSbpInferHint
    
    * rename api: Operator::InferBlobSbpInferHintIf => Operator::InferOuputBlobsSbpInferHintIf
    
    * OpGraph::InferIsModelBlob
    
    * rename file: logical_blob_parallel_desc.* => sbp_parallel.*
    
    * rename filename: lbpd_hint* => sbp_infer_hint*
    
    * rename field: SbpInferHint::has_data_split => SbpInferHint::is_data_split
    
    * rename fields: SbpInferHint::is_data_split, is_model_split, is_data_partial_sum, is_model_broadcast
    
    * refactor SbpInferHint::split_axis
    
    * LambdaOpParallelSignature
    
    * replace function MakeVariableOpDataSplitOpParallelSignature with class VariableOpDataSplitOpParallelSignature
    
    * replace function MakeVariableOpModelSplitOpParallelSignature with class VariableOpModelSplitOpParallelSignature
    
    * BroadcastBinaryOpParallelSignature
    
    * Matmul_DMS_MS_2_P_OpParallelSignature
    
    * Gather_DC_MS_2_P_OpParallelSignature
    
    * class DataSplitOpParallelSignature
    
    * class ModelBroadcastOpParallelSignature
    
    * class DS_MC_2_DS_OpParallelSignature
    
    * add field OpParallelSignature::op_
    
    * refactor: ModelSplitAxis => OutputBlobModelSplitAxis
    
    * remove Operator::InferOuputBlobsSbpInferHintIf
    
    * implement MatmulOp::OutputBlobModelSplitAxis
    
    * implement GatherOp::OutputBlobModelSplitAxis
    
    * implement TransposeOp::OutputBlobModelSplitAxis and BiasAddOp::OutputBlobModelSplitAxis
    
    * add method OpGraph::IsDataBlob
    
    * refactor OpGraph::InferSbpParallel
    
    * refactor class SbpInferHint
    
    * rename local variable: SbpInferHint4BnInOp => SbpInferHint4Ibn
    
    * refactor MakeModelSplitOpParallelSignature
    
    * refactor Make_DC_MS_2_MS_OpParallelSignature
    
    * remove unused class LambdaOpParallelSignature; refactor class name '*Clone*' => '*Broadcast*'
    
    * bugfix: Operator::OutputBlobModelSplitAxis for sole-ibn op
    
    * fix bugs in SbpInferHint::has_split_axis(), SbpInferHint::split_axis and OpNode::IsModelBlob4Lbi
    
    * refactor class SbpInferHint: replace split_axis_ with sbp_parallel_
    
    * refactor by SbpInferHint::sbp_parallel
    
    * 1) rename OpNode data member; 2) rm unused proto
    
    * fix clone (#1641)
    
    * OpGraph::GetBlobDataType (#1643)
    
    * OpGraph::GetBlobDataType
    
    * refine OpGraph::GetBlobDataType
    
    * IdentityOp => TupleIdentityOp (#1644)
    
    * Dev sbp parallel cast (#1646)
    
    * add SbpParallelCastOp
    
    * only SplitParallel and BroadcastParallel can be user customized
    
    * rename: SbpParallelCastOp => ParallelCastOp
    
    * build boxing_conf by sbp_parallel
    
    * fix a bug in BroadcastBinaryOpParallelSignature
    
    * support broadcast_parallel for sole-ibn op
    
    * 1) build boxing_op_conf by sbp_parallel for tuple_identity_op;
    2) no op parallel desc fix for kModelParallel;
    3) fix a in TaskGraph::EnableMemSharingInVariableOp
    4) add TupleIdentityOpParallelSignature
    
    * fix bug in IsModelParallel121 (#1648)
    
    * merge develop
    
    * merge develop (#1649)
    Unverified
    59eb55c1