Skip to content
Snippets Groups Projects
Unverified Commit 59eb55c1 authored by Li Xinqi's avatar Li Xinqi Committed by GitHub
Browse files

Dev bert merge develop (#1650)

* Implement gelu op (#1478)

* gelu op

* call different funcs for float and double

* Dev bert gather op (#1483)

* embedding_dense_op

* refine

* gather op

* revert

* Fix gelu bug (#1484)

* fix inherit bug

* fix backward formula

* fix bug

* Dev variable op (#1485)

* DefineTestBlobConf => DefineTestBlobOpConf (#1480)

* variable op

* Dev variable op disable memsharing (#1487)

* disable mem sharing for VariableOp

* variable disable tick diff

* fix

* refine

* options transpose_a and transpose_b for Matmul

* matmul operator conf

* Dev bert const scalar op (#1488)

* const scalar  op

* refine

* fix

* data parallel only

* const range op (#1489)

* square and sqrt

* broadcast_binary_op

* feat: add mean op (#1490)

* feat: add mean op

* feat: add mean_kernel

* feat: add implementation

* feat: fix mean kernel

* Dev bert slice op (#1491)

* add op_conf

* add slice op impl

* add space kernel impl

* fix

* same semantic as python

* optional start and end

* fix

* add has_dim0_in_shape in reshape op (#1486)

* refine CHECK in broadcast_binary_op

* feat: add kernel implement for broadcast_mul/div

* Impl square && sqrt (#1495)

* impl square && sqrt

* fix typo

* Dev bert slice op (#1496)

* add op_conf

* add slice op impl

* add space kernel impl

* fix

* same semantic as python

* optional start and end

* fix

* slice kernel cpu impl

* modify coding style

* BiasAddOpConf

* refactor(broadcast_div_kernel): update kernel util api

* Dev bert cosnt range use device piece size (#1498)

* use device_piece_size

* cosnt size => size

* fix

* no check in BroadcastBinaryOp::InitFromProto

* override GetCustomizedConfs for broadcast_binary_op

* fix: fix bugs in broadcast_div/mul kernel (#1502)

* fix: fix bugs in broadcast_div/mul kernel

* fix

* fix: fix the infer bw_buf blobdesc bug in broadcast_binary op

* Bias Add Op && Kernel (#1503)

* pass compile

* fix typo

* Matmul kernel implementation (#1494)

* pass compile

* add comment

* fix bug

* Dev bert const scalar kernel (#1492)

* const scalar kernel

* fix

* fix

* init

* empty const range kernel

* sketch of gather kernel

* gather kernel

* refine

* refine

* const range kernel

* refine

* backward

* const range size

* gather kernel

* assert index

* add truncated_normal initializer (#1499)

* add truncated_normal initializer

* rename RngTruncatedNormal

* fix: add const override for InferBwBufBlobDescs in BroadcastBinaryOp

* fix: udpate the supported data type from floating to arithmetic

* enforce 2d on bias add

* Dev bert slice op (#1500)

* add op_conf

* add slice op impl

* add space kernel impl

* fix

* same semantic as python

* optional start and end

* fix

* slice kernel cpu impl

* modify coding style

* slice gpu impl const buf infer

* add slice gpu impl

* simplify slice cpu impl

* fix gpu impl bug

* fix typo

* add forward function from broadcast_add,broadcast_sub

* feat: add gpu impl of cast kernel (#1504)

* Dev nc cast (#1507)

* feat: add gpu impl of cast kernel

* register gpu cast op

* Fix broadcast binary all dim size 1 (#1505)

* remove check NumAxes

* check scalar

* IsScalarBlob

* b_diff=>b (#1509)

* feat: add LayerNormOp/Kernel without kernel implement (#1510)

* fix: fix missing registing layer_normalization kernel

* fix: fix missing registing layer_normalization op

* fix: temply remove activation from layer_norm_kernel

* ExecShapeUtil

* broadcast_binary_xpu_util.h

* add bw kernel of broadcast_add

* Dev constant (#1513)

* constant_op

* init_op_conf

* sequence=>range

* Dev broadcast add (#1514)

* ExecShapeUtil

* broadcast_binary_xpu_util.h

* add bw kernel of broadcast_add

* WITH_CUDA_PARAM

* left extended shape

* xpu_ndarray_builder

* add bw kernel of broadcast_sub

* updt to 1d (#1512)

* fix small in xpu_reduce_ndarray

* fix(broadcast_binary_op): fix the wrong data_type of bw_buf regst (#1515)

* feat(mean): update mean_op/kernel for calc only last dim of blob (#1516)

* fix(mean_kernel): fix typo

* ndarray reduce

* new reduce

* fix shape of tmp_storage

* reduce

* more check for NdArrayReduce

* ImplaceApplyUnary<UnaryFuncMinus>

* ndarray_apply_broadcast_binary

* delte useless files

* complete backward kernel of broadcast_mul

* add backward kernel of broadcast_div

* broadcast binary op check data type equal (#1508)

* fix bug in broadcast_binary

* debug op

* EncodeBlob

* const_out_blob_feature_load_file

* DefineTestBlobOpConf.has_diff

* indices has_diff = false (#1519)

* adam model update (#1518)

* adam model update

* add comment

* update

* add correct_deviation flag

* rename

* remove GetCustomizedConf

* fix bug in mean_op fw kernel

* add sigmoid loss op

* ndarray_apply_broadcast_unary

* reomve multiplier of mean kernel

* fix(boxing_actor): not handle ctrl regst in NormalProcessNaiveReadableRegstMsg()

* fix raw (#1522)

* rsqrt

* XpuReducedNdarray supports expression template

* faster_reduce

* inlined cuda device function

* profiling reduce_sum

* refactor(kernel_util.cu): calc x_strides on cpu instead of on TransposeGpu() (#1525)

* BroadcastBinaryOp

* ExecShape => XpuShape

* fix shape bug in mean bw kernel

* refine XpuNdarrayAssign

* use ndarray broadcast mul (#1529)

* Dev softmax reduce ndarray (#1527)

* softmax use ndarray reduce

* fix shape

* refine reduce

* fix

* remove xpu_ndarray_builder

* fix(actor.cpp): never access regst after sending it to producer

* ndarray_util.h => xpu_util.h

* xpu_ndarray_util.h => ndarray_util.h

* XpuNdArrayUtil => NdarrayUtil

* SwitchReduce(SwitchCase(num_axes), ...) => Reduce(...)

* refactor: rename NormalProcessNaiveReadableRegstMsg() to NormalProcessNaiveReadableDataRegstMsg() (#1532)

* SwitchBroadcastApply(SwitchCase(num_axes), ...) => BroadcastApply(...)

* softmax kernel use ndarray reduce  (#1530)

* softmax use ndarray reduce

* fix shape

* refine reduce

* fix

* RowMax=>NdarrayReduce

* SwitchReduce=>Reduce

* move template parameter NDIMS from class NdarrayReduce to methods of class NdarrayReduce

* rename file: ndarray/xpu_ndarray_reduce_test.cpp -> ndarray/ndarray_reduce_test.cpp

* move NdarrayUtil::SwitchReduce(...) to NdarrayReduce::SwitchReduce(...)

* Dev one hot encoder (#1533)

* one_hot op

* ohe

* one hot kernel

* refine

* refine

* remove old

* refine

* refine

* refine

* format

* save m and v in adam_model_update (#1534)

* Dev profile reduce (#1535)

* ndarray_reduce_impl

* NdarrayMatrixRowReduce

* 1) MatrixColReduce; 2) WITH_CUDA_PARAM => RUN_CUDA_KERNEL

* NdarrayScalarReduce

* NdarrayDefaultReduce

* refactor NdarrayReduce<DeviceType device_type, typename T> to NdarrayReduce<DeviceType device_type, typename T, const T(*binary_func)(const T, const T)>

* 1) MaxVal<T>() => GetMaxVal<T>(); MaxValue<T>::value => MaxVal<T>::value

* replace KernelUtil::RowMax with NdarrayUtil::ReduceMax

* NdarrayNoReduce

* eliminate redundant code by macros

* Fix matmul gpu bugs (#1528)

* call different api for batchedgemm

* updt api

* use naive loop

* save work

* save work

* updt impl

* remove useless code

* replace naive loop with cublasgemmbatched

* feat: add ScalarAddOp and ScalarMulOp (#1541)

* Dev nc scalar (#1543)

* feat: add ScalarAddOp and ScalarMulOp

* feat: add ScalarAddKernel and ScalarMulKernel

* fix: ScalarAddOp/ScalarMulOp not inheri from CWiseOp

* fix: fix code style

* fix: fix typo of include file in scalar_add_op/scalar_mul_op

* fix(scalar_mul_kernel): register ScalarMulKerenl

* fix: add MulbyScalarPara(), replace cublas_scal by this on ScalarMulKernel

* fix(scalar_mul_kernel): fix typo

* Dev nc testtrans (#1540)

* feat: update trans kernel

* InitGlobalCudaDeviceProp

* in_blob and out_blob is unnecesary for bw kernel of variable_op and constant_op

* Transpose: the shape elem_cnt of x must not exceed 2^32

* remove LabelType (#1545)

* rm ndarray_reduce_core.*

* Dev identity loss (#1547)

* identity_loss

* loss op

* CalcLossInstanceNum

* mem shared for mdupdt first in regst and md diff add regst (#1546)

* remove useless code (#1548)

* Dev sparse cross entropy (#1550)

* op for sparse cross _entropy

* modify op_conf for sparse cross entropy

* saprse cross entropy kernel

* op

* SparseCrossEntropyKernelUtil

* refine

* refine shape check (#1552)

* refactoring reduce sum (#1554)

* refactoring reduce sum

* also use shape and dptr when bw

* add resize when keepdims

* address reviews

* move functions to Anonymous namespace

* address reviews

* remove auto

* replace find

* rename keepdims

* only enable nccl on gpu

* fix diff add regst size in MdUpdt task node as same as in regst (#1556)

* mem shared for mdupdt first in regst and md diff add regst

* fix diff add regst size in MdUpdt task node as same as in regst

* minor fix

* special occasion when it is a loss op

* Dev loss instance num (#1544)

* loss instance number

* set_has_loss_instance_num_field

* loss

* in_diff

* LossOpFixInDiffHasLossInstanceNum

* remove need_do_loss_instance_num

* move to FixInDiffBlobDescs

* remove

* loss_instance_num use float

* refine

* Boxing ForwardLossInstance

* fix

* fix loss

* fix

* refine

* fix

* refine

* refine

* impl reduce mean

* Dev all reduce ctrl edge (#1558)

* mem shared for mdupdt first in regst and md diff add regst

* feat: add ReduceInplaceIdentity LogicalNode/TaskNode/Op/Kernel

* nccl reduce ctrl edge

* MayConsumeModelDiff

* fix diff add regst size in MdUpdt task node as same as in regst

* eager_reduce_ratio

* mem sharing for ReduceIdentity

* ReduceInplaceIdentity => ReduceIdentity

* reduce ctrl edge supports for arbitrary placement

* refine ChainLogicalGraph::IsLogicalNodeMergeable

* model name (#1561)

* Dev gather refine (#1517)

* gather op index support all int type and axis

* out=in

* reformat

* negative axis

* LookupKernel=>GatherKernel

* reformat

* refine

* axis

* refine & bugfix

* remove ConstScalar and ConstRange (#1526)

* Refine range initializer (#1523)

* support axis

* refine naming

* fix before_dim_size

* doc

* refine

* refine naming

* refine naming

* VariableLogicalNode

* identity (#1563)

* total_instance_num use naive mdupdt (#1564)

* patch by hand from faster_rcnn

* revert LogicalVariableOp

* Dev clone boxing (#1566)

* identity

* reduce clone boxing

* Dev clone boxing (#1568)

* identity

* reduce clone boxing

* tuple identity

* Dev tick (#1571)

* feat: add Tick LogicalNode/TaskNode/Op/Kernel

* feat: remove Tick LogicalNode/TaskNode

* feat: add BldSubTskGphByTickToSource for TickOp

* refine: refine due to comment

* feat: add BldSubTskGphByRecordLoadToTick

* pr tick op/kernel alone

* feat: add TickOp and BldSubTskGphByTickToSource  (#1565)

* feat: add Tick LogicalNode/TaskNode/Op/Kernel

* feat: remove Tick LogicalNode/TaskNode

* feat: add BldSubTskGphByTickToSource for TickOp

* refine: refine due to comment

* feat: add BldSubTskGphByRecordLoadToTick

* refine: refine due to comment

* refine: due to comment

* refine: remove BldSubTskGphByRecordLoadToTick

* fix tick op in dlnet (#1572)

* Dev clip by global norm (#1521)

* clip_by_global_norm

* update

* refine model_update op

* remove useless code

* fix name

* rename clip_norm

* remove useless code

* force init memory and add CHECK()

* remove useless code and add comment

* fixbug

* refine code

* Dev bert profile (#1573)

* 1) refactor reduce_group; 2) add new stream kReduceCtrl

* 1) allreduce and model_update overlapping; 2) allreduce and fw overlapping

* add mdupdt ctrl edges within reduce group (#1575)

* Dev group all reduce by model bytes (#1577)

* group all reduce by model byte size

* mv OpGraph into a seperate file op_graph.h

* gelu (#1578)

* Dev bert layer norm (#1574)

* layer norm

* layer_norm

* fix trainable

* fix

* fix trainable

* refine

* Dev bert cuda event sync (#1581)

* cudaSetDevice in actor poller threads

* ReduceConcatCompActor ; NaiveActor

* set dev id (#1583)

* Dev bert profiling (#1586)

* profiling

* all_reduce_* option for performance optimization

* fix a mem sharing bug (#1590)

* Fix mem sharing bug (#1593)

* fix a mem sharing bug

* refine by review

* remove previous if condition

* refine

* Dev profiling adam (#1592)

* profiling

* all_reduce_* option for performance optimization

* faster adam kernel

* Dev refine transpose (#1594)

* profiling

* all_reduce_* option for performance optimization

* faster adam kernel

* refine dropout and transpose

* loss print duration (#1598)

* pseudo chains of OpGraph

* ConvertPseudoChainToChain

* refine pseudo_chain

* refine register coloring algorithm

* rename op_graph log file name

* remove unused code

* Dev bigger chain (#1601)

* pseudo chains of OpGraph

* ConvertPseudoChainToChain

* refine pseudo_chain

* refine register coloring algorithm

* rename op_graph log file name

* remove unused code

* chore: add -gencode in CMakeLists.txt (#1603)

* EnableMemSharingInVariableOp

* no mem_sharing for out_diff & model_diff in variable_op

* Dev mem sharing for variable op (#1604)

* pseudo chains of OpGraph

* ConvertPseudoChainToChain

* refine pseudo_chain

* refine register coloring algorithm

* rename op_graph log file name

* remove unused code

* EnableMemSharingInVariableOp

* no mem_sharing for out_diff & model_diff in variable_op

* refine code

* Fix jxf reduce concat bug (#1606)

* refine logic to infer reduce_concat_op's elem_cnt of out blob, still have bugs...

* add RoundUp in reduce_concat

* CHECK_LE -> CHECK_EQ

* add CHECK

* Dev random shuffle (#1607)

* random shuffle

* fix

* refine

* refine

* single thread

* refine

* cmake add half (#1609)

* Bugfix no tick diff (#1614)

* group by has_diff

* rm unnecessary identity

* share model_diff and out_diff in variable op (#1616)

* share model_diff and out_diff in variable op

* bugfix: model_diff is a produced register

* register_num of model_diff is 1

* add VariableKernelConf

* no mutable

* bugfix

* bugfix: set ctrl_regst's return_regst_num (#1617)

* 带策略的寄存器着色 (#1613)

* mem_shared_hint_id

* sharable memory block

* rm useless code

* remove useless code

* bugfix: no redundant edges

* rename: MemBlockGroup => MemBlock

* put constrcutor of SharableMemBlockNode into header file

* bugfix

* rename field: MemBlock.block_id => MemBlock.mem_block_id

* refine CHECK in AllReduce (#1618)

* refine CHECK in AllReduce

* move ReduceConcatOpCtx definition to .cpp file

* fix fw_consumer nullptr (#1622)

* faster improver (#1628)

* multithreads register coloring (#1630)

* multithreads register coloring

* refine code

* Dev bert accuracy with weight (#1632)

* accuracy

* accuracy_task_node add fw_buf

* fw_buf=>data_tmp

* Dev logical blob dim0 (#1625)

* mem_shared_hint_id

* sharable memory block

* rm useless code

* remove useless code

* bugfix: no redundant edges

* rename: MemBlockGroup => MemBlock

* put constrcutor of SharableMemBlockNode into header file

* bugfix

* rename field: MemBlock.block_id => MemBlock.mem_block_id

* replace piece_size with logical_blob_dim0

* BlobParallelConf

* BlobParallelDesc

* infer out blob model_split_axis

* int64_t => int32_t

* InferOutBlobParallelDesc

* gather out blob model split (#1624)

* InferBlobParallelDesc

* let variable op support kModelParallel

* rename lbi2blob_desc_ => lbi2no_parallel_blob_desc_

* Global<OpGraph>

* SplitLogicalInputBlobDesc

* ConcatOutputBlobDescs

* rename: BlobDataParallel => DataBlobParallel; BlobModelParallel => ModelBlobParallel; BlobGridParallel => GridBlobParallel

* OpGraph::CheckBlobDescs(...)

* exact division is unnecessary

* fix bugs

* rename InferOutBlob* => InferOutputBlob

* exact division in variable_op is unnecessary

* bug fix

* fix bugs

* fix bugs

* IsInputBlobAllowedModelSplit

* use Global<OpGraph> to InferModelSize

* add OpGraph::GetDataBalancedSplitter and OpGraph::GetModelBalancedSplitter

* fix IdentityOp::IsInputBlobAllowedModelSplit

* no implementation for pure virtual function Operator::IsInputBlobAllowedModelSplit

* refine BlobParallelDesc: replace CopyParallelConf with operator=

* refine ParallelDesc: remove unused functions

* more checks on ParallelDesc

* Dev logical blob dim0 (#1635)

* mem_shared_hint_id

* sharable memory block

* rm useless code

* remove useless code

* bugfix: no redundant edges

* rename: MemBlockGroup => MemBlock

* put constrcutor of SharableMemBlockNode into header file

* bugfix

* rename field: MemBlock.block_id => MemBlock.mem_block_id

* replace piece_size with logical_blob_dim0

* BlobParallelConf

* BlobParallelDesc

* infer out blob model_split_axis

* int64_t => int32_t

* InferOutBlobParallelDesc

* gather out blob model split (#1624)

* InferBlobParallelDesc

* let variable op support kModelParallel

* rename lbi2blob_desc_ => lbi2no_parallel_blob_desc_

* Global<OpGraph>

* SplitLogicalInputBlobDesc

* ConcatOutputBlobDescs

* rename: BlobDataParallel => DataBlobParallel; BlobModelParallel => ModelBlobParallel; BlobGridParallel => GridBlobParallel

* OpGraph::CheckBlobDescs(...)

* exact division is unnecessary

* fix bugs

* rename InferOutBlob* => InferOutputBlob

* exact division in variable_op is unnecessary

* bug fix

* fix bugs

* fix bugs

* IsInputBlobAllowedModelSplit

* use Global<OpGraph> to InferModelSize

* add OpGraph::GetDataBalancedSplitter and OpGraph::GetModelBalancedSplitter

* fix IdentityOp::IsInputBlobAllowedModelSplit

* no implementation for pure virtual function Operator::IsInputBlobAllowedModelSplit

* refine BlobParallelDesc: replace CopyParallelConf with operator=

* refine ParallelDesc: remove unused functions

* more checks on ParallelDesc

* remove unused function Operator::MaxModelSplitNum

* bugfix: SoleOp() => op_vec().at(0)

* Dev global op graph (#1636)

* Global<OpGraph> is only available duraing compilation

* small record_piece_size for InferNoParallelBlobDesc

* Dev op graph piece size (#1637)

* fix a bug in OpGraph::InferNoParallelBlobDesc

* fix a bug in OpGraph::InferNoParallelBlobDesc

* DfsTopoForEachNodeSortByDistanceToSink (#1638)

* Dev jxf bert top k (#1633)

* top_k

* dev top_k op

* refine

* fix bug

* refactor top_k op, cooperate with gather op to get values now

* customized TOPK_KERNEL_ENTRY in auto factory

* batch gather op

* refine

* Backup: batch_gather op, pass compile

* fix bugs, pass the test

* fix no new line at the end of file

* const

* refine by review

* fix bugs

* rename: instance_dim -> instance_size

* remove a blank line

* refine coding style by Juncheng's suggestions, Bravo

* refine top_k

* more refine

* compatible with new model parallel

* refine

* rename

* cpu only in top_k

* Dev model boxing (#1639)

* mem_shared_hint_id

* sharable memory block

* rm useless code

* remove useless code

* bugfix: no redundant edges

* rename: MemBlockGroup => MemBlock

* put constrcutor of SharableMemBlockNode into header file

* bugfix

* rename field: MemBlock.block_id => MemBlock.mem_block_id

* replace piece_size with logical_blob_dim0

* BlobParallelConf

* BlobParallelDesc

* infer out blob model_split_axis

* int64_t => int32_t

* InferOutBlobParallelDesc

* gather out blob model split (#1624)

* InferBlobParallelDesc

* let variable op support kModelParallel

* rename lbi2blob_desc_ => lbi2no_parallel_blob_desc_

* Global<OpGraph>

* SplitLogicalInputBlobDesc

* ConcatOutputBlobDescs

* rename: BlobDataParallel => DataBlobParallel; BlobModelParallel => ModelBlobParallel; BlobGridParallel => GridBlobParallel

* OpGraph::CheckBlobDescs(...)

* exact division is unnecessary

* fix bugs

* rename InferOutBlob* => InferOutputBlob

* exact division in variable_op is unnecessary

* bug fix

* fix bugs

* fix bugs

* IsInputBlobAllowedModelSplit

* use Global<OpGraph> to InferModelSize

* add OpGraph::GetDataBalancedSplitter and OpGraph::GetModelBalancedSplitter

* fix IdentityOp::IsInputBlobAllowedModelSplit

* no implementation for pure virtual function Operator::IsInputBlobAllowedModelSplit

* refine BlobParallelDesc: replace CopyParallelConf with operator=

* refine ParallelDesc: remove unused functions

* more checks on ParallelDesc

* remove unused function Operator::MaxModelSplitNum

* BlobParallelDesc::EquivalentTo

* LogicalNOde::main_model_parallel_ is out of date

* refine Operator: replace IsElemWiseOp with IsSoleInputBlobAllowedModelSplit

* refine transpose conf

* fix a bug in Operator::FixParallelDesc

* InferInputBlobModelSplitAxis

* BlobParallelType

* more default behaviors for Operator::InferInputOutputBlobParallelType

* op_parallel_signature

* rename: BlobParallelType => LogicalBlobParallelDesc

* OpGraph::InferLogicalBlobParallelDesc

* refactor SplitLogicalInputBlobDesc by LogicalBlobParallelDesc

* refine OpNode::ConcatBlobDesc By LogicalBlobParallelDesc

* OpNode::lbi2model_split_axis_

* OpGraph::GetBalancedSplitter

* replace OpGraph::GetBlobParallelDesc4Lbi with OpGraph::GetLbpd4Lbi

* rm BlobParallelDesc in OpGraph

* VariableOp::InitOpParallelSignatures

* rm BlobParallelDesc

* rename Make*ParalelSignature functions

* MakeOpParallelSignature_DS_MC_2_DS

* MakeOpParallelSignature_DC_MS_2_MS

* BiasAddOp::InitOpParallelSignatures

* refine MakeOpParallelSignature_DC_MS_2_MS

* MatmulOp::InitOpParallelSignatures

* GatherOp::InitOpParallelSignatures

* bugfix: model_split_axis cannot equals -1 when parallel_policy is kModelParallel

* refactor: bn2parallel_id2blob_desc => lbi2parallel_id2blob_desc

* refine OpNode

* LogicalBlobParallelConf

* LogicalBlobParallelDesc::DualLbpd

* 1) merge dev_bert;
2) placement.proto not used in logical_blob_parallel_conf.proto

* bugfix: 1) remove CHECK(has_model) in Operator::NaiveInitOpParallelSignatures; 2) lbpd->set_parallel_num(val)

* fix bugs in GatherOp::InitOpParallelSignatures and BroadcastBinaryOp::InitOpParallelSignatures

* refactor: InitOpParallelSignatures => GetOpParallelSignatures

* refactor: const OpParallelSignature => std::unique_ptr<const OpParallelSignature>

* rm LogicalBlobParallelConf

* refactor: ModelSplitAxis4BnInOp => LbpdHint4BnInOp

* fix bugs about LbpdHint

* simplify the interface of InferInputOutputBlobLogicalBlobParallelDescIf

* rename Class CloneParallel => BroadcastParallel

* rename field: clone_parallel => broadcast_parallel

* refactor LbpdHint by SbpParallel

* InferIsModelBlob4OutputBlobsIf

* remove field LogicalBlobParallelDesc::parallel_num

* rename: LogicalBlobParallelDesc => SbpParallel

* rename: LbpdHint =>SbpInferHint

* simplify interface Operator::InferOutputBlobSbpInferHint

* rename api: Operator::InferBlobSbpInferHintIf => Operator::InferOuputBlobsSbpInferHintIf

* OpGraph::InferIsModelBlob

* rename file: logical_blob_parallel_desc.* => sbp_parallel.*

* rename filename: lbpd_hint* => sbp_infer_hint*

* rename field: SbpInferHint::has_data_split => SbpInferHint::is_data_split

* rename fields: SbpInferHint::is_data_split, is_model_split, is_data_partial_sum, is_model_broadcast

* refactor SbpInferHint::split_axis

* LambdaOpParallelSignature

* replace function MakeVariableOpDataSplitOpParallelSignature with class VariableOpDataSplitOpParallelSignature

* replace function MakeVariableOpModelSplitOpParallelSignature with class VariableOpModelSplitOpParallelSignature

* BroadcastBinaryOpParallelSignature

* Matmul_DMS_MS_2_P_OpParallelSignature

* Gather_DC_MS_2_P_OpParallelSignature

* class DataSplitOpParallelSignature

* class ModelBroadcastOpParallelSignature

* class DS_MC_2_DS_OpParallelSignature

* add field OpParallelSignature::op_

* refactor: ModelSplitAxis => OutputBlobModelSplitAxis

* remove Operator::InferOuputBlobsSbpInferHintIf

* implement MatmulOp::OutputBlobModelSplitAxis

* implement GatherOp::OutputBlobModelSplitAxis

* implement TransposeOp::OutputBlobModelSplitAxis and BiasAddOp::OutputBlobModelSplitAxis

* add method OpGraph::IsDataBlob

* refactor OpGraph::InferSbpParallel

* refactor class SbpInferHint

* rename local variable: SbpInferHint4BnInOp => SbpInferHint4Ibn

* refactor MakeModelSplitOpParallelSignature

* refactor Make_DC_MS_2_MS_OpParallelSignature

* remove unused class LambdaOpParallelSignature; refactor class name '*Clone*' => '*Broadcast*'

* bugfix: Operator::OutputBlobModelSplitAxis for sole-ibn op

* fix bugs in SbpInferHint::has_split_axis(), SbpInferHint::split_axis and OpNode::IsModelBlob4Lbi

* refactor class SbpInferHint: replace split_axis_ with sbp_parallel_

* refactor by SbpInferHint::sbp_parallel

* 1) rename OpNode data member; 2) rm unused proto

* fix clone (#1641)

* OpGraph::GetBlobDataType (#1643)

* OpGraph::GetBlobDataType

* refine OpGraph::GetBlobDataType

* IdentityOp => TupleIdentityOp (#1644)

* Dev sbp parallel cast (#1646)

* add SbpParallelCastOp

* only SplitParallel and BroadcastParallel can be user customized

* rename: SbpParallelCastOp => ParallelCastOp

* build boxing_conf by sbp_parallel

* fix a bug in BroadcastBinaryOpParallelSignature

* support broadcast_parallel for sole-ibn op

* 1) build boxing_op_conf by sbp_parallel for tuple_identity_op;
2) no op parallel desc fix for kModelParallel;
3) fix a in TaskGraph::EnableMemSharingInVariableOp
4) add TupleIdentityOpParallelSignature

* fix bug in IsModelParallel121 (#1648)

* merge develop

* merge develop (#1649)
parent 8009a404
No related branches found
No related tags found
No related merge requests found
Showing
with 116 additions and 56 deletions
......@@ -40,6 +40,11 @@ if (WIN32)
set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} /D_ITERATOR_DEBUG_LEVEL=0")
else()
list(APPEND CUDA_NVCC_FLAGS -std=c++11 -w -Wno-deprecated-gpu-targets)
list(APPEND CUDA_NVCC_FLAGS -gencode arch=compute_30,code=\"sm_30,compute_30\")
list(APPEND CUDA_NVCC_FLAGS -gencode arch=compute_52,code=\"sm_52,compute_52\")
list(APPEND CUDA_NVCC_FLAGS -gencode arch=compute_60,code=\"sm_60,compute_60\")
list(APPEND CUDA_NVCC_FLAGS -gencode arch=compute_61,code=\"sm_61,compute_61\")
list(APPEND CUDA_NVCC_FLAGS -gencode arch=compute_70,code=\"sm_70,compute_70\")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -g -Wall -Wno-sign-compare -Wno-unused-function")
if (RELEASE_VERSION)
list(APPEND CUDA_NVCC_FLAGS -O3)
......
# main cpp
list(APPEND of_main_cc ${PROJECT_SOURCE_DIR}/oneflow/core/job/oneflow.cpp)
list(APPEND of_main_cc ${PROJECT_SOURCE_DIR}/oneflow/core/ndarray/ndarray_reduce_test.cpp)
list(APPEND of_main_cc ${PROJECT_SOURCE_DIR}/tools/gen_resnet.cpp)
list(APPEND of_main_cc ${PROJECT_SOURCE_DIR}/tools/gen_alexnet.cpp)
list(APPEND of_main_cc ${PROJECT_SOURCE_DIR}/tools/gen_googlenet.cpp)
......
......@@ -12,6 +12,7 @@ include(libjpeg-turbo)
include(opencv)
include(eigen)
include(cocoapi)
include(half)
if (BUILD_CUDA)
set(CUDA_SEPARABLE_COMPILATION ON)
......@@ -90,6 +91,7 @@ set(oneflow_third_party_dependencies
eigen
cocoapi_copy_headers_to_destination
cocoapi_copy_libs_to_destination
half_copy_headers_to_destination
)
include_directories(
......@@ -104,6 +106,7 @@ include_directories(
${OPENCV_INCLUDE_DIR}
${EIGEN_INCLUDE_DIR}
${COCOAPI_INCLUDE_DIR}
${HALF_INCLUDE_DIR}
)
if (BUILD_CUDA)
......@@ -124,3 +127,5 @@ if (BUILD_CUDA)
${NCCL_INCLUDE_DIR}
)
endif()
add_definitions(-DHALF_ENABLE_CPP11_USER_LITERALS=0)
include (ExternalProject)
set(HALF_INCLUDE_DIR ${THIRD_PARTY_DIR}/half/include)
set(HALF_URL https://cfhcable.dl.sourceforge.net/project/half/half/1.12.0/half-1.12.0.zip)
set(HALF_BASE_DIR ${CMAKE_CURRENT_BINARY_DIR}/half/src/half)
set(HALF_HEADERS
"${HALF_BASE_DIR}/include/half.hpp"
)
if(BUILD_THIRD_PARTY)
ExternalProject_Add(half
PREFIX half
URL ${HALF_URL}
UPDATE_COMMAND ""
CONFIGURE_COMMAND ""
BUILD_COMMAND ""
BUILD_IN_SOURCE 1
INSTALL_COMMAND ""
)
add_custom_target(half_create_header_dir
COMMAND ${CMAKE_COMMAND} -E make_directory ${HALF_INCLUDE_DIR}
DEPENDS half)
add_custom_target(half_copy_headers_to_destination
DEPENDS half_create_header_dir)
foreach(header_file ${HALF_HEADERS})
add_custom_command(TARGET half_copy_headers_to_destination PRE_BUILD
COMMAND ${CMAKE_COMMAND} -E copy_if_different ${header_file} ${HALF_INCLUDE_DIR})
endforeach()
endif(BUILD_THIRD_PARTY)
......@@ -193,8 +193,11 @@ int Actor::HandlerNormal(const ActorMsg& msg) {
Regst* regst = msg.regst();
if (naive_consumed_rs_.HasRegstDescId(regst->regst_desc_id())) {
CHECK_EQ(0, naive_consumed_rs_.TryPushBackRegst(regst));
NormalProcessNaiveReadableRegstMsg(
naive_consumed_rs_.RegstDeq4RegstDescId(regst->regst_desc_id()));
const auto& rdeq = naive_consumed_rs_.RegstDeq4RegstDescId(regst->regst_desc_id());
CHECK(rdeq.empty() == false);
if (rdeq.front()->regst_desc()->regst_desc_type().has_data_regst_desc()) {
NormalProcessNaiveReadableDataRegstMsg(rdeq);
}
} else if (TryUpdtStateAsProducedRegst(regst) == 0) {
// do nothing
} else {
......@@ -325,8 +328,9 @@ void Actor::AsyncSendConsumedCtrlRegstMsgToProducer() {
CHECK_GE(reg_deq.size(), returned_regst_num);
for (size_t i = 0; i < returned_regst_num; ++i) {
Regst* regst = reg_deq.at(i);
AsyncSendMsg(ActorMsg::BuildRegstMsgToProducer(actor_id_, regst->producer_actor_id(), regst));
// must access regst before sending it to producer
regst_desc_ids.push_back(regst->regst_desc_id());
AsyncSendMsg(ActorMsg::BuildRegstMsgToProducer(actor_id_, regst->producer_actor_id(), regst));
}
});
naive_consumed_rs_.PopFrontRegsts(regst_desc_ids);
......@@ -452,8 +456,10 @@ void Actor::AsyncSendRegstMsgToProducer(Regst* regst) {
}
void Actor::AsyncSendRegstMsgToProducer(Regst* regst, int64_t producer) {
// must access regst before sending it to producer
int64_t regst_desc_id = regst->regst_desc_id();
AsyncSendMsg(ActorMsg::BuildRegstMsgToProducer(actor_id_, producer, regst));
naive_consumed_rs_.TryPopFrontRegst(regst->regst_desc_id());
naive_consumed_rs_.TryPopFrontRegst(regst_desc_id);
}
Regst* Actor::GetSoleProducedRegst4RegstDescId(int64_t regst_desc_id) {
......
......@@ -128,7 +128,7 @@ class Actor {
// Process Msg
virtual void NormalProcessCustomizedEordMsg(const ActorMsg&) {}
virtual void NormalProcessNaiveReadableRegstMsg(const std::deque<Regst*>&) {}
virtual void NormalProcessNaiveReadableDataRegstMsg(const std::deque<Regst*>&) {}
virtual void NormalProcessCustomizedReadableRegstMsg(const ActorMsg&) { UNIMPLEMENTED(); }
virtual bool NormalTryProcessReadableMsgFromOtherMachine(const ActorMsg&) { return false; }
int TryUpdtStateAsProducedRegst(Regst* regst);
......
......@@ -9,7 +9,7 @@ void BoxingActor::VirtualActorInit(const TaskProto& task_proto) {
OF_SET_MSG_HANDLER(&BoxingActor::HandlerNormal);
}
void BoxingActor::NormalProcessNaiveReadableRegstMsg(const std::deque<Regst*>& rq) {
void BoxingActor::NormalProcessNaiveReadableDataRegstMsg(const std::deque<Regst*>& rq) {
if (rq.back()->packed_blob()->max_col_num() > 1 && col_id_order_ == ColIdOrder::kUnCertain) {
TrySetColIdOrder(rq.back());
}
......
......@@ -14,7 +14,7 @@ class BoxingActor final : public Actor {
void VirtualActorInit(const TaskProto&) override;
private:
void NormalProcessNaiveReadableRegstMsg(const std::deque<Regst*>&) override;
void NormalProcessNaiveReadableDataRegstMsg(const std::deque<Regst*>&) override;
void Act() override;
void VirtualAsyncSendNaiveProducedRegstMsgToConsumer();
void VirtualAsyncSendNaiveConsumedRegstMsgToProducer();
......
......@@ -9,9 +9,13 @@ class LossPrintCompActor final : public SinkCompActor {
public:
OF_DISALLOW_COPY_AND_MOVE(LossPrintCompActor);
LossPrintCompActor() = default;
~LossPrintCompActor() = default;
~LossPrintCompActor() override = default;
private:
void VirtualSinkCompActorInit(const TaskProto&) override { timestamp_ = 0; }
void* NewOther() override { return &timestamp_; }
double timestamp_ = 0;
};
} // namespace oneflow
......
......@@ -12,4 +12,6 @@ void NaiveActor::VirtualAsyncSendNaiveProducedRegstMsgToConsumer() {
});
}
REGISTER_ACTOR(TaskType::kReduceIdentity, NaiveActor);
} // namespace oneflow
......@@ -33,7 +33,7 @@ void NormalBackwardCompActor::ForEachCurCustomizedReadableRegst(
}
}
void NormalBackwardCompActor::NormalProcessNaiveReadableRegstMsg(const std::deque<Regst*>& rq) {
void NormalBackwardCompActor::NormalProcessNaiveReadableDataRegstMsg(const std::deque<Regst*>& rq) {
if (rq.size() == 1 && rq.front()->regst_desc_id() == any_out_diff_regst_desc_id_) {
AsyncReturnModelRegstUntilModelVersionIdEqual(
GetModelVersionIdFromPieceId(rq.front()->piece_id(), actual_num_of_piece_in_batch_));
......
......@@ -15,7 +15,7 @@ class NormalBackwardCompActor final : public CompActor {
private:
void ForEachCurCustomizedReadableRegst(std::function<void(const Regst*)>) const override;
void NormalProcessNaiveReadableRegstMsg(const std::deque<Regst*>&) override;
void NormalProcessNaiveReadableDataRegstMsg(const std::deque<Regst*>&) override;
void NormalProcessCustomizedReadableRegstMsg(const ActorMsg&) override;
void Act() override;
bool IsCustomizedReadReady() override;
......
......@@ -2,16 +2,6 @@
namespace oneflow {
void ReduceConcatCompActor::VirtualCompActorInit(const TaskProto& proto) {
InputWiseCompActor::Init(proto);
}
void ReduceConcatCompActor::SetKernelCtxOther(void** other) {
int64_t in_bn_id = InBnId4RegstDescId(cur_processed_regst_desc_id());
other_val_ = std::make_pair(in_bn_id, EnableInplace());
*other = static_cast<void*>(&other_val_);
}
REGISTER_ACTOR(TaskType::kReduceConcat, ReduceConcatCompActor);
} // namespace oneflow
#ifndef ONEFLOW_CORE_ACTOR_REDUCE_CONCAT_COMPUTE_ACTOR_H_
#define ONEFLOW_CORE_ACTOR_REDUCE_CONCAT_COMPUTE_ACTOR_H_
#include "oneflow/core/actor/input_wise_compute_actor.h"
#include "oneflow/core/actor/naive_actor.h"
namespace oneflow {
class ReduceConcatCompActor final : public InputWiseCompActor {
class ReduceConcatCompActor final : public NaiveActor {
public:
OF_DISALLOW_COPY_AND_MOVE(ReduceConcatCompActor);
ReduceConcatCompActor() = default;
~ReduceConcatCompActor() = default;
private:
void VirtualCompActorInit(const TaskProto& proto) override;
void SetKernelCtxOther(void** other) override;
std::pair<int64_t, bool> other_val_;
};
} // namespace oneflow
......
......@@ -2,15 +2,6 @@
namespace oneflow {
void ReduceSplitCompActor::VirtualCompActorInit(const TaskProto& proto) {
InputWiseCompActor::Init(proto);
}
void ReduceSplitCompActor::SetKernelCtxOther(void** other) {
other_val_ = EnableInplace();
*other = static_cast<void*>(&other_val_);
}
REGISTER_ACTOR(TaskType::kReduceSplit, ReduceSplitCompActor);
} // namespace oneflow
#ifndef ONEFLOW_CORE_ACTOR_REDUCE_SPLIT_COMPUTE_ACTOR_H_
#define ONEFLOW_CORE_ACTOR_REDUCE_SPLIT_COMPUTE_ACTOR_H_
#include "oneflow/core/actor/input_wise_compute_actor.h"
#include "oneflow/core/actor/naive_actor.h"
namespace oneflow {
class ReduceSplitCompActor final : public InputWiseCompActor {
class ReduceSplitCompActor final : public NaiveActor {
public:
OF_DISALLOW_COPY_AND_MOVE(ReduceSplitCompActor);
ReduceSplitCompActor() = default;
~ReduceSplitCompActor() = default;
private:
void VirtualCompActorInit(const TaskProto& proto) override;
void SetKernelCtxOther(void** other) override;
bool other_val_;
};
} // namespace oneflow
......
......@@ -114,13 +114,13 @@ void EpollCommNet::InitSockets() {
((this_machine.data_port_agent() != -1) ? (this_machine.data_port_agent())
: (this_listen_port)));
} else {
for (this_listen_port = 1024; this_listen_port < MaxVal<uint16_t>(); ++this_listen_port) {
for (this_listen_port = 1024; this_listen_port < GetMaxVal<uint16_t>(); ++this_listen_port) {
if (SockListen(listen_sockfd, this_listen_port, total_machine_num) == 0) {
PushPort(this_machine_id, this_listen_port);
break;
}
}
CHECK_LT(this_listen_port, MaxVal<uint16_t>());
CHECK_LT(this_listen_port, GetMaxVal<uint16_t>());
}
int32_t src_machine_count = 0;
......
......@@ -65,7 +65,7 @@ void IBVerbsQP::Connect(const IBVerbsConnectionInfo& peer_info) {
qp_attr.ah_attr.grh.dgid.global.interface_id = peer_info.interface_id();
qp_attr.ah_attr.grh.flow_label = 0;
qp_attr.ah_attr.grh.sgid_index = 0;
qp_attr.ah_attr.grh.hop_limit = MaxVal<uint8_t>();
qp_attr.ah_attr.grh.hop_limit = GetMaxVal<uint8_t>();
qp_attr.ah_attr.dlid = peer_info.lid();
qp_attr.ah_attr.sl = 0;
qp_attr.ah_attr.src_path_bits = 0;
......
......@@ -8,14 +8,16 @@
namespace oneflow {
#define BLAS_NAME_SEQ \
OF_PP_MAKE_TUPLE_SEQ(dot) \
OF_PP_MAKE_TUPLE_SEQ(swap) \
OF_PP_MAKE_TUPLE_SEQ(copy) \
OF_PP_MAKE_TUPLE_SEQ(axpy) \
OF_PP_MAKE_TUPLE_SEQ(scal) \
OF_PP_MAKE_TUPLE_SEQ(gemv) \
OF_PP_MAKE_TUPLE_SEQ(gemm)
#define BLAS_NAME_SEQ \
OF_PP_MAKE_TUPLE_SEQ(dot) \
OF_PP_MAKE_TUPLE_SEQ(swap) \
OF_PP_MAKE_TUPLE_SEQ(copy) \
OF_PP_MAKE_TUPLE_SEQ(axpy) \
OF_PP_MAKE_TUPLE_SEQ(scal) \
OF_PP_MAKE_TUPLE_SEQ(gemv) \
OF_PP_MAKE_TUPLE_SEQ(gemm) \
OF_PP_MAKE_TUPLE_SEQ(gemmBatched) \
OF_PP_MAKE_TUPLE_SEQ(gemmStridedBatched)
#define CBLAS_TEMPLATE(name) \
template<typename T, typename... Args> \
......
......@@ -95,6 +95,37 @@ TRAIT_CONST_VAR(One, 1);
#undef TRAIT_CONST_VAR
template<typename T>
struct MaxVal;
template<typename T>
struct MinVal;
#define TRAIT_LIMIT_VAL(max_or_min, T, limit_value) \
template<> \
struct max_or_min##Val<T> final { \
static_assert(alignof(int) == alignof(int32_t), "int32_t should be exactly int"); \
static_assert(alignof(long long) == alignof(int64_t), "int64_t should be exactly long long"); \
constexpr static T value = limit_value; \
}
TRAIT_LIMIT_VAL(Max, int8_t, CHAR_MAX);
TRAIT_LIMIT_VAL(Max, int32_t, INT_MAX);
TRAIT_LIMIT_VAL(Max, uint32_t, UINT_MAX);
TRAIT_LIMIT_VAL(Max, int64_t, LLONG_MAX);
TRAIT_LIMIT_VAL(Max, uint64_t, ULLONG_MAX);
TRAIT_LIMIT_VAL(Max, float, FLT_MAX);
TRAIT_LIMIT_VAL(Max, double, DBL_MAX);
TRAIT_LIMIT_VAL(Min, int8_t, CHAR_MIN);
TRAIT_LIMIT_VAL(Min, int32_t, INT_MIN);
TRAIT_LIMIT_VAL(Min, uint32_t, 0);
TRAIT_LIMIT_VAL(Min, int64_t, LLONG_MIN);
TRAIT_LIMIT_VAL(Min, uint64_t, 0);
TRAIT_LIMIT_VAL(Min, float, -FLT_MAX);
TRAIT_LIMIT_VAL(Min, double, -DBL_MAX);
#undef TRAIT_LIMIT_VAL
// Func
bool IsIntegralDataType(DataType data_type);
......
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment