- Aug 06, 2021
-
-
Juncheng authored
-
- Aug 05, 2021
-
-
Juncheng authored
* comm_net_sequence_number * remove piece_id * Remove IsAllowedActor Co-authored-by:
oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
-
Juncheng authored
* Add env ONEFLOW_THREAD_LOCAL_MESSAGE_QUEUE_ENABLE * refine GetGlobalWorkStreamId * refine name * refine Co-authored-by:
oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
-
- Jul 30, 2021
-
-
cheng cheng authored
Co-authored-by:
oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
-
- Feb 20, 2021
-
-
leaves-zwx authored
* rm LocalWorkStreamId * rm AllocateLocalWorkStreamId in TaskNode * rm local work stream id in task node and commnet task node * rm local_work_stream_id param in NewTaskId * fix test Co-authored-by:
oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
-
- Dec 28, 2020
-
-
Juncheng authored
-
- Nov 03, 2020
-
-
Juncheng authored
Co-authored-by:
oneflow-bot <69100618+oneflow-bot@users.noreply.github.com>
-
- Oct 30, 2020
-
-
Li Xinqi authored
* Add Repeat/Acc user op * ssp_variable_proxy op/kernel/task_node * SspVariableProxyActor * bind inplace relation between var and ref * fix ssp_variable_proxy bugs * address comments * Update test_ssp_variable_proxy.py Co-authored-by:
liujuncheng <liujuncheng1022@gmail.com> Co-authored-by:
Shenghang Tsai <jackalcooper@gmail.com> Co-authored-by:
oneflow-bot <69100618+oneflow-bot@users.noreply.github.com>
-
- Oct 10, 2020
-
-
Juncheng authored
* AutoRegistrationFactory add key type * AutoRegistrationFactory add creators accessores (#3662) * AutoRegistrationFactory add creators accessores * refine * const Co-authored-by:
oneflow-bot <69100618+oneflow-bot@users.noreply.github.com>
-
- Jul 23, 2020
-
-
Shenghang Tsai authored
* add license at root dir * check in empty files * rm space * check in script * update script * fix bug * add print * fix * add exit * add to of_format * add CI task * fix license * Revert "fix license" This reverts commit 818b6d7691d3a8b4a25dd41a47ff2c5922b8ec57. * only add once * quick fix * fix script * dont fmt empty file * fix * quick fix * fix py * add license * fix exit * add license for hpp * add license * license new vm files Co-authored-by:
tsai <caishenghang@oneflow.org>
-
- Dec 09, 2019
-
-
Juncheng authored
-
- Dec 06, 2019
-
-
Juncheng authored
* rm model_version_id * rm model_version_id * remove model vid
-
- Nov 29, 2019
-
-
Niu Chong authored
-
- Sep 24, 2019
-
-
Niu Chong authored
* Dev actor msg queue (#2225) * async msg queue * EnqueueAsyncMsg * Merge wnd python (#2226) * not ready yet * segment fix * fix segment_sum bugs * 1st wide_n_deep push * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * run sucessfully on single GPU * fix 121 for tick (#2069) * delete unncessary multiply_grad class * speed up generate time for dot2svg (#2083) * Add axis conf to bias_add for any axis channel (#2087) * bias_add completion * follow comment * make conf axis required * Revert "Add axis conf to bias_add for any axis channel (#2087)" (#2091) This reverts commit 8679ce980ce8570bf927baeab8616ee7b93fac47. * updated * fix segment_sum_grad * fix sbp * fix segment_sum impl for data parallel * fix * remove useless code in segment_kernel_util.h * add python interface * fix sigmoid conf * fix naming error * fix typo * temp mod loss sbp * add LazyAdam * Merge branch 'dev_python' of https://github.com/Oneflow-Inc/oneflow into dev_python_widedeep * rm useless code * unsorted_segment_sum * refactor sigmoid_cross_entropy_loss_kernel to high performance * Improve sigmoid cross entropy loss grad (#2207) * remove for loop called cuda kernel * minor fix * ../oneflow/python/ops/data_ops.py (#2209) * fix lazy_adam * Merge wnd and python (#2214) * rm ActivationType from op/kernel (#2205) * refactor sigmoid_cross_entropy_loss * fix SigmoidGrad::InferBatchAxis * support part_name_prefix and part_name_suffix_length (#2208) * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus * oneflow.watch for debug * Dev decode batch size (#2206) * rm batch_size and piece_size * merge dev_python * Update reshape_like_op.cpp (#2213) * oneflow.parallel (#2211) * oneflow.parallel * refactor split_axis => parallel * rename parallel => distribute * fix typo: *Parallel => *Distribute * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute() * merge dev_python * fix boxing: P->S(0) * check in docker build scripts (#2216) * Dev python widedeep docker (#2218) * check in docker build scripts * check in .dockerignore * rm oneflow.segment_sum * remove segment_sum * rm unused file * rm debug code * rm debug code * rm double empty lines * remove useless comments * fix send msg (#2227) * fix reduction_coefficient (#2228) * refactor ndarray for eq/ne/... * Dev kernel launch synchronized (#2230) * IsKernelLaunchSynchronized * virtual * refine * refine * seperate LOGICAL_BINARY_FUNC from ARITHMETIC_BINARY_FUNC * more static_assert * remove unused task related dot function (#2236) * remove unused task related dot function * do not output dot rank info * Dev non distributed optimizer js (#2234) * op&kernel&actor * job * job_completer * graph * format * fix pd * fix * ignore DelPlacementByOpName * fix auto tick * JobBuilder * fix * config util * fix * fix opgrade * broadcast tick * fix allreduce * balance by model size * GetSoleOutBlobSize * async_actor_msg_deque * group * AddOrMutOpsOnlyOnce * fix NcclTupleBroadcastGrad * order * set nccl order hint * op_conf * grad hint * NcclTupleBroadcastReduceSequencePass * add missed mutops * order fix * try kMdUpdtArea * fix nccl_order_hint * fix * add ti * tuple_identity_op * remove useless * group * fix dead lock * force ctrl in * sc broadcast * sort obn * group nccl * config group_size_mbyte * non_distributed_optimizer_group_size_mbyte * format * stop check * rm message sending optimization * refine lazy adam (#2244) * refine lazy adam * update * memory version 2 step 1: replace original concept about mem sharing (#2242) * mem_shared_id -> mem_block_id; mem_shared_off_set -> mem_block_offset; enable_mem_sharing->enable_reuse_mem * memory version 2 step 1: replace original concept about mem sharing * record reader multi thread (#2246) * multi thread * ComputeThreadPoolSize * python api
-
- Sep 19, 2019
- Sep 04, 2019
-
-
lixinqi authored
-
Juncheng authored
* assign op AddGlobalStepOpConf fix ARITHMETIC_DATA_TYPE_SEQ identity_op_conf add ops GenNewSnapshotName SnapshotOp cleanup blob name LearningRateScheduleOp LearningRateScheduleKernel LearningRateScheduleKernel AddLearningRateScheduleOpConf learning rate cleanup fix fix * remove total_mbn_num * date time format * save * refine * refine * revert * refine snapshot * fix * refine * AutoGlobalStep * refine * GenLogicalBlobName * AutoLearningRate * remove JobDesc lr * fix snapshot path * Maybe<void> * learning_rate blob * remove next_model_vid fix fix fix learning_rate * train_conf * fix for global step on multi nodes * SnapshotReader snapshot writer model init op fix refine init InitializeFromSnapshotConf model io job ModelLoadOp ModelLoadKernel MakeModelLoadJob ModelSaveOp fix InterUserJobInfo _MakeModelLoadJobFunc MutModelLoadOpConTickInputHelper fix refine init/load/save set_default_variable * remove SnapshotMgr * snapshot.h * delete model_init_job.cpp foreign_input_op_conf fix snapshot path set path op_conf fix fix CopyFromNdarray to bytes c use uint8 char2uint8 * model init * model io * fix * ModelSaveKernel * mutable_batch_axis()->Clear() * InferBatchAxis * fix * refine * job set * MakeModelIoJobs * fix * jobs * fix * model io job * GenOutputOpConf * refine snapshot * refine * fix * refine CheckPoint * remove session * refine * refine * refine * remove keyword.h/cpp * refine * global_step=>train_step * GetSbpSignatures * ModelInitOp
-
- Aug 11, 2019
-
-
Li Xinqi authored
* unlock critical sections as more as possible * consumed and produced regst of actor 'case' are customized * refine code
-
- Jun 20, 2019
- Jun 19, 2019
-
-
Xinqi authored
-
- Jun 05, 2019
-
-
Juncheng authored
-
- May 31, 2019
-
-
Xinqi authored
-
- May 26, 2019
-
-
Li Xinqi authored
* OpGraph::MakePredicatorIsAllLbiConsumersReachableToOpName * refactor TaskGraph::EnableInplaceMemSharing * TaskGraph::GetSafeInplaceOpBlobArgList * InplaceLbiGraph::ForEachSafeInplaceEdgesInSourceOpSubTree * fix a typo * TaskGraph::SetTaskRegstInplaceInfo * InplaceRegstGraph * refine * refine IsLbiOnTaskEdge * fix a bug in TaskGraph::ForEachDeviceNodes * ForEachGpuDeviceNodes * remove wrong CHECK * fix wrong use of std::unordered_set::erase * fix a bug in TaskGraph::GetInplaceOpBlobArgList * fix inplace bugs * fix error CHECK between inplace in dptr and inplace out dptr
-
- May 16, 2019
-
-
Li Xinqi authored
* InplaceObnGraph * more checks in InplaceObnGraph::InitNodes * framework of InplaceObnGraph::ComputeSafeInplaceObns * refine InplaceObnGraph::ComputeSafeInplaceObns * replace InplaceObnGraph with InplaceLbiGraph * fix three types of mut_ref conflicts * InplaceLbiGraph::FindFirstConstRefConflictMutRefEdge * fix bugs in InplaceLbiGraph::ComputeSafeInplaceObns * InplaceLbiGraph::DisconnectUnReachabeAndDataMutableEdge * InplaceLbiGraph::FixMutRefConflictsFromSourceOpNode * InplaceLbiGraph::FixMutRefConflictsFromSourceOpNode * Graph::FindFirstBackEdgeDstNode * more CHECK_ISNULL * fix a bug in Graph::FindFirstBackEdgeDstNode() a * fix bugs in Graph<NodeType, EdgeType>::ForEachConnectedComponent * rename GetIsMutableIbnConsumer => FindSoleMutableIbnConsumer * refine InplaceLbiGraph::IsConstRefConflictMutRefNode * there could be no mut_ref node found in InplaceLbiGraph::FindFirstInterOpRefConflictMutRefEdge * refine InplaceLbiGraph::FindFirstInterOpRefConflictMutRefEdge * remove unnecessary CHECK in InplaceLbiGraph::GetSafeInplaceObnEdges * fix a line of comment in InplaceLbiGraph::GetSafeInplaceObnEdges * shouldn't delete the edge to updt_node * refine InplaceLbiGraph::FixMutRefConflictsFromSourceOpNode * refine FindFirstIntraOpRefConflictMutRefEdge * fix a bug in InplaceLbiGraph::FindFirstIntraOpRefConflictMutRefEdge * CheckSubGraph * change some lambdas to functions
-
- May 06, 2019
-
-
Niu Chong authored
* style(actor.h): move customized virtual function declaration together * feat: update Actor to support inplace regst * feat: check inplace in/out regst dptr equal
-
- Feb 21, 2019
-
-
Shiyuan Shang-Guan authored
* gpu (#1310) * Fix snapshot (#1320) * fix bug of snapshot * refine distribute.sh * use more accurate function calls * rename function * update for model parallel * refine code * feat: enhance cmake download & options (#1281) * feat: enhance cmake download & options * feat(tools/): add share libs build scripts * fix: add cmake options * feat: add 3rd party download * chore: updat README * fix: fix protobuf & cmake repo * fix: fix options name * chore: merge 3rd_party.cmake & third_party.cmake * chore: revert pre cmake URL fix * chore: update ExternalProject check * fix: fix typo & missing download * fix: fix download url * chore: update readme * chore: fix typo * fix: fix bugs * fix: fix bugs * fix: fix pre * print all third party libs * refine readme * DOWNLOAD_THIRD_PARTY -> PRECOMPILED_THIRD_PARTY * refine readme * minor typo fix * Fix bug in model parallel (#1345) * fix conv in model parallel * add TODO * Fix bug in gcc54 (#1352) * fix bug in gcc 5.4 * update * refine ibverbs lib (#1391) * refine link ibverbs lib * modify minor * fix a little bug in accuracy print (#1403) * batch num for prediction (#1405) * batch num for prediction * !Train() => Predict() * fix normlization epsilon check (#1433) * Fix normlization epsilon check (#1441) * fix normlization epsilon check * remove check, fix eplison value in op_conf * align with tensorflow (#1461) * Dev crop with random size (#1468) * random size crop proto * ImagePreprocessImpl::<kCropWithRandomSize> * clang format * MaxVal * Dev jinyi offline build (#1476) * chore: remove pre compiler funcs * chore: add submoudles * fix: fix project build URL from git_url -> submodule_dir_url * fix: fix submodule commit id * fix: fix .gitmodules * chore: mv third_party dir * chore: remove test-driver(glog#188) link in glog submodule * fix: update glog from: da816ea70645e463aa04f9564544939fa327d5a7 ==> to: 4f3e18bf26cdb794fc66cec348f57b5838a0c929 * chore: update README.md * Dev prelu (#1474) * GPU impl of prelu * better naming * address review * renaming and use ? : * address reviews on op * change op conf * rename weight * allow 2d+ * not elementwise op * naive impl * minor fix * rename * remove dupl * refacor to remove duplicate * remove dup * remove dup * reverse condition * move code * remove useless arg * refactoring * add empty line * fix jpeg encoder quality (#1450) * fix(actor.cpp): never access regst after sending it to producer (#1531) * fix(boxing_actor): not handle ctrl regst in NormalProcessNaiveReadableRegstMsg() (#1520) * Dev center crop (#1542) * center crop * update * add scalar_mul (#1553) * refactor(actor/*): update the {RegstNameType, {}} to std::make_pair (#1605) * fix(boxing_actor): not handle ctrl regst in NormalProcessNaiveReadableRegstMsg() * refactor(actor/*): update the {RegstNameType, {}} to std::make_pair * fix record_num in blob (#1619) * fix record_num in blob * add comment
-
- Feb 19, 2019
-
-
Li Xinqi authored
* Implement gelu op (#1478) * gelu op * call different funcs for float and double * Dev bert gather op (#1483) * embedding_dense_op * refine * gather op * revert * Fix gelu bug (#1484) * fix inherit bug * fix backward formula * fix bug * Dev variable op (#1485) * DefineTestBlobConf => DefineTestBlobOpConf (#1480) * variable op * Dev variable op disable memsharing (#1487) * disable mem sharing for VariableOp * variable disable tick diff * fix * refine * options transpose_a and transpose_b for Matmul * matmul operator conf * Dev bert const scalar op (#1488) * const scalar op * refine * fix * data parallel only * const range op (#1489) * square and sqrt * broadcast_binary_op * feat: add mean op (#1490) * feat: add mean op * feat: add mean_kernel * feat: add implementation * feat: fix mean kernel * Dev bert slice op (#1491) * add op_conf * add slice op impl * add space kernel impl * fix * same semantic as python * optional start and end * fix * add has_dim0_in_shape in reshape op (#1486) * refine CHECK in broadcast_binary_op * feat: add kernel implement for broadcast_mul/div * Impl square && sqrt (#1495) * impl square && sqrt * fix typo * Dev bert slice op (#1496) * add op_conf * add slice op impl * add space kernel impl * fix * same semantic as python * optional start and end * fix * slice kernel cpu impl * modify coding style * BiasAddOpConf * refactor(broadcast_div_kernel): update kernel util api * Dev bert cosnt range use device piece size (#1498) * use device_piece_size * cosnt size => size * fix * no check in BroadcastBinaryOp::InitFromProto * override GetCustomizedConfs for broadcast_binary_op * fix: fix bugs in broadcast_div/mul kernel (#1502) * fix: fix bugs in broadcast_div/mul kernel * fix * fix: fix the infer bw_buf blobdesc bug in broadcast_binary op * Bias Add Op && Kernel (#1503) * pass compile * fix typo * Matmul kernel implementation (#1494) * pass compile * add comment * fix bug * Dev bert const scalar kernel (#1492) * const scalar kernel * fix * fix * init * empty const range kernel * sketch of gather kernel * gather kernel * refine * refine * const range kernel * refine * backward * const range size * gather kernel * assert index * add truncated_normal initializer (#1499) * add truncated_normal initializer * rename RngTruncatedNormal * fix: add const override for InferBwBufBlobDescs in BroadcastBinaryOp * fix: udpate the supported data type from floating to arithmetic * enforce 2d on bias add * Dev bert slice op (#1500) * add op_conf * add slice op impl * add space kernel impl * fix * same semantic as python * optional start and end * fix * slice kernel cpu impl * modify coding style * slice gpu impl const buf infer * add slice gpu impl * simplify slice cpu impl * fix gpu impl bug * fix typo * add forward function from broadcast_add,broadcast_sub * feat: add gpu impl of cast kernel (#1504) * Dev nc cast (#1507) * feat: add gpu impl of cast kernel * register gpu cast op * Fix broadcast binary all dim size 1 (#1505) * remove check NumAxes * check scalar * IsScalarBlob * b_diff=>b (#1509) * feat: add LayerNormOp/Kernel without kernel implement (#1510) * fix: fix missing registing layer_normalization kernel * fix: fix missing registing layer_normalization op * fix: temply remove activation from layer_norm_kernel * ExecShapeUtil * broadcast_binary_xpu_util.h * add bw kernel of broadcast_add * Dev constant (#1513) * constant_op * init_op_conf * sequence=>range * Dev broadcast add (#1514) * ExecShapeUtil * broadcast_binary_xpu_util.h * add bw kernel of broadcast_add * WITH_CUDA_PARAM * left extended shape * xpu_ndarray_builder * add bw kernel of broadcast_sub * updt to 1d (#1512) * fix small in xpu_reduce_ndarray * fix(broadcast_binary_op): fix the wrong data_type of bw_buf regst (#1515) * feat(mean): update mean_op/kernel for calc only last dim of blob (#1516) * fix(mean_kernel): fix typo * ndarray reduce * new reduce * fix shape of tmp_storage * reduce * more check for NdArrayReduce * ImplaceApplyUnary<UnaryFuncMinus> * ndarray_apply_broadcast_binary * delte useless files * complete backward kernel of broadcast_mul * add backward kernel of broadcast_div * broadcast binary op check data type equal (#1508) * fix bug in broadcast_binary * debug op * EncodeBlob * const_out_blob_feature_load_file * DefineTestBlobOpConf.has_diff * indices has_diff = false (#1519) * adam model update (#1518) * adam model update * add comment * update * add correct_deviation flag * rename * remove GetCustomizedConf * fix bug in mean_op fw kernel * add sigmoid loss op * ndarray_apply_broadcast_unary * reomve multiplier of mean kernel * fix(boxing_actor): not handle ctrl regst in NormalProcessNaiveReadableRegstMsg() * fix raw (#1522) * rsqrt * XpuReducedNdarray supports expression template * faster_reduce * inlined cuda device function * profiling reduce_sum * refactor(kernel_util.cu): calc x_strides on cpu instead of on TransposeGpu() (#1525) * BroadcastBinaryOp * ExecShape => XpuShape * fix shape bug in mean bw kernel * refine XpuNdarrayAssign * use ndarray broadcast mul (#1529) * Dev softmax reduce ndarray (#1527) * softmax use ndarray reduce * fix shape * refine reduce * fix * remove xpu_ndarray_builder * fix(actor.cpp): never access regst after sending it to producer * ndarray_util.h => xpu_util.h * xpu_ndarray_util.h => ndarray_util.h * XpuNdArrayUtil => NdarrayUtil * SwitchReduce(SwitchCase(num_axes), ...) => Reduce(...) * refactor: rename NormalProcessNaiveReadableRegstMsg() to NormalProcessNaiveReadableDataRegstMsg() (#1532) * SwitchBroadcastApply(SwitchCase(num_axes), ...) => BroadcastApply(...) * softmax kernel use ndarray reduce (#1530) * softmax use ndarray reduce * fix shape * refine reduce * fix * RowMax=>NdarrayReduce * SwitchReduce=>Reduce * move template parameter NDIMS from class NdarrayReduce to methods of class NdarrayReduce * rename file: ndarray/xpu_ndarray_reduce_test.cpp -> ndarray/ndarray_reduce_test.cpp * move NdarrayUtil::SwitchReduce(...) to NdarrayReduce::SwitchReduce(...) * Dev one hot encoder (#1533) * one_hot op * ohe * one hot kernel * refine * refine * remove old * refine * refine * refine * format * save m and v in adam_model_update (#1534) * Dev profile reduce (#1535) * ndarray_reduce_impl * NdarrayMatrixRowReduce * 1) MatrixColReduce; 2) WITH_CUDA_PARAM => RUN_CUDA_KERNEL * NdarrayScalarReduce * NdarrayDefaultReduce * refactor NdarrayReduce<DeviceType device_type, typename T> to NdarrayReduce<DeviceType device_type, typename T, const T(*binary_func)(const T, const T)> * 1) MaxVal<T>() => GetMaxVal<T>(); MaxValue<T>::value => MaxVal<T>::value * replace KernelUtil::RowMax with NdarrayUtil::ReduceMax * NdarrayNoReduce * eliminate redundant code by macros * Fix matmul gpu bugs (#1528) * call different api for batchedgemm * updt api * use naive loop * save work * save work * updt impl * remove useless code * replace naive loop with cublasgemmbatched * feat: add ScalarAddOp and ScalarMulOp (#1541) * Dev nc scalar (#1543) * feat: add ScalarAddOp and ScalarMulOp * feat: add ScalarAddKernel and ScalarMulKernel * fix: ScalarAddOp/ScalarMulOp not inheri from CWiseOp * fix: fix code style * fix: fix typo of include file in scalar_add_op/scalar_mul_op * fix(scalar_mul_kernel): register ScalarMulKerenl * fix: add MulbyScalarPara(), replace cublas_scal by this on ScalarMulKernel * fix(scalar_mul_kernel): fix typo * Dev nc testtrans (#1540) * feat: update trans kernel * InitGlobalCudaDeviceProp * in_blob and out_blob is unnecesary for bw kernel of variable_op and constant_op * Transpose: the shape elem_cnt of x must not exceed 2^32 * remove LabelType (#1545) * rm ndarray_reduce_core.* * Dev identity loss (#1547) * identity_loss * loss op * CalcLossInstanceNum * mem shared for mdupdt first in regst and md diff add regst (#1546) * remove useless code (#1548) * Dev sparse cross entropy (#1550) * op for sparse cross _entropy * modify op_conf for sparse cross entropy * saprse cross entropy kernel * op * SparseCrossEntropyKernelUtil * refine * refine shape check (#1552) * refactoring reduce sum (#1554) * refactoring reduce sum * also use shape and dptr when bw * add resize when keepdims * address reviews * move functions to Anonymous namespace * address reviews * remove auto * replace find * rename keepdims * only enable nccl on gpu * fix diff add regst size in MdUpdt task node as same as in regst (#1556) * mem shared for mdupdt first in regst and md diff add regst * fix diff add regst size in MdUpdt task node as same as in regst * minor fix * special occasion when it is a loss op * Dev loss instance num (#1544) * loss instance number * set_has_loss_instance_num_field * loss * in_diff * LossOpFixInDiffHasLossInstanceNum * remove need_do_loss_instance_num * move to FixInDiffBlobDescs * remove * loss_instance_num use float * refine * Boxing ForwardLossInstance * fix * fix loss * fix * refine * fix * refine * refine * impl reduce mean * Dev all reduce ctrl edge (#1558) * mem shared for mdupdt first in regst and md diff add regst * feat: add ReduceInplaceIdentity LogicalNode/TaskNode/Op/Kernel * nccl reduce ctrl edge * MayConsumeModelDiff * fix diff add regst size in MdUpdt task node as same as in regst * eager_reduce_ratio * mem sharing for ReduceIdentity * ReduceInplaceIdentity => ReduceIdentity * reduce ctrl edge supports for arbitrary placement * refine ChainLogicalGraph::IsLogicalNodeMergeable * model name (#1561) * Dev gather refine (#1517) * gather op index support all int type and axis * out=in * reformat * negative axis * LookupKernel=>GatherKernel * reformat * refine * axis * refine & bugfix * remove ConstScalar and ConstRange (#1526) * Refine range initializer (#1523) * support axis * refine naming * fix before_dim_size * doc * refine * refine naming * refine naming * VariableLogicalNode * identity (#1563) * total_instance_num use naive mdupdt (#1564) * patch by hand from faster_rcnn * revert LogicalVariableOp * Dev clone boxing (#1566) * identity * reduce clone boxing * Dev clone boxing (#1568) * identity * reduce clone boxing * tuple identity * Dev tick (#1571) * feat: add Tick LogicalNode/TaskNode/Op/Kernel * feat: remove Tick LogicalNode/TaskNode * feat: add BldSubTskGphByTickToSource for TickOp * refine: refine due to comment * feat: add BldSubTskGphByRecordLoadToTick * pr tick op/kernel alone * feat: add TickOp and BldSubTskGphByTickToSource (#1565) * feat: add Tick LogicalNode/TaskNode/Op/Kernel * feat: remove Tick LogicalNode/TaskNode * feat: add BldSubTskGphByTickToSource for TickOp * refine: refine due to comment * feat: add BldSubTskGphByRecordLoadToTick * refine: refine due to comment * refine: due to comment * refine: remove BldSubTskGphByRecordLoadToTick * fix tick op in dlnet (#1572) * Dev clip by global norm (#1521) * clip_by_global_norm * update * refine model_update op * remove useless code * fix name * rename clip_norm * remove useless code * force init memory and add CHECK() * remove useless code and add comment * fixbug * refine code * Dev bert profile (#1573) * 1) refactor reduce_group; 2) add new stream kReduceCtrl * 1) allreduce and model_update overlapping; 2) allreduce and fw overlapping * add mdupdt ctrl edges within reduce group (#1575) * Dev group all reduce by model bytes (#1577) * group all reduce by model byte size * mv OpGraph into a seperate file op_graph.h * gelu (#1578) * Dev bert layer norm (#1574) * layer norm * layer_norm * fix trainable * fix * fix trainable * refine * Dev bert cuda event sync (#1581) * cudaSetDevice in actor poller threads * ReduceConcatCompActor ; NaiveActor * set dev id (#1583) * Dev bert profiling (#1586) * profiling * all_reduce_* option for performance optimization * fix a mem sharing bug (#1590) * Fix mem sharing bug (#1593) * fix a mem sharing bug * refine by review * remove previous if condition * refine * Dev profiling adam (#1592) * profiling * all_reduce_* option for performance optimization * faster adam kernel * Dev refine transpose (#1594) * profiling * all_reduce_* option for performance optimization * faster adam kernel * refine dropout and transpose * loss print duration (#1598) * pseudo chains of OpGraph * ConvertPseudoChainToChain * refine pseudo_chain * refine register coloring algorithm * rename op_graph log file name * remove unused code * Dev bigger chain (#1601) * pseudo chains of OpGraph * ConvertPseudoChainToChain * refine pseudo_chain * refine register coloring algorithm * rename op_graph log file name * remove unused code * chore: add -gencode in CMakeLists.txt (#1603) * EnableMemSharingInVariableOp * no mem_sharing for out_diff & model_diff in variable_op * Dev mem sharing for variable op (#1604) * pseudo chains of OpGraph * ConvertPseudoChainToChain * refine pseudo_chain * refine register coloring algorithm * rename op_graph log file name * remove unused code * EnableMemSharingInVariableOp * no mem_sharing for out_diff & model_diff in variable_op * refine code * Fix jxf reduce concat bug (#1606) * refine logic to infer reduce_concat_op's elem_cnt of out blob, still have bugs... * add RoundUp in reduce_concat * CHECK_LE -> CHECK_EQ * add CHECK * Dev random shuffle (#1607) * random shuffle * fix * refine * refine * single thread * refine * cmake add half (#1609) * Bugfix no tick diff (#1614) * group by has_diff * rm unnecessary identity * share model_diff and out_diff in variable op (#1616) * share model_diff and out_diff in variable op * bugfix: model_diff is a produced register * register_num of model_diff is 1 * add VariableKernelConf * no mutable * bugfix * bugfix: set ctrl_regst's return_regst_num (#1617) * 带策略的寄存器着色 (#1613) * mem_shared_hint_id * sharable memory block * rm useless code * remove useless code * bugfix: no redundant edges * rename: MemBlockGroup => MemBlock * put constrcutor of SharableMemBlockNode into header file * bugfix * rename field: MemBlock.block_id => MemBlock.mem_block_id * refine CHECK in AllReduce (#1618) * refine CHECK in AllReduce * move ReduceConcatOpCtx definition to .cpp file * fix fw_consumer nullptr (#1622) * faster improver (#1628) * multithreads register coloring (#1630) * multithreads register coloring * refine code * Dev bert accuracy with weight (#1632) * accuracy * accuracy_task_node add fw_buf * fw_buf=>data_tmp * Dev logical blob dim0 (#1625) * mem_shared_hint_id * sharable memory block * rm useless code * remove useless code * bugfix: no redundant edges * rename: MemBlockGroup => MemBlock * put constrcutor of SharableMemBlockNode into header file * bugfix * rename field: MemBlock.block_id => MemBlock.mem_block_id * replace piece_size with logical_blob_dim0 * BlobParallelConf * BlobParallelDesc * infer out blob model_split_axis * int64_t => int32_t * InferOutBlobParallelDesc * gather out blob model split (#1624) * InferBlobParallelDesc * let variable op support kModelParallel * rename lbi2blob_desc_ => lbi2no_parallel_blob_desc_ * Global<OpGraph> * SplitLogicalInputBlobDesc * ConcatOutputBlobDescs * rename: BlobDataParallel => DataBlobParallel; BlobModelParallel => ModelBlobParallel; BlobGridParallel => GridBlobParallel * OpGraph::CheckBlobDescs(...) * exact division is unnecessary * fix bugs * rename InferOutBlob* => InferOutputBlob * exact division in variable_op is unnecessary * bug fix * fix bugs * fix bugs * IsInputBlobAllowedModelSplit * use Global<OpGraph> to InferModelSize * add OpGraph::GetDataBalancedSplitter and OpGraph::GetModelBalancedSplitter * fix IdentityOp::IsInputBlobAllowedModelSplit * no implementation for pure virtual function Operator::IsInputBlobAllowedModelSplit * refine BlobParallelDesc: replace CopyParallelConf with operator= * refine ParallelDesc: remove unused functions * more checks on ParallelDesc * Dev logical blob dim0 (#1635) * mem_shared_hint_id * sharable memory block * rm useless code * remove useless code * bugfix: no redundant edges * rename: MemBlockGroup => MemBlock * put constrcutor of SharableMemBlockNode into header file * bugfix * rename field: MemBlock.block_id => MemBlock.mem_block_id * replace piece_size with logical_blob_dim0 * BlobParallelConf * BlobParallelDesc * infer out blob model_split_axis * int64_t => int32_t * InferOutBlobParallelDesc * gather out blob model split (#1624) * InferBlobParallelDesc * let variable op support kModelParallel * rename lbi2blob_desc_ => lbi2no_parallel_blob_desc_ * Global<OpGraph> * SplitLogicalInputBlobDesc * ConcatOutputBlobDescs * rename: BlobDataParallel => DataBlobParallel; BlobModelParallel => ModelBlobParallel; BlobGridParallel => GridBlobParallel * OpGraph::CheckBlobDescs(...) * exact division is unnecessary * fix bugs * rename InferOutBlob* => InferOutputBlob * exact division in variable_op is unnecessary * bug fix * fix bugs * fix bugs * IsInputBlobAllowedModelSplit * use Global<OpGraph> to InferModelSize * add OpGraph::GetDataBalancedSplitter and OpGraph::GetModelBalancedSplitter * fix IdentityOp::IsInputBlobAllowedModelSplit * no implementation for pure virtual function Operator::IsInputBlobAllowedModelSplit * refine BlobParallelDesc: replace CopyParallelConf with operator= * refine ParallelDesc: remove unused functions * more checks on ParallelDesc * remove unused function Operator::MaxModelSplitNum * bugfix: SoleOp() => op_vec().at(0) * Dev global op graph (#1636) * Global<OpGraph> is only available duraing compilation * small record_piece_size for InferNoParallelBlobDesc * Dev op graph piece size (#1637) * fix a bug in OpGraph::InferNoParallelBlobDesc * fix a bug in OpGraph::InferNoParallelBlobDesc * DfsTopoForEachNodeSortByDistanceToSink (#1638) * Dev jxf bert top k (#1633) * top_k * dev top_k op * refine * fix bug * refactor top_k op, cooperate with gather op to get values now * customized TOPK_KERNEL_ENTRY in auto factory * batch gather op * refine * Backup: batch_gather op, pass compile * fix bugs, pass the test * fix no new line at the end of file * const * refine by review * fix bugs * rename: instance_dim -> instance_size * remove a blank line * refine coding style by Juncheng's suggestions, Bravo * refine top_k * more refine * compatible with new model parallel * refine * rename * cpu only in top_k * Dev model boxing (#1639) * mem_shared_hint_id * sharable memory block * rm useless code * remove useless code * bugfix: no redundant edges * rename: MemBlockGroup => MemBlock * put constrcutor of SharableMemBlockNode into header file * bugfix * rename field: MemBlock.block_id => MemBlock.mem_block_id * replace piece_size with logical_blob_dim0 * BlobParallelConf * BlobParallelDesc * infer out blob model_split_axis * int64_t => int32_t * InferOutBlobParallelDesc * gather out blob model split (#1624) * InferBlobParallelDesc * let variable op support kModelParallel * rename lbi2blob_desc_ => lbi2no_parallel_blob_desc_ * Global<OpGraph> * SplitLogicalInputBlobDesc * ConcatOutputBlobDescs * rename: BlobDataParallel => DataBlobParallel; BlobModelParallel => ModelBlobParallel; BlobGridParallel => GridBlobParallel * OpGraph::CheckBlobDescs(...) * exact division is unnecessary * fix bugs * rename InferOutBlob* => InferOutputBlob * exact division in variable_op is unnecessary * bug fix * fix bugs * fix bugs * IsInputBlobAllowedModelSplit * use Global<OpGraph> to InferModelSize * add OpGraph::GetDataBalancedSplitter and OpGraph::GetModelBalancedSplitter * fix IdentityOp::IsInputBlobAllowedModelSplit * no implementation for pure virtual function Operator::IsInputBlobAllowedModelSplit * refine BlobParallelDesc: replace CopyParallelConf with operator= * refine ParallelDesc: remove unused functions * more checks on ParallelDesc * remove unused function Operator::MaxModelSplitNum * BlobParallelDesc::EquivalentTo * LogicalNOde::main_model_parallel_ is out of date * refine Operator: replace IsElemWiseOp with IsSoleInputBlobAllowedModelSplit * refine transpose conf * fix a bug in Operator::FixParallelDesc * InferInputBlobModelSplitAxis * BlobParallelType * more default behaviors for Operator::InferInputOutputBlobParallelType * op_parallel_signature * rename: BlobParallelType => LogicalBlobParallelDesc * OpGraph::InferLogicalBlobParallelDesc * refactor SplitLogicalInputBlobDesc by LogicalBlobParallelDesc * refine OpNode::ConcatBlobDesc By LogicalBlobParallelDesc * OpNode::lbi2model_split_axis_ * OpGraph::GetBalancedSplitter * replace OpGraph::GetBlobParallelDesc4Lbi with OpGraph::GetLbpd4Lbi * rm BlobParallelDesc in OpGraph * VariableOp::InitOpParallelSignatures * rm BlobParallelDesc * rename Make*ParalelSignature functions * MakeOpParallelSignature_DS_MC_2_DS * MakeOpParallelSignature_DC_MS_2_MS * BiasAddOp::InitOpParallelSignatures * refine MakeOpParallelSignature_DC_MS_2_MS * MatmulOp::InitOpParallelSignatures * GatherOp::InitOpParallelSignatures * bugfix: model_split_axis cannot equals -1 when parallel_policy is kModelParallel * refactor: bn2parallel_id2blob_desc => lbi2parallel_id2blob_desc * refine OpNode * LogicalBlobParallelConf * LogicalBlobParallelDesc::DualLbpd * 1) merge dev_bert; 2) placement.proto not used in logical_blob_parallel_conf.proto * bugfix: 1) remove CHECK(has_model) in Operator::NaiveInitOpParallelSignatures; 2) lbpd->set_parallel_num(val) * fix bugs in GatherOp::InitOpParallelSignatures and BroadcastBinaryOp::InitOpParallelSignatures * refactor: InitOpParallelSignatures => GetOpParallelSignatures * refactor: const OpParallelSignature => std::unique_ptr<const OpParallelSignature> * rm LogicalBlobParallelConf * refactor: ModelSplitAxis4BnInOp => LbpdHint4BnInOp * fix bugs about LbpdHint * simplify the interface of InferInputOutputBlobLogicalBlobParallelDescIf * rename Class CloneParallel => BroadcastParallel * rename field: clone_parallel => broadcast_parallel * refactor LbpdHint by SbpParallel * InferIsModelBlob4OutputBlobsIf * remove field LogicalBlobParallelDesc::parallel_num * rename: LogicalBlobParallelDesc => SbpParallel * rename: LbpdHint =>SbpInferHint * simplify interface Operator::InferOutputBlobSbpInferHint * rename api: Operator::InferBlobSbpInferHintIf => Operator::InferOuputBlobsSbpInferHintIf * OpGraph::InferIsModelBlob * rename file: logical_blob_parallel_desc.* => sbp_parallel.* * rename filename: lbpd_hint* => sbp_infer_hint* * rename field: SbpInferHint::has_data_split => SbpInferHint::is_data_split * rename fields: SbpInferHint::is_data_split, is_model_split, is_data_partial_sum, is_model_broadcast * refactor SbpInferHint::split_axis * LambdaOpParallelSignature * replace function MakeVariableOpDataSplitOpParallelSignature with class VariableOpDataSplitOpParallelSignature * replace function MakeVariableOpModelSplitOpParallelSignature with class VariableOpModelSplitOpParallelSignature * BroadcastBinaryOpParallelSignature * Matmul_DMS_MS_2_P_OpParallelSignature * Gather_DC_MS_2_P_OpParallelSignature * class DataSplitOpParallelSignature * class ModelBroadcastOpParallelSignature * class DS_MC_2_DS_OpParallelSignature * add field OpParallelSignature::op_ * refactor: ModelSplitAxis => OutputBlobModelSplitAxis * remove Operator::InferOuputBlobsSbpInferHintIf * implement MatmulOp::OutputBlobModelSplitAxis * implement GatherOp::OutputBlobModelSplitAxis * implement TransposeOp::OutputBlobModelSplitAxis and BiasAddOp::OutputBlobModelSplitAxis * add method OpGraph::IsDataBlob * refactor OpGraph::InferSbpParallel * refactor class SbpInferHint * rename local variable: SbpInferHint4BnInOp => SbpInferHint4Ibn * refactor MakeModelSplitOpParallelSignature * refactor Make_DC_MS_2_MS_OpParallelSignature * remove unused class LambdaOpParallelSignature; refactor class name '*Clone*' => '*Broadcast*' * bugfix: Operator::OutputBlobModelSplitAxis for sole-ibn op * fix bugs in SbpInferHint::has_split_axis(), SbpInferHint::split_axis and OpNode::IsModelBlob4Lbi * refactor class SbpInferHint: replace split_axis_ with sbp_parallel_ * refactor by SbpInferHint::sbp_parallel * 1) rename OpNode data member; 2) rm unused proto * fix clone (#1641) * OpGraph::GetBlobDataType (#1643) * OpGraph::GetBlobDataType * refine OpGraph::GetBlobDataType * IdentityOp => TupleIdentityOp (#1644) * Dev sbp parallel cast (#1646) * add SbpParallelCastOp * only SplitParallel and BroadcastParallel can be user customized * rename: SbpParallelCastOp => ParallelCastOp * build boxing_conf by sbp_parallel * fix a bug in BroadcastBinaryOpParallelSignature * support broadcast_parallel for sole-ibn op * 1) build boxing_op_conf by sbp_parallel for tuple_identity_op; 2) no op parallel desc fix for kModelParallel; 3) fix a in TaskGraph::EnableMemSharingInVariableOp 4) add TupleIdentityOpParallelSignature * fix bug in IsModelParallel121 (#1648) * merge develop * merge develop (#1649)
-
- Oct 01, 2018
-
-
Niu Chong authored
fix: add AsyncSednRegstMsgToConsumer() for send single produced regst, e.g. forward_model_regst (#1274) * fix(normal_model_update_compute_actor): fix send forward_model_regst_ to consumer * fix: add AsyncSednRegstMsgToConsumer() for send single produced regst, e.g. forward_model_regst
-
Niu Chong authored
* fix(normal_forward_compute_actor): fix SendMsgToForwardModelSaveActor() * refine(normal_forward_compute_actor)
-
- Sep 30, 2018
-
-
Niu Chong authored
* feat(register_slot): add the RegstSlot * feat(register_slot): update RegstSlot if * feat(actor): update member of Actor to use RegstSlot * fix(register_slot): fix the available_regst_desc_cnt init val * refine(register_slot): rename PushBack/PopFront, FindTheRegstDescId to TryPushBack/TryPopFront, HasRegstDescId * feat(regst_slot): rename ForEachCurRegstDeq/ForEachCurFrontRegst to ForEachRegstDeq/ForEachFrontRegst * feat(regst_slot): add ForChosenRegstDeq/ForChosenFrontRegst, add CHECK empty in ForEachFrontRegst * fix(register_slot): fix the CHECK empty * feat: remove actual_writeable_regst_desc_id_ from Actor, add Naive/CustomizedProducedRegst * fix(normal_model_update_actor): bug: not send customized regst to consumer when SendIntialModel * fix(normal_forward_compute_actor): bug: not add kLoss/kAccuracy produced regst to NaiveProducedRegst * fix(actor): UNIMPLEMENTED() for AsyncSendCustomizedProducedRegstMsgToConsumer * fix(normal_forward_compute_actor): set const_buf_regst to nullptr when recv from consumers * fix(actor): total_reading_data_regst_cnt, not total_reading_ctrl_regst_cnt * refactor: update GetNaiveConsumedRegstDescName to GetNaiveOrCustomizedConsumedRegstDescName(same for Produced) * feat: combine data_regst and ctrl_regst in Actor * fix: fix bugs * fix: fix bugs * fix: remove .swp files and unused LOG * feat: split Act and SendMsg (#1255) * feat: split Act and SendMsg * refine: rename HandleProduced/ConsumedDataRegst.. to HandleProduced/ConsumedNaiveDatRegst.. * fix(input_wise_comp_actor): bug: not set piece id * fix(actor): potential bug: produced msg with no allowed actor still pop from queue * refactor: mv some protected member function to private * fix(actor): fix the condition about sending EORD msg * refactor(input_wise_actor): use RegstSlot in InputWiseActor * fix(copy_comm_net_actor): rename piece_id2regst_ctx to piece_id2regst_ctx_ * refactor: rename Name2RegstDescId to Name2RegstDescIds * refactor(naive_actor): "override final" instead of only "final" * refine(actor): little refine * feat: update the return type of GetNaiveOrCustomizedNamesRegstDescName to enum class RegstNameType
-
- Sep 10, 2018
-
-
Jinhui Yuan authored
-
- Sep 07, 2018
-
-
Niu Chong authored
* feat(register_slot): add the RegstSlot * feat(register_slot): update RegstSlot if * feat(actor): update member of Actor to use RegstSlot * fix(register_slot): fix the available_regst_desc_cnt init val * refine(register_slot): rename PushBack/PopFront, FindTheRegstDescId to TryPushBack/TryPopFront, HasRegstDescId * feat(regst_slot): rename ForEachCurRegstDeq/ForEachCurFrontRegst to ForEachRegstDeq/ForEachFrontRegst * feat(regst_slot): add ForChosenRegstDeq/ForChosenFrontRegst, add CHECK empty in ForEachFrontRegst * fix(register_slot): fix the CHECK empty
-
- Aug 19, 2018
-
-
Jinhui Yuan authored
-
Jinhui Yuan authored
* refine act_id order condition * strict act id check (excluding model regst) * add TODO: figure out the ActNumForEachOutput of model regsts to MdSave area
-
- Aug 04, 2018
-
-
Jinhui Yuan authored
-
- Aug 01, 2018
-
-
Jinhui Yuan authored
-
- Jul 16, 2018
-
-
Niu Chong authored
* feat: avoid net contention by adding ctrl edge in ReduceStruct * refine(task_graph.h/cpp): refine AddCtrlEdgeInReduceStruct() * fix(graph/task_graph.cpp): fix the bug of machine order * fix(graph/task_graph.cpp): do not add ctrl edge with reduce scatter * feat: add ReduceGlobalAddCompActor * fix: fix the bug of reduce_global_actor/kernel * chore: remove used vim .swp file * fix(graph/task_graph.cpp): fix the bug of sorting copycomment when build reduce ctrl edge * fix(graph/task_graph.h/cpp): add CtrlEdge for ReduceGather * feat: revert add ctrl edge in reduce struct from this PR * refactor: rename ReduceGlobalAddCompActor to InputWiseCompActor for scalability * fix(kernel/reduce_global_add_kernel.cpp): use Memcpy other than Memset for first blob to be added * refactor(actor/input_wise_compute_actor.*): use HashMap and counter instead of HashSet for processed regst_desc * refactor: let ReduceGlobalAddCompActor inherit InputWiseCompActor * feature: add ReduceGatherCompActor that inherits InputWiseCompActor * fix(reduce_gather_kernel.cpp): add missing break * refactor: replace regst_desc_id2bn_in_op_ with regst_desc_id2in_bn_id_ in InputWiseCompActor * fix(reduce_global_add_kernel): remove useless class member parallel_id_ * refactor: make ReduceLocalAdd kernel support inputwise, rename ReduceGlobalAddActor to ReduceAddActor for scalibility
-
- Jul 03, 2018
-
-
Jinhui Yuan authored
* add parallel record decoder * null data loader * use real loader * use libjpeg-turbo * add streams support, TODO:1,disable internal buffer_;2,use template * make persistent_in_stream support multiple files * make compiler support new loader * minor refine * make new loader work on mnist * update proto in benchmark and example * refactor stream buffer filler * refine persistent_in_stream * workable * add record_loader_op * finish record loader op * AddRecordLoaderOps * make compiler work * infer shape works * add record_load_kernel * let decode actor pass in_regst in normal way * add kOFRecordPtr type * remove record regst type * change ALL_DATA_TYPE_SEQ to ALL_POD_DATA_TYPE_SEQ * support OFRecordPtr blob * complete decode_ofrecord_kernel * allocate OFRecord in Blob<OFRecordPtr> * fix of record ptr blob * let actor manage the OFRecord blob * let regst mgr own ofrecord memory * workable * remove useless code * refine * NormalRegst -> DataRegst * OFRecord data type (#984) * OFRecord data type * placement new (#987) * placement new * fix * remove useless code * placement new OFRecord * remove useless code * Refactor stream (#985) * refactor stream scanner * let persistence_in_stream create the binary stream * refine persistence_in_stream * refine * POD_DATA_TYPE_SEQ * update placement proto in benchmark and example
-