- Jul 16, 2021
-
-
Li Xinqi authored
* refactor job_pass by maybe_system * remove useless files Co-authored-by:
oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
-
- Jul 01, 2021
-
-
daquexian authored
* add missing JUST Signed-off-by:
daquexian <daquexian566@gmail.com> * remove redundant header Signed-off-by:
daquexian <daquexian566@gmail.com> * add missing JUST in master Signed-off-by:
daquexian <daquexian566@gmail.com> * fix compile error on gcc5 Signed-off-by:
daquexian <daquexian566@gmail.com> Co-authored-by:
oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
-
- May 05, 2021
-
-
cheng cheng authored
* Fw/Bw support double compute stream * NCCL comm create by stream id * 2D NCCL logical kernel support BW independent stream * StreamIndex: NcclComputeStream for each subgraph insert nccl logical. * refactor code * refine code for review * Add WITH_CUDA in DoJobPass(InsertNcclLogicalOpPass)
-
- Mar 09, 2021
-
-
Juncheng authored
* Refine InferTimeShape * fix xla * fix xla Co-authored-by:
oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
-
- Feb 24, 2021
-
-
cheng cheng authored
* disable_group_boxing and change nccl logical order to dst * remove note * both support insert nccl logical ops as close as possible to Src/Dst node Co-authored-by:
oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
-
- Feb 19, 2021
-
-
cheng cheng authored
* Remove keep_header_only and BlobDesc::is_body_disabled * Remove InputBlobModifier::use_header_only and UserOps set_use_header_only
-
- Feb 18, 2021
-
-
cheng cheng authored
* Enable insert nccl logical op pass * FindMaxConnectedSubgraphForGpuExecOrder~ * through order and interface * implement of insert nccl logical op in pass * add nccl logical op using UserOp Implement and EagerNcclCommMgr * add NCCL ReduceScatter op/kernel; refine pass impl of topo order * add NCCL logical op/kernel AllGather * fix bug of reduce scatter/ all gather infer shape * refine log and note * fix complier err build with CPU ONLY * support NCCL ALL2ALL and test pass of alexnet model parallel * rollback of diff in checkpointing_pass.cpp * rename to nccl_use_compute_stream; ResourceDesc::nccl_use_compute_stream; refine name for review; create nccl_comm_ in KernelCompute; * refine code for review * add unittest for nccl use compute stream * format test scripts * refine align
-
- Feb 14, 2021
-
-
Li Xinqi authored
* source subset tick * remove useless header files * insert DstSubsetTickOp * remove incorrect CHECK * add tick op for each machine * TryBindBnWithOneofRegst * add sink tick op in main_job * refactor LinkMainJob * fix typo in task_graph * refactor AddGlobalCriticalSection * rename and refactor DstSubsetTick::InferBlobDescs and SrcSubsetTick::InferBlobDescs * add src_subset_tick for input-output critical section * refactor AutoSourceTick and AutoSinkTick * SrcSubsetTickCompTaskNode: bind bns and in_regst if bns is valid in current device * refactor optional input to repeated inputs for SrcSubsetTickOpConf Co-authored-by:
oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
-
- Nov 02, 2020
-
-
Li Xinqi authored
* refactor OpGraphPass to JobPass * refactor methods of JobPassCtx
-
- Sep 05, 2020
-
-
Juncheng authored
-
- Jul 28, 2020
-
-
OuYang Yu authored
Co-authored-by:
Li Xinqi <lixinqi2010@gmail.com>
-
- Jul 23, 2020
-
-
Shenghang Tsai authored
* add license at root dir * check in empty files * rm space * check in script * update script * fix bug * add print * fix * add exit * add to of_format * add CI task * fix license * Revert "fix license" This reverts commit 818b6d7691d3a8b4a25dd41a47ff2c5922b8ec57. * only add once * quick fix * fix script * dont fmt empty file * fix * quick fix * fix py * add license * fix exit * add license for hpp * add license * license new vm files Co-authored-by:
tsai <caishenghang@oneflow.org>
-
- Mar 19, 2020
-
-
cheng cheng authored
* disable add keep header only op pass by config * rename enable_keep_header_only
-
- Jan 08, 2020
- Jan 07, 2020
- Jan 02, 2020
- Dec 27, 2019
-
-
Niu Chong authored
* Add user op related proto * Add OpRegistration (#2424) * Add uncompleted op registration for cooperation review * Fix the compile bugs * Refactor the implementation macro of op_reg arg member funcs * Move impl of Attr() with default to cpp and specialize it * Add LookUpInOpRegistry() * Add UserOp as placeholder * Rename in->input and out->output * Fix the missing ctor of OpRegistrationBuilder * Add GetAllRegisteredUserOp() for debug * Add Log for every user_op registration * Add const qualifier for Builder::Build() and user_op namespace for REGISTER_USER_OP macro * Remove the LOG() from ctor of op registrar due to segment fault (maybe a glog bug) * add customized dir (#2425) * add customized dir * customized/.keep * Add map<string, ListString> output; to UserOpConf (#2426) * Substitute std::function<...> with alias name and Set default val for those function (#2428) * Add Kernel Registry (#2431) * Add Kernel Registration * Make REGISTER_USER_OP/KERNEL macro available when not in namespace of oneflow * Add missing TODO of CreateFn parameter * Add OpKernel for user op * Fix a little code style * Add GradRegistry (#2433) * implement of user_op, instead of get sbp sign (#2429) * Add UserKernel and UserKernelConf (#2438) * Add VirtualGenKernelConf() for UserOp and fill UserKernelConf * Fill KernelRegCtx * Fix typos and bugs * Add UserKernel * Add KernelRegContext(const KernelConf&) as the ctor * Dev cc python user op conf builder (#2435) * user op builder in python * user op wrapper * Implement of add default vale and check valid between c++ and python * remove notes * fixbug and runnable for UserOp complie and add sample as ccrelu * fix func and class name * check attr type in op def and op conf; refine code for review * Dev cc infer tmp size (#2445) * Refine some code and interface (#2447) * Add InferContext and Infer Util functions * Add framework.h as the only included header for user code * Fix the Dtype infer bug * Fix duplicated ret * Fix the KernelRegistration of UserKernel * Update cc_relu_op.cpp to use InferContext * Refine and Add test relu kernel * Add user_op::Blob * Update InferContext to ptr from const ref * Add user_op_conf into InferContext and Attr() * Move cc_relu_op.cpp to customized/ops/ * Add Shape and Dtype into Blob * Fill the real ReluKernel with Gpu and Float * Remove unused files * Add unique_names_ for op registration * Refactor AttrSeq for re-used of attr function specialization (#2452) * Refactor AttrSeq for re-used of attr function specialization * Remove Serialize interface * Dev cc gen bw user conf (#2449) * refine python print err * interface of UserOpWrapper UserOpConfWrapper UserOpConfWrapperBuilder * implement of user op conf builder in c++ * generate backward op conf for user op * define grad registration value func * refine code for review * check input valid when query need grad * refine name * implement of demo ccrelu_grad and test pass of alexnet * refine ccrelu python * Add UserOpDefWrapper; Fix .py bug; Add TestReshape op/kernel (#2454) * Add UserOpDefWrapper * Update paras of CheckAttrs() from UserOpDef to UserOpDefWrapper * Fix bug in user_op_builder.py * Add TestReshape Op&Kernel * fix ccrelu op grad register and shape infer; Add ccrelu alexnet test python script (#2457) * Move UserOpConf from op_conf.proto to user_op_conf.proto * Add test_reshape.py * Refine the imple of access to attrs in user_op_conf.cpp * Refactor the way to access AttrVal with AttrValAccessor * Add GetAttr() in KernelContext * Rename op_infer_util.h to infer_util.h and Fill paras of InferTmpSize with InferContxt * Refactor InferContext to simpley user_op * Refine customized test kernels to get along with interface update * Dev merge from dev python (#2465) * Clear session (#2416) * oneflow.clear_default_session * fix bugs in oneflow.config.machine * refactor function return type (#2417) * fix for py2 (#2418) * blob parallel conf * Pr watch scope (#2419) * pr oneflow.watch* * merge more code to pass watch_scope.py * TODO: input_blob_def.parallel_conf * oneflow.cluster (#2423) * oneflow.cluster * no alias for oneflow.cluster.* * mv cpp_logging_conf from config_proto to cluster_proto * rename: cluster => env * rename: Environment => Session * Free port (#2427) * oneflow.cluster * no alias for oneflow.cluster.* * mv cpp_logging_conf from config_proto to cluster_proto * rename: cluster => env * rename: Environment => Session * auto find a free port for single node environment * localhost only * Dev single processor test (#2430) * oneflow.cluster * no alias for oneflow.cluster.* * mv cpp_logging_conf from config_proto to cluster_proto * rename: cluster => env * rename: Environment => Session * auto find a free port for single node environment * localhost only * single process test * Cluster::WorkerLoop * delete unnecessary OF_BARRIER_ALL * no longer fork children processes to run tests * robust contextmanager for CurJobConf (#2434) * fix of_pure_proto_dir (#2439) * Ctrl between optimizer (#2443) * add ctrl edges between optimizors * update docker file * sequentialize all optimizors * Revert "fix of_pure_proto_dir (#2439)" (#2446) This reverts commit 5031cc86. * Oneflow unittest (#2448) * oneflow.unittest.* * oneflow.unittest.register_testcases * rename: oneflow.unittest.register_testcases -> oneflow.unittest.register_test_cases * Test bert inplace with xinqi (#2450) * update bert script * update watch_scope test script * update for debug * update for debug * update debug script * test_inplace.py * no reshape * debug IsLbiAllConsumersReachable * fix inplace * rm useless code * update config * fix critical_section (#2453) * Patch distribute (#2456) * backup * fix bugs * test_inplace * Fix InplaceActor when no one consume inplace out regst (#2458) * disable mutable inplace edge to variable (#2459) * Fix unittest import conflict for py2 (#2460) * Create __init__.py (#2464) * update ccrelu_alexnet.py * Update code that use UserOpConf to UserOpConfWrapper * Add paras of GetSbp * Refactor KernelContext with pure virtual member function * Refactor KernelRegContext with pure virtual member function * Refactor UserKernelContext ctor to simplify code * Refactor InferContext with pure virtual member function * Rename Blob to Tensor; BlobDef to TensorDesc * Remove unused log and file * Dev cc user op sbp (#2480) * ccrelu multi-gpu runnable * example of ccrelu op get sbp sign * sbp context * add example for get sbp using LogicalTensor... * GetSbpFnUtil::MirrorSplitAtDim0 * fix bug of useless * Fix the bug of init UserKernelContext (#2487) * Fix the bug of init UserKernelContext * Fix due to comment * Refine code * Add missing SetCheckAttrFn implementation * refine python user_op_builder.SetAttr(); refine reshape test an… (#2505) * Make UserOp runnable; refine user op python test for new change in dev_python Co-authored-by:
cheng cheng <472491134@qq.com>
-
- Dec 26, 2019
-
-
Houjiang Chen authored
* Enable multiply definition for xla compilation in oneflow * Realize running an executable * Abstract and gather resources (such as client, builder etc.) needed to compile as CompilationResourceStore * Implement a seperate xla allocator to avoid introducing much objects of tensorflow * Define CompilationContext separately * Running XLA by CPU mode is OK now * Make the result shape after running the executable is a tuple, and refine comments * Add compilation cache to solve recompiling every time * Resolve InferSbpSignature in XlaLaunchOp * Resove executing on specified cuda stream * Refine XlaLaunch parallel conf, add batch matmul op * Refactor job rebuilding and fixup time shape * Update batch_dim_lbis field if XlaLaunch has any output which has batch dim * Resolve cluster-ring after clustered, take sbp policy and time shape into consideration * Add reshape op * Fix bugs * Rename CompilationContext by XlaLaunchContext, add XlaRuntimeScope to swap stream handle * Fix bugs * Update cmake to compile with xla optionally * Support more ops * Add more ops, and fix bugs * Implement XLA allocator and internal memory pool * Adaptively resize allocator memory size * Refine memory allocator * Block host if running cpu executable * Fix bug for getting scalar value * Fix result layout bug. This bug causes wrong result for transpose * Refine gelu backward * Of xla sx (#1990) * add identity xla op * Add batch gather op * Refine batch gather * fix batch gather bug aand add gather op, mv identity op to unary_op * Add softmax and gather/batch_gather * Add xla softmax_grad op * Add xla layer normalization op * Add xla layer norm backward op * Alias inputs and outputs to compute in-place * Reuse output buffers when running xla executable. It brings about 10% speedup for bert on single gpu by zero copy results * Reuse output buffers when running xla executable. It brings about 10% speedup for bert on single gpu by zero copy results * Refine xla allocator * Refine code style * Add xla reduce_sum op * Rewrite model update op to optimizer graph * Fix hang bugs * Fix input which body is disabled in xla launch kernel * Fix self control in * Fix self control in * Add fake consume op * Fix HasAttr bug for optional field * Refine AdamOptimizer * Fix xla AdamOptimizer bugs * Add meta data in HLO instruction, and refine * Fix bugs * add reduce sum and split normal model update (#2040) * remove append_func_to_list * Rm deprecated model update and save code (#1958) * remove code * mv random gen to kernel * mk seed required * address reviews * fix unused warning * address reviews * check in more deprecation * remove ModelSaveOpConf * move out ops and modify item (#1962) * ModelInit.__oneflow_input_remote_blobs__ * fix cpu only query & add error info (#1964) * NumaAwareCudaMallocHost (#1959) * NumaAwareCudaMallocHost * add conf * modify check_point and add test check_point (#1963) * fix misuse of Scope/raii * op_name2variable_blob * add sigmoid test and tanh test (#1966) * add op matmul and matmul test (#1967) * rename oneflow.val to oneflow.input_blob_def * support auto var for convolution (#1972) * add op add and test add (#1973) * mv deprecated.pb_util to lib.core.pb_util * add op get_variable and get_variable test (#1975) * add op get_variable and get_variable test * modify shape extend * AllReduceSequencePass (#1976) * python2 compatibility for check_point * fix "return (blob_a, blob_b)" bug * rename: arg_passing => arg_pass * shared regst blob header between jobs (#1919) * half impl * register manager handle memory shared for separated memory * set separated memory shared id for shared regst between jobs * half impl of python for blob * fix BUG of pod ToProto() when proto has inited * fix BUG of infer dim0_inner_shape() in foreign_input_op * 1. PushJob copy from python can infer dim0_valid_num * add test for dynamic relu * refine test file * refine code * refine note * update test file for new interface * rename separated_header* (#1979) * some bugs fixes for a train&eval job (#1978) * debugging alex net * check in test pull_multiple_blob.py * strcter check * fix bias in conv * fix various bugs * rm file * op_name in different jobs can be overloaded * fix compile bug in job_set_compile_ctx * rm cmake code for building oneflow binary * check in script (#1980) * check in script * rm used import * CudaCurrentDeviceGuard (#1977) * fix val (#1981) * Merge job set and split fw bw (#1982) * add MemoryCopier and TensorSliceCopier (#1901) * add MemoryCopier and TensorSliceCopier * Index=>NdIndex * refine * refine * fix addition error checking (#1911) * Merge dev_mixed_precision into dev_split_fw_bw (#1904) * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * Merge dev_mixed_precision: Part-2 (#1907) * feat: add NewKernelUtil * fix typos * feat: add cublas_tensor_op_math_handle() * add gemm (#1860) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * feat: NewKernelUtil -> NewKernelUtil<DeviceType> * feat: update FullyConnectedKernel to use NewKernelUtil * Dev sx mixed precision (#1861) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * save cpp * save * add relu and relu_backward * remove spared space * add explicit declaration * rename * feat: update ConvKernel to support half * add sigmoid and tanh (#1867) * add axpy (#1866) * style: formatting * refactor(new_kernel_util): unify Hgemm with cublas_gemm<T> * fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle * refine(new_kernel_util.h) * refine(new_kernel_util.cu) * feat(new_kernel_util): add OFBatchedGemm() * feat: update MatMulKernel to support half * feat: update ConvData/Bias/FilterGradKernel to support half * refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out * feat: support loss scale * fix(operator): :bug:add InferHasBatchDim() * feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu() * refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float * style(kernel/cast_kernel.cpp): formatting * fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle() * style(cast_kernel.cpp): formatting * feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil * refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil * feat(dropout_kernel): :sparkles:update DropoutKernel to support half * refactor(dropout_kernel): remove backward funcs * refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support * fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple) * fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs() * fix: fix little bugs * fix(conv_data/filter_grad_op): min byte size of buf blob is 1 * feat: support half for bias_add_kernel * fix(bias_add_op): remove data type check * feat(relu_kernel): support half * refactor: add ADD_GPU_HALF_KERNEL_CREATOR * fix: typos * feat(pooling_kernel): support half * fix: remove CHECK_EQ of default data type * feat(pooling_grad_kernel): support half * feat: support half in ofrecord_encoder (TODO) * fix * feat: support half in sparse_cross_entropy_kernel * debug grad op (#1883) * Dev debug op mixed precision (#1884) * debug grad op * do nothing instead of UNIMPLEMENTED * fix(dropout_kernel): add tmp_split_fw_bw condition * build(half.cmake): https->http * fix(record_load_kernel): support total_batch_num * fix pooling (#1885) * fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs() * fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs() * fix: add GetCudnnScalingParameters() to fix scaling params * fix: add enable_true_half_config_when_conf() into config and update related code * feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization * refactor(matmul_kernel): remove Backward() * feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx() * feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat() * refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr * refactor(new_kernel_util.cu): remove static of func in anonymous namespace * feat(job_conf.proto): add enable_auto_mixed_precision field * feat(auto_mixed_precision_lists): add amp_lists * feat(auto_mixed_precision): build the skeleton * feat(auto_mixed_precision): almost finish amp graph pass * feat(auto_mixed_precision.cpp): complte InsertCastOp() * refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG * perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO) * refine(auto_mixed_precision.cpp): refine LOG * feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes() * Dev half ndarray (#1886) * debug grad op * ZeroVal => GetZeroVal; OneVal => GetOneVal * MaxVal => GetMaxVal; MinVal => GetMinVal * check data type * DevDType * move function template to struct template for BinaryFunc* and UnaryFunc* * support half for reduce_sum_kernel * ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr * half for NdarrayUtil * OF_DEVICE_FUNC is always inline * half for NdarrayApplyUnaray * simplify usage of NdarrayUtil * UnaryFuncExp * add VarNdarrayBuilder and ValNdarrayBuilder * simplify NdarrayUtil in layer_norm_param_grad_kernel * InplaceBroadcast * remove SoftmaxKernelUtil * half for softmax_kernel * fix improper use of __CUDA_ARCH__ * disable sm_30,sm_52 * refine(conv_kernel.cu): fix typo * fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix: fix typos of GetOneVal * fix(auto_mixed_precision.cpp): allocate for shared_ptr * refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding * fix(auto_mixed_precision.cpp): fix typo * fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge * style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet() * style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...> * feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp * feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs * feat(auto_mixed_precision.cpp): more logs * refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal * fix(bias_add_op.cpp): fix bias_multiplier shape * feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half * feat: update MatmulKernel and new_kernel_util to support half * refactor(auto_mixed_precision): add ClearList and refine code * feat(tanh_*_kernel): support half * feat(add_kernel): support half * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF * style(CMakeLists.txt): fix typo * fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix(auto_mixed_precision.cpp): group inserted cast op by lbn * fix get one ptr (#1913) * fix(layer_norm): add LayerNormOp to grey_list and support the half * fix(layer_norm about): fix it to run when amp * fix: move fix sbp signature from OpNode to OpGraph * Dev new kernel util (#1925) * refactor(kernel/util): refactor NewKernelUtil and add DnnIf * refactor(kernel/util): add BlasIf * refactor(kernel/util): add ArithemeticIf * refactor(kernel/util): add cuda_kernel_util.* * refactor: refactor NewKernelUtil * refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including * refactor(new_kernel_util.h): remove unused header files * refactor: refactor loop include * feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936) * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA * CHECK cuda version > 10.0 when use auto_mixed_presion * Fix bug of Snapshot delete file Unwanted (#1937) * fix link BUG of release version (#1938) * delete redundant code in OpGraph JobCompleter and Operator (#1927) * 1. delete redundant code in OpGraph JobCompleter and Operator 2. fix bug of Snapshot delete file Unwanted 3. refine ReadMe * revert README change * split 2 pull request * Refactor Kernel Registry V2: The clear & easy Way (#1941) * refactor(resource.proto): move DeviceType to common/device_type.proto * feat(kernel_registration): add kernel_registration.h/cpp * feat(kernel_registration): update matmul_kernel to support new registration * feat: add CreateKernel for new registry * feat: udpate registry of cast conf * refactor(kernel_registration): remove KernelRegMap * fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949) * grpc SetMaxMessageSize(INT_MAX) (#1950) * fix bug of Graph::ForEachConnectedComponent (#1952) * Grpc set max size (#1953) * grpc SetMaxMessageSize(INT_MAX) * set max msg len for ctrl service * code for test grpc max msg size * remove test code * NumaAwareCudaMallocHost (#1959) * NumaAwareCudaMallocHost * add conf * AllReduceSequencePass (#1976) * Merge job set and split fw bw (#1983) * add MemoryCopier and TensorSliceCopier (#1901) * add MemoryCopier and TensorSliceCopier * Index=>NdIndex * refine * refine * fix addition error checking (#1911) * Merge dev_mixed_precision into dev_split_fw_bw (#1904) * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * Merge dev_mixed_precision: Part-2 (#1907) * feat: add NewKernelUtil * fix typos * feat: add cublas_tensor_op_math_handle() * add gemm (#1860) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * feat: NewKernelUtil -> NewKernelUtil<DeviceType> * feat: update FullyConnectedKernel to use NewKernelUtil * Dev sx mixed precision (#1861) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * save cpp * save * add relu and relu_backward * remove spared space * add explicit declaration * rename * feat: update ConvKernel to support half * add sigmoid and tanh (#1867) * add axpy (#1866) * style: formatting * refactor(new_kernel_util): unify Hgemm with cublas_gemm<T> * fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle * refine(new_kernel_util.h) * refine(new_kernel_util.cu) * feat(new_kernel_util): add OFBatchedGemm() * feat: update MatMulKernel to support half * feat: update ConvData/Bias/FilterGradKernel to support half * refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out * feat: support loss scale * fix(operator): :bug:add InferHasBatchDim() * feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu() * refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float * style(kernel/cast_kernel.cpp): formatting * fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle() * style(cast_kernel.cpp): formatting * feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil * refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil * feat(dropout_kernel): :sparkles:update DropoutKernel to support half * refactor(dropout_kernel): remove backward funcs * refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support * fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple) * fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs() * fix: fix little bugs * fix(conv_data/filter_grad_op): min byte size of buf blob is 1 * feat: support half for bias_add_kernel * fix(bias_add_op): remove data type check * feat(relu_kernel): support half * refactor: add ADD_GPU_HALF_KERNEL_CREATOR * fix: typos * feat(pooling_kernel): support half * fix: remove CHECK_EQ of default data type * feat(pooling_grad_kernel): support half * feat: support half in ofrecord_encoder (TODO) * fix * feat: support half in sparse_cross_entropy_kernel * debug grad op (#1883) * Dev debug op mixed precision (#1884) * debug grad op * do nothing instead of UNIMPLEMENTED * fix(dropout_kernel): add tmp_split_fw_bw condition * build(half.cmake): https->http * fix(record_load_kernel): support total_batch_num * fix pooling (#1885) * fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs() * fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs() * fix: add GetCudnnScalingParameters() to fix scaling params * fix: add enable_true_half_config_when_conf() into config and update related code * feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization * refactor(matmul_kernel): remove Backward() * feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx() * feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat() * refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr * refactor(new_kernel_util.cu): remove static of func in anonymous namespace * feat(job_conf.proto): add enable_auto_mixed_precision field * feat(auto_mixed_precision_lists): add amp_lists * feat(auto_mixed_precision): build the skeleton * feat(auto_mixed_precision): almost finish amp graph pass * feat(auto_mixed_precision.cpp): complte InsertCastOp() * refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG * perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO) * refine(auto_mixed_precision.cpp): refine LOG * feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes() * Dev half ndarray (#1886) * debug grad op * ZeroVal => GetZeroVal; OneVal => GetOneVal * MaxVal => GetMaxVal; MinVal => GetMinVal * check data type * DevDType * move function template to struct template for BinaryFunc* and UnaryFunc* * support half for reduce_sum_kernel * ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr * half for NdarrayUtil * OF_DEVICE_FUNC is always inline * half for NdarrayApplyUnaray * simplify usage of NdarrayUtil * UnaryFuncExp * add VarNdarrayBuilder and ValNdarrayBuilder * simplify NdarrayUtil in layer_norm_param_grad_kernel * InplaceBroadcast * remove SoftmaxKernelUtil * half for softmax_kernel * fix improper use of __CUDA_ARCH__ * disable sm_30,sm_52 * refine(conv_kernel.cu): fix typo * fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix: fix typos of GetOneVal * fix(auto_mixed_precision.cpp): allocate for shared_ptr * refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding * fix(auto_mixed_precision.cpp): fix typo * fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge * style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet() * style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...> * feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp * feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs * feat(auto_mixed_precision.cpp): more logs * refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal * fix(bias_add_op.cpp): fix bias_multiplier shape * feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half * feat: update MatmulKernel and new_kernel_util to support half * refactor(auto_mixed_precision): add ClearList and refine code * feat(tanh_*_kernel): support half * feat(add_kernel): support half * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF * style(CMakeLists.txt): fix typo * fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix(auto_mixed_precision.cpp): group inserted cast op by lbn * fix get one ptr (#1913) * fix(layer_norm): add LayerNormOp to grey_list and support the half * fix(layer_norm about): fix it to run when amp * fix: move fix sbp signature from OpNode to OpGraph * Dev new kernel util (#1925) * refactor(kernel/util): refactor NewKernelUtil and add DnnIf * refactor(kernel/util): add BlasIf * refactor(kernel/util): add ArithemeticIf * refactor(kernel/util): add cuda_kernel_util.* * refactor: refactor NewKernelUtil * refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including * refactor(new_kernel_util.h): remove unused header files * refactor: refactor loop include * feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936) * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA * CHECK cuda version > 10.0 when use auto_mixed_presion * Fix bug of Snapshot delete file Unwanted (#1937) * fix link BUG of release version (#1938) * delete redundant code in OpGraph JobCompleter and Operator (#1927) * 1. delete redundant code in OpGraph JobCompleter and Operator 2. fix bug of Snapshot delete file Unwanted 3. refine ReadMe * revert README change * split 2 pull request * Refactor Kernel Registry V2: The clear & easy Way (#1941) * refactor(resource.proto): move DeviceType to common/device_type.proto * feat(kernel_registration): add kernel_registration.h/cpp * feat(kernel_registration): update matmul_kernel to support new registration * feat: add CreateKernel for new registry * feat: udpate registry of cast conf * refactor(kernel_registration): remove KernelRegMap * fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949) * grpc SetMaxMessageSize(INT_MAX) (#1950) * fix bug of Graph::ForEachConnectedComponent (#1952) * Grpc set max size (#1953) * grpc SetMaxMessageSize(INT_MAX) * set max msg len for ctrl service * code for test grpc max msg size * remove test code * NumaAwareCudaMallocHost (#1959) * NumaAwareCudaMallocHost * add conf * AllReduceSequencePass (#1976) * CudaCurrentDeviceGuard (#1977) * delete tmp_split_fw_bw_train_conf (#1985) * delete tmp_split_fw_bw_train_conf * delete useless comments * fix refactor bug in layer_norm_op * minor fixes * update py script * remove code could be misleading * Fix all reduce mem sharing (#1986) * fix all reduce mem sharing * ByteSizeOfDataContentField=>ByteSizeOfBlobBody * remove obsolete task_graph optimization * no arg_pass_job for variable_op * merge memory block id between jobs (#1910) * refine MemBlock and CriticalSection * job memory sharing strategy * revert diff in CriticalSectionDesc * Merge memory block between sub plans * Get mutual exclusion job groups * forget to consider memory merge only in same machine * memory zone unique id * Merge Done; merge memory block id from right to left; get memory block ids info * revert MemBlock * generate mutual exclusion job groups Done. * update for proto * add JobMemSharingStrategy in python interface * remove memorycase hash * move JobMemSharingStrategy to JobSetProto * using default strategy = parallel priority strategy * update interface of flow.job_mem_sharing_strategy * InterJobMemSharingUtil and PlanUtil * revert oneflow.h * fix bug * New implement of Merge memory block id between jobs * refine code * fix a fatal bug in std::hash<oneflow::Shape> * +REGISTER_INDEPENDENT_THREAD_NUM for print task_node * unlock critical sections as more as possible (#1994) * Bugfix actor case (#1995) * unlock critical sections as more as possible * consumed and produced regst of actor 'case' are customized * refine code * Bugfix actor case (#1996) * unlock critical sections as more as possible * consumed and produced regst of actor 'case' are customized * refine code * small regst_num for reentrant_lock (#1997) * fmt dev_job_set(#1999) * double buffer for tick_op * tick is cpu op * speedup compile time (#2000) * only merge mem_block_id between user job (#1993) * Fix keep header only (#2001) * speedup compile time * fix keep header only * remove shared model (#2003) * remove blob_mem_sharing (#2005) * No copyhd for output (#2006) * no cpu tick * no copyhd for output_op/swith_output_op * remove temp comments * rename AddCopyH2DTaskTo to TryAddCopyH2DTaskTo * remove clone_id (#2007) * layer norm auto var (#2004) * layer norm auto var * make of_format * bn sbp (#2008) * Refactor job completer (#1998) * fmt * refactor GenerateOpConf4Trainning * more refactor * refactor SetCtrlInOpName4VariableOp * use uniq ptr * refactor RewriteBoxingWithAllReduce * refactor MakeAllReduceSequence * refactor auto_mixed_precision * refactor DumpLogicalBlobDescAndSbpSignature * refactor group_boxing_by_dst_parallel * refactor add_keep_header_only_op_conf * refactor AutoSourceTick * refactor AddTickForTimeShape * refactor AutoSinkTick * refactor AddGlobalOutputCriticalSections * refactor SetOpTimeShape7BatchDimLbis * fix a bug in IsInterfaceTask (#2009) * Bugfix is interface task (#2010) * fix a bug in IsInterfaceTask * IsOutputInterfaceTask * copyhd-free output_op task_node * Dev job set config util (#2011) * add more if in JobConfigProtoBuilder * unlock critical sections as more as possible * consumed and produced regst of actor 'case' are customized * remove total batch num in config util * remove clone_id * assert has train_conf * rm debug info * Dev job set bert (#2013) * support bert * mv into bert * manual format * fix adam (#2015) * fix adam * div batch instance num before update model * remove outdate code in oneflow.cpp (#2017) * Dev split like (#2016) * no total_instance_num * add auto grad for concat * check in impl * check in bug fixes * fix bugs for split_like * split_like_op.cpp format * add normalization_autovar * Update op_conf.proto * address reviews * fix typo * constant ref * rm forward_loss_instance_num (#2018) * Bugfix job set multi device (#2019) * sbp for tick input bn * interface_blob_conf for output_op/switch_output_op * set sbp conf for tuple identity op * fix bugs when merge main plan * delete useless code * address review * fix error use of GenRepeatedBn() * ForEachConnectedComponent is easily misused * 1) fix output op parallel_conf; 2) refactor InterfaceOpUtil * only for return output_op * factor: lbn => logical_blob_name; logical_blob_name() => logical_blob_name * return op instead of output op acts as part of user job * enable_all_reduce_group * bugfix: init RuntimeBuffersScope before Runtime * demo python scripts for enable_all_reduce_group * remove wrong optimization code * constant_conf for enable_all_reduce_group.py test * fix interface op parallel conf * fix reduce concat kernel (#2020) * binary program oneflow_worker * user_job_completer * remove unused code loss_print * rm unused code loss_acc * remove unused accuracy_acc and accuracy_print * remove input_diff/output_diff/model_diff bns * remove unused bns in gdb util * replace data_tmp_bns/model_bns/fw_buf_bns with tmp_bns * support mpi using style * Bugfix put job conf into plan (#2023) * put job_conf into plan * using job_name judge isPullJob/isPushJob * fix wrong job_id error * model_init is a push job; model_save is a pull job * make cmake more reasonable (#2024) * Restructure python module and minimum setup.py (#2026) * check in updated paths * check in minimum setup tool * Dev python init multi unit (#2022) * init multi-unit by send oneflow_worker binary and ConfigProto to worker machine * refine var name * refine code * compile user/main job only on master * bert multi machine test code * fix bugs * JobConfs * fix bugs under WITH_RDMA * fix multi-machine bugs * delete useless code * Add xla reduce_sum op * fix overflow bug of mem_zone_unique_id, set job_id to int64_t (#2028) * feat: init_worker can without scp binary and no use uuid (#2029) * half impl of without scp bin * feat: init_worker can without scp binary and no use uuid * check in fixes (#2030) * fixbug of delete worker (#2033) * Dev dot plan (#2035) * reuse plan to dot file * refine plan dot * Check in bug fix and multi node script (#2032) * check in fixes * check in script * fix boxing bug when setting conf with sbp * flag for iter * fixbug of delete worker * fix delete worker in script * address review, add exclusive or check * reuse plan to dot file * refine plan dot * fix and add flags * fmt * rm debug output * more flags * check Activation * fix fc bug when num axes > 2 * reverse change * fix next_batch_num (#2036) * upgrade nccl to 2.4.8 (#2037) * fix shape of fc in_diff (#2038) * Rewrite model update op to optimizer graph * Update oneflow.cmake (#2041) * better looking merged_plan to dot v1 (#2039) * better looking and more infomation of merged_plan.dot * refine color * Fix tick in multi node parallel (#2042) (#2047) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * Dev train conf builder (#2046) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * check in impl * fix data dir (#2054) * fix data dir * rm model load path * AssignOp (#2058) * AssignOp * remove useless code * Python ops gather and unit test (#2053) * python_ops gather and unit test * format * minor mod * SnapshotOp (#2060) * magical add and fix bug (#2061) * check in impl * add todo * Dev jxf python pooling (#2056) * run max_pool_2d without bug * correct max_pool_2d * correct average_pool_2d * minor refine * final version * rename to nn.py * add name arg to pool1d ops * refine by review * rename to _GetSequence and move it to the end of file (#2063) * fix BindInterfaceMemBlockId (#2065) * mark py file generated (#2066) * Dev gracious exit (#2057) * add more checks * make language more consistant * better error info for worker init * better error * Update setup.py (#2068) * Refine Infer APIs by return Maybe<void> type (#2051) * Refine Infer APIs by return Maybe<void> type * Fix return type * Fix code style * Replace CHECK macros in the implementation of infer APIs * Revert IsOk * fix bug for split like op (#2070) * fix snapshot path (#2071) * Dev job set fix infer apis (#2072) * Refine Infer APIs by return Maybe<void> type * Fix return type * Fix code style * Replace CHECK macros in the implementation of infer APIs * Revert IsOk * update * add AutoGlobalStep (#2073) * rm default_initializer_conf in train conf (#2075) * Fix sigmoid op (#2076) * fix sigmoid op bug * fix bug for split like op * add sigmoid grad op * Fix bn (#2077) * fix bn * return Maybe<void> OK in lambda * fix typo * fix SigmoidGradOp (#2078) * Dev python merge job set (#2081) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * fix gcc warning in release (#2080) * fix gcc version in release * fix empty line * Fix adam mv initilizer (#2082) * zero constant initilzer for adam m and v * make of_format * init adam m v beta1_t and beta2_t * use value instead of initializer * const float& -> const float * update * LearningRateScheduleOp (#2079) * matmul (#2084) * matmul * np.allclose * Fix hang bugs * bugfix: reshape op infer dim0 size; and look up tensorflow reshape (#2085) * bugfix: reshape op infer dim0 size; and look up tensorflow reshape * refine code for read * check py if and test * prelu (#2086) * prelu * fix * fix * template for either ptr cast (#2088) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * add template for cast * rename * Dev build and infer ctx (#2089) * add job_build_and_infer_ctx interface * lbn_with_split_hint * fix maybe macro * fix signature of Maybe<T>::Error() * job_build_and_infer_if * add c_api_util wrapper for job_build_and_infer_ctx * implement python/job_build_and_infer interface * CurJobBuildAndInferCtx_AddPlacementGroup * BuildJobAndInferCtx and Mgr c++ implement (#2074) * job_build_and_infer_ctx_mgr * refine interface of infer_ctx_mgr * JobBuildInferCtx set job conf; add and refine error type * revert job.proto * half impl of add op in build_infer_ctx * generate op produced empty logical blob desc ; infer out blob desc interface * job_build_and_infer_ctx VERSION 1 * add InferOutBlobDesc for conv op; remove record_piece_size in interface op * maybe return * job_set hold by job_build_and_infer_ctx_mgr * check placement when infer ctx mgr leave cur job * Global New/Delete JobBuildAndInferCtxMgr * add JUST when ctx add op * remove unused job_conf.arg_op_name * fix bugs caused by python new api * fix bugs caused by lack of Global<JobDesc> * fix bugs caused by new api * refactor compiler.Compile * merge dev_python * remove unused message proto * rename api * Fix input which body is disabled in xla launch kernel * add RemoteBlob.shape and RemoteBlob.dtype * Fix data type set default variable (#2092) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * fix default data type * Add conf axis for bias_add for any axis channel (#2093) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * bias_add completion * follow comment * make conf axis required * Dev jxf python initializer (#2090) * oneflow initializer * update * Fix self control in * Bugfix python alexnet (#2096) * bugfix_python_alexnet * fix * Add fake consume op * Dev global step (#2100) * assign op AddGlobalStepOpConf fix ARITHMETIC_DATA_TYPE_SEQ identity_op_conf add ops GenNewSnapshotName SnapshotOp cleanup blob name LearningRateScheduleOp LearningRateScheduleKernel LearningRateScheduleKernel AddLearningRateScheduleOpConf learning rate cleanup fix fix * remove total_mbn_num * date time format * save * refine * refine * revert * refine snapshot * fix * refine * AutoGlobalStep * refine * GenLogicalBlobName * AutoLearningRate * remove JobDesc lr * fix snapshot path * Maybe<void> * learning_rate blob * remove next_model_vid fix fix fix learning_rate * train_conf * fix for global step on multi nodes * Fix optimizer initializer (#2095) * fix optimizer initializer * rename lars data temp bn * fix job_type (#2102) * Dev alexnet new api (#2094) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * check in softmax loss * nn.conv2d and nn.bias_add * fix opname * fix merge conflict * fix name * dense (#2097) * Fix jxf dense v2 (#2098) * dense * minor fix * alexnet * fix conf * quick fix * transpose * fix layers * add transpose * fix fc * fix * fix * fix data laod * params check and format * rm activation in op conf * save workaround * fix avg pool 2d * fix max pool 2d * remove fc3 relu * alexnet eval * minor * replace has_batch_dim with batch_axis (#2104) * replace has_batch_dim with batch_axis * refactor OrderValue4HasBatchAxis * fix batch_axis bugs in ConvFilterGradOp and DecodeOFRecordOp * no CHECK in MatmulOp::InferBatchAxis * infer op by op_conf and parallel_conf * wrapper Error for ErrorProto * replace ErrorUtil with Error * add OF_CHECK (#2110) * optional split_axis (#2113) * Fix HasAttr bug for optional field * undefined (#2116) * merge reduce xxx (#2119) * Update GetSbpSig() with Maybe (#2118) * fix sveral ops * modify all ops * format * update complete * Refine AdamOptimizer * fix (#2120) * Fix xla AdamOptimizer bugs * support scalar for reduce_xxx axis args (#2122) * Dev opt split axis (#2121) * optional split_axis * backup * VariableConf::(OptInt64 split_axis) * backup * fix autovar split_axis (#2125) * Dev model init op (#2117) * assign op AddGlobalStepOpConf fix ARITHMETIC_DATA_TYPE_SEQ identity_op_conf add ops GenNewSnapshotName SnapshotOp cleanup blob name LearningRateScheduleOp LearningRateScheduleKernel LearningRateScheduleKernel AddLearningRateScheduleOpConf learning rate cleanup fix fix * remove total_mbn_num * date time format * save * refine * refine * revert * refine snapshot * fix * refine * AutoGlobalStep * refine * GenLogicalBlobName * AutoLearningRate * remove JobDesc lr * fix snapshot path * Maybe<void> * learning_rate blob * remove next_model_vid fix fix fix learning_rate * train_conf * fix for global step on multi nodes * SnapshotReader snapshot writer model init op fix refine init InitializeFromSnapshotConf model io job ModelLoadOp ModelLoadKernel MakeModelLoadJob ModelSaveOp fix InterUserJobInfo _MakeModelLoadJobFunc MutModelLoadOpConTickInputHelper fix refine init/load/save set_default_variable * remove SnapshotMgr * snapshot.h * delete model_init_job.cpp foreign_input_op_conf fix snapshot path set path op_conf fix fix CopyFromNdarray to bytes c use uint8 char2uint8 * model init * model io * fix * ModelSaveKernel * mutable_batch_axis()->Clear() * InferBatchAxis * fix * refine * job set * MakeModelIoJobs * fix * jobs * fix * model io job * GenOutputOpConf * refine snapshot * refine * fix * refine CheckPoint * remove session * refine * refine * refine * remove keyword.h/cpp * refine * global_step=>train_step * GetSbpSignatures * ModelInitOp * fix (#2127) * rm stale alextnet script (#2129) * Dev plain maybe (#2126) * optional split_axis * backup * VariableConf::(OptInt64 split_axis) * backup * 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp * SharedOrPlain * const std::shared_ptr<T>& => std::shared_ptr<T> * Dev simple checkpoint manager (#2128) * SimpleCheckPointManager * makedirs * fix path * save * refine * refine * fix path to numpy (#2130) * Dev plain maybe (#2132) * optional split_axis * backup * VariableConf::(OptInt64 split_axis) * backup * 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp * SharedOrPlain * const std::shared_ptr<T>& => std::shared_ptr<T> * rename: Maybe.data() => Maybe.Data_YouAreNotAllowedToCallThisOutsideJustOrCheckJust() * refactor Maybe<JobBuildAndInferCtx> => Maybe<JobBuildAndInferCtx*> * Dev jxf merge general ops (#2131) * merge some general ops to dev_python * dense demo * rm print in test * new line at the end of file * format * fix check point * update alexnet * broadcast_xxx (#2134) * broadcast_xxx * typo * typo * rm job_conf.num_of_batches_in_snapshot * fix args (#2136) * fix proto if (#2138) * pass name to inner function (#2139) * check dropout if (#2140) * check dropout if * fix typo * Dev merge math ops (#2143) * merge math ops * new line at the end of file * merge layer norm (#2144) * variable_scope (#2141) * variable_scope * revert format * add check * Merge dropout if (#2145) * check dropout if * fix typo * fix typo * slice (#2142) * slice * add check and docstring * minor * minor * add const (#2146) * add const * fix indentation * address review * fmt * rm redundant * Update array_ops.py * Update array_ops.py * Update array_ops.py * add more activations to math_ops (#2147) * fix bug (#2149) * trancated normal for bert (#2150) * Update bert for dev python (#2151) * trancated normal for bert * bert support * math.dropout to nn.dropout (#2153) * refactor oneflow_internal with Maybe::GetDataAndSerializedErrorProto * allow export multiple interfaces in oneflow_export decorator (#2154) * refactor job_build_and_infer_if.h * update oneflow_internal.h to use Maybe (#2135) * Fix python internal (#2133) * Return error meassage in oneflow_internal * Refine environment_objects_scope * add OF_ERROR_STR_CHECK and OFStrCat() * format * fix based on review * fix(oneflow_internal.h): add undef * fix: expr -> (expr) * feat: update oneflow_internal_helper to use func * Transfer data_part_num to DecodeOp and RecordLoadOp (#2148) * Transfer data_part_num to DecodeOp and RecordLoadOp * Fix python scripts * Dev nc of internal (#2155) * Fix python internal (#2133) * Return error meassage in oneflow_internal * Refine environment_objects_scope * add OF_ERROR_STR_CHECK and OFStrCat() * format * fix based on review * fix(oneflow_internal.h): add undef * fix: expr -> (expr) * feat: update oneflow_internal_helper to use func * fix: fix ctor bug * fix config_proto * rename c_api_util.Init => c_api_util.InitEnvironment * refactor compile_context.cur_job => compile_context.cur_job_conf * remove FixPackedBlobDescOfProducedRegst (#2156) * Fix snapshot root path empty log (#2158) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * Fix snapshot root path empty log * fix channel last (#2157) * fix channel last * minor * merge pb_message * add cudnn conv force algo (#2159) * Update bert for dev python (#2160) * remove old bert * set data_part_num in decoder * support model load/saveargs * Dev flow function (#2152) * add of.function, refactor init, refine session, and refine runtime * rm useless code * rename * update * add test * @oneflow_export JobConfigProto and Trainconf (#2162) * @oneflow_export JobConfigProto and Trainconf * remove unused config in config_util.py * remove oneflow.get_cur_job_conf_builder * bugfix: bias_add op and reduce_sum op infer sbp and implement of bias_add kernel (#2161) * 1) refactor compiler._CompileJob; 2) fix test scripts; 3) fix oneflow.config.model_update_conf * fix config.train.model_update_conf * _GetJobConfAttr * update alexnet (#2166) * Update alexnet (#2167) * update alexnet * update for bert * 15->16 * more reasonable conf * get variable in py layer norm * replace val in pb msg; decode lbn string with split hint (#2165) * bugfix: boxing task node generate boxing conf; remove wrong check in op_graph (#2163) * Add meta data in HLO instruction, and refine * python model parallel (#2103) * decode split hint in lbn; refine interface of get/set_lbn_in_op_conf; fill sbp_conf in job when add op * merge placement group * refine code in AddAndInferOp * auto merge placement group when add op; remove mergeplacementgroup interface * infer sbp parallel when add op; impl Get/Has split axis in infer_ctx * python blob add interface for model parallel * refine code of python blob split * remove interface of has/get_split_axis in python blob * remove interface of has_batch_dim in python blob * add check blob split_axis can be divide by parallel num * refine code for maybe get/infer sbp * fix bugs: 1. python generate parallel conf; 2. gen logical blob id from lbn; 3.python blob desc .etc * fix for plain point maybe * fix bug: add repeated placement group, remove add placement interface in hand * fixbug: python/blob_desc, temp impl of not deepcopy; feat: dense layer support model parallel * dev_python model parallel runnable and check correct * remove add placement group when placment scope exit * 1. fixbug of bias_add op infer blob_desc/sbp; bias_add kernel impl; 2. dense layer set model_split_axis=0 for model parallel * bugfix: bias_add backward infer sbp wrong; model parallel bias add debug done * refine python blob_desc.split implement * refine interface decode lbn to split hint * refine auto add placment group * refine lbn with split hint decode * refine code for review * remove AutoVar related code (#2168) * feat: remove all autovar * fix and format * fix: fix op::InferBlobDesc * add prototype (#2172) * add prototype * infer blob desc with sbp_signature * `str_a is not str_b' is buggy, use `str_a != str_b' instead * Update snapshot.cpp (#2174) * remove useless lines (#2176) * Fix bert multi nodes (#2177) * remove useless lines * fix bert and init_cluster_env for multi nodes * CHECK_JUST for InferBlobDescsIf (#2178) * Fix bert multi nodes (#2180) * remove useless lines * fix bert and init_cluster_env for multi nodes * config_proto -> default_config_proto * delete worker * update alexnet * remove unused op (#2182) * remove parallel_ctx when kernel init (#2185) * InferOpSbpSignature in op_graph and infer_ctx (#2175) * InferOpSbpSignature in op_graph and infer_ctx * bugfix: lambda life time; gen job build error add location info * refine error generation and return * refine check lbi vaild and exists * remove parallel num in decode_of_record op/kernel (#2186) * Fix bugs * delete GlobalJobDesc() in operator/ (#2188) * rm unused test file * Refine * Add assign ops behind adam optimizer to update model and momentum etc. * Add assign ops behind adam optimizer to update model and momentum etc. * Remove fake consume op * Support enable/disable XLA by set env * Merge callback, limit max operator count for each XLA subgraph * CudaEventPool * fix vector * refine * Support in-place update for optimizer * Add alias input and output to prevent reusing input with other temp buffers * Refine code style * Remove unused code * Of xla (#2237) * mv deprecated.pb_util to lib.core.pb_util * add op get_variable and get_variable test (#1975) * add op get_variable and get_variable test * modify shape extend * AllReduceSequencePass (#1976) * python2 compatibility for check_point * fix "return (blob_a, blob_b)" bug * rename: arg_passing => arg_pass * shared regst blob header between jobs (#1919) * half impl * register manager handle memory shared for separated memory * set separated memory shared id for shared regst between jobs * half impl of python for blob * fix BUG of pod ToProto() when proto has inited * fix BUG of infer dim0_inner_shape() in foreign_input_op * 1. PushJob copy from python can infer dim0_valid_num * add test for dynamic relu * refine test file * refine code * refine note * update test file for new interface * rename separated_header* (#1979) * some bugs fixes for a train&eval job (#1978) * debugging alex net * check in test pull_multiple_blob.py * strcter check * fix bias in conv * fix various bugs * rm file * op_name in different jobs can be overloaded * fix compile bug in job_set_compile_ctx * rm cmake code for building oneflow binary * check in script (#1980) * check in script * rm used import * CudaCurrentDeviceGuard (#1977) * fix val (#1981) * Merge job set and split fw bw (#1982) * add MemoryCopier and TensorSliceCopier (#1901) * add MemoryCopier and TensorSliceCopier * Index=>NdIndex * refine * refine * fix addition error checking (#1911) * Merge dev_mixed_precision into dev_split_fw_bw (#1904) * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * Merge dev_mixed_precision: Part-2 (#1907) * feat: add NewKernelUtil * fix typos * feat: add cublas_tensor_op_math_handle() * add gemm (#1860) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * feat: NewKernelUtil -> NewKernelUtil<DeviceType> * feat: update FullyConnectedKernel to use NewKernelUtil * Dev sx mixed precision (#1861) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * save cpp * save * add relu and relu_backward * remove spared space * add explicit declaration * rename * feat: update ConvKernel to support half * add sigmoid and tanh (#1867) * add axpy (#1866) * style: formatting * refactor(new_kernel_util): unify Hgemm with cublas_gemm<T> * fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle * refine(new_kernel_util.h) * refine(new_kernel_util.cu) * feat(new_kernel_util): add OFBatchedGemm() * feat: update MatMulKernel to support half * feat: update ConvData/Bias/FilterGradKernel to support half * refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out * feat: support loss scale * fix(operator): :bug:add InferHasBatchDim() * feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu() * refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float * style(kernel/cast_kernel.cpp): formatting * fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle() * style(cast_kernel.cpp): formatting * feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil * refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil * feat(dropout_kernel): :sparkles:update DropoutKernel to support half * refactor(dropout_kernel): remove backward funcs * refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support * fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple) * fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs() * fix: fix little bugs * fix(conv_data/filter_grad_op): min byte size of buf blob is 1 * feat: support half for bias_add_kernel * fix(bias_add_op): remove data type check * feat(relu_kernel): support half * refactor: add ADD_GPU_HALF_KERNEL_CREATOR * fix: typos * feat(pooling_kernel): support half * fix: remove CHECK_EQ of default data type * feat(pooling_grad_kernel): support half * feat: support half in ofrecord_encoder (TODO) * fix * feat: support half in sparse_cross_entropy_kernel * debug grad op (#1883) * Dev debug op mixed precision (#1884) * debug grad op * do nothing instead of UNIMPLEMENTED * fix(dropout_kernel): add tmp_split_fw_bw condition * build(half.cmake): https->http * fix(record_load_kernel): support total_batch_num * fix pooling (#1885) * fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs() * fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs() * fix: add GetCudnnScalingParameters() to fix scaling params * fix: add enable_true_half_config_when_conf() into config and update related code * feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization * refactor(matmul_kernel): remove Backward() * feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx() * feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat() * refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr * refactor(new_kernel_util.cu): remove static of func in anonymous namespace * feat(job_conf.proto): add enable_auto_mixed_precision field * feat(auto_mixed_precision_lists): add amp_lists * feat(auto_mixed_precision): build the skeleton * feat(auto_mixed_precision): almost finish amp graph pass * feat(auto_mixed_precision.cpp): complte InsertCastOp() * refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG * perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO) * refine(auto_mixed_precision.cpp): refine LOG * feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes() * Dev half ndarray (#1886) * debug grad op * ZeroVal => GetZeroVal; OneVal => GetOneVal * MaxVal => GetMaxVal; MinVal => GetMinVal * check data type * DevDType * move function template to struct template for BinaryFunc* and UnaryFunc* * support half for reduce_sum_kernel * ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr * half for NdarrayUtil * OF_DEVICE_FUNC is always inline * half for NdarrayApplyUnaray * simplify usage of NdarrayUtil * UnaryFuncExp * add VarNdarrayBuilder and ValNdarrayBuilder * simplify NdarrayUtil in layer_norm_param_grad_kernel * InplaceBroadcast * remove SoftmaxKernelUtil * half for softmax_kernel * fix improper use of __CUDA_ARCH__ * disable sm_30,sm_52 * refine(conv_kernel.cu): fix typo * fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix: fix typos of GetOneVal * fix(auto_mixed_precision.cpp): allocate for shared_ptr * refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding * fix(auto_mixed_precision.cpp): fix typo * fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge * style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet() * style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...> * feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp * feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs * feat(auto_mixed_precision.cpp): more logs * refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal * fix(bias_add_op.cpp): fix bias_multiplier shape * feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half * feat: update MatmulKernel and new_kernel_util to support half * refactor(auto_mixed_precision): add ClearList and refine code * feat(tanh_*_kernel): support half * feat(add_kernel): support half * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF * style(CMakeLists.txt): fix typo * fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix(auto_mixed_precision.cpp): group inserted cast op by lbn * fix get one ptr (#1913) * fix(layer_norm): add LayerNormOp to grey_list and support the half * fix(layer_norm about): fix it to run when amp * fix: move fix sbp signature from OpNode to OpGraph * Dev new kernel util (#1925) * refactor(kernel/util): refactor NewKernelUtil and add DnnIf * refactor(kernel/util): add BlasIf * refactor(kernel/util): add ArithemeticIf * refactor(kernel/util): add cuda_kernel_util.* * refactor: refactor NewKernelUtil * refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including * refactor(new_kernel_util.h): remove unused header files * refactor: refactor loop include * feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936) * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA * CHECK cuda version > 10.0 when use auto_mixed_presion * Fix bug of Snapshot delete file Unwanted (#1937) * fix link BUG of release version (#1938) * delete redundant code in OpGraph JobCompleter and Operator (#1927) * 1. delete redundant code in OpGraph JobCompleter and Operator 2. fix bug of Snapshot delete file Unwanted 3. refine ReadMe * revert README change * split 2 pull request * Refactor Kernel Registry V2: The clear & easy Way (#1941) * refactor(resource.proto): move DeviceType to common/device_type.proto * feat(kernel_registration): add kernel_registration.h/cpp * feat(kernel_registration): update matmul_kernel to support new registration * feat: add CreateKernel for new registry * feat: udpate registry of cast conf * refactor(kernel_registration): remove KernelRegMap * fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949) * grpc SetMaxMessageSize(INT_MAX) (#1950) * fix bug of Graph::ForEachConnectedComponent (#1952) * Grpc set max size (#1953) * grpc SetMaxMessageSize(INT_MAX) * set max msg len for ctrl service * code for test grpc max msg size * remove test code * NumaAwareCudaMallocHost (#1959) * NumaAwareCudaMallocHost * add conf * AllReduceSequencePass (#1976) * Merge job set and split fw bw (#1983) * add MemoryCopier and TensorSliceCopier (#1901) * add MemoryCopier and TensorSliceCopier * Index=>NdIndex * refine * refine * fix addition error checking (#1911) * Merge dev_mixed_precision into dev_split_fw_bw (#1904) * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * Merge dev_mixed_precision: Part-2 (#1907) * feat: add NewKernelUtil * fix typos * feat: add cublas_tensor_op_math_handle() * add gemm (#1860) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * feat: NewKernelUtil -> NewKernelUtil<DeviceType> * feat: update FullyConnectedKernel to use NewKernelUtil * Dev sx mixed precision (#1861) * add gemm * save * add blobgemm * update * update * fix cu * update cpp * save cpp * save * add relu and relu_backward * remove spared space * add explicit declaration * rename * feat: update ConvKernel to support half * add sigmoid and tanh (#1867) * add axpy (#1866) * style: formatting * refactor(new_kernel_util): unify Hgemm with cublas_gemm<T> * fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle * refine(new_kernel_util.h) * refine(new_kernel_util.cu) * feat(new_kernel_util): add OFBatchedGemm() * feat: update MatMulKernel to support half * feat: update ConvData/Bias/FilterGradKernel to support half * refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out * feat: support loss scale * fix(operator): :bug:add InferHasBatchDim() * feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu() * refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float * style(kernel/cast_kernel.cpp): formatting * fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle() * style(cast_kernel.cpp): formatting * feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil * refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil * feat(dropout_kernel): :sparkles:update DropoutKernel to support half * refactor(dropout_kernel): remove backward funcs * refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support * fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple) * fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_op): add InferHasBatchDim() and GetSbpSigs() * fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs() * fix: fix little bugs * fix(conv_data/filter_grad_op): min byte size of buf blob is 1 * feat: support half for bias_add_kernel * fix(bias_add_op): remove data type check * feat(relu_kernel): support half * refactor: add ADD_GPU_HALF_KERNEL_CREATOR * fix: typos * feat(pooling_kernel): support half * fix: remove CHECK_EQ of default data type * feat(pooling_grad_kernel): support half * feat: support half in ofrecord_encoder (TODO) * fix * feat: support half in sparse_cross_entropy_kernel * debug grad op (#1883) * Dev debug op mixed precision (#1884) * debug grad op * do nothing instead of UNIMPLEMENTED * fix(dropout_kernel): add tmp_split_fw_bw condition * build(half.cmake): https->http * fix(record_load_kernel): support total_batch_num * fix pooling (#1885) * fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs() * fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs() * fix: add GetCudnnScalingParameters() to fix scaling params * fix: add enable_true_half_config_when_conf() into config and update related code * feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization * refactor(matmul_kernel): remove Backward() * feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx() * feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat() * refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr * refactor(new_kernel_util.cu): remove static of func in anonymous namespace * feat(job_conf.proto): add enable_auto_mixed_precision field * feat(auto_mixed_precision_lists): add amp_lists * feat(auto_mixed_precision): build the skeleton * feat(auto_mixed_precision): almost finish amp graph pass * feat(auto_mixed_precision.cpp): complte InsertCastOp() * refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG * perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO) * refine(auto_mixed_precision.cpp): refine LOG * feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes() * Dev half ndarray (#1886) * debug grad op * ZeroVal => GetZeroVal; OneVal => GetOneVal * MaxVal => GetMaxVal; MinVal => GetMinVal * check data type * DevDType * move function template to struct template for BinaryFunc* and UnaryFunc* * support half for reduce_sum_kernel * ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr * half for NdarrayUtil * OF_DEVICE_FUNC is always inline * half for NdarrayApplyUnaray * simplify usage of NdarrayUtil * UnaryFuncExp * add VarNdarrayBuilder and ValNdarrayBuilder * simplify NdarrayUtil in layer_norm_param_grad_kernel * InplaceBroadcast * remove SoftmaxKernelUtil * half for softmax_kernel * fix improper use of __CUDA_ARCH__ * disable sm_30,sm_52 * refine(conv_kernel.cu): fix typo * fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix: fix typos of GetOneVal * fix(auto_mixed_precision.cpp): allocate for shared_ptr * refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding * fix(auto_mixed_precision.cpp): fix typo * fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge * style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet() * style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...> * feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp * feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs * feat(auto_mixed_precision.cpp): more logs * refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal * fix(bias_add_op.cpp): fix bias_multiplier shape * feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half * feat: update MatmulKernel and new_kernel_util to support half * refactor(auto_mixed_precision): add ClearList and refine code * feat(tanh_*_kernel): support half * feat(add_kernel): support half * update binary_func.h * udpate * update ndarray * update * update * update * udpate * refactor(data_type.h): better representation * fix(unary_func.h): fix typo * style(data_type.h): format * refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF * style(CMakeLists.txt): fix typo * fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr * fix(auto_mixed_precision.cpp): group inserted cast op by lbn * fix get one ptr (#1913) * fix(layer_norm): add LayerNormOp to grey_list and support the half * fix(layer_norm about): fix it to run when amp * fix: move fix sbp signature from OpNode to OpGraph * Dev new kernel util (#1925) * refactor(kernel/util): refactor NewKernelUtil and add DnnIf * refactor(kernel/util): add BlasIf * refactor(kernel/util): add ArithemeticIf * refactor(kernel/util): add cuda_kernel_util.* * refactor: refactor NewKernelUtil * refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including * refactor(new_kernel_util.h): remove unused header files * refactor: refactor loop include * feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936) * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA * CHECK cuda version > 10.0 when use auto_mixed_presion * Fix bug of Snapshot delete file Unwanted (#1937) * fix link BUG of release version (#1938) * delete redundant code in OpGraph JobCompleter and Operator (#1927) * 1. delete redundant code in OpGraph JobCompleter and Operator 2. fix bug of Snapshot delete file Unwanted 3. refine ReadMe * revert README change * split 2 pull request * Refactor Kernel Registry V2: The clear & easy Way (#1941) * refactor(resource.proto): move DeviceType to common/device_type.proto * feat(kernel_registration): add kernel_registration.h/cpp * feat(kernel_registration): update matmul_kernel to support new registration * feat: add CreateKernel for new registry * feat: udpate registry of cast conf * refactor(kernel_registration): remove KernelRegMap * fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949) * grpc SetMaxMessageSize(INT_MAX) (#1950) * fix bug of Graph::ForEachConnectedComponent (#1952) * Grpc set max size (#1953) * grpc SetMaxMessageSize(INT_MAX) * set max msg len for ctrl service * code for test grpc max msg size * remove test code * NumaAwareCudaMallocHost (#1959) * NumaAwareCudaMallocHost * add conf * AllReduceSequencePass (#1976) * CudaCurrentDeviceGuard (#1977) * delete tmp_split_fw_bw_train_conf (#1985) * delete tmp_split_fw_bw_train_conf * delete useless comments * fix refactor bug in layer_norm_op * minor fixes * update py script * remove code could be misleading * Fix all reduce mem sharing (#1986) * fix all reduce mem sharing * ByteSizeOfDataContentField=>ByteSizeOfBlobBody * remove obsolete task_graph optimization * no arg_pass_job for variable_op * merge memory block id between jobs (#1910) * refine MemBlock and CriticalSection * job memory sharing strategy * revert diff in CriticalSectionDesc * Merge memory block between sub plans * Get mutual exclusion job groups * forget to consider memory merge only in same machine * memory zone unique id * Merge Done; merge memory block id from right to left; get memory block ids info * revert MemBlock * generate mutual exclusion job groups Done. * update for proto * add JobMemSharingStrategy in python interface * remove memorycase hash * move JobMemSharingStrategy to JobSetProto * using default strategy = parallel priority strategy * update interface of flow.job_mem_sharing_strategy * InterJobMemSharingUtil and PlanUtil * revert oneflow.h * fix bug * New implement of Merge memory block id between jobs * refine code * fix a fatal bug in std::hash<oneflow::Shape> * +REGISTER_INDEPENDENT_THREAD_NUM for print task_node * unlock critical sections as more as possible (#1994) * Bugfix actor case (#1995) * unlock critical sections as more as possible * consumed and produced regst of actor 'case' are customized * refine code * Bugfix actor case (#1996) * unlock critical sections as more as possible * consumed and produced regst of actor 'case' are customized * refine code * small regst_num for reentrant_lock (#1997) * fmt dev_job_set(#1999) * double buffer for tick_op * tick is cpu op * speedup compile time (#2000) * only merge mem_block_id between user job (#1993) * Fix keep header only (#2001) * speedup compile time * fix keep header only * remove shared model (#2003) * remove blob_mem_sharing (#2005) * No copyhd for output (#2006) * no cpu tick * no copyhd for output_op/swith_output_op * remove temp comments * rename AddCopyH2DTaskTo to TryAddCopyH2DTaskTo * remove clone_id (#2007) * layer norm auto var (#2004) * layer norm auto var * make of_format * bn sbp (#2008) * Refactor job completer (#1998) * fmt * refactor GenerateOpConf4Trainning * more refactor * refactor SetCtrlInOpName4VariableOp * use uniq ptr * refactor RewriteBoxingWithAllReduce * refactor MakeAllReduceSequence * refactor auto_mixed_precision * refactor DumpLogicalBlobDescAndSbpSignature * refactor group_boxing_by_dst_parallel * refactor add_keep_header_only_op_conf * refactor AutoSourceTick * refactor AddTickForTimeShape * refactor AutoSinkTick * refactor AddGlobalOutputCriticalSections * refactor SetOpTimeShape7BatchDimLbis * fix a bug in IsInterfaceTask (#2009) * Bugfix is interface task (#2010) * fix a bug in IsInterfaceTask * IsOutputInterfaceTask * copyhd-free output_op task_node * Dev job set config util (#2011) * add more if in JobConfigProtoBuilder * unlock critical sections as more as possible * consumed and produced regst of actor 'case' are customized * remove total batch num in config util * remove clone_id * assert has train_conf * rm debug info * Dev job set bert (#2013) * support bert * mv into bert * manual format * fix adam (#2015) * fix adam * div batch instance num before update model * remove outdate code in oneflow.cpp (#2017) * Dev split like (#2016) * no total_instance_num * add auto grad for concat * check in impl * check in bug fixes * fix bugs for split_like * split_like_op.cpp format * add normalization_autovar * Update op_conf.proto * address reviews * fix typo * constant ref * rm forward_loss_instance_num (#2018) * Bugfix job set multi device (#2019) * sbp for tick input bn * interface_blob_conf for output_op/switch_output_op * set sbp conf for tuple identity op * fix bugs when merge main plan * delete useless code * address review * fix error use of GenRepeatedBn() * ForEachConnectedComponent is easily misused * 1) fix output op parallel_conf; 2) refactor InterfaceOpUtil * only for return output_op * factor: lbn => logical_blob_name; logical_blob_name() => logical_blob_name * return op instead of output op acts as part of user job * enable_all_reduce_group * bugfix: init RuntimeBuffersScope before Runtime * demo python scripts for enable_all_reduce_group * remove wrong optimization code * constant_conf for enable_all_reduce_group.py test * fix interface op parallel conf * fix reduce concat kernel (#2020) * binary program oneflow_worker * user_job_completer * remove unused code loss_print * rm unused code loss_acc * remove unused accuracy_acc and accuracy_print * remove input_diff/output_diff/model_diff bns * remove unused bns in gdb util * replace data_tmp_bns/model_bns/fw_buf_bns with tmp_bns * support mpi using style * Bugfix put job conf into plan (#2023) * put job_conf into plan * using job_name judge isPullJob/isPushJob * fix wrong job_id error * model_init is a push job; model_save is a pull job * make cmake more reasonable (#2024) * Restructure python module and minimum setup.py (#2026) * check in updated paths * check in minimum setup tool * Dev python init multi unit (#2022) * init multi-unit by send oneflow_worker binary and ConfigProto to worker machine * refine var name * refine code * compile user/main job only on master * bert multi machine test code * fix bugs * JobConfs * fix bugs under WITH_RDMA * fix multi-machine bugs * delete useless code * Add xla reduce_sum op * fix overflow bug of mem_zone_unique_id, set job_id to int64_t (#2028) * feat: init_worker can without scp binary and no use uuid (#2029) * half impl of without scp bin * feat: init_worker can without scp binary and no use uuid * check in fixes (#2030) * fixbug of delete worker (#2033) * Dev dot plan (#2035) * reuse plan to dot file * refine plan dot * Check in bug fix and multi node script (#2032) * check in fixes * check in script * fix boxing bug when setting conf with sbp * flag for iter * fixbug of delete worker * fix delete worker in script * address review, add exclusive or check * reuse plan to dot file * refine plan dot * fix and add flags * fmt * rm debug output * more flags * check Activation * fix fc bug when num axes > 2 * reverse change * fix next_batch_num (#2036) * upgrade nccl to 2.4.8 (#2037) * fix shape of fc in_diff (#2038) * Rewrite model update op to optimizer graph * Update oneflow.cmake (#2041) * better looking merged_plan to dot v1 (#2039) * better looking and more infomation of merged_plan.dot * refine color * Fix tick in multi node parallel (#2042) (#2047) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * Dev train conf builder (#2046) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * check in impl * fix data dir (#2054) * fix data dir * rm model load path * AssignOp (#2058) * AssignOp * remove useless code * Python ops gather and unit test (#2053) * python_ops gather and unit test * format * minor mod * SnapshotOp (#2060) * magical add and fix bug (#2061) * check in impl * add todo * Dev jxf python pooling (#2056) * run max_pool_2d without bug * correct max_pool_2d * correct average_pool_2d * minor refine * final version * rename to nn.py * add name arg to pool1d ops * refine by review * rename to _GetSequence and move it to the end of file (#2063) * fix BindInterfaceMemBlockId (#2065) * mark py file generated (#2066) * Dev gracious exit (#2057) * add more checks * make language more consistant * better error info for worker init * better error * Update setup.py (#2068) * Refine Infer APIs by return Maybe<void> type (#2051) * Refine Infer APIs by return Maybe<void> type * Fix return type * Fix code style * Replace CHECK macros in the implementation of infer APIs * Revert IsOk * fix bug for split like op (#2070) * fix snapshot path (#2071) * Dev job set fix infer apis (#2072) * Refine Infer APIs by return Maybe<void> type * Fix return type * Fix code style * Replace CHECK macros in the implementation of infer APIs * Revert IsOk * update * add AutoGlobalStep (#2073) * rm default_initializer_conf in train conf (#2075) * Fix sigmoid op (#2076) * fix sigmoid op bug * fix bug for split like op * add sigmoid grad op * Fix bn (#2077) * fix bn * return Maybe<void> OK in lambda * fix typo * fix SigmoidGradOp (#2078) * Dev python merge job set (#2081) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * fix gcc warning in release (#2080) * fix gcc version in release * fix empty line * Fix adam mv initilizer (#2082) * zero constant initilzer for adam m and v * make of_format * init adam m v beta1_t and beta2_t * use value instead of initializer * const float& -> const float * update * LearningRateScheduleOp (#2079) * matmul (#2084) * matmul * np.allclose * Fix hang bugs * bugfix: reshape op infer dim0 size; and look up tensorflow reshape (#2085) * bugfix: reshape op infer dim0 size; and look up tensorflow reshape * refine code for read * check py if and test * prelu (#2086) * prelu * fix * fix * template for either ptr cast (#2088) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * add template for cast * rename * Dev build and infer ctx (#2089) * add job_build_and_infer_ctx interface * lbn_with_split_hint * fix maybe macro * fix signature of Maybe<T>::Error() * job_build_and_infer_if * add c_api_util wrapper for job_build_and_infer_ctx * implement python/job_build_and_infer interface * CurJobBuildAndInferCtx_AddPlacementGroup * BuildJobAndInferCtx and Mgr c++ implement (#2074) * job_build_and_infer_ctx_mgr * refine interface of infer_ctx_mgr * JobBuildInferCtx set job conf; add and refine error type * revert job.proto * half impl of add op in build_infer_ctx * generate op produced empty logical blob desc ; infer out blob desc interface * job_build_and_infer_ctx VERSION 1 * add InferOutBlobDesc for conv op; remove record_piece_size in interface op * maybe return * job_set hold by job_build_and_infer_ctx_mgr * check placement when infer ctx mgr leave cur job * Global New/Delete JobBuildAndInferCtxMgr * add JUST when ctx add op * remove unused job_conf.arg_op_name * fix bugs caused by python new api * fix bugs caused by lack of Global<JobDesc> * fix bugs caused by new api * refactor compiler.Compile * merge dev_python * remove unused message proto * rename api * Fix input which body is disabled in xla launch kernel * add RemoteBlob.shape and RemoteBlob.dtype * Fix data type set default variable (#2092) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * fix default data type * Add conf axis for bias_add for any axis channel (#2093) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * bias_add completion * follow comment * make conf axis required * Dev jxf python initializer (#2090) * oneflow initializer * update * Fix self control in * Bugfix python alexnet (#2096) * bugfix_python_alexnet * fix * Add fake consume op * Dev global step (#2100) * assign op AddGlobalStepOpConf fix ARITHMETIC_DATA_TYPE_SEQ identity_op_conf add ops GenNewSnapshotName SnapshotOp cleanup blob name LearningRateScheduleOp LearningRateScheduleKernel LearningRateScheduleKernel AddLearningRateScheduleOpConf learning rate cleanup fix fix * remove total_mbn_num * date time format * save * refine * refine * revert * refine snapshot * fix * refine * AutoGlobalStep * refine * GenLogicalBlobName * AutoLearningRate * remove JobDesc lr * fix snapshot path * Maybe<void> * learning_rate blob * remove next_model_vid fix fix fix learning_rate * train_conf * fix for global step on multi nodes * Fix optimizer initializer (#2095) * fix optimizer initializer * rename lars data temp bn * fix job_type (#2102) * Dev alexnet new api (#2094) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * check in softmax loss * nn.conv2d and nn.bias_add * fix opname * fix merge conflict * fix name * dense (#2097) * Fix jxf dense v2 (#2098) * dense * minor fix * alexnet * fix conf * quick fix * transpose * fix layers * add transpose * fix fc * fix * fix * fix data laod * params check and format * rm activation in op conf * save workaround * fix avg pool 2d * fix max pool 2d * remove fc3 relu * alexnet eval * minor * replace has_batch_dim with batch_axis (#2104) * replace has_batch_dim with batch_axis * refactor OrderValue4HasBatchAxis * fix batch_axis bugs in ConvFilterGradOp and DecodeOFRecordOp * no CHECK in MatmulOp::InferBatchAxis * infer op by op_conf and parallel_conf * wrapper Error for ErrorProto * replace ErrorUtil with Error * add OF_CHECK (#2110) * optional split_axis (#2113) * Fix HasAttr bug for optional field * undefined (#2116) * merge reduce xxx (#2119) * Update GetSbpSig() with Maybe (#2118) * fix sveral ops * modify all ops * format * update complete * Refine AdamOptimizer * fix (#2120) * Fix xla AdamOptimizer bugs * support scalar for reduce_xxx axis args (#2122) * Dev opt split axis (#2121) * optional split_axis * backup * VariableConf::(OptInt64 split_axis) * backup * fix autovar split_axis (#2125) * Dev model init op (#2117) * assign op AddGlobalStepOpConf fix ARITHMETIC_DATA_TYPE_SEQ identity_op_conf add ops GenNewSnapshotName SnapshotOp cleanup blob name LearningRateScheduleOp LearningRateScheduleKernel LearningRateScheduleKernel AddLearningRateScheduleOpConf learning rate cleanup fix fix * remove total_mbn_num * date time format * save * refine * refine * revert * refine snapshot * fix * refine * AutoGlobalStep * refine * GenLogicalBlobName * AutoLearningRate * remove JobDesc lr * fix snapshot path * Maybe<void> * learning_rate blob * remove next_model_vid fix fix fix learning_rate * train_conf * fix for global step on multi nodes * SnapshotReader snapshot writer model init op fix refine init InitializeFromSnapshotConf model io job ModelLoadOp ModelLoadKernel MakeModelLoadJob ModelSaveOp fix InterUserJobInfo _MakeModelLoadJobFunc MutModelLoadOpConTickInputHelper fix refine init/load/save set_default_variable * remove SnapshotMgr * snapshot.h * delete model_init_job.cpp foreign_input_op_conf fix snapshot path set path op_conf fix fix CopyFromNdarray to bytes c use uint8 char2uint8 * model init * model io * fix * ModelSaveKernel * mutable_batch_axis()->Clear() * InferBatchAxis * fix * refine * job set * MakeModelIoJobs * fix * jobs * fix * model io job * GenOutputOpConf * refine snapshot * refine * fix * refine CheckPoint * remove session * refine * refine * refine * remove keyword.h/cpp * refine * global_step=>train_step * GetSbpSignatures * ModelInitOp * fix (#2127) * rm stale alextnet script (#2129) * Dev plain maybe (#2126) * optional split_axis * backup * VariableConf::(OptInt64 split_axis) * backup * 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp * SharedOrPlain * const std::shared_ptr<T>& => std::shared_ptr<T> * Dev simple checkpoint manager (#2128) * SimpleCheckPointManager * makedirs * fix path * save * refine * refine * fix path to numpy (#2130) * Dev plain maybe (#2132) * optional split_axis * backup * VariableConf::(OptInt64 split_axis) * backup * 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp * SharedOrPlain * const std::shared_ptr<T>& => std::shared_ptr<T> * rename: Maybe.data() => Maybe.Data_YouAreNotAllowedToCallThisOutsideJustOrCheckJust() * refactor Maybe<JobBuildAndInferCtx> => Maybe<JobBuildAndInferCtx*> * Dev jxf merge general ops (#2131) * merge some general ops to dev_python * dense demo * rm print in test * new line at the end of file * format * fix check point * update alexnet * broadcast_xxx (#2134) * broadcast_xxx * typo * typo * rm job_conf.num_of_batches_in_snapshot * fix args (#2136) * fix proto if (#2138) * pass name to inner function (#2139) * check dropout if (#2140) * check dropout if * fix typo * Dev merge math ops (#2143) * merge math ops * new line at the end of file * merge layer norm (#2144) * variable_scope (#2141) * variable_scope * revert format * add check * Merge dropout if (#2145) * check dropout if * fix typo * fix typo * slice (#2142) * slice * add check and docstring * minor * minor * add const (#2146) * add const * fix indentation * address review * fmt * rm redundant * Update array_ops.py * Update array_ops.py * Update array_ops.py * add more activations to math_ops (#2147) * fix bug (#2149) * trancated normal for bert (#2150) * Update bert for dev python (#2151) * trancated normal for bert * bert support * math.dropout to nn.dropout (#2153) * refactor oneflow_internal with Maybe::GetDataAndSerializedErrorProto * allow export multiple interfaces in oneflow_export decorator (#2154) * refactor job_build_and_infer_if.h * update oneflow_internal.h to use Maybe (#2135) * Fix python internal (#2133) * Return error meassage in oneflow_internal * Refine environment_objects_scope * add OF_ERROR_STR_CHECK and OFStrCat() * format * fix based on review * fix(oneflow_internal.h): add undef * fix: expr -> (expr) * feat: update oneflow_internal_helper to use func * Transfer data_part_num to DecodeOp and RecordLoadOp (#2148) * Transfer data_part_num to DecodeOp and RecordLoadOp * Fix python scripts * Dev nc of internal (#2155) * Fix python internal (#2133) * Return error meassage in oneflow_internal * Refine environment_objects_scope * add OF_ERROR_STR_CHECK and OFStrCat() * format * fix based on review * fix(oneflow_internal.h): add undef * fix: expr -> (expr) * feat: update oneflow_internal_helper to use func * fix: fix ctor bug * fix config_proto * rename c_api_util.Init => c_api_util.InitEnvironment * refactor compile_context.cur_job => compile_context.cur_job_conf * remove FixPackedBlobDescOfProducedRegst (#2156) * Fix snapshot root path empty log (#2158) * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * fix 121 for tick (#2069) * Fix snapshot root path empty log * fix channel last (#2157) * fix channel last * minor * merge pb_message * add cudnn conv force algo (#2159) * Update bert for dev python (#2160) * remove old bert * set data_part_num in decoder * support model load/saveargs * Dev flow function (#2152) * add of.function, refactor init, refine session, and refine runtime * rm useless code * rename * update * add test * @oneflow_export JobConfigProto and Trainconf (#2162) * @oneflow_export JobConfigProto and Trainconf * remove unused config in config_util.py * remove oneflow.get_cur_job_conf_builder * bugfix: bias_add op and reduce_sum op infer sbp and implement of bias_add kernel (#2161) * 1) refactor compiler._CompileJob; 2) fix test scripts; 3) fix oneflow.config.model_update_conf * fix config.train.model_update_conf * _GetJobConfAttr * update alexnet (#2166) * Update alexnet (#2167) * update alexnet * update for bert * 15->16 * more reasonable conf * get variable in py layer norm * replace val in pb msg; decode lbn string with split hint (#2165) * bugfix: boxing task node generate boxing conf; remove wrong check in op_graph (#2163) * Add meta data in HLO instruction, and refine * python model parallel (#2103) * decode split hint in lbn; refine interface of get/set_lbn_in_op_conf; fill sbp_conf in job when add op * merge placement group * refine code in AddAndInferOp * auto merge placement group when add op; remove mergeplacementgroup interface * infer sbp parallel when add op; impl Get/Has split axis in infer_ctx * python blob add interface for model parallel * refine code of python blob split * remove interface of has/get_split_axis in python blob * remove interface of has_batch_dim in python blob * add check blob split_axis can be divide by parallel num * refine code for maybe get/infer sbp * fix bugs: 1. python generate parallel conf; 2. gen logical blob id from lbn; 3.python blob desc .etc * fix for plain point maybe * fix bug: add repeated placement group, remove add placement interface in hand * fixbug: python/blob_desc, temp impl of not deepcopy; feat: dense layer support model parallel * dev_python model parallel runnable and check correct * remove add placement group when placment scope exit * 1. fixbug of bias_add op infer blob_desc/sbp; bias_add kernel impl; 2. dense layer set model_split_axis=0 for model parallel * bugfix: bias_add backward infer sbp wrong; model parallel bias add debug done * refine python blob_desc.split implement * refine interface decode lbn to split hint * refine auto add placment group * refine lbn with split hint decode * refine code for review * remove AutoVar related code (#2168) * feat: remove all autovar * fix and format * fix: fix op::InferBlobDesc * add prototype (#2172) * add prototype * infer blob desc with sbp_signature * `str_a is not str_b' is buggy, use `str_a != str_b' instead * Update snapshot.cpp (#2174) * remove useless lines (#2176) * Fix bert multi nodes (#2177) * remove useless lines * fix bert and init_cluster_env for multi nodes * CHECK_JUST for InferBlobDescsIf (#2178) * Fix bert multi nodes (#2180) * remove useless lines * fix bert and init_cluster_env for multi nodes * config_proto -> default_config_proto * delete worker * update alexnet * remove unused op (#2182) * remove parallel_ctx when kernel init (#2185) * InferOpSbpSignature in op_graph and infer_ctx (#2175) * InferOpSbpSignature in op_graph and infer_ctx * bugfix: lambda life time; gen job build error add location info * refine error generation and return * refine check lbi vaild and exists * remove parallel num in decode_of_record op/kernel (#2186) * Fix bugs * delete GlobalJobDesc() in operator/ (#2188) * rm unused test file * Refine * Add assign ops behind adam optimizer to update model and momentum etc. * Add assign ops behind adam optimizer to update model and momentum etc. * Remove fake consume op * Support enable/disable XLA by set env * Merge callback, limit max operator count for each XLA subgraph * CudaEventPool * fix vector * refine * Support in-place update for optimizer * Add alias input and output to prevent reusing input with other temp buffers * Refine code style * Remove unused code * Fix static cublas library and xla link conflict * Fix cublas link conflict with tensorflow * Fix different connection kinds for multiple gpu cards (#2282) * Refine xla cluster algo (#2289) * Fix different connection kinds for multiple gpu cards * Fix bug for mutiple outputs consumed by one node * Refine cluster algo * Refine MarkClusterId pass and ReduceSplit task node (#2314) * Fix different connection kinds for multiple gpu cards * Fix bug for mutiple outputs consumed by one node * Refine cluster algo * Determine fusion disabled edges * update * Produce multiple registers on edges for ReduceSplit task node. Fix new allocator by stream id. * Refine MarkClusterId pass * Clustering subgraph with reverse ordering is better * Support strict clustering by taking dependencies into consideration * Translate rebuild job and rewrite optimizer into passes, and refine code style * Fix spell error * Update cmake * Merge branch dev_python (#2321) * Dev res50 new api (#2173) * check in script * runable * fix multinode * fix and real train * fix param data_format * fix truncated normal * quick fix multi node launch (#2193) * Dev reshape sbp (#2192) * reshape sbp * more check for reshape conf * fix error CHECK * refactor reshape * fix reshape like op * support naive case of s0 * refine * rm redundant code * more generous check for equal element cnt * restore empty line * add GatherMs0Grad op (#2191) * support for gather with s(0) `in' * add gather_ms0_op * fix bugs in message GatherMs0OpConf and GatherMs0Kernel * only (B, S(0)) -> P supported for gather_ms0 op * add GatherMs0Grad op * minor fix * refine code * bugfix and update gather test case * add concat op and pass the test (#2067) * add concat op and pass the test * add vgg job_conf * model compared to be same as the old one * rm unnecessary file * Update array_ops.py * mv file * get rid of ternary operator (#2195) * Dev reshape util struct (#2194) * check in changes * rm file * minor fix * Merge network files of 2 cnns (#2196) * add inceptionV3 * check in vgg16 * add cnns test scripts for dev_python (#2170) * add cnns test scripts for dev_python * add alexnet test scripts * add resnet50 * add inceptionv3 * add resnet50 * add vgg16 * first version of run_cnns_test.py * remove old files * unsorted_segment_sum (#2198) * oneflow.unsorted_segment_sum (#2199) * oneflow.unsorted_segment_sum * remote unused import * remove unused import * Dev batch unsorted segment sum (#2200) * oneflow.unsorted_segment_sum * remote unused import * remove unused import * rename UnsortedSegmentSum to BatchUnsortedSegmentSum * rename: batch_unsorted_* => unsorted_batch_* * unsorted_segment_sum (#2201) * unsorted_segment_sum * fix job_completer/unsorted_segment_sum_grad.cpp * more check for unsorted_segment_sum batch_axis * remove FixParallelDesc (#2202) * rm KernelIfWithModel KernelIfWithActivation (#2203) * remove KernelIfWithActivation * remove KernelIfWithModel * rm blob header kLossInstanceNum (#2204) * rm ActivationType from op/kernel (#2205) * refactor sigmoid_cross_entropy_loss * fix SigmoidGrad::InferBatchAxis * support part_name_prefix and part_name_suffix_length (#2208) * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus * oneflow.watch for debug * Dev decode batch size (#2206) * rm batch_size and piece_size * merge dev_python * Update reshape_like_op.cpp (#2213) * oneflow.parallel (#2211) * oneflow.parallel * refactor split_axis => parallel * rename parallel => distribute * fix typo: *Parallel => *Distribute * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute() * fix warning: return string reference to temporary (#2212) * docker build support (#2002) * update cmake files * check in files * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * shrink ctx size * fix script * fix wheel build * fix wheel build not adding .so (#2052) * lower cmake version bar * rm more files * keep build dir * check in test bash script * fix * Dev docker sx (#2124) * add python2 docker env * rm old docker files * update repository * add ARG CUDA and USE_PYTHON_3_OR_2 * reform files * update * rm log doesn't print when there is cache * use default arg in dockerfile * better py 2 or 3 condition * add default * use if * update alexnet * update for bert * 15->16 * add resnet50 in model (#2217) * remove parallel policy; rm FC/rnn/embedding look up op/kernel (#2215) * remove parallel policy * rm FC/rnn/embedding_look_up op/kernel * add check data parallel for conv/layer_norm op * bugfix: bias add + use math_add when batch size = 1 * fix InferBatchAxis (#2220) * sync with bert_benchamrk (#2221) * sync with bert_benchamrk * rename run.sh * Dev actor msg queue (#2225) * async msg queue * EnqueueAsyncMsg * Merge wnd python (#2226) * not ready yet * segment fix * fix segment_sum bugs * 1st wide_n_deep push * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * run sucessfully on single GPU * fix 121 for tick (#2069) * delete unncessary multiply_grad class * speed up generate time for dot2svg (#2083) * Add axis conf to bias_add for any axis channel (#2087) * bias_add completion * follow comment * make conf axis required * Revert "Add axis conf to bias_add for any axis channel (#2087)" (#2091) This reverts commit 8679ce980ce8570bf927baeab8616ee7b93fac47. * updated * fix segment_sum_grad * fix sbp * fix segment_sum impl for data parallel * fix * remove useless code in segment_kernel_util.h * add python interface * fix sigmoid conf * fix naming error * fix typo * temp mod loss sbp * add LazyAdam * Merge branch 'dev_python' of https://github.com/Oneflow-Inc/oneflow into dev_python_widedeep * rm useless code * unsorted_segment_sum * refactor sigmoid_cross_entropy_loss_kernel to high performance * Improve sigmoid cross entropy loss grad (#2207) * remove for loop called cuda kernel * minor fix * ../oneflow/python/ops/data_ops.py (#2209) * fix lazy_adam * Merge wnd and python (#2214) * rm ActivationType from op/kernel (#2205) * refactor sigmoid_cross_entropy_loss * fix SigmoidGrad::InferBatchAxis * support part_name_prefix and part_name_suffix_length (#2208) * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus * oneflow.watch for debug * Dev decode batch size (#2206) * rm batch_size and piece_size * merge dev_python * Update reshape_like_op.cpp (#2213) * oneflow.parallel (#2211) * oneflow.parallel * refactor split_axis => parallel * rename parallel => distribute * fix typo: *Parallel => *Distribute * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute() * merge dev_python * fix boxing: P->S(0) * check in docker build scripts (#2216) * Dev python widedeep docker (#2218) * check in docker build scripts * check in .dockerignore * rm oneflow.segment_sum * remove segment_sum * rm unused file * rm debug code * rm debug code * rm double empty lines * remove useless comments * fix send msg (#2227) * fix reduction_coefficient (#2228) * refactor ndarray for eq/ne/... * Dev kernel launch synchronized (#2230) * IsKernelLaunchSynchronized * virtual * refine * refine * seperate LOGICAL_BINARY_FUNC from ARITHMETIC_BINARY_FUNC * more static_assert * remove unused task related dot function (#2236) * remove unused task related dot function * do not output dot rank info * Dev non distributed optimizer js (#2234) * op&kernel&actor * job * job_completer * graph * format * fix pd * fix * ignore DelPlacementByOpName * fix auto tick * JobBuilder * fix * config util * fix * fix opgrade * broadcast tick * fix allreduce * balance by model size * GetSoleOutBlobSize * async_actor_msg_deque * group * AddOrMutOpsOnlyOnce * fix NcclTupleBroadcastGrad * order * set nccl order hint * op_conf * grad hint * NcclTupleBroadcastReduceSequencePass * add missed mutops * order fix * try kMdUpdtArea * fix nccl_order_hint * fix * add ti * tuple_identity_op * remove useless * group * fix dead lock * force ctrl in * sc broadcast * sort obn * group nccl * config group_size_mbyte * non_distributed_optimizer_group_size_mbyte * format * stop check * rm message sending optimization * refine lazy adam (#2244) * refine lazy adam * update * memory version 2 step 1: replace original concept about mem sharing (#2242) * mem_shared_id -> mem_block_id; mem_shared_off_set -> mem_block_offset; enable_mem_sharing->enable_reuse_mem * memory version 2 step 1: replace original concept about mem sharing * record reader multi thread (#2246) * multi thread * ComputeThreadPoolSize * python api * Fix random decode (#2252) * add decode random * fix decode random actor * Dev pr boxing v2 (#2248) * NcclDeviceCtx * include naive_actor * refine * use_boxing_v2 * config.use_boxing_v2 * SubTskGphBuilder * fix * hash<oneflow::MemoryCase> * Maybe<void> * ChainSubTskGphBuilder * SliceBoxingOp * return ok * SliceBoxingKernel * SliceBoxingActor * kSliceBoxing * nccl boxing op * nccl actor * REGISTER_OP * GetMsgFromCustomizedConf * NcclBoxingTaskNode * BldSubTskGphByBoxingV2 * NcclBoxingSubTskGphBuilder * fix * fix * NcclKernel * ParallelContext * REGISTER_ACTOR * fix rank set * IsNcclTaskType * limit * 1024 * multi thread reader * thread_num * IsKernelLaunchSynchronized * refine * NcclTupleReduce/BroadcastKernel use NcclDeviceCtx * MakeHostMemCase * NcclBldSubTskGph * remove use less code * use_boxing_v2 * refine * refine * refine * refine * refine * cmake find python note when version less 3.14 (#2286) * fix bug: reduce split kernel inplace (#2297) * Dev bias add (#2299) * use bias add * fix * bias_add * bias add half * fix * reinterpret_cast * fix half * HALF * fix * ADD_DEFAULT_KERNEL_CREATOR * fix * format * Fix dev python test (#2294) * add decode random * fix decode random actor * fix dev_python test scripts * fix batch_size test scripts * fix * Memory Version 2.0 Step 2: MemSharedAndReused between jobs (#2267) * MemBlockProto and ChunkProto * create mem block and chunk after improver * interface merge mem block and chunk between sub plans * merge chunk between jobs for memory reuse * using memory zone unique id replace memory case hash * merge interface op mem block between jobs for mem shared * gen GlobalCriticalSection by mem block id and chunk id * check mem block and chunk valid before runtime * Refactor: RegstMgr ; allocate memory by mem block and chunk instead of regst * fix bug; and pass test * fig bug: init chunk_id_count in id_manager * reuse copyHd out mem between jobs * PushPlan and PullPlan for memblock and chunk * refine merge mem block / chunk in oneflow.cpp * at(i); * GetOpName2JobId2TaskProtos functional * using output ptr; pass test AlexNet and Resnet * Fix xla reshape op * Merge upstream of_xla (#2322) * Dev res50 new api (#2173) * check in script * runable * fix multinode * fix and real train * fix param data_format * fix truncated normal * quick fix multi node launch (#2193) * Dev reshape sbp (#2192) * reshape sbp * more check for reshape conf * fix error CHECK * refactor reshape * fix reshape like op * support naive case of s0 * refine * rm redundant code * more generous check for equal element cnt * restore empty line * add GatherMs0Grad op (#2191) * support for gather with s(0) `in' * add gather_ms0_op * fix bugs in message GatherMs0OpConf and GatherMs0Kernel * only (B, S(0)) -> P supported for gather_ms0 op * add GatherMs0Grad op * minor fix * refine code * bugfix and update gather test case * add concat op and pass the test (#2067) * add concat op and pass the test * add vgg job_conf * model compared to be same as the old one * rm unnecessary file * Update array_ops.py * mv file * get rid of ternary operator (#2195) * Dev reshape util struct (#2194) * check in changes * rm file * minor fix * Merge network files of 2 cnns (#2196) * add inceptionV3 * check in vgg16 * add cnns test scripts for dev_python (#2170) * add cnns test scripts for dev_python * add alexnet test scripts * add resnet50 * add inceptionv3 * add resnet50 * add vgg16 * first version of run_cnns_test.py * remove old files * unsorted_segment_sum (#2198) * oneflow.unsorted_segment_sum (#2199) * oneflow.unsorted_segment_sum * remote unused import * remove unused import * Dev batch unsorted segment sum (#2200) * oneflow.unsorted_segment_sum * remote unused import * remove unused import * rename UnsortedSegmentSum to BatchUnsortedSegmentSum * rename: batch_unsorted_* => unsorted_batch_* * unsorted_segment_sum (#2201) * unsorted_segment_sum * fix job_completer/unsorted_segment_sum_grad.cpp * more check for unsorted_segment_sum batch_axis * remove FixParallelDesc (#2202) * rm KernelIfWithModel KernelIfWithActivation (#2203) * remove KernelIfWithActivation * remove KernelIfWithModel * rm blob header kLossInstanceNum (#2204) * rm ActivationType from op/kernel (#2205) * refactor sigmoid_cross_entropy_loss * fix SigmoidGrad::InferBatchAxis * support part_name_prefix and part_name_suffix_length (#2208) * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus * oneflow.watch for debug * Dev decode batch size (#2206) * rm batch_size and piece_size * merge dev_python * Update reshape_like_op.cpp (#2213) * oneflow.parallel (#2211) * oneflow.parallel * refactor split_axis => parallel * rename parallel => distribute * fix typo: *Parallel => *Distribute * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute() * fix warning: return string reference to temporary (#2212) * docker build support (#2002) * update cmake files * check in files * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * shrink ctx size * fix script * fix wheel build * fix wheel build not adding .so (#2052) * lower cmake version bar * rm more files * keep build dir * check in test bash script * fix * Dev docker sx (#2124) * add python2 docker env * rm old docker files * update repository * add ARG CUDA and USE_PYTHON_3_OR_2 * reform files * update * rm log doesn't print when there is cache * use default arg in dockerfile * better py 2 or 3 condition * add default * use if * update alexnet * update for bert * 15->16 * add resnet50 in model (#2217) * remove parallel policy; rm FC/rnn/embedding look up op/kernel (#2215) * remove parallel policy * rm FC/rnn/embedding_look_up op/kernel * add check data parallel for conv/layer_norm op * bugfix: bias add + use math_add when batch size = 1 * fix InferBatchAxis (#2220) * sync with bert_benchamrk (#2221) * sync with bert_benchamrk * rename run.sh * Dev actor msg queue (#2225) * async msg queue * EnqueueAsyncMsg * Merge wnd python (#2226) * not ready yet * segment fix * fix segment_sum bugs * 1st wide_n_deep push * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * run sucessfully on single GPU * fix 121 for tick (#2069) * delete unncessary multiply_grad class * speed up generate time for dot2svg (#2083) * Add axis conf to bias_add for any axis channel (#2087) * bias_add completion * follow comment * make conf axis required * Revert "Add axis conf to bias_add for any axis channel (#2087)" (#2091) This reverts commit 8679ce980ce8570bf927baeab8616ee7b93fac47. * updated * fix segment_sum_grad * fix sbp * fix segment_sum impl for data parallel * fix * remove useless code in segment_kernel_util.h * add python interface * fix sigmoid conf * fix naming error * fix typo * temp mod loss sbp * add LazyAdam * Merge branch 'dev_python' of https://github.com/Oneflow-Inc/oneflow into dev_python_widedeep * rm useless code * unsorted_segment_sum * refactor sigmoid_cross_entropy_loss_kernel to high performance * Improve sigmoid cross entropy loss grad (#2207) * remove for loop called cuda kernel * minor fix * ../oneflow/python/ops/data_ops.py (#2209) * fix lazy_adam * Merge wnd and python (#2214) * rm ActivationType from op/kernel (#2205) * refactor sigmoid_cross_entropy_loss * fix SigmoidGrad::InferBatchAxis * support part_name_prefix and part_name_suffix_length (#2208) * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus * oneflow.watch for debug * Dev decode batch size (#2206) * rm batch_size and piece_size * merge dev_python * Update reshape_like_op.cpp (#2213) * oneflow.parallel (#2211) * oneflow.parallel * refactor split_axis => parallel * rename parallel => distribute * fix typo: *Parallel => *Distribute * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute() * merge dev_python * fix boxing: P->S(0) * check in docker build scripts (#2216) * Dev python widedeep docker (#2218) * check in docker build scripts * check in .dockerignore * rm oneflow.segment_sum * remove segment_sum * rm unused file * rm debug code * rm debug code * rm double empty lines * remove useless comments * fix send msg (#2227) * fix reduction_coefficient (#2228) * refactor ndarray for eq/ne/... * Dev kernel launch synchronized (#2230) * IsKernelLaunchSynchronized * virtual * refine * refine * seperate LOGICAL_BINARY_FUNC from ARITHMETIC_BINARY_FUNC * more static_assert * remove unused task related dot function (#2236) * remove unused task related dot function * do not output dot rank info * Dev non distributed optimizer js (#2234) * op&kernel&actor * job * job_completer * graph * format * fix pd * fix * ignore DelPlacementByOpName * fix auto tick * JobBuilder * fix * config util * fix * fix opgrade * broadcast tick * fix allreduce * balance by model size * GetSoleOutBlobSize * async_actor_msg_deque * group * AddOrMutOpsOnlyOnce * fix NcclTupleBroadcastGrad * order * set nccl order hint * op_conf * grad hint * NcclTupleBroadcastReduceSequencePass * add missed mutops * order fix * try kMdUpdtArea * fix nccl_order_hint * fix * add ti * tuple_identity_op * remove useless * group * fix dead lock * force ctrl in * sc broadcast * sort obn * group nccl * config group_size_mbyte * non_distributed_optimizer_group_size_mbyte * format * stop check * rm message sending optimization * refine lazy adam (#2244) * refine lazy adam * update * memory version 2 step 1: replace original concept about mem sharing (#2242) * mem_shared_id -> mem_block_id; mem_shared_off_set -> mem_block_offset; enable_mem_sharing->enable_reuse_mem * memory version 2 step 1: replace original concept about mem sharing * record reader multi thread (#2246) * multi thread * ComputeThreadPoolSize * python api * Fix random decode (#2252) * add decode random * fix decode random actor * Dev pr boxing v2 (#2248) * NcclDeviceCtx * include naive_actor * refine * use_boxing_v2 * config.use_boxing_v2 * SubTskGphBuilder * fix * hash<oneflow::MemoryCase> * Maybe<void> * ChainSubTskGphBuilder * SliceBoxingOp * return ok * SliceBoxingKernel * SliceBoxingActor * kSliceBoxing * nccl boxing op * nccl actor * REGISTER_OP * GetMsgFromCustomizedConf * NcclBoxingTaskNode * BldSubTskGphByBoxingV2 * NcclBoxingSubTskGphBuilder * fix * fix * NcclKernel * ParallelContext * REGISTER_ACTOR * fix rank set * IsNcclTaskType * limit * 1024 * multi thread reader * thread_num * IsKernelLaunchSynchronized * refine * NcclTupleReduce/BroadcastKernel use NcclDeviceCtx * MakeHostMemCase * NcclBldSubTskGph * remove use less code * use_boxing_v2 * refine * refine * refine * refine * refine * cmake find python note when version less 3.14 (#2286) * fix bug: reduce split kernel inplace (#2297) * Dev bias add (#2299) * use bias add * fix * bias_add * bias add half * fix * reinterpret_cast * fix half * HALF * fix * ADD_DEFAULT_KERNEL_CREATOR * fix * format * Fix dev python test (#2294) * add decode random * fix decode random actor * fix dev_python test scripts * fix batch_size test scripts * fix * Memory Version 2.0 Step 2: MemSharedAndReused between jobs (#2267) * MemBlockProto and ChunkProto * create mem block and chunk after improver * interface merge mem block and chunk between sub plans * merge chunk between jobs for memory reuse * using memory zone unique id replace memory case hash * merge interface op mem block between jobs for mem shared * gen GlobalCriticalSection by mem block id and chunk id * check mem block and chunk valid before runtime * Refactor: RegstMgr ; allocate memory by mem block and chunk instead of regst * fix bug; and pass test * fig bug: init chunk_id_count in id_manager * reuse copyHd out mem between jobs * PushPlan and PullPlan for memblock and chunk * refine merge mem block / chunk in oneflow.cpp * at(i); * GetOpName2JobId2TaskProtos functional * using output ptr; pass test AlexNet and Resnet * Dev cuda 9 arch 70 (#2318) * kCudaAlignSize = 256 * always compute_70 * __CUDA_API_VERSION >= 10000 * __CUDA_API_VERSION >= 10000 * disable_all_reduce_sequence * Fix xla reshape op * Fix compilation without xla * Remove useless code and fix data type mismatch in field desc (#2326) * Remove useless code * Refine code style * Fix data type mismatch in field desc * Update README.md (#2335) * Refine code style (#2336) * Update XLA usage document (#2337) * Update XLA usage document * Fix mistakes * Add xla clang-format and format codestyle (#2340) * Revert "Add xla clang-format and format codestyle (#2340)" (#2341) This reverts commit e3cd432be2880ca55d3fc305dbb87e5416d5e724. * Add xla clang-format and format codestyle (#2342) * Add xla clang-format and format codestyle * Fix header file missing * Of xla sx (#2334) * add gather grad op and pass testing * rm check * done batch gather grad * pass test * modify according to the review * add unsorted_segment_sum and refine unsorted_batch_segment_sum * reform according to review * refromate according to the clang-format and rm reference to the temp object * Pick step0 and step1 new commits (#2346) * Add xla clang-format and format codestyle * Fix header file missing * Modify codes to support XLA Conflicts: oneflow/core/job/job_builder.cpp oneflow/core/job/job_builder.h oneflow/core/operator/op_conf.proto * Fix a bug for building subgraph although it won't lead to wrong results (#2347) * Fix setting is_mutable in xla launch op (#2349) * Change directory xla to xrt, apply patch if building with xla * Refactor * Add infer shape pass, and Refactor launch kernel, graph compiler * Refine code style, add xla executable and graph compiler * Rename platform.proto as types.proto * change OpCompiler to OpKernel, complete xla graph compiler * Fix compilation bugs and add allocator, now xla compilation is ok * Add xla executable runtime * Add executable run scope to support launch kernel on specific stream. * Fix infer shape pass, and revert cuda event pool * Refactor graph building with attaching argument metadata. * Set mutability if rebuilding job * Set device ordinal correctly * Refine DelOps * Refine Argument definition and abstract function as subgraph * Fix infer shape in xrt launch op and launch kernel. * Add builder, executable, graph compiler, logger, value and a fc demo converter for tensorrt. * Refine code style * Rename xla Operand as XlaValue. * Complete TensorRT compiler and builder, Refine OpKernel * Pick public code changes from the new tensorrt branch. * Fix tensorrt compilation * Fake implementation of trt executable * Support selecting engine in launch kernel, refine trt executable * Use global logger required by tensorrt, rebuild engine if batch size is larger than default max batch size, and other bugfix. * Support train phase setting for registered op kernel * Remove RewriteOptimizer pass, update xla optimizer op. * Format job builder .h and .cpp files. * Remove RewriteOptimizer pass, update xla optimizer op. * Fix project compilation and remove pairs from identical_sbp_oba_pairs if the related operators are about to be deleted from the job. * Fix project compilation and remove pairs from identical_sbp_oba_pairs if the related operators are about to be deleted from the job. * Refine code style and comment. * Refine model update inference for launch op. * Refine * Refine code style and comment. * Refine model update inference for launch op. Conflicts: oneflow/xrt/kernel/op_kernel.h oneflow/xrt/node_util.cpp oneflow/xrt/node_util.h oneflow/xrt/passes/cluster.h oneflow/xrt/passes/mark_cluster_id_pass.cpp oneflow/xrt/passes/rebuild_job_pass.cpp oneflow/xrt/types.h * Add xrt README.md * Add use_xla_jit and use_tensorrt options in job proto * Refine code style * Fix BlobDesc getter and xla LayerNorm op for FP16 * Make use_xla_jit and use_tensorrt configurable from python config and env variables. * Update benchmark * Refine xrt README and rename compile_with_xrt.h file * Update README * Revert tensorrt * Fix absl missing if building with TensorRT but without XLA * Update xrt benchmark * Disable WITH_XLA by default * Update xrt benchmark * Format xrt as core * add activation op * add softmax op * Refine code style, remove unused code * Remove duplication of XLA usage * test pass * pooling test pass * add concat op, not tested * add activation ops, test not psassed * Add xla gelu unittest * add activation op, and test passed * add pooling op, and test passed * Fix int64 env variable * Export float16 for python * Add xla relu unittest * try to solve conv bug * add elementwise add op, test passed * add concat op, test passed * Bugfix: transfer weights from gpu to host since tensorrt requires host weights. * add op unit tests * resolve conflicts and fix softmax bug * add identity op and topk op, to test * Add xla bias add and reshape unittests * Add xla identity unittest * Add xla cast and scalar op unittests * Add xla broadcast op and transpose unittests * Add xla add, sigmoid and tanh unittests * add reduce mean op, test passed * formate ops, add CHECKs, and optimize function structure * Add xla gather and batch_gather unittests * Add xla softmax unittest and fix softmax bug if axis is not the last dim. * add trt gather op and unit test * Add xla reduce_sum unittest, and support keep_dims for xla reduce * Add xla layer_norm unittest, and refine xla layer norm op * Add reshape_like unittest, and export reshape_like api * Refine xrt unittest code style * Export softmax_grad op, add softmax_grad unittest * Export tanh_grad op and add xla unittest * Export gelu_grad op, and add xla unittest * add conv unit test * reformate * Export layer_norm_grad and layer_norm_param_grad api, add xla unittests * Commit to merge upstream of_xrt * check files * modify files according to review advice. * Add xrt unittests (#2483) * Revert tensorrt * Fix absl missing if building with TensorRT but without XLA * Update xrt benchmark * Add xla gelu unittest * Fix int64 env variable * Export float16 for python * Add xla relu unittest * Add xla bias add and reshape unittests * Add xla identity unittest * Add xla cast and scalar op unittests * Add xla broadcast op and transpose unittests * Add xla add, sigmoid and tanh unittests * Add xla gather and batch_gather unittests * Add xla softmax unittest and fix softmax bug if axis is not the last dim. * Add xla reduce_sum unittest, and support keep_dims for xla reduce * Add xla layer_norm unittest, and refine xla layer norm op * Add reshape_like unittest, and export reshape_like api * Refine xrt unittest code style * Export softmax_grad op, add softmax_grad unittest * Export tanh_grad op and add xla unittest * Export gelu_grad op, and add xla unittest * Export layer_norm_grad and layer_norm_param_grad api, add xla unittests * Commit to merge upstream of_xrt * Fix reduce_mean facade bug if keep_dims if true. * Refine tensorrt unittests * Check failed if full reduce without keep dimension. * madd pooling unit test * Add tensorrt bias_add and reshape op, and their unittests. * Support fp16 for tensorrt. * Add tensorrt transpose op and unittest. * add unit test conv_2d * add unit test concat * Fix concat if axis is -1. * Refine tensorrt conv2d unittest * Fix padding mode for conv2d and pooling, refine unittests. * Refine tensorrt concat unittest * Add convert api from string engine to XrtEngine. * Revert tensorrt, and merge of_xrt branch * Remove some comments. * Refine tensorrt unittests * Add XrtConfig to deal with xla and tensorrt configurations. Conflicts: oneflow/xrt/api.cpp * Update tensorflow.cmake to avoid applying the patch repeatedly. * Remove XrtConfig Option, and fix xrt unittests * Add tensorrt batch norm (#2516) * Refine xrt signatrue hash, and fix python configuration (#2520) * Fix XrtCompilationEnabled returns (#2524) * Fix compilation after merge dev_python * Update xrt unittests * Revert protobuf version * Remove comment FOR_RANGE * Remove unused code * Reformart * Refine job builder * Disable dump job if not debug mode Co-authored-by:
Snow <snow3s@qq.com> Co-authored-by:
Juncheng <liujuncheng1022@gmail.com>
-
- Dec 21, 2019
-
-
lixinqi authored
-
- Dec 20, 2019
- Dec 12, 2019
-
-
Houjiang Chen authored
-
- Nov 26, 2019
-
-
Li Xinqi authored
* cmake find python note when version less 3.14 (#2286) * fix bug: reduce split kernel inplace (#2297) * Dev bias add (#2299) * use bias add * fix * bias_add * bias add half * fix * reinterpret_cast * fix half * HALF * fix * ADD_DEFAULT_KERNEL_CREATOR * fix * format * Fix dev python test (#2294) * add decode random * fix decode random actor * fix dev_python test scripts * fix batch_size test scripts * fix * Memory Version 2.0 Step 2: MemSharedAndReused between jobs (#2267) * MemBlockProto and ChunkProto * create mem block and chunk after improver * interface merge mem block and chunk between sub plans * merge chunk between jobs for memory reuse * using memory zone unique id replace memory case hash * merge interface op mem block between jobs for mem shared * gen GlobalCriticalSection by mem block id and chunk id * check mem block and chunk valid before runtime * Refactor: RegstMgr ; allocate memory by mem block and chunk instead of regst * fix bug; and pass test * fig bug: init chunk_id_count in id_manager * reuse copyHd out mem between jobs * PushPlan and PullPlan for memblock and chunk * refine merge mem block / chunk in oneflow.cpp * at(i); * GetOpName2JobId2TaskProtos functional * using output ptr; pass test AlexNet and Resnet * Dev cuda 9 arch 70 (#2318) * kCudaAlignSize = 256 * always compute_70 * __CUDA_API_VERSION >= 10000 * __CUDA_API_VERSION >= 10000 * disable_all_reduce_sequence * Fix cuda9 cudnn turing issue (#2329) * fix cuda 9 issus on turing device * CUDA_VERSION * no cuda check * bias add kernel gpu half (#2330) * mem_block=>header_mem_block (#2338) * speedup oneflow compilation * identity_sbp_conf * DropOut Version2 (#2355) * random mask like op conf; refine dropout op in python * remove useless dropout kernel conf * implement of random mask like op * refine dropout op * refine dropout grad op * refine generate dropout backward * random mask like kernel * refine dropout (grad) kernel * fix link problem for template separated compile * fix bug and pass test * dropout kernel for half * add check for dropout mask input data type * bugfixs * Remove IsOpFloat32() in auto_mixed_precision.cpp (#2358) * fuse op/kernl to 1 cpp * refine for review * fix bug * Refactor Kernel Registry for more flexible registration (#2363) * feat: update KernelRegistration and add KernelRegValProto * Refactor Kernel Registry for more flexible registration * Remove unused kernel_reg_value.proto * Memory Version 2.0 Step 3: MemReused in job (#2319) * use_memory_allocation_algorithm_v2 for switch improver mem block id * reuse plan task graph and ctrl edge for inferred mem block * refine interface; InJobMemSharingUtil * navie merge memory big chain; gen regst apply/release queue; handle for inplace hint regst * generate regst 2 mutual exclusion regsts * bugfix: apply should before release * interface for multi-thread run algorithm get mem block offset result * selet best algorithm to set mem block id and mem block offset * set mem block for inplace consumer regst * 3 algorithm interface * half implement of algo 1 * implement of algorithm0_OfColorImproved * runnable in 1 machine 1 device * Memory Chain * merge MemoryChain and pass Correctness test of alexnet and resnet50 * bugfixs: continues inplace consume relationship in bert-base fp16 * erase useless info in MemoryChain * implement of BfcAllocator and Tf_Bfc algorithm * use bfc algo and fix bug * only use default algo * renme in_job_* => intra_job_* * rename: InJob* => IntraJob* * rename: 1) apply_regsts_queue => alloc_regsts_queue; 2) release_regsts_queue => free_regsts_queue * rename function name in job/intra_job_mem_sharing_util.cpp * rename variable names in job/intra_job_mem_sharing_util.cpp: 1) *apply* => *alloc*; 2) *release* => *free* * refactor FindFreeOffset => FindFreeOffsetAndNewBufferSize * rename method: DeallocateRaw => FreeRaw * rename varable for review * use enum for mem reused algorithm and add python interface * fix sbp infer (#2373) * mv addr calculation out of decoder (#2374) * use tmp blob for temp storage (#2375) * INDEX_DATA_TYPE_SEQ (#2381) * refine include (#2382) * refine include * format format * element_wise_mul (#2383) * gather refine (#2384) * Dev fix sbp (#2388) * fix sbp * fix sbp * remove VirtualGenKernelConf * rename Read to ReadFully (#2389) * Dev parallel cast (#2391) * parallel cast * op_conf * refine * Dev auto zero padding (#2393) * auto_zero_padding * auto_zero_padding * fix * fix input_mask and token_type_id (#2398) * fix job launch (#2401) * fix sbp bug (#2402) * fix sbp * fix * add missing header files (#2410) * refactor cnn model tests (#2411) * refactor cnn model tests * reformat README.md * reformat README.md * refactor ndarray_reduce (#2412) * fix inplace reachability bug (#2413) * refactor gpu relu (#2414) * refactor gpu relu * CHECK_KERNEL_SAFE_INT32 * there may be a subtle cuda bug in ((float) x < 0) * refactor ndarray_reduce (#2405) * refactor ndarray_reduce * refactor relu/bias_add * refactor relu * refactor relu * refactor bias_add * refactor relu/bias_add * fix inplace_lbi bug * refactor addition * IsKernelSafeInt32 * CUDA_1D_KERNEL_LOOP_T * CUDA_1D_KERNEL_LOOP_T * If add (#2415) * refactor ndarray_reduce * refactor relu/bias_add * refactor relu * refactor relu * refactor bias_add * refactor relu/bias_add * fix inplace_lbi bug * refactor addition * IsKernelSafeInt32 * CUDA_1D_KERNEL_LOOP_T * CUDA_1D_KERNEL_LOOP_T * add unless oprand is nonzero * Clear session (#2416) * oneflow.clear_default_session * fix bugs in oneflow.config.machine * refactor function return type (#2417) * fix for py2 (#2418) * blob parallel conf * Pr watch scope (#2419) * pr oneflow.watch* * merge more code to pass watch_scope.py * TODO: input_blob_def.parallel_conf * fix reexport of identity op * merge dev_quick_dirty_object_detection * oneflow.cluster (#2423) * oneflow.cluster * no alias for oneflow.cluster.* * mv cpp_logging_conf from config_proto to cluster_proto * rename: cluster => env * rename: Environment => Session * Free port (#2427) * oneflow.cluster * no alias for oneflow.cluster.* * mv cpp_logging_conf from config_proto to cluster_proto * rename: cluster => env * rename: Environment => Session * auto find a free port for single node environment * localhost only * Dev single processor test (#2430) * oneflow.cluster * no alias for oneflow.cluster.* * mv cpp_logging_conf from config_proto to cluster_proto * rename: cluster => env * rename: Environment => Session * auto find a free port for single node environment * localhost only * single process test * Cluster::WorkerLoop * delete unnecessary OF_BARRIER_ALL * no longer fork children processes to run tests * format * fix align byte size bug (#2436) * fix align bugs (#2440) * fix: GetNumOfLoDLevels lack return * minor script fix and update * update script * remove redundant function
-
- Nov 18, 2019
-
-
Li Xinqi authored
* pr oneflow.watch* * merge more code to pass watch_scope.py * TODO: input_blob_def.parallel_conf
-
- Oct 15, 2019
-
-
lixinqi authored
-
- Oct 11, 2019
-
-
Juncheng authored
* kCudaAlignSize = 256 * always compute_70 * __CUDA_API_VERSION >= 10000 * __CUDA_API_VERSION >= 10000 * disable_all_reduce_sequence
-
- Oct 09, 2019
-
-
lixinqi authored
-
- Sep 28, 2019
-
-
lixinqi authored
-
- Sep 27, 2019
-
-
Li Xinqi authored
-
- Sep 24, 2019
-
-
Niu Chong authored
* Dev actor msg queue (#2225) * async msg queue * EnqueueAsyncMsg * Merge wnd python (#2226) * not ready yet * segment fix * fix segment_sum bugs * 1st wide_n_deep push * Fix tick in multi node parallel (#2042) * check in fixes * fix by adding boxing method * register tick op * move code and add more check * fix typo * fix bug when filtering op nodes before adding tick * fix wheel build not adding .so (#2052) * color plan dot VERSION-2 (#2045) * run sucessfully on single GPU * fix 121 for tick (#2069) * delete unncessary multiply_grad class * speed up generate time for dot2svg (#2083) * Add axis conf to bias_add for any axis channel (#2087) * bias_add completion * follow comment * make conf axis required * Revert "Add axis conf to bias_add for any axis channel (#2087)" (#2091) This reverts commit 8679ce980ce8570bf927baeab8616ee7b93fac47. * updated * fix segment_sum_grad * fix sbp * fix segment_sum impl for data parallel * fix * remove useless code in segment_kernel_util.h * add python interface * fix sigmoid conf * fix naming error * fix typo * temp mod loss sbp * add LazyAdam * Merge branch 'dev_python' of https://github.com/Oneflow-Inc/oneflow into dev_python_widedeep * rm useless code * unsorted_segment_sum * refactor sigmoid_cross_entropy_loss_kernel to high performance * Improve sigmoid cross entropy loss grad (#2207) * remove for loop called cuda kernel * minor fix * ../oneflow/python/ops/data_ops.py (#2209) * fix lazy_adam * Merge wnd and python (#2214) * rm ActivationType from op/kernel (#2205) * refactor sigmoid_cross_entropy_loss * fix SigmoidGrad::InferBatchAxis * support part_name_prefix and part_name_suffix_length (#2208) * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus * oneflow.watch for debug * Dev decode batch size (#2206) * rm batch_size and piece_size * merge dev_python * Update reshape_like_op.cpp (#2213) * oneflow.parallel (#2211) * oneflow.parallel * refactor split_axis => parallel * rename parallel => distribute * fix typo: *Parallel => *Distribute * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute() * merge dev_python * fix boxing: P->S(0) * check in docker build scripts (#2216) * Dev python widedeep docker (#2218) * check in docker build scripts * check in .dockerignore * rm oneflow.segment_sum * remove segment_sum * rm unused file * rm debug code * rm debug code * rm double empty lines * remove useless comments * fix send msg (#2227) * fix reduction_coefficient (#2228) * refactor ndarray for eq/ne/... * Dev kernel launch synchronized (#2230) * IsKernelLaunchSynchronized * virtual * refine * refine * seperate LOGICAL_BINARY_FUNC from ARITHMETIC_BINARY_FUNC * more static_assert * remove unused task related dot function (#2236) * remove unused task related dot function * do not output dot rank info * Dev non distributed optimizer js (#2234) * op&kernel&actor * job * job_completer * graph * format * fix pd * fix * ignore DelPlacementByOpName * fix auto tick * JobBuilder * fix * config util * fix * fix opgrade * broadcast tick * fix allreduce * balance by model size * GetSoleOutBlobSize * async_actor_msg_deque * group * AddOrMutOpsOnlyOnce * fix NcclTupleBroadcastGrad * order * set nccl order hint * op_conf * grad hint * NcclTupleBroadcastReduceSequencePass * add missed mutops * order fix * try kMdUpdtArea * fix nccl_order_hint * fix * add ti * tuple_identity_op * remove useless * group * fix dead lock * force ctrl in * sc broadcast * sort obn * group nccl * config group_size_mbyte * non_distributed_optimizer_group_size_mbyte * format * stop check * rm message sending optimization * refine lazy adam (#2244) * refine lazy adam * update * memory version 2 step 1: replace original concept about mem sharing (#2242) * mem_shared_id -> mem_block_id; mem_shared_off_set -> mem_block_offset; enable_mem_sharing->enable_reuse_mem * memory version 2 step 1: replace original concept about mem sharing * record reader multi thread (#2246) * multi thread * ComputeThreadPoolSize * python api
-
- Sep 20, 2019
-
-
Juncheng authored
* op&kernel&actor * job * job_completer * graph * format * fix pd * fix * ignore DelPlacementByOpName * fix auto tick * JobBuilder * fix * config util * fix * fix opgrade * broadcast tick * fix allreduce * balance by model size * GetSoleOutBlobSize * async_actor_msg_deque * group * AddOrMutOpsOnlyOnce * fix NcclTupleBroadcastGrad * order * set nccl order hint * op_conf * grad hint * NcclTupleBroadcastReduceSequencePass * add missed mutops * order fix * try kMdUpdtArea * fix nccl_order_hint * fix * add ti * tuple_identity_op * remove useless * group * fix dead lock * force ctrl in * sc broadcast * sort obn * group nccl * config group_size_mbyte * non_distributed_optimizer_group_size_mbyte * format * stop check * rm message sending optimization
-
- Sep 07, 2019
-
-
Niu Chong authored
* feat: remove all autovar * fix and format * fix: fix op::InferBlobDesc
-
cheng cheng authored
-
- Sep 04, 2019
-
-
Juncheng authored
* assign op AddGlobalStepOpConf fix ARITHMETIC_DATA_TYPE_SEQ identity_op_conf add ops GenNewSnapshotName SnapshotOp cleanup blob name LearningRateScheduleOp LearningRateScheduleKernel LearningRateScheduleKernel AddLearningRateScheduleOpConf learning rate cleanup fix fix * remove total_mbn_num * date time format * save * refine * refine * revert * refine snapshot * fix * refine * AutoGlobalStep * refine * GenLogicalBlobName * AutoLearningRate * remove JobDesc lr * fix snapshot path * Maybe<void> * learning_rate blob * remove next_model_vid fix fix fix learning_rate * train_conf * fix for global step on multi nodes * SnapshotReader snapshot writer model init op fix refine init InitializeFromSnapshotConf model io job ModelLoadOp ModelLoadKernel MakeModelLoadJob ModelSaveOp fix InterUserJobInfo _MakeModelLoadJobFunc MutModelLoadOpConTickInputHelper fix refine init/load/save set_default_variable * remove SnapshotMgr * snapshot.h * delete model_init_job.cpp foreign_input_op_conf fix snapshot path set path op_conf fix fix CopyFromNdarray to bytes c use uint8 char2uint8 * model init * model io * fix * ModelSaveKernel * mutable_batch_axis()->Clear() * InferBatchAxis * fix * refine * job set * MakeModelIoJobs * fix * jobs * fix * model io job * GenOutputOpConf * refine snapshot * refine * fix * refine CheckPoint * remove session * refine * refine * refine * remove keyword.h/cpp * refine * global_step=>train_step * GetSbpSignatures * ModelInitOp
-