Skip to content
Snippets Groups Projects
  1. Jul 16, 2021
  2. Jul 01, 2021
  3. May 05, 2021
  4. Mar 09, 2021
  5. Feb 24, 2021
  6. Feb 19, 2021
  7. Feb 18, 2021
    • cheng cheng's avatar
      NCCL use compute stream to memory cost & speed up (#4221) · 45697b0c
      cheng cheng authored
      * Enable insert nccl logical op pass
      
      * FindMaxConnectedSubgraphForGpuExecOrder~
      
      * through order and interface
      
      * implement of insert nccl logical op in pass
      
      * add nccl logical op using UserOp Implement and EagerNcclCommMgr
      
      * add NCCL ReduceScatter op/kernel; refine pass impl of topo order
      
      * add NCCL logical op/kernel AllGather
      
      * fix bug of reduce scatter/ all gather infer shape
      
      * refine log and note
      
      * fix complier err build with CPU ONLY
      
      * support NCCL ALL2ALL and test pass of alexnet model parallel
      
      * rollback of diff in checkpointing_pass.cpp
      
      * rename to nccl_use_compute_stream; ResourceDesc::nccl_use_compute_stream; refine name for review; create nccl_comm_ in KernelCompute;
      
      * refine code for review
      
      * add unittest for nccl use compute stream
      
      * format test scripts
      
      * refine align
  8. Feb 14, 2021
    • Li Xinqi's avatar
      Sink tick in main job (#4207) · 25d9c26c
      Li Xinqi authored
      
      * source subset tick
      
      * remove useless header files
      
      * insert DstSubsetTickOp
      
      * remove incorrect CHECK
      
      * add tick op for each machine
      
      * TryBindBnWithOneofRegst
      
      * add sink tick op in main_job
      
      * refactor LinkMainJob
      
      * fix typo in task_graph
      
      * refactor AddGlobalCriticalSection
      
      * rename and refactor DstSubsetTick::InferBlobDescs and SrcSubsetTick::InferBlobDescs
      
      * add src_subset_tick for input-output critical section
      
      * refactor AutoSourceTick and AutoSinkTick
      
      * SrcSubsetTickCompTaskNode: bind bns and in_regst if bns is valid in current device
      
      * refactor optional input to repeated inputs for SrcSubsetTickOpConf
      
      Co-authored-by: default avataroneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>
  9. Nov 02, 2020
  10. Sep 05, 2020
  11. Jul 28, 2020
  12. Jul 23, 2020
    • Shenghang Tsai's avatar
      Dev apache2 license (#3266) · d0bdbd5d
      Shenghang Tsai authored
      
      * add license at root dir
      
      * check in empty files
      
      * rm space
      
      * check in script
      
      * update script
      
      * fix bug
      
      * add print
      
      * fix
      
      * add exit
      
      * add to of_format
      
      * add CI task
      
      * fix license
      
      * Revert "fix license"
      
      This reverts commit 818b6d7691d3a8b4a25dd41a47ff2c5922b8ec57.
      
      * only add once
      
      * quick fix
      
      * fix script
      
      * dont fmt empty file
      
      * fix
      
      * quick fix
      
      * fix py
      
      * add license
      
      * fix exit
      
      * add license for hpp
      
      * add license
      
      * license new vm files
      
      Co-authored-by: default avatartsai <caishenghang@oneflow.org>
  13. Mar 19, 2020
  14. Jan 08, 2020
  15. Jan 07, 2020
  16. Jan 02, 2020
  17. Dec 27, 2019
    • Niu Chong's avatar
      Merge UserOp (#2471) · 76fe960a
      Niu Chong authored
      
      * Add user op related proto
      
      * Add OpRegistration (#2424)
      
      * Add uncompleted op registration for cooperation review
      
      * Fix the compile bugs
      
      * Refactor the implementation macro of op_reg arg member funcs
      
      * Move impl of Attr() with default to cpp and specialize it
      
      * Add LookUpInOpRegistry()
      
      * Add UserOp as placeholder
      
      * Rename in->input and out->output
      
      * Fix the missing ctor of OpRegistrationBuilder
      
      * Add GetAllRegisteredUserOp() for debug
      
      * Add Log for every user_op registration
      
      * Add const qualifier for Builder::Build() and user_op namespace for REGISTER_USER_OP macro
      
      * Remove the LOG() from ctor of op registrar due to segment fault (maybe a glog bug)
      
      * add customized dir (#2425)
      
      * add customized dir
      
      * customized/.keep
      
      * Add map<string, ListString> output; to UserOpConf (#2426)
      
      * Substitute std::function<...> with alias name and Set default val for those function (#2428)
      
      * Add Kernel Registry (#2431)
      
      * Add Kernel Registration
      
      * Make REGISTER_USER_OP/KERNEL macro available when not in namespace of oneflow
      
      * Add missing TODO of CreateFn parameter
      
      * Add OpKernel for user op
      
      * Fix a little code style
      
      * Add GradRegistry (#2433)
      
      * implement of user_op, instead of get sbp sign (#2429)
      
      * Add UserKernel and UserKernelConf (#2438)
      
      * Add VirtualGenKernelConf() for UserOp and fill UserKernelConf
      
      * Fill KernelRegCtx
      
      * Fix typos and bugs
      
      * Add UserKernel
      
      * Add KernelRegContext(const KernelConf&) as the ctor
      
      * Dev cc python user op conf builder (#2435)
      
      * user op builder in python
      
      * user op wrapper
      
      * Implement of add default vale and check valid between c++ and python
      
      * remove notes
      
      * fixbug and runnable for UserOp complie and add sample as ccrelu
      
      * fix func and class name
      
      * check attr type in op def and op conf; refine code for review
      
      * Dev cc infer tmp size (#2445)
      
      * Refine some code and interface (#2447)
      
      * Add InferContext and Infer Util functions
      
      * Add framework.h as the only included header for user code
      
      * Fix the Dtype infer bug
      
      * Fix duplicated ret
      
      * Fix the KernelRegistration of UserKernel
      
      * Update cc_relu_op.cpp to use InferContext
      
      * Refine and Add test relu kernel
      
      * Add user_op::Blob
      
      * Update InferContext to ptr from const ref
      
      * Add user_op_conf into InferContext and Attr()
      
      * Move cc_relu_op.cpp to customized/ops/
      
      * Add Shape and Dtype into Blob
      
      * Fill the real ReluKernel with Gpu and Float
      
      * Remove unused files
      
      * Add unique_names_ for op registration
      
      * Refactor AttrSeq for re-used of attr function specialization (#2452)
      
      * Refactor AttrSeq for re-used of attr function specialization
      
      * Remove Serialize interface
      
      * Dev cc gen bw user conf (#2449)
      
      * refine python print err
      
      * interface of UserOpWrapper UserOpConfWrapper UserOpConfWrapperBuilder
      
      * implement of user op conf builder in c++
      
      * generate backward op conf for user op
      
      * define grad registration value func
      
      * refine code for review
      
      * check input valid when query need grad
      
      * refine name
      
      * implement of demo ccrelu_grad and test pass of alexnet
      
      * refine ccrelu python
      
      * Add UserOpDefWrapper; Fix .py bug; Add TestReshape op/kernel (#2454)
      
      * Add UserOpDefWrapper
      
      * Update paras of CheckAttrs() from UserOpDef to UserOpDefWrapper
      
      * Fix bug in user_op_builder.py
      
      * Add TestReshape Op&Kernel
      
      * fix ccrelu op grad register and shape infer; Add ccrelu alexnet test python script (#2457)
      
      * Move UserOpConf from op_conf.proto to user_op_conf.proto
      
      * Add test_reshape.py
      
      * Refine the imple of access to attrs in user_op_conf.cpp
      
      * Refactor the way to access AttrVal with AttrValAccessor
      
      * Add GetAttr() in KernelContext
      
      * Rename op_infer_util.h to infer_util.h and Fill paras of InferTmpSize with InferContxt
      
      * Refactor InferContext to simpley user_op
      
      * Refine customized test kernels to get along with interface update
      
      * Dev merge from dev python (#2465)
      
      * Clear session (#2416)
      
      * oneflow.clear_default_session
      
      * fix bugs in oneflow.config.machine
      
      * refactor function return type (#2417)
      
      * fix for py2 (#2418)
      
      * blob parallel conf
      
      * Pr watch scope (#2419)
      
      * pr oneflow.watch*
      
      * merge more code to pass watch_scope.py
      
      * TODO: input_blob_def.parallel_conf
      
      * oneflow.cluster (#2423)
      
      * oneflow.cluster
      
      * no alias for oneflow.cluster.*
      
      * mv cpp_logging_conf from config_proto to cluster_proto
      
      * rename: cluster => env
      
      * rename: Environment => Session
      
      * Free port (#2427)
      
      * oneflow.cluster
      
      * no alias for oneflow.cluster.*
      
      * mv cpp_logging_conf from config_proto to cluster_proto
      
      * rename: cluster => env
      
      * rename: Environment => Session
      
      * auto find a free port for single node environment
      
      * localhost only
      
      * Dev single processor test (#2430)
      
      * oneflow.cluster
      
      * no alias for oneflow.cluster.*
      
      * mv cpp_logging_conf from config_proto to cluster_proto
      
      * rename: cluster => env
      
      * rename: Environment => Session
      
      * auto find a free port for single node environment
      
      * localhost only
      
      * single process test
      
      * Cluster::WorkerLoop
      
      * delete unnecessary OF_BARRIER_ALL
      
      * no longer fork children processes to run tests
      
      * robust contextmanager for CurJobConf (#2434)
      
      * fix of_pure_proto_dir (#2439)
      
      * Ctrl between optimizer (#2443)
      
      * add ctrl edges between optimizors
      
      * update docker file
      
      * sequentialize all optimizors
      
      * Revert "fix of_pure_proto_dir (#2439)" (#2446)
      
      This reverts commit 5031cc86.
      
      * Oneflow unittest (#2448)
      
      * oneflow.unittest.*
      
      * oneflow.unittest.register_testcases
      
      * rename: oneflow.unittest.register_testcases -> oneflow.unittest.register_test_cases
      
      * Test bert inplace with xinqi (#2450)
      
      * update bert script
      
      * update watch_scope test script
      
      * update for debug
      
      * update for debug
      
      * update debug script
      
      * test_inplace.py
      
      * no reshape
      
      * debug IsLbiAllConsumersReachable
      
      * fix inplace
      
      * rm useless code
      
      * update config
      
      * fix critical_section (#2453)
      
      * Patch distribute (#2456)
      
      * backup
      
      * fix bugs
      
      * test_inplace
      
      * Fix InplaceActor when no one consume inplace out regst (#2458)
      
      * disable mutable inplace edge to variable (#2459)
      
      * Fix unittest import conflict for py2 (#2460)
      
      * Create __init__.py (#2464)
      
      * update ccrelu_alexnet.py
      
      * Update code that use UserOpConf to UserOpConfWrapper
      
      * Add paras of GetSbp
      
      * Refactor KernelContext with pure virtual member function
      
      * Refactor KernelRegContext with pure virtual member function
      
      * Refactor UserKernelContext ctor to simplify code
      
      * Refactor InferContext with pure virtual member function
      
      * Rename Blob to Tensor; BlobDef to TensorDesc
      
      * Remove unused log and file
      
      * Dev cc user op sbp (#2480)
      
      * ccrelu multi-gpu runnable
      
      * example of ccrelu op get sbp sign
      
      * sbp context
      
      * add example for get sbp using LogicalTensor...
      
      * GetSbpFnUtil::MirrorSplitAtDim0
      
      * fix bug of useless
      
      * Fix the bug of init UserKernelContext (#2487)
      
      * Fix the bug of init UserKernelContext
      
      * Fix due to comment
      
      * Refine code
      
      * Add missing SetCheckAttrFn implementation
      
      * refine python user_op_builder.SetAttr(); refine reshape test an… (#2505)
      
      * Make UserOp runnable; refine user op python test for new change in dev_python
      
      Co-authored-by: default avatarcheng cheng <472491134@qq.com>
      76fe960a
  18. Dec 26, 2019
    • Houjiang Chen's avatar
      XRT: XLA + TensorRT (#2525) · 8f3dcf94
      Houjiang Chen authored
      * Enable multiply definition for xla compilation in oneflow
      
      * Realize running an executable
      
      * Abstract and gather resources (such as client, builder etc.) needed to compile as CompilationResourceStore
      
      * Implement a seperate xla allocator to avoid introducing much objects of tensorflow
      
      * Define CompilationContext separately
      
      * Running XLA by CPU mode is OK now
      
      * Make the result shape after running the executable is a tuple, and refine comments
      
      * Add compilation cache to solve recompiling every time
      
      * Resolve InferSbpSignature in XlaLaunchOp
      
      * Resove executing on specified cuda stream
      
      * Refine XlaLaunch parallel conf, add batch matmul op
      
      * Refactor job rebuilding and fixup time shape
      
      * Update batch_dim_lbis field if XlaLaunch has any output which has batch dim
      
      * Resolve cluster-ring after clustered, take sbp policy and time shape into consideration
      
      * Add reshape op
      
      * Fix bugs
      
      * Rename CompilationContext by XlaLaunchContext, add XlaRuntimeScope to swap stream handle
      
      * Fix bugs
      
      * Update cmake to compile with xla optionally
      
      * Support more ops
      
      * Add more ops, and fix bugs
      
      * Implement XLA allocator and internal memory pool
      
      * Adaptively resize allocator memory size
      
      * Refine memory allocator
      
      * Block host if running cpu executable
      
      * Fix bug for getting scalar value
      
      * Fix result layout bug. This bug causes wrong result for transpose
      
      * Refine gelu backward
      
      * Of xla sx (#1990)
      
      * add identity xla op
      
      * Add batch gather op
      
      * Refine batch gather
      
      * fix batch gather bug aand add gather op, mv identity op to unary_op
      
      * Add softmax and gather/batch_gather
      
      * Add xla softmax_grad op
      
      * Add xla layer normalization op
      
      * Add xla layer norm backward op
      
      * Alias inputs and outputs to compute in-place
      
      * Reuse output buffers when running xla executable. It brings about 10%
      speedup for bert on single gpu by zero copy results
      
      * Reuse output buffers when running xla executable. It brings about 10%
      speedup for bert on single gpu by zero copy results
      
      * Refine xla allocator
      
      * Refine code style
      
      * Add xla reduce_sum op
      
      * Rewrite model update op to optimizer graph
      
      * Fix hang bugs
      
      * Fix input which body is disabled in xla launch kernel
      
      * Fix self control in
      
      * Fix self control in
      
      * Add fake consume op
      
      * Fix HasAttr bug for optional field
      
      * Refine AdamOptimizer
      
      * Fix xla AdamOptimizer bugs
      
      * Add meta data in HLO instruction, and refine
      
      * Fix bugs
      
      * add reduce sum and split normal model update (#2040)
      
      * remove append_func_to_list
      
      * Rm deprecated model update and save code (#1958)
      
      * remove code
      
      * mv random gen to kernel
      
      * mk seed required
      
      * address reviews
      
      * fix unused warning
      
      * address reviews
      
      * check in more deprecation
      
      * remove ModelSaveOpConf
      
      * move out ops and modify item (#1962)
      
      * ModelInit.__oneflow_input_remote_blobs__
      
      * fix cpu only query & add error info (#1964)
      
      * NumaAwareCudaMallocHost (#1959)
      
      * NumaAwareCudaMallocHost
      
      * add conf
      
      * modify check_point and add test check_point (#1963)
      
      * fix misuse of Scope/raii
      
      * op_name2variable_blob
      
      * add sigmoid test and tanh test (#1966)
      
      * add op matmul and matmul test (#1967)
      
      * rename oneflow.val to oneflow.input_blob_def
      
      * support auto var for convolution (#1972)
      
      * add op add and test add (#1973)
      
      * mv deprecated.pb_util to lib.core.pb_util
      
      * add op get_variable and get_variable test (#1975)
      
      * add op get_variable and get_variable test
      
      * modify shape extend
      
      * AllReduceSequencePass (#1976)
      
      * python2 compatibility for check_point
      
      * fix "return (blob_a, blob_b)" bug
      
      * rename: arg_passing => arg_pass
      
      * shared regst blob header between jobs (#1919)
      
      * half impl
      
      * register manager handle memory shared for separated memory
      
      * set separated memory shared id for shared regst between jobs
      
      * half impl of python for blob
      
      * fix BUG of pod ToProto() when proto has inited
      
      * fix BUG of infer dim0_inner_shape() in foreign_input_op
      
      * 1. PushJob copy from python can infer dim0_valid_num
      
      * add test for dynamic relu
      
      * refine test file
      
      * refine code
      
      * refine note
      
      * update test file for new interface
      
      * rename separated_header* (#1979)
      
      * some bugs fixes for a train&eval job (#1978)
      
      * debugging alex net
      
      * check in test pull_multiple_blob.py
      
      * strcter check
      
      * fix bias in conv
      
      * fix various bugs
      
      * rm file
      
      * op_name in different jobs can be overloaded
      
      * fix compile bug in job_set_compile_ctx
      
      * rm cmake code for building oneflow binary
      
      * check in script (#1980)
      
      * check in script
      
      * rm used import
      
      * CudaCurrentDeviceGuard (#1977)
      
      * fix val (#1981)
      
      * Merge job set and split fw bw (#1982)
      
      * add MemoryCopier and TensorSliceCopier (#1901)
      
      * add MemoryCopier and TensorSliceCopier
      
      * Index=>NdIndex
      
      * refine
      
      * refine
      
      * fix addition error checking (#1911)
      
      * Merge dev_mixed_precision into dev_split_fw_bw (#1904)
      
      * update binary_func.h
      
      * udpate
      
      * update ndarray
      
      * update
      
      * update
      
      * update
      
      * udpate
      
      * refactor(data_type.h): better representation
      
      * fix(unary_func.h): fix typo
      
      * style(data_type.h): format
      
      * Merge dev_mixed_precision: Part-2 (#1907)
      
      * feat: add NewKernelUtil
      
      * fix typos
      
      * feat: add cublas_tensor_op_math_handle()
      
      * add gemm (#1860)
      
      * add gemm
      
      * save
      
      * add blobgemm
      
      * update
      
      * update
      
      * fix cu
      
      * update cpp
      
      * feat: NewKernelUtil -> NewKernelUtil<DeviceType>
      
      * feat: update FullyConnectedKernel to use NewKernelUtil
      
      * Dev sx mixed precision (#1861)
      
      * add gemm
      
      * save
      
      * add blobgemm
      
      * update
      
      * update
      
      * fix cu
      
      * update cpp
      
      * save cpp
      
      * save
      
      * add relu and relu_backward
      
      * remove spared space
      
      * add explicit declaration
      
      * rename
      
      * feat: update ConvKernel to support half
      
      * add sigmoid and tanh (#1867)
      
      * add axpy (#1866)
      
      * style: formatting
      
      * refactor(new_kernel_util): unify Hgemm with cublas_gemm<T>
      
      * fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle
      
      * refine(new_kernel_util.h)
      
      * refine(new_kernel_util.cu)
      
      * feat(new_kernel_util): add OFBatchedGemm()
      
      * feat: update MatMulKernel to support half
      
      * feat: update ConvData/Bias/FilterGradKernel to support half
      
      * refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out
      
      * feat: support loss scale
      
      * fix(operator): :bug:add InferHasBatchDim()
      
      * feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu()
      
      * refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float
      
      * style(kernel/cast_kernel.cpp): formatting
      
      * fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle()
      
      * style(cast_kernel.cpp): formatting
      
      * feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil
      
      * refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil
      
      * feat(dropout_kernel): :sparkles:update DropoutKernel to support half
      
      * refactor(dropout_kernel): remove backward funcs
      
      * refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support
      
      * fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple)
      
      * fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix(relu_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix: fix little bugs
      
      * fix(conv_data/filter_grad_op): min byte size of buf blob is 1
      
      * feat: support half for bias_add_kernel
      
      * fix(bias_add_op): remove data type check
      
      * feat(relu_kernel): support half
      
      * refactor: add ADD_GPU_HALF_KERNEL_CREATOR
      
      * fix: typos
      
      * feat(pooling_kernel): support half
      
      * fix: remove CHECK_EQ of default data type
      
      * feat(pooling_grad_kernel): support half
      
      * feat: support half in ofrecord_encoder (TODO)
      
      * fix
      
      * feat: support half in sparse_cross_entropy_kernel
      
      * debug grad op (#1883)
      
      * Dev debug op mixed precision (#1884)
      
      * debug grad op
      
      * do nothing instead of UNIMPLEMENTED
      
      * fix(dropout_kernel): add tmp_split_fw_bw condition
      
      * build(half.cmake): https->http
      
      * fix(record_load_kernel): support total_batch_num
      
      * fix pooling (#1885)
      
      * fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix: add GetCudnnScalingParameters() to fix scaling params
      
      * fix: add enable_true_half_config_when_conf() into config and update related code
      
      * feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization
      
      * refactor(matmul_kernel): remove Backward()
      
      * feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx()
      
      * feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat()
      
      * refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr
      
      * refactor(new_kernel_util.cu): remove static of func in anonymous namespace
      
      * feat(job_conf.proto): add enable_auto_mixed_precision field
      
      * feat(auto_mixed_precision_lists): add amp_lists
      
      * feat(auto_mixed_precision): build the skeleton
      
      * feat(auto_mixed_precision): almost finish amp graph pass
      
      * feat(auto_mixed_precision.cpp): complte InsertCastOp()
      
      * refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG
      
      * perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO)
      
      * refine(auto_mixed_precision.cpp): refine LOG
      
      * feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes()
      
      * Dev half ndarray (#1886)
      
      * debug grad op
      
      * ZeroVal => GetZeroVal; OneVal => GetOneVal
      
      * MaxVal => GetMaxVal; MinVal => GetMinVal
      
      * check data type
      
      * DevDType
      
      * move function template to struct template for BinaryFunc* and UnaryFunc*
      
      * support half for reduce_sum_kernel
      
      * ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr
      
      * half for NdarrayUtil
      
      * OF_DEVICE_FUNC is always inline
      
      * half for NdarrayApplyUnaray
      
      * simplify usage of NdarrayUtil
      
      * UnaryFuncExp
      
      * add VarNdarrayBuilder and ValNdarrayBuilder
      
      * simplify NdarrayUtil in layer_norm_param_grad_kernel
      
      * InplaceBroadcast
      
      * remove SoftmaxKernelUtil
      
      * half for softmax_kernel
      
      * fix improper use of __CUDA_ARCH__
      
      * disable sm_30,sm_52
      
      * refine(conv_kernel.cu): fix typo
      
      * fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr
      
      * fix: fix typos of GetOneVal
      
      * fix(auto_mixed_precision.cpp): allocate for shared_ptr
      
      * refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding
      
      * fix(auto_mixed_precision.cpp): fix typo
      
      * fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge
      
      * style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet()
      
      * style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...>
      
      * feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp
      
      * feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs
      
      * feat(auto_mixed_precision.cpp): more logs
      
      * refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal
      
      * fix(bias_add_op.cpp): fix bias_multiplier shape
      
      * feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half
      
      * feat: update MatmulKernel and new_kernel_util to support half
      
      * refactor(auto_mixed_precision): add ClearList and refine code
      
      * feat(tanh_*_kernel): support half
      
      * feat(add_kernel): support half
      
      * update binary_func.h
      
      * udpate
      
      * update ndarray
      
      * update
      
      * update
      
      * update
      
      * udpate
      
      * refactor(data_type.h): better representation
      
      * fix(unary_func.h): fix typo
      
      * style(data_type.h): format
      
      * refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF
      
      * style(CMakeLists.txt): fix typo
      
      * fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr
      
      * fix(auto_mixed_precision.cpp): group inserted cast op by lbn
      
      * fix get one ptr (#1913)
      
      * fix(layer_norm): add LayerNormOp to grey_list and support the half
      
      * fix(layer_norm about): fix it to run when amp
      
      * fix: move fix sbp signature from OpNode to OpGraph
      
      * Dev new kernel util (#1925)
      
      * refactor(kernel/util): refactor NewKernelUtil and add DnnIf
      
      * refactor(kernel/util): add BlasIf
      
      * refactor(kernel/util): add ArithemeticIf
      
      * refactor(kernel/util): add cuda_kernel_util.*
      
      * refactor: refactor NewKernelUtil
      
      * refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including
      
      * refactor(new_kernel_util.h): remove unused header files
      
      * refactor: refactor loop include
      
      * feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel
      
      * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936)
      
      * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA
      
      * CHECK cuda version > 10.0 when use auto_mixed_presion
      
      * Fix bug of Snapshot delete file Unwanted (#1937)
      
      * fix link BUG of release version (#1938)
      
      * delete redundant code in OpGraph JobCompleter and Operator (#1927)
      
      * 1. delete redundant code in OpGraph JobCompleter and Operator  2. fix bug of Snapshot delete file Unwanted  3. refine ReadMe
      
      * revert README change
      
      * split 2 pull request
      
      * Refactor Kernel Registry V2: The clear & easy Way (#1941)
      
      * refactor(resource.proto): move DeviceType to common/device_type.proto
      
      * feat(kernel_registration): add kernel_registration.h/cpp
      
      * feat(kernel_registration): update matmul_kernel to support new registration
      
      * feat: add CreateKernel for new registry
      
      * feat: udpate registry of cast conf
      
      * refactor(kernel_registration): remove KernelRegMap
      
      * fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949)
      
      * grpc SetMaxMessageSize(INT_MAX) (#1950)
      
      * fix bug of Graph::ForEachConnectedComponent (#1952)
      
      * Grpc set max size (#1953)
      
      * grpc SetMaxMessageSize(INT_MAX)
      
      * set max msg len for ctrl service
      
      * code for test grpc max msg size
      
      * remove test code
      
      * NumaAwareCudaMallocHost (#1959)
      
      * NumaAwareCudaMallocHost
      
      * add conf
      
      * AllReduceSequencePass (#1976)
      
      * Merge job set and split fw bw (#1983)
      
      * add MemoryCopier and TensorSliceCopier (#1901)
      
      * add MemoryCopier and TensorSliceCopier
      
      * Index=>NdIndex
      
      * refine
      
      * refine
      
      * fix addition error checking (#1911)
      
      * Merge dev_mixed_precision into dev_split_fw_bw (#1904)
      
      * update binary_func.h
      
      * udpate
      
      * update ndarray
      
      * update
      
      * update
      
      * update
      
      * udpate
      
      * refactor(data_type.h): better representation
      
      * fix(unary_func.h): fix typo
      
      * style(data_type.h): format
      
      * Merge dev_mixed_precision: Part-2 (#1907)
      
      * feat: add NewKernelUtil
      
      * fix typos
      
      * feat: add cublas_tensor_op_math_handle()
      
      * add gemm (#1860)
      
      * add gemm
      
      * save
      
      * add blobgemm
      
      * update
      
      * update
      
      * fix cu
      
      * update cpp
      
      * feat: NewKernelUtil -> NewKernelUtil<DeviceType>
      
      * feat: update FullyConnectedKernel to use NewKernelUtil
      
      * Dev sx mixed precision (#1861)
      
      * add gemm
      
      * save
      
      * add blobgemm
      
      * update
      
      * update
      
      * fix cu
      
      * update cpp
      
      * save cpp
      
      * save
      
      * add relu and relu_backward
      
      * remove spared space
      
      * add explicit declaration
      
      * rename
      
      * feat: update ConvKernel to support half
      
      * add sigmoid and tanh (#1867)
      
      * add axpy (#1866)
      
      * style: formatting
      
      * refactor(new_kernel_util): unify Hgemm with cublas_gemm<T>
      
      * fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle
      
      * refine(new_kernel_util.h)
      
      * refine(new_kernel_util.cu)
      
      * feat(new_kernel_util): add OFBatchedGemm()
      
      * feat: update MatMulKernel to support half
      
      * feat: update ConvData/Bias/FilterGradKernel to support half
      
      * refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out
      
      * feat: support loss scale
      
      * fix(operator): :bug:add InferHasBatchDim()
      
      * feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu()
      
      * refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float
      
      * style(kernel/cast_kernel.cpp): formatting
      
      * fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle()
      
      * style(cast_kernel.cpp): formatting
      
      * feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil
      
      * refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil
      
      * feat(dropout_kernel): :sparkles:update DropoutKernel to support half
      
      * refactor(dropout_kernel): remove backward funcs
      
      * refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support
      
      * fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple)
      
      * fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix(relu_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix: fix little bugs
      
      * fix(conv_data/filter_grad_op): min byte size of buf blob is 1
      
      * feat: support half for bias_add_kernel
      
      * fix(bias_add_op): remove data type check
      
      * feat(relu_kernel): support half
      
      * refactor: add ADD_GPU_HALF_KERNEL_CREATOR
      
      * fix: typos
      
      * feat(pooling_kernel): support half
      
      * fix: remove CHECK_EQ of default data type
      
      * feat(pooling_grad_kernel): support half
      
      * feat: support half in ofrecord_encoder (TODO)
      
      * fix
      
      * feat: support half in sparse_cross_entropy_kernel
      
      * debug grad op (#1883)
      
      * Dev debug op mixed precision (#1884)
      
      * debug grad op
      
      * do nothing instead of UNIMPLEMENTED
      
      * fix(dropout_kernel): add tmp_split_fw_bw condition
      
      * build(half.cmake): https->http
      
      * fix(record_load_kernel): support total_batch_num
      
      * fix pooling (#1885)
      
      * fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix: add GetCudnnScalingParameters() to fix scaling params
      
      * fix: add enable_true_half_config_when_conf() into config and update related code
      
      * feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization
      
      * refactor(matmul_kernel): remove Backward()
      
      * feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx()
      
      * feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat()
      
      * refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr
      
      * refactor(new_kernel_util.cu): remove static of func in anonymous namespace
      
      * feat(job_conf.proto): add enable_auto_mixed_precision field
      
      * feat(auto_mixed_precision_lists): add amp_lists
      
      * feat(auto_mixed_precision): build the skeleton
      
      * feat(auto_mixed_precision): almost finish amp graph pass
      
      * feat(auto_mixed_precision.cpp): complte InsertCastOp()
      
      * refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG
      
      * perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO)
      
      * refine(auto_mixed_precision.cpp): refine LOG
      
      * feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes()
      
      * Dev half ndarray (#1886)
      
      * debug grad op
      
      * ZeroVal => GetZeroVal; OneVal => GetOneVal
      
      * MaxVal => GetMaxVal; MinVal => GetMinVal
      
      * check data type
      
      * DevDType
      
      * move function template to struct template for BinaryFunc* and UnaryFunc*
      
      * support half for reduce_sum_kernel
      
      * ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr
      
      * half for NdarrayUtil
      
      * OF_DEVICE_FUNC is always inline
      
      * half for NdarrayApplyUnaray
      
      * simplify usage of NdarrayUtil
      
      * UnaryFuncExp
      
      * add VarNdarrayBuilder and ValNdarrayBuilder
      
      * simplify NdarrayUtil in layer_norm_param_grad_kernel
      
      * InplaceBroadcast
      
      * remove SoftmaxKernelUtil
      
      * half for softmax_kernel
      
      * fix improper use of __CUDA_ARCH__
      
      * disable sm_30,sm_52
      
      * refine(conv_kernel.cu): fix typo
      
      * fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr
      
      * fix: fix typos of GetOneVal
      
      * fix(auto_mixed_precision.cpp): allocate for shared_ptr
      
      * refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding
      
      * fix(auto_mixed_precision.cpp): fix typo
      
      * fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge
      
      * style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet()
      
      * style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...>
      
      * feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp
      
      * feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs
      
      * feat(auto_mixed_precision.cpp): more logs
      
      * refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal
      
      * fix(bias_add_op.cpp): fix bias_multiplier shape
      
      * feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half
      
      * feat: update MatmulKernel and new_kernel_util to support half
      
      * refactor(auto_mixed_precision): add ClearList and refine code
      
      * feat(tanh_*_kernel): support half
      
      * feat(add_kernel): support half
      
      * update binary_func.h
      
      * udpate
      
      * update ndarray
      
      * update
      
      * update
      
      * update
      
      * udpate
      
      * refactor(data_type.h): better representation
      
      * fix(unary_func.h): fix typo
      
      * style(data_type.h): format
      
      * refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF
      
      * style(CMakeLists.txt): fix typo
      
      * fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr
      
      * fix(auto_mixed_precision.cpp): group inserted cast op by lbn
      
      * fix get one ptr (#1913)
      
      * fix(layer_norm): add LayerNormOp to grey_list and support the half
      
      * fix(layer_norm about): fix it to run when amp
      
      * fix: move fix sbp signature from OpNode to OpGraph
      
      * Dev new kernel util (#1925)
      
      * refactor(kernel/util): refactor NewKernelUtil and add DnnIf
      
      * refactor(kernel/util): add BlasIf
      
      * refactor(kernel/util): add ArithemeticIf
      
      * refactor(kernel/util): add cuda_kernel_util.*
      
      * refactor: refactor NewKernelUtil
      
      * refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including
      
      * refactor(new_kernel_util.h): remove unused header files
      
      * refactor: refactor loop include
      
      * feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel
      
      * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936)
      
      * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA
      
      * CHECK cuda version > 10.0 when use auto_mixed_presion
      
      * Fix bug of Snapshot delete file Unwanted (#1937)
      
      * fix link BUG of release version (#1938)
      
      * delete redundant code in OpGraph JobCompleter and Operator (#1927)
      
      * 1. delete redundant code in OpGraph JobCompleter and Operator  2. fix bug of Snapshot delete file Unwanted  3. refine ReadMe
      
      * revert README change
      
      * split 2 pull request
      
      * Refactor Kernel Registry V2: The clear & easy Way (#1941)
      
      * refactor(resource.proto): move DeviceType to common/device_type.proto
      
      * feat(kernel_registration): add kernel_registration.h/cpp
      
      * feat(kernel_registration): update matmul_kernel to support new registration
      
      * feat: add CreateKernel for new registry
      
      * feat: udpate registry of cast conf
      
      * refactor(kernel_registration): remove KernelRegMap
      
      * fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949)
      
      * grpc SetMaxMessageSize(INT_MAX) (#1950)
      
      * fix bug of Graph::ForEachConnectedComponent (#1952)
      
      * Grpc set max size (#1953)
      
      * grpc SetMaxMessageSize(INT_MAX)
      
      * set max msg len for ctrl service
      
      * code for test grpc max msg size
      
      * remove test code
      
      * NumaAwareCudaMallocHost (#1959)
      
      * NumaAwareCudaMallocHost
      
      * add conf
      
      * AllReduceSequencePass (#1976)
      
      * CudaCurrentDeviceGuard (#1977)
      
      * delete tmp_split_fw_bw_train_conf (#1985)
      
      * delete tmp_split_fw_bw_train_conf
      
      * delete useless comments
      
      * fix refactor bug in layer_norm_op
      
      * minor fixes
      
      * update py script
      
      * remove code could be misleading
      
      * Fix all reduce mem sharing (#1986)
      
      * fix all reduce mem sharing
      
      * ByteSizeOfDataContentField=>ByteSizeOfBlobBody
      
      * remove obsolete task_graph optimization
      
      * no arg_pass_job for variable_op
      
      * merge memory block id between jobs (#1910)
      
      * refine MemBlock and CriticalSection
      
      * job memory sharing strategy
      
      * revert diff in CriticalSectionDesc
      
      * Merge memory block between sub plans
      
      * Get mutual exclusion job groups
      
      * forget to consider memory merge only in same machine
      
      * memory zone unique id
      
      * Merge Done;  merge memory block id from right to left; get memory block ids info
      
      * revert MemBlock
      
      * generate mutual exclusion job groups Done.
      
      * update for proto
      
      * add JobMemSharingStrategy in python interface
      
      * remove memorycase hash
      
      * move JobMemSharingStrategy to JobSetProto
      
      * using default strategy = parallel priority strategy
      
      * update interface of flow.job_mem_sharing_strategy
      
      * InterJobMemSharingUtil and PlanUtil
      
      * revert oneflow.h
      
      * fix bug
      
      * New implement of Merge memory block id between jobs
      
      * refine code
      
      * fix a fatal bug in std::hash<oneflow::Shape>
      
      * +REGISTER_INDEPENDENT_THREAD_NUM for print task_node
      
      * unlock critical sections as more as possible (#1994)
      
      * Bugfix actor case (#1995)
      
      * unlock critical sections as more as possible
      
      * consumed and produced regst of actor 'case' are customized
      
      * refine code
      
      * Bugfix actor case (#1996)
      
      * unlock critical sections as more as possible
      
      * consumed and produced regst of actor 'case' are customized
      
      * refine code
      
      * small regst_num for reentrant_lock (#1997)
      
      * fmt dev_job_set(#1999)
      
      * double buffer for tick_op
      
      * tick is cpu op
      
      * speedup compile time (#2000)
      
      * only merge mem_block_id between user job (#1993)
      
      * Fix keep header only (#2001)
      
      * speedup compile time
      
      * fix keep header only
      
      * remove shared model (#2003)
      
      * remove blob_mem_sharing (#2005)
      
      * No copyhd for output (#2006)
      
      * no cpu tick
      
      * no copyhd for output_op/swith_output_op
      
      * remove temp comments
      
      * rename AddCopyH2DTaskTo to TryAddCopyH2DTaskTo
      
      * remove clone_id (#2007)
      
      * layer norm auto var (#2004)
      
      * layer norm auto var
      
      * make of_format
      
      * bn sbp (#2008)
      
      * Refactor job completer (#1998)
      
      * fmt
      
      * refactor GenerateOpConf4Trainning
      
      * more refactor
      
      * refactor SetCtrlInOpName4VariableOp
      
      * use uniq ptr
      
      * refactor RewriteBoxingWithAllReduce
      
      * refactor MakeAllReduceSequence
      
      * refactor auto_mixed_precision
      
      * refactor DumpLogicalBlobDescAndSbpSignature
      
      * refactor group_boxing_by_dst_parallel
      
      * refactor add_keep_header_only_op_conf
      
      * refactor AutoSourceTick
      
      * refactor AddTickForTimeShape
      
      * refactor AutoSinkTick
      
      * refactor AddGlobalOutputCriticalSections
      
      * refactor SetOpTimeShape7BatchDimLbis
      
      * fix a bug in IsInterfaceTask (#2009)
      
      * Bugfix is interface task (#2010)
      
      * fix a bug in IsInterfaceTask
      
      * IsOutputInterfaceTask
      
      * copyhd-free output_op task_node
      
      * Dev job set config util (#2011)
      
      * add more if in JobConfigProtoBuilder
      
      * unlock critical sections as more as possible
      
      * consumed and produced regst of actor 'case' are customized
      
      * remove total batch num in config util
      
      * remove clone_id
      
      * assert has train_conf
      
      * rm debug info
      
      * Dev job set bert (#2013)
      
      * support bert
      
      * mv into bert
      
      * manual format
      
      * fix adam (#2015)
      
      * fix adam
      
      * div batch instance num before update model
      
      * remove outdate code in oneflow.cpp (#2017)
      
      * Dev split like (#2016)
      
      * no total_instance_num
      
      * add auto grad for concat
      
      * check in impl
      
      * check in bug fixes
      
      * fix bugs for split_like
      
      * split_like_op.cpp format
      
      * add normalization_autovar
      
      * Update op_conf.proto
      
      * address reviews
      
      * fix typo
      
      * constant ref
      
      * rm forward_loss_instance_num (#2018)
      
      * Bugfix job set multi device (#2019)
      
      * sbp for tick input bn
      
      * interface_blob_conf for output_op/switch_output_op
      
      * set sbp conf for tuple identity op
      
      * fix bugs when merge main plan
      
      * delete useless code
      
      * address review
      
      * fix error use of GenRepeatedBn()
      
      * ForEachConnectedComponent is easily misused
      
      * 1) fix output op parallel_conf; 2) refactor InterfaceOpUtil
      
      * only for return output_op
      
      * factor: lbn => logical_blob_name; logical_blob_name() => logical_blob_name
      
      * return op instead of output op acts as part of user job
      
      * enable_all_reduce_group
      
      * bugfix: init RuntimeBuffersScope before Runtime
      
      * demo python scripts for enable_all_reduce_group
      
      * remove wrong optimization code
      
      * constant_conf for enable_all_reduce_group.py test
      
      * fix interface op parallel conf
      
      * fix reduce concat kernel (#2020)
      
      * binary program oneflow_worker
      
      * user_job_completer
      
      * remove unused code loss_print
      
      * rm unused code loss_acc
      
      * remove unused accuracy_acc and accuracy_print
      
      * remove input_diff/output_diff/model_diff bns
      
      * remove unused bns in gdb util
      
      * replace data_tmp_bns/model_bns/fw_buf_bns with tmp_bns
      
      * support mpi using style
      
      * Bugfix put job conf into plan (#2023)
      
      * put job_conf into plan
      
      * using job_name judge isPullJob/isPushJob
      
      * fix wrong job_id error
      
      * model_init is a push job; model_save is a pull job
      
      * make cmake more reasonable (#2024)
      
      * Restructure python module and minimum setup.py (#2026)
      
      * check in updated paths
      
      * check in minimum setup tool
      
      * Dev python init multi unit (#2022)
      
      * init multi-unit by send oneflow_worker binary and ConfigProto to worker machine
      
      * refine var name
      
      * refine code
      
      * compile user/main job only on master
      
      * bert multi machine test code
      
      * fix bugs
      
      * JobConfs
      
      * fix bugs under WITH_RDMA
      
      * fix multi-machine bugs
      
      * delete useless code
      
      * Add xla reduce_sum op
      
      * fix overflow bug of mem_zone_unique_id, set job_id to int64_t (#2028)
      
      * feat: init_worker can without scp binary and no use uuid (#2029)
      
      * half impl of without scp bin
      
      * feat: init_worker can without scp binary and no use uuid
      
      * check in fixes (#2030)
      
      * fixbug of delete worker (#2033)
      
      * Dev dot plan (#2035)
      
      * reuse plan to dot file
      
      * refine plan dot
      
      * Check in bug fix and multi node script (#2032)
      
      * check in fixes
      
      * check in script
      
      * fix boxing bug when setting conf with sbp
      
      * flag for iter
      
      * fixbug of delete worker
      
      * fix delete worker in script
      
      * address review, add exclusive or check
      
      * reuse plan to dot file
      
      * refine plan dot
      
      * fix and add flags
      
      * fmt
      
      * rm debug output
      
      * more flags
      
      * check Activation
      
      * fix fc bug when num axes > 2
      
      * reverse change
      
      * fix next_batch_num (#2036)
      
      * upgrade nccl to 2.4.8 (#2037)
      
      * fix shape of fc in_diff (#2038)
      
      * Rewrite model update op to optimizer graph
      
      * Update oneflow.cmake (#2041)
      
      * better looking merged_plan to dot v1 (#2039)
      
      * better looking and more infomation of merged_plan.dot
      
      * refine color
      
      * Fix tick in multi node parallel (#2042) (#2047)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * Dev train conf builder (#2046)
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * check in impl
      
      * fix data dir (#2054)
      
      * fix data dir
      
      * rm model load path
      
      * AssignOp (#2058)
      
      * AssignOp
      
      * remove useless code
      
      * Python ops gather and unit test (#2053)
      
      * python_ops gather and unit test
      
      * format
      
      * minor mod
      
      * SnapshotOp (#2060)
      
      * magical add and fix bug (#2061)
      
      * check in impl
      
      * add todo
      
      * Dev jxf python pooling (#2056)
      
      * run max_pool_2d without bug
      
      * correct max_pool_2d
      
      * correct average_pool_2d
      
      * minor refine
      
      * final version
      
      * rename to nn.py
      
      * add name arg to pool1d ops
      
      * refine by review
      
      * rename to _GetSequence and move it to the end of file (#2063)
      
      * fix BindInterfaceMemBlockId (#2065)
      
      * mark py file generated (#2066)
      
      * Dev gracious exit (#2057)
      
      * add more checks
      
      * make language more consistant
      
      * better error info for worker init
      
      * better error
      
      * Update setup.py (#2068)
      
      * Refine Infer APIs by return Maybe<void> type (#2051)
      
      * Refine Infer APIs by return Maybe<void> type
      
      * Fix return type
      
      * Fix code style
      
      * Replace CHECK macros in the implementation of infer APIs
      
      * Revert IsOk
      
      * fix bug for split like op (#2070)
      
      * fix snapshot path (#2071)
      
      * Dev job set fix infer apis (#2072)
      
      * Refine Infer APIs by return Maybe<void> type
      
      * Fix return type
      
      * Fix code style
      
      * Replace CHECK macros in the implementation of infer APIs
      
      * Revert IsOk
      
      * update
      
      * add AutoGlobalStep (#2073)
      
      * rm default_initializer_conf in train conf (#2075)
      
      * Fix sigmoid op (#2076)
      
      * fix sigmoid op bug
      
      * fix bug for split like op
      
      * add sigmoid grad op
      
      * Fix bn (#2077)
      
      * fix bn
      
      * return Maybe<void> OK in lambda
      
      * fix typo
      
      * fix SigmoidGradOp (#2078)
      
      * Dev python merge job set (#2081)
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * fix 121 for tick (#2069)
      
      * fix gcc warning in release (#2080)
      
      * fix gcc version in release
      
      * fix empty line
      
      * Fix adam mv initilizer (#2082)
      
      * zero constant initilzer for adam m and v
      
      * make of_format
      
      * init adam m v beta1_t and beta2_t
      
      * use value instead of initializer
      
      * const float& -> const float
      
      * update
      
      * LearningRateScheduleOp (#2079)
      
      * matmul (#2084)
      
      * matmul
      
      * np.allclose
      
      * Fix hang bugs
      
      * bugfix: reshape op infer dim0 size; and look up tensorflow reshape (#2085)
      
      * bugfix: reshape op infer dim0 size; and look up tensorflow reshape
      
      * refine code for read
      
      * check py if and test
      
      * prelu (#2086)
      
      * prelu
      
      * fix
      
      * fix
      
      * template for either ptr cast (#2088)
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * fix 121 for tick (#2069)
      
      * add template for cast
      
      * rename
      
      * Dev build and infer ctx (#2089)
      
      * add job_build_and_infer_ctx interface
      
      * lbn_with_split_hint
      
      * fix maybe macro
      
      * fix signature of Maybe<T>::Error()
      
      * job_build_and_infer_if
      
      * add c_api_util wrapper for job_build_and_infer_ctx
      
      * implement python/job_build_and_infer interface
      
      * CurJobBuildAndInferCtx_AddPlacementGroup
      
      * BuildJobAndInferCtx  and  Mgr  c++ implement (#2074)
      
      * job_build_and_infer_ctx_mgr
      
      * refine interface of infer_ctx_mgr
      
      * JobBuildInferCtx set job conf; add and refine error type
      
      * revert job.proto
      
      * half impl of add op in build_infer_ctx
      
      * generate op produced empty logical blob desc ; infer out blob desc interface
      
      * job_build_and_infer_ctx VERSION 1
      
      * add InferOutBlobDesc for conv op; remove record_piece_size in interface op
      
      * maybe return
      
      * job_set hold by job_build_and_infer_ctx_mgr
      
      * check placement when infer ctx mgr leave cur job
      
      * Global New/Delete JobBuildAndInferCtxMgr
      
      * add JUST when ctx add op
      
      * remove unused job_conf.arg_op_name
      
      * fix bugs caused by python new api
      
      * fix bugs caused by lack of Global<JobDesc>
      
      * fix bugs caused by new api
      
      * refactor compiler.Compile
      
      * merge dev_python
      
      * remove unused message proto
      
      * rename api
      
      * Fix input which body is disabled in xla launch kernel
      
      * add RemoteBlob.shape and RemoteBlob.dtype
      
      * Fix data type set default variable (#2092)
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * fix 121 for tick (#2069)
      
      * fix default data type
      
      * Add conf axis for bias_add for any axis channel (#2093)
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * bias_add completion
      
      * follow comment
      
      * make conf axis required
      
      * Dev jxf python initializer (#2090)
      
      * oneflow initializer
      
      * update
      
      * Fix self control in
      
      * Bugfix python alexnet (#2096)
      
      * bugfix_python_alexnet
      
      * fix
      
      * Add fake consume op
      
      * Dev global step (#2100)
      
      * assign op
      
      
      AddGlobalStepOpConf
      
      
      fix
      
      
      ARITHMETIC_DATA_TYPE_SEQ
      
      
      identity_op_conf
      
      
      add ops
      
      
      GenNewSnapshotName
      
      
      SnapshotOp
      
      
      cleanup
      
      
      blob name
      
      
      LearningRateScheduleOp
      
      
      LearningRateScheduleKernel
      
      
      LearningRateScheduleKernel
      
      
      AddLearningRateScheduleOpConf
      
      
      learning rate
      
      
      cleanup
      
      
      fix
      
      
      fix
      
      * remove total_mbn_num
      
      * date time format
      
      * save
      
      * refine
      
      * refine
      
      * revert
      
      * refine snapshot
      
      * fix
      
      * refine
      
      * AutoGlobalStep
      
      * refine
      
      * GenLogicalBlobName
      
      * AutoLearningRate
      
      * remove JobDesc lr
      
      * fix snapshot path
      
      * Maybe<void>
      
      * learning_rate blob
      
      * remove next_model_vid
      
      
      fix
      
      
      fix 
      
      
      fix
      
      
      learning_rate
      
      * train_conf
      
      * fix for global step on multi nodes
      
      * Fix optimizer initializer (#2095)
      
      * fix optimizer initializer
      
      * rename lars data temp bn
      
      * fix job_type (#2102)
      
      * Dev alexnet new api (#2094)
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * fix 121 for tick (#2069)
      
      * check in softmax loss
      
      * nn.conv2d and nn.bias_add
      
      * fix opname
      
      * fix merge conflict
      
      * fix name
      
      * dense (#2097)
      
      * Fix jxf dense v2 (#2098)
      
      * dense
      
      * minor fix
      
      * alexnet
      
      * fix conf
      
      * quick fix
      
      * transpose
      
      * fix layers
      
      * add transpose
      
      * fix fc
      
      * fix
      
      * fix
      
      * fix data laod
      
      * params check and format
      
      * rm activation in op conf
      
      * save workaround
      
      * fix avg pool 2d
      
      * fix max pool 2d
      
      * remove fc3 relu
      
      * alexnet eval
      
      * minor
      
      * replace has_batch_dim with batch_axis (#2104)
      
      * replace has_batch_dim with batch_axis
      
      * refactor OrderValue4HasBatchAxis
      
      * fix batch_axis bugs in ConvFilterGradOp and DecodeOFRecordOp
      
      * no CHECK in MatmulOp::InferBatchAxis
      
      * infer op by op_conf and  parallel_conf
      
      * wrapper Error for ErrorProto
      
      * replace ErrorUtil with Error
      
      * add OF_CHECK (#2110)
      
      * optional split_axis (#2113)
      
      * Fix HasAttr bug for optional field
      
      * undefined (#2116)
      
      * merge reduce xxx (#2119)
      
      * Update GetSbpSig() with Maybe (#2118)
      
      * fix sveral ops
      
      * modify all ops
      
      * format
      
      * update complete
      
      * Refine AdamOptimizer
      
      * fix (#2120)
      
      * Fix xla AdamOptimizer bugs
      
      * support scalar for reduce_xxx axis args (#2122)
      
      * Dev opt split axis (#2121)
      
      * optional split_axis
      
      * backup
      
      * VariableConf::(OptInt64 split_axis)
      
      * backup
      
      * fix autovar split_axis (#2125)
      
      * Dev model init op (#2117)
      
      * assign op
      
      
      AddGlobalStepOpConf
      
      
      fix
      
      
      ARITHMETIC_DATA_TYPE_SEQ
      
      
      identity_op_conf
      
      
      add ops
      
      
      GenNewSnapshotName
      
      
      SnapshotOp
      
      
      cleanup
      
      
      blob name
      
      
      LearningRateScheduleOp
      
      
      LearningRateScheduleKernel
      
      
      LearningRateScheduleKernel
      
      
      AddLearningRateScheduleOpConf
      
      
      learning rate
      
      
      cleanup
      
      
      fix
      
      
      fix
      
      * remove total_mbn_num
      
      * date time format
      
      * save
      
      * refine
      
      * refine
      
      * revert
      
      * refine snapshot
      
      * fix
      
      * refine
      
      * AutoGlobalStep
      
      * refine
      
      * GenLogicalBlobName
      
      * AutoLearningRate
      
      * remove JobDesc lr
      
      * fix snapshot path
      
      * Maybe<void>
      
      * learning_rate blob
      
      * remove next_model_vid
      
      
      fix
      
      
      fix 
      
      
      fix
      
      
      learning_rate
      
      * train_conf
      
      * fix for global step on multi nodes
      
      * SnapshotReader
      
      
      snapshot writer
      
      
      model init op
      
      
      fix
      
      
      refine
      
      
      init
      
      
      InitializeFromSnapshotConf
      
      
      model io job
      
      
      ModelLoadOp
      
      
      ModelLoadKernel
      
      
      MakeModelLoadJob
      
      
      ModelSaveOp
      
      
      fix
      
      
      InterUserJobInfo
      
      
      _MakeModelLoadJobFunc
      
      
      MutModelLoadOpConTickInputHelper
      
      
      fix
      
      
      refine
      
      
      init/load/save
      
      
      set_default_variable
      
      * remove SnapshotMgr
      
      * snapshot.h
      
      * delete model_init_job.cpp
      
      
      foreign_input_op_conf
      
      
      fix
      
      
      snapshot path
      
      
      set path
      
      
      op_conf
      
      
      fix
      
      
      fix CopyFromNdarray
      
      
      to bytes c
      
      
      use uint8
      
      
      char2uint8
      
      * model init
      
      * model io
      
      * fix
      
      * ModelSaveKernel
      
      * mutable_batch_axis()->Clear()
      
      * InferBatchAxis
      
      * fix
      
      * refine
      
      * job set
      
      * MakeModelIoJobs
      
      * fix
      
      * jobs
      
      * fix
      
      * model io job
      
      * GenOutputOpConf
      
      * refine snapshot
      
      * refine
      
      * fix
      
      * refine CheckPoint
      
      * remove session
      
      * refine
      
      * refine
      
      * refine
      
      * remove keyword.h/cpp
      
      * refine
      
      * global_step=>train_step
      
      * GetSbpSignatures
      
      * ModelInitOp
      
      * fix (#2127)
      
      * rm stale alextnet script (#2129)
      
      * Dev plain maybe (#2126)
      
      * optional split_axis
      
      * backup
      
      * VariableConf::(OptInt64 split_axis)
      
      * backup
      
      * 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp
      
      * SharedOrPlain
      
      * const std::shared_ptr<T>& => std::shared_ptr<T>
      
      * Dev simple checkpoint manager (#2128)
      
      * SimpleCheckPointManager
      
      * makedirs
      
      * fix path
      
      * save
      
      * refine
      
      * refine
      
      * fix path to numpy (#2130)
      
      * Dev plain maybe (#2132)
      
      * optional split_axis
      
      * backup
      
      * VariableConf::(OptInt64 split_axis)
      
      * backup
      
      * 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp
      
      * SharedOrPlain
      
      * const std::shared_ptr<T>& => std::shared_ptr<T>
      
      * rename: Maybe.data() => Maybe.Data_YouAreNotAllowedToCallThisOutsideJustOrCheckJust()
      
      * refactor Maybe<JobBuildAndInferCtx> => Maybe<JobBuildAndInferCtx*>
      
      * Dev jxf merge general ops (#2131)
      
      * merge some general ops to dev_python
      
      * dense demo
      
      * rm print in test
      
      * new line at the end of file
      
      * format
      
      * fix check point
      
      * update alexnet
      
      * broadcast_xxx (#2134)
      
      * broadcast_xxx
      
      * typo
      
      * typo
      
      * rm job_conf.num_of_batches_in_snapshot
      
      * fix args (#2136)
      
      * fix proto if (#2138)
      
      * pass name to inner function (#2139)
      
      * check dropout if (#2140)
      
      * check dropout if
      
      * fix typo
      
      * Dev merge math ops (#2143)
      
      * merge math ops
      
      * new line at the end of file
      
      * merge layer norm (#2144)
      
      * variable_scope (#2141)
      
      * variable_scope
      
      * revert format
      
      * add check
      
      * Merge dropout if (#2145)
      
      * check dropout if
      
      * fix typo
      
      * fix typo
      
      * slice (#2142)
      
      * slice
      
      * add check and docstring
      
      * minor
      
      * minor
      
      * add const (#2146)
      
      * add const
      
      * fix indentation
      
      * address review
      
      * fmt
      
      * rm redundant
      
      * Update array_ops.py
      
      * Update array_ops.py
      
      * Update array_ops.py
      
      * add more activations to math_ops (#2147)
      
      * fix bug (#2149)
      
      * trancated normal for bert (#2150)
      
      * Update bert for dev python (#2151)
      
      * trancated normal for bert
      
      * bert support
      
      * math.dropout to nn.dropout (#2153)
      
      * refactor oneflow_internal with Maybe::GetDataAndSerializedErrorProto
      
      * allow export multiple interfaces in oneflow_export decorator (#2154)
      
      * refactor job_build_and_infer_if.h
      
      * update oneflow_internal.h to use Maybe (#2135)
      
      * Fix python internal (#2133)
      
      * Return error meassage in oneflow_internal
      
      * Refine environment_objects_scope
      
      * add OF_ERROR_STR_CHECK and OFStrCat()
      
      * format
      
      * fix based on review
      
      * fix(oneflow_internal.h): add undef
      
      * fix: expr -> (expr)
      
      * feat: update oneflow_internal_helper to use func
      
      *  Transfer data_part_num to DecodeOp and RecordLoadOp (#2148)
      
      *  Transfer data_part_num to DecodeOp and RecordLoadOp
      
      * Fix python scripts
      
      * Dev nc of internal (#2155)
      
      * Fix python internal (#2133)
      
      * Return error meassage in oneflow_internal
      
      * Refine environment_objects_scope
      
      * add OF_ERROR_STR_CHECK and OFStrCat()
      
      * format
      
      * fix based on review
      
      * fix(oneflow_internal.h): add undef
      
      * fix: expr -> (expr)
      
      * feat: update oneflow_internal_helper to use func
      
      * fix: fix ctor bug
      
      * fix config_proto
      
      * rename c_api_util.Init => c_api_util.InitEnvironment
      
      * refactor compile_context.cur_job => compile_context.cur_job_conf
      
      * remove FixPackedBlobDescOfProducedRegst (#2156)
      
      * Fix snapshot root path empty log (#2158)
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * fix 121 for tick (#2069)
      
      * Fix snapshot root path empty log
      
      * fix channel last (#2157)
      
      * fix channel last
      
      * minor
      
      * merge pb_message
      
      * add cudnn conv force algo (#2159)
      
      * Update bert for dev python (#2160)
      
      * remove old bert
      
      * set data_part_num in decoder
      
      * support model load/saveargs
      
      * Dev flow function (#2152)
      
      * add of.function, refactor init, refine session, and refine runtime
      
      * rm useless code
      
      * rename
      
      * update
      
      * add test
      
      * @oneflow_export JobConfigProto and Trainconf (#2162)
      
      * @oneflow_export JobConfigProto and Trainconf
      
      * remove unused config in config_util.py
      
      * remove oneflow.get_cur_job_conf_builder
      
      * bugfix: bias_add op and reduce_sum op infer sbp and implement of bias_add kernel (#2161)
      
      * 1) refactor compiler._CompileJob; 2) fix test scripts; 3) fix oneflow.config.model_update_conf
      
      * fix config.train.model_update_conf
      
      * _GetJobConfAttr
      
      * update alexnet (#2166)
      
      * Update alexnet (#2167)
      
      * update alexnet
      
      * update for bert
      
      * 15->16
      
      * more reasonable conf
      
      * get variable in py layer norm
      
      * replace val in pb msg;  decode lbn string with split hint (#2165)
      
      * bugfix: boxing task node generate boxing conf; remove wrong check in op_graph (#2163)
      
      * Add meta data in HLO instruction, and refine
      
      * python model parallel (#2103)
      
      * decode split hint in lbn; refine interface of get/set_lbn_in_op_conf; fill sbp_conf in job when add op
      
      * merge placement group
      
      * refine code in AddAndInferOp
      
      * auto merge placement group when add op; remove mergeplacementgroup interface
      
      * infer sbp parallel when add op; impl Get/Has split axis in infer_ctx
      
      * python blob add interface for model parallel
      
      * refine code of python blob split
      
      * remove interface of has/get_split_axis in python blob
      
      * remove interface of has_batch_dim in python blob
      
      * add check blob split_axis can be divide by parallel num
      
      * refine code for maybe get/infer sbp
      
      * fix bugs: 1. python generate parallel conf; 2. gen logical blob id from lbn; 3.python blob desc .etc
      
      * fix for plain point maybe
      
      * fix bug: add repeated placement group, remove add placement interface in hand
      
      * fixbug: python/blob_desc, temp impl of not deepcopy;  feat: dense layer support model parallel
      
      * dev_python model parallel runnable and check correct
      
      * remove add placement group when placment scope exit
      
      * 1. fixbug of bias_add op infer blob_desc/sbp; bias_add kernel impl; 2. dense layer set model_split_axis=0 for model parallel
      
      * bugfix: bias_add backward infer sbp wrong;  model parallel bias add debug done
      
      * refine python blob_desc.split implement
      
      * refine interface decode lbn to split hint
      
      * refine auto add placment group
      
      * refine lbn with split hint decode
      
      * refine code for review
      
      * remove AutoVar related code (#2168)
      
      * feat: remove all autovar
      
      * fix and format
      
      * fix: fix op::InferBlobDesc
      
      * add prototype (#2172)
      
      * add prototype
      
      * infer blob desc with sbp_signature
      
      * `str_a is not str_b' is buggy, use `str_a != str_b' instead
      
      * Update snapshot.cpp (#2174)
      
      * remove useless lines (#2176)
      
      * Fix bert multi nodes (#2177)
      
      * remove useless lines
      
      * fix bert and init_cluster_env for multi nodes
      
      * CHECK_JUST for InferBlobDescsIf (#2178)
      
      * Fix bert multi nodes (#2180)
      
      * remove useless lines
      
      * fix bert and init_cluster_env for multi nodes
      
      * config_proto -> default_config_proto
      
      * delete worker
      
      * update alexnet
      
      * remove unused op (#2182)
      
      * remove parallel_ctx when kernel init (#2185)
      
      * InferOpSbpSignature in op_graph and infer_ctx (#2175)
      
      * InferOpSbpSignature in op_graph and infer_ctx
      
      * bugfix: lambda life time;  gen job build error add location info
      
      * refine error generation and return
      
      * refine check lbi vaild and exists
      
      * remove parallel num in decode_of_record op/kernel (#2186)
      
      * Fix bugs
      
      * delete GlobalJobDesc() in operator/ (#2188)
      
      * rm unused test file
      
      * Refine
      
      * Add assign ops behind adam optimizer to update model and momentum etc.
      
      * Add assign ops behind adam optimizer to update model and momentum etc.
      
      * Remove fake consume op
      
      * Support enable/disable XLA by set env
      
      * Merge callback, limit max operator count for each XLA subgraph
      
      * CudaEventPool
      
      * fix vector
      
      * refine
      
      * Support in-place update for optimizer
      
      * Add alias input and output to prevent reusing input with other temp buffers
      
      * Refine code style
      
      * Remove unused code
      
      * Of xla (#2237)
      
      * mv deprecated.pb_util to lib.core.pb_util
      
      * add op get_variable and get_variable test (#1975)
      
      * add op get_variable and get_variable test
      
      * modify shape extend
      
      * AllReduceSequencePass (#1976)
      
      * python2 compatibility for check_point
      
      * fix "return (blob_a, blob_b)" bug
      
      * rename: arg_passing => arg_pass
      
      * shared regst blob header between jobs (#1919)
      
      * half impl
      
      * register manager handle memory shared for separated memory
      
      * set separated memory shared id for shared regst between jobs
      
      * half impl of python for blob
      
      * fix BUG of pod ToProto() when proto has inited
      
      * fix BUG of infer dim0_inner_shape() in foreign_input_op
      
      * 1. PushJob copy from python can infer dim0_valid_num
      
      * add test for dynamic relu
      
      * refine test file
      
      * refine code
      
      * refine note
      
      * update test file for new interface
      
      * rename separated_header* (#1979)
      
      * some bugs fixes for a train&eval job (#1978)
      
      * debugging alex net
      
      * check in test pull_multiple_blob.py
      
      * strcter check
      
      * fix bias in conv
      
      * fix various bugs
      
      * rm file
      
      * op_name in different jobs can be overloaded
      
      * fix compile bug in job_set_compile_ctx
      
      * rm cmake code for building oneflow binary
      
      * check in script (#1980)
      
      * check in script
      
      * rm used import
      
      * CudaCurrentDeviceGuard (#1977)
      
      * fix val (#1981)
      
      * Merge job set and split fw bw (#1982)
      
      * add MemoryCopier and TensorSliceCopier (#1901)
      
      * add MemoryCopier and TensorSliceCopier
      
      * Index=>NdIndex
      
      * refine
      
      * refine
      
      * fix addition error checking (#1911)
      
      * Merge dev_mixed_precision into dev_split_fw_bw (#1904)
      
      * update binary_func.h
      
      * udpate
      
      * update ndarray
      
      * update
      
      * update
      
      * update
      
      * udpate
      
      * refactor(data_type.h): better representation
      
      * fix(unary_func.h): fix typo
      
      * style(data_type.h): format
      
      * Merge dev_mixed_precision: Part-2 (#1907)
      
      * feat: add NewKernelUtil
      
      * fix typos
      
      * feat: add cublas_tensor_op_math_handle()
      
      * add gemm (#1860)
      
      * add gemm
      
      * save
      
      * add blobgemm
      
      * update
      
      * update
      
      * fix cu
      
      * update cpp
      
      * feat: NewKernelUtil -> NewKernelUtil<DeviceType>
      
      * feat: update FullyConnectedKernel to use NewKernelUtil
      
      * Dev sx mixed precision (#1861)
      
      * add gemm
      
      * save
      
      * add blobgemm
      
      * update
      
      * update
      
      * fix cu
      
      * update cpp
      
      * save cpp
      
      * save
      
      * add relu and relu_backward
      
      * remove spared space
      
      * add explicit declaration
      
      * rename
      
      * feat: update ConvKernel to support half
      
      * add sigmoid and tanh (#1867)
      
      * add axpy (#1866)
      
      * style: formatting
      
      * refactor(new_kernel_util): unify Hgemm with cublas_gemm<T>
      
      * fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle
      
      * refine(new_kernel_util.h)
      
      * refine(new_kernel_util.cu)
      
      * feat(new_kernel_util): add OFBatchedGemm()
      
      * feat: update MatMulKernel to support half
      
      * feat: update ConvData/Bias/FilterGradKernel to support half
      
      * refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out
      
      * feat: support loss scale
      
      * fix(operator): :bug:add InferHasBatchDim()
      
      * feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu()
      
      * refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float
      
      * style(kernel/cast_kernel.cpp): formatting
      
      * fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle()
      
      * style(cast_kernel.cpp): formatting
      
      * feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil
      
      * refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil
      
      * feat(dropout_kernel): :sparkles:update DropoutKernel to support half
      
      * refactor(dropout_kernel): remove backward funcs
      
      * refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support
      
      * fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple)
      
      * fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix(relu_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix: fix little bugs
      
      * fix(conv_data/filter_grad_op): min byte size of buf blob is 1
      
      * feat: support half for bias_add_kernel
      
      * fix(bias_add_op): remove data type check
      
      * feat(relu_kernel): support half
      
      * refactor: add ADD_GPU_HALF_KERNEL_CREATOR
      
      * fix: typos
      
      * feat(pooling_kernel): support half
      
      * fix: remove CHECK_EQ of default data type
      
      * feat(pooling_grad_kernel): support half
      
      * feat: support half in ofrecord_encoder (TODO)
      
      * fix
      
      * feat: support half in sparse_cross_entropy_kernel
      
      * debug grad op (#1883)
      
      * Dev debug op mixed precision (#1884)
      
      * debug grad op
      
      * do nothing instead of UNIMPLEMENTED
      
      * fix(dropout_kernel): add tmp_split_fw_bw condition
      
      * build(half.cmake): https->http
      
      * fix(record_load_kernel): support total_batch_num
      
      * fix pooling (#1885)
      
      * fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix: add GetCudnnScalingParameters() to fix scaling params
      
      * fix: add enable_true_half_config_when_conf() into config and update related code
      
      * feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization
      
      * refactor(matmul_kernel): remove Backward()
      
      * feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx()
      
      * feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat()
      
      * refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr
      
      * refactor(new_kernel_util.cu): remove static of func in anonymous namespace
      
      * feat(job_conf.proto): add enable_auto_mixed_precision field
      
      * feat(auto_mixed_precision_lists): add amp_lists
      
      * feat(auto_mixed_precision): build the skeleton
      
      * feat(auto_mixed_precision): almost finish amp graph pass
      
      * feat(auto_mixed_precision.cpp): complte InsertCastOp()
      
      * refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG
      
      * perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO)
      
      * refine(auto_mixed_precision.cpp): refine LOG
      
      * feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes()
      
      * Dev half ndarray (#1886)
      
      * debug grad op
      
      * ZeroVal => GetZeroVal; OneVal => GetOneVal
      
      * MaxVal => GetMaxVal; MinVal => GetMinVal
      
      * check data type
      
      * DevDType
      
      * move function template to struct template for BinaryFunc* and UnaryFunc*
      
      * support half for reduce_sum_kernel
      
      * ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr
      
      * half for NdarrayUtil
      
      * OF_DEVICE_FUNC is always inline
      
      * half for NdarrayApplyUnaray
      
      * simplify usage of NdarrayUtil
      
      * UnaryFuncExp
      
      * add VarNdarrayBuilder and ValNdarrayBuilder
      
      * simplify NdarrayUtil in layer_norm_param_grad_kernel
      
      * InplaceBroadcast
      
      * remove SoftmaxKernelUtil
      
      * half for softmax_kernel
      
      * fix improper use of __CUDA_ARCH__
      
      * disable sm_30,sm_52
      
      * refine(conv_kernel.cu): fix typo
      
      * fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr
      
      * fix: fix typos of GetOneVal
      
      * fix(auto_mixed_precision.cpp): allocate for shared_ptr
      
      * refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding
      
      * fix(auto_mixed_precision.cpp): fix typo
      
      * fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge
      
      * style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet()
      
      * style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...>
      
      * feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp
      
      * feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs
      
      * feat(auto_mixed_precision.cpp): more logs
      
      * refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal
      
      * fix(bias_add_op.cpp): fix bias_multiplier shape
      
      * feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half
      
      * feat: update MatmulKernel and new_kernel_util to support half
      
      * refactor(auto_mixed_precision): add ClearList and refine code
      
      * feat(tanh_*_kernel): support half
      
      * feat(add_kernel): support half
      
      * update binary_func.h
      
      * udpate
      
      * update ndarray
      
      * update
      
      * update
      
      * update
      
      * udpate
      
      * refactor(data_type.h): better representation
      
      * fix(unary_func.h): fix typo
      
      * style(data_type.h): format
      
      * refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF
      
      * style(CMakeLists.txt): fix typo
      
      * fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr
      
      * fix(auto_mixed_precision.cpp): group inserted cast op by lbn
      
      * fix get one ptr (#1913)
      
      * fix(layer_norm): add LayerNormOp to grey_list and support the half
      
      * fix(layer_norm about): fix it to run when amp
      
      * fix: move fix sbp signature from OpNode to OpGraph
      
      * Dev new kernel util (#1925)
      
      * refactor(kernel/util): refactor NewKernelUtil and add DnnIf
      
      * refactor(kernel/util): add BlasIf
      
      * refactor(kernel/util): add ArithemeticIf
      
      * refactor(kernel/util): add cuda_kernel_util.*
      
      * refactor: refactor NewKernelUtil
      
      * refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including
      
      * refactor(new_kernel_util.h): remove unused header files
      
      * refactor: refactor loop include
      
      * feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel
      
      * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936)
      
      * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA
      
      * CHECK cuda version > 10.0 when use auto_mixed_presion
      
      * Fix bug of Snapshot delete file Unwanted (#1937)
      
      * fix link BUG of release version (#1938)
      
      * delete redundant code in OpGraph JobCompleter and Operator (#1927)
      
      * 1. delete redundant code in OpGraph JobCompleter and Operator  2. fix bug of Snapshot delete file Unwanted  3. refine ReadMe
      
      * revert README change
      
      * split 2 pull request
      
      * Refactor Kernel Registry V2: The clear & easy Way (#1941)
      
      * refactor(resource.proto): move DeviceType to common/device_type.proto
      
      * feat(kernel_registration): add kernel_registration.h/cpp
      
      * feat(kernel_registration): update matmul_kernel to support new registration
      
      * feat: add CreateKernel for new registry
      
      * feat: udpate registry of cast conf
      
      * refactor(kernel_registration): remove KernelRegMap
      
      * fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949)
      
      * grpc SetMaxMessageSize(INT_MAX) (#1950)
      
      * fix bug of Graph::ForEachConnectedComponent (#1952)
      
      * Grpc set max size (#1953)
      
      * grpc SetMaxMessageSize(INT_MAX)
      
      * set max msg len for ctrl service
      
      * code for test grpc max msg size
      
      * remove test code
      
      * NumaAwareCudaMallocHost (#1959)
      
      * NumaAwareCudaMallocHost
      
      * add conf
      
      * AllReduceSequencePass (#1976)
      
      * Merge job set and split fw bw (#1983)
      
      * add MemoryCopier and TensorSliceCopier (#1901)
      
      * add MemoryCopier and TensorSliceCopier
      
      * Index=>NdIndex
      
      * refine
      
      * refine
      
      * fix addition error checking (#1911)
      
      * Merge dev_mixed_precision into dev_split_fw_bw (#1904)
      
      * update binary_func.h
      
      * udpate
      
      * update ndarray
      
      * update
      
      * update
      
      * update
      
      * udpate
      
      * refactor(data_type.h): better representation
      
      * fix(unary_func.h): fix typo
      
      * style(data_type.h): format
      
      * Merge dev_mixed_precision: Part-2 (#1907)
      
      * feat: add NewKernelUtil
      
      * fix typos
      
      * feat: add cublas_tensor_op_math_handle()
      
      * add gemm (#1860)
      
      * add gemm
      
      * save
      
      * add blobgemm
      
      * update
      
      * update
      
      * fix cu
      
      * update cpp
      
      * feat: NewKernelUtil -> NewKernelUtil<DeviceType>
      
      * feat: update FullyConnectedKernel to use NewKernelUtil
      
      * Dev sx mixed precision (#1861)
      
      * add gemm
      
      * save
      
      * add blobgemm
      
      * update
      
      * update
      
      * fix cu
      
      * update cpp
      
      * save cpp
      
      * save
      
      * add relu and relu_backward
      
      * remove spared space
      
      * add explicit declaration
      
      * rename
      
      * feat: update ConvKernel to support half
      
      * add sigmoid and tanh (#1867)
      
      * add axpy (#1866)
      
      * style: formatting
      
      * refactor(new_kernel_util): unify Hgemm with cublas_gemm<T>
      
      * fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle
      
      * refine(new_kernel_util.h)
      
      * refine(new_kernel_util.cu)
      
      * feat(new_kernel_util): add OFBatchedGemm()
      
      * feat: update MatMulKernel to support half
      
      * feat: update ConvData/Bias/FilterGradKernel to support half
      
      * refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out
      
      * feat: support loss scale
      
      * fix(operator): :bug:add InferHasBatchDim()
      
      * feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu()
      
      * refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float
      
      * style(kernel/cast_kernel.cpp): formatting
      
      * fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle()
      
      * style(cast_kernel.cpp): formatting
      
      * feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil
      
      * refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil
      
      * feat(dropout_kernel): :sparkles:update DropoutKernel to support half
      
      * refactor(dropout_kernel): remove backward funcs
      
      * refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support
      
      * fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple)
      
      * fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix(relu_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix: fix little bugs
      
      * fix(conv_data/filter_grad_op): min byte size of buf blob is 1
      
      * feat: support half for bias_add_kernel
      
      * fix(bias_add_op): remove data type check
      
      * feat(relu_kernel): support half
      
      * refactor: add ADD_GPU_HALF_KERNEL_CREATOR
      
      * fix: typos
      
      * feat(pooling_kernel): support half
      
      * fix: remove CHECK_EQ of default data type
      
      * feat(pooling_grad_kernel): support half
      
      * feat: support half in ofrecord_encoder (TODO)
      
      * fix
      
      * feat: support half in sparse_cross_entropy_kernel
      
      * debug grad op (#1883)
      
      * Dev debug op mixed precision (#1884)
      
      * debug grad op
      
      * do nothing instead of UNIMPLEMENTED
      
      * fix(dropout_kernel): add tmp_split_fw_bw condition
      
      * build(half.cmake): https->http
      
      * fix(record_load_kernel): support total_batch_num
      
      * fix pooling (#1885)
      
      * fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs()
      
      * fix: add GetCudnnScalingParameters() to fix scaling params
      
      * fix: add enable_true_half_config_when_conf() into config and update related code
      
      * feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization
      
      * refactor(matmul_kernel): remove Backward()
      
      * feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx()
      
      * feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat()
      
      * refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr
      
      * refactor(new_kernel_util.cu): remove static of func in anonymous namespace
      
      * feat(job_conf.proto): add enable_auto_mixed_precision field
      
      * feat(auto_mixed_precision_lists): add amp_lists
      
      * feat(auto_mixed_precision): build the skeleton
      
      * feat(auto_mixed_precision): almost finish amp graph pass
      
      * feat(auto_mixed_precision.cpp): complte InsertCastOp()
      
      * refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG
      
      * perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO)
      
      * refine(auto_mixed_precision.cpp): refine LOG
      
      * feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes()
      
      * Dev half ndarray (#1886)
      
      * debug grad op
      
      * ZeroVal => GetZeroVal; OneVal => GetOneVal
      
      * MaxVal => GetMaxVal; MinVal => GetMinVal
      
      * check data type
      
      * DevDType
      
      * move function template to struct template for BinaryFunc* and UnaryFunc*
      
      * support half for reduce_sum_kernel
      
      * ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr
      
      * half for NdarrayUtil
      
      * OF_DEVICE_FUNC is always inline
      
      * half for NdarrayApplyUnaray
      
      * simplify usage of NdarrayUtil
      
      * UnaryFuncExp
      
      * add VarNdarrayBuilder and ValNdarrayBuilder
      
      * simplify NdarrayUtil in layer_norm_param_grad_kernel
      
      * InplaceBroadcast
      
      * remove SoftmaxKernelUtil
      
      * half for softmax_kernel
      
      * fix improper use of __CUDA_ARCH__
      
      * disable sm_30,sm_52
      
      * refine(conv_kernel.cu): fix typo
      
      * fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr
      
      * fix: fix typos of GetOneVal
      
      * fix(auto_mixed_precision.cpp): allocate for shared_ptr
      
      * refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding
      
      * fix(auto_mixed_precision.cpp): fix typo
      
      * fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge
      
      * style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet()
      
      * style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...>
      
      * feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp
      
      * feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs
      
      * feat(auto_mixed_precision.cpp): more logs
      
      * refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal
      
      * fix(bias_add_op.cpp): fix bias_multiplier shape
      
      * feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half
      
      * feat: update MatmulKernel and new_kernel_util to support half
      
      * refactor(auto_mixed_precision): add ClearList and refine code
      
      * feat(tanh_*_kernel): support half
      
      * feat(add_kernel): support half
      
      * update binary_func.h
      
      * udpate
      
      * update ndarray
      
      * update
      
      * update
      
      * update
      
      * udpate
      
      * refactor(data_type.h): better representation
      
      * fix(unary_func.h): fix typo
      
      * style(data_type.h): format
      
      * refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF
      
      * style(CMakeLists.txt): fix typo
      
      * fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr
      
      * fix(auto_mixed_precision.cpp): group inserted cast op by lbn
      
      * fix get one ptr (#1913)
      
      * fix(layer_norm): add LayerNormOp to grey_list and support the half
      
      * fix(layer_norm about): fix it to run when amp
      
      * fix: move fix sbp signature from OpNode to OpGraph
      
      * Dev new kernel util (#1925)
      
      * refactor(kernel/util): refactor NewKernelUtil and add DnnIf
      
      * refactor(kernel/util): add BlasIf
      
      * refactor(kernel/util): add ArithemeticIf
      
      * refactor(kernel/util): add cuda_kernel_util.*
      
      * refactor: refactor NewKernelUtil
      
      * refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including
      
      * refactor(new_kernel_util.h): remove unused header files
      
      * refactor: refactor loop include
      
      * feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel
      
      * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936)
      
      * not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA
      
      * CHECK cuda version > 10.0 when use auto_mixed_presion
      
      * Fix bug of Snapshot delete file Unwanted (#1937)
      
      * fix link BUG of release version (#1938)
      
      * delete redundant code in OpGraph JobCompleter and Operator (#1927)
      
      * 1. delete redundant code in OpGraph JobCompleter and Operator  2. fix bug of Snapshot delete file Unwanted  3. refine ReadMe
      
      * revert README change
      
      * split 2 pull request
      
      * Refactor Kernel Registry V2: The clear & easy Way (#1941)
      
      * refactor(resource.proto): move DeviceType to common/device_type.proto
      
      * feat(kernel_registration): add kernel_registration.h/cpp
      
      * feat(kernel_registration): update matmul_kernel to support new registration
      
      * feat: add CreateKernel for new registry
      
      * feat: udpate registry of cast conf
      
      * refactor(kernel_registration): remove KernelRegMap
      
      * fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949)
      
      * grpc SetMaxMessageSize(INT_MAX) (#1950)
      
      * fix bug of Graph::ForEachConnectedComponent (#1952)
      
      * Grpc set max size (#1953)
      
      * grpc SetMaxMessageSize(INT_MAX)
      
      * set max msg len for ctrl service
      
      * code for test grpc max msg size
      
      * remove test code
      
      * NumaAwareCudaMallocHost (#1959)
      
      * NumaAwareCudaMallocHost
      
      * add conf
      
      * AllReduceSequencePass (#1976)
      
      * CudaCurrentDeviceGuard (#1977)
      
      * delete tmp_split_fw_bw_train_conf (#1985)
      
      * delete tmp_split_fw_bw_train_conf
      
      * delete useless comments
      
      * fix refactor bug in layer_norm_op
      
      * minor fixes
      
      * update py script
      
      * remove code could be misleading
      
      * Fix all reduce mem sharing (#1986)
      
      * fix all reduce mem sharing
      
      * ByteSizeOfDataContentField=>ByteSizeOfBlobBody
      
      * remove obsolete task_graph optimization
      
      * no arg_pass_job for variable_op
      
      * merge memory block id between jobs (#1910)
      
      * refine MemBlock and CriticalSection
      
      * job memory sharing strategy
      
      * revert diff in CriticalSectionDesc
      
      * Merge memory block between sub plans
      
      * Get mutual exclusion job groups
      
      * forget to consider memory merge only in same machine
      
      * memory zone unique id
      
      * Merge Done;  merge memory block id from right to left; get memory block ids info
      
      * revert MemBlock
      
      * generate mutual exclusion job groups Done.
      
      * update for proto
      
      * add JobMemSharingStrategy in python interface
      
      * remove memorycase hash
      
      * move JobMemSharingStrategy to JobSetProto
      
      * using default strategy = parallel priority strategy
      
      * update interface of flow.job_mem_sharing_strategy
      
      * InterJobMemSharingUtil and PlanUtil
      
      * revert oneflow.h
      
      * fix bug
      
      * New implement of Merge memory block id between jobs
      
      * refine code
      
      * fix a fatal bug in std::hash<oneflow::Shape>
      
      * +REGISTER_INDEPENDENT_THREAD_NUM for print task_node
      
      * unlock critical sections as more as possible (#1994)
      
      * Bugfix actor case (#1995)
      
      * unlock critical sections as more as possible
      
      * consumed and produced regst of actor 'case' are customized
      
      * refine code
      
      * Bugfix actor case (#1996)
      
      * unlock critical sections as more as possible
      
      * consumed and produced regst of actor 'case' are customized
      
      * refine code
      
      * small regst_num for reentrant_lock (#1997)
      
      * fmt dev_job_set(#1999)
      
      * double buffer for tick_op
      
      * tick is cpu op
      
      * speedup compile time (#2000)
      
      * only merge mem_block_id between user job (#1993)
      
      * Fix keep header only (#2001)
      
      * speedup compile time
      
      * fix keep header only
      
      * remove shared model (#2003)
      
      * remove blob_mem_sharing (#2005)
      
      * No copyhd for output (#2006)
      
      * no cpu tick
      
      * no copyhd for output_op/swith_output_op
      
      * remove temp comments
      
      * rename AddCopyH2DTaskTo to TryAddCopyH2DTaskTo
      
      * remove clone_id (#2007)
      
      * layer norm auto var (#2004)
      
      * layer norm auto var
      
      * make of_format
      
      * bn sbp (#2008)
      
      * Refactor job completer (#1998)
      
      * fmt
      
      * refactor GenerateOpConf4Trainning
      
      * more refactor
      
      * refactor SetCtrlInOpName4VariableOp
      
      * use uniq ptr
      
      * refactor RewriteBoxingWithAllReduce
      
      * refactor MakeAllReduceSequence
      
      * refactor auto_mixed_precision
      
      * refactor DumpLogicalBlobDescAndSbpSignature
      
      * refactor group_boxing_by_dst_parallel
      
      * refactor add_keep_header_only_op_conf
      
      * refactor AutoSourceTick
      
      * refactor AddTickForTimeShape
      
      * refactor AutoSinkTick
      
      * refactor AddGlobalOutputCriticalSections
      
      * refactor SetOpTimeShape7BatchDimLbis
      
      * fix a bug in IsInterfaceTask (#2009)
      
      * Bugfix is interface task (#2010)
      
      * fix a bug in IsInterfaceTask
      
      * IsOutputInterfaceTask
      
      * copyhd-free output_op task_node
      
      * Dev job set config util (#2011)
      
      * add more if in JobConfigProtoBuilder
      
      * unlock critical sections as more as possible
      
      * consumed and produced regst of actor 'case' are customized
      
      * remove total batch num in config util
      
      * remove clone_id
      
      * assert has train_conf
      
      * rm debug info
      
      * Dev job set bert (#2013)
      
      * support bert
      
      * mv into bert
      
      * manual format
      
      * fix adam (#2015)
      
      * fix adam
      
      * div batch instance num before update model
      
      * remove outdate code in oneflow.cpp (#2017)
      
      * Dev split like (#2016)
      
      * no total_instance_num
      
      * add auto grad for concat
      
      * check in impl
      
      * check in bug fixes
      
      * fix bugs for split_like
      
      * split_like_op.cpp format
      
      * add normalization_autovar
      
      * Update op_conf.proto
      
      * address reviews
      
      * fix typo
      
      * constant ref
      
      * rm forward_loss_instance_num (#2018)
      
      * Bugfix job set multi device (#2019)
      
      * sbp for tick input bn
      
      * interface_blob_conf for output_op/switch_output_op
      
      * set sbp conf for tuple identity op
      
      * fix bugs when merge main plan
      
      * delete useless code
      
      * address review
      
      * fix error use of GenRepeatedBn()
      
      * ForEachConnectedComponent is easily misused
      
      * 1) fix output op parallel_conf; 2) refactor InterfaceOpUtil
      
      * only for return output_op
      
      * factor: lbn => logical_blob_name; logical_blob_name() => logical_blob_name
      
      * return op instead of output op acts as part of user job
      
      * enable_all_reduce_group
      
      * bugfix: init RuntimeBuffersScope before Runtime
      
      * demo python scripts for enable_all_reduce_group
      
      * remove wrong optimization code
      
      * constant_conf for enable_all_reduce_group.py test
      
      * fix interface op parallel conf
      
      * fix reduce concat kernel (#2020)
      
      * binary program oneflow_worker
      
      * user_job_completer
      
      * remove unused code loss_print
      
      * rm unused code loss_acc
      
      * remove unused accuracy_acc and accuracy_print
      
      * remove input_diff/output_diff/model_diff bns
      
      * remove unused bns in gdb util
      
      * replace data_tmp_bns/model_bns/fw_buf_bns with tmp_bns
      
      * support mpi using style
      
      * Bugfix put job conf into plan (#2023)
      
      * put job_conf into plan
      
      * using job_name judge isPullJob/isPushJob
      
      * fix wrong job_id error
      
      * model_init is a push job; model_save is a pull job
      
      * make cmake more reasonable (#2024)
      
      * Restructure python module and minimum setup.py (#2026)
      
      * check in updated paths
      
      * check in minimum setup tool
      
      * Dev python init multi unit (#2022)
      
      * init multi-unit by send oneflow_worker binary and ConfigProto to worker machine
      
      * refine var name
      
      * refine code
      
      * compile user/main job only on master
      
      * bert multi machine test code
      
      * fix bugs
      
      * JobConfs
      
      * fix bugs under WITH_RDMA
      
      * fix multi-machine bugs
      
      * delete useless code
      
      * Add xla reduce_sum op
      
      * fix overflow bug of mem_zone_unique_id, set job_id to int64_t (#2028)
      
      * feat: init_worker can without scp binary and no use uuid (#2029)
      
      * half impl of without scp bin
      
      * feat: init_worker can without scp binary and no use uuid
      
      * check in fixes (#2030)
      
      * fixbug of delete worker (#2033)
      
      * Dev dot plan (#2035)
      
      * reuse plan to dot file
      
      * refine plan dot
      
      * Check in bug fix and multi node script (#2032)
      
      * check in fixes
      
      * check in script
      
      * fix boxing bug when setting conf with sbp
      
      * flag for iter
      
      * fixbug of delete worker
      
      * fix delete worker in script
      
      * address review, add exclusive or check
      
      * reuse plan to dot file
      
      * refine plan dot
      
      * fix and add flags
      
      * fmt
      
      * rm debug output
      
      * more flags
      
      * check Activation
      
      * fix fc bug when num axes > 2
      
      * reverse change
      
      * fix next_batch_num (#2036)
      
      * upgrade nccl to 2.4.8 (#2037)
      
      * fix shape of fc in_diff (#2038)
      
      * Rewrite model update op to optimizer graph
      
      * Update oneflow.cmake (#2041)
      
      * better looking merged_plan to dot v1 (#2039)
      
      * better looking and more infomation of merged_plan.dot
      
      * refine color
      
      * Fix tick in multi node parallel (#2042) (#2047)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * Dev train conf builder (#2046)
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * check in impl
      
      * fix data dir (#2054)
      
      * fix data dir
      
      * rm model load path
      
      * AssignOp (#2058)
      
      * AssignOp
      
      * remove useless code
      
      * Python ops gather and unit test (#2053)
      
      * python_ops gather and unit test
      
      * format
      
      * minor mod
      
      * SnapshotOp (#2060)
      
      * magical add and fix bug (#2061)
      
      * check in impl
      
      * add todo
      
      * Dev jxf python pooling (#2056)
      
      * run max_pool_2d without bug
      
      * correct max_pool_2d
      
      * correct average_pool_2d
      
      * minor refine
      
      * final version
      
      * rename to nn.py
      
      * add name arg to pool1d ops
      
      * refine by review
      
      * rename to _GetSequence and move it to the end of file (#2063)
      
      * fix BindInterfaceMemBlockId (#2065)
      
      * mark py file generated (#2066)
      
      * Dev gracious exit (#2057)
      
      * add more checks
      
      * make language more consistant
      
      * better error info for worker init
      
      * better error
      
      * Update setup.py (#2068)
      
      * Refine Infer APIs by return Maybe<void> type (#2051)
      
      * Refine Infer APIs by return Maybe<void> type
      
      * Fix return type
      
      * Fix code style
      
      * Replace CHECK macros in the implementation of infer APIs
      
      * Revert IsOk
      
      * fix bug for split like op (#2070)
      
      * fix snapshot path (#2071)
      
      * Dev job set fix infer apis (#2072)
      
      * Refine Infer APIs by return Maybe<void> type
      
      * Fix return type
      
      * Fix code style
      
      * Replace CHECK macros in the implementation of infer APIs
      
      * Revert IsOk
      
      * update
      
      * add AutoGlobalStep (#2073)
      
      * rm default_initializer_conf in train conf (#2075)
      
      * Fix sigmoid op (#2076)
      
      * fix sigmoid op bug
      
      * fix bug for split like op
      
      * add sigmoid grad op
      
      * Fix bn (#2077)
      
      * fix bn
      
      * return Maybe<void> OK in lambda
      
      * fix typo
      
      * fix SigmoidGradOp (#2078)
      
      * Dev python merge job set (#2081)
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * fix 121 for tick (#2069)
      
      * fix gcc warning in release (#2080)
      
      * fix gcc version in release
      
      * fix empty line
      
      * Fix adam mv initilizer (#2082)
      
      * zero constant initilzer for adam m and v
      
      * make of_format
      
      * init adam m v beta1_t and beta2_t
      
      * use value instead of initializer
      
      * const float& -> const float
      
      * update
      
      * LearningRateScheduleOp (#2079)
      
      * matmul (#2084)
      
      * matmul
      
      * np.allclose
      
      * Fix hang bugs
      
      * bugfix: reshape op infer dim0 size; and look up tensorflow reshape (#2085)
      
      * bugfix: reshape op infer dim0 size; and look up tensorflow reshape
      
      * refine code for read
      
      * check py if and test
      
      * prelu (#2086)
      
      * prelu
      
      * fix
      
      * fix
      
      * template for either ptr cast (#2088)
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * fix 121 for tick (#2069)
      
      * add template for cast
      
      * rename
      
      * Dev build and infer ctx (#2089)
      
      * add job_build_and_infer_ctx interface
      
      * lbn_with_split_hint
      
      * fix maybe macro
      
      * fix signature of Maybe<T>::Error()
      
      * job_build_and_infer_if
      
      * add c_api_util wrapper for job_build_and_infer_ctx
      
      * implement python/job_build_and_infer interface
      
      * CurJobBuildAndInferCtx_AddPlacementGroup
      
      * BuildJobAndInferCtx  and  Mgr  c++ implement (#2074)
      
      * job_build_and_infer_ctx_mgr
      
      * refine interface of infer_ctx_mgr
      
      * JobBuildInferCtx set job conf; add and refine error type
      
      * revert job.proto
      
      * half impl of add op in build_infer_ctx
      
      * generate op produced empty logical blob desc ; infer out blob desc interface
      
      * job_build_and_infer_ctx VERSION 1
      
      * add InferOutBlobDesc for conv op; remove record_piece_size in interface op
      
      * maybe return
      
      * job_set hold by job_build_and_infer_ctx_mgr
      
      * check placement when infer ctx mgr leave cur job
      
      * Global New/Delete JobBuildAndInferCtxMgr
      
      * add JUST when ctx add op
      
      * remove unused job_conf.arg_op_name
      
      * fix bugs caused by python new api
      
      * fix bugs caused by lack of Global<JobDesc>
      
      * fix bugs caused by new api
      
      * refactor compiler.Compile
      
      * merge dev_python
      
      * remove unused message proto
      
      * rename api
      
      * Fix input which body is disabled in xla launch kernel
      
      * add RemoteBlob.shape and RemoteBlob.dtype
      
      * Fix data type set default variable (#2092)
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * fix 121 for tick (#2069)
      
      * fix default data type
      
      * Add conf axis for bias_add for any axis channel (#2093)
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * bias_add completion
      
      * follow comment
      
      * make conf axis required
      
      * Dev jxf python initializer (#2090)
      
      * oneflow initializer
      
      * update
      
      * Fix self control in
      
      * Bugfix python alexnet (#2096)
      
      * bugfix_python_alexnet
      
      * fix
      
      * Add fake consume op
      
      * Dev global step (#2100)
      
      * assign op
      
      
      AddGlobalStepOpConf
      
      
      fix
      
      
      ARITHMETIC_DATA_TYPE_SEQ
      
      
      identity_op_conf
      
      
      add ops
      
      
      GenNewSnapshotName
      
      
      SnapshotOp
      
      
      cleanup
      
      
      blob name
      
      
      LearningRateScheduleOp
      
      
      LearningRateScheduleKernel
      
      
      LearningRateScheduleKernel
      
      
      AddLearningRateScheduleOpConf
      
      
      learning rate
      
      
      cleanup
      
      
      fix
      
      
      fix
      
      * remove total_mbn_num
      
      * date time format
      
      * save
      
      * refine
      
      * refine
      
      * revert
      
      * refine snapshot
      
      * fix
      
      * refine
      
      * AutoGlobalStep
      
      * refine
      
      * GenLogicalBlobName
      
      * AutoLearningRate
      
      * remove JobDesc lr
      
      * fix snapshot path
      
      * Maybe<void>
      
      * learning_rate blob
      
      * remove next_model_vid
      
      
      fix
      
      
      fix 
      
      
      fix
      
      
      learning_rate
      
      * train_conf
      
      * fix for global step on multi nodes
      
      * Fix optimizer initializer (#2095)
      
      * fix optimizer initializer
      
      * rename lars data temp bn
      
      * fix job_type (#2102)
      
      * Dev alexnet new api (#2094)
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * fix 121 for tick (#2069)
      
      * check in softmax loss
      
      * nn.conv2d and nn.bias_add
      
      * fix opname
      
      * fix merge conflict
      
      * fix name
      
      * dense (#2097)
      
      * Fix jxf dense v2 (#2098)
      
      * dense
      
      * minor fix
      
      * alexnet
      
      * fix conf
      
      * quick fix
      
      * transpose
      
      * fix layers
      
      * add transpose
      
      * fix fc
      
      * fix
      
      * fix
      
      * fix data laod
      
      * params check and format
      
      * rm activation in op conf
      
      * save workaround
      
      * fix avg pool 2d
      
      * fix max pool 2d
      
      * remove fc3 relu
      
      * alexnet eval
      
      * minor
      
      * replace has_batch_dim with batch_axis (#2104)
      
      * replace has_batch_dim with batch_axis
      
      * refactor OrderValue4HasBatchAxis
      
      * fix batch_axis bugs in ConvFilterGradOp and DecodeOFRecordOp
      
      * no CHECK in MatmulOp::InferBatchAxis
      
      * infer op by op_conf and  parallel_conf
      
      * wrapper Error for ErrorProto
      
      * replace ErrorUtil with Error
      
      * add OF_CHECK (#2110)
      
      * optional split_axis (#2113)
      
      * Fix HasAttr bug for optional field
      
      * undefined (#2116)
      
      * merge reduce xxx (#2119)
      
      * Update GetSbpSig() with Maybe (#2118)
      
      * fix sveral ops
      
      * modify all ops
      
      * format
      
      * update complete
      
      * Refine AdamOptimizer
      
      * fix (#2120)
      
      * Fix xla AdamOptimizer bugs
      
      * support scalar for reduce_xxx axis args (#2122)
      
      * Dev opt split axis (#2121)
      
      * optional split_axis
      
      * backup
      
      * VariableConf::(OptInt64 split_axis)
      
      * backup
      
      * fix autovar split_axis (#2125)
      
      * Dev model init op (#2117)
      
      * assign op
      
      
      AddGlobalStepOpConf
      
      
      fix
      
      
      ARITHMETIC_DATA_TYPE_SEQ
      
      
      identity_op_conf
      
      
      add ops
      
      
      GenNewSnapshotName
      
      
      SnapshotOp
      
      
      cleanup
      
      
      blob name
      
      
      LearningRateScheduleOp
      
      
      LearningRateScheduleKernel
      
      
      LearningRateScheduleKernel
      
      
      AddLearningRateScheduleOpConf
      
      
      learning rate
      
      
      cleanup
      
      
      fix
      
      
      fix
      
      * remove total_mbn_num
      
      * date time format
      
      * save
      
      * refine
      
      * refine
      
      * revert
      
      * refine snapshot
      
      * fix
      
      * refine
      
      * AutoGlobalStep
      
      * refine
      
      * GenLogicalBlobName
      
      * AutoLearningRate
      
      * remove JobDesc lr
      
      * fix snapshot path
      
      * Maybe<void>
      
      * learning_rate blob
      
      * remove next_model_vid
      
      
      fix
      
      
      fix 
      
      
      fix
      
      
      learning_rate
      
      * train_conf
      
      * fix for global step on multi nodes
      
      * SnapshotReader
      
      
      snapshot writer
      
      
      model init op
      
      
      fix
      
      
      refine
      
      
      init
      
      
      InitializeFromSnapshotConf
      
      
      model io job
      
      
      ModelLoadOp
      
      
      ModelLoadKernel
      
      
      MakeModelLoadJob
      
      
      ModelSaveOp
      
      
      fix
      
      
      InterUserJobInfo
      
      
      _MakeModelLoadJobFunc
      
      
      MutModelLoadOpConTickInputHelper
      
      
      fix
      
      
      refine
      
      
      init/load/save
      
      
      set_default_variable
      
      * remove SnapshotMgr
      
      * snapshot.h
      
      * delete model_init_job.cpp
      
      
      foreign_input_op_conf
      
      
      fix
      
      
      snapshot path
      
      
      set path
      
      
      op_conf
      
      
      fix
      
      
      fix CopyFromNdarray
      
      
      to bytes c
      
      
      use uint8
      
      
      char2uint8
      
      * model init
      
      * model io
      
      * fix
      
      * ModelSaveKernel
      
      * mutable_batch_axis()->Clear()
      
      * InferBatchAxis
      
      * fix
      
      * refine
      
      * job set
      
      * MakeModelIoJobs
      
      * fix
      
      * jobs
      
      * fix
      
      * model io job
      
      * GenOutputOpConf
      
      * refine snapshot
      
      * refine
      
      * fix
      
      * refine CheckPoint
      
      * remove session
      
      * refine
      
      * refine
      
      * refine
      
      * remove keyword.h/cpp
      
      * refine
      
      * global_step=>train_step
      
      * GetSbpSignatures
      
      * ModelInitOp
      
      * fix (#2127)
      
      * rm stale alextnet script (#2129)
      
      * Dev plain maybe (#2126)
      
      * optional split_axis
      
      * backup
      
      * VariableConf::(OptInt64 split_axis)
      
      * backup
      
      * 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp
      
      * SharedOrPlain
      
      * const std::shared_ptr<T>& => std::shared_ptr<T>
      
      * Dev simple checkpoint manager (#2128)
      
      * SimpleCheckPointManager
      
      * makedirs
      
      * fix path
      
      * save
      
      * refine
      
      * refine
      
      * fix path to numpy (#2130)
      
      * Dev plain maybe (#2132)
      
      * optional split_axis
      
      * backup
      
      * VariableConf::(OptInt64 split_axis)
      
      * backup
      
      * 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp
      
      * SharedOrPlain
      
      * const std::shared_ptr<T>& => std::shared_ptr<T>
      
      * rename: Maybe.data() => Maybe.Data_YouAreNotAllowedToCallThisOutsideJustOrCheckJust()
      
      * refactor Maybe<JobBuildAndInferCtx> => Maybe<JobBuildAndInferCtx*>
      
      * Dev jxf merge general ops (#2131)
      
      * merge some general ops to dev_python
      
      * dense demo
      
      * rm print in test
      
      * new line at the end of file
      
      * format
      
      * fix check point
      
      * update alexnet
      
      * broadcast_xxx (#2134)
      
      * broadcast_xxx
      
      * typo
      
      * typo
      
      * rm job_conf.num_of_batches_in_snapshot
      
      * fix args (#2136)
      
      * fix proto if (#2138)
      
      * pass name to inner function (#2139)
      
      * check dropout if (#2140)
      
      * check dropout if
      
      * fix typo
      
      * Dev merge math ops (#2143)
      
      * merge math ops
      
      * new line at the end of file
      
      * merge layer norm (#2144)
      
      * variable_scope (#2141)
      
      * variable_scope
      
      * revert format
      
      * add check
      
      * Merge dropout if (#2145)
      
      * check dropout if
      
      * fix typo
      
      * fix typo
      
      * slice (#2142)
      
      * slice
      
      * add check and docstring
      
      * minor
      
      * minor
      
      * add const (#2146)
      
      * add const
      
      * fix indentation
      
      * address review
      
      * fmt
      
      * rm redundant
      
      * Update array_ops.py
      
      * Update array_ops.py
      
      * Update array_ops.py
      
      * add more activations to math_ops (#2147)
      
      * fix bug (#2149)
      
      * trancated normal for bert (#2150)
      
      * Update bert for dev python (#2151)
      
      * trancated normal for bert
      
      * bert support
      
      * math.dropout to nn.dropout (#2153)
      
      * refactor oneflow_internal with Maybe::GetDataAndSerializedErrorProto
      
      * allow export multiple interfaces in oneflow_export decorator (#2154)
      
      * refactor job_build_and_infer_if.h
      
      * update oneflow_internal.h to use Maybe (#2135)
      
      * Fix python internal (#2133)
      
      * Return error meassage in oneflow_internal
      
      * Refine environment_objects_scope
      
      * add OF_ERROR_STR_CHECK and OFStrCat()
      
      * format
      
      * fix based on review
      
      * fix(oneflow_internal.h): add undef
      
      * fix: expr -> (expr)
      
      * feat: update oneflow_internal_helper to use func
      
      *  Transfer data_part_num to DecodeOp and RecordLoadOp (#2148)
      
      *  Transfer data_part_num to DecodeOp and RecordLoadOp
      
      * Fix python scripts
      
      * Dev nc of internal (#2155)
      
      * Fix python internal (#2133)
      
      * Return error meassage in oneflow_internal
      
      * Refine environment_objects_scope
      
      * add OF_ERROR_STR_CHECK and OFStrCat()
      
      * format
      
      * fix based on review
      
      * fix(oneflow_internal.h): add undef
      
      * fix: expr -> (expr)
      
      * feat: update oneflow_internal_helper to use func
      
      * fix: fix ctor bug
      
      * fix config_proto
      
      * rename c_api_util.Init => c_api_util.InitEnvironment
      
      * refactor compile_context.cur_job => compile_context.cur_job_conf
      
      * remove FixPackedBlobDescOfProducedRegst (#2156)
      
      * Fix snapshot root path empty log (#2158)
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * fix 121 for tick (#2069)
      
      * Fix snapshot root path empty log
      
      * fix channel last (#2157)
      
      * fix channel last
      
      * minor
      
      * merge pb_message
      
      * add cudnn conv force algo (#2159)
      
      * Update bert for dev python (#2160)
      
      * remove old bert
      
      * set data_part_num in decoder
      
      * support model load/saveargs
      
      * Dev flow function (#2152)
      
      * add of.function, refactor init, refine session, and refine runtime
      
      * rm useless code
      
      * rename
      
      * update
      
      * add test
      
      * @oneflow_export JobConfigProto and Trainconf (#2162)
      
      * @oneflow_export JobConfigProto and Trainconf
      
      * remove unused config in config_util.py
      
      * remove oneflow.get_cur_job_conf_builder
      
      * bugfix: bias_add op and reduce_sum op infer sbp and implement of bias_add kernel (#2161)
      
      * 1) refactor compiler._CompileJob; 2) fix test scripts; 3) fix oneflow.config.model_update_conf
      
      * fix config.train.model_update_conf
      
      * _GetJobConfAttr
      
      * update alexnet (#2166)
      
      * Update alexnet (#2167)
      
      * update alexnet
      
      * update for bert
      
      * 15->16
      
      * more reasonable conf
      
      * get variable in py layer norm
      
      * replace val in pb msg;  decode lbn string with split hint (#2165)
      
      * bugfix: boxing task node generate boxing conf; remove wrong check in op_graph (#2163)
      
      * Add meta data in HLO instruction, and refine
      
      * python model parallel (#2103)
      
      * decode split hint in lbn; refine interface of get/set_lbn_in_op_conf; fill sbp_conf in job when add op
      
      * merge placement group
      
      * refine code in AddAndInferOp
      
      * auto merge placement group when add op; remove mergeplacementgroup interface
      
      * infer sbp parallel when add op; impl Get/Has split axis in infer_ctx
      
      * python blob add interface for model parallel
      
      * refine code of python blob split
      
      * remove interface of has/get_split_axis in python blob
      
      * remove interface of has_batch_dim in python blob
      
      * add check blob split_axis can be divide by parallel num
      
      * refine code for maybe get/infer sbp
      
      * fix bugs: 1. python generate parallel conf; 2. gen logical blob id from lbn; 3.python blob desc .etc
      
      * fix for plain point maybe
      
      * fix bug: add repeated placement group, remove add placement interface in hand
      
      * fixbug: python/blob_desc, temp impl of not deepcopy;  feat: dense layer support model parallel
      
      * dev_python model parallel runnable and check correct
      
      * remove add placement group when placment scope exit
      
      * 1. fixbug of bias_add op infer blob_desc/sbp; bias_add kernel impl; 2. dense layer set model_split_axis=0 for model parallel
      
      * bugfix: bias_add backward infer sbp wrong;  model parallel bias add debug done
      
      * refine python blob_desc.split implement
      
      * refine interface decode lbn to split hint
      
      * refine auto add placment group
      
      * refine lbn with split hint decode
      
      * refine code for review
      
      * remove AutoVar related code (#2168)
      
      * feat: remove all autovar
      
      * fix and format
      
      * fix: fix op::InferBlobDesc
      
      * add prototype (#2172)
      
      * add prototype
      
      * infer blob desc with sbp_signature
      
      * `str_a is not str_b' is buggy, use `str_a != str_b' instead
      
      * Update snapshot.cpp (#2174)
      
      * remove useless lines (#2176)
      
      * Fix bert multi nodes (#2177)
      
      * remove useless lines
      
      * fix bert and init_cluster_env for multi nodes
      
      * CHECK_JUST for InferBlobDescsIf (#2178)
      
      * Fix bert multi nodes (#2180)
      
      * remove useless lines
      
      * fix bert and init_cluster_env for multi nodes
      
      * config_proto -> default_config_proto
      
      * delete worker
      
      * update alexnet
      
      * remove unused op (#2182)
      
      * remove parallel_ctx when kernel init (#2185)
      
      * InferOpSbpSignature in op_graph and infer_ctx (#2175)
      
      * InferOpSbpSignature in op_graph and infer_ctx
      
      * bugfix: lambda life time;  gen job build error add location info
      
      * refine error generation and return
      
      * refine check lbi vaild and exists
      
      * remove parallel num in decode_of_record op/kernel (#2186)
      
      * Fix bugs
      
      * delete GlobalJobDesc() in operator/ (#2188)
      
      * rm unused test file
      
      * Refine
      
      * Add assign ops behind adam optimizer to update model and momentum etc.
      
      * Add assign ops behind adam optimizer to update model and momentum etc.
      
      * Remove fake consume op
      
      * Support enable/disable XLA by set env
      
      * Merge callback, limit max operator count for each XLA subgraph
      
      * CudaEventPool
      
      * fix vector
      
      * refine
      
      * Support in-place update for optimizer
      
      * Add alias input and output to prevent reusing input with other temp buffers
      
      * Refine code style
      
      * Remove unused code
      
      * Fix static cublas library and xla link conflict
      
      * Fix cublas link conflict with tensorflow
      
      * Fix different connection kinds for multiple gpu cards (#2282)
      
      * Refine xla cluster algo (#2289)
      
      * Fix different connection kinds for multiple gpu cards
      
      * Fix bug for mutiple outputs consumed by one node
      
      * Refine cluster algo
      
      * Refine MarkClusterId pass and ReduceSplit task node (#2314)
      
      * Fix different connection kinds for multiple gpu cards
      
      * Fix bug for mutiple outputs consumed by one node
      
      * Refine cluster algo
      
      * Determine fusion disabled edges
      
      * update
      
      * Produce multiple registers on edges for ReduceSplit task node.
      Fix new allocator by stream id.
      
      * Refine MarkClusterId pass
      
      * Clustering subgraph with reverse ordering is better
      
      * Support strict clustering by taking dependencies into consideration
      
      * Translate rebuild job and rewrite optimizer into passes, and refine code style
      
      * Fix spell error
      
      * Update cmake
      
      * Merge branch dev_python (#2321)
      
      * Dev res50 new api (#2173)
      
      * check in script
      
      * runable
      
      * fix multinode
      
      * fix and real train
      
      * fix param data_format
      
      * fix truncated normal
      
      * quick fix multi node launch (#2193)
      
      * Dev reshape sbp (#2192)
      
      * reshape sbp
      
      * more check for reshape conf
      
      * fix error CHECK
      
      * refactor reshape
      
      * fix reshape like op
      
      * support naive case of s0
      
      * refine
      
      * rm redundant code
      
      * more generous check for equal element cnt
      
      * restore empty line
      
      * add GatherMs0Grad op (#2191)
      
      * support for gather with s(0) `in'
      
      * add gather_ms0_op
      
      * fix bugs in message GatherMs0OpConf and GatherMs0Kernel
      
      * only (B, S(0)) -> P supported for gather_ms0 op
      
      * add GatherMs0Grad op
      
      * minor fix
      
      * refine code
      
      * bugfix and update gather test case
      
      * add concat op and pass the test (#2067)
      
      * add concat op and pass the test
      
      * add vgg job_conf
      
      * model compared to be same as the old one
      
      * rm unnecessary file
      
      * Update array_ops.py
      
      * mv file
      
      * get rid of ternary operator (#2195)
      
      * Dev reshape util struct (#2194)
      
      * check in changes
      
      * rm file
      
      * minor fix
      
      * Merge network files of 2 cnns (#2196)
      
      * add inceptionV3
      
      * check in vgg16
      
      * add cnns test scripts for dev_python (#2170)
      
      * add cnns test scripts for dev_python
      
      * add alexnet test scripts
      
      * add resnet50
      
      * add inceptionv3
      
      * add resnet50
      
      * add vgg16
      
      * first version of run_cnns_test.py
      
      * remove old files
      
      * unsorted_segment_sum (#2198)
      
      * oneflow.unsorted_segment_sum (#2199)
      
      * oneflow.unsorted_segment_sum
      
      * remote unused import
      
      * remove unused import
      
      * Dev batch unsorted segment sum (#2200)
      
      * oneflow.unsorted_segment_sum
      
      * remote unused import
      
      * remove unused import
      
      * rename UnsortedSegmentSum to BatchUnsortedSegmentSum
      
      * rename: batch_unsorted_* => unsorted_batch_*
      
      * unsorted_segment_sum (#2201)
      
      * unsorted_segment_sum
      
      * fix job_completer/unsorted_segment_sum_grad.cpp
      
      * more check for unsorted_segment_sum batch_axis
      
      * remove FixParallelDesc (#2202)
      
      * rm KernelIfWithModel KernelIfWithActivation (#2203)
      
      * remove KernelIfWithActivation
      
      * remove KernelIfWithModel
      
      * rm blob header kLossInstanceNum (#2204)
      
      * rm ActivationType from op/kernel (#2205)
      
      * refactor sigmoid_cross_entropy_loss
      
      * fix SigmoidGrad::InferBatchAxis
      
      * support part_name_prefix and part_name_suffix_length (#2208)
      
      * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus
      
      * oneflow.watch for debug
      
      * Dev decode batch size (#2206)
      
      * rm batch_size and piece_size
      
      * merge dev_python
      
      * Update reshape_like_op.cpp (#2213)
      
      * oneflow.parallel (#2211)
      
      * oneflow.parallel
      
      * refactor split_axis => parallel
      
      * rename parallel => distribute
      
      * fix typo: *Parallel => *Distribute
      
      * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()
      
      * fix warning: return string reference to temporary (#2212)
      
      * docker build support (#2002)
      
      * update cmake files
      
      * check in files
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * shrink ctx size
      
      * fix script
      
      * fix wheel build
      
      * fix wheel build not adding .so (#2052)
      
      * lower cmake version bar
      
      * rm more files
      
      * keep build dir
      
      * check in test bash script
      
      * fix
      
      * Dev docker sx (#2124)
      
      * add python2 docker env
      
      * rm old docker files
      
      * update repository
      
      * add ARG CUDA and USE_PYTHON_3_OR_2
      
      * reform files
      
      * update
      
      * rm log doesn't print when there is cache
      
      * use default arg in dockerfile
      
      * better py 2 or 3 condition
      
      * add default
      
      * use if
      
      * update alexnet
      
      * update for bert
      
      * 15->16
      
      * add resnet50 in model (#2217)
      
      * remove parallel policy; rm FC/rnn/embedding look up op/kernel (#2215)
      
      * remove parallel policy
      
      * rm FC/rnn/embedding_look_up op/kernel
      
      * add check data parallel for conv/layer_norm op
      
      * bugfix: bias add + use math_add when batch size = 1
      
      * fix InferBatchAxis (#2220)
      
      * sync with bert_benchamrk (#2221)
      
      * sync with bert_benchamrk
      
      * rename run.sh
      
      * Dev actor msg queue (#2225)
      
      * async msg queue
      
      * EnqueueAsyncMsg
      
      * Merge wnd python (#2226)
      
      * not ready yet
      
      * segment fix
      
      * fix segment_sum bugs
      
      * 1st wide_n_deep push
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * run sucessfully on single GPU
      
      * fix 121 for tick (#2069)
      
      * delete unncessary multiply_grad class
      
      * speed up generate time for dot2svg (#2083)
      
      * Add axis conf to bias_add for any axis channel (#2087)
      
      * bias_add completion
      
      * follow comment
      
      * make conf axis required
      
      * Revert "Add axis conf to bias_add for any axis channel (#2087)" (#2091)
      
      This reverts commit 8679ce980ce8570bf927baeab8616ee7b93fac47.
      
      * updated
      
      * fix segment_sum_grad
      
      * fix sbp
      
      * fix segment_sum impl for data parallel
      
      * fix
      
      * remove useless code in segment_kernel_util.h
      
      * add python interface
      
      * fix sigmoid conf
      
      * fix naming error
      
      * fix typo
      
      * temp mod loss sbp
      
      * add LazyAdam
      
      * Merge branch 'dev_python' of https://github.com/Oneflow-Inc/oneflow into dev_python_widedeep
      
      * rm useless code
      
      * unsorted_segment_sum
      
      * refactor sigmoid_cross_entropy_loss_kernel to high performance
      
      * Improve sigmoid cross entropy loss grad (#2207)
      
      * remove for loop called cuda kernel
      
      * minor fix
      
      * ../oneflow/python/ops/data_ops.py (#2209)
      
      * fix lazy_adam
      
      * Merge wnd and python (#2214)
      
      * rm ActivationType from op/kernel (#2205)
      
      * refactor sigmoid_cross_entropy_loss
      
      * fix SigmoidGrad::InferBatchAxis
      
      * support part_name_prefix and part_name_suffix_length (#2208)
      
      * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus
      
      * oneflow.watch for debug
      
      * Dev decode batch size (#2206)
      
      * rm batch_size and piece_size
      
      * merge dev_python
      
      * Update reshape_like_op.cpp (#2213)
      
      * oneflow.parallel (#2211)
      
      * oneflow.parallel
      
      * refactor split_axis => parallel
      
      * rename parallel => distribute
      
      * fix typo: *Parallel => *Distribute
      
      * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()
      
      * merge dev_python
      
      * fix boxing: P->S(0)
      
      * check in docker build scripts (#2216)
      
      * Dev python widedeep docker (#2218)
      
      * check in docker build scripts
      
      * check in .dockerignore
      
      * rm oneflow.segment_sum
      
      * remove segment_sum
      
      * rm unused file
      
      * rm debug code
      
      * rm debug code
      
      * rm double empty lines
      
      * remove useless comments
      
      * fix send msg (#2227)
      
      * fix reduction_coefficient (#2228)
      
      * refactor ndarray for eq/ne/...
      
      * Dev kernel launch synchronized (#2230)
      
      * IsKernelLaunchSynchronized
      
      * virtual
      
      * refine
      
      * refine
      
      * seperate LOGICAL_BINARY_FUNC from ARITHMETIC_BINARY_FUNC
      
      * more static_assert
      
      * remove unused task related dot function (#2236)
      
      * remove unused task related dot function
      
      * do not output dot rank info
      
      * Dev non distributed optimizer js (#2234)
      
      * op&kernel&actor
      
      * job
      
      * job_completer
      
      * graph
      
      * format
      
      * fix pd
      
      * fix
      
      * ignore DelPlacementByOpName
      
      * fix auto tick
      
      * JobBuilder
      
      * fix
      
      * config util
      
      * fix
      
      * fix opgrade
      
      * broadcast tick
      
      * fix allreduce
      
      * balance by model size
      
      * GetSoleOutBlobSize
      
      * async_actor_msg_deque
      
      * group
      
      * AddOrMutOpsOnlyOnce
      
      * fix NcclTupleBroadcastGrad
      
      * order
      
      * set nccl order hint
      
      * op_conf
      
      * grad hint
      
      * NcclTupleBroadcastReduceSequencePass
      
      * add missed mutops
      
      * order fix
      
      * try kMdUpdtArea
      
      * fix nccl_order_hint
      
      * fix
      
      * add ti
      
      * tuple_identity_op
      
      * remove useless
      
      * group
      
      * fix dead lock
      
      * force ctrl in
      
      * sc broadcast
      
      * sort obn
      
      * group nccl
      
      * config group_size_mbyte
      
      * non_distributed_optimizer_group_size_mbyte
      
      * format
      
      * stop check
      
      * rm message sending optimization
      
      * refine lazy adam (#2244)
      
      * refine lazy adam
      
      * update
      
      * memory version 2 step 1: replace original concept about mem sharing (#2242)
      
      * mem_shared_id -> mem_block_id;  mem_shared_off_set -> mem_block_offset; enable_mem_sharing->enable_reuse_mem
      
      * memory version 2 step 1: replace original concept about mem sharing
      
      * record reader multi thread (#2246)
      
      * multi thread
      
      * ComputeThreadPoolSize
      
      * python api
      
      * Fix random decode (#2252)
      
      * add decode random
      
      * fix decode random actor
      
      * Dev pr boxing v2 (#2248)
      
      * NcclDeviceCtx
      
      * include naive_actor
      
      * refine
      
      * use_boxing_v2
      
      * config.use_boxing_v2
      
      * SubTskGphBuilder
      
      * fix
      
      * hash<oneflow::MemoryCase>
      
      * Maybe<void>
      
      * ChainSubTskGphBuilder
      
      * SliceBoxingOp
      
      * return ok
      
      * SliceBoxingKernel
      
      * SliceBoxingActor
      
      * kSliceBoxing
      
      * nccl boxing op
      
      * nccl actor
      
      * REGISTER_OP
      
      * GetMsgFromCustomizedConf
      
      * NcclBoxingTaskNode
      
      * BldSubTskGphByBoxingV2
      
      * NcclBoxingSubTskGphBuilder
      
      * fix
      
      * fix
      
      * NcclKernel
      
      * ParallelContext
      
      * REGISTER_ACTOR
      
      * fix rank set
      
      * IsNcclTaskType
      
      * limit
      
      * 1024
      
      * multi thread reader
      
      * thread_num
      
      * IsKernelLaunchSynchronized
      
      * refine
      
      * NcclTupleReduce/BroadcastKernel use NcclDeviceCtx
      
      * MakeHostMemCase
      
      * NcclBldSubTskGph
      
      * remove use less code
      
      * use_boxing_v2
      
      * refine
      
      * refine
      
      * refine
      
      * refine
      
      * refine
      
      * cmake find python note when version less 3.14 (#2286)
      
      * fix bug: reduce split kernel inplace (#2297)
      
      * Dev bias add (#2299)
      
      * use bias add
      
      * fix
      
      * bias_add
      
      * bias add half
      
      * fix
      
      * reinterpret_cast
      
      * fix half
      
      * HALF
      
      * fix
      
      * ADD_DEFAULT_KERNEL_CREATOR
      
      * fix
      
      * format
      
      * Fix dev python test (#2294)
      
      * add decode random
      
      * fix decode random actor
      
      * fix dev_python test scripts
      
      * fix batch_size test scripts
      
      * fix
      
      * Memory Version 2.0 Step 2:  MemSharedAndReused between jobs (#2267)
      
      * MemBlockProto and ChunkProto
      
      * create mem block and chunk after improver
      
      * interface merge mem block and chunk between sub plans
      
      * merge chunk between jobs for memory reuse
      
      * using memory zone unique id replace memory case hash
      
      * merge interface op mem block between jobs for mem shared
      
      * gen GlobalCriticalSection by mem block id and chunk id
      
      * check mem block and chunk valid before runtime
      
      * Refactor: RegstMgr ;  allocate memory by mem block and chunk instead of regst
      
      * fix bug; and pass test
      
      * fig bug: init chunk_id_count in id_manager
      
      * reuse copyHd out mem between jobs
      
      * PushPlan and PullPlan for memblock and chunk
      
      * refine merge mem block / chunk in oneflow.cpp
      
      * at(i);
      
      * GetOpName2JobId2TaskProtos functional
      
      * using output ptr; pass test AlexNet and Resnet
      
      * Fix xla reshape op
      
      * Merge upstream of_xla (#2322)
      
      * Dev res50 new api (#2173)
      
      * check in script
      
      * runable
      
      * fix multinode
      
      * fix and real train
      
      * fix param data_format
      
      * fix truncated normal
      
      * quick fix multi node launch (#2193)
      
      * Dev reshape sbp (#2192)
      
      * reshape sbp
      
      * more check for reshape conf
      
      * fix error CHECK
      
      * refactor reshape
      
      * fix reshape like op
      
      * support naive case of s0
      
      * refine
      
      * rm redundant code
      
      * more generous check for equal element cnt
      
      * restore empty line
      
      * add GatherMs0Grad op (#2191)
      
      * support for gather with s(0) `in'
      
      * add gather_ms0_op
      
      * fix bugs in message GatherMs0OpConf and GatherMs0Kernel
      
      * only (B, S(0)) -> P supported for gather_ms0 op
      
      * add GatherMs0Grad op
      
      * minor fix
      
      * refine code
      
      * bugfix and update gather test case
      
      * add concat op and pass the test (#2067)
      
      * add concat op and pass the test
      
      * add vgg job_conf
      
      * model compared to be same as the old one
      
      * rm unnecessary file
      
      * Update array_ops.py
      
      * mv file
      
      * get rid of ternary operator (#2195)
      
      * Dev reshape util struct (#2194)
      
      * check in changes
      
      * rm file
      
      * minor fix
      
      * Merge network files of 2 cnns (#2196)
      
      * add inceptionV3
      
      * check in vgg16
      
      * add cnns test scripts for dev_python (#2170)
      
      * add cnns test scripts for dev_python
      
      * add alexnet test scripts
      
      * add resnet50
      
      * add inceptionv3
      
      * add resnet50
      
      * add vgg16
      
      * first version of run_cnns_test.py
      
      * remove old files
      
      * unsorted_segment_sum (#2198)
      
      * oneflow.unsorted_segment_sum (#2199)
      
      * oneflow.unsorted_segment_sum
      
      * remote unused import
      
      * remove unused import
      
      * Dev batch unsorted segment sum (#2200)
      
      * oneflow.unsorted_segment_sum
      
      * remote unused import
      
      * remove unused import
      
      * rename UnsortedSegmentSum to BatchUnsortedSegmentSum
      
      * rename: batch_unsorted_* => unsorted_batch_*
      
      * unsorted_segment_sum (#2201)
      
      * unsorted_segment_sum
      
      * fix job_completer/unsorted_segment_sum_grad.cpp
      
      * more check for unsorted_segment_sum batch_axis
      
      * remove FixParallelDesc (#2202)
      
      * rm KernelIfWithModel KernelIfWithActivation (#2203)
      
      * remove KernelIfWithActivation
      
      * remove KernelIfWithModel
      
      * rm blob header kLossInstanceNum (#2204)
      
      * rm ActivationType from op/kernel (#2205)
      
      * refactor sigmoid_cross_entropy_loss
      
      * fix SigmoidGrad::InferBatchAxis
      
      * support part_name_prefix and part_name_suffix_length (#2208)
      
      * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus
      
      * oneflow.watch for debug
      
      * Dev decode batch size (#2206)
      
      * rm batch_size and piece_size
      
      * merge dev_python
      
      * Update reshape_like_op.cpp (#2213)
      
      * oneflow.parallel (#2211)
      
      * oneflow.parallel
      
      * refactor split_axis => parallel
      
      * rename parallel => distribute
      
      * fix typo: *Parallel => *Distribute
      
      * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()
      
      * fix warning: return string reference to temporary (#2212)
      
      * docker build support (#2002)
      
      * update cmake files
      
      * check in files
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * shrink ctx size
      
      * fix script
      
      * fix wheel build
      
      * fix wheel build not adding .so (#2052)
      
      * lower cmake version bar
      
      * rm more files
      
      * keep build dir
      
      * check in test bash script
      
      * fix
      
      * Dev docker sx (#2124)
      
      * add python2 docker env
      
      * rm old docker files
      
      * update repository
      
      * add ARG CUDA and USE_PYTHON_3_OR_2
      
      * reform files
      
      * update
      
      * rm log doesn't print when there is cache
      
      * use default arg in dockerfile
      
      * better py 2 or 3 condition
      
      * add default
      
      * use if
      
      * update alexnet
      
      * update for bert
      
      * 15->16
      
      * add resnet50 in model (#2217)
      
      * remove parallel policy; rm FC/rnn/embedding look up op/kernel (#2215)
      
      * remove parallel policy
      
      * rm FC/rnn/embedding_look_up op/kernel
      
      * add check data parallel for conv/layer_norm op
      
      * bugfix: bias add + use math_add when batch size = 1
      
      * fix InferBatchAxis (#2220)
      
      * sync with bert_benchamrk (#2221)
      
      * sync with bert_benchamrk
      
      * rename run.sh
      
      * Dev actor msg queue (#2225)
      
      * async msg queue
      
      * EnqueueAsyncMsg
      
      * Merge wnd python (#2226)
      
      * not ready yet
      
      * segment fix
      
      * fix segment_sum bugs
      
      * 1st wide_n_deep push
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * run sucessfully on single GPU
      
      * fix 121 for tick (#2069)
      
      * delete unncessary multiply_grad class
      
      * speed up generate time for dot2svg (#2083)
      
      * Add axis conf to bias_add for any axis channel (#2087)
      
      * bias_add completion
      
      * follow comment
      
      * make conf axis required
      
      * Revert "Add axis conf to bias_add for any axis channel (#2087)" (#2091)
      
      This reverts commit 8679ce980ce8570bf927baeab8616ee7b93fac47.
      
      * updated
      
      * fix segment_sum_grad
      
      * fix sbp
      
      * fix segment_sum impl for data parallel
      
      * fix
      
      * remove useless code in segment_kernel_util.h
      
      * add python interface
      
      * fix sigmoid conf
      
      * fix naming error
      
      * fix typo
      
      * temp mod loss sbp
      
      * add LazyAdam
      
      * Merge branch 'dev_python' of https://github.com/Oneflow-Inc/oneflow
      
       into dev_python_widedeep
      
      * rm useless code
      
      * unsorted_segment_sum
      
      * refactor sigmoid_cross_entropy_loss_kernel to high performance
      
      * Improve sigmoid cross entropy loss grad (#2207)
      
      * remove for loop called cuda kernel
      
      * minor fix
      
      * ../oneflow/python/ops/data_ops.py (#2209)
      
      * fix lazy_adam
      
      * Merge wnd and python (#2214)
      
      * rm ActivationType from op/kernel (#2205)
      
      * refactor sigmoid_cross_entropy_loss
      
      * fix SigmoidGrad::InferBatchAxis
      
      * support part_name_prefix and part_name_suffix_length (#2208)
      
      * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus
      
      * oneflow.watch for debug
      
      * Dev decode batch size (#2206)
      
      * rm batch_size and piece_size
      
      * merge dev_python
      
      * Update reshape_like_op.cpp (#2213)
      
      * oneflow.parallel (#2211)
      
      * oneflow.parallel
      
      * refactor split_axis => parallel
      
      * rename parallel => distribute
      
      * fix typo: *Parallel => *Distribute
      
      * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()
      
      * merge dev_python
      
      * fix boxing: P->S(0)
      
      * check in docker build scripts (#2216)
      
      * Dev python widedeep docker (#2218)
      
      * check in docker build scripts
      
      * check in .dockerignore
      
      * rm oneflow.segment_sum
      
      * remove segment_sum
      
      * rm unused file
      
      * rm debug code
      
      * rm debug code
      
      * rm double empty lines
      
      * remove useless comments
      
      * fix send msg (#2227)
      
      * fix reduction_coefficient (#2228)
      
      * refactor ndarray for eq/ne/...
      
      * Dev kernel launch synchronized (#2230)
      
      * IsKernelLaunchSynchronized
      
      * virtual
      
      * refine
      
      * refine
      
      * seperate LOGICAL_BINARY_FUNC from ARITHMETIC_BINARY_FUNC
      
      * more static_assert
      
      * remove unused task related dot function (#2236)
      
      * remove unused task related dot function
      
      * do not output dot rank info
      
      * Dev non distributed optimizer js (#2234)
      
      * op&kernel&actor
      
      * job
      
      * job_completer
      
      * graph
      
      * format
      
      * fix pd
      
      * fix
      
      * ignore DelPlacementByOpName
      
      * fix auto tick
      
      * JobBuilder
      
      * fix
      
      * config util
      
      * fix
      
      * fix opgrade
      
      * broadcast tick
      
      * fix allreduce
      
      * balance by model size
      
      * GetSoleOutBlobSize
      
      * async_actor_msg_deque
      
      * group
      
      * AddOrMutOpsOnlyOnce
      
      * fix NcclTupleBroadcastGrad
      
      * order
      
      * set nccl order hint
      
      * op_conf
      
      * grad hint
      
      * NcclTupleBroadcastReduceSequencePass
      
      * add missed mutops
      
      * order fix
      
      * try kMdUpdtArea
      
      * fix nccl_order_hint
      
      * fix
      
      * add ti
      
      * tuple_identity_op
      
      * remove useless
      
      * group
      
      * fix dead lock
      
      * force ctrl in
      
      * sc broadcast
      
      * sort obn
      
      * group nccl
      
      * config group_size_mbyte
      
      * non_distributed_optimizer_group_size_mbyte
      
      * format
      
      * stop check
      
      * rm message sending optimization
      
      * refine lazy adam (#2244)
      
      * refine lazy adam
      
      * update
      
      * memory version 2 step 1: replace original concept about mem sharing (#2242)
      
      * mem_shared_id -> mem_block_id;  mem_shared_off_set -> mem_block_offset; enable_mem_sharing->enable_reuse_mem
      
      * memory version 2 step 1: replace original concept about mem sharing
      
      * record reader multi thread (#2246)
      
      * multi thread
      
      * ComputeThreadPoolSize
      
      * python api
      
      * Fix random decode (#2252)
      
      * add decode random
      
      * fix decode random actor
      
      * Dev pr boxing v2 (#2248)
      
      * NcclDeviceCtx
      
      * include naive_actor
      
      * refine
      
      * use_boxing_v2
      
      * config.use_boxing_v2
      
      * SubTskGphBuilder
      
      * fix
      
      * hash<oneflow::MemoryCase>
      
      * Maybe<void>
      
      * ChainSubTskGphBuilder
      
      * SliceBoxingOp
      
      * return ok
      
      * SliceBoxingKernel
      
      * SliceBoxingActor
      
      * kSliceBoxing
      
      * nccl boxing op
      
      * nccl actor
      
      * REGISTER_OP
      
      * GetMsgFromCustomizedConf
      
      * NcclBoxingTaskNode
      
      * BldSubTskGphByBoxingV2
      
      * NcclBoxingSubTskGphBuilder
      
      * fix
      
      * fix
      
      * NcclKernel
      
      * ParallelContext
      
      * REGISTER_ACTOR
      
      * fix rank set
      
      * IsNcclTaskType
      
      * limit
      
      * 1024
      
      * multi thread reader
      
      * thread_num
      
      * IsKernelLaunchSynchronized
      
      * refine
      
      * NcclTupleReduce/BroadcastKernel use NcclDeviceCtx
      
      * MakeHostMemCase
      
      * NcclBldSubTskGph
      
      * remove use less code
      
      * use_boxing_v2
      
      * refine
      
      * refine
      
      * refine
      
      * refine
      
      * refine
      
      * cmake find python note when version less 3.14 (#2286)
      
      * fix bug: reduce split kernel inplace (#2297)
      
      * Dev bias add (#2299)
      
      * use bias add
      
      * fix
      
      * bias_add
      
      * bias add half
      
      * fix
      
      * reinterpret_cast
      
      * fix half
      
      * HALF
      
      * fix
      
      * ADD_DEFAULT_KERNEL_CREATOR
      
      * fix
      
      * format
      
      * Fix dev python test (#2294)
      
      * add decode random
      
      * fix decode random actor
      
      * fix dev_python test scripts
      
      * fix batch_size test scripts
      
      * fix
      
      * Memory Version 2.0 Step 2:  MemSharedAndReused between jobs (#2267)
      
      * MemBlockProto and ChunkProto
      
      * create mem block and chunk after improver
      
      * interface merge mem block and chunk between sub plans
      
      * merge chunk between jobs for memory reuse
      
      * using memory zone unique id replace memory case hash
      
      * merge interface op mem block between jobs for mem shared
      
      * gen GlobalCriticalSection by mem block id and chunk id
      
      * check mem block and chunk valid before runtime
      
      * Refactor: RegstMgr ;  allocate memory by mem block and chunk instead of regst
      
      * fix bug; and pass test
      
      * fig bug: init chunk_id_count in id_manager
      
      * reuse copyHd out mem between jobs
      
      * PushPlan and PullPlan for memblock and chunk
      
      * refine merge mem block / chunk in oneflow.cpp
      
      * at(i);
      
      * GetOpName2JobId2TaskProtos functional
      
      * using output ptr; pass test AlexNet and Resnet
      
      * Dev cuda 9 arch 70 (#2318)
      
      * kCudaAlignSize = 256
      
      * always compute_70
      
      * __CUDA_API_VERSION >= 10000
      
      * __CUDA_API_VERSION >= 10000
      
      * disable_all_reduce_sequence
      
      * Fix xla reshape op
      
      * Fix compilation without xla
      
      * Remove useless code and fix data type mismatch in field desc (#2326)
      
      * Remove useless code
      
      * Refine code style
      
      * Fix data type mismatch in field desc
      
      * Update README.md (#2335)
      
      * Refine code style (#2336)
      
      * Update XLA usage document (#2337)
      
      * Update XLA usage document
      
      * Fix mistakes
      
      * Add xla clang-format and format codestyle (#2340)
      
      * Revert "Add xla clang-format and format codestyle (#2340)" (#2341)
      
      This reverts commit e3cd432be2880ca55d3fc305dbb87e5416d5e724.
      
      * Add xla clang-format and format codestyle (#2342)
      
      * Add xla clang-format and format codestyle
      
      * Fix header file missing
      
      * Of xla sx (#2334)
      
      * add gather grad op and pass testing
      
      * rm check
      
      * done batch gather grad
      
      * pass test
      
      * modify according to the review
      
      * add unsorted_segment_sum and refine unsorted_batch_segment_sum
      
      * reform according to review
      
      * refromate according to the clang-format and rm reference to the temp object
      
      * Pick step0 and step1 new commits (#2346)
      
      * Add xla clang-format and format codestyle
      
      * Fix header file missing
      
      * Modify codes to support XLA
      
      Conflicts:
      	oneflow/core/job/job_builder.cpp
      	oneflow/core/job/job_builder.h
      	oneflow/core/operator/op_conf.proto
      
      * Fix a bug for building subgraph although it won't lead to wrong results (#2347)
      
      * Fix setting is_mutable in xla launch op (#2349)
      
      * Change directory xla to xrt, apply patch if building with xla
      
      * Refactor
      
      * Add infer shape pass, and Refactor launch kernel, graph compiler
      
      * Refine code style, add xla executable and graph compiler
      
      * Rename platform.proto as types.proto
      
      * change OpCompiler to OpKernel, complete xla graph compiler
      
      * Fix compilation bugs and add allocator, now xla compilation is ok
      
      * Add xla executable runtime
      
      * Add executable run scope to support launch kernel on specific stream.
      
      * Fix infer shape pass, and revert cuda event pool
      
      * Refactor graph building with attaching argument metadata.
      
      * Set mutability if rebuilding job
      
      * Set device ordinal correctly
      
      * Refine DelOps
      
      * Refine Argument definition and abstract function as subgraph
      
      * Fix infer shape in xrt launch op and launch kernel.
      
      * Add builder, executable, graph compiler, logger, value and a fc demo converter for tensorrt.
      
      * Refine code style
      
      * Rename xla Operand as XlaValue.
      
      * Complete TensorRT compiler and builder, Refine OpKernel
      
      * Pick public code changes from the new tensorrt branch.
      
      * Fix tensorrt compilation
      
      * Fake implementation of trt executable
      
      * Support selecting engine in launch kernel, refine trt executable
      
      * Use global logger required by tensorrt, rebuild engine if batch size is larger than default max batch size, and other bugfix.
      
      * Support train phase setting for registered op kernel
      
      * Remove RewriteOptimizer pass, update xla optimizer op.
      
      * Format job builder .h and .cpp files.
      
      * Remove RewriteOptimizer pass, update xla optimizer op.
      
      * Fix project compilation and remove pairs from identical_sbp_oba_pairs if the related operators are about to be deleted from the job.
      
      * Fix project compilation and remove pairs from identical_sbp_oba_pairs if the related operators are about to be deleted from the job.
      
      * Refine code style and comment.
      
      * Refine model update inference for launch op.
      
      * Refine
      
      * Refine code style and comment.
      
      * Refine model update inference for launch op.
      
      Conflicts:
      	oneflow/xrt/kernel/op_kernel.h
      	oneflow/xrt/node_util.cpp
      	oneflow/xrt/node_util.h
      	oneflow/xrt/passes/cluster.h
      	oneflow/xrt/passes/mark_cluster_id_pass.cpp
      	oneflow/xrt/passes/rebuild_job_pass.cpp
      	oneflow/xrt/types.h
      
      * Add xrt README.md
      
      * Add use_xla_jit and use_tensorrt options in job proto
      
      * Refine code style
      
      * Fix BlobDesc getter and xla LayerNorm op for FP16
      
      * Make use_xla_jit and use_tensorrt configurable from python config and env variables.
      
      * Update benchmark
      
      * Refine xrt README and rename compile_with_xrt.h file
      
      * Update README
      
      * Revert tensorrt
      
      * Fix absl missing if building with TensorRT but without XLA
      
      * Update xrt benchmark
      
      * Disable WITH_XLA by default
      
      * Update xrt benchmark
      
      * Format xrt as core
      
      * add activation op
      
      * add softmax op
      
      * Refine code style, remove unused code
      
      * Remove duplication of XLA usage
      
      * test pass
      
      * pooling test pass
      
      * add concat op, not tested
      
      * add activation ops, test not psassed
      
      * Add xla gelu unittest
      
      * add  activation op, and test  passed
      
      * add pooling op, and test passed
      
      * Fix int64 env variable
      
      * Export float16 for python
      
      * Add xla relu unittest
      
      * try to solve conv bug
      
      * add elementwise add op, test passed
      
      * add concat op, test passed
      
      * Bugfix: transfer weights from gpu to host since tensorrt requires host weights.
      
      * add op unit tests
      
      * resolve conflicts and fix softmax bug
      
      * add identity op and topk op, to test
      
      * Add xla bias add and reshape unittests
      
      * Add xla identity unittest
      
      * Add xla cast and scalar op unittests
      
      * Add xla broadcast op and transpose unittests
      
      * Add xla add, sigmoid and tanh unittests
      
      * add reduce mean op, test passed
      
      * formate ops, add CHECKs, and optimize function structure
      
      * Add xla gather and batch_gather unittests
      
      * Add xla softmax unittest and fix softmax bug if axis is not the last dim.
      
      * add trt gather op and unit test
      
      * Add xla reduce_sum unittest, and support keep_dims for xla reduce
      
      * Add xla layer_norm unittest, and refine xla layer norm op
      
      * Add reshape_like unittest, and export reshape_like api
      
      * Refine xrt unittest code style
      
      * Export softmax_grad op, add softmax_grad unittest
      
      * Export tanh_grad op and add xla unittest
      
      * Export gelu_grad op, and add xla unittest
      
      * add conv unit test
      
      * reformate
      
      * Export layer_norm_grad and layer_norm_param_grad api, add xla unittests
      
      * Commit to merge upstream of_xrt
      
      * check files
      
      * modify files according to review advice.
      
      * Add xrt unittests (#2483)
      
      * Revert tensorrt
      
      * Fix absl missing if building with TensorRT but without XLA
      
      * Update xrt benchmark
      
      * Add xla gelu unittest
      
      * Fix int64 env variable
      
      * Export float16 for python
      
      * Add xla relu unittest
      
      * Add xla bias add and reshape unittests
      
      * Add xla identity unittest
      
      * Add xla cast and scalar op unittests
      
      * Add xla broadcast op and transpose unittests
      
      * Add xla add, sigmoid and tanh unittests
      
      * Add xla gather and batch_gather unittests
      
      * Add xla softmax unittest and fix softmax bug if axis is not the last dim.
      
      * Add xla reduce_sum unittest, and support keep_dims for xla reduce
      
      * Add xla layer_norm unittest, and refine xla layer norm op
      
      * Add reshape_like unittest, and export reshape_like api
      
      * Refine xrt unittest code style
      
      * Export softmax_grad op, add softmax_grad unittest
      
      * Export tanh_grad op and add xla unittest
      
      * Export gelu_grad op, and add xla unittest
      
      * Export layer_norm_grad and layer_norm_param_grad api, add xla unittests
      
      * Commit to merge upstream of_xrt
      
      * Fix reduce_mean facade bug if keep_dims if true.
      
      * Refine tensorrt unittests
      
      * Check failed if full reduce without keep dimension.
      
      * madd pooling unit test
      
      * Add tensorrt bias_add and reshape op, and their unittests.
      
      * Support fp16 for tensorrt.
      
      * Add tensorrt transpose op and unittest.
      
      * add unit test conv_2d
      
      * add unit test concat
      
      * Fix concat if axis is -1.
      
      * Refine tensorrt conv2d unittest
      
      * Fix padding mode for conv2d and pooling, refine unittests.
      
      * Refine tensorrt concat unittest
      
      * Add convert api from string engine to XrtEngine.
      
      * Revert tensorrt, and merge of_xrt branch
      
      * Remove some comments.
      
      * Refine tensorrt unittests
      
      * Add XrtConfig to deal with xla and tensorrt configurations.
      
      Conflicts:
      	oneflow/xrt/api.cpp
      
      * Update tensorflow.cmake to avoid applying the patch repeatedly.
      
      * Remove XrtConfig Option, and fix xrt unittests
      
      * Add tensorrt batch norm (#2516)
      
      * Refine xrt signatrue hash, and fix python configuration (#2520)
      
      * Fix XrtCompilationEnabled returns (#2524)
      
      * Fix compilation after merge dev_python
      
      * Update xrt unittests
      
      * Revert protobuf version
      
      * Remove comment FOR_RANGE
      
      * Remove unused code
      
      * Reformart
      
      * Refine job builder
      
      * Disable dump job if not debug mode
      
      Co-authored-by: default avatarSnow <snow3s@qq.com>
      Co-authored-by: default avatarJuncheng <liujuncheng1022@gmail.com>
      8f3dcf94
  19. Dec 21, 2019
  20. Dec 20, 2019
  21. Dec 12, 2019
  22. Nov 26, 2019
    • Li Xinqi's avatar
      Merge quick dirty from obj detect (#2444) · f5937569
      Li Xinqi authored
      * cmake find python note when version less 3.14 (#2286)
      
      * fix bug: reduce split kernel inplace (#2297)
      
      * Dev bias add (#2299)
      
      * use bias add
      
      * fix
      
      * bias_add
      
      * bias add half
      
      * fix
      
      * reinterpret_cast
      
      * fix half
      
      * HALF
      
      * fix
      
      * ADD_DEFAULT_KERNEL_CREATOR
      
      * fix
      
      * format
      
      * Fix dev python test (#2294)
      
      * add decode random
      
      * fix decode random actor
      
      * fix dev_python test scripts
      
      * fix batch_size test scripts
      
      * fix
      
      * Memory Version 2.0 Step 2:  MemSharedAndReused between jobs (#2267)
      
      * MemBlockProto and ChunkProto
      
      * create mem block and chunk after improver
      
      * interface merge mem block and chunk between sub plans
      
      * merge chunk between jobs for memory reuse
      
      * using memory zone unique id replace memory case hash
      
      * merge interface op mem block between jobs for mem shared
      
      * gen GlobalCriticalSection by mem block id and chunk id
      
      * check mem block and chunk valid before runtime
      
      * Refactor: RegstMgr ;  allocate memory by mem block and chunk instead of regst
      
      * fix bug; and pass test
      
      * fig bug: init chunk_id_count in id_manager
      
      * reuse copyHd out mem between jobs
      
      * PushPlan and PullPlan for memblock and chunk
      
      * refine merge mem block / chunk in oneflow.cpp
      
      * at(i);
      
      * GetOpName2JobId2TaskProtos functional
      
      * using output ptr; pass test AlexNet and Resnet
      
      * Dev cuda 9 arch 70 (#2318)
      
      * kCudaAlignSize = 256
      
      * always compute_70
      
      * __CUDA_API_VERSION >= 10000
      
      * __CUDA_API_VERSION >= 10000
      
      * disable_all_reduce_sequence
      
      * Fix cuda9 cudnn turing issue (#2329)
      
      * fix cuda 9 issus on turing device
      
      * CUDA_VERSION
      
      * no cuda check
      
      * bias add kernel gpu half (#2330)
      
      * mem_block=>header_mem_block (#2338)
      
      * speedup oneflow compilation
      
      * identity_sbp_conf
      
      * DropOut Version2 (#2355)
      
      * random mask like op conf; refine dropout op in python
      
      * remove useless dropout kernel conf
      
      * implement of random mask like op
      
      * refine dropout op
      
      * refine dropout grad op
      
      * refine generate dropout backward
      
      * random mask like kernel
      
      * refine dropout (grad) kernel
      
      * fix link problem for template separated compile
      
      * fix bug and pass test
      
      * dropout kernel for half
      
      * add check for dropout mask input data type
      
      * bugfixs
      
      * Remove IsOpFloat32() in auto_mixed_precision.cpp (#2358)
      
      * fuse op/kernl to 1 cpp
      
      * refine for review
      
      * fix bug
      
      * Refactor Kernel Registry for more flexible registration (#2363)
      
      * feat: update KernelRegistration and add KernelRegValProto
      
      * Refactor Kernel Registry for more flexible registration
      
      * Remove unused kernel_reg_value.proto
      
      * Memory Version 2.0 Step 3: MemReused in job (#2319)
      
      * use_memory_allocation_algorithm_v2 for switch improver mem block id
      
      * reuse plan task graph and ctrl edge for inferred mem block
      
      * refine interface; InJobMemSharingUtil
      
      * navie merge memory big chain; gen regst apply/release queue; handle for inplace hint regst
      
      * generate regst 2 mutual exclusion regsts
      
      * bugfix: apply should before release
      
      * interface for multi-thread run algorithm get mem block offset result
      
      * selet best algorithm to set mem block id and mem block offset
      
      * set mem block for inplace consumer regst
      
      * 3 algorithm interface
      
      * half implement of algo 1
      
      * implement of algorithm0_OfColorImproved
      
      * runnable in 1 machine 1 device
      
      * Memory Chain
      
      * merge MemoryChain and pass Correctness test of alexnet and resnet50
      
      * bugfixs: continues inplace consume relationship in bert-base fp16
      
      * erase useless info in MemoryChain
      
      * implement of BfcAllocator and Tf_Bfc algorithm
      
      * use bfc algo and fix bug
      
      * only use default algo
      
      * renme in_job_* => intra_job_*
      
      * rename: InJob* => IntraJob*
      
      * rename: 1) apply_regsts_queue => alloc_regsts_queue; 2) release_regsts_queue => free_regsts_queue
      
      * rename function name in job/intra_job_mem_sharing_util.cpp
      
      * rename variable names in job/intra_job_mem_sharing_util.cpp: 1) *apply* => *alloc*; 2) *release* => *free*
      
      * refactor FindFreeOffset => FindFreeOffsetAndNewBufferSize
      
      * rename method: DeallocateRaw => FreeRaw
      
      * rename varable for review
      
      * use enum for mem reused algorithm and add python interface
      
      * fix sbp infer (#2373)
      
      * mv addr calculation out of decoder (#2374)
      
      * use tmp blob for temp storage (#2375)
      
      * INDEX_DATA_TYPE_SEQ (#2381)
      
      * refine include (#2382)
      
      * refine include
      
      * format
      
      
      format
      
      * element_wise_mul (#2383)
      
      * gather refine (#2384)
      
      * Dev fix sbp (#2388)
      
      * fix sbp
      
      * fix sbp
      
      * remove VirtualGenKernelConf
      
      * rename Read to ReadFully (#2389)
      
      * Dev parallel cast (#2391)
      
      * parallel cast
      
      * op_conf
      
      * refine
      
      * Dev auto zero padding (#2393)
      
      * auto_zero_padding
      
      * auto_zero_padding
      
      * fix
      
      * fix input_mask and token_type_id (#2398)
      
      * fix job launch (#2401)
      
      * fix sbp bug (#2402)
      
      * fix sbp
      
      * fix
      
      * add missing header files (#2410)
      
      * refactor cnn model tests (#2411)
      
      * refactor cnn model tests
      
      * reformat README.md
      
      * reformat README.md
      
      * refactor ndarray_reduce (#2412)
      
      * fix inplace reachability bug (#2413)
      
      * refactor gpu relu (#2414)
      
      * refactor gpu relu
      
      * CHECK_KERNEL_SAFE_INT32
      
      * there may be a subtle cuda bug in ((float) x < 0)
      
      * refactor ndarray_reduce (#2405)
      
      * refactor ndarray_reduce
      
      * refactor relu/bias_add
      
      * refactor relu
      
      * refactor relu
      
      * refactor bias_add
      
      * refactor relu/bias_add
      
      * fix inplace_lbi bug
      
      * refactor addition
      
      * IsKernelSafeInt32
      
      * CUDA_1D_KERNEL_LOOP_T
      
      * CUDA_1D_KERNEL_LOOP_T
      
      * If add (#2415)
      
      * refactor ndarray_reduce
      
      * refactor relu/bias_add
      
      * refactor relu
      
      * refactor relu
      
      * refactor bias_add
      
      * refactor relu/bias_add
      
      * fix inplace_lbi bug
      
      * refactor addition
      
      * IsKernelSafeInt32
      
      * CUDA_1D_KERNEL_LOOP_T
      
      * CUDA_1D_KERNEL_LOOP_T
      
      * add unless oprand is nonzero
      
      * Clear session (#2416)
      
      * oneflow.clear_default_session
      
      * fix bugs in oneflow.config.machine
      
      * refactor function return type (#2417)
      
      * fix for py2 (#2418)
      
      * blob parallel conf
      
      * Pr watch scope (#2419)
      
      * pr oneflow.watch*
      
      * merge more code to pass watch_scope.py
      
      * TODO: input_blob_def.parallel_conf
      
      * fix reexport of identity op
      
      * merge dev_quick_dirty_object_detection
      
      * oneflow.cluster (#2423)
      
      * oneflow.cluster
      
      * no alias for oneflow.cluster.*
      
      * mv cpp_logging_conf from config_proto to cluster_proto
      
      * rename: cluster => env
      
      * rename: Environment => Session
      
      * Free port (#2427)
      
      * oneflow.cluster
      
      * no alias for oneflow.cluster.*
      
      * mv cpp_logging_conf from config_proto to cluster_proto
      
      * rename: cluster => env
      
      * rename: Environment => Session
      
      * auto find a free port for single node environment
      
      * localhost only
      
      * Dev single processor test (#2430)
      
      * oneflow.cluster
      
      * no alias for oneflow.cluster.*
      
      * mv cpp_logging_conf from config_proto to cluster_proto
      
      * rename: cluster => env
      
      * rename: Environment => Session
      
      * auto find a free port for single node environment
      
      * localhost only
      
      * single process test
      
      * Cluster::WorkerLoop
      
      * delete unnecessary OF_BARRIER_ALL
      
      * no longer fork children processes to run tests
      
      * format
      
      * fix align byte size bug (#2436)
      
      * fix align bugs (#2440)
      
      * fix: GetNumOfLoDLevels lack return
      
      * minor script fix and update
      
      * update script
      
      * remove redundant function
      f5937569
  23. Nov 18, 2019
  24. Oct 15, 2019
  25. Oct 11, 2019
  26. Oct 09, 2019
  27. Sep 28, 2019
  28. Sep 27, 2019
  29. Sep 24, 2019
    • Niu Chong's avatar
      merge with dev_python (#2249) · 3960d2cb
      Niu Chong authored
      * Dev actor msg queue (#2225)
      
      * async msg queue
      
      * EnqueueAsyncMsg
      
      * Merge wnd python (#2226)
      
      * not ready yet
      
      * segment fix
      
      * fix segment_sum bugs
      
      * 1st wide_n_deep push
      
      * Fix tick in multi node parallel (#2042)
      
      * check in fixes
      
      * fix by adding boxing method
      
      * register tick op
      
      * move code and add more check
      
      * fix typo
      
      * fix bug when filtering op nodes before adding tick
      
      * fix wheel build not adding .so (#2052)
      
      * color plan dot VERSION-2 (#2045)
      
      * run sucessfully on single GPU
      
      * fix 121 for tick (#2069)
      
      * delete unncessary multiply_grad class
      
      * speed up generate time for dot2svg (#2083)
      
      * Add axis conf to bias_add for any axis channel (#2087)
      
      * bias_add completion
      
      * follow comment
      
      * make conf axis required
      
      * Revert "Add axis conf to bias_add for any axis channel (#2087)" (#2091)
      
      This reverts commit 8679ce980ce8570bf927baeab8616ee7b93fac47.
      
      * updated
      
      * fix segment_sum_grad
      
      * fix sbp
      
      * fix segment_sum impl for data parallel
      
      * fix
      
      * remove useless code in segment_kernel_util.h
      
      * add python interface
      
      * fix sigmoid conf
      
      * fix naming error
      
      * fix typo
      
      * temp mod loss sbp
      
      * add LazyAdam
      
      * Merge branch 'dev_python' of https://github.com/Oneflow-Inc/oneflow into dev_python_widedeep
      
      * rm useless code
      
      * unsorted_segment_sum
      
      * refactor sigmoid_cross_entropy_loss_kernel to high performance
      
      * Improve sigmoid cross entropy loss grad (#2207)
      
      * remove for loop called cuda kernel
      
      * minor fix
      
      * ../oneflow/python/ops/data_ops.py (#2209)
      
      * fix lazy_adam
      
      * Merge wnd and python (#2214)
      
      * rm ActivationType from op/kernel (#2205)
      
      * refactor sigmoid_cross_entropy_loss
      
      * fix SigmoidGrad::InferBatchAxis
      
      * support part_name_prefix and part_name_suffix_length (#2208)
      
      * rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus
      
      * oneflow.watch for debug
      
      * Dev decode batch size (#2206)
      
      * rm batch_size and piece_size
      
      * merge dev_python
      
      * Update reshape_like_op.cpp (#2213)
      
      * oneflow.parallel (#2211)
      
      * oneflow.parallel
      
      * refactor split_axis => parallel
      
      * rename parallel => distribute
      
      * fix typo: *Parallel => *Distribute
      
      * add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()
      
      * merge dev_python
      
      * fix boxing: P->S(0)
      
      * check in docker build scripts (#2216)
      
      * Dev python widedeep docker (#2218)
      
      * check in docker build scripts
      
      * check in .dockerignore
      
      * rm oneflow.segment_sum
      
      * remove segment_sum
      
      * rm unused file
      
      * rm debug code
      
      * rm debug code
      
      * rm double empty lines
      
      * remove useless comments
      
      * fix send msg (#2227)
      
      * fix reduction_coefficient (#2228)
      
      * refactor ndarray for eq/ne/...
      
      * Dev kernel launch synchronized (#2230)
      
      * IsKernelLaunchSynchronized
      
      * virtual
      
      * refine
      
      * refine
      
      * seperate LOGICAL_BINARY_FUNC from ARITHMETIC_BINARY_FUNC
      
      * more static_assert
      
      * remove unused task related dot function (#2236)
      
      * remove unused task related dot function
      
      * do not output dot rank info
      
      * Dev non distributed optimizer js (#2234)
      
      * op&kernel&actor
      
      * job
      
      * job_completer
      
      * graph
      
      * format
      
      * fix pd
      
      * fix
      
      * ignore DelPlacementByOpName
      
      * fix auto tick
      
      * JobBuilder
      
      * fix
      
      * config util
      
      * fix
      
      * fix opgrade
      
      * broadcast tick
      
      * fix allreduce
      
      * balance by model size
      
      * GetSoleOutBlobSize
      
      * async_actor_msg_deque
      
      * group
      
      * AddOrMutOpsOnlyOnce
      
      * fix NcclTupleBroadcastGrad
      
      * order
      
      * set nccl order hint
      
      * op_conf
      
      * grad hint
      
      * NcclTupleBroadcastReduceSequencePass
      
      * add missed mutops
      
      * order fix
      
      * try kMdUpdtArea
      
      * fix nccl_order_hint
      
      * fix
      
      * add ti
      
      * tuple_identity_op
      
      * remove useless
      
      * group
      
      * fix dead lock
      
      * force ctrl in
      
      * sc broadcast
      
      * sort obn
      
      * group nccl
      
      * config group_size_mbyte
      
      * non_distributed_optimizer_group_size_mbyte
      
      * format
      
      * stop check
      
      * rm message sending optimization
      
      * refine lazy adam (#2244)
      
      * refine lazy adam
      
      * update
      
      * memory version 2 step 1: replace original concept about mem sharing (#2242)
      
      * mem_shared_id -> mem_block_id;  mem_shared_off_set -> mem_block_offset; enable_mem_sharing->enable_reuse_mem
      
      * memory version 2 step 1: replace original concept about mem sharing
      
      * record reader multi thread (#2246)
      
      * multi thread
      
      * ComputeThreadPoolSize
      
      * python api
  30. Sep 20, 2019
    • Juncheng's avatar
      Dev non distributed optimizer js (#2234) · 2b7c50b0
      Juncheng authored
      * op&kernel&actor
      
      * job
      
      * job_completer
      
      * graph
      
      * format
      
      * fix pd
      
      * fix
      
      * ignore DelPlacementByOpName
      
      * fix auto tick
      
      * JobBuilder
      
      * fix
      
      * config util
      
      * fix
      
      * fix opgrade
      
      * broadcast tick
      
      * fix allreduce
      
      * balance by model size
      
      * GetSoleOutBlobSize
      
      * async_actor_msg_deque
      
      * group
      
      * AddOrMutOpsOnlyOnce
      
      * fix NcclTupleBroadcastGrad
      
      * order
      
      * set nccl order hint
      
      * op_conf
      
      * grad hint
      
      * NcclTupleBroadcastReduceSequencePass
      
      * add missed mutops
      
      * order fix
      
      * try kMdUpdtArea
      
      * fix nccl_order_hint
      
      * fix
      
      * add ti
      
      * tuple_identity_op
      
      * remove useless
      
      * group
      
      * fix dead lock
      
      * force ctrl in
      
      * sc broadcast
      
      * sort obn
      
      * group nccl
      
      * config group_size_mbyte
      
      * non_distributed_optimizer_group_size_mbyte
      
      * format
      
      * stop check
      
      * rm message sending optimization
      2b7c50b0
  31. Sep 07, 2019
  32. Sep 04, 2019
    • Juncheng's avatar
      Dev model init op (#2117) · 26717534
      Juncheng authored
      * assign op
      
      
      AddGlobalStepOpConf
      
      
      fix
      
      
      ARITHMETIC_DATA_TYPE_SEQ
      
      
      identity_op_conf
      
      
      add ops
      
      
      GenNewSnapshotName
      
      
      SnapshotOp
      
      
      cleanup
      
      
      blob name
      
      
      LearningRateScheduleOp
      
      
      LearningRateScheduleKernel
      
      
      LearningRateScheduleKernel
      
      
      AddLearningRateScheduleOpConf
      
      
      learning rate
      
      
      cleanup
      
      
      fix
      
      
      fix
      
      * remove total_mbn_num
      
      * date time format
      
      * save
      
      * refine
      
      * refine
      
      * revert
      
      * refine snapshot
      
      * fix
      
      * refine
      
      * AutoGlobalStep
      
      * refine
      
      * GenLogicalBlobName
      
      * AutoLearningRate
      
      * remove JobDesc lr
      
      * fix snapshot path
      
      * Maybe<void>
      
      * learning_rate blob
      
      * remove next_model_vid
      
      
      fix
      
      
      fix 
      
      
      fix
      
      
      learning_rate
      
      * train_conf
      
      * fix for global step on multi nodes
      
      * SnapshotReader
      
      
      snapshot writer
      
      
      model init op
      
      
      fix
      
      
      refine
      
      
      init
      
      
      InitializeFromSnapshotConf
      
      
      model io job
      
      
      ModelLoadOp
      
      
      ModelLoadKernel
      
      
      MakeModelLoadJob
      
      
      ModelSaveOp
      
      
      fix
      
      
      InterUserJobInfo
      
      
      _MakeModelLoadJobFunc
      
      
      MutModelLoadOpConTickInputHelper
      
      
      fix
      
      
      refine
      
      
      init/load/save
      
      
      set_default_variable
      
      * remove SnapshotMgr
      
      * snapshot.h
      
      * delete model_init_job.cpp
      
      
      foreign_input_op_conf
      
      
      fix
      
      
      snapshot path
      
      
      set path
      
      
      op_conf
      
      
      fix
      
      
      fix CopyFromNdarray
      
      
      to bytes c
      
      
      use uint8
      
      
      char2uint8
      
      * model init
      
      * model io
      
      * fix
      
      * ModelSaveKernel
      
      * mutable_batch_axis()->Clear()
      
      * InferBatchAxis
      
      * fix
      
      * refine
      
      * job set
      
      * MakeModelIoJobs
      
      * fix
      
      * jobs
      
      * fix
      
      * model io job
      
      * GenOutputOpConf
      
      * refine snapshot
      
      * refine
      
      * fix
      
      * refine CheckPoint
      
      * remove session
      
      * refine
      
      * refine
      
      * refine
      
      * remove keyword.h/cpp
      
      * refine
      
      * global_step=>train_step
      
      * GetSbpSignatures
      
      * ModelInitOp
      26717534