Skip to content
Snippets Groups Projects
Commit 8f3dcf94 authored by Houjiang Chen's avatar Houjiang Chen Committed by cheng cheng
Browse files

XRT: XLA + TensorRT (#2525)

* Enable multiply definition for xla compilation in oneflow

* Realize running an executable

* Abstract and gather resources (such as client, builder etc.) needed to compile as CompilationResourceStore

* Implement a seperate xla allocator to avoid introducing much objects of tensorflow

* Define CompilationContext separately

* Running XLA by CPU mode is OK now

* Make the result shape after running the executable is a tuple, and refine comments

* Add compilation cache to solve recompiling every time

* Resolve InferSbpSignature in XlaLaunchOp

* Resove executing on specified cuda stream

* Refine XlaLaunch parallel conf, add batch matmul op

* Refactor job rebuilding and fixup time shape

* Update batch_dim_lbis field if XlaLaunch has any output which has batch dim

* Resolve cluster-ring after clustered, take sbp policy and time shape into consideration

* Add reshape op

* Fix bugs

* Rename CompilationContext by XlaLaunchContext, add XlaRuntimeScope to swap stream handle

* Fix bugs

* Update cmake to compile with xla optionally

* Support more ops

* Add more ops, and fix bugs

* Implement XLA allocator and internal memory pool

* Adaptively resize allocator memory size

* Refine memory allocator

* Block host if running cpu executable

* Fix bug for getting scalar value

* Fix result layout bug. This bug causes wrong result for transpose

* Refine gelu backward

* Of xla sx (#1990)

* add identity xla op

* Add batch gather op

* Refine batch gather

* fix batch gather bug aand add gather op, mv identity op to unary_op

* Add softmax and gather/batch_gather

* Add xla softmax_grad op

* Add xla layer normalization op

* Add xla layer norm backward op

* Alias inputs and outputs to compute in-place

* Reuse output buffers when running xla executable. It brings about 10%
speedup for bert on single gpu by zero copy results

* Reuse output buffers when running xla executable. It brings about 10%
speedup for bert on single gpu by zero copy results

* Refine xla allocator

* Refine code style

* Add xla reduce_sum op

* Rewrite model update op to optimizer graph

* Fix hang bugs

* Fix input which body is disabled in xla launch kernel

* Fix self control in

* Fix self control in

* Add fake consume op

* Fix HasAttr bug for optional field

* Refine AdamOptimizer

* Fix xla AdamOptimizer bugs

* Add meta data in HLO instruction, and refine

* Fix bugs

* add reduce sum and split normal model update (#2040)

* remove append_func_to_list

* Rm deprecated model update and save code (#1958)

* remove code

* mv random gen to kernel

* mk seed required

* address reviews

* fix unused warning

* address reviews

* check in more deprecation

* remove ModelSaveOpConf

* move out ops and modify item (#1962)

* ModelInit.__oneflow_input_remote_blobs__

* fix cpu only query & add error info (#1964)

* NumaAwareCudaMallocHost (#1959)

* NumaAwareCudaMallocHost

* add conf

* modify check_point and add test check_point (#1963)

* fix misuse of Scope/raii

* op_name2variable_blob

* add sigmoid test and tanh test (#1966)

* add op matmul and matmul test (#1967)

* rename oneflow.val to oneflow.input_blob_def

* support auto var for convolution (#1972)

* add op add and test add (#1973)

* mv deprecated.pb_util to lib.core.pb_util

* add op get_variable and get_variable test (#1975)

* add op get_variable and get_variable test

* modify shape extend

* AllReduceSequencePass (#1976)

* python2 compatibility for check_point

* fix "return (blob_a, blob_b)" bug

* rename: arg_passing => arg_pass

* shared regst blob header between jobs (#1919)

* half impl

* register manager handle memory shared for separated memory

* set separated memory shared id for shared regst between jobs

* half impl of python for blob

* fix BUG of pod ToProto() when proto has inited

* fix BUG of infer dim0_inner_shape() in foreign_input_op

* 1. PushJob copy from python can infer dim0_valid_num

* add test for dynamic relu

* refine test file

* refine code

* refine note

* update test file for new interface

* rename separated_header* (#1979)

* some bugs fixes for a train&eval job (#1978)

* debugging alex net

* check in test pull_multiple_blob.py

* strcter check

* fix bias in conv

* fix various bugs

* rm file

* op_name in different jobs can be overloaded

* fix compile bug in job_set_compile_ctx

* rm cmake code for building oneflow binary

* check in script (#1980)

* check in script

* rm used import

* CudaCurrentDeviceGuard (#1977)

* fix val (#1981)

* Merge job set and split fw bw (#1982)

* add MemoryCopier and TensorSliceCopier (#1901)

* add MemoryCopier and TensorSliceCopier

* Index=>NdIndex

* refine

* refine

* fix addition error checking (#1911)

* Merge dev_mixed_precision into dev_split_fw_bw (#1904)

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* Merge dev_mixed_precision: Part-2 (#1907)

* feat: add NewKernelUtil

* fix typos

* feat: add cublas_tensor_op_math_handle()

* add gemm (#1860)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* feat: NewKernelUtil -> NewKernelUtil<DeviceType>

* feat: update FullyConnectedKernel to use NewKernelUtil

* Dev sx mixed precision (#1861)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* save cpp

* save

* add relu and relu_backward

* remove spared space

* add explicit declaration

* rename

* feat: update ConvKernel to support half

* add sigmoid and tanh (#1867)

* add axpy (#1866)

* style: formatting

* refactor(new_kernel_util): unify Hgemm with cublas_gemm<T>

* fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle

* refine(new_kernel_util.h)

* refine(new_kernel_util.cu)

* feat(new_kernel_util): add OFBatchedGemm()

* feat: update MatMulKernel to support half

* feat: update ConvData/Bias/FilterGradKernel to support half

* refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out

* feat: support loss scale

* fix(operator): :bug:add InferHasBatchDim()

* feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu()

* refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float

* style(kernel/cast_kernel.cpp): formatting

* fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle()

* style(cast_kernel.cpp): formatting

* feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil

* refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil

* feat(dropout_kernel): :sparkles:update DropoutKernel to support half

* refactor(dropout_kernel): remove backward funcs

* refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support

* fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple)

* fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs()

* fix: fix little bugs

* fix(conv_data/filter_grad_op): min byte size of buf blob is 1

* feat: support half for bias_add_kernel

* fix(bias_add_op): remove data type check

* feat(relu_kernel): support half

* refactor: add ADD_GPU_HALF_KERNEL_CREATOR

* fix: typos

* feat(pooling_kernel): support half

* fix: remove CHECK_EQ of default data type

* feat(pooling_grad_kernel): support half

* feat: support half in ofrecord_encoder (TODO)

* fix

* feat: support half in sparse_cross_entropy_kernel

* debug grad op (#1883)

* Dev debug op mixed precision (#1884)

* debug grad op

* do nothing instead of UNIMPLEMENTED

* fix(dropout_kernel): add tmp_split_fw_bw condition

* build(half.cmake): https->http

* fix(record_load_kernel): support total_batch_num

* fix pooling (#1885)

* fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs()

* fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs()

* fix: add GetCudnnScalingParameters() to fix scaling params

* fix: add enable_true_half_config_when_conf() into config and update related code

* feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization

* refactor(matmul_kernel): remove Backward()

* feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx()

* feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat()

* refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr

* refactor(new_kernel_util.cu): remove static of func in anonymous namespace

* feat(job_conf.proto): add enable_auto_mixed_precision field

* feat(auto_mixed_precision_lists): add amp_lists

* feat(auto_mixed_precision): build the skeleton

* feat(auto_mixed_precision): almost finish amp graph pass

* feat(auto_mixed_precision.cpp): complte InsertCastOp()

* refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG

* perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO)

* refine(auto_mixed_precision.cpp): refine LOG

* feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes()

* Dev half ndarray (#1886)

* debug grad op

* ZeroVal => GetZeroVal; OneVal => GetOneVal

* MaxVal => GetMaxVal; MinVal => GetMinVal

* check data type

* DevDType

* move function template to struct template for BinaryFunc* and UnaryFunc*

* support half for reduce_sum_kernel

* ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr

* half for NdarrayUtil

* OF_DEVICE_FUNC is always inline

* half for NdarrayApplyUnaray

* simplify usage of NdarrayUtil

* UnaryFuncExp

* add VarNdarrayBuilder and ValNdarrayBuilder

* simplify NdarrayUtil in layer_norm_param_grad_kernel

* InplaceBroadcast

* remove SoftmaxKernelUtil

* half for softmax_kernel

* fix improper use of __CUDA_ARCH__

* disable sm_30,sm_52

* refine(conv_kernel.cu): fix typo

* fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix: fix typos of GetOneVal

* fix(auto_mixed_precision.cpp): allocate for shared_ptr

* refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding

* fix(auto_mixed_precision.cpp): fix typo

* fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge

* style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet()

* style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...>

* feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp

* feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs

* feat(auto_mixed_precision.cpp): more logs

* refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal

* fix(bias_add_op.cpp): fix bias_multiplier shape

* feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half

* feat: update MatmulKernel and new_kernel_util to support half

* refactor(auto_mixed_precision): add ClearList and refine code

* feat(tanh_*_kernel): support half

* feat(add_kernel): support half

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF

* style(CMakeLists.txt): fix typo

* fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix(auto_mixed_precision.cpp): group inserted cast op by lbn

* fix get one ptr (#1913)

* fix(layer_norm): add LayerNormOp to grey_list and support the half

* fix(layer_norm about): fix it to run when amp

* fix: move fix sbp signature from OpNode to OpGraph

* Dev new kernel util (#1925)

* refactor(kernel/util): refactor NewKernelUtil and add DnnIf

* refactor(kernel/util): add BlasIf

* refactor(kernel/util): add ArithemeticIf

* refactor(kernel/util): add cuda_kernel_util.*

* refactor: refactor NewKernelUtil

* refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including

* refactor(new_kernel_util.h): remove unused header files

* refactor: refactor loop include

* feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936)

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA

* CHECK cuda version > 10.0 when use auto_mixed_presion

* Fix bug of Snapshot delete file Unwanted (#1937)

* fix link BUG of release version (#1938)

* delete redundant code in OpGraph JobCompleter and Operator (#1927)

* 1. delete redundant code in OpGraph JobCompleter and Operator  2. fix bug of Snapshot delete file Unwanted  3. refine ReadMe

* revert README change

* split 2 pull request

* Refactor Kernel Registry V2: The clear & easy Way (#1941)

* refactor(resource.proto): move DeviceType to common/device_type.proto

* feat(kernel_registration): add kernel_registration.h/cpp

* feat(kernel_registration): update matmul_kernel to support new registration

* feat: add CreateKernel for new registry

* feat: udpate registry of cast conf

* refactor(kernel_registration): remove KernelRegMap

* fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949)

* grpc SetMaxMessageSize(INT_MAX) (#1950)

* fix bug of Graph::ForEachConnectedComponent (#1952)

* Grpc set max size (#1953)

* grpc SetMaxMessageSize(INT_MAX)

* set max msg len for ctrl service

* code for test grpc max msg size

* remove test code

* NumaAwareCudaMallocHost (#1959)

* NumaAwareCudaMallocHost

* add conf

* AllReduceSequencePass (#1976)

* Merge job set and split fw bw (#1983)

* add MemoryCopier and TensorSliceCopier (#1901)

* add MemoryCopier and TensorSliceCopier

* Index=>NdIndex

* refine

* refine

* fix addition error checking (#1911)

* Merge dev_mixed_precision into dev_split_fw_bw (#1904)

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* Merge dev_mixed_precision: Part-2 (#1907)

* feat: add NewKernelUtil

* fix typos

* feat: add cublas_tensor_op_math_handle()

* add gemm (#1860)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* feat: NewKernelUtil -> NewKernelUtil<DeviceType>

* feat: update FullyConnectedKernel to use NewKernelUtil

* Dev sx mixed precision (#1861)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* save cpp

* save

* add relu and relu_backward

* remove spared space

* add explicit declaration

* rename

* feat: update ConvKernel to support half

* add sigmoid and tanh (#1867)

* add axpy (#1866)

* style: formatting

* refactor(new_kernel_util): unify Hgemm with cublas_gemm<T>

* fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle

* refine(new_kernel_util.h)

* refine(new_kernel_util.cu)

* feat(new_kernel_util): add OFBatchedGemm()

* feat: update MatMulKernel to support half

* feat: update ConvData/Bias/FilterGradKernel to support half

* refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out

* feat: support loss scale

* fix(operator): :bug:add InferHasBatchDim()

* feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu()

* refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float

* style(kernel/cast_kernel.cpp): formatting

* fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle()

* style(cast_kernel.cpp): formatting

* feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil

* refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil

* feat(dropout_kernel): :sparkles:update DropoutKernel to support half

* refactor(dropout_kernel): remove backward funcs

* refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support

* fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple)

* fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs()

* fix: fix little bugs

* fix(conv_data/filter_grad_op): min byte size of buf blob is 1

* feat: support half for bias_add_kernel

* fix(bias_add_op): remove data type check

* feat(relu_kernel): support half

* refactor: add ADD_GPU_HALF_KERNEL_CREATOR

* fix: typos

* feat(pooling_kernel): support half

* fix: remove CHECK_EQ of default data type

* feat(pooling_grad_kernel): support half

* feat: support half in ofrecord_encoder (TODO)

* fix

* feat: support half in sparse_cross_entropy_kernel

* debug grad op (#1883)

* Dev debug op mixed precision (#1884)

* debug grad op

* do nothing instead of UNIMPLEMENTED

* fix(dropout_kernel): add tmp_split_fw_bw condition

* build(half.cmake): https->http

* fix(record_load_kernel): support total_batch_num

* fix pooling (#1885)

* fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs()

* fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs()

* fix: add GetCudnnScalingParameters() to fix scaling params

* fix: add enable_true_half_config_when_conf() into config and update related code

* feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization

* refactor(matmul_kernel): remove Backward()

* feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx()

* feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat()

* refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr

* refactor(new_kernel_util.cu): remove static of func in anonymous namespace

* feat(job_conf.proto): add enable_auto_mixed_precision field

* feat(auto_mixed_precision_lists): add amp_lists

* feat(auto_mixed_precision): build the skeleton

* feat(auto_mixed_precision): almost finish amp graph pass

* feat(auto_mixed_precision.cpp): complte InsertCastOp()

* refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG

* perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO)

* refine(auto_mixed_precision.cpp): refine LOG

* feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes()

* Dev half ndarray (#1886)

* debug grad op

* ZeroVal => GetZeroVal; OneVal => GetOneVal

* MaxVal => GetMaxVal; MinVal => GetMinVal

* check data type

* DevDType

* move function template to struct template for BinaryFunc* and UnaryFunc*

* support half for reduce_sum_kernel

* ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr

* half for NdarrayUtil

* OF_DEVICE_FUNC is always inline

* half for NdarrayApplyUnaray

* simplify usage of NdarrayUtil

* UnaryFuncExp

* add VarNdarrayBuilder and ValNdarrayBuilder

* simplify NdarrayUtil in layer_norm_param_grad_kernel

* InplaceBroadcast

* remove SoftmaxKernelUtil

* half for softmax_kernel

* fix improper use of __CUDA_ARCH__

* disable sm_30,sm_52

* refine(conv_kernel.cu): fix typo

* fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix: fix typos of GetOneVal

* fix(auto_mixed_precision.cpp): allocate for shared_ptr

* refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding

* fix(auto_mixed_precision.cpp): fix typo

* fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge

* style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet()

* style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...>

* feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp

* feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs

* feat(auto_mixed_precision.cpp): more logs

* refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal

* fix(bias_add_op.cpp): fix bias_multiplier shape

* feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half

* feat: update MatmulKernel and new_kernel_util to support half

* refactor(auto_mixed_precision): add ClearList and refine code

* feat(tanh_*_kernel): support half

* feat(add_kernel): support half

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF

* style(CMakeLists.txt): fix typo

* fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix(auto_mixed_precision.cpp): group inserted cast op by lbn

* fix get one ptr (#1913)

* fix(layer_norm): add LayerNormOp to grey_list and support the half

* fix(layer_norm about): fix it to run when amp

* fix: move fix sbp signature from OpNode to OpGraph

* Dev new kernel util (#1925)

* refactor(kernel/util): refactor NewKernelUtil and add DnnIf

* refactor(kernel/util): add BlasIf

* refactor(kernel/util): add ArithemeticIf

* refactor(kernel/util): add cuda_kernel_util.*

* refactor: refactor NewKernelUtil

* refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including

* refactor(new_kernel_util.h): remove unused header files

* refactor: refactor loop include

* feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936)

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA

* CHECK cuda version > 10.0 when use auto_mixed_presion

* Fix bug of Snapshot delete file Unwanted (#1937)

* fix link BUG of release version (#1938)

* delete redundant code in OpGraph JobCompleter and Operator (#1927)

* 1. delete redundant code in OpGraph JobCompleter and Operator  2. fix bug of Snapshot delete file Unwanted  3. refine ReadMe

* revert README change

* split 2 pull request

* Refactor Kernel Registry V2: The clear & easy Way (#1941)

* refactor(resource.proto): move DeviceType to common/device_type.proto

* feat(kernel_registration): add kernel_registration.h/cpp

* feat(kernel_registration): update matmul_kernel to support new registration

* feat: add CreateKernel for new registry

* feat: udpate registry of cast conf

* refactor(kernel_registration): remove KernelRegMap

* fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949)

* grpc SetMaxMessageSize(INT_MAX) (#1950)

* fix bug of Graph::ForEachConnectedComponent (#1952)

* Grpc set max size (#1953)

* grpc SetMaxMessageSize(INT_MAX)

* set max msg len for ctrl service

* code for test grpc max msg size

* remove test code

* NumaAwareCudaMallocHost (#1959)

* NumaAwareCudaMallocHost

* add conf

* AllReduceSequencePass (#1976)

* CudaCurrentDeviceGuard (#1977)

* delete tmp_split_fw_bw_train_conf (#1985)

* delete tmp_split_fw_bw_train_conf

* delete useless comments

* fix refactor bug in layer_norm_op

* minor fixes

* update py script

* remove code could be misleading

* Fix all reduce mem sharing (#1986)

* fix all reduce mem sharing

* ByteSizeOfDataContentField=>ByteSizeOfBlobBody

* remove obsolete task_graph optimization

* no arg_pass_job for variable_op

* merge memory block id between jobs (#1910)

* refine MemBlock and CriticalSection

* job memory sharing strategy

* revert diff in CriticalSectionDesc

* Merge memory block between sub plans

* Get mutual exclusion job groups

* forget to consider memory merge only in same machine

* memory zone unique id

* Merge Done;  merge memory block id from right to left; get memory block ids info

* revert MemBlock

* generate mutual exclusion job groups Done.

* update for proto

* add JobMemSharingStrategy in python interface

* remove memorycase hash

* move JobMemSharingStrategy to JobSetProto

* using default strategy = parallel priority strategy

* update interface of flow.job_mem_sharing_strategy

* InterJobMemSharingUtil and PlanUtil

* revert oneflow.h

* fix bug

* New implement of Merge memory block id between jobs

* refine code

* fix a fatal bug in std::hash<oneflow::Shape>

* +REGISTER_INDEPENDENT_THREAD_NUM for print task_node

* unlock critical sections as more as possible (#1994)

* Bugfix actor case (#1995)

* unlock critical sections as more as possible

* consumed and produced regst of actor 'case' are customized

* refine code

* Bugfix actor case (#1996)

* unlock critical sections as more as possible

* consumed and produced regst of actor 'case' are customized

* refine code

* small regst_num for reentrant_lock (#1997)

* fmt dev_job_set(#1999)

* double buffer for tick_op

* tick is cpu op

* speedup compile time (#2000)

* only merge mem_block_id between user job (#1993)

* Fix keep header only (#2001)

* speedup compile time

* fix keep header only

* remove shared model (#2003)

* remove blob_mem_sharing (#2005)

* No copyhd for output (#2006)

* no cpu tick

* no copyhd for output_op/swith_output_op

* remove temp comments

* rename AddCopyH2DTaskTo to TryAddCopyH2DTaskTo

* remove clone_id (#2007)

* layer norm auto var (#2004)

* layer norm auto var

* make of_format

* bn sbp (#2008)

* Refactor job completer (#1998)

* fmt

* refactor GenerateOpConf4Trainning

* more refactor

* refactor SetCtrlInOpName4VariableOp

* use uniq ptr

* refactor RewriteBoxingWithAllReduce

* refactor MakeAllReduceSequence

* refactor auto_mixed_precision

* refactor DumpLogicalBlobDescAndSbpSignature

* refactor group_boxing_by_dst_parallel

* refactor add_keep_header_only_op_conf

* refactor AutoSourceTick

* refactor AddTickForTimeShape

* refactor AutoSinkTick

* refactor AddGlobalOutputCriticalSections

* refactor SetOpTimeShape7BatchDimLbis

* fix a bug in IsInterfaceTask (#2009)

* Bugfix is interface task (#2010)

* fix a bug in IsInterfaceTask

* IsOutputInterfaceTask

* copyhd-free output_op task_node

* Dev job set config util (#2011)

* add more if in JobConfigProtoBuilder

* unlock critical sections as more as possible

* consumed and produced regst of actor 'case' are customized

* remove total batch num in config util

* remove clone_id

* assert has train_conf

* rm debug info

* Dev job set bert (#2013)

* support bert

* mv into bert

* manual format

* fix adam (#2015)

* fix adam

* div batch instance num before update model

* remove outdate code in oneflow.cpp (#2017)

* Dev split like (#2016)

* no total_instance_num

* add auto grad for concat

* check in impl

* check in bug fixes

* fix bugs for split_like

* split_like_op.cpp format

* add normalization_autovar

* Update op_conf.proto

* address reviews

* fix typo

* constant ref

* rm forward_loss_instance_num (#2018)

* Bugfix job set multi device (#2019)

* sbp for tick input bn

* interface_blob_conf for output_op/switch_output_op

* set sbp conf for tuple identity op

* fix bugs when merge main plan

* delete useless code

* address review

* fix error use of GenRepeatedBn()

* ForEachConnectedComponent is easily misused

* 1) fix output op parallel_conf; 2) refactor InterfaceOpUtil

* only for return output_op

* factor: lbn => logical_blob_name; logical_blob_name() => logical_blob_name

* return op instead of output op acts as part of user job

* enable_all_reduce_group

* bugfix: init RuntimeBuffersScope before Runtime

* demo python scripts for enable_all_reduce_group

* remove wrong optimization code

* constant_conf for enable_all_reduce_group.py test

* fix interface op parallel conf

* fix reduce concat kernel (#2020)

* binary program oneflow_worker

* user_job_completer

* remove unused code loss_print

* rm unused code loss_acc

* remove unused accuracy_acc and accuracy_print

* remove input_diff/output_diff/model_diff bns

* remove unused bns in gdb util

* replace data_tmp_bns/model_bns/fw_buf_bns with tmp_bns

* support mpi using style

* Bugfix put job conf into plan (#2023)

* put job_conf into plan

* using job_name judge isPullJob/isPushJob

* fix wrong job_id error

* model_init is a push job; model_save is a pull job

* make cmake more reasonable (#2024)

* Restructure python module and minimum setup.py (#2026)

* check in updated paths

* check in minimum setup tool

* Dev python init multi unit (#2022)

* init multi-unit by send oneflow_worker binary and ConfigProto to worker machine

* refine var name

* refine code

* compile user/main job only on master

* bert multi machine test code

* fix bugs

* JobConfs

* fix bugs under WITH_RDMA

* fix multi-machine bugs

* delete useless code

* Add xla reduce_sum op

* fix overflow bug of mem_zone_unique_id, set job_id to int64_t (#2028)

* feat: init_worker can without scp binary and no use uuid (#2029)

* half impl of without scp bin

* feat: init_worker can without scp binary and no use uuid

* check in fixes (#2030)

* fixbug of delete worker (#2033)

* Dev dot plan (#2035)

* reuse plan to dot file

* refine plan dot

* Check in bug fix and multi node script (#2032)

* check in fixes

* check in script

* fix boxing bug when setting conf with sbp

* flag for iter

* fixbug of delete worker

* fix delete worker in script

* address review, add exclusive or check

* reuse plan to dot file

* refine plan dot

* fix and add flags

* fmt

* rm debug output

* more flags

* check Activation

* fix fc bug when num axes > 2

* reverse change

* fix next_batch_num (#2036)

* upgrade nccl to 2.4.8 (#2037)

* fix shape of fc in_diff (#2038)

* Rewrite model update op to optimizer graph

* Update oneflow.cmake (#2041)

* better looking merged_plan to dot v1 (#2039)

* better looking and more infomation of merged_plan.dot

* refine color

* Fix tick in multi node parallel (#2042) (#2047)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* Dev train conf builder (#2046)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* check in impl

* fix data dir (#2054)

* fix data dir

* rm model load path

* AssignOp (#2058)

* AssignOp

* remove useless code

* Python ops gather and unit test (#2053)

* python_ops gather and unit test

* format

* minor mod

* SnapshotOp (#2060)

* magical add and fix bug (#2061)

* check in impl

* add todo

* Dev jxf python pooling (#2056)

* run max_pool_2d without bug

* correct max_pool_2d

* correct average_pool_2d

* minor refine

* final version

* rename to nn.py

* add name arg to pool1d ops

* refine by review

* rename to _GetSequence and move it to the end of file (#2063)

* fix BindInterfaceMemBlockId (#2065)

* mark py file generated (#2066)

* Dev gracious exit (#2057)

* add more checks

* make language more consistant

* better error info for worker init

* better error

* Update setup.py (#2068)

* Refine Infer APIs by return Maybe<void> type (#2051)

* Refine Infer APIs by return Maybe<void> type

* Fix return type

* Fix code style

* Replace CHECK macros in the implementation of infer APIs

* Revert IsOk

* fix bug for split like op (#2070)

* fix snapshot path (#2071)

* Dev job set fix infer apis (#2072)

* Refine Infer APIs by return Maybe<void> type

* Fix return type

* Fix code style

* Replace CHECK macros in the implementation of infer APIs

* Revert IsOk

* update

* add AutoGlobalStep (#2073)

* rm default_initializer_conf in train conf (#2075)

* Fix sigmoid op (#2076)

* fix sigmoid op bug

* fix bug for split like op

* add sigmoid grad op

* Fix bn (#2077)

* fix bn

* return Maybe<void> OK in lambda

* fix typo

* fix SigmoidGradOp (#2078)

* Dev python merge job set (#2081)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* fix gcc warning in release (#2080)

* fix gcc version in release

* fix empty line

* Fix adam mv initilizer (#2082)

* zero constant initilzer for adam m and v

* make of_format

* init adam m v beta1_t and beta2_t

* use value instead of initializer

* const float& -> const float

* update

* LearningRateScheduleOp (#2079)

* matmul (#2084)

* matmul

* np.allclose

* Fix hang bugs

* bugfix: reshape op infer dim0 size; and look up tensorflow reshape (#2085)

* bugfix: reshape op infer dim0 size; and look up tensorflow reshape

* refine code for read

* check py if and test

* prelu (#2086)

* prelu

* fix

* fix

* template for either ptr cast (#2088)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* add template for cast

* rename

* Dev build and infer ctx (#2089)

* add job_build_and_infer_ctx interface

* lbn_with_split_hint

* fix maybe macro

* fix signature of Maybe<T>::Error()

* job_build_and_infer_if

* add c_api_util wrapper for job_build_and_infer_ctx

* implement python/job_build_and_infer interface

* CurJobBuildAndInferCtx_AddPlacementGroup

* BuildJobAndInferCtx  and  Mgr  c++ implement (#2074)

* job_build_and_infer_ctx_mgr

* refine interface of infer_ctx_mgr

* JobBuildInferCtx set job conf; add and refine error type

* revert job.proto

* half impl of add op in build_infer_ctx

* generate op produced empty logical blob desc ; infer out blob desc interface

* job_build_and_infer_ctx VERSION 1

* add InferOutBlobDesc for conv op; remove record_piece_size in interface op

* maybe return

* job_set hold by job_build_and_infer_ctx_mgr

* check placement when infer ctx mgr leave cur job

* Global New/Delete JobBuildAndInferCtxMgr

* add JUST when ctx add op

* remove unused job_conf.arg_op_name

* fix bugs caused by python new api

* fix bugs caused by lack of Global<JobDesc>

* fix bugs caused by new api

* refactor compiler.Compile

* merge dev_python

* remove unused message proto

* rename api

* Fix input which body is disabled in xla launch kernel

* add RemoteBlob.shape and RemoteBlob.dtype

* Fix data type set default variable (#2092)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* fix default data type

* Add conf axis for bias_add for any axis channel (#2093)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* bias_add completion

* follow comment

* make conf axis required

* Dev jxf python initializer (#2090)

* oneflow initializer

* update

* Fix self control in

* Bugfix python alexnet (#2096)

* bugfix_python_alexnet

* fix

* Add fake consume op

* Dev global step (#2100)

* assign op


AddGlobalStepOpConf


fix


ARITHMETIC_DATA_TYPE_SEQ


identity_op_conf


add ops


GenNewSnapshotName


SnapshotOp


cleanup


blob name


LearningRateScheduleOp


LearningRateScheduleKernel


LearningRateScheduleKernel


AddLearningRateScheduleOpConf


learning rate


cleanup


fix


fix

* remove total_mbn_num

* date time format

* save

* refine

* refine

* revert

* refine snapshot

* fix

* refine

* AutoGlobalStep

* refine

* GenLogicalBlobName

* AutoLearningRate

* remove JobDesc lr

* fix snapshot path

* Maybe<void>

* learning_rate blob

* remove next_model_vid


fix


fix 


fix


learning_rate

* train_conf

* fix for global step on multi nodes

* Fix optimizer initializer (#2095)

* fix optimizer initializer

* rename lars data temp bn

* fix job_type (#2102)

* Dev alexnet new api (#2094)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* check in softmax loss

* nn.conv2d and nn.bias_add

* fix opname

* fix merge conflict

* fix name

* dense (#2097)

* Fix jxf dense v2 (#2098)

* dense

* minor fix

* alexnet

* fix conf

* quick fix

* transpose

* fix layers

* add transpose

* fix fc

* fix

* fix

* fix data laod

* params check and format

* rm activation in op conf

* save workaround

* fix avg pool 2d

* fix max pool 2d

* remove fc3 relu

* alexnet eval

* minor

* replace has_batch_dim with batch_axis (#2104)

* replace has_batch_dim with batch_axis

* refactor OrderValue4HasBatchAxis

* fix batch_axis bugs in ConvFilterGradOp and DecodeOFRecordOp

* no CHECK in MatmulOp::InferBatchAxis

* infer op by op_conf and  parallel_conf

* wrapper Error for ErrorProto

* replace ErrorUtil with Error

* add OF_CHECK (#2110)

* optional split_axis (#2113)

* Fix HasAttr bug for optional field

* undefined (#2116)

* merge reduce xxx (#2119)

* Update GetSbpSig() with Maybe (#2118)

* fix sveral ops

* modify all ops

* format

* update complete

* Refine AdamOptimizer

* fix (#2120)

* Fix xla AdamOptimizer bugs

* support scalar for reduce_xxx axis args (#2122)

* Dev opt split axis (#2121)

* optional split_axis

* backup

* VariableConf::(OptInt64 split_axis)

* backup

* fix autovar split_axis (#2125)

* Dev model init op (#2117)

* assign op


AddGlobalStepOpConf


fix


ARITHMETIC_DATA_TYPE_SEQ


identity_op_conf


add ops


GenNewSnapshotName


SnapshotOp


cleanup


blob name


LearningRateScheduleOp


LearningRateScheduleKernel


LearningRateScheduleKernel


AddLearningRateScheduleOpConf


learning rate


cleanup


fix


fix

* remove total_mbn_num

* date time format

* save

* refine

* refine

* revert

* refine snapshot

* fix

* refine

* AutoGlobalStep

* refine

* GenLogicalBlobName

* AutoLearningRate

* remove JobDesc lr

* fix snapshot path

* Maybe<void>

* learning_rate blob

* remove next_model_vid


fix


fix 


fix


learning_rate

* train_conf

* fix for global step on multi nodes

* SnapshotReader


snapshot writer


model init op


fix


refine


init


InitializeFromSnapshotConf


model io job


ModelLoadOp


ModelLoadKernel


MakeModelLoadJob


ModelSaveOp


fix


InterUserJobInfo


_MakeModelLoadJobFunc


MutModelLoadOpConTickInputHelper


fix


refine


init/load/save


set_default_variable

* remove SnapshotMgr

* snapshot.h

* delete model_init_job.cpp


foreign_input_op_conf


fix


snapshot path


set path


op_conf


fix


fix CopyFromNdarray


to bytes c


use uint8


char2uint8

* model init

* model io

* fix

* ModelSaveKernel

* mutable_batch_axis()->Clear()

* InferBatchAxis

* fix

* refine

* job set

* MakeModelIoJobs

* fix

* jobs

* fix

* model io job

* GenOutputOpConf

* refine snapshot

* refine

* fix

* refine CheckPoint

* remove session

* refine

* refine

* refine

* remove keyword.h/cpp

* refine

* global_step=>train_step

* GetSbpSignatures

* ModelInitOp

* fix (#2127)

* rm stale alextnet script (#2129)

* Dev plain maybe (#2126)

* optional split_axis

* backup

* VariableConf::(OptInt64 split_axis)

* backup

* 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp

* SharedOrPlain

* const std::shared_ptr<T>& => std::shared_ptr<T>

* Dev simple checkpoint manager (#2128)

* SimpleCheckPointManager

* makedirs

* fix path

* save

* refine

* refine

* fix path to numpy (#2130)

* Dev plain maybe (#2132)

* optional split_axis

* backup

* VariableConf::(OptInt64 split_axis)

* backup

* 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp

* SharedOrPlain

* const std::shared_ptr<T>& => std::shared_ptr<T>

* rename: Maybe.data() => Maybe.Data_YouAreNotAllowedToCallThisOutsideJustOrCheckJust()

* refactor Maybe<JobBuildAndInferCtx> => Maybe<JobBuildAndInferCtx*>

* Dev jxf merge general ops (#2131)

* merge some general ops to dev_python

* dense demo

* rm print in test

* new line at the end of file

* format

* fix check point

* update alexnet

* broadcast_xxx (#2134)

* broadcast_xxx

* typo

* typo

* rm job_conf.num_of_batches_in_snapshot

* fix args (#2136)

* fix proto if (#2138)

* pass name to inner function (#2139)

* check dropout if (#2140)

* check dropout if

* fix typo

* Dev merge math ops (#2143)

* merge math ops

* new line at the end of file

* merge layer norm (#2144)

* variable_scope (#2141)

* variable_scope

* revert format

* add check

* Merge dropout if (#2145)

* check dropout if

* fix typo

* fix typo

* slice (#2142)

* slice

* add check and docstring

* minor

* minor

* add const (#2146)

* add const

* fix indentation

* address review

* fmt

* rm redundant

* Update array_ops.py

* Update array_ops.py

* Update array_ops.py

* add more activations to math_ops (#2147)

* fix bug (#2149)

* trancated normal for bert (#2150)

* Update bert for dev python (#2151)

* trancated normal for bert

* bert support

* math.dropout to nn.dropout (#2153)

* refactor oneflow_internal with Maybe::GetDataAndSerializedErrorProto

* allow export multiple interfaces in oneflow_export decorator (#2154)

* refactor job_build_and_infer_if.h

* update oneflow_internal.h to use Maybe (#2135)

* Fix python internal (#2133)

* Return error meassage in oneflow_internal

* Refine environment_objects_scope

* add OF_ERROR_STR_CHECK and OFStrCat()

* format

* fix based on review

* fix(oneflow_internal.h): add undef

* fix: expr -> (expr)

* feat: update oneflow_internal_helper to use func

*  Transfer data_part_num to DecodeOp and RecordLoadOp (#2148)

*  Transfer data_part_num to DecodeOp and RecordLoadOp

* Fix python scripts

* Dev nc of internal (#2155)

* Fix python internal (#2133)

* Return error meassage in oneflow_internal

* Refine environment_objects_scope

* add OF_ERROR_STR_CHECK and OFStrCat()

* format

* fix based on review

* fix(oneflow_internal.h): add undef

* fix: expr -> (expr)

* feat: update oneflow_internal_helper to use func

* fix: fix ctor bug

* fix config_proto

* rename c_api_util.Init => c_api_util.InitEnvironment

* refactor compile_context.cur_job => compile_context.cur_job_conf

* remove FixPackedBlobDescOfProducedRegst (#2156)

* Fix snapshot root path empty log (#2158)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* Fix snapshot root path empty log

* fix channel last (#2157)

* fix channel last

* minor

* merge pb_message

* add cudnn conv force algo (#2159)

* Update bert for dev python (#2160)

* remove old bert

* set data_part_num in decoder

* support model load/saveargs

* Dev flow function (#2152)

* add of.function, refactor init, refine session, and refine runtime

* rm useless code

* rename

* update

* add test

* @oneflow_export JobConfigProto and Trainconf (#2162)

* @oneflow_export JobConfigProto and Trainconf

* remove unused config in config_util.py

* remove oneflow.get_cur_job_conf_builder

* bugfix: bias_add op and reduce_sum op infer sbp and implement of bias_add kernel (#2161)

* 1) refactor compiler._CompileJob; 2) fix test scripts; 3) fix oneflow.config.model_update_conf

* fix config.train.model_update_conf

* _GetJobConfAttr

* update alexnet (#2166)

* Update alexnet (#2167)

* update alexnet

* update for bert

* 15->16

* more reasonable conf

* get variable in py layer norm

* replace val in pb msg;  decode lbn string with split hint (#2165)

* bugfix: boxing task node generate boxing conf; remove wrong check in op_graph (#2163)

* Add meta data in HLO instruction, and refine

* python model parallel (#2103)

* decode split hint in lbn; refine interface of get/set_lbn_in_op_conf; fill sbp_conf in job when add op

* merge placement group

* refine code in AddAndInferOp

* auto merge placement group when add op; remove mergeplacementgroup interface

* infer sbp parallel when add op; impl Get/Has split axis in infer_ctx

* python blob add interface for model parallel

* refine code of python blob split

* remove interface of has/get_split_axis in python blob

* remove interface of has_batch_dim in python blob

* add check blob split_axis can be divide by parallel num

* refine code for maybe get/infer sbp

* fix bugs: 1. python generate parallel conf; 2. gen logical blob id from lbn; 3.python blob desc .etc

* fix for plain point maybe

* fix bug: add repeated placement group, remove add placement interface in hand

* fixbug: python/blob_desc, temp impl of not deepcopy;  feat: dense layer support model parallel

* dev_python model parallel runnable and check correct

* remove add placement group when placment scope exit

* 1. fixbug of bias_add op infer blob_desc/sbp; bias_add kernel impl; 2. dense layer set model_split_axis=0 for model parallel

* bugfix: bias_add backward infer sbp wrong;  model parallel bias add debug done

* refine python blob_desc.split implement

* refine interface decode lbn to split hint

* refine auto add placment group

* refine lbn with split hint decode

* refine code for review

* remove AutoVar related code (#2168)

* feat: remove all autovar

* fix and format

* fix: fix op::InferBlobDesc

* add prototype (#2172)

* add prototype

* infer blob desc with sbp_signature

* `str_a is not str_b' is buggy, use `str_a != str_b' instead

* Update snapshot.cpp (#2174)

* remove useless lines (#2176)

* Fix bert multi nodes (#2177)

* remove useless lines

* fix bert and init_cluster_env for multi nodes

* CHECK_JUST for InferBlobDescsIf (#2178)

* Fix bert multi nodes (#2180)

* remove useless lines

* fix bert and init_cluster_env for multi nodes

* config_proto -> default_config_proto

* delete worker

* update alexnet

* remove unused op (#2182)

* remove parallel_ctx when kernel init (#2185)

* InferOpSbpSignature in op_graph and infer_ctx (#2175)

* InferOpSbpSignature in op_graph and infer_ctx

* bugfix: lambda life time;  gen job build error add location info

* refine error generation and return

* refine check lbi vaild and exists

* remove parallel num in decode_of_record op/kernel (#2186)

* Fix bugs

* delete GlobalJobDesc() in operator/ (#2188)

* rm unused test file

* Refine

* Add assign ops behind adam optimizer to update model and momentum etc.

* Add assign ops behind adam optimizer to update model and momentum etc.

* Remove fake consume op

* Support enable/disable XLA by set env

* Merge callback, limit max operator count for each XLA subgraph

* CudaEventPool

* fix vector

* refine

* Support in-place update for optimizer

* Add alias input and output to prevent reusing input with other temp buffers

* Refine code style

* Remove unused code

* Of xla (#2237)

* mv deprecated.pb_util to lib.core.pb_util

* add op get_variable and get_variable test (#1975)

* add op get_variable and get_variable test

* modify shape extend

* AllReduceSequencePass (#1976)

* python2 compatibility for check_point

* fix "return (blob_a, blob_b)" bug

* rename: arg_passing => arg_pass

* shared regst blob header between jobs (#1919)

* half impl

* register manager handle memory shared for separated memory

* set separated memory shared id for shared regst between jobs

* half impl of python for blob

* fix BUG of pod ToProto() when proto has inited

* fix BUG of infer dim0_inner_shape() in foreign_input_op

* 1. PushJob copy from python can infer dim0_valid_num

* add test for dynamic relu

* refine test file

* refine code

* refine note

* update test file for new interface

* rename separated_header* (#1979)

* some bugs fixes for a train&eval job (#1978)

* debugging alex net

* check in test pull_multiple_blob.py

* strcter check

* fix bias in conv

* fix various bugs

* rm file

* op_name in different jobs can be overloaded

* fix compile bug in job_set_compile_ctx

* rm cmake code for building oneflow binary

* check in script (#1980)

* check in script

* rm used import

* CudaCurrentDeviceGuard (#1977)

* fix val (#1981)

* Merge job set and split fw bw (#1982)

* add MemoryCopier and TensorSliceCopier (#1901)

* add MemoryCopier and TensorSliceCopier

* Index=>NdIndex

* refine

* refine

* fix addition error checking (#1911)

* Merge dev_mixed_precision into dev_split_fw_bw (#1904)

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* Merge dev_mixed_precision: Part-2 (#1907)

* feat: add NewKernelUtil

* fix typos

* feat: add cublas_tensor_op_math_handle()

* add gemm (#1860)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* feat: NewKernelUtil -> NewKernelUtil<DeviceType>

* feat: update FullyConnectedKernel to use NewKernelUtil

* Dev sx mixed precision (#1861)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* save cpp

* save

* add relu and relu_backward

* remove spared space

* add explicit declaration

* rename

* feat: update ConvKernel to support half

* add sigmoid and tanh (#1867)

* add axpy (#1866)

* style: formatting

* refactor(new_kernel_util): unify Hgemm with cublas_gemm<T>

* fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle

* refine(new_kernel_util.h)

* refine(new_kernel_util.cu)

* feat(new_kernel_util): add OFBatchedGemm()

* feat: update MatMulKernel to support half

* feat: update ConvData/Bias/FilterGradKernel to support half

* refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out

* feat: support loss scale

* fix(operator): :bug:add InferHasBatchDim()

* feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu()

* refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float

* style(kernel/cast_kernel.cpp): formatting

* fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle()

* style(cast_kernel.cpp): formatting

* feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil

* refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil

* feat(dropout_kernel): :sparkles:update DropoutKernel to support half

* refactor(dropout_kernel): remove backward funcs

* refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support

* fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple)

* fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs()

* fix: fix little bugs

* fix(conv_data/filter_grad_op): min byte size of buf blob is 1

* feat: support half for bias_add_kernel

* fix(bias_add_op): remove data type check

* feat(relu_kernel): support half

* refactor: add ADD_GPU_HALF_KERNEL_CREATOR

* fix: typos

* feat(pooling_kernel): support half

* fix: remove CHECK_EQ of default data type

* feat(pooling_grad_kernel): support half

* feat: support half in ofrecord_encoder (TODO)

* fix

* feat: support half in sparse_cross_entropy_kernel

* debug grad op (#1883)

* Dev debug op mixed precision (#1884)

* debug grad op

* do nothing instead of UNIMPLEMENTED

* fix(dropout_kernel): add tmp_split_fw_bw condition

* build(half.cmake): https->http

* fix(record_load_kernel): support total_batch_num

* fix pooling (#1885)

* fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs()

* fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs()

* fix: add GetCudnnScalingParameters() to fix scaling params

* fix: add enable_true_half_config_when_conf() into config and update related code

* feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization

* refactor(matmul_kernel): remove Backward()

* feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx()

* feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat()

* refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr

* refactor(new_kernel_util.cu): remove static of func in anonymous namespace

* feat(job_conf.proto): add enable_auto_mixed_precision field

* feat(auto_mixed_precision_lists): add amp_lists

* feat(auto_mixed_precision): build the skeleton

* feat(auto_mixed_precision): almost finish amp graph pass

* feat(auto_mixed_precision.cpp): complte InsertCastOp()

* refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG

* perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO)

* refine(auto_mixed_precision.cpp): refine LOG

* feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes()

* Dev half ndarray (#1886)

* debug grad op

* ZeroVal => GetZeroVal; OneVal => GetOneVal

* MaxVal => GetMaxVal; MinVal => GetMinVal

* check data type

* DevDType

* move function template to struct template for BinaryFunc* and UnaryFunc*

* support half for reduce_sum_kernel

* ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr

* half for NdarrayUtil

* OF_DEVICE_FUNC is always inline

* half for NdarrayApplyUnaray

* simplify usage of NdarrayUtil

* UnaryFuncExp

* add VarNdarrayBuilder and ValNdarrayBuilder

* simplify NdarrayUtil in layer_norm_param_grad_kernel

* InplaceBroadcast

* remove SoftmaxKernelUtil

* half for softmax_kernel

* fix improper use of __CUDA_ARCH__

* disable sm_30,sm_52

* refine(conv_kernel.cu): fix typo

* fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix: fix typos of GetOneVal

* fix(auto_mixed_precision.cpp): allocate for shared_ptr

* refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding

* fix(auto_mixed_precision.cpp): fix typo

* fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge

* style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet()

* style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...>

* feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp

* feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs

* feat(auto_mixed_precision.cpp): more logs

* refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal

* fix(bias_add_op.cpp): fix bias_multiplier shape

* feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half

* feat: update MatmulKernel and new_kernel_util to support half

* refactor(auto_mixed_precision): add ClearList and refine code

* feat(tanh_*_kernel): support half

* feat(add_kernel): support half

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF

* style(CMakeLists.txt): fix typo

* fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix(auto_mixed_precision.cpp): group inserted cast op by lbn

* fix get one ptr (#1913)

* fix(layer_norm): add LayerNormOp to grey_list and support the half

* fix(layer_norm about): fix it to run when amp

* fix: move fix sbp signature from OpNode to OpGraph

* Dev new kernel util (#1925)

* refactor(kernel/util): refactor NewKernelUtil and add DnnIf

* refactor(kernel/util): add BlasIf

* refactor(kernel/util): add ArithemeticIf

* refactor(kernel/util): add cuda_kernel_util.*

* refactor: refactor NewKernelUtil

* refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including

* refactor(new_kernel_util.h): remove unused header files

* refactor: refactor loop include

* feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936)

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA

* CHECK cuda version > 10.0 when use auto_mixed_presion

* Fix bug of Snapshot delete file Unwanted (#1937)

* fix link BUG of release version (#1938)

* delete redundant code in OpGraph JobCompleter and Operator (#1927)

* 1. delete redundant code in OpGraph JobCompleter and Operator  2. fix bug of Snapshot delete file Unwanted  3. refine ReadMe

* revert README change

* split 2 pull request

* Refactor Kernel Registry V2: The clear & easy Way (#1941)

* refactor(resource.proto): move DeviceType to common/device_type.proto

* feat(kernel_registration): add kernel_registration.h/cpp

* feat(kernel_registration): update matmul_kernel to support new registration

* feat: add CreateKernel for new registry

* feat: udpate registry of cast conf

* refactor(kernel_registration): remove KernelRegMap

* fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949)

* grpc SetMaxMessageSize(INT_MAX) (#1950)

* fix bug of Graph::ForEachConnectedComponent (#1952)

* Grpc set max size (#1953)

* grpc SetMaxMessageSize(INT_MAX)

* set max msg len for ctrl service

* code for test grpc max msg size

* remove test code

* NumaAwareCudaMallocHost (#1959)

* NumaAwareCudaMallocHost

* add conf

* AllReduceSequencePass (#1976)

* Merge job set and split fw bw (#1983)

* add MemoryCopier and TensorSliceCopier (#1901)

* add MemoryCopier and TensorSliceCopier

* Index=>NdIndex

* refine

* refine

* fix addition error checking (#1911)

* Merge dev_mixed_precision into dev_split_fw_bw (#1904)

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* Merge dev_mixed_precision: Part-2 (#1907)

* feat: add NewKernelUtil

* fix typos

* feat: add cublas_tensor_op_math_handle()

* add gemm (#1860)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* feat: NewKernelUtil -> NewKernelUtil<DeviceType>

* feat: update FullyConnectedKernel to use NewKernelUtil

* Dev sx mixed precision (#1861)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* save cpp

* save

* add relu and relu_backward

* remove spared space

* add explicit declaration

* rename

* feat: update ConvKernel to support half

* add sigmoid and tanh (#1867)

* add axpy (#1866)

* style: formatting

* refactor(new_kernel_util): unify Hgemm with cublas_gemm<T>

* fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle

* refine(new_kernel_util.h)

* refine(new_kernel_util.cu)

* feat(new_kernel_util): add OFBatchedGemm()

* feat: update MatMulKernel to support half

* feat: update ConvData/Bias/FilterGradKernel to support half

* refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out

* feat: support loss scale

* fix(operator): :bug:add InferHasBatchDim()

* feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu()

* refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float

* style(kernel/cast_kernel.cpp): formatting

* fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle()

* style(cast_kernel.cpp): formatting

* feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil

* refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil

* feat(dropout_kernel): :sparkles:update DropoutKernel to support half

* refactor(dropout_kernel): remove backward funcs

* refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support

* fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple)

* fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs()

* fix: fix little bugs

* fix(conv_data/filter_grad_op): min byte size of buf blob is 1

* feat: support half for bias_add_kernel

* fix(bias_add_op): remove data type check

* feat(relu_kernel): support half

* refactor: add ADD_GPU_HALF_KERNEL_CREATOR

* fix: typos

* feat(pooling_kernel): support half

* fix: remove CHECK_EQ of default data type

* feat(pooling_grad_kernel): support half

* feat: support half in ofrecord_encoder (TODO)

* fix

* feat: support half in sparse_cross_entropy_kernel

* debug grad op (#1883)

* Dev debug op mixed precision (#1884)

* debug grad op

* do nothing instead of UNIMPLEMENTED

* fix(dropout_kernel): add tmp_split_fw_bw condition

* build(half.cmake): https->http

* fix(record_load_kernel): support total_batch_num

* fix pooling (#1885)

* fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs()

* fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs()

* fix: add GetCudnnScalingParameters() to fix scaling params

* fix: add enable_true_half_config_when_conf() into config and update related code

* feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization

* refactor(matmul_kernel): remove Backward()

* feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx()

* feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat()

* refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr

* refactor(new_kernel_util.cu): remove static of func in anonymous namespace

* feat(job_conf.proto): add enable_auto_mixed_precision field

* feat(auto_mixed_precision_lists): add amp_lists

* feat(auto_mixed_precision): build the skeleton

* feat(auto_mixed_precision): almost finish amp graph pass

* feat(auto_mixed_precision.cpp): complte InsertCastOp()

* refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG

* perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO)

* refine(auto_mixed_precision.cpp): refine LOG

* feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes()

* Dev half ndarray (#1886)

* debug grad op

* ZeroVal => GetZeroVal; OneVal => GetOneVal

* MaxVal => GetMaxVal; MinVal => GetMinVal

* check data type

* DevDType

* move function template to struct template for BinaryFunc* and UnaryFunc*

* support half for reduce_sum_kernel

* ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr

* half for NdarrayUtil

* OF_DEVICE_FUNC is always inline

* half for NdarrayApplyUnaray

* simplify usage of NdarrayUtil

* UnaryFuncExp

* add VarNdarrayBuilder and ValNdarrayBuilder

* simplify NdarrayUtil in layer_norm_param_grad_kernel

* InplaceBroadcast

* remove SoftmaxKernelUtil

* half for softmax_kernel

* fix improper use of __CUDA_ARCH__

* disable sm_30,sm_52

* refine(conv_kernel.cu): fix typo

* fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix: fix typos of GetOneVal

* fix(auto_mixed_precision.cpp): allocate for shared_ptr

* refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding

* fix(auto_mixed_precision.cpp): fix typo

* fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge

* style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet()

* style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...>

* feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp

* feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs

* feat(auto_mixed_precision.cpp): more logs

* refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal

* fix(bias_add_op.cpp): fix bias_multiplier shape

* feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half

* feat: update MatmulKernel and new_kernel_util to support half

* refactor(auto_mixed_precision): add ClearList and refine code

* feat(tanh_*_kernel): support half

* feat(add_kernel): support half

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF

* style(CMakeLists.txt): fix typo

* fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix(auto_mixed_precision.cpp): group inserted cast op by lbn

* fix get one ptr (#1913)

* fix(layer_norm): add LayerNormOp to grey_list and support the half

* fix(layer_norm about): fix it to run when amp

* fix: move fix sbp signature from OpNode to OpGraph

* Dev new kernel util (#1925)

* refactor(kernel/util): refactor NewKernelUtil and add DnnIf

* refactor(kernel/util): add BlasIf

* refactor(kernel/util): add ArithemeticIf

* refactor(kernel/util): add cuda_kernel_util.*

* refactor: refactor NewKernelUtil

* refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including

* refactor(new_kernel_util.h): remove unused header files

* refactor: refactor loop include

* feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936)

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA

* CHECK cuda version > 10.0 when use auto_mixed_presion

* Fix bug of Snapshot delete file Unwanted (#1937)

* fix link BUG of release version (#1938)

* delete redundant code in OpGraph JobCompleter and Operator (#1927)

* 1. delete redundant code in OpGraph JobCompleter and Operator  2. fix bug of Snapshot delete file Unwanted  3. refine ReadMe

* revert README change

* split 2 pull request

* Refactor Kernel Registry V2: The clear & easy Way (#1941)

* refactor(resource.proto): move DeviceType to common/device_type.proto

* feat(kernel_registration): add kernel_registration.h/cpp

* feat(kernel_registration): update matmul_kernel to support new registration

* feat: add CreateKernel for new registry

* feat: udpate registry of cast conf

* refactor(kernel_registration): remove KernelRegMap

* fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949)

* grpc SetMaxMessageSize(INT_MAX) (#1950)

* fix bug of Graph::ForEachConnectedComponent (#1952)

* Grpc set max size (#1953)

* grpc SetMaxMessageSize(INT_MAX)

* set max msg len for ctrl service

* code for test grpc max msg size

* remove test code

* NumaAwareCudaMallocHost (#1959)

* NumaAwareCudaMallocHost

* add conf

* AllReduceSequencePass (#1976)

* CudaCurrentDeviceGuard (#1977)

* delete tmp_split_fw_bw_train_conf (#1985)

* delete tmp_split_fw_bw_train_conf

* delete useless comments

* fix refactor bug in layer_norm_op

* minor fixes

* update py script

* remove code could be misleading

* Fix all reduce mem sharing (#1986)

* fix all reduce mem sharing

* ByteSizeOfDataContentField=>ByteSizeOfBlobBody

* remove obsolete task_graph optimization

* no arg_pass_job for variable_op

* merge memory block id between jobs (#1910)

* refine MemBlock and CriticalSection

* job memory sharing strategy

* revert diff in CriticalSectionDesc

* Merge memory block between sub plans

* Get mutual exclusion job groups

* forget to consider memory merge only in same machine

* memory zone unique id

* Merge Done;  merge memory block id from right to left; get memory block ids info

* revert MemBlock

* generate mutual exclusion job groups Done.

* update for proto

* add JobMemSharingStrategy in python interface

* remove memorycase hash

* move JobMemSharingStrategy to JobSetProto

* using default strategy = parallel priority strategy

* update interface of flow.job_mem_sharing_strategy

* InterJobMemSharingUtil and PlanUtil

* revert oneflow.h

* fix bug

* New implement of Merge memory block id between jobs

* refine code

* fix a fatal bug in std::hash<oneflow::Shape>

* +REGISTER_INDEPENDENT_THREAD_NUM for print task_node

* unlock critical sections as more as possible (#1994)

* Bugfix actor case (#1995)

* unlock critical sections as more as possible

* consumed and produced regst of actor 'case' are customized

* refine code

* Bugfix actor case (#1996)

* unlock critical sections as more as possible

* consumed and produced regst of actor 'case' are customized

* refine code

* small regst_num for reentrant_lock (#1997)

* fmt dev_job_set(#1999)

* double buffer for tick_op

* tick is cpu op

* speedup compile time (#2000)

* only merge mem_block_id between user job (#1993)

* Fix keep header only (#2001)

* speedup compile time

* fix keep header only

* remove shared model (#2003)

* remove blob_mem_sharing (#2005)

* No copyhd for output (#2006)

* no cpu tick

* no copyhd for output_op/swith_output_op

* remove temp comments

* rename AddCopyH2DTaskTo to TryAddCopyH2DTaskTo

* remove clone_id (#2007)

* layer norm auto var (#2004)

* layer norm auto var

* make of_format

* bn sbp (#2008)

* Refactor job completer (#1998)

* fmt

* refactor GenerateOpConf4Trainning

* more refactor

* refactor SetCtrlInOpName4VariableOp

* use uniq ptr

* refactor RewriteBoxingWithAllReduce

* refactor MakeAllReduceSequence

* refactor auto_mixed_precision

* refactor DumpLogicalBlobDescAndSbpSignature

* refactor group_boxing_by_dst_parallel

* refactor add_keep_header_only_op_conf

* refactor AutoSourceTick

* refactor AddTickForTimeShape

* refactor AutoSinkTick

* refactor AddGlobalOutputCriticalSections

* refactor SetOpTimeShape7BatchDimLbis

* fix a bug in IsInterfaceTask (#2009)

* Bugfix is interface task (#2010)

* fix a bug in IsInterfaceTask

* IsOutputInterfaceTask

* copyhd-free output_op task_node

* Dev job set config util (#2011)

* add more if in JobConfigProtoBuilder

* unlock critical sections as more as possible

* consumed and produced regst of actor 'case' are customized

* remove total batch num in config util

* remove clone_id

* assert has train_conf

* rm debug info

* Dev job set bert (#2013)

* support bert

* mv into bert

* manual format

* fix adam (#2015)

* fix adam

* div batch instance num before update model

* remove outdate code in oneflow.cpp (#2017)

* Dev split like (#2016)

* no total_instance_num

* add auto grad for concat

* check in impl

* check in bug fixes

* fix bugs for split_like

* split_like_op.cpp format

* add normalization_autovar

* Update op_conf.proto

* address reviews

* fix typo

* constant ref

* rm forward_loss_instance_num (#2018)

* Bugfix job set multi device (#2019)

* sbp for tick input bn

* interface_blob_conf for output_op/switch_output_op

* set sbp conf for tuple identity op

* fix bugs when merge main plan

* delete useless code

* address review

* fix error use of GenRepeatedBn()

* ForEachConnectedComponent is easily misused

* 1) fix output op parallel_conf; 2) refactor InterfaceOpUtil

* only for return output_op

* factor: lbn => logical_blob_name; logical_blob_name() => logical_blob_name

* return op instead of output op acts as part of user job

* enable_all_reduce_group

* bugfix: init RuntimeBuffersScope before Runtime

* demo python scripts for enable_all_reduce_group

* remove wrong optimization code

* constant_conf for enable_all_reduce_group.py test

* fix interface op parallel conf

* fix reduce concat kernel (#2020)

* binary program oneflow_worker

* user_job_completer

* remove unused code loss_print

* rm unused code loss_acc

* remove unused accuracy_acc and accuracy_print

* remove input_diff/output_diff/model_diff bns

* remove unused bns in gdb util

* replace data_tmp_bns/model_bns/fw_buf_bns with tmp_bns

* support mpi using style

* Bugfix put job conf into plan (#2023)

* put job_conf into plan

* using job_name judge isPullJob/isPushJob

* fix wrong job_id error

* model_init is a push job; model_save is a pull job

* make cmake more reasonable (#2024)

* Restructure python module and minimum setup.py (#2026)

* check in updated paths

* check in minimum setup tool

* Dev python init multi unit (#2022)

* init multi-unit by send oneflow_worker binary and ConfigProto to worker machine

* refine var name

* refine code

* compile user/main job only on master

* bert multi machine test code

* fix bugs

* JobConfs

* fix bugs under WITH_RDMA

* fix multi-machine bugs

* delete useless code

* Add xla reduce_sum op

* fix overflow bug of mem_zone_unique_id, set job_id to int64_t (#2028)

* feat: init_worker can without scp binary and no use uuid (#2029)

* half impl of without scp bin

* feat: init_worker can without scp binary and no use uuid

* check in fixes (#2030)

* fixbug of delete worker (#2033)

* Dev dot plan (#2035)

* reuse plan to dot file

* refine plan dot

* Check in bug fix and multi node script (#2032)

* check in fixes

* check in script

* fix boxing bug when setting conf with sbp

* flag for iter

* fixbug of delete worker

* fix delete worker in script

* address review, add exclusive or check

* reuse plan to dot file

* refine plan dot

* fix and add flags

* fmt

* rm debug output

* more flags

* check Activation

* fix fc bug when num axes > 2

* reverse change

* fix next_batch_num (#2036)

* upgrade nccl to 2.4.8 (#2037)

* fix shape of fc in_diff (#2038)

* Rewrite model update op to optimizer graph

* Update oneflow.cmake (#2041)

* better looking merged_plan to dot v1 (#2039)

* better looking and more infomation of merged_plan.dot

* refine color

* Fix tick in multi node parallel (#2042) (#2047)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* Dev train conf builder (#2046)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* check in impl

* fix data dir (#2054)

* fix data dir

* rm model load path

* AssignOp (#2058)

* AssignOp

* remove useless code

* Python ops gather and unit test (#2053)

* python_ops gather and unit test

* format

* minor mod

* SnapshotOp (#2060)

* magical add and fix bug (#2061)

* check in impl

* add todo

* Dev jxf python pooling (#2056)

* run max_pool_2d without bug

* correct max_pool_2d

* correct average_pool_2d

* minor refine

* final version

* rename to nn.py

* add name arg to pool1d ops

* refine by review

* rename to _GetSequence and move it to the end of file (#2063)

* fix BindInterfaceMemBlockId (#2065)

* mark py file generated (#2066)

* Dev gracious exit (#2057)

* add more checks

* make language more consistant

* better error info for worker init

* better error

* Update setup.py (#2068)

* Refine Infer APIs by return Maybe<void> type (#2051)

* Refine Infer APIs by return Maybe<void> type

* Fix return type

* Fix code style

* Replace CHECK macros in the implementation of infer APIs

* Revert IsOk

* fix bug for split like op (#2070)

* fix snapshot path (#2071)

* Dev job set fix infer apis (#2072)

* Refine Infer APIs by return Maybe<void> type

* Fix return type

* Fix code style

* Replace CHECK macros in the implementation of infer APIs

* Revert IsOk

* update

* add AutoGlobalStep (#2073)

* rm default_initializer_conf in train conf (#2075)

* Fix sigmoid op (#2076)

* fix sigmoid op bug

* fix bug for split like op

* add sigmoid grad op

* Fix bn (#2077)

* fix bn

* return Maybe<void> OK in lambda

* fix typo

* fix SigmoidGradOp (#2078)

* Dev python merge job set (#2081)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* fix gcc warning in release (#2080)

* fix gcc version in release

* fix empty line

* Fix adam mv initilizer (#2082)

* zero constant initilzer for adam m and v

* make of_format

* init adam m v beta1_t and beta2_t

* use value instead of initializer

* const float& -> const float

* update

* LearningRateScheduleOp (#2079)

* matmul (#2084)

* matmul

* np.allclose

* Fix hang bugs

* bugfix: reshape op infer dim0 size; and look up tensorflow reshape (#2085)

* bugfix: reshape op infer dim0 size; and look up tensorflow reshape

* refine code for read

* check py if and test

* prelu (#2086)

* prelu

* fix

* fix

* template for either ptr cast (#2088)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* add template for cast

* rename

* Dev build and infer ctx (#2089)

* add job_build_and_infer_ctx interface

* lbn_with_split_hint

* fix maybe macro

* fix signature of Maybe<T>::Error()

* job_build_and_infer_if

* add c_api_util wrapper for job_build_and_infer_ctx

* implement python/job_build_and_infer interface

* CurJobBuildAndInferCtx_AddPlacementGroup

* BuildJobAndInferCtx  and  Mgr  c++ implement (#2074)

* job_build_and_infer_ctx_mgr

* refine interface of infer_ctx_mgr

* JobBuildInferCtx set job conf; add and refine error type

* revert job.proto

* half impl of add op in build_infer_ctx

* generate op produced empty logical blob desc ; infer out blob desc interface

* job_build_and_infer_ctx VERSION 1

* add InferOutBlobDesc for conv op; remove record_piece_size in interface op

* maybe return

* job_set hold by job_build_and_infer_ctx_mgr

* check placement when infer ctx mgr leave cur job

* Global New/Delete JobBuildAndInferCtxMgr

* add JUST when ctx add op

* remove unused job_conf.arg_op_name

* fix bugs caused by python new api

* fix bugs caused by lack of Global<JobDesc>

* fix bugs caused by new api

* refactor compiler.Compile

* merge dev_python

* remove unused message proto

* rename api

* Fix input which body is disabled in xla launch kernel

* add RemoteBlob.shape and RemoteBlob.dtype

* Fix data type set default variable (#2092)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* fix default data type

* Add conf axis for bias_add for any axis channel (#2093)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* bias_add completion

* follow comment

* make conf axis required

* Dev jxf python initializer (#2090)

* oneflow initializer

* update

* Fix self control in

* Bugfix python alexnet (#2096)

* bugfix_python_alexnet

* fix

* Add fake consume op

* Dev global step (#2100)

* assign op


AddGlobalStepOpConf


fix


ARITHMETIC_DATA_TYPE_SEQ


identity_op_conf


add ops


GenNewSnapshotName


SnapshotOp


cleanup


blob name


LearningRateScheduleOp


LearningRateScheduleKernel


LearningRateScheduleKernel


AddLearningRateScheduleOpConf


learning rate


cleanup


fix


fix

* remove total_mbn_num

* date time format

* save

* refine

* refine

* revert

* refine snapshot

* fix

* refine

* AutoGlobalStep

* refine

* GenLogicalBlobName

* AutoLearningRate

* remove JobDesc lr

* fix snapshot path

* Maybe<void>

* learning_rate blob

* remove next_model_vid


fix


fix 


fix


learning_rate

* train_conf

* fix for global step on multi nodes

* Fix optimizer initializer (#2095)

* fix optimizer initializer

* rename lars data temp bn

* fix job_type (#2102)

* Dev alexnet new api (#2094)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* check in softmax loss

* nn.conv2d and nn.bias_add

* fix opname

* fix merge conflict

* fix name

* dense (#2097)

* Fix jxf dense v2 (#2098)

* dense

* minor fix

* alexnet

* fix conf

* quick fix

* transpose

* fix layers

* add transpose

* fix fc

* fix

* fix

* fix data laod

* params check and format

* rm activation in op conf

* save workaround

* fix avg pool 2d

* fix max pool 2d

* remove fc3 relu

* alexnet eval

* minor

* replace has_batch_dim with batch_axis (#2104)

* replace has_batch_dim with batch_axis

* refactor OrderValue4HasBatchAxis

* fix batch_axis bugs in ConvFilterGradOp and DecodeOFRecordOp

* no CHECK in MatmulOp::InferBatchAxis

* infer op by op_conf and  parallel_conf

* wrapper Error for ErrorProto

* replace ErrorUtil with Error

* add OF_CHECK (#2110)

* optional split_axis (#2113)

* Fix HasAttr bug for optional field

* undefined (#2116)

* merge reduce xxx (#2119)

* Update GetSbpSig() with Maybe (#2118)

* fix sveral ops

* modify all ops

* format

* update complete

* Refine AdamOptimizer

* fix (#2120)

* Fix xla AdamOptimizer bugs

* support scalar for reduce_xxx axis args (#2122)

* Dev opt split axis (#2121)

* optional split_axis

* backup

* VariableConf::(OptInt64 split_axis)

* backup

* fix autovar split_axis (#2125)

* Dev model init op (#2117)

* assign op


AddGlobalStepOpConf


fix


ARITHMETIC_DATA_TYPE_SEQ


identity_op_conf


add ops


GenNewSnapshotName


SnapshotOp


cleanup


blob name


LearningRateScheduleOp


LearningRateScheduleKernel


LearningRateScheduleKernel


AddLearningRateScheduleOpConf


learning rate


cleanup


fix


fix

* remove total_mbn_num

* date time format

* save

* refine

* refine

* revert

* refine snapshot

* fix

* refine

* AutoGlobalStep

* refine

* GenLogicalBlobName

* AutoLearningRate

* remove JobDesc lr

* fix snapshot path

* Maybe<void>

* learning_rate blob

* remove next_model_vid


fix


fix 


fix


learning_rate

* train_conf

* fix for global step on multi nodes

* SnapshotReader


snapshot writer


model init op


fix


refine


init


InitializeFromSnapshotConf


model io job


ModelLoadOp


ModelLoadKernel


MakeModelLoadJob


ModelSaveOp


fix


InterUserJobInfo


_MakeModelLoadJobFunc


MutModelLoadOpConTickInputHelper


fix


refine


init/load/save


set_default_variable

* remove SnapshotMgr

* snapshot.h

* delete model_init_job.cpp


foreign_input_op_conf


fix


snapshot path


set path


op_conf


fix


fix CopyFromNdarray


to bytes c


use uint8


char2uint8

* model init

* model io

* fix

* ModelSaveKernel

* mutable_batch_axis()->Clear()

* InferBatchAxis

* fix

* refine

* job set

* MakeModelIoJobs

* fix

* jobs

* fix

* model io job

* GenOutputOpConf

* refine snapshot

* refine

* fix

* refine CheckPoint

* remove session

* refine

* refine

* refine

* remove keyword.h/cpp

* refine

* global_step=>train_step

* GetSbpSignatures

* ModelInitOp

* fix (#2127)

* rm stale alextnet script (#2129)

* Dev plain maybe (#2126)

* optional split_axis

* backup

* VariableConf::(OptInt64 split_axis)

* backup

* 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp

* SharedOrPlain

* const std::shared_ptr<T>& => std::shared_ptr<T>

* Dev simple checkpoint manager (#2128)

* SimpleCheckPointManager

* makedirs

* fix path

* save

* refine

* refine

* fix path to numpy (#2130)

* Dev plain maybe (#2132)

* optional split_axis

* backup

* VariableConf::(OptInt64 split_axis)

* backup

* 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp

* SharedOrPlain

* const std::shared_ptr<T>& => std::shared_ptr<T>

* rename: Maybe.data() => Maybe.Data_YouAreNotAllowedToCallThisOutsideJustOrCheckJust()

* refactor Maybe<JobBuildAndInferCtx> => Maybe<JobBuildAndInferCtx*>

* Dev jxf merge general ops (#2131)

* merge some general ops to dev_python

* dense demo

* rm print in test

* new line at the end of file

* format

* fix check point

* update alexnet

* broadcast_xxx (#2134)

* broadcast_xxx

* typo

* typo

* rm job_conf.num_of_batches_in_snapshot

* fix args (#2136)

* fix proto if (#2138)

* pass name to inner function (#2139)

* check dropout if (#2140)

* check dropout if

* fix typo

* Dev merge math ops (#2143)

* merge math ops

* new line at the end of file

* merge layer norm (#2144)

* variable_scope (#2141)

* variable_scope

* revert format

* add check

* Merge dropout if (#2145)

* check dropout if

* fix typo

* fix typo

* slice (#2142)

* slice

* add check and docstring

* minor

* minor

* add const (#2146)

* add const

* fix indentation

* address review

* fmt

* rm redundant

* Update array_ops.py

* Update array_ops.py

* Update array_ops.py

* add more activations to math_ops (#2147)

* fix bug (#2149)

* trancated normal for bert (#2150)

* Update bert for dev python (#2151)

* trancated normal for bert

* bert support

* math.dropout to nn.dropout (#2153)

* refactor oneflow_internal with Maybe::GetDataAndSerializedErrorProto

* allow export multiple interfaces in oneflow_export decorator (#2154)

* refactor job_build_and_infer_if.h

* update oneflow_internal.h to use Maybe (#2135)

* Fix python internal (#2133)

* Return error meassage in oneflow_internal

* Refine environment_objects_scope

* add OF_ERROR_STR_CHECK and OFStrCat()

* format

* fix based on review

* fix(oneflow_internal.h): add undef

* fix: expr -> (expr)

* feat: update oneflow_internal_helper to use func

*  Transfer data_part_num to DecodeOp and RecordLoadOp (#2148)

*  Transfer data_part_num to DecodeOp and RecordLoadOp

* Fix python scripts

* Dev nc of internal (#2155)

* Fix python internal (#2133)

* Return error meassage in oneflow_internal

* Refine environment_objects_scope

* add OF_ERROR_STR_CHECK and OFStrCat()

* format

* fix based on review

* fix(oneflow_internal.h): add undef

* fix: expr -> (expr)

* feat: update oneflow_internal_helper to use func

* fix: fix ctor bug

* fix config_proto

* rename c_api_util.Init => c_api_util.InitEnvironment

* refactor compile_context.cur_job => compile_context.cur_job_conf

* remove FixPackedBlobDescOfProducedRegst (#2156)

* Fix snapshot root path empty log (#2158)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* Fix snapshot root path empty log

* fix channel last (#2157)

* fix channel last

* minor

* merge pb_message

* add cudnn conv force algo (#2159)

* Update bert for dev python (#2160)

* remove old bert

* set data_part_num in decoder

* support model load/saveargs

* Dev flow function (#2152)

* add of.function, refactor init, refine session, and refine runtime

* rm useless code

* rename

* update

* add test

* @oneflow_export JobConfigProto and Trainconf (#2162)

* @oneflow_export JobConfigProto and Trainconf

* remove unused config in config_util.py

* remove oneflow.get_cur_job_conf_builder

* bugfix: bias_add op and reduce_sum op infer sbp and implement of bias_add kernel (#2161)

* 1) refactor compiler._CompileJob; 2) fix test scripts; 3) fix oneflow.config.model_update_conf

* fix config.train.model_update_conf

* _GetJobConfAttr

* update alexnet (#2166)

* Update alexnet (#2167)

* update alexnet

* update for bert

* 15->16

* more reasonable conf

* get variable in py layer norm

* replace val in pb msg;  decode lbn string with split hint (#2165)

* bugfix: boxing task node generate boxing conf; remove wrong check in op_graph (#2163)

* Add meta data in HLO instruction, and refine

* python model parallel (#2103)

* decode split hint in lbn; refine interface of get/set_lbn_in_op_conf; fill sbp_conf in job when add op

* merge placement group

* refine code in AddAndInferOp

* auto merge placement group when add op; remove mergeplacementgroup interface

* infer sbp parallel when add op; impl Get/Has split axis in infer_ctx

* python blob add interface for model parallel

* refine code of python blob split

* remove interface of has/get_split_axis in python blob

* remove interface of has_batch_dim in python blob

* add check blob split_axis can be divide by parallel num

* refine code for maybe get/infer sbp

* fix bugs: 1. python generate parallel conf; 2. gen logical blob id from lbn; 3.python blob desc .etc

* fix for plain point maybe

* fix bug: add repeated placement group, remove add placement interface in hand

* fixbug: python/blob_desc, temp impl of not deepcopy;  feat: dense layer support model parallel

* dev_python model parallel runnable and check correct

* remove add placement group when placment scope exit

* 1. fixbug of bias_add op infer blob_desc/sbp; bias_add kernel impl; 2. dense layer set model_split_axis=0 for model parallel

* bugfix: bias_add backward infer sbp wrong;  model parallel bias add debug done

* refine python blob_desc.split implement

* refine interface decode lbn to split hint

* refine auto add placment group

* refine lbn with split hint decode

* refine code for review

* remove AutoVar related code (#2168)

* feat: remove all autovar

* fix and format

* fix: fix op::InferBlobDesc

* add prototype (#2172)

* add prototype

* infer blob desc with sbp_signature

* `str_a is not str_b' is buggy, use `str_a != str_b' instead

* Update snapshot.cpp (#2174)

* remove useless lines (#2176)

* Fix bert multi nodes (#2177)

* remove useless lines

* fix bert and init_cluster_env for multi nodes

* CHECK_JUST for InferBlobDescsIf (#2178)

* Fix bert multi nodes (#2180)

* remove useless lines

* fix bert and init_cluster_env for multi nodes

* config_proto -> default_config_proto

* delete worker

* update alexnet

* remove unused op (#2182)

* remove parallel_ctx when kernel init (#2185)

* InferOpSbpSignature in op_graph and infer_ctx (#2175)

* InferOpSbpSignature in op_graph and infer_ctx

* bugfix: lambda life time;  gen job build error add location info

* refine error generation and return

* refine check lbi vaild and exists

* remove parallel num in decode_of_record op/kernel (#2186)

* Fix bugs

* delete GlobalJobDesc() in operator/ (#2188)

* rm unused test file

* Refine

* Add assign ops behind adam optimizer to update model and momentum etc.

* Add assign ops behind adam optimizer to update model and momentum etc.

* Remove fake consume op

* Support enable/disable XLA by set env

* Merge callback, limit max operator count for each XLA subgraph

* CudaEventPool

* fix vector

* refine

* Support in-place update for optimizer

* Add alias input and output to prevent reusing input with other temp buffers

* Refine code style

* Remove unused code

* Fix static cublas library and xla link conflict

* Fix cublas link conflict with tensorflow

* Fix different connection kinds for multiple gpu cards (#2282)

* Refine xla cluster algo (#2289)

* Fix different connection kinds for multiple gpu cards

* Fix bug for mutiple outputs consumed by one node

* Refine cluster algo

* Refine MarkClusterId pass and ReduceSplit task node (#2314)

* Fix different connection kinds for multiple gpu cards

* Fix bug for mutiple outputs consumed by one node

* Refine cluster algo

* Determine fusion disabled edges

* update

* Produce multiple registers on edges for ReduceSplit task node.
Fix new allocator by stream id.

* Refine MarkClusterId pass

* Clustering subgraph with reverse ordering is better

* Support strict clustering by taking dependencies into consideration

* Translate rebuild job and rewrite optimizer into passes, and refine code style

* Fix spell error

* Update cmake

* Merge branch dev_python (#2321)

* Dev res50 new api (#2173)

* check in script

* runable

* fix multinode

* fix and real train

* fix param data_format

* fix truncated normal

* quick fix multi node launch (#2193)

* Dev reshape sbp (#2192)

* reshape sbp

* more check for reshape conf

* fix error CHECK

* refactor reshape

* fix reshape like op

* support naive case of s0

* refine

* rm redundant code

* more generous check for equal element cnt

* restore empty line

* add GatherMs0Grad op (#2191)

* support for gather with s(0) `in'

* add gather_ms0_op

* fix bugs in message GatherMs0OpConf and GatherMs0Kernel

* only (B, S(0)) -> P supported for gather_ms0 op

* add GatherMs0Grad op

* minor fix

* refine code

* bugfix and update gather test case

* add concat op and pass the test (#2067)

* add concat op and pass the test

* add vgg job_conf

* model compared to be same as the old one

* rm unnecessary file

* Update array_ops.py

* mv file

* get rid of ternary operator (#2195)

* Dev reshape util struct (#2194)

* check in changes

* rm file

* minor fix

* Merge network files of 2 cnns (#2196)

* add inceptionV3

* check in vgg16

* add cnns test scripts for dev_python (#2170)

* add cnns test scripts for dev_python

* add alexnet test scripts

* add resnet50

* add inceptionv3

* add resnet50

* add vgg16

* first version of run_cnns_test.py

* remove old files

* unsorted_segment_sum (#2198)

* oneflow.unsorted_segment_sum (#2199)

* oneflow.unsorted_segment_sum

* remote unused import

* remove unused import

* Dev batch unsorted segment sum (#2200)

* oneflow.unsorted_segment_sum

* remote unused import

* remove unused import

* rename UnsortedSegmentSum to BatchUnsortedSegmentSum

* rename: batch_unsorted_* => unsorted_batch_*

* unsorted_segment_sum (#2201)

* unsorted_segment_sum

* fix job_completer/unsorted_segment_sum_grad.cpp

* more check for unsorted_segment_sum batch_axis

* remove FixParallelDesc (#2202)

* rm KernelIfWithModel KernelIfWithActivation (#2203)

* remove KernelIfWithActivation

* remove KernelIfWithModel

* rm blob header kLossInstanceNum (#2204)

* rm ActivationType from op/kernel (#2205)

* refactor sigmoid_cross_entropy_loss

* fix SigmoidGrad::InferBatchAxis

* support part_name_prefix and part_name_suffix_length (#2208)

* rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus

* oneflow.watch for debug

* Dev decode batch size (#2206)

* rm batch_size and piece_size

* merge dev_python

* Update reshape_like_op.cpp (#2213)

* oneflow.parallel (#2211)

* oneflow.parallel

* refactor split_axis => parallel

* rename parallel => distribute

* fix typo: *Parallel => *Distribute

* add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()

* fix warning: return string reference to temporary (#2212)

* docker build support (#2002)

* update cmake files

* check in files

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* shrink ctx size

* fix script

* fix wheel build

* fix wheel build not adding .so (#2052)

* lower cmake version bar

* rm more files

* keep build dir

* check in test bash script

* fix

* Dev docker sx (#2124)

* add python2 docker env

* rm old docker files

* update repository

* add ARG CUDA and USE_PYTHON_3_OR_2

* reform files

* update

* rm log doesn't print when there is cache

* use default arg in dockerfile

* better py 2 or 3 condition

* add default

* use if

* update alexnet

* update for bert

* 15->16

* add resnet50 in model (#2217)

* remove parallel policy; rm FC/rnn/embedding look up op/kernel (#2215)

* remove parallel policy

* rm FC/rnn/embedding_look_up op/kernel

* add check data parallel for conv/layer_norm op

* bugfix: bias add + use math_add when batch size = 1

* fix InferBatchAxis (#2220)

* sync with bert_benchamrk (#2221)

* sync with bert_benchamrk

* rename run.sh

* Dev actor msg queue (#2225)

* async msg queue

* EnqueueAsyncMsg

* Merge wnd python (#2226)

* not ready yet

* segment fix

* fix segment_sum bugs

* 1st wide_n_deep push

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* run sucessfully on single GPU

* fix 121 for tick (#2069)

* delete unncessary multiply_grad class

* speed up generate time for dot2svg (#2083)

* Add axis conf to bias_add for any axis channel (#2087)

* bias_add completion

* follow comment

* make conf axis required

* Revert "Add axis conf to bias_add for any axis channel (#2087)" (#2091)

This reverts commit 8679ce980ce8570bf927baeab8616ee7b93fac47.

* updated

* fix segment_sum_grad

* fix sbp

* fix segment_sum impl for data parallel

* fix

* remove useless code in segment_kernel_util.h

* add python interface

* fix sigmoid conf

* fix naming error

* fix typo

* temp mod loss sbp

* add LazyAdam

* Merge branch 'dev_python' of https://github.com/Oneflow-Inc/oneflow into dev_python_widedeep

* rm useless code

* unsorted_segment_sum

* refactor sigmoid_cross_entropy_loss_kernel to high performance

* Improve sigmoid cross entropy loss grad (#2207)

* remove for loop called cuda kernel

* minor fix

* ../oneflow/python/ops/data_ops.py (#2209)

* fix lazy_adam

* Merge wnd and python (#2214)

* rm ActivationType from op/kernel (#2205)

* refactor sigmoid_cross_entropy_loss

* fix SigmoidGrad::InferBatchAxis

* support part_name_prefix and part_name_suffix_length (#2208)

* rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus

* oneflow.watch for debug

* Dev decode batch size (#2206)

* rm batch_size and piece_size

* merge dev_python

* Update reshape_like_op.cpp (#2213)

* oneflow.parallel (#2211)

* oneflow.parallel

* refactor split_axis => parallel

* rename parallel => distribute

* fix typo: *Parallel => *Distribute

* add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()

* merge dev_python

* fix boxing: P->S(0)

* check in docker build scripts (#2216)

* Dev python widedeep docker (#2218)

* check in docker build scripts

* check in .dockerignore

* rm oneflow.segment_sum

* remove segment_sum

* rm unused file

* rm debug code

* rm debug code

* rm double empty lines

* remove useless comments

* fix send msg (#2227)

* fix reduction_coefficient (#2228)

* refactor ndarray for eq/ne/...

* Dev kernel launch synchronized (#2230)

* IsKernelLaunchSynchronized

* virtual

* refine

* refine

* seperate LOGICAL_BINARY_FUNC from ARITHMETIC_BINARY_FUNC

* more static_assert

* remove unused task related dot function (#2236)

* remove unused task related dot function

* do not output dot rank info

* Dev non distributed optimizer js (#2234)

* op&kernel&actor

* job

* job_completer

* graph

* format

* fix pd

* fix

* ignore DelPlacementByOpName

* fix auto tick

* JobBuilder

* fix

* config util

* fix

* fix opgrade

* broadcast tick

* fix allreduce

* balance by model size

* GetSoleOutBlobSize

* async_actor_msg_deque

* group

* AddOrMutOpsOnlyOnce

* fix NcclTupleBroadcastGrad

* order

* set nccl order hint

* op_conf

* grad hint

* NcclTupleBroadcastReduceSequencePass

* add missed mutops

* order fix

* try kMdUpdtArea

* fix nccl_order_hint

* fix

* add ti

* tuple_identity_op

* remove useless

* group

* fix dead lock

* force ctrl in

* sc broadcast

* sort obn

* group nccl

* config group_size_mbyte

* non_distributed_optimizer_group_size_mbyte

* format

* stop check

* rm message sending optimization

* refine lazy adam (#2244)

* refine lazy adam

* update

* memory version 2 step 1: replace original concept about mem sharing (#2242)

* mem_shared_id -> mem_block_id;  mem_shared_off_set -> mem_block_offset; enable_mem_sharing->enable_reuse_mem

* memory version 2 step 1: replace original concept about mem sharing

* record reader multi thread (#2246)

* multi thread

* ComputeThreadPoolSize

* python api

* Fix random decode (#2252)

* add decode random

* fix decode random actor

* Dev pr boxing v2 (#2248)

* NcclDeviceCtx

* include naive_actor

* refine

* use_boxing_v2

* config.use_boxing_v2

* SubTskGphBuilder

* fix

* hash<oneflow::MemoryCase>

* Maybe<void>

* ChainSubTskGphBuilder

* SliceBoxingOp

* return ok

* SliceBoxingKernel

* SliceBoxingActor

* kSliceBoxing

* nccl boxing op

* nccl actor

* REGISTER_OP

* GetMsgFromCustomizedConf

* NcclBoxingTaskNode

* BldSubTskGphByBoxingV2

* NcclBoxingSubTskGphBuilder

* fix

* fix

* NcclKernel

* ParallelContext

* REGISTER_ACTOR

* fix rank set

* IsNcclTaskType

* limit

* 1024

* multi thread reader

* thread_num

* IsKernelLaunchSynchronized

* refine

* NcclTupleReduce/BroadcastKernel use NcclDeviceCtx

* MakeHostMemCase

* NcclBldSubTskGph

* remove use less code

* use_boxing_v2

* refine

* refine

* refine

* refine

* refine

* cmake find python note when version less 3.14 (#2286)

* fix bug: reduce split kernel inplace (#2297)

* Dev bias add (#2299)

* use bias add

* fix

* bias_add

* bias add half

* fix

* reinterpret_cast

* fix half

* HALF

* fix

* ADD_DEFAULT_KERNEL_CREATOR

* fix

* format

* Fix dev python test (#2294)

* add decode random

* fix decode random actor

* fix dev_python test scripts

* fix batch_size test scripts

* fix

* Memory Version 2.0 Step 2:  MemSharedAndReused between jobs (#2267)

* MemBlockProto and ChunkProto

* create mem block and chunk after improver

* interface merge mem block and chunk between sub plans

* merge chunk between jobs for memory reuse

* using memory zone unique id replace memory case hash

* merge interface op mem block between jobs for mem shared

* gen GlobalCriticalSection by mem block id and chunk id

* check mem block and chunk valid before runtime

* Refactor: RegstMgr ;  allocate memory by mem block and chunk instead of regst

* fix bug; and pass test

* fig bug: init chunk_id_count in id_manager

* reuse copyHd out mem between jobs

* PushPlan and PullPlan for memblock and chunk

* refine merge mem block / chunk in oneflow.cpp

* at(i);

* GetOpName2JobId2TaskProtos functional

* using output ptr; pass test AlexNet and Resnet

* Fix xla reshape op

* Merge upstream of_xla (#2322)

* Dev res50 new api (#2173)

* check in script

* runable

* fix multinode

* fix and real train

* fix param data_format

* fix truncated normal

* quick fix multi node launch (#2193)

* Dev reshape sbp (#2192)

* reshape sbp

* more check for reshape conf

* fix error CHECK

* refactor reshape

* fix reshape like op

* support naive case of s0

* refine

* rm redundant code

* more generous check for equal element cnt

* restore empty line

* add GatherMs0Grad op (#2191)

* support for gather with s(0) `in'

* add gather_ms0_op

* fix bugs in message GatherMs0OpConf and GatherMs0Kernel

* only (B, S(0)) -> P supported for gather_ms0 op

* add GatherMs0Grad op

* minor fix

* refine code

* bugfix and update gather test case

* add concat op and pass the test (#2067)

* add concat op and pass the test

* add vgg job_conf

* model compared to be same as the old one

* rm unnecessary file

* Update array_ops.py

* mv file

* get rid of ternary operator (#2195)

* Dev reshape util struct (#2194)

* check in changes

* rm file

* minor fix

* Merge network files of 2 cnns (#2196)

* add inceptionV3

* check in vgg16

* add cnns test scripts for dev_python (#2170)

* add cnns test scripts for dev_python

* add alexnet test scripts

* add resnet50

* add inceptionv3

* add resnet50

* add vgg16

* first version of run_cnns_test.py

* remove old files

* unsorted_segment_sum (#2198)

* oneflow.unsorted_segment_sum (#2199)

* oneflow.unsorted_segment_sum

* remote unused import

* remove unused import

* Dev batch unsorted segment sum (#2200)

* oneflow.unsorted_segment_sum

* remote unused import

* remove unused import

* rename UnsortedSegmentSum to BatchUnsortedSegmentSum

* rename: batch_unsorted_* => unsorted_batch_*

* unsorted_segment_sum (#2201)

* unsorted_segment_sum

* fix job_completer/unsorted_segment_sum_grad.cpp

* more check for unsorted_segment_sum batch_axis

* remove FixParallelDesc (#2202)

* rm KernelIfWithModel KernelIfWithActivation (#2203)

* remove KernelIfWithActivation

* remove KernelIfWithModel

* rm blob header kLossInstanceNum (#2204)

* rm ActivationType from op/kernel (#2205)

* refactor sigmoid_cross_entropy_loss

* fix SigmoidGrad::InferBatchAxis

* support part_name_prefix and part_name_suffix_length (#2208)

* rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus

* oneflow.watch for debug

* Dev decode batch size (#2206)

* rm batch_size and piece_size

* merge dev_python

* Update reshape_like_op.cpp (#2213)

* oneflow.parallel (#2211)

* oneflow.parallel

* refactor split_axis => parallel

* rename parallel => distribute

* fix typo: *Parallel => *Distribute

* add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()

* fix warning: return string reference to temporary (#2212)

* docker build support (#2002)

* update cmake files

* check in files

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* shrink ctx size

* fix script

* fix wheel build

* fix wheel build not adding .so (#2052)

* lower cmake version bar

* rm more files

* keep build dir

* check in test bash script

* fix

* Dev docker sx (#2124)

* add python2 docker env

* rm old docker files

* update repository

* add ARG CUDA and USE_PYTHON_3_OR_2

* reform files

* update

* rm log doesn't print when there is cache

* use default arg in dockerfile

* better py 2 or 3 condition

* add default

* use if

* update alexnet

* update for bert

* 15->16

* add resnet50 in model (#2217)

* remove parallel policy; rm FC/rnn/embedding look up op/kernel (#2215)

* remove parallel policy

* rm FC/rnn/embedding_look_up op/kernel

* add check data parallel for conv/layer_norm op

* bugfix: bias add + use math_add when batch size = 1

* fix InferBatchAxis (#2220)

* sync with bert_benchamrk (#2221)

* sync with bert_benchamrk

* rename run.sh

* Dev actor msg queue (#2225)

* async msg queue

* EnqueueAsyncMsg

* Merge wnd python (#2226)

* not ready yet

* segment fix

* fix segment_sum bugs

* 1st wide_n_deep push

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* run sucessfully on single GPU

* fix 121 for tick (#2069)

* delete unncessary multiply_grad class

* speed up generate time for dot2svg (#2083)

* Add axis conf to bias_add for any axis channel (#2087)

* bias_add completion

* follow comment

* make conf axis required

* Revert "Add axis conf to bias_add for any axis channel (#2087)" (#2091)

This reverts commit 8679ce980ce8570bf927baeab8616ee7b93fac47.

* updated

* fix segment_sum_grad

* fix sbp

* fix segment_sum impl for data parallel

* fix

* remove useless code in segment_kernel_util.h

* add python interface

* fix sigmoid conf

* fix naming error

* fix typo

* temp mod loss sbp

* add LazyAdam

* Merge branch 'dev_python' of https://github.com/Oneflow-Inc/oneflow

 into dev_python_widedeep

* rm useless code

* unsorted_segment_sum

* refactor sigmoid_cross_entropy_loss_kernel to high performance

* Improve sigmoid cross entropy loss grad (#2207)

* remove for loop called cuda kernel

* minor fix

* ../oneflow/python/ops/data_ops.py (#2209)

* fix lazy_adam

* Merge wnd and python (#2214)

* rm ActivationType from op/kernel (#2205)

* refactor sigmoid_cross_entropy_loss

* fix SigmoidGrad::InferBatchAxis

* support part_name_prefix and part_name_suffix_length (#2208)

* rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus

* oneflow.watch for debug

* Dev decode batch size (#2206)

* rm batch_size and piece_size

* merge dev_python

* Update reshape_like_op.cpp (#2213)

* oneflow.parallel (#2211)

* oneflow.parallel

* refactor split_axis => parallel

* rename parallel => distribute

* fix typo: *Parallel => *Distribute

* add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()

* merge dev_python

* fix boxing: P->S(0)

* check in docker build scripts (#2216)

* Dev python widedeep docker (#2218)

* check in docker build scripts

* check in .dockerignore

* rm oneflow.segment_sum

* remove segment_sum

* rm unused file

* rm debug code

* rm debug code

* rm double empty lines

* remove useless comments

* fix send msg (#2227)

* fix reduction_coefficient (#2228)

* refactor ndarray for eq/ne/...

* Dev kernel launch synchronized (#2230)

* IsKernelLaunchSynchronized

* virtual

* refine

* refine

* seperate LOGICAL_BINARY_FUNC from ARITHMETIC_BINARY_FUNC

* more static_assert

* remove unused task related dot function (#2236)

* remove unused task related dot function

* do not output dot rank info

* Dev non distributed optimizer js (#2234)

* op&kernel&actor

* job

* job_completer

* graph

* format

* fix pd

* fix

* ignore DelPlacementByOpName

* fix auto tick

* JobBuilder

* fix

* config util

* fix

* fix opgrade

* broadcast tick

* fix allreduce

* balance by model size

* GetSoleOutBlobSize

* async_actor_msg_deque

* group

* AddOrMutOpsOnlyOnce

* fix NcclTupleBroadcastGrad

* order

* set nccl order hint

* op_conf

* grad hint

* NcclTupleBroadcastReduceSequencePass

* add missed mutops

* order fix

* try kMdUpdtArea

* fix nccl_order_hint

* fix

* add ti

* tuple_identity_op

* remove useless

* group

* fix dead lock

* force ctrl in

* sc broadcast

* sort obn

* group nccl

* config group_size_mbyte

* non_distributed_optimizer_group_size_mbyte

* format

* stop check

* rm message sending optimization

* refine lazy adam (#2244)

* refine lazy adam

* update

* memory version 2 step 1: replace original concept about mem sharing (#2242)

* mem_shared_id -> mem_block_id;  mem_shared_off_set -> mem_block_offset; enable_mem_sharing->enable_reuse_mem

* memory version 2 step 1: replace original concept about mem sharing

* record reader multi thread (#2246)

* multi thread

* ComputeThreadPoolSize

* python api

* Fix random decode (#2252)

* add decode random

* fix decode random actor

* Dev pr boxing v2 (#2248)

* NcclDeviceCtx

* include naive_actor

* refine

* use_boxing_v2

* config.use_boxing_v2

* SubTskGphBuilder

* fix

* hash<oneflow::MemoryCase>

* Maybe<void>

* ChainSubTskGphBuilder

* SliceBoxingOp

* return ok

* SliceBoxingKernel

* SliceBoxingActor

* kSliceBoxing

* nccl boxing op

* nccl actor

* REGISTER_OP

* GetMsgFromCustomizedConf

* NcclBoxingTaskNode

* BldSubTskGphByBoxingV2

* NcclBoxingSubTskGphBuilder

* fix

* fix

* NcclKernel

* ParallelContext

* REGISTER_ACTOR

* fix rank set

* IsNcclTaskType

* limit

* 1024

* multi thread reader

* thread_num

* IsKernelLaunchSynchronized

* refine

* NcclTupleReduce/BroadcastKernel use NcclDeviceCtx

* MakeHostMemCase

* NcclBldSubTskGph

* remove use less code

* use_boxing_v2

* refine

* refine

* refine

* refine

* refine

* cmake find python note when version less 3.14 (#2286)

* fix bug: reduce split kernel inplace (#2297)

* Dev bias add (#2299)

* use bias add

* fix

* bias_add

* bias add half

* fix

* reinterpret_cast

* fix half

* HALF

* fix

* ADD_DEFAULT_KERNEL_CREATOR

* fix

* format

* Fix dev python test (#2294)

* add decode random

* fix decode random actor

* fix dev_python test scripts

* fix batch_size test scripts

* fix

* Memory Version 2.0 Step 2:  MemSharedAndReused between jobs (#2267)

* MemBlockProto and ChunkProto

* create mem block and chunk after improver

* interface merge mem block and chunk between sub plans

* merge chunk between jobs for memory reuse

* using memory zone unique id replace memory case hash

* merge interface op mem block between jobs for mem shared

* gen GlobalCriticalSection by mem block id and chunk id

* check mem block and chunk valid before runtime

* Refactor: RegstMgr ;  allocate memory by mem block and chunk instead of regst

* fix bug; and pass test

* fig bug: init chunk_id_count in id_manager

* reuse copyHd out mem between jobs

* PushPlan and PullPlan for memblock and chunk

* refine merge mem block / chunk in oneflow.cpp

* at(i);

* GetOpName2JobId2TaskProtos functional

* using output ptr; pass test AlexNet and Resnet

* Dev cuda 9 arch 70 (#2318)

* kCudaAlignSize = 256

* always compute_70

* __CUDA_API_VERSION >= 10000

* __CUDA_API_VERSION >= 10000

* disable_all_reduce_sequence

* Fix xla reshape op

* Fix compilation without xla

* Remove useless code and fix data type mismatch in field desc (#2326)

* Remove useless code

* Refine code style

* Fix data type mismatch in field desc

* Update README.md (#2335)

* Refine code style (#2336)

* Update XLA usage document (#2337)

* Update XLA usage document

* Fix mistakes

* Add xla clang-format and format codestyle (#2340)

* Revert "Add xla clang-format and format codestyle (#2340)" (#2341)

This reverts commit e3cd432be2880ca55d3fc305dbb87e5416d5e724.

* Add xla clang-format and format codestyle (#2342)

* Add xla clang-format and format codestyle

* Fix header file missing

* Of xla sx (#2334)

* add gather grad op and pass testing

* rm check

* done batch gather grad

* pass test

* modify according to the review

* add unsorted_segment_sum and refine unsorted_batch_segment_sum

* reform according to review

* refromate according to the clang-format and rm reference to the temp object

* Pick step0 and step1 new commits (#2346)

* Add xla clang-format and format codestyle

* Fix header file missing

* Modify codes to support XLA

Conflicts:
	oneflow/core/job/job_builder.cpp
	oneflow/core/job/job_builder.h
	oneflow/core/operator/op_conf.proto

* Fix a bug for building subgraph although it won't lead to wrong results (#2347)

* Fix setting is_mutable in xla launch op (#2349)

* Change directory xla to xrt, apply patch if building with xla

* Refactor

* Add infer shape pass, and Refactor launch kernel, graph compiler

* Refine code style, add xla executable and graph compiler

* Rename platform.proto as types.proto

* change OpCompiler to OpKernel, complete xla graph compiler

* Fix compilation bugs and add allocator, now xla compilation is ok

* Add xla executable runtime

* Add executable run scope to support launch kernel on specific stream.

* Fix infer shape pass, and revert cuda event pool

* Refactor graph building with attaching argument metadata.

* Set mutability if rebuilding job

* Set device ordinal correctly

* Refine DelOps

* Refine Argument definition and abstract function as subgraph

* Fix infer shape in xrt launch op and launch kernel.

* Add builder, executable, graph compiler, logger, value and a fc demo converter for tensorrt.

* Refine code style

* Rename xla Operand as XlaValue.

* Complete TensorRT compiler and builder, Refine OpKernel

* Pick public code changes from the new tensorrt branch.

* Fix tensorrt compilation

* Fake implementation of trt executable

* Support selecting engine in launch kernel, refine trt executable

* Use global logger required by tensorrt, rebuild engine if batch size is larger than default max batch size, and other bugfix.

* Support train phase setting for registered op kernel

* Remove RewriteOptimizer pass, update xla optimizer op.

* Format job builder .h and .cpp files.

* Remove RewriteOptimizer pass, update xla optimizer op.

* Fix project compilation and remove pairs from identical_sbp_oba_pairs if the related operators are about to be deleted from the job.

* Fix project compilation and remove pairs from identical_sbp_oba_pairs if the related operators are about to be deleted from the job.

* Refine code style and comment.

* Refine model update inference for launch op.

* Refine

* Refine code style and comment.

* Refine model update inference for launch op.

Conflicts:
	oneflow/xrt/kernel/op_kernel.h
	oneflow/xrt/node_util.cpp
	oneflow/xrt/node_util.h
	oneflow/xrt/passes/cluster.h
	oneflow/xrt/passes/mark_cluster_id_pass.cpp
	oneflow/xrt/passes/rebuild_job_pass.cpp
	oneflow/xrt/types.h

* Add xrt README.md

* Add use_xla_jit and use_tensorrt options in job proto

* Refine code style

* Fix BlobDesc getter and xla LayerNorm op for FP16

* Make use_xla_jit and use_tensorrt configurable from python config and env variables.

* Update benchmark

* Refine xrt README and rename compile_with_xrt.h file

* Update README

* Revert tensorrt

* Fix absl missing if building with TensorRT but without XLA

* Update xrt benchmark

* Disable WITH_XLA by default

* Update xrt benchmark

* Format xrt as core

* add activation op

* add softmax op

* Refine code style, remove unused code

* Remove duplication of XLA usage

* test pass

* pooling test pass

* add concat op, not tested

* add activation ops, test not psassed

* Add xla gelu unittest

* add  activation op, and test  passed

* add pooling op, and test passed

* Fix int64 env variable

* Export float16 for python

* Add xla relu unittest

* try to solve conv bug

* add elementwise add op, test passed

* add concat op, test passed

* Bugfix: transfer weights from gpu to host since tensorrt requires host weights.

* add op unit tests

* resolve conflicts and fix softmax bug

* add identity op and topk op, to test

* Add xla bias add and reshape unittests

* Add xla identity unittest

* Add xla cast and scalar op unittests

* Add xla broadcast op and transpose unittests

* Add xla add, sigmoid and tanh unittests

* add reduce mean op, test passed

* formate ops, add CHECKs, and optimize function structure

* Add xla gather and batch_gather unittests

* Add xla softmax unittest and fix softmax bug if axis is not the last dim.

* add trt gather op and unit test

* Add xla reduce_sum unittest, and support keep_dims for xla reduce

* Add xla layer_norm unittest, and refine xla layer norm op

* Add reshape_like unittest, and export reshape_like api

* Refine xrt unittest code style

* Export softmax_grad op, add softmax_grad unittest

* Export tanh_grad op and add xla unittest

* Export gelu_grad op, and add xla unittest

* add conv unit test

* reformate

* Export layer_norm_grad and layer_norm_param_grad api, add xla unittests

* Commit to merge upstream of_xrt

* check files

* modify files according to review advice.

* Add xrt unittests (#2483)

* Revert tensorrt

* Fix absl missing if building with TensorRT but without XLA

* Update xrt benchmark

* Add xla gelu unittest

* Fix int64 env variable

* Export float16 for python

* Add xla relu unittest

* Add xla bias add and reshape unittests

* Add xla identity unittest

* Add xla cast and scalar op unittests

* Add xla broadcast op and transpose unittests

* Add xla add, sigmoid and tanh unittests

* Add xla gather and batch_gather unittests

* Add xla softmax unittest and fix softmax bug if axis is not the last dim.

* Add xla reduce_sum unittest, and support keep_dims for xla reduce

* Add xla layer_norm unittest, and refine xla layer norm op

* Add reshape_like unittest, and export reshape_like api

* Refine xrt unittest code style

* Export softmax_grad op, add softmax_grad unittest

* Export tanh_grad op and add xla unittest

* Export gelu_grad op, and add xla unittest

* Export layer_norm_grad and layer_norm_param_grad api, add xla unittests

* Commit to merge upstream of_xrt

* Fix reduce_mean facade bug if keep_dims if true.

* Refine tensorrt unittests

* Check failed if full reduce without keep dimension.

* madd pooling unit test

* Add tensorrt bias_add and reshape op, and their unittests.

* Support fp16 for tensorrt.

* Add tensorrt transpose op and unittest.

* add unit test conv_2d

* add unit test concat

* Fix concat if axis is -1.

* Refine tensorrt conv2d unittest

* Fix padding mode for conv2d and pooling, refine unittests.

* Refine tensorrt concat unittest

* Add convert api from string engine to XrtEngine.

* Revert tensorrt, and merge of_xrt branch

* Remove some comments.

* Refine tensorrt unittests

* Add XrtConfig to deal with xla and tensorrt configurations.

Conflicts:
	oneflow/xrt/api.cpp

* Update tensorflow.cmake to avoid applying the patch repeatedly.

* Remove XrtConfig Option, and fix xrt unittests

* Add tensorrt batch norm (#2516)

* Refine xrt signatrue hash, and fix python configuration (#2520)

* Fix XrtCompilationEnabled returns (#2524)

* Fix compilation after merge dev_python

* Update xrt unittests

* Revert protobuf version

* Remove comment FOR_RANGE

* Remove unused code

* Reformart

* Refine job builder

* Disable dump job if not debug mode

Co-authored-by: default avatarSnow <snow3s@qq.com>
Co-authored-by: default avatarJuncheng <liujuncheng1022@gmail.com>
parent 465ee822
No related branches found
No related tags found
No related merge requests found
Showing
with 594 additions and 62 deletions
......@@ -8,6 +8,8 @@ option(BUILD_RDMA "" OFF)
option(BUILD_CUDA "" ON)
option(RELEASE_VERSION "" ON)
option(PY3 "" OFF)
option(WITH_XLA "Option to build with XLA" OFF)
option(WITH_TENSORRT "Option to build with TensorRT" OFF)
if(NOT RELEASE_VERSION)
set(CUDNN_STATIC OFF CACHE BOOL "")
......@@ -20,6 +22,13 @@ else()
project(oneflow C CXX)
endif()
if (WITH_XLA)
add_definitions(-DWITH_XLA)
endif()
if (WITH_TENSORRT)
add_definitions(-DWITH_TENSORRT)
endif()
enable_testing()
set(CMAKE_CXX_STANDARD 11)
set(CMAKE_POSITION_INDEPENDENT_CODE ON)
......@@ -65,7 +74,7 @@ if(WIN32)
#set(CMAKE_EXE_LINKER_FLAGS_DEBUG "${CMAKE_EXE_LINKER_FLAGS} /DEBUG:FASTLINK")
set(CMAKE_CXX_FLAGS_DEBUG "${CMAKE_CXX_FLAGS_DEBUG} /D_ITERATOR_DEBUG_LEVEL=0")
else()
list(APPEND CUDA_NVCC_FLAGS -std=c++11 -w -Wno-deprecated-gpu-targets)
list(APPEND CUDA_NVCC_FLAGS -w -Wno-deprecated-gpu-targets)
# half is not fully supported when __CUDA_ARCH__ < 530
# list(APPEND __cuda_nvcc_gencodes "arch=compute_30,code=sm_30")
# list(APPEND __cuda_nvcc_gencodes "arch=compute_30,code=compute_30")
......@@ -85,10 +94,12 @@ else()
foreach(CUDA_NVCC_GENCODE ${CUDA_NVCC_GENCODES})
list(APPEND CUDA_NVCC_FLAGS -gencode ${CUDA_NVCC_GENCODE})
endforeach()
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -g -Wall -Wno-sign-compare -Wno-unused-function -fPIC")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -g -std=c++11 -Wall -Wno-sign-compare -Wno-unused-function -fPIC")
if (RELEASE_VERSION)
list(APPEND CUDA_NVCC_FLAGS -O3)
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3")
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O3 -DNDEBUG")
else()
set(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -O0")
endif()
endif()
......@@ -97,4 +108,5 @@ if (THIRD_PARTY)
set(THIRD_PARTY OFF CACHE BOOL "" FORCE)
else()
include(oneflow)
configure_file(${PROJECT_SOURCE_DIR}/setup.py.in ${PROJECT_BINARY_DIR}/setup.py)
endif()
......@@ -42,3 +42,76 @@ or you can just clone source code and submodules step by step
```
cmake -DTHIRD_PARTY=OFF .. && make -j
```
### Build with XLA
- Install bazel
Download and install bazel from [here](https://docs.bazel.build/versions/1.0.0/bazel-overview.html) , and version 0.24.1 is recommended. You can confirm bazel is installed successfully by running the following command:
```shell
bazel version
```
- Update cmake
It is needed only if CMake installed does not support downloading .tgz file from URL with https protocol. Skip this step, just go back here to reinstall CMake if you encountered a downloading error while building the third-parties.
Download cmake(>=3.7) from [here](https://cmake.org/download/) , configure and install it by the following command:
```shell
# Install curl develop toolkit
sudo yum install libcurl-devel
# install cmake
cd cmake && ./bootstrap --system-curl --prefix=$your_path && make install
```
- Build third-parties
Run the following command to build third-parties.
```shell
cd build && cmake -DWITH_XLA=ON -DTHIRD_PARTY=ON ..
make -j$(nproc)
```
If the downloading error occurred, you should go back to the previous step to reinstall the cmake, then clean the file CMakeCache.txt and build the third-parties once again.
- Build OneFlow
```shell
cmake .. \
-DWITH_XLA=ON \
-DPYTHON_LIBRARY=your_python_lib_path \
-DPYTHON_INCLUDE_DIR=your_python_include_dir \
-DPython_NumPy_INCLUDE_DIRS=your_numpy_include_dir
make -j$(nproc)
```
- XLA documents
You can check this [doc](./oneflow/xrt/README.md) to obtain more details about how to use XLA.
### Build with TensorRT
- Build third-parties
Run the following command to build third-parties.
```shell
cd build && cmake -DWITH_TENSORRT=ON -DTHIRD_PARTY=ON ..
make -j$(nproc)
```
- Build OneFlow
```shell
cmake .. \
-DWITH_TENSORRT=ON \
-DPYTHON_LIBRARY=your_python_lib_path \
-DPYTHON_INCLUDE_DIR=your_python_include_dir \
-DPython_NumPy_INCLUDE_DIRS=your_numpy_include_dir
make -j$(nproc)
```
......@@ -45,6 +45,24 @@ foreach(oneflow_hdr_to_be_expanded ${oneflow_all_hdr_to_be_expanded})
endforeach()
file(GLOB_RECURSE oneflow_all_src "${PROJECT_SOURCE_DIR}/oneflow/core/*.*" "${PROJECT_SOURCE_DIR}/oneflow/python/*.*")
if (WITH_XLA OR WITH_TENSORRT)
file(GLOB_RECURSE oneflow_xrt_src "${PROJECT_SOURCE_DIR}/oneflow/xrt/*.*")
if (NOT WITH_XLA)
file(GLOB_RECURSE xla_removing_src "${PROJECT_SOURCE_DIR}/oneflow/xrt/xla/*.*")
endif ()
if (NOT WITH_TENSORRT)
file(GLOB_RECURSE trt_removing_src "${PROJECT_SOURCE_DIR}/oneflow/xrt/tensorrt/*.*")
endif ()
list(APPEND xrt_removing_srcs ${xla_removing_src})
list(APPEND xrt_removing_srcs ${trt_removing_src})
# message(STATUS "removing_srcs: ${xrt_removing_srcs}")
foreach (removing_file ${xrt_removing_srcs})
list(REMOVE_ITEM oneflow_xrt_src ${removing_file})
endforeach ()
list(APPEND oneflow_all_src ${oneflow_xrt_src})
endif()
foreach(oneflow_single_file ${oneflow_all_src})
# Verify whether this file is for other platforms
set(exclude_this OFF)
......@@ -70,33 +88,33 @@ foreach(oneflow_single_file ${oneflow_all_src})
set(group_this ON)
endif()
if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/core/.*\\.h$")
if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/(core|xrt)/.*\\.h$")
list(APPEND of_all_obj_cc ${oneflow_single_file})
set(group_this ON)
endif()
if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/core/.*\\.cuh$")
if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/(core|xrt)/.*\\.cuh$")
if(BUILD_CUDA)
list(APPEND of_all_obj_cc ${oneflow_single_file})
endif()
set(group_this ON)
endif()
if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/core/.*\\.cu$")
if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/(core|xrt)/.*\\.cu$")
if(BUILD_CUDA)
list(APPEND of_all_obj_cc ${oneflow_single_file})
endif()
set(group_this ON)
endif()
if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/core/.*\\.proto$")
if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/(core|xrt)/.*\\.proto$")
list(APPEND of_all_proto ${oneflow_single_file})
#list(APPEND of_all_obj_cc ${oneflow_single_file}) # include the proto file in the project
set(group_this ON)
endif()
if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/core/.*\\.cpp$")
if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/core/.*_test\\.cpp$")
if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/(core|xrt)/.*\\.cpp$")
if("${oneflow_single_file}" MATCHES "^${PROJECT_SOURCE_DIR}/oneflow/(core|xrt)/.*_test\\.cpp$")
# test file
# list(APPEND of_all_test_cc ${oneflow_single_file})
else()
......
......@@ -15,6 +15,17 @@ include(cocoapi)
include(half)
include(json)
if (WITH_XLA)
include(tensorflow)
endif()
if (WITH_TENSORRT)
if (NOT WITH_XLA)
include(absl)
endif()
include(tensorrt)
endif()
if (BUILD_CUDA)
set(CUDA_SEPARABLE_COMPILATION ON)
find_package(CUDA REQUIRED)
......@@ -114,6 +125,11 @@ if (BUILD_CUDA)
include(cub)
include(nccl)
if (WITH_XLA)
# Fix conflicts between tensorflow cublas dso and oneflow static cublas.
# TODO(hjchen2) Should commit a issue about this fix.
list(APPEND oneflow_third_party_libs -Wl,--whole-archive ${cuda_lib_dir}/libcublas_static.a -Wl,--no-whole-archive)
endif()
list(APPEND oneflow_third_party_libs ${CUDA_LIBRARIES})
list(APPEND oneflow_third_party_libs ${CUDNN_LIBRARIES})
list(APPEND oneflow_third_party_libs ${NCCL_STATIC_LIBRARIES})
......@@ -150,6 +166,17 @@ if(BUILD_RDMA)
endif()
endif()
if(WITH_XLA)
list(APPEND oneflow_third_party_libs ${TENSORFLOW_XLA_LIBRARIES})
endif()
if(WITH_TENSORRT)
if (NOT WITH_XLA)
list(APPEND oneflow_third_party_libs ${ABSL_LIBRARIES})
endif()
list(APPEND oneflow_third_party_libs ${TENSORRT_LIBRARIES})
endif()
message(STATUS "oneflow_third_party_libs: " ${oneflow_third_party_libs})
add_definitions(-DHALF_ENABLE_CPP11_USER_LITERALS=0)
include (ExternalProject)
SET(ABSL_PROJECT absl)
SET(ABSL_GIT_URL https://github.com/abseil/abseil-cpp.git)
SET(ABSL_GIT_TAG 43ef2148c0936ebf7cb4be6b19927a9d9d145b8f)
SET(ABSL_SOURCE_DIR ${CMAKE_CURRENT_BINARY_DIR}/third_party/absl)
SET(ABSL_INSTALL_DIR ${THIRD_PARTY_DIR}/absl)
SET(ABSL_INCLUDE_DIR ${ABSL_INSTALL_DIR}/include CACHE PATH "" FORCE)
SET(ABSL_LIBRARY_DIR ${ABSL_INSTALL_DIR}/lib CACHE PATH "" FORCE)
INCLUDE_DIRECTORIES(${ABSL_INCLUDE_DIR})
LINK_DIRECTORIES(${ABSL_LIBRARY_DIR})
SET(ABSL_LIBRARIES
${ABSL_LIBRARY_DIR}/libabsl_base.a
${ABSL_LIBRARY_DIR}/libabsl_spinlock_wait.a
${ABSL_LIBRARY_DIR}/libabsl_dynamic_annotations.a
${ABSL_LIBRARY_DIR}/libabsl_malloc_internal.a
${ABSL_LIBRARY_DIR}/libabsl_throw_delegate.a
${ABSL_LIBRARY_DIR}/libabsl_int128.a
${ABSL_LIBRARY_DIR}/libabsl_strings.a
${ABSL_LIBRARY_DIR}/libabsl_str_format_internal.a
${ABSL_LIBRARY_DIR}/libabsl_time.a
${ABSL_LIBRARY_DIR}/libabsl_bad_optional_access.a)
if (THIRD_PARTY)
ExternalProject_Add(${ABSL_PROJECT}
PREFIX ${ABSL_SOURCE_DIR}
GIT_REPOSITORY ${ABSL_GIT_URL}
GIT_TAG ${ABSL_GIT_TAG}
UPDATE_COMMAND ""
CMAKE_ARGS
-DCMAKE_BUILD_TYPE:STRING=${CMAKE_BUILD_TYPE}
-DBUILD_SHARED_LIBS:BOOL=OFF
-DCMAKE_CXX_FLAGS:STRING=${CMAKE_CXX_FLAGS}
-DCMAKE_CXX_FLAGS_DEBUG:STRING=${CMAKE_CXX_FLAGS_DEBUG}
-DCMAKE_CXX_FLAGS_RELEASE:STRING=${CMAKE_CXX_FLAGS_RELEASE}
CMAKE_CACHE_ARGS
-DCMAKE_INSTALL_PREFIX:PATH=${ABSL_INSTALL_DIR}
-DCMAKE_INSTALL_LIBDIR:PATH=${ABSL_LIBRARY_DIR}
-DCMAKE_POSITION_INDEPENDENT_CODE:BOOL=ON
-DCMAKE_BUILD_TYPE:STRING=${CMAKE_BUILD_TYPE}
)
endif(THIRD_PARTY)
......@@ -3,9 +3,18 @@ include (ExternalProject)
set(EIGEN_INCLUDE_DIR ${THIRD_PARTY_DIR}/eigen/include/eigen3)
set(EIGEN_INSTALL_DIR ${THIRD_PARTY_DIR}/eigen)
set(EIGEN_URL ${CMAKE_CURRENT_BINARY_DIR}/third_party/eigen/src/eigen)
if(WITH_XLA)
#set(EIGEN_URL "https://storage.googleapis.com/mirror.tensorflow.org/bitbucket.org/eigen/eigen/get/8071cda5714d.tar.gz")
set(EIGEN_URL "https://bitbucket.org/eigen/eigen/get/8071cda5714d.tar.gz")
else()
set(EIGEN_URL ${CMAKE_CURRENT_BINARY_DIR}/third_party/eigen/src/eigen)
endif()
add_definitions(-DEIGEN_NO_AUTOMATIC_RESIZING -DEIGEN_NO_MALLOC -DEIGEN_USE_GPU)
add_definitions(-DEIGEN_NO_AUTOMATIC_RESIZING -DEIGEN_USE_GPU)
if (NOT WITH_XLA)
add_definitions(-DEIGEN_NO_MALLOC)
endif()
#add_definitions(-DEIGEN_NO_AUTOMATIC_RESIZING -DEIGEN_NO_MALLOC -DEIGEN_USE_GPU)
if (THIRD_PARTY)
......
......@@ -5,7 +5,11 @@ set(PROTOBUF_LIBRARY_DIR ${THIRD_PARTY_DIR}/protobuf/lib)
set(PROTOBUF_BINARY_DIR ${THIRD_PARTY_DIR}/protobuf/bin)
set(PROTOBUF_SRC_DIR ${CMAKE_CURRENT_BINARY_DIR}/protobuf/src/protobuf/src)
set(PROTOBUF_URL ${CMAKE_CURRENT_BINARY_DIR}/third_party/protobuf/src/protobuf)
if(WITH_XLA)
set(PROTOBUF_URL "https://storage.googleapis.com/mirror.tensorflow.org/github.com/protocolbuffers/protobuf/archive/310ba5ee72661c081129eb878c1bbcec936b20f0.tar.gz")
else()
set(PROTOBUF_URL ${CMAKE_CURRENT_BINARY_DIR}/third_party/protobuf/src/protobuf)
endif()
if(WIN32)
set(PROTOBUF_BUILD_LIBRARY_DIR ${CMAKE_CURRENT_BINARY_DIR}/protobuf/src/protobuf/${CMAKE_BUILD_TYPE})
......
include (ExternalProject)
if (WITH_XLA)
list(APPEND TENSORFLOW_BUILD_CMD --define with_xla_support=true)
if (RELEASE_VERSION)
list(APPEND TENSORFLOW_BUILD_CMD -c opt)
set(TENSORFLOW_GENFILE_DIR k8-opt)
else()
list(APPEND TENSORFLOW_BUILD_CMD --copt=-g -c dbg)
set(TENSORFLOW_GENFILE_DIR k8-dbg)
endif()
set(TF_WITH_CUDA ON)
if (TF_WITH_CUDA)
set(CUDA_COMPUTE_CAPABILITIES "6.0,6.1")
if (NOT CUDA_VERSION VERSION_LESS "10.0")
set(CUDA_COMPUTE_CAPABILITIES "${CUDA_COMPUTE_CAPABILITIES},7.0")
endif()
list(APPEND TENSORFLOW_BUILD_CMD --config=cuda)
list(APPEND TENSORFLOW_BUILD_CMD --action_env TF_NEED_CUDA=1)
list(APPEND TENSORFLOW_BUILD_CMD --action_env TF_CUDA_COMPUTE_CAPABILITIES=${CUDA_COMPUTE_CAPABILITIES})
endif()
message(STATUS ${TENSORFLOW_BUILD_CMD})
set(TENSORFLOW_PROJECT tensorflow)
set(TENSORFLOW_GIT_URL https://github.com/tensorflow/tensorflow.git)
#set(TENSORFLOW_GIT_TAG master)
set(TENSORFLOW_GIT_TAG 80c04b80ad66bf95aa3f41d72a6bba5e84a99622)
set(TENSORFLOW_SOURCES_DIR ${THIRD_PARTY_DIR}/tensorflow)
set(TENSORFLOW_SRCS_DIR ${TENSORFLOW_SOURCES_DIR}/src/tensorflow)
set(TENSORFLOW_INC_DIR ${TENSORFLOW_SOURCES_DIR}/src/tensorflow)
set(PATCHES_DIR ${PROJECT_SOURCE_DIR}/oneflow/xrt/patches)
set(TENSORFLOW_JIT_DIR ${TENSORFLOW_SRCS_DIR}/tensorflow/compiler/jit)
set(TENSORFLOW_GEN_DIR ${TENSORFLOW_SRCS_DIR}/bazel-out/${TENSORFLOW_GENFILE_DIR}/genfiles)
set(TENSORFLOW_EXTERNAL_DIR ${TENSORFLOW_SRCS_DIR}/bazel-tensorflow/external)
set(THIRD_ABSL_DIR ${TENSORFLOW_EXTERNAL_DIR}/com_google_absl)
set(THIRD_PROTOBUF_DIR ${TENSORFLOW_EXTERNAL_DIR}/com_google_protobuf/src)
set(THIRD_BORINGSSL_DIR ${TENSORFLOW_EXTERNAL_DIR}/boringssl/src)
set(THIRD_SNAPPY_DIR ${TENSORFLOW_EXTERNAL_DIR}/snappy)
list(APPEND TENSORFLOW_XLA_INCLUDE_DIR
${TENSORFLOW_INC_DIR}
${TENSORFLOW_GEN_DIR}
${THIRD_ABSL_DIR}
${THIRD_PROTOBUF_DIR}
${THIRD_BORINGSSL_DIR}
${THIRD_SNAPPY_DIR}
)
include_directories(${TENSORFLOW_XLA_INCLUDE_DIR})
list(APPEND TENSORFLOW_XLA_LIBRARIES libtensorflow_framework.so.1)
list(APPEND TENSORFLOW_XLA_LIBRARIES libxla_core.so)
link_directories(
${TENSORFLOW_SRCS_DIR}/bazel-bin/tensorflow
${TENSORFLOW_SRCS_DIR}/bazel-bin/tensorflow/compiler/jit/xla_lib
)
if (THIRD_PARTY)
ExternalProject_Add(${TENSORFLOW_PROJECT}
PREFIX ${TENSORFLOW_SOURCES_DIR}
GIT_REPOSITORY ${TENSORFLOW_GIT_URL}
GIT_TAG ${TENSORFLOW_GIT_TAG}
PATCH_COMMAND patch -Np1 < ${PATCHES_DIR}/xla.patch
CONFIGURE_COMMAND ""
BUILD_COMMAND cd ${TENSORFLOW_SRCS_DIR} &&
bazel build ${TENSORFLOW_BUILD_CMD} -j 20 //tensorflow/compiler/jit/xla_lib:libxla_core.so
INSTALL_COMMAND ""
)
endif(THIRD_PARTY)
set(TENSORFLOW_XLA_FRAMEWORK_LIB ${TENSORFLOW_SRCS_DIR}/bazel-bin/tensorflow/libtensorflow_framework.so.1)
set(TENSORFLOW_XLA_CORE_LIB ${TENSORFLOW_SRCS_DIR}/bazel-bin/tensorflow/compiler/jit/xla_lib/libxla_core.so)
endif(WITH_XLA)
include (ExternalProject)
if (WITH_TENSORRT)
find_path(TENSORRT_INCLUDE_DIR NvInfer.h
PATHS ${TENSORRT_ROOT} ${TENSORRT_ROOT}/include
$ENV{TENSORRT_ROOT} $ENV{TENSORRT_ROOT}/include
${THIRD_PARTY_DIR}/tensorrt/include)
find_library(TENSORRT_LIBRARIES NAMES libnvinfer.so libnvinfer.a
PATHS ${TENSORRT_ROOT} ${TENSORRT_ROOT}/lib
$ENV{TENSORRT_ROOT} $ENV{TENSORRT_ROOT}/lib
${THIRD_PARTY_DIR}/tensorrt/lib)
if (TENSORRT_INCLUDE_DIR AND TENSORRT_LIBRARIES)
else()
message(FATAL_ERROR "TensorRT was not found. You can set TENSORRT_ROOT to specify the search path.")
endif()
message(STATUS "TensorRT Include: ${TENSORRT_INCLUDE_DIR}")
message(STATUS "TensorRT Lib: ${TENSORRT_LIBRARIES}")
include_directories(${TENSORRT_INCLUDE_DIR})
endif(WITH_TENSORRT)
#include "oneflow/core/common/protobuf.h"
#include "oneflow/core/common/shape.pb.h"
#include "oneflow/core/common/str_util.h"
#include "oneflow/core/register/blob_desc.pb.h"
#include <google/protobuf/io/coded_stream.h>
......@@ -88,6 +89,11 @@ int32_t GetEnumFromPbMessage(const PbMessage& msg, const std::string& field_name
OF_PP_FOR_EACH_TUPLE(DEFINE_SET_VAL_IN_PBMESSAGE, PROTOBUF_BASIC_DATA_TYPE_SEQ)
const PbMessage& GetMessageInPbMessage(const PbMessage& msg, const std::string& field_name) {
PROTOBUF_REFLECTION(msg, field_name);
return r->GetMessage(msg, fd);
}
PbMessage* MutableMessageInPbMessage(PbMessage* msg, const std::string& field_name) {
PROTOBUF_REFLECTION((*msg), field_name);
return r->MutableMessage(msg, fd);
......@@ -115,6 +121,67 @@ PbMessage* MutableMessageInPbMessage(PbMessage* msg, int field_index) {
return r->MutableMessage(msg, fd);
}
#define DECLARE_GETTER_FUNC_HEADER(type) \
template<> \
type GetValFromPbMessage<type>(const PbMessage& msg, const std::string& field_name)
#define DECLARE_SETTER_FUNC_HEADER(type) \
template<> \
void SetValInPbMessage<type>(PbMessage * msg, const std::string& field_name, const type& val)
#define DEFINE_MESSAGE_VAL_GETTER_AND_SETTER(message_type) \
DECLARE_GETTER_FUNC_HEADER(message_type) { \
PROTOBUF_REFLECTION(msg, field_name); \
return *dynamic_cast<const message_type*>(&r->GetMessage(msg, fd)); \
} \
DECLARE_SETTER_FUNC_HEADER(message_type) { \
PROTOBUF_REFLECTION((*msg), field_name); \
r->MutableMessage(msg, fd)->CopyFrom(val); \
}
DEFINE_MESSAGE_VAL_GETTER_AND_SETTER(ShapeProto);
#define DEFINE_ENUM_VAL_GETTER_AND_SETTER(enum_type) \
DECLARE_GETTER_FUNC_HEADER(enum_type) { \
PROTOBUF_REFLECTION(msg, field_name); \
return static_cast<enum_type>(r->GetEnumValue(msg, fd)); \
} \
DECLARE_SETTER_FUNC_HEADER(enum_type) { \
PROTOBUF_REFLECTION((*msg), field_name); \
r->SetEnumValue(msg, fd, val); \
}
DEFINE_ENUM_VAL_GETTER_AND_SETTER(DataType);
#define DEFINE_VECTOR_VAL_GETTER_AND_SETTER(vec_type, vec_type_name) \
DECLARE_GETTER_FUNC_HEADER(vec_type) { \
PROTOBUF_REFLECTION(msg, field_name); \
int32_t field_size = r->FieldSize(msg, fd); \
vec_type retval(field_size); \
for (int i = 0; i < field_size; ++i) { retval[i] = r->Get##vec_type_name(msg, fd, i); } \
return std::move(retval); \
} \
DECLARE_SETTER_FUNC_HEADER(vec_type) { \
PROTOBUF_REFLECTION((*msg), field_name); \
for (int i = 0; i < val.size(); ++i) { r->Set##vec_type_name(msg, fd, i, val[i]); } \
}
#define MAKE_REPEATED_TUPLE_SEQ(type, type_name) \
OF_PP_MAKE_TUPLE_SEQ(std::vector<type>, Repeated##type_name)
#define PROTOBUF_BASIC_REPEATED_DATA_TYPE_SEQ \
MAKE_REPEATED_TUPLE_SEQ(std::string, String) \
MAKE_REPEATED_TUPLE_SEQ(int32_t, Int32) \
MAKE_REPEATED_TUPLE_SEQ(uint32_t, UInt32) \
MAKE_REPEATED_TUPLE_SEQ(int64_t, Int64) \
MAKE_REPEATED_TUPLE_SEQ(uint64_t, UInt64) \
MAKE_REPEATED_TUPLE_SEQ(float, Float) \
MAKE_REPEATED_TUPLE_SEQ(double, Double) \
MAKE_REPEATED_TUPLE_SEQ(int16_t, EnumValue) \
MAKE_REPEATED_TUPLE_SEQ(bool, Bool)
OF_PP_FOR_EACH_TUPLE(DEFINE_VECTOR_VAL_GETTER_AND_SETTER, PROTOBUF_BASIC_REPEATED_DATA_TYPE_SEQ);
#define DEFINE_ADD_VAL_IN_PBRF(cpp_type, pb_type_name) \
template<> \
void AddValInPbRf(PbMessage* msg, const std::string& field_name, const cpp_type& val) { \
......
......@@ -36,6 +36,7 @@ using PbMd = google::protobuf::util::MessageDifferencer;
OF_PP_MAKE_TUPLE_SEQ(int64_t, Int64) \
OF_PP_MAKE_TUPLE_SEQ(uint64_t, UInt64) \
OF_PP_MAKE_TUPLE_SEQ(float, Float) \
OF_PP_MAKE_TUPLE_SEQ(double, Double) \
OF_PP_MAKE_TUPLE_SEQ(int16_t, EnumValue) \
OF_PP_MAKE_TUPLE_SEQ(bool, Bool)
......@@ -92,6 +93,7 @@ template<typename T>
void SetValInPbMessage(PbMessage* msg, const std::string& field_name, const T& val);
const PbMessage& GetMessageInPbMessage(const PbMessage& msg, int field_index);
const PbMessage& GetMessageInPbMessage(const PbMessage& msg, const std::string& field_name);
PbMessage* MutableMessageInPbMessage(PbMessage*, const std::string& field_name);
PbMessage* MutableMessageInPbMessage(PbMessage*, int field_index);
......
#ifndef ONEFLOW_CORE_REGISTER_SHAPE_VIEW_H_
#define ONEFLOW_CORE_REGISTER_SHAPE_VIEW_H_
#include "oneflow/core/common/util.h"
#include "oneflow/core/common/shape_vec.h"
namespace oneflow {
......
......@@ -35,7 +35,9 @@ void NormalForwardCompTaskNode::ProduceAllRegstsAndBindEdges() {
}
void NormalForwardCompTaskNode::ConsumeAllRegsts() {
ForEachInDataEdge([&](TaskEdge* edge) { ConsumeRegst("in", edge->GetSoleRegst()); });
ForEachInDataEdge([&](TaskEdge* edge) {
for (const auto& regst : edge->GetRegsts()) { ConsumeRegst("in", regst); }
});
}
bool NormalForwardCompTaskNode::IsReadyForBuild() {
......
......@@ -4,7 +4,9 @@
namespace oneflow {
void OptimizerCompTaskNode::ConsumeAllRegsts() {
ForEachInDataEdge([&](TaskEdge* edge) { ConsumeRegst("in", edge->GetSoleRegst()); });
ForEachInDataEdge([&](TaskEdge* edge) {
for (const auto& regst : edge->GetRegsts()) { ConsumeRegst("in", regst); }
});
}
void OptimizerCompTaskNode::ProduceAllRegstsAndBindEdges() { ProduceRegst("tmp", false, 1, 1); }
......
......@@ -5,48 +5,28 @@
namespace oneflow {
namespace {
int32_t GetDataRegstDescCnt(
const HashMap<std::string, std::shared_ptr<RegstDesc>> name2regst_desc) {
size_t cnt = 0;
for (const auto& pair : name2regst_desc) {
cnt += pair.second->regst_desc_type().has_data_regst_desc();
}
return cnt;
}
} // namespace
void ReduceSplitCompTaskNode::ProduceAllRegstsAndBindEdges() {
std::vector<EdgeInfo> edge_infos;
std::shared_ptr<Operator> reduce_split_op = this->logical_node()->SoleOp();
HashMap<LogicalBlobId, int32_t> lbi2order;
std::shared_ptr<Operator> reduce_split_op = this->logical_node()->SoleOp();
FOR_RANGE(int32_t, idx, 0, reduce_split_op->output_bns().size()) {
ProduceRegst("out_" + std::to_string(idx), false, 1, 1);
const auto& lbi = reduce_split_op->BnInOp2Lbi(reduce_split_op->output_bns().Get(idx));
CHECK(lbi2order.emplace(lbi, idx).second);
}
ForEachOutDataEdge([&](TaskEdge* edge) {
TaskNode* dst_node = edge->dst_node();
CHECK(edge->dst_node()->GetTaskType() == TaskType::kOptimizer
|| edge->dst_node()->GetTaskType() == TaskType::kNormalForward);
CompTaskNode* mdupdt_node = dynamic_cast<CompTaskNode*>(dst_node);
std::shared_ptr<Operator> mdupdt_op = mdupdt_node->logical_node()->SoleOp();
int32_t order = -1;
for (const std::string& ibn : mdupdt_op->input_bns()) {
const auto& order_it = lbi2order.find(mdupdt_op->BnInOp2Lbi(ibn));
if (order_it != lbi2order.end()) { order = order_it->second; }
if (order_it != lbi2order.end()) {
BindEdgeWithProducedRegst(edge, "out_" + std::to_string(order_it->second));
}
}
CHECK_NE(order, -1);
EdgeInfo edge_info{edge, order};
edge_infos.emplace_back(edge_info);
});
SortEdges(&edge_infos);
FOR_RANGE(size_t, idx, 0, edge_infos.size()) {
std::string out_regst_name = "out_" + std::to_string(idx);
std::shared_ptr<RegstDesc> out_regst = ProduceRegst(out_regst_name, false, 1, 1);
edge_infos[idx].edge->AddRegst(out_regst_name, out_regst);
}
}
void ReduceSplitCompTaskNode::ConsumeAllRegsts() {
......@@ -68,22 +48,23 @@ void ReduceSplitCompTaskNode::BuildExecGphAndRegst() {
node->BindBnWithRegst(reduce_split_op->SoleIbn(), GetSoleConsumedRegst("in"));
FOR_RANGE(size_t, i, 0, reduce_split_op->output_bns().size()) {
std::shared_ptr<RegstDesc> out_regst = GetProducedRegst("out_" + std::to_string(i));
std::string blob_name = "out_" + std::to_string(i);
std::shared_ptr<RegstDesc> out_regst = GetProducedRegst(blob_name);
CHECK(out_regst.get() != nullptr);
out_regst->AddLbi(reduce_split_op->BnInOp2Lbi(reduce_split_op->output_bns().Get(i)));
node->BindBnWithRegst(reduce_split_op->output_bns().Get(i), out_regst);
out_regst->AddLbi(reduce_split_op->BnInOp2Lbi(blob_name));
node->BindBnWithRegst(blob_name, out_regst);
}
node->InferBlobDescs(parallel_ctx());
}
void ReduceSplitCompTaskNode::EnableMemSharingInReduce(const ReduceMemSharingCtx& ctx) {
CHECK_EQ(GetRankCtx().TotalSegmentCount(), 1);
size_t split_num = GetDataRegstDescCnt(produced_regsts());
std::shared_ptr<Operator> reduce_split_op = this->logical_node()->SoleOp();
int64_t offset = 0;
FOR_RANGE(int32_t, idx, 0, split_num) {
RegstDesc* split_out_regst = GetProducedRegst("out_" + std::to_string(idx)).get();
ctx.EnableMemSharing4Regst(split_out_regst, offset);
offset += InferRegstSize(*split_out_regst);
for (int i = 0; i < reduce_split_op->output_bns().size(); ++i) {
RegstDesc* out_regst = GetProducedRegst("out_" + std::to_string(i)).get();
ctx.EnableMemSharing4Regst(out_regst, offset);
offset += InferRegstSize(*out_regst);
}
}
......
......@@ -46,6 +46,20 @@ message MemoryAllocationAlgorithmConf {
optional bool use_time_line_algo = 3 [default = false];
}
message XrtConfig {
message XlaConfig {
// TODO
}
message TensorRTConfig {
optional bool use_fp16 = 1 [default = false];
optional bool use_int8 = 2 [default = false];
}
optional bool use_xla_jit = 1 [default = false];
optional bool use_tensorrt = 2 [default = false];
optional XlaConfig xla_config = 3;
optional TensorRTConfig tensorrt_config = 4;
}
message JobConfigProto {
required string job_name = 1;
......@@ -65,6 +79,8 @@ message JobConfigProto {
optional bool use_memory_allocation_algorithm_v2 = 101 [default = true];
optional MemoryAllocationAlgorithmConf memory_allocation_algorithm_conf = 102;
optional XrtConfig xrt_config = 103;
optional bool enable_cudnn = 200 [default = true];
optional int64 cudnn_buf_limit_mbyte = 201 [default = 1024]; // 1GByte
optional int32 cudnn_conv_force_fwd_algo = 202;
......
......@@ -30,6 +30,17 @@ JobBuilder::JobBuilder(Job* job) : job_(job) {
op_name2parallel_conf_.emplace(op_name, placemnt_group->mutable_parallel_conf()).second);
}
}
auto* sbp_conf = job->mutable_sbp_conf();
for (auto& pair : *(sbp_conf->mutable_op_name2sbp_signature_conf())) {
op_name2sbp_signature_conf_.emplace(pair.first, &pair.second);
}
for (auto& pair : *(job->mutable_helper()->mutable_lbn2batch_axis())) {
lbn2batch_axis_.emplace(pair.first, &pair.second);
}
auto* helper_conf = job->mutable_helper();
for (auto& pair : *(helper_conf->mutable_op_name2op_time_shape())) {
op_name2time_shapes_.emplace(pair.first, &pair.second);
}
FOR_RANGE(int32_t, i, 0, job->placement().blob_placement_group_size()) {
auto* blob_pg = job->mutable_placement()->mutable_blob_placement_group(i);
for (const auto& lbi : blob_pg->lbi()) {
......@@ -38,12 +49,14 @@ JobBuilder::JobBuilder(Job* job) : job_(job) {
}
}
const OperatorConf& JobBuilder::OpConf4OpName(const std::string& op_name) const {
return *op_name2op_conf_.at(op_name);
OperatorConf* JobBuilder::MutableOpConf4OpName(const std::string& op_name) {
const auto& it = op_name2op_conf_.find(op_name);
CHECK(it != op_name2op_conf_.end());
return it->second;
}
const ParallelConf& JobBuilder::ParallelConf4OpName(const std::string& op_name) const {
return *op_name2parallel_conf_.at(op_name);
const OperatorConf& JobBuilder::OpConf4OpName(const std::string& op_name) const {
return *op_name2op_conf_.at(op_name);
}
const ParallelConf& JobBuilder::ParallelConf4Lbi(const LogicalBlobId& lbi) const {
......@@ -89,15 +102,69 @@ void JobBuilder::MutParallelConfOnlyOnce(const std::string& op_name,
*placement_group->mutable_parallel_conf() = parallel_conf;
}
void JobBuilder::DelOps(const std::vector<OperatorConf>& op_confs) {
for (const auto& op_conf : op_confs) {
const std::string& op_name = op_conf.name();
op_name2op_conf_.erase(op_name);
auto* op_list = job_->mutable_net()->mutable_op();
auto it = std::remove_if(op_list->begin(), op_list->end(),
[&](const OperatorConf& conf) { return conf.name() == op_name; });
if (it != op_list->end()) { op_list->erase(it); }
void JobBuilder::RemoveOpByName(const std::string& op_name) {
RemoveOpByName(std::unordered_set<std::string>{op_name});
}
void JobBuilder::RemoveOpByName(const std::unordered_set<std::string>& removing_names) {
// Update net
DLNetConf net = job_->net();
job_->mutable_net()->clear_op();
for (const OperatorConf& op_conf : net.op()) {
if (removing_names.count(op_conf.name()) == 0) { *(job_->mutable_net()->add_op()) = op_conf; }
}
// Update placement
auto placement_group = job_->placement().placement_group();
job_->mutable_placement()->clear_placement_group();
for (const PlacementGroup& place : placement_group) {
PlacementGroup p;
OpNameSet* op_set = p.mutable_op_set();
for (const std::string& name : place.op_set().op_name()) {
if (removing_names.count(name) == 0) { op_set->add_op_name(name); }
}
*(p.mutable_parallel_conf()) = place.parallel_conf();
if (op_set->op_name().size() > 0) { *(job_->mutable_placement()->add_placement_group()) = p; }
}
auto* sbp_conf = job_->mutable_sbp_conf()->mutable_op_name2sbp_signature_conf();
auto* time_shape_conf = job_->mutable_helper()->mutable_op_name2op_time_shape();
for (const std::string& op_name : removing_names) {
// Update Sbp
if (sbp_conf->count(op_name) > 0) { sbp_conf->erase(op_name); }
// Update time shape
if (time_shape_conf->count(op_name) > 0) { time_shape_conf->erase(op_name); }
}
// Update batch dim lbis
// Update identical sbp oba pairs
if (job_->helper().has_identical_sbp_oba_pairs()) {
auto identical_sbp_oba_pairs = job_->helper().identical_sbp_oba_pairs().pair();
job_->mutable_helper()->mutable_identical_sbp_oba_pairs()->clear_pair();
for (const auto& pair : identical_sbp_oba_pairs) {
if (removing_names.count(pair.first().op_name()) == 0
&& removing_names.count(pair.second().op_name()) == 0) {
*(job_->mutable_helper()->mutable_identical_sbp_oba_pairs()->mutable_pair()->Add()) = pair;
}
}
}
// Update builder
JobBuilder builder(job_);
op_name2op_conf_.swap(builder.op_name2op_conf_);
op_name2parallel_conf_.swap(builder.op_name2parallel_conf_);
op_name2sbp_signature_conf_.swap(builder.op_name2sbp_signature_conf_);
lbn2batch_axis_.swap(builder.lbn2batch_axis_);
}
void JobBuilder::DelOps(const std::vector<std::string>& op_names) {
std::unordered_set<std::string> removing_names;
for (const auto& op_name : op_names) { removing_names.insert(op_name); }
RemoveOpByName(removing_names);
}
void JobBuilder::DelOps(const std::vector<OperatorConf>& op_confs) {
std::unordered_set<std::string> removing_names;
for (const auto& op_conf : op_confs) { removing_names.insert(op_conf.name()); }
RemoveOpByName(removing_names);
}
void JobBuilder::MutOpsOnlyOnce(const std::vector<OperatorConf>& op_confs) {
......@@ -130,6 +197,22 @@ void JobBuilder::ForEachOperator(const std::function<void(const Operator&)>& Han
}
}
const ParallelConf& JobBuilder::ParallelConf4OpName(const std::string& op_name) const {
return *op_name2parallel_conf_.at(op_name);
}
void JobBuilder::AddParallelConf4OpName(const std::string& op_name,
const ParallelConf& parallel_conf) {
bool update = (op_name2parallel_conf_.count(op_name) == 0);
if (update) {
// update `op_name2parallel_conf_`
PlacementGroup* group = job_->mutable_placement()->add_placement_group();
group->mutable_op_set()->add_op_name(op_name);
*(group->mutable_parallel_conf()) = parallel_conf;
op_name2parallel_conf_[op_name] = group->mutable_parallel_conf();
}
}
SbpParallel* JobBuilder::MutSbpParallel4Oba(const OpBlobArg& oba) const {
auto* sbp_sig = &(*job_->mutable_sbp_conf()->mutable_op_name2sbp_signature_conf())[oba.op_name()];
return &(*sbp_sig->mutable_bn_in_op2sbp_parallel())[oba.bn_in_op()];
......@@ -141,4 +224,54 @@ void JobBuilder::BindIdenticalSbpOpBlobArgPair(const OpBlobArg& first, const OpB
*pair->mutable_second() = second;
}
const SbpSignature& JobBuilder::SbpSignature4OpName(const std::string& op_name) const {
const auto& it = op_name2sbp_signature_conf_.find(op_name);
CHECK(it != op_name2sbp_signature_conf_.end());
return *(it->second);
}
void JobBuilder::AddSbpSignature4OpName(const std::string& op_name,
const SbpSignature& sbp_signature) {
const auto& it = op_name2sbp_signature_conf_.find(op_name);
if (it != op_name2sbp_signature_conf_.end()) {
*(it->second) = sbp_signature;
return;
}
auto* op_name2sbp_signature_conf = job_->mutable_sbp_conf()->mutable_op_name2sbp_signature_conf();
(*op_name2sbp_signature_conf)[op_name] = sbp_signature;
op_name2sbp_signature_conf_.emplace(op_name, &(*op_name2sbp_signature_conf)[op_name]);
}
const OpTimeShape& JobBuilder::TimeShape4OpName(const std::string& op_name) const {
const auto& it = op_name2time_shapes_.find(op_name);
CHECK(it != op_name2time_shapes_.end());
return *(it->second);
}
void JobBuilder::AddTimeShape4OpName(const std::string& op_name, const OpTimeShape& time_shape) {
bool update = (op_name2time_shapes_.count(op_name) == 0);
if (update) {
auto* time_shape_conf = job_->mutable_helper()->mutable_op_name2op_time_shape();
(*time_shape_conf)[op_name] = time_shape;
op_name2time_shapes_[op_name] = &((*time_shape_conf)[op_name]);
}
}
const OptInt64& JobBuilder::BatchAxis4Lbn(const std::string& lbn) const {
const auto& it = lbn2batch_axis_.find(lbn);
CHECK(it != lbn2batch_axis_.end());
return *(it->second);
}
void JobBuilder::AddBatchAxis4Lbn(const std::string& lbn, const OptInt64& axis) {
bool update =
(lbn2batch_axis_.count(lbn) == 0) || (lbn2batch_axis_[lbn]->value() != axis.value());
if (update) {
auto* batch_axis = job_->mutable_helper()->mutable_lbn2batch_axis();
(*batch_axis)[lbn] = axis;
lbn2batch_axis_[lbn] = &((*batch_axis)[lbn]);
}
}
} // namespace oneflow
......@@ -26,19 +26,37 @@ class JobBuilder final {
SbpConf* mutable_sbp_conf() { return job_->mutable_sbp_conf(); }
const OperatorConf& OpConf4OpName(const std::string& op_name) const;
const ParallelConf& ParallelConf4OpName(const std::string& op_name) const;
const ParallelConf& ParallelConf4Lbi(const LogicalBlobId& lbi) const;
OperatorConf* MutableOpConf4OpName(const std::string& op_name);
void AddOps(const ParallelConf& parallel_conf, const std::vector<OperatorConf>& op_confs);
void MutOpsOnlyOnce(const std::vector<OperatorConf>& op_confs);
void MutParallelConfOnlyOnce(const std::string& op_name, const ParallelConf& parallel_conf);
void AddOrMutOpsOnlyOnce(const ParallelConf& parallel_conf,
const std::vector<OperatorConf>& op_confs);
void RemoveOpByName(const std::string& op_name);
void RemoveOpByName(const std::unordered_set<std::string>& removing_names);
void DelOps(const std::vector<std::string>& op_names);
void DelOps(const std::vector<OperatorConf>& op_confs);
SbpParallel* MutSbpParallel4Oba(const OpBlobArg& oba) const;
void BindIdenticalSbpOpBlobArgPair(const OpBlobArg& first, const OpBlobArg& second);
void ForEachOperator(const std::function<void(const Operator&)>& Handler) const;
const ParallelConf& ParallelConf4Lbi(const LogicalBlobId& lbi) const;
const ParallelConf& ParallelConf4OpName(const std::string& op_name) const;
void AddParallelConf4OpName(const std::string& op_name, const ParallelConf& parallel_conf);
const SbpSignature& SbpSignature4OpName(const std::string& op_name) const;
void AddSbpSignature4OpName(const std::string& op_name, const SbpSignature& sbp_signature);
const OpTimeShape& TimeShape4OpName(const std::string& op_name) const;
void AddTimeShape4OpName(const std::string& op_name, const OpTimeShape& time_shape);
const OptInt64& BatchAxis4Lbn(const std::string& lbn) const;
void AddBatchAxis4Lbn(const std::string& lbn, const OptInt64& axis);
private:
PlacementGroup* FindPlacementGroup(const std::string& op_name) const;
......@@ -48,6 +66,10 @@ class JobBuilder final {
HashMap<LogicalBlobId, ParallelConf*> lbi2blob_parallel_conf_;
HashSet<std::string> modified_op_conf_op_names_;
HashSet<std::string> modified_parallel_conf_op_names_;
HashMap<std::string, SbpSignature*> op_name2sbp_signature_conf_;
HashMap<std::string, OpTimeShape*> op_name2time_shapes_;
HashMap<std::string, OptInt64*> lbn2batch_axis_;
};
} // namespace oneflow
......
......@@ -64,6 +64,9 @@ class JobDesc final {
bool all_reduce_fp16() const;
int64_t cudnn_buf_limit_mbyte() const { return job_conf_.cudnn_buf_limit_mbyte(); }
bool has_xrt_config() const { return job_conf_.has_xrt_config(); }
const XrtConfig& xrt_config() const { return job_conf_.xrt_config(); }
#define DEFINE_FUNCTION_CONFIG_GETTER(T, func_name, field_name) \
T func_name(const std::string& field_name) const { \
const UserOpAttrVal& attr_val = GetFunctionFlagVal(field_name); \
......
......@@ -16,6 +16,8 @@
#include "oneflow/core/job_completer/add_lbi_diff_watcher.h"
#include "oneflow/core/framework/config_def.h"
#include "oneflow/core/job_completer/xrt_compilation.h"
namespace oneflow {
namespace {
......@@ -356,6 +358,15 @@ void JobCompleter::Complete(Job* job) const {
WithOpGraphAndMutJobBuilder(job, &AddGlobalOutputCriticalSections);
WithOpGraphAndMutJobBuilder(job, &DumpLogicalBlobDescAndSbpSignature);
WithOpGraphAndMutJobBuilder(job, &SetOpTimeShape7BatchAxisLbis);
if (XrtCompilationEnabled(GlobalJobDesc())) {
#ifdef OF_WITH_XRT
WithOpGraphAndMutJob(job, &RebuildXrtCompiledJob);
#else
LOG(WARNING) << "It will not use XLA or TensorRT since WITH_XLA or "
"WITH_TENSORRT was not enabled when compiling the project.";
#endif // OF_WITH_XRT
}
CheckOpGraph(OpGraph(*job));
}
......
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment