Commits · 50e1c346f79aabc4ec85ff963917a379e4f673e6 · Summer2021 / 210130121 · GitLab

Snippets Groups Projects

Jul 16, 2021

Job pass maybe system (#5503) · 50e1c346

Li Xinqi authored 3 years ago


* refactor job_pass by maybe_system

* remove useless files

Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>

50e1c346

Jul 01, 2021

add missing JUST (#5357) · 95337ebc

daquexian authored 3 years ago


* add missing JUST

Signed-off-by: daquexian <daquexian566@gmail.com>

* remove redundant header

Signed-off-by: daquexian <daquexian566@gmail.com>

* add missing JUST in master

Signed-off-by: daquexian <daquexian566@gmail.com>

* fix compile error on gcc5

Signed-off-by: daquexian <daquexian566@gmail.com>

Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>

95337ebc

May 05, 2021

NCCL logical support Pipeline Parallel By independent NcclComputeStream. (#4806) · 4a4f0322

cheng cheng authored 3 years ago

* Fw/Bw support double compute stream

* NCCL comm create by stream id

* 2D NCCL logical kernel support BW independent stream

* StreamIndex: NcclComputeStream for each subgraph insert nccl logical.

* refactor code

* refine code for review

* Add WITH_CUDA in DoJobPass(InsertNcclLogicalOpPass)

4a4f0322

Mar 09, 2021

Refine InferTimeShape (#4347) · 9e60dd30

Juncheng authored 4 years ago


* Refine InferTimeShape

* fix xla

* fix xla

Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>

9e60dd30

Feb 24, 2021

disable_group_boxing and change nccl logical order to dst (#4236) · 25a88f21

cheng cheng authored 4 years ago


* disable_group_boxing and change nccl logical order to dst

* remove note

* both support insert nccl logical ops as close as possible to Src/Dst node

Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>

25a88f21

Feb 19, 2021

Remove keep_header_only and BlobDesc::is_body_disabled (#4230) · a1237e3c

cheng cheng authored 4 years ago

* Remove keep_header_only and BlobDesc::is_body_disabled

* Remove InputBlobModifier::use_header_only and UserOps set_use_header_only

a1237e3c

Feb 18, 2021

NCCL use compute stream to memory cost & speed up (#4221) · 45697b0c

cheng cheng authored 4 years ago

* Enable insert nccl logical op pass

* FindMaxConnectedSubgraphForGpuExecOrder~

* through order and interface

* implement of insert nccl logical op in pass

* add nccl logical op using UserOp Implement and EagerNcclCommMgr

* add NCCL ReduceScatter op/kernel; refine pass impl of topo order

* add NCCL logical op/kernel AllGather

* fix bug of reduce scatter/ all gather infer shape

* refine log and note

* fix complier err build with CPU ONLY

* support NCCL ALL2ALL and test pass of alexnet model parallel

* rollback of diff in checkpointing_pass.cpp

* rename to nccl_use_compute_stream; ResourceDesc::nccl_use_compute_stream; refine name for review; create nccl_comm_ in KernelCompute;

* refine code for review

* add unittest for nccl use compute stream

* format test scripts

* refine align

45697b0c

Feb 14, 2021

Sink tick in main job (#4207) · 25d9c26c

Li Xinqi authored 4 years ago


* source subset tick

* remove useless header files

* insert DstSubsetTickOp

* remove incorrect CHECK

* add tick op for each machine

* TryBindBnWithOneofRegst

* add sink tick op in main_job

* refactor LinkMainJob

* fix typo in task_graph

* refactor AddGlobalCriticalSection

* rename and refactor DstSubsetTick::InferBlobDescs and SrcSubsetTick::InferBlobDescs

* add src_subset_tick for input-output critical section

* refactor AutoSourceTick and AutoSinkTick

* SrcSubsetTickCompTaskNode: bind bns and in_regst if bns is valid in current device

* refactor optional input to repeated inputs for SrcSubsetTickOpConf

Co-authored-by: oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com>

25d9c26c

Nov 02, 2020
- refactor OpGraphPass to JobPass (#3745) · 2a51d542
  Li Xinqi authored 4 years ago
```
* refactor OpGraphPass to JobPass

* refactor methods of JobPassCtx
```
  Unverified
  
  2a51d542
Sep 05, 2020
- Add ctrl_in_op_name only when unreachable (#3537) · 2a0480b3
  Juncheng authored 4 years ago
  
  Unverified
  
  2a0480b3
Jul 28, 2020
- Rename folder job_completer to job_rewriter (#3308) · 5eaf879a
  OuYang Yu authored 4 years ago
```
Co-authored-by: Li Xinqi <lixinqi2010@gmail.com>
```
  Unverified
  
  5eaf879a
Jul 23, 2020

Dev apache2 license (#3266) · d0bdbd5d

Shenghang Tsai authored 4 years ago


* add license at root dir

* check in empty files

* rm space

* check in script

* update script

* fix bug

* add print

* fix

* add exit

* add to of_format

* add CI task

* fix license

* Revert "fix license"

This reverts commit 818b6d7691d3a8b4a25dd41a47ff2c5922b8ec57.

* only add once

* quick fix

* fix script

* dont fmt empty file

* fix

* quick fix

* fix py

* add license

* fix exit

* add license for hpp

* add license

* license new vm files

Co-authored-by: tsai <caishenghang@oneflow.org>

d0bdbd5d

Mar 19, 2020
- disable add keep header only op pass by config (#2702) · 13c05e47
  cheng cheng authored 5 years ago
```
* disable add keep header only op pass by config

* rename enable_keep_header_only
```
  Unverified
  
  13c05e47
Jan 08, 2020
- add pass DumpTimeShapeAndBlobParallelConfPass · fb138d8a
  lixinqi authored 5 years ago
  
  fb138d8a
- JobBuildAndInferCtx::Complete · 9b03ed3a
  lixinqi authored 5 years ago
  
  9b03ed3a
Jan 07, 2020
- refactor JobCompleter::Complete · bc5b5e0f
  lixinqi authored 5 years ago
  
  bc5b5e0f
- more cases of REGISTER_FUNCTION_PASS · ee85ce06
  lixinqi authored 5 years ago
  
  ee85ce06
- REGISTER_FUNCTION_PASS · c117d4ed
  lixinqi authored 5 years ago
  
  c117d4ed
- remove unused pass ReplaceFacade · 387608fb
  lixinqi authored 5 years ago
  
  387608fb
Jan 02, 2020
- avoid CHECK failed in SetCtrlInOpName4VariableOp · 77865ab3
  lixinqi authored 5 years ago
  
  77865ab3
- remove every_nth_op · c9c50b7a
  Xinqi Li authored 5 years ago
  
  c9c50b7a
Dec 27, 2019

Merge UserOp (#2471) · 76fe960a

Niu Chong authored 5 years ago


* Add user op related proto

* Add OpRegistration (#2424)

* Add uncompleted op registration for cooperation review

* Fix the compile bugs

* Refactor the implementation macro of op_reg arg member funcs

* Move impl of Attr() with default to cpp and specialize it

* Add LookUpInOpRegistry()

* Add UserOp as placeholder

* Rename in->input and out->output

* Fix the missing ctor of OpRegistrationBuilder

* Add GetAllRegisteredUserOp() for debug

* Add Log for every user_op registration

* Add const qualifier for Builder::Build() and user_op namespace for REGISTER_USER_OP macro

* Remove the LOG() from ctor of op registrar due to segment fault (maybe a glog bug)

* add customized dir (#2425)

* add customized dir

* customized/.keep

* Add map<string, ListString> output; to UserOpConf (#2426)

* Substitute std::function<...> with alias name and Set default val for those function (#2428)

* Add Kernel Registry (#2431)

* Add Kernel Registration

* Make REGISTER_USER_OP/KERNEL macro available when not in namespace of oneflow

* Add missing TODO of CreateFn parameter

* Add OpKernel for user op

* Fix a little code style

* Add GradRegistry (#2433)

* implement of user_op, instead of get sbp sign (#2429)

* Add UserKernel and UserKernelConf (#2438)

* Add VirtualGenKernelConf() for UserOp and fill UserKernelConf

* Fill KernelRegCtx

* Fix typos and bugs

* Add UserKernel

* Add KernelRegContext(const KernelConf&) as the ctor

* Dev cc python user op conf builder (#2435)

* user op builder in python

* user op wrapper

* Implement of add default vale and check valid between c++ and python

* remove notes

* fixbug and runnable for UserOp complie and add sample as ccrelu

* fix func and class name

* check attr type in op def and op conf; refine code for review

* Dev cc infer tmp size (#2445)

* Refine some code and interface (#2447)

* Add InferContext and Infer Util functions

* Add framework.h as the only included header for user code

* Fix the Dtype infer bug

* Fix duplicated ret

* Fix the KernelRegistration of UserKernel

* Update cc_relu_op.cpp to use InferContext

* Refine and Add test relu kernel

* Add user_op::Blob

* Update InferContext to ptr from const ref

* Add user_op_conf into InferContext and Attr()

* Move cc_relu_op.cpp to customized/ops/

* Add Shape and Dtype into Blob

* Fill the real ReluKernel with Gpu and Float

* Remove unused files

* Add unique_names_ for op registration

* Refactor AttrSeq for re-used of attr function specialization (#2452)

* Refactor AttrSeq for re-used of attr function specialization

* Remove Serialize interface

* Dev cc gen bw user conf (#2449)

* refine python print err

* interface of UserOpWrapper UserOpConfWrapper UserOpConfWrapperBuilder

* implement of user op conf builder in c++

* generate backward op conf for user op

* define grad registration value func

* refine code for review

* check input valid when query need grad

* refine name

* implement of demo ccrelu_grad and test pass of alexnet

* refine ccrelu python

* Add UserOpDefWrapper; Fix .py bug; Add TestReshape op/kernel (#2454)

* Add UserOpDefWrapper

* Update paras of CheckAttrs() from UserOpDef to UserOpDefWrapper

* Fix bug in user_op_builder.py

* Add TestReshape Op&Kernel

* fix ccrelu op grad register and shape infer; Add ccrelu alexnet test python script (#2457)

* Move UserOpConf from op_conf.proto to user_op_conf.proto

* Add test_reshape.py

* Refine the imple of access to attrs in user_op_conf.cpp

* Refactor the way to access AttrVal with AttrValAccessor

* Add GetAttr() in KernelContext

* Rename op_infer_util.h to infer_util.h and Fill paras of InferTmpSize with InferContxt

* Refactor InferContext to simpley user_op

* Refine customized test kernels to get along with interface update

* Dev merge from dev python (#2465)

* Clear session (#2416)

* oneflow.clear_default_session

* fix bugs in oneflow.config.machine

* refactor function return type (#2417)

* fix for py2 (#2418)

* blob parallel conf

* Pr watch scope (#2419)

* pr oneflow.watch*

* merge more code to pass watch_scope.py

* TODO: input_blob_def.parallel_conf

* oneflow.cluster (#2423)

* oneflow.cluster

* no alias for oneflow.cluster.*

* mv cpp_logging_conf from config_proto to cluster_proto

* rename: cluster => env

* rename: Environment => Session

* Free port (#2427)

* oneflow.cluster

* no alias for oneflow.cluster.*

* mv cpp_logging_conf from config_proto to cluster_proto

* rename: cluster => env

* rename: Environment => Session

* auto find a free port for single node environment

* localhost only

* Dev single processor test (#2430)

* oneflow.cluster

* no alias for oneflow.cluster.*

* mv cpp_logging_conf from config_proto to cluster_proto

* rename: cluster => env

* rename: Environment => Session

* auto find a free port for single node environment

* localhost only

* single process test

* Cluster::WorkerLoop

* delete unnecessary OF_BARRIER_ALL

* no longer fork children processes to run tests

* robust contextmanager for CurJobConf (#2434)

* fix of_pure_proto_dir (#2439)

* Ctrl between optimizer (#2443)

* add ctrl edges between optimizors

* update docker file

* sequentialize all optimizors

* Revert "fix of_pure_proto_dir (#2439)" (#2446)

This reverts commit 5031cc86.

* Oneflow unittest (#2448)

* oneflow.unittest.*

* oneflow.unittest.register_testcases

* rename: oneflow.unittest.register_testcases -> oneflow.unittest.register_test_cases

* Test bert inplace with xinqi (#2450)

* update bert script

* update watch_scope test script

* update for debug

* update for debug

* update debug script

* test_inplace.py

* no reshape

* debug IsLbiAllConsumersReachable

* fix inplace

* rm useless code

* update config

* fix critical_section (#2453)

* Patch distribute (#2456)

* backup

* fix bugs

* test_inplace

* Fix InplaceActor when no one consume inplace out regst (#2458)

* disable mutable inplace edge to variable (#2459)

* Fix unittest import conflict for py2 (#2460)

* Create __init__.py (#2464)

* update ccrelu_alexnet.py

* Update code that use UserOpConf to UserOpConfWrapper

* Add paras of GetSbp

* Refactor KernelContext with pure virtual member function

* Refactor KernelRegContext with pure virtual member function

* Refactor UserKernelContext ctor to simplify code

* Refactor InferContext with pure virtual member function

* Rename Blob to Tensor; BlobDef to TensorDesc

* Remove unused log and file

* Dev cc user op sbp (#2480)

* ccrelu multi-gpu runnable

* example of ccrelu op get sbp sign

* sbp context

* add example for get sbp using LogicalTensor...

* GetSbpFnUtil::MirrorSplitAtDim0

* fix bug of useless

* Fix the bug of init UserKernelContext (#2487)

* Fix the bug of init UserKernelContext

* Fix due to comment

* Refine code

* Add missing SetCheckAttrFn implementation

* refine python user_op_builder.SetAttr(); refine reshape test an… (#2505)

* Make UserOp runnable; refine user op python test for new change in dev_python

Co-authored-by: cheng cheng <472491134@qq.com>

76fe960a

Dec 26, 2019

XRT: XLA + TensorRT (#2525) · 8f3dcf94

Houjiang Chen authored 5 years ago

* Enable multiply definition for xla compilation in oneflow

* Realize running an executable

* Abstract and gather resources (such as client, builder etc.) needed to compile as CompilationResourceStore

* Implement a seperate xla allocator to avoid introducing much objects of tensorflow

* Define CompilationContext separately

* Running XLA by CPU mode is OK now

* Make the result shape after running the executable is a tuple, and refine comments

* Add compilation cache to solve recompiling every time

* Resolve InferSbpSignature in XlaLaunchOp

* Resove executing on specified cuda stream

* Refine XlaLaunch parallel conf, add batch matmul op

* Refactor job rebuilding and fixup time shape

* Update batch_dim_lbis field if XlaLaunch has any output which has batch dim

* Resolve cluster-ring after clustered, take sbp policy and time shape into consideration

* Add reshape op

* Fix bugs

* Rename CompilationContext by XlaLaunchContext, add XlaRuntimeScope to swap stream handle

* Fix bugs

* Update cmake to compile with xla optionally

* Support more ops

* Add more ops, and fix bugs

* Implement XLA allocator and internal memory pool

* Adaptively resize allocator memory size

* Refine memory allocator

* Block host if running cpu executable

* Fix bug for getting scalar value

* Fix result layout bug. This bug causes wrong result for transpose

* Refine gelu backward

* Of xla sx (#1990)

* add identity xla op

* Add batch gather op

* Refine batch gather

* fix batch gather bug aand add gather op, mv identity op to unary_op

* Add softmax and gather/batch_gather

* Add xla softmax_grad op

* Add xla layer normalization op

* Add xla layer norm backward op

* Alias inputs and outputs to compute in-place

* Reuse output buffers when running xla executable. It brings about 10%
speedup for bert on single gpu by zero copy results

* Reuse output buffers when running xla executable. It brings about 10%
speedup for bert on single gpu by zero copy results

* Refine xla allocator

* Refine code style

* Add xla reduce_sum op

* Rewrite model update op to optimizer graph

* Fix hang bugs

* Fix input which body is disabled in xla launch kernel

* Fix self control in

* Fix self control in

* Add fake consume op

* Fix HasAttr bug for optional field

* Refine AdamOptimizer

* Fix xla AdamOptimizer bugs

* Add meta data in HLO instruction, and refine

* Fix bugs

* add reduce sum and split normal model update (#2040)

* remove append_func_to_list

* Rm deprecated model update and save code (#1958)

* remove code

* mv random gen to kernel

* mk seed required

* address reviews

* fix unused warning

* address reviews

* check in more deprecation

* remove ModelSaveOpConf

* move out ops and modify item (#1962)

* ModelInit.__oneflow_input_remote_blobs__

* fix cpu only query & add error info (#1964)

* NumaAwareCudaMallocHost (#1959)

* NumaAwareCudaMallocHost

* add conf

* modify check_point and add test check_point (#1963)

* fix misuse of Scope/raii

* op_name2variable_blob

* add sigmoid test and tanh test (#1966)

* add op matmul and matmul test (#1967)

* rename oneflow.val to oneflow.input_blob_def

* support auto var for convolution (#1972)

* add op add and test add (#1973)

* mv deprecated.pb_util to lib.core.pb_util

* add op get_variable and get_variable test (#1975)

* add op get_variable and get_variable test

* modify shape extend

* AllReduceSequencePass (#1976)

* python2 compatibility for check_point

* fix "return (blob_a, blob_b)" bug

* rename: arg_passing => arg_pass

* shared regst blob header between jobs (#1919)

* half impl

* register manager handle memory shared for separated memory

* set separated memory shared id for shared regst between jobs

* half impl of python for blob

* fix BUG of pod ToProto() when proto has inited

* fix BUG of infer dim0_inner_shape() in foreign_input_op

* 1. PushJob copy from python can infer dim0_valid_num

* add test for dynamic relu

* refine test file

* refine code

* refine note

* update test file for new interface

* rename separated_header* (#1979)

* some bugs fixes for a train&eval job (#1978)

* debugging alex net

* check in test pull_multiple_blob.py

* strcter check

* fix bias in conv

* fix various bugs

* rm file

* op_name in different jobs can be overloaded

* fix compile bug in job_set_compile_ctx

* rm cmake code for building oneflow binary

* check in script (#1980)

* check in script

* rm used import

* CudaCurrentDeviceGuard (#1977)

* fix val (#1981)

* Merge job set and split fw bw (#1982)

* add MemoryCopier and TensorSliceCopier (#1901)

* add MemoryCopier and TensorSliceCopier

* Index=>NdIndex

* refine

* refine

* fix addition error checking (#1911)

* Merge dev_mixed_precision into dev_split_fw_bw (#1904)

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* Merge dev_mixed_precision: Part-2 (#1907)

* feat: add NewKernelUtil

* fix typos

* feat: add cublas_tensor_op_math_handle()

* add gemm (#1860)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* feat: NewKernelUtil -> NewKernelUtil<DeviceType>

* feat: update FullyConnectedKernel to use NewKernelUtil

* Dev sx mixed precision (#1861)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* save cpp

* save

* add relu and relu_backward

* remove spared space

* add explicit declaration

* rename

* feat: update ConvKernel to support half

* add sigmoid and tanh (#1867)

* add axpy (#1866)

* style: formatting

* refactor(new_kernel_util): unify Hgemm with cublas_gemm<T>

* fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle

* refine(new_kernel_util.h)

* refine(new_kernel_util.cu)

* feat(new_kernel_util): add OFBatchedGemm()

* feat: update MatMulKernel to support half

* feat: update ConvData/Bias/FilterGradKernel to support half

* refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out

* feat: support loss scale

* fix(operator): :bug:add InferHasBatchDim()

* feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu()

* refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float

* style(kernel/cast_kernel.cpp): formatting

* fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle()

* style(cast_kernel.cpp): formatting

* feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil

* refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil

* feat(dropout_kernel): :sparkles:update DropoutKernel to support half

* refactor(dropout_kernel): remove backward funcs

* refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support

* fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple)

* fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs()

* fix: fix little bugs

* fix(conv_data/filter_grad_op): min byte size of buf blob is 1

* feat: support half for bias_add_kernel

* fix(bias_add_op): remove data type check

* feat(relu_kernel): support half

* refactor: add ADD_GPU_HALF_KERNEL_CREATOR

* fix: typos

* feat(pooling_kernel): support half

* fix: remove CHECK_EQ of default data type

* feat(pooling_grad_kernel): support half

* feat: support half in ofrecord_encoder (TODO)

* fix

* feat: support half in sparse_cross_entropy_kernel

* debug grad op (#1883)

* Dev debug op mixed precision (#1884)

* debug grad op

* do nothing instead of UNIMPLEMENTED

* fix(dropout_kernel): add tmp_split_fw_bw condition

* build(half.cmake): https->http

* fix(record_load_kernel): support total_batch_num

* fix pooling (#1885)

* fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs()

* fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs()

* fix: add GetCudnnScalingParameters() to fix scaling params

* fix: add enable_true_half_config_when_conf() into config and update related code

* feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization

* refactor(matmul_kernel): remove Backward()

* feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx()

* feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat()

* refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr

* refactor(new_kernel_util.cu): remove static of func in anonymous namespace

* feat(job_conf.proto): add enable_auto_mixed_precision field

* feat(auto_mixed_precision_lists): add amp_lists

* feat(auto_mixed_precision): build the skeleton

* feat(auto_mixed_precision): almost finish amp graph pass

* feat(auto_mixed_precision.cpp): complte InsertCastOp()

* refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG

* perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO)

* refine(auto_mixed_precision.cpp): refine LOG

* feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes()

* Dev half ndarray (#1886)

* debug grad op

* ZeroVal => GetZeroVal; OneVal => GetOneVal

* MaxVal => GetMaxVal; MinVal => GetMinVal

* check data type

* DevDType

* move function template to struct template for BinaryFunc* and UnaryFunc*

* support half for reduce_sum_kernel

* ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr

* half for NdarrayUtil

* OF_DEVICE_FUNC is always inline

* half for NdarrayApplyUnaray

* simplify usage of NdarrayUtil

* UnaryFuncExp

* add VarNdarrayBuilder and ValNdarrayBuilder

* simplify NdarrayUtil in layer_norm_param_grad_kernel

* InplaceBroadcast

* remove SoftmaxKernelUtil

* half for softmax_kernel

* fix improper use of __CUDA_ARCH__

* disable sm_30,sm_52

* refine(conv_kernel.cu): fix typo

* fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix: fix typos of GetOneVal

* fix(auto_mixed_precision.cpp): allocate for shared_ptr

* refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding

* fix(auto_mixed_precision.cpp): fix typo

* fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge

* style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet()

* style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...>

* feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp

* feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs

* feat(auto_mixed_precision.cpp): more logs

* refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal

* fix(bias_add_op.cpp): fix bias_multiplier shape

* feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half

* feat: update MatmulKernel and new_kernel_util to support half

* refactor(auto_mixed_precision): add ClearList and refine code

* feat(tanh_*_kernel): support half

* feat(add_kernel): support half

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF

* style(CMakeLists.txt): fix typo

* fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix(auto_mixed_precision.cpp): group inserted cast op by lbn

* fix get one ptr (#1913)

* fix(layer_norm): add LayerNormOp to grey_list and support the half

* fix(layer_norm about): fix it to run when amp

* fix: move fix sbp signature from OpNode to OpGraph

* Dev new kernel util (#1925)

* refactor(kernel/util): refactor NewKernelUtil and add DnnIf

* refactor(kernel/util): add BlasIf

* refactor(kernel/util): add ArithemeticIf

* refactor(kernel/util): add cuda_kernel_util.*

* refactor: refactor NewKernelUtil

* refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including

* refactor(new_kernel_util.h): remove unused header files

* refactor: refactor loop include

* feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936)

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA

* CHECK cuda version > 10.0 when use auto_mixed_presion

* Fix bug of Snapshot delete file Unwanted (#1937)

* fix link BUG of release version (#1938)

* delete redundant code in OpGraph JobCompleter and Operator (#1927)

* 1. delete redundant code in OpGraph JobCompleter and Operator  2. fix bug of Snapshot delete file Unwanted  3. refine ReadMe

* revert README change

* split 2 pull request

* Refactor Kernel Registry V2: The clear & easy Way (#1941)

* refactor(resource.proto): move DeviceType to common/device_type.proto

* feat(kernel_registration): add kernel_registration.h/cpp

* feat(kernel_registration): update matmul_kernel to support new registration

* feat: add CreateKernel for new registry

* feat: udpate registry of cast conf

* refactor(kernel_registration): remove KernelRegMap

* fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949)

* grpc SetMaxMessageSize(INT_MAX) (#1950)

* fix bug of Graph::ForEachConnectedComponent (#1952)

* Grpc set max size (#1953)

* grpc SetMaxMessageSize(INT_MAX)

* set max msg len for ctrl service

* code for test grpc max msg size

* remove test code

* NumaAwareCudaMallocHost (#1959)

* NumaAwareCudaMallocHost

* add conf

* AllReduceSequencePass (#1976)

* Merge job set and split fw bw (#1983)

* add MemoryCopier and TensorSliceCopier (#1901)

* add MemoryCopier and TensorSliceCopier

* Index=>NdIndex

* refine

* refine

* fix addition error checking (#1911)

* Merge dev_mixed_precision into dev_split_fw_bw (#1904)

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* Merge dev_mixed_precision: Part-2 (#1907)

* feat: add NewKernelUtil

* fix typos

* feat: add cublas_tensor_op_math_handle()

* add gemm (#1860)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* feat: NewKernelUtil -> NewKernelUtil<DeviceType>

* feat: update FullyConnectedKernel to use NewKernelUtil

* Dev sx mixed precision (#1861)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* save cpp

* save

* add relu and relu_backward

* remove spared space

* add explicit declaration

* rename

* feat: update ConvKernel to support half

* add sigmoid and tanh (#1867)

* add axpy (#1866)

* style: formatting

* refactor(new_kernel_util): unify Hgemm with cublas_gemm<T>

* fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle

* refine(new_kernel_util.h)

* refine(new_kernel_util.cu)

* feat(new_kernel_util): add OFBatchedGemm()

* feat: update MatMulKernel to support half

* feat: update ConvData/Bias/FilterGradKernel to support half

* refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out

* feat: support loss scale

* fix(operator): :bug:add InferHasBatchDim()

* feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu()

* refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float

* style(kernel/cast_kernel.cpp): formatting

* fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle()

* style(cast_kernel.cpp): formatting

* feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil

* refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil

* feat(dropout_kernel): :sparkles:update DropoutKernel to support half

* refactor(dropout_kernel): remove backward funcs

* refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support

* fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple)

* fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs()

* fix: fix little bugs

* fix(conv_data/filter_grad_op): min byte size of buf blob is 1

* feat: support half for bias_add_kernel

* fix(bias_add_op): remove data type check

* feat(relu_kernel): support half

* refactor: add ADD_GPU_HALF_KERNEL_CREATOR

* fix: typos

* feat(pooling_kernel): support half

* fix: remove CHECK_EQ of default data type

* feat(pooling_grad_kernel): support half

* feat: support half in ofrecord_encoder (TODO)

* fix

* feat: support half in sparse_cross_entropy_kernel

* debug grad op (#1883)

* Dev debug op mixed precision (#1884)

* debug grad op

* do nothing instead of UNIMPLEMENTED

* fix(dropout_kernel): add tmp_split_fw_bw condition

* build(half.cmake): https->http

* fix(record_load_kernel): support total_batch_num

* fix pooling (#1885)

* fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs()

* fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs()

* fix: add GetCudnnScalingParameters() to fix scaling params

* fix: add enable_true_half_config_when_conf() into config and update related code

* feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization

* refactor(matmul_kernel): remove Backward()

* feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx()

* feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat()

* refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr

* refactor(new_kernel_util.cu): remove static of func in anonymous namespace

* feat(job_conf.proto): add enable_auto_mixed_precision field

* feat(auto_mixed_precision_lists): add amp_lists

* feat(auto_mixed_precision): build the skeleton

* feat(auto_mixed_precision): almost finish amp graph pass

* feat(auto_mixed_precision.cpp): complte InsertCastOp()

* refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG

* perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO)

* refine(auto_mixed_precision.cpp): refine LOG

* feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes()

* Dev half ndarray (#1886)

* debug grad op

* ZeroVal => GetZeroVal; OneVal => GetOneVal

* MaxVal => GetMaxVal; MinVal => GetMinVal

* check data type

* DevDType

* move function template to struct template for BinaryFunc* and UnaryFunc*

* support half for reduce_sum_kernel

* ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr

* half for NdarrayUtil

* OF_DEVICE_FUNC is always inline

* half for NdarrayApplyUnaray

* simplify usage of NdarrayUtil

* UnaryFuncExp

* add VarNdarrayBuilder and ValNdarrayBuilder

* simplify NdarrayUtil in layer_norm_param_grad_kernel

* InplaceBroadcast

* remove SoftmaxKernelUtil

* half for softmax_kernel

* fix improper use of __CUDA_ARCH__

* disable sm_30,sm_52

* refine(conv_kernel.cu): fix typo

* fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix: fix typos of GetOneVal

* fix(auto_mixed_precision.cpp): allocate for shared_ptr

* refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding

* fix(auto_mixed_precision.cpp): fix typo

* fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge

* style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet()

* style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...>

* feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp

* feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs

* feat(auto_mixed_precision.cpp): more logs

* refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal

* fix(bias_add_op.cpp): fix bias_multiplier shape

* feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half

* feat: update MatmulKernel and new_kernel_util to support half

* refactor(auto_mixed_precision): add ClearList and refine code

* feat(tanh_*_kernel): support half

* feat(add_kernel): support half

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF

* style(CMakeLists.txt): fix typo

* fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix(auto_mixed_precision.cpp): group inserted cast op by lbn

* fix get one ptr (#1913)

* fix(layer_norm): add LayerNormOp to grey_list and support the half

* fix(layer_norm about): fix it to run when amp

* fix: move fix sbp signature from OpNode to OpGraph

* Dev new kernel util (#1925)

* refactor(kernel/util): refactor NewKernelUtil and add DnnIf

* refactor(kernel/util): add BlasIf

* refactor(kernel/util): add ArithemeticIf

* refactor(kernel/util): add cuda_kernel_util.*

* refactor: refactor NewKernelUtil

* refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including

* refactor(new_kernel_util.h): remove unused header files

* refactor: refactor loop include

* feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936)

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA

* CHECK cuda version > 10.0 when use auto_mixed_presion

* Fix bug of Snapshot delete file Unwanted (#1937)

* fix link BUG of release version (#1938)

* delete redundant code in OpGraph JobCompleter and Operator (#1927)

* 1. delete redundant code in OpGraph JobCompleter and Operator  2. fix bug of Snapshot delete file Unwanted  3. refine ReadMe

* revert README change

* split 2 pull request

* Refactor Kernel Registry V2: The clear & easy Way (#1941)

* refactor(resource.proto): move DeviceType to common/device_type.proto

* feat(kernel_registration): add kernel_registration.h/cpp

* feat(kernel_registration): update matmul_kernel to support new registration

* feat: add CreateKernel for new registry

* feat: udpate registry of cast conf

* refactor(kernel_registration): remove KernelRegMap

* fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949)

* grpc SetMaxMessageSize(INT_MAX) (#1950)

* fix bug of Graph::ForEachConnectedComponent (#1952)

* Grpc set max size (#1953)

* grpc SetMaxMessageSize(INT_MAX)

* set max msg len for ctrl service

* code for test grpc max msg size

* remove test code

* NumaAwareCudaMallocHost (#1959)

* NumaAwareCudaMallocHost

* add conf

* AllReduceSequencePass (#1976)

* CudaCurrentDeviceGuard (#1977)

* delete tmp_split_fw_bw_train_conf (#1985)

* delete tmp_split_fw_bw_train_conf

* delete useless comments

* fix refactor bug in layer_norm_op

* minor fixes

* update py script

* remove code could be misleading

* Fix all reduce mem sharing (#1986)

* fix all reduce mem sharing

* ByteSizeOfDataContentField=>ByteSizeOfBlobBody

* remove obsolete task_graph optimization

* no arg_pass_job for variable_op

* merge memory block id between jobs (#1910)

* refine MemBlock and CriticalSection

* job memory sharing strategy

* revert diff in CriticalSectionDesc

* Merge memory block between sub plans

* Get mutual exclusion job groups

* forget to consider memory merge only in same machine

* memory zone unique id

* Merge Done;  merge memory block id from right to left; get memory block ids info

* revert MemBlock

* generate mutual exclusion job groups Done.

* update for proto

* add JobMemSharingStrategy in python interface

* remove memorycase hash

* move JobMemSharingStrategy to JobSetProto

* using default strategy = parallel priority strategy

* update interface of flow.job_mem_sharing_strategy

* InterJobMemSharingUtil and PlanUtil

* revert oneflow.h

* fix bug

* New implement of Merge memory block id between jobs

* refine code

* fix a fatal bug in std::hash<oneflow::Shape>

* +REGISTER_INDEPENDENT_THREAD_NUM for print task_node

* unlock critical sections as more as possible (#1994)

* Bugfix actor case (#1995)

* unlock critical sections as more as possible

* consumed and produced regst of actor 'case' are customized

* refine code

* Bugfix actor case (#1996)

* unlock critical sections as more as possible

* consumed and produced regst of actor 'case' are customized

* refine code

* small regst_num for reentrant_lock (#1997)

* fmt dev_job_set(#1999)

* double buffer for tick_op

* tick is cpu op

* speedup compile time (#2000)

* only merge mem_block_id between user job (#1993)

* Fix keep header only (#2001)

* speedup compile time

* fix keep header only

* remove shared model (#2003)

* remove blob_mem_sharing (#2005)

* No copyhd for output (#2006)

* no cpu tick

* no copyhd for output_op/swith_output_op

* remove temp comments

* rename AddCopyH2DTaskTo to TryAddCopyH2DTaskTo

* remove clone_id (#2007)

* layer norm auto var (#2004)

* layer norm auto var

* make of_format

* bn sbp (#2008)

* Refactor job completer (#1998)

* fmt

* refactor GenerateOpConf4Trainning

* more refactor

* refactor SetCtrlInOpName4VariableOp

* use uniq ptr

* refactor RewriteBoxingWithAllReduce

* refactor MakeAllReduceSequence

* refactor auto_mixed_precision

* refactor DumpLogicalBlobDescAndSbpSignature

* refactor group_boxing_by_dst_parallel

* refactor add_keep_header_only_op_conf

* refactor AutoSourceTick

* refactor AddTickForTimeShape

* refactor AutoSinkTick

* refactor AddGlobalOutputCriticalSections

* refactor SetOpTimeShape7BatchDimLbis

* fix a bug in IsInterfaceTask (#2009)

* Bugfix is interface task (#2010)

* fix a bug in IsInterfaceTask

* IsOutputInterfaceTask

* copyhd-free output_op task_node

* Dev job set config util (#2011)

* add more if in JobConfigProtoBuilder

* unlock critical sections as more as possible

* consumed and produced regst of actor 'case' are customized

* remove total batch num in config util

* remove clone_id

* assert has train_conf

* rm debug info

* Dev job set bert (#2013)

* support bert

* mv into bert

* manual format

* fix adam (#2015)

* fix adam

* div batch instance num before update model

* remove outdate code in oneflow.cpp (#2017)

* Dev split like (#2016)

* no total_instance_num

* add auto grad for concat

* check in impl

* check in bug fixes

* fix bugs for split_like

* split_like_op.cpp format

* add normalization_autovar

* Update op_conf.proto

* address reviews

* fix typo

* constant ref

* rm forward_loss_instance_num (#2018)

* Bugfix job set multi device (#2019)

* sbp for tick input bn

* interface_blob_conf for output_op/switch_output_op

* set sbp conf for tuple identity op

* fix bugs when merge main plan

* delete useless code

* address review

* fix error use of GenRepeatedBn()

* ForEachConnectedComponent is easily misused

* 1) fix output op parallel_conf; 2) refactor InterfaceOpUtil

* only for return output_op

* factor: lbn => logical_blob_name; logical_blob_name() => logical_blob_name

* return op instead of output op acts as part of user job

* enable_all_reduce_group

* bugfix: init RuntimeBuffersScope before Runtime

* demo python scripts for enable_all_reduce_group

* remove wrong optimization code

* constant_conf for enable_all_reduce_group.py test

* fix interface op parallel conf

* fix reduce concat kernel (#2020)

* binary program oneflow_worker

* user_job_completer

* remove unused code loss_print

* rm unused code loss_acc

* remove unused accuracy_acc and accuracy_print

* remove input_diff/output_diff/model_diff bns

* remove unused bns in gdb util

* replace data_tmp_bns/model_bns/fw_buf_bns with tmp_bns

* support mpi using style

* Bugfix put job conf into plan (#2023)

* put job_conf into plan

* using job_name judge isPullJob/isPushJob

* fix wrong job_id error

* model_init is a push job; model_save is a pull job

* make cmake more reasonable (#2024)

* Restructure python module and minimum setup.py (#2026)

* check in updated paths

* check in minimum setup tool

* Dev python init multi unit (#2022)

* init multi-unit by send oneflow_worker binary and ConfigProto to worker machine

* refine var name

* refine code

* compile user/main job only on master

* bert multi machine test code

* fix bugs

* JobConfs

* fix bugs under WITH_RDMA

* fix multi-machine bugs

* delete useless code

* Add xla reduce_sum op

* fix overflow bug of mem_zone_unique_id, set job_id to int64_t (#2028)

* feat: init_worker can without scp binary and no use uuid (#2029)

* half impl of without scp bin

* feat: init_worker can without scp binary and no use uuid

* check in fixes (#2030)

* fixbug of delete worker (#2033)

* Dev dot plan (#2035)

* reuse plan to dot file

* refine plan dot

* Check in bug fix and multi node script (#2032)

* check in fixes

* check in script

* fix boxing bug when setting conf with sbp

* flag for iter

* fixbug of delete worker

* fix delete worker in script

* address review, add exclusive or check

* reuse plan to dot file

* refine plan dot

* fix and add flags

* fmt

* rm debug output

* more flags

* check Activation

* fix fc bug when num axes > 2

* reverse change

* fix next_batch_num (#2036)

* upgrade nccl to 2.4.8 (#2037)

* fix shape of fc in_diff (#2038)

* Rewrite model update op to optimizer graph

* Update oneflow.cmake (#2041)

* better looking merged_plan to dot v1 (#2039)

* better looking and more infomation of merged_plan.dot

* refine color

* Fix tick in multi node parallel (#2042) (#2047)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* Dev train conf builder (#2046)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* check in impl

* fix data dir (#2054)

* fix data dir

* rm model load path

* AssignOp (#2058)

* AssignOp

* remove useless code

* Python ops gather and unit test (#2053)

* python_ops gather and unit test

* format

* minor mod

* SnapshotOp (#2060)

* magical add and fix bug (#2061)

* check in impl

* add todo

* Dev jxf python pooling (#2056)

* run max_pool_2d without bug

* correct max_pool_2d

* correct average_pool_2d

* minor refine

* final version

* rename to nn.py

* add name arg to pool1d ops

* refine by review

* rename to _GetSequence and move it to the end of file (#2063)

* fix BindInterfaceMemBlockId (#2065)

* mark py file generated (#2066)

* Dev gracious exit (#2057)

* add more checks

* make language more consistant

* better error info for worker init

* better error

* Update setup.py (#2068)

* Refine Infer APIs by return Maybe<void> type (#2051)

* Refine Infer APIs by return Maybe<void> type

* Fix return type

* Fix code style

* Replace CHECK macros in the implementation of infer APIs

* Revert IsOk

* fix bug for split like op (#2070)

* fix snapshot path (#2071)

* Dev job set fix infer apis (#2072)

* Refine Infer APIs by return Maybe<void> type

* Fix return type

* Fix code style

* Replace CHECK macros in the implementation of infer APIs

* Revert IsOk

* update

* add AutoGlobalStep (#2073)

* rm default_initializer_conf in train conf (#2075)

* Fix sigmoid op (#2076)

* fix sigmoid op bug

* fix bug for split like op

* add sigmoid grad op

* Fix bn (#2077)

* fix bn

* return Maybe<void> OK in lambda

* fix typo

* fix SigmoidGradOp (#2078)

* Dev python merge job set (#2081)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* fix gcc warning in release (#2080)

* fix gcc version in release

* fix empty line

* Fix adam mv initilizer (#2082)

* zero constant initilzer for adam m and v

* make of_format

* init adam m v beta1_t and beta2_t

* use value instead of initializer

* const float& -> const float

* update

* LearningRateScheduleOp (#2079)

* matmul (#2084)

* matmul

* np.allclose

* Fix hang bugs

* bugfix: reshape op infer dim0 size; and look up tensorflow reshape (#2085)

* bugfix: reshape op infer dim0 size; and look up tensorflow reshape

* refine code for read

* check py if and test

* prelu (#2086)

* prelu

* fix

* fix

* template for either ptr cast (#2088)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* add template for cast

* rename

* Dev build and infer ctx (#2089)

* add job_build_and_infer_ctx interface

* lbn_with_split_hint

* fix maybe macro

* fix signature of Maybe<T>::Error()

* job_build_and_infer_if

* add c_api_util wrapper for job_build_and_infer_ctx

* implement python/job_build_and_infer interface

* CurJobBuildAndInferCtx_AddPlacementGroup

* BuildJobAndInferCtx  and  Mgr  c++ implement (#2074)

* job_build_and_infer_ctx_mgr

* refine interface of infer_ctx_mgr

* JobBuildInferCtx set job conf; add and refine error type

* revert job.proto

* half impl of add op in build_infer_ctx

* generate op produced empty logical blob desc ; infer out blob desc interface

* job_build_and_infer_ctx VERSION 1

* add InferOutBlobDesc for conv op; remove record_piece_size in interface op

* maybe return

* job_set hold by job_build_and_infer_ctx_mgr

* check placement when infer ctx mgr leave cur job

* Global New/Delete JobBuildAndInferCtxMgr

* add JUST when ctx add op

* remove unused job_conf.arg_op_name

* fix bugs caused by python new api

* fix bugs caused by lack of Global<JobDesc>

* fix bugs caused by new api

* refactor compiler.Compile

* merge dev_python

* remove unused message proto

* rename api

* Fix input which body is disabled in xla launch kernel

* add RemoteBlob.shape and RemoteBlob.dtype

* Fix data type set default variable (#2092)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* fix default data type

* Add conf axis for bias_add for any axis channel (#2093)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* bias_add completion

* follow comment

* make conf axis required

* Dev jxf python initializer (#2090)

* oneflow initializer

* update

* Fix self control in

* Bugfix python alexnet (#2096)

* bugfix_python_alexnet

* fix

* Add fake consume op

* Dev global step (#2100)

* assign op


AddGlobalStepOpConf


fix


ARITHMETIC_DATA_TYPE_SEQ


identity_op_conf


add ops


GenNewSnapshotName


SnapshotOp


cleanup


blob name


LearningRateScheduleOp


LearningRateScheduleKernel


LearningRateScheduleKernel


AddLearningRateScheduleOpConf


learning rate


cleanup


fix


fix

* remove total_mbn_num

* date time format

* save

* refine

* refine

* revert

* refine snapshot

* fix

* refine

* AutoGlobalStep

* refine

* GenLogicalBlobName

* AutoLearningRate

* remove JobDesc lr

* fix snapshot path

* Maybe<void>

* learning_rate blob

* remove next_model_vid


fix


fix 


fix


learning_rate

* train_conf

* fix for global step on multi nodes

* Fix optimizer initializer (#2095)

* fix optimizer initializer

* rename lars data temp bn

* fix job_type (#2102)

* Dev alexnet new api (#2094)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* check in softmax loss

* nn.conv2d and nn.bias_add

* fix opname

* fix merge conflict

* fix name

* dense (#2097)

* Fix jxf dense v2 (#2098)

* dense

* minor fix

* alexnet

* fix conf

* quick fix

* transpose

* fix layers

* add transpose

* fix fc

* fix

* fix

* fix data laod

* params check and format

* rm activation in op conf

* save workaround

* fix avg pool 2d

* fix max pool 2d

* remove fc3 relu

* alexnet eval

* minor

* replace has_batch_dim with batch_axis (#2104)

* replace has_batch_dim with batch_axis

* refactor OrderValue4HasBatchAxis

* fix batch_axis bugs in ConvFilterGradOp and DecodeOFRecordOp

* no CHECK in MatmulOp::InferBatchAxis

* infer op by op_conf and  parallel_conf

* wrapper Error for ErrorProto

* replace ErrorUtil with Error

* add OF_CHECK (#2110)

* optional split_axis (#2113)

* Fix HasAttr bug for optional field

* undefined (#2116)

* merge reduce xxx (#2119)

* Update GetSbpSig() with Maybe (#2118)

* fix sveral ops

* modify all ops

* format

* update complete

* Refine AdamOptimizer

* fix (#2120)

* Fix xla AdamOptimizer bugs

* support scalar for reduce_xxx axis args (#2122)

* Dev opt split axis (#2121)

* optional split_axis

* backup

* VariableConf::(OptInt64 split_axis)

* backup

* fix autovar split_axis (#2125)

* Dev model init op (#2117)

* assign op


AddGlobalStepOpConf


fix


ARITHMETIC_DATA_TYPE_SEQ


identity_op_conf


add ops


GenNewSnapshotName


SnapshotOp


cleanup


blob name


LearningRateScheduleOp


LearningRateScheduleKernel


LearningRateScheduleKernel


AddLearningRateScheduleOpConf


learning rate


cleanup


fix


fix

* remove total_mbn_num

* date time format

* save

* refine

* refine

* revert

* refine snapshot

* fix

* refine

* AutoGlobalStep

* refine

* GenLogicalBlobName

* AutoLearningRate

* remove JobDesc lr

* fix snapshot path

* Maybe<void>

* learning_rate blob

* remove next_model_vid


fix


fix 


fix


learning_rate

* train_conf

* fix for global step on multi nodes

* SnapshotReader


snapshot writer


model init op


fix


refine


init


InitializeFromSnapshotConf


model io job


ModelLoadOp


ModelLoadKernel


MakeModelLoadJob


ModelSaveOp


fix


InterUserJobInfo


_MakeModelLoadJobFunc


MutModelLoadOpConTickInputHelper


fix


refine


init/load/save


set_default_variable

* remove SnapshotMgr

* snapshot.h

* delete model_init_job.cpp


foreign_input_op_conf


fix


snapshot path


set path


op_conf


fix


fix CopyFromNdarray


to bytes c


use uint8


char2uint8

* model init

* model io

* fix

* ModelSaveKernel

* mutable_batch_axis()->Clear()

* InferBatchAxis

* fix

* refine

* job set

* MakeModelIoJobs

* fix

* jobs

* fix

* model io job

* GenOutputOpConf

* refine snapshot

* refine

* fix

* refine CheckPoint

* remove session

* refine

* refine

* refine

* remove keyword.h/cpp

* refine

* global_step=>train_step

* GetSbpSignatures

* ModelInitOp

* fix (#2127)

* rm stale alextnet script (#2129)

* Dev plain maybe (#2126)

* optional split_axis

* backup

* VariableConf::(OptInt64 split_axis)

* backup

* 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp

* SharedOrPlain

* const std::shared_ptr<T>& => std::shared_ptr<T>

* Dev simple checkpoint manager (#2128)

* SimpleCheckPointManager

* makedirs

* fix path

* save

* refine

* refine

* fix path to numpy (#2130)

* Dev plain maybe (#2132)

* optional split_axis

* backup

* VariableConf::(OptInt64 split_axis)

* backup

* 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp

* SharedOrPlain

* const std::shared_ptr<T>& => std::shared_ptr<T>

* rename: Maybe.data() => Maybe.Data_YouAreNotAllowedToCallThisOutsideJustOrCheckJust()

* refactor Maybe<JobBuildAndInferCtx> => Maybe<JobBuildAndInferCtx*>

* Dev jxf merge general ops (#2131)

* merge some general ops to dev_python

* dense demo

* rm print in test

* new line at the end of file

* format

* fix check point

* update alexnet

* broadcast_xxx (#2134)

* broadcast_xxx

* typo

* typo

* rm job_conf.num_of_batches_in_snapshot

* fix args (#2136)

* fix proto if (#2138)

* pass name to inner function (#2139)

* check dropout if (#2140)

* check dropout if

* fix typo

* Dev merge math ops (#2143)

* merge math ops

* new line at the end of file

* merge layer norm (#2144)

* variable_scope (#2141)

* variable_scope

* revert format

* add check

* Merge dropout if (#2145)

* check dropout if

* fix typo

* fix typo

* slice (#2142)

* slice

* add check and docstring

* minor

* minor

* add const (#2146)

* add const

* fix indentation

* address review

* fmt

* rm redundant

* Update array_ops.py

* Update array_ops.py

* Update array_ops.py

* add more activations to math_ops (#2147)

* fix bug (#2149)

* trancated normal for bert (#2150)

* Update bert for dev python (#2151)

* trancated normal for bert

* bert support

* math.dropout to nn.dropout (#2153)

* refactor oneflow_internal with Maybe::GetDataAndSerializedErrorProto

* allow export multiple interfaces in oneflow_export decorator (#2154)

* refactor job_build_and_infer_if.h

* update oneflow_internal.h to use Maybe (#2135)

* Fix python internal (#2133)

* Return error meassage in oneflow_internal

* Refine environment_objects_scope

* add OF_ERROR_STR_CHECK and OFStrCat()

* format

* fix based on review

* fix(oneflow_internal.h): add undef

* fix: expr -> (expr)

* feat: update oneflow_internal_helper to use func

*  Transfer data_part_num to DecodeOp and RecordLoadOp (#2148)

*  Transfer data_part_num to DecodeOp and RecordLoadOp

* Fix python scripts

* Dev nc of internal (#2155)

* Fix python internal (#2133)

* Return error meassage in oneflow_internal

* Refine environment_objects_scope

* add OF_ERROR_STR_CHECK and OFStrCat()

* format

* fix based on review

* fix(oneflow_internal.h): add undef

* fix: expr -> (expr)

* feat: update oneflow_internal_helper to use func

* fix: fix ctor bug

* fix config_proto

* rename c_api_util.Init => c_api_util.InitEnvironment

* refactor compile_context.cur_job => compile_context.cur_job_conf

* remove FixPackedBlobDescOfProducedRegst (#2156)

* Fix snapshot root path empty log (#2158)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* Fix snapshot root path empty log

* fix channel last (#2157)

* fix channel last

* minor

* merge pb_message

* add cudnn conv force algo (#2159)

* Update bert for dev python (#2160)

* remove old bert

* set data_part_num in decoder

* support model load/saveargs

* Dev flow function (#2152)

* add of.function, refactor init, refine session, and refine runtime

* rm useless code

* rename

* update

* add test

* @oneflow_export JobConfigProto and Trainconf (#2162)

* @oneflow_export JobConfigProto and Trainconf

* remove unused config in config_util.py

* remove oneflow.get_cur_job_conf_builder

* bugfix: bias_add op and reduce_sum op infer sbp and implement of bias_add kernel (#2161)

* 1) refactor compiler._CompileJob; 2) fix test scripts; 3) fix oneflow.config.model_update_conf

* fix config.train.model_update_conf

* _GetJobConfAttr

* update alexnet (#2166)

* Update alexnet (#2167)

* update alexnet

* update for bert

* 15->16

* more reasonable conf

* get variable in py layer norm

* replace val in pb msg;  decode lbn string with split hint (#2165)

* bugfix: boxing task node generate boxing conf; remove wrong check in op_graph (#2163)

* Add meta data in HLO instruction, and refine

* python model parallel (#2103)

* decode split hint in lbn; refine interface of get/set_lbn_in_op_conf; fill sbp_conf in job when add op

* merge placement group

* refine code in AddAndInferOp

* auto merge placement group when add op; remove mergeplacementgroup interface

* infer sbp parallel when add op; impl Get/Has split axis in infer_ctx

* python blob add interface for model parallel

* refine code of python blob split

* remove interface of has/get_split_axis in python blob

* remove interface of has_batch_dim in python blob

* add check blob split_axis can be divide by parallel num

* refine code for maybe get/infer sbp

* fix bugs: 1. python generate parallel conf; 2. gen logical blob id from lbn; 3.python blob desc .etc

* fix for plain point maybe

* fix bug: add repeated placement group, remove add placement interface in hand

* fixbug: python/blob_desc, temp impl of not deepcopy;  feat: dense layer support model parallel

* dev_python model parallel runnable and check correct

* remove add placement group when placment scope exit

* 1. fixbug of bias_add op infer blob_desc/sbp; bias_add kernel impl; 2. dense layer set model_split_axis=0 for model parallel

* bugfix: bias_add backward infer sbp wrong;  model parallel bias add debug done

* refine python blob_desc.split implement

* refine interface decode lbn to split hint

* refine auto add placment group

* refine lbn with split hint decode

* refine code for review

* remove AutoVar related code (#2168)

* feat: remove all autovar

* fix and format

* fix: fix op::InferBlobDesc

* add prototype (#2172)

* add prototype

* infer blob desc with sbp_signature

* `str_a is not str_b' is buggy, use `str_a != str_b' instead

* Update snapshot.cpp (#2174)

* remove useless lines (#2176)

* Fix bert multi nodes (#2177)

* remove useless lines

* fix bert and init_cluster_env for multi nodes

* CHECK_JUST for InferBlobDescsIf (#2178)

* Fix bert multi nodes (#2180)

* remove useless lines

* fix bert and init_cluster_env for multi nodes

* config_proto -> default_config_proto

* delete worker

* update alexnet

* remove unused op (#2182)

* remove parallel_ctx when kernel init (#2185)

* InferOpSbpSignature in op_graph and infer_ctx (#2175)

* InferOpSbpSignature in op_graph and infer_ctx

* bugfix: lambda life time;  gen job build error add location info

* refine error generation and return

* refine check lbi vaild and exists

* remove parallel num in decode_of_record op/kernel (#2186)

* Fix bugs

* delete GlobalJobDesc() in operator/ (#2188)

* rm unused test file

* Refine

* Add assign ops behind adam optimizer to update model and momentum etc.

* Add assign ops behind adam optimizer to update model and momentum etc.

* Remove fake consume op

* Support enable/disable XLA by set env

* Merge callback, limit max operator count for each XLA subgraph

* CudaEventPool

* fix vector

* refine

* Support in-place update for optimizer

* Add alias input and output to prevent reusing input with other temp buffers

* Refine code style

* Remove unused code

* Of xla (#2237)

* mv deprecated.pb_util to lib.core.pb_util

* add op get_variable and get_variable test (#1975)

* add op get_variable and get_variable test

* modify shape extend

* AllReduceSequencePass (#1976)

* python2 compatibility for check_point

* fix "return (blob_a, blob_b)" bug

* rename: arg_passing => arg_pass

* shared regst blob header between jobs (#1919)

* half impl

* register manager handle memory shared for separated memory

* set separated memory shared id for shared regst between jobs

* half impl of python for blob

* fix BUG of pod ToProto() when proto has inited

* fix BUG of infer dim0_inner_shape() in foreign_input_op

* 1. PushJob copy from python can infer dim0_valid_num

* add test for dynamic relu

* refine test file

* refine code

* refine note

* update test file for new interface

* rename separated_header* (#1979)

* some bugs fixes for a train&eval job (#1978)

* debugging alex net

* check in test pull_multiple_blob.py

* strcter check

* fix bias in conv

* fix various bugs

* rm file

* op_name in different jobs can be overloaded

* fix compile bug in job_set_compile_ctx

* rm cmake code for building oneflow binary

* check in script (#1980)

* check in script

* rm used import

* CudaCurrentDeviceGuard (#1977)

* fix val (#1981)

* Merge job set and split fw bw (#1982)

* add MemoryCopier and TensorSliceCopier (#1901)

* add MemoryCopier and TensorSliceCopier

* Index=>NdIndex

* refine

* refine

* fix addition error checking (#1911)

* Merge dev_mixed_precision into dev_split_fw_bw (#1904)

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* Merge dev_mixed_precision: Part-2 (#1907)

* feat: add NewKernelUtil

* fix typos

* feat: add cublas_tensor_op_math_handle()

* add gemm (#1860)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* feat: NewKernelUtil -> NewKernelUtil<DeviceType>

* feat: update FullyConnectedKernel to use NewKernelUtil

* Dev sx mixed precision (#1861)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* save cpp

* save

* add relu and relu_backward

* remove spared space

* add explicit declaration

* rename

* feat: update ConvKernel to support half

* add sigmoid and tanh (#1867)

* add axpy (#1866)

* style: formatting

* refactor(new_kernel_util): unify Hgemm with cublas_gemm<T>

* fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle

* refine(new_kernel_util.h)

* refine(new_kernel_util.cu)

* feat(new_kernel_util): add OFBatchedGemm()

* feat: update MatMulKernel to support half

* feat: update ConvData/Bias/FilterGradKernel to support half

* refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out

* feat: support loss scale

* fix(operator): :bug:add InferHasBatchDim()

* feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu()

* refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float

* style(kernel/cast_kernel.cpp): formatting

* fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle()

* style(cast_kernel.cpp): formatting

* feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil

* refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil

* feat(dropout_kernel): :sparkles:update DropoutKernel to support half

* refactor(dropout_kernel): remove backward funcs

* refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support

* fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple)

* fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs()

* fix: fix little bugs

* fix(conv_data/filter_grad_op): min byte size of buf blob is 1

* feat: support half for bias_add_kernel

* fix(bias_add_op): remove data type check

* feat(relu_kernel): support half

* refactor: add ADD_GPU_HALF_KERNEL_CREATOR

* fix: typos

* feat(pooling_kernel): support half

* fix: remove CHECK_EQ of default data type

* feat(pooling_grad_kernel): support half

* feat: support half in ofrecord_encoder (TODO)

* fix

* feat: support half in sparse_cross_entropy_kernel

* debug grad op (#1883)

* Dev debug op mixed precision (#1884)

* debug grad op

* do nothing instead of UNIMPLEMENTED

* fix(dropout_kernel): add tmp_split_fw_bw condition

* build(half.cmake): https->http

* fix(record_load_kernel): support total_batch_num

* fix pooling (#1885)

* fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs()

* fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs()

* fix: add GetCudnnScalingParameters() to fix scaling params

* fix: add enable_true_half_config_when_conf() into config and update related code

* feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization

* refactor(matmul_kernel): remove Backward()

* feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx()

* feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat()

* refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr

* refactor(new_kernel_util.cu): remove static of func in anonymous namespace

* feat(job_conf.proto): add enable_auto_mixed_precision field

* feat(auto_mixed_precision_lists): add amp_lists

* feat(auto_mixed_precision): build the skeleton

* feat(auto_mixed_precision): almost finish amp graph pass

* feat(auto_mixed_precision.cpp): complte InsertCastOp()

* refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG

* perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO)

* refine(auto_mixed_precision.cpp): refine LOG

* feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes()

* Dev half ndarray (#1886)

* debug grad op

* ZeroVal => GetZeroVal; OneVal => GetOneVal

* MaxVal => GetMaxVal; MinVal => GetMinVal

* check data type

* DevDType

* move function template to struct template for BinaryFunc* and UnaryFunc*

* support half for reduce_sum_kernel

* ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr

* half for NdarrayUtil

* OF_DEVICE_FUNC is always inline

* half for NdarrayApplyUnaray

* simplify usage of NdarrayUtil

* UnaryFuncExp

* add VarNdarrayBuilder and ValNdarrayBuilder

* simplify NdarrayUtil in layer_norm_param_grad_kernel

* InplaceBroadcast

* remove SoftmaxKernelUtil

* half for softmax_kernel

* fix improper use of __CUDA_ARCH__

* disable sm_30,sm_52

* refine(conv_kernel.cu): fix typo

* fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix: fix typos of GetOneVal

* fix(auto_mixed_precision.cpp): allocate for shared_ptr

* refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding

* fix(auto_mixed_precision.cpp): fix typo

* fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge

* style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet()

* style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...>

* feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp

* feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs

* feat(auto_mixed_precision.cpp): more logs

* refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal

* fix(bias_add_op.cpp): fix bias_multiplier shape

* feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half

* feat: update MatmulKernel and new_kernel_util to support half

* refactor(auto_mixed_precision): add ClearList and refine code

* feat(tanh_*_kernel): support half

* feat(add_kernel): support half

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF

* style(CMakeLists.txt): fix typo

* fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix(auto_mixed_precision.cpp): group inserted cast op by lbn

* fix get one ptr (#1913)

* fix(layer_norm): add LayerNormOp to grey_list and support the half

* fix(layer_norm about): fix it to run when amp

* fix: move fix sbp signature from OpNode to OpGraph

* Dev new kernel util (#1925)

* refactor(kernel/util): refactor NewKernelUtil and add DnnIf

* refactor(kernel/util): add BlasIf

* refactor(kernel/util): add ArithemeticIf

* refactor(kernel/util): add cuda_kernel_util.*

* refactor: refactor NewKernelUtil

* refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including

* refactor(new_kernel_util.h): remove unused header files

* refactor: refactor loop include

* feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936)

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA

* CHECK cuda version > 10.0 when use auto_mixed_presion

* Fix bug of Snapshot delete file Unwanted (#1937)

* fix link BUG of release version (#1938)

* delete redundant code in OpGraph JobCompleter and Operator (#1927)

* 1. delete redundant code in OpGraph JobCompleter and Operator  2. fix bug of Snapshot delete file Unwanted  3. refine ReadMe

* revert README change

* split 2 pull request

* Refactor Kernel Registry V2: The clear & easy Way (#1941)

* refactor(resource.proto): move DeviceType to common/device_type.proto

* feat(kernel_registration): add kernel_registration.h/cpp

* feat(kernel_registration): update matmul_kernel to support new registration

* feat: add CreateKernel for new registry

* feat: udpate registry of cast conf

* refactor(kernel_registration): remove KernelRegMap

* fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949)

* grpc SetMaxMessageSize(INT_MAX) (#1950)

* fix bug of Graph::ForEachConnectedComponent (#1952)

* Grpc set max size (#1953)

* grpc SetMaxMessageSize(INT_MAX)

* set max msg len for ctrl service

* code for test grpc max msg size

* remove test code

* NumaAwareCudaMallocHost (#1959)

* NumaAwareCudaMallocHost

* add conf

* AllReduceSequencePass (#1976)

* Merge job set and split fw bw (#1983)

* add MemoryCopier and TensorSliceCopier (#1901)

* add MemoryCopier and TensorSliceCopier

* Index=>NdIndex

* refine

* refine

* fix addition error checking (#1911)

* Merge dev_mixed_precision into dev_split_fw_bw (#1904)

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* Merge dev_mixed_precision: Part-2 (#1907)

* feat: add NewKernelUtil

* fix typos

* feat: add cublas_tensor_op_math_handle()

* add gemm (#1860)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* feat: NewKernelUtil -> NewKernelUtil<DeviceType>

* feat: update FullyConnectedKernel to use NewKernelUtil

* Dev sx mixed precision (#1861)

* add gemm

* save

* add blobgemm

* update

* update

* fix cu

* update cpp

* save cpp

* save

* add relu and relu_backward

* remove spared space

* add explicit declaration

* rename

* feat: update ConvKernel to support half

* add sigmoid and tanh (#1867)

* add axpy (#1866)

* style: formatting

* refactor(new_kernel_util): unify Hgemm with cublas_gemm<T>

* fix(new_kernel_util): use type traits to let cublasHgemm use tensor_op_math_handle

* refine(new_kernel_util.h)

* refine(new_kernel_util.cu)

* feat(new_kernel_util): add OFBatchedGemm()

* feat: update MatMulKernel to support half

* feat: update ConvData/Bias/FilterGradKernel to support half

* refactor(optimizer): replace DiffLbi4BnInOp() with diff_lbi_of_var_out

* feat: support loss scale

* fix(operator): :bug:add InferHasBatchDim()

* feat(kernel_util.cu): suppport f2h and h2f in CopyElemOnGpu()

* refactor(cast_kernel.h/cpp): :recycle:update cast_kernel.cpp to support float2half and half2float

* style(kernel/cast_kernel.cpp): formatting

* fix(cuda_device_context.h): :bug:add cublas_tensor_op_math_handle()

* style(cast_kernel.cpp): formatting

* feat(new_kernel_util): :sparkles:support Transpose in NewKerneUtil

* refactor(transpose_kernel): :recycle:use NewKernelUtil instead of KernelUtil

* feat(dropout_kernel): :sparkles:update DropoutKernel to support half

* refactor(dropout_kernel): remove backward funcs

* refactor(dropout_kernel.cu): use std::enable_if instead of template partial specilization to support

* fix(conv_op.cpp): :bug:add InferHasBatchDim() and GetSbpSignatures() (only simple)

* fix(conv_data/bias/filter_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_op): add InferHasBatchDim() and GetSbpSigs()

* fix(relu_grad_op): add InferHasBatchDim() and GetSbpSigs()

* fix: fix little bugs

* fix(conv_data/filter_grad_op): min byte size of buf blob is 1

* feat: support half for bias_add_kernel

* fix(bias_add_op): remove data type check

* feat(relu_kernel): support half

* refactor: add ADD_GPU_HALF_KERNEL_CREATOR

* fix: typos

* feat(pooling_kernel): support half

* fix: remove CHECK_EQ of default data type

* feat(pooling_grad_kernel): support half

* feat: support half in ofrecord_encoder (TODO)

* fix

* feat: support half in sparse_cross_entropy_kernel

* debug grad op (#1883)

* Dev debug op mixed precision (#1884)

* debug grad op

* do nothing instead of UNIMPLEMENTED

* fix(dropout_kernel): add tmp_split_fw_bw condition

* build(half.cmake): https->http

* fix(record_load_kernel): support total_batch_num

* fix pooling (#1885)

* fix(every_nth_op): add InferHasBatchDim() and GetSbpSigs()

* fix(model_save_op/model_save_v2_op): add InferHasBatchDim() and GetSbpSigs()

* fix: add GetCudnnScalingParameters() to fix scaling params

* fix: add enable_true_half_config_when_conf() into config and update related code

* feat: update GetCudnnScalingParameters to SPOnePtr/SPZeroPtr, and fix pool/lrn/normalization

* refactor(matmul_kernel): remove Backward()

* feat(new_kernel_util): support HGemmWithFloat() which use cublasSgemmEx()

* feat: add enable_cublashgemm_when_matmul in Config; udpate MatMulKernel to support HGemmWithFloat()

* refactor: rename SPOne/ZeroPtr to CudnnSPOne/ZeroPtr

* refactor(new_kernel_util.cu): remove static of func in anonymous namespace

* feat(job_conf.proto): add enable_auto_mixed_precision field

* feat(auto_mixed_precision_lists): add amp_lists

* feat(auto_mixed_precision): build the skeleton

* feat(auto_mixed_precision): almost finish amp graph pass

* feat(auto_mixed_precision.cpp): complte InsertCastOp()

* refine(auto_mixed_precision.cpp): use cur_lbn and add some LOG

* perf(auto_mixed_precision.cpp): use VLOG(2) instead of LOG(INFO)

* refine(auto_mixed_precision.cpp): refine LOG

* feat(auto_mixed_precision): add INSERT_CHECK and PropagateWhiteThroughNonListNodes()

* Dev half ndarray (#1886)

* debug grad op

* ZeroVal => GetZeroVal; OneVal => GetOneVal

* MaxVal => GetMaxVal; MinVal => GetMinVal

* check data type

* DevDType

* move function template to struct template for BinaryFunc* and UnaryFunc*

* support half for reduce_sum_kernel

* ZeroPtr => GetZeroPtr; OnePtr => GetOnePtr

* half for NdarrayUtil

* OF_DEVICE_FUNC is always inline

* half for NdarrayApplyUnaray

* simplify usage of NdarrayUtil

* UnaryFuncExp

* add VarNdarrayBuilder and ValNdarrayBuilder

* simplify NdarrayUtil in layer_norm_param_grad_kernel

* InplaceBroadcast

* remove SoftmaxKernelUtil

* half for softmax_kernel

* fix improper use of __CUDA_ARCH__

* disable sm_30,sm_52

* refine(conv_kernel.cu): fix typo

* fix(conv_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix: fix typos of GetOneVal

* fix(auto_mixed_precision.cpp): allocate for shared_ptr

* refactor(auto_mixed_precision.cpp): refactor INSERT_CHECK for better understanding

* fix(auto_mixed_precision.cpp): fix typo

* fix(auto_mixed_precision.cpp): fix typo of OnOutEdge and OnInEdge

* style(auto_mixed_precision): rename SetXXXSet() to FillXXXSet()

* style(auto_mixed_precision_lists.cpp): use AMPList instead of a long HashSet<...>

* feat(protobuf): add MutableRepeatedMessageInPbMessage() for modify ibn of PrintOp

* feat(auto_mixed_precision.cpp): add Container2Str() and more delicate logs

* feat(auto_mixed_precision.cpp): more logs

* refactor(auto_mixed_precision.cpp): refactor the algo of graph traversal

* fix(bias_add_op.cpp): fix bias_multiplier shape

* feat(gather_xxx): udpate gather,gather_grad,gather_kernel_util to support half

* feat: update MatmulKernel and new_kernel_util to support half

* refactor(auto_mixed_precision): add ClearList and refine code

* feat(tanh_*_kernel): support half

* feat(add_kernel): support half

* update binary_func.h

* udpate

* update ndarray

* update

* update

* update

* udpate

* refactor(data_type.h): better representation

* fix(unary_func.h): fix typo

* style(data_type.h): format

* refactor(kernel): rename ADD_GPU_HALF_KERNEL_CREATOR to ADD_DEFAULT_KERNEL_CREATOR_WITH_GPU_HALF

* style(CMakeLists.txt): fix typo

* fix(layer_norm_kernel.cu): GetOne/ZeroPtr -> CudnnSPOne/ZeroPtr

* fix(auto_mixed_precision.cpp): group inserted cast op by lbn

* fix get one ptr (#1913)

* fix(layer_norm): add LayerNormOp to grey_list and support the half

* fix(layer_norm about): fix it to run when amp

* fix: move fix sbp signature from OpNode to OpGraph

* Dev new kernel util (#1925)

* refactor(kernel/util): refactor NewKernelUtil and add DnnIf

* refactor(kernel/util): add BlasIf

* refactor(kernel/util): add ArithemeticIf

* refactor(kernel/util): add cuda_kernel_util.*

* refactor: refactor NewKernelUtil

* refactor(kernel/util/xxx_interface): mv XxxIf to interface_base.h to avoid loop including

* refactor(new_kernel_util.h): remove unused header files

* refactor: refactor loop include

* feat(about kernel_util): add InitializeWithConstConf into ArithemeticIf and fix bias_add_kernel

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA_VERSION… (#1936)

* not compile CUDA_NVCC_FLAGS arch = compute_70 when CUDA

* CHECK cuda version > 10.0 when use auto_mixed_presion

* Fix bug of Snapshot delete file Unwanted (#1937)

* fix link BUG of release version (#1938)

* delete redundant code in OpGraph JobCompleter and Operator (#1927)

* 1. delete redundant code in OpGraph JobCompleter and Operator  2. fix bug of Snapshot delete file Unwanted  3. refine ReadMe

* revert README change

* split 2 pull request

* Refactor Kernel Registry V2: The clear & easy Way (#1941)

* refactor(resource.proto): move DeviceType to common/device_type.proto

* feat(kernel_registration): add kernel_registration.h/cpp

* feat(kernel_registration): update matmul_kernel to support new registration

* feat: add CreateKernel for new registry

* feat: udpate registry of cast conf

* refactor(kernel_registration): remove KernelRegMap

* fix bug of op_gragh in SplitLogicalInputBlobDesc (#1949)

* grpc SetMaxMessageSize(INT_MAX) (#1950)

* fix bug of Graph::ForEachConnectedComponent (#1952)

* Grpc set max size (#1953)

* grpc SetMaxMessageSize(INT_MAX)

* set max msg len for ctrl service

* code for test grpc max msg size

* remove test code

* NumaAwareCudaMallocHost (#1959)

* NumaAwareCudaMallocHost

* add conf

* AllReduceSequencePass (#1976)

* CudaCurrentDeviceGuard (#1977)

* delete tmp_split_fw_bw_train_conf (#1985)

* delete tmp_split_fw_bw_train_conf

* delete useless comments

* fix refactor bug in layer_norm_op

* minor fixes

* update py script

* remove code could be misleading

* Fix all reduce mem sharing (#1986)

* fix all reduce mem sharing

* ByteSizeOfDataContentField=>ByteSizeOfBlobBody

* remove obsolete task_graph optimization

* no arg_pass_job for variable_op

* merge memory block id between jobs (#1910)

* refine MemBlock and CriticalSection

* job memory sharing strategy

* revert diff in CriticalSectionDesc

* Merge memory block between sub plans

* Get mutual exclusion job groups

* forget to consider memory merge only in same machine

* memory zone unique id

* Merge Done;  merge memory block id from right to left; get memory block ids info

* revert MemBlock

* generate mutual exclusion job groups Done.

* update for proto

* add JobMemSharingStrategy in python interface

* remove memorycase hash

* move JobMemSharingStrategy to JobSetProto

* using default strategy = parallel priority strategy

* update interface of flow.job_mem_sharing_strategy

* InterJobMemSharingUtil and PlanUtil

* revert oneflow.h

* fix bug

* New implement of Merge memory block id between jobs

* refine code

* fix a fatal bug in std::hash<oneflow::Shape>

* +REGISTER_INDEPENDENT_THREAD_NUM for print task_node

* unlock critical sections as more as possible (#1994)

* Bugfix actor case (#1995)

* unlock critical sections as more as possible

* consumed and produced regst of actor 'case' are customized

* refine code

* Bugfix actor case (#1996)

* unlock critical sections as more as possible

* consumed and produced regst of actor 'case' are customized

* refine code

* small regst_num for reentrant_lock (#1997)

* fmt dev_job_set(#1999)

* double buffer for tick_op

* tick is cpu op

* speedup compile time (#2000)

* only merge mem_block_id between user job (#1993)

* Fix keep header only (#2001)

* speedup compile time

* fix keep header only

* remove shared model (#2003)

* remove blob_mem_sharing (#2005)

* No copyhd for output (#2006)

* no cpu tick

* no copyhd for output_op/swith_output_op

* remove temp comments

* rename AddCopyH2DTaskTo to TryAddCopyH2DTaskTo

* remove clone_id (#2007)

* layer norm auto var (#2004)

* layer norm auto var

* make of_format

* bn sbp (#2008)

* Refactor job completer (#1998)

* fmt

* refactor GenerateOpConf4Trainning

* more refactor

* refactor SetCtrlInOpName4VariableOp

* use uniq ptr

* refactor RewriteBoxingWithAllReduce

* refactor MakeAllReduceSequence

* refactor auto_mixed_precision

* refactor DumpLogicalBlobDescAndSbpSignature

* refactor group_boxing_by_dst_parallel

* refactor add_keep_header_only_op_conf

* refactor AutoSourceTick

* refactor AddTickForTimeShape

* refactor AutoSinkTick

* refactor AddGlobalOutputCriticalSections

* refactor SetOpTimeShape7BatchDimLbis

* fix a bug in IsInterfaceTask (#2009)

* Bugfix is interface task (#2010)

* fix a bug in IsInterfaceTask

* IsOutputInterfaceTask

* copyhd-free output_op task_node

* Dev job set config util (#2011)

* add more if in JobConfigProtoBuilder

* unlock critical sections as more as possible

* consumed and produced regst of actor 'case' are customized

* remove total batch num in config util

* remove clone_id

* assert has train_conf

* rm debug info

* Dev job set bert (#2013)

* support bert

* mv into bert

* manual format

* fix adam (#2015)

* fix adam

* div batch instance num before update model

* remove outdate code in oneflow.cpp (#2017)

* Dev split like (#2016)

* no total_instance_num

* add auto grad for concat

* check in impl

* check in bug fixes

* fix bugs for split_like

* split_like_op.cpp format

* add normalization_autovar

* Update op_conf.proto

* address reviews

* fix typo

* constant ref

* rm forward_loss_instance_num (#2018)

* Bugfix job set multi device (#2019)

* sbp for tick input bn

* interface_blob_conf for output_op/switch_output_op

* set sbp conf for tuple identity op

* fix bugs when merge main plan

* delete useless code

* address review

* fix error use of GenRepeatedBn()

* ForEachConnectedComponent is easily misused

* 1) fix output op parallel_conf; 2) refactor InterfaceOpUtil

* only for return output_op

* factor: lbn => logical_blob_name; logical_blob_name() => logical_blob_name

* return op instead of output op acts as part of user job

* enable_all_reduce_group

* bugfix: init RuntimeBuffersScope before Runtime

* demo python scripts for enable_all_reduce_group

* remove wrong optimization code

* constant_conf for enable_all_reduce_group.py test

* fix interface op parallel conf

* fix reduce concat kernel (#2020)

* binary program oneflow_worker

* user_job_completer

* remove unused code loss_print

* rm unused code loss_acc

* remove unused accuracy_acc and accuracy_print

* remove input_diff/output_diff/model_diff bns

* remove unused bns in gdb util

* replace data_tmp_bns/model_bns/fw_buf_bns with tmp_bns

* support mpi using style

* Bugfix put job conf into plan (#2023)

* put job_conf into plan

* using job_name judge isPullJob/isPushJob

* fix wrong job_id error

* model_init is a push job; model_save is a pull job

* make cmake more reasonable (#2024)

* Restructure python module and minimum setup.py (#2026)

* check in updated paths

* check in minimum setup tool

* Dev python init multi unit (#2022)

* init multi-unit by send oneflow_worker binary and ConfigProto to worker machine

* refine var name

* refine code

* compile user/main job only on master

* bert multi machine test code

* fix bugs

* JobConfs

* fix bugs under WITH_RDMA

* fix multi-machine bugs

* delete useless code

* Add xla reduce_sum op

* fix overflow bug of mem_zone_unique_id, set job_id to int64_t (#2028)

* feat: init_worker can without scp binary and no use uuid (#2029)

* half impl of without scp bin

* feat: init_worker can without scp binary and no use uuid

* check in fixes (#2030)

* fixbug of delete worker (#2033)

* Dev dot plan (#2035)

* reuse plan to dot file

* refine plan dot

* Check in bug fix and multi node script (#2032)

* check in fixes

* check in script

* fix boxing bug when setting conf with sbp

* flag for iter

* fixbug of delete worker

* fix delete worker in script

* address review, add exclusive or check

* reuse plan to dot file

* refine plan dot

* fix and add flags

* fmt

* rm debug output

* more flags

* check Activation

* fix fc bug when num axes > 2

* reverse change

* fix next_batch_num (#2036)

* upgrade nccl to 2.4.8 (#2037)

* fix shape of fc in_diff (#2038)

* Rewrite model update op to optimizer graph

* Update oneflow.cmake (#2041)

* better looking merged_plan to dot v1 (#2039)

* better looking and more infomation of merged_plan.dot

* refine color

* Fix tick in multi node parallel (#2042) (#2047)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* Dev train conf builder (#2046)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* check in impl

* fix data dir (#2054)

* fix data dir

* rm model load path

* AssignOp (#2058)

* AssignOp

* remove useless code

* Python ops gather and unit test (#2053)

* python_ops gather and unit test

* format

* minor mod

* SnapshotOp (#2060)

* magical add and fix bug (#2061)

* check in impl

* add todo

* Dev jxf python pooling (#2056)

* run max_pool_2d without bug

* correct max_pool_2d

* correct average_pool_2d

* minor refine

* final version

* rename to nn.py

* add name arg to pool1d ops

* refine by review

* rename to _GetSequence and move it to the end of file (#2063)

* fix BindInterfaceMemBlockId (#2065)

* mark py file generated (#2066)

* Dev gracious exit (#2057)

* add more checks

* make language more consistant

* better error info for worker init

* better error

* Update setup.py (#2068)

* Refine Infer APIs by return Maybe<void> type (#2051)

* Refine Infer APIs by return Maybe<void> type

* Fix return type

* Fix code style

* Replace CHECK macros in the implementation of infer APIs

* Revert IsOk

* fix bug for split like op (#2070)

* fix snapshot path (#2071)

* Dev job set fix infer apis (#2072)

* Refine Infer APIs by return Maybe<void> type

* Fix return type

* Fix code style

* Replace CHECK macros in the implementation of infer APIs

* Revert IsOk

* update

* add AutoGlobalStep (#2073)

* rm default_initializer_conf in train conf (#2075)

* Fix sigmoid op (#2076)

* fix sigmoid op bug

* fix bug for split like op

* add sigmoid grad op

* Fix bn (#2077)

* fix bn

* return Maybe<void> OK in lambda

* fix typo

* fix SigmoidGradOp (#2078)

* Dev python merge job set (#2081)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* fix gcc warning in release (#2080)

* fix gcc version in release

* fix empty line

* Fix adam mv initilizer (#2082)

* zero constant initilzer for adam m and v

* make of_format

* init adam m v beta1_t and beta2_t

* use value instead of initializer

* const float& -> const float

* update

* LearningRateScheduleOp (#2079)

* matmul (#2084)

* matmul

* np.allclose

* Fix hang bugs

* bugfix: reshape op infer dim0 size; and look up tensorflow reshape (#2085)

* bugfix: reshape op infer dim0 size; and look up tensorflow reshape

* refine code for read

* check py if and test

* prelu (#2086)

* prelu

* fix

* fix

* template for either ptr cast (#2088)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* add template for cast

* rename

* Dev build and infer ctx (#2089)

* add job_build_and_infer_ctx interface

* lbn_with_split_hint

* fix maybe macro

* fix signature of Maybe<T>::Error()

* job_build_and_infer_if

* add c_api_util wrapper for job_build_and_infer_ctx

* implement python/job_build_and_infer interface

* CurJobBuildAndInferCtx_AddPlacementGroup

* BuildJobAndInferCtx  and  Mgr  c++ implement (#2074)

* job_build_and_infer_ctx_mgr

* refine interface of infer_ctx_mgr

* JobBuildInferCtx set job conf; add and refine error type

* revert job.proto

* half impl of add op in build_infer_ctx

* generate op produced empty logical blob desc ; infer out blob desc interface

* job_build_and_infer_ctx VERSION 1

* add InferOutBlobDesc for conv op; remove record_piece_size in interface op

* maybe return

* job_set hold by job_build_and_infer_ctx_mgr

* check placement when infer ctx mgr leave cur job

* Global New/Delete JobBuildAndInferCtxMgr

* add JUST when ctx add op

* remove unused job_conf.arg_op_name

* fix bugs caused by python new api

* fix bugs caused by lack of Global<JobDesc>

* fix bugs caused by new api

* refactor compiler.Compile

* merge dev_python

* remove unused message proto

* rename api

* Fix input which body is disabled in xla launch kernel

* add RemoteBlob.shape and RemoteBlob.dtype

* Fix data type set default variable (#2092)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* fix default data type

* Add conf axis for bias_add for any axis channel (#2093)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* bias_add completion

* follow comment

* make conf axis required

* Dev jxf python initializer (#2090)

* oneflow initializer

* update

* Fix self control in

* Bugfix python alexnet (#2096)

* bugfix_python_alexnet

* fix

* Add fake consume op

* Dev global step (#2100)

* assign op


AddGlobalStepOpConf


fix


ARITHMETIC_DATA_TYPE_SEQ


identity_op_conf


add ops


GenNewSnapshotName


SnapshotOp


cleanup


blob name


LearningRateScheduleOp


LearningRateScheduleKernel


LearningRateScheduleKernel


AddLearningRateScheduleOpConf


learning rate


cleanup


fix


fix

* remove total_mbn_num

* date time format

* save

* refine

* refine

* revert

* refine snapshot

* fix

* refine

* AutoGlobalStep

* refine

* GenLogicalBlobName

* AutoLearningRate

* remove JobDesc lr

* fix snapshot path

* Maybe<void>

* learning_rate blob

* remove next_model_vid


fix


fix 


fix


learning_rate

* train_conf

* fix for global step on multi nodes

* Fix optimizer initializer (#2095)

* fix optimizer initializer

* rename lars data temp bn

* fix job_type (#2102)

* Dev alexnet new api (#2094)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* check in softmax loss

* nn.conv2d and nn.bias_add

* fix opname

* fix merge conflict

* fix name

* dense (#2097)

* Fix jxf dense v2 (#2098)

* dense

* minor fix

* alexnet

* fix conf

* quick fix

* transpose

* fix layers

* add transpose

* fix fc

* fix

* fix

* fix data laod

* params check and format

* rm activation in op conf

* save workaround

* fix avg pool 2d

* fix max pool 2d

* remove fc3 relu

* alexnet eval

* minor

* replace has_batch_dim with batch_axis (#2104)

* replace has_batch_dim with batch_axis

* refactor OrderValue4HasBatchAxis

* fix batch_axis bugs in ConvFilterGradOp and DecodeOFRecordOp

* no CHECK in MatmulOp::InferBatchAxis

* infer op by op_conf and  parallel_conf

* wrapper Error for ErrorProto

* replace ErrorUtil with Error

* add OF_CHECK (#2110)

* optional split_axis (#2113)

* Fix HasAttr bug for optional field

* undefined (#2116)

* merge reduce xxx (#2119)

* Update GetSbpSig() with Maybe (#2118)

* fix sveral ops

* modify all ops

* format

* update complete

* Refine AdamOptimizer

* fix (#2120)

* Fix xla AdamOptimizer bugs

* support scalar for reduce_xxx axis args (#2122)

* Dev opt split axis (#2121)

* optional split_axis

* backup

* VariableConf::(OptInt64 split_axis)

* backup

* fix autovar split_axis (#2125)

* Dev model init op (#2117)

* assign op


AddGlobalStepOpConf


fix


ARITHMETIC_DATA_TYPE_SEQ


identity_op_conf


add ops


GenNewSnapshotName


SnapshotOp


cleanup


blob name


LearningRateScheduleOp


LearningRateScheduleKernel


LearningRateScheduleKernel


AddLearningRateScheduleOpConf


learning rate


cleanup


fix


fix

* remove total_mbn_num

* date time format

* save

* refine

* refine

* revert

* refine snapshot

* fix

* refine

* AutoGlobalStep

* refine

* GenLogicalBlobName

* AutoLearningRate

* remove JobDesc lr

* fix snapshot path

* Maybe<void>

* learning_rate blob

* remove next_model_vid


fix


fix 


fix


learning_rate

* train_conf

* fix for global step on multi nodes

* SnapshotReader


snapshot writer


model init op


fix


refine


init


InitializeFromSnapshotConf


model io job


ModelLoadOp


ModelLoadKernel


MakeModelLoadJob


ModelSaveOp


fix


InterUserJobInfo


_MakeModelLoadJobFunc


MutModelLoadOpConTickInputHelper


fix


refine


init/load/save


set_default_variable

* remove SnapshotMgr

* snapshot.h

* delete model_init_job.cpp


foreign_input_op_conf


fix


snapshot path


set path


op_conf


fix


fix CopyFromNdarray


to bytes c


use uint8


char2uint8

* model init

* model io

* fix

* ModelSaveKernel

* mutable_batch_axis()->Clear()

* InferBatchAxis

* fix

* refine

* job set

* MakeModelIoJobs

* fix

* jobs

* fix

* model io job

* GenOutputOpConf

* refine snapshot

* refine

* fix

* refine CheckPoint

* remove session

* refine

* refine

* refine

* remove keyword.h/cpp

* refine

* global_step=>train_step

* GetSbpSignatures

* ModelInitOp

* fix (#2127)

* rm stale alextnet script (#2129)

* Dev plain maybe (#2126)

* optional split_axis

* backup

* VariableConf::(OptInt64 split_axis)

* backup

* 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp

* SharedOrPlain

* const std::shared_ptr<T>& => std::shared_ptr<T>

* Dev simple checkpoint manager (#2128)

* SimpleCheckPointManager

* makedirs

* fix path

* save

* refine

* refine

* fix path to numpy (#2130)

* Dev plain maybe (#2132)

* optional split_axis

* backup

* VariableConf::(OptInt64 split_axis)

* backup

* 1) specialize Maybe<T*>; 2) fix bugs in adam_optm.cpp

* SharedOrPlain

* const std::shared_ptr<T>& => std::shared_ptr<T>

* rename: Maybe.data() => Maybe.Data_YouAreNotAllowedToCallThisOutsideJustOrCheckJust()

* refactor Maybe<JobBuildAndInferCtx> => Maybe<JobBuildAndInferCtx*>

* Dev jxf merge general ops (#2131)

* merge some general ops to dev_python

* dense demo

* rm print in test

* new line at the end of file

* format

* fix check point

* update alexnet

* broadcast_xxx (#2134)

* broadcast_xxx

* typo

* typo

* rm job_conf.num_of_batches_in_snapshot

* fix args (#2136)

* fix proto if (#2138)

* pass name to inner function (#2139)

* check dropout if (#2140)

* check dropout if

* fix typo

* Dev merge math ops (#2143)

* merge math ops

* new line at the end of file

* merge layer norm (#2144)

* variable_scope (#2141)

* variable_scope

* revert format

* add check

* Merge dropout if (#2145)

* check dropout if

* fix typo

* fix typo

* slice (#2142)

* slice

* add check and docstring

* minor

* minor

* add const (#2146)

* add const

* fix indentation

* address review

* fmt

* rm redundant

* Update array_ops.py

* Update array_ops.py

* Update array_ops.py

* add more activations to math_ops (#2147)

* fix bug (#2149)

* trancated normal for bert (#2150)

* Update bert for dev python (#2151)

* trancated normal for bert

* bert support

* math.dropout to nn.dropout (#2153)

* refactor oneflow_internal with Maybe::GetDataAndSerializedErrorProto

* allow export multiple interfaces in oneflow_export decorator (#2154)

* refactor job_build_and_infer_if.h

* update oneflow_internal.h to use Maybe (#2135)

* Fix python internal (#2133)

* Return error meassage in oneflow_internal

* Refine environment_objects_scope

* add OF_ERROR_STR_CHECK and OFStrCat()

* format

* fix based on review

* fix(oneflow_internal.h): add undef

* fix: expr -> (expr)

* feat: update oneflow_internal_helper to use func

*  Transfer data_part_num to DecodeOp and RecordLoadOp (#2148)

*  Transfer data_part_num to DecodeOp and RecordLoadOp

* Fix python scripts

* Dev nc of internal (#2155)

* Fix python internal (#2133)

* Return error meassage in oneflow_internal

* Refine environment_objects_scope

* add OF_ERROR_STR_CHECK and OFStrCat()

* format

* fix based on review

* fix(oneflow_internal.h): add undef

* fix: expr -> (expr)

* feat: update oneflow_internal_helper to use func

* fix: fix ctor bug

* fix config_proto

* rename c_api_util.Init => c_api_util.InitEnvironment

* refactor compile_context.cur_job => compile_context.cur_job_conf

* remove FixPackedBlobDescOfProducedRegst (#2156)

* Fix snapshot root path empty log (#2158)

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* fix 121 for tick (#2069)

* Fix snapshot root path empty log

* fix channel last (#2157)

* fix channel last

* minor

* merge pb_message

* add cudnn conv force algo (#2159)

* Update bert for dev python (#2160)

* remove old bert

* set data_part_num in decoder

* support model load/saveargs

* Dev flow function (#2152)

* add of.function, refactor init, refine session, and refine runtime

* rm useless code

* rename

* update

* add test

* @oneflow_export JobConfigProto and Trainconf (#2162)

* @oneflow_export JobConfigProto and Trainconf

* remove unused config in config_util.py

* remove oneflow.get_cur_job_conf_builder

* bugfix: bias_add op and reduce_sum op infer sbp and implement of bias_add kernel (#2161)

* 1) refactor compiler._CompileJob; 2) fix test scripts; 3) fix oneflow.config.model_update_conf

* fix config.train.model_update_conf

* _GetJobConfAttr

* update alexnet (#2166)

* Update alexnet (#2167)

* update alexnet

* update for bert

* 15->16

* more reasonable conf

* get variable in py layer norm

* replace val in pb msg;  decode lbn string with split hint (#2165)

* bugfix: boxing task node generate boxing conf; remove wrong check in op_graph (#2163)

* Add meta data in HLO instruction, and refine

* python model parallel (#2103)

* decode split hint in lbn; refine interface of get/set_lbn_in_op_conf; fill sbp_conf in job when add op

* merge placement group

* refine code in AddAndInferOp

* auto merge placement group when add op; remove mergeplacementgroup interface

* infer sbp parallel when add op; impl Get/Has split axis in infer_ctx

* python blob add interface for model parallel

* refine code of python blob split

* remove interface of has/get_split_axis in python blob

* remove interface of has_batch_dim in python blob

* add check blob split_axis can be divide by parallel num

* refine code for maybe get/infer sbp

* fix bugs: 1. python generate parallel conf; 2. gen logical blob id from lbn; 3.python blob desc .etc

* fix for plain point maybe

* fix bug: add repeated placement group, remove add placement interface in hand

* fixbug: python/blob_desc, temp impl of not deepcopy;  feat: dense layer support model parallel

* dev_python model parallel runnable and check correct

* remove add placement group when placment scope exit

* 1. fixbug of bias_add op infer blob_desc/sbp; bias_add kernel impl; 2. dense layer set model_split_axis=0 for model parallel

* bugfix: bias_add backward infer sbp wrong;  model parallel bias add debug done

* refine python blob_desc.split implement

* refine interface decode lbn to split hint

* refine auto add placment group

* refine lbn with split hint decode

* refine code for review

* remove AutoVar related code (#2168)

* feat: remove all autovar

* fix and format

* fix: fix op::InferBlobDesc

* add prototype (#2172)

* add prototype

* infer blob desc with sbp_signature

* `str_a is not str_b' is buggy, use `str_a != str_b' instead

* Update snapshot.cpp (#2174)

* remove useless lines (#2176)

* Fix bert multi nodes (#2177)

* remove useless lines

* fix bert and init_cluster_env for multi nodes

* CHECK_JUST for InferBlobDescsIf (#2178)

* Fix bert multi nodes (#2180)

* remove useless lines

* fix bert and init_cluster_env for multi nodes

* config_proto -> default_config_proto

* delete worker

* update alexnet

* remove unused op (#2182)

* remove parallel_ctx when kernel init (#2185)

* InferOpSbpSignature in op_graph and infer_ctx (#2175)

* InferOpSbpSignature in op_graph and infer_ctx

* bugfix: lambda life time;  gen job build error add location info

* refine error generation and return

* refine check lbi vaild and exists

* remove parallel num in decode_of_record op/kernel (#2186)

* Fix bugs

* delete GlobalJobDesc() in operator/ (#2188)

* rm unused test file

* Refine

* Add assign ops behind adam optimizer to update model and momentum etc.

* Add assign ops behind adam optimizer to update model and momentum etc.

* Remove fake consume op

* Support enable/disable XLA by set env

* Merge callback, limit max operator count for each XLA subgraph

* CudaEventPool

* fix vector

* refine

* Support in-place update for optimizer

* Add alias input and output to prevent reusing input with other temp buffers

* Refine code style

* Remove unused code

* Fix static cublas library and xla link conflict

* Fix cublas link conflict with tensorflow

* Fix different connection kinds for multiple gpu cards (#2282)

* Refine xla cluster algo (#2289)

* Fix different connection kinds for multiple gpu cards

* Fix bug for mutiple outputs consumed by one node

* Refine cluster algo

* Refine MarkClusterId pass and ReduceSplit task node (#2314)

* Fix different connection kinds for multiple gpu cards

* Fix bug for mutiple outputs consumed by one node

* Refine cluster algo

* Determine fusion disabled edges

* update

* Produce multiple registers on edges for ReduceSplit task node.
Fix new allocator by stream id.

* Refine MarkClusterId pass

* Clustering subgraph with reverse ordering is better

* Support strict clustering by taking dependencies into consideration

* Translate rebuild job and rewrite optimizer into passes, and refine code style

* Fix spell error

* Update cmake

* Merge branch dev_python (#2321)

* Dev res50 new api (#2173)

* check in script

* runable

* fix multinode

* fix and real train

* fix param data_format

* fix truncated normal

* quick fix multi node launch (#2193)

* Dev reshape sbp (#2192)

* reshape sbp

* more check for reshape conf

* fix error CHECK

* refactor reshape

* fix reshape like op

* support naive case of s0

* refine

* rm redundant code

* more generous check for equal element cnt

* restore empty line

* add GatherMs0Grad op (#2191)

* support for gather with s(0) `in'

* add gather_ms0_op

* fix bugs in message GatherMs0OpConf and GatherMs0Kernel

* only (B, S(0)) -> P supported for gather_ms0 op

* add GatherMs0Grad op

* minor fix

* refine code

* bugfix and update gather test case

* add concat op and pass the test (#2067)

* add concat op and pass the test

* add vgg job_conf

* model compared to be same as the old one

* rm unnecessary file

* Update array_ops.py

* mv file

* get rid of ternary operator (#2195)

* Dev reshape util struct (#2194)

* check in changes

* rm file

* minor fix

* Merge network files of 2 cnns (#2196)

* add inceptionV3

* check in vgg16

* add cnns test scripts for dev_python (#2170)

* add cnns test scripts for dev_python

* add alexnet test scripts

* add resnet50

* add inceptionv3

* add resnet50

* add vgg16

* first version of run_cnns_test.py

* remove old files

* unsorted_segment_sum (#2198)

* oneflow.unsorted_segment_sum (#2199)

* oneflow.unsorted_segment_sum

* remote unused import

* remove unused import

* Dev batch unsorted segment sum (#2200)

* oneflow.unsorted_segment_sum

* remote unused import

* remove unused import

* rename UnsortedSegmentSum to BatchUnsortedSegmentSum

* rename: batch_unsorted_* => unsorted_batch_*

* unsorted_segment_sum (#2201)

* unsorted_segment_sum

* fix job_completer/unsorted_segment_sum_grad.cpp

* more check for unsorted_segment_sum batch_axis

* remove FixParallelDesc (#2202)

* rm KernelIfWithModel KernelIfWithActivation (#2203)

* remove KernelIfWithActivation

* remove KernelIfWithModel

* rm blob header kLossInstanceNum (#2204)

* rm ActivationType from op/kernel (#2205)

* refactor sigmoid_cross_entropy_loss

* fix SigmoidGrad::InferBatchAxis

* support part_name_prefix and part_name_suffix_length (#2208)

* rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus

* oneflow.watch for debug

* Dev decode batch size (#2206)

* rm batch_size and piece_size

* merge dev_python

* Update reshape_like_op.cpp (#2213)

* oneflow.parallel (#2211)

* oneflow.parallel

* refactor split_axis => parallel

* rename parallel => distribute

* fix typo: *Parallel => *Distribute

* add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()

* fix warning: return string reference to temporary (#2212)

* docker build support (#2002)

* update cmake files

* check in files

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* shrink ctx size

* fix script

* fix wheel build

* fix wheel build not adding .so (#2052)

* lower cmake version bar

* rm more files

* keep build dir

* check in test bash script

* fix

* Dev docker sx (#2124)

* add python2 docker env

* rm old docker files

* update repository

* add ARG CUDA and USE_PYTHON_3_OR_2

* reform files

* update

* rm log doesn't print when there is cache

* use default arg in dockerfile

* better py 2 or 3 condition

* add default

* use if

* update alexnet

* update for bert

* 15->16

* add resnet50 in model (#2217)

* remove parallel policy； rm FC/rnn/embedding look up op/kernel (#2215)

* remove parallel policy

* rm FC/rnn/embedding_look_up op/kernel

* add check data parallel for conv/layer_norm op

* bugfix: bias add + use math_add when batch size = 1

* fix InferBatchAxis (#2220)

* sync with bert_benchamrk (#2221)

* sync with bert_benchamrk

* rename run.sh

* Dev actor msg queue (#2225)

* async msg queue

* EnqueueAsyncMsg

* Merge wnd python (#2226)

* not ready yet

* segment fix

* fix segment_sum bugs

* 1st wide_n_deep push

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* run sucessfully on single GPU

* fix 121 for tick (#2069)

* delete unncessary multiply_grad class

* speed up generate time for dot2svg (#2083)

* Add axis conf to bias_add for any axis channel (#2087)

* bias_add completion

* follow comment

* make conf axis required

* Revert "Add axis conf to bias_add for any axis channel (#2087)" (#2091)

This reverts commit 8679ce980ce8570bf927baeab8616ee7b93fac47.

* updated

* fix segment_sum_grad

* fix sbp

* fix segment_sum impl for data parallel

* fix

* remove useless code in segment_kernel_util.h

* add python interface

* fix sigmoid conf

* fix naming error

* fix typo

* temp mod loss sbp

* add LazyAdam

* Merge branch 'dev_python' of https://github.com/Oneflow-Inc/oneflow into dev_python_widedeep

* rm useless code

* unsorted_segment_sum

* refactor sigmoid_cross_entropy_loss_kernel to high performance

* Improve sigmoid cross entropy loss grad (#2207)

* remove for loop called cuda kernel

* minor fix

* ../oneflow/python/ops/data_ops.py (#2209)

* fix lazy_adam

* Merge wnd and python (#2214)

* rm ActivationType from op/kernel (#2205)

* refactor sigmoid_cross_entropy_loss

* fix SigmoidGrad::InferBatchAxis

* support part_name_prefix and part_name_suffix_length (#2208)

* rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus

* oneflow.watch for debug

* Dev decode batch size (#2206)

* rm batch_size and piece_size

* merge dev_python

* Update reshape_like_op.cpp (#2213)

* oneflow.parallel (#2211)

* oneflow.parallel

* refactor split_axis => parallel

* rename parallel => distribute

* fix typo: *Parallel => *Distribute

* add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()

* merge dev_python

* fix boxing: P->S(0)

* check in docker build scripts (#2216)

* Dev python widedeep docker (#2218)

* check in docker build scripts

* check in .dockerignore

* rm oneflow.segment_sum

* remove segment_sum

* rm unused file

* rm debug code

* rm debug code

* rm double empty lines

* remove useless comments

* fix send msg (#2227)

* fix reduction_coefficient (#2228)

* refactor ndarray for eq/ne/...

* Dev kernel launch synchronized (#2230)

* IsKernelLaunchSynchronized

* virtual

* refine

* refine

* seperate LOGICAL_BINARY_FUNC from ARITHMETIC_BINARY_FUNC

* more static_assert

* remove unused task related dot function (#2236)

* remove unused task related dot function

* do not output dot rank info

* Dev non distributed optimizer js (#2234)

* op&kernel&actor

* job

* job_completer

* graph

* format

* fix pd

* fix

* ignore DelPlacementByOpName

* fix auto tick

* JobBuilder

* fix

* config util

* fix

* fix opgrade

* broadcast tick

* fix allreduce

* balance by model size

* GetSoleOutBlobSize

* async_actor_msg_deque

* group

* AddOrMutOpsOnlyOnce

* fix NcclTupleBroadcastGrad

* order

* set nccl order hint

* op_conf

* grad hint

* NcclTupleBroadcastReduceSequencePass

* add missed mutops

* order fix

* try kMdUpdtArea

* fix nccl_order_hint

* fix

* add ti

* tuple_identity_op

* remove useless

* group

* fix dead lock

* force ctrl in

* sc broadcast

* sort obn

* group nccl

* config group_size_mbyte

* non_distributed_optimizer_group_size_mbyte

* format

* stop check

* rm message sending optimization

* refine lazy adam (#2244)

* refine lazy adam

* update

* memory version 2 step 1: replace original concept about mem sharing (#2242)

* mem_shared_id -> mem_block_id;  mem_shared_off_set -> mem_block_offset; enable_mem_sharing->enable_reuse_mem

* memory version 2 step 1: replace original concept about mem sharing

* record reader multi thread (#2246)

* multi thread

* ComputeThreadPoolSize

* python api

* Fix random decode (#2252)

* add decode random

* fix decode random actor

* Dev pr boxing v2 (#2248)

* NcclDeviceCtx

* include naive_actor

* refine

* use_boxing_v2

* config.use_boxing_v2

* SubTskGphBuilder

* fix

* hash<oneflow::MemoryCase>

* Maybe<void>

* ChainSubTskGphBuilder

* SliceBoxingOp

* return ok

* SliceBoxingKernel

* SliceBoxingActor

* kSliceBoxing

* nccl boxing op

* nccl actor

* REGISTER_OP

* GetMsgFromCustomizedConf

* NcclBoxingTaskNode

* BldSubTskGphByBoxingV2

* NcclBoxingSubTskGphBuilder

* fix

* fix

* NcclKernel

* ParallelContext

* REGISTER_ACTOR

* fix rank set

* IsNcclTaskType

* limit

* 1024

* multi thread reader

* thread_num

* IsKernelLaunchSynchronized

* refine

* NcclTupleReduce/BroadcastKernel use NcclDeviceCtx

* MakeHostMemCase

* NcclBldSubTskGph

* remove use less code

* use_boxing_v2

* refine

* refine

* refine

* refine

* refine

* cmake find python note when version less 3.14 (#2286)

* fix bug: reduce split kernel inplace (#2297)

* Dev bias add (#2299)

* use bias add

* fix

* bias_add

* bias add half

* fix

* reinterpret_cast

* fix half

* HALF

* fix

* ADD_DEFAULT_KERNEL_CREATOR

* fix

* format

* Fix dev python test (#2294)

* add decode random

* fix decode random actor

* fix dev_python test scripts

* fix batch_size test scripts

* fix

* Memory Version 2.0 Step 2:  MemSharedAndReused between jobs (#2267)

* MemBlockProto and ChunkProto

* create mem block and chunk after improver

* interface merge mem block and chunk between sub plans

* merge chunk between jobs for memory reuse

* using memory zone unique id replace memory case hash

* merge interface op mem block between jobs for mem shared

* gen GlobalCriticalSection by mem block id and chunk id

* check mem block and chunk valid before runtime

* Refactor: RegstMgr ;  allocate memory by mem block and chunk instead of regst

* fix bug; and pass test

* fig bug: init chunk_id_count in id_manager

* reuse copyHd out mem between jobs

* PushPlan and PullPlan for memblock and chunk

* refine merge mem block / chunk in oneflow.cpp

* at(i);

* GetOpName2JobId2TaskProtos functional

* using output ptr; pass test AlexNet and Resnet

* Fix xla reshape op

* Merge upstream of_xla (#2322)

* Dev res50 new api (#2173)

* check in script

* runable

* fix multinode

* fix and real train

* fix param data_format

* fix truncated normal

* quick fix multi node launch (#2193)

* Dev reshape sbp (#2192)

* reshape sbp

* more check for reshape conf

* fix error CHECK

* refactor reshape

* fix reshape like op

* support naive case of s0

* refine

* rm redundant code

* more generous check for equal element cnt

* restore empty line

* add GatherMs0Grad op (#2191)

* support for gather with s(0) `in'

* add gather_ms0_op

* fix bugs in message GatherMs0OpConf and GatherMs0Kernel

* only (B, S(0)) -> P supported for gather_ms0 op

* add GatherMs0Grad op

* minor fix

* refine code

* bugfix and update gather test case

* add concat op and pass the test (#2067)

* add concat op and pass the test

* add vgg job_conf

* model compared to be same as the old one

* rm unnecessary file

* Update array_ops.py

* mv file

* get rid of ternary operator (#2195)

* Dev reshape util struct (#2194)

* check in changes

* rm file

* minor fix

* Merge network files of 2 cnns (#2196)

* add inceptionV3

* check in vgg16

* add cnns test scripts for dev_python (#2170)

* add cnns test scripts for dev_python

* add alexnet test scripts

* add resnet50

* add inceptionv3

* add resnet50

* add vgg16

* first version of run_cnns_test.py

* remove old files

* unsorted_segment_sum (#2198)

* oneflow.unsorted_segment_sum (#2199)

* oneflow.unsorted_segment_sum

* remote unused import

* remove unused import

* Dev batch unsorted segment sum (#2200)

* oneflow.unsorted_segment_sum

* remote unused import

* remove unused import

* rename UnsortedSegmentSum to BatchUnsortedSegmentSum

* rename: batch_unsorted_* => unsorted_batch_*

* unsorted_segment_sum (#2201)

* unsorted_segment_sum

* fix job_completer/unsorted_segment_sum_grad.cpp

* more check for unsorted_segment_sum batch_axis

* remove FixParallelDesc (#2202)

* rm KernelIfWithModel KernelIfWithActivation (#2203)

* remove KernelIfWithActivation

* remove KernelIfWithModel

* rm blob header kLossInstanceNum (#2204)

* rm ActivationType from op/kernel (#2205)

* refactor sigmoid_cross_entropy_loss

* fix SigmoidGrad::InferBatchAxis

* support part_name_prefix and part_name_suffix_length (#2208)

* rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus

* oneflow.watch for debug

* Dev decode batch size (#2206)

* rm batch_size and piece_size

* merge dev_python

* Update reshape_like_op.cpp (#2213)

* oneflow.parallel (#2211)

* oneflow.parallel

* refactor split_axis => parallel

* rename parallel => distribute

* fix typo: *Parallel => *Distribute

* add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()

* fix warning: return string reference to temporary (#2212)

* docker build support (#2002)

* update cmake files

* check in files

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* shrink ctx size

* fix script

* fix wheel build

* fix wheel build not adding .so (#2052)

* lower cmake version bar

* rm more files

* keep build dir

* check in test bash script

* fix

* Dev docker sx (#2124)

* add python2 docker env

* rm old docker files

* update repository

* add ARG CUDA and USE_PYTHON_3_OR_2

* reform files

* update

* rm log doesn't print when there is cache

* use default arg in dockerfile

* better py 2 or 3 condition

* add default

* use if

* update alexnet

* update for bert

* 15->16

* add resnet50 in model (#2217)

* remove parallel policy； rm FC/rnn/embedding look up op/kernel (#2215)

* remove parallel policy

* rm FC/rnn/embedding_look_up op/kernel

* add check data parallel for conv/layer_norm op

* bugfix: bias add + use math_add when batch size = 1

* fix InferBatchAxis (#2220)

* sync with bert_benchamrk (#2221)

* sync with bert_benchamrk

* rename run.sh

* Dev actor msg queue (#2225)

* async msg queue

* EnqueueAsyncMsg

* Merge wnd python (#2226)

* not ready yet

* segment fix

* fix segment_sum bugs

* 1st wide_n_deep push

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* run sucessfully on single GPU

* fix 121 for tick (#2069)

* delete unncessary multiply_grad class

* speed up generate time for dot2svg (#2083)

* Add axis conf to bias_add for any axis channel (#2087)

* bias_add completion

* follow comment

* make conf axis required

* Revert "Add axis conf to bias_add for any axis channel (#2087)" (#2091)

This reverts commit 8679ce980ce8570bf927baeab8616ee7b93fac47.

* updated

* fix segment_sum_grad

* fix sbp

* fix segment_sum impl for data parallel

* fix

* remove useless code in segment_kernel_util.h

* add python interface

* fix sigmoid conf

* fix naming error

* fix typo

* temp mod loss sbp

* add LazyAdam

* Merge branch 'dev_python' of https://github.com/Oneflow-Inc/oneflow

 into dev_python_widedeep

* rm useless code

* unsorted_segment_sum

* refactor sigmoid_cross_entropy_loss_kernel to high performance

* Improve sigmoid cross entropy loss grad (#2207)

* remove for loop called cuda kernel

* minor fix

* ../oneflow/python/ops/data_ops.py (#2209)

* fix lazy_adam

* Merge wnd and python (#2214)

* rm ActivationType from op/kernel (#2205)

* refactor sigmoid_cross_entropy_loss

* fix SigmoidGrad::InferBatchAxis

* support part_name_prefix and part_name_suffix_length (#2208)

* rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus

* oneflow.watch for debug

* Dev decode batch size (#2206)

* rm batch_size and piece_size

* merge dev_python

* Update reshape_like_op.cpp (#2213)

* oneflow.parallel (#2211)

* oneflow.parallel

* refactor split_axis => parallel

* rename parallel => distribute

* fix typo: *Parallel => *Distribute

* add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()

* merge dev_python

* fix boxing: P->S(0)

* check in docker build scripts (#2216)

* Dev python widedeep docker (#2218)

* check in docker build scripts

* check in .dockerignore

* rm oneflow.segment_sum

* remove segment_sum

* rm unused file

* rm debug code

* rm debug code

* rm double empty lines

* remove useless comments

* fix send msg (#2227)

* fix reduction_coefficient (#2228)

* refactor ndarray for eq/ne/...

* Dev kernel launch synchronized (#2230)

* IsKernelLaunchSynchronized

* virtual

* refine

* refine

* seperate LOGICAL_BINARY_FUNC from ARITHMETIC_BINARY_FUNC

* more static_assert

* remove unused task related dot function (#2236)

* remove unused task related dot function

* do not output dot rank info

* Dev non distributed optimizer js (#2234)

* op&kernel&actor

* job

* job_completer

* graph

* format

* fix pd

* fix

* ignore DelPlacementByOpName

* fix auto tick

* JobBuilder

* fix

* config util

* fix

* fix opgrade

* broadcast tick

* fix allreduce

* balance by model size

* GetSoleOutBlobSize

* async_actor_msg_deque

* group

* AddOrMutOpsOnlyOnce

* fix NcclTupleBroadcastGrad

* order

* set nccl order hint

* op_conf

* grad hint

* NcclTupleBroadcastReduceSequencePass

* add missed mutops

* order fix

* try kMdUpdtArea

* fix nccl_order_hint

* fix

* add ti

* tuple_identity_op

* remove useless

* group

* fix dead lock

* force ctrl in

* sc broadcast

* sort obn

* group nccl

* config group_size_mbyte

* non_distributed_optimizer_group_size_mbyte

* format

* stop check

* rm message sending optimization

* refine lazy adam (#2244)

* refine lazy adam

* update

* memory version 2 step 1: replace original concept about mem sharing (#2242)

* mem_shared_id -> mem_block_id;  mem_shared_off_set -> mem_block_offset; enable_mem_sharing->enable_reuse_mem

* memory version 2 step 1: replace original concept about mem sharing

* record reader multi thread (#2246)

* multi thread

* ComputeThreadPoolSize

* python api

* Fix random decode (#2252)

* add decode random

* fix decode random actor

* Dev pr boxing v2 (#2248)

* NcclDeviceCtx

* include naive_actor

* refine

* use_boxing_v2

* config.use_boxing_v2

* SubTskGphBuilder

* fix

* hash<oneflow::MemoryCase>

* Maybe<void>

* ChainSubTskGphBuilder

* SliceBoxingOp

* return ok

* SliceBoxingKernel

* SliceBoxingActor

* kSliceBoxing

* nccl boxing op

* nccl actor

* REGISTER_OP

* GetMsgFromCustomizedConf

* NcclBoxingTaskNode

* BldSubTskGphByBoxingV2

* NcclBoxingSubTskGphBuilder

* fix

* fix

* NcclKernel

* ParallelContext

* REGISTER_ACTOR

* fix rank set

* IsNcclTaskType

* limit

* 1024

* multi thread reader

* thread_num

* IsKernelLaunchSynchronized

* refine

* NcclTupleReduce/BroadcastKernel use NcclDeviceCtx

* MakeHostMemCase

* NcclBldSubTskGph

* remove use less code

* use_boxing_v2

* refine

* refine

* refine

* refine

* refine

* cmake find python note when version less 3.14 (#2286)

* fix bug: reduce split kernel inplace (#2297)

* Dev bias add (#2299)

* use bias add

* fix

* bias_add

* bias add half

* fix

* reinterpret_cast

* fix half

* HALF

* fix

* ADD_DEFAULT_KERNEL_CREATOR

* fix

* format

* Fix dev python test (#2294)

* add decode random

* fix decode random actor

* fix dev_python test scripts

* fix batch_size test scripts

* fix

* Memory Version 2.0 Step 2:  MemSharedAndReused between jobs (#2267)

* MemBlockProto and ChunkProto

* create mem block and chunk after improver

* interface merge mem block and chunk between sub plans

* merge chunk between jobs for memory reuse

* using memory zone unique id replace memory case hash

* merge interface op mem block between jobs for mem shared

* gen GlobalCriticalSection by mem block id and chunk id

* check mem block and chunk valid before runtime

* Refactor: RegstMgr ;  allocate memory by mem block and chunk instead of regst

* fix bug; and pass test

* fig bug: init chunk_id_count in id_manager

* reuse copyHd out mem between jobs

* PushPlan and PullPlan for memblock and chunk

* refine merge mem block / chunk in oneflow.cpp

* at(i);

* GetOpName2JobId2TaskProtos functional

* using output ptr; pass test AlexNet and Resnet

* Dev cuda 9 arch 70 (#2318)

* kCudaAlignSize = 256

* always compute_70

* __CUDA_API_VERSION >= 10000

* __CUDA_API_VERSION >= 10000

* disable_all_reduce_sequence

* Fix xla reshape op

* Fix compilation without xla

* Remove useless code and fix data type mismatch in field desc (#2326)

* Remove useless code

* Refine code style

* Fix data type mismatch in field desc

* Update README.md (#2335)

* Refine code style (#2336)

* Update XLA usage document (#2337)

* Update XLA usage document

* Fix mistakes

* Add xla clang-format and format codestyle (#2340)

* Revert "Add xla clang-format and format codestyle (#2340)" (#2341)

This reverts commit e3cd432be2880ca55d3fc305dbb87e5416d5e724.

* Add xla clang-format and format codestyle (#2342)

* Add xla clang-format and format codestyle

* Fix header file missing

* Of xla sx (#2334)

* add gather grad op and pass testing

* rm check

* done batch gather grad

* pass test

* modify according to the review

* add unsorted_segment_sum and refine unsorted_batch_segment_sum

* reform according to review

* refromate according to the clang-format and rm reference to the temp object

* Pick step0 and step1 new commits (#2346)

* Add xla clang-format and format codestyle

* Fix header file missing

* Modify codes to support XLA

Conflicts:
	oneflow/core/job/job_builder.cpp
	oneflow/core/job/job_builder.h
	oneflow/core/operator/op_conf.proto

* Fix a bug for building subgraph although it won't lead to wrong results (#2347)

* Fix setting is_mutable in xla launch op (#2349)

* Change directory xla to xrt, apply patch if building with xla

* Refactor

* Add infer shape pass, and Refactor launch kernel, graph compiler

* Refine code style, add xla executable and graph compiler

* Rename platform.proto as types.proto

* change OpCompiler to OpKernel, complete xla graph compiler

* Fix compilation bugs and add allocator, now xla compilation is ok

* Add xla executable runtime

* Add executable run scope to support launch kernel on specific stream.

* Fix infer shape pass, and revert cuda event pool

* Refactor graph building with attaching argument metadata.

* Set mutability if rebuilding job

* Set device ordinal correctly

* Refine DelOps

* Refine Argument definition and abstract function as subgraph

* Fix infer shape in xrt launch op and launch kernel.

* Add builder, executable, graph compiler, logger, value and a fc demo converter for tensorrt.

* Refine code style

* Rename xla Operand as XlaValue.

* Complete TensorRT compiler and builder, Refine OpKernel

* Pick public code changes from the new tensorrt branch.

* Fix tensorrt compilation

* Fake implementation of trt executable

* Support selecting engine in launch kernel, refine trt executable

* Use global logger required by tensorrt, rebuild engine if batch size is larger than default max batch size, and other bugfix.

* Support train phase setting for registered op kernel

* Remove RewriteOptimizer pass, update xla optimizer op.

* Format job builder .h and .cpp files.

* Remove RewriteOptimizer pass, update xla optimizer op.

* Fix project compilation and remove pairs from identical_sbp_oba_pairs if the related operators are about to be deleted from the job.

* Fix project compilation and remove pairs from identical_sbp_oba_pairs if the related operators are about to be deleted from the job.

* Refine code style and comment.

* Refine model update inference for launch op.

* Refine

* Refine code style and comment.

* Refine model update inference for launch op.

Conflicts:
	oneflow/xrt/kernel/op_kernel.h
	oneflow/xrt/node_util.cpp
	oneflow/xrt/node_util.h
	oneflow/xrt/passes/cluster.h
	oneflow/xrt/passes/mark_cluster_id_pass.cpp
	oneflow/xrt/passes/rebuild_job_pass.cpp
	oneflow/xrt/types.h

* Add xrt README.md

* Add use_xla_jit and use_tensorrt options in job proto

* Refine code style

* Fix BlobDesc getter and xla LayerNorm op for FP16

* Make use_xla_jit and use_tensorrt configurable from python config and env variables.

* Update benchmark

* Refine xrt README and rename compile_with_xrt.h file

* Update README

* Revert tensorrt

* Fix absl missing if building with TensorRT but without XLA

* Update xrt benchmark

* Disable WITH_XLA by default

* Update xrt benchmark

* Format xrt as core

* add activation op

* add softmax op

* Refine code style, remove unused code

* Remove duplication of XLA usage

* test pass

* pooling test pass

* add concat op, not tested

* add activation ops, test not psassed

* Add xla gelu unittest

* add  activation op, and test  passed

* add pooling op, and test passed

* Fix int64 env variable

* Export float16 for python

* Add xla relu unittest

* try to solve conv bug

* add elementwise add op, test passed

* add concat op, test passed

* Bugfix: transfer weights from gpu to host since tensorrt requires host weights.

* add op unit tests

* resolve conflicts and fix softmax bug

* add identity op and topk op, to test

* Add xla bias add and reshape unittests

* Add xla identity unittest

* Add xla cast and scalar op unittests

* Add xla broadcast op and transpose unittests

* Add xla add, sigmoid and tanh unittests

* add reduce mean op, test passed

* formate ops, add CHECKs, and optimize function structure

* Add xla gather and batch_gather unittests

* Add xla softmax unittest and fix softmax bug if axis is not the last dim.

* add trt gather op and unit test

* Add xla reduce_sum unittest, and support keep_dims for xla reduce

* Add xla layer_norm unittest, and refine xla layer norm op

* Add reshape_like unittest, and export reshape_like api

* Refine xrt unittest code style

* Export softmax_grad op, add softmax_grad unittest

* Export tanh_grad op and add xla unittest

* Export gelu_grad op, and add xla unittest

* add conv unit test

* reformate

* Export layer_norm_grad and layer_norm_param_grad api, add xla unittests

* Commit to merge upstream of_xrt

* check files

* modify files according to review advice.

* Add xrt unittests (#2483)

* Revert tensorrt

* Fix absl missing if building with TensorRT but without XLA

* Update xrt benchmark

* Add xla gelu unittest

* Fix int64 env variable

* Export float16 for python

* Add xla relu unittest

* Add xla bias add and reshape unittests

* Add xla identity unittest

* Add xla cast and scalar op unittests

* Add xla broadcast op and transpose unittests

* Add xla add, sigmoid and tanh unittests

* Add xla gather and batch_gather unittests

* Add xla softmax unittest and fix softmax bug if axis is not the last dim.

* Add xla reduce_sum unittest, and support keep_dims for xla reduce

* Add xla layer_norm unittest, and refine xla layer norm op

* Add reshape_like unittest, and export reshape_like api

* Refine xrt unittest code style

* Export softmax_grad op, add softmax_grad unittest

* Export tanh_grad op and add xla unittest

* Export gelu_grad op, and add xla unittest

* Export layer_norm_grad and layer_norm_param_grad api, add xla unittests

* Commit to merge upstream of_xrt

* Fix reduce_mean facade bug if keep_dims if true.

* Refine tensorrt unittests

* Check failed if full reduce without keep dimension.

* madd pooling unit test

* Add tensorrt bias_add and reshape op, and their unittests.

* Support fp16 for tensorrt.

* Add tensorrt transpose op and unittest.

* add unit test conv_2d

* add unit test concat

* Fix concat if axis is -1.

* Refine tensorrt conv2d unittest

* Fix padding mode for conv2d and pooling, refine unittests.

* Refine tensorrt concat unittest

* Add convert api from string engine to XrtEngine.

* Revert tensorrt, and merge of_xrt branch

* Remove some comments.

* Refine tensorrt unittests

* Add XrtConfig to deal with xla and tensorrt configurations.

Conflicts:
	oneflow/xrt/api.cpp

* Update tensorflow.cmake to avoid applying the patch repeatedly.

* Remove XrtConfig Option, and fix xrt unittests

* Add tensorrt batch norm (#2516)

* Refine xrt signatrue hash, and fix python configuration (#2520)

* Fix XrtCompilationEnabled returns (#2524)

* Fix compilation after merge dev_python

* Update xrt unittests

* Revert protobuf version

* Remove comment FOR_RANGE

* Remove unused code

* Reformart

* Refine job builder

* Disable dump job if not debug mode

Co-authored-by: Snow <snow3s@qq.com>
Co-authored-by: Juncheng <liujuncheng1022@gmail.com>

8f3dcf94

Dec 21, 2019
- refactor ConfigFlagDef · 725a37fc
  lixinqi authored 5 years ago
  
  725a37fc
Dec 20, 2019
- enable pseudo chain merge (#2514) · 2efa93bc
  Li Xinqi authored 5 years ago
```
* enable pseudo chain merge

* refactor ConfigDef
```
  Unverified
  
  2efa93bc
- refactor ConfigDef · 635ef2f7
  lixinqi authored 5 years ago
  
  635ef2f7
- enable pseudo chain merge · 5f84cae8
  lixinqi authored 5 years ago
  
  5f84cae8
Dec 12, 2019
- Support auto mixed-precision for inference (#2495) · f36ff9eb
  Houjiang Chen authored 5 years ago
  
  f36ff9eb
Nov 26, 2019

Merge quick dirty from obj detect (#2444) · f5937569

Li Xinqi authored 5 years ago

* cmake find python note when version less 3.14 (#2286)

* fix bug: reduce split kernel inplace (#2297)

* Dev bias add (#2299)

* use bias add

* fix

* bias_add

* bias add half

* fix

* reinterpret_cast

* fix half

* HALF

* fix

* ADD_DEFAULT_KERNEL_CREATOR

* fix

* format

* Fix dev python test (#2294)

* add decode random

* fix decode random actor

* fix dev_python test scripts

* fix batch_size test scripts

* fix

* Memory Version 2.0 Step 2:  MemSharedAndReused between jobs (#2267)

* MemBlockProto and ChunkProto

* create mem block and chunk after improver

* interface merge mem block and chunk between sub plans

* merge chunk between jobs for memory reuse

* using memory zone unique id replace memory case hash

* merge interface op mem block between jobs for mem shared

* gen GlobalCriticalSection by mem block id and chunk id

* check mem block and chunk valid before runtime

* Refactor: RegstMgr ;  allocate memory by mem block and chunk instead of regst

* fix bug; and pass test

* fig bug: init chunk_id_count in id_manager

* reuse copyHd out mem between jobs

* PushPlan and PullPlan for memblock and chunk

* refine merge mem block / chunk in oneflow.cpp

* at(i);

* GetOpName2JobId2TaskProtos functional

* using output ptr; pass test AlexNet and Resnet

* Dev cuda 9 arch 70 (#2318)

* kCudaAlignSize = 256

* always compute_70

* __CUDA_API_VERSION >= 10000

* __CUDA_API_VERSION >= 10000

* disable_all_reduce_sequence

* Fix cuda9 cudnn turing issue (#2329)

* fix cuda 9 issus on turing device

* CUDA_VERSION

* no cuda check

* bias add kernel gpu half (#2330)

* mem_block=>header_mem_block (#2338)

* speedup oneflow compilation

* identity_sbp_conf

* DropOut Version2 (#2355)

* random mask like op conf; refine dropout op in python

* remove useless dropout kernel conf

* implement of random mask like op

* refine dropout op

* refine dropout grad op

* refine generate dropout backward

* random mask like kernel

* refine dropout (grad) kernel

* fix link problem for template separated compile

* fix bug and pass test

* dropout kernel for half

* add check for dropout mask input data type

* bugfixs

* Remove IsOpFloat32() in auto_mixed_precision.cpp (#2358)

* fuse op/kernl to 1 cpp

* refine for review

* fix bug

* Refactor Kernel Registry for more flexible registration (#2363)

* feat: update KernelRegistration and add KernelRegValProto

* Refactor Kernel Registry for more flexible registration

* Remove unused kernel_reg_value.proto

* Memory Version 2.0 Step 3: MemReused in job (#2319)

* use_memory_allocation_algorithm_v2 for switch improver mem block id

* reuse plan task graph and ctrl edge for inferred mem block

* refine interface; InJobMemSharingUtil

* navie merge memory big chain; gen regst apply/release queue; handle for inplace hint regst

* generate regst 2 mutual exclusion regsts

* bugfix: apply should before release

* interface for multi-thread run algorithm get mem block offset result

* selet best algorithm to set mem block id and mem block offset

* set mem block for inplace consumer regst

* 3 algorithm interface

* half implement of algo 1

* implement of algorithm0_OfColorImproved

* runnable in 1 machine 1 device

* Memory Chain

* merge MemoryChain and pass Correctness test of alexnet and resnet50

* bugfixs: continues inplace consume relationship in bert-base fp16

* erase useless info in MemoryChain

* implement of BfcAllocator and Tf_Bfc algorithm

* use bfc algo and fix bug

* only use default algo

* renme in_job_* => intra_job_*

* rename: InJob* => IntraJob*

* rename: 1) apply_regsts_queue => alloc_regsts_queue; 2) release_regsts_queue => free_regsts_queue

* rename function name in job/intra_job_mem_sharing_util.cpp

* rename variable names in job/intra_job_mem_sharing_util.cpp: 1) *apply* => *alloc*; 2) *release* => *free*

* refactor FindFreeOffset => FindFreeOffsetAndNewBufferSize

* rename method: DeallocateRaw => FreeRaw

* rename varable for review

* use enum for mem reused algorithm and add python interface

* fix sbp infer (#2373)

* mv addr calculation out of decoder (#2374)

* use tmp blob for temp storage (#2375)

* INDEX_DATA_TYPE_SEQ (#2381)

* refine include (#2382)

* refine include

* format


format

* element_wise_mul (#2383)

* gather refine (#2384)

* Dev fix sbp (#2388)

* fix sbp

* fix sbp

* remove VirtualGenKernelConf

* rename Read to ReadFully (#2389)

* Dev parallel cast (#2391)

* parallel cast

* op_conf

* refine

* Dev auto zero padding (#2393)

* auto_zero_padding

* auto_zero_padding

* fix

* fix input_mask and token_type_id (#2398)

* fix job launch (#2401)

* fix sbp bug (#2402)

* fix sbp

* fix

* add missing header files (#2410)

* refactor cnn model tests (#2411)

* refactor cnn model tests

* reformat README.md

* reformat README.md

* refactor ndarray_reduce (#2412)

* fix inplace reachability bug (#2413)

* refactor gpu relu (#2414)

* refactor gpu relu

* CHECK_KERNEL_SAFE_INT32

* there may be a subtle cuda bug in ((float) x < 0)

* refactor ndarray_reduce (#2405)

* refactor ndarray_reduce

* refactor relu/bias_add

* refactor relu

* refactor relu

* refactor bias_add

* refactor relu/bias_add

* fix inplace_lbi bug

* refactor addition

* IsKernelSafeInt32

* CUDA_1D_KERNEL_LOOP_T

* CUDA_1D_KERNEL_LOOP_T

* If add (#2415)

* refactor ndarray_reduce

* refactor relu/bias_add

* refactor relu

* refactor relu

* refactor bias_add

* refactor relu/bias_add

* fix inplace_lbi bug

* refactor addition

* IsKernelSafeInt32

* CUDA_1D_KERNEL_LOOP_T

* CUDA_1D_KERNEL_LOOP_T

* add unless oprand is nonzero

* Clear session (#2416)

* oneflow.clear_default_session

* fix bugs in oneflow.config.machine

* refactor function return type (#2417)

* fix for py2 (#2418)

* blob parallel conf

* Pr watch scope (#2419)

* pr oneflow.watch*

* merge more code to pass watch_scope.py

* TODO: input_blob_def.parallel_conf

* fix reexport of identity op

* merge dev_quick_dirty_object_detection

* oneflow.cluster (#2423)

* oneflow.cluster

* no alias for oneflow.cluster.*

* mv cpp_logging_conf from config_proto to cluster_proto

* rename: cluster => env

* rename: Environment => Session

* Free port (#2427)

* oneflow.cluster

* no alias for oneflow.cluster.*

* mv cpp_logging_conf from config_proto to cluster_proto

* rename: cluster => env

* rename: Environment => Session

* auto find a free port for single node environment

* localhost only

* Dev single processor test (#2430)

* oneflow.cluster

* no alias for oneflow.cluster.*

* mv cpp_logging_conf from config_proto to cluster_proto

* rename: cluster => env

* rename: Environment => Session

* auto find a free port for single node environment

* localhost only

* single process test

* Cluster::WorkerLoop

* delete unnecessary OF_BARRIER_ALL

* no longer fork children processes to run tests

* format

* fix align byte size bug (#2436)

* fix align bugs (#2440)

* fix: GetNumOfLoDLevels lack return

* minor script fix and update

* update script

* remove redundant function

f5937569

Nov 18, 2019

Pr watch scope (#2419) · 8c3d4e48

Li Xinqi authored 5 years ago

* pr oneflow.watch*

* merge more code to pass watch_scope.py

* TODO: input_blob_def.parallel_conf

8c3d4e48

Oct 15, 2019
- gpu direct watch_op · 06a105a7
  lixinqi authored 5 years ago
  
  06a105a7
Oct 11, 2019

Dev cuda 9 arch 70 (#2318) · 77bcbedd

Juncheng authored 5 years ago

* kCudaAlignSize = 256

* always compute_70

* __CUDA_API_VERSION >= 10000

* __CUDA_API_VERSION >= 10000

* disable_all_reduce_sequence

77bcbedd

Oct 09, 2019
- remove duplicate include · 448ea513
  lixinqi authored 5 years ago
  
  448ea513
Sep 28, 2019
- refactor watch_diff · fddfa2ab
  lixinqi authored 5 years ago
  
  fddfa2ab
Sep 27, 2019
- watch_diff (#2290) · 167b5f77
  Li Xinqi authored 5 years ago
  
  167b5f77
Sep 24, 2019

merge with dev_python (#2249) · 3960d2cb

Niu Chong authored 5 years ago

* Dev actor msg queue (#2225)

* async msg queue

* EnqueueAsyncMsg

* Merge wnd python (#2226)

* not ready yet

* segment fix

* fix segment_sum bugs

* 1st wide_n_deep push

* Fix tick in multi node parallel (#2042)

* check in fixes

* fix by adding boxing method

* register tick op

* move code and add more check

* fix typo

* fix bug when filtering op nodes before adding tick

* fix wheel build not adding .so (#2052)

* color plan dot VERSION-2 (#2045)

* run sucessfully on single GPU

* fix 121 for tick (#2069)

* delete unncessary multiply_grad class

* speed up generate time for dot2svg (#2083)

* Add axis conf to bias_add for any axis channel (#2087)

* bias_add completion

* follow comment

* make conf axis required

* Revert "Add axis conf to bias_add for any axis channel (#2087)" (#2091)

This reverts commit 8679ce980ce8570bf927baeab8616ee7b93fac47.

* updated

* fix segment_sum_grad

* fix sbp

* fix segment_sum impl for data parallel

* fix

* remove useless code in segment_kernel_util.h

* add python interface

* fix sigmoid conf

* fix naming error

* fix typo

* temp mod loss sbp

* add LazyAdam

* Merge branch 'dev_python' of https://github.com/Oneflow-Inc/oneflow into dev_python_widedeep

* rm useless code

* unsorted_segment_sum

* refactor sigmoid_cross_entropy_loss_kernel to high performance

* Improve sigmoid cross entropy loss grad (#2207)

* remove for loop called cuda kernel

* minor fix

* ../oneflow/python/ops/data_ops.py (#2209)

* fix lazy_adam

* Merge wnd and python (#2214)

* rm ActivationType from op/kernel (#2205)

* refactor sigmoid_cross_entropy_loss

* fix SigmoidGrad::InferBatchAxis

* support part_name_prefix and part_name_suffix_length (#2208)

* rename: OutRemoteBlobsResultBox => OutRemoteBlobsStatus

* oneflow.watch for debug

* Dev decode batch size (#2206)

* rm batch_size and piece_size

* merge dev_python

* Update reshape_like_op.cpp (#2213)

* oneflow.parallel (#2211)

* oneflow.parallel

* refactor split_axis => parallel

* rename parallel => distribute

* fix typo: *Parallel => *Distribute

* add blob_desc.with_split_distribute(axis) and blob_desc.with_broadcast_distribute()

* merge dev_python

* fix boxing: P->S(0)

* check in docker build scripts (#2216)

* Dev python widedeep docker (#2218)

* check in docker build scripts

* check in .dockerignore

* rm oneflow.segment_sum

* remove segment_sum

* rm unused file

* rm debug code

* rm debug code

* rm double empty lines

* remove useless comments

* fix send msg (#2227)

* fix reduction_coefficient (#2228)

* refactor ndarray for eq/ne/...

* Dev kernel launch synchronized (#2230)

* IsKernelLaunchSynchronized

* virtual

* refine

* refine

* seperate LOGICAL_BINARY_FUNC from ARITHMETIC_BINARY_FUNC

* more static_assert

* remove unused task related dot function (#2236)

* remove unused task related dot function

* do not output dot rank info

* Dev non distributed optimizer js (#2234)

* op&kernel&actor

* job

* job_completer

* graph

* format

* fix pd

* fix

* ignore DelPlacementByOpName

* fix auto tick

* JobBuilder

* fix

* config util

* fix

* fix opgrade

* broadcast tick

* fix allreduce

* balance by model size

* GetSoleOutBlobSize

* async_actor_msg_deque

* group

* AddOrMutOpsOnlyOnce

* fix NcclTupleBroadcastGrad

* order

* set nccl order hint

* op_conf

* grad hint

* NcclTupleBroadcastReduceSequencePass

* add missed mutops

* order fix

* try kMdUpdtArea

* fix nccl_order_hint

* fix

* add ti

* tuple_identity_op

* remove useless

* group

* fix dead lock

* force ctrl in

* sc broadcast

* sort obn

* group nccl

* config group_size_mbyte

* non_distributed_optimizer_group_size_mbyte

* format

* stop check

* rm message sending optimization

* refine lazy adam (#2244)

* refine lazy adam

* update

* memory version 2 step 1: replace original concept about mem sharing (#2242)

* mem_shared_id -> mem_block_id;  mem_shared_off_set -> mem_block_offset; enable_mem_sharing->enable_reuse_mem

* memory version 2 step 1: replace original concept about mem sharing

* record reader multi thread (#2246)

* multi thread

* ComputeThreadPoolSize

* python api

3960d2cb

Sep 20, 2019

Dev non distributed optimizer js (#2234) · 2b7c50b0

Juncheng authored 5 years ago

* op&kernel&actor

* job

* job_completer

* graph

* format

* fix pd

* fix

* ignore DelPlacementByOpName

* fix auto tick

* JobBuilder

* fix

* config util

* fix

* fix opgrade

* broadcast tick

* fix allreduce

* balance by model size

* GetSoleOutBlobSize

* async_actor_msg_deque

* group

* AddOrMutOpsOnlyOnce

* fix NcclTupleBroadcastGrad

* order

* set nccl order hint

* op_conf

* grad hint

* NcclTupleBroadcastReduceSequencePass

* add missed mutops

* order fix

* try kMdUpdtArea

* fix nccl_order_hint

* fix

* add ti

* tuple_identity_op

* remove useless

* group

* fix dead lock

* force ctrl in

* sc broadcast

* sort obn

* group nccl

* config group_size_mbyte

* non_distributed_optimizer_group_size_mbyte

* format

* stop check

* rm message sending optimization

2b7c50b0

Sep 07, 2019
- remove AutoVar related code (#2168) · c3f7c870
  Niu Chong authored 5 years ago
```
* feat: remove all autovar

* fix and format

* fix: fix op::InferBlobDesc
```
  c3f7c870
- replace val in pb msg; decode lbn string with split hint (#2165) · 69bbb2c5
  cheng cheng authored 5 years ago
  
  69bbb2c5
Sep 04, 2019

Dev model init op (#2117) · 26717534

Juncheng authored 5 years ago

* assign op


AddGlobalStepOpConf


fix


ARITHMETIC_DATA_TYPE_SEQ


identity_op_conf


add ops


GenNewSnapshotName


SnapshotOp


cleanup


blob name


LearningRateScheduleOp


LearningRateScheduleKernel


LearningRateScheduleKernel


AddLearningRateScheduleOpConf


learning rate


cleanup


fix


fix

* remove total_mbn_num

* date time format

* save

* refine

* refine

* revert

* refine snapshot

* fix

* refine

* AutoGlobalStep

* refine

* GenLogicalBlobName

* AutoLearningRate

* remove JobDesc lr

* fix snapshot path

* Maybe<void>

* learning_rate blob

* remove next_model_vid


fix


fix 


fix


learning_rate

* train_conf

* fix for global step on multi nodes

* SnapshotReader


snapshot writer


model init op


fix


refine


init


InitializeFromSnapshotConf


model io job


ModelLoadOp


ModelLoadKernel


MakeModelLoadJob


ModelSaveOp


fix


InterUserJobInfo


_MakeModelLoadJobFunc


MutModelLoadOpConTickInputHelper


fix


refine


init/load/save


set_default_variable

* remove SnapshotMgr

* snapshot.h

* delete model_init_job.cpp


foreign_input_op_conf


fix


snapshot path


set path


op_conf


fix


fix CopyFromNdarray


to bytes c


use uint8


char2uint8

* model init

* model io

* fix

* ModelSaveKernel

* mutable_batch_axis()->Clear()

* InferBatchAxis

* fix

* refine

* job set

* MakeModelIoJobs

* fix

* jobs

* fix

* model io job

* GenOutputOpConf

* refine snapshot

* refine

* fix

* refine CheckPoint

* remove session

* refine

* refine

* refine

* remove keyword.h/cpp

* refine

* global_step=>train_step

* GetSbpSignatures

* ModelInitOp

26717534