An error occurred while fetching folder content.
Select Git revision
daquexian
authored and
GitHub
committed
* add changes for multi dev demo Signed-off-by:daquexian <daquexian566@gmail.com> * add part of backward hook Signed-off-by:
daquexian <daquexian566@gmail.com> * update Signed-off-by:
daquexian <daquexian566@gmail.com> * add naive init_with_env Signed-off-by:
daquexian <daquexian566@gmail.com> * update Signed-off-by:
daquexian <daquexian566@gmail.com> * update Signed-off-by:
daquexian <daquexian566@gmail.com> * support_multi_client * update Signed-off-by:
daquexian <daquexian566@gmail.com> * Remove unused code * Fix multi client launch * fix __main__ bug * update abcd op Signed-off-by:
daquexian <daquexian566@gmail.com> * fix multi client sync, make nccl instr ordered Signed-off-by:
daquexian <daquexian566@gmail.com> * temp changes Signed-off-by:
daquexian <daquexian566@gmail.com> * Use functional api instead of op_expr_helper::XXXOp. * align with latest master, remove unused code Signed-off-by:
daquexian <daquexian566@gmail.com> * local rank returns 0 when no env var, save is_multi_client in EnvDesc Signed-off-by:
daquexian <daquexian566@gmail.com> * move is_multi_client to ProcessCtx, rename cuda_d2d device to nccl, remove unused code Signed-off-by:
daquexian <daquexian566@gmail.com> * abcd -> return_first_input op Signed-off-by:
daquexian <daquexian566@gmail.com> * remove launch.py for now Signed-off-by:
daquexian <daquexian566@gmail.com> * refine Signed-off-by:
daquexian <daquexian566@gmail.com> * update IsMultiClient in env_util.py Signed-off-by:
daquexian <daquexian566@gmail.com> * rm multi_dev_demo.py Signed-off-by:
daquexian <daquexian566@gmail.com> * remove exported functions in env_util.py Signed-off-by:
daquexian <daquexian566@gmail.com> * remove unused op expr helper func Signed-off-by:
daquexian <daquexian566@gmail.com> * fix bug Signed-off-by:
daquexian <daquexian566@gmail.com> * add DevVmDepObjectConsumeMode and set it as NONE in backward Signed-off-by:
daquexian <daquexian566@gmail.com> * move return_first_input op from math_ops.py to tensor_ops.py Signed-off-by:
daquexian <daquexian566@gmail.com> * fix compile error Signed-off-by:
daquexian <daquexian566@gmail.com> * refine Signed-off-by:
daquexian <daquexian566@gmail.com> * add comments Signed-off-by:
daquexian <daquexian566@gmail.com> * fix exit bug in init.py Signed-off-by:
daquexian <daquexian566@gmail.com> * align with master Signed-off-by:
daquexian <daquexian566@gmail.com> * update device ctor Signed-off-by:
daquexian <daquexian566@gmail.com> * default dev id = local rank % gpu num Signed-off-by:
daquexian <daquexian566@gmail.com> * assert single machine Signed-off-by:
daquexian <daquexian566@gmail.com> * reformat Signed-off-by:
daquexian <daquexian566@gmail.com> * fix consume mode, implement eager_nccl_allreduce by process ranks Signed-off-by:
daquexian <daquexian566@gmail.com> * fill sorted_ranks field in old code, reformat Signed-off-by:
daquexian <daquexian566@gmail.com> * set default val for op conf, align with master Signed-off-by:
daquexian <daquexian566@gmail.com> * impl return_first_input as functional api, impl allreduce as module Signed-off-by:
daquexian <daquexian566@gmail.com> * add more tests Signed-off-by:
daquexian <daquexian566@gmail.com> * reformat Signed-off-by:
daquexian <daquexian566@gmail.com> * align with master Signed-off-by:
daquexian <daquexian566@gmail.com> * rename ddp to flow.nn.parallel.DistributedDataParallel Signed-off-by:
daquexian <daquexian566@gmail.com> * refine eager nccl comm Signed-off-by:
daquexian <daquexian566@gmail.com> * refine eager nccl comm, divide grad by group size Signed-off-by:
daquexian <daquexian566@gmail.com> * rename reversed_param_list -> ddp_state_for_reversed_params Signed-off-by:
daquexian <daquexian566@gmail.com> * make return_first_input inplace Signed-off-by:
daquexian <daquexian566@gmail.com> * restore eager allreduce Signed-off-by:
daquexian <daquexian566@gmail.com> * add static all zero tensor and select first Signed-off-by:
daquexian <daquexian566@gmail.com> * refine Signed-off-by:
daquexian <daquexian566@gmail.com> * add functional allreduce op and use current rank group Signed-off-by:
daquexian <daquexian566@gmail.com> * meterialize StaticAllZeroTensor in allreduce, support it in scalar mul Signed-off-by:
daquexian <daquexian566@gmail.com> * materialize static zeros tensor in set_acc_grad Signed-off-by:
daquexian <daquexian566@gmail.com> * rename Signed-off-by:
daquexian <daquexian566@gmail.com> * auto format by CI Co-authored-by:
clackhan <han_binbin@163.com> Co-authored-by:
hjchen2 <chenhoujiangcug@gmail.com> Co-authored-by:
oneflow-ci-bot <69100618+oneflow-ci-bot@users.noreply.github.com> Co-authored-by:
oneflow-ci-bot <ci-bot@oneflow.org>
Name | Last commit | Last update |
---|---|---|
.. |