OneFlow CHANGELOG V0.3.2

Changelog

OneFlow 发布了新版本 0.3.2,这个版本以及之前的 0.3.1 版本都是大版本 0.3.0 的 minor 版本,所以在此一并介绍。
在这个版本中,引入了大量性能优化、加入了不少新的 feature,率先支持了 CUDA 11.1。

主要新功能一览

  • 支持亚线性内存优化
    通过 oneflow.experimental.scope(checkpointing=self.checkpoint_activations) 开启,大幅节省内存。例如:
def transformer_layer(self, name, x, *, past):
    # ...
    with flow.scope.namespace(name):
        x = flow.identity(x)
        with flow.experimental.scope.config(
            checkpointing=self.checkpoint_activations
        ):
            norm1 = norm(x, name="layernorm_1")
            # ...
  • 新版本的 checkpoint
    新版本的 checkpoint 大幅提高了灵活性。支持部分加载/保存,支持获取权重的值(可用于打印等操作),支持使用 numpy 数组给权重赋值。

    with tempfile.TemporaryDirectory() as save_dir:
        refresh_session()
        large1 = get_checkpoint_ready_model(model_getter, dtype)
        flow.checkpoint.save(save_dir)
        res1 = large1()
        refresh_session()
        large2 = get_checkpoint_ready_model(model_getter, dtype)
        vars_in_file = flow.checkpoint.get(save_dir)
        flow.load_variables(vars_in_file)
        res2 = large2()
    
    refresh_session()
    model = get_checkpoint_ready_model(get_add_and_reduce_mean_model, dtype)
    var_x = flow.get_all_variables()["x"]
    var_y_value_before_loading = flow.get_all_variables()["y"].numpy()
    new_val_np = np.random.random(var_x.shape).astype(np.float32)
    flow.load_variables({
           "x": new_val_np})
    var_y_value_after_loading = flow.get_all_variables()["y"].numpy()
    flow_res = model()
    
  • 支持 dynamic loss scale schedule
    具体开启方式:

    loss_scale_policy = flow.optimizer.loss_scale.dynamic_loss_scale(increment_period=2000)
    optimizer = flow.optimizer.AdamW(..., loss_scale_policy=loss_scale_policy)
    
  • 支持最新的 CUDA 11.1

    可以通过如下命令安装:

    python3 -m pip install --find-links https://release.oneflow.info oneflow_cu111 --user
    
  • 提供预先编译的带 XLA 张量编译器的安装包(支持CUDA 10,10.1,10.2,11.0)

    可以通过如下命令安装:

    python3 -m pip install --find-links https://release.oneflow.info oneflow_cu101_xla --user
    

主要改进和 bug 修复

Changelog v0.3.0 ~ v0.3.2 (16/12/2020)

Op 修复和优化

优化了 scalar mul by tensor, cast scaleprelufused_scale_tril 等 Op 和 Op 组合

  • [enhancement][op] Dev sx xla clip #3656
  • [enhancement][op] Add UserOp::InferSbpSignature #3699
  • [bug][op] Fix fuse scalar mul by tensor sbp #3692
  • [bug][op] fix softmax condition #3675
  • [enhancement][op] slice_update op #3544
  • [enhancement][op] optimize rmsprop and lars optimizers #3809
  • [enhancement][op] add oneflow_range #3725
  • [enhancement][op] torch.gather #3602
  • [bug][op] skip conv2d padding dynamic test case #3813
  • [bug][op] Fix __hne in BinaryFuncFloorMod #3788
  • [bug][op] Fix bn[_add]_relu test case #3767
  • [enhancement][op][system] Make class Tensor abstract #3757
  • [enhancement][op] Add user_op::KernelCreateContext #3739
  • [bug][op] fix warning #3732
  • [api][enhancement][op] User op registry attr #3716
  • [enhancement][op][refactor] Dev refactor user op registry attr #3714
  • [bug][op] fix argwhere format #4010
  • [enhancement][op] Argwhere support empty blob #4009
  • [enhancement][op] Fuse cast scale #3999
  • [enhancement][op] layer_norm_grad_add_to_output #3998
  • [enhancement][op] Dev optimize prelu #3987
  • [api][enhancement][op] Switch identity to user op and add it to auto mixed precision clear list #3992
  • [enhancement][op] Optimize slice kernel #3989
  • [bug][op] Hotfix: add parallel cast to amp clear list #3988
  • [enhancement][op] fused_scale_tril / hot fix matmul / softmax broadcast_sub broadcast_div #3980
  • [bug][op] add combined margin cpu and fix bug #3961
  • [bug][op] fix pad op #3971
  • [bug][op] Fix constant init value #3947
  • [bug][op] indexed_slices_model_update handle empty tensor #3933
  • [bug][op] fix distribute_clone sbp #3803
  • [bug][op] Reshape backward issue with distribute split #3915
  • [enhancement][op] Remove NormalModelUpdateOpConf #3917
  • [enhancement][op] Dev unsorted segment sum #3731
  • [bug][op] Dev split like add backward #3901
  • [bug][op] distribute concat out dynamic false #3899
  • [enhancement][op] UserOpWrapper add HasGradTensor4OpOutput #3904
  • [enhancement][op] Unpack/Pack user op #3727
  • [enhancement][op] adam_bias_correction_learning_rate #3763
  • [enhancement][op][serving] add flatten op implementation #3789
  • [enhancement][op] Dev enhance sort ops #3828
  • [enhancement][op] Optimize softmax cuda kernel block size #3853
  • [enhancement][op] SplitLikeOp prefix support #3866
  • [bug][op] fix gather set_is_dynamic #3900
  • [bug][op] fix unsorted segment sum like #3898

新增 Op 和已有 Op 的新功能

增加了 polyval, swish, mish, multi_square_sum, mseloss, lamb, triplet loss 等 Op

  • [enhancement][op] Add polyval op #3541
  • [feature][op] Add broadcast like backward #3665
  • [feature][op] Add cuda_pseudo_half.h #3669
  • [feature][op][python] add swish activation #3970
  • [feature][op][python] add mish activation #3972
  • [feature][op] Add multi_square_sum op #3977
  • [feature][op] TripOp add fill value #3960
  • [feature][op] add combined margin loss #3819
  • [feature][op] dynamic loss scale schedule op #3885
  • [feature][op][python] add mseloss #3893
  • [feature][op] LAMB support #3620
  • [feature][op] logical slice_assign and slice op #3647
  • [feature][op][system] Add Repeat/Acc user op #3707
  • [feature][op][ssp] Ssp variable proxy #3715
  • [feature][op] multi_count_not_finite op #3879
  • [feature][op] model update op add skip if #3883
  • [feature][python] Add triplet loss #3864

系统组件

OneFlow Collective Boxing支持NCCL All2All,支持 CUDA11.1 编译

  • [feature][system] Add Nccl All2All #3538
  • [WIP][bug][system] Add attribute “batch_axis_non_change” to oneflow.transpose #3685
  • [bug][system] fix memcopy #3687
  • [documentation][enhancement][system] change url link of api docs #3677
  • [enhancement][system] Op collection #3833
  • [bug][system] fix pybind11 include #3876
  • [enhancement][system] Dev replace str to cfg obj in python callback #3832
  • [enhancement][system] Dev cpp instructions builder #3829
  • [enhancement][system] Dev forward declare cfg #3808
  • [bug][system] Fix CUDA 11.1 compiler crashes #3795
  • [bug][system] Bakcport bug fixes for distributed run from multi node ci #3765
  • [bug][system] Fix handle remote regst #3761
  • [enhancement][system] Refactor ExecKernel::bn_in_op2regst_desc_id to bn_in_op2blob_info #3744
  • [enhancement][system] Dev scope attr value #3756
  • [enhancement][system] rename UserOpAttrVal to AttrValue #3752
  • [enhancement][system] refactor OpGraphPass to JobPass #3745
  • [enhancement][system] RtRegst/Regst GetBlobDesc/BlobByOrdinal #3737
  • [enhancement][system] Log WARNING to stderr #3713
  • [enhancement][system] Use cudaMemcpyDefault #3700
  • [enhancement][system] Migrate foreigns to pybind11 #3939
  • [enhancement][system] Optimize NcclCollectiveBoxingExecutorBackend::ExecuteGroup latency #3997
  • [feature][system] OptimizerPlacementOptimization #3944
  • [feature][system] New checkpoint #3540
  • [enhancement][system] Sublinear memory cost by checkpointing #3976
  • [enhancement][system] Add gradients stats aggregation #3979
  • [feature][system] nccl enable mixed fusion #3981
  • [enhancement][system] remove serialized in python callback #3891
  • [bug][system] Fix CollectiveBoxingGenericTaskNode::ProduceAllRegstsAndBindEdges #3946
  • [feature][system] Add NaiveB2PSubTskGphBuilder #3942
  • [bug][system] disable new checkpoint by default temporarily #3943
  • [bug][system] Explicitly specify the SBP in NonDistributedOptimizerPass #3937
  • [enhancement][system] Add ssp variable proxy #3859
  • [cfg][enhancement][system] Dev switch error proto with cfg error proto #3858
  • [enhancement][refactor][system] New Chain #3874
  • [feature][system] DynamicLossScale #3886
  • [bug][system] Remove CheckNoCycle in chain graph #3693
  • [feature][ssp][system] Memory Reuse support time shape > meta shape #3796
  • [feature][system] OneFlow support tensor shape max dim size up to 6 #3802
  • [bug][enhancement][system] Support Ampere devices #3806
  • [enhancement][system] Simple kernel memory bandwidth profiler #3855

Eager 模式

修复了一系列 bug

  • [bug][eager] Use universal start global device id for all streams #3701
  • [bug][eager] Ci add eager #3672
  • [bug][eager] Fix eager mode bug #3681
  • [eager][feature] Eager transport #3598
  • [eager][enhancement][python][refactor] rm scope_proto symbol_id #3865
  • [cfg][eager][enhancement] Replace py instruction to CFG Instruction #3773
  • [eager][enhancement][refactor] refactor ParallelDescSymbol #3774
  • [eager][feature] use proxy blob_object for boxing, add some inter-node boxing #3711
  • [bug][eager] fix unpacked mirrored blob object shape #3703
  • [bug][eager] Fix eager memory leak and re-enable new checkpoint #4008
  • [bug][eager] barrier for multi node eager #3748

Python 前端

  • [api][documentation][python] Dev add api rst #3695
  • [feature][python][refactor] add check in deconv #3835
  • [bug][enhancement][python] fix stirng format in py35 #3878
  • [bug][python] fix exception in BlobObject del #3742
  • [bug][python] make float/double as aliases of float32/float64 #3740
  • [api][bug][documentation][python] Fix placement api doc #3638
  • [cfg][enhancement][python] Dev replace py job conf proto to cfg #3856
  • [feature][python] add bceloss #3804
  • [enhancement][feature][python] add l1 loss op in python #3793

工具链

更多的 SWIG 接口由 Pybind11 替换

  • [documentation][tooling] Add api docs zzk #3680
  • [documentation][tooling] Add api docs zzk #3587
  • [cfg][enhancement][tooling] Cfg template operator reform #3861
  • [cfg][enhancement][tooling] Dev use union instead of struct for oneof #3870
  • [cfg][enhancement][tooling] Sort cfg obj forward declare #3844
  • [enhancement][tooling] Dev move run instruction to pybind #3775
  • [bug][cfg][tooling] fix cfg module load error bug #3815
  • [bug][tooling] Fix oneflow worker launch in py35 #3778
  • [bug][cfg][tooling] Fix cfg sub proto mudule process bug #3729
  • [enhancement][tooling] Dev data onerec #3104
  • [cfg][enhancement][tooling] Dev compare cfg file #3717
  • [bug][tooling] remove proton not related to Instruction #3708
  • [bug][cfg][tooling] Dev switch instruction to cfg instruction #3702
  • [cfg][enhancement][refactor][tooling] replace ScopeProto to cfg #3816
  • [api][enhancement][refactor][tooling] Refine custom op build #3925
  • [enhancement][tooling] default show cpp error stack frame #3948
  • [cfg][enhancement][tooling] Dev replace py parallel conf proto to cfg #3810
  • [cfg][enhancement][tooling] optimize cfg generator to save time #3906
  • [enhancement][feature][tooling] Py kernel2 #3686

编译

修复 NVCC 参数,C++ 11 ABI 在 RedHat GCC 下 CMake 设置错误环境变量,修复编译可能出现的 make -j,修复手动编译的时候 include 目录消失

  • [build][documentation] fix readme #3694
  • [bug][build] fix missing symbol when load so #3676
  • [bug][build] Fix CUDA_NVCC_GENCODES #3869
  • [build][documentation] Add info in readme about how to build oneflow in docker #3781
  • [build][ci][enhancement] Add bazel_cache dir for XLA build #3766
  • [bug][build] fix ubuntu build relocation R_X86_64_PC32 against symbol error #3754
  • [build][ci][enhancement] Refactor build script #3698
  • [bug][build] fix make -j in grpc and openssl #3724
  • [bug][build] detect cxx11 abi availibility in cmake #3709
  • [bug][build] fix include files not copied #3907

CI

提升运行速度和稳定性,支持分布式环境

  • [bug][ci] test use uuid log dir #3689
  • [ci][enhancement] Run check_license_and_format in every branch #3683
  • [ci][feature][test] Parallel run op cases #3670
  • [ci][enhancement] Run xla and pure cpu only when cuda test succeeds #3679
  • [ci][documentation][enhancement] add requirements.txt for api-docs #3671
  • [ci][enhancement] ci add label check workflow #3664
  • [ci][enhancement] CI merge all jobs into one #3868
  • [ci][enhancement] Check label every push #3863
  • [ci][enhancement] Update hard coded host affiliations #3847
  • [ci][enhancement] External PR skip oss steps #3843
  • [ci][enhancement] ci use pull_request ev #3842
  • [ci][enhancement] ci only use pull_request_target #3840
  • [ci][enhancement] Add pull_request_target to allow forks access secrets when CI triggerd #3837
  • [ci][enhancement] CI run when bot is requested review #3831
  • [ci][enhancement] Prevent CI failure #3830
  • [ci][enhancement] ci dont test 2n8c #3786
  • [ci][enhancement] upload bin to oss #4000
  • [ci][enhancement][test] larger tol for bn #3965
  • [bug][ci] fix oss list file 100 limit #3935
  • [ci][enhancement] Refine release oss url #3924
  • [ci][enhancement] Build master whl once a day #3894
  • [ci][feature] Multi node support in CI #3735

Test

修复 image resize 测试用例

  • [bug][test] Fix image_test_util #3690
  • [bug][test] Fix image resize test #3666
  • [enhancement][test] import tensorflow in RunTensorFlowOp #3682

你可能感兴趣的:(CHANGELOG,深度学习)