在某心培训中,最常见的一个广告就是所谓复现XLNet的。原意是,在面试一个小时中,如果你不能手打XLNet,那么你连基本功都达不到。所以换句话说,倒贴钱都没公司要你。
这个广告造成极坏的影响。姑且不说,后面推荐课程课程一点帮助都没有,其实就是简单的优化理论。
我们先说一下XLNet复现为啥不可能。我们先看XLNet源码。先看这个文件。自己看看多长,我记得打印出来是四十页。四十页一个小时老师能打完我都不相信,更不用说复现。
再说一个更关键,大部分顶会论文不会把所有细节都放出来,比如说这段代码
flags.DEFINE_string("master", default=None,
help="master")
flags.DEFINE_string("tpu", default=None,
help="The Cloud TPU to use for training. This should be either the name "
"used when creating the Cloud TPU, or a grpc://ip.address.of.tpu:8470 url.")
flags.DEFINE_string("gcp_project", default=None,
help="Project name for the Cloud TPU-enabled project. If not specified, "
"we will attempt to automatically detect the GCE project from metadata.")
flags.DEFINE_string("tpu_zone",default=None,
help="GCE zone where the Cloud TPU is located in. If not specified, we "
"will attempt to automatically detect the GCE project from metadata.")
flags.DEFINE_bool("use_tpu", default=True,
help="Use TPUs rather than plain CPUs.")
flags.DEFINE_integer("num_hosts", default=1,
help="number of TPU hosts")
flags.DEFINE_integer("num_core_per_host", default=8,
help="number of cores per host")
flags.DEFINE_bool("track_mean", default=False,
help="Whether to track mean loss.")
# Experiment (data/checkpoint/directory) config
flags.DEFINE_integer("num_passes", default=1,
help="Number of passed used for training.")
flags.DEFINE_string("record_info_dir", default=None,
help="Path to local directory containing `record_info-lm.json`.")
flags.DEFINE_string("model_dir", default=None,
help="Estimator model_dir.")
flags.DEFINE_string("init_checkpoint", default=None,
help="Checkpoint path for initializing the model.")
# Optimization config
flags.DEFINE_float("learning_rate", default=1e-4,
help="Maximum learning rate.")
flags.DEFINE_float("clip", default=1.0,
help="Gradient clipping value.")
# lr decay
flags.DEFINE_float("min_lr_ratio", default=0.001,
help="Minimum ratio learning rate.")
flags.DEFINE_integer("warmup_steps", default=0,
help="Number of steps for linear lr warmup.")
flags.DEFINE_float("adam_epsilon", default=1e-8,
help="Adam epsilon.")
flags.DEFINE_string("decay_method", default="poly",
help="Poly or cos.")
flags.DEFINE_float("weight_decay", default=0.0,
help="Weight decay rate.")
# Training config
flags.DEFINE_integer("train_batch_size", default=16,
help="Size of the train batch across all hosts.")
flags.DEFINE_integer("train_steps", default=100000,
help="Total number of training steps.")
flags.DEFINE_integer("iterations", default=1000,
help="Number of iterations per repeat loop.")
flags.DEFINE_integer("save_steps", default=None,
help="Number of steps for model checkpointing. "
"None for not saving checkpoints")
flags.DEFINE_integer("max_save", default=100000,
help="Maximum number of checkpoints to save.")
# Data config
flags.DEFINE_integer("seq_len", default=0,
help="Sequence length for pretraining.")
flags.DEFINE_integer("reuse_len", default=0,
help="How many tokens to be reused in the next batch. "
"Could be half of `seq_len`.")
flags.DEFINE_bool("uncased", False,
help="Use uncased inputs or not.")
flags.DEFINE_integer("perm_size", 0,
help="Window size of permutation.")
flags.DEFINE_bool("bi_data", default=True,
help="Use bidirectional data streams, i.e., forward & backward.")
flags.DEFINE_integer("mask_alpha", default=6,
help="How many tokens to form a group.")
flags.DEFINE_integer("mask_beta", default=1,
help="How many tokens to mask within each group.")
flags.DEFINE_integer("num_predict", default=None,
help="Number of tokens to predict in partial prediction.")
flags.DEFINE_integer("n_token", 32000, help="Vocab size")
# Model config
flags.DEFINE_integer("mem_len", default=0,
help="Number of steps to cache")
flags.DEFINE_bool("same_length", default=False,
help="Same length attention")
flags.DEFINE_integer("clamp_len", default=-1,
help="Clamp length")
flags.DEFINE_integer("n_layer", default=6,
help="Number of layers.")
flags.DEFINE_integer("d_model", default=32,
help="Dimension of the model.")
flags.DEFINE_integer("d_embed", default=32,
help="Dimension of the embeddings.")
flags.DEFINE_integer("n_head", default=4,
help="Number of attention heads.")
flags.DEFINE_integer("d_head", default=8,
help="Dimension of each attention head.")
flags.DEFINE_integer("d_inner", default=32,
help="Dimension of inner hidden size in positionwise feed-forward.")
flags.DEFINE_float("dropout", default=0.0,
help="Dropout rate.")
flags.DEFINE_float("dropatt", default=0.0,
help="Attention dropout rate.")
flags.DEFINE_bool("untie_r", default=False,
help="Untie r_w_bias and r_r_bias")
flags.DEFINE_string("summary_type", default="last",
help="Method used to summarize a sequence into a compact vector.")
flags.DEFINE_string("ff_activation", default="relu",
help="Activation type used in position-wise feed-forward.")
flags.DEFINE_bool("use_bfloat16", False,
help="Whether to use bfloat16.")
这些是预训练的核心,有些很容易理解,有些在文中没太提到,但是我自己在训练XlNet(是的,我自己用TPU POD训练过)用过,这些参数极其重要。如果不知道这些参数怎么调,训练出来也没用。
最后一句,国内能训练起XLNet的公司举手呗。
之所以打假,就是不想让这种制造焦虑。我对于贪心学院课程没上过。所以不做评价。但是这个广告文案,过于恶心和低级,请不要在弄了。