文本分类:BERT 实战

经上一篇博文,相信模型的环境准备已经完成啦!

接下来我们需要根据不同的任务,使用BERT提高准确率。

此篇讲的是文本的分类:

1.数据的格式

需要准备3个文件,分别是训练集、验证集、测试集,格式相同,每行为一个类别+文本,用“\t”间隔。(如果选择其他间隔符,需要修改run_classifier.py中_read_tsv方法)。

2.修改run_classifier.py 

(1)添加处理数据的类,class MyProcessor(), 如下:(说明:数据文件读取的类DataProcessor, 官方自带了4个不同数据集(Xnli, Mnli, Mrpc和Cola)的子类)

class MyProcessor(DataProcessor):
    """Processor for my data set."""
    def get_train_examples(self, data_dir):
        examples = []
        file_path = os.path.join(data_dir, 'train.csv')
        df = pd.read_csv(file_path, encoding='utf-8')
        for i, data in enumerate(df.values):
            guid = 'train-%d' % (i)
            text_a = tokenization.convert_to_unicode(str(data[1]))
            label = str(data[2:])
            examples.append(InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
        return examples

    def get_test_examples(self, data_dir):
        examples = []
        file_path = os.path.join(data_dir, 'test.csv')
        df = pd.read_csv(file_path, encoding='utf-8')
        for i, data in enumerate(df.values):
            guid = 'test-%d' % (i)
            text_a = tokenization.convert_to_unicode(str(data[1]))
            label = str(data[2:])
            examples.append(InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
        return examples

    def get_dev_examples(self, data_dir):
        examples = []
        file_path = os.path.join(data_dir, 'dev.csv')
        df = pd.read_csv(file_path, encoding='utf-8')
        for i, data in enumerate(df.values):
            guid = 'dev-%d' % (i)
            text_a = tokenization.convert_to_unicode(str(data[1]))
            label = str(data[2:])
            examples.append(InputExample(guid=guid, text_a=text_a, text_b=None, label=label))
        return examples

    def get_labels(self):
        """See base class."""
        return ['toxic', 'severe_toxic', 'obscene', 'threat', 'insult', 'identity_hate']

 

(2)在main函数中添加MyProcessor到processors里面,如下:

def main(_):
  tf.logging.set_verbosity(tf.logging.INFO)

  processors = {
      "cola": ColaProcessor,
      "mnli": MnliProcessor,
      "mrpc": MrpcProcessor,
      "xnli": XnliProcessor,
      "my": MyProcessor,
  }

(3)文本特征的转换,源码分析如下:

file_based_convert_examples_to_features 函数将sample转化为特征(具体是调用 convert_single_example 函数),如下:
# 特征的转化
def convert_single_example(ex_index, example, label_list, max_seq_length,
                           tokenizer):
  """Converts a single `InputExample` into a single `InputFeatures`."""
  # 数据不够组成一个batch的时候,创建假的样例  running eval/predict on the TPU, we need to pad the number of examples
  if isinstance(example, PaddingInputExample):
    return InputFeatures(
        input_ids=[0] * max_seq_length,
        input_mask=[0] * max_seq_length,
        segment_ids=[0] * max_seq_length,
        label_id=0,
        is_real_example=False)

 # label_list 字典 将标签数字化
  label_map = {}
  for (i, label) in enumerate(label_list):
    label_map[label] = i

  tokens_a = tokenizer.tokenize(example.text_a)
  tokens_b = None
  if example.text_b:
    tokens_b = tokenizer.tokenize(example.text_b)

  if tokens_b:
    # Modifies `tokens_a` and `tokens_b` in place so that the total
    # length is less than the specified length.
    # Account for [CLS], [SEP], [SEP] with "- 3"
    _truncate_seq_pair(tokens_a, tokens_b, max_seq_length - 3)
  else:
    # Account for [CLS] and [SEP] with "- 2"
    if len(tokens_a) > max_seq_length - 2:
      tokens_a = tokens_a[0:(max_seq_length - 2)]

  # The convention in BERT is: 规定句子如下
  # (a) For sequence pairs:
  #  tokens:   [CLS] is this jack ##son ##ville ? [SEP] no it is not . [SEP]
  #  type_ids: 0     0  0    0    0     0       0 0     1  1  1  1   1 1
  # (b) For single sequences:
  #  tokens:   [CLS] the dog is hairy . [SEP]
  #  type_ids: 0     0   0   0  0     0 0
  tokens = []
  segment_ids = []
  tokens.append("[CLS]")
  segment_ids.append(0)
  for token in tokens_a:
    tokens.append(token)
    segment_ids.append(0)
  tokens.append("[SEP]")
  segment_ids.append(0)

  if tokens_b:
    for token in tokens_b:
      tokens.append(token)
      segment_ids.append(1)
    tokens.append("[SEP]")
    segment_ids.append(1)

  input_ids = tokenizer.convert_tokens_to_ids(tokens)

  # The mask has 1 for real tokens and 0 for padding tokens. Only real
  # tokens are attended to.
  input_mask = [1] * len(input_ids)

  # Zero-pad up to the sequence length.
  # input_ids 字符的id
  # input_mask  是否是真实的字符  The mask has 1 for real tokens and 0 for padding tokens
  # segment_ids  text_a or text_b
  while len(input_ids) < max_seq_length:
    input_ids.append(0)
    input_mask.append(0)
    segment_ids.append(0)

  assert len(input_ids) == max_seq_length
  assert len(input_mask) == max_seq_length
  assert len(segment_ids) == max_seq_length

  label_id = label_map[example.label]
  if ex_index < 5:
    tf.logging.info("*** Example ***")
    tf.logging.info("guid: %s" % (example.guid))
    tf.logging.info("tokens: %s" % " ".join(
        [tokenization.printable_text(x) for x in tokens]))
    tf.logging.info("input_ids: %s" % " ".join([str(x) for x in input_ids]))
    tf.logging.info("input_mask: %s" % " ".join([str(x) for x in input_mask]))
    tf.logging.info("segment_ids: %s" % " ".join([str(x) for x in segment_ids]))
    tf.logging.info("label: %s (id = %d)" % (example.label, label_id))
 # input_ids 字符的id
  # input_mask  是否是真实的字符  The mask has 1 for real tokens and 0 for padding tokens
  # segment_ids  text_a or text_b
  # label_id = label_map[example.label]
  # # label_list 字典 将标签数字化
  #   label_map = {}
  #   for (i, label) in enumerate(label_list):
  #     label_map[label] = i
  feature = InputFeatures(
      input_ids=input_ids,
      input_mask=input_mask,
      segment_ids=segment_ids,
      label_id=label_id,
      is_real_example=True)
  return feature

(4)调整运行的参数:

data_dir:存放数据集的文件夹
bert_config_file:bert中文模型中的bert_config.json文件
task_name:processors中添加的任务名“zbs”
vocab_file:bert中文模型中的vocab.txt文件
output_dir:训练好的分类器模型的存放文件夹
init_checkpoint:bert中文模型中的bert_model.ckpt.index文件
do_train:是否训练,设置为“True”
do_eval:是否验证,设置为“True”
do_predict:是否测试,设置为“False”

max_seq_length:输入文本序列的最大长度,也就是每个样本的最大处理长度,多余会去掉,不够会补齐。最大值512。
train_batch_size: 训练模型求梯度时,批量处理数据集的大小。值越大,训练速度越快,内存占用越多。
eval_batch_size: 验证时,批量处理数据集的大小。同上。
predict_batch_size: 测试时,批量处理数据集的大小。同上。
learning_rate: 反向传播更新权重时,步长大小。值越大,训练速度越快。值越小,训练速度越慢,收敛速度慢,容易过拟合。迁移学习中,一般设置较小的步长(小于2e-4)
num_train_epochs:所有样本完全训练一遍的次数。
warmup_proportion:用于warmup的训练集的比例。
save_checkpoints_steps:检查点的保存频率。

(5)运行参数:

export BERT_BASE_DIR=./uncased_L-12_H-768_A-12
export MY_DATA=./dataset
python run_classifier.py \
--task_name=my \
--do_train=true \
--do_eval=true \
--data_dir=$MY_DATA \
--vocab_file=$BERT_BASE_DIR/vocab.txt \
--bert_config_file=$BERT_BASE_DIR/bert_config.json \
--init_checkpoint=$BERT_BASE_DIR/bert_model.ckpt \
--output_dir=./MY_OUTPUT

 

你可能感兴趣的:(Tensorflow)