[置顶] 通过CNTK处理自然语言模型

前言

CNTK是微软开源的深度学习工具箱,主要在机器学习领域实现了各种神经网络结构的计算功能。而目前在自然语言处理领域上,深度计算等更是研究的主流方向,笔者将会在本文中,针对一些简单的语言模型概念进行讲解,并结合CNTK工具箱进行一些基础的实践工作。

语言模型

有关语言模型的概念应该从什么是模型说起,模型的概念有很多种理解方式,其中有一种理解的方式是,通过一种简单的方式去认知复杂的事物。简单的说就是对某一个不知道如何下手的事情,通过某一种东西使其容易进行分析以及计算或者使用。

而语言模型是对自然语言转换为机器能够处理一种数学模型的抽象。

语言模型是一种将自然语言用概率的方式进行表述的方式。

举例子如下:

CNTK is a general solution for training and testing many kinds of neural networks.

自然语言指的是这个句子的本身,一个句子是由若干个单词经特定的顺序构成。所谓概率就是指这些单词以特定顺序出现的可能性。

上述例子句子中,如果单词不对大家会感觉很不爽,例如改成如下这样,大家会感觉很不科学,反映在语言模型中则为整个句子按照如下情况出现的概率很低。

Cat is a general Dog for eating and drinking many kinds of neural friends.

同样,单词出现的顺序如果不对,大家也会感觉很不爽,例如如下这样,大家仍感会感觉费解,

A is CNTK general training solution for and testing kinds many of networks neural.

综上,通过将单词以正确的顺序进行排列,才能最大可能的将具体的语义表达出来。这里的可能指的就是概率。通过以概率的方式,总结出特定位置应该出现特定的单词的模型,可以理解为自然语言模型。(笔者个人的理解的方式,如有错误请及时指点)

以概率的方式表述语言

我们定义一个单词为 w (以后见到w就当做某一个特定的单词理解),单词 w1 出现在 w0 后面的概率为 P(w1|w0) 。而 w2 出现在 w1 后面的概率为 P(w2|w1)

在此基础上,我们则可知 P(w2|w0,w1)=P(w1|w0)P(w2|w1) 。上述公式可以理解为,假定给出一个单词 w0 ,下一个单词为 w1 的概率为 P(w1|w0) ,同时 w1 紧接着又出现 w2 的概率则应该是在此基础上乘 P(w2|w1) 的结果。

所以递推下去,整个一个句子的概率就是:(为了避免大家看求乘积符号∏费劲,于是写成了…)

P(w0,w1,...wn)=P(wn|wn1)...P(w1|w0)

有时候大家可能会看到很多论文中是加法,为什么是加法,是因为,计算机在计算的时候,小于1的数连续相乘最后会乘出来的数很小的,会没有的。所以,为了方便计算(也为了加速计算)其实是对等式的左右两端取了Log。对于Log的特定是:

log(AB)=log(A)+log(B)

所以为了避免乘法所导致的问题,所以在计算过程中对左右两端都取了log。所以就变成了便于计算的加法。

P(w0,w1,...wn)=P(wn|wn1)+...+P(w1|w0)

上述的概率表述就是语言模型。(笔者个人的理解的方式,如有错误请及时指点)

表述单词的方式

上述中,我们定义了 w 来表述单词,但是单词仍然是一个word,在计算机中就是一长串的内存,没有数学意义(也许有,字符’A’小于字符’C’,这是从ASCII角度体现的,但是不解决整个句子的语义关系)。如何将一个单词进行数学方式的表述才能参与到语言的处理过程中?

标识不同的单词

首先我们先解决如何用数学的方式标示不同的单词,方法很多种,举个最简单的方法(其实也是一个很有效也是很常用的方法)。我们可以给每个单词编一个号,整数类型的。例如 0 代表Hello, 1 代表World,如此这样为所有的单词进行编码。这样我们就将单词引导到了数字的表述层面上,同时也解决了不同单词编码重复的问题。

单词与单词之间的关系

另外一个问题是,虽然两个单词不同,但是有可能表述的意思相近,如何表示不同单词之间的关系?单词与单词之间的关系,大家可能会感觉迷惑,但是简单的看如下解释大家就会很清楚。

两个单词,debug和smile有什么关系?,很简单的回答是,他们平时真没有什么关系。但是laugh和smile有什么关系?,平时有一些关系,都代表了“笑”只是程度不同。

上述内容就是单词与单词之间的关系。单词和单词之间的关系有强弱以及相斥之说,反映在数学层面上来讲,一般这种情况我们用距离来表述(距离这个词有着严谨的数学定义,笔者只是简单的拿来说而已,作为科普文,并非严谨)。

如何表距离?我们生活中常见的是欧氏距离,我们就拿欧氏距离说事,平面(键盘理解为一个平面,Z键到P键)或者是空间(鼠标距离屏幕左上角)。单词也存在这样的距离,也就是语义之间的距离。

我们引入一种向量的表述方式来表述每一个单词在自然语言中的意义,例如如下形式:

w>[0.12,0.98,0.32,0.95,0.53,...,0.12]

这就是大家在论文中见到的词向量,向量是怎么来的?来源是经验总结,或者是算法获得。

这个向量大家可以理解为一个雷达图,我们将语言中的各种情况定义为一个或若干个维度,例如是否为动词否欢乐、是否讽刺等等,很多种情况,每个情况都是一个或若干个维度。具体有都少维度,没有特定的说法,主流使用的一般取值为50、100以及200。

单词之间的计算

在引入词向量之前,单词与单词之间是彼此独立的,但是词向量的引入,将单词变成了向量。向量则可以参与计算,于是,单词与单词之间即产生了联系。

w⃗ 0w⃗ 1=w⃗ 2w⃗ 3

仍然是举个例子,我们在生活中会遇到两个相对的词,例如,King 与 Queen, 假设有一个问题问到,仿造King对应Queen的模式,Man对应什么?,我们会轻易的回答出是woman,但是如何思考出来的?

我们需要使用计算的方式处理这个问题,根据上面给出的公式,我们很容易获得:

w⃗ 3=w⃗ 2+w⃗ 1w⃗ 0

我们通过将King、Queen以及Man的词向量带入,则可以计算出 w⃗ 3 ,然后查找w_3对应的词向量表,既可获得类似于woman的答案。

通过网络处理语言模型

将神经网络应用于语言模型

之前在介绍CNTK的时候有简单的介绍过神经网络,但是神经网络如何同语言模型相对应?我们依旧先引入一张网上已经传遍了的图片,来自《A Neural Probabilistic Language Model》。详细的论文可以通过bing搜索后下载。

[置顶] 通过CNTK处理自然语言模型_第1张图片

上图中,最下面绿色的方块代表的是前文中在“标识不同的单词”部分讲解的使用整数来标识单词。每个单词是一个索引,在整个词库中。
而C是一个矩阵,C是一个能够将单词在词库中的索引对应到词向量的矩阵。
从单词的索引到词向量的映射,我们将其对应到一个神经网络的输入层。

连续的若干个单词的词向量直接输入值隐含层。

最后由交由输出层输出,输出的值即为语言模型中所谓的可能性,也就是我们之前引入的概率P。

通过如上的映射,我们既将我们之前所讲的语言模型,给直接的映射到了神经网络中。

最重要的是神经网络训练结束后,其训练的网络本身就是词向量的集合。这也正是神经网络的优点。

目前,随着训练方法的提升,各种复杂的网络结构也被引入进来,以提高训练后网络的结果。

使用CNTK处理语言模型

终于说到了CNTK部分了,本文其实CNTK只是作为工具来训练网络而已。CNTK能够构造各种复杂的网络,各种复杂网络对于训练的结果可能都不进相同,而本文中将针对CNTK的PennTreebank示例进行展开。

准备样本

首先是样本的准备,PennTreebank是一个自然语言的样本库,目前大家所获取的CNTK中的PennTreebank示例会包含一些语句,但是这只是PennTreebank的一些子集。如需获得更加大量的样本可以参考去其官网查找,https://www.cis.upenn.edu/~treebank/

实现CNTK配置文件

CNTK的使用过程总是伴随着配置文件的,所以在使用CNTK在做语言模型处理时,也是从配置文件开始。

预处理wordclass

首先我们应该确定如何将数据输入进来,CNTK已经提供好了Reader,我们可以直接使用LMSequenceReader来读取数据。但是遇到的第一个问题是LMSequenceReader要求我们提供一个单词的分类的字典的参数wordclass,我们从何获得?

这就是我们遇到的第一个问题,预处理问题,CNTK已经给我们提供了生成这个参数文件的方法,CNTK中有一种特殊的action叫做writeWordAndClass用于执行该操作,下面是简单的使用方式:

####################################### # PREPARATION CONFIG # #######################################

writeWordAndClassInfo = [
    action = "writeWordAndClass"
    inputFile = "$DataDir$/$trainFile$"
    beginSequence = "</s>"
    endSequence   = "</s>"
    outputVocabFile = "$ModelDir$/vocab.txt"
    outputWord2Cls  = "$ModelDir$/word2cls.txt"
    outputCls2Index = "$ModelDir$/cls2idx.txt"
    vocabSize = "$confVocabSize$"
    nbrClass = "$confClassSize$"
    cutoff = 0
    printValues = true
]

在执行writeWordAndClass时需要同时指定几个参数:
1. inputFile:用于指明用于生成wordClass的输入文件,就是样本
2. outputVocabFile:输出Vocab文件的位置,该文件包含4列,与LMSequenceReader中要求的wordClass参数要求的文件相对应,第一列是单词的id,第二列是该单词出现的次数,第三列是该单词本身,第四列是该单词的分类
3. outputWord2Cls:输出单词与类型的映射文件的位置
4. outputCls2Index:输出类型到单词ID的映射文件的位置
5. vocabSize:期待的词汇量大小
6. nbrClass:期待的分类数量
7. cutoff:当某一个单词出现的次数太少后将当做<unk>进行处理,默认值是2。

配置Reader

之前已经确定过了使用LMSequenceReader作为输入的Reader。下面是引用来的一个使用的例子:

reader=[
    readerType="LMSequenceReader"
    randomize=false
    nbruttineachrecurrentiter=10
    unk="<unk>"
    wordclass="$DataDir$\wordclass.txt"
    file="$DataDir$\penntreebank.train.txt"
    labelIn=[
        labelDim=10000
        beginSequence="</s>"
        endSequence="</s>"
    ]
]

LMSequenceReader主要的一些配置参数如下:
1. randomize:用于指明是否随机的去读取样本
2. nbruttsineachrecurrentiter:用于指定每个minibatch中句子输的限制?
3. unk:用于指定一个符号,标明这个单词将会被忽略,被忽略的单词也将被映射到这个符号上。
4. wordclass:用于指定word class信息(也就是我们在预处理环节中生成的outputVocabFile 文件)
5. file:用于输入的样本文件
6. labelIn:这是一个配置块,包含三个参数,首先是beginSequence用于指定句子起始的符号,其次是endSequence用于指定句子的结束符号,而最后的labelDim用于指定维度,一般是句子中平均包含的单词数。

配置网络模型

CNTK在PennTreebank示例中使用的是预定义的Class-based long short-termmemory网络模型,在SimpleNetworkBuilder中通过指定rnnType来指定,

    SimpleNetworkBuilder = [
        rnnType = "CLASSLSTM"   # TODO: camelCase
        recurrentLayer = 1      # number of recurrent layers

        trainingCriterion = "classCrossEntropyWithSoftmax"
        evalCriterion     = "classCrossEntropyWithSoftmax"

        initValueScale = 6.0
        uniformInit = true
        layerSizes = "$confVocabSize$:150:200:10000"
        defaultHiddenActivity = 0.1 # default value for hidden states
        addPrior = false
        addDropoutNodes = false
        applyMeanVarNorm = false
        lookupTableOrder = 1        # TODO: document what this means

        # these are for the class information for class-based language modeling
        vocabSize = "$confVocabSize$"
        nbrClass  = "$confClassSize$"
    ]

配置训练方法

CNTK中目前只支持一种随机梯度下降发(SGD)来训练网络。下面是例子中使用的配置。

    SGD = [
        minibatchSize = 128:256:512
        learningRatesPerSample = 0.1
        momentumPerMB = 0
        gradientClippingWithTruncation = true
        clippingThresholdPerSample = 15.0
        maxEpochs = 16
        numMBsToShowResult = 100
        gradUpdateType = "none"
        loadBestModel = true

        dropoutRate = 0.0

        #traceNodeNamesReal = AutoName37 # this allows to track a node's value

        # settings for Auto Adjust Learning Rate
        AutoAdjust = [
            autoAdjustLR = "adjustAfterEpoch"
            reduceLearnRateIfImproveLessThan = 0.001
            continueReduce = false
            increaseLearnRateIfImproveMoreThan = 1000000000
            learnRateDecreaseFactor = 0.5
            learnRateIncreaseFactor = 1.382
            numMiniBatch4LRSearch = 100
            numPrevLearnRates = 5
            numBestSearchEpoch = 1
        ]
    ]

配置输出

在PennTreebank这个例子中,配置文件同时也实现了针对训练结果的输出层的输出,其中每一条记录都是对应句子的概率。

    action = "write"

    outputPath = "$OutputDir$/Write"
    #outputPath = "-"                    # "-" will write to stdout; useful for debugging
    outputNodeNames = TrainNodeClassBasedCrossEntropy # when processing one sentence per minibatch, this is the sentence posterior
    format = [
        sequencePrologue = "log P(W)="    # (using this to demonstrate some formatting strings)
        type = "real"
    ]

最终的配置文件

通过如上的配置,我们将得到类似于CNTK所提供的例子PennTreebank的配置文件,我再此贴出,如果在调试过程中遇到了问题,可以给与一定的参考。例子中所给的可能与上文中所说的有些不同,但是总的思路相同,例子中针对一些细节做了一些简单的调整。

# Parameters can be overwritten on the command line
# for example: cntk configFile=myConfigFile RootDir=../.. 
# For running from Visual Studio add
# currentDirectory=$(SolutionDir)/<path to corresponding data folder> 
RootDir = ".."

ConfigDir = "$RootDir$/Config"
DataDir   = "$RootDir$/Data"
OutputDir = "$RootDir$/Output"
ModelDir  = "$OutputDir$/Models"

# deviceId=-1 for CPU, >=0 for GPU devices, "auto" chooses the best GPU, or CPU if no usable GPU is available
deviceId = "auto"

command = writeWordAndClassInfo:train:test:write

precision  = "float"
traceLevel = 1
modelPath  = "$ModelDir$/rnn.dnn"

# uncomment the following line to write logs to a file
#stderr=$OutputDir$/rnnOutput

numCPUThreads = 1

confVocabSize = 10000
confClassSize = 50

trainFile = "ptb.train.txt"
validFile = "ptb.valid.txt"
testFile  = "ptb.test.txt"

#######################################
#  PREPARATION CONFIG                 #
#######################################

writeWordAndClassInfo = [
    action = "writeWordAndClass"
    inputFile = "$DataDir$/$trainFile$"
    beginSequence = "</s>"
    endSequence   = "</s>"
    outputVocabFile = "$ModelDir$/vocab.txt"
    outputWord2Cls  = "$ModelDir$/word2cls.txt"
    outputCls2Index = "$ModelDir$/cls2idx.txt"
    vocabSize = "$confVocabSize$"
    nbrClass = "$confClassSize$"
    cutoff = 0
    printValues = true
]

#######################################
#  TRAINING CONFIG                    #
#######################################

train = [
    action = "train"
    traceLevel = 1
    epochSize = 0               # (for quick tests, this can be overridden with something small)

    SimpleNetworkBuilder = [
        rnnType = "CLASSLSTM"   # TODO: camelCase
        recurrentLayer = 1      # number of recurrent layers

        trainingCriterion = "classCrossEntropyWithSoftmax"
        evalCriterion     = "classCrossEntropyWithSoftmax"

        initValueScale = 6.0
        uniformInit = true
        layerSizes = "$confVocabSize$:150:200:10000"
        defaultHiddenActivity = 0.1 # default value for hidden states
        addPrior = false
        addDropoutNodes = false
        applyMeanVarNorm = false
        lookupTableOrder = 1        # TODO: document what this means

        # these are for the class information for class-based language modeling
        vocabSize = "$confVocabSize$"
        nbrClass  = "$confClassSize$"
    ]

    SGD = [
        minibatchSize = 128:256:512
        learningRatesPerSample = 0.1
        momentumPerMB = 0
        gradientClippingWithTruncation = true
        clippingThresholdPerSample = 15.0
        maxEpochs = 16
        numMBsToShowResult = 100
        gradUpdateType = "none"
        loadBestModel = true

        dropoutRate = 0.0

        #traceNodeNamesReal = AutoName37 # this allows to track a node's value

        # settings for Auto Adjust Learning Rate
        AutoAdjust = [
            autoAdjustLR = "adjustAfterEpoch"
            reduceLearnRateIfImproveLessThan = 0.001
            continueReduce = false
            increaseLearnRateIfImproveMoreThan = 1000000000
            learnRateDecreaseFactor = 0.5
            learnRateIncreaseFactor = 1.382
            numMiniBatch4LRSearch = 100
            numPrevLearnRates = 5
            numBestSearchEpoch = 1
        ]
    ]

    reader = [
        readerType = "LMSequenceReader"
        randomize = "none"              # BUGBUG: This is currently ignored
        nbruttsineachrecurrentiter = 0  # means fill up the minibatch with as many parallel sequences as fit
        cacheBlockSize = 2000000        # load it all

        # word class info
        wordclass = "$ModelDir$/vocab.txt"

        # if writerType is set, we will cache to a binary file
        # if the binary file exists, we will use it instead of parsing this file
        # writerType=BinaryReader

        # write definition
        wfile = "$OutputDir$/sequenceSentence.bin"

        # wsize - inital size of the file in MB
        # if calculated size would be bigger, that is used instead
        wsize = 256

        # wrecords - number of records we should allocate space for in the file
        # files cannot be expanded, so this should be large enough. If known modify this element in config before creating file
        wrecords = 1000

        # windowSize - number of records we should include in BinaryWriter window
        windowSize = "$confVocabSize$"

        file = "$DataDir$/$trainFile$"

        # additional features sections
        # for now store as expanded category data (including label in)
        features = [
            # sentence has no features, so need to set dimension to zero
            dim = 0
            # write definition
            sectionType = "data"
        ]

        #labels sections
        labelIn = [
            dim = 1
            labelType = "Category"
            beginSequence = "</s>"
            endSequence = "</s>"

            # vocabulary size
            labelDim = "$confVocabSize$"
            labelMappingFile = "$OutputDir$/sentenceLabels.txt"

            # Write definition
            # sizeof(unsigned) which is the label index type
            elementSize = 4
            sectionType = "labels"
            mapping = [
                # redefine number of records for this section, since we don't need to save it for each data record
                wrecords = 11                
                # variable size so use an average string size
                elementSize = 10
                sectionType = "labelMapping"
            ]

            category = [
                dim = 11
                # elementSize = sizeof(ElemType) is default
                sectionType = "categoryLabels"
            ]
        ]

        # labels sections
        labels = [
            dim = 1
            labelType = "NextWord"
            beginSequence = "O"
            endSequence = "O"

            # vocabulary size
            labelDim = "$confVocabSize$"
            labelMappingFile = "$OutputDir$/sentenceLabels.out.txt"

            # Write definition 
            # sizeof(unsigned) which is the label index type
            elementSize = 4
            sectionType = "labels"
            mapping = [
                # redefine number of records for this section, since we don't need to save it for each data record
                wrecords = 3
                # variable size so use an average string size
                elementSize = 10
                sectionType = "labelMapping"
            ]

            category = [
                dim = 3
                # elementSize = sizeof(ElemType) is default
                sectionType = categoryLabels
            ]
        ]
    ]

    # if a cvReader section is specified, SGD will use this to compute the CV criterion
    cvReader = [
        # reader to use
        readerType = "LMSequenceReader"
        randomize = "none"
        nbruttsineachrecurrentiter = 0  # 0 means fill up the minibatch with as many parallel sequences as fit
        cacheBlockSize = 2000000        # just load it all

        # word class info
        wordclass = "$ModelDir$/vocab.txt"

        # if writerType is set, we will cache to a binary file
        # if the binary file exists, we will use it instead of parsing this file
        # writerType = "BinaryReader"

        # write definition
        wfile = "$OutputDir$/sequenceSentence.valid.bin"

        # wsize - inital size of the file in MB
        # if calculated size would be bigger, that is used instead
        wsize = 256

        # wrecords - number of records we should allocate space for in the file
        # files cannot be expanded, so this should be large enough. If known modify this element in config before creating file
        wrecords = 1000

        # windowSize - number of records we should include in BinaryWriter window
        windowSize = "$confVocabSize$"

        file = "$DataDir$/$validFile$"

        # additional features sections
        # for now store as expanded category data (including label in)
        features = [
            # sentence has no features, so need to set dimension to zero
            dim = 0
            # write definition
            sectionType = "data"
        ]

        # labels sections
        # it should be the same as that in the training set
        labelIn = [
            dim = 1

            # vocabulary size
            labelDim = "$confVocabSize$"
            labelMappingFile = "$OutputDir$/sentenceLabels.out.txt"

            labelType = "Category"
            beginSequence = "</s>"
            endSequence = "</s>"

            # Write definition
            # sizeof(unsigned) which is the label index type
            elementSize = 4
            sectionType = "labels"

            mapping = [
                # redefine number of records for this section, since we don't need to save it for each data record
                wrecords = 11
                # variable size so use an average string size
                elementSize = 10
                sectionType = "labelMapping"
            ]

            category = [
                dim = 11
                # elementSize = sizeof(ElemType) is default
                sectionType = "categoryLabels"
            ]
        ]

        #labels sections
        labels = [
            dim = 1

            labelType = "NextWord"
            beginSequence = "O"
            endSequence = "O"

            # vocabulary size
            labelDim = "$confVocabSize$"
            labelMappingFile = "$OutputDir$/sentenceLabels.out.txt"

            # Write definition
            # sizeof(unsigned) which is the label index type
            elementSize = 4
            sectionType = "labels"

            mapping = [
                # redefine number of records for this section, since we don't need to save it for each data record
                wrecords = 3
                # variable size so use an average string size
                elementSize = 10
                sectionType = "labelMapping"
            ]

            category = [
                dim = 3
                # elementSize = sizeof(ElemType) is default
                sectionType = "categoryLabels"
            ]
        ]
    ]
]

#######################################
#  TEST CONFIG                        #
#######################################

test = [
    action = "eval"

    # correspond to the number of words/characteres to train in a minibatch
    minibatchSize = 8192                # choose as large as memory allows for maximum GPU concurrency
    # need to be small since models are updated for each minibatch
    traceLevel = 1
    epochSize = 0

    reader = [
        # reader to use
        readerType = "LMSequenceReader"
        randomize = "none"
        nbruttsineachrecurrentiter = 0  # 0 means fill up the minibatch with as many parallel sequences as fit
        cacheBlockSize = 2000000        # just load it all

        # word class info
        wordclass = "$ModelDir$/vocab.txt"

        # if writerType is set, we will cache to a binary file
        # if the binary file exists, we will use it instead of parsing this file
        # writerType = "BinaryReader"

        # write definition
        wfile = "$OutputDir$/sequenceSentence.bin"
        # wsize - inital size of the file in MB
        # if calculated size would be bigger, that is used instead
        wsize = 256

        # wrecords - number of records we should allocate space for in the file
        # files cannot be expanded, so this should be large enough. If known modify this element in config before creating file
        wrecords = 1000

        # windowSize - number of records we should include in BinaryWriter window
        windowSize = "$confVocabSize$"

        file = "$DataDir$/$testFile$"

        # additional features sections
        # for now store as expanded category data (including label in)
        features = [
            # sentence has no features, so need to set dimension to zero
            dim = 0
            # write definition
            sectionType = "data"
        ]

        #labels sections
        labelIn = [
            dim = 1

            # vocabulary size
            labelDim = "$confVocabSize$"
            labelMappingFile = "$OutputDir$/sentenceLabels.txt"

            labelType = "Category"
            beginSequence = "</s>"
            endSequence = "</s>"

            # Write definition
            # sizeof(unsigned) which is the label index type
            elementSize = 4
            sectionType = "labels"

            mapping = [
                # redefine number of records for this section, since we don't need to save it for each data record
                wrecords = 11
                # variable size so use an average string size
                elementSize = 10
                sectionType = "labelMapping"
            ]

            category = [
                dim = 11
                # elementSize = sizeof(ElemType) is default
                sectionType = "categoryLabels"
            ]
        ]

        #labels sections
        labels = [
            dim = 1
            labelType = "NextWord"
            beginSequence = "O"
            endSequence = "O"

            # vocabulary size
            labelDim = "$confVocabSize$"

            labelMappingFile = "$OutputDir$/sentenceLabels.out.txt"
            # Write definition
            # sizeof(unsigned) which is the label index type
            elementSize = 4
            sectionType = "labels"

            mapping = [
                # redefine number of records for this section, since we don't need to save it for each data record
                wrecords = 3
                # variable size so use an average string size
                elementSize = 10
                sectionType = "labelMapping"
            ]

            category = [
                dim = 3
                # elementSize = sizeof(ElemType) is default
                sectionType = "categoryLabels"
            ]
        ]
    ]
]

#######################################
#  WRITE CONFIG                       #
#######################################

# This will write out the log sentence probabilities
#   log P(W) = sum_i P(w_n | w_1..w_n-1)
# of all test sentences in the form log P(W)=<value>, one line per test
# sentence.
#
# This is accomplished by writing out the value of the CE criterion, which
# is an aggregate over all words in a minibatch. By presenting each sentence
# as a separate minibatch, the CE criterion is equal to the log sentence prob.
#
# This can be used for N-best rescoring if you prepare your N-best hypotheses
# as an input file with one line of text per hypothesis, where the output is
# the corresponding log probabilities, one value per line, in the same order.

write = [
    action = "write"

    outputPath = "$OutputDir$/Write"
    #outputPath = "-"                    # "-" will write to stdout; useful for debugging
    outputNodeNames = TrainNodeClassBasedCrossEntropy # when processing one sentence per minibatch, this is the sentence posterior
    format = [
        sequencePrologue = "log P(W)="    # (using this to demonstrate some formatting strings)
        type = "real"
    ]

    minibatchSize = 8192                # choose this to be big enough for the longest sentence
    # need to be small since models are updated for each minibatch
    traceLevel = 1
    epochSize = 0

    reader = [
        # reader to use
        readerType = "LMSequenceReader"
        randomize = "none"              # BUGBUG: This is ignored.
        nbruttsineachrecurrentiter = 1  # one sentence per minibatch
        cacheBlockSize = 1              # workaround to disable randomization

        # word class info
        wordclass = "$ModelDir$/vocab.txt"

        # if writerType is set, we will cache to a binary file
        # if the binary file exists, we will use it instead of parsing this file
        # writerType = "BinaryReader"

        # write definition
        wfile = "$OutputDir$/sequenceSentence.bin"
        # wsize - inital size of the file in MB
        # if calculated size would be bigger, that is used instead
        wsize = 256

        # wrecords - number of records we should allocate space for in the file
        # files cannot be expanded, so this should be large enough. If known modify this element in config before creating file
        wrecords = 1000

        # windowSize - number of records we should include in BinaryWriter window
        windowSize = "$confVocabSize$"

        file = "$DataDir$/$testFile$"

        # additional features sections
        # for now store as expanded category data (including label in)
        features = [
            # sentence has no features, so need to set dimension to zero
            dim = 0
            # write definition
            sectionType = "data"
        ]

        #labels sections
        labelIn = [
            dim = 1

            # vocabulary size
            labelDim = "$confVocabSize$"
            labelMappingFile = "$OutputDir$/sentenceLabels.txt"

            labelType = "Category"
            beginSequence = "</s>"
            endSequence = "</s>"

            # Write definition
            # sizeof(unsigned) which is the label index type
            elementSize = 4
            sectionType = "labels"

            mapping = [
                # redefine number of records for this section, since we don't need to save it for each data record
                wrecords = 11
                # variable size so use an average string size
                elementSize = 10
                sectionType = "labelMapping"
            ]

            category = [
                dim = 11
                # elementSize = sizeof(ElemType) is default
                sectionType = "categoryLabels"
            ]
        ]

        #labels sections
        labels = [
            dim = 1
            labelType = "NextWord"
            beginSequence = "O"
            endSequence = "O"

            # vocabulary size
            labelDim = "$confVocabSize$"

            labelMappingFile = "$OutputDir$/sentenceLabels.out.txt"
            # Write definition
            # sizeof(unsigned) which is the label index type
            elementSize = 4
            sectionType = "labels"

            mapping = [
                # redefine number of records for this section, since we don't need to save it for each data record
                wrecords = 3
                # variable size so use an average string size
                elementSize = 10
                sectionType = "labelMapping"
            ]

            category = [
                dim = 3
                # elementSize = sizeof(ElemType) is default
                sectionType = "categoryLabels"
            ]
        ]
    ]
]

运行CNTK训练网络

终于写到了运行部分,运行部分其实是最简单的,只需要执行如下命令即可。

cntk.exe configFile=rnn.cntk

最终,如果您看到如下界面,代表您已经成功的训练的网络,并获得了训练的结果。
[置顶] 通过CNTK处理自然语言模型_第2张图片

同时您也会在Output文件夹中找到Write.TrainNodeClassBasedCrossEntropy文件。他将会记录您用于测试的句子的可能性,也就是前面一直在讨论的概率P。
[置顶] 通过CNTK处理自然语言模型_第3张图片

总结

本文从什么是语言模型等基础的概念开始讲起,大体的为读者灌输了如何通过数学的方式去标识一个单词,进而又介绍了词向量等相关的只是,之后通过一个简单的网络结构,讲解了语言模型如何同神经网络相结合。最终通过CNTK并结合其中的PennTreebank例子实践的处理了一个语言模型的网络训练过程。

希望本文能够对初涉及深度网络处理自然语言模型的读者有一些参考作用。由于笔者并非特定领域人员,所以难免会有些错误,如遇错误请及时指正。

笔者在写本文时,有几篇文章对笔者大有帮助,笔者将列举如下:

[1] 斯坦福大学自然语言处理
https://class.coursera.org/nlp/

[2] Deep Learning in NLP (一)词向量和语言模型
http://licstar.net/archives/328

[3] 《A Neural Probabilistic Language Model》
http://jmlr.org/papers/volume3/bengio03a/bengio03a.pdf

本文很多内容是参考来博士的论文,所以在此表达下谢意,本文主旨是将自然语言模型普及下去,以通俗的语言去讲解领域知识,并给出通过CNTK进行处理的方法。

最后结束语,欢迎同大家进行交流。

你可能感兴趣的:(自然语言处理,神经网络,NLP,深度学习,CNTK)