tesseract4.0训练脚本(二)

combine_lang_model

COMBINE_LANG_MODEL(1)                                    COMBINE_LANG_MODEL(1)

NAME
       combine_lang_model - generate starter traineddata # 用于生成初始traineddata文件

SYNOPSIS
       combine_lang_model --input_unicharset filename --script_dir dirname
       --output_dir rootdir --lang lang [--lang_is_rtl] [pass_through_recoder]
       [--words file --puncs file --numbers file]

DESCRIPTION
       combine_lang_model(1) generates a starter traineddata file that can be
       used to train an LSTM-based neural network model. It takes as input a
       unicharset and an optional set of wordlists. It eliminates the need to
       run set_unicharset_properties(1), wordlist2dawg(1), some non-existent
       binary to generate the recoder (unicode compressor), and finally
       combine_tessdata(1).
       # combine_lang_model 生成一个用于训练给予lstm神经网络模型的起始traineddata文件。它将一个unicharset文件和一系列可选的wordlists集作为输入。

OPTIONS
       -l lang
           The language to use. Tesseract uses 3-character ISO 639-2 language
           codes. (See LANGUAGES)
           # 需要使用的语言 tesseract使用由ISO 639-2 的三字母语言代码

       --script_dir PATH
           Directory name for input script unicharsets. It should point to the
           location of langdata (github repo) directory. (type:string
           default:)
           # 输入的脚本字符的文件夹名,它应该指向langdata文件夹的位置

       --input_unicharset FILE
           Unicharset to complete and use in encoding. It can be a
           hand-created file with incomplete fields. Its basic and script
           properties will be set before it is used. (type:string default:)
           # 输入的 unicharset文件,完成encoding所需要的unicharset文件。
           # 可以是一个手动创建的不完整的集合。(还没有尝试过,可行的话,可以尝试作为白名单来用)
           # 它的基本属性和脚本属性将在使用前设置。(??这里不是很明白)

       --lang_is_rtl BOOL
           True if language being processed is written right-to-left (eg
           Arabic/Hebrew). (type:bool default:false)
           # lang is right to left
           # 字符顺序是从右到左的时候置为真,eg 阿拉伯语和希伯来语 ,默认是假

       --pass_through_recoder BOOL
           If true, the recoder is a simple pass-through of the unicharset.
           Otherwise, potentially a compression of it by encoding Hangul in
           Jamos, decomposing multi-unicode symbols into sequences of
           unicodes, and encoding Han using the data in the
           radical_table_data, which must be the content of the file:
           langdata/radical-stroke.txt. (type:bool default:false)
           # (还不是很清楚这个标志位,机翻一下)

       --version_str STRING
           An arbitrary version label to add to traineddata file (type:string
           default:)
           # 自己命名的语言识别版本

       --words FILE
           (Optional) File listing words to use for the system dictionary
           (type:string default:)
           # 可选项
           # 单词列表文件

       --numbers FILE
           (Optional) File listing number patterns (type:string default:)
           # 可选项
           # 数字模式列表

       --puncs FILE
           (Optional) File listing punctuation patterns. The
           words/puncs/numbers lists may be all empty. If any are non-empty
           then puncs must be non-empty. (type:string default:)
           # 可选项  
           # 标点符号
       --output_dir PATH
           Root directory for output files. Output files will be written to
           //.* (type:string default:)
           # 输出文件的根目录

HISTORY
       combine_lang_model(1) was first made available for
       tesseract4.00.00alpha.

RESOURCES
       Main web site: https://github.com/tesseract-ocr Information on training
       tesseract LSTM:
       https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

SEE ALSO
       tesseract(1)

COPYING
       Copyright (C) 2012 Google, Inc. Licensed under the Apache License,
       Version 2.0

AUTHOR
       The Tesseract OCR engine was written by Ray Smith and his research
       groups at Hewlett Packard (1985-1995) and Google (2006-present).

                                  04/07/2018             COMBINE_LANG_MODEL(1)

你可能感兴趣的:(tesseract)