tesseract4.0训练 脚本(一)

lstmeval

NAME
       lstmeval - Evaluation program for LSTM-based networks.
       基于LSTM网络的评估程序

SYNOPSIS
       lstmeval --model lang.lstm|langtrain_checkpoint|pluscharsN.NNN_NN.checkpoint [--traineddata
       lang/lang.traineddata] --eval_listfile lang.eval_files.txt [--verbosity N] [--max_image_MB NNNN]

DESCRIPTION
       lstmeval(1) evaluates LSTM-based networks. Either a recognition model or a training checkpoint can be given as
       input for evaluation along with a list of lstmf files. If evaluating a training checkpoint, --traineddata
       should also be specified.
       lstmeval 评估基于LSTM的神经网络。识别的模型或者训练的检查点都可以用来作为进行lstmf文件识别的输入项。
       如果评估一个训练的检查点,那么生成检查点时用到的--traineddata也需要提起(作为输入项)

OPTIONS
       --model FILE
           Name of model file (training or recognition) (type:string default:)

       --traineddata FILE
           If model is a training checkpoint, then traineddata must be the traineddata file that was given to the
           trainer (type:string default:)
           当之前的那个model选项是checkpoint时,traineddata 需要时之前训练checkout时用到的traineddata

       --eval_listfile FILE
           File listing sample files in lstmf training format. (type:string default:)
           含有lstmf格式文件的列表txt

       --max_image_MB INT
           Max memory to use for images. (type:int default:2000)
           最大内存占用

       --verbosity INT
           Amount of diagnosting information to output (0-2). (type:int default:1)
           诊断信息级别

HISTORY
       lstmeval(1) was first made available for tesseract4.00.00alpha.

RESOURCES
       Main web site: https://github.com/tesseract-ocr Information on training tesseract LSTM:
       https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract-4.00

SEE ALSO
       tesseract(1)

COPYING
       Copyright (C) 2012 Google, Inc. Licensed under the Apache License, Version 2.0

AUTHOR
       The Tesseract OCR engine was written by Ray Smith and his research groups at Hewlett Packard (1985-1995) and
       Google (2006-present).

tesstrain.sh

# This script provides an easy way to execute various phases of training
# Tesseract.  For a detailed description of the phases, see
# https://github.com/tesseract-ocr/tesseract/wiki/TrainingTesseract
# # 这是tesseract训练过程中几个不同阶段所需要的脚本方法。
# USAGE:
#
# tesstrain.sh
#    --fontlist FONTS           # A list of fontnames to train on. # 所需要训练的字体类型
#    --fonts_dir FONTS_PATH     # Path to font files. # 字体文件所在文件夹
#    --lang LANG_CODE           # ISO 639 code. # 遵循iso639规范的三字母代码规范
#    --langdata_dir DATADIR     # Path to tesseract/training/langdata directory. # 
#    --output_dir OUTPUTDIR     # Location of output traineddata file. #
#    --overwrite                # Safe to overwrite files in output_dir. # 覆盖的选项,没有使用过
#    --linedata_only            # Only generate training data for lstmtraining. # lstm训练的选项
#    --run_shape_clustering     # Run shape clustering (use for Indic langs). # 没用过
#    --exposures EXPOSURES      # A list of exposure levels to use (e.g. "-1 0 1"). # 还不太清楚
#
# OPTIONAL flags for input data. If unspecified we will look for them in
# the langdata_dir directory. # 可选项,没有提起的话 会在langdata文件夹中查找
#    --training_text TEXTFILE   # Text to render and use for training. # 训练文本文件
#    --wordlist WORDFILE        # Word list for the language ordered by 
#                               # decreasing frequency. # 所训练语言的word列表,以使用频率的降序排列
#
# OPTIONAL flag to specify location of existing traineddata files, required
# during feature extraction. If unspecified will use TESSDATA_PREFIX defined in
# the current environment. # 当需要特征提起时,用来指定tessdata文件夹的可选项,如果没有提到的话,会选用当前环境变量里面的TESSDATA_PREFIX对应的地址
#    --tessdata_dir TESSDATADIR     # Path to tesseract/tessdata directory.
#
# NOTE:
# The font names specified in --fontlist need to be recognizable by Pango using
# fontconfig. An easy way to list the canonical names of all fonts available on
# your system is to run text2image with --list_available_fonts and the
# appropriate --fonts_dir path.

你可能感兴趣的:(tesseract)