Hybrid page layout analysis via tab-stop detection

Abstract

  A new hybrid page layout analysis algorithm is proposed, which uses bottom-up methods to form an initial data-type hypothesis and locate the tab-stops that were used when the page was formatted. The detected tab-stops, are used to deduce the column layout of the page. The column layout is then applied in a top-down manner to impose structure and reading-order on the detected regions. The complete C++ source code implementation is available as part of the Tesseract open source OCR engine at here .

Past Methods 1: Bottom-up

  • Analyze groups of pixels or connected components to classify into text/image/graphic/blank/line
  • Spread/smear/anneal groups of pixels by some neighborhood voting scheme, morphology or voronoi/graph algorithms.
  • Find connected components of labels to group pixels into typed regions.
  • Box-up regions into rectangles where possible.
  • Morphological approach is very similar.
  • Hard to include knowledge like "Columns should usually be the same size."

Past Methods 2: Top Down

  • Often starts with a (possibly pre-trained) model of layout, eg 2-column journal page
  • Attempts to cut the image into the required parts, either with recursive vertical/horizontal cuts, or finding rectangles of whitespace.
  • Methods usually fail on non-rectangular regions.
  • Methods can often only deal with pages that fit the model.

New Method: Hybrid

Hybrid page layout analysis via tab-stop detection_第1张图片
Hybrid layout

你可能感兴趣的:(Hybrid page layout analysis via tab-stop detection)