pytesseract psm 选项参数

最近写*车之家的爬虫,遇到动态,扭曲的自定义字符,以前直接比对不变的字符部分已经不行了,想了半天,对字符的操作不是很了解,所以就想到用orc来直接识别好了

遇到问题,使用pytesseract进行操作的时候,添加了中文的语言的选项,但是不添加psm参数时,识别不出来。经过一番查找 找到

应该加上--psm 8 ,将整个图像当初一个汉字来操作

 

 

 
  1. Page segmentation modes:

  2. 0 Orientation and script detection (OSD) only.

  3. 1 Automatic page segmentation with OSD.

  4. 2 Automatic page segmentation, but no OSD, or OCR.

  5. 3 Fully automatic page segmentation, but no OSD. (Default)

  6. 4 Assume a single column of text of variable sizes.

  7. 5 Assume a single uniform block of vertically aligned text.

  8. 6 Assume a single uniform block of text.

  9. 7 Treat the image as a single text line.

  10. 8 Treat the image as a single word.

  11. 9 Treat the image as a single word in a circle.

  12. 10 Treat the image as a single character.

  13. 11 Sparse text. Find as much text as possible in no particular order.

  14. 12 Sparse text with OSD.

  15. 13 Raw line. Treat the image as a single text line,

  16. bypassing hacks that are Tesseract-specific.

Here is a sample usage of image_to_string with multiple parameters.

 
  1. target = pytesseract.image_to_string(image, lang='eng', boxes=False, \

  2. config='--psm 10 --oem 3 -c tessedit_char_whitelist=0123456789')

 

 

你可能感兴趣的:(爬虫,Python,tesseract)