谷歌的愿景_什么是Google API愿景以及如何使用它

谷歌的愿景

介绍 (Introduction)

This post finds his root in an interesting project of knowledge extraction. The first step was to extract the text of pdf documents. The company that I work for is based on the Google platform, so naturally, I would like to use the OCR of the API Vision but, can’t find an easy way to use the API to extract text. So here this post.

这篇文章源自一个有趣的知识提取项目。 第一步是提取pdf文档的文本。 我工作的公司基于Google平台,因此自然而然地,我想使用API​​ Vision的OCR,但是找不到使用API​​提取文本的简便方法。 所以这里这篇文章。

The notebook of this post is available on GitHub

这篇文章的笔记本可在GitHub上找到

Google API愿景 (Google API Vision)

Google released the API to help people, industry, and researchers to use their functionalities.

Google发布了该API,以帮助人们,行业和研究人员使用其功能。

Google Cloud's Vision API has powerful machine learning models pre-trained through REST and RPC APIs. Tag images and quickly organize them into millions of predefined categories. You will be able to detect objects and faces, read printed or handwritten text, and integrate useful metadata into your image catalog. (source: API Vision)

Google Cloud的Vision API具有强大的机器学习模型,这些模型通过REST和RPC API进行了预训练。 标记图像并快速将它们组织成数百万个预定义类别。 您将能够检测物体和面部,阅读印刷或手写文本,并将有用的元数据集成到图像目录中。 (来源: API Vision )

The part of the API that interested us for this post is the OCR part.

令我们感兴趣的API部分是OCR部分。

光学字符识别 (OCR)

Optical Character Recognition or OCR is a technology where characters are recognized and detected inside an image. Most of the time Convolutional Neural Networks (CNN) are trained on a very large dataset of characters and numbers in different types and colors. You can imagine a small window slicing on each pixel or group of pixels to detect characters or partial characters, spaces, forms, lines etc.

光学字符识别或OCR是一种在图像内部识别和检测字符的技术。 大多数时候,卷积神经网络(CNN)都是在非常大的不同类型和颜色的字符和数字数据集上训练的。 您可以想象在每个像素或一组像素上切片一个小窗口,以检测字符或部分字符,空格,形式,线条等。

服务帐号 (Service Account)

A service account is a special type of Google account intended to represent a non-human user that needs to authenticate and be authorized to access data in Google APIs. (source: IAM google cloud)

服务帐户是一种特殊的Google帐户,旨在代表需要认证并有权访问Google API中数据的非人类用户。 (资料来源:IAM谷歌云 )

Basically you can imagine it as an RSA key (encrypted key to communicate with high security between machine via the internet) with which you can connect to Google services (API, GCS, IAM…). Its basic form is a json file.

基本上,您可以将其想象为RSA密钥(通过互联网在计算机之间以高安全性进行通信的加密密钥),您可以使用该RSA密钥连接到Google服务(API,GCS,IAM…)。 它的基本形式是一个json文件。

笔记本 (Notebook)

Here, I will show you the different functions to use the API and extract the text from the image automatically.

在这里,我将向您展示使用API​​并从图像中自动提取文本的不同功能。

Libraries needed to be installed:

需要安装库:

!pip install google-cloud
!pip install google-cloud-storage
!pip install google-cloud-pubsub
!pip install google-cloud-vision
!pip install pdf2image
!pip install google-api-python-client
!pip install google-auth

The libraries used:

使用的库:

from pdf2image import convert_from_bytes
import glob
from tqdm import tqdm
import base64
import json
import os
from io import BytesIO
import numpy as np
import io
from PIL import Image
from google.cloud import pubsub_v1
from google.cloud import visionfrom google.oauth2 import service_account
import googleapiclient.discovery# to see a progress bar
tqdm().pandas()

The OCR can take pdf, tiff and jpeg formats to be used in the API.In this post we will convert the pdf into jpeg to concatenate many pages in one picture. Two manners of using jpeg:

OCR可以采用pdf,tiff和jpeg格式在API中使用。在本文中,我们将pdf转换为jpeg,以在一张图片中连接许多页面。 jpeg的两种使用方式:

First, you could convert your pdf in jpeg files and save them into another repository:

首先,您可以将pdf转换为jpeg文件,然后将其保存到另一个存储库中:

# Name files where the pdf are and where you want to save the results
NAME_INPUT_FOLDER = "PDF FOLDER NAME"
NAME_OUTPUT_FOLDER= "RESULT TEXTS FOLDER"list_pdf = glob.glob(NAME_INPUT_FOLDER+"/*.pdf") # stock the name of the pdf files # Loop over all the files
for i in list_pdf:# convert the pdf into jpeg
pages = convert_from_path(i, 500)
for page in tqdm(enumerate(pages)):# save each page into jpeg
page[1].save(NAME_OUTPUT_FOLDER+"/"+i.split('/')[-1].split('.')[0]+'_'+str(page[0])+'.jpg', 'JPEG') # keep the name of the document and add increment

Here, you can use your jpeg document with the API. But, you can do it better without saving the jpeg file and use it in memory to call the API directly.

在这里,您可以将jpeg文档与API结合使用。 但是,您可以做得更好,而无需保存jpeg文件并在内存中使用它直接调用API。

设置凭证 (Setup Credentials)

Before going deeper we need to configure the credentials of the Vision API. You’ll see, it’s very simple:

在深入研究之前,我们需要配置Vision API的凭据。 您会看到,这非常简单:

SCOPES = ['https://www.googleapis.com/auth/cloud-vision']
SERVICE_ACCOUNT_FILE = "PUT the PATH of YOUR SERVICE ACCOUNT JSON FILE HERE"# Configure the google credentials
credentials = service_account.Credentials.from_service_account_file(
SERVICE_ACCOUNT_FILE, scopes=SCOPES)

图片处理 (Picture manipulations)

This needed more code because we also concatenate 10 pages of documents to create a “big picture” and feed it to the API. One call versus 10 is better for the price because you’ll pay each time you’ll request the API.

这需要更多代码,因为我们还连接了10页文档以创建“大图”并将其提供给API。 价格最好是一次调用而不是10次,因为每次请求API时都要付费。

Let’s go:

我们走吧:

def pil_grid(images, max_horiz=np.iinfo(int).max):
    '''
    Function to stock the image into memory
    '''
    n_images = len(images)
    n_horiz = min(n_images, max_horiz)
    h_sizes, v_sizes = [0] * n_horiz, [0] * (n_images // n_horiz)
    for i, im in enumerate(images):
        h, v = i % n_horiz, i // n_horiz
        h_sizes[h] = max(h_sizes[h], im.size[0])
        v_sizes[v] = max(v_sizes[v], im.size[1])
    h_sizes, v_sizes = np.cumsum([0] + h_sizes), np.cumsum([0] + v_sizes)
    im_grid = Image.new('RGB', (h_sizes[-1], v_sizes[-1]), color='white')
    for i, im in enumerate(images):
        im_grid.paste(im, (h_sizes[i % n_horiz], v_sizes[i // n_horiz]))
    return im_grid
  
  def concat_file_ocr(path, cred=credentials):
    '''
    Function to concat 10 pages of the document and feed them to the OCR
    @param path: (str) path of the pdf
    @param cred: google credentials (service account)  
    '''
    imgs = convert_from_bytes(open(path, 'rb').read(), fmt="jpeg")
    nb_pages = len(imgs)
    nb_remaining_pages = nb_pages
    ocr_step = 10
    current_ocr_page_nb = 0
    text = []
    while nb_remaining_pages > 0:
        if nb_remaining_pages > ocr_step:
            ocr_range = range(current_ocr_page_nb, ocr_step + current_ocr_page_nb)
            nb_remaining_pages -= ocr_step
            current_ocr_page_nb += ocr_step
        else:
            ocr_range = range(current_ocr_page_nb, current_ocr_page_nb + nb_remaining_pages)
            nb_remaining_pages = 0
        # call ocr with range
        im_grid = pil_grid(imgs[ocr_range.start:ocr_range.stop],1)
        temp = BytesIO()
        im_grid.save(temp, format='jpeg')
        text.append(detect_text_document(temp.getvalue(), cred))
    np.savetxt(NAME_OUTPUT_FOLDER+"/"+path.split('/')[-1].split('.')[0]+'.txt', text, fmt="%s")

With these two functions, you’ll be able to load a pdf file, convert it into bytes, create a “big picture” and feed it to the function detect_text_document() (details below).

使用这两个函数,您将能够加载pdf文件,将其转换为字节,创建“大图片”,并将其提供给函数detect_text_document() (以下详细信息)。

The function detect_text_document takes in input the content of the pictures and the credentials (information of your service account).

函数detect_text_document接受图片内容和凭证(您的服务帐户信息)的输入。

def detect_text_document(content, credentials):
    """
    Function to call the API vision and return the text detected inside the image
    @param content: (bytes) image in bytes 
    @param credentials: credentials of the service account to call the API 
    @return: the text detected inside the picture
    """
    
    client = vision.ImageAnnotatorClient(credentials=credentials)
    #with io.open(uri, 'rb') as image_file:
    #    content = image_file.read()
    
    # load the image in bytes 
    image = vision.types.Image(content=content)
    # call the OCR and keep text annotation
    response = client.text_detection(image=image)
    
    # The actual response for the first page of the input file.
    breaks = vision.enums.TextAnnotation.DetectedBreak.BreakType
    paragraphs = []
    lines = []
    # extract text by block of detection
    for page in response.full_text_annotation.pages:
        for block in page.blocks:
            for paragraph in block.paragraphs:
                para = ""
                line = ""
                for word in paragraph.words:
                    for symbol in word.symbols:
                        line += symbol.text
                        if symbol.property.detected_break.type == breaks.SPACE:
                            line += ' '
                        if symbol.property.detected_break.type == breaks.EOL_SURE_SPACE:
                            line += ' '
                            lines.append(line)
                            para += line
                            line = ''
                        if symbol.property.detected_break.type == breaks.LINE_BREAK:
                            lines.append(line)
                            para += line
                            line = ''
                paragraphs.append(para)


    
    return "\n".join(paragraphs)

The output is a text extracted from the images. The goal of this function is to concatenate words into paragraphs and documents.

输出是从图像中提取的文本。 此功能的目标是将单词连接到段落和文档中。

如何使用它? (How to use it?)

You can use this block of functions like this:

您可以使用以下功能块:

for doc_pdf in tqdm(list_pdf):        # call the function which convert into jpeg, stack 10 images
# and call the API, save the output into txt file

concat_file_ocr(doc_pdf)

The input is just the path obtained with the glob function. The credentials were generated in the Setup Credentials part. This loop will take each pdf of the input files, call the API with a jpeg file obtained by converting the pdf and save text files containing the detection.

输入只是使用glob函数获得的路径。 这些凭据是在“ 安装凭据”部分中生成的。 此循环将获取输入文件的每个pdf,使用通过转换pdf获得的jpeg文件调用API,并保存包含检测结果的文本文件。

结论 (Conclusion)

Here, you reach the end of the tutorial on how to use the Vision API and generate text files containing the detection automatically. You know how to configure credentials with your service account and convert a pdf into a jpeg file (one jpeg per page). Is it all? No, I have some bonuses for you (see below).

在这里,您将到达教程的结尾,了解如何使用Vision API并自动生成包含检测结果的文本文件。 您知道如何使用服务帐户配置凭据并将pdf转换为jpeg文件(每页一个jpeg)。 全部吗 不,我为您提供一些奖金(请参见下文)。

奖励1:每页使用API (Bonus 1: Use the API per page)

The previous functions allow you to use the API with the concatenation of pages. But, we can use the API per page of the pdf document. The function below will request the API for each page of the convert pdf into a jpeg format.

先前的功能允许您将API与页面串联使用。 但是,我们可以在pdf文档的每页上使用API​​。 以下功能将要求将pdf转换为jpeg格式的每一页的API。

def call_ocr_save_txt(path, cred=credentials):
    '''
    Function to feed the OCR with each page of the pdf convert into jpeg
    @param path: (str) path of the pdf
    @param cred: google credentials 
    '''
    pages = convert_from_bytes(open(path, 'rb').read(), fmt="jpeg") 
    text = []
    # run on each page of the pdf 
    for page in pages:
        # cast the jpeg into bytes 
        temp = BytesIO()
        page.save(temp, format='jpeg')


            # save the result of the OCR inside the variable text 
        text.append(detect_text_document(temp.getvalue(), cred))
        # save the result into txt file 
    np.savetxt(NAME_OUTPUT_FILE+"/"+i.split('/')[1].split('.')[0]+'.txt', text, fmt="%s")

It’s very easy to use it, just call this function with the path of the pdf folder and the credentials. Like this:

使用它非常容易,只需使用pdf文件夹的路径和凭据调用此函数。 像这样:

if per_page: # option True if you want to use per page# call the API vision per page of the pdf
for i in tqdm(list_pdf):# open the pdf and convert it into a PlImage format jpeg
call_ocr_save_txt(i, cred=credentials)

奖励2:使用多处理库 (Bonus 2: Use multiprocessing library)

Just for fun, you can use this API with multiprocessing (ok it’s not real multiprocessing in python with the Global Interpreter Lock (GIL)). But, here the code:

只是为了好玩,您可以将此API与多处理配合使用(好吧,这不是带有全局解释器锁(GIL)的python中真正的多处理)。 但是,这里的代码:

if multi_proc:
nb_threads = mp.cpu_count() # return the number of CPU
print(f"The number of available CPU is {nb_threads}")# if you want to use the API without stacking the pages
if per_page:# create threads corresponding to the number specified
pool = mp.Pool(processes=nb_threads) # map the function with part of the list for each thread
result = pool.map(call_ocr_save_txt, list_pdf)
if per_document:
pool = mp.Pool(processes=nb_threads)
result = pool.map(concat_file_ocr, list_pdf)

翻译自: https://towardsdatascience.com/what-is-google-api-vision-and-how-to-use-it-372a83e6d02c

谷歌的愿景

你可能感兴趣的:(java)