superchao1982

基于Python的离线OCR图片文字识别（五）——终极版本

至此，终于迎来了离线ocr的终极大结局，命令行后面参数既支持图像文件、图像文件夹，还支持PDF图像类型的文件，既支持通过json文件进行参数配置，又支持帮助文档，easyOCR包既支持允许字符集（也即仅支持字符集中的识别，例如在验证码识别场合），也支持排除字符集，还支持批处理尺寸大小、线程数目、分段结构保留（支持paragraph时，ocr结果就没有原来单句时的识别概率值了）等。

#!/home/super/miniconda3/bin/python
#encoding=utf-8
#author: superchao1982, [email protected]

#帮助信息
strhelp='''
img2txt is one program to get ocr texts from image or pdf files!

batchsize is the batch size, larger more faster but more merrory;
workernum is the number of threads, larger more faster;
maximgsize is the max height or width of the images to be passed to the ocr processing when extract from the pdf files;
paragraph is whether to keep the paragraph when ocring
langpath is the directory of the language data stored, '/home/langdata' for linux and 'C:\ocr\langdata' for win;
allowlist is chars allow to be recognized only, '' means allow all charactors;
removechar is char to be removed when ocr processing, for example '| _^~`&';
txtdir is the path to store the txt files, could be any legal absolute or relative path,'' means the same directory of the image files;

=== settings above can be changed in the file 'config.json' which stored in langpath ===
contents in config.json like:
{
    "batchsize": 2,
    "workernum": 4,
    "maximgsize": 1000,
    "paragraph": True
    "langpath": "/home/langdata",
    "allowlist": "",
    "removechar": " _^~`&"
    "txtdir": ""
}
------------------------------------
e.g.
./img2txt.py img1.jpg jmg2.jpg 001.pdf 002.pdf #follow by one or more image or pdf files
./img2txt.py ./pdfs home/usr/Document/imgs #follow by one or more directory contain image or pdf files
./img2txt.py --help #output the help info
./img2txt.py --config #generate the default config.json file in the langpath
------------------------------------
'''
import sys
import json
import os
import pdf2image
import numpy as np

#------------------默认参数设置----------------------
batchsize=2         # (default = 1) - Batch_size>1 will make EasyOCR faster but use more memory
workernum=4         # (default = 0) - Number thread used in of dataloader
maximgsize=1000     # (default = 1000) - Max image width & height when using pdf
paraend='\n'        # (default = '\n') - The paragraph ending char
allowlist=''        # (string) - Force EasyOCR to recognize only subset of characters
removechar='| _^~`&'#待删除无效字符
txtdir=''            #ocr识别后同名txt文件存放的位置:空表示同一目录，点表示相对目录，其他表示绝对目录
#根据系统设置默认的语言包路径
if sys.platform.lower().startswith('linux'):
    langpath='/home/langdata'
elif sys.platform.lower().startswith('win'):
    langpath='C:\ocr\langdata'
else:
    print('\tError: Unknow System!')
    sys.exit()

#根据默认参数生成配置字典
config={
    "batchsize": batchsize,
    "workernum": workernum,
    "maximgsize": maximgsize,
    "paraend": paraend,
    "allowlist": allowlist,
    "langpath": langpath,
    "removechar": removechar,
    "txtdir": txtdir
}

#------------------命令行参数处理----------------------
#首先对输入的命令行参数进行处理，在加载ocr包之前排查的好处是避免临处理时出错白白浪费时间
for i in range(1,len(sys.argv)):#获取命令行参数：argv[0]表示可执行文件本身
    if sys.argv[i] in ['-h', '--help']:
        print(strhelp)
        sys.exit()
    elif sys.argv[i] in ['-c', '--config']:
        #保存字典到文件
        try:
            with open(os.path.join(langpath,'config.json'), 'w') as jsonfile:
                json.dump(config, jsonfile, ensure_ascii=False,indent=4)
            print('Genrerating config.json success! ---> ', os.path.join(langpath,'config.json'))
        except(Exception) as e:
            print('\tSaving config file config.json Error: ', e)#输出异常错误
        sys.exit()
    else:
        #check the image file or directory is valid-提前校验，免得浪费时间加载easyocr模型
        if not os.path.exists(sys.argv[i]):
            print(sys.argv[i], ' is invalid, please input the correct file or directory path!')
            sys.exit()

#判断指定目录下是否存在配置文件config.json,存在就使用（不存在就使用上面的默认值）：
configfile=os.path.join(langpath,'config.json')
if os.path.exists(configfile):
    try:
        with open(configfile, 'r') as jsonfile:
            config=json.load(jsonfile)
        batchsize=config['batchsize']
        workernum=config['workernum']
        maximgsize=config['maximgsize']
        paraend=config['paraend']
        langpath=config['langpath']
        allowlist=config['allowlist']
        removechar=config['removechar']
        txtdir=config['txtdir']
        print('Using the config in ', configfile)
    except(Exception) as e:
        print('\tReading config file ', configfile ,' Error: ', e)#输出异常错误
        print('\tCheck the json file, or remove the config.json file to use defaulting configs!')
        sys.exit()
else:
    print('Using the default config! You can make your own config.json in ', langpath, ' by using the "--config" option')
print(config)

#------------------OCR前准备工作----------------------
#检查语言包路径是否正确，语言包是必须的
if not os.path.exists(langpath):
    print('\tError: Invalid langpath! Checking the path again!')
    sys.exit()
    
#检查txt文件保存路径，不存在就生成一个
if len(txtdir)>0 and not os.path.exists(txtdir):
    print('txtdir in config.json is not exists, generating ', txtdir)
    try:
        os.system('mkdir '+txtdir)
        print('Making directory: ',txtdir)
    except(Exception) as e:
        print('\tMaking txt directory Error: ', e)#输出异常错误
        print('\tPlease input a legal txtdir in the config.json file and try again!')
        sys.exit()

#根据段落结尾符ocr时判断是否分段落
if len(paraend)>0:
    paragraph=True
else:
    paragraph=False

#导入ocr包及语言包——之所以不在前面导入，是因为导入包花费时间较多，如果前面由于配置出错就浪费了时间
import easyocr
ocrreader=easyocr.Reader(['ch_sim', 'en'], model_storage_directory=langpath)#Linux: r'/home/langdata', Windows: r'C:\ocr\langdata'

#------------------开始OCR识别----------------------
for ind in range(1,len(sys.argv)):#依次获取命令行参数：由于argv[0]表示可执行文件本身，所以忽略该参数
    argvalue=sys.argv[ind]
    
    #如果命令行参数是文件类型，就对该文件进行处理...
    if os.path.isfile(argvalue):
        paper=''
        #获取文件后缀名
        filext=os.path.splitext(argvalue)[-1]
        if filext.upper() not in ['.JPG','.JPEG','.PNG','.BMP','.PDF']:#转换为大写后再比对
            print('\t', argvalue, ' 不是有效的文件格式(jpg/jpeg/png/bmp/pdf)!')
            continue#下一个命令行参数
        #如果是pdf文档    
        if filext.upper() in['.PDF']:
            images=pdf2image.convert_from_path(argvalue)#将pdf文档转换为图像序列
            for i in range(len(images)):#如果pdf转换后的图片尺寸过大，为了避免内存崩溃，缩小到特定尺寸
                ratio=max(images[i].width, images[i].height)/maximgsize#需要缩小的倍数
                if ratio>1:
                    images[i]=images[i].resize((round(images[i].width/ratio),round(images[i].height/ratio)))
                #至此，需要进行ocr的图片数据准备完毕！
                if len(allowlist)>0:#如果设置了识别字符集
                    result = ocrreader.readtext(np.asarray(images[i]),batch_size=batchsize,workers=workernum,detail=0,paragraph=paragraph,allowlist=allowlist)
                else:
                    result = ocrreader.readtext(np.asarray(images[i]),batch_size=batchsize,workers=workernum,detail=0,paragraph=paragraph)
                for w in result:#识别结果是一个列表，对识别结果进行拼接
                    paper = paper+w+paraend
        else:#否则，本身就是图片数据
            if len(allowlist)>0:#如果设置了识别字符集
                result = ocrreader.readtext(argvalue,batch_size=batchsize,workers=workernumt,detail=0,paragraph=paragraph,allowlist=allowlis)
            else:
                result = ocrreader.readtext(argvalue,batch_size=batchsize,workers=workernum,detail=0,paragraph=paragraph)
            for w in result:#识别结果是一个列表，对识别结果进行拼接
                paper = paper+w+paraend #如果设置了段落结尾符，在拼接时加上
        #如果设置了删除字符集
        for item in removechar:#依次删除
            paper=paper.replace(item, '')
        #print(paper)#至此，文本识别全部完成！-------------------
        #下面开始存储识别结果txt文件
        #记录当前文件的识别结果，保存为同名的txt文件
        if(len(txtdir)>0):#如果设置了txt文件目录
            txtname=os.path.basename(argvalue)+'.txt'#与原文件同名的txt文件（不含目录仅文件名）
            txtpath=os.path.join(txtdir, txtname)
        else:
            txtpath=os.path.splitext(argvalue)[0]+'.txt'#与原文件同名的txt文件（包括目录）
        print('saving file ---> ', txtpath)#保存的文件名字
        try:
            with open(txtpath, 'w') as txtfile:
                txtfile.write(paper)
        except(Exception) as e:
            print('\t', txtpath, ' Saving txt File Error: ', e)#输出异常错误
            continue
            
    #如果是文件夹...
    if os.path.isdir(argvalue):
        for root, _, filenames in os.walk(argvalue):#依次遍历文件夹，由于不关心其中的文件夹，所以将文件夹设置为隐变量
            for imgname in filenames:#遍历的每个文件（不含路径，路径在root里）
                paper=''
                filext=os.path.splitext(imgname)[-1]#得到文件后缀名
                if filext.upper() not in ['.JPG','.JPEG','.PNG','.BMP','.PDF']:
                    print('\t', imgname, '的后缀名不是有效的文件格式，跳过该文件！')
                    continue
                #与root进行拼接，得到图像文件的绝对路径（含文件名和后缀名）
                imgpath=os.path.join(root, imgname)#文件绝对路径
                #如果是pdf文档
                if filext.upper() in['.PDF']:
                    images=pdf2image.convert_from_path(imgpath)#将pdf文档转换为图像序列
                    for i in range(len(images)):#如果pdf转换后的图片尺寸过大，为了避免内存崩溃，缩小到特定尺寸
                        ratio=max(images[i].width, images[i].height)/maximgsize#需要缩小的倍数
                        if ratio>1:
                            images[i]=images[i].resize((round(images[i].width/ratio),round(images[i].height/ratio)))
                        #至此，需要进行ocr的图片数据准备完毕！
                        if len(allowlist)>0:#如果设置了识别字符集
                            result = ocrreader.readtext(np.asarray(images[i]),batch_size=batchsize,workers=workernum,detail=0,paragraph=paragraph,allowlist=allowlist)
                        else:
                            result = ocrreader.readtext(np.asarray(images[i]),batch_size=batchsize,workers=workernum,detail=0,paragraph=paragraph)
                        for w in result:#识别结果是一个列表，对识别结果进行拼接
                            paper = paper+w+paraend #如果设置了段落结尾符，在拼接时加上
                else:#否则，本身就是图片数据
                    if len(allowlist)>0:#如果设置了识别字符集
                        result = ocrreader.readtext(imgpath,batch_size=batchsize,workers=workernum,detail=0,paragraph=paragraph,allowlist=allowlist)
                    else:
                        result = ocrreader.readtext(imgpath,batch_size=batchsize,workers=workernum,detail=0,paragraph=paragraph)
                    for w in result:#识别结果是一个列表，对识别结果进行拼接
                        paper = paper+w+paraend #如果设置了段落结尾符，在拼接时加上
                #如果设置了删除字符集
                for item in removechar:#依次删除
                    paper=paper.replace(item, '')
                #print(paper)
                #至此，文本识别全部完成！--------------------
                #下面开始存储识别结果txt文件
                #记录当前文件的识别结果，保存为同名的txt文件
                txtname=os.path.splitext(imgname)[0]+'.txt'#与原文件同名的txt文件（不包括目录）
                if(len(txtdir)>0):#如果设置了非空的txt文件目录
                    #原来的方式是直接把所有的txt全部放在指定的一个文件夹中，当不同文件夹中存在同名的图像文件时，会存在txt文件覆盖的情况
                    #txtpath=os.path.join(txtdir, txtname)#拼接得到txt文件的绝对路径
                    #下面的方式是在指定的文件夹下面按照原图像文件的目录结构新建相同的文件夹结构并存放txt文件
                    relativeimgpath=imgpath[len(argvalue)+1:]#图片绝对路径左减去命令行指定的路径argpath得到图像文件的内部相对路径,+1是去除\
                    imgtxtdir=os.path.join(txtdir,relativeimgpath)#指定txt文件路径+图像内部相对路径（还带有图像文件名和后缀名）
                    txtfiledir=os.path.dirname(imgtxtdir)#去掉图像文件名和后缀名
                    if not os.path.exists(txtfiledir):#上面的新文件路径不一定存在
                        try:
                            os.system('mkdir '+txtfiledir)#新建文件夹
                            print('Making directory: ',txtfiledir)
                        except(Exception) as e:
                            print('\tMaking txt directory Error: ', e)#输出异常错误
                            print('\tTxt file will be storded in the image file directory!')
                            txtpath=os.path.join(root, txtname)#路径+txt文件名
                    txtpath=os.path.join(txtfiledir, txtname)#新路径+txt文件名
                else:#否则就是默认的空的txt文件目录，表示txt文件就存储在图像对应的文件夹里
                    txtpath=os.path.join(root, txtname)#路径+txt文件名
                print('saving file ---> ', txtpath)#保存的文件名字
                try:
                    with open(txtpath, 'w') as txtfile:
                        txtfile.write(paper)
                except(Exception) as e:
                    print('\t', txtpath, ' Saving txt File Error: ', e)#输出异常错误
                    continue

最后，由于easyOCR自身的原因，总会给出一些很烦人的关于pytorch包的warnings，为了避免这些警告信息干扰，可以按照warnings中的信息对其中的_utils.py文件进行修改，路径如下：

将其中的warnings对应的语句删除即可，删除后的_utils.py文件内容如下（请谨慎删除，或者先备份源文件后再删除）：

import functools
import inspect
import warnings
from collections import OrderedDict
from typing import Any, Dict, Optional, TypeVar, Callable, Tuple, Union

from torch import nn

from .._utils import sequence_to_str
from ._api import WeightsEnum


class IntermediateLayerGetter(nn.ModuleDict):
    """
    Module wrapper that returns intermediate layers from a model

    It has a strong assumption that the modules have been registered
    into the model in the same order as they are used.
    This means that one should **not** reuse the same nn.Module
    twice in the forward if you want this to work.

    Additionally, it is only able to query submodules that are directly
    assigned to the model. So if `model` is passed, `model.feature1` can
    be returned, but not `model.feature1.layer2`.

    Args:
        model (nn.Module): model on which we will extract the features
        return_layers (Dict[name, new_name]): a dict containing the names
            of the modules for which the activations will be returned as
            the key of the dict, and the value of the dict is the name
            of the returned activation (which the user can specify).

    Examples::

        >>> m = torchvision.models.resnet18(weights=ResNet18_Weights.DEFAULT)
        >>> # extract layer1 and layer3, giving as names `feat1` and feat2`
        >>> new_m = torchvision.models._utils.IntermediateLayerGetter(m,
        >>>     {'layer1': 'feat1', 'layer3': 'feat2'})
        >>> out = new_m(torch.rand(1, 3, 224, 224))
        >>> print([(k, v.shape) for k, v in out.items()])
        >>>     [('feat1', torch.Size([1, 64, 56, 56])),
        >>>      ('feat2', torch.Size([1, 256, 14, 14]))]
    """

    _version = 2
    __annotations__ = {
        "return_layers": Dict[str, str],
    }

    def __init__(self, model: nn.Module, return_layers: Dict[str, str]) -> None:
        if not set(return_layers).issubset([name for name, _ in model.named_children()]):
            raise ValueError("return_layers are not present in model")
        orig_return_layers = return_layers
        return_layers = {str(k): str(v) for k, v in return_layers.items()}
        layers = OrderedDict()
        for name, module in model.named_children():
            layers[name] = module
            if name in return_layers:
                del return_layers[name]
            if not return_layers:
                break

        super().__init__(layers)
        self.return_layers = orig_return_layers

    def forward(self, x):
        out = OrderedDict()
        for name, module in self.items():
            x = module(x)
            if name in self.return_layers:
                out_name = self.return_layers[name]
                out[out_name] = x
        return out


def _make_divisible(v: float, divisor: int, min_value: Optional[int] = None) -> int:
    """
    This function is taken from the original tf repo.
    It ensures that all layers have a channel number that is divisible by 8
    It can be seen here:
    https://github.com/tensorflow/models/blob/master/research/slim/nets/mobilenet/mobilenet.py
    """
    if min_value is None:
        min_value = divisor
    new_v = max(min_value, int(v + divisor / 2) // divisor * divisor)
    # Make sure that round down does not go down by more than 10%.
    if new_v < 0.9 * v:
        new_v += divisor
    return new_v


D = TypeVar("D")


def kwonly_to_pos_or_kw(fn: Callable[..., D]) -> Callable[..., D]:
    """Decorates a function that uses keyword only parameters to also allow them being passed as positionals.

    For example, consider the use case of changing the signature of ``old_fn`` into the one from ``new_fn``:

    .. code::

        def old_fn(foo, bar, baz=None):
            ...

        def new_fn(foo, *, bar, baz=None):
            ...

    Calling ``old_fn("foo", "bar, "baz")`` was valid, but the same call is no longer valid with ``new_fn``. To keep BC
    and at the same time warn the user of the deprecation, this decorator can be used:

    .. code::

        @kwonly_to_pos_or_kw
        def new_fn(foo, *, bar, baz=None):
            ...

        new_fn("foo", "bar, "baz")
    """
    params = inspect.signature(fn).parameters

    try:
        keyword_only_start_idx = next(
            idx for idx, param in enumerate(params.values()) if param.kind == param.KEYWORD_ONLY
        )
    except StopIteration:
        raise TypeError(f"Found no keyword-only parameter on function '{fn.__name__}'") from None

    keyword_only_params = tuple(inspect.signature(fn).parameters)[keyword_only_start_idx:]

    @functools.wraps(fn)
    def wrapper(*args: Any, **kwargs: Any) -> D:
        args, keyword_only_args = args[:keyword_only_start_idx], args[keyword_only_start_idx:]
        if keyword_only_args:
            keyword_only_kwargs = dict(zip(keyword_only_params, keyword_only_args))
            warnings.warn(
                f"Using {sequence_to_str(tuple(keyword_only_kwargs.keys()), separate_last='and ')} as positional "
                f"parameter(s) is deprecated since 0.13 and will be removed in 0.15. Please use keyword parameter(s) "
                f"instead."
            )
            kwargs.update(keyword_only_kwargs)

        return fn(*args, **kwargs)

    return wrapper


W = TypeVar("W", bound=WeightsEnum)
M = TypeVar("M", bound=nn.Module)
V = TypeVar("V")


def handle_legacy_interface(**weights: Tuple[str, Union[Optional[W], Callable[[Dict[str, Any]], Optional[W]]]]):
    """Decorates a model builder with the new interface to make it compatible with the old.

    In particular this handles two things:

    1. Allows positional parameters again, but emits a deprecation warning in case they are used. See
        :func:`torchvision.prototype.utils._internal.kwonly_to_pos_or_kw` for details.
    2. Handles the default value change from ``pretrained=False`` to ``weights=None`` and ``pretrained=True`` to
        ``weights=Weights`` and emits a deprecation warning with instructions for the new interface.

    Args:
        **weights (Tuple[str, Union[Optional[W], Callable[[Dict[str, Any]], Optional[W]]]]): Deprecated parameter
            name and default value for the legacy ``pretrained=True``. The default value can be a callable in which
            case it will be called with a dictionary of the keyword arguments. The only key that is guaranteed to be in
            the dictionary is the deprecated parameter name passed as first element in the tuple. All other parameters
            should be accessed with :meth:`~dict.get`.
    """

    def outer_wrapper(builder: Callable[..., M]) -> Callable[..., M]:
        @kwonly_to_pos_or_kw
        @functools.wraps(builder)
        def inner_wrapper(*args: Any, **kwargs: Any) -> M:
            for weights_param, (pretrained_param, default) in weights.items():  # type: ignore[union-attr]
                # If neither the weights nor the pretrained parameter as passed, or the weights argument already use
                # the new style arguments, there is nothing to do. Note that we cannot use `None` as sentinel for the
                # weight argument, since it is a valid value.
                sentinel = object()
                weights_arg = kwargs.get(weights_param, sentinel)
                if (
                    (weights_param not in kwargs and pretrained_param not in kwargs)
                    or isinstance(weights_arg, WeightsEnum)
                    or (isinstance(weights_arg, str) and weights_arg != "legacy")
                    or weights_arg is None
                ):
                    continue

                # If the pretrained parameter was passed as positional argument, it is now mapped to
                # `kwargs[weights_param]`. This happens because the @kwonly_to_pos_or_kw decorator uses the current
                # signature to infer the names of positionally passed arguments and thus has no knowledge that there
                # used to be a pretrained parameter.
                pretrained_positional = weights_arg is not sentinel
                if pretrained_positional:
                    # We put the pretrained argument under its legacy name in the keyword argument dictionary to have a
                    # unified access to the value if the default value is a callable.
                    kwargs[pretrained_param] = pretrained_arg = kwargs.pop(weights_param)
                else:
                    pretrained_arg = kwargs[pretrained_param]

                if pretrained_arg:
                    default_weights_arg = default(kwargs) if callable(default) else default
                    if not isinstance(default_weights_arg, WeightsEnum):
                        raise ValueError(f"No weights available for model {builder.__name__}")
                else:
                    default_weights_arg = None
                del kwargs[pretrained_param]
                kwargs[weights_param] = default_weights_arg

            return builder(*args, **kwargs)

        return inner_wrapper

    return outer_wrapper


def _ovewrite_named_param(kwargs: Dict[str, Any], param: str, new_value: V) -> None:
    if param in kwargs:
        if kwargs[param] != new_value:
            raise ValueError(f"The parameter '{param}' expected value {new_value} but got {kwargs[param]} instead.")
    else:
        kwargs[param] = new_value


def _ovewrite_value_param(param: Optional[V], new_value: V) -> V:
    if param is not None:
        if param != new_value:
            raise ValueError(f"The parameter '{param}' expected value {new_value} but got {param} instead.")
    return new_value


class _ModelURLs(dict):
    def __getitem__(self, item):

        return super().__getitem__(item)

系统学习Python——并发模型和异步编程：进程、线程和GIL
分类目录：《系统学习Python》总目录在文章《并发模型和异步编程：基础知识》我们简单介绍了Python中的进程、线程和协程。本文就着重介绍Python中的进程、线程和GIL的关系。Python解释器的每个实例都是一个进程。使用multiprocessing或concurrent.futures库可以启动额外的Python进程。Python的subprocess库用于启动运行外部程序（不管使用何种
Flask框架入门：快速搭建轻量级Python网页应用「已注销」 python-AI python基础网站网络 python flask 后端
转载：Flask框架入门：快速搭建轻量级Python网页应用1.Flask基础Flask是一个使用Python编写的轻量级Web应用框架。它的设计目标是让Web开发变得快速简单，同时保持应用的灵活性。Flask依赖于两个外部库：Werkzeug和Jinja2，Werkzeug作为WSGI工具包处理Web服务的底层细节，Jinja2作为模板引擎渲染模板。安装Flask非常简单，可以使用pip安装命令
Python Flask 框架入门：快速搭建 Web 应用的秘诀 Python编程之道 Python人工智能与大数据 Python编程之道 python flask 前端 ai
PythonFlask框架入门：快速搭建Web应用的秘诀关键词Flask、微框架、路由系统、Jinja2模板、请求处理、WSGI、Web开发摘要想快速用Python搭建一个灵活的Web应用？Flask作为“微框架”代表，凭借轻量、可扩展的特性，成为初学者和小型项目的首选。本文将从Flask的核心概念出发，结合生活化比喻、代码示例和实战案例，带你一步步掌握：如何用Flask搭建第一个Web应用？路由
python_虚拟环境阿_焦 python
第一、配置虚拟环境：virtualenv（1）pipvirtualenv>安装虚拟环境包（2）pipinstallvirtualenvwrapper-win>安装虚拟环境依赖包（3）c盘创建虚拟目录>C:\virtualenv>配置环境变量【了解一下】：（1）如何使用virtualenv创建虚拟环境a、cd到C:\virtualenv目录下：b、mkvirtualenvname>创建虚拟环境nam
高效批量单词翻译工具的设计与应用
本文还有配套的精品资源，点击获取简介：在信息技术飞速发展的今天，批量单词翻译工具通过计算机的数据处理能力，大大提高了语言学习和文字处理的效率。用户通过简单输入单词列表到一个文本文件，并运行翻译程序，即可获得翻译结果并保存至指定文件。该工具集成了内置或外部翻译引擎，利用自然语言处理技术实现快速准确的翻译，并可能提供词性识别等附加功能。尽管机器翻译无法完全取代人工校对，但它为用户提供了一种高效的翻译解
Python爱心光波
系列文章序号直达链接Tkinter1Python李峋同款可写字版跳动的爱心2Python跳动的双爱心3Python蓝色跳动的爱心4Python动漫烟花5Python粒子烟花Turtle1Python满屏飘字2Python蓝色流星雨3Python金色流星雨4Python漂浮爱心5Python爱心光波①6Python爱心光波②7Python满天繁星8Python五彩气球9Python白色飘雪10Pyt
Python流星雨 Want595 python 开发语言
文章目录系列文章写在前面技术需求完整代码代码分析1.模块导入2.画布设置3.画笔设置4.颜色列表5.流星类(Star)6.流星对象创建7.主循环8.流星运动逻辑9.视觉效果10.总结写在后面系列文章序号直达链接表白系列1Python制作一个无法拒绝的表白界面2Python满屏飘字表白代码3Python无限弹窗满屏表白代码4Python李峋同款可写字版跳动的爱心5Python流星雨代码6Python
Python之七彩花朵代码实现 PlutoZuo Python python 开发语言
Python之七彩花朵代码实现文章目录Python之七彩花朵代码实现下面是一个简单的使用Python的七彩花朵。这个示例只是一个简单的版本，没有很多高级功能，但它可以作为一个起点，你可以在此基础上添加更多功能。importturtleastuimportrandomasraimportmathtu.setup(1.0,1.0)t=tu.Pen()t.ht()colors=['red','skybl
Python 脚本最佳实践2025版
前文可以直接把这篇文章喂给AI,可以放到AI角色设定里,也可以直接作为提示词.这样,你只管提需求,写脚本就让AI来.概述追求简洁和清晰：脚本应简单明了。使用函数(functions)、常量(constants)和适当的导入(import)实践来有逻辑地组织你的Python脚本。使用枚举(enumerations)和数据类(dataclasses)等数据结构高效管理脚本状态。通过命令行参数增强交互性
（Python基础篇）了解和使用分支结构 EternityArt 基础篇 python
目录一、引言二、Python分支结构的类型与语法（一）if语句（单分支）（二）if-else语句（双分支）（三）if-elif-else语句（多分支）三、分支结构的应用场景（一）提示用户输入用户名，然后再提示输入密码，如果用户名是“admin”并且密码是“88888”则提示正确，否则，如果用户名不是admin还提示用户用户名不存在,（二）提示用户输入用户名，然后再提示输入密码，如果用户名是“adm
（Python基础篇）循环结构 EternityArt 基础篇 python
一、什么是Python循环结构？循环结构是编程中重复执行代码块的机制。在Python中，循环允许你：1.迭代处理数据：遍历列表、字典、文件内容等。2.自动化重复任务：如批量处理数据、生成序列等。3.控制执行流程：根据条件决定是否继续或终止循环。二、为什么需要循环结构？假设你需要打印1到100的所有偶数：没有循环：需手动编写100行print()语句。print(0)print(2)print(4)
（Python基础篇）字典的操作 EternityArt 基础篇 python 开发语言
一、引言在Python编程中，字典（Dictionary）是一种极具灵活性的数据结构，它通过“键-值对”（key-valuepair）的形式存储数据，如同现实生活中的字典——通过“词语（键）”快速查找“释义（值）”。相较于列表和元组的有序索引访问，字典的优势在于基于键的快速查找，这使得它在处理需要频繁通过唯一标识获取数据的场景中极为高效。掌握字典的操作，能让我们更高效地组织和管理复杂数据，是Pyt
Python七彩花朵 Want595 python 开发语言
系列文章序号直达链接Tkinter1Python李峋同款可写字版跳动的爱心2Python跳动的双爱心3Python蓝色跳动的爱心4Python动漫烟花5Python粒子烟花Turtle1Python满屏飘字2Python蓝色流星雨3Python金色流星雨4Python漂浮爱心5Python爱心光波①6Python爱心光波②7Python满天繁星8Python五彩气球9Python白色飘雪10Pyt
用OpenCV标定相机内参应用示例（C++和Python）
下面是一个完整的使用OpenCV进行相机内参标定（CameraCalibration）的示例，包括C++和Python两个版本，基于棋盘格图案标定。一、目标：相机标定通过拍摄多张带有棋盘格图案的图像，估计相机的内参：相机矩阵（内参）K畸变系数distCoeffs可选外参（R,T）标定精度指标（如重投影误差）二、棋盘格参数设置（根据自己的棋盘格设置）：棋盘格角点数：9x6（内角点，9列×6行）；每个
Anaconda 详细下载与安装教程
Anaconda详细下载与安装教程1.简介Anaconda是一个用于科学计算的开源发行版，包含了Python和R的众多常用库。它还包括了conda包管理器，可以方便地安装、更新和管理各种软件包。2.下载Anaconda2.1访问官方网站首先，打开浏览器，访问Anaconda官方网站。2.2选择适合的版本在页面中，你会看到两个主要的下载选项：AnacondaIndividualEdition：适用于
python中 @注解及内置注解的使用方法总结以及完整示例慧一居士 Python python
在Python中，装饰器（Decorator）使用@符号实现，是一种修改函数/类行为的语法糖。它本质上是一个高阶函数，接受目标函数作为参数并返回包装后的函数。Python也提供了多个内置装饰器，如@property、@staticmethod、@classmethod等。一、核心概念装饰器本质：@decorator等价于func=decorator(func)执行时机：在函数/类定义时立即执行装饰
Python中的静态方法和类方法详解
在Python中，`@staticmethod`和`@classmethod`是两种装饰器，它们用于定义类中的方法，但是它们的行为和用途有所不同。###@staticmethod`@staticmethod`装饰器用于定义一个静态方法。静态方法不接收类或实例的引用作为第一个参数，因此它不能访问类的状态或实例的状态。静态方法可以看作是与类关联的普通函数，但它们可以通过类名直接调用。classMath
Python中类静态方法：@classmethod/@staticmethod详解和实战示例
在Python中，类方法(@classmethod)和静态方法(@staticmethod)是类作用域下的两种特殊方法。它们使用装饰器定义，并且与实例方法(deffunc(self))的行为有所不同。1.三种方法的对比概览方法类型是否访问实例(self)是否访问类(cls)典型用途实例方法✅是❌否访问对象属性类方法@classmethod❌否✅是创建类的替代构造器，访问类变量等静态方法@stati
Python多版本管理与pip升级全攻略：解决冲突与高效实践码界奇点 Python python pip 开发语言 python3.11 源代码管理虚拟现实依赖倒置原则
引言Python作为最流行的编程语言之一，其版本迭代速度与生态碎片化给开发者带来了巨大挑战。据统计，超过60%的Python开发者需要同时维护基于Python3.6+和Python2.7的项目。本文将系统解决以下核心痛点：如何安全地在同一台机器上管理多个Python版本pip依赖冲突的根治方案符合PEP标准的生产环境最佳实践第一部分：Python多版本管理核心方案1.1系统级多版本共存方案Wind
基于Python的健身数据分析工具的搭建流程day1 weixin_45677320 python 开发语言数据挖掘爬虫
基于Python的健身数据分析工具的搭建流程分数据挖掘、数据存储和数据分析三个步骤。本文主要介绍利用Python实现健身数据分析工具的数据挖掘部分。第一步：加载库加载本文需要的库，如下代码所示。若库未安装，请按照python如何安装各种库（保姆级教程）_python安装库-CSDN博客https://blog.csdn.net/aobulaien001/article/details/133298
数字孪生技术为UI前端注入新活力：实现产品设计的沉浸式体验 ui设计前端开发老司机 ui
hello宝子们...我们是艾斯视觉擅长ui设计、前端开发、数字孪生、大数据、三维建模、三维动画10年+经验!希望我的分享能帮助到您!如需帮助可以评论关注私信我们一起探讨!致敬感谢感恩!一、引言：从“平面交互”到“沉浸体验”的UI革命当用户在电商APP中翻看3D家具模型却无法感知其与自家客厅的匹配度，当设计师在2D屏幕上绘制汽车内饰却难以预判实际乘坐体验——传统UI设计的“平面化、静态化、割裂感”
AIGC工具与软件开发流程的深度集成方案 Irene-HQ 软件开发测试 AIGC 测试工具 github AIGC 程序人生面试
一、代码开发环节集成路径‌环境配置标准化‌安装AIGC工具包并配置环境变量（如设置AIGC_TOOL_PATH），确保团队开发环境一致‌。在IDE插件市场安装Copilot等工具，实现编码时实时建议调用‌。‌人机协作新模式‌‌需求解析‌：上传PRD文档，AI自动提取业务规则生成类结构（如支付模块的PaymentService雏形）‌。‌代码补全‌：输入注释//JWT验证中间件，生成OAuth2.0
seaborn又一个扩展heatmapz qq_21478261 #Python可视化 matplotlib
推荐阅读：Pythonmatplotlib保姆级教程嫌Matplotlib繁琐？试试Seaborn！
NGS测序基础梳理01-文库构建（Library Preparation） qq_21478261 #生物信息生物学
本文介绍Illumina测序平台文库构建（LibraryPreparation）步骤，文库结构。写作时间：2020.05。推荐阅读：10W字《Python可视化教程1.0》来了！一份由公众号「pythonic生物人」精心制作的PythonMatplotlib可视化系统教程，105页PDFhttps://mp.weixin.qq.com/s/QaSmucuVsS_DR-klfpE3-Q10W字《Rg
Python 常用内置函数详解（七）：dir()函数——获取当前本地作用域中的名称列表或对象的有效属性列表
目录一、功能二、语法和示例一、功能dir()函数获取当前本地作用域中的名称列表或对象的有效属性列表。二、语法和示例dir()函数有两种形式，如果没有实参，则返回当前本地作用域中的名称列表。如果有实参，它会尝试返回该对象的有效属性列表。如果对象有一个名为__dir__()的方法，那么该方法将被调用，并且必须返回一个属性列表。dir()函数的语法格式如下：C:\Users\amoxiang>ipyth
pythonjson中list操作_Python json.dumps 特殊数据类型的自定义序列化操作
场景描述：Python标准库中的json模块，集成了将数据序列化处理的功能；在使用json.dumps()方法序列化数据时候，如果目标数据中存在datetime数据类型，执行操作时，会抛出异常：TypeError:datetime.datetime(2016,12,10,11,04,21)isnotJSONserializable那么遇到json.dumps序列化不支持的数据类型，该怎么办！首先，
Python 日期格式转json.dumps的解决方法 douyaoxin python json 开发语言
classDateEncoder(json.JSONEncoder):defdefault(self,obj):ifisinstance(obj,datetime.datetime):returnobj.strftime('%Y-%m-%d%H:%M:%S')elifisinstance(obj,datetime.date):returnobj.strftime("%Y-%m-%d")json.d
Python 爬虫实战：视频平台播放量实时监控（含反爬对抗与数据趋势预测）西攻城狮北 python 爬虫音视频
一、引言在数字内容蓬勃发展的当下，视频平台的播放量数据已成为内容创作者、营销人员以及行业分析师手中极为关键的情报资源。它不仅能够实时反映内容的受欢迎程度，更能在竞争分析、营销策略制定以及内容优化等方面发挥不可估量的作用。然而，视频平台为了保护自身数据和用户隐私，往往会设置一系列反爬虫机制，对数据爬取行为进行限制。这就向我们发起了挑战：如何巧妙地突破这些限制，同时精准地捕捉并预测播放量的动态变化趋势
Python技能手册 - 模块module 金色牛神 Python python windows 开发语言
系列Python常用技能手册-基础语法Python常用技能手册-模块modulePython常用技能手册-包package目录module模块指什么typing数据类型int整数float浮点数str字符串bool布尔值TypeVar类型变量functools高阶函数工具functools.partial()函数偏置functools.lru_cache()函数缓存sorted排序列表排序元组排序
深度学习模型表征提取全解析 ZhangJiQun&MXP 教学 2024大模型以及算力 2021 AI python 深度学习人工智能 python embedding 语言模型
模型内部进行表征提取的方法在自然语言处理（NLP）中，“表征（Representation）”指将文本（词、短语、句子、文档等）转化为计算机可理解的数值形式（如向量、矩阵），核心目标是捕捉语言的语义、语法、上下文依赖等信息。自然语言表征技术可按“静态/动态”“有无上下文”“是否融入知识”等维度划分一、传统静态表征（无上下文，词级为主）这类方法为每个词分配固定向量，不考虑其在具体语境中的含义（无法解
apache ftpserver-CentOS config gengzg apache
<server xmlns="http://mina.apache.org/ftpserver/spring/v1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation=" http://mina.apache.o
优化MySQL数据库性能的八种方法 AILIKES sql mysql
1、选取最适用的字段属性　　MySQL可以很好的支持大数据量的存取，但是一般说来，数据库中的表越小，在它上面执行的查询也就会越快。因此，在创建表的时候，为了获得更好的性能，我们可以将表中字段的宽度设得尽可能小。例如，在定义邮政编码这个字段时，如果将其设置为CHAR(255),显然给数据库增加了不必要的空间，甚至使用VARCHAR这种类型也是多余的，因为CHAR(6)就可以很
JeeSite 企业信息化快速开发平台 Kai_Ge JeeSite
JeeSite 企业信息化快速开发平台平台简介 JeeSite是基于多个优秀的开源项目，高度整合封装而成的高效，高性能，强安全性的开源Java EE快速开发平台。 JeeSite本身是以Spring Framework为核心容器，Spring MVC为模型视图控制器，MyBatis为数据访问层， Apache Shiro为权限授权层，Ehcahe对常用数据进行缓存，Activit为工作流
通过Spring Mail Api发送邮件 120153216 邮件 main
原文地址：http://www.open-open.com/lib/view/open1346857871615.html 使用Java Mail API来发送邮件也很容易实现，但是最近公司一个同事封装的邮件API实在让我无法接受，于是便打算改用Spring Mail API来发送邮件，顺便记录下这篇文章。【Spring Mail API】 Spring Mail API都在org.spri
Pysvn 程序员使用指南 2002wmj SVN
源文件:http://ju.outofmemory.cn/entry/35762 这是一篇关于pysvn模块的指南. 完整和详细的API请参考 http://pysvn.tigris.org/docs/pysvn_prog_ref.html. pysvn是操作Subversion版本控制的Python接口模块. 这个API接口可以管理一个工作副本, 查询档案库, 和同步两个. 该
在SQLSERVER中查找被阻塞和正在被阻塞的SQL 357029540 SQL Server
SELECT R.session_id AS BlockedSessionID , S.session_id AS BlockingSessionID , Q1.text AS Block
Intent 常用的用法备忘 7454103 .net android Google Blog F#
Intent 应该算是Android中特有的东西。你可以在Intent中指定程序要执行的动作（比如：view,edit,dial），以及程序执行到该动作时所需要的资料。都指定好后，只要调用startActivity()，Android系统会自动寻找最符合你指定要求的应用程序，并执行该程序。下面列出几种Intent 的用法显示网页:
Spring定时器时间配置 adminjun spring 时间配置定时器
红圈中的值由6个数字组成，中间用空格分隔。第一个数字表示定时任务执行时间的秒，第二个数字表示分钟，第三个数字表示小时，后面三个数字表示日，月，年，< xmlnamespace prefix ="o" ns ="urn:schemas-microsoft-com:office:office" /> 测试的时候，由于是每天定时执行，所以后面三个数
POJ 2421 Constructing Roads 最小生成树 aijuans 最小生成树
来源：http://poj.org/problem?id=2421 题意：还是给你n个点，然后求最小生成树。特殊之处在于有一些点之间已经连上了边。思路：对于已经有边的点，特殊标记一下，加边的时候把这些边的权值赋值为0即可。这样就可以既保证这些边一定存在，又保证了所求的结果正确。代码： #include <iostream> #include <cstdio>
重构笔记——提取方法（Extract Method） ayaoxinchao java 重构提炼函数局部变量提取方法
提取方法（Extract Method）是最常用的重构手法之一。当看到一个方法过长或者方法很难让人理解其意图的时候，这时候就可以用提取方法这种重构手法。下面是我学习这个重构手法的笔记：提取方法看起来好像仅仅是将被提取方法中的一段代码，放到目标方法中。其实，当方法足够复杂的时候，提取方法也会变得复杂。当然，如果提取方法这种重构手法无法进行时，就可能需要选择其他
为UILabel添加点击事件 bewithme UILabel
默认情况下UILabel是不支持点击事件的，网上查了查居然没有一个是完整的答案，现在我提供一个完整的代码。 UILabel *l = [[UILabel alloc] initWithFrame:CGRectMake(60, 0, listV.frame.size.width - 60, listV.frame.size.height)]
NoSQL数据库之Redis数据库管理(PHP-REDIS实例) bijian1013 redis 数据库 NoSQL
一.redis.php <?php //实例化 $redis = new Redis(); //连接服务器 $redis->connect("localhost"); //授权 $redis->auth("lamplijie"); //相关操
SecureCRT使用备注 bingyingao secureCRT 每页行数
SecureCRT日志和卷屏行数设置一、使用securecrt时，设置自动日志记录功能。 1、在C:\Program Files\SecureCRT\下新建一个文件夹(也就是你的CRT可执行文件的路径），命名为Logs； 2、点击Options -> Global Options -> Default Session -> Edite Default Sett
【Scala九】Scala核心三：泛型 bit1129 scala
泛型类 package spark.examples.scala.generics class GenericClass[K, V](val k: K, val v: V) { def print() { println(k + "," + v) } } object GenericClass { def main(args: Arr
素数与音乐 bookjovi 素数数学 haskell
由于一直在看haskell，不可避免的接触到了很多数学知识，其中数论最多，如素数，斐波那契数列等，很多在学生时代无法理解的数学现在似乎也能领悟到那么一点。闲暇之余，从图书馆找了<<The music of primes>>和<<世界数学通史>>读了几遍。其中素数的音乐这本书与软件界熟知的&l
Java-Collections Framework学习与总结-IdentityHashMap BrokenDreams Collections
这篇总结一下java.util.IdentityHashMap。从类名上可以猜到，这个类本质应该还是一个散列表，只是前面有Identity修饰，是一种特殊的HashMap。简单的说，IdentityHashMap和HashM
读《研磨设计模式》-代码笔记-享元模式-Flyweight bylijinnan java 设计模式
声明：本文只为方便我个人查阅和理解，详细的分析以及源代码请移步原作者的博客http://chjavach.iteye.com/ import java.util.ArrayList; import java.util.Collection; import java.util.HashMap; import java.util.List; import java
PS人像润饰&调色教程集锦 cherishLC PS
1、仿制图章沿轮廓润饰——柔化图像，凸显轮廓 http://www.howzhi.com/course/retouching/ 新建一个透明图层，使用仿制图章不断Alt+鼠标左键选点，设置透明度为21%，大小为修饰区域的1/3左右（比如胳膊宽度的1/3），再沿纹理方向（比如胳膊方向）进行修饰。所有修饰完成后，对该润饰图层添加噪声，噪声大小应该和
更新多个字段的UPDATE语句 crabdave update
更新多个字段的UPDATE语句 update tableA a set (a.v1, a.v2, a.v3, a.v4) = --使用括号确定更新的字段范围
hive实例讲解实现in和not in子句 daizj hive not in in
本文转自：http://www.cnblogs.com/ggjucheng/archive/2013/01/03/2842855.html 当前hive不支持 in或not in 中包含查询子句的语法，所以只能通过left join实现。假设有一个登陆表login(当天登陆记录,只有一个uid),和一个用户注册表regusers(当天注册用户，字段只有一个uid)，这两个表都包含
一道24点的10+种非人类解法（2,3,10,10） dsjt 算法
这是人类算24点的方法？！！！事件缘由：今天晚上突然看到一条24点状态，当时惊为天人，这NM叫人啊？以下是那条状态朱明西 : 24点，算2 3 10 10，我LX炮狗等面对四张牌痛不欲生，结果跑跑同学扫了一眼说，算出来了，2的10次方减10的3次方。。我草这是人类的算24点啊。。然后么。。。我就在深夜很得瑟的问室友求室友算刚出完题，文哥的暴走之旅开始了 5秒后
关于YII的菜单插件 CMenu和面包末breadcrumbs路径管理插件的一些使用问题 dcj3sjt126com yii framework
在使用 YIi的路径管理工具时，发现了一个问题。 <?php
对象与关系之间的矛盾：“阻抗失配”效应[转] come_for_dream 对象
概述 “阻抗失配”这一词组通常用来描述面向对象应用向传统的关系数据库（RDBMS）存放数据时所遇到的数据表述不一致问题。C++程序员已经被这个问题困扰了好多年，而现在的Java程序员和其它面向对象开发人员也对这个问题深感头痛。 “阻抗失配”产生的原因是因为对象模型与关系模型之间缺乏固有的亲合力。“阻抗失配”所带来的问题包括：类的层次关系必须绑定为关系模式（将对象
学习编程那点事 gcq511120594 编程互联网
一年前的夏天，我还在纠结要不要改行，要不要去学php？能学到真本事吗？改行能成功吗？太多的问题，我终于不顾一切，下定决心，辞去了工作，来到传说中的帝都。老师给的乘车方式还算有效，很顺利的就到了学校，赶巧了，正好学校搬到了新校区。先安顿了下来，过了个轻松的周末，第一次到帝都，逛逛吧！接下来的周一，是我噩梦的开始，学习内容对我这个零基础的人来说，除了勉强完成老师布置的作业外，我已经没有时间和精力去
Reverse Linked List II hcx2013 list
Reverse a linked list from position m to n. Do it in-place and in one-pass. For example:Given 1->2->3->4->5->NULL, m = 2 and n = 4, return
Spring4.1新特性——页面自动化测试框架Spring MVC Test HtmlUnit简介 jinnianshilongnian spring 4.1
目录 Spring4.1新特性——综述 Spring4.1新特性——Spring核心部分及其他 Spring4.1新特性——Spring缓存框架增强 Spring4.1新特性——异步调用和事件机制的异常处理 Spring4.1新特性——数据库集成测试脚本初始化 Spring4.1新特性——Spring MVC增强 Spring4.1新特性——页面自动化测试框架Spring MVC T
Hadoop集群工具distcp liyonghui160com
1. 环境描述两个集群：rock 和 stone rock无kerberos权限认证，stone有要求认证。 1. 从rock复制到stone，采用hdfs Hadoop distcp -i hdfs://rock-nn:8020/user/cxz/input hdfs://stone-nn:8020/user/cxz/运行在rock端，即源端问题：报版本
一个备份MySQL数据库的简单Shell脚本 pda158 mysql 脚本
　　主脚本（用于备份mysql数据库）：　　该Shell脚本可以自动备份数据库。只要复制粘贴本脚本到文本编辑器中，输入数据库用户名、密码以及数据库名即可。我备份数据库使用的是mysqlump 命令。后面会对每行脚本命令进行说明。　　 1. 分别建立目录“backup”和“oldbackup” 　　#mkdir /backup 　　#mkdir /oldbackup 　
300个涵盖IT各方面的免费资源（中）——设计与编码篇 shoothao IT资源图标库图片库色彩板字体
A. 免费的设计资源 Freebbble:来自于Dribbble的免费的高质量作品。 Dribbble:Dribbble上“免费”的搜索结果——这是巨大的宝藏。 Graphic Burger:每个像素点都做得很细的绝佳的设计资源。 Pixel Buddha:免费和优质资源的专业社区。 Premium Pixels:为那些有创意的人提供免费的素材。
thrift总结 - 跨语言服务开发 uule thrift
官网官网JAVA例子 thrift入门介绍 IBM-Apache Thrift - 可伸缩的跨语言服务开发框架 Thrift入门及Java实例演示 thrift的使用介绍 RPC POM： <dependency> <groupId>org.apache.thrift</groupId>

基于Python的离线OCR图片文字识别（五）——终极版本

你可能感兴趣的:(Python数据处理,数据清洗,python环境配置,python,大数据,自然语言处理)