运用Python去除“使用了LaTeX的Beamer类创建的保留动画效果的PDF演示文稿”的重复页面

运用Python去除“使用了LaTeX的Beamer类创建的保留动画效果的PDF演示文稿”的重复页面_第1张图片

这个pdf实际上只有64页,但是加了动画导出的pdf有225面,为了方便翻阅,我们只需要保留每次最丰富(完整)的那一面ppt即可,使用如下代码,获得该提取页面的页码:

# -*- coding: utf-8 -*-
# @File: getRealPageNumbers.py
# @Author: 和谐号
# @Software: PyCharm
# @CreationTime: 2023-09-18 14:22
# @OverviewDescription:

# pip install PyPDF2==3.0.0

from PyPDF2 import PdfReader
import re

# 打开 PDF 文档
pdf_path = r"C:\Users\和谐号\Desktop\ch1集合与点集.pdf"

# 创建一个 PyPDF2 的 PDF 阅读器对象
pdf_reader = PdfReader(pdf_path)

# 获取总页数
total_pages = len(pdf_reader.pages)
real_page_number_list = []
# real_total_page_number = int(re.findall(r"\d* / (\d*)", pdf_reader.pages[0].extract_text())[0])
# print(real_total_page_number)

# 提取页码内容
for page_number in range(total_pages):
    page = pdf_reader.pages[page_number]
    page_text = page.extract_text()
    # if page_number == 0:
    #     print(page_text)

    # 查找包含页码的文本
    if page_text:
        m = re.findall(r"(\d*) / \d*", page_text)
        real_page_number_list.append(int(m[0]))

print(real_page_number_list)

# 获得该提取的页码:
now_real_page = 1
must_page_list = []
for i, item in enumerate(real_page_number_list):
    if item > now_real_page:
        must_page_list.append(i)
        now_real_page += 1
must_page_list.append(total_pages)

result = ""
for item in must_page_list:
    result += str(item) + ","

print(result[:-1], end="")

运行结果:

D:\Anaconda3\python.exe C:\Users\和谐号\PycharmProjects\pythonProject\2023-09-18-pdf页面提取\getRealPageNumbers.py 
[1, 2, 2, 2, 2, 2, 2, 3, 4, 4, 4, 4, 4, 5, 6, 6, 6, 6, 6, 6, 6, 7, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 11, 11, 11, 11, 12, 12, 13, 13, 13, 13, 13, 14, 14, 14, 14, 15, 15, 15, 15, 16, 17, 17, 17, 18, 19, 20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 21, 21, 22, 22, 23, 23, 23, 23, 23, 23, 24, 24, 24, 25, 25, 26, 26, 27, 27, 27, 27, 27, 28, 28, 28, 28, 29, 29, 29, 29, 29, 29, 30, 31, 31, 31, 32, 32, 32, 33, 33, 34, 34, 34, 34, 35, 36, 36, 37, 37, 37, 37, 38, 39, 40, 40, 40, 40, 40, 40, 40, 40, 41, 41, 41, 41, 41, 41, 42, 42, 42, 42, 42, 42, 43, 43, 43, 43, 44, 45, 45, 45, 45, 46, 46, 46, 47, 47, 47, 47, 47, 47, 48, 48, 49, 49, 49, 49, 49, 49, 50, 50, 50, 50, 50, 51, 51, 51, 51, 51, 52, 53, 54, 54, 54, 54, 54, 54, 54, 55, 56, 56, 56, 56, 57, 58, 58, 58, 59, 59, 59, 59, 59, 59, 59, 59, 59, 59, 59, 60, 61, 62, 62, 62, 63, 63, 64, 64, 64, 64, 64, 64, 64, 64]
1,7,8,13,14,21,22,24,32,34,38,40,45,49,53,54,57,58,59,67,73,75,81,84,86,88,93,97,103,104,107,110,112,116,117,119,123,124,125,133,139,145,149,150,154,157,163,165,171,176,181,182,183,190,191,195,196,199,210,211,212,215,217,225
进程已结束,退出代码为 0

所以该提取的页码为:

1,7,8,13,14,21,22,24,32,34,38,40,45,49,53,54,57,58,59,67,73,75,81,84,86,88,93,97,103,104,107,110,112,116,117,119,123,124,125,133,139,145,149,150,154,157,163,165,171,176,181,182,183,190,191,195,196,199,210,211,212,215,217,225

使用wps进行pdf提取:

运用Python去除“使用了LaTeX的Beamer类创建的保留动画效果的PDF演示文稿”的重复页面_第2张图片

即可得到去重后的pdf:

运用Python去除“使用了LaTeX的Beamer类创建的保留动画效果的PDF演示文稿”的重复页面_第3张图片

这样就方便阅读多了。

2.0版本:

不依靠页码下标,而是根据内容直接判断,普适性更强

# -*- coding: utf-8 -*-
# @File: getRealPage2.py
# @Author: 和谐号
# @Software: PyCharm
# @CreationTime: 2023-09-24 17:47
# @OverviewDescription:

# 使用前要在终端中安装pdfplumber包
# pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
# pip install pdfplumber

import pdfplumber

pdf_path = r"C:\Users\和谐号\Desktop\Stochastic-processes-I.pdf"

previous_page = []
current_page = []
with pdfplumber.open(pdf_path) as pdf:
    for page in pdf.pages:
        current_page = page.extract_text().split('\n')
        for item in previous_page:
            if item not in current_page:
                print(page.page_number - 1, end=",")
                break
        previous_page = current_page
        # print()
        # print(page)
        # print(page_text)
    print(len(pdf.pages))

运行结果:

D:\Anaconda3\python.exe C:\Users\和谐号\PycharmProjects\pythonProject\2023-09-18-pdf页面提取\getRealPage2.py 
1,8,17,26,31,40,45,51,58,62,68,70,77,82,87,88,89,92,107,111,121,126,130,135,141,142,144,152,155,158,162,166,168,174,176,179,184,188,192,193,197,199,203,205,206,207,208,210,214,218,220,225,226,227,233,236,239,242,246,248,251,252,253,254,256,258,262,266,271,273,276,284,292,299,304,305,306,309,312,317,325,330,332,334,342,345,349,353,359,364,366,367,372,374,379,382,385,388,389,392,393,396,399,400,404,407,410,414,417,422,426,429,434,439,443,449,453,459,463,467,471,472,475,477,483,487,491,495,496,498,514,521,524

进程已结束,退出代码为 0
 

.3.0版本:

页面去重Python程序3.0版本,更新内容:1.新增pdf生成功能,基本实现去重全自动化,运行程序后即可在同级文件夹下生成去重后的pdf文件    2.代码前新增使用步骤说明。
 

# -*- coding: utf-8 -*-
# @File: getRealPageNumbers(v3.0).py
# @Author: 和谐号
# @Software: PyCharm
# @CreationTime: 2023-10-17 15:31
# @OverviewDescription:

# 使用步骤
# 1.在终端(或cmd)中安装pdfplumber,PyPDF2包,其代码如下:
# pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
# pip install pdfplumber
# pip install PyPDF2
# 2.修改本文件中的pdf_path为你需要处理的文件地址(修改引号中的内容即可)
# 3.运行程序,将会在同一文件夹下生成提取页面后的pdf文件

import pdfplumber, PyPDF2, os

pdf_path = r"C:\Users\和谐号\Desktop\ch2Lebesgue测度 w7.pdf"

pdf_file = open(pdf_path, 'rb')
pdf_reader = PyPDF2.PdfReader(pdf_file)
pdf_writer = PyPDF2.PdfWriter()
pdf_name = os.path.basename(pdf_path)
pdf_name_without_extension = os.path.splitext(pdf_name)[0]

previous_page = []
current_page = []

with pdfplumber.open(pdf_path) as pdf:
    for page in pdf.pages:
        current_page = page.extract_text().split('\n')
        for item in previous_page:
            if item not in current_page:
                pdf_writer.add_page(pdf_reader.pages[page.page_number - 2])
                print(page.page_number - 1, end=",")
                break
        previous_page = current_page
        # print()
        # print(page)
        # print(page_text)
    pdf_writer.add_page(pdf_reader.pages[len(pdf.pages) - 1])
    print(len(pdf.pages))

file_directory = os.path.dirname(pdf_path)
output_path = os.path.join(file_directory, pdf_name_without_extension + '(去重版).pdf')
with open(output_path, 'wb') as output_pdf:
    pdf_writer.write(output_pdf)

pdf_file.close()
output_pdf.close()

你可能感兴趣的:(pdf)