这个pdf实际上只有64页,但是加了动画导出的pdf有225面,为了方便翻阅,我们只需要保留每次最丰富(完整)的那一面ppt即可,使用如下代码,获得该提取页面的页码:
# -*- coding: utf-8 -*-
# @File: getRealPageNumbers.py
# @Author: 和谐号
# @Software: PyCharm
# @CreationTime: 2023-09-18 14:22
# @OverviewDescription:
# pip install PyPDF2==3.0.0
from PyPDF2 import PdfReader
import re
# 打开 PDF 文档
pdf_path = r"C:\Users\和谐号\Desktop\ch1集合与点集.pdf"
# 创建一个 PyPDF2 的 PDF 阅读器对象
pdf_reader = PdfReader(pdf_path)
# 获取总页数
total_pages = len(pdf_reader.pages)
real_page_number_list = []
# real_total_page_number = int(re.findall(r"\d* / (\d*)", pdf_reader.pages[0].extract_text())[0])
# print(real_total_page_number)
# 提取页码内容
for page_number in range(total_pages):
page = pdf_reader.pages[page_number]
page_text = page.extract_text()
# if page_number == 0:
# print(page_text)
# 查找包含页码的文本
if page_text:
m = re.findall(r"(\d*) / \d*", page_text)
real_page_number_list.append(int(m[0]))
print(real_page_number_list)
# 获得该提取的页码:
now_real_page = 1
must_page_list = []
for i, item in enumerate(real_page_number_list):
if item > now_real_page:
must_page_list.append(i)
now_real_page += 1
must_page_list.append(total_pages)
result = ""
for item in must_page_list:
result += str(item) + ","
print(result[:-1], end="")
运行结果:
D:\Anaconda3\python.exe C:\Users\和谐号\PycharmProjects\pythonProject\2023-09-18-pdf页面提取\getRealPageNumbers.py
[1, 2, 2, 2, 2, 2, 2, 3, 4, 4, 4, 4, 4, 5, 6, 6, 6, 6, 6, 6, 6, 7, 8, 8, 9, 9, 9, 9, 9, 9, 9, 9, 10, 10, 11, 11, 11, 11, 12, 12, 13, 13, 13, 13, 13, 14, 14, 14, 14, 15, 15, 15, 15, 16, 17, 17, 17, 18, 19, 20, 20, 20, 20, 20, 20, 20, 20, 21, 21, 21, 21, 21, 21, 22, 22, 23, 23, 23, 23, 23, 23, 24, 24, 24, 25, 25, 26, 26, 27, 27, 27, 27, 27, 28, 28, 28, 28, 29, 29, 29, 29, 29, 29, 30, 31, 31, 31, 32, 32, 32, 33, 33, 34, 34, 34, 34, 35, 36, 36, 37, 37, 37, 37, 38, 39, 40, 40, 40, 40, 40, 40, 40, 40, 41, 41, 41, 41, 41, 41, 42, 42, 42, 42, 42, 42, 43, 43, 43, 43, 44, 45, 45, 45, 45, 46, 46, 46, 47, 47, 47, 47, 47, 47, 48, 48, 49, 49, 49, 49, 49, 49, 50, 50, 50, 50, 50, 51, 51, 51, 51, 51, 52, 53, 54, 54, 54, 54, 54, 54, 54, 55, 56, 56, 56, 56, 57, 58, 58, 58, 59, 59, 59, 59, 59, 59, 59, 59, 59, 59, 59, 60, 61, 62, 62, 62, 63, 63, 64, 64, 64, 64, 64, 64, 64, 64]
1,7,8,13,14,21,22,24,32,34,38,40,45,49,53,54,57,58,59,67,73,75,81,84,86,88,93,97,103,104,107,110,112,116,117,119,123,124,125,133,139,145,149,150,154,157,163,165,171,176,181,182,183,190,191,195,196,199,210,211,212,215,217,225
进程已结束,退出代码为 0
所以该提取的页码为:
1,7,8,13,14,21,22,24,32,34,38,40,45,49,53,54,57,58,59,67,73,75,81,84,86,88,93,97,103,104,107,110,112,116,117,119,123,124,125,133,139,145,149,150,154,157,163,165,171,176,181,182,183,190,191,195,196,199,210,211,212,215,217,225
使用wps进行pdf提取:
即可得到去重后的pdf:
这样就方便阅读多了。
2.0版本:
不依靠页码下标,而是根据内容直接判断,普适性更强
# -*- coding: utf-8 -*-
# @File: getRealPage2.py
# @Author: 和谐号
# @Software: PyCharm
# @CreationTime: 2023-09-24 17:47
# @OverviewDescription:
# 使用前要在终端中安装pdfplumber包
# pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
# pip install pdfplumber
import pdfplumber
pdf_path = r"C:\Users\和谐号\Desktop\Stochastic-processes-I.pdf"
previous_page = []
current_page = []
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
current_page = page.extract_text().split('\n')
for item in previous_page:
if item not in current_page:
print(page.page_number - 1, end=",")
break
previous_page = current_page
# print()
# print(page)
# print(page_text)
print(len(pdf.pages))
运行结果:
D:\Anaconda3\python.exe C:\Users\和谐号\PycharmProjects\pythonProject\2023-09-18-pdf页面提取\getRealPage2.py
1,8,17,26,31,40,45,51,58,62,68,70,77,82,87,88,89,92,107,111,121,126,130,135,141,142,144,152,155,158,162,166,168,174,176,179,184,188,192,193,197,199,203,205,206,207,208,210,214,218,220,225,226,227,233,236,239,242,246,248,251,252,253,254,256,258,262,266,271,273,276,284,292,299,304,305,306,309,312,317,325,330,332,334,342,345,349,353,359,364,366,367,372,374,379,382,385,388,389,392,393,396,399,400,404,407,410,414,417,422,426,429,434,439,443,449,453,459,463,467,471,472,475,477,483,487,491,495,496,498,514,521,524进程已结束,退出代码为 0
.3.0版本:
页面去重Python程序3.0版本,更新内容:1.新增pdf生成功能,基本实现去重全自动化,运行程序后即可在同级文件夹下生成去重后的pdf文件 2.代码前新增使用步骤说明。
# -*- coding: utf-8 -*-
# @File: getRealPageNumbers(v3.0).py
# @Author: 和谐号
# @Software: PyCharm
# @CreationTime: 2023-10-17 15:31
# @OverviewDescription:
# 使用步骤
# 1.在终端(或cmd)中安装pdfplumber,PyPDF2包,其代码如下:
# pip config set global.index-url https://pypi.tuna.tsinghua.edu.cn/simple
# pip install pdfplumber
# pip install PyPDF2
# 2.修改本文件中的pdf_path为你需要处理的文件地址(修改引号中的内容即可)
# 3.运行程序,将会在同一文件夹下生成提取页面后的pdf文件
import pdfplumber, PyPDF2, os
pdf_path = r"C:\Users\和谐号\Desktop\ch2Lebesgue测度 w7.pdf"
pdf_file = open(pdf_path, 'rb')
pdf_reader = PyPDF2.PdfReader(pdf_file)
pdf_writer = PyPDF2.PdfWriter()
pdf_name = os.path.basename(pdf_path)
pdf_name_without_extension = os.path.splitext(pdf_name)[0]
previous_page = []
current_page = []
with pdfplumber.open(pdf_path) as pdf:
for page in pdf.pages:
current_page = page.extract_text().split('\n')
for item in previous_page:
if item not in current_page:
pdf_writer.add_page(pdf_reader.pages[page.page_number - 2])
print(page.page_number - 1, end=",")
break
previous_page = current_page
# print()
# print(page)
# print(page_text)
pdf_writer.add_page(pdf_reader.pages[len(pdf.pages) - 1])
print(len(pdf.pages))
file_directory = os.path.dirname(pdf_path)
output_path = os.path.join(file_directory, pdf_name_without_extension + '(去重版).pdf')
with open(output_path, 'wb') as output_pdf:
pdf_writer.write(output_pdf)
pdf_file.close()
output_pdf.close()