IDEA中Java项目调用Python脚本实现docx、pdf转markdown

一、idea中Python配置

1、下载Python插件IDEA中Java项目调用Python脚本实现docx、pdf转markdown_第1张图片

2、配置Python解释器

IDEA中Java项目调用Python脚本实现docx、pdf转markdown_第2张图片

3、添加需要的包

IDEA中Java项目调用Python脚本实现docx、pdf转markdown_第3张图片

IDEA中Java项目调用Python脚本实现docx、pdf转markdown_第4张图片

二、Python脚本编辑

1、创建Python File

IDEA中Java项目调用Python脚本实现docx、pdf转markdown_第5张图片

2、Python代码编辑

        1、docx2markdown

                sys包用于输入参数传递,print()用于脚本最终返回到Java中的结果

# -*- coding=utf-8 -*-
import sys
import time
import mammoth
from markdownify import markdownify

def convert_img(image):
    with image.open() as image_bytes:
        file_suffix = image.content_type.split("/")[1]
        path_file = "D://digital/img/{}.{}".format(str(time.time()), file_suffix)
        with open(path_file, 'wb') as f:
            f.write(image_bytes.read())
    return {"src": path_file}


def docx2md(docxPath):
    with open(docxPath, "rb") as docx_file:
        # 转化 Word 文档为 HTML
        result = mammoth.convert_to_html(docx_file, convert_image=mammoth.images.img_element(convert_img))
        # 获取 HTML 内容
        html = result.value
        # 转化 HTML 为 Markdown
        md = markdownify(html, heading_style="ATX")
        print(md)
        with open("D://digital/test.html", 'w', encoding='utf-8') as html_file, open("D://digital/test.md", "w",
                                                                                     encoding='utf-8') as md_file:
            html_file.write(html)
            md_file.write(md)


docx2md(sys.argv[1])

        2、pdf2markdown

# -*- coding=utf-8 -*-

import sys

import fitz
from markdownify import markdownify
from tqdm import tqdm


def pdf2md(pdf_path):
    doc = fitz.open(pdf_path)
    print(doc)
    html_content = "Title"
    for page in tqdm(doc):
        html_content += page.get_text('html')

    html_content += ""
    # 转化 HTML 为 Markdown
    md = markdownify(html_content, heading_style="ATX")

    # print(md)
    with open("D://digital/SOLAS2014(中)救生设备节选1.md", "w",encoding='utf-8') as md_file:
        md_file.write(md)
    return md

print(pdf2md(sys.argv[1]))

三、Java调用脚本的实现

        确定传递的参数(脚本路径,脚本所需参数)

        Runtime.getRuntime().exec(params)执行脚本

public static String pdf2md(MultipartFile file) throws IOException {
        File tempFile = File.createTempFile("temp", ".pdf");
        String pdfPath = tempFile.getAbsolutePath();
        System.out.println("临时文件路径:" + pdfPath);
        file.transferTo(tempFile);
        StringBuilder result = new StringBuilder();
        try {
            String[] params = new String[] { "python", "src/main/java/util/python/pdf2md.py", pdfPath};
            Process proc = Runtime.getRuntime().exec(params);// 执行py文件
            BufferedReader in = new BufferedReader(new InputStreamReader(proc.getInputStream(), "gb2312"));
            String line = null;

            while ((line = in.readLine()) != null) {
                result.append(line);
            }
            in.close();
            proc.waitFor();
        } catch (IOException | InterruptedException e) {
            e.printStackTrace();
        }
        tempFile.deleteOnExit();
        return result.toString();
    }

IDEA中Java项目调用Python脚本实现docx、pdf转markdown_第6张图片

你可能感兴趣的:(intellij-idea,java,ide,python)