一次使用pandas+vba将Word中内容搬到Excel的经历

一次使用pandas+vba将Word中内容搬到Excel的经历

  • 项目的目标
  • 完成过程如下

项目的目标

本次项目的目标是将Word中的列表内容提取出来,然后转入到Excel中,用于导入系统。当然由于没有使用python处理Word的工具包,且对相关内容也不熟悉,所以采用vba将Word的内容格式化,然后再导入Excel

完成过程如下

  1. 通过Vba将Word中的数据格式化;
  2. 导入的是Html的形式,因为没有找到直接使用后可以转换的方式,采用Vba将每个段落的前后增加

    的标签,同时对加粗的增加的标签;

Dim pg As Paragraph, r As Range
   Dim startPositon, endPosition
   Dim reg As Object
   Set reg = CreateObject("VBScript.Regexp")
   reg.Pattern = "^[1-9]\d*.[1-9]\d*.[1-9]\d*."
   For Each pg In ActiveDocument.Paragraphs
       Dim prev As Boolean: prev = False
       Dim title, changetitle As String
       Dim is_exist As Boolean
       Set r = pg.Range
       If Len(r.Text) > 1 Then
         
           If r.ListFormat.ListString <> "" Then
               title = r.ListFormat.ListString
               is_exist = reg.Test(title)
               If is_exist Then
               
               Else
                   With r
                       .SetRange r.Start, r.End - 1
                       .InsertAfter ("")
                       .InsertBefore ("
  • ") End With End If Else With r .SetRange r.Start, r.End - 1 .InsertAfter ("

    ") .InsertBefore ("

    ") End With End If End If

    1. 为保证之后去掉段落标记,所以对不需要去掉的段落标记增加特殊标记;
       Dim pg As Paragraph, r As Range
       Dim reg As Object
       Set reg = CreateObject("VBScript.Regexp")
       reg.Pattern = "^[1-9]\d*.[1-9]\d*.[1-9]\d*."
       For Each pg In ActiveDocument.Paragraphs
           Dim title, changetitle As String
           Dim is_exist As Boolean
           Set r = pg.Range
           title = r.ListFormat.ListString
           is_exist = reg.Test(title)
           If is_exist Then
               With r
                           .SetRange r.Start, r.End - 1
                           .InsertAfter ("&&&&")
                           .InsertBefore ("&&&&")
               End With
           End If
       Next
    
    1. 在Word中首先应用^p去掉所有的段落标记,然后用替换功能将&&&&替换成段落标记;
    2. 因为Word中有图表,将其复制后全部拷贝到txt文档,去掉特殊的内容,将txt文档保存
    3. 利用python的pandas将数据写入到excel中
    import math
    import pandas as pd
    import re
    
    import_file = open("./import_all.txt", 'r', encoding='utf-8')
    xls_file = pd.read_excel('./gnxz_import.xlsx', sheet_name='sheet1')  
    xls_file.drop(index=xls_file.index,inplace=True)
    sPattern = r"^\d+\.\d+\.\d+\."
    pattern = re.compile(sPattern)
    count = 0
    title = ""
    content = ""
    for line in import_file.readlines():
        count = count + 1
        line_index = math.ceil(count / 2) - 1
        line = line.strip('\n')
        m = pattern.match(line)
        if m is not None:
            line = line[m.end():]
        if count % 2 == 0:
            content = line
            series = pd.DataFrame({'文本标题':[title], '正文内容':[content]})
            xls_file = pd.concat([xls_file, series], ignore_index=True)
        else:
            title = line
    import_file.close()
    print(xls_file)
    xls_file.to_excel('./gnxz_i.xlsx', index=False)
    

    你可能感兴趣的:(excel,pandas,word)