python使用正则表达式处理邮件

正则表达式可以提取定义的文字模式,在爬虫,文字模式的提取中有很大作用,这里,我们举例使用正则表达式处理一个由数千邮件合并的一个txt文件

语料库地址:https://www.kaggle.com/rtatman/fraudulent-email-corpus

import re
import pandas as pd
import email
fh = open(r'C:\\Users\\Yao\\Desktop\\kaggle\\test_emails.txt','r',encoding="gb18030",errors='ignore')
s=fh.read()
fh=s

首先打开文件

emails = []

contents = re.split(r'From r', fh)

进入txt文件分析可以看到邮件由固定模式From r 组成,分割为3977个单独的信,其中取一封,内容如下

contents[0]
Out[11]: '  Wed Oct 30 21:41:56 2002\nReturn-Path: \nX-Sieve: cmu-sieve 2.0\nReturn-Path: \nMessage-Id: <[email protected]>\nFrom: "MR. JAMES NGOLA." \nReply-To: [email protected]\nTo: [email protected]\nDate: Thu, 31 Oct 2002 02:38:20 +0000\nSubject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP\nX-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM\nMIME-Version: 1.0\nContent-Type: text/plain; charset="us-ascii"\nContent-Transfer-Encoding: 8bit\nX-MIME-Autoconverted: from quoted-printable to 8bit by sideshowmel.si.UM id g9V2foW24311\nStatus: O\n\nFROM:MR. JAMES NGOLA.\nCONFIDENTIAL TEL: 233-27-587908.\nE-MAIL: ([email protected]).\n\nURGENT BUSINESS ASSISTANCE AND PARTNERSHIP.\n\n\nDEAR FRIEND,\n\nI AM ( DR.) JAMES NGOLA, THE PERSONAL ASSISTANCE TO THE LATE CONGOLESE (PRESIDENT LAURENT KABILA) WHO WAS ASSASSINATED BY HIS BODY GUARD ON 16TH JAN. 2001.\n\n\nTHE INCIDENT OCCURRED IN OUR PRESENCE WHILE WE WERE HOLDING MEETING WITH HIS EXCELLENCY OVER THE FINANCIAL RETURNS FROM THE DIAMOND SALES IN THE AREAS CONTROLLED BY (D.R.C.) DEMOCRATIC REPUBLIC OF CONGO FORCES AND THEIR FOREIGN ALLIES ANGOLA AND ZIMBABWE, HAVING RECEIVED THE PREVIOUS DAY (USD$100M) ONE HUNDRED MILLION UNITED STATES DOLLARS, CASH IN THREE DIPLOMATIC BOXES ROUTED THROUGH ZIMBABWE.\n\nMY PURPOSE OF WRITING YOU THIS LETTER IS TO SOLICIT FOR YOUR ASSISTANCE AS TO BE A COVER TO THE FUND AND ALSO COLLABORATION IN MOVING THE SAID FUND INTO YOUR BANK ACCOUNT THE SUM OF (USD$25M) TWENTY FIVE MILLION UNITED STATES DOLLARS ONLY, WHICH I DEPOSITED WITH A SECURITY COMPANY IN GHANA, IN A DIPLOMATIC BOX AS GOLDS WORTH (USD$25M) TWENTY FIVE MILLION UNITED STATES DOLLARS ONLY FOR SAFE KEEPING IN A SECURITY VAULT FOR ANY FURTHER INVESTMENT PERHAPS IN YOUR COUNTRY. \n\nYOU WERE INTRODUCED TO ME BY A RELIABLE FRIEND OF MINE WHO IS A TRAVELLER,AND ALSO A MEMBER OF CHAMBER OF COMMERCE AS A RELIABLE AND TRUSTWORTHY PERSON WHOM I CAN RELY ON AS FOREIGN PARTNER, EVEN THOUGH THE NATURE OF THE TRANSACTION WAS NOT REVEALED TO HIM FOR SECURITY REASONS.\n\n\nTHE (USD$25M) WAS PART OF A PROCEEDS FROM DIAMOND TRADE MEANT FOR THE LATE PRESIDENT LAURENT KABILA WHICH WAS DELIVERED THROUGH ZIMBABWE IN DIPLOMATIC BOXES. THE BOXES WERE KEPT UNDER MY CUSTODY BEFORE THE SAD EVENT THAT TOOK THE LIFE OF (MR. PRESIDENT).THE CONFUSION THAT ENSUED AFTER THE ASSASSINATION AND THE SPORADIC SHOOTING AMONG THE FACTIONS, I HAVE TO RUN AWAY FROM THE COUNTRY FOR MY DEAR LIFE AS I AM NOT A SOLDIER BUT A CIVIL SERVANT I CROSSED RIVER CONGO TO OTHER SIDE OF CONGO LIBREVILLE FROM THERE I MOVED TO THE THIRD COUNTRY GHANA WHERE I AM PRESENTLY TAKING REFUGE. \n\nAS A MATTER OF FACT, WHAT I URGENTLY NEEDED FROM YOU IS YOUR ASSISTANCE IN MOVING THIS MONEY INTO YOUR ACCOUNT IN YOUR COUNTRY FOR INVESTMENT WITHOUT RAISING EYEBROW. FOR YOUR ASSISTANCE I WILL GIVE YOU 20% OF THE TOTAL SUM AS YOUR OWN SHARE WHEN THE MONEY GETS TO YOUR ACCOUNT, WHILE 75% WILL BE FOR ME, OF WHICH WITH YOUR KIND ADVICE I HOPE TO INVEST IN PROFITABLE VENTURE IN YOUR COUNTRY IN OTHER TO SETTLE DOWN FOR MEANINGFUL LIFE, AS I AM TIRED OF LIVING IN A WAR ENVIRONMENT. \n\nTHE REMAINING 5% WILL BE USED TO OFFSET ANY COST INCURRED IN THE CAUSE OF MOVING THE MONEY TO YOUR ACCOUNT. IF THE PROPOSAL IS ACCEPTABLE TO YOU PLEASE CONTACT ME IMMEDIATELY THROUGH THE ABOVE TELEPHONE AND E-MAIL, TO ENABLE ME ARRANGE FACE TO FACE MEETING WITH YOU IN GHANA FOR THE CLEARANCE OF THE FUNDS BEFORE TRANSFRING IT TO YOUR BANK ACCOUNT AS SEEING IS BELIEVING. \n\nFINALLY, IT IS IMPORTANT ALSO THAT I LET YOU UNDERSTAND THAT THERE IS NO RISK INVOLVED WHATSOEVER AS THE MONEY HAD NO RECORD IN KINSHASA FOR IT WAS MEANT FOR THE PERSONAL USE OF (MR. PRESIDEND ) BEFORE THE NEFARIOUS INCIDENT OCCURRED, AND ALSO I HAVE ALL THE NECESSARY DOCUMENTS AS REGARDS TO THE FUNDS INCLUDING THE (CERTIFICATE OF DEPOSIT), AS I AM THE DEPOSITOR OF THE CONSIGNMENT.\n\n\nLOOKING FORWARD TO YOUR URGENT RESPONSE.\n\nYOUR SINCERELY,\n\nMR. JAMES NGOLA. \n\n\n\n\n\n\n\n\n\n\n'


contents.pop(0)
for item in contents:
    emails_dict = {}
    sender = re.search(r'From:.*', item)

这里是找到首字母为From之后一直到出现换行符位置结束的模式,在上面的信中就是From: "MR. JAMES NGOLA." ,在这里之后出现换行符\n,于是匹配结束


    if sender is not None:
        s_email = re.search(r'\w\S*@.*\w',sender.group())#group为取得字符串整体
        s_name = re.search(r':.*<',sender.group())
    else:
        s_email = None
        s_name = None
    if s_email is not None:
        sender_email = s_email.group()
    else:
        sender_email = None
    emails_dict['sender_email'] = sender_email
    if s_name is not None:
        sender_name = re.sub('\s*<','',re.sub(':\s*','',s_name.group()))#由于之前匹配带有<和:符号,所以把这些符号去掉
    else:
        sender_name = None
    emails_dict['sender_name'] = sender_name
    recipient = re.search(r'To:.*',item)
    if recipient is not None:
        r_email = re.search(r'\w\S*@.*\w', recipient.group())
        r_name = re.search(r':.*<', recipient.group())
    else:
        r_email = None
        r_name = None
    if r_email is not None:
        recipient_email = r_email.group()
    else:
        recipient_email = None
    emails_dict['recipient_email'] = recipient_email
    if r_name is not None:
        recipient_name = re.sub('\s*<','',re.sub(':\s*','',r_name.group()))
    else:
        recipient_name = None
    emails_dict['recipient_name'] = recipient_name
    date_field = re.search(r'Date:.*',item)
    if date_field is not None:
        date = re.search(r'\d+\s\w+\s\d+', date_field.group())#这里+表示匹配一个或者多个左边的模式,可以把日期提取出来,如'31 Oct 2002'
    else:
        date = None
    if date is not None:
        date_sent = date.group()
    else:
        date_sent = None
    
    emails_dict['date_sent'] = date_sent
    subject_field = re.search(r'Subject:.*',item)
    if subject_field is not None:
        subject = re.sub(r'Subject:','', subject_field.group())
    else:
        subject = None
    emails_dict['subject'] = subject    
    full_email =  email.message_from_string(item)
    body = full_email.get_payload()
    emails_dict['email_body'] = body
    emails.append(emails_dict)

  
   #最后进行格式转换
emails_df = pd.DataFrame(emails)
    
pd.DataFrame.head(emails_df, n=3)


你可能感兴趣的:(python)