用JavaMail API解析邮件内容的一些相关事项,写的很不错。
原文地址:http://techforum4u.com/content.php/177-HOW-TO-PARSE-AN-EMAIL-USING-THE-JAVAMAIL-API
Article Origin Location:http://techforum4u.com/content.php/177-HOW-TO-PARSE-AN-EMAIL-USING-THE-JAVAMAIL-API
======================Main Content=======================
By Pankaj
AIM
The aim of this document is to elaborate on how to retrieve an email from the mail server and parse it using the JavaMail API.
PREFACE
The first section of the article explains the architecture used for Mail Parsing. It elaborates the process of how a request to parse the mail is initiated after the mail is sent from the external world to the SMTP server used by the application. This is followed by an explanation for parsing the mail file using the JavaMail API. The practical complications that were encountered, known issues and solutions have also been detailed below.
ARCHITECTURE
The architecture that was used in our project is briefly described below.
The architecture explained can be used to receive mails from the external world, parse them and archive them and deliver the mail to the recipients. It must be mentioned here that this approach can be used when the database design is such that only one content part is archived along with the metadata irrespective of the number of recipients.
DOMAIN NAME CONFIGURATION FOR SMTP SERVER
The mails sent in the external world are relayed and delivered to the SMTP server based on the domain name that is configured. Hence, when the SMTP server is configured, it is mandatory for it to identify that list of domain names that will be used by the end users in their email addresses. Once this is done, the mails sent to those configured domain names will be sent to the SMTP server.
CREATE REQUEST FOR MAIL PARSING
The mails received by the mail server will be available in the email pool.
When the mail server receives a mail, it sends an http request (based on the domain) to the listener with the name of the mail file and the resent flag as parameters.
The listener is a servelet that picks up the file and sends it for further processing. Also, if a mail lies in the email pool beyond a certain time limit, then a failover script should be used to send an http request to the listener with the name of the mail file and the resent flag as parameters.
The resent flag is used to indicate whether the mail has been sent for the first time or not. If the mail is sent for the first time from email server to http listener, the resent flag is N, else Y.
On receiving the request, the listener sends a request to the mail parser with the location, name of the mail file and the resent flag as parameters.
The file is then parsed by retrieving the MimeMessage object which contains information about the recipients, content, and attachments. This information is then stored to the database and delivered to the recipients.
USE OF JAVAMAIL API TO PARSE EMAIL FILES
This section will describe in detail the processes involved in parsing a mail file.
Almost every mail file that is received by the SMTP server has a header called the Message-Id. There may still be a strange case where a mail file will not have the Message-Id. Such mail files need to be ignored when this approach is used, as the message id plays a significant role in deciding the kind of processing required.
The message-id along with the timestamp of the email can be used as a unique identification for a mail that can have one or more recipients and hence should be stored in the database as they will come in handy when duplicates have to be avoided.
RECIPIENTS
To retrieve the information about the recipients, the method available in the MimeMessage – getRecipients (Message.RecipientType type) can be used.
The mapping between the type and the corresponding RFC 822 header is as follows:
Message.RecipientType.TO - "To"
Message.RecipientType.CC - "Cc"
Message.RecipientType.BCC - "Bcc"
MimeMessage.RecipientType.NEWSGROUPS - "Newsgroups"
An external server may send out one mail file with all the recipients of the mail or an individual mail file for each recipient. If the server sends one mail file for all recipients, when there are multiple BCC recipients, it will not result in accurate delivery to all intended recipients.
X-ENVELOPE-TO HEADER
• A header called X-Envelope-To was used to store the information about the recipient. This header can store only one value at a time.
• For each recipient of the mail, an individual mail file will be sent to the listener with one of the recipients in the X-Envelope-To header.
• All mail files will anyway have information about To and Cc recipients and it will not be required to process the files repeatedly for these two types of recipients.
• While processing the mail file, the program should check if the recipient in the X-Envelope-To header is also present in the To or Cc fields.
o If yes, the mail file can be ignored if it has been processed at least once before.
o If not, the recipient in the X-Envelope-To header will be considered as a BCC recipient.
• The unique combination of message id and email timestamp can be used to detect if the mail was processed earlier as these two values will be stored in the database.
AUTHOR
The information about the author can be obtained using the getFrom() method in the MimeMessage class.
The name of the author, if specified, can be retrieved from the getPersonal() method available in the Address class.
The diagram below is a pictorial representation of the process involved in the parsing of a mail file.
PARSE CONTENT
The content of the mail file can belong to a whole lot of MIME types. Since there is no specific list of MIME types on the internet and this can be anything non-standard according to international convention. Hence parsing the content is mainly through trial and error.
This section will explain how parsing of certain known mime types can be handled:
The content of the mail can be retrieved using the getContent() method of the MimeMessage class. The content of a mail can be one of the following two instances:
• String
• MimeMultiPart
(a) When the content is an instance of String, the following are the known possibilities of content.
(i) UUENCODED MAIL
Definition: A mail file is said to be uuencoded when it has an attachment that is preceded by the characters “begin 666” and ends with “`end”. The mail can also have content.
Parsing: To parse such a mail file, you have to take the substring of the content from the beginning till you encounter “begin 666”.
•Content: Depending on the content type, the content hast has to be treated as plain text or html content. This information has to be stored in order to preserve the formatting of the content for exact visualization of the source at the target.
•Attachment: The attachment name is normally present after the header “begin 666” which is followed by a return key. If the attachment string does not end with the footer “`end”, append the same to the string. The UUDecoder class has to be used to decode the string into a byte array. Following it, the input stream is created from byte array. This is used to create the attachment.
(ii)SIGNED MAIL
Definition: The content type of the mail file not only gives information about the content being a plain text or html and details about the charset. It also contains the information whether the mail is a signed mail. This means that only the intended recipient can open such a mail. It is used as a means of securing the content of the email from other users.
Parsing: When the content type has the string “signed-data”, the mail is treated as a signed email. An email attachment can be created using the input stream of the mail file and delivered to the recipients.
(iii) UNKNOWN MIME TYPEParsing: When the mime type is not known for a communication, the content is treated either as plain text or html text based on the content type or stored in the database.
(b) When the content is an instance of Mimemultipart, the following are the known possibilities of content.
(i) multipart/mixed or multipart/related
When the mime type is “multipart/mixed” or “multipart/related”, the mail file contains a combination of two or more of the remaining mime types. Hence, it’s necessary to get
the count of parts in the multipart content and parse them individually based on the mime type of each part.
(ii) RFC 822 headers (multipart/report)
Normally, delivery receipt mails have the mime type as multipart/report. These mails normally have an input stream with the mime type as “text/rfc822-headers”. The input stream should be converted into a string and stored with the content type as either text/plain or text/html.
(iii) text/*
The mime type text/* implies that the content can either be a text/html or a text/plain. There is also a possibility that there could be an attachment in with this mime type. It can be found if an attachment exists, based on whether a file name is defined for the part. If there is a file name, create an attachment with the input stream. If there is no file name, add the string as the content of the communication and set the content type as either text/plain or text/html.
(iv) multipart/Alternative
• The mime type “multipart/alternative” implies that the content of the communication is a combination of mime types.
• Normally the content will be of both text/plain and text/html types.
• The html version is given more preference keeping the visualization of the mail in mind.
• Apart from these two types, the content can also belong to any one of the remaining mime types.
(v) multipart/*
This means that the content of the mail file has multiple parts but they are neither multipart/mixed nor multipart/related. Hence, we must retrieve each part of the multipart content and parse it individually based on the mime type of each part.
(vi) message/rfc822
The mails with this mime type are added as an attachment to the mail with the extension as “eml”.
(vii) Handle attachments & inline images
• When the mail has attachments, the part will have a file name or the content’s encoding will be “base64’ and the part will be an instance of the MimeBodyPart.
• The input stream is used to create an attachment for the mail.
• When the mail has inline images, store the content id of the image in the database too.
• When the mail is displayed to the recipient, the content has to be parsed again to search for the match between the content id that was stored and the one available in the content.
• When a match is found, place the inline image in that section of the content. In order to ensure that no information is lost, the inline image can also be added as an attachment for the mail. When it is not possible to find a match for the stored content id, the user can at least view it as an attachment.
ATTENTION POINTS
1. Charset Issues:
i. When there is interaction with the external world, there are lots of possibilities that the format of the content that is viewed by the receiver is totally distorted from what was sent.
ii. One of the main reasons can be the charset in which the content is stored, especially in the case of blobs for attachments. Make it a point to store it in the charset used by the database in order to ensure better visualization.
2. Encoded characters in subject, attachment names:
i. When the subject or attachment’s name contains accented characters, they will most probably appear distorted to the end user. To overcome this scenario, use the decodeText and decodeWord methods in MimeUtility.
ii. Also try setting the following system property in your server– System.setProperty(“mail.mime.decodetext.strict”,f alse);
3. Encoding problems in mails sent by MAC users:
When the encoding of the keywords or attachment names contain the words “CSMACINTOSH” or “MAC” or “MACINTOSH’, replace the strings “=\\?” and \\?Q\\? with the string “=?macroman/Q?” and then use the decodeText method in MimeUtility.
4. It is always possible for a mail to not have an author or have an author with only the name and no email address. Ensure that you create a dummy address whenever you figure out that there is no information about the author. The encoding problems mentioned above can also occur in the name of the author.
5. Some charsets are not supported by the mail.jar. Please include jcharset.jar to handle some usual charsets that are unsupported by the version provided by Sun.
6. When a mail contains only BCC recipients, most mail servers send “undisclosed-recipients:;( undisclosed-recipients;:)” in the To header. There is a possibility of this value being in an illegal format. The exception for the same needs to be handled.
7. There is a possibility of encountering the following exception while parsing the content of the mail:
sun.io.MalformedInputException at sun.io.ByteToCharUTF8.convert(ByteToCharUTF8.java( Compiled Code)),.
This is an inherent bug in the jar provided by sun itself.
(Ref.: http://bugs.sun.com/bugdatabase/view...bug_id=6274255)
8. When the SMTP server and the application’s database are not up to date with the existing domains of the users it is possible that some mails may be received, but may not match with the defined domain list. Hence such mails will be ignored. Therefore, it is a must to ensure that the domain names and email addresses of users are updated on a daily basis in both the database as well as the SMTP server. Every time the SMTP server is updated, it must be restarted for the changes to take effect.
9. Error handling: Whenever there is an error, if the recipients are known, try to create an email attachment with the original mail file and send the mail to the recipients. This will ensure that the content is not lost. To analyze as to why parsing failed, program it in such a way that whenever an exception occurs, the mail file is moved to an error folder.
The files in this folder can be analyzed to improve the parsing functionality by trial and error.
10. Use of log tables:
i. Three or more log tables would be needed for the mail parser.
ii. One log table must store the unique combination of message-id and email date.
iii. The other log tables can be used to store the log of mail files that were processed successfully and those that failed while processing.
iv. If some mail files were ignored because they were duplicates i.e. same mail for existing recipient, we need to store that information too.
v. Keep the mail files in separate folders such as parsed mails, ignored mails and failed mails so that you can use them for analyzing and validating records at any time.
vi. You can retain the data for a pre-defined period of time and then delete the files.
vii. It is a good practice to store the date of parsing the mail file too in the log tables. A batch cycle can be written to clean the table after a pre-determined number of days.
11. Testing:
• Testing has to be performed with as many external mail servers as possible to ensure completeness of coverage of all possibilities of content and attachment formats and charsets.
• Also, please focus on the mail providers used by the majority of people at your client’s place.
• Make it a practice to always use a different combination of recipients so that you can test if the delivery to BCC is successful.
• Also test with accented characters and for other error scenarios.
GLOSSARY
SMTP:
• The Simple Mail Transfer Protocol (SMTP) is the mechanism for delivery of email. In the context of the JavaMail API, JavaMail-based program will communicate with a particular company or Internet Service Provider's (ISP's) SMTP server.
• That SMTP server will relay the message on to the SMTP server of the recipient(s) to eventually be acquired by the user(s) through POP or IMAP.
• This does not require your SMTP server to be an open relay, as authentication is supported, but it is your responsibility to ensure the SMTP server is configured properly.
• There is nothing in the JavaMail API for tasks like configuring a server to relay messages or to add and remove email accounts.
MIME:
• MIME stands for Multipurpose Internet Mail Extensions. It is not a mail transfer protocol.
• Instead, it defines the content of what is transferred: the format of the messages, attachments, and so on. There are many different documents that take effect here: RFC 822, RFC 2045, RFC 2046, and RFC 2047.
• As a user of the JavaMail API, these formats are not a matter of concern. However, these formats do exist and are used by your programs.
SESSION:
The Session class defines a basic mail session. It is through this session that everything else works. The Session object takes advantage of a java.util.Properties object to get information like mail server, username, password, and other information that can be shared across your entire application.
MESSAGE:
• Being an abstract class, you must work with a subclass, in most cases javax.mail.internet.MimeMessage.
• A MimeMessage is a email message that understands MIME types and headers, as defined in the different RFCs.
• Message headers are restricted to US-ASCII characters only, though non-ASCII characters can be encoded in certain header fields.
CONCLUSION
The parsing process explained above will never be a perfect replacement for existing thick client applications like Lotus Notes, Outlook Express, etc. It only provides an alternative approach when you create a web-mail application. The Internet is source of immense information when it comes to dealing with issues in Java Mail parsing.