C#转换rtf到纯文本

How to: Convert RTF to Plain Text (C# Programming Guide)

Rich Text Format (RTF) is a document format developed by Microsoft in the late 1980s to enable the exchange of documents across operating systems. Both Microsoft Word and WordPad can read and write RTF documents. In the .NET Framework, you can use theRichTextBox control to create a word processor that supports RTF and enables a user to apply formatting to text in a WYSIWIG manner.

You can also use the RichTextBox control to programmatically remove the RTF formatting codes from a document and convert it to plain text. You do not need to embed the control in a Windows Form to perform this kind of operation.

To use the RichTextBox control in a project

  1. Add a reference to System.Windows.Forms.dll.

  2. Add a using directive for the System.Windows.Forms namespace (optional).

Example


The following example provides a sample RTF file to be converted. The file contains RTF formatting, such as font information, and it also contains four Unicode characters and four extended ASCII characters. The file is opened, passed to theRichTextBox as RTF, retrieved as text, displayed in aMessageBox, and output to a file in UTF-8 format.

C#
    // Save the following RTF file to the same folder as your .exe file, and call it "test.rtf".
    /*
    {\rtf1\ansi\ansicpg1252\deff0\deflang1033{\fonttbl{\f0\fswiss\fcharset0 Arial;}{\f1\fnil\fprq1\fcharset0 Courier New;}{\f2\fswiss\fprq2\fcharset0 Arial;}}
{\colortbl ;\red0\green128\blue0;\red0\green0\blue0;}
{\*\generator Msftedit 5.41.21.2508;}\viewkind4\uc1\pard\f0\fs20 This is the \i Greek \i0 word "psyche": \cf1\f1\u968?\u965?\u967?\u942?\cf2\f2 . It is encoded in Unicode.\par
Here are four extended \b ASCII \b0 characters (Windows code page 1252):  \'e2\'e4\u1233?\'e5\cf0\par
}
     */
    class ConvertFromRTF
    {
        static void Main()
        {

            string path = @"test.rtf";

            //Create the RichTextBox. (Requires a reference to System.Windows.Forms.dll.)
            System.Windows.Forms.RichTextBox rtBox = new System.Windows.Forms.RichTextBox();

            // Get the contents of the RTF file. Note that when it is
            // stored in the string, it is encoded as UTF-16.
            string s = System.IO.File.ReadAllText(path);

            // Display the RTF text.
            System.Windows.Forms.MessageBox.Show(s);

            // Convert the RTF to plain text.
            rtBox.Rtf = s;
            string plainText = rtBox.Text;

            // Display plain text output in MessageBox because console
            // cannot display Greek letters.
            System.Windows.Forms.MessageBox.Show(plainText);

            // Output plain text to file, encoded as UTF-8.
            System.IO.File.WriteAllText(@"output.txt", plainText);
        }
    }


RTF characters are encoded in eight bits. However, the format does let users specify Unicode characters in addition to extended ASCII characters from specified code pages. Because theRichTextBox.Text property is of typestring, the characters are encoded as Unicode UTF-16. Any extended ASCII characters and Unicode characters from the source RTF document are correctly encoded in the text output.

If you use the File.WriteAllText method to write the text to disk, the text will be encoded as UTF-8 (without a Byte Order Mark).

http://msdn.microsoft.com/en-us/library/cc488002.aspx

Rich Text Format (RTF) Version 1.5 Specification

http://www.biblioscape.com/rtf15_spec.htm

Word 2007: Rich Text Format (RTF) Specification, version 1.9.1

http://www.microsoft.com/en-us/download/details.aspx?id=10725

你可能感兴趣的:(C#)