Apache Tika - Apache Tika

The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries. You can find the latest release on the download page. See the Getting Started guide for instructions on how to start using Tika.

支持的文件格式有:

Supported Document Formats

This page lists all the document formats supported by Apache Tika 0.6. Follow the links to the various parser class javadocs for more detailed information about each document format and how it is parsed by Tika.

  • Supported Document Formats
    • HyperText Markup Language
    • XML and derived formats
    • Microsoft Office document formats
    • OpenDocument Format
    • Portable Document Format
    • Electronic Publication Format
    • Rich Text Format
    • Compression and packaging formats
    • Text formats
    • Feed and Syndication formats
    • Audio formats
    • Image formats
    • Video formats
    • Java class files and archives
    • The mbox format
    • CAD formats
    • Font formats
    • Executable programs and libraries

阅读全文……

你可能感兴趣的:(文本抽取)