Preparing to Use LZO-Compressed Text Files

Before using LZO-compressed tables in Impala, do the following one-time setup for each machine in the cluster. Install the necessary packages using either the Cloudera public repository, a private repository you establish, or by using packages.

  1. Prepare your systems to work with LZO using Cloudera repositories:

    Download and install the appropriate file to each machine on which you intend to use LZO with Impala. Install the:

    • Red Hat 5 repo file to /etc/yum.repos.d/.
    • Red Hat 6 repo file to /etc/yum.repos.d/.
    • SUSE repo file to /etc/zypp/repos.d/.
    • Ubuntu 10.04 list file to /etc/apt/sources.list.d/.
    • Ubuntu 12.04 list file to /etc/apt/sources.list.d/.
    • Debian list file to /etc/apt/sources.list.d/.
  2. Configure Impala to use LZO:

    Use one of the following sets of commands to refresh your package management system's repository information, install the hadoop-lzo-cdh4 package, and install the impala-lzo-cdh4 package.

    For RHEL/CentOS systems:

    $ sudo yum update
    $ sudo yum install hadoop-lzo-cdh4
    $ sudo yum install impala-lzo

    For SUSE systems:

    $ sudo apt-get update
    $ sudo zypper install hadoop-lzo-cdh4
    $ sudo zypper install impala-lzo

    For Debian/Ubuntu systems:

    $ sudo zypper update
    $ sudo apt-get install hadoop-lzo-cdh4
    $ sudo apt-get install impala-lzo
      Note:

    The level of the impala-lzo-cdh4 package is closely tied to the version of Impala you use. Any time you upgrade Impala, re-do the installation command for impala-lzo-cdh4 on each applicable machine to make sure you have the appropriate version of that package.

  3. For core-site.xml on the client and server (that is, in the configuration directories for both Impala and Hadoop), append com.hadoop.compression.lzo.LzopCodec to the comma-separated list of codecs. For example:
    <property>
      <name>io.compression.codecs</name>
      <value>org.apache.hadoop.io.compress.DefaultCodec,org.apache.hadoop.io.compress.GzipCodec,
    	org.apache.hadoop.io.compress.BZip2Codec,org.apache.hadoop.io.compress.DeflateCodec,
    	org.apache.hadoop.io.compress.SnappyCodec,com.hadoop.compression.lzo.LzopCodec</value>
    </property>
  4. Restart the MapReduce services.

Creating LZO Compressed Text Tables

A table containing LZO-compressed text files must be created in Hive with the following storage clause:

STORED AS
    INPUTFORMAT 'com.hadoop.mapred.DeprecatedLzoTextInputFormat'
    OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

Files in an LZO-compressed table must use the .lzo extension. After loading data into such a table, index the files so that they can be split. Indexing is done by running the LZO indexer,com.hadoop.compression.lzo.DistributedLzoIndexer, which is included in the hadoop-lzo package.

Run the indexer using:

$ hadoop jar /usr/lib/hadoop/lib/hadoop-lzo-cdh4-0.4.15-gplextras.jar
  com.hadoop.compression.lzo.DistributedLzoIndexer /hdfs_location_of_table/

Indexed files have the same name as the file they index, except that index files use the .index extension. Impala can read non-indexed files, but such reading is typically done from remote DataNodes, which is very inefficient.

In Hive, when writing LZO compressed text tables, you must include the following specification:

hive> SET hive.exec.compress.output=true;
hive> SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzopCodec;

Impala does not currently support writing LZO-compressed Text files.

Once you have created tables, you can also take data stored in one format and use Hive to transform existing tables to another format. If you have existing data, you might begin by transforming data from one format into another. Hive provides an easy way to convert tables between formats. To convert tables using Hive, create a new table that uses the desired format, then select all rows from the existing table and insert those into the new table.

你可能感兴趣的:(Preparing to Use LZO-Compressed Text Files)