一般情况下对于CSV格式文件数据,有多种第三方SerDer来处理。本文采用CSVSerDe:
一、添加第三方SerDe
首先在Hive classpath中添加第三方SerDe JAR包,命令如下:
hive> add jar /home/hadoopUser/cloud/hive/apache-hive-0.13.1-bin/lib/csv-serde-1.1.2.jar;
Added /home/hadoopUser/cloud/hive/apache-hive-0.13.1-bin/lib/csv-serde-1.1.2.jar to class path
Added resource: /home/hadoopUser/cloud/hive/apache-hive-0.13.1-bin/lib/csv-serde-1.1.2.jar
可以从该链接下载:csv-serde-1.1.2.jar,以某CSV文件为例介绍处理过程
二、某CSV日志文件格式如下:
1997,Ford,E350,"ac, abs, moon",3000.00
1999,Chevy,"Venture ""Extended Edition""","",4900.00
1999,Chevy,"Venture ""Extended Edition, Very Large""","",5000.00
1996,Jeep,Grand Cherokee,"MUST SELL!air, moon roof, loaded",4799.00
以逗号分隔,分别表示:年,制造商,型号,说明,价值
三、创建Hive表
hive> CREATE TABLE serde_csv(year STRING,company STRING,type STRING,description STRING,value STRING)
> ROW FORMAT SERDE 'com.bizo.hive.serde.csv.CSVSerde'
> STORED AS TEXTFILE ;
OK
Time taken: 0.072 seconds
四、导入数据
hive> LOAD DATA LOCAL INPATH "/home/hadoopUser/data/csv_serde.txt" INTO TABLE serde_csv;
Copying data from file:/home/hadoopUser/data/csv_serde.txt
Copying file: file:/home/hadoopUser/data/csv_serde.txt
Loading data to table hive.serde_csv
Table hive.serde_csv stats: [numFiles=1, numRows=0, totalSize=259, rawDataSize=0]
OK
Time taken: 0.389 seconds
hive> select * from serde_csv;
OK
1997 Ford E350 ac, abs, moon 3000.00
1999 Chevy Venture "Extended Edition" 4900.00
1999 Chevy Venture "Extended Edition, Very Large" 5000.00
1996 Jeep Grand Cherokee MUST SELL!air, moon roof, loaded 4799.00