hadoop 读取orc文件和读取lzo文件

读取orc文件

Hadoop streaming读取orc文件,有“org.apache.hadoop.hive.ql.io.orc.OrcInputFormat”和“org.apache.orc.mapred.OrcInputFormat”方式读取,大数据集群默认不提供对应jar,需要提交任务时增加-libjars配置指定jar路径,jar包包括hive-exec-2.3.5.jar,aircompressor-0.8.jar,hive-storage-api-2.4.0.jar,orc-mapreduce-1.5.5.jar,orc-shims-1.5.5.jar。
测试结果:”org.apache.hadoop.hive.ql.io.orc.OrcInputFormat” 性能高于 “org.apache.orc.mapred.OrcInputFormat”

下面为样例
Jar路径

ORC_DIR=/orcjars/ # 大数据平台第三方包路径
ORC_CORE=${ORC_DIR}/hive-exec-2.3.5.jar,${ORC_DIR}/aircompressor-0.8.jar,${ORC_DIR}/hive-storage-api-2.4.0.jar,${ORC_DIR}/orc-mapreduce-1.5.5.jar,${ORC_DIR}/orc-shims-1.5.5.jar

Hadoop jar提交参数

INPUTFORMAT="org.apache.hadoop.hive.ql.io.orc.OrcInputFormat" # 指定解析类
hadoop jar ${JAR_PACKAGE} \
-libjars $ORC_CORE \ # 配置第三方包路径
-D mapred.job.queue.name=badm \
-D mapred.job.name=ads_hbpt_down_history_locus_h \
-D stream.map.input.ignoreKey=true \
-D stream.map.output.field.separator=\t \
-D map.output.key.field.separator=_ \
-D num.key.fields.for.partition=1 \
-D mapreduce.output.fileoutputformat.compress=true \
-D mapreduce.output.fileoutputformat.compress.codec=com.hadoop.compression.lzo.LzopCodec \
-numReduceTasks 1 \
-input ${IN_PATH} \
-output ${OUT_PATH} \
-mapper ${MAP_FILE}" ${jobID_plate}" \
-reducer ${RED_FILE} \
-file ${MAP_FILE} \
-file ${RED_FILE} \
-inputformat $INPUTFORMAT \ # 
-partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner

map代码处理

for line in sys.stdin:
    line = line.strip()
    line = line.replace("{","").replace("}","") # 去掉大括号
    # hadoop streaming解析orc文件,输入流解析后包含util.SortAssist$UnionKey@a32820cf对象,与值\t分隔,值按逗号分隔输出
    ftSp = []
    tpSp = line.split(",")
    fField = tpSp[0]
    orcField = fField.split("\t") 
    if len(orcField) == 2: # 移除输入数据前,增加sortassist$unionkey\t
        sim = orcField[1]
        tpSp[0]=sim
    for item in tpSp:
        ftSp.append(str(item).strip())
    line = ','.join(ftSp)

注意:
”org.apache.hadoop.hive.ql.io.orc.OrcInputFormat”,input输入目录必须是根目录:/bigdata/product/mapred/warehouse/common_component/road_match3.1/20191214。
”org.apache.orc.mapred.OrcInputFormat”,input输入目录必须是根目录:/bigdata/product/mapred/warehouse/common_component/road_match3.1/20191214/*.orc 或者/bigdata/product/mapred/warehouse/common_component/road_match3.1/20191214

读取lzo文件

hadoop jar $JAR_PACKAGE \
        -D mapred.job.queue.name=default \
        -D mapred.job.name=ads_huoyun_gakk_filter \
        -D stream.map.input.ignoreKey=true \
        -D map.output.key.field.separator=, \
        -files ${EXEC_PATH}/utils \
        -D num.key.fields.for.partition=1 \
        -partitioner org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner \
        -numReduceTasks 200 \
        -input $IN_PATH \
        -output $OUT_PATH \
        -mapper $MAP_FILE \
        -file $MAP_FILE \
        -file ${EXEC_PATH}/points.txt \
        -file ${EXEC_PATH}/jiugongge.py

python代码

#!/usr/bin/env python3
#coding:utf-8

import sys

sys.path.append('')
from utils.base_util import Point
from jiugongge import getGridList

max_length = 0
def get_jiugong(fname):
    global max_length
    grids = {}
    lines = open(fname,'r').readlines()
    for line in lines:
        line = line.strip()
        sp = line.split('_')
        lon = float(sp[1])
        lat = float(sp[2])
        length = float(sp[3])
        #grid = LatLonGrid.getGridIDBySource(lon,lat,length=length*2)
        gris = getGridList(lon,lat)
        for gri in gris:
            if gri not in grids:
                grids[gri] = 1
    return grids
all_grids = get_jiugong('points.txt')

for line in sys.stdin:
  line = line.strip()
  sp = line.split(",")
  if sp[2] == '0':
    p = Point(sp)
    lon = float(p.lon)/600000.0
    lat = float(p.lat)/600000.0
    g_id = str(int(lon * 100)) + "_" + str(int(lat * 100))
    if g_id in all_grids:
        print (line)

你可能感兴趣的:(hadoop,大数据,hive)