flume简单测试hdfssink && hivesink

转自:https://blog.csdn.net/woloqun/article/details/77651006

quich start

vi example.conf

# example.conf: A single-node Flume configuration

# Name the components on this agent
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# Describe/configure the source
a1.sources.r1.type = netcat
a1.sources.r1.bind = localhost
a1.sources.r1.port = 44444

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23

启动agent

bin/flume-ng agent --conf conf --conf-file conf/example.conf --name a1 -Dflume.root.logger=INFO,console
  • 1

telnet port 44444 to send message

telnet localhost 44444
Trying ::1...
telnet: connect to address ::1: Connection refused
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.
aaaaaaaaaaaaaaaaaaaa
OK
dddddddddddddddddddd
OK
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10

agent 收到数据并打印

2017-08-23 22:17:27,110 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 61 61 61 61 61 61 61
61 61 61 61 61 61 61 61 61 aaaaaaaaaaaaaaaa }
2017-08-23 22:17:31,113 (SinkRunner-PollingRunner-DefaultSinkProcessor) [INFO - org.apache.flume.sink.LoggerSink.process(LoggerSink.java:95)] Event: { headers:{} body: 64 64 64 64 64 64 64
64 64 64 64 64 64 64 64 64 dddddddddddddddd }
  • 1
  • 2
  • 3
  • 4

kafka source && log sink

vi kafkalog.conf

# Name the components on this agent
a1.sources = source1
a1.sinks = k1
a1.channels = c1

a1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.source1.channels = c1
a1.sources.source1.batchSize = 10
a1.sources.source1.batchDurationMillis = 2000
a1.sources.source1.kafka.bootstrap.servers = localhost:9092
a1.sources.source1.kafka.topics = test
a1.sources.source1.kafka.consumer.group.id = custom.g.id

# Describe the sink
a1.sinks.k1.type = logger

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24

注意:a1.channels.c1.transactionCapacity>=a1.sources.source1.batchSize

启动

bin/flume-ng agent –conf conf –conf-file conf/kafkalog.conf –name a1 -Dflume.root.logger=INFO,console

kafka source && hdfs sink

vi kafkahdfs.conf

# Name the components on this agent
a1.sources = source1
a1.sinks = k1
a1.channels = c1

a1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.source1.channels = c1
a1.sources.source1.batchSize = 10
a1.sources.source1.batchDurationMillis = 2000
a1.sources.source1.kafka.bootstrap.servers = localhost:9092
a1.sources.source1.kafka.topics = test
a1.sources.source1.kafka.consumer.group.id = custom.g.id



# Describe the sink
a1.sinks.k1.type = hdfs

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1

a1.sinks.k1.channel = c1
a1.sinks.k1.hdfs.path = hdfs://work:9000/flume/test/%y-%m-%d/%H%M/%S
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 10
a1.sinks.k1.hdfs.roundUnit = minute
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32

启动agent

bin/flume-ng agent --conf conf --conf-file conf/kafkahdfs.conf --name a1 -Dflume.root.logger=INFO,console
  • 1

生成的文件

hadoop fs -ls /flume/test/17-08-23/2330/00

Found 13 items
-rw-r--r--   1 work supergroup        325 2017-08-23 23:31 /flume/test/17-08-23/2330/00/events-.1503502266033
-rw-r--r--   1 work supergroup        325 2017-08-23 23:31 /flume/test/17-08-23/2330/00/events-.1503502266034
-rw-r--r--   1 work supergroup        262 2017-08-23 23:31 /flume/test/17-08-23/2330/00/events-.1503502266035
-rw-r--r--   1 work supergroup        337 2017-08-23 23:33 /flume/test/17-08-23/2330/00/events-.1503502426378
-rw-r--r--   1 work supergroup        325 2017-08-23 23:33 /flume/test/17-08-23/2330/00/events-.1503502426379
-rw-r--r--   1 work supergroup        325 2017-08-23 23:33 /flume/test/17-08-23/2330/00/events-.1503502426380
-rw-r--r--   1 work supergroup        345 2017-08-23 23:34 /flume/test/17-08-23/2330/00/events-.1503502426381
-rw-r--r--   1 work supergroup        322 2017-08-23 23:34 /flume/test/17-08-23/2330/00/events-.1503502426382
-rw-r--r--   1 work supergroup        324 2017-08-23 23:34 /flume/test/17-08-23/2330/00/events-.1503502426383
-rw-r--r--   1 work supergroup        324 2017-08-23 23:34 /flume/test/17-08-23/2330/00/events-.1503502426384
-rw-r--r--   1 work supergroup        324 2017-08-23 23:34 /flume/test/17-08-23/2330/00/events-.1503502426385
-rw-r--r--   1 work supergroup        324 2017-08-23 23:34 /flume/test/17-08-23/2330/00/events-.1503502426386
-rw-r--r--   1 work supergroup        261 2017-08-23 23:34 /flume/test/17-08-23/2330/00/events-.1503502426387
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16

kafka source hive sink

# Name the components on this agent
a1.sources = source1
a1.sinks = k1
a1.channels = c1

a1.sources.source1.type = org.apache.flume.source.kafka.KafkaSource
a1.sources.source1.channels = c1
a1.sources.source1.batchSize = 10
a1.sources.source1.batchDurationMillis = 2000
a1.sources.source1.kafka.bootstrap.servers = localhost:9092
a1.sources.source1.kafka.topics = test
a1.sources.source1.kafka.consumer.group.id = custom.g.id



# Describe the sink
a1.sinks.k1.type = hdfs

# Use a channel which buffers events in memory
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100

# Bind the source and sink to the channel
a1.sources.r1.channels = c1

a1.sinks = k1
a1.sinks.k1.type = hive
a1.sinks.k1.hive.metastore = thrift://192.168.1.115:9083
a1.sinks.k1.hive.database = default
a1.sinks.k1.hive.table = kafkauser
a1.sinks.k1.hive.partition = %y%m%d    #若是已知数据,直接写字段名就可以
a1.sinks.k1.useLocalTimeStamp = false
a1.sinks.k1.round = true
a1.sinks.k1.roundValue = 10
a1.sinks.k1.roundUnit = minute
a1.sinks.k1.serializer = DELIMITED
a1.sinks.k1.serializer.delimiter = "\t"
a1.sinks.k1.serializer.serdeSeparator = '\t'
a1.sinks.k1.serializer.fieldnames =name,age    #分区字段就不要在这再写了
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27
  • 28
  • 29
  • 30
  • 31
  • 32
  • 33
  • 34
  • 35
  • 36
  • 37
  • 38
  • 39
  • 40

建表

create table kafkauser(name string,age int)
partitioned by(day string)
clustered by (name) into 5 buckets    #使用Hive做为Sink,Hive表必须cluster by bucket
stored as orc;    #Hive表需要stored as orc。
  • 1
  • 2
  • 3
  • 4

注意:hive表必须是orc格式,分桶

配置hive

vi hive-site.xml: 

  <property>
    <name>hive.support.concurrencyname>
    <value>truevalue>
  property>

  <property>
    <name>hive.exec.dynamic.partition.modename>
    <value>nonstrictvalue>
  property>


    <property>
    <name>hive.txn.managername>
    <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManagervalue>
  property>

  <property>
    <name>hive.compactor.initiator.onname>
    <value>truevalue>
  property>

  <property>
    <name>hive.compactor.worker.threadsname>
    <value>1value>
  property>    	
	
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15
  • 16
  • 17
  • 18
  • 19
  • 20
  • 21
  • 22
  • 23
  • 24
  • 25
  • 26
  • 27

启动hivemetastore

hive --service metastore &
  • 1

启动agent

bin/flume-ng agent --conf conf --conf-file conf/kafkahive.conf --name a1 -Dflume.root.logger=INFO,console --classpath "/home/work/soft/apache-hive-0.14.0-bin/hcatalog/share/hcatalog/*":"/home/work/soft/apache-hive-0.14.0-bin/lib/*"    -Dflume.monitoring.type=http -Dflume.monitoring.port=34545
  • 1

常见问题1

Failed to start agent because dependencies were not found in classpath. Error follows.
java.lang.NoClassDefFoundError: org/apache/hive/hcatalog/streaming/RecordWriter
  • 1
  • 2

解决办法一:

将hive中相关的jar复制到$FLUME_HOME/lib下

cp /home/xiaobin/soft/apache-hive-0.14.0-bin/hcatalog/share/hcatalog/*.jar  $FLUME_HOME/lib/

cp /home/xiaobin/soft/apache-hive-0.14.0-bin/hive-*.jar $FLUME_HOME/lib/

cp antlr-2.7.7.jar ~/soft/apache-flume-1.7.0-bin/lib/

cp antlr-runtime-3.4.jar ~/soft/apache-flume-1.7.0-bin/lib/
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7

解决办法二:

指定classpath

 --classpath "/home/work/soft/apache-hive-0.14.0-bin/hcatalog/share/hcatalog/*":"/home/work/soft/apache-hive-0.14.0-bin/lib/*"
  • 1

常见问题二:内存溢出

Exception in thread "PollableSourceRunner-KafkaSource-source1" java.lang.OutOfMemoryError: Java heap space

vi flume-ng
JAVA_OPTS="-Xmx500m"
  • 1
  • 2
  • 3
  • 4

常见问题三

Caused by: org.apache.flume.sink.hive.HiveWriter$ConnectException: Failed connecting to EndPoint {metaStoreUri='thrift://192.168.1.115:9083', database='default', table='kafkauser', partitionVals=[20170826] }
    at org.apache.flume.sink.hive.HiveWriter.(HiveWriter.java:99)
    at org.apache.flume.sink.hive.HiveSink.getOrCreateWriter(HiveSink.java:343)
    at org.apache.flume.sink.hive.HiveSink.drainOneBatch(HiveSink.java:295)
    at org.apache.flume.sink.hive.HiveSink.process(HiveSink.java:253)
    ... 3 more
Caused by: org.apache.flume.sink.hive.HiveWriter$ConnectException: Failed connecting to EndPoint {metaStoreUri='thrift://192.168.1.115:9083', database='default', table='kafkauser', partitionVals=[20170826] }
    at org.apache.flume.sink.hive.HiveWriter.newConnection(HiveWriter.java:383)
    at org.apache.flume.sink.hive.HiveWriter.(HiveWriter.java:86)
    ... 6 more
Caused by: java.util.concurrent.TimeoutException
    at java.util.concurrent.FutureTask.get(FutureTask.java:205)
    at org.apache.flume.sink.hive.HiveWriter.timedCall(HiveWriter.java:434)
    at org.apache.flume.sink.hive.HiveWriter.newConnection(HiveWriter.java:376)
    ... 7 more
  • 1
  • 2
  • 3
  • 4
  • 5
  • 6
  • 7
  • 8
  • 9
  • 10
  • 11
  • 12
  • 13
  • 14
  • 15

这个问题是目录权限问题,解决办法如下

hadoop dfs -chmod 777 /tmp/hive
chmod 777 /tmp/hive
  • 1
  • 2

常见问题四:

NullPointerException Non-local session path expected to be non-null
  • 1

配置文件错误(仔细检查)

相关连接

http://henning.kropponline.de/2015/05/19/hivesink-for-flume/ 
https://my.oschina.net/wangjiankui/blog/711942 
https://cwiki.apache.org/confluence/display/Hive/Streaming+Data+Ingest#StreamingDataIngest-StreamingRequirements

你可能感兴趣的:(flume)