flume 抽取图片文件数据写入到HDFS

flume 是一个日志处理的工具,其擅长处理文本数据。不过在有些使用场景,比如采集服务器上的很多小的图片数据时,也可以派上用场。
话不多说,直接上flume-conf配置信息:

# ==== start ====
agent.sources = spooldirsource
agent.channels = memoryChannel
agent.sinks = hdfssink

# For each one of the sources, the type is defined
agent.sources.spooldirsource.type = spooldir

# The channel can be defined as follows.
agent.sources.spooldirsource.channels = memoryChannel

agent.sources.spooldirsource.spoolDir = /data/mcmin/imgfiles

agent.sources.spooldirsource.deserializer = org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder

agent.sources.spooldirsource.deserializer.maxBlobLength = 100000000

# Each sink's type must be defined
agent.sinks.hdfssink.type = hdfs

#Specify the channel the sink should use
agent.sinks.hdfssink.channel = memoryChannel

# ns1 是高可用地址 
# /%Y/%m/%d 根据日期动态写目录
agent.sinks.hdfssink.hdfs.path = hdfs://ns1/mcmin/%Y/%m/%d

agent.sinks.hdfssink.hdfs.useLocalTimeStamp = true

agent.sinks.hdfssink.hdfs.fileSuffix = .jpg

agent.sinks.hdfssink.hdfs.fileType = DataStream

# Each channel's type is defined.
agent.channels.memoryChannel.type = memory

# Other config values specific to each type of channel(sink or source)
# can be defined as well
# In this case, it specifies the capacity of the memory channel
agent.channels.memoryChannel.capacity = 100000

# ==== end ====

这里的需要注意的有两点:
1: spooldirsource的deserializer 声明为org.apache.flume.sink.solr.morphline.BlobDeserializer$Builder
2: hdfs sink 的 fileType 要声明为 DataStream

这里,flume的spooldirsource把每一个图片文件,都封装成一个单独event,不跟处理文本数据一样(文本数据是把文本的每一行内容都封装成一个event)。而且他是把这些event缓存在内存中,所以,当一次性处理大量图片文件或者说图片大小较大时,容易撑爆内存。这是需要注意的。

你可能感兴趣的:(hadoop)