flume sink到hdfs第一列是时间戳,怎么去掉

flume sink到hdfs第一列是时间戳,怎么去掉?如下



1492665578789  111
1492665580789  222
1492666625916  qqqq
1492664454650 
1492664455642   q


【问题描述】 用flume收集本地文件夹下的文件变动 source的类型是:spooldir


配置文件如下:


LogAgent.sources = mysource
LogAgent.channels = mychannel
LogAgent.sinks = mysink


LogAgent.sources.mysource.type= spooldir
LogAgent.sources.mysource.fileHeader = true
LogAgent.sources.mysource.deserializer.outputCharset=UTF-8
LogAgent.sources.mysource.channels=mychannel
LogAgent.sources.mysource.spoolDir=/tmp/logs
LogAgent.sources.mysource.basenameHeader=true
LogAgent.sources.mysource.basenameHeaderKey=fileName


LogAgent.sinks.mysink.channel= mychannel
LogAgent.sinks.mysink.type=hdfs
LogAgent.sinks.mysink.hdfs.path=hdfs://master:9000/data/logs/%Y/%m/%d/%H/
LogAgent.sinks.mysink.hdfs.filePrefix=%{fileName}
LogAgent.sinks.mysink.hdfs.batchSize=1000
LogAgent.sinks.mysink.hdfs.rollSize=0
LogAgent.sinks.mysink.hdfs.rollCount=10000
LogAgent.sinks.mysink.hdfs.useLocalTimeStamp=true


LogAgent.channels.mychannel.type=memory
LogAgent.channels.mychannel.capacity=1000000
LogAgent.channels.mychannel.transactionCapacity=300000



【问题解决】


情况一:配置文件中应该加入以下的内容,让hdfs知道文件的格式:



LogAgent.sinks.mysink.hdfs.fileType=DataStream
LogAgent.sinks.mysink.hdfs.writeFormat=Text


官网的解释:

hdfs.fileType SequenceFile File format: currently SequenceFileDataStream or CompressedStream (1)DataStream will not compress output file and please don’t set codeC (2)CompressedStream requires set hdfs.codeC with an available codeC



情况二: 配置前缀写错了。 mysink.hdfs.fileType=DataStream  


不是 

 mysink.fileType=DataStream




=================================================================================


外国朋友遇到的问题如下:




hdfs-sink: how to get rid of the timestamp added in every event by flume in the HDFS files


1 down vote favorite

I have a few files which contains json in each line

[root@ip-172-29-1-12 vp_flume]# more vp_170801.txt.finished | awk '{printf("%s\n", substr($0,0,20))}'
{"status":"OK","resp
{"status":"OK","resp
{"status":"OK","resp
{"status":"OK","resp
{"status":"OK","resp
{"status":"OK","resp
{"status":"OK","resp
{"status":"OK","resp
{"status":"OK","resp
{"status":"OK","resp

My flume config is

[root@ip-172-29-1-12 flume]# cat flume_test.conf 
agent.sources = seqGenSrc
agent.channels = memoryChannel
agent.sinks = loggerSink

agent.sources.seqGenSrc.type = spooldir
agent.sources.seqGenSrc.spoolDir = /moveitdata/dong/vp_flume
agent.sources.seqGenSrc.deserializer.maxLineLength = 10000000
agent.sources.seqGenSrc.fileSuffix = .finished
agent.sources.seqGenSrc.deletePolicy = never

agent.sources.seqGenSrc.channels = memoryChannel
agent.sinks.loggerSink.channel = memoryChannel
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 100

agent.sinks.loggerSink.type = hdfs
agent.sinks.loggerSink.hdfs.path = /home/dong/vp_flume

agent.sinks.loggerSink.hdfs.writeFormat = Text
agent.sinks.loggerSink.hdfs.rollInterval = 0
agent.sinks.loggerSink.hdfs.rollSize = 1000000000
agent.sinks.loggerSink.hdfs.rollCount = 0

The files in HDFS is:

[root@ip-172-29-1-12 flume]# hadoop fs -text /home/dong/vp_flume/* | awk '{printf("%s\n", substr($0,0,20))}' | more
1505276698665   {"stat
1505276698665   {"stat
1505276698666   {"stat
1505276698666   {"stat
1505276698666   {"stat
1505276698667   {"stat
1505276698667   {"stat
1505276698667   {"stat
1505276698668   {"stat
1505276698668   {"stat
1505276698668   {"stat
1505276698668   {"stat
1505276698669   {"stat
1505276698669   {"stat
1505276698669   {"stat
1505276698669   {"stat
1505276698670   {"stat
1505276698670   {"stat
1505276698670   {"stat
1505276698670   {"stat

Question: I don't like the timestamp which is added by flume in each event. However, how can I get rid of it by configuring flume properly?



最佳的解答是:



up vote 1 down vote

You have not explicitly mentioned a hdfs.fileType property in your agent config file so Flume will use SequenceFile as default. SequenceFile supports two write formats: Text and Writable. You have set hdfs.writeFormat = Text which means Flume will use HDFSTextSerializer to serialize your events. If you take a look at its source (Line 53), you will see that it adds a timestamp as the default key.

Using hdfs.writeFormat = Writable won't help either because it does the same. You can check its source here (Line 52).

A key is always required for a SequenceFile. So, unless you have a good reason to use SequenceFile I'd suggest you to use hdfs.fileType = DataStream in your agent config.



你可能感兴趣的:(C(C++/C#),linux)