这里主要介绍两个比较常用的案例
实现:我们希望通过Flume将Linux上的test.log日志文件实时采集存储到HDFS上。
具体实现步骤如下:
hello,java
hello,python
hello,scala
hello,spark
#组件别名
a1.sources = r1
a1.channels = c1
a1.sinks = k1
#数据源描述
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /usr/logs/a.log
a1.sources.r1.fileHeader = true
#管道描述
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
#下沉地描述
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = /flume/events/%y-%m-%d/
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.rollInterval = 0
a1.sinks.k1.hdfs.rollSize = 2000
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.batchSize = 1000
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.fileType = DataStream
# 绑定组件
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
bin/flume-ng agent --conf conf --conf-file conf/test.conf --name a1 -Dflume.root.logger=DEBUG,console -Dorg.apache.flume.log.printconfig=true -Dorg.apache.flume.log.rawdata=true
注意:这里指定文件的时候确保先后顺序,否则可能出现无法读取log4j文件,导致控制台没有日志文件信息,正常情况下,当我们执行这个shell命令,控制台一直会有日志输出,如下
2019-01-16 11:27:45,914 (conf-file-poller-0) [DEBUG - org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:127)] Checking file:conf/test.conf for changes
2019-01-16 11:28:15,914 (conf-file-poller-0) [DEBUG - org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:127)] Checking file:conf/test.conf for changes
2019-01-16 11:28:45,915 (conf-file-poller-0) [DEBUG - org.apache.flume.node.PollingPropertiesFileConfigurationProvider$FileWatcherRunnable.run(PollingPropertiesFileConfigurationProvider.java:127)] Checking file:conf/test.conf for changes
[root@nd1 ~]# hadoop fs -cat /flume/events/19-01-16/events-.1547609029956.tmp
hello,java
hello,python
hello,scala
hello,spark
注意:这个的目标文件以.tmp结尾,因为我们没有设置滚动时间,只有当我们关闭进程之后,才会滚动成目标文件
echo "hi,123" >> test.log
[root@nd1 flume]# hadoop fs -cat /flume/events/19-01-16/events-.1547609029956.tmp
hello,java
hello,python
hello,scala
hello,spark
hi,123
我们发现新增内容已经被采集到HDFS上,符合预期,当我们关闭shell执行命令,我们再次查看HDFS相关路径文件,此时已滚动成我们需要的目标文件,无.tmp后缀
[root@nd1 flume]# hadoop fs -ls /flume/events/19-01-16/
Found 1 items
-rw-r--r-- 3 root supergroup 55 2019-01-16 11:42 /flume/events/19-01-16/events-.1547609029956
实现:在实际日志抽取中,我们希望根据日志文件的数据日期进行文件分组,而不是根据当前系统日期进行分组。我们可以通过自定义拦截器来实现。
具体实现步骤如下
2019-01-11 02:11:59,819 INFO [org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)] - Stopped
2019-01-11 02:11:59,823 INFO [org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)] - MapOutputTrackerMasterEndpoint stopped!
2019-01-11 02:11:59,841 INFO [org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)] - MemoryStore cleared
2019-02-11 02:11:59,841 INFO [org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)] - BlockManager stopped
2019-02-11 02:11:59,843 INFO [org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)] - BlockManagerMaster stopped
2019-02-11 02:11:59,846 WARN [org.apache.spark.internal.Logging$class.logWarning(Logging.scala:66)] - Message RemoteProcessDisconnected dropped. RpcEnv already stopped.
2019-02-11 02:11:59,846 WARN [org.apache.spark.internal.Logging$class.logWarning(Logging.scala:66)] - Message RemoteProcessDisconnected dropped. RpcEnv already stopped.
2019-03-11 02:11:59,847 INFO [org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)] - OutputCommitCoordinator stopped!
2019-03-11 02:11:59,849 INFO [org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)] - Successfully stopped SparkContext
2019-03-11 02:11:59,852 INFO [org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)] - Shutdown hook called
2019-03-11 02:11:59,853 INFO [org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)] - Deleting directory /opt/spark_local_data/spark-400ab519-6921-4bf9-ad76-d9542a33c0b8
2019-03-11 02:12:03,425 INFO [org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)] - Replaying log path: hdfs://mycluster/SparkLogs/application_1545699762417_0130.lz4
package com.whty.flume;
import java.util.ArrayList;
import java.util.List;
import java.util.Map;
import org.apache.flume.Context;
import org.apache.flume.Event;
import org.apache.flume.interceptor.Interceptor;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
/**
* 用来抽取日志数据日志日期,根据日期日期进行文件分类
* @author Administrator
* */
public class LogETLInterceptor implements Interceptor{
//打印日志,便于测试方法的执行顺序
private static final Logger logger = LoggerFactory.getLogger(LogETLInterceptor.class);
/**
* 拦截器构造方法,在自定义拦截器静态内部类的build方法中调用,用来创建自定义拦截器对象。
*/
public LogETLInterceptor() {
logger.info("----------自定义拦截器构造方法执行");
}
/**
* 该方法用来初始化拦截器,在拦截器的构造方法执行之后执行,也就是创建完拦截器对象之后执行
*/
@Override
public void initialize() {
logger.info("----------自定义拦截器的initialize方法执行");
}
/**
* 用来处理每一个event对象,该方法不会被系统自动调用,
* 一般在 List intercept(List events) 方法内部调用。
* @param event event对象
* @return 处理完之后的event对象
*/
@Override
public Event intercept(Event event) {
logger.info("----------intercept(Event event)方法执行---starting----,处理单个event,处理前的:" + event);
/*
这里编写event的处理代码
*/
Map<String,String> headers = event.getHeaders();
String bodyContent = new String(event.getBody());
logger.info("-----处理bodyContent:" + bodyContent);
//截取日志信息里面的日期,content = "2019-01-11 02:12:03,425";
String[] split = bodyContent.split(",");
String dataStr = split[0];
System.out.println(dataStr);
String resStr = dataStr.substring(0, 10);
System.out.println(resStr);
headers.put("resStr", resStr);
event.setHeaders(headers);
logger.info("----------intercept(Event event)方法执行----ending----,处理单个event,处理后的:" + event);
return event;
}
/**
* 这里编写对于event对象集合的处理代码,一般都是遍历event的对象集合,
* 对于每一个event对象,调用 Event intercept(Event event) 方法,
* 然后根据返回值是否为null,来将其添加到新的集合中。
*/
@Override
public List<Event> intercept(List<Event> events) {
List<Event> results = new ArrayList<>();
Event event;
for(Event e : events) {
event = intercept(e);
if(event != null) {
results.add(event);
}
}
return results;
}
/**
* 该方法主要用来销毁拦截器对象值执行,一般是一些释放资源的处理
*/
@Override
public void close() {
logger.info("----------自定义拦截器close方法执行");
}
/**
* 通过该静态内部类来创建自定义对象供flume使用,实现Interceptor.Builder接口,并实现其抽象方法
*/
public static class Builder implements Interceptor.Builder {
/**
* 该方法主要用来返回创建的自定义类拦截器对象
* @return
*/
@Override
public Interceptor build() {
logger.info("----------build方法执行");
return new LogETLInterceptor();
}
/**
* 用来接收flume配置自定义拦截器参数
* @param context 通过该对象可以获取flume配置自定义拦截器的参数
*/
@Override
public void configure(Context context) {
logger.info("----------configure方法执行");
logger.info("------获取到拦截器参数desc为:" + context.getString("desc"));
}
}
}
#组件别名
a1.sources = r1
a1.channels = c1
a1.sinks = k1
#自定义拦截器
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = com.whty.flume.LogETLInterceptor$Builder
bin/flume-ng agent --conf conf --conf-file conf/test.conf --name a1 -Dflume.root.logger=INFO,console
[root@nd1 flume]# hadoop fs -ls /flume/events/
Found 3 items
drwxr-xr-x - root supergroup 0 2019-01-16 13:42 /flume/events/2019-01-11
drwxr-xr-x - root supergroup 0 2019-01-16 13:42 /flume/events/2019-02-11
drwxr-xr-x - root supergroup 0 2019-01-16 13:42 /flume/events/2019-03-11
echo "2019-04-11 02:11:59,852 INFO [org.apache.spark.internal.Logging$class.logInfo(Logging.scala:54)] - Shutdown hook called" >> a.log
[root@nd1 flume]# hadoop fs -ls /flume/events/
Found 4 items
drwxr-xr-x - root supergroup 0 2019-01-16 13:42 /flume/events/2019-01-11
drwxr-xr-x - root supergroup 0 2019-01-16 13:42 /flume/events/2019-02-11
drwxr-xr-x - root supergroup 0 2019-01-16 13:42 /flume/events/2019-03-11
drwxr-xr-x - root supergroup 0 2019-01-16 13:45 /flume/events/2019-04-11
我们通过查看文件内容,例如2019-04-11,不同日期文件夹下存放的是源日志文件中不同日期的内容,符合预期。
[root@nd1 flume]# hadoop fs -cat /flume/events/2019-04-11/events-.1547617509543.tmp
2019-04-11 02:11:59,852 INFO [org.apache.spark.internal.Logging.logInfo(Logging.scala:54)] - Shutdown hook called