Andy [email protected]
2013/09/28-2013/09/30
markdown的语法高亮格式在oschina的blog上有问题,在git.oschina.net上没有问题http://git.oschina.net/wuerping/notes/blob/master/2013/2013-09-30/AnalyzingTwitterDatawithApacheHadoop.md
这是这个系列的第一篇,讲的是如何用 Apache Flume
, Apache HDFS
, Apache Oozie
, 和 Apache Hive
去设计一个能够分析 Twitter数据的,端到端的数据 pipeline。
Twitter Streaming API
outputs tweets in a JSON format
which can be arbitrarily complex
.CDH (Cloudera’s Distribution Including Apache Hadoop)
components can be pieced together to build the data pipeline we need to answer the questions we have.sources
和 sinks
,event
。sources
产生 events
, events
通过 channel
从 source
送到 sink
,sink
负责写数据到预定义的位置。flume 支持的 source
Apache Oozie
is a workflow coordination system
that can be used to solve this problem.job workflows
, which can be scheduled to run based on a set of criteria.Apache Oozie
用来每小时加 partition
delimited row format
, but our Twitter data is in a JSON format, which will not work with the defaults.Hive SerDe
interface to specify how to interpret what we’ve loaded.SELECT created_at, entities, text, user
FROM tweets
WHERE user.screen_name='ParvezJugon'
AND retweeted_status.user.screen_name='ScottOstby';
*这是这个系列的第二篇。第一部分是讲如何将 CDH 的组件整合成一个应用,这一部分是深入说明每个组件 *
源有两种不同的风格
两者不同之处实际上是推与拉的区别
Event-driven sources
typically receive events through mechanisms like callbacks
or RPC calls
.Pollable sources
, in contrast, operate by polling
for events every so often in a loop
.这个例子用的是 Memory Channel
TwitterAgent.channels.MemChannel.type = memory
不错的配置功能
TwitterAgent.sinks.HDFS.hdfs.path = hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/
其中的 timestamp 信息来自 TwitterSource 给每个 event 加的 header
headers.put("timestamp", String.valueOf(status.getCreatedAt().getTime()));
/etc/default/flume-ng-agent包含一个环境变量FLUME_AGENT_NAME
$ /etc/init.d/flume-ng-agent start
/user/flume/tweets
natty@hadoop1:~/source/cdh-twitter-example$ hadoop fs -ls /user/flume/tweets/2012/09/20/05
Found 2 items
-rw-r--r-- 3 flume hadoop 255070 2012-09-20 05:30 /user/flume/tweets/2012/09/20/05/FlumeData.1348143893253
-rw-r--r-- 3 flume hadoop 538616 2012-09-20 05:39 /user/flume/tweets/2012/09/20/05/FlumeData.1348143893254.tmp
先写到.tmp文件,当 events 或 time 条件满足时 move 到 roll 文件, 参数是:rollCount
,rollInterval
这是这个系列的第三篇。讨论Hive的优劣,讨论在这个分析tweets数据的应用中Hive是正确的选择
well-structured
unstructured
, semi-structured
, and poly-structured
SELECT array_column[0] FROM foo;
SELECT map_column[‘map_key’] FROM foo;
SELECT struct_column.struct_field FROM foo;
表设计
CREATE EXTERNAL TABLE tweets (
...
retweeted_status STRUCT<
text:STRING,
user:STRUCT>,
entities STRUCT<
urls:ARRAY>,
user_mentions:ARRAY>,
hashtags:ARRAY>>,
text STRING,
...
)
PARTITIONED BY (datehour INT)
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/flume/tweets';
SELECT entities.user_mentions[0].screen_name FROM tweets;
JSON objects 映射到 Hive columns
在 Hive , SerDe 是 Ser
ializer 与 De
serializer 两者的缩写
如果它看起来像 duck,声音听起来也像 duck, 所以它肯定是 duck, right? 对于 Hive 的新用户,不能错误地将 Hive 当成关系统型数据库