#Note# Analyzing Twitter Data with Apache Hadoo...

#Note# Analyzing Twitter Data with Apache Hadoop 系列 1、2、3

Analyzing Twitter Data with Apache Hadoop

  • by Jon Natkins September 19, 2012
  • http://blog.cloudera.com/blog/2012/09/analyzing-twitter-data-with-hadoop/

这是这个系列的第一篇,讲的是如何用 Apache Flume, Apache HDFS, Apache Oozie, 和 Apache Hive 去设计一个能够分析 Twitter数据的,端到端的数据 pipeline。

  • 相关代码在 Cloudera Github.

Who is Influential?

  • Now we know the question we want to ask: Which Twitter users get the most retweets? Who is influential within our industry?
  • 换成山寨版的说法就是:找到谁谁谁是大V

How Do We Answer These Questions?

  • However, querying Twitter data in a traditional RDBMS is inconvenient, since the Twitter Streaming API outputs tweets in a JSON format which can be arbitrarily complex.
  • 传统数据库也是可用的,不过 Twitter Streaming API 输出的 tweets 是复杂的 JSON format,用起来不方便

  • The diagram above shows a high-level view of how some of the CDH (Cloudera’s Distribution Including Apache Hadoop) components can be pieced together to build the data pipeline we need to answer the questions we have.

Gathering Data with Apache Flume

  • 数据流的两端是 sourcessinks
  • 每个独立的数据(tweets)补称之为 event
  • sources 产生 events, events 通过 channelsource 送到 sink
  • sink 负责写数据到预定义的位置。

flume 支持的 source

  • http://flume.apache.org/FlumeUserGuide.html#flume-sources
    • Flume Sources
    • Avro Source
    • Thrift Source
    • NetCat Source
    • Syslog Sources
    • HTTP Source
    • Scribe Source

Partition Management with Oozie

  • Apache Oozie is a workflow coordination system that can be used to solve this problem.
  • Oozie is an extremely flexible system for designing job workflows, which can be scheduled to run based on a set of criteria.
  • We can configure the workflow to run an ALTER TABLE command that adds a partition containing the last hour’s worth of data into Hive, and we can instruct the workflow to occur every hour.

Apache Oozie 用来每小时加 partition

Querying Complex Data with Hive

  • Hive expects that input files use a delimited row format, but our Twitter data is in a JSON format, which will not work with the defaults.
  • The schema is only really enforced when we read the data, and we can use the Hive SerDe interface to specify how to interpret what we’ve loaded.
  • hive 缺省是 delimited row format, 如何处理 JSON format? 使用 Hive SerDe。示例的 JSON 太长,看原文
  • 一个查询语句
SELECT created_at, entities, text, user
FROM tweets
WHERE user.screen_name='ParvezJugon'
  AND retweeted_status.user.screen_name='ScottOstby';

Some Results

  • 一个更复杂的查询语句


Analyzing Twitter Data with Apache Hadoop, Part 2: Gathering Data with Flume

  • by Jon Natkins October 21, 2012
  • http://blog.cloudera.com/blog/2012/10/analyzing-twitter-data-with-hadoop-part-2-gathering-data-with-flume/

*这是这个系列的第二篇。第一部分是讲如何将 CDH 的组件整合成一个应用,这一部分是深入说明每个组件 *



  • event-driven
  • pollable


  • Event-driven sources typically receive events through mechanisms like callbacks or RPC calls.
  • Pollable sources, in contrast, operate by polling for events every so often in a loop.

Examining the TwitterSource

Configuring the Flume Agent


这个例子用的是 Memory Channel

TwitterAgent.channels.MemChannel.type = memory
  • http://flume.apache.org/FlumeUserGuide.html#flume-channels
    • Flume Channels
    • Memory Channel
    • JDBC Channel
    • File Channel
    • Pseudo Transaction Channel
    • Custom Channel



TwitterAgent.sinks.HDFS.hdfs.path = hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/

其中的 timestamp 信息来自 TwitterSource 给每个 event 加的 header

headers.put("timestamp", String.valueOf(status.getCreatedAt().getTime()));

Starting the Agent


$ /etc/init.d/flume-ng-agent start


natty@hadoop1:~/source/cdh-twitter-example$ hadoop fs -ls /user/flume/tweets/2012/09/20/05
  Found 2 items
  -rw-r--r--   3 flume hadoop   255070 2012-09-20 05:30 /user/flume/tweets/2012/09/20/05/FlumeData.1348143893253
  -rw-r--r--   3 flume hadoop   538616 2012-09-20 05:39 /user/flume/tweets/2012/09/20/05/FlumeData.1348143893254.tmp

先写到.tmp文件,当 events 或 time 条件满足时 move 到 roll 文件, 参数是:rollCount,rollInterval


Analyzing Twitter Data with Apache Hadoop, Part 3: Querying Semi-structured Data with Apache Hive

  • by Jon Natkins November 13, 2012
  • http://blog.cloudera.com/blog/2012/11/analyzing-twitter-data-with-hadoop-part-3-querying-semi-structured-data-with-hive/


Characterizing Data

  • well-structured
  • unstructured, semi-structured, and poly-structured

Complex Data Structures

SELECT array_column[0] FROM foo;
SELECT map_column[‘map_key’] FROM foo;

SELECT struct_column.struct_field FROM foo;

A Table for Tweets


 retweeted_status STRUCT<
 entities STRUCT<
 text STRING,
ROW FORMAT SERDE 'com.cloudera.hive.serde.JSONSerDe'
LOCATION '/user/flume/tweets';

SELECT entities.user_mentions[0].screen_name FROM tweets;

JSON objects 映射到 Hive columns

Serializers and Deserializers

在 Hive , SerDe 是 Serializer 与 Deserializer 两者的缩写

Putting It All Together

One Thing to Watch Out For…

如果它看起来像 duck,声音听起来也像 duck, 所以它肯定是 duck, right? 对于 Hive 的新用户,不能错误地将 Hive 当成关系统型数据库

