Projects_Tweets Analysis

This is a course project of real time big data analysis. My idea is to use tweets to extract trending topics and analyze political popularity using Hadoop map/reduce. About 120,000 tweets are collected using Twitter Streaming APIs in a JSON format. The data is loaded into HDFS on NYU HPC clusters, to take advantage of their fast computing ability. In map task, each tweet is parsed with displayURL, hashtagEntities, text as key and their corresponding contents as value. Then Mark the occurence of each entity and send to reducer. The reducer will get the total count for each displayURL or hashtagEntities. Do a second map/reduce to get the order of each entity. Then I got the top popular links which has the most people watched during that time, also for the top hashtags. The total time for the map/reduce computing of the 4GB data takes about 18 seconds, including the JVM lauching for each map task.

For political popularity analysis of presendential selection, I use text matching to see how many people mentioned each politician. This will get a general sense of how popular they are on social media.

 

Obstacle: Advertisements, 

HashMap = [“displayURL” : “some_url_here”,

        “hashtagEntities” : “#tag1, #tag2, …”,

        “text” : “something_typed_here”]

你可能感兴趣的:(Projects_Tweets Analysis)