讲解:STAT 7008、java/c++、hashtag analysis、CS/pythonR|

STAT 7008 - Assignment 2Due Date by 31 Oct 2018Question 1 (hashtag analysis)1. Tweets1.json corresponds to tweets received before a Presidentialdebate in 2016 and all other data files correspond to tweets receivedimmediately after the same Presidential debate. Please download thefiles from the following link:https://transfernow.net/912g42y1vs78Please write codes to read the data files tweets1.json to tweets5.jsonand combine tweets2.json to tweets5.json to a single file, namedtweets2.json. Determine the number of tweets in tweets1.json andtweets2.json.2. In order to clean the tweets in each file with a focus on extractinghashtags, we observe that retweeted_status is another tweet withina tweet. We select tweets using the following criteria:- Non-empty number of hashtags either in entities or in itsretweeted_status.- There is a timestamp.- There is a legitimate location.- Extract hashtags that were written in English or convert hashtagsthat were written partially in English (Ignore non-englishcharacters).Write a function to return a dictionary of acceptable tweets, locationsand hashtags for both tweets1.json and the tweets2.json respectively.3. Write a function to extract the top n tweeted hashtags of a givenhashtag list. Use the function to find the top n tweeted hashtags of thetweets1.json and the tweets2.json respectively.4. Write a function to return a data frame which contains the top ntweeted hashtags of a given hashtag list. The columns in the returneddata frame are hashtag and freq.5. Use the function to produce a horizontal bar chart of the top ntweeted hashtags of the tweets1.json and tweets2.json respectively.6. Find the max time and min time of the tweets1.json and thetweets2.json respectively.7. For each interval defined by (min time, max time), divide it into 10equally spaced periods respectively.8. For a given collection of tweets, write a function to return a data framewith two columns, hashtags and their time of creation. Use thefunction to produce data frames for the tweets1.json and thetweets2.json. Use pandas.cut or else, create a third column level ineach data frame which cuts the time of creation by the correspondinginterval obtained in part 7 respectively.9. Use pandas.pivot or else, create a numpy array or a pandas data framewhose rows are time period defined in part 7 and whose columns arehashtags. The entry for the ith time period and jth hashtag is thenumber of occurrence of the jth hashtag in ith time period. Fill theentry without data by zero. Do this for tweets1.json and thetweets2.json respectively.10. Following part 9, what is the number of occurrence of hashtag trumpin the sixth period in the tweets1.json? What is the number ofoccurrence of hashtag trump in the eighth period in the tweets2.json?11. Using the tables obtained in part 9, we can also find the total numberof occurrences for each hashtag. Rank these hashtags in decreasingorder and obtain a time plot for the top 20 hashtags in a single graph.Rescale the size of the graph so that it is not too small nor too large.Do this for both tweets1.json and the tweets2.json respectively.12. The zip_codes_states.csv contains city, state, county, latitude andlongitude of US. Read the file.13. Select tweets in tweets1.json and the tweets2.json with locations onlyin the zip_codes_states.csv. Remove also the locatio代做STAT 7008作业、代写java/c++程序作业、代做hashtag analysis作业、代做CS/python london.14. Find the top 20 tweeted locations in both tweets1.json and thetweets2.json respectively.15. Since there are multiple (lon, lat) pairs for each location, write afunction to return the average lon and the average lat of a givenlocation. Use the function to generate the average lon and the averagelat for every locations in tweets1.json and the tweets2.json.16. Combine tweets1.json and tweets2.json. Then, create data frameswhich contain locations, counts, longitude and latitude in tweets1.jsonand the tweets2.json.17. Using the sharpfile of US states st99_d00 and the help of the websitehttps://stackoverflow.com/questions/39742305/how-to-usebasemap-python-to-plot-us-with-50-states,produce the followinggraphs.18. (Optional)Using polygon patches and the help of the websitehttps://stackoverflow.com/questions/39742305/how-to-usebasemap-python-to-plot-us-with-50-states,produce the followinggraph.Question 3 (extract hurricane paths)The website http://weather.unisys.com provides hurricane paths data from1850. We work to extract hurricane paths for a given year.1. Since the link contains the hurricane information varies with years andthe information is contained in multiple pages, we need to know thestarting page and the total number of pages for a given year. What isthe appropriate starting page for year = 2017?2. In order to solve the second question, we try inputting a large numberas the number of pages for a given year. Use an appropriate number,write a function to extract all links each of which holds information onthe hurricanes in 2017.3. Some of the collected links provide summary of hurricanes which donot lead to correct tables. Remove those links.4. For each valid hurricane link, it contains four set of information:- Date- Hurricane classification- Hurricane name- A table of hurricane positions over datesSince the entire information is contained in a text file provided in thecorresponding webpage defined by the link, write a function todownload the text file and read (without saving it to a local directory)the text file (at this moment, you don’t need to convert the data toother format).5. With the downloaded contents, write a function to convert thecontents to a list of dictionaries. Each dictionary in the list contains thefollowing keys: Date, Category of the hurricane, Name of the hurricaneand a table of information for the hurricane path. Convert the Date ineach dictionary to datetime objects. Since the recorded times for thehurricane paths used the Z-time, convert it to datetime object with thehelp of http://www.theweatherprediction.com/basic/ztime/.6. We find some missing data in the Wind column of some tables. Sincethe classification of a hurricane at a given moment can be found in theStatus column of the same table and the classification also relates tothe wind speed at that moment, use the classification to impute themissing wind data. You may want to read the following websitehttps://en.wikipedia.org/wiki/Tropical_cyclone_scales.7. Plot the hurricane paths of year 2017size by the wind speed and colorby the classification status.If you can produce your graph in a creative way, bonus marks will begiven.8. (Optional)Convert the above functions as function of year so that when wechange year, you will be able to generate plot of the hurricane pathsin that year easily.转自:http://ass.3daixie.com/2018103123017546.html

你可能感兴趣的:(讲解:STAT 7008、java/c++、hashtag analysis、CS/pythonR|)