讲解:CSE 482、Data Analysis、java、 javaSPSS|Python

CSE 482: Big Data Analysis (Spring 2020) Homework 2Due date: Monday, February 19, 2020Please make sure you submit a PDF version of your homework via D2L.1. Write the corresponding HDFS commands to perform the tasks describedfor each question below. Type hadoop fs -help for the list of HDFScommands available. You can also refer to the documentation available athttps://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html. To double-check your answers, you should testthe commands to make sure they work correctly.(a) Suppose you are connected to a master node on AWS running hadoopwith Linux operating system. Assume you have created a data directorynamed logs (on the Linux filesystem of the master node),which currently contains 1000 Web log files to be processed. Writethe hadoop DFS commands needed to upload all the Web log filesfrom the logs directory to the directory named /user/hadoop/dataon HDFS. Assume the /user/hadoop/data directory has not existedyet on HDFS. Therefore, you need to create the directory first beforetransferring the files.(b) Write the HDFS command to move the Web log files from the /user/hadoop/data directory on HDFS to a shared directory named /user/share/ on HDFS. After the move, all the files should now be locatedin the /user/share/data/ directory. Write the HDFS command tolist all the files and subdirectories located in /user/share directory.To make sure the files have been moved, write the correspondingHDFS command to list all the files and subdirectories located in thedirectory named /user/hadoop to verify that the data subdirectoryno longer exists.(c) Suppose one of the files located in the /user/share/data/ directorynamed 2020-01-01.txt is corrupted. You need to replace thecorrupted file with a new file named 2020-01-01-new.txt, whichis currently located in the logs/new directory on the local (Linux)filesystem of the AWS master node. Write the HDFS commandsto (1) delete the corrupted file from /user/share/data/ directoryon HDFS, (2) Upload the new file from logs/new directory to the/user/share/data/ directory on HDFS, and (3) rename the new fileon HDFS from 2020-01-01-new.txt to 2020-01-01.txt.(d) Write the HDFS command to display the content of the file 2020-01-01.txt, which is currently stored in the /user/share/data/ directoryon HDFS. As the file is huge, write another HDFS command to displaythe last kilobyte of the file to standard output.12. Consider a Hadoop program written to solve each computational problemand dataset described below. State how would you setup the (key,value)pairs as inputs and outputs of its mapper and reducer classes. Assumeyour Hadoop program uses TextInputFormat as its input format (whereeach record corresponds to a line of the input file). Since the inputs for themappers are the same (byte offset, content of the line) for all the problemsbelow, you only have to specify the mappers’ outputs as well as reducers’inputs and outputs. You must also explain the operations performed bythe map and reduce functions of the Hadoop program. If the problemrequires more than one mapreduce jobs, you should explain what each jobis trying to do along with its input and output key-value pairs. You shouldsolve the computation problem with minimum number of mapreduce jobs.Example:Data set: Collections of text documents.Problem: Count the frequency of nouns that appear at least 100 times inthe documents.Answer:(i) Mapper function: Tokenize each line into a set of terCSE 482作业代做、代写Data Analysis作业、代写java语言作业、 代做java程序设计作业 代做SPSms (words), and filter outterms that are not nouns.(ii) Mapper output: key is a noun, value is 1.(iii) Reducer input: key is a word, value is list of 1’s.(iv) Reduce function: sums up the 1’s for each key (noun).(v) Reducer output: key is a noun, value is frequency of the word (filter the nounswhose frequencies are below 100).(a) Data set: Car for sale data. Each line in the data file has 5 columns(seller id, car make, car model, car year, price). For example:1234,honda,accord,2010,105002331,ford,taurus,2005,2400Problem: Find the median price (over all years) for each make andmodel of vehicle. For example, the median price for ford taurus couldbe 8000.(b) Data set: Netflix movie rental data. Each record in the data filecontains the following 4 columns: userID, rental date, movie title,movie genre. For example, the recorduser111 12-20-2019 star_wars scifiuser111 12-21-2019 aladdin animationuser111 12-25-2019 lion_king animationProblem: Find the favorite movie genre of each user. In the aboveexample, the favorite genre for user111 is animation.2(c) Data set: Youtube subscriber data. Each line in the data file isa 2-tuple (user, subscriber). For example, the following lines in thedata file:john maryjohn bobmary johnshow that mary and bob are subscribers of John’s Youtube videos.Problem: Find all pairs of users who subscribe to each others’videos. In the example above, john and mary are such pair of subscribers,but john and bob are not (since john does not subscribe tobob’s videos)(d) Data set: Loan applicant data. Each line in the data file containsthe following attributes: marital status, age group, employment status,home ownership, credit rating, and class (approve/reject).single, 18-25, employed, none, poor, reject.single, 25-45, employed, yes, good, approve.Problem: Compute the entropy of each attribute (marital status,age group, etc) with respect to the class variable.(e) Data set: Document data. Each record in the dataset correspondsto a document with its ID and set of words that appear in the document.For example, the following records contain the set of wordsthat appear in documents 12345, 12346, and 12347, respectively.12345 team won goal result12346 political party won election result12347 lunch party restaurantProblem: Compute the cosine similarity between every pair of documentsin the dataset. Given a pair of documents, say, u and v, theircosine similarity is computed as follows:cosine(u, v) = nuv √nu × nv,where nuv is the number of words that appear in both u and v, nuis the number of words that appear in document u and nv is thenumber of words that appear in document v. For the above example,cosine(12345,12346) = 2/√20 whereas cosine(12346,12347) = 1/√15.Hint: You will need two mapreduce (Hadoop) jobs for this problem.33. Download the data file Titanic.csv from the class Web site. Each linein the data file has the following comma-separated attribute values:PassengerGroup,Age,Gender,OutcomeFor this question, you need to write a Hadoop program that computes themutual information between every pair of attributes. The reducer outputwill contain the following key-value pair:• key is name of attribute pair, e.g., (Age, Outcome).• Value is the their mutual information.Deliverable: Your hadoop source code (*.java), the archived (jar) files,and the reducer output file, which must have 2 tab-separated columns:attribute pair and its mutual information value.4转自:http://www.3zuoye.com/contents/9/4806.html

你可能感兴趣的:(讲解:CSE 482、Data Analysis、java、 javaSPSS|Python)