讲解：Analytics 512、R、Canvas、RC/C++|C/C++

Analytics 512: Take Home Final Exam 2019200 points in ten problems.This is the take-home portion of the exam. You may use your notes, your books, all material on the coursewebsite, and your computer or any computer in the departmental computer lab. You may also use officialdocumentation for R, built-in or on https://cran.r-project.org/, but no other material on the Internet. Provideproper attribution for all such sources. You may not use any human help, except whatever help is providedby me.Your solution should consist of two files: An .Rmd file that loads all data and all packages, makes all plots,and contains all comments and explanation, and an .html or .pdf file that is produced by the .Rmd file.Return your solutions by Friday, 5/10/19, 11:59PM. in Canvas or hand in printed copies of both files or fax both files to 202.687.6067.Part I: Bikeshare RidershipThe first part of the exam uses data on hourly ridership counts for the Capital Bikeshare system in Washington,DC for the years 2011 and 2012. Use the data frame cabi. The data frame contains time related variablesand weather related variables, plus two numerical target variables. Each observation contains data forone hour during these two years, with a few gaps.The data have been adapted from a set at the UCI repository. Link to the original data set: https://archive.ics.uci.edu/ml/datasets/Bike+Sharing+DatasetSystem data of the Capital Bikeshare system are here: https://www.capitalbikeshare.com/system-dataTime related variables season a categorical variable with values 1 (for January - March), 2 (April - June), 3 (July - September),4 (October - December) year with values 2011 and 2012 month a categorical variable with values 1, 2, . . . , 12 wday which is 0 for weekends and holidays and 1 otherwise hr a numerical variable with values 0, 1, . . . , 23Weather related variables temp scaled temperature atemp scaled perceived temperature hum scaled humidity windspeed scaled weather, a categorical variable with values 1 (e.g. clear or few clouds or partly cloudy), 2 (e.g. cloudyor broken clouds or foggy), 3 (e.g. snow or rain or thunderstorm)1Target variablesThe bikeshare system has casual riders who rent bicycles on the spot (e.g., tourists) and registered riderswho have a subscription (e.g., commuters). casual Total number of casual riders during this hour registered Total number of registered riders during this hourProblem 1 (20)Use numerical summaries, graphs, etc. to answer the following questions. No model fitting or other statisticalprocedures are required for this. Each graph should help answer one or more of these questions and shouldbe accompanied by explanations.(a) How do ridership counts depend on the year? The month? The hour of the day? How do casual andregistered riders differ in this respect?(b) How are casual and registered ridership counts related? Does this depend on the year? Does it dependon the type of day (working day or not)?(c) Is there an association between the weather situation and ridership counts? For casual riders? Forregistered riders?(d) There are relations between time related predictors and weather related predictors. Demonstrate thiswith a few suitable graphs.For problems 2-4, split the data into a training set (70%) and a test set (30%).Problem 2 (25)(a) Fit a multiple regression to predict registered ridership from the other variables (excluding casualridership), using the training data. Identify the significant variables and comment on their coefficients.(b) Estimate the RMS prediction error of this model using the test set.(c) Does the RMS prediction error depend on the month? Answer this question using the test data andsuitable tables or graphs.(d) Make copies of the training and test data in which hr is a categorical variable. Fit a multiple regressionmodel. Compare the summary of this model to the one from part (a). Also estimate the RMS predictionerror from the test set.Problem 3 (30)Use the original cabi data for this problem. (a) Train artificial neural networks with various numbers ofnodes in the hidden layer to predict registered ridership. Use the training data and only weather relatAnalytics 512作业代写、R编程设计作业代做、代写Canvas留学生作业、R课程设计作业代做调试C/C++编edvariables. Recommend a suitable number of nodes, with explanation. (b) Repeat part (a), using only timerelated variables. (c) Repeat part (a), using two time related and two weather related variables. Explainyour choice of variables.Problem 4 (10)What do you think are six useful predictors? Use any method you want to answer this question.2Part II: Vegetation CoverProblems 5 - 8 use data on vegetation cover. Use the data frames covtype.train and covtype.test. Theoriginal data are at https://archive.ics.uci.edu/ml/datasets/CovertypeEach data set contains 10,000 observations of 55 variables. These have been collected on 30m × 30m patchesof hilly forest land by the US Forest Service. elev = elevation in meters, slope = slope of the terrain in degrees, aspect = direction of the slope indegrees h_dist_hydro, v_dist_hydro = Horizontal and vertical distance to nearest water feature in meters h_dist_road = Horizontal distance to nearest roadway in meters hillshade_9, hillshade_12, hillshade_3 = Index for hill shade at 9 AM, 12 noon, 3 PM, atsummer solstice h_dist_fire = Horizontal distance to nearest wildfire ignition point in meters wild1, ... wild4 = binary indicator variables for wilderness designation soil1, ..., soil40 = binary indicator variables for soil type cover = Target variable (type of forest cover), with values 2 and 3.Problem 5 (20)Fit a logistic model to the training data in order to separate the classes. Choose a classification thresholdso that sensitivity and specificity are approximately the same on the training data. Then report sensitivity,specificity, and overall error rate for the test data.Problem 6 (25)Fit a support vector machine with radial kernels in order to separate the classes. Tune the cost and gammaparameters so that cross validation gives the best performance on the training data. Then assess the resultingmodel on the test data. Report sensitivity, specificity, and overall error rate for training and test data.Problem 7 (10)Fit a decision tree to the training data in order to separate the two classes. Prune the tree using crossvalidation and make sure that there are no redundant splits (i.e. splits that lead to leaves with the sameclassification). Then estimate the classification error rate for the pruned tree from the test data.Problem 8 (20)Fit a random forest model to the training data in order to separate the classes. Identify the ten mostimportant variables and fit another random forest model, using only these variables. Use the test data todecide which model has better performance.Part 3: MNIST Digit DataProblems 9 and 10 use the MNIST image classification data, available as mnist_all.RData in Canvas. Weuse only the test data (10,000 images).3Problem 9 (20)(a) Select a random subset of 1000 digits. Use hierarchical clustering with complete linkage on these imagesand visualize the dendrogram.(b) Does the dendrogram provides compelling evidence about the “correct” number of clusters? Explainyour answer.(c) Cut the dendrogram to generate a set of clusters that appears to be reasonable. There should bebetween 5 and 15 clusters. Then find a way to create a visual representation (i.e. a typical image) ofeach cluster. Explain and describe your approach.Problem 10 (20)Use Principal Component Analysis on the MNIST images.(a) Make a plot of the proportion of variance explained vs. number of principal components. Which fractionof the variance is explained by the first two principal components? Which fraction is explained by thefirst ten principal components?(b) Plot the scores of the first two principal components of all digits against each other, color coded by thedigit that is represented. Comment on the plot. Does it appear that digits may be separated by thesescores?(c) Find three digits which are reasonably well separated by the plot that you made in part (b). Illustratethis with a color coded plot like the one in (b) for just these three digits. Don’t expect perfect separation.(d) Find three other digits which are not well separated by the plot that you made in part (b). Illustratethis with another color coded plot like the one in (b) for just these three digits.4转自：http://www.7daixie.com/2019050956562748.html

讲解：Analytics 512、R、Canvas、RC/C++|C/C++

你可能感兴趣的:(讲解：Analytics 512、R、Canvas、RC/C++|C/C++)