讲解：FS19 STT481、Java、Python，c/c++、Data AnalysisPython|P

FS19 STT481: Homework 3(Due: Wednesday, November 6th, beginning of the class.)100 points total1. (20 pts) Finish the swirl course “Exploratory Data Analysis”. Finish Section 1-10 (no need to do 11-15).You can install and go to the course by using the following command lines.library(swirl)install_course(Exploratory_Data_Analysis)swirl()2. (20 pts) In this question, we are going to perform cross-validation methods in order to choose a betterlogistic regression model.Consider Weekly data set, where we want to predict Direction using Lag1 and Lag2 predictors. To load theWeekly data set, use the following command lines.library(ISLR)data(Weekly)Suppose now I have two candidate models:(i) log P r(Direction==”Up”)1−P r(Direction==”Up”) = β0 + β1Lag1 + β2Lag2;(ii) log P r(Direction==”Up”)1−P r(Direction==”Up”) = β0 + β1Lag1 + β2Lag2 + β3Lag12 + β4Lag22.(a) For each model, compute the LOOCV estimate for the test error by following the steps:Write a for loop from i = 1 to i = n, where n is the number of observations in the data set, that performseach of the following steps:i. Fit a logistic regression model using all but the ith observation to predict Direction using Lag1 andLag2 for model (i) and using Lag1, Lag2, I(Lag1ˆ2), I(Lag2ˆ2) for model (ii).ii. Compute the posterior probability of the market moving up for the ith observation.iii. Use the posterior probability for the ith observation and use the threshold 0.5 in order to predictwhether or not the market moves up.iv. Determine whether or not an error was made in predicting the direction for the ith observation. If anerror was made, then indicate this as a 1, and otherwise indicate it as a 0.Take the average of the n numbers obtained in iv in order to obtain the LOOCV estimate for the test error.(b) Comment on the results. Which of the two models appears to provide the better result on this databased on the LOOCV estimates?(c) The cv.glm function can be used to computer the LOOCV test error estimate. Run the followingcommand lines and see whether the results are the same as the ones you did in (a).library(boot)# Since the response is a binary variable an# appropriate cost function for glm.cv iscost 0.5)glm.fit cv.error.1 glm.fit cv.error.2 1(d) For each model, compute the 10-fold CV estimate for the test error by following the steps:Run the following command lines.set.seed(1) ## the seed can be arbitrary but we use 1 for the sake of consistencyfold.index Write a for loop from i = 1 to i = 10 and in each loop, perform each of the following steps:i. Fit a logistic regression model using all but the observations that satisfy fold.index==i to predictDirection using Lag1 and Lag2 for model (i) and using Lag1, Lag2, I(Lag1ˆ2), I(Lag2ˆ2) for model(ii).ii. Compute the posterior probability of the market moving up for the observations that satisfyfold.index==i.iii. Use the posterior probabilities for the observations that satisfy fold.index==i and use the threshold0.5 in order to predict whether or not the market moves up.iv. Compute the error rate was made in predicting Direction for those observations that satisfyfold.index==i.Take the average of the 10 numbers obtained in iv in order to obtain the 10-fold CV estimate for the testerror.(e) Comment on the results. Which of the two models appears to provide the better result on this databased on the 10-fold CV estimates?(f) cv.glm function can be used to compute the 10-fold CV test error estimate. Run the following commandlines and see whether the results are the same as the ones you did in (d). If they are not the same,what’s the reason?library(boot)# Since the response is a binary variable an# appropriate cost function for glm.cv iscost 0.5)glm.fit cv.error.1 glm.fit cv.error.2 (g) Comment on the computation costs for LOOCV and FS19 STT481代做、Java程序设计调试、代写Pyt10-fold CV. Which one is faster in your implementationin (a) and (d)?3. (20 pts) In this question, we are going to perform cross-validation methods to determine the tuningparameter K for KNN.Consider Default data set, where we want to predict default using student, balance, and income predictors.Since student is a qualitative predictor, we want to use dummy variable for it and standardize the data usingscale function. To load the Default data set and standardize the data, use the following command lines.library(ISLR)data(Default)X X[,student] X y Suppose now the candidate tuning parameter K’s for KNN are K = 1, 5, 10, 15, 20, 25, 30.(a) For each K, compute the LOOCV estimate for the test error by following the steps:2Write a for loop from i = 1 to i = n, where n is the number of observations in the data set, that performseach of the following steps:i. Perform KNN using all but the ith observation and predict default for the ith observation. (Hint: useknn function and return the class. No need to compute posterior probabilities. That is, use prob =FALSE in the knn function and then use the return class of knn).ii. Determine whether or not an error was made in predicting the direction for the ith observation. If anerror was made, then indicate this as a 1, and otherwise indicate it as a 0.Take the average of the n numbers obtained in ii in order to obtain the LOOCV estimate for the test error.(b) Comment on the results. Which of the tuning parameter K’s appears to provide the best results onthis data based on the LOOCV estimates?(c) knn.cv function can be used to perform LOOCV. Run the following command lines and see whetherthe results are same as the ones you did in (a).library(class)for(k in c(1,5,10,15,20,25,30)){cvknn print(mean(cvknn != y))}(d) For each K, compute the 10-fold CV estimate for the test error by following the steps:Run the following command lines.set.seed(10) ## the seed can be arbitrary but we use 10 for the sake of consistencyfold.index Write a for loop from i = 1 to i = 10 and in the loop, perform each of the following steps:i. Perform KNN using all but the observations that satisfy fold.index==i and predict default for theobservations that satisfy fold.index==i. (Hint: use knn function and return the class. No need tocompute posterior probabilities. That is, use prob = FALSE in the knn function and then use the returnclass of knn).ii. Compute the error rate was made in predicting the direction for those observations that satisfyfold.index==i.Take the average of the 10 numbers obtained in ii in order to obtain the 10-fold CV estimate for the test error.(e) Comment on the results. Which of the tuning parameter K’s appears to provide the best results onthis data based on the 10-fold CV estimates?4. (10 pts) In this question, we are going to use the zipcode data in the HW2 Q10.(a) Using the zipcode_train.csv data, perform a 10-fold cross-validation using KNNs with K = 1, 2, . . . , 30and choose the best tuning parameter K.(b) Using the zipcode_test.csv and comparing the KNN you obtained in (a) with logsitic regression andLDA, which of these methods appears to provide the best results on the test data? Is this the sameconclusion that you made in HW2?(c) Using the KNN you obtained above in (a), show two of those handwritten digits that this KNN cannotidentify correctly.5. (10 pts) Question 8 in Section 5.4.6. (10 pts) Question 8 in Section 6.8.7. (10 pts) In this question, we will predict the number of applications received using the other variablesin the College data set.First, we split the data set into a training set and a test set by using the following command lines.3library(ISLR)data(College)set.seed(20)train College.train College.test (a) Fit a linear model using least squares on the training set, and report the test error obtained.(b) Fit a ridge regression model on the training set, with λ chosen by cross-validation. Report the testerror obtained.(c) Fit a lasso model on the training set, with λ chosen by crossvalidation. Report the test error obtained,along with the number of non-zero coefficient estimates.4转自：http://www.3daixie.com/contents/11/3444.html

讲解：FS19 STT481、Java、Python，c/c++、Data AnalysisPython|P

你可能感兴趣的:(讲解：FS19 STT481、Java、Python，c/c++、Data AnalysisPython|P)