Predicting the Price of Round Cut DiamondsSTP 494/STP 598: Machine LearningIntroduction 1Data 1Data Source . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1Data Preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1Methods 1Results 2Variable Selection Using Regsubsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Multiple Linear Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2Random Forest Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3Discussion and Conclusion 4Future Work 4References 5Appendix 6Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6iIntroductionOne of the biggest purchase a couple makes is the engagement ring. The engagement ring has been a staplesince the 19th century. One of the main component of the engagement ring is the diamond. With this beinga very expensive purchase as consumers, we should be well informed on how to price these things, so we don’toverpay! Our data set involves pricing of round cut diamonds based on numerous attributes below. Our goalis to fit a model to predict the price of round cut diamonds using the best methods and best set of variablesgiven.DataData SourceThis dataset contains 53,940 diamond observations consisted of 10 variables as follows:Variable Descriptionprice price in US dollars ($326-$18,823)carat weight of the diamond (0.2-5.01)cut quality of the cut (Fair, Good, Very Good, Premium, Ideal)color diamond color (J (worst), I, H, G, F, E, D (best))clarity grade of how clear the diamond is (I1 (worst), SI2, SI1, VS2, VS1, VVS2, VVS1, IF (best))NOTE: I = Included, SI = Slightly Included, VS = Very Slightly Included, VVS = Very, VerySlightly Included, IF = Internally Flawlessx length in mm (0-10.74)y width in mm (0-58.9)z depth in mm (0-31.8)depth total depth percentage (43-79)table width of top of diamond relative to widest point (43-95)Data PreparationFirst, the data was loaded in Excel for overview. There were not any missing values. However, the firstcolumn, which showed the indices of each observation, was deemed indeterminate with no value added andwas deleted as a result. The data was then loaded into R and further examined by looking at the structure.We see that the levels for each categorical variables (cut, color, and clarity) are sorted by alphabetical orderby default. Because the levels are sequential, we releveled each according to the order presented in the datasource section above. Our data was then split into 75 percent training and 25 percent testing sets.MethodsMultiple linear regression and random forest regression was used to fit models to predict the price of diamonds.We started off using a variable selection method (regsubsets) to select the “best” subset of predictors. Afterselecting which predictors to use, we fitted them using multiple linear regression. We evaluated our modelsusing the root mean square error (RMSE) criteria. This was used as our base model. From there, we triedto improve our model using random forest regression. From class, we were informed it is one of the mostpowerful tools out there. With random forest, we wanted to compare how the RMSE value changes whenmtrys and ntrees change in an attempt to minimize the RMSE. Since we have a large dataset and randomforest is computationally expensive, we opted to resample our dataset to 2000 observations.1ResultsVariable Selection Using RegsubsetsRegsubsets was used to select the best variables to use.5 10 15 201050 1150 1250 1350RMSE vs. Num VarsNum VarsRMSEFrom the plot above, we can see that the RMSE is at its lowest point at around 16 variables and beyond.Note that there were 3 categorical variables, which were dummy coded uisng 0 and 1 and each is shown hereas a separate variable. For example, colorD would be dummy coded as 0 if the grade of the color is not Dand 1 if it is D. This would be counted as a separate variable here. After 16 varibles, it starts plateauing.Since we want a model with the lowest amount of variables used and the RMSE of a full model and one with16 variables is nearly the same, a 16 variable model is the best choice.Multiple Linear RegressionThe variables selected by regsubsets was used here on the full dataset to get our multiple linear regressionmodel which resulted in a RMSE of $1065.38. Our model is as follows:price =+147.61 + 11242.57carat + 916.79colorI + 1399.00colorH + 1900.69colorG + 2098.07colorF+2164.11colorE + 2380.17colorD + 2870.17claritySI2 + 3849.15claritySI1 + 4473.61clarityV S2+4786.27clarityV S1 + 5174.62clarityV V S2 + 5242.24clarityV V S1 + 5599.45clarityIF ? 83.86depth1041.00x + εThe reference level for the following categorial variables are:2Categorical Variable Reference Levelcolor colorJclarity clarityIThis is the worst color grade and the worst clarity grade. This was made the reference level intentionallyfor easy comparison. We can see from our model that the coefficients of each color grade and clarity gradeincreases as each grade gets better. Intuitively, this makes sense since on average, beter quality means moreexpensive.Random Forest RegressionFor random forest regression, we tried mtry values of 2, 3, 4, and 5, and ntree of 100, 200, 300, and 400. TheRMSE for each mtry and ntree combination are tabulated below.mtry ntree = 100 ntree = 200 ntree = 300 ntree = 4002 $968.51 $955.21 $956.18 $960.163 $987.38 $994.53 $969.73 $989.854 $1025.51 $997.77 $1005.13 $1003.575 $1037.21 $1031.66 $1048.65 $1034.73For visualization, we have plotted the RMSE against the mtry values separated by each value of ntree asshown below:RMSE vs mtrymtryRMSEntree = 100ntree = 200ntree = 300ntree = 4002 3 4 5930 970 1010 1050Clearly, we can see that the “best” random forest regression model is using mtry = 2 and ntree = 200. Formtry = 2, the changes in ntree from 200 to 300 to 400 has marginal effect since the RMSE were pretty closeto each other. We can also see that using too many trees can actually cause overfitting of the data (RMSE3increased).Discussion and ConclusionMultiple linear regression was far “worse” thSTP 494/STP 598作业代做、Machine Learning作业代写、代写R编程作业、代做R课程设计作业 代an random forest regression with respective RMSEs of $1065.38compared to $955.21 (using mtry = 2 and ntree = 200). The RMSEs are tolerable considering the price ofeach diamond ranged from $326 to $18,823. However when using mtry of 4 and 5 with ntree of 100, 200, 300,or 400, the RMEs were similar to the RMSE of the multiple linear regression model with RMSEs around theearly to mid $1000. With our results backing up, we can verify that as predicted, random forest was thedominant method once agian since it comes out on top.Future WorkWe can further extend this work by looking into gradient boosting tree as well as neutral networks and deeplearning. However, we chose to focus on using random forest in our project since it is widely known as the“best” method.4Referenceshttps://bluenile.v360.in/49/imaged/gia-1162408531/2/still.jpghttps://vincentarelbundock.github.io/Rdatasets/datasets.htmlAn Introduction to Statistical Learning with Application in R by Gareth James, Daniela Witten, TrevorHastie, and Robert TibshiraniMachine Learning with R by Brett Lantz5AppendixCode# looking at datadiamond = read.csv(diamonds.csv)str(diamond)# reorganizing levelsdiamond$cut = factor(diamond$cut, levels = c(Fair, Good, Very Good, Premium,Ideal))diamond$color = factor(diamond$color, levels = rev(levels(diamond$color)))diamond$clarity = factor(diamond$clarity, levels = c(I1, SI2, SI1, VS2,VS1, VVS2, VVS1, IF))str(diamond)summary(diamond$price)# resampling and spliting data into train and test setsset.seed(99)smp_train = 2000train_ind train smp_test = 0.25 * smp_traintest_ind test # using regsubsets to choose best modellibrary(leaps)##--------------------------------------------------## function to do rmse for k in 1:pdovalbest = function(object, newdata, ynm) {form = as.formula(object$call[[2]])p = 23 #categorical variables split up denoted 0 or 1 for each levelrmsev = rep(0, p)test.mat = model.matrix(form, newdata)for (k in 1:p) {coefk = coef(object, id = k)xvars = names(coefk)pred = test.mat[, xvars] %*% coefkrmsev[k] = sqrt(mean((newdata[[ynm]] - pred)^2))}return(rmsev)}##------------------------------------------------------------## do validation approach several timesntry = 100p = 23resmat = matrix(0, p, ntry) #each row for num vars, each col for new train/test drawfor (i in 1:ntry) {regfit.best = regsubsets(price ~ ., data = train, nvmax = 23, nbest = 1,6method = exhaustive)resmat[, i] = dovalbest(regfit.best, test, price)}mresmat = apply(resmat, 1, mean) #average across columns##--------------------------------------------------## plot results of repeated train/valplot(mresmat, xlab = Num Vars, ylab = RMSE, type = b, col = blue, pch = 19,main = RMSE vs. Num Vars)##--------------------------------------------------## Fit using number of vars chosen by train/validation and all the data.kopt = 16 #optimal k=number of vars: chosen by eye-balling plotregfit.best = regsubsets(price ~ ., data = diamond, nvmax = kopt, nbest = 1,method = exhaustive)xmat = model.matrix(price ~ ., diamond)ddf = data.frame(xmat[, -1], price = diamond$price) #dont use intercept (-1) & y=pricenms = c(names(coef(regfit.best, kopt))[-1], price)ddfsub = ddf[, nms] #drop all vars except those names by the coef at koptthereg = lm(price ~ ., ddfsub)print(summary(thereg))# multiple linear regression using variables from best subsetfit0 = lm(price ~ carat + color + clarity + depth + x, data = train)pred = predict(fit0, test)rmse = sqrt(mean((test$price - pred)^2))cat(The root mean square error is: , rmse)fit00 = lm(price ~ carat + color + clarity + depth + x, data = diamond)summary(fit00)# random forestlibrary(randomForest)fit1 = randomForest(price ~ carat + color + clarity + depth + x, data = train,mtry = 2, ntree = 100)pred1 = predict(fit1, test)rmse1 = sqrt(mean((test$price - pred1)^2))cat(The root mean square error is: , rmse1)fit2 = randomForest(price ~ carat + color + clarity + depth + x, data = train,mtry = 2, ntree = 200)pred2 = predict(fit2, test)rmse2 = sqrt(mean((test$price - pred2)^2))cat(The root mean square error is: , rmse2)fit3 = randomForest(price ~ carat + color + clarity + depth + x, data = train,mtry = 2, ntree = 300)pred3 = predict(fit3, test)rmse3 = sqrt(mean((test$price - pred3)^2))cat(The root mean square error is: , rmse3)fit4 = randomForest(price ~ carat + color + clarity + depth + x, data = train,7mtry = 2, ntree = 400)pred4 = predict(fit4, test)rmse4 = sqrt(mean((test$price - pred4)^2))cat(The root mean square error is: , rmse4)rmeasure = c()for (i in 2:5) {for (j in c(100, 200, 300, 400)) {fit mtry = i, ntree = j)pred rmse rmeasure cat(mtry = , i, ntree = , j, RMSE = , rmse, \n)}}# plotting random forest RMSEs against mtry by ntreex = c(2, 3, 4, 5)y = c(968.506, 955.2064, 956.1764, 960.1649, 987.3767, 994.533, 969.7294, 989.8527,1025.512, 997.7698, 1005.133, 1003.568, 1037.209, 1031.658, 1048.647, 1034.725)ntree100 = c(968.506, 987.3767, 1025.512, 1037.209)ntree200 = c(955.2064, 994.533, 997.7698, 1031.658)ntree300 = c(956.1764, 969.7294, 1005.133, 1048.647)ntree400 = c(960.1649, 989.8527, 1003.568, 1034.725)ntree100 = c(968.506, 987.3767, 1025.512, 1037.209)ntree200 = c(955.2064, 994.533, 997.7698, 1031.658)ntree300 = c(956.1764, 969.7294, 1005.133, 1048.647)ntree400 = c(960.1649, 989.8527, 1003.568, 1034.725)plot(x, ntree100, type = l, col = 2, xlab = mtry, ylab = RMSE, main = RMSE vs mtry,ylim = c(930, 1070), axes = FALSE)lines(x, ntree200, type = l, col = 3)lines(x, ntree300, type = l, col = 4)lines(x, ntree400, type = l, col = 5)legend(topleft, c(ntree = 100, ntree = 200, ntree = 300, ntree = 400),col = c(2, 3, 4, 5), lty = 1, cex = 0.7)axis(side = 1, at = c(2:5))axis(side = 2, at = c(930, 950, 970, 990, 1010, 1030, 1050, 1070))box()8转自:http://ass.3daixie.com/2019050657137893.html