讲解:Data Science、Statistical Modelling、R、RSPSS|SPSS

Applications of Data Science and Statistical ModellingAssignment 429/11/2019The dataset SubstationRPD.RData contains real power delivered (KW) for each 10-minute period, of everyday during June and July, for 410 substations in the southwest of Wales, UK. The aim of this assignment isto understand how the power demand changes throughout the day, identify any weekly/monthly patterns ifpresent, and using this information fit a GAM which allows us to predict future demands. Note that in orderto fit a GAM you’ll need to have the mgcv package installed.1. [3 marks] Produce summaries of the dataset SubstationRP D.RData and produce histograms showingthe distributions of real power delivered for the 410 substations. Comment on the distributions ofreal power delivered, and any variations between those distributions between substations. (E.g. Youcould choose specific 10 minute intervals - say the 10 minute window after midnight, and plot thedistribution of the power demand across the substations, or look at average daily demands, maximumdaily demands...)2. [3 marks] For each substation, calculate the average demand for each 10 minute period (that is youshould average over the days) and then plot these on the same plot, using a different colour for eachsubstation. Add a thick, black line showing the overall mean for the demand of all of the substations.Comment on the variability in patterns between substations. Does the overall mean seem a reasonablesummary of all the data? (Hint: Since we are plotting 410 separate curves, you might want to suppressthe legend, which can be done using the ggplot option ‘theme(legend.position = none)‘).010020030040000:00 04:00 08:00 12:00 16:00 20:00 23:50TimeAverage Daily DemandAll days3. [3 marks] Split your plot in Question 2 into four separate plots representing; 1) All days, 2) Weekdays,3) Saturdays and 4) Sundays. Are there any differences in patterns between days? (Hint: You mightfind the ‘weekdays‘ function useful.)Now that we understand how the demand changes throughout the day, and have identified some seasonalpatterns, the next step is to fit a GAM to our data:4. [2 marks] First, reformat the SubstationRPD.RData dataset so that each row is the average of alldemand data Data Science作业代做、代写Statistical Modelling作业、R编程语言作业代写、代做R课程作业for each substation. That is each row corresponds to one day, and in each column youshould have the average demand (across all substations) for the corresponding 10 minute period.15. [10 marks] Add a column with the day of the month, and another one with the month of the year. Notethat you can access these using the following R code:as.numeric(substr(Date,9,10)) # dayas.numeric(substr(Date,6,7)) # monthNext collapse the data, so that the previously calculated mean power demands are in a single column, insteadof separate rows. By this point you should have a dataset similar to the following:# A tibble: 6 x 6# Groups: Date, weekdays [1]Date weekdays minute.int mean day month 1 2012-06-01 Friday 1 56.7 1 62 2012-06-01 Friday 2 57.0 1 63 2012-06-01 Friday 3 56.6 1 64 2012-06-01 Friday 4 55.7 1 65 2012-06-01 Friday 5 55.5 1 66 2012-06-01 Friday 6 54.9 1 6Fit and plot a GAM which accounts for the underlying seasonal pattern in demands (you should decide whichseasonal patterns are appropriate to include - daily (use the minute.int column in the above dataset), weekly -(use the day column in the above dataset), monthly - (use the month column in the above dataset)). Commenton the fit of the model. What are the (effective) degrees of freedom, and what does this tell us about thecomplexity of the model that has been fit?6. [4 marks] Choose an appropriate model, with which predict the demand for the 21st to the 28th of July.Take the daily average demand, and produce a plot showing these mean predictions against time. Youcan use the following code to create a new dataset for the prediction. Note that depending on how younamed the columns of your dataset you might have to modify the column names in the following code:new.data rep(7,1152)),nrow=1152,ncol=3,byrow=FALSE))new.data$Date as.Date(2012-07-28),days),144)names(new.data) All the exercises should be solved using R. A pdf document with your answers, (commented)R code and its outputs/plots should be submitted via ELE by Noon (12pm), 18th December.Note that late submissions will be penalised.2转自:http://www.daixie0.com/contents/18/4502.html

你可能感兴趣的:(讲解:Data Science、Statistical Modelling、R、RSPSS|SPSS)