Statistical Programming Assignment 2Gordon RossNovember 19, 2018 Submission: Only one submission attempt is allowed. The deadlineis 23:59 on Monday December 3rd. Late submissions will incur a 15%penalty.To submit, create an R script called matriculationnumberA2.R wherematriculationnumber refers to your matriculation number, and uploadto the Assessment 1 section of Learn. Failure to give your scriptfile the correct name will incur a 5% penalty. You have to submit asingle script file, i.e., matriculationnumberA1.R. Failure to complywith this will incur a 5% penalty. Your answer for each question mustbe included in a corresponding section of your R script file. For example,your answer/code for question 1.1 must be included in a sectionwhich looks like:## ;;## ---------------------------------------------## Q1: -- add your code below## ---------------------------------------------## ;;## 1.1code goes here## ---------------------------------------------I will deduct 5% of marks for script files which are disorganised (e.g questionsare not answered in numerical order, or where it is not clear which questiona code fragment is answering) so please make sure your file has a sensiblestructure. Guidance - Assessment criteria.1– A marking scheme is given. Additionally to the markingscheme, your code will be assessed according to the following criteria: Style: follow https://google.github.io/styleguide/Rguide.xml with care; Writing of functions: avoid common pitfalls of localvs global assignments; wrap your code in a coherent set ofinstructions and try to make it as generic as possible; Also,functions that are meant to be optimized with optim mustbe written accordingly, see ?optim. Executability: your code must be executable and shouldnot require additional code in order to run. A common pitfallis failure to load R packages required by your code. Deadline: Monday December 3rd, 23:59. Individual feedback will be given.2Please answer all three questions. The first question is a fairly straightorwardtest of Monte–Carlo integration. The last 2 questions are more conceptuallychallenging, and will apply computational techniques to less artificialproblems than we have seen in lectures.Question 1Use Monte Carlo integration with 100,000 random numbers to evaluate thefollowing integrals. In your script file, report both your code, the estimatedintegral value, and the Monte Carlo error.R 31x2e((x 2)2)dx, using a N(2, 1) proposal distribution [5].R 51y5log(y)dy using a Uniform(1,5) proposal [5].It is well known that π is the solution to the following integral:Z 1041 + x2dxUse Monte Carlo to approximate this integral, for sample sizes (i.e. thenumber of random numbers) N ∈ (10, 100, 1000). For each value, also computethe Monte Carlo approximation error . Write the values and the errorsin your script files [5].3Question 2The below plot shows the monthly percentage returns to a financial asset overa 20 month period. The numbers generating the plot are shown beneath it.5 10 15 20?0.04 ?0.02 0.00 0.02 0.04 0.06monthpercentage changey 1.74, -0.29, -1.31, -0.07, -1.22, 3.24, -1.97,1.81, 4.00, 1.87, 1.50, 6.81, -4.14)In finance, it is often useful to know whether there has been a change inthe variance over time, i.e. some point k such that the variance of observationsy1, . . . , yk is equal to σ21and the variance of observations yk+1, . . . , ynis equal to σ22, where σ216= σ22(note that by convention, yk belongs to thepre-change segment).1. Assuming that the observations are Gaussian, describe how an F-testcould be used to test whether a change has occurred at location k = 10.Clearly state the null and alternative hypotheses. (Write your wordsas a comment in your R script). [2]2. Implement this test in R and make a conclusion based on your p-value(remember that the var.test() function carries out the F test). [3]43. In practice, we do not know which specific k to test for (i.e. we donot know in advance where the change occurred). Instead, we wish toestimate which value of k the change occurred at. One approach for thisis to perform the F test at every possible value of k ∈ (2, 3, . . . , n ? 2),so 17 tests in total given 20 observations (note that we need at least 2observations in each segment to compute the variance, hence why wedo not consider k ∈ {1, n ? 1, n}).In other words, for each value of k ∈ (2, 3, . . . , n?2), split the observationsinto the sets y1, . . . , yk and yk+1, . . . , yn, then perform an F-testand record the p-value.After carrying out these 17 tests, the best estimate of k will be thevalue of k which gives the lowest p-value since this provides the mostevidence for a difference in variance. Perform this procedure in R andhence determine which value of k is most likely to be the change point.in the above data. [10]4. Next, we need to determine whether the change point we found is statisticallysignificant, i.e. is there really evidence to suggest that thereis a change in variance at the value of k you found above? Unfortunately,we cannot just check whether the p-value of the F test at thispoint is less than 0.05 because we did not just perform one test, weperformed 17 tests and chose the lowest p-value. This multiple testingissue means that a more sophisticated procedure is necessary.Instead we can use a variant of permutation testing. Let the nullhypothesis be that there is no change point anywhere, i.e. that allobservations have the same distribution. Let the alternative be thatthere is a change point at some unknown value of k. If the null hypothesisis true, we can rearrange the 20 observations in any order we like.For each rearrangement, compute the minimum p-value over all 17 Ftests, and hence approximate the distribution of the minimum p-valueunder the null hypothesis. Plot this distribution, and hence concludewhether there is evidence to suggest that a change has occurred in thegiven sequence (i.e. check if your minimum p-value from part 3. 代写Gordon Ross作业、代做R编程作业、代写R实验作业、代做Statistical Programming作业 aboveis in the lower 5th quintile of p-values from this null distribution). [10].Question 3Statistical methods can be used to determine the (unknown) author of anunidentified piece of writing. This is known as stylometry. Research has5a, all, also, an, and, any, are, as, at, be, been,but, by, can, do, down, even, every, for, from,had, has, have, her, his, if, in, into, is, it,its, may, more, must, my, no, not, now, of, on,one, only, or, our, shall, should, so, some, such,than, that, the, their, then, there, things, this, to,up, upon, was, were, what, when, which, who, will,with, would, yourFigure 1: The 70 grammatical words that characterise writing styleshown that people differ in how frequently they use basic grammatical Englishwords such as ‘a’ and ‘the’. These differences are quite small (perhapsone person only uses the word ‘the’ 1% more often than another person does)but they do exist, and can be identified given a large enough sample of aperson’s writing. As such, if we are given a large sample of a person’s writingand a new text that has an unknown author, then it is possible to statisticallytest whether the person wrote it simply by counting up the number oftimes these basic grammatical words appear in the new text, and comparingit to their writing sampleThis question will explore an example of this technique. You will downloada text file which contains counts of how often each of 3 different authorsused the 70 grammatical word from Figure 1. You will then use this to determinewhich of the 3 was the most likely author of a new piece of writing thathas an unknown author. The three authors in question are Agatha Christie(British crime noveltist), Charles Dickens, and George R R Martin (authorof Song of Ice and Fire/Game of Thrones)First download the ‘authorship.csv” file from the course webpage and loadit into R. This contains a 3x71 matrix of counts, where the rows correspondto each of the above three authors in order. The columns are counts of howmany times each author used each of the words from Figure 1, summed upover all of their published books. So for example, the first column of thefirst row counts how many times Agatha Christie used the word ‘a’ in herpublished novels, while the 2nd column of the third row counts how manytimes George R R Martin used the word ‘all’. The 71st column counts thenumber of non-grammatical words each author used (i.e. every word whichwasnt one of the 70 in this list). The total number of words used by each6authors are equal to the row sums, i.e. 3,808,305 for Christie, 3,825,393 forDickens, and 1,753,671 for George R R Martin.1. Download the authorship.csv file and then load it into R, as a numericalmatrix of counts [3].2. For each author i, we have a 71 element vector corresponding tothe number of times they used each word. We will model this as aMultinomial(θi) distribution. The Multinomial distributionn is a generalisationof the Binomial distribution, where a random variabel cantake one of K different outcomes (in this case, K = 71).For a particular author i, the unknown parameter θiis a 71 elementvector, θi = (θi,1, θi,2, . . . , θi,71). Suppose that yi = (yi,1, yi,2, . . . , yi,71)is the vector of counts (i.e. y1,1 is how many times author 1 used theword ‘a’ and so on). Then the maximum likelihood estimate of θiis:θi,k =yi,kP71j=1 yi,j(i.e. the MLE for the proportion of times each word is used by authori is simply the empirical proportion of times they used that word).Write R code to compute the maximum likelihood estimate of the 71-element θi vector for each of the three authors and report each one asa seperate commented line in your script file [5].3. Download the ‘unknowntext.csv’ file from the course website and loadit into R [2].4. The unknowntext.csv file contains an extract of 10,000 words takenfrom a novel written by one of these three authors. As above, this isa 71 element vector which counts how many times each of the abovegrammatical words were used. We will try to determine which authorwrote it by testing which of the estimated ?θi parameters it is consistentwith. This can be done using hypothesis testing, i.e. for each θi wewill test:H0 : p(z) = Multinomial(θi)H1 : p(z) 6= Multinomial(θi)7where z = (z1, . . . , z71) is the word counts for the unknown text. First,normalise this vector so that it sums to 1 by dividing each element by10,000. Next, we define the test statistic:Ti =X71k=1(zkθi,k)2where zk is the normalised count for the kth word. Compute this teststatistic for all 3 authors and write down the values in your script file.[5]5. Ti essentially measures the distance between the new text, and theparameter for each author. As such for each author i, we will rejectthe null hypothesis if Ti > γi and conclude that this author did notwrite the text. We need to choose γiin order to make the Type 1 errorequal to the usual 0.05 (so that we only mistakenly reject the null 5%of the time, if the author really did write the text).For each author, use Monte Carlo simulation to find the appropriatevalue of γi. You can do this by simulating sample data under theassumption that the null hypothesis is true, computing the test statisticfor each simulated piece of data, and then defining γi to be the95th quantile of these simulated test statistics. This means that if thenull hypothesis is true, only 5% of observations simulated from theMultinomial(θi) distribution will be greater than this value of γi.In other words: for each i, simulate a large number (e.g. S = 100, 000)of observations from the Multinomial(θi) with 71 categories and 10,000words (i.e. equal to the number of words in the unknown text). Computethe test statistic Ti above for each simulated observation. Then,define γi to be the 0.95S largest of these values (similar to bootstrapping)[10]6. Based on the above, compute which of the three null hypotheses arerejected and hence determine the most likely author of the unknowntext. [5]转自:http://ass.3daixie.com/2018112339684134.html