FINM 331代做、代写R程序语言、代做DATA ANALYSIS、R设计代写代做数据库SQL|代做Database

FINM 331: MULTIVARIATE DATA ANALYSISFALL 2018PROBLEM SET 3The required files for all problems can be found in:http://www.stat.uchicago.edu/~lekheng/courses/331/hw3/The file name indicates which problem the file is for (p1*.txt for Problem 1, etc). You are welcomedto use any programming language or software packages you like.1. (Factor Analysis) This is the same air quality data set we saw in the previous problem set butthis time we will only take four variables X1, X2, X5 and X6 by leaving out CO, NO, and HCvariables.(a) Obtain the principal component solution to the factor model X = μ+LF+ε with numberof factors m = 1 and m = 2 using:(i) the sample covariance matrix;(ii) the sample correlation matrix.In other words, you should find the matrix factor loadings L ∈ Rn×m, the specific variancesψ1, . . . , ψp ∈ R, and write down the proportions of variability (in percentages) due to thefactors.(b) Find the angle between the first factor loading in (i) and that the first factor loading in (ii).(c) For the m = 2 case, compare the factor loadings obtained in (i) and that in (ii) usingorthogonal Procrustes analysis.(d) Comment on your results.2. (Population Canonical Correlation Analysis) The 2 × 1 random vectors X and Y have jointmean vector and joint covariance matrix(a) Calculate the canonical correlation ρ1 (the largest), ρ2 (the second largest).(b) Find the canonical correlation variables (U1, V1) and (U2, V2) corresponding to ρ1 and ρ2.(c) Let U = [U1, U2]T and V = [V1, V2]T. EvaluateE��UV�� and Cov ��UV�� =�ΣU ΣUVΣV U ΣV�(d) Comment on the correlation structure between and within U and V .3. (Sample canonical correlation analysis) The data set for this problem is obtained by taking fourdifferent measures of stiffness, shock, vibrate, static1, static2, for each of n = 30 boards.The first measurement involves sending a shock wave down the board, the second measurementDate: November 5, 2018 (Version 1.0); due: December 3, 2018.12 FINM 331 ASSIGNMENT 3is determined while vibrating the board, and the last two measurements are obtained from statictests. The squared distances d2j = (xj x)TS1(xjx) are also included as the last column inthe data matrix.Let X = [X1, X2]T be the random vector representing the dynamic measures of stiffness, and letY = [Y1, Y2]T be the random vector representing the static measures of stiffness. Load the datamatrix p3.txt (R command: stiff = read.table(p3.txt))(a) Perform a canonical correlation analysis of these data by computing the singular value decompositionof an appropriate matrix formed from the sample covariance matrices. You maycompare your result with that obtained by your software (if you use R, it is cancor(X1,X2)).(b) Write the first canonical correlation variables U1 and V1 as linear combinations of shock,vibrate, static1, static2.(c) Produce two scatterplots of the data: one in the coordinate plane of the first canonicalcorrelation vectors, one in the plane of the second canonical correlation vectors.(d) Based on the two plots and the values of the canonical correlations {ρ1, ρ2}, comment onthe correlation structure captured by each canonical pair.(e) Repeat (a) with sample correlation matrices in place of sample covariance matrices andverify that the pairs of canonical vectors obtained are related via scaling by the samplestandard deviation matrix.4. (Canonical correlation analysis for angular measurements) Some observations are in the formof angles. Here we will see how to deal with such data.(a) Consider a bivariate random vector X = [X1, X2]T with a uniform distribution inside acircle of radius 1 centered at some unknown pointμ =�μ1μ2�∈ R2.Then E(X) = μ. A sample of n = 4 is taken. The observed values areCompute sample mean x and sample covariance matrix. Is x a good estimator of μWhyor why not?(b) We consider an angular valued random variable θ, note that this can always be representedas a random vector Y = [cos θ,sin θ]Tthat takes value on the circle. Show that2 = cos β and b2/pb21 + b22 = sin β. Here b = [b1, b2]T ∈ R2is a constantvector.(c) Let X = X be a random vector with a single component, i.e., just a random variable. HereX is not angular valued. Show that the population canonical correlation isρ1 = maxβCorrX, cos(θ�β)�and that selecting the population canonical correlation variable V1 amounts to selecting anew ‘origin’ or ‘baseline’ β for the angle θ.(d) Let X is a random variable representing ozone (O3) levels and θ is a angular random variablerepresenting wind direction measured from the north. We make 19 observations to obtainFINM 331 ASSIGNMENT 3 3the sample correlation matrixR =�RX RXθRθX Rθ=O3 cos θ sin θO3 1.000 0.166 0.694cos θ 0.166 1.000 ?0.051sin θ 0.694 ?0.051 1.000.Find the sample canonical correlation ρb1 and the sample canonical correlation variable Vb1representing the new origin β.(e) Let φ be another angular valued random variable and let X = [cos φ,sin φ]T. Then similarto (b), we getaTX =qa21 + a22cos(φα).Now show thatρ1 = maxα,βCorrcos(φ�α), cos(θ� β)�.(f) Let φ and θ be two angular random variables representing wind directions at 6:00 a.m. andat 12:00 p.m. We make 21 measurements of X and Y (related to φ and θ as in (b) and(d)). We obtain the sample correlation matrixR =RX RXYRY X RYcos φ sin φ cos θ sin θcos φ 1.000 0.291 0.440 0.372sin φ 0.291 1.000 0.205 0.243cos θ 0.440 0.205 1.000 0.181sin θ 0.372 0.243 0.181 1.000�Find the sample canonical correlation ρb1 and sample canonical correlation variables Ub1 andVb1.5. (Proofs behind cca) Let A ∈ Rp×p and B ∈ Rq×q be symmetric positive definite matrices andC ∈ Rp×q. LetG = A?1/2CB?1/2 ∈ Rp×q.We shall write λmax(M) for the largest eigenvalue of a matrix M.(a) Suppose p = q. Show that eigenvalues of B?1A, B?1/2AB?1/2, and AB?1 are all equal.What are the relations between the eigenvectors?(b) Suppose p = q. Show thatmaxx∈Rp{xTAx : xTBx = 1} = maxy∈Rp{yTB1/2AB1/2y : yTy = 1}.By using Problem 7 in Homework 2, deduce thatmaxx∈Rp{xTAx : xTBx = 1} = λmax(B1/2AB1/2),argmaxx∈Rp{xTAx : xTBx = 1} = qmax,where qmax ∈ Rpis the eigenvector of B1A corresponding to the largest eigenvalue.(c) Show that if we fix x ∈ Rp and just maximize over all y ∈ Rq, thenmaxy∈Rq{(xTCy)2: yTBy = 1} = maxy∈Rq{yT[CTxxTC]y : yTBy = 1}and deduce that from (a) and (b) thatmaxy∈Rq{(xTCy)2: yTBy = 1} = λmax(B?1CTxxTC).4 FINM 331 ASSIGNMENT 3Show that the largest eigenvalue of a rank-1 matrix abTis bTa and deduce thatmaxy∈Rq{(xTCy)2: yTBy = 1} = xTCB?1CTx.(d) Using (a), (c), and Problem 7 in Homework 2, show thatmaxx∈Rp, y∈Rq{(xTCy)2: xTAx = 1, yTBy = 1} = λmax(GGT).(e) Let σ1, . . . , σp ∈ R, u1, . . . , up ∈ Rp, v1, . . . , vp ∈ Rq FINM 331作业代做、代写R程序语言作业、代做DATA ANALYSIS作业、R课程设计作业代写 代做数据库SQL|be the singular values and left/rightsingular vectors of G. By Problem 7 in Homework 2, show thatmaxx∈Rp{xTGGTx : xTx = 1, uTi x = 0, i = 1, . . . , k ? 1} = σ2k,argmaxx∈Rp{xTGGTx : xTx = 1, uTi x = 0, i = 1, . . . , k ? 1} = uk,for k = 1, . . . , p. Hence deduce thatmaxx∈Rp, y∈Rq{xTCy : xTAx = 1, yTBy = 1, uTi A1/2x = 0, i = 1, . . . , k 1} = σk,argmaxx∈Rp, y∈Rq{xTCy : xTAx = 1, yTBy = 1, uTi A1/2x = 0, i = 1, . . . , k 1} = (A1/2uk, B1/2vk),for k = 1, . . . , p. Finally show thatmaxx∈Rp, y∈Rq{xTCy : xTAx = 1, yTBy = 1, uTi A1/2x = 0, vTi B1/2y = 0, i = 1, . . . , k ? 1} = σk,argmaxx∈Rp, y∈Rq{xTCy : xTAx = 1, yTBy = 1, uTi A1/2x = 0, vTi B1/2y = 0, i = 1, . . . , k 1} = (A1/2uk, B1/2vk),for k = 1, . . . , p.6. (Linear discriminant analysis) The admissions committee of a business school used GPA andGMAT scores to make admission decisions. The values for the variable admit = 1,2,3 correspondto admission decisions of yes, no, waitlist. Label the data set p6.txt — helpful Rcommands:gsbdata = read.table(p6.txt); colnames(gsbdata)=c(GPA, GMAT,admit);(a) Calculate xi, x and Spool.(b) Calculate the sample within groups matrix W, its inverse W?1, and the sample betweengroups matrix B. Find the eigenvalues and eigenvectors of W?1B. (R command for A1is solve(A)).(c) Use the linear discriminants derived from these eigenvectors to classify the two new observationsx = [3.21, 497]T and x = [3.22, 497]T.(d) Scatterplot the original data set on the plane of the first two discriminants, labeled byadmission decisions. Comment on the results in (c). Is this a good admission policy?7. (Correspondence Analysis) A client of a law firm would like to visualize the number of largeclass-action lawsuits each year across different industries from 2011 to the first half of 2017. Thecorrespondence analysis provides a means of displaying or summarizing a set of categorical datain two-dimensional graphical form. The data on class-action lawsuits are from annual reportsof Stanford Law School’s Securities Class Action Clearinghouse. To load the data in R, you canuse the following command:CALaw = read.csv(/classaction lawsuit.csv,header=TRUE)Notation: Denote X as a data matrix of the number of class action lawsuits for industryyear.xi denotes row total (summing across all years for each industry). xj denotes columntotal (summing across all industries for each year). x?? denotes grand total. Define Dr =diag(x1, . . . , xn) and Dc = diag(x1, . . . , xp).(a) What are the dimensions, n and p, in this dataset?FINM 331 ASSIGNMENT 3 5(b) Show 1 is an eigenvalue of matrices D1r XD1c XT and D1c XTD?1r X and that the correspondingeigenvectors are proportional to 1 = [1, . . . , 1]T.(c) Transform the data as follows:Y =√x�D1/2r�X abTx�D1/2c ∈ Rn×p,where a = Dr1n and b = Dc1p. Report the SVD on Y (both singular values and left/rightsingular vectors). Is there another formula to compute the entries of the matrix Y ?(d) Write down the formula to compute row weight vectors and column weight vectors. Howmany different row weight vectors and column weight vectors are there? Report all rowweight vectors and column weight vectors.(e) Similar to PCA, makes the following two plots:Scatterplot of the first two row weight vectors: Does this scatterplot inform usabout year or industry? What do you learn from this scatterplot?2D biplot: What do you learn from the biplot?(f) Write down the formula to calculate the Frobenius norm of Y . Compute the Frobeniusnorm of Y . What is the relationship between the sum of squares of the singular values andthe Frobenius norm of Y ?(g) Report the percentage of original variance that each dimension in the row/column weightvectors explain? How many singular values are needed to effectively summarize at least90% of the variability in the data?8. (Multidimensional Scaling) An investor looking to allocate his funds to different industries seeksto visually understand the relationship between returns across different US industries. Thisinvestor has a deep pocket but does not know statistics so he comes to seek your advice. Asa financial mathematician, multidimensional scaling first comes to your mind to answer thisinvestor’s question. To collect the data, the US industry returns can be downloaded from theIndustry Return sections in Kenneth French’s website at:http://mba.tuck.dartmouth.edu/pages/faculty/ken.french/data library.html#ResearchFor the purpose of this problem set, the dataset of monthly returns of 30 US industries isdownloaded and formatted. To read in the dataset in R, you may useFF=read.csv(./FamaFrench30.csv, header=TRUE).Each row of the data represents how the industry in each column goes up or down on differentdate. The value of 1 means that the industry on a particular column goes up 1% on that month,compared to the previous month.(a) Report mean returns and standard deviation of five industries of your choice. Out of all30 industries, which industry performs the best on average, which industry is the mostvolatile?(b) Let Rit be the return of industry i at time t. Write a formula to compute the distance betweentwo industries. Denote what each subscript/superscript means and specify dimension ofeach subscript/superscript (i.e., explicitly stating what do you sum to). Write a code tocompute distance and report the distance of the following pair of industries: Autos – ElecEq Autos – Trans Autos – Oil(c) Do you need to demean the data to compute distance matrix? Why?(d) Report the distance matrix of all industries. To conveniently compute distance, R has abuilt in distance matrix command dist.dist(data matrix, method = euclidean, diag = FALSE, upper = FALSE, p = 2)6 FINM 331 ASSIGNMENT 3The first input is the data matrix. The distance command will compute the Euclideandistances among each row of the data. (Hint: You may need to convert the results intomatrix using the as.matrix command.)(e) Multidimensional scaling: With the distance matrix in hand, you are now ready to performmultidimensional scaling to visualize this data. The end goal is to plot the first two dimensionsafter multidimensional scaling. To perform MDS, you first need Euclidean distancematrix (EDM) from the previous part. Then, you would perform the following stepsStep 1: Form Gram matrix G from EDM. [Handout 9, equation 7.6]Step 2: Perform EVD on G and recover X using X = QpΛ1/2p .Report the result by plotting the first two dimensions after multidimensional scaling withcorresponding industry label for each data point. Does this plot have to be unique? Why?(f) Interpret the results. What does closer/further in distance mean in this setting? Whichindustry tends to co-move with Games industry the most? List three industries whosereturns tend to move on its own.(g) (Optional): What is your advice for an investor who put most of his money on stocks inTelecom? [Think about diversification]转自:http://ass.3daixie.com/2018112882668413.html

你可能感兴趣的:(FINM 331代做、代写R程序语言、代做DATA ANALYSIS、R设计代写代做数据库SQL|代做Database)