讲解：431 Quiz 2、R、R、data

431 Quiz 2: Fall 2019Thomas E. Lovedue 2019-11-18 at Noon, version 2019-11-12InstructionsAll of the links for Quiz 2 Materials are atThe MaterialsTo complete the Quiz you’ll need three things, all of which are linked at the URL above.1. The 2019-431-quiz02-questions.PDF file. This contains all of the instructions, questions and potentialresponses. Be sure that you see all 30 questions, and all 27 pages.2. Five data files, named quiz_data_states.csv, quiz_hosp.csv, quiz_ra.csv, quiz_sim_nejm.csvand quiz_statin.csv, which may be useful to you.3. The Quiz 2 Answer Sheet which is a Google Form.Use the PDF file to read the quiz and craft your responses (occasionally making use of the provided datasets), and then place those responses into the Answer Sheet Google Form. When using the Answer Sheet,please select or type in your best response (or responses, as indicated) for each question. All of your responsesmust be in the Answer Sheet by the deadline.Key Things To RememberThe deadline for completing the Answer Sheet is Noon on Monday 2019-11-18, and this is a firm deadline,without the grace period we allow for in turning in Homework.The questions are not arranged in any particular order, and your score is based on the number of correctresponses, so you should answer all questions. There are 30 questions, and each is worth either 3 or 4 points.The maximum possible score on the quiz is 100 points. Questions 01, 02, 05, 06, 08, 14, 17, 22, 27 and 30 areworth 4 points each. They are marked to indicate this.If you wish to work on some of the quiz on the Answer Sheet and then return later, you can do this by [1]completing the final question which asks you to type in your full name, and then [2] submitting the AnswerSheet. You will then receive a link which allows you to return to the Answer Sheet without losing yourprogress.Occasionally, I ask you to provide a single line of code. In all cases, a single line of code can include at mostone pipe for these purposes, although you may or may not need the pipe in any particular setting. Moreover,you need not include the library command at any time for any of your code. Assume in all questions that allrelevant packages have been loaded in R. Any reference to a logarithm refers to a natural logarithm. If youneed to set a seed, use set.seed(2019) throughout this Quiz.You are welcome to consult the materials provided on the course website, but you are not allowed to discussthe questions on this quiz with anyone other than Professor Love and the teaching assistants at 431-help atcase dot edu. Please submit any questions you have about the Quiz to 431-help through email. Thank you,and good luck.11 Question 01 (4 points)Consider the starwars tibble that is part of the dplyr package in the tidyverse. Filter the data file to focuson individuals who are of the Human species, who also have complete data on both their height and mass.Then use a t-based approach to estimate an appropriate 90% confidence interval for the difference between themean body-mass index of Human males minus the mean body-mass index of Human females. Don’t assumethat the population variances of males and females are the same. The data provides height in centimetersand mass in kilograms. You’ll need to calculate the body-mass index (BMI) values - the appropriate formulato obtain BMI in our usual units of kgm2 is:BMI =10, 000 ∗ mass in kg(height in cm)2Specify your point estimate, and then the lower and upper bound, each rounded to a single decimal place,and be sure to specify the units of measurement.2 Question 02 (4 points)On 2019-09-25, Maggie Koerth-Baker at FiveThirtyEight published “We’ve Been Fighting the Vaping CrisisSince 1937.” In that article, she quotes a 2019-09-06 article at the New England Journal of Medicine byJennifer E. Layden et al. entitled “Pulmonary Illness Related to E-Cigarette Use in Illinois and Wisconsin —Preliminary Report.” Quoting that report:E-cigarettes are battery-operated devices that heat a liquid and deliver an aerosolized productto the user. . . . In July 2019, the Wisconsin Department of Health Services and the IllinoisDepartment of Public Health received reports of pulmonary disease associated with the use ofe-cigarettes (also called vaping) and launched a coordinated public health investigation. . . . Wedefined case patients as persons who reported use of e-cigarette devices and related products inthe 90 days before symptom onset and had pulmonary infiltrates on imaging and whose illnesseswere not attributed to other causes.The entire report is available at https://www.nejm.org/doi/full/10.1056/NEJMoa1911614. In the study, 53case patients were identified, but some patients gave no response to the question of whether or not “they hadused THC (tetrahydrocannabinol) products in e-cigarette devices in the past 90 days.” 33 of the 41 reportedTHC use. Assume those 41 subjects are a random sample of all case patients that will appear in Wisconsinand Illinois in 2019.Use a SAIFS procedure to estimate an appropriate 90% confidence interval for the PERCENTAGE ofcase patients in Illinois and Wisconsin in 2019 that used THC in the 90 days prior to symptom onset.Note that I’ve emphasized the word PERCENTAGE here, so as to stop you from instead presenting aproportion. Specify your point estimate of this PERCENTAGE, and then the lower and upper bound foryour confidence interval, in each case rounded to a single decimal place.23 Question 03Alex, Beth, Cara and Dave independently select random samples from the same population. The sample sizesare 200 for Alex, 400 for Beth, 125 for Cara, and 300 for Dave. Each researcher constructs a 95% confidenceinterval from their data using the same statistical method. The half-widths (margins of error) for thoseconfidence intervals are 1.45, 1.74, 1.96 and 2.43. Match each interval’s margin of error with its researcher.Rows:a. Alex, who took a sample of n = 200 people.b. Beth, who took a sample of n = 400 people.c. Cara, who took a sample of n = 125 people.d. Dave, who took a sample of n = 300 people.Columns:1. 1.452. 1.743. 1.964. 2.434 Question 04Suppose you have a tibble with two variables. One is a factor called Exposure with levels High, Low andMedium, arranged in that order, and the other is a quantitative outcome. You want to rearrange the orderof the Exposure variable so that you can then use it to identify for ggplot2 a way to split histograms ofoutcomes up into a series of smaller plots, each containing the histogram for subjects with a particular levelof exposure (Low then Medium then High.)Which of the pairs of tidyverse functions identified below has Dr. Love used to accomplish such a plot?a. fct_reorder and facet_wrapb. fct_relevel and facet_wrapc. fct_collapse and facet_wrapd. fct_reorder and group_bye. fct_collapse and group_by35 Question 05 (4 points)In a double-blind trial, 350 patients with active rheumatoid arthritis were randomly assigned to receive oneof two therapy types: a cheaper one, or a pricier one, and went on to participate in the trial.The primary outcome was the change in DAS28 at 48 weeks as compared to study entry. The DAS28 isa composite index of the number of swollen and tender joints, the erythrocyte sedimentation rate, and avisual-analogue scale of patient-reported disease activity. A decrease in the DAS28 of 1.2 or more (so a changeof -1.2 or below) was considered to be a clinically meaningful improvement. Data are in the quiz_ra.csv file.A student completed four analyses, shown below. Which of the following 90% confidence intervals for thechange in DAS28 at 48 weeks most appropriately compares the pricier therapy to the cheaper one?d. Analysis De. Analysis Ef. Analysis Fg. Analysis Gra % tbl_df()mosaic::favstats(das28_chg ~ therapy, data = ra)therapy min Q1 median Q3 max mean sd n missing1 Cheaper -6.12 -2.955 -2.22 -1.415 0.56 -2.250857 1.208183 175 02 Pricier -5.56 -2.630 -2.06 -1.250 1.53 -2.027486 1.260694 175 0ggplot(data = ra, aes(x = therapy, y = das28_chg, fill = therapy)) +geom_violin(alpha = 0.3) + geom_boxplot(width = 0.3, notch = TRUE) +theme_bw() + guides(fill = FALSE) + scale_fill_viridis_d()5.1 Analysis Dra %$% t.test(das28_chg ~ therapy, var.equal = TRUE) %>%tidy(conf.int = TRUE, conf.level = 0.90) %>%mutate(estimate = estimate1 - estimate2) %>%select(estimate, conf.low, conf.high, method)# A tibble: 1 x 4estimate conf.low conf.high method 1 -0.223 -0.483 0.0362 Two Sample t-test5.2 Analysis Era %$% t.test(das28_chg ~ therapy, paired = TRUE) %>%tidy(conf.int = TRUE, conf.level = 0.90) %>%select(estimate, conf.low, conf.high, method)# A tibble: 1 x 4estimate conf.low conf.high method 1 -0.223 -0.250 -0.197 Paired t-test5.3 Analysis Fra %$% wilcox.test(das28_chg ~ therapy, paired = TRUE,conf.int = TRUE, conf.level = 0.90) %>%tidy() %>%select(estimate, conf.low, conf.high, method)# A tibble: 1 x 4estimate conf.low conf.high method 1 -0.230 -0.245 -0.215 Wilcoxon signed rank test with continuity co~5.4 Analysis Gra %$% wilcox.test(das28_chg ~ therapy, conf.int = TRUE, conf.level = 0.90) %>%tidy() %>%select(estimate, conf.low, conf.high, method)# A tibble: 1 x 4estimate conf.low conf.high method 1 -0.240 -0.450 -0.0300 Wilcoxon rank sum test with continuity corre~56 Question 06 (4 points)Referring again to the study initially described in Question 05, which of the following analyses provides anappropriate 90% confidence interval for the difference (cheaper - pricier) in the proportion of participantswho had a clinically meaningful improvement (DAS28 change of -1.2 or below) at 48 weeks?j. Analysis Jk. Analysis Kl. Analysis Lm. Analysis Mn. None of the above.6.1 Analysis Jra % tbl_df()ra %mutate(improved = das28_chg %mutate(improved = fct_relevel(factor(improved), FALSE))ra %>% tabyl(improved, therapy)improved Cheaper PricierFALSE 31 41TRUE 144 134twobytwo(31, 41, 144, 134, improved, didnt improve,cheaper, pricier)2 by 2 table analysis:------------------------------------------------------Outcome : cheaperComparing : improved vs. didnt improvecheaper pricier P(cheaper) 95% conf. intervalimproved 31 41 0.4306 0.3217 0.5466didnt improve 144 134 0.5180 0.4593 0.576295% conf. intervalRelative Risk: 0.8312 0.6227 1.1096Sample Odds Ratio: 0.7036 0.4173 1.1864Conditional MLE Odds Ratio: 0.7043 0.4019 1.2246Probability difference: -0.0874 -0.2100 0.0416Exact P-value: 0.2339Asymptotic P-value: 0.1872------------------------------------------------------66.2 Analysis Kra % tbl_df()ra %mutate(improved = das28_chg %mutate(improved = fct_relevel(factor(improved), TRUE))ra %>% tabyl(improved, therapy)improved Cheaper PricierTRUE 144 134FALSE 31 41twobytwo(144, 134, 31, 41, improved, didnt improve,cheaper, pricier)2 by 2 table analysis:------------------------------------------------------Outcome : cheaperComparing : improved vs. didnt improvecheaper pricier P(cheaper) 95% conf. intervalimproved 144 134 0.5180 0.4593 0.5762didnt improve 31 41 0.4306 0.3217 0.546695% conf. intervalRelative Risk: 1.2031 0.9013 1.6059Sample Odds Ratio: 1.4213 0.8429 2.3965Conditional MLE Odds Ratio: 1.4198 0.8166 2.4880Probability difference: 0.0874 -0.0416 0.2100Exact P-value: 0.2339Asymptotic P-value: 0.1872------------------------------------------------------76.3 Analysis Lra % tbl_df()ra %mutate(improved = das28_chg %mutate(improved = fct_relevel(factor(improved), FALSE))ra %>% tabyl(improved, therapy)improved Cheaper PricierFALSE 31 41TRUE 144 134twobytwo(31, 41, 144, 134, conf.level = 0.90,improved, didnt improve, cheaper, pricier)2 by 2 table analysis:------------------------------------------------------Outcome : cheaperComparing : improved vs. didnt improvecheaper pricier P(cheaper) 90% conf. intervalimproved 31 41 0.4306 0.3383 0.5279didnt improve 144 134 0.5180 0.4687 0.566990% conf. intervalRelative Risk: 0.8312 0.6523 1.0592Sample Odds Ratio: 0.7036 0.4538 1.0908Conditional MLE Odds Ratio: 0.7043 0.4379 1.1271Probability difference: -0.0874 -0.1914 0.0212Exact P-value: 0.2339Asymptotic P-value: 0.1872------------------------------------------------------86.4 Analysis Mra % tbl_df()ra %mutate(improved = das28_chg %mutate(improved = fct_relevel(factor(improved), TRUE))ra %>% tabyl(improved, therapy)improved Cheaper PricierTRUE 144 134FALSE 31 41twobytwo(144, 134, 31, 41, conf.level = 0.90,improved, didnt improve, cheaper, pricier)2 by 2 table analysis:------------------------------------------------------Outcome : cheaperComparing : improved vs. didnt improvecheaper pricier P(cheaper) 90% conf. intervalimproved 144 134 0.5180 0.4687 0.5669didnt improve 31 41 0.4306 0.3383 0.527990% conf. intervalRelative Risk: 1.2031 0.9441 1.5331Sample Odds Ratio: 1.4213 0.9168 2.2034Conditional MLE Odds Ratio: 1.4198 0.8872 2.2838Probability difference: 0.0874 -0.0212 0.1914Exact P-value: 0.2339Asymptotic P-value: 0.1872------------------------------------------------------97 Question 07In response to unexpectedly low enrollment, the protocol was amended part-way through the trial describedin Question 05 to change the primary outcome from a binary outcome to a continuous outcome in order toincrease the power of the study.Originally, the proposed primary outcome was the difference in the proportion of participants who had aDAS28 of 3.2 or less at week 48. The original power analysis established a sample size target of 225 completedenrollments in each therapy group, based on a two-sided 10% significance level, and a desire for 90% power.In that initial power analysis, the proportion of participants with a DAS28 of 3.2 or less at week 48 wasassumed to be 0.27 under the less effective of the two therapies.What value was used in the power calculation for the proportion of participants with DAS28 of 3.2 or less atweek 48 for the more effective therapy? State your answer rounded to two decimal places.8 Question 08 (4 points)In the trial described in Question 05, 21 of the 222 subjects originally assigned to receive the cheaper therapyand 35 of the 219 subjects originally assigned to receive the pricier therapy experienced a serious adverseevent (which included infections, gastrointestinal, renal, urinary, cardiac or vascular disorders, as well assurgical or medical procedures.)Suppose you wanted to determine whether or not there was a statistically detectable difference in the rates ofserious adverse events in the two therapy groups at the 5% significance level? Specify a single line of R codethat would do this, appropriately.109 Question 09The Pottery data are part of the carData package in R. Included are data describing the chemical compositionof ancient pottery found at four sites in Great Britain. This data set will also be used in Question 10. In thisquestion, we will focus on the Na (Sodium) levels, and our go代做431 Quiz 2、R编程设计调试、R语言代写、代做dal is to compare the mean Na levels across thefour sites.anova(lm(Na ~ Site, data = carData::Pottery))Analysis of Variance TableResponse: NaDf Sum Sq Mean Sq F value Pr(>F)Site 3 0.25825 0.086082 9.5026 0.0003209 ***Residuals 22 0.19929 0.009059---Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1Which of the following conclusions is most appropriate, based on the output above?a. The F test allows us to conclude that the population mean Na level in at least one of the four sites isdetectably different than the others, at a 1% significance level.b. The F test allows us to conclude that the population mean Na level in each of the four sites is detectablydifferent than each of the others, at a 1% significance level.c. The F test allows us to conclude that the population mean Na level is the same in all four sites, at a1% significance level.d. The F test allows us to conclude that the population mean Na level may not be the same in all sites,but is not detectably different at the 1% level.e. None of these conclusions are appropriate.1110 Question 10Consider these two sets of plots, generated to describe variables from the Pottery data set within the carDatapackage.Plot 2 for Question 10Question 10 continues on the next page. . .12Question 10 continuesAnd now, here are summary statistics from the mosaic::inspect function describing the variables containedin the Pottery data set.mosaic::inspect(carData::Pottery)categorical variables:name class levels n missing1 Site factor 4 26 0distribution1 Llanedyrn (53.8%), AshleyRails (19.2%) ...quantitative variables:name class min Q1 median Q3 max mean sd n1 Al numeric 10.10 11.95 13.800 17.4500 20.80 14.4923077 2.9926474 262 Fe numeric 0.92 1.70 5.465 6.5900 7.09 4.4676923 2.4097507 263 Mg numeric 0.53 0.67 3.825 4.5025 7.23 3.1415385 2.1797260 264 Ca numeric 0.01 0.06 0.155 0.2150 0.31 0.1465385 0.1012301 265 Na numeric 0.03 0.05 0.150 0.2150 0.54 0.1584615 0.1352832 26missingBased on this output, and whatever other work you need to do, which of the statements below is true, aboutVariable 1 (as shown in Plot 1) and Variable 2 (shown in Plot 2)?a. Var1 is . . .b. Var2 is . . .Choices are:11 Question 11Suppose you have a data frame named mydata containing a variable called sbp, which shows the participant’ssystolic blood pressure in millimeters of mercury. Which of the following lines of code will create a newvariable badbp within the mydata data frame which takes the value TRUE when a subject has a systolicblood pressure that is at least 120 mm Hg, and FALSE when a subject’s systolic is less than 120 mm Hg.a. mydata %>% badbp = 120b. mydata$badbp = 120, YES, NO)c. badbp % filter(sbp >= 120)d. mydata %>% mutate(badbp = sbp >= 120)e. None of these will do the job.12 Question 12According to Jeff Leek in The Elements of Data Analytic Style, which of the following is NOT a good reasonto create graphs for data exploration?a. To understand properties of the data.b. To inspect qualitative features of the data more effectively than a huge table of raw data would allow.c. To discover new patterns or associations.d. To consider whether transformations may be of use.e. To look for statistical significance without first exploring the data.13 Question 13If the characteristics of a sample approximate the characteristics of its population in every respect, thenwhich of the statements below is true? (CHECK ALL THAT APPLY.)a. The sample is randomb. The sample is accidentalc. The sample is stratifiedd. The sample is systematice. The sample is representativef. None of the above14Setup for Questions 14-15For Questions 14 and 15, consider the data I have provided in the quiz_hosp.csv file. The data describe700 simulated patients at a metropolitan hospital. Available are:• subject.id = Subject Identification Number (not a meaningful code)• sex = the patient’s sex (FEMALE or MALE)• statin = does the patient have a prescription for a statin medication (YES or NO)• insurance = the patient’s insurance type (MEDICARE, COMMERCIAL, MEDICAID, UNINSURED)• hsgrads = the percentage of adults in the patient’s home neighborhood who have at least a high schooldiploma (this measure of educational attainment is used as an indicator of the socio-economic place inwhich the patient lives)14 Question 14 (4 points)Using the quiz_hosp data, what is the 95% confidence interval for the odds ratio which compares the odds ofreceiving a statin if you are MALE divided by the odds of receiving a statin if you are FEMALE. Show thepoint and interval estimates, rounded to two decimal places. Do NOT use a Bayesian augmentation here.15 Question 15Perform an appropriate analysis to determine whether insurance type is associated with the education(hsgrads) variable, ignoring all other information in the quiz_hosp data. Which of the following conclusionsis most appropriate based on your analyses, using a 5% significance level?a. The ANOVA F test shows no detectable effect of insurance on hsgrads, so it doesn’t make sense tocompare pairs of insurance types.b. The ANOVA F test shows a detectable effect of insurance on hsgrads, and a Tukey HSD comparisonreveals that Medicare shows detectably higher education levels than Uninsured.c. The ANOVA F test shows a detectable effect of insurance on hsgrads, and a Tukey HSD comparisonreveals that Medicaid’s education level is detectably lower than either Medicare or Commercial.d. The ANOVA F test shows a detectable effect of insurance on hsgrads, and a Tukey HSD comparisonreveals that Uninsured’s education level is detectably lower than Commercial or Medicare.e. None of these conclusions is appropriate.1516 Question 16Once a confidence interval is calculated, several design changes may be used by a researcher to make aconfidence interval wider or narrower. For each of the changes listed below, indicate the impact on the widthof the confidence interval.Rows area. Increase the level of confidence.b. Increase the sample size.c. Increase the standard error of the estimate.d. Use a bootstrap approach to estimate the CI.Columns are1. CI will become wider2. CI will become narrower3. CI width will not change4. It is impossible to tell17 Question 17 (4 points)The data in the quiz_statin.csv file provided to you describe the results of a study of 180 patients whohave a history of high cholesterol. Patients in the study were randomly assigned to the use of a new statinmedication, or to retain their current regimen. The columns in the data set show a patient identificationcode, whether or not the patient was assigned to the new statin (Yes or No) and their LDL cholesterol value(in mg/dl) at the end of the study. You have been asked to produce a 95% confidence interval comparing themean LDL levels across the two statin groups (including both a point estimate and appropriate confidenceinterval rounded to two decimal places), and then describe your result in context in a single English sentence.Which of the following approaches and conclusions are reasonable in this setting? (CHECK ALL THATAPPLY)a. LDL levels using the new statin were 4.95 mg/dl higher with 95% CI (0.65, 9.24) mg/dl, based on anindicator variable regression model, which replicates a two-sample t test assuming equal variances.b. LDL levels using the new statin were 4.95 mg/dl lower with 95% CI (0.65, 9.24) mg/dl, based on anindicator variable regression model, which replicates a two-sample t test assuming equal variances.c. LDL levels using the new statin were 4.95 mg/dl higher with 95% CI (0.56, 9.33) mg/dl, based on aWelch two-sample t test not assuming equal variances.d. LDL levels using the new statin were 4.95 mg/dl lower with 95% CI (0.56, 9.33) mg/dl, based on aWelch two-sample t test not assuming equal variances.e. LDL levels using the new statin were 4.95 mg/dl higher with 95% CI (0.94, 9.21) mg/dl, based on abootstrap comparison of the population means and using the seed 2019.f. LDL levels using the new statin were 4.95 mg/dl lower with 95% CI (0.94, 9.21) mg/dl, based on abootstrap comparison of the population means and using the seed 2019.g. None of the above are appropriate, since we should be using a paired samples analysis with these data.1618 Question 18A hospital system has about 1 million records in its electronic health record database who meet our study’squalifying requirements for inclusion and exclusion. We believe that about 20% of the subjects who qualifyby these criteria will need a particular blood test.Rows are:a. Which will provide a confidence interval with smaller width for the proportion needing the blood test,using a Wald approach?b. Which will provide a better confidence interval estimate for the sample proportion of eligible subjectswho need the blood test?Columns are:1. A random sample of 85 subjects who meet the qualifying requirements.2. A non-random sample of 850,000 of the subjects who met the qualifying requirements in the past year.19 Question 19A series of 88 models were built by a team of researchers interested in systems biology. 36 of the modelsshowed promising results in an attempt to validate them out of sample. Define the hit rate as the percentageof models built that show these promising results. Which of the following intervals appropriately describesthe uncertainty we have around a hit rate estimate in this setting, using a Wald confidence interval approachwith a Bayesian augmentation and permitting a 10% rate of Type I error?a. (31.8%, 50.3%)b. (32.2%, 50.2%)c. 0.411 plus or minus 9 percentage pointsd. (32.4%, 50.3%)e. None of these intervals.1720 Question 20The lab component of a core course in biology is taught at the Watchmaker’s Technical Institute by a setof five teaching assistants, whose names, conveniently, are Amy, Beth, Carmen, Donna and Elena. On thesecond quiz of the semester (each section takes the same set of quizzes) an administrator at WTI wants tocompare the mean scores across lab sections. She produces the following output in R.Analysis of Variance TableResponse: exam2Df Sum Sq Mean Sq F value Pr(>F)ta 4 971.5 242.868 2.7716 0.02898Residuals 165 14458.4 87.627Emboldened by this result, the administrator decides to compare mean exam2 scores for each possible pair ofTAs, using a Bonferroni correction. Suppose she’s not heard of pairwise.t.test() and therefore plans tomake each comparison separately with two-sample t tests. If she wants to maintain an overall α level of 0.10for the resulting suite of pairwise comparisons using the Bonferroni correction, then what significance levelshould she use for each of the individual two-sample t tests?a. She should use a significance level of 0.10 on each test.b. She should use 0.05 on each test.c. She should use 0.025 on each test.d. She should use 0.01 on each test.e. She should use 0.001 on each test.f. None of these answers are correct.21 Question 21If the administrator at the Watchmaker’s Technical Institute that we mentioned in Question 20 instead useda Tukey HSD approach to make her comparisons, she might have obtained the following output.Tukey multiple comparisons of exam2 means, 90% family-wise confidence leveldiff lwr upr || diff lwr upr----- ----- ----- || ----- ------ ----Beth-Amy 1.21 -4.43 6.83 || Donna-Beth -6.53 -12.16 -0.90Carmen-Amy -1.41 -7.04 4.22 || Elena-Beth -0.24 -5.87 5.40Donna-Amy -5.32 -10.96 0.31 || Donna-Carmen -3.91 -9.54 1.72Elena-Amy 0.97 -4.66 6.60 || Elena-Carmen 2.38 -3.25 8.01Carmen-Beth -2.62 -8.25 3.01 || Elena-Donna 6.29 0.66 11.93Note that when we refer in the responses below to Beth’s scores, we mean the scores of students who were inBeth’s lab section. Which conclusion of those presented below would be most appropriate?a. Amy’s scores are significantly higher than Carmen’s or Elena’s.b. Beth’s scores were significantly higher than Amy’s.c. Donna’s scores are significantly lower than Beth’s or Elena’s.d. Elena’s scores are significantly lower than Donna’s.e. None of these answers are correct.1822 Question 22 (4 points)The quiz_data_states.csv file contains information on several variables related to the 50 United Statesplus the District of Columbia. The available data include 102 rows of information on six columns, and thosecolumns are:• code: the two-letter abbreviation for the “state” (DC = Washington DC, etc.)• state: the “state” name• year: 2019 or 2010, the year for which the remaining variables were obtained• population: number of people living in the “state”• poverty_people: number of people in the “state” living below the poverty line• poverty_rate: % of people living in the “state” who are below the poverty lineOur eventual goal is to use the quiz_data_states data to produce an appropriate 90% confidence intervalfor the change from 2010 to 2019 in poverty rate, based on an analysis of the data at the level of the 51“states”.Which of the following statements is most true?a. This should be done using a paired samples analysis, and the quiz_data_states data require us tocalculate the paired differences, but are otherwise ready to plot now.b. This should be done using a paired samples analysis, and the quiz_data_states data require us topivot the data to make them wider, and then calculate the paired differences and plot them.c. This should be done using a paired samples analysis, and the quiz_data_states data require us topivot the data to make them longer, and then calculate the paired differences and plot them.d. This should be done using an independent samples analysis, and the quiz_data_states data are readyto be plotted appropriately now.e. This should be done using an independent samples analysis, and the quiz_data_states data requireus to pivot the data to make them wider, and then plot the distributions of the two samples.f. This should be done using an independent samples analysis, and the quiz_data_states data requireus to pivot the data to make them longer, and then plot the distributions of the two samples.23 Question 23Which of the following is the most appropriate way to complete the development of the confidence intervalproposed in Question 22?a. Tukey HSD comparisons following an Analysis of Varianceb. Applying tidy() to an Indicator Variable Regressionc. Applying tidy() to an Intercept-only Regressiond. A Wilcoxon-Mann-Whitney Rank Sum Confidence Intervale. A bootstrap on the poverty_people values across the states1924 Question 24Use the data you have been provided in the quiz_data_states.csv file to provide a point estimate of thechange from 2010 to 2019 in the poverty rate in the United States as a whole. Provide your response as aproportion with four decimal places. Note carefully what I am asking for (and not asking for) here.25 Question 25In The Signal and The Noise, Nate Silver writes repeatedly about a Bayesian way of thinking about uncertainty,for instance in Chapters 8 and 13. Which of the following statistical methods is NOT consistent with aBayesian approach to thinking about variation and uncertainty? (CHECK ALL THAT APPLY)a. Updating our forecasts as new information appears.b. Establishing a researchable hypothesis prior to data collection.c. Significance testing of a null hypothesis, using, say, Fisher’s exact test.d. Combining information from multiple sources to build a model.e. Gambling using a strategy derived from转自：http://www.3daixie.com/contents/11/3444.html

讲解：431 Quiz 2、R、R、data_statesPython|R

你可能感兴趣的:(讲解：431 Quiz 2、R、R、data_statesPython|R)