

and Psychological Measurement



•Peking University

[if ppt]•[endif]

•Course No. 06716070


model and reliability



& Reliability


definition of the CTT model, assumptions, results.


definition of reliability


of composite scores


Error of Measurement




CONSEQUENCES be considered “part” of validity?


using student test scores to evaluate teacher effectiveness


test scores be sufficiently valid without being reliable?


Driver’s test includes a reliable written portion and an unreliable performance

task. Why?



•Test scores are random variables“sampled” from a hypothetical population

•X = T + E

•Definition of E(X)

•True score for an examinee:

Tj = E(X) = mXj




true score is the mean, or expected value, of an examinee’s observed scores

obtained from a large (theoretically infinite) number of repeated test



every examinee has a distribution of possible observed scores…even though we

usually only test once.



•What would make the observed scores

change from one trial to the next? (Hint: True scores don’t change)


are random and fluctuate


examinee’s distribution of observed scores would be centered around his/her

true score.



•The observed scores have a SD, it

reflects the amount of error variability present.

•A really reliable test

would have examinees’ observed scores closely clustered around their true

scores, with very little random fluctuation.


of Error


examinee j: Ej = Xj - Tj


that Ej is a new random variable and that E(Ej) = 0.


E(Xj) = Tj


average of errors for one examinee = 0




refers to the consistency (or reproducibility) of scores over administrations.


over time, across parallel forms, between raters, or over tasks within an



= Repeatability



way to think about this:

If z-scores for examinees stay consistent over administrations, the test scoresare reliable.



the extent to which scores are free of randomness or errormakes them reliable.



•As with validity evidence, remember that

tests are not reliable, per se, but rather test

scoresare reliable.

•A test may be administered to a very

different population of examinees and produce very different results…


to Quantify Reliability?


know it is desirable for scores to be relatively free from random error, and we

know X = T + E.


T and X are highly related, it implies that E and X are weakly related. If X

and T are perfectly related, then all Observed variability is due to True





Index = Correlation between Observed scores and True scores: rXT



•The reliability index is an important

result, but it isn’t practical without further assumptions being made.

•We can’t observe True scores, only

Observed scores, so how could we ever estimate the correlation between the two?




CTT estimation of reliability depends on the concept of parallel forms. Two

forms are parallel if:


examinee has the same true score on both forms of the test: Tj1= Tj2


variances for both forms are equal: s2(E1)= s2(E2)


are uncorrelated across forms


assumes the same construct!



•It is

difficult (at best) to construct strictlyparallel forms, but the concept is important because it makes reliability



important is that it’s theoretically

possibleto construct strictly parallel forms…




definitions of forms that are not strictly parallel will be especially helpful

when we discuss the task of equating or linking different forms.





tau equivalence

–Congenericity(or “congeneric forms”)

•Tau (t)


•Tau (t)

equivalence relaxes the assumption of equal error variances (i.e., error

variances may be unequal), but keeps the assumption that true scores are equal:

Tj1 = Tj2


still uncorrelated


Tau equivalence


Tau (t)

equivalence further relaxes assumptions


variances are not necessarily equal, and

–Truescores across forms only differ by an additive constant:


j1 = Tj2 + c


still uncorrelated



•Congenericityfurther relaxes assumptions to allow for different scales across forms


variances are not necessarily equal, and

–Truescores across forms differ by a positive linear function:


j1 =

d*Tj2+ c, where d > 0


still uncorrelated




that is required for CTT to work is that the concept of

parallel forms is theoretically possible.


practice, we will only need to rely on the assumption of congenericity to

deal with estimating reliability and equating multiple forms.




between observed scores across two parallel forms: rXiXj




elegant, enduring concept:


vs. Index


Coefficient = (Rel. Index)2


Index = SQRT(Rel. Coefficient)




Error variance decreases…


of True/Observed variance increases


coefficient increases



Coefficient= proportion of Observed score variance due to True score variability.


Index= correlation between Observed and True scores.


Reliability can now be estimated with observable data!



the reliability coefficient, we can determine how much of the variability in

observed scores is due to differences among TRUE scores (the thing we’re trying

to measure!).


higher the value (bounded by zero and one), the less influenced by random

errors the scores are.



say rXiXj = 0.81. 81% of the Observed scorevariance is due to True score variance, and

s2(T) = 0.81s2(X).

•If s(X) =4, we can predict:

s(T) =

SQRT(0.81*16) =


•And,the correlation between X,T:

rXT = SQRT(0.81)= 0.9


Error of Measurement


if we have measurements across parallel forms, we can estimate the proportion

of True score to Observed score variance…so what?

•If we

know the proportion of True score variance, we also know the proportion of Error variance.


Error of Measurement


knowing the Error


we can use this information to state our confidence that an examinee’s test

score accurately reflects his/her true ability (i.e., the True score).


of Error


can’t know how much of any one examinee’s score is due to error, but we can

estimate the expected amount of variability for observed scores around each

examinee’s true score…THINK: “confidence interval”



•Remember, True score is defined as the

mean, or expected value, of an examinee’s Observed scores from a large number

of repeated test administrations.

•Theoretically, every examinee has a

distribution of possible observed scores, even though we only observe one (or



Error of Measurement


can’t actually

computethe standard deviation of possible observed scores for each examinee, but we

can estimate the averageerror standard deviation…


is what we call the Standard Error of Measurement (SEM).

–In a

couple of weeks we

will talk about conditional SEMs.


Error of Measurement




Normally distributed errors (common in Regression):

•X ± 1sE à 68%



repeated testing, 68% of the time X would be in this interval

•X ± 1.96sE à 95%



repeated testing, 95% of the time X would be in this interval




Coefficient: rXiXj is just like R2 from Regression


the standard error of measurement is just like the standard error of estimate.


we’ll generalize this to predict T from X.


Reliability Data


between scores from the same form administered to the same group of examinees

on two separate occasions (coefficient

of stability).




between two different forms administered to the same examinees on one occasion (coefficient

of equivalence).




Reliability Data

•Correlation among test scores when

examinees respond to parallel components repeatedly is estimated by the coefficient

of internal consistency.


week’s topic is Internal Consistency: the reliability of composite scores

[if ppt]•[endif]
