EXERCISE 7
PSYCHOLOGICAL MEASUREMENT (806)
OCTOBER 27, 1998
DUE NOVEMBER 17, 1998
DIRECTION -- Based on identified groups, do one of the following:
1. Run Example 1 from BILOG (see pages 6-1 to 6-12 and the handout for class). Prepare a short guide for the class for running BILOG based on the sample.
2. Based on the sample output (you can use the copy provided to the class), interpret the results and findings for the IRT materials. Focus on the Aitem parameters after cycle 6" (see pages 6-7 and 6-8).
3. Demonstrate the relationship between the IRT (6-7 & 6-8) and classical (6-4 & 6-5) item parameters. Demonstrate the relationship between the thetas and classical test scores (see page s 6-10 & 6-11). Explain the results.
INTRODUTION TO ITEM RESPONSE THEORY
1-1
Item response theory (IRT) is gaining in acceptance in psychological and educational testing because it provides more adaptable and effective methods of test construction, analysis, and scoring than those derived from classical test theory. The source of its greater power is in the relationships it establishes between properties of the items and the operating characteristics of the test made up of the items. These relationships can be valid for actual tests of any length, whereas any comparable results in classical theory hold only for hypothetical tests consisting of indefinitely many items.
The provision in IRT for treating the items, or small sets of similar items, as the exchangeable units of test construction and scoring has led to numerous innovations in testing practice, especially item banking and adaptive testing. The former can appreciably reduce the time and cost of producing a high quality operational test. The latter, either in the form of computerized adaptive testing or two-stage testing using paper-and-pencil instruments, enables testing time to be reduced to half or a third of that required for a conventional test of the same precision.
Equally important for long-term testing and assessment programs is the ability to retire and replace items in an operational test without altering the interpretation of the test scale. Because IRT scale scores are functions of estimated item parameters, the scoring absorbs possible differences in the characteristics (difficulty, discriminating power, etc.)
Another property unique to IRT is the location of items and the respondents on the same scale. The response models on which IRT methods are based enable the analyst to state the probability that a respondent at a particular score level will answer a given item correctly. This permits the "content referencing" of the scale scores. Typical items that respondents can answer correctly with an assigned probability (e.g., 50 or 80 percent) illustrate the meaning of various points on the scale in terms of task content.
In this chapter, we discuss the IRT procedures implemented in the BILOG program. Only results are presented. For derivations and proofs, the reader should consult the readings listed at the end of the chapter.
6 ---- Sample Problems
The examples in this chapter illustrate typical applications of the BILOG program. The command files and data for these examples appear in files of the program distribution diskettes. Because of the versatility of the program, not all its features can be demonstrated here. The user will find descriptions of other options in the command summaries of Chapter 4 and can easily modify the command files of the present examples to explore their effects.
Example 1: SIMULATED 3-PL DATA: EAP ESTIMATIONS OF SCORES
This example is based on randomly generated data that simulate some of the most frequently encountered conditions of cognitive testing. The test consists of 20 items, which is about the smallest number of binary items that can be reasonably used in practical testing. The items are assumed to be in multiple-choice format with five alternatives. It is assumed that some of the examinees will omit items that are too difficult for them.
To hold the problem to a reasonable size, we have set the sample size at 100. This is on the small side in an IRT application (a sample size in excess of 200 is usually considered desirable), but with the default priors assumed for the item parameters, the results of the calibration are reasonable.
Because the sample size is not large enough to attempt estimating the population latent distribution, we have accepted the default normal distribution with mean zero and standard deviation one.
Twenty items is the smallest number for which the chi-square test of item fit can be recommended. Due to the small sample size, the test is not very sensitive, and, as might be expected, none of the items shows any indication of poor fit.
For scoring the examinees, we have chosen the most robust method, EAP (Bays) estimation, on the assumption that the population distribution is normal. We follow the usual testing convention of standardizing the scores in the sample (with mean 0 and standard deviation 1, in this case). Because the scale of measurement for a psychological or educational test almost always has an arbitrary origin and unit, no generality is lost by rescaling in the calibration sample. In Example 2, these EAP score estimates are compared with MI (maximum likelihood) estimates similarly scaled.
To illustrate (in Example 2) how response data can be scored from the file of item-parameter estimates from an earlier BILOG calibration, we make use of the SAVE option in the GLOBAL command, followed by the SAVE command, to save the item parameter file as EXAMPL01.PAR. The item parameters are saved after being adjusted to account fr the rescaling of the scores in the sample. If they are used to score other data, the resulting scores will be expressed in the scale metric of the original sample. They can however, be saved after rescaling in other data, in which case they will be readjusted to the scale of that sample.
COMMAND FILE: EXAMPL01.BLG
Example 01: TEST WITH OMITTED RESOPONSES
THREE - PARAMETER MODEL RANDOM DATA EAP SCORE ESTIMATION
>COMMENTS
This example illustrates the use of the GLOBAL "OMITS" option with the 3-PL model. Omitted responses are scored fractionally correct in an amount equal to the reciprocal of the number of multiple-choice alternatives (NALT = 5)
The data for this example have been randomly generated to fit the 3-PL model, but some of the responses with probability of being correct less than 0.3 have been randomly omitted. So that the true estimated scores can be compared, the generating values of the respondents' scores are used as the case ID's.
The data are in the file EXAMPL01.DAT of the BLGDAT directory; the answer key (KFNAME) and the omit key (OFNAME) are, respectively, the first and second records of the data file.
The respondents' scores are estimated by the EAP method (default) and rescaled to mean 0 and standard deviation 1 in the sample (RSC=3). The item parameter estimates are saved AFTER rescaling.
>GLOBAL DFNAME='BLGDAT\EXAMPL01.DAT', NPARM=3, OMITS,SAVE;
>SAVE PARM='EXAMPL01.PAR';
>LENGTH NITEMS=20;
>INPUT NTOT=20, NALT=5, NIDC=10, KFNAME='BLDDAT\EXAMPL01.DAT',
OFNAME='BLGDAT\EXAMPL01.DAT';
(4X, 10A1, T17, 20A1)
>TEST TNAME=RANDOM;
>CALIB FLOAT;
>SCORE RSC=3;
OUTPUT FILES
The following is annotated output of the three phases of the computations for Example 1. These results, plus other system messages, appear transiently on the screen as the computations proceed. The SAVE file, EXAMPL01.PAR, contains essentially the same results as in the rescaled parameter listing of Phase 3, and is not shown here. Annotation numbers appear in brakets.
SAMPLE PROBLEMS 6-2
Phase 1 output: EXAMPL01.PH1
********** PHASE 1************
EXAMPLE 01: TEST WITH OMITTED RESPONSES
THREE-PARAMETER MODEL RANDOM DATA EAP SCORE ESTIMATION
>COMMENTS
[Same as above]
>GLOBAL DFNAME='BLGDAT\EXAMPL01.DAT' , NPARM=3, OMITS,save;
GLOBAL PARAMETERS
|
NUMBER OF SUBTESTS |
1 |
|
CASE WEIGHTING |
NONE EMPLOYED |
|
ITEM RESPONSE MODEL |
3 PARAMETER LOGISTIC |
OMITS WILL BE REPLACED BY THE RECIPROCAL OF THE NUMBER OF RESPONSE ALTERNATIVES
OFNAME='BLGDAT\EXAMPL01.DAT';
DATA INPUT SPECIFICATIONS
|
NUMBER OF FORMAT CARDS |
1 |
|
NUMBER OF ITEMS IN INPUT STREAM |
20 |
|
NUMBER OF RESPONSE ALTERNATIVES |
5 |
|
NUMBER OF SUBJECT ID CHARACTERS |
10 |
|
SUBJECT DATA INPUT OPTION |
SINGLE-SUBJECT DATA, NO CASE WEIGHTS |
|
MAXIMUM SAMPLE SIZE FOR ITEM CALIBRATION ALL SUBJECTS INCLUDED IN RUN |
1000 |
FORMAT CARD FOR INPUT IS (4X,10A1,T17,20A1)
FILE ASSIGNMENT AND DISPOSITION
[INPUT FILES]
|
SUBJECT DATA INPUT FILE |
BLGDAT\EXAMPL01.DAT SINGLE-SUBJECT DATA, NO CASE WEIGHTS |
|
CORRECT-RESPONSE KEY FILE |
BLGDAT\EXAMPL01.DAT |
|
OMITTED RESPONSE KEY FILE |
BLGDAT\EXAMPL01.DAT |
|
[OUTPUT FILES] |
|
|
ITEM PARAMETERS FILE |
EXAMPLE01.PAR |
|
[SCRATCH FILES] |
|
|
BILOG SYSTEM BINARY DATA FILE |
EXAMPL01.MFL |
|
CALIBRATION BINARY DATA FILE |
EXAMPL01.CFL |
|
ESTIMATED COVARIANCE FILE |
EXAMPL01.VFL |
|
TEMPORARY FILE |
EXAMPL01.T02 |
|
TEMPORARY FILE |
EXAMPL01.T03 |
|
TEMPORARY FILE |
EXAMPL01.T14 |
|
TEMPORARY FILE |
EXAMPL01.T99 |
>TEST TNAME=RANDOM;
ANSWER KEY:
1 RANDOM RRRRRRRRRRRRRRRRRRRR
OMIT KEY:
1 RANDOM 88888888888888888888
OBSERVATION 1 WEIGHT: 1.0000 ID: -0.324
SUBTEST 1 RANDOM
TRIED RIGHT
20.000 14.000
|
ITEM TRIED RIGHT |
1 1.0 .0 |
2 1.0 .0 |
3 1.0 .0 |
4 1.0 .0 |
5 1.0 -1.0 |
6 1.0 .0 |
7 1.0 1.0 |
8 1.0 .0 |
9 1.0 .0 |
10 1.0 1.0 |
|
ITEM TRIED RIGHT |
11 1.0 -1.0 |
12 1.0 1.0 |
13 1.0 1.0 |
14 1.0 1.0 |
15 1.0 .0 |
16 1.0 1.0 |
17 1.0 1.0 |
18 1.0 1.0 |
19 1.0 1.0 |
20 1.0 1.0 |
OBSERVATION 2 WEIGHT: 1.000 ID: -0.673
|
ITEM TRIED RIGHT |
1 1.0 1.0 |
2 1.0 1.0 |
3 1.0 1.0 |
4 1.0 -1.0 |
5 1.0 1.0 |
6 1.0 1.0 |
7 1.0 .0 |
8 1.0 1.0 |
9 1.0 1.0 |
10 1.0 1.0 |
|
ITEM TRIED RIGHT |
11 1.0 1.0 |
12 1.0 1.0 |
13 1.0 -1.0 |
14 1.0 1.0 |
15 1.0 .0 |
16 1.0 .0 |
17 1.0 1.0 |
18 1.0 .0 |
19 1.0 1.0 |
20 1.0 1.0 |
100 OBSERVATIONS READ FROM FILE: BLGDAT\EMPL01.DAT
100 OBSERVATIONS WRITTEN TO FILE: EXAMPL01.MFL
CLASSICAL ITEM STATISTICS FOR SUBTEST RANDOM
|
ITEM |
NAME |
NUMBER TRIED |
NUMBER RIGHT |
PERCENT |
LOGIT/1.7 |
ITEM*TEST PEARSON |
CORRELATION BISERIAL |
|
1 |
0001 |
100.0 |
67.0 |
.670 |
.42 |
.473 |
.614 |
|
2 |
0002 |
100.0 |
48.0 |
.480 |
-.05 |
.366 |
.459 |
|
3 |
0003 |
100.0 |
49.0 |
.490 |
-.02 |
.374 |
.468 |
|
4 |
0004 |
100.0 |
39.0 |
.390 |
-.26 |
.322 |
.409 |
|
5 |
0005 |
100.0 |
44.0 |
.440 |
-.14 |
.370 |
.466 |
|
6 |
0006 |
100.0 |
40.0 |
.400 |
-.24 |
.311 |
.394 |
|
7 |
0007 |
100.0 |
57.0 |
.570 |
.17 |
.256 |
.322 |
|
8 |
0008 |
100.0 |
54.0 |
.540 |
.09 |
.499 |
.626 |
|
9 |
0009 |
100.0 |
76.0 |
.760 |
.68 |
.330 |
.454 |
|
10 |
0010 |
100.0 |
78.0 |
.780 |
.74 |
.255 |
.357 |
|
11 |
0011 |
100.0 |
45.0 |
.450 |
-.12 |
.498 |
.626 |
|
12 |
0012 |
100.0 |
79.0 |
.790 |
.780 |
.354 |
.500 |
|
13 |
0013 |
100.0 |
52.0 |
.520 |
.05 |
.216 |
.271 |
|
14 |
0014 |
100.0 |
86.0 |
.860 |
1.07 |
.265 |
.413 |
|
15 |
0015 |
100.0 |
44.0 |
.440 |
-.14 |
.331 |
.416 |
|
16 |
0016 |
100.0 |
82.0 |
.820 |
.89 |
.187 |
.274 |
|
17 |
0017 |
100.0 |
70.0 |
.700 |
.50 |
.274 |
.362 |
|
18 |
0018 |
100.0 |
74.0 |
.740 |
.62 |
.297 |
.402 |
|
19 |
0019 |
100.0 |
86.0 |
.860 |
1.07 |
.366 |
.571 |
|
20 |
0020 |
100.0 |
82.0 |
.820 |
.89 |
.291 |
.426 |
[1] The first scratch files are normally deleted at the end of the problem run, but the first three can be saved by assigning them non-default names in the SAVE command. If the data set is large and is to be analyzed more than once, the master binary file, EXAMPL01.MFL, contains all the data and is worth saying. The calibration binary file contains those cases used for item parameter estimation; it may be equal to our smaller than the master file. In route applications, there is very little justification for requesting a calibration sample larger than the default size of 1000; the improvement in precision of item parameter estimation in larger sample sizes is relatively small.
The EXAMPL01.VFL file contains the estimated sampling variances are convariances of the item parameter estimators. While the square roots of the variances appear as standard errors in the output, the covariances between parameters (obtained from the inverse of the estimated information matrix) are available only in this file. Covariances between items are not saved, but those between the two or three parameters of the separate items appear in this file. (See Chapter 5.)
[2] The presented/not-presented codes (TRIED: 1.0 presented, -1.0 not-presented) and item scores of the first two cases (RIGHT: 1.0 if correct, 0 if incorrect, -1 if omitted) are output as check on the data format and scoring keys.
[3] The classical items statistics provide the starting values for the iterative estimation of the item parameters (see Chapter 1, pp. 3-5, for the relationships between item statistics and item parameters). If the number of items and sample size is large and the classical normal assumptions apply, these starting values are sufficiently accurate that a few EM cycles yield estimates of the item parameters that are quite satisfactory for most practical purposes. For preliminary analyses under these conditions there is little reason to spend the additonal computing time required to obtain fully converged MML or MMAP estimates.
Phase 2 output: EXAMPL01.PH@
*********** PHASE 2 ***********
EXAMPLE 01: TEST WITH OMITTED RESPONSES THREE-PARAMETER MODEL RANDOM DATA EAP SCORE ESTIMATON
>CALIB FLOAT;
CALIBRATION PARAMETERS
|
MAXIMUM NUMBER OF EM CYCLES: |
10 |
|
MAXIMUM NUMBER OF NEWTON CYCLES: |
2 |
|
CONVERGENCE CRITERION: |
.0100 |
|
SUBJECT DISTRIBUTION: |
NORMAL PRIOR |
|
PLOT EMPIRICAL VS. FITTED ICC'S |
NO |
|
DATA HANDLING: |
DATA ON SCRATCH FILE |
|
PRIOR DISTRIBUTION ON ASYMPTOTES: |
YES |
|
PRIOR DISTRIBUTION ON SLOPES: |
YES |
|
PRIOR DISTRIBUTION ON THRESHOLDS: |
NO |
|
SOURCE OF ITEM HYPER PARAMETERS: |
PROGRAM DEFAULTS, HYPERPARAMETERS WILL BE UPDATED EACH CYCLE |
*******************************************
CALIBRATIONO F SUBTEST RANDOM
*******************************************
METHOD OF SOLUTION
EM CYCLES (MAXIMUM OF 10)
FOLLOWED BY NEWTON-RAPHSON STEPS (MAXIMUM OF 2)
QUADRATURE POINTS AND PRIOR WEIGHTS:
|
|
1 |
2 |
3 |
4 |
5 |
|
POINT WEIGHT |
-.4000E+01 .1190E-03 |
-3111E+01 .2805E-02 |
-.2222E+01 .3002E-01 |
-.1333E+01 .1458E+00 |
-.4444E+00 .3213E+00 |
|
|
6 |
7 |
8 |
9 |
10 |
|
POINT WEIGHT |
.4444E+00 .3213E+00 |
.1333E+01 .1458E+00 |
.2222E+01 .3001E-01 |
.31111E+01 .2805E-02 |
.4000E+01 .1190E-03 |
PRIOR DISTRIBUTIONS ON ITEM PARAMETERS
(THRESHOLDS, NORMAL; SLOPES, LOG-NORMAL; GUESSING, BETA)
|
|
THRESHOLDS |
SLOPES |
ASYMPTOTES |
||||||
|
ITEM |
MU |
SIGMA |
MU |
SIGMA |
MU |
SIGMA |
|||
|
0001 |
- |
- |
.000 |
.500 |
5.00 |
17.00 |
|||
|
0002 |
- |
- |
.000 |
.500 |
5.00 |
17.00 |
|||
[18 similar lines omitted]
[EM STEP]
-- 2 LOG LIKELIHOOD = 2260.2174
SAMPLE PROBLEMS 6-6
CYCLE 1: LARGEST CHANGE = 2.3714
-- 2 LOG LIKELIHOOD = 2258.8731
UPDATED PRIOR ON ASYMPTOTES; ALPHA & BETA = 5.29157 16.70843
UPDATED PRIOR ON LOG SLOPES; MEAN & SD = -.24391 .50000
CYCLE 2: LARGEST CHANGE = .07377
-- 2 LOG LIKELIHOOD = 2258.4835
UPDATED PRIOR ON LOG ASYMPTOTES; ALPHA & BETA = 5.3610 16.63990
UPDATED PRIOR ON LOG SLOPES; MEAN & SD = -.26662 .50000
CYCLE 3: LARGEST CHANGE = .02392
-- 2 LOG LIKELIHOOD = 2258.3934
UPDATED PRIOR ON ASYMPTOTES; ALPHA & BETA = 5.40416 16.59584
UPDATED PRIOR ON LOG SLOPES; MEAN & SD = -.27645 .50000
CYCLE 4: LARGEST CHANGE = .01786
-- 2 LOG LIKELIHOOD = 2258.3526
UPDATED PRIOR ON ASYMPTOTES; ALPHA & Beta = 5.43811 16.56189
UPDATED PRIOR ON LOG SLOPES; MEAN & SD = -.28489 .50000
CYCLE 5: LARGEST CHANGE = .00809
[NEWTON STEP]
UPDATED PRIOR ON ASYMPTOTES; ALPHA & BETA = 5.46420 16.53580
UPDATED PRIOR ON LOG SLOPES; MEAN & SD = -.28872 .50000
--2 LOG LIKELIHOOD = 2258.3003
CYCLE 6: LARGEST CHANGE = .00809
SUBTEST RANDOM: ITEM PARAMETERS AFTER CYCLE 6
|
ITEM |
INTERCEPT S.E |
SLOPE S.E |
THRESHOLD S.E |
DISPERSN S.E |
ASYMPTOTE S.E |
CHISQ (PROB) |
DF |
|
0001
|
.34 .254* |
1.042 .373* |
-.327 .273* |
.960 .344* |
.218 .088* |
.4 (.5395) |
1.0 |
|
0002 |
-.448 .324* |
.754 .270* |
.594 .352* |
1.327 .475* |
.225 .083* |
2.5 (.2916) |
2.0 |
|
0003 |
-.456 .331* |
.830 .312* |
.549 .325* |
1.204 .452* |
.222 .083* |
.9 (.6353) |
2.2 |
|
0004 |
-.795 .378* |
.727 .284* |
1.094 .415* |
1.376 .538* |
.204 .076* |
1.2 (.5432) |
2.0 |
|
0005 |
-.609 .364* |
.840 .335* |
.724 .334 |
1.190 .475* |
.216 .079* |
.8 (.6870) |
2.0 |
|
0006 |
-.839 .437* |
.795 .321 |
1.056 .408* |
1.258 .509* |
.234 .079* |
.3 (.8595) |
2.0 |
|
0007 |
-.277 .306* |
.659 .238* |
.420 .419* |
1.518 .548* |
.271 .094* |
.6 (.7601) |
2.0 |
|
0008 |
-.113 .250* |
.920 .313* |
.122 .256* |
1.087 .370* |
.175 .074* |
11.8 (.0030) |
2.0 |
|
0009 |
.593 .224* |
.729 .234* |
-.814 .369* |
1.372 .440* |
.230 .094* |
.3 (.8780) |
2.0 |
|
0010 |
.624 .209* |
.582 .179* |
1.718 .458* |
1.718 .528* |
.232 .095* |
.0 (.8618) |
1.0 |
|
0011 |
-.579 .405* |
1.160 .571* |
.499 .247* |
.862 .424* |
.198 .074* |
1.3 (.5272) |
2.0 |
|
0012 |
.803 .246* |
.850 .279* |
-.944 .342* |
1.176 .386* |
.221 .092* |
.1 (.9451) |
2.0 |
|
0013 |
-.347 .290* |
.546 .190* |
.635 .485* |
1.832 .636* |
.246 .092* |
.3 (.8477) |
2.0 |
|
0014 |
1.063 .244* |
.650 .227* |
-1.636 .533 |
1.538 .536* |
.229 .094* |
3.8 (.0496) |
1.0 |
|
0015 |
-.599 .342* |
.683 .241* |
.877 .403* |
1.463 .516* |
.220 .082* |
1.1 (.5827) |
2.0 |
|
0016 |
.799 .208* |
.539 165* |
-1.483 .549* |
1.855 .569* |
.229 .094* |
.7 (.7086) |
2.0 |
|
0017 |
.331 .213* |
.585 .183* |
-.565 .417* |
1.709 .535* |
.232 .094* |
.6 (.4389) |
1.0 |
|
0018 |
.514 .212* |
.641 .204* |
-.801 .406* |
1.559 .495* |
.227 .093* |
.9 (.6521) |
2.0 |
|
0019 |
1.359 .485* |
1.129 .520* |
-1.204 .319* |
.886 .408* |
.229 .094* |
.1 (.7876) |
1.0 |
|
0020 |
.847 .227* |
.665 .205 |
-1.274 .440 |
1.503 .463* |
.230 .095 |
.3 (.8640) |
2.0 |
*STANDARD ERROR
27.8 35.0
(.8016)
LARGEST CHANGE = .008
|
PARAMETER |
MEAN |
STN DEV |
|
ASYMPTOTE SLOPE LOG (SLOPE) THRESHOLD |
.224 .766 -.291 -.177 |
.019 .182 .226 .923 |
QUADRATURE POINTS AND POSTERIOR WEIGHTS:
|
|
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
|
POINT |
-.40000E+01 |
-.31111E+01 |
-.2222E+01 |
-.1333E+01 |
-.4444E+00 |
.4444E+00 |
.13333E+01 |
.2222E+01 |
.3111E+01 |
.4000E+01 |
|
WEIGHT |
.3892E-04 |
.1558E-02 |
.2523E-01 |
.1564E+00 |
.3045E+00 |
.3360E+00 |
.1407E+00 |
.3159E-01 |
.3822E-02 |
.1842E-03 |
[4] The quadrature points and weights are respectively, deviates and normalized probability densities of the assumed prior distribution of ability. The default is a normal prior, but the user can supply other values. (The item-parameter estimation is not very sensitive to moderate departures from a normal prior.)
[5] See the summary for the PRIORS command, p. 4:31, for the relationship between the ALPHA and BETA parameters of the beta distribution and the mean and variance of the asymptote-parameter distribution.
[6] When priors are assumed on the item parameters, the likelihood may not increase during the EM cycles. This occurs because the starting values are closer to point of maximum marginal likelihood than the MAP estimates. But if the number of items and the sample size is large enough to justify the use of the FLOAT option, the means of the priors will be estimated simultaneously with the item parameters and the marginal likelihood will generally increase. The resulting estimates are then intermediate between the MML and the MMAP estimates. The compromise is necessary because there are seldom enough items to estimate the variances and other moments of their prior distributions accurately. Note that during the EM cycles the means of the priors change but not the standard deviations.
[7] If the number of items is 20 or greater, approximate chi-square statistics for the goodness-of-fit of each item are calculated and output. For this purpose, the cases in the calibration sample are sorted into successive intervals of the latent continuum according to the EAP estimates of their ability rescaled to mean 0 and standard deviation 1. Then, neighboring intervals are collapsed until the expected proportion of correct or incorrect responses exceeds 0.05. This gives a reasonable test of fit fi the number of items is large enough to make the assignment of the cases accurate, and if the sample size is large enough to retain three or more intervals. In the present example, the sample size is too small for a useful test; for more suitable data, see Example 8. Note that because the MML estimation procedure the residual frequencies (OBSERVED - EXPEVCTED) are not under linear constraints due to estimation of item parameters, the degrees of freedom are equal to the number of intervals retained.
When the number of items is less than 20, these chi-square statistics are replaced by root-mean-square standardized posterior residuals at the quadrature points. (See Examples 2, 4, 5 and 6)
[8] At the end of the calibration, normalized probability densities of the latent distribution are estimated at the quadrature points by the a posteriori weights shown in the output. In the present example, their agreement with the prior normal weights is very good, but this is to be expeced because the simulated data were generated from a normal distribution.
Phase 3 output: EXAMPL01.PH3
*** PHASE 3 ***
EXAMPLE 01: TEST WITH OMITTED RESPONSES
THREE PARAMETER MODEL RANDOM DATA EAP SCORE ESTIMATION
|
METHOD OF SCORING SUBJECTS: |
EXPECTATION OF A POSTERIORI (EAP; BAYES ESTIMATES) |
|
TYPE OF PRIOR: |
NORMAL |
|
SCORES WRITTEN TO FILE |
EXAMPL01.PH3 |
|
TYPE OF RESCALING: |
CENTER WITH RESPECT TO SCORE ESTIMATES |
|
ITEM AND TEST INFORMATION: |
NONE REQUIRED |
|
TEST |
NAME |
QUAD POINTS |
PRIOR MEAN |
PRIOR STN DEV |
RESCALING SCALING |
CONSTANTS LOCATION |
|
1 |
RANDOM |
10 |
.000 |
1.000 |
1.000 |
.000 |
***************************************
SCORING
***************************************
[9]
EAP SUBJECT ESTIMATION, SUBTEST: RANDOM
QUADRATURE POINTS AND PRIOR WEIGHTS:
|
|
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
|
POINT |
-.4000E+01 |
-.3111E+01 |
-.2222E+02 |
-.1333E+01 |
-.4444E+00 |
.4444E+00 |
.1333E+01 |
.2222E+01 |
.3111E+01 |
.4000E+01 |
|
WEIGHT |
.1190E-03 |
.2805E-02 |
.3002E-01 |
.1458E+00 |
.3213E+00 |
.3213E+00 |
.1458E+00 |
.3002E-01 |
.2805E-02 |
.1190E-03 |
CORRELATIONS AMONG SUBTEST SCORE ESTIMATES:
RANDOM = 1.000
MEANS AND STANDARD DEVIATIONS OF SCORE ESTIMATE:
TEST: Random
MEAN: .015
S.D.: .868
MARGINAL LATENT DISTRIBUTION FOR SUBTEST RANDOM
MEAN = .015 S.D. = .968
[11]
|
|
1 |
2 |
3 |
4 |
5 |
6 |
7 |
8 |
9 |
10 |
|
POINT |
-.4000E+01 |
-.3111E+01 |
-.2222E+02 |
-.1333E+01 |
-.4444E+00 |
.4444E+00 |
.1333E+01 |
.2222E+01 |
.3111E+01 |
.4000E+01 |
|
WEIGHT |
.5130E-04 |
.1583E-02 |
.2525E-01 |
.1564E+00 |
.3044E+00 |
.3360E+00 |
.1407E+00 |
.3159E-01 |
.3824E-01 |
.1847E-03 |
RESCALING DONE WITH REPECT TO SUBJECT DISTRIBUTION [12]
SCALING LOCATION
SUBTEST CONSTANT
RANDOM 1.153 -.017
[13]
[14]
SUBJECT IDENTIFICATION
|
SUBJECT IDENTIFICATION |
WEIGHT |
SUBTEST |
TRIED |
RIGHT |
PERCENT |
ABILITY |
S.E. |
MARGINAL PROB |
|
-0.324 |
1.00 |
RANDOM |
20 |
10 |
.5000 |
-.6105 |
.4612 |
.000027 |
|
-0.673 |
1.00 |
RANDOM |
20 |
14 |
.7000 |
.5785 |
.5354 |
.000004 |
|
-2.722 |
1.00 |
RANDOM |
20 |
7 |
.3500 |
-1.4665 |
.6717 |
.000000 |
|
0.714 |
1.00 |
RANDOM |
20 |
16 |
.8000 |
.9046 |
.5743 |
.000027 |
|
0.351 |
1.00 |
RANDOM |
20 |
14 |
.7000 |
.3676 |
.4916 |
.000024 |
|
-0.042 |
1.00 |
RANDOM |
20 |
13 |
.6500 |
-.0031 |
.5672 |
.000010 |
|
-0.747 |
1.00 |
RANDOM |
20 |
9 |
.4500 |
-.5874 |
.4811 |
.000012 |
|
0.154 |
1.00 |
RANDOM |
20 |
11 |
.5500 |
-.4572 |
.4856 |
.000008 |
[OUTPUT FOR 80 CASES OMITTED HERE]
|
-0.021 |
1.00 |
RANDOM |
20 |
14 |
.7000 |
.2931 |
.5126 |
.000017 |
|
-3.063 |
1.00 |
RANDOM |
20 |
20 |
1.0000 |
2.2583 |
.7016 |
.017796 |
|
-0.050 |
1.00 |
RANDOM |
20 |
15 |
.7500 |
.4971 |
.4861 |
.000047 |
|
0.434 |
1.00 |
RANDOM |
20 |
19 |
.9500 |
1.8448 |
.6295 |
.004412 |
|
0.995 |
1.00 |
RANDOM |
20 |
15 |
.7500 |
.6837 |
.5417 |
.000009 |
|
-1.554 |
1.00 |
RANDOM |
20 |
6 |
.3000 |
-1.8910 |
.6342 |
.000000 |
|
-1.742 |
1.00 |
RANDOM |
20 |
3 |
.1500 |
-1.9552 |
.6429 |
.000003 |
|
0.146 |
1.00 |
RANDOM |
20 |
17 |
.8500 |
1.1645 |
.5786 |
.000103 |
|
0.459 |
1.00 |
RANDOM |
20 |
15 |
.7500 |
.5191 |
.4895 |
.000038 |
|
0.030 |
1.00 |
RANDOM |
20 |
14 |
.7000 |
.2452 |
.5068 |
.000031 |
SUBTEST RANDOM ; RESCALED ITEM PARAMETERS [15]
|
ITEM |
INTERCEPT S.E. |
SLOPE S.E. |
THRESHOLD S.E. |
DISPERSN S.E. |
ASYMPTOTE S.E. |
|
0001 |
.356 .254* |
.904 .324* |
-.393 .322* |
1.106 .396* |
.218 .088* |
|
0002 |
-.437 .324* |
.654 .234* |
.668 .447* |
1.529 .547* |
.225 .083* |
|
0003 |
-.444 .331 |
.721 .271* |
.616 .449 |
1.388 .521* |
.222 .083* |
|
0004 |
-.784 .378* |
.631 .246* |
1.244 .658* |
1.586 .620* |
.204 .076* |
|
0005 |
-.597 .364* |
.729 .291* |
.818 .535* |
1.371 .547* |
.216 .079* |
[OUTPUT FOR THE REMAINING ITEMS OMITTED HERE]
|
PARAMETER |
MEAN |
STN DEV |
|
ASYMPTOTE |
.244 |
.019 |
|
SLOPE |
.665 |
.158 |
|
LOG(SLOPE) |
-.433 |
.226 |
|
THRESHOLD |
-.221 |
1.064 |
MEAN & SD OF SCORE ESTIMATES AFTER RESCALING: . 000 1.000
SAMPLE PROBLEMS 6-11
[9] All cases in the data file are scored; not just those used in the item calibration. The SCORE keyword of the SAVE command writes the scores to a separate file. Printing of the scores in the Phase 3 output can be suppressed by the NOPRINT option of the SCORE command.
[10] These are the sample means and standard deviations before rescaling. Note the sample standard deviation is less than the assumed S.D. of 1 in the population due to the regression of EAP estimates to the mean.
[11] The S.D. of the latent distribution estimated from the posterior weights at the quadrature points is, however, near its true value. This S.D. is computed using the formula for the variance of grouped data, with the quadrature points treated as class marks and the posterior weights as class frequencies. Shepard's correction for coarse grouping is applied before the square root is taken to obtain the standard deviation. The posterior weights appear only when EAP score estimation is selected and are based on all cases, not just those used in the calibration.
[12] If the rescaling option is selected, the scores are rescaled linearly with these constants before the scores are output. In the present example, the rescaling will set the sample mean and standard deviation equal to their population values, which have been arbitrarily assigned the values 0 and 1.
[13] The score output includes the classical number right and percent right in addition to the estimated latent ability and its standard error (or in the case of EAP estimation, the posterior standard deviation). In the latter case, the estimated marginal probability of the response pattern of each case is also output. Scores for cases with extremely deviant marginal probabilities should be viewed with caution. They may indicate faulty use of the answer sheet or random responding. A quantile plot of the log marginal probabilities can help detect such cases.
[14] Given the relatively small number of items and their modest discriminating power (slopes), it is not suprising that a few of the deviations between the generating values shown in the ID field at the left and the estimated abilities are rather large. Notice, however, that the sample root-mean-square error (RMSE) (estimated ability - generating ability) of 0.5485 agrees well with the posterior standard deviations ("standard errors") of the scores based on Bayesian theory. This RMSE corresponds to a classical reliability coefficient of 0.70.
[15] As discussed in Chapter 1, the item slopes and thresholds can be adjusted so that they will estimate directly the rescaled scores. The item lower asymptotes are not affected. In the EXAMPL01.PAR file, the rescaled parameters replace those from the calibration phase.
SAMPLE PROBLEMS 6-12