EXERCISE 7

PSYCHOLOGICAL MEASUREMENT (806)

OCTOBER 27, 1998

DUE NOVEMBER 17, 1998

 

DIRECTION -- Based on identified groups, do one of the following:

 1. Run Example 1 from BILOG (see pages 6-1 to 6-12 and the handout for class). Prepare a short guide for the class for running BILOG based on the sample.

 

2. Based on the sample output (you can use the copy provided to the class), interpret the results and findings for the IRT materials. Focus on the Aitem parameters after cycle 6" (see pages 6-7 and 6-8).

 

3. Demonstrate the relationship between the IRT (6-7 & 6-8) and classical (6-4 & 6-5) item parameters. Demonstrate the relationship between the thetas and classical test scores (see page s 6-10 & 6-11). Explain the results.

 

INTRODUTION TO ITEM RESPONSE THEORY

1-1

Item response theory (IRT) is gaining in acceptance in psychological and educational testing because it provides more adaptable and effective methods of test construction, analysis, and scoring than those derived from classical test theory. The source of its greater power is in the relationships it establishes between properties of the items and the operating characteristics of the test made up of the items. These relationships can be valid for actual tests of any length, whereas any comparable results in classical theory hold only for hypothetical tests consisting of indefinitely many items.

The provision in IRT for treating the items, or small sets of similar items, as the exchangeable units of test construction and scoring has led to numerous innovations in testing practice, especially item banking and adaptive testing. The former can appreciably reduce the time and cost of producing a high quality operational test. The latter, either in the form of computerized adaptive testing or two-stage testing using paper-and-pencil instruments, enables testing time to be reduced to half or a third of that required for a conventional test of the same precision.

Equally important for long-term testing and assessment programs is the ability to retire and replace items in an operational test without altering the interpretation of the test scale. Because IRT scale scores are functions of estimated item parameters, the scoring absorbs possible differences in the characteristics (difficulty, discriminating power, etc.)

Another property unique to IRT is the location of items and the respondents on the same scale. The response models on which IRT methods are based enable the analyst to state the probability that a respondent at a particular score level will answer a given item correctly. This permits the "content referencing" of the scale scores. Typical items that respondents can answer correctly with an assigned probability (e.g., 50 or 80 percent) illustrate the meaning of various points on the scale in terms of task content.

In this chapter, we discuss the IRT procedures implemented in the BILOG program. Only results are presented. For derivations and proofs, the reader should consult the readings listed at the end of the chapter.

6 ---- Sample Problems

 

The examples in this chapter illustrate typical applications of the BILOG program. The command files and data for these examples appear in files of the program distribution diskettes. Because of the versatility of the program, not all its features can be demonstrated here. The user will find descriptions of other options in the command summaries of Chapter 4 and can easily modify the command files of the present examples to explore their effects.

Example 1: SIMULATED 3-PL DATA: EAP ESTIMATIONS OF SCORES

This example is based on randomly generated data that simulate some of the most frequently encountered conditions of cognitive testing. The test consists of 20 items, which is about the smallest number of binary items that can be reasonably used in practical testing. The items are assumed to be in multiple-choice format with five alternatives. It is assumed that some of the examinees will omit items that are too difficult for them.

To hold the problem to a reasonable size, we have set the sample size at 100. This is on the small side in an IRT application (a sample size in excess of 200 is usually considered desirable), but with the default priors assumed for the item parameters, the results of the calibration are reasonable.

Because the sample size is not large enough to attempt estimating the population latent distribution, we have accepted the default normal distribution with mean zero and standard deviation one.

Twenty items is the smallest number for which the chi-square test of item fit can be recommended. Due to the small sample size, the test is not very sensitive, and, as might be expected, none of the items shows any indication of poor fit.

For scoring the examinees, we have chosen the most robust method, EAP (Bays) estimation, on the assumption that the population distribution is normal. We follow the usual testing convention of standardizing the scores in the sample (with mean 0 and standard deviation 1, in this case). Because the scale of measurement for a psychological or educational test almost always has an arbitrary origin and unit, no generality is lost by rescaling in the calibration sample. In Example 2, these EAP score estimates are compared with MI (maximum likelihood) estimates similarly scaled.

To illustrate (in Example 2) how response data can be scored from the file of item-parameter estimates from an earlier BILOG calibration, we make use of the SAVE option in the GLOBAL command, followed by the SAVE command, to save the item parameter file as EXAMPL01.PAR. The item parameters are saved after being adjusted to account fr the rescaling of the scores in the sample. If they are used to score other data, the resulting scores will be expressed in the scale metric of the original sample. They can however, be saved after rescaling in other data, in which case they will be readjusted to the scale of that sample.

COMMAND FILE: EXAMPL01.BLG

Example 01: TEST WITH OMITTED RESOPONSES

THREE - PARAMETER MODEL RANDOM DATA EAP SCORE ESTIMATION

>COMMENTS

This example illustrates the use of the GLOBAL "OMITS" option with the 3-PL model. Omitted responses are scored fractionally correct in an amount equal to the reciprocal of the number of multiple-choice alternatives (NALT = 5)

The data for this example have been randomly generated to fit the 3-PL model, but some of the responses with probability of being correct less than 0.3 have been randomly omitted. So that the true estimated scores can be compared, the generating values of the respondents' scores are used as the case ID's.

The data are in the file EXAMPL01.DAT of the BLGDAT directory; the answer key (KFNAME) and the omit key (OFNAME) are, respectively, the first and second records of the data file.

The respondents' scores are estimated by the EAP method (default) and rescaled to mean 0 and standard deviation 1 in the sample (RSC=3). The item parameter estimates are saved AFTER rescaling.

>GLOBAL DFNAME='BLGDAT\EXAMPL01.DAT', NPARM=3, OMITS,SAVE;

>SAVE PARM='EXAMPL01.PAR';

>LENGTH NITEMS=20;

>INPUT NTOT=20, NALT=5, NIDC=10, KFNAME='BLDDAT\EXAMPL01.DAT',

OFNAME='BLGDAT\EXAMPL01.DAT';

(4X, 10A1, T17, 20A1)

>TEST TNAME=RANDOM;

>CALIB FLOAT;

>SCORE RSC=3;

 

OUTPUT FILES

The following is annotated output of the three phases of the computations for Example 1. These results, plus other system messages, appear transiently on the screen as the computations proceed. The SAVE file, EXAMPL01.PAR, contains essentially the same results as in the rescaled parameter listing of Phase 3, and is not shown here. Annotation numbers appear in brakets.

SAMPLE PROBLEMS 6-2

Phase 1 output: EXAMPL01.PH1

********** PHASE 1************

 EXAMPLE 01: TEST WITH OMITTED RESPONSES

THREE-PARAMETER MODEL RANDOM DATA EAP SCORE ESTIMATION

>COMMENTS

 

[Same as above]

>GLOBAL DFNAME='BLGDAT\EXAMPL01.DAT' , NPARM=3, OMITS,save;

GLOBAL PARAMETERS

NUMBER OF SUBTESTS

1

CASE WEIGHTING

NONE EMPLOYED

ITEM RESPONSE MODEL

3 PARAMETER LOGISTIC
NORMAL METRIC (I.E., D=1.7)

 OMITS WILL BE REPLACED BY THE RECIPROCAL OF THE NUMBER OF RESPONSE ALTERNATIVES

OFNAME='BLGDAT\EXAMPL01.DAT';

DATA INPUT SPECIFICATIONS

NUMBER OF FORMAT CARDS

1

NUMBER OF ITEMS IN INPUT STREAM

20

NUMBER OF RESPONSE ALTERNATIVES

5

NUMBER OF SUBJECT ID CHARACTERS

10

SUBJECT DATA INPUT OPTION

SINGLE-SUBJECT DATA, NO CASE WEIGHTS

MAXIMUM SAMPLE SIZE FOR ITEM CALIBRATION ALL SUBJECTS INCLUDED IN RUN

1000

FORMAT CARD FOR INPUT IS (4X,10A1,T17,20A1)

FILE ASSIGNMENT AND DISPOSITION

[INPUT FILES]

SUBJECT DATA INPUT FILE

BLGDAT\EXAMPL01.DAT

SINGLE-SUBJECT DATA, NO CASE WEIGHTS

CORRECT-RESPONSE KEY FILE

BLGDAT\EXAMPL01.DAT

OMITTED RESPONSE KEY FILE

BLGDAT\EXAMPL01.DAT

[OUTPUT FILES]

 

ITEM PARAMETERS FILE

EXAMPLE01.PAR

[SCRATCH FILES]

 

BILOG SYSTEM BINARY DATA FILE

EXAMPL01.MFL

CALIBRATION BINARY DATA FILE

EXAMPL01.CFL

ESTIMATED COVARIANCE FILE

EXAMPL01.VFL

TEMPORARY FILE

EXAMPL01.T02

TEMPORARY FILE

EXAMPL01.T03

TEMPORARY FILE

EXAMPL01.T14

TEMPORARY FILE

EXAMPL01.T99

>TEST TNAME=RANDOM;

ANSWER KEY:

1 RANDOM RRRRRRRRRRRRRRRRRRRR

OMIT KEY:

1 RANDOM 88888888888888888888

OBSERVATION 1 WEIGHT: 1.0000 ID: -0.324

SUBTEST 1 RANDOM

TRIED RIGHT

20.000 14.000

ITEM

TRIED

RIGHT

1

1.0

.0

2

1.0

.0

3

1.0

.0

4

1.0

.0

5

1.0

-1.0

6

1.0

.0

7

1.0

1.0

8

1.0

.0

9

1.0

.0

10

1.0

1.0

ITEM

TRIED

RIGHT

11

1.0

-1.0

12

1.0

1.0

13

1.0

1.0

14

1.0

1.0

15

1.0

.0

16

1.0

1.0

17

1.0

1.0

18

1.0

1.0

19

1.0

1.0

20

1.0

1.0

OBSERVATION 2   WEIGHT: 1.000    ID: -0.673

ITEM

TRIED

RIGHT

1

1.0

1.0

2

1.0

1.0

3

1.0

1.0

4

1.0

-1.0

5

1.0

1.0

6

1.0

1.0

7

1.0

.0

8

1.0

1.0

9

1.0

1.0

10

1.0

1.0

ITEM

TRIED

RIGHT

11

1.0

1.0

12

1.0

1.0

13

1.0

-1.0

14

1.0

1.0

15

1.0

.0

16

1.0

.0

17

1.0

1.0

18

1.0

.0

19

1.0

1.0

20

1.0

1.0

 

100 OBSERVATIONS READ FROM FILE: BLGDAT\EMPL01.DAT

100 OBSERVATIONS WRITTEN TO FILE: EXAMPL01.MFL

CLASSICAL ITEM STATISTICS FOR SUBTEST RANDOM

 

 

ITEM

NAME

NUMBER TRIED

NUMBER RIGHT

PERCENT

LOGIT/1.7

ITEM*TEST PEARSON

CORRELATION BISERIAL

1

0001

100.0

67.0

.670

.42

.473

.614

2

0002

100.0

48.0

.480

-.05

.366

.459

3

0003

100.0

49.0

.490

-.02

.374

.468

4

0004

100.0

39.0

.390

-.26

.322

.409

5

0005

100.0

44.0

.440

-.14

.370

.466

6

0006

100.0

40.0

.400

-.24

.311

.394

7

0007

100.0

57.0

.570

.17

.256

.322

8

0008

100.0

54.0

.540

.09

.499

.626

9

0009

100.0

76.0

.760

.68

.330

.454

10

0010

100.0

78.0

.780

.74

.255

.357

11

0011

100.0

45.0

.450

-.12

.498

.626

12

0012

100.0

79.0

.790

.780

.354

.500

13

0013

100.0

52.0

.520

.05

.216

.271

14

0014

100.0

86.0

.860

1.07

.265

.413

15

0015

100.0

44.0

.440

-.14

.331

.416

16

0016

100.0

82.0

.820

.89

.187

.274

17

0017

100.0

70.0

.700

.50

.274

.362

18

0018

100.0

74.0

.740

.62

.297

.402

19

0019

100.0

86.0

.860

1.07

.366

.571

20

0020

100.0

82.0

.820

.89

.291

.426

 

 

 [1] The first scratch files are normally deleted at the end of the problem run, but the first three can be saved by assigning them non-default names in the SAVE command. If the data set is large and is to be analyzed more than once, the master binary file, EXAMPL01.MFL, contains all the data and is worth saying. The calibration binary file contains those cases used for item parameter estimation; it may be equal to our smaller than the master file. In route applications, there is very little justification for requesting a calibration sample larger than the default size of 1000; the improvement in precision of item parameter estimation in larger sample sizes is relatively small.

  The EXAMPL01.VFL file contains the estimated sampling variances are convariances of the item parameter estimators. While the square roots of the variances appear as standard errors in the output, the covariances between parameters (obtained from the inverse of the estimated information matrix) are available only in this file. Covariances between items are not saved, but those between the two or three parameters of the separate items appear in this file. (See Chapter 5.)

[2] The presented/not-presented codes (TRIED: 1.0 presented, -1.0 not-presented) and item scores of the first two cases (RIGHT: 1.0 if correct, 0 if incorrect, -1 if omitted) are output as check on the data format and scoring keys.

[3] The classical items statistics provide the starting values for the iterative estimation of the item parameters (see Chapter 1, pp. 3-5, for the relationships between item statistics and item parameters). If the number of items and sample size is large and the classical normal assumptions apply, these starting values are sufficiently accurate that a few EM cycles yield estimates of the item parameters that are quite satisfactory for most practical purposes. For preliminary analyses under these conditions there is little reason to spend the additonal computing time required to obtain fully converged MML or MMAP estimates.

Phase 2 output: EXAMPL01.PH@

*********** PHASE 2 ***********

EXAMPLE 01: TEST WITH OMITTED RESPONSES THREE-PARAMETER MODEL RANDOM DATA EAP SCORE ESTIMATON

>CALIB FLOAT;

CALIBRATION PARAMETERS

MAXIMUM NUMBER OF EM CYCLES:

10

MAXIMUM NUMBER OF NEWTON CYCLES:

2

CONVERGENCE CRITERION:

.0100

SUBJECT DISTRIBUTION:

NORMAL PRIOR

PLOT EMPIRICAL VS. FITTED ICC'S

NO

DATA HANDLING:

DATA ON SCRATCH FILE

PRIOR DISTRIBUTION ON ASYMPTOTES:

YES

PRIOR DISTRIBUTION ON SLOPES:

YES

PRIOR DISTRIBUTION ON THRESHOLDS:

NO

SOURCE OF ITEM HYPER PARAMETERS:

PROGRAM DEFAULTS,

HYPERPARAMETERS WILL BE UPDATED EACH CYCLE

 

*******************************************

CALIBRATIONO F SUBTEST RANDOM

*******************************************

 

METHOD OF SOLUTION

EM CYCLES (MAXIMUM OF 10)

FOLLOWED BY NEWTON-RAPHSON STEPS (MAXIMUM OF 2)

QUADRATURE POINTS AND PRIOR WEIGHTS:

 

 

1

2

3

4

5

POINT

WEIGHT

-.4000E+01

.1190E-03

-3111E+01

.2805E-02

-.2222E+01

.3002E-01

-.1333E+01

.1458E+00

-.4444E+00

.3213E+00

 

6

7

8

9

10

POINT WEIGHT

.4444E+00

.3213E+00

.1333E+01

.1458E+00

.2222E+01

.3001E-01

.31111E+01

.2805E-02

.4000E+01

.1190E-03

 

 PRIOR DISTRIBUTIONS ON ITEM PARAMETERS

(THRESHOLDS, NORMAL; SLOPES, LOG-NORMAL; GUESSING, BETA)

 

THRESHOLDS

SLOPES

ASYMPTOTES

ITEM

MU

SIGMA

MU

SIGMA

MU

SIGMA

0001

-

-

.000

.500

5.00

17.00

0002

-

-

.000

.500

5.00

17.00

[18 similar lines omitted]

[EM STEP]

-- 2 LOG LIKELIHOOD = 2260.2174

SAMPLE PROBLEMS 6-6

CYCLE 1: LARGEST CHANGE = 2.3714

-- 2 LOG LIKELIHOOD = 2258.8731

UPDATED PRIOR ON ASYMPTOTES; ALPHA & BETA = 5.29157 16.70843

UPDATED PRIOR ON LOG SLOPES; MEAN & SD = -.24391 .50000

CYCLE 2: LARGEST CHANGE = .07377

-- 2 LOG LIKELIHOOD = 2258.4835

UPDATED PRIOR ON LOG ASYMPTOTES; ALPHA & BETA =  5.3610 16.63990

UPDATED PRIOR ON LOG SLOPES; MEAN & SD = -.26662 .50000

CYCLE 3: LARGEST CHANGE = .02392

-- 2 LOG LIKELIHOOD = 2258.3934

UPDATED PRIOR ON ASYMPTOTES; ALPHA & BETA = 5.40416 16.59584

UPDATED PRIOR ON LOG SLOPES; MEAN & SD = -.27645 .50000

CYCLE 4: LARGEST CHANGE = .01786

-- 2 LOG LIKELIHOOD = 2258.3526

UPDATED PRIOR ON ASYMPTOTES; ALPHA & Beta = 5.43811 16.56189

UPDATED PRIOR ON LOG SLOPES; MEAN & SD = -.28489 .50000

 CYCLE 5: LARGEST CHANGE = .00809

[NEWTON STEP]

UPDATED PRIOR ON ASYMPTOTES; ALPHA & BETA = 5.46420 16.53580

UPDATED PRIOR ON LOG SLOPES; MEAN & SD = -.28872 .50000

--2 LOG LIKELIHOOD = 2258.3003

CYCLE 6: LARGEST CHANGE = .00809

SUBTEST RANDOM: ITEM PARAMETERS AFTER CYCLE 6

 

ITEM

INTERCEPT

S.E

SLOPE

S.E

THRESHOLD

S.E

DISPERSN

S.E

ASYMPTOTE

S.E

CHISQ

(PROB)

DF

0001

 

.34

.254*

1.042

.373*

-.327

.273*

.960

.344*

.218

.088*

.4

(.5395)

1.0

0002

-.448

.324*

.754

.270*

.594

.352*

1.327

.475*

.225

.083*

2.5

(.2916)

2.0

0003

-.456

.331*

.830

.312*

.549

.325*

1.204

.452*

.222

.083*

.9

(.6353)

2.2

0004

-.795

.378*

.727

.284*

1.094

.415*

1.376

.538*

.204

.076*

1.2

(.5432)

2.0

0005

-.609

.364*

.840

.335*

.724

.334

1.190

.475*

.216

.079*

.8

(.6870)

2.0

0006

-.839

.437*

.795

.321

1.056

.408*

1.258

.509*

.234

.079*

.3

(.8595)

2.0

0007

-.277

.306*

.659

.238*

.420

.419*

1.518

.548*

.271

.094*

.6

(.7601)

2.0

0008

-.113

.250*

.920

.313*

.122

.256*

1.087

.370*

.175

.074*

11.8

(.0030)

2.0

0009

.593

.224*

.729

.234*

-.814

.369*

1.372

.440*

.230

.094*

.3

(.8780)

2.0

0010

.624

.209*

.582

.179*

1.718

.458*

1.718

.528*

.232

.095*

.0

(.8618)

1.0

0011

-.579

.405*

1.160

.571*

.499

.247*

.862

.424*

.198

.074*

1.3

(.5272)

2.0

0012

.803

.246*

.850

.279*

-.944

.342*

1.176

.386*

.221

.092*

.1

(.9451)

2.0

0013

-.347

.290*

.546

.190*

.635

.485*

1.832

.636*

.246

.092*

.3

(.8477)

2.0

0014

1.063

.244*

.650

.227*

-1.636

.533

1.538

.536*

.229

.094*

3.8

(.0496)

1.0

0015

-.599

.342*

.683

.241*

.877

.403*

1.463

.516*

.220

.082*

1.1

(.5827)

2.0

0016

.799

.208*

.539

165*

-1.483

.549*

1.855

.569*

.229

.094*

.7

(.7086)

2.0

0017

.331

.213*

.585

.183*

-.565

.417*

1.709

.535*

.232

.094*

.6

(.4389)

1.0

0018

.514

.212*

.641

.204*

-.801

.406*

1.559

.495*

.227

.093*

.9

(.6521)

2.0

0019

1.359

.485*

1.129

.520*

-1.204

.319*

.886

.408*

.229

.094*

.1

(.7876)

1.0

0020

.847

.227*

.665

.205

-1.274

.440

1.503

.463*

.230

.095

.3

(.8640)

2.0

*STANDARD ERROR

27.8 35.0

(.8016)

LARGEST CHANGE = .008

PARAMETER

MEAN

STN DEV

ASYMPTOTE

SLOPE

LOG (SLOPE)

THRESHOLD

.224

.766

-.291

-.177

.019

.182

.226

.923

 

QUADRATURE POINTS AND POSTERIOR WEIGHTS:

 

1

2

3

4

5

6

7

8

9

10

POINT

-.40000E+01

-.31111E+01

-.2222E+01

-.1333E+01

-.4444E+00

.4444E+00

.13333E+01

.2222E+01

.3111E+01

.4000E+01

WEIGHT

.3892E-04

.1558E-02

.2523E-01

.1564E+00

.3045E+00

.3360E+00

.1407E+00

.3159E-01

.3822E-02

.1842E-03

 

[4] The quadrature points and weights are respectively, deviates and normalized probability densities of the assumed prior distribution of ability. The default is a normal prior, but the user can supply other values. (The item-parameter estimation is not very sensitive to moderate departures from a normal prior.)

[5] See the summary for the PRIORS command, p. 4:31, for the relationship between the ALPHA and BETA parameters of the beta distribution and the mean and variance of the asymptote-parameter distribution.

[6] When priors are assumed on the item parameters, the likelihood may not increase during the EM cycles. This occurs because the starting values are closer to point of maximum marginal likelihood than the MAP estimates. But if the number of items and the sample size is large enough to justify the use of the FLOAT option, the means of the priors will be estimated simultaneously with the item parameters and the marginal likelihood will generally increase. The resulting estimates are then intermediate between the MML and the MMAP estimates. The compromise is necessary because there are seldom enough items to estimate the variances and other moments of their prior distributions accurately. Note that during the EM cycles the means of the priors change but not the standard deviations.

[7] If the number of items is 20 or greater, approximate chi-square statistics for the goodness-of-fit of each item are calculated and output. For this purpose, the cases in the calibration sample are sorted into successive intervals of the latent continuum according to the EAP estimates of their ability rescaled to mean 0 and standard deviation 1. Then, neighboring intervals are collapsed until the expected proportion of correct or incorrect responses exceeds 0.05. This gives a reasonable test of fit fi the number of items is large enough to make the assignment of the cases accurate, and if the sample size is large enough to retain three or more intervals. In the present example, the sample size is too small for a useful test; for more suitable data, see Example 8. Note that because the MML estimation procedure the residual frequencies (OBSERVED - EXPEVCTED) are not under linear constraints due to estimation of item parameters, the degrees of freedom are equal to the number of intervals retained.

When the number of items is less than 20, these chi-square statistics are replaced by root-mean-square standardized posterior residuals at the quadrature points. (See Examples 2, 4, 5 and 6)

[8] At the end of the calibration, normalized probability densities of the latent distribution are estimated at the quadrature points by the a posteriori weights shown in the output. In the present example, their agreement with the prior normal weights is very good, but this is to be expeced because the simulated data were generated from a normal distribution.

 

Phase 3 output: EXAMPL01.PH3

*** PHASE 3 ***

EXAMPLE 01: TEST WITH OMITTED RESPONSES

THREE PARAMETER MODEL RANDOM DATA EAP SCORE ESTIMATION

METHOD OF SCORING SUBJECTS:

EXPECTATION OF A POSTERIORI (EAP; BAYES ESTIMATES)

TYPE OF PRIOR:

NORMAL

SCORES WRITTEN TO FILE

EXAMPL01.PH3

TYPE OF RESCALING:

CENTER WITH RESPECT TO SCORE ESTIMATES

ITEM AND TEST INFORMATION:

NONE REQUIRED

 

TEST

NAME

QUAD POINTS

PRIOR MEAN

PRIOR STN DEV

RESCALING SCALING

CONSTANTS LOCATION

1

RANDOM

10

.000

1.000

1.000

.000

***************************************

SCORING

***************************************

[9]

EAP SUBJECT ESTIMATION, SUBTEST: RANDOM

QUADRATURE POINTS AND PRIOR WEIGHTS:

 

1

2

3

4

5

6

7

8

9

10

POINT

-.4000E+01

-.3111E+01

-.2222E+02

-.1333E+01

-.4444E+00

.4444E+00

.1333E+01

.2222E+01

.3111E+01

.4000E+01

WEIGHT

.1190E-03

.2805E-02

.3002E-01

.1458E+00

.3213E+00

.3213E+00

.1458E+00

.3002E-01

.2805E-02

.1190E-03

 

CORRELATIONS AMONG SUBTEST SCORE ESTIMATES:

RANDOM = 1.000

MEANS AND STANDARD DEVIATIONS OF SCORE ESTIMATE:

TEST: Random

MEAN: .015

S.D.: .868

MARGINAL LATENT DISTRIBUTION FOR SUBTEST RANDOM

MEAN = .015 S.D. = .968

[11]

 

1

2

3

4

5

6

7

8

9

10

POINT

-.4000E+01

-.3111E+01

-.2222E+02

-.1333E+01

-.4444E+00

.4444E+00

.1333E+01

.2222E+01

.3111E+01

.4000E+01

WEIGHT

.5130E-04

.1583E-02

.2525E-01

.1564E+00

.3044E+00

.3360E+00

.1407E+00

.3159E-01

.3824E-01

.1847E-03

RESCALING DONE WITH REPECT TO SUBJECT DISTRIBUTION [12]

SCALING LOCATION

SUBTEST CONSTANT

RANDOM 1.153 -.017

[13]

[14]

SUBJECT IDENTIFICATION

 

SUBJECT IDENTIFICATION

WEIGHT

 SUBTEST

TRIED

RIGHT

PERCENT

ABILITY

S.E.

MARGINAL PROB

-0.324

 1.00

RANDOM

20

10

.5000

-.6105

.4612

.000027

-0.673

 1.00

RANDOM

20

14

.7000

.5785

 .5354

.000004

-2.722

 1.00

RANDOM

 20

 7

 .3500

 -1.4665

 .6717

 .000000

0.714

 1.00

RANDOM

 20

 16

 .8000

 .9046

 .5743

 .000027

0.351

 1.00

RANDOM

 20

 14

 .7000

 .3676

 .4916

 .000024

-0.042

 1.00

RANDOM

 20

 13

 .6500

 -.0031

 .5672

 .000010

-0.747

 1.00

RANDOM

20

 9

.4500

-.5874

.4811

.000012

0.154

 1.00

RANDOM

 20

 11

 .5500

 -.4572

.4856

 .000008

[OUTPUT FOR 80 CASES OMITTED HERE]

-0.021

 1.00

RANDOM

20

14

.7000

.2931

.5126

.000017

-3.063

 1.00

RANDOM

20

20

1.0000

2.2583

.7016

.017796

-0.050

 1.00

 RANDOM

 20

15

.7500

.4971

.4861

.000047

0.434

 1.00

 RANDOM

 20

19

.9500

1.8448

.6295

.004412

0.995

 1.00

 RANDOM

 20

15

.7500

.6837

.5417

.000009

-1.554

 1.00

 RANDOM

 20

6

.3000

-1.8910

.6342

.000000

-1.742

1.00

RANDOM

20

3

.1500

-1.9552

.6429

.000003

0.146

1.00

 RANDOM

 20

17

.8500

1.1645

.5786

.000103

0.459

1.00

 RANDOM

 20

15

.7500

.5191

.4895

.000038

0.030

1.00

 RANDOM

 20

14

.7000

.2452

.5068

.000031

SUBTEST RANDOM ; RESCALED ITEM PARAMETERS [15]

ITEM

INTERCEPT

S.E.

SLOPE

S.E.

THRESHOLD

S.E.

DISPERSN

S.E.

ASYMPTOTE

S.E.

0001

.356

.254*

.904

.324*

-.393

.322*

1.106

.396*

.218

.088*

0002

-.437

.324*

.654

.234*

.668

.447*

1.529

.547*

.225

.083*

0003

-.444

.331

.721

.271*

.616

.449

1.388

.521*

.222

.083*

0004

-.784

.378*

.631

.246*

1.244

.658*

1.586

.620*

.204

.076*

0005

-.597

.364*

.729

.291*

.818

.535*

1.371

.547*

.216

.079*

[OUTPUT FOR THE REMAINING ITEMS OMITTED HERE]

PARAMETER

MEAN

STN DEV

ASYMPTOTE

.244

.019

SLOPE

.665

.158

LOG(SLOPE)

-.433

.226

THRESHOLD

-.221

1.064

 

MEAN & SD OF SCORE ESTIMATES AFTER RESCALING: . 000 1.000

SAMPLE PROBLEMS 6-11

[9] All cases in the data file are scored; not just those used in the item calibration. The SCORE keyword of the SAVE command writes the scores to a separate file. Printing of the scores in the Phase 3 output can be suppressed by the NOPRINT option of the SCORE command.

[10] These are the sample means and standard deviations before rescaling. Note the sample standard deviation is less than the assumed S.D. of 1 in the population due to the regression of EAP estimates to the mean.

[11] The S.D. of the latent distribution estimated from the posterior weights at the quadrature points is, however, near its true value. This S.D. is computed using the formula for the variance of grouped data, with the quadrature points treated as class marks and the posterior weights as class frequencies. Shepard's correction for coarse grouping is applied before the square root is taken to obtain the standard deviation. The posterior weights appear only when EAP score estimation is selected and are based on all cases, not just those used in the calibration.

[12] If the rescaling option is selected, the scores are rescaled linearly with these constants before the scores are output. In the present example, the rescaling will set the sample mean and standard deviation equal to their population values, which have been arbitrarily assigned the values 0 and 1.

[13] The score output includes the classical number right and percent right in addition to the estimated latent ability and its standard error (or in the case of EAP estimation, the posterior standard deviation). In the latter case, the estimated marginal probability of the response pattern of each case is also output. Scores for cases with extremely deviant marginal probabilities should be viewed with caution. They may indicate faulty use of the answer sheet or random responding. A quantile plot of the log marginal probabilities can help detect such cases.

[14] Given the relatively small number of items and their modest discriminating power (slopes), it is not suprising that a few of the deviations between the generating values shown in the ID field at the left and the estimated abilities are rather large. Notice, however, that the sample root-mean-square error (RMSE) (estimated ability - generating ability) of 0.5485 agrees well with the posterior standard deviations ("standard errors") of the scores based on Bayesian theory. This RMSE corresponds to a classical reliability coefficient of 0.70.

[15] As discussed in Chapter 1, the item slopes and thresholds can be adjusted so that they will estimate directly the rescaled scores. The item lower asymptotes are not affected. In the EXAMPL01.PAR file, the rescaled parameters replace those from the calibration phase.

SAMPLE PROBLEMS 6-12