- 1-DETERMINING DIFFERENTIAL ITEM FUNCTIONING AND ITS EFFECT ON THE TEST SCORES OF SELECTED PIB INDEXES , USING ITEM RESPONSE THEORY

The objective of this article is to present the results of an investigation into the item and test characteristics of two tests of the Potential Index Batteries (PIB) in terms of differential item functioning (DIF) and the effect thereof on test scores of different race groups. The English Vocabulary (Index 12) and Spelling Tests (Index 22) of the PIB were analysed for white, black and coloured South Africans. Item response theory (IRT) methods were used to identify items which function differentially for white, black and coloured race groups. The effects of the differences between the item characteristic curves (ICCs) of the three race groups on the test characteristic curve (TCCs) were studied. The items identified as biased (DIF) appeared to have a negligible effect on the test scores of Index 12 and Index 22 at the different ability levels for the groups considered. It can be concluded that the tests do not appear to discriminate unfairly, due to DIF, against race groups.

-2 - The new Employment Equity Act (1998) places all test developers and users under an obligation to consider the impact of psychometric assessments on different groups as carefully as they consider other technical psychometric issues.The importance of the incorporation of this requirement in the design of psychometric instruments cannot be overemphasised.The fact that some tests may discriminate unfairly against certain groups has become a matter of primary concern in South Africa.
What complicates the issue of unfair discrimination is that differences in the experiential backgrounds of groups or individuals inevitably manifest themselves in test performance.Insofar as culture affects behaviour, cultural influences will and should be detected by such a measure.Sometimes, inexplicable differences in personality, cognition and factors involved in the test situation itself have an effect on the different performances on test items of different cultural groups (Scheuneman, 1985).However, if all cultural differentials which cause unfair discrimination related to differential item functioning (DIF) are ruled out from a test, the content validity of the test may be compromised.In an effort to include in tests only items common to different cultures or sub-cultures, content may be chosen that has limited scope in terms of the construct measured.
Insofar as the assumptions of the latent ability theory in item response models for unidimensional measures allow for differential item functioning, only one dominant component or primary trait influences test performance.This assumption, however, does not exclude the influence of secondary factors or traits on test performance.The secondary factors or traits that can have an impact on test performance in addition to the dominant component include cognitive, personality and test-taking factors.In terms of DIF, the apparent differences in the primary ability (when, in fact, there are no such differences) may be the consequence of secondary latent traits.
Accordingly, DIF or item bias can be defined as the difference between two groups in the probability of an individual providing the correct response to an item, given the same primary or underlying ability.This means that, if an item is unbiased, the probabilities of correct responses at each ability level must be identical, apart from sampling error, across different populations of interest (Hambleton & Swaminathan, 1990).
-3 -However, the interpretation and evaluation of test scores are not based on the individuals' responses to items, but rather on scale scores and configurations of scores.Tests are most valuable if the test level rather than the individual item level is the basis of comparison.Some items might have lower content validity than other items in the test, and focusing on specific items might detract from the overall value of the test in assessing ability.Focusing on item level only without considering the cumulative effect on the total score would mean moving away from total information of the scale (Pope, Butcher & Seelen, 1994).Thus, the cumulative effect of DIF on the test score for the groups in question should be investigated before any conclusions can be reached on the level of unfair discrimination present in the test.
The objective of this article is to present the results of an investigation into the item characteristics of two tests of the Potential Index Batteries (PIB) in terms of DIF and the effect thereof on the test scores of different race groups.

Strategy for identifying DIF
The three-parameter logistic item response theory model was used to identify DIF.The a, b and c parameters were obtained for each group by means of the marginal maximum likelihood (MML) procedure and the EM algorithm using the Xcalibre Item Parameter Estimation Programme.The programme's calculations include standardised residuals to indicate how well the response data fit the selected item response theory (IRT) model for the item parameters estimated.The statistical properties of the MML technique seem to have high levels of consistency.By implementing the MML technique, reasonable estimates of IRT item parameters can be derived from short tests (e.g. 25 items) and small samples of examinees (e.g. less than 1 000) (Mislevy & Bock, 1982).Linn, Levine and Wardrop (1981) proposed a strategy using the area between the item characteristic curves for comparison and focal groups to determine bias.The strategy can be explained as follows: According to the three-parameter logistic model, the conditional probability Pi (? ) that a person randomly chosen from all those with ability (? ) will answer item i correctly, is a function of ? and three item parameters.Each item is characterised by three item parameters: the item discrimination, a; the location or difficulty of the item, b; and the lower asymptote or probability that persons with extreme low ability will respond correctly to the item, c.The graph of Pi (? ) as a function of ? is called the item characteristic curve (ICC) for item i.According to the model, the probability of getting the item right is completely determined by ? and the three item parameters.More specifically, members of different groups with equal ability should have the same probability of correctly answering an item.In other words, the conditional probabilities, P (? ) , and their graphs should be invariant from one group to another if the item is not biased.
Since the item parameters have to be calibrated separately for each group, the item parameters were standardised on bi for each group before comparing the ICCs (Hambleton & Swaminathan, 1990) .The c parameter was determined for the combined group and assumed to be fixed and equal for the different sub-groups which were compared.
Factor analysis by means of the SPSS statistical package was used to check whether the assumption of unidimensionality of the test items was reasonable.The phi correlation was used as a measure of the relationship between two dichotomous variables.It is commonly believed that using phi correlation leads to a factor solution with too many factors, some of them difficulty factors due to the range of item difficulties among the items in the test.A second order factor analysis was then done to determine the actual number of factors underlying the intercorrelation of the first order factors (Schepers, 1992).
The Mantel Haenszel chi-square statistic, as calculated by means of the Biasx Programme of the HSRC, was used to cross-validate the results obtained on the IRT model (Holland & Thayer, 1988).Using multiple methods in studying DIF is generally recommended.The chi-square statistic can be considered a close approximation of the item response theory approach.The Mantel Haenszel statistics, in addition to the chi-square statistic, include the delta statistic as an indicator of the level and the direction of DIF.The calculations of Mantel Haenszel chi-square statistics were performed by the University of Pretoria's Network and Support Services.

Determining the effect of DIF on test scores
The Test Characteristic Curve (TCC) is the sum of P (? ) for the items included in the test.The TCC is an estimation of the proportion of items answered correctly at each ?level (Hambleton & Swaminathan, 1990).Therefore, the difference in the TCCs at each ability level for different groups provides an indication of the difference in the standardised ICC for each group.The cumulative effect of DIF on the test score can be estimated through calculating the net differences in ICCs of biased items for the groups in question.

Data analysed
Data for the analyses reported below are based on the English Vocabulary (Index 12) and English Spelling Ability (Index 22) tests of the PIB, consisting of 20 and 25 items respectively (Erasmus & Minnaar, 1997).Index 12 requires the testee to indicate which one of five alternative words has more or less the same meaning as a specified word.Index 22 requires the testee to indicate the correct spelling of a specific word, given five alternatives.Both measures were developed for post-matriculants.No time limits were imposed for the completion of the tests.The test chisquare and item response data were obtained from job applicants.A convenience sample consisting of 609 white and 694 coloured candidates was used.The sample represents the total number of available records for the particular groups.A random sample of 677 black candidates, randomly drawn from the 5 000 available data records for black candidates, was used.Frequency distributions indicating the first language and educational characteristics of each of the samples are set out in Table 1.For the purposes of this study, the black and coloured race groups were considered to be the previously disadvantaged groups and accordingly treated as the focal group.
The white group was consequently treated as the comparison group.In this respect, the whiteblack comparisons and the white-coloured comparisons were considered the main areas of interest relating to test fairness.

Indices of DIF
Three indices of DIF involving areas between ICCs were computed (Linn et al., 1981).The three bias indices used for the results reported below are the following: 1.
Base high area (BHA): the area, if any, between the ICCs for the groups compared where the ICC for the focal group is above that of the ICC for the comparison group.

2.
Base low area (BLA): the area, if any, between the ICCs for the groups compared where the ICC for the focal group is below that of the ICC for the comparison group.

3.
Square root of the sum of squares: the square root of the sum of the squared differences between ICCs in the region of ?= -3 to ?= +3.
An item with a large BHA but small or zero BLA would be considered to be DIF against the comparison group.The direction of DIF would be just the opposite for an item with a large BLA but zero or small BHA.The bias in an item with large BHA and large BLA would depend upon the distribution of ability in the contrasted groups of examinees.The square root of the sum of squares provides an index of total DIF in the region of ?= -3 to ?= ?+3.Linn et al. (1981) used a 0,2 cut-off as an indication of possible DIF.Although it cannot be claimed that the value of 0,20 corresponds to a significance statistic of 0,10 or 0,05, it should be a good approximation thereof (Hulin, Drasgow & Parsons, 1983) and indicates a high possibility of DIF.
Due to the sample size sensitiveness of the MH chi-square statistic, a cut-off of 10,83 (at a significance level of 0,01) as an indication of DIF was applied (Raju, Drasgow & Slinde, 1993;Raju, 1990).

RESULTS
The test score summary statistics for black, white and coloured candidates are set out in Table 2.
The raw score means for the black and coloured groups on Index 12 are approximately the same, whereas the mean for the white group is approximately 3 points or 0,70 standard deviation greater than the mean for the black and coloured groups.The mean for the black group on Index 22 is -7 -approximately 2 points or 0,46 standard deviation greater than that of the white group and 1 point or 0,34 standard deviation greater than that of the coloured group.Each of the above standard deviation comparisons made was based on the standard deviation values of the group with the greatest mean value for the test.The difference between the mean values for each of the comparisons made are statistically significant (p= 0.01), except for the comparison between the black and coloured groups on Index 12.

Unidimensionality
The principal factor analyses using phi coefficients based on the total group of 1 980 respondents provided evidence of the unidimensionality of Index 12 and Index 22. Index 12 yielded four eigenvalues greater than unity, with the highest eigenvalue of 3,4 accounting for 18% of the total variance.The second and third eigenvalues accounted for substantially smaller percentages of the total variance (7,1 % and 3,1 % respectively).According to the scree test, there appeared to be a single dominant factor.A second order factor analysis was done which yielded one eigenvalue greater than unity.This indicates one dominant factor, with an eigenvalue of 2,13 accounting for 53 % of the total variance.The same procedure as above was followed with Index 22.The first order factor analysis yielded eight eigenvalues greater than unity, with the highest eigenvalue of 3,24 accounting for 13 % of the total variance.The second and third eigenvalues accounted for substantially smaller percentages of the total variance (5,1% and 4,7 % respectively).The scree test indicated a single dominant factor.The second order factor analysis yielded one eigenvalue greater than unity which indicates the presence of one dominant factor, with a value of 2,10 accounting for 26 % of the total variance.According to Hulin et al. (1983), IRT models can be applied to moderately heterogeneous item sets.

Excaliber calibrations
The item parameters estimates for the white, black and coloured groups were determined.The calculated standardised residuals indicate that all the response data fit the 3 Parameter IRT model for the item parameters estimated by means of the Excaliber Calibrations Programme.The item parameters were standardised on parameter b for the black and coloured groups (focal groups), using the mean b parameter of the white group (comparison group).The c parameter was kept constant for each of the groups, based on the c parameter for the total group (Raju, Drasgow & Slinde, 1993).

DIF indices: Index 12 and 22
The DIF statistics for the white-black and white-coloured comparisons are set out in Tables 3 and   4 <Place Table 3 here> In the case of the white-black group comparison, nine biased items were identified based on the IRT method and ten items based on the MH technique.Of the ten items identified by means of the MH technique, nine items were also identified as biased using the IRT method.Only Item 16 was not included by the IRT method.The results indicate a very high similarity between the IRT and MH technique in terms of the identification of DIF.Raju, et al. (1993) and Raju (1990) reported a similarly large percentage of overlap with significant signed and unsigned areas between two item response functions and the MH technique using more or less the same cut-off on the MH chisquare statistic (i.e. 9 and 10,88 respectively).In both studies, the MH technique identified slightly more biased items than the IRT technique (i.e.overlap of 0,80).These findings provide evidence of the accuracy of the square root of the sum of squares as a DIF index using a cut-off of 0,20 in the current study.
It is interesting to note that five items, i.e. items 4, 7, 9, 13 and 14 seemed to favour the black group, while four items, i.e. items 6, 10, 11 and 12 (using the IRT technique), and five items, including item 16 (using the MH technique), favoured the white group.The ICCs for Items 6 and 13 are illustrated as examples in Figures 1 and 2 respectively.Although the ICCs were substantially different for black and white groups for approximately half of the items, the direction of DIF did not consistently favour any of the groups.Five of the items that were identified as biased favoured the black group and the four remaining items favoured the white group.The estimated net effect of the nine items identified as biased on the difference between groups in the proportion of items answered correctly (TCC) is illustrated in Figure 3.The calculation of the estimated difference between groups on the TCC is based on the assumption that the items which were not identified as significantly biased had a difference in P (? ) values of zero.As Figure 3 illustrates, the effect of the items identified as biased on the estimated difference in test scores between the white and black groups seems to favour the black group, but the effect can be considered to be minor.The largest difference occurs between ability level -1,50 and -0,50, but the estimated difference in test scores does not exceed 0,38 at any point.An average difference in the estimated test score of 0,19 occurred over the whole spectrum of abilities between -3 and +3, due to DIF.Thus, eliminating all DIF would have only a negligible effect on the group differences in the test as a whole, at the risk of reducing the validity of the test.
<Place Figures 1, 2, 3 and 4 here> Only two biased items were identified in the comparison of the white-coloured groups (using both the IRT and MH techniques) on Index 12.These items include items 6 and 12 and both favour the white group.The estimated effect thereof on the total test score can be considered to be very small and negligible (Figure 3), assuming that the difference between the groups on all the other nonbiased items in the test is zero.The largest difference occurs between ability levels -0,5 and 0,00, but the estimated difference in the test scores does not exceed 0,54 at any point.An average difference in the estimated test score of 0,23 occurred over the whole spectrum of abilities between -3 and +3, due to DIF.
Index 22 was analysed using the same procedures as above.Nine biased items were identified.As with Index 12, there was a very large overlap between results obtained using the IRT and MH techniques to identify DIF.In the case of the white-black group comparison, nine biased items were identified based on the IRT method and eight items were found based on the MH technique.
Of the nine items identified by means of the IRT technique, eight items were also identified as biased using the MH technique.Only item 24 was not included in the MH technique.As with Index 12, it is interesting to note that six items, i.e. items 5, 8,11,13, 17 and 22 seemed to favour the black group.Three items, i.e. items 6, 10 and 24 (using the IRT technique), and two items, excluding item 24 (using the MH technique), favoured the white group.The estimated net effect on the test scores is illustrated in Figure 4 and can be considered to be very small and negligible.The largest difference in favour of the black group occurred between ability levels 0,50 and 1,00, but the estimated difference in the test scores did not exceed 0,58 at any point.An average difference in the estimated test score of 0,21 occurred over the whole spectrum of abilities between -3 and +3 due to DIF.
In terms of the white-coloured comparison, only one item was identified as biased, favouring the coloured group.As with the previous findings, the estimated net effect of DIF on the total score is negligible.The largest difference occurred between ability levels -1,00 and 0,00, but the estimated difference in the test scores did not exceed 0,16 at any point.An average difference in the estimated test score of 0,07 occurred over the whole spectrum of abilities between -3 and +3, due to DIF.
<Place Table 4 here> It is interesting to note that all the items identified as biased for each of the tests differed on each (ability) level without overlapping at any point.This signifies consistent DIF in favour of a particular group at all levels.
c-parameter for total group