The consTrucT equivalence and iTem bias of The pib / speex concepTualisaTion-abiliTy TesT for members of five language groups in souTh africa

This study’s objective was to determine whether the Potential Index Batteries/Situation Specific Evaluation Expert (PIB/SpEEx) conceptualisation (100) ability test displays construct equivalence and item bias for members of five selected language groups in South Africa. The sample consisted of a non-probability convenience sample (N = 6 261) of members of five language groups (speakers of Afrikaans, English, North Sotho, Setswana and isiZulu) working in the medical and beverage industries or studying at higher-educational institutions. Exploratory factor analysis with target rotations confrmed the PIB/SpEEx 100’s construct equivalence for the respondents from these five language groups. No evidence of either uniform or non-uniform item bias of practical signifcance was found for the sample.

In South Africa, as elsewhere in the world, psychological instruments are often used for selection and development purposes (Van de Vijver & Rothmann, 2004).Psychological tests are commonly used as aids to determine whether employees have the necessary skills for a specific job (Van der Merwe, 1999).
Psychological instruments can be divided into different groups or types, such as cognitive, personality and interest tests.This study focuses on a cognitive test known as the Potential Index Batteries/Situation Specific Evaluation Expert (PIB/SpEEx) conceptualisation (100) ability test.The PIB/SpEEx 100 test was developed by Potential Index Associates specifically to assess job-relevant conceptual-reasoning skills within a cross-cultural context (Erasmus, 2001).The history and development of cognitive tests in general and the use thereof in a cross-cultural context are particularly relevant to the issue at hand.
Cognitive tests have been developed over more than a century and a variety of perspectives about what constitutes intelligence has emerged.Initially, the specific constructs that were measured by cognitive tests were disputed and theories were therefore developed to explain what really constitutes intelligence or cognitive ability as well as how best to measure these concepts and how to measure these constructs across different cultures in a fair and unbiased way (Gregory, 2004).
The possibilities of unfairness and bias in the use of cognitive tests have resulted in extensive research on factors that might affect the fairness of psychometric instruments.Culture and language can be included among these factors (Gregory, 2004).
Initially, there was little or no attempt to assess cognitive competence in a culturally relevant framework (Kendell, Verster & Von Mollendorf, 1988) and early pioneers in the assessment movement largely ignored the impact of cultural background on test results (Gregory, 2004).Early psychometric testing in South Africa mainly followed international trends and, at the beginning of the 1900s, when psychology began to emerge as an independent field of study, tests were imported from Europe and North America and applied in all sectors of the community without any distinction being made (Foxcroft, 1997).
Gradually, however, an increasing need for change in psychometric testing throughout the world in general and in South Africa in particular began to emerge.Cross-cultural issues began to emerge as being problematic in the 1920s (Meiring, Van de Vijver, Rothmann & Barrick, 2005), when studies of diverse-culture assessments became somewhat more systematic and empirically orientated.It began to dawn on practitioners that not all instruments were equally appropriate to all peoples and cultures (Bedell, Van Eeden & Van Staden, 2000).
In the 1940s and 1950s, work in the psychometric domain in South Africa focused rather pragmatically on the educability and trainability of black South Africans.There was some realisation that cultural differences can influence testing outcomes and attempts to create 'culture-free' tests soon became the vogue (Bedell et al., 2000).Biesheuvel (1949;1952) can be considered as one of the pioneers in the development of tests to solve problems associated with the testing of preliterate black populations.The General Adaptability Battery was one of the better measures developed during this period for the testing of blacks with little educational background for occupational suitability.
From 1960 onwards, there was a growing recognition that culture exerts subtle and pervasive effects in the testing domain and that it is not possible to remove culture from the equation.At that time, it was increasingly understood that culture affects behaviour and consequently the psychological constructs that were measured and culture began to be seen as an important moderator of test performance (Kendell et al., 1988).From 1960 to 1984, The National Institute for Personnel Research and the Institute for Psychological and Edumetric Research played important roles in the development of measures along cultural and racial lines in the South African context.Both these institutions were later incorporated into the Human Sciences Research Council, which took over the Vol.34 No. 3 pp.29 -38 SA Tydskrif vir Bedryfsielkunde SA Journal of Industrial Psychology http://www.sajip.co.za 30 role of test development.The emphasis at the time was on the development of separate measures for each cultural group and/ or the use of group-specific norms (Foxcroft & Roodt, 2005).Examples of measures that were developed during the 1990s are the General Scholastic Aptitude Test and the Paper and Pencil Games (PPG) test (Claassen, 1990;1996).The PPG is, to date, the only measure available in all 11 official languages in South Africa.Due to problems experienced in the comparison of the scores of tests developed for different groups, the focus since the late 1990s has been on the development of tests that are fair in terms of both language and culture (Foxcroft & Roodt, 2005).Examples of more recent developments along these lines are the Learning Potential Computerised Adaptive Tests and the Ability, Processing of Information and Learning Battery tests (Taylor, 1997& De Beer, 2000).
A shift towards a closer consideration of any cultural bias inherent in tests also strengthened the notion that culture may constitute a source of systematic error in test results.Kendell et al. (1988) point out that test scores often correlate with non-test variables, such as test-taking behaviour, cultural and/or environmental factors and dispositional factors.Testtaking behaviour is influenced by factors such as the level of education, home language, practice or familiarity with tests of the person(s) taking the tests.Among these factors, the issue of language received much attention in psychological assessment, as it is an overriding consideration that linguistic barriers may inhibit the test performance of minority groups (Gregory, 2004).Language is closely linked to the culture in which a test is developed, as language is almost always used to express the cultural concepts and constructs that need to be measured (McCrae, 2000).However, this is a complex area of study, as not only are there inter-cultural differences in language usage but language itself also evolves and changes over time, even within cultural groupings (Wallis & Birt, 2003).
To compensate for the problems associated with the link between culture and language, some test developers sought to resolve such problems by developing tests in different languages (Bedell et al., 2000).However, the translation of tests into different languages (which was expected to be the answer to the culture dilemma) posed problems of its own.Various practical problems were found with the translation procedure itself.Although there is support for the adaptation of existing tests and the development of culturally appropriate tests and norms, it must be recognised that there are several difficulties in developing and norming tests in a culturally and linguistically diverse society (Foxcroft, 1997).
In the South African context, very specific problems arose in the translation of tests due to three main reasons.Firstly, South Africa has 11 official languages and tests therefore have to be translated into all 11 languages (presenting problems with regard to cost, to the lack of available translators with both language and specialist psychological/human-resource expertise and to a lack of equivalent specialist vocabulary in all the languages).Secondly, among the limited pool of available test administrators, there are not enough administrators speaking the preferred language of test takers who cannot understand English.Thirdly, practitioners have reported problems with regard to the different dialects (of one language) spoken in different areas and to a difference in performance between urban and rural individuals tested in their mother tongue (Bedell et al., 2000).
Apart from the link between culture and language, language is also linked to cognitive processes.Galotti (2004) points out that the use of language in a variety of cognitive tasks raises the following important question: what influence(s) does language have on other cognitive processes?Two extreme positions exist: on the one hand, Chomsky (in Sharrat, 1987) argues that language and other cognitive processes operate completely independently of each other and, on the other, Sharrat (1987) posits that language and other cognitive processes relate completely with one determining the other.Between these two extremes there is considerable middle ground, where language and other cognitive processes are seen as related in some ways but as independent in others (Galotti, 2004).For example: Chomsky (in Sharrat, 1987) states that children manage to acquire language rapidly and efficiently at a stage when cognitive functioning still seems to be relatively undeveloped -thereby implying that language acquisition and cognitive processes are independent processes.Sharrat (1987) argues in favour of the dependency of language and cognitive processes in that the structure of language causes people to think of the world in certain ways.Both arguments appear to be plausible depending on the context within which the arguments are presented.Owen (1992) studied the content and format of items in tests that function differentially and suggests reasons for bias: language (especially in the case of the black subjects in his study who were tested in English) and cognitive style (subject-related).
Language training and problem-solving strategies were recommended, as the differences in mean test performance preclude the use of common norms, while the use of separate norms for the different population groups defeats the purpose of a common test (Owen, 1992).Therefore, the language in which a test is developed has important consequences because of its relation to both culture and cognitive processes.Culture and language have an effect on cognitive processes and may consequently affect an individual's performance in cognitive tests.

Non-verbal tests: A means to reduce the effect of culture and language proficiency on test performance
In reaction to criticism arguing that intelligence tests are culturally biased, a number of non-verbal tests of intelligence have been published (Owen, 1998).Non-verbal tests were developed to measure fluid intelligence, which is a relatively culture-reduced form of mental efficiency (Gregory, 2004).Fluid intelligence is related to a person's inherent capacity to learn and solve problems and is thus used when a task requires a person to adapt to a new situation (Gregory, 2004).
Historically, test developers have tried to construct non-verbal tests of intelligence to meet the needs of a linguistic minority (in other words, individuals who have limited proficiency in the language of the dominant culture).Typically, in Europe and North America, these individuals are either foreign-born or have hearing problems.The situation in South Africa is somewhat different, in that the 'linguistic minority' may, in fact, be a numerical majority of people who do not belong to a 'foreign' group at all.Increasingly, there is a greater realisation among psychologists that many measuring devices are not entirely appropriate for subjects whose mother tongue or first language is not English, for illiterates and for those with speech and hearing impairments (Gregory, 2004).
According to Kline (1993), non-verbal items include pictorial odd-man-out, pictures with errors that have to be recognised, figure classification in which two figures of a series that belong together have to be selected, embedded figures where a shape embedded in other shapes has to be discovered, the identification of the sequence of shapes in matrix format and other variations of pictorial stimuli.Examples of specific nonverbal tests include the Test of Non-verbal Intelligence (TONI), Cattell's Culture Fair Intelligence Test (CFIT) and Raven's Progressive Matrices (RPM).
The TONI items require the examinee to solve problems by identifying relationships among abstract figures.Many of the items are similar in format to those found in the RPM (Gregory, 2004).
The CFIT is a non-verbal measure of fluid intelligence or the ability to engage in analytic and reasoning activities with abstract and novel materials.This is a widely used test, particularly for examinees with language or cultural deficits.Originally designed by Cattell (1940), this test is a culture-free measure of cognitive aptitude.It consists of items without any verbal content.However, questions have been raised about the extent to which the test is completely free of cultural content (for example, even pictures can be culturally loaded) and the name was later changed from the Culture Free Intelligence Test to the CFIT (Hoge, 1999).
The RPM has a non-verbal construction and does not rely on an examinee's fluency in English or any other language, since it consists only of universal symbols.It is often used when testers require a measure of aptitude and ability that is not biased by a test candidate's educational background, ethnic or racial differences, linguistic ability or cultural deficiencies (Samuda et al., 1998).However, reviewers of this test have raised some questions about the construct validity of the instrument, as it is not entirely clear what aspects of cognitive aptitude are assessed.Hoge (1999) states, moreover, that it is clear that RPM scores are not equivalent to the abstract-reasoning scores yielded by an instrument such as the Wechsler Intelligence Scale.
Like the above-mentioned tests, the PIB/SpEEx 100 test is also a non-verbal cognitive-assessment measure.
The development of non-verbal tests is seen as a possible solution to minimising the effect of language proficiency on the comparability of the test scores of different groups.

Challenges associated with cross-cultural testing in South Africa
At present, psychological assessment in South Africa faces many challenges (Foxcroft, Paterson, Le Roux & Herbst, 2004), including the following: • The creation of tests that can be used without bias across diverse linguistic and cultural backgrounds is a complex process (Huysamen, 1996).• According to recent legislation, notably the stipulations of the Employment Equity Act, Act 55 of 1998 (Section 8), qualified professionals may use only psychological tests and similar instruments that can be proved to be scientifically valid and reliable and that are not biased against any particular employee or group (Republic of South Africa, 2006).
These challenges encourage industrial psychologists to conduct applied research on the psychometric properties of tests and to explore the fairness of tests that are used (Foxcroft et al., 2004).
Although it is reassuring to see the vast interest in cross-cultural studies, it is regrettable that practitioners and academics do not have a well-established and widely adopted practice in cross-cultural research to deal with issues such as instrument feasibility and multiple interpretations (Van de Vijver, 1998).
According to Van de Vijver (1998), bias and equivalence are concepts that form the core of a framework attempting to incorporate aspects specific to cross-cultural research.
Previous studies in South Africa report race, the level of education, language and the understanding of English to be the main factors that affect the construct and item comparability of cognitive tests (Meiring et al., 2005).Therefore, there is a need to continue to research the issue of bias and equivalence in the culturally diverse South Africa (Meiring et al., 2005).Bias and equivalence research would assist in establishing whether assessment instruments are fair to all language or cultural groups.
The objective of this study was to determine the construct equivalence and item bias of the non-verbal PIB/SpEEx 100 test for diverse language groups in South Africa.The key terms 'construct equivalence' and 'item bias' are briefly explained below.
In theory, the concepts of equivalence and bias are the opposite of each other.Thus, scores are equivalent when they are not biased.Nevertheless, in cross-cultural research conducted to date, the two concepts are treated separately and become associated with different aspects of cross-cultural comparisons.
Equivalence is associated with the measurement level at which scores obtained in different cultures can be compared and bias is a generic term for all measurement artefacts that threaten the validity of cross-cultural comparisons (Van de Vijver & Leung, 1997).
Construct equivalence (also known as structural equivalence) is at the first-measurement level and indicates the extent to which the same construct is measured across different cultural groups under study.Construct equivalence is a precondition to subsequent measurement levels known as measurement-unit equivalence (ratio level) and scalar equivalence (interval level).Measurement-unit equivalence requires the offset of scales to be similar for groups and scalar equivalence requires scores on the instrument to have the same interval scales across cultural groups (Van de Vijver, 1998).The problem with dichotomous items is that they do not have an origin or a unit of measurement and the concepts of unit and scalar equivalence consequently cannot be applied to dichotomous variables (Eid, Langeheine, & Diener, 2003).
According to Van de Vijver and Leung (1997)

Research approach
In this study, a quasi-experimental design was used.Quasiexperimental designs help researchers test for causal relationships in a variety of situations (Neuman, 1997).
According to Van de Vijver and Leung (1997) the interpretability of observed differences in the focal variable and on a reduction in the number of alternative explanations.A substantive step in the process of enhancing the interpretability of observed differences and reducing alternative explanations is the choice of appropriate context variables either to verify or to falsify a particular interpretation (Van de Vijver & Leung (1997).It is evident from the preceding literature discussion that language group as a context variable can be considered a plausible explanation of the observed differences in the focal variable (test score).Consequently, the research method followed in this study is designed to evaluate the (lack of) success of the context variable 'language group' as an alternative explanation for observed score differences in the no-verbal-based PIB/ SpEEx 100 test.
The method followed in this study is discussed below with regard to the respondents, the measuring instrument and the statistical procedures used.

Respondents
A non-probability convenience sample was drawn from three industries and sectors within South Africa, namely the beverage-manufacturing industry, the medical sector and two tertiary institutions from the higher-education sector.The ample included 6 261 participants from five different language groups (Afrikaans, English, North Sotho, Setswana and isiZulu).The participants numbered as follows: 1 643 Afrikaans speakers, 912 English speakers, 1 304 North Sotho speakers, 1 139 Setswana speakers and 1 263 isiZulu speakers.The biographical information of the sample is presented in Table 1.
The sample consisted of 43.2% females and 56.8% males.A further 4.3% of the respondents did not indicate their gender and are therefore indicated as unknown in Table 1.Most of the respondents (89.6%) had completed secondary school up to Grade 8 or Grade 12, while the rest of the sample had obtained a diploma or degree at university or technikon as their highest qualification (8.9%).Only 1.87% of the respondents did not indicate their qualification(s).Most of respondents were enrolled as full-time students at tertiary institutions (74%), while the rest were employed by the beverage (21.7%) and medical (4.3%) industries.
The mean age of the sample was 20.26 years.The youngest respondent was 17 years old and the oldest was 49 years old.

Measuring instrument
The aim of the PIB/SpEEx is to provide a comprehensive assessment package suitable for the assessment and development of human potential in the workplace.The various indices assess human potential relating to specific dimensions or basic competencies.These are identified in the PIB/SpEEx battery manual (Erasmus, 2001) as set out below.
The PIB/SpEEx battery consists of two types of scales, namely the cognitive and the behavioural scales (Erasmus, 2001).PIB/ SpEEx (conceptualisation) 100 is one of the cognitive scales, which means that it assesses an element of intellectual potential and, more specifically, conceptual reasoning.PIB/SpEEx 100 is a visual or non-verbal scale and, because it consists of visual or non-verbal items that explore reasoning processes using shapes and figures, it could arguably be administered in any language whatsoever (Erasmus, 2001).It is therefore particularly useful when people with poor English language skills or any other language, for that matter, are assessed (Erasmus & Schaap, 2003).
The PIB/SpEEx 100 test is a normative scale consisting of 30 items.The respondent must complete a pattern through the identification of one or more rules that determine the relationships of parts.The test assesses potential to reason in spatial terms, to see the relationships of parts, to complete a picture, to envisage a whole or an end result and to anticipate outcome.It is a performance test and a time limit of 15 minutes to complete the test is therefore imposed.
In a previous study, the average metrical properties of the PIB/ SpEEx 100 were investigated.The sample included different industry sectors and academic institutions.It was reported that the PIB/SpEEx 100 scale obtained a mean Cronbach alpha coefficient of 0.90 (Schaap, 2001).

Research procedure
Data were collected from the existing PIB/SpEEx database, which is used for selection and development purposes in industry and tertiary institutions.The pencil-and-paper version of the PIB/SpEEx 100 test was applied.All the data were acquired under the supervision of registered psychologists and were dealt with in a manner that protects the confidentiality of test results.

Statistical analysis
Construct-equivalence and item-bias analyses were used in this study to evaluate the PIB/SpEEx 100 test's comparability across language groups.The gathered data were analysed by means of scale-level analyses to examine the similarity of the factors underlying the PIB/SpEEx 100 test as well as bias at item level.The SPSS (SPSS Inc, 2006) and MicroFACT 2.0 (Waller, 1995) computer programs were used to perform the required statistical analyses.
Descriptive statistics and reliability analysis.Descriptive statistics were calculated in respect of the test scores of the total sample and in respect of the language groups to provide an understanding of the distribution of scores within and between the groups.A reliability analysis was done for each group.Reliability coefficients can provide valuable clues about the measurement accuracy and the appropriateness of an instrument for cross-cultural comparison (Van de Vijver & Leung, 1997).
Factor analysis.Construct equivalence can be investigated through several techniques, such as factor analysis, cluster analysis and multidimensional scaling or other dimensionalityreducing techniques (Van de Vijver & Leung, 1997).In this study, construct equivalence was examined by means of exploratory factor analysis.Rogers (1995) explains that the reason for using exploratory factor analysis is to identify a latent subset of psychological characteristics or factors that underlie a specific domain.The basic idea behind the application of this technique is to obtain the structure of each group, which can then be compared across all the groups involved (language groupings, in the case of this specific study).Factor analysis is the most frequently employed technique in the study of construct equivalence (Naudé & Rothmann, 2004).Exploratory factor analysis derives factors that provide the best statistical fit to the data (Murphy & Davidshofer, 2001).According to Rogers (1995), the aim of exploratory factor analysis is to express observed scores as scores on a limited set of unobserved, underlying factors.
Factor analysis is relevant for the establishment of construct equivalence because it decomposes observed scores into unobserved components (Van de Vijver & Leung, 1997).In this study, the factor analysis consisted of a two-step procedure proposed by Van de Vijver and Leung (1997).Firstly, a Principle Axis Factor (PAF) analysis was conducted on the total sample group, which yielded a common matrix of factor loadings.This served as the target matrix for comparison purposes.Secondly, the factor loadings of each language group were compared with the target matrix by means of targeted rotation.The factor loadings were rotated to one target group (total group) to determine the construct equivalence of the factor for the other language groups.Factorial agreement was then estimated with Tucker's coefficient of congruence (Tucker's phi) (Van de Vijver & Leung, 1997).
Item-level analysis.In theory, a test and test items measuring a specific construct are perfectly unidimensional, that is all items of a specific test measure one and the same construct.In practice, however, this absolute is never attained (Rudner, Getson & Knight, 1980).The one-dimensionality assumption is theoretically a prerequisite for item-bias analysis.Item bias refers to the extent to which an item measures a construct differently across different populations.
In this study, item-bias analysis was performed through logistic regression.More specifically, the study made use of binomial (or binary) logistic regression, which is applicable when one or more variables consist of dichotomous scores, in this case correct and incorrect responses to the various items of the test.
In the case of item bias, a distinction can be made between uniform and non-uniform bias.According to Van de Vijver and Leung (1997), uniform bias refers to the influence of bias on scores that are more or less the same for all score levels.Individuals from one cultural group may have higher scores on an item than individuals from another cultural group, even when they have the same total score.Non-uniform bias refers to a situation where influences are not identical for all score levels and an item discriminates better in one group than in another.An item is taken to show nonuniform bias if the interaction between the score level and culture is significant (Meiring et al., 2005).
Logistic regression is a technique to fit a regression model to data where the dependent variable is dichotomous (Howell, 1997).It is unique in its ability to predict dichotomous variables and, like correlation, provides information about the strength and direction of the relationships across the variables (Marczyk, DeMatteo & Festinger, 2005).In this study, the total test score and language served as the independent variables, while the item score was the dependent variable.
Logistic regression can be used to predict a dependent variable (in this study, the item score) on the basis of independent variables (the test score and language) and to determine the percentage of variance of the dependent variable explained by the independent variables, to determine the relative importance of independent variables, to assess interaction effects and to understand the impact of covariate control variables (Kerlinger & Lee, 2000).In this study, the Chi-square statistic was used to evaluate the statistical significance of the uniform and the nonuniform item bias.
In this study, the Nagelkerke R² statistic was used to calculate the effect size (the strength of the relationship) between the dependent variable and the independent variables.The effect size of language (the uniform bias) was determined through the calculation of the difference between the Nagelkerke R² of the first step (in which score level was the sole predictor) and that of the second step (in which language, dummy coded, was added as a predictor).In the third step, the interaction of culture and score level was added.The difference between the second and the third step estimates the effect size of the interaction (the non-uniform bias) (Meiring et al., 2005).

Descriptive statistics
Table 2 shows the descriptive statistics and Cronbach alpha coefficients of the measuring instrument for the different groups.
Observable differences exist in the mean values, standard deviations (SDs), coefficients of skewness and kurtosis as well as in the alpha (α) coefficients of the five different language groups that were compared.For example, a noticeable difference in the mean, SD, skewness, kurtosis and reliability of test scores exists in respect of the Afrikaans and Northern Sotho groups.The observed score differences between the groups naturally raise questions concerning the construct equivalence and bias An item-level PAF analysis based on a tetrachoric correlation matrix was performed in respect of each group.Tetrachoric correlation is used in factor analysis when both variables are dichotomous and are assumed to represent underlying bivariate normal distributions, as is the case when a dichotomous test item is used to measure some dimension of achievement (Van de Vijver & Leung, 1997).
According to the scree plot, the factor analysis yields more factors than expected (see Figure 1).According to Hambleton and Swaminathan (1985), factor analysis based on tetrachoric correlations is inclined to yield too many factors.The difference between the eigen-values of the first two roots and the rest suggests that there might be two significant constructs.A clear break can be observed on the scree plot between roots two and three for all the groups.The eigen-values of the random data set (the broken line) intersect the eigen-values for the true data set (the solid line) at root three, indicating two significant factors (Horn, 1965).
More detailed results of the item-level factor analysis for the total group are depicted in Table 3.The results show that the first factor accounts for up to 60% of the variance of the unrotated factor matrix.This is significantly more than the criterion of Shillaw (1996) of at least 20% variance on the first factor before unidimensionality can be assumed.In addition, the eigen-value of the first factor also needs to be significantly higher than that of the next largest factor.The first factor has a variance of more than three times that of the second factor, which provides strong evidence in favour of assuming unidimensionality.
Due to the fact that the variance for the second factor was relatively high compared to the third factor and the eigen-value significant, the possibility of a second meaningful construct of the instrument and add to the importance of conducting appropriate analyses.
Acceptable Cronbach alpha coefficients varying from 0.83 to 0.88 were obtained for the different groups.These alpha coefficients are acceptable if one uses the guideline of α > 0.80 suggested by Anastasi and Urbina (1997).
In order to compare a test across cultures meaningfully, its equivalence must be demonstrated in those cultures, in this case different language groups.In this study, this was done at the item level and at the test level as suggested by Kline (1993).

Factor analysis
Exploratory factor analysis was used to determine the construct equivalence of 'conceptualisation' as measured by the PIB/ SpEEx 100 test.Eigen-values and percentage of variance explained (per group) for the unrotated factor matrix could not be entirely ruled out.Upon further inspection, it was found that a plausible explanation for the second factor was an artefact attributed to the differential item skewness or the difficulty factor of the items.All the items with salient loadings on factor two had a high difficulty value (p-value) in common.
A more precise statistical method was therefore needed to determine dimensionality and deal with the effect of item differential skewness.
Consequently, the procedure of Schepers (1992) was applied to determine the dimensionality of the PiB/SpEEx 100 statistically.Schepers developed the procedure to control for factor artefacts that form as a result of items being differentially skewed.The first phase of Schepers's procedure requires an iterative factor analysis and a varimax rotation of the significant factors extracted through the use of Kaiser's criterion (Kaiser, 1961).
The PAF analysis (based on tetrachoric correlations) on the total group yielded five factors, which were then subjected to varimax rotation.The items with the highest loading on the respective factors were aggregated to form a new set of variables.The new set of variables was subjected to PAF analysis based on Pearson's correlation matrix.A single factor (explaining the 45% variance) emerged through the use of Kaiser's criterion (see Figure 2).Accordingly, it was confirmed that the PiB/SpEEx 100 test is unidimensional.

Structural (construct) equivalence
The results of the item-level PAF analysis (based on tetrachoric correlations) were used in the structural equivalence analysis.
The PAF analysis was repeated for each group and one factor was extracted for comparison purposes.The factor loadings as well as Tucker's congruence coefficients are presented in Table 4.
Factor loadings of 0.30 and higher can generally be considered acceptable (Tabachnik & Fiddell, 1989).Small deviations from the 0.30 criterion were allowed to account for differences in sample homogeneity (Schaap & Basson, 2003).Of the thirty items, only five items showed low factor loadings across the different language groups.Three of these five items (items 3, 4 and 6) consistently displayed low factor loadings across all language groups, while items 18 and 25 displayed a low factor loading for only one language group (Setswana).In total, 25 items had moderate to high factor loadings (> 0.30 with permitted deviations) for all the language groups.Tucker's phi coefficients for the different language groups are given in Table 4 above.As a general rule, values higher than 0.95 are seen as evidence of factorial similarity, whereas values lower than 0.90 are taken as pointing to non-similarity (Van de Vijver & Leung, 1997).Inspection of Table 4 above shows that Tucker's phi coefficients with values higher than 0.95 are present in all the different language groups.This provides a strong indication of structural equivalence and it can therefore be deduced that the construct is equivalent for all five different language groups.
Constructs that are equivalent for different cultural groups indicate an absence of construct bias in an instrument (Schaap & Basson, 2003).

Analysis of item bias
The aim of this analysis was not to test for cultural differences but to test whether the item scores were identical for respondents from different language groups with an equal total score level.
Table 5 presents a cross-tabulation of the different language groups and score categories.The cross-tabulation provides information about the cell sizes of the matrix that was used for item-bias analysis.
The respondents were divided into seven groups according to their ability level (test score levels).The various language groups in the seven different ability levels in the table all have more than 50 cases, which can be considered acceptable cell sizes for the purpose of item-bias analysis.

35
FiGure 2 Scree plot for the factor-analytical procedure of Schepers (1992)  evidence, factor analysis was required to provide more conclusive evidence (Schaap & Basson, 2003).
Factor analysis of the PIB/SpEEx 100 test yielded a single dominant factor, as expected.The percentage of variance explained by the factors and eigen-values suggests that there is only one significant construct that can be identified as the conceptualisation-ability construct.
Results of the exploratory factor analysis indicate similar response patterns for the different groups on most of the items.The factor congruence coefficients obtained meet the criterion of high agreement and emphasise the structural equivalence of the construct for the different language groups (Meiring et al., 2005).
Although most items show statistically significant bias due to the large sample size, which increases the sensitivity of the Chi-square statistic, the bias is so small as to be negligible from a practical perspective (Cohen, 1988).With regard to the evaluation of item bias, it was found that none of the items show either uniform or non-uniform bias of practical significance.
Overall, it can be concluded that the PIB/SpEEx 100 test appears to be equivalent and that the test items are not biased for the different language groups included in this study.Thus, the non-verbal items of the PIB/SpEEx 100 scale do not appear to be language-sensitive for the language groups included in this study.The assumption made by Erasmus and Schaap (2003) that the non-verbal scales of the PIB/SpEEx are language-free and can be administered in any language can therefore be confirmed for the PIB/SpEEx 100 test for five South African language groups.Table 6 indicates that, when bias is evaluated in terms of the statistical significance of Chi-square, most items reveal statistically significant bias.The criterion of Cohen (1988), according to which the lower threshold for medium-effect size is 0.06, was applied to examine the practical significance of the item bias (this size was chosen because it can be considered significantly large enough to be practically important).Many items show statistical bias but the bias effect size (Nagelkerke R 2 ) is so slight as to be negligible from a practical point of view.

DISCUSSION
As is mentioned in the introduction to this article, many challenges are faced in the use of different assessment instruments in the South African context today.Obtaining equivalent measures that may be used across a diversity of linguistic and cultural backgrounds is perhaps the most central issue in cross-cultural and cross-language comparative research (Huysamen, 1996).
The purpose of this article is to report on the construct equivalence and item-bias research that was conducted on the PIB/SpEEx (conceptualisation) 100 test for five language groups in South Africa.
Overall, small observable differences in scale reliabilities in respect of the various language groups provide some indication that the construct may be equivalent for the language groups.The scale reliabilities are all well within the range of what is generally considered acceptable for different groups.As it is recognised that differences in scale reliabilities among groups could be considered as preliminary and not as conclusive Item-bias statistics of the conceptual-reasoning test for the different language groups The introduction to this study states that a question that needs answering is whether a given psychometric instrument can stand the scrutiny of the Employment Equity Act and its subsections (Republic of South Africa, 2006), in this case a test used to measure conceptual-reasoning ability.This test did not show practical significant bias and the results are consequently encouraging for the equitable use of the PIB/SpEEx 100 test in a multicultural environment like that of South Africa.

Recommendations and suggestions for further research
As discussed, according to Van de Vijver and Leung (1997), there are three kinds of bias: item, construct and method bias.This study does not address all the aspects of test usage but focuses on item and construct bias.Method bias (this refers to problems deriving from instrument characteristics) is not taken into consideration and should be investigated in a separate study.
Multi-sample confirmatory factor analysis (MCFA) procedures suitable for dichotomous variables should be considered, as these provide more options to test for measurement invariance (Skrondal & Rabe-Hesketh, 2005).Compared to PAF procedures, MCFA is a more versatile tool when testing for the hierarchically linked hypotheses of cross-cultural measurement invariance.MCFA allows for the testing of specific pattern coefficients, error variances and factor covariances to determine the specific differences among groups and to understand the aspects of the test structure that differ across groups (Maller & French, 2004).
To ensure the equitable use of the PIB/SpEEx 100 test, the predictive validity and predictive bias of the test can also be considered.Even an unbiased instrument may not work equally well for different language groups.This study does not address the question of whether the cognitive scale can predict future training and job performance in a fair way for all language groups.A final verdict on the cross-cultural suitability of the current test can be given only once the predictive bias is also tested.
In the light of the importance of different language groups in this study, it is recommended that future research include biographical questions that elicit responses on current home language and mother (original) tongue.These questions are highly applicable to South Africa, since many people indicate English as their home language when their mother tongue is not, in fact, English.
Although further investigation is needed, the prospects for the use of the PIB/SpEEx 100 in a multicultural environment seem favourable.The development of new instruments or the modification of existing ones can benefit from insights gained from research on the nature and extent of cultural loadings on cognitive-ability tests.
Vijver & Rothmann, 2004)or this study, one must take into account the history and development of cognitive tests as well as the many challenges that psychological assessment faces in South Africa.Much more research is needed on the equivalence and bias of assessment tools used in South Africa before psychology as a profession can live up to the demands implied in the Employment Equity Act (Van deVijver & Rothmann, 2004).For these reasons, this study aims specifically to investigate the equivalence and bias of the PIB/SpEEx 100 test to ensure that this test is used appropriately in the South African context and measures one construct, namely conceptualisation or conceptual reasoning, in different language groups.
(Jensen, 1981)fers to measurement artefacts at item level.A few examples from an inexhaustive list of nuisance factors at item level are poor item translation, inappropriate item content and inadequate item formulation (complex wording).Item bias is a measurement problem that, if not attended to, can jeopardise the validity of cross-cultural comparisons.When a test is biased towards a group, the scores for the group consistently underestimate or overestimate the true values.A test can be said to be biased towards a group when any given score obtained by an individual in that group does not have the same meaning as the very same score obtained by an individual in another group.The two groups in question might be from different racial groups, have different socioeconomic backgrounds, or be of different genders or any other biographical category of persons in the general population(Jensen, 1981).
Scree plot for the item-level factor analysis in respect of the total group