ConstruCt equivalenCe of the oPq 32 n for BlaCk and White PeoPle in south afriCa

Orientation: The construct equivalence of the Occupational Personality Questionnaire (OPQ32n) for black and white groups was investigated. Research purpose: The objective was to investigate the structural invariance of the OPQ32n for two South African population groups. Motivation for the study: The OPQ32n is often used for making a variety of personnel decisions in South Africa. Evidence regarding the suitability of personality questionnaires for use across South Africa’s various population groups is required by practitioners for selecting appropriate psychometric instruments. Research design, approach and method: Data were collected by means of a questionnaire and the results were analysed using quantitative statistical methods. The sample consisted of 248 Black and 476 White people from the SHL (South Africa) database. Structural equation modelling was used to examine the structural equivalence of the OPQ32n scale scores for these two groups. Main findings: A good fit regarding factor correlations and covariances on the 32 scales was obtained, partially supporting the structural equivalence of the questionnaire for the two groups. The analyses furthermore indicated that there was structural invariance, with the effect of the Social Desirability scale partialled out. Practical/managerial implications: The present study focused on aspects of structural equivalence only. The OPQ32n therefore passed the first hurdle in this particular context, but further investigation is necessary to provide evidence that the questionnaire is suitable for use in personnel decisions comparing the population groups. Contribution: Despite the positive findings with regard to structural equivalence and social desirability response style, it should be borne in mind that no assumptions regarding full scale equivalence can be made on the basis of the present findings.


InTrODuCTIOn
No practice in modern psychology has been assailed more than psychological testing, because test bias and fairness have become controversial topics internationally in the broader contexts of cultural and sexual bias (Gregory, 2007).As a result of the globalisation and migration of the workforce, the multicultural nature of populations has become more prominent in many countries worldwide, particularly during the past two decades.These phenomena pose challenges to the practice of psychological assessment (Van de Vijver & Rothmann, 2004).Anastasi and Urbina (1997) indicated that, internationally, the design of selection strategies for fair test use with cultural minorities has emerged as a new focal point.Decision models are being proposed that have the effect of selecting larger proportions of persons from lower-scoring groups (Cascio & Aguinis, 2005).Such decision models have as their goal that which is generally designated by terms such as 'affirmative action' or the reduction of 'adverse impact' in the selection process.
The cultural appropriateness of psychological tests and their usage were placed in the spotlight in South Africa with the promulgation of the Employment Equity Act No. 55 of 1998, specifically Section 8 (Republic of South Africa, 1998).Since the Act was promulgated, the issues of the culture fairness and test bias of psychological instruments became points of continuous concern (Van de Vijver & Rothmann, 2004).Instead of resting with potential complainants, the onus of proof has shifted to psychologists using psychological instruments to prove that those instruments adhere to the regulations of the Employment Equity Act.The South African law requires psychologists to be proactively involved by providing evidence that tests are unbiased and can be used in a fair manner (Van de Vijver & Rothmann, 2004).Therefore, there is a need for measuring instruments that meet the specified requirements so that psychological tests can be used for all cultural and language groups in South Africa.One of the main goals of the assessment profession in South Africa is (and should be) to endeavour to align current practice with legal demands, through the development of new instruments and the validation of existing ones for use in the multicultural society (Foxcroft, 2004;Meiring, Van de Vijver, Rothman & Barrick, 2005;Van de Vijver & Rothmann, 2004).Crocker and Algina (1986) referred back to the 1960s, when issues involved in using tests to select minority applicants for jobs began to receive attention.The possibility of bias in test scores was an issue for test developers and users only.Since then, these issues have begun to receive much more attention and the matter has become a burning issue within psychological testing.
Currently several documents exist that provide guidelines for assessing the psychometric soundness of psychological tests, such as the American Psychological Association (APA) Standards for educational and psychological tests (1999), the Society for Industrial and Organisational Psychology of South Africa (SIOPSA) Guidelines for the validation and use of assessment procedures for the workplace (2005), Psychological test use in South Africa (Mauer, 2002) and Applied psychology in human resource management (Cascio & Aguinis, 2005).From these sources it is clear that psychologists have to consider the indicators and guidelines and that every endeavour should be made to address scientifically the psychometric bias properties of tests and the fairness of the uses of tests.In a discussion of the APA standards, Huysamen (2002) pointed out the conceptualisation of construct validity as the primary objective in test validation.Mauer (2002) emphasised the possible juridical and professional consequences if psychometric requirements for tests are ignored.He also stressed that the procedures used in any form of adjudging, appraisal, assessment, evaluation, valuation, grading, ranking, classifying, categorising, placing, positioning or rating, insofar as it deals with employees, should be shown scientifically to be reliable, valid and unbiased.Again, the importance of establishing sound psychometric evidence is emphasised in this reminder.
Fairness, bias and equivalence Gregory (2007) distinguished clearly between 'test fairness' and 'test bias', but pointed out that the two terms are often wrongly considered to be interchangeable.This is a common misconception, because test fairness is a broad concept that recognises the importance of social values in test usage (a values concept), whereas test bias refers to objective statistical indices that examine the patterning of test scores for relevant subpopulations (a statistical concept).Test developers can therefore control test bias, but they cannot control test fairness, because the fair use of tests and the decisions taken as a consequence of testing are in the hands of test users.
The various selection strategies for fair test use for addressing affirmative action or the reduction of adverse impact referred to above cannot be realised solely by producing unbiased tests.Although it is true that any form of bias, including lack of construct equivalence between groups, may, and probably will, result in discriminating personnel decision making, the converse unfortunately does not hold true.Sections 15 and 20(3) of the South African Employment Equity Act No. 55 of 1998(Republic of South Africa, 1998) define affirmative action measures as the means employed to 'ensure the equitable representation of suitably qualified people from the designated groups'.Such measures call for selection decision models that are not dictated by the inherent psychometric properties of measuring instruments.By using meticulously constructed tests one therefore cannot ensure compliance with the goals of the Employment Equity Act.The judicious use of reliable, valid and unbiased tests is a necessary, but not sufficient, prerequisite for fairness in testing.Because the focus of the present paper is on a specific psychometric aspect of bias, namely construct equivalence, further reference to fairness or culture fairness in testing is avoided.Cole and Moss (1989, p. 205) defined test bias as being present 'when a test score has meanings or implications for a relevant, definable subgroup of test takers that are different from the meanings or implications for the remainder of the test takers'.This definition of bias implies that test scores obtained for various subgroups of a given population cannot be interpreted in the same way across the groups.Cole and Moss (1989) proposed that bias is differential validity in the case of a given interpretation of test scores for specific subgroups of a population.Gregory (2007) agreed with this interpretation by equating test bias with differential validity.He distinguished between three different types of bias, namely bias in content validity, bias in predictive or criterion-related validity and bias in construct validity, when comparisons between specific subgroups of populations are being made.
To illustrate the existence of various theoretical viewpoints regarding test bias, the definition of Cascio and Aguinis (2005) for differential validity deserves mention.On considering test bias regarding employment decisions, they held a somewhat more restricted view on differential validity than that advocated by Cole and Moss (1989) and Gregory (2007).Cascio and Aguinis (2005) described differential validity as a form of test bias that is the result of differences in the magnitudes of the criterionrelated validity coefficients for the various subgroups being compared.For a proper assessment of bias, they recommended that the possible presence of predictive bias (or differential prediction) should rather be investigated (Cascio & Aguinis, 2005).This entails an examination of possible differences in standard errors of estimate for the subgroups, and in the slopes and intercepts of the subgroups' regression lines, an approach also supported by Geisinger (1994).
The present study deals specifically with bias in construct validity and it is acknowledged that construct validity is a broad concept.The definition offered by Reynolds appears to be logically acceptable, namely, bias with regard to construct validity exists when a test is shown to measure different hypothetical traits (psychological constructs) for one group than for another; that is, differing interpretations of a common performance are shown to be appropriate as a function of ethnicity, gender, or of another variable of interest.
( Reynolds, 1998, cited in Gregory, 2007, p. 274) Essential criteria for the non-bias of a test that follow from this definition are that there should be an equal number of underlying factors for the various subgroups and that the item or subscale loadings should be similar for the population subgroups, that is, factorial invariance across the groups is required (Gregory, 2007).
Recent research by Poortinga, Van de Vijver and others (Poortinga, 1989;Van de Vijver & Leung, 1997;Van de Vijver & Leung, 2000;Van de Vijver & Poortinga, 1997;Van de Vijver & Tanzer, 1997) has suggested a taxonomy of bias and equivalence that provides a framework for examining bias that is more comprehensive and less simplistic than the approaches mentioned earlier.Van de Vijver and Tanzer (1997) and Van de Vijver and Leung (1997) noted that bias (or non-equivalence) is present when there are score differences between subgroups on the measurements of a particular construct (such as the items of a test) that do not correspond to differences between the subgroups in the underlying trait or ability.Bias is defined as the opposite of equivalence, although the term bias generally tends to refer to nuisance factors in cross-cultural comparisons between groups, whereas equivalence is generally associated with a hierarchy of measurement levels regarding cross-cultural score comparisons (Van de Vijver & Leung, 1997).Equivalence, therefore, indicates the measurement level at which the scores obtained for different groups can be compared.
Equivalence and bias are the fundamental concepts when comparisons between subgroups of populations or crosscultural comparisons are made, because inferences based on biased (or non-equivalent) scores are invalid.Measuring instruments that are used for various cultural groups, such as those found in South Africa, should therefore be assessed in terms of bias and equivalence for score comparisons between the groups.It is important to note that the concepts bias and equivalence do not refer to properties inherent in any particular measuring instrument.These concepts deal with the characteristics of an instrument in a (specific) comparison between groups (such as groups from different cultures), rather than with the intrinsic properties of the measuring instrument (Van de Vijver & Tanzer, 1997).
Three kinds of bias are distinguished in the taxonomy, namely construct, method and item bias (differential item functioning) ( Van de Vijver & Leung, 1997;Van de Vijver & Tanzer, 1997).
The definition for construct bias is similar to that proposed by Reynolds (1998, cited in Gregory, 2007) and occurs when the construct measured is not identical across the various subgroups being compared.A comprehensive evaluation of bias for a particular comparison requires an integrated and extensive examination of all aspects of bias.There are many procedures and statistical techniques that can be used for this purpose before claims can be made about a lack of all types of bias.
The hierarchy of three different levels of equivalence deals with the level of measurement implicit in any specific comparison between groups (Van de Vijver & Leung, 1997; Van de Vijver & Tanzer, 1997).Direct comparisons between the descriptive statistics of groups are in order only when the scores of the various groups are on the same measurement scale and when the same construct is measured in the groups.When using common psychometric tests in the employment domain across population groups, as is usually the case in South Africa, the overall goal is to use tests that yield directly comparable results.

Construct equivalence
At the bottom of the hierarchy we find the level of construct equivalence, also labelled structural invariance, structural equivalence or functional equivalence.Construct equivalence exists when the same construct is measured in the various groups being studied, whereas construct inequivalence occurs when an instrument measures different constructs in the groups, or when the measured construct overlaps only partially across the groups.Construct equivalence is often assessed by means of exploratory factor analysis with target rotation, by determining the similarity of exploratory factor analysis results by means of the coefficient of congruence, or by structural equation modelling.These equivalence concepts should be distinguished from the concept construct validity, which is the extent to which a measure shows a pattern of high correlations with measures that are expected to measure the same construct (convergent validity), as well as low correlations with measures of other constructs (discriminant validity).Construct validity may be assessed, inter alia, by means of examining patterns of correlations, as indicated before, by using the multi-traitmulti-method approach, by experimental means, or by using exploratory and confirmatory factor analysis.

Measurement unit equivalence
Measurement unit equivalence is the next level of equivalence and occurs when two or more measures have the same measurement unit, but might have different origins.An example cited most often is the measurement of temperature in which the Celsius or Kelvin scales are used, because the origins of these two scales differ by 273 degrees, but one degree on the Celsius scale has the same meaning as one degree on the Kelvin scale.Direct score comparisons can be made only when the differences between the origins on the scales are known, a rare occurrence in psychological research (Van de Vijver & Leung, 1997).

Full scale equivalence
Also referred to as or scalar equivalence, full scale equivalence is found at the top of the hierarchy.It occurs when measures have the same measurement unit and the same origin.This level of equivalence allows direct comparisons across population subgroups or across cultures.It is important to note that full scale equivalence can be attained only when measurement is entirely bias free, that is, when there is no construct, method or item bias.For psychological variables, full scale equivalence cannot be proven directly.It has to be assessed indirectly by means of the available methods for studying bias.When the research question deals with the constructs measured in the comparison groups, construct equivalence is all that is required and this level of equivalence will not be affected by method or item bias (Van de Vijver & Leung, 1997).However, when the aim is to directly compare the means obtained in the groups or directly compare scores of individuals belonging to the various groups, such as for personnel decision making, full scale equivalence must be present.In the current study, the focus is on construct equivalence as a first step in the assessment of bias in applications of a particular measuring instrument.
In South Africa, with its multicultural society, it has long been recognised that testing poses special problems for test developers and users (Foxcroft, 2004;Van de Vijver & Rothmann, 2004;Wallis, 2004).There clearly is a dearth of evidence that indicates that tests being used across population groups are free from bias, because far too few studies have been published that investigated the possible presence of test bias in general or construct bias in particular.Of particular relevance here is that a test that does not measure what it proposes to measure across subgroups invalidates all inferences drawn from the test results (Wallis, 2004).

Personality testing and the OPQ32
There has been a substantial increase in the use of personality and related tests when hiring for a broad spectrum of jobs (Clevenger, Pereira, Weichmann, Schmitt & Harvey, 2001;Ones & Anderson, 2002;Saville & Willson, 1991).Recent surveys have indicated unequivocally that the use of personality tests is becoming increasingly popular among employers for personnel selection decisions (Ones & Anderson, 2002).
Personality tests are also used widely in South Africa, but establishing comparability across groups is vital in a country where people from a variety of cultural or demographic groups compete for job opportunities (Bedell, Van Eeden & Van Staden, 1999;Meiring et al., 2005).Yet few attempts have been made to test the comparability of results for different cultural groups (Van de Vijver & Rothmann, 2004).Van de Vijver and Rothmann (2004) concluded that much more research is needed on the equivalence and bias of assessment tools before psychology as a profession can live up to the demands implied in the Employment Equity Act.
A variety of factors can cause group differences in test scores, such as race, culture, socio-economic status, education, language and cognitive style (Meiring et al., 2005).Many tests that are used across South African population groups are, at present, administered in English only.Apart from other possible cultural nuisance variables, there is evidence that the level of proficiency in the English language affects performance in cognitive and personality tests (Abrahams & Mauer, 1999a;Claassen, 1993;Foxcroft & Aston, 2006;Koch, 2007;Owen, 1989, Van Eeden & Van Tonder, 1995;Van Eeden & Visser, 1992).Evidence of construct bias when tests have been administered in English only has been found by Meiring et al. (2005), Abrahams and Mauer (1999b) and Koch (2007).Construct bias also resulted when (mostly) Black students had to complete Schepers's Locus of Control Inventory in a second language, whereas (mostly) White students could complete the questionnaire in their mother tongue (Berg, Buys, Schaap & Olckers, 2004;Schaap, Buys & Olckers, 2003).The same data set was used for both studies.
Nevertheless, there also are examples of research in South Africa that reported construct equivalence across groups where tests were administered in English only.Schaap and Basson (2003) found evidence that the constructs measured by the PIB/SpEEx Motivation Index, namely internal locus of control and external locus of control, were equivalent for Black, Asian and White entry-level job applicants.Vorster, Olckers, Buys and Schaap (2005) investigated the equivalence of the structural model of the Job Diagnostic Survey and reported that the model held for Black and White groups.Only 22% of the respondents completed the questionnaire in their mother tongue (English) and it should be noted that approximately 69% of the White group did not complete the questionnaire in their mother tongue because a large percentage were Afrikaans speaking.It is not evident that the same results would have been obtained had the sample been split into first-and secondlanguage groups.
In another study, Coetzer and Rothman (2007) found evidence that supported the hypothesised dimensionality of the constructs burnout (as measured by the Maslach Burnout Inventory -General Survey) and work engagement (as measured by the Utrecht Work Engagement Scale) for two groups, one with English as home language and the other with Afrikaans, or an African language, as home language.They also found that construct equivalence existed for the two groups when certain items were deleted from the data.It should be noted that the majority (76%) of the Afrikaans/African group consisted of Afrikaans speakers, with the implication that the results did not provide convincing evidence for Africanlanguage speakers.
A number of studies have focused on the comparability of personality measures for different population groups in South Africa, with mixed results.For instance, Taylor and Boeyens (1990) found some support for construct equivalence for the South African Personality Questionnaire (SAPQ), but it was evident that the instrument suffered from item bias in their application of comparing the scores of White and Black respondents.Research by Van Eeden, Taylor and Du Toit (1996) and Abrahams (1997) indicated that two versions of the Sixteen Personality Factor Questionnaire (the 16PF5 and the 16PF, SA92) may not be suitable for individuals who do not have English as first language.Van Eeden and Prinsloo (1997) reported some degree of construct equivalence for the 16PF, SA92, but cautioned that there were differences between the factor loadings of the second-order factors for Black and White people.Heuchert, Parker, Stumpf and Myburgh (2000) investigated the construct equivalence of the NEO Personality Inventory -Revised (NEO PI-R) and found a clear five-factor structure for Black and White students that conformed to the five-factor model (FFM) of personality.In another context, Taylor (2000) found that the openness factor could not be extracted for black employees, whereas the factor structure found for White employees was in line with expectations regarding the FFM.
The most extensive South African bias study to date was conducted by Meiring et al. (2005) using a sample of 13 681 applicants from 12 different cultural groups for entry-level jobs in the South African Police Service.One of the measuring instruments included in the study was the 15FQ+ Personality Questionnaire, which was developed for use in industrial and organisational settings.The alpha coefficients for some of the factors were exceptionally low, particularly for the Black language groups.Furthermore, exploratory factor analysis with target rotation to a pooled solution of 15FQ+ factors yielded poor agreement with the factors of the Ndebele, White, Indian and Coloured groups, thereby indicating structural or construct inequivalence.In addition, significant item bias was found for many items, although a medium effect size was obtained for one item only.Meiring et al. (2005) also found that neither removal of the biased items nor cognitive/Englishlanguage ability or social desirability affected the magnitude of the cross-cultural differences observed.
The present study is yet another attempt to investigate the structural equivalence across population groups of a personality questionnaire in a South African context.In this instance we focused on a personality questionnaire that is currently being used extensively in South African organisations, namely the Occupational Personality Questionnaire (version OPQ32n), because no research results on this issue have been published regarding the OPQ32n.The main aim with the development of the OPQ32 was to provide an instrument that would give a comprehensive, detailed description of personality likely to be relevant in occupational contexts for the selection, development and counselling of predominantly managerial-level staff.
The OPQ32 is based on an occupational model of personality that describes 32 dimensions or scales of individuals' preferred or typical styles of behaviour at work.In addition, it includes a Social Desirability scale.The model consists of three domains, (1) relationships with people, (2) thinking style and (3) feeling and emotions.The three domains are joined by a potential fourth -the dynamism domain -which relates to sources of energy (OPQ32 Technical Manual, 2006).There are two questionnaires for measuring personality using the above model, namely the OPQ32n (normative) and OPQ32i (ipsative).
With regard to the comparability of the OPQ for different population groups, it was found, in a study conducted in the United Kingdom (UK), that the questionnaire's internal consistency reliabilities for a combined sample of Black and Asian respondents was lower than that for a White sample (OPQ32 Technical Manual, 2006).Furthermore, it was found that, for a sample from the general population, an analysis of background information showed that there was a higher proportion of the ethnic minority sample with poor education than in the White group, possibly resulting in less accurate responses.The mean reliability for the Black and Asian sample was equal to 0.70.
When the OPQ32 mean scale scores of White and minority ethnic groups were compared in the UK, only nine of the mean differences reached statistical significance.The largest of these differences was on 'achieving', with a medium effect size (d = 0.43).These results were ascribed to the occupational relevance of the OPQ32 content, together with the straightforward way in which items are phrased.This means that people from different demographic backgrounds were able to relate to the questionnaire in a similar manner (OPQ32 Technical Manual, 2006).It was earlier argued that such results may also not necessarily be obtained in relation to the various South African population groups.
It is clear that language of administration, race and culture may be among the main factors impacting on the construct comparability of personality tests and that these factors are particularly salient in contemporary South Africa.
The objective of the present study, therefore, was to investigate the structural invariance of the OPQ32n for two South African population groups.It was also decided to examine differences in OPQ32n scale scores between Black and White demographic groups and to establish whether these were likely to arise from a lack of construct equivalence between the two groups.

reseArCH DesIgn research approach
Data were collected by means of a questionnaire and the results were analysed using quantitative statistical methods.

research method research participants
The data were collected from various South African companies using the OPQ32 normative version (OPQ32n) for the selection and development of their personnel.

Measuring instrument
All the respondents had completed the normative version of the OPQ32 at the request of their respective organisations and the questionnaires were scored by SHL South Africa.This OPQ32 version was chosen because it is often used in developmental and counselling applications in the industry and in practice.Furthermore, the results for the ipsative version, where forced choices have to be made, are not suitable for factor analysis (Baron, 1996;Dunlap & Cornell, 1994;Johnson, Wood & Blinkhorn, 1988;Kerlinger & Lee, 2000;Visser & Du Toit, 2004).
As recommended by SHL, the ipsative version is the version of the OPQ32 used most frequently, particularly for selection, because of the hypothesis that socially desirable responding will bias individual responses.
On some of the 32 scales, a high score is indicative of a positive outcome for the scale, whereas a low score indicates a favourable description within parameters of the work context on other scales.A specific personality style is not, in itself, good or bad, but appropriate or inappropriate depending on the circumstances.
In South Africa, SHL makes norms available for a total population of South Africans, but not for separate population groups.The internal consistency reliabilities for the scales ranged from 0.65 to 0.87 (median = 0.79) for a general population sample (N = 2028) in the UK (OPQ32 Technical Manual, 2006).Test-retest reliability was established in the UK using a sample of 107 undergraduates at various higher education institutions (OPQ32 Technical Manual, 2006).After one month, the reliabilities ranged from 0.64 to 0.91, with a median of 0.79.No test-retest studies have been done in South Africa.
In terms of construct validity, in the UK it was found that the scale intercorrelations for the OPQ32n ranged from -0.51 to 0.56, with two-thirds of the correlations falling between -0.2 and 0.2 (OPQ32 Technical Manual, 2006).This suggests a relatively high degree of independence for most of the scales, despite the large number of narrow scales included.Seventy-seven per cent of the OPQ32n scale pairs shared less than 10% common variance, but there were some pairs of scales that were highly related.The OPQ32n was also subjected to exploratory factor analysis, and principal components extraction followed by varimax rotation gave the clearest results.Six factors were extracted in four different data sets (two from the United Kingdom and one each from the United States and South Africa), explaining 51% -53% of the total variance in the respective data sets.In interpreting these factors, comparisons were made with the 'Big Five' model of personality of McCrae and Costa (1987).Five of the derived clusters of dimensions, namely extraversion, agreeableness, conscientiousness, neuroticism and openness to experience, clearly represented typical Big Five descriptions (OPQ32 Technical Manual, 2006).The sixth dimension was not consistent across the samples, but in the South African sample it related to adaptability.In another South African study, Visser and Du Toit ( 2004) obtained a six-factor solution that included the Big Five factors plus a factor labelled as Interpersonal Relationship Harmony, which was likened to the concept of ubuntu.
The criterion validity of the OPQ32 has been verified in many studies in the UK and elsewhere (OPQ32 Technical Manual, 2006).In these studies, OPQ32 results were correlated with indicators of performance of various kinds, generally managers' ratings of competence.With total sample sizes exceeding 6000 for an earlier version of the OPQ and 2500 for the OPQ32, they provide a robust body of evidence to support the occupational use of the OPQ32 questionnaire because the patterns of relationships found in the studies provided strong support for the criterion validity of the OPQ32 (OPQ32 Technical Manual, 2006).The Social Desirability scale of the OPQ32n measures the extent to which a person is more/less self-critical in responses and more/less concerned with making a good impression.Socially desirable responding has been shown to be more prevalent among black than white populations in UK and USA standardisation samples (OPQ32 Technical Manual, 2006).However, in a South African study, Visser (2002) found that Black and White groups did not differ statistically significantly with regard to their scores on the Social Desirability scale of the OPQ5.2Concept Model, an earlier version of the currently used instrument.On the basis of these conflicting results, it was decided to test for structural invariance with and without the effect of Social Desirability partialled out.

research procedure
The administration of the OPQ32n was done in a number of South African companies, using the paper-and-pencil version, or completing the questionnaire online.Psychometrists or trained OPQ32 staff administered the questionnaires.All responses were captured on a SHL South Africa database.
The data required for the analyses for the current study were extracted from this database.

statistical analysis
In this section, the rationale for, and procedure of, the structural equation modelling used in the current study is explained.The OPQ32n is focused on measuring multiple narrow traits that are important for certain domains of interest (such as job competencies), rather than broad personality factors.Factor analytic research in which the Big Five personality factors are extracted from the OPQ typically explain only approximately 50% of the variance (OPQ32 Technical Manual, 2006;Visser & Du Toit, 2004).Conceptually, only 25 out of 32 traits of the OPQ are related to the Big Five (OPQ32 Technical Manual, 2006).This implies that a higher-order factor model fits the data poorly.There is no merit in comparing a structure across cultures that does not fit in the reference group in the first place, because a good fit for one group has to be achieved first before doing multi-group analyses (Byrne, 2006).Therefore, it is meaningless to follow a commonly used procedure for establishing factorial invariance by comparing the factor structures between the groups in this study.Instead, the question that we wished to answer was whether the 32 narrow constructs are equivalent in two groups through comparing their nomological networks (theoretical concepts and their relationships with the other constructs).Such networks are represented in the trait correlation matrix (Clark & Watson, 1995;Cronbach & Meehl, 1955).Comparing correlation matrices directly provides a suitable global test of equivalence in cases in which there is no predefined factor structure to be fitted to the measured constructs (Bentler, 2005).Furthermore, the trait correlation matrix provides all the necessary information for factor analysis and, therefore, serves as a necessary condition for the equivalence of higher-order factor structures.
Since the statistical theory is based on covariance matrices, a special set-up procedure is required to model the correlation matrices correctly in EQS.Bentler described the global test of equivalence for correlation matrices as follows: Let Y 1 be the vector of observed variables in the first group, and let the population covariance be Σ 1 .The population correlation matrix is P 1 and Σ 1 = D 1 P 1 D 1 , where D 1 is the diagonal matrix of standard deviations of the variables.Thus if we present Y 1 as Y 1 = D 1 X 1 , then it is apparent that the covariance matrix of X 1 is P 1 .
(2005, p. 152) The logic of the EQS model is, therefore, to present the observed variables Y 1 (unstandardised OPQ scale scores) through dummy factors X 1 (Y 1 = D 1 X 1 ), with factor loadings equal to the scales' observed standard deviations (D 1 ) and no measurement error.Dummy factors X 1 are standardised so that their variances are 1 and their covariance matrix is therefore a correlation matrix.
There are as many such equations as variables and the factor loadings are estimated freely.Each of the dummy factors' variances has to be fixed to 1 in the model for their covariance matrix to become the correlation matrix.The covariances of the factors are free parameters and, corresponding to the offdiagonal elements of P 1 , they are correlations between the observed variables.
The same type of set-up applies to the second group, where Σ 2 = D 2 P 2 D 2 , and consists of the following two steps: • To evaluate the first hypothesis, that P 1 = P 2 (i.e. that the correlation matrices are equal), cross-group constraints have to be made on the covariances of the dummy factors, but since their variances are constrained to unity, these are in fact equality of correlations.In this first step, the diagonal matrices D 1 and D 2 (i.e. the factor loadings representing the observed scale scores' standard deviations), are not constrained to be equal across groups.Only trait correlations are constrained.• The second hypothesis is that the scales' standard deviations are also equal across samples and, thus, not just the correlations, but also the covariance matrices, are equal (Σ 1 = Σ 2 ).This is a stronger hypothesis and requires constraining the factor loadings representing the scales' standard deviations (D 1 and D 2 ) to be equal across the groups, in addition to the constraints set in the previous step.
These two steps were also repeated with the effect of the Social Desirability scale partialled out by computing partial correlations for every intercorrelation between the 32 OPQ32n scales.The comparisons were performed using the structural equation modelling software EQS Version 6.1 for Windows (Bentler, 1985(Bentler, -2005;;Byrne, 2006).

resulTs
The first step of the analyses entailed computing the means, standard deviations and internal consistency reliabilities of the various OPQ32n scales for the Black and White groups separately and for the total group.Internal consistency was assessed in two ways, namely by computing coefficient alphas and mean inter-item correlations (Clark & Watson, 1995).Subsequently, the magnitude of the differences between the means of the black and white groups on the various OPQ32n scales was assessed.The d statistic, which is calculated by standardising the raw effect size as expressed in the measurement unit of the variables by dividing it by the pooled standard deviation of the two groups, was used for this purpose (Cohen, 1988).This statistic therefore expresses score distances in units of variability and is an estimation of the effect size index.The results of these calculations are presented in Table 2.
For the Black sample, the Cronbach alpha values ranged from 0.57 for Conscientiousness and 0.59 for Variety Seeking to 0.85 for Rule Following and Worrying.It is evident that there were only two alpha values marginally lower than 0.60, which is regarded by some as a lower limit for acceptability for internal consistency reliabilities for personality scales in basic and applied research (Clark & Watson, 1995).However, Nunnally (1978) has advocated that 0.70 be regarded as the lower limit during the early stages of research.In total, only eight scales yielded alphas lower than 0.70 for the Black group.
For the White group the lowest alpha of 0.71 was obtained for Independent Minded, whereas the highest value of 0.91 was obtained for Tough Minded.For the total group, the alphas ranged from 0.72 for Independent Minded to 0.88 for Worrying and Rule Following.The mean alpha for the Black group on the 32 OPQ scales was 0.74, whereas the mean alpha for the White group was 0.84.For Social Desirability the alphas were 0.66 for the Black group, 0.66 for the White group and 0.68 for the total group.The mean inter-item correlations per scale for the Black group varied between 0.15 and 0.48, whereas those for the White group varied from 0.31 to 0.62.Vol.36 No. 1 Page 7 of 11 The intercorrelations between the 32 OPQ32n scales were computed for each group separately.These two 32 × 32 tables are too large to reproduce here, but are available to interested readers upon request.In addition, the mean intercorrelation coefficient between the 32 OPQ32n scales was computed for each group separately, using absolute values and excluding the main diagonal from the averaging.For the Black group the mean intercorrelation was equal to 0.185 (SD = 0.12) and a strongly similar result was obtained for the White group, namely 0.184 (SD = 0.13).
The d statistics for comparing the means of the Black and White groups on the OPQ32n scales varied from negligible (d = 0.00) to values representing moderate effect sizes (d = 0.54 for Data Rational, d = -0.61for Decisive and d = 0.50 for Social Desirability).Apart from these three moderate effect sizes and three more scales approaching the value of 0.50 (d = -0.41 for Modest, d = 0.42 for Tough Minded, d = 0.45 for Outspoken), altogether 14 of the 32 scales yielded small effect sizes, with the remainder being smaller than 0.20.
The second step of the analyses dealt with conducting a global test of the equality of the covariance matrices of the Black (n = 248) and White (n = 476) groups to investigate the structural equivalence of the OPQ32n for the two groups.The procedure followed was explained in the statistical analysis section.The null hypothesis that Σ black = Σ white , where Σ g is the population variance-covariance matrix, was tested.Because the exact equality of covariance matrices is hard to verify in large samples (Bentler, 1985(Bentler, -2005)), the null hypothesis that the correlation matrices of the Black and White groups are equal was also tested as an initial step.This hypothesis implies that the correlation matrices of the measured variables are the same, although the covariance matrices may differ between the groups due to variables not having equal variances.This was achieved by fixing the variances of the dummy factors in the model at one, so that the covariances of the variables were then equal to the correlations.
Firstly, we fitted a model with 32 latent variables, each represented by a single indicator (the observed scale score).In this model, the error variances were fixed to be zero, the factor variances were all fixed to unity, the factor loadings that represented the standard deviations of the observed scores were free, as were the covariances between the factors.This model was then fitted in a multigroup analysis, with all covariances constrained to be equal, thereby testing the equality of the factor correlations.
Thereafter we followed the same procedure, but the 32 factor loadings that had been previously unconstrained between the samples, were subsequently constrained to be equal, thus producing a stronger hypothesis.Again, in a multigroup analysis the 32 factor loadings were constrained to be equal between the two samples.This model tested the equality of the covariance matrices, because the equality of the variances of the observed variables was also tested.
The comparisons were performed using the structural equation modelling software EQS Version 6.1 for Windows (Bentler, 1985(Bentler, -2005;;Byrne, 2006).Structural equivalence was therefore tested by establishing whether the patterns of scale intercorrelations (and/or covariances) were equivalent.
In summary, four separate analyses of covariance structures using maximum likelihood estimation were conducted and, in the hypothesised models, the latent variables were allowed to correlate with one another.Being the larger sample, the data of the White group were used to represent the hypothesised model.The steps undertaken were as follows: Firstly, the groups were compared with regard to their correlation matrices only and, secondly, they were compared with regard to their covariance matrices on the 32 scales.These analyses were then repeated with the effect of the Social Desirability scale partialled out by computing partial correlations for every intercorrelation between the 32 OPQ32n scales.
Before carrying out these analyses, the data were inspected to establish whether the assumption of the multivariate normality of the data, on which the maximum likelihood method is based, held true for the two samples.Violation of this assumption may render the model chi-square test invalid, such that alternative estimation methods may have to be employed (Byrne, 2006).
The EQS structural equation modelling software was the first to introduce a correction for the chi-square statistic developed by Satorra and Bentler (1988) as the so-called 'robust' alternative to conventional maximum likelihood estimation.The robust option should be used whenever distributional assumptions are violated.It provides output statistics, such as the Satorra-Bentler scaled model chi-square, and robust versions of some other fit statistics.
We found clear evidence of deviation from multivariate normality in the data, because the sample statistics for both samples yielded several significant non-zero univariate kurtoses.In addition, the normalised estimate of Mardia's coefficient was equal to 27.89 for the White group and 26.63 for the Black group.Both values are substantially larger than 5, the cut-off beyond which data should be regarded as nonnormal (Bentler, 2005).Consequently, the robust method, which requires raw data for its computation, was the desired option in the current study and was carried out on the initial data set.However, the input data for the steps in which social desirability was partialled out consisted of partial correlations only, meaning that robust statistics could not be computed in these instances.Where possible, we report robust results, but we also report the conventional maximum likelihood results for comparative purposes.
The model fit indices that were used were the model or likelihood-ratio chi-square, normed chi-square (χ 2 /df), root mean square error of approximation (RMSEA), including its 90% confidence intervals, comparative fit index (CFI) and standardised root mean square residual (SRMR).The results of the structural equation modelling indicate that the null hypotheses of identical covariance matrices for the four separate analyses cannot be rejected, because all of the fit indices, with the exception of the significant model chi-squares, indicated good fit or closely approached well-fitting models.
The goodness-of-fit indices are reported in Table 3.
In every case, the statistically significant model chi-square values were the only goodness-of-fit values that consistently did not meet the accepted levels indicative of good model fit.
The implication of significant model chi-square values is that hypotheses of identical correlation matrices should be rejected for the four separate analyses.However, this result is obtained often in research and may usually be ascribed to the large size of the sample and/or the lack of model fit (Byrne, 2006;Tabachnik & Fidell, 2001).In the present study, the sample sizes for both groups were substantially larger than sample sizes of 100 to 200, which are regarded as the likely n for obtaining non-significant chi-square statistics (Hair, Anderson, Tatham & Black, 1998).
Because large samples are required for obtaining precise parameter estimates in the analysis of covariance structures, model chi-square values are regarded as unrealistic criteria on which to base decisions regarding model fit.Nevertheless, the possibility that the model did not fit, as well as the large samples, remain as viable explanations for the results.
The remainder of the goodness-of-fit indices represent a positive picture of very good fitting models.The chi-square/degrees of freedom ratios were smaller than 2 (the limit recommended by Hair et al., 1998) in every instance.Also, according to the limits recommended by Carmines and McIver (1981), these comparisons were indicative of good fit.The RMSEA values had magnitudes representing good fit, because they were smaller than 0.05 (Browne & Cudeck, 1993;Hu & Bentler, 1999).
In every instance, the 90% confidence intervals were narrow, so that none of the upper values exceeded 0.06 (Hu & Bentler, 1999).In addition, the CFI values were higher than the value of 0.95 that was recommended by Bentler (1990) and Hu and Bentler (1999) as indicative of good model fit, with the exception of the CFIs for the comparison of covariances, where the values of 0.942, 0.941 and 0.943 fell just below the cut-off for well-fitting models.Surprisingly, the use of the robust option generally yielded only marginally better results than the conventional maximum likelihood method, with the largest improvement being for the correlation matrices, where the robust RMSEA value of 0.033 represented an improvement of 0.008 on 0.041.
Finally, the SRMR magnitudes were less than 0.08, which is suggested as the maximum level for acceptance of good fit in the case of comparisons of the correlations (Hu & Bentler, 1999).Once again, the comparison of the covariances marginally missed the criterion for acceptance.The SRMR represents the mean across all standardised residuals, or the mean discrepancy between the correlation matrices of the two groups.From these results, it appears that there is substantial support for the goal of the current study, namely to demonstrate a satisfactory level of structural invariance when Black and White groups are being compared, because support was found for the invariance of the correlation matrices.With regard to the comparison of the covariance matrices, less well-fitting results were obtained in the case of two of the model fit indices.Furthermore, the goodness-of-fit indices remained approximately constant when the correlation matrices were being compared with and without removing the effect of social desirability.

DIsCussIOn
The results of this study indicate that the internal consistency reliabilities of the OPQ32n scales are acceptable for the two different groups for basic and applied research, although the mean alpha for the Black group was substantially lower than that for the White group.When compared to findings in the UK, it is evident that the Black group obtained somewhat lower alphas than the lowest and highest alphas reported in the UK, but that the White group obtained substantially higher alpha values than the sample from the general population in the UK (OPQ32 Technical Manual, 2006).The Social Desirability scale yielded an alpha value of 0.66 for the Black group and 0.66 for the White group, which is higher than the value of 0.63 reported in the UK study.Overall, the alpha coefficients for the various scales were acceptable for the total sample, because the lowest value (0.68) was obtained for Social Desirability and the highest for Rule Following and Worrying (0.88).The reliability results obtained here were similar to those of another South African study (SHL South Africa, 2002).Clark and Watson (1995) cautioned against over-reliance on coefficient alpha to assess the extent of the internal consistency of a measuring instrument.They regarded indexes of internal consistency, such as alpha, as ambiguous because their magnitudes rely on the number of test items plus the mean intercorrelation between the items.Coefficient alpha, as an index of internal consistency, is rendered more or less useless, because the number of items is entirely irrelevant.As a solution, Clark and Watson (1995) recommended that the mean inter-item correlation per scale be used as the measure of internal consistency.They suggested that mean inter-item correlations should fall in the range 0.15-0.50,depending on the nature of the constructs.In the current study, the mean interitem correlations for the Black group fell into the recommended range, whereas when the scale alphas were around 0.90 for the White group, the corresponding correlations approximated 0.60.Clark and Watson (1995) pointed out that correlations that are too high are indicative of measuring instruments of too narrow constructs, often at the expense of validity.
The reliability findings reported in the context of the current study are markedly higher than those reported by Meiring et al. (2005).A major obstacle in their study regarding bias in a personality questionnaire that was developed specifically for use in the workplace, the 15FQ+, was that the alphas for the Black language groups in particular were very weak, in some cases as low as 0.20.In their research, the magnitudes of the obtained reliability coefficients were so low that they probably affected the obtained research findings.
With regard to comparisons between the means of the Black and White groups on the OPQ scales, several statistically significant differences were obtained, but most of these differences were small in magnitude.One would have expected larger differences between Black and White South African groups on the OPQ scales than between White and minority groups in the UK due to possible cultural distance, but this was generally not the case except for the reported medium effect sizes.In the case of Social Desirability, a smaller effect size (d = 0.32) was obtained than in the current study, indicating that the minority group members were more inclined to provide socially desirable responses than members of the White group (OPQ Technical Manual, 2006).In interpreting the obtained differences between groups, readers are reminded that no assumption of full scale equivalence may be made in these studies.All too often, social science research is published without due acknowledgement of the limitations that the untested assumption of full scale equivalence pose.
The structural equation modelling indicated a highly satisfactory degree of structural invariance when the groups were compared with regard to their factor correlation matrices on the 32 scales.South African Black and White respondents therefore were comparable as far as their correlations between the 32 scales were concerned.For the present study, the score patterns obtained by the Black and White groups therefore can be considered structurally equivalent, in the sense that the OPQ32n questionnaire in this particular application of a comparison between Black and White groups was not biased in terms of yielding different correlation matrices for the two groups.Although the results obtained in the present study appear more favourable than those reported by Meiring et al. (2005) regarding the 15FQ+, it is important to remember that direct comparisons cannot be made unless comparable methodology and samples have been used.Somewhat less positive results regarding two of the fit indices were obtained when the covariances were compared, indicating that some of the variances between the groups differed.The latter result was expected, given the explanation by Bentler (2005) that exact equality of all Σ g is hard to verify in large samples.
Furthermore, the analyses indicated that there was structural invariance with the effect of the Social Desirability scale partialled out.Removing the effect of social desirability did not affect the structural equivalence of the two groups substantially, because when the correlations were computed on scores with the effect of social desirability controlled, the fit indices remained largely unchanged.This may indicate that the possible systematic effect of social desirability on the scale scores is similar in the two groups, despite the fact that the groups differed with regard to their means on this variable in the present study and in others (OPQ Technical Manual, 2006).This result is also plausible, because the Social Desirability scale did not correlate substantially with the other scales.The impact of the Social Desirability scale on the research findings can thus be regarded as negligible.Similar conclusions were reached by Meiring et al. (2005) when they investigated whether method bias existed as a result of differences in response styles across cultural groups.They found that social desirability scores did not affect the magnitude of differences between twelve South African language groups with regard to the 15FQ+ personality questionnaire.These results also support those found by Ones and Viswesvaran (1998), who reported that social desirability functions neither as a mediator nor as a suppressor variable in personality measurement.
It is important to note that, in the present study, the so-called global test of equal correlation/covariance matrices was conducted as originally advocated by Jöreskog (1971).Byrne (2006) had indicated that this test may lead to contradictory or inconsistent results due to the fact that there is no baseline model that permits an orderly sequence of analytic steps for testing sets of parameters in a series of increasingly restrictive hypotheses.One has to bear in mind that it is not yet finally established what the preferred method for invariance testing should be, because Byrne (2006) admitted that several issues need to be resolved and backed up with sound analytic findings.The global test is regarded as an 'overly restrictive test of the data' and 'substantially more stringent than is the case for tests of invariance related to sets of parameters in the model' (Byrne, 2006, p. 175).We used the global test because it was not our goal to determine the number of underlying factors of the OPQn for each of the groups, nor whether the OPQ items reflected 32 personality factors.Construct validation was therefore not the goal of the study, because we assumed that the test measures 32 personality factors.The global test conducted here provided a test of the invariance of the factor correlation and covariance matrices of the OPQ32n.This test indicates whether relationships between multiple constructs (measuring a wide domain) are similar in the groups and, by implication, that the factor structures and convergent or discriminant validity for the groups will be similar.
A limitation of the study may be the relative homogeneity of the sample with regard to education, which implies that respondents was excluded from the sample because they could not be included in comparisons between the population groups.Due to the small sizes of the Coloured and Indian groups, we decided to compare the Black (n = 248) and White (n = 476) groups only.These two groups constituted the majority of the original sample and it was considered prudent to omit the influence of smaller demographic groups.Biographical information for the sample of 724 respondents appears in Table1.Their ages ranged from 19 to 65, with a mean age of 31.40 (SD = 8.44).There were 288 women (39.78%) and 436 men (60.22%) in the sample.All the respondents reported an educational level of matric or higher; in fact, 545 of them (76.87%) held a first or higher degree.There were 15 missing values with regard to educational level.The comparability of the Black and White groups is important for a study on measurement equivalence.Table1provides information regarding the biographical characteristics of the two groups.It appeared that the groups did not differ significantly on any of the variables.
The original population on record at SHL South Africa consisted of 1579 respondents, of whom 248 were Black, 29 were Coloured, 37 were Indian and 476 were White.Sixteen respondents indicated another population group, whereas 773 candidates did not indicate which population group they belonged to.The latter group of Vol.36 No. 1 Page 5 of 11 (SHL South Africa, 2002)r the Social Desirability scale was equal to 0.63.In South Africa, a reliability study on a composite sample of 1181 employees and students resulted in alpha coefficients ranging from 0.69 to 0.88.The sample included 19.64% Black people, 2.71% Asian people, 2.29% Coloured people and 33.02% White people (42.34% of the respondents did not indicate their ethnic origin)(SHL South Africa, 2002).

Table 1
Biographical information for the White and Black groups (N = 724) *Age were obtained from the respondent's South African identity documents

Table 2
Means, standard deviations, alpha coefficients and d statistics for the Black and White groups *Statistically significant difference between means at the 0.01 level.

Table 3
Analysis of covariance structures goodness-of-fit statistics for the Black and White groups