Abstract
Orientation: Technologybased simulation exercises are popular assessment measures for the selection and development of human resources.
Research purpose: The primary goal of this study was to investigate the construct validity of an electronic inbasket exercise using computerbased simulation technology. The secondary goal of the study was to investigate how resampling techniques can be used to recover model parameters using small samples.
Motivation for the study: Although computerbased simulations are becoming more popular in the applied context, relatively little is known about the construct validity of these measures.
Research approach/design and method: A quantitative ex post facto correlational design was used in the current study with a convenience sample (N = 89). The internal structure of the simulation exercise was assessed using a confirmatory factor analytical approach. In addition, biascorrected bootstrapping and Monte Carlo simulation strategies were used to assess the confidence intervals around model parameters.
Main findings: Support was not found for the entire model, but only for one of the dimensions, namely, the Interaction dimension. Multicollinearity was found between most of the dimensions that were problematic for factor analyses.
Practical/managerial implications: This study holds important implications for assessment practitioners who hope to develop unproctored simulation exercises.
Contribution/valueadd: This study aims to contribute to the existing debate regarding the validity and utility of assessment centres (ACs), as well as to the literature concerning the use of technologydriven ACs. In addition, the study aims to make a methodological contribution by demonstrating how resampling techniques can be used in small AC samples.
Keywords: Assessment centres; electronic inbasket; Monte Carlo; biascorrected bootstrapping; small sample analyses; computerbased simulations.
Introduction
Orientation
A number of selection instruments are available for the selection of personnel. They include personality questionnaires, targeted interviews, situational interviews, situational judgement tests, aptitude and ability tests, previous job roles and simulated exercises. Some of these instruments are more effective than others, while some have higher predictive validity than others (Sackett, Lievens, Van Iddekinge, & Kuncel, 2017; Schmidt & Oh, 2015). Assessment centres (ACs) have demonstrated superior criterionrelated validity in comparison to other standalone measures (e.g. personality measures and interviews) (Arthur, Day, McNelly, & Edens, 2003; Meiring, Becker, Gericke, & Louw, 2015). In addition, there is also evidence that AC ratings can improve the prediction of job or training performance beyond other common predictors like cognitive ability and personality (Sackett, Shewach, & Keiser, 2017).
This may be the reason why organisations still employ ACs when selecting and developing their employees. Although literature supports the strong link between standardised tests of general mental ability and job performance, behaviourbased assessment provides a richer and more nuanced view of managerial potential (Lievens & Thornton, 2005). Notwithstanding the value of ACs, the relatively high cost of the method remains a deterrent for most organisations. In recent years, the development of workforce management software has enabled the largescale automation of human resource functions such as benefit allocation, performance management, and skills development. Similarly, AC design and the delivery of exercises have been shaped by advances in information technology. Electronic inbaskets have probably been the most popular AC exercise to place within a technologyenabled platform or application because of the high degree of fidelity between traditional inbasket exercises and electronic inbasket applications (Meiring & Van der Westhuizen, 2011).
Despite the largescale application of electronic inbaskets, applied research has not kept up with the prolific changes in industry. Relatively little is still known regarding the internal structure of electronic inbaskets compared to traditional inbaskets. More specifically, are the constructrelated problems associated with traditional ACs still problematic within the technologyenabled simulations? The current study aims to answer these research questions through the investigation of a largescale electronic inbasket used for development purposes. Finally, the study demonstrates how resampling techniques can be used to augment small samples that typically plague AC research. Biascorrected bootstrapped confidence intervals and Monte Carlo resampling strategies were used to produce parameter estimates from empirically derived bootstrapped confidence intervals.
Research purpose and objectives
The popularity of ACs in practice stems largely from the large body of literature that supports the link between dimension ratings and onthejob performance (Arthur et al., 2003; Hermelin, Lievens, & Robertson, 2007). That is, AC ratings accurately predict which candidates have the potential to succeed in meeting the job objectives for an intended role. Research on the constructrelated validity of ACs has, however, been less successful. Building on the seminal work of Sackett and Dreher (1982), research has consistently found strong correlations between dimensions measured by the same exercises rather than between the same dimensions measured across different exercises. Historically, dimensions have often been the main currency of ACs (Thornton & Rupp, 2012). In organisations, dimension ratings play an essential role in informing human resource practices such as selection, placement and development, thereby serving as a conventional and appropriate way to report on AC performance (Thornton & Gibbons, 2009).
The consistent finding that exercise effects dominate AC ratings has prompted numerous researchers to put forth plausible explanations for the findings. Lievens (2009) argues that the aberrant results may be because of poor AC designs, such as too many dimensions included in ACs, the absence of behavioural checklists to classify behaviour, the use of psychologists as raters and frameofreference training. Despite these design changes, ACs still predominantly display exercise effects (Brits, Meiring, & Becker, 2013; Lance, 2008).
In light of these persistent mixed findings, three notable largescale reviews dealing with constructrelated validity have aimed to find solutions for the statistical challenges facing AC research using Confirmatory Factor Analysis techniques (Bowler & Woehr, 2006; Lance, Lambert, Gewin, Lievens, & Conway, 2004; Lievens & Conway, 2001). Bowler and Woehr (2006) conducted metaanalytical analyses on the exercisebydimension matrices of various CFA model configurations, and found that dimensions accounted for 22% of the variance while exercises accounted for 34% of the variance of postexercise dimension ratings (PEDRs). Next, Lievens and Conway (2001) conducted separate CFAs for published exercise × dimension correlation matrices, and found that dimensions and exercises accounted for more or less equal proportions of variance. Finally, Lance et al. (2004) analysed the same exercise × dimension correlation matrices used in the metaanalyses by Leivens and Conway (2001), however, avoided specifying correlated uniquenesses, which was a problematic statistical assumption in the Lievens and Conway (2001) study. They found stronger support for a general performance factor across exercises. Recent research suggests that increasing the number of behavioural indicators rated per dimension in an exercise leads to better model fit and stronger support for dimensions in AC data (Monahan, Hoffman, Lance, Jackson, & Foster, 2013).
Given the mixed findings it is difficult to anticipate which source of variance will be dominant in the current investigation. This may also not be the most important question. Rather, judgements regarding the validity of ACs are largely dependent on the design intention of the simulation. If the AC was developed to tap into a distinct aspect of that dimension’s construct space across exercises, finding strong support for dimension effects will probably underscore the construct validity of the assessment. However, if the simulation was developed to tap into different elements of the dimensions across exercises, finding strong dimension factors at the expense of systematic dimension–exercise interaction effects would not be evidence of construct validity. Based on the fact that the current AC was designed to tap dimensionspecific behavioural consistency across exercises, the construct validity of the assessment should rightly be investigated from the perspective of correlated dimensions.
Although the debate has recently moved beyond the exerciseversusdimension debate (Lievens & Christiansen, 2010), we consider it important to investigate the internal structure of technologydelivered ACs because the medium of delivery can moderate the relationship between the stimuli and the behavioural response. Furthermore, in our experience very few practitioners structure interventions and provide individual feedback on exercise and dimensions. It probably remains true that most practitioners design their ACs based on a number of key competencies that are needed to be successful in a given position (Meiring & Buckett, 2016). If this is indeed the case, then it remains important to investigate the internal structure of ACs that are mainly designed to reflect dimensions.
An additional problem with AC research is that samples are typically small because of the cost of administration. Normally, ACs are administered to a small number of participants at the end of a multiplehurdle assessment approach. The lack of construct validity, when defined as crossexercise dimensionbased behaviour congruence, may be explained in part by the lack of statistical power associated with the small sample sizes. Modern multivariate statistical techniques, especially confirmatory factor analytical approaches, require large and normally distributed data (Byrne, 2016). Thus, the secondary goal of the study was to present two resampling techniques that can be used by practitioners to improve the confidence in AC parameters.
The main research objective of this study was to design and implement computerbased simulation technology (CBST) as an electronic inbasket exercise (depicting the daytoday activities of a supervisor) in an assessment development centre (ADC) for a major manufacturing enterprise in the United States. The goal of the ADC was to identify leadership potential – those individuals who may be ready to be promoted to higher levels in the organisation. This process incorporated group and individual online simulation exercises, which were used to measure behavioural and organisational competencies and performance areas on strategic and tactical levels. A secondary goal of the study was to investigate if resampling techniques can be used to assess model fit and to estimate confidence intervals around model parameters. Thus, the overarching research question can be described as follows: Can a CBST as an inbasket exercise in ADCs be used to accurately measure behavioural dimensions?
Based on the foregoing research question, the primary objectives of this study are to:
 examine the construct validity of the CBST inbasket exercise;
 demonstrate the use of resampling techniques in evaluating the quality of model parameters.
Review of the literature
The use of computerbased assessment centres in personnel development and selection
Recent research in employee selection has shifted the focus from traditional selection paradigms to more dynamic and flexible delivery methods. This is mainly driven by the higher fidelity of technologyenabled platforms and their associated cost savings. There is an increased interest in different selection methods such as situational judgement tests and the role of technology and the Internet in recruitment and selection. Social networking websites and video résumés have become part of selection procedures (Nikolaou, Anderson, & Salgado, 2012). A recent development in this regard relates to the use of technology in selection and assessment. The use of technology often takes the form of online simulations and webbased assessments. Simulations in selection and assessment are intended to closely replicate certain tasks, skills and abilities required for performance on the job (Schmitt, 2012).
This study made use of a computerdelivered inbasket exercise. By its very nature we can think of this assessment as a situational judgement test (SJT) rather than an AC because the assessment was made up of only a single exercise. Based on best practice guidelines for the use of the AC method in South Africa (Meiring & Buckett, 2016), ACs must consist of at least two simulated exercises. For this reason, we regard the electronic inbasket as a standalone competencybased simulation exercise rather than a fullfledged AC.
The electronic inbasket contained a computerbased inbasket exercise with multiple case studies, some of which had openended response formats while others made use of machinedriven scoring options. The scoring key was developed by an independent team of behavioural experts in collaboration with line managers and human resources in the given organisation. The response options reflect the desired behaviours of the supervisor in degrees of appropriateness (1 – least appropriate to 5 – most appropriate). Most of the responses were machine scored. There were some openended sections in the inbasket exercise that required direct input from the respondents. These responses were scored by a team of trained behavioural experts. The final overall assessment rating was a weighted combination of the scores achieved in the two sections of the same inbasket exercise. Thus, the inbasket exercise complies with the criteria for a traditional AC, at least as far as data integration is concerned, although only one simulation exercise was used.
Validity issues of assessment centres
As the acceptance and widespread use of competencybased assessments have increased in the last two decades, various interest groups have published practice and research guidelines (International Task Force on Assessment Centre Guidelines, 2015). According to Arthur, Day and Woehr (2008), validity must be established at the start of test construction and prior to operational use. Lievens, Dilchert and Ones (2009) defined validity as the process of collecting evidence (AC results) to determine the meaning of such assessment ratings and the inferences based on these ratings. Validity is defined by the use and purpose of the AC and is crucial to the permissibility of inferences derived from AC measures.
The popularity of ACs is because of their many strengths, including that they have little adverse impact and predict a variety of performance criteria (Thornton & Rupp, 2006) with predictive validity correlations ranging from 0.28 to 0.52 (Gaugler, Rosenthal, Thornton, & Bentson, 1987; Hermelin et al., 2007). In addition, the method has been shown to have high criterionrelated validity, as well as content validity (Gaugler et al., 1987; Lievens et al., 2009).
However, research evidence concerning the internal structure of ACs shows much less support for the construct validity of AC dimensions (Kleinmann et al., 2011; Tett & Burnett, 2003). This leads to the conclusion that ACs are perhaps not very successful at measuring the constructs (dimensions) that they aim to measure (Haaland & Christiansen, 2002). Specifically, evidence in support of the discriminant validity of dimensions in ACs is relatively sparse (Thornton, MuellerHanson, & Rupp, 2017).
In the context of ACs, the International Task Force (2015) describes validity as the extent to which an AC yields valuable, useful results. Validity relates to whether the method actually measures what it is designed to measure (Arthur et al., 2008). While ACs have demonstrated impressive evidence related to content and criterion validity, the approach has been criticised for not being able to prove that it does, in fact, measure a set of predetermined dimensions.
Assessment centre research has largely used multitrait–multimethod (MTMM) analyses as a framework for the analysis of the internal structure of AC ratings (Campbell & Fiske, 1959). This framework is useful when multiple constructs are measured using multiple methods, as is the case with ACs. According to this framework, when the same construct is measured using different methods, the scores on the construct are expected to correlate strongly with one another or to converge. This is known as convergent validity. On the contrary, when different constructs are measured using the same method or when different constructs are measured using different methods, the scores should not overlap substantially. The discrimination between different constructs is known as discriminant validity.
Within the AC literature, Sackett and Dreher (1982) sent shockwaves through the AC community with their statement that differences in dimension scores could not be explained across different exercises. Sackett and Dreher (1982) studied the PEDRs in terms of MTMM, and argued that there is more constancy in the ratings of dimensions in a single exercise than in the ratings for a single dimension across multiple exercises. These findings and the conclusions they generated highlighted the shortcomings relating to the identification and use of dimensions within AC exercises (Jackson, Lance, & Hoffman, 2012).
Studies focusing on dimensionbased ACs (DBACs) indicate that results relating to dimensions across exercises are most meaningful in decisions pertaining to candidates. The dimensionbased focus is the most commonly used, most commonly researched and most commonly discussed AC perspective (Thornton & Rupp, 2012).
More recently, authors have argued that crossexercise correlations of dimensions will remain elusive because behaviour is exercisedependent (Hoffman, 2012). Although we agree with this thinking, our position remains that one would like to see some form of conceptual separation in dimension scores, especially if these dimensions are used as standalone criteria for decisionmaking. This will form the basis for any inferences regarding the discriminant validity of the AC. In contrast one would hope to find relatively high correlations between the same dimensions in different exercises. However, because competencies are measured with a single method, this assumption cannot be tested in the current study. However, one would at least expect similar competencies to be moderately correlated, which can be seen as a proxy of convergent validity.
Statistical methods for construct validity and resampling techniques
Because the construct validity of ACs is largely concerned with the internal structure of the AC and is related to either a taskbased orientation or a dimensionbased orientation, Campbell and Fiske’s (1959) MTMM framework is specifically useful when considering AC structures. This framework has been widely used to investigate the internal structure of AC ratings.
Despite the frequent use of MTMM matrices, Hoffman (2012) argues that MTMM approaches may not be well suited to analyse AC factor structures (Bowler & Woehr, 2006). These authors warn of a potential oversight of AC performance ratings, as theoretically the MTMM does not recognise variance in AC ratings stemming from the assessee, the assessor, the dimensions or exercises and the interactions between these sources of variance. Analytically, the MTMM matrices are plagued by nonadmissible solutions and weak model termination (Lance, Woehr, & Meade, 2007). Lance et al. (2007) suggest using hybrid models when investigating ACs such as Correlated Dimensions Correlated Exercises (CDCE), OneDimension Correlated Exercises (1DCE) and Unilateral Dimensions Correlated Exercises (UDCE + G), which may lead to more useful findings because these approaches overcome many of the shortcomings of MTMM.
More recently, approaches such as generalisability theory and other variance decomposition approaches have been used to investigate the relative contributions and interactions of assessor, dimension or exercise variance as sources of legitimate variance in AC ratings (Bowler & Woehr, 2008).
Nontheless, the CFA approach has dominated investigations of the internal construct validity of ACs and is ideal for large samples of data (n > 200). This may be explained, in part, by the fact that most dimensions included in ACs are based on the initial job analyses. Thus, CFAs are often used because the representative selection of the (task and contextual) demands, constraints and opportunities that constitute the job in question (and their potential interactions) are known prior to the development of the AC. For this reason, most validation studies are concerned with finding confirmation for the proposed structure rather than finding the appropriate structure. The foregoing line of reasoning suggests that an AC attempts to assess the latent behavioural dimensions comprising performance by eliciting observable behavioural denotations of latent performance dimensions or competencies via specific stimuli set in a variety of specific microcontexts that differ in terms of situational characteristics. A conclusion that the instrument has construct validity (i.e. that inferences on the construct as constitutively defined may permissibly be inferred from measures of the instrument) will be strengthened if it can, in addition, be shown that a structural model reflecting the manner in which the construct or constructs of interest are embedded in a larger nomological network of constructs according to the constitutive definition fits the data.
However, ACs have generally been plagued by small samples (Lievens & Christiaansen, 2010). This is also true of the current study. Alternatives suggested by Bowler and Woehr (2008) include the use of resampling techniques that remedy some of the problems associated with MTMM.
This method was independently tested by Hoffman, Melchers, Blair, Kleinmann and Ladd (2011b), who found that the alternative resampling provided more insights than the traditional MTMM approach. Kuncel and Sackett (2014) used a different sampling technique based on the theory of composites to investigate the exercise–dimension variance. More recently, new approaches have been proposed to overcome the challenges related to CFA analyses using small samples.
The Monte Carlo family of resampling techniques may be fruitfully used to test the appropriateness of model parameters, standard errors, confidence intervals and even fit indices under various assumptions. Because of the low statistical power in small samples, standard errors may be overestimated, which may lead to significant effects being missed. In contrast, if standard errors are underestimated, significant effect may be overstated (Muthén & Muthén, 2002). The Monte Carlo Method for Assessing Mediation (MCMAM) was first described and evaluated by MacKinnon, Lockwood and Williams (2004) to assess smallsample performance. The Monte Carlo technique entails the extrapolation of data into thousands of simulated data frames, which model statistical characteristics similar to the original sample data (Muthén & Muthén, 2002). Monte Carlo in essence is an estimator or test statistic that has a true sampling distribution under a particular set of conditions; it assists in the true sampling distribution (Lance, Woehr, & Meade, 2005).
Monte Carlo features include saving parameter estimates from the analysis of real data to be used as population and/or coverage values for data generation in a Monte Carlo simulation study. Monte Carlo simulations involve identifying a mathematical model of the activity or process to be researched and defining the parameters such as mean and standard deviation for each factor in the model (Lance et al., 2005). It creates random data according to those parameters and simulates and analyses the output of the process. A typical Monte Carlo simulation involves the generation of independent datasets of interest and computing the numerical value of the data for each dataset. The larger the dataset, the closer the true sampling properties of the data (Davidian, 2005). When used in ACs, the Monte Carlo simulation addresses the extent to which the model fits the generated data (Lance et al., 2005). Lance et al. (2005) suggest that Monte Carlo simulation should be used to assess whether model fit should be used to compare competing CFA models.
An alternative resampling technique known as residual bootstrapping (Bollen & Stine, 1992) can also be used independently or in conjunction with Monte Carlo simulations. However, bootstrapping is aimed at investigating the standard errors of model parameters across volume biascorrected draws. In an article by MacKinnon et al. (2004), biascorrected bootstrap confidence intervals were found to be very accurate. Bootstrapping is also an option for smaller datasets and involves the resampling of an original dataset to a desired sample size (Efron, 1979).
Bootstrapping complements traditional confidence intervals by estimating standard errors of parameter estimates over a large number of hypothetical sample draws (Bollen & Stine, 1992; Hancock & Nevitt, 1999). One of the main reasons why Bollen–Stine bootstraps was included in the study is based on the fact that the technique provides a way to impose the covariance structure model on the sample data. This way the researcher is able to examine the bootstrapping performance of the fit statistics under the assumptions of the ‘null hypothesis’ that the model fits (Kim & Millsap, 2014). Furthermore, Monte Carlo simulated data may result in bias standard errors and parameter estimates if the data are generated under the assumption of normality when the data in the sample are actually nonnormal (Muthén and Muthén, 2002). The Bollen–Stine bootstrap can therefore correct for standard error and fit statistical bias that occurs in structural equation modelling (SEM) applications because of nonnormal data (Bollen & Stine, 1992). Bollen–Stine bootstraps (BSBS) are deployed whereby the original data are rotated to conform to the fitted structure. Bollen–Stine bootstraps take the empirical sample of size (N) and randomly draw repeated samples with replacement to the same size (N). The goal is to repeat the sampling size and form an integrated picture of the original sampling data (Efron, 1979). In this study, because of the relatively small sample size the BSBS standard errors and biascorrected confidence intervals were used to get a sense of the width of confidence intervals around parameter estimates. Relatively narrow confidence intervals around parameter estimates suggest that standard errors are not abnormally high.
In the literature review, we focused on the historical debate regarding the construct validity and internal structure of ACs. Internetdelivered simulated exercises closely resemble the features of traditional samplebased assessment, yet the delivery and scoring platform differs significantly. However, relatively little is known about the internal structure of electronic simulations in general and inbaskets in particular. Thus, the overarching goal of the study is to assess the internal structure of an electronic inbasket. Furthermore, the potential benefits of resampling techniques were discussed in the context of ACs. The section concluded with a discussion of resampling techniques and the use of Monte Carlo and bootstrapping techniques.
Research design
Research approach
A nonexperimental, quantitative research design was used in the current study to empirically test the main research objectives. More specifically, an ex post facto correlational design was used and implemented in a confirmatory factor analytical framework. Postexercise dimension ratings were used as the level of measurement and served as manifest variables in the factor analytical models that were specified.
Research strategy
Initially, the data were screened for multivariate outliers and out of range responses. Descriptive statistics were generated to investigate the distribution and central tendency of PEDR scores for each of the competency dimensions. Inferential statistics were generated by specifying a confirmatory factor analytical model. The internal structure of the electronic inbasket can be operationalised through the specification of fixed and freely estimated model parameters.
More specifically, the CBST measurement model can be defined in terms of a set of measurement equations, expressed in matrix algebra notation (see Equation 1):
Where:
 X is a 19 × 1 column vector of observable indicator variables (PEDR);
 Λ_{X} is a 19 × 5 matrix of factor loadings;
 ξ is a 1 × 5 column vector of latent competency dimensions;
 δ is a 19 × 1 column vector of measurement error.
In addition, all the offdiagonal elements of the phi covariance matrix, denoting the covariance between the five latent competencies, were freed up to be estimated. Model parameters of the CFA model were estimated using maximum likelihood with robust standard errors and fit indices because of the nonnormality of the sample data. For identification purposes, each of the five latent competencies was standardised and all error variances were specified to be uncorrelated. Fit indices and model parameters were estimated using Mplus 7.2 (Muthén & Muthén, 2017. Multiple fit indices were used to evaluate the tenability of the CBST. These indices included the Satorra–Bentler χ^{2}, the Comparative Fit Index (CFI; Bentler, 1990), the Root Mean Square Error of Approximation (RMSEA; Steiger & Lind, 1980) with accompanying confidence intervals, and the Standardised Root Mean Square Residual (SRMR; Joreskog & Sorbom, 1993). Comparative Fit Index values in excess of 0.90 (Bentler, 1990), RMSEA values lower than 0.08 (Browne & Cudeck, 1993) and SRMR values lower than 0.06 (Hu & Bentler, 1999) were regarded as satisfactory.
Research method
Research participants
A convenience sample of 89 supervisors were selected in a nonrandom fashion from a large multinational manufacturing organisation operating in the petroleum and rubber industry in North America. The sample was selected from incumbent supervisors, who were earmarked to partake in a larger leadership developmental programme in the organisation. The first step in the development programme was to complete the CBST to gain more insight into the strengths and development areas of each supervisor.
Measuring instruments
Six broad competencies were identified by the client organisation for inclusion in the CBST based on their proposed link to job performance as identified through the job analysis process. An external consulting organisation was contracted to develop the behavioural indicators and scoring method for each competency. A summary of the six metacompetency clusters and subdimensions is presented in Table 1.
TABLE 1: Metacompetencies and subdimensions. 
Because all the metacompetencies were operationalised within the online inbasket, there was only one simulation format. For this reason, the current research cannot be regarded as an AC because competencies were not measured within multiple exercises (Meiring & Buckett, 2016).
The electronic inbasket was scored using a combination of multiplechoice machine scoring and manual scoring by trained raters of the openended video vignettes. The scores were integrated according to equal weighted averages for openended and multiplechoice response options. The openended responses were scored by a team of trained behavioural experts, who examined the responses in accordance with the conceptual definitions. The assessors attended frameofreference training to accurately observe, record, classify and assess the responses to the openended questions. During the training session, examples were provided with a range of responses, ranging from appropriate to less appropriate behavioural examples, and how to use the five point behaviourally anchored rating scale (BARS) to assess responses. As part of the training, all assessors had to complete the CBST.
In this regard, the simulated electronic inbasket complied with the criteria of traditional ACs insofar as each competency was observed and scored by multiple raters and integrated into an overall score. Because all the competencies were operationalised in a single simulation format, the inbasket cannot be regarded as a traditional AC. However, we believe that the results of the study hold important implications for samplebased assessment, and specifically for those simulations that are delivered on an electronic platform. Completion of the computerbased simulation inbasket exercise took around 40 min. All participants completed the task in the allocated time. For this reason, there were no missing values in the data.
Research procedure and ethical considerations
Managers who participated in the AC were identified for future promotion because of strong performance and competence in their incumbent positions. Because the purpose of the AC was for development, all the managers who participated in the study consented to partake in the AC. All participants were informed that their data may be used for research purposes. The identity of all participants was kept anonymous by converting the raw data into an encrypted file that was shared with the researchers. Thus, the final dataset contained no personal information other than the race, age and gender of the participants.
Statistical analysis
The internal structure of the CBST was assessed by specifying a confirmatory factor analytical model with Mplus 7.2 (Muthén & Muthén, 1998–2017). Because the researcher had a welldeveloped a priori conceptualisation of how the subdimension scores are related to the higher order latent dimensions, it was decided to conduct only CFA and not exploratory factor analysis (EFA). Thus, the goal of the CFA analysis was not to investigate the manner in which the PEDR scores are related to higher order latent dimensions, but rather to investigate the relative strength of relationships between PEDRs and latent dimensions in the conceptual model.
In addition, it was important to evaluate the overall fit of the proposed model to the observed data. If strong support was found for the model parameters and overall fit of the model to the data, it would be possible to conclude that the CBST has construct validity and may be used for diagnostic and selection purposes (Lievens & Christiansen, 2010). Because of the small sample size, the authors decided to use resampling techniques to assess the confidence intervals and bootstrapped standard errors of the factor loadings of parameter estimates. In small sample sizes, the statistical power available to reject the null hypothesis is limited when in reality (i.e. in the population) the linkages are statistically different from zero (Lievens & Christiansen, 2010). To overcome this fundamental methodological problem, the researchers employed Monte Carlo and bootstrapping techniques to assess the stability of standard errors and to construct confidence intervals around point estimates.
Monte Carlo simulations were used to extract 1000 simulated datasets with model statistical characteristics similar to the sample data. This approach used in the current study can be regarded as an external Monte Carlo study, insofar as parameters saved from the real data analyses are used for population values for the simulated data. Thus, a twostep approach is used to calculate the model parameters and then to use these values as input to generate data in step 2. The fact that the simulated data use the model parameters estimated from the real data may not be sufficient to capture the nonnormality in the simulated data. However, when working with skew data, the robust maximum likelihood estimation (MLE) can be used.
This may be particularly important when examining the critical value chisquare fit statistic in the parent and simulated samples (Curran, West, & Finch, 1996; Wang, Fan, & Wilson, 1996). Kim and Millsap (2014) advocate the use of robust MLE for both the real and generated samples. Analyses by the Kim and Millsap (2014) indicated that the original assumptions about nonnormality of data in simulated studies by Millsap (2012) may have been overly rigorous. Kim and Millsap (2014), however, advocate that small samples may lead to higher levels of discrepancy between fit indices in the simulated and original data. Given that the sample in the current investigation is very small, it is important to investigate the differences in model fit indices and parameter estimates reported in the parent and simulated data sets.
In addition, the Bollen–Stine bootstrap (residuals bootstrap) produces a correct bootstrapped sampling distribution for chisquare, and thus a correct bootstrapped pvalue, without presuming a specific distribution of the data (Muthén & Muthén, 2002). Because the Bollen–Stine technique preserves the characteristics of the original data, the technique may be particularly useful when the source data are nonnormal.
The Bollen–Stine bootstrap can be used to correct for standard error and fit statistical bias that occur in SEM applications because of nonnormal data. Bollen and Stine (1992) perform bootstraps whereby the original data are rotated to conform to the fitted structure. By default, the Bollen–Stine technique reestimates the model with rotated data and uses the estimates as starting values for each bootstrap iteration. It also rejects samples where convergence was not achieved (implemented through reject [econverged = 0] option supplied to bootstrap) (Millsap, 2012)
It is also possible to use confidence intervals and bootstrapping to gain greater confidence in findings. This involves investigating ‘real’ sampling variability without assuming specific distribution for the data. Bollen–Stine bootstrap (residuals bootstrap) produces correct bootstrapped sampling distribution for chisquare, and thus correct bootstrapped pvalues. Resampling variability used to get bootstrapped pvalues are confidence intervals that can be used to aid in the interpretation of model parameters (Bollen & Stine, 1992; Millsap, 2012).
In each case, the results obtained from the original data were compared to the results generated with the Monte Carlo simulations and Bollen–Stine biascorrected bootstrapping. The comparative results may contribute to the AC literature by demonstrating the utility of resampling techniques when working with relatively small sample sizes. It is important to emphasise that the resampling methods are not a `silver bullet’ for small sample sizes, as any sampling error contained in the sample from which resampling is drawn will be included in the bootstrapped sample (Enders, 2005). Moreover, any errors are likely to be duplicated in resampled data sets when missing data analysis techniques are applied. However, we believe that the benefits of resampling techniques (improved statistical power, biascorrected standard errors and confidence intervals around parameter estimates) outweigh the alternative, which is to do nothing.
Ethical consideration
This article followed all ethical standards for a research without direct contact with human or animal subjects.
Results
The primary objective of this study was to examine the validation of a CBST inbasket exercise within an ADC. This objective involves proving the behavioural validity of the workplace simulation. It further implies that if construct validity is intact, then the exercises comply with the principles of the ADC and may lead to valid development and selection decisions.
The results of the study are discussed according to the following structure:
 frequencies and descriptive statistics;
 screening the data;
 examining the appropriateness of the data for multivariate CFA;
 specification and estimation of CFA model;
 evaluating the model according to goodnessoffit indices;
 evaluating the model according to model parameters;
 using Monte Carlo estimates;
 using bootstrap (BS) biascorrected bootstraps.
As with other multivariate linear statistical procedures, CFA requires that certain assumptions must be met with regard to the sample. Therefore, prior to formally fitting the CFA model to the data, the assumptions of multivariate normality, linearity and adequacy of variance were assessed. In general, no serious violations of these assumptions were detected in the data. However, the data did not follow a multivariate normal distribution and therefore robust maximum likelihood (RML) was specified as the estimation technique. Basic descriptive statistics were generated to assess the variability and central tendency of PEDRs. The means and standard deviations of PEDRs are presented in Table 2.
The results in Table 2 suggested that the range of scores was restricted as one would expect to find when assessing job incumbents. Next, we assessed the bivariate correlations between PEDRs prior to specifying the CFA model. In total there were 26 indicators, and there were thus 325 intercorrelations in the correlation matrix, of which 22 reported bivariate correlations greater than 0.90. This was problematic because it suggested that many of the PEDR scores lacked discriminant validity. Furthermore, variables that are perfect linear combinations of each other or are extremely highly correlated prevent the covariance matrices from being inverted (Tabachnick & Fidell, 2007). The 22 high correlations are presented in Appendix 1.
When the total CBST model was specified as CFA model, MPLUS issued a warning that the sample covariance matrix may be singular and that the model could not converge. Based on the singular covariance matrix, it was impossible to specify and assess the total CBST. One possible remedy would be to collapse highly correlated subdimensions into broader competencies. In previous studies, Hoffman et al. (2011a) found support for broad competencies in PEDR scores rather than smaller idiosyncratic competencies.
Collapsing dimensions into broader competencies may make sense from a theoretical and methodological perspective. From a methodological perspective, treating dimension scores (PEDRs) as indicators of broader dimensions will increase the indicator to dimension ratio. Monahan et al. (2013) found that greater indicator to dimension ratios leads to improved termination and admissibility in CFA models. Secondly, Hoffman et al. (2011b) argue that the dimensionsasitems approach has been used extensively to validate taxonomies of managerial performance, models of organisational citizenship behaviour, multisource performance ratings and measures of managerial skills. Howard (2008) advocates the use of broad facets where indicators form the basic input to factor analytic models. For this reason, it seems to make methodological and theoretical sense to group dimensions that are strongly correlated together in broader dimensions as long as they clearly share conceptual overlap.
In Appendix 1, five of the correlations with a correlation of 1.0 belonged to the Learning metacompetency. The same was true for the Executing metacompetencies (six correlations r > 0.955) and Directing (six correlations r > 0.994). That basically left the authors with the Entrepreneurship (fourdimension ratings) and Interaction (sixdimension ratings) metacompetencies to assess. The Vision metacompetency was not considered because only three subdimensions can be used as indicators in the CFA model and two of them were highly correlated. Against this background it was decided to focus on the Interaction metacompetency because this dimension had the most dimension ratings that could be used as indicators and seemed to report moderate positive correlations between the subdimensions. However, even within the Interaction metacompetency, multicollinearity seemed to be evident, although it was significantly lower than in the other metacompetencies. The bivariate correlations of the Interaction metacompetency are presented in Table 3.
TABLE 3: Bivariate correlations of the Interaction metacompetency. 
Table 3 includes some problematic correlations. In particular, within the Interaction cluster, Promoting Diversity correlated highly with Intercultural Sensitivity and Motivating Others correlated highly with Fostering Teamwork. From a measurement theory perspective, these two pairs of items should be combined into two single factors because there seems to be very little distinction between them. Based on the content, it was deemed theoretically permissible to combine these item pairs. Thus, Motivating Others was combined with Fostering Teamwork to become INT_MOFT and Promoting Diversity was combined with Intercultural Sensitivity to become INT_PDIS. After combining these constructs, the CFA solution converged to an admissible solution with satisfactory model fit. In addition, the multicollinearity problem had at least been addressed. None of the remaining bivariate correlations within the Interaction cluster was greater than 0.90. The correlation matrix for the revised Interaction metacompetency is presented in Table 4.
TABLE 4: Revised Interaction metacompetency bivariate correlations. 
Because nonnormal data can lead to bias fit indices and standard errors in the simulated data when using Monte Carlo, the normality of the observed variables was assessed with SPSS (Version 25, IBM, 2017). Although visual inspection indicated that most of the observed variables were nonnormal, the simple test of dividing the skewness score by its associated standard error indicated that most variables followed a normal distribution. If the results are greater than ±1.96, it suggests that the data are not normal with respect to this specific statistic (Rose, Spinks, & Canhoto, 2015). The results of the analyses are summarised in Table 5.
TABLE 5: Skewness of observed variables for the revised Interaction dimension. 
Results from Table 5 suggest that most of the variables are normally distributed with the exception of Interaction Motivating Others and Fostering Teamwork. For this reason, we decided to specify the RML estimator as suggested by Muthén and Muthén (1998–2017). According to the authors, the parameter estimates should be the same irrespective of whether RML or maximum likelihood is used. It is only the standard errors that will be adjusted when using RML. For this reason, Muthén and Muthén (1998–2017) recommend using RML as the default estimated in CFA analyses irrespective of the data distribution.
The correlations in Table 4 suggest that none of the remaining correlations was problematic. After two rounds of revision, we specified a CFA model with four indicators. The CFA solution converged to an admissible solution. Goodnessoffit indices are reported in Table 6.
TABLE 6: Goodnessoffit indices for the revised interaction dimension. 
The overall model fit can be regarded as satisfactory based on the criteria and cutoff rules reported in the methodology section. The CFI and TuckerLewis Index (TLI) were in excess of 0.95, and the RMSEA and SRMR are close to the normative cutoff value of 0.05.
The unstandardised and standardised results demonstrated that most of the model parameters were indicative of good model fit. This provides further support for the revised Interaction measurement model. A summary of the model parameters is presented in Table 7.
TABLE 7: Unstandardised and standardised parameter estimates of the revised Interaction dimension. 
The results in Table 7 suggest that the four broad dimension ratings are good indicators of the latent competency of Interaction. Monte Carlo simulations were once again conducted using the original input model configuration as a basis for the estimation. Tables 8 and 9 contain the mean and standard deviation of the chisquare and RMSEA fit indices over the 1000 replications of the Monte Carlo analyses.
TABLE 8: Mean, standard deviation, critical value of chisquare fit index across 1000 draws. 
TABLE 9: Mean, standard deviation, critical value of Root Mean Square Error of Approximation fit index across 1000 draws. 
Table 8 indicates the critical chisquare value given two degrees of freedom. Thus, the value in column 1 in Table 8 provides the probability that the chisquare value exceeds the critical percentile value of 5.991. Column 2 provides the proportion of replications for which the critical value is exceeded, 0.051, which is close to the expected value of 0.050. This suggests that the chisquare distribution is well approximated in the current investigation because the expected and observed percentiles and distributions are close to one another in absolute value.
Similar results are displayed in Table 9, which indicates the probability that the RMSEA value exceeds the critical value.
The critical RMSEA value of 0.052 is exceeded in approximately 9.5% of the 1000 replications. Although the mean RMSEA value in the simulated data is indicative of good fit (0.014), the relatively large deviation between the expected and observed proportions containing the critical value raises concern regarding the approximate fit of the original CFA model, given the results of the Monte Carlo simulation.
The standard error, Monte Carloderived standard error, average standard deviation and average coverage values are presented in Table 10. The column labelled Population provides the parameter values in the sample data. The column Average provides the average model parameters across the 1000 replications. The difference between these two values can be regarded as the proportion bias in parameter estimates presented in the last column of Table 10. The column labelled Standard deviation provides the standard deviation of parameter estimates across the Monte Carlo replications. This value can also be regarded as the population standard error (Muthén & Muthén, 2002). The column labelled MSE gives the mean square error of each parameter, while the column labelled 95% Cover gives the proportion of replications for which the 95% confidence interval contains the population parameter value. In contrast to the original data, the 95% coverage was respectable, with most values exceeding 0.80. The coverage values reflect the proportion of replications for which the 95% confidence interval contains the true parameter value. A value of 0.80 would imply that in 80% of the replications the true parameter point estimate was present. Muthén and Muthén (2002) suggest that the coverage should remain between 0.91 and 0.98. Finally, the column labelled percentage significant coefficients provides an estimate of the proportion of replications for which the null hypothesis that a parameter is equal to zero is rejected at the 0.05 level. For parameters with population values equal to zero, this value is an estimate of power with respect to a single parameter, that is, the probability of rejecting the null hypothesis when it is false (type I error) (Muthén & Muthén, 1998–2017).
TABLE 10: Average population estimations with Mean Square Error, 95% coverage, proportion of replications equal to zero under H0, and parameter bias. 
Against this background, the information in Table 10 provides a mixed picture of the Monte Carlo results with regard to the Interaction dimension. Although the 95% coverage indicates that a relatively large proportion of the 95% confidence interval contains the population parameter value across the 1000 replications, the power to reject the null hypothesis stating that the parameter is equal to zero in the sample is relatively large. However, the deviation between the average population parameter estimates and the sample estimates indicates that the parameters may be biased. This would indicate that the generated data did not estimate model parameters with a high degree of accuracy.
Next, we discuss the results from the Bollen–Stine residual bootstrapped standard errors and biascorrected confidence intervals generated with regard to the Interaction subscale with 1000 bootstrap draws. The intention of this analysis is to provide valid inferences from the sample data to some large universe of potential data; in other words, to provide information about the population from statistics generated with random smaller samples. Because it would be virtually impossible to obtain access to random samples from populations that have the same characteristics as the larger population, statistical methods have been developed to determine the confidence with which such inferences can be drawn, given the characteristics of the available sample (Cohen, Cohen, West, & Aiken, 2003). The variability of model parameters as a function of the unreliability of scores can be inferred by means of confidence intervals.
Bootstrapping and other resampling techniques complement traditional confidence intervals by estimating standard errors of parameter estimates over a large number of hypothetical sample draws (Hancock & Nevitt, 1999). Results from the biascorrected bootstrap procedure are delineated in Appendix 2.
The results of the biascorrected bootstrapping indicate that the 95% confidence intervals between model parameters are quite broad, which erodes confidence in the replication of specific point estimates in the population. In addition, the difference between the population parameter estimates and mean values recovered by Monte Carol draws, indicates substantial bias in the parameter estimates. The same conclusion can be reached with regard to the standard errors.
Considering all this information collectively, one would have to conclude that the construct validity evidence for the original CBST is limited. For example, the overall measurement model did not converge and was eventually abandoned because of high multicollinearity between dimension ratings. Consequently, only a small subsection of the measure was further investigated with additional analysis. Even these models of the Interaction dimension required extensive modification and manipulation before they showed acceptable fit to the data.
More supportive evidence for construct validity was found with regard to the revised Interaction metacompetency after the subdimensions of Motivating Others (INT_MO) and Fostering Teamwork (INT_FT), as well as the subdimensions of Promoting Diversity (INT_PD) and Intercultural Sensitivity (INT_IS) were combined. The newly combined subdimensions were labelled INT_PDIS (Promoting Diversity and Interpersonal Sensitivity) and INT_MOFT (Team Motivation). Theoretically it makes sense to group the dimension ratings of Promoting Diversity and Interpersonal Sensitivity, as well as Motivating Others and Fostering Teamwork.
Discussion
Outline of the results
The primary research objective was to examine the construct validity of an electronic inbasket using CBST technology. In the end, only a revised version of one of the six metacompetencies could be assessed. The results suggest that AC methodologies packaged in interactive software applications are not immune to the problems that face traditional samplebased assessments. Multicollinearity remains a particularly thorny issue, in part, because not enough consideration is awarded to the conceptual definition of competencies at the design stage. However, this problem does not seem to be unique to the current study. Hoffman et al. (2011a, 2011b) found that narrow dimensions grouped together in broader dimensions provide better model fit than models that contain many dimensions. Thus, when dimensions were modelled in a way that took the similarity between them into account, it was possible to find evidence for dimension factors in ACs. Similarly, Kuncel and Sackett (2014) found that greater levels of aggregation compound common variance and reduce error variance. This should improve convergence and fit in CFA models. An alternative explanation that cannot be ruled out is the impact of frameofreference training on assessor ratings. Often rigorous frameofreference training results in ratings that are very similar for multiple raters of the same participant in the AC. Although this practice may promote interrater reliability it may restrict the range of dimension ratings. In addition, the high correlations between dimensions further suggest that the scoring mechanism failed to indicate the relative differences between dimensions in a single simulation. However, this result is not unique. Previous AC research consistently found that behaviour is consistent across dimensions within exercises, rather than across exercises of the same dimensions.
The lack of correspondence among the same dimension observations across AC exercises has often been regarded as problematic, and several innovative interventions have been proposed to remedy the problem. However, proponents of the exercisecentric ideology will highlight the link between exercise effects and criterion scores. However, recent studies suggest that ACs have been misspecified and as a result the contribution of dimensions have historically been underestimated in AC ratings. This holds important implications for practice because most AC applications are probably still expressed in dimensioncentric discourse.
In the second round of data analysis, we investigated the model parameters by way of two resampling techniques, namely, Monte Carlo simulations and biascorrected bootstrapping. Confidence intervals were provided from the biascorrected draws to assess the variability of model parameters because of the calculated population standard errors. In accordance with the original results, we found the bootstrap confidence intervals to be quite wide and coverage levels below the suggested level of 0.90. This provided further support that the results should be interpreted with caution because estimates may be biased. More specifically, these techniques may provide AC scholars and practitioners with another set of tools to assess the validity of ratings, especially when samples are relatively small. We may have arrived at a different conclusion regarding the validity of the revised Interaction dimension, albeit not the whole inbasket, if these two resampling techniques had not been employed. In general, the CFA results of the revised Interaction model showed satisfactory fit, low residuals and robust factor loadings. However, the resampling techniques indicate that the results may not be trustworthy and may be because of type I errors. These two approaches provide valuable tools for AC practitioners and researchers who often have to conduct research with very small sample sizes.
Limitations of the study
Although the study provided a lot of useful findings, there are some conflicting results that need to be reported. One of the biggest limitations is that only one exercise type, namely, an inbasket, was used in the current study. This made the specification and estimation of method effects impossible. Typically, the size of the exercise effects provides important information regarding the functioning and internal structure of simulations. Based on best practice guidelines for the use of the AC method in South Africa, Meiring and Buckett (2016) stipulate that a single electronic inbasket exercise does not constitute an AC. This stipulation originates from the classic definition of an AC, which posits that multiple exercises and multiple observers are key differentiators between ACs and other evaluation methods. For this reason, results reported in the current study cannot be generalised to other traditional ACs.
Another limitation was the structure and design of the CBST. Ratings of competencies were reflected as PEDRs and not behavioural indicators. This greatly limited the number of data points to specify each of the metacompetencies. If metacompetencies were specified with behavioural indicators, the researcher could delete behavioural indicators that demonstrated collinearity, yet still measure the six metacompetencies. However, in the current study the researchers could only specify and evaluate the Interaction metacompetency because the other competencies had too few PEDR scores to combine into broader competencies.
An additional limitation of this study is that the performance ratings from managers, as well as the success of a followup supervisory development programme could not be investigated. As a result, the criterion validity of the six metacompetencies and job performance could not be investigated. It would have been interesting to see if differences on the metacompetencies translated into significant criterionrelated differences.
Practical implications
The research value and contribution of this study can be best described by discussing multiple perspectives. From a practical perspective, this application of CBST demonstrates that faster, more accurate solutions exist for conducting ACs for the purposes of selection and development. From a theoretical perspective, the research results and learning points from the CBST inbasket exercise depict the reallife events of the manager and may act as a workplace or business simulation, which adds to the incremental validity of selection or development strategy. From a corporate perspective, the accelerating rate of change and the increasing uncertainty in the outcomes of change are evident across the whole business arena. This is enhanced by the increased demand for experienced talent. From a research perspective, there is considerable bias in model parameters when using small samples. However, biascorrected bootstrapping techniques and Monte Carlo simulations may be used productively to evaluate the bias in model parameters.
Conclusion
This study set out to evaluate the construct validity of an electronic inbasket by investigating the internal structure of the exercise. The selection of competencies was based on job analyses and each of the six metacompetencies has a number of subdimensions. This design is similar to traditional AC exercises. The initial goal of the study was to assess the internal structure of the entire inbasket with a CFA methodology using MTMM matrices. However, initial statistical screening of the data suggested that a large number of dimensions were highly correlated and lacked discriminant validity. To remedy this problem, dimensions were collapsed whenever it made theoretical sense to do so. In the end, only one metacompetency, the Interaction dimension, could be evaluated with a CFA approach. The results showed that the proposed model fitted the sample data well.
However, results from the two resampling techniques suggested that the model parameters were contaminated by bias and may lead to invalid inferences. This study demonstrates how these two techniques can be used when using CFA approaches in small samples. Finally, the study demonstrates that for all the potential benefits associated with electronic and Internetdelivered simulations, the a priori design and scoring mechanism should comply with best practice if one hopes to find support for construct validity in AC ratings.
Acknowledgements
Competing interests
The authors declare that they have no financial or personal relationship(s) that may have inappropriately influenced them in writing this article.
Authors’ contributions
J.B. wrote the article and conducted the statistical analyses. D.M. conceptualised the literature review and J.H.v.d.W. collected the data and wrote the original master’s thesis which forms the basis of the current article.
Funding information
This research received no specific grant from any funding agency in the public, commercial or notforprofit sectors.
Data availability statement
Data sharing is not applicable to this article as no new data were created or analysed in this study.
Disclaimer
The views expressed in this article are of the authors’ own, and do not represent the position of any related institutions or funding agencies. No funding was received by the authors during the process of completing the research contained in the current article.
References
Arthur, W. Jr., Day, E. A., McNelly, T. L., & Edens, P. S. (2003). A metaanalysis of the criterionrelated validity of assessment center dimensions. Personnel Psychology, 56(1), 125–153. https://doi.org/10.1111/j.17446570.2003.tb00146.x
Arthur, W., Jr., Day, E. A., & Woehr, D. J. (2008). Mend it, don’t end it: An alternative view of assessment center constructrelated validity evidence. Industrial and Organizational Psychology, 1(1), 105–111. https://doi.org/10.1111/j.17549434.2007.00019
Bentler, P. M. (1990). Comparative fit indexes in structural models. Psychological Bulletin, 107(2), 238–246. https://doi.org/10.1037/00332909.107.2.238
Bollen, K. A., & Stine, R. A. (1992). Bootstrapping goodnessoffit measures in structural equation models. Sociological Methods and Research, 21(2), 205–229. https://doi.org/10.1177/0049124192021002004
Bowler, M. C., & Woehr, D. J. (2006). A metaanalytic evaluation of the impact of dimensions and exercise factors on assessment center ratings. Journal of Applied Psychology, 91(5), 1114–1124. https://doi.org/10.1037/00219010.91.5.1114
Bowler, M. C., & Woehr, D. J. (2008). Evaluating assessment center constructrelated validity via variance partitioning. In B. J. Hoffman (Ed.), Reexamining assessment centers: Alternate approaches. Paper presented at the 23rd annual meeting of the Society for Industrial and Organisational Psychology, San Francisco, CA.
Brits, N., Meiring, D., & Becker, J. R. (2013). Investigating the construct validity of a development assessment centre. South African Journal of Industrial Psychology, 39(1), 1–13. https://doi.org/10.4102/sajip.v39i1.1092
Browne, M. W., & Cudeck, R. (1993). Alternative ways of assessing model fit. In K. A. Bollen & J. S. Long (Eds.), Testing structural equation models (pp. 136–162). Newbury Park, CA: Sage.
Byrne, B. M. (2016). Structural equation modeling with AMOS: Basic concepts, applications, and programming. London: Routledge.
Campbell, D. T., & Fiske, D. W. (1959). Convergent and discriminant validation by the multitraitmultimethod matrix. Psychological Bulletin, 56(2), 81–105 Retrieved from psycnet.apa.org.
Cohen, J., Cohen, P., West, S. G., & Aiken, L. S. (2003). Applied multiple regression/correlation analysis for the behavioral sciences. Hamilton, NJ: Hamilton Printing Company.
Curran, P. J., West, S. G., & Finch, J. F. (1996). The robustness of test statistics to nonnormality and specification error in confirmatory factor analysis. Psychological Methods, 1(1), 16–29. https://doi.org/10.1037/1082989X.1.1.16
Davidian, S. (2005). Simulation studies in statistics. What is a Monte Carlo study? Retrieved from http://www4.stat.ncsu.edu/~davidian/st810a/simulation_handout.pdf.
Efron, B. (1979). Bootstrap methods: Another look at the jackknife. The Annals of Statistics, 7(1), 1–26. https://doi.org/10.1214/aos/1176344552
Enders, C. K. (2005). An SAS macro for implementing the modified BollenStine bootstrap for missing data: Implementing the bootstrap using existing structural equation modeling software. Structural Equation Modeling, 12, 620–641. https://doi.org/10.1207/s15328007sem1204_6
Gaugler, B. B., Rosenthal, D. B., Thornton, G. C., & Bentson, C. (1987). Metaanalysis of assessment center validity. Journal of Applied Psychology Monograph, 72(3), 493–511. https://doi.org/10.1037/00219010.72.3.493
Haaland, S., & Christiansen, N. D. (2002). Implications of traitactivation theory for evaluating the construct validity of assessment center ratings. Personnel Psychology, 55(1), 137–163. https://doi.org/10.1111/j.17446570.2002.tb00106.x
Hancock, G. R., & Nevitt, J. (1999). Bootstrapping and the identification of exogenous latent variables within structural equation models. Structural Equation Modeling, 6(4), 394–399. https://doi.org/10.1080/10705519909540142
Hermelin, E., Lievens, F., & Robertson, I. (2007). The validity of assessment centre for the prediction of supervisory performance rankings: A metaanalysis. International Journal of Selection and Assessment, 15(4), 405–411. https://doi.org/10.1111/j.14682389.2007.00399.x
Hoffman, B. J. (2012). Exercises, dimensions, and the Battle of Lilliput: Evidence for a mixedmodel interpretation of AC performance. In D. J. R. Jackson, C. E. Lance, & B. J. Hoffman (Eds.), The psychology of assessment centers (pp. 281–306). New York: Routledge.
Hoffman, B. J., Melchers, K. G., Blair, C. A., Kleinmann, M., & Ladd, R. T. (2011a). Center validity exercises and dimensions are the currency of assessment centers. Personnel Psychology, 64(2), 351–395. https://doi.org/10.1111/j.17446570.2011.01213.x
Hoffman, B. J., Melchers, K. G., Blair, C. A., Kleinmann, M., & Ladd, R. T. (2011b). Exercises and dimensions are the currency of assessment centres [Electronic version]. Personnel Psychology, 64(2), 351–395. https://doi.org/10.1111/j.17446570.2011.01213.x
Howard, A. (2008). Making assessment centres work the way they are supposed to. Industrial and Organizational Psychology, 1, 98–104. https://doi.org/10.1111/j.17549434.2007.00018.x
Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6(1), 1–55. https://doi.org/10.1080/10705519909540118
IBM Corporation. (2017). IBM SPSS statistics for windows. New York: IBM Corporation.
International Task Force on Assessment Center Guidelines. (2015). Guidelines and ethical considerations for assessment center operations. International Journal of Selection and Assessment, 17(3), 243–253. https://doi.org/10.1111/j.14682389.2009.00467.x
Jackson, D., Lance, C. E., & Hoffman, B. (2012). The psychology of assessment centres. London: Routledge.
Joreskog, K. G., & Sorbom, D. (1993). USREL vi: Analysis of linear structural relationships by the method of maximum likelihood. Chicago, IL: National Educational Resources.
Kim, H., & Millsap, R. (2014). Using the BollenStine bootstrapping method for evaluating approximate fit indices. Multivariate Behavioral Research, 49(6), 581–596. https://doi.org/10.1080/00273171.2014.947352
Kleinmann, M., Ingold, P. V., Lievens, F., Jansen, A., Melchers, K. G., & König, C. J. (2011). A different look at why selection procedures work: The role of candidates’ ability to identify criteria. Organizational Psychology Review, 1(2), 128–146. https://doi.org/10.1177/2041386610387000
Kuncel, N. R., & Sackett, P. R. (2014). Resolving the assessment center construct validity problem (as we know it). Journal of Applied Psychology, 99(1), 38–47. https://doi.org/10.1037/a0034147
Lance, C. E. (2008). Where have we been, how did we get there and where shall we go? Industrial and Organisational Psychology: Perspectives on Science and Practice, 1(1), 140–146. https://doi.org/10.1111/j.17549434.2007.00028.x
Lance, C. E., Lambert, T. A., Gewin, A. G., Lievens, F., & Conway, J. M. (2004). Revised estimates of dimension and exercise variance components in assessment center post exercise dimension ratings. Journal of Applied Psychology, 89(2), 377–385. https://doi.org/10.1037/00219010.89.2.377
Lance, C. E., Woehr, D. J., & Meade, A. W. (2005, April). A Monte Carlo investigation of assessment centre construct validity models. Published paper delivered at the 20th Annual Conference of the Society for Industrial and Organizational Psychology, Los Angeles, CA.
Lance, C. E., Woehr, D. J., & Meade, A. W. (2007). Case study: A Monte Carlo investigation of assessment center exercise factors represent crosssituational specific, not method bias. Human Performance, 13(4), 323–353.
Lievens, F. (2009). Assessment centres: A tale about dimensions, exercises, and dancing bears. European Journal of Work and Organizational Psychology, 18(1), 102–121. https://doi.org/10.1080/13594320802058997
Lievens, F., & Christiansen, N. D. (2010). Core debates in assessment center research: Dimensions versus exercises. In D. Jackson, C. Lance, & B. Hoffman (Eds.), The psychology of assessment centers (pp. 68–94). New York: Routledge.
Lievens, F., & Conway, J. M. (2001). Dimensions and exercise variance in assessment center scores: A largescale evaluation of multitrait–multimethod studies. Journal of Applied Psychology, 86(6), 1202–1222. https://doi.org/10.1037/00219010.86.6.1202
Lievens, F., Dilchert, S., & Ones, D. S. (2009). The importance of exercise and dimension factors in assessment centers: Simultaneous examinations of constructrelated and criterionrelated validity. Human Performance, 22(5), 375–390. https://doi.org/10.1080/08959280903248310
Lievens, F., & Thornton, G. C., III. (2005). Assessment centers: Recent developments in practice and research. In A. Evers, O. SmitVoskuijl, & N. Anderson (Eds.), Handbook of selection (pp. 243–264). New Jersey: Blackwell Publishing.
MacKinnon, D. P., Lockwood, C. M., & Williams, J. (2004). Confidence limits for the indirect effect: Distribution of the product and resampling methods. Multivariate Behavioral Research, 39(1), 99–128. https://doi.org/10.1207/s15327906mbr3901_4
Meiring, D., Becker, R. J., Gericke, S., & Louw, N. (2015). Assessment centers: Latest developments on construct validity. In I. Nikolaou & J. K. Oostrom (Eds.), Employee recruitment, selection, and assessment. Contemporary issues for theory and practice (pp. 190–206). London: Psychological PressTaylor & Francis.
Meiring, D., & Buckett, A. (2016). Best practice guidelines for the use of the assessment centre method in South Africa. SA Journal of Industrial Psychology, 42(1), 1–15. https://doi.org/10.4102/sajip.v42i1.1298
Meiring, D., & Van der Westhuizen, J. H. (2011). Computerbased simulation technology as part of the AC and DAC: A global South African review. In N. Povah & G. C Thornton (Eds.), Assessment and development centres: Strategies for global talent management. Surrey, UK: Gower Publishing Ltd.
Millsap, R. E. (2012). A simulation paradigm for evaluating model fit. In M. Edwards & R. MacCallum (Eds.), Current issues in the theory and application of latent variable models (pp. 165–182). New York: Routledge.
Monahan, E. L., Hoffman, B. J., Lance, C. E., Jackson, D. J. R., & Foster, M. R. (2013). Now you see them, now you do not: The influence of indicatorfactor ratio on support for assessment center dimensions. Personnel Psychology, 66(4), 1009–1047. https://doi.org/10.1111/peps.12049
Muthén, L. K., & Muthén, B. O. (2002). How to use a Monte Carlo study to decide on sample size and determine power. Structural Equation Modeling: A Multidisciplinary Journal, 9(4), 599–620. https://doi.org/10.1207/S15328007SEM0904_8
Muthén, L. K., & Muthén, B. O. (1998–2017). Mplus user’s guide (7th edn.). Los Angeles, CA: Muthén & Muthén.
Nikolaou, I., Anderson, N., & Salgado, J. (2012). Special issue on advances in selection and assessment in Europe. International Journal of Selection & Assessment, 20(4), 383–384. https://doi.org/10.1111/ijsa.12000
Rose, S., Spinks, N., & Canhoto, A. I. (2015). Management research: Applying principles. New York: Routledge Taylor and Francis.
Sackett, P. R., Lievens, F., Van Iddekinge, C., & Kuncel, N. (2017). Individual differences and their measurement: A review of 100 years of research. Journal of Applied Psychology, 102(3), 254–273. https://doi.org/10.1037/apl0000151
Sackett, P. R., & Dreher, G. F. (1982). Constructs and assessment center dimensions: Some troubling empirical findings. Journal of Applied Psychology, 67(4), 401–410. https://doi.org/10.1037/00219010.67.4.401
Sackett, P. R., Shewach, O. R., & Keiser, H. N. (2017). Assessment centers versus cognitive ability tests: Challenging the conventional wisdom on criterionrelated validity. Journal of Applied Psychology, 102(10), 1435. https://doi.org/10.1037/apl0000236
Schmidt, F. L., & Oh, I. (2015). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 100 years of research findings. Working Paper. https://doi.org/10.13140/RG.2.2.18843.26400
Schmitt, N. (Ed.). (2012). The Oxford handbook of personnel selection and assessment. New York: Oxford University Press.
Steiger, J. H., & Lind, J. C. (1980). Statisticallybased tests for the number of common factors. Paper presented at the Annual Meeting of the Psychometric Iowa, Taylor and Francis, Iowa City.
Tabachnick, B. G., & Fidell, L. S. (2007). Using multivariate statistics (7th edn.). New York: Pearson.
Tett, R. P., & Burnett, D. D. (2003). A personality traitbased interactionist model of job performance. Journal of Applied Psychology, 88(3), 500–517. https://doi.org/10.1037/00219010.88.3.500
Thornton, G. C., & Gibbons, A. M. (2009). Validity of assessment centres for personnel selection. Human Resource Management Review, 19(2), 169–187. https://doi.org/10.1016/j.hrmr.2009.02.002
Thornton, G. C., III, MuellerHanson, R. A., & Rupp, D. E. (2017). Developing organizational simulations: A guide for practitioners, students, and researchers. London: Routledge.
Thornton, G. C., III, & Rupp, D. E. (2012). Research into dimensionbased assessment centers. In D. J. Jackson, B. J. Hoffman, & C. E. Lance (Eds.), The psychology of assessment centers (pp. 141–170). New York: Routledge.
Thornton, G. C., & Rupp, D. R. (2006). Assessment centers in human resource management: Strategies for prediction, diagnosis, and development. Mahwah, NJ: Lawrence Erlbaum Associates.
Wang, L., Fan, X., & Wilson, V. L. (1996). Effects of nonnormal data on parameter estimates and fit indices for a model with latent and manifest variables: An empirical study. Structural Equation Modeling, 3(3), 228–247. https://doi.org/10.1080/10705519609540042
Appendix 1
TABLE 1A1: Bivariate correlations between assessment centres indicators. 
Appendix 2
