Clinical validation of brief mental health scales for use in South African occupational healthcare

for the use of psychologically orientated screening measures in South African workplaces (Employment Equity Act, 1998). The combined set of brief mental health scales, as described under Methods, has not yet been validated for use in a South African occupational health surveillance context. Orientation: South Africa carries a high burden of mental ill-health. Screening to identify individuals for further referral is emerging as one pathway to promote access to mental health interventions. Existing occupational health surveillance infrastructure may be a useful mechanism for clinical mental health screening. Research purpose: This study explored the clinical validity of a range of brief mental health measures in the context of occupational health surveillance. Motivation for the study: To meaningfully screen for mental health as part of occupational health surveillance, tools are required that are empirically validated, clinically useful, locally available and practical to administer. Research approach/design and method: Workers ( n = 1816), recruited through workplace occupational health surveillance programmes, completed the Patient Health Questionnaire-9, Brief Symptom Inventory 18-somatisation subscale, Generalised Anxiety Disorder scale-7, Primary Care Post-Traumatic Stress Disorder Screen, Intense (panic-like) anxiety scale and CAGE scale and partook in a diagnostic interview with a clinical psychologist. Main findings: Basic psychometric characteristics were reported, including confirmatory factor analyses, measurement invariance, internal consistencies and socio-demographic effects. Clinical utility was explored through receiver operating/operator characteristics curve analyses, and calculations of positive and negative predictive values, as well as sensitivity and specificity. These indicators provided evidence of clinical validity in the study context. Practical/managerial implications: The findings support the use of psychological screening as a brief, practicable and easily accessible mode of occupational mental health support. Contribution/value-add: This article presented evidence of structural and criterion validity for these scales and described their clinical application for practical use in occupational mental health surveillance.


Introduction Orientation
The provision of mental health services in South Africa (SA) faces serious challenges, in part because of severe resource constraints (Docrat, Besada, Cleary, Daviaud, & Lund, 2019). It is in addressing the gap between the mental health needs and the availability of providers that screening, with the aim of identifying individuals for further referral or support, has emerged as one pathway to promote access to individuals who require mental health interventions. Existing occupational health surveillance mechanisms may be a useful vehicle for clinical mental health screening. Yet, to screen meaningfully and efficiently, tools are required that are empirically validated, clinically useful, locally available and practical to administer, and in doing so, meet the ethico-legal standards for the use of psychologically orientated screening measures in South African workplaces (Employment Equity Act, 1998). The combined set of brief mental health scales, as described under Methods, has not yet been validated for use in a South African occupational health surveillance context. Within SA, the provision of mental healthcare is associated with a number of challenges, predominantly around insufficient resources and disparities in access to care (Docrat et al., 2019). This includes a lack of care providers (e.g. psychiatric nurses, psychiatrists, psychologists), as well as infrastructure (e.g. psychiatric beds, capacity at primary healthcare clinics). For example, less than 5% of the SA health budget is spent on mental health and less than 8% of that at primary healthcare level. Furthermore, in the public sector, there are only 0.31 psychiatrists, 0.79 psychologists and 1.83 social workers per 100 000 population, and it is estimated that only 1 in 10 people in SA living with a mental health condition receive the care they need (Docrat et al., 2019). Budgetary and capacity limitations in the public health system often translate to an ongoing lack of mental healthcare infrastructure, including mechanisms to appropriately identify and respond to mental illness in the population at large.
In the workplace, poor mental health is associated with significant costs, both human and economic (Mall et al., 2015;Schoeman, 2017;Stander, Bergh, Miller-Janson, De Beer, & Korb, 2016;Zungu, 2013). For example, major depression and anxiety disorders are estimated to cause a loss of earnings of R54 000 per affected adult per year, with the total annual cost to the South African economy amounting to more than R40-billion annually (Schoeman, 2017). Other reports suggest that one in four employed workers or managers have been formally diagnosed with depression (Stander et al., 2016). Another local estimate implicated substance abuse in 50% of SA workplace accidents (McCann et al., 2011). In addition, international studies reported an increased risk for workplace accidents and injuries where CMD are present (Hilton & Whiteford, 2010;Kessler, Lane, Stang, & Van Brunt, 2009;Palmer, D'Angelo, Harris, Linaker, & Coggon, 2014;Soares, Gelmini, Brandão, & Silva, 2018). Apart from the economic costs, poor mental health has personal implications, from the demands on individuals to manage their conditions, to reduced personal accomplishment and sense of self-worth, as well as the challenges of dealing with perceived stigma at work.
Employers have developed greater awareness of the deleterious effects of poor mental health on human resource management and corporate success and have sought to establish mechanisms to actively address this through Employer Assistance or Employer Well-being Programmes. Parallel to these programmes, occupational health surveillance is already a statutory requirement in the workplace and continues to expand to include some form of mental health monitoring. Adapting existing occupational health surveillance infrastructure for mental health screening could become an efficient point of entry to enable the streaming of identified individuals towards appropriate mental health services, and thus substantially contribute to the identification of need and timeous referral for intervention. Mental health screening could include any process aimed at identifying, amongst groups of people, individuals at risk for poor mental health, in order to allow the streaming of those in need towards further assessment or intervention. Screening is typically brief and aimed at identifying need for referral, rather than at making a diagnosis.

Rationale and aim
To integrate the identification of mental health needs into existing occupational health surveillance infrastructure, appropriate screening tools are required. A large number of potential measures are available (cf. Mulvaney-Day et al., 2017, for review), but few have been adequately studied in local SA settings and particularly in the context of occupational healthcare provision. Neither their fair and unbiased use (Employment Equity Act, 1998) nor their clinical validity (i.e. accuracy in identifying risk), have been established in this context. There is agreement that validation is a constant process, involving a continuum of evidentiary support, including evidence of internal structures and effects of context and sample characteristics (AERA, 2014;EFPA, 2013;Schaap & Kekana, 2016). Before any screening scales can therefore be used with confidence, evidence of validity in local settings is required. This study investigated six measures, described in the Methods section, which speak to the mental health conditions most often encountered in the workplace and aimed to provide evidence of validity for this specific set of tools in the specific context of occupational mental health surveillance.

Research approach
This study followed a cross-sectional survey design and quantitatively analysed data obtained through the completion of psychological scales.

Research method Participants
Participants were recruited through workplace occupational health surveillance programmes and invited to complete the questionnaire booklet and partake in an interview during their annual occupational health assessments. All participants (n = 1816) gave written informed consent to the process.
Participants were included in the study if they had a minimum of 9 years of formal schooling. This was to ensure a level of English proficiency sufficient to complete the mental health measures described here. Their ages ranged from 20 to 60 (M = 33.8, ± 8.2) and 37.4% were women. All participants were in full-time salaried employment and comprised skilled and semi-skilled workers. Further composition of the sample (home language, occupational background) is provided in Table 1. The data were from workers across multiple sites in different provinces. The sample does not necessarily represent any specific community or industry in SA, as it was a convenience sample, recruited from a range of industries and geographical locations. Data collection took place from January to December 2019.

Instruments
The six measures, described here, purport to measure mental health conditions most commonly encountered in the workplace. Most of them have internationally reported evidence of validity and have previously been studied in local populations (often in primary healthcare) or are currently being used in SA industry. All were available in the public domain and could be readily reproduced. Table 2 provides an overview of the internal reliability and diagnostic accuracy of these scales, whilst Table 3 provides an overview of dimensionality and confirmatory factor analysis data. All the scales are screening tools and not intended to confirm clinical diagnoses.

Patient Health Questionnaire-9
Major depressive disorder is a syndrome characterised by severe and persistent low mood, profound sadness, sense of despair or anhedonia (APA, 2021). The Patient Health Questionnaire-9 (PHQ-9) is a screening, diagnostic and monitoring tool that measures the severity of depression in primary care settings (Kroenke, Spitzer, & Williams, 2001;Spitzer, Kroenke, & Williams, 1999). Each item is scored on a range from 0 (not at all) to 3 (nearly every day), with higher scores indicating higher levels of depression. The 9-item scale has high internal consistency (see Table 2) and good test-retest reliability in Western (r = 0.84, Kroenke et al., 2001) and African (r = 0.90, Adewuya, Ola, & Afolabi, 2006;r = 0.75, Weobong et al., 2009) samples. The scale rates the frequency of symptoms and Question 9 screens for the presence and duration of suicide ideation. It has a follow up, non-scored question that assigns weight to the degree to which depressive symptoms have affected a patient's level of functioning. This 10th item was not completed at this study's participating sites.

Generalised Anxiety Disorder scale -7
Generalised anxiety disorder is a syndrome characterised by excessive and uncontrollable anxiety and worry about a range of concerns (APA, 2021). The Generalised Anxiety Disorder scale-7 (GAD-7) is a screening, diagnostic and monitoring tool that measures the severity of generalised anxiety in primary care settings (Spitzer, Kroenke, Williams, & Lowe, 2006). Each item is scored on a range from 0 (not at all) to 3 (nearly every day), with higher scores indicating higher levels of anxiety. The 7-item scale has high internal consistency (see Table 2) and good test-retest reliability (r = 0.83, Spitzer et al., 2006). The scale rates the frequency of symptoms, and a follow up, non-scored question assigns weight to the degree to which anxiety symptoms have affected a patient's level of functioning. This item was not completed at this study's participating sites.
The United States of America validation study demonstrated good sensitivity and specificity for GAD (Spitzer et al., 2006). Subsequent studies also reported generally good specificity and a range of sensitivities across samples (  Zhong et al., 2015). Further reports indicated high sensitivity and good specificity for also detecting panic disorder, social anxiety disorder and PTSD (Kroenke et al., 2007). An optimal cut-point for any anxiety disorder was established as ≥ 9 (Kroenke et al., 2007) and ≥ 10 for GAD in Western samples (García-Campayo et al., 2010;Kroenke et al., 2007;Spitzer et al., 2006). Optimal cut-points for practical use in SA have not yet been established.

Primary care post-traumatic stress disorder screen for DSM-5
Post-traumatic stress disorder is a syndrome that develops subsequent to exposure to traumatic events where an individual believed that there was a threat to life or physical integrity and safety and is characterised by a range of symptom clusters (APA, 2021). The primary care posttraumatic stress disorder screen for DSM-5 (PC-PTSD-5) was developed as a brief screen for PTSD in primary care settings using updated DSM-5 criteria. The 5-item screen enquires about the presence or absence of core PTSD symptoms, namely intrusive memory, avoidance, alterations in cognition  and mood and alternations in arousal and reactivity. The scale has high internal consistency (Table 2; Bovin et al., 2021;Jung et al., 2018), with good test-retest reliability (r = 0.89) and concurrent validity reported (Jung et al., 2018). The PC-PTSD-5 has demonstrated excellent diagnostic accuracy (see Table 2), with a cut-point of ≥ 3 offering optimal sensitivity and specificity (Bovin et al., 2021;Jung et al., 2018;Prins et al., 2016). The only report of previous use in SA that could be located (with emergency medical personnel; Van Wijk et al., 2020) did not report psychometric data.

Intense (panic-like) anxiety
Panic disorder is a syndrome characterised by repeated episodes of sudden onset intense apprehension and fearfulness in the absence of actual danger, accompanied by a range of discomforting physical symptoms (APA, 2021). The 2-item scale for panic-like anxiety came from the Guide for Aviation Medical Examiners (SACAA, 2017; the third CAA itemseeking urgent medical advice because of anxiety -was not included). The 2-item scale focuses on the sudden and intense experience of anxiety symptoms, as well as unexplained physical sensations associated with anxiety. A YES answer to either item would result in referral for further assessment. It was included based on its current use in industry, although no studies of its usefulness could be located.

CAGE scale
Alcohol use disorder is a catch-all diagnosis encompassing varying degrees of excessive use of alcohol (including abuse and dependence) (APA, 2021). Problematic alcohol use was determined using the 4-item CAGE (Ewing, 1984). The CAGE questionnaire has been extensively evaluated for use in identifying alcoholism and is considered a validated screening technique (cf. Dhalla & Kopec, 2007, for review). High sensitivity and specificity were reported for the identification of excessive, that is, problem, drinking, as well as for the identification of alcoholism (Table 2; Claassen, 1999; see also Williams, 2014, for review). High test-retest reliability has also been described (r > 0.80; Dhalla & Kopec, 2007). Historically, various cut-scores have been proposed based on different demographic factors (e.g. gender), and currently a score of ≥ 2 is generally considered indicative for concern (i.e. for alcohol dependence; Dhalla & Kopec, 2007;O'Brien, 2008;Vissoci et al., 2018;Williams, 2014). Studies from sub-Saharan Africa (Table 2 and Table 3) suggested good diagnostic utility (Claassen, 1999) and good internal reliability and unidimensional structure (Vissoci et al., 2018). In spite of its widespread use in SA (Labadarios, 2018), no reports on local validation of cut-points for the English version could be located.

Procedure
The measures were collated into booklet form, in the order presented here. Two measures, namely the BSI-18-S and the panic-like anxiety screen, were discontinued before the end of the study period and were removed from subsequent booklets. Each scale was presented in English, using the standard format and administration described in the respective source materials (e.g. manuals).
Each participant also partook in a clinical interview. This was conducted by clinical psychologists, who assessedusing DSM-5 criteria -the presence of disorders of mood (i.e. MDD), anxiety (i.e. GAD, panic disorder, PTSD), or substances (i.e. AUD). The assessment focussed on the specific syndromes listed here, and was therefore not inclusive of all presentations of poor mental health. Other conditions were identified and noted but not included in this study. Despite extensively reported criticisms (cf. Lynch, 2018, for overview), the DSM-5 (APA, 2013) remains the gold standard for clinical diagnostic purposes. In contrast to initial concerns (Chmielewski, Clark, Bagby, & Watson, 2015), excellent inter-rater interview-based diagnostic reliability (kappa > 0.70) has been reported for experienced psychiatrists and psychologists (Osório et al., 2019). The psychologists involved in the present study had at least 5 years experience in the occupational health surveillance context. The purpose of the interview assessment was to act as the reference standard (i.e. criterion measure) against which to evaluate the clinical utility of the brief mental health scales. Interviews took place within 24 hours of completing the screening booklet, and participants were allowed time off work to attend the interview. This study was incorporated into an ongoing occupational health screening programme, and the scales and psychological interview was administered as part of an annual occupational mental health review. Responses were entered into a spreadsheet, coded where appropriate and then irreversibly anonymised.
To ensure consistency of data gathering, three study review points were planned, the first at 300 cases, the second at 750 cases and the third at 1500 cases. Participating psychologists were further encouraged to share their clinical impressions and other concerns during monthly group supervision meetings. One purpose of the review was to consider the clinical usefulness of the scales, and if they were deemed to contribute little to the process or created an undue burden on the clinicians or participants, then to be discontinued. As a result of the health provision focus of the screening programme, interpretation of clinical utility was skewed to its practical impact more than on its psychometric characteristics. As mentioned earlier, two scales were discontinued early, and the available data for them will be reported under the Results section. All 1816 participants completed the remaining four scales, whilst only some completed the additional two scales prior to their discontinuation.

Statistical analysis
All statistical analyses, with the exclusion of the confirmatory factor analysis (CFA), were conducted with SPSS (version 27). Internal consistencies of the scales were examined with Cronbach's alpha, item-intercorrelations and corrected item-total correlations. Mplus 8.6 was used in both CFAs to assess unidimensionality and multigroup measurement invariance (Muthén & Muthén, 2017).
http://www.sajip.co.za Open Access Dimensionality of the PHQ-9, GAD-7, PC-PTSD-5 and CAGE were examined with CFA. It was expected that the four scales will exhibit unidimensionality, that is, items or indicators loading highly on one latent factor each. All items were examined for distribution properties and deviation from normality. Skewness and non-normality influence the type of estimator used in the CFA. Usually, maximum likelihood (ML) is used, but for skew and non-normal data the estimators need to be robust and the choice depends amongst others on the nature of the indicators (Brown, 2015). Thus, for continuous variables (PHQ-9, GAD-7), the maximum likelihood -robust (MLR) estimator was used and for categorical responses (PC-PTSD-5, CAGE), weighted least squares -mean and variance-adjusted (WLSMV) (Muthén & Muthén, 2017). The global fit χ 2 would be preferred to be small and non-significant. Although this is rarely achieved, the following indices with cut points were taken into consideration. The standardised root mean square residual (SRMR) with good fit indicated by < 0.08 (Schreiber, Nora, Stage, Barlow, & King, 2006). The root mean square error of approximation (RMSEA) should be < 0.06 to < 0.08 for continuous data and < 0.06 for categorical data . Both the comparative fit index (CFI) and the Tucker-Lewis index (TLI) should be > 0.95 .
Local indications of misfit are the size of the standardised residuals and modification indices > 4 (Brown, 2015;Hair, Black, Babin, & Anderson, 2019). Usual indications of local problems on the models are standardised factor correlations out of range, negative error and factor variances, the significance of factor loadings, the size of parameter estimates and the reliability of indicators indicated by percentage of variance accounted for by the latent factors (indicated by the R-square of indicators). Modification indices should indicate no covariance between error variances, in this case referred to as within-construct error covariance (Hair et al., 2019).
Measurement invariance is a crucial aspect to assess for scales, especially if scores need to be compared across groups, whether they are language, gender or multicultural groups. Researchers often compare groups on test scores without considering measurement invariance (Brown, 2015). Scales need to be invariant with respect to the way the latent constructs are formed (configural invariance), the indicators or items should load similarly on latent factors across the groups (metric invariance) and lastly the origin of an indicator should be the same across groups, that is, they should have slopes (metric invariance) and similar origins on the y-axis (Wang & Wang, 2020). Testing for intercept invariance is called scalar equivalence. Thus, the process with testing for measurement invariance is to, firstly, look at the performance of a model in each subgroup sample (single group solutions) (see Table 4). Modifications to models may be made at this stage but if the groups' models differ in terms of specifications, one would be testing for partial measurement invariance (Byrne, 2012). Secondly, both groups are tested for factor structure (configural invariance), then for metric and scalar invariance. It is a hierarchical process thus one cannot proceed to nested models if model fit for the previous level fails (Kline, 2016). If modifications to the models can be substantiated, then the next level will be tested for partial measurement invariance given the restrictions placed on the model (Byrne, 2012).
The requirement for invariance is that the difference in global χ 2 between hierarchical models is not significant. In the case of the estimators used in this study, namely MLR for continuous indicators and WLSMV for binary indicators, the Satorra-Bentler correction for the difference between successive models were calculated because of differences not following a χ 2 distribution (Kline, 2016;Muthén & Muthén, 2017).
The measurement invariance for the PHQ-9, GAD-7, PC-PTSD-5 and the CAGE were evaluated first for gender (men and women; see Table 4) and then language (English first language speakers, and English second language speakers; Table 4). In each instance the group model results were provided as singular group solutions, and then in the order of configural, metric and scalar invariance. The measurement invariance of the PC-PTSD-5 and the CAG included only configural and scalar invariance because of the binary or categorical nature of their responses (Brown, 2015).
Criterion validity (and for the purpose of this study, also clinical utility) was explored through receiver operating/ operator characteristics (ROC) curve analyses and positive and negative predictive values were calculated. Sensitivity and specificity data were calculated to address optimal cutpoints for use in clinical practice. Receiver operating characteristics analysis is used to evaluate diagnostic tests and predictive models by plotting sensitivity versus specificity of a classification test, expressed as area under the curve (AUC). An AUC ≥ 0.70 is considered fair, ≥ 0.80 considered good, and ≥ 0.90 excellent (Safari, Baratloo, Elfil, & Negida, 2016). Sensitivity refers to the ability of a test to correctly identify persons with a condition, whilst specificity refers to the ability of a test to correctly identify people without the condition. Positive predictive value is the probability that persons with a positive screening test truly have the condition, whilst negative predictive value refers to the probability that persons with a negative screening test truly don't have the condition.
After measurement invariance was examined, sociodemographic effects were further explored using Pearson's correlation coefficients (for age effects) and t-tests for independent samples (for gender and language effects). Age and gender effects were previously reported (as discussed here), and this analysis served to explore whether different interpretative values (e.g. cut-points) might be required for different groups. Psychological scales often contain abstract concepts and in this sample were administered in English to a multi-language population. To explore the fairness of the scalesparticularly for screening purposes -across different home languages (but with at least Grade 9 English literacy), the sample was divided into two groups, namely English first language (18.9% of the sample) and Non-English first

Ethical considerations
The study has been approved by Stellenbosch University's Health Research Ethics Committee (#N20/07/078). All participants (n = 1816) gave written informed consent to the process and researchers only had access to de-identified data for analysis.

Results
Indicators of scale dimensionality are reported in Table 3, indicators of measurement invariance analysis are given in Table 4 and socio-demographic effects and criterion validity markers are presented in Table 5. Detailed sensitivity and specificity figures are presented in Table 6. Across all measures, age correlated significantly with scores. All the age correlations were negative, with very small effect sizes. Cronbach's alphas are reported in Table 5 and in no case did alpha improve through the deletion of items.
Although none of the four models tested for unidimensionality obtained a non-significant χ 2 , the values were not excessively high. All other fit indices exceed the cut-points provided earlier. The TLI (0.93) for PHQ-9 was an exception, but the CFI was close enough to 0.95. The RMSEA was sufficiently small (0.04-0.05) for models, except for PC-PTSD-5, which still reached the criterion of < 0.06. The SRMR was smaller than 0.06 for all models. It can be accepted that all models exhibited sufficient fit to be evaluated as unidimensional scales (see Table 3). The details per scale are presented here.
In terms of measurement invariance, Table 4 shows that the single group solutions for men and women did not obtain a non-significant global χ 2 although SRMR were smaller than 0.08 and RMSEA was sufficiently low. Both CFI and TLI ranged in the region of 0.09 and it seems as if the smaller women sample fit the model less well than the model for men. The details per scale are presented here. Detail for the measurement invariance process for language for the four instruments are also presented in Table 4 and detailed here.

Primary Health Questionnaire-9
Acceptable Cronbach's alpha (Table 5) and corrected itemtotal correlations (Figure 1) were found, with inter-item correlations ranging from 0.22 to 0.60. During the CFA, the Primary Health Questionnaire-9 (PHQ-9) showed two modification indices higher than 20 for covariance between indicator error variances. Only substantial reasons would allow including these two within-construct error covariance to be freed for estimation (Hair et al., 2019). The content of the items, although somewhat related, would not warrant such a decision (Byrne, 2012). Standardised loadings were relatively uniform and high, ranging from 0.58 to 0.76 with Item 9 at 0.47. The scale demonstrated significant parameters (Table 3), low error, high communality The PHQ-9 for men and women were configural and metric invariant (Δχ 2 = 7.7, Δdf = 8) but did not reach scalar invariance (Δχ 2 = 26.1, Δdf = 8, p < 0.001). However, the examination of the modification indices showed that Item 4 influenced invariance and allowed its intercept to be freely estimated and permitted the remainder of intercepts to remain equivalent (Δχ 2 = 8.3, Δdf = 7). Note that the amended scalar model was compared with the metric model. Thus, PHQ-9 achieved partial measurement invariance on the scalar level, for gender.
The models for English first language speakers and for English second language speakers showed adequate fit for the RMSEA and SRMR indices. The smaller group of English first language speakers showed a CFI = 0.884 and TLI = 0.845 whilst the larger group of second language speakers were above 0.9 for the same indices. The global χ 2 for both groups were significant (p < 0.001). Full measurement invariance was demonstrated for configural, metric (Δχ 2 = 5.0, Δdf = 8) and scalar levels (Δχ 2 = 8.0, Δdf = 8), for language.
The PHQ-9 correlated significantly with the GAD-7, PC-PTSD-5 and CAGE and there were also significant comorbidities between MDD and GAD, PTSD and AUD (Table 7).
Excellent AUC was found (Table 5) and optimal sensitivity and specificity were obtained around a cut-point of ≥ 10 (Table 6). No significant language effects were found but there was a significant gender effect, where women reported more severe mood symptoms (Cohen's d = 0.18; mean difference = 0.6) and more proportional cases were reported. Given the partial scalar invariance, small effect size and small mean difference, it did not appear practically useful to develop separate cut-points for women and men.

Brief Symptom Inventory-18-S
A progress review after 350 cases found little usefulness of this scale. There was a poor association with clinical outcomes, identifying only 50% of interview-determined cases of psychological distress (i.e. defined for this purpose as any DSM-5 disorder) and poor internal consistency. There were moderate correlations with other scales (PHQ-9: r = 0.526, p < 0.001; GAD-7: r = 0.364, p < 0.001), which all displayed better sensitivity and specificity. As a result of its poor clinical utility, its use was discontinued after 352 cases.

Generalised Anxiety Disorder scale-7
Acceptable Cronbach's alpha (Table 5) and corrected item-total correlations ( Figure 1) were found, with inter-item correlations ranging from 0.46 to 0.72. During the CFA, the GAD-7 exhibited a similar situation as with the PHQ-9, with two high within-construct error covariances, but the same argument against freeing these parameters applied. The standardised loadings were consistently uniform and high, ranging from 0.65 to 0.83. The scale demonstrated significant parameters (Table 3), low error, high communality as indicated by R-Square values for all indicators and loaded high on each latent factor, providing sufficient evidence for unidimensionality.
The GAD-7 single group solutions showed that the model for women exhibited good fit with a non-significant χ 2 (19.964, df = 14) and all other fit indices well over the recommended limits for good fitting models. Except for the global χ 2 (52.0, df = 14, p < 0.001) the remainder of the fit indices also indicated a good fitting model for men. Similar results to the previous test were found with respect to measurement invariance: the instrument exhibited both configural and metric invariance but partial scalar invariance when the intercept for Item 5 was freely estimated (Δχ 2 = 7.1, Δdf = 5).
The GAD-7 showed adequate fit for both language groups for the CFI, TLI, RMSEA and SRMR indices. The global χ 2 for both groups were significant with the smaller English first language group significant at the 0.05 exceedance level and the larger group p < 0.001. The GAD-7 achieved configural  Inter-scale correlations for PHQ-9, GAD-7, PC-PTSD-5, and CAGE.
The GAD-7 correlated significantly with the PHQ-9, PC-PTSD-5 and CAGE, and there were also significant comorbidities between GAD and MDD, PTSD and AUD disorder (Table 7).
Excellent AUC was found (Table 5) and good sensitivity and specificity were obtained around a cut-point of ≥ 9. In this sample, specificity was marginally improved (whilst maintaining sensitivity) when a score of ≥ 10 was used as cutpoint (see Table 6). No significant language effects were found, but there was a significant gender effect, where women reported more severe anxiety symptoms (Cohen's d = 0.12; mean difference = 0.4). Given the absence of scalar invariance, small effect size and small mean difference, it did not appear practically useful to develop separate cut-points for women and men. The GAD-7 cases also included all cases of panic disorder and most cases of PTSD and were thus possibly more indicative of 'any' anxiety disorder than GAD only.

Primary care post-traumatic stress disorder screen for DSM-5
Cronbach's alpha was acceptable for research, but only borderline sufficient for clinical use (Table 5). Acceptable corrected item-total correlations (Figure 1) were found and inter-item correlations ranged from 0.22 to 0.50. During the CFA, the PC-PTSD-5 had no modification indices above 4 and standardised loadings ranged from 0.74 to 0.89. The scale demonstrated significant parameters (Table 3), low error, high communality as indicated by R-Square values for all indicators, and loaded high on each latent factor, providing sufficient evidence for unidimensionality.
The PC-PTSD-5 models for both men and women showed good fit indices with CFI, TLI, RMSEA and SRMR well within the limits for good fitting models. The model for women achieved a non-significant global χ 2 (9.903, df = 5). The instrument achieved both configural and scalar invariance (Δχ 2 = 3.0, Δdf = 3) for gender.
The model fit for PC-PTSD-5 first language speakers could not be determined because the residual covariance matrix (theta) is not positive definite and involved indicator Item 5. The global χ 2 for English second language speakers was significant (χ 2 = 29.378, df = 5, p < 0.001). The second language single model CFI, TLI, RMSEA and SRMR indices were within acceptable limits. As a result of the undefined R-Square for indicator Item 5 in the English first language speaking group measurement invariance could not be evaluated for the PC-PTSD-5 for language.
The PC-PTSD-5 correlated significantly with the PHQ-9, GAD-7 and CAGE, and there were also significant comorbidities between PTSD and MDD and GAD (Table 6). Excellent AUC was reported (Table 5), and optimal sensitivity and specificity were obtained around a cut-point of ≥ 3 (see Table 6). There were no significant mean difference gender or language effects observed.

Panic-like anxiety
Early feedback from participating psychologists indicated scepticism regarding the usefulness of this scale, and after a progress review of the first 746 cases, it was discontinued because of poor specificity. Participants reported intense anxiety more often than what could be clinically diagnosed, with less than 40% of YES responses (to either item) associated with any actual diagnosis. All interview-confirmed panic cases were also identified through the GAD-7. Feedback from participating psychologists suggested that the high rate of false positives was more an indicator of non-pathological general psychological distress, rather than reflective of actual panic-like experiences.

CAGE
Cronbach's alpha was acceptable for research but not sufficient for clinical use (Table 5). Acceptable corrected item-total correlations (Figure 1) were found and inter-item correlations ranged from 0.35 to 0.63. During the CFA, the CAGE had no modification indices above 4. The standardised loadings were above 0.7 and topped out at above 0.9. The scale demonstrated significant parameters (Table 3), low error, high communality as indicated by R-Square values for all indicators and loaded high on each latent factor, providing sufficient evidence for unidimensionality.
The single group model for men fit extremely well with a non-significant χ 2 (5.251, df = 2) but the model for women could not be determined because the residual covariance matrix (theta) is not positive definite and involved indicator Item 3. Thus, measurement invariance for the CAGE could not be determined as a result of the undefined R-Square for indicator Item 3 in the women's group.
Again, the model fit for the English first language speakers could not be determined because the residual covariance matrix (theta) is not positive definite and involved indicator Item 3. The larger group, namely the English second language speakers single model yielded extremely good fit (χ 2 = 5.834, df = 2, p > 0.05) and the CFI, TLI, RMSEA and SRMR indices very good fit. The measurement invariance for the CAGE could not be determined as a result of the undefined R-Square for indicator Item 3 in the English first language speaking group.
The CAGE correlated significantly with the PHQ-9, GAD-7 and PC-PTSD-5, and there were also significant comorbidities between AUD and MDD and GAD (Table 7).
Good AUC was reported and highest sensitivity and specificity were obtained around a cut-point of ≥ 2 (Table 5). There was a significant gender effect, where men reported more indicators of problematic alcohol use (Cohen's d = 0.39; mean difference = 0.2) and more proportional cases. There was also a significant language effect, where non-English first language speakers reported more indicators of problematic alcohol use (Cohen's d = 0.32; mean difference = 0.2). Given that measurement invariance for gender and language could not be determined, combined with the negligible effect sizes and small mean differences, it did not appear practically useful to develop separate cut-points based on gender or language.

Measures for clinical consideration: evidence of validity and practical implications for occupational health screening
The first objective was to provide evidence of structural validity. In this regard evidence of validity, based on internal structure, were found for all four scales. All four scales provided sufficient evidence for unidimensionality. Various degrees of measurement invariance were observed, with the two scales with binary responses not allowing for full measurement invariance to be determined. Itemintercorrelations were generally acceptable and all intertotal correlations exceeded 0.3. The PHQ-9 and GAD-7 had good internal reliability, the PC-PTSD-5 was acceptable and the CAGE alpha coefficient, whilst acceptable for research, was questionable for clinical use. Furthermore, evidence of validity based on relationships to other variables were demonstrated in the expected significant correlations between the four scales ( Table 7 and Table 8 ).
The second objective was to provide evidence of criterion validity and clinical utility. Evidence of validity based on test-criterion relationships were demonstrated through strong associations between scale scores and interview outcomes (as the references standard; Table 5). Furthermore, the PHQ-9, GAD-7 and PC-PTSD-5 displayed excellent screening accuracy, and the CAGE good accuracy, in this setting. The high screening accuracy may in part be a sampling artefact, as the participants were drawn from organisations with more ingrained systems and cultures that educate, promote and screen for mental illness. Such organisations tend to have workforce populations with higher rates of mental health literacy, who are more adept at recognising and reporting mental ill-health (Lieberman, 2019). The four scales further reported high negative predictive value, suggesting low rates of missed identification of risk -a desirable characteristic for screening tools. The variable and poorer positive predictive value was likely because of the relatively low prevalence of CMD in this generally healthy sample (Ranganathan & Aggarwal, 2018).
On a practical level, the PHQ-9 results supported previous recommendations that scores ≥ 10 be considered as a positive screen for depression in low-and middle-income contexts, as well as in occupational health settings (Akena et al., 2012;Volker et al., 2016). For the GAD-7, marginally better specificity was obtained when the cut-point for any anxiety disorder was raised to ≥ 10 (without sacrificing sensitivity). Furthermore, the GAD-7 appeared useful (in retrospect) to identify not only GAD but also panic and possibly even PTSD, which supported earlier international experience (Kroenke et al., 2007). Additional work will be necessary to determine whether different cut-points would be required for different presentations of disordered anxiety (e.g. GAD, panic disorder, etc.). The results of both scales supported previous reports on the higher mean scores of women compared with men (e.g. Löwe et al., 2008), although in neither case did the data require development of separate thresholds for women and men, which would simplify future interpretation during screening. For both the PHQ-9 and GAD-7 at least partial scalar invariance was observed for language, making these scales potentially useful for administration in multilingual workgroups.
The PC-PTSD-5 results supported previous recommendations that scores ≥ 3 be considered as a positive screen for further referral (Bovin et al., 2021;Jung et al., 2018;Prins et al., 2016). The slightly higher specificity than previous reports (Prins et al., 2016), together with the lack of significant gender and language effects (whilst acknowledging skewed subsample sizes) were encouraging for practical application.
In a country with highly reported community level traumatic exposures and associated prevalence of PTSD (Edwards, 2005;Kaminer & Eagle, 2010;Peltzer & Pengpid, 2019), a screener as brief is this one may be particularly valuable. However, given that the interview-determined PTSD prevalence was only 0.8%, follow-up studies will be required to confirm the diagnostic accuracy of the PC-PTSD-5 in samples with higher prevalence of traumatic exposure or diagnosed PTSD. Follow-up studies would also be required to further explore measurement invariance across language groups.
The CAGE cut-point of ≥ 2 appear consistent with findings in sub-Saharan Africa studies (Claassen, 1999;Vissoci et al., 2018). However, in spite of the good screening accuracy reported, the CAGE might be somewhat less useful in the current context. The poor Cronbach's alpha may be questionable for clinical use, and the poorer specificity and PPV may lower screening efficiency by identifying too many false positive cases for referral. The gender effect found in this sample (i.e. higher mean scores of men compared with women), although small, is consistent with general reports, where the need for different cut-points for women and men has been exhaustively debated (cf. Dhalla & Kopec, 2007, for review).
The significant language effect observed in this sample poses a substantial challenge to the practical use of the CAGE, particularly as measurement invariance could not be determined. Non-English first language speakers reported more indicators of problematic alcohol use, although the mean score difference and effect size were small. It could be hypothesised that this sample of educated employees may have had better English proficiency than the general population (hence the small mean score difference), and that another sample may have greater difficulty with the language employed in the four items. All four items require some semantic interpretation and language background could therefore influence the reporting of the CAGE indicators of problematic alcohol use. Possible language effects may make the CAGE less suitable for use in the current SA multilingual context.
High levels of AUD have consistently been reported in SA society (Herman et al., 2009) and there remains a need for a

Discontinued measures
The BSI-18-S was discontinued early because of questionable psychometric properties and poor clinical utility. It differentiated poorly between diagnostic cases, supporting the finding of Recklitis et al. (2017Recklitis et al. ( , p. 1197) who concluded that the BSI-18 should not be used as a stand-alone screening measure for making clinical decisions (i.e. referral for mental health follow-up). It could be hypothesised that the poor clinical utility was because of this educated sample exhibiting the necessary vocabulary to be more specific in reporting their distress (e.g. as mood or anxiety). Furthermore, a 6-item scale may not be sufficient to tap into a construct as complex as somatisation, especially as the BSI was initially designed for use with medical patients (Derogatis, 2001) and possibly was not a good fit in a population of generally healthy adults.
The 2-item panic-like anxiety scale was discontinued early because of poor specificity. This scale was adapted from civil aviation guidelines (SACAA, 2017), which were originally intended for a very specific group (e.g. aircrew), and may not be equally useful in a general industry population. This leaves two possible avenues for further consideration: Firstly, two items make a very brief scale to screen for a common and complex condition such as panic disorder. Future research may be usefully directed towards identifying additional items to improve its utility to screen for panic. Secondly, there may be no need for an additional intense panic-like measure in this context, as the GAD-7 in this sample also identified all panic cases. Thus, simply using the GAD-7 as general screening for multiple anxiety disorders might be sufficient, especially as all positive screening outcomes are automatically referred for further assessment.

Practical implications
The findings of this study point to an opportunity to more fully realise the strategic role that brief, locally validated and clinically efficacious screening measures can play in facilitating more efficient access points and referral pathways for mental health support in South African occupational healthcare. This is especially pertinent given the role that imprecise and non-normed screening measures play not only in the over or under-diagnosis of CMD but also the inappropriate allocation of, and expenditure on, intervention opportunities in what are often resourcelimited occupational healthcare support systems. Alone, however, validated mental health screening measures do not adequately solve the burden of CMD in workplace settings and can in fact prove to be counterproductive should the 'point of screening' not be coupled to appropriate post-screening referral and treatment pathways (Joyce et al., 2016). Furthermore, whilst routine occupational mental health screening for CMD is evermore en vogue for its cost effectiveness (Dobson et al., 2018) and its role in establishing workplace cultures of 'continuous health promotion' (Magnavita, 2018), this does not eliminate longer standing critiques that such screenings heighten experiences of workplace stigma (Solomon, Mikulincer, & Flum, 1989). For this reason, the screening measures recommended here need to be embedded within workplace programmes of mental health education.

Limitations
The results from this convenience sample cannot be generalised to populations with lower formative education levels without further evidence of validity. Furthermore, the lack of comorbidity data -likely a significant proportion (Nel, Augustyn, & Bartman, 2018) -was not known for this data set. It is possible, and would require further investigation, that against the background of multiple comorbidities, scale scores could have reflected general mental distress, rather than specific diagnoses. The scales reported here measured the severity, or presence, of selected conditions, but not the extent of impact on daily life (PHQ-9 and GAD-7 have additional items to measure the degree to which symptoms have affected patients' level of functioning, but they were not available for this study). Future studies may be valuable in clarifying associations between severity scores and impact on level of functioning in local SA samples. Inter-rater reliability -when using multiple psychologists for DMS-5 based assessments -were not available for this study and may need to be accommodated in future protocols. Lastly, further research may require larger samples to investigate the measurement invariance of these scales across different socio-demographic variables.

Conclusion
This study reported evidence of structural and criterion validity for the four scales when administered in local occupational health surveillance settings. A particular benefit of the PHQ-9, GAD-7 and PC-PTSD-5 is that the same reference norms appear useful -for now -across gender and language backgrounds, at least in workplace populations with a minimum of 9 years of formal schooling. However, there remains a need for larger scale 'general population' studies to establish their utility in a more diverse range of occupational environments and workplaces, where systems and cultures of mental health promotion and intervention are less ingrained and practised.
For practical application, the PHQ-9, GAD-7 and PC-PTSD-5 demonstrated good diagnostic accuracy and -where there is a relatively highly educated and psychologically literate occupational sample -confirmed that targeted mental health screening presents potential clinical utility for identification, referral and intervention within occupational health surveillance infrastructure.