introduction
In 2000, the School of Medicine of the Pontificia Universidad Católica de Chile created a Faculty Development Center to provide formative instances for its more than 700 clinical teachers, and to foster a culture of continuous enhancement of teaching quality1,2. Back then there were few questionnaires in Spanish to evaluate the performance of medical teachers in clinical settings. The Center developed and validated a questionnaire in Spanish named MEDUC30 based on a systematic review of specialized literature and using as a template the seven-domain educational model of Stanford University for evaluating clinical teaching effectiveness3. The questionnaire was refined and validated by a Delphi pannel4.
MEDUC30 is a 30-item questionnaire with a frequency Likert-type scale of 4 points4. Its items tribute to one of following domains: [i] Learning Climate, [ii] Evaluation, [iii] Feedback, [iv] Communication of Goals, [v] Control of the Session, [vi] Promotion of Understanding and Retention, [vii] Promotion of Self-directed Learning and[viii] Patient-Based Teaching. The eighth domain was incorporated to ensure that the activity evaluated referred to actual clinical teaching rather than minilectures given in clinical settings.
In the initial validation study4 MEDUC30 displayed a good reliability and a four-factor structure: Patient-Based Teaching and Learning Climate emerged as separate empirical factors, and the remaining five SFDP's educational domains gathered in two large factors: Evaluating skills (comprising Evaluation plus Feedback) and Teaching skills (comprising Communication of Goals, Control of the Session, Promotion of Comprehension and Retention, and Promotion of Self-directed learning)4.
MEDUC30 is to our knowledge one of the few validated instrument in Spanish to evaluate clinical teachers' effectiveness during the initial years of clinical experiential learning. It complements the questionnaires developed by the Faculty of Medicine of the UNAM of Mexico to evaluate teachers' performance during the basic medical science teaching5-7 and medical specialty training8 and those developed by the Pontificia Universidad Católica de Chile for evaluation of clinical teachers in different medical specialties9,10. MEDUC30 has been used at the Pontificia Universidad Católica de Chile medical school since 2004. However, no confirmatory factor analysis (CFA) and measurement invariance studies have been made so far.
The purpose of this study is to provide updated evidence as to the reproducibility, validity and usefulness of MEDUC30 questionnaire to evaluate clinical teaching in Spanish-speaking contexts. To this end, we validated MEDUC30 using a larger and more recent database employing confirmatory analytical methods (CFA) to ascertain the model's goodness-of-fit and multi-group CFA to study measurement invariance.
method
Study and participants
This is an analytical, longitudinal, retrospective study aiming to examine the psychometric properties of the data produced with MEDUC30 in its regular use of evaluation of clinical teachers' performance at Pontificia Universidad Católica de Chile School of medicine. We analysed a total of 24,681 evaluation forms regarding 579 clinical tutors (63% men) collected from 2004 through 2015.
Instrument
MEDUC30 is a 30-item instrument that describes observable teacher behaviours4. It uses a four-level scale: 1. 'almost never', 2. 'sometimes', 3. 'often', and 4. 'almost always'4. Twenty nine items tribute to eight dimensions as follows: [i] Patient-Based Teaching, items 1-5; [ii] Communication of Goals, items 6-8; [iii] Evaluation, items 9-12; [iv] Promoting Understanding and Retention, items 13-16; [v] Promoting Self-directed Learning, items 17-19; [vi] Control of the Session, items 20-22; [vii] Feedback, items 23-25; and [viii] Learning Climate, items 26-29. The last item corresponds to a global rating.
Application of MEDUC30
The evaluation process was as follows: at the end of each clinical rotation medical students from 3rd to 7th (last) year of study were asked to evaluate their clinical tutors using MEDUC30 paper forms. Students filled the forms anonimoulsy (in the absence of the evaluated teacher) as many times as rotations they had during the year. This was an ongoing pro cess; at the end of each academic year, tutors accumulated between 5 and 30 evaluations. Individual reports were sent to the teachers to provide them specific feedback regarding the eigth domains of effective teaching. Copies of these reports were made available yearly to school authorities to be used as a source of information for academic promotion.
Statistical Analysis
Items as ordinal measures of continuous latent constructs and missing data handling
For analytical purposes, we treated the MEDUC30 items' scores as ordinal rather than continuous11. To deal with missing data, we applied multiple imputation12 using Multivariate Imputation by Chained Equations (MICE) with the proportional odds logistic regression (POLR) model. We used these imputed data to conduct both Exploratory (EFA) and Confirmatory Factor Analyses (CFA).
Factor Analysis
We conducted EFA to study the dimensionality and internal structure of the data, CFA to evaluate the model's goodness-of-fit, and multi-group CFA to study measurement invariance.
The data was randomly divided into three samples: sample 1 (n = 4,122) for EFA, sample 2 (n = 4,109) for CFA global fit assessment, and sample 3 (n = 16,450) for CFA measurement invariance evaluation.
For the EFA we used the unweighted least squares (ULS) estimator, while for CFA, we added robust standard errors, and mean and variance adjusted test statistic with second order approach13 (ULSMVS in lavaan R Package).
Four fit indices were calculated to evaluate and compare descriptive goodness-of-fit. Two comparative fit indices: Comparative Fit Index (CFI) and Tucker-Lewis Index (TLI); one parsimony correction index: Root-Mean-Square Error of Approximation (RMSEA); and one absolute fit index: Weighted Root-Mean-square Residual (WRMR).
The following cutoff values were derived from simulation studies14-17: good fit when: CFI ≥ .96, TLI ≥ .95, RMSEA ≤ .05; acceptable fit when: CFI and TLI ≥ .90, RMSEA < .08; mediocre fit: if .08 ≤ RMSEA ≤ .10, with CFI and TLI ≥ .90. Meeting at least two of the three criteria just described in one level of satisfaction, and the remaining in an adjacent level (upper or lower), the model fit was assumed as conforming to the former16,18. Finally, if CFI or TLI < .90, or RMSEA > .10 the model were rejected. WRMR (smaller is better) was used to corroborate model comparison19.
For reliability measures, we calculated Cronbach's α, and McDonald's ω and ω t20 as indices of internal consistency of the respective constructs. Additionally, for bifactor model constructs we reported McDonald's ωh21 with respective ωs coefficients as indices of factor saturation.
We studied measurement invariance over tutor gender, date, semester, year of study, clinical teaching setting, and length of clinical rotation using χ2-based likelihood-ratio test (LRT) with Satorra22 adjusted test statistic. For every grouping variable, we used a random subsample of the biggest group(s) to ensure equal n with the smaller ones.
Software
We conducted all statistical analyses using R software 3.3.0 23 with specific packages. Multiple Imputation by Chained Equations was performed using the mice package 2.25 24. Exploratory factor analysis, including tests for sampling adequacy and parallel analysis, was conducted with the psych package 1.6.4 25. Confirmatory Factor Analysis was conducted with the lavaan package 0.5.20 26. Proportional odds regressions were performed with the MASS package 7.3.45 27.
Ethical Considerations
This paper reports the results of the clinical teachers' evaluations conducted from 2004 to 2015 at Pontificia Universidad Católica de Chile School of Medicine. This is a mandatory process overseen by the Center for Medical Education according to ethical considerations aimed to assure the confidential handling of the information. The evaluation forms are filled anonymously by students. Each clinical tutor receives a yearly report of his/her results and these results are also known by department head and the school authorities to be used for purposes of career promotion.
RESULTS
The proportion of missing responses for the whole questionnaire was low (1.79%) except for items 10 and 24 (8.72% and 5.13% missing values, respectively). To deal with this situation, we used multiple imputation.
On the other hand, the highest response category of the questionnaire ('almost always') concentrated 74.3% of the answers, indicative of a 'ceiling effect'. To deal with this situation, we selected polychoric correlation and a robust estimator19,28,29.
The three samples were comparable with regards to date (ANOVA p = .60), tutor gender, clinical teaching setting, semester (binomial regression p = .29, .72, and .43 respectively), year of study, and global score (POLR p = .33, .69 respectively).
Exploratory Factor Analysis (Sample 1) Assumptions' Evaluation
The Kaiser-Meyer-Olkin measure of sampling adequacy (.97) indicated a marvellous (≥ .90) factor-ability according to Kaiser and Rice30 criteria. The Bartlett's sphericity test (2 = 129511.72, df = 406, p < .001) also indicated reasonability of factor analysis.
Number of Retained Factors
We decided to retain seven factors according to two criteria: the maximum number of factors given by Horn's Parallel Analysis (eight factors) and the last big drop in the eigenvalues in the sedimentation plot (between factors 7 and 8).
In addition, we tested the hypothesis that a bi-factor model could better explain the data given the large difference in eigenvalues between factors 1 and 2 (17.36 vs. 0.98), as suggested by Reise31. This implied the retention of one general factor, with six specific factors.
Latent Structure
Exploratory analysis for a bifactor structure model with 6 specific dimensions resulted in the following domain-specific factors, alongside the general factor (Figure 1): Patient-Based Teaching (PBT; items 1 to 5), Communication of Goals (CG; items 6 to 8), Evaluation and Feedback (EVFB; items 9, 11, 12, 23, and 25), Promotion of Understanding, Reten tion and Self-directed Learning (PURSL; items 13 to 19), Control of the Session (CS; items 20 to 22), and Learning Climate (LC; items 10, 24, and 26 to 29). Only two items had considerable cross-loadings (items 10 and 24), both loaded more on Learning Climate than on their original theoretical factor. Commonalities of items ranged from .50 to .90. Factor loadings ranged from .59 to .83 for the general factor, and from .23 to .67 for domain-specific factors.
Confirmatory Factor Analysis (Sample 2)
Goodness-of-fit and model comparison
We compared four models in CFA: [i] The four correlated traits model described by Bitran et al.4, [ii] the bifactor model suggested in the results of EFA, [iii] a model with six correlated traits (corresponding to the six domain-specific factors of the bifactor model), and [iv] a single-factor model. According to evaluated global fit indices (Table 1), the bifactor model was the only one with acceptable (CFI) to good (TLI and RMSEA) global fit indices. This supports the bifactor model with six domain-specific factors as the best latent structure for this MEDUC30 Data.
Model | SBX 2 (df) | SB X 2 / df | CFI | TLI | RMSEA [90% CI] | WRMR |
---|---|---|---|---|---|---|
Sample 2 | ||||||
Single factor | 4029.6 (154) | 26.1 | .771 | .979 | .078 [.076, .080] | 3.49 |
Four Correlated Traits | 2531.5 (169) | 15.0 | .86 | .988 | .058 [.057, .060] | 2.57 |
Six Correlated Traits | 2029.2 (176) | 11.6 | .89 | .991 | .051 [.049, .053] | 2.17 |
Bifactor | 1234.2 (3) | 8.1 | .936 | .994 | .041 [.040, .043] | 1.82 |
Note. SBx2 = Satorra-Bentler scaled chi-square, df = degrees of freedom, CFI = Comparative fit index. TLI = Tucker-Lewis index. RMSEA = Root-mean-square error of approximation. WRMR = Weighted root-mean-square residual. All p-values < .001.
Factor structure and reliability
All factors in the bifactor model (the general and the six domain-specific factors) displayed good reliability coefficients: .88 to .98 for Cronbach's α, and .79 to .97 for McDonald's ω (Table 2). Hierarchical reliability (ωh/s) was stronger for the general factor compared to the domain-specific factors, particularly the PURSL factor (ωs= .08) (Table 2).
Item | Theoretical factor | Bifactor Model (general & six domain-specific factors) | ||||||
---|---|---|---|---|---|---|---|---|
g | PBT | CG | EVFB | PURSL | CS | LC | ||
Item 1 | PBT | .71 | .09 | |||||
Item 2 | PBT | .71 | .60 | |||||
Item 3 | PBT | .73 | .59 | |||||
Item 4 | PBT | .70 | .55 | |||||
Item 5 | PBT | .73 | .28 | |||||
Item 6 | CG | .76 | .53 | |||||
Item 7 | CG | .80 | .51 | |||||
Item 8 | CG | .83 | .16 | |||||
Item 9 | EV | .71 | .35 | |||||
Item 10 | EV | .83 | .21 | |||||
Item 11 | EV | .70 | .42 | |||||
Item 12 | EV | .80 | .45 | |||||
Item 13 | PUR | .80 | .27 | |||||
Item 14 | PUR | .82 | .35 | |||||
Item 15 | PUR | .81 | .36 | |||||
Item 16 | PUR | .82 | .29 | |||||
Item 17 | PSL | .82 | .13 | |||||
Item 18 | PSL | .72 | .03 | |||||
Item 19 | PSL | .82 | .15 | |||||
Item 20 | CS | .76 | .47 | |||||
Item 21 | CS | .73 | .49 | |||||
Item 22 | CS | .63 | .42 | |||||
Item 23 | FB | .73 | .41 | |||||
Item 24 | FB | .81 | .29 | |||||
Item 25 | FB | .74 | .40 | |||||
Item 26 | LC | .78 | .49 | |||||
Item 27 | LC | .89 | .15 | |||||
Item 28 | LC | .80 | .48 | |||||
Item 29 | LC | .85 | .89 | .40 | ||||
Cronbach's α | .98 | .91 | .91 | .92 | .94 | .88 | .96 | |
McDonald's ωt | .97 | .86 | .87 | .87 | .88 | .79 | .93 | |
McDonald's ω(h/s) | .92 | .31 | .21 | .23 | .08 | .21 | .25 |
Note. g = General factor, PBT = Patient-based teaching, CG = Communication of goals, EVFB = Evaluation and feedback, PURSL = Promotion of understanding, retention and self-directed learning, CS = Control of session, LC = Learning Climate.
Factor loadings for the bifactor model (Table 2) were all high on the general factor, ranging from .63 (item 22) to .87 (item 27). All specific factors except PURSL had at least two salient loadings (≥ .40). With exception of items 1, 8, 17, 18, 19 and 27 (loading < .20), domain-specific loadings were in general large enough (≥ .20) reflecting a multidimensional structure.
Measurement invariance (Sample 3)
Multigroup CFA (Table 3) indicated that configur-al (form), weak (loadings) and strong (intercepts) measurement invariance could be sustained (p > .05) across tutor gender (man/woman), clinical teaching setting (inpatient/outpatient), year of study (3rd to 7th), length of clinical rotation (1 to 7-or-more weeks), date (2004 to 2015), and semester (fall/spring).
Note. SB χ2 = Satorra-Bentler scaled chi-square, df = degrees of freedom. CFI = Comparative fit index, RMSEA = Root-mean-square error of approximation, configural invariance only. a n = 4000 per gender b n = 4000 per semester (fall/spring). c n = 4000 per clinical teaching setting (inpatient/outpatient). d n = 4th, 5th, 6th, 7th). e n = 780 per group (one, two, three, four, five, six, or seven-and-more weeks). f n = 503 configural invariance only. of the tutor (man/woman). 933 per year of study (3rd, per year (2004 to 2015).
Also, all of these variables were sources of population heterogeneity (mean) with p <.001 except for the length of clinical rotation with p = .003 which was still significant at most traditional confidence values.
Discussion
We evaluated the validity of MEDUC30 to assess clinical teachers' effectiveness. According to EFA, our data was reasonably well explained by a bifactor structure with six domain-specific factors. Five of them closely related to the seven theoretical domains of SFDP framework3, and the sixth factor corresponding to an added dimension named Patient-Based Teaching. CFA proved that this model had a good fit for the data and was better than a single factor model or a first-order multidimensional model with four or six factors to account for MEDUC30 scores.
Besides supporting the multidimensionality of the teaching effectiveness construct, present results indicate that MEDUC30 behaves as a hierarchical construct, with a general factor that can be construed as 'being a good teacher', and six domain-specific factors. During the last decade, hundreds of clinical teachers at the PUC have completed a diploma in medical education32. Thus, it is conceivable that the 'being a good teacher' general factor found in this study is related to this professionalisation of teaching which entails the acquisition of general good teach ing practices, in addition to domain-specific skills.
The 'good teacher' general factor found in our study is reminiscent of the 'teaching performance' latent construct proposed by Flores et al.7 to explain his results with OPINEST, an instrument used to evaluate medical teacher competences.
The potential contribution of MEDUC30 to medical education in Spanish-speaking contexts is related to its focus on the evaluation of facilitatory rol of medical teachers in the clinical setting. ME-DUC30 covers this sensitive period of transition from passive, teacher-centered, information-driven teaching to active, student-centered, patient-driven learning33. A recent study found that the attributes of an effective teacher differ between the classroom and the clinical setting34 thus giving support to the importance of context specificity in teaching effectiveness ratings.
Similarities and differences with other implementations of the SFDP framework
Our results are partially consistent with initial validations of the SFDP framework construct using EFA on data obtained with the questionnaire SFDP263,35. In these studies the authors deemed the data to be reasonably explained by the theoretical seven-dimension structure.
Compared to the four-factor structure proposed for MEDUC30 in the initial exploratory studies4, the bifactor structure with six domain-specific factors presented here corresponds more closely to the theoretical SFDP framework. Three of these factors corresponded exactly to the dimensions: Learning Climate, Control of the Session and Communication of Goals. The other two factors gathered the items of Evaluation and Feedback, on the one hand, and Promoting Comprehension and Retention and Promoting Self-directed Learning, on the other.
In a psychometric evaluation of the SFDP26, performed over a relatively small sample (N = 119), Mintz et al.36 proposed a new five-factor structure for a reduced 15-item instrument. Comparisons of our results with this report are difficult to draw since the authors did not evaluate hierarchical models. On the other hand, they eliminated entire dimensions rather than redefining the structure based on substantive and statistical criteria with the original set of items.
In a recent study done with Middle Eastern undergraduate medical students37, a modified version of the "System for Evaluation of Teaching Qualities (SETQ), an instrument also based on the SPDF educational famework, displayed a six-factor structure consistent with the main SFDP domains and with MEDUC30.
Strengths and limitations of this study
MEDUC30 is a validated theory-based instrument in Spanish to assess clinical teachers' effectiveness by students during the training clinical years. It adds to the repertoire of instruments developed in Spanish to evaluate medical teacher performance in basic science years5,6, and those aimed at medical specialty training8-10.
This study has strengths related to the sample size and analytical methods used. Compared to other validation studies (for a revision see Fluit et al.2), the number of evaluations and teachers was several folds larger, and we employed multiple and stringent criteria for the factor analyses. These features endorse the robustness and reliability of results.
Unlike most validations of similar instruments, this study includes measurement invariance information. MEDUC30 can be used for comparisons across several data variables (i.e. date, tutor gender, year of study, and the length of rotation).
One limitation of this study is that it involves a single medical school; thus it would be necessary to confirm the questionnaire generalizability for other medical schools or countries with different clinical teaching realities.
Regarding further improvements of the questionnaire, it seems advisable to increase the width of the scale to allow for a larger response range. Scores have improved systematically during the 12-year period of assessment and, as a consequence, the power of discrimination of the 4-point scale has diminished.
It should always be borne in mind that while the assessment of clinical teachers by students could reveal valid and relevant information, this should be "triangulated" with information derived from other sources, including peers and self-assessment2.
CONCLUSIONS
In this report we give evidence that MEDUC30 is a reliable and valid instrument suited to provide clinical teachers with feedback on their strengths and weaknesses about multiple dimensions of clinical teaching. It has evidence of content validity, internal structure validity and use validity. This instrument should be of interest to medical schools of Spanish-speaking countries for it adds to the repertoire of validated instruments in Spanish to evaluate medical teachers in the clinical teaching setting.
MEDUC30 internal structure validity was supported in this study by the multidimensionality of its scores and the consistency of this internal structure with the educational framework used in its development. The content validity evidence derives from the questionnaire construction, which was based on a previously developed instrument, and on the input of experts and students. Also, MEDUC30 items cover 5 out of 7 of the roles agreed as characteristic of good teaching38,39,40. Finally, its use validity is confirmed by the widespread use and acceptance of this instrument for the assessment of clinical teachers at PUC medical school for more than ten years.
In conclusion, MEDUC30 meets satisfactorily three of the five possible sources of validity evidence as defined by the American Psychological and Education Research Associations published standards41,42: internal structure, content and use. Future investigations will be needed to provide evidence for the remaining two validity sources: relation to other variables and consequences.
CONTRIBUCIÓN INDIVIDUAL
MB: Concepción y diseño del estudio, interpretación de los datos, discusión, redacción del manuscrito y aprobación final del mismo para envío a publicación.
MT-S: Análisis e interpretación de la data, elaboración de un manuscrito inicial, discusión y análisis de resultados, revisión crítica del ma nuscrito final.
OP: Análisis e interpretación de la data, discusión y análisis de resultados, revisión crítica del manuscrito final.