Multiple Rater Agreement

Note that Cohen`s kappa only applied to 2 reviewers who rated the exact same articles. For example, an inter-evaluator reliability of 75% may be acceptable for a test designed to determine the quality of reception of a television program. In general, in most regions, an agreement between evaluators of at least 75% is required for a test to be considered reliable. However, greater reliability between evaluators may be required in some areas. Irr was assessed using an IBEI-CCI with mixed and consistent average measures (McGraw and Wong, 1996) to assess the extent to which programmers provided consistency in their assessments of empathy between subjects. The resulting CCI was within the excellent range, CCI = 0.96 (Cicchetti, 1994), suggesting that programmers had a high level of agreement and that empathy was assessed in the same way among all programmers. The high CCI indicates that a minimal measurement error has been introduced by independent coders and that, therefore, statistical significance for subsequent analyses is not significantly reduced. Empathy assessments were therefore considered appropriate for use in the hypothesis tests in the present study. In order to more accurately assess the agreement between evaluators, the absolute proportion of agreement must be considered in the light of the magnitude and direction of the differences observed. These two aspects provide relevant information on how ratings tend to be closely divergent and whether a subset of reviewers or rated individuals consistently has higher or lower ratings compared to another.

The magnitude of the difference is an important aspect of match scores, as the proportions of statistically equal scores reflect only a perfect match. However, such a perfect match may not always be relevant, for example by. B clinical means. To assess the magnitude of the difference between the assessors, we used a descriptive approach that took into account the distribution and size of the score differences. Since different ratings were only reliably observed when the calculations were based on the reliability of the ELAN retest tests, we used these results to assess the size and direction of the differences. Overall, the differences observed were small: most of them (60%) were in 1 ET, all within 1.96 ET of the mean of the differences. Thus, the differences occurring were within an acceptable range for a screening tool, as they did not exceed one standard deviation from the standard scale used. This finding puts into perspective the relatively small proportion of absolute agreement measured compared to the test-retest reliability groups of the tools (43.4%) and highlights the importance of taking into account not only the importance, but also the magnitude of the differences. Interestingly, it also complies with the 100% absolute agreement that results from the calculations that use this study instead of the standardized reliability of the instrument used. There are several formulas that can be used to calculate match limits. The simple formula given in the previous paragraph that works well for random sizes greater than 60,[14] is The most difficult (and strict) way to measure inter-evaluator reliability is to use Cohen`s kappa, which calculates the percentage of items that evaluators agree on, while taking into account that evaluators randomly match certain elements in a purely random manner. Possible values for kappa statistics range from −1 to 1, where 1 indicates a perfect match, 0 indicates a completely random match, and −1 indicates a “perfect” match.

Landis and Koch (1977) provide guidelines for the interpretation of kappa values, with values ranging from 0.0 to 0.2 indicating slight agreement, 0.21 to 0.40 indicating fair agreement, 0.41 to 0.60 indicating moderate agreement, 0.61 to 0.80 indicating substantial agreement, and 0.81 to 1.0 indicating near-perfect or perfect match. However, the use of these qualitative thresholds is discussed, and Krippendorff (1980) provides a more conservative interpretation suggesting that conclusions should be discarded for variables with values below 0.67, that conclusions should be provisionally drawn for values between 0.67 and 0.80, and that final conclusions should be drawn for values above 0.80. In practice, however, kappa coefficients are often kept in research studies below Krippendorff`s conservative threshold values, and Krippendorff proposes these thresholds based on his own content analysis work, recognizing that acceptable IRR estimates vary depending on the study methods and research question. Second, it must be decided whether subjects evaluated by multiple encoders are evaluated by the same group of encoders (fully cross-design) or whether different topics are evaluated by different subsets of encoders. The contrast between these two options is shown in the upper and lower rows of Table 1. While fully cross-bred designs may require a higher total number of ratings, they allow for systematic bias between programmers that can be evaluated and controlled in an IRR estimate, which can improve overall IRR estimates. For example, CCIs may underestimate actual reliability for some designs that are not completely cross-referenced, and researchers may need to use alternative statistics that are not well distributed in statistical software packages to evaluate IRR in some studies that are not completely cross-referenced (Putka, Le, McCloy, and Diaz, 2008). Only by using the test-retest reliability specified in the ELAN manual were a significant number of different evaluation pairs (30 out of 53 or 56.6%) obtained. The magnitude of these differences was descriptively assessed using a scatter plot (see Figure 3) and a Bland-Altman diagram (also known as the Tukey mean difference graph, see Figure 4). First of all, we displayed the evaluation of each child in a scatter plot and illustrated the two different areas of agreement: 43.4% of the grades diverged by less than three T points and can therefore be considered concordant within the limits of the more conservative RCI estimate, all 100% of the scores are within 11 T points and therefore within the limits of the agreement, which is based on a reliability estimate obtained with the sample from this study.

Conclusions drawn from evaluations of different evaluators (p.B. parents and teachers) or at different times (p.B. before and after an intervention) are highly relevant for many disciplines in which skills, behaviours and symptoms are frequently assessed and compared. In order to capture the degree of agreement between evaluators as well as the relationship between evaluations, it is important to consider three different aspects: (1) Assess reliability between evaluators, to what extent the extent used is able to distinguish participants with different skill levels when evaluations are provided by different evaluators. Inter-rater reliability measures can also be used to determine the slightest deviation between two scores needed to determine a reliable difference. (2) Inter-evaluator agreement, including the part of the absolute agreement, including, where appropriate, the extent and direction of differences. (3) Strength of association between credit ratings, measured by linear correlations. Kottner and colleagues provide detailed explanations of these approaches, for example, in their “Guidelines for Reporting Reliability and Agreement Studies” (Kottner et al., 2011). Authors of the field of education (p.B. Brown et al., 2004; Stemler, 2004) and behavioural psychology (Mitchell, 1979) also emphasized the need to clearly distinguish between the various aspects that contribute to assessing the consistency and reliability of ratings. .

Commenti non disponibili

top