-
Notifications
You must be signed in to change notification settings - Fork 8
Agreement or Accuracy
Agreement is perhaps the most straight-forward way to quantify the reliability of categorical measurements. It quantifies the amount of observed agreement (i.e., objects that pairs of raters assigned to the same or similar categories) divided by the amount of possible agreement (i.e., objects that pairs of raters could have assigned to the same categories).
Observed agreement, especially in its simplified form, has been in use for a long time (see Benini, 1901) and has been given many different names. Many fields call it "accuracy" while others call it "agreement." It has also been reinvented and given different names over the years, such as Osgood's (1959) coefficient and Holsti's (1969) CR. Observed agreement is often criticized for not adjusting for chance agreement and as such has been called the "index of crude agreement" (Rogot & Goldberg, 1966), the "most primitive" index (Cohen, 1960), and "flawed" (Hayes & Krippendorff, 2007). Despite this criticism, and perhaps due to the challenge of adjusting for chance agreement successfully, agreement has continued to enjoy widespread use. The idea of calculating observed agreement for multiple raters using the "mean pairs" approach was described by Armitage et al. (1966). Gwet (2014) fully generalized the approach to accommodate multiple raters, multiple categories, and any weighting scheme.
- mAGREE %Calculates agreement using vectorized formulas
Use these formulas with two raters and two (dichotomous) categories:
Use these formulas with multiple raters, multiple categories, and any weighting scheme:
- Benini, R. (1901). Principii di demongraphia: Manuali barbera di scienze giuridiche sociali e politiche. Firenze, Italy: G. Barbera.
- Osgood, C. E. (1959). The representational model and relevant research methods. In I. de Sola Pool (Ed.), Trends in Content Analysis (pp. 33–88). Urbana, Illinois: University of Illinois Press.
- Holsti, O. R. (1969). Content analysis for the social sciences and humanities. Reading, MA: Addison-Wesley.
- Rogot, E., & Goldberg, I. D. (1966). A proposed index for measuring agreement in test-retest studies. Journal of Chronic Diseases, 19(9), 991–1006.
- Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.
- Hayes, A. F., & Krippendorff, K. (2007). Answering the call for a standard reliability measure for coding data. Communication Methods and Measures, 1(1), 77–89.
- Armitage, P., Blendis, L. M., & Smyllie, H. C. (1966). The measurement of observer disagreement in the recording of signs. Journal of the Royal Statistical Society, 129(1), 98–109.
- Gwet, K. L. (2014). Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters (4th ed.). Gaithersburg, MD: Advanced Analytics.