Inter-annotator agreement for object detection analyses

Inter-Annotator Agreement (IAA) is widely used in the evaluation of annotator reliability. It is fundamental as a measure of the quality of the annotated data, especially for object detection tasks, where human-annotated datasets are vital for the training of Machine Learning models. Several metrics measuring IAA have been devised, such as Krippendorff’s α or Fleiss’s κ; however a single value does not provide extensive insights on annotation issues. Some researchers have thus devised other metrics, some more interpretable, some more specific to a given use case. The method presented in this paper aims at pinpointing annotation issues through new metrics, specific for object detection tasks. Some of the metrics found in this article are adaptations of previously defined ones, while others are completely novel. The metrics are defined to respond to specific issues that may present themselves in annotation tasks for object detection: whether some annotators contribute more to agreement than others; whether annotators agree on the location and labelling of their bounding boxes; whether there is ambiguity among categories; whether annotators use different annotation schemes. The usefulness of the method is shown through the annotation of a dataset of aerial waste imaging by three annotators. The metrics, supported by visualisations, helped to identify the issues stemming from both the input data and the behaviour of the annotators: while the reported global agreement of the dataset reached a moderate level, the additional results provided by the aforementioned metrics highlighted issues regarding localisation and categorisation ambiguity, and identified a difference in the annotation schemes used by the annotators.

L’Inter-Annotator Agreement (IAA) è ampiamente utilizzato nella valutazione dell’affidabilità degli annotatori. Esistono diverse metriche per misurare l’IAA, come l’α di Krippendorff o la κ di Fleiss, ma un singolo valore non può riportare informazioni esaustive sui problemi legati all’annotazione. Alcuni ricercatori hanno ideato altre metriche, alcune più interpretabili, altre più specifiche a un caso d’uso. Il metodo presentato in questo articolo mira a evidenziare i problemi legati all’anotazione con nuove metriche, specifiche per l’object detection. Alcune delle metriche presentate in questo articolo sono rielaborazioni di metriche definite precedentemente, mentre altre sono completamente nuove. Le metriche sono definite in modo da rispondere a problemi specifici che possono palesarsi nei compiti di annotazione legati all’object detection: alcuni annotatori potrebbero contribuire di più all’accordo rispetto agli altri; gli annotatori potrebbero non accordarsi sulla localizzazione o sulla categorizzazione delle loro bounding box; alcune categorie potrebbero essere difficili da distinguere dalle altre; gli annotatori potrebbero usare schemi di annotazione differenti. L’utilità del metodo viene illustrata tramite l’annotazione di un dataset di immagini aeree di rifiuti da parte di tre annotatori. Le metriche, coadiuvate da alcune visualizzazioni, hanno aiutato l’identificazione di alcuni problemi scaturiti sia dai dati iniziali, sia dal comportamento degli annotatori: nonostante il livello moderato di accordo globale, i risultati aggiuntivi prodotti dalle metriche menzionate sopra hanno evidenziato problemi riguardanti l’ambiguità di localizzazione e categorizzazione ed una differenza negli schemi di annotazione utilizzati dagli annotatori.