October 30, 2023

Assuring Quality in Data Annotation

Assuring Quality in Data Annotation

Acquiring high-quality labeled data has been a long-standing challenge for the machine learning (ML) industry. Challenges include: 

  1. Diverse Needs: Every ML project has its own specific requirements.
  2. Rising Standards: As ML models get better, they need better-quality data.
  3. Identifying Proficient Labelers: Lack of visibility in discerning the traits that make an effective labeler.

Given the above challenges, a key question we had was: How might we improve quality in data labeling, taking into account different needs of projects, rising standards and the need to identify proficient labelers?

It Starts With Measuring

The starting point for improving quality was figuring out how to measure it. We couldn’t move a needle that didn’t exist yet. Aligning all stakeholders, whether labeler, client or engineer on what quality meant proved difficult without a frame of reference.

To create said reference, we chose to start simple by defining quality in its simplest form - the presence of mistakes in labeling.

However, how do we define mistakes? On what basis? The answer was quite simple in hindsight. Every data labeling project came with a set of instructions from the client that’s the source of truth for labelers working on the projects. The rules for what makes a label wrong came from the instructions given by the client.

Example: Categorizing Mistakes for Image Annotation 

Image annotation involves assigning labels to pixels in an image in different use cases. Consider a simple project where a labeler would have to draw bounding boxes around cats and dogs.

Sample Image for our Project

For image annotation, mistakes could be divided into the following general categories based on prior research:

  • Misdrawn Annotations: Annotations with bad boundaries e.g. too tight or loose.
  • Mislabeled Annotations: Annotations with the wrong label e.g. cat labeled as a dog
  • Extra Annotations: Unnecessary or additional annotations that don’t fit project instructions
  • Missing Annotations: Annotations that should have been drawn but were not 
Illustration of potential mistakes in our example project


Introducing the Accuracy Scorecard

With that, this led to the conception of the Accuracy Scorecard. Think of this as a precise ledger where we recorded mistakes made in image annotation projects, based on the aforementioned mistake types. We used a simple formula to do this where:

Application of the scorecard gave us a clear view of performance at both the project and individual level.

Example of the Annotator Scores in a given project

The Result

Implementing the scorecard across multiple projects proved a resounding success. Projects started to improve slowly but steadily over time due to the following key factors:

  1. Feedback for Labelers: We were able to create individualized feedback for data labelers with the data provided by the scorecard. This extended to correcting gaps in their understanding of the project instructions, e.g. many extra/missing mistakes typically denoted that they did not fully understand the project requirements.
  2. Culture Shift: Labelers engaged with SUPA as a community to solicit feedback and advice on how to improve their scores. They even collaborated with each other in small teams to validate each other's work and understanding of the project
  3. Root Cause Identification: Visibility of scores at the project level helped us trace quality challenges to their root cause and fix them, e.g. lack of clarity in instructions, understanding of tooling

....

Learn more about our use cases here