November 30, 2023

How to Maximise Quality in Dataset Annotation

Quality training data is the fuel that powers modern AI systems. However, ensuring strong data quality has proven to be one of the most persistent challenges in the machine learning pipeline. Specifically, the process of manually collecting & annotating datasets continues to be a major pain point.

In our previous post, we discussed how issues with data annotating lead to downstream issues for ML models, hampering performance. To shed light on these inconsistencies, our team at SUPA has developed an accuracy scorecard that provides visibility into annotators’ work. This made it possible to pinpoint areas needing improvement.

However, visibility alone is not enough. Identifying problem areas guides where to focus, but people need constructive guidance on how to actually enhance their skills. This is where implementing a targeted feedback loop with annotators becomes absolutely crucial to drive continuous improvement. If annotators are not given opportunities to learn from their mistakes and refine their skills, quality will plateau. To understand this, let’s first dig into how quality is impacted by the learning speed of annotators.

Accelerated Learning & its Impact on Data Quality

With machine learning now powering everything from medical imaging to autonomous vehicles, annotators must adapt to a wide array of domains. And with product cycles measured in months rather than years, they need to get up to speed quickly. But what does up to speed entail exactly?

Let’s use our accuracy scorecard example for this. Through extensive benchmarking, we’ve found that a 90% first-pass accuracy score represents a reasonable threshold before labeled datasets go through quality checks. This means that annotators can only make 1 mistake per 10 annotations that they make on their first attempt doing the work.

Annotators follow a learning curve, represented in blue, as their skills improve. Accelerating learning rate will allows them to achieve target accuracy levels (depicted in red) in a shorter amount of time. Considering tight data annotation deadlines, it’s crucial annotators reach desired performance levels fast.

‍

The visibility provided by the scorecard plays a key role here as it represents the foundation in which great feedback and training can be delivered. Research clearly shows effective feedback can halve the time for annotators to reach a 90% accuracy level. When feedback directly addresses root causes of errors, annotators learn faster. This allows them to achieve rigorous benchmarks quicker than standard training programs.

On the other hand, vague or misaligned feedback slows progress. By focusing feedback on fixing individual weaknesses and understanding of the project, annotators build competence faster relative to effort. Feedback infrastructure that speeds up the delivery of feedback should be non-negotiable for accelerating annotators’ development.

Effective feedback saves time while raising the bar for quality. This empowers teams to reach higher standards faster across projects.

What does Effective Feedback look like for Data Annotators?

Before going into more specifics about SUPA’s feedback model, it’s important to establish a universal framework for what “effective” feedback entails in data annotating workflows.

Many organizations still take a hands-off approach, offering little specific guidance to annotators. At best, high-level comments like “Improve accuracy” or “Be more consistent” are provided during sporadic performance reviews.

But this generic, sparse feedback often fails to move the needle. We’ve learned this painful lesson over the course of many projects. Without clear examples tied to observable problem areas, annotators would struggle to interpret feedback. Their feedback is often that they have no clue how to analyze or correct their work patterns.

In contrast, effective feedback is frequent, specific, and actionable:

Frequent: Provided continuously, in context, at relevant teachable moments rather than during widely spaced reviews
Specific: Includes analysis of actual examples of errors made, not just abstract directives to “do better”
Actionable: Supplies precise, tactical recommendations to implement in situations where certain errors occur

For example, consider a simple project identifying the condition of packages in transit. If a quality checker notices an annotator consistently struggling with identifying damaged packaging, they would assemble examples of the errors made. The feedback would pinpoint the exact issue like box corner dents or torn tape that was missed. Instructions would be shared for double checking common overlooked areas and what actions apply for edge cases.

The goal of feedback here is not just to evaluate performance, but to provide the tools and know-how for individuals to self-correct. Frequent, specific, and actionable feedback unlocks rapid upskilling for annotator teams. They gain consistency and accuracy at scale while also perpetually expanding collective expertise.

SUPA’s Approach: Scaling Effective Feedback

In the prior section, we explored how effective feedback fuels faster, better learning for data annotators. But how can this be accomplished without compromising speed or quality at scale, especially given aggressive timelines for ML projects?

At SUPA, our scaling infrastructure aims to deflate this speed vs quality tradeoff by accelerating annotators’ learning curves: enhancing skills exponentially faster. We compress proficiency development timelines from months to weeks via the following approaches:

Scaling Quality Checker Training

Our quality checkers complete training modules focused on providing constructive, targeted feedback. We equip them with frameworks for explaining common errors and crafting helpful feedback connected to the specific skills needing development.

For instance, if a annotator repeatedly mixes up plastic polymer types, the QC reviewer may assemble a photo sheet of examples mislabeled as PVC instead of polypropylene. They’d provide guidance on carefully inspecting for the “shine” quality of polypropylene and double checking flexibility.

We also have a “mentorship” program where the quality checkers are assigned low performing annotators for training. This is helpful when the project needs to scale, as quality checkers will identify skilled annotators that improve fast to join their ranks. They are also periodically rotated into the annotator pool to equip them with the perspective of the annotators receiving their feedback.

The impact of this is the exponential growth of skilled annotators during the scaling phase of a project. While we’re still refining this process even today, this approach works well and relies on the community aspect of the annotators for efficient upskilling.

Concurrent Quality Checking

Unlike most data annotating companies (that we know of), we integrate quality checking into the annotating process from the start. While some approaches mandate quality checking post-batch completion, we begin the moment annotators start work on a batch.

It goes something like this:

Annotator: Finishes classifying 3 images in a dataset, submits work

QC Reviewer: Checks first image, provides tip - "When the product lighting is uneven, try adjusting the brightness before assessing for defects so features stand out more clearly."

Annotator: Implements feedback, continues on next 2 images applying improved technique

QC Reviewer: Checks image 4 - "Great job accounting for tricky lighting changes here! For the small scratches, specify whether surface level vs deeper grooves."

Annotator: Applies depth of scratch callout feedback across remaining batch

By embedding back-and-forth collaboration within active workflows, we reinforce skills in real-time while also shipping production value simultaneously. Early course correction stops bad habits from setting in. On average, annotators that receive a lot of feedback while working on the first half of their tasks have far fewer errors when they reach the second half.

Purpose-Built Tooling

Our platform equips reviewers with tailored functionality to standardize and simplify high-quality feedback at scale:

Inline Image Commenting

Our interface allows reviewers to add context directly onto the source assets, eliminating potential ambiguity over which parts are incorrect:

*Reviewers can pinpoint mistakes and provide feedback accordingly*

Some examples:

"This section with torn packaging should be marked as damaged."
"The scratches here appear surface level based on the depth."

By grounding feedback visually, comprehension and recall improves over text-only comments detached from the images.

Automated Accuracy Reporting

As Quality Checkers insert feedback pointers on the images themselves, our systems auto-generate the performance scorecard based on the pointers.

Key stats like:

Aggregated areas needing improvement
Numerical accuracy rates
Examples of frequent error cases

get compiled dynamically based on the comments applied to sample data batches. This saves reviewers time while providing data annotating leads an analytical dashboard on quality assurance.

Combined with the other feedback frameworks we've shared, these purpose-built tools create end-to-end efficiency, transparency, and actionability.

Closing Thoughts: Data Quality and AI

As machine learning capabilities rapidly advance, it’s easy to focus solely on the models. However, the true fuel driving AI progress is quality training data. Producing accurate, nuanced datasets depends heavily on the human annotators doing the annotation work. Their skills make or break the systems they enable.

That's why implementing effective feedback practices for annotating teams is so key. Those human gains transfer directly through to better model performance. At SUPA, our frameworks ensure responsive, personalized feedback at scale. We equip reviewers, streamline cycles, and provide analytics with the goal of being able to scale a quality workforce in a very short time.

As appetite for ever more advanced AI models grow across industries, it’s easy to forget the human factors at the core. Invest in the feedback fueling your data creators & achieve returns in intelligence both artificial and human.

—

Here at SUPA, we’ve spent over 7 years helping AI companies build better models. Find out more about our use cases here.

‍