December 5, 2022

Data Labeling’s Last Mile Problem

Data Labeling’s Last Mile Problem

This is a two-part story. You’re reading part one, where we illustrate and explain data labeling’s last mile problem. Part two on how we solve this problem is up next.

What is Data Labeling’s Last Mile Problem?

Answers from 20+ engineers at CVPR2022 (Computer Vision Pattern Recognition) reveal that the main roadblock for them to confidently train their models is reducing label noise. 

Label noise is defined as the deviation of labels from the true dataset, and is a critical component of data quality that negatively impacts the performance of machine learning (ML) algorithms. Three reasons why label noise happens: 

  1. Your dataset may include new corner cases
  2. Your images may suffer from quality loss due to hardware that’s gone out of calibration, or 
  3. One (or a few) annotators might have made a careless mistake

No Standard Tool to Measure Label Noise

“You can't improve what you don't measure.” - Peter Drucker

Faced with label noise, engineers have accepted the reality that they will need to run quality assurance (QA) checks to measure the labeled dataset’s quality and correct the mistakes by themselves.

However, there is no standard tool to efficiently measure label noise. Engineers have hacked together tools and built their own review process to verify the quality of their labeled datasets.

One engineer at a reputable autonomous trucking company shared that their entire data pipeline is made up of isolated solutions and tools –, where there is no communication between the companies from curation, to labeling, to QA. There is a huge amount of friction for the engineer and their team to measure label noise and improve the quality of the labeled data.

Correcting Label Noise Takes Months

Label noise is a prevalent problem in almost every labeled dataset, and engineers know that reducing it is not a one-and-done task.

It requires continuous iterations and communication with annotators to figure out what steps would effectively improve the quality of the labeled data. Sometimes, neither party knows which quality improvement strategy would work, which means trial and error of different strategies is required. This entire process typically takes months, as shared by a senior staff engineer currently working at an autonomous driving startup:

“Major pain point for annotations is the instructions take 1-2 months of back and forth with the external team to iterate and align. 4-6 months ramp up time until it is stable.”

Correcting label noise pushes ML model deployment back by months and could potentially impact your company’s profitability and survival. If a competitor gets to production quicker, they have a headstart in owning a larger market share. Therefore, there needs to be faster iterations and testing of quality improvement strategies for companies to stay ahead of the competition.

Summary

Reducing label noise is the prevalent last mile problem in every engineer’s data pipeline.

Without the right tools to measure and quickly iterate quality improvement strategies, it could take months before engineers can confidently deploy their ML models to production. The time it takes to improve and maintain label quality could mean life or death for your company.

Based on the facts above, SUPA effectively addresses both these challenges with the BOLT data labeling platform, where engineers can easily and quickly run iterations to effectively reduce label noise. Try out BOLT today for free and get a POC report to evaluate against your current data labeling solution.

Stay tuned for part 2 where we explain how BOLT’s features help alleviate engineers’ burden of reducing label noise and solve data labeling’s last mile problem.