October 28, 2022

Measuring Annotation Task Complexity [Part 1]

What is annotation task complexity?

Task complexity score is a numerical value assigned to an image annotation task. This value is normalized to the 0-1 range, with 0 being the easiest and 1 being the hardest.
‍

Why measure it?

It's important to be able to quantify the difficulty or complexity of an annotation task so that you can plan your annotation strategy more effectively.

It gives us a more accurate estimate of time taken to complete task
It allows us to more intelligently assign annotators to tasks by matching more experienced annotators with higher complexity projects
Facilitates the optimization of the QC effort and strategy
‍

What is required to generate a complexity score?

The algorithm works with image data in order to calculate the complexity score. It works with both annotated and unannotated data, but will compute a score that takes into account a wider range of metrics if provided with annotations/labels. At the bare minimum, this method needs a single unannotated image, but the more images/annotations, the more accurate the measure of complexity.

‍

How is project complexity measured?

The complexity generated is the sum of weighted metrics captured from the image and annotation data.

These metrics can be broadly divided into the following three categories:

Image Data

Image Entropy
Image Blurriness
Image Brightness

Annotation/Label Data

No. of labels
No. of annotations per image
Type of annotation(s)

Object Data

Object visibility
Similarity between different objects

Image Metrics

The metrics captured in this section examine the unlabelled image data. Whether or not annotations are provided, these three metrics only require - at minimum - a single image in order to be generated

Image Entropy

The entropy of an image represents the amount of information/randomness in the image. The more unpredictable or noisy an image is from one pixel to the next, the higher the entropy of that image.

The entropy of an image is calculated by the following equation:
Let p = the normalized image histogram counts, with 0’s filtered out. The entropy is then:

-sum(p. *log2(p))

In order to obtain a normalized entropy that can be used to compare different image types across different tasks, the maximum entropy is calculated and the computed image entropy is normalized to the 0-max(entropy) scale.

Below are some examples of images from public datasets, along with their corresponding entropy scores.

‍

‍

The entropy thus captures the amount of noise in an image, and this matches the intuitive visual assessment in many but not all cases.

Mismatches between calculated entropy and visually-judged noise can be due to shadows, glares, slightly different shades of colors that may not be picked up by the eye, uneven lighting, fuzziness/blurriness or items with rich textures, etc. The lighting and blur noise are picked up by the next two metrics, and color differences are examined further on, thus balancing out most false high entropies in the overall complexity score.

Image Blurriness

Through our research, it was found that a repeated cause of project complexity was a high number of blurry images. To this end, a blur detection algorithm is used to count the number of images in the sampled set that are ‘blurry’.

To detect blur, variation of the Laplacian is used by convolving the image, in greyscale, with the following kernel:
‍

The Laplacian Kernel

The Laplacian highlights regions of the image containing high intensity changes, and as such is used to detect edges. When the generated result is low, it signifies low variance between different regions of the image in terms of edge/non-edge distinctions, indicating a higher possibility that the image is blurry.

After experimenting with different thresholds, the blur threshold was set at 400. Below are examples of images classified as either blurry or non-blurry according to this algorithm.

‍

‍

Image Brightness

Another commonly reported problem are dark images that make it difficult to spot objects/edges. The brightness of an image is computed by converting it to a grayscale histogram and calculating the average pixel brightness. The resulting average score is then normalized to the 0-1 range based on the [0,255] limits of grayscale values.

Below are some examples of images and their corresponding normalized brightness scores.

‍

Label Metrics

These metrics look at the label and annotation information. This calculation requires images with GroundTruth(GT), whereas the other two do not necessarily need GT. The extra information must however be supplied when calling the script, as explained in the readme.

The data used in this section comes from a combination of public and internal datasets, and the label and annotation information are used to inform metric definitions and thresholds for the overall complexity score. A total of 181 datasets and over 100,000 images are accessed.

Number of Labels in the task

This metric looks at the total number of labels in an image. Data from image were examined to understand the average number of labels in an image, and what counts as few or many labels. Out of the 181 projects, 97 contained image annotation data, and among these, 52 had completed leads. Only completed leads were examined to avoid a bias towards a lower total number of labels.

‍

Distribution of labels

‍

The average number of labels in these datasets is 12.82, but as can be seen in the box plot, there are high outliers that skew this mean. The median in this case is more representative of the expected number of labels, which sits at 5.5 per image. 75% of image have less than 20 labels, and almost all the rest have less than 40 labels, with only three image having over 40 labels. Based on these figures the number of labels of an incoming project can be used to assign a label score to it as follows:

No. of Annotations Per image

The expected number of objects to be annotated per image affects its complexity, especially in terms of how long it takes for a lead to be completed.

Total number of annotations per image

‍

50% of these had 2 or fewer annotations, whereas 75% had no more than 8 annotations. There are a large number of outliers beyond the outer boxplot whisker which lies at about 20 annotations. Using these figures, the following brackets are set:

Annotation Type

Here, researchers were asked to rank fives types of annotation projects in increasing order of difficulty. The general consensus was as follows, with the corresponding difficulty level:

Bounding Box: 1

Circle: 2

Polygon: 3

Semantic Segmentation: 4

Instance Segmentation: 5

Projects that contain a mixture of different annotation types will have the difficulty set to that of the higher difficulty type, e.g. a bounding box and polygon mixed project will have the difficulty of a polygon project, 3.

Part 2

In the next post we will go through the rest of the method. Stay tuned for part 2!