The recent advancements in computer vision are truly thrilling. Just last week, our team came across Grounding DINO, a object detection model that raises the bar in the field of zero-shot object detection. Grounding DINO achieves a 52.5 AP on the COCO detection zero-shot transfer benchmark. 63AP with COCO data.
As with many zero-shot learning models, detection on some domains appear to be easier than others. I attempted to use the model to detect blood cells in microscopy images, but it was unsuccessful. However, by using changing the prompt to "round object," I managed to identify around 70-80% of the visible blood cells.
Our team at Supa believes that this can be extremely useful in automating or enhancing the efficiency of data labeling processes. We use Grounding DINO as the initial layer of detection, feeding its output into our platform to identify errors using our classification workflow. Only the instances containing mistakes are corrected, which involves either eliminating false positives or including false negatives.
We've put together a Jupyter notebook for you to begin experimenting on your own. You can test the model and even export the predictions in COCO format, allowing you to upload them to our platform and explore the validation and quality improvement workflow. We'll soon be publishing a blog post on this topic, and we're also working on something interesting Segment Anything.
If you'd like to be notified when we release these contents, please make sure you check out for new updates on our Linkedin page.
Everything from uploading data to seeing it labeled in real time was really cool. This is just way simpler to use compared to Amazon Sagemaker and LabelBox. I was also very impressed with how the platform delivered exactly what we needed in terms of label quality.
I was also able to view the labels as they were being generated, which gave me quick feedback about the label quality, rather than waiting for the whole batch. This replaced my standard manual QA process using external tools like Voxel's Fiftyone, as the labels were clear and easy to parse through in real-time.