A bit of background. The company I work for does veterinary diagnostics. Specifically, we diagnose internal parasites in agricultural animals.
Our work functions in two stages. First, we identify the number of eggs (worm burden) under a microscope. Then we hatch out the eggs (a 7-day process) to identify exactly what species of parasites are present. This second stage is an incredibly specialized skill, that is unfortunately dying. It is important, because identifying what species are present in an animal or flock helps determine the correct drench (chemicals) to give to remove the parasites. It is almost impossibly (bar a few species) to determine the species present just by looking at the eggs due to miniscule/minute differences in the way the eggs look under the microscope.
What we are trying to do is create an AI image recognition model that looks at an image of the eggs taken for an animal or flock, and with a high level of confidence determine what species are present in the animal or flock (based on similar images that have had their eggs hatched out and identified). The difficulty with our idea, is that we cannot provide images to be used in the training set that are labelled with 100% certainty. For example, a given image that might contain up to 1000 eggs is actually going to be labelled as being 10% species A, 35% species B, 20% species C, and 35% species D. (Note that we deal with less than 10 species, as sub-species is not as important.) The reason the labels aren’t 100% one species is because once we hatch the eggs out, the result is reported as being 10% species A, 35% species B, 20% species C, and 35% species D as mentioned before. (It is too difficult and time consuming for this instance to isolate a species, make sure an animal is infected with only this species, and then observe the eggs and hatch them out to prove that those eggs = that particular species.)
So my question is this: How do I go about setting up an AI image recognition model using images labeled based on a percentage? (I hope this makes sense). I have only come across how to do it when the training images are labelled with certainty (this is a dog, or this is a truck, as opposed to 20% of the objects in this image are dogs, and 80% are trucks).
I am not sure whether this approach would work, but I am thinking of doing some sort of resampling for the training set of images.
For neural networks that does object detection and classification like YOLOv8 (which Andrey Germanov has written an article about it in FCC), the preparation of training data involves annotations of images. Let’s say there is a dog and a cat in an image, we add the parameters of the bounding boxes of the dog and the cat with the labels to the annotation file of the image.
In your example, let’s say an image contains 10% species A, 35% species B, 20% species C, and 35% species D. We might set the bounding box as the whole image, and then make 100 copies of the image, 10 of which labelled as species A, 35 labelled as species B, 20 species C and 35 species D, the 100 copies are then all added to the training set. Would it work?
Upon considering it, though, I’m not entirely sure it would work. I could easily get the model to function as an automatic egg counter to determine worm burden. Placing bounding boxes around the eggs and annotating them as such is easy enough. Getting a prediction on each egg, though, is my stumbling point. Due to the way the test is conducted, the images hold more than just eggs. There are air bubbles and other cellular debris. The model would have to first identify the eggs from the debris, then identify the species. I can easily train for the first, with no trouble. Its the annotation for the second step that stumps inexperienced me. The reason I i think placing a bounding box on the whole image wouldn’t work is due to the very large amount of variation in images, and the amount of debris per image (some images might only have 10 eggs, where as others can have 1000).
Perhaps a variant of the idea could work? Using the example numbers, and 100 copies of the same image, annotate all the eggs in 10 images as species A, 35 images as species B, 20 images as species C, and 35 images as species D. Or would doing this really screw up the confidence factor? (Does make the annotation job 100 times longer/harder though.)
Yeah, I had thought of an alternative way to do it: get all the bounding boxes, and randomly assign 10% to species A, 35% speices B and so on. May save time for annotation job but possibly would screw up the confidence factor.
When I thought of placing the bounding box for the whole image, I was thinking if the image is tightly filled with eggs, the different composition of species may give out different textures, but it’s not gonna be the case.
I have come up with one more thing that may be tried: perform clustering to the eggs identified in the image. Hopefully if the distribution of eggs in each cluster match the result of hatching, we may attach the corresponding labels in annotations.
In theory, clustering would work. Unfortunately in practice, because they are field samples, there is almost always a mix of species in an animal/flock, so getting a cluster correct would a big hit and miss.
What I had hoped was that someone somewhere has created a method by which when annotating the images, the object in a bounding box can be labelled as being a “chance out of 100”. I.e. this object is 30% likely to be A, and 60% likely to be B. Then, as more images with similar objects are added to the dataset, the model picks up on the differences and “refines” the likelihood to being a higher confidence B than A. Again, my theory, and probably completely impractical… but I can keep hoping someone holds the key.