Florence2Sam2Segmenter
A specialized engine to find objects in images and predict masks.
Last updated
A specialized engine to find objects in images and predict masks.
Last updated
The Florence2SamSegmenter is a Glif-hosted service that uses and under the hood. It's inspired by this .
The model can be found in the dropdown under Model.
You can now add an image url under Image. For example, if it's run with the following image:
We will get:
This format is specialized for use in scripts, which can come in handy when building artifacts!
Continue reading for the advanced settings and how it works under the hood.
Output mask: instead of polygons the output now contains mask_url
, which is a url to a black-and-white bitmask of the segmentation. Since there is only one export layer, multiple object segmentations are flattened.
Provide custom caption: instead of letting Florence2 predict a caption, you can also supply a caption yourself if you know what's in the image or if you want to specify an object to cut out.
Area threshold: after the Florence2 grounding step, you can filter out very large bounding boxes. Every bounding box with an area higher than this threshold is discarded. Expressed relative to the total image size.
Confidence threshold: disregard any SAM2 mask predictions with a confidence score lower than this threshold. Note: do not confuse this with the bounding box confidence.
NMS threshold: uses the SAM2 scores and filters out overlapping bounding boxes using Non-Max Supression.
Polygon precision: lower values create a more detailed polygon shape.
Florence2 is a versatile model that can do multiple tasks, but here we have chained together two tasks: MORE_DETAILED_CAPTIONING
and CAPTION_TO_PHRASE_GROUNDING
.
MORE_DETAILED_CAPTIONING
predicts a detailed caption. For the rock on the table image we would get something like:
We then pass the caption to the CAPTION_TO_PHRASE_GROUNDING
task. This will produce a list of found objects and corresponding bounding boxes. This step could output something like:
The first filter checks the sizes of all bounding boxes and then removes those with a relative area higher than the Area threshold
setting. In this example, the table
bbox will be removed as it's area is 0.51 and is higher than the 0.3 threshold.
Then, SAM2 is used to transform each bounding box into a mask. It will return a bitmask and a confidence score per object.
The second filter is based purely on the confidence score. Everything lower than the set Confidence threshold
will be removed.
The last filter removes overlapping objects with Non-Max Supression. Since the bounding boxes themselves don't have confidence scores, we use the corresponding SAM2 confidence scores.
This results in the final prediction: