Florence2Sam2Segmenter

A specialized engine to find objects in images and predict masks.

The Florence2SamSegmenter is a Glif-hosted service that uses Florence2 and Sam2 under the hood. It's inspired by this Hugging Face space.

The model can be found in the dropdown under Model.

You can now add an image url under Image. For example, if it's run with the following image:

We will get:

This format is specialized for use in scripts, which can come in handy when building artifacts!

Continue reading for the advanced settings and how it works under the hood.

Advanced Settings

Output mask: instead of polygons the output now contains mask_url , which is a url to a black-and-white bitmask of the segmentation. Since there is only one export layer, multiple object segmentations are flattened.
Provide custom caption: instead of letting Florence2 predict a caption, you can also supply a caption yourself if you know what's in the image or if you want to specify an object to cut out.
Area threshold: after the Florence2 grounding step, you can filter out very large bounding boxes. Every bounding box with an area higher than this threshold is discarded. Expressed relative to the total image size.
Confidence threshold: disregard any SAM2 mask predictions with a confidence score lower than this threshold. Note: do not confuse this with the bounding box confidence.
NMS threshold: uses the SAM2 scores and filters out overlapping bounding boxes using Non-Max Supression.
Polygon precision: lower values create a more detailed polygon shape.

Under the hood

Florence2 is a versatile model that can do multiple tasks, but here we have chained together two tasks: MORE_DETAILED_CAPTIONING and CAPTION_TO_PHRASE_GROUNDING.

MORE_DETAILED_CAPTIONING predicts a detailed caption. For the rock on the table image we would get something like:

The image shows a large rock on a wooden table. The rock appears to be made of a light-colored material, possibly stone or concrete, and has a rough texture. It is resting on the table with its head slightly tilted to the side. The background is blurred, but it seems to be a room with a window and a wall. The lighting is soft and natural, creating a warm and cozy atmosphere.

We then pass the caption to the CAPTION_TO_PHRASE_GROUNDING task. This will produce a list of found objects and corresponding bounding boxes. This step could output something like:

The first filter checks the sizes of all bounding boxes and then removes those with a relative area higher than the Area threshold setting. In this example, the table bbox will be removed as it's area is 0.51 and is higher than the 0.3 threshold.

Then, SAM2 is used to transform each bounding box into a mask. It will return a bitmask and a confidence score per object.

The second filter is based purely on the confidence score. Everything lower than the set Confidence threshold will be removed.

The last filter removes overlapping objects with Non-Max Supression. Since the bounding boxes themselves don't have confidence scores, we use the corresponding SAM2 confidence scores.

This results in the final prediction:

PreviousImage to Text Block NextVideo Generator Block

Last updated 5 months ago