Florence-2 Large

Florence 2 is an advanced vision foundation model that leverages a prompt-based approach and a massive dataset to excel in various vision and vision-language tasks.

Model details

CategoryDetails
Model NameFlorence 2
VersionLarge
Model CategoryVLM
Size0.77B parameters
HuggingFace Modelmicrosoft/Florence-2-large
OpenAI Compatible endpointChat Completions
LicenseMIT

Capabilities

FeatureDetails
Tool Calling
Azion Long-term Support (LTS)
Context Length4096
Supports LoRA
Input dataText + Image

Usage

Florence uses tags according to each task it will perform. Below are all the tags with their corresponding tasks.

Tasks with no additional input:

  • Whole image to natural language:
TagDescription
<CAPTION>Image level brief caption
<DETAILED_CAPTION>Image level detailed caption
<MORE_DETAILED_CAPTION>Image level very detailed caption
  • Whole image or region to text:
TagDescription
<OCR>OCR for entire image
<OCR_WITH_REGION>OCR for entire image, with bounding boxes for individual text items
  • Whole image to regions and categories or natural language labels:
TagDescription
<REGION_PROPOSAL>Proposes bounding boxes for salient objects (no labels)
<OD>Identifies objects via bounding boxes and gives categorical labels
<DENSE_REGION_CAPTION>Identifies objects via bounding boxes and gives natural language labels

Tasks with region input

  • Region to segment:
TagDescription
<REGION_TO_SEGMENTATION>Segments salient object in a given region
  • Region to text:
TagDescription
<REGION_TO_CATEGORY>Gets object classification for bounding box
<REGION_TO_DESCRIPTION>Gets natural language description for contents of bounding box

Tasks with natural language input

  • Natural language to regions (one to many):
TagDescription
<CAPTION_TO_PHRASE_GROUNDING>Given a caption, provides bounding boxes to visually ground phrases in the caption
  • Natural language to region (one to one):
TagDescription
<OPEN_VOCABULARY_DETECTION>Detects bounding box for objects and OCR text
  • Natural language to segment (one to one):
TagDescription
<REFERRING_EXPRESSION_SEGMENTATION>Referring Expression Segmentation - given a natural language descriptor identifies the segmented corresponding region

This is how a request using Florence tags should look like:

Terminal window
curl -X POST http://endpoint-url/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "microsoft/Florence-2-large",
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://images.unsplash.com/photo-1543373014-cfe4f4bc1cdf?q=80&w=3148&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
}
},
{"type": "text", "text": "<DETAILED_CAPTION>"}
]
}
]
}'

Running with Edge Functions:

This is an example of how to run this model with Edge Functions:

const modelResponse = await Azion.AI.run("microsoft-florence-2-large", {
"messages": [
{
"role": "user",
"content": [
{
"type": "image_url",
"image_url": {
"url": "https://images.unsplash.com/photo-1543373014-cfe4f4bc1cdf?q=80&w=3148&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D"
}
},
{
"type": "text",
"text": "<DETAILED_CAPTION>"
}
]
}
]
})

JSON schema

{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"required": [
"messages"
],
"properties": {
"messages": {
"type": "array",
"items": {
"type": "object",
"required": [
"role",
"content"
],
"properties": {
"role": {
"type": "string",
"const": "user"
},
"content": {
"type": "array",
"maxItems": 2,
"items": {
"oneOf": [
{
"type": "object",
"required": [
"type",
"image_url"
],
"properties": {
"type": {
"const": "image_url"
},
"image_url": {
"type": "object",
"required": [
"url"
],
"properties": {
"url": {
"type": "string",
"format": "uri"
}
}
}
}
},
{
"type": "object",
"required": [
"type",
"text"
],
"properties": {
"type": {
"const": "text"
},
"text": {
"type": "string",
"enum": [
"<OCR>",
"<CAPTION>",
"<DETAILED_CAPTION>",
"<MORE_DETAILED_CAPTION>",
"<OCR_WITH_REGION>",
"<REGION_PROPOSAL>",
"<OD>",
"<DENSE_REGION_CAPTION>"
]
}
}
},
{
"type": "object",
"required": [
"type",
"text"
],
"properties": {
"type": {
"const": "text"
},
"text": {
"type": "string",
"allOf": [
{
"not": {
"pattern": "<OCR>|<CAPTION>|<DETAILED_CAPTION>|<MORE_DETAILED_CAPTION>|<OCR_WITH_REGION>|<REGION_PROPOSAL>|<OD>|<DENSE_REGION_CAPTION>"
}
},
{
"pattern": "<REGION_TO_SEGMENTATION>|<REGION_TO_CATEGORY>|<REGION_TO_DESCRIPTION>|<PHRASE_GROUNDING>|<OPEN_VOCABULARY_DETECTION>|<REFERRING_EXPRESSION_SEGMENTATION>"
}
]
}
}
}
]
}
}
}
}
}
}
}