Florence-2 Large
Florence 2 is an advanced vision foundation model that leverages a prompt-based approach and a massive dataset to excel in various vision and vision-language tasks.
Model details
Category | Details |
---|---|
Model Name | Florence 2 |
Version | Large |
Model Category | VLM |
Size | 0.77B parameters |
HuggingFace Model | microsoft/Florence-2-large |
OpenAI Compatible endpoint | Chat Completions |
License | MIT |
Capabilities
Feature | Details |
---|---|
Tool Calling | ❌ |
Azion Long-term Support (LTS) | ❌ |
Context Length | 4096 |
Supports LoRA | ❌ |
Input data | Text + Image |
Usage
Florence uses tags according to each task it will perform. Below are all the tags with their corresponding tasks.
Tasks with no additional input:
- Whole image to natural language:
Tag | Description |
---|---|
<CAPTION> | Image level brief caption |
<DETAILED_CAPTION> | Image level detailed caption |
<MORE_DETAILED_CAPTION> | Image level very detailed caption |
- Whole image or region to text:
Tag | Description |
---|---|
<OCR> | OCR for entire image |
<OCR_WITH_REGION> | OCR for entire image, with bounding boxes for individual text items |
- Whole image to regions and categories or natural language labels:
Tag | Description |
---|---|
<REGION_PROPOSAL> | Proposes bounding boxes for salient objects (no labels) |
<OD> | Identifies objects via bounding boxes and gives categorical labels |
<DENSE_REGION_CAPTION> | Identifies objects via bounding boxes and gives natural language labels |
Tasks with region input
- Region to segment:
Tag | Description |
---|---|
<REGION_TO_SEGMENTATION> | Segments salient object in a given region |
- Region to text:
Tag | Description |
---|---|
<REGION_TO_CATEGORY> | Gets object classification for bounding box |
<REGION_TO_DESCRIPTION> | Gets natural language description for contents of bounding box |
Tasks with natural language input
- Natural language to regions (one to many):
Tag | Description |
---|---|
<CAPTION_TO_PHRASE_GROUNDING> | Given a caption, provides bounding boxes to visually ground phrases in the caption |
- Natural language to region (one to one):
Tag | Description |
---|---|
<OPEN_VOCABULARY_DETECTION> | Detects bounding box for objects and OCR text |
- Natural language to segment (one to one):
Tag | Description |
---|---|
<REFERRING_EXPRESSION_SEGMENTATION> | Referring Expression Segmentation - given a natural language descriptor identifies the segmented corresponding region |
This is how a request using Florence tags should look like:
curl -X POST http://endpoint-url/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "microsoft/Florence-2-large", "messages": [ { "role": "user", "content": [ { "type": "image_url", "image_url": { "url": "https://images.unsplash.com/photo-1543373014-cfe4f4bc1cdf?q=80&w=3148&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" } }, {"type": "text", "text": "<DETAILED_CAPTION>"} ] } ]}'
Running with Edge Functions:
This is an example of how to run this model with Edge Functions:
const modelResponse = await Azion.AI.run("microsoft-florence-2-large", { "messages": [ { "role": "user", "content": [ { "type": "image_url", "image_url": { "url": "https://images.unsplash.com/photo-1543373014-cfe4f4bc1cdf?q=80&w=3148&auto=format&fit=crop&ixlib=rb-4.0.3&ixid=M3wxMjA3fDB8MHxwaG90by1wYWdlfHx8fGVufDB8fHx8fA%3D%3D" } }, { "type": "text", "text": "<DETAILED_CAPTION>" } ] } ]})
JSON schema
{ "$schema": "http://json-schema.org/draft-07/schema#", "type": "object", "required": [ "messages" ], "properties": { "messages": { "type": "array", "items": { "type": "object", "required": [ "role", "content" ], "properties": { "role": { "type": "string", "const": "user" }, "content": { "type": "array", "maxItems": 2, "items": { "oneOf": [ { "type": "object", "required": [ "type", "image_url" ], "properties": { "type": { "const": "image_url" }, "image_url": { "type": "object", "required": [ "url" ], "properties": { "url": { "type": "string", "format": "uri" } } } } }, { "type": "object", "required": [ "type", "text" ], "properties": { "type": { "const": "text" }, "text": { "type": "string", "enum": [ "<OCR>", "<CAPTION>", "<DETAILED_CAPTION>", "<MORE_DETAILED_CAPTION>", "<OCR_WITH_REGION>", "<REGION_PROPOSAL>", "<OD>", "<DENSE_REGION_CAPTION>" ] } } }, { "type": "object", "required": [ "type", "text" ], "properties": { "type": { "const": "text" }, "text": { "type": "string", "allOf": [ { "not": { "pattern": "<OCR>|<CAPTION>|<DETAILED_CAPTION>|<MORE_DETAILED_CAPTION>|<OCR_WITH_REGION>|<REGION_PROPOSAL>|<OD>|<DENSE_REGION_CAPTION>" } }, { "pattern": "<REGION_TO_SEGMENTATION>|<REGION_TO_CATEGORY>|<REGION_TO_DESCRIPTION>|<PHRASE_GROUNDING>|<OPEN_VOCABULARY_DETECTION>|<REFERRING_EXPRESSION_SEGMENTATION>" } ] } } } ] } } } } } }}