Implement AI Inference in your applications

AI Inference at the edge runs models close to data sources to reduce latency, preserve privacy, and optimize resource usage.

AI Inference architecture diagram

The following figure illustrates a reference architecture for AI inference at the edge.

AI Inference Architecture at the Edge

Data flow

The user sends a request to the application domain.
The nearest edge node receives the request and forwards it to the application which applies policies and calls the function.
The function orchestrates the inference: preprocesses input, and calls the model via Azion.AI.run through the Azion Cells API.
The agent processes the user’s input and sends it back to the user through the same path.

Components

Applications: application layer on Azion’s distributed network.
Functions: executes inference logic, orchestrates RAG, and integrates external services.
AI Inference (Edge Runtime): runs AI models directly at the edge.
Global Infrastructure: Azion’s distributed global network.
Orchestrator: manages requests through the Edge Node for Cells execution.

Note: Azion Cells is an advanced feature that is part of the Orchestrator. To activate it, you need to contact Azion’s sales team to verify requirements and enable the product in your account.

Implementation

Use the AI Inference Starter Kit template to accelerate your journey and have a ready-to-use application with OpenAI-compatible APIs running at the edge.

Access the Azion Console.
Click the + Create button to open the templates page.
Select the AI Inference Starter Kit template.
On the template page, enter a name for your application and click Deploy.
Wait for the application to be available at the domain your-url.map.azionedge.net (or your custom domain).
Interact with the model via your application’s OpenAI-compatible endpoint.

Example of interaction with the nanonets/Nanonets-OCR-s model, which is an instruction-oriented OCR model that extracts text from images/documents:

curl -X POST https://YOUR-DOMAIN.map.azionedge.net/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "nanonets/Nanonets-OCR-s",
    "max_tokens": 500,
    "messages": [
      {
        "role": "user",
        "content": [
          {
            "type": "image_url",
            "image_url": { "url": "https://www.any-url-example.com/example.jpg" }
          },
          {
            "type": "text",
            "text": "Extract the text from the document above in structured Markdown. Tables in HTML. Equations in LaTeX. Use <watermark>...</watermark> for watermarks and <page_number>...</page_number> for pagination."
          }
        ]
      }
    ]
  }'

Notes:

Access the Azion AI Inference Models page to see more available models.
If you added authentication to your API, include the appropriate header (for example, -H "Authorization: Bearer <TOKEN>").

Example of a function that executes Azion.AI.run inside a Function:

async function handleRequest(request) {
  const url = new URL(request.url);
  if (url.pathname !== "/ocr") return new Response("Not found", { status: 404 });

  const modelResponse = await Azion.AI.run("nanonets/Nanonets-OCR-s", {
    stream: false,
    max_tokens: 500,
    messages: [
      {
        role: "user",
        content: [
          {
            type: "image_url",
            image_url: { url: "https://www.any-url-example.com/example.jpg" }
          },
          {
            type: "text",
            text:
              "Extract the text from the document above in structured Markdown. " +
              "Tables in HTML. Equations in LaTeX. Use <watermark>...</watermark> " +
              "for watermarks and <page_number>...</page_number> for pagination."
          }
        ]
      }
    ]
  });

  return new Response(JSON.stringify(modelResponse), {
    headers: { "Content-Type": "application/json" }
  });
}

addEventListener("fetch", (event) => {
  event.respondWith(handleRequest(event.request));
});

Observability

Real-Time Metrics: analyze events and gain insights into your applications’ performance.
Real-Time Events: track requests, raw data, logs, and complex queries.

AI Inference architecture diagram

Data flow

Components

Implementation

Observability

Related documentation