//AI Inference

Global AI inference on managed infrastructure

Build and deploy intelligent applications that run AI models close to your users — at any scale.

Docs

Built-in reliability

Distributed architecture with automatic failover keeps your AI running.

OpenAI-compatible API

Keep your existing code and SDKs — just change the endpoint.

Pay for usage

You pay only when your models are running inference.

Deploy from your local environment

Build, debug, and ship AI workloads the way you already code, but with the scale of distributed infrastructure.

Low-latency Instant Deploy Zero Config Pay Per Request

Learn more

LoRA fine-tuning interface for customizing AI models for domain-specific performance.

Fine-tune with LoRA

Adapt model outputs to your domain using Low-Rank Adaptation (LoRA), improving accuracy for specialized tasks while reducing compute costs—without full model retraining.
Learn more

Explore our Featured Models

Mistral 3 Small (24B AWQ)

Compact language model with capabilities comparable to larger models. Ideal for conversational agents, function calling, fine-tuning, and local inference with sensitive data.

View model

Qwen3 30B A3B Instruct 2507 FP8

30B-parameter FP8 causal language model for long-context (256K) text generation and reasoning. Supports chat/QA, summarization, multilingual tasks, math/science, coding, and tool-augmented workflows.

View model

GPT-OSS 20B

OpenAI model with 20 billion parameters for text generation, conversation, and NLP tasks. Features tool calling capabilities and 131K token context length.

View model

InternVL3

Advanced multimodal LLM with tool usage, GUI agents, industrial image analysis, 3D vision perception, and more.

View model

Qwen2.5 VL AWQ 3B

Vision Language Model with visual analysis, agentic reasoning, long video comprehension, visual localization, and structured output generation.

View model

Qwen2.5 VL AWQ 7B

Instruction-tuned VLM for visual analysis and advanced multimodal tasks.

View model

Qwen3 Embedding 4B

4B-parameter multilingual embedding model (36 layers, 32K context). Outputs 2560-dim vectors for text/code retrieval, classification, clustering, and bitext mining.

View model

BAAI/bge-reranker-v2-m3

Lightweight reranker model with strong multilingual capabilities. Offers multilingual support, easy deployment, and fast inference.

View model

Nanonets-OCR-s

OCR model that converts document images to structured Markdown, preserving layout (headings, lists, tables). Output is easy to parse and feed into LLM pipelines.

View model

From hello world to production AI workloads

Build AI applications with the same workflow you use for modern web apps. AI Inference connects with SQL Database, Object Storage, and Functions so teams can run models, retrieve context, store assets, and execute distributed logic in one platform.
Docs

ai-inference.ts

1export default {
2  async fetch(request: Request) {
3    const { prompt } = await request.json();
4 
5    const response = await Azion.AI.run(
6      'Qwen/Qwen3-30B-A3B-Instruct-2507-FP8',
7      { messages: [{ role: 'user', content: prompt }] }
8    );
9 
10    return Response.json(response);
11  }
12};

What you can build with AI Inference

Automation

AI agents for automated workflows

Build agents that plan, call tools, and complete tasks with context-aware responses.

See example

AI Apps

AI-powered applications (RAG + search)

Create RAG, semantic search, and personalized experiences using SQL Database vector search.

See example

Media

Image generation and visual AI

Generate, analyze, and transform images with multimodal AI models.

See example

Audio

Speech-to-text and audio processing

Transcribe, summarize, and process audio with low-latency AI inference.

See example

Support

Customer support copilot

Answer customer questions using your knowledge base and scalable AI inference.

See example

Security

Automated threat detection and takedown

Classify threats, detect abuse, and automate security actions with AI models.

See example

"With Azion, we scale proprietary AI models without managing infrastructure—inspecting millions of websites daily and automating the market’s fastest threat takedown."

Fabio Ramos

CEO

View success story

//Complete, not complex

Primitives that Scale with You

Compute

FunctionsRun code globally, low latency

RulesControl traffic routing

Load BalancerHigh availability across origins

Image ProcessorOptimize and modify Images

AI InferenceLow-latency distributed inference

AI GatewayGovern and route LLMs

Data

Object StorageStore and deliver globally

SQL DatabaseDistributed SQL with low latency

KV StoreKeep state close, fast

CacheAccelerate delivery, boost reliability

Security

Web Application Firewall (WAF)Smart way to block threats

API GatewayAuthenticate and protect APIs

Bot ManagementStop bots, prevent abuse

DNSResilient DNS with performance

Frequently Asked Questions

What is Azion AI Inference?

Azion AI Inference is a serverless platform for deploying and running AI models on a distributed architecture. It provides an OpenAI-compatible API for easy migration, supports LLMs, VLMs, embeddings, and reranking models, and offers LoRA fine-tuning for domain customization. Scale automatically without GPU management while maintaining low-latency responses globally.

Which models can I run?

Choose from a catalog of open-source models for text and code generation, vision-language tasks, embeddings, and reranking. The catalog evolves as new models become available, and you can fine-tune supported models with LoRA for your specific domain.

Is it compatible with OpenAI API?

Yes. AI Inference uses an OpenAI-compatible API format, so you can migrate existing applications by updating the base URL and credentials. Keep your current SDKs and integration patterns—no code rewrite required.

Can I fine-tune models?

Yes. AI Inference supports LoRA (Low-Rank Adaptation) fine-tuning, allowing you to specialize models for your domain without full retraining. This reduces compute costs while improving accuracy for specific tasks like customer support, code generation, or domain-specific Q&A.

How do I build RAG and semantic search?

Use AI Inference with SQL Database Vector Search to store embeddings and retrieve relevant context for RAG applications. This built-in vector search means no separate vector database to manage—SQL and vectors in one service.

Can I build AI agents and tool-calling workflows?

Yes. AI Inference powers agent patterns like ReAct and tool-calling workflows when combined with Applications, Functions, and external APIs. Azion provides templates and guides for LangChain and LangGraph-based agent architectures.

How do I migrate from Cloudflare Workers AI?

Migration is straightforward due to OpenAI-compatible APIs on both platforms. Update your base URL to point to Azion AI Inference endpoints, migrate any LoRA adapters, and integrate with Azion Functions. If you use Cloudflare Vectorize, migrate to Azion SQL Database Vector Search for built-in vector storage.

How does pricing compare to other platforms?

AI Inference uses straightforward per-request pricing without abstract units like "neurons." You pay for inference requests based on model and token usage—no idle costs, no capacity commitments. This transparency makes cost forecasting predictable compared to neuron-based billing models.

How do I deploy AI inference into my application?

Create an AI Inference endpoint, then integrate it into your request flow using Applications and Functions. This adds AI capabilities to existing APIs and user experiences with distributed execution, managed scaling, and automatic failover.

//Build

Build once.
Run everywhere.

Get a faster path to launch, lower latency, and less infrastructure overhead.

Join our community

Global AI inference on managed infrastructure

Built-in reliability

OpenAI-compatible API

Pay for usage

Deploy from your local environment

Fine-tune with LoRA

Explore our Featured Models

Mistral 3 Small (24B AWQ)

Qwen3 30B A3B Instruct 2507 FP8

GPT-OSS 20B

InternVL3

Qwen2.5 VL AWQ 3B

Qwen2.5 VL AWQ 7B

Qwen3 Embedding 4B

BAAI/bge-reranker-v2-m3

Nanonets-OCR-s

From hello world to production AI workloads

What you can build with AI Inference

AI agents for automated workflows

AI-powered applications (RAG + search)

Image generation and visual AI

Speech-to-text and audio processing

Customer support copilot

Automated threat detection and takedown

Primitives that Scale with You

Frequently Asked Questions

What is Azion AI Inference?

Which models can I run?

Is it compatible with OpenAI API?

Can I fine-tune models?

How do I build RAG and semantic search?

Can I build AI agents and tool-calling workflows?

How do I migrate from Cloudflare Workers AI?

How does pricing compare to other platforms?

How do I deploy AI inference into my application?

Build once.Run everywhere.

Build once.
Run everywhere.