Chat Completions

DeepInfra offers an OpenAI-compatible chat completions API for all LLM models at the best prices for open-source model inference. For other model types (embeddings, image generation, speech, reranking, and more), see More APIs. The endpoint is:

https://api.deepinfra.com/v1/openai

The only changes you need to make from your existing OpenAI code:

Set base_url to https://api.deepinfra.com/v1/openai
Set api_key to your DeepInfra token
Set model to a model from our catalog

Install the SDK

pip install openai

Basic chat completion

from openai import OpenAI

openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=[{"role": "user", "content": "Hello"}],
)

print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)

Multi-turn conversations

To create a longer conversation, include the full message history in every request. The model uses this context to provide better answers.

from openai import OpenAI

openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=[
        {"role": "system", "content": "Respond like a michelin starred chef."},
        {"role": "user", "content": "Can you name at least two different techniques to cook lamb?"},
        {"role": "assistant", "content": "Bonjour! Let me tell you, my friend, cooking lamb is an art form..."},
        {"role": "user", "content": "Tell me more about the second method."},
    ],
)

print(chat_completion.choices[0].message.content)

The longer the conversation, the more tokens it uses. The maximum conversation length is determined by the model’s context size.

Supported parameters

Parameter	Notes
`model`	Model name, or `MODEL_NAME:VERSION`, or `deploy_id:DEPLOY_ID`
`messages`	Roles: `system`, `user`, `assistant`
`max_tokens`	Max tokens to generate. See Max output tokens
`stream`	See Streaming
`temperature`	Sampling temperature between 0 and 2. Higher values produce more random output; lower values more deterministic. Default: `1.0`
`top_p`	Nucleus sampling threshold — only tokens comprising the top `top_p` probability mass are considered. Default: `1.0`
`stop`	Up to 4 sequences where the API will stop generating further tokens
`n`	Number of completion sequences to return. Default: `1`
`presence_penalty`	Penalizes tokens that have already appeared in the text, encouraging the model to discuss new topics. Range: -2.0 to 2.0. Default: `0`
`frequency_penalty`	Penalizes tokens based on how often they’ve appeared so far, reducing repetition. Range: -2.0 to 2.0. Default: `0`
`response_format`	See Structured Outputs
`tools`, `tool_choice`	See Tool Calling
`service_tier`	Priority inference for tagged models. See Service Tier below.
`reasoning_effort`	Controls reasoning depth for reasoning models. See Reasoning Models.

We may not be 100% compatible with all OpenAI parameters. Let us know on Discord or by email if something you need is missing.

For the complete parameter reference, see the API reference.

Service tier

Set service_tier to "priority" to request priority inference on supported models. Priority requests get faster time-to-first-token and higher throughput during peak demand.

Priority inference incurs a 20% surcharge on top of the model’s standard per-token price.

response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_body={"service_tier": "priority"},
)

The response includes a service_tier field confirming which tier was used. Not all models support priority tiers — check the model page for availability.

Max output tokens

The maximum number of tokens that can be generated in a single response is model-dependent, with a hard cap of 16384 tokens for most models. Set max_tokens to control the limit for a specific request.

Continuing responses beyond the limit

If you need a longer response, use response continuation: send a follow-up request with the previous response included as an assistant message, and the model will continue from where it left off.

curl "https://api.deepinfra.com/v1/openai/chat/completions" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
    -d '{
        "model": "deepseek-ai/DeepSeek-V3",
        "messages": [
            {"role": "user", "content": "Write a very long essay about AI."},
            {"role": "assistant", "content": "<previous truncated response>"}
        ],
        "max_tokens": 4096
    }'

Note: response continuation cannot extend past the model’s total context window. A 400 error is returned when the total context size is exceeded.

What’s next

Streaming

Stream tokens as they’re generated.

Structured Outputs

Get responses in JSON format.

Tool Calling

Give models access to external functions.

Vision

Send images alongside text.

Reasoning Models

Control chain-of-thought reasoning behavior.

Getting Started

More APIs

Deploy Private Models

GPU Instances

Integrations

Account & Security

Tutorials

Chat Completions

Install the SDK

Basic chat completion

Multi-turn conversations

Supported parameters

Service tier

Max output tokens

Continuing responses beyond the limit

What’s next

Streaming

Structured Outputs

Tool Calling

Vision

Reasoning Models

Getting Started

Chat Completions

More APIs

Deploy Private Models

GPU Instances

Integrations

Account & Security

Tutorials

​Install the SDK

​Basic chat completion

​Multi-turn conversations

​Supported parameters

​Service tier

​Max output tokens

​Continuing responses beyond the limit

​What’s next

Streaming

Structured Outputs

Tool Calling

Vision

Reasoning Models

Install the SDK

Basic chat completion

Multi-turn conversations

Supported parameters

Service tier

Max output tokens

Continuing responses beyond the limit

What’s next