Skip to main content
DeepInfra offers an OpenAI-compatible chat completions API for all LLM models at the best prices for open-source model inference. For other model types (embeddings, image generation, speech, reranking, and more), see More APIs. The endpoint is:
https://api.deepinfra.com/v1/openai
The only changes you need to make from your existing OpenAI code:
  1. Set base_url to https://api.deepinfra.com/v1/openai
  2. Set api_key to your DeepInfra token
  3. Set model to a model from our catalog

Install the SDK

pip install openai

Basic chat completion

from openai import OpenAI

openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=[{"role": "user", "content": "Hello"}],
)

print(chat_completion.choices[0].message.content)
print(chat_completion.usage.prompt_tokens, chat_completion.usage.completion_tokens)

Multi-turn conversations

To create a longer conversation, include the full message history in every request. The model uses this context to provide better answers.
from openai import OpenAI

openai = OpenAI(
    api_key="$DEEPINFRA_TOKEN",
    base_url="https://api.deepinfra.com/v1/openai",
)

chat_completion = openai.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=[
        {"role": "system", "content": "Respond like a michelin starred chef."},
        {"role": "user", "content": "Can you name at least two different techniques to cook lamb?"},
        {"role": "assistant", "content": "Bonjour! Let me tell you, my friend, cooking lamb is an art form..."},
        {"role": "user", "content": "Tell me more about the second method."},
    ],
)

print(chat_completion.choices[0].message.content)
The longer the conversation, the more tokens it uses. The maximum conversation length is determined by the model’s context size.

Supported parameters

ParameterNotes
modelModel name, or MODEL_NAME:VERSION, or deploy_id:DEPLOY_ID
messagesRoles: system, user, assistant
max_tokensMax tokens to generate. See Max output tokens
streamSee Streaming
temperatureSampling temperature between 0 and 2. Higher values produce more random output; lower values more deterministic. Default: 1.0
top_pNucleus sampling threshold — only tokens comprising the top top_p probability mass are considered. Default: 1.0
stopUp to 4 sequences where the API will stop generating further tokens
nNumber of completion sequences to return. Default: 1
presence_penaltyPenalizes tokens that have already appeared in the text, encouraging the model to discuss new topics. Range: -2.0 to 2.0. Default: 0
frequency_penaltyPenalizes tokens based on how often they’ve appeared so far, reducing repetition. Range: -2.0 to 2.0. Default: 0
response_formatSee Structured Outputs
tools, tool_choiceSee Tool Calling
service_tierPriority inference for tagged models. See Service Tier below.
reasoning_effortControls reasoning depth for reasoning models. See Reasoning Models.
We may not be 100% compatible with all OpenAI parameters. Let us know on Discord or by email if something you need is missing.
For the complete parameter reference, see the API reference.

Service tier

Set service_tier to "priority" to request priority inference on supported models. Priority requests get faster time-to-first-token and higher throughput during peak demand.
Priority inference incurs a 20% surcharge on top of the model’s standard per-token price.
response = client.chat.completions.create(
    model="deepseek-ai/DeepSeek-V3",
    messages=[{"role": "user", "content": "Hello!"}],
    extra_body={"service_tier": "priority"},
)
The response includes a service_tier field confirming which tier was used. Not all models support priority tiers — check the model page for availability.

Max output tokens

The maximum number of tokens that can be generated in a single response is model-dependent, with a hard cap of 16384 tokens for most models. Set max_tokens to control the limit for a specific request.

Continuing responses beyond the limit

If you need a longer response, use response continuation: send a follow-up request with the previous response included as an assistant message, and the model will continue from where it left off.
curl "https://api.deepinfra.com/v1/openai/chat/completions" \
    -H "Content-Type: application/json" \
    -H "Authorization: Bearer $DEEPINFRA_TOKEN" \
    -d '{
        "model": "deepseek-ai/DeepSeek-V3",
        "messages": [
            {"role": "user", "content": "Write a very long essay about AI."},
            {"role": "assistant", "content": "<previous truncated response>"}
        ],
        "max_tokens": 4096
    }'
Note: response continuation cannot extend past the model’s total context window. A 400 error is returned when the total context size is exceeded.

What’s next

Streaming

Stream tokens as they’re generated.

Structured Outputs

Get responses in JSON format.

Tool Calling

Give models access to external functions.

Vision

Send images alongside text.

Reasoning Models

Control chain-of-thought reasoning behavior.