- Set
base_urltohttps://api.deepinfra.com/v1/openai - Set
api_keyto your DeepInfra token - Set
modelto a model from our catalog
Install the SDK
Basic chat completion
Multi-turn conversations
To create a longer conversation, include the full message history in every request. The model uses this context to provide better answers.Supported parameters
| Parameter | Notes |
|---|---|
model | Model name, or MODEL_NAME:VERSION, or deploy_id:DEPLOY_ID |
messages | Roles: system, user, assistant |
max_tokens | Max tokens to generate. See Max output tokens |
stream | See Streaming |
temperature | Sampling temperature between 0 and 2. Higher values produce more random output; lower values more deterministic. Default: 1.0 |
top_p | Nucleus sampling threshold — only tokens comprising the top top_p probability mass are considered. Default: 1.0 |
stop | Up to 4 sequences where the API will stop generating further tokens |
n | Number of completion sequences to return. Default: 1 |
presence_penalty | Penalizes tokens that have already appeared in the text, encouraging the model to discuss new topics. Range: -2.0 to 2.0. Default: 0 |
frequency_penalty | Penalizes tokens based on how often they’ve appeared so far, reducing repetition. Range: -2.0 to 2.0. Default: 0 |
response_format | See Structured Outputs |
tools, tool_choice | See Tool Calling |
service_tier | Priority inference for tagged models. See Service Tier below. |
reasoning_effort | Controls reasoning depth for reasoning models. See Reasoning Models. |
We may not be 100% compatible with all OpenAI parameters. Let us know on Discord or by email if something you need is missing.
Service tier
Setservice_tier to "priority" to request priority inference on supported models. Priority requests get faster time-to-first-token and higher throughput during peak demand.
service_tier field confirming which tier was used. Not all models support priority tiers — check the model page for availability.
Max output tokens
The maximum number of tokens that can be generated in a single response is model-dependent, with a hard cap of 16384 tokens for most models. Setmax_tokens to control the limit for a specific request.
Continuing responses beyond the limit
If you need a longer response, use response continuation: send a follow-up request with the previous response included as an assistant message, and the model will continue from where it left off.What’s next
Streaming
Stream tokens as they’re generated.
Structured Outputs
Get responses in JSON format.
Tool Calling
Give models access to external functions.
Vision
Send images alongside text.
Reasoning Models
Control chain-of-thought reasoning behavior.