OpenAI API - 补全功能#

SGLang 提供与 OpenAI 兼容的 API，使用户能够无缝地从 OpenAI 服务过渡到自托管本地模型。完整的 API 参考可在 OpenAI API 参考中找到。

本教程涵盖以下流行 API：

chat/completions
completions

请参阅其他教程，了解用于视觉语言模型的视觉 API 和用于嵌入模型的嵌入 API。

启动服务器#

在您的终端中启动服务器并等待其初始化。

[ ]:

from sglang.test.doc_patch import launch_server_cmd
from sglang.utils import wait_for_server, print_highlight, terminate_process

server_process, port = launch_server_cmd(
    "python3 -m sglang.launch_server --model-path qwen/qwen2.5-0.5b-instruct --host 0.0.0.0 --log-level warning"
)

wait_for_server(f"http://localhost:{port}")
print(f"Server started on http://localhost:{port}")

聊天补全#

使用方法#

服务器完全实现了 OpenAI API。如果 Hugging Face 分词器中有聊天模板，它会自动应用该模板。启动服务器时，您也可以使用 --chat-template 指定自定义聊天模板。

[ ]:

import openai

client = openai.Client(base_url=f"http://127.0.0.1:{port}/v1", api_key="None")

response = client.chat.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    messages=[
        {"role": "user", "content": "列出3个国家和它们的首都。"},
    ],
    temperature=0,
    max_tokens=64,
)

print_highlight(f"Response: {response}")

模型思维/推理支持#

一些模型支持内部推理或思维过程，这些可以在 API 响应中显示。SGLang 通过 chat_template_kwargs 参数和兼容的推理解析器为各种推理模型提供统一支持。

支持的模型和配置#

模型系列	聊天模板参数	推理解析器	说明
DeepSeek-R1 (R1, R1-0528, R1-Distill)	`enable_thinking`	`--reasoning-parser deepseek-r1`	标准推理模型
DeepSeek-V3.1	`thinking`	`--reasoning-parser deepseek-v3`	混合模型（思维/非思维模式）
Qwen3 (标准)	`enable_thinking`	`--reasoning-parser qwen3`	混合模型（思维/非思维模式）
Qwen3-Thinking	N/A (始终启用)	`--reasoning-parser qwen3-thinking`	始终生成推理内容
Kimi	N/A (始终启用)	`--reasoning-parser kimi`	Kimi 推理模型
Gpt-Oss	N/A (始终启用)	`--reasoning-parser gpt-oss`	Gpt-Oss 推理模型

基本用法#

要启用推理输出，您需要：

使用适当的推理解析器启动服务器
在 chat_template_kwargs 中设置模型特定参数
可选地使用 separate_reasoning: False 来不单独获取推理内容（默认为 True）

Qwen3-Thinking 模型注意点： 这些模型始终生成推理内容，不支持 enable_thinking 参数。使用 --reasoning-parser qwen3-thinking 或 --reasoning-parser qwen3 来解析推理内容。

示例：Qwen3 模型#

# 启动服务器：
# python3 -m sglang.launch_server --model Qwen/Qwen3-4B --reasoning-parser qwen3

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url=f"http://127.0.0.1:30000/v1",
)

model = "Qwen/Qwen3-4B"
messages = [{"role": "user", "content": "'strawberry'中有多少个'r'?"}]

response = client.chat.completions.create(
    model=model,
    messages=messages,
    extra_body={
        "chat_template_kwargs": {"enable_thinking": True},
        "separate_reasoning": True
    }
)

print("Reasoning:", response.choices[0].message.reasoning_content)
print("-"*100)
print("Answer:", response.choices[0].message.content)

ExampleOutput:

Reasoning: 好的，用户问的是单词 'strawberry' 中有多少个 'r'。让我想想。首先，我需要确保单词拼写正确。Strawberry... S-T-R-A-W-B-E-R-R-Y。等等，是这样吗？让我分解一下。

从 'strawberry' 开始，我们逐个字母写出来。S, T, R, A, W, B, E, R, R, Y。嗯，等等，这是10个字母。让我再检查一下。S (1), T (2), R (3), A (4), W (5), B (6), E (7), R (8), R (9), Y (10)。所以字母是 S-T-R-A-W-B-E-R-R-Y。
...
因此，答案应该是 'strawberry' 中有三个 R。但我需要确保我没有把其他字母也算作 R。让我再检查一遍。S, T, R, A, W, B, E, R, R, Y。没有其他 R。所以总共是三个。是的，这似乎是正确的。

----------------------------------------------------------------------------------------------------
Answer: 单词 "strawberry" 包含 **三个** 字母 'r'。分析如下：

1. **S-T-R-A-W-B-E-R-R-Y**
   - **第三个字母**是 'R'。
   - **第八和第九个字母**也是 'R's。

因此，总数为 **3**。

**Answer:** 3.

注意： 设置 "enable_thinking": False（或省略该参数）将导致 reasoning_content 为 None。Qwen3-Thinking 模型始终生成推理内容，不支持 enable_thinking 参数。

Logit Bias 支持#

SGLang 支持聊天补全和补全 API 的 logit_bias 参数。此参数允许您通过向它们的 logit 添加偏置值来修改特定 token 生成的可能性。偏置值范围可以从 -100 到 100，其中：

正值（0 到 100）增加 token 被选中的可能性
负值（-100 到 0）减少 token 被选中的可能性
-100 有效阻止 token 的生成

logit_bias 参数接受一个字典，其中键是 token ID（作为字符串），值是偏置量（作为浮点数）。

获取 Token ID#

要有效使用 logit_bias，您需要知道要偏置的单词的 token ID。以下是获取 token ID 的方法：

# 获取分词器以查找 token ID
import tiktoken

# 对于 OpenAI 模型，使用适当的编码
tokenizer = tiktoken.encoding_for_model("gpt-3.5-turbo")  # 或您的模型

# 获取特定单词的 token ID
word = "sunny"
token_ids = tokenizer.encode(word)
print(f"Token IDs for '{word}': {token_ids}")

# 对于 SGLang 模型，您可以通过客户端访问分词器
# 并获取用于偏置的 token ID

重要： logit_bias 参数使用 token ID 作为字符串键，而不是实际单词。

示例：DeepSeek-V3 模型#

DeepSeek-V3 模型通过 thinking 参数支持思维模式：

# 启动服务器：
# python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-V3.1 --tp 8  --reasoning-parser deepseek-v3

from openai import OpenAI

client = OpenAI(
    api_key="EMPTY",
    base_url=f"http://127.0.0.1:30000/v1",
)

model = "deepseek-ai/DeepSeek-V3.1"
messages = [{"role": "user", "content": "'strawberry'中有多少个'r'?"}]

response = client.chat.completions.create(
    model=model,
    messages=messages,
    extra_body={
        "chat_template_kwargs": {"thinking": True},
        "separate_reasoning": True
    }
)

print("Reasoning:", response.choices[0].message.reasoning_content)
print("-"*100)
print("Answer:", response.choices[0].message.content)

示例输出：

Reasoning: 首先，问题是："'strawberry'中有多少个'r'?"

我需要计算字母 "r" 在单词 "strawberry" 中出现的次数。

让我写出这个单词：S-T-R-A-W-B-E-R-R-Y。

现在，我将遍历每个字母并计算 'r's。
...
所以，我在 "strawberry" 中有三个 'r's。

我应该再检查一遍。这个单词拼写为 S-T-R-A-W-B-E-R-R-Y。字母在位置 3、8 和 9 是 'r's。是的，这是正确的。

因此，答案应该是 3。
----------------------------------------------------------------------------------------------------
Answer: 单词 "strawberry" 包含 **3** 个字母 "r"。以下是清晰的分析：

- 单词拼写：S-T-R-A-W-B-E-R-R-Y
- "r" 出现在第 3、8 和 9 个位置。

注意： DeepSeek-V3 模型使用 thinking 参数（而不是 enable_thinking）来控制推理输出。

[ ]:

# 使用 logit_bias 参数的示例
# 注意：您需要从分词器获取实际的 token ID
# 为演示，我们将使用一些示例 token ID
response = client.chat.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    messages=[
        {"role": "user", "content": "完成这个句子：今天的天气是"}
    ],
    temperature=0.7,
    max_tokens=20,
    logit_bias={
        "12345": 50,  # 增加 token ID 12345 的可能性
        "67890": -50,  # 减少 token ID 67890 的可能性
        "11111": 25,  # 稍微增加 token ID 11111 的可能性
    },
)

print_highlight(f"使用 logit bias 的响应: {response.choices[0].message.content}")

参数#

聊天补全 API 接受 OpenAI 聊天补全 API 的参数。有关更多详细信息，请参阅 OpenAI 聊天补全 API。

SGLang 通过 extra_body 参数扩展了标准 API，允许进行额外的自定义。extra_body 中的一个关键选项是 chat_template_kwargs，可用于向聊天模板处理器传递参数。

[ ]:

response = client.chat.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    messages=[
        {
            "role": "system",
            "content": "你是一位提供简明回答的知识渊博的历史学家。",
        },
        {"role": "user", "content": "告诉我关于古罗马的信息"},
        {
            "role": "assistant",
            "content": "古罗马是一个以意大利为中心的文明。",
        },
        {"role": "user", "content": "他们的主要成就是什么？"},
    ],
    temperature=0.3,  # 较低的温度以获得更专注的回答
    max_tokens=128,  # 简明回答的合理长度
    top_p=0.95,  # 稍高的值以获得更好的流畅性
    presence_penalty=0.2,  # 轻微的惩罚以避免重复
    frequency_penalty=0.2,  # 轻微的惩罚以获得更自然的语言
    n=1,  # 单个回答通常更稳定
    seed=42,  # 用于可重复性
)

print_highlight(response.choices[0].message.content)

也支持流模式。

Logit Bias 支持#

补全 API 也支持 logit_bias 参数，功能与上述聊天补全部分中描述的相同。

[ ]:

stream = client.chat.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    messages=[{"role": "user", "content": "说这是一个测试"}],
    stream=True,
)
for chunk in stream:
    if chunk.choices[0].delta.content is not None:
        print(chunk.choices[0].delta.content, end="")

[ ]:

# 为补全 API 使用 logit_bias 参数的示例
# 注意：您需要从分词器获取实际的 token ID
# 为演示，我们将使用一些示例 token ID
response = client.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    prompt="人工智能最好的编程语言是",
    temperature=0.7,
    max_tokens=20,
    logit_bias={
        "12345": 75,  # 强烈偏好 token ID 12345
        "67890": -100,  # 完全避免 token ID 67890
        "11111": -25,  # 稍微不鼓励 token ID 11111
    },
)

print_highlight(f"使用 logit bias 的响应: {response.choices[0].text}")

补全#

使用方法#

补全 API 与聊天补全 API 类似，但没有 messages 参数或聊天模板。

[ ]:

response = client.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    prompt="列出3个国家和它们的首都。",
    temperature=0,
    max_tokens=64,
    n=1,
    stop=None,
)

print_highlight(f"Response: {response}")

参数#

补全 API 接受 OpenAI 补全 API 的参数。有关更多详细信息，请参阅 OpenAI 补全 API。

以下是详细的补全请求示例：

[ ]:

response = client.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    prompt="写一个关于太空探索者的短故事。",
    temperature=0.7,  # 适中的温度用于创意写作
    max_tokens=150,  # 对于故事需要更长的响应
    top_p=0.9,  # 在词选择中平衡多样性
    stop=["\n\n", "THE END"],  # 多个停止序列
    presence_penalty=0.3,  # 鼓励新颖元素
    frequency_penalty=0.3,  # 减少重复短语
    n=1,  # 生成一个补全
    seed=123,  # 用于可重复的结果
)

print_highlight(f"Response: {response}")

结构化输出 (JSON, Regex, EBNF)#

对于 OpenAI 兼容的结构化输出 API，请参阅结构化输出了解更多详细信息。

使用 LoRA 适配器#

SGLang 支持 LoRA（低秩适应）适配器与 OpenAI 兼容的 API。您可以使用 base-model:adapter-name 语法直接在 model 参数中指定要使用哪个适配器。

服务器设置：

python -m sglang.launch_server \
    --model-path qwen/qwen2.5-0.5b-instruct \
    --enable-lora \
    --lora-paths adapter_a=/path/to/adapter_a adapter_b=/path/to/adapter_b

有关 LoRA 服务配置的更多详细信息，请参阅 LoRA 文档。

API 调用：

(推荐) 使用 model:adapter 语法指定要使用哪个适配器：

response = client.chat.completions.create(
    model="qwen/qwen2.5-0.5b-instruct:adapter_a",  # ← base-model:adapter-name
    messages=[{"role": "user", "content": "转换为 SQL：显示所有用户"}],
    max_tokens=50,
)

向后兼容：使用 ``extra_body``

旧方法 extra_body 仍然受支持以保持向后兼容：

# 向后兼容方法
response = client.chat.completions.create(
    model="qwen/qwen2.5-0.5b-instruct",
    messages=[{"role": "user", "content": "转换为 SQL：显示所有用户"}],
    extra_body={"lora_path": "adapter_a"},  # ← 旧方法
    max_tokens=50,
)

注意： 当同时指定 model:adapter 和 extra_body["lora_path"] 时，model:adapter 语法优先。

[ ]:

terminate_process(server_process)

OpenAI API - 补全功能

目录

OpenAI API - 补全功能#

启动服务器#

聊天补全#

使用方法#

模型思维/推理支持#

支持的模型和配置#

基本用法#

示例：Qwen3 模型#

Logit Bias 支持#

获取 Token ID#

示例：DeepSeek-V3 模型#

参数#

Logit Bias 支持#

补全#

使用方法#

参数#

结构化输出 (JSON, Regex, EBNF)#

使用 LoRA 适配器#