采样参数#

本文档描述了 SGLang Runtime 的采样参数。它是运行时的低级端点。如果您需要一个能够自动处理聊天模板的高级端点，请考虑使用 OpenAI 兼容 API。

`/generate` 端点#

/generate 端点接受 JSON 格式的以下参数。详细用法请参见原生 API 文档。该对象定义在 io_struct.py::GenerateReqInput 中。您也可以阅读源代码以查找更多参数和文档。

参数	类型/默认值	说明
text	`Optional[Union[List[str], str]] = None`	输入提示。可以是单个提示或一批提示。
input_ids	`Optional[Union[List[List[int]], List[int]]] = None`	文本的 token ID；可以指定 text 或 input_ids 中的一个。
input_embeds	`Optional[Union[List[List[List[float]]], List[List[float]]]] = None`	input_ids 的嵌入；可以指定 text、input_ids 或 input_embeds 中的一个。
image_data	`Optional[Union[List[List[ImageDataItem]], List[ImageDataItem], ImageDataItem]] = None`	图像输入。可以是图像实例、文件名、URL 或 base64 编码字符串。可以是单个图像、图像列表或图像列表的列表。
audio_data	`Optional[Union[List[AudioDataItem], AudioDataItem]] = None`	音频输入。可以是文件名、URL 或 base64 编码字符串。
sampling_params	`Optional[Union[List[Dict], Dict]] = None`	如下所述的采样参数。
rid	`Optional[Union[List[str], str]] = None`	请求 ID。
return_logprob	`Optional[Union[List[bool], bool]] = None`	是否返回 token 的对数概率。
logprob_start_len	`Optional[Union[List[int], int]] = None`	如果 return_logprob，则返回 logprobs 的提示起始位置。默认为 "-1"，仅输出 token 的 logprobs。
top_logprobs_num	`Optional[Union[List[int], int]] = None`	如果 return_logprob，则每个位置返回的顶级 logprobs 数量。
token_ids_logprob	`Optional[Union[List[List[int]], List[int]]] = None`	如果 return_logprob，则返回 logprob 的 token ID。
return_text_in_logprobs	`bool = False`	是否返回 logprobs 中的 token 的文本形式。
stream	`bool = False`	是否流式输出。
lora_path	`Optional[Union[List[Optional[str]], Optional[str]]] = None`	LoRA 的路径。
custom_logit_processor	`Optional[Union[List[Optional[str]], str]] = None`	用于高级采样控制的自定义 logit 处理器。必须是使用其 `to_str()` 方法序列化的 `CustomLogitProcessor` 实例。用法如下所述。
return_hidden_states	`Union[List[bool], bool] = False`	是否返回隐藏状态。

采样参数#

该对象定义在 sampling_params.py::SamplingParams 中。您也可以阅读源代码以查找更多参数和文档。

默认值说明#

默认情况下，SGLang 从模型的 generation_config.json 中初始化几个采样参数（当服务器使用 --sampling-defaults model 启动时，这是默认行为）。如果要使用 SGLang/OpenAI 常量默认值，请使用 --sampling-defaults openai 启动服务器。您总是可以通过 sampling_params 覆盖每个请求的任何参数。

# 使用 generation_config.json 中模型提供的默认值（默认行为）
python -m sglang.launch_server --model-path <MODEL> --sampling-defaults model

# 改用 SGLang/OpenAI 常量默认值
python -m sglang.launch_server --model-path <MODEL> --sampling-defaults openai

核心参数#

参数	类型/默认值	说明
max_new_tokens	`int = 128`	以 token 为单位的最大输出长度。
stop	`Optional[Union[str, List[str]]] = None`	一个或多个停止词。如果采样到这些词中的任何一个，生成将会停止。
stop_token_ids	`Optional[List[int]] = None`	以 token ID 形式提供停止词。如果采样到这些 token ID 中的任何一个，生成将会停止。
stop_regex	`Optional[Union[str, List[str]]] = None`	当命中此列表中的任何一个正则表达式模式时停止
temperature	`float (模型默认值; fallback 1.0)`	温度，用于采样下一个 token。`temperature = 0` 对应于贪婪采样，更高的温度会导致更多样性。
top_p	`float (模型默认值; fallback 1.0)`	Top-p 从最小的排序列表中选择 token，其累积概率超过 `top_p`。当 `top_p = 1` 时，这简化为从所有 token 中无限制采样。
top_k	`int (模型默认值; fallback -1)`	Top-k 从概率最高的 `k` 个 token 中随机选择。
min_p	`float (模型默认值; fallback 0.0)`	Min-p 从概率大于 `min_p * highest_token_probability` 的 token 中采样。

惩罚器#

参数	类型/默认值	说明
frequency_penalty	`float = 0.0`	根据 token 在生成过程中出现的频率对其进行惩罚。必须在 `-2` 和 `2` 之间，其中负数鼓励重复 token，正数鼓励采样新的 token。惩罚的缩放随每个 token 的出现次数线性增长。
presence_penalty	`float = 0.0`	如果 token 在生成过程中出现过，对其进行惩罚。必须在 `-2` 和 `2` 之间，其中负数鼓励重复 token，正数鼓励采样新的 token。如果 token 出现，惩罚的缩放是恒定的。
repetition_penalty	`float = 1.0`	缩放先前生成的 token 的 logits，以阻止（值 > 1）或鼓励（值 < 1）重复。有效范围为 `[0, 2]`；`1.0` 保持概率不变。
min_new_tokens	`int = 0`	强制模型生成至少 `min_new_tokens`，直到采样到停止词或 EOS token。请注意，这可能导致意外的行为，例如，如果分布严重偏向这些 token。

约束解码#

有关以下参数，请参阅我们关于约束解码的专门指南。

参数	类型/默认值	说明
json_schema	`Optional[str] = None`	结构化输出的 JSON schema。
regex	`Optional[str] = None`	结构化输出的正则表达式。
ebnf	`Optional[str] = None`	结构化输出的 EBNF。
structural_tag	`Optional[str] = None`	结构化输出的结构标记。

其他选项#

参数	类型/默认值	说明
n	`int = 1`	指定每个请求生成的输出序列数。（不推荐在一个请求中生成多个输出 (n > 1)；多次重复相同的提示能提供更好的控制和效率。）
ignore_eos	`bool = False`	当采样到 EOS token 时不要停止生成。
skip_special_tokens	`bool = True`	在解码过程中移除特殊 token。
spaces_between_special_tokens	`bool = True`	在反标记化过程中是否在特殊 token 之间添加空格。
no_stop_trim	`bool = False`	不要从生成的文本中修剪停止词或 EOS token。
custom_params	`Optional[List[Optional[Dict[str, Any]]]] = None`	使用 `CustomLogitProcessor` 时使用。用法见下文。

示例#

常规#

启动服务器：

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3-8B-Instruct --port 30000

发送请求：

import requests

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
    },
)
print(response.json())

详细示例请参阅发送请求。

流式输出#

发送请求并流式输出：

import requests, json

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "The capital of France is",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
        "stream": True,
    },
    stream=True,
)

prev = 0
for chunk in response.iter_lines(decode_unicode=False):
    chunk = chunk.decode("utf-8")
    if chunk and chunk.startswith("data:"):
        if chunk == "data: [DONE]":
            break
        data = json.loads(chunk[5:].strip("\n"))
        output = data["text"].strip()
        print(output[prev:], end="", flush=True)
        prev = len(output)
print("")

详细示例请参阅OpenAI 兼容 API。

多模态#

启动服务器：

python3 -m sglang.launch_server --model-path lmms-lab/llava-onevision-qwen2-7b-ov

下载图像：

curl -o example_image.png -L https://github.com/sgl-project/sglang/blob/main/examples/assets/example_image.png?raw=true

发送请求：

import requests

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n"
                "<|im_start|>user\n<image>\nDescribe this image in a very short sentence.<|im_end|>\n"
                "<|im_start|>assistant\n",
        "image_data": "example_image.png",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 32,
        },
    },
)
print(response.json())

image_data 可以是文件名、URL 或 base64 编码字符串。另请参见 python/sglang/srt/utils.py:load_image。

流式输出以类似于上述的方式支持。

详细示例请参阅OpenAI API Vision。

结构化输出（JSON、正则表达式、EBNF）#

您可以指定 JSON schema、正则表达式或 EBNF 来约束模型输出。模型输出将保证遵循给定的约束。每个请求只能指定一个约束参数（json_schema、regex 或 ebnf）。

SGLang 支持两种语法后端：

XGrammar（默认）：支持 JSON schema、正则表达式和 EBNF 约束。
- XGrammar 目前使用 GGML BNF 格式。
Outlines：支持 JSON schema 和正则表达式约束。

如果要初始化 Outlines 后端，可以使用 --grammar-backend outlines 标志：

python -m sglang.launch_server --model-path meta-llama/Meta-Llama-3.1-8B-Instruct \
--port 30000 --host 0.0.0.0 --grammar-backend [xgrammar|outlines] # xgrammar 或 outlines（默认：xgrammar）

import json
import requests

json_schema = json.dumps({
    "type": "object",
    "properties": {
        "name": {"type": "string", "pattern": "^[\\w]+$"},
        "population": {"type": "integer"},
    },
    "required": ["name", "population"],
})

# JSON（适用于 Outlines 和 XGrammar）
response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "Here is the information of the capital of France in the JSON format.\n",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 64,
            "json_schema": json_schema,
        },
    },
)
print(response.json())

# 正则表达式（仅 Outlines 后端）
response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "Paris is the capital of",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 64,
            "regex": "(France|England)",
        },
    },
)
print(response.json())

# EBNF（仅 XGrammar 后端）
response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "Write a greeting.",
        "sampling_params": {
            "temperature": 0,
            "max_new_tokens": 64,
            "ebnf": 'root ::= "Hello" | "Hi" | "Hey"',
        },
    },
)
print(response.json())

详细示例请参阅结构化输出。

自定义 logit 处理器#

使用 --enable-custom-logit-processor 标志启动服务器。

python -m sglang.launch_server \
  --model-path meta-llama/Meta-Llama-3-8B-Instruct \
  --port 30000 \
  --enable-custom-logit-processor

定义一个自定义 logit 处理器，它将始终采样特定的 token id。

from sglang.srt.sampling.custom_logit_processor import CustomLogitProcessor

class DeterministicLogitProcessor(CustomLogitProcessor):
    """一个虚拟的 logit 处理器，将 logits 更改为总是
    采样给定的 token id。
    """

    def __call__(self, logits, custom_param_list):
        # 检查 logits 数量是否与自定义参数数量匹配
        assert logits.shape[0] == len(custom_param_list)
        key = "token_id"

        for i, param_dict in enumerate(custom_param_list):
            # 屏蔽所有其他 token
            logits[i, :] = -float("inf")
            # 为指定的 token 分配最高概率
            logits[i, param_dict[key]] = 0.0
        return logits

发送请求：

import requests

response = requests.post(
    "http://localhost:30000/generate",
    json={
        "text": "The capital of France is",
        "custom_logit_processor": DeterministicLogitProcessor().to_str(),
        "sampling_params": {
            "temperature": 0.0,
            "max_new_tokens": 32,
            "custom_params": {"token_id": 5},
        },
    },
)
print(response.json())

发送 OpenAI 聊天完成请求：

import openai
from sglang.utils import print_highlight

client = openai.Client(base_url="http://127.0.0.1:30000/v1", api_key="None")

response = client.chat.completions.create(
    model="meta-llama/Meta-Llama-3-8B-Instruct",
    messages=[
        {"role": "user", "content": "List 3 countries and their capitals."},
    ],
    temperature=0.0,
    max_tokens=32,
    extra_body={
        "custom_logit_processor": DeterministicLogitProcessor().to_str(),
        "custom_params": {"token_id": 5},
    },
)

print_highlight(f"Response: {response}")

采样参数

目录

采样参数#

/generate 端点#

采样参数#

默认值说明#

核心参数#

惩罚器#

约束解码#

其他选项#

示例#

常规#

流式输出#

多模态#

结构化输出（JSON、正则表达式、EBNF）#

自定义 logit 处理器#

`/generate` 端点#