如何支持新模型#

本文档说明如何在 SGLang 中添加对新语言模型和多模态大语言模型（MLLMs）的支持。它还涵盖了如何测试新模型以及注册外部实现。

如何支持新的语言模型#

要在 SGLang 中支持新模型，您只需要在 SGLang 模型目录下添加一个文件。您可以参考现有的模型实现，为您的模型创建一个新文件。对于大多数模型，您应该能够找到一个类似的模型作为起点（例如，从 Llama 开始）。同时请参考如何将模型从 vLLM 移植到 SGLang。

如何支持新的多模态大语言模型#

要在 SGLang 中支持新的多模态大语言模型（MLLM），除了标准的 LLM 支持外，还有几个关键组件：

将您的新模型注册为多模态：在 model_config.py 中扩展 is_multimodal_model，使您的模型返回 True。
注册新的聊天模板：只有当您的默认聊天模板无法接受图像作为输入时：在 conversation.py 中注册一个新的聊天模板和相应的匹配函数。
多模态数据处理器：定义一个继承自 BaseMultimodalProcessor 的新 Processor 类，并将此处理器注册为您模型的专用处理器。更多细节请参见 multimodal_processor.py。
处理多模态令牌：为您的新模型实现一个 pad_input_ids 函数。在此函数中，提示中的多模态令牌应进行扩展（如必要）并使用多模态数据哈希值进行填充，以便 SGLang 可以通过 RadixAttention 识别不同的多模态数据。
处理图像特征提取：为您的新模型实现一个 get_image_feature 函数，该函数从原始图像数据中提取图像特征，并将其转换为语言模型使用的嵌入。
适应视觉注意力：将 ViT 的多头 Attention 与 SGLang 的 VisionAttention 进行适配。

您可以参考 Qwen2VL 或其他 MLLM 实现。这些模型演示了如何正确处理多模态和文本输入。

测试和调试#

请在 PR 描述中记录您的所有测试和基准测试结果。

交互式调试#

对于交互式调试，请比较 Hugging Face/Transformers 和 SGLang 的输出。以下两个命令应给出相同的文本输出和非常相似的前缀填充对数概率：

获取参考输出：

python3 scripts/playground/reference_hf.py --model-path [新模型] --model-type {text,mllm}

获取 SGLang 输出：

python3 -m sglang.bench_one_batch --correct --model [新模型]

将模型添加到测试套件#

为确保新模型得到良好的维护，请通过将其包含在 test_generation_models.py 文件的 ALL_OTHER_MODELS 列表中，将其添加到测试套件中，在您的本地机器上测试新模型，并在 PR 中报告在代表性基准测试（GSM8K、MMLU、MMMU、MMMU-Pro 等）上的结果。\ 对于 VLM，还要在 test_vision_openai_server_{x}.py 中添加测试（例如 test_vision_openai_server_a.py、test_vision_openai_server_b.py）。

这是一个在您的本地机器上测试新模型的示例命令：

ONLY_RUN=Qwen/Qwen2-1.5B python3 -m unittest test_generation_models.TestGenerationModels.test_others

基准测试#

（必需）MMMU：按照 MMMU 基准测试 README.md 获取 SGLang 与 HF Transformer 的准确率比较。SGLang 运行的准确率分数不应低于 HF Transformer 运行的准确率。同样，请遵循 https://docs.sglang.ai/developer_guide/benchmark_and_profiling.html 获取性能比较：TTFT 和吞吐量必须达到或超过基线（例如，HF Transformer）。
（可选）其他评估：如果您运行了其他评估，请在 PR 描述中记录结果。

将模型从 vLLM 移植到 SGLang#

vLLM 模型目录是一个宝贵的资源，因为 vLLM 涵盖了许多模型。SGLang 重用了 vLLM 的接口和一些层，使得将模型从 vLLM 移植到 SGLang 更加容易。

要将模型从 vLLM 移植到 SGLang：

比较这两个文件以获取指导：
- SGLang Llama 实现
- vLLM Llama 实现
主要差异包括：
- 用 RadixAttention 替换 vLLM 的 Attention（确保将 layer_id 传递给 RadixAttention）。
- 用 SGLang 的 LogitsProcessor 替换 vLLM 的 LogitsProcessor。
- 用 SGLang 的 VisionAttention 替换 ViT 的多头 Attention。
- 用 SGLang 层替换其他 vLLM 层（例如 RMSNorm、SiluAndMul）。
- 移除 Sample。
- 更改 forward() 函数并添加 forward_batch() 方法。
- 在末尾添加 EntryClass。
- 确保新实现仅使用 SGLang 组件，不依赖任何 vLLM 组件。

注意：确保将您的新模型添加到支持模型的文档列表中。

注册外部模型实现#

除上述方法外，您还可以在启动服务器之前使用 ModelRegistry 注册您的新模型。这使您可以无需修改源代码即可集成您的模型。

例如：

from sglang.srt.models.registry import ModelRegistry
from sglang.srt.entrypoints.http_server import launch_server

# 对于单个模型，将其添加到注册表中：
ModelRegistry.models[model_name] = model_class

# 对于多个模型，您可以模仿 import_model_classes() 函数：
from functools import lru_cache

@lru_cache()
def import_new_model_classes():
    model_arch_name_to_cls = {}
    # 使用您的新模型类填充 model_arch_name_to_cls。
    ...
    return model_arch_name_to_cls

ModelRegistry.models.update(import_new_model_classes())

# 使用您的服务器参数启动服务器：
launch_server(server_args)

示例：实现和提供 Llama 包装器模型服务#

以下是一个入门级的分步指南，介绍如何在 SGLang 中端到端实现新模型，然后通过离线引擎运行它。

实现我们的模型#

为了简单起见，这个新模型将是 Llama 3.1-8B-Instruct 的一个简单包装器，我们的目标只是在每次 forward 调用时通过对每个单独的对数概率取平方根来偏置输出对数概率。

首先，我们在一个名为 llama_wrapper.py 的文件中定义我们的模型。第一步是从 SRT（SGLang 的内部后端）导入必要的库。

# 在文件 `llama_wrapper.py` 中

import torch
from transformers import LlamaConfig
from typing import Optional
from sglang.srt.layers.logits_processor import LogitsProcessorOutput
from sglang.srt.layers.quantization.base_config import QuantizationConfig
from sglang.srt.model_executor.forward_batch_info import ForwardBatch, PPProxyTensors

from sglang.srt.models.llama import LlamaForCausalLM

接下来，我们为我们的模型声明一个新的 class 并使其继承自 LlamaForCausalLM，这样我们的模型就可以访问 LlamaForCausalLM 预定义的模块和层，例如 LlamaAttention 和 LlamaMLP。请注意，几乎所有模型实现都在其 __init__ 方法中接受 config 和 quant_config 作为参数；config 和 quant_config 通过 model_loader/loader.py 传入。由于我们继承了 LlamaForCausalLM，我们可以直接将参数传递给其构造函数，这将为我们设置成员变量。

class LlamaWrapper(LlamaForCausalLM):
    def __init__(
        self,
        config: LlamaConfig,
        quant_config: Optional[QuantizationConfig] = None,
        prefix: str = "",
    ) -> None:
        super().__init__(config=config, quant_config=quant_config, prefix=prefix)

现在，我们要定义 forward 方法，这将在推理时被调用。请注意，forward 的签名对于任何模型本质上是相同的；您可以参考在 models 目录中定义的其他模型。要查看 forward 在 SGLang 运行时内部的确切调用位置，请查看 ModelRunner 类中的 forward_decode 和 forward_extend。

    @torch.no_grad()
    def forward(
        self,
        input_ids: torch.Tensor,
        positions: torch.Tensor,
        forward_batch: ForwardBatch,
        pp_proxy_tensors: Optional[PPProxyTensors] = None,
        input_embeds: Optional[torch.Tensor] = None,
        get_embedding: bool = False,
    ) -> LogitsProcessorOutput:

我们现在调用 self.model 的 __call__ 方法（这是 LlamaForCausalLM 在其 __init__ 方法中定义的成员变量），它最终会调用 LlamaForCausalLM 的 forward 方法。之后，我们将 hidden_states 传递给我们模型的 LogitsProcessor（同样在 LlamaForCausalLM 中定义）。

        hidden_states = self.model(
            input_ids,
            positions,
            forward_batch,
            input_embeds,
            pp_proxy_tensors=pp_proxy_tensors,
        )

        res: LogitsProcessorOutput = self.logits_processor(
            input_ids,
            hidden_states,
            self.lm_head,
            forward_batch,
        )

在获得下一个令牌的对数概率后，我们终于可以执行我们的偏置步骤了。

        orig_logits = res.next_token_logits
        res.next_token_logits = torch.where(
            orig_logits > 0,
            orig_logits.sqrt(),
            orig_logits
        )

        return res

现在，我们的 LlamaWrapper 模型已经创建完成，可以提供服务了！

通过 SGLang 的离线引擎提供我们的模型服务#

本指南的下一步涉及在本地托管我们的新模型离线，以便它可以本地提供服务，而无需 HTTP 服务器。

首先，创建一个名为 run.py 的新文件。现在，我们必须确保 SGLang 的 ModelRegistry 能够找到我们的模型。为此，我们首先从 Huggingface 下载模型的配置和权重。

# 在文件 `run.py` 中

import asyncio
from functools import lru_cache
from huggingface_hub import snapshot_download
from llama_wrapper import LlamaWrapper # 确保导入我们的新模型！
import sglang as sgl
from sglang.srt.models.registry import ModelRegistry

# 确保在 Huggingface 上请求访问此模型，然后导出您的
# `HF_TOKEN`以下载模型快照
llama_dir = snapshot_download(
    repo_id="meta-llama/Llama-3.1-8B-Instruct",
    local_dir="./llama_ckpt",
)

现在我们的模型已经存储在磁盘上，我们想通过更改 ./llama_ckpt/config.json 中的 architectures 字段为 LlamaWrapper 来指向它。这样，当我们传入模型检查点的路径给 SGLang 时，它就知道我们要使用 "LlamaWrapper" 而不是 "LlamaForCausalLM" 作为我们的模型。

{
  "architectures": [
   #  "LlamaForCausalLM"
    "LlamaWrapper"
  ],
  ...
}

但是，如果我们没有将 LlamaWrapper 类与 "LlamaWrapper" 注册表关键字关联，那么 SGLang 将找不到我们的模型。因此，为了注册我们的 LlamaWrapper，我们想要遵循上面标题为"注册外部模型实现"部分的步骤。

@lru_cache()
def import_new_model_classes():
    model_arch_name_to_cls = {"LlamaWrapper": LlamaWrapper}
    return model_arch_name_to_cls

ModelRegistry.models.update(import_new_model_classes())

最后，当我们创建 Engine 时，我们只需传入本地模型目录的路径。然后，我们的 LlamaWrapper 就可以提供服务了；在本指南中，我们将使用 SGLang Engine 的非流式异步生成端点。

def main():
    llm = sgl.Engine(model_path="./llama_ckpt")
    sampling_params = {"temperature": 0.2, "top_k": 5}
    prompts = [
        "Write a short, neutral self-introduction for a fictional character. Hello, my name is",
        "Provide a concise factual statement about France’s capital city. The capital of France is",
        "Explain possible future trends in artificial intelligence. The future of AI is",
    ]

    asyncio.run(run_llm(llm, sampling_params, prompts))

    llm.shutdown()

async def run_llm(
    llm,
    sampling_params,
    prompts,
) -> None:
    outputs = await llm.async_generate(prompts, sampling_params)

    for prompt, output in zip(prompts, outputs):
        print(f"\nPrompt: {prompt}")
        print(f"Generated text: {output['text']}")

if __name__ == "__main__":
    main()

现在，当我们调用 python run.py 时，我们将获得我们新创建模型的输出！

文档#

在 generative_models.md 或 multimodal_language_models.md 的支持模型表中添加您的模型

遵循这些指南，您可以在 SGLang 中添加对新语言模型和多模态大语言模型的支持，并确保它们得到彻底的测试并轻松集成到系统中。