性能测试#

本文档介绍如何使用Xinference的性能测试工具。

测试工具概述 #

使用 xorbitsai/inference 提供的测试集。

测试工具包括：

benchmark_serving：测试整体服务指标。
benchmark_latency：测试延迟指标。
benchmark_long：测试长上下文的推理效果。

测试数据集 #

对于 serving 和 latency 的测试，需要使用数据集。默认使用的数据集在 https://hf-mirror.com/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/tree/main ，下载 ShareGPT_V3_unfiltered_cleaned_split.json

运行测试集 #

基本命令 #

python benchmark_serving.py --dataset /path/to/ShareGPT_V3_unfiltered_cleaned_split.json \
                            --tokenizer /path/to/tokenizer \
                            --model-uid ${model_uid} \
                            --num-prompt 100 --concurrency 50

参数说明 #

--dataset 为刚刚下载的数据集路径。
--tokenizer 指定为模型的下载路径。
--model-uid 指定为模型的 ID，在运行的模型列表里可以找到。
--num-prompt 为测试的 prompt 个数，默认 100。
--concurrency 同时并发的请求，默认100。

结果说明 #

测试结果示例 #

================ Benchmark Result ================
Successful requests:                     100
Benchmark duration (s):                  44.55
Total input tokens:                      16604
Total generated tokens:                  18181
Request throughput (req/s):              2.24
Input token throughput (tok/s):          372.70
Output token throughput (tok/s):         408.10
----------------Latency Statistics----------------
Mean latency (s):                        22.7351
Mean latency per token (s):              0.0909
Mean latency per output token (s):       0.3358
==================================================

指标含义 #

Successful requests：成功的请求数。
Benchmark duration (s)：Benchmark 时长。
Total input tokens：总输入 token 数。
Total generated tokens：总生成 token 数。
Request throughput (req/s)：请求吞吐（每秒处理的请求）。
Input token throughput (tok/s)：输入 token 吞吐（每秒处理的输入 token）。
Output token throughput (tok/s)：输出的 token 吞吐（每秒生成的 token）。
Mean latency (s)：平均延迟。
Mean latency per token (s)：平均每 token 延迟。
Mean latency per output token (s)：平均每输出 token 延迟。