speculative-decoding

找到 2 篇关于此标签的文章

AWS Trainium2 上的 Speculative Decoding 将 LLM 推理延迟降低最高 3 倍

AWS 基准测试显示，在 Trainium2 上结合 vLLM 使用 speculative decoding，可将解码密集型工作负载的 inter-token 延迟降低最高 3 倍。

开源项目 DFlash 在 M5 Max 上借助 MLX 实现 Qwen3.5-9B 推理 4.13 倍加速，token 接受率高达 89.4%。