speculative-decoding

2 articles tagged with this topic

AWS-Trainium2vLL M

Speculative Decoding on AWS Trainium2 Cuts LLM Lat ency Up to 3x

AWS benchmarks show speculative decoding with vLLM on Trainium2 reduces inter -token latency up to 3x for decode-heavy workloads.

Apr 154 min read

DFlash speculative decoding on Apple Silicon: 4.1x on Qwen3.5-9B, now open source (MLX, M5 Max)

Open-source DFlash achiev es 4.13x speedup on Qwen3.5-9B using MLX on M5 Max with 89.4% token acceptance rate.

Apr 134 min read