Back to home
speculative-decoding
2 articles tagged with this topic
AWS-Trainium2vLL M
Speculative Decoding on AWS Trainium2 Cuts LLM Lat ency Up to 3x
AWS benchmarks show speculative decoding with vLLM on Trainium2 reduces inter -token latency up to 3x for decode-heavy workloads.
Apr 154 min read
MLXQwen3.5
DFlash speculative decoding on Apple Silicon: 4.1x on Qwen3.5-9B, now open source (MLX, M5 Max)
Open-source DFlash achiev es 4.13x speedup on Qwen3.5-9B using MLX on M5 Max with 89.4% token acceptance rate.
Apr 134 min read