找到 1 篇关于此标签的文章
AWS benchmarks show speculative decoding with vLLM on Trainium2 reduces inter -token latency up to 3x for decode-heavy workloads.