A single RTX 3090 natively achieved 72 tok/s (tokens per second, a metric for model output speed) on Windows — meaning you finally don't need to install Linux first to run local LLMs.

What this is

A Reddit community developer released a native Windows vLLM (LLM inference acceleration framework) patch and portable launcher. After downloading and extracting, users can double-click to run Qwen3.6-27B (a 27-billion parameter open-source model) on Windows without configuring Python environments or using WSL/Docker virtualization. Test data shows that on a single 3090, short-prompt speed reaches 72 tok/s, long-prompt around 64.5 tok/s, and a single card can even support a 127k ultra-long context. This is enabled by using the INT4 (4-bit integer quantization, a technique to compress models and reduce VRAM usage) version of the model.

Industry view

We note that local LLMs have long suffered from a "Linux compulsion"—great performance, but high barriers to entry. This work drastically narrows the usability gap between Windows and Linux, enabling more traditional enterprises accustomed to Windows to trial local deployment with zero friction. However, there are dissenting voices in the community: in absolute performance, Windows still lags behind Linux (the same card can hit 80+ tok/s on Linux); furthermore, this solution only supports Nvidia 30-series and above GPUs, leaving older cards and AMD users out in the cold. Additionally, the unofficial vLLM branch remains unverified for enterprise-grade long-term stability, and INT4 quantization's precision loss on complex logic tasks is a potential risk.

Impact on regular people

For enterprise IT: No need to restructure underlying infrastructure; they can pilot local AI deployments directly on existing Windows workstations, verifying data privacy protection solutions at a low cost.

For individual professionals: Tech enthusiasts can more easily run local models on their office PCs to handle sensitive document summarization and information extraction without uploading to the cloud.

For the consumer market: The extreme simplification of local deployment and VRAM optimization may further drive up demand for high-VRAM consumer GPUs in non-gaming, office scenarios.