Local Inference

4 articles tagged with this topic

MinimaxLocalLLaMA

Minimax 2.7 Update Anticipated by Local LLM Community

Reddit's LocalLLaMA community signals anticipation for Minimax 2.7, but details remain sparse.

Apr 61 min read

llama.cppGLM-4.7

Best Local LLM for Agentic Coding on a Single RTX 4090

A 4090 owner benchmarks GLM-4.7, Nemotron-30B, and Qwen3-Coder for local agentic coding via llama.cpp.

Apr 61 min read

HunyuanOCR 1B Runs at 90 t/s on GTX 1060 via GGUF

Tencent's HunyuanOCR 1B model runs at 90 tokens/sec on a GTX 1060 via GGUF, enabling local OCR on budget hardware.

Apr 61 min read

Gemma 4Per-Layer Embeddings

Per-Layer Embeddings: How Gemma 4's Small Models Work

Gemma 4's E2B and E4B models use per-layer embeddings, not MoE, enabling new inference performance tradeoffs.

Apr 52 min read