Local Inference

找到 4 篇关于此标签的文章

MinimaxLocalLLaMA

Minimax 2.7 Update Anticipated by Local LLM Community

Reddit's LocalLLaMA community signals anticipation for Minimax 2.7, but details remain sparse.

llama.cppGLM-4.7

Best Local LLM for Agentic Coding on a Single RTX 4090

A 4090 owner benchmarks GLM-4.7, Nemotron-30B, and Qwen3-Coder for local agentic coding via llama.cpp.

HunyuanOCR 1B Runs at 90 t/s on GTX 1060 via GGUF

Tencent's HunyuanOCR 1B model runs at 90 tokens/sec on a GTX 1060 via GGUF, enabling local OCR on budget hardware.

Gemma 4Per-Layer Embeddings

Per-Layer Embeddings: How Gemma 4's Small Models Work

Gemma 4's E2B and E4B models use per-layer embeddings, not MoE, enabling new inference performance tradeoffs.