LocalLLaMA

30 articles tagged with this topic

阿里 Qwen 3.6 Max 悄悄上线，中国模型榜单第一——但开源还是闭源，这才是真正的问题

Alibaba's Qwen 3.6 Max quietly launched in preview, scoring highest among Chinese models — but its open-source status remains undecided.

Apr 202 min read

LocalLLaMAQwen3

本地 AI 自己调工具还在「鬼打墙」——开源社区的真实使用体验比宣传落后整整一代

A 103-upvote Reddit thread exposes how local open-source models consistently hallucinate completed tasks during tool calling.

Apr 193 min read

LocalLLaMARTX 3090

两张显卡能不能同时跑两个 AI 模型？一个真实用户案例揭示本地部署的核心取舍

An RTX 3090 + RTX 3060 user's Reddit question reveals the core hardware trade-offs in local LLM deployment.

Apr 193 min read

LocalLLaMA

Is harness a new buzzword?

Not AI news.

Apr 181 min read

Qwen3LocalLLaMA

Qwen 3.6 is the first local model that actually feels worth the effort for me

Alibaba's Qwen3.6 35B-A3B runs Q8 at 170 tokens/ sec with full 260K context on dual consumer GPUs.

Apr 174 min read

LocalLLaMAOpenWebUI

Move to local models

Source article is a personal support question, not a reportable AI news event.

Apr 171 min read

Qwen3.6-35BLocalLLaMA

Qwen3.6-35B is worse at tool use and reasoning loops than 3.5?

Community testers report Qwen3.6-35B enters infinite reasoning loops more than Qwen3.5 on agentic coding tasks.

Apr 173 min read

QwenAlib aba

Alibaba Releases Qwen3.6-35B-A3B Mixture-of-Experts Model

Alibaba's Qwen team releases Qwen3.6-35B-A3B, a 35B-parameter MoE model activating 3B parameters per token.

Apr 162 min read

Gemma-4Google-De epMind

Gemma 4 Jailbreak System Prompt

A system prompt designed to bypass Gemma 4's safety filters is circulating on Reddit with 112 upvotes.

Apr 153 min read

LocalLLaMAllama.cpp

Local AI is the best

A Reddit post praising local AI tools contains no verifiable news, data, or technical developments.

Apr 151 min read

Qwen3.5GGUF

Qwen3.5-9B GGUF Quant Rankings: Q8_0 Dominates KLD Scores

KLD benchmarks across community GGUF quants show Q8_0 variants cluster near 0.001 KLD, with quality degrading shar ply below Q5.

Apr 143 min read

MLXQwen3.5

DFlash speculative decoding on Apple Silicon: 4.1x on Qwen3.5-9B, now open source (MLX, M5 Max)

Open-source DFlash achiev es 4.13x speedup on Qwen3.5-9B using MLX on M5 Max with 89.4% token acceptance rate.

Apr 134 min read

GemmaQwen3

Why some small/medium models fail at grammar checking task?

Gem ma 4B, GPT-OSS-20B, and Qwen3-80B hallucinate spelling errors in grammatically correct sentences.

Apr 133 min read

UnslothMiniMax-M2.7

Unsloth Releases Full GGUF Quant Suite for MiniMax M2.7

Unsloth uploads 22 GGUF quantizations of MiniMax M2.7, ranging from 1-bit (60.7 GB) to BF16 (457 GB).

Apr 123 min read

MiniMaxMiniMax-M2.7

MiniMax M2.7 Blocks Commercial Use Despite 'Open' Release

MiniMax M2.7 prohibits commercial use, paid APIs, and profitable fine-tuning under its license terms.

Apr 123 min read

Gemma 4Qwen3

Controlling Gemma 4 Thinking Tokens via System Prompts

Users struggle to reliably toggle Gemma 4's reasoning mode via system prompts, unlike Qwen-30B-A3B.

Apr 83 min read

Gemma 4Google DeepMind

Gemma 4 31B Ranks Top-3 in Five European Languages on EuroEval

Gemma 4 31B scores 1st in Finnish, 2nd in Danish/French/Italian on EuroEval multilingual leaderboard.

Apr 74 min read

Google Edge Galleryon-device LLM

Google Edge Gallery App: First Impressions from LocalLLaMA Community

A LocalLLaMA user shares early impressions of Google's Edge Gallery on-device AI app for Android.

Apr 71 min read

Gemma 4Google DeepMind

Inside Google DeepMind's Gemma 4 Launch: What It Actually Took

A Reddit thread breaks down the engineering and logistics behind launching Gemma 4, Google DeepMind's open model.

Apr 61 min read

MinimaxLocalLLaMA

Minimax 2.7 Update Anticipated by Local LLM Community

Reddit's LocalLLaMA community signals anticipation for Minimax 2.7, but details remain sparse.

Apr 61 min read

Llamafine-tuning

Fine-Tuning on 4chan Data Boosts Llama 8B and 70B Benchmark Scores

A researcher fine-tuned Llama 8B and 70B on 4chan data and reports both models outperformed their base versions.

Apr 62 min read

Claude Opus 4Anthropic

Claude Opus 4 Fails Elden Ring: A Reality Check on AGI Claims

A developer tested Claude Opus 4 on Elden Ring gameplay. It couldn't leave the first room, challenging Jensen Huang's AGI claims.

Apr 62 min read

Gemma 4Google

Gemma 4 31B Matches Gemini 2.5 Pro on Local Hardware Benchmarks

Community benchmarks show Gemma 4 31B achieving Gemini 2.5 Pro-level scores when run locally via llama.cpp harness.

Apr 61 min read

PerplexityEmbedding Models

Perplexity Releases MIT-Licensed Embedding Models for Local Use

Perplexity AI has published several embedding models under the MIT license, enabling free commercial use in local deployments.

Apr 61 min read

QwenAlibaba Cloud

Qwen 3.6 Spotted in Official App Alongside 3.5 Max Preview

A Reddit user spotted Qwen 3.6 inside the official Qwen app, suggesting an imminent public release beyond API access.

Apr 51 min read

REAPQuantization

35% REAP Quantization Runs 397B Model on 96GB GPU

A community researcher achieved usable quality from a 397B parameter model using 35% REAP quantization on a 96GB GPU.

Apr 51 min read

MiniMax-M1Gemma 4

NYT Connections Benchmark: MiniMax-M1 Leads Local LLMs at 34.4

Community benchmark ranks MiniMax-M1 at 34.4, Gemma 4 31B at 30.1, Arcee Trinity Large Thinking at 29.5 on NYT Connections puzzles.

Apr 51 min read

Gemma-4-31BMulti-Agent

Gemma-4-31B Multi-Agent Swarm Matches Gemini Pro and GPT-5 Benchmarks

A LocalLLaMA user built a Gemma-4-31B agent swarm achieving performance comparable to frontier closed models.

Apr 41 min read

RAGLocalLLaMA

RAG Demystified: Baseline vs. Advanced Retrieval Pipelines

Community clarifies RAG's true baseline: retrieve, rerank, inject chunks, generate — extras are enhancements.

Apr 42 min read

RAGLocalLLaMA

RAG vs. Agentic Retrieval: What Actually Counts as RAG?

A LocalLLaMA thread debates whether RAG is a precise term or marketing hype for any retrieval-based LLM system.

Apr 41 min read

LocalLLaMA

阿里 Qwen 3.6 Max 悄悄上线，中国模型榜单第一——但开源还是闭源，这才是真正的问题

本地 AI 自己调工 具还在「鬼打墙」——开源社区的真实使 用体验比宣传落后整整一代

两张显卡能不能同时跑两个 AI 模 型？一个真实用户案例揭示本地 部署的核心取舍

Is harness a new buzzword?

Qwen 3.6 is the first local model that actually feels worth the effort for me

Move to local models

Qwen3.6-35B is worse at tool use and reasoning loops than 3.5?

Alibaba Releases Qwen3.6-35B-A3B Mixture-of-Experts Model

Gemma 4 Jailbreak System Prompt

Local AI is the best

Qwen3.5-9B GGUF Quant Rankings: Q8_0 Dominates KLD Scores

DFlash speculative decoding on Apple Silicon: 4.1x on Qwen3.5-9B, now open source (MLX, M5 Max)

Why some small/medium models fail at grammar checking task?

Unsloth Releases Full GGUF Quant Suite for MiniMax M2.7

MiniMax M2.7 Blocks Commercial Use Despite 'Open' Release

Controlling Gemma 4 Thinking Tokens via System Prompts

Gemma 4 31B Ranks Top-3 in Five European Languages on EuroEval

Google Edge Gallery App: First Impressions from LocalLLaMA Community

Inside Google DeepMind's Gemma 4 Launch: What It Actually Took

Minimax 2.7 Update Anticipated by Local LLM Community

Fine-Tuning on 4chan Data Boosts Llama 8B and 70B Benchmark Scores

Claude Opus 4 Fails Elden Ring: A Reality Check on AGI Claims

Gemma 4 31B Matches Gemini 2.5 Pro on Local Hardware Benchmarks

Perplexity Releases MIT-Licensed Embedding Models for Local Use

Qwen 3.6 Spotted in Official App Alongside 3.5 Max Preview

35% REAP Quantization Runs 397B Model on 96GB GPU

NYT Connections Benchmark: MiniMax-M1 Leads Local LLMs at 34.4

Gemma-4-31B Multi-Agent Swarm Matches Gemini Pro and GPT-5 Benchmarks

RAG Demystified: Baseline vs. Advanced Retrieval Pipelines

RAG vs. Agentic Retrieval: What Actually Counts as RAG?

本地 AI 自己调工具还在「鬼打墙」——开源社区的真实使用体验比宣传落后整整一代

两张显卡能不能同时跑两个 AI 模型？一个真实用户案例揭示本地部署的核心取舍