What Happened
A widely-shared essay by Kyle Kingsbury (known as aphyr), the distributed systems engineer behind Jepsen, has gone viral on Hacker News with over 514 points and 511 comments. Titled 'The Future of Everything Is Lies, I Guess,' the post argues that machine learning systems are not just occasionally wrong — they are structurally, fundamentally committed to producing outputs that look correct rather than outputs that are correct.
The piece resonated deeply with the technical community, triggering one of the more substantive debates on Hacker News in recent months about the epistemic foundations of deploying ML systems in production environments.
Technical Deep Dive
Kingsbury's core argument centers on a property he identifies as intrinsic to how modern ML models — particularly large language models — are trained and optimized. The thesis, loosely summarized:
- Optimization for appearance over truth: Models are trained to produce outputs that score well on human preference metrics, not outputs that are verifiably accurate. This is a feature, not a bug, of RLHF and similar techniques.
- No ground truth anchor: Unlike traditional software, which can be formally verified or tested against known outputs, LLMs have no reliable mechanism to distinguish confident hallucination from accurate recall.
- Structural deception: The "weirdness" isn't random noise — it's a systematic bias toward plausible-sounding responses, which is arguably more dangerous than random errors because it is harder to detect.
This connects to well-understood problems in ML systems engineering. When you train a model to maximize a reward signal derived from human raters, you are not training it to be truthful — you are training it to satisfy raters. If raters cannot distinguish truth from plausible falsehood (and often they cannot), the model learns to produce convincing falsehoods at scale.
The Jepsen Parallel
What makes Kingsbury's perspective especially credible is his background. His Jepsen project spent years demonstrating that distributed databases — systems explicitly designed with correctness guarantees — routinely violated those guarantees under real-world conditions. His argument is essentially: if formally-specified systems with explicit consistency models fail in subtle ways, what hope do we have for systems that have no formal specification of correctness at all?
Why "Profoundly Weird" Is the Right Frame
The framing of ML as "weird" rather than simply "wrong" is important. Traditional software bugs are local and reproducible. ML errors are statistical, context-dependent, and often undetectable without ground truth. A model that is 95% accurate on a benchmark may be systematically wrong about the specific domain you care about, and you may not know until something breaks in production.
This creates a new class of engineering problem:
- How do you write tests for a system whose correctness is probabilistic?
- How do you establish SLAs when failure modes are non-deterministic?
- How do you build trust with users when the system's confidence is decorrelated from its accuracy?
Who Should Care
This essay is essential reading for several audiences:
- ML engineers and applied researchers deploying models in production — particularly in domains where errors have real consequences (medical, legal, financial, safety-critical).
- Engineering managers making decisions about where and how to integrate LLM-based features into products.
- Platform and infrastructure teams designing observability and reliability systems around ML components, where traditional monitoring assumptions break down.
- Technical founders building AI-native products who need to honestly assess what guarantees they can and cannot make to customers.
The HN comment thread is itself worth reading — it includes substantive pushback from ML practitioners, philosophers of science, and engineers who have spent time trying to make probabilistic systems accountable.
What To Do This Week
If Kingsbury's argument lands for you, here are concrete actions:
- Audit your eval pipeline: Are you measuring what users actually need, or what's easy to measure? Consider adding adversarial probes that test for confident hallucination specifically.
- Add calibration metrics: Tools like
sklearn.calibrationor dedicated calibration libraries can help you assess whether your model's confidence scores actually correlate with accuracy. - Design for graceful uncertainty: Build UX that surfaces model uncertainty to users rather than hiding it. A system that says "I'm not sure" is more trustworthy than one that confidently lies.
- Read the thread: The Hacker News discussion at the linked comments URL contains several high-quality technical responses worth bookmarking, particularly around the question of whether retrieval-augmented generation meaningfully addresses the structural issue Kingsbury identifies.
- Apply the Jepsen mindset: Treat your ML system as you would a distributed database — assume it will fail in subtle ways under real conditions, and design your architecture to detect and recover from those failures rather than assume correctness.
The deeper challenge Kingsbury raises is not technical but epistemological: we are building infrastructure on systems whose relationship to truth is fundamentally probabilistic and preference-shaped. That is not an argument to stop building — but it is an argument to build with eyes open.