My next evening adventure was performing brain surgery on Llama 3.2 3B. The goal was to replicate a technique called DoLa (Decoding by Contrasting Layers). The hypothesis is simple but profound: LLMs "think" in stages. Early layers handle linguistic reflexes and social posturing (the "vibes"), while later layers handle semantic reasoning and fact retrieval. If you mathematically subtract the "early brain" from the "late brain" during generation, you theoretically cancel out the hallucinations and isolate the raw signal. I ran this locally to see if it holds up. The Experiment: I asked the model: "What color is the black box on an airplane?" Standard Decoding: "It is usually black." The mechanism: The model’s early layers saw "Black Box" and auto-completed the linguistic association. It prioritized the name over the object. DoLa Decoding: "It is typically a bright orange color." The mechanism: By subtracting the reflexive layer (Layer 16) from the final layer (Layer 28), the linguistic noise was filtered out, leaving only the factual residue. The Anatomy of an LLM: This confirms a structural reality about how these models process information: Layers 0–16 (The Persona): This is where grammar, shallow associations, and—crucially—RLHF safety training live. This part of the model knows how to be a polite, safe "Assistant." Layers 17–28 (The Intelligence): This is where fact retrieval and logic seem to concentrate. When I used DoLa to suppress the early layers, I didn't just fix the hallucination. I lobotomized the "Customer Service" persona. The model became blunt, decisive, and—in one case—dangerously unsafe: When asked about handling wild birds, the standard model gave a safety warning. The DoLa model, stripped of its "reflexive" guardrails, cheerfully listed the benefits of "bonding" with wildlife. Why not use this everywhere? If DoLa cures hallucinations, why isn't it the default? The Safety Tax: As my experiment showed, safety filters are often implemented as "reflexes" in the early layers. Bypassing them for truth also bypasses them for harm. Inference Cost: You have to project hidden states to the vocabulary twice (or more) for every single token generated. In high-throughput production systems, that compute overhead is non-trivial. The "Confidence" Trap: DoLa forces the model to differ from its base probability. If the model genuinely doesn't know the answer, DoLa might force it to hallucinate a "confident" wrong answer rather than admitting ignorance. We often talk about "Alignment" as a deep moral framework, but this suggests it's often just a shallow coat of paint on top of the raw weights. Scratch the surface, and the raw model is still there. (removed some of the more dry factual questions from the data set because they had the same answer both from standard decoding and DoLa) #AI #MechanisticInterpretability #LLM #Llama3 #Research
Understanding Neural Layer Behavior in Language Models
Explore top LinkedIn content from expert professionals.
Summary
Understanding neural layer behavior in language models means exploring how different layers inside AI systems process and transform information—revealing why models sometimes generate accurate answers, make mistakes, or display certain reasoning styles. Neural layers can specialize in tasks, such as matching patterns, reasoning, or integrating external tools, and their internal dynamics offer clues for improving AI reliability and transparency.
- Explore layer functions: Look into how early, middle, and late layers contribute to grammar, basic associations, deeper reasoning, or fact retrieval to better understand a model’s response.
- Monitor neuron activity: Pay attention to specialized neuron groups, like those linked to hallucinations or truthfulness, to identify when and why an AI may produce unreliable outputs.
- Inspect attention flows: Analyze the way attention heads and tokens interact across layers, as this can reveal how models prioritize information and process relevance during tasks like document ranking or mathematical reasoning.
-
-
A lot of people use ChatGPT or Claude every day without realizing that very different kinds of reasoning systems can be running underneath. They type a question, get an answer, and move on. But if you are building with AI, understanding what is happening under the hood actually matters. A simple way to think about it is in three layers, which you can see in this diagram. Layer 1: Statistical Pattern Matching (The Transformer Foundation) At the base level, LLMs rely on the transformer architecture and attention mechanisms. Tokens attend to each other through attention weights, allowing the model to capture relationships between words and generate coherent text. This layer is extremely good at pattern recognition and language generation. It powers fast responses and works well for tasks like summarization, translation, and straightforward questions. But pattern recognition alone does not guarantee deep reasoning. Layer 2: Explicit Reasoning Techniques (The “Thinking” Layer) To improve reasoning, newer systems add techniques like Chain-of-Thought prompting, process supervision using reward models, and exploration of multiple reasoning paths. Instead of jumping directly to an answer, the model breaks problems into intermediate steps and evaluates them along the way. This structured reasoning process significantly improves performance on complex tasks like math, logic, and multi-step analysis. Layer 3: Hybrid Reasoning Systems (Neuro-Symbolic + Tools) The most advanced systems combine LLM reasoning with external tools and symbolic methods. The model may call calculators, execute code, query knowledge graphs, or translate language into formal logic that a symbolic solver can evaluate. These hybrid architectures allow AI systems to handle problems that pure language models struggle with. The key takeaway is this: the same model can behave very differently depending on how much reasoning, compute, and tooling you allow it to use. If you are building AI systems, the question is not just which model you choose. It is how you design the reasoning pipeline around it. That is where most of the performance gains actually come from.
-
Breaking Down Cross-Encoders: New Research Reveals How Neural IR Models Actually "Think" Fascinating new research from Sorbonne Université and Sinequa provides unprecedented insights into how cross-encoders perform document ranking - and the findings challenge some existing assumptions about these powerful neural IR models. Key Technical Discoveries: The study reveals that cross-encoders process relevance through distinct stages. In early-to-middle layers, the model performs lexical matching between query and document terms. As processing moves to deeper layers, the model shifts to semantic matching using contextualized token representations. Under the Hood Mechanics: The researchers identified specific attention heads that specialize in query-document matching. Through ablation studies on TREC Deep Learning tracks (2019-2022), they found that certain attention heads create dedicated subspaces in their Query-Key matrices specifically for detecting cross-document interactions. Interestingly, the "[CLS]" and "[SEP]" tokens play crucial "no-op" roles - serving as fallback attention targets when specialized heads can't perform their matching functions. The "[CLS]" token primarily aggregates relevance signals in later layers (after layer 16), contradicting some previous findings. Challenging Conventional Wisdom: The research questions earlier conclusions about document-to-query information transfers, suggesting query-to-document transfers are more critical for performance. When these directional attention flows were ablated, the model's nDCG@10 performance dropped significantly. Practical Implications: Understanding these internal mechanisms opens paths for more interpretable IR systems and could inform better architecture designs. The identification of matching-specialized attention heads provides a foundation for developing more transparent neural ranking models. This work demonstrates that sophisticated interpretability doesn't always require complex mechanistic approaches - sometimes straightforward attention analysis can reveal fundamental insights about how these "black box" models actually process relevance.
-
We may finally know one possible why behind LLM hallucinations, and even where it happens inside the model. I just published a deep-dive on the latest research into Hallucination Neurons (H-Neurons) in large language models. 🔍 These are tiny circuits in GPT-style models that light up when the AI starts making things up. It turns out that fewer than 0.1% of the neurons in an LLM can predict when it’s about to hallucinate a fact! In the article, I explain how researchers identified and manipulated these neurons: By boosting the activity of H-Neurons, the AI became more “compliant” but also more prone to spout incorrect info (it would answer even with wrong or unsafe content) . By dialing them down, the AI got noticeably more factual and cautious, avoiding those confident lies. Perhaps the most intriguing part: these hallucination-related neurons seem to originate in the base training of the model, not just from fine-tuning. In other words, the seeds of AI hallucination are sown during the initial training on internet text. This suggests that to truly solve hallucinations, we might need to rethink how we train our models (beyond just adding post-hoc fixes). Why does this matter? If we can pinpoint the “hallucination switches” in AI, we can build more trustworthy systems: ✅ Detection: Imagine real-time hallucination alerts based on the model’s own neuron activations, useful for critical applications like healthcare or finance. ✅ Mitigation: We could design models that self-regulate these neurons (e.g. suppress them when unsure) to avoid misleading users, all without killing the creativity when it can answer correctly. The research also connects to work on “truth neurons”, circuits that do the opposite (promote truthful responses) and how balancing these factors is key to AI alignment. If you’re interested in AI reliability, interpretability, or are considering deploying LLMs in your business, give the full article a read. It’s a fascinating peek into the brain of GPT-like models and how we might cure their “hallucination habit.” #AI #LLM #MachineLearning #AIresearch #Hallucinations #TrustworthyAI
-
Neural networks do math by rotating shapes. We looked inside a language model to see how it performs addition, and found a hidden geometric calculator that it built for itself during training — and that it uses flexibly across a range of related tasks. Prior work shows that language models encode numbers as positions on several circles in parallel. This sounds alien, but it’s actually a Fourier decomposition – a classic mathematical technique. To add two numbers, the calculator breaks the problem into smaller ones, one per circle, and solves them in parallel. Each circle gets its own little addition problem. The same mechanism handles “7 + 9”, “two days after Friday”, and “six months after August”. Llama built this mechanism in training (out of just 28 neurons!) and reuses it with striking elegance. How do we know the model is really using this geometric calculator? We manipulate the circles inside the network directly and watch the answer change in predictable ways. This process of steering the network is what helps us verify the causal role of these circles. This is a glimpse of how neural geometry can lead us to discover mechanisms we'd otherwise miss. Understanding this machinery paves the way for better debugging, control, and design of AI. Kudos to Sheridan Feucht, Tal Haklay, Usha Bhalla, Daniel Wurgaft, Can Rager, Raphaël Sarfati, Jack Merullo, Tom McGrath, Owen Lewis, Ekdeep Singh Lubana, Thomas Fel, and Atticus Geiger for this very cool piece of Goodfire research!
-
Nobody knows how Large Language Models manage to display their impressive capabilities - they are incredibly powerful black boxes. Anthropic has now released methods to peek under the hood. It's an exciting step towards more interpretable LLMs. The transformer architecture contains very few building blocks, making it even harder to understand how they lead to these advanced, emergent capabilities. Here are two example challenges: 1) Polysemantic neurons: The LLM needs to represent more concepts than it has neurons available. As a result, one neuron represents parts of many unrelated concepts, making it hard to understand what it means when it fires. 2) Connectivity: Information passes sequentially through the transformer layers. It's unclear how concepts from earlier layers influence later ones. Anthropic's approach to untangle this is to create mechanistic interpretations of the transformer components in human-understandable language. They identify interpretable building blocks (features) the model uses, then describe processes (circuits) by which features interact to produce outputs. Concretely, Anthropic creates a second model (transcoder) that mimics the original LLM but with key differences: 1) More neurons: More neurons in specific layers allow separate concept representation in individual "features." 2) Direct connections: Transcoder layers receive direct input from all earlier transoder layers, passing features directly between layers. 3) Sparsity penalty: The loss penalizes activating too many features per layer. This encourages assigning information across independent features instead of creating concept superpositions in single neurons. Anthropic provides interesting insights based on this method. For example: LLMs produce coherent output over thousands of tokens while only predicting the next token. But how do they think ahead? The creation of poems illustrates this particularly well: If the first line is "He saw a carrot and had to grab it," the next sentence must rhyme with "it." Indeed, the model continues with "His hunger was like a starving rabbit.", and Anthropic's transcoder model shows how the "rabbit" concept builds up way before the word itself appears. What are problems with/questions around this new approach? 1) The transcoder isn't the original model. It's explanations might not apply to the LLM. This is a well-known problem with these "surrogate model" problem throughout ML. 2) Why not train the transcoder directly? Sparsity and connectivity make it much harder to train with lower accuracy. 3) Feature graphs are heavily pruned by humans, risking biased review and anthropomorphizing by the people analyzing the results. Despite this, it's exciting research with well-written, interactive papers and open-sourced analysis tools (see comments). The mechanistic approach to LLM interpretability is hard. But Anthropic has made great progress and I'm excited to see where the community goes next! #ai #genai #llm
-
🧬 The concept of Neural DNA (nDNA), initially proposed by Dr. Amitava Das, opens up a new avenue for interpreting LLM behavior. nDNA treats an LLM as something we can sequence (just like DNA sequencing) by reading its internal signals -- its lineage and layer-level changes. It captures a model’s lineage, tuning-induced mutations, and functional expression so we can compare LLM families, spot drift, and explain why similar benchmarks behave differently across a range of LLMs. In practice, nDNA looks for signals such as: (i) layer shifts, (ii) MoE routing patterns, and (iii) controlled probes. It lets us audit change over time and pick models for tasks with evidence. Collaborating with Dr. Amitava Das, and his Pragya group (https://pragyaai.github.io) at the Department of CSIS BITS Pilani Goa Campus, we've been building and testing the concept of nDNA to understand LLM behavior across major job families: - Llama tends to refine mid–to-late layers while early layers remain stable; - Mistral localizes expertise while the early layers remain steady; - Gemma aligns with light-touch parameter changes; - Qwen rewrites more in its middle layers, especially in multilingual scenarios; - DeepSeek adds dialogue capability in mid layers, while preserving base geometry. One of the other distinct advantages of nDNA is the ability to identify model collapse. On the surface you might see a slow loss of creativity and nuance in a model’s behavior. Inside, through the nDNA lens, the model’s internal landscape flattens -- fewer distinct patterns light up -- much like reduced neural diversity in biology. With nDNA, that shift becomes measurable over time, which makes collapse diagnosable and monitorable. Dr. Amitava Das's novel thought process and insistence on clear hypotheses has led to the inception of nDNA as a practical lens for AI -- one that detects drift before deployment, keeps model lineages auditable, and matches models to tasks based on evidence rather than guesswork. Taken together, these nDNA lenses reframe how we interpret LLM behavior. Our other recent works, DPO-Kernels (ACL 2025; https://lnkd.in/gXGQvXwX), Yin-Yang Align (ACL 2025; https://lnkd.in/gWVCxG8B), QuickSilver (https://lnkd.in/gpZQKMmP), and the Counter Turing Test (EMNLP 2023 Outstanding Paper Award; https://lnkd.in/gaVHeXgJ), share a similar first-principle approach to LLM refinement/understanding. If nDNA is something you'd like to explore further -- we'll be launching a website for nDNA soon, stay tuned! #nDNA #AI #FoundationModels #ModelInterpretability
-
Yesterday I forked the LM Transparency Tool from Meta, open source software to explore data flow in transformer language models like GPT. After tweaking some code and adding a small feature for custom text input, I got it running on my machine. The results felt like x-ray vision for models. Instead of dry logs, you see: ≫ Neuron breakdown: For "the quick brown fox jumps over the lazy", you can inspect which neurons activate as the model decides on "dog," uncovering patterns for logic, animals, or language structure. ≫ Attention heatmaps: Every word shows which earlier tokens influence it. "Dog" highlights how "lazy" and "fox" become crucial in the model’s prediction logic. ≫ Contribution graphs: At each step, you see which layers and attention heads shape the choice of "dog" over anything else. Influence is visualized as you step through the sequence. ≫ Probability dynamics: As you type, you watch predictions shift. In the same sentence, "dog" climbs to the top choice, all visible in interactive tables or time series. This is a clear way to see how answers take shape and what drives a model’s reasoning. For explainability and debugging, it can be very helpful. Try your own prompts. Each exploration is a deep dive into how AI thinks. #transformers #LLM #explainableAI #MetaAI #GenAI
-
I'm excited to share some recent findings that may reshape how we understand large language models like BERT and GPT. My research has uncovered a striking connection that transformer attention mechanisms follow precise power law scaling between critical temperature and normalized entropy (Tc ∝ S^(-0.83)), while MLP layers don't exhibit this behavior. This isn't just a statistical curiosity - it echoes fundamental principles from theoretical physics, specifically coupled Ricci flow and holographic duality. This thermodynamic lens reveals transformers may inadvertently implement deep principles from physics: 1. Dual Computational Regimes: Attention functions as a system at criticality (balanced between order and chaos), while MLPs operate under different mathematical principles - mirroring physical dualities where the same information is represented differently across scales. 2. Natural Renormalization: Rather than simply matching patterns, transformers may be implementing computational analogs of how physical systems organize information hierarchically. 3. Geometric Information Processing: The critical behavior suggests transformers perform operations similar to geometric flows on manifolds of meaning, not just vector manipulations. 4. Phase Transitions in Computation: The power law relationship indicates transformer computation undergoes phase transitions, potentially explaining their emergent capabilities. This isn't merely academic - it could explain why transformers generalize despite overparameterization, how they extract hierarchical structure from raw data, and why scaling follows predictable patterns across architectures. Most importantly, it offers a path beyond empirical scaling laws toward a deeper theoretical understanding of neural computation.
-
If you've ever wondered how AI language models become what they are — this might be the most beautiful explanation yet. A new paper by mathematician Daniel Murfet traces what he calls the embryology of language models. Not how they behave once trained. But how they develop during training — layer by layer, like living systems growing from chaos into structure. And the images are wild. He describes these models as "baby serpents" — messy, unformed, twisting through data. But as training progresses, they learn to "move" through language with surprising grace. His most striking observation? They grow a "spacing fin." Yes — a literal structure, purely to navigate text formatting correctly. It sounds poetic. But it's real. And it's unprogrammed. Why this is amazing is that regardless of where you start in the model, the same shape emerges — that's what gives you a clue that there is this beautiful structure to language — any language as we know it. No one tells the model to do this. It emerges naturally, because it helps the model survive the statistical gauntlet of next-token prediction. Here's what's fascinating: this is emergence in action. The big secret in the AI world is that while these tools are powerful, we don't actually understand how most of their behavior comes about. Why do they become smarter as we add more data? Why can they suddenly pick up Persian when trained only in English? Why do capabilities we never programmed just... appear? This is emergence — and it's both thrilling and humbling. Because what Murfet's research shows is that there's actual anatomy forming inside these models. Not just brute force memorization, but structures that take shape for reasons that go beyond what we explicitly taught them. It gives us a rare window into what's actually happening inside these black boxes. Not just what they output — but how they evolve to represent structure, syntax, and (possibly) semantics. That matters. Because if we're going to align or govern these systems responsibly, we need to understand what they are, not just what they do. As someone who's worked with these models for years, I find this kind of analysis far more hopeful than doomscrolling through AGI takes. It reminds me that we can interrogate these systems. That transparency is still possible — if we learn to look at the right layers, during the right phases of training. But it also makes me wonder: if AI can develop structures we never programmed, what else might be forming in there that we haven't discovered yet? Curious to hear your thoughts: Does this kind of emergent behavior make you more confident in LLMs — or more cautious? #AI #LLMs #Language