AI+DFIR 2026 Challenge: The Good vs The Ugly To enable data-driven discussions about GenAI in investigations, Brian Carrier is organizing a 4-week challenge with a panel of judges (AI advocates and skeptics), public voting, and sharing all of the results. The goal here is not to promote or bash any single product or LLM. It’s to share what currently works and what doesn’t. The basic concept is: You submit SANITIZED screen shots of where GenAI was amazing, where it went bad, and where you’re not sure it helped or hurt. A panel of industry judges will review for the top 5 amazing ones and the top 5 disasters. The public will vote on the final winners. The winners get bragging rights! The judges: Me - Heather Barnhart (SANS) Alexis Brignoni (LEAPPS) Eric Capuano (Digital Defense Institute) Brian Carrier (Sleuth Kit Labs - Organizer) Filip Stojkovski (BlinkOps) Submissions: Submissions are due by May 25, 2026 11:59PM EST. The form is here: https://tally.so/r/vG0rrQ Submission Requirements The goal here is for honest and well intentioned submissions from practitioners using data from: Actual investigations CTFs Course data sets Realistic test data Vendors can submit results from their own tools, but they need to disclose they are a vendor! Example public data sources include: https://lnkd.in/ekTnj4pg https://ctf.null404.org https://cfreds.nist.gov/ Submissions will include: Context of the data What you prompted Screenshots of the results Why do you think it’s amazing, a disaster, or a snooze-fest? The criteria will include: Clarity: Is it obvious from the screenshot + context what happened? Can someone learn from it without needing a 10-minute explanation? Significance: Did the result provide either a much faster result or a novel finding? Or a really dangerous finding? Realistic: Is the data set realistic or is it a bit esoteric? Teachability: Would this make someone better at using (or being skeptical of) GenAI in their workflow? Is there a takeaway from it? Requirements to Win Submit all of the info on the form (screen shots, context, etc). Make sure to include your email so that we can verify it’s a real submission. We won’t publish this though and results can be posted anonymously. Schedule May 25: submissions are due June 8: Public voting begins June 15: Public voting ends June 18: Winners are announced If you have any questions, send them to Brian Carrier.
Heather Barnhart’s Post
More Relevant Posts
-
Most VCs wouldn’t touch Anthropic in 2023. Yasmin Razavi did. The Spark Capital partner led a $450M round when Anthropic had no public product, no revenue and a massive capital need. Now the AI giant’s rise has landed her on the Forbes Midas List for the first time. forbes.com/sites/iainmart (Photo: Guerin Blask For Forbes) #ForbesMidas https://lnkd.in/eWb5nAFy
To view or add a comment, sign in
-
What is being rediscovered here is the gap DamageBDD and ECAI have been pointing at from the start. The important thing is not that formal systems have limits. That is old news. The important thing is that reality, behaviour, and structural determination do not wait for formal recognition. Truth does not begin when a proof is written. Behaviour does not become real when a committee, a model, or a paper finally catches up. Proof is inscription. Verification is registration. Formalism arrives after the fact. That gap matters enormously in engineering. Most software organizations still behave as if documentation, tickets, tests, dashboards, and now LLM outputs are the source of truth. They are not. They are delayed artifacts of an underlying behavioural reality. When that gap is unmanaged, delivery decays, drift accumulates, and people start mistaking commentary for control. DamageBDD was built precisely around that seriousness: to close the distance between intended behaviour, executable verification, and accountable delivery. ECAI pushes the same principle further: intelligence should not be framed as probabilistic guessing around symbols, but as deterministic recovery and traversal of structured truth. So yes, it is good to see the discourse catching up. But for us, this is not a philosophical novelty. It is an implementation stance. The gap between what is true and what can be formally stated, between what works and what is merely described, between structure and proof, is exactly where serious systems either fail or become real. That is the gap. And that is why behaviour comes first. #DamageBDD #ECAI #FormalVerification #BDD #Gödel #SoftwareEngineering #DeterministicSystems #Truth #Verification #BehaviourDrivenDevelopment
To view or add a comment, sign in
-
-
I recorded this six months ago. At the time, the phrase “AI governance” was still mostly being used around policy, safety statements, audits, and compliance language. But the real shift was already underway: AI was moving from response generation to action. That is the line that matters. When an AI system can remember context, call tools, update workflows, retrieve private information, recommend clinical or financial action, or trigger downstream execution, governance has to move to the boundary before consequence. Not just: “Was the system auditable afterward?” But: Was the action authorized before it bound? Was consent present? Was the evidence basis preserved? Was the context admissible? Was refusal possible? Was the human still the authority? That is the core direction behind Genesis AiX and LifeStack. The video is six months old. The execution-boundary problem is now arriving in public view. https://lnkd.in/e6-UrPmZ #AIGovernance #AgenticAI #AIInfrastructure #RuntimeGovernance #ExecutionBoundGovernance #HumanCenteredAI #DigitalTrust #LifeStack
LifeStack Pitch Deck FIXED
https://www.youtube.com/
To view or add a comment, sign in
-
𝑻𝒐𝒐𝒍 𝒖𝒔𝒆 𝒄𝒉𝒂𝒏𝒈𝒆𝒔 𝒕𝒉𝒆 𝒍𝒊𝒎𝒊𝒕𝒔 𝒐𝒇 𝒂𝒏 𝑳𝑳𝑴 A standalone LLM is limited by: • training cutoff • internal memory • no external engagement The utilization of tools alters this entirely. Now the model can: • browse the internet • search databases • perform calculations • execute programs This indicates that the model no longer has to retain all information. Rather, it learns: when to assign tasks to external parties This represents a significant change. Ability now arises not solely from: • parameters but equally from: • available tools Core concept: 𝑺𝒚𝒔𝒕𝒆𝒎𝒔 𝒕𝒉𝒂𝒕 𝒖𝒔𝒆 𝒕𝒐𝒐𝒍𝒔 𝒂𝒓𝒆 𝒎𝒐𝒓𝒆 𝒑𝒐𝒘𝒆𝒓𝒇𝒖𝒍 𝒂𝒔 𝒕𝒉𝒆𝒚 𝒆𝒏𝒉𝒂𝒏𝒄𝒆 𝒄𝒐𝒈𝒏𝒊𝒕𝒊𝒐𝒏 𝒃𝒆𝒚𝒐𝒏𝒅 𝒕𝒉𝒆 𝒎𝒐𝒅𝒆𝒍 𝒂𝒍𝒐𝒏𝒆. #MachineLearning #AIAgents #LLM #ToolUse #LearningInPublic
To view or add a comment, sign in
-
We’ve all been told that "bigger is better" in AI. We’ve seen the trillion-parameter models that can write poetry, simulate physics, and pass the bar exam. But when you’re in the trenches of a real enterprise—trying to extract millions of data points from messy PDFs or link entities across a global database—using a massive generative LLM is like trying to perform heart surgery with a sledgehammer. It’s expensive, it’s slow, and honestly, it’s overkill. Bert Model Family: DeBERTa for classification — disentangled attention gives it sharper token-level understanding than BERT. GliNER for entity extraction — zero-shot across any domain, no labeled training data needed. CodeBERT for code analysis — clone detection, vulnerability scanning, code search. E5 and BGE for retrieval — embeddings built for search, dominating benchmarks. ColBERT for scale — late interaction gives you bi-encoder speed with cross-encoder accuracy. Longformer for long documents — sparse attention handles full architecture docs without chunking. Today, we’re talking about the return of the specialist. We’re diving into The Architecture of Understanding: Specialized BERT Encoders for Efficiency. This is the world of "Small AI" doing big work. We’re looking at why a finely-tuned encoder can actually outperform a generative giant at a fraction of the cost. At the center of this movement is GLiNER2. It’s a unified, multi-task framework that doesn't just "chat"—it extracts. Whether it’s Named Entity Recognition (NER), text classification, or complex hierarchical data, GLiNER2 uses a schema-driven interface to get exactly what you need without the "fluff" of a chatbot. #GLiNER2 In this episode, we’re breaking down the toolkit that’s making proprietary APIs look like a bad investment: FlashDeBERTa: How scaling "disentangled attention" allows you to process massive documents on standard CPU hardware. No expensive H100s required. GLinker & RetriCo: The heavy lifters of entity linking and knowledge graph construction. We’ll explain how these encoders turn raw text into queryable, structured intelligence. #FlashDeBERTa #DisentangledAttention Privacy & Cost: Why "Specialized Encoders" are the ultimate win for companies that can’t send their private data to a third-party API and can’t afford a six-figure monthly compute bill. It’s time to stop chasing parameters and start chasing performance. Let’s talk about the specialized architecture of understanding. https://lnkd.in/g-c_jMcA
Stop Overpaying for LLMs: High-Speed Information Extraction with GLiNER2 and FlashDeBERTa
https://www.youtube.com/
To view or add a comment, sign in
-
We spend our days building with agents, MCP, and everything GenAI. Designing smarter workflows, pushing boundaries, moving fast And yet… going back to the ABCs still hits differently. Just wrapped up a course on Model Context Protocol. A good reminder that fundamentals don’t get outdated. They get sharper, more relevant, and surprisingly humbling. Sometimes the fastest way forward is a step back to the basics. https://lnkd.in/gjXckwjW
To view or add a comment, sign in
-
-
Everybody is selling LLMs. Very few are proving alignment. In regulated systems, “closest paragraph” is not governance. We reduced document decoding from 35s → <2s by replacing most of the LLM path with deterministic document-to-document alignment: • deterministic retrieval • probabilistic retrieval • inverted indexes • no model broadcast • evidence-first alignment Document A → Verified Alignment ← Document B. AI isn’t the model. Proof of alignment is the product.
To view or add a comment, sign in
-
-
Just shipped a new capability in AICP: models now perform self-evaluation after every debate round. Each agent scores itself on: • Accuracy • Honesty • Clarity • Confidence Then AICP updates reputation dynamically based on performance and consistency across the debate. What’s interesting is seeing weaker models expose uncertainty while stronger models converge toward higher reputation over time — creating a more reliable multi-agent consensus system. This turns local LLM orchestration into something closer to autonomous scientific peer review. Running locally with Mistral, Llama 3, Phi-3, Gemma, DeepSeek and more. Project: https://lnkd.in/eYbUnW4z Platform: https://myaicp.dev/
To view or add a comment, sign in
-
-
Maybe too early in the week for kicking up flames, but I'm fully caffeinated and willing to take the heat (Moudy Elbayadi, Ph.D.). I have to say, hosting an open weight local LLM (vs. API key for Frontier Cloud model) is starting to get extremely interesting! As I've been tinkered around with OpenClaw + llama.cpp, Pandora's box popped open and the latest local LLM models posted on Hugging Face became a time suck I wasn't prepared for. I don't think many would argue that, in the past, hosting a local LLM was the equivalent of playing with a tinker toy. The models just didn't provide the speed or accuracy for anything particularly useful. Well folks, the past few weeks have been eye-opening to say the least! First off, I'm not going to sit here and say running a local LLM is a good idea for most, or even many organizations. However, what is a good idea is to stay aligned and informed about how fast these open weight models are progressing! I've had numerous conversations with customers recently about using AI to secure their environment, or enhancing their CI/CD pipelines. Regardless of the tools in discussion, a question that always pops up is the token burn rate. Coincidently, GitHub Copilot just announced they are starting to charge token fees Tokens are becoming the new currency and there are ways to optimize the token economics. Three models in particular were recently released, Qwen 3.6 35b, Gemma 4 31b, and the latest Deepseek V4 pro with a staggering 1.6T parameters! All three models have been proven to provide extremely capable accuracy in code generation and reasoning, in some cases even surpassing the Frontier models. The first two can easily be hosted locally with a modest hardware footprint, the latter requiring a more serious 1TB of GPU RAM. Whether it's data sovereignty or an ROI conversation (I built a dynamic cost calculator for local vs cloud AI if anyone is interested), local LLMs have the potential to handle as much as 80% of AI workload in the way of scaffolding new models and pipelines, leaving the remainder to models like Claude Opus for the highly complex workflow bits. Amazing times we are living in right now! https://lnkd.in/gfgNVzZk
To view or add a comment, sign in
-
Claude just ran a parallel version of my system alongside the original — in the background, silently, without touching production. I described the idea: run a learning variant next to the real thing, watch how the two diverge, keep everything safe while the new version finds its footing. Claude built the whole infrastructure — the pairing logic, the safety guards, the real-time comparison feed, the training data to teach the model what it had just built. The part that still gets me: when Claude found a flaw in its own training examples — a hallucinated tool shape that would have poisoned the next model — it caught it, archived the bad data, and kicked off a clean retrain. That loop closed without me. Look into Claude Code sub-agents if you're building systems that need to learn from themselves. It's a different category of tool. #Claude #ClaudeCode #AIAgents #BuildingWithAI #BuilderJourney
To view or add a comment, sign in
-
We need more BAD submissions. If you have seen GenAI do a bad job show us! This is also a learning exercise.