67

Update: We've deployed machine learning auto-flagging

Given the positive reception, we've activated machine learning anti-spam's auto-flagging feature, as detailed below. Staff will be monitoring its work very closely over the next few weeks to ensure accuracy, especially with regards to automatic binding flags that result in instant deletion.

We're excited to see how big of an impact this feature makes on the site's spam influx, and we'll be back to share the data on the ML model's effectiveness in a couple months.

On behalf of the moderation tooling team, thank you all for your feedback and analysis of our work; your voice is an essential part of the work that we do!


Original: Should Stack Overflow use machine learning to flag spam automatically?

In the spirit of reducing moderator workload, we’ve started using a machine learning model to automatically identify, flag, and delete spam. So far it’s been extremely effective on Super User, having flagged 80% of all spam on that site since it was activated on December 10th, 2025. After improving the model and running it in evaluation mode over our year-end break, we’d like for you to review the data concerning its accuracy and earn your trust to activate its flagging capability on Stack Overflow as well.

How do the anti-spam capabilities work?

When a post is created or a post author edits their post, our systems subject it to a spam evaluation. We currently subject posts to two checks, a similarity evaluation and a pass through our ML model. The ML model is trained on several cumulative years’ worth of deleted spam across the network’s data, and yields a confidence score between 0% and 100%. At a high level of “spam confidence”, we’ll raise a non-binding spam flag. This non-binding flag counts towards the four spam flags that are required for automatic deletion as spam. At a very high “spam confidence” level, we will automatically delete the post with a binding spam flag.

Automatic non-binding spam flags do not affect the post’s score and are largely invisible to non-moderators. They can be dismissed by moderators like any other flag if they’re found to be unhelpful. Automatic binding spam flags attach a unique post notice, which links to an explanatory help center article. Our hope with this help center article is that legitimate posts have a visible and easy path to undeletion with the help of a handling moderator.

For more detailed information about how ML anti-spam operates, please review the network-wide announcement.

How effective would ML anti-spam be on Stack Overflow?

We ran our latest ML model in silent evaluation mode on Stack Overflow through the holidays, and the results are quite impressive. Between December 19th, 2025 and January 9th, 2026, there were 731 instances of spam on Stack Overflow. ML anti-spam would’ve identified and flagged 468 of them, with 24 false positive non-binding flags and 7 false positive binding flags. This represents a 94% accuracy rating, and if flagging were enabled, we would have flagged 64% of all spam that was posted, with 80% of caught spam being instantly deleted by a binding flag.

Here’s the ML model’s theoretical flagging summary data in a table:

# Total Spam # Autoflagged (%) Non-binding TP flags Binding TP Flags
731 468 (64%) 94 374

It is worth noting that this system will work alongside Similarity anti-spam which is already flagging on Stack Overflow, so reviewing that anti-spam detector’s effectiveness over the same time period feels appropriate. Since December 10th, 2025, Similarity anti-spam caught 100, or 13.7%, of spam on Stack Overflow, with 14 false positive non-binding flags and no false positive binding flags. This represents an 87% accuracy rating. While there is some detection overlap here, it’s clear that ML anti-spam detects significantly more spam and is more accurate when doing so.

Takeaways

Our analysis of this data is that the ML anti-spam model would be a boon if deployed to Stack Overflow, and we’d like for you to take a peek and ensure that it’s something you’d feel comfortable if it were guarding your site. Do you have any concerns with its effectiveness as we’ve laid out? Do you have any improvements we can make to the user experience if we accidentally flag something incorrectly?

We’ll be monitoring this post for feedback until Wednesday, January 21st 2026. Our goal is to move forward with enabling ML anti-spam’s flagging capabilities after the end of the feedback window. If you see any serious issues that could hinder the rollout, please be sure to detail them in an answer so we can resolve them before we move forward.

11
  • 10
    nice! could you include a list of those 7 posts upon which false positive binding flags were cast? is there an option to try it out with binding flags disabled? Commented Jan 14 at 20:59
  • 9
    @starball Re: list - Sure! Theoretical false positive binding flag posts: stackoverflow.com/q/79850883 stackoverflow.com/q/79853510 stackoverflow.com/q/79854179 stackoverflow.com/q/79856691 stackoverflow.com/q/79856910 stackoverflow.com/q/79860571 stackoverflow.com/q/79862146 - An audit reveals one of the 8 posts was later deleted as spam, so we're actually at 7 false positives (updated). Some of these, upon review, are actually spam, but programmatically we've marked these as "misses" because they were, as an example, author-deleted Commented Jan 14 at 21:10
  • 4
    @starball We can turn this on with just non-binding flags, but given the accuracy I'm personally fine with letting it loose. I'd really like to see how it does with no guardrails given the data we're looking at currently. However, if this is a precaution the community wants to take, I will bring it to the team. Commented Jan 14 at 21:15
  • 16
    @Spevacus Out of the 7 you listed, I think only one is actually a false positive. Commented Jan 15 at 0:12
  • 6
    I'm not that super impressed to be honest, 64% detection is definitely not that impressive, 94% is OK-ish, but the 7 binding false positives is unacceptable, don't fire any binding Flags at all, 'Smoke Detector' (and flagging Users) will still nuke the Post in less than 1 minute if it's indeed really Spam (except in the 'Staging Ground' where it can take up to 1 hour and often requires manual reporting in the 'SD/Charcoal' Chat-Room - due to lack of API-access, I think I understood/remember)... Commented Jan 15 at 1:48
  • 6
    @chivracq Remember that the perfect world we want here is to remove these autonomously and never have to involve a human. Like Dharman said, while we programmatically considered those 7 posts as false positives, a human review suggests these are largely posts we'd want deleted anyway. By my review, 5 of these are actual spam, and indeed one of them was deleted with spam flags after I made this post ( stackoverflow.com/staging-ground/79860571 ). The accuracy is likely far better than this post suggests, I was simply handing you the data I queried for with our internal data explorer. Commented Jan 15 at 1:55
  • 13
    Next project using ML will be to create a Module detecting Duplicates, I hope...!?, to display the results just before an Asker will press the 'Submit' button, then again in the 'Staging Ground' and again if/once published on 'Main' when some Answerer starts typing in the 'Answer' field to post an Answer... (And I'll be absolutely extaticly super happy with "only" 64% detection, just saying...!!!) Commented Jan 15 at 2:08
  • Can we get the number of non-spam posts posted during the monitored timeframe so we can calculate the false-positive rate. Commented Jan 15 at 9:06
  • @starball I was hoping for an official answer to avoid subtleties like are the dates inclusive and stuff. You also need to subtract the "# Total Spam". I get "# Non-Spam" 8262 and thus a fp rate of ~0.08% from sede. Commented Jan 15 at 9:25
  • One "stupid" question, => is your Module also able to scan and auto-flag Posts (Questions - and Comments) from the 'SO'-'SG' ('Staging Ground')...? Because afaik, 'SD' is not, I regularly report in 'Charcoal' Posts from the 'SG' where after 1 hour, I am still the first and only Spam-flagger... (in the night hours European time...) Commented Jan 21 at 18:17
  • 2
    @chivracq Yep! We scan all posts from Staging Ground as well as all opinion based posts. I will also mention that SmokeDetector does monitor SG posts, it just might miss the ones you've seen. Or, it scanned them but did not auto flag them or find them to be spammy. Commented Jan 21 at 18:22

2 Answers 2

30

I'm impressed! My compliments to everyone who put this together :)

Personally, I feel more comfortable with the option where it's turned on, but binding flags are disabled. Actually, out of the 7 false positive binding flags you list, I only take issue with two of them- 79853510, and 79856910 (which- granted- have/had a lot of room for improvement, and I'm not necessarily sure that they're not spam- I'm admittedly not super good with the technicalities of spam classification).

But the messaging you set up (post notice and help center page!) to the receiver of what to do if they got hit with a false positive is quite good, so given the accuracy, I'm okay with it.

Or maybe the confidence thresholds could just be tweaked a bit so it's a little more lax with binding flags? (but again, this is just my view on it, and others may feel fine with these current results and tuning).

Thanks again for working on things like this.

8
  • Whether the answer 79853510/11107541 is spam or not, I find its deletion perfectly correct: solely "I wrote program" answers none question. So, would binding auto-spam flag" do not cost -100 rep penalty for OP, I won't treat that case as false positive one. As for the answer 79856910/11107541, it looks like LLM-generated for me. I am not good in such detection, but if my guess is correct, then I wouldn't treat that case as false positive one either. Commented Jan 14 at 23:35
  • As a mod, I would probably mark both of them as helpful. Commented Jan 15 at 0:13
  • 6
    I agree that letting it cast only non-binding flags is better. A moderator can then decide what to do with and there's transparency. Commented Jan 15 at 0:16
  • 4
    I'm glad you're impressed! I'm pushing back a bit on only letting it use non-binding flags at first because I do think it'll do a great job out of the box, however if a majority opinion feels like doing that for a given time period to gauge trust, I'll bring it to the team and we can propose a trial period. Commented Jan 15 at 0:20
  • @Spevacus thanks. ultimately, I recognize it's up to you (the company) what you want to do. but you sought my feedback so you received :D if you decide to enable the binding flags, I'm impressed enough with the accuracy that I wouldn't be too bothered by it. Commented Jan 15 at 1:42
  • 4
    @starball What would you think of a mod who deleted good posts as spam 1% of the time? Why should this system be treated differently? Commented Jan 15 at 2:37
  • 4
    @Spevacus I'm also a bit uncomfortable about the binding flags, even if the number of false positives in your sample is really only 1 or 2 rather than the 7 or 8 that were originally identified. I much prefer the Charcoal system, where at least 1 flag must be cast manually. Commented Jan 15 at 7:11
  • 8
    OTOH, the binding flags would be more acceptable if they were all put into a verification queue for mods (& maybe veteran flaggers) to review, so that any mistakes can be swiftly rectified. (And perhaps don't apply the -100 penalty to them until after they've been reviewed, although most spammers have virtually no rep, so that penalty is meaningless to them, anyway). Commented Jan 15 at 7:11
9

This is not accurate enough, and SmokeDetector could do this much better

I'm very glad you are all thinking about this; it's something I've advocated for years and I do believe most spam handling can and should be automated. That being said, SmokeDetector would be a much better candidate to use in such a system.

Based a simple combination of minimum reason count/reason score (which I came with in 5 minutes, it could easily be improved), 40% of spam on Super User and 50% on Server Fault could be nuked without a single false positive in any of those tens of thousands of posts throughout the past 9 years (running numbers for SO is taking a while right now, but I expect this holds for it as well).

This is roughly the same ratio as what percent of spam is getting bindingly flagged by this system. Except this system has a had a 1% false positive rate, while using SmokeDetector's systems would have gotten no false positive in 9 years of spam, I think this is simply not nearly as good.

Furthermore, a 1% false positive rate is just too high, especially when it's a binding flag. Even just to cast non-binding flags, SmokeDetector needs at 99.75% confidence. I'd be hard-pressed to find an auto-flagging bot of the community's with a false positive rate higher than 1%. Simply deleting a false positive is much, much worse. Rather than just wasting a mod's time, you've actively deleted a post that will probably not be noticed or undeleted, given a user an unjust reputation penalty, and unless they intimiately familiar with the inner workings of the site, you've probably driven them off to.

The 94% accuracy rate of the non-binding flags is still too low. That's an unreasonable level of accuracy for an experienced human on just about any sort of flag, let alone a bot and when talking about spam/rude/abusive flags. I think it needs at least 98-99% accuracy to be reasonable. I'm also very much on the more lenient end of this spectrum. We could catch much more spam just by lowering Smokey autoflag accuracies to this level than the 80% this system gets.

24
  • 8
    we already have smoke detector. and the community can decide whether it wants to change their individual autoflag rule thresholds, no? if your concern is false positives, then that should be okay if the flags are non-binding, though it's an interesting question of interaction between this and the charcoal stuff, which IIRC leaves one remaining flag to a human. Commented Jan 15 at 1:38
  • 1
    @starball There is an expected threshold of accuracy, even for humans. If I make a ton of terrible flags, I will be rightly flag banned. Even if the majority of my flags are good. For a bot, which could utter flood the system, the threshold needs to higher. All the more so considering these flags are put at the top of the queue, and that this things non-binding flag + SD not-meant-to-nuke-flag = post nuked, and that's apparently been decided to be a feature. Commented Jan 15 at 2:36
  • 1
    @Starship it's pretty straightforward. A model that has 99% confidence in something should be correct 99% of the time — not less, and also not more. If it's more accurate, then it was underconfident, meaning it still did a bad job of assessing confidence. Commented Jan 15 at 3:43
  • 9
    I joined Charcoal in 2020, and I'm familiar with how their detections work and why they are so careful when flagging. Charcoal's main goal is to stay as close to 100% accuracy as possible for any autoflags. This makes perfect sense in a system where flags are attributed to human volunteers as the bot casts flags on behalf of them, and even a single FP can carry consequences. Autoflagging under these conditions means they optimize for precision above all else, at the cost of not autoflagging "obvious" (to humans) spam. Commented Jan 15 at 5:43
  • 5
    We have the benefit of attributing flags to robots, removing a lot of the "overhead". Further, while the "recall" amount is lower (64% of spam would have been caught), these are posts we're acting on, and we're acting on a larger number of posts than Smokey does. If we look to our theoretical unilateral removals, that's 50% less spam (80% of total spam would've been instantly deleted) that a human (or Smokey) ever sees, let alone needs to flag. Where we raise an FP non-binding flag, a moderator can simply dismiss the flag, and no user is penalized for flagging incorrectly. Commented Jan 15 at 5:43
  • 3
    For binding flags, the threshold for deletion is quite conservative, and will always remain so. Of the 7 false positives, I posted some thoughts in this comment. In these posts, only one binding false positive was a genuinely incorrect flag, and even that was a low-impact case. It also turns out another one (stackoverflow.com/a/79856910 ) would not have been flagged due to some internal safeguards that my query didn't account for. I'll explain these safeguards elsewhere in the future. Commented Jan 15 at 5:43
  • 5
    For non-binding flags, I agree that non-binding accuracy should still be in the high 90% accuracy range, but remember that the intent of these flags is to triage, not instantly delete. Some declined flags are alright if the system reduces the volume of spam anyone has to see autonomously and we're not too noisy with them. To that end, we have been improving Similarity anti-spam to reduce noisy false positives. Further, I did not review the non-binding false positives manually, I took the programmatic classification of FP, which we've found is a bit problematic with binding flags anyway. Commented Jan 15 at 5:45
  • 5
    Smokey and this system are also not mutually exclusive. Much like all of our other anti-spam systems (Similarity anti-spam, SpamRam, Cloudflare, and some other fun ones I won't dive into), these systems can and will work in beautiful harmony with each other. The ML model can also continue to be improved as time goes on and we observe it flagging in the wild, and we have every intention of ensuring as few legitimate posts are deleted as possible via improvements to it. Commented Jan 15 at 5:45
  • 3
    @KarlKnechtel I don't know what you expect of Smoke Detector. It catches posts based on heuristics but doesn't assign confidence in percentage. Each different detection reason has a separate weight assigned to it that depends on past performance. If a reason has a lot of TPs and low to non-existent FPs, it will have a higher weight. If a post is caught for multiple reasons, it's assigned the sum of the weights. Commented Jan 15 at 6:48
  • 2
    Users can then sign up for autoflagging and can set up rules for how their flags are to be used - the minimum weight and the minimum rep of the poster of a detected post. For the an autoflagging rule criteria to be valid, the criteria for it will checked against historic reports and the rule has to have in at least 99.75% true positive rate over historic. Also, if in the future the criteria for a rule falls under that 99.75% threshold, the rule is automatically disabled. The autoflagging stats can be seen on the bottom here and show 99.5% TP and 0.4% FP. Commented Jan 15 at 6:48
  • 5
    However, Smoke Detector will not attempt to nuke posts even if the reason weight is very high. It only casts up to 3 flags, thus requiring a regular human to cast the last one to nuke a post. There is an option to allow auto-nuking but it's per-site and requires explicit acknowledgement from a site mod to enable. It's there for huge spam waves when even the human flaggers are stretched. Commented Jan 15 at 6:48
  • 1
    I got 8262 for "#non spam posts" from sede so it's about 0.08% Commented Jan 15 at 9:22
  • 1
    For reference with that "#non spam posts" the total FP rate (binding & non-binding) is (7 + 24) / 8262 ≈ 0.3% Commented Jan 15 at 9:54
  • 1
    @Spevacus What I am saying is that this system is worse than Smokey in every conceivable way. Forgot Smokey, no bot casts flags below 99% accuracy, minimum. Even bots like Natty and DharmanBot, whose flags just send it to a review queue and who don’t flag on anyone’s behalf are above this. Even bots flagging old NLN comments are above this. It’s simply unreasonable to have a bot with this much false positives, even on non binding flags. And anything short of 99.99% accuracy is also not enough for binding deletion. Routinely penalizing users and deleting posts for no reason is not a good thing. Commented Jan 15 at 12:22
  • 3
    @KarlKnechtel it's an analysis of historic data. Rather than a projection of how likely something is to be spam. Huge difference, because for projections, you have the issue you asked for - does 99% confidence mean that 99% of the decisions are correct. However, Smokey doesn't do that - it examines decisions that were already made and check how much of them were correct. You're only allowed to have an autoflagging rule that historically has a success rate of 99.75%. Calculated as simply num_tp / num_total. Commented Jan 15 at 16:29

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.