Timeline for answer to Stack Overflow now uses machine learning to flag spam automatically by Starship
Current License: CC BY-SA 4.0
Post Revisions
27 events
| when toggle format | what | by | license | comment | |
|---|---|---|---|---|---|
| Jan 15 at 21:56 | comment | added | Starship | @KarlKnechtel It assigns 2 confidence numbers: Reason count (self explanatory) and reason weight (weight of each individual reason is added). Only combinations of the two which have more than 99.75% accuracy are allowed | |
| Jan 15 at 21:54 | comment | added | Starship | @starball Yes. But that flag that’s supposed to be human would be made by this system, thus nuking it. And while you can adjust your threshold to be higher than 99.75% it can’t be lower | |
| Jan 15 at 17:53 | comment | added | starball Mod | autoflag usually leaves one flag for a human, no? and my first comment about the community being able to set their own thresholds was meant to be about each user's own autoflag rules. | |
| Jan 15 at 16:29 | comment | added | VLAZ |
@KarlKnechtel it's an analysis of historic data. Rather than a projection of how likely something is to be spam. Huge difference, because for projections, you have the issue you asked for - does 99% confidence mean that 99% of the decisions are correct. However, Smokey doesn't do that - it examines decisions that were already made and check how much of them were correct. You're only allowed to have an autoflagging rule that historically has a success rate of 99.75%. Calculated as simply num_tp / num_total.
|
|
| Jan 15 at 16:17 | comment | added | Karl Knechtel | @VLAZ I don't understand. You say "It catches posts based on heuristics but doesn't assign confidence in percentage.", but the post says: "Even just to cast non-binding flags, SmokeDetector needs at 99.75% confidence." That 99.75% number comes from somewhere. | |
| Jan 15 at 12:58 | comment | added | Starship | @starball Why would making a massive amount of good posts flagged as spam for no good reason be a good thing? Espically when this and Smokey autoflag will nuke it without a human. Especially when we have a system better in every possible way that they won’t use | |
| Jan 15 at 12:22 | comment | added | Starship | @Spevacus What I am saying is that this system is worse than Smokey in every conceivable way. Forgot Smokey, no bot casts flags below 99% accuracy, minimum. Even bots like Natty and DharmanBot, whose flags just send it to a review queue and who don’t flag on anyone’s behalf are above this. Even bots flagging old NLN comments are above this. It’s simply unreasonable to have a bot with this much false positives, even on non binding flags. And anything short of 99.99% accuracy is also not enough for binding deletion. Routinely penalizing users and deleting posts for no reason is not a good thing. | |
| Jan 15 at 9:54 | comment | added | cafce25 | For reference with that "#non spam posts" the total FP rate (binding & non-binding) is (7 + 24) / 8262 ≈ 0.3% | |
| Jan 15 at 9:22 | comment | added | cafce25 | I got 8262 for "#non spam posts" from sede so it's about 0.08% | |
| Jan 15 at 9:15 | comment | added | cafce25 | "Except this system has a had a 1% false positive rate" – The false positive rate is about an order of magnitude better than that. It's "#non-spam flagged as spam"/"#non-spam" not "#non-spam flagged as spam"/"#spam". | |
| Jan 15 at 6:48 | comment | added | VLAZ | However, Smoke Detector will not attempt to nuke posts even if the reason weight is very high. It only casts up to 3 flags, thus requiring a regular human to cast the last one to nuke a post. There is an option to allow auto-nuking but it's per-site and requires explicit acknowledgement from a site mod to enable. It's there for huge spam waves when even the human flaggers are stretched. | |
| Jan 15 at 6:48 | comment | added | VLAZ | Users can then sign up for autoflagging and can set up rules for how their flags are to be used - the minimum weight and the minimum rep of the poster of a detected post. For the an autoflagging rule criteria to be valid, the criteria for it will checked against historic reports and the rule has to have in at least 99.75% true positive rate over historic. Also, if in the future the criteria for a rule falls under that 99.75% threshold, the rule is automatically disabled. The autoflagging stats can be seen on the bottom here and show 99.5% TP and 0.4% FP. | |
| Jan 15 at 6:48 | comment | added | VLAZ | @KarlKnechtel I don't know what you expect of Smoke Detector. It catches posts based on heuristics but doesn't assign confidence in percentage. Each different detection reason has a separate weight assigned to it that depends on past performance. If a reason has a lot of TPs and low to non-existent FPs, it will have a higher weight. If a post is caught for multiple reasons, it's assigned the sum of the weights. | |
| Jan 15 at 5:45 | comment | added | Spevacus StaffMod | Smokey and this system are also not mutually exclusive. Much like all of our other anti-spam systems (Similarity anti-spam, SpamRam, Cloudflare, and some other fun ones I won't dive into), these systems can and will work in beautiful harmony with each other. The ML model can also continue to be improved as time goes on and we observe it flagging in the wild, and we have every intention of ensuring as few legitimate posts are deleted as possible via improvements to it. | |
| Jan 15 at 5:45 | comment | added | Spevacus StaffMod | For non-binding flags, I agree that non-binding accuracy should still be in the high 90% accuracy range, but remember that the intent of these flags is to triage, not instantly delete. Some declined flags are alright if the system reduces the volume of spam anyone has to see autonomously and we're not too noisy with them. To that end, we have been improving Similarity anti-spam to reduce noisy false positives. Further, I did not review the non-binding false positives manually, I took the programmatic classification of FP, which we've found is a bit problematic with binding flags anyway. | |
| Jan 15 at 5:43 | comment | added | Spevacus StaffMod | For binding flags, the threshold for deletion is quite conservative, and will always remain so. Of the 7 false positives, I posted some thoughts in this comment. In these posts, only one binding false positive was a genuinely incorrect flag, and even that was a low-impact case. It also turns out another one (stackoverflow.com/a/79856910 ) would not have been flagged due to some internal safeguards that my query didn't account for. I'll explain these safeguards elsewhere in the future. | |
| Jan 15 at 5:43 | comment | added | Spevacus StaffMod | We have the benefit of attributing flags to robots, removing a lot of the "overhead". Further, while the "recall" amount is lower (64% of spam would have been caught), these are posts we're acting on, and we're acting on a larger number of posts than Smokey does. If we look to our theoretical unilateral removals, that's 50% less spam (80% of total spam would've been instantly deleted) that a human (or Smokey) ever sees, let alone needs to flag. Where we raise an FP non-binding flag, a moderator can simply dismiss the flag, and no user is penalized for flagging incorrectly. | |
| Jan 15 at 5:43 | comment | added | Spevacus StaffMod | I joined Charcoal in 2020, and I'm familiar with how their detections work and why they are so careful when flagging. Charcoal's main goal is to stay as close to 100% accuracy as possible for any autoflags. This makes perfect sense in a system where flags are attributed to human volunteers as the bot casts flags on behalf of them, and even a single FP can carry consequences. Autoflagging under these conditions means they optimize for precision above all else, at the cost of not autoflagging "obvious" (to humans) spam. | |
| Jan 15 at 3:43 | comment | added | Karl Knechtel | @Starship it's pretty straightforward. A model that has 99% confidence in something should be correct 99% of the time — not less, and also not more. If it's more accurate, then it was underconfident, meaning it still did a bad job of assessing confidence. | |
| Jan 15 at 2:36 | comment | added | Starship | @starball There is an expected threshold of accuracy, even for humans. If I make a ton of terrible flags, I will be rightly flag banned. Even if the majority of my flags are good. For a bot, which could utter flood the system, the threshold needs to higher. All the more so considering these flags are put at the top of the queue, and that this things non-binding flag + SD not-meant-to-nuke-flag = post nuked, and that's apparently been decided to be a feature. | |
| Jan 15 at 2:34 | comment | added | Starship | @starball Actually they can't. I've discussed this with Charcoal admins on a number of occassions and the general consensus is "we'd like to be able to do this at least occasionally in some limited version, but we'd need CM approval" | |
| Jan 15 at 2:34 | comment | added | Starship | @KarlKnechtel What does that meaN? | |
| Jan 15 at 2:06 | comment | added | Karl Knechtel | Has there been a proper "calibration" study comparing SD's confidence to false positive rate at that confidence level? | |
| Jan 15 at 1:38 | comment | added | starball Mod | we already have smoke detector. and the community can decide whether it wants to change their individual autoflag rule thresholds, no? if your concern is false positives, then that should be okay if the flags are non-binding, though it's an interesting question of interaction between this and the charcoal stuff, which IIRC leaves one remaining flag to a human. | |
| Jan 15 at 1:22 | history | edited | Starship | CC BY-SA 4.0 |
deleted 1 character in body
|
| Jan 15 at 1:16 | history | edited | Starship | CC BY-SA 4.0 |
added 138 characters in body
|
| Jan 15 at 0:38 | history | answered | Starship | CC BY-SA 4.0 |