Timeline for answer to Stack Overflow now uses machine learning to flag spam automatically by Starship

Current License: CC BY-SA 4.0

Post Revisions

27 events

when toggle format	what		by	license	comment
Jan 15 at 21:56	comment	added	Starship		@KarlKnechtel It assigns 2 confidence numbers: Reason count (self explanatory) and reason weight (weight of each individual reason is added). Only combinations of the two which have more than 99.75% accuracy are allowed
Jan 15 at 21:54	comment	added	Starship		@starball Yes. But that flag that’s supposed to be human would be made by this system, thus nuking it. And while you can adjust your threshold to be higher than 99.75% it can’t be lower
Jan 15 at 17:53	comment	added	starball Mod		autoflag usually leaves one flag for a human, no? and my first comment about the community being able to set their own thresholds was meant to be about each user's own autoflag rules.
Jan 15 at 16:29	comment	added	VLAZ		@KarlKnechtel it's an analysis of historic data. Rather than a projection of how likely something is to be spam. Huge difference, because for projections, you have the issue you asked for - does 99% confidence mean that 99% of the decisions are correct. However, Smokey doesn't do that - it examines decisions that were already made and check how much of them were correct. You're only allowed to have an autoflagging rule that historically has a success rate of 99.75%. Calculated as simply `num_tp / num_total`.
Jan 15 at 16:17	comment	added	Karl Knechtel		@VLAZ I don't understand. You say "It catches posts based on heuristics but doesn't assign confidence in percentage.", but the post says: "Even just to cast non-binding flags, SmokeDetector needs at 99.75% confidence." That 99.75% number comes from somewhere.
Jan 15 at 12:58	comment	added	Starship		@starball Why would making a massive amount of good posts flagged as spam for no good reason be a good thing? Espically when this and Smokey autoflag will nuke it without a human. Especially when we have a system better in every possible way that they won’t use
Jan 15 at 12:22	comment	added	Starship		@Spevacus What I am saying is that this system is worse than Smokey in every conceivable way. Forgot Smokey, no bot casts flags below 99% accuracy, minimum. Even bots like Natty and DharmanBot, whose flags just send it to a review queue and who don’t flag on anyone’s behalf are above this. Even bots flagging old NLN comments are above this. It’s simply unreasonable to have a bot with this much false positives, even on non binding flags. And anything short of 99.99% accuracy is also not enough for binding deletion. Routinely penalizing users and deleting posts for no reason is not a good thing.
Jan 15 at 9:54	comment	added	cafce25		For reference with that "#non spam posts" the total FP rate (binding & non-binding) is (7 + 24) / 8262 ≈ 0.3%
Jan 15 at 9:22	comment	added	cafce25		I got 8262 for "#non spam posts" from sede so it's about 0.08%
Jan 15 at 9:15	comment	added	cafce25		"Except this system has a had a 1% false positive rate" – The false positive rate is about an order of magnitude better than that. It's "#non-spam flagged as spam"/"#non-spam" not "#non-spam flagged as spam"/"#spam".
Jan 15 at 6:48	comment	added	VLAZ		However, Smoke Detector will not attempt to nuke posts even if the reason weight is very high. It only casts up to 3 flags, thus requiring a regular human to cast the last one to nuke a post. There is an option to allow auto-nuking but it's per-site and requires explicit acknowledgement from a site mod to enable. It's there for huge spam waves when even the human flaggers are stretched.
Jan 15 at 6:48	comment	added	VLAZ		Users can then sign up for autoflagging and can set up rules for how their flags are to be used - the minimum weight and the minimum rep of the poster of a detected post. For the an autoflagging rule criteria to be valid, the criteria for it will checked against historic reports and the rule has to have in at least 99.75% true positive rate over historic. Also, if in the future the criteria for a rule falls under that 99.75% threshold, the rule is automatically disabled. The autoflagging stats can be seen on the bottom here and show 99.5% TP and 0.4% FP.
Jan 15 at 6:48	comment	added	VLAZ		@KarlKnechtel I don't know what you expect of Smoke Detector. It catches posts based on heuristics but doesn't assign confidence in percentage. Each different detection reason has a separate weight assigned to it that depends on past performance. If a reason has a lot of TPs and low to non-existent FPs, it will have a higher weight. If a post is caught for multiple reasons, it's assigned the sum of the weights.
Jan 15 at 5:45	comment	added	Spevacus StaffMod		Smokey and this system are also not mutually exclusive. Much like all of our other anti-spam systems (Similarity anti-spam, SpamRam, Cloudflare, and some other fun ones I won't dive into), these systems can and will work in beautiful harmony with each other. The ML model can also continue to be improved as time goes on and we observe it flagging in the wild, and we have every intention of ensuring as few legitimate posts are deleted as possible via improvements to it.
Jan 15 at 5:45	comment	added	Spevacus StaffMod		For non-binding flags, I agree that non-binding accuracy should still be in the high 90% accuracy range, but remember that the intent of these flags is to triage, not instantly delete. Some declined flags are alright if the system reduces the volume of spam anyone has to see autonomously and we're not too noisy with them. To that end, we have been improving Similarity anti-spam to reduce noisy false positives. Further, I did not review the non-binding false positives manually, I took the programmatic classification of FP, which we've found is a bit problematic with binding flags anyway.
Jan 15 at 5:43	comment	added	Spevacus StaffMod		For binding flags, the threshold for deletion is quite conservative, and will always remain so. Of the 7 false positives, I posted some thoughts in this comment. In these posts, only one binding false positive was a genuinely incorrect flag, and even that was a low-impact case. It also turns out another one (stackoverflow.com/a/79856910 ) would not have been flagged due to some internal safeguards that my query didn't account for. I'll explain these safeguards elsewhere in the future.
Jan 15 at 5:43	comment	added	Spevacus StaffMod		We have the benefit of attributing flags to robots, removing a lot of the "overhead". Further, while the "recall" amount is lower (64% of spam would have been caught), these are posts we're acting on, and we're acting on a larger number of posts than Smokey does. If we look to our theoretical unilateral removals, that's 50% less spam (80% of total spam would've been instantly deleted) that a human (or Smokey) ever sees, let alone needs to flag. Where we raise an FP non-binding flag, a moderator can simply dismiss the flag, and no user is penalized for flagging incorrectly.
Jan 15 at 5:43	comment	added	Spevacus StaffMod		I joined Charcoal in 2020, and I'm familiar with how their detections work and why they are so careful when flagging. Charcoal's main goal is to stay as close to 100% accuracy as possible for any autoflags. This makes perfect sense in a system where flags are attributed to human volunteers as the bot casts flags on behalf of them, and even a single FP can carry consequences. Autoflagging under these conditions means they optimize for precision above all else, at the cost of not autoflagging "obvious" (to humans) spam.
Jan 15 at 3:43	comment	added	Karl Knechtel		@Starship it's pretty straightforward. A model that has 99% confidence in something should be correct 99% of the time — not less, and also not more. If it's more accurate, then it was underconfident, meaning it still did a bad job of assessing confidence.
Jan 15 at 2:36	comment	added	Starship		@starball There is an expected threshold of accuracy, even for humans. If I make a ton of terrible flags, I will be rightly flag banned. Even if the majority of my flags are good. For a bot, which could utter flood the system, the threshold needs to higher. All the more so considering these flags are put at the top of the queue, and that this things non-binding flag + SD not-meant-to-nuke-flag = post nuked, and that's apparently been decided to be a feature.
Jan 15 at 2:34	comment	added	Starship		@starball Actually they can't. I've discussed this with Charcoal admins on a number of occassions and the general consensus is "we'd like to be able to do this at least occasionally in some limited version, but we'd need CM approval"
Jan 15 at 2:34	comment	added	Starship		@KarlKnechtel What does that meaN?
Jan 15 at 2:06	comment	added	Karl Knechtel		Has there been a proper "calibration" study comparing SD's confidence to false positive rate at that confidence level?
Jan 15 at 1:38	comment	added	starball Mod		we already have smoke detector. and the community can decide whether it wants to change their individual autoflag rule thresholds, no? if your concern is false positives, then that should be okay if the flags are non-binding, though it's an interesting question of interaction between this and the charcoal stuff, which IIRC leaves one remaining flag to a human.
Jan 15 at 1:22	history	edited	Starship	CC BY-SA 4.0	deleted 1 character in body
Jan 15 at 1:16	history	edited	Starship	CC BY-SA 4.0	added 138 characters in body
Jan 15 at 0:38	history	answered	Starship	CC BY-SA 4.0

toggle format