Leniency is required to limit false positives.
Let me elaborate with a simplification:
Let's say the system that reviews chats outputs a single number from 0 to 3, determining how toxic the chat was. 0 being not toxic, 1 mildly toxic, 2 moderately toxic, 3 extremely toxic.
Our system isn't perfect, though, so whatever it outputs, let's say there's a 5% chance to be 1 too low, and a 5% chance to be 1 too high.
(For simplicity's sake I'll ignore Bayes theorem and the actual likelihood of encountering toxicity of a certain level)
Let's take a look now at a couple of scenarios:
A player's chat in 1 game is reviewed and the output is 1:
The chance that this player was moderately toxic is 5%
The chance that this player was at least mildly toxic is 95%
The chance that this player is entirely innocent is 5%.
A player's chat in 3 games is reviewed and the output is 1 for all:
The chance that this player was moderately toxic at least once is 14%
The chance that this player was at least mildly toxic in all 3 is 86%
The chance that this player was entirely innocent is 1%
A player's chat in 5 games is reviewed and the output is 1 for all:
The chance that this player was moderately toxic at least once is 23%
The chance that this player was at least mildly toxic in all 5 is 77%
The chance that this player was at least mildly toxic in 3 games (or more) is 99.9%
The chance that this player was entirely innocent is 0.00003%
A player's chat in 10 games is reviewed and the output is 1 for all.
The chance that this player was moderately toxic at least once is 40%.
The chance that this player was at least mildly toxic in all 10 is 60%
The chance that this player was at least mildly toxic in 8 games (or more) is 99%
The chance that this player was at least mildly toxic in 6 games (or more) is 99.99%
The chance that this player was entirely innocent is 0.000000000001%
I feel the need to point out that this last number is not exaggerated or made up. It's really this low.
As said, this is a simplification, and a very extreme one at that, but it serves to illustrate that looking at multiple games is an incredibly powerful method of reducing the risk of false positives; even when the toxicity detection has low specificity.
And that low specificity is probably a reasonable assumption. I'd even say it's probably even lower than 95% for mild toxicity, since it includes things that can even be tricky for humans to detect in pure text form.
Edit: If you want the maths, I don't mind providing it, it's just that I'll only put in the effort of writing up multiplication heavy formulas in a forum that uses asterisks for formatting if somebody actually requests it. It's difficult to spot mistakes in your own formulas when all asterisks are replaced with "*" and switching between editor and preview constantly is simply annoying.