Worth pointing out that calculating p-values on a wide set of metrics and selecting for those under $threshold (called p-hacking) is not statistically sound - who cares, we are not an academic journal, but a pill of knowledge.
The idea is, since data has a ~1/20 chance of having a p < 0.05, you are bound to get false positives. In academia it's definitely not something you'd do, but I think here it's fine.
@OP have you considered calculating Cohen's effect size? p only tells us that, given the magnitude of the differences and the number of samples, we are "pretty sure" the difference is real. Cohen's `d` tells us how big the difference is on a "standard" scale.
Yes, if OP did a full vocabulary comparison and took just those sub-threshold, it would be hacking. I'm not sure that's the case here, though? Given that (the post) OP started with em-dash, and probably didn't do repeated sampling, then it should be a pretty fair hypothesis that em-dash usage is a marker.
Your comment about p<0.05, feels out of place to me. The p-values here are << 0.05. Like waaaaay lower.
Perhaps Fisher's exact is more appropriate, on the per-word basis?
A Bonferroni correction would be suitable. I usually see it used in genome-wide association studies (GWAS) that check to see if a trait or phenotype is influenced by any single nucleotide polymorphisms (SNPs) in a genome. So it's doing multiple testing on a scale of ~1 million.
> One of the simplest approaches to correct for multiple testing is the Bonferroni correction. The Bonferroni correction adjusts the alpha value from α = 0.05 to α = (0.05/k) where k is the number of statistical tests conducted. For a typical GWAS using 500,000 SNPs, statistical significance of a SNP association would be set at 1e-7. This correction is the most conservative, as it assumes that each association test of the 500,000 is independent of all other tests – an assumption that is generally untrue due to linkage disequilibrium among GWAS markers.
I think these term frequency comparisons are probably a pretty blunt tool, as some of the most well known AI indicators aren't words, but turns of phrase and sentence structure.
IMO a more interesting experiment would be to show comments to people (that haven't seen these conclusions), and have them assess whether they suspect them of being bots or AI authored, and then correlate that with account age.
The idea is, since data has a ~1/20 chance of having a p < 0.05, you are bound to get false positives. In academia it's definitely not something you'd do, but I think here it's fine.
@OP have you considered calculating Cohen's effect size? p only tells us that, given the magnitude of the differences and the number of samples, we are "pretty sure" the difference is real. Cohen's `d` tells us how big the difference is on a "standard" scale.