Return to Revisions

1 of 7

answered Jun 7, 2023 at 23:55

6.7k
2
17
29

OK, finally there is something to actually discuss. This discussion should have happened before the policy was decided, and the policy should have been decided with community-elected moderators' input rather than imposed by fiat ─ and the real policy remains secret ─ so let's not celebrate too much.

But now at least we can have a discussion about the factual basis for the policy, while not forgetting that the strike is about the imposition of the policy against the community's wishes and without our feedback, the differing public and private versions of the policy, and the slander against moderators, not about disputes over what this data implies.

At the 0.50 detection threshold, around 1-in-5.5 posts are falsely detected. At the 0.90 detection threshold, around 1-in-13 posts are falsely detected.

OK. How many moderators are trusting this tool at thresholds of 50% or 90%? I think pretty much everyone knows that these tools are completely useless at such low thresholds.

While it is theoretically possible to achieve better baseline error rates than 1-in-20 by picking higher thresholds, the efficacy of the detector may fall off considerably. A detector that does not produce false positives is no good if it also produces no true positives.

"Theoretically" is a strange word to use here. It is possible to achieve better baseline error rates than 1-in-20 by picking higher thresholds, i.e. thresholds above 97%. The data you have presented shows that this is empirically true, not just theoretically.

Whether or not a threshold of 97+% means the tool misses a lot of true positives is irrelevant. It makes no sense to forbid the use of a tool just because it misses a lot of true positives. Lateral flow tests for COVID-19 can miss 20-80% of true positive cases; that just means we can't (and don't) rely solely on LFTs. It doesn't mean we should ban LFTs.

Also, moderators have said clearly that they do not rely exclusively on this tool, or any GPT detection tool. We should expect that the false positive rate for moderator decisions, which are based on multiple kinds of evidence, should be significantly lower than the false positive rate for just one kind of evidence.

At this point, we can’t endorse usage of this service either as a tool for discriminating AI-generated posts or as a tool for validating suspicions.

Nobody is asking you to endorse it. What we're asking is for you to let us choose how our own communities are moderated.

Over the last few months, folks within the company have been working to answer the question, “What has been taking place in the data coming out of Stack Overflow since GPT’s release?”

(Emphasis mine.) Stack Overflow is the largest Stack Exchange site, but that means it is not representative of other Stack Exchange sites. Even if this data really does show a need for an extremely permissive policy which de facto allows users to plagiarise AI answers, it at most shows that need for Stack Overflow, not all Stack Exchange sites.

In total, the rate at which frequent answerers leave the site quadrupled since GPT’s release.

This graph looks pretty noisy, so I assume there are some wide error bars around that "quadrupled" figure. For the sake of argument let's say it's roughly correct, though. So why are the more active users leaving?

Are there just fewer questions? ─ Yes, there are. You attempt to rule this out as a factor, because the "number of questions per frequent answerer" has gone up, not down. But this is not surprising, and doesn't imply anything.

Imagine there are 100 questions per day, 50 nerds who answer one question per day, and 10 hypernerds who answer five each. Now imagine the number of questions goes down to 60 per day. The hypernerds are more affected by the question shortage, so suppose they're more likely to leave; say 30% of the nerds and 50% of the hypernerds leave. There are now 30 nerds and 5 hypernerds, and all 60 questions per day are still getting answered. So the number of questions per hypernerd has gone up from 10 to 12, but there aren't more questions available to be answered. So this number rising simply doesn't imply that there are enough "available" questions to retain more of the hypernerds.

The only assumption I've made is that hypernerds are more likely to become uninterested if there aren't enough questions for them, whereas regular nerds aren't there only to answer questions so a question shortage is less likely to motivate them to leave. Seems pretty plausible to me, and it's consistent with the trends in your data (on questions per hypernerd, and proportion of answers written by hypernerds), so we can't rule it out using this data.

Has the quality of the questions gone down? ─ Perhaps the hypernerds are more motivated by having interesting questions to answer, whereas the regular nerds just answer a question every now and then if they happen to notice it. Unfortunately your data says nothing about question quality, but anecdotally this is the reason I have stopped writing so many answers on Stack Overflow. I could speculate on a few reasons why question quality might have fallen in the advent of ChatGPT ─ perhaps people who know how to write a good question are more likely to get a satisfactory response from ChatGPT themselves, and therefore don't need to ask Stack Overflow ─ but anyway a drop in question quality can't be blamed on the AI moderation policy.

After we allowed GPT suspensions on first offense, 6.6% of users who posted >2 answers in a given week were suspended within three weeks, a 16-fold increase.

This claim is also a non-sequitur, because you include AI plagiarists among the group of people "trying to actively participate in" Stack Overflow, but plagiarising answers from an AI is not what it means to participate in Stack Overflow. If all of those users who get suspended are indeed AI plagiarists, then the correct percentage of suspensions among people who are trying to actively participate in the community is 0%, not 7%. So your argument here is wholly uncompelling.

These suspensions only negatively affect the community if they are false positives. You have given some data on the false positive rate of AI tools, but not about the false positives for moderator decisions to suspend users (which, again, are made on the basis of multiple kinds of evidence).

Additionally, this metric ─ users who posted >2 answers in a given week, and were suspended within three weeks ─ seems suspiciously precise. Why is >2 answers the cutoff? Why is 1 week the period in which those answers were written, and why is 3 weeks the period in which they were suspended? It smells of cherry-picking to me. How robust is this finding to changes in the metric?

Instead suppose that no more than 1-in-50 of the people who were suspended for GPT usage were not actually using GPT. In order for this to be true, a large volume of users would have needed to immediately convert from being regular users to ChatGPT users;

No, that does not follow. It can be true if there is a steady flow of new users who use ChatGPT, or if users suspended for ChatGPT use return to the site after their 7-day suspension and then post more ChatGPT answers, or if not everyone who uses ChatGPT gets caught immediately. All three are very plausible.

this value alone rings a deafening number of alarm bells for potential false positive detections and contributor loss alike.

This 7% figure implies nothing about the rate of false positives, because it uses the wrong denominator. Many users who don't use ChatGPT and don't get suspended for suspected ChatGPT use, are not in the ">2 answers per week" category, but the number of those users obviously matters for the false positive rate.

Likewise, this figure doesn't imply anything about loss of legitimate contributors unless we already accept the doubtful claim about false positives.

What follows is the internal ‘gold standard’ for how we measure GPT posts on the platform [...] In principle, if people are copying and pasting answers out of services like GPT, then they won’t save as many drafts as people who write answers within Stack Exchange.

This metric doesn't seem fit for the present purpose. Firstly, it's an absolute number, whereas we already know that the absolute numbers of questions and answers have been falling, so automatically the absolute number of ChatGPT answers is expected to fall alongside that. To support your argument about false positive rates, we need the proportion of answers written by ChatGPT, not the total number.

Additionally, it assumes that the behaviour of ChatGPT plagiarists has not changed over time. But this is an unreasonable assumption, because as Stack Overflow's policy on ChatGPT answers became more widely known ─ and people's 7-day suspensions for posting ChatGPT answers expired ─ we should expect that the AI plagiarists' behaviour changed to try to avoid getting caught. More sophisticated plagiarists will change a few words, delete "fluff" sentences which don't contribute to answering the question, introduce intentional spelling or grammar mistakes, or so on; these are smaller edits, so they will tend to make answers look "less ChatGPT-like" in the ratio of small to big edits.

So this metric going down does not really indicate that ChatGPT use has gone down.

This metric is sensitive to noise, but was validated against other metrics early on at the peak of the GPT answer rate.

The fact that it was validated early does not mean it remains valid, since AI plagiarists' behaviour should be expected to change over time.

The following chart shows the expected % of answers posted in a given week that are GPT-suspect.

There is no "following chart" ─ perhaps you intended to include a chart here but it got lost while editing?

Based on the data, we would hazard a guess that Stack Overflow currently sees 10-15 GPT answers in the typical day, or 70-100 answers per week. [...] could it be the case that roughly 7% of frequent answerers on the site are still posting via ChatGPT? If this were the case, the site should be seeing at least 330 GPT answers per week, but the rate estimate is not close.

The estimate of 70-100 per week is probably a significant underestimate for the previously-stated reasons. It's quite plausible that 330 per week is the correct number.

It is exceptionally strange for us to look at a moderator’s action and find ourselves unable to verify it – yet this is the situation we are frequently in with respect to GPT.

You haven't said why you were unable to verify these moderator actions. Is it because moderators have not provided sufficient information about how they have made their judgements, or is it because you don't agree with those judgements?

If it's the former, that would be a basis for you to tell moderators that they need to provide more information when they suspend users for suspected ChatGPT plagiarism; if it's the latter, then that would again be something to discuss with moderators. Either way, it doesn't support the new policy which practically forbids almost all such suspensions.

Instead, the most we can do is state that we just can’t tell. We lack the tools to verify wrongdoing on the part of a user who has been removed, messaged, or had their content deleted, and this is a serious problem.

You may not be able to tell, but I remain entirely unconvinced that the moderators issuing these suspensions can't tell. The data you've presented here doesn't support that conclusion.

Is it still possible that the proportion of false positives is small? Maybe so – it can’t be completely eliminated at this time. [... But] it would require some very strange user behavior en masse around answering, by users who were otherwise answering questions normally. These are behaviors we do not have an organic explanation for after months of exploration

Perhaps if you had discussed this with the moderators, they might have been able to offer some explanations which you failed to consider.

It looks really bad for you to impose a policy by fiat, keep the justification for the policy secret for a week, and then admit that the basis for the policy is that you couldn't think of any other explanations for this data ─ when you never involved the people with direct experience of the issue in your attempt to understand that data. I'm sorry, but that is not behaviour I associate with a good-faith effort to reach the truth.

What we know, right now, is that the current situation is untenable. We have real, justified concerns for the survival of the network.

It's a shame that in this sentence, the "current situation" you're referring to is the rate of users leaving Stack Overflow, rather than the behaviour of Stack Exchange, Inc. which has caused this strike action.

It's our community, so we are at least as concerned as you about threats to the community's continuing viability. Unfortunately, the behaviour of Stack Exchange, Inc. is currently the greatest threat to our community's continued existence, and while publishing this data and your analysis is welcome, it neither demonstrates an understanding of why we are on strike, nor does it address any of the strikers' demands.

answered Jun 7, 2023 at 23:55

kaya3

6.7k
2
17
29