28

I noticed that in the latest data dump, the Comments.xml files do not provide a ContentLicense for any of the rows, even though it is available in SEDE.

Here is an example of looking at the first few records in the tezos.meta site (just as an example):

side-by-side of data dump and SEDE

As you can see, the data dump on the right does not have the ContentLicense attribute anywhere.

8
  • 8
    The ContentLicense column was also present in both the December 2023 and March 2024 data dumps, the latter prior to the changes to how the data dump was generated. The bug was likely introduced alongside whatever changes were made to the generation process Commented Jul 31, 2024 at 13:50
  • 2
    Which data dump are you talking about exactly? The March 2024 one? Commented Jul 31, 2024 at 17:17
  • 5
    @CesarM This is about the data dump files published in April. Commented Jul 31, 2024 at 17:20
  • 1
    @CesarM The April data dumps have the issue described in the question. All of the data dumps have the issue described in my answer. Commented Aug 2, 2024 at 11:50
  • @ThomasOwens if all the dumps have the issue you describe then even the "period to cure the violation" defined by ver 4.0 has probably been over 6-8 something ago. Commented Aug 2, 2024 at 14:13
  • @SPArcheon-onstrike 4.0 gives 30 days from the discovery. So the violation of the data dump was discovered on 31 July. The violation in the API was discovered today, 2 August. And there are also questions about commercial offerings that I can't see that were raised today, 2 August. Although 2.5 and 3.0 have no period to cure the violation - it's immediate. However, it can exist on express reinstatement. So if it does automatically terminate, I can choose to reinstate it. My understanding is that I can also waive that and opt not to do anything if I'm satisfied with the remediation. Commented Aug 2, 2024 at 16:40
  • oh, so it starts counting from the discovery date. Thanks for the explanation @ThomasOwens Commented Aug 2, 2024 at 16:41
  • 3
    Moving this to [status-completed] as the bug described in the post is resolved per answer below Commented Jan 31 at 21:04

2 Answers 2

15

There is an even bigger problem here, and one that extends beyond the comment data's ContentLicense field.

Content on Stack Exchange sites is licensed under one of three CC BY-SA licenses, depending on when it was posted. All three licenses require including either a copy of or a URI for the license. In 2.5 and 3.0, this requirement is stated in 4(a). In 4.0, this requirement is Section 3(a)(1)(A)(v). In 4.0, Section 3(a)(2) does say that the conditions may be satisfied "in any reasonable manner based on the medium, means, and context" in which the content is shared, but given the medium and means, providing a URI is not unreasonable.

When any of the ContentLicense fields are populated, the field only contains a short name for the license ("CC BY-SA 4.0", "CC BY-SA 3.0", "CC BY-SA 2.5"). To be compliant, the field should contain a URI. The URI should probably be the canonical URL, so one of https://creativecommons.org/licenses/by-sa/2.5/ or https://creativecommons.org/licenses/by-sa/3.0/ https://creativecommons.org/licenses/by-sa/4.0/.

There may be other ways to satisfy this requirement. Creative Commons provides all three licenses in plain text as well as RDF/XML files. There may be ways to distribute these licenses as part of the data dump. However, this does seem unnecessarily verbose. Putting the URI into the readme.txt that explains the fields may also satisfy the letter of the requirements. As long as there are clear and explicit URIs to all three versions and clarity on how to go from the ContentLicense field to the full text of the license, I believe the requirements would be met.

I should also note that this same issue applies to the API. The API's content_license field is also one of CC BY-SA 4.0, CC BY-SA 3.0, or CC BY-SA 2.5 depending on the date the content was posted or last edited. I don't see anything in the API or in the API documentation that provides the required URIs for content licensed using 2.5 or 3.0 and I would suspect that it would be reasonable to expect the URI for 4.0 given the medium and means of transmitting the content. It may be worth questioning whether the OverflowAPI product suffers from a similar product, delivering improperly marked content to downstream users.

However, something else that I've noticed is that the license for the data dump is only at the top level. The individual 7z files for the sites do not contain any information about the license of the data dump as a whole. Although one could argue that, in the current scheme, the data dump is the set of 7z files, this argument may not hold after the changes to the data dump where users must request a download individually.

I'd also point out that this could be a good opportunity to update the license of the data dump. The data dump is currently licensed CC BY-SA 3.0. The collection could be licensed CC BY-SA 4.0 instead. Creative Commons typically recommends using later versions of the license where possible.

18
  • 5
    "To be compliant, the field should contain a URI." I don't think this is actually necessary, if all the URIs are provided in license.txt. If there were a License table containing all necessary information, and the comment field contained FOREIGN KEY references to it, I'd think that'd be fine too. The work as a whole would contain sufficient attribution. Commented Jul 31, 2024 at 14:16
  • I do agree that putting the URIs in a readme or license file may be sufficient. I don't like it as much, but I think it satisfies the letter of the requirements. As long as those files are part of the distribution. Changing to the per-site instead of the set of all sites would require putting the license and readme files in different places. Commented Jul 31, 2024 at 14:19
  • 1
    Even on archive, I can download any given site's data dump without even knowing there's a license file or a readme (they just happen to be in the same folder). So, I don't know that anything changes in the new scheme, except that they should probably link to at least the license file on the download page. The readme is just a description of the schema, that would arguably be better as a URL pointing here. Commented Jul 31, 2024 at 15:18
  • @testing-for-ya But what if Stack Exchange goes down? Then you need the dump to tell you how to read the dump. Commented Jul 31, 2024 at 16:36
  • @wizzwizz4 You would also have to assume that archive.org simultaneously disappeared? Or that nobody with the dump would be capable of re-publishing the material from that post in the event both sites disappeared? Commented Jul 31, 2024 at 16:57
  • 1
    @wizzwizz4 just FYI, the license.txt file only links to CC BY-SA 3.0, so the requirement is not satisfied even from that aspect. Commented Jul 31, 2024 at 18:07
  • Just wondering.... with all the ongoing discussion about the company granting itself a secondary license to the content in the site ToS, did they ever claim to be distributing the dumps under CC BY-SA? Could they be implicitly using their secondary license instead? Commented Aug 1, 2024 at 9:40
  • @SPArcheon-onstrike If there is a secondary license or not, it doesn't matter. The Terms of Service do state that "the Creative Commons Data Dump is licensed under the CC BY-SA license". The data dump may be a collection or compilation, which means that it can be licensed however the company wants. They could release the data dump CC BY-NC-SA or even all rights reserved. However, the individual works in the dump will remain under some version of CC BY-SA, unless or until the ToS changes the licenses we grant SE upon posting content and it would only affect work posted after that change. Commented Aug 1, 2024 at 9:52
  • @SPArcheon-onstrike The secondary license also doesn't give the SE the right to relicense or sublicense. So if they did distribute (or export or display) under this license and not CC BY-SA, the recipient wouldn't get any rights to copy, distribute, display, etc. For our content to be useful and usable by downstream recipients, it must be distributed under the CC BY-SA license. Commented Aug 1, 2024 at 9:55
  • @ThomasOwens If I get your post right, then the (current?) format of the dump actually doesn't fulfill the requirements of the CC BY-SA license. If there was a claim that the dump is indeed distributed under CC BY-SA, then is this an actual violation?? Commented Aug 1, 2024 at 10:11
  • @SPArcheon-onstrike There is the license of the data dump and the license of content. The license of the dump doesn't matter at all in this case. It could be all rights reserved and this would still be an issue. Every work distributed under any CC license must have a URI back to the license or the full text of the license. There is no URI back to the license associated with the questions, answers, and comments in the data dump. Commented Aug 1, 2024 at 10:14
  • @ThomasOwens isn't that my point? If the dump content is distributed under CC then it must follow the CC guidance and the (current?) format doesn't fulfill those requirement. But if the dump is distributed under the secondary license Stack seems to have then they are free to choose any format, and the ContentLicense column would be just a metadata they choose to provide for clarity sake that doesn't strictly have to follow any guidance. The "record" in the dump would be licensed under Stack own license, the fact that it is also available elsewhere as CC would not matter imho Commented Aug 1, 2024 at 10:44
  • @SPArcheon-onstrike If you're proposing SE to release our content under the secondary license from the ToS (assuming that one exists), that would also fix this issue. SE would need to remove the implication that the content is available CC BY-SA by removing the ContentLicense column and updating any other documentation with the dump. But I'm honestly not sure what it would mean to have the rights to share and adapt the compilation but have no rights to any of the works within the compilation. Unless the use was protected by fair use, I think that scheme would make the data dump useless. Commented Aug 1, 2024 at 11:45
  • 2
    @SPArcheon-onstrike Yes. This is a violation. If they are resharing under the CC BY-SA license that we, as contributors, give SE, then SE must pass along that license to downstream recipients via a URI to the license or the full text of the license. Based on the content of the dump, they are indeed trying to reshare it under the CC BY-SA license. 2.5 and 3.0 are terminated automatically upon breach. 4.0 has a period to cure the violations. Anyone would have to consult a lawyer to determine next steps. Commented Aug 1, 2024 at 12:11
  • 2
    As a community we seem to be collectively angrier about this change than we were about them pulling it last year. I worry that if we try to nit-pick, nail them on technicalities, and threaten legal action, an easier answer for them will be to just stop distribution altogether (and let's not tempt them; we know that's been on their mind). They don't have to provide a dump, but it's cheaper than making us scrape; turning the dump off and making us scrape is cheaper than going to court. What would legal action do other than kill the dump and/or bankrupt / kill the site? Who would that help? Commented Aug 2, 2024 at 14:53
15

We've added back the ContentLicense column to the Comments.xml file, so this shouldn't be an issue going forward.

As for Thomas' answer, we've looked into it. While the license.txt file has contained a link to CC-BY-SA 3.0, and we included a link to CC-BY-SA 4.0 on the Internet Archive page (you can download the license.txt file on that same page). We decided to make it more explicitly accessible by adding the links to the XML files wrapped in a comment. Going forward you'll see this in the XML files:

<?xml version="1.0" encoding="utf-8"?> <!-- ContentLicense CC BY-SA 2.5 - Url: https://creativecommons.org/licenses/by-sa/2.5/ CC BY-SA 3.0 - Url: https://creativecommons.org/licenses/by-sa/3.0/ CC BY-SA 4.0 - Url: https://creativecommons.org/licenses/by-sa/4.0/ --> <posthistory> <row Id="...> </posthistory> 

This should fix both the issue in the question as well as the one raised on Thomas' original answer. As for the API, I'll also look into it and edit this answer when I have an update.

1
  • 10
    Re: the code change from <ContentLicenses> to a XML comment (the edits in this answer). It was pointed out by folks that the proposed format was an invalid XML and would break parsers and not be read properly. We changed it to be wrapped in a XML comment to avoid that issue. Commented Aug 2, 2024 at 19:15

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.