5

The Database schema documentation for the public data dump and SEDE lists ParentId among the things which are present in the PostsWithDeleted table. However, if you check the deleted answers, the ParentId is present only if the parent question is not deleted.

For example, this query returns no results. If we have no restriction that the question has to be deleted, the we get results - however, only in such cases where the question is not deleted.

  • Is it intentional that the ParentId is not stored in SEDE if the question is deleted?
  • Did SEDE always behave this way? Was it always the case that ParentId was present only if the question was not deleted?

In the past I thought that ParentId was stored in all cases and I used this in some queries.

1 Answer 1

7

It is intentional, but not in the way that you probably think.

We populate SEDE with the same set of views used to populate the data dump. There is a view called Posts_<date> (which is used to achieve midnight cutoff, and includes deleted posts) and then a view that derives from that called PostsButNotDeleted_<date> (which, not surprisingly, filters out all deleted posts). This is so that we don't have to repeat all the joining and filtering logic in multiple places, and allows us to easily create Posts and PostsWithDeleted in SEDE (based on Posts_<date>), and Posts in the data dump (based on PostsButNotDeleted_<date>).

There was an effort (that pre-dates the distribution change) to ensure that the output in the data dump is "complete," i.e., doesn't contain posts that reference posts or users that don't exist, comments that reference posts or users that don't exist, and so on. The logic to prevent that (for posts) is simple:

... AcceptedAnswerId = CASE WHEN EXISTS ( SELECT 1 FROM Posts WHERE Id = p.AcceptedAnswerId AND DeletionDate IS NULL ) THEN p.AcceptedAnswerId END, ParentId = CASE WHEN EXISTS ( SELECT 1 FROM Posts WHERE Id = p.ParentId AND DeletionDate IS NULL ) THEN p.ParentId END, ... 

This logic doesn't care if the child answer or parent question has been deleted independently or if it has been deleted because the parent question was deleted.

The priority at the time was to prevent these dangling posts (or references to them) from appearing in the data dump output. But this impacts SEDE by definition, which was not intentional.

This could potentially be revisited, but it would require discovery and prioritization.

1
  • Thanks for the response. Based on your answer, it seems that there is a chance that at some point in the past this used to be different. (So perhaps I did not remember this completely incorrectly.| And congratulations to the bronze data-explorer badge. Commented Jun 18 at 10:39

You must log in to answer this question.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.