0

I am using DSpace version 6.3 here. I discovered that I have created duplicate entries when performing a batch import. Using the SQL query from this answer, I managed to list all the duplicates in a given field. For this example, I am using the dc.subject field (metadata_field_id=57) to list items (dspace_object_id) that have duplicate values in dc.subject field.

Below is the query that I used:

SELECT metadata_value_id, dspace_object_id, text_value FROM (SELECT *, COUNT(*) OVER (PARTITION BY dspace_object_id, text_value) AS cnt FROM metadatavalue where metadata_field_id=57) e WHERE cnt > 1 

Below is the sample list generated from that query:

metadata_value_id dspace_object_id text_value
503018 13f07109-7797-4d5b-a8bd-1f9e91a2433d pompanos
503021 13f07109-7797-4d5b-a8bd-1f9e91a2433d pompanos
503217 233d1e67-b698-4e90-8e70-a175776b2d80 pests
503219 233d1e67-b698-4e90-8e70-a175776b2d80 pests
83574 47753988-fc2a-4416-b20d-acbff6e256de Penaeus monodon
10800 47753988-fc2a-4416-b20d-acbff6e256de Penaeus monodon
531520 50965923-bc65-4fdf-af61-2c8debdfe057 Penaeus monodon
531521 50965923-bc65-4fdf-af61-2c8debdfe057 Penaeus monodon
483882 57d0544c-1825-431a-acf9-eb835c24920b development
483879 57d0544c-1825-431a-acf9-eb835c24920b development

Based on the table above, you can see that there are 2 occurrences (others have more than 2) of a text_value in the same item (dspace_object_id).

My question is how can I delete the duplicate but retain the first occurrence? In the table above, rows with the following metadata_value_id should be deleted:

503021 503219 83574 <-- This should be deleted instead of 10800 because 10800 was put in (inserted) first. 531521 483882 <-- Should be deleted instead of 483879 because of the same reason above. 

I'm hoping to do this via SQL query alone because what I have done before is import this list to a Google spreadsheet, remove the duplicates there using a Remove Duplicates add-on, download it as a CSV, and then use that CSV to update the metadatavalue table.

2 Answers 2

1

Assuming that the 'first occurrence' means the one with smaller metadata_value_id:

delete from metadatavalue as Not1stOcc where exists (select * from metadatavalue as FirstOcc where FirstOcc.text_value = Not1stOcc.text_value and FirstOcc.metadata_value_id < Not1stOcc.metadata_value_id ) 

should work

Sign up to request clarification or add additional context in comments.

Comments

1

For my future reference, I modified tinazmu's answer that worked for my use case.

DELETE FROM metadatavalue as Not1stOcc WHERE EXISTS (SELECT metadata_value_id, dspace_object_id, text_value, metadata_field_id FROM metadatavalue AS FirstOcc WHERE FirstOcc.text_value = Not1stOcc.text_value AND FirstOcc.metadata_field_id = Not1stOcc.metadata_field_id AND FirstOcc.dspace_object_id = Not1stOcc.dspace_object_id AND FirstOcc.metadata_field_id = Not1stOcc.metadata_field_id AND metadata_field_id = 57 AND FirstOcc.metadata_value_id < Not1stOcc.metadata_value_id ) 

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.