How to delete duplicate entries from a specific field by SQL query in DSpace?

Question

I am using DSpace version 6.3 here. I discovered that I have created duplicate entries when performing a batch import. Using the SQL query from this answer, I managed to list all the duplicates in a given field. For this example, I am using the dc.subject field (metadata_field_id=57) to list items (dspace_object_id) that have duplicate values in dc.subject field.

Below is the query that I used:

SELECT metadata_value_id, dspace_object_id, text_value FROM (SELECT *, COUNT(*) OVER (PARTITION BY dspace_object_id, text_value) AS cnt FROM metadatavalue where metadata_field_id=57) e WHERE cnt > 1

Below is the sample list generated from that query:

metadata_value_id	dspace_object_id	text_value
503018	13f07109-7797-4d5b-a8bd-1f9e91a2433d	pompanos
503021	13f07109-7797-4d5b-a8bd-1f9e91a2433d	pompanos
503217	233d1e67-b698-4e90-8e70-a175776b2d80	pests
503219	233d1e67-b698-4e90-8e70-a175776b2d80	pests
83574	47753988-fc2a-4416-b20d-acbff6e256de	Penaeus monodon
10800	47753988-fc2a-4416-b20d-acbff6e256de	Penaeus monodon
531520	50965923-bc65-4fdf-af61-2c8debdfe057	Penaeus monodon
531521	50965923-bc65-4fdf-af61-2c8debdfe057	Penaeus monodon
483882	57d0544c-1825-431a-acf9-eb835c24920b	development
483879	57d0544c-1825-431a-acf9-eb835c24920b	development

Based on the table above, you can see that there are 2 occurrences (others have more than 2) of a text_value in the same item (dspace_object_id).

My question is how can I delete the duplicate but retain the first occurrence? In the table above, rows with the following metadata_value_id should be deleted:

503021 503219 83574 <-- This should be deleted instead of 10800 because 10800 was put in (inserted) first. 531521 483882 <-- Should be deleted instead of 483879 because of the same reason above.

I'm hoping to do this via SQL query alone because what I have done before is import this list to a Google spreadsheet, remove the duplicates there using a Remove Duplicates add-on, download it as a CSV, and then use that CSV to update the metadatavalue table.

tinazmu · Accepted Answer · 2022-05-05 03:50:47Z

Assuming that the 'first occurrence' means the one with smaller metadata_value_id:

delete from metadatavalue as Not1stOcc where exists (select * from metadatavalue as FirstOcc where FirstOcc.text_value = Not1stOcc.text_value and FirstOcc.metadata_value_id < Not1stOcc.metadata_value_id )

should work

euler · Accepted Answer · 2022-05-05 08:44:46Z

For my future reference, I modified tinazmu's answer that worked for my use case.

DELETE FROM metadatavalue as Not1stOcc WHERE EXISTS (SELECT metadata_value_id, dspace_object_id, text_value, metadata_field_id FROM metadatavalue AS FirstOcc WHERE FirstOcc.text_value = Not1stOcc.text_value AND FirstOcc.metadata_field_id = Not1stOcc.metadata_field_id AND FirstOcc.dspace_object_id = Not1stOcc.dspace_object_id AND FirstOcc.metadata_field_id = Not1stOcc.metadata_field_id AND metadata_field_id = 57 AND FirstOcc.metadata_value_id < Not1stOcc.metadata_value_id )

Collectives™ on Stack Overflow

How to delete duplicate entries from a specific field by SQL query in DSpace?

2 Answers 2

Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

Comments

Comments

Linked

Related