Remove duplicates from Spark SQL joining two dataframes

Question

I Have two spark DFs which I need to join. Only select Values from df2 which are present in df1 there shouldn't be repeated rows.

For example:

df1:

+-------------+---------------+----------+ |a |b |val | +-------------+---------------+----------+ | 202003101750| 202003101700|1712384842| | 202003101740| 202003101700|1590554927| | 202003101730| 202003101700|1930860788| | 202003101730| 202003101600| 101713| | 202003101720| 202003101700|1261542412| | 202003101720| 202003101600| 1824155| | 202003101710| 202003101700| 912601761| +-------------+---------------+----------+

df2:

+-------------+---------------+ |a |b | +-------------+---------------+ | 202003101800| 202003101700| | 202003101800| 202003101700| | 202003101750| 202003101700| | 202003101750| 202003101700| | 202003101750| 202003101700| | 202003101750| 202003101700| | 202003101740| 202003101700| | 202003101740| 202003101700| +-------------+---------------+

I am doing the following:

df1.join(df2, Seq("a", "b"), "leftouter").where(col("val").isNotNull) But my output has several repeated rows.

+-------------+---------------+----------+ |a |b |val | +-------------+---------------+----------+ | 202003101750| 202003101700|1712384842| | 202003101750| 202003101700|1712384842| | 202003101750| 202003101700|1712384842| | 202003101750| 202003101700|1712384842| | 202003101740| 202003101700|1590554927| | 202003101740| 202003101700|1590554927| | 202003101740| 202003101700|1590554927| | 202003101740| 202003101700|1590554927|| +-------------+---------------+----------+

I am trying to achieve an except like operation if val is dropped from df1. But except doesn't seem to work. For example the following is the desired operation df1.drop(col("val")).except("df2") The schema is as follows for df1:

root |-- a: String (nullable = true) |-- b: String (nullable = true) |-- val: long (nullable = true)

Also, What exactly is the difference between left-outer join and except? Expected output:

+-------------+---------------+----------+ |a |b |val | +-------------+---------------+----------+ | 202003101750| 202003101700|1712384842| | 202003101740| 202003101700|1590554927|| +-------------+---------------+----------+

please add your expected output..?

notNull
– notNull

2020-04-28 03:36:27 +00:00
Commented Apr 28, 2020 at 3:36 — notNull
– notNull, Commented Apr 28, 2020 at 3:36
@Shu added comment. Can you please take a look?

coderWorld
– coderWorld

2020-04-28 03:46:45 +00:00
Commented Apr 28, 2020 at 3:46 — coderWorld
– coderWorld, Commented Apr 28, 2020 at 3:46

notNull · Accepted Answer · 2020-04-28 03:59:07Z

LeftOuter join will get all the rows from left table and matching rows from right table.

Except will give rows that are not exist in second dataframe compared to first dataframe(without duplicates).

For your case you can use inner (or) outer join with dropDuplicates.

df1.join(df2, Seq("a", "b"), "inner").dropDuplicates().show() //+------------+------------+----------+ //| a| b| val| //+------------+------------+----------+ //|202003101740|202003101700|1590554927| //|202003101750|202003101700|1712384842| //+------------+------------+----------+ df1.join(df2, Seq("a", "b"), "rightouter").where(col("val").isNotNull).dropDuplicates().show() //+------------+------------+----------+ //| a| b| val| //+------------+------------+----------+ //|202003101740|202003101700|1590554927| //|202003101750|202003101700|1712384842| //+------------+------------+----------+

@coderWorld, One difference exists distinct will apply to the whole dataframe but dropDuplicates we can drop duplicates on specific column (or) on whole dataframe too!

Piyush Patel · Accepted Answer · 2020-04-28 09:17:27Z

2

You can use the function dropDuplicates(), that remove all duplicated rows:

uniqueDF = df.dropDuplicates()

Or your can specify the columns you wanna match:

uniqueDF = df.dropDuplicates("a","b")

edited Apr 28, 2020 at 9:17

Piyush Patel

1,7811 gold badge17 silver badges27 bronze badges

answered Apr 28, 2020 at 3:44

Alexandre Andrade

5734 silver badges13 bronze badges

1 Comment

Learn Hadoop Over a year ago

use "left_semi" or left_anti --join method

Collectives™ on Stack Overflow

Remove duplicates from Spark SQL joining two dataframes

2 Answers 2

2 Comments

1 Comment

Linked

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

1 Comment

Linked

Related