Union in Spark SQL query removing duplicates from Dataset

Question

I am using Java API for Apache Spark , and i have two Dataset A & B. The schema for these both is same : PhoneNumber, Name, Age, Address

There is one record in both the Dataset that has PhoneNumber as common, but other columns in this record are different

I run following SQL query on these two Datasets (by registering these as temporary Table):

A.createOrReplaceTempView("A"); B.createOrReplaceTempView("B"); String query = "Select * from A UNION Select * from B"; Dataset<Row> result = sparkSession.sql(query); result.show();

Surprisingly, the result has only one record with same PhoneNumber, and the other is removed.

I know UNION is SQL query is intended to remove duplicates, but then it also needs to know the Primary Key on the basis of which it decides what is duplicate.

How does this query infer the "Primary key" of my Dataset? (There is no concept of Primary key in Spark)

zero323 · Accepted Answer · 2017-09-22 21:13:20Z

You can use either UNION ALL:

Seq((1L, "foo")).toDF.createOrReplaceTempView("a") Seq((1L, "bar"), (1L, "foo")).toDF.createOrReplaceTempView("b") spark.sql("SELECT * FROM a UNION ALL SELECT * FROM b").explain

== Physical Plan == Union :- LocalTableScan [_1#152L, _2#153] +- LocalTableScan [_1#170L, _2#171]

or Dataset.union method:

spark.table("a").union(spark.table("b")).explain

== Physical Plan == Union :- LocalTableScan [_1#152L, _2#153] +- LocalTableScan [_1#170L, _2#171]

How does this query infer the "Primary key" of my Dataset?

I doesn't, or at least not in the current version. It just applies HashAggregate using all available columns:

spark.sql("SELECT * FROM a UNION SELECT * FROM b").explain

== Physical Plan == *HashAggregate(keys=[_1#152L, _2#153], functions=[]) +- Exchange hashpartitioning(_1#152L, _2#153, 200) +- *HashAggregate(keys=[_1#152L, _2#153], functions=[]) +- Union :- LocalTableScan [_1#152L, _2#153] +- LocalTableScan [_1#170L, _2#171]

Collectives™ on Stack Overflow

Union in Spark SQL query removing duplicates from Dataset

1 Answer 1

Comments

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Comments

Related