5

I am using Java API for Apache Spark , and i have two Dataset A & B. The schema for these both is same : PhoneNumber, Name, Age, Address

There is one record in both the Dataset that has PhoneNumber as common, but other columns in this record are different

I run following SQL query on these two Datasets (by registering these as temporary Table):

A.createOrReplaceTempView("A"); B.createOrReplaceTempView("B"); String query = "Select * from A UNION Select * from B"; Dataset<Row> result = sparkSession.sql(query); result.show(); 

Surprisingly, the result has only one record with same PhoneNumber, and the other is removed.

I know UNION is SQL query is intended to remove duplicates, but then it also needs to know the Primary Key on the basis of which it decides what is duplicate.

How does this query infer the "Primary key" of my Dataset? (There is no concept of Primary key in Spark)

1 Answer 1

4

You can use either UNION ALL:

Seq((1L, "foo")).toDF.createOrReplaceTempView("a") Seq((1L, "bar"), (1L, "foo")).toDF.createOrReplaceTempView("b") spark.sql("SELECT * FROM a UNION ALL SELECT * FROM b").explain 
== Physical Plan == Union :- LocalTableScan [_1#152L, _2#153] +- LocalTableScan [_1#170L, _2#171] 

or Dataset.union method:

spark.table("a").union(spark.table("b")).explain 
== Physical Plan == Union :- LocalTableScan [_1#152L, _2#153] +- LocalTableScan [_1#170L, _2#171] 

How does this query infer the "Primary key" of my Dataset?

I doesn't, or at least not in the current version. It just applies HashAggregate using all available columns:

spark.sql("SELECT * FROM a UNION SELECT * FROM b").explain 
== Physical Plan == *HashAggregate(keys=[_1#152L, _2#153], functions=[]) +- Exchange hashpartitioning(_1#152L, _2#153, 200) +- *HashAggregate(keys=[_1#152L, _2#153], functions=[]) +- Union :- LocalTableScan [_1#152L, _2#153] +- LocalTableScan [_1#170L, _2#171] 
Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.