I am using Java API for Apache Spark , and i have two Dataset A & B. The schema for these both is same : PhoneNumber, Name, Age, Address
There is one record in both the Dataset that has PhoneNumber as common, but other columns in this record are different
I run following SQL query on these two Datasets (by registering these as temporary Table):
A.createOrReplaceTempView("A"); B.createOrReplaceTempView("B"); String query = "Select * from A UNION Select * from B"; Dataset<Row> result = sparkSession.sql(query); result.show(); Surprisingly, the result has only one record with same PhoneNumber, and the other is removed.
I know UNION is SQL query is intended to remove duplicates, but then it also needs to know the Primary Key on the basis of which it decides what is duplicate.
How does this query infer the "Primary key" of my Dataset? (There is no concept of Primary key in Spark)