OUTER JOIN on 2 DATA FRAMES : Spark Scala SqlContext

Question

I am getting error while doing outer joins on 2 data frames. I am trying to get the percentile.

 val sqlContext = new org.apache.spark.sql.SQLContext(sc) val df = sqlContext.jsonFile("temp.txt") val res = df.withColumn("visited", explode($"visited")) val result1 =res.groupBy($"customerId", $"visited.placeName").agg(count("*").alias("total")) val result2 = res .filter($"visited.rating" < 4) .groupBy($"customerId", $"visited.placeName") .agg(count("*").alias("top")) result1.show() result2.show() val temp = result1.join(result2, List("placeName","customerId"), "outer") temp.registerTempTable("percentile") sqlContext.sql("select top/total as Percentage from percentile groupBy placeName")

The Error I am getting :

<console>:43: error: type mismatch; found : List[String] required: org.apache.spark.sql.Column –

I am using spark 1.5

Can anyone tell me what I am doing wrong here ? I tried this as well : val temp = result1.join(result2, Seq("placeName","customerId"), "outer") But still getting :

found : Seq[String] required: org.apache.spark.sql.Column

With this :

val res = df.withColumn("visited", explode($"visited")) val file1 =res.groupBy( $"visited.placeName").agg(count("*").alias("total")) val file2 = res.filter($"visited.rating" > 3).groupBy($"visited.placeName").agg(count("*").alias("top")) file1.show() file2.show() file1.join(file2)

I am getting duplicate columns.

My Schema :

 { "country": "France", "customerId": "France001", "visited": [ { "placeName": "US", "rating": "2", "famousRest": "N/A", "placeId": "AVBS34" }, { "placeName": "US", "rating": "3", "famousRest": "SeriousPie", "placeId": "VBSs34" }, { "placeName": "Canada", "rating": "3", "famousRest": "TimHortons", "placeId": "AVBv4d" } ] } US top = 1 count = 3 Canada top = 1 count = 3 { "country": "Canada", "customerId": "Canada012", "visited": [ { "placeName": "UK", "rating": "3", "famousRest": "N/A", "placeId": "XSdce2" }, ] } UK top = 1 count = 1 { "country": "France", "customerId": "France001", "visited": [ { "placeName": "US", "rating": "4.3", "famousRest": "N/A", "placeId": "AVBS34" }, { "placeName": "US", "rating": "3.3", "famousRest": "SeriousPie", "placeId": "VBSs34" }, { "placeName": "Canada", "rating": "4.3", "famousRest": "TimHortons", "placeId": "AVBv4d" } ] } US top = 2 count = 3 Canada top = 1 count = 3

So at the end I need something like :

PlaceName Percentage US 57.14 (1+1+2)/(3+1+3) *100 Canada 33.33 (1+1)/(3+3) *100 UK 100 1*100

Schema:

root |-- country: string(nullable=true) |-- customerId:string(nullable=true) |-- visited: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- placeId: string (nullable = true) | | |-- placeName: string (nullable = true) | | |-- famousRest: string (nullable = true) | | |-- rating: string (nullable = true)

Equi-join with Seq[String] is supported since 1.6. In 1.5 you'll have to apply all === conditions one by one. — zero323
– zero323, Commented Jul 1, 2016 at 20:25

Collectives™ on Stack Overflow

OUTER JOIN on 2 DATA FRAMES : Spark Scala SqlContext

0

Linked

Hot Network Questions

Collectives™ on Stack Overflow

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Linked