I am getting error while doing outer joins on 2 data frames. I am trying to get the percentile.
val sqlContext = new org.apache.spark.sql.SQLContext(sc) val df = sqlContext.jsonFile("temp.txt") val res = df.withColumn("visited", explode($"visited")) val result1 =res.groupBy($"customerId", $"visited.placeName").agg(count("*").alias("total")) val result2 = res .filter($"visited.rating" < 4) .groupBy($"customerId", $"visited.placeName") .agg(count("*").alias("top")) result1.show() result2.show() val temp = result1.join(result2, List("placeName","customerId"), "outer") temp.registerTempTable("percentile") sqlContext.sql("select top/total as Percentage from percentile groupBy placeName") The Error I am getting :
<console>:43: error: type mismatch; found : List[String] required: org.apache.spark.sql.Column – I am using spark 1.5
Can anyone tell me what I am doing wrong here ? I tried this as well : val temp = result1.join(result2, Seq("placeName","customerId"), "outer") But still getting :
found : Seq[String] required: org.apache.spark.sql.Column With this :
val res = df.withColumn("visited", explode($"visited")) val file1 =res.groupBy( $"visited.placeName").agg(count("*").alias("total")) val file2 = res.filter($"visited.rating" > 3).groupBy($"visited.placeName").agg(count("*").alias("top")) file1.show() file2.show() file1.join(file2) I am getting duplicate columns.
My Schema :
{ "country": "France", "customerId": "France001", "visited": [ { "placeName": "US", "rating": "2", "famousRest": "N/A", "placeId": "AVBS34" }, { "placeName": "US", "rating": "3", "famousRest": "SeriousPie", "placeId": "VBSs34" }, { "placeName": "Canada", "rating": "3", "famousRest": "TimHortons", "placeId": "AVBv4d" } ] } US top = 1 count = 3 Canada top = 1 count = 3 { "country": "Canada", "customerId": "Canada012", "visited": [ { "placeName": "UK", "rating": "3", "famousRest": "N/A", "placeId": "XSdce2" }, ] } UK top = 1 count = 1 { "country": "France", "customerId": "France001", "visited": [ { "placeName": "US", "rating": "4.3", "famousRest": "N/A", "placeId": "AVBS34" }, { "placeName": "US", "rating": "3.3", "famousRest": "SeriousPie", "placeId": "VBSs34" }, { "placeName": "Canada", "rating": "4.3", "famousRest": "TimHortons", "placeId": "AVBv4d" } ] } US top = 2 count = 3 Canada top = 1 count = 3 So at the end I need something like :
PlaceName Percentage US 57.14 (1+1+2)/(3+1+3) *100 Canada 33.33 (1+1)/(3+3) *100 UK 100 1*100 Schema:
root |-- country: string(nullable=true) |-- customerId:string(nullable=true) |-- visited: array (nullable = true) | |-- element: struct (containsNull = true) | | |-- placeId: string (nullable = true) | | |-- placeName: string (nullable = true) | | |-- famousRest: string (nullable = true) | | |-- rating: string (nullable = true)
Seq[String]is supported since 1.6. In 1.5 you'll have to apply all===conditions one by one.