0

I have a dataframe of format given below.

movieId1 | genreList1 | movieId2 | genreList2 --------------------------------------------------------------- 1 |[Adventure,Comedy] | 2 |[Adventure,Comedy] 1 |[Animation,Drama] | 3 |[War,Drama] 

Dataframe schema is

 StructType( StructField(movieId1,IntegerType,false), StructField(genres1,ArrayType(StringType,true),true), StructField(movieId2,IntegerType,false), StructField(genres2,ArrayType(StringType,true),true) ) 

I was wondering if there was any way to create a new dataframe with a new column which is the Jaccard Coefficient of two genres in a row.

jaccardCoefficient(Set1, Set2) = (Set1 intersect Set2).size / (Set1 union Set2).size movieId1 | movieId2 | jaccardcoeff --------------------------------------------------------------- 1 | 2 | 1 1 | 3 | 0.5 

Any help would be much appreciated. Thanks.

1 Answer 1

4

Given this input DataFrame:

+--------+-------------------+--------+-------------------+ |movieId1| genreList1|movieId2| genreList2| +--------+-------------------+--------+-------------------+ | 1|[Adventure, Comedy]| 2|[Adventure, Comedy]| | 1| [Animation, Drama]| 3| [War, Drama]| +--------+-------------------+--------+-------------------+ 

with schema:

StructType( StructField(movieId1,IntegerType,false), StructField(genreList1,ArrayType(StringType,true),true), StructField(movieId2,IntegerType,false), StructField(genreList2,ArrayType(StringType,true),true)) 

You can simply use an UDF to calculate the jaccard coefficient:

val jaccardCoefficient = udf { (Set1: WrappedArray[String], Set2: WrappedArray[String]) => (Set1.toList.intersect(Set2.toList)).size.toDouble / (Set1.toList.union(Set2.toList)).distinct.size.toDouble } 

Use this UDF like follow:

 input.withColumn("jaccardcoeff", jaccardCoefficient($"genreList1",$"genreList2")) 

to obtain your disired output:

+--------+--------+------------+ |movieId1|movieId2|jaccardcoeff| +--------+--------+------------+ | 1| 2| 1| | 1| 3| 0.33333| +--------+--------+------------+ 
Sign up to request clarification or add additional context in comments.

1 Comment

Thank you so much.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.