I have a dataframe of format given below.
movieId1 | genreList1 | movieId2 | genreList2 --------------------------------------------------------------- 1 |[Adventure,Comedy] | 2 |[Adventure,Comedy] 1 |[Animation,Drama] | 3 |[War,Drama] Dataframe schema is
StructType( StructField(movieId1,IntegerType,false), StructField(genres1,ArrayType(StringType,true),true), StructField(movieId2,IntegerType,false), StructField(genres2,ArrayType(StringType,true),true) ) I was wondering if there was any way to create a new dataframe with a new column which is the Jaccard Coefficient of two genres in a row.
jaccardCoefficient(Set1, Set2) = (Set1 intersect Set2).size / (Set1 union Set2).size movieId1 | movieId2 | jaccardcoeff --------------------------------------------------------------- 1 | 2 | 1 1 | 3 | 0.5 Any help would be much appreciated. Thanks.