3

From this data frame

+-----+-----------------+ |store| values | +-----+-----------------+ | 1|[1, 2, 3,4, 5, 6]| | 2| [2,3]| +-----+-----------------+ 

I would like to apply the Counter function to get this:

+-----+------------------------------+ |store| values | +-----+------------------------------+ | 1|{1:1, 2:1, 3:1, 4:1, 5:1, 6:1}| | 2|{2:1, 3:1} | +-----+------------------------------+ 

I got this data frame using the answer of another question :

GroupBy and concat array columns pyspark

So I try to modify the code that is in the answers like this:

Option 1:

def flatten_counter(val): return Counter(reduce (lambda x, y:x+y, val)) udf_flatten_counter = sf.udf(flatten_counter, ty.ArrayType(ty.IntegerType())) df3 = df2.select("store", flatten_counter("values2").alias("values3")) df3.show(truncate=False) 

Option 2:

df.rdd.map(lambda r: (r.store, r.values)).reduceByKey(lambda x, y: x + y).map(lambda row: Counter(row[1])).toDF(['store', 'values']).show() 

but it doesn't work.

Does anybody know how can I do it?

Thank you

1 Answer 1

9

You just have to provide correct data type

udf_flatten_counter = sf.udf( lambda x: dict(Counter(x)), ty.MapType(ty.IntegerType(), ty.IntegerType())) df = spark.createDataFrame( [(1, [1, 2, 3, 4, 5, 6]), (2, [2, 3])], ("store", "values")) df.withColumn("cnt", udf_flatten_counter("values")).show(2, False) # +-----+------------------+---------------------------------------------------+ # |store|values |cnt | # +-----+------------------+---------------------------------------------------+ # |1 |[1, 2, 3, 4, 5, 6]|Map(5 -> 1, 1 -> 1, 6 -> 1, 2 -> 1, 3 -> 1, 4 -> 1)| # |2 |[2, 3] |Map(2 -> 1, 3 -> 1) | # +-----+------------------+---------------------------------------------------+ 

Similarly with RDD

df.rdd.mapValues(Counter).mapValues(dict).toDF(["store", "values"]).show(2, False) # +-----+---------------------------------------------------+ # |store|values | # +-----+---------------------------------------------------+ # |1 |Map(5 -> 1, 1 -> 1, 6 -> 1, 2 -> 1, 3 -> 1, 4 -> 1)| # |2 |Map(2 -> 1, 3 -> 1) | # +-----+---------------------------------------------------+ 

Conversion to dict is necessary because apparently Pyrolite cannot handle Counter objects.

Sign up to request clarification or add additional context in comments.

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.