Count the number of specific key in Pyspark

Question

Assume that I have a column A, every row is a list that contains:

[{"a":"1", "b":"2", "c":"3"}, {"a":"2", "b":"5", "c":"7"}]

How do I count the number of "a"s?

I would like a solution like F.map().

Many thanks

Aelarion · Accepted Answer · 2021-04-28 02:53:29Z

Edited Answer:

Adjusting based on comment from OP. To get the occurrences of a particular key in a list of dictionaries, you can still use list comprehension (with a few adjustments):

A = [{"a":"1", "b":"2", "c":"3"}, {"a":"2", "b":"5", "c":"7"}] A_count = len([y for x in A for y in x if y == 'a']) print(A_count)

Output:

We're essentially using the same logic, just in this case we're using nested list comprehension. x first iterates through A (the dictionaries), and y iterates through x (specifically, the keys in each dictionary). Finally, we use an if condition to make sure the key matches the specified value.

Old Answer: Not really sure this provides a solution like "map", but you can use list comprehension which is fairly straightforward:

A = [{"a":"1", "b":"2", "c":"3"}, {"a":"2", "b":"5", "c":"7"}] A_sum = sum([int(x['a']) for x in A]) print(A_sum)

Output:

Explanation:

Essentially we are collecting the dictionary values based on your given key of 'a', parsing that value to a string, and then using sum to add all the resulting values in that list. Some good reference material is on W3Schools.

Thanks for your help. But I think the output should be 2. This is because I want to count the occurrence of key "a", not sum the values of it.
Ah ok I misunderstood your question. See edited answer, should do it for you.

Vaebhav · Accepted Answer · 2021-04-28 06:05:36Z

You can use a udf to achieve this , Assuming each row as you mentioned is a list with dictionaries -

import pyspark from pyspark.sql import SQLContext import pyspark.sql.functions as F from functools import partial temp_df = spark.createDataFrame( [ [[{"a":"1", "b":"2", "c":"3"}, {"a":"2", "b":"5", "c":"7"}]], [[{"a":"10", "b":"2", "c":"3"}, {"a":"20", "b":"5", "c":"7"}]], [[{"a":"10", "b":"2", "c":"3"}, {"a":"20", "b":"5", "c":"7"}]], ], ["A"] ) def key_occurence(inp,key=None): res = 0 for d in inp: if key in d: res += 1 return res partial_func = partial(key_occurence, key="a") key_occurence_udf = F.udf(partial_func,"int") temp_df = temp_df.withColumn("A_occurence",key_occurence_udf("A")) temp_df.show() +--------------------+-----------+ | A|A_occurence| +--------------------+-----------+ |[[a -> 1, b -> 2,...| 2| |[[a -> 10, b -> 2...| 2| |[[a -> 10, b -> 2...| 2| +--------------------+-----------+

The udf additionally takes in a argument to check for the corresponding key

Collectives™ on Stack Overflow

Count the number of specific key in Pyspark

2 Answers 2

2 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

2 Comments

Comments

Related