0

Assume that I have a column A, every row is a list that contains:

[{"a":"1", "b":"2", "c":"3"}, {"a":"2", "b":"5", "c":"7"}] 

How do I count the number of "a"s?

I would like a solution like F.map().

Many thanks

2 Answers 2

1

Edited Answer:

Adjusting based on comment from OP. To get the occurrences of a particular key in a list of dictionaries, you can still use list comprehension (with a few adjustments):

A = [{"a":"1", "b":"2", "c":"3"}, {"a":"2", "b":"5", "c":"7"}] A_count = len([y for x in A for y in x if y == 'a']) print(A_count) 

Output:

2 

We're essentially using the same logic, just in this case we're using nested list comprehension. x first iterates through A (the dictionaries), and y iterates through x (specifically, the keys in each dictionary). Finally, we use an if condition to make sure the key matches the specified value.


Old Answer: Not really sure this provides a solution like "map", but you can use list comprehension which is fairly straightforward:

A = [{"a":"1", "b":"2", "c":"3"}, {"a":"2", "b":"5", "c":"7"}] A_sum = sum([int(x['a']) for x in A]) print(A_sum) 

Output:

3 

Explanation:

Essentially we are collecting the dictionary values based on your given key of 'a', parsing that value to a string, and then using sum to add all the resulting values in that list. Some good reference material is on W3Schools.

Sign up to request clarification or add additional context in comments.

2 Comments

Thanks for your help. But I think the output should be 2. This is because I want to count the occurrence of key "a", not sum the values of it.
Ah ok I misunderstood your question. See edited answer, should do it for you.
0

You can use a udf to achieve this , Assuming each row as you mentioned is a list with dictionaries -

import pyspark from pyspark.sql import SQLContext import pyspark.sql.functions as F from functools import partial temp_df = spark.createDataFrame( [ [[{"a":"1", "b":"2", "c":"3"}, {"a":"2", "b":"5", "c":"7"}]], [[{"a":"10", "b":"2", "c":"3"}, {"a":"20", "b":"5", "c":"7"}]], [[{"a":"10", "b":"2", "c":"3"}, {"a":"20", "b":"5", "c":"7"}]], ], ["A"] ) def key_occurence(inp,key=None): res = 0 for d in inp: if key in d: res += 1 return res partial_func = partial(key_occurence, key="a") key_occurence_udf = F.udf(partial_func,"int") temp_df = temp_df.withColumn("A_occurence",key_occurence_udf("A")) temp_df.show() +--------------------+-----------+ | A|A_occurence| +--------------------+-----------+ |[[a -> 1, b -> 2,...| 2| |[[a -> 10, b -> 2...| 2| |[[a -> 10, b -> 2...| 2| +--------------------+-----------+ 

The udf additionally takes in a argument to check for the corresponding key

Comments

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.