2

I have a list of lists as follws.

mylist = [[5274919, ["report", "porcelain", "firing", "technic"]], [5274920, ["implantology", "dentistry"]], [52749, ["method", "recognition", "long", "standing", "root", "perforation", "molar"]], [5274923, ["exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "method", "standing"]]] 

I also have a list of concepts as follows.

myconcepts = ["method", "standing"] 

I want to see how many times each concept in myconcepts is in mylist records. i.e.;

"method" = 2 times in records (i.e. in `52749` and `5274923`) "standing" = 2 times in records 

My current code is as follows.

mycounting = 0 for concept in myconcepts: for item in mylist: if concept in item[1]: mycounting = mycounting + 1 print(mycounting) 

However, my current mylist is very very large and have about 5 million records. myconcepts list have about 10000 concepts.

In my current code it takes nearly 1 minute for a concept to get the count, which is very slow.

I would like to know the most efficient way of doing this in python?

For testing purposes I have attached a small portion of my dataset in: https://drive.google.com/file/d/1z6FsBtLyDZClod9hK8nK4syivZToa7ps/view?usp=sharing

I am happy to provide more details if needed.

5
  • That depends on your usage. If you're going to do this many times, it pays to use a flattened list, and either use a Counter object (a type of dict), or use the count method on the flattened list. These techniques are already documented well on Stack Overflow and elsewhere on line. Commented Nov 12, 2019 at 22:12
  • can you change the lists to sets? Searching sets is O(1) instead of O(n) Commented Nov 12, 2019 at 22:13
  • With that many records you might reconsider your data structure. Commented Nov 12, 2019 at 22:13
  • @Barmar yes, I am happy to change the structure of mylist Commented Nov 12, 2019 at 22:13
  • EmJ, is converting you information into a database a possible answer? If so i'll show you how to add you data to mongodb for a quick search answer. Commented Nov 12, 2019 at 22:18

3 Answers 3

2

Adapting approach 3 from https://www.geeksforgeeks.org/python-count-the-sublists-containing-given-element-in-a-list/

from itertools import chain from collections import Counter mylist = [[5274919, ["report", "porcelain", "firing", "technic"]], [5274920, ["implantology", "dentistry"]], [52749, ["method", "recognition", "long", "standing", "root", "perforation", "molar"]], [5274923, ["exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "method", "standing"]]] myconcepts = ["method", "standing"] def countList(lst, x): " Counts number of times item x appears in sublists " return Counter(chain.from_iterable(set(i[1]) for i in lst))[x] # Use dictionary comprehension to apply countList to concept list result = {x:countList(mylist, x) for x in myconcepts} print(result) # {'method':2, 'standing':2} 

*Revised current method (compute counts only once) *

def count_occurences(lst): " Number of counts of each item in all sublists " return Counter(chain.from_iterable(set(i[1]) for i in lst)) cnts = count_occurences(mylist) result = {x:cnts[x] for x in myconcepts} print(result) # {'method':2, 'standing':2} 

Performance (comparing posted methods using Jupyter Notebook)

Results show this method and Barmar posted method are close (i.e. 36 vs 42 us)

The improvement to the current method reduced to time approximately in half (i.e. from 36 us to 19 us). This improvement should be even more substantial for a larger number of concepts (i.e. problem has > 1000 concepts).

However, the original method is faster at 2.55 us/loop.

Method current method

%timeit { x:countList(mylist, x) for x in myconcepts} #10000 loops, best of 3: 36.6 µs per loop Revised current method: %%timeit cnts = count_occurences(mylist) result = {x:cnts[x] for x in myconcepts} 10000 loops, best of 3: 19.4 µs per loop 

Method 2 (from Barmar post)

%%timeit r = collections.Counter(flatten(mylist)) {i:r.get(i, 0) for i in myconcepts} # 10000 loops, best of 3: 42.7 µs per loop 

Method 3 (Original Method)

%%timeit result = {} for concept in myconcepts: mycounting = 0 for item in mylist: if concept in item[1]: mycounting = mycounting + 1 result[concept] = mycounting # 100000 loops, best of 3: 2.55 µs per loop 
Sign up to request clarification or add additional context in comments.

15 Comments

@Emj--about to test performance. Will update the post in the next few minutes.
@emj--check my updated post where I added your original method. Surprisingly the original is fastest. Wonder if it depends upon the size of the dataset (i.e. original is faster for small datasets).
@EmJ--good feedback. Let me see if I can remove the overhead of computing sets for every concept.
@emj--check revised method that uses count_occurences. This should be a lot faster for your test case.
@emj--just saw your post. Looks like you have already found a good solution.
|
2

You can flatten the input and then use collections.Counter:

import collections myconcepts = ["method", "standing"] mylist = [[5274919, ["report", "porcelain", "firing", "technic"]], [5274920, ["implantology", "dentistry"]], [5274921, ["method", "recognition", "long", "standing", "root", "perforation", "molar"]], [5274923, ["exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "standing"]]] def flatten(d): for i in d: yield from [i] if not isinstance(i, list) else flatten(i) r = collections.Counter(flatten(mylist)) result = {i:r.get(i, 0) for i in myconcepts} 

Output:

{'method': 2, 'standing': 2} 

Edit: record lookup:

result = {i:sum(i in b for _, b in mylist) for i in myconcepts} 

Output:

{'method': 2, 'standing': 2} 

9 Comments

Thank you for your answer. However, I am looking how many times each concept is in mylist records (not the total count). For example, method appears in two records 52749 and 5274923 (please see my edited question). Thank you :)
@EmJ Are you trying to find the record names that have a corresponding value in myconcepts? The current result {'method': 2, 'standing': 2} is indeed the individual counts for each element in myconcepts.
Thank you for the comment. I want to find the number of records that each concept is in. For example consider a record like this; [5274923, ["exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "method", "standing"]. It has method 2 times. However, the number of records is 1. :)
@EmJ Please see my recent edit, as I added a solution to find the unique records that store a target value.
Thank you for the edit. However, I only need a count of number of records. For instance, {'method': 2, 'standing': 2} :)
|
1

Change the concept lists to sets, so that searching will be O(1). You can then use intersection to count the number of matches in each set.

import set mylist = [ [5274919, {"report", "porcelain", "firing", "technic"}], [5274920, {"implantology", "dentistry"}], [52749, {"method", "recognition", "long", "standing", "root", "perforation", "molar"}], [5274923, {"exogenic", "endogenic", "cause", "tooth", "jaw", "anomaly", "method", "method", "standing"}] ] myconcepts = {"method", "standing"} mycounting = 0 for item in mylist: mycounting += len(set.intersection(myconcepts, item[1])) print(mycounting) 

If you want to get the counts for each concept separately, you'll need to loop over myconcept, then use the in operator. You can put the results in a dictionary.

mycounting = {concept: sum(1 for l in mylist if concept in l[1]) for concept in myconcepts} print(mycounting) // {'standing': 2, 'method': 2} 

This will still be more efficient than using a list, because concept in l[1] is O(1).

7 Comments

Thank you for the answer. I get an error saying NameError: name 'intersection' is not defined. Do you know how to fix this? Thank you :)
Should be set.intersection.
By the way, you can do intersections and unions of sets using the overloaded & and | operators; I think it is more readable, personally.
@Barmar I get the answer of your code as 4. But it should be 2 and 2. :)
@kaya3 I knew that, but I thought this would be more self-explanatory.
|

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.