How to slice a generator object or iterator?

Question

I would like to loop over a "slice" of an iterator. I'm not sure if this is possible as I understand that it is not possible to slice an iterator. What I would like to do is this:

def f(): for i in range(100): yield(i) x = f() for i in x[95:]: print(i)

This of course fails with:

--------------------------------------------------------------------------- TypeError Traceback (most recent call last) <ipython-input-37-15f166d16ed2> in <module>() 4 x = f() 5 ----> 6 for i in x[95:]: 7 print(i) TypeError: 'generator' object is not subscriptable

Is there a pythonic way to loop through a "slice" of a generator?

Basically the generator I'm actually concerned with reads a very large file and performs some operations on it line by line. I would like to test slices of the file to make sure that things are performing as expected, but it is very time consuming to let it run over the entire file.

Edit:
As mentioned I need to to this on a file. I was hoping that there was a way of specifying this explicitly with the generator for instance:

import skbio f = 'seqs.fna' seqs = skbio.io.read(f, format='fasta')

seqs is a generator object

for seq in itertools.islice(seqs, 30516420, 30516432): #do a bunch of stuff here pass

The above code does what I need, however is still very slow as the generator still loops through the all of the lines. I was hoping to only loop over the specified slice

I don't understand your question... If your generator takes a file as an input, then to test it, pass it slices of that file, why do you want to "slice the generator"? — xrisk
– xrisk, Commented Jan 11, 2016 at 22:32
Note that islice-ing the generator won't stop it from going through the lines before the ones you care about and processing them. It'd be better to provide it with an islice of the file. (You'll still need to read the file to look for newlines, but you'll skip whatever processing the generator does on the unwanted lines.) — user2357112
– user2357112, Commented Jan 11, 2016 at 22:38

ShadowRanger · Accepted Answer · 2016-01-11 23:31:37Z

In general, the answer is itertools.islice, but you should note that islice doesn't, and can't, actually skip values. It just grabs and throws away start values before it starts yield-ing values. So it's usually best to avoid islice if possible when you need to skip a lot of values and/or the values being skipped are expensive to acquire/compute. If you can find a way to not generate the values in the first place, do so. In your (obviously contrived) example, you'd just adjust the start index for the range object.

In the specific cases of trying to run on a file object, pulling a huge number of lines (particularly reading from a slow medium) may not be ideal. Assuming you don't need specific lines, one trick you can use to avoid actually reading huge blocks of the file, while still testing some distance in to the file, is the seek to a guessed offset, read out to the end of the line (to discard the partial line you probably seeked to the middle of), then islice off however many lines you want from that point. For example:

import itertools with open('myhugefile') as f: # Assuming roughly 80 characters per line, this seeks to somewhere roughly # around the 100,000th line without reading in the data preceding it f.seek(80 * 100000) next(f) # Throw away the partial line you probably landed in the middle of for line in itertools.islice(f, 100): # Process 100 lines # Do stuff with each line

For the specific case of files, you might also want to look at mmap which can be used in similar ways (and is unusually useful if you're processing blocks of data rather than lines of text, possibly randomly jumping around as you go).

Update: From your updated question, you'll need to look at your API docs and/or data format to figure out exactly how to skip around properly. It looks like skbio offers some features for skipping using seq_num, but that's still going to read if not process most of the file. If the data was written out with equal sequence lengths, I'd look at the docs on Alignment; aligned data may be loadable without processing the preceding data at all, by e.g by using Alignment.subalignment to create new Alignments that skip the rest of the data for you.

In an unstructured, unindexed file, is there any way of getting the (exactly) 100,000th line without ripping through the entire thing?
@NickT: Nope. Modules like linecache will let you pretend like you have random access, but it's still "ripping through" the whole thing; there is no meaningful way to find where the line breaks are without reading through to find them. mmap-ing a file and using mmap.find or mmap.rfind repeatedly could find lines relative to the start or end of a file without storing any lines in memory, but it's still reading the file.
@NickT: I've previously posted an answer for using mmap to read the last X lines of a large file without slurping the whole thing; that's the closest you'll get. You need to read from one end of the file or the other, you can't leap to a given line without reading to figure out where that specific line is unless the lines are of fixed length.

Yoav Glazner · Accepted Answer · 2016-01-11 22:46:09Z

6

islice is the pythonic way

from itertools import islice g = (i for i in range(100)) for num in islice(g, 95, None): print num

answered Jan 11, 2016 at 22:46

Yoav Glazner

8,0511 gold badge22 silver badges36 bronze badges

2 Comments

Hossein Gholami Over a year ago

I know it's not related to the question; but why not g = list(range(100)) instead of your 2nd line

Kyle Meador Over a year ago

@HosseinGholami your proposed change would result in g being a list, not an iterator.

ncopiy · Accepted Answer · 2019-12-18 06:55:35Z

6

You can't slice a generator object or iterator using a normal slice operations. Instead you need to use itertools.islice as @jonrsharpe already mentioned in his comment.

import itertools for i in itertools.islice(x, 95) print(i)

Also note that islice returns an iterator and consume data on the iterator or generator. So you will need to convert you data to list or create a new generator object if you need to go back and do something or use the little known itertools.tee to create a copy of your generator.

from itertools import tee first, second = tee(f())

edited Dec 18, 2019 at 6:55

ncopiy

1,64417 silver badges30 bronze badges

answered Jan 11, 2016 at 22:59

Sede

61.5k20 gold badges158 silver badges162 bronze badges

5 Comments

ShadowRanger Over a year ago

Note: itertools.tee is storing a copy of every output produced by the furthest advanced tee-ed copy, and can't discard any of those values until the least advanced iterator produces it. So a use of tee in which one tee-ed iterator is exhausted before you read the second one would usually be handled better by just list-ifying the original generator, then iterating it multiple times.

Sede Over a year ago

@ShadowRanger Do you mean by iterating the original the copy is also consumed? Can you please elaborate? list-ifying the original generator means load all the data in memory.

ShadowRanger Over a year ago

I never said anything about iterating the original consuming the copy; not sure what you mean by that? Basically, if you do x, y = tee(some_generator_making_numbers), then do sum(x), then all the values of some_generator_making_numbers are stored internally in the tee shared data until you drain them from y as well; if you don't iterate all outputs of tee roughly in parallel, then you aren't likely to be reducing memory overhead over just list-ifying with somelist = list(some_generator_making_numbers) then iterating somelist as many times as you want.

ShadowRanger Over a year ago

Point is, tee isn't actually copying the generator. It's making new generators based on a single shared cache, where the first generator to request item X causes the shared cache to pull the value from the original generator, and the last generator to request item X releases that value from the cache. But if the first tee generator runs to exhaustion before the second even pulls a single value, then the shared cache contains every value from the original generator (memory required is roughly equivalent to having stored all the values in a list).

ShadowRanger Over a year ago

This is actually part of the tee documentation: "This itertool may require significant auxiliary storage (depending on how much temporary data needs to be stored). In general, if one iterator uses most or all of the data before another iterator starts, it is faster to use list() instead of tee()."

Dorcioman · Accepted Answer · 2020-03-09 19:25:36Z

let's clarify something first. Spouse you want to extract the first values from your generator, based on the number of arguments you specified to the left of the expression. Starting from this moment, we have a problem, because in Python there are two alternatives to unpack something.

Let's discuss these alternatives using the following example. Imagine you have the following list l = [1, 2, 3]

1) The first alternative is to NOT use the "start" expression

a, b, c = l # a=1, b=2, c=3

This works great if the number of arguments at the left of the expression (in this case, 3 arguments) is equal to the number of elements in the list. But, if you try something like this

a, b = l # ValueError: too many values to unpack (expected 2)

This is because the list contains more arguments than those specified to the left of the expression

2) The second alternative is to use the "start" expression; this solve the previous error

a, b, c* = l # a=1, b=2, c=[3]

The "start" argument act like a buffer list. The buffer can have three possible values:

 a, b, *c = [1, 2] # a=1, b=2, c=[] a, b, *c = [1, 2, 3] # a=1, b=2, c=[3] a, b, *c = [1, 2, 3, 4, 5] # a=1, b=2, c=[3,4,5]

Note that the list must contain at least 2 values (in the above example). If not, an error will be raised

Now, jump to your problem. If you try something like this:

a, b, c = generator

This will work only if the generator contains only three values (the number of the generator values must be the same as the number of left arguments). Elese, an error will be raise.

If you try something like this:

a, b, *c = generator

If the number of values in the generator is lower than 2; an error will be raise because variables "a", "b" must have a value
If the number of values in the generator is 3; then a=, b=(val_2>, c=[]
If the numeber of values in the generator is greater than 3; then a=, b=(val_2>, c=[, ... ] In this case, if the generator is infinite; the program will be blocked trying to consume the generator

What I propose for you is the following solution

# Create a dummy generator for this example def my_generator(): i = 0 while i < 2: yield i i += 1 # Our Generator Unpacker class GeneratorUnpacker: def __init__(self, generator): self.generator = generator def __iter__(self): return self def __next__(self): try: return next(self.generator) except StopIteration: return None # When the generator ends; we will return None as value if __name__ == '__main__': dummy_generator = my_generator() g = GeneratorUnpacker(dummy_generator ) a, b, c = next(g), next(g), next(g)

Collectives™ on Stack Overflow

How to slice a generator object or iterator?

4 Answers 4

3 Comments

2 Comments

5 Comments

Comments

Linked

Hot Network Questions

Collectives™ on Stack Overflow

4 Answers 4

3 Comments

2 Comments

5 Comments

Comments

Linked

Related