Skip to content

[Python API] Chunking Bug #43

@Aweptimum

Description

@Aweptimum

There is an issue in events_filter.Chunks.__iter__ (copied below)

def __iter__(self) -> collections.abc.Iterator[numpy.ndarray]:
events_buffers: list[numpy.ndarray] = []
current_length = 0
for events in self.parent:
events_length = len(events)
while events_length > 0:
if current_length + events_length < self.chunk_length:
events_buffers.append(events)
current_length += events_length
break
pivot = self.chunk_length - current_length
if len(events_buffers) == 0:
yield events[:pivot]
else:
events_buffers.append(events[:pivot])
yield numpy.concatenate(
events_buffers, dtype=events_stream.EVENTS_DTYPE
)
events_buffers = []
events = events[pivot:]
if len(events_buffers) > 0:
yield numpy.concatenate(events_buffers, dtype=events_stream.EVENTS_DTYPE)
events_buffers = []

I was using it for its intended purpose of grabbing data in batches, but it seems to get stuck in an infinite loop after the first chunk and returns an empty batch (basically yield events[:pivot] runs infinitely).

I think the issue is that, while events = events[:pivot] is overridden, the events_length variable that the while loop is checking is not updated to match.
Changing:

while events_length > 0:

to:

while (events_length := len(events) > 0:

Appears to give the expected behavior. However, it seems like itertools might be faster? I'm getting a ~10s improvement iterating over 800k events using the below snippet instead of calling .chunks (25 vs 36 seconds)

evts = itertools.batched( itertools.chain.from_iterable(faery.events_stream_from_file(path)), 1000 )

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions