Question: Efficient Handling of attention_mask for Long Sequences

Thank you for your work!

In the documentation, the attention_mask is described as an optional mask of shape (batch_size, 1, query_len, key_len). However, this still appears to require initializing a [L, L] matrix for the attention mask, where L is the sequence length.

My question:
How is this problem addressed when the sequence length L is very large? Are there any optimizations or alternative approaches that prevent the memory and computational overhead of allocating such a large attention mask?

Looking forward to your insights!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question: Efficient Handling of `attention_mask` for Long Sequences #115

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Question: Efficient Handling of attention_mask for Long Sequences #115

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Question: Efficient Handling of `attention_mask` for Long Sequences #115