- Notifications
You must be signed in to change notification settings - Fork 45
Closed
Labels
questionFurther information is requestedFurther information is requested
Description
Thank you for your work!
In the documentation, the attention_mask is described as an optional mask of shape (batch_size, 1, query_len, key_len). However, this still appears to require initializing a [L, L] matrix for the attention mask, where L is the sequence length.
My question:
How is this problem addressed when the sequence length L is very large? Are there any optimizations or alternative approaches that prevent the memory and computational overhead of allocating such a large attention mask?
Looking forward to your insights!
Copilot and LoserCheems
Metadata
Metadata
Labels
questionFurther information is requestedFurther information is requested