A mechanism in BERT’s architecture.
This is in software complexity (runtime and memory), where is the sequence length.
As a result of this, faster solutions and implementations have come, with the most popular one being Flash Attention.
Jul 28, 20241 min read
A mechanism in BERT’s architecture.
This is O(N2) in software complexity (runtime and memory), where N is the sequence length.
As a result of this, faster solutions and implementations have come, with the most popular one being Flash Attention.