A mechanism in BERT’s architecture.

This is in software complexity (runtime and memory), where is the sequence length.

As a result of this, faster solutions and implementations have come, with the most popular one being Flash Attention.