Blog Post

From vision to reality, explore our blog and articles. Contact us to turn your ideas into success.
Contact us.

Secret Sauce of Llama-2: KV-Caching and Grouped-Query Attention

By Next Solution Lab on 2024-07-15 04:31:20

Introduction

Llama-2 is a powerful language model developed by Meta (formerly Facebook) as part of its suite of advanced AI tools. It is designed to understand and generate human-like text based on the input it receives. Building on its predecessor, Llama-2 offers improved accuracy, efficiency, and versatility in natural language processing tasks. This makes it ideal for applications such as chatbots, content generation, language translation, and more. Llama-2 implemented innovative ideas to mitigate the execution time issue and reduced it exponentially by adopting ideas like KV-Caching, and Grouped Query Attention(First introduced by FlashAttention). It has also implemented Rotary Positional Embedding, RMS Norm but in this blog, we are going to discuss about

1. KV-Caching

2. Grouped Query Attention

KV-Caching

What is KV Caching?

In natural language processing, attention mechanisms are fundamental to the performance of transformer models. However, a common challenge with standard attention layers is the repeated calculation of key (K) and value (V) matrices for each input sequence. This repetitive computation is both time-consuming and resource-intensive.

To address this inefficiency, the concept of KV caching was introduced. KV caching involves storing the previously computed key and value matrices so that they can be reused in subsequent computations. By doing this, we eliminate the need to recompute these matrices every time, significantly speeding up the attention mechanism. This optimization not only enhances the model's performance but also reduces the computational load, making it particularly beneficial for large-scale and real-time applications.

Pre-requisite: To understand the KV-Caching mechanism you have to know attention mechanism

KV-Cache Mechanism (step by step)

The KV-Cache mechanism can be simplified by,

Step 1: For the newly generated token, compute the key (K) and value (V) vectors.

Step 2: Concatenate the newly computed K and V vectors with the cached K and V vectors to form the complete K and V matrices required for self-attention.

Step 4: Use the concatenated K and V matrices to perform the self-attention mechanism, which Step determines the attention scores and context for the new token.

Step 5: Append the new K and V vectors to the cache, ensuring that the cache now includes the latest token's vectors along with those from all previous tokens.

Step 6: Generate the next token in the sequence based on the self-attention output.

Step 7: Repeat steps 1 to 6 for each subsequent token in the sequence.

The steps can be visualized by,

Image courtesy: https://medium.com/@joaolages/kv-caching-explained-276520203249

KV-Cache Memory Consumption

The memory consumption of KV-Cache can be easily computed by the equation

2*Precision * N_layers * D_model * SeqLen * Batch

Where:

Precision refers to data type. (Nowadays bf16 is very common)

N_layers refer to the Number of attention layers of the network

D_model refers to Model Dimension (Input and Output Size of the model)

SeqLen refers to the number of tokens or sequences the model will process

Batch refers to Batch Size

Per-token Lama-2-7B model takes around 512KB and Llama-2-13B takes 800KB memory.

Computational Speed

Llama 2's KV cache mechanism significantly enhances computational speed by minimizing redundant computations during decoding iterations. Compared to utilizing the VRAM cache, Llama 2's KV cache offers superior time complexity. By storing and reusing key-value pairs, Llama 2 eliminates the need for costly recomputations, thereby reducing processing time. This optimization ensures efficient utilization of computational resources, leading to faster inference times and improved model performance.

Grouped Query Attention

Introduction to Grouped Query Attention

Grouped Query Attention represents a breakthrough in transformer model optimization. By intelligently grouping queries and keys, it targets a fundamental challenge; the quadratic complexity of traditional attention mechanisms concerning sequence length. This innovative approach seeks to enhance both computational efficiency and memory utilization, offering a scalable solution for processing lengthy sequences in natural language tasks.

Quick Preview – Multi-Head Attention

Multi-head attention introduces a crucial mechanism in transformer models, enabling the model to focus on diverse aspects of the input data simultaneously. Each attention head operates independently, utilizing distinct sets of linear layers for processing queries, keys, and values. This modular design, represented by h heads, enhances the model's ability to capture various relationships within the data. In each head h, attention scores are computed based on the transformed query (Q.W_q^h), key (K.W_k^h), and value (V.W_v^h) vectors, offering a nuanced understanding of the input across multiple representation subspaces. The attention equation is,

headʰ = Attention(Q.W_qʰ,K.W_kʰ,V.W_vʰ)

Figure-1: Multi-Head Attention

Problem

The main challenge revolves around memory overhead, particularly in autoregressive models such as Transformers. During each decoding step, the model needs to load decoder weights along with all attention keys and values. This operation is not only computationally demanding but also consumes significant memory bandwidth. As model sizes expand, this overhead becomes more pronounced, posing challenges for scaling up the model efficiently. Figure 1 visualizes Multi-head Attention which indicates that it takes exponential memory with the increase of Head Numbers of Attension.

Quick Preview – Multi Query Attention

Multi-Query Attension is proposed as a solution to the memory problem of Multi-head Attention which removes the bottlenec by taking only one key and value against all queries. The Multi-Query Attention forms a single key head by averaging all the key heads that still share major information within. Multi-Query Attension does the same for the value head part as well reducing to a single value from a set of values.

Figure-2: Multi-Query Attention

Problem

Multi-Query attention solved the memory issue but it has its disadvantages. As it is using a single Key and Value head, the model’s quality will eventually degrade even training can go unstable as well.

Solution: Grouped Query Attention

Grouped-query attention (GQA) combines aspects of both multi-head attention (MHA) and multi-query attention (MQA) to form a streamlined attention mechanism. GQA offers improved efficiency while maintaining effectiveness. The GQA groups Query heads into G groups and each group is assigned to a single Key and Value head and it can be written as GQA-G where the latter G denotes to number of groups.

If GQA-1 (G=1), it forms Multi-Query Attention and

if GQA-H (G=H), it forms Multi-Head Attention (H refers to the number of Heads of Multi-Head Attention)

Figure-3: Grouped-Query Attention

GQA balances between Multi-Head attention and Multi-Query Attention and keeps the quality as close as Multi-Head Attention while minimizing the Headcount like Multi-Query attention. This Innovative design makes a go-to option for the Attention mechanism as it largely reduces both time and memory consumption and keeps the model’s quality intact or very marginal degradation.

Conclusion

Multi-head attention is great for capturing different types of information/features but it comes with its prices Multi-Query attention reduces the complexity but costs the model’s performance thus Grouped-Query Attention balances two of them by reducing the complexity and keeping intact the performance.

Let us know your interest

At Next Solution Lab, we are dedicated to transforming experiences through innovative solutions. If you are interested in learning more about how our projects can benefit your organization.

Contact Us

Bangladesh Office

(+880) 1765799777
House 752, Road 10, Avenue 4,
Mirpur DOHS, Dhaka - 1216

Japan Office

Katsushika-KU
Shiratori 2-18-8,
Tokyo Japan.

Canada Office

3440 Peter St Windsor,
ON N9C4C9,Canada

USA Office

1944 Watson Ave,2nd Floor
Bronx,NY 10472

Blog Post

Secret Sauce of Llama-2: KV-Caching and Grouped-Query Attention

Introduction

KV-Caching

What is KV Caching?

KV-Cache Mechanism (step by step)

KV-Cache Memory Consumption

Computational Speed

Grouped Query Attention

Introduction to Grouped Query Attention

Quick Preview – Multi-Head Attention

Problem

Quick Preview – Multi Query Attention

Problem

Solution: Grouped Query Attention

Conclusion

Let us know your interest

Bangladesh Office

Japan Office

Canada Office

USA Office

Latest

Resources

Company

Offshore Development

Web Development

Mobile Application

Artificial Intelligence

Software Testing as a Service

Consultation and Strategy

Research and Development

Digital Marketing and Others

Blog Post

Secret Sauce of Llama-2: KV-Caching and Grouped-Query Attention

Introduction

KV-Caching

What is KV Caching?

KV-Cache Mechanism (step by step)

KV-Cache Memory Consumption

Computational Speed

Grouped Query Attention

Introduction to Grouped Query Attention

Quick Preview – Multi-Head Attention

Problem

Quick Preview – Multi Query Attention

Problem

Solution: Grouped Query Attention

Conclusion

Let us know your interest

Bangladesh Office

Japan Office

Canada Office

USA Office

Latest

Resources

Company