Researchers have developed a simple yet effective solution to a puzzling problem that can degrade the performance of large language models like ChatGPT.

When a human AI conversation involves several rounds of continuous dialogue, powerful Large language machine learning models Those that run chatbots like ChatGPT sometimes start to crash, causing the bots to rapidly degrade in performance.

Chatbots - Artistic impression.Chatbots - Artistic impression.

Chatbots – Artistic impression. Photo credit: Christine Daniloff, MIT

A team of researchers from MIT and elsewhere has identified a surprising cause of this problem and developed a simple solution that would enable a chatbot to maintain a non-stop conversation without crashing or slowing down. Is.

Their approach involves a change to the key-value cache (which is similar to conversational memory) at the heart of many major language models. In some methods, when the cache needs to hold more information than it can hold, the first chunks of data are evicted. This may cause the model to fail.

By ensuring that these first few data points remain in memory, the researchers’ method allows a chatbot to continue a conversation no matter how long the conversation lasts.

The method, called StreamingLLM, enables a model to remain efficient even when a conversation spans more than 4 million words. When compared to another method that avoids crashing by constantly recalculating part of the past conversation, StreamingLLM performed 22 times faster.

This can allow a chatbot to have long conversations throughout the workday without having to be constantly rebooted, enabling efficient AI assistants for tasks like copywriting, editing, or code generation.

“Now, with this method, we can deploy these big language models permanently. By creating a chatbot that we can always chat with, and that always gives us answers based on our recent interactions. “We can use these chatbots in some new applications,” says Guang Xuan Zhao, a graduate student in Electrical Engineering and Computer Science (EECS). and lead author of a paper on StreamingLLM.

Xiao’s co-authors include his advisor, Song Han, an associate professor at EECS, a member of the MIT-IBM Watson AI Lab, and a distinguished scientist at NVIDIA. as well as Yuandong Tian, ​​research scientist at MetaAI; Bedi Chen, assistant professor at Carnegie Mellon University; and senior author Mike Lewis, a research scientist at MetaAI. This work will be presented at the International Conference on Learning Representations.

A surprising event

Major language models encode data, such as user query words, into representations called tokens. Many models employ what is known as a focus mechanism that uses these tokens to generate new text.

Typically, an AI chatbot writes new text based on the text it has just seen, so it stores the most recent tokens in memory, called the KV Cache, for later use. The attention mechanism creates a grid containing all the tokens in the cache, an “attention map” that maps how strongly each token, or word, is related to other tokens.

Understanding these relationships is a feature that enables large language models to generate human-like text.

But when the cache becomes very large, the attention map can become even larger, which slows down the computation.

Also, if encoding the content requires more tokens than the cache, the performance of the model degrades. For example, a popular model can store 4,096 tokens, yet an academic paper has about 10,000 tokens.

To solve these problems, the researchers use a “sliding cache” that evicts the oldest tokens to add new ones. However, the model’s performance often degrades as soon as that first token is extracted, which rapidly degrades the quality of newly generated words.

In this new paper, the researchers found that if they put the first token in a sliding cache, the model will maintain its performance even when the cache size is exceeded.

But it didn’t make any sense. The first word of a novel is possibly unrelated to the last word, so why would the first word be so important for the model to generate the latest word?

In their new paper, the researchers also uncovered the reason for this phenomenon.

Attention sinks.

Some models use a softmax operation in their attention mechanism, which assigns each token a score that represents how closely related it is to other tokens. Softmax operation requires all attention scores to be 1. The model discards any remaining focus scores into the first token.

The researchers call this first token “attention sink.”

“We need an attention sink, and the model decides to use the first token as an attention sink because it’s globally visible – every other token can see it. We found that the dynamics of the model “We must always keep the attention in the cache to maintain,” Hahn says.

In constructing a streaming LLM, the researchers discovered that having four attention sync tokens at the beginning of the sliding cache led to the best performance.

They also found that the positional encoding of each token must remain the same, even when new tokens are added and others are removed. If token 5 is hit, token 6 should be encoded as 6, even though it is now the fifth token in the cache.

By combining these two ideas, they enabled StreamingLLM to maintain a continuous conversation while outperforming a popular method that uses recalculation.

For example, when the cache contains 256 tokens, the recalculation method takes 63 milliseconds to decode the new token, while StreamingLLM takes 31 milliseconds. However, if the cache size increases to 4,096 tokens, the recalculation requires 1,411 milliseconds for a new token, while StreamingLLM requires only 65 milliseconds.

“The innovative method of streaming LLM, centered around the attention sync mechanism, ensures stable memory usage and performance, even up to 4 million Processing text up to token length”. Singapore which was not involved in this work. “This capability isn’t just impressive; it’s transformative, enabling StreamingLLM to be applied to a wide array of AI applications. StreamingLLM’s performance and versatility mark it as a highly promising technology, which That we are poised to revolutionize how we approach AI-powered generation applications.

Tianqi Chen, an assistant professor in the departments of machine learning and computer science at Carnegie Mellon University who was not involved in the research, agreed, saying that “LLM’s streaming of large language models can streamline the conversation length. We have been using it to enable deployment of Mistral models on iPhones with great success.

The researchers also explored the use of attentional sync during model training by generating multiple placeholder tokens across all training samples.

They found that training with attentional sinks allowed a model to maintain performance with only one attentional sink in its cache, rather than the four that had stabilized the performance of the previously trained model. are required to do

But while StreamingLLM enables a model to communicate continuously, the model cannot remember words that are not stored in the cache. In the future, researchers plan to target this limitation by investigating ways to recover tokens that have been evicted or enable the model to remember previous interactions.

Written by Adam Zvi

Source: Massachusetts Institute of Technology