Leaner large language models could enable efficient local use on phones and laptops

An abstract dark blue backdrop adorned with various lines and dots, creating a dynamic visual effect.

(stock.adobe.com)

Large language models (LLMs) are increasingly automating tasks like translation, text classification and customer service. But tapping into an LLM’s power typically requires users to send their requests to a centralized server — a process that’s expensive, energy-intensive and often slow.

Now, researchers have introduced a technique for compressing an LLM’s reams of data, which could increase privacy, save energy and lower costs.

The new algorithm, developed by engineers at Princeton and Stanford Engineering, works by trimming redundancies and reducing the precision of an LLM’s layers of information. This type of leaner LLM could be stored and accessed locally on a device like a phone or laptop and could provide performance nearly as accurate and nuanced as an uncompressed version.

“Any time you can reduce the computational complexity, storage and bandwidth requirements of using AI models, you can enable AI on devices and systems that otherwise couldn’t handle such compute- and memory-intensive tasks,” said study coauthor Andrea Goldsmith, dean of Princeton’s School of Engineering and Applied Science and Arthur LeGrand Doty Professor of Electrical and Computer Engineering.

“When you use ChatGPT, whatever request you give it goes to the back-end servers of OpenAI, which process all of that data, and that is very expensive,” said coauthor Rajarshi Saha, a Stanford Engineering Ph.D. student. “So, you want to be able to do this LLM inference using consumer GPUs [graphics processing units], and the way to do that is by compressing these LLMs.” Saha’s graduate work is coadvised by Goldsmith and coauthor Mert Pilanci, an assistant professor at Stanford Engineering.

The researchers will present their new algorithm CALDERA, which stands for Calibration Aware Low precision DEcomposition with low Rank Adaptation, at the Conference on Neural Information Processing Systems (NeurIPS) in December. Saha and colleagues began this compression research not with LLMs themselves, but with the large collections of information that are used to train LLMs and other complex AI models, such as those used for image classification. This technique, a forerunner to the new LLM compression approach, was published in 2023.

Training data sets and AI models are both composed of matrices, or grids of numbers that are used to store data. In the case of LLMs, these are called weight matrices, which are numerical representations of word patterns learned from large swaths of text.

“We proposed a generic algorithm for compressing large data sets or large matrices,” said Saha. “And then we realized that nowadays, it’s not just the data sets that are large, but the models being deployed are also getting large. So, we could also use our algorithm to compress these models.”

While the team’s algorithm is not the first to compress LLMs, its novelty lies in an innovative combination of two properties, one called “low-precision,” the other “low-rank.” As digital computers store and process information as bits (zeros and ones), “low-precision” representation reduces the number of bits, speeding up storage and processing while improving energy efficiency. On the other hand, “low-rank” refers to reducing redundancies in the LLM weight matrices.

“Using both of these properties together, we are able to get much more compression than either of these techniques can achieve individually,” said Saha.

The team tested their technique using Llama 2 and Llama 3, open-source large language models released by Meta AI, and found that their method, which used low-rank and low-precision components in tandem with each other, can be used to improve other methods which use just low-precision. The improvement can be up to 5%, which is significant for metrics that measure uncertainty in predicting word sequences.

They evaluated the performance of the compressed language models using several sets of benchmark tasks for LLMs. The tasks included determining the logical order of two statements, or answering questions involving physical reasoning, such as how to separate an egg white from a yolk or how to make a cup of tea.

“I think it’s encouraging and a bit surprising that we were able to get such good performance in this compression scheme,” said Goldsmith, who moved to Princeton from Stanford Engineering in 2020. “By taking advantage of the weight matrix rather than just using a generic compression algorithm for the bits that are representing the weight matrix, we were able to do much better.”

Using an LLM compressed in this way could be suitable for situations that don’t require the highest possible precision. Moreover, the ability to fine-tune compressed LLMs on edge devices like a smartphone or laptop enhances privacy by allowing organizations and individuals to adapt models to their specific needs without sharing sensitive data with third-party providers. This reduces the risk of data breaches or unauthorized access to confidential information during the training process. To enable this, the LLMs must initially be compressed enough to fit on consumer-grade GPUs.

Saha also cautioned that running LLMs on a smartphone or laptop could hog the device’s memory for a period of time. “You won’t be happy if you are running an LLM and your phone drains out of charge in an hour,” said Saha. Low-precision computation can help reduce power consumption, he added. “But I wouldn’t say that there’s one single technique that solves all the problems. What we propose in this paper is one technique that is used in combination with techniques proposed in prior works. And I think this combination will enable us to use LLMs on mobile devices more efficiently and get more accurate results.”

The paper, “Compressing Large Language Models using Low Rank and Low Precision Decomposition,” will be presented at the Conference on Neural Information Processing Systems (NeurIPS) in December 2024. In addition to Goldsmith, Saha and Pilanci, coauthors include Stanford Engineering researchers Naomi Sagan and Varun Srivastava. This work was supported in part by the U.S. National Science Foundation, the U.S. Army Research Office, and the Office of Naval Research.

Andrea Goldsmith

Professor writes on white board while talking with grad student.

Electrical and Computer Engineering

Improving human health, energy systems, computing and communications, and security

Beyond ChatGPT: Princeton Language and Intelligence initiative pushes the boundaries of large AI models

Helping energy systems weather the storm

Senior thesis combined passions for computer science and linguistics

Detailed map reveals groundwater levels across the U.S.

EPA regulations cut power sector emissions but miss opportunities for deeper reductions

Class Day awards celebrate graduates’ outstanding leadership, research and service

Andrea Goldsmith

Electrical and Computer Engineering

Engineering Newsletter Signup