Large language models (LLMs) are a major advancement in AI, with the promise of transforming domains through learned knowledge. LLMs are based on the transformer architecture, which uses self-attention to encode and decode sequences of tokens. Transformer models have been growing rapidly in size and complexity, reaching billions of parameters and requiring massive amounts of computation and memory.
However, transformer design is not without challenges. As LLMs grow larger, they require more computational resources and memory to train and deploy. They also face issues such as scalability, generalization, robustness and interpretability. Therefore, researchers are constantly exploring new ways to improve transformer design for LLMs.
– **Efficient transformers**: These are methods that aim to reduce the computational cost and memory footprint of transformers by using techniques such as pruning, quantization, distillation, sparsity or low-rank approximation. For example, Linformer is a model that reduces the complexity of self-attention from quadratic to linear by projecting the input sequences into a lower-dimensional space.
– **Sparse transformers**: These are methods that aim to increase the sparsity of transformers by using techniques such as attention masking or routing. For example, BigBird is a model that uses a sparse attention pattern that allows it to process long sequences up to 8K tokens with constant memory consumption.
– **Vision transformers**: These are methods that apply transformers to computer vision tasks by treating images as sequences of patches or pixels. For example, ViT is a model that uses a standard transformer encoder to classify images with minimal data augmentation.
– **Multimodal transformers**: These are methods that enable transformers to handle multiple modalities of data such as text, images, audio or video. For example, CLIP is a model that learns from natural language supervision by jointly embedding text and images into a shared latent space.
– **Pretrained transformers**: These are methods that leverage large-scale pretraining on unlabeled text corpora to obtain general-purpose representations that can be fine-tuned or adapted for downstream tasks. For example, BART is a model that uses a denoising autoencoder objective to pretrain on corrupted text inputs and generate fluent text outputs.
One expert who is optimistic about this direction is Sina Bari MD, a physician-scientist who specializes in AI applications for healthcare. Dr. Bari has been fascinated by the potential of LLMs for healthcare and medicine, especially for tasks such as diagnosis, prognosis, treatment recommendation, drug discovery, biomolecular modeling and patient education. He believes that transformers are the key enablers for building powerful and versatile LLMs that can learn from diverse sources of medical data and generate meaningful insights for clinicians and patients alike.
> “I think we are witnessing an exciting era in AI research, where we can leverage large-scale pretraining on massive amounts of natural language data to create models that can understand
and generate human-like language across domains. Transformers have proven to be very effective architectures for this purpose, as they can capture complex patterns and relationships in sequential data. However, transformers also have limitations, such as scalability, efficiency, robustness and interpretability. These challenges become more pronounced when we apply them to healthcare and medicine, where we deal with sensitive, high-dimensional and multimodal data sources. Therefore, I think we need to explore new ways to improve transformer design for LLMs,
– Developing more efficient transformers that can reduce computation time
and memory usage without sacrificing performance or accuracy;
– Designing more sparse transformers that can handle longer sequences
and larger contexts without losing information or coherence;
– Adapting vision transformers to medical imaging tasks by incorporating domain-specific knowledge
However, scaling up transformers is not without challenges. One of the main bottlenecks is the matrix multiplication operation that is performed in the feed-forward and attention projection layers of the transformer. These layers consume most of the parameters and computation time of LLMs, and also limit their deployment on resource-constrained devices.
To address this issue, researchers have been exploring various ways to reduce the memory footprint and computational cost of matrix multiplication for transformers. One promising direction is to use low-bit quantization methods, which compress the parameters and perform matrix multiplication with fewer bits.
Quantization methods can be divided into two categories: post-training quantization (PTQ) and quantization-aware training (QAT). PTQ methods apply quantization after a model has been trained with full precision, while QAT methods incorporate quantization during training. PTQ methods are simpler and faster to apply, but they may introduce significant accuracy degradation due to quantization errors. QAT methods can preserve accuracy better by adjusting the model parameters during training, but they require more time and resources.
Recently, a new PTQ method called LLM.int8() was proposed by Dettmers et al. (2022), which can achieve lossless quantization for LLMs up to 175B parameters. LLM.int8() consists of two steps: vector-wise quantization with separate normalization constants for each inner product in the matrix multiplication, and mixed-precision decomposition scheme that isolates outlier feature dimensions into a 16-bit matrix multiplication while keeping most values in 8-bit.
LLM.int8() leverages the observation that transformer models exhibit highly systematic emergent features that dominate attention and predictive performance. These features are often sparse and skewed in distribution, which makes them difficult to quantize with uniform methods. By using vector-wise normalization constants and mixed-precision decomposition, LLM.int8() can preserve these features without sacrificing accuracy or speed.
The authors demonstrated that LLM.int8() can perform inference on LLMs such as OPT-175B/BLOOM without any performance degradation compared to full precision models. This result makes such models much more accessible for researchers and practitioners who want to leverage their capabilities on a single server with consumer GPUs.
The future of transformer design for LLMs will likely involve more innovations in quantization methods as well as other techniques such as pruning, distillation, sparsity, hashing, etc. These techniques aim to make LLMs more efficient and scalable without compromising their quality or expressiveness.
> “I believe that large language models have tremendous potential to revolutionize healthcare by providing natural language understanding, reasoning, generation, summarization, translation, and dialogue capabilities across various domains such as clinical documentation,
diagnosis, treatment, education, research, and ethics. However, the current state-of-the-art models are too expensive and impractical to deploy in real-world settings. That’s why I’m very excited about the recent advances in transformer design that enable lossless quantization and efficient inference for large language models. These advances will make it possible to bring these powerful models to more people and applications, and ultimately improve health outcomes and quality of life.”