Understanding LoRA Part I: Exploring Intrinsic Dimensions

October 31, 2024

Efficient fine-tuning techniques for Language Models

Feature image by ChatGPT

LoRA (Low-Rank Adaptation) has quickly become the de facto method for efficient fine-tuning of large language models. It offers a lightweight approach to adapting pre-trained models, significantly reducing the computational cost and memory requirements of traditional fine-tuning methods.

LoRA was introduced in the paper Hu, Edward J., et al., “LoRA: Low-Rank Adaptation of Large Language Models,” which takes its inspiration primarily from two ideas:

Li et al., “Measuring the Intrinsic Dimension of Objective Landscapes”Aghajanyan et al., “Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning”

In this series of three articles, I’ll be covering each of these ideas in depth, and finally, LoRA itself. This will help understand not only what LoRA is, but how the authors came up with it.

In this article, we will be talking about the fundamental inspiration behind LoRA — intrinsic dimensions. We will try to understand what the intrinsic dimension is and how it applies to various deep learning tasks, as described in Li et al., “Measuring the Intrinsic Dimension of Objective Landscapes.”

Here’s how this article is structured:

Understanding Intrinsic DimensionsRandom Subspace TrainingExperiments and Results
– Details and Conventions
– Results
– Intrinsic Dimensions and Compression of Networks
– Intrinsic Dimensions and Minimum Description Length (MDL)Optimizing the Projection Matrix
– Dense Random Projections
– Sparse Random Projections
– Fastfood TransformConclusionReferences

Understanding Intrinsic Dimensions

Let’s set aside the complexities of non-linear neural networks for simplicity. Instead, let’s take a visually feasible example like solving a system of linear equations.

Let’s say we want to find a solution for the following system of linear equations using optimization:

Solving a system of linear equations using optimization

These two equations represent the same plane, i.e., they completely overlap. It will be clear shortly why we chose such equations.

Geometrically, the solution space of a system of linear equations is the intersection of the lines or planes they represent. In this case, the solution forms a large plane, as both equations completely overlap. Solving this system only requires finding a single point on the solution space, as any point on it will satisfy the equations.

In mathematical optimization, we typically guess a point in space, evaluate how close it is to the solution, and adjust it accordingly. Instead, what if we pick a smaller, random subspace (a plane or a line), and try finding a solution point within that subspace? If the solution space is large enough (in this case, it is a large plane), there’s a high probability of finding a point in the chosen subspace. This would significantly reduce the search complexity.

To demonstrate this, let’s randomly select a subspace for the above equations:

Searching for a solution for the system of equations on a random plane

Now, let’s bring this concept to life with a visualization:

Searching for a solution for the system of equations on a random plane

Notice how the random subspace (green) intersects the solution space in a line. All the points on this line will satisfy the equations. Instead of searching for this point in the entire 3D space, we scaled down the problem to searching it on a 2D plane. This represents a huge optimization, as it can significantly reduce the computational resources required.

Note: This example is a simplified analogy to help visualize the idea. In reality, a 2D plane solution space might not be ‘large enough’ to find a solution.To make this idea more accurate and practical, let’s look at the toy experiment mentioned in the paper: Consider a 1000D vector. The optimization cost function requires for it to have the first 100 values sum to 1, the next 100 to 2, and so on. This is like a system of linear equations in 1000 variables with 10 equations. So, the solution space here would be a 990D hyperplane. Now, the solution space covers such a large volume that we can try to optimize it in merely a 10D subspace.

In a linear system, the size of the solution space is determined by the number of equations. For example, with two equations and three variables, the solution is a line (unless they overlap, then it is a plane), whereas with three equations and three variables, the solution is a single point. Think of an equation as a constraint on the solution space. With each new unique equation, we increase the difficulty of the problem, making the search space narrower and effectively harder to find.

Similarly, in neural networks, if the solution space is large enough, the parameter search space needed can be very small and still allow us to find a solution with high probability. This means the problem that the network is solving has fewer constraints and hence, is not as complex as it might seem. The smallest possible dimension d such that a solution can be found within a d-dimensional subspace is called the intrinsic dimension of that learning problem. We can thus infer the inherent complexity of a learning problem based on the size of the search subspace.

But how do we use this in practical deep learning models? That’s where things get even more interesting.

Random Subspace Training

Now that we have a solid intuition, let’s extend this idea to neural networks.

The standard optimization procedure for neural networks involves learning a set of weights that transforms the input representation to the desired output representation. For a single-layered neural network:

A single-layered neural network

where g is the activation function, and W and b are the weights and bias, respectively.

Consider another space where all the weights in the network (in this case, just W and b) form the basis. If W is a (in_features × out_features) matrix, and b is a (out_features × 1) vector, this space will have (out_features + in_features × out_features) axes, one for each weight. Each point in this space defines a unique weight configuration for the model. This space is commonly referred to as the parameter space.

If we take another axis for plotting the loss function, we get a loss landscape, where we can directly correlate the weights with the loss value.

We want to search for a point with minimal loss in the parameter space. And if the minimal loss region is “large enough,” we can effectively search for it within a much smaller, random subspace.

So how do we train a model in a low-dimensional subspace? The authors propose the following:

For a parameter vector θ⁽ᴰ⁾ ∈ ℝᴰ in the parameter space of D dimensions, let θ₀⁽ᴰ⁾ be the randomly chosen initial value, and θ⁽ᵈ⁾ be a parameter vector from a much smaller subspace (d ≪ D). Now,

Random subspace training

where P is a randomly generated projection matrix that maps θ⁽ᵈ⁾ back to the original D-dimensional space.

Why is the projection matrix P needed?
While the search for a solution point occurs within a low-dimensional subspace, the solution point itself is in the original high-dimensional space. We are just assuming that it can be found within the smaller space, which doesn’t change the nature, nor the dimensionality of that point.

In standard optimization, gradient steps are typically taken directly in the space of θ⁽ᴰ⁾. Instead, we make only θ⁽ᵈ⁾ trainable and keep the rest frozen. This ensures that the optimization occurs within the smaller subspace. θ⁽ᵈ⁾ is initialized to a zero vector so that the initial value of θ⁽ᴰ⁾ is θ₀⁽ᴰ⁾. This allows the network to benefit from custom initialization schemes while constraining the search to the lower-dimensional subspace.

The authors further mention that they normalize P to unit length. Additionally, they rely on the orthogonality of high-dimensional random vectors, and do not explicitly orthogonalize P. This makes it an approximately orthonormal basis of the random subspace.

Well, why does this even matter?

Columns of unit length ensure that steps of unit length in the space of θ⁽ᵈ⁾, scale to steps of unit length in θ⁽ᴰ⁾. This allows the training to take well-calibrated steps during gradient descent. Each step corresponds to a direct, measurable impact in the original parameter space, preventing distortions or skewed steps that could occur if the basis were not orthonormal.Orthogonality of column vectors ensures that updates along one basis vector in the subspace do not interfere with updates along another basis vector, preserving the independence of directions in the subspace, and allowing for clear, interpretable movement in different directions.Orthonormal basis also make calculations more convenient.

Now, we will try to train models by iteratively increasing d. This will allow us to estimate dᵢₙₜ, i.e., the intrinsic dimension of various objectives.

Experiments and Results

In this section, we will go through some of the experiments mentioned in the paper. We will see the intrinsic dimension for several neural network architectures for various objectives.

Details and Conventions

The problems that neural networks usually solve are complicated, wherein the losses are never really exactly zero. Hence, to evaluate the correctness of the solution, the authors compare their model’s performance with the best directly trained (in full parameter space) baseline model.

In a supervised classification setting, validation accuracy is taken as a performance metric. The authors define dᵢₙₜ₁₀₀ as the intrinsic dimension of the 100% solution, i.e., performance as good as the baseline model. However, they found dᵢₙₜ₁₀₀ to be very inconsistent across models and objectives with widely varying values. In some cases, dᵢₙₜ₁₀₀ can be as high as D. Hence, they benchmark dᵢₙₜ₉₀ (performance at least equal to 90% of the baseline) instead, as it provides a reasonable tradeoff between model performance and robustness of dᵢₙₜ to small changes in the performance.

Note: Accuracy is preferred to loss to ensure the results allow comparison across models with different scales of loss.

Results

We will perform the MNIST (Li et al. (2006), CC BY-SA 3.0) classification experiment mentioned in the paper and try to reproduce the results.

All of my implementation code (Pytorch) is available here.Authors’ implementation code (TensorFlow) is available here.Note: For my code, work is still in progress for reproducing results for convolutional networks and a few other experiments.

For MNIST, first we take a fully connected network with layer sizes 784–200–200–10. Here, D = 784 × 200 + 200 × 200 + 200 × 10 = 199210. After gradually increasing the subspace dimension d, we get dᵢₙₜ₉₀ at about 750 which is similar to the paper.

Reproducing results for MNIST on a FC network

The authors have also experimented with MNIST with shuffled labels and shuffled pixels to understand the correlation between the increasing complexity of the task and intrinsic dimensions. They also perform a detailed analysis on convnets — if they are always better on MNIST. I recommend reading the paper for these deeper insights, and a more exhaustive analysis.

Here is a consolidated results table from the paper:

Intrinsic dimension for various objectives and models from the paper

Intrinsic Dimensions and Compression of Networks

Based on the results for various datasets and network architectures, it can be seen that there is a substantial reduction in the number of trainable parameters required for achieving a performance that is on par with the baseline model.

This clearly hints at a new way of compressing neural networks. For example, for MNIST FC, the subspace dimension (750) gives ~99% reduction in the number of trainable parameters (originally, 199210). Now to store this network, one needs to store only three items:

the random seed for generating the weights initialization, θ₀⁽ᴰ⁾.the random seed for generating the projection matrix, P.the low-dimensional parameter vector θ⁽ᵈ⁾ (which can be as small as 750 floating point values).

The authors further argue that this way of compressing networks avoids the need for elaborate pruning or quantization methods, making it both conceptually and practically efficient.

Intrinsic Dimensions and Minimum Description Length (MDL)

The Minimum Description Length essentially suggests that the best model for a given dataset is the one that compresses the data the most efficiently. In other words, it’s the model that can be described using the fewest bits while still maintaining accuracy. In practical terms, MDL is used as a measure of model complexity — where lower MDL corresponds to a simpler, more efficient model that achieves the same level of performance as a more complex one.

Instead of number of bits, the authors consider MDL in terms of degrees of freedom (dimensions) for representing the model. As discussed earlier, random subspace training naturally leads to a compressed representation of the network. This makes dᵢₙₜ an upper bound on the MDL of the problem solution, as it represents the dimensionality necessary to achieve comparable performance to full-dimensional optimization. In standard optimization, the number of parameters (D) is considered as an upper bound on the MDL of the model. dᵢₙₜ provides a much tighter bound. This interpretation suggests that models with lower intrinsic dimensions could be more well-suited to the problem, as they would have a lower MDL.

For example, in the results table, it can be observed that the LeNet architecture has a lower dᵢₙₜ₉₀ for MNIST classification (290) compared to a fully connected (FC) network, which has dᵢₙₜ₉₀ of 750. This supports the intuition that LeNet is a better-suited model for the MNIST problem due to its lower MDL.

Optimizing the Projection Matrix

Before concluding this article, one last thing that needs some light is the scalability of the projection matrix P.

For any given layer with a (in_features × out_features) weight matrix W, we take the flat size of W as W_size = in_features * out_features. Then P is a (W_size × d) matrix. It is clear that for larger models, we will quickly run into scaling limits. Hence, for generating this random P matrix, the authors experiment with three methods:

Dense Random Projections

The simplest method is to construct a dense matrix where each entry is drawn from a standard normal distribution. While this method is effective for models with few parameters, its computational and memory costs scale as 𝒪(Dd). For example, the authors found that while this approach worked with d = 225 and LeNet parameters D = 44426, they hit a limit while using a LeNet with D = 62006, unable to scale beyond d = 1000.

Sparse Random Projections

To address the scaling limitations of dense projections, the authors implemented ‘very sparse random projections’ based on Li et al. (2006). Here, the density of the matrix is set to √(1 / D), meaning that each entry has a probability of √(1 / D) of being non-zero, resulting in only 𝒪(d√D) non-zero entries. This reduces the time and space complexity significantly, making it possible to increase d up to 2500. However, the memory overhead for non-zero elements (24 bytes each) limited further scaling.

Fastfood Transform

The Fastfood transform (Le et al., 2013) offers a highly efficient way to generate random projections with minimal memory usage. It allows for implicit generation of the projection matrix using only 𝒪(D) space, with a total time complexity of 𝒪(Dlogd). While the technical details of the Fastfood transform are beyond the scope of this discussion, it is based on factorizing the projection matrix into simpler components. This significantly reduces the space requirements, enabling the scaling of larger models — even 1-million parameters.

Conclusion

In this article, we deep dived into the primary idea that leads to LoRA — instrinsic dimensions. We discussed what it is, its relevance and application to deep learning objectives, and a few results to objectify the effectiveness of the approach. Finally, we discussed the bottlenecks and efficiency concerns in the proposed approach.

Next, we will delve into how intrinsic dimensions inform the fine-tuning of large language models (LLMs), bringing us a step closer to LoRA.

Finally,

All of my implementation code (Pytorch) is available here.Authors’ implementation code (TensorFlow) is available here.

Feel free to comment or reach out to me for any clarifications or feedback on this article.

References

Measuring the Intrinsic Dimension of Objective Landscapes

https://medium.com/media/9a315fb6948ee87eb9e2e38232fb51de/href LoRA: Low-Rank Adaptation of Large Language Models Intrinsic Dimensionality Explains the Effectiveness of Language Model Fine-Tuning Very sparse random projections | Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining Fastfood: Approximate Kernel Expansions in Loglinear Time

*all images without a citation are created by the author of this article

Understanding LoRA Part I: Exploring Intrinsic Dimensions was originally published in Towards Data Science on Medium, where people are continuing the conversation by highlighting and responding to this story.