Signs Beat Floats: Low-Rank Double-Binary Adaptation for On-Device Fine-Tuning
LoRDBA optimizes on-device LoRA adapters by replacing their low-rank matrices with binary sign vectors and channel-wise scales, drastically reducing memory footprint while maintaining performance. This design converts the adapter's computation into efficient sign-accumulation operations, enabling significant storage compression with only marginal latency overhead, making it highly suitable for resource-constrained deployments.
Deep Analysis
Background
Large language model (LLM) adaptation on edge devices typically employs a frozen, quantized base model paired with a small, trainable Low-Rank Adaptation (LoRA) adapter. In the "unmerged adapter-mode," the LoRA module is not folded into the base weights but operates as a separate, active component. This setup makes the adapter a critical unit for local updates, communication, and hot-swapping. However, a standard LoRA adapter still contains dense floating-point weights, presenting a bottleneck for storage and memory efficiency.
Key Points
The core innovation of LoRDBA is a fundamental redesign of the adapter's internal structure:
- Binary Sign Representation: Both matrices (A and B) of the low-rank factorization are replaced with binary vectors of {-1, +1}. The magnitude information is preserved through lightweight, channel-wise scaling factors.
- Computational Shift: The forward pass transforms from standard dense matrix multiplication into two sign-accumulation matrix multiplications (which are computationally cheaper) interleaved with the channel-wise scaling operation.
- Performance and Efficiency Trade-off:
- Footprint: The adapter's storage size is reduced by over 10x compared to standard fp16 LoRA at the same rank.
- Quality: In certain regimes, LoRDBA matches the task performance of fp16 LoRA and outperforms other low-bit quantization baselines.
- Latency: The unmerged adapter incurs a prefill latency overhead of at most 8% at rank r=16.
- Training Memory: Training memory consumption is approximately 1.6x that of fp16 LoRA.
- Theoretical Insight: A finite-sample analysis shows that the reconstruction quality of the binarized factors depends on the residual-to-magnitude ratio of the original LoRA factors, providing a guide for the method's applicability.
Significance
LoRDBA addresses a critical practical constraint: deploying and managing personalized, task-specific adapters on memory-limited devices. By drastically compressing the adapter while keeping it separate (unmerged), it enables:
- Efficient On-Device Storage: Users can store many specialized adapters locally for instant hot-swapping.
- Reduced Communication Overhead: Transmitting adapters (e.g., from cloud to device) becomes faster and cheaper.
- Maintained Adaptability: The small, trainable state for local updates is preserved in an efficient format. The work demonstrates that significant size reduction in the adapter branch is feasible without a proportional sacrifice in model quality or introduction of prohibitive latency, advancing the practicality of on-device LLM personalization.
Disclaimer: The above content is generated by AI and is for reference only.