Signs Beat Floats: Low-Rank Double-Binary Adaptation for On-Device Fine-Tuning

Deep Analysis

Background

Large language model (LLM) adaptation on edge devices typically employs a frozen, quantized base model paired with a small, trainable Low-Rank Adaptation (LoRA) adapter. In the "unmerged adapter-mode," the LoRA module is not folded into the base weights but operates as a separate, active component. This setup makes the adapter a critical unit for local updates, communication, and hot-swapping. However, a standard LoRA adapter still contains dense floating-point weights, presenting a bottleneck for storage and memory efficiency.

Key Points

The core innovation of LoRDBA is a fundamental redesign of the adapter's internal structure:

Binary Sign Representation: Both matrices (A and B) of the low-rank factorization are replaced with binary vectors of {-1, +1}. The magnitude information is preserved through lightweight, channel-wise scaling factors.
Computational Shift: The forward pass transforms from standard dense matrix multiplication into two sign-accumulation matrix multiplications (which are computationally cheaper) interleaved with the channel-wise scaling operation.
Performance and Efficiency Trade-off:
- Footprint: The adapter's storage size is reduced by over 10x compared to standard fp16 LoRA at the same rank.
- Quality: In certain regimes, LoRDBA matches the task performance of fp16 LoRA and outperforms other low-bit quantization baselines.
- Latency: The unmerged adapter incurs a prefill latency overhead of at most 8% at rank r=16.
- Training Memory: Training memory consumption is approximately 1.6x that of fp16 LoRA.
Theoretical Insight: A finite-sample analysis shows that the reconstruction quality of the binarized factors depends on the residual-to-magnitude ratio of the original LoRA factors, providing a guide for the method's applicability.

Significance

LoRDBA addresses a critical practical constraint: deploying and managing personalized, task-specific adapters on memory-limited devices. By drastically compressing the adapter while keeping it separate (unmerged), it enables:

Efficient On-Device Storage: Users can store many specialized adapters locally for instant hot-swapping.
Reduced Communication Overhead: Transmitting adapters (e.g., from cloud to device) becomes faster and cheaper.
Maintained Adaptability: The small, trainable state for local updates is preserved in an efficient format. The work demonstrates that significant size reduction in the adapter branch is feasible without a proportional sacrifice in model quality or introduction of prohibitive latency, advancing the practicality of on-device LLM personalization.

Disclaimer: The above content is generated by AI and is for reference only.

Deep Analysis

Background

Key Points

Significance

Related Articles