Research Papers 1d ago Updated 1d ago 48

Signs Beat Floats: Low-Rank Double-Binary Adaptation for On-Device Fine-Tuning

LoRDBA optimizes on-device LoRA adapters by replacing their low-rank matrices with binary sign vectors and channel-wise scales, drastically reducing memory footprint while maintaining performance. This design converts the adapter's computation into efficient sign-accumulation operations, enabling significant storage compression with only marginal latency overhead, making it highly suitable for resource-constrained deployments.

55
Hot
85
Quality
70
Impact

Deep Analysis

Background

Large language model (LLM) adaptation on edge devices typically employs a frozen, quantized base model paired with a small, trainable Low-Rank Adaptation (LoRA) adapter. In the "unmerged adapter-mode," the LoRA module is not folded into the base weights but operates as a separate, active component. This setup makes the adapter a critical unit for local updates, communication, and hot-swapping. However, a standard LoRA adapter still contains dense floating-point weights, presenting a bottleneck for storage and memory efficiency.

Key Points

The core innovation of LoRDBA is a fundamental redesign of the adapter's internal structure:

  • Binary Sign Representation: Both matrices (A and B) of the low-rank factorization are replaced with binary vectors of {-1, +1}. The magnitude information is preserved through lightweight, channel-wise scaling factors.
  • Computational Shift: The forward pass transforms from standard dense matrix multiplication into two sign-accumulation matrix multiplications (which are computationally cheaper) interleaved with the channel-wise scaling operation.
  • Performance and Efficiency Trade-off:
    • Footprint: The adapter's storage size is reduced by over 10x compared to standard fp16 LoRA at the same rank.
    • Quality: In certain regimes, LoRDBA matches the task performance of fp16 LoRA and outperforms other low-bit quantization baselines.
    • Latency: The unmerged adapter incurs a prefill latency overhead of at most 8% at rank r=16.
    • Training Memory: Training memory consumption is approximately 1.6x that of fp16 LoRA.
  • Theoretical Insight: A finite-sample analysis shows that the reconstruction quality of the binarized factors depends on the residual-to-magnitude ratio of the original LoRA factors, providing a guide for the method's applicability.

Significance

LoRDBA addresses a critical practical constraint: deploying and managing personalized, task-specific adapters on memory-limited devices. By drastically compressing the adapter while keeping it separate (unmerged), it enables:

  • Efficient On-Device Storage: Users can store many specialized adapters locally for instant hot-swapping.
  • Reduced Communication Overhead: Transmitting adapters (e.g., from cloud to device) becomes faster and cheaper.
  • Maintained Adaptability: The small, trainable state for local updates is preserved in an efficient format. The work demonstrates that significant size reduction in the adapter branch is feasible without a proportional sacrifice in model quality or introduction of prohibitive latency, advancing the practicality of on-device LLM personalization.

Disclaimer: The above content is generated by AI and is for reference only.

Share: