ICG: Improving Cover Image Generation via MLLM-based Prompting and Personalized Preference Alignment

Deep Analysis

Bridging MLLMs and Diffusion Models via a Unified Adapter

The primary technical contribution of ICG is its architectural innovation that moves beyond traditional, disjointed pipelines. Prior approaches often relied on handcrafted prompts to guide image generation, creating a semantic gap between text understanding and visual synthesis. ICG addresses this by introducing a lightweight adapter that directly connects the MLLM and the diffusion model, enabling end-to-end training. This adapter acts as a semantic bridge, translating the rich, contextual features extracted by the MLLM into a format (meta tokens) that the diffusion model can directly condition upon during image synthesis. The design allows the MLLM's understanding of item semantics to directly and dynamically influence the generative process, rather than being a separate, static preprocessing step. This end-to-end integration is key to improving the final image's semantic fidelity to the source content.

Personalized Preference Learning Without Labeled Supervision

A core challenge in personalization is obtaining ground-truth labels for what an individual user finds appealing. ICG's solution is a sophisticated reward-based optimization framework that circumvents this need. The system does not use explicit human ratings for training. Instead, it combines two types of rewards:

Public Rewards: These are universal metrics for aesthetic quality and basic relevance between the generated image and the item text, ensuring a baseline standard of visual appeal and semantic alignment.
Personalized Reward Model: This component is trained offline from historical user behavior data, such as click-through rates. It learns to predict a user's implicit preference score for an image, effectively distilling individual tastes from interaction patterns.

The multi-reward strategy balances these signals, guiding the diffusion model to generate images that are both broadly pleasing and specifically tailored to a user's learned preferences. This approach cleverly turns the problem of lacking explicit labels into a strength by leveraging abundant implicit behavioral data, which is more scalable and reflective of real-world engagement.

Practical Implications and Scope of Personalization

The research positions ICG not just as a technical model but as a practical tool for digital platforms. Its plug-and-play nature, compatible with common MLLM and diffusion model checkpoints, suggests lower adoption barriers. The framework's personalization operates at a fine-grained level; user embeddings refine the semantic features extracted from item titles and reference images, meaning two users viewing the same item could receive differently styled cover images optimized for their tastes.

This has direct implications for recommendation systems. The paper notes improved offline recommendation accuracy, indicating that the personalized covers themselves become informative signals for downstream tasks. The generated cover image thus serves a dual function: it is both an engaging end-product for the user and a rich, learned representation of user-item compatibility for the platform's algorithms. This blurs the line between content generation and feature engineering, suggesting a new paradigm where generative models actively contribute to and refine the recommendation feedback loop.

Disclaimer: The above content is generated by AI and is for reference only.

Deep Analysis

Bridging MLLMs and Diffusion Models via a Unified Adapter

Personalized Preference Learning Without Labeled Supervision

Practical Implications and Scope of Personalization

Related Articles