Information-theoretic Multimodal Representation Learning for Electrocardiogram Signals

Deep Analysis

Article Type: Research paper (computer science / medical AI)

The Multimodal ECG Gap

Current multimodal approaches align ECG signals with clinical reports to incorporate diagnostic semantics. However, clinical reports fail to preserve the rich physiological structure of ECG waveforms, particularly across multiple levels of abstraction:

Coarse diagnostic categories (e.g., normal vs. abnormal)
Fine-grained morphology (e.g., specific waveform patterns)

This gap means existing methods lose critical signal-level information when relying solely on text supervision.

Information-Theoretic Foundation

MERIT formulates ECG representation learning from an information-theoretic perspective, deriving a tractable objective that:

Preserves signal structure
Integrates clinical semantics

This principled approach provides theoretical grounding rather than ad-hoc design choices, unifying two learning objectives (masked modeling and contrastive alignment) under a single framework.

Dual-Branch Architecture

MERIT combines two complementary pretraining strategies:

Masked ECG modeling — learns to reconstruct masked portions of the ECG signal, preserving fine-grained morphological features
ECG-text contrastive alignment — aligns ECG representations with clinical report embeddings, incorporating diagnostic semantics

The two branches operate jointly, allowing the model to capture both low-level waveform structure and high-level clinical meaning simultaneously.

Benchmark Performance

On PTB-XL, MERIT demonstrates consistent improvements over prior methods:

Task	Improvement
PTB-XL All classification	>3% F1
SubClass classification	>5% F1

The larger gain on SubClass classification suggests MERIT particularly excels at distinguishing fine-grained cardiac conditions — precisely where preserving morphological detail matters most.

Zero-Shot and Distribution-Shift Robustness

MERIT shows strong generalization without task-specific fine-tuning:

Zero-shot evaluation on PTB-XL SubClass: up to +2.66% AUC and +2.11% F1 improvement
Robustness under multiple distribution-shift settings

These results indicate the learned representations capture transferable ECG features rather than overfitting to specific data characteristics.

Downstream Text Generation

MERIT representations serve as conditioning inputs for large language models to generate clinical text. This application improves text quality across several metrics:

ROUGE
METEOR

My independent judgment: The strongest evidence for MERIT's representational quality comes not from the classification benchmarks alone, but from the fact that the same representations improve both discriminative tasks (classification, zero-shot) and generative tasks (clinical text generation). A representation that boosts both task types simultaneously likely captures richer, more generalizable ECG features than methods optimized for only one objective. The fine-grained SubClass gains further confirm that the information-theoretic objective successfully preserves morphological details that contrastive-only approaches discard.

Disclaimer: The above content is generated by AI and is for reference only.