Motivation Illustration: A central challenge in visual encoding or decoding lies in cross-modal alignment that aims to establish a precise mapping between neural and visual distributions. Early methods relied on simple linear regressions to approximate the unidirectional relationship, which limited their ability to capture complex semantic correspondences. Recent approaches introduced nonlinear mappings using Diffusion Transformer (DiT) or Diffusion Prior (DP) under generative objectives, operating by conditioning Gaussian noise on one modality (e.g., neural or visual latent distribution) and iteratively guiding it toward the target distribution. However, such conditional noise-to-data pipelines still treat encoding and decoding as separate processes. In contrast, our proposed XFM establishes continuous and reversible flows directly between the neural and visual distributions, achieving a unified framework for encoding and decoding.
Overview of NeuroFlow. Stage-1 (A): NeuroVAE introduces probabilistic learning to model neural variability and constrains the latent space with visual semantics for consistent image-to-fMRI synthesis. Stage-2 (B): XFM unifies encoding and decoding processes by learning a time-dependent, reversible flow between empirical visual and neural latent distributions. Stage-3 (C): Encoding and decoding are performed within a single model at inference, where reversing the temporal direction naturally transitions between the two processes.
Qualitative visual encoding and decoding performance comparisons. Left: NeuroFlow achieves superior decoding quality in semantic fidelity and visual structure. Right: NeuroFlow suppresses irrelevant cortical activity while enhancing category-specific regions, capturing consistent activation patterns underlying neural variability to support image synthesis consistent with visual stimuli.
Empirical visualizations. (A) Ablation study: removing key objectives leads to degraded visual fidelity and semantic coherence. (B) Flow trajectory: Encoding trajectory reveals a suppression of early visual responses and a transition toward functional regions (i.e., FFA and EBA). Decoding trajectory evolves from an initial structural sketch, not Gaussian noises, to a realistic and high-fidelity image. (C-D) Brain functional analysis: category-selective fMRI activations and voxel-wise evaluation derived from raw and synthetic fMRI, computed on the whole test set, showing that NeuroFlow suppresses early visual activity and emphasizes higher-order functional regions.