Parallel Jacobi Decoding for Fast Autoregressive Image Generation

CVPR 2026

Westlake University

^*Corresponding author: wanghuan [at] westlake [dot] edu [dot] cn

Abstract

Autoregressive (AR) models have demonstrated remarkable performance in generating high-fidelity images. However, their inherently sequential next-token prediction leads to significantly slower inference. Recent studies have introduced Jacobi-style decoding to accelerate autoregressive image generation. Extending the draft sequence initially improves efficiency, yet the acceleration quickly saturates as error propagation in the one-dimensional sequence hinders convergence. Observing that images exhibit strong local spatial correlations, we propose Parallel Jacobi Decoding (PJD), a training-free decoding approach that expands draft tokens in the two-dimensional spatial domain to enable efficient spatially parallel refinement. PJD adjusts the attention mask to mitigate error accumulation and improve convergence stability. Extensive experiments on diverse datasets show that PJD achieves 4.8×–6.4× acceleration across multiple autoregressive image generation models while maintaining competitive generation quality.

Method Overview

An illustration of one PJD iteration. (Left) Three rows become simultaneously active, each initializing three draft tokens. (Middle) All active rows are processed in one forward pass of the autoregressive transformer, followed by row-wise validation: accepted tokens are committed, while rejected ones are reused as the initial drafts for the next iteration. (Right) Each row’s sliding window advances after validation.

Main Results

Quantitative comparison of image generation methods on the MS-COCO dataset. Latency represents the time to generate a single image, and Step indicates the number of steps required. The acceleration factors for both latency and steps are relative to Vanilla AR.

Quantitative comparison of image generation methods on the PartiPrompt dataset. Latency represents the time to generate a single image, and Step indicates the number of steps required. The acceleration factors are relative to Vanilla AR.

Quantitative comparison on MS-COCO with additional quality metrics.

Quantitative comparison of decoding methods on Janus-Pro across the MS-COCO and PartiPrompt datasets. We report the latency and number of sampling steps required to generate a single 384 × 384 image, together with acceleration factors relative to Vanilla AR.

Ablation study on the effectiveness of the Row-Causal Mask (RCM) on the MS-COCO dataset. We compare decoding performance on both Lumina-mGPT and LlamaGen-XL, evaluating the impact of removing RCM. Incorporating RCM consistently improves sampling efficiency while also providing better image quality.

Ablation study. Left: Effect of Context Token Count c. A larger c improves image quality (lower FID) but reduces acceleration. Middle: Ablation on top-k sampling. Our method maintains stable acceleration under all tested top-k configurations. Right: Impact of CFG scale on Step Compression and FID; higher CFG scales yield lower FID while maintaining over 6× acceleration.

Comparison of step compression ratios across different image resolutions. Our method consistently outperforms SJD and GSD, with the improvement becoming more evident as the image resolution increases.

Qualitative Results

Qualitative comparisons of 768 × 768 image generation with Lumina-mGPT across different methods. Our approach accelerates image generation by significantly reducing the number of steps, while maintaining the same level of image quality.

Qualitative comparison of 512 × 512 image generation results on LlamaGen-XL using four decoding strategies: Vanilla AR, SJD, GSD, and our PJD method. Across all prompts, our approach achieves the fastest generation with the fewest sampling steps.

Qualitative comparisons of 384 × 384 image generation on Janus-Pro across multiple prompts. For each pair, the left image is generated by Vanilla AR and the right image is generated by our method. Our approach significantly reduces the number of sampling steps while preserving comparable image quality.

@inproceedings{liao2026parallel, title={Parallel Jacobi Decoding for Fast Autoregressive Image Generation}, author={Liao, Boya and Li, Ying and Jian, Siyong and Wang, Huan}, booktitle={CVPR}, year={2026} }