Multiverse: Your Language Models Secretly Decide How to Parallelize and Merge Generation

1Carnegie Mellon University, 2NVIDIA
Multiverse Demo (3x speedup)

Abstract

Autoregressive Large Language Models (AR-LLMs) frequently exhibit implicit parallelism in sequential generation. Inspired by this, we introduce Multiverse, a new generative model that enables natively parallel generation. Multiverse internalizes a MapReduce paradigm, generating automatically through three stages: (i) a Map stage for adaptive task decomposition, (ii) a Process stage for parallel subtask execution, and (iii) a Reduce stage for lossless result synthesis. Next, we build a real-world Multiverse reasoning model with co-design of data, algorithm, and system, enabling rapid and seamless transfer from frontier AR-LLMs. Starting from sequential reasoning chains, we create Multiverse 1K by converting them into structured training data using an automated LLM-assisted pipeline, avoiding costly human annotations. Algorithmically, we design Multiverse Attention to separate parallel reasoning steps while keeping compatibility with causal attention for efficient training. Systematically, we implement Multiverse Engine to enable parallel inference. It features a dedicated scheduler that dynamically switches between sequential and parallel generation, triggered directly by the model. After a 3-hour fine-tuning with 1K examples, our Multiverse-32B stands as the only non-AR model achieving performance on par with leading AR-LLMs of the same scale, evidenced by AIME24 & 25 scores of 52% and 43%, respectively. Moreover, our budget control experiments show that Multiverse-3B exhibits superior scaling, outperforming AR-LLMs by 1.87 % on average using the same context length. Such parallel scaling further leads to practical efficiency gain, achieving up to 2× speedup across varying batch sizes.

Multiverse Evaluation

What we can achieve with Multiverse

Reasoning Performance

Model / Metric AIME24 AIME25 MATH500 GPQA-Diamond
pass@1 # parallel pass@1 # parallel pass@1 # parallel pass@1 # parallel
s1-32B 35.4 1.00 25.8 1.00 88.6 1.00 48.0 1.00
s1.1-32B 52.9 1.00 41.7 1.00 93.4 1.00 60.3 1.00
Qwen2.5-32B-Instruct 15.8 1.00 10.4 1.00 80.4 1.00 47.0 1.00
Autoregressive-32B 51.3 1.00 42.9 1.0 92.8 1.00 61.6 1.00
Multiverse-32B-zero 52.1 1.07 44.2 1.05 91.8 1.05 62.1 1.06
Multiverse-32B 52.9 1.24 44.2 1.18 92.4 1.15 61.7 1.17

TODO:

Scaling

GPQA-Diamond Scaling

Budget Control Figure 1

MATH500 Scaling

Budget Control Figure 2

TODO:

Efficiency Performance

Speedup vs. Parallelsim

Efficiency Figure 1

Throughput vs. Parallelsim

Efficiency Figure 2

TODO:

Multiverse 1K

Explore the parallelism in the training corpus

Multiverse-1K is a new generation of Multiverse model, which is trained on 1K examples. Select one id to show the xml.

How to Build Multiverse Model

Understanding the co-design of data, algorithm, and system.

Data: Multiverse 1K

Multiverse 1K Data Curation Pipeline

Figure (a): Multiverse 1K is automatically generated using an LLM-assisted data curation pipeline.

To address the absence of MapReduce structures in existing sequential reasoning data, we introduce Multiverse-1K. While these long CoT trajectories often inherently contain such structures, explicitly generating them is difficult. Thus, we develop an automated LLM-assisted pipeline that transforms sequential reasoning chains into parallel MapReduce structures. This conversion is guided by a five-stage prompting protocol powered by Gemini 2.5 Pro.

Generating a Summary Tree. First, we iteratively decompose and outline the original reasoning chain into a two-level tree structure. In the first round, the entire reasoning chain is broken down into multiple steps. In the second round, each step is examined by the LLM for further decomposition into substeps. Each resulting step or substep will be labeled and outlined with a concise description.

Identifying Parallel Groups. Second, we instruct the LLM to analyze each reasoning step, identifying which steps or groups of steps can be executed in parallel without violating logical dependencies.

Reformating into Parallel Structures. Third, the summary tree is converted into a parallel structure based on the previous analysis. To explicitly signal parallel execution, parallelizable steps or step groups are enclosed within the control tags <Parallel> and </Parallel>, forming a parallel block.

Refilling Original Details. Fourth, we prompt the LLM to repopulate the detailed content for each step and substep. This is achieved by retrieving and copying the related original reasoning trajectories.

Adding MapReduce Structures. Finally, we further convert the parallel structures into MapReduce structures. For each parallel block, the LLM generates both the Map and Reduce stages by outlining the specific goals and results for each individual path. Moreover, all paths are rewritten to avoid words implying sequential relations (e.g., similarly) and to prevent including or referencing content from other paths, thereby ensuring each path's completeness and independence.

To further refine our data, two supplementary validation stages are incorporated. After the fourth stage, a content check will filter out data if its edit distance ratio is above 0.2. Next, after the fifth stage, a grammar check will confirm strict adherence to our MapReduce structures. Data failing either case will be iteratively regenerated through our pipeline until both standards are met. The application of this automated pipeline to the s1K-1.1 dataset has yielded Multiverse 1K, a new dataset consisting of 1,000 high-quality, structured reasoning trajectories across a range of math and science problems.


Algorithm: Multiverse Attention

Multiverse Attention Mechanism

Figure (b): Multiverse Attention mechanism.

Next, we introduce Multiverse Attention to replace the causal attention in AR-LLMs. Causal attention computes the i-th token's output with query qi, and keys kj, values vj from positions j ≤ i.

However, this formulation poses challenges for conceptual parallel generation, as later paths rely on both (i) the key-value (KV) pairs and (ii) the positional indices produced by earlier paths. To address this, we modify both the attention masks and position indices following APE. In Multiverse Attention, each path within the same block starts from the same position and executes independently without accessing others. During the Reduce stage, all paths converge to the same position equal to the maximum position reached across paths, regardless of their individual lengths.

Due to its similarity with causal attention, Multiverse Attention supports efficient training since (i) it preserves training parallelism, and (ii) it can be seamlessly adapted via fine-tuning on a few samples.


System: Multiverse Engine

Multiverse Engine Architecture

Figure (c): Multiverse Engine architecture.

To enable truly parallel generation in practical deployments, we introduce Multiverse Engine, an extension of existing inference engines designed for AR models. Specifically, we employ SGLang due to its support for continuous batching and radix attention. These features allow dynamic batch scheduling and flexible KV-cache reuse, two scenarios frequently occur in the Map and Reduce stages.

The Map stage is automatically triggered when a <Parallel> token is generated. Next, the scheduler counts the number of <Outline> encountered to decide the degree of parallelism until reaching </Goal>. Based on this count, the engine creates multiple paths executed in parallel as distinct samples within the same batch. Leveraging radix attention, these paths share the prefix KV cache from the current context. Each path is identified and initiated with "<Path> i" according to its order i in the <Outline> list. After prefilling, all paths are added to the decoding queue for parallel generation. When a path finishes, either by reaching </Path> or the maximum length, it enters a "zombie" state that releases all resources and waits for the completion of other paths before continuing.

The Reduce stage begins once all processing paths have completed. In this stage, the engine merges the KV states from all paths along with the preceding context to form a new sequence. Thanks to the flexible memory layout of the radix cache, indices of KV cache can be seamlessly merged without any padding, thereby avoiding both physical data copying and subsequent redundant computation. The token <Conclusion>, prefixed by this combined KV cache, is then added to the prefilling queue. Once finished, the task is moved to the decoding queue to resume generation along the new sequence.

Acknowledgments

We thank Zhuoming Chen, Haizhong Zheng, Ranajoy Sadhukhan, Songlin Yang, Liliang Ren, Wentao Guo, Ruijie Zhu, Yu Zhang, Yixin Dong, Tian Jin, and Xin Dong for their constructive feedback on this work. We are particularly grateful to NVIDIA and BitDeer AI Research for generously providing GPU resources, and to Google for supplying free Gemini API credits. This research was supported in part by a Google Research Award, an Amazon Research Award, Intel, Li Auto, Moffett AI, and the CMU CyLab Seed Fund.