LeX-Art: Rethinking Text Generation for Visual Content

Teaser

A set of four colorful promotional stickers with a summer theme, featuring text and simple graphic elements, with the text on it: "hot", "summer", "offer", "sale", "ecology", "product", "new", "discount", "limited", "deal".

A close-up of colorful, intricately decorated cookies arranged neatly on a wooden table, with the text on it: "Sweet", "Delight", "Joy", "Celebrate", "Tasty", "Treats", "Artisan", "Festive", "Beautiful", "Homemade".

A picture of a colorful collection of stickers spread out on a wooden table, with the text on it: "Vibrant", "Express", "Playful", "Create", "Sticker", "Joyful", "Artistic", "Fun".

A picture of a neatly arranged stack of wooden blocks with colorful designs, with the text on it: "Learn", "Play", "Grow", "Build", "Create", "Explore", "Think", "Solve", "Imagine".

A picture of a woman seated gracefully against a vibrant and abstract colorful background, with the text on it: "Elegance", "Vibrance", "Harmony", "Beauty", "Expression", "Colorful", "Dreamy", "Poise", "Serenity".

A picture of a vibrant puzzle-themed book cover with intriguing designs, with the text on it: "Brain", "Teasers", "Challenge", "Accepted", "Mind", "Games", "Logic", "Quest", "Puzzler's".

A picture of a vibrant retro rock-themed gift card with bold and colorful design elements, with the text on it: "Steel", "Panther", "Rockdown", "In", "The", "Lockdown", "Gift".

A picture of a speech bubble filled with various vegetables, with the text on it: "Vegetable", "Gardening", "with", "MIKE", "THE", "GARDENER".

A picture of a band performing on a stage, with the text on it: "SIMPLY", "Bullet", "Proof", "Nothing", "SAUCER".

A picture of a video game trailer scene set in a futuristic theme, with the text on it: "BID", "BEYOND", "EARTH", "ANNOUNCEMENT", "Trailer".

A picture of a black and white magnet, with the text on it: "IMAGE" in yellow, "HERE" in purple.

A picture of a black top hat with festive decorations, with the text on it: "HAPPY" in green, "NEW" in purple, "YEAR" in yellow.

A picture of a blue-iced cake topped with candles, with the text on it: "Birthday" in yellow, "Billy" in yellow.

A picture of a person holding a transparent tube filled with colorful gems, with the text on it: "Hot" in the cursive font style; "Rod" in the cursive font style; "Clear", which are 3D letters.

A picture of a festive paper cup with a celebratory design, with the text on it: "HAPPY", which are 3D letters; "BIRTH" in the cursive font style; "DAY" in the rounded font style.

LeX-FLUX and LeX-Lumina have shown advanced text rendering capability in terms of multiple words, complex layout, and text attributes controllability.

Abstract

We introduce LeX-Art, a comprehensive suite for high-quality text-image synthesis that systematically bridges the gap between prompt expressiveness and text rendering fidelity. Our approach follows a data-centric paradigm, constructing a high-quality data synthesis pipeline based on Deepseek-R1 to curate LeX-10K, a dataset of 10K high-resolution, aesthetically refined 1024$\times$1024 images. Beyond dataset construction, we develop LeX-Enhancer, a robust prompt enrichment model, and train two text-to-image models, LeX-FLUX and LeX-Lumina, achieving state-of-the-art text rendering performance. To systematically evaluate visual text generation, we introduce LeX-Bench, a benchmark that assesses fidelity, aesthetics, and alignment, complemented by Pairwise Normalized Edit Distance (PNED), a novel metric for robust text accuracy evaluation. Experiments demonstrate significant improvements, with LeX-Lumina achieving a 79.81% PNED gain on CreateBench, and LeX-FLUX outperforming baselines in color (+3.18%), positional (+4.45%), and font accuracy (+3.81%). Our codes, models, datasets, and demo are publicly available.

Illustration of LeX-Art Suite.

Key Features

LeX-10K Dataset

10K high-quality, high-resolution text-image pairs curated with advanced filtering techniques

LeX-Enhancer

A prompt enhancement model based on DeepSeek-R1 that enriches simple captions with detailed visual attributes

LeX-FLUX & LeX-Lumina

Two text-to-image models optimized for text rendering with different size-performance tradeoffs

LeX-Bench

A comprehensive benchmark for evaluating visual text generation across multiple dimensions

PNED Metric

A new metric for evaluating text accuracy that handles sequence variations and unmatched elements

Methodology

The framework of data construction pipeline. The red words in the R1 enhanced prompt are not rendered in the generated image, and it is fixed after the post alignment.

DeepSeek-R1 as a Prompt Enhancer

In text-to-image generation, enriching prompts with fine-grained visual details has been shown to significantly improve output quality. We leverage DeepSeek-R1's reasoning capabilities to transform simple captions into detailed descriptions with rich visual attributes including font styles, color schemes, and spatial layouts.

Images generated by FLUX.1 [dev] based on different prompts. DeepSeek-R1 provides more detailed and structured descriptions compared to other LLMs, resulting in higher quality text rendering.

Data Quality Comparison

Our LeX-10K dataset achieves significantly higher image quality and aesthetic scores compared to existing text image datasets, highlighting its advantage in text rendering tasks.

Image quality score and image aesthetics score distribution of AnyText dataset and LeX-10K.

LeX-Bench

Overview of LeX-Bench. Prompts in LeX-Bench are split into three levels: 630 Easy-Level (2–4 words), 480 Medium Level (5–9 words), and 200 Hard-Level (10–14 words). Prompts of the easy level also contain text attributes: color, font, position.

Experimental Results

Model Performance Comparison

We evaluate LeX-FLUX and LeX-Lumina on traditional benchmarks and our proposed LeX-Bench. The results demonstrate significant improvements in text rendering accuracy, aesthetics, and text attribute controllability.

Model	LeX-Enhancer	PNED ↓	Recall ↑	Aesthetic ↑
FLUX.1 [dev]	✗	2.23	0.64	3.38
FLUX.1 [dev]	✓	1.50	0.77	3.43
LeX-FLUX	✗	2.30	0.64	3.46
LeX-FLUX	✓	1.52	0.79	3.47
Lumina-Image-2.0	✗	2.39	0.51	2.86
Lumina-Image-2.0	✓	1.85	0.69	3.29
LeX-Lumina	✗	2.41	0.53	3.41
LeX-Lumina	✓	1.44	0.73	3.50

Comparison with Glyph-Controlled Models

We compare our models with state-of-the-art glyph-conditioned methods on the AnyText-Benchmark. Despite not using explicit glyph information, our models achieve competitive performance in text rendering accuracy and superior prompt-image alignment.

Methods	LeX-Enhancer	Acc ↑	CLIPScore ↑
ControlNet	-	0.5837	0.8448
TextDiffuser	-	0.5921	0.8685
GlyphControl	-	0.3710	0.8847
AnyText	-	0.7239	0.8841
LeX-Lumina	✗	0.3840	0.8100
LeX-Lumina	✓	0.6220	0.8832
LeX-FLUX	✗	0.5220	0.8754
LeX-FLUX	✓	0.7110	0.8918

Human Preference Evaluation

Human evaluators consistently preferred images generated by LeX-Lumina over Lumina-Image 2.0 in terms of text accuracy, text recall rate, and aesthetic quality.

Human preference result on text accuracy, text recall rate and aesthetics for LeX-Lumina. For ease of illustration, we visualize the proportion of votes where LeX-Lumina wins, loses and ties with Lumina-Image 2.0.

Visual Comparisons

LeX-Lumina vs. Lumina-Image 2.0

Qualitative comparison between Lumina-Image 2.0 and LeX-Lumina. The first column shows Lumina-Image-2.0 without LeX-Enhancer using Simple Caption; the second column shows the trained LeX-Lumina without LeX-Enhancer using Simple Caption; the third column shows Lumina-Image-2.0 with LeX-Enhancer enabled; and the fourth column shows LeX-Lumina with LeX-Enhancer enabled. We observe that (1) LeX-Lumina exhibits a better text rendering capability in terms of text fidelity and aesthetics; (2) LeX-Enhancer exhibits a strong capability for enhancing simple prompts.

LeX-FLUX vs. FLUX.1 [dev]

Qualitative comparison between FLUX.1 [dev] and LeX-FLUX. The first column shows FLUX.1 [dev] without LeX-Enhancer using Simple Caption; the second column shows the trained LeX-FLUX without LeX-Enhancer using Simple Caption; the third column shows FLUX.1 [dev] with LeX-Enhancer enabled; and the fourth column shows LeX-FLUX with LeX-Enhancer enabled. We observe that (1) LeX-FLUX exhibits a better text rendering capability in terms of text fidelity and text attributes controllability; (2) LeX-Enhancer exhibits a strong capability for enhancing simple prompts.

Comparison with Glyph-Controlled Models

Qualitative comparison between LeX-Lumina, LeX-FLUX and glyph-conditioned models. We compare our models with AnyText and TextDiffuser for five different prompts. We observe that our models generally achieve high fidelity, better text attribute controllability and higher aesthetics.

More Generated Samples

Showcase of text rendering results from LeX-Lumina (first two rows) and LeX-FLUX (last two rows) on text-to-image tasks. The examples demonstrate the models' ability to generate clear, well-aligned, and aesthetically pleasing text within images.

Conclusion

In this work, we enhance the text rendering capability of text-to-image foundation models from a data-centric perspective. Specifically, we leverage DeepSeek-R1 to synthesize a set of domain-specific, highly detailed image descriptions. Subsequently, we design a series of post-processing modules, resulting in LeX-10K, a synthetic dataset comprising 10,000 high-quality text-image samples. Using the prompt-enhancing data generated by DeepSeek-R1, we fine-tune a locally deployable prompt enhancer, LeX-Enhancer. Furthermore, leveraging LeX-10K, we fine-tune two existing models, FLUX.1 [dev] and Lumina-Image 2.0, to obtain LeX-FLUX and LeX-Lumina, respectively.
To rigorously evaluate text rendering performance, we construct a comprehensive benchmark, LeX-Bench, which assesses text rendering accuracy, aesthetics, and controllability over text attributes. In addition, we introduce a flexible and general metric, Pairwise Normalized Edit Distance (PNED), specifically designed for evaluating text rendering precision. Extensive experiments conducted on LeX-Bench demonstrate that LeX-FLUX and LeX-Lumina achieve significant improvements across all aspects of text rendering.

Citation

                @article{zhao2025lexart,

                  title={LeX-Art: Rethinking Text Generation via Scalable High-Quality Data Synthesis},

                  author={Zhao, Shitian and Wu, Qilong and Li, Xinyue and Zhang, Bo and Li, Ming and Qin, Qi and Liu, Dongyang and Zhang, Kaipeng and Li, Hongsheng and Qiao, Yu and Gao, Peng and Fu, Bin and Li, Zhen},

                  journal={arXiv preprint},

                  year={2025}

                }