We introduce LeX-Art, a comprehensive suite for high-quality text-image synthesis that systematically bridges the gap between prompt expressiveness and text rendering fidelity. Our approach follows a data-centric paradigm, constructing a high-quality data synthesis pipeline based on Deepseek-R1 to curate LeX-10K, a dataset of 10K high-resolution, aesthetically refined 1024$\times$1024 images. Beyond dataset construction, we develop LeX-Enhancer, a robust prompt enrichment model, and train two text-to-image models, LeX-FLUX and LeX-Lumina, achieving state-of-the-art text rendering performance. To systematically evaluate visual text generation, we introduce LeX-Bench, a benchmark that assesses fidelity, aesthetics, and alignment, complemented by Pairwise Normalized Edit Distance (PNED), a novel metric for robust text accuracy evaluation. Experiments demonstrate significant improvements, with LeX-Lumina achieving a 79.81% PNED gain on CreateBench, and LeX-FLUX outperforming baselines in color (+3.18%), positional (+4.45%), and font accuracy (+3.81%). Our codes, models, datasets, and demo are publicly available.
Illustration of LeX-Art Suite.
10K high-quality, high-resolution text-image pairs curated with advanced filtering techniques
A prompt enhancement model based on DeepSeek-R1 that enriches simple captions with detailed visual attributes
Two text-to-image models optimized for text rendering with different size-performance tradeoffs
A comprehensive benchmark for evaluating visual text generation across multiple dimensions
A new metric for evaluating text accuracy that handles sequence variations and unmatched elements
The framework of data construction pipeline. The red words in the R1 enhanced prompt are not rendered in the generated image, and it is fixed after the post alignment.
In text-to-image generation, enriching prompts with fine-grained visual details has been shown to significantly improve output quality. We leverage DeepSeek-R1's reasoning capabilities to transform simple captions into detailed descriptions with rich visual attributes including font styles, color schemes, and spatial layouts.
Images generated by FLUX.1 [dev] based on different prompts. DeepSeek-R1 provides more detailed and structured descriptions compared to other LLMs, resulting in higher quality text rendering.
Our LeX-10K dataset achieves significantly higher image quality and aesthetic scores compared to existing text image datasets, highlighting its advantage in text rendering tasks.
Image quality score and image aesthetics score distribution of AnyText dataset and LeX-10K.
Overview of LeX-Bench. Prompts in LeX-Bench are split into three levels: 630 Easy-Level (2–4 words), 480 Medium Level (5–9 words), and 200 Hard-Level (10–14 words). Prompts of the easy level also contain text attributes: color, font, position.
We evaluate LeX-FLUX and LeX-Lumina on traditional benchmarks and our proposed LeX-Bench. The results demonstrate significant improvements in text rendering accuracy, aesthetics, and text attribute controllability.
Model | LeX-Enhancer | PNED ↓ | Recall ↑ | Aesthetic ↑ |
---|---|---|---|---|
FLUX.1 [dev] | ✗ | 2.23 | 0.64 | 3.38 |
FLUX.1 [dev] | ✓ | 1.50 | 0.77 | 3.43 |
LeX-FLUX | ✗ | 2.30 | 0.64 | 3.46 |
LeX-FLUX | ✓ | 1.52 | 0.79 | 3.47 |
Lumina-Image-2.0 | ✗ | 2.39 | 0.51 | 2.86 |
Lumina-Image-2.0 | ✓ | 1.85 | 0.69 | 3.29 |
LeX-Lumina | ✗ | 2.41 | 0.53 | 3.41 |
LeX-Lumina | ✓ | 1.44 | 0.73 | 3.50 |
We compare our models with state-of-the-art glyph-conditioned methods on the AnyText-Benchmark. Despite not using explicit glyph information, our models achieve competitive performance in text rendering accuracy and superior prompt-image alignment.
Methods | LeX-Enhancer | Acc ↑ | CLIPScore ↑ |
---|---|---|---|
ControlNet | - | 0.5837 | 0.8448 |
TextDiffuser | - | 0.5921 | 0.8685 |
GlyphControl | - | 0.3710 | 0.8847 |
AnyText | - | 0.7239 | 0.8841 |
LeX-Lumina | ✗ | 0.3840 | 0.8100 |
LeX-Lumina | ✓ | 0.6220 | 0.8832 |
LeX-FLUX | ✗ | 0.5220 | 0.8754 |
LeX-FLUX | ✓ | 0.7110 | 0.8918 |
Human evaluators consistently preferred images generated by LeX-Lumina over Lumina-Image 2.0 in terms of text accuracy, text recall rate, and aesthetic quality.
Human preference result on text accuracy, text recall rate and aesthetics for LeX-Lumina. For ease of illustration, we visualize the proportion of votes where LeX-Lumina wins, loses and ties with Lumina-Image 2.0.
Qualitative comparison between Lumina-Image 2.0 and LeX-Lumina. The first column shows Lumina-Image-2.0 without LeX-Enhancer using Simple Caption; the second column shows the trained LeX-Lumina without LeX-Enhancer using Simple Caption; the third column shows Lumina-Image-2.0 with LeX-Enhancer enabled; and the fourth column shows LeX-Lumina with LeX-Enhancer enabled. We observe that (1) LeX-Lumina exhibits a better text rendering capability in terms of text fidelity and aesthetics; (2) LeX-Enhancer exhibits a strong capability for enhancing simple prompts.
Qualitative comparison between FLUX.1 [dev] and LeX-FLUX. The first column shows FLUX.1 [dev] without LeX-Enhancer using Simple Caption; the second column shows the trained LeX-FLUX without LeX-Enhancer using Simple Caption; the third column shows FLUX.1 [dev] with LeX-Enhancer enabled; and the fourth column shows LeX-FLUX with LeX-Enhancer enabled. We observe that (1) LeX-FLUX exhibits a better text rendering capability in terms of text fidelity and text attributes controllability; (2) LeX-Enhancer exhibits a strong capability for enhancing simple prompts.
Qualitative comparison between LeX-Lumina, LeX-FLUX and glyph-conditioned models. We compare our models with AnyText and TextDiffuser for five different prompts. We observe that our models generally achieve high fidelity, better text attribute controllability and higher aesthetics.
Showcase of text rendering results from LeX-Lumina (first two rows) and LeX-FLUX (last two rows) on text-to-image tasks. The examples demonstrate the models' ability to generate clear, well-aligned, and aesthetically pleasing text within images.
In this work, we enhance the text rendering capability of text-to-image foundation models from a data-centric perspective. Specifically, we leverage DeepSeek-R1 to synthesize a set of domain-specific, highly detailed image descriptions. Subsequently, we design a series of post-processing modules, resulting in LeX-10K, a synthetic dataset comprising 10,000 high-quality text-image samples. Using the prompt-enhancing data generated by DeepSeek-R1, we fine-tune a locally deployable prompt enhancer, LeX-Enhancer. Furthermore, leveraging LeX-10K, we fine-tune two existing models, FLUX.1 [dev] and Lumina-Image 2.0, to obtain LeX-FLUX and LeX-Lumina, respectively.
To rigorously evaluate text rendering performance, we construct a comprehensive benchmark, LeX-Bench, which assesses text rendering accuracy, aesthetics, and controllability over text attributes. In addition, we introduce a flexible and general metric, Pairwise Normalized Edit Distance (PNED), specifically designed for evaluating text rendering precision. Extensive experiments conducted on LeX-Bench demonstrate that LeX-FLUX and LeX-Lumina achieve significant improvements across all aspects of text rendering.