Z3 Verified · Zero Label Noise · 3 Solving Strategies

Verified Reasoning Graph Dataset // DSATUR · Backtracking/MRV · Welsh-Powell

Train LLMs to reason like o1. 5,000 graph coloring tasks verified by Z3 — no hallucinations, no label noise. Pure algorithmic signal for System 2 fine-tuning.

5K
Baseline rows
0%
Label noise
3
Algorithms
5.4M
Total tokens
Dataset meticulously crafted by @nagygabor
Live simulation · Graph coloring · Z3 Library verified
Who is this for?

Designed for modern AI workflows

AI Researchers
Perfect for studying algorithmic reasoning, System 2 scaling laws, and internal representations of constraint satisfaction problems in LLMs.
Open-Source Fine-Tuners
Inject deterministic, verified reasoning capabilities into models like Llama-3 or Mistral without relying on hallucination-prone synthetic data.
Enterprise AI Teams
Build specialized, high-accuracy planning agents that require rigid adherence to logical constraints for production environments.
Architecture

Why this dataset
is different

By the numbers

Dataset analytics

5,000
Production-ready baseline rows
↑ 20k · 100k available
~5.4M
Total tokens (char/4 approx)
Avg 1,089 tokens / row
1,869+
Max edges per graph
Up to 65 nodes
3
Solving strategies per task
Backtracking · DSATUR · Welsh-Powell
8k+
Context window needed
Max 4,161 tokens / row
Difficulty

Calibrated complexity split

Every row is tagged with one of three difficulty tiers — Easy, Medium, or Hard — derived from graph density, chromatic number, and search depth required.

17%
Easy
Sparse graphs with low chromatic numbers — good for warming up reasoning heads and quick curriculum starts.
38%
Medium
Moderate density across random, bipartite, and tree structures. The model must handle multi-step constraint propagation.
45%
Hard
Dense and near-clique graphs with up to 1,869 edges. Requires deep backtracking and precise chromatic bound reasoning.
Graph Structures

5 Diverse Topologies

Random Graphs
Erdős-Rényi models providing unstructured, unpredictable constraint landscapes to test baseline generalization.
Bipartite
Structured sets where no two nodes within the same set share edges. Crucial for testing exact 2-colorability detection.
Tree Structures
Hierarchical, cycle-free graphs. Models must learn efficient linear-time coloring strategies without getting stuck.
Dense Graphs
High edge-density graphs where nearly every pair of nodes is connected — stress-tests chromatic number reasoning at scale.
Near-Clique
Near-complete subgraphs that push maximum chromatic number limits and require extreme conflict resolution under tight constraints.
Multi-task design

Four generalization objectives

~41%
Full Graph Coloring
Complete backtracking or greedy trace with structural preamble — the core algorithmic reasoning scaffold.
~31%
Validation
Real per-edge conflict detection — 50% valid, 50% corrupted colorings. Model must identify conflicts precisely.
~22%
Missing Color
Constraint propagation: forbidden colors → available colors. Deduce the valid color for a single uncolored node.
~5%
Exact Chromatic Number
Lower bound (clique/odd cycle) + upper bound (Brooks) + Z3 proof — determine the absolute minimum colors required.
Algorithm diversity

Three solving strategies

~44%
Backtracking / MRV
Most-restricted variable first — up to 12 look-ahead backtrack events per trace. Teaches the model to recognize dead ends and recover gracefully.
~30%
DSATUR
Dynamic saturation — always colors the most constrained node. Near-optimal and deterministic, producing clean constraint-propagation logs.
~26%
Welsh-Powell
Greedy degree-order with no backtracking. Full forbidden-color log per step — trains fast, fluent coloring without search overhead.
Data structure

Exact row schema

train_data_5k.jsonl — single row example
{
  "task_type":    "task_a_coloring",
  "graph_type":   "bipartite",
  "difficulty":   "medium",
  "strategy":     "dsatur",
  "nodes":        42,
  "edges":        260,
  "instruction":  "Graph with 42 nodes. Edges: [(12,0),(16,0)...]
               Color using ≤2 colors, no adjacent nodes same color.",
  "reasoning": {
    "strategy":   "dsatur",
    "preamble":  "42 nodes, 260 edges, bipartite → χ(G)=2",
    "steps": [
      { "node": 20, "saturation": 0, "assigned_color": 0, "forbidden": [] },
      { "node": 7,  "saturation": 1, "assigned_color": 1, "forbidden": [0] },
      { "action": "[backtrack]", "from_node": 47, "reason": "conflict" }
    ]
  },
  "solution": { "20": 0, "7": 1, "16": 0, "...": "..." }
}
Training economics

Token statistics & cost estimation

Token statistics — char/4 approximation · cl100k_base compatible · extrapolated to 5,000 rows
Metric 5k Baseline (Free) 20k Dataset 100k Dataset
Total Rows5,000~20,000~100,000
Total Tokens~5.45M~21.8M~109M
Avg Tokens / Row1,0891,0891,089
Median Tokens / Row899~900~900
Max Tokens / Row4,161~4,200~4,200
p95 Tokens / Row2,658~2,700~2,700
Recommended Model Context8k+8k+8k+
Avg tokens / row — broken down by task type
Task Type Avg Tokens Max Tokens Row Count
task_a_coloring1,3784,161~2,088
task_b_validation9573,272~1,541
task_c_missing_color8873,116~1,099
task_d_chromatic4321,267~271
FAQ

Frequently Asked Questions

Which models are compatible with this dataset?
Any model with a context window of at least 8k tokens is sufficient. The dataset is explicitly formatted and tested for Llama-3, Mistral, and Qwen architectures.
Is the dataset updated regularly?
Yes, the 20k Gumroad tier receives bi-monthly updates with harder distributions and new permutation techniques. The 5k baseline is stable.
How is "0% label noise" guaranteed?
Every single graph, edge conflict, and valid coloring is calculated and rigorously verified using the Microsoft Z3 SMT solver before being converted to text.
What is the "strategy" field and why does it matter?
Each row is tagged with the algorithm that produced its reasoning trace: Backtracking/MRV (~44%), DSATUR (~30%), or Welsh-Powell (~26%). Training across all three strategies teaches the model multiple valid solution paths — not just one greedy heuristic.
What is permutation augmentation?
Node indices and color assignments are randomly shuffled for every row. This forces the model to learn the underlying algorithm rather than memorizing patterns tied to specific node labels.
Access tiers

Choose your scale

Free
5k Baseline
// via HuggingFace
Perfect for testing pipelines & small experiments.
  • 5,000 verified rows
  • All 4 task types
  • All 5 graph topologies
  • 3 solving strategies
  • ~5.45M tokens · JSONL format
Download Free
$249 / one-time
100k Dataset
// custom pipeline
Built for 70B+ enterprise models & robust evals.
  • 100,000 verified rows
  • ~109M tokens
  • Full difficulty stratification
  • Custom topology options
  • B2B support via email
Buy on Gumroad