Z3 Verified · Zero Label Noise · 3 Solving Strategies

Verified Reasoning Graph Dataset // DSATUR · Backtracking/MRV · Welsh-Powell

Train LLMs to reason like o1. 5,000 graph coloring tasks verified by Z3 — no hallucinations, no label noise. Pure algorithmic signal for System 2 fine-tuning.

Free Dataset Get 100k Rows → Explore ↓

Baseline rows

Label noise

Algorithms

5.4M

Total tokens

Dataset meticulously crafted by @nagygabor

Live simulation · Graph coloring · Z3 Library verified

Who is this for?

Designed for modern AI workflows

AI Researchers

Perfect for studying algorithmic reasoning, System 2 scaling laws, and internal representations of constraint satisfaction problems in LLMs.

Open-Source Fine-Tuners

Inject deterministic, verified reasoning capabilities into models like Llama-3 or Mistral without relying on hallucination-prone synthetic data.

Enterprise AI Teams

Build specialized, high-accuracy planning agents that require rigid adherence to logical constraints for production environments.

Architecture

Why this dataset
is different

By the numbers

Dataset analytics

5,000

Production-ready baseline rows

↑ 20k · 100k available

~5.4M

Total tokens (char/4 approx)

Avg 1,089 tokens / row

1,869+

Max edges per graph

Up to 65 nodes

Solving strategies per task

Backtracking · DSATUR · Welsh-Powell

8k+

Context window needed

Max 4,161 tokens / row

Difficulty

Calibrated complexity split

Every row is tagged with one of three difficulty tiers — Easy, Medium, or Hard — derived from graph density, chromatic number, and search depth required.

17%

Easy

Sparse graphs with low chromatic numbers — good for warming up reasoning heads and quick curriculum starts.

38%

Medium

Moderate density across random, bipartite, and tree structures. The model must handle multi-step constraint propagation.

45%

Hard

Dense and near-clique graphs with up to 1,869 edges. Requires deep backtracking and precise chromatic bound reasoning.

Graph Structures

5 Diverse Topologies

Random Graphs

Erdős-Rényi models providing unstructured, unpredictable constraint landscapes to test baseline generalization.

Bipartite

Structured sets where no two nodes within the same set share edges. Crucial for testing exact 2-colorability detection.

Tree Structures

Hierarchical, cycle-free graphs. Models must learn efficient linear-time coloring strategies without getting stuck.

Dense Graphs

High edge-density graphs where nearly every pair of nodes is connected — stress-tests chromatic number reasoning at scale.

Near-Clique

Near-complete subgraphs that push maximum chromatic number limits and require extreme conflict resolution under tight constraints.

Multi-task design

Four generalization objectives

~41%

Full Graph Coloring

Complete backtracking or greedy trace with structural preamble — the core algorithmic reasoning scaffold.

~31%

Validation

Real per-edge conflict detection — 50% valid, 50% corrupted colorings. Model must identify conflicts precisely.

~22%

Missing Color

Constraint propagation: forbidden colors → available colors. Deduce the valid color for a single uncolored node.

~5%

Exact Chromatic Number

Lower bound (clique/odd cycle) + upper bound (Brooks) + Z3 proof — determine the absolute minimum colors required.

Algorithm diversity

Three solving strategies

~44%

Backtracking / MRV

Most-restricted variable first — up to 12 look-ahead backtrack events per trace. Teaches the model to recognize dead ends and recover gracefully.

~30%

DSATUR

Dynamic saturation — always colors the most constrained node. Near-optimal and deterministic, producing clean constraint-propagation logs.

~26%

Welsh-Powell

Greedy degree-order with no backtracking. Full forbidden-color log per step — trains fast, fluent coloring without search overhead.

Data structure

Exact row schema

train_data_5k.jsonl — single row example

{
  "task_type":    "task_a_coloring",
  "graph_type":   "bipartite",
  "difficulty":   "medium",
  "strategy":     "dsatur",
  "nodes":        42,
  "edges":        260,
  "instruction":  "Graph with 42 nodes. Edges: [(12,0),(16,0)...]
               Color using ≤2 colors, no adjacent nodes same color.",
  "reasoning": {
    "strategy":   "dsatur",
    "preamble":  "42 nodes, 260 edges, bipartite → χ(G)=2",
    "steps": [
      { "node": 20, "saturation": 0, "assigned_color": 0, "forbidden": [] },
      { "node": 7,  "saturation": 1, "assigned_color": 1, "forbidden": [0] },
      { "action": "[backtrack]", "from_node": 47, "reason": "conflict" }
    ]
  },
  "solution": { "20": 0, "7": 1, "16": 0, "...": "..." }
}

Training economics

Token statistics & cost estimation

Token statistics — char/4 approximation · cl100k_base compatible · extrapolated to 5,000 rows

Metric	5k Baseline (Free)	20k Dataset	100k Dataset
Total Rows	5,000	~20,000	~100,000
Total Tokens	~5.45M	~21.8M	~109M
Avg Tokens / Row	1,089	1,089	1,089
Median Tokens / Row	899	~900	~900
Max Tokens / Row	4,161	~4,200	~4,200
p95 Tokens / Row	2,658	~2,700	~2,700
Recommended Model Context	8k+	8k+	8k+

Avg tokens / row — broken down by task type

Task Type	Avg Tokens	Max Tokens	Row Count
task_a_coloring	1,378	4,161	~2,088
task_b_validation	957	3,272	~1,541
task_c_missing_color	887	3,116	~1,099
task_d_chromatic	432	1,267	~271

FAQ

Frequently Asked Questions

Which models are compatible with this dataset?

Any model with a context window of at least 8k tokens is sufficient. The dataset is explicitly formatted and tested for Llama-3, Mistral, and Qwen architectures.

Is the dataset updated regularly?

Yes, the 20k Gumroad tier receives bi-monthly updates with harder distributions and new permutation techniques. The 5k baseline is stable.

How is "0% label noise" guaranteed?

Every single graph, edge conflict, and valid coloring is calculated and rigorously verified using the Microsoft Z3 SMT solver before being converted to text.

What is the "strategy" field and why does it matter?

Each row is tagged with the algorithm that produced its reasoning trace: Backtracking/MRV (~44%), DSATUR (~30%), or Welsh-Powell (~26%). Training across all three strategies teaches the model multiple valid solution paths — not just one greedy heuristic.

What is permutation augmentation?

Node indices and color assignments are randomly shuffled for every row. This forces the model to learn the underlying algorithm rather than memorizing patterns tied to specific node labels.

Access tiers

Choose your scale

Free

5k Baseline

// via HuggingFace

Perfect for testing pipelines & small experiments.

5,000 verified rows
All 4 task types
All 5 graph topologies
3 solving strategies
~5.45M tokens · JSONL format

Download Free

Popular

$49 / one-time

20k Dataset

// via Gumroad

Perfect for 7B-8B parameter model fine-tuning.

20,000 verified rows
4× more diversity
~21.8M tokens
Higher difficulty distribution
Permutation augmented

Buy on Gumroad

$249 / one-time

100k Dataset

// custom pipeline

Built for 70B+ enterprise models & robust evals.

100,000 verified rows
~109M tokens
Full difficulty stratification
Custom topology options
B2B support via email