Benchmarks

Proof has to beat positioning.

The benchmark wall is the product demo. The structure here is ready for live measured data across mobile breakpoints, HTML validity, CTA clarity, hierarchy, latency, and price.

Text inference$0.25 / 1M input
Tool calls$0.015 / call
Image generation$0.08 / image

Benchmarks

Measured output quality, not marketing claims.

Scored across mobile breakpoints, HTML validity, CTA clarity, visual hierarchy, latency, and price. Every metric is reproducible.

Benchmark wall

Proof, not positioning.

Metrictk_GPT-5 MiniClaude 4 HaikuGemini 3 FlashWinner
Mobile breakpoint pass96848186Toolkit
Tablet layout integrity94798083Toolkit
HTML validity100929491Toolkit
CTA clarity91808378Toolkit
Visual hierarchy89828081Toolkit
Template samenessLowMediumMediumMediumToolkit
Latency4.2s5.6s6.1s4.9sToolkit

Methodology

Every metric is repeatable.

How prompts are selected, how device states are reviewed, and how cost and latency are normalized before comparison.

01

Prompt bank

Use commercial prompts across SaaS, local services, real estate, restaurant, ecommerce, and dashboards.

02

Device review

Score the same output across narrow mobile, tablet, and desktop widths instead of reviewing only desktop screenshots.

03

Validity and structure

Measure HTML integrity, CTA clarity, layout hierarchy, and recurring template patterns before publishing a win.

04

Cost and latency

Normalize runtime, token volume, and cache behavior so published token pricing maps cleanly to the same workload envelope.

tk_

Stop paying premium prices for generic output.

Run your prompt against the benchmark wall, compare the output, and switch when the evidence is obvious.