Which model actually
understands time?

The industry standard benchmark for temporal reasoning
and time grounding in LLM systems.

The Problem

Temporal mistakes are a major failure mode for agentic systems. Existing evals are inconsistent, non-reproducible, or tied to licensed corpora.

⚠️

Time Hallucinations

Models confidently return wrong dates, breaking scheduling and automation workflows.

🔍

Ambiguity Blindness

LLMs fail to detect when time expressions are ambiguous, leading to silent errors.

DST Edge Cases

Daylight saving time transitions cause systematic failures in time resolution.

📊

Inconsistent Evaluation

No standardized benchmark means you can't compare models or track regressions.

Time Torture Test

See how different models interpret the same time expression. Watch them disagree, then see how times.ai handles ambiguity.

The Complete Solution

Everything you need to measure, improve, and deploy reliable temporal reasoning in LLM systems.

📊

Synthetic Dataset

10,000+ items with deterministic ground truth. No licensing risks, 100% code-generated.

DST Torture Suite

500+ edge cases covering daylight saving transitions, leap years, and timezone boundaries.

🔬

Reproducible Eval

Strict JSON output contract with enterprise readiness scoring. Hash-based reproducibility IDs.

🎯

Ambiguity Detection

First-class API for detecting ambiguous time expressions and returning candidate resolutions.

📈

Weekly Index

Temporal Robustness Index tracks model regressions and improvements over time.

🔓

Open Source

Evaluation harness available on GitHub. Transparent methodology, citable reports.

API Wedge

Normalize time expressions, detect ambiguity, and get candidate resolutions with a single API call.

  • Strict JSON output contract
  • Ambiguity detection as first-class signal
  • TypeScript and Python SDKs
  • Enterprise-ready with trace IDs
View API Docs →
POST /ground
{
  "query": "Meeting next Tuesday at 9am",
  "anchor_iso": "2026-03-05T12:00:00Z",
  "anchor_tz": "America/Los_Angeles",
  "locale": "en-US"
}

Response:
{
  "ambiguous": true,
  "confidence": 0.85,
  "ambiguity_types": ["ANCHOR_SCOPE"],
  "required_slots": ["slot.anchor_rule"],
  "candidates": [
    {
      "iso": "2026-03-10T09:00:00-07:00",
      "assumptions": {"slot.anchor_rule": "strict_week"}
    },
    {
      "iso": "2026-03-17T09:00:00-07:00",
      "assumptions": {"slot.anchor_rule": "sliding_window_7d"}
    }
  ],
  "trace_id": "abc123..."
}

Leaderboard

See how models compare on temporal reasoning. Scores updated weekly with the Temporal Robustness Index.

ModelOverallAmbiguityDST EdgeEnterprise
GPT-4 Turbo87.392.178.5100%
Claude 3 Opus85.789.482.3100%
Llama 3 70B72.168.965.295%
GPT-3.5 Turbo68.471.258.798%