Which model actually
understands time?
The industry standard benchmark for temporal reasoning
and time grounding in LLM systems.
The Problem
Temporal mistakes are a major failure mode for agentic systems. Existing evals are inconsistent, non-reproducible, or tied to licensed corpora.
Time Hallucinations
Models confidently return wrong dates, breaking scheduling and automation workflows.
Ambiguity Blindness
LLMs fail to detect when time expressions are ambiguous, leading to silent errors.
DST Edge Cases
Daylight saving time transitions cause systematic failures in time resolution.
Inconsistent Evaluation
No standardized benchmark means you can't compare models or track regressions.
Time Torture Test
See how different models interpret the same time expression. Watch them disagree, then see how times.ai handles ambiguity.
The Complete Solution
Everything you need to measure, improve, and deploy reliable temporal reasoning in LLM systems.
Synthetic Dataset
10,000+ items with deterministic ground truth. No licensing risks, 100% code-generated.
DST Torture Suite
500+ edge cases covering daylight saving transitions, leap years, and timezone boundaries.
Reproducible Eval
Strict JSON output contract with enterprise readiness scoring. Hash-based reproducibility IDs.
Ambiguity Detection
First-class API for detecting ambiguous time expressions and returning candidate resolutions.
Weekly Index
Temporal Robustness Index tracks model regressions and improvements over time.
Open Source
Evaluation harness available on GitHub. Transparent methodology, citable reports.
API Wedge
Normalize time expressions, detect ambiguity, and get candidate resolutions with a single API call.
- ✓Strict JSON output contract
- ✓Ambiguity detection as first-class signal
- ✓TypeScript and Python SDKs
- ✓Enterprise-ready with trace IDs
POST /ground
{
"query": "Meeting next Tuesday at 9am",
"anchor_iso": "2026-03-05T12:00:00Z",
"anchor_tz": "America/Los_Angeles",
"locale": "en-US"
}
Response:
{
"ambiguous": true,
"confidence": 0.85,
"ambiguity_types": ["ANCHOR_SCOPE"],
"required_slots": ["slot.anchor_rule"],
"candidates": [
{
"iso": "2026-03-10T09:00:00-07:00",
"assumptions": {"slot.anchor_rule": "strict_week"}
},
{
"iso": "2026-03-17T09:00:00-07:00",
"assumptions": {"slot.anchor_rule": "sliding_window_7d"}
}
],
"trace_id": "abc123..."
}Leaderboard
See how models compare on temporal reasoning. Scores updated weekly with the Temporal Robustness Index.
| Model | Overall | Ambiguity | DST Edge | Enterprise |
|---|---|---|---|---|
| GPT-4 Turbo | 87.3 | 92.1 | 78.5 | 100% |
| Claude 3 Opus | 85.7 | 89.4 | 82.3 | 100% |
| Llama 3 70B | 72.1 | 68.9 | 65.2 | 95% |
| GPT-3.5 Turbo | 68.4 | 71.2 | 58.7 | 98% |