Methodology
Data layers
Weather (Open-Meteo / ERA5)
Daily temperature, rainfall, sunshine, frost-day counts at a representative point per DEFRA region, 1999–present. Source: ECMWF ERA5 reanalysis via Open-Meteo's archive API. Multi-point sampling (3–5 points per region) is incremental work in progress.
The model's headline weather feature is the compound stress score, a multi-stage aggregation across the wheat growth cycle, capturing how each stage's weather deviated from the 1999-present baseline. Unlike single-event weather models that flag heatwaves or droughts in isolation, compound stress aggregates pressure across the full crop year, reflecting how a yield outcome usually has multiple contributing weather episodes.
Soil moisture (COSMOS-UK)
Daily volumetric water content, 2013–present. Used as a contextual variable in z-score calculation only. The pre-2013 gap means it can't be a direct model feature without introducing a structural bias against earlier years.
Satellite NDVI (Sentinel-2)
Mean and median NDVI per region, 2015–present. Tested both unmasked and SCL-masked (cropland pixels only). Neither correlates meaningfully with yield at flowering or stem-extension stages, the SCL mask excludes forest and water but cannot distinguish wheat from grass or other crops. Crop-type classification (e.g. UKCEH Land Cover Plus: Crops) would be needed to isolate wheat specifically; pending licensing.
Farmer sentiment (TFF)
Daily ingest of posts from active arable sub-forums on The Farming Forum. Compliant ingest with publisher-respecting rate limits and an identifying user-agent. The corpus is actively curated, practitioners are mapped to DEFRA regions and the off-topic / market-wire content is filtered out at scoring time.
Each post is scored twice. The UK-agri lexicon (a hand-curated dictionary spanning the major arable signal categories, disease, weather, drilling progress, market mood, etc.) produces a deterministic score in [-1, +1] with a small-sample-stable normalisation. A frontier LLM then scores the post independently with a one-line rationale, catching sarcasm and context the lexicon misses (e.g. "the best septoria fungicide is dry weather" reads as positive sentiment about dry conditions, not a disease complaint). The two scores are blended with weight toward the LLM. Posts with no agronomic signal at all are excluded from daily aggregates rather than averaged in as zero, this is the out-of-season filter, important for keeping the daily aggregate uncontaminated by off-topic chat.
The ensemble model
A multi-component model with consensus filtering. Several complementary modelling components, each with a different inductive bias, vote on the directional yield call. The ensemble emits a tradeable call only when the components agree, hence the tradeable subset is smaller than the full population but cleaner.
Cross-validated walk-forward: 62.3% hit rate on tradeable calls (38 of 61), r=0.316 (p<0.0001) on predicted-vs-actual yield anomaly, and 92% on the bad-year calls the model made with conviction (24 of 26).
When it's sharpest. The call firms up as the crop develops and is at its most reliable from stem extension onward (late spring), still weeks ahead of DEFRA harvest figures and AHDB condition reports, the window in which the trade has not yet repriced. That is the edge: not a nine-months-out guess, but a confident regional read while the decision is still open.
The sentiment confidence overlay
The sentiment confidence overlay sits over the directional model, a bounded multiplier, never wide enough to flip the directional call. It sits at 1.0 when sentiment is silent, edges up when sentiment direction matches the ensemble's call, and edges down when it contradicts. The overlay does not change the directional call, it modifies the displayed confidence label only. Future architecture iterations will land as forward sentiment data accumulates, allowing the overlay to graduate into a feature.
Walk-forward validation
Every claim about historical accuracy uses walk-forward validation. Each region-year is predicted by a model trained only on data available before that year, no in-sample fit, no look-ahead. The 62.3% hit rate is therefore a fair estimate of how the model would have performed if deployed in real time at any point in the 1999–2025 window. See the full Track Record for year-by-year breakdown.
What we do not publish
For competitive reasons, the following are not on the public site:
- The lexicon's term list and weights.
- Hyperparameters and feature-importance weights inside the ensemble.
- The curated practitioner-region map.
- The historical sentiment corpus, an accumulating, scored record that compounds daily and can't be back-filled.
The methodology is explainable by design: an acquirer's analyst can audit the decisions in plain English. The defensible IP is the curation and the public track record, which accrue only in calendar time, so a replicator starting today is years behind by definition.
Related reading: Track Record · Glossary · 2019 case study · 2018 miss