Why an ensemble rather than a single model?

Different inductive biases catch different patterns. The consensus filter, emit a call only when complementary components agree, produces a smaller but cleaner tradeable subset with a 62.7% hit rate.

Why include farmer sentiment if it doesn't change the directional call?

Sentiment captures real-time disease and progress signals weeks before they appear in DEFRA or AHDB data. Using it as a bounded confidence overlay (rather than a feature) keeps the ensemble's track record intact while improving the displayed confidence calibration.

How long is the validation horizon?

Walk-forward across 23 crop years (2003-2025), 253 region-years total. Each year's call uses a model trained only on prior years.

What's not in the model?

Macro grain prices, currency, geopolitical news, and US-corn-belt analogues. CropIntel forecasts UK yield, which then feeds price; the price-feedback layer is intentionally separate.

Methodology

How CropIntel turns weather, satellite, and farmer sentiment data into a daily UK wheat yield forecast.

In one paragraph

CropIntel uses a multi-component ensemble model over a deep historical baseline of UK regional weather, soil, and satellite data, augmented by a hybrid lexicon + LLM scoring of UK farmer commentary. The model emits a directional call only when its components agree, producing a 62.7% tradeable hit rate on the walk-forward backtest. Sentiment from real-time farmer commentary serves as a bounded confidence overlay, it does not change the model's directional call, only the displayed confidence label.

Data layers

Weather (Open-Meteo / ERA5)

Daily temperature, rainfall, sunshine, frost-day counts at a representative point per DEFRA region, 1999–present. Source: ECMWF ERA5 reanalysis via Open-Meteo's archive API. Multi-point sampling (3–5 points per region) is incremental work in progress.

The model's headline weather feature is the compound stress score, a multi-stage aggregation across the wheat growth cycle, capturing how each stage's weather deviated from the 1999-present baseline. Unlike single-event weather models that flag heatwaves or droughts in isolation, compound stress aggregates pressure across the full crop year, reflecting how a yield outcome usually has multiple contributing weather episodes.

Soil moisture (COSMOS-UK)

Daily volumetric water content, 2013–present. Used as a contextual variable in z-score calculation only. The pre-2013 gap means it can't be a direct model feature without introducing a structural bias against earlier years.

Satellite NDVI (Sentinel-2)

Mean and median NDVI per region, 2015–present. Tested both unmasked and SCL-masked (cropland pixels only). Neither correlates meaningfully with yield at flowering or stem-extension stages, the SCL mask excludes forest and water but cannot distinguish wheat from grass or other crops. Crop-type classification (e.g. UKCEH Land Cover Plus: Crops) would be needed to isolate wheat specifically; pending licensing.

Farmer sentiment (TFF)

Daily ingest of posts from active arable sub-forums on The Farming Forum. Compliant ingest with publisher-respecting rate limits and an identifying user-agent. The corpus is actively curated, practitioners are mapped to DEFRA regions and the off-topic / market-wire content is filtered out at scoring time.

Each post is scored twice. The UK-agri lexicon (a hand-curated dictionary spanning the major arable signal categories, disease, weather, drilling progress, market mood, etc.) produces a deterministic score in [-1, +1] with a small-sample-stable normalisation. A frontier LLM then scores the post independently with a one-line rationale, catching sarcasm and context the lexicon misses (e.g. "the best septoria fungicide is dry weather" reads as positive sentiment about dry conditions, not a disease complaint). The two scores are blended with weight toward the LLM. Posts with no agronomic signal at all are excluded from daily aggregates rather than averaged in as zero, this is the out-of-season filter, important for keeping the daily aggregate uncontaminated by off-topic chat.

The ensemble model

A multi-component model with consensus filtering. Several complementary modelling components, each with a different inductive bias, vote on the directional yield call. The ensemble emits a tradeable call only when the components agree, hence the tradeable subset is smaller than the full population but cleaner.

Cross-validated walk-forward: 62.7% hit rate on tradeable calls (37 of 59), r=0.306 (p<0.0001) on predicted-vs-actual yield anomaly. Of the 25 below-average region-years where the model made a conviction call, it correctly called 23 below average (92%).

When it's sharpest. The call firms up as the crop develops and is at its most reliable from stem extension onward (late spring), still weeks ahead of DEFRA harvest figures and AHDB condition reports, the window in which the trade has not yet repriced. That is the edge: not a nine-months-out guess, but a confident regional read while the decision is still open.

The sentiment confidence overlay

The sentiment confidence overlay sits over the directional model, a bounded multiplier, never wide enough to flip the directional call. It sits at 1.0 when sentiment is silent, edges up when sentiment direction matches the ensemble's call, and edges down when it contradicts. The overlay does not change the directional call, it modifies the displayed confidence label only. Future architecture iterations will land as forward sentiment data accumulates, allowing the overlay to graduate into a feature.

Walk-forward validation

Every claim about historical accuracy uses walk-forward validation. Each region-year is predicted by a model trained only on data available before that year, no in-sample fit, no look-ahead. The 62.7% hit rate is therefore a fair estimate of how the model would have performed if deployed in real time at any point in the 2003-2025 window. See the full Track Record for year-by-year breakdown.

What we do not publish

For competitive reasons, the following are not on the public site:

The lexicon's term list and weights.
Hyperparameters and feature-importance weights inside the ensemble.
The curated practitioner-region map.
The historical sentiment corpus, an accumulating, scored record that compounds daily and can't be back-filled.

The methodology is explainable by design: an acquirer's analyst can audit the decisions in plain English. The defensible IP is the curation and the public track record, which accrue only in calendar time, so a replicator starting today is years behind by definition.

Related reading: Track Record · Glossary · 2019 case study · 2018 miss