Methodology

Public documentation of data sources, models and validation.

1. Data sources

Event data. Match-by-match structured event streams (shots, passes, tackles, set pieces, etc.) provided by API-Football.
Market data. Bookmaker odds snapshots from The Odds API, used purely as a reference variable representing market expectations.
Environmental data. Match-day temperature, precipitation and wind from public weather providers.
Historical archive. ~100,000 publicly available historical matches across 5–8 seasons, used for model training.

2. Feature engineering

We extract 40–60 features per match across five tiers:

Strength. Elo rating, recent 10-match win rate, xG (expected goals).
Form. W/D/L sequences, goals-scored / conceded trends, win/loss streak length.
Matchup. Head-to-head record, average goals, style compatibility indicators.
Market. Opening-to-current odds drift, line consistency across bookmakers.
Context. Match importance, schedule density, key player availability, weather.

3. Models

Primary. XGBoost multi-class and binary classifiers producing discrete probability distributions for win / draw / loss, Asian handicap, and total goals.
Secondary. LightGBM, used for ensembling and cross-validation.
Validation. Strict time-series split — train on past matches, validate on later ones. No random splits.
Evaluation metrics. Log loss, calibration curve, backtested accuracy.
Retraining cadence. Weekly incremental retraining with the latest matches.

4. Report generation

The probability distributions and structured indicators are fed into a large language model (Claude Sonnet, served through a regional API gateway) which composes the natural-language report on a fixed template: match background and head-to-head, recent form and lineup, tactical-style comparison, key variables and uncertainties, probability distribution and scenario analysis, and a data-source / methodology footer.

Every conclusion is expressed as a probability, distribution, confidence interval, or historical analogue. We do not produce directional recommendations.

5. Uncertainty and limits

Football has substantial irreducible variance. Expected model accuracy is 55–65%, in line with the academic literature.
Pre-match lineups, injuries, and last-minute changes may only surface a few hours before kickoff. The model triggers incremental updates when such information is published.
Every probability output ships with sample size and confidence interval. Users should interpret outputs in the context of their own research questions.

All data and analytical outputs are research-grade content, intended solely for research, education and editorial use. Not investment advice, not betting advice.