Methodology
Public documentation of data sources, models and validation.
1. Data sources
- Event data. Match-by-match structured event streams (shots, passes, tackles, set pieces, etc.) provided by API-Football.
- Market data. Bookmaker odds snapshots from The Odds API, used purely as a reference variable representing market expectations.
- Environmental data. Match-day temperature, precipitation and wind from public weather providers.
- Historical archive. ~100,000 publicly available historical matches across 5–8 seasons, used for model training.
2. Feature engineering
We extract 40–60 features per match across five tiers:
- Strength. Elo rating, recent 10-match win rate, xG (expected goals).
- Form. W/D/L sequences, goals-scored / conceded trends, win/loss streak length.
- Matchup. Head-to-head record, average goals, style compatibility indicators.
- Market. Opening-to-current odds drift, line consistency across bookmakers.
- Context. Match importance, schedule density, key player availability, weather.
3. Models
- Primary. XGBoost multi-class and binary classifiers producing discrete probability distributions for win / draw / loss, Asian handicap, and total goals.
- Secondary. LightGBM, used for ensembling and cross-validation.
- Validation. Strict time-series split — train on past matches, validate on later ones. No random splits.
- Evaluation metrics. Log loss, calibration curve, backtested accuracy.
- Retraining cadence. Weekly incremental retraining with the latest matches.
4. Report generation
The probability distributions and structured indicators are fed into a large language model (Claude Sonnet, served through a regional API gateway) which composes the natural-language report on a fixed template: match background and head-to-head, recent form and lineup, tactical-style comparison, key variables and uncertainties, probability distribution and scenario analysis, and a data-source / methodology footer.
Every conclusion is expressed as a probability, distribution, confidence interval, or historical analogue. We do not produce directional recommendations.
5. Uncertainty and limits
- Football has substantial irreducible variance. Expected model accuracy is 55–65%, in line with the academic literature.
- Pre-match lineups, injuries, and last-minute changes may only surface a few hours before kickoff. The model triggers incremental updates when such information is published.
- Every probability output ships with sample size and confidence interval. Users should interpret outputs in the context of their own research questions.
All data and analytical outputs are research-grade content, intended solely for research, education and editorial use. Not investment advice, not betting advice.