England, 2018–2024
Data Analysis

Pupil poverty predicts which children miss school. Area deprivation is a much weaker signal.

A four-year machine learning analysis of persistent absence in English state schools finds that Free School Meals eligibility outperforms area deprivation as a predictor by a factor of nineteen to one.

Data: DfE Absence Statistics · GIAS · IMD 2019 · ONS Postcode Directory  ·  Method: Random Forest · SHAP · Spatial GroupKFold CV
The finding

The COVID attendance crisis hit every school equally, regardless of how deprived their neighbourhood was

Persistent absence, missing at least 10% of possible school sessions, has long been treated as a proxy for disadvantage. The assumption is that it clusters in deprived areas, and that area-level interventions can reach the pupils most at risk. This analysis, covering roughly 79,000 school-year observations across four academic years, tests that assumption directly.

Free School Meals eligibility is nineteen times more predictive of absence rates than the Index of Multiple Deprivation score of the area a school sits in. Area deprivation is real but it is a blunt instrument, and the model makes that gap visible.

The year a school year falls in is the second strongest predictor in the model. Not because school years inherently differ, but because COVID-19 caused a structural break in attendance norms that is still visible in 2023-24 data. When the SHAP dependence plots are split by deprivation level, the shock is flat across that spectrum. The pandemic hit every school at roughly the same magnitude.

39.7%
of model explanatory power comes from
Free School Meals eligibility alone
0.597
Test R² for Random Forest
vs 0.486 for OLS baseline
2.1%
of model explanatory power from
area-level IMD score
The model

What drives absence

Random Forest feature importances, trained on roughly 79,000 school-year observations split by school URN to prevent leakage.

Random Forest vs OLS baseline

Test set performance. RF trained with GridSearchCV (n_estimators=300, max_depth=10). OLS baseline uses same feature set.

Explainability — SHAP values
SHAP summary bar
Mean absolute SHAP values across the test set. Higher = more influence on the model's prediction.
SHAP beeswarm
Beeswarm plot showing direction and spread of each feature's effect. Red = high feature value, blue = low.
FSM dependence
FSM dependence plot. Higher FSM rates consistently drive large positive SHAP contributions — more absence.
Year dependence
Year dependence coloured by IMD score. The post-COVID shift is uniform across deprivation levels — no gradient.
Phase dependence
Secondary schools (phase=2) receive consistently higher SHAP contributions than primary schools.
IMD dependence
IMD has a weak, noisy relationship with absence once FSM, year and phase are accounted for.
Key findings

What the model tells us

Data & method

Feature importances in full

FeatureImportanceDescription
percent_fsm0.397% pupils eligible for Free School Meals
year_numeric0.298Academic year encoded 0–5 (captures COVID break)
phase_numeric0.2091=Primary, 2=Secondary, 3=All-through
log_pupils0.044Log-transformed school roll
imd_score0.021IMD 2019 area deprivation score
region_London0.013London region dummy
All other regions + urban flag0.018