← Back to Home

🏘️ Property Valuation & Bias Detection – Cook County, Illinois

View Notebook Github Repo

Abstract

This study examines the determinants of 2013 to 2019 residential property sale prices in Cook County, Illinois using a three-stage linear modeling approach. Beginning with a simple univariate model based on bedroom count, I iteratively introduce building size and then a comprehensive set of engineered features (including log-transformed area measures, tax assessor estimates, fireplace counts, repair condition scores, central air conditioning status, and categorical encodings of room counts). My final model trained on 204,792 property transactions achieves a root-mean-square error (RMSE) of 0.3716 in predicting log sale price, representing a substantial 57% improvement over the baseline. I also find that lower-priced homes are systematically overestimated, illustrating a regressive bias that can exacerbate tax burdens on low-income residents. Drawing from local journalism and policy research, I discuss how tax appeal processes, neighborhood demographics, and historical segregation intersect with model performance to compound unfair outcomes.

1. Introduction

Accurate prediction of residential sale prices is crucial for market transparency, equitable property taxation, and urban planning. While hedonic price models have long incorporated structural and locational attributes, advances in data pipelines and feature engineering now allow for larger-scale, reproducible analyses. This paper details a structured approach to modeling Cook County sale prices, emphasizing clarity, reproducibility, and interpretability. I additionally interrogate how residual errors map onto price strata, revealing a regressive pattern that aligns with critiques of Cook County’s assessment practices.

2. Data Description

The primary dataset (cook_county_train.csv) contains 204,792 records and 62 features, including physical attributes (e.g., building square footage, number of bedrooms), material characteristics (e.g., roof and wall types), condition assessments, and tax assessor estimates for land and building values. A companion codebook (data/codebook.txt) describes each variable.

3. Data Cleaning & Exploratory Analysis

  1. Outlier Removal: Excluded properties with sale prices below $499 to eliminate data entry errors and symbolic transactions.
  2. Missing Values: Removed or imputed missing entries for critical predictors (e.g., square footage).
  3. Univariate Distributions: Sale Year/Month: Sales span multiple years, with peak activity in summer months; Sale Price: Right-skewed; log transformation employed.
  4. Correlation Analysis: Building square footage and assessor estimates exhibit strong positive correlations (r > 0.75) with sale price.
  5. Feature Relationships: Residual plots from early models revealed heteroscedasticity and nonlinear patterns, motivating log transforms and expanded predictors.
Distribution of Log Sale Price Log Sale Price by Number of Bedrooms

4. Modeling Framework

  1. Baseline Model (Model 1): Features: Total bedroom count; Pipeline: outlier removal, log-transform price; Estimator: OLS; Purpose: benchmark.
  2. Intermediate Model (Model 2): Features: Bedrooms + log(building sq ft); Pipeline: same as Baseline; Evaluation: residual analysis.
  3. Final Model (Model 3): Features (13 total): log-transformed sale price, building/land sq ft, assessor estimates, engineered fireplaces, repair condition, binary central air, one-hot room counts; Estimator: LinearRegression; Evaluation: RMSE.

5. Results & Evaluation

Model Engineered Feature(s) RMSE (multiplier error)
Model 1 (Baseline) Bedrooms 0.8674 (~×2.38 error)
Model 2 (Intermediate) Bedrooms + Log Area 0.8059 (~×2.24 error)
Model 3 (Final) 13 Predictors 0.3716 (~×1.45 error)
Overestimation Rate by Price Strata Residuals vs. True Log Sale Price RMSE by Price Strata

6. Discussion & Next Steps

  1. Interpretability & Equity: Model coefficients remain interpretable; regressive bias leads to higher tax burdens on low-income owners, echoing Chicago Tribune investigations.
  2. Systemic Context: Higher-value homeowners more likely to appeal assessments; historical segregation correlates with housing underinvestment; policy implications include fairness-aware modeling and appeal assistance.
  3. Residual Analysis by Price Strata: Lowest quintile homes are systematically overestimated, indicating regressive taxation biases.
  4. Next Steps: Incorporate spatial/hierarchical models; experiment with fairness-aware and ensemble methods; build stakeholder dashboards.

7. Conclusion

This analysis demonstrates that a reproducible hedonic modeling workflow cuts RMSE from 0.87 to 0.37 and uncovers a regressive taxation bias in Cook County, Illinois. There exists a systematic overestimation of lower-priced homes, intensifying price burdens for their owners.

Even accurate models can reinforce systemic inequity if their errors fall disproportionately on disadvantaged groups. Predictive models must go beyond technical accuracy to incorporate fairness, transparency, and local context. Without thoughtful oversight, even the most statistically sound models risk amplifying racial and economic disparities. Future work should integrate fairness metrics and spatial controls to ensure more equitable assessment outcomes.

References

  1. Cook County Assessor’s Office. Cook County Property Codebook. 2025.
  2. Chicago Tribune investigations on property assessment disparities.
  3. Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning. Springer.
  4. Kuhn, M., & Johnson, K. (2019). Feature Engineering and Selection. CRC Press.