In previous posts, I discussed building a regression model to predict diabetes rates and techniques to improve its performance. With feature engineering and model tuning, I had significantly boosted the R-squared metric. But I wanted to try incorporating some new predictor variables to potentially capture additional factors related to diabetes prevalence.
For this iteration, I added columns for socioeconomic status, household disability rate, and housing/transportation types. The hypothesis was these features might explain some of the remaining variance in diabetes rates across communities.
After standardizing the new features and fitting a linear regression model, I evaluated it on the test set. The model showed a small improvement in R-squared, indicating the new features were incrementally helpful in explaining variance in the target variable.
Checking the model coefficients provided some insights. For example, socioeconomic status had a high negative weight, meaning it was negatively correlated with diabetes rate as expected based on domain knowledge. Housing type also proved somewhat informative.
There are likely complex nonlinear relationships between diabetes and factors like income, transit access, and living conditions. To fully capture these, I may need to transform features, add interactions, or use more flexible algorithms. But even simple linear regression got a slight boost from the new features.
The process of improving the model is gradual, requiring testing many ideas. Each iteration provides more insight for getting closer to maximum predictability. My goal now is identifying additional datasets to combine with what I already have – more data and features could further improve the model.