In my previous piece, I talked about how my model for forecasting diabetes rates was somewhat improved by include socioeconomic factors. By adding relevant new predictors and external datasets, I hoped to further improve the model. After doing some research, I discovered two intriguing data sources to use: a food access/nutrition database for low income communities and a county health survey including lifestyle information. Including them produced valuable new aspects including the prevalence of obesity, the availability of food, the frequency of exercise, and more. Taking care to manage missing data and match geographic identifiers, I integrated the datasets.


Retraining the model with the enlarged data produced observable improvements; the R-squared rose , demonstrating the importance of the extra signal. Strong correlations between the new diet- and activity-related variables and the prevalence of diabetes were found when the model coefficients were examined. By combining facts concerning healthcare availability, housing density, and other conceivable issues, there is still space for improvement. Additionally, I want to test nonlinear machine learning algorithms. However, adding pertinent outside data has already significantly improved the model’s ability to anticipate outcomes. I am getting closer to accurately estimating the factors that influence the occurrence of diabetes with each advancement. The model’s usefulness for designing effective public health interventions should continue to be advanced by further upgrading the data and algorithms.

Leave a Reply

Your email address will not be published. Required fields are marked *