With reference to my previous post, I explored K- fold cross validation which gave me an R-squared value of 0.36 suggesting that the model is capturing a significant portion of the variance. I wanted to explore couple more cross validation techniques . So I decided to learn about Monte carlo Cross-validation. It randomly partitions the data into training and test sets multiple times and averages the results. Useful for assessing model robustness to different data splits. It provides a robust estimate of model performance.
I set up your predictor variables ,X (%INACTIVE and %OBESE) and target variable ,Y ( %DIABETIC) based on the dataset . Set the code for 200 iterations.
I obtained an R-squared value of 0.31 from monte Carlo Cross-Validation . Which indicates that there is a moderate level of association between the predictor variables and the target variable. It suggests that about 31% of the variation in the % DIABETIC can be explained by the linear relationship with % INACTIVE and % OBESE. While 5-fold cross-validation yielded an R-squared value of 0.36, MCCV produced an R-squared value of 0.31. These differences reflect the variations in model performance introduced by different data splitting approaches and randomization.
The next step I considered:
Feature Engineering: Exploring additional predictor variables or engineered features that might better capture the underlying relationships in the data.
I explored predictor variables like ‘Number of Primary Care Physicians’, ‘Overall Socioeconomic Status’ , ‘Limited Access To Healthy Foods’. The next goal is to improve the model’s accuracy until there achieve a satisfactory level of performance using the new predictor variables.