I tried to understand Bootstrap resampling to estimate parameters, construct confidence intervals, and quantify uncertainties without making assumptions about the data’s underlying distribution, providing robust and distribution-free statistical inference.

Its a statistical technique used to estimate the distribution of a statistic by resampling with replacement from the observed data. In simpler terms, it’s like simulating multiple hypothetical scenarios based on the existing data to understand the range of possibilities.

I wrote a Python code for bootstrap resampling

    • The code starts by extracting the data for % INACTIVE, % OBESE, and % DIABETIC from the merged_df DataFrame. These extracted data arrays are inactive_data, obese_data, and diabetic_data respectively.
  • Bootstrap Resampling Function (calculate_bootstrap_ci):
    • The calculate_bootstrap_ci function performs the core bootstrap resampling operation. It takes a data array as input.
    • In a loop, it generates num_samples bootstrap samples by randomly selecting data points from the input data array with replacement.
    • For each bootstrap sample, it calculates the mean and stores it in the bootstrap_means list.
  • Confidence Interval Calculation:
    • After generating the bootstrap means, the function calculates the 95% confidence interval using percentiles. The lower bound is the 2.5th percentile, and the upper bound is the 97.5th percentile of the bootstrap_means list.

Finally calculated confidence intervals for % INACTIVE, % OBESE, and % DIABETIC 

% INACTIVE: Between 14.61% and 14.94% of the population is inactive. This suggests that a significant portion of the population is not engaging in physical activity.

% OBESE: Between 18.14% and 18.35% of the population is obese. This indicates a relatively high prevalence of obesity in the population.

% DIABETIC: Between 7.04% and 7.19% of the population has diabetes. This gives you an estimate of the proportion of people affected by diabetes in your dataset.

With reference to my previous post, I explored K- fold cross validation which gave me an R-squared value of 0.36 suggesting that the model is capturing a significant portion of the variance.  I wanted to explore couple more cross validation techniques . So I decided to learn about Monte carlo Cross-validation. It randomly partitions the data into training and test sets multiple times and averages the results. Useful for assessing model robustness to different data splits. It provides a robust estimate of model performance.

I set up your predictor variables ,X (%INACTIVE and %OBESE) and target variable ,Y ( %DIABETIC) based on the dataset . Set the code for 200 iterations.

I obtained an R-squared value of 0.31 from monte Carlo Cross-Validation . Which indicates that there is a moderate level of association between the predictor variables and the target variable. It suggests that about 31% of the variation in the % DIABETIC can be explained by the linear relationship with % INACTIVE and % OBESE. While 5-fold cross-validation yielded an R-squared value of 0.36, MCCV produced an R-squared value of 0.31. These differences reflect the variations in model performance introduced by different data splitting approaches and randomization.

The next step I considered:

Feature Engineering: Exploring additional predictor variables or engineered features that might better capture the underlying relationships in the data.

I explored predictor variables like ‘Number of Primary Care Physicians’, ‘Overall Socioeconomic Status’ , ‘Limited Access To Healthy Foods’. The next goal is to improve the model’s accuracy until there achieve a satisfactory level of performance using the new predictor variables.

Cross-validation is a crucial technique for assessing and improving the performance of machine learning models and is the next natural step for further analysis of the model.

I worked on setting up Python code for K-fold cross-validation, which involved splitting the dataset into K subsets (folds) to perform training and testing multiple times. In each iteration of the cross-validation loop, a different subset of the data is used as the test set, while the remaining data is used for training.

The code calculates the R-squared value for each fold of the cross-validation. The R-squared value is a measure of how well the model’s predictions match the actual log-transformed % DIABETIC values in the test set. The R-squared value ranges from 0 to 1, with higher values indicating a better fit of the model to the data.

Finally, the code calculates the average R-squared value across all the folds of cross-validation. This average R-squared value gives you an estimate of how well the quadratic regression model is performing on average across different subsets of the data.

So, in summary, the “model” in the code is a quadratic regression model that is trained and evaluated using K-fold cross-validation to assess its performance in predicting log-transformed % DIABETIC values based on the predictor variables % INACTIVE and % OBESE.

 

I obtained an average R-squared value of 0.36 from the 5-fold cross-validation.

While an R-squared value of 0.36 suggests that the model is capturing a significant portion of the variance, it also indicates that there is a substantial amount of unexplained variance in the data. This implies that factors other than % INACTIVE and % OBESE may contribute to the variation in % DIABETIC.

Based on this result, I will consider further model refinement or explore other predictors to improve the model’s explanatory power. Additionally, assess the residuals and other model diagnostics to ensure that the model assumptions are met and to identify potential areas for improvement.

With reference to my previous findings the substantial rise in the R-squared value, transitioning from 0.35 (in the case of the simple linear model) to 0.43 (with the quadratic model on log-transformed data), indicates a noteworthy improvement in how well the quadratic model fits the data in comparison to the straightforward linear model.

I plotted scatterplots to visually explore and understand the relationship between the original predictor variables (% INACTIVE and % OBESE) and the log-transformed target variable (% DIABETIC) in the dataset. 

 

The scatter points represent individual data points, with one axis showing the values of the predictor variable (% INACTIVE or % OBESE) and the other axis showing the log-transformed values of the target variable (% DIABETIC). Each point corresponds to a data observation.

The red and green lines overlaid on the scatter plots represent the predictions made by a quadratic regression model. These lines indicate how well the model fits the data. If the model fits well, the red and green lines should follow the data points’ general trend.

By having two separate graphs (one for each predictor variable), you can compare the relationships between the predictors and the log-transformed target variable. We can visually assess which predictor appears to have a stronger relationship with the target variable.

While the R-squared value has improved, it’s essential to evaluate the model further. I will consider using techniques such as cross-validation to assess the model’s performance on unseen data. I will also try to compare different polynomial degrees (e.g., cubic or higher) to see if a higher-degree polynomial provides a better fit.

Today in class :

 t-test – statistical analysis technique that compares the means of two groups to ascertain whether discrepancies between them are more likely to be the result of chance than randomness. It’s a tool that aids in determining whether a difference between two sets of data is meaningful and statistically significant. 

Analyzed data that lists pairs (post-molt, pre-molt) where “post-molt” is the size of a crab’s shell after molting, and “pre-molt” is the size of a crab’s shell before molting. 

Sine there are situations where the assumptions of a t-test may not hold, particularly when dealing with highly non-normally distributed data or data with other violations of t-test assumptions (such as equal variance). To address these issues and obtain a more robust estimate of the p-value, we can sometimes turn to Monte Carlo procedures.  Monte Carlo procedures offer a powerful alternative for estimating p-values. 

Following up on my previous post

Since the results for the log transformation of the dependent variable (%DIABETIC) did not result in any significant rise in  R Squared value. 

I tried to apply a quadratic (or polynomial) regression model to log-transformed data.

  •  I defined the feature matrix X, including the predictor variables % INACTIVE, % OBESE. The log-transformed target variable is stored in y.
  • Created a ‘PolynomialFeatures’ object with a degree of 2, which indicates that we want to generate quadratic (second-degree) polynomial features.
  • Used the fit_transform method to transform the original features in X to include quadratic terms. This creates a new feature matrix X_poly with the original features and their quadratic combinations
  • Created a linear regression model (LinearRegression) and fit it to the transformed feature matrix X_poly
  • Calculated the R squared value .

R Squared value = 0.43 

This improvement in R-squared indicates that the quadratic model is explaining more of the variance in the data and is likely capturing the underlying relationships more accurately.

Today in class : Linear Regression: The analysis included two predictor variables using linear regression, including “% obese” and “% inactive.” The dependent variable’s changes were predicted or explained using these other factors. ( %  Diabetic)

The model’s R-squared value was 0.34. The strength of the linear relationship between the predictors and the dependent variable is indicated by this value, which suggests that the predictors account for about 34% of the variation in the outcome.

Quadratic Model: You can expand the linear regression model by including quadratic terms for the variables% INACTIVE and OBESE to produce a quadratic model for these variables. A nonlinear relationship between the predictors and the dependent variable is possible in a quadratic model. 

%DIABETIC=β0+β1⋅%INACTIVE+β2⋅%OBESE+β3⋅(%INACTIVE)²+β4⋅(%OBESE)²+ϵ

Overfitting: In the context of statistical modelling and machine learning, the term “overfitting” describes a scenario in which a model is overly complex and fits the training data too closely.

Overfit models can lead to poor generalization, where the model’s performance degrades when it encounters data it hasn’t seen during training.

As discussed in class I went back and tried the quadratic model using Python:

import pandas as pd

import statsmodels.api as sm

merged_df[‘INACTIVE_squared’] = merged_df[‘% INACTIVE’] ** 2

merged_df[‘OBESE_squared’] = merged_df[‘% OBESE’] ** 2

X = merged_df[[‘% INACTIVE’, ‘% OBESE’, ‘INACTIVE_squared’, ‘OBESE_squared’]]

X = sm.add_constant(X)  # Add a constant term (intercept) to the model

y = merged_df[‘% DIABETIC’]

model = sm.OLS(y, X).fit()

print(model.summary())

Which gave me R Squared value of = 0.384

 Following up on my previous post I tried log transformation of the dependent variable (%DIABETIC) which gave me R Squared value of = 0.353. The increase in R Squared value is very minimal. Which means that log transformation did not lead to any dramatic increase in the R Squared value . The primary reason for log-transforming the dependent variable was to address heteroscedasticity and meet the assumptions of linear regression.

Moving ahead I will keep looking for better transformations and models .

According to the findings of my multiple linear regression, which I discussed in my previous post, the percentage of inactivity (% INACTIVE) and the percentage of obesity (% OBESE) are both statistically significant predictors of the proportion of people who have diabetes (% DIABETIC). The R-squared value of the model is 0.341, indicating that the model’s predictors can account for approximately 34.1% of the variance in the percentage of diabetics. The total model may be statistically significant, as indicated by the significant F-statistic.

 

Although I also ran the Breusch-Pagan test, which measures heteroscedasticity, the p-value was extremely low (around zero), suggesting that the model contains evidence of heteroscedasticity. The reliability may be impacted by heteroscedasticity, which is when the variance of the residuals varies depending on the values of the predictors, which can affect the reliability of regression estimates.

It is essential to address the situation because heteroscedasticity has been revealed. I will attempt to stabilize the variance by changing the dependent variable (% DIABETIC) using a suitable transformation (e.g., log-transformation) also explore other predictor variables like socioeconomic status, household composition, and disability and then reevaluate the model to determine its explanatory and predictive capacity.

Topics covered in the lecture :

  1. p-value: which are significant in statistics. Here is the key concept: When we assume that there is no actual difference or relationship (also known as the “null hypothesis”), a p-value helps us determine how likely we will observe outcomes similar to those we have.  In conclusion, p-values are crucial for assessing whether the data includes strong evidence against the null hypothesis. A high p-value denotes inadequate evidence to reject the null hypothesis and suggests the potential of chance rather than a significant effect. A low p-value (usually below 0.05) indicates strong evidence against the null hypothesis.
  1. Breusch pagan test : In order to find heteroscedasticity, the Breusch-Pagan test comes to our aid. It evaluates whether one or more independent variables in our regression model are related to the variance of the residuals. A major outcome from this test raises the possibility that our data may be heteroscedastic, which calls for more research.

CDC Dataset :

1.Multiple Regression :

Conducted a multiple regression analysis where the target variable was the percentage of diabetes, with the predictor variables being the percentages of inactivity and obesity.

Created several plots to visualize the results using seaborn library

A residual plot to help check for the assumption of constant variance and identify any patterns or outliers in the residuals.

Regression plots to visualize the relationships between individual predictor variables and the target variable while holding the other variables constant

code snippet :

#For Regression plots

import matplotlib.pyplot as plt
import seaborn as sns

sns.regplot(x=merged_df[‘% DIABETIC’], y=y, scatter_kws={‘alpha’:0.3}, color=’green’, label=’%Diabetic’)
sns.regplot(x=merged_df[‘% INACTIVE’], y=y, scatter_kws={‘alpha’:0.3}, color=’blue’, label=’inactivity’)
sns.regplot(x=merged_df[‘% OBESE’], y=y, scatter_kws={‘alpha’:0.3}, color=’red’, label=’obesity’)
plt.xlabel(‘Predictor Variables’)
plt.ylabel(‘Target Variable’)
plt.title(‘Regression Plots’)
plt.legend()
plt.show()

corr_matrix = merged_df[[‘% DIABETIC’, ‘% INACTIVE’, ‘% OBESE’]].corr()

# For heatmap of the correlation matrix
sns.heatmap(corr_matrix, annot=True, cmap=’coolwarm’, linewidths=0.5)
plt.title(‘Correlation Matrix Heatmap’)
plt.show()

Heatmap to visualize the correlation matrix between predictor variables

2.Breusch pagan test : performed breusch pagan test on multiple regression using python

To tackle the CDC 2018 diabetes dataset comprehensively, I plan to consider various factors and approaches, including additional variables like socioeconomic status, household composition, and disability. Here’s how I initially approached this dataset:

 I began by getting to know the dataset inside out. This involved reviewing its structure, understanding column names, and identifying data types. During this process, several key questions emerged:

What is the overall prevalence of diabetes in different U.S. counties, and is there geographic variation?

Are there any correlations between obesity rates and diabetes prevalence in different counties? Does a higher obesity rate correlate with a higher diabetes rate?

How does the level of physical inactivity relate to diabetes and obesity rates? Are areas with higher rates of inactivity more likely to have higher diabetes and obesity rates?

Are there regional patterns in diabetes, obesity, and inactivity rates? Are certain areas of the country more affected than others?

Can we build predictive models to forecast diabetes rates based on obesity and inactivity levels? How accurate are these models?

To gain deeper insights into the dataset, I conducted data exploration. This involved generating summary statistics and creating visualizations, including histograms, box plots, and scatterplots. These techniques helped me understand the distributions of variables such as %diabetes, %obesity, and %inactivity.

 Data visualization played a crucial role in understanding relationships and patterns. I used scatterplots to visualize how %diabetes relates to %obesity and %inactivity. Heatmaps were employed to identify correlations between variables.

Moving forward, I tried linear regression analysis. The goal here was to comprehend how %diabetes is influenced by %obesity and %inactivity.

Additionally, I plan on examining the residuals of the regression models to check for important statistical assumptions:

Normality: Are the residuals normally distributed?

Heteroscedasticity: Does the spread of residuals change as %inactivity levels change?

Outliers: Are there any outlier data points that significantly affect the model?

 If the assumptions of normality, heteroscedasticity, or the presence of outliers are violated, I will consider applying appropriate transformations to the data to improve the reliability of the analysis.