I tried to understand Bootstrap resampling to estimate parameters, construct confidence intervals, and quantify uncertainties without making assumptions about the data’s underlying distribution, providing robust and distribution-free statistical inference.
Its a statistical technique used to estimate the distribution of a statistic by resampling with replacement from the observed data. In simpler terms, it’s like simulating multiple hypothetical scenarios based on the existing data to understand the range of possibilities.
I wrote a Python code for bootstrap resampling
-
- The code starts by extracting the data for % INACTIVE, % OBESE, and % DIABETIC from the merged_df DataFrame. These extracted data arrays are inactive_data, obese_data, and diabetic_data respectively.
- Bootstrap Resampling Function (calculate_bootstrap_ci):
- The calculate_bootstrap_ci function performs the core bootstrap resampling operation. It takes a data array as input.
- In a loop, it generates num_samples bootstrap samples by randomly selecting data points from the input data array with replacement.
- For each bootstrap sample, it calculates the mean and stores it in the bootstrap_means list.
- Confidence Interval Calculation:
- After generating the bootstrap means, the function calculates the 95% confidence interval using percentiles. The lower bound is the 2.5th percentile, and the upper bound is the 97.5th percentile of the bootstrap_means list.
Finally calculated confidence intervals for % INACTIVE, % OBESE, and % DIABETIC
% INACTIVE: Between 14.61% and 14.94% of the population is inactive. This suggests that a significant portion of the population is not engaging in physical activity.
% OBESE: Between 18.14% and 18.35% of the population is obese. This indicates a relatively high prevalence of obesity in the population.
% DIABETIC: Between 7.04% and 7.19% of the population has diabetes. This gives you an estimate of the proportion of people affected by diabetes in your dataset.