Boston Analyze – crime incident reports july-2012-august-2015

Dataset Overview

The dataset comprises various columns, each providing specific details about crime incidents. The key columns we’ll focus on are:

  • NatureCode
  • Shooting
  • Year
  • Month
  • X
  • Y
  • Location

Let’s start our exploration!

Exploratory Data Analysis (EDA)

  1. Incident Types and Crime Codes:
    • What are the most common incident types?
    • Which crime codes are prevalent?
  2. Distribution of Crime by District:
    • How does the number of crimes vary across different districts?
    • Are there districts with higher or lower crime rates?
  3. Time-based Analysis:
    • How has the overall crime rate changed over the years?
    • Is there a monthly or weekly pattern in crime incidents?

Insights from EDA

  1. Incident Types and Crime Codes:
    • The dataset provides a diverse range of incident types, from thefts to assaults.
    • Certain crime codes might be more common, indicating specific types of criminal activity.
  2. Distribution of Crime by District:
    • Some districts may experience higher crime rates than others.
    • Understanding the variations can help allocate resources effectively.
  3. Time-based Analysis:
    • Over the years, there might be trends or fluctuations in crime rates.
    • Analyzing monthly and weekly patterns can reveal when certain types of crimes are more likely to occur.

Questions for Further Exploration

  1. Spatial Analysis:
    • Are there specific locations or streets with consistently high crime rates?
    • How do crime rates vary between main streets and cross streets?
  2. Weapon Involvement:
    • What types of weapons are most commonly involved in crimes?
    • Is there a correlation between weapon type and the severity of incidents?
  3. Domestic Incidents:
    • How prevalent are domestic incidents, and do they follow any patterns?
    • Are there specific shifts or days of the week when domestic incidents are more likely?

Having established the presence of significant age disparities, I proceeded with post-hoc analyses to pinpoint specific differences between racial groups. Employing Tukey’s Honestly Significant Difference (HSD) test, I identified the pairs of races with the most notable age differences. This step was vital for a nuanced understanding of the dynamics at play within the dataset.

The outcomes of this analysis offer crucial insights into the complexities of racial disparities in police shootings. Acknowledging the presence of age differences among various racial groups is pivotal for shaping informed discussions and policy interventions.

When I conducted the ANOVA test on the dataset, it helped me determine whether there were any significant differences among the means of the groups. ANOVA provided a broad understanding, indicating that there were differences in at least one pair of groups’ means.

To delve deeper and identify the specific groups with different means, I employed Tukey’s Honestly Significant Difference (HSD) test. Unlike ANOVA, Tukey’s HSD is a post hoc test tailored to be used after ANOVA. It enabled me to pinpoint the exact groups that were driving the significant difference revealed by ANOVA.

In simpler terms, ANOVA acted as a preliminary indicator, suggesting the presence of differences, while Tukey’s HSD stepped in to provide detailed insights. It answered the crucial question: which particular groups within the dataset exhibited distinct means?

Having established the presence of significant age disparities, I proceeded with post-hoc analyses to pinpoint specific differences between racial groups. Employing Tukey’s Honestly Significant Difference (HSD) test, I identified the pairs of races with the most notable age differences. This step was vital for a nuanced understanding of the dynamics at play within the dataset.

The outcomes of this analysis offer crucial insights into the complexities of racial disparities in police shootings. Acknowledging the presence of age differences among various racial groups is pivotal for shaping informed discussions and policy interventions.

Analysis of Variance (ANOVA): 

Moving beyond basic descriptions, I conducted Analysis of Variance (ANOVA) test to investigate potential age differences among racial groups. This powerful statistical tool allowed me to compare means across multiple groups simultaneously.

The results of the ANOVA test indicated that there is a significant difference in ages among racial groups. The F-statistic of 103.50 and a very low p-value (approximately 6.13e-44) suggest that the differences in ages between these groups are unlikely to have occurred by random chance.

In other words, you have evidence to reject the null hypothesis, which means that there are statistically significant differences in ages among the racial groups you tested. This finding can be important for further analyses and discussions regarding disparities or variations in ages within different racial categories.

The outcomes of this analysis offer crucial insights into the complexities of racial disparities in police shootings. Acknowledging the presence of age differences among various racial groups is pivotal for shaping informed discussions and policy interventions.

Having established the presence of significant age disparities, I will proceed with post-hoc analyses to pinpoint specific differences between racial groups.

The Mean Square Difference (MSD) I calculated from the probabilities of age and race in the dataset provided a quantitative measure of disparity in the experiences of Black and White individuals with respect to police shootings.

I performed a Monte Carlo simulation to test the mean square difference (MSD) calculated from our dataset against random samples. First, I defined the MSD value that we calculated from our data. Then, I conducted 10,000 simulations where I randomly shuffled the age and race data. For each simulation, I calculated the probabilities for Black and White races similar to how we did with our original data.

In each simulation, I computed the ratio of probabilities for Black and White races and then calculated the MSD using this ratio. This process allowed me to create a distribution of MSD values based on random chance.

After running the simulations, I compared the MSD value we calculated from our dataset with the distribution of MSD values from the random samples. The comparison was done by computing a p-value, which represents the probability of observing an MSD value as extreme as ours, or even more extreme, under the null hypothesis that there is no difference between races.

Descriptive Statistics and Visualizations
Histograms and Summary Statistics
I generated histograms for ages within each race category. These visualizations provided a clear overview of the age distribution for different races. Alongside the visualizations, I calculated essential summary statistics, including mean, median, standard deviation, variance, skewness, and kurtosis. These metrics offer insights into the central tendency and the spread of age within each racial group.

Variance Analysis

Variance analysis helps us understand the extent of age variation within each race. By comparing variances, we identify which racial group has the most diverse age range among individuals involved in police shootings.

Confidence Intervals for the Mean Age

Confidence intervals provide a range in which the true mean age for each race is likely to fall. We compute 95%, 99%, and 99.9% confidence intervals, offering a more comprehensive understanding of the uncertainty associated with the mean age within each racial category.

Bayesian Probability Analysis

Using a Bayesian approach, we calculate the probability of an individual’s race given their age. This analysis provides a unique perspective on how race and age are correlated in the context of police shootings.


Analyzing police shootings data through statistical techniques sheds light on the intricate relationship between race, age, and law enforcement encounters. By employing various statistical methods, I gained valuable insights into the demographics of individuals involved in these incidents. Understanding these patterns is a significant step toward fostering informed discussions and implementing policies aimed at addressing the complex issues surrounding police shootings.

I worked on two different clustering algorithms: K-means clustering and DBSCAN. The purpose of this analysis was to group similar data points together based on their geographical coordinates (latitude and longitude) using different clustering algorithms. The choice of clustering algorithms and the number of clusters (in this case, 3 clusters) might be determined based on the nature of the data and the specific goals of the analysis. The visualization helps in understanding how well the algorithms have grouped the data points into distinct clusters.

K-means Clustering:

  • K-means grouped the data into 3 clusters. Each data point was assigned to one of these clusters.
  • K-means assumes that clusters are spherical and equally sized, which might not be the case for geographical data. Therefore, the effectiveness of K-means in this context depends on the distribution of your data.

DBSCAN Clustering:

  • DBSCAN identified clusters based on dense regions of data points. Outliers or sparse points that don’t belong to any dense cluster were labeled as -1.
  • DBSCAN doesn’t require specifying the number of clusters beforehand, making it more flexible, especially when the clusters have different shapes and densities.

Since the data is not normally distributed ,  I decided to perform monte carlo test.

I started by preparing the data, separating the ages of black and white individuals killed by the police into two distinct arrays, ensuring any missing values were removed.

Next, I calculated the observed mean age difference between white and black individuals from the original data.

I then set up a Monte Carlo simulation, specifying 10000 simulations to be performed. For each simulation, I randomly sampled ages with replacement from the combined dataset of black and white individuals. In each sample, I calculated the mean age difference between the sampled white and black individuals. These calculated mean differences were stored in a list called mean_diffs.

After running the simulations, I measured the time taken for the entire process. I also filtered out any NaN values from the list of mean differences to ensure accurate analysis.

For a visual representation, I plotted a histogram of the mean differences obtained from the Monte Carlo simulations. To provide context, I included a vertical dashed line indicating the observed mean difference from the original data.

To draw meaningful conclusions, I calculated and printed the mean age difference for white and black individuals based on the original data. Additionally, I determined the number of simulated samples with a mean difference greater than the observed difference. This analysis helps in understanding whether the observed difference is statistically significant or if it could have occurred by chance.

The fact that none of the random samples had a mean age difference greater than 7.41 years (in 10,000, samples) strongly suggests that the observed difference is not due to random chance. This is reflected in the very low p-value (essentially zero), indicating a statistically significant result.

I also performed a Cohen’s d test because it not only provides a standardized measure of the effect size but also helps interpret the practical significance of the differences between groups. Which gave me a value of 0.589321

Interpreting this effect size, it means that the average age of white people killed by police is significantly different from the average age of black people killed by police, with white individuals generally being older than black individuals in these incidents.



In my recent analysis, I delved into the data surrounding individuals tragically killed by the police. By examining age, gender, and race, I gained valuable insights into the demographics of these incidents.

I worked on Comparative analysis :

  • Compared the age distribution across different races (e.g., Black, White, Hispanic, Asian) to identify any significant disparities.
  • Explored the gender-based differences in the age of individuals affected by police incidents.

In this analysis, I compared the age distribution of individuals involved in police incidents across different racial categories. The main objective was to discern significant differences in the ages of individuals based on their race.

Visualization Method: Box Plot


  • X-Axis (Race): I represented each racial category, such as ‘White’, ‘Black’, and ‘Hispanic’, on the X-axis.
  • Y-Axis (Age): The age of individuals involved in police incidents was depicted on the Y-axis.
  • Box Plot: Utilizing the box plot visualization, I displayed essential statistical metrics such as the minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum of the dataset. This graphical representation offered a clear summary of the distribution of ages within each racial category, aiding in the identification of potential outliers.

Through this analysis, I aimed to gain insights into the disparities in age across various racial groups involved in police incidents, thereby contributing to a comprehensive understanding of the data.

Based on the analysis and visualizations performed on the age distribution of individuals killed by police, we can draw several inferences:

I found descriptive statistics Minimum and Maximum Age ,Mean Age, Median Age , Standard Deviation , Skewness , Kurtosis

1. Age Range and Central Tendency:

  • The age range of individuals killed by police in the dataset spans from 6 to 91 years.
  • The mean age, which is approximately 37.1 years, represents the average age of individuals in these incidents.

2. Distribution Shape:

  • The histogram of the age distribution shows a shape that is close to normal (bell-shaped), indicating that the ages are relatively evenly distributed around the mean.
  • This is further supported by the Q-Q plot, where the data points lie approximately along a straight line, suggesting that the age data follows a normal distribution.

3. Variability and Skewness:

  • The standard deviation of approximately 13.0 indicates that ages vary around the mean by this amount on average.
  • The skewness of around 0.73 indicates a slight positive skew, suggesting that the age distribution is slightly skewed to the right. This means there are a few incidents involving individuals older than the mean age, pulling the distribution in that direction.

4. Comparison with Normal Distribution:

  • The age distribution, while slightly skewed, does not have extensive fat tails or significant peakiness around the mean (kurtosis is close to 3). This suggests that there are no extreme outliers or highly concentrated age groups.
  • The analysis includes a comparison with a standard normal distribution, highlighting that the dataset’s age distribution deviates slightly from a perfect normal distribution.
  • In my exploration of police shootings in the US, I took a closer look at the geographical aspects of these incidents. Geospatial analysis helps you recognize patterns and trends in data that are related to geographic locations and can provide valuable insights.
  • To start, I imported essential libraries: Pandas for data manipulation, Matplotlib for visualization, and Scikit-Learn for clustering. Read the police shooting data from a CSV file, extracting latitude and longitude information for analysis.
  • Visualized the police shooting locations using scatter plots and 2D histograms. These visualizations provide a clear overview of the geographical distribution of incidents across the continental US.
  •  I explored DBSCAN clustering. Unlike traditional clustering algorithms, DBSCAN does not require specifying the number of clusters beforehand. Instead, it identifies dense regions of data points and labels them as clusters. Points that are isolated from these dense regions are marked as noise, making DBSCAN particularly adept at handling irregularly shaped clusters and noise within the data.
  • Unlike traditional clustering algorithms, DBSCAN identified densely populated clusters of police shooting locations without any prior assumptions about the number of clusters. This adaptive nature allowed me to uncover spatial groupings that would have been challenging to capture using fixed-cluster methods.
  • DBSCAN’s ability to classify points as noise or outliers provided insights into isolated incidents. By separating these outliers from the main clusters, I gained a clearer understanding of singular events versus recurrent patterns. This differentiation is crucial in understanding the context of police shootings, distinguishing between random incidents and recurring issues in specific areas.


No matter how persuasive, numbers frequently remain in the abstract. I used data visualisation to fill the gap between raw data and human comprehension. I turned numbers into visual stories by using dynamic charts, graphs, and interactive maps, which helped me understand the data better.

Geospatial analysis : I learned that geography is essential to comprehending social phenomena. I was able to see the gunshots’ geographic spread by mapping out the incidences. The resulting scatter plot clearly showed clusters in particular areas, which brought up significant issues regarding the underlying causes of these patterns.

Demographic Disparities:

Examining the data through the lens of demographics was both illuminating and disheartening. Disparities based on race and gender became glaringly evident.

Understanding Armed Status and Threat Types :

It is essential to comprehend the conditions around these situations. To obtain understanding, threat categories and armed status were examined. The weapons used in shootings were thoroughly examined to shed light on them. The scatter plot comparing threat kinds to age, meanwhile, provided a multidimensional viewpoint.

Body Cameras and Mental Health Factors :

Additionally, the use of body cams and the significance of mental health in these situations were investigated. The use of body cameras was depicted in the pie charts in an engaging way, underscoring the importance of transparency. Similar to this, discussions on the difficulties people with mental health issues encounter spurred by our analysis of incidences related to mental health.

Exploration of data : Understanding the dataset itself was the first step in our investigation. I carefully studied each column, figuring out what the labels and figures meant. It was important to comprehend the context of the data since each item had the weight of a sad occurrence; it wasn’t just a collection of facts.

The Heart of Exploration: Asking the Right Questions : 

Who are those affected by fatal police shootings the most?
I found inequalities depending on ethnicity, age, gender, and signs of mental illness by immersing myself in demographic data.

Analysing demographic data:

What is the average age of the victims, in terms of age distribution? Do various racial groups’ average ages differ significantly from one another?
Gender Inequalities: What is the victims’ gender composition? Exist any gender-specific trends in the deadly police shootings?
Racial Inequalities: Which races make up the victims? Are some racial groupings more severely impacted than others?

Characteristics of the Incident:

Armed vs. Unarmed: What proportion of victims were armed? Is there a discernible difference between being armed or unarmed in terms of the chance of getting shot?
How many occurrences saw victims running away from the scene? Is there a connection between running away and how intense the experience was?
In what percentage of incidents were there police officers wearing body cameras? Does wearing body cams by officers affect the results?
What proportion of victims manifested indicators of mental illness? Are there any links between signs of mental illness and police shootings?

I carefully examined the statistics, and patterns started to appear. Will continue to delve deeper into the analysis, seeking a comprehensive understanding of the underlying factors.

In my previous piece, I talked about how my model for forecasting diabetes rates was somewhat improved by include socioeconomic factors. By adding relevant new predictors and external datasets, I hoped to further improve the model. After doing some research, I discovered two intriguing data sources to use: a food access/nutrition database for low income communities and a county health survey including lifestyle information. Including them produced valuable new aspects including the prevalence of obesity, the availability of food, the frequency of exercise, and more. Taking care to manage missing data and match geographic identifiers, I integrated the datasets.


Retraining the model with the enlarged data produced observable improvements; the R-squared rose , demonstrating the importance of the extra signal. Strong correlations between the new diet- and activity-related variables and the prevalence of diabetes were found when the model coefficients were examined. By combining facts concerning healthcare availability, housing density, and other conceivable issues, there is still space for improvement. Additionally, I want to test nonlinear machine learning algorithms. However, adding pertinent outside data has already significantly improved the model’s ability to anticipate outcomes. I am getting closer to accurately estimating the factors that influence the occurrence of diabetes with each advancement. The model’s usefulness for designing effective public health interventions should continue to be advanced by further upgrading the data and algorithms.

In previous posts, I discussed building a regression model to predict diabetes rates and techniques to improve its performance. With feature engineering and model tuning, I had significantly boosted the R-squared metric. But I wanted to try incorporating some new predictor variables to potentially capture additional factors related to diabetes prevalence.

For this iteration, I added columns for socioeconomic status, household disability rate, and housing/transportation types. The hypothesis was these features might explain some of the remaining variance in diabetes rates across communities.

After standardizing the new features and fitting a linear regression model, I evaluated it on the test set. The model showed a small improvement in R-squared, indicating the new features were incrementally helpful in explaining variance in the target variable.

Checking the model coefficients provided some insights. For example, socioeconomic status had a high negative weight, meaning it was negatively correlated with diabetes rate as expected based on domain knowledge. Housing type also proved somewhat informative.

There are likely complex nonlinear relationships between diabetes and factors like income, transit access, and living conditions. To fully capture these, I may need to transform features, add interactions, or use more flexible algorithms. But even simple linear regression got a slight boost from the new features.

The process of improving the model is gradual, requiring testing many ideas. Each iteration provides more insight for getting closer to maximum predictability. My goal now is identifying additional datasets to combine with what I already have – more data and features could further improve the model.

Boosting Model Accuracy: Techniques to Improve R-Squared in Regression Models

Regression modeling is a popular statistical approach for understanding relationships between variables. The R-squared metric indicates how well the model fits the observed data, with higher values meaning more explanatory power.

Recently, I developed a regression model aiming to predict diabetes prevalence rates based on factors like physical inactivity and obesity levels. Despite iteratively tweaking the initial model, the best R-squared value achieved was 0.43. While decent, there was clear room for improvement in predictive accuracy.

In this post, I’ll share techniques I tried and am currently working on that helped significantly boost the R-squared for this project:

  • Incorporating additional relevant predictors correlated with the outcome variable, such as socioeconomic status, food access, education, and healthcare availability. This allowed capturing more variance in the target variable.
  • Applying transformations like logarithmic and Box-Cox to linearize non-linear relationships and correct residual patterns, improving overall model fit. Polynomial terms were also added to account for nonlinear effects.
  • Using regularization techniques like ridge and lasso regression to reduce overfitting by penalizing model complexity. Lasso helped remove redundant variables.
  • Carefully selecting the most predictive subset of explanatory variables through feature selection, eliminating noise-contributing variables.
  • Employing ensemble modeling techniques like random forests to combine multiple models and reduce variance.

I tried to understand Bootstrap resampling to estimate parameters, construct confidence intervals, and quantify uncertainties without making assumptions about the data’s underlying distribution, providing robust and distribution-free statistical inference.

Its a statistical technique used to estimate the distribution of a statistic by resampling with replacement from the observed data. In simpler terms, it’s like simulating multiple hypothetical scenarios based on the existing data to understand the range of possibilities.

I wrote a Python code for bootstrap resampling

    • The code starts by extracting the data for % INACTIVE, % OBESE, and % DIABETIC from the merged_df DataFrame. These extracted data arrays are inactive_data, obese_data, and diabetic_data respectively.
  • Bootstrap Resampling Function (calculate_bootstrap_ci):
    • The calculate_bootstrap_ci function performs the core bootstrap resampling operation. It takes a data array as input.
    • In a loop, it generates num_samples bootstrap samples by randomly selecting data points from the input data array with replacement.
    • For each bootstrap sample, it calculates the mean and stores it in the bootstrap_means list.
  • Confidence Interval Calculation:
    • After generating the bootstrap means, the function calculates the 95% confidence interval using percentiles. The lower bound is the 2.5th percentile, and the upper bound is the 97.5th percentile of the bootstrap_means list.

Finally calculated confidence intervals for % INACTIVE, % OBESE, and % DIABETIC 

% INACTIVE: Between 14.61% and 14.94% of the population is inactive. This suggests that a significant portion of the population is not engaging in physical activity.

% OBESE: Between 18.14% and 18.35% of the population is obese. This indicates a relatively high prevalence of obesity in the population.

% DIABETIC: Between 7.04% and 7.19% of the population has diabetes. This gives you an estimate of the proportion of people affected by diabetes in your dataset.

With reference to my previous post, I explored K- fold cross validation which gave me an R-squared value of 0.36 suggesting that the model is capturing a significant portion of the variance.  I wanted to explore couple more cross validation techniques . So I decided to learn about Monte carlo Cross-validation. It randomly partitions the data into training and test sets multiple times and averages the results. Useful for assessing model robustness to different data splits. It provides a robust estimate of model performance.

I set up your predictor variables ,X (%INACTIVE and %OBESE) and target variable ,Y ( %DIABETIC) based on the dataset . Set the code for 200 iterations.

I obtained an R-squared value of 0.31 from monte Carlo Cross-Validation . Which indicates that there is a moderate level of association between the predictor variables and the target variable. It suggests that about 31% of the variation in the % DIABETIC can be explained by the linear relationship with % INACTIVE and % OBESE. While 5-fold cross-validation yielded an R-squared value of 0.36, MCCV produced an R-squared value of 0.31. These differences reflect the variations in model performance introduced by different data splitting approaches and randomization.

The next step I considered:

Feature Engineering: Exploring additional predictor variables or engineered features that might better capture the underlying relationships in the data.

I explored predictor variables like ‘Number of Primary Care Physicians’, ‘Overall Socioeconomic Status’ , ‘Limited Access To Healthy Foods’. The next goal is to improve the model’s accuracy until there achieve a satisfactory level of performance using the new predictor variables.

Cross-validation is a crucial technique for assessing and improving the performance of machine learning models and is the next natural step for further analysis of the model.

I worked on setting up Python code for K-fold cross-validation, which involved splitting the dataset into K subsets (folds) to perform training and testing multiple times. In each iteration of the cross-validation loop, a different subset of the data is used as the test set, while the remaining data is used for training.

The code calculates the R-squared value for each fold of the cross-validation. The R-squared value is a measure of how well the model’s predictions match the actual log-transformed % DIABETIC values in the test set. The R-squared value ranges from 0 to 1, with higher values indicating a better fit of the model to the data.

Finally, the code calculates the average R-squared value across all the folds of cross-validation. This average R-squared value gives you an estimate of how well the quadratic regression model is performing on average across different subsets of the data.

So, in summary, the “model” in the code is a quadratic regression model that is trained and evaluated using K-fold cross-validation to assess its performance in predicting log-transformed % DIABETIC values based on the predictor variables % INACTIVE and % OBESE.


I obtained an average R-squared value of 0.36 from the 5-fold cross-validation.

While an R-squared value of 0.36 suggests that the model is capturing a significant portion of the variance, it also indicates that there is a substantial amount of unexplained variance in the data. This implies that factors other than % INACTIVE and % OBESE may contribute to the variation in % DIABETIC.

Based on this result, I will consider further model refinement or explore other predictors to improve the model’s explanatory power. Additionally, assess the residuals and other model diagnostics to ensure that the model assumptions are met and to identify potential areas for improvement.

With reference to my previous findings the substantial rise in the R-squared value, transitioning from 0.35 (in the case of the simple linear model) to 0.43 (with the quadratic model on log-transformed data), indicates a noteworthy improvement in how well the quadratic model fits the data in comparison to the straightforward linear model.

I plotted scatterplots to visually explore and understand the relationship between the original predictor variables (% INACTIVE and % OBESE) and the log-transformed target variable (% DIABETIC) in the dataset. 


The scatter points represent individual data points, with one axis showing the values of the predictor variable (% INACTIVE or % OBESE) and the other axis showing the log-transformed values of the target variable (% DIABETIC). Each point corresponds to a data observation.

The red and green lines overlaid on the scatter plots represent the predictions made by a quadratic regression model. These lines indicate how well the model fits the data. If the model fits well, the red and green lines should follow the data points’ general trend.

By having two separate graphs (one for each predictor variable), you can compare the relationships between the predictors and the log-transformed target variable. We can visually assess which predictor appears to have a stronger relationship with the target variable.

While the R-squared value has improved, it’s essential to evaluate the model further. I will consider using techniques such as cross-validation to assess the model’s performance on unseen data. I will also try to compare different polynomial degrees (e.g., cubic or higher) to see if a higher-degree polynomial provides a better fit.

Today in class :

 t-test – statistical analysis technique that compares the means of two groups to ascertain whether discrepancies between them are more likely to be the result of chance than randomness. It’s a tool that aids in determining whether a difference between two sets of data is meaningful and statistically significant. 

Analyzed data that lists pairs (post-molt, pre-molt) where “post-molt” is the size of a crab’s shell after molting, and “pre-molt” is the size of a crab’s shell before molting. 

Sine there are situations where the assumptions of a t-test may not hold, particularly when dealing with highly non-normally distributed data or data with other violations of t-test assumptions (such as equal variance). To address these issues and obtain a more robust estimate of the p-value, we can sometimes turn to Monte Carlo procedures.  Monte Carlo procedures offer a powerful alternative for estimating p-values. 

Following up on my previous post

Since the results for the log transformation of the dependent variable (%DIABETIC) did not result in any significant rise in  R Squared value. 

I tried to apply a quadratic (or polynomial) regression model to log-transformed data.

  •  I defined the feature matrix X, including the predictor variables % INACTIVE, % OBESE. The log-transformed target variable is stored in y.
  • Created a ‘PolynomialFeatures’ object with a degree of 2, which indicates that we want to generate quadratic (second-degree) polynomial features.
  • Used the fit_transform method to transform the original features in X to include quadratic terms. This creates a new feature matrix X_poly with the original features and their quadratic combinations
  • Created a linear regression model (LinearRegression) and fit it to the transformed feature matrix X_poly
  • Calculated the R squared value .

R Squared value = 0.43 

This improvement in R-squared indicates that the quadratic model is explaining more of the variance in the data and is likely capturing the underlying relationships more accurately.

Today in class : Linear Regression: The analysis included two predictor variables using linear regression, including “% obese” and “% inactive.” The dependent variable’s changes were predicted or explained using these other factors. ( %  Diabetic)

The model’s R-squared value was 0.34. The strength of the linear relationship between the predictors and the dependent variable is indicated by this value, which suggests that the predictors account for about 34% of the variation in the outcome.

Quadratic Model: You can expand the linear regression model by including quadratic terms for the variables% INACTIVE and OBESE to produce a quadratic model for these variables. A nonlinear relationship between the predictors and the dependent variable is possible in a quadratic model. 


Overfitting: In the context of statistical modelling and machine learning, the term “overfitting” describes a scenario in which a model is overly complex and fits the training data too closely.

Overfit models can lead to poor generalization, where the model’s performance degrades when it encounters data it hasn’t seen during training.

As discussed in class I went back and tried the quadratic model using Python:

import pandas as pd

import statsmodels.api as sm

merged_df[‘INACTIVE_squared’] = merged_df[‘% INACTIVE’] ** 2

merged_df[‘OBESE_squared’] = merged_df[‘% OBESE’] ** 2

X = merged_df[[‘% INACTIVE’, ‘% OBESE’, ‘INACTIVE_squared’, ‘OBESE_squared’]]

X = sm.add_constant(X)  # Add a constant term (intercept) to the model

y = merged_df[‘% DIABETIC’]

model = sm.OLS(y, X).fit()


Which gave me R Squared value of = 0.384

 Following up on my previous post I tried log transformation of the dependent variable (%DIABETIC) which gave me R Squared value of = 0.353. The increase in R Squared value is very minimal. Which means that log transformation did not lead to any dramatic increase in the R Squared value . The primary reason for log-transforming the dependent variable was to address heteroscedasticity and meet the assumptions of linear regression.

Moving ahead I will keep looking for better transformations and models .

According to the findings of my multiple linear regression, which I discussed in my previous post, the percentage of inactivity (% INACTIVE) and the percentage of obesity (% OBESE) are both statistically significant predictors of the proportion of people who have diabetes (% DIABETIC). The R-squared value of the model is 0.341, indicating that the model’s predictors can account for approximately 34.1% of the variance in the percentage of diabetics. The total model may be statistically significant, as indicated by the significant F-statistic.


Although I also ran the Breusch-Pagan test, which measures heteroscedasticity, the p-value was extremely low (around zero), suggesting that the model contains evidence of heteroscedasticity. The reliability may be impacted by heteroscedasticity, which is when the variance of the residuals varies depending on the values of the predictors, which can affect the reliability of regression estimates.

It is essential to address the situation because heteroscedasticity has been revealed. I will attempt to stabilize the variance by changing the dependent variable (% DIABETIC) using a suitable transformation (e.g., log-transformation) also explore other predictor variables like socioeconomic status, household composition, and disability and then reevaluate the model to determine its explanatory and predictive capacity.

Topics covered in the lecture :

  1. p-value: which are significant in statistics. Here is the key concept: When we assume that there is no actual difference or relationship (also known as the “null hypothesis”), a p-value helps us determine how likely we will observe outcomes similar to those we have.  In conclusion, p-values are crucial for assessing whether the data includes strong evidence against the null hypothesis. A high p-value denotes inadequate evidence to reject the null hypothesis and suggests the potential of chance rather than a significant effect. A low p-value (usually below 0.05) indicates strong evidence against the null hypothesis.
  1. Breusch pagan test : In order to find heteroscedasticity, the Breusch-Pagan test comes to our aid. It evaluates whether one or more independent variables in our regression model are related to the variance of the residuals. A major outcome from this test raises the possibility that our data may be heteroscedastic, which calls for more research.

CDC Dataset :

1.Multiple Regression :

Conducted a multiple regression analysis where the target variable was the percentage of diabetes, with the predictor variables being the percentages of inactivity and obesity.

Created several plots to visualize the results using seaborn library

A residual plot to help check for the assumption of constant variance and identify any patterns or outliers in the residuals.

Regression plots to visualize the relationships between individual predictor variables and the target variable while holding the other variables constant

code snippet :

#For Regression plots

import matplotlib.pyplot as plt
import seaborn as sns

sns.regplot(x=merged_df[‘% DIABETIC’], y=y, scatter_kws={‘alpha’:0.3}, color=’green’, label=’%Diabetic’)
sns.regplot(x=merged_df[‘% INACTIVE’], y=y, scatter_kws={‘alpha’:0.3}, color=’blue’, label=’inactivity’)
sns.regplot(x=merged_df[‘% OBESE’], y=y, scatter_kws={‘alpha’:0.3}, color=’red’, label=’obesity’)
plt.xlabel(‘Predictor Variables’)
plt.ylabel(‘Target Variable’)
plt.title(‘Regression Plots’)

corr_matrix = merged_df[[‘% DIABETIC’, ‘% INACTIVE’, ‘% OBESE’]].corr()

# For heatmap of the correlation matrix
sns.heatmap(corr_matrix, annot=True, cmap=’coolwarm’, linewidths=0.5)
plt.title(‘Correlation Matrix Heatmap’)

Heatmap to visualize the correlation matrix between predictor variables

2.Breusch pagan test : performed breusch pagan test on multiple regression using python

To tackle the CDC 2018 diabetes dataset comprehensively, I plan to consider various factors and approaches, including additional variables like socioeconomic status, household composition, and disability. Here’s how I initially approached this dataset:

 I began by getting to know the dataset inside out. This involved reviewing its structure, understanding column names, and identifying data types. During this process, several key questions emerged:

What is the overall prevalence of diabetes in different U.S. counties, and is there geographic variation?

Are there any correlations between obesity rates and diabetes prevalence in different counties? Does a higher obesity rate correlate with a higher diabetes rate?

How does the level of physical inactivity relate to diabetes and obesity rates? Are areas with higher rates of inactivity more likely to have higher diabetes and obesity rates?

Are there regional patterns in diabetes, obesity, and inactivity rates? Are certain areas of the country more affected than others?

Can we build predictive models to forecast diabetes rates based on obesity and inactivity levels? How accurate are these models?

To gain deeper insights into the dataset, I conducted data exploration. This involved generating summary statistics and creating visualizations, including histograms, box plots, and scatterplots. These techniques helped me understand the distributions of variables such as %diabetes, %obesity, and %inactivity.

 Data visualization played a crucial role in understanding relationships and patterns. I used scatterplots to visualize how %diabetes relates to %obesity and %inactivity. Heatmaps were employed to identify correlations between variables.

Moving forward, I tried linear regression analysis. The goal here was to comprehend how %diabetes is influenced by %obesity and %inactivity.

Additionally, I plan on examining the residuals of the regression models to check for important statistical assumptions:

Normality: Are the residuals normally distributed?

Heteroscedasticity: Does the spread of residuals change as %inactivity levels change?

Outliers: Are there any outlier data points that significantly affect the model?

 If the assumptions of normality, heteroscedasticity, or the presence of outliers are violated, I will consider applying appropriate transformations to the data to improve the reliability of the analysis.