Descriptive Statistics and Visualizations
Histograms and Summary Statistics
I generated histograms for ages within each race category. These visualizations provided a clear overview of the age distribution for different races. Alongside the visualizations, I calculated essential summary statistics, including mean, median, standard deviation, variance, skewness, and kurtosis. These metrics offer insights into the central tendency and the spread of age within each racial group.

Variance Analysis

Variance analysis helps us understand the extent of age variation within each race. By comparing variances, we identify which racial group has the most diverse age range among individuals involved in police shootings.

Confidence Intervals for the Mean Age

Confidence intervals provide a range in which the true mean age for each race is likely to fall. We compute 95%, 99%, and 99.9% confidence intervals, offering a more comprehensive understanding of the uncertainty associated with the mean age within each racial category.

Bayesian Probability Analysis

Using a Bayesian approach, we calculate the probability of an individual’s race given their age. This analysis provides a unique perspective on how race and age are correlated in the context of police shootings.


Analyzing police shootings data through statistical techniques sheds light on the intricate relationship between race, age, and law enforcement encounters. By employing various statistical methods, I gained valuable insights into the demographics of individuals involved in these incidents. Understanding these patterns is a significant step toward fostering informed discussions and implementing policies aimed at addressing the complex issues surrounding police shootings.

I worked on two different clustering algorithms: K-means clustering and DBSCAN. The purpose of this analysis was to group similar data points together based on their geographical coordinates (latitude and longitude) using different clustering algorithms. The choice of clustering algorithms and the number of clusters (in this case, 3 clusters) might be determined based on the nature of the data and the specific goals of the analysis. The visualization helps in understanding how well the algorithms have grouped the data points into distinct clusters.

K-means Clustering:

  • K-means grouped the data into 3 clusters. Each data point was assigned to one of these clusters.
  • K-means assumes that clusters are spherical and equally sized, which might not be the case for geographical data. Therefore, the effectiveness of K-means in this context depends on the distribution of your data.

DBSCAN Clustering:

  • DBSCAN identified clusters based on dense regions of data points. Outliers or sparse points that don’t belong to any dense cluster were labeled as -1.
  • DBSCAN doesn’t require specifying the number of clusters beforehand, making it more flexible, especially when the clusters have different shapes and densities.

Since the data is not normally distributed ,  I decided to perform monte carlo test.

I started by preparing the data, separating the ages of black and white individuals killed by the police into two distinct arrays, ensuring any missing values were removed.

Next, I calculated the observed mean age difference between white and black individuals from the original data.

I then set up a Monte Carlo simulation, specifying 10000 simulations to be performed. For each simulation, I randomly sampled ages with replacement from the combined dataset of black and white individuals. In each sample, I calculated the mean age difference between the sampled white and black individuals. These calculated mean differences were stored in a list called mean_diffs.

After running the simulations, I measured the time taken for the entire process. I also filtered out any NaN values from the list of mean differences to ensure accurate analysis.

For a visual representation, I plotted a histogram of the mean differences obtained from the Monte Carlo simulations. To provide context, I included a vertical dashed line indicating the observed mean difference from the original data.

To draw meaningful conclusions, I calculated and printed the mean age difference for white and black individuals based on the original data. Additionally, I determined the number of simulated samples with a mean difference greater than the observed difference. This analysis helps in understanding whether the observed difference is statistically significant or if it could have occurred by chance.

The fact that none of the random samples had a mean age difference greater than 7.41 years (in 10,000, samples) strongly suggests that the observed difference is not due to random chance. This is reflected in the very low p-value (essentially zero), indicating a statistically significant result.

I also performed a Cohen’s d test because it not only provides a standardized measure of the effect size but also helps interpret the practical significance of the differences between groups. Which gave me a value of 0.589321

Interpreting this effect size, it means that the average age of white people killed by police is significantly different from the average age of black people killed by police, with white individuals generally being older than black individuals in these incidents.



In my recent analysis, I delved into the data surrounding individuals tragically killed by the police. By examining age, gender, and race, I gained valuable insights into the demographics of these incidents.

I worked on Comparative analysis :

  • Compared the age distribution across different races (e.g., Black, White, Hispanic, Asian) to identify any significant disparities.
  • Explored the gender-based differences in the age of individuals affected by police incidents.

In this analysis, I compared the age distribution of individuals involved in police incidents across different racial categories. The main objective was to discern significant differences in the ages of individuals based on their race.

Visualization Method: Box Plot


  • X-Axis (Race): I represented each racial category, such as ‘White’, ‘Black’, and ‘Hispanic’, on the X-axis.
  • Y-Axis (Age): The age of individuals involved in police incidents was depicted on the Y-axis.
  • Box Plot: Utilizing the box plot visualization, I displayed essential statistical metrics such as the minimum, first quartile (Q1), median (Q2), third quartile (Q3), and maximum of the dataset. This graphical representation offered a clear summary of the distribution of ages within each racial category, aiding in the identification of potential outliers.

Through this analysis, I aimed to gain insights into the disparities in age across various racial groups involved in police incidents, thereby contributing to a comprehensive understanding of the data.

Based on the analysis and visualizations performed on the age distribution of individuals killed by police, we can draw several inferences:

I found descriptive statistics Minimum and Maximum Age ,Mean Age, Median Age , Standard Deviation , Skewness , Kurtosis

1. Age Range and Central Tendency:

  • The age range of individuals killed by police in the dataset spans from 6 to 91 years.
  • The mean age, which is approximately 37.1 years, represents the average age of individuals in these incidents.

2. Distribution Shape:

  • The histogram of the age distribution shows a shape that is close to normal (bell-shaped), indicating that the ages are relatively evenly distributed around the mean.
  • This is further supported by the Q-Q plot, where the data points lie approximately along a straight line, suggesting that the age data follows a normal distribution.

3. Variability and Skewness:

  • The standard deviation of approximately 13.0 indicates that ages vary around the mean by this amount on average.
  • The skewness of around 0.73 indicates a slight positive skew, suggesting that the age distribution is slightly skewed to the right. This means there are a few incidents involving individuals older than the mean age, pulling the distribution in that direction.

4. Comparison with Normal Distribution:

  • The age distribution, while slightly skewed, does not have extensive fat tails or significant peakiness around the mean (kurtosis is close to 3). This suggests that there are no extreme outliers or highly concentrated age groups.
  • The analysis includes a comparison with a standard normal distribution, highlighting that the dataset’s age distribution deviates slightly from a perfect normal distribution.
  • In my exploration of police shootings in the US, I took a closer look at the geographical aspects of these incidents. Geospatial analysis helps you recognize patterns and trends in data that are related to geographic locations and can provide valuable insights.
  • To start, I imported essential libraries: Pandas for data manipulation, Matplotlib for visualization, and Scikit-Learn for clustering. Read the police shooting data from a CSV file, extracting latitude and longitude information for analysis.
  • Visualized the police shooting locations using scatter plots and 2D histograms. These visualizations provide a clear overview of the geographical distribution of incidents across the continental US.
  •  I explored DBSCAN clustering. Unlike traditional clustering algorithms, DBSCAN does not require specifying the number of clusters beforehand. Instead, it identifies dense regions of data points and labels them as clusters. Points that are isolated from these dense regions are marked as noise, making DBSCAN particularly adept at handling irregularly shaped clusters and noise within the data.
  • Unlike traditional clustering algorithms, DBSCAN identified densely populated clusters of police shooting locations without any prior assumptions about the number of clusters. This adaptive nature allowed me to uncover spatial groupings that would have been challenging to capture using fixed-cluster methods.
  • DBSCAN’s ability to classify points as noise or outliers provided insights into isolated incidents. By separating these outliers from the main clusters, I gained a clearer understanding of singular events versus recurrent patterns. This differentiation is crucial in understanding the context of police shootings, distinguishing between random incidents and recurring issues in specific areas.


No matter how persuasive, numbers frequently remain in the abstract. I used data visualisation to fill the gap between raw data and human comprehension. I turned numbers into visual stories by using dynamic charts, graphs, and interactive maps, which helped me understand the data better.

Geospatial analysis : I learned that geography is essential to comprehending social phenomena. I was able to see the gunshots’ geographic spread by mapping out the incidences. The resulting scatter plot clearly showed clusters in particular areas, which brought up significant issues regarding the underlying causes of these patterns.

Demographic Disparities:

Examining the data through the lens of demographics was both illuminating and disheartening. Disparities based on race and gender became glaringly evident.

Understanding Armed Status and Threat Types :

It is essential to comprehend the conditions around these situations. To obtain understanding, threat categories and armed status were examined. The weapons used in shootings were thoroughly examined to shed light on them. The scatter plot comparing threat kinds to age, meanwhile, provided a multidimensional viewpoint.

Body Cameras and Mental Health Factors :

Additionally, the use of body cams and the significance of mental health in these situations were investigated. The use of body cameras was depicted in the pie charts in an engaging way, underscoring the importance of transparency. Similar to this, discussions on the difficulties people with mental health issues encounter spurred by our analysis of incidences related to mental health.

Exploration of data : Understanding the dataset itself was the first step in our investigation. I carefully studied each column, figuring out what the labels and figures meant. It was important to comprehend the context of the data since each item had the weight of a sad occurrence; it wasn’t just a collection of facts.

The Heart of Exploration: Asking the Right Questions : 

Who are those affected by fatal police shootings the most?
I found inequalities depending on ethnicity, age, gender, and signs of mental illness by immersing myself in demographic data.

Analysing demographic data:

What is the average age of the victims, in terms of age distribution? Do various racial groups’ average ages differ significantly from one another?
Gender Inequalities: What is the victims’ gender composition? Exist any gender-specific trends in the deadly police shootings?
Racial Inequalities: Which races make up the victims? Are some racial groupings more severely impacted than others?

Characteristics of the Incident:

Armed vs. Unarmed: What proportion of victims were armed? Is there a discernible difference between being armed or unarmed in terms of the chance of getting shot?
How many occurrences saw victims running away from the scene? Is there a connection between running away and how intense the experience was?
In what percentage of incidents were there police officers wearing body cameras? Does wearing body cams by officers affect the results?
What proportion of victims manifested indicators of mental illness? Are there any links between signs of mental illness and police shootings?

I carefully examined the statistics, and patterns started to appear. Will continue to delve deeper into the analysis, seeking a comprehensive understanding of the underlying factors.

In my previous piece, I talked about how my model for forecasting diabetes rates was somewhat improved by include socioeconomic factors. By adding relevant new predictors and external datasets, I hoped to further improve the model. After doing some research, I discovered two intriguing data sources to use: a food access/nutrition database for low income communities and a county health survey including lifestyle information. Including them produced valuable new aspects including the prevalence of obesity, the availability of food, the frequency of exercise, and more. Taking care to manage missing data and match geographic identifiers, I integrated the datasets.


Retraining the model with the enlarged data produced observable improvements; the R-squared rose , demonstrating the importance of the extra signal. Strong correlations between the new diet- and activity-related variables and the prevalence of diabetes were found when the model coefficients were examined. By combining facts concerning healthcare availability, housing density, and other conceivable issues, there is still space for improvement. Additionally, I want to test nonlinear machine learning algorithms. However, adding pertinent outside data has already significantly improved the model’s ability to anticipate outcomes. I am getting closer to accurately estimating the factors that influence the occurrence of diabetes with each advancement. The model’s usefulness for designing effective public health interventions should continue to be advanced by further upgrading the data and algorithms.

In previous posts, I discussed building a regression model to predict diabetes rates and techniques to improve its performance. With feature engineering and model tuning, I had significantly boosted the R-squared metric. But I wanted to try incorporating some new predictor variables to potentially capture additional factors related to diabetes prevalence.

For this iteration, I added columns for socioeconomic status, household disability rate, and housing/transportation types. The hypothesis was these features might explain some of the remaining variance in diabetes rates across communities.

After standardizing the new features and fitting a linear regression model, I evaluated it on the test set. The model showed a small improvement in R-squared, indicating the new features were incrementally helpful in explaining variance in the target variable.

Checking the model coefficients provided some insights. For example, socioeconomic status had a high negative weight, meaning it was negatively correlated with diabetes rate as expected based on domain knowledge. Housing type also proved somewhat informative.

There are likely complex nonlinear relationships between diabetes and factors like income, transit access, and living conditions. To fully capture these, I may need to transform features, add interactions, or use more flexible algorithms. But even simple linear regression got a slight boost from the new features.

The process of improving the model is gradual, requiring testing many ideas. Each iteration provides more insight for getting closer to maximum predictability. My goal now is identifying additional datasets to combine with what I already have – more data and features could further improve the model.

Boosting Model Accuracy: Techniques to Improve R-Squared in Regression Models

Regression modeling is a popular statistical approach for understanding relationships between variables. The R-squared metric indicates how well the model fits the observed data, with higher values meaning more explanatory power.

Recently, I developed a regression model aiming to predict diabetes prevalence rates based on factors like physical inactivity and obesity levels. Despite iteratively tweaking the initial model, the best R-squared value achieved was 0.43. While decent, there was clear room for improvement in predictive accuracy.

In this post, I’ll share techniques I tried and am currently working on that helped significantly boost the R-squared for this project:

  • Incorporating additional relevant predictors correlated with the outcome variable, such as socioeconomic status, food access, education, and healthcare availability. This allowed capturing more variance in the target variable.
  • Applying transformations like logarithmic and Box-Cox to linearize non-linear relationships and correct residual patterns, improving overall model fit. Polynomial terms were also added to account for nonlinear effects.
  • Using regularization techniques like ridge and lasso regression to reduce overfitting by penalizing model complexity. Lasso helped remove redundant variables.
  • Carefully selecting the most predictive subset of explanatory variables through feature selection, eliminating noise-contributing variables.
  • Employing ensemble modeling techniques like random forests to combine multiple models and reduce variance.