Advanced Mathematical Statistics

Today in class : Linear Regression: The analysis included two predictor variables using linear regression, including “% obese” and “% inactive.” The dependent variable’s changes were predicted or explained using these other factors. ( % Diabetic)

The model’s R-squared value was 0.34. The strength of the linear relationship between the predictors and the dependent variable is indicated by this value, which suggests that the predictors account for about 34% of the variation in the outcome.

Quadratic Model: You can expand the linear regression model by including quadratic terms for the variables% INACTIVE and OBESE to produce a quadratic model for these variables. A nonlinear relationship between the predictors and the dependent variable is possible in a quadratic model.

%DIABETIC=β0+β1⋅%INACTIVE+β2⋅%OBESE+β3⋅(%INACTIVE)²+β4⋅(%OBESE)²+ϵ

Overfitting: In the context of statistical modelling and machine learning, the term “overfitting” describes a scenario in which a model is overly complex and fits the training data too closely.

Overfit models can lead to poor generalization, where the model’s performance degrades when it encounters data it hasn’t seen during training.

As discussed in class I went back and tried the quadratic model using Python:

import pandas as pd

import statsmodels.api as sm

merged_df[‘INACTIVE_squared’] = merged_df[‘% INACTIVE’] ** 2

merged_df[‘OBESE_squared’] = merged_df[‘% OBESE’] ** 2

X = merged_df[[‘% INACTIVE’, ‘% OBESE’, ‘INACTIVE_squared’, ‘OBESE_squared’]]

X = sm.add_constant(X) # Add a constant term (intercept) to the model

y = merged_df[‘% DIABETIC’]

model = sm.OLS(y, X).fit()

print(model.summary())

Which gave me R Squared value of = 0.384

Following up on my previous post I tried log transformation of the dependent variable (%DIABETIC) which gave me R Squared value of = 0.353. The increase in R Squared value is very minimal. Which means that log transformation did not lead to any dramatic increase in the R Squared value . The primary reason for log-transforming the dependent variable was to address heteroscedasticity and meet the assumptions of linear regression.

Moving ahead I will keep looking for better transformations and models .

Leave a Reply Cancel reply