ANOVA and Regression Analysis
Instructions:
- Introduction, which includes a brief background discussion related to the topic and study
- Data collection
- Results of data analysis/testing
- Analysis and interpretation of the results using R
- Conclusions
- References
The report must demonstrate at least two approaches to data analysis. Examples include:
- Correlation, descriptive statistics, ANOVA and regression analysis
- Application of various tests as applicable
- Time series analysis as applicable
My proposal: Based on the provided dataset, an exploratory data analysis will be performed to summarize the data set’s main characteristics with the help of visual methods, understand the data properties, find patterns in data etc. (descriptive statistics, plotting, missing values, covariation, variation etc.). Next, a regression analysis will be performed as an explanation of the causation (to determine the appropriate ratio and to what extent it predicts the risk of Diabetes II).
Solution
Introduction
Diabetes is a disease of human body that causes blood glucose levels to rise higher than normal. This is also termed as hyperglycaemia. Type 2 diabetes is the most common form of diabetes observed presently in USA .
The data consist of data about patients who were interviewed in a project to research about the prevalence of obesity, diabetes, and other cardiovascular risk factors in central Virginia for African American residents.
As per medical science , Diabetes Type II is associated most strongly with obesity , which is again strongly associated with cholesterol levels. High cholesterol levels also lead to cardiac diseases/risks.
According to the hypotheses we are testing here , The waist/hip ratio may be a predictor in diabetes and heart disease.
Diabetes type II is also associated with hypertension – they may both be part of a certain syndrome called “Syndrome X” in genetics.
Our aim would be to classify the patients , i.e. a positive diagnosis if glyhb>7.0 and a negative diagnosis otherwise . We’ll denote this by 1 & 0 respectively .
Our objective/hypotheses would be to statistically test:
- The extent of association of waist/hip ratio to glyhb levels. We’ll be using regression analysis for this task.
- Using logistic regression to check , how the odds of having Diabetes Type II varies with the waist/hip ratio & other factors.
- Whether the variation is statistically significant .
- Correlation of other important factors with waist/hip ratio .
- Whether there’s any statistically significant difference of mean waist/hip ratio ( andglyhb levels) between male & female subjects.
Also we are interested in exploratory analysis of the features using histogram and summary statistics . We’ll also be using correlation matrices and scatterplots to understand the structure of the data. We are also interested in the age-wise and gender-wise variations .
ANALYSIS OF RESULTS/INTERPRETATION
Our objective/hypotheses would be to statistically test:
- The extent of association of waist/hip ratio to glyhb levels. We’ll be using
value for detecting what percentage of variance in glyhb is explained by waist/hip ratio
- Using logistic regression to check , how the odds of having Diabetes Type II varies with the waist/hip ratio & other factors. We’ll check whether
is statistically significant , i.e. we’d try to test
- Whether there’s any statistically significant difference of mean waist/hip ratio ( andglyhb levels) between male & female subjects. We use usual one-way ANOVA for this case .( here usual IID & normal distribution assumptions are taken)
Procedure :
First , depending on glyhb values , we categorise them into 1 & 0 . If it’s a positive diagnosis , i.e. glyhb value >7.0 , we use 1 and otherwise we use 0 . Since now the response variable is a 0-1 variable ,we run a logistic regression to predict the odds of having Diabetes Type II based on the predictors ( both categorical & numerical) .
MODEL/PARAMETERS:
Our main model to fit is the logistic regression model ,
Where , where
’s are predictor variables and
is target variable (categorical). In case any of the predictor variables are categorical , we can assign numbers to them as needed .
MODEL BUILDING/PARAMETER CHOOSING :
The main idea is to choose coefficients such that likelihood is maximised . R numerically maximises using iterative algorithm.
Where
(is the number of rows in the data )
The values of which maximises are found and shown as REGRESSION COEFFICIENTS in the output .
The linear regression yields the following results
Call:
glm(formula = x$glyhb ~ ., family = binomial(link = “logit”),
data = x)
Deviance Residuals:
Min 1Q Median 3Q Max
-3.6809 -0.3292 -0.2122 -0.1169 3.4382
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -4.0684387 24.6507290 -0.165 0.8689
chol 0.0109447 0.0089258 1.226 0.2201
stab.glu 0.0365823 0.0054762 6.680 2.39e-11 ***
hdl -0.0301874 0.0293913 -1.027 0.3044
ratio -0.1316713 0.2833887 -0.465 0.6422
locationLouisa 0.1374395 0.4789171 0.287 0.7741
age 0.0323905 0.0182758 1.772 0.0763 .
gendermale -0.4467240 0.7331429 -0.609 0.5423
height -0.0410434 0.0916873 -0.448 0.6544
weight 0.0028824 0.0136028 0.212 0.8322
framemedium 0.1762112 0.5426120 0.325 0.7454
framesmall 0.3154585 0.7892797 0.400 0.6894
bp.1s 0.0126584 0.0148998 0.850 0.3956
bp.1d 0.0035380 0.0245465 0.144 0.8854
bp.2s -0.0132029 0.0201231 -0.656 0.5118
bp.2d 0.0233647 0.0363129 0.643 0.5199
waist 0.2140389 0.6143844 0.348 0.7276
hip -0.1672643 0.5325529 -0.314 0.7535
time.ppn 0.0010857 0.0006804 1.596 0.1106
y -6.3039110 27.5933831 -0.228 0.8193
—
Signif.codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 333.72 on 396 degrees of freedom
Residual deviance: 160.09 on 377 degrees of freedom
(6 observations deleted due to missingness)
AIC: 200.09
Number of Fisher Scoring iterations: 6
Analysis :
- From the p-values , as it’s evident that , the strongest predictor of odds of having DM type 2 , is of course , as expected , the factor stab.glu with an extremely small p-value of 39e-11***.
- The regression analysis clearly shows that , y ( i.e. waist/hip ratio) is NOT a statistically significant factor , inspite of having a seemingly large coefficient -6.3039110 . This is due to extremely high p-value (0.8193)
We now run an ordinary least square regression of glyhb on y( i.e. waist/hip ratio)
The results yield :
Multiple R-squared | 0.03692 |
Adjusted R-squared | 0.03442 |
So , y merely explains 3 to 4% of the variance in glyhb , which again indicates that y is NOT a significant factor in determining DM type 2 .
ANALYSING GENDER WISE DIFFERENCES IN glyhb LEVELS BY ONE-WAY ANOVA:
We know are interested in analysing the box plots we attached ( i.e. gender wise variation in y &glyhb levels) :
Degrees of freedom | Sum of squares | Mean Sq | F value | p-value | |
Gender | 1 | 0.01 | 0.00717 | 0.056 | 0.813 |
Residuals | 401 | 51.6 | 0.12733 |
The large p-value shows that there’s no statistically significant difference between the mean glyhb levels of male & female patients interviewed. (these results are obtained by F-test)
ANALYSING GENDER WISE DIFFERENCES IN waist/hip ratio BY ONE-WAY ANOVA:
Degrees of freedom | Sum of squares | Mean Sq | F value | p-value | |
Gender | 1 | 0.2557 | 0.25569 | 55.48 | 0.5.89e-13 *** |
Residuals | 399 | 1.8388 | 0.00461 |
The extremely small p-value shows that there’s statistically significant difference between the mean waist/hip ratio of male & female patients interviewed. (these results are obtained by F-test)
……..
Conclusion :
- glu and gly.hb are strongly associated as evident from the correlation matrix .THe same correlation has been already confirmed much earlier by doctors. The extreme small p value obtained in the logistic regression model confirms the same fact.
- The correlation of glyhb& waist/hip Ratio is 0.19 , which indicates low association
- Despite having a large coefficient , the extremely large p-value indicates that waist/hip ratio is NOT a significant factor is determining the odds of having DM type 2 .
- There’s no statistically significant difference between the mean glyhb levels of male & female patients interviewed.(these results are obtained by F-test). This was tested by an ANOVA based F test .
- There’s statistically significant difference between the mean waist/hip ratio of male & female patients interviewed (these results are obtained by F-test) . This is expected due to their biologically different features .
- The logistic model fitted has AIC score of 200.09. Note the decrease in deviance from 333.72 on 396 degrees of freedom ( in null model) to 160.09 on 377 degrees of freedom ( in the logistic model) .
- Here we used logistic regression , since our response variable was a 0-1 variable , i.e. a categorical variable . The MLE’s were found by iterative algorithm (as default in R functions).
- As indicated by an OLS (ordinary least square) regression model , waist/hip ratio explains only about 3 to 4% of the variance in glyhb levels . So , based on the data , we can conclude that the waist/hip ratio is NOT a significant factor in predicting glyhb level , which is the key to diagnosis of DM type 2 .
library(readr)
Diabetes_II<- read_csv(“~/Desktop/Diabetes II.csv”)
x <-Diabetes_II
#summary statistics
summary(x)
#variability measurement
sd(x$chol,na.rm = T)
sd(x$stab.glu)
sd(x$hdl,na.rm = T)
#categorical variables
library(ggplot2)
with(x,table(location))
with(x,table(gender))
#histograms
hist(x$chol,prob=T)
hist(x$stab.glu,prob=T)
hist(x$ratio,prob=T)
hist(x$time.ppn,prob=T)
hist(x$bp.1s,prob=T)
hist(x$bp.1d,prob=T)
hist(x$bp.2d,prob=T)
hist(x$bp.2s,prob=T)
hist(x$glyhb,prob=TRUE)
y <-x$waist/x$hip #waist/hip ratio
z <-x$glyhb #stores glyhb values
#correlation of waist/hip ratio with glyhb
plot(y,z,xlab=”waist/hip” , ylab=”Glycosolated hemoglobin “)
abline(lm(z~y),col=”red”,lwd=3) #draws regression line
plot(x$chol,z,xlab=”cholestrol”,ylab=”Glycosylated haemoglobin”)
abline(lm(z~x$chol),col=”blue”,lwd=3)#draws regression line
plot(x$stab.glu,z,xlab=”stab.glu”,ylab=”Glycosylated haemoglobin”)
abline(lm(z~x$stab.glu),col=”green”,lwd=3)#draws regression line
df<-data.frame(x$chol,x$stab.glu,x$hdl,z,y)
round(cor(df,use = “pairwise.complete.obs”),2)#give correlation matrix
#missing values
x$chol[is.na(x$chol)]<-mean(x$chol,na.rm=T)
x$stab.glu[is.na(x$stab.glu)]<-mean(x$stab.glu,na.rm=T)
x$hdl[is.na(x$hdl)]<-mean(x$hdl,na.rm=T)
x$ratio[is.na(x$ratio)]<-mean(x$ratio,na.rm=T)
x$glyhb[is.na(x$glyhb)]<-mean(x$glyhb,na.rm=T)
x$age[is.na(x$age)]<-mean(x$age,na.rm=T)
x$height[is.na(x$height)]<-mean(x$height,na.rm=T)
x$bp.1s[is.na(x$bp.1s)]<-mean(x$bp.1s,na.rm=T)
x$bp.2s[is.na(x$bp.2s)]<-mean(x$bp.2s,na.rm=T)
x$bp.1d[is.na(x$bp.1d)]<-mean(x$bp.1d,na.rm=T)
x$bp.2d[is.na(x$bp.2d)]<-mean(x$bp.2d,na.rm=T)
#function for finding mode ( there isn’t any inbuilt one in R)
Mode <- function(x) {
ux<- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
x$frame[is.na(x$frame)]<- Mode(x$frame)
y <-x$waist/x$hip #waist/hip ratio
boxplot(x$chol~x$gender)#genderwise variations of chol levels
boxplot(y~x$gender)#genderwise variations of waist/hip ratio
boxplot(x$glyhb~x$gender)#genderwise variations of glyhb levels
#logistic regression
x <- x[-1]
x <- data.frame(x,y)
x$glyhb<- ifelse(x$glyhb>7.0,1,0)#positive diagonosis =1 , negative =0
fit<-glm(x$glyhb ~ ., family = binomial(link = “logit”), data =x)
summary(fit) #shows all the coefficients and their p-values , the same’s already included in analysis
#ordinary regression of glyhb on waist/hip
L <-lm(z~y) #this is ordinary regression
summary(L)
#ANOVA based F test for boxplots
summary(aov(x$glyhb~x$gender))
summary(aov(y~x$gender))
RESULTS OF DATA ANALYSIS &TESTING :
EXPLORATORY ANALYSIS
The dataset contains information about 403 individuals and each row has 17 features .
The features are
- chol,
- glu,
- hdl,ratio,
- glyhb,
- location*,
- age ,
- gender*,
- height,
- weight,
- frame*,
- 1s,
- 1d,
- 2s,
- 2d,
- waist,
- hip &
- ppn
We now explore each variable by methods like histogram, summary statistics etc.
(The * marked features are categorical in nature.)
chol :
hdl :
Summ
Analysing the table shows that , among the individuals , 200 were located in Buckingham and rest 203 were located in Louisa.
ary statistics of the main variables are as follows :
Analysis of gender shows that , among the individuals , 234 , i.e. 58% were female and rest 169 (42%) were males.
Mean height of the individuals is 66.02 . SD of height is 3.918
Mean weight of the individuals is 177.59 . SD of height is 40.34
Now we give the scatterplot of waist/hip ratio and Glycosylatedhaemoglobin
Variable | chol | stab.glu | hdl | time.ppn | ratio |
Min | 78.0 | 48.0 | 12.00 | 5.0 | 1.500 |
Q1 | 179.0 | 81.0 | 38.00 | 90.0 | 3.200 |
Median | 204.0 | 89.0 | 46.00 | 240.0 | 4.200 |
Mean | 207.8 | 106.7 | 50.45 | 341.2 | 4.522 |
Q3 | 230.0 | 106.0 | 59.00 | 517.5 | 5.400 |
Max | 443.0 | 385.0 | 120.00 | 1560.0 | 19.300 |
SD | 44.44 | 53.07 | 17.26 | 309.541 | 1.727 |
From the scatterplot , as we see , there’s a huge cluster of points towards bottom left and the rest of the points are sparsely spaced. Now , we add the regression line of Glycosylatedhaemoglobin on waist/hip ratio in the the scatterplot .In the later sections , we’ll analyse how strong the correlation is & whether it is statistically significant .
From the knowledge of medical science , the variables chol , stab.glu&hdl should also be strongly correlated with glyhb .
Let’s look at the scatterplots .
This scatterplot looks similar to the previous one .
Again , there’s a dense cluster near left bottom corner and the rest of the points are loosely spaced . Here , the regression line seems to be a decent fit . We’ll later on analyse how well these regression lines are predicting , using
Let’s look at the correlation matrix now :
Variables | chol | stab.glu | Hdl | Glyhb | waist/hip |
chol | 1.00 | 0.15 | 0.19 | 0.25 | 0.10 |
stab.glu | 0.15 | 1.00 | -0.16 | 0.75 | 0.19 |
Hdl | 0.19 | -0.16 | 1.00 | -0.15 | -0.17 |
Glyhb | 0.25 | 0.75 | -0.15 | 1.00 | 0.19 |
waist/hip | 0.10 | 0.19 | -0.17 | 0.19 | 1.00 |
From the matrix , we see that the extent of association between glyhb&stab.glu is high , but the extent of association between glyhb and waist/hip ratio is NOT high .
We’ll impute the missing values of continuous variables by using the zoo package in R . Among the categorical variables , only frame has missing values . We replace the missing values by using the mode of the column frame.
The following boxplot represents the gender wise variation of waist/hip ratio .Later we’ll use ANOVA table to understand the statistical significance of this plot.
The following boxplot gives gender wise variation of glyhblevels :
REFERENCES
- https://www.analyticsvidhya.com/blog/2015/11/beginners-guide-on-logistic-regression-in-r/ Logistic regression in R
- http://www.diabetes.org
- https://www.r-bloggers.com/create-a-correlation-matrix-in-r/ : correlation matrices
- http://www.diabetes.org : Medical references &informations