ANOVA and Regression Analysis

ANOVA and Regression Analysis

Instructions: 

  • Introduction, which includes a brief background discussion related to the topic and study
  1. Data collection
  • Results of data analysis/testing
  • Analysis and interpretation of the results using R
  • Conclusions
  • References 

The report must demonstrate at least two approaches to data analysis. Examples include:

  1. Correlation, descriptive statistics, ANOVA and regression analysis
  2. Application of various tests as applicable
  3. Time series analysis as applicable 

My proposal: Based on the provided dataset, an exploratory data analysis will be performed to summarize the data set’s main characteristics with the help of visual methods, understand the data properties, find patterns in data etc. (descriptive statistics, plotting, missing values, covariation, variation etc.). Next, a regression analysis will be performed as an explanation of the causation (to determine the appropriate ratio and to what extent it predicts the risk of Diabetes II).

Solution 

Introduction 

Diabetes is a disease of human body that causes blood glucose levels to rise higher than normal. This is also termed as hyperglycaemia. Type 2 diabetes is the most common form of diabetes observed presently in USA .

The data consist of data about patients who were interviewed in a project to research about the prevalence of obesity, diabetes, and other cardiovascular risk factors in central Virginia for African American residents.

As per medical science , Diabetes Type II is associated most strongly with obesity , which is again strongly associated with cholesterol levels. High cholesterol levels also lead to cardiac diseases/risks.

According to the hypotheses we are testing here , The waist/hip ratio may be a predictor in diabetes and heart disease.

Diabetes type II is also associated with hypertension – they may both be part of a certain syndrome called “Syndrome X” in genetics.

Our aim would be to classify the patients , i.e. a positive diagnosis if glyhb>7.0 and a negative diagnosis otherwise . We’ll denote this by 1 & 0 respectively .

Our objective/hypotheses would be to statistically test:

  • The extent of association of waist/hip ratio to glyhb levels. We’ll be using regression analysis for this task.
  • Using logistic regression to check , how the odds of having Diabetes Type II varies with the waist/hip ratio & other factors.
  • Whether the variation is statistically significant .
  • Correlation of other important factors with waist/hip ratio .
  • Whether there’s any statistically significant difference of mean waist/hip ratio ( andglyhb levels) between male & female subjects.

Also we are interested in exploratory analysis of the features using histogram and summary statistics . We’ll also be using correlation matrices and scatterplots to understand the structure of the data. We are also interested in the age-wise and gender-wise variations . 

ANALYSIS OF RESULTS/INTERPRETATION 

Our objective/hypotheses would be to statistically test:

  • The extent of association of waist/hip ratio to glyhb levels. We’ll be using value for detecting what percentage of variance in glyhb is explained by waist/hip ratio
  • Using logistic regression to check , how the odds of having Diabetes Type II varies with the waist/hip ratio & other factors. We’ll check whether is statistically significant , i.e. we’d try to test
  • Whether there’s any statistically significant difference of mean waist/hip ratio ( andglyhb levels) between male & female subjects. We use usual one-way ANOVA for this case .( here usual IID & normal distribution assumptions are taken) 

Procedure :

First , depending on glyhb values , we categorise them into 1 & 0 . If it’s a positive diagnosis , i.e. glyhb value >7.0 , we use 1 and otherwise we use 0 . Since now the response variable is a 0-1 variable ,we run a logistic regression to predict the odds of having Diabetes Type II based on the predictors ( both categorical & numerical) .

MODEL/PARAMETERS:

Our main model to fit is the logistic regression model ,

Where  , where ’s are predictor variables and  is target variable (categorical).  In case any of the predictor variables are categorical , we can assign numbers to them as needed . 

MODEL BUILDING/PARAMETER CHOOSING :

The main idea is to choose coefficients  such that likelihood is maximised . R numerically maximises  using iterative algorithm.

Where

(is the number of rows in the data )

The values of  which maximises  are found and shown as REGRESSION COEFFICIENTS in the output .

The linear regression yields the following results

Call:

glm(formula = x$glyhb ~ ., family = binomial(link = “logit”),

data = x)

Deviance Residuals:

Min       1Q   Median       3Q      Max

-3.6809  -0.3292  -0.2122  -0.1169   3.4382

Coefficients:

Estimate Std.         Error           z value   Pr(>|z|)

(Intercept)    -4.0684387      24.6507290  -0.165    0.8689

chol            0.0109447        0.0089258     1.226     0.2201

stab.glu        0.0365823      0.0054762      6.680    2.39e-11 ***

hdl            -0.0301874         0.0293913     -1.027   0.3044

ratio          -0.1316713         0.2833887     -0.465   0.6422

locationLouisa  0.1374395  0.4789171      0.287    0.7741

age             0.0323905        0.0182758    1.772      0.0763 .

gendermale  -0.4467240     0.7331429    -0.609    0.5423

height         -0.0410434       0.0916873    -0.448    0.6544

weight          0.0028824      0.0136028     0.212     0.8322

framemedium  0.1762112   0.5426120     0.325    0.7454

framesmall      0.3154585    0.7892797    0.400    0.6894

bp.1s           0.0126584       0.0148998    0.850    0.3956

bp.1d           0.0035380       0.0245465    0.144    0.8854

bp.2s          -0.0132029      0.0201231    -0.656    0.5118

bp.2d           0.0233647      0.0363129    0.643     0.5199

waist           0.2140389       0.6143844    0.348     0.7276

hip            -0.1672643        0.5325529    -0.314    0.7535

time.ppn     0.0010857       0.0006804    1.596      0.1106

y              -6.3039110        27.5933831  -0.228     0.8193

Signif.codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1

(Dispersion parameter for binomial family taken to be 1)

Null deviance: 333.72  on 396  degrees of freedom

Residual deviance: 160.09  on 377  degrees of freedom

(6 observations deleted due to missingness)

AIC: 200.09

Number of Fisher Scoring iterations: 6

Analysis :

  1. From the p-values , as it’s evident that , the strongest predictor of odds of having DM type 2 , is of course , as expected , the factor stab.glu with an extremely small p-value of 39e-11***.
  1. The regression analysis clearly shows that , y ( i.e. waist/hip ratio) is NOT a statistically significant factor , inspite of having a seemingly large coefficient -6.3039110 . This is due to extremely high p-value (0.8193) 

We now run an ordinary least square regression of glyhb on y( i.e. waist/hip ratio)

The results yield :

Multiple R-squared 0.03692
Adjusted R-squared 0.03442

So , y merely explains 3 to 4% of the variance in glyhb , which again indicates that y is NOT a significant factor in determining DM type 2 . 

ANALYSING GENDER WISE DIFFERENCES IN glyhb LEVELS BY ONE-WAY ANOVA:

We know are interested in analysing the box plots we attached ( i.e. gender wise variation in y &glyhb levels) :

Degrees of freedom Sum of squares Mean Sq F value  p-value
Gender 1 0.01 0.00717 0.056 0.813
Residuals 401 51.6 0.12733

The large p-value shows that there’s no statistically significant difference between the mean glyhb levels of male & female patients interviewed. (these results are obtained by F-test)

ANALYSING GENDER WISE DIFFERENCES IN waist/hip ratio BY ONE-WAY ANOVA:

Degrees of freedom Sum of squares Mean Sq F value  p-value
Gender 1 0.2557 0.25569 55.48 0.5.89e-13 ***
Residuals 399 1.8388 0.00461

The extremely small p-value shows that there’s statistically significant difference between the mean waist/hip ratio of male & female patients interviewed. (these results are obtained by F-test)

…….. 

Conclusion :

  • glu and gly.hb are strongly associated as evident from the correlation matrix .THe same correlation has been already confirmed much earlier by doctors. The extreme small p value obtained in the logistic regression model confirms the same fact.
  • The correlation of glyhb& waist/hip Ratio is 0.19 , which indicates low association
  • Despite having a large coefficient , the extremely large p-value indicates that waist/hip ratio is NOT a significant factor is determining the odds of having DM type 2 .
  • There’s no statistically significant difference between the mean glyhb levels of male & female patients interviewed.(these results are obtained by F-test). This was tested by an ANOVA based F test .
  • There’s statistically significant difference between the mean waist/hip ratio of male & female patients interviewed (these results are obtained by F-test) . This is expected due to their biologically different features .
  • The logistic model fitted has AIC score of 200.09. Note the decrease in deviance from 333.72 on 396  degrees of freedom ( in null model) to 160.09  on 377 degrees of freedom ( in the logistic model) .
  • Here we used logistic regression , since our response variable was a 0-1 variable , i.e. a categorical variable . The MLE’s were found by iterative algorithm (as default in R functions).
  • As indicated by an OLS (ordinary least square) regression model , waist/hip ratio explains only about 3 to 4% of the variance in glyhb levels . So , based on the data , we can conclude that the waist/hip ratio is NOT a significant factor in predicting glyhb level , which is the key to diagnosis of DM type 2 .

library(readr)

Diabetes_II<- read_csv(“~/Desktop/Diabetes II.csv”)

x <-Diabetes_II

#summary statistics

summary(x)

#variability measurement

sd(x$chol,na.rm = T)

sd(x$stab.glu)

sd(x$hdl,na.rm = T)

#categorical variables

library(ggplot2)

with(x,table(location))

with(x,table(gender))

#histograms

hist(x$chol,prob=T)

hist(x$stab.glu,prob=T)

hist(x$ratio,prob=T)

hist(x$time.ppn,prob=T)

hist(x$bp.1s,prob=T)

hist(x$bp.1d,prob=T)

hist(x$bp.2d,prob=T)

hist(x$bp.2s,prob=T)

hist(x$glyhb,prob=TRUE)

y <-x$waist/x$hip #waist/hip ratio

z <-x$glyhb #stores glyhb values

#correlation of waist/hip ratio with glyhb

plot(y,z,xlab=”waist/hip” , ylab=”Glycosolated hemoglobin “)

abline(lm(z~y),col=”red”,lwd=3) #draws regression line

plot(x$chol,z,xlab=”cholestrol”,ylab=”Glycosylated haemoglobin”)

abline(lm(z~x$chol),col=”blue”,lwd=3)#draws regression line

plot(x$stab.glu,z,xlab=”stab.glu”,ylab=”Glycosylated haemoglobin”)

abline(lm(z~x$stab.glu),col=”green”,lwd=3)#draws regression line

df<-data.frame(x$chol,x$stab.glu,x$hdl,z,y)

round(cor(df,use = “pairwise.complete.obs”),2)#give correlation matrix

#missing values

x$chol[is.na(x$chol)]<-mean(x$chol,na.rm=T)

x$stab.glu[is.na(x$stab.glu)]<-mean(x$stab.glu,na.rm=T)

x$hdl[is.na(x$hdl)]<-mean(x$hdl,na.rm=T)

x$ratio[is.na(x$ratio)]<-mean(x$ratio,na.rm=T)

x$glyhb[is.na(x$glyhb)]<-mean(x$glyhb,na.rm=T)

x$age[is.na(x$age)]<-mean(x$age,na.rm=T)

x$height[is.na(x$height)]<-mean(x$height,na.rm=T)

x$bp.1s[is.na(x$bp.1s)]<-mean(x$bp.1s,na.rm=T)

x$bp.2s[is.na(x$bp.2s)]<-mean(x$bp.2s,na.rm=T)

x$bp.1d[is.na(x$bp.1d)]<-mean(x$bp.1d,na.rm=T)

x$bp.2d[is.na(x$bp.2d)]<-mean(x$bp.2d,na.rm=T)

#function for finding mode ( there isn’t any inbuilt one in R)

Mode <- function(x) {

ux<- unique(x)

ux[which.max(tabulate(match(x, ux)))]

}

x$frame[is.na(x$frame)]<- Mode(x$frame)

y <-x$waist/x$hip #waist/hip ratio

boxplot(x$chol~x$gender)#genderwise variations of chol levels

boxplot(y~x$gender)#genderwise variations of waist/hip ratio

boxplot(x$glyhb~x$gender)#genderwise variations of glyhb levels

#logistic regression

x <- x[-1]

x <- data.frame(x,y)

x$glyhb<- ifelse(x$glyhb>7.0,1,0)#positive diagonosis =1 , negative =0

fit<-glm(x$glyhb ~ ., family = binomial(link = “logit”), data =x)

summary(fit) #shows all the coefficients and their p-values , the same’s already included in analysis

#ordinary regression of glyhb on waist/hip

L <-lm(z~y) #this is ordinary regression

summary(L)

#ANOVA based F test for boxplots

summary(aov(x$glyhb~x$gender))

summary(aov(y~x$gender))

RESULTS OF DATA ANALYSIS &TESTING : 

EXPLORATORY ANALYSIS 

The dataset contains information about 403 individuals and each row has 17 features .

The features are

  1. chol,
  2. glu,
  3. hdl,ratio,
  4. glyhb,
  5. location*,
  6. age ,
  7. gender*,
  8. height,
  9. weight,
  10. frame*,
  11. 1s,
  12. 1d,
  13. 2s,
  14. 2d,
  15. waist,
  16. hip &
  17. ppn

 We now explore each variable by methods like histogram, summary statistics etc.

(The * marked features are categorical in nature.)

chol :

hdl :

Summ

Analysing the table shows that , among the individuals , 200 were located in Buckingham and rest 203 were located in Louisa.
ary statistics of the main variables are as follows :

Analysis of gender shows that , among the individuals , 234 , i.e. 58% were female and rest 169 (42%) were males.

Mean height of the individuals is 66.02 . SD of height is 3.918

Mean weight of the individuals is 177.59 . SD of height is 40.34

Now we give the scatterplot of waist/hip ratio and Glycosylatedhaemoglobin

Variable chol stab.glu hdl time.ppn ratio
Min 78.0 48.0 12.00 5.0 1.500
Q1 179.0 81.0 38.00 90.0 3.200
Median 204.0 89.0 46.00 240.0 4.200
Mean 207.8 106.7 50.45 341.2 4.522
Q3 230.0 106.0 59.00 517.5 5.400
Max 443.0 385.0 120.00 1560.0 19.300
SD 44.44 53.07 17.26 309.541 1.727

From the scatterplot , as we see , there’s a huge cluster of points towards bottom left and the rest of the points are sparsely spaced. Now , we add the regression line of Glycosylatedhaemoglobin on waist/hip ratio in the the scatterplot .In the later sections , we’ll analyse how strong the correlation is & whether it is statistically significant .

From the knowledge of medical science , the variables chol , stab.glu&hdl should also be  strongly correlated with glyhb . 

Let’s look at the scatterplots .

This scatterplot looks similar to the previous one .

Again , there’s a dense cluster near left bottom corner and the rest of the points are loosely spaced . Here , the regression line seems to be a decent fit . We’ll later on analyse how well these regression lines are predicting , using

Let’s look at the correlation matrix now :

Variables chol stab.glu Hdl Glyhb waist/hip
chol 1.00 0.15 0.19 0.25 0.10
stab.glu 0.15 1.00 -0.16 0.75 0.19
Hdl 0.19 -0.16 1.00 -0.15 -0.17
Glyhb 0.25 0.75 -0.15 1.00 0.19
waist/hip 0.10 0.19 -0.17 0.19 1.00

From the matrix , we see that the extent of association between glyhb&stab.glu is high , but the extent of association between glyhb and waist/hip ratio is NOT high .

We’ll impute the missing values of continuous variables by using the zoo package in R . Among the categorical variables , only frame has missing values . We replace the missing values by using the mode of the column frame.

The following boxplot represents the gender wise variation of waist/hip ratio .Later we’ll use ANOVA table to understand the statistical significance of this plot.

The following boxplot gives gender wise variation of glyhblevels :

REFERENCES

  1. https://www.analyticsvidhya.com/blog/2015/11/beginners-guide-on-logistic-regression-in-r/ Logistic regression in R
  2. http://www.diabetes.org
  3. https://www.r-bloggers.com/create-a-correlation-matrix-in-r/ : correlation matrices
  4. http://www.diabetes.org : Medical references &informations