Applied Statistics Project Report

Applied Statistics Project Report

Applied Statistics Project Report 

Executive Summary
Type II diabetes is a condition in which the body becomes resistant to the usual effects of insulin. There are
a few precursors that come before type II diabetes, for instance, insulin resistance is a condition where the
person’s body fails to respond to insulin and as this condition gets worse (with obesity as well as inactivity),
there is a higher likelihood of developing type II diabetes.
We investigate the relationships between physical and genetic characteristics of obese patients and their status
as insulin resistant, insulin sensitive, or type-II diabetic. In particular, we address the following questions:
• Which clinical factors, e.g. abdominal fat, waist to hip ratio, body mass index and waist circumference,
are related to insulin sensitivity?
• Are certain genes (and sets of genes) expressed differently in the different metabolic groups?
• Are certain genes related to waist circumference?
• Given an obese patient’s physical measurements and genetic data, can we predict whether they are
insulin sensitive or resistant?

Background
Insulin is a hormone which plays a key role in the regulation of blood glucose levels. Insulin resistance is a
pathological condition in which cells fail to respond normally to the hormone insulin. In people with insulin
resistance, the muscles and the liver resist the action of insulin, so the body should produce higher amounts
to keep the blood glucose levels within a normal range. It is more common in people with a family history of
diabetes or people who are overweight, particularly around the stomach area. A person with insulin resistance
has a greater risk of developing Type II diabetes and heart disease. Nowadays, type II diabetes and obesity
are increasingly affecting human populations around the world.

Problem
We investigate a number of scientific questions.
1. The question of which clinical measurements (e.g. weight, BMI, abdomen fat) are most closely associated
with risk of being insulin resistant. It is believed that people who are ‘apple-shaped’ and have a lot
of visceral fat have an elevated risk of developing insulin resistance, which is a precursor to type II
diabetes. We were tasked with exmining the data available and determining whether the evidence
at hand supports this idea. In statistical terms, the question is whether the typical values of these
measurements differed between the OIR and OIS groups, and whether these differences can be said to
be statistically significant.
2. The question of whether certain individual genes are expressed differently in the different metabolic
groups. We had reason to believe, based on literature, that certain genes were associated with insulin
sensitivity. The statistical problem here was similar to the last case: the question was whether the
expression levels differed between the groups, and whether the differences were statistically significant.
In this case, there was the added complication of a large-scale multiple testing problem.
3. The question of whether the exression levels of certain individual genes are related to waist circumference.
In statistical terms, this is a regression problem.
4. The question of whether two particular sets of genes, the q-arm and p-arm genes on the sixth chromosome
are expressed differently in the OIR and OIS groups, as suggested by our NUTM collaborators based
on their literature review. In statistical terms, this involved performing a gene-set test on each of the
two sets of genes.
5. The question of whether we can predict the status of a nondiabetic obese patient as insulin resistant or
insulin sensitive given their genetic data and physical measurements. In statistical terms, we want to
see whether it is feasible to classify obese patients into insulin sensitive and insulin resistant groups
given the data we have available.

Solution:

Introduction

Human health is affected by many factors. In our data set, we will focus on how socio-economic conditions, incomes, and diet affect diabetes and the level of cholesterol.

This task will show how important the methods of machine learning play in the definition of human health. Based on the lifestyle of a person, you can with a fairly accurate probability to determine the potential risk of diabetes mellitus or elevated levels of chalestirin.

In this work, several techniques of machine learning were applied:

  • correlation analysis
  • linear regression
  • decision tree
  • random forests
  • associative rules

I used several approaches, in order to look at health problems from different sides. I tried to build an analysis, balancing between the accuracy of models and their simple explanation.

As a result of the analysis, I managed to get a fairly accurate model of classification and to find interrelations.

Correlation

Our data set contains categorical variables. However, we will build a correlation between the variables to determine the relationship between the variables.

Сorrelation between variables

We see a positive correlation between BDYMSQ04 and AGEC, DIETRDI and AGEC, DIETRDI and BDYMSQ04, HCHOLBC and DIABBC.

We see a negative correlation between DIETQ14 and AGEC, DIETQ14  and BDYMSQ04 , DIABBC and AGEC, HCHOLBC and AGEC.

Analysis of the network correlation structure:

We see that correlation exists between the variables. This means that there is a relationship between lifestyle and illness of patients. We will find out how the patient’s lifestyle affects his health.

Linear Regression

Analyze DIABBC

Linear regression shows that there is a close relationship between diabetes mellitus and Index of Relative Socio-Economic Disadvantage.

Analyze HCHOLBC

The data show that there is a correlation between blood pressure, diabetes, age and diet. . Looking at the data, it should be inferred. With the age of a person, the risk of diabetes increases. In this case, if people do not adhere to a diet, the risk of growing diabetes is further increased. In addition, there is a relationship between diabetes mellitus and blood pressure.

Decision tree

Analyze DIABBC

We see that the first node of the tree occurred on a variable Equivalised income of household: deciles. If  Equivalised income of household: deciles > 3, then such people with a 73% chance never treated the problem of diabetes mellitus. But if people have low income and adhere to a diet, then such people are likely to suffer from diabetes mellitus

Analyze HCHOLBC

If a person does not apply for dietary adherence or not, then most likely he is not currently on never told has high cholesterol.

Random forest

AnalyseDIABBC

multiclass.aunu             acc            mmce

0.78495927      0.94897959      0.05102041

The random forest algorithm, with 94% accuracy, can determine the classification of dietary questions.

AnalyseHCHOLBC

multiclass.aunu             acc            mmce

0.7292729       0.8841343       0.1158657

The random forest algorithm, with 94% accuracy, can determine the classification of whether has high cholesterol questions.

Analyze important variables

Analyze DIABBC

Analyze HCHOLBC

To determine the classification of diabetes or cholesterol, the most important data are: questions about the age of a person, and questions about a person’s diet.

let’s find the rules by which you can diagnose problems with human health

If a person has problems with diabetes mellitus:

If a person has problems with high cholesterol:

Conclusion:

The health of a person is affected by his lifestyle, as well as the environment. On the basis of questionnaires, it is possible to diagnose with a high degree of accuracy the problem of a person with diabetes or cholesterol.

For example, it turned out that if a person turned to a specialist about recommendations for eating vegetables and so on, but not being on a diet and not adhering to recommendations, then such a person has problems with cholesterol.

Also, for example, it turned out that if a person does not follow the recommendations for a healthy diet and at the same time has a high level of cholesterol, then most likely this person has diabetes mellitus.

As we can see, these are simple logical rules that can be interpreted and applied to determine the risk of diabetes mellitus or high cholesterol. 

npa.R

setwd(“P:/R/AS/6”)

library(data.table)

library(dplyr)

library(ggplot2)

library(ClustOfVar)

library(qgraph)

library(corrplot)

df = fread(“npa2011.csv”, stringsAsFactors = TRUE)

df = df[,.(

AGEC,

SF2SA1QN,

INCDEC,

BDYMSQ04,

DIASTOL,

DIETQ12,

DIETQ14,

DIETQ5,

DIETQ8,

DIETRDI,

DIABBC,

HCHOLBC

)]

sum(is.na(df))

str(df)

summary(df)

# Check correlation berween variables

df.correlation = cor(df)

corrplot(df.correlation)

# Build cluster on all data

model1 = hclustvar(df.correlation)

summary(model1)

plot(model1)

qgraph(df.correlation, layout = “spring”)

# Linear model

DIABBC.linearmodel = lm(DIABBC ~., data = df[,-c(“HCHOLBC”)])

summary(DIABBC.linearmodel)

HCHOLBC.linearmodel = lm(HCHOLBC ~., data = df[,-c(“DIABBC”)])

summary(HCHOLBC.linearmodel)

# Create factor variables

df = df[,.(

AGEC  =as.numeric(AGEC),

SF2SA1QN = as.factor(SF2SA1QN),

INCDEC = as.numeric(INCDEC),

BDYMSQ04 = as.factor(BDYMSQ04),

DIASTOL = as.numeric(DIASTOL),

DIETQ12 = as.factor(DIETQ12),

DIETQ14 = as.factor(DIETQ14),

DIETQ5 = as.factor(DIETQ5),

DIETQ8 = as.factor(DIETQ8),

DIETRDI = as.factor(DIETRDI),

DIABBC = as.factor(DIABBC),

HCHOLBC = as.factor(HCHOLBC)

)]

str(df)

model2 = hclustvar(scale(select_if(df, is.numeric)), select_if(df, is.factor))

plot(model2)

#Visualisationscunter

ggplot(df, aes(SF2SA1QN)) + geom_bar()

ggplot(df, aes(INCDEC)) + geom_density()

ggplot(df, aes(DIASTOL)) + geom_density()

ggplot(df, aes(BDYMSQ04)) + geom_bar()

ggplot(df, aes(DIETQ12)) + geom_bar()

ggplot(df, aes(DIETQ14)) + geom_bar()

ggplot(df, aes(DIETQ12)) + geom_bar()

ggplot(df, aes(DIETQ5)) + geom_bar()

ggplot(df, aes(DIETQ8)) + geom_bar()

ggplot(df, aes(DIETQ12)) + geom_bar()

ggplot(df, aes(DIETRDI)) + geom_bar()

ggplot(df, aes(DIABBC)) + geom_bar()

ggplot(df, aes(AGEC)) + geom_density()

#

ggplot(df, aes(x = SF2SA1QN, y = DIASTOL)) + geom_boxplot()

ggplot(df, aes(x = scale(INCDEC), y = scale(DIASTOL))) + geom_point()

#Make rpart model

library(rpart)

library(rpart.plot)

DIABBC.rpart = rpart(DIABBC ~., data = df[,-c(“HCHOLBC”, “AGEC”)])

predict.DIABBC.rpart = predict(DIABBC.rpart, data = df[,-c(“HCHOLBC”, “DIABBC”)])

rpart.plot(DIABBC.rpart)

HCHOLBC.rpart = rpart(HCHOLBC ~., data = df[,-c(“DIABBC”, “AGEC”)])

predict.HCHOLBC.rpart = predict(HCHOLBC.rpart, data = df[,-c(“HCHOLBC”, “DIABBC”)])

rpart.plot(HCHOLBC.rpart)

# Multi task

library(mlr)

task.DIABBC = makeClassifTask(data = df[,-c(“HCHOLBC”)], target = “DIABBC”)

fv2 = generateFilterValuesData(task.DIABBC, method = c(“information.gain”, “chi.squared”))

fv2$data

plotFilterValues(fv2)

#Analyse DIABBC

lrn = makeLearner(“classif.randomForest”, predict.type = “prob”)

n = getTaskSize(task.DIABBC)

mod = train(lrn, task.DIABBC, subset = seq(1, n, by = 2))

pred = predict(mod, task.DIABBC, subset = seq(2, n, by = 2))

performance(pred, measures = list(multiclass.aunu, acc, mmce))

pd = generatePartialDependenceData(mod, task.DIABBC, “DIASTOL”)

plt = plotPartialDependence(pd)

head(plt$data)

plt

plot(Probability ~ Value, data = plt$data, type = “b”, xlab = plt$data$Feature[1])

#Analyse HCHOLBC

task.HCHOLBC = makeClassifTask(data = df[,-c(“DIABBC”)], target = “HCHOLBC”)

fv2 = generateFilterValuesData(task.DIABBC, method = c(“information.gain”, “chi.squared”))

fv2$data

plotFilterValues(fv2)

n = getTaskSize(task.HCHOLBC)

mod = train(lrn, task.HCHOLBC, subset = seq(1, n, by = 2))

pred = predict(mod, task.HCHOLBC, subset = seq(2, n, by = 2))

performance(pred, measures = list(multiclass.aunu, acc, mmce))

#Define rules

library(arules)

library(arulesViz)

df.rules =df[,.(

AGEC  =as.factor(AGEC),

SF2SA1QN = as.factor(SF2SA1QN),

INCDEC = as.factor(INCDEC),

BDYMSQ04 = as.factor(BDYMSQ04),

DIASTOL = as.factor(DIASTOL),

DIETQ12 = as.factor(DIETQ12),

DIETQ14 = as.factor(DIETQ14),

DIETQ5 = as.factor(DIETQ5),

DIETQ8 = as.factor(DIETQ8),

DIETRDI = as.factor(DIETRDI),

DIABBC = as.factor(DIABBC),

HCHOLBC = as.factor(HCHOLBC)

)]

# rules DIABBC

rules<- apriori(df.rules,

parameter = list(minlen=2, supp=0.01, conf=0.05),

appearance = list(rhs=c(“DIABBC=1″), default=”lhs”),

control = list(verbose=F))

plot(head(rules, n =10), method=”graph”, control=list(type=”items”))

inspect(rules)

# rules HCHOLBC

rules<- apriori(df.rules,

parameter = list(minlen=2, supp=0.01, conf=0.1),

appearance = list(rhs=c(“HCHOLBC=1″), default=”lhs”),

control = list(verbose=F))

plot(head(rules, n =10), method=”graph”, control=list(type=”items”))

inspect(rules)