Inferential Statistics Computation

Inferential Statistics Computation

I have project and  I have three questions miss and I want the work in in word and r studio programme.
please find the file of r studio bellow and you have to write df<- dataset (994) to open it .
then you will find the attachment with my project which in page 35 to 45.
I want to answer this three questions with example and explanation in the word file which are ” :

1-

Inferential statistics 1: Compute a confidence interval (invent one) for
The parameter p of one nominal variable with two outcomes and interpret
the result. I want example and explanation.

2-

Inferential statistics 2: Test the difference of two means of a ratio variable
two populations (like weight of male vs female, number of owers of
smoker vs, non smoker,…. invent something yourself). Interpret your
results. I want example and explanation.

3-

Inferential statistics 3: Test whether two nominal variables are related
(like Are smoking and swimming related?). Interpret your results. 

How to get your data frame for the group project
Follow the steps:
1. Download from Canvas the file df.RData to your computer
2. (Optional: Clear the rstudio workspace { this can be done in the scroll
down menu “Session” or by using the brush symbol in the upper right
window (Environment History). If you did that correctly the upper right
window (Environment History) should be empty.)
3. Upload the file df.RData in rstudio. This can be done in rstudio in the
bottom right window by choosing the just downloaded file df.RData and
opening it by clicking – in the Console window in rstudio should appear
the command load(…)
4. If you imported the file df.RData correctly to rstudio you should see the
upper right window the window Functions and in it the line data set
function(x)
5. Enter in the Console in the command line

where “” is the number of your group (we discussed this in class).
6. If everything worked your data frame df of the group project appears
now in the right environment window.
7. You successfully stored your group project data set now in R as variable
df (of course you can choose a different name) – you can start working
with it now.

As an explanation – the variables of the data set:
Age in years
Sex (male/female)
Height in cm
Weight in kg
Education (high school, bachelor,master,phd)
Smoker (yes/no)
Housing (flat/house)
Pet { does one have any (yes/no)
Hair { lengths (no/short/long)
Flowers { how many at home
Commute { transfer time to work in minutes
Siblings
Internettime { daily
Sports { is one interested in sports (yes/no)
Clothes { is one interested in sports (yes/no)
Swimmer { how good can one swim (no/okay/good)
Readingtime { how many minutes daily does one read (in minutes)

What to do for the group project
In short: You should discuss/work on the project in small groups (small group
means a group of size from 1 (exceptionally) to 2 (preferred)). Each group will
turn in one copy of their discussions/ ideas/ solutions, which should represent
the work of everyone in the group. The group project should be a concisely,
nicely written report of at least 10 pages containing methods from descriptive
as well as inferential statistics. The final report should contain
Graphs done in R
Results of your computation with R
Inferential statistics done with R
Explanation/interpretation of your results, especially those from inferential statistics.
The final report in .pdf is due the latest at the beginning of the last week of
this course.
More detailed:
Discuss some variables alone with techniques of descriptive statistics. It
would be nice if you could motivate why you picked your variables. You
should pick at least one nominal and one interval/rational variable (explain
which categories your variables belong to) and apply all the suitable
techniques we learned in class at least once (so frequency distribution,
pie chart, histogram, box plot, numerical descriptive techniques).
Find two pairs of ratio variables and discuss whether they are related.
One pair should be related, the other not. Use Covariance and Correlation
to discuss the relation. Compute the linear regression for the related pair,
plot the scatter plot together with the linear regression and explain the
findings.
Inferential statistics 1: Compute a confidence interval (invent one) for
the parameter p of one nominal variable with two outcomes and interpret
the result.
Here an example of a similar kind which might help: Assume you have
the following list, stored as the variable “answers”, of answers to the
Question “Do you like statistics?”:

answers
## [1] “yes” “no” “yes” “yes” “yes” “yes” “yes” “yes” “yes”
## [10] “yes” “yes” “yes” “no” “no” “no” “no” “yes” “yes”
## [19] “yes” “yes” “no” “yes” “yes” “no” “no” “yes” “yes”
## [28] “yes” “no” “yes” “yes” “yes” “yes” “yes” “yes” “yes”
## [37] “yes” “yes” “yes” “no” “yes” “yes” “yes” “yes” “yes”
## [46] “yes” “no” “no” “yes” “yes”
In order to compute the confidence interval for the “yes” answers of this
nominal variable, we use the test statistic

which is standard normal distributed (the theory behind that is the
approximation of the binomial distribution by the normal distribution).
Here p ^ denotes the relative frequency of successes in the data set, p is
the unknown portion which we want to estimate and n is the size of our
data set. This gives, as explained earlier in these notes the
following bounds for the confidence interval with significance level α:

For our data set “answers” we firstly compute the relative frequency of
“yes” answers: The command table() helps, the command as:data:frame()
guarantees that we can use the output of the table() like a data frame.
We stored the table as variable “a” and the desired value p ^ is stored as P.
a<-as.data.frame(table(answers))
a
## answers Freq
## 1 no 12
## 2 yes 38
P<-a[2,2]/50
P
## [1] 0.76
Assume we choose α = 0:05, then qnorm(0:025;0;1) is the value zα=2,
stored as the variable “z”:

z<- – qnorm(0.025, 0,1 )
z
## [1] 1.959964
This gives the lower bound
P – z*sqrt(P*(1-P)/50)
## [1] 0.6416208
and the upper bound
P + z*sqrt(P*(1-P)/50)
## [1] 0.8783792
Inferential statistics 2: Test the difference of two means of a ratio variable
two populations (like weight of male vs female, number of flowers of
smoker vs, non smoker,…. invent something yourself). Interpret your
results.
Again an example for this situation: Assume you measure the speed
of 100 yellow and black cars on a motorway. Here are the first of your
observations, stored as the variable “motor”:
head(motor)
## color speed
## 1 black 123
## 2 yellow 129
## 3 black 124
## 4 yellow 124
## 5 black 117
## 6 black 12

We want to test, if the speed mean of black cars is significantly different
to the speed mean of yellow cars: We make a test where we compare the
means of two populations.

Which test statistic to choose depends whether the variance of the two
populations is equal or not. This is checked by an other test, which can
be found on page 441 in Kellers book:
The test statistic

is F distributed with ν1 = n1 1 and ν2 = n2 1 degrees of freedom.
n1 denotes the number of elements in population 1, n2 the number of
elements in population 2. s2 1 and s2 2 are the two variances of the two populations. In our example the two populations are the black and yellow cars.
As explained in section 1.6, we first extract from our original data set
“motor” two data sets, one for the black cars and one for the yellow ones:
motor1<- motor[motor$color==”black”,]
motor2<- motor[motor$color==”yellow”,]
head(motor1)
## color speed
## 1 black 123
## 3 black 124
## 5 black 117
## 6 black 122
## 7 black 105
## 8 black 94
dim(motor1)
## [1] 75 2
head(motor2)
## color speed
## 2 yellow 129
## 4 yellow 124
## 10 yellow 138
## 21 yellow 140
## 22 yellow 133
## 27 yellow 140
dim(motor2)
## [1] 25 2

Thus, in our example, n1 =
n1<-nrow(motor1)
n1
## [1] 75
and n2 =
n2<-nrow(motor2)
n2
## [1] 25
The command nrow() gives the number of rows.
To test the equality of the variances we test:

var1<- var(motor1$speed)
var2<- var(motor2$speed)
var1/var2
## [1] 1.132162
and the rejection region, for α = 0:05 is determined by Fα=2;n11;n21 and
F1α=2;n11;n21. In our example
qf(1-0.025,n1-1,n2-1)
## [1] 2.053903
and
qf(0.025,n1-1,n2-1)
## [1] 0.5448977

To test the equality of the means we therefore use the test statistic.

Our two hypothesis are now
H0: µ1 µ2 = 0, that is the means coincide.
H1: µ1 µ2 6= 0:
Thus we compute in R:
x1<-mean(motor1$speed)
x1
## [1] 120.5867
x2<-mean(motor2$speed)
x2
## [1] 136.72
var1<- var(motor1$speed)
var2<- var(motor2$speed)
t<- (x1-x2)/sqrt((((n1-1)*var1+(n2-1*var2))/(n1+n2-2))
*(1/n1+1/n2))
t
## [1] -6.561956
The rejection region is determined by ±tα=2;n1+n22, that is, if we pick,
as an example, α = 0:05, plus/minus
-qt(0.025, n1+n2-2)
## [1] 1.984467
Since t falls in the rejection region we reject H0, i.e., we accept
H1 : µ1 6= µ2.
Inferential statistics 3: Test whether two nominal variables are related
(like Are smoking and swimming related?). Interpret your results.
Again an example: Assume you want to analyse if there is a relation
between people who play a certain instrument and their favorite fruits.
300 observations are given as the variable “music”, here are the first lines
of the data set:
head(music)
## instrument fruit
## 1 piano ananas
## 2 flute ananas
## 3 guitar ananas
## 4 piano orange
## 5 guitar apple
## 6 piano ananas
To test whether the played instrument and the favorite fruit are related
or not we can apply a  of a contingency table. The hypotheses are:
H0: instrument and fruit are independent.
H1: instrument and fruit are dependent.
The command table() immediately gives a contingency table:
table(music)
## fruit
## instrument ananas apple banana orange
## flute 14 15 28 10
## guitar 63 14 23 20
## piano 26 45 22 20
The test statistic,

To compute the test statistic in r we proceed as follows:
f11<-table(music)[1,1]
f12<-table(music)[1,2]
f13<-table(music)[1,3]
f14<-table(music)[1,4]
f21<-table(music)[2,1]
f22<-table(music)[2,2]
f23<-table(music)[2,3]
f24<-table(music)[2,4]
f31<-table(music)[3,1]
f32<-table(music)[3,2]
f33<-table(music)[3,3]
f34<-table(music)[3,4]
e11<-rowSums(table(music))[1]*colSums(table(music))[1]/300
e12<-rowSums(table(music))[1]*colSums(table(music))[2]/300
e13<-rowSums(table(music))[1]*colSums(table(music))[3]/300
e14<-rowSums(table(music))[1]*colSums(table(music))[4]/300
e21<-rowSums(table(music))[2]*colSums(table(music))[1]/300
e22<-rowSums(table(music))[2]*colSums(table(music))[2]/300
e23<-rowSums(table(music))[2]*colSums(table(music))[3]/300
e24<-rowSums(table(music))[2]*colSums(table(music))[4]/300
e31<-rowSums(table(music))[3]*colSums(table(music))[1]/300
e32<-rowSums(table(music))[3]*colSums(table(music))[2]/300
e33<-rowSums(table(music))[3]*colSums(table(music))[3]/300
e34<-rowSums(table(music))[3]*colSums(table(music))[4]/300
x <-(f11-e11)^2/e11 + (f12-e12)^2/e12 +
(f13-e13)^2/e13 + (f14-e14)^2/e14 +
(f21-e21)^2/e21 + (f22-e22)^2/e22 +
(f23-e23)^2/e23 + (f24-e24)^2/e24 +
(f31-e31)^2/e31 + (f32-e32)^2/e32 +
(f33-e33)^2/e33 + (f34-e34)^2/e34
x
## flute
## 49.16676

qchisq(1-0.05,6)
## [1] 12.59159
Since
x
## flute
## 49.16676
is greater than
qchisq(1-0.05,6)
## [1] 12.59159
we reject H0, i.e. we conclude that the two variables instrument and fruit
are dependent.
Remark: R has a built in function chisq:test() which directly computes
the value x, however I ask you to compute it “by hand” as just explained.
Here, for the sake of completeness:
options(width = 60)
chisq.test(table(music),simulate.p.value=T)
##
## Pearson’s Chi-squared test with simulated p-value
## (based on 2000 replicates)
##
## data: table(music)
## X-squared = 49.167, df = NA, p-value = 0.0004998 

Solution 

Ans1.R 

load(“C:/Users/user/Downloads/df.RData”)

df = dataset(994)

a<-as.data.frame(table(df$clothes))

no.of.obs=dim(df)[1]

P<-a[2,2]/no.of.obs

z<- – qnorm(0.025, 0,1 )

low.bound=P-z*sqrt(P*(1-P)/no.of.obs)

up.bound=P+z*sqrt(P*(1-P)/no.of.obs)

Conf.interval= c(low.bound,up.bound) 

Ans2.R 

load(“C:/Users/user/Downloads/df.RData”)

df = dataset(994)

weight_nonsmoker=df[which(df$smoker==”no”),]$weight

weight_smoker=df[which(df$smoker==”yes”),]$weight

n1=length(weight_nonsmoker)

n2=length(weight_smoker)

var1=var(weight_nonsmoker)

var2=var(weight_smoker)

ratio=var1/var2

cutoff1=qf(1-0.025,n1-1,n2-1)

cutoff2=qf(0.025,n1-1,n2-1)

result1=ifelse(ratio<cutoff1 && ratio>cutoff2, “ACCEPT”,”REJECT”)

result1

x1=mean(weight_nonsmoker)

x2=mean(weight_smoker)

t = (x1-x2)/sqrt((((n1-1)*var1+(n2-1*var2))/(n1+n2-2))*(1/n1+1/n2))

cutoffmain=-qt(0.025, n1+n2-2)

ifelse(abs(t)<cutoffmain,”ACCEPT”,”REJECT”) 

Ans3.R

 load(“C:/Users/user/Downloads/df.RData”)

df = dataset(994)

pet_housing=df[,c(“housing”,”pet”)]

table(pet_housing)

f11<-table(pet_housing)[1,1]

f12<-table(pet_housing)[1,2]

f21<-table(pet_housing)[2,1]

f22<-table(pet_housing)[2,2]

no.of.obs=dim(pet_housing)[1]

e11<-rowSums(table(pet_housing))[1]*colSums(table(pet_housing))[1]/no.of.obs

e12<-rowSums(table(pet_housing))[1]*colSums(table(pet_housing))[2]/no.of.obs

e21<-rowSums(table(pet_housing))[2]*colSums(table(pet_housing))[1]/no.of.obs

e22<-rowSums(table(pet_housing))[2]*colSums(table(pet_housing))[2]/no.of.obs

x <-(f11-e11)^2/e11 + (f12-e12)^2/e12 +(f21-e21)^2/e21 + (f22-e22)^2/e22

cutoff1=qchisq(1-0.05,1)

ifelse(x>cutoff1,”REJECT”,”ACCEPT”) 

Report

  1. For the first question we needed to find the confidence interval for the parameter p of a nominal variable with two outcomes. We choose the variable “Clothes” which has outcomes “yes” and “no”.

In order to calculate the confidence interval for the “yes” answers we calculate the following:

After doing the above we get the confidence interval(0.4541742, 0.5418258).

  1. For the second question we need to test the difference between the means of a ratio variable of two populations. Here we chose the two populations to be Smoker and Nonsmoker and we chose our ratio variable to be weight.

In order to test the difference between means we first need to test that the variance of the two groups are the same. To do so we use the F-statistic s12/ s22 where s1 is the variance of the first group and s2 is the variance of the second group. This test gets Accepted. We get the ratio to be 1.283779. Since s12/ s22is neither > Fa/2,n1-1,n21 (=0.7799554) nor < F1-a/2,n1-1,n21 (=1.284071) the test gets ACCEPTED and we assume equalvariances.

Now we test the equality of means of the two groups using the t-statistic.

We get the value of t to be 3.01111. Since abs(t) > ta/2,n1+n22 (= 1.964739) this test gets REJECTED. This means that the mean weight of smokers and nonsmokers is not the same.

  1. For the third question we needed to test whether two nominal variables are related. To do so we chose the variable pet and housing and we check if having pets is related to living in a house or a flat.

To compute this we do the chi-square independence test and use the test statistic

The value of the test statistic comes out to be 12.74605 which is >X20.05,v (=3.841459) the test gets rejected. Thus the two variables housing and pets are dependent.