**Inferential Statistics Computation **

I have project and I have three questions miss and I want the work in in word and r studio programme.

please find the file of r studio bellow and you have to write df<- dataset (994) to open it .

then you will find the attachment with my project which in page 35 to 45.

I want to answer this three questions with example and explanation in the word file which are ” :

1-

Inferential statistics 1: Compute a confidence interval (invent one) for

The parameter p of one nominal variable with two outcomes and interpret

the result. I want example and explanation.

2-

Inferential statistics 2: Test the difference of two means of a ratio variable

two populations (like weight of male vs female, number of owers of

smoker vs, non smoker,…. invent something yourself). Interpret your

results. I want example and explanation.

3-

Inferential statistics 3: Test whether two nominal variables are related

(like Are smoking and swimming related?). Interpret your results.** **

How to get your data frame for the group project**
**Follow the steps:

1. Download from Canvas the file df.RData to your computer

2. (Optional: Clear the rstudio workspace { this can be done in the scroll

down menu “Session” or by using the brush symbol in the upper right

window (Environment History). If you did that correctly the upper right

window (Environment History) should be empty.)

3. Upload the file df.RData in rstudio. This can be done in rstudio in the

bottom right window by choosing the just downloaded file df.RData and

opening it by clicking – in the Console window in rstudio should appear

the command load(…)

4. If you imported the file df.RData correctly to rstudio you should see the

upper right window the window Functions and in it the line data set

function(x)

5. Enter in the Console in the command line

where “*∗*” is the number of your group (we discussed this in class).

6. If everything worked your data frame *df *of the group project appears

now in the right environment window.

7. You successfully stored your group project data set now in R as variable

df (of course you can choose a different name) – you can start working

with it now.

As an explanation – the variables of the data set:

*• *Age in years

*• *Sex (male/female)

*• *Height in cm

*• *Weight in kg

*• *Education (high school, bachelor,master,phd)

*• *Smoker (yes/no)

*• *Housing (flat/house)

*• *Pet { does one have any (yes/no)

*• *Hair { lengths (no/short/long)

*• *Flowers { how many at home

*• *Commute { transfer time to work in minutes

*• *Siblings

*• *Internettime { daily

*• *Sports { is one interested in sports (yes/no)

*• *Clothes { is one interested in sports (yes/no)

*• *Swimmer { how good can one swim (no/okay/good)

*• *Readingtime { how many minutes daily does one read (in minutes)

**What to do for the group project
**In short: You should discuss/work on the project in small groups (small group

means a group of size from 1 (exceptionally) to 2 (preferred)). Each group will

turn in one copy of their discussions/ ideas/ solutions, which should represent

the work of everyone in the group. The group project should be a concisely,

nicely written report of at least 10 pages containing methods from descriptive

as well as inferential statistics. The final report should contain

*•*Graphs done in R

*•*Results of your computation with R

*•*Inferential statistics done with R

*•*Explanation/interpretation of your results, especially those from inferential statistics.

The final report in .pdf is due the latest at the beginning of the last week of

this course.

More detailed:

*•*Discuss some variables alone with techniques of descriptive statistics. It

would be nice if you could motivate why you picked your variables. You

should pick at least one nominal and one interval/rational variable (explain

which categories your variables belong to) and apply all the suitable

techniques we learned in class at least once (so frequency distribution,

pie chart, histogram, box plot, numerical descriptive techniques).

*•*Find two pairs of ratio variables and discuss whether they are related.

One pair should be related, the other not. Use Covariance and Correlation

to discuss the relation. Compute the linear regression for the related pair,

plot the scatter plot together with the linear regression and explain the

findings.

*•*Inferential statistics 1: Compute a confidence interval (invent one) for

the parameter

*p*of one nominal variable with two outcomes and interpret

the result.

Here an example of a similar kind which might help: Assume you have

the following list, stored as the variable “answers”, of answers to the

Question “Do you like statistics?”:

answers

## [1] “yes” “no” “yes” “yes” “yes” “yes” “yes” “yes” “yes”

## [10] “yes” “yes” “yes” “no” “no” “no” “no” “yes” “yes”

## [19] “yes” “yes” “no” “yes” “yes” “no” “no” “yes” “yes”

## [28] “yes” “no” “yes” “yes” “yes” “yes” “yes” “yes” “yes”

## [37] “yes” “yes” “yes” “no” “yes” “yes” “yes” “yes” “yes”

## [46] “yes” “no” “no” “yes” “yes”

In order to compute the confidence interval for the “yes” answers of this

nominal variable, we use the test statistic

which is standard normal distributed (the theory behind that is the

approximation of the binomial distribution by the normal distribution).

Here *p *^ denotes the relative frequency of successes in the data set, *p *is

the unknown portion which we want to estimate and *n *is the size of our

data set. This gives, as explained earlier in these notes the

following bounds for the confidence interval with significance level *α*:

For our data set “answers” we firstly compute the relative frequency of

“yes” answers: The command *table*() helps, the command *as:data:frame*()

guarantees that we can use the output of the *table*() like a data frame.

We stored the table as variable “a” and the desired value *p *^ is stored as *P*.

a<-as.data.frame(table(answers))

a

## answers Freq

## 1 no 12

## 2 yes 38

P<-a[2,2]/50

P

## [1] 0.76

Assume we choose *α *= 0*:*05, then *–**qnorm*(0*:*025*;*0*;*1) is the value *z**α=*2,

stored as the variable “z”:

z<- – qnorm(0.025, 0,1 )

z

## [1] 1.959964

This gives the lower bound

P – z*sqrt(P*(1-P)/50)

## [1] 0.6416208

and the upper bound

P + z*sqrt(P*(1-P)/50)

## [1] 0.8783792

*• *Inferential statistics 2: Test the difference of two means of a ratio variable

two populations (like weight of male vs female, number of flowers of

smoker vs, non smoker,…. invent something yourself). Interpret your

results.

Again an example for this situation: Assume you measure the speed

of 100 yellow and black cars on a motorway. Here are the first of your

observations, stored as the variable “motor”:

head(motor)

## color speed

## 1 black 123

## 2 yellow 129

## 3 black 124

## 4 yellow 124

## 5 black 117

## 6 black 12

We want to test, if the speed mean of black cars is significantly different

to the speed mean of yellow cars: We make a test where we compare the

means of two populations.

Which test statistic to choose depends whether the variance of the two

populations is equal or not. This is checked by an other test, which can

be found on page 441 in Kellers book:

The test statistic

is *F *distributed with *ν*1 = *n*1 *– *1 and *ν*2 = *n*2 *– *1 degrees of freedom.

*n*1 denotes the number of elements in population 1, *n*2 the number of

elements in population 2. *s*2 1 and *s*2 2 are the two variances of the two populations. In our example the two populations are the black and yellow cars.

As explained in section 1.6, we first extract from our original data set

“motor” two data sets, one for the black cars and one for the yellow ones:

motor1<- motor[motor$color==”black”,]

motor2<- motor[motor$color==”yellow”,]

head(motor1)

## color speed

## 1 black 123

## 3 black 124

## 5 black 117

## 6 black 122

## 7 black 105

## 8 black 94

dim(motor1)

## [1] 75 2

head(motor2)

## color speed

## 2 yellow 129

## 4 yellow 124

## 10 yellow 138

## 21 yellow 140

## 22 yellow 133

## 27 yellow 140

dim(motor2)

## [1] 25 2

Thus, in our example, *n*1 =

n1<-nrow(motor1)

n1

## [1] 75

and *n*2 =

n2<-nrow(motor2)

n2

## [1] 25

The command *nrow*() gives the number of rows.

To test the equality of the variances we test:

var1<- var(motor1$speed)

var2<- var(motor2$speed)

var1/var2

## [1] 1.132162

and the rejection region, for *α *= 0*:*05 is determined by *F**α=*2*;n*1*–*1*;n*2*–*1 and

*F*1*–**α=*2*;n*1*–*1*;n*2*–*1. In our example

qf(1-0.025,n1-1,n2-1)

## [1] 2.053903

and

qf(0.025,n1-1,n2-1)

## [1] 0.5448977

To test the equality of the means we therefore use the test statistic.

Our two hypothesis are now

**–***H*0: *µ*1 *– **µ*2 = 0, that is the means coincide.

**–***H*1: *µ*1 *– **µ*2 *6*= 0*:**
*Thus we compute in R:

x1<-mean(motor1$speed)

x1

## [1] 120.5867

x2<-mean(motor2$speed)

x2

## [1] 136.72

var1<- var(motor1$speed)

var2<- var(motor2$speed)

t<- (x1-x2)/sqrt((((n1-1)*var1+(n2-1*var2))/(n1+n2-2))

*(1/n1+1/n2))

t

## [1] -6.561956

The rejection region is determined by

*±*

*t*

*α=*2

*;n*1+

*n*2

*–*2, that is, if we pick,

as an example,

*α*= 0

*:*05, plus/minus

-qt(0.025, n1+n2-2)

## [1] 1.984467

Since

*t*falls in the rejection region we reject

*H*0, i.e., we accept

*H*1 :

*µ*1

*6*=

*µ*2.

*•*Inferential statistics 3: Test whether two nominal variables are related

(like Are smoking and swimming related?). Interpret your results.

Again an example: Assume you want to analyse if there is a relation

between people who play a certain instrument and their favorite fruits.

300 observations are given as the variable “music”, here are the first lines

of the data set:

head(music)

## instrument fruit

## 1 piano ananas

## 2 flute ananas

## 3 guitar ananas

## 4 piano orange

## 5 guitar apple

## 6 piano ananas

To test whether the played instrument and the favorite fruit are related

or not we can apply a of a contingency table. The hypotheses are:

**–**

*H*0: instrument and fruit are independent.

**–**

*H*1: instrument and fruit are dependent.

The command

*table*() immediately gives a contingency table:

table(music)

## fruit

## instrument ananas apple banana orange

## flute 14 15 28 10

## guitar 63 14 23 20

## piano 26 45 22 20

The test statistic,

To compute the test statistic in r we proceed as follows:

f11<-table(music)[1,1]

f12<-table(music)[1,2]

f13<-table(music)[1,3]

f14<-table(music)[1,4]

f21<-table(music)[2,1]

f22<-table(music)[2,2]

f23<-table(music)[2,3]

f24<-table(music)[2,4]

f31<-table(music)[3,1]

f32<-table(music)[3,2]

f33<-table(music)[3,3]

f34<-table(music)[3,4]

e11<-rowSums(table(music))[1]*colSums(table(music))[1]/300

e12<-rowSums(table(music))[1]*colSums(table(music))[2]/300

e13<-rowSums(table(music))[1]*colSums(table(music))[3]/300

e14<-rowSums(table(music))[1]*colSums(table(music))[4]/300

e21<-rowSums(table(music))[2]*colSums(table(music))[1]/300

e22<-rowSums(table(music))[2]*colSums(table(music))[2]/300

e23<-rowSums(table(music))[2]*colSums(table(music))[3]/300

e24<-rowSums(table(music))[2]*colSums(table(music))[4]/300

e31<-rowSums(table(music))[3]*colSums(table(music))[1]/300

e32<-rowSums(table(music))[3]*colSums(table(music))[2]/300

e33<-rowSums(table(music))[3]*colSums(table(music))[3]/300

e34<-rowSums(table(music))[3]*colSums(table(music))[4]/300

x <-(f11-e11)^2/e11 + (f12-e12)^2/e12 +

(f13-e13)^2/e13 + (f14-e14)^2/e14 +

(f21-e21)^2/e21 + (f22-e22)^2/e22 +

(f23-e23)^2/e23 + (f24-e24)^2/e24 +

(f31-e31)^2/e31 + (f32-e32)^2/e32 +

(f33-e33)^2/e33 + (f34-e34)^2/e34

x

## flute

## 49.16676

qchisq(1-0.05,6)

## [1] 12.59159

Since

x

## flute

## 49.16676

is greater than

qchisq(1-0.05,6)

## [1] 12.59159

we reject *H*0, i.e. we conclude that the two variables instrument and fruit

are dependent.

Remark: R has a built in function *chisq:test*() which directly computes

the value *x*, however I ask you to compute it “by hand” as just explained.

Here, for the sake of completeness:

options(width = 60)

chisq.test(table(music),simulate.p.value=T)

##

## Pearson’s Chi-squared test with simulated p-value

## (based on 2000 replicates)

##

## data: table(music)

## X-squared = 49.167, df = NA, p-value = 0.0004998** **

**Solution**** **

**Ans1.R**** **

load(“C:/Users/user/Downloads/df.RData”)

df = dataset(994)

a<-as.data.frame(table(df$clothes))

no.of.obs=dim(df)[1]

P<-a[2,2]/no.of.obs

z<- – qnorm(0.025, 0,1 )

low.bound=P-z*sqrt(P*(1-P)/no.of.obs)

up.bound=P+z*sqrt(P*(1-P)/no.of.obs)

Conf.interval= c(low.bound,up.bound)** **

**Ans2.R**** **

load(“C:/Users/user/Downloads/df.RData”)

df = dataset(994)

weight_nonsmoker=df[which(df$smoker==”no”),]$weight

weight_smoker=df[which(df$smoker==”yes”),]$weight

n1=length(weight_nonsmoker)

n2=length(weight_smoker)

var1=var(weight_nonsmoker)

var2=var(weight_smoker)

ratio=var1/var2

cutoff1=qf(1-0.025,n1-1,n2-1)

cutoff2=qf(0.025,n1-1,n2-1)

result1=ifelse(ratio<cutoff1 && ratio>cutoff2, “ACCEPT”,”REJECT”)

result1

x1=mean(weight_nonsmoker)

x2=mean(weight_smoker)

t = (x1-x2)/sqrt((((n1-1)*var1+(n2-1*var2))/(n1+n2-2))*(1/n1+1/n2))

cutoffmain=-qt(0.025, n1+n2-2)

ifelse(abs(t)<cutoffmain,”ACCEPT”,”REJECT”)** **

**Ans3.R**

** **load(“C:/Users/user/Downloads/df.RData”)

df = dataset(994)

pet_housing=df[,c(“housing”,”pet”)]

table(pet_housing)

f11<-table(pet_housing)[1,1]

f12<-table(pet_housing)[1,2]

f21<-table(pet_housing)[2,1]

f22<-table(pet_housing)[2,2]

no.of.obs=dim(pet_housing)[1]

e11<-rowSums(table(pet_housing))[1]*colSums(table(pet_housing))[1]/no.of.obs

e12<-rowSums(table(pet_housing))[1]*colSums(table(pet_housing))[2]/no.of.obs

e21<-rowSums(table(pet_housing))[2]*colSums(table(pet_housing))[1]/no.of.obs

e22<-rowSums(table(pet_housing))[2]*colSums(table(pet_housing))[2]/no.of.obs

x <-(f11-e11)^2/e11 + (f12-e12)^2/e12 +(f21-e21)^2/e21 + (f22-e22)^2/e22

cutoff1=qchisq(1-0.05,1)

ifelse(x>cutoff1,”REJECT”,”ACCEPT”)** **

# Report

- For the first question we needed to find the confidence interval for the parameter p of a nominal variable with two outcomes. We choose the variable “Clothes” which has outcomes “yes” and “no”.

In order to calculate the confidence interval for the “yes” answers we calculate the following:

After doing the above we get the confidence interval(0.4541742, 0.5418258).

- For the second question we need to test the difference between the means of a ratio variable of two populations. Here we chose the two populations to be Smoker and Nonsmoker and we chose our ratio variable to be weight.

In order to test the difference between means we first need to test that the variance of the two groups are the same. To do so we use the F-statistic s_{1}^{2}/ s_{2}^{2} where s_{1} is the variance of the first group and s_{2} is the variance of the second group. This test gets Accepted. We get the ratio to be 1.283779. Since s_{1}^{2}/ s_{2}^{2}is neither > F_{a/2,n1-1,n2}_{–}_{1 }(=0.7799554) nor < F_{1-a/2,n1-1,n2}_{–}_{1} (=1.284071) the test gets ACCEPTED and we assume equalvariances.

Now we test the equality of means of the two groups using the t-statistic.

We get the value of t to be 3.01111. Since abs(t) > t_{a/2,n1+n2}_{–}_{2} (= 1.964739) this test gets REJECTED. This means that the mean weight of smokers and nonsmokers is not the same.

- For the third question we needed to test whether two nominal variables are related. To do so we chose the variable pet and housing and we check if having pets is related to living in a house or a flat.

To compute this we do the chi-square independence test and use the test statistic

The value of the test statistic comes out to be 12.74605 which is >X^{2}_{0.05,v }(=3.841459) the test gets rejected. Thus the two variables housing and pets are dependent.