Inferential Statistics Computation
I have project and I have three questions miss and I want the work in in word and r studio programme.
please find the file of r studio bellow and you have to write df<- dataset (994) to open it .
then you will find the attachment with my project which in page 35 to 45.
I want to answer this three questions with example and explanation in the word file which are ” :
1-
Inferential statistics 1: Compute a confidence interval (invent one) for
The parameter p of one nominal variable with two outcomes and interpret
the result. I want example and explanation.
2-
Inferential statistics 2: Test the difference of two means of a ratio variable
two populations (like weight of male vs female, number of owers of
smoker vs, non smoker,…. invent something yourself). Interpret your
results. I want example and explanation.
3-
Inferential statistics 3: Test whether two nominal variables are related
(like Are smoking and swimming related?). Interpret your results.
How to get your data frame for the group project
Follow the steps:
1. Download from Canvas the file df.RData to your computer
2. (Optional: Clear the rstudio workspace { this can be done in the scroll
down menu “Session” or by using the brush symbol in the upper right
window (Environment History). If you did that correctly the upper right
window (Environment History) should be empty.)
3. Upload the file df.RData in rstudio. This can be done in rstudio in the
bottom right window by choosing the just downloaded file df.RData and
opening it by clicking – in the Console window in rstudio should appear
the command load(…)
4. If you imported the file df.RData correctly to rstudio you should see the
upper right window the window Functions and in it the line data set
function(x)
5. Enter in the Console in the command line
where “∗” is the number of your group (we discussed this in class).
6. If everything worked your data frame df of the group project appears
now in the right environment window.
7. You successfully stored your group project data set now in R as variable
df (of course you can choose a different name) – you can start working
with it now.
As an explanation – the variables of the data set:
• Age in years
• Sex (male/female)
• Height in cm
• Weight in kg
• Education (high school, bachelor,master,phd)
• Smoker (yes/no)
• Housing (flat/house)
• Pet { does one have any (yes/no)
• Hair { lengths (no/short/long)
• Flowers { how many at home
• Commute { transfer time to work in minutes
• Siblings
• Internettime { daily
• Sports { is one interested in sports (yes/no)
• Clothes { is one interested in sports (yes/no)
• Swimmer { how good can one swim (no/okay/good)
• Readingtime { how many minutes daily does one read (in minutes)
What to do for the group project
In short: You should discuss/work on the project in small groups (small group
means a group of size from 1 (exceptionally) to 2 (preferred)). Each group will
turn in one copy of their discussions/ ideas/ solutions, which should represent
the work of everyone in the group. The group project should be a concisely,
nicely written report of at least 10 pages containing methods from descriptive
as well as inferential statistics. The final report should contain
• Graphs done in R
• Results of your computation with R
• Inferential statistics done with R
• Explanation/interpretation of your results, especially those from inferential statistics.
The final report in .pdf is due the latest at the beginning of the last week of
this course.
More detailed:
• Discuss some variables alone with techniques of descriptive statistics. It
would be nice if you could motivate why you picked your variables. You
should pick at least one nominal and one interval/rational variable (explain
which categories your variables belong to) and apply all the suitable
techniques we learned in class at least once (so frequency distribution,
pie chart, histogram, box plot, numerical descriptive techniques).
• Find two pairs of ratio variables and discuss whether they are related.
One pair should be related, the other not. Use Covariance and Correlation
to discuss the relation. Compute the linear regression for the related pair,
plot the scatter plot together with the linear regression and explain the
findings.
• Inferential statistics 1: Compute a confidence interval (invent one) for
the parameter p of one nominal variable with two outcomes and interpret
the result.
Here an example of a similar kind which might help: Assume you have
the following list, stored as the variable “answers”, of answers to the
Question “Do you like statistics?”:
answers
## [1] “yes” “no” “yes” “yes” “yes” “yes” “yes” “yes” “yes”
## [10] “yes” “yes” “yes” “no” “no” “no” “no” “yes” “yes”
## [19] “yes” “yes” “no” “yes” “yes” “no” “no” “yes” “yes”
## [28] “yes” “no” “yes” “yes” “yes” “yes” “yes” “yes” “yes”
## [37] “yes” “yes” “yes” “no” “yes” “yes” “yes” “yes” “yes”
## [46] “yes” “no” “no” “yes” “yes”
In order to compute the confidence interval for the “yes” answers of this
nominal variable, we use the test statistic
which is standard normal distributed (the theory behind that is the
approximation of the binomial distribution by the normal distribution).
Here p ^ denotes the relative frequency of successes in the data set, p is
the unknown portion which we want to estimate and n is the size of our
data set. This gives, as explained earlier in these notes the
following bounds for the confidence interval with significance level α:
For our data set “answers” we firstly compute the relative frequency of
“yes” answers: The command table() helps, the command as:data:frame()
guarantees that we can use the output of the table() like a data frame.
We stored the table as variable “a” and the desired value p ^ is stored as P.
a<-as.data.frame(table(answers))
a
## answers Freq
## 1 no 12
## 2 yes 38
P<-a[2,2]/50
P
## [1] 0.76
Assume we choose α = 0:05, then –qnorm(0:025;0;1) is the value zα=2,
stored as the variable “z”:
z<- – qnorm(0.025, 0,1 )
z
## [1] 1.959964
This gives the lower bound
P – z*sqrt(P*(1-P)/50)
## [1] 0.6416208
and the upper bound
P + z*sqrt(P*(1-P)/50)
## [1] 0.8783792
• Inferential statistics 2: Test the difference of two means of a ratio variable
two populations (like weight of male vs female, number of flowers of
smoker vs, non smoker,…. invent something yourself). Interpret your
results.
Again an example for this situation: Assume you measure the speed
of 100 yellow and black cars on a motorway. Here are the first of your
observations, stored as the variable “motor”:
head(motor)
## color speed
## 1 black 123
## 2 yellow 129
## 3 black 124
## 4 yellow 124
## 5 black 117
## 6 black 12
We want to test, if the speed mean of black cars is significantly different
to the speed mean of yellow cars: We make a test where we compare the
means of two populations.
Which test statistic to choose depends whether the variance of the two
populations is equal or not. This is checked by an other test, which can
be found on page 441 in Kellers book:
The test statistic
is F distributed with ν1 = n1 – 1 and ν2 = n2 – 1 degrees of freedom.
n1 denotes the number of elements in population 1, n2 the number of
elements in population 2. s2 1 and s2 2 are the two variances of the two populations. In our example the two populations are the black and yellow cars.
As explained in section 1.6, we first extract from our original data set
“motor” two data sets, one for the black cars and one for the yellow ones:
motor1<- motor[motor$color==”black”,]
motor2<- motor[motor$color==”yellow”,]
head(motor1)
## color speed
## 1 black 123
## 3 black 124
## 5 black 117
## 6 black 122
## 7 black 105
## 8 black 94
dim(motor1)
## [1] 75 2
head(motor2)
## color speed
## 2 yellow 129
## 4 yellow 124
## 10 yellow 138
## 21 yellow 140
## 22 yellow 133
## 27 yellow 140
dim(motor2)
## [1] 25 2
Thus, in our example, n1 =
n1<-nrow(motor1)
n1
## [1] 75
and n2 =
n2<-nrow(motor2)
n2
## [1] 25
The command nrow() gives the number of rows.
To test the equality of the variances we test:
var1<- var(motor1$speed)
var2<- var(motor2$speed)
var1/var2
## [1] 1.132162
and the rejection region, for α = 0:05 is determined by Fα=2;n1–1;n2–1 and
F1–α=2;n1–1;n2–1. In our example
qf(1-0.025,n1-1,n2-1)
## [1] 2.053903
and
qf(0.025,n1-1,n2-1)
## [1] 0.5448977
To test the equality of the means we therefore use the test statistic.
Our two hypothesis are now
–H0: µ1 – µ2 = 0, that is the means coincide.
–H1: µ1 – µ2 6= 0:
Thus we compute in R:
x1<-mean(motor1$speed)
x1
## [1] 120.5867
x2<-mean(motor2$speed)
x2
## [1] 136.72
var1<- var(motor1$speed)
var2<- var(motor2$speed)
t<- (x1-x2)/sqrt((((n1-1)*var1+(n2-1*var2))/(n1+n2-2))
*(1/n1+1/n2))
t
## [1] -6.561956
The rejection region is determined by ±tα=2;n1+n2–2, that is, if we pick,
as an example, α = 0:05, plus/minus
-qt(0.025, n1+n2-2)
## [1] 1.984467
Since t falls in the rejection region we reject H0, i.e., we accept
H1 : µ1 6= µ2.
• Inferential statistics 3: Test whether two nominal variables are related
(like Are smoking and swimming related?). Interpret your results.
Again an example: Assume you want to analyse if there is a relation
between people who play a certain instrument and their favorite fruits.
300 observations are given as the variable “music”, here are the first lines
of the data set:
head(music)
## instrument fruit
## 1 piano ananas
## 2 flute ananas
## 3 guitar ananas
## 4 piano orange
## 5 guitar apple
## 6 piano ananas
To test whether the played instrument and the favorite fruit are related
or not we can apply a of a contingency table. The hypotheses are:
–H0: instrument and fruit are independent.
–H1: instrument and fruit are dependent.
The command table() immediately gives a contingency table:
table(music)
## fruit
## instrument ananas apple banana orange
## flute 14 15 28 10
## guitar 63 14 23 20
## piano 26 45 22 20
The test statistic,
To compute the test statistic in r we proceed as follows:
f11<-table(music)[1,1]
f12<-table(music)[1,2]
f13<-table(music)[1,3]
f14<-table(music)[1,4]
f21<-table(music)[2,1]
f22<-table(music)[2,2]
f23<-table(music)[2,3]
f24<-table(music)[2,4]
f31<-table(music)[3,1]
f32<-table(music)[3,2]
f33<-table(music)[3,3]
f34<-table(music)[3,4]
e11<-rowSums(table(music))[1]*colSums(table(music))[1]/300
e12<-rowSums(table(music))[1]*colSums(table(music))[2]/300
e13<-rowSums(table(music))[1]*colSums(table(music))[3]/300
e14<-rowSums(table(music))[1]*colSums(table(music))[4]/300
e21<-rowSums(table(music))[2]*colSums(table(music))[1]/300
e22<-rowSums(table(music))[2]*colSums(table(music))[2]/300
e23<-rowSums(table(music))[2]*colSums(table(music))[3]/300
e24<-rowSums(table(music))[2]*colSums(table(music))[4]/300
e31<-rowSums(table(music))[3]*colSums(table(music))[1]/300
e32<-rowSums(table(music))[3]*colSums(table(music))[2]/300
e33<-rowSums(table(music))[3]*colSums(table(music))[3]/300
e34<-rowSums(table(music))[3]*colSums(table(music))[4]/300
x <-(f11-e11)^2/e11 + (f12-e12)^2/e12 +
(f13-e13)^2/e13 + (f14-e14)^2/e14 +
(f21-e21)^2/e21 + (f22-e22)^2/e22 +
(f23-e23)^2/e23 + (f24-e24)^2/e24 +
(f31-e31)^2/e31 + (f32-e32)^2/e32 +
(f33-e33)^2/e33 + (f34-e34)^2/e34
x
## flute
## 49.16676
qchisq(1-0.05,6)
## [1] 12.59159
Since
x
## flute
## 49.16676
is greater than
qchisq(1-0.05,6)
## [1] 12.59159
we reject H0, i.e. we conclude that the two variables instrument and fruit
are dependent.
Remark: R has a built in function chisq:test() which directly computes
the value x, however I ask you to compute it “by hand” as just explained.
Here, for the sake of completeness:
options(width = 60)
chisq.test(table(music),simulate.p.value=T)
##
## Pearson’s Chi-squared test with simulated p-value
## (based on 2000 replicates)
##
## data: table(music)
## X-squared = 49.167, df = NA, p-value = 0.0004998
Solution
Ans1.R
load(“C:/Users/user/Downloads/df.RData”)
df = dataset(994)
a<-as.data.frame(table(df$clothes))
no.of.obs=dim(df)[1]
P<-a[2,2]/no.of.obs
z<- – qnorm(0.025, 0,1 )
low.bound=P-z*sqrt(P*(1-P)/no.of.obs)
up.bound=P+z*sqrt(P*(1-P)/no.of.obs)
Conf.interval= c(low.bound,up.bound)
Ans2.R
load(“C:/Users/user/Downloads/df.RData”)
df = dataset(994)
weight_nonsmoker=df[which(df$smoker==”no”),]$weight
weight_smoker=df[which(df$smoker==”yes”),]$weight
n1=length(weight_nonsmoker)
n2=length(weight_smoker)
var1=var(weight_nonsmoker)
var2=var(weight_smoker)
ratio=var1/var2
cutoff1=qf(1-0.025,n1-1,n2-1)
cutoff2=qf(0.025,n1-1,n2-1)
result1=ifelse(ratio<cutoff1 && ratio>cutoff2, “ACCEPT”,”REJECT”)
result1
x1=mean(weight_nonsmoker)
x2=mean(weight_smoker)
t = (x1-x2)/sqrt((((n1-1)*var1+(n2-1*var2))/(n1+n2-2))*(1/n1+1/n2))
cutoffmain=-qt(0.025, n1+n2-2)
ifelse(abs(t)<cutoffmain,”ACCEPT”,”REJECT”)
Ans3.R
load(“C:/Users/user/Downloads/df.RData”)
df = dataset(994)
pet_housing=df[,c(“housing”,”pet”)]
table(pet_housing)
f11<-table(pet_housing)[1,1]
f12<-table(pet_housing)[1,2]
f21<-table(pet_housing)[2,1]
f22<-table(pet_housing)[2,2]
no.of.obs=dim(pet_housing)[1]
e11<-rowSums(table(pet_housing))[1]*colSums(table(pet_housing))[1]/no.of.obs
e12<-rowSums(table(pet_housing))[1]*colSums(table(pet_housing))[2]/no.of.obs
e21<-rowSums(table(pet_housing))[2]*colSums(table(pet_housing))[1]/no.of.obs
e22<-rowSums(table(pet_housing))[2]*colSums(table(pet_housing))[2]/no.of.obs
x <-(f11-e11)^2/e11 + (f12-e12)^2/e12 +(f21-e21)^2/e21 + (f22-e22)^2/e22
cutoff1=qchisq(1-0.05,1)
ifelse(x>cutoff1,”REJECT”,”ACCEPT”)
Report
- For the first question we needed to find the confidence interval for the parameter p of a nominal variable with two outcomes. We choose the variable “Clothes” which has outcomes “yes” and “no”.
In order to calculate the confidence interval for the “yes” answers we calculate the following:
After doing the above we get the confidence interval(0.4541742, 0.5418258).
- For the second question we need to test the difference between the means of a ratio variable of two populations. Here we chose the two populations to be Smoker and Nonsmoker and we chose our ratio variable to be weight.
In order to test the difference between means we first need to test that the variance of the two groups are the same. To do so we use the F-statistic s12/ s22 where s1 is the variance of the first group and s2 is the variance of the second group. This test gets Accepted. We get the ratio to be 1.283779. Since s12/ s22is neither > Fa/2,n1-1,n2–1 (=0.7799554) nor < F1-a/2,n1-1,n2–1 (=1.284071) the test gets ACCEPTED and we assume equalvariances.
Now we test the equality of means of the two groups using the t-statistic.
We get the value of t to be 3.01111. Since abs(t) > ta/2,n1+n2–2 (= 1.964739) this test gets REJECTED. This means that the mean weight of smokers and nonsmokers is not the same.
- For the third question we needed to test whether two nominal variables are related. To do so we chose the variable pet and housing and we check if having pets is related to living in a house or a flat.
To compute this we do the chi-square independence test and use the test statistic
The value of the test statistic comes out to be 12.74605 which is >X20.05,v (=3.841459) the test gets rejected. Thus the two variables housing and pets are dependent.