Cluster Analysis
Part I: Manual Exercise of the Hierarchical Cluster Algorithm
Perform a cluster analysis by hand with the a set of five objects {A,B,C,D,E}. Use the city block distance metric to get the initial dissimilarity matrix and the complete linkage update rule. Should you encounter ties in the minimum distances, then merge those objects of the lower lexigraphical order.
Task 1: Calculate the dissimilarity matrix among the objects by using the city block distance metric. (0.2 points)
{A} | {B} | {C} | {D} | {E} | |
{A} | 0 | ||||
{B} | 0 | ||||
{C} | 0 | ||||
{D} | 0 | ||||
{E} | 0 |
Task 2: Identify the pair of objects for the firstmerger, update the class labels in the distance matrix and the distance matrix itself using the complete linkage rule. (0.2 points)
{ } | { } | { } | { } | |
{ } | 0 | |||
{ } | 0 | |||
{ } | 0 | |||
{ } | 0 |
Task 3: Identify the pair of objects/classes for the second merger and update the class labels and the distance matrix using the complete linkage rule. (0.2 points)
{ } | { } | { } | |
{ } | 0 | ||
{ } | 0 | ||
{ } | 0 |
Task 4: Identify the objects/classes for the third merger and update the class labels and the distance matrix using the complete linkage rule. (0.2 points)
{ } | { } | |
{ } | 0 | |
{ } | 0 |
Task 5: Draw the dendrogram of this classification. (0.2 points)
Part 2: Cluster the 25 European Countries (1973) According to their Protein Consumption
The zip-file OldEurope.zip holds the shapefileEuroProteinMap.shp of major 25 European countries in their 1980 boundaries and information of their population’s protein consumption by food group in 1973.
Figure 1: 25 major European countries in their 1980 boundaries, which were included in the protein study.
Note: nowadays some countries do no longer exist in this form and the protein consumption patterns will most likely have changed due to the increased affluence in these countries.
Task 6: Map the qualitative variable Regions of Europe (see EuroProteinMap.shp).Use the shapefileEuro1980Countries.shp as geographic reference frame for your map. In order to zoom into Europe plot the reference frame with the command:
plot(euro.shp, axes=T, bg=”lightblue”, border=”white”, col=grey(0.9), xlim=c(-20,40), ylim=c(30,70))
Task 7:For the subsequent tasks you may want to extract the data-frame for the shapefile with the command euroProt<- as.data.frame(euroProtMap).
[a] Provide a descriptive analysis (either numerical or visually) of the 9 different protein food groups and discuss your analysis.
[b] Justify whether it is advisable to jointly scale the food variables and perform the rescaling of the food groups as needed.
Task 8:Experiment with different dissimilarity metrics and updating rules combinations to find a classification of the 25 countries similar to that of the regions of Europe from task 6.
[a] Show the code that you use for your final classification results. Discuss the dendrogram and scree-plot as well as how many reasonable groups your cluster algorithm has identified.
[b] Perform a descriptive analysis of the identified groups and discuss it.
[c]
Compare in a confusion matrix your classification against the regions of Europe.
Task 9:Label your classification result with informative names (see the function factor( ) ) and visualize your classification in a properly designed map.
CarsClusterAnalysis (1) 1.r
## See also functions in the library(cluster)
rm(list=ls(all=TRUE))
rangeTrans<- function(df){
rT<- function(x){(x-min(x))/(max(x)-min(x))}
df<- df[, sapply(df, is.numeric)] # select numeric variables
df<- sapply(df,rT) # apply range transformation
return(df)
}
options(digits=2)
data(Cars93, package=”MASS”)
summary(Cars93)
Cars93$MPG <- (3*Cars93$MPG.city+2*Cars93$MPG.highway)/5 # Average MPG
## Select relevant variable dicriminating variable
cars<- subset(Cars93, select=c(Make,Type,Origin,Price,Horsepower,EngineSize,MPG,
Weight,Passengers))
## Select US models for origin specific analysis
#cars <- subset(cars, Origin==”USA”)
## Select foreign models
#cars <- subset(cars, Origin==”non-USA”)
## Organize data-frame
cars<- cars[order(cars$Type,cars$Origin), ] # Sort by Type and Origin
row.names(cars) <- cars$Make # Assigning the row-names based on Make
cars$Make<- NULL # drop Make since now in the row.names
## Descriptive analysis based on marketing labels
boxplot(Price~Type, data=cars, ylab=”Price”, main=”Price by Type”, varwidth=T)
boxplot(Horsepower~Type, data=cars, ylab=”Horsepower”, main=”Horsepower by Type”, varwidth=T)
boxplot(EngineSize~Type, data=cars, ylab=”Engine”, main=”Engine Size by Type”, varwidth=T)
boxplot(MPG~Type, data=cars, ylab=”MPG”, main=”MPG by Type”, varwidth=T)
boxplot(Weight~Type, data=cars, ylab=”Weight”, main=”Weight by Type”, varwidth=T)
boxplot(Passengers~Type, data=cars, ylab=”Passengers”, main=”Passengers by Type”, varwidth=T)
car::scatterplotMatrix(~Price+Horsepower+EngineSize+MPG+Weight+Passengers | Type,
data=cars, smoother=F, legend.plot=F)
aggregate(cars[, sapply(cars, is.numeric)], by=list(cars$Type), mean)
aggregate(cars[, sapply(cars, is.numeric)], by=list(cars$Type), sd)
## Scaling of metric variables
cars.trans<- scale(cars[ ,sapply(cars, is.numeric)]) # scale by z-transformation
#cars.trans<- rangeTrans(cars) # scale by range transformation
## Calculate inter-object distance matrix
cars.dist<- dist(cars.trans, method=”euclidean”) # Euclidean distance
#cars.dist<- dist(cars.trans, method=”manhattan”) # Euclidean distance
## Perform cluster analysis
#cars.clus<- hclust(cars.dist, method=”complete”) # clustering based on Complete method
cars.clus<- hclust(cars.dist, method=”ward.D2″) # Ward method with squared Euclidian distances
## Plot Dendrogram
plot(cars.clus, xlab=”Make”, hang=-1, ylab=”Dissimilaritiy”)
cars.class<- rect.hclust(cars.clus, k=5) # break dendrogram into clusters
#cars.class<- identify(cars.clus) # manually break dendogram into clusters
## Generate Scree Plot
plot(rev(cars.clus$height), type=”b”, main=”Determine # of Clusters”,
xlab=”Aglomeration Step”, ylab= “Increasing Dissimilarity”)
## Selecting clusters
n.class<- 5 # Number of clusters
abline(v=n.class,lty=2) # Show in scree plot
cars.class<- cutree(cars.clus, k=n.class) # Cutting clusters
table(cars.class) # number of objects in cluster
## Add generated classes to data-frame
cars<- data.frame(cars, myClass=cars.class) # New dataframewitht the clusters
## Checking cluster characteristics
boxplot(Price~myClass, data=cars, ylab=”Price”, main=”Price by Class”, varwidth=T)
boxplot(Horsepower~myClass, data=cars, ylab=”Horsepower”, main=”Horsepower by Class”, varwidth=T)
boxplot(EngineSize~myClass, data=cars, ylab=”Engine”, main=”Engine Size by Class”, varwidth=T)
boxplot(MPG~myClass, data=cars, ylab=”MPG”, main=”MPG by Class”, varwidth=T)
boxplot(Weight~myClass, data=cars, ylab=”Weight”, main=”Weight by Class”, varwidth=T)
boxplot(Passengers~myClass, data=cars, ylab=”Passengers”, main=”Passengers by Class”, varwidth=T)
## Assign descriptive label to the classes
cars$myClass<- factor(cars$myClass,
labels=c(“EcoBoxes”,”MidsizedCars”,”Large+Vans”,”Sporty”,”TinyBoxes”))
aggregate(cars[, sapply(cars, is.numeric)], by=list(cars$myClass), mean)
aggregate(cars[, sapply(cars, is.numeric)], by=list(cars$myClass), sd)
## Confusion Matrix
addmargins(table(cars$Type,cars$myClass))
Part 2.R
rm(list=ls(all=TRUE))
setwd(“C:\\Users\\Alamr_000\\Desktop\\Fall2017\\Data Analysis\\Labs\\OldEurope (2)”)
library(rgdal);library(rCarto)
euro.shp<- readOGR(dsn=”C:\\Users\\Alamr_000\\Desktop\\Fall2017\\Data Analysis\\Labs\\OldEurope (2)”,
layer = “Euro1980Countries”)
euroPro.shp<-readOGR(dsn=”C:\\Users\\Alamr_000\\Desktop\\Fall2017\\Data Analysis\\Labs\\OldEurope (2)”,
layer=”EuroProteinMap”)
plot(euroPro.shp, axes=T, bg=”lightblue”, border=”white”, col=grey(0.9), xlim=c(-20,40),ylim=c(30,70))
plot(euro.shp, axes=T, bg=”lightblue”, border=”white”, col=grey(0.9), xlim=c(-20,40),ylim=c(30,70),add=T)
Solution
PART I
Task 1
{A} | {B} | {C} | {D} | {E} | |
{A} | 0 | 2 | 6 | 7 | 6 |
{B} | 2 | 0 | 4 | 5 | 4 |
{C} | 6 | 4 | 0 | 3 | 4 |
{D} | 7 | 5 | 3 | 0 | 1 |
{E} | 6 | 4 | 4 | 1 | 0 |
Task 2
After 1st update, objects D and E are clubbed together.
{A} | {B} | {C} | {D,E} | |
{A} | 0 | 2 | 6 | 7 |
{B} | 2 | 0 | 4 | 5 |
{C} | 6 | 4 | 0 | 4 |
{D,E} | 7 | 5 | 4 | 0 |
Task 3
After 2nd update, objects A and B are clubbed together.
{A,B} | {C} | {D,E} | |
{A,B} | 0 | 6 | 7 |
{C} | 6 | 0 | 4 |
{D,E} | 7 | 4 | 0 |
Task 4
After 3rd update, objects C and {D,E} are clubbed together.
{A,B} | {C,D,E} | |
{A,B} | 0 | 7 |
{C,D,E} | 7 | 0 |
Task 5
PART II
Task 6
Figure 1 Map of Euro countries based on Region
Task 7
a)
As can be seen from the distribution of the different types of proteins, few of the types show very different distributions and hence can create a bias in the clustering process because of their larger magnitude. This is especially true for the Cereals class as it is much larger in magnitude than the other classes of proteins.
Task 8
a)
R CODE
library(maptools)
library(ggplot2)
library(reshape2)
EuroProteinMap=readShapePoly(“EuroProteinMap.shp”,verbose=TRUE)
euroProt<- as.data.frame(EuroProteinMap)
printClusters<- function(clust)
{
clust1 = c()
clust2 = c()
clust3 = c()
clust4 = c()
clust5 = c()
clust6 = c()
for(i in 1:25)
{
if (clust[i]==1){clust1 = c(clust1,names(clust)[i])}
if (clust[i]==2){clust2 = c(clust2,names(clust)[i])}
if (clust[i]==3){clust3 = c(clust3,names(clust)[i])}
if (clust[i]==4){clust4 = c(clust4,names(clust)[i])}
if (clust[i]==5){clust5 = c(clust5,names(clust)[i])}
if (clust[i]==6){clust6 = c(clust6,names(clust)[i])}
}
print(clust1)
print(clust2)
print(clust3)
print(clust4)
print(clust5)
print(clust6)
}
d = dist(prot,method =”manhattan”)
c = hclust(d,method= “complete”)
c$labels = euroProt$Country
plot(c,labels = euroProt$Country)
clusters =cutree(c,k=6)
printClusters(clusters)
The best clustering result was achieved using the manhattan distance metric and the complete update logic for the hierarchical clustering algorithm. The calculated dendogram when cut at 6 clusters gives clusters which show great resemblance to the Region the countries belong to.
The clusters identified by the algorithm are :
[1] “France” “Ireland” “UK” “Switzerland” “Belgium” “Austria” “Netherlands” “W Germany” [2] “Greece” “USSR” “Italy” [3] “Poland” “Czechoslovakia” “E Germany” [4] “Bulgaria” “Yugoslavia” “Romania” “Hungary” “Albania” [5] “Finland” “Denmark” “Norway” “Sweden” [6] “Spain” “Portugal”
The clustering algorithm misclassifies 3 countries altogether. Of the regions specified in the dataset, Iberia, WestEuro and Scania are correctly identified as clusters by the hierarchical clustering algorithm.
b)
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
The above plots showcase the distribution of different types of proteins among the 6 identified clusters. By visual examination alone, it can be seen that the distribution of proteins across the clusters are very different and were hence instrumental in correctly classifying the clusters according to the region they belong to.
c)
Confusion Matrix of Truth ( Region variable) against identified clusters
Cluster | |||||||
WestEuro | EastEuro | Balkan | Iberia | Scania | Mediterra | ||
Truth | WestEuro | 8 | |||||
EastEuro | 3 | 1 | 1 | ||||
Balkan | 4 | ||||||
Iberia | 2 | ||||||
Scania | 4 | ||||||
Mediterra | 2 |
d)
Figure 2 Clusters identified by Hierarchical algorithm
All R codes used:
library(maptools)
library(ggplot2)
library(reshape2)
EuroProteinMap=readShapePoly(“EuroProteinMap.shp”,verbose=TRUE)
plot(EuroProteinMap, ylim = c(30,90),xlim = c(-10,40),col= EuroProteinMap@data$Region)
euroProt<- as.data.frame(EuroProteinMap)
protList = names(euroProt)[7:15]
prot = euroProt[,protList]
meltDf = melt(prot)
head(meltDf)
ggplot(meltDf, aes(x=value, fill=variable)) +
geom_histogram(binwidth=10)+
facet_grid(variable~.)
#TASK 8
printClusters<- function(clust)
{
clust1 = c()
clust2 = c()
clust3 = c()
clust4 = c()
clust5 = c()
clust6 = c()
for(i in 1:25)
{
if (clust[i]==1){clust1 = c(clust1,names(clust)[i])}
if (clust[i]==2){clust2 = c(clust2,names(clust)[i])}
if (clust[i]==3){clust3 = c(clust3,names(clust)[i])}
if (clust[i]==4){clust4 = c(clust4,names(clust)[i])}
if (clust[i]==5){clust5 = c(clust5,names(clust)[i])}
if (clust[i]==6){clust6 = c(clust6,names(clust)[i])}
}
print(clust1)
print(clust2)
print(clust3)
print(clust4)
print(clust5)
print(clust6)
}
# WestEuro :France,Ireland,UK,Switzerland,Belgium,Austria,Netherlands,W Germany
# EastEuro :Poland,USSR,Hungary,Czechoslovakia,E Germany
# Balkan :Bulgaria,Yugoslavia,Romania,Albania
# Iberia : Spain, Portugal
# Scania :Finland,Denmark,Norway,Sweden
# Mediterra : Greece, Italy
# “manhattan”,”maximum”,”canberra”,”binary”,”euclidean”
# single, average, ward.D2
# Identifying the best clustering parameter
d = dist(prot,method =”manhattan”)
c = hclust(d,method= “complete”)
c$labels = euroProt$Country
plot(c,labels = euroProt$Country)
clusters =cutree(c,k=6)
printClusters(clusters)
euroProt$cluster = as.factor(clusters)
# Plotting distributions of different types of proteins for all identified clusters
for(i in 1:6)
{
protList = names(euroProt)[7:15]
prot = euroProt[euroProt$cluster==i,protList]
meltDf = melt(prot)
head(meltDf)
ggplot(meltDf, aes(x=value, fill=variable)) +
geom_histogram(binwidth=10)+
facet_grid(variable~.)
}
#Plotting the European map using the identified clusters
EuroProteinMap@data$cluster = as.factor(clusters)
plot(EuroProteinMap, ylim = c(30,90),xlim = c(-10,40),col= EuroProteinMap@data$cluster)