Cluster Analysis

Cluster Analysis

Part I: Manual Exercise of the Hierarchical Cluster Algorithm 

Perform a cluster analysis by hand with the a set of five objects {A,B,C,D,E}. Use the city block distance metric to get the initial dissimilarity matrix and the complete linkage update rule. Should you encounter ties in the minimum distances, then merge those objects of the lower lexigraphical order.

Task 1: Calculate the dissimilarity matrix among the objects by using the city block distance metric. (0.2 points)

  {A} {B} {C} {D} {E}
{A} 0
{B} 0
{C} 0
{D} 0
{E} 0

 Task 2: Identify the pair of objects for the firstmerger, update the class labels in the distance matrix and the distance matrix itself using the complete linkage rule. (0.2 points)

  {           } {           } {           } {           }
{           } 0
{           } 0
{           } 0
{           } 0

Task 3: Identify the pair of objects/classes for the second merger and update the class labels and the distance matrix using the complete linkage rule. (0.2 points)

  {           } {           } {           }
{           } 0
{           } 0
{           } 0

Task 4: Identify the objects/classes for the third merger and update the class labels and the distance matrix using the complete linkage rule. (0.2 points)

  {           } {           }
{           } 0
{           } 0

Task 5: Draw the dendrogram of this classification. (0.2 points)

Part 2: Cluster the 25 European Countries (1973) According to their Protein Consumption

The zip-file OldEurope.zip holds the shapefileEuroProteinMap.shp of major 25 European countries in their 1980 boundaries and information of their population’s protein consumption by food group in 1973.

Figure 1: 25 major European countries in their 1980 boundaries, which were included in the protein study.

Note: nowadays some countries do no longer exist in this form and the protein consumption patterns will most likely have changed due to the increased affluence in these countries.

Task 6: Map the qualitative variable Regions of Europe (see EuroProteinMap.shp).Use the shapefileEuro1980Countries.shp as geographic reference frame for your map. In order to zoom into Europe plot the reference frame with the command:

plot(euro.shp, axes=T, bg=”lightblue”, border=”white”, col=grey(0.9), xlim=c(-20,40), ylim=c(30,70))

Task 7:For the subsequent tasks you may want to extract the data-frame for the shapefile with the command euroProt<- as.data.frame(euroProtMap).

[a] Provide a descriptive analysis (either numerical or visually) of the 9 different protein food groups and discuss your analysis.

[b] Justify whether it is advisable to jointly scale the food variables and perform the rescaling of the food groups as needed.

Task 8:Experiment with different dissimilarity metrics and updating rules combinations to find a classification of the 25 countries similar to that of the regions of Europe from task 6.

[a] Show the code that you use for your final classification results. Discuss the dendrogram and scree-plot as well as how many reasonable groups your cluster algorithm has identified.

[b] Perform a descriptive analysis of the identified groups and discuss it.

[c]

Compare in a confusion matrix your classification against the regions of Europe.

Task 9:Label your classification result with informative names (see the function factor( ) ) and visualize your classification in a properly designed map. 

CarsClusterAnalysis (1) 1.r 

## See also functions in the library(cluster)

rm(list=ls(all=TRUE))

rangeTrans<- function(df){

rT<- function(x){(x-min(x))/(max(x)-min(x))}

df<- df[, sapply(df, is.numeric)]         # select numeric variables

df<- sapply(df,rT)                        # apply range transformation

return(df)

}

options(digits=2)

data(Cars93, package=”MASS”)

summary(Cars93)

Cars93$MPG <- (3*Cars93$MPG.city+2*Cars93$MPG.highway)/5           # Average MPG

## Select relevant variable dicriminating variable

cars<- subset(Cars93, select=c(Make,Type,Origin,Price,Horsepower,EngineSize,MPG,

Weight,Passengers))

## Select US models for origin specific analysis

#cars <- subset(cars, Origin==”USA”)

## Select foreign models

#cars <- subset(cars, Origin==”non-USA”)

## Organize data-frame

cars<- cars[order(cars$Type,cars$Origin), ]       # Sort by Type and Origin

row.names(cars) <- cars$Make                       # Assigning the row-names based on Make

cars$Make<- NULL                                  # drop Make since now in the row.names

## Descriptive analysis based on marketing labels

boxplot(Price~Type, data=cars, ylab=”Price”, main=”Price by Type”, varwidth=T)

boxplot(Horsepower~Type, data=cars, ylab=”Horsepower”, main=”Horsepower by Type”, varwidth=T)

boxplot(EngineSize~Type, data=cars, ylab=”Engine”, main=”Engine Size by Type”, varwidth=T)

boxplot(MPG~Type, data=cars, ylab=”MPG”, main=”MPG by Type”, varwidth=T)

boxplot(Weight~Type, data=cars, ylab=”Weight”, main=”Weight by Type”, varwidth=T)

boxplot(Passengers~Type, data=cars, ylab=”Passengers”, main=”Passengers by Type”, varwidth=T)

car::scatterplotMatrix(~Price+Horsepower+EngineSize+MPG+Weight+Passengers | Type,

data=cars, smoother=F, legend.plot=F)

aggregate(cars[, sapply(cars, is.numeric)], by=list(cars$Type), mean)

aggregate(cars[, sapply(cars, is.numeric)], by=list(cars$Type), sd)

## Scaling of metric variables

cars.trans<- scale(cars[ ,sapply(cars, is.numeric)])    # scale by z-transformation

#cars.trans<- rangeTrans(cars)                           # scale by range transformation

## Calculate inter-object distance matrix

cars.dist<- dist(cars.trans, method=”euclidean”)        # Euclidean distance

#cars.dist<- dist(cars.trans, method=”manhattan”)        # Euclidean distance

## Perform cluster analysis

#cars.clus<- hclust(cars.dist, method=”complete”)        # clustering based on Complete method

cars.clus<- hclust(cars.dist, method=”ward.D2″)        # Ward method with squared Euclidian distances

## Plot Dendrogram

plot(cars.clus, xlab=”Make”, hang=-1, ylab=”Dissimilaritiy”)

cars.class<- rect.hclust(cars.clus, k=5)                # break dendrogram into clusters

#cars.class<- identify(cars.clus)                        # manually break dendogram into clusters

## Generate Scree Plot

plot(rev(cars.clus$height), type=”b”, main=”Determine # of Clusters”,

xlab=”Aglomeration Step”, ylab= “Increasing Dissimilarity”)

## Selecting clusters

n.class<- 5                                       # Number of clusters

abline(v=n.class,lty=2)                            # Show in scree plot

cars.class<- cutree(cars.clus, k=n.class)         # Cutting clusters

table(cars.class)                                  # number of objects in cluster

## Add generated classes to data-frame

cars<- data.frame(cars, myClass=cars.class)        # New dataframewitht the clusters

## Checking cluster characteristics

boxplot(Price~myClass, data=cars, ylab=”Price”, main=”Price by Class”, varwidth=T)

boxplot(Horsepower~myClass, data=cars, ylab=”Horsepower”, main=”Horsepower by Class”, varwidth=T)

boxplot(EngineSize~myClass, data=cars, ylab=”Engine”, main=”Engine Size by Class”, varwidth=T)

boxplot(MPG~myClass, data=cars, ylab=”MPG”, main=”MPG by Class”, varwidth=T)

boxplot(Weight~myClass, data=cars, ylab=”Weight”, main=”Weight by Class”, varwidth=T)

boxplot(Passengers~myClass, data=cars, ylab=”Passengers”, main=”Passengers by Class”, varwidth=T)

## Assign descriptive label to the classes

cars$myClass<- factor(cars$myClass,

labels=c(“EcoBoxes”,”MidsizedCars”,”Large+Vans”,”Sporty”,”TinyBoxes”))

aggregate(cars[, sapply(cars, is.numeric)], by=list(cars$myClass), mean)

aggregate(cars[, sapply(cars, is.numeric)], by=list(cars$myClass), sd)

## Confusion Matrix

addmargins(table(cars$Type,cars$myClass))

 Part 2.R 

rm(list=ls(all=TRUE))

setwd(“C:\\Users\\Alamr_000\\Desktop\\Fall2017\\Data Analysis\\Labs\\OldEurope (2)”)

library(rgdal);library(rCarto)

euro.shp<- readOGR(dsn=”C:\\Users\\Alamr_000\\Desktop\\Fall2017\\Data Analysis\\Labs\\OldEurope (2)”,

layer = “Euro1980Countries”)

euroPro.shp<-readOGR(dsn=”C:\\Users\\Alamr_000\\Desktop\\Fall2017\\Data Analysis\\Labs\\OldEurope (2)”,

layer=”EuroProteinMap”)

plot(euroPro.shp, axes=T, bg=”lightblue”, border=”white”, col=grey(0.9), xlim=c(-20,40),ylim=c(30,70))

plot(euro.shp, axes=T, bg=”lightblue”, border=”white”, col=grey(0.9), xlim=c(-20,40),ylim=c(30,70),add=T) 

Solution 

PART I 

Task 1

{A} {B} {C} {D} {E}
{A} 0 2 6 7 6
{B} 2 0 4 5 4
{C} 6 4 0 3 4
{D} 7 5 3 0 1
{E} 6 4 4 1 0

 

Task 2

After 1st update, objects D and E are clubbed together. 

{A} {B} {C} {D,E}
{A} 0 2 6 7
{B} 2 0 4 5
{C} 6 4 0 4
{D,E} 7 5 4 0

 Task 3

After 2nd update, objects A and B are clubbed together. 

{A,B} {C} {D,E}
{A,B} 0 6 7
{C} 6 0 4
{D,E} 7 4 0

 Task 4

After 3rd update, objects C and {D,E} are clubbed together. 

{A,B} {C,D,E}
{A,B} 0 7
{C,D,E} 7 0

 Task 5 

PART II

Task 6

Figure 1 Map of Euro countries based on Region

Task 7

a)

As can be seen from the distribution of the different types of proteins, few of the types show very different distributions and hence can create a bias in the clustering process because of their larger magnitude. This is especially true for the Cereals class as it is much larger in magnitude than the other classes of proteins. 

Task 8

a)

R CODE

library(maptools)

library(ggplot2)

library(reshape2)

EuroProteinMap=readShapePoly(“EuroProteinMap.shp”,verbose=TRUE)

euroProt<- as.data.frame(EuroProteinMap)

printClusters<- function(clust)

{

clust1 = c()

clust2 = c()

clust3 = c()

clust4 = c()

clust5 = c()

clust6 = c()

for(i in 1:25)

{

if (clust[i]==1){clust1 = c(clust1,names(clust)[i])}

if (clust[i]==2){clust2 = c(clust2,names(clust)[i])}

if (clust[i]==3){clust3 = c(clust3,names(clust)[i])}

if (clust[i]==4){clust4 = c(clust4,names(clust)[i])}

if (clust[i]==5){clust5 = c(clust5,names(clust)[i])}

if (clust[i]==6){clust6 = c(clust6,names(clust)[i])}

}

print(clust1)

print(clust2)

print(clust3)

print(clust4)

print(clust5)

print(clust6)

}

d = dist(prot,method  =”manhattan”)

c = hclust(d,method= “complete”)

c$labels = euroProt$Country

plot(c,labels = euroProt$Country)

clusters =cutree(c,k=6)

printClusters(clusters)

The best clustering result was achieved using the manhattan distance metric and the complete update logic for the hierarchical clustering algorithm. The calculated dendogram when cut at 6 clusters gives clusters which show great resemblance to the Region the countries belong to.

The clusters identified by the algorithm are :

[1] “France” “Ireland” “UK” “Switzerland” “Belgium” “Austria” “Netherlands” “W Germany”  [2] “Greece” “USSR” “Italy” [3] “Poland” “Czechoslovakia” “E Germany”     [4] “Bulgaria” “Yugoslavia” “Romania” “Hungary” “Albania”   [5] “Finland” “Denmark” “Norway” “Sweden” [6] “Spain” “Portugal”

The clustering algorithm misclassifies 3 countries altogether. Of the regions specified in the dataset, Iberia, WestEuro and Scania are correctly identified as clusters by the hierarchical clustering algorithm.

b)

Cluster 1

Cluster 2

Cluster 3

Cluster 4

Cluster 5

Cluster 6

The above plots showcase the distribution of different types of proteins among the 6 identified clusters. By visual examination alone, it can be seen that the distribution of proteins across the clusters are very different and were hence instrumental in correctly classifying the clusters according to the region they belong to.

c)

Confusion Matrix of Truth ( Region variable) against identified clusters

Cluster
WestEuro EastEuro Balkan Iberia Scania Mediterra
Truth WestEuro 8
EastEuro 3 1 1
Balkan 4
Iberia 2
Scania 4
Mediterra 2

d)

                                                                   Figure 2 Clusters identified by Hierarchical algorithm

All R codes used:

library(maptools)

library(ggplot2)

library(reshape2)

EuroProteinMap=readShapePoly(“EuroProteinMap.shp”,verbose=TRUE)

plot(EuroProteinMap, ylim = c(30,90),xlim = c(-10,40),col= EuroProteinMap@data$Region)

euroProt<- as.data.frame(EuroProteinMap)

protList = names(euroProt)[7:15]

prot = euroProt[,protList]

meltDf = melt(prot)

head(meltDf)

ggplot(meltDf, aes(x=value, fill=variable)) +

geom_histogram(binwidth=10)+

facet_grid(variable~.)

#TASK 8

printClusters<- function(clust)

{

clust1 = c()

clust2 = c()

clust3 = c()

clust4 = c()

clust5 = c()

clust6 = c()

for(i in 1:25)

{

if (clust[i]==1){clust1 = c(clust1,names(clust)[i])}

if (clust[i]==2){clust2 = c(clust2,names(clust)[i])}

if (clust[i]==3){clust3 = c(clust3,names(clust)[i])}

if (clust[i]==4){clust4 = c(clust4,names(clust)[i])}

if (clust[i]==5){clust5 = c(clust5,names(clust)[i])}

if (clust[i]==6){clust6 = c(clust6,names(clust)[i])}

}

print(clust1)

print(clust2)

print(clust3)

print(clust4)

print(clust5)

print(clust6)

}

# WestEuro :France,Ireland,UK,Switzerland,Belgium,Austria,Netherlands,W Germany

# EastEuro :Poland,USSR,Hungary,Czechoslovakia,E Germany

# Balkan :Bulgaria,Yugoslavia,Romania,Albania

# Iberia : Spain, Portugal

# Scania :Finland,Denmark,Norway,Sweden

# Mediterra : Greece, Italy

# “manhattan”,”maximum”,”canberra”,”binary”,”euclidean”

# single, average, ward.D2

# Identifying the best clustering parameter

d = dist(prot,method  =”manhattan”)

c = hclust(d,method= “complete”)

c$labels = euroProt$Country

plot(c,labels = euroProt$Country)

clusters =cutree(c,k=6)

printClusters(clusters)

euroProt$cluster = as.factor(clusters)

# Plotting distributions of different types of proteins for all identified clusters

for(i in 1:6)

{

protList = names(euroProt)[7:15]

prot = euroProt[euroProt$cluster==i,protList]

meltDf = melt(prot)

head(meltDf)

ggplot(meltDf, aes(x=value, fill=variable)) +

geom_histogram(binwidth=10)+

facet_grid(variable~.)

}

#Plotting the European map using the identified clusters

EuroProteinMap@data$cluster = as.factor(clusters)

plot(EuroProteinMap, ylim = c(30,90),xlim = c(-10,40),col= EuroProteinMap@data$cluster)