Clustering belongs to unsupervised learning along with other branches like dimensionality reduction. The training samples do not have labels. The input data is then grouped into different classes, assigning each training sample to one group based on some similarity measure (usually Euclidean distance). The classic method of clustering uses an algorithm called kmeans
For the purposes of plotting in 2D, the first 2 principal components are used are axes.
iris.pc <- prcomp(iris[,1:4], center = FALSE, scale. = FALSE)$x %>% as.data.frame()
First, using the standard kmeans
algorithm on the data set we get:
km.cluster <- kmeans(iris[,1:4], centers = 3, iter.max = 20, nstart = 2)
iris.pc$kmeans.cluster <- km.cluster$cluster
table(iris$Species, km.cluster$cluster)
##
## 1 2 3
## setosa 0 0 50
## versicolor 2 48 0
## virginica 36 14 0
mjs_plot(iris.pc, x=PC1, y=PC2) %>%
mjs_point(color_accessor=kmeans.cluster) %>%
mjs_labs(x="principal comp 1", y="principal comp 2")
Next using RandomForest algorithm. The algorithm is run in unsupervised mode by setting the outcome variable y = NULL
. The algorithm generates proximity matrix. This matrix gives a rough estimate of distance between samples based on the proportion of times the samples end up in same leaf node. The proximity matrix is converted to a dist
matrix which is then input to the hclust
algorithm. The hierarchical tree is then cut at number of branches = 3 to obtain the final cluster assignment.
rf.fit <- randomForest(x = iris[,1:4], y = NULL, ntree = 10000, proximity = TRUE, oob.prox = TRUE)
hclust.rf <- hclust(as.dist(1-rf.fit$proximity), method = "ward.D2")
rf.cluster = cutree(hclust.rf, k=3)
iris.pc$rf.clusters <- rf.cluster
table(rf.cluster, iris$Species)
##
## rf.cluster setosa versicolor virginica
## 1 50 0 0
## 2 0 49 14
## 3 0 1 36
mjs_plot(iris.pc, x=PC1, y=PC2) %>%
mjs_point(color_accessor=rf.clusters) %>%
mjs_labs(x="principal comp 1", y="principal comp 2")
From Brieman’s original description:
In unsupervised learning the data consist of a set of x -vectors of the same dimension with no class labels or response variables. There is no figure of merit to optimize, leaving the field open to ambiguous conclusions. The usual goal is to cluster the data - to see if it falls into different piles, each of which can be assigned some meaning.
The approach in random forests is to consider the original data as class 1 and to create a synthetic second class of the same size that will be labeled as class 2. The synthetic second class is created by sampling at random from the univariate distributions of the original data. Here is how a single member of class two is created - the first coordinate is sampled from the N values {x(1,n)}. The second coordinate is sampled independently from the N values {x(2,n)}, and so forth.
Thus, class two has the distribution of independent random variables, each one having the same univariate distribution as the corresponding variable in the original data. Class 2 thus destroys the dependency structure in the original data. But now, there are two classes and this artificial two-class problem can be run through random forests. This allows all of the random forests options to be applied to the original unlabeled data set.
If the oob misclassification rate in the two-class problem is, say, 40% or more, it implies that the x -variables look too much like independent variables to random forests. The dependencies do not have a large role and not much discrimination is taking place. If the misclassification rate is lower, then the dependencies are playing an important role.
Formulating it as a two class problem has a number of payoffs. Missing values can be replaced effectively. Outliers can be found. Variable importance can be measured. Scaling can be performed (in this case, if the original data had labels, the unsupervised scaling often retains the structure of the original scaling). But the most important payoff is the possibility of clustering.