Adapting the Nearest Neighbours Method for Spatial Clustering

Rastin Seysan
5 min readMar 19, 2019

--

Randomly generated data plotted on Open Street Maps base

Working on a spatial analysis project recently, I needed a method to assign a series of points to clusters, where all the points in a cluster had to be within reasonable walking distance. To make this more tangible, I will use a hypothetical scenario. Alexa is an enterprising entrepreneur who predicts an increase in demand for ice cream with climate change looming large. She is planning to set up multiple ice cream vans across an area but is worried that some vans may run out of ice cream, so she will need the attendants to transfer the ice cream between the vans quickly enough for it not to melt.

If Alexa has experience in clustering, one method that may immediately come to her mind is density based clustering or DBSCAN, since with DBSCAN the 𝜺 parameter controls the maximum distance allowed between a new point and the cluster, for the point to be added to the cluster. However, for a very wide cluster, two points at opposite ends of the cluster will be too far away and the attendants will not be able to transfer the ice cream between their vans. As a result, a hierarchical approach may be more suitable. But since hierarchical clustering also relies on the distance between points and clusters after the first iteration, depending on the chosen tree cut, it either produces too many small clusters or bunches many vans together that may be too far, and cannot provide the middle-ground solution that the ice cream business needs.

What Alexa’s business needs, is to have each van belong to multiple clusters; in other words, a relational structure that connects all the vans that are close enough for moving ice cream. To form this structure, a square matrix D, holding the distances between all pairs of vans deployed should be formed, where entry D(i,j) represents the distance between van i and van j. Matrix D, also known as a distance (cost) matrix should be formed based on the topology of the space, which could be walking distances on an urban road network, a geodesic or simply Euclidean. The matrix is then modified by replacing all the unacceptable distances by zeros, that is, for example, any entry higher than 800 metres will be replaced by a zero. The resulting matrix can then be used as the adjacency matrix for a weighted graph that has a connection between every two vans that are close enough for transferring ice cream.

I will call this method NNr since it relies on connecting the nearest neighbours within a certain radius. NNr returns a weighted undirected but not necessarily connected graph. Where the resulting graph is disconnected, each component is similar to a DBSCAN cluster with the same 𝜺 as the cutoff distance in NNr. Briefly, NNr can be implemented in the following steps:

  1. Find the distance matrix for the data, using the distance function appropriate to the geometry of the space.
  2. Replace all entries with distances larger than the acceptable distance 𝜺 with zeros.
  3. Interpret the resulting matrix as the adjacency matrix for a weighted graph, where there are as many clusters as the number of vertices (points), the members of which are the immediate neighbourhood of each vertex.

The main advantage of this method over the standard clustering approaches is the possibility of including each data point in multiple clusters, as opposed to creating sharply defined cluster boundaries, which is more meaningful for spatial applications. For Alexa, this means that each of her vans will be connected to a number of others in the graph, which are the ones that can supply the initial one with ice cream.

Example

To see NNr in action, I generated a random sample of coordinates in euclidean space, using R as follows:

x <- data.frame(x = c(rnorm(30, 5, 2), rnorm(30, 2, 1)), y = c(rnorm(30,4,3), rnorm(30, 2, 2)))
plot(x, col = 'orange', ann = F, yaxt = 'n', xaxt = 'n', bty = 'n'))
Randomly generated coordinates

The distance matrix is first calculated and converted to a matrix object, before replacing all distances larger than 1 with zeros. If geodesics are required, the distm()function from the geosphere package can be used instead to find the distances.

x_d <- as.matrix(dist(x, diag = T, upper = T))
x_d[x_d > 1] <- 0

To visualise relational data, the igraph package can be used in order to convert the data to an igraph object before creating visualisations. Converting the matrix to a graph object also makes the manipulation of relational data easier.

library(igraph)
g <- graph_from_adjacency_matrix(x_d, mode = 'undirected', weighted = T)
plot(g2, edge.width = (E(g2)$weight), vertex.size = 3, vertex.label.cex = .8, vertex.shape = 'none', layout = as.matrix(x))
NNr results with labelled points

If attributes need to be added to the graph object, such as van name or types of ice cream that a van holds, it is easier to use the graph.data.frame() functions to create the graph, supplying it with a data frame of edges and a data frame of vertices and attributes.

Finding the immediate neighbourhood of a vertex, i.e. a cluster for a certain vertex, is simply accomplished when using a graph object.

neighborhood(g2, nodes = 57)[[1]]
+ 5/60 vertices, named, from 1846441:
[1] 57 27 33 42 60

To compare the results against DBSCAN, K-Means and Hierarchical Clustering, plots are also generated, colouring the vertices according to the cluster to which they belong. DBSCAN function was used from the package dbscan.

db <- dbscan::dbscan(x, 1, 2)
km <- kmeans(x, 5)
hc <- hclust(dist(x))
plot(g2, edge.width = (E(g2)$weight), vertex.size = 3, vertex.label = NA, layout = as.matrix(x), vertex.color = db$cluster)plot(g2, edge.width = (E(g2)$weight), vertex.size = 3, vertex.label = NA, layout = as.matrix(x), vertex.color = km$cluster)plot(g2, edge.width = (E(g2)$weight), vertex.size = 3, vertex.label = NA, layout = as.matrix(x), vertex.color = cutree(hc, k = 30))plot(g2, edge.width = (E(g2)$weight), vertex.size = 3, vertex.label = NA, layout = as.matrix(x), vertex.color = cutree(hc, k = 5))

--

--

Rastin Seysan

Engineer-economist, investigating 🔎, contributing to ✍️ and investing 📈 in the future of work