I Sarafis PhD Thesis Abstract

Driven by advances in data collection and storage, increasingly large and high dimensional datasets are being stored. Without special tools, human analysts can no longer make sense of such enormous volumes of data. Hence, intelligent data mining (DM) techniques are being developed to semi-automate the process of mining nuggets of hidden knowledge, and extract them in forms that can be readily utilised in areas such as decision support. Clustering high dimensional data is especially challenging due to the inherent sparsity of the dataspace. Evolutionary algorithms (EAs) are a promising technique for DM clustering as population-based searches have intrinsic search parallelism, their stochastic nature avoids local optima and recovers from poor initialisation.

This thesis investigates the use of evolutionary algorithms to effectively and efficiently mine clusters from massive and high dimensional numerical databases. The fundamental question addressed by this thesis is: can a stochastic search cluster large high dimensional datasets, and extract knowledge that conforms to the important requirements for DM clustering? Experimental results on both artificial and real-world datasets lead us to conclude that it can.

The thesis proposes a novel EA methodology for DM clustering with the following three phases. Firstly, a sophisticated quantisation algorithm (TSQ: Two-Stage Quantization) imposes a uniform multi-dimensional grid onto the dataspace to reduce the search combinations. TSQ quantises the dataspace using a novel statistical analysis that reflects the local data distribution. It determines an appropriate grid resolution that enables the discrimination of clusters, while preserving accuracy and acceptable computational cost. Secondly, a novel EA (NOCEA: Non-Overlapping Clustering with an Evolutionary Algorithms) discovers high quality clustering rules using several novel semi-stochastic genetic operators, an integer-valued encoding scheme, and a simple data coverage maximisation fitness function. Both TSQ and NOCEA rely on a novel statistical analysis (UDA: Uniform-region Discovery Algorithm) identifying flat density regions (U-regions) in univariate histograms. U-regions detected in orthogonal uni-dimensional projections are ``signatures'' of clusters being embedded in higher dimensional spaces. Thirdly, a post-processing simplification phase that removes irrelevant dimensions (subspace clustering) and assembles the clusters. The thesis also explores task parallelism for several genetic operations to improve scalability when the data to be mined is large and high dimensional.

NOCEA is a generic and robust clustering algorithm that meets the key DM clustering criteria. The following properties of NOCEA are demonstrated on both benchmark artificial datasets, and in a substantial real-world case study clustering the seismic activity associated with the active crustal deformation along the African-Eurasian-Arabian tectonic plate boundary. NOCEA produces interpretable output in the form of disjoint and axis-aligned hyper-rectangular clustering rules with homogeneous data distribution; the output is minimised for ease of comprehension. NOCEA has the ability to discover homogeneous clusters of arbitrary density, geometry, and data coverage. NOCEA effectively treats high dimensional data, e.g. 200 dimensions, and it effectively identifies subspace clusters being embedded in arbitrary subsets of dimensions. NOCEA has near linear scalability with respect to the database size, and both data and cluster dimensionality. NOCEA has substantial potential for task parallelism, e.g. reaching a speed up of 13.8 on 16 processors. NOCEA produces similar quality results irrespective initialisation and order of input data. NOCEA is exceptionally resistant to background noise. Finally, NOCEA has minimal requirements for a priori knowledge, and does not presume any canonical distribution of the input data.