External Clustering Evaluation Indices, Imbalanced Data Sets, Bioinformatics, Text Mining

Clustering is a type of unsupervised learning that has as aim to find the underlying structure, composed of clusters, present in the data. Examples (objects or instances) belonging to each cluster should share some relevant property (similarity) regarding the data domain. The definition of effective evaluation measures is important to provide the users with a degree of "goodness" (quality) for the clustering results derived from the algorithms used. These assessments should be objective and have no bias to any algorithm. Indeed, several clustering validity measures have been proposed to evaluate the quality of clustering results. In this work, we are interested in external validation indices, which are employed to measure the extent to which cluster labels match externally supplied class labels.

Recently, there have been some works trying to characterize, based on some formal constraints, which aspects of the quality of a partition are captured by different external validation indices. Cluster homogeneity and cluster completeness are examples of basic constraints (or "desirable properties") that clustering accuracy measures should display. For instance, cluster homogeneity states that clusters should not mix objects belonging to different classes, that is, they should be homogeneous.

In this work, we are concerned with the behaviour of such external evaluation measures when one has to compare partitions generated from highly imbalanced data sets (or data with a highly skewed class distribution). For highly imbalanced data sets, almost all the instances are labelled as one class (majority class), whereas far fewer examples are labelled as the other classes. Examples of this are cancer gene expression data sets, which usually have a high number of genes (features) and a low number of patients (examples). The data sets in the context of text mining also follow this kind of pattern.

In the context of supervised learning, standard machine learning algorithms tend to be biased towards the majority class and ignore the minority class. Many researchers have addressed the problem of how to evaluate supervised learning algorithms in the case of class size imbalance. In contrast, up to now, there are few studies trying to characterize to which extent class imbalance can influence the measures used to compare the quality of the partitions generated with clustering algorithms.

Thus, our aim is to develop a wide analysis of the main existing external evaluation measures in the context of imbalanced data such as that from bioinformatics (e.g., microarray) and text mining. We have also as objective to propose novel indices and test them not only in the context of evaluation, but also as components with some cluster ensembles algorithms.

Cluster ensemble and multi-objective optimization

The selection of the best clustering algorithm for a given data set is one of the main difficulties in cluster analysis. In fact, there is a large number of clustering algorithms, each one looking for clusters according to a different cluster definition (or clustering criterion). These algorithms search for a homogeneous structure (all clusters conforming to the same cluster definition), whereas data can present a heterogeneous structure (each cluster conforming to a different cluster definition).

A classical approach to address these problems is by using a clustering validation technique. However, most of these techniques are biased toward a given clustering criterion (e.g., cluster compactness). Alternatively, the problems of algorithm selection and data presenting clusters with heterogeneous structures can be addressed by using cluster ensemble and multi-objective clustering approaches. We have introduced a multi-objective clustering ensemble algorithm (MOCLE, for short), which employs simultaneously concepts from both cluster ensemble and multi-objective clustering algorithms. The idea is not only to minimize the intrinsic problems of cluster analysis, but also the limitations of the cluster ensemble and multi-objective clustering methods when they are used separately.

Meta-learning and clustering

In several domains, such as in Machine Learning, there is a variety of algorithms that can be considered as candidates to solve particular problems. One of the most difficulty tasks in these domains is to predict when one algorithm is better than another to solve a given problem. Traditional approaches to predicting the performance of algorithms often involve costly trial-and-error procedures. Other approaches require expert knowledge, which is not always straightforward to acquire.

In the previous context, meta-learning approaches have arisen as effective solutions, able to automatically predicting algorithms performance for a given problem. Thus, such approaches could support non-expert users in the algorithm selection task. There are different interpretations for the term “meta-learning”. In our work, we use “meta-learning” meaning the automatic process of generating knowledge that relates the performance of machine learning algorithms to the characteristics of the problem (i.e., characteristics of its datasets).

So far, in the literature, meta-learning had been used only for selecting/ranking supervised learning algorithms. Motivated by this, recentely we extend the use of meta-learning approaches for clustering algorithms. We developed our case study in the context of clustering algorithms applied to cancer gene expression data generated by microarray. There are several works that can be developed from our proposal by implementing other meta-learners for different categories of datasets, and by using other meta-learning approaches that have not yet been used in the algorithm selection problem.

Complexity indices for classification problems: cancer gene expression data

Currently, cancer diagnosis at a molecular level has been made possible through the analysis of gene expression data. More specifically, one usually uses machine learning techniques to build, from cancer gene expression data, automatic diagnosis models (classifiers). Cancer gene expression data often present some characteristics that can have a negative impact in the generalization ability of the classifiers generated. Some of these properties are data sparsity and an unbalanced class distribution.

We investigate the results of a set of indices able to extract the intrinsic complexity information from the data. These indices measure statistics of data geometry, topology and shape of the classification boundary. They can be used to analyze, among other things, which particular characteristics of cancer gene expression data mostly impact the prediction ability of, for example, support vector machine classifiers.

Our basic goal is to investigate the capability of the complexity indices to explain the difficulty in the classification of cancer gene expression data, considering the popular experimental framework usually employed in the analysis of microrray data. This is accomplished mainly by analyzing the correlation of the classification error rates of the classifiers generated to the values yielded by the complexity indices.

Another challenging topic for further research is the definition of complexity measures, which take, at the same time, both data preparation (e.g., feature selection and treatment of missing values) and classification into account. Other important direction is to employ complexity indices, in particular those related to data sparsity and class balance, for building a method for recommending the number of samples necessary for a robust gene selection and classification results. Another interesting direction is the use of the complexity indices as meta-attributes for meta-learning. In such a context, for instance, we could build a classifier to suggest the most appropriate classification/feature selection method for a given data set.

Learning algorithms for weightless neural networks: a unified view

My co-workers and I address the problem of integrating prior knowledge and learning from examples in sequential Artificial Neural Networks (ANN). More specifically, our focus is on the Weightless Neural Networks (WNNs). We assume that, besides the training set, a learning algorithm will have access to some form of partial knowledge (prior knowledge) in terms of symbolic rules, on the problem. This partial knowledge will be represented by restrictions in the numerical values associated with each neuron of the network and then completed through learning from examples.

For this model of "unified learning", or hybrid WNNs, to be effective, we argue that knowledge integration must be uniform. Thus, just as for explicit rules (partial knowledge), the learned rules must be represented by restrictions in the numerical values of the neuron. Also, one has to specify the type of prior knowledge to use. Without loss of generality, we assume that the prior knowledge about the problem will be provided in terms of automata.

The role of learning now is to "refine" partial knowledge. In this context, the learning space is considerably restricted by a priori knowledge of a given domain. Therefore, the unified learning algorithm becomes closely related to the incorporation of knowledge through rules.

We address key issues in the integration of a priori knowledge and learning from examples in ANNs: the formal characterization of classes of automata that can be implemented in ANNs (prior knowledge), the formal characterization of the classes of automata that can be learned by ANNs when used with existing training algorithms (a posteriori knowledge) and development of learning algorithms that incorporates a priori and a posteriori knowledge (unified algorithms ) and applications/associations to new areas of computing such as cognitive computation.