ANR InvolvD : Interactive constraint elicitation for unsupervised and semi-supervised data mining.
Early machine learning (ML) and data mining (DM) research tried to fully automate knowledge discovery processes, reducing human intervention. For good reasons: we cannot deal with large amounts of (high-dimensional) data, see patterns everywhere, and technical progress should help save time. Currently, Google offers an AutoML service.
In a supervised setting, using labels for parameter tuning and model selection can work but with only few labels available (a semi-supervised setting), or none at all (an unsupervised one), the opposite is necessary: putting the user into the loop and letting her react to results improves the mining process. In short, one needs interactive data mining.
This creates new challenges: result presentation needs to help the user make decisions, feedback needs to be intuitive for the user and useful in the mining process, and where AutoML can run for hours, interactive systems need to interact often with the user.
Overcoming them not only improves mining results but adds another benefit: users are more likely to trust results if they understand the process generating them, by participating in it or by understanding how others did. This is relevant in high-stakes settings, where investments (of time or money), or even human lives depend on correct results. Presenting results during interactive mining also helps to interpret final results, supporting hypothesis formation, data acquisition, and real-life experiments. Finally, recent regulation in the EU and the US define citizens’ rights to have explanations for algorithmic decisions affecting them, and require companies or institutions to provide them. Such requirements have motivated research attempting to translate black box models, e.g. in Deep Learning, into interpretable ones via an intermediate step. In such methods, however, user feedback cannot directly influence the model nor the learning process.
To work towards explainable results in unsupervised and semi-supervised DM, we propose the project InvolvD, which addresses several challenges: identifying sense-making visualizations, offering explanations for informed feedback, transforming them into useful constraints, and developing new algorithms exploiting those. Using clustering and symbolic pattern mining, we will study problem settings where user reactions can be fed back directly into the process itself.
The use case employed to guide our progress during the project’s duration is chemoinformatics, a prototypical one for the issues outlined above. In drug design, exploratory data analysis is highly important, molecules need to be understood w.r.t. their structure and/or chemical properties, and experts have knowledge that is hard to exploit before seeing preliminary results.
Marcílio PEREIRA DE SOUTO
Action's Web Page