Project ideas
Big data infrastructure for evolutionary learning methods
We live in the era of big data. All aspects of science and society generate very large amounts of data, from internet to biology to
astronomy. All this data, however, is useless unless we have computational techniques that are able to grasp its vastness and extract
meaningful information providing added value. Within this context, Google's Map-Reduce methodology/framework for handling big data
(popularised through the Hadoop open source implementation) is currently the bread-and-butter of big data analysis, the platform of choice
for the majority of this community.
The objective of this project is to adapt the evolutionary data mining systems develop in the last few years at the University of Nottingham
to Hadoop, so they are able to tackle problems of sizes that before were impossible to solve. To achieve this aim the project will make
use of the brand new High-Performance Computing cluster of the University.
Web service for -omics data analysis using rule-based machine learning
In the last decade biological research has seen the development of
many experimental technologies that are able to generate
high-throughput quantitative data from biological samples. Their usage
has improved our understanding about many different aspects of life.
However, the effectiveness of these technologies is constrained by the
limitations of the analysis methods applied to this data.
Recently at the University of Nottingham we have develop a new methodology based on rule-based machine learning to mine this kind of
datasets to generate robust prediction models and extract meaningful information out of the mining process. The goal of this project
is to develop a web service with a user-friendly web inteface to anybody can access our methodology. This web service will interact
in the backend with the brand new High-Performance Computing cluster of the University to perform all computationally heavy elements
of its functioning.
Rule-based knowledge representations for -omics data analysis
In the last decade biological research has seen the development of
many experimental technologies that are able to generate
high-throughput quantitative data from biological samples. Their usage
has improved our understanding about many different aspects of life.
However, the effectiveness of these technologies is constrained by the
limitations of the analysis methods applied to this data.
Recently at the University of Nottingham we have develop a new methodology based on rule-based machine learning to mine this kind of
datasets to generate robust prediction models and extract meaningful information out of the mining process. The type of information
that we can extract out of the data mining process greatly depends on how the rule-based knowledge representations are defined. The goal
of this project is to create and thoroughly evaluate a broad range of variants of knowledge representations for -omics data analysis in
order to identify their domains of competence in terms of prediction capacity and knowledge discovery.
Large Scale Data Mining Challenge: Contact Map prediction
Bioinformatics is a very fascinating research area where many disciplines
of science such as mathematics, computer science, engineering, etc. are
put together to solve biological problems and bring new insight into our
understanding of how life works. Within the bioinformatics context one of
the most relevant topics of research is proteomics, the study of the role
and structure of proteins and, in particular, the prediction of the
structure of proteins (PSP).
The prediction of the (sub)structure of proteins is a very challenging task
from a data mining point of view: Very large sets of records, high dimensionality spaces and high class unbalance are just
some of these challenges. The focus of this project is in a specific type of PSP: contact map (CM) prediction, which involves all
of these challenges. The CM prediction method developed at Nottingham is currently one of the top world methods for this class
of problems, but its training process is extremely costly, using tens of thousands of CPU hours
The focus of this project is to perform a data mining-centric reassessment of our CM prediction method in order to (a) improve
the quality of the predictions and (b) alleviate the computational cost of training the model.