Projects ideas

Below you can find several project ideas that I am happy to supervise. Nevertheless, I am open to consider your own ideas as long as they are aligned to my research interests. Please refer to my Research interests and Publications for more information.

If you wish to talk about any of these, email me on for an appointment.

Vehicle Incident Hot Spots Identification

Transportation research mostly aims at establishing the means for improving driving performance, economy and safety. Logistics complexity coupled with large transport networks has required the widespread use of sensors, tracking devices, and mobile communication equipment in order to enable such developments. These devices constantly gather information of vehicles and their journeys. This includes, for instance, safety hazards, vehicle diagnostics and driving behaviour. Given the velocity by which large volumes of data are produced, the challenge is to establish effective tools for fast processing and analysis so that the information can be employed by transport stake holders in a timely manner.

The aim of this project is to come up with a big data algorithm (based on Spark Streaming) to identify Hot Spots for traffic incidents and accidents on very large datasets.

Visualising the behaviour of Semi-supervised classification techniques

The Semi-Supervised Learning (SSL) paradigm has attracted much attention in many different fields ranging from bioinformatics to web mining, where it is easier to obtain unlabeled than labeled data because it requires less effort, expertise and time-consumption. In this context, traditional supervised learning is limited to using labeled data to build a model. SSL is a learning paradigm concerned with the design of models in the presence of both labeled and unlabeled data. Essentially, SSL methods use unlabeled samples to either modify or reprioritize the hypothesis obtained from labeled samples alone

A successful methodology to tackle the SSL problem is based on traditional supervised classification algorithms. These techniques aim to enlarge the number of labeled examples by labeling the unlabeled training points. They iteratively include their most confident predictions to the pool of labeled examples, and then they repeat the process.

The idea of this project is to visualise the behaviour of such kind of techniques. You will have to implement some basic semi-supervised models (Self-training and co-training) and show in a 2-D input dataset how the decision boundaries between classes are changing as you add newly labeled points to the training data.

Graph-based semi-supervised learning in Spark

This project will be focused on the exploitation of few amounts of annotated data and a great number of unannotated samples within a semi-supervised big data scheme. More specifically, we want to extend the capabilities graph-based SSL models to the big data context. Despite ther performance, when it comes together with big datasets (large number of examples and/or features), current approaches become non effective and non efficient due to the big dimension of the problem.

In this project, you will learn some big data learning technologies such as Apache Spark to explore the use of different strategies to design graph-based SSL for big datasets.

Other Ideas

  • Any recent competition related to data science can be used as a basis for a project. For example, you can have a look to the data mining competitions proposed in
  • Visualisation tool for evolutionary models in data mining.
  • Visualisation tool for big data sets using Apache Spark.
  • Machine learning models in the big data context.

Previous proposals

Note that the following list of projects has been already carried out, but there is still the option of revisiting them from a different angle.

Fingerprint recognition

Personal identification is an important issue in many fields such as criminology, forensic identifications, payments or identification in computer systems. Among all the biometric features that can be used for identification, such as voice, iris or DNA, fingerprints are the most widely used.

The design of Automatic Fingerprint Identification Systems (AFISs) may be considered in two different settings (known as verification and identification) from which derive the following project ideas:
  • You create an app that allows users to verify their identity by either using a fingerprint scanner or a picture of their fingerprint. The system must be robust to rotations, translations and deformations of the skin, and ensure that no one else is erroneously granted access.

    You will be provided with some software for feature extraction and some fingerprint databases. However, the student will be responsible of the implementation of the matching algorithm.

    As a result of your software, it must show the given fingerprint image along with the extracted features (See the picture below for an example) as well as a response whether the access is granted or not.
  • You create a parallel program that is able to identify an individual fingerprint among a large database of fingerprints. The system must be robust to rotations, translations and deformations of the skin.

    You will have access to some code in which you will base your parallelization scheme. I will be very happy if you want to explore different parallalisation technologies such as MPI or GPUs.

    The resulting program should have a graphical user interface that allows us to take a given fingerprint, and shows the most similar fingerprint in the database.

Data Visualisation and Generation for Machine Learning Algorithms

In this project, the main idea is to design a program that allows the user to contruct (draw) his own artificial dataset in an easy manner. The program must be highly flexible allowing the user to choose the number of dimensions, the kind of distribution of the data (if any), number of points (density), shapes, etc.

As a result, the user will get a dataset in CSV text file that represents the drawn figure. The software has to be then linked to some classification algorithms (you can use scikit-learn implementations, weka, R, etc) that label the given dataset. Then, the decision boundaries between classes can be represented (See figure below).

(c) Copyright: Isaac Triguero Velázquez

Totally Valid XHTML 1.0 Totally Valid WCAG AAA