I'm a senior lecturer in the School of Computing Science at the University of Glasgow. Before starting a permanent post in Glasgow I did a couple of Post Doctoral positions (under Mark Girolami) and prior to that I did my Undergraduate degree (Electrical and Electronic Engineering) and PhD (Machine Learning Techniques for Microarray Analysis, supervisor Colin Campbell) at the University of Bristol.
My research involves the development of Machine Learning and Statistical techniques to help with the analysis of complex datasets, particularly within the field of Metabolomics but also other -omics fields, Human-Computer Interaction and Information Retreival. My Metabolomics work is done in collaboration with many people, particularly Glasgow Polyomics (of which I'm an affiliate member).
In 2011, Mark Girolami and I published a textbook: A First Course in Machine Learning which you might find useful if you want an introduction to modern (fairly probabilistic) Machine Learning but don't know a great deal of maths. A second edition with additional material (Gaussian Processes, Dirichlet Processes, was published in August 2016.
I am always interested in hearing from potential PhD candidates. If any of the research below interests you, please get in touch. If you want to find my office, there are some instructions below.
I review for many journals and conferences, as well as being an associate editor for PLoS One. I have acted as an internal and external examiner for several PhD theses (in the UK and abroad) and am currently one of the external examiners for the computing science taught masters programmes at the University of Manchester.
I currently hold various non-research roles within the School of Computing Science: Graduate School Convenor (aka PG tutor), deputy senior adviser of studies and I'm a member of the College of Science and Engineering's appeals committee. I played a substantial role in the successful Athena Swan bronze application made by the School.
My main research activities are in the application of machine learning and applied statistical techniques to problems in metabolomics. Metabolomics is the study of the small molecule content of an organism and I'm particularly interested in the analysis of mass spectrometry metabolomics data. Mass Spectrometry allows, in theory, the simultanoues measurement of the abundance of many metabolites but use of this technology is hindered by the difficuly in analysising the data, particularly in identifying the molecules present in the data from the measured masses.
Some notable contributions in this area include a method for using the dependencies between metabolites to aid in identification, a probabilistic method for metabolite annotation / identification, and the use of peak groupe to improve alignment and differential expression computation.
Currently, I'm working on novel ways of analysing molecular fragmentation data.
Much of this work is done in collaboration with Glasgow Polyomics, of which I'm an affiliate member.
Microarray analysis: The research in my PhD centered around developing Machine Learning techniques to analyse data from cDNA microarrays, including a 2003 paper on the number of arrays needed to be able to classify diseased from healthy samples which is still being cited. I also performed one of the first analyses on combine time series transcriptomic and proteomic data.
Parameter inference: I am currently involved in an EPSRC funded project (EP/L020319/1) on parameter inference in mechanistic models of biological systems (with Dirk Husmeier and Maurizio Filippone) where we are interested in using gradient matching to speed up inference in systems of ordinary differential equations (ICML 2016) as well as a BBSRC funded project (BB/L001276/1) on stomatal dynamics with Mike Blatt (plant sciences, Glasgow) and others.
Leaukaemia: I co-supervise two PhD students who are using bioinformatics tools to help learn about Chronic Myeloid Leukaemia, with Prof. Tessa Holyoake.
I'm interested in how we can better model the uncertainty in user interactions with a goal towards making better systems. This work began with the work on FingerCloud, which used a particle filter to model the position of the finger in 3D above an array of capacitive sensors. This filter was used to control a map, where the less engaged the user was the more autonomous the map became. We extended this to AnglePose which inferred the pitch and yaw angles of the interacting finger, along with its position. This work was taken forward in Daryl Weir's PhD in which he looked at modelling uncertainty in touch. This started with a UIST paper that used Machine Learning to learn offset models for specific users and ended with a CHI paper demonstrating a keyboard that used both the learned offset and variability for specific users to improve typing performance. The offset work was built upon by Daniel Buschek, who published an interesting study in MobileHCI investigating how offset patterns varied across users and varied for the same user on different devices.
With John Williamson, I have been co-supervising Faizuddin M.Noor's PhD on using back of device sensors for implicit interactions. We have shown that it is possible to detect where a user is going to touch a screen before they do so (CHI 2014) and, more recently showed that it is possible to detect user errors due to hand movement and changes in their grip patterns (CHI 2016).
Information Retreival: Along with Iadh Ounis, I supervised Edwin Thuma's PhD on A semi-automated FAQ retrieval system for HIV/AIDS. Edwin built a system to which users could send an SMS including questions about HIV/AIDS that would attempt to send an answer from a FAQ handbook used by the department of health in Botswana. There were three publications arising from this thesis: Exploiting query logs and field-based models..., Detecting missing content queries..., and Evaluating bad query abandonment...
Robotics: With Paul Siebert, I currently supervise Li Sun's PhD on using a robot to manipulate clothing.
Joe Wandy Bayesian methods for metabolomics
Faizuddin M.Noor Implicit interactions for mobile devices
Eduardo Gómez Castañeda
Dr. Daryl Weir Modelling Uncertainity in Touch Interactions
Touch interaction is an increasingly ubiquitous input modality on modern devices. It appears on devices including phones, tablets, smartwatches and even some recent laptops. Despite its popularity, touch as an input technology suffers from a high level of measurement uncertainty. This stems from issues such as the ‘fat finger problem’, where the soft pad of the finger creates an ambiguous contact region with the screen that must be approximated by a single touch point. In addition to these physical uncertainties, there are issues of uncertainty of intent when the user is unsure of the goal of a touch. Perhaps the most common example is when typing a word, the user may be unsure of the spelling leading to touches on the wrong keys. The uncertainty of touch leads to an offset between the user’s intended target and the touch position recorded by the device. While numerous models have been proposed to model and correct for these offsets, existing techniques in general have assumed that the offset is a deterministic function of the input. We observe that this is not the case — touch also exhibits a random component. We propose in this dissertation that this property makes touch an excellent target for analysis using probabilistic techniques from machine learning. These techniques allow us to quantify the uncertainty expressed by a given touch, and the core assertion of our work is that this allows useful improvements to touch interaction to be obtained. We show this through a number of studies. In Chapter 4, we apply Gaussian Process regression to the touch offset problem, producing models which allow very accurate selection of small targets. In the process, we observe that offsets are both highly non-linear and highly user-specific. In Chapter 5, we make use of the predictive uncertainty of the GP model when applied to a soft keyboard — this allows us to obtain key press probabilities which we combine with a language model to perform autocorrection. In Chapter 6, we introduce an extension to this framework in which users are given direct control over the level of uncertainty they express. We show that not only can users control such a system succesfully, they can use it to improve their performance when typing words not known to the language model. Finally, in Chapter 7 we show that users’ touch behaviour is significantly different across different tasks, particularly for typing compared to pointing tasks. We use this to motivate an investigation of the use of a sparse regression algorithm, the Relevance Vector Machine, to train offset models using small amounts of data.
In this thesis, we consider the problem of long-term persistence in ecological and epidemiological systems. This is important in conservation biology for protecting species at risk of extinction and in epidemiology for reducing disease prevalence and working towards elimination. Understanding how to predict and control persistence is critical for these aims. In Chapter 2, we discuss existing ways of characterising persistence and their relationship with the modelling paradigms employed in ecology and epidemiology. We note that data are often limited to information on the state of particular patches or populations and are modelled using a metapopulation approach. In Chapter 3, we define persistence in relation to a pre-specified time horizon in stochastic single-species and two-species competition models, comparing results between discrete and continuous time simulations. We find that discrete and continuous time simulations can result in different persistence predictions, especially in the case of inter-specific competition. The study also serves to illustrate the shortcomings of defining persistence in relation to a specific time horizon. A more mathematically rigorous interpretation of persistence in stochastic models can be found by considering the quasi-stationary distribution (QSD) and the associated measure of mean time to extinction from quasi-stationarity. In Chapter 4, we investigate the contribution of individual patches to extinction times and metapopulation size, and provide predictors of patch value that can be calculated easily from readily available data. In Chapter 5, we focus directly on the QSD of heterogeneous systems. Through simulation, we investigate possible compressions of the QSD that could be used when standard numerical approaches fail due to high system dimensionality, and provide guidance on appropriate compression choices for different purposes. In Chapter 6, we consider deterministic models and investigate the effect of introducing additional patch states on the persistence threshold. We suggest a possible model that might be appropriate for making predictions that extend to stochastic systems. By considering a family of models as limiting cases of a more general model, we demonstrate a novel approach for deriving quantities of interest for linked models that should help guide modelling decisions. Finally, in Chapter 7, we draw out implications for conservation biology and disease control, as well as for future work on biological persistence.
Dr. Edwin Thuma A semi-automated FAQ retrieval system for HIV/AIDS
This thesis describes a semi-automated FAQ retrieval system that can be queried by users through short text messages on low-end mobile phones to provide answers on HIV/AIDS related queries. First we address the issue of result presentation on low-end mobile phones by proposing an iterative interaction retrieval strategy where the user engages with the FAQ retrieval system in the question answering process. At each iteration, the system returns only one question-answer pair to the user and the iterative process terminates after the user's information need has been satisfied. Since the proposed system is iterative, this thesis attempts to reduce the number of iterations (search length) between the users and the system so that users do not abandon the search process before their information need has been satisfied. Moreover, we conducted a user study to determine the number of iterations that users are willing to tolerate before abandoning the iterative search process. We subsequently used the bad abandonment statistics from this study to develop an evaluation measure for estimating the probability that any random user will be satisfied when using our FAQ retrieval system. In addition, we used a query log and its click-through data to address three main FAQ document collection deficiency problems in order to improve the retrieval performance and the probability that any random user will be satisfied when using our FAQ retrieval system. Conclusions are derived concerning whether we can reduce the rate at which users abandon their search before their information need has been satisfied by using information from previous searches to: Address the term mismatch problem between the users' SMS queries and the relevant FAQ documents in the collection; to selectively rank the FAQ document according to how often they have been previously identified as relevant by users for a particular query term; and to identify those queries that do not have a relevant FAQ document in the collection. In particular, we proposed a novel template-based approach that uses queries from a query log for which the true relevant FAQ documents are known to enrich the FAQ documents with additional terms in order to alleviate the term mismatch problem. These terms are added as a separate field in a field-based model using two different proposed enrichment strategies, namely the Term Frequency and the Term Occurrence strategies. This thesis thoroughly investigates the effectiveness of the aforementioned FAQ document enrichment strategies using three different field-based models. Our findings suggest that we can improve the overall recall and the probability that any random user will be satisfied by enriching the FAQ documents with additional terms from queries in our query log. Moreover, our investigation suggests that it is important to use an FAQ document enrichment strategy that takes into consideration the number of times a term occurs in the query when enriching the FAQ documents. We subsequently show that our proposed enrichment approach for alleviating the term mismatch problem generalise well on other datasets. Through the evaluation of our proposed approach for selectively ranking the FAQ documents, we show that we can improve the retrieval performance and the probability that any random user will be satisfied when using our FAQ retrieval system by incorporating the click popularity score of a query term t on an FAQ document d into the scoring and ranking process. Our results generalised well on a new dataset. However, when we deploy the click popularity score of a query term t on an FAQ document d on an enriched FAQ document collection, we saw a decrease in the retrieval performance and the probability that any random user will be satisfied when using our FAQ retrieval system. Furthermore, we used our query log to build a binary classifier for detecting those queries that do not have a relevant FAQ document in the collection (Missing Content Queries (MCQs))). Before building such a classifier, we empirically evaluated several feature sets in order to determine the best combination of features for building a model that yields the best classification accuracy in identifying the MCQs and the non-MCQs. Using a different dataset, we show that we can improve the overall retrieval performance and the probability that any random user will be satisfied when using our FAQ retrieval system by deploying a MCQs detection subsystem in our FAQ retrieval system to filter out the MCQs. Finally, this thesis demonstrates that correcting spelling errors can help improve the retrieval performance and the probability that any random user will be satisfied when using our FAQ retrieval system. We tested our FAQ retrieval system with two different testing sets, one containing the original SMS queries and the other containing the SMS queries which were manually corrected for spelling errors. Our results show a significant improvement in the retrieval performance and the probability that any random user will be satisfied when using our FAQ retrieval system.
In this thesis, I focus on methodological concepts of studying infectious disease transmission between agricultural premises. I used different disease systems as exemplars for spatial and network methods to investigate transmission patterns. Infectious diseases cause tangible economic threat to the farming industry worldwide by damaging livestock populations, reducing farm productivity and causing trade restriction. This implies the importance of veterinary epidemiological studies in control and eradication of pathogens. Recent increase in availability of data and computational power allowed for more opportunities to study mechanisms of pathogenic transmission. Nowadays, the bottleneck is primarily associated with efficient methods that can analyse vast amounts of high-resolution data. Here I address two livestock pathogens that differ in their epidemiology: bacteria Streptococcus agalactiae and foot-and-mouth disease (FMD) virus. Streptococcus agalactiae is a contagious pathogen that causes mastitis in cattle, and thus possesses a substantial economic burden to the dairy industry. Known transmission routes between cattle are restricted to those via milking machines, milkers’ hands and fomites during milking process. Additionally, recent studies suggested potential introductions from other host species: primarily, humans. However, strain typing data showed discrepancies in strain compositions of bacteria isolated from humans and bovines. In this thesis, strain-specific features of between-herd transmission of Streptococcus agalactiae within dairy cattle population in Denmark are investigated. Foot-and-mouth disease (FMD) is a viral infection that affects cloven-hoofed animals and is of big importance mainly because of the trade restrictions against infected regions and countries. Control programmes against FMD usually include vaccination and culling of animals. However, the debate on the optimal control for FMD is still ongoing. In this thesis, I address questions on identification of the routes of infection and on requirements for movement recording systems to be used for efficient contact tracing during an FMD outbreak. This thesis reveals several interesting findings. Firstly, the increased understanding of strain-specific transmission characteristics of Streptococcus agalactiae. One of the observed strains (ST103) showed significant and consistent spatial clustering of its cases among Danish dairy cattle herds in 2009–2011. Secondly, the network analysis of cattle movements and affiliations with veterinary practices showed that veterinary practices were exclusively associated with transmission of ST103 of Streptococcus agalactiae. Contrastingly, movement networks appeared to be important for all the three predominant bacterial strains (ST1, ST23 and ST103). Fourthly, the new extended approach that allows estimation of the whole transmission tree at once was proposed and tested for the Darlington cluster within the 2001 FMD UK epidemic. Finally, in chapter 6, it was shown that mathematical modelling did not suggest any advantages of ensuring smaller delays in the post-silent control of FMD-like pathogens.
Dr. Li (Kevin) Sun Thesis to appear
Machine Learning 4/M - Moodle
Advanced Programming (IT) - Moodle
S1 Systems Biology lectures - Material
I've supervised projects in many areas and I'm happy to discuss opportunities in any area broadly related to my research interests. I'm currently particularly interested in projects in computational metabolomics, ranging from software engineering projects to more open ended research ones. If you would like to see projects from previous years, here are some (email me if you would like one that isn't listed):
|MSc Bioinformatics||2012||Joe Wandy||...|
|MSc||2015||Scott Greig||FrAnK: A Web Application for the Annotation of Mass Spectral Peaks Using Fragmentation Spectra|
|MSc||2015||Mary McDowall||Structural Comparisons for Small Molecules|
|MSc||2015||James Ferguson||Investigating the Use of Peak Clustering for the Identification of Metabolites|
|MSc||2015||Nikolay Zhekov||Investigation of different features for predicting ionization efficiency of metabolites in mass spectrometry using Gaussian processes|
|MSc||2015||Cristina Mihailescu||Implementations of knapsack algorithms for metabolite identification|
|L4||2015||Liam Furphy||Computational Modelling of Red Blood Cells II|
|L4||2015||Daniel Budeanu||Text Prediction Visualizer for Android|
|L4||2015||Andrew Lau||A Junior Tennis Diary|
|MSci||2015||Dimitar Petrov||Improving Touchscreen Typing Using Back-of-Device Grip Interactions|
|MSci||2015||Martin Bevc||Predicting the Outcome of Tennis Matches From Point-by-Point Data|
|MSc||2014||Josef Idris Khan||Computational Modelling of Red Blood Cells|
|MSc||2014||Victor Semerdjiev||Machine Learning the Premier League|
|MSc||2014||Jean Georgina Power||Register Machine Simulator|
|L4||2016||Adam Kurkiewicz||The effect of natural selection on relatedness in randomly mating population|
|L4||2016||Christos Kitsos||Interactive Website For Rabies Researchers|
|L4||2016||Zoe Gerolemou||Automated Genome Informatics Workflow Development|
|L4||2016||Claire Thompson||Porting of a Mass2LDA tool and visualiser into FrAnK (Fragment Annotation Toolkit)|
This introductory level textbook was developed from a Machine Learning course delivered at the University of Glasgow initially by Mark Girolami and then taken over by me. As the course was aimed at final year Computing Science students who may not have taken any courses in statistics or mathematics, the book assumes almost no pre-requisites in mathematics or statistics.
More details can be found on the book webpage here.
The second edition was published in August 2016. This new edition includes three new chapters (>100 pages of new material) that covers some advanced topics: Gaussian Processes, Markov Chain Monte Carlo Sampling, and Dirichlet Processes. When you buy the hard copy you get the e-book version included.
From the foyer of the Sir Alwyn Williams Building (see map on right), either take the stairs to floor 'S' or take the lift to the second floor (use the left hand button; we have two second floors).
Once on floor 'S', walk along the corridor for about 50m until you come to S123 on the right.
If you get lost, return to the foyer and ring me using the phone there on extension 1649.