Development and Integration of Decision-making Tools in Data Analysis Procedures for HTS-based Plant Virus Diagnostics

Kaoutar Daoud / Jožef Stefan International Postgraduate School

Project description

Metagenomic sequencing is revolutionizing the detection and characterization of viruses, and a wide variety of software tools are available to analyze these data. The typical concern regarding high- throughput data analysis is that there is no “Swiss army knife” software that can handle all possible biological questions. Consulting the methods publications does not suffice. Even benchmarking these methods is no longer effective because of the continuous introduction of new tools, most of which come with a particular and often extensive set of parameters.

The motivation behind my research in the frame of INEXTVIR, is to fill the gap between the methods developers and the final users, by developing a machine learning based Decision Support System to recommend bioinformatic pipelines to users, for the given High Throughput Sequencing based plant virus data analysis. With a Decision Support System, researchers will be able to make faster and better operational decisions that serve their needs.

Project goals

My research project will contribute to the bioinformatics and software development segments of the project through:

1- The development of a machine learning approach for decision support to help users select the most appropriate bioinformatic pipeline in:

post-sequence detection (Kodoja, Taxonomer, VirusDetect, VirFinder,...), and
post-sequence classification (Kraken, Clark, MGmapper, Kaiju, MSC,...) of plant viruses based on NGS data.

Among other factors, the choice of the tool is also influenced by the interest behind the analysis that may be:

Monitoring of viral pathogens.
Identification of anti-microbial resistance genes.
Phage identification.
Obtaining a complete catalogue of the organisms that are present.

2- The development of computational approach for evaluation of performance of bioinformatics pipelines for analysis of NGS data.

How to properly evaluate the classification in a taxonomy?

3- The simplification and automation of bioinformatics pipelines in the system such as Galaxy, integration of decision support modules into the pipeline and integration with laboratory data management systems and workflow tools such as SciNote should enable researchers without informatics expertise to perform computational analyses through a user friendly interface.