Research

Research

During my time as a member of the Notre Dame Bioinformatics Lab, my research focused on bacterial and viral genomic data. I developed novel methods for rapid genotyping of shotgun sequence data, particularly for environmental (or metagenomic) bacterial samples. Prior to that work, I was involved in research with Expressed Sequence Tags (ESTs). Much of my work involved parallelizing applications to run in a grid environment, primarily our 7000 node campus Condor grid. I also did work in support of the Vectorbase site.

My primary work in the NDBL was on scalable bioinformatics. This work was heavily collaborative with members of the Cooperative Computing Lab and also with members of the Department of Biological Sciences. This work produced a paper that was presented at the 5th annual Workshop on Workflows in Support of Large-Scale Science which occurred in conjunction with Supercomputing 2010 in New Orleans (Slides, Paper). My work on the MAKER genome annotation tool was presented at the 2nd IEEE Conference on Computational Advances in Bio and Medical Sciences (ICCABS) 2012. (Slides, Paper)

My initial work after arriving at Notre Dame involved the parallelization of the sequence alignment tool SSAHA. This work was presented at an ISMB 2010 poster session in Boston in July (Poster). This work was also included in the article "Harnessing parallelism in multicore clusters with the All-Pairs, Wavefront, and Makeflow abstractions." in Journal of Cluster Computing, September 2010. (Paper)

A full list of publications is available in my CV

Course Related Research

CSE 60641 - Graduate Operating Systems

AVATAR - AVATAR is an abstraction for making use of virtual machines in a distributed computing environment. The goal of AVATAR is to present a homogeneous set of resources to the user of a heterogeneous grid resource, such as Condor. An additional goal is to provide this service in a way that is nearly transparent to the user. Therefore AVATAR only requires the user to provide the requirements of their job in a similar way as they would to Condor. Our system then takes these requirements, checks them against the remote host on which the job is executing and decides whether or not to run a virtual machine or just execute the job natively. In the case where a virtual machine is necessary, the system fetches the filesystem and kernel and executes an instance of a virtual machine, in which the job will execute. The output of the job is then returned to the local filesystem where the grid system, such as Condor, can pick up the output and return it to the user.

CSE 60543 - Algorithms for Biological Networks

Parallel short read assembly - Applied networks methods, primarily data and graph partitioning ideas, to data and graphs produced during modern short read assembly. The goal of this project is to reduce the very high RAM requirements associated with DeBruijn graph based assembly.

CSE 60532 - Bioinformatics Computing

Examined EST data from the Salt Cress project looking for structural variations between two related populations and between three related species. Utilized the EST pipeline described in my paper at Works 2010 and also utilized in O'neil et. al. "Population-level transcriptome sequencing of non-model organisms Erynnis propertius and Papilio zelicaon"

CSE 60647 - Data Mining

Political speech is notorious for perfidy. Fact checking organizations exist, but their credibility is undermined by subjective interpretations of key phrases, by semantic confusions, and by a perceived need to provide "balanced" evaluations. We posit that these threats to credibility can be substantially alleviated by focusing exclusively on statements both objective and quantifiable. In this work, we describe tools for identifying such statements as a possible first step towards a pipeline for credible, high-throughput fact checking. We acquire and structure a large corpus of United States Senate floor speeches, and annotate a subset for training and testing. We then apply data mining techniques to develop a model that can be applied to new documents in order to identify statements of quantitative fact.

CSE60817 - Healthcare Analytics

The goal of this project is to design a system that will help parents or guardians effectively manage the medical treatment of the children in their care. By providing reminders of dosages amounts and times, as well as helping to track medications for multiple children, our system can reduce the chances of improper dosage due to error or confusion.

Our group produced an iPhone application that achieves the initial goal: to give parents and guardians of children a simple application that allows for tracking of their children’s medical needs. This application currently pulls from a small (but medically correct) dosage database that, if expanded to a full-sized database, would allow parents to also search for recommended pediatric dosages for a wide variety of medications. The mini-EHR and reminders make it easy to keep track of administered vaccinations and current medications.