Distribution of errors and methods of inference for automated DNA sequencing
For a given input DNA fragment (clone), we study the distribution of the output sequence (observation) from an automated sequencer. Primary emphasis is placed on obtaining likelihood based procedures associated with the important computational tasks involved in a shotgun sequencing strategy: ranking multiple sequence alignments, estimating a consensus sequence, and assessing the confidence of a reconstruction. These tasks rely on the important problem of deciding which pairs of data sequences arise from overlapping fragments. The overlap detection problem is formulated from the standpoint of point estimation as well as test of hypotheses. Using theoretical models for data under no errors and under substitution type errors, the performance of maximum likelihood, maximum posterior estimates for overlap are compared. Likelihood ratio tests of overlap versus no overlap are evaluated. The goal there is to understand the relationship between different error processes on procedures to detect when cloned fragments overlap. Because the underlying biochemical mechanisms responsible for the sequence reading errors are poorly understood, we have proposed new definitions and methods for characterizing error events when discrepancies are evident. Using results from our data study, a probabilistic model, Run Extension and Contraction (RECO), for sequence read errors is developed. Methods for parameter estimation and overlap detection under the RECO model are described. Analysis of two independent experimentally derived data sets demonstrates that our RECO model provides a good fit to the majority of sequence read errors.