Distribution of errors and methods of inference for automated DNA sequencing

Alexander, Gregory E.

doi:10.57912/23866890.v1

thesesdissertations_2593_OBJ.pdf (5.95 MB)

Distribution of errors and methods of inference for automated DNA sequencing

thesis

posted on 2023-08-04, 15:00 authored by Gregory E. Alexander

For a given input DNA fragment (clone), we study the distribution of the output sequence (observation) from an automated sequencer. Primary emphasis is placed on obtaining likelihood based procedures associated with the important computational tasks involved in a shotgun sequencing strategy: ranking multiple sequence alignments, estimating a consensus sequence, and assessing the confidence of a reconstruction. These tasks rely on the important problem of deciding which pairs of data sequences arise from overlapping fragments. The overlap detection problem is formulated from the standpoint of point estimation as well as test of hypotheses. Using theoretical models for data under no errors and under substitution type errors, the performance of maximum likelihood, maximum posterior estimates for overlap are compared. Likelihood ratio tests of overlap versus no overlap are evaluated. The goal there is to understand the relationship between different error processes on procedures to detect when cloned fragments overlap. Because the underlying biochemical mechanisms responsible for the sequence reading errors are poorly understood, we have proposed new definitions and methods for characterizing error events when discrepancies are evident. Using results from our data study, a probabilistic model, Run Extension and Contraction (RECO), for sequence read errors is developed. Methods for parameter estimation and overlap detection under the RECO model are described. Analysis of two independent experimentally derived data sets demonstrates that our RECO model provides a good fit to the majority of sequence read errors.

History

Publisher

ProQuest

Language

English

Notes

Ph.D. American University 1997.

Handle

http://hdl.handle.net/1961/thesesdissertations:2593

Media type

application/pdf

Access statement

Unprocessed

Usage metrics

Keywords

Statistics Molecular biology Genetics

Licence

In Copyright

Exports

RefWorks

BibTeX

Ref. manager

Endnote

DataCite

NLM

DC

Distribution of errors and methods of inference for automated DNA sequencing

History

Publisher

Language

Notes

Handle

Media type

Access statement

Usage metrics

Categories

Keywords

Licence

Exports