Show simple item record

The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances

dc.contributor.authorRöhling, Sophie
dc.contributor.authorLinne, Alexander
dc.contributor.authorSchellhorn, Jendrik
dc.contributor.authorHosseini, Morteza
dc.contributor.authorDencker, Thomas
dc.contributor.authorMorgenstern, Burkhard
dc.date.accessioned2020-02-11T08:39:45Z
dc.date.available2020-02-11T08:39:45Z
dc.date.issued2020de
dc.relation.ISSN1932-6203de
dc.identifier.urihttp://resolver.sub.uni-goettingen.de/purl?gs-1/17161
dc.description.abstractWe study the number Nk of length-k word matches between pairs of evolutionarily related DNA sequences, as a function of k. We show that the Jukes-Cantor distance between two genome sequences—i.e. the number of substitutions per site that occurred since they evolved from their last common ancestor—can be estimated from the slope of a function F that depends on Nk and that is affine-linear within a certain range of k. Integers kmin and kmax can be calculated depending on the length of the input sequences, such that the slope of F in the relevant range can be estimated from the values F(kmin) and F(kmax). This approach can be generalized to so-called Spaced-word Matches (SpaM), where mismatches are allowed at positions specified by a user-defined binary pattern. Based on these theoretical results, we implemented a prototype software program for alignment-free sequence comparison called Slope-SpaM. Test runs on real and simulated sequence data show that Slope-SpaM can accurately estimate phylogenetic distances for distances up to around 0.5 substitutions per position. The statistical stability of our results is improved if spaced words are used instead of contiguous words. Unlike previous alignment-free methods that are based on the number of (spaced) word matches, Slope-SpaM produces accurate results, even if sequences share only local homologies.de
dc.description.sponsorshipOpen-Access-Publikationsfonds 2020
dc.language.isoengde
dc.rightsopenAccess
dc.rights.urihttps://creativecommons.org/licenses/by/4.0/
dc.subjectSequence alignment; Phylogenetic analysis; Multiple alignment calculation; Phylogenetics; Plant genomics; Bacterial genomics; Molecular evolution; Nucleotide sequencingde
dc.subject.ddc570
dc.titleThe number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distancesde
dc.typejournalArticlede
dc.identifier.doi10.1371/journal.pone.0228070
dc.identifier.doi10.1371/journal.pone.0228070.g001
dc.identifier.doi10.1371/journal.pone.0228070.g002
dc.identifier.doi10.1371/journal.pone.0228070.g003
dc.identifier.doi10.1371/journal.pone.0228070.g004
dc.identifier.doi10.1371/journal.pone.0228070.g005
dc.identifier.doi10.1371/journal.pone.0228070.t001
dc.identifier.doi10.1371/journal.pone.0228070.t002
dc.identifier.doi10.1371/journal.pone.0228070.s001
dc.identifier.doi10.1371/journal.pone.0228070.r001
dc.identifier.doi10.1371/journal.pone.0228070.r002
dc.identifier.doi10.1371/journal.pone.0228070.r003
dc.type.versionpublishedVersionde
dc.relation.eISSN1932-6203
dc.bibliographicCitation.volume15de
dc.bibliographicCitation.issue2de
dc.type.subtypejournalArticle
dc.bibliographicCitation.articlenumbere0228070de
dc.description.statuspeerReviewedde
dc.bibliographicCitation.journalPLOS ONEde


Files in this item

Thumbnail

This item appears in the following Collection(s)

Show simple item record

These documents are avalilable under the license:
openAccess