Significant non-existence of sequences in genomes and proteomes release_36xnboxrejbifgouvty4addqt4

by Grigorios Koulouras, Martin Frith

Published in Nucleic Acids Research by Oxford University Press (OUP).

2021  

Abstract

<jats:title>Abstract</jats:title> Minimal absent words (MAWs) are minimal-length oligomers absent from a genome or proteome. Although some artificially synthesized MAWs have deleterious effects, there is still a lack of a strategy for the classification of non-occurring sequences as potentially malicious or benign. In this work, by using Markovian models with multiple-testing correction, we reveal significant absent oligomers, which are statistically expected to exist. This suggests that their absence is due to negative selection. We survey genomes and proteomes covering the diversity of life and find thousands of significant absent sequences. Common significant MAWs are often mono- or dinucleotide tracts, or palindromic. Significant viral MAWs are often restriction sites and may indicate unknown restriction motifs. Surprisingly, significant mammal genome MAWs are often present, but rare, in other mammals, suggesting that they are suppressed but not completely forbidden. Significant human MAWs are frequently present in prokaryotes, suggesting immune function, but rarely present in human viruses, indicating viral mimicry of the host. More than one-fourth of human proteins are one substitution away from containing a significant MAW, with the majority of replacements being predicted harmful. We provide a web-based, interactive database of significant MAWs across genomes and proteomes.
In application/xml+jats format

Archived Files and Locations

application/pdf  1.6 MB
file_27fjejpz5vf25nfjzzqu2zpkzm
watermark.silverchair.com (publisher)
web.archive.org (webarchive)
Read Archived PDF
Preserved and Accessible
Type  article-journal
Stage   published
Date   2021-03-10
Language   en ?
DOI  10.1093/nar/gkab139
PubMed  33693858
Container Metadata
Open Access Publication
In DOAJ
In ISSN ROAD
In Keepers Registry
ISSN-L:  0305-1048
Work Entity
access all versions, variants, and formats of this works (eg, pre-prints)
Catalog Record
Revision: 44425ebc-2512-4417-bd41-5048f65626b8
API URL: JSON