Observation selection bias in contact prediction and its implications for structural bioinformatics

TitleObservation selection bias in contact prediction and its implications for structural bioinformatics
Publication TypeJournal Article
Year of Publication2016
AuthorsOrlando, G, Raimondi, D, Vranken, WF
JournalScientific Reports
Volume6
Pagination36679 -
Date Published11/2016
Abstract

Next Generation Sequencing is dramatically increasing the number of known protein sequences, with related experimentally determined protein structures lagging behind. Structural bioinformatics is attempting to close this gap by developing approaches that predict structure-level characteristics for uncharacterized protein sequences, with most of the developed methods relying heavily on evolutionary information collected from homologous sequences. Here we show that there is a substantial observational selection bias in this approach: the predictions are validated on proteins with known structures from the PDB, but exactly for those proteins significantly more homologs are available compared to less studied sequences randomly extracted from Uniprot. Structural bioinformatics methods that were developed this way are thus likely to have over-estimated performances; we demonstrate this for two contact prediction methods, where performances drop up to 60% when taking into account a more realistic amount of evolutionary information. We provide a bias-free dataset for the validation for contact prediction methods called NOUMENON.

URLhttp://dx.doi.org/10.1038/srep36679