Toggle Main Menu Toggle Search

Open Access padlockePrints

Bi-level diversity optimisation for representative protein panel selection

Lookup NU author(s): Zhen Ou, Dr Katherine JamesORCiD, Professor Anil Wipat

Downloads

Full text for this publication is not currently held within this repository. Alternative links are provided below where available.


Abstract

Selecting representative subsets from large protein sequence datasets is a common challenge in enzyme discovery and related tasks under limited screening capacity. In practice, candidate panels are often constructed using clustering-based redundancy reduction or manual selection guided by phylogenetic or similarity-network analyses, which do not directly optimise subset diversity and require threshold tuning or expert interpretation. Here, we present a bi-level diversity-optimisation framework for representative protein panel selection implemented using a local search heuristic that iteratively updates panel composition to improve diversity. The method formulates panel design as a combinatorial optimisation problem over pairwise distance matrices, combining a MaxMin objective to enforce minimum separation between selected sequences with a MaxSum objective to increase global dispersion. This formulation enables the direct construction of fixed-cardinality panels while remaining independent of the similarity representation used to compute pairwise distances. Benchmarking across four Pfam families shows that the bi-level formulation consistently reduces redundancy among selected sequences, lowering maximum pairwise identity by 43-46% relative to the previous MaxSum-based formulation, while maintaining comparable or improved EC-label coverage. The framework can incorporate sequence- or structure-based similarity measures, providing a flexible strategy for constructing diverse representative panels across homologous protein families.


Publication metadata

Author(s): Ou Z, James K, Charnock S, Wipat A

Publication type: Article

Publication status: Submitted

Journal: bioRxiv

Year: 2026

Publisher: Cold Spring Harbor Laboratory

URL: https://doi.org/10.64898/2026.04.17.719243

DOI: 10.64898/2026.04.17.719243

Notes: This article is a preprint and has not been certified by peer review.


Altmetrics

Altmetrics provided by Altmetric


Share