Identifying representative sequences of protein families using submodular optimization.

 0 Người đánh giá. Xếp hạng trung bình 0

Tác giả: David C Cantu, Anh N Luu, Ha Nguyen, Hung Nguyen, Phuong Nguyen, Tin Nguyen

Ngôn ngữ: eng

Ký hiệu phân loại: 658.533 Kinds of sequences

Thông tin xuất bản: England : Scientific reports , 2025

Mô tả vật lý:

Bộ sưu tập: NCBI

ID: 689502

Identifying representative sequences for groups of functionally similar proteins and enzymes poses significant computational challenges. In this study, we applied submodular optimization, a method effective in data summarization, to select representative sequences for thioesterase enzyme families. We introduced and validated two algorithms, Greedy and Bidirectional Greedy, using curated protein sequence data from the ThYme (Thioester-active enzYmes) database. Both algorithms generated sequence subsets that preserved completeness (inclusion of all known family sequences) and specificity (accurate family representation). The Greedy algorithm outperformed the Bidirectional Greedy algorithm and other methods, particularly in reducing redundancy. Our study offers an efficient approach for identifying representative protein sequences within families that have significant sequence similarity, likely to deliver results close to theoretical optima in polynomial time, with the potential to improve the selection and optimization of representative sequences in protein databases.
Tạo bộ sưu tập với mã QR

THƯ VIỆN - TRƯỜNG ĐẠI HỌC CÔNG NGHỆ TP.HCM

ĐT: (028) 36225755 | Email: tt.thuvien@hutech.edu.vn

Copyright @2024 THƯ VIỆN HUTECH