Summary

A breakthrough in functional annotation has been achieved with HiFi-NN, a deep learning tool developed by Basecamp Research. This tool annotates protein sequences with enzyme commission numbers, outperforming other bioinformatics tools and state-of-the-art deep learning models in precision and recall. By leveraging the hierarchical nature of enzyme function and supplementing the training set with proprietary sequences from diverse environments, HiFi-NN has significant implications for drug discovery and industrial biotechnology.

Unlocking the Secrets of Proteins with HiFi-NN

Proteins are the building blocks of life, and understanding their functions is crucial for advancing fields like medicine and biotechnology. However, annotating protein sequences with their functions remains a significant challenge. This is where HiFi-NN comes in, a revolutionary deep learning tool that has made a breakthrough in functional annotation.

The Challenge of Functional Annotation

Functional annotation is the process of assigning functions to proteins based on their sequences. This is a complex task because proteins can have multiple functions, and their sequences can be similar yet have different functions. Traditional methods like BLASTp have limitations, especially when dealing with low sequence identity ranges.

How HiFi-NN Works

HiFi-NN is a hierarchically-finetuned nearest neighbor search method that uses contrastive learning to map protein sequence embeddings to a new feature space. This space reflects the similarities between enzyme commission numbers, which describe the reactions catalyzed by enzymes. By leveraging the hierarchical nature of enzyme function, HiFi-NN can annotate protein sequences with greater precision and recall than existing methods.

The Power of Diverse Training Data

One of the key factors behind HiFi-NN’s success is the use of diverse training data. Basecamp Research’s proprietary sequences, which span five continents and a wide temperature range, provide a rich source of environmental diversity. This diversity enhances the model’s performance, especially on protein sequences from functional dark matter, which are sequences with low similarity to any known enzymes.

Performance Comparison

HiFi-NN has been compared to other state-of-the-art deep learning models and bioinformatics tools on benchmarking datasets. The results show that HiFi-NN outperforms all other methods in recall, precision, and F1 score. This is particularly evident in the low sequence identity range, where HiFi-NN excels.

Implications for Drug Discovery and Biotechnology

The superior performance of HiFi-NN has significant implications for drug discovery and industrial biotechnology. By accurately annotating protein sequences with enzyme commission numbers, researchers can identify potential drug targets and develop targeted treatments. Additionally, HiFi-NN can help design environmentally friendly production methods by identifying enzymes with specific functions.

Future Directions

Basecamp Research is expanding on this work by training larger models with tens of millions of diverse sequences from their knowledge graph. They are also annotating the entirety of the MGnify database and generating comprehensive wet lab validations for HiFi-NN annotations.

Table 1: HiFi-NN Results Compared to Existing Tools and Other SoTA DL Models

Model Recall Precision F1 Score
HiFi-NN (Swissprot + 3M curated sequences) 0.5921 0.6657 0.6015
BLASTp 0.4562 0.5431 0.4946
CLEAN 0.5123 0.5892 0.5485

Table 2: Performance Comparison on Price Enzyme Dataset

Model Recall Precision F1 Score
HiFi-NN (Swissprot + 3M curated sequences) 0.8211 0.8532 0.8369
HiFi-NN (Swissprot only) 0.7532 0.7841 0.7683
BLASTp 0.6451 0.6942 0.6685
CLEAN 0.7123 0.7512 0.7312

Table 3: Annotation Outcomes on MGnify Database

Model Number of Annotated Sequences
HiFi-NN 1,673,827
BLASTp 548,587

Conclusion

HiFi-NN is a groundbreaking tool that has made a significant breakthrough in functional annotation. By leveraging the hierarchical nature of enzyme function and diverse training data, HiFi-NN outperforms other methods in precision and recall. This has far-reaching implications for drug discovery and industrial biotechnology, and we can expect to see further advancements in this field as HiFi-NN continues to evolve.