26.05.2022 change 30.03.2023

Innovative Polish tool for protein classification

Credit: Adobe Stock

A new bioinformatics tool developed by scientists from the Faculty of Biology of the University of Warsaw enables quick and error-free classification of proteins, identification of potential drug binding sites, identification of proteins present on the surface of viruses, as well as, for example, RNA research.

BioS2Net (Biological Sequence and Structure Network) is an advanced algorithm using machine learning that enables the classification of newly discovered proteins not only on the basis of the similarity of amino acid sequences, but also their spatial structure. A publication about the tool appeared in the International Journal of Molecular Sciences (https://www.mdpi.com/1422-0067/23/6/2966).

The tool was developed by a team led by Dr. Takao Ishikawa from the Department of Molecular Biology, Faculty of Biology, University of Warsaw, in collaboration with a scientist from the Faculty of Mathematics, Informatics and Mechanics, University of Warsaw. According to the authors, its main application is the improved classification of proteins, as the current structural classification system is based on the painstaking process of comparing the structures of new proteins with those already categorized.

'While an automated system does exist, it is very restrictive and only takes into account the similarity of the protein sequences, completely ignoring their structures. A tool such as BioS2Net has the potential to significantly improve the entire process', explains Dr. Ishikawa. 'Additionally, after minor modifications, the architecture we have developed can be used for other tasks, not necessarily related to classification. For example, it could be used to detect binding sites for potential drugs in a protein, or to identify proteins on the surface of viruses'.

'For example, imagine a situation where, thanks to the use of BioS2Net, proteins previously classified in different groups will be categorized as very similar to each other in terms of their surface structure, despite a different protein chain folding inside the structure. And then it is possible that a molecule that interacts with one of these proteins (for example as a drug) will also turn out to be an effective interactor for the other protein', Dr. Ishikawa describes further potential practical applications of the tool. 'Another interesting application could be, for example, detecting binding sites in proteins that may be either drug targets or points of interaction with a viral protein'.

The operation of BioS2Net is based on sequential mathematical operations that are based on data about a specific protein. To work, the tool needs these data (the more, the better), appropriate software capable of performing complex calculations related to neural network training, and a lot of time.

As a result, BioS2Net creates a unique representation of each protein as a constant size vector. 'It can be compared to something like a barcode that describes each of the known proteins', explains Dr. Ishikawa. 'It is a great tool for classifying proteins based on the amino acid sequence and spatial structure. It is particularly important that it allows to detect proteins with a similar three-dimensional structures, but with a different protein chain fold'.

'Previously used methods would assign such proteins to separate groups. Meanwhile, there are known cases when these types of molecules perform similar functions. BioS2Net may be useful for detecting such groups of proteins', he adds.

The scientist says that new proteins are being discovered all the time. The vast majority of them, if they already have a described spatial structure, are deposited in the Protein Data Bank database, which anyone can access via the Internet. 'It is worth noting, however, that the process of discovering new proteins begins much earlier, at the stage of genome sequencing. In genome databases, we often find the annotation 'hypothetical protein'. Computer algorithms exist which, based on nucleotide sequences in a sequenced genome, predict gene-like regions that potentially code protein information. We know a lot of such potential proteins. Their functions can be partially predicted on the basis of their similarity to previously described molecules, but to fully understand such their role and mechanism of action, it is often necessary to determine their structure first, which requires months or years of experiments', says the researcher from the University of Warsaw.

In the case of proteins, a similar sequence of amino acids usually translates into a similar structure. Until recently, this was a dogma in structural biology. 'But today we know', says Dr. Ishikawa, 'that many proteins are intrinsically disordered proteins (IDPs), or at least contain such regions. Such proteins can have different structures depending on other proteins they interact with at the time'.

'Additionally, the entire context in which the protein folds is very important. For example, the presence of chaperones, and even the very rate of protein synthesis in a cell, can have a significant impact on its final shape, and therefore also its functions. However, this does not change the fact that the amino acid sequence is the fundamental feature of each protein', he emphasises.

Why is it so important to know the exact structure of a protein? The author of the publication explains that proteins always have a specific structure when carrying out their tasks in the cell. For example, if we want to design a new drug that will interact with a particular protein, it is fundamental to define the structure of that protein. 'During the SARS-CoV-2 pandemic, it was necessary, for example, to determine the structure of the viral S protein (the so-called spike) to be able to propose a specific molecule that would interact with it, and thus reduce the efficiency of infection of human cells', he says. 'In conclusion: studying the structures of proteins is of great importance for understanding their functions and mechanisms of action, as well as other molecules that interact with them'.

As for BioS2Net itself, first you need to download information about a given protein from the database and process it. Processing converts all the characteristics of a protein, such as atomic coordinates, types of amino acids, evolutionary profile, etc., into numbers that a computer can understand. Each individual atom of the molecule is described by several dozen numbers that express these features.

These numbers are then fed into a neural network that analyses each of the atoms and their closest neighbours, taking into account both their spatial and sequential arrangement. The next step is to combine the groups of atoms into one 'superatom' which contains all the learned local information. This process is repeated until the 'superatom' contains aggregate information about the entire protein. 'This is our barcode, which we then use to classify the protein, using standard neural networks', says Dr. Ishikawa.

When asked about the accuracy of the new tool, the biologist explains that when it comes to generating a unique vector to represent each protein, BioS2Net does it flawlessly and each protein is represented in the only possible way and no other molecule will be described in the same way.

'On the other hand, when we used BioS2Net to classify proteins, we achieved a result of 95.4 percent match to the current database classification. This means that in more than 95 cases out of 100, BioS2Net was able to correctly assign a protein to a given group. It is worth mentioning that this current classification is based on the similarity of amino acid sequences and ignores structural information', explains the author of the publication.

Scientists emphasize that in addition to its main application (protein classification), BioS2Net will also be able to analyse other biological molecules, including RNA. 'We believe that the tool could also be used to classify completely different biological data, such as maps of the chromosomes in the cell nucleus. In fact, our architecture can be useful wherever structure and sequence are defined', they say.

Dr. Ishikawa adds that BioS2Net was developed as part of the first author's (Albert Roethl) bachelor's thesis under his supervision. 'It is worth emphasizing, because it is an important signal that a bachelor's degree is not necessarily just a diploma thesis that you have to do, but something that has scientific potential and can be published in an international journal', the scientist says.

PAP - Science in Poland, Katarzyna Czechowicz

kap/ ekr/

tr. RL