LCRAnnotationsDB: a database of low complexity regions functional and structural annotations

Ziemska-Legiecka, Joanna; Jarnot, Patryk; Szymańska, Sylwia; Błaszczyk, Dagmara; Staśczak, Alicja; Langer-Macioł, Hanna; Lucińska, Kinga; Widzisz, Karolina; Janas, Aleksandra; Słowik, Hanna; Śliwińska, Wiktoria; Gruca, Aleksandra; Grynberg, Marcin

doi:10.1186/s12864-024-10960-5

Database
Open access
Published: 27 December 2024

LCRAnnotationsDB: a database of low complexity regions functional and structural annotations

Joanna Ziemska-Legiecka¹,
Patryk Jarnot²,
Sylwia Szymańska²,
Dagmara Błaszczyk³,
Alicja Staśczak^4,5,
Hanna Langer-Macioł^6,8,
Kinga Lucińska⁶,
Karolina Widzisz⁷,
Aleksandra Janas⁶,
Hanna Słowik⁶,
Wiktoria Śliwińska⁶,
Aleksandra Gruca²^na1 &
…
Marcin Grynberg¹^na1

BMC Genomics volume 25, Article number: 1251 (2024) Cite this article

269 Accesses
1 Altmetric
Metrics details

Abstract

Low Complexity Regions (LCRs) are segments of proteins with a low diversity of amino acid composition. These regions play important roles in proteins. However, annotations describing these functions are dispersed across databases and scientific literature. LCRAnnotationsDB aims to consolidate knowledge about LCRs and store relevant annotations in a single place. To unify redundant annotations, we assigned them categories based on similarity in function, protein structure, and biological process. Categories are organized hierarchically by linking them to Gene Ontology terms. The LCRAnnotationsDB database can be accessed at https://lcrannotdb.lcr-lab.org/.

Peer Review reports

Background

For a long time, Low Complexity Regions (LCRs) were considered non-functional and disordered fragments of proteins [1]. Consequently, LCRs were filtered out during analyses, resulting in a lack of resources that could provide systematized knowledge about their functions. In recent years, however, researchers have shown that many LCRs play important roles in proteins and have annotated their functions and structures in general protein databases [2, 3]. Despite these annotations, relevant data are scattered across various resources and scientific articles. Additionally, each database uses distinct identifiers and terms in protein annotations for the same characteristics. As a result, to obtain unified information about proteins, researchers use Gene Ontology (GO) to describe their functions in a standardized manner and automatically analyze large protein datasets using machine learning models [4]. Currently, GO is the most common source for automated functional annotation of protein sequences. However, this approach cannot be easily applied to the functional analysis of LCRs, because they rarely play a global function in proteins and are usually short fragments of the protein sequence.

Proteins are usually grouped by their functional and evolutionary relationships into families, which can be analyzed and stored in databases such as InterPro [5]. Additionally, we can group proteins based on their similar functions, even if these proteins do not have similar sequences. Here we focus on specific short fragments, LCRs only, that determine specific local functions, that can influence the biological function of a protein. For example, a group of proteins with fragments rich in arginine binds to RNA and supports various functions of globular proteins [6, 7]. These regions can be homorepeats of arginine, as seen in the REV protein, or more irregular LCRs, as found in the antitermination protein N [8,9,10]. In both cases, these regions bind RNA and overlap with LCRs. LCRs are characterized by a low diversity of amino acids and may play important functional roles in proteins. Another example is liquid-liquid phase separation (LLPS) proteins that contain LCRs. LCRs in these proteins are located in crucial regions required for binding nucleotides or proteins and for phase separation [11,12,13]. For instance, the nucleoporin NSP1 contains an FG-rich region consisting of several LCRs. This region forms a hydrogel through interactions between repeats in LCRs [11].

In this work, we introduce LCRAnnotationsDB, which enables functional analysis of LCRs. Four services with data partially overlapping with our database exist: HRaP, LCR-eXXXplorer, LCD-Composer, and PlaToLoCo. The HRaP database contains homorepeats and predicted disordered patterns [14], providing information only about a small subset of LCRs without functional annotation, except for GOs. Additionally, it is unclear how frequently the information in this database is updated. LCR-eXXXplorer provides information on LCRs identified using the CAST [15] and SEG [16] methods, enriched with data on GO terms assigned to proteins, disordered regions identified using the IUPred [17] and ANCHOR predictors, and functional annotations from the UniProt/SwissProt database. While LCR-eXXXplorer is a valuable resource, it is not updated regularly, and the functional LCR annotations are limited. In contrast, LCRAnnotationsDB incorporates annotations from nine additional external databases making our database the most comprehensive source of information on LCR functions. The LCD-Composer website allows searching for LCRs in proteins rich in a selected set of amino acids [18]. It does not provide additional functional annotations, except for GO terms assigned to proteins. The LCD-Composer results are available for download as TSV files, and the service does not offer any visualization of the results. HRaP, LCR-eXXXplorer, and LCD-Composer provide GO terms annotating the whole protein sequence, while Gene Ontology annotations available in LCRAnnotationsDB are assigned to LCR fragments only, ensuring that the functional information is specific to the particular low complexity fragment. PlaToLoCo identifies LCRs using various state-of-the-art methods and provides annotations of Pfam domains, predicted transmembrane regions, and signal peptides [19]. The functional information on LCRs provided by PlaToLoCo is limited, as its main aim is to integrate information from different tools for LCR discovery. None of these services include unified functional and structural information from a large number of publicly available protein databases.

To collect and systematize LCR annotations, we created LCRAnnotationsDB. This database contains unified information about LCRs and their functions, integrated from publicly available sources, making it unique among other resources. We introduced a category system that groups similar annotations from different source databases. This method and outcome will help scientists describe LCRs of interest that are not annotated in canonical databases such as InterPro [5], SMART [20], or DisProt [21]. Researchers can use this information to discover new functions of LCRs and proteins by sequence similarity inference. LCRAnnotationsDB allows easy provides to information about LCR functions and structures for analysis.

Construction and content

Construction

LCRAnnotationsDB is a database of LCRs identified in the canonical collection of protein sequences from UniProtKB/Swiss-Prot using SEG method with the strict and default parameter sets [16, 22]. Each LCR record in LCRAnnotationsDB includes annotations from publicly available databases (UniProtKB/Swiss-Prot [23], NCBI RefSeq [24], RCSB PDB [25], neXtProt [26], InterPro [5], TOPDOM [27], DisProt [21], ELM [28], PhaSepDB [29], PhaSePro [30]), as well as annotations from predictive methods (IUPred3 [31], Phobius [32]). Annotations related to similar functions are grouped into categories, which we have introduced as part of LCRAnnotationsDB. Where possible, these categories are linked to matching GO terms, which provide hierarchical relationships between them. Note that GO terms related to categories describe the function of a particular LCR fragment they annotate. The category system was inspired by InterPro and DisProt databases, from which we collected some of categories [5, 21]. We also assigned and manually curated our own categories. The data integration is shown in Fig. 1. Additionally, we designed a user-friendly web interface that allows users to easily browse and find information about the biological role of LCRs (https://lcrannotdb.lcr-lab.org/). Users can also automate their analyses and download available data using the RESTful API (https://lcrannotdb.lcr-lab.org/api/).

Identification of LCRs

The LCRAnnotationsDB database contains LCRs identified by the SEG method [16, 22]. SEG has three parameters: window size (W), trigger complexity ($K_2(1)$), and extension complexity ($K_2(2)$). It uses a sliding protein sequence window of a given size (W) to identify a highly biased region. As it scans the sequence, it calculates entropy in the sliding window. If the entropy drops below the $K_2(1)$ threshold, the algorithm starts to expand the identified region until it reaches the $K_2(2)$ threshold. We used two parameter sets of SEG to detect LCRs: strict (W = 15, $K_2(1)$ = 1.5, $K_2(2)$ = 1.8) [33] and default (W = 12, $K_2(1)$ = 2.2, $K_2(2)$ = 2.5) [16]. The strict parameters detect more regular and shorter LCRs that mostly consist of repetitive patterns [33]. Regular LCRs are composed of homogenous sequences, repetitive fragments, or slightly irregular fragments. Default parameters extend the output of the strict parameters to include more regions that are irregular in composition.

Sources of data

As sources of data for LCRAnnotationsDB, we chose publicly available databases and methods for protein region identification that provide information about protein-protein interactions, structures and functional annotations. We selected diverse types of databases, including those that store general data about various types of protein regions and those that specialize in a particular subset (e.g., RCSB PDB, which stores protein structures) [25]. Our final result is a database integrating 12 sources of information. In LCRAnnotationsDB, we integrated annotations from 10 databases (UniProtKB/Swiss-Prot, NCBI RefSeq, RCSB PDB, neXtProt, InterPro, TOPDOM, DisProt, ELM, PhaSepDB, PhaSePro) and 2 predictive methods (IUPred3, Phobius). In Supplementary Material 1, we describe these sources and their methods of use.

The number of annotations derived from each source is presented in Table 1. The largest number of annotations comes from InterPro, neXtProt, and UniProtKB databases, each providing over 1 million annotations. There are also over 200 thousand annotations each from RCSB PDB and TOPDOM. From NCBI RefSeq, DisProt, ELM, PhaSepDB, and PhaSePro, we obtained 27,426; 3,879; 640; 455; and 97 LCR annotations, respectively. We also used two predictors. The first one, Phobius found over 500 thousand annotations of LCRs. The second method, IUPred, identified 200 thousand annotations associated with LCRs.

Table 1 The databases used to acquire data for LCRAnnotationsDB. Strict LCRs are fragments identified by SEG with strict parameters and the default LCRs are identified with the default parameters of SEG

Full size table

User interface and implementation

On the Home page, the search form allows users to query proteins, LCRs, or annotations by UniProtACC, protein name, categories, and GO terms. It is also possible to select the source of annotations and to set LCR coverage by annotation, and vice versa. Coverage of LCR by annotation is the ratio of overlap length between the annotation region and LCR divided by the total LCR length, while coverage of annotation by LCR is the same overlap with respect to annotation length, expressed in percentages (see Fig. 2).

The Database page can be accessed from the top menu, and its main role is to present information stored in the database from different perspectives. Here, users can select the following subpages: Proteins, LCRs, Data Sources, Categories, and GOs (see Fig. 3).

The Protein subpage contains all identified LCRs with their annotations. The LCR subpage includes information about a selected LCR and its annotations. The Annotation subpage presents LCR locations of a selected annotation. The Data Sources table contains a list of annotations and allows users to browse annotations only from a selected source. The Category subpage contains information about proteins, annotations, and LCRs, that have a category assigned. Finally, the GO table contains only categories that have assigned GOs. Each subpage provides a table with a list of corresponding records, which users can browse in detail. After selecting a record, the relevant LCR, protein, or annotation is displayed in a feature viewer on its corresponding subpage. The user can browse regions in a graphical representation and the relationship between LCRs and other annotations. If a region is selected in the feature viewer (see Fig. 4), it is also highlighted in a FASTA sequence displayed below the feature viewer. In each subpage with the graphical representation, the user can download a sequence in FASTA format and view the results presented in the feature viewer as a picture.

LCRAnnotationsDB uses PostgreSQL to store data about proteins and annotations. Additionally, the database stores information about LCRs, proteins, GO, and categories (see Fig. 5). The data integration pipeline is written in Python 3.6., and the web server is supported by the Django framework.

The RESTful API allows users to download data directly from LCRAnnotationsDB in CSV or JSON format (https://lcrannotdb.lcr-lab.org/api/). The API landing page describes the interface details of the RESTful API, which can be used by external software.

Content

We identified 26,977 LCRs using SEG with strict parameters (which we further call strict LCRs) in 16,798 proteins and 808,169 LCRs using default parameters (default LCRs) in 333,123 proteins out of 563,579 proteins initially in the analysis. The average length of LCRs is 23.9 amino acids for strict and 15.3 amino acids for default parameters. Table 1 presents the number of annotations assigned by all databases and prediction methods. On average, each strict LCR has 9.7 assigned annotations, while a default LCR has 9.4. All of the LCRs in the database have at least one annotation. Since annotations from different databases can provide redundant functional information, we have created categories that group similar functions occurring in various source databases simultaneously. Our database contains 5,479 categories, with each category describing, on average, 886.8 annotations. For each annotation integrated from InterPro, we assigned categories with the same GO term that was assigned in the source database. Analogously, we assigned only those categories from DisProt that had GO term annotations. In summary, we assigned annotations to 5,347 categories based on InterPro, 118 based on DisProt, and manually assigned annotations to 98 categories. One category may come from more than one source. In total, we assigned categories to 66.9% of all annotations for strict LCRs and 63.2 % for default LCRs. We also assigned GO terms to categories where possible (e.g., GO:0016020 term to the “membrane” category). The most popular categories for strict LCRs in the database are as follow: positional variant, disordered region, compositionally biased region, tertiary structure, acidic region, isoform, structural constituent of ribosome, ribosome, translation, translation initiation factor activity, membrane, and ATP binding (see Table 2). Additionally, users can submit suggestions of new annotations and categories supported by evidence from scientific and reliable sources or updates of protein sequences through the suggestion subpage on the web interface.

The LCRAnnotationsDB interface may be used to search, browse, and download data. We provide users with a search form (Home page), annotations (Database), documentation (DOC), contact form (Contact), information about authors (About), and an API manual (API). Additionally, they can use detailed pages describing individual LCRs, proteins, annotations or categories. Detailed pages contain visualizations of sequences with additional information about them. The RESTful API may be used to automate tasks on data stored in the database. For instance, users can download data related to a particular category.

Table 2 The 30 most frequent annotations of LCRs identified by the SEG method with strict (strict LCRs) and default (default LCRs) parameter sets, assigned to categories integrated into LCRAnnotationsDB

Full size table

In our database, most categories describe the structure of LCRs (tertiary structure, disordered region, helix, secondary structure, coiled-coil, structure, structural constituent of ribosome). There are also annotations of binding regions or nucleotides (RNA, DNA, ATP, GTP binding), ions and proteins. Additionally, in LCRAnnotationsDB, some categories refer to participation in biological processes (e.g., GTPase activity, translation initiation factor activity). There are also numerous positional variants of human proteins from the neXtProt database and more general categories like ‘repeat’.

Enrichment of LCR annotations in protein families

LCRs are short fragments of the complete protein sequence. To investigate the relationship between functional annotations of LCRs and global protein functions, we conducted an analysis of the enrichment of LCR annotations in InterPro Families set (InterPro version 99.0). First, we retrieved a dataset from our database that consisted of proteins with annotated LCRs, which overlap by at least 70% with the annotation and vice versa (the LCR dataset). This means that the LCR covers the annotation in at least 70% of the annotation length and that the annotation covers the LCR in at least 70% of the LCR length. We assume that in such cases the annotations are directly related to the presence of LCRs. Then, for each protein sequence from the LCR dataset, we retrieved the IDs of InterPro Families assigned to it. Having a list of InterPro IDs, we then downloaded all proteins from InterPro belonging to those families. We call this the InterPro dataset. In this analysis, we excluded annotations with InterPro Family IDs from the LCR dataset. This ensures that the analyzed Protein Family IDs are related to the entire protein sequence, allowing us to hypothesize about a possible relationship between the existence of LCRs and protein function. According to the InterPro Families definition, the annotation of a protein represents its whole length (https://interpro-documentation.readthedocs.io/en/latest/entries_info.html). For both the LCR and InterPro datasets, we then counted the number of sequences in each dataset separately and the number of sequences that belong to the intersection of both sets. Finally, to analyze the enrichment of LCR annotations in protein families, we calculated the hypergeometric tests:

$$\begin{aligned} P(X)=\frac{{\left( \begin{array}{c}k\\ x\end{array}\right) }{\left( \begin{array}{c}N-k\\ n-x\end{array}\right) }}{{\left( \begin{array}{c}N\\ n\end{array}\right) }}. \end{aligned}$$

(1)

where N denotes the number of proteins in UniProtKB/Swiss-Prot, n is the number of proteins in the LCR dataset, k is the number of proteins in the InterPro dataset, and x is the number of proteins from the intersection of both sets. The resulting p-values were adjusted using the Benjamini-Hochberg multiple-testing correction procedure, assuming a false discovery rate of 5%.

We found that more than half (56.54%) of the LCR annotations in our database are significantly enriched within InterPro Families set. These results may indicate a strong relationship between the presence of an LCR and the general protein function in the dataset analysed. To analyze the results in aggregated form, we grouped the annotations by their categories. We observed many significant annotations related to structural sequence features such as helix, secondary and tertiary structure, and coiled-coil. Other important categories included RNA binding, TAR and RRE motifs, and the RGG box. The aggregated results for the most frequent categories are presented in Table 3. All results of the enrichment analysis are available in Supplementary Material 3.

Table 3 Number of significant results from the hypergeometric test with Benjamini-Hochberg procedure. The most frequent family categories in this table are structural and nucleotide binding terms

Full size table

Utility and discussion

Use case

LCRs with annotations in protein families

Previous sections of the paper focused on the usability of our database and a description of its content. In this part of the paper, we provide an in-depth examination of specific cases of categorized annotations that exhibit significant enrichment within a family. We describe these families and compare the sequences of LCRs to determine whether these regions may share similar annotations. Additionally, we identify proteins with unannotated LCRs and speculate on their possible functions.

To demonstrate the association between LCRs and protein families, we executed a series of steps. Initially, we obtained families significantly enriched in LCR annotations, as described in the previous section. To identify the most interesting pairs, we determined the proportion of proteins with annotated LCRs within each protein family. We filtered out (1) pairs with a frequency of less than 30% in datasets of InterPro protein families, (2) pairs with a frequency of less than 30% in datasets of proteins with annotated LCRs, and (3) pairs with fewer than 5 common proteins between datasets. Additionally, we compared GO of (1) proteins from a particular family, (2) categories of its LCR annotations and (3) GOs of families. GOs come from (1) the QuickGo database, (2) GOs of categories sourced from LCRAnnotationsDB, and (3) information about families from the InterPro database.

Consequently, we obtained 49 categories of LCR annotations relevant to the categories of LCR datasets with unique pairs of 506 InterPro and LCR datasets. A substantial fraction of these categories (around 33%) are linked to nucleotide binding or structural states, underscoring the significance of these regions in molecular pathways and protein structure determination. Notably, our focus was drawn to the category of RNA binding proteins, including three distinct families: the anti-repression trans-activator protein (REV) family (InterPro Entry: IPR000625), the immunodeficiency virus transactivating regulatory protein (Tat) family (IPR001831), and the RNA-dependent RNA polymerase (RdRp) family (IPR002166). These families are involved in various facets of viral RNA metabolism, including transcription, translation, and replication. To identify the proteins from InterPro families with annotated LCRs described in this chapter, users may conduct a search using the form view illustrated in Fig. 6.

Firstly, we analyzed the anti-repression trans-activator protein (REV) family (InterPro Entry: IPR000625), which comprises proteins from immunodeficiency viruses essential for viral propagation. REV proteins possess an RRE-binding motif, enabling them to bind the Rev response element (RRE) to form complexes with mRNA. The arginine-rich RRE-binding motif serves as a nuclear localization signal, binding RRE-containing RNAs to form a complex for nuclear export [8, 9]. REV proteins involved in the export pathway of viral mRNA, form a complex with REV protein dimers, mRNA, and host proteins such as CRM1 to facilitate the translocation of viral RNA from the nucleus to the cytoplasm [9, 34]. The objective is to shield viral mRNA from detection as foreign. The InterPro database lists 74 proteins from the UniProtKB/Swiss-Prot database in this family were analyzed. Utilizing LCRAnnotationsDB, we identified 71 proteins belonging to the REV family with LCRs overlapping with the RRE-binding motif annotations. Furthermore, we found 73 proteins with RRE motif annotations, all corresponding to UniProtKB REV proteins. Among these proteins, 71 belonged to the InterPro REV family. Interestingly, two of them (UniProtACC: P36339, P22379) were not categorized in InterPro Families as REV proteins, even though their sequences show 70% LCR coverage in the RRE motif. This suggests the need for further investigation for the potential classification of these proteins within this family.

We noticed that some GOs of proteins from this family overlap with categories of LCR annotations. Three of GOs (host cell nucleus, regulation of DNA-templated transcription, DNA-binding transcription factor activity) are assigned to the REV family and categories of LCR annotations. The other three GOs (RNA binding, protein export from nucleus, protein localization to nucleoplasm) describe only categories and proteins. There is only one GO annotation of a protein, nuclear import signal receptor activity, that is not assigned to LCR annotations. Additionally, more precise GO categories, such as mRNA binding, host cell cytoplasm, and host cell nucleolus, are assigned to categories of LCR annotations. Some proteins are categorized under the viral process.

Secondly, we analyzed the immunodeficiency virus transactivating regulatory protein family, Tat (IPR001831). These proteins work with polymerase and the transcription elongation complex in viral RNA synthesis [35]. Specifically, Tat recruits the transcription elongation factor pTEFb, the TATA box binding protein, and chromatin-modifying proteins to the LTR promoter, leading to RNA polymerase II phosphorylation, the assembly of novel transcription complexes, and chromatin remodeling [36,37,38,39]. Proteins in the Tat family contain the TAR-binding motif, which binds the trans-activating response element [40]. The TAR-binding motif in Tat is a critical functional component of this protein family, as it binds the RNA stem-loop structure in viral RNAs [35, 41]. Some studies suggest that Tat can also enhance viral transcription independently of binding the trans-activating response element (TAR) located in the viral transcripts [42, 43]. Tat’s involvement extends to HIV-1 RNA splicing, capping, translation, and reverse transcription [44,45,46,47]. InterPro indicates that this family contains 72 proteins from the UniProtKB/Swiss-Prot database, originating from human, simian, and bovine immunodeficiency viruses, as well as equine infectious anemia and Jembrana disease viruses. Using LCRAnnotationsDB, we identified 23 proteins with LCRs in the TAR-binding region, all meeting the 70% coverage threshold. Additionally, nine proteins showed TAR-binding motif LCRs below 70% coverage. The remaining 40 proteins had TAR-binding motif annotations in the UniProtKB/Swiss-Prot database, but these motifs were not classified as LCRs, due to greater diversity compared to the strict and default SEG thresholds.

For this family, we also compared the GOs of proteins and categories of LCRs. Only one GO - nuclear import signal receptor activity - was assigned to over 50% of the proteins and was not assigned to any other category. The GOs of families (positive regulation of viral transcription, RNA-binding transcription regulator activity, host cell nucleus) were not assigned to 9 proteins from this family and were not assigned to categories of LCR annotations for only one protein from this family. Categories for at least half of the proteins had additional GOs assigned, including protein serine/threonine phosphatase inhibitor activity, virus-mediated perturbation of host defense response, symbiont-mediated suppression of host type I interferon-mediated signaling pathway, metal ion binding, actinin binding, extracellular region, negative regulation of peptidyl-threonine phosphorylation, symbiont-mediated suppression of host translation initiation, host cell cytoplasm, apoptotic process, DNA-templated transcription, and cyclin binding.

The final group analyzed was the RNA-dependent RNA polymerase (RdRp) family (IPR002166), consisting of 56 proteins from the UniProtKB/Swiss-Prot database. This family includes viral ‘RNA-directed RNA polymerases’ and ‘genome polyprotein proteins’, which are involved in RNA replication [48, 49]. In LCRAnnotationsDB, 53 out of 57 proteins in this family contain at least one LCR. Among them, 18 proteins have LCRs in the RNA-binding motif, each surpassing 70% coverage. Another 11 proteins, classified as genome polyprotein proteins, showed lower coverage for LCR and RNA-binding annotations. The remaining 19 RNA-directed RNA polymerase proteins and 5 genome polyproteins lacked RNA-binding annotations in any source database. We hypothesize that all LCRs from this family, located in approximately the same position and rich in arginine, can be considered as RNA-binding regions.

In the RdRp family, there were four GOs assigned to proteins that were not assigned to annotations: protein binding, viral process, nuclear import signal receptor activity, and virion component. We also found 36 GOs assigned to categories but not to proteins. The most common ones were RNA-templated transcription, 5’-3’ RNA polymerase activity, and virus-mediated perturbation of host defense response. At least one category of LCR annotations was assigned to 57 proteins outside the family. Over 20 proteins from this family did not have GOs assigned.

The results of each of these three family analyses illustrate diverse issues related to database usage. The study of the REV family revealed proteins not included in the family. LCRAnnotationsDB allowed us to find and include them. The Tat family, on the other hand, contains an irregular LCR annotated as the TAR-binding motif. In some Tat family proteins, this region was not identified as an LCR. This suggests that TAR-binding LCRs are in the twilight zone of detection, possible due to strong evolutionary pressure on this region. Our database showed the diversity in protein functional region composition within specific families. The issue with the RdRp family was its incomplete annotation in UniProtKB. Some members of the family have annotated LCRs, while others do not. These unannotated proteins contain LCRs rich in arginine with amino acid compositions similar to those with LCR annotations. In all families, proteins and categories of LCRs had more GOs assigned than InterPro Families. Additionally, not all GOs were assigned to each protein within the families. In all analyzed families, more GOs were assigned to LCRs than to protein and family categories. Full data from this analysis can be found in Supplementary Material 4.

RNA-binding LCR in prokaryotic large ribosomal subunit protein bL20c

Out of many proteins with RNA-binding motifs in LCRAnnotationsDB, we analyzed those containing more than one annotated LCR. We investigated this set manually and found that a subset of proteins annotated as both RNA- and protein-binding is the most interesting. This means these proteins possess both RNA-binding and protein-binding LCRs. We excluded the most obvious examples, such as serine/arginine repetitive matrix protein 2 (Q9UQ35), which has a known structure and mechanism of action. The most intriguing was the prokaryotic 50S subunit bL20, a protein with a vaguely known function [50, 51]. Using the LCRAnnotationsDB search by protein name, we identified 118 homologous proteins (see description of the search process in Fig. 7). The SEG method with strict parameters is not always able to identify both LCRs in each case, but the protein can bind both types of molecules. Several very recent articles show how bL20 participates in ribosome assembly [52,53,54]. bL20 is a crucial piece in the ribosome assembly, forming the L20 block, which binds the assembly core at the early stages of the process. The element of interest is the RNA-binding motif, rich in arginines, located at the N-terminal of the protein. The C terminus has a protein-binding motif rich in lysines, which is typical for this role. Arginine stretches are known to bind RNA [6, 7], and in this case, they specifically bind rRNA [52]. The protein-binding motif binds to elements of the Core and the L20 block [52].

Discussion

In the digital space, a plethora of general and more specific databases exist, offering annotations for protein sequences. These resources predominantly focus on high-complexity regions, such as protein domains, families, entire proteins, and complexes, which may include a subset of annotations pertaining to LCRs. Despite this, a dedicated database that exclusively focuses on the functions and structures of LCRs is absent. This deficiency makes retrieving LCR-specific annotations difficult and time-consuming. As a result, researchers frequently default to applying GO terms for the functional analysis of LCRs. While this provides a broad categorization for proteins containing LCRs, it falls short of delineating the specific functions of the LCRs themselves. The demand for such a specialized database has been consistently expressed by researchers engaged in exploring the Dark Proteome [55].

Annotations in biological databases are assigned in two ways: manually or automatically, with the support of different tools such as domain predictors or text mining methods [56, 57]. One challenge that arises when creating meta-databases is unifying data that are redundant across databases using different labels to describe similar biological phenomena. To address this issue, we designed a category system in which the same annotations, named differently in distinct data sources, are represented by a single category. Grouping annotations into categories allows us to analyze the functions of LCRs on a more general level, providing novel insights into LCRs and their relation to structural sequence features or protein functions. Most categories have assigned GO terms; however, some structural and other categories, such as coiled coils, secondary structures, or isoforms, lack GO term assignments. Similar systems are used in other databases, such as DisProt and InterPro. However, we must account for the strong bias of the UniProtKB/Swiss-Prot database in our analysis. This bias may result from the small size of the database and limited knowledge about specific organisms, processes, and proteins. Additionally, we integrated some biased and highly specialized databases, like DisProt or neXtProt, which focused on narrow groups of proteins.

The significant enrichment of LCR annotations in some InterPro protein families shows that low complexity fragments in proteins can be strongly related to general protein function. These relationships occur between annotations from very diverse categories and families and can be observed for more than 50% of the annotations in our dataset.

The analysis of the co-occurrence of InterPro Families and LCR annotations shows that we can distinguish many functional LCR groups. This type of analysis may serve as an alternative to predicting protein region function based on similarity. We selected three example families: REV, TAR, and RdRP. Proteins belonging to different families play different roles in organisms, although all contain similar RNA-binding regions. This LCR group, common to the three different protein families, is known as the arginine-rich RNA-binding domain [6, 7]. We have noticed significant differences between GOs of protein families, proteins, and categories of LCRs. We suspect that analyzing GOs solely of proteins enriched with LCRs may overlook some important relationships between the occurrence of LCRs in proteins and their functions. In this case study, we have shown that LCRAnnotationsDB can be used to analyze LCRs in the context of protein families.

The final example of ribosomal subunit protein bL20c analyzes a protein with two LCRs, each possessing a distinct function. This protein features regions that bind to both proteins and DNA, each characterized by unique amino acid compositions. When analyzing these regions using protein GO terms, the LCRs are erroneously classified as having both protein-binding and DNA-binding functions, although each function is realized independently. This analysis highlights the dangers of using general GO terms assigned to proteins to describe their regions. Such descriptions impede a precise annotation of LCRs.

In the future version, we plan to manually add information about LCRs from scientific publications. This approach is used in DisProt where data about intrinsically disordered regions is manually curated based on experimental findings and scientific literature [21]. Similarly, databases such as PhaSePro and PhaSepDB also rely on manual annotation. We have noted that these resources often feature unique data not present in general databases such as UniProtKB or InterPro, highlighting the value of manual annotations in enriching database content. To add general GO terms assigned to categories of LCR annotations, our database invites users to share their research findings (https://lcrannotdb.lcr-lab.org/contact/). We believe that integrating the existing information on LCR functions in one place will enable our tool to facilitate future research on the function of LCRs. Furthermore, incorporating data from author publications significantly enhances the ability to present their results. We also plan to continuously expand our categorization system since not all annotations are currently covered by categories.

These plans are not, however, devoid of challenges. The main challenge is the lack of a precise definition of LCRs [58]. This poses a significant issue for studies aiming to cover the whole LCR space. The subject is complicated by technical issues, such as database redundancy when retrieving data from other sources, the absence of universal standards for data formats, the need for format changes, and annotation bias. These challenges may be addressed through experience and experiments to optimize the method. The final challenge, manual annotation, cannot be resolved by technical means at present. It requires a group of annotators and a working methodology to retrieve data in an organized manner. This kind of endeavor requires a lot of resources and planning.

Conclusion

We have created LCRAnnotationsDB, a curated and comprehensive database for protein LCR annotations. It consolidates data from various sources and provides a platform to explore the diversity and importance of LCRs in different protein families and functional categories. Additionally, we have systematized annotations into a unified categorical framework to resolve the duplicative nature of annotations found in various databases. Most of these categories correspond to established GO terms; however, some annotations are classified under newly created categories.

We have presented the analysis of LCR sequences within protein families and demonstrated a pronounced enrichment of protein families with annotated LCRs. Our database encourages researchers to easily screen LCRs within the protein sequences of their interest. LCRAnnotationsDB features a user-friendly web interface, complete with a tutorial on its documentation subpage. It can be accessed at https://lcrannotdb.lcr-lab.org/.

Our work fills a gap in the study of protein sequences. Most previous research focused on high-complexity sequences, leaving LCRs heavily understudied.

Data availability

Code to our enrichment analyses are available in repository https://github.com/Addreoran/lcrannotationsdb_enrichment. All data included in LCRAnnotationsDB are available at https://lcrannotdb.lcr-lab.org/.

References

Shen H, Kan JLC, Green MR. Arginine-serine-rich domains bound at splicing enhancers contact the branchpoint to promote prespliceosome assembly. Mol Cell. 2004;13(3):367–76.
Article CAS PubMed Google Scholar
Ntountoumi C, Vlastaridis P, Mossialos D, Stathopoulos C, Iliopoulos I, Promponas V, et al. Low complexity regions in the proteins of prokaryotes perform important functional roles and are highly conserved. Nucleic Acids Res. 2019;47(19):9998–10009.
Article CAS PubMed PubMed Central Google Scholar
Faux NG, Bottomley SP, Lesk AM, Irving JA, Morrison JR, de La Banda MG, et al. Functional insights from the distribution and role of homopeptide repeat-containing proteins. Genome Res. 2005;15(4):537–51.
Article CAS PubMed PubMed Central Google Scholar
Kulmanov M, Smaili FZ, Gao X, Hoehndorf R. Semantic similarity and machine learning with ontologies. Brief Bioinform. 2021;22(4):bbaa199.
Paysan-Lafosse T, Blum M, Chuguransky S, Grego T, Pinto BL, Salazar GA, et al. InterPro in 2022. Nucleic Acids Res. 2023;51(D1):D418–27.
Article CAS PubMed Google Scholar
Tan R, Frankel AD. Structural variety of arginine-rich RNA-binding peptides. Proc Natl Acad Sci. 1995;92(12):5282–6.
Article CAS PubMed PubMed Central Google Scholar
Bayer TS, Booth LN, Knudsen SM, Ellington AD. Arginine-rich motifs present multiple interfaces for specific binding by RNA. RNA. 2005;11(12):1848–57.
Article CAS PubMed PubMed Central Google Scholar
Kjems J, Brown M, Chang DD, Sharp PA. Structural analysis of the interaction between the human immunodeficiency virus Rev protein and the Rev response element. Proc Natl Acad Sci. 1991;88(3):683–7.
Article CAS PubMed PubMed Central Google Scholar
Rausch J, Le Grice S. HIV Rev assembly on the Rev response element (RRE): a structural perspective. Viruses. 2015;7:3053–75.
Article CAS PubMed PubMed Central Google Scholar
Van Gilst MR, Rees WA, Das A, von Hippel PH. Complexes of N antitermination protein of phage $\lambda$ with specific and nonspecific RNA target sites on the nascent transcript. Biochemistry. 1997;36(6):1514–24.
Frey S, Richter RP, Görlich D. FG-rich repeats of nuclear pore proteins form a three-dimensional meshwork with hydrogel-like properties. Science. 2006;314(5800):815–7.
Article CAS PubMed Google Scholar
Wang L, Kang J, Lim L, Wei Y, Song J. TDP-43 NTD can be induced while CTD is significantly enhanced by ssDNA to undergo liquid-liquid phase separation. Biochem Biophys Res Commun. 2018;499(2):189–95.
Article CAS PubMed Google Scholar
Boehning M, Dugast-Darzacq C, Rankovic M, Hansen AS, Yu T, Marie-Nelly H, et al. RNA polymerase II clustering through carboxy-terminal domain phase separation. Nat Struct Mol Biol. 2018;25(9):833–40.
Article CAS PubMed Google Scholar
Lobanov MY, Sokolovskiy IV, Galzitskaya OV. HRaP: database of occurrence of HomoRepeats and patterns in proteomes. Nucleic Acids Res. 2013 10;42(D1):D273–8. https://doi.org/10.1093/nar/gkt927.
Promponas VJ, Enright AJ, Tsoka S, Kreil DP, Leroy C, Hamodrakas S, et al. CAST: an iterative algorithm for the complexity analysis of sequence tracts. Complexity analysis of sequence tracts. Bioinformatics (Oxford, England). 2000;16(10):915–22.
Wootton JC, Federhen S. Statistics of local complexity in amino acid sequences and sequence databases. Comput Chem. 1993;17(2):149–63.
Article CAS Google Scholar
Dosztányi Z, Csizmók V, Tompa P, Simon I. IUPred: web server for the prediction of intrinsically unstructured regions of proteins based on estimated energy content. Bioinformatics. 2005;21(16):3433–4. https://doi.org/10.1093/bioinformatics/bti541.
Article CAS PubMed Google Scholar
Cascarina SM, King DC, Osborne Nishimura E, Ross ED. LCD-Composer: an intuitive, composition-centric method enabling the identification and detailed functional mapping of low-complexity domains. NAR Genomics Bioinforma. 2021;3(2):lqab048.
Jarnot P, Ziemska-Legiecka J, Dobson L, Merski M, Mier P, Andrade-Navarro MA, et al. PlaToLoCo: the first web meta-server for visualization and annotation of low complexity regions in proteins. Nucleic Acids Res. 2020;48(W1):W77–84.
Article CAS PubMed PubMed Central Google Scholar
Letunic I, Khedkar S, Bork P. SMART: recent updates, new developments and status in 2020. Nucleic Acids Res. 2021;49(D1):D458–60.
Article CAS PubMed Google Scholar
Quaglia F, Mészáros B, Salladini E, Hatos A, Pancsa R, Chemes LB, et al. DisProt in 2022: improved quality and accessibility of protein intrinsic disorder annotation. Nucleic Acids Res. 2022;50(D1):D480–7.
Article CAS PubMed Google Scholar
Wootton JC, Federhen S. Analysis of compositionally biased regions in sequence databases. Methods Enzymol. 1996;266:554–71. https://api.semanticscholar.org/CorpusID:30650486.
UniProt: the universal protein knowledgebase in 2023. Nucleic Acids Res. 2023;51(D1):D523–31.
O’Leary NA, Wright MW, Brister JR, Ciufo S, Haddad D, McVeigh R, et al. Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation. Nucleic Acids Res. 2015;44:D733 – 45. https://api.semanticscholar.org/CorpusID:13488943.
Burley SK, Bhikadiya C, Bi C, Bittrich S, Chen L, Crichlow GV, et al. RCSB Protein Data Bank: powerful new tools for exploring 3D structures of biological macromolecules for basic and applied research and education in fundamental biology, biomedicine, biotechnology, bioengineering and energy sciences. Nucleic Acids Res. 2020;49:D437 – 51. https://api.semanticscholar.org/CorpusID:227079063.
Zahn-Zabal M, Michel PA, Gateau A, Nikitin F, Schaeffer M, Audot E, et al. The neXtProt knowledgebase in 2020: data, tools and usability improvements. Nucleic Acids Res. 2019;48:D328 – 34. https://api.semanticscholar.org/CorpusID:208018282.
Varga JK, Dobson L, Tusnády GE. TOPDOM: database of conservatively located domains and motifs in proteins. Bioinformatics. 2016;32:2725 – 26. https://api.semanticscholar.org/CorpusID:8475614.
Kumar M, Gouw M, Michael S, Sámano-Sánchez H, Pancsa R, Glavina J, et al. ELM—the eukaryotic linear motif resource in 2020. Nucleic Acids Res. 2019;48:D296 – 306. https://api.semanticscholar.org/CorpusID:207897679.
Hou C, Wang X, Xie H, Chen T, Zhu P, Xu X, et al. PhaSepDB in 2022: annotating phase separation-related proteins with droplet states, co-phase separation partners and other experimental information. Nucleic Acids Res. 2022;51(D1):D460–5.
Article PubMed Central Google Scholar
Mészáros B, Erdős G, Szabó B, Schád É, Tantos Á, Abukhairan R, et al. PhaSePro: the database of proteins driving liquid-liquid phase separation. Nucleic Acids Res. 2020;48(D1):D360–7.
PubMed Google Scholar
Erdös G, Pajkos M, Dosztányi Z. IUPred3: prediction of protein disorder enhanced with unambiguous experimental annotation and visualization of evolutionary conservation. Nucleic Acids Res. 2021;49:W297 – 303. https://api.semanticscholar.org/CorpusID:235242376.
Käll L, Krogh A, Sonnhammer ELL. A combined transmembrane topology and signal peptide prediction method. J Mol Biol. 2004;338 5:1027–36. https://api.semanticscholar.org/CorpusID:6858687.
Radó-Trilla N, Albà MM. Dissecting the role of low-complexity regions in the evolution of vertebrate proteins. BMC Evol Biol. 2012;12:155. https://api.semanticscholar.org/CorpusID:5208321.
Hope TJ. The ins and outs of HIV Rev. Arch Biochem Biophys. 1999;365(2):186–91.
Article CAS PubMed Google Scholar
Mujeeb A, Bishop K, Peterlin BM, Turck C, Parslow TG, James TL. NMR structure of a biologically active peptide containing the RNA-binding domain of human immunodeficiency virus type 1 Tat. Proc Natl Acad Sci. 1994;91(17):8248–52.
Article CAS PubMed PubMed Central Google Scholar
Parada CA, Roeder RG. Enhanced processivity of RNA polymerase II triggered by Tat-induced phosphorylation of its carboxy-terminal domain. Nature. 1996;384(6607):375–8.
Article CAS PubMed Google Scholar
Bieniasz PD, Grdina TA, Bogerd HP, Cullen BR. Recruitment of cyclin T1/P-TEFb to an HIV type 1 long terminal repeat promoter proximal RNA target is both necessary and sufficient for full activation of transcription. Proc Natl Acad Sci. 1999;96(14):7791–6.
Article CAS PubMed PubMed Central Google Scholar
Raha T, Cheng SWG, Green MR. HIV-1 Tat Stimulates Transcription Complex Assembly through Recruitment of TBP in the Absence of TAFs. PLoS Biol. 2005;3(2):e44.
Article PubMed PubMed Central Google Scholar
Gatignol A. Transcription of HIV: Tat and cellular chromatin. Adv Pharmacol. 2007;55:137–59.
Article CAS PubMed Google Scholar
Gotora PT, van der Sluis R, Williams ME. HIV-1 Tat amino acid residues that influence Tat-TAR binding affinity: a scoping review. BMC Infect Dis. 2023;23(1):164.
Article CAS PubMed PubMed Central Google Scholar
Weeks KM, Ampe C, Schultz SC, Steitz TA, Crothers DM. Fragments of the HIV-1 Tat Protein Specifically Biand TAR RNA. Science. 1990;249(4974):1281–5.
Article CAS PubMed Google Scholar
Das AT, Harwig A, Berkhout B. The HIV-1 Tat Protein Has a Versatile Role in Activating Viral Transcription. J Virol. 2011;85(18):9506–16.
Article CAS PubMed PubMed Central Google Scholar
Barboric M, Taube R, Nekrep N, Fujinaga K, Peterlin BM. Binding of Tat to TAR and recruitment of positive transcription elongation factor b occur independently in bovine immunodeficiency virus. J Virol. 2000;74(13):6039–44.
Article CAS PubMed PubMed Central Google Scholar
Chiu YL, Coronel E, Ho CK, Shuman S, Rana TM. HIV-1 Tat Protein Interacts with Mammalian Capping Enzyme and Stimulates Capping of TAR RNA. J Biol Chem. 2001;276(16):12959–66.
Article CAS PubMed Google Scholar
Braddock M, Thorburn AM, Chambers A, Elliott GD, Anderson GJ, Kingsman AJ, et al. A nuclear translational block imposed by the HIV-1 U3 region is relieved by the Tat-TAR interaction. Cell. 1990;62(6):1123–33.
Article CAS PubMed Google Scholar
SenGupta DN, Berkhout B, Gatignol A, Zhou AM, Silverman RH. Direct evidence for translational regulation by leader RNA and Tat protein of human immunodeficiency virus type 1. Proc Natl Acad Sci. 1990;87(19):7492–6.
Article CAS PubMed PubMed Central Google Scholar
Harrich D. Tat is required for efficient HIV-1 reverse transcription. EMBO J. 1997;16(6):1224–35.
Article CAS PubMed PubMed Central Google Scholar
Ishido S, Fujita T, Hotta H. Complex Formation of NS5B with NS3 and NS4A Proteins of Hepatitis C Virus. Biochem Biophys Res Commun. 1998;244(1):35–40.
Article CAS PubMed Google Scholar
Behrens SE, Tomei L, De Francesco R. Identification and properties of the RNA-dependent RNA polymerase of hepatitis C virus. EMBO J. 1996;15(1):12–22.
Article CAS PubMed PubMed Central Google Scholar
Wittmann-Liebold B, Seib C. The primary structure of protein L20 from the large subunit of the Escherichia coli ribosome. FEBS Lett. 1979;103(1):61–5.
Article CAS PubMed Google Scholar
Fayat G, Mayaux JF, Sacerdot C, Fromant M, Springer M, Grunberg-Manago M, et al. Escherichia coli phenylalanyl-tRNA synthetase operon region. J Mol Biol. 1983;171(3):239–61.
Article CAS PubMed Google Scholar
Sheng K, Li N, Rabuck-Gibbons JN, Dong X, Lyumkis D, Williamson JR. Assembly landscape for the bacterial large ribosomal subunit. Nat Commun. 2023;14(1):5220.
Qin B, Lauer SM, Balke A, Vieira-Vieira CH, Bürger J, Mielke T, Selbach M, Scheerer P, Spahn CMT, Nikolay R. Cryo-EM captures early ribosome assembly in action. Nat Commun. 2023;14(1):898.
Dong X, Doerfel LK, Sheng K, Rabuck-Gibbons JN, Popova AM, Lyumkis D, et al. Near-physiologicalin vitroassembly of 50S ribosomes involves parallel pathways. Nucleic Acids Res. 2023;51(6):2862–76.
Article CAS PubMed PubMed Central Google Scholar
Perdigão N, Heinrich J, Stolte C, Sabir KS, Buckley MJ, Tabor B, et al. Unexpected features of the dark proteome. Proc Natl Acad Sci. 2015;112(52):15898–903.
Article PubMed PubMed Central Google Scholar
Lu S, Wang J, Chitsaz F, Derbyshire MK, Geer RC, Gonzales NR, et al. CDD/SPARCLE: the conserved domain database in 2020. Nucleic Acids Res. 2020;48(D1):D265–8.
Article CAS PubMed Google Scholar
Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics (Oxford, England). 2020;36(4):1234–40.
CAS PubMed Google Scholar
Mier P, Paladin L, Tamana S, Petrosian S, Hajdu-Soltész B, Urbanek A, et al. Disentangling the complexity of low complexity proteins. Brief Bioinform. 2020;21(2):458–72.
Article CAS PubMed Google Scholar

Download references

Acknowledgements

We are grateful to Anna Muszewska and Krzysztof Pawłowski for their crucial comments, suggestions, and questions regarding our manuscript. We also thank Łukasz Kniżewski and Mateusz Jastrzebski for server administration and configuration.

Funding

This work was supported by the National Science Centre (grant no. 2020/39/B/ST6/03447), the Insitite of Biochemistry and Biophysics PAS minigrant (SBM - 08/07/2021 to J.Z-L) and Silesian University of Technology BKM grant (BKM 02/120/BKM24/0040 to S.S).

Author information

Aleksandra Gruca and Marcin Grynberg contributed equally to this work.

Authors and Affiliations

Institute of Biochemistry and Biophysics, Polish Academy of Sciences, Warsaw, 02-106, Poland
Joanna Ziemska-Legiecka & Marcin Grynberg
Department of Computer Networks and Systems, Silesian University of Technology, Gliwice, 44-100, Poland
Patryk Jarnot, Sylwia Szymańska & Aleksandra Gruca
Malopolska Centre of Biotechnology, Jagiellonian University, Kraków, 30-387, Poland
Dagmara Błaszczyk
Biotechnology Center, Silesian University of Technology, Gliwice, 44-100, Poland
Alicja Staśczak
Department of Systems Biology and Engineering, Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, 44-100, Poland
Alicja Staśczak
Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, 44-100, Poland
Hanna Langer-Macioł, Kinga Lucińska, Aleksandra Janas, Hanna Słowik & Wiktoria Śliwińska
Department of Graphics, Computer Vision and Digital Systems, Silesian University of Technology, Gliwice, 44-100, Poland
Karolina Widzisz
Department of Clinical and Molecular Genetics, Maria Sklodowska-Curie National Research Institute of Oncology, Gliwice Branch, Gliwice, 44-100, Poland
Hanna Langer-Macioł

Authors

Joanna Ziemska-Legiecka
View author publications
You can also search for this author in PubMed Google Scholar
Patryk Jarnot
View author publications
You can also search for this author in PubMed Google Scholar
Sylwia Szymańska
View author publications
You can also search for this author in PubMed Google Scholar
Dagmara Błaszczyk
View author publications
You can also search for this author in PubMed Google Scholar
Alicja Staśczak
View author publications
You can also search for this author in PubMed Google Scholar
Hanna Langer-Macioł
View author publications
You can also search for this author in PubMed Google Scholar
Kinga Lucińska
View author publications
You can also search for this author in PubMed Google Scholar
Karolina Widzisz
View author publications
You can also search for this author in PubMed Google Scholar
Aleksandra Janas
View author publications
You can also search for this author in PubMed Google Scholar
Hanna Słowik
View author publications
You can also search for this author in PubMed Google Scholar
Wiktoria Śliwińska
View author publications
You can also search for this author in PubMed Google Scholar
Aleksandra Gruca
View author publications
You can also search for this author in PubMed Google Scholar
Marcin Grynberg
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

J.Z.-L. implemented and designed the database and interface, improved data integration by removing errors, and adjusted the code of integration to changing sources. P.J. designed the integration of data sources. S.S. provided documentation for LCRannotationsDB. J.Z.-L., P.J., A.G., and M.G. performed the analysis of LCRs’ functional significance in protein families. J.Z.-L., P.J., S.S., D.B., A.S., H.L.-M., K.L., K.W., A.J., H.S., and W. Ś. implemented data integration from databases and predictors, selecting data sources. M.G. and A.G. conceived the study, supervised the project, and provided edits. All authors contributed to the final manuscript and tested the database interface.

Corresponding authors

Correspondence to Joanna Ziemska-Legiecka, Aleksandra Gruca or Marcin Grynberg.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Material 1.

Supplementary Material 2.

Supplementary Material 3.

Supplementary Material 4.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

About this article

Cite this article

Ziemska-Legiecka, J., Jarnot, P., Szymańska, S. et al. LCRAnnotationsDB: a database of low complexity regions functional and structural annotations. BMC Genomics 25, 1251 (2024). https://doi.org/10.1186/s12864-024-10960-5

Download citation

Received: 29 May 2024
Accepted: 25 October 2024
Published: 27 December 2024
DOI: https://doi.org/10.1186/s12864-024-10960-5

LCRAnnotationsDB: a database of low complexity regions functional and structural annotations

Abstract

Background

Construction and content

Construction

Identification of LCRs

Sources of data

Categories

User interface and implementation

Content

Enrichment of LCR annotations in protein families

Utility and discussion

Use case

LCRs with annotations in protein families

RNA-binding LCR in prokaryotic large ribosomal subunit protein bL20c

Discussion

Conclusion

Data availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Additional information

Publisher's Note

Supplementary information

Supplementary Material 1.

Supplementary Material 2.

Supplementary Material 3.

Supplementary Material 4.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomics

Contact us