- Software
- Open access
- Published:
Expanding the concept of ID conversion in TogoID by introducing multi-semantic and label features
Journal of Biomedical Semantics volume 16, Article number: 1 (2025)
Abstract
Background
TogoID (https://togoid.dbcls.jp/) is an identifier (ID) conversion service designed to link IDs across diverse categories of life science databases. With its ability to obtain IDs related in different semantic relationships, a user-friendly web interface, and a regular automatic data update system, TogoID has been a valuable tool for bioinformatics.
Results
We have recently expanded TogoID's ability to represent semantics between datasets, enabling it to handle multiple semantic relationships within dataset pairs. This enhancement enables TogoID to distinguish relationships such as "glycans bind to proteins" or "glycans are processed by proteins" between glycans and proteins. Additional new features include the ability to display labels corresponding to database IDs, making it easier to interpret the relationships between the various IDs available in TogoID, and the ability to convert labels to IDs, extending the entry point for ID conversion. The implementation of URL parameters, which reproduces the state of TogoID's web application, allows users to share complex search results through a simple URL.
Conclusions
These advancements improve TogoID’s utility in bioinformatics, allowing researchers to explore complex ID relationships. By introducing the tool’s multi-semantic and label features, TogoID expands the concept of ID conversion and supports more comprehensive and efficient data integration across life science databases.
Background
TogoID is an identifier (ID) conversion service developed by the authors, designed to link IDs across a wide range of life science databases, facilitating data exploration across multiple sources [1]. TogoID has several advantages, including a user-friendly web interface which can perform multi-step conversion, open-source ID pair collection programs, and weekly data updates to keep up with the original databases. One of its most distinctive features is the ability to handle IDs from databases regardless of their entity categories (e.g. genes, chemical compounds, diseases, etc.), thus expanding the concept of ID conversion. Conventionally, the term "ID conversion" refers to linking IDs of identical entities from different databases. For instance, IDs of GlyTouCan, the repository of glycan structures [2], can be converted to IDs of PubChem, a comprehensive database of chemical compounds [3]. In TogoID, the term "ID conversion" is used not only for such conventional meaning but also for obtaining IDs that are linked by a broad range of semantic relations. For example, users can input GlyTouCan IDs to obtain UniProt [4] IDs of proteins modified by the glycans. To clarify the semantics of such various relations between datasets, we defined the TogoID ontology (TIO) and used it to display the semantics in the web interface. These unique features of TogoID expand the utility of identifiers for broader bioinformatic analyses.
Since the previous publication, we have continuously added new datasets to TogoID and developed its new functionalities. Especially, the ability to cover multiple semantics of relations between a single pair of datasets is important to fully describe complex relations between biological entities. In this manuscript, we report new features of TogoID that advance the concept of ID conversion.
Results
Selecting relations from multiple options
The initial version of TogoID assumed only one semantic relationship between a single dataset pair. As we noticed this was not enough to describe diverse relationships between glycans and proteins, we have modified this to accommodate multiple types. This makes it possible to cover different types of semantic relationships between proteins and glycans, such as "protein is modified with glycans", "protein enzymatically processes glycan", and "protein enzymatically produces glycan" (Fig. 1).
Users can either paste IDs of interest into the textbox or upload a file containing IDs. In the example shown in Fig. 2, GlyTouCan IDs are specified in the textbox, and the user selects "GlyTouCan" from the options displayed in the leftmost column, which shows the candidate datasets based on the pattern matching of ID notations. Subsequently, datasets linked to GlyTouCan are displayed. In this new version of TogoID, the "UniProt" dataset appears in three rows. The lines connecting "GlyTouCan" to each "UniProt" row indicate the semantics of the relationship, namely "glycan is attached to protein", "glycan is acceptor of enzymatic reaction by protein", and "glycan is product of enzymatic reaction by protein," respectively. By selecting the relationship of the user’s interest and clicking the table icon, a modal window to preview the ID relations and download the entire results is displayed.
Displaying labels
To provide users with a clear understanding of what IDs represent, we implemented a new feature to display labels for IDs. The labels of database entries include symbols of genes and IUPAC notations of glycans, etc. The labels of databases that do not have appropriate strings to display as labels, such as the interaction IDs in the IntAct database [5], are excluded. As a result, TogoID can display labels for 65 datasets out of the 105 datasets that are currently supported in our system. Figure 3 shows the modal window to show a conversion result of the updated version. By clicking the switches next to the column names of the table, the labels for the IDs are displayed, making it clear for users what each ID means. Displayed labels can be included in the table that can be downloaded or copied.
Conversion from labels to IDs
While natural language labels are human-readable, they are often polysemous or synonymous, making them unsuitable as identifiers for database entries, which require precise and unambiguous references. Consequently, converting natural language labels into database identifiers is a crucial function for bridging the gap between human-friendly expressions and machine-readable data representations. To address this, we implemented this functionality into TogoID. To account for potential spelling variations or misspellings in user-supplied labels, we utilized PubDictionaries which offers a public web API for mapping labels to IDs with fuzzy mapping capabilities [6] (see Implementation for details).
The web interface of the function to convert labels to IDs is shown in Fig. 4. Users begin by entering labels in the input window and selecting the target dataset from the "Dataset" drop-down menu. In Fig. 4A, the integrated disease ontology, MONDO, is selected. The "Threshold" parameter specifies the tolerance for fuzzy matching when querying PubDictionaries, with values from 0.5 to 1. A value of 1 allows only exact matches, while lower values increase tolerance for less exact matches. The types of labels to be searched are selected using the checkboxes. After pressing the EXECUTE button, the matched IDs are displayed in a table. The "ID" column lists the candidate IDs obtained from the PubDictionaries API, the "Match type" column indicates which label types (“name”, “broader synonym” etc.) were matched, and the main label in the selected database associated with each ID. The "Score" column represents the string match score.
The web interface for conversion from labels to IDs. A Conversion from disease names to MONDO IDs. One of the input strings includes misspelling, but successfully converted to the corresponding ID. B Conversion from gene symbols to NCBI Gene IDs. A Taxonomy ID of the species must be specified for the conversion to reduce ambiguity
In Fig. 4B, "NCBI Gene" is selected as the dataset. Since only exact matches are considered in the conversion of gene labels, the threshold parameter is not shown. Users specify the species using the "Species" drop-down menu. This menu includes only major organisms; if the intended species is not listed, users can enter the Taxonomy ID directly into the adjacent input window.
The conversion result table can be copied to the clipboard or downloaded in CSV or TSV format, just like the ID conversion result table. By clicking the "Convert IDs" button, the converted IDs are automatically copied to the ID input field, allowing users to perform further conversions using the TogoID ID conversion interface.
URL parameters
We developed URL parameters that reproduce the state of TogoID's web application. An application’s state can refer to a complex search result, where a user has found, for example, a list of diseases that are linked with a list of variants for a specific glycosyltransferase. Thus, by implementing a method by which such search results could be shared via a URL, other applications can easily have a link from their resources to TogoID and utilize the ID link information accumulated in TogoID.
As an example, by opening the following URL with a browser, a user will be navigated to the TogoID web page showing the ID conversion route from UniProt to GlyTouCan with two identifiers specified.
https://togoid.dbcls.jp/?route=glytoucan%2CTIO_000126%2Cuniprot&ids=G48258CR%2CG46677TE
To reproduce the state of the TogoID web application, we implemented the following parameters.
-
route: A comma-separated list of dataset names and TogoID Ontology properties that connect between datasets. The properties are placed between the dataset names.
-
ids: A comma-separated list of input IDs.
Additionally, information on available relations between datasets can be obtained using the following API, where {source} and {target} represent names of source and target datasets respectively.
https://api.togoid.dbcls.jp/config/relation/{source}-{target}.
For example,
https://api.togoid.dbcls.jp/config/relation/glytoucan-uniprot.
Returns JSON below.
[
{
"forward": {"display_label": "is attached to", "id": "TIO_000060"},
"reverse": {"display_label": "is modified with", "id": "TIO_000061"}
},
{
"forward": {"display_label": "is acceptor of enzymatic reaction by", "id": "TIO_000126"},
"reverse": {"display_label": "enzymatically processes", "id": "TIO_000127"}
},
{
"forward": {"display_label": "is product of enzymatic reaction by", "id": "TIO_000128"},
"reverse": {"display_label": "enzymatically produces", "id": "TIO_000129"}
}
]
This indicates that there exist multi-semantic relations between these two datasets distinguished by the TogoID ontology (TIO) IDs with labels. In TogoID ontology we have defined 133 relations to distinguish different types of relationships for the pairs of datasets. Note that the {source} in API is a source dataset where we take relation information from (“forward”), thus the relation in the “reverse” direction, {target}-{source}, is not always available. All possible relations can be obtained by https://api.togoid.dbcls.jp/config/relation. The entire list of TIO IDs can be seen at https://github.com/togoid/togoid-config/blob/main/ontology/property.tsv and the TogoID ontology is available at https://togoid.dbcls.jp/ontology.
Data update
In the two years since our previous publication, TogoID has expanded its coverage to include 40 additional datasets, bringing the total to 105 datasets from 73 distinct databases (Table 1). As mentioned in our previous paper [1], some databases can contain a mix of IDs from different categories under a single namespace. For example, the NCI Thesaurus (NCIt) uses "C" prefixed IDs for both disease and tissue classifications. To distinguish such IDs, TogoID subdivides databases by category and refers to each subdivision as a dataset. The number of direct dataset pairs has also grown by 98, now totaling 263. The number of properties defined in the TogoID ontology to describe relationships between datasets has increased by 56, reaching 133.
Use cases
To illustrate the practical applications of the described features, we present two use cases that demonstrate how these functionalities can be utilized effectively. The first use case is an example of tracing from molecular IDs to disease IDs through multiple conversion steps. The second example is selecting appropriate semantics between IDs and narrowing down the results based on molecular functions.
-
Use case 1: Obtain disease IDs related to the enzymes that produce glycans
In the example shown in Fig. 5A, the GlycoMotif ID of N-acetyllactosamine, G00055MO, is specified as an input ID. GlycoMotif is a collection of characteristic substructures observed in glycan structures, provided by GlyCosmos and included in GlyGen [7]. In the example, GlyTouCan IDs of glycans containing the input motifs are obtained. These GlyTouCan IDs are then linked to UniProt IDs with the relationship "is product of enzymatic reaction by." Subsequently, ClinVar [8], the database about relationships among genetic variations and diseases in humans, and Mondo [9], the ontology aiming to harmonize disease definitions across the world, are selected. This allows users to obtain disease IDs related to the enzymes that produce the glycans of interest, with variants that can affect the relationships between the enzymes and the diseases.
The URL to reproduce this conversion on the web application is:
-
Use case 2: Obtain GO annotations for proteins modified by glycans
In the example shown in Fig. 5B, the GlycoMotif ID of N-Glycan high mannose, G00028MO, is input, and GlyTouCan IDs of glycans containing the motifs are obtained. These GlyTouCan IDs are then linked to UniProt IDs with the relationship "is attached to." Following this, GO (Gene Ontology) [10] is selected. After downloading the results, users can filter the UniProt entries using GO annotations. For example, by filtering with GO:0005886 (plasma membrane), users can explore glycosylated membrane proteins that may serve as drug targets.
The URL to reproduce this conversion on the web application is:
Implementation
Changes to the backend database system
In the TogoID backend, the ID pairs are stored in a relational database (RDB) to ensure adequate response performance. Each of the tables has two columns which contain source and target IDs respectively. In the previous version, the names of the tables were "{source dataset}-{target dataset}", e.g. "glytoucan-uniprot" (Fig. 6A). The frontend system referred to each table with the name.
In the updated version, we separated the tables for supporting dataset pairs with multiple semantic relations, and thus the names of the tables are renamed to "{source dataset}-{target dataset}-{TogoID ontology ID}", where the TogoID ontology ID specifies the semantic relation between the datasets. For example, "glytoucan-uniprot-TIO_000060", "glytoucan-uniprot-TIO_000126" and "glytoucan-uniprot-TIO_000128" are the table names of the ID pairs between GlyTouCan and UniProt, and the semantic relations are "TIO_000060 (glycan is attached to protein)", "TIO_000126 (glycan is acceptor of enzymatic reaction by protein)", and "TIO_000128 (glycan is product of enzymatic reaction by protein)" respectively (Fig. 6B). With this extension, the frontend system can distinguish different relations.
Changes to API
TogoID provides an API to accept IDs and return the converted IDs (for details, see https://togoid.dbcls.jp/apidoc/). In the API, users specify the path of conversion as a comma-separated string, e.g. "glytoucan,uniprot". In the updated version, users can specify the semantic relation between datasets by putting the corresponding TogoID ontology ID, e.g. "glytoucan,TIO_000060,uniprot". For the dataset pairs with only one semantic relation defined, TogoID ontology ID is not required.
The system to display labels
The Resource Description Framework (RDF: https://www.w3.org/TR/rdf11-concepts/) is a standard model to represent data on the Semantic Web, where each resource is described as a uniform resource identifier (URI). The standard query language SPARQL is used to access RDF data. Generally, database IDs can be used as URIs when they are concatenated with their corresponding URI prefix; thus they are used to query for relevant data in RDF graphs.
Based on the collection of RDF datasets hosted at the RDF Portal [11] and other sources, we developed a system to display labels of input IDs, utilizing a tool called Grasp (https://github.com/dbcls/grasp). Grasp is middleware that accepts GraphQL queries and generates SPARQL queries based on predefined configurations. In TogoID, we configured Grasp to generate SPARQL queries to retrieve labels from existing data. Regarding datasets managed by TogoID where RDF is not available, we created simple RDF data containing IDs and their corresponding labels. This RDF data for ID-to-label relationships is also available on the RDF Portal, alongside the existing RDF data for ID-to-ID relationships. Users can combine these datasets with other available data through the SPARQL endpoint of the RDF Portal (https://rdfportal.org/primary/sparql) to perform advanced SPARQL queries.
The system to convert labels to IDs
Implementing the function to convert labels to IDs was more complex than simply converting IDs to labels or between different types of IDs. Labels differ from IDs in several key ways:
-
Labels are not necessarily unique within a database. For example, in gene databases covering multiple species, orthologous genes may share the same symbols across species.
-
Labels for diseases etc. are often written in natural language and may be entered manually by humans, and thus they may contain typos and spelling variations.
-
Records may have synonyms in addition to the main label. Users may need to search either only the main label or include synonyms.
We addressed these by using PubDictionaries [6], which supports fuzzy matching, and by creating a new user interface to specify biological species and label types. PubDictionaries is a repository of dictionaries, where each dictionary represents a collection of mappings between natural language labels and identifiers. Notably, it provides an API for converting labels to IDs based on any dictionaries uploaded to the platform. Since it performs fuzzy match searches, it can accommodate subtle surface differences, such as spelling variations or typographical errors.
We created several dictionaries and uploaded them to PubDictionaries (https://pubdictionaries.org/users/togoid). To enable users to specify the type of labels, such as primary labels or synonyms, based on their search intent, we created separate dictionaries for different label types. Also, PubDictionaries offers a feature that allows tags to be assigned to dictionary entries. Users can refine their search by specifying tags. For the dictionaries of gene databases, TogoID uses Taxonomy IDs as tags, and thus enables users to specify Taxonomy IDs together with input labels for searching. By using this system, we can easily accommodate the necessary labels for any datasets added to TogoID in the future.
Discussion and conclusion
Conventional ID conversion tools have overlooked the semantic relationships between IDs. While TogoID was already capable of handling semantics, its use of ontology made it possible to naturally extend the system to handle multiple semantics in a single dataset pair as part of this update. The relationships that can be described by the extension made to address this are not limited to the relationship between glycans and proteins described above. For example, TogoID covers the relation between Reactome reaction IDs [12] and UniProt IDs, which means proteins participate in reactions. However, what role the proteins play in the reaction, e.g. an enzyme, a reactant, or a product, has not yet been distinguished in TogoID. We plan to investigate the information described in the Reactome database and classify the relations in TogoID in the future.
On the other hand, addressing multiple semantics has also revealed new challenges.
-
While it is now possible to obtain the relationship between an enzyme and its product, and between an enzyme and its reaction acceptor, TogoID can only obtain binary relationships, and it is not possible to obtain triplets of an enzyme, an acceptor, and a product.
-
Another relationship between a protein and a glycan is that "a lectin recognizes a glycan." As in the lectin database LfDB [13], this relationship is described using affinity scores, and is not a simple binary relationship. If we were to cover this with TogoID, we would need a function that stores or calculates the relationship scores in the TogoID system and can output only relationships filtered by the scores.
We need to consider the balance of how much should be covered by ID conversion services, but enabling researchers to easily obtain such relationships would be beneficial and is a challenge to be addressed in the future.
As for the label-to-ID conversion function, one of the motivations for implementing it was to address the common issue where gene lists in published papers are provided only with symbols and lack corresponding IDs. This can hinder researchers when reanalyzing data. While TogoID helps mitigate this problem, it cannot fully resolve it. This is primarily due to gene symbols not always being unique, even within the same species, when synonyms are involved. For example, the string "PC4" is used as a synonym for four different genes: NCBI Gene ID 3475 (IFRD1), 3854 (KRT6B), 10923 (SUB1), and 54760 (PCSK4). It is crucial for researchers to be aware of this nature of gene symbols and use gene IDs when specifying genes to avoid ambiguity.
Regarding the display of labels, there is an issue that multiple labels for a single database entry are possible. For instance, in TogoID, glycans are labeled using IUPAC notation, but users may prefer alternative notations, such as WURCS [14]. Similarly, while TogoID currently displays the scientific name for Taxonomy IDs, some users might prefer the common name. Expanding the range of label types that can be displayed is an important area for future development.
In summary, the developments introduced in this manuscript advance the concept of ID conversion, making it a more powerful tool for bioinformatics. TogoID is expected to be used as a platform for tracing the relationships between various elements that will be elucidated by technical advances and enriched in databases in the future.
Availability and requirements
Project name: TogoID.
Project home page: https://togoid.dbcls.jp/
Operating system(s): Platform independent.
Other requirements: None.
License: The TogoID web service is free to use. The programs to collect ID relations are available at GitHub: https://github.com/togoid/togoid-config under the MIT License.
Any restrictions to use by non-academics: None.
Data availability
No datasets were generated or analysed during the current study.
Abbreviations
- API:
-
Application-programming interface
- GO:
-
Gene Ontology
- ID:
-
Identifier
- IUPAC:
-
International Union of Pure and Applied Chemistry
- MONDO:
-
Mondo Disease Ontology
- RDB:
-
Relational Database
- RDF:
-
Resource Description Framework
- TIO:
-
TogoID Ontology
- URI:
-
Uniform Resource Identifier
References
Ikeda S, Ono H, Ohta T, Chiba H, Naito Y, Moriya Y, et al. TogoID: an exploratory ID converter to bridge biological datasets. Bioinformatics. 2022;38(17):4194–9.
Fujita A, Aoki NP, Shinmachi D, Matsubara M, Tsuchiya S, Shiota M, et al. The international glycan repository GlyTouCan version 3.0. Nucleic Acids Res. 2021;49(D1):D1529-33.
Kim S, Chen J, Cheng T, Gindulyte A, He J, He S, et al. PubChem 2023 update. Nucleic Acids Res. 2023;51(D1):D1373–80.
The UniProt Consortium. UniProt: the Universal Protein Knowledgebase in 2023. Nucleic Acids Res. 2023;51(D1):D523–31.
del Toro N, Shrivastava A, Ragueneau E, Meldal B, Combe C, Barrera E, et al. The IntAct database: efficient access to fine-grained molecular interaction data. Nucleic Acids Res. 2022;50(D1):D648–53.
Kim JD, Wang Y, Fujiwara T, Okuda S, Callahan TJ, Cohen KB. Open Agile text mining for bioinformatics: the PubAnnotation ecosystem. Bioinformatics. 2019;35(21):4372–80.
GlycoMotif. Available from: https://glycomotif.glyomics.org/. Cited 2024 Aug 27.
Landrum MJ, Lee JM, Benson M, Brown GR, Chao C, Chitipiralla S, et al. ClinVar: improving access to variant interpretations and supporting evidence. Nucleic Acids Res. 2018;46(D1):D1062–7.
Vasilevsky NA, Matentzoglu NA, Toro S, Flack JE, Hegde H, Unni DR, et al. Mondo: Unifying diseases for the world, by the world. medRxiv; 2022. p. 2022.04.13.22273750. Available from: https://www.medrxiv.org/content/10.1101/2022.04.13.22273750v3. Cited 2024 Jul 24.
Gene Ontology knowledgebase in 2023 | Genetics | Oxford Academic. Available from: https://academic.oup.com/genetics/article/224/1/iyad031/7068118?login=false. Cited 2024 Jul 24.
Kawashima S, Katayama T, Hatanaka H, Kushida T, Takagi T. NBDC RDF portal: a comprehensive repository for semantic data in life sciences. Database. 2018;2018:bay123.
Milacic M, Beavers D, Conley P, Gong C, Gillespie M, Griss J, et al. The Reactome Pathway Knowledgebase 2024. Nucleic Acids Res. 2024;52(D1):D672–8.
Hirabayashi J, Tateno H, Shikanai T, Aoki-Kinoshita KF, Narimatsu H. The Lectin Frontier Database (LfDB), and data generation based on frontal affinity chromatography. Mol Basel Switz. 2015;20(1):951–73.
Tanaka K, Aoki-Kinoshita KF, Kotera M, Sawaki H, Tsuchiya S, Fujita N, et al. WURCS: The Web3 Unique Representation of Carbohydrate Structures. ACS Publications. American Chemical Society; 2014. Available from: https://pubs.acs.org/doi/full/10.1021/ci400571e. Cited 2024 Oct 22.
Acknowledgements
We thank Info Lounge Corporation for the web interface and API development of the TogoID system.
Funding
This work was supported under the Life Science Database Integration Project, NBDC of Japan Science and Technology Agency.
Author information
Authors and Affiliations
Contributions
All authors contributed to the study conception and design. Data preparation was performed by S.I., M.H., H.C., T.T., S.K., and Y.M.. Development of Grasp and its configuration for the RDF Portal SPARQL endpoint were performed by T.K., S.K., Y.M., and S.I.. J.-D.K. extended PubDictionaries to support necessary functions for TogoID. The first draft of the manuscript was written by S.I., followed by discussions with M.H., Y.Y. and K.F.A.-K.. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Ikeda, S., Aoki-Kinoshita, K.F., Chiba, H. et al. Expanding the concept of ID conversion in TogoID by introducing multi-semantic and label features. J Biomed Semant 16, 1 (2025). https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13326-024-00322-1
Received:
Accepted:
Published:
DOI: https://doiorg.publicaciones.saludcastillayleon.es/10.1186/s13326-024-00322-1