NAR Database List
I always look forward to the Nucleic Acids Research (NAR) database issue. It's a great way to learn what people are interested in and learn something new. Its also fun to count the number of databases, because each way they're counted, a different answer is obtained. As in 2016, in 2018 the number of databases listed by NAR decreased. This is in part due to the overall trend that the number of new databases being submitted to the archive each year since 2004, has been slowly decreasing. It is also due to an increase in the number of databases being removed from the archive. Both issues are discussed at the end of this blog.
But, first the numbers. The opening editorial by DJ Rigden and XM Fernández indicates that, in 2018, 66 new databases were added and 147 databases were removed. As stated in the editorial, the archive now lists 1613 databases. However, if the semi-alphabetical listing of databases is used as a source for counting, then there are 1697. I could have also used my estimate from last year. In 2018, I estimated 1737 databases. Using 1737 as a base, adding new, and subtracting purged would give 1737+66-147=1656 databases. There you have it, three ways to count and three different answers; counting databases is hard.
Semi-alphabetical is used above because the first six entries begin with N, E, B, D, E, and G for NCBI (USA), EBI (Europe), Big Data Center (China), DDBJ (Japan), European Nucleotide Archive (Europe), and GenBank (USA), respectively. Making an alphabetical list non-alphabetical requires some effort. Perhaps these are considered by the NAR team to be the most important bio databases, ranked according to importance? I'll leave any interpretations implied up to the reader (Go USA). I should add, that this short list is followed by a seeming alphabetical list, that is really random, and later becomes alphabetical. Perhaps a better heading for this list would be "a list of databases."
As indicated by this year's title, I'm going to focus on databases related to immunology. Digital World Biology is working with Shoreline Community College to develop an immuno-bioinformatics class. Hence, it is worthwhile exploring the 31 resources listed under the NAR Immunological Databases Category, and summarized in a table on a linked page. Of the 31 listed databases, four are gone. The URLs might point to a working page, but a database no longer exists. Two of the entries are clearly not databases. Many more are likely inactive as determined by a lack of an update date, or a last update that is five or more years old. Removing these leaves between seven and 11 of the 31 that are likely to still be active. The larger number of active databases is due to a group of five entries that are part the same website, and publication.
In terms of content, twelve (>30%) of the databases focus(ed) on antigens and epitopes with two of those focused on haptens. Haptens are chemicals that, alone, are too small to be immunogenic, but when bound to proteins, can elicit an immune response. Antigen databases are important and often used in vaccine development, which is a big area in biotechnology. Seven other databases focus on antigen receptor sequences and the remaining are more specialized and focus on macrophage gene expression datasets, antibody structures, interaction networks, and innate immunity.
Two of the databases, IMGT (ImmunoGeneTics, left side of the figure below) and IEDB (Immune Epitope Database, right side of figure below), are standards in immunology. If you want to learn about antigen receptor (Ig/Ab and TCR) genes, and obtain the gene sequences, for many organisms, or search collections of gene sequences (DNA and amino acid), IMGT is the place to go. For example, the IgBLAST documentation (last blog) recommends IMGT for obtaining any needed fasta files of V(D)J sequences. A charming appeal of IMGT is that it has maintained its quaint 1990s look and feel, but under that old UI is an amazing resource.
IEDB focuses on antigens and epitopes, and contains a large number of peptide and non-peptide (hapten and larger chemical) epitopes. Of significance, the resource also holds a wealth of assay data, so that a researcher can further evaluate the quality of an epitope of interest. In addition to epitopes, their assay information, and source organisms, IEDB also provides a large number of tools for epitope prediction. We will explore these tools in the Immuno-bioinformatics class and verify their results by comparing predicted epitopes to those identified in solved structures using Molecule World.™
It is also important to note that IEDB contains a large amount of data about MHC (Major Histocompatibility Complex) restriction. Any efforts to design vaccines that stimulate cellular immunity broadly, need to take MHC variability into account. An aspect of IEDB that I find particulary useful is that the search boxes on the front page are accompanied by finders that allow a user to browse the available items they might want to search on. Too often bio databases start with a search, but do not help a user learn what the parameters are. Instead, IEDB has taken an approach that guides their users through the data. With this approach IEDB communicates, and teaches, how immune systems work in a useful way.
Digital World Biology likes molecular structures, so another resource of interest is SAbDab, which according to its tag line is "The Structural Antibody Database." SAbDab most directly competes with AbDb, "a database of PDB-derived antibody structures" (this title is from AbDb's most recent article in the journal Database). Interestingly AbDb is not in the NAR database list; I'll come back to that later. Both SAbDab and AbDb are extensive repositories of antibody structures and provide many tools to work with structures and sequences. SAbDab has an edge on the user interface. Search tools, statistics and other parts are easily accessible or viewable from the front page and it is easy to get back to the front page. AbDb, on the other hand, has a deep history of tools and is rich with data, but one quickly gets lost as they navigate the various links.
A feature I really like in SAbDab is the list of therapeutic antibodies because of its obvious link to biotechnology. This table lists 151 structures (Jan 24, 2019) for some fewer number of antibodies. Unfortunately the table does not contain any statistics, just two columns: antibody name and structure(s). In many cases there more than one structure per antibody, and the way the data are presented makes counting the antibodies too cumbersome, so we'll just say less than 151; counting's hard.
Speaking of databases not in the NAR database list, we know that the NAR database list built from user contributions. While useful, it suffers from both lack of curation and contribution. Curation, we know, is time consuming and financially difficult to support. Thus, entries age and their value diminishes over time. Using the Immunologic Databases as an example, we learned that only ~25%-30% of the current entries are likely active databases. The NAR volunteer team runs link checkers and that helps clean out the obvious dead links, but this approach cannot make deeper assessments.
A larger challenge is that researchers have other options for publishing their database work. The journal Database, an Oxford Academic Journal like NAR, is one alternative. From a quick inspection of articles in Database it's clear that many of the databases published in Database do not get in to the NAR listing, even though these journals are produced by the same publisher. And, there are other venues for dissemination. In my previous research to find datasets for the Immuno-bioinformatics class, I identified BioGPS, JingleBells, and ImmPort as potential resources. None of these are in the NAR list. Counting databases is hard.
To close, databases are cool and important because they form the foundation from which new knowledge is derived from data and information. Anyone seeking to develop large scale data analysis methods and insights from data, will create a database of some kind. As the NAR list demonstrates, databases can be transitory or long-lasting depending on their scope and utility for a community. While the NAR list is a galaxy in a universe, it is a reasonable sampling and representation of bio database diversity and changes overtime.