Bgee: Gene Expression Evolution

Bgee general documentation - Table of contents

Hide | TopGeneral information

Note: you can retrieve data sources used in Bgee with details on the release used on the data sources page

Jump to:

BackWhich expression data?

Bgee currently includes:
  • RNA-Seq samples from [GEO], runs retrieved from [SRA]. Each sample is manually annotated by Bgee curators to anatomical and developmental ontologies. Statistical tests are then performed to define expression calls, no-expression calls, and to assign a level of confidence to the data.
  • Affymetrix data from [GEO] and [ArrayExpress]. Each chip is manually annotated by Bgee curators to map it onto anatomical and developmental ontologies. Quality controls are performed to remove chips of low-quality, duplicated chips, or chips of incompatible chip type. Statistical tests are then performed to define expression calls, no-expression calls, differential expression calls, and to assign a level of confidence to the data.
  • In situ hybridization data from [ZFIN] for zebrafish, [Xenbase] for Xenopus, [MGI] for the mouse, and [BDGP] for Drosophila. These data are already mapped to ontologies by the source databases, and provided with quality information. They are used to generate expression calls and no-expression calls.
  • EST data from [UniGene]. Each EST library is manually annotated by Bgee curators to map it onto anatomical and developmental ontologies. Statistical tests are performed to define expression calls and to assign a level of confidence.

BackWhich anatomical ontologies?

Bgee uses already-existing and well-established anatomical ontologies, to describe anatomy of species, in a computer-understandable way. Five species are currently integrated into Bgee:
  • Danio rerio: anatomical ontology [ZFA]
  • Homo sapiens: anatomical ontology [EHDAA] for embryo, and anatomical ontology [EV] for adult
  • Mus musculus: anatomical ontology [EMAPA] for embryo, and anatomical ontology [MA] for adult
  • Xenopus tropicalis: anatomical ontology [XAO]
  • Drosophila melanogaster: anatomical ontology [FBbt]

BackWhich developmental stage ontologies?

Developmental stages are usually just used as a list of terms (controlled vocabularies). But in order to manage the variability in annotations of data from different sources, developmental ontologies are required, to describe development of species, in a computer-understandable way.

We use the already-existing developmental stages listed in anatomical ontologies (see above), and organize them into key events of development, to generate ontologies. They are available in OBO format or ontology browsing.

BackWhich homology links?

The homology links are defined by curation, using an internal software: Homolonto.

Homolonto uses an ontology alignment approach to propose homology relationships between anatomical ontologies, which are then manually validated by an expert. Once relationships generated through this first step, they are manually reviewed and extended by Bgee curators.

The Homolonto software and source code are available in the download section.

BackWhich relationships between developmental stages?

Although there is no direct equivalence between the developmental stages of two species because of heterochrony [heterochrony in Wikipedia], it is still possible to identify key events of development, common to all bilaterian animals.

We have developed a small ontology of "metastages", representing these key events. We then map the developmental stages of each species to these common "metastages".

The metastages ontology, and the mapping of the developmental stages of each species, are available in the download section.

BackWhich genomes?

Bgee is currently based on genomes from [Ensembl]. See the data sources page to see information on the release used.

BackWhich gene families?

Bgee uses several prediction methods to define gene families:
  • Large families: For protein coding genes, Bgee recovers the families as defined in Ensembl ("Protein families"). These family predictions are based on the Tribe MCL clustering method, including all protein isoforms of every coding gene that Ensembl predicts, but also all fungi/metaozoa proteins present in Uniprot/SWISSPROT and Uniprot/SPTREMBL.
    For miRNA families, Bgee recovers the families as defined in miRBase. These families are taken from Rfam.
    Note that miRNAs are only part of this type of gene families.
  • Orthologs Vertebrates: Bgee reports groups of orthologs with a common ancestor gene in vertebrates or any sub-taxa (more precisely Euteleostomi or any sub-taxa), based on the Ensembl gene trees. These are based on TreeBeST, which aims to represent the evolutionary history of gene families, i.e. genes that diverged from duplication or speciation events [Gene Orthology/Paralogy prediction method].
  • Orthologs Animals: Bgee reports groups of orthologs with a common ancestor gene in animals (more precisely Coelomata), based on the Ensembl gene trees. These are based on TreeBeST, which aims to represent the evolutionary history of gene families, i.e. genes that diverged from duplication or speciation events [Gene Orthology/Paralogy prediction method].

Hide | TopData analysis

Bgee provides "expression" data, "differential expression" data, and "no-expression" data. Four data types are currently included:

  • RNA-Seq samples from [GEO], runs retrieved from [SRA]. Each sample is manually annotated by Bgee curators to anatomical and developmental ontologies. Statistical tests are then performed to define expression calls, no-expression calls, and to assign a level of confidence to the data.
  • Affymetrix data from [GEO] and [ArrayExpress]. Each chip is manually annotated by Bgee curators to map it onto anatomical and developmental ontologies. Quality controls are performed to remove chips of low-quality, duplicated chips, or chips of incompatible chip type. Statistical tests are then performed to define expression calls, no-expression calls, differential expression calls, and to assign a level of confidence to the data.
  • In situ hybridization data from [ZFIN] for zebrafish, [Xenbase] for Xenopus, [MGI] for the mouse, and [BDGP] for Drosophila. These data are already mapped to ontologies by the source databases, and provided with quality information. They are used to generate expression calls and no-expression calls.
  • EST data from [UniGene]. Each EST library is manually annotated by Bgee curators to map it onto anatomical and developmental ontologies. Statistical tests are performed to define expression calls and to assign a level of confidence.

Jump to:

BackRNA-Seq data analyses

RNA-Seq data preprocessing

The raw data in .sra format are downloaded from the Short Read Archive (SRA) database. The extracted reads, in fastq format, are mapped to regions of the reference genome, specified in a .gtf file: i) transcribed regions; ii) selected intergenic regions (see below); iii) exon junction regions.

The mapping of the reads is performed using TopHat2, which internally uses the Bowtie2 aligner. The maximum number of mappings allowed for a read is set to 1. The intergenic regions are chosen in such a way that the distribution of their lengths matches the distribution of lengths of the transcriptome. The minimal distance of boundaries of intergenic regions to the nearest gene is 5 kb. Reads that map to the features are summed up using the htseq-count software. The RPK (read per kilobase) value for every feature is obtained by dividing the number of reads that match a given feature by its length.

The present/absent calls

Our approach to define present/absent calls is based on Hebenstreit et al., Mol. Syst. Biol., 2011. The principle is to be able to distinguish biologically relevant signal of expression from experimental noise or background activity of the transcription machinery. Please note that this procedure is still experimental, and might be subject to change in the next releases.

For each RNA-Seq library independently, we define a RPK cutoff, k, for determining “present/absent” calls, set to be equal to the minimal value for which the ratio of relative abundance of intergenic regions and genes, with RPK values above k, is equal or lower than α (in Bgee, α = 0.05).

In other words, a RPK threshold is defined for each sample independently, such that a randomly chosen feature, from the set of genes and intergenic regions, with a RPK value above the threshold, has at least 95% probability of being a gene.

For every feature <i>f</i> belonging to <i>F</i>, (<i>RPB<sub>f</sub></i> >= <i>k<sub>α</sub></i>) implies that the probability that the feature <i>f</i> belongs to <i>G</i> is greater or equal to (1 - <i>α</i>)

Where

  • RPB<sub>f</sub> = (j<sub>f</sub>/l<sub>f</sub>)
  • F = GI
  • kα: cutoff value
  • α: significance level
  • RPBf: Reads Per Base for the feature f
  • jf: number of reads mapped to the feature f
  • lf: length of the feature f
  • G: set of genes
  • I: set of intergenic regions
  • F: set of the features that belong to gene set or intergenic regions

Fig. 1 Example density distribution of log2 (RPK + 1e-08) values for different categories of features, the vertical dashed line specify the cutoff value for sample GSM752614

Fig. 1 Example density distribution of log2 (RPK + 1e-08) values for different categories of features, the vertical dashed line specify the cutoff value for sample GSM752614

Fig. 2 Distribution of log2 (RPK + 1e-08) values for different feature categories, dashed line specify cutoff

Fig. 2 Distribution of log2 (RPK + 1e-08) values for different feature categories, dashed line specify cutoff

Cutoff determination

1) For every value of x define the ratio r (Fig. 3):

r = ((n<sub>ix</sub> * N<sub>g</sub>) / (n<sub>gx</sub> * N<sub>i</sub>))

where:

  • nix: number of intergenic regions with RPK values higher than x
  • ngx: number of genes with RPK values higher than x
  • Ni; number of all intergenic regions
  • Ng: number of all genes

2) The cutoff value k is the minimal value of x for which r is equal or lower than α

Fig. 3. Ratio of relative abundance of intergenic and genes log2 (RPK + 1e-08) values

Fig. 3. Ratio of relative abundance of intergenic and genes log2 (RPK + 1e-08) values

BackAffymetrix data quality controls and filtering

Affymetrix chips are filtered before inclusion into Bgee, based on quality controls, on the identification of duplicated content, and on the control of chip types for incompatibility or errors:

Quality Controls

In order to select the best methodology for Affymetrix quality control, we have tested several methods intended for quality assessment of a single experiment: scaling factor, RNA degradation slope, MAS5 percent present, 5'/3' expression ratio of housekeeping genes, relative log expression (RLE) and normalized unscaled standard error (NUSE), global normalized unscaled standard error (GNUSE), and finally, Inter Quartile Range of average rank (arIQR), a new method that we have developed (see below).

As an independent measure of quality, we have computed the correlation of the gene expression profile of each array, to a reference expression profile of homologous genes in the same organ, in a different species. Samples which show a lower correlation with the reference, were correctly identified as of poor quality on the basis of some of the quality metrics. The best methods for this task –arIQR and MAS5 percent present– were implemented in Bgee.

To filter Affymetrix chips, we use the new quality measurement that we have developed, along with MAS5 percent present:

  • Inter Quartile Range of average rank (arIQR): this statistic is obtained by ranking all the probes intensities from a given array, and by computing the average rank for each probeset. The Inter Quartile Range of the probesets average ranks serves as quality score for the given array.
  • percent present score (MAS5 PP): percentage of probesets identified as "present" by MAS5.

In order to determine the cutoff for quality scores, we have computed, for each chip type independently, the distribution of arIQR and MAS5 PP scores of microarrays in the [GEO DataSets]. The values of quality metrics which separate the lowest 5% arrays from the other chips of the same type were taken as cutoffs. This results in independent cutoffs for arIQR and MAS5 PP scores, for each chip type. If not enough data were available for a chip type, no quality score thresholds were computed.

The quality scores thresholds for each chip type used in Bgee are available here, and from the database (see Chip types table).

Filtering of the chips:

  • when raw CEL files are available: the quality is controlled on the basis of arIQR and MAS5 PP scores. A chip is removed if one of these scores is below the threshold defined for this chip type and this quality score. A chip is conserved if both scores are above the defined thresholds, or if no thresholds could be computed for this chip type.
  • when only MAS5 files are available: only MAS5 PP score is used. A chip is conserved if the MAS5 PP score is above the threshold defined for this chip type, or if no threshold could be defined for this chip type.

Removal of duplicated content

We have identified duplicated content in the source databases: fully or partially duplicated experiments from independent data submissions, duplicated chips used in several experiments, duplicated chips inside experiments. We have implemented a procedure to identify and remove such duplicates:

  • when raw CEL files are available, for each CEL file: i) the SHA512 checksum is computed; ii) the scan date is retrieved from the file; iii) a unique string is generated on the basis of the arIQR score and the percent present. If any of these elements are identical between several chips, they are likely to be duplicated chips (althought scan dates are often not unique, while checksums are almost irrefutable).
  • when only MAS5 files are available: i) the original files are filtered to obtain comparable files (probeset IDs and columns are ordered, headers and non standard lines are removed). SHA512 checksums are then computed on these filtered files; ii) unique strings are generated on the basis of number of "present" calls, "absent" calls, "marginal" calls, "undefined" calls. If any of these elements are identical between several chips, they are likely to be duplicated chips.

When files have been manually confirmed to be duplicates, we keep only one of them. When a chip is duplicated between different experiments, we keep the one that is part of the experiment with the highest number of chips.

Control of the chip type

We have identified chip types incompatible with the analyses used in Bgee, and have removed them. The list of incompatible chip types is available here, and from the database (see Chip types table). We have also identified CEL files for which the chip type provided in the experiment description is wrong. We correct this problem by using the CDF name present in the CEL files.

BackGeneration of expression calls

For each data type, Bgee applies dedicated analysis to generate expression calls and to assign a level of confidence to the data: low or high.

  • RNA-Seq data: summary of our method (detailed information available in our documentation).
    • Mapping of the reads of a library to transcribed regions, intergenic regions, and intronic regions.
    • Computation of RPK (Reads Per Kilobase) values for each region.
    • Definition of a RPK threshold, for each library independently, to distinguish between background noise and biologically relevant signal of expression, based on the relative ratio of intergenic and transcribed regions above the threshold.
    • Genes with a RPK value above the threshold considered as expressed with a high quality, genes with a RPK value below the threshold considered as not expressed with a high quality.
    • Removal of results for all genes never seen as expressed over the whole RNA-Seq dataset.
    • If the same gene, at the same developmental stage, in the same anatomical structure, is reported several times (different libraries), and the results are inconsistent: If the inconsistency is "expression" vs "no-expression", data quality of the overall expression summary is set to "expression low quality".
  • Affymetrix data:
    • if raw data are available: (i) normalization of the signal of the probesets by the gcRMA algorithm, (ii) Wilcoxon test on the signal of the probesets against a subset of weakly expressed probesets (see Schuster et al., Genome Biology (2007)). Individual probeset quality set to low if the p-value is between 1 and 5%, set to high quality if the p-value is lower than 1%.
    • Else (only normalized data are available): use of the present/absent calls provided by the MAS5 software (see Liu et al., Bioinformatics (2002)). Individual probeset quality set to low if the probeset is flagged as "marginal" or "present".
    • Removal of all probesets never seen as "expressed high quality" over all the experiments where raw data are available, or never seen as "present" over all the experiments where only normalized data are available.
    • If the same gene, at the same developmental stage, in the same anatomical structure, is reported several times (different chips or probesets), and the results between different individual probesets are inconsistent: if the inconsistency is "expression low quality" vs "no-expression", the overall probesets summary is removed. If the inconsistency is "expression high quality" vs "no-expression", data quality of the overall probesets summary is set to "expression low quality". If the inconsistency is "expression high quality" vs "expression low quality", data quality of the overall probesets summary is set to "expression high quality"
  • In situ hybridization data:
    • for data from [ZFIN], Bgee uses their "stars rating" when provided (3 to 5 stars, or no stars rating provided, considered as high quality in Bgee, 1 to 2 stars considered as low quality in Bgee).
    • For data from [MGI], Bgee uses the quality information provided (signal "present", "moderate", "strong", or "very strong" in MGI considered as high quality in Bgee. Signal "ambiguous", "trace", or "weak" considered as low quality in Bgee).
    • if the same gene, at the same developmental stage, in the same anatomical structure, is reported several times (e.g. different experiments), and results are inconsistent: if the inconsistency is "expression low quality" vs "no-expression", the overall summary is removed. If the inconsistency is "expression high quality" vs "no-expression", data quality of the overall summary is set to "expression low quality". If the inconsistency is "expression high quality" vs "expression low quality", data quality of the overall summary is set to "expression high quality"
  • EST data: expression low quality if, in a EST library, 1 to 6 ESTs are mapped to a transcript. High quality if, in a EST library, at least 7 ESTs are mapped to a transcript (see Audic and Claverie, Genome Research (1997)).

BackGeneration of differential expression calls

For differential expression analyses, Bgee uses Affymetrix experiments studying at least 3 conditions (anatomical structures/developmental stages), with at least 2 replicates for each. An ANOVA is used to determine whether the gene has a significant variation of its level of expression over the conditions. If it does, a multiple comparison to the mean is performed. If the adjusted p-value is between 1% and 5%, the differential-expression information is considered as low quality. If the adjusted p-value is below 1%, differential-expression information considered as high quality. Only genes showed to be expressed in this organ/stage are considered for this analysis (i.e., no genes showed to have an absence of expression in this organ/stage).

If in the same organ/stage, there is for a gene a conflict between several results (different probesets or chips identified the gene as being over-expressed and under-expressed), the result with the lowest p-value is considered, and the overall quality for this result is given a low quality.

In some cases, some annotations can be too granular for a proper use in differential analyses. For instance, using our human developmental ontology, data annotated at the age of "23 year-old" would not be considered to be the same condition as "24 year-old". To solve this problem, we defined some developmental stages as being "too granular". All data mapped to such stages are transferred to their closest parent stage that is not too granular, for the differential analysis. In the example above, data annotated at "23 year-old" and "24 year-old" would be considered as being the same "young adult stage" condition. The list of "too granular" stages can be retrieved from the database (See developmental stages table description).

BackGeneration of absent expression calls

RNA-Seq, Affymetrix, and in situ hybridization data are used to determine whether a gene is NOT expressed (gene expression not detected).
  • RNA-Seq data: summary of our method (detailed information available in our documentation).
    • Mapping of the reads of a library to transcribed regions, intergenic regions, and intronic regions.
    • Computation of RPK (Reads Per Kilobase) values for each region.
    • Definition of a RPK threshold, for each library independently, to distinguish between background noise and biologically relevant signal of expression, based on the relative ratio of intergenic and transcribed regions above the threshold.
    • Genes with a RPK value below the threshold considered as not expressed with a high quality.
    • Removal of results for all genes never seen as expressed over the whole RNA-Seq dataset.
    • If any other result is inconsistent (another library showed expression of this gene in the same anatomical structure at the same developmental stage), gene is considered as expressed (with a low quality). If any other RNA-Seq result found expression of the same gene in a substructure or a child stage, the no-expression result is removed.
  • Affymetrix data: analyses performed only when raw data are available: (i) normalization of the signal of the probesets by the gcRMA algorithm, (ii) Wilcoxon test on the signal of the probesets against a subset of weakly expressed probesets (see Schuster et al., Genome Biology (2007)). If the signal of a probeset is not significantly different from the background signal, gene considered as not expressed. Only high quality data are used. If any other result is inconsistent (another experiment/chip has detected expression of this gene in the same anatomical structure at the same developmental stage), gene is considered as expressed (with a low quality). If any other result found expression of the same gene in a substructure or a child stage, the no-expression result is removed.
  • In situ hybridization data: if the staining of a hybridization is not detected, gene considered as not expressed. Only high quality data are used. If any other result is inconsistent (another experiment has detected expression of this gene in the same anatomical structure at the same developmental stage), gene is considered as expressed (with a low quality). If any other result found expression of the same gene in a substructure or a child stage, the no-expression result is removed.

Hide | TopHomology relationships between anatomical ontologies

Homology relationships between anatomical structures of different species-specific ontologies are represented as groups of homologous anatomical structures (Homologous Organs Groups, HOGs). Here are described the creation and usage rules of these HOGs. More details are available in: [Parmentier et al., Bioinformatics, 2010]. Please note that several changes have been made since the publication of this paper. The description on this website is the most up-to-date.

The HOGs are available in the download section as an OBO file with an association file, or as TSV data files. You can visualize a CARO-compliant version of this ontology for vertebrates only on the NCBO BioPortal: [the vHOG ontology].

Jump to:

BackHOGs creation

The homology relationships are defined by curation, using an internally developed software: Homolonto (available in the download section). Homolonto uses an ontology alignment approach to propose homology relationships between anatomical ontologies, which are then manually validated by an expert. Once relationships are generated through this first step, they are manually reviewed and extended by Bgee curators.

During the revision process, a level of confidence is provided for each mapping: "obvious", "well-established" (a reference is provided), "debated" (a consensus has been chosen), "uncertain" (a reference may be provided).

The anatomical ontologies used to define the HOGs are:

  • Danio rerio: anatomical ontology [ZFA]
  • Homo sapiens: anatomical ontology [EHDAA] for embryo, and anatomical ontology [EV] for adult
  • Mus musculus: anatomical ontology [EMAPA] for embryo, and anatomical ontology [MA] for adult
  • Xenopus tropicalis: anatomical ontology [XAO]
  • Drosophila melanogaster: anatomical ontology [FBbt]

BackHOGs composition

If several anatomical structures from the same species belong to the same HOG, they define a species "unit". If two units from different species belong to the same HOG, then they are homologous.

For instance, here is the composition of the HOG "brain" (HOG:0000157) for the human:

  • EHDAA:300: future brain - exists from CarnegieStage09 to CarnegieStage09.
  • EHDAA:830: future brain - exists from CarnegieStage10 to CarnegieStage12.
  • EHDAA:2629: brain - exists from CarnegieStage13 to CarnegieStage20.
  • EV:0100164: brain - exists from Adult to Adult.

All these anatomical structures refer to the same concept of "brain", they constitute the "human brain unit". Units for the other species are designed in the same way. These units are homologous.

As you can see in the example above, these "units" are mainly required because of the structure of the ontologies. They merge anatomical structures linked most of the time by develops_from or same_as relationships. But at some point, these relationships must not be followed up. For instance, in the zebrafish anatomical ontology, the structure ZFA:0000008: brain develops_from the structure ZFA:0001135: neural tube. Obviously, the structure "neural tube" does not refer to the concept of brain, and should not be merged in the corresponding HOG.
The general rule applied by Bgee curators is: the "presumptive", "primitive", "future", "primordium", "degenerating" (etc ...) structures are linked to the same HOG as their corresponding fully formed structure.

"Units" are also required to manage differences in the design of the ontologies. For instance, organs can be described by only one structure, or only described by their right and left parts, or only by their substructures, or split in different structures depending on their localization in the organism, ... The general rule applied by Bgee curators can be described by this figure:
Homology relationships general rule schema
Legend: The circles correspond to structures in an ontology, the lines to their relationships. Black: structure exists - White: structure does not exist - Red arrows: mapping to HOG

  • Case A: in the first species ontology, on the left, an organ is described by a general structure and also by more precise substructures. In the second species ontology, on the right, the homologous organ is only described by precise structures, corresponding to the substructures of the first ontology. HOGs are created for each precise structures of both ontologies.
  • Case B: in the first species ontology, on the left, an organ is described only by a general structure. In the second species ontology, on the right, the homologous organ is only described by precise structures. A HOG is created for the general structure of the first ontology. The precise structures of the second ontology are all mapped to this general HOG, generating a species "unit" inside the HOG.

BackRelationships part_of and is_a amongst HOGs

The relationships between HOGs are generated though several steps. Several changes have been made since the publication of [Parmentier et al., Bioinformatics, 2010]. The description on this website is the most up-to-date.

  1. Initial Step: all possible paths between HOGs are retrieved. For instance, if an anatomical structure 'a', mapped to the HOG 'A', has a part_of relationship to the anatomical structure 'b', mapped to the HOG 'B', then a putative part_of relationship is defined between HOGs 'A' and 'B'. Relationships between HOGs are often indirect (e.g. structure 'a', mapped to HOG 'A', part_of structure 'c', part_of structure 'b', mapped to HOG 'B'). In such cases, based on [OBO foundry relation composition rules], part_of relations always "win". So, if a part_of relationship is seen along the path between two HOGs, then the putative relationship is part_of.
  2. Skipping relations from non-trusted ontologies: some ontologies do not follow OBO principles, and implement for instance only one type of relation amongst all concepts. All the putative relations inferred by these ontologies during initial step are then set as the [SKOS] type broader_than. But the final relation type between these HOGs can still be inferred thanks to other ontologies. For the current release of the HOG ontology, the non-trusted ontologies are: EV, EHDAA, EMAPA.
  3. Skipping relations defined by too few ontologies: if the proportion of ontologies defining a relation, compared to the total number of ontologies involved in the creation of the HOGs, is below a defined threshold ('ontology coverage'), then the relation is defined to the [SKOS] type broader_than, and the algorithm stops examining relations between these HOGs. Note that if the relation is set to broader_than during this step, it is probably the existence of the relation itself that is uncertain, not especially the type of the relation. For the current release of the HOG ontology, the ontology coverage threshold is set to 1.
  4. Defining within-ontology agreement: several anatomical structures from the same ontology can belong to the same HOG. This can generate a within-ontology conflict for defining a relation type. For instance, structures 'a' and 'b' allow to define a putative part_of relationship between HOGs 'A' and 'B', while structures 'a′' and 'b′', belonging to the same ontology, define a putative is_a relationship between these HOGs. The algorithm then calculates, for each relation type, the proportion that the number of paths defining this relation type represents, compared to the total number of paths between these two HOGs for this ontology. If, for a type, this proportion exceeds a defined threshold ('within-ontology agreement'), then this relation type is attributed for this ontology between these HOGs. Otherwise, the relation is defined to the type broader_than for this ontology. For the current release of the HOG ontology, the within-ontology agreement threshold is set to 1.
  5. Defining inter-ontology agreement: different ontologies can define different relation types between two related HOGs. This conflict is resolved in the same way as at the previous step, by using a defined threshold ('inter-ontology agreement'). Note that broader_than relations are not taken into account at this step, e.g.: if three ontologies define a broader_than relation between two HOGs, and another ontology defines a is_a relation, the final relation will be is_a. If no agreement is found, the final relation type between these two HOGs is defined as broader_than.For the current release of the HOG ontology, the inter-ontology agreement threshold is set to 1
  6. Removing cyclic relationships: by inferring automatically the relationships between HOGs, cycles may be generated (e.g. HOG 'A' part_of HOG 'B' part_of HOG 'A'), whereas an ontology has to be acyclic. If such cycles are detected, the algorithm stops, and one of the involved relationships is manually removed.
  7. Removing redundancies: if several relationships are redundant, only the deepest relationship is conserved; for instance, if a HOG 'A' has two substructures by a part_of relationship, 'B' and 'C', and if 'C' is also a substructure of 'B', then the direct relationship between the HOGs 'A' and 'C' is removed. Note that according to [OBO foundry relation composition rules], the following relations are NOT redundant, e.g.: C is_a B part_of A / C is_a A => according to composition rules, C is_a B part_of A equals C part_of A, so C is_a A is not redundant.
  8. The algorithm then searches for potential errors: HOGs with multiple or no is_a relations (according to the [FMA policy] of single is_a inheritance and completeness), HOGs with too many or no part_of relations.
  9. Curation step: curators then manually review all the broader_than relations, to attribute them to a type defined by the [OBO Relation Ontology]. Some custom relationships, not inferred by the algorithm, can also be added at this step, and some relations inferred by the algorithm can also be removed.

BackUsing HOGs

When using the HOG ontology for their own analyses, users should be aware of two things:

  • When retrieving expression data for a HOG, a common mistake is to look for expression data in the organs directly mapped to the HOG only. Users should also:
    • retrieve the organs mapped to its descent HOGs
    • retrieve expression data in the substructures of the organs mapped to the HOG and its descent HOGs. For instance, when retrieving expression data in the HOG "brain" for human, if users consider only the expression in the organ "brain", this means they would only retrieve experiments performed on the brain as a whole, and not experiments performed on the forebrain, the midbrain, etc.
  • When retrieving expression data in several HOGs, a common mistake is to forget that most HOGs are not independant (for instance, using expression data in both the HOG "brain" and the HOG "hindbrain" for statistical analyses would be wrong). Users should rather use independant HOGs (not substructures of each others).

Hide | TopTSV data files description

The full Bgee database is provided as TSV files or MySQL dump in the download section. Here are the description of these files, and some basic information on their content. The data are linked between files using common IDs: if they have the same name, they represent the same object.

Jump to:

BackOntologies

Species

File: specie.tsv
Species IDSpecies name

Homologous Organs Groups

File: homologousOrgansGroups.tsv
HOG IDHOG nameHOG description

Description: homology relationships between anatomical structures of different species-specific ontologies are represented as groups of homologous anatomical structures (Homologous Organs Groups, HOGs). This file contains the list of HOGs. The mapping of the anatomical structures to these HOGs is present in the anatomicalStructures.tsv file. See also the Homology relationships section for more information.

HOG relationships

File: HOGRelationships.tsv
Parent HOG IDDescent HOG IDRelation type

Description: represents the relationships between HOGs (part_of, is_a, ...). See the Homology relationships section for more information.

Metastages

File: metastages.tsv
Metastage IDMetastage nameMetastage descriptionMetastage left boundMetastage right boundMetastage level

Description: the metastages represent the key events of development common to all bilaterian animals. The developmental stages of each species are then mapped to these metastages. This file contains the list of metastages. The mapping of the developmental stages to these metastages is present in the stages.tsv file. The is_a relationships between metastages are represented as a Nested Set Model (see [Nested Set Model in Wikipedia]), by using the columns left bound, right bound, and level.

Metastage name synonyms

File: metastageNameSynonyms.tsv
Metastage IDMetastage name synonym

Developmental stages

File: stages.tsv
Stage IDStage nameStage descriptionStage left boundStage right boundStage levelSpecies IDMetastage IDToo granular

Description: the is_a relationships between developmental stages are represented as a Nested Set Model (see [Nested Set Model in Wikipedia]), by using the columns left bound, right bound, and level. These values are of course independent for different "Species ID". The developmental stages are mapped to the metastages using the "Metastage ID" column. Note that any stages not explicitly mapped to a metastage ("Metastage ID" not defined), but with a mapped parent stage, are also mapped to this metastage.

The field Too granular defines stages that are too granular for differential expression analyses: in some cases, some annotations can be too granular for a proper use in differential analyses. For instance, using our human developmental ontology, data annotated at the age of "23 year-old" would not be considered to be the same condition as "24 year-old". To solve this problem, we defined some developmental stages as being "too granular". All data mapped to such stages are transfered to their closest parent stage that is not too granular, for the differential analyses. In the example above, data annotated at "23 year-old" and "24 year-old" would be considered as being the same "young adult stage" condition. The parent stage that is actually used for the analysis can then be retrieved in the DEA chips groups table (see below).

Developmental stage name synonyms

File: stageNameSynonyms.tsv
Stage IDStage name synonym

Anatomical structures

File: anatomicalStructures.tsv
Anatomical structure IDAnatomical structure nameAnatomical structure descriptionStart stage IDEnd stage IDHOG IDHOG confidenceHOG reference

Description: represents the members of the anatomical ontologies. Homology relationships between anatomical structures are represented as groups of homologous structures (Homologous Organs Groups, HOGs). The anatomical structures are mapped to these HOGs using the "HOG ID" column. For each mapping, a level of confidence is provided: "obvious", "well-established" (a reference is provided), "debated" (a consensus has been chosen), "uncertain" (a reference may be provided), "Homolonto" (unreviewed automatic alignment). Note that any structures not explicitly mapped to a HOG ("HOG ID" not defined), but with a mapped parent structure, are also mapped to this HOG. See the Homology relationships section for more information.

Anatomical structure name synonyms

File: anatomicalStructureNameSynonyms.tsv
Anatomical structure IDAnatomical structure name synonym

Anatomical structure relationships

File: anatomicalStructureRelationships.tsv
Parent anatomical structure IDDescent anatomical structure IDRelation type

Description: represents the relationships between anatomical structures (part_of, is_a, ...).

BackGenes and gene families

Gene family prediction methods

File: geneFamilyPredictionMethods.tsv
Gene family prediction method IDGene family prediction method
Description: Bgee uses several prediction methods to define gene families:
  • Large families: For protein coding genes, Bgee recovers the families as defined in Ensembl ("Protein families"). These family predictions are based on the Tribe MCL clustering method, including all protein isoforms of every coding gene that Ensembl predicts, but also all fungi/metaozoa proteins present in Uniprot/SWISSPROT and Uniprot/SPTREMBL.
    For miRNA families, Bgee recovers the families as defined in miRBase. These families are taken from Rfam.
    Note that miRNAs are only part of this type of gene families.
  • Orthologs Vertebrates: Bgee reports groups of orthologs with a common ancestor gene in vertebrates or any sub-taxa (more precisely Euteleostomi or any sub-taxa), based on the Ensembl gene trees. These are based on TreeBeST, which aims to represent the evolutionary history of gene families, i.e. genes that diverged from duplication or speciation events [Gene Orthology/Paralogy prediction method].
  • Orthologs Animals: Bgee reports groups of orthologs with a common ancestor gene in animals (more precisely Coelomata), based on the Ensembl gene trees. These are based on TreeBeST, which aims to represent the evolutionary history of gene families, i.e. genes that diverged from duplication or speciation events [Gene Orthology/Paralogy prediction method].

Gene families

File: geneFamilies.tsv
Gene family IDGene family nameGene family descriptionGene family prediction method ID

Gene types

File: geneBioTypes.txt
Gene type IDGene type name

Genes

File: genes.tsv
Gene IDGene nameGene descriptionGene type IDSpecies ID

Gene to gene families

File: geneToGeneFamilies.tsv
Gene IDGene family ID

Description: as Bgee uses several gene family prediction methods, genes can belong to several gene families. This association file is thus required.

BackGlobal expression data

Data sources

File: dataSources.tsv
Data source IDData source name

Description: this file contains the list of primary databases used to construct the Bgee database. Expression data present in Bgee are then mapped to these data sources.

Expression

File: expression.tsv
Expression IDGene IDAnatomical structure IDStage IDExpression confidence for EST dataExpression confidence for Affymetrix dataExpression confidence for in situ hybridization dataExpression confidence for RNA-Seq data

Description: this file recapitulates all the expression data stored in Bgee, whatever the data type. Each line represents an expression pattern: a gene, expressed in an anatomical structure, at a developmental stage. A column is then added for each data type, which can takes three values: "no data", "poor quality", or "high quality". If this expression pattern has not been detected by using this data type, the value taken is "no data". For EST data, "poor quality" and "high quality" represent the best data quality, amongst all the data from this type that defines this expression pattern. For Affymetrix, in-situ hybridization, and RNA-Seq data, this value represent the overall expression summary. See the data analysis section for more information.

No expression

File: noExpression.tsv
No expression IDGene IDAnatomical structure IDStage IDNo expression confidence for Affymetrix dataNo expression confidence for in situ hybridization dataNo expression confidence for RNA-Seq data

Description: this file recapitulates information of no-expression stored in Bgee, whatever the data type. Each line represents the information that a gene is NOT expressed in an anatomical structure, at a developmental stage. A column is then added for each data type, which can takes three values: "no data", or "high quality" ("no-expression" data are only "high quality" data). If this information has not been detected by using this data type, the value taken is "no data". See the data analysis section for more information.

Differential expression

File: differentialExpression.tsv
Differential expression IDGene IDAnatomical structure IDStage IDDifferential expression direction ('over' or 'under')Differential expression confidence for Affymetrix data

Description: this file recapitulates information of differential expression stored in Bgee, from Affymetrix data. Each line represents the information that a gene is under or over expressed in an anatomical structure, at a developmental stage. The last column represents the confidence in this information, which can takes three values: "no data", "poor quality", or "high quality". See the data analysis section for more information.

BackRNA-Seq data

RNA-Seq experiments

File: rnaSeqExperiments.tsv
Rna-Seq experiment IDRna-Seq experiment nameRna-Seq experiment descriptionData source ID

Description: list of RNA-Seq experiments used in Bgee. They are linked to the original data source by the column "Data source ID".

RNA-Seq platforms

File: rnaSeqPlatforms.tsv
RNA-Seq platform IDRNA-Seq platform description

Description: RNA-Seq platforms used, e.g., Illumina Genome Analyzer IIx

RNA-Seq libraries

File: rnaSeqLibraries.tsv
RNA-Seq library IDRNA-Seq secondary library IDRNA-Seq experiment IDRNA-Seq platform IDAnatomical structure IDStage IDlog2 RPK thresholdPercentage of "present" genesPercentage of "present" protein-coding genesPercentage of "present" intronic regionsPercentage of "present" intergenic regionsTotal reads countUsed reads countAligned reads countMinimum read lengthMaximum read lengthLibrary type

Description: list of the RNA-Seq libraries annotated and used in Bgee. They are mapped to the ontologies by the columns "Anatomical structure ID" and "Stage ID".

'RNA-Seq library ID' represents the ID of the sample in the GEO database.

'RNA-Seq secondary library ID' represents the ID of the library in the SRA database.

'log2 RPK threshold' represents the value above which genes are considered as expressed (see data analysis section for more information).

'Total reads count' is the total number of reads present in the library (all runs aggregated).

'Used reads count' is the number of remaining reads after filtering by the TopHat software.

RNA-Seq runs

File: rnaSeqRuns.tsv
RNA-Seq run IDRNA-Seq library ID

Description: RNA-Seq runs from the SRA database, linked to their container library.

RNA-Seq results

File: rnaSeqResults.tsv
RNA-Seq library IDGene IDlog2 RPK valueAligned reads countDetection flagExpression IDNo expression IDExpression confidence for RNA-Seq dataIf result excluded, reason for exclusion

Description: list of the gene present/absent calls for every RNA-Seq libraries stored in Bgee.

'log2 RPK value' is the value used to define whether a gene is expressed or not, as compared to the log2 RPK threshold for the given library (see data analysis section).

'Expression confidence for RNA-Seq data': pease note that for now, all RNA-Seq results are considered as high quality data.

A result can be either associated to an expression result (Expression ID not null), or a no-expression result (No Expression ID not null).

A result with a detection flag "absent" can still be associated to an expression result (Expression ID not null), when it represents a conflict (another result, for the same gene at the same developmental stage in the same anatomical structure, show expression for this gene). This result flagged as "absent" is used to decrease the quality of the associated expression result.

If both Expression ID and No Expression ID are null, the result has been removed from the dataset. Reasons for exclusions are:

  • pre filtering: genes always seen as "absent" over the whole dataset are not considered.
  • noExpression conflict: a "no-expression" result has been removed because of expression of the same gene detected on other libraries in some substructures/child stages.

BackAffymetrix data

Microarray experiments

File: microarrayExperiments.tsv
Microarray experiment IDMicroarray experiment nameMicroarray experiment descriptionData source ID

Description: list of microarray experiments used in Bgee. They are linked to the original data source by the column "Data source ID".

Chip types

File: chipTypes.tsv
Chip type IDChip type nameCDF nameUsable in BgeeBgee quality score thresholdPercent present

Description: list of the different chip types used in Bgee.

'Bgee quality score threshold' and 'Percent present threshold' are the thresholds used to consider a chip as of good quality. If their value is '0', it means that no threshold was defined for this chip type.

Normalization types

File: normalizationTypes.tsv
Normalization type IDNormalization type name

Description: list of the different normalization methods used by Bgee to renormalize probeset signal intensities, e.g. "gcRMA". See the data analysis section for more information.

Detection types

File: detectionTypes.tsv
Detection type IDDetection type name

Description: list of the different methods used by Bgee to determine whether a gene is expressed or not, and with which confidence, based on the normalized probeset signal intensities, e.g. "Schuster et al. method". See the data analysis section for more information.

Affymetrix chips

File: affymetrixChips.tsv
Bgee affymetrix chip IDAffymetrix chip IDMicroarray experiment IDChip type IDScan dateNormalization type IDDetection type IDBgee quality scorePercent presentAnatomical structure IDStage ID

Description: list of the Affymetrix chips annotated and used in Bgee. They are mapped to the ontologies by the columns "Anatomical structure ID" and "Stage ID".

'Bgee quality score' is an Affymetrix quality score developed by Bgee (see documentation) used to remove low quality chips. It can only be used when raw CEL files are available. Percentage of probesets with a 'present' state ('Percent present') is also used to remove low quality chips, and is applied on all data, even when only MAS5 processed files are available. Thresholds on quality score and percent present are defined fo each chip type independently. They can be retrieved from the table 'Chip types', using the 'Chip type ID' field. All chips present in Bgee have passed quality controls.

The field 'Affymetrix Chip ID' represents the ID of the chip in the source database. Bgee affymetrix chip ID is a Bgee internal ID, because 'Affymetrix Chip ID' are not unique (but the couples 'Affymetrix Chip ID' - 'Microarray experiment ID' are). This internal ID is used to link to other tables ('Affymetrix probesets' and 'DEA chips groups to affymetrix chips')

Affymetrix probesets

File: affymetrixProbesets.tsv
Affymetrix probeset IDBgee affymetrix chip IDGene IDNormalized signal intensityDetection flagExpression IDNo Expression IDConfidence for Affymetrix dataReason for exlusion

Description: list of the probesets of every Affymetrix chips stored in Bgee. A level of confidence is assigned for each probeset: "poor quality" or "high quality" (see data analysis section).

A probeset can be either associated to an expression result (Expression ID not null), or a no-expression result (No Expression ID not null).

A probeset with a detection flag "absent" can still be associated to an expression result (Expression ID not null), when it represents a conflict (another probeset, for the same gene at the same developmental stage in the same anatomical structure, is detected as expressed). This Probeset flagged as "absent" is used to decrease the quality of the associated expression result.

If both Expression ID and No Expression ID are null, the probeset has been removed from the dataset. Reasons for exclusions are:

  • pre filtering: probesets always seen as "absent" or "marginal" over the whole dataset are removed.
  • bronze quality (quality too low): for a gene/organ/stage, mix of probesets "absent" and "marginal" (no "present", and inconsistency expression / no-expression).
  • absent low quality (MAS5): a no-expression result is retrieved only using MAS5. No-expression results must be confirmed by analyses where raw data are available.
  • noExpression conflict: a "no-expression" result has been removed because of expression of the same gene detected by other probesets in some substructures/child stages.

See the data analysis section for more information. Note that this file is very large.

Differential expression analysis (DEA) types

File: differentialExpressionAnalysisTypes.tsv
DEA type IDDEA type name

Description: types of analysis used to detect over-expression of genes.

Differential expression analyses (DEA)

File: differentialExpressionAnalyses.tsv
DEA IDDEA type IDMicroarray experiment ID

Description: analysis performed to detect over-expression of genes.

DEA chips groups

File: DEAChipsGroups.tsv
DEA chips group IDDEA IDAnatomical structure IDStage ID

Description: to detect over-expression of genes, Differential Expression Analyses can only be performed on groups of chips (more or less equivalent to a set of replicates), with the same conditions (anatomical structures/developmental stages), to estimate variance of gene expression. The analysis (DEA ID) will then used at least 3 different groups of chips, from the same experiment, in different conditions (anatomical structures/developmental stages), to detect over-expression in some of them.

The Stage ID does not always correspond to the Stage ID of the corresponding Affymetrix chips in the Affymetrix chips table. It does not correspond when the chip was annotated to a developmental stage too granular for differential expresson analyses. In that case, data are transfered to the closest parent stage not too granular. This is the Stage ID of this parent than can be retrieved in this table. In cases when the annotation is not too granular, Stage ID is the same than for the corresponding chips. See documentation of the table Developmental stages for more information.

DEA chips groups to affymetrix chips

File: DEAChipsGroupsToAffymetrixChips.tsv
DEA chips group IDBgee affymetrix chip ID

Description: this file allow to get back to the original chips grouped to performed a Differential Expression Analysis.

DEA affymetrix probesets summaries

File: deaAffymetrixProbesetSummary.tsv
DEA Affymetrix probesets summary IDDEA chips group IDGene IDFold changeDifferential Expression IDExpression confidence for DE Affymetrix dataRaw p-valueIf excluded, reason for exlusion

Description: a line in this table is a summary of a set of probesets, used for the differential expression analysis, belonging to differentindividual affymetrix chips, corresponding to one group of chips (DEA chips group ID). If Differential Expression ID is not null, then this gene is differentially expressed in these conditions.

A probeset showing under-expression can still be associated to an over-expression result (and vice versa): it means there was a conflict (another probeset or chip, for the same gene at the same developmental stage in the same anatomical structure, showed the opposite direction of differential expression). The result with the lowest p-value is considered and overall quality is set to "low quality".

If Differential Expression ID is null, the probeset has been removed from the dataset. Reasons for exclusions are:

  • not expressed: the corresponding gene has been shown to be not expressed in this anatomical structure at this developmental stage (see noExpression table).

BackIn situ hybridization data

In situ experiments

File: inSituExperiments.tsv
In situ experiment IDIn situ experiment nameIn situ experiment descriptionData source ID

Description: list of the in situ hybridization experiments used in Bgee. They are linked to the original data source by the column "Data source ID".

In situ evidences

File: inSituEvidences.tsv
In situ evidence IDIn situ experiment ID

Description: an "evidence" represents a material used to define a gene expression pattern, most of the time, an image of an in situ hybridization. But it can also be a publication, ...

In situ spots

File: inSituSpots.tsv
In situ spot IDIn situ evidence IDAnatomical structure IDStage IDGene IDDetection flagExpression IDNo Expression IDConfidence for in situ hybridization dataReason for exclusion

Description: a spot represents a gene expression pattern defined by in situ hybridization data. Most of the time, it represents a labeled area of an hybridization image. But it could also represents the information found in a publication. They are mapped to the ontologies by the columns "Anatomical structure ID" and "Stage ID". A level of confidence is assigned for each spot: "poor quality" or "high quality" (see data analysis section).

A spot can be either associated to an expression result (Expression ID not null), or a no-expression result (No Expression ID not null).

A spot with a detection flag "absent" can still be associated to an expression result (Expression ID not null), when it represents a conflict (another in situ spot, for the same gene at the same developmental stage in the same anatomical structure, is detected as expressed). This spot flagged as "absent" is used to decrease the quality of the associated expression result.

If both Expression ID and No Expression ID are null, the spot has been removed from the dataset. Reasons for exclusions are:

  • bronze quality (quality too low): for a gene/organ/stage, mix of spots "absent" and "expressed low quality" (no "expressed high quality", and inconsistency expression / no-expression).
  • absent low quality: a no-expression result has been retrieved using "low quality" spots only. A no-expression result must be confirmed by "high quality" data.
  • noExpression conflict: a "no-expression" result has been removed because of expression of the same gene detected by other spots in some substructures/child stages.
See the data analysis section for more information.

BackEST data

EST libraries

File: ESTLibraries.tsv
EST library IDEST library nameEST library descriptionAnatomical structure IDStage IDData source ID

Description: list of EST libraries annotated and used in Bgee. They are mapped to the ontologies by the columns "Anatomical structure ID" and "Stage ID". They are linked to the original data source by the column "Data source ID".

Expressed Sequence Tags

File: ESTs.tsv
EST IDEST library IDGene IDUnigene cluster IDExpression IDExpression confidence for EST data

Description: a level of confidence is assigned for each EST: "poor quality" or "high quality". See the data analysis section for more information.