Database of Orthologous Promoters
Database Creation

The creation of the database was accomplished in two phases. In the first phase we created clusters that contain orthologous sequences from different species, each identified by BLAST (version 2.2.15; Altschul et al. 1990, 1997) searches using a Homo sapiens (in the case of the chordate section) or Arabidopsis thaliana (in the case of the plant section) sequence as query.

In the second phase we created a multiple alignment with DIALIGN (version 2.2.1; Morgenstern et al. 1998, Morgenstern 1999), and calculated a modified information content value for each column of the alignment. We extracted conserved motifs based on this information content value.

Creation of the orthologous clusters

The source of the sequences in our BLAST databases were the genomic DNA sequences belonging to the Chordata phylum (in the case of the chordate section) and the Viridiplantae kingdom (in the case of the plant section) available from the nt, wgs, gss and htg sections of the NCBI GenBank database in addition to the annotated whole genomes of Homo sapiens (release date 2006. april) and Arabidopsis thaliana (release date 2005. december; both downloaded from NCBI). We also used genomic sequences downloaded from ENSEMBL, the Broad Institute, the Baylor College of Medicine and the Joint Genome Institute.

In the first step we extracted the annotated first two mRNA and CDS exon sequences of each annotated gene in the reference species (Homo sapiens and Arabidopsis thaliana for the chordate and plant section respectively). Depending on the start positions of the corresponding mRNA and CDS first exons we distinguished the following 6 types of genes :

  • Type 1 : The corresponding mRNA and CDS first exons have the same start position. The first CDS exon is not less than 50 bp long.
  • Type 2 : The corresponding mRNA and CDS first exons have the same start position. The first CDS exon is less than 50 bp long.
  • Type 3 : The start position of the mRNA first exon is upstream of the corresponding CDS first exon. The first CDS exon is not less than 50 bp long.
  • Type 4 : The start position of the mRNA first exon is upstream of the corresponding CDS first exon. The first CDS exon is less than 50 bp long.
  • Type 5n : The start position of more than one (number: n) mRNA exon is upstream of the corresponding CDS first exon. The first mRNA exon is not less than 50 bp long.
  • Type 6n : The start position of more than one (number: n) mRNA exon is upstream of the corresponding CDS first exon. The first mRNA exon is less than 50 bp long.

Each sequence received a unique 8-digit numerical identifier. The identifiers of the Homo sapiens and Arabidopsis thaliana were later used as 'cluster identifiers' as well (see below).

Then we created a database from which we intended to select the orthologous sequences for each gene of the reference species. This database also contained the sequences of the reference species mentioned above. (The latter sequences in this database were the sequences of the first exon + 3000 bp upstream region in the case of types 1, 3 and 5n; or the sequences of the first two exons + 3000 bp region upstream to the first exon in the case of types 2, 4 and 6n.) We used this database as a BLAST database against which we conducted a BLASTN (version 2.2.15) search using the sequences of the first exons (in the case of types 1, 3 and 5n) or the combined sequences of the first two exons (in the case of types 2, 4 and 6n) as queries. The parameters of the BLASTN search were the following :

  • Word size : 12
  • Gap opening penalty : 1
  • Gap extension penalty : 1
  • E-value : 0.01

We filtered the resultant hits with cutoff values of 50% identity and 75% combined alignment length with respect to the length of the query. The remaining hits were used to identify the 'orthologous' ones using the following simple algorithm. In each species, that had more than one hit, the hit with the best score was chosen. If there were two or more hits with identical scores from a species, the sequence that extended furthest to the 5' direction relative to the hit was chosen. Then we took the 500-bp, 1000-bp and 3000-bp long sequences right upstream to the hit sequences as the orthologous promoter regions. If the sequence did not reach the 500-bp, 1000-bp or 3000-bp length, the minimum required length was 300-bp (200-bp in the case of plants), 700-bp and 2000-bp respectively. Finally, these sequences were collected into one cluster under the corresponding gene identifier of the corresponding reference species.

Creation of the sequence subsets

To find the evolutionary conserved motifs, first we grouped the sequences of a cluster, based on the evolutionary distance from the reference species (Arabidopsis thaliana or Homo sapiens). There are 4 types of groups (subsets) in the case of plants and 10 in the case of chordates. The subsets are the following :

Plants
  • B Brassicaceae : contains sequences from the Brassicacea family.
  • E eudicotyledons : contains sequences from all available dicotyledonous plants in addition to the Brassicaceae sequences (for example Populus trichocarpa or Ricinus communis).
  • M Magnoliophyta : contains usually some kind of monocotyledonous sequence, in addition to the above mentioned groups (for example Oryza sativa or Zea mays).
  • V Viridiplantae : contains other Viridiplantae sequences, not in the previous groups.
Chordates
  • P Primates : contains only primate sequences.
  • R Euarchontoglires : contains usually rodent sequences besides primates.
  • E Eutheria : placental mammals ( for example Bos taurus, Canis familiaris, etc).
  • H Theria : placental mammals and marsupials (usually Monodelphis domestica).
  • M Mammalia : all mammals, including Prototheria (for example Ornithorhynchus anatinus).
  • N Amniota : amniotes, including sauropsids (birds and reptiles).
  • T Tetrapoda : tetrapods, including amphibians (for example Xenopus laevis).
  • F Teleostomi : all bony vertebrates, including most fishes (for example Takifugu rubripes or Danio rerio).
  • V Vertebrata : all vertebrates.
  • C Chordata : all chordates, including Ciona intestinalis.
Searching for conserved motifs

We first made a multiple alignment with the program DIALIGN. We used default values to run the program. The sequences in the clusters were masked for single sequence repeats (SSR). The Tandem Repeats Finder (Benson 1999) was used in this process with the following options :

  • Match : 2
  • Mismatch : 7
  • Delta : 7
  • PM : 80
  • PI : 10
  • Minscore : 24
  • Maxperiod : 100

After calculating the modified information content (IC) score, we searched for 'seed regions', where the local IC score reached 70% of the maximum. We extended these seed regions in both directions, until the IC score dropped below 60% of the maximum. Based on these regions, we extracted blocks from the multiple alignment, and also generated a consensus sequence.

Created and maintained by Endre Sebestyén, Tibor Nagy & Endre Barta. 2007-2011, Agricultural Biotechnology Center