Format of pseudogene annotation 
files (.gff)
The pseudogene annotation files are tab-delimited, 
multi-field text files. Each file has a header line describing the content in 
each column. The columns from left to right are:
 
  - ID:  unique identifier for each processed pseudogene in the 
  format of chr$a_$b.$c where $a is chromosome name, $b is Swissprot/Trembl 
  Protein Accession Number, and $c is the sequential numering of the pseudogene 
  that matches protein $b on chromosome $a. Example: chr1_P02404.1. 
  
- Short_ID: short version of pseudogene ID in the fomat of $a_$b 
  where $a is the Swissprot protein acession number and $b is the sequential 
  numbering of the pseudogene that matches protein $a in the whole genome. 
  Example P02404_1. 
  
- Chr: chromosome name. 
  
- Chrom_start: starting coordiante of the pseudogene on the 
  chromosome, based on the Build 28 of the GoldebPath assemble. 
  
- Chrom_end: end coordinate of the pseudogene on the chromosome. 
  
- Chrom_strand: "-" or "+" 
  
- Cytogenic_band: chromosomal band as predicted by Ensembl. Example: 
  "1p36.33". 
  
- Query_protein: Accession number of the cloest match protein in 
  Swissprot/TrEmbl. 
  
- Query_start: starting amino acid number on the query protein that 
  the pseudogene matches. 
  
- Query_end: end amino acid number on the query protein that the 
  pseudogene matches. 
  
- Query_len: Sequence length of the query protein in cloumn 8. 
  
- Completeness: sequence completeness of the pseudogene compared with 
  the query protein. 
  
- E-value: Expect value of the pseudogene in the TBLASTX search. 
  
- AA_ident: amino acid sequence identity between the pseudogene and 
  query protein. 
  
- DNA_ident: nucleotide sequence identity between the pseudogene and 
  the query protein, coding region only. Some query proteins don't have coding 
  sequence available. 
  
- Polya: "0" or "1" or "2" or "3".
 
- Disable: "0" or "d" or "D". "0" indicates no disablement (only for 
  RP pseudogenes). "d" indicates disablement in a region of low sequence 
  identity. "D" indicates disablement in region of high sequence identity. 
  
- GC_Pgene: GC content of the pseudogene sequence 
  
- GC_Isochore: GC content of the 100K bp window on the chromosome. 
  
- Isochore_class: isochore class where the pseudogene resides. L1, 
  L2, HJ1, H2 H3 
  
- Kimura_Distance: Evolution distance of the pseudogene sequence from 
  the present day sequence. 
  
- Class: "PSSD1" indicates "true" processed pseudogenes. "PSSD2" 
  indicates putative processed pseudogenes. 
  
- Comment: cytoplasmic ribosomal protein pseudogenes are labeled as 
  "RP". 
  
- Protein_name: "Protein name" field of the query protein in the 
  Swissprot/TrEmbl. 
  
- Gene_name: "Gene name" field of the query protein in the 
  Swissprot/TrEmbl. 
  
- MIM: Entry of the query protein in the MIM database (Mendelian 
  Inheritance in Man). 
 
  
Format of pseudogene DNA sequence 
file (.dna)
 
These files contain multiple-sequence, FASTA format, nucleotide seuqunces of 
the annotated processed pseudogenes.
Each pseudogene entry has 2 lines. The 
header line begining with ">", followed with a unique pseudogene ID (field 1 
in the corresponding .gff annotation file). Some other attributes of the 
pseudogene are also provided on the header line including "Chrom", 
"Chrom-start", "Chrom_end", "Strand", "band", "Query_protein", "Query_start", 
"Query_end", "Queyr_len", "Class_new", "Comment" and "Short_ID". Definition of 
the attributes can be found from above.
 
Format of pseudogene amino acid 
sequence file (.fa)
 
These files contain multiple-sequence, FASTA format, predicted amino acid 
seuqunces of the annotated processed pseudogenes. 
Each pseudogene entry has 
3lines. The header line begining with ">", followed with a unique pseudogene 
ID (field 1 in the corresponding .gff annotation file). Some other attributes of 
the pseudogene are also provided on the header line including "Chrom", 
"Chrom-start", "Chrom_end", "Strand", "band", "Query_protein", "Query_start", 
"Query_end", "Queyr_len", "Class_new", "Comment" and "Short_ID". Definition of 
the attributes can be found from above.
Second line is the amino acid 
sequence of the query protein, the third line is the predicted amino acid 
sequence of the pseudogene. Frameshifts are indicated as "\" or "/", stop codons 
are indicated as "X", gaps are shown as "-".
 
  Occurrences of processed 
    pseudogenes 
  
 
The file contain multiple tab-deliminated fields:
  - Rank: ranking of the proteins based on number of processed pseudogenes.
- Count: numer of processed pseudogenes (excluding putative ones) that close 
    match the protein.
- AC: Swissprot/TrEMBL acession number of the protein.
- DB: "SWP" or "TREMBL".
- DB_Name: Swissprot entry name.
- Comment: Ribosomal protein sequences are labeled as "RP".
- Protein_len: Sequence length of the protein.
- CDS_len: Sequence length of the coding sequence (CDS) of the protein.
- CDS_GC: GC content of the coding sequence.
- EBI_Name: Description of the protein provided by EBI.
- Secondary_AC: Secondary acession number of the protein in Swissprot/TrEMBL.
- Protein_Name: protein name described in Swissprot/TrEMBL.
- Synonyms: alternative names for the protein, provided by Swissprot.
- Gene_Name: Associated gene name for the protein.
- MIM: Entry in the MIM (Mendelian Inheritance in Man) database.
- Key Words: biological key word of the protein provided by Swissprot/TrEMBL.
 
 
 
Updated 
12/03/2002, ZL@bioinfo.mbb.yale.edu 
Copyright 2002, All Rights Reserved