See http://bioinfo.mbb.yale.edu/permissions.shtml for usage terms. Please note that the pipeline is not engineered for casual use by relatively novice users. You may need to roll up your sleeves and have a look at the code from time to time. 1) BLAST an organism's genome against its proteome. We typically split the proteome into bite-size chunks and run a number of concurrent BLASTs. Use the '-m 8' option to produce tab delimited output. 2) 'processBlastOutput.py' will convert the BLAST output into a form appropriate for use by the pipeline. Note: This script takes two arguments: a file containing the proteome in FASTA format and a directory in which the various BLAST split outputs are located. The script is hardwired to look for the pattern 'splitXXXXOut' where X is a digit. It would be straightforward to change this pattern if you wish. 3) The pipeline needs data to mask out known genes. There a couple of options here: i) Provide null exon data files (one per chromosome). No masking. ii) Provide coordinates that span an entire gene. Intronic regions will then be masked. iii) Provide coordinates of just exons. Pseudogene.org uses option iii). The requisite data is extracted from the Ensembl mysql files: exon.txt.table exon_transcript.txt.table seq_region.txt.table translation.txt.table translation_stable_id.txt.table by the script 'extractKPExonLocations.py'. This script assumes that it is executed in the directory containing the files and is given as an argument a file containing the proteome in FASTA format. 4) Set environment variables. Here's an example file for bash: # we tend to gather all files together under one subdirectory dataDir=/home1/njc2/bioInformatics/genomes/dr/34.5b # describes a pattern used to find all BLAST output files. Note the # 'P': Step 1) produced output segregated by chromosome and strand. # Pseudogenes.org runs the pipeline once for the plus ('P') strand and # once for the minus ('M') strand. # # Here and elsewhere, '%s' will be replaced by chromosome identifiers # during the execution of the script. E.g, if the script is run with # the arguments '1 X', then it will look for files named # 'chr1_P_blastHits.sorted' and 'chrX_P_blasHits.sorted'. export BlastoutSortedTemplate=${dataDir}/pgpipe/chr%s_P_blastHits.sorted # Location of chromosome dna files from Ensembl. export ChromosomeFastaTemplate=${dataDir}/dna/Danio_rerio.ZFISH5.oct.dna.chromosome.%s.fa # Location of maskt files (see Step 3) above) export ExonMaskTemplate=${dataDir}/mysql/chr%s_exLocs # The columns in the mask file that provide start and stop data (0-based). export ExonMaskFields='2 3' # Location of the FASTA program tfasty34 export FastaProgram=/home1/njc2/fromColossusHome/bioInformatics/fasta/tfasty34 # The proteome in FASTA format. export ProteinQueryFile=${dataDir}/pep/Danio_rerio.ZFISH5.oct.pep.known.fa 5) In a fresh directory, run the pipeline via the script 'runScripts.py'. We typically create directories 'plus' and 'minus' to keep the two runs separate and then merge the results later. 'runScripts.py' expects a list of chromosomes as arguments. Note: the analysis of one chromosome is independent of all the others, so you may choose to run one or two small ones to test the setup before running with a list of all of the chromosomes. E.g.: mkdir plus cd plus source plusEnvVariables ${PipelinePath}/runScripts.py 22 # if the results look good... ${PipelinePath}/runScripts.py 1 2 3 ... 6) The final output lives in the subdirectory 'pgenes'. Other directories contain intermediate results.