Difference between revisions of "Creosote Assembly"

From CoGepedia
Jump to: navigation, search
Line 93: Line 93:
 
*Cat all the R1s together
 
*Cat all the R1s together
 
*Cat all the R2s together
 
*Cat all the R2s together
 +
 +
==Assembly using ABySS==
 +
ABySS requires paired end reads be marked in the names with "A" "B" or "1" "2" or "F" "R". The current read files didn't have this information, so I added an "A" to the end of every name in R1_all.pairs.fastq.gz using a perl one liner, and a "B" to the end of every name in R2_all.pairs.fastq.gz
 +
 +
Since this is paired end data I used the ABySS pair-end driver script 'abyss-pe'
 +
 +
Invocation:
 +
*First attempt: abyss-pe k=55 n=10 in="R1_all.pairs.fastq.gz R2_all.pairs.gastq.gz" se="merged_pairs.fastq.gz R1_all.frags.fastq.gz R2_all_frags.fastq.gz" <-- ABySS dies because the sequences in merged_pairs are not always the same length as the quality score strings
 +
*Second attempt (no single end data and added "A" and "B" to paired end reads): abyss-pe k=55 n=10 in="R1_all_pairsA.fastq R2_all_pairsB.fastq"

Revision as of 11:27, 29 August 2011

Creosote genome sequencing and assembly notes:

  • Sample obtained from front yard of 4951 W. McElroy Dr.
  • Sequences obtained from one lane of Illumina HiSeq2000
  • Fastq files delivered from UAGC
    • 82 files
    • Headers are Sanger format (code 33)
    • Pairend reads
      • lane3_NoIndex_L003_R1_041.fastq
      • lane3_NoIndex_L003_R2_041.fastq
    • Need to get adapter sequences used in sequencing
      • TGACCA (Not present in sequence reads)
  • Check quality with fastqc: http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/

Steps:

  • Merge R1 files; merge R2 files
  • gzip them

Convert sequence formats (I haven't tried this yet)

Haibao Tang 8/5/11 11:09 AM 
on the conversion, i found an alternative:
zcat R1_all.fastq.gz | seqret fastq-sanger::stdin fastq-illumina::stdout | gzip > R1_converted.fastq.gz
install emboss
just tested, takes 20-30 minutes on R1_all

Trim sequences

  • Get this package of Haibaos:
  • Run this: python -m jcvi.apps.baseclean trim R1.fastq.gz R2.fastq.gz
    • Automatically trims and cleans sequences, also does the conversion to appropriate fastq format
    • NOTE: This program should download trimmomatic, but may need to update the path of the timmomatic program in the program
  • If the Trimmer script fails for silly reasons, you can run it from the command-line:
java -Xmx4g -cp Trimmomatic-0.13/trimmomatic-0.13.jar org.usadellab.trimmomatic.TrimmomaticPE lane3_NoIndex_L003_R1_001.b64.fastq.gz lane3_NoIndex_L003_R2_001.b64.fastq.gz lane3_NoIndex_L003_R1_001.pairs.fastq.gz lane3_NoIndex_L003_R1_001.frags.fastq.gz lane3_NoIndex_L003_R2_001.pairs.fastq.gz lane3_NoIndex_L003_R2_001.frags.fastq.gz ILLUMINACLIP:adapters.fasta:2:40:15 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

Genome Assembly

  • Note: Bao recommends CLC for genome assembly. Runs faster, less memory, less sensitive to bad data. Compute intensive. THIS IS COMMERCIAL SOFTWARE

Running SOAPdenovo

SOAPdenovo31mer all -s ../../soap.config.eric -o SoapAssem -K 25 -p 16 -R -d -D -F
    • Note: if SOAP crashes, try another XXmer binary (e.g. 63mer)

Running Velvet

    • Need to interleave reads:
#test run
~/src/velvet_1.1.04/shuffleSequences_fastq.pl lane3_NoIndex_L003_R1_001.pairs.fastq lane3_NoIndex_L003_R2_001.pairs.fastq merged_pairs.fastq

#full data with modified shuffleSequences script that handles gzip files
~/src/velvet_1.1.04/shuffleSequences_fastq_gzip.pl R1_all.pairs.fastq.gz R2_all.pairs.fastq.gz merged_pairs.fastq.gz
    • set threading of velvet with env var
export OMP_NUM_THREADS=32
    • running velveth
OMP_NUM_THREADS=32 velveth VelvetAssem 31 -shortPaired -fastq.gz merged_pairs.fastq.gz -short -fastq.gz lane3_NoIndex_L003_R1_001.frags.fastq.gz -short -fastq.gz lane3_NoIndex_L003_R2_001.frags.fastq.gz 
OMP_NUM_THREADS=32 velvetg VelvetAssem -scaffolding yes -exp_cov auto -cov_cutoff auto -min_contig_lgth 200 -ins_length 150
#FULL RUN
 OMP_NUM_THREADS=32 velveth VelvetAssem 31 -shortPaired -fastq.gz merged_pairs.fastq.gz -short -fastq.gz R1_all.frags.fastq.gz -short -fastq.gz R2_all.frags.fastq.gz




Other Stuff Trimming reads

Cleaning Single Reads:

>Adapter 4
TGACCA
>Adapter 4 rc
TGGTCA
    • Command-line run:
Running /home/elyons/bin/trimReads  -Q 33 -f /home/elyons/projects/genome/data/creosote/Sample_lane3/adapter/adapter.faa ./lane3_NoIndex_L003_R2_015.fastq


Converting sequences

  • python -m jcvi.formats.fastq convert (read help file, default conversion Sanger (code 33) to Illumina (code 64)

Other programs to clean sequences

  • python -m jcvi.apps.baseclean trim fastqfile (single ended)
  • python -m jcvi.apps.baseclean trim R1.fastq.gz R2.fastq.gz (paired ended)

keep sequences in single files (or two files for a pair of reads)

  • Cat all the R1s together
  • Cat all the R2s together

Assembly using ABySS

ABySS requires paired end reads be marked in the names with "A" "B" or "1" "2" or "F" "R". The current read files didn't have this information, so I added an "A" to the end of every name in R1_all.pairs.fastq.gz using a perl one liner, and a "B" to the end of every name in R2_all.pairs.fastq.gz

Since this is paired end data I used the ABySS pair-end driver script 'abyss-pe'

Invocation:

  • First attempt: abyss-pe k=55 n=10 in="R1_all.pairs.fastq.gz R2_all.pairs.gastq.gz" se="merged_pairs.fastq.gz R1_all.frags.fastq.gz R2_all_frags.fastq.gz" <-- ABySS dies because the sequences in merged_pairs are not always the same length as the quality score strings
  • Second attempt (no single end data and added "A" and "B" to paired end reads): abyss-pe k=55 n=10 in="R1_all_pairsA.fastq R2_all_pairsB.fastq"