Creosote Assembly: Difference between revisions
Jump to navigation
Jump to search
Created page with ' Creosote genome sequencing and assembly notes: *Sample obtained from front yard of 4951 W. McElroy Dr. *Sequences obtained from one lane of Illumina HiSeq2000 *Fastq files deli...' |
|||
(6 intermediate revisions by 2 users not shown) | |||
Line 1: | Line 1: | ||
Creosote genome sequencing and assembly notes: | Creosote genome sequencing and assembly notes: | ||
Line 19: | Line 18: | ||
*Merge R1 files; merge R2 files | *Merge R1 files; merge R2 files | ||
*gzip them | *gzip them | ||
'''Convert sequence formats''' (I haven't tried this yet) | |||
Haibao Tang 8/5/11 11:09 AM | |||
on the conversion, i found an alternative: | |||
zcat R1_all.fastq.gz | seqret fastq-sanger::stdin fastq-illumina::stdout | gzip > R1_converted.fastq.gz | |||
install emboss | |||
just tested, takes 20-30 minutes on R1_all | |||
'''Trim sequences''' | '''Trim sequences''' | ||
*Get this package of Haibaos: | *Get this package of Haibaos: | ||
Line 35: | Line 41: | ||
SOAPdenovo31mer all -s ../../soap.config.eric -o SoapAssem -K 25 -p 16 -R -d -D -F | SOAPdenovo31mer all -s ../../soap.config.eric -o SoapAssem -K 25 -p 16 -R -d -D -F | ||
**Note: if SOAP crashes, try another XXmer binary (e.g. 63mer) | **Note: if SOAP crashes, try another XXmer binary (e.g. 63mer) | ||
'''Running Velvet''' | '''Running Velvet''' | ||
**Need to interleave reads: | **Need to interleave reads: | ||
#test run | |||
~/src/velvet_1.1.04/shuffleSequences_fastq.pl lane3_NoIndex_L003_R1_001.pairs.fastq lane3_NoIndex_L003_R2_001.pairs.fastq merged_pairs.fastq | ~/src/velvet_1.1.04/shuffleSequences_fastq.pl lane3_NoIndex_L003_R1_001.pairs.fastq lane3_NoIndex_L003_R2_001.pairs.fastq merged_pairs.fastq | ||
#full data with modified shuffleSequences script that handles gzip files | |||
~/src/velvet_1.1.04/shuffleSequences_fastq_gzip.pl R1_all.pairs.fastq.gz R2_all.pairs.fastq.gz merged_pairs.fastq.gz | |||
**set threading of velvet with env var | **set threading of velvet with env var | ||
export OMP_NUM_THREADS=32 | export OMP_NUM_THREADS=32 | ||
Line 43: | Line 54: | ||
OMP_NUM_THREADS=32 velveth VelvetAssem 31 -shortPaired -fastq.gz merged_pairs.fastq.gz -short -fastq.gz lane3_NoIndex_L003_R1_001.frags.fastq.gz -short -fastq.gz lane3_NoIndex_L003_R2_001.frags.fastq.gz | OMP_NUM_THREADS=32 velveth VelvetAssem 31 -shortPaired -fastq.gz merged_pairs.fastq.gz -short -fastq.gz lane3_NoIndex_L003_R1_001.frags.fastq.gz -short -fastq.gz lane3_NoIndex_L003_R2_001.frags.fastq.gz | ||
OMP_NUM_THREADS=32 velvetg VelvetAssem -scaffolding yes -exp_cov auto -cov_cutoff auto -min_contig_lgth 200 -ins_length 150 | OMP_NUM_THREADS=32 velvetg VelvetAssem -scaffolding yes -exp_cov auto -cov_cutoff auto -min_contig_lgth 200 -ins_length 150 | ||
#FULL RUN | |||
OMP_NUM_THREADS=32 velveth VelvetAssem 31 -shortPaired -fastq.gz merged_pairs.fastq.gz -short -fastq.gz R1_all.frags.fastq.gz -short -fastq.gz R2_all.frags.fastq.gz | |||
Line 79: | Line 93: | ||
*Cat all the R1s together | *Cat all the R1s together | ||
*Cat all the R2s together | *Cat all the R2s together | ||
==Assembly using ABySS== | |||
By James S | |||
ABySS requires paired end reads be marked in the names with "A" "B" or "1" "2" or "F" "R". The current read files didn't have this information, so I added an "A" to the end of every name in R1_all.pairs.fastq.gz using a perl one liner, and a "B" to the end of every name in R2_all.pairs.fastq.gz | |||
Since this is paired end data I used the ABySS pair-end driver script 'abyss-pe' | |||
Invocation: | |||
*First attempt: abyss-pe k=55 n=10 -j2 name=creosote in="R1_all.pairs.fastq.gz R2_all.pairs.fastq.gz" se="merged_pairs.fastq.gz R1_all.frags.fastq.gz R2_all_frags.fastq.gz" <-- ABySS dies because the sequences in merged_pairs are not always the same length as the quality score strings | |||
Note by Eric: merged_pairs.fastq.gz is the merge of R1_all.pairs and R2_all.pairs. This merged file was required by velvet. | |||
*Second attempt (no single end data and added "A" and "B" to paired end reads): abyss-pe -j2 k=55 n=10 name=creosote in="R1_all_pairsA.fastq R2_all_pairsB.fastq" | |||
*Third attempt (trying for longer contigs): abyss-pe -j2 k=64 n=5 name=creosote90 in="R1_all_pairsA.fastq R2_all_pairsB.fastq" |
Latest revision as of 14:40, 31 August 2011
Creosote genome sequencing and assembly notes:
- Sample obtained from front yard of 4951 W. McElroy Dr.
- Sequences obtained from one lane of Illumina HiSeq2000
- Fastq files delivered from UAGC
- 82 files
- Headers are Sanger format (code 33)
- Description of Fastq file format with notes on specific decoding of header names generated by various technologies: http://en.wikipedia.org/wiki/FASTQ_format
- Pairend reads
- lane3_NoIndex_L003_R1_041.fastq
- lane3_NoIndex_L003_R2_041.fastq
- Need to get adapter sequences used in sequencing
- TGACCA (Not present in sequence reads)
- Check quality with fastqc: http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/
Steps:
- Merge R1 files; merge R2 files
- gzip them
Convert sequence formats (I haven't tried this yet)
Haibao Tang 8/5/11 11:09 AM on the conversion, i found an alternative: zcat R1_all.fastq.gz | seqret fastq-sanger::stdin fastq-illumina::stdout | gzip > R1_converted.fastq.gz install emboss just tested, takes 20-30 minutes on R1_all
Trim sequences
- Get this package of Haibaos:
- git clone git://github.com/tanghaibao/jcvi.git
- SET PATH: export PYTHONPATH=/lib/python (which is the dir above jcvi)
- may need to install biopython: sudo easy_install biopython
- Run this: python -m jcvi.apps.baseclean trim R1.fastq.gz R2.fastq.gz
- Automatically trims and cleans sequences, also does the conversion to appropriate fastq format
- NOTE: This program should download trimmomatic, but may need to update the path of the timmomatic program in the program
- If the Trimmer script fails for silly reasons, you can run it from the command-line:
java -Xmx4g -cp Trimmomatic-0.13/trimmomatic-0.13.jar org.usadellab.trimmomatic.TrimmomaticPE lane3_NoIndex_L003_R1_001.b64.fastq.gz lane3_NoIndex_L003_R2_001.b64.fastq.gz lane3_NoIndex_L003_R1_001.pairs.fastq.gz lane3_NoIndex_L003_R1_001.frags.fastq.gz lane3_NoIndex_L003_R2_001.pairs.fastq.gz lane3_NoIndex_L003_R2_001.frags.fastq.gz ILLUMINACLIP:adapters.fasta:2:40:15 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
Genome Assembly
- Note: Bao recommends CLC for genome assembly. Runs faster, less memory, less sensitive to bad data. Compute intensive. THIS IS COMMERCIAL SOFTWARE
Running SOAPdenovo
SOAPdenovo31mer all -s ../../soap.config.eric -o SoapAssem -K 25 -p 16 -R -d -D -F
- Note: if SOAP crashes, try another XXmer binary (e.g. 63mer)
Running Velvet
- Need to interleave reads:
#test run ~/src/velvet_1.1.04/shuffleSequences_fastq.pl lane3_NoIndex_L003_R1_001.pairs.fastq lane3_NoIndex_L003_R2_001.pairs.fastq merged_pairs.fastq #full data with modified shuffleSequences script that handles gzip files ~/src/velvet_1.1.04/shuffleSequences_fastq_gzip.pl R1_all.pairs.fastq.gz R2_all.pairs.fastq.gz merged_pairs.fastq.gz
- set threading of velvet with env var
export OMP_NUM_THREADS=32
- running velveth
OMP_NUM_THREADS=32 velveth VelvetAssem 31 -shortPaired -fastq.gz merged_pairs.fastq.gz -short -fastq.gz lane3_NoIndex_L003_R1_001.frags.fastq.gz -short -fastq.gz lane3_NoIndex_L003_R2_001.frags.fastq.gz OMP_NUM_THREADS=32 velvetg VelvetAssem -scaffolding yes -exp_cov auto -cov_cutoff auto -min_contig_lgth 200 -ins_length 150 #FULL RUN OMP_NUM_THREADS=32 velveth VelvetAssem 31 -shortPaired -fastq.gz merged_pairs.fastq.gz -short -fastq.gz R1_all.frags.fastq.gz -short -fastq.gz R2_all.frags.fastq.gz
Other Stuff
Trimming reads
- Trim Paired ends with Trimmomatic: http://www.usadellab.org/cms/index.php?page=trimmomatic
- Assumes Illumina Encoding (code: 64, not code: 33)
- Need to convert for the HighSeq Reads:
- easy_install biopython
- git clone git://github.com/tanghaibao/jcvi.git
- export PYTHONPATH=/lib/python (which is the dir above jcvi)
- python -m jcvi.formats.fastq (Install missing packages)
Cleaning Single Reads:
- Sequences cleaned using trimReads by Haibao Tang: https://github.com/tanghaibao/trimReads/tree/
- Ran with supplied adapter sequence file:
>Adapter 4 TGACCA >Adapter 4 rc TGGTCA
- Command-line run:
Running /home/elyons/bin/trimReads -Q 33 -f /home/elyons/projects/genome/data/creosote/Sample_lane3/adapter/adapter.faa ./lane3_NoIndex_L003_R2_015.fastq
Converting sequences
- python -m jcvi.formats.fastq convert (read help file, default conversion Sanger (code 33) to Illumina (code 64)
Other programs to clean sequences
- python -m jcvi.apps.baseclean trim fastqfile (single ended)
- python -m jcvi.apps.baseclean trim R1.fastq.gz R2.fastq.gz (paired ended)
keep sequences in single files (or two files for a pair of reads)
- Cat all the R1s together
- Cat all the R2s together
Assembly using ABySS
By James S
ABySS requires paired end reads be marked in the names with "A" "B" or "1" "2" or "F" "R". The current read files didn't have this information, so I added an "A" to the end of every name in R1_all.pairs.fastq.gz using a perl one liner, and a "B" to the end of every name in R2_all.pairs.fastq.gz
Since this is paired end data I used the ABySS pair-end driver script 'abyss-pe'
Invocation:
- First attempt: abyss-pe k=55 n=10 -j2 name=creosote in="R1_all.pairs.fastq.gz R2_all.pairs.fastq.gz" se="merged_pairs.fastq.gz R1_all.frags.fastq.gz R2_all_frags.fastq.gz" <-- ABySS dies because the sequences in merged_pairs are not always the same length as the quality score strings
Note by Eric: merged_pairs.fastq.gz is the merge of R1_all.pairs and R2_all.pairs. This merged file was required by velvet.
- Second attempt (no single end data and added "A" and "B" to paired end reads): abyss-pe -j2 k=55 n=10 name=creosote in="R1_all_pairsA.fastq R2_all_pairsB.fastq"
- Third attempt (trying for longer contigs): abyss-pe -j2 k=64 n=5 name=creosote90 in="R1_all_pairsA.fastq R2_all_pairsB.fastq"