Creosote: Difference between revisions

← Older edit Newer edit →

Revision as of 23:00, 4 August 2011

Creosote genome sequencing and assembly notes:

Sample obtained from front yard of 4951 W. McElroy Dr.
Sequences obtained from one lane of Illumina HiSeq2000
Fastq files delivered from UAGC
- 82 files
- Headers are Sanger format (code 33)
  - Description of Fastq file format with notes on specific decoding of header names generated by various technologies: http://en.wikipedia.org/wiki/FASTQ_format
- Pairend reads
  - lane3_NoIndex_L003_R1_041.fastq
  - lane3_NoIndex_L003_R2_041.fastq
- Need to get adapter sequences used in sequencing
  - TGACCA (Not present in sequence reads)
Check quality with fastqc: http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/
- Creosote First Run FastQC

Trimming reads

Trim Paired ends with Trimmomatic: http://www.usadellab.org/cms/index.php?page=trimmomatic
Assumes Illumina Encoding (code: 64, not code: 33)
- Need to convert for the HighSeq Reads:
- easy_install biopython
- git clone git://github.com/tanghaibao/jcvi.git
- export PYTHONPATH=/lib/python (which is the dir above jcvi)
- python -m jcvi.formats.fastq (Install missing packages)

Steps:

Merge R1 files; merge R2 files
gzip them

Trim sequences

Run this: python -m jcvi.apps.baseclean trim R1.fastq.gz R2.fastq.gz
- Automatically trims and cleans sequences, also does the conversion to appropriate fastq format
- NOTE: This program should download trimmomatic, but may need to update the path of the timmomatic program in the program
If the Trimmer script fails for silly reasons, you can run it from the command-line:

java -Xmx4g -cp Trimmomatic-0.13/trimmomatic-0.13.jar org.usadellab.trimmomatic.TrimmomaticPE lane3_NoIndex_L003_R1_001.b64.fastq.gz lane3_NoIndex_L003_R2_001.b64.fastq.gz lane3_NoIndex_L003_R1_001.pairs.fastq.gz lane3_NoIndex_L003_R1_001.frags.fastq.gz lane3_NoIndex_L003_R2_001.pairs.fastq.gz lane3_NoIndex_L003_R2_001.frags.fastq.gz ILLUMINACLIP:adapters.fasta:2:40:15 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36

Genome Assembly

Note: Bao recommends CLC for genome assembly. Runs faster, less memory, less sensitive to bad data. Compute intensive. THIS IS COMMERCIAL SOFTWARE

Running SOAPdenovo

SOAPdenovo31mer all -s ../../soap.config.eric -o SoapAssem -K 25 -p 16 -R -d -D -F

- Note: if SOAP crashes, try another XXmer binary (e.g. 63mer)

Running Velvet

- Need to interleave reads:

~/src/velvet_1.1.04/shuffleSequences_fastq.pl lane3_NoIndex_L003_R1_001.pairs.fastq lane3_NoIndex_L003_R2_001.pairs.fastq merged_pairs.fastq

- set threading of velvet with env var

export OMP_NUM_THREADS=32

- running velveth

OMP_NUM_THREADS=32 velveth VelvetAssem 31 -shortPaired -fastq.gz merged_pairs.fastq.gz -short -fastq.gz lane3_NoIndex_L003_R1_001.frags.fastq.gz -short -fastq.gz lane3_NoIndex_L003_R2_001.frags.fastq.gz 
OMP_NUM_THREADS=32 velvetg VelvetAssem -scaffolding yes -exp_cov auto -cov_cutoff auto -min_contig_lgth 200 -ins_length 150

Other Stuff

Cleaning Single Reads:

Sequences cleaned using trimReads by Haibao Tang: https://github.com/tanghaibao/trimReads/tree/
- Ran with supplied adapter sequence file:

>Adapter 4
TGACCA
>Adapter 4 rc
TGGTCA

- Command-line run:

Running /home/elyons/bin/trimReads  -Q 33 -f /home/elyons/projects/genome/data/creosote/Sample_lane3/adapter/adapter.faa ./lane3_NoIndex_L003_R2_015.fastq

Converting sequences

python -m jcvi.formats.fastq convert (read help file, default conversion Sanger (code 33) to Illumina (code 64)

Other programs to clean sequences

python -m jcvi.apps.baseclean trim fastqfile (single ended)
python -m jcvi.apps.baseclean trim R1.fastq.gz R2.fastq.gz (paired ended)

keep sequences in single files (or two files for a pair of reads)

Cat all the R1s together
Cat all the R2s together

@@ Line 5: / Line 5: @@
 *Fastq files delivered from UAGC
 **82 files
+**Headers are Sanger format (code 33)
+***Description of Fastq file format with notes on specific decoding of header names generated by various technologies: http://en.wikipedia.org/wiki/FASTQ_format
+**Pairend reads
 ***lane3_NoIndex_L003_R1_041.fastq
 ***lane3_NoIndex_L003_R2_041.fastq
-**Need to understand if these are paired-end reads
 **Need to get adapter sequences used in sequencing
-**Description of Fastq file format with notes on specific decoding of header names generated by various technologies: http://en.wikipedia.org/wiki/FASTQ_format
+***TGACCA (Not present in sequence reads)
 *Check quality with fastqc: http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/
 **[[Creosote First Run FastQC]]
+'''Trimming reads'''
-*Sequences cleaned using trimReads by Haibao Tang: https://github.com/tanghaibao/trimReads/tree/
-**NOte:  Only use on single reads
-**Ran with supplied adapter sequence file:
- >Adapter 4
- TGACCA
- >Adapter 4 rc
- TGGTCA
-**Command-line run:
- Running /home/elyons/bin/trimReads  -Q 33 -f /home/elyons/projects/genome/data/creosote/Sample_lane3/adapter/adapter.faa ./lane3_NoIndex_L003_R2_015.fastq
-**Output of trimReads:
-''New Notes for processing"
 *Trim Paired ends with Trimmomatic: http://www.usadellab.org/cms/index.php?page=trimmomatic
 *Assumes Illumina Encoding (code: 64, not code: 33)
@@ Line 39: / Line 26: @@
 *Merge R1 files; merge R2 files
 *gzip them
+'''Trim sequences'''
 *Run this: python -m jcvi.apps.baseclean trim R1.fastq.gz R2.fastq.gz
+**Automatically trims and cleans sequences, also does the conversion to appropriate fastq format
 **NOTE: This program should download trimmomatic, but may need to update the path of the timmomatic program in the program
-*Note:  Bao recommends CLC for genome assembly.  Runs faster, less memory, less sensitive to bad data.  Compute intensive.
 *If the Trimmer script fails for silly reasons, you can run it from the command-line:
   java -Xmx4g -cp Trimmomatic-0.13/trimmomatic-0.13.jar org.usadellab.trimmomatic.TrimmomaticPE lane3_NoIndex_L003_R1_001.b64.fastq.gz lane3_NoIndex_L003_R2_001.b64.fastq.gz lane3_NoIndex_L003_R1_001.pairs.fastq.gz lane3_NoIndex_L003_R1_001.frags.fastq.gz lane3_NoIndex_L003_R2_001.pairs.fastq.gz lane3_NoIndex_L003_R2_001.frags.fastq.gz ILLUMINACLIP:adapters.fasta:2:40:15 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36
-*Running SOAPdenovo:
+'''Genome Assembly'''
+*Note:  Bao recommends CLC for genome assembly.  Runs faster, less memory, less sensitive to bad data.  Compute intensive.  THIS IS COMMERCIAL SOFTWARE
+'''Running SOAPdenovo'''
   SOAPdenovo31mer all -s ../../soap.config.eric -o SoapAssem -K 25 -p 16 -R -d -D -F
 **Note: if SOAP crashes, try another XXmer binary (e.g. 63mer)
 '''Running Velvet'''
 **Need to interleave reads:
@@ Line 62: / Line 52: @@
 '''Other Stuff'''
-*python -m jcvi.formats.fastq convert  (read help file, default converstion
+Cleaning Single Reads:
+*Sequences cleaned using trimReads by Haibao Tang: https://github.com/tanghaibao/trimReads/tree/
+**Ran with supplied adapter sequence file:
+ >Adapter 4
+ TGACCA
+ >Adapter 4 rc
+ TGGTCA
+**Command-line run:
+ Running /home/elyons/bin/trimReads  -Q 33 -f /home/elyons/projects/genome/data/creosote/Sample_lane3/adapter/adapter.faa ./lane3_NoIndex_L003_R2_015.fastq
+'''Converting sequences'''
+*python -m jcvi.formats.fastq convert  (read help file, default conversion Sanger (code 33) to Illumina (code 64)
+'''Other programs to clean sequences'''
 *python -m jcvi.apps.baseclean trim fastqfile (single ended)
 *python -m jcvi.apps.baseclean trim R1.fastq.gz R2.fastq.gz (paired ended)
+'''keep sequences in single files (or two files for a pair of reads)'''
 *Cat all the R1s together
 *Cat all the R2s together

Creosote: Difference between revisions

Revision as of 23:00, 4 August 2011

Navigation menu

Search