Performing assemblies, mappings and some analysis for the CONCOCT paper. Code can be found at: https://github.com/BinPro/CONCOCT
Assemblies
https://docs.google.com/spreadsheets/d/1fDZjgK43Rg-Wn2zU1P_IbxPh7ZwXgdmcmMvKl9gJNek/edit#gid=0 * NewMock{1,2,2b} - Performed on Lindgren * Sharon2013 - Performed on Milou
Mapping
Used these crazy oneliners:
# map all reads d=`pwd`; for s in ../../NewMock2/Sorted_Sample*R1*.fasta.gz; do mkdir -p ${s##.*/}; cd ${s##.*/}; ls bowtie2/asm_pair-smds.coverage || sbatch -d afterok:1735572 -A b2010008 -t 12:00:00 -J NewMock1-ref-map -p core -n 4 ~/bin/sbatch_job bash $METASSEMBLE_DIR/scripts/map/map-bowtie2-markduplicates.sh -ct 4 -p '-f' ../$s ../${s/R1/R2} pair ../ref.fa asm bowtie2; cd $d; done # check for completion of coverage generation find /proj/b2013151/nobackup/private/per-sample-asm -name *.coverage | parallel grep -q 'genome' {} '||' echo no genome in {} # copy mapping stats for s in NewMock2b/out_41/remap/*/*.out; do cat $s | head -50 | awk -v sample=`echo ${s} | cut -d/ -f4` -v OFS="\t" '{if ($0 ~ "were paired") {a = $1;} if ($0 ~ ") aligned concordantly 0 times") { b = $1} if ($0 ~ ") aligned concordantly exactly 1 time" ) {c = $1} if ($0 ~ "aligned concordantly >1 times") { d=$1 } if ($0 ~ ") aligned 0 times") {e=$1} if ($0 ~ "% overall alignment rate") { f = $1 } } END { print sample,a,b,c,d,e,f }'; done' } }")" }"") }}' # copy duplication stats find . -name *.metrics | sort -k5 -t_ | xargs -n1 awk '{if (NR==8) {print $3}}'
Generate CONCOCT input files
Generate the input table for CONCOCT after mapping:
sbatch -t 06:00:00 -A b2010008 -p core -n 8 -J gen_input_table --output=concoct_inputtable_4f4685e.tsv-slurm.out ~/bin/sbatch_job time python /glob/inod/github/CONCOCT/scripts/gen_input_table.py --isbedfiles --samplenames '<(for s in *_1.fastq.gz; do echo ${s%%_1.fastq.gz}; done)' Contigs.fasta *.fastq.gz/bowtie2/asm_pair-smds.coverage '>' concoct_inputtable_4f4685e.tsv
Generate the linkage table:
sbatch -A b2010008 -J ecoli-linkage -t 06:00:00 -p node -n 16 --output=concoct_linkage_4f4685e.tsv-slurm.out ~/bin/sbatch_job time python /glob/inod/github/CONCOCT/scripts/bam_to_linkage.py -m 16 --samplenames '<(for s in *.fastq.gz; do echo ${s%%_1.fastq.gz}; done)' --fullsearch --regionlength 500 Contigs.fasta *.fastq.gz/bowtie2/asm_pair-smds.bam '>' concoct_linkage_4f4685e.tsv
Generate read counts for simulated mock data for contig to genome assignments:
sbatch -A b2010008 -p node -n 16 -t 02:00:00 -J newmock1-41-count --output=contig_count_per_genome_4f4685e.tsv-slurm.out ~/bin/sbatch_job time python /glob/inod/github/CONCOCT/scripts/contig_read_count_per_genome.py -m 16 Contigs.fasta ../../../ref.fa Sorted*/bowtie2/asm_pair-smds.bam '>' contig_count_per_genome_4f4685e.tsv
Copying files between servers
Pretty useful to copy files between servers while keeping full path of the file:
rsync --relative --progress -avhe 'ssh -p 99999' `find NewMock2b/ -regex '.*\(Contigs.fasta\|concoct_.*.tsv.*\|contig_count_per_genome.*.tsv.*\)'` ino@xx.xx.xx:/home/ino/projects/concoct-paper-assemblies/concoct-paper-assemblies/
Cutting up contigs
We found that cutting up contigs in pieces of 10K improved the binning:
time python ~/gitrepos/biorhino-tools/scripts/br-cut-up-fasta.py -c 10000 -o 0 -m Contigs.fasta
MetaWatt v 1.7
Used GBK files to build reference database (includes Bacteria and Archaea):
ftp://ftp.ncbi.nlm.nih.gov/genomes/Bacteria/all.gbk.tar.gz
Explanation is in the README of MetaWatt:
http://sourceforge.net/projects/metawatt/files/
We cut up contigs in 10K and filtered contigs < 1K out. Then simply chose a bin size that resulted in the expected number of bins for high confidence binning.
- NewMock1 high confidence 20 clusters min bin size 1900 kb
- NewMock2 high confidence 101 clusters min bin size ??
- Sharon2013 high confidence 34 clusters min bin size 110 kb
Used to extract the clusters:
less Contigs.fasta.data | grep '^#con' | grep ' All ' | awk -v OFS=',' '{print $1,$3}' | sed 's/Sharon2013_high_bin_//' | sed 's/^#//' > high_confidence.csv '
Checking metagenome coverage and contig assignment
We are not able to get more genomes out than we assembled so it is good to look at the actual coverage of the metagenome, see the notebook. We used MUMmer for the alignment:
d=`pwd`; for c in out_41/Contigs.fasta; do mkdir -p ${c%%/*}/mummer && sbatch --output=${c%%/*}/mummer-slurm.out -p core -n 4 -A b2010008 -t 01-00:00:00 -J mummer-NewMock2b-${c%%/*} ~/bin/sbatch_job bash -x /glob/inod/github/metassemble/scripts/validate/nucmer/run-nucmer.sh genomes_above25_80_new_final.fasta $c ${c%%/*}/mummer/nucmer; cd $d; done
Duplication problem in Ray
Stumbled on a duplication problem for F Prau. See also the notebook. We checked duplication with MUMmer:
for c in AnotherMock1/out_41/Contigs.fasta; do mkdir -p ${c%/*}/mummer-dup && sbatch --output=${c%/*}/mummer-dup-slurm.out -p core -n 4 -A b2010008 -t 5:00:00 -J mummer-dup-${c%/*} ~/bin/sbatch_job bash -x /glob/inod/github/metassemble/scripts/validate/nucmer/run-nucmer.sh $c $c ${c%/*}/mummer-dup/nucmer; cd $d; done
Then removed duplicate contigs with this gist:
Comments
comments powered by Disqus