I have been involved in couple of projects heavily based on ChIP-seq data. See publications:

  1. male breast cancer project - published in Nature Communication
  2. prostate cancer project - published in Nature Communication

I wrote a snakemake pipeline and python scripts for robust/reproducible processing and visualization of the data.

Snakemake pipeline

The pipeline is hosted on the github repository.

Roughly, the pipeline takes the following steps to produce the outcome:

  • Downloading raw data (either bam/fastq files) from the specified locations (local, remote, or GEO) in DataList.csv
  • Alignment with bwa-mem (in case of fastq files)
  • Marking duplicate reads with picard
  • Removing low-quality reads (retain reads with mapping quality > 20)
  • Peak calling with MACS1.4/MACS2/DFilter (support more than one peak callers)
  • Taking intersection between the peaks

See also README.md on the repository.

Python scripts for visualization

pybedtools(see also online documentation) is a python wrapper of the bedtools.