Much of the biological research, from medicine to biotech, is moving toward sequence analysis. We are now generating targeted and whole genome big data, which needs to be analyzed to answer biological questions. To help you get started, you will be introduced to The Bioconductor project. Bioconductor is and builds the infrastructure to share software tools (packages), workflows and datasets for the analysis and comprehension of genomic data. Bioconductor is a great platform accessible to you, and it is a community developed open software resource. By the end of this course, you will be able to use essential Bioconductor packages and get a grasp of its infrastructure and some built-in datasets. Using BSgenome, Biostrings, IRanges, GenomicRanges, TxDB, ShortRead and Rqc with real datasets from different species is going to be an exceptional experience!
In this chapter you will get hands-on with Bioconductor. Bioconductor is the specialized repository for bioinformatics software, developed and maintained by the R community. You will learn how to install and use bioconductor packages. You will be introduced to S4 objects and functions, because most packages within Bioconductor inherit from S4. Additionally, you will use a real genomic dataset of a fungus to explore the BSgenome package.
Biostrings are memory efficient string containers. Biostring has matching algorithms, and other utilities, for fast manipulation of large biological sequences or sets of sequences. How efficient you can become by using the right containers for your sequences? You will learn about alphabets, and sequence manipulation by using the tiny genome of a virus.
The IRanges and GenomicRanges packages are also containers for storing and manipulating genomic intervals and variables defined along a genome. These packages provide infrastructure and support to many other Bioconductor packages because of their enriching features. You will learn how to use these containers and their associated metadata, for manipulation of your sequences. The dataset you will be looking at is a special gene of interest in the human genome.
ShortRead is the package for input, manipulation and assessment of fasta and fastq files. You can subset, trim and filter the sequences of interest, and even do a report of quality. An extra bonus towards the last exercises will give you the tools for parallel quality assessment, wink, wink Rqc. Exciting enough, for this you will use plant genome sequences!
In the following tracksAnalyzing Genomic Data
DatasetsZika Genomic DNA datasetA. Thaliana Short Reads with Quality datasetHuman Gene & Transcript ID datasetYeast Genome dataset
“I've used other sites—Coursera, Udacity, things like that—but DataCamp's been the one that I've stuck with.”
Devon Edwards Joseph
Lloyds Banking Group
“DataCamp is the top resource I recommend for learning data science.”
Harvard Business School
“DataCamp is by far my favorite website to learn from.”
Decision Science Analytics, USAA