Much of the biological research, from medicine to biotech, is moving toward sequence analysis. We are now generating targeted and whole genome big data, which needs to be analyzed to answer biological questions. To help you get started, you will be introduced to The Bioconductor project. Bioconductor is and builds the infrastructure to share software tools (packages), workflows and datasets for the analysis and comprehension of genomic data. Bioconductor is a great platform accessible to you, and it is a community developed open software resource. By the end of this course, you will be able to use essential Bioconductor packages and get a grasp of its infrastructure and some built-in datasets. Using BSgenome, Biostrings, IRanges, GenomicRanges, TxDB, ShortRead and Rqc with real datasets from different species is going to be an exceptional experience!
What is Bioconductor?Free
In this chapter, you will get hands-on with Bioconductor. Bioconductor is the specialized repository for bioinformatics software, developed and maintained by the R community. You will learn how to install and use bioconductor packages. You'll be introduced to S4 objects and functions, because most packages within Bioconductor inherit from S4. Additionally, you will use a real genomic dataset of a fungus to explore the BSgenome package.Introduction to the Bioconductor Project50 xpBioconductor version100 xpBiocManager to install packages100 xpThe role of S4 in Bioconductor50 xpS4 class definition50 xpInteraction with classes100 xpIntroducing biology of genomic datasets50 xpDiscovering the yeast genome100 xpPartitioning the yeast genome100 xpAvailable genomes50 xp
Biostrings and When to Use Them?
Biostrings are memory efficient string containers. Biostring has matching algorithms, and other utilities, for fast manipulation of large biological sequences or sets of sequences. How efficient you can become by using the right containers for your sequences? You will learn about alphabets, and sequence manipulation by using the tiny genome of a virus.Introduction to Biostrings50 xpExploring the Zika virus sequence100 xpBiostrings containers50 xpManipulating Biostrings100 xpSequence handling50 xpFrom a set to a single sequence100 xpSubsetting a set50 xpCommon sequence manipulation functions100 xpWhy are we interested in patterns?50 xpSearching for a pattern50 xpFinding Palindromes100 xpFinding a conserved region within six frames100 xpLooking for a match100 xp
IRanges and GenomicRanges
The IRanges and GenomicRanges packages are also containers for storing and manipulating genomic intervals and variables defined along a genome. These packages provide infrastructure and support to many other Bioconductor packages because of their enriching features. You will learn how to use these containers and their associated metadata, for manipulation of your sequences. The dataset you will be looking at is a special gene of interest in the human genome.IRanges and Genomic Structures50 xpIRanges50 xpConstructing IRanges100 xpInteracting with IRanges100 xpGene of interest50 xpFrom tabular data to Genomic Ranges100 xpGenomicRanges accessors100 xpABCD1 mutation50 xpHuman genome chromosome X100 xpManipulating collections of GRanges50 xpA sequence window50 xpIs it there?50 xpMore about ABCD1100 xpHow many transcripts?100 xpFrom GRangesList object into a GRanges object100 xp
ShortRead is the package for input, manipulation and assessment of fasta and fastq files. You can subset, trim and filter the sequences of interest, and even do a report of quality. An extra bonus towards the last exercises will give you the tools for parallel quality assessment, wink, wink Rqc. Exciting enough, for this you will use plant genome sequences!Sequence files50 xpWhy fastq?50 xpReading in files50 xpExploring a fastq file100 xpExtract a sample from a fastq file100 xpSequence quality50 xpExploring sequence quality100 xpBase quality plot50 xpTry your own nucleotide frequency plot100 xpMatch and filter50 xpFiltering reads on the go!100 xpRemoving duplicates50 xpMore filtering!100 xpMultiple assessment50 xpPlotting cycle average quality100 xpIntroduction to Bioconductor50 xp
In the following tracksAnalyzing Genomic Data in R
DatasetsZika Genomic DNA datasetA. Thaliana Short Reads with Quality datasetHuman Gene & Transcript ID datasetYeast Genome dataset
James ChapmanSee More
Curriculum Manager, DataCamp
James is a Curriculum Manager at DataCamp, where he collaborates with experts from industry and academia to create courses on AI, data science, and analytics. He has led nine DataCamp courses on diverse topics in Python, R, AI developer tooling, and Google Sheets. He has a Master's degree in Physics and Astronomy from Durham University, where he specialized in high-redshift quasar detection. In his spare time, he enjoys restoring retro toys and electronics.
Follow James on LinkedIn
Follow James on LinkedIn
Paula MartinezSee More
Data Scientist and Bioinformatician
Paula Andrea Martinez is currently working at The Life Sciences infrastructure ELIXIR Europe. She empowers life scientists by training them in software skills, data analysis, visualization and data stewardship best practices. She also advocates for open and reproducible science as evidenced by her volunteer roles with The Carpentries. Paula gained her PhD in applied Bioinformatics from The University of Queensland, using computational methods to study genomic diversity. She is particularly interested in R, databases, community building, open science, and diversity in STEM.