Data Science Versus Statistics
According to our “Learn Data Science In 8 (Easy) Steps” infographic, one of the first steps to learn data science is to get a good understanding of statistics, mathematics, and machine learning.
If you remember well, the next step is to learn how to code.
But once you know all the Python you need to know to do data science, it’s time to consolidate the knowledge that you have gained.
The statistic topics for data science this blog references and includes resources for are:
This list is not exhaustive of the statistics used in data science but is meant to get you started. :)
By the way, if you’re still looking to start learning Python for data science, you should consider taking our Intro to Python For Data Science course. It will help you apply statistics with Python.
Learning statistics for data science is very practical, and that’s why you should not forget to also (continue to) focus on practicing the theoretical concepts that you might have already learned at the start of your data science journey.
But what exactly is the difference between statistics and data science?
They are often confounded, and some say that there is no difference and that the data scientist is actually a statistician.
But in the end, if you put those opinions aside for a moment, most can agree on the fact that statistics is one of the core components used in data science and is one of the core components.
Statistics with Python
Today’s post will focus on how you can learn statistics with Python. Including statistical analysis topics you will need to explore on your data science journey using Python.
R is a good place to start with statistics. It was developed for statistical computing and graphics, so it offers a ton of statistical packages to its users. Python, on the other hand, is a general-purpose language that has many applications.
However, you can also use Python for statistics.
Some people say that they use Python because of its performance or because it can also do a lot of stuff that R can do.
But, in essence, this programming language has a rising popularity, and the number of packages that can be used for data science has certainly increased over recent years.
In short, there are definitely reasons to use Python for statistical analysis.
The tool that you will eventually choose will just depend on what type of analysis you want to do.
So, are you ready to get started with statistics in Python?
Python Statistics & Probability Theory
The first topic that you should probably tackle is statistics and probability theory. There are not only quite some videos and courses out there that can help you, but there are also a lot of (printed) books that will help you to get started with statistics in Python.
Introduction to Statistics with Python
For an introduction to statistics, this tutorial with real-life examples is the way to go. The notebooks of this tutorial will introduce you to concepts like mean, median, standard deviation, and the basics of topics such as hypothesis testing and probability distributions.
A fine way to start your stats learning, since it is inspired by the books "Think Bayes" and "Think Stats", which are two top recommendations that will come back below!
If you’re looking for books, you can try out this free book on computational statistics in Python, which not only contains an introduction to programming with Python, but also treats topics such as Markov Chain Monte Carlo, the Expectation-Maximization (EM) algorithm, resampling methods, and much more.
Or you can buy this book by Thomas Haslwanter for a general introduction to common statistical tests, linear regression analysis and topics from survival analysis and Bayesian statistics. Note that this book does take life and medical sciences as an application area.
Both of the above books already introduce you to more advanced statistics topics with Python too, as you can see.
If you're a fan of videos, you should consider watching this tutorial on statistical data analysis with SciPy with Christopher Fonnesbeck, an Assistant Professor in the Department of Biostatistics at the Vanderbilt University School of Medicine. There's also this video on inferential and exploratory statistics with Python by Gaël Varoquaux. This last video makes use of the Python packages Pandas and StatsModels.
You’ll see that these resources are quite general resources to get you started with statistics in Python.
If you’re looking for resources that will quickly bring you up to speed with the basics of statistics, you should check out DataCamp’s Statistical Thinking in Python course, taught by Justin Bois. You’ll get introduced to concepts such as Exploratory Data Analysis (EDA), variance and covariance, means and medians, probability distributions, and so much more.
Probability Theory with Python
Probability theory is also something that is highly valuable to take into account when you’re learning statistics with Python. It’s the analysis of random phenomena. That means that the outcome of any random event is non-deterministic: it can be any of the several possible outcomes, and the eventual outcome is determined by chance.
Probability theory contains the conceptual origins of statistics.
The resources that have been mentioned above gave a general introduction to statistics and, in some cases they also covered probability theory (which seems reasonable given the above), but there are also resources that exclusively focus on this topic.
You can also check out the following resources:
One of the top recommendations is the Computational Probability and Inference course from EdX. This hands-on course, taught by MIT instructors, will make you comfortable with the principles of probability and inference.
You should also read this free book, written by Professor Brian Blais, which is an introductory statistical inference textbook, motivated by probability theory as logic.
Python Probability Distributions
To really learn statistics with Python for data science, you should also develop a good intuition of when what distribution is used. A distribution is a listing or function that shows all the possible values or intervals of the data and how often they occur.
And, if you take a look at this list, you'll see that there are quite some distributions to consider.
For an introduction to uniform, normal, binomial and Poisson probability distributions with SciPy, you can check out this blog post.
A top recommendation is the fourth chapter in the “Think Stats: Probability and Statistics for Programmers” book, which will introduce you to continuous distributions. However, the fifth chapter will give you a solid introduction to probability distributions, too.
To visualize distributions, you can make use of histograms, among others. If you want to have a quick overview, you can check out this IPython notebook, which will give you a short introduction to descriptive statistics with mean values, quantiles, and histograms and their relations. To learn more about how you can visualize distributions, you can check out this Seaborn tutorial.
Note that if you want to follow a course that covers some of the distributions, such as binomial and Poisson, and distribution functions such as the empirical cumulative distribution function, or a course that will teach you how to visualize these distributions, you can check out DataCamp’s course on Statistical Thinking in Python.
Python Hypothesis Testing
Hypothesis tests are statistical tests that are used to determine whether there is enough evidence in a sample of data to infer that a particular condition is true for the entire population.
The two central concepts of these tests are the null hypothesis and the alternative hypothesis, but also the p-value is fundamental to hypothesis testing. These things are very hard to understand when you’re new to the field, and it will require some effort to grasp the alpha value or significance level for your p-value and what makes the difference between rejecting or failing to reject the null hypothesis.
You'll find a tutorial on the site of the SciPy library that works briefly with p-values and estimation.
These SciPy lectures will introduce you to the t-test, which you can use to test your hypothesis by analyzing two populations means. You can also resort to this blog post if you want to explore t-tests.
If you want to read a book, the top recommendation “Think Stats: Probability and Statistics for Programmers” book is still valid also for hypothesis testing. The seventh chapter will teach you all about hypothesis testing if you haven’t already gone through the other chapters to learn about distributions.
For people that are looking for courses, DataCamp's Statistical Thinking in Python (Part 2) offers an introduction and test examples for you to get the necessary knowledge and practice on hypothesis testing and so much more.
Statistical Modeling and Fitting in Python
Now that you've gotten the hang of hypothesis testing and distributions, you can first review or go deeper into how you can make statistical models and fit distributions to data.
Statistical models approximate that what generates your data and can be used in data analysis to summarize data, to predict, and to simulate. In other words, it’s a representation of complex phenomena that generated the data, and that can be used for summaries, predictions or simulations.
This, however, entails that you also need to be able to find out whether your data fits that model.
To provide the best fit between the model and the data estimation can be used. Estimation is concerned with making inferences about a population, based on information obtained from a sample. Next to hypothesis testing, it’s a way of learning something about the population from the sample.
This tutorial introduces you to the topic of fitting with the help of the Python library SciPy.
Statistical data modeling and fitting is also a chapter in this statistical analysis tutorial, elaborated in notebooks and made by Christopher Fonnesbeck. This name will sound familiar now!
For those who are more into videos, this tutorial is also available on Youtube in four movies and treats topics such as estimation (Maximum Likelihood and Method of Moments).
You can see the videos of this tutorial here.
By the way, if you want to know more about the Maximum Likelihood Estimation for statistical pattern classification, don't miss this IPython notebook or this notebook that explains how to compute this estimate for different distributions. These notebooks are part of the pattern classification repository made by Sebastian Raschka, who also has another repository for his Python Machine Learning book.
Machine Learning with Python
And with his last suggestion for the Machine Learning book, you might wonder: this post was about statistics, right?
And it's not that machine learning and statistics are the same, but they do ask the same question: how can we learn from data?
Also, both machine learning and statistics techniques are frequently used in, for example, pattern recognition or data mining.
Machine learning is quite a useful tool in your data science toolbox. It’s quite a broad topic, and you can spend a lot of time to figure out its concepts and algorithms.
That's why you can better start now!
But it’s not very straightforward where you need to get started since it’s so broad and there exist a lot of resources to get proficient with machine learning in Python.
The general machine learning course taught by Andrew Ng is quite theoretical, but it’s still recommended if you first want to approach the main concepts and algorithms from a theoretical point of view.
However, there are also a lot more practical resources out there that can help you to get started.
The following resources are just a few that are out there:
This gentle introduction to machine learning with SciPy will help you to get on the right track. This tutorial is ideal for those who want to freshen up their basic stats knowledge and want to build on that. Kyle Kastner leads you to parameter estimation, regression, model estimation, and basic classification.
If you want a book to approach this topic, you could check out the IPython Interactive Computing and Visualization Cookbook. The eighth chapter gives you an introduction to fundamental machine learning concepts and illustrates algorithms such as logistic regression, Naive Bayes, K-nearest neighbors, Support Vector Machines, random forests, and others. The cookbook uses the Scikit-learn package for its examples.
If you want a tutorial with an introduction to machine learning Scikit-learn, go here.
Also, don't miss this tutorial on the Naive Bayes classifier.
Regression Analysis with Python
Regression is certainly something that you can not miss when it comes to statistics for data science. It's a statistical process to estimate the relationship among variables.
To understand how you can do regression with Python, you should first start first with going through some material on linear regression.
But check out this tutorial first: it covers regression analysis using the StatsModels package with Quandl. It first explains the different types of regression that are out there and then provides you with a practical example.
Then, go through this linear regression tutorial for more practice.
Then, you can move on to non-linear regression. For a tutorial on ridge and lasso regression in Python, you can check out this Analytics Vidhya tutorial. The article makes use of the Python libraries NumPy, Pandas, Matplotlib and Scikit-learn to clearly explain to you how to approach this topic.
There’s also a great notebook tutorial on logistic regression that you can find here.
Bayesian Thinking & Modeling in Python
Bayesian statistics is a theory that expresses the evidence about the true state of the world in terms of degrees of belief known as Bayesian probabilities. Sometimes, you will want to take a Bayesian approach to data science problems.
What this exactly means will become clear in this excellent five-part series intro that will introduce you to frequentism and Bayesianism.
If, however, you’re more a book fan, you can check out “Think Bayes: Bayesian Statistics in Python”. "Bayesian Methods For Hackers" is another great resource to get introduced to Bayesian inference. Two must-read books for anybody that wants to get started with Bayesian thinking and modeling!
Or, if you want an introduction in a notebook, you can go through this tutorial, which introduces you to the Bayes theorem.
Also don't miss this tutorial on Bayesian Statistical Analysis in Python and the accompanying Youtube videos that will introduce you to Bayesian statistics, Markov chain Monte Carlo, PyMC, Hierarchical Modeling and model checking and validation.
For a tutorial on Bayesian model fitting in Python, you should check out these IPython Notebooks and the accompanying YouTube video, which is a lecture by Jake VanderPlas at the ESAC Data Analysis and Statistics Workshop 2014.
If you want to re-use resources, you can check out the IPython Interactive Computing and Visualization Cookbook, that you might have already used to look at Machine Learning. The seventh chapter of this book is about statistical data analysis but focuses on frequentist and Bayesian methods for hypothesis testing, parametric and nonparametric estimation, and model inference.
Here is a tutorial on PyMC, a Python module that implements Bayesian statistical models and fitting algorithms, including Markov Chain Monte Carlo (MCMC). Also, this tutorial, in which you'll learn how to implement Bayesian linear regression models with PyMC3, is worth checking out.
Simply stated, Markov chains are mathematical systems that hop from one "state" to another. These states can be a situation or set of values. That means that you have a list of states available and, on top of that, a Markov chain tells you the probability of hopping, or "transitioning," from one state to any other state.
Some of the resources that were mentioned above already introduce you to this topic.
In addition to those resources, you might also want to watch this video: it’s a tutorial that uses Monte Carlo simulation and resampling, among others, to explore hypothesis testing and statistical modeling. It might be a good way to consolidate your knowledge before you go into the Markov chains.
Also, the Computational Statistics in Python book will give you some insights into Markov Chains. It’s an awesome introduction to Markov Chain Monte Carlo.
Getting Started with Statistics in Python
This list is just to get you started. You’ll see that many resources overlap or you might find other resources out there. Make sure to let us know on Twitter!
In any case, there’s no reason to wait any longer to start learning statistics with Python.
Python and other related courses