
Loved by learners at thousands of companies
Course Description
How do we get from data to answers?  Exploratory data analysis is a process for exploring datasets, answering questions, and visualizing results.  This course presents the tools you need to clean and validate data, to visualize distributions and relationships between variables, and to use regression models to predict and explain.  You'll explore data related to demographics and health, including the National Survey of Family Growth and the General Social Survey.  But the methods you learn apply to all areas of science, engineering, and business. You'll use Pandas, a powerful library for working with data, and other core Python libraries including NumPy and SciPy, StatsModels for regression, and Matplotlib for visualization.  With these tools and skills, you will be prepared to work with real data, make discoveries, and present compelling results.
Training 2 or more people?
Get your team access to the full DataCamp platform, including all the features.- 1Read, clean, and validateFreeThe first step of almost any data project is to read the data, check for errors and special cases, and prepare data for analysis. This is exactly what you'll do in this chapter, while working with a dataset obtained from the National Survey of Family Growth. 
- 2DistributionsIn the first chapter, having cleaned and validated your data, you began exploring it by using histograms to visualize distributions. In this chapter, you'll learn how to represent distributions using Probability Mass Functions (PMFs) and Cumulative Distribution Functions (CDFs). You'll learn when to use each of them, and why, while working with a new dataset obtained from the General Social Survey. Probability mass functions50 xpMake a PMF100 xpPlot a PMF100 xpCumulative distribution functions50 xpMake a CDF100 xpCompute IQR100 xpPlot a CDF100 xpComparing distributions50 xpDistribution of education50 xpExtract education levels100 xpPlot income CDFs100 xpModeling distributions50 xpDistribution of income100 xpComparing CDFs100 xpComparing PDFs100 xp
- 3RelationshipsUp until this point, you've only looked at one variable at a time. In this chapter, you'll explore relationships between variables two at a time, using scatter plots and other visualizations to extract insights from a new dataset obtained from the Behavioral Risk Factor Surveillance Survey (BRFSS). You'll also learn how to quantify those relationships using correlation and simple regression. Exploring relationships50 xpPMF of age100 xpScatter plot100 xpJittering100 xpVisualizing relationships50 xpHeight and weight100 xpDistribution of income100 xpIncome and height100 xpCorrelation50 xpComputing correlations100 xpInterpreting correlations50 xpSimple regression50 xpIncome and vegetables100 xpFit a line100 xp
- 4Multivariate ThinkingExplore multivariate relationships using multiple regression to describe non-linear relationships and logistic regression to explain and predict binary variables. Limits of simple regression50 xpRegression and causation50 xpUsing StatsModels100 xpMultiple regression50 xpPlot income and education100 xpNon-linear model of education100 xpVisualizing regression results50 xpMaking predictions100 xpVisualizing predictions100 xpLogistic regression50 xpPredicting a binary variable100 xpNext steps50 xp
Training 2 or more people?
Get your team access to the full DataCamp platform, including all the features.datasets
National Survey of Family Growth (NSFG)General Social Survey (GSS)Behavioral Risk Factor Surveillance System (BRFSS)collaborators


prerequisites
Python ToolboxJoin over 18 million learners and start Exploring and Analyzing Data in Python today!
Create Your Free Account
or
By continuing, you accept our Terms of Use, our Privacy Policy and that your data is stored in the USA.