Course notes: Introduction to Data Science in Python

Course Notes

Use this workspace to take notes, store code snippets, and build your own interactive cheatsheet!

Note that the data from the course is not yet added to this workspace. You will need to navigate to the course overview page, download any data you wish to use, and add it to the file browser.

# Import any packages you want to use here

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

Getting Started in Python

Modules

Modules (sometimes called packages or libraries) help group together related tools in Python. For example, we might want to group together all of the tools that make different types of charts: bar charts, line charts, and histograms. Some common examples of modules are:

matplotlib (which creates charts)
pandas (which loads tabular data) - read data from a file
scikit-learn (which performs machine learning)
scipy (which contains statistics functions)
nltk (which works with text data).
numpy (a module for performing mathematical operations on lists of data.)

We must import any modules that we plan on using before we can write any other code. We do this at the top of the script editor. If we don't import modules, we can't use the tools that they contain.

import pandas as pd

import seaborn as sns

from matplotlib import pyplot as plt

# Oftentimes, module names are long, so we can shorten them by using an alias. To give your module an alias, just add "as" and a shorter name to your original import statement. This statement will alias "pandas" as "pd".

aliasing lets us shorten seaborn.scatterplot() to sns.scatterplot().

Creating variables

Variables help us reference a piece of data for later use. A variable gives us an easy-to-use shortcut to a piece of data. Whenever we use the variable name in our code, it will be replaced with the original piece of data.

Rules to define variables

Variables must start with a letter. You can use a capital letter, but we usually use lowercase
We can't use special characters like exclamation points or dashes.
Variable names are case sensitive

Floats and strings

Floats represent either integers or decimals.

height = 24 weight = 75.5

Strings represent text and can contain letters, numbers, spaces, and special characters.

name = "Bayes"
height = "24"
weight = 75.5

# Floats represent either integers or decimals. 

height = 24
weight = 75.5

# Strings represent text and can contain letters, numbers, spaces, and special characters.

name ='Bayes'
breed = "Gonden Retriever"

# We define a string by putting either single ('') or double ("") quotes around a piece of text. It doesn't matter if you use single (') or double (") quotes, but it's important to be consistent throughout your code.

Functions

A function is an action. It turns one or more inputs into an output.

This function (plt.plot) takes the data from the table in the bottom-left and plots letter_index on the x-axis and frequency on the y-axis, resulting in the graph on the bottom-right.

The function name plt.plot has two parts:

The first part tells us what module the function comes from. In this case, is plt, which was the alias we used when we imported matplotlib.
The second part (which comes after the period) is the name of the function: plot.
The function name is always followed by parentheses ().

Positional arguments

Positional arguments are one type of input that a function can have. Positional arguments must come in a specific order.

In this case, the first argument is the x-value of each point, and the second argument is the y-value of each point. Each argument is separated by a comma.

it's good practice to put a space after the comma, but your code will run even if you forget that space.

Keyword arguments

Keyword arguments come after positional arguments, but if there are multiple keyword arguments, they can come in any order.

In this case, the keyword argument is called "label". After the equals sign, we've put the actual input to the function, which is "Ransom". Eventually, this argument will let us create a legend for our graph.

Common function errors

What is Pandas

Pandas is a module for working with tabular data, or data that has rows and columns. Common examples of tabular data are spreadsheets or database tables.

Pandas gives you many tools for working with tabular data. You can:

load tabular data from multiple sources like spreadsheets or databases
search for particular rows or columns in your loaded data
calculate aggregate statistics (like averages or standard deviations)
combine data from multiple sources.

Pandas introduces a new, more powerful data type: the DataFrame, which represents tabular data. Loading data into a DataFrame is the first step in using Pandas.

One of the easiest ways to create a DataFrame is by using a CSV file. CSV stands for comma-separated values (is a common way of storing tabular data as a text-only file.)

Before we can start using Pandas, we have to import the pandas module. Recall that we always import Pandas under the alias "pd".

Next, we create our first DataFrame from a CSV. Turning a CSV into a DataFrame is easy:

import pandas as pd

df = pd.read_csv('ransom.csv')

# pd.read_csv is a function that takes one argument, the name of the CSV file as a string

# Notice that we saved this DataFrame to a variable called "df". We can display this variable by using the "print" function

print(df)

# When we print a DataFrame, we get to see every row in the DataFrame. Usually, we don't want to print an entire DataFrame to inspect it; we just want to view the first few lines; We can do this by using head:

import pandas as pd

df = pd.read_csv('ransom.csv')

df.head() # We call this type of function a method

# The .head method just selects the first five rows of "df".

‌
‌
‌