Skip to content

Data Manipulation with pandas

Run the hidden code cell below to import the data used in this course.

# Import the course packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# Import the four datasets
avocado = pd.read_csv("datasets/avocado.csv")
homelessness = pd.read_csv("datasets/homelessness.csv")
temperatures = pd.read_csv("datasets/temperatures.csv")
walmart = pd.read_csv("datasets/walmart.csv")

Take Notes

Data analysts used the six steps of the data analysis process to improve their workplace and its business processes. The six steps of the data analysis process that you have been learning in this program are: ask, prepare, process, analyze, share, act and ask, prepare, process, analyze, share, act.

https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/1SGUPX1MSv6hlD19TDr-og_c73fbb7c8c87419cb4a705b4eff51dee_Screen-Shot-2021-03-17-at-5.38.27-PM.png?expiry=1689638400000&hmac=-qJUJ5oys9SCrp5J9-J4InYuuqSigphOPAFIS5O5KF8


There are six stages to the data life cycle. Here is a recap: Plan: Decide what kind of data is needed, how it will be managed, and who will be responsible for it. Capture: Collect or bring in data from a variety of different sources. Manage: Care for and maintain the data. This includes determining how and where it is stored and the tools used to do so. Analyze: Use the data to solve problems, make decisions, and support business goals. Archive: Keep relevant data stored for long-term and future reference. Destroy: Remove data from storage and delete any shared copies of the data.


Spreadsheets Data analysts rely on spreadsheets to collect and organize data. Two popular spreadsheet applications you will probably use a lot in your future role as a data analyst are Microsoft Excel and Google Sheets.

Spreadsheets structure data in a meaningful way by letting you _Collect, store, organize, and sort information Identify patterns and piece the data together in a way that works for each specific data project Create excellent data visualizations, like graphs and charts. _

Databases and query languages A database is a collection of structured data stored in a computer system. Some popular Structured Query Language (SQL) programs include MySQL, Microsoft SQL Server, and BigQuery.

Query languages Allow analysts to isolate specific information from a database(s) Make it easier for you to learn and understand the requests made to databases Allow analysts to select, create, add, or download data from a database for analysis

Visualization tools Data analysts use a number of visualization tools, like graphs, maps, tables, charts, and more. Two popular visualization tools are Tableau and Looker.

These tools Turn complex numbers into a story that people can understand Help stakeholders come up with conclusions that lead to informed decisions and effective business strategies
Have multiple features

  • Tableau's simple drag-and-drop feature lets users create interactive graphs in dashboards and worksheets
  • Looker communicates directly with a database, allowing you to connect your data right to the visual tool you choose

A career as a data analyst also involves using programming languages, like R and Python, which are used a lot for statistical analysis, visualization, and other data analysis.


Depending on which phase of the data analysis process you’re in, you will need to use different tools. For example, if you are focusing on creating complex and eye-catching visualizations, then the visualization tools we discussed earlier are the best choice. But if you are focusing on organizing, cleaning, and analyzing data, then you will probably be choosing between spreadsheets and databases using queries. Spreadsheets and databases both offer ways to store, manage, and use data. The basic content for both tools are sets of values. Yet, there are some key differences, too:

Spreadsheets _Software applications Structure data in a row and column format Organize information in cells Provide access to a limited amount of data Manual data entry Generally one user at a time Controlled by the user _

Databases Data stores - accessed using a query language (e.g. SQL) Structure data using rules and relationships Organize information in complex collections Provide access to huge amounts of data Strict and consistent data entry Multiple users Controlled by a database management system


Basic structure of a SQL query:

SELECT (choose the columns to want) ColumnA, ColumnB, ColumnC FROM (from the appropriate table)

WHERE (a certain condition is met) Condition1 AND Condition2 AND Condition3


SQL Notes

  • If you are looking for all customers with a last name that begins with the letters “Ch," the WHERE clause would be: WHERE field1 LIKE 'Ch%'

  • You can place comments alongside your SQL to help you remember what the name represents. Comments are text placed between certain characters, /* and */, or after two dashes (--)

  • You can also make it easier on yourself by assigning a new name or alias to the column or table names to make them easier to work with (and avoid the need for comments). column AS columnNew

  • <> means "does not equal"


Course 2 – Ask Questions to Make Data-Driven Decisions

Asking effective questions: To do the job of a data analyst, you need to ask questions and problem-solve. In this part of the course, you’ll check out some common analysis problems and how analysts solve them. You’ll also learn about effective questioning techniques that can help guide your analysis.

Making data-driven decisions: In analytics, data drives decision making. In this part of the course, you’ll explore data of all kinds and its impact on decision making. You’ll also learn how to share your data through reports and dashboards.

Mastering spreadsheet basics: Spreadsheets are an important data analytics tool. In this part of the course, you’ll learn both why and how data analysts use spreadsheets in their work. You’ll also explore how structured thinking can help analysts better understand problems and come up with solutions.

Always remembering the stakeholder: Successful data analysts learn to balance needs and expectations. In this part of the course, you’ll learn strategies for managing the expectations of stakeholders while establishing clear communication with your team to achieve your objectives.

Completing the Course Challenge: At the end of this course, you will be able to put everything you have learned into practice with the Course Challenge. The Course Challenge will ask you questions about key principles you have been learning about and then give you an opportunity to apply those principles in three scenarios.


There are six data analysis phases that will help you make seamless decisions: ask, prepare, process, analyze, share, and act.

Step 1: Ask It’s impossible to solve a problem if you don’t know what it is. These are some things to consider: Define the problem you’re trying to solve Make sure you fully understand the stakeholder’s expectations Focus on the actual problem and avoid any distractions Collaborate with stakeholders and keep an open line of communication Take a step back and see the whole situation in context

Questions to ask yourself in this step: What are my stakeholders saying their problems are? Now that I’ve identified the issues, how can I help the stakeholders resolve their questions?

Step 2: Prepare You will decide what data you need to collect in order to answer your questions and how to organize it so that it is useful. You might use your business task to decide: What metrics to measure Locate data in your database Create security measures to protect that data

Questions to ask yourself in this step: What do I need to figure out how to solve this problem? What research do I need to do

Step 3: Process Clean data is the best data and you will need to clean up your data to get rid of any possible errors, inaccuracies, or inconsistencies. This might mean: Using spreadsheet functions to find incorrectly entered data Using SQL functions to check for extra spaces Removing repeated entries Checking as much as possible for bias in the data

Questions to ask yourself in this step: What data errors or inaccuracies might get in my way of getting the best possible answer to the problem I am trying to solve? How can I clean my data so the information I have is more consistent

Step 4: Analyze You will want to think analytically about your data. At this stage, you might sort and format your data to make it easier to: Perform calculations Combine data from multiple sources Create tables with your results

Questions to ask yourself in this step: What story is my data telling me? How will my data help me solve this problem? Who needs my company’s product or service? What type of person is most likely to use it?

Step 5: Share Everyone shares their results differently so be sure to summarize your results with clear and enticing visuals of your analysis using data via tools like graphs or dashboards. This is your chance to show the stakeholders you have solved their problem and how you got there. Sharing will certainly help your team:
Make better decisions Make more informed decisions Lead to stronger outcomes Successfully communicate your findings

Questions to ask yourself in this step: How can I make what I present to the stakeholders engaging and easy to understand? What would help me understand this if I were the listener?

Step 6: Act Now it’s time to act on your data. You will take everything you have learned from your data analysis and put it to use. This could mean providing your stakeholders with recommendations based on your findings so they can make data-driven decisions.

Questions to ask yourself in this step: How can I use the feedback I received during the share phase (step 5) to actually meet the stakeholder’s needs and expectations?

These six steps can help you to break the data analysis process into smaller, manageable parts, which is called structured thinking. This process involves four basic activities: Recognizing the current problem or situation Organizing available information Revealing gaps and opportunities Identifying your options

When you are starting out in your career as a data analyst, it is normal to feel pulled in a few different directions with your role and expectations. Following processes like the ones outlined here and using structured thinking skills can help get you back on track, fill in any gaps and let you know exactly what you need.


Making predictions A company that wants to know the best advertising method to bring in new customers is an example of a problem requiring analysts to make predictions. Analysts with data on location, type of media, and number of new customers acquired as a result of past ads can't guarantee future results, but they can help predict the best placement of advertising to reach the target audience.

Categorizing things An example of a problem requiring analysts to categorize things is a company's goal to improve customer satisfaction. Analysts might classify customer service calls based on certain keywords or scores. This could help identify top-performing customer service representatives or help correlate certain actions taken with higher customer satisfaction scores.

Spotting something unusual A company that sells smart watches that help people monitor their health would be interested in designing their software to spot something unusual. Analysts who have analyzed aggregated health data can help product developers determine the right algorithms to spot and set off alarms when certain data doesn't trend normally.

Identifying themes User experience (UX) designers might rely on analysts to analyze user interaction data. Similar to problems that require analysts to categorize things, usability improvement projects might require analysts to identify themes to help prioritize the right product features for improvement. Themes are most often used to help researchers explore certain aspects of data. In a user study, user beliefs, practices, and needs are examples of themes.

By now you might be wondering if there is a difference between categorizing things and identifying themes. The best way to think about it is: categorizing things involves assigning items to categories; identifying themes takes those categories a step further by grouping them into broader themes.

Discovering connections A third-party logistics company working with another company to get shipments delivered to customers on time is a problem requiring analysts to discover connections. By analyzing the wait times at shipping hubs, analysts can determine the appropriate schedule changes to increase the number of on-time deliveries.

Finding patterns Minimizing downtime caused by machine failure is an example of a problem requiring analysts to find patterns in data. For example, by analyzing maintenance data, they might discover that most failures happen if regular maintenance is delayed by more than a 15-day window.


Examples of SMART questions Here's an example that breaks down the thought process of turning a problem question into one or more SMART questions using the SMART method: What features do people look for when buying a new car?

Specific: Does the question focus on a particular car feature? Measurable: Does the question include a feature rating system? Action-oriented: Does the question influence creation of different or new feature packages? Relevant: Does the question identify which features make or break a potential car purchase? Time-bound: Does the question validate data on the most popular features from the last three years?

Questions should be open-ended. This is the best way to get responses that will help you accurately qualify or disqualify potential solutions to your specific problem. So, based on the thought process, possible SMART questions might be:

On a scale of 1-10 (with 10 being the most important) how important is your car having four-wheel drive? What are the top five features you would like to see in a car package? What features, if included with four-wheel drive, would make you more inclined to buy the car? How much more would you pay for a car with four-wheel drive? Has four-wheel drive become more or less popular in the last three years?


The three (or four) V words for big data

When thinking about the benefits and challenges of big data, it helps to think about the three Vs: volume, variety, and velocity. Volume describes the amount of data. Variety describes the different kinds of data. Velocity describes how fast the data can be processed. Some data analysts also consider a fourth V: veracity. Veracity refers to the quality and reliability of the data. These are all important considerations related to processing huge, complex data sets.


How the data will be collected Decide if you will collect the data using your own resources or receive (and possibly purchase it) from another party. Data that you collect yourself is called first-party data.

Data sources If you don’t collect the data using your own resources, you might get data from second-party or third-party data providers. Second-party data is collected directly by another group and then sold. Third-party data is sold by a provider that didn’t collect the data themselves. Third-party data might come from a number of different sources.

Solving your business problem Datasets can show a lot of interesting information. But be sure to choose data that can actually help solve your problem question. For example, if you are analyzing trends over time, make sure you use time series data — in other words, data that includes dates.

How much data to collect If you are collecting your own data, make reasonable decisions about sample size. A random sample from existing data might be fine for some projects. Other projects might need more strategic data collection to focus on certain criteria. Each project has its own needs.

Time frame If you are collecting your own data, decide how long you will need to collect it, especially if you are tracking trends over a long period of time. If you need an immediate answer, you might not have time to collect new data. In this case, you would need to use historical data that already exists.


Data formats in practice: https://www.coursera.org/learn/data-preparation/supplement/mBSNa/data-formats-in-practice


What is data modeling? Data modeling is the process of creating diagrams that visually represent how data is organized and structured. These visual representations are called data models. You can think of data modeling as a blueprint of a house. At any point, there might be electricians, carpenters, and plumbers using that blueprint. Each one of these builders has a different relationship to the blueprint, but they all need it to understand the overall structure of the house. Data models are similar; different users might have different data needs, but the data model gives them an understanding of the structure as a whole.

Levels of data modeling Each level of data modeling has a different level of detail.

Conceptual data modeling gives a high-level view of the data structure, such as how data interacts across an organization. For example, a conceptual data model may be used to define the business requirements for a new database. A conceptual data model doesn't contain technical details.

_Logical data modeling_ focuses on the technical details of a database such as relationships, attributes, and entities. For example, a logical data model defines how individual records are uniquely identified in a database. But it doesn't spell out actual names of database tables. That's the job of a physical data model.

_Physical data modeling_ depicts how a database operates. A physical data model defines all entities and attributes used; for example, it includes table names, column names, and data types for the database.


U.S. government data site : Data.gov is one of the most comprehensive data sources in the US. This resource gives users the data and tools that they need to do research, and even helps them develop web and mobile applications and design data visualizations.

U.S. Census Bureau : This open data source offers demographic information from federal, state, and local governments, and commercial entities in the U.S. too.

Open Data Network : This data source has a really powerful search engine and advanced filters. Here, you can find data on topics like finance, public safety, infrastructure, and housing and development.

Google Cloud Public Datasets : There are a selection of public datasets available through the Google Cloud Public Dataset Program that you can find already loaded into BigQuery.

Dataset Search : The Dataset Search is a search engine designed specifically for data sets; you can use this to search for specific data sets.


Use the following decision tree as a reminder of how to deal with data errors or not enough data:

https://d3c33hcgiwev3.cloudfront.net/imageAssetProxy.v1/nubavN6IS5mm2rzeiFuZgw_1204106238b34cff9a89859772cdfaa1_Screen-Shot-2021-03-05-at-10.36.19-AM.png?expiry=1693008000000&hmac=JYop973Kc0nTU8Eb4_Rt2OnNucELO9fuyYQEMfwEoJY


Margin of error is used to determine how close your sample’s result is to what the result would likely have been if you could have surveyed or tested the entire population. Margin of error helps you understand and interpret survey or test results in real-life. Calculating the margin of error is particularly helpful when you are given the data to analyze. After using a calculator to calculate the margin of error, you will know how much the sample results might differ from the results of the entire population.


Common mistakes to avoid

Not checking for spelling errors: Misspellings can be as simple as typing or input errors. Most of the time the wrong spelling or common grammatical errors can be detected, but it gets harder with things like names or addresses. For example, if you are working with a spreadsheet table of customer data, you might come across a customer named “John” whose name has been input incorrectly as “Jon” in some places. The spreadsheet’s spellcheck probably won’t flag this, so if you don’t double-check for spelling errors and catch this, your analysis will have mistakes in it.

Forgetting to document errors: Documenting your errors can be a big time saver, as it helps you avoid those errors in the future by showing you how you resolved them. For example, you might find an error in a formula in your spreadsheet. You discover that some of the dates in one of your columns haven’t been formatted correctly. If you make a note of this fix, you can reference it the next time your formula is broken, and get a head start on troubleshooting. Documenting your errors also helps you keep track of changes in your work, so that you can backtrack if a fix didn’t work.

Not checking for misfielded values: A misfielded value happens when the values are entered into the wrong field. These values might still be formatted correctly, which makes them harder to catch if you aren’t careful. For example, you might have a dataset with columns for cities and countries. These are the same type of data, so they are easy to mix up. But if you were trying to find all of the instances of Spain in the country column, and Spain had mistakenly been entered into the city column, you would miss key data points. Making sure your data has been entered correctly is key to accurate, complete analysis.

Overlooking missing values: Missing values in your dataset can create errors and give you inaccurate conclusions. For example, if you were trying to get the total number of sales from the last three months, but a week of transactions were missing, your calculations would be inaccurate. As a best practice, try to keep your data as clean as possible by maintaining completeness and consistency.

Only looking at a subset of the data: It is important to think about all of the relevant data when you are cleaning. This helps make sure you understand the whole story the data is telling, and that you are paying attention to all possible errors. For example, if you are working with data about bird migration patterns from different sources, but you only clean one source, you might not realize that some of the data is being repeated. This will cause problems in your analysis later on. If you want to avoid common errors like duplicates, each field of your data requires equal attention.

Losing track of business objectives: When you are cleaning data, you might make new and interesting discoveries about your dataset-- but you don’t want those discoveries to distract you from the task at hand. For example, if you were working with weather data to find the average number of rainy days in your city, you might notice some interesting patterns about snowfall, too. That is really interesting, but it isn’t related to the question you are trying to answer right now. Being curious is great! But try not to let it distract you from the task at hand.

Not fixing the source of the error: Fixing the error itself is important. But if that error is actually part of a bigger problem, you need to find the source of the issue. Otherwise, you will have to keep fixing that same error over and over again. For example, imagine you have a team spreadsheet that tracks everyone’s progress. The table keeps breaking because different people are entering different values. You can keep fixing all of these problems one by one, or you can set up your table to streamline data entry so everyone is on the same page. Addressing the source of the errors in your data will save you a lot of time in the long run.

Not analyzing the system prior to data cleaning: If we want to clean our data and avoid future errors, we need to understand the root cause of your dirty data. Imagine you are an auto mechanic. You would find the cause of the problem before you started fixing the car, right? The same goes for data. First, you figure out where the errors come from. Maybe it is from a data entry error, not setting up a spell check, lack of formats, or from duplicates. Then, once you understand where bad data comes from, you can control it and keep your data clean.

Not backing up your data prior to data cleaning: It is always good to be proactive and create your data backup before you start your data clean-up. If your program crashes, or if your changes cause a problem in your dataset, you can always go back to the saved version and restore it. The simple procedure of backing up your data can save you hours of work-- and most importantly, a headache.

Not accounting for data cleaning in your deadlines/process: All good things take time, and that includes data cleaning. It is important to keep that in mind when going through your process and looking at your deadlines. When you set aside time for data cleaning, it helps you get a more accurate estimate for ETAs for stakeholders, and can help you know when to request an adjusted ETA.


To add data to an existing table: INSERT INTO table name (colum1, column2, column3) VALUES (data_for_column1, ...., ....)

To update a data in an existing table: UPDATE table name SET column1 = 'new_value' WHERE column2 = 'refence_value_in_the_same_row'

How to output data and avoiding duplicates: SELECT DISTINCT column

How to use substring function: SELECT col FROM table WHERE substring(column1, starting_point, how_many_digits)

How to change data type using CAST function CAST (column AS new_type such as FLOAT64)

Using COALESCE() to return none null data: SELECT COALESCE(col1, col2) -- returns whatever data is in col1, but if null, returns data in col2

DataFrame methods: .head() # returns the first few rows (the “head” of the DataFrame). .info() # shows information on each of the columns, such as the data type and number of missing values. .shape # returns the number of rows and columns of the DataFrame. .describe() # calculates a few summary statistics for each column. .values: A two-dimensional NumPy array of values. .columns: An index of columns: the column names. .index: An index for the rows: either row numbers or row names. df.sort_values("breed", ascending=True) # sorts one column df.sort_values(['breed', 'weight'], ascending=[True, False]) # sorts multiple columns dogs[(dogs["height_cm"] > 60) & (dogs["color"] == "tan")] # filtering/selecting rows .isin() # filter rows with different values i.e. condition = dogs[dogs["color"].isin(["brown", "black", "tan"])]

Summary statistics dataframe[column_name].mean() # returns the mean of the column (.median() / .mode() / .min() / .max() / .var() / .std() / .quantile()) .agg(quantile) # method that allows you to apply your own custom functions to a DataFrame: df['column'].agg(function) .cumsum() # can be added to a column to cumulative sum the columns first plus second into second and so on... (.cummax(), cummin(), cumprod())

Counting df.drop_duplicates(subset=['column1','column2']) # removes any duplicates from a column(s) df['column'].value_counts(sort=True, normalize=True) # returns the number of occurences for each element in a column (and can sort it, and also return in a form of proportions using the normalize argument)

Grouped summary statistics df.groupby('column_name')['another column'].mean() # returns mean of the common 'anoter_column values' in 'column_name' df.groupby('column_name')['another column'].agg() df.pivot_table(values='column_of_interest', index='column_of_separation', aggfunc=[function, function2], colums='other_columns_of_separation', fill_value=0, margins=True) # margins return average per column and group

Explicit indexes df.set_index('column to have as first column aka index') df.reset_index(drop=True) # lol, resets the index, drop discards the index df.loc[['row_name', 'row_name_2']] # subsetting on outer level df.loc[[(outer_row, inner_row), (outer_row_1, inner_row_1)]] # subsetting on inner level df.sort_index(level=['index_name', 'index_name_2'], ascending=[True,False]) df.loc[outer_level_1:outer_level_2] # slices rows from 1 to 2 including 2 df.loc[('outer_level_1', 'inner_level_1'):('outer_level_2', 'inner_level_2')] df.loc[:, column1:column2] Compared to slicing lists, there are a few things to remember. You can only slice an index if the index is sorted (using .sort_index()). To slice at the outer level, first and last can be strings. To slice at inner levels, first and last should be tuples. If you pass a single slice to .loc[], it will slice the rows.

df['column_name'].hist(bins=0, alpha=number_from_0_1) df.groupby('column_name')['column_name'].mean() df.plot(x='', y,='', kind='bar,line,scatter', title='', rot=specify_angle) plt.legend(['F', 'M'])

df.isna().any() #insa returns a true/false df with true representing a NaN value. any() return a list of columns with true if it has at least 1 NaN df.isnull() df.isna() # sum return how many NaN in a column df.dropna() # removes row with NaN values df.fillna(value) # replaces NaN with value

new_df = pd.DataFrame(list_of_dicts) new_df = pd.DataFrame(dict_of_lists) pd.read_csv('path') df.to_csv('path')

Joining with Pandas df1_df2 = df1.merge(df2, on=['common_column', ...], suffixes=('',''), how='left,inner,right,outer') # suffixes is the piece of column name gets auto generated when having duplicate columns names across the 2 dfs df1_df2 = df1.merge(df2, how='left') # merges df2 into df1, result will have equal or more rows to df1 df1_df2 = df1.merge(df2, how='right', left_on='column_from_left_table', right_on='column_from_right_table') #df1 and df2 could be the same df which would be referred to as a self merge df1.merge(df2, left_on='col1', left_index=True, right_on='', right_index=True) # have to set left_index and right_index when merging with indexes

Joining tables vertically pd.concat([df1, df2, df3], ignore_index=False/True, keys=['jan', 'feb', 'mar']) # returns a multi index table with keys as outer and 0,1,2,... as inner indexes pd.concat([df1, df2, df3], sort=True, join='inner')

.merge( ... , ..., validate='one_to_one') pd.concat(..., ..., verify_integrity=True/False)

df1.merge(df2) is a method, pd.merge_ordered(df1, df2) is a function pd.merge_ordered(df1, df2, fill_method='ffill') # fills empty cells by the smae value as the row above pd.merge_asof(direction='foward, backward') # merges cells with greater/less than or equal close cells from the left table

.query('a string of boolean expression') df.melt(id_vars=['col1', 'col2'], value_vars=['col1', 'col2'], var_name=['str'], value_name='str') # columns you do not wanna change, colus you wanna unpivot

Statistics:

Two types:-

  1. Descriptive (describe and summarize data) / Inferential (Use a sample of data to make inferences about a larger population)

Two types of data:-

  1. Numeric (Quantitive) Continous (measured): time spent waiting in line Discrete (counted): number of pets
  2. Categorical (Qualitative) Nominal (unordered): Country of residence Ordinal (ordered): agree / somewhat agree / ...

Measuring Spreads

  • Variance: subtract dist from data points from mean -> square dists -> sum all dists -> divide by (num of data points - 1) OR np.var(df1['col'], ddof=1)

  • Standard deviation: sqrt of the var OR np.std(df1['col'], ddof=1)

  • Mean absolute deviation x = df['col'] - mean(df$col) -> np.mean(np.abs(x))

  • Quantiles np.quantile(df['col'], 0.5) # 0.5 quantile is the median Quartiles: np.quantile(df['col'], ['0', '0.5', ...]) np.quantile(df['col'], ['0', '0.2', '0.4', '0.6', '0.8', '1']) np.quantile(df['col'], np.linspace(0, 1, 5)) # np.linspace(start, stop, num)

  • IQR Interquantile range np.quantile(df['col'], '0.75') - np.quantile(df['col'], '0.25') OR from scipy.stats import iqr -> iqr(df['col'],)

  • Outliers data point < Quartile1 - 1.5 IQR OR data point > Quartile3 + 1.5 IQR

.describe() is your bread and butter

Measuring chance: num of ways event can happen / num of possible outcomes

df.sample(2, replace=True) # picks a arrow out randomly '2' fo sampling 2 rows... np.random.seed(10) # sets the seed

Expected value can be calculated by multiplying each possible outcome with its corresponding probability and taking the sum

** from scip.stats import uniform**

P(waiting time <= 7) uniform.cdf(7, lower limit, uppper limit) # gives probability of waiting less than 7 mins 1 - uniform.cdf(7, lower limit, uppper limit) # gives probability of waiting greater than 7 mins

P(4 <= waiting time <= 7) uniform.cdf(7, lower limit, uppper limit) - uniform.cdf(4, lower limit, uppper limit)

uniform.rvs(upper limit, lower limit, size=nnumber of randomly generated values between upper and lower limits)

from scipy.stats import binom binom.rvs(no of coins, probability of outcome, size= size of trials) binom.rvs(1 coin, 0.5 chance, size=1 trial) # 1 heads 0 tails

P(heads=7) -> binom.pmf(num of heads, num of trials, prob of heads) -> binom.pmf(7, 10, 0.5) P(heads<=7) -> binom.cdf(7, 10, 0.5) P(heads>=7) -> 1 - binom.cdf(7, 10, 0.5)

THE NORMAL DISTRIBUTION

from scipy.stats import norm norm.cdf(number of interest, mean, std) # propotion of how many under the number of interest

What height are 90% of women shorter than? -> norm.ppf(0.9, mean, std) What height are 90% of women taller than? -> norm.ppf((1-0.9), mean, std) norm.rvs(mean,std,size=x)

The poisson distribution

frpom scipy.stats import poisson poisson.pmf(5,8) #If there are 8 adoptions per week, what is the probability of 5 adoptions in a week poisson.cdf(5,8) # 5 adoptions or less poisson.rvs(8, size=10) # taking samples

The exponential distribution Probability of time between poisson events

from scipy.stats import expon expon(2, scale=1)

The T Distribution degrees of freedom df

The Log Normal Distribution Variable whose logarithm is normally distributed

Correlation

import seaborn as sns sns.scatterplot(x='', y='', data=df.name) sns.lmplot(x='', y='', data=df.name, ci=None) df['col'].corr(df['col2']) # returns correlation coeff

DATA VISUALISATION

fg, ax = plt.subplots() ax.plot(col.x, col.y, marker="o" or "v", linestyle="--" or "None", color='r') ax.set_xlabel("Whatever") ax.set_ylabel("Whatever") ax.set_title("Whatever") ax.set_xticklabels(x, rotation=90) plt.show()

fg, ax = plt,subplots(rows, columns) # rows is number of subplots vertically and columns nuber of subplots horizontally ax[0, 0].plot(...)

fg, ax = plt,subplots(rows, cols, sharey=True)

To draw two graphs into the same figure but with different axes: ax.plot(x, y) ax2 = ax.twinx() ax2.plot()

To change y axis labels' color: ax.tick_params('y', colors='...')

Annotation: ax.annotate("text..", xy=(...), xytext=(...), arrowprops={"arrowstyle":"->", "color":"grey"}) # ...=(pd.Timestamp('date'), y value)

Stacked bar charts: ax.bar(x, y, bottom="another y" + "another y", label="ex. bronze") ax.legend

Histograms: ax.hist(col, label=..., bins=...[..., ...], histtype=step)

error bars ax.bar(x, y.mean(), yerr=col.std()) ax.errorbar(xcol, ycol, yerr=col) ax.boxplot([ycol1, ycol2]), set_xticklabels(["...", "..."])

Scatter plots ax.scatter(x, y, color="...", label="...") ax.scatter(x, y, c=...a time-series col...) to make a gradiently colored graph based on time

Changing a graph style: plt.style.use("ggplot") # other ones include: "default", "bmh", "seaborn-colorblind", "tableau-colorblind10", "grayscale", 'Solarize_Light2'

Sharing figs fig.savefig("filename.png", quality=1-100, dpi=...) # other ones include .jpg, .svg fig.set_size_inches([width, height])

col.unique() # list of unique

Seaborn plot sns.scatterplot(x=..., y=...) sns.countplot(x(or y)='...', data=df) #bar plot

HUE sns.scatterplot(x=..., y=..., data=..., hue='col', hue_order=[..., ...], palette={'yes':'black', 'no':'red'})

Relational plots sns.relplot(x=..., y=..., data=..., kind="ex. scatter", row or/and col=..., col_wrap="to specify how many subplots in a row", col_order=...)

sns.relplot(x=..., y=..., data=..., kind=...size='col', hue='same col as size') # Useful for quantatitave columns sns.relplot(x=..., y=..., data=..., style='col', hue='same as style') # useful for 2-3 types of points sns.relplot(x=..., y=..., data=..., alpha=0-1) # for stacked figures sns.relplot(x=..., y=..., data=..., kind='line', markers=... , dashes=..., ci='sd or None') # for stacked figures

Categorical plots sns.catplot(x=..., y=..., data=..., kind='count', order=..., ci=...) sns.catplot(x=..., y=..., data=..., kind='box', order=..., sym="...removes the outliers if empty...", whis=2.0 or [5,95]) whis is to change the IQR. a number will be the multiplictive; 1.5 by default. a list will change the quartiles range sns.catplot(x=..., y=..., data=..., kind='point', join=False to-remove-lines, estimator=median mean-by-default, capsize=..., ci=...)

Change style of a graph sns.set_style() # "white", "dark", "whitegrid", "darkgrid", "ticks" sns.set_pallete() # "RdBu", "PRGn", "RdBu_r", "PRGn_r", "Greys", "Blues", "PuRd", "GnBu" sns.set_context() # From smallest to largest: "paper", "notebook", "talk", and "poster".

Types of objects created by Seaborn,: FacetGrid (relplot,catplot, etc..) & AxesSubplot (scatterplot, countplot, etc...)

Adding Titles:- FacetGrid: g = relplot(...) g.fig.suptitle("...", y=...how high a title is) AxesSubplot g.set_title("....")

For multiple subplots you can use g.fig.suptitle to create a title above all the subplots and g.set_title for subplot titles g.set_title("The is {col_name}") will inpu the col name in the string

Setting labels and customizing: g.set(xlabel=..., ylabel=...) for both FacetGrid and subplots plt.xticks(rotation=90) for both FacetGrid and subplots

sns.displot(df['Award_Amount'], kind='kde, rug=True, fill=True)

More Seborn plot types: lmplot and regplot despine(left=True # To remove the y axis line)

Set colors in Seaborn sns.set(color_codes=True)

Palettes Circular color palettes - non ordered data Sequential palettes - high-low data Diverging palettes - when low and high data are interesting sns.palplot(sns.color_palette('Purples', 8)) # To show color palettes

Create a plot with 1 row and 2 columns that share the y axis label fig, (ax0, ax1) = plt.subplots(nrows=1, ncols=2, sharey=True)

Plot the distribution of 1 bedroom apartments on ax0 sns.histplot(df['fmr_1'], ax=ax0) ax0.set(xlabel="1 Bedroom Fair Market Rent", xlim=(100,1500))

Categorical plots: stripplot and swarmplot - shows every observation on the graph boxplot, violinplot and boxenplot - abstract representations barplot, pointplot and countplot - statistical estimates

Regression plots regplot residplot

Matrix plots sns.heatmap() function requires the data to be in a grid format, pd.crosstab() is used to manipulate the data heatmap(annot=True, fmt='d', cmap='YlGnBu', cbar=False, linewidths=0.5) annot is for annotations, fmt to ensure data are integers, cmap are shades used, cbar lol?, linewidths is the line between cells

PairGrid & pairplot sns.pairplot(df, vars=[], kind=..., diag_kind=...)

JointGrid & jointplot

FUNCTIONS

global variable #alters the value of the global variable after the function is excuted nonlocal variable

def function(*args) # to let a function have infinite arguments

LAMBDA FUNCTIONS ex. x = lambda x,y: x+y -> lambda(1,2) -> 3

map(function, seq) # applies the function into all the elements in the seq such as a list The function filter() offers a way to filter out elements from a list that don't satisfy certain criteria. The function reduce() also takes a func and a seq.

try: and except: #runs the piece of code under try and if it errors out it runs the code under except.

raise # used to return an error type such as TypeError, ValueError, etc...

ITERATIONS

word ="DA" -> it = iter(word) -> next(it) -> 'D'-> next(it) -> 'A' print(*it) -> D A T A

enumerate(list, start=...) -> creates an object with the elements of the list alongside its index. you can start the index using the start arg

zip(list1, list2) -> puts the corresponding elements from the 2 list together zip(*zippedlists) -> unzips the zipped lists

pd.read_csv('jhfkdh', chunksize=100) # breaks the loaded data from the czv file into chunks if the data is too big...!

List comprehesion output expression for iterator variable in iterable if predicate expression new_nums = [num + 1 for num in nums] #adds 1 to all the values in nums and creates the new list new_nums... new_fellowship = [member if len(member) >= 7 else "" for member in fellowship]

Exploratory Data Analysis

Run cancelled
# Add your code snippets here

Explore Datasets

Use the DataFrames imported in the first cell to explore the data and practice your skills!

  • Print the highest weekly sales for each department in the walmart DataFrame. Limit your results to the top five departments, in descending order. If you're stuck, try reviewing this video.
  • What was the total nb_sold of organic avocados in 2017 in the avocado DataFrame? If you're stuck, try reviewing this video.
  • Create a bar plot of the total number of homeless people by region in the homelessness DataFrame. Order the bars in descending order. Bonus: create a horizontal bar chart. If you're stuck, try reviewing this video.
  • Create a line plot with two lines representing the temperatures in Toronto and Rome. Make sure to properly label your plot. Bonus: add a legend for the two lines. If you're stuck, try reviewing this video.