Understanding Data Science

# Start coding here...

CHAPTER 1 Data Collection Data Cleaning Data Visualization and Data Prediction and

APPLICATIONS OF DATA SCIENCE

Traditional Machine Learning Internet of Things (IoT) Deep Learning

Machine Learning

What do we need?

A Well Defined Question
_ :What's the probability that this transaction is fraudulent?:_
A set of example data
_ :Old transactions labeled as "Fraudulent" or "valid"_
A new set of data to use our algorhithm on
:New credit card transactions:
Usually a physical device
Case Study: Smart Watch data

The Internet of Things (IoT)

Refers to gadgets that aren't standard computers
:Smart Watches
:Electronic toll collection systems
:Building energy management systems
Much more!
Case Study: Image of Pixels

Deep Learning

Many neurons work together
requires much more training data
used in complex problems:
- image classification
- Language understanding
- Text Summary

DATA SCIENCE ROLES AND TOOLS

Data Engineer Data Analyst Data Scientist Machine Learning Scientist

Data Engineer

Information Architects
Build data pipelines and storage solutions
Data is collected, easily obtained and processed.
Focus on Data Entry and Storage

-SQL : To store and organize data -Java, Scala, or Python: Programming languages to process data -Shell : Command line to automate and run tasks -Cloud computing : Amazon Web Services (AWS), Azure, Google Cloud Platform for large amounts of data

Data Analyst

Perform simpler analyses that describe data
Create reports and dashboards to summarize data
Clean data for analysis
:ess prgramming and stats experience
Fopcus on data prep and data exploration and visualization
SQL : Retreive and aggregate data
Spreadsheets (Excel or Google Sheets) : Simple analysis
BI Tools (Tableau, Power BI, Looker) : Dashboards and visualizations
May have: Python or R : For cleaning and analyzing data

Data Scientist

Versed in statistical methods
Run experiments and analyses for insights
Traditional machine learning for prediction and forecasting
Focus on last three stages of flow: Clean, Visualize, Explore

SQL: Retrieve and aggregate data Python and or R: Data science libraries, e.g. pandas (Python) and tidyverse (R)

Machine Learning Scientist

Predictions and extraploations
Classification
Deep learning:
- Image Processing
- Natural Language Processing
Focus on last threee stages, strong focus on Prediction

Python and or R

Machine learning libraries e.g. TensorFlow or Spark

#Data Sources #Company Data

Collected by companies
helps make data-driven decisions

#Open Data

free, open data sources
can be used, shared and built on by anyone

#Company data

web events
survey data
customer data
logisitics data
financial transactions

Web Data calcuate conversion rates

Events
timestamps
User ID

Survey Data

Asking people for their opinions
Methods:
-face-to-face interview
-online questionnaire
- focus group

Net PRomoter Score - asks how likely user is to rec product to friend/colleague

#Open data Public Data API's

Application Programming Interface
Request Data over the internet
-Twitter, Wikipedia, Yahoo Finance, Google Maps and more

Tracking a hashtag

All tweets with #DataFramed (Datacamp's podcast!)
Use Twitter API

Public Records International organizations

e.g. World Bank, UN, WTO National statistical offices
e.g. censuses, surveys Government agencies
e.g. weather, environment, population

Data.gov (USA)

#Data Types Why it's important?

Storing the data
Visualizing and analyzing the data

#Quantitative and Qualitative data #Quantitative Measurement can be counted, measured, and expressed in numbers Deals with numbers Can be measured

Case Study: Refrigerator

is 60in Tall
has 2 apples in it
Costs $1,000

#Qualitative Data that is descriptive Can be observed but not measured

Case study: Refrigerator

is red
was made in Italy
Smells like fish

#Other types of data

Image Data
- pixels make up images
Text Data
- written reviews of companies
Geospatial Data
- data with location info
- e.g. Street Data, Building Data, Vegetation Data, all integrated into mapping layering (Google maps, Waze)
Network Data
- people or things in a network, and the relationships between them (Facebook)
...

#Data Storage and Retreival

Step 1 in Data Science Workflow -> Data Collection and Storage

Things to Consider when Storing Data:

Where (Location)
What Kind of Data (Data Type)
How will the data be accessed (Retreival)

Location: Parallel Storage Solutions

Companies keep clusters or servers of computers on premises to allow manageable storage of information without overloading a single computer.

Location: the Cloud

Pay someone else to store your info for you
e.g. Microsoft Azure, Amazon Web Services (AWS), and Google Cloud
help with not only storage, but also data analytics, machine learning and deep learning

Types of data storage Unstructured

email
text
video and audio files
web pages
social media This type of data is typically stored in a database called a Document Database

Tabular (in a table, like a spreadsheet, more common) Relational Database Both Document and Relational Databases can be found on cloud storage providers mentioned earlier

Retreival: Data Querying

Database Type: Query Language Document Database NoSQL Relational Database SQL

SQL - Structured Query Language NoSQL - Not Only Sequel

Storing Data is like building a library

Decide where I'll build my library (Location)

On-premises cluster
Cloud provider (AWS, Azure, Google Cloud)

Decide what types of shelves to install to store my books (Data Type)

the types of shelves will depend on the types of books

Data Type: Storage Solution: Unstructured Document Database Tabular Relational Database

System for referencing and checking out books

the way you locate and retreive each book is dependent on how that book is stored

Data Type: Query Lanaguage: Document Database NoSQL Relational Database SQL

Data Pipelines

How do we scale? More than one data source:

API's
Public Records
Databases

Different Data Types:

Unstructured
Tabular
Real-time streaming data e.g. Live tweets

#Extract

What is a data pipeline?

Moves data into defined stages
Automated collection and storage -- _Scheduled hourly, daily, weekly, monthly, etc. -- Triggered by an event
Monitored with generated alerts
Necessary for big data projects
Data engineers work to customize solutions
ETL (Extract Transform Load)

Case Study: Smart Home With all the data coming in, how do we keep it organized and easy to use?

#Transform: Example transformations:

Joining data sources into one set
Converting data structures to fit database schemas
Removing irrelevant data Data preparation and exploration does not occur at this stage

#Load

Once we set up all those steps, we automate