Skip to content
# Start coding here... 

CHAPTER 1 Data Collection Data Cleaning Data Visualization and Data Prediction and

APPLICATIONS OF DATA SCIENCE

Traditional Machine Learning Internet of Things (IoT) Deep Learning

Machine Learning

What do we need?

  • A Well Defined Question
  • _ :What's the probability that this transaction is fraudulent?:_
  • A set of example data
  • _ :Old transactions labeled as "Fraudulent" or "valid"_
  • A new set of data to use our algorhithm on
  • :New credit card transactions:
  • Usually a physical device
  • Case Study: Smart Watch data

The Internet of Things (IoT)

  • Refers to gadgets that aren't standard computers
  • :Smart Watches
  • :Electronic toll collection systems
  • :Building energy management systems
  • Much more!
  • Case Study: Image of Pixels

Deep Learning

  • Many neurons work together
  • requires much more training data
  • used in complex problems:
    • image classification
    • Language understanding
    • Text Summary

DATA SCIENCE ROLES AND TOOLS

Data Engineer Data Analyst Data Scientist Machine Learning Scientist

Data Engineer

  • Information Architects
  • Build data pipelines and storage solutions
  • Data is collected, easily obtained and processed.
  • Focus on Data Entry and Storage

-SQL : To store and organize data -Java, Scala, or Python: Programming languages to process data -Shell : Command line to automate and run tasks -Cloud computing : Amazon Web Services (AWS), Azure, Google Cloud Platform for large amounts of data

Data Analyst

  • Perform simpler analyses that describe data

  • Create reports and dashboards to summarize data

  • Clean data for analysis

  • :ess prgramming and stats experience

  • Fopcus on data prep and data exploration and visualization

  • SQL : Retreive and aggregate data

  • Spreadsheets (Excel or Google Sheets) : Simple analysis

  • BI Tools (Tableau, Power BI, Looker) : Dashboards and visualizations

  • May have: Python or R : For cleaning and analyzing data

Data Scientist

  • Versed in statistical methods
  • Run experiments and analyses for insights
  • Traditional machine learning for prediction and forecasting
  • Focus on last three stages of flow: Clean, Visualize, Explore

SQL: Retrieve and aggregate data Python and or R: Data science libraries, e.g. pandas (Python) and tidyverse (R)

Machine Learning Scientist

  • Predictions and extraploations
  • Classification
  • Deep learning:
    • Image Processing
    • Natural Language Processing
  • Focus on last threee stages, strong focus on Prediction

Python and or R

  • Machine learning libraries e.g. TensorFlow or Spark

#Data Sources #Company Data

  • Collected by companies
  • helps make data-driven decisions

#Open Data

  • free, open data sources
  • can be used, shared and built on by anyone

#Company data

  • web events
  • survey data
  • customer data
  • logisitics data
  • financial transactions

Web Data calcuate conversion rates

  • Events
  • timestamps
  • User ID

Survey Data

  • Asking people for their opinions
  • Methods:
  • -face-to-face interview
  • -online questionnaire
    • focus group

Net PRomoter Score - asks how likely user is to rec product to friend/colleague

#Open data Public Data API's

  • Application Programming Interface
  • Request Data over the internet
  • -Twitter, Wikipedia, Yahoo Finance, Google Maps and more

Tracking a hashtag

  • All tweets with #DataFramed (Datacamp's podcast!)
  • Use Twitter API

Public Records International organizations

  • e.g. World Bank, UN, WTO National statistical offices
  • e.g. censuses, surveys Government agencies
  • e.g. weather, environment, population

Data.gov (USA)

#Data Types Why it's important?

  • Storing the data
  • Visualizing and analyzing the data

#Quantitative and Qualitative data #Quantitative Measurement can be counted, measured, and expressed in numbers Deals with numbers Can be measured

Case Study: Refrigerator

  • is 60in Tall
  • has 2 apples in it
  • Costs $1,000

#Qualitative Data that is descriptive Can be observed but not measured

Case study: Refrigerator

  • is red
  • was made in Italy
  • Smells like fish

#Other types of data

  • Image Data
    • pixels make up images
  • Text Data
    • written reviews of companies
  • Geospatial Data
    • data with location info
    • e.g. Street Data, Building Data, Vegetation Data, all integrated into mapping layering (Google maps, Waze)
  • Network Data
    • people or things in a network, and the relationships between them (Facebook)
  • ...

#Data Storage and Retreival

Step 1 in Data Science Workflow -> Data Collection and Storage

Things to Consider when Storing Data:

  • Where (Location)
  • What Kind of Data (Data Type)
  • How will the data be accessed (Retreival)

Location: Parallel Storage Solutions

  • Companies keep clusters or servers of computers on premises to allow manageable storage of information without overloading a single computer.

Location: the Cloud

  • Pay someone else to store your info for you
  • e.g. Microsoft Azure, Amazon Web Services (AWS), and Google Cloud
  • help with not only storage, but also data analytics, machine learning and deep learning

Types of data storage Unstructured

  • email
  • text
  • video and audio files
  • web pages
  • social media This type of data is typically stored in a database called a Document Database

Tabular (in a table, like a spreadsheet, more common) Relational Database Both Document and Relational Databases can be found on cloud storage providers mentioned earlier

Retreival: Data Querying

Database Type: Query Language Document Database NoSQL Relational Database SQL

SQL - Structured Query Language NoSQL - Not Only Sequel

Storing Data is like building a library

  1. Decide where I'll build my library (Location)
  • On-premises cluster
  • Cloud provider (AWS, Azure, Google Cloud)
  1. Decide what types of shelves to install to store my books (Data Type)
  • the types of shelves will depend on the types of books

Data Type: Storage Solution: Unstructured Document Database Tabular Relational Database

  1. System for referencing and checking out books
  • the way you locate and retreive each book is dependent on how that book is stored

Data Type: Query Lanaguage: Document Database NoSQL Relational Database SQL

Data Pipelines

How do we scale? More than one data source:

  • API's
  • Public Records
  • Databases

Different Data Types:

  • Unstructured
  • Tabular
  • Real-time streaming data e.g. Live tweets

#Extract

What is a data pipeline?

  • Moves data into defined stages
  • Automated collection and storage -- _Scheduled hourly, daily, weekly, monthly, etc. -- Triggered by an event
  • Monitored with generated alerts
  • Necessary for big data projects
  • Data engineers work to customize solutions
  • ETL (Extract Transform Load)

Case Study: Smart Home With all the data coming in, how do we keep it organized and easy to use?

#Transform: Example transformations:

  • Joining data sources into one set
  • Converting data structures to fit database schemas
  • Removing irrelevant data Data preparation and exploration does not occur at this stage

#Load

Once we set up all those steps, we automate