# Start coding here...
CHAPTER 1 Data Collection Data Cleaning Data Visualization and Data Prediction and
APPLICATIONS OF DATA SCIENCE
Traditional Machine Learning Internet of Things (IoT) Deep Learning
Machine Learning
What do we need?
- A Well Defined Question
- _ :What's the probability that this transaction is fraudulent?:_
- A set of example data
- _ :Old transactions labeled as "Fraudulent" or "valid"_
- A new set of data to use our algorhithm on
- :New credit card transactions:
- Usually a physical device
- Case Study: Smart Watch data
The Internet of Things (IoT)
- Refers to gadgets that aren't standard computers
- :Smart Watches
- :Electronic toll collection systems
- :Building energy management systems
- Much more!
- Case Study: Image of Pixels
Deep Learning
- Many neurons work together
- requires much more training data
- used in complex problems:
-
- image classification
-
- Language understanding
-
- Text Summary
DATA SCIENCE ROLES AND TOOLS
Data Engineer Data Analyst Data Scientist Machine Learning Scientist
Data Engineer
- Information Architects
- Build data pipelines and storage solutions
- Data is collected, easily obtained and processed.
- Focus on Data Entry and Storage
-SQL : To store and organize data -Java, Scala, or Python: Programming languages to process data -Shell : Command line to automate and run tasks -Cloud computing : Amazon Web Services (AWS), Azure, Google Cloud Platform for large amounts of data
Data Analyst
-
Perform simpler analyses that describe data
-
Create reports and dashboards to summarize data
-
Clean data for analysis
-
:ess prgramming and stats experience
-
Fopcus on data prep and data exploration and visualization
-
SQL : Retreive and aggregate data
-
Spreadsheets (Excel or Google Sheets) : Simple analysis
-
BI Tools (Tableau, Power BI, Looker) : Dashboards and visualizations
-
May have: Python or R : For cleaning and analyzing data
Data Scientist
- Versed in statistical methods
- Run experiments and analyses for insights
- Traditional machine learning for prediction and forecasting
- Focus on last three stages of flow: Clean, Visualize, Explore
SQL: Retrieve and aggregate data Python and or R: Data science libraries, e.g. pandas (Python) and tidyverse (R)
Machine Learning Scientist
- Predictions and extraploations
- Classification
- Deep learning:
-
- Image Processing
-
- Natural Language Processing
- Focus on last threee stages, strong focus on Prediction
Python and or R
- Machine learning libraries e.g. TensorFlow or Spark
#Data Sources #Company Data
- Collected by companies
- helps make data-driven decisions
#Open Data
- free, open data sources
- can be used, shared and built on by anyone
#Company data
- web events
- survey data
- customer data
- logisitics data
- financial transactions
Web Data calcuate conversion rates
- Events
- timestamps
- User ID
Survey Data
- Asking people for their opinions
- Methods:
- -face-to-face interview
- -online questionnaire
-
- focus group
Net PRomoter Score - asks how likely user is to rec product to friend/colleague
#Open data Public Data API's
- Application Programming Interface
- Request Data over the internet
- -Twitter, Wikipedia, Yahoo Finance, Google Maps and more
Tracking a hashtag
- All tweets with #DataFramed (Datacamp's podcast!)
- Use Twitter API
Public Records International organizations
- e.g. World Bank, UN, WTO National statistical offices
- e.g. censuses, surveys Government agencies
- e.g. weather, environment, population
Data.gov (USA)
#Data Types Why it's important?
- Storing the data
- Visualizing and analyzing the data
#Quantitative and Qualitative data #Quantitative Measurement can be counted, measured, and expressed in numbers Deals with numbers Can be measured
Case Study: Refrigerator
- is 60in Tall
- has 2 apples in it
- Costs $1,000
#Qualitative Data that is descriptive Can be observed but not measured
Case study: Refrigerator
- is red
- was made in Italy
- Smells like fish
#Other types of data
- Image Data
-
- pixels make up images
- Text Data
-
- written reviews of companies
- Geospatial Data
-
- data with location info
-
- e.g. Street Data, Building Data, Vegetation Data, all integrated into mapping layering (Google maps, Waze)
- Network Data
-
- people or things in a network, and the relationships between them (Facebook)
- ...
#Data Storage and Retreival
Step 1 in Data Science Workflow -> Data Collection and Storage
Things to Consider when Storing Data:
- Where (Location)
- What Kind of Data (Data Type)
- How will the data be accessed (Retreival)
Location: Parallel Storage Solutions
- Companies keep clusters or servers of computers on premises to allow manageable storage of information without overloading a single computer.
Location: the Cloud
- Pay someone else to store your info for you
- e.g. Microsoft Azure, Amazon Web Services (AWS), and Google Cloud
- help with not only storage, but also data analytics, machine learning and deep learning
Types of data storage Unstructured
- text
- video and audio files
- web pages
- social media This type of data is typically stored in a database called a Document Database
Tabular (in a table, like a spreadsheet, more common) Relational Database Both Document and Relational Databases can be found on cloud storage providers mentioned earlier
Retreival: Data Querying
Database Type: Query Language Document Database NoSQL Relational Database SQL
SQL - Structured Query Language NoSQL - Not Only Sequel
Storing Data is like building a library
- Decide where I'll build my library (Location)
- On-premises cluster
- Cloud provider (AWS, Azure, Google Cloud)
- Decide what types of shelves to install to store my books (Data Type)
- the types of shelves will depend on the types of books
Data Type: Storage Solution: Unstructured Document Database Tabular Relational Database
- System for referencing and checking out books
- the way you locate and retreive each book is dependent on how that book is stored
Data Type: Query Lanaguage: Document Database NoSQL Relational Database SQL
Data Pipelines
How do we scale? More than one data source:
- API's
- Public Records
- Databases
Different Data Types:
- Unstructured
- Tabular
- Real-time streaming data e.g. Live tweets
#Extract
What is a data pipeline?
- Moves data into defined stages
- Automated collection and storage -- _Scheduled hourly, daily, weekly, monthly, etc. -- Triggered by an event
- Monitored with generated alerts
- Necessary for big data projects
- Data engineers work to customize solutions
- ETL (Extract Transform Load)
Case Study: Smart Home With all the data coming in, how do we keep it organized and easy to use?
#Transform: Example transformations:
- Joining data sources into one set
- Converting data structures to fit database schemas
- Removing irrelevant data Data preparation and exploration does not occur at this stage
#Load
Once we set up all those steps, we automate