Intermediate Importing Data in Python
Run the hidden code cell below to import the data used in this course.
1 hidden cell
Take Notes
Add notes about the concepts you've learned and code cells with code you want to keep.
Import package
from urllib.request import urlretrieve
Import pandas
import pandas as pd
Assign url of file: url
url = 'https://assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'
Save file locally
urlretrieve(url, 'winequality-red.csv')
Read file into a DataFrame and print its head
df = pd.read_csv('winequality-red.csv', sep=';') print(df.head())
Import packages
import matplotlib.pyplot as plt import pandas as pd
Assign url of file: url
url = 'https://assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'
Read file into a DataFrame: df
df = pd.read_csv(url, sep=';')
Print the head of the DataFrame
print(df.head())
Plot first column of df
df.iloc[:, 0].hist()
plt.xlabel('fixed acidity (g(tartaric acid)/dm
read specific sheet in Excel file
Import package
import pandas as pd
Assign url of file: url
url = 'https://assets.datacamp.com/course/importing_data_into_r/latitude.xls'
Read in all sheets of Excel file: xls
xls = pd.read_excel(url, sheet_name = None)
Print the sheetnames to the shell
print(xls.keys())
Print the head of the first sheet (using its name, NOT its index)
print(xls['1700'].head())
Performing HTTP requests in Python using urllib
Import packages
from urllib.request import urlopen, Request
Specify the url
url = "https://campus.datacamp.com/courses/1606/4135?ex=2"
This packages the request: request
request = Request(url)
Sends the request and catches the response: response
response = urlopen(request)
Print the datatype of response
print(type(response))
Be polite and close the response!
response.close()
Well, as it came from an HTML page, you could read it to extract the HTML and, in fact, such a http.client.HTTPResponse object has an associated read() method. In this exercise, you'll build on your previous great work to extract the response and print the HTML.
Import packages
from urllib.request import urlopen, Request
Specify the url
url = "https://campus.datacamp.com/courses/1606/4135?ex=2"
This packages the request
request = Request(url)
Sends the request and catches the response: response
response = urlopen(request)
Extract the response: html
html = response.read()
Print the html
print(html)
Be polite and close the response!
response.close()
Now that you've got your head and hands around making HTTP requests using the urllib package, you're going to figure out how to do the same using the higher-level requests library.
Import package
import requests
Specify the url: url
url = "http://www.datacamp.com/teach/documentation"
Packages the request, send the request and catch the response: r
r = requests.get(url)
Extract the response: text
text = r.text
Print the html
print(text)
Beutiful soup Extract data from HTML (e.g. extract only text or extract a list of all the web links on the page)
Import packages
import requests from bs4 import BeautifulSoup
Specify url: url
url = 'https://www.python.org/~guido/'
Package the request, send the request and catch the response: r
r = requests.get(url)
Extracts the response as html: html_doc
html_doc = r.text
Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)
Prettify the BeautifulSoup object: pretty_soup
pretty_soup = soup.prettify()
Print the response
print(pretty_soup)
Turning a webpage into data using BeautifulSoup: getting the text
Import packages
import requests from bs4 import BeautifulSoup
Specify url: url
url = 'https://www.python.org/~guido/'
Package the request, send the request and catch the response: r
r = requests.get(url)
Extract the response as html: html_doc
html_doc = r.text
Create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)
Get the title of Guido's webpage: guido_title
guido_title = soup.title
Print the title of Guido's webpage to the shell
print(guido_title)
Get Guido's text: guido_text
guido_text = soup.get_text()
Print Guido's text to the shell
print(guido_text)
Turning a webpage into data using BeautifulSoup: getting the hyperlinks
Use the method find_all() to find all hyperlinks in soup, remembering that hyperlinks are defined by the HTML tag but passed to find_all() without angle brackets; (Invalid URL)
Import packages
import requests from bs4 import BeautifulSoup
Specify url
(Invalid URL)url = ' (Invalid URL)https://www.python.org/~guido/'
Package the request, send the request and catch the response: r
r = requests.get(url)
Extracts the response as html: html_doc
html_doc = r.text
create a BeautifulSoup object from the HTML: soup
soup = BeautifulSoup(html_doc)
Print the title of Guido's webpage
print(soup.title)
Find all 'a' tags (which define hyperlinks): a_tags
a_tags = soup.find_all('a')
Print the URLs to the shell
for link in a_tags: print(link.get('href'))
###Loading and exploring a JSON
Load JSON: json_data
with open("a_movie.json") as json_file: json_data = json.load(json_file)
Print each key-value pair in json_data
for k in json_data.keys(): print(k + ': ', json_data[k])
In [1]: import json with open("a_movie.json") as json_file: json_data = json.load(json_file) In [2]: print(json_data['Title']) The Social Network In [3]: print(json_data['Year']) 2010
###API
Import requests package
import requests
Assign URL to variable: url
url = 'http://www.omdbapi.com/?apikey=72bc447a&t=the+social+network'
Package the request, send the request and catch the response: r
r = requests.get(url)
Print the text of the response
print(r.text)
Import package
import requests
Assign URL to variable: url
url = 'http://www.omdbapi.com/?apikey=72bc447a&t=social+network'
Package the request, send the request and catch the response: r
r = requests.get(url)
Decode the JSON data into a dictionary: json_data
json_data = r.json()
Print each key-value pair in json_data
for k in json_data.keys(): print(k + ': ', json_data[k])
Extract a wikipedia page on pizza
Import package
import requests
Assign URL to variable: url
url = 'https://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=pizza'
Package the request, send the request and catch the response: r
r = requests.get(url)
Decode the JSON data into a dictionary: json_data
json_data = r.json()
Print the Wikipedia page extract
pizza_extract = json_data['query']['pages']['24768']['extract'] print(pizza_extract)
Twitter
Different from other APIs is that one needs to log in with individual credentials to ones twitter profile first
Store credentials in relevant variables
consumer_key = "nZ6EA0FxZ293SxGNg8g8aP0HM" consumer_secret = "fJGEodwe3KiKUnsYJC3VRndj7jevVvXbK2D5EiJ2nehafRgA6i" access_token = "1092294848-aHN7DcRP9B4VMTQIhwqOYiB14YkW92fFO8k8EPy" access_token_secret = "X4dHmhPfaksHcQ7SCbmZa2oYBBVSD2g8uIHXsp5CTaksx"
Create your Stream object with credentials
stream = tweepy.Stream(consumer_key, consumer_secret, access_token, access_token_secret)
Filter your Stream variable
stream.filter(["clinton", "trump", "sanders", "cruz"])
Import package
import json
String of path to file: tweets_data_path
tweets_data_path = 'tweets.txt'
Initialize empty list to store tweets: tweets_data
tweets_data = []
Open connection to file
tweets_file = open(tweets_data_path, "r")
Read in tweets and store in list: tweets_data
for line in tweets_file: tweet = json.loads(line) tweets_data.append(tweet)
Close connection to file
tweets_file.close()
Print the keys of the first tweet dict
print(tweets_data[0].keys())
Import package
import pandas as pd
Build DataFrame of tweet texts and languages
df = pd.DataFrame(tweets_data, columns=['text','lang'])
Print head of DataFrame
print(df.head())
A little bit of Twitter text analysis Now that you have your DataFrame of tweets set up, you're going to do a bit of text analysis to count how many tweets contain the words 'clinton', 'trump', 'sanders' and 'cruz'. In the pre-exercise code, we have defined the following function word_in_text(), which will tell you whether the first argument (a word) occurs within the 2nd argument (a tweet).
import re
def word_in_text(word, text): word = word.lower() text = text.lower() match = re.search(word, text)
if match: return True return False
You're going to iterate over the rows of the DataFrame and calculate how many tweets contain each of our keywords! The list of objects for each candidate has been initialized to 0.
# Initialize list to store tweet counts
[clinton, trump, sanders, cruz] = [0, 0, 0, 0]
Iterate through df, counting the number of tweets in which
each candidate is mentioned
for index, row in df.iterrows(): clinton += word_in_text('clinton', row['text']) trump += word_in_text('trump', row['text']) sanders += word_in_text('sanders', row['text']) cruz += word_in_text('cruz', row['text'])
# Import packages
import matplotlib.pyplot as plt import seaborn as sns
Set seaborn style
sns.set(color_codes=True)
Create a list of labels:cd
cd = ['clinton', 'trump', 'sanders', 'cruz']
Plot the bar chart
ax = sns.barplot(cd, [clinton, trump, sanders, cruz]) ax.set(ylabel="count") plt.show()
# Add your code snippets here