Skip to content

Intermediate Importing Data in Python

Run the hidden code cell below to import the data used in this course.


1 hidden cell

Take Notes

Add notes about the concepts you've learned and code cells with code you want to keep.

Import package

from urllib.request import urlretrieve

Import pandas

import pandas as pd

Assign url of file: url

url = 'https://assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

Save file locally

urlretrieve(url, 'winequality-red.csv')

Read file into a DataFrame and print its head

df = pd.read_csv('winequality-red.csv', sep=';') print(df.head())

Import packages

import matplotlib.pyplot as plt import pandas as pd

Assign url of file: url

url = 'https://assets.datacamp.com/production/course_1606/datasets/winequality-red.csv'

Read file into a DataFrame: df

df = pd.read_csv(url, sep=';')

Print the head of the DataFrame

print(df.head())

Plot first column of df

df.iloc[:, 0].hist() plt.xlabel('fixed acidity (g(tartaric acid)/dm)') plt.ylabel('count') plt.show()

read specific sheet in Excel file

Import package

import pandas as pd

Assign url of file: url

url = 'https://assets.datacamp.com/course/importing_data_into_r/latitude.xls'

Read in all sheets of Excel file: xls

xls = pd.read_excel(url, sheet_name = None)

Print the sheetnames to the shell

print(xls.keys())

Print the head of the first sheet (using its name, NOT its index)

print(xls['1700'].head())

Performing HTTP requests in Python using urllib

Import packages

from urllib.request import urlopen, Request

Specify the url

url = "https://campus.datacamp.com/courses/1606/4135?ex=2"

This packages the request: request

request = Request(url)

Sends the request and catches the response: response

response = urlopen(request)

Print the datatype of response

print(type(response))

Be polite and close the response!

response.close()

Well, as it came from an HTML page, you could read it to extract the HTML and, in fact, such a http.client.HTTPResponse object has an associated read() method. In this exercise, you'll build on your previous great work to extract the response and print the HTML.

Import packages

from urllib.request import urlopen, Request

Specify the url

url = "https://campus.datacamp.com/courses/1606/4135?ex=2"

This packages the request

request = Request(url)

Sends the request and catches the response: response

response = urlopen(request)

Extract the response: html

html = response.read()

Print the html

print(html)

Be polite and close the response!

response.close()

Now that you've got your head and hands around making HTTP requests using the urllib package, you're going to figure out how to do the same using the higher-level requests library.

Import package

import requests

Specify the url: url

url = "http://www.datacamp.com/teach/documentation"

Packages the request, send the request and catch the response: r

r = requests.get(url)

Extract the response: text

text = r.text

Print the html

print(text)

Beutiful soup Extract data from HTML (e.g. extract only text or extract a list of all the web links on the page)

Import packages

import requests from bs4 import BeautifulSoup

Specify url: url

url = 'https://www.python.org/~guido/'

Package the request, send the request and catch the response: r

r = requests.get(url)

Extracts the response as html: html_doc

html_doc = r.text

Create a BeautifulSoup object from the HTML: soup

soup = BeautifulSoup(html_doc)

Prettify the BeautifulSoup object: pretty_soup

pretty_soup = soup.prettify()

Print the response

print(pretty_soup)

Turning a webpage into data using BeautifulSoup: getting the text

Import packages

import requests from bs4 import BeautifulSoup

Specify url: url

url = 'https://www.python.org/~guido/'

Package the request, send the request and catch the response: r

r = requests.get(url)

Extract the response as html: html_doc

html_doc = r.text

Create a BeautifulSoup object from the HTML: soup

soup = BeautifulSoup(html_doc)

Get the title of Guido's webpage: guido_title

guido_title = soup.title

Print the title of Guido's webpage to the shell

print(guido_title)

Get Guido's text: guido_text

guido_text = soup.get_text()

Print Guido's text to the shell

print(guido_text)

Turning a webpage into data using BeautifulSoup: getting the hyperlinks

Use the method find_all() to find all hyperlinks in soup, remembering that hyperlinks are defined by the HTML tag but passed to find_all() without angle brackets; (Invalid URL)

Import packages

import requests from bs4 import BeautifulSoup

Specify url

(Invalid URL)

url = ' (Invalid URL)https://www.python.org/~guido/'

Package the request, send the request and catch the response: r

r = requests.get(url)

Extracts the response as html: html_doc

html_doc = r.text

create a BeautifulSoup object from the HTML: soup

soup = BeautifulSoup(html_doc)

Print the title of Guido's webpage

print(soup.title)

Find all 'a' tags (which define hyperlinks): a_tags

a_tags = soup.find_all('a')

Print the URLs to the shell

for link in a_tags: print(link.get('href'))

###Loading and exploring a JSON

Load JSON: json_data

with open("a_movie.json") as json_file: json_data = json.load(json_file)

Print each key-value pair in json_data

for k in json_data.keys(): print(k + ': ', json_data[k])

In [1]: import json with open("a_movie.json") as json_file: json_data = json.load(json_file) In [2]: print(json_data['Title']) The Social Network In [3]: print(json_data['Year']) 2010

###API

Import requests package

import requests

Assign URL to variable: url

url = 'http://www.omdbapi.com/?apikey=72bc447a&t=the+social+network'

Package the request, send the request and catch the response: r

r = requests.get(url)

Print the text of the response

print(r.text)

Import package

import requests

Assign URL to variable: url

url = 'http://www.omdbapi.com/?apikey=72bc447a&t=social+network'

Package the request, send the request and catch the response: r

r = requests.get(url)

Decode the JSON data into a dictionary: json_data

json_data = r.json()

Print each key-value pair in json_data

for k in json_data.keys(): print(k + ': ', json_data[k])

Extract a wikipedia page on pizza

Import package

import requests

Assign URL to variable: url

url = 'https://en.wikipedia.org/w/api.php?action=query&prop=extracts&format=json&exintro=&titles=pizza'

Package the request, send the request and catch the response: r

r = requests.get(url)

Decode the JSON data into a dictionary: json_data

json_data = r.json()

Print the Wikipedia page extract

pizza_extract = json_data['query']['pages']['24768']['extract'] print(pizza_extract)

Twitter

Different from other APIs is that one needs to log in with individual credentials to ones twitter profile first

Store credentials in relevant variables

consumer_key = "nZ6EA0FxZ293SxGNg8g8aP0HM" consumer_secret = "fJGEodwe3KiKUnsYJC3VRndj7jevVvXbK2D5EiJ2nehafRgA6i" access_token = "1092294848-aHN7DcRP9B4VMTQIhwqOYiB14YkW92fFO8k8EPy" access_token_secret = "X4dHmhPfaksHcQ7SCbmZa2oYBBVSD2g8uIHXsp5CTaksx"

Create your Stream object with credentials

stream = tweepy.Stream(consumer_key, consumer_secret, access_token, access_token_secret)

Filter your Stream variable

stream.filter(["clinton", "trump", "sanders", "cruz"])

Import package

import json

String of path to file: tweets_data_path

tweets_data_path = 'tweets.txt'

Initialize empty list to store tweets: tweets_data

tweets_data = []

Open connection to file

tweets_file = open(tweets_data_path, "r")

Read in tweets and store in list: tweets_data

for line in tweets_file: tweet = json.loads(line) tweets_data.append(tweet)

Close connection to file

tweets_file.close()

Print the keys of the first tweet dict

print(tweets_data[0].keys())

Import package

import pandas as pd

Build DataFrame of tweet texts and languages

df = pd.DataFrame(tweets_data, columns=['text','lang'])

Print head of DataFrame

print(df.head())

A little bit of Twitter text analysis Now that you have your DataFrame of tweets set up, you're going to do a bit of text analysis to count how many tweets contain the words 'clinton', 'trump', 'sanders' and 'cruz'. In the pre-exercise code, we have defined the following function word_in_text(), which will tell you whether the first argument (a word) occurs within the 2nd argument (a tweet).

import re

def word_in_text(word, text): word = word.lower() text = text.lower() match = re.search(word, text)

if match: return True return False

You're going to iterate over the rows of the DataFrame and calculate how many tweets contain each of our keywords! The list of objects for each candidate has been initialized to 0.

# Initialize list to store tweet counts

[clinton, trump, sanders, cruz] = [0, 0, 0, 0]

Iterate through df, counting the number of tweets in which

each candidate is mentioned

for index, row in df.iterrows(): clinton += word_in_text('clinton', row['text']) trump += word_in_text('trump', row['text']) sanders += word_in_text('sanders', row['text']) cruz += word_in_text('cruz', row['text'])

# Import packages

import matplotlib.pyplot as plt import seaborn as sns

Set seaborn style

sns.set(color_codes=True)

Create a list of labels:cd

cd = ['clinton', 'trump', 'sanders', 'cruz']

Plot the bar chart

ax = sns.barplot(cd, [clinton, trump, sanders, cruz]) ax.set(ylabel="count") plt.show()

# Add your code snippets here