Skip to main content

Using PostgreSQL in Python

In this tutorial, you will be learning about the database in Python, with a few examples in PostgreSQL Database.
Oct 2018  · 10 min read

Databases, such as PostgreSQL require user authentication to access and are particular to a given database structure. Once you have access to a database, you can employ similar techniques.

PostgreSQL with Python Tutorial

This module illustrates some basic techniques to connect to and use data from a relational database, in this case, PostgreSQL, one of several SQL-based databases. If you would like to learn more about the systems theory side of relational database management systems (RDBMS), the "Database Systems and Concepts" is an excellent resource.

Using a SQL-Database

Using a database in Python relies on being able to write database queries in SQL. In fact, a good way to test your Python scripts is to first test the SQL commands in a third-party program such as DBeaver.

A good place you can currently live practice these skills is SQLZOO.

The Basics

In a SQL-language, data is stored in tables. Tables often are nested in a group called a "schema." You may have multiple schemas in a database.

$$ \text{Database} \supset \text{Schema} \supset \text{Table} $$

Example: Suppose you have a database that has data and results of a machine learning problem such as predicting adverse police incidents. The results are stored in the " results" schema. Specific model metrics are in the "evaluations" table. To view all such results:

SQL select * from results.evaluations;

Decomposing this query, you have several essential elements: `select`. What task you will be doing, `select`. Next, what do you want to return? If everything uses `*`. If you only want one column `your_colunm_name`. Thus `select *` or `select your_column_name`. Next, where are your data stored: `from`? Specify a `schema.table`.

Putting this together, you have the general form: `select * from schema.table;` Note: it is best to end your SQL statements with a "`;`".

Now suppose you want only the results for a specific model id, parameter, and metric, and parameter, you can use a `where` statement to specify:

```sql
select * from results.evaluations
    where model_id=971
    and metric='false [email protected]' and parameter='75.0';

Connecting to a Database in Python

A simple way to connect to a database is to use Python. Here, you will first need a credentials file, such as example_psql.py:

PGHOST=your_database_host
PGDATABASE=your_database_name
PGUSER=your_database_username
PGPASSWORD=your_database_secret_password

Next, you will need to import several packages:

import psycopg2
import sys, os
import numpy as np
import pandas as pd
import example_psql as creds
import pandas.io.sql as psql

Finally, the database connection can be relatively simple:

## ****** LOAD PSQL DATABASE ***** ##


# Set up a connection to the postgres server.
conn_string = "host="+ creds.PGHOST +" port="+ "5432" +" dbname="+ creds.PGDATABASE +" user=" + creds.PGUSER \
+" password="+ creds.PGPASSWORD
conn=psycopg2.connect(conn_string)
print("Connected!")

# Create a cursor object
cursor = conn.cursor()


def load_data(schema, table):

    sql_command = "SELECT * FROM {}.{};".format(str(schema), str(table))
    print (sql_command)

    # Load the data
    data = pd.read_sql(sql_command, conn)

    print(data.shape)
    return (data)

To underscore what this code is doing, it connects to your database using your credentials and returns the data you queried, i.e., the select * from schema.table; as a Pandas data frame. You can then visualize or analyze this data just as you would any data loaded into Pandas from a CSV.

A More Complex Example

Rather than this simple connection, you can instead use a series of functions in a separate script to connect to the database. This script is located in setup/setup_environment.py:

#!/usr/bin/env python
import os
import yaml
from sqlalchemy import create_engine
import logging

log = logging.getLogger(__name__)


def get_database():
    try:
        engine = get_connection_from_profile()
        log.info("Connected to PostgreSQL database!")
    except IOError:
        log.exception("Failed to get database connection!")
        return None, 'fail'

    return engine


def get_connection_from_profile(config_file_name="default_profile.yaml"):
    """
    Sets up database connection from config file.
    Input:
    config_file_name: File containing PGHOST, PGUSER,
                      PGPASSWORD, PGDATABASE, PGPORT, which are the
                      credentials for the PostgreSQL database
    """

    with open(config_file_name, 'r') as f:
        vals = yaml.load(f)

    if not ('PGHOST' in vals.keys() and
            'PGUSER' in vals.keys() and
            'PGPASSWORD' in vals.keys() and
            'PGDATABASE' in vals.keys() and
            'PGPORT' in vals.keys()):
        raise Exception('Bad config file: ' + config_file_name)

    return get_engine(vals['PGDATABASE'], vals['PGUSER'],
                      vals['PGHOST'], vals['PGPORT'],
                      vals['PGPASSWORD'])


def get_engine(db, user, host, port, passwd):
    """
    Get SQLalchemy engine using credentials.
    Input:
    db: database name
    user: Username
    host: Hostname of the database server
    port: Port number
    passwd: Password for the database
    """

    url = 'postgresql://{user}:{passwd}@{host}:{port}/{db}'.format(
        user=user, passwd=passwd, host=host, port=port, db=db)
    engine = create_engine(url, pool_size = 50)
    return engine

Once we have this script, we can connect to the database in a new script:

import sys
import os
import pandas as pd
import subprocess
import argparse
import pdb
import pickle
from setup import setup_environment

# Make PostgreSQL Connection
engine = setup_environment.get_database()
try:
    con = engine.raw_connection()
    con.cursor().execute("SET SCHEMA '{}'".format('your_schema_name'))
except:
    pass

Note: in this code example, you would want to replace 'your_schema_name' with the particular name of your schema, for example, the 'models' schema.

Trying Some Complex Queries

Now that you have made the database connection, you can try a complex query, such as returning the pickle files (a Pythonic method of storing data) for model results built using Sci-kit Learn.

def get_pickle_best_models(timestamp, metric, parameter=None, number=25, directory="results"):

    """
    --------------------------------------------------------
    Get the PICKLE FILE of the best models
    by the specified timestamp and given metric
    RETURNS the PICKLE FILE to a DIRECTORY
    --------------------------------------------------------
    ARGUMENTS:
        timestamp:  models run on or after given timestamp
                    example: '2018-09-02'
        metric:     metric to be optimized
                    example: '[email protected]'
        parameter:  parameter value or threshold if any
                    default=None
                    example: '10.0'
        number:     maximum number of desired results
                    default = 25
    --------------------------------------------------------
    """

    if parameter is None:
        query = ("SELECT pickle_blob, run_time  FROM \
                    (SELECT evaluations.model_id, run_time \
                        FROM results.evaluations JOIN results.models \
                        ON evaluations.model_id=models.model_id \
                        WHERE run_time >= '{}' \
                        AND value is not null \
                        AND metric = '{}' \
                        ORDER BY value DESC LIMIT {}) \
                    AS top_models \
                    INNER JOIN results.data \
                    ON top_models.model_id=data.model_id ; " ).format(timestamp, metric, number)

    elif parameter is not None:
        query = ("SELECT pickle_blob, run_time  FROM \
                    (SELECT evaluations.model_id, run_time \
                        FROM results.evaluations JOIN results.models \
                        ON evaluations.model_id=models.model_id \
                        WHERE run_time >= '{}' \
                        AND value is not null \
                        AND metric = '{}' \
                        AND parameter = '{}' \
                        ORDER BY value DESC LIMIT {}) \
                    AS top_models \
                    INNER JOIN results.data \
                    ON top_models.model_id=data.model_id ; " ).format(timestamp, metric, parameter, number)



    df_models = pd.read_sql(query, con=con)
    N = len(df_models['pickle_blob'])

    for file_number in range(0, N):
        pickle_file = pickle.loads(df_models['pickle_blob'].iloc[file_number])
        file_name = df_models['run_time'].apply(lambda x: str(x).replace(' ', 'T')).iloc[file_number]
        if parameter is None:
            full_file_name = "police_eis_results_"+"top_"+metric+"any"+"_"+file_name+".pkl"
        elif parameter is not None:
            full_file_name = "police_eis_results_"+"top_"+metric+parameter+"_"+file_name+".pkl"
        file_path = directory+full_file_name
        pickle.dump(pickle_file, open( file_path, "wb" ) )

    return None

Dissecting this Function

Taking out some of the detail, you have these general processes

def get_pickle_best_models(timestamp, metric, parameter=None, number=25, directory="results"):

    if parameter is None:
        # Do Query WITHOUT Parameter

    elif parameter is not None:
        # Do Query WITH Parameter

    # Run Query and Store Results as Pandas Data Frame
    df_models = pd.read_sql(query, con=con)

    # Loop Over Dataframe to Save Pickle Files for Different Model Run Times
    # (Code Here)

Writing a Python Query

As seen in the example, you write the SQL query as a string:

query = ("Long SQL Query as a String")

Here, you can insert variables into the query using the .format(var) method. In this way, you can systematically vary the query based on various arguments passed to the function. The full form of one of the above queries is as follows:

query = ("SELECT pickle_blob, run_time  FROM \
              (SELECT evaluations.model_id, run_time \
              FROM results.evaluations JOIN results.models \
              ON evaluations.model_id=models.model_id \
              WHERE run_time >= '{}' \
              AND value is not null \
              AND metric = '{}' \
              AND parameter = '{}' \
              ORDER BY value DESC LIMIT {}) \
          AS top_models \
          INNER JOIN results.data \
          ON top_models.model_id=data.model_id ; " ).format(timestamp, metric, parameter, number)

Note the format of the query. In the python code, you need to break query lines with \ and use the format function '{}'.format(x) to insert variables into the script. Your format brackets will either need single quotes around it or not, depending on if you are trying to pass a string value or numeric value into your query. Abstracting the query from Python manipulation, you are left with a SQL query. Below, you can try out some example values for the Python formatting of the timestamp, metric, parameter, and number of results:

SELECT pickle_blob, run_time  FROM
          (SELECT evaluations.model_id, run_time
              FROM results.evaluations JOIN results.models
              ON evaluations.model_id=models.model_id
              WHERE run_time >= '2018-09-02'
              AND value is not null
              AND metric = '[email protected]'
              AND parameter = '10.0'
              ORDER BY value DESC LIMIT 10)
          AS top_models
          INNER JOIN results.data
          ON top_models.model_id=data.model_id ;

What are you returning?

The above query is a more complex example, but a valuable one to work with. First, what do you want to return:

  • pickle_blob
  • run_time

*e.g. SELECT pickle_blob, runtime

Where does this data come from?

  • A join of the results.data table with a join on the results.evaluations and results.models via a subquery.

In the above example, you want both the model's pickle_blob and the model's run_time. You also need the best models on certain criteria. Yet, all of these attributes are in discrete tables in the results schema. As a result, you must do several joins of these tables.

Technically, you are doing what is called a sub-query. Note the first line of the main query SELECT pickle_blob, run_time FROM where the FROM refers to a subquery in parentheses (SELECT...LIMIT 10), which you refer to with the alias AS top_models.

Here, the top_models is an arbitrary name defined and used only within the query. You could replace top_models with x or y or my_best_model_ever. The name does not matter as long as it is discrete from the other table names and reserved SQL-statements such as select. It merely refers to the result of our subquery.

In essence, the subquery returns a selective version of the joined results.models and results.evaluations table. You do an INNER JOIN using the top_models subquery on a key id using ON top_models.model_id=data.model_id.

What is your subquery?

SELECT evaluations.model_id, run_time
      FROM results.evaluations JOIN results.models
      ON evaluations.model_id=models.model_id
      WHERE run_time >= '2018-09-02'
      AND value is not null
      AND metric = '[email protected]'
      AND parameter = '10.0'
      ORDER BY value DESC LIMIT 10

The subquery also conducts a join on the results.evaluations and results.models tables joined on the model_id. It returns the model_id and run_time from the joined table.

Criteria:

Here, we only want models that fit certain criteria:

  • timestamp WHERE run_time >= '2018-09-02'
  • non-null results on the value column: AND value is not null
  • metric AND metric = '[email protected]'
  • parameter AND parameter = '10.0'
  • n-top results ORDER BY value DESC LIMIT 10

As you can see, although you have many elements of this query, you can manipulate the query using Python. In the end, you can harness the power of databases and the functionality of Python to build dynamic code.

Conclusion

Well done! You now know the database in Python, with few examples in PostgreSQL Database. Also, You can do self-learning on your own and find the resources to learn. If you are interested in learning more about Database topics in general, read DataCamp's tutorial Execute Python/R in SQL.

Introduction to Python

Beginner
4 hours
4,596,364
Master the basics of data analysis with Python in just four hours. This online course will introduce the Python interface and explore popular packages.
See DetailsRight Arrow
Start Course

Intermediate Python

Beginner
4 hours
883,217
Level up your data science skills by creating visualizations using Matplotlib and manipulating DataFrames with pandas.

Introduction to Databases in Python

Beginner
4 hours
90,698
In this course, you'll learn the basics of relational databases and how to interact with them.
See all coursesRight Arrow
Related
Data Science Concept Vector Image

How to Become a Data Scientist in 8 Steps

Find out everything you need to know about becoming a data scientist, and find out whether it’s the right career for you!

Jose Jorge Rodriguez Salgado

12 min

How to Become a Data Analyst in 2023: 5 Steps to Start Your Career

Learn how to become a data analyst and discover everything you need to know about launching your career, including the skills you need and how to learn them.
Elena Kosourova 's photo

Elena Kosourova

18 min

DC Data in Soccer Infographic.png

How Data Science is Changing Soccer

With the Fifa 2022 World Cup upon us, learn about the most widely used data science use-cases in soccer.
Richie Cotton's photo

Richie Cotton

Sports Analytics: How Different Sports Use Data Analytics

Discover how sports analytics works and how different sports use data to provide meaningful insights. Plus, discover what it takes to become a sports data analyst.
Kurtis Pykes 's photo

Kurtis Pykes

13 min

The 23 Top Python Interview Questions & Answers

Essential Python interview questions with examples for job seekers, final-year students, and data professionals.
Abid Ali Awan's photo

Abid Ali Awan

22 min

Getting started with Python cheat sheet

Python is the most popular programming language in data science. Use this cheat sheet to jumpstart your Python learning journey.
DataCamp Team's photo

DataCamp Team

8 min

See MoreSee More