Course
For developers, having their workstations correctly configured eases the development experience. That includes the installation and configuration of the best tools and setting up all those facilities that will make the developer to give their best on the process of creating a piece of software. Data scientists are no exception to this need, so having a correct setup becomes crucial for creating unique pieces of software, from visualization tools to ingestion processes. In this tutorial, the basics of a computer setup for Data science will be explained, but this guide will also be useful for anyone interesting in developing software using Python language. Let's get on it!
Scope
Python is available for a wide variety of operating systems. This tutorial will show some snippets executed over Linux Ubuntu operating system, but they widely apply over other UNIX-based system and MAC OS. For MS Windows systems, there are also minor differences that will be mentioned whenever they apply.
Python: the Core Piece
tl;dr Use Python3 implementation from the official website
Thinking of Python interpreter as the core piece for applications development based on this language seems like an obvious question. But at this point, you may get stuck when looking for the best Python interpreter to use. Furthermore, you may have read about two different versions of the language, so choosing the correct version and interpreter is the first important decision to take.
Python, Anaconda, Jython, WinPython, IronPython... too many Snakes for a Single Cage
Due to the popularity of the language, there are lots of implementations of the language interpreter. Some of them are readily available to be installed on your computer, while some other are in early development phases and their installation is only recommended for their developers or brave programmers. The official Python wiki lists some of the available interpreters, pointing their specific features compared to other implementations. As we will see along this tutorial, the Python distribution available on the official website includes everything you need to start developing your Python programs without missing any of the features other implementations may give you.
Python2 or Python3? Which Version Should You Use?
Even though Python 3 was released ten years ago, the transition from previous Python 2 to the newest version has not been a piece of cake. One of the main reasons for this low-adoption rate was the syntax differences that made the already existing libraries not to be available for the latest version. As you may notice while developing Python software, the availability of libraries for all kind of purposes is one of the key strengths for this language. Along this period, a set of tools and libraries such as 2to3 or six were created for narrowing the gap for already existing pieces of software, easing their migration to the latest version. By the time this tutorial is written, the vast majority of libraries is readily available for Python 3 flavored software, so the decision of which version to use at this point is quite straightforward: Python 3. Furthermore, by 2020 Python 2 lifecycle will come to its end. There are even some people planning a Celebration of Life party at PyCon 2020 event, so do not hesitate to join them to thank Python2 what it did for you and your software!
Installing Python on your Computer
Python interpreter is widely available for most operating systems, and even some of the provide packages that ease its installation. For Ubuntu, Python2 and Python3 are even included by default on desktop environments simultaneously. Just consider changing the python alias which still applies for Python version 2 instead of 3, or defining an alias in your ~/.bashrc file in case you find it useful. As mentioned above, it is expected that in the following years there will be only one Python version installed. My recommendation at this point is to install the most recent version of Python. If you already have Python3 installed on your computer, check its version is higher than 3.5. In case you have an older version installed, you are encouraged to upgrade it and enjoy the latest features of the language.
$ python --version
Python 3.5.2
For Windows system owners, installing Python interpreter is also quite simple. The official downloads page provides an installer that helps you in setting up the interpreter with just a few clicks.
Managing Project Dependencies with pip
For basic development, having the Python interpreter installed is everything you need to start coding. Python software bundle comes with the interpreter, plus some facilities aimed at easing the application management. One of these tools is called pip, and its purpose is to help you install third-party libraries stored in PyPI (Python Package Index) or other public or private repositories.
Our First Python Project: Jupyter Notebooks
Let's think of our first Data science project: A Jupyter notebook that will wrangle some data and present some stunning visualizations. The first library that will be required in order to work with notebooks will be Jupyter. Installing Jupyter on your computer is as simple as follows:
$ pip install --user jupyter
Notice that the --user option will install the library only for the user that executes it. System-wide installation can be done by removing this option but may require to be a privileged user to complete the installation. Apart from installing, pip command allows you to download libraries, query for the current configuration, upgrading, and uninstalling a library in case it is not needed anymore. The following table list some of the commands that will be useful for managing your application needs:
Command | Description |
---|---|
pip install <package> |
Install a package |
pip install --user <package> |
Install a package for a single user |
pip install --upgrade <package> |
Update a package |
pip uninstall <package> |
Uninstall a package |
pip download <package> |
Download a package |
pip freeze |
List installed packages |
pip install -r requirements.txt |
Install packages listed in the given requirements.txt file |
pip show <package> |
Shows information about the package |
We will come up again with the commands listed above in order to manage project dependencies. By now, keep the idea of pip as the dependencies manager for your application. Last but not least, it is important to know that the pip command installs not only the requested libraries but also their dependencies, so there is no need of dealing with complex installation procedures. pip install is the way to go!
Virtual Environments: Isolate your Project Settings
Imagine that, by the time you are working on your notebooks project, you discover a new, shiny library that gives you some features you would like to try before including in your notebooks project configuration. This library requires having installed a different version of jupyter notebooks. At this point you have the following options:
- Install the new package without its dependencies. The library will probably malfunction unless you make some changes to its code.
- Install the new package with dependencies. Your current project will be affected, forcing you to upgrade the project configuration and making some changes to make it work as expected.
The described scenarios may happen when working on two projects simultaneously. Try thinking on more complex situations where you can work on more than two projects, or having a set of Python-based applications that depend on the library you need to have installed without modifications. That may be a real headache. Fortunately, Python provides you with a mechanism for isolating the project configuration from other projects or applications. This mechanism is called Virtual Environments, and even though they existed back in Python 2 along with other isolating mechanism approaches, it has become a built-in package since version 3.6 (you are encouraged to install this version or a newer one!). Python interpreters older than 3.6 can install the library using pip as follows:
$ pip install -U venv
For those using Python2 (unless you prefer upgrading), the library is called virtualenv and can be installed the same way.
$ pip install -U virtualenv
Those developers using Ubuntu may find virtualenv as a system package that can be installed using apt-get as follows:
$ sudo apt-get install python3-venv
Creating a new virtual environment is easy. From the command line, type the following command to create a new virtual environment inside directory new_venv.
$ python -m venv new_venv
As a recommendation, try typing the command above from the workspace directory. This way you have a self-contained directory with your source code, and the virtual environment where the libraries required by your application will be stored. Once the virtual environment has been created, you can start working in isolated mode by loading the activate script inside new_venv/bin/activate.
source new_venv/bin/activate
(new_venv) $ ...
Windows users shall execute the activate script installed inside Scripts directory.
new_venv/Scripts/activate
(new_venv) $ ...
Notice that once the virtual environment has been loaded, the prompt shell will change, writing the virtual environment name between parentheses. At a glance, the virtual environment contains the following elements:
- A python interpreter: The same that the one used for creating the virtual environment. Notice that unless you do the change of python alias, you will have to explicitly use python3 or even python3.6 in case you have more than one Python3 installed.
- Pip package manager. The package manager will install all the libraries inside the lib/pythonX.Y/site/packages directory. The python interpreter will use this directory for resolving dependencies.
- Activation/Deactivation script. It helps loading the project environment, and returning back to the default configuration once it is not needed. With the virtual environment loaded successfully, you can install any library as follows:
$ pip install requests
This command will install requests library inside the virtual environment, and it can be loaded by the python interpreter stored inside the bin folder of virtual environment directory. Notice that no privileged user access was required to install the library. You can check the installation by using the pip show command (pip show requests) that will provide details about the installed library. Finally, once you stop working on a project, you can deactivate the virtual environment by executing the deactivate command.
(new_venv) $ deactivate
$
So simple, so powerful, so helpful. Here are the virtual environments to help! :D
TIP: The location where packages are installed can be checked through the sys.prefix variable. Try executing it on a loaded environment, and also in a default configuration.
>>> import sys
>>> print(sys.prefix)
Project Scaffolding: That's Easy-Peasy Using Cookiecutter!
You and most of us started developing Python application with a simple script (probably the Hello World). So, thinking of a similar, along with the concepts explained about, leads us to the following workspace structure:
- A directory containing our workspace, containing
- A virtual environment directory, and
- A simple script with your Hello World application.
Even though Python projects are not required to have any workspace structure, it is recommended to structure your code, along with its associated test suite and any other scripts or configuration files that will help you work on a full development lifecycle. In its most straightforward approach, a Python workspace could be as follows:
$ tree -L 1
.
├── requirements.txt ---> Project dependencies file
├── src ---> Source code goes here
├── tests ---> Test code goes here
└── venv ---> Virtual environment files
We have talked about the requirements.txt file before in the pip section. This file will contain the project dependencies that can be stored using the pip freeze command. src, and test directories will contain business logic and associated tests. And venv directory (it is usually named like this) stored the virtual environment configuration.
Creating Project Scaffolding the Easy Way: Cookiecutter
Even though the approach presented before may be enough, as long as your development skills grow, you may want to work on evolution scaffolding that includes more tools that ease your development process. Creating Python scaffoldings manually may be tedious. A better option is to maintain a copy of a reference project and replace some of their contents after copying it to set up a new project. Even though it may work, this option is error-prone and may lead you to inconsistent scenarios if you forget to set up some resources correctly. New projects can be easily set up by using the cookiecutter tool. This tool takes as input a project template, and stores a copy of the template after replacing file names and their contents with a set of input parameters that are prompted to the user. This tool is implemented in python language and can be installed as a pip library.
$ pip install -U cookiecutter
Remember, the -U option will install the tool in your userspace. You can skip it, but you will be required to install it as root (or as a sudoer user). This tool can also be found for Ubuntu users as a package that can be installed using apt-get
$ sudo apt-get install cookiecutter
On its simplest execution, the cookiecutter tool can be invoked as follows:
$ cookiecutter <TEMPLATE_LOCATION>
The input templates location can be stored locally, or specified as a GIT URL. The latter option is convenient for those developers and organizations interested in maintaining and evolving their project workspace templates.
To understand how cookiecutter works, take a look to a sample cookiecutter project. The core file for this and any other cookicutter project is named cookiecutter.json and contains the set of values that the user shall specify in order to create the project workspace. This file is written in JSON format, and it includes a dictionary containing key-value pairs, where the values represent the defaults in case the user skips any of the inputs.
{
"full_name": "Steven Loria",
"email": "sloria1@gmail.com",
"github_username": "sloria",
"project_name": "My Flask App",
"app_name": "myflaskapp",
"project_short_description": "A flasky app.",
"use_pipenv": ["no", "yes"]
}
From the input values, cookiecutter will replace from the template those file and directory names containing double brackets {{...}}.
For those interested in creating their own archetypes, it is a great idea to start working on a cookiecutter project and put inside it anything considered as valuable for developing apps. For the ones that can't wait to start on a new project, there are lots of cookiecutter project available and ready to be instantiated. Looking for a cookiecutter that fits your needs? Try searching for some using this cookiecutter search engine! More good news, this tool is not only aimed at Python project scaffoldings, but it is generic for any other project types!
Putting them all Together: A Project Development Lifecycle
Once the concepts of packages management, virtual environment, and project scaffolding have been introduced, we can put them all together and explain the way you can work by setting up your computer after following this tutorial:
- Use Python3 as your coding interpreter. Even better if you start using Python 3.6.
- Create a new project, where your Python modules and dependencies will be stored.
- Create a new project easily by using cookiecutter along with a reference template of your choice.
- Create a new virtual environment inside the application folder, and activate it once you are ready to code your app.
- Code the best app of your life, and make use of third-party libraries if you need to.
- Don't forget to declare the dependencies in a requirements.txt file by using the
pip freeze
command.
- Don't forget to declare the dependencies in a requirements.txt file by using the
- [Optional] Upload your project to a CVS system such as Git. But don't upload the virtual environment contents. You can recreate it anytime you need to as explained below.
Once the code and the dependencies file are stored, you can recreate the project structure easily as follows:
- Copy the code and dependencies file into a workspace folder (Git users may clone the repository).
- Create a new virtual environment inside the application folder, and activate it for installing the application dependencies.
- Install the dependencies using
pip install -r requirements.txt
. - Work on improvements or fixes on your projects. Do not forget to specify the new dependencies in the requirements file using
pip freeze
.
Once more thing before completing this tutorial. Even though we have not mentioned any of the available IDEs for coding using Python language, you are free to choose your own, and customize it to fit your needs.
Happy Python coding!
If you would like to learn more about Python, take DataCamp's free Intro to Python for Data Science course.
Python Courses
Course
Introduction to Data Science in Python
Course
Intermediate Python
blog
Tutorial: How to Install Python on macOS and Windows
tutorial
Python Tutorial for Beginners
tutorial
Setting Up VSCode For Python: A Complete Guide
tutorial
GIT SETUP: The Definitive Guide
Olivia Smith
7 min
tutorial
Setup a Data Science Environment on your Computer
DataCamp Team
8 min
tutorial