Setup a Data Science Environment on your Computer
After learning on an online interactive training and education platform like Datacamp, one of the next steps is taking that skill gained in Python, R, Git, or Unix Shell and using it on your local computer. It is not always easy to know what you need to install for the various projects you have in mind. This tutorial will let you know what packages, what softwares you need to install to get started with the various technologies. This tutorial will include:
- With that, let's get started!
To be able to use Python on your local computer, you first need to install it. There are many different python distributions, but for data science, the Anaconda Python Distribution is the most popular.
Benefits of Anaconda
Anaconda is a package manager, an environment manager, and Python distribution that contains a collection of many open source packages. An installation of Anaconda comes with many packages such as numpy, scikit-learn, scipy, and pandas preinstalled and it is also the recommended way to install Jupyter Notebooks. The image below shows a Jupyter Notebook in action. Jupyter notebooks contain both code and rich text elements, such as figures, links, and equations. You can learn more about Jupyter Notebooks here.
Some other benefits of Anaconda include:
If you need additional packages after installing Anaconda, you can use Anaconda's package manager conda or pip to install those packages.This is highly advantageous as you don't have to manage dependencies between multiple packages yourself. Conda even makes it easy to switch between Python 2 and 3 (you can learn more here).
Anaconda comes with Spyder, a Python Integrated Development Environment. An Integrated Development Environment is a coding tool which allows you to write, test and debug your code as they typically offer code completion, code insight by highlighting, resource management and debugging tools among many other features. It is also possible to integrate Anaconda with other Python Integrated Development Environments including PyCharm and Atom. You can learn more about different Python Integrated Development Environments here.
How to Install Anaconda (Python)
Here are some links to guides below on how to install Anaconda on your operating system.
R Programming Language
Most people generally install RStudio alongside the R programming language. The RStudio integrated development environment (IDE) is generally considered the easiest and best way to work with the R Programming language.
Benefits of RStudio
An install of the R programming language gives you a set of functions and objects from the R language and an R interpreter that allows you to build and run commands. RStudio gives you an integrated development environment that works alongside the R interpreter.
When you open RStudio, an screen like the one above appears. A few features in contained in the four RStudio Panes are: (A) a Text Editor. (B) Dashboard to Work Environment. (C) R Interpreter. (D) Help Window and Package Management System. All these features make it so RStudio is all you really need after installing R.
How to Install R and RStudio
Here are some links to guides below on how to install R and RStudio on your operating system.
Navigating directories, copying files, using virtual machines, and more are a regular part of a data scientist's job. You will often find the Unix Shell utilized to accomplish these tasks.
Some Uses of a Unix Shell
1 - Many Cloud Computing Platforms are Linux based (utilize a flavor of Unix Shell). For instance, if you want to Setup a Data Science Environment on Google Cloud, or do Deep Learning With Jupyter Notebooks In The Cloud (AWS EC2) it requires some Unix Shell knowledge. There are times when you may have a use for a Windows virtual machine, but it is less common.
2 - Unix Shell provides a number of useful commands such as:
wc command which counts the number of lines or words in a file,
cat command which concatenates/merges files,
tail commands which help you subset large files. You can learn more about this in 8 Useful Shell Commands for Data Science. Also, check out DataCamp's course Data Processing in Shell.
3 - You will often find Unix Shell integrated with other technologies as you will see throughout the rest of the article.
Integration with Other Technologies
You will often find Unix Shell commands integrated in other technologies. For example, it is common to find shell commands in Jupyter Notebooks alongside Python code. In Jupyter Notebook, you can access shell commands by escaping to the shell by using an
!. In the code below, the result of the shell command
ls (which lists all the files in the current directory) is assigned to the Python variable myfiles.
myfiles = !ls
The image below shows some Python Code integrated in a workflow to combine multiple datasets. Notice a Unix Shell command (enclosed in the red rectangle) integrated in a Jupyter Notebook.
Keep in mind that the code in the image above isn't some unique way to do a task, but just a small example of how you may see Unix utilized. If you want to learn how to use Unix for Data Science, Datacamp has a free course Introduction to Shell for Data Science which I highly recommend. It is a skill that lots of aspiring data scientists forget about, but it is a very important skill in the workplace.
Unix Shell on Mac
Mac comes with a Unix shell so you usually don't need to install anything! An important point is that there is a variety of Unix systems that have different commands. Sometimes you find that you don't have a Unix command (like
wget) found on another Unix system. Similar to how you have package managers through RStudio and Anaconda, Mac can have a package manager called Homebrew if you install it. The link below goes over how to install and use Homebrew.
Unix Shell Commands on Windows
Windows does not come with a Unix Shell. Keep in mind that what Unix Shell does for you is give you useful commands for Data Science. There are many different ways to get these useful commands on Windows. You can Install Git on Windows with the optional Unix tools so that you can have Unix commands on your Command Prompt. Alternatively, you could install Gnu on Windows (GOW) (10mb), Cygwin (100mb minimum), among many other options.
Git is the most widely used version control system. A version control system is something that records changes to a file or set of files over time so that you can recall specific versions later. Git is an important technology as it really helps you work with others and it is something you will find in a lot of workplaces. Some of the benefits of learning Git include:
Nothing version controlled using Git is ever lost, so you can always go back to see previous versions of your programs.
Git notifies you when your work conflicts with someone else's, so it's harder (but not impossible) to accidentally overwrite work.
Git can synchronize work done by different people on different machines, so it scales as your team does.
Knowing Git makes it easier to contribute to open source development of packages in R and Python.
Integration with Other Technologies
One of the cool things about Git is you often find it integrated with other technologies. Earlier I mentioned that the RStudio integrated development environment (IDE) is generally considered the best way to work with the R Programming language. RStudio offers version control support and most many Python Integrated Development Environments (IDE) (learn more here) offer version control support.
If you want to learn how to use Git for Data Science, DataCamp has a free course Introduction to Git for Data Science which I highly recommend.
How to Install Git
Here are some links to guides below on how to install Git on your operating system.
This tutorial provides a way to setup a local data science environment on your local computer. An important point to emphasize is that these technologies can and are often integrated together. If you any questions or thoughts on the tutorial, feel free to reach out in the comments below or through Twitter. Also, feel free to check out my other installation based tutorials located on my Github or my Medium blog.