Skip to main content

Kaggle Datasets Tutorial: Kaggle Notebooks

Learn about Kaggle datasets and notebooks and get a head start on creating your Kaggle profile.
Mar 2022  · 7 min read

A "Kaggle Notebook" is a free jupyter notebook server that can be GPU integrated. Just like DataCamp Workspace notebooks, it allows you to perform machine learning operations on cloud computers instead of doing it on your own computer. Each time you create a Kaggle Notebook, you can edit and run its content in the browser. There is no need to set up your own jupyter notebook environment, just enter Kaggle, create a notebook and start using it on the browser. You can see the notebooks you have created before on the Kaggle Notebook page and you can also review other people's notebooks.

To create a Notebook, click on "New Notebook" after you navigate to the Kaggle Code page (Figure 3.1). After this process, cloud resources will be allocated for you and a notebook will be created instantly. You can give your notebook a name by clicking on the notebook text in the upper left corner. As you can see at a first glance, many options and features found in jupyter notebooks are also available here.

The most frequent questions about Kaggle Notebooks are how to share a notebook publicly, how to add another person as a collaborator, how to import a dataset to a notebook, and how to use the GPU. You can see the buttons required to perform each of these operations in Figure 3.2.

Figure 3.1: Kaggle Code

Figure 3.1: Kaggle Code

First, you have to commit a version of your Notebook to make it public. You can create a new version of your Notebook by clicking the "Save Version" button at the top right. Then, you can make your notebook accessible to everyone by clicking the "Share" button on the left of this button, or you can add others as collaborators using the same menu. To import a dataset, simply click on the "Add data" button under the "Save Version" button on the right menu, and select the dataset you want to add. To activate the GPU, you need to select the GPU option from the accelerator section in the menu on the right side. The maximum GPU time you can use on Kaggle is set at 30 hours per week.

Figure 3.2: Kaggle Notebook

Figure 3.2: Kaggle Notebook

All other features of notebooks are explained in detail in the Kaggle documentation.

KAGGLE DATASETS

WHAT ARE KAGGLE DATASETS?

Kaggle is a data science platform but it also supports dataset handling. "Kaggle Datasets" allows you to create your own custom datasets, share them with others and easily import them into your notebooks. Additionally, you can add private datasets which would only be visible to you.

What makes this feature one of the most important ones in Kaggle is that it gives you access to a wide variety of top-quality datasets shared by other users. You can easily find the datasets you want with just a few search and filtering methods.

DATASET SEARCH FILTERS

To search for a dataset, write your keywords in the search field, as shown in Figure 4.1. Here you can see that we can access several datasets about the pandemic just by typing "Covid" in the search bar.

If you click on "Filters" on the right side of the search bar, more filtering options will appear (Figure 4.2). With these, you can narrow your search by entering dataset tags, file type, and other values like the minimum or maximum size of the dataset (Figure 4.3).


Figure 4.1: Dataset Search Filters

Figure 4.1: Dataset Search Filters

Kaggle allows you to download any dataset for free, but depending on what you are going to use it for, you may need to pay attention to the license type of the datasets. In some cases, it is possible that you may need to obtain additional permissions from their owners in case you want to use a dataset for an academic paper or in case you intend to use it for commercial purposes, for example.

Figure 4.2: Dataset Search Filters by Tags

Figure 4.2: Dataset Search Filters by Tags

There are three main license types on Kaggle:

  1. Creative Commons: There are several kinds of Creative Commons licenses:
    1. CCO, which stands for public domain and means that the dataset is available to everyone under any circumstances.
    2. CC-BY, which requires the dataset user to credit its owner.
    3. CC-BY-SA, which also requires the owner to be credited and adds the condition that the dataset keeps the same kind of license even after it's modified.
  2. GPL: This license basically provides four main usage options:
    1. Firstly, you get unlimited use of a dataset.
    2. You also have the possibility to examine how the dataset works, and modify it.
    3. Additionally, you are entitled to the unlimited distribution of copies of the dataset.
    4. And lastly, you can distribute the modified version of the dataset as well.
  3. Open Database: This allows users to share, modify and use the dataset but it makes it mandatory to establish the same kind of license for the modified dataset.

Figure 4.3: Dataset Search Filters

Figure 4.3: Dataset Search Filters

DATA EXPLORER

The Data Explorer section allows you to quickly browse through the content and structure of the datasets. It gives you an overview of the files and the columns in the data, as well as their histogram graphs (Figure 4.4).

Figure 4.4: Data Explorer

Figure 4.4: Data Explorer

DATASETS FOR BEGINNERS

The following datasets are fun and easy to play with as a beginner. You can fetch these to your notebooks and start getting your hands dirty by visualizing the data. Once you come up with an idea, you can even build machine learning models with some of these datasets.

CUSTOM DATASETS

You can also upload and use your own datasets in Kaggle. This feature comes in handy when you have your own dataset or when you've modified a dataset and want to use it in your notebook. In order to upload a dataset, first, you need to zip your main dataset file. Then click on "New Dataset" in the Datasets section. Give your dataset a name and upload your zip file (Figure 4.5).

Figure 4.5: Importing Custom Datasets

Figure 4.5: Importing Custom Datasets

And that's it. You can now fetch the uploaded dataset to your notebook and start using it, as shown in section 2. If you want to keep the dataset private, make sure that the label in the right bottom corner of the uploading screen reads "Private". If you don't want it to be private, you can click on the label and change it to "Public".

Related
Data Science Concept Vector Image

How to Become a Data Scientist in 8 Steps

Find out everything you need to know about becoming a data scientist, and find out whether it’s the right career for you!
Jose Jorge Rodriguez Salgado's photo

Jose Jorge Rodriguez Salgado

12 min

YOLO Object Detection Explained

Understand YOLO object detection, its benefits, how it has evolved over the last couple of years and some real-life applications.
Zoumana Keita 's photo

Zoumana Keita

5 Ways to Use Data Science in Marketing

Discover five ways you can use data science in marketing. Get ahead of the game, improve your data skills, and work on a data science marketing project.
Natassha Selvaraj's photo

Natassha Selvaraj

DC Data in Soccer Infographic.png

How Data Science is Changing Soccer

With the Fifa 2022 World Cup upon us, learn about the most widely used data science use-cases in soccer.
Richie Cotton's photo

Richie Cotton

_Quote.png

The Deep Learning Revolution in Space Science

Justin Fletcher joins the show to talk about how the US Space Force is using deep learning with telescope data to monitor satellites, potentially lethal space debris, and identify and prevent catastrophic collisions. 

Richie Cotton's photo

Richie Cotton

53 min

Regular Expressions Cheat Sheet

Regular expressions (regex or regexp) are a pattern of characters that describe an amount of text. Regular expressions are one of the most widely used tools in natural language processing and allow you to supercharge common text data manipulation tasks. Use this cheat sheet as a handy reminder when working with regular expressions.
DataCamp Team's photo

DataCamp Team

See MoreSee More