Skip to main content

Kaggle Datasets Tutorial: Kaggle Notebooks

Learn about Kaggle datasets and notebooks and get a head start on creating your Kaggle profile.
Mar 2022  · 7 min read

A "Kaggle Notebook" is a free jupyter notebook server that can be GPU integrated. Just like DataCamp Workspace notebooks, it allows you to perform machine learning operations on cloud computers instead of doing it on your own computer. Each time you create a Kaggle Notebook, you can edit and run its content in the browser. There is no need to set up your own jupyter notebook environment, just enter Kaggle, create a notebook and start using it on the browser. You can see the notebooks you have created before on the Kaggle Notebook page and you can also review other people's notebooks.

To create a Notebook, click on "New Notebook" after you navigate to the Kaggle Code page (Figure 3.1). After this process, cloud resources will be allocated for you and a notebook will be created instantly. You can give your notebook a name by clicking on the notebook text in the upper left corner. As you can see at a first glance, many options and features found in jupyter notebooks are also available here.

The most frequent questions about Kaggle Notebooks are how to share a notebook publicly, how to add another person as a collaborator, how to import a dataset to a notebook, and how to use the GPU. You can see the buttons required to perform each of these operations in Figure 3.2.

Figure 3.1: Kaggle Code

Figure 3.1: Kaggle Code

First, you have to commit a version of your Notebook to make it public. You can create a new version of your Notebook by clicking the "Save Version" button at the top right. Then, you can make your notebook accessible to everyone by clicking the "Share" button on the left of this button, or you can add others as collaborators using the same menu. To import a dataset, simply click on the "Add data" button under the "Save Version" button on the right menu, and select the dataset you want to add. To activate the GPU, you need to select the GPU option from the accelerator section in the menu on the right side. The maximum GPU time you can use on Kaggle is set at 30 hours per week.

Figure 3.2: Kaggle Notebook

Figure 3.2: Kaggle Notebook

All other features of notebooks are explained in detail in the Kaggle documentation.

KAGGLE DATASETS

WHAT ARE KAGGLE DATASETS?

Kaggle is a data science platform but it also supports dataset handling. "Kaggle Datasets" allows you to create your own custom datasets, share them with others and easily import them into your notebooks. Additionally, you can add private datasets which would only be visible to you.

What makes this feature one of the most important ones in Kaggle is that it gives you access to a wide variety of top-quality datasets shared by other users. You can easily find the datasets you want with just a few search and filtering methods.

DATASET SEARCH FILTERS

To search for a dataset, write your keywords in the search field, as shown in Figure 4.1. Here you can see that we can access several datasets about the pandemic just by typing "Covid" in the search bar.

If you click on "Filters" on the right side of the search bar, more filtering options will appear (Figure 4.2). With these, you can narrow your search by entering dataset tags, file type, and other values like the minimum or maximum size of the dataset (Figure 4.3).


Figure 4.1: Dataset Search Filters

Figure 4.1: Dataset Search Filters

Kaggle allows you to download any dataset for free, but depending on what you are going to use it for, you may need to pay attention to the license type of the datasets. In some cases, it is possible that you may need to obtain additional permissions from their owners in case you want to use a dataset for an academic paper or in case you intend to use it for commercial purposes, for example.

Figure 4.2: Dataset Search Filters by Tags

Figure 4.2: Dataset Search Filters by Tags

There are three main license types on Kaggle:

  1. Creative Commons: There are several kinds of Creative Commons licenses:
    1. CCO, which stands for public domain and means that the dataset is available to everyone under any circumstances.
    2. CC-BY, which requires the dataset user to credit its owner.
    3. CC-BY-SA, which also requires the owner to be credited and adds the condition that the dataset keeps the same kind of license even after it's modified.
  2. GPL: This license basically provides four main usage options:
    1. Firstly, you get unlimited use of a dataset.
    2. You also have the possibility to examine how the dataset works, and modify it.
    3. Additionally, you are entitled to the unlimited distribution of copies of the dataset.
    4. And lastly, you can distribute the modified version of the dataset as well.
  3. Open Database: This allows users to share, modify and use the dataset but it makes it mandatory to establish the same kind of license for the modified dataset.

Figure 4.3: Dataset Search Filters

Figure 4.3: Dataset Search Filters

DATA EXPLORER

The Data Explorer section allows you to quickly browse through the content and structure of the datasets. It gives you an overview of the files and the columns in the data, as well as their histogram graphs (Figure 4.4).

Figure 4.4: Data Explorer

Figure 4.4: Data Explorer

DATASETS FOR BEGINNERS

The following datasets are fun and easy to play with as a beginner. You can fetch these to your notebooks and start getting your hands dirty by visualizing the data. Once you come up with an idea, you can even build machine learning models with some of these datasets.

CUSTOM DATASETS

You can also upload and use your own datasets in Kaggle. This feature comes in handy when you have your own dataset or when you've modified a dataset and want to use it in your notebook. In order to upload a dataset, first, you need to zip your main dataset file. Then click on "New Dataset" in the Datasets section. Give your dataset a name and upload your zip file (Figure 4.5).

Figure 4.5: Importing Custom Datasets

Figure 4.5: Importing Custom Datasets

And that's it. You can now fetch the uploaded dataset to your notebook and start using it, as shown in section 2. If you want to keep the dataset private, make sure that the label in the right bottom corner of the uploading screen reads "Private". If you don't want it to be private, you can click on the label and change it to "Public".

Related

Inside Our Favorite DataFramed Episodes of 2022

An inside look at our favorite episodes of the DataFramed podcast of 2022
Adel Nehme's photo

Adel Nehme

2 min

[Infographic] Data Science Project Checklist

Use this checklist when planning your next data science project.
Adel Nehme's photo

Adel Nehme

Introduction to Probability Rules Cheat Sheet

Learn the basics of probability with our Introduction to Probability Rules Cheat Sheet. Quickly reference key concepts and formulas for finding probability, conditional probability, and more.
Richie Cotton's photo

Richie Cotton

1 min

Data Governance Fundamentals Cheat Sheet

Master the fundamentals of data governance with our Data Governance Fundamentals Cheat Sheet. Quickly reference key concepts, best practices, and key components of a data governance program.
Richie Cotton's photo

Richie Cotton

1 min

ChatGPT Cheat Sheet for Data Science

In this cheat sheet, gain access to 60+ ChatGPT prompts for data science tasks.
Travis Tang's photo

Travis Tang

10 min

Docker for Data Science: An Introduction

In this Docker tutorial, discover the setup, common Docker commands, dockerizing machine learning applications, and industry-wide best practices.
Arunn Thevapalan's photo

Arunn Thevapalan

15 min

See MoreSee More