# Data frames

| September 10th, 2013

## What's a data frame?

You may remember from the chapter about matrices that all the elements that you put in a matrix should be of the same type. Back then, your data set on Star Wars only contained numeric elements.

When doing a market research survey, however, you often have questions such as:

• 'Are your married?' or 'yes/no' questions (`logical`)
• 'How old are you?' (`numeric`)
• 'What is your opinion on this product?' or other 'open-ended' questions (`character`)
• ...

The output, namely the respondents' answers to the questions formulated above, is a data set of different data types. You will often find yourself working with data sets that contain different data types instead of only one.

A data frame has the variables of a data set as columns and the observations as rows. This will be a familiar concept for those coming from different statistical software packages such as SAS or SPSS.

### Instructions

Click 'Submit Answer'. The data from the built-in example data frame `mtcars` will be printed to the console.

`# no pec` ```# Print out built-in R data frame mtcars ``` ```# Print out built-in R data frame mtcars ``` ``` test_output_contains("mtcars", incorrect_msg = "Do not change anything about the code, Make sure that you output `mtcars`.") success_msg("Great! Continue to the next exercise.") ```

Just click 'Submit Answer' and witness the magic!

## Quick, have a look at your data set

Wow, that is a lot of cars!

Working with large data sets is not uncommon in data analysis. When you work with (extremely) large data sets and data frames, your first task as a data analyst is to develop a clear understanding of its structure and main elements. Therefore, it is often useful to show only a small part of the entire data set.

So how to do this in R? Well, the function `head()` enables you to show the first observations of a data frame. Similarly, the function `tail()` prints out the last observations in your data set.

Both `head()` and `tail()` print a top line called the 'header', which contains the names of the different variables in your data set.

### Instructions

Call `head()` on the `mtcars` data set to have a look at the header and the first observations.

`# no pec` ```# Call head() on mtcars ``` ```# Call head() on mtcars head(mtcars)``` ``` test_function("head", "x", incorrect_msg = "Have you correctly passed `mtcars` to the `head()` function?") test_output_contains("head(mtcars)", incorrect_msg = "Simply print out the result of the `head()` call, no need to assign it to a new variable.") success_msg("Wonderful! So, what do we have in this data set? For example, `hp` for example represents the car's horsepower; the Datsun has the lowest horse power of the 6 cars that are displayed. For a full overview of the variables' meaning, type `?mtcars` in the console and read the help page. Continue to the next exercise!"); ```

`head(mtcars)` will show the first observations of the `mtcars` data frame.

## Have a look at the structure

Another method that is often used to get a rapid overview of your data is the function `str()`. The function `str()` shows you the structure of your data set. For a data frame it tells you:

• The total number of observations (e.g. 32 car types)
• The total number of variables (e.g. 11 car features)
• A full list of the variables names (e.g. `mpg`, `cyl` ... )
• The data type of each variable (e.g. `num`)
• The first observations

Applying the `str()` function will often be the first thing that you do when receiving a new data set or data frame. It is a great way to get more insight in your data set before diving into the real analysis.

### Instructions

Investigate the structure of `mtcars`. Make sure that you see the same numbers, variables and data types as mentioned above.

`# no pec` `# Investigate the structure of mtcars` ```# Investigate the structure of mtcars str(mtcars)``` ``` test_output_contains("str(mtcars)", incorrect_msg = "Have you correctly called `str()` on `mtcars`?") success_msg("Nice work! Can you find all the information that is listed in the exercise's assignment? Continue to the next exercise.") ```

## Creating a data frame

Since using built-in data sets is not even half the fun of creating your own data sets, the rest of this chapter is based on your personally developed data set. Put your jet pack on because it is time for some space exploration!

As a first goal, you want to construct a data frame that describes the main characteristics of eight planets in our solar system. According to your good friend Buzz, the main features of a planet are:

• The type of planet (Terrestrial or Gas Giant).
• The planet's diameter relative to the diameter of the Earth.
• The planet's rotation across the sun relative to that of the Earth.
• If the planet has rings or not (TRUE or FALSE).

After doing some high-quality research on Wikipedia, you feel confident enough to create the necessary vectors: `name`, `type`, `diameter`, `rotation` and `rings`; these vectors have already been coded up on the right. The first element in each of these vectors correspond to the first observation.

You construct a data frame with the `data.frame()` function. As arguments, you pass the vectors from before: they will become the different columns of your data frame. Because every column has the same length, the vectors you pass should also have the same length. But don't forget that it is possible (and likely) that they contain different types of data.

### Instructions

Use the function `data.frame()` to construct a data frame. Pass the vectors `name`, `type`, `diameter`, `rotation` and `rings` as arguments to `data.frame()`, in this order. Call the resulting data frame `planets_df`.

`# no pec` ```# Definition of vectors name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune") type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant") diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883) rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67) rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE) # Create a data frame from the vectors planets_df <- ``` ```# Definition of vectors name <- c("Mercury", "Venus", "Earth", "Mars", "Jupiter", "Saturn", "Uranus", "Neptune") type <- c("Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Terrestrial planet", "Gas giant", "Gas giant", "Gas giant", "Gas giant") diameter <- c(0.382, 0.949, 1, 0.532, 11.209, 9.449, 4.007, 3.883) rotation <- c(58.64, -243.02, 1, 1.03, 0.41, 0.43, -0.72, 0.67) rings <- c(FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE) # Create a data frame from the vectors planets_df <- data.frame(name, type, diameter, rotation, rings)``` ``` msg = "Do not change anything about the definition of the vectors. Only add a `data.frame()` call to create `planets_df`." test_object("name", undefined_msg = msg, incorrect_msg = msg) test_object("type", undefined_msg = msg, incorrect_msg = msg) test_object("diameter", undefined_msg = msg, incorrect_msg = msg) test_object("rotation", undefined_msg = msg, incorrect_msg = msg) test_object("rings", undefined_msg = msg, incorrect_msg = msg) test_object("planets_df", incorrect_msg = "Have you correctly called `data.frame()` to create `planets_df`. Inside `data.frame()`, make sure to pass all vectors in the correct order: `name`, `type`, `diameter`, `rotation` and finally `rings`.") success_msg("Great job! Continue to the next exercise. The logical next step, as you know by now, is inspecting the data frame you just created. Head over to the next exercise."); ```

Your `data.frame()` call starts as follows: ``` data.frame(planets, type, diameter) ``` Can you finish it?

## Creating a data frame (2)

The `planets_df` data frame should have 8 observations and 5 variables. It has been made available in the workspace, so you can directly use it.

### Instructions

Use `str()` to investigate the structure of the new `planets_df` variable.

`load(url("http://s3.amazonaws.com/assets.datacamp.com/course/intro_to_r/planets.RData"))` `# Check the structure of planets_df` ```# Check the structure of planets_df str(planets_df)``` ``` msg = "Do not remove or overwrite the `planets_df` data frame that is already available in the workspace!" test_object("planets_df", undefined_msg = msg, incorrect_msg = msg) test_output_contains("str(planets_df)", incorrect_msg = "Have you correctly displayed the structure of `planets_df`? Use `str()` to do this!") success_msg("Awesome! Now that you have a clear understanding of the `planets_df` data set, it's time to see how you can select elements from it. Learn all about in the next exercises!") ```

`planets_df` is already available in your workspace, so `str(planets_df)` will do the trick.

## Selection of data frame elements

Similar to vectors and matrices, you select elements from a data frame with the help of square brackets `[ ]`. By using a comma, you can indicate what to select from the rows and the columns respectively. For example:

• `my_df[1,2]` selects the value at the first row and select element in `my_df`.
• `my_df[1:3,2:4]` selects rows 1, 2, 3 and columns 2, 3, 4 in `my_df`.

Sometimes you want to select all elements of a row or column. For example, `my_df[1, ]` selects all elements of the first row. Let us now apply this technique on `planets_df`!

### Instructions

• From `planets_df`, select the diameter of Mercury: this is the value at the first row and the third column. Simply print out the result.
• From `planets_df`, select all data on Mars (the fourth row). Simply print out the result.
`load(url("http://s3.amazonaws.com/assets.datacamp.com/course/intro_to_r/planets.RData"))` ```# The planets_df data frame from the previous exercise is pre-loaded # Print out diameter of Mercury (row 1, column 3) # Print out data for Mars (entire fourth row) ``` ```# The planets_df data frame from the previous exercise is pre-loaded # Print out diameter of Mercury (row 1, column 3) planets_df[1,3] # Print out data for Mars (entire fourth row) planets_df[4, ]``` ``` msg = "Do not remove or overwrite the `planets_df` data frame!" test_object("planets_df", undefined_msg = msg, incorrect_msg = msg) test_output_contains("planets_df[1,3]", incorrect_msg = "Have you correctly selected and printed out the diameter for Mercury? You can use `[1,3]`.") test_output_contains("planets_df[4, ]", incorrect_msg = "Have you correctly selected and printed out all data for Mars? You can use `[4,]`.") success_msg("Great! Apart from selecting elements from your data frame by index, you can also use the column names. To learn how, head over to the next exercise.") ```

To select the diameter for Venus (the second row), you would need: `planets_df[2,3]`. What do you need for Mercury then?

## Selection of data frame elements (2)

Instead of using numerics to select elements of a data frame, you can also use the variable names to select columns of a data frame.

Suppose you want to select the first three elements of the `type` column. One way to do this is

``````planets_df[1:3,1]
``````

A possible disadvantage of this approach is that you have to know (or look up) the column number of `type`, which gets hard if you have a lot of variables. It is often easier to just make use of the variable name:

``````planets_df[1:3,"type"]
``````

### Instructions

Select and print out the first 5 values in the `"diameter"` column of `planets_df`.

`load(url("http://s3.amazonaws.com/assets.datacamp.com/course/intro_to_r/planets.RData"))` ```# The planets_df data frame from the previous exercise is pre-loaded # Select first 5 values of diameter column ``` ```# The planets_df data frame from the previous exercise is pre-loaded # Select first 5 values of diameter column planets_df[1:5, "diameter"]``` ``` msg = "Do not remove or overwrite the `planets_df` data frame!" test_object("planets_df", undefined_msg = msg, incorrect_msg = msg) test_output_contains("planets_df[1:5, \"diameter\"]", incorrect_msg = "Have you correctly selected the first five values from the diameter column and printed them out? You can use `[1:5, \"diameter\"]` here.") success_msg("Nice! Continue to the next exercise!") ```

You can select the first five values with `planets_df[1:5, ...]`. Can you fill in the `...` bit to only select the `"diameter"` column?

## Only planets with rings

You will often want to select an entire column, namely one specific variable from a data frame. If you want to select all elements of the variable `diameter`, for example, both of these will do the trick:

``````planets_df[,3]
planets_df[,"diameter"]
``````

However, there is a short-cut. If your columns have names, you can use the `\$` sign:

``````planets_df\$diameter
``````

### Instructions

• Use the `\$` sign to select the `rings` variable from `planets_df`. Store the vector that results as `rings_vector`.
• Print out `rings_vector` to see if you got it right.
`load(url("http://s3.amazonaws.com/assets.datacamp.com/course/intro_to_r/planets.RData"))` ```# planets_df is pre-loaded in your workspace # Select the rings variable from planets_df rings_vector <- # Print out rings_vector``` ```# planets_df is pre-loaded in your workspace # Select the rings variable from planets_df rings_vector <- planets_df\$rings # Print out rings_vector rings_vector``` ``` msg = "Do not remove or overwrite the `planets_df` data frame!" test_object("planets_df", undefined_msg = msg, incorrect_msg = msg) test_object("rings_vector", incorrect_msg = "Have you correctly selected the `rings` variable from `planets_df`? Use `\$rings`. Store the result as `rings_vector`.") test_output_contains("rings_vector", incorrect_msg = "Don't forget to print out `rings_vector` after you've created it!") success_msg("Great! Continue to the next exercise and discover yet another way of subsetting!") ```

`planets_df\$diameter` selects the `diameter` column from `planets_df`; what do you need to select the `rings` column then?

## Only planets with rings (2)

You probably remember from high school that some planets in our solar system have rings and others do not. But due to other priorities at that time (read: puberty) you can not recall their names, let alone their rotation speed, etc.

If you type `rings_vector` in the console, you get:

``````[1] FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE
``````

This means that the first four observations (or planets) do not have a ring (`FALSE`), but the other four do (`TRUE`). However, you do not get a nice overview of the names of these planets, their diameter, etc. Let's try to use `rings_vector` to select the data for the four planets with rings.

### Instructions

The code on the right selects the `name` column of all planets that have rings. Adapt the code so that instead of only the `name` column, all columns for planets that have rings are selected.

```load(url("http://s3.amazonaws.com/assets.datacamp.com/course/intro_to_r/planets.RData")) rings_vector <- planets_df\$rings``` ```# planets_df and rings_vector are pre-loaded in your workspace # Adapt the code to select all columns for planets with rings planets_df[rings_vector, "name"]``` ```# planets_df and rings_vector are pre-loaded in your workspace # Adapt the code to select all columns for planets with rings planets_df[rings_vector, ]``` ``` msg <- "Do not remove or overwrite `planets_df` or `rings_vector`!" test_object("planets_df", undefined_msg = msg, incorrect_msg = msg) test_object("rings_vector", undefined_msg = msg, incorrect_msg = msg) test_output_contains('planets_df[rings_vector, ]', incorrect_msg = "Have you correctly adapted the code to select _all_ columns for the planets that have rings? You can use `planets_df[rings_vector, ]`. Make sure to include the comma here, it's crucial!") success_msg("Wonderful! This is a rather tedious solution. The next exercise will teach you how to do it in a more concise way.") ```

Remember that to select all columns, you simply have to leave the columns part inside the `[ ]` empty! This means you'll need `[rings_vector, ]`.

## Only planets with rings but shorter

So what exactly did you learn in the previous exercises? You selected a subset from a data frame (`planets_df`) based on whether or not a certain condition was true (rings or no rings), and you managed to pull out all relevant data. Pretty awesome! By now, NASA is probably already flirting with your CV ;-).

Now, let us move up one level and use the function `subset()`. You should see the `subset()` function as a short-cut to do exactly the same as what you did in the previous exercises.

``````subset(my_df, subset = some_condition)
``````

The first argument of `subset()` specifies the data set for which you want a subset. By adding the second argument, you give R the necessary information and conditions to select the correct subset.

The code below will give the exact same result as you got in the previous exercise, but this time, you didn't need the `rings_vector`!

``````subset(planets_df, subset = rings)
``````

### Instructions

Use `subset()` on `planets_df` to select planets that have a diameter smaller than Earth. Because the `diameter` variable is a relative measure of the planet's diameter w.r.t that of planet Earth, your condition is `diameter < 1`.

`load(url("http://s3.amazonaws.com/assets.datacamp.com/course/intro_to_r/planets.RData"))` ```# planets_df is pre-loaded in your workspace # Select planets with diameter < 1 ``` ```# planets_df is pre-loaded in your workspace # Select planets with diameter < 1 subset(planets_df, subset = diameter < 1)``` ``` msg = "Do not remove or overwrite the `planets_df` data frame!" test_object("planets_df", undefined_msg = msg, incorrect_msg = msg) test_correct({ test_output_contains("subset(planets_df, subset = diameter < 1)", incorrect_msg = "Have you correctly specified the `subset = ...` part inside `subset()`. The condition in this case is `diameter < 1`. Simply print out the result.") }, { test_function("subset", args = "x", not_called_msg = "Make sure to use the `subset()` function to perform the selection!", incorrect_msg = "The first argument you pass to `subset()` should be `planets_df`.") }) success_msg("Great! Not only is the `subset()` function more concise, it is probably also more understandable for people who read your code. Continue to the next exercise."); ```

`subset(planets_df, subset = ...)` almost solves it; can you fill in the `...`?

## Sorting

Making and creating rankings is one of mankind's favorite affairs. These rankings can be useful (best universities in the world), entertaining (most influential movie stars) or pointless (best 007 look-a-like).

In data analysis you can sort your data according to a certain variable in the data set. In R, this is done with the help of the function `order()`.

`order()` is a function that gives you the ranked position of each element when it is applied on a variable, such as a vector for example:

``````> a <- c(100, 10, 1000)
> order(a)
[1] 2 1 3
``````

10, which is the second element in `a`, is the smallest element, so 2 comes first in the output of `order(a)`. 100, which is the first element in `a` is the second smallest element, so 1 comes second in the output of `order(a)`.

This means we can use the output of `order(a)` to reshuffle `a`:

``````> a[order(a)]
[1]   10  100 1000
``````

### Instructions

Experiment with the `order()` function in the console. Click 'Submit Answer' when you are ready to continue.

`# no pec` `# Play around with the order function in the console` `# Play around with the order function in the console` ``` success_msg("Great! Now let's use the `order()` function to sort your data frame!") ```

Just play with the `order()` function in the console!

Alright, now that you understand the `order()` function, let us do something useful with it. You would like to rearrange your data frame such that it starts with the smallest planet and ends with the largest one. A sort on the `diameter` column.
• Call `order()` on `planets_df\$diameter` (the `diameter` column of `planets_df`). Store the result as `positions`.
• Now reshuffle `planets_df` with the `positions` vector as row indexes inside square brackets. Keep all columns. Simply print out the result.
`load(url("http://s3.amazonaws.com/assets.datacamp.com/course/intro_to_r/planets.RData"))` ```# planets_df is pre-loaded in your workspace # Use order() to create positions positions <- # Use positions to sort planets_df ``` ```# planets_df is pre-loaded in your workspace # Use order() to create positions positions <- order(planets_df\$diameter) # Use positions to sort planets_df planets_df[positions, ]``` ``` msg = "Do not remove or overwrite the `planets_df` data frame!" test_object("planets_df", undefined_msg = msg, incorrect_msg = msg) test_object("positions", incorrect_msg = "Have you correctly calculated the `positions` variable? You can use `sort(planets_df\$diameter)`.") test_output_contains("planets_df[positions,]", incorrect_msg = "Use `planets_df[positions, ]` to sort `planets_df`; the comma inside the square brackets is crucial!") success_msg("Wonderful! This exercise concludes the chapter on data frames. Remember that data frames are extremely important in R, you will need them all the time. Another very often used data structure is the list. This will be the subject of the next chapter!") ```
• Use `order(planets_df\$diameter)` to create `positions`.
• Now, you can use `positions` inside square brackets: `planets_df[...]`; can you fill in the `...`?