Skip to content
Course Notes: Introduction to R
  • AI Chat
  • Code
  • Report
  • Course Notes

    Use this workspace to take notes, store code snippets, or build your own interactive cheatsheet! The datasets used in this course are available in the datasets folder.

    # Import any packages you want to use here
    

    Take Notes

    Add notes here about the concepts you've learned and code cells with code you want to keep.

    Add your notes here

    # Add your code snippets here
    

    => After finishing this course, one of your favorite functions in R will be summary(). This will give you a quick overview of the contents of a variable:

    summary(my_var)

    In R, a matrix is a collection of elements of the same data type (numeric, character, or logical) arranged into a fixed number of rows and columns. Since you are only working with rows and columns, a matrix is called two-dimensional.

    To create a matrix, we use the function matrix(). In the matrix() function are 3 arguments;

    • The first will be the elements to be filled in the matrix.
    • The second is the 'byrow' argument which takes the value 'TRUE' if those elements are ment to be inputted in row format and FALSE in column format.
    • The last argument is the number of rows you want to want the matrix to have.

    In R, the function rowSums() conveniently calculates the totals for each row of a matrix. This function creates a new vector:

    rowSums(my_matrix)

    You can add a column or multiple columns to a matrix with the cbind() function, which merges matrices and/or vectors together by column. For example:

    big_matrix <- cbind(matrix1, matrix2, vector1 ...)

    A vector is substantially a list of variables, and the simplest data structure in R. A vector consists of a collection of numbers, arithmetic expressions, logical values or character strings for example.

    • Just like every action has a reaction, every cbind() has an rbind().

    • Just like cbind() has rbind(), colSums() has rowSums()

    Similar to vectors, you can use the square brackets [ ] to select one or multiple elements from a matrix. Whereas vectors have one dimension, matrices have two dimensions. You should therefore use a comma to separate the rows you want to select from the columns. For example:

    => my_matrix[1,2] selects the element at the first row and second column. => my_matrix[1:3,2:4] results in a matrix with the data on the rows 1, 2, 3 and columns 2, 3, 4. If you want to select all elements of a row or a column, no number is needed before or after the comma, respectively:

    => my_matrix[,1] selects all elements of the first column. => my_matrix[1,] selects all elements of the first row.

    The term FACTOR refers to a statistical data type used to store categorical variables. To create factors in R, you make use of the function factor().

    • A continuous variable is that a categorical variable can belong to a limited number of categories. A good example of a categorical variable is sex. In many circumstances you can limit the sex categories to "Male" or "Female". (Sometimes you may need different categories. For example, you may need to consider chromosomal variation, hermaphroditic animals, or different cultural norms, but you will always have a finite number of categories.)

    • A continuous variable, on the other hand, can correspond to an infinite number of values.

    There are two types of categorical variables: a nominal categorical variable and an ordinal categorical variable.

    • A nominal variable is a categorical variable without an implied order(no ranks). This means that it is impossible to say that 'one is worth more than the other'. For example, think of the categorical variable animals_vector with the categories "Elephant", "Giraffe", "Donkey" and "Horse". Here, it is impossible to say that one stands above or below the other.
    • Ordinal variables do have a natural ordering. Consider for example the categorical variable temperature_vector with the categories: "Low", "Medium" and "High". Here it is obvious that "Medium" stands above "Low", and "High" stands above "Medium". To define factors of ordianl categorical values, 3 arguments are to be defined; => The ordinal values/variables themselves (or vector containing them) => order = TRUE to rank the elements/values in the variable or order = FALSE otherwise. => levels = c("value1", "value2", ...) to arrange the elements ranking For example, my_code <- c("new", "medium", "old") ordinal_my_code <- factor(my_code, order = TRUE, levels = c("level 0", "level 1", "level 2"))

    When you first get a dataset, you will often notice that it contains factors with specific factor levels. However, sometimes you will want to change the names of these levels for clarity or other reasons. R allows you to do this with the function levels():

    levels(factor_vector) <- c("name1", "name2",...)

    A good illustration is the raw data that is provided to you by a survey. A common question for every questionnaire is the sex of the respondent. Here, for simplicity, just two categories were recorded, "M" and "F". (You usually need more categories for survey data; either way, you use a factor to store the categorical data.)

    survey_vector <- c("M", "F", "F", "M", "M") Recording the sex with the abbreviations "M" and "F" can be convenient if you are collecting data with pen and paper, but it can introduce confusion when analyzing the data. At that point, you will often want to change the factor levels to "Male" and "Female" instead of "M" and "F" for clarity.

    Watch out: the order with which you assign the levels is important. If you type levels(factor_survey_vector), you'll see that it outputs [1] "F" "M". If you don't specify the levels of the factor when creating the vector, R will automatically assign them alphabetically. To correctly map "F" to "Female" and "M" to "Male", the levels should be set to c("Female", "Male"), in this order.

    Working with large datasets is not uncommon in data analysis. When you work with (extremely) large datasets and data frames, your first task as a data analyst is to develop a clear understanding of its structure and main elements. Therefore, it is often useful to show only a small part of the entire dataset.

    So how to do this in R? Well, the function head() enables you to show the first observations of a data frame. Similarly, the function tail() prints out the last observations in your dataset.

    Both head() and tail() print a top line called the 'header', which contains the names of the different variables in your dataset.

    ++> Have a look at the structure Another method that is often used to get a rapid overview of your data is the function str(). The function str() shows you the structure of your dataset. For a data frame it tells you:

    The total number of observations (e.g. 32 car types) The total number of variables (e.g. 11 car features) A full list of the variables names (e.g. mpg, cyl … ) The data type of each variable (e.g. num) The first observations Applying the str() function will often be the first thing that you do when receiving a new dataset or data frame. It is a great way to get more insight in your dataset before diving into the real analysis.

    ++> Selection of data frame elements Similar to vectors and matrices, you select elements from a data frame with the help of square brackets [ ]. By using a comma, you can indicate what to select from the rows and the columns respectively. For example:

    • my_df[1,2] selects the value at the first row and second column in my_df.
    • my_df[1:3,2:4] selects rows 1, 2, 3 and columns 2, 3, 4 in my_df. Sometimes you want to select all elements of a row or column. For example, my_df[1, ] selects all elements of the first row.

    Instead of using numerics to select elements of a data frame, you can also use the variable names to select columns of a data frame.

    Suppose you want to select the first three elements of the type column. One way to do this is

    planets_df[1:3,2] A possible disadvantage of this approach is that you have to know (or look up) the column number of type, which gets hard if you have a lot of variables. It is often easier to just make use of the variable name:

    planets_df[1:3,"type"]

    You will often want to select an entire column, namely one specific variable from a data frame. If you want to select all elements of the variable diameter, for example, both of these will do the trick:

    planets_df[,3] planets_df[,"diameter"] However, there is a short-cut. If your columns have names, you can use the $ sign:

    planets_df$diameter

    You probably remember from high school that some planets in our solar system have rings and others do not. Unfortunately you can not recall their names. Could R help you out?

    If you type rings_vector in the console, you get:

    [1] FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE This means that the first four observations (or planets) do not have a ring (FALSE), but the other four do (TRUE). However, you do not get a nice overview of the names of these planets, their diameter, etc. Let's try to use rings_vector to select the data for the four planets with rings. Remember that to select all columns, you simply have to leave the columns part inside the [ ] empty! This means you'll need [rings_vector, ].

    You should see the subset() function as a short-cut to do exactly the same as what you did in the previous exercises.

    subset(my_df, subset = some_condition) The first argument of subset() specifies the dataset for which you want a subset. By adding the second argument, you give R the necessary information and conditions to select the correct subset.

    The code below will give the exact same result as you got in the previous exercise, but this time, you didn't need the rings_vector!

    subset(planets_df, subset = rings)

    ++> Sorting Making and creating rankings is one of mankind's favorite affairs. These rankings can be useful (best universities in the world), entertaining (most influential movie stars) or pointless (best 007 look-a-like).

    In data analysis you can sort your data according to a certain variable in the dataset. In R, this is done with the help of the function order().

    order() is a function that gives you the ranked position of each element when it is applied on a variable, such as a vector for example:

    a <- c(100, 10, 1000) order(a) [1] 2 1 3 10, which is the second element in a, is the smallest element, so 2 comes first in the output of order(a). 100, which is the first element in a is the second smallest element, so 1 comes second in the output of order(a).

    This means we can use the output of order(a) to reshuffle a:

    a[order(a)] [1] 10 100 1000