Blog

Five Tips to Improve your R Code

Five useful tips that you can use to effectively improve your R code, from using seq() to create sequences to ditching which() and much more!

Updated Dec 2017 · 8 min read

@drsimonj here with five simple tricks I find myself sharing all the time with fellow R users to improve their code!

1. More Fun to Ssequence from 1

Next time you use the colon operator to create a sequence from 1 like 1:n, try seq().

# Sequence a vector
x <- runif(10)
seq(x)
#>  [1]  1  2  3  4  5  6  7  8  9 10

# Sequence an integer
seq(nrow(mtcars))
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
#> [24] 24 25 26 27 28 29 30 31 32

The colon operator can produce unexpected results that can create all sorts of problems without you noticing! Take a look at what happens when you want to sequence the length of an empty vector:

# Empty vector
x <- c()

1:length(x)
#> [1] 1 0

seq(x)
#> integer(0)

You'll also notice that this saves you from using functions like length(). When applied to an object of a certain length, seq() will automatically create a sequence from 1 to the length of the object.

2. `vector()` what you `c()`

Next time you create an empty vector with c(), try to replace it with vector("type", length).

# A numeric vector with 5 elements
vector("numeric", 5)
#> [1] 0 0 0 0 0

# A character vector with 3 elements
vector("character", 3)
#> [1] "" "" ""

Doing this improves memory usage and increases speed! You often know upfront what type of values will go into a vector, and how long the vector will be. Using c() means R has to slowly work both of these things out. So help give it a boost with vector()!

A good example of this value is in a for loop. People often write loops by declaring an empty vector and growing it with c() like this:

x <- c()
for (i in seq(5)) {
  x <- c(x, i)
}

#> x at step 1 : 1
#> x at step 2 : 1, 2
#> x at step 3 : 1, 2, 3
#> x at step 4 : 1, 2, 3, 4
#> x at step 5 : 1, 2, 3, 4, 5

Instead, pre-define the type and length with vector(), and reference positions by index, like this:

n <- 5
x <- vector("integer", n)
for (i in seq(n)) {
  x[i] <- i
}

#> x at step 1 : 1, 0, 0, 0, 0
#> x at step 2 : 1, 2, 0, 0, 0
#> x at step 3 : 1, 2, 3, 0, 0
#> x at step 4 : 1, 2, 3, 4, 0
#> x at step 5 : 1, 2, 3, 4, 5

Here's a quick speed comparison:

n <- 1e5

x_empty <- c()
system.time(for(i in seq(n)) x_empty <- c(x_empty, i))
#>    user  system elapsed 
#>  15.238   2.327  17.650

x_zeros <- vector("integer", n)
system.time(for(i in seq(n)) x_zeros[i] <- i)
#>    user  system elapsed 
#>   0.007   0.000   0.007

That should be convincing enough!

Master your data skills with DataCamp

More than 10 million people learn Python, R, SQL, and other tech skills using our hands-on courses crafted by industry experts.

Start Learning

3. Ditch the `which()`

Next time you use which(), try to ditch it! People often use which() to get indices from some boolean condition, and then select values at those indices. This is not necessary.

Getting vector elements greater than 5:

x <- 3:7

# Using which (not necessary)
x[which(x > 5)]
#> [1] 6 7

# No which
x[x > 5]
#> [1] 6 7

Or counting number of values greater than 5:

# Using which
length(which(x > 5))
#> [1] 2

# Without which
sum(x > 5)
#> [1] 2

Why should you ditch which()? It's often unnecessary and boolean vectors are all you need.

For example, R lets you select elements flagged as TRUE in a boolean vector:

condition <- x > 5
condition
#> [1] FALSE FALSE FALSE  TRUE  TRUE
x[condition]
#> [1] 6 7

Also, when combined with sum() or mean(), boolean vectors can be used to get the count or proportion of values meeting a condition:

sum(condition)
#> [1] 2
mean(condition)
#> [1] 0.4

which() tells you the indices of TRUE values:

which(condition)
#> [1] 4 5

And while the results are not wrong, it's just not necessary. For example, I often see people combining which() and length() to test whether any or all values are TRUE. Instead, you just need any() or all():

x <- c(1, 2, 12)

# Using `which()` and `length()` to test if any values are greater than 10
if (length(which(x > 10)) > 0)
  print("At least one value is greater than 10")
#> [1] "At least one value is greater than 10"

# Wrapping a boolean vector with `any()`
if (any(x > 10))
  print("At least one value is greater than 10")
#> [1] "At least one value is greater than 10"

# Using `which()` and `length()` to test if all values are positive
if (length(which(x > 0)) == length(x))
  print("All values are positive")
#> [1] "All values are positive"

# Wrapping a boolean vector with `all()`
if (all(x > 0))
  print("All values are positive")
#> [1] "All values are positive"

Oh, and it saves you a little time...

x <- runif(1e8)

system.time(x[which(x > .5)])
#>    user  system elapsed 
#>   1.156   0.522   1.686

system.time(x[x > .5])
#>    user  system elapsed 
#>   1.071   0.442   1.662

4. `factor` that factor!

Ever removed values from a factor and found you're stuck with old levels that don't exist anymore? I see all sorts of creative ways to deal with this. The simplest solution is often just to wrap it in factor() again.

This example creates a factor with four levels ("a", "b", "c" and "d"):

# A factor with four levels
x <- factor(c("a", "b", "c", "d"))
x
#> [1] a b c d
#> Levels: a b c d

plot(x)

If you drop all cases of one level ("d"), the level is still recorded in the factor:

# Drop all values for one level
x <- x[x != "d"]

# But we still have this level!
x
#> [1] a b c
#> Levels: a b c d

plot(x)

A super simple method for removing it is to use factor() again:

x <- factor(x)
x
#> [1] a b c
#> Levels: a b c

plot(x)

This is typically a good solution to a problem that gets a lot of people mad. So save yourself a headache and factor that factor!

5. First you get the `$`, then you get the power

Next time you want to extract values from a data.frame column where the rows meet a condition, specify the column with $ before the rows with [.

Say you want the horsepower (hp) for cars with 4 cylinders (cyl), using the mtcars data set. You can write either of these:

# rows first, column second - not ideal
mtcars[mtcars$cyl == 4, ]$hp
#>  [1]  93  62  95  66  52  65  97  66  91 113 109

# column first, rows second - much better
mtcars$hp[mtcars$cyl == 4]
#>  [1]  93  62  95  66  52  65  97  66  91 113 109

The tip here is to use the second approach.

But why is that?

First reason: do away with that pesky comma! When you specify rows before the column, you need to remember the comma: mtcars[mtcars$cyl == 4,]$hp. When you specify column first, this means that you're now referring to a vector, and don't need the comma!

Second reason: speed! Let's test it out on a larger data frame:

# Simulate a data frame...
n <- 1e7
d <- data.frame(
  a = seq(n),
  b = runif(n)
)

# rows first, column second - not ideal
system.time(d[d$b > .5, ]$a)
#>    user  system elapsed 
#>   0.497   0.126   0.629

# column first, rows second - much better
system.time(d$a[d$b > .5])
#>    user  system elapsed 
#>   0.089   0.017   0.107

Worth it, right?

Still, if you want to hone your skills as an R data frame ninja, I suggest learning dplyr. You can get a good overview on the dplyr website or really learn the ropes with online courses like DataCamp's Data Manipulation in R with dplyr.

Sign off

Thanks for reading and I hope this was useful for you.

For updates of recent blog posts, follow @drsimonj on Twitter, or email me at [email protected] to get in touch.

If you'd like the code that produced this blog, check out the blogR GitHub repository.

Check out our Getting Started with the Tidyverse: Tutorial.

Topics

R Programming

Data Science

R Courses

Certification available

Course

Introduction to R

4 hr

2.7M

Master the basics of data analysis in R, including vectors, lists, and data frames, and practice R with real data sets.

See Details

Start Course

Certification available

Course

Intermediate R

6 hr

588.2K

Continue your journey to becoming an R ninja by learning about conditional statements, loops, and vector functions.

See Details

Start Course

Certification available

Course

Introduction to the Tidyverse

4 hr

311.1K

Get started on the path to exploring and visualizing your own data with the tidyverse, a powerful and popular collection of data science tools within R.

See Details

Start Course

Data Science in Finance: Unlocking New Potentials in Financial Markets

Discover the role of data science in finance, shaping tomorrow's financial strategies. Gain insights into advanced analytics and investment trends.

Shawn Plummer

9 min

5 Common Data Science Challenges and Effective Solutions

Emerging technologies are changing the data science world, bringing new data science challenges to businesses. Here are 5 data science challenges and solutions.

DataCamp Team

8 min

Navigating R Certifications in 2024: A Comprehensive Guide

Explore DataCamp's R programming certifications with our guide. Learn about Data Scientist and Data Analyst paths, preparation tips, and career advancement.

Matt Crabtree

8 min

A Data Science Roadmap for 2024

Do you want to start or grow in the field of data science? This data science roadmap helps you understand and get started in the data science landscape.

Mark Graus

10 min

R Markdown Tutorial for Beginners

Learn what R Markdown is, what it's used for, how to install it, what capacities it provides for working with code, text, and plots, what syntax it uses, what output formats it supports, and how to render and publish R Markdown documents.

Elena Kosourova

12 min

Introduction to DynamoDB: Mastering NoSQL Database with Node.js | A Beginner's Tutorial

Learn to master DynamoDB with Node.js in this beginner's guide. Explore table creation, CRUD operations, and scalability in AWS's NoSQL database.

Gary Alway

11 min

See More See More

1. More Fun to Ssequence from 1

2. vector() what you c()

Master your data skills with DataCamp

3. Ditch the which()

4. factor that factor!

5. First you get the $, then you get the power

Sign off

Data Science in Finance: Unlocking New Potentials in Financial Markets

5 Common Data Science Challenges and Effective Solutions

Navigating R Certifications in 2024: A Comprehensive Guide

A Data Science Roadmap for 2024

R Markdown Tutorial for Beginners

Introduction to DynamoDB: Mastering NoSQL Database with Node.js | A Beginner's Tutorial

.css-1531qan{-webkit-text-decoration:none;text-decoration:none;color:inherit;}Introduction to R

Intermediate R

Introduction to the Tidyverse

Data Science in Finance: Unlocking New Potentials in Financial Markets

5 Common Data Science Challenges and Effective Solutions

Navigating R Certifications in 2024: A Comprehensive Guide

A Data Science Roadmap for 2024

R Markdown Tutorial for Beginners

Introduction to DynamoDB: Mastering NoSQL Database with Node.js | A Beginner's Tutorial

2. `vector()` what you `c()`

3. Ditch the `which()`

4. `factor` that factor!

5. First you get the `$`, then you get the power

Introduction to R