Skip to main content
HomeBlogR Programming

RDocumentation: Scoring and Ranking

Learn more on how the search results on RDocumentation are generated!
Mar 2017  · 7 min read

One of the core features of is its search functionality. From the start, we wanted to have a super simple search bar that finds what you are looking for, without a complex form asking for a package name, function name, versions or anything else. Just a simple search bar.

In today's technical blog post, we will highlight the technologies and techniques that we use to provide relevant and meaningful results to our users.

Elasticsearch uses Elasticsearch to index and search through all R packages and topics.

Elasticsearch is an open-source, scalable, distributed, enterprise-grade search engine.

Elasticsearch is perfect for querying documentation because it doesn’t use conventional SQL data, but it stores documents in a JSON-like data structure instead. Each document is just a set of key-value pairs with simple data types (strings, numbers, lists, dates, …). Its distributed nature means Elasticsearch can be extremely fast.

An Elasticsearch cluster can have multiple indexes and each index can have multiple document types. A document type just describes what the structure of the document should look like. To learn more about Elasticsearch types, you can visit the guide on uses three different types: package_version, topic, and package. The first 2 are the main ones; let’s discuss package later.

Because is open-source, you can see the Elasticsearch’s mappings in our github repo

package_version type

The package_version type is like a translation of the DESCRIPTION file of a package, it features the main field that one can find in there; package_name, version, title, description, release_date, license, url, copyright, created_at, updated_at, latest_version, maintainer and collaborators. The maintainer and collaborators are extracted from the Authors field in the DESCRIPTION file

topic type

The topic documents are parsed from the Rd files, the standard format of documentation in R. The topic type has the following keys: name, title, description, usage, details, value, references, note, author, seealso, examples, created_at, updated_at, sections, aliases and keywords.

Scoring in Elasticsearch

Before doing any scoring, Elasticseach first tries to reduce the set of candidates by checking if the document actually matches the query. Basically, a query is a word (or a set of words). Based on the query setting, Elasticsearch searches for a match in certain fields of certain types.

However, a match does not necessarily mean that the document is relevant; the same word can have different meanings in different contexts. Based on the query settings, we can filter by type and field, and include more contextual information. This contextual information will improve the relevancy and this is where scoring comes into place.

Elasticsearch uses Lucene under the hood, so the scoring is based on Lucene’s Practical Scoring Function which brings together some models like the TF-IDF, Vector Space Model and Boolean Model to score the document.

If you want to lean more about how that function is used in Elasticsearch, you can check out this section of guide.

One way to improve relevancy is to apply a boost to some fields. For example, in the full search, we naturally boost fields like package_name and title for packages and aliases and name for topics.

Another effective way to improve relevancy is to boost documents based on their popularity. The idea behind that is that if a package is more popular, the user is more likely to search for this package. Showing the more populars packages first will increase the probability that we show what the user is actually looking for.

Using downloads as a popularity measure

There are multiple ways to measure popularity. We could use direct measures like votes or rankings that users give (like ratings on Amazon products), or indirect measures like the number of items sold or the number of views (for YouTube videos).

At, we chose the latter. More specifically, we use the number of downloads as a measure of popularity. Indirect measures are typically easier to collect because they don’t require active user input.


One problem that arises when using the number of downloads is that old packages will naturally have more total downloads than newer packages. That does not mean that they are more popular, however, they have just been around longer. What if a package was very popular years ago, but has now become obsolete and is no longer being actively used by the community?

To solve this problem, we only take into account the number of downloads in the last month. That way, older packages’ popularity score is not artificially boosted, and obsolete packages will quickly fade out.

Direct vs Indirect downloads

Another problem arises from reverse dependencies. R packages typically depend on a wide range of other packages. Packages with a lot of reverse dependencies will get downloaded way more than others. However, these packages are typically more low-level and are not used directly by the end-user. We have to watch out to not give their number of downloads too much weight.

As an example, take Rcpp. Over 70 percent of all packages on CRAN, the comprehensive R archive network, depend on this package, which obviously makes it the most downloaded R package. However, rather few R users will directly use this package and search for its documentation.

To solve this problem, we needed to separate direct downloads (downloads that happens because a user requested it) and indirect downloads (downloads that happen because a dependent packages was downloaded). To distinguish the direct and indirect downloads from the CRAN logs, we use the same heuristic described in the cran.stats package by Arun Srinivasan.

We now have a meaningful popularity metric: the number of direct downloads in the last month. Elasticsearch provides an easy way to inject this additional information; for more details, check out this article on

The score is modified as follows:

new_score = old_score * log(1 + number of direct downloads in the last month)

We use a log() function to smooth out the number of downloads value, because each subsequent downloads has less weight; the difference between 0 and 1000 downloads should have a bigger impact on a popularity score than the difference between 100,000 and 101,000 downloads.

This re-scoring improves the overall relevancy of the search results presented by and as a result, users can focus on reading documentation instead of searching for it.

If you want to find out more about how exactly the Elasticsearch query is implemented, you can take a look at the RDocumentation project on GitHub. The query itself is located in the SearchController.

If you want to learn more about how is implemented, check out our repos:

About RDocumentation  

RDocumentation aggregates help documentation for R packages from CRAN, BioConductor, and GitHub - the three most common sources of current R documentation. goes beyond simply aggregating this information, however, by bringing all of this documentation to your fingertips via the RDocumentation package. The RDocumentation package overwrites the basic help functions from the utils package and gives you access to from the comfort of your RStudio IDE. Look up the newest and most popular R packages, search through documentation and post community examples. 


Julia vs R - Which Should You Learn?

Compare the main elements of Julia vs R programming languages that set them apart from one another and explore the current job market for each of these skills.
Joleen Bothma's photo

Joleen Bothma

11 min

Text Data In R Cheat Sheet

Welcome to our cheat sheet for working with text data in R! This resource is designed for R users who need a quick reference guide for common tasks related to cleaning, processing, and analyzing text data. The cheat sheet includes a list of useful functio
Richie Cotton's photo

Richie Cotton

5 min

Dates and Times in R Cheat Sheet

Welcome to our cheat sheet for working with dates and times in R! This resource provides a list of common functions and packages for manipulating, analyzing, and visualizing data with dates and times. Whether you're a beginner or an experienced R programm
Richie Cotton's photo

Richie Cotton

1 min

Multiple Linear Regression in R: Tutorial With Examples

A complete overview to understanding multiple linear regressions in R through examples.
Zoumana Keita 's photo

Zoumana Keita

12 min

T-tests in R Tutorial: Learn How to Conduct T-Tests

Determine if there is a significant difference between the means of the two groups using t.test() in R.
Abid Ali Awan's photo

Abid Ali Awan

10 min

K-Nearest Neighbors (KNN) Classification with R Tutorial

Delve into K-Nearest Neighbors (KNN) classification with R. Learn how to use 'class' and 'caret' R packages, tune hyperparameters, and evaluate model performance.
Abid Ali Awan's photo

Abid Ali Awan

11 min

See MoreSee More