Skip to main content
HomeBlogR Programming

RDocumentation: Scoring and Ranking

Learn more on how the search results on RDocumentation are generated!
Mar 2017  · 7 min read

One of the core features of is its search functionality. From the start, we wanted to have a super simple search bar that finds what you are looking for, without a complex form asking for a package name, function name, versions or anything else. Just a simple search bar.

In today's technical blog post, we will highlight the technologies and techniques that we use to provide relevant and meaningful results to our users.

Elasticsearch uses Elasticsearch to index and search through all R packages and topics.

Elasticsearch is an open-source, scalable, distributed, enterprise-grade search engine.

Elasticsearch is perfect for querying documentation because it doesn’t use conventional SQL data, but it stores documents in a JSON-like data structure instead. Each document is just a set of key-value pairs with simple data types (strings, numbers, lists, dates, …). Its distributed nature means Elasticsearch can be extremely fast.

An Elasticsearch cluster can have multiple indexes and each index can have multiple document types. A document type just describes what the structure of the document should look like. To learn more about Elasticsearch types, you can visit the guide on uses three different types: package_version, topic, and package. The first 2 are the main ones; let’s discuss package later.

Because is open-source, you can see the Elasticsearch’s mappings in our github repo

package_version type

The package_version type is like a translation of the DESCRIPTION file of a package, it features the main field that one can find in there; package_name, version, title, description, release_date, license, url, copyright, created_at, updated_at, latest_version, maintainer and collaborators. The maintainer and collaborators are extracted from the Authors field in the DESCRIPTION file

topic type

The topic documents are parsed from the Rd files, the standard format of documentation in R. The topic type has the following keys: name, title, description, usage, details, value, references, note, author, seealso, examples, created_at, updated_at, sections, aliases and keywords.

Scoring in Elasticsearch

Before doing any scoring, Elasticseach first tries to reduce the set of candidates by checking if the document actually matches the query. Basically, a query is a word (or a set of words). Based on the query setting, Elasticsearch searches for a match in certain fields of certain types.

However, a match does not necessarily mean that the document is relevant; the same word can have different meanings in different contexts. Based on the query settings, we can filter by type and field, and include more contextual information. This contextual information will improve the relevancy and this is where scoring comes into place.

Elasticsearch uses Lucene under the hood, so the scoring is based on Lucene’s Practical Scoring Function which brings together some models like the TF-IDF, Vector Space Model and Boolean Model to score the document.

If you want to lean more about how that function is used in Elasticsearch, you can check out this section of guide.

One way to improve relevancy is to apply a boost to some fields. For example, in the full search, we naturally boost fields like package_name and title for packages and aliases and name for topics.

Another effective way to improve relevancy is to boost documents based on their popularity. The idea behind that is that if a package is more popular, the user is more likely to search for this package. Showing the more populars packages first will increase the probability that we show what the user is actually looking for.

Using downloads as a popularity measure

There are multiple ways to measure popularity. We could use direct measures like votes or rankings that users give (like ratings on Amazon products), or indirect measures like the number of items sold or the number of views (for YouTube videos).

At, we chose the latter. More specifically, we use the number of downloads as a measure of popularity. Indirect measures are typically easier to collect because they don’t require active user input.


One problem that arises when using the number of downloads is that old packages will naturally have more total downloads than newer packages. That does not mean that they are more popular, however, they have just been around longer. What if a package was very popular years ago, but has now become obsolete and is no longer being actively used by the community?

To solve this problem, we only take into account the number of downloads in the last month. That way, older packages’ popularity score is not artificially boosted, and obsolete packages will quickly fade out.

Direct vs Indirect downloads

Another problem arises from reverse dependencies. R packages typically depend on a wide range of other packages. Packages with a lot of reverse dependencies will get downloaded way more than others. However, these packages are typically more low-level and are not used directly by the end-user. We have to watch out to not give their number of downloads too much weight.

As an example, take Rcpp. Over 70 percent of all packages on CRAN, the comprehensive R archive network, depend on this package, which obviously makes it the most downloaded R package. However, rather few R users will directly use this package and search for its documentation.

To solve this problem, we needed to separate direct downloads (downloads that happens because a user requested it) and indirect downloads (downloads that happen because a dependent packages was downloaded). To distinguish the direct and indirect downloads from the CRAN logs, we use the same heuristic described in the cran.stats package by Arun Srinivasan.

We now have a meaningful popularity metric: the number of direct downloads in the last month. Elasticsearch provides an easy way to inject this additional information; for more details, check out this article on

The score is modified as follows:

new_score = old_score * log(1 + number of direct downloads in the last month)

We use a log() function to smooth out the number of downloads value, because each subsequent downloads has less weight; the difference between 0 and 1000 downloads should have a bigger impact on a popularity score than the difference between 100,000 and 101,000 downloads.

This re-scoring improves the overall relevancy of the search results presented by and as a result, users can focus on reading documentation instead of searching for it.

If you want to find out more about how exactly the Elasticsearch query is implemented, you can take a look at the RDocumentation project on GitHub. The query itself is located in the SearchController.

If you want to learn more about how is implemented, check out our repos:

About RDocumentation  

RDocumentation aggregates help documentation for R packages from CRAN, BioConductor, and GitHub - the three most common sources of current R documentation. goes beyond simply aggregating this information, however, by bringing all of this documentation to your fingertips via the RDocumentation package. The RDocumentation package overwrites the basic help functions from the utils package and gives you access to from the comfort of your RStudio IDE. Look up the newest and most popular R packages, search through documentation and post community examples. 



The Venture Mindset with Ilya Strebulaev, Economist Professor at Stanford Graduate School of Business

Richie and Ilya explore the venture mindset, the importance of embracing unknowns, how VCs deal with unpredictability, how our education affects our decision-making ability, venture mindset principles and much more. 
Richie Cotton's photo

Richie Cotton

59 min

cheat sheet

LaTeX Cheat Sheet

Learn everything you need to know about LaTeX in this convenient cheat sheet!
Richie Cotton's photo

Richie Cotton


Sorting Data in R

How to sort a data frame in R.
DataCamp Team's photo

DataCamp Team

2 min


Scatterplot in R

Learn how to create a scatterplot in R. The basic function is plot(x, y), where x and y are numeric vectors denoting the (x,y) points to plot.
DataCamp Team's photo

DataCamp Team


Operators in R

Learn how to use arithmetic and logical operators in R. These binary operators work on vectors, matrices, and scalars.
DataCamp Team's photo

DataCamp Team

4 min


How to Transpose a Matrix in R: A Quick Tutorial

Learn three methods to transpose a matrix in R in this quick tutorial
Adel Nehme's photo

Adel Nehme

See MoreSee More