Start Learning for Free

Join over 500,000 other Data Science learners and start one of our interactive tutorials today!

Topic r small

RDocumentation: Scoring and Ranking

March 10th, 2017 in R Programming

One of the core features of RDocumentation.org is its search functionality. From the start, we wanted to have a super simple search bar that finds what you are looking for, without a complex form asking for a package name, function name, versions or anything else. Just a simple search bar.

In today's technical blog post, we will highlight the technologies and techniques that we use to provide relevant and meaningful results to our users.

Elasticsearch

RDocumentation.org uses Elasticsearch to index and search through all R packages and topics.

Elasticsearch is an open-source, scalable, distributed, enterprise-grade search engine.

Elasticsearch is perfect for querying documentation because it doesn’t use conventional SQL data, but it stores documents in a JSON-like data structure instead. Each document is just a set of key-value pairs with simple data types (strings, numbers, lists, dates, …). Its distributed nature means Elasticsearch can be extremely fast.

An Elasticsearch cluster can have multiple indexes and each index can have multiple document types. A document type just describes what the structure of the document should look like. To learn more about Elasticsearch types, you can visit the guide on elastic.co.

RDocumentation.org uses three different types: package_version, topic, and package. The first 2 are the main ones; let’s discuss package later.

Because RDocumentation.org is open-source, you can see the Elasticsearch’s mappings in our github repo

package_version type

The package_version type is like a translation of the DESCRIPTION file of a package, it features the main field that one can find in there; package_name, version, title, description, release_date, license, url, copyright, created_at, updated_at, latest_version, maintainer and collaborators. The maintainer and collaborators are extracted from the Authors field in the DESCRIPTION file

topic type

The topic documents are parsed from the Rd files, the standard format of documentation in R. The topic type has the following keys: name, title, description, usage, details, value, references, note, author, seealso, examples, created_at, updated_at, sections, aliases and keywords.

Scoring in Elasticsearch

Before doing any scoring, Elasticseach first tries to reduce the set of candidates by checking if the document actually matches the query. Basically, a query is a word (or a set of words). Based on the query setting, Elasticsearch searches for a match in certain fields of certain types.

However, a match does not necessarily mean that the document is relevant; the same word can have different meanings in different contexts. Based on the query settings, we can filter by type and field, and include more contextual information. This contextual information will improve the relevancy and this is where scoring comes into place.

Elasticsearch uses Lucene under the hood, so the scoring is based on Lucene’s Practical Scoring Function which brings together some models like the TF-IDF, Vector Space Model and Boolean Model to score the document.

If you want to lean more about how that function is used in Elasticsearch, you can check out this section of elastic.co guide.

One way to improve relevancy is to apply a boost to some fields. For example, in the RDocumentation.org full search, we naturally boost fields like package_name and title for packages and aliases and name for topics.

Another effective way to improve relevancy is to boost documents based on their popularity. The idea behind that is that if a package is more popular, the user is more likely to search for this package. Showing the more populars packages first will increase the probability that we show what the user is actually looking for.

Using downloads as a popularity measure

There are multiple ways to measure popularity. We could use direct measures like votes or rankings that users give (like ratings on Amazon products), or indirect measures like the number of items sold or the number of views (for YouTube videos).

At RDocumentation.org, we chose the latter. More specifically, we use the number of downloads as a measure of popularity. Indirect measures are typically easier to collect because they don’t require active user input.

Timeframing

One problem that arises when using the number of downloads is that old packages will naturally have more total downloads than newer packages. That does not mean that they are more popular, however, they have just been around longer. What if a package was very popular years ago, but has now become obsolete and is no longer being actively used by the community?

To solve this problem, we only take into account the number of downloads in the last month. That way, older packages’ popularity score is not artificially boosted, and obsolete packages will quickly fade out.

Direct vs Indirect downloads

Another problem arises from reverse dependencies. R packages typically depend on a wide range of other packages. Packages with a lot of reverse dependencies will get downloaded way more than others. However, these packages are typically more low-level and are not used directly by the end-user. We have to watch out to not give their number of downloads too much weight.

As an example, take Rcpp. Over 70 percent of all packages on CRAN, the comprehensive R archive network, depend on this package, which obviously makes it the most downloaded R package. However, rather few R users will directly use this package and search for its documentation.

To solve this problem, we needed to separate direct downloads (downloads that happens because a user requested it) and indirect downloads (downloads that happen because a dependent packages was downloaded). To distinguish the direct and indirect downloads from the CRAN logs, we use the same heuristic described in the cran.stats package by Arun Srinivasan.

We now have a meaningful popularity metric: the number of direct downloads in the last month. Elasticsearch provides an easy way to inject this additional information; for more details, check out this article on elastic.co.

The score is modified as follows:

new_score = old_score * log(1 + number of direct downloads in the last month)

We use a log() function to smooth out the number of downloads value, because each subsequent downloads has less weight; the difference between 0 and 1000 downloads should have a bigger impact on a popularity score than the difference between 100,000 and 101,000 downloads.

This re-scoring improves the overall relevancy of the search results presented by RDocumentation.org and as a result, users can focus on reading documentation instead of searching for it.

If you want to find out more about how exactly the Elasticsearch query is implemented, you can take a look at the RDocumentation project on GitHub. The query itself is located in the SearchController.

If you want to learn more about how RDocumentation.org is implemented, check out our repos:


About RDocumentation  

RDocumentation aggregates help documentation for R packages from CRAN, BioConductor, and GitHub - the three most common sources of current R documentation. RDocumentation.org goes beyond simply aggregating this information, however, by bringing all of this documentation to your fingertips via the RDocumentation package. The RDocumentation package overwrites the basic help functions from the utils package and gives you access to RDocumentation.org from the comfort of your RStudio IDE. Look up the newest and most popular R packages, search through documentation and post community examples. 

Create an RDocumentation account today!

Comments

muenchen-bob
This scoring system is a very significant contribution to the R community. New packages are appearing so rapidly that we need this to help sort through them all. With downloads distributed across CRAN, how are you gathering the counts? In your discussion of the importance of using the log function, I thought you had it backwards until I realized you must be talking about 100,000 (American use of comma) not 100.00 with an extra zero.
03/13/17 1:36 PM |
ludo
Hello Bob,
Thanks for your feedback, we corrected the misused comma.
About the downloads count, we are using the logs of the RStudio mirror (http://cran-logs.rstudio.com/) to get the counts. Of course we are missing the counts from other mirrors but what's important here is not the absolute numbers but how much a package is downloaded relatively to others packages. Since the RStudio mirror is one of the biggest, the numbers it provide should be pretty relevant.

Ludovic
03/15/17 8:51 AM |