RDocumentation: Scoring and Ranking
One of the core features of RDocumentation.org is its search functionality. From the start, we wanted to have a super simple search bar that finds what you are looking for, without a complex form asking for a package name, function name, versions or anything else. Just a simple search bar.
In today's technical blog post, we will highlight the technologies and techniques that we use to provide relevant and meaningful results to our users.
Elasticsearch
RDocumentation.org uses Elasticsearch to index and search through all R packages and topics.
Elasticsearch is an open-source, scalable, distributed, enterprise-grade search engine.
Elasticsearch is perfect for querying documentation because it doesn’t use conventional SQL data, but it stores documents in a JSON-like data structure instead. Each document is just a set of key-value pairs with simple data types (strings, numbers, lists, dates, …). Its distributed nature means Elasticsearch can be extremely fast.
An Elasticsearch cluster can have multiple indexes and each index can have multiple document types. A document type just describes what the structure of the document should look like. To learn more about Elasticsearch types, you can visit the guide on elastic.co.
RDocumentation.org uses three different types: package_version
, topic
, and package
. The first 2 are the main ones; let’s discuss package
later.
Because RDocumentation.org is open-source, you can see the Elasticsearch’s mappings in our github repo.
package_version type
The package_version
type is like a translation of the DESCRIPTION
file of a package, it features the main field that one can find in there; package_name
, version
, title
, description
, release_date
, license
, url
, copyright
, created_at
, updated_at
, latest_version
, maintainer
and collaborators
. The maintainer
and collaborators
are extracted from the Authors
field in the DESCRIPTION
file
topic type
The topic documents are parsed from the Rd
files, the standard format of documentation in R. The topic
type has the following keys: name
, title
, description
, usage
, details
, value
, references
, note
, author
, seealso
, examples
, created_at
, updated_at
, sections
, aliases
and keywords
.
Scoring in Elasticsearch
Before doing any scoring, Elasticseach first tries to reduce the set of candidates by checking if the document actually matches the query. Basically, a query is a word (or a set of words). Based on the query setting, Elasticsearch searches for a match in certain fields of certain types.
However, a match does not necessarily mean that the document is relevant; the same word can have different meanings in different contexts. Based on the query settings, we can filter by type and field, and include more contextual information. This contextual information will improve the relevancy and this is where scoring comes into place.
Elasticsearch uses Lucene under the hood, so the scoring is based on Lucene’s Practical Scoring Function which brings together some models like the TF-IDF, Vector Space Model and Boolean Model to score the document.
If you want to lean more about how that function is used in Elasticsearch, you can check out this section of elastic.co guide.
One way to improve relevancy is to apply a boost to some fields. For example, in the RDocumentation.org full search, we naturally boost fields like package_name
and title
for packages and aliases
and name
for topics.
Boosting the popular documents
Another effective way to improve relevancy is to boost documents based on their popularity. The idea behind that is that if a package is more popular, the user is more likely to search for this package. Showing the more populars packages first will increase the probability that we show what the user is actually looking for.
Using downloads as a popularity measure
There are multiple ways to measure popularity. We could use direct measures like votes or rankings that users give (like ratings on Amazon products), or indirect measures like the number of items sold or the number of views (for YouTube videos).
At RDocumentation.org, we chose the latter. More specifically, we use the number of downloads as a measure of popularity. Indirect measures are typically easier to collect because they don’t require active user input.
Timeframing
One problem that arises when using the number of downloads is that old packages will naturally have more total downloads than newer packages. That does not mean that they are more popular, however, they have just been around longer. What if a package was very popular years ago, but has now become obsolete and is no longer being actively used by the community?
To solve this problem, we only take into account the number of downloads in the last month. That way, older packages’ popularity score is not artificially boosted, and obsolete packages will quickly fade out.
Direct vs Indirect downloads
Another problem arises from reverse dependencies. R packages typically depend on a wide range of other packages. Packages with a lot of reverse dependencies will get downloaded way more than others. However, these packages are typically more low-level and are not used directly by the end-user. We have to watch out to not give their number of downloads too much weight.
As an example, take Rcpp. Over 70 percent of all packages on CRAN, the comprehensive R archive network, depend on this package, which obviously makes it the most downloaded R package. However, rather few R users will directly use this package and search for its documentation.
To solve this problem, we needed to separate direct downloads (downloads that happens because a user requested it) and indirect downloads (downloads that happen because a dependent packages was downloaded). To distinguish the direct and indirect downloads from the CRAN logs, we use the same heuristic described in the cran.stats package by Arun Srinivasan.
We now have a meaningful popularity metric: the number of direct downloads in the last month. Elasticsearch provides an easy way to inject this additional information; for more details, check out this article on elastic.co.
The score is modified as follows:
new_score = old_score * log(1 + number of direct downloads in the last month)
We use a log()
function to smooth out the number of downloads
value, because each subsequent downloads has less weight; the difference between 0 and 1000 downloads should have a bigger impact on a popularity score than the difference between 100,000 and 101,000 downloads.
This re-scoring improves the overall relevancy of the search results presented by RDocumentation.org and as a result, users can focus on reading documentation instead of searching for it.
If you want to find out more about how exactly the Elasticsearch query is implemented, you can take a look at the RDocumentation project on GitHub. The query itself is located in the SearchController
.
If you want to learn more about how RDocumentation.org is implemented, check out our repos:
- RDocumentation-app: The web application running rdocumentation.org.
- RDocumentation-elasticsearch: Configuration and feeders of the Elasticsearch server serving rdocumentation.org.
- RDocumentation: R package to integrate rdocumentation.org into your R workflow
- RDocumentation-lambda-worker: AWS Lambda pipeline to parse package documentation for rdocumentation.org
About RDocumentation
RDocumentation aggregates help documentation for R packages from CRAN, BioConductor, and GitHub - the three most common sources of current R documentation. RDocumentation.org goes beyond simply aggregating this information, however, by bringing all of this documentation to your fingertips via the RDocumentation package. The RDocumentation package overwrites the basic help functions from the utils package and gives you access to RDocumentation.org from the comfort of your RStudio IDE. Look up the newest and most popular R packages, search through documentation and post community examples.
blog
Using DataCamp's Autograder to Teach R
blog
R Correlation Tutorial
blog
Jupyter And R Markdown: Notebooks With R
blog
What is R? - An Introduction to The Statistical Computing Powerhouse
Summer Worsley
18 min
tutorial