Skip to main content

10 Python Packages to Add to Your Data Science Stack in 2022

Bekhruz Tuychiev,
May 25, 2022 5 min read
LinkedInFacebookTwitterCopy
Looking to expand your data science stack in 2022? This guide highlights 10 of the fastest-growing Python packages that solve various problems you might encounter.

As data science matures and evolves, so does the set of tools at the disposal of practitioners. While libraries such as scikit-learn, pandas, numpy, and matplotlib are foundational to the PyData Stack, learning and mastering new libraries and packages is essential to growing in a data career. 

For this reason, this article will cover ten increasingly popular packages in the Python machine learning and data science ecosystems that have emerged over the past few years. 

1. SHAP

As machine learning moves from experimentation to operationalization, the explainability of models is a must. Depending on the use case, organizations are making model explainability and transparency a requirement as part of the deployment process.

The rise of explainability in machine learning has been accelerating over the past few years, one look at search trends of the past ten years for the term “explainable AI” will showcase this:

google trends of explainable ai

Google Trends screenshot by the author — link to the result

This rising interest in Explainable AI (XAI) comes from the need to avoid harmful outcomes associated with machine learning models. This is especially the case for high-risk use-cases, in industries such as finance or healthcare. Machine learning models can have outcomes riddled with biases that amplify existing stereotypes. This can be seen on display in Google Translate, one of the most commonly used language models in the world:

example of bias in action

An example of how machine learning models can amplify harmful stereotypes

The sample on the left was in Uzbek, a gender-neutral language. However, when translating the query into English, Google Translate’s language model reinforced sexist stereotypes with its results. You can observe similar results for other gender-neutral languages, such as Turkish or Persian. 

Such examples of bias can have extremely detrimental outcomes in machine learning use-cases such as credit risk modeling or credit approval. To minimize these risks, data scientists are using explainable AI (XAI) techniques to understand the inner workings of machine learning systems. 

One of the most popular tools for XAI is the SHAP library created by Scott M. Lundberg and Su-In Lee. SHAPley Additive exPlanations (SHAP) uses a game theory approach to provide explanations for what’s driving the output of a large array of machine learning models. 

A major part of its mass appeal is its elegant visualization of Shapley values, which can explain model outputs both generally and individually. You can deep dive into SHAP and its examples by looking through the documentation

shap values

Images from SHAP docs — MIT License

🌟 GitHub Stars: 16.2K

📦 Issues: 1.3K

🍴 Forks: 2.5K

🔗 Useful links: docs, comprehensive tutorial

2. UMAP

As datasets keep growing in size, so does the need for better, more efficient dimensionality reduction algorithms. 

While PCA is fast and efficient,  it can return oversimplified results as it only reduces the number of dimensions of the dataset without necessarily paying attention to the underlying data structure. t-SNE tries to remedy that by placing more importance on the structure of the data, but that process makes it sluggish for larger datasets.

Fortunately, in 2018, Leland McInnes and his colleagues introduced the UMAP (Uniform Manifold Approximation and Projection) algorithm to be the common ground between the two methods. The UMAP Python package reduces the dimensions of tabular datasets more smartly, emphasizing the importance of the global topological structure of the data.

The package is trendy on Kaggle, and its docs outline other interesting applications beyond dimensionality reduction, like faster outlier detection for larger datasets. Its results are practical and beautiful when visualized:

umap examples

Images from UMAP docs — BSD-3-Clause License

🌟 GitHub Stars: 5.6K

📦 Issues: 313

🍴 Forks: 633

🔗 Useful links: docs, comprehensive tutorial

3 & 4. LightGBM and CatBoost

When the XGBoost library became stable in 2015, it quickly dominated tabular competitions on Kaggle. It was fast and outperformed other gradient-boosting implementations. However, it was not perfect. Two billion-dollar companies, Microsoft and Yandex, got inspired by Tianqi Chen's work on gradient boosted machines and the open-source LightGBM and CatBoost libraries.

Their aim was straightforward— to improve on the weaknesses of XGBoost. While LightGBM vastly reduced the memory footprint of the boosted trees formed by XGBoost, CatBoost became even faster than XGBoost and achieved impressive results with default parameters.

In Kaggle’s State of Data Science and Machine Learning Survey of 2021, the two libraries ranked in the top seven most popular machine learning frameworks. 

tools used by data scientists kaggle 2021

Screenshot by the author from Kaggle State of ML & DS Survey

🌟 GitHub Stars (LGBM, CB): 13.7K, 6.5K

📦 Issues: 174, 363

🍴 Forks: 3.5K, 1K

🔗 Useful links: LGBM docs, CB docs, tutorials — LGBM, CB

5. BentoML

Deploying models into production has never been more important. In this section, we’ll talk about how BentoML simplifies the process of deploying models as API endpoints. Historically, data scientists used web frameworks like Flask, Django, or FastAPI to deploy models as API endpoints, but these tools often carry with them a relatively steeper learning curve.

BentoML simplifies creating an API service, requiring only a few lines of code. It works with virtually any machine learning framework and can deploy them as API endpoints in a few minutes. Even though BentoML was released last year and is still in beta, it amassed a significant community. You can check out a variety of examples of BentoML in action here

🌟 GitHub Stars: 3.5K

📦 Issues: 395

🍴 Forks: 53

🔗 Useful links: docs, comprehensive tutorial

6 & 7. Streamlit and Gradio

A machine learning solution should be accessible to everyone, and while API deployment benefits your coworkers, teammates, and your programmer friends, a model should have a user-friendly interface for the non-technical community as well.

Two of the fastest-growing packages for building such interfaces are Streamlit and Gradio. They both offer low-code Pythonic APIs to build web apps to showcase your models. Using simple functions, you can create HTML components to take different types of user input, such as images, text, video, speech, sketches, etc., and return a prediction.

Streamlit is especially useful, as you can use it to tell beautiful data stories with its rich media tools. You can check out a wide variety of examples from Streamlit in their gallery. 

streamlit example

An example of a streamlit web app for machine learning

Combining an API service like BentoML with UI tools like Streamlit or Gradio is the best and lightest way to deploy machine learning models in 2022. 

🌟 GitHub Stars (Streamlit, Gradio): 18.9K, 6.6K

📦 Issues: 264, 119

🍴 Forks: 1.7K, 422

🔗 Useful links: Streamlit docs, Gradio docs, tutorials — Streamlit, Gradio

8. PyCaret

PyCaret is a low-code machine learning library that has been capturing a lot of attention recently. Using PyCaret, you can automate almost any stage of a machine learning pipeline with only a few lines of code. It combines some of the best features and algorithms from other popular packages like Scikit-learn, XGBoost, transformers, etc. Its main attraction comes from its ability to go from data preparation to model deployment within a few minutes in a notebook environment.

PyCaret has separate sub-modules for classification, regression, NLP, clustering, anomaly detection, and a dedicated module for time series analysis as of its latest release. PyCaret is the go-to library if you want to automate and accelerate parts of your machine learning workflow. 

🌟 GitHub Stars: 6.5K

📦 Issues: 248

🍴 Forks: 1.3K

🔗 Useful links: docs, tutorials

9. Optuna

In the penultimate slot, we have Optuna, a hyperparameter library that’s gaining steam on Kaggle. 

Optuna example

An example of Optuna visualizations

Optuna is a bayesian hyperparameter tuning library that works on virtually any ML framework. It has numerous advantages over its rivals, such as:

  • Platform-agnostic design
  • Pythonic search space — you can define hyperparameters with conditionals and loops
  • An extensive suite of state-of-the-art tuning algorithms, available to change with a single keyword
  • Easy and efficient parallelization, which lets you  scale across available resources through an argument
  • Visualization  of plot tuning experiments, which lets you compare the importance of hyperparameters
  • Optuna’s API is based on objects called studies and trials. Combined, they give the ability to control how long a tuning session runs, pause and resume them, etc..

🌟 GitHub Stars: 6.3K

📦 Issues: 108

🍴 Forks: 701

🔗 Useful links: docs, comprehensive tutorial

10. Data Version Control — DVC

data version control homepage

Screenshot of DVC homepage

As data landscapes become more and more complex, having a clear understanding of changes to datasets is becoming more and more essential. That’s what DVC aims to do,  versioning and managing your massive data files and models as efficiently as Git manages your codebase.

While Git is very useful for tracking changes to codebases, it falters when versioning large files, which has hindered the progress of open-source data science. Data scientists needed a system to keep track of changes made to both code and data simultaneously and work on experiments in isolated branches without duplicating data sources.

Data Version Control (DVC) by Iterative.ai made this all possible. With a simple remote or local repo to store the data, DVC can capture changes to data and models just like code and track metrics and model artifacts to monitor experiments.

When combined with DagsHub (i.e., GitHub for data scientists), it becomes a game-changing tool since DagsHub offers free storage for DVC caches and can be configured with a single CLI command.

🌟 GitHub Stars: 9.7K

📦 Issues: 619

🍴 Forks: 924

🔗 Useful links: docs, comprehensive tutorial, sample project made with DVC and DagsHub

Learn more about the latest tools

The data science & machine learning landscapes are vibrant and ever-growing. While the tools listed above are gaining steam, we can definitely expect even more tools and consolidation in the modern data stack. To learn more about new tools and advancements in data science, check out the following resources:

  1. Subscribe to the DataFramed Podcast
  2. Subscribe to our Youtube page and keep track of the Weekly Roundup, your bite-sized video weekly news bulletin
  3. Check out cheat sheets