As data science matures and evolves, so does the set of tools at the disposal of practitioners. While libraries such as scikit-learn, pandas, numpy, and matplotlib are foundational to the PyData Stack, learning and mastering new libraries and packages is essential to growing in a data career.
For this reason, this article will cover ten increasingly popular packages in the Python machine learning and data science ecosystems that have emerged over the past few years.
As machine learning moves from experimentation to operationalization, the explainability of models is a must. Depending on the use case, organizations are making model explainability and transparency a requirement as part of the deployment process.
The rise of explainability in machine learning has been accelerating over the past few years, one look at search trends of the past ten years for the term “explainable AI” will showcase this:
Google Trends screenshot by the author — link to the result
This rising interest in Explainable AI (XAI) comes from the need to avoid harmful outcomes associated with machine learning models. This is especially the case for high-risk use-cases, in industries such as finance or healthcare. Machine learning models can have outcomes riddled with biases that amplify existing stereotypes. This can be seen on display in Google Translate, one of the most commonly used language models in the world:
An example of how machine learning models can amplify harmful stereotypes
The sample on the left was in Uzbek, a gender-neutral language. However, when translating the query into English, Google Translate’s language model reinforced sexist stereotypes with its results. You can observe similar results for other gender-neutral languages, such as Turkish or Persian.
Such examples of bias can have extremely detrimental outcomes in machine learning use-cases such as credit risk modeling or credit approval. To minimize these risks, data scientists are using explainable AI (XAI) techniques to understand the inner workings of machine learning systems.
One of the most popular tools for XAI is the SHAP library created by Scott M. Lundberg and Su-In Lee. SHAPley Additive exPlanations (SHAP) uses a game theory approach to provide explanations for what’s driving the output of a large array of machine learning models.
A major part of its mass appeal is its elegant visualization of Shapley values, which can explain model outputs both generally and individually. You can deep dive into SHAP and its examples by looking through the documentation.
Images from SHAP docs — MIT License
🌟 GitHub Stars: 16.2K
📦 Issues: 1.3K
🍴 Forks: 2.5K
As datasets keep growing in size, so does the need for better, more efficient dimensionality reduction algorithms.
While PCA is fast and efficient, it can return oversimplified results as it only reduces the number of dimensions of the dataset without necessarily paying attention to the underlying data structure. t-SNE tries to remedy that by placing more importance on the structure of the data, but that process makes it sluggish for larger datasets.
Fortunately, in 2018, Leland McInnes and his colleagues introduced the UMAP (Uniform Manifold Approximation and Projection) algorithm to be the common ground between the two methods. The UMAP Python package reduces the dimensions of tabular datasets more smartly, emphasizing the importance of the global topological structure of the data.
The package is trendy on Kaggle, and its docs outline other interesting applications beyond dimensionality reduction, like faster outlier detection for larger datasets. Its results are practical and beautiful when visualized:
Images from UMAP docs — BSD-3-Clause License
🌟 GitHub Stars: 5.6K
📦 Issues: 313
🍴 Forks: 633
3 & 4. LightGBM and CatBoost
When the XGBoost library became stable in 2015, it quickly dominated tabular competitions on Kaggle. It was fast and outperformed other gradient-boosting implementations. However, it was not perfect. Two billion-dollar companies, Microsoft and Yandex, got inspired by Tianqi Chen's work on gradient boosted machines and the open-source LightGBM and CatBoost libraries.
Their aim was straightforward— to improve on the weaknesses of XGBoost. While LightGBM vastly reduced the memory footprint of the boosted trees formed by XGBoost, CatBoost became even faster than XGBoost and achieved impressive results with default parameters.
In Kaggle’s State of Data Science and Machine Learning Survey of 2021, the two libraries ranked in the top seven most popular machine learning frameworks.
Screenshot by the author from Kaggle State of ML & DS Survey
🌟 GitHub Stars (LGBM, CB): 13.7K, 6.5K
📦 Issues: 174, 363
🍴 Forks: 3.5K, 1K
Deploying models into production has never been more important. In this section, we’ll talk about how BentoML simplifies the process of deploying models as API endpoints. Historically, data scientists used web frameworks like Flask, Django, or FastAPI to deploy models as API endpoints, but these tools often carry with them a relatively steeper learning curve.
BentoML simplifies creating an API service, requiring only a few lines of code. It works with virtually any machine learning framework and can deploy them as API endpoints in a few minutes. Even though BentoML was released last year and is still in beta, it amassed a significant community. You can check out a variety of examples of BentoML in action here.
🌟 GitHub Stars: 3.5K
📦 Issues: 395
🍴 Forks: 53
6 & 7. Streamlit and Gradio
A machine learning solution should be accessible to everyone, and while API deployment benefits your coworkers, teammates, and your programmer friends, a model should have a user-friendly interface for the non-technical community as well.
Two of the fastest-growing packages for building such interfaces are Streamlit and Gradio. They both offer low-code Pythonic APIs to build web apps to showcase your models. Using simple functions, you can create HTML components to take different types of user input, such as images, text, video, speech, sketches, etc., and return a prediction.
Streamlit is especially useful, as you can use it to tell beautiful data stories with its rich media tools. You can check out a wide variety of examples from Streamlit in their gallery.
An example of a streamlit web app for machine learning
Combining an API service like BentoML with UI tools like Streamlit or Gradio is the best and lightest way to deploy machine learning models in 2022.
🌟 GitHub Stars (Streamlit, Gradio): 18.9K, 6.6K
📦 Issues: 264, 119
🍴 Forks: 1.7K, 422
PyCaret is a low-code machine learning library that has been capturing a lot of attention recently. Using PyCaret, you can automate almost any stage of a machine learning pipeline with only a few lines of code. It combines some of the best features and algorithms from other popular packages like Scikit-learn, XGBoost, transformers, etc. Its main attraction comes from its ability to go from data preparation to model deployment within a few minutes in a notebook environment.
PyCaret has separate sub-modules for classification, regression, NLP, clustering, anomaly detection, and a dedicated module for time series analysis as of its latest release. PyCaret is the go-to library if you want to automate and accelerate parts of your machine learning workflow.
🌟 GitHub Stars: 6.5K
📦 Issues: 248
🍴 Forks: 1.3K
In the penultimate slot, we have Optuna, a hyperparameter library that’s gaining steam on Kaggle.
An example of Optuna visualizations
Optuna is a bayesian hyperparameter tuning library that works on virtually any ML framework. It has numerous advantages over its rivals, such as:
- Platform-agnostic design
- Pythonic search space — you can define hyperparameters with conditionals and loops
- An extensive suite of state-of-the-art tuning algorithms, available to change with a single keyword
- Easy and efficient parallelization, which lets you scale across available resources through an argument
- Visualization of plot tuning experiments, which lets you compare the importance of hyperparameters
- Optuna’s API is based on objects called studies and trials. Combined, they give the ability to control how long a tuning session runs, pause and resume them, etc..
🌟 GitHub Stars: 6.3K
📦 Issues: 108
🍴 Forks: 701
10. Data Version Control — DVC
Screenshot of DVC homepage
As data landscapes become more and more complex, having a clear understanding of changes to datasets is becoming more and more essential. That’s what DVC aims to do, versioning and managing your massive data files and models as efficiently as Git manages your codebase.
While Git is very useful for tracking changes to codebases, it falters when versioning large files, which has hindered the progress of open-source data science. Data scientists needed a system to keep track of changes made to both code and data simultaneously and work on experiments in isolated branches without duplicating data sources.
Data Version Control (DVC) by Iterative.ai made this all possible. With a simple remote or local repo to store the data, DVC can capture changes to data and models just like code and track metrics and model artifacts to monitor experiments.
When combined with DagsHub (i.e., GitHub for data scientists), it becomes a game-changing tool since DagsHub offers free storage for DVC caches and can be configured with a single CLI command.
🌟 GitHub Stars: 9.7K
📦 Issues: 619
🍴 Forks: 924
Learn more about the latest tools
The data science & machine learning landscapes are vibrant and ever-growing. While the tools listed above are gaining steam, we can definitely expect even more tools and consolidation in the modern data stack. To learn more about new tools and advancements in data science, check out the following resources: