Data science is one of the fastest-moving industries out there. New research papers, packages, tutorials, and technologies are being launched every day, sometimes making it hard to keep tabs. For any data practitioner, staying on top of the latest in data science is important to keep learning and growing. In this article, you’ll find a collection of all the latest news, tutorials, research, and insights you might have missed during last month.
Newly Released Tutorials
Two ways to create custom transformers with scikit-learn
This Towards Data Science blog post covers a fairly simple tutorial on how you can go about creating a custom transformer using scikit-learn. Scikit-learn is one of the most widely used packages in data science, and arguably the most popular machine learning package out there. It provides a host of pre-processing functionality such as One Hot Encoding, MinMax Scaling, and more. Sometimes, however, it is required to use custom pre-processing functions instead of using pre-built ones such as One Hot Encoder. This tutorial teaches you how to create custom transformers in scikit-learn to streamline data pre-processing for machine learning.
MLOps in BigQuery ML using Vertex AI
Over the past year, big tech companies have made great strides in the field of MLOps. Bringing machine learning models into production environments is one of the biggest challenges faced by data teams today. This video provides a quick tutorial on how you can use Google Cloud’s Vertex AI and Big Query ML services together to build a production-ready Vertex AI pipeline.
DataCamp’s New Power BI Tutorials
With the launch of its new Data Analyst in Power BI Track in partnership with Microsoft, DataCamp released a few tutorials on Power BI over the past month. Power BI is one of the most widely used business intelligence tools out there, and if your company uses Microsoft Office, chances are you already have access to it. It’s also a great stepping stone for anyone looking to go beyond Excel and dive deeper into the world of data. These tutorials cover the gamut of working with Power BI, from Transitioning from Excel to Power BI, to a deeper tutorial on Power BI for Beginners, Data Modelling in Power BI, and a tutorial on Power BI’s DAX Formulas for Beginners. You can also download this handy cheat sheet as you go along your learning journey.
Trailblazing Machine Learning Research
DALLE-2 by OpenAI
Over the past month, OpenAI wowed the world with its DALLE-2 image generation algorithm. OpenAI has trained this new machine learning model to create photorealistic images and art from simple natural text input. DALLE-2, which is the successor of DALLE-1, achieves a great level of photorealism by using a process called “diffusion” which creates a pattern of random dots and gradually alters it to form an image that it recognizes from the text. You can read about it here, and watch Isabella Leslie Miller, DataCamp’s Data Journalist, break it down in our Weekly Roundup News Video.
Pathways Language Model (PaLM)
PaLM demonstrates the first large-scale use of the Pathways system. The Pathways system architecture was introduced by Google last year to efficiently train a new generation of models that can do a variety of tasks across different domains. With PaLM, Google created a humongous 540-billion parameter Transformer model that has achieved state-of-the-art performance on a variety of language and reasoning tasks. These tasks range from code generation, joke explanation, cause and effect identification, and more. Check out the article for a deeper dive on Pathways and the type of use-cases PaLM unlocks.
New Packages and Models
PyTorch 1.11 Release
With the release of Pytorch version 1.11 comes a variety of improvements. The latest version of PyTorch now offers beta versions of TorchData, the successor of DataLoader API, and functorch, which offers composable function transforms for PyTorch Modules. Beyond the introduction of these functions, PyTorch 1.11 boasts a variety of performance improvements, such as 40% faster startup time for mobile and edge deployments, and more. To learn more about PyTorch, check out this DataCamp Course for Beginners.
EpyNN stands for “Educational python for Neural Networks”. It is intended for teachers, students, scientists, and anyone with relatively minimal Python skills who wish to understand and build neural networks from scratch. It provides a host of architecture templates and practical examples that reduce the time it takes to learn how to develop neural networks from scratch.
BigScience Large Language Model Training (tr11-176B-ml-logs)
BigScience is a popular space in Hugging Face and is on a mission to create large-scale language models in the open with thousands of researchers around the world. The training of BigScience’s main model started on March 11, 2022, and will train for three to four months on 384 A100 80GB GPUs of the Jean Zay public supercomputer. This open-source model will consist of 176B parameters and will have a GPT-like decoder-only architecture. You can follow the training updates on Twitter—and will be able to use the model sometime around June.
Data Science & Machine Learning Use Cases
La Liga Partners with Databricks for Football Analytics
La Liga’s analytics team has partnered with Databricks to deploy a data lakehouse within the league’s analytics infrastructure. This will allow La Liga Tech—the new, consolidated analytics wing of the league - to structure and manage their data much more efficiently, and leverage machine learning on interesting use-cases such as injury prediction, content recommendations for fans, and more.
Using Deep Learning to Annotate the Protein Universe
Scientists have been long using computational tools to infer and annotate the protein function directly from its sequence. Google’s AI team has successfully used Deep Learning to predict the function of proteins reliably; they call it ProtENN, which has enabled them to add about 6.8 million entries to Pfam’s database. Google’s AI team has released the ProtENN model and a distill-like interactive article for experimentation. Solving these types of problems will enable faster, more reliable, novel drug discoveries and therapeutics with machine learning.
Insights & Opinions
Empowering the Modern Data Analyst
In the latest DataFramed Podcast, Peter Fishman, CEO of Mozart Data, breaks down the state of the modern data stack, and how the latest tools in data science are designed to empower the modern data analyst. Moreover, he goes through his experience leading data teams, what makes an excellent data analyst, the importance of subject matter expertise and listening to users, and more. Listen and subscribe to DataFramed wherever you get your podcasts.
MLOps is a Mess But that’s to be Expected
In this article, Mihail Eric, founder of Confetti AI, shares his insights on the state of tooling in MLOps today. Just as the article headline describes, Mihail brilliantly breaks down the disjointed state of the Machine Learning tooling landscape. As machine learning operations is still in its infancy, the set of standards, tools, best practices surrounding it are still being formed and there is no clear canonical stack practitioners can rely on. Moreover, solving the talent shortages, and adopting machine learning thinking within organizational cultures will shape the adoption of MLOps in the years to come.
We hope you enjoyed this round-up of top stories, insights, tutorials, and research. For more on the latest in data science, check out the following resources:
Navigating the World of MLOps Certifications
A Data Science Roadmap for 2024