Skip to content
FLAML Automl in PySpark
Using FLAML and MLflow to Automate and Track Machine Learning Tasks with Spark DataFrames
Machine learning is a powerful and versatile tool for data analysis, but it can also be time-consuming and complex. That’s why I decided to try out FLAML, a fast and lightweight AutoML library. Here are some main points that I worked on in this notebook:
- In this notebook, I'm experimenting with FLAML AutoML to automate machine learning tasks. This involves using Spark Dataframes for streamlined data processing.
- Additionally, we're implementing tracking through MLflow. This means we'll keep a close eye on the results of each experiment in the AutoML process.
- As an extra step, we'll be storing the artifacts of the best model generated by AutoML. This ensures we have easy access to the top-performing model for future use or reference.
Some important points about this experiment are:
- The experiment was conducted in Local mode spark using a laptop. You can follow this list if you want to set up spark via Windows WSL. However, I don’t think this will work in the Datacamp workspace.
- The estimator (machine learning algorithm) that FLAML used in this case is not one of the estimators available in Spark MLlib, but rather LGBM from SynapseML. Therefore, you need to install Synapse ML separately when you start a spark session.
- The input data for automl is a pandas-on-spark dataframe, so we need to use the
to_pandas_on_spark
function to convert the Spark dataframe. - Based on the estimator and the input data, I think the transformer pipline, such as preprocessing, should be applied before converting the Spark dataframe.
This is a snapshot of the experiment outcomes with FLAML AutoML that you can view through MLflow UI:
Initiate spark session and install SynapseML
from pyspark.sql import SparkSession
from pyspark.ml.feature import VectorAssembler, OneHotEncoder, StringIndexer
from pyspark.ml import Pipeline
# spark = SparkSession\
# .builder\
# .appName("FLAML AutoML in PySpark")\
# .getOrCreate()
spark = (SparkSession.builder.appName("FLAML AutoML in PySpark")
.config("spark.jars.packages", "com.microsoft.azure:synapseml_2.12:1.0.1") # Please use 1.0.1 version for Spark3.2 and 1.0.1-spark3.3 version for Spark3.3
.config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven")
.getOrCreate())
import synapse.ml
spark
Load dataset
train_sdf = spark.read.options(inferSchema=True).csv("train.csv/", header=True)
test_sdf = spark.read.options(inferSchema=True).csv("test.csv/", header=True)
train_sdf.rdd.getNumPartitions()