AutoML: cutting through the hype

Are the machines getting too smart?

Winning at chess, Go, Starcraft, and now, maybe, winning at machine learning. I set out to find out what AutoML means and what it can actually deliver.

AutoML is reportedly at the peak of its hype cycle, so what is it, and does it work? Is it amazing? Or just the next way to scam unsuspecting business people?

Trends in webs search of “automl”.

If you don’t want to read the whole article, here is my conclusion:

TL; DR

AutoML is robust enough for well-behaved tabular datasets. AutoML is likely to gain traction in large organizations with well-defined ML problems and IT infrastructure that can support ML deployments. AutoML is not flexible enough for all ML problems. Even with AutoML, there are challenges in problem identification, data identification, data cleaning and consistency, data/model management and deployment.

Me

So what does AutoML actually mean? I identified three specific functions that AutoML can serve, as well as which products fall into each category.

  1. Point-and-click ML: Train a new model from scratch on user-input data. The software can decide what model(s) to use, automatically do cross-validation, and generate deployable models.
  2. Machine learning as a service (MLaaS): software/infra platforms that manage data storage/transport, model training and model deployment.
  3. Transfer learning: fine tune existing models with domain-specific data.

Point-and-click AutoML

There are a number of point-and click AutoML products out there, which are positioned to become the “Microsoft Excel of ML”.

In general, these products automatically select a model, or an ensemble of models, that work the best on a provided data set. They usually include ML best practices like cross-validation, feature importance, etc.

H2O Driverless AI

H2O.ai has been a relatively early player in the AutoML world and has raised a lot of money (over $150 million to date). H2O.ai has distinguished itself by releasing its H2O ML platform as open source.

H2O Driverless AI is a proprietary point-and-click interface for H2O. You load in data via tabular files (.csv etc) and then train the model automatically with builtin cross-validation and model selection.

A selling point of H2O’s products is that it is like having a “Kaggle Grandmaster in a box“. And, allegedly, it performs in the top 1% of Kagglers.

Also, H2O advertises the ability to export Java models for productionization.

RapidMiner Auto Model

RapidMiner has been around for a while. It’s core product is a point-and-click data analysis tool which includes an AutoML feature.

They also have managed “AI Cloud” and “Automated Model Operations” which allow for some nice model deployment and monitoring tools.

Models can be trained up locally and then “deployed” via a click (or so they claim, in reality, configuring a live application for such a deployment is probably trickier).

DataRobot Automated Machine Learning

DataRobot is another point-and-click AutoML solution. It looks very similar to H2O and has the features you would expect for an AutoML platform.

DataRobot also sells a number of managed cloud solutions, which H2O does not prominently advertise.

MLaaS

Some of the point-and-click options (especially RapidMiner and DataRobot) have some managed cloud and deployment options. However, in the MLaaS category, I’ll focus on AWS.

Amazon AWS SageMaker

Amazon AWS SageMaker is a suite of AutoML products that features the most extensive MLaaS platform I surveyed.

SageMaker allows models to be trained on data stored in the AWS cloud and deployed on AWS, or exported and deployed on other hardware. There are numerous tie-ins to other AWS products. It includes an automatic model trainer with the usual features.

The fact that AWS is the most popular cloud computing vendor makes SageMaker automatically a compelling option, as many companies already have portions of their infrastructure operating on AWS.

SageMaker can load data from S3, use elastic compute, deploy on AWS compute, and even integrate with Mechanical Turk to get human-labeled data.

Transfer learning

Google AutoML

Google Cloud Platform (GCP) has a number of AutoML products.

One of the distinguishing features of the Google products is the breadth of the transfer learning products. There are offerings in computer vision, machine translation and NLP.

For users with generic needs, there are pre-existing models for vision and translation. However, the AutoML products cater to more specific use-cases. The AutoML products allow users to train specialized models by providing a modest number of labeled training examples.

A potential strength of these products is their ability to leverage Google’s existing base of deep learning models. In technical terms, Google brings a number pre-trained layers, and allows each user to train the last few layers with customized training examples.

However, these products are not all point-and-click solutions and likely require some amount of coding to work.

Google AutoML could integrate with the rest of GCP which includes ample file storage and compute options, making this also effectively an MLaaS platform.

Open source offerings

To date, there are not many open source offerings for AutoML. Much of the advantage to AutoML (customizable point-and-click software, pre-trained models, managed cloud resources) does not usually work in an open source model. However, there are a couple of packages worth investigating.

  • auto-keras: Automated neural architecture search (NAS) using Keras.
  • auto-sklearn: Automated ensemble model generation using scikit-learn estimators.

What are the strengths and weaknesses of AutoML?

Strengths of AutoML

  • Using best ML practices: steps like k-fold validation, histogram inspection, and hyper parameter search can take a while to code up in Python. Having a program do it automatically is nice.
  • Interpretability: many AutoML platforms have built-in interpretability tools.
  • Small-to-medium data: Training a vision or translation model from scratch would take mountains of data. The possibility of using transfer learning for this task makes it more approachable.
  • Non-coders: It’s useful to provide the power of advanced ML in a package that is not much more difficult to use (potentially) than Microsoft Excel.
  • Model operations: Imagine – upload data to a cloud platform, train a model with an automated GUI, deploy with a click, and hit an API to evaluate the model. It’s an appealing thought.
  • Companies that don’t have ML in production yet: for companies that don’t have strong ML competency, AutoML could be a way to catch up.

Weaknesses of AutoML

  • Training time performance/memory management: I know I run into memory limits and training time constraints. With AutoML, which must support training dozens of different models on several different cross-val sets, these time and memory constraints become worse.
  • Bigger data: Related to above, the ability to train with larger data sets is constrained when training time and overhead is increased. This can be partly alleviated with MLaaS cloud hardware and distributed training, but that adds a lot of complexity to experimental/training workflows.
  • Tunnel vision: Having all the data/modeling/features in one place is an appealing vision. However, so frequently, machine learning models can be improved by “going back” to the source data or finding other related data sets, or by discovering data quality and consistency issues. By claiming AutoML does everything, we may lose sight of alternative data sources and data consistency issues.
  • Feature engineering/feature selection: Many AutoML platforms claim to do automated “feature engineering”. However, many feature engineering and feature selection tasks can be accomplished only/more quickly with human-like intuition into the real-world relationship between features.
  • Claiming too much: AutoML companies claim things like automated feature engineering and one-click deployments, but these claims conceal a lot of complexity.
  • Flexibility: for well-behaved tabular data sets, AutoML is likely to approach the best performance of human-constructed models. However, for less structured tasks, multi-step tasks, hybrid ML/rules-based tasks, unsupervised/representation learning, the AutoML frameworks are probably not flexible enough.
  • Support for deep learning: The support for deep learning in AutoML (other than well-defined tasks in transfer learning) is not yet very good. Neural architecture search is generally a harder problem than ensemble construction/parameter search, so I expect this to continue.

The critics weigh in

Gartners.com publishes reviews of many tech products, including most of these AutoML platforms. What do industry people have to say about it?

H2O Driverless AI

DataRobot

Amazon AWS SageMaker

The verdict: What should data scientists know?

It’s easy for data scientists to get defensive and assume that AutoML is all hype. The reality is that some parts of the data science workflow are boilerplate, and therefore automation-ready (test-train splits, model selection, hyper-parameter search).

For smaller companies, especially start-ups with more specific, niche ML applications, AutoML is probably going to be too generic, too expensive and too slow (from a computation point of view).

Large organizations in healthcare, finance, or insurance have well-defined business processes and built-up IT infrastructure. They probably have many ML needs that are complex, yet relatively generic, and they may not have ML expertise or infrastructure. AutoML platforms could be a good “plug-and-play” fit into the overall business and IT workflows in such cases. I would expect AutoML to capture a growing share of ML tasks in mature enterprises.

Still, there is still a lot of manual work required, like problem definition, data discovery, data cleaning/merging, data consistency, feature engineering/generation, data/model management, model deployment, etc. The AutoML platforms can handle the training and selection of the model.

Data scientists should be aware of AutoML and be wary if too much of their job becomes the parts that can be automated. Data scientists whose only skills are plugging parameters into scikit-learn, perhaps, should be worried. It is always a good idea to diversify your skills, but there is still a lot of work for data scientists to do, and interest in the field continues growing quickly.

Google web search trend for “data science”.
Advertisement

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s