This article was written by Miles Adkins and originally appeared on the DataRobot Blog here: https://www.datarobot.com/blog/datarobot-enables-scalable-feature-engineering-and-inference-leveraging-snowflake-snowpark-java-udfs/
One of the leading common issues we see in the machine learning life cycle (feature engineering, training, scoring, etc.) is the involvement of a variety of separate and often slow compute environments. This often unduly increases model pipeline complexity, with major nuances being the pulling and shifting of data from where your data lives in production databases and handing machine learning operations teams a series of model artifacts with dependencies that need to be reassembled to put that model into production.
To help combat this problem, in the DataRobot 6.3 release, DataRobot announced the release of Portable Prediction Servers, allowing organizations to bring any DataRobot model closer to their production data as well as integrate into already existing pipelines and applications.
With Snowflake’s announcement of Snowpark and Java UDFs (user-defined functions), DataRobot has continued to expand on this theme. With Snowpark, data preparation tasks in Zepl can be pushed down into Snowflake for in-database feature engineering. And to further reduce the disparate compute environment problem between machine learning models and data, DataRobot Java Scoring Code can be paired with Snowflake Java UDFs for in-database model scoring/inference.
Using Snowpark in Zepl for Feature Engineering
Snowpark is a new developer experience for Snowflake, allowing you to build efficient and powerful pipelines with familiar constructs in your programming language of choice. Snowflake has always delivered performance and ease-of-use for users familiar with SQL. Now Snowpark enables users to write in Scala and Java using a DataFrame model that is widely used and familiar.
Inside a Zepl Notebook, users simply set their cell runtime to “%snowpark” and configure a “Snowpark” data source. From there, code executed in Zepl will be translated and pushed down to the Snowflake platform where the data is already living, taking advantage of Snowflake’s performance, scalability, and concurrency.
Here we can see sample Scala code that hits a Snowflake table and returns a set of filtered rows:
Pairing DataRobot Scoring Code Models with Java UDFs
With Java UDFs, customers can run Java functions right inside Snowflake’s Data Cloud with better performance, scalability, and concurrency over hosted external services.
One mechanism DataRobot supports for exporting and running models in external environments is Java Scoring Code. These Scoring Code JARs contain prediction calculation logic identical to the DataRobot API, can be run anywhere Java code can be executed, and are often the best choice for low-latency, high-scale scoring.
Models that support Java Scoring Code can be identified by their tag in the leaderboard:
Once models are deployed to a Snowflake Prediction Environment, users execute the generated script to upload the JAR and create an associated UDF to perform inference directly in Snowflake.
With the Java UDF, users can make predictions anywhere they currently leverage Snowflake—all without moving any data outside the database.
To close the loop and understand a model’s performance on data in production, users can ingest service and prediction data back into DataRobot MLOps for analysis.