This article is written by Snowflake and originally appeared on the Snowflake Blog here: https://www.snowflake.com/blog/3-snowflake-features-that-make-data-science-easier/
Data science is proving to be a major competitive advantage for companies. While business intelligence (BI) helps companies with reporting and historical analysis, data science goes a step further and predicts the future. It can leverage much more data from many more sources, and using machine learning (ML) principles, it automatically identifies patterns and trends to model, predict, or forecast future outcomes.
Data science is being used for a wide variety of purposes, from providing personalized movie and TV show suggestions to forecasting where a virus is likely to spread next and helping save lives. This giant leap to advanced analytics has largely been enabled by the cloud. Companies can inexpensively collect, store, and analyze more data than ever before, and with graphics processing unit (GPU)–accelerated computing, they can train multiple ML models simultaneously in just minutes and then choose the most accurate ones to deploy.
But most data science projects fail; in fact, according to a VentureBeat article, 87% never even make it into production, largely because of the complexity involved in building them. Snowflake’s cloud data platform helps companies streamline their data science initiatives. In a newly released Deloitte report that surveyed more than 2,700 global companies about how they are preparing for AI, they ranked modernization of their data infrastructure as their top initiative for gaining a competitive advantage because it is “foundational to every AI-related initiative”—evidence that a modern cloud data platform such as Snowflake can be the linchpin for delivering successful data science projects.
Figure 1. An illustration of a typical data science workflow
Snowflake Features That Power Successful Data Science Projects
Here are three Snowflake features that make it simpler for companies to run successful data science projects so they can leverage AI and ML to enable advanced analytics and gain a competitive edge.
A single, consolidated source for all data
For the highest accuracy, data scientists need to incorporate a wide variety of information when training their ML models. But data can reside in many places and comes in various formats. According to Infoworld, data scientists typically spend up to 80% of their time finding, retrieving, consolidating, cleaning, and preparing data, and only the remaining 20% on building, training, and deploying their models. Much of this is because getting the right data isn’t just “one-and-done.” Data scientists often need to go back to collect additional data multiple times during the course of one project. This entire process can take weeks or months, contributing to latency in the data science workflow. In addition, the data used for analysis needs to have a high level of integrity, or the results won’t be valid or trusted.
By bringing data in from multiple environments, Snowflake provides all data in a single high-performance platform, removing the complexity and latency caused by traditional ETL jobs. Data can be profiled and cleansed directly in Snowflake, ensuring a high level of data integrity. And Snowflake also provides data discovery capabilities so users can more easily and quickly find and access their data. Snowflake also provides instant access to diverse third-party data sets through Snowflake Data Marketplace. There, unique third-party data is available from hundreds of providers and available immediately on demand.
Powerful compute resources for data preparation
Data scientists need powerful compute resources to process and prepare data before they can feed it into modern ML models and deep learning tools. As mentioned above, data scientists spend most of their time understanding, processing, and transforming data they find in multiple formats. One such compute-intensive process is feature engineering, which involves transforming raw data into new, clearer signals that are more meaningful and lead to more-accurate predictive models. Creating new features that are predictive can be complex and time-consuming, involving domain expertise, familiarity with each model’s unique requirements, and multiple iterations. Most legacy tools, including Apache Spark, are overly complex and highly inefficient at data preparation, resulting in brittle and expensive data pipelines.
Snowflake’s unique architecture provides dedicated compute clusters for each workload and team so there is no resource contention between data engineering, BI, and data science workloads. Snowflake’s ML partners push down much of their automated feature engineering into Snowflake’s cloud data platform, providing a significant speed boost to automated machine learning (AutoML). Manual feature engineering can be done in Snowflake using many languages by using Snowflake’s Python, Apache Spark, and ODBC/JDBC connectors. Transforming data with SQL makes feature engineering accessible to a broader audience of data workers and can result in speed and efficiency boosts of 10 times compared to Apache Spark.
An extensive partner ecosystem
Data scientists use many tool sets, and the ML space is rapidly evolving, with new tools being added each year. However, legacy data infrastructure can’t always support the demands of multiple different tool sets, and new technologies such as AutoML require a modern infrastructure to function properly.
Through Snowflake’s extensive partner ecosystem, customers can take advantage of direct connections to all existing and emerging data science tools, platforms, and languages such as Python, R, Java, and Scala; open source libraries such as PyTorch, XGBoost, TensorFlow, and scikit-learn; notebooks such as Jupyter and Zeppelin; and platforms such as DataRobot, Dataiku, H20.ai, Zepl, Amazon Sagemaker, and many others. Snowflake also offers integrations with the latest ML tools and libraries, such as Dask and Saturn Cloud. By offering a single consistent repository for data, Snowflake removes the need to retool the underlying data every time tools, languages, or libraries are changed. Furthermore, the output from these tools can seamlessly be integrated back into Snowflake.
Snowflake Is an Engine for Business Value
Once predictive models are deployed, their scored data can be fed back into traditional BI decision-making processes and embedded into applications such as Salesforce. Feeding powerful data science results back to business users can unlock insights that provide unprecedented business growth. In addition, when Snowflake is used with leading ML tools, it can drastically reduce latency in the data science workflow by cutting the time required for developing models from weeks or months to hours.