This article was written by Grover Righter and originally appeared on the DataRobot Blog here: https://www.datarobot.com/blog/picking-the-right-notebook-for-your-data-science-team/
The bar for AI keeps rising. Seventy-six percent of companies prioritize AI and machine learning (ML) over other IT initiatives, according to Algorithmia’s 2021 enterprise trends in machine learning report. With growing pressure on data scientists, every organization needs to ensure that their teams are empowered with the right tools. At the same time, the toolkit needs to meet enterprise needs and regulatory requirements.
Data science notebooks have become a crucial part of the data science practice. As a Data Scientist at heart and through direct work with our customers and community, I am sharing my observations about the advantages and challenges different notebook solutions bring to the table.
Open Source vs. Cloud-Integrated Solutions
When it comes to scalability and speed, you need to look at the stack you are currently working with and ask a few key questions:
- How well are your tools integrated?
- How are your systems performing?
- What is the level of complexity?
- How regular and reliable is your system?
Also, since security and risk management have become board-level issues for organizations (Gartner), you need to think about these as well.
Before deciding what would be the best tool for your data science team, let’s look at the criteria for how you choose a notebook solution:
- Efficiency: What languages can I use? Can I use several different languages?
- Speed and Scalability: How many resources do I need for compute?
- Collaboration and Sharing: Is it easy to collaborate? How can team members reuse work already done?
- Visualizations: How flexible is plotting? What different visualizations does the solution support?
- Governance and Security: How can I ensure security of my data? How can I mitigate security risks?
Let’s take a look at one of the open source solutions.
Open source systems (OSS) are easy to love. Jupyter, for example, contains the potential to execute multiple kernels (language interpreters). It also runs in standard browsers, and it allows for a historic record-keeping history of many datasets, along with visual data graphics.
Open source notebooks exist because most data science languages are a mix of object-oriented code, complex libraries, and functional programming. The output was designed for the command line world, not a graphical plot world. Plotting graphics using Python, R, Scala or other languages has always depended on conversion to JPEG format or some other graphical output that does not display when created. Tables of data and the graphics they created were viewed in different tools. Data analysts spent many hours converting assets into reports or refactoring them in more graphic native tools, such as Tableau.
By implementing open source notebooks like Jupyter in a browser, data science can join programming, some documentation (using Markdown), tables, and graphics all in the same environment. From the beginning, the practice arose of naming notebooks for the name of an experiment, the date, and the author. This allowed for a review of historic progress on a project without unwinding history in a version control regression.
My team used this notebook previously as well, but at one point, I realized that it no longer served the expectations that the market and organizations set for our team. We had a lot of workarounds to address many of the issues that I’ll share later in this blog. But most importantly, when we choose a tool, we have to think, do we want to spend time figuring out how to address issues or would we rather spend it delivering real value?
A Breakdown of DataRobot Zepl – Integrated Cloud Solution
Flex Scale without Manual Container Deployment
Open source notebooks are normally run either on a local computer or in a single container with remote access. The resources available in an open source notebook are constrained by the computer or container in which it is deployed. Changing the memory, CPU, and other performance-scale attributes is non-trivial. While we do have solutions to stand up a new container, size it “upwards,” install an open source notebook, install a kernel environment, run a project, save the results and tear it down, the process is still a bit manual, slow, and inefficient. In addition, homing in on the “right size” environment to run a project can take many slow iterations.
With DataRobot Zepl, we simply create a notebook using any size initial container we wish. As we decide we need more resources, a drop-down menu lets us switch the notebook to run in a bigger (or smaller) container and be up and running in a few seconds. This advantage has changed how much time teams spend on container switching, overall resources used, and project efficiency. Until one has worked on exploratory datasets across several projects, one has no idea how much effort it takes to “right size” environments to projects. With DataRobot Zepl, a drop-down menu has changed the way we operate.
Flexible, Multi-Kernel Code Sets in a Single Notebook
Open source notebooks like Jupyter can be deployed and configured to run almost any kernel. But the process to change from Python to Scala, for example, or Python to R is usually static and results in a single kernel new solution. Worst of all, the notebooks are now “not as portable,” because in addition to the code in the notebook, we need to exactly recreate the custom kernel used when the notebook was created. It is not practical to keep custom instances up and running when not needed, so our teams often created a deployment model to recreate custom kernels. Creating and maintaining these custom environments required a lot of time and engineering resources.
DataRobot Zepl is inherently multi-kernel in every instance. You can specify a mix of Python, R and Scala in any notebook with zero kernel setup required, and the environment can be reproduced by loading and running the notebook. The advantages of mixing R code for some unique libraries and Python code for more general data frame access with common display graphics for both is a big leap forward.
Cloud-to-Cloud Data Performance 103 to 106 Faster
Prior to the 21st Century, most developers owned a “compiler book.” This was not a book one read about compilers; it was a book one read while building and slowly compiling software. The 21st Century equivalent should be called the “query and download book.” When an open source notebook is deployed on a local machine, and the data required are located across a network, it can take (literally) hours for a complex query with large datasets to resolve and be available on the local machine. If the data are static, fine. One can download once and run locally—although this violates many security policies. But if the data are dynamic, there can be many multi-hour pauses in progress. This is not an imaginary issue. The author of this blog has flown on red-eye flights several times when projects became stalled due to remote data with the only solution being to fly to the data warehouse facility and work in the NOC to get actual data access.
DataRobot Zepl operates 100% in the cloud. In addition, most of the data sources are also cloud-based and peered with DataRobot data centers. Our experience has ranged from performance times of data access being reduced by between 1,000-to-1 and 1,000,000-to-1 across multiple projects. Using DataRobot Zepl, a very large, complex query may require enough of a delay to get a cup of coffee but never time to crack open a book.
Secrets and Passwords. All projects, small or large, need a place to store secrets. On larger projects, we can invest real resources on technology to embed bootstrapping (secrets to get to secrets) inside the container .yaml files. On smaller projects and ad hoc data science work, team members often simply embed confidential user names, access codes, and passwords in files. While this is a real security risk in and of itself, the risk is multiplied when code is stored in version-control repositories. In many cases, the secrets apply to very broad data resources.
It is fine to make policies to prevent embedding passwords and user names in code. But for small discovery projects, there is no convenient and universal secrets-keeping model. Thus, secrets end up in open source notebooks on a regular basis, exposing organizations to risk.
With DataRobot Zepl, there is a simple, secure built-in set of methods to retain secrets. Not only does the credentials model reside in the correct location (it is co-located with data source definitions), but the model also does not allow for the open display of secrets when notebooks are shared. This lowers the cost of protecting passwords and increases not-in-code policies to a very high level.
Data Security. When open source notebooks like Jupyter are installed on local machines, the data often gets downloaded to these local machines as well. The reason is a mirror of the 1,000 times speed improvement noted above. It is simply too slow to run models on a local machine and have the data pulled down for every job run, since data science is very iterative. This can cause multiple local copies of very sensitive data.
CI/CD Flows from External Sources
While we prefer DataRobot Zepl for enterprise data science, we also must incorporate prior art from previous notebooks, Python code, R code, and Scala code. This external code is open and iterative and is being updated while projects and data science models are in progress.
DataRobot Zepl allows for both external code inclusion and also the ability to simply import code into DataRobot Zepl notebooks to be joined with other notebook logic.
When DataRobot Zepl code needs to inform external notebooks, entire notebooks can be exported in the previous format, although some display and multi-kernel functionality may be lost, of course.
All of this cooperation with other notebook and non-notebook code allows us to utilize DataRobot Zepl as a core platform for larger collaborative CI/CD multi-team projects.
Collaboration and Sharing
We can always use GitHub to share code in other open source notebooks, and this works fine for the code itself. But enterprise data science projects are combinations of code and data. DataRobot Zepl provides a team collaboration model where entire notebooks can be shared, along with the basics of data sources and also historic display runs.
Notebooks can be shared with co-developers who can modify or clone notebooks. Notebooks can also be shared with non-developers to see report runs and data results, but not have any access to code or data.
Better Graphics and Presentation Layer
DataRobot Zepl has more powerful, more professional and more “ready to display” graphing and charting options. Localized widgets make creating executive-ready presentations simple and faster than transporting results into another platform. In addition, as new code or data is added, the team can simply rerun the notebook to get fresh results with all code, data access, and display layer in the DataRobot Zepl notebook.
Ready to Advance Your Data Science Toolkit with DataRobot Zepl?
You can start today! With the DataRobot Zepl trial, you can start for free today. To get you started, access the public documentation and library of Notebook Accelerators that we have collected for you. Learn how Embrace Home Loans utilizes DataRobot Zepl to improve their team’s efficiency and maximize ROI from the marketing efforts.