This article is by Sadovsky and originally appeared on the Alteryx Analytics Blog here: https://community.alteryx.com/t5/Analytics-Blog/Unleashing-Advanced-Analytics-with-the-Alteryx-Intelligence/ba-p/574803
As the data landscape evolves, organizations’ analytical needs grow increasingly complex. Historically, modern data-science techniques have been isolated to the select few who were not only were experts at programming but were also deeply entrenched in statistics. Now that is no longer true! The Alteryx Intelligence Suite democratizes advanced analytical capabilities for any interested Alteryx user. It brings the power of predictive machine learning and natural language processing to all organizations looking to unlock the power of their data with Alteryx.
In our initial release of the Alteryx Intelligence Suite, we chose to focus on two of the most common data science challenges facing organizations today:
- Data is no longer constrained to just spreadsheets. Data-savvy organizations have image, PDF, and other text assets they can extract value from. With the Intelligence Suite’s text mining capabilities, we’re opening organizations to a whole new avenue of analytical capabilities. Social media comments, legal documents, support emails, and more all contain a treasure trove of data that, in many organizations, has never been fully discovered. These building blocks let you get this type of data into Alteryx, prep it for analysis, and then explore the underlying topics and themes throughout the text and visualize your results to better understand the humans behind the data.
- Success today requires staying two steps ahead. Business planning in the modern world is more complex than ever. Your organization needs every competitive edge and having a clearer view of what's to come could make or break the sustainability of your business. With the Intelligence Suite’s machine learning capabilities, we’re addressing how organizations use data to create models for prediction and interpretation. In all aspects of business, you often need to make decisions given incomplete data. Predictive modeling lets you use patterns observed in the past to infer what might happen in the future. The Intelligence Suite equips users with the building blocks to answer these types of questions, along with a guided experience to help navigate the complexities inherent in this process.
The Alteryx Intelligence Suite is designed for organizations in all stages of their analytic journey. For a beginner analyst or an organization that is just starting to adopt advanced analytics, the Intelligence Suite gives you all that’s needed to begin your analytic journey knowing that the choices you make via our drag-and-drop building blocks and on-screen guides are backed by best-in-class data science via established open source libraries like scikit-learn and XGBoost. For advanced users, the building blocks provide in-depth configuration and customization of these libraries integrated into the Alteryx environment.
Users can begin to explore their predictive problems such as ranking customers who are most likely to churn or predicting the probability of an event of interest using Assisted Modeling. As the organization matures, models can be deployed via Alteryx Promote or Alteryx Server for production. If desired, models can always be translated to raw Python code to share with other data scientists or deployed in a cloud ecosystem.
Whether being used for prototyping or production, the process is transparent, letting business analysts and citizen data scientists work together. The same best-in-class data science capabilities are true for our text mining building blocks. They are built on libraries like Tesseract, VADER and scikit-learn, ensuring that users are getting the best-in-class capabilities available in the market, all with the ease of use of Alteryx.
The Intelligence Suite’s Text Mining Capabilities
I’m excited to highlight some of the amazing capabilities of our Text Mining tool group. At its core, the Text Mining Tool group makes it simple to get text into Alteryx from any format, including via PDFs and images by optical character recognition. This capability alone enables a whole new way for users to bring data into Alteryx. However, once data is there, the Text Mining tool group also provides building blocks for manipulating and processing that data even further.
Preparing Text for Analysis
The tool group includes a building block specifically to prep text data for analysis, performing “lemmatization.” Put simply, this approach helps take different forms of words into their base grammatical component. For example, “am” / ”are” / ”is” all become “be,” and “cat” / “cats” / “cat’s” / “cats’” would become just “cat.” When performing advanced learning on text, this step is crucial to generalize large bodies of complex text into simple underlying structure.
With the Intelligence Suite, the task is as easy as dragging a building block onto the Designer canvas and clicking your way to a custom configuration.
Mining the social web has become a disruptive new way for organizations to understand their product impact in near real time. Tweets can be collected and defined as being positive, neutral, or negative, and businesses can keep a daily metric of “positive to negative ratio” commentary to see how the web is reacting. That said, defining a tweet’s sentiment, and doing it at a large scale, used to require someone to get deep in code. With our code-free sentiment analysis building block, this becomes an easy task.
With a very simple workflow leveraging the Intelligence Suite, you can create an effective way to bulk analyze tweets!
Michael Jordan, along with David Blei and Andrew Ng, is one of the main authors on the journal article introducing Latent Dirichlet Allocation, the research that underlies the field of topic modeling. To no one’s surprise, this isn’t the same Jordan as the 14-time NBA all-star and brief minor-league baseball player for my favorite team, the Chicago White Sox. Imagine though, you had two giant blocks of text about both the sports star, and the University of California Berkley machine-learning star. How could you tell them apart?
Well, the distribution of words in those documents would likely be very different. Topic Modeling looks at those distributions, realizing that some words might be common to both, but likely co-occur in other unique patterns. Applying Topic Modeling to these texts could help you annotate all your documents with topics like “Basketball” or “Machine Learning,” but you also might discover other themes like “Sneakers,” or “Space Jam,” that could help you further organize, search or summarize your texts. One can imagine how organizations armed with lots and lots of text documents could begin to utilize this technology.
I had the luxury of learning topic modeling from John Lafferty, a coauthor of David Blei during my PhD. Bringing this technology to all users of all types of academic and work backgrounds is a personally close and exciting venture into democratizing data science for me! Now instead of struggling to get code to work based on complex underlying mathematical models, I can drag and drop tools in Alteryx and quickly start to explore the topics in any set of documents.
Visualizing Your Output
The Text Mining tool group lets you build word clouds from your output, giving you a graphical representation of your analysis, complete with filters and options to make your graphics shine! For example, below is our data science word cloud, in the shape of a cloud.
Machine Learning with Alteryx Intelligence Suite
Walking through all our new machine learning capabilities is too much content for this post, so instead I’d like to focus on some of my favorite features in the new Machine Learning tool group.
Full Transparency and Control
The Assisted Modeling building block keeps humans in the loop with machine learning. While it profiles data to make the best suggestions possible given multiple heuristics and best practices, no one knows your data better than you! As opposed to other black-box solutions, Assisted Modeling shows why it’s making the recommendations it is and at what certainty, and always allows you to override its choices.
Picking the right data for a model is hard. If you’re not careful, data that wouldn’t be available for the model in the future could accidentally be included in your training set. This phenomenon is often referred to as “data leakage” and can cause models in production to fail entirely or produce subpar results. On the other end of the spectrum, often we don’t know what data is important to a task at hand, so we throw in everything we have. This is often the best agnostic approach; however, it can slow down the modeling process, and has the potential of complicating algorithms—causing them to perform worse than they would otherwise.
Assisted Modeling uses two techniques (Gini Impurity and Goodman-Kruskal Tau), to identify the best set of features to use to efficiently generate an unbiased, high-quality model.
Perhaps my favorite theorem in all of machine learning is the “No Free Lunch Theorem”. Roughly paraphrased, it implies that there’s no way to know which modeling algorithm is going to be right for any particular dataset. While XGBoost may be best for one set of data, a simple Linear Model could work well for another. Our only solution to this problem is to run multiple models on training data, and empirically see which one works best.
Assisted Modeling’s leaderboard page allows us to do just that, with multiple models optimized to run in parallel given the constraints of your computer.
The most valuable part of Assisted Modeling for many analysts will be that it can help you get better at Machine Learning and gives you the option to see your work graphically or as bare code. It carefully guides you through the modeling process, explaining what it is doing and why, while providing a detailed glossary that explains terms and methodology in plain English. You can simply click through default options, or, as you gain experience, begin to experiment on your own, putting the “science” in data science! As you practice, you can skip the “assisted mode” altogether and focus on creating models right on the canvas. Ultimately, you can turn your model into raw Python code, letting you use the graphical interface to model and then see and edit in code what your guided modeling experience created.
Whether you’re a newcomer or experienced, Assisted Modeling helps you build or prototype, and ultimately share or explore models in their native Python representation, completing the journey from building blocks to executable code.