Introduction
For whatever reason, you might want to (or need to) collect, store and process unstructured text items with associated metadata. Examples could be to set up a library for research purposes or to build a tool to watch the public discourse for relevant conversation pieces. This guide will showcase a just-the-basics approach to build one possible implementation of such a tool in Alteryx (there are surely many alternative ways to do it differently or better).
Before We Start
The first step is to decide which kind of item will be collected. Many kinds of items are possible, including:
- News articles
- Blog articles
- Generic websites
- Data rows from any table or database
- Social media comments such as tweets
- Text and metadata from PDF documents
Generating Query Strings
Into the component goes a list of search terms. Out comes a query string suited to the API.
To search the News API collection of news articles, the API expects a query string ("q") that consists of individual or logically combined search terms. The list of search terms that make up the query string can become fairly complex depending on what you are searching for.
The character limitation is also the reason why I'm only putting search terms with more than one word into parentheses:
The following is the setup of the Summarize tool:
In addition, "tesla AND" was added via a formula tool so as to limit the number of results for the tutorial by specifying an additional search condition:
Executing API Requests
In goes the previously generated query string. Out comes an API response for further processing.
The next step looks pretty simple at first, but the magnifying glass icon hides an iterative macro that gave me a headache. Either macros are not implemented as easy as they could be or I'm just not smart enough for them. Or maybe I'm just spoiled by the ease of use of most other tools in Alteryx.
The iterative macro takes the data provided by the macro input. This consists of the queries generated in step one (including the respective categories and subcategories) as well as the contents of the Formula tool:
I'm not exactly sure whether one needs to "initialize the page number variable" outside the macro, but I'm leaving it in there because it feels right. I read several tutorials and community questions when building the macro, this tutorial was particularly helpful.
This is what the first half of the iterative macro looks like:
In my version of the workflow, the parameters for language, apiKey, and maximum page size are hard coded into the Download tool (the one with the lightning icon) that performs the actual API requests. One might also pass them into the macro from the workflow in which the macro is embedded.
In many cases, an API key can also be provided to the API via the request headers. This can better hide the key from potential eavesdroppers (it's an HTTP request after all) and may reduce the total length of the query URL. I'm leaving it in the query body as it's a bit easier to understand that way.
Executing the request will result in a JSON formatted answer by the API. Fortunately, Alteryx has the JSON Parse tool to convert it into table format:
{"status":"ok","totalResults":244,"articles":[{"source":{"id":null,"name":"Cleantechnica.com"},"author":"Guest Contributor","title":"Tesla’s [Ahem, Elon’s] Unconventional Approach To Media Criticism","description":"Negative media coverage is a fact of l...
becomes:
To keep track of the requests, I'm adding a timestamp via a Formula tool.
The second half of the iterative macro has two branches:
The upper stream takes only those lines that contain the values for total results per search query. If the number of results is above 100 (the maximum per page for News API), the tools calculate the number of total pages and the iterative macro then runs a certain query until all pages have been obtained. I'm making use of an IF statement and the ceiling function CEIL() to calculate the total pages. I'm not sure whether this is the most elegant way, but it seems to work:
Only those requests for which the current page value is lower than or equal to the total number of pages remain in the iteration and get passed back to the beginning for another run with page number incremented by 1.
All other lines (and the line with the total results merged back in for potential statistics and logging purposes further down the workflow) are passed outside the macro (back into the workflow above) via the bottom Output tool. There, they're cached in an Alteryx database file (.yxdb).
Hint: After a successful API request, collapse the first two toolboxes ("Disable Container") via the icon in the upper left corner to preserve your allotted number of API calls.
Converting API Responses
In goes the raw response data from the API request, out come deduplicated items with associated metadata.
In the first half of the conversion workflow, the tabular JSON data is transformed into a much more usable regular table structure. Before this, each line is a property of an item from the JSON response, potentially split up further into sub-properties such as a list of tags within a general "tag" property.
The Cross Tab tool is the opposite of the "transpose" option in Microsoft Excel (of course, Alteryx also has a Transpose Tool). It converts the lines into columns. In the Cross Tab tool, select the properties which describe a unique item and on which to do the grouping of lines (this also means you will keep these properties as columns).
Everything except "JSON_Name2" should be clear from the descriptions above. JSON_Name2 is the part of JSON_Name that contains the article number of a certain item on a page.
After some re-arranging and sorting, the data looks like this:
It may happen (and depending on the search terms, it may happen a lot) that articles are found through different search term subcategories. As a remediation, I calculate a unique ID for each item at the beginning of the second half of the workflow. The assumption behind this is that no article will be published via the same URL at the exact same time more than once without it being an exact duplicate. To create the ID, the URL and publication timestamp are concatenated and hashed as an MD5 checksum (this is just to shorten it, the raw concatenated string would work as well).
The Summarize tool that follows groups by item IDs and concatenates values that may differ even for the same item. After grouping, items may have duplicate category and even duplicate subcategory values. The latter may be due to the API not deduplicating according to URL and timestamp. As Alteryx does not have a "Concat Unique" option for Summarize (feature request!), the deduplication of metadata within each field must be done manually by splitting the values up into rows again, using the Unique tool and re-concatenating the unique values.
The end result is a list of items with associated metadata (see the item with duplicate categories in line 22). Of course, there will still be duplicates as the URL and the timestamp are not sufficient for content deduplication. For example, some news sites will publish one article several times to different websites in some kind of federation (which results in slightly different URLs but exactly the same content).
For more information, visit: https://community.alteryx.com/t5/Engine-Works-Blog/Guide-to-Building-a-Simple-Application-for-Collecting-Text-Items/ba-p/360209