This article was written by StephaneP and originally appeared on the Alteryx Engine Works blog here: https://community.alteryx.com/t5/Engine-Works/Four-Methods-to-Consolidate-Files/ba-p/912480
Loading multiple files is an Alteryx core capability. But there can be complications...
Don't worry, you will have learned all the tricks by the end of this article.
The standard tools are can handle most cases but certain projects might have more demanding requirements.
- I have not 1 or 2, but dozens of files to load
- My files are supposed to have the same format but sometimes I have small deviations
- I want to load all files except 2 with certain characteristics
- My users sometimes add extra columns
- The files are not in the same directories
- My number of files varies from one time to another
It is then necessary to switch to other more powerful techniques. It is therefore important to understand these methods and which ones to use depending on what you are looking for:
- simplicity of use
- robustness to format changes
- customization to handle special cases
This article will also cover the different types or errors that may arise. They may be good indicators to switch to the more advanced ways of loading. Often you think your files are all the same, but they are not even if visually they look identical, and Alteryx raises an error to highlight it.
The four options in order of increasing complexity:
- Manual consolidation of files via the UNION tool
- Automatic consolidation via the INPUT tool and "wildcards"
- Automatic consolidation via the DYNAMIC INPUT tool
- Automatic consolidation via a macro package using the INPUT tool
Summary of their strengths/weaknesses :
|Dynamism||Simplicity||Robustness to format changes||Customization to manage special cases|
|1. Manual consolidation of files via the UNION||X||
|2. Automatic consolidation via the INPUT tool and the "wildcards"||XXX||XXX||X||X|
|3. Automatic consolidation via the DYNAMIC INPUT tool||XXX||XX||X||XX|
|4. Automatic consolidation via a macro package using the INPUT tool||XXX||X||XXX||XXX|
Description of the case study
To demonstrate the nuances, we will try to consolidate four files that are supposed to be identical. As is often the case in real life, these files are not really identical and in fact have three distinct formats that can cause problems for standard tools.
FinanceData_APAC.xlx and FinanceData_Europe.xlxs are identical according to this format:
FinanceData_USA.xlx has 2 differences. First the column "Production Price" has been renamed to "Manufacturing Price" but has the same numeric format. Second, an additional column "Comment" has been added at the end.
Finance_USA2.xlsx has the same columns but the format of the "Production Price" column is V_String (text) instead of Double (numeric).
We will see how the different methods react to these "small" differences which are the ones that consume our time every day. It is important to master them to know which technique to use.
1 - Manual consolidation of files via the UNION tool
The UNION tool automatically proposes the output format that will consolidate all the files in terms of format and also the number of columns. The name of the output columns is imposed by the first file read.
There are 3 consolidation options:
- Autoconfig by position (it aligns column 1 to column 1, column 2 to column 2…)
- Autoconfig by name (it aligns based on the name)
- Manually configure fields (= you are doing the mapping)
Here it is the "Auto config by position" option that has been chosen. It allows for the fact that the same column has two different names from one file to another (Production Price / Manufacturing price). It is the name of the column in the first file that will be used. The option “Autoconfig by name” is also very practical when the columns are not in the same order.
- It never results in error and does not require reformatting the source files directly in the file or via Alteryx processing.
- Very easy to use and understand.
- The "By name" option allows you to consolidate files with the same column names but different order.
- Requires hard coding a source per file. Cumbersome if there are many. Impossible to make dynamic if the names or the number change.
- You don't have control over the output format since all the "exotic" columns of the source files are in the output. Not applicable if you control the format of these files.
- The same goes for the format since for the same column it will adapt the format to find one that can absorb all the data types of the sources.
- Only one tab can be loaded per workbook, the one indicated in the entries.
2 - Automatic consolidation via the INPUT DATA tool and the "wildcards"
The wildcards can be used in the file name:
- "*" represents any number of characters
- "?" represents one and only one character, but any character
FinanceData_*.xlsx represents all the files that start with "Finance_Data_" and any number of characters and end with ".xlsx". So this scans the directory and brings back our four files. The INPUT DATA will therefore try to load all the files listed, assuming that they all have exactly the same format. Any discrepancy (column or format) leads to non-loading.
- USA is not loaded because it has a different number of columns ("Comment" in addition)
- USA2 is not loaded because at least one column (“Production Margin” in string) does not have the same format as the first file loaded (same duplicate column)
Warning, this does not result in errors but in warnings. Here are the typical messages:
- "...has a different number of fields than the 1st file in the set and will be skipped"
- "...has a different schema than the 1st file in the set and will be skipped"
- Dynamic this time. No need to name the files. If a new file arrives, no need to update the workflow.
- We control the output format which is necessarily equivalent to the 1st file loaded (since otherwise it is rejected).
- Any discrepancy in the format leads to non-loading.
- To integrate the files in error it is necessary to physically modify them. Nothing can be done with only this tool to automate these corrections.
- Does not allow complex selection of files to be uploaded. You can only retrieve files that respect the wildcard definition. But for example it is impossible to retrieve "only the last updated file," or "all files except APAC."
3 - Automatic consolidation via the DYNAMIC INPUT tool
We specify the directory where the files are located.
And on the same principle as the previous case, we specify the characteristics of the files to load.
The wildcards can be used in the file name:
- "*" which represents any number of characters
- "?" which represents one and only one character, but any character
FinanceData_*.xlsx represents all the files that start with "Finance_Data_" and any number of characters and end with ".xlsx". So this scans the directory and brings back our four files. In output, the object brings back all the characteristics of the files including the complete path that we use in the next step.
This tool allows, on the basis of a list of files, to dynamically load all the files via a single object. It is necessary to specify an example file format and a list of paths to these files.
This example file imposes the expected format. Any file that does not respect this first format will be rejected. The DYNAMIC INPUT will therefore try to load all the files in the list. Its format reference is that of the template (here Europe). The ones that are different are discarded.
- USA is not loaded because it has a different number of columns ("Comment" in addition).
- USA2 is not loaded because at least one column (“Production Margin”) does not have the same format as the first file loaded
The OK lines of the other files are still processed and continue in the flow.
Warning, the non-loading for an additional column reason does not result in an error but in a warning. On the other hand, a format change triggers the error. Here are the typical error messages:
- “…has a different number of fields than the 1st file in the set and will be skipped”
- “…has a different schema than the 1st file in the set and will be skipped”
- Dynamic. No need to name the files. If a new file arrives, no need to update the workflow.
- We control the output format which is necessarily equivalent to that of the 1st file loaded (since otherwise it is rejected).
- You can reprocess the dynamic list of files before processing. For example, take the last updated file, or do not take the one from a particular country, or rename it... There is a step in Alteryx between the generation of the list and the upload that is commonly used to filter files on criteria or update the name of the files to be loaded.
- Any discrepancy in format leads to non-uploading.
- To integrate the files in error, you have to physically modify them. Nothing can be done in this tool to automate these corrections.
- You can only load one tab of a workbook, the one indicated in the template.
4 - Automatic consolidation via a macro package using the INPUT tool
A batch macro is a combination of tools that are packaged and provided with a list of parameters. It will perform N times the packaged treatment for all the entries of the list.
In our case, we package an INPUT DATA object to which we will provide the list of files to be loaded. The strength of this method is that at each iteration, the macro considers that it is processing one and only one entry without taking into account the history. It is as if we had N times an input with one file and not an input with N files. Thus, we no longer have the problem of format divergence between the 1st and the Nth. Each file is being considered as unique so is necessarily coherent with itself.
And this is where the bonus effect appears. The macro output consolidates the results as the UNION object would and therefore handles all the problems of format or number of distinct columns. This way there is no rejection and the formats are standardized and optimized.
Two things about this macro:
- It asks to choose the column which contains the complete path (path + tab). You can therefore reuse it for your own projects.
- It consolidates the outputs based on the position of the columns (not on the name). This can be changed in the macro. See screenshot of the macro.
Warning, the non-loading for an additional column reason does not result in an error but in a warning. On the other hand, a format change triggers an error.
Note: The macro properties can be found by navigating to View > Interface Designer, then clicking on the gear icon at the bottom left.
This is where you determine whether the macro will consolidate the data on the basis of the column position (the 1st with the 1st, the 2nd with the 2nd...) or on the basis of column name (whatever the order, it searches for the names and aligns them).
- Dynamic. No need to name files. If a new file arrives, there is no need to update the workflow.
- Format or column discrepancies are automatically absorbed without rejection and without raising an error.
- The dynamic list of files can be preprocessed. For example, take the last updated file, or do not take the one from a particular country. There is a step in Alteryx between the generation of the list and the loading that is commonly used to filter the files on the criteria.
- Within the macro, you can also integrate other more complex processing (e.g. renaming columns, managing the order of columns, deleting unwanted columns...).
- You can potentially load several tabs. Here we hard-coded the "KPI" tab but we can imagine loading several tabs of workbooks.
- More complex to use because it is necessary to understand macros. Even if, with this example, it remains simple.
- You don't have control over the output format because all the "exotic" columns of the source files are in the output. Not applicable if you master the format of these files.
To see if an object is a macro, activate the option "Options" / "User settings" / "Canvas" / "Display macro indicators on tools".
This will display a (+) on the tools that are macros.
Then just right click on it and launch the "Open Macro" option to open a dedicated tab for this macro.
For more information on macros, see the Macros interactive lessons in the Community.
There you go, no more excuses to not dynamically load everything that moves and automatically manage all these cases without having to touch anything from one launch to the next while keeping it simple when you don't need complexity.
Isn't life great?