This article is by David Ha and originally appeared on the Alteryx Engine Works Blog here: https://community.alteryx.com/t5/Engine-Works-Blog/Web-Scraping-will-it-Alteryx/ba-p/429475
Will it Alteryx?
Starting your web scraping journey with BeautifulSoup and the Python tool takes no time at all.
First, identify a table on a webpage that you'd like to pull into Designer for analysis. Since fantasy football draft season is upon us, and the running back is the most important position of all (sssh, don't tell anyone else that), I'll be analyzing last year's running back stats to try and understand who I might want to draft this season.
With the web page open, use your browser's web development tool to view the HTML source code. On most browsers, this is available by hitting F12. From here, you can simply navigate through the source code watching the blue highlighting on the left as a guide until you find the code supporting the table in question.
The source providing the table in our example is below. We'll need this in a bit...
<table cellpadding="0" border="0" id="data" class="table table-bordered table-striped table-hover tablesorter"> ... </table>
Next, simply drag the Python tool (in the Developer category) on the canvas and add the following code to the Python code editor.
# Import packages from ayx import Package from ayx import Alteryx # Note these two lines only need to be executed one time to install the package. Package.installPackages(['bs4']) Package.installPackages(['html5lib']) import pandas import requests from bs4 import BeautifulSoup # URL of whatever web page you want to scrape url = "https://www.fantasypros.com/nfl/stats/rb.php" page = requests.get(url) soup = BeautifulSoup(page.text, 'html')
Note, the Package.installPackages lines only need to be executed once and require that you launch Designer by right-clicking and selecting Run as Administrator. Run this cell using the Python Run button circled below.
Assuming no errors, you will now have an object of type BeautifulSoup which contains the HTML page ready for parsing. From the source code we identified earlier, we're looking for an HTML tableobject with an id of data. BeautifulSoup makes extracting that information a one-line call:
# Find a specific table that includes an id attribute table = soup.find('table', id="data")
Some HTML tables have no id attribute. In that case, you can simply find all the tables on the page and then specify which one based on its order of appearance on the page:
# for tables without an 'id' attribute, you can simply find all the tables and then grab the one you want. Remember the 1st one will be position '0'. table = soup.find_all('table')
Now that we have the desired table, we need to convert it to a Pandas DataFrame, which is the required format to output from the Python tool to use in an Alteryx workflow. Most solutions you'll find out there for converting an HTML table into a Pandas DataFrame show that you must hardcode the column names, which isn't a very scalable solution. What we want is something that is repeatable regardless of the data being ingested in our workflow. Hello pandas.read_html().
df = pandas.read_html(table.prettify(), header=1, flavor='bs4') df.head()
A few of points on the read_html() function call above:
- The read_html() function returns a list of dataframes, but in our case, there's only one, so we simply grab the first one off the list.
- The prettify() method returns the string representation of the table. We could also use str(table) and accomplish the same thing.
- We must specify to use bs4 as the flavor since our table was parsed using BeautifulSoup. Mmm, flavor, soup, I'm getting hungry.
- The 'header=1' option specifies to use the 1st row from the table as the column headers in our DataFrame. I had to specify this in my example since the table had an extra column header (Rushing, Receiving, Misc) above the real column headers:
If all that processed correctly, our table will be converted to a DataFrame, and we can use the df.head() function to output the first 5 rows for verification:
With my table now converted to a DataFrame, I can send it to the 1st output anchor on the Python tool with the code below:
At this point, we are free to start analyzing the data in question to solve our problem. In my case, the very important problem of understanding what statistics have the biggest impact on fantasy points for running backs.
The combination of the Python tool with the BeautifulSoup package makes scraping data from the web attainable for everyone and provides for a robust, scalable, repeatable solution. If you've tried this out and found success, problems, or a better approach, please leave a comment and let us know!
If you have any technologies you would like to see explored in future installments of the "Will it Alteryx?" series, please leave a comment below!
And lastly, for those wondering, the data shows the highest correlation to fantasy points for running backs is from rushing yards, number of rush attempts, and rush TDs. Yards per carry is less influential, so go for volume!