#IronViz – Let’s Go on a Pokémon Safari!

It’s that time again – Iron Viz!  The second round of Iron Viz entered my world via an email with a very enticing “Iron Viz goes on Safari!” theme.  My mind immediately got stuck on one thing: Pokémon Safari Zone.

Growing up I was a huge gamer and Pokémon was (and still is) one of my favorites.  I even have a cat named after a Pokémon, it’s Starly (find her in the viz!).  So I knew if I was going to participate that the idea for Pokémon Safari was the only way to go.

I spent a lot of time thinking about how I might want to bring this to life.  Did I want to do a virtual safari of all the pocket monsters?  Did I want to focus on the journey of Ash Ketchum through the Safari Zone?  Did I want to focus on the video games?

After all the thoughts swirled through my mind – I settled on the idea of doing a long form re-creation of Ash Ketchum’s adventure through the Safari Zone in the anime.  I sat down and googled to figure out the episode number and go watch.  But to my surprise the episode has been banned.  It hasn’t made it on much TV and the reason it is banned makes it very unattractive and unfriendly for an Iron Viz long form.  I was gutted and had to set off on a different path.

The investment into the Safari Zone episode got me looking through the general details of the Safari Zone in the games.  And that’s what ended up being my hook.  I tend to think in a very structured format and because there were 4 regions that HAD Safari Zones (or what I’d consider to be the general spirit of one) it made it easy for me to compare each of them against each other.

Beyond that I knew I wanted to keep the spirit of the styling similar to the games.  My goal for the viz is to give the end user an understanding of the types of Pokémon in each game.  To show some basic details about each pocket monster, but to have users almost feel like they’re on the Safari.

There’s also this feeling I wanted to capture – for anyone who has played Pokémon you may know it.  It’s the shake of the tall grass.  It is the tug of the Fishing Pole.  It’s the screen transition.  In a nutshell: what Pokémon did I just encounter?  There is a lot of magic in that moment of tall grass shake and transition to ‘battle’ or ‘encounter’ screen.

My hope is that I captured that well with the treemaps.  You are walking through each individual area and encountering Pokémon.  For the seasoned Safari-goer, you’ll be more interested in knowing WHERE you should go and understanding WHAT you can find there.  Hence the corresponding visuals surrounding.

The last component of this visualization was the Hover interactivity.  I hope it translates well because I wanted the interactivity to be very fluid.  It isn’t a click and uncover – that’s too active.  I wanted this to be a very passive and openly interactive visualization where the user would unearth more through exploring and not have to click.

#WorkoutWednesday Week 24 – Math Musings

The Workout Wednesday for week 24 is a great way to represent where a result for a particular value falls with respect to a broader collection.  I’ve used a spine chart recently on a project where most data was centered around certain points and I wanted to show the range.  Propagating maximums, minimums, averages, quartiles, and (when appropriate) medians can help to profile data very effectively.

So I started off really enjoying where this visualization was going.  Also because the spine chart I made on a recent project was before I even knew the thing I developed had already been named.  (Sad on my part, I should read more!)

My enjoyment turned into caution really quickly once I saw the data set.  There are several ratios in the data set and very few counts/sums of things.  My math brain screams trap!  Especially when we start tiptoeing into the world of what we semantically call “average of all” or “overall average” or something that somehow represents a larger collective (“everybody”).  There is a lot of open-ended interpretation that goes into this particular calculation and when you’re working with pre-computed ratios it gets really tricky really quickly.

Here’s a picture of the underlying data set:


Some things to notice right away – the ratios for each response are pre-computed.  The number of responses is different for each institution.  (To simplify this view, I’m on one year and one question).

So the heart of the initial question is this: if I want to compare my results to the overall results, how would I do that?  Now there are probably 2 distinct camps here.  1: take the average of one of the columns and use that to represent the “overall average”.  Let’s be clear on what that is: it is the average pre-computed ratio of a survey.  It is NOT the observed percentage of all individuals surveyed.  That would be option 2: the weighted average.  For the weighted average or to calculate a representation of all respondents we could add up all the qualifying respondents answering ‘agree’ and divide it by the total respondents.

Now we all know this concept of average of an average vs. weighted average can cause issues.  Specifically we’d feel the friction immediately if there were several low-end responses commingled with several higher response capturing entities.  EX: Place A: 2 people out of 2 answered yes (100%) and  Place B: 5 out of 100 answered ‘yes’ (5%).  If we average 100% and 5% we’ll get 52.5%.  But what if we take 7 out of 102, that’s 6.86% – a way different number.  (Intentionally extreme example.)

So my math brain was convinced that the “overall average” or “ratio for all” should be inclusive of the weights of each Institution.  That was fairly easy to compensate for: take each ratio and multiply it by the number of respondents to get raw counts and then add those all back up together.

The next sort of messy thing to deal with was finding the minimums and maximums of these values.  It seems straightforward, but when reviewing the data set and the specifications of what is being displayed there’s caution to throw with regard to level of aggregation and how the data is filtered.  As an example, depending on how the ratios are leveraged, you could end up finding the minimum of 3 differently weighted subjects to a subject group.  You could also probably find the minimum Institution + subject result at the subject level of all the subjects within a group.  Again I think the best bet here is to tread cautiously over the ratios and get into raw counts as quickly as possible.

So what does this all mean?  To me it means tread carefully and ask clear questions about what people are trying to measure.  This is also where I will go the distance and include calculations in tool tips to help demonstrate what the values I am calculating represent.  Ratios are tricky and averaging them is even trickier.  There likely isn’t a perfect way to deal with them and it’s something we all witness consistently throughout our professional lives (how many of us have averaged a pre-computed average handle time?).

Beyond the math tangent – I want to reiterate how great a visualization I think this is.  I also want to highlight that because I went deep-end math on it that I decided to go deep end development different.

The main difference from the development perspective?  Instead of using reference bands, I used a gannt bar as the IQR.  I really like using the bar because it gives users an easier target to hover over.  It also reduce some of the noise of the default labeling that occurs with reference lines.  To create the gannt bar – simply compute the IQR as a calculated field and use it as the size.  You can select one of the percentile points to be the start of the mark.

#MakeoverMonday Week 25 | Maricopa County Ozone Readings

We had another giant data set this week – 202 million records of EPA Ozone readings across the United States.  The giant data set is generously hosted by Exasol.  I encourage you to register here to gain access to the data.

The heart of the data is pretty straight forward – PPM readings across several sites around the nation for the past 25+ years.  As I went through and browsed the data set, it’s easy to see that there are multiple readings per site per day.  Here’s the basic data model:

Parameter Name only has Ozone, Units of Measure only has Parts per million.  There is one little tweak to this data set – the Datum field.  Now this wasn’t a familiar term for me, so I described the domain to see what it had.

I know exactly what one of these 4 things means (beyond Unknown) – that’s WGS84.  I was literally at the Alteryx Inspire conference two weeks ago and in a Spatial Analytics session where people were talking about different standards for coordinate systems on Earth.  The facilitators mentioned that WGS84 was a main standard.  For fun I decided to plot the number of records for each Datum per year to see how the Lat/Lon have potentially changed in measurement over time.  Since 2012 it seems like WGS84 has dominated as the preferred standard.

So armed with that knowledge I sort of kept it in my back pocket of something I may need to be mindful of if I enter the world of mapping.

Beyond that, I had to start my focus on preparing something for Tableau Public.  202 million records unfortunately won’t sit on Public and I have to extract the data.  Naturally I did what every human would do and zeroed in on my city: Phoenix Metropolitan area aka Maricopa County.

So going through the data set there are multiple sites that are taking measurements.  And more than that, these sites are taking measurements multiple times per day.  I really wanted to express that somehow in my final visualization.  Here’s all the site averages plotted each day for the past 30 years – thanks Exasol!

So this is averaged per day per site – and you can see how much variation there is.  Some are reporting very low numbers, even zeros.  Some are very high.

If I take off the site ID, here’s what I get for the daily averages:

Notice the Y-axis – much less dramatic.  Now the EPA has the AQI measurements and it doesn’t even get into the “bad” range until 0.071 PPM (Unhealthy for Sensitive Groups).  So there’s less of a story to some extent when we take the averages.  This COULD be because of the sites in Maricopa county (maybe there are low or faulty numbers dragging down the average) or it could be because when you do the average you’re getting better precision of truth.

I’m going down this path because at this point I decided to make a decision: I wanted to look at the maximum daily measurement.  Given that these are instantaneous measurements, I felt that knowing the maximum measurement in a given day would also provide insight and value into how Ozone levels are faring.  And more specifically, knowing my region a little bit – the measurement sites could be outside of well populated areas and may naturally have lower occurring measurements.

So that was step one for me: move to the world of MAX.  This let me leverage all the site data and get going.  (Also originally I wanted to jitter and display all the sites because I thought that would be interesting – I distilled the data down further because I wasn’t getting what I wanted in terms of presentation in the end result).

Okay – next up was plotting the data.  I wanted to do a single page very dense data display that had all the years and the months and allowed for easy comparisons.  I had thought a cycle plot may be appropriate, but after trying a few combinations I didn’t see anything special about day of the week additions and noticed that the measurement really is about time of year (the month).  Secondary comparison being each year.

Now that I’ve covered that part – next up was how to plot.  Again, this originally started out its life as dots that were going to be color encoded using the AQI scale with PPM on the Y-axis.  And I almost published it that way.  But to be honest with you, I don’t know if the minutia of the PPM really matters that much.  I think that dimension defined on top of the measurement is easier for an end user to understand.  Hence my final development fork: turn the categorical result into a unit measure (1, 2, 3, 4 etc.) as a byproduct to represent height of a bar chart.  And that’s where I got really inspired.  I made “Good” -1 and “Moderate” 0.  That way anything positive on the Y-axis is a bad day.  To me this will allow you to see the streaks of bad throughout the time periods.

Close up of 2015 – I love this.  Look at those moderates just continuing the axis.  Look how clear the not so good to very bad is.  This resonates with me.

Okay – so final steps here were going to be to have a map of all the measurements at each site (again the max for each site based on the user clicking a day).  It was actually quite cute showing Phoenix more close up.  And then I was going to have national readings (max for each site upon clicking a day) as a comparison.  This would have been super awesome – here’s the picture:

So good.  And perhaps I could have kept this, but knowing I have to go to Tableau Public – it just isn’t going to handle the national data well.  So I sat on this for an evening and while I was driving to work I decided to do a marginal chart that showed the breakdown of number of days of each type.  The “why” was because it looks like things are getting better – more attention needs to be drawn to that!

So last steps ended up being to add on the marginal bar charts and then go one step further to isolate the “bad days” per year and have them be the final distilled metric at the far far right.  My thought process: scan each year, get an idea of performance, see it aggregated to the bar chart, then see the bad as a single number.  For sheer visual pleasure I decided to distill the “bad” further into one more chart.  I had a stacked bar chart to start, but didn’t like it.  I figured for the sake of artistry I could get away with the area chart and I really like the effect it brings.  You can see that the “very bad” days have become less prominent in recent years.

So that pretty much sums up the development process.  Here’s the full viz again and a comparison to the original output for Maricopa County, which echos the sentiment of my maximums – Ozone measurements are going down.



#MakeoverMonday Week 24 – The Watercolours of Tate

First – I apologize.  I did a lot of web editing this week that has led to a series of system fails.  The first was spelling the hashtag wrong.  Next I decided to re-upload the workbook and ruin the bit link.  What will be the next fail?

Anyway – to rectify the series of fails I decided that the best thing to do would be to create a blog post.  Blog posts merit new tweets and new links!

So week 24’s data was the Tate Collection, which upon click through of this link indicates it is a decent approximation of artwork housed at Tate.

Looking at the underlying data set, here’s the columns we get:

And the records:

So I started off decently excited about the fact that there were 2 URLs to leverage in the data set.  One with just a thumbnail image only and the other a full link to the asset.  However, the Tate website can’t be accessed via HTTPS, so it doesn’t work for on dashboard URLs on Tableau Public.  I guess Tableau wants us to be secure – and I respect that!

So my first idea of going the route of all float with an image in the background was out.

Now my next idea was to limit the data set.  I had originally thought to do the “Castles of Tate” – check out the number of titles:

A solid number: 2,791 works of art.  A great foundation for the underneath.  Except of course for what we knew to be true of the data: Turner.

Sigh – this bummed me out.  Apparently only Turner really likes to label works of art with “Castle.”  Same was true for River and Mountain.  Fortunately I was able to easily see that using the URL actions on Tableau Desktop (again can’t do that on Public because of security reasons):

Here is a classic Turner castle:

Now yes, it is artwork – but doesn’t necessarily evoke what I was looking to unearth in the Tate collection.

So I went another path, focusing on the medium.  There was a decent collection of watercolour (intentional European spelling).  And within that a few additional artist representations beyond our good friend Turner.

So this informed the rest of the visualization.  Lucky for me there was a decent amount of distribution date wise, both from a creation and acquisition standpoint.  This allowed me to do some really pretty things with binned time buckets.  And inspired by the Tate logo: I took a very abstract approach to the visualization this week.  The output is intentionally meant for data discovery.  I am not deriving insights for you, I’m building a view for you to explore.

One of my most favorite elements is the small multiples bubble chart.  This is not intended to aid in cognition, this is intended to be artwork of artwork.  I think that pretty much describes the entire visualization if I’m being honest.  Something that could stand alone as a picture perhaps or be drilled deep to the depths of going to each piece’s website and finding out more.

Some oddities with color I explored this week included: using an index and placing that on the color shelf with a diverging color palette (that’s what is coloring the bubble charts).  And also using modulo on the individual asset names to spark some fun visual encoding.  Better than all one color, I felt breaking up the values in a programmatic way would be fun and different.

Perhaps my most favorite of this is the top section with the bubble charts and bar charts below with the binned year ranges between.  Pure data art blots.

Here’s the full visualization on Tableau Public – I promise not to tinker further with the URLs.

#WorkoutWednesday Week 23 – American National Parks

I’m now back in full force from an amazing analytics experience at the Alteryx Inspire conference in Las Vegas.  The week was packed with learning, inspiration, and community – things I adore and am honored to be a part of.  Despite the awesome nature of the event, I have to admit I’m happy to be home and keeping up with my workout routine.

So here goes the “how” of this week’s Workout Wednesday week 23.  Specifications and backstory can be found on Andy’s blog here.

Here’s a picture of my final product and my general assessment of what would be required for approach:

Things you can see from the static image that will be required –

  • Y axis grid lines are on specific demarcations with ordinal indicators
  • X-axis also has specific years marked
  • Colors are for specific parks
  • Bump chart of parks is fairly straight forward, will require index() calculation
  • Labels are only on colored lines – tricky

Now here’s the animated version showing how interactivity works

  • Highlight box has specific actions
    • When ‘none’ is selected, defaults to static image
    • When park of specific color is selected, only that park has different coloration and it is labeled
    • When park of unspecified color is selected, only that park has different coloration (black) and it is labeled

Getting started is the easy part here – building the bump chart.  Based on the data set and instructions it’s important to recognize that this is limited to parks of type ‘National Historical Park’ and ‘National Park.’  Here’s the basic bump chart setup:

and the custom sort for the table calculation:

Describing this is pretty straight for – index (rank) each park by the descending sum of recreation visitors every year.  Once you’ve got that setup, flipping the Y-axis to reversed will get you to the basic layout you’re trying to achieve.

Now – the grid lines and the y-axis header.  Perhaps I’ve been at this game too long, but anytime I notice custom grid lines I immediately think of reference lines.  Adding constant reference lines gives ultimate flexibility in what they’re labelled with and how they’re displayed.  So each of the rank grid lines are reference lines.  You can add the ‘Rank’ header to the axis by creating an ad-hoc calculation of a text string called ‘Rank.’  A quick note on this: if you add dimensions and measures to your sheet be prepared to double check and modify your table calculations.  Sometimes dimensions get incorporated when it wasn’t intended.

Now on to the most challenging part of this visualization: the coloration and labels.  I’ll start by saying there are probably several ways to complete this task and this represents my approach (not necessarily the most efficient one):

First up: making colors for specific parks called out:

(probably should have just used the Grouping functionality, but I’m a fast typer)

Then making a parameter to allow for highlighting:

(you’ll notice here that I had the right subset of parks, this is because I made the Park Type a data source filter and later an extract filter – thus removing them from the domain)

Once the parameter is made, build in functionality for that:

And then I set a calculation to dynamically flip between the two calculations depending on what the parameter was set to.

Looking back on this: I didn’t need the third calculation, it’s exactly the same functionality as the second one.  In fact as I write this, I tested it using the second calculation only and it functions just fine.  I think the over-build speaks to my thought process.

  1. First let’s isolate and color the specific parks
  2. Let’s make all the others a certain color
  3. Adding in the parameter functionality, I need the colors to be there if it is set to ‘(None)’
  4. Otherwise I need it to be black
  5. And just for kicks, let’s ensure that when the parameter is set to ‘(None)’ that I really want it to be the colors I’ve specified in the first calc
  6. Otherwise I want the functionality to follow calc 2

Here’s the last bit of logic to get the labels on the lines.  Essentially I know we’re going to want to label the end point and because of functionality I’m going to have to require all labels to be visible and determine which ones actually have values for the label.  PS: I’m really happy to use that match color functionality on this viz.

And the label setting:

That wraps up the build for this week’s workout with the last components being to add in additional components to the tooltip and to stylize.  A great workout that demonstrates the compelling nature of interactive visualization and the always compelling bump chart.

Interact with the full visualization here on my Tableau Public.

Alteryx Inspire – Day 1

When I went to the Tableau Conference last year, I felt it was important to spend some time documenting my experience.  Anytime I go to a conference related to my professional aspirations I’m always taken by the wealth of knowledge that’s uncovered.

The Alteryx Inspire conference is a pared down conference with about 2,000 attendees.  It is comfortably housed in the Aria hotel across 2 spacious and open floors.  There are escalators that split between level 3 and level 1 – there’s nice flow to it and plenty of natural light.  Events take place over three days: Monday, Tuesday, and Wednesday.  Monday is mostly a product training day and the bulk of sessions are the remainder of the week.  Opening keynote is Tuesday.

This year – being my first – I was extremely fortunate to be able to attend and to do the product training track.  This gives me a firsthand opportunity to see how the product company sells and trains on its tool.  Facilitators are typically great at selling the ‘why’ and ‘how’ behind something.

Today I sat for a full day going through the introduction to Alteryx Designer.  Not because it was my first time using the tool, but because I believe there’s something very powerful about origin stories.  There’s something you learn in the first 30 minutes that someone who doesn’t have the ‘formal training’ may never pick up.  That happened for me today and it was great to see everything in action.

As an advocate for data-informed decision making the tool is indispensable.  Just by listening to the 100+ in my classroom, it’s scary to witness firsthand the youth that exists with businesses accessing data.  Yes, there have been really great strides, but so many people are just at the beginning.  I chuckle when I hear the typical ‘Excel’ analogies, but the overwhelming majority are nodding with how much they relate to the joke.

I’ve always seen Alteryx as a natural companion for a data analyst.  For anyone out there trying to manage data it offers up a solution.  If only for the single act of being able to see a visual output of the thought process and work that went in to producing a data model.  A data model or report that can be shared, saved, printed (please don’t print), and most importantly: be communicated.  For someone doing data prep, blending, gathering – this is how you explain to your boss what you do.  This is the demonstration of what it takes to be the data wrangler.  This is how you share your critical thinking skills.

I’ve just scratched the surface and have 2 more full days of Alteryx.  One that has already been peppered with amazing collaboration opportunities and sharing of enthusiasm.  The vibe is chill, the people are great, and the mission is achievable.

Tomorrow is another day and an opportunity to take the building blocks and dream of skyscrapers.