#MakeoverMonday Week 24 – The Watercolours of Tate

First – I apologize.  I did a lot of web editing this week that has led to a series of system fails.  The first was spelling the hashtag wrong.  Next I decided to re-upload the workbook and ruin the bit link.  What will be the next fail?

Anyway – to rectify the series of fails I decided that the best thing to do would be to create a blog post.  Blog posts merit new tweets and new links!

So week 24’s data was the Tate Collection, which upon click through of this link indicates it is a decent approximation of artwork housed at Tate.

Looking at the underlying data set, here’s the columns we get:

And the records:

So I started off decently excited about the fact that there were 2 URLs to leverage in the data set.  One with just a thumbnail image only and the other a full link to the asset.  However, the Tate website can’t be accessed via HTTPS, so it doesn’t work for on dashboard URLs on Tableau Public.  I guess Tableau wants us to be secure – and I respect that!

So my first idea of going the route of all float with an image in the background was out.

Now my next idea was to limit the data set.  I had originally thought to do the “Castles of Tate” – check out the number of titles:

A solid number: 2,791 works of art.  A great foundation for the underneath.  Except of course for what we knew to be true of the data: Turner.

Sigh – this bummed me out.  Apparently only Turner really likes to label works of art with “Castle.”  Same was true for River and Mountain.  Fortunately I was able to easily see that using the URL actions on Tableau Desktop (again can’t do that on Public because of security reasons):

Here is a classic Turner castle:

Now yes, it is artwork – but doesn’t necessarily evoke what I was looking to unearth in the Tate collection.

So I went another path, focusing on the medium.  There was a decent collection of watercolour (intentional European spelling).  And within that a few additional artist representations beyond our good friend Turner.

So this informed the rest of the visualization.  Lucky for me there was a decent amount of distribution date wise, both from a creation and acquisition standpoint.  This allowed me to do some really pretty things with binned time buckets.  And inspired by the Tate logo: I took a very abstract approach to the visualization this week.  The output is intentionally meant for data discovery.  I am not deriving insights for you, I’m building a view for you to explore.

One of my most favorite elements is the small multiples bubble chart.  This is not intended to aid in cognition, this is intended to be artwork of artwork.  I think that pretty much describes the entire visualization if I’m being honest.  Something that could stand alone as a picture perhaps or be drilled deep to the depths of going to each piece’s website and finding out more.

Some oddities with color I explored this week included: using an index and placing that on the color shelf with a diverging color palette (that’s what is coloring the bubble charts).  And also using modulo on the individual asset names to spark some fun visual encoding.  Better than all one color, I felt breaking up the values in a programmatic way would be fun and different.

Perhaps my most favorite of this is the top section with the bubble charts and bar charts below with the binned year ranges between.  Pure data art blots.

Here’s the full visualization on Tableau Public – I promise not to tinker further with the URLs.

#WorkoutWednesday Week 23 – American National Parks

I’m now back in full force from an amazing analytics experience at the Alteryx Inspire conference in Las Vegas.  The week was packed with learning, inspiration, and community – things I adore and am honored to be a part of.  Despite the awesome nature of the event, I have to admit I’m happy to be home and keeping up with my workout routine.

So here goes the “how” of this week’s Workout Wednesday week 23.  Specifications and backstory can be found on Andy’s blog here.

Here’s a picture of my final product and my general assessment of what would be required for approach:

Things you can see from the static image that will be required –

  • Y axis grid lines are on specific demarcations with ordinal indicators
  • X-axis also has specific years marked
  • Colors are for specific parks
  • Bump chart of parks is fairly straight forward, will require index() calculation
  • Labels are only on colored lines – tricky

Now here’s the animated version showing how interactivity works

  • Highlight box has specific actions
    • When ‘none’ is selected, defaults to static image
    • When park of specific color is selected, only that park has different coloration and it is labeled
    • When park of unspecified color is selected, only that park has different coloration (black) and it is labeled

Getting started is the easy part here – building the bump chart.  Based on the data set and instructions it’s important to recognize that this is limited to parks of type ‘National Historical Park’ and ‘National Park.’  Here’s the basic bump chart setup:

and the custom sort for the table calculation:

Describing this is pretty straight for – index (rank) each park by the descending sum of recreation visitors every year.  Once you’ve got that setup, flipping the Y-axis to reversed will get you to the basic layout you’re trying to achieve.

Now – the grid lines and the y-axis header.  Perhaps I’ve been at this game too long, but anytime I notice custom grid lines I immediately think of reference lines.  Adding constant reference lines gives ultimate flexibility in what they’re labelled with and how they’re displayed.  So each of the rank grid lines are reference lines.  You can add the ‘Rank’ header to the axis by creating an ad-hoc calculation of a text string called ‘Rank.’  A quick note on this: if you add dimensions and measures to your sheet be prepared to double check and modify your table calculations.  Sometimes dimensions get incorporated when it wasn’t intended.

Now on to the most challenging part of this visualization: the coloration and labels.  I’ll start by saying there are probably several ways to complete this task and this represents my approach (not necessarily the most efficient one):

First up: making colors for specific parks called out:

(probably should have just used the Grouping functionality, but I’m a fast typer)

Then making a parameter to allow for highlighting:

(you’ll notice here that I had the right subset of parks, this is because I made the Park Type a data source filter and later an extract filter – thus removing them from the domain)

Once the parameter is made, build in functionality for that:

And then I set a calculation to dynamically flip between the two calculations depending on what the parameter was set to.

Looking back on this: I didn’t need the third calculation, it’s exactly the same functionality as the second one.  In fact as I write this, I tested it using the second calculation only and it functions just fine.  I think the over-build speaks to my thought process.

  1. First let’s isolate and color the specific parks
  2. Let’s make all the others a certain color
  3. Adding in the parameter functionality, I need the colors to be there if it is set to ‘(None)’
  4. Otherwise I need it to be black
  5. And just for kicks, let’s ensure that when the parameter is set to ‘(None)’ that I really want it to be the colors I’ve specified in the first calc
  6. Otherwise I want the functionality to follow calc 2

Here’s the last bit of logic to get the labels on the lines.  Essentially I know we’re going to want to label the end point and because of functionality I’m going to have to require all labels to be visible and determine which ones actually have values for the label.  PS: I’m really happy to use that match color functionality on this viz.

And the label setting:

That wraps up the build for this week’s workout with the last components being to add in additional components to the tooltip and to stylize.  A great workout that demonstrates the compelling nature of interactive visualization and the always compelling bump chart.

Interact with the full visualization here on my Tableau Public.

Alteryx Inspire – Day 1

When I went to the Tableau Conference last year, I felt it was important to spend some time documenting my experience.  Anytime I go to a conference related to my professional aspirations I’m always taken by the wealth of knowledge that’s uncovered.

The Alteryx Inspire conference is a pared down conference with about 2,000 attendees.  It is comfortably housed in the Aria hotel across 2 spacious and open floors.  There are escalators that split between level 3 and level 1 – there’s nice flow to it and plenty of natural light.  Events take place over three days: Monday, Tuesday, and Wednesday.  Monday is mostly a product training day and the bulk of sessions are the remainder of the week.  Opening keynote is Tuesday.

This year – being my first – I was extremely fortunate to be able to attend and to do the product training track.  This gives me a firsthand opportunity to see how the product company sells and trains on its tool.  Facilitators are typically great at selling the ‘why’ and ‘how’ behind something.

Today I sat for a full day going through the introduction to Alteryx Designer.  Not because it was my first time using the tool, but because I believe there’s something very powerful about origin stories.  There’s something you learn in the first 30 minutes that someone who doesn’t have the ‘formal training’ may never pick up.  That happened for me today and it was great to see everything in action.

As an advocate for data-informed decision making the tool is indispensable.  Just by listening to the 100+ in my classroom, it’s scary to witness firsthand the youth that exists with businesses accessing data.  Yes, there have been really great strides, but so many people are just at the beginning.  I chuckle when I hear the typical ‘Excel’ analogies, but the overwhelming majority are nodding with how much they relate to the joke.

I’ve always seen Alteryx as a natural companion for a data analyst.  For anyone out there trying to manage data it offers up a solution.  If only for the single act of being able to see a visual output of the thought process and work that went in to producing a data model.  A data model or report that can be shared, saved, printed (please don’t print), and most importantly: be communicated.  For someone doing data prep, blending, gathering – this is how you explain to your boss what you do.  This is the demonstration of what it takes to be the data wrangler.  This is how you share your critical thinking skills.

I’ve just scratched the surface and have 2 more full days of Alteryx.  One that has already been peppered with amazing collaboration opportunities and sharing of enthusiasm.  The vibe is chill, the people are great, and the mission is achievable.

Tomorrow is another day and an opportunity to take the building blocks and dream of skyscrapers.

#MakeoverMonday Week 22 – Internet Usage by Country

This week’s data set demonstrates the number of users per 100 people by country spanning several years.  The original data set and accompanying visualization starts as an interactive map with the ability to animate through the changing values year by year.  Additionally, the interactor can click into a country to see percentage changes or the comparative changes with multiple countries.

Channeling my inner Hans Rosling – I was drawn to play through the animation of the change by year, starting with 1960.  What sort of narrative could I see play out?

Perhaps it was the developer inside of me, but I couldn’t get over the color legend.  For the first 30 years (1960 to 1989) there’s only a few data points, all signifying zero.  Why?  Does this mean that those few countries actually measured this value in those years, or is it just bad data?  Moving past the first 30 years, my mind was starting to try and resolve the rest of usage changes.  However – here again my mind was hurt by the coloration.  The color legend shifts from year to year.  There’s always a green, greenish yellow, yellow, orange, and red.  How am I to ascertain growth or change when I must constantly refer to the legend?  Sure there’s something to say about comparing country to country, but it loses alignment once you start paginating through the years.

Moving past my general take on the visualization – there were certain things I picked up on and wanted to carry forward on my makeover.  The first was the value out of 100 people.  Because I noticed that the color legend was increasing year to year, this meant that overall number of users was increasing.  Similarly, when thinking about comparing the countries, coloration changed, meaning ranks were changing.

I’ll tell you – my mind was originally drawn to the idea of 3 slope charts sitting next to each other.  One representing the first 5 years, then next 5 years, and so on.  Each country as a line.  Well that wasn’t really possible because the data has 1990 to 2000 as the first set of years – so I went down the path of the first 10 years.  It doesn’t tell me much other than something somewhat obvious: internet usage exploded from 1990 to 2000.

Here’s how the full set would have maybe played out:

This is perhaps a bit more interesting, but my mind doesn’t like the 10 year gap between 1990 and 2000, five year gaps from 2000 to 2010, and then annual measurements from 2010 to 2015 (that I didn’t include on this chart).  More to the point, it seems to me that 2000 may be a better starting measurement point.  And it created the inflection point of my narrative.

Looking at this chart – I went ahead and decided my narrative would be to understand not only how much more internet usage there is per country, but to also demonstrate how certain countries have grown throughout the time periods.  I limited the data set to the top 50 in 2015 to eliminate some of the data noise (there were 196 members in the country domain, when I cut it to 100 there were still some 0s in 2000).

To help demonstrate that usage was just overall more prolific, I developed a consistent dimension to block out number of users.  So as you read it – it goes from light gray to blue depending on the value.  The point being that as we get nearer in time, there’s more dark blue, no light gray.

And then I went the route of a bump chart to show how the ranks have changed.  Norway had been at the top of the charts, now it’s Iceland.  When you hover over the lines you can see what happened.  And in some cases it makes sense, a country is already dominating usage, increasing can only go so far.

But there are some amazing stories that can unfold in this data set: check out Andorra.  It went from #33 all the way up to #3.

You can take this visualization and step back into different years and benchmark each country on how prolific internet usage was during the time.  And do direct peer comparatives to boot.

This one deserves time spent focused on the interactivity of the visualization.  That’s part of the reason why it is so dense at first glance.  I’m intentionally trying to get the end user to see 3 things up front: overall internet usage in 2000 (by size and color encoding) and the starting rank of countries, the overall global increase in internet usage (demonstrated by coloration change over the spans), and then who the current usage leader is.

Take some time to play with the visualization here.

Workout Wednesday Week 21 – Part 1 (My approach to existing structure)

This week’s Workout Wednesday had us taking NCAA data and developing a single chart that showed the cumulative progression of a basketball game.  More specifically a line chart where the X axis is countdown of time and the Y axis is current score.  There’s some additional detail in the form of the size of each dot representing 1, 2, or 3 points.  (see cover photo)

Here’s what the underlying data set looks like:

Comparing the data structure to the image and what needs to be produced my brain started to hurt.  Some things I noticed right away:

  • Teams are in separate columns
  • Score is consolidated into one column and only displayed when it changes
  • Time amount is in 20 minute increments and resets each half
  • Flavor text (detail) is in separate columns (the team columns)
  • Event ID restarts each half, seriously.

My mind doesn’t like that there’s a team dimension that’s not in the same column.  It doesn’t like the restarting time either.  It really doesn’t like the way the score is done.  These aren’t numbers I can aggregate together, they are raw outputs that are in a string format.

Nonetheless, my goal for the Workout was to take what I had in that structure and see if I could make the viz.  What I don’t know is this: did Andy do it the same way?

My approach:

First I needed to get the X axis working.  I’ve done a good bit of work with time so I knew a few things needed to happen.  The first part was to convert what was in MM:SS to seconds.  I did this in my mind to change the data to a continuous axis that I could format into MM:SS format.  Here’s the calculation:

I cheated and didn’t write my calculated field for longevity.  I saw that there was a dropped digit in the data and compensated by breaking it up into two parts.  Probably a more holistic way to do this would be to say if it is of length 4 then append a 0 to the string and then go about the same process.  Here’s the described results showing the domain:

Validation check: the time goes from 0 to 20 minutes (0 to 20*60 seconds aka 1200 seconds).  We’re good.

Next I needed to format that time into MM:SS continuous format.  I took that calculation from Jonathan Drummey.  I’ve used this more than once, so my google search is appropriately ‘Jonathan Drummey time formatting.’  So the resultant time ‘measure’ was almost there, but I wasn’t taking into consideration the +20 minutes for the first half and that the time axis was full game duration.  So here’s the two calculations that I made (first is +20 mins, then the formatting):

At this point I felt like I was kind of getting somewhere – almost to the point of making the line chart, but I needed to break apart the teams.  For that bit I leveraged the fact that the individual team fields only have details in them when that team scores.  Here’s the calc:

I still don’t have a lot going on – at best I have a dot plot where I can draw out the event ID and start plotting the individual points.

So to get the score was relatively easy.  I also did this in a custom to the data set kind of way with 3 calculations – find the left score, find the right score, then tag the scores to the teams.

Throwing that on rows, here’s the viz:

All the events are out of order and this is really difficult to understand.  To get closer to the view I did a few things all at once:

  • Reverse the time axis
  • Add Sum of the Team Score to the path
  • Put a combined half + event field on detail (since event restarts per half)

Also – I tried Event & Half separately and my lines weren’t connected (broken at half time; so creating a derived combined field proved useful at connecting the line for me)

Here’s that viz:

It’s looking really good.  Next steps are to get the dots to represent the ball sizes.

One of my last calculations:

That got dropped on size on a duplicated and synchronized “Team Score.”  To get the pesky null to not display from the legend was a simple right click and ‘hide.’  I also had to sort the Ball Size dimensions to align with the perceived sizing.  Also the line size was made super skinny.

Now some cool things happened because of how I did this:  I could leverage the right and left scores for tooltips.  I could also leverage them in the titling of the overall scores UNC = {MAX([LeftScore]}.

Probably the last component was counting the number of baskets (within the scope of making it a single returned value in a title per the specs of the ask).  Those were repeated LODs:

And thankfully the final component of the over sized scores on the last marks could be accomplished by the ‘Always Show’ option.

Now I profess this may not be the most efficient way to develop the result, heck here’s what my final sheet looks like:

All that being said: I definitely accomplished the task.

In Part 2 of this series, I’ll be dissecting how Andy approached it.  We obviously did something different because it seems like he may have used the Attribute function (saw some * in tooltips).  My final viz has all data points and no asterisks ex: 22:03 remaining UNC.  Looking at that part, mine has each individual point and the score at each instantaneous spot, his drops the score.  Could it be that he tiptoed around the data structure in a very different way?

I encourage you to download the workbook and review what I did via Tableau Public.


#MakeoverMonday Week 21 – Are Britons Drinking Less?

After some botched attempts at reestablishing routine, #MakeoverMonday week 21 got made within the time-boxed week!  I have one pending makeover and an in-progress blog post to talk about Viz Club and the 4 developed during that special time.  But for now, a quick recap of the how and why behind this week’s viz.

This week’s data set was straightforward – aggregated measures sliced by a few dimensions.  And to what I believe is now becoming an obvious trend on how data is published, it included both aggregated and lower dimensions within the same field (read this as “men,” “women,” “all people”).  The structured side of my doesn’t like it and screams for me to exclude from any visualizations, but this week I figured I’d take a different approach.

The key questions asked related to alcohol consumption frequency by different age and gender combinations (plus those aggregates) – so there was lots of opportunity to compare within those dimensions.  More to that, the original question and how the data was presented begged to rephrase into what became the more direct title (Are Britons Drinking Less?)

The question really informed the visualizations – and more to that point, the phrasing of the original article seemed to dictate to me that this was a “falling measure.”  Meaning it has been declining for years or year-to-year, or now compared to then – you get the idea.

With it being a falling measure and already in percentages, this made the concept of using an “difference from first” table calculation a natural progression.  When using the calculation the first year of the measure would be anchored at zero and subsequent years would be compared to it.  Essentially asking and answering for every year “was it more or less than the first year we asked?”  Here’s the beautiful small multiple:

Here the demographics are set to color, lightest blue being youngest to darkest blue being oldest; red is the ‘all.’  I actually really enjoyed being able to toss the red on there for a comparison and it is really nice to see the natural over/under of the age groups (which mathematically follows if they’re aggregates of the different groups).

One thing I did to add further emphasis was to put positive deltas on size – that is to say to over emphasize (in a very subdued, probably only Ann appreciates the humor behind it way) when it is anti-the trend.  Or more directly stated: draw the readers attention to different points where the percentage response has increased.

Here’s the resultant:

So older demographics are drinking more than they used to and that’s fueled by women.  This becomes more obvious to the point of the original article when looking at the Teetotal groups and seeing many more fat lines.

Here’s the calculation to create the line sizing:

Last up was to make one more view to help sell the message.  I figured a dot plot would mimic champagne bubbles in a very abstract way.  And I also thought open/closed circles in combination with the color encoding would be pleasant for the readers.  Last custom change there was to flip the vertical axis of time to be in reverse.  Time is read top down and you can see it start to push down to the left in some of the different groupings.

If you go the full distance and interact with the dashboard, the last thing I hope you’ll notice and appreciate is the color legend/filter bar at the top.  I hate color legends because they lack utility.  Adding in a treemap version of a legend that does double duty as highlight buttons is my happy medium (and only when I feel like color encoding is not actively communicated enough).

#MakeoverMonday Week 18

{witty intro}  This week’s makeover challenge was to take Sydney ferry data for 7 ferry lines and 8 months.  What’s even better is there was another dimension with a domain of 9 members.  This is a dream data set.  I say it’s a dream from the perspective of having two dimensions that can be manipulated and managed (no deciding HOW they have to be reduced or further grouped) and there’s decent data volume with each one.

In the world of visualization, I think this is a great starter data set.  And it was fun for me because I could focus on some of the design rather than deciding on a deep analytical angle.  Plus in the spirit of the original, my approach was to redo the output of “who’s riding the ferries” and make it more accessible.

So the lowdown: first decision made was the color palette.  The ferry route map had a lot of greens in it.  And obviously a lot of blues because of water.

So I wanted to take that idea and take it one step further.  That landed to me a world of deep blues and greens – using the darkest blue/green throughout typically represent the “most” of something.

These colors informed most decisions that came afterward.  I really wanted to stick to small multiples on this one, just by the sheer line up of the two medium/small domained dimensions.  Unfortunately – nothing of that nature turned out very interesting.  Here’s an example:

Like it’s okay and somewhat interesting – especially giving each row the opportunity to have a different axis range.  But you can see the “problem” immediately, there’s a few routes that are pretty flat and further to that, end users are likely going to be frustrated by the independent axis when they dive deeper to compare.

Pivoting from that point led me to the conclusion that the dimensions shouldn’t necessarily be shown together, but instead show one within the other.  But – worth noting, in the small multiple above you can see that the ‘Adult’ fare is just the most everywhere all the time.  Which led to this guy:

Where the bars are overall and the dots are Adult fares.  I felt that representing them in this context could free up the other dwarfed fare types to play with the data.

Last step from my end was to highlight those fare types and add a little whimsy.  I knew switching to % of total would be ideal because of the trip amounts for each route.  Interpret this as: normalizing to proportions gave opportunity to compare the routes.

I actually landed on the area chart by accident – I was stuck with lines, did my typical CTRL + drag of same pill to try and do some fun dual axis… and Tableau decided to automatically build me an area chart.

The original view of this was obviously not as attractive and I’ve done a few things to enhance how this displays.  The main thing was to eliminate the adult fare from the view visually.  We KNOW it’s the most, let’s move on.  Next was to stretch out the data a bit to see what’s going on in the remaining 30%-ish of rides. (Nerd moment: look at what I titled the sheet.)

Finishing up – there’s some label magic to show only those that are non-adult.  I also RETAINED the axis labels – I am hoping this helps to demonstrate and draw attention to the tagged axis at 50%.  What’s probably the most fun about this viz – you can hover over that same blue space and see the adult contribution – no data lost.

Overall I’m happy with the final effect.  A visually attractive display of data that hopefully invites users into deeper exploration.  Smaller dimension members given a chance to shine, and some straightforward questions asked and answered.

#MakeoverMonday Week 17

After a bit of life prioritization, I’m back in full force on a mission to contribute to Makeover Monday.  To that end, I’m super thrilled to share that I’ve completed my MBA.  I’ve always been an individual destined not to settle for one higher education degree, so having that box checked has felt amazing.

Now on to the Makeover!  This week’s data set was extra special because it was published on the Tableau blog – essentially more incentive to participate and contribute (there’s plenty of innate incentive IMO).

The data was courtesy of LinkedIn and represented 3 years worth of “top skills.”  Here’s my best snapshot of the data: 


This almost perfectly describes the data set, without the added bonus of there being a ‘Global’ skills in the Country dimension as well.  Mixing aggregations or concepts of what people believe can be aggregated, I sighed just a little bit.  I also sighed at seeing some countries are missing 2014 skills and 2016 is truncated to 10 skills each.

So the limitations of the data set meant that there had to be some clever dealing to get around this.  My approach was to take it from a 2016 perspective.  And furthermore to “look back” to 2014 whenever there was any sort of comparison.    I made the decision to eliminate “Global” and any countries without 2014 from the data set.  I find that the data lends itself best to comparison within a given country (my perspective) – so eliminating countries was something I could rationalize.

Probably the only visualization I really cared about was a slope chart.  I thought this would be a good representation of how a skill has gotten hotter (or not).  Here’s that:

Some things I did to jazz it up a bit.  Added a simple boolean expression to color to denote if the rank has improved since 2014.  Added on reference lines for the years to anchor the lines.  I’ve done slope charts different ways, but this one somehow evolved into this approach.  Here’s what the sheet looks like:

Walking through it, starting with the filter shelf.  I’ve got an Action filter on country (based on action filter buttons elsewhere on the dashboard).  Year has been added to context and 2015 eliminated.  Datasource filtered out the countries without 2014 data & global.  Skill is filtered to an LOD for 2016 Rank <> 0.  This ensures I’m only using 2016 skills.  The context filters keep everything looking pretty for the countries.

The year lines are reference lines – all headers are hidden.  There’s a dual axis on rows to have line chart & circle chart.  The second Year in columns is redundant and leftover from an abandoned labeling attempt (but adds nice dual labels automatically to my reference lines).

Just as a note – I made the 2016 LOD with a 2014 LOD to do some cute math for line size – I didn’t like it so abandoned.

Last steps were to add additional context to the “value” of 2016 skills.  So a quick unit chart and word cloud.  One thing I like to do on my word clouds these days is square the values on size.  I find that this makes the visual indicator for size easier to understand.  What’s great about this is that smaller rank is better, so instead of “^2” it became this:

Sometimes math just does you a real solid.

The kicker of this entire data set for me and gained knowledge: Statistical Analysis and Data Mining are hot!  Super hot!  Also really like that User Interface Design and Algorithm Design made it to the top 10 for the United States.  I would tell anyone that a huge component of my job is designing analytical outputs for all types of users and that requires an amount of UX design.  And coincidentally I’m making an algorithm to determine how to eliminate a backlog, all in Tableau.  (basic linear equation)

#MakeoverMonday Week 12 – All About March Madness

This week’s Makeover Monday topic was based on an article attempting to provide analysis into why it is harder for people to correctly pick their March Madness brackets. The original visualization is this guy:

With most Makeover Monday approaches I like to review the inspiration and visualization and let that somewhat decide the direction of my analysis. In this case I found that completely impossible. For my own sake I’m going to try and digest what it is I’m seeing/interpreting.

  • Title indicates that we’re looking at the seeds making the final 4
  • Each year is represented as a discrete value
  • I should be able to infer that “number above column represents sum of Final Four seeds” by the title
  • In the article it says in 2008 that all the seeds which made it to the final 4 were #1 – validates my logical assumption
  • Tracking this down further, I am now thinking each color represents a region – no idea which colors mean what – I take that back, I think they are ranked by the seed value (it looks like the first instance of the best seed rank is always yellow)
  • And then there’s an annotation tacked on % of Final Four teams seeded 7th or lower for two different time periods
    • Does 5.2% from 1985 to 2008 equal: count the blue bars (plus one red bar) with values >=7 – that’s 5 out of (24 * 4) = 5/96 = 5.2%
    • Same logic for the second statistic: 7 out of (8 * 4) = 21.9%

And then there’s the final distraction of the sum of the seed values above each bar.  What does this accomplish?  Am I going to use it to quickly try and calculate an “average seed value” for each year?  Because my math degree didn’t teach me to compute ratios at the speed of thought – it taught me to solve problems by using a combination of algorithms and creative thinking.  It also doesn’t help me with understanding interesting years – the height of the stacked bars does this just fine.

So to me this seems like an article where they’ve decided to take up more real estate and beef up the analysis with a visual display.  It’s not working and I’m sad that it is a “Chart of the Day.”

Now on to what I did and why.  I’ll add a little preface and say that I was VERY compelled to do a repeat of my Big Game Battle visualization, because I really like the idea of using small multiples to represent sports and team flux.  Here’s that display again:

Yes – you have to interact to understand, but once you do it is very clear.  Each line represents a win/loss result for the teams.  They are then bundled together by their regions to see how they progressed into the Superbowl.  In the line chart it is a running sum.  So you can quickly see that the Patriots and Falcons both had very strong seasons.  The 49ers were awful.

So that was my original inspiration, but I didn’t want to do the same thing and I had less time.  So I went a super distilled route of cutting down the idea behind the original article further.  Let’s just focus on seed rank of those in the championship.  To an extent I don’t really think there’s a dramatic story in the final 4 rankings – the “worst” seed that made it there was 11th.  We don’t even know if that team made it further.

In my world I’ve got championship winners vs. losers with position indicating their seed rank.  Color represents the result for the team for the year and for overall visual appeal I’ve made the color ramp.  To help orient the reader, I’ve added min/max ranks (I screwed this up and did pane for winner, should have been table like it is for loser, but it looks nice anyway).  I’ve also added on strategic years to help demonstrate that it’s a timeline.  If you were to interact, you’d see the name of the team and a few more specifics about what it is you’re looking at.

The reality of my takeaway here – a #1 seed usually wins.  Consistently wins, wins in streaks.  And there’s even a fair amount of #1 losers.  If I had to make a recommendation based on 32 years of championships: pick the #1 seeds and stick with them.  Using the original math from the article: 19 out of 32 winners were seed #1 (60%) and 11 out of 32 losers were seed #1 (34%).  Odds of a 1 being in the final 2 across all the years?  47% – And yes, that is said very tongue in cheek.

Makeover Monday Week 10 – Top 500 YouTube Game(r) Channels

We’re officially 10 weeks into Makeover Monday, which is a phenomenal achievement.  This means that I’ve actively participated in recreating 10 different visualizations with data varying from tourism, to Trump, to this week’s Youtube gamers.

First some commentary people may not like to read: the data set was not that great.  There’s one huge reason why it wasn’t great: one of the measures (plus a dimension) was a dependent variable on two independent variables.  And that dependent variable was processed via a pre-built algorithm.  So it would almost make sense to use the resultant dependent variable to enrich other data.

I’m being very abstract right now – here’s the structure of the data set:

Let’s walk through the fields:

  • Rank – this is a component based entirely on the sort chosen by the top (for this view it is by video views, not sure what those random 2 are, I just screencapped the site)
  • SB Score/Rank – this is some sort of ranking value applied to a user based on a propriety algorithm that takes a few variables into consideration
  • SB Score (as a letter grade) – the letter grade expression of the SB score
  • User – the name of the gamer channel
  • Subscribers – the # of channel subscribers
  • Video Views – the # of video views

As best as I can tell through reading the methodology – SB score/rank (the # and the alpha) are influenced in part from the subscribers and video views.  Which means putting these in the same view is really sort of silly.  You’re kind of at a disadvantage if you scatterplot subscribers vs. video views because the score is purportedly more accurate in terms of finding overall value/quality.

There’s also not enough information contained within the data set to amass any new insights on who is the best and why.  What you can do best with this data set is summarization, categorization, and displaying what I consider data set “vitals.”

So this is the approach that I took.  And more to that point, I wanted to make over a very specific chart style that I have seen Alberto Cairo employ a few times throughout my 6 week adventure in his MOOC.

That view: a bar chart sliced through with lines to help understand size of chunks a little bit better.  This guy:

So my energy was focused on that – which only happened after I did a few natural (in my mind) steps in summarizing the data, namely histograms:

Notice here that I’ve leveraged the axis values across all 3 charts (starting with SB grade and through to it’s sibling charts to minimize clutter).  I think this has decent effect, but I admit that the bars aren’t equal width across each bar chart.  That’s not pleasant.

My final two visualizations were to demonstrate magnitude and add more specifics in a visual manner to what was previously a giant text table.

The scatterplot helps to achieve this by displaying the 2 independent variables with the overall “SB grade” encoded on both color and size.  Note: for size I did powers of 2: 2^9, 2^8, 2^7…2^1.  This was a decent exponential effect to break up the sizing in a consistent manner.

The unit chart on the right is to help demonstrate not only the individual members, but display the elite A+ status and the terrible C+, D+, and D statuses.  The color palette used throughout is supposed to highlight these capstones – bright on the edges and random neutrals between.

This is aptly named an exploration because I firmly believe the resultant visualization was built to broadly pluck away at the different channels and get intrigued by the “details.”  In a more real world I would be out hunting for additional data to tag this back to – money, endorsements, average video length, number of videos uploaded, subject matter area, type of ads utilized by the user.  All of these appended to this basic metric aimed at measuring a user’s “influence” would lead down the path of a true analysis.