#WorkoutWednesday Week 24 – Math Musings

The Workout Wednesday for week 24 is a great way to represent where a result for a particular value falls with respect to a broader collection.  I’ve used a spine chart recently on a project where most data was centered around certain points and I wanted to show the range.  Propagating maximums, minimums, averages, quartiles, and (when appropriate) medians can help to profile data very effectively.

So I started off really enjoying where this visualization was going.  Also because the spine chart I made on a recent project was before I even knew the thing I developed had already been named.  (Sad on my part, I should read more!)

My enjoyment turned into caution really quickly once I saw the data set.  There are several ratios in the data set and very few counts/sums of things.  My math brain screams trap!  Especially when we start tiptoeing into the world of what we semantically call “average of all” or “overall average” or something that somehow represents a larger collective (“everybody”).  There is a lot of open-ended interpretation that goes into this particular calculation and when you’re working with pre-computed ratios it gets really tricky really quickly.

Here’s a picture of the underlying data set:


Some things to notice right away – the ratios for each response are pre-computed.  The number of responses is different for each institution.  (To simplify this view, I’m on one year and one question).

So the heart of the initial question is this: if I want to compare my results to the overall results, how would I do that?  Now there are probably 2 distinct camps here.  1: take the average of one of the columns and use that to represent the “overall average”.  Let’s be clear on what that is: it is the average pre-computed ratio of a survey.  It is NOT the observed percentage of all individuals surveyed.  That would be option 2: the weighted average.  For the weighted average or to calculate a representation of all respondents we could add up all the qualifying respondents answering ‘agree’ and divide it by the total respondents.

Now we all know this concept of average of an average vs. weighted average can cause issues.  Specifically we’d feel the friction immediately if there were several low-end responses commingled with several higher response capturing entities.  EX: Place A: 2 people out of 2 answered yes (100%) and  Place B: 5 out of 100 answered ‘yes’ (5%).  If we average 100% and 5% we’ll get 52.5%.  But what if we take 7 out of 102, that’s 6.86% – a way different number.  (Intentionally extreme example.)

So my math brain was convinced that the “overall average” or “ratio for all” should be inclusive of the weights of each Institution.  That was fairly easy to compensate for: take each ratio and multiply it by the number of respondents to get raw counts and then add those all back up together.

The next sort of messy thing to deal with was finding the minimums and maximums of these values.  It seems straightforward, but when reviewing the data set and the specifications of what is being displayed there’s caution to throw with regard to level of aggregation and how the data is filtered.  As an example, depending on how the ratios are leveraged, you could end up finding the minimum of 3 differently weighted subjects to a subject group.  You could also probably find the minimum Institution + subject result at the subject level of all the subjects within a group.  Again I think the best bet here is to tread cautiously over the ratios and get into raw counts as quickly as possible.

So what does this all mean?  To me it means tread carefully and ask clear questions about what people are trying to measure.  This is also where I will go the distance and include calculations in tool tips to help demonstrate what the values I am calculating represent.  Ratios are tricky and averaging them is even trickier.  There likely isn’t a perfect way to deal with them and it’s something we all witness consistently throughout our professional lives (how many of us have averaged a pre-computed average handle time?).

Beyond the math tangent – I want to reiterate how great a visualization I think this is.  I also want to highlight that because I went deep-end math on it that I decided to go deep end development different.

The main difference from the development perspective?  Instead of using reference bands, I used a gannt bar as the IQR.  I really like using the bar because it gives users an easier target to hover over.  It also reduce some of the noise of the default labeling that occurs with reference lines.  To create the gannt bar – simply compute the IQR as a calculated field and use it as the size.  You can select one of the percentile points to be the start of the mark.

March & April Combined Book Binge

Time for another recount of the content I’ve been consuming.  I missed my March post, so I figured it would be fine to do a combined effort.

First up:

The Icarus Deception by Seth Godin

In my last post I mentioned that I got a recommendation to tune in to Seth and got the opportunity to hear him firsthand on Design Matters.  Well, here’s the first full Seth book I’ve consumed and it didn’t disappoint.  If I had to describe what this book contains – I would say that it is a near manifesto for the modern artist.  The world is run by industrialists and the artist is trying to break through.

I appreciate how Seth frames the concept of an artist – he unpacks the term and invites or ENCOURAGES everyone to identify as such.  Being an artist means being emotionally invested, showing up, giving a shit.  That giving a shit, caring, connecting is ALL there is.  That you succeed in the world by connecting, by sharing your art.  These concepts and ideals resonate deeply with me.  He also explains how vulnerable and gutting it can be to live as an artist – something I’ve felt and experienced several times.

During the course of listening to this book I was on site with a client.  We got to a certain point, agreed on the direction and visualizations, then shared them with the broader team.  The broader team came heavy with design suggestions – most notable the green/red discussion came in to play.  I welcome these challenges and as an artist and communicator it is my responsibility to share my process, listen to feedback, and collaborate to find a solution.  That definitely occurred throughout the process, but honestly caused me to lose my balance for a moment.

As I reflected on what happened – I was drawn to this idea that as a designer I try to have ultimate empathy for the end user.  And furthermore the amount of care given to the end user is never fully realized by the casual interactor.  A melancholy realization, but one that should not be neglected or forgotten.

Moving on to the next book:

Rework by Jason Fried & David Heinemeier Hansson

This one landed in my lap because it was available while perusing through library books.

A quick read that talks about how to succeed in business.  It takes an extreme focus on being married to a vision and committing to it.  The authors focus on getting work done.  Sticking to a position and seeing it through.  I very much appreciated that they were PROUD of decisions they made for their products and company.  Active decisions NOT to do something can be more liberating and make someone more successful than being everything to everyone.

Last up was this guy:

Envisioning Information by Edward Tufte

A continuation of reading through all the Tufte books.  I am being lazy by saying “more of the same.”  Or “what I’ve come to expect.”  These are lazy terms, but they encapsulate what Tufte writes about: understanding visual displays of information.  Analyzing at a deep level the good, bad, and ugly of displays to get to the heart of how we can communicate through visuals.

I particularly loved some of the amazing train time tables displayed.  This concept of using lines to represent timing of different routes was amazing to see.  And the way color is explored and leveraged is on another level.  I highly recommend this one if the thought of verbalizing your witnessing of Tufte’s strong tongue-in-cheek style sounds entertaining.  I know for me it was.

Makeover Monday Week 10 – Top 500 YouTube Game(r) Channels

We’re officially 10 weeks into Makeover Monday, which is a phenomenal achievement.  This means that I’ve actively participated in recreating 10 different visualizations with data varying from tourism, to Trump, to this week’s Youtube gamers.

First some commentary people may not like to read: the data set was not that great.  There’s one huge reason why it wasn’t great: one of the measures (plus a dimension) was a dependent variable on two independent variables.  And that dependent variable was processed via a pre-built algorithm.  So it would almost make sense to use the resultant dependent variable to enrich other data.

I’m being very abstract right now – here’s the structure of the data set:

Let’s walk through the fields:

  • Rank – this is a component based entirely on the sort chosen by the top (for this view it is by video views, not sure what those random 2 are, I just screencapped the site)
  • SB Score/Rank – this is some sort of ranking value applied to a user based on a propriety algorithm that takes a few variables into consideration
  • SB Score (as a letter grade) – the letter grade expression of the SB score
  • User – the name of the gamer channel
  • Subscribers – the # of channel subscribers
  • Video Views – the # of video views

As best as I can tell through reading the methodology – SB score/rank (the # and the alpha) are influenced in part from the subscribers and video views.  Which means putting these in the same view is really sort of silly.  You’re kind of at a disadvantage if you scatterplot subscribers vs. video views because the score is purportedly more accurate in terms of finding overall value/quality.

There’s also not enough information contained within the data set to amass any new insights on who is the best and why.  What you can do best with this data set is summarization, categorization, and displaying what I consider data set “vitals.”

So this is the approach that I took.  And more to that point, I wanted to make over a very specific chart style that I have seen Alberto Cairo employ a few times throughout my 6 week adventure in his MOOC.

That view: a bar chart sliced through with lines to help understand size of chunks a little bit better.  This guy:

So my energy was focused on that – which only happened after I did a few natural (in my mind) steps in summarizing the data, namely histograms:

Notice here that I’ve leveraged the axis values across all 3 charts (starting with SB grade and through to it’s sibling charts to minimize clutter).  I think this has decent effect, but I admit that the bars aren’t equal width across each bar chart.  That’s not pleasant.

My final two visualizations were to demonstrate magnitude and add more specifics in a visual manner to what was previously a giant text table.

The scatterplot helps to achieve this by displaying the 2 independent variables with the overall “SB grade” encoded on both color and size.  Note: for size I did powers of 2: 2^9, 2^8, 2^7…2^1.  This was a decent exponential effect to break up the sizing in a consistent manner.

The unit chart on the right is to help demonstrate not only the individual members, but display the elite A+ status and the terrible C+, D+, and D statuses.  The color palette used throughout is supposed to highlight these capstones – bright on the edges and random neutrals between.

This is aptly named an exploration because I firmly believe the resultant visualization was built to broadly pluck away at the different channels and get intrigued by the “details.”  In a more real world I would be out hunting for additional data to tag this back to – money, endorsements, average video length, number of videos uploaded, subject matter area, type of ads utilized by the user.  All of these appended to this basic metric aimed at measuring a user’s “influence” would lead down the path of a true analysis.

The Flow of Human Migration

Today I decided to take a bit of a detour while working on a potential project for #VizForSocialGood.  I was focused on a data set provided by UNICEF that showed the number of migrants from different areas/regions/countries to destination regions/countries.  I’m pretty sure it is the direct companion to a chord diagram that UNICEF published as part of their Uprooted report.

As I was working through the data, I wanted to take it and start at the same place.  Focus on migration globally and then narrow the focus in on children affected by migration.

Needless to say – I got side tracked.  I started by wanting to make paths on maps showing the movement of migrants.  I haven’t really done this very often, so I figured this would be a great data set to play with.  Once I set that up, it quickly divulged into something else.

I wasn’t satisfied with the density of the data.  The clarity of how it was displayed wasn’t there for me.  So I decided to take an abstract take on the same concept.  As if by fate I had received Chart Chooser cards in the mail earlier and Josh and I were reviewing them.  We were having a conversation about the various uses of each chart and brainstorming on how it could be incorporated into our next Tableau user group (I really do eat, drink, and breathe this stuff).

Anyway – one of the charts we were talking about was the sankey diagram.  So it was already on my mind and I’d seen it accomplished multiple times in Tableau.  It was time to dive in and see how this abstraction would apply to the geospatial.

I started with Chris Love’s basic tutorial of how to set up a sankey.  It’s a really straightforward read that explains all the concepts required to make this work.  Here’s the quick how-to in my paraphrased words.

  1. Duplicate your data via a Union, identify the original and the copy (Which is great because I had already done this for the pathing)  As I understand it from Chris’s write-up this let’s us ‘stretch out’ the data so to speak.
  2. Once the data is stretched out, it’s filled in by manipulating the binning feature in Tableau.  My interpretation would be that the bins ‘kind of’ act like dimensions (labeled out by individual integers).  This becomes useful in creating individual points that eventually turn into the line (curve).
  3. Next there are ranking functions made to determine the starting and end points of the curves.
  4. Finally the curve is built using a mathematical function called a sigmoid function.  This is basically an asymptotic function that goes from -1 to 1 and has a middle area with a slope of ~1.
  5. After the curve is developed, the points are plotted.  This is where the ranking is set up to determine the leftmost and rightmost points.  Chris’s original specifications had the ranking straightforward for each of the dimensions.  My final viz is a riff on this.
  6. The last steps are to switch the chart to a line chart and then build out the width (size) of the line based on the measure you used in the ranking (percent of total) calculation.

So I did all those steps and ended up with exactly what was described – a sankey diagram.  A brilliant one too, I could quickly switch the origin dimension to different levels (major area, region, country) and do similar work on the destination side.  This is what ultimately led me to the final viz I made.

So while adjusting the table calculations, I came to one view that I really enjoyed.  The ranking pretty much “broke” for the initial starting point (everything was at 1), but the destination was right.  What this did for the viz was take everything from a single point and then create roots outward.  Initial setup had this going from left to right – but it was quite obvious that it looked like tree roots.  So I flipped it all.

I’ll admit – this is mostly a fun data shaping/vizzing exercise.  You can definitely gain insights through the way it is deployed (take a look at Latin America & Caribbean).

After the creation of the curvy (onion shape), it was a “what to add next” free for all.  I had wrestled with the names of the destination countries to try and get something reasonable, but couldn’t figure out how to display them in proximity with the lines.  No matter – the idea of a word cloud seemed kind of interesting.  You’d get the same concept of the different chord sizes passed on again and see a ton of data on where people are migrating.  This also led to some natural interactivity of clicking on a country code to see its corresponding chords above.

Finally to add more visual context a simple breakdown of the major regions origin to destinations.  To tell the story a bit further.  The story points for me: most migrants move within their same region, except for Latin America/Caribbean.

How do you add value through data analytics?

I recently read this article that really ignited a lot of thoughts that often swirl around in my mind.  If you were to ask me what my drive is, it’s making data-informed, data-driven decisions.  My mechanism for this is through data visualization.  More broadly than that, it is communicating complex ideas in a visual manner.  Often when you take an idea and paint it into a picture people can connect more deeply to it and it becomes the catalyst for change.

All that being said – I’ve encountered a sobering problem.  Those on the more “analytical” side of the industry sometimes fail to see the value in the communication aspect of data analytics.  They’ve become mired down by the concept that knowing statistical programming languages, database theory, and structured query language are the most important aspects of the process.  While I don’t discount the significance of these tools (and the ability to utilize them correctly), I can’t be completely on board with it.

We’ve all sat in a meeting that is born out of one idea: how do we get better.  We don’t get better by writing the most clever and efficient SQL query.  We get better by talking through and really understanding what it IS we’re trying to measure.  When we say X what do we mean?  How do we define X.  Defining X is the hard part – pulling it out of the database, not as difficult.  If you can get really good at definitions, it becomes intuitive when you start trying to incorporate it into your business initiatives.

As we continue to evolve in the business world, I highly encourage those from both ends of the spectrum to try and meet somewhere in the middle.  We have an unbelievable amount of technical tools at our disposal, yet quite often you step into a business who is still trying to figure out HOW to measure the most basic of metrics.  Let’s stop and consider how this happened and work on achieving excellence and improvement through the marriage of business and technical acumen – with artistry and creativity thrown in there for good measure.