Workout Wednesday Week 21 – Part 1 (My approach to existing structure)

This week’s Workout Wednesday had us taking NCAA data and developing a single chart that showed the cumulative progression of a basketball game.  More specifically a line chart where the X axis is countdown of time and the Y axis is current score.  There’s some additional detail in the form of the size of each dot representing 1, 2, or 3 points.  (see cover photo)

Here’s what the underlying data set looks like:

Comparing the data structure to the image and what needs to be produced my brain started to hurt.  Some things I noticed right away:

  • Teams are in separate columns
  • Score is consolidated into one column and only displayed when it changes
  • Time amount is in 20 minute increments and resets each half
  • Flavor text (detail) is in separate columns (the team columns)
  • Event ID restarts each half, seriously.

My mind doesn’t like that there’s a team dimension that’s not in the same column.  It doesn’t like the restarting time either.  It really doesn’t like the way the score is done.  These aren’t numbers I can aggregate together, they are raw outputs that are in a string format.

Nonetheless, my goal for the Workout was to take what I had in that structure and see if I could make the viz.  What I don’t know is this: did Andy do it the same way?

My approach:

First I needed to get the X axis working.  I’ve done a good bit of work with time so I knew a few things needed to happen.  The first part was to convert what was in MM:SS to seconds.  I did this in my mind to change the data to a continuous axis that I could format into MM:SS format.  Here’s the calculation:

I cheated and didn’t write my calculated field for longevity.  I saw that there was a dropped digit in the data and compensated by breaking it up into two parts.  Probably a more holistic way to do this would be to say if it is of length 4 then append a 0 to the string and then go about the same process.  Here’s the described results showing the domain:

Validation check: the time goes from 0 to 20 minutes (0 to 20*60 seconds aka 1200 seconds).  We’re good.

Next I needed to format that time into MM:SS continuous format.  I took that calculation from Jonathan Drummey.  I’ve used this more than once, so my google search is appropriately ‘Jonathan Drummey time formatting.’  So the resultant time ‘measure’ was almost there, but I wasn’t taking into consideration the +20 minutes for the first half and that the time axis was full game duration.  So here’s the two calculations that I made (first is +20 mins, then the formatting):

At this point I felt like I was kind of getting somewhere – almost to the point of making the line chart, but I needed to break apart the teams.  For that bit I leveraged the fact that the individual team fields only have details in them when that team scores.  Here’s the calc:

I still don’t have a lot going on – at best I have a dot plot where I can draw out the event ID and start plotting the individual points.

So to get the score was relatively easy.  I also did this in a custom to the data set kind of way with 3 calculations – find the left score, find the right score, then tag the scores to the teams.

Throwing that on rows, here’s the viz:

All the events are out of order and this is really difficult to understand.  To get closer to the view I did a few things all at once:

  • Reverse the time axis
  • Add Sum of the Team Score to the path
  • Put a combined half + event field on detail (since event restarts per half)

Also – I tried Event & Half separately and my lines weren’t connected (broken at half time; so creating a derived combined field proved useful at connecting the line for me)

Here’s that viz:

It’s looking really good.  Next steps are to get the dots to represent the ball sizes.

One of my last calculations:

That got dropped on size on a duplicated and synchronized “Team Score.”  To get the pesky null to not display from the legend was a simple right click and ‘hide.’  I also had to sort the Ball Size dimensions to align with the perceived sizing.  Also the line size was made super skinny.

Now some cool things happened because of how I did this:  I could leverage the right and left scores for tooltips.  I could also leverage them in the titling of the overall scores UNC = {MAX([LeftScore]}.

Probably the last component was counting the number of baskets (within the scope of making it a single returned value in a title per the specs of the ask).  Those were repeated LODs:

And thankfully the final component of the over sized scores on the last marks could be accomplished by the ‘Always Show’ option.

Now I profess this may not be the most efficient way to develop the result, heck here’s what my final sheet looks like:

All that being said: I definitely accomplished the task.

In Part 2 of this series, I’ll be dissecting how Andy approached it.  We obviously did something different because it seems like he may have used the Attribute function (saw some * in tooltips).  My final viz has all data points and no asterisks ex: 22:03 remaining UNC.  Looking at that part, mine has each individual point and the score at each instantaneous spot, his drops the score.  Could it be that he tiptoed around the data structure in a very different way?

I encourage you to download the workbook and review what I did via Tableau Public.

 

#MakeoverMonday Week 21 – Are Britons Drinking Less?

After some botched attempts at reestablishing routine, #MakeoverMonday week 21 got made within the time-boxed week!  I have one pending makeover and an in-progress blog post to talk about Viz Club and the 4 developed during that special time.  But for now, a quick recap of the how and why behind this week’s viz.

This week’s data set was straightforward – aggregated measures sliced by a few dimensions.  And to what I believe is now becoming an obvious trend on how data is published, it included both aggregated and lower dimensions within the same field (read this as “men,” “women,” “all people”).  The structured side of my doesn’t like it and screams for me to exclude from any visualizations, but this week I figured I’d take a different approach.

The key questions asked related to alcohol consumption frequency by different age and gender combinations (plus those aggregates) – so there was lots of opportunity to compare within those dimensions.  More to that, the original question and how the data was presented begged to rephrase into what became the more direct title (Are Britons Drinking Less?)

The question really informed the visualizations – and more to that point, the phrasing of the original article seemed to dictate to me that this was a “falling measure.”  Meaning it has been declining for years or year-to-year, or now compared to then – you get the idea.

With it being a falling measure and already in percentages, this made the concept of using an “difference from first” table calculation a natural progression.  When using the calculation the first year of the measure would be anchored at zero and subsequent years would be compared to it.  Essentially asking and answering for every year “was it more or less than the first year we asked?”  Here’s the beautiful small multiple:

Here the demographics are set to color, lightest blue being youngest to darkest blue being oldest; red is the ‘all.’  I actually really enjoyed being able to toss the red on there for a comparison and it is really nice to see the natural over/under of the age groups (which mathematically follows if they’re aggregates of the different groups).

One thing I did to add further emphasis was to put positive deltas on size – that is to say to over emphasize (in a very subdued, probably only Ann appreciates the humor behind it way) when it is anti-the trend.  Or more directly stated: draw the readers attention to different points where the percentage response has increased.

Here’s the resultant:

So older demographics are drinking more than they used to and that’s fueled by women.  This becomes more obvious to the point of the original article when looking at the Teetotal groups and seeing many more fat lines.

Here’s the calculation to create the line sizing:

Last up was to make one more view to help sell the message.  I figured a dot plot would mimic champagne bubbles in a very abstract way.  And I also thought open/closed circles in combination with the color encoding would be pleasant for the readers.  Last custom change there was to flip the vertical axis of time to be in reverse.  Time is read top down and you can see it start to push down to the left in some of the different groupings.

If you go the full distance and interact with the dashboard, the last thing I hope you’ll notice and appreciate is the color legend/filter bar at the top.  I hate color legends because they lack utility.  Adding in a treemap version of a legend that does double duty as highlight buttons is my happy medium (and only when I feel like color encoding is not actively communicated enough).

#MakeoverMonday Week 18

{witty intro}  This week’s makeover challenge was to take Sydney ferry data for 7 ferry lines and 8 months.  What’s even better is there was another dimension with a domain of 9 members.  This is a dream data set.  I say it’s a dream from the perspective of having two dimensions that can be manipulated and managed (no deciding HOW they have to be reduced or further grouped) and there’s decent data volume with each one.

In the world of visualization, I think this is a great starter data set.  And it was fun for me because I could focus on some of the design rather than deciding on a deep analytical angle.  Plus in the spirit of the original, my approach was to redo the output of “who’s riding the ferries” and make it more accessible.

So the lowdown: first decision made was the color palette.  The ferry route map had a lot of greens in it.  And obviously a lot of blues because of water.

So I wanted to take that idea and take it one step further.  That landed to me a world of deep blues and greens – using the darkest blue/green throughout typically represent the “most” of something.

These colors informed most decisions that came afterward.  I really wanted to stick to small multiples on this one, just by the sheer line up of the two medium/small domained dimensions.  Unfortunately – nothing of that nature turned out very interesting.  Here’s an example:

Like it’s okay and somewhat interesting – especially giving each row the opportunity to have a different axis range.  But you can see the “problem” immediately, there’s a few routes that are pretty flat and further to that, end users are likely going to be frustrated by the independent axis when they dive deeper to compare.

Pivoting from that point led me to the conclusion that the dimensions shouldn’t necessarily be shown together, but instead show one within the other.  But – worth noting, in the small multiple above you can see that the ‘Adult’ fare is just the most everywhere all the time.  Which led to this guy:

Where the bars are overall and the dots are Adult fares.  I felt that representing them in this context could free up the other dwarfed fare types to play with the data.

Last step from my end was to highlight those fare types and add a little whimsy.  I knew switching to % of total would be ideal because of the trip amounts for each route.  Interpret this as: normalizing to proportions gave opportunity to compare the routes.

I actually landed on the area chart by accident – I was stuck with lines, did my typical CTRL + drag of same pill to try and do some fun dual axis… and Tableau decided to automatically build me an area chart.

The original view of this was obviously not as attractive and I’ve done a few things to enhance how this displays.  The main thing was to eliminate the adult fare from the view visually.  We KNOW it’s the most, let’s move on.  Next was to stretch out the data a bit to see what’s going on in the remaining 30%-ish of rides. (Nerd moment: look at what I titled the sheet.)

Finishing up – there’s some label magic to show only those that are non-adult.  I also RETAINED the axis labels – I am hoping this helps to demonstrate and draw attention to the tagged axis at 50%.  What’s probably the most fun about this viz – you can hover over that same blue space and see the adult contribution – no data lost.

Overall I’m happy with the final effect.  A visually attractive display of data that hopefully invites users into deeper exploration.  Smaller dimension members given a chance to shine, and some straightforward questions asked and answered.

#MakeoverMonday Week 17

After a bit of life prioritization, I’m back in full force on a mission to contribute to Makeover Monday.  To that end, I’m super thrilled to share that I’ve completed my MBA.  I’ve always been an individual destined not to settle for one higher education degree, so having that box checked has felt amazing.

Now on to the Makeover!  This week’s data set was extra special because it was published on the Tableau blog – essentially more incentive to participate and contribute (there’s plenty of innate incentive IMO).

The data was courtesy of LinkedIn and represented 3 years worth of “top skills.”  Here’s my best snapshot of the data: 

 

This almost perfectly describes the data set, without the added bonus of there being a ‘Global’ skills in the Country dimension as well.  Mixing aggregations or concepts of what people believe can be aggregated, I sighed just a little bit.  I also sighed at seeing some countries are missing 2014 skills and 2016 is truncated to 10 skills each.

So the limitations of the data set meant that there had to be some clever dealing to get around this.  My approach was to take it from a 2016 perspective.  And furthermore to “look back” to 2014 whenever there was any sort of comparison.    I made the decision to eliminate “Global” and any countries without 2014 from the data set.  I find that the data lends itself best to comparison within a given country (my perspective) – so eliminating countries was something I could rationalize.

Probably the only visualization I really cared about was a slope chart.  I thought this would be a good representation of how a skill has gotten hotter (or not).  Here’s that:

Some things I did to jazz it up a bit.  Added a simple boolean expression to color to denote if the rank has improved since 2014.  Added on reference lines for the years to anchor the lines.  I’ve done slope charts different ways, but this one somehow evolved into this approach.  Here’s what the sheet looks like:

Walking through it, starting with the filter shelf.  I’ve got an Action filter on country (based on action filter buttons elsewhere on the dashboard).  Year has been added to context and 2015 eliminated.  Datasource filtered out the countries without 2014 data & global.  Skill is filtered to an LOD for 2016 Rank <> 0.  This ensures I’m only using 2016 skills.  The context filters keep everything looking pretty for the countries.

The year lines are reference lines – all headers are hidden.  There’s a dual axis on rows to have line chart & circle chart.  The second Year in columns is redundant and leftover from an abandoned labeling attempt (but adds nice dual labels automatically to my reference lines).

Just as a note – I made the 2016 LOD with a 2014 LOD to do some cute math for line size – I didn’t like it so abandoned.

Last steps were to add additional context to the “value” of 2016 skills.  So a quick unit chart and word cloud.  One thing I like to do on my word clouds these days is square the values on size.  I find that this makes the visual indicator for size easier to understand.  What’s great about this is that smaller rank is better, so instead of “^2” it became this:

Sometimes math just does you a real solid.

The kicker of this entire data set for me and gained knowledge: Statistical Analysis and Data Mining are hot!  Super hot!  Also really like that User Interface Design and Algorithm Design made it to the top 10 for the United States.  I would tell anyone that a huge component of my job is designing analytical outputs for all types of users and that requires an amount of UX design.  And coincidentally I’m making an algorithm to determine how to eliminate a backlog, all in Tableau.  (basic linear equation)

#MakeoverMonday Week 12 – All About March Madness

This week’s Makeover Monday topic was based on an article attempting to provide analysis into why it is harder for people to correctly pick their March Madness brackets. The original visualization is this guy:

With most Makeover Monday approaches I like to review the inspiration and visualization and let that somewhat decide the direction of my analysis. In this case I found that completely impossible. For my own sake I’m going to try and digest what it is I’m seeing/interpreting.

  • Title indicates that we’re looking at the seeds making the final 4
  • Each year is represented as a discrete value
  • I should be able to infer that “number above column represents sum of Final Four seeds” by the title
  • In the article it says in 2008 that all the seeds which made it to the final 4 were #1 – validates my logical assumption
  • Tracking this down further, I am now thinking each color represents a region – no idea which colors mean what – I take that back, I think they are ranked by the seed value (it looks like the first instance of the best seed rank is always yellow)
  • And then there’s an annotation tacked on % of Final Four teams seeded 7th or lower for two different time periods
    • Does 5.2% from 1985 to 2008 equal: count the blue bars (plus one red bar) with values >=7 – that’s 5 out of (24 * 4) = 5/96 = 5.2%
    • Same logic for the second statistic: 7 out of (8 * 4) = 21.9%

And then there’s the final distraction of the sum of the seed values above each bar.  What does this accomplish?  Am I going to use it to quickly try and calculate an “average seed value” for each year?  Because my math degree didn’t teach me to compute ratios at the speed of thought – it taught me to solve problems by using a combination of algorithms and creative thinking.  It also doesn’t help me with understanding interesting years – the height of the stacked bars does this just fine.

So to me this seems like an article where they’ve decided to take up more real estate and beef up the analysis with a visual display.  It’s not working and I’m sad that it is a “Chart of the Day.”

Now on to what I did and why.  I’ll add a little preface and say that I was VERY compelled to do a repeat of my Big Game Battle visualization, because I really like the idea of using small multiples to represent sports and team flux.  Here’s that display again:

Yes – you have to interact to understand, but once you do it is very clear.  Each line represents a win/loss result for the teams.  They are then bundled together by their regions to see how they progressed into the Superbowl.  In the line chart it is a running sum.  So you can quickly see that the Patriots and Falcons both had very strong seasons.  The 49ers were awful.

So that was my original inspiration, but I didn’t want to do the same thing and I had less time.  So I went a super distilled route of cutting down the idea behind the original article further.  Let’s just focus on seed rank of those in the championship.  To an extent I don’t really think there’s a dramatic story in the final 4 rankings – the “worst” seed that made it there was 11th.  We don’t even know if that team made it further.

In my world I’ve got championship winners vs. losers with position indicating their seed rank.  Color represents the result for the team for the year and for overall visual appeal I’ve made the color ramp.  To help orient the reader, I’ve added min/max ranks (I screwed this up and did pane for winner, should have been table like it is for loser, but it looks nice anyway).  I’ve also added on strategic years to help demonstrate that it’s a timeline.  If you were to interact, you’d see the name of the team and a few more specifics about what it is you’re looking at.

The reality of my takeaway here – a #1 seed usually wins.  Consistently wins, wins in streaks.  And there’s even a fair amount of #1 losers.  If I had to make a recommendation based on 32 years of championships: pick the #1 seeds and stick with them.  Using the original math from the article: 19 out of 32 winners were seed #1 (60%) and 11 out of 32 losers were seed #1 (34%).  Odds of a 1 being in the final 2 across all the years?  47% – And yes, that is said very tongue in cheek.

Makeover Monday Week 10 – Top 500 YouTube Game(r) Channels

We’re officially 10 weeks into Makeover Monday, which is a phenomenal achievement.  This means that I’ve actively participated in recreating 10 different visualizations with data varying from tourism, to Trump, to this week’s Youtube gamers.

First some commentary people may not like to read: the data set was not that great.  There’s one huge reason why it wasn’t great: one of the measures (plus a dimension) was a dependent variable on two independent variables.  And that dependent variable was processed via a pre-built algorithm.  So it would almost make sense to use the resultant dependent variable to enrich other data.

I’m being very abstract right now – here’s the structure of the data set:

Let’s walk through the fields:

  • Rank – this is a component based entirely on the sort chosen by the top (for this view it is by video views, not sure what those random 2 are, I just screencapped the site)
  • SB Score/Rank – this is some sort of ranking value applied to a user based on a propriety algorithm that takes a few variables into consideration
  • SB Score (as a letter grade) – the letter grade expression of the SB score
  • User – the name of the gamer channel
  • Subscribers – the # of channel subscribers
  • Video Views – the # of video views

As best as I can tell through reading the methodology – SB score/rank (the # and the alpha) are influenced in part from the subscribers and video views.  Which means putting these in the same view is really sort of silly.  You’re kind of at a disadvantage if you scatterplot subscribers vs. video views because the score is purportedly more accurate in terms of finding overall value/quality.

There’s also not enough information contained within the data set to amass any new insights on who is the best and why.  What you can do best with this data set is summarization, categorization, and displaying what I consider data set “vitals.”

So this is the approach that I took.  And more to that point, I wanted to make over a very specific chart style that I have seen Alberto Cairo employ a few times throughout my 6 week adventure in his MOOC.

That view: a bar chart sliced through with lines to help understand size of chunks a little bit better.  This guy:

So my energy was focused on that – which only happened after I did a few natural (in my mind) steps in summarizing the data, namely histograms:

Notice here that I’ve leveraged the axis values across all 3 charts (starting with SB grade and through to it’s sibling charts to minimize clutter).  I think this has decent effect, but I admit that the bars aren’t equal width across each bar chart.  That’s not pleasant.

My final two visualizations were to demonstrate magnitude and add more specifics in a visual manner to what was previously a giant text table.

The scatterplot helps to achieve this by displaying the 2 independent variables with the overall “SB grade” encoded on both color and size.  Note: for size I did powers of 2: 2^9, 2^8, 2^7…2^1.  This was a decent exponential effect to break up the sizing in a consistent manner.

The unit chart on the right is to help demonstrate not only the individual members, but display the elite A+ status and the terrible C+, D+, and D statuses.  The color palette used throughout is supposed to highlight these capstones – bright on the edges and random neutrals between.

This is aptly named an exploration because I firmly believe the resultant visualization was built to broadly pluck away at the different channels and get intrigued by the “details.”  In a more real world I would be out hunting for additional data to tag this back to – money, endorsements, average video length, number of videos uploaded, subject matter area, type of ads utilized by the user.  All of these appended to this basic metric aimed at measuring a user’s “influence” would lead down the path of a true analysis.

Makeover Monday Week 9 – Andy’s AMEX

So I started my dream job at the beginning of February.  This means I’ve been spending the month adjusting and tweaking my personal schedule and working on bringing back good habits.  In particular – I’ve missed out on doing daily workouts and consistently blogging about data viz.  Fortunately I’ve been keeping up with the practice component (Makeover Monday, Hackathon, Workout Wednesday), but I wholeheartedly believe in the holistic approach of sharing the thought process behind the viz.  (TL;DR – this was my paragraph of empty excuses)

Moving on then- the thought process behind the makeover.  And what’s even more interesting perhaps is that I can almost post-viz take some of the thoughts that Andy had regarding this week’s visualizations and provide my context.

Based on the original visualization I had an inkling that there wasn’t going to be a ton of data funneling in.  Being an individual who tracks all expenses and has seen them visually represented, I felt like food should represent a larger proportion of expenses.

Andy’s AMEX ’16

For reference, here’s a wonderful donut chart that Mint.com provided me on my top 3 most used credit cards.  I funnel everything I can through credit cards, and food in general takes up a huge portion of spend.

Ann’s ’16 Credit Card Spending

Both of these visualizations leave something to be desired.  I like Andy’s original AMEX one better than the donut I got, but they are both very distilled.  Andy spent a lot on transportation and travel, and apparently I spent a lot on shopping and education.

Getting REALLY specific about the data – there were 110 records (FYI my donut is 477, 209 is food/dining).  Plotting the data quickly over time, there were large gaps of time with no purchases.

Armed with this, I decided to take an approach of piggybacking off the predefined categories to see if throughout time Andy typically has one category that gets a lot of spend, or to see if the spend trends are lumped together.

More to that point, I wanted to show the way the data was dispersed in a daily fashion… so I went down this path.  The largest transaction for each day plotted (using the category on color, amount on size) across the 12 months.  I actually really like this view because I can clearly see the large vehicle purchase in December and you get a better feel for how spread out the card’s utilization was.  (I am guessing my lack of axis label on the day of the month is jarring.)

Also because I hate color legends, this meant I needed to introduce the idea of a color legend via data points elsewhere and led to the first view:

So… I kind of got really interested in utilization frequency and wanted to take it further.  So the next step was to make a barcode chart.  Very similar in concept to what the “top daily spend” is showing, but not limiting the data to only the top daily in this case.

Insights gained here – I get this feeling that Andy may only (or mostly) use his AMEX for meals where he’s out traveling.  Hovering over the points would add more insight to the transaction values.  More than that, we get a feel for what this card is generally swiped for: the 3 categories at the bottom (and FYI I ranked these by sum dollars spent).

Finally – bundling it together in a palatable format – what were the headline transactions for the year?  I wanted to do monthly and have categories, but there wasn’t enough data.  So I opted to go transaction level and keep it top 5 each quarter.  I think there’s novelty here in terms of presentation, but also value in quick rough comparisons of values over each quarter.

And this rounds out the end of the analysis.  Most of the transactions here are centered around travel.  My brain is not sanitized enough to say what you can infer – I have too much generalized knowledge of how Andy’s profession could explain these findings to present from a pure lack of knowledge standpoint.  (TL;DR – I know that Andy travels for his job, was surprised #data16 wasn’t an obvious point within the data set)

So the thought process in general behind the path I took this: I wanted to explore how often Andy spends money in certain categories.  I was intrigued by frequency of usage to see if it could eventually point back to provide the data creator (the guy who bought stuff) some additional aha! moments.

To be more honest – I actually think this is something that I would want for myself.  I in particular would love to plot my Amazon.com transactions and see how that changes throughout the year.  Both in barcode for frequency (imagining Black Friday is heavy) and then to see if I’m utilizing their services any differently.  (I have this feeling that grocery type purchases are on the climb).

Oh – and in terms of asking about colors and fonts: I did go Andy’s blog for inspiration.  I wanted to do a red/blue motif based on the blog, but needed more colors.  So I think I googled “blue color palette” and ended up with this cute starting palette that evolved into having pops of orange and yellow.  Font: I left this to something minimal that I thought Andy would be okay with (Arial Narrow) that would also bode well across all platforms.

Makeover Monday Week 8 – Potatoes in the EU

I’ll say this first – I don’t eat potatoes.  Although potatoes are super tasty, I refuse to have them as part of my diet.  So I was less than thrilled about approaching a week that was pure potato (especially coming off the joy of Valentine’s Day).  Nonetheless – it presented itself with a perfect opportunity for growth and skill testing.  Essentially, if I could make a viz I loved about a vegetable I hate – that would speak to my ability to interpret varying data sets and build out displays.

I’m very pleased with the end results.  I think it has a very Stephen Few-esque approach.  Several small multiples with high and low denoted, color playing throughout as a dual encoder.  And there’s even visual interest in how the data was sorted for data shape.

So how did I arrive there?  It started with the bar chart of annual yield.  I had an idea on color scheme and knew that I wanted to make it more than gray.

This gave perfect opportunity to highlight the minimum and maximum yields.  To see what years different countries production was affected by things like weather and climate.  It’s actually very interesting to see that not too many of the dark bars (max value) are in more recent times.  Seems like agricultural innovation is keeping pace with climate issues.

After that I was hooked on this idea of sets of 3.  So I knew I wanted to replicate a small multiple in a different way using the same sort order.  That’s where Total Yield came in.  I’ve been pondering this one in the shower on the legitimacy of adding up annual ratios for an overall yield.  My brain says it’s fine because the size of the country doesn’t change.  But my vulnerable brain part says that someone may take issue with it.  I’d love for a potato farming expert to come along and tell me if that’s a silly thing to add up.  I see the value in doing a straight total comparison of the years.  Because although there’s fluctuation in the yield annually, we have a normalized way to show how much each countries produces irrespective of total land size.

Next was the dot plot of the current year.  This actually started out its life as a KPI indicator of up or down from previous, but it was too much for the visual.  I felt the idea of the dot plot of current year would do more justice to “right now” understanding.  Especially because you can do some additional visual comparison to its flanks and see more insight.

And then rinse/repeat for the right side.  This is really where things get super interesting.  The amount of variability in pricing for each country, both by average and current year.  Also – 2013 was a great year for potatoes.

Book Binge – January Edition

It’s time for another edition of book binge – a random category of blog posts devised (and now only on its second iteration) where I walk through the different books I’ve read and purchased this month.

First – a personal breakthrough!  I have always been an avid reader, but admittedly become lazy in recent years.  Instead of reading at least one book a month, I was going on small reading sprees of 2 or 3 books every four or five months.  After the success of my December reads, I figured I would keep things going and try to substitute books as entertainment whenever possible.

Here are a few books I read in January:

The Functional Art by Alberto Cairo

I picked this one up because it is quintessential to the world of data journalism and data visualization.  I also thought it would be great to get into the head of one of the instructors of a MOOC I’m taking.  Plus who can resist the draw of the slope chart on the cover?

I loved this one.  I like Alberto’s writing style.  It is rooted in logic and his use of text spacing and bold as emphasis is heavy on impact.  I appreciate that he says data visualization has to first be functional, but reminds us that it has to be seen to matter.  It’s also interesting to read the interviews/profiles in the end of the book of journalists.  This is an excellent way for me to shift my perspective and paradigm.  I come from the analysis/mathematical side of things – these folks are there to communicate stories of data.  A great read that is broken up in such a way that it is easy to digest.

Next book was The Visual Display of Quantitative Information by Edward Tufte

Obviously a classic read for anyone in the data visualization world by the “father” of modern information graphics.  I must admit I picked up all 4 of Tufte’s books in December, and couldn’t get my brain wrapped around them.  I was flipping through the pages to get a sense for how the information was contained and felt a little intimidated.  That intimidation was all in my head.  Once I began reading – the flow of information made perfect sense.

I appreciate Tufte’s voice and axiom type approach to information graphics.  Yes – there are times when it is snarky and absurd, but it is also full of purpose.  He walks through information graphics history, spotlighting many of the greats and lamenting the lack of recent progression (or more of a recession) in the art.

I have two favorites in this one: how he communicates small multiples and sparklines.  The verbiage used to describe the impact (and amount of information) small multiples can convey is poetic (and I don’t really like poetry).  His work on developing and demonstrating sparklines is truly illuminating.  There were times where I had dreams of putting together some of the high “data-ink” low “chartjunk” visuals that he described.  And his epilogue makes me forgive all the snarkyness.  The first in a series that I am ecstatic to continue to read.

The last book I’ll highlight this month was a short read – a Christmas present from a friend.

Together is Better by Simon Sinek

I’m very familiar with Simon – mostly because of his famous TED talk on starting with why. I’ve read his book on the subject as well. So I was delighted to be handed this tiny gem.  Written in hybrid format of children’s book and inspirational quote book – this is a good one to flip through if you’re in need of a quiet moment.  Simon calls himself a self professed optimist at the end, and that’s definitely how I left the book feeling.

It aims at sparking the inner fire we all have – and the most powerful moment: Simon saying that you don’t have to invent a new idea and then follow it.  It is perfectly acceptable to commit to someone else’s vision and follow them.  It frees you completely from the world of “special,” new, and different that entrepreneurial and ambitious types (myself) get hung up on.  You don’t have to make up an original idea – just find something that resonates deeply with you and latch on.  That is just as powerful as being a visionary.

The other part of this book devotes a significant amount of snippet takes on leadership.  A friendly reminder of what leadership looks like.  Leadership is not management.

I’ve got more books on the way and will be back in a month with three new reads to share.

The Flow of Human Migration

Today I decided to take a bit of a detour while working on a potential project for #VizForSocialGood.  I was focused on a data set provided by UNICEF that showed the number of migrants from different areas/regions/countries to destination regions/countries.  I’m pretty sure it is the direct companion to a chord diagram that UNICEF published as part of their Uprooted report.

As I was working through the data, I wanted to take it and start at the same place.  Focus on migration globally and then narrow the focus in on children affected by migration.

Needless to say – I got side tracked.  I started by wanting to make paths on maps showing the movement of migrants.  I haven’t really done this very often, so I figured this would be a great data set to play with.  Once I set that up, it quickly divulged into something else.

I wasn’t satisfied with the density of the data.  The clarity of how it was displayed wasn’t there for me.  So I decided to take an abstract take on the same concept.  As if by fate I had received Chart Chooser cards in the mail earlier and Josh and I were reviewing them.  We were having a conversation about the various uses of each chart and brainstorming on how it could be incorporated into our next Tableau user group (I really do eat, drink, and breathe this stuff).

Anyway – one of the charts we were talking about was the sankey diagram.  So it was already on my mind and I’d seen it accomplished multiple times in Tableau.  It was time to dive in and see how this abstraction would apply to the geospatial.

I started with Chris Love’s basic tutorial of how to set up a sankey.  It’s a really straightforward read that explains all the concepts required to make this work.  Here’s the quick how-to in my paraphrased words.

  1. Duplicate your data via a Union, identify the original and the copy (Which is great because I had already done this for the pathing)  As I understand it from Chris’s write-up this let’s us ‘stretch out’ the data so to speak.
  2. Once the data is stretched out, it’s filled in by manipulating the binning feature in Tableau.  My interpretation would be that the bins ‘kind of’ act like dimensions (labeled out by individual integers).  This becomes useful in creating individual points that eventually turn into the line (curve).
  3. Next there are ranking functions made to determine the starting and end points of the curves.
  4. Finally the curve is built using a mathematical function called a sigmoid function.  This is basically an asymptotic function that goes from -1 to 1 and has a middle area with a slope of ~1.
  5. After the curve is developed, the points are plotted.  This is where the ranking is set up to determine the leftmost and rightmost points.  Chris’s original specifications had the ranking straightforward for each of the dimensions.  My final viz is a riff on this.
  6. The last steps are to switch the chart to a line chart and then build out the width (size) of the line based on the measure you used in the ranking (percent of total) calculation.

So I did all those steps and ended up with exactly what was described – a sankey diagram.  A brilliant one too, I could quickly switch the origin dimension to different levels (major area, region, country) and do similar work on the destination side.  This is what ultimately led me to the final viz I made.

So while adjusting the table calculations, I came to one view that I really enjoyed.  The ranking pretty much “broke” for the initial starting point (everything was at 1), but the destination was right.  What this did for the viz was take everything from a single point and then create roots outward.  Initial setup had this going from left to right – but it was quite obvious that it looked like tree roots.  So I flipped it all.

I’ll admit – this is mostly a fun data shaping/vizzing exercise.  You can definitely gain insights through the way it is deployed (take a look at Latin America & Caribbean).

After the creation of the curvy (onion shape), it was a “what to add next” free for all.  I had wrestled with the names of the destination countries to try and get something reasonable, but couldn’t figure out how to display them in proximity with the lines.  No matter – the idea of a word cloud seemed kind of interesting.  You’d get the same concept of the different chord sizes passed on again and see a ton of data on where people are migrating.  This also led to some natural interactivity of clicking on a country code to see its corresponding chords above.

Finally to add more visual context a simple breakdown of the major regions origin to destinations.  To tell the story a bit further.  The story points for me: most migrants move within their same region, except for Latin America/Caribbean.