February Book Binge

Another month has passed, so it’s time to recount what I’ve been reading.

Admittedly it was kind of a busy month for me, so I decided to mix up some of my book habits with podcasts.  To reflect that – I’ve decided to share a mixture of both.

 

First up is Rhinoceros Success by Scott Alexander

This is a short read designed to ignite fire and passion into whoever reads it. It walks through how a big burly rhino would approach every day life, and how you as a rhino should follow suit.

I read this one while I was transitioning between jobs and found it to be a great source of humor during the process. It helps to articulate out ‘why’ you may be doing certain things and puts it in the context of what a rhino would do. This got me through some rough patches of uncertainty.

The next book was Made to Stick by the Heath brothers

This was another recommendation and one that I enjoyed. I will caveat and say that this book is really long. I struggled to try and get through a chapter at a time (~300 pages and only 7 chapters). It is chocked full of stories to help the reader understand the required model to make ideas stick.

I read this one because often times a big part of my job is communicating out a yet to be seen vision. And it is also to try and get people to buy-in to a new type of thinking. These aren’t easy and can be met with resistance. The tools that the Heath brothers offer are simple and straightforward. I think they even extend further to writing or public speaking. How do you communicate a compelling idea that will resonate with your audience?

I’ve got their 2 other books and will be reading one of them in March.

Lastly – I wanted to spend a little bit of time sharing a podcast that I’ve come to enjoy. It is Design Matters with Debbie Millman.

This was shared with me by someone on Twitter. I found myself commuting much more than average this much (as part of the job change) and I was looking for media to consume during the variable length (30 to 60 minute) commute. This podcast fits that time slot so richly. What’s awesome is the first podcast I listed to had Seth Godin on it (reading one of his books now) – so it was a great dual purpose item. I could hear Seth and preview if I should read one of his many books and also get a dose of Debbie.

The beauty of this podcast for me is that Debbie spends a lot of time exploring the personality and history of modern artists/designers. She does this by amassing research on each individual and then having a very long sit-down to discuss findings. Often times this involves analyzing individual perspectives and recounting significant past events. I always find it illuminating how these people view the world and how they’ve “arrived” at their current place in life.

That wraps up my content diet for the month – and I’m off to listen to Seth.

Book Binge – January Edition

It’s time for another edition of book binge – a random category of blog posts devised (and now only on its second iteration) where I walk through the different books I’ve read and purchased this month.

First – a personal breakthrough!  I have always been an avid reader, but admittedly become lazy in recent years.  Instead of reading at least one book a month, I was going on small reading sprees of 2 or 3 books every four or five months.  After the success of my December reads, I figured I would keep things going and try to substitute books as entertainment whenever possible.

Here are a few books I read in January:

The Functional Art by Alberto Cairo

I picked this one up because it is quintessential to the world of data journalism and data visualization.  I also thought it would be great to get into the head of one of the instructors of a MOOC I’m taking.  Plus who can resist the draw of the slope chart on the cover?

I loved this one.  I like Alberto’s writing style.  It is rooted in logic and his use of text spacing and bold as emphasis is heavy on impact.  I appreciate that he says data visualization has to first be functional, but reminds us that it has to be seen to matter.  It’s also interesting to read the interviews/profiles in the end of the book of journalists.  This is an excellent way for me to shift my perspective and paradigm.  I come from the analysis/mathematical side of things – these folks are there to communicate stories of data.  A great read that is broken up in such a way that it is easy to digest.

Next book was The Visual Display of Quantitative Information by Edward Tufte

Obviously a classic read for anyone in the data visualization world by the “father” of modern information graphics.  I must admit I picked up all 4 of Tufte’s books in December, and couldn’t get my brain wrapped around them.  I was flipping through the pages to get a sense for how the information was contained and felt a little intimidated.  That intimidation was all in my head.  Once I began reading – the flow of information made perfect sense.

I appreciate Tufte’s voice and axiom type approach to information graphics.  Yes – there are times when it is snarky and absurd, but it is also full of purpose.  He walks through information graphics history, spotlighting many of the greats and lamenting the lack of recent progression (or more of a recession) in the art.

I have two favorites in this one: how he communicates small multiples and sparklines.  The verbiage used to describe the impact (and amount of information) small multiples can convey is poetic (and I don’t really like poetry).  His work on developing and demonstrating sparklines is truly illuminating.  There were times where I had dreams of putting together some of the high “data-ink” low “chartjunk” visuals that he described.  And his epilogue makes me forgive all the snarkyness.  The first in a series that I am ecstatic to continue to read.

The last book I’ll highlight this month was a short read – a Christmas present from a friend.

Together is Better by Simon Sinek

I’m very familiar with Simon – mostly because of his famous TED talk on starting with why. I’ve read his book on the subject as well. So I was delighted to be handed this tiny gem.  Written in hybrid format of children’s book and inspirational quote book – this is a good one to flip through if you’re in need of a quiet moment.  Simon calls himself a self professed optimist at the end, and that’s definitely how I left the book feeling.

It aims at sparking the inner fire we all have – and the most powerful moment: Simon saying that you don’t have to invent a new idea and then follow it.  It is perfectly acceptable to commit to someone else’s vision and follow them.  It frees you completely from the world of “special,” new, and different that entrepreneurial and ambitious types (myself) get hung up on.  You don’t have to make up an original idea – just find something that resonates deeply with you and latch on.  That is just as powerful as being a visionary.

The other part of this book devotes a significant amount of snippet takes on leadership.  A friendly reminder of what leadership looks like.  Leadership is not management.

I’ve got more books on the way and will be back in a month with three new reads to share.

The Flow of Human Migration

Today I decided to take a bit of a detour while working on a potential project for #VizForSocialGood.  I was focused on a data set provided by UNICEF that showed the number of migrants from different areas/regions/countries to destination regions/countries.  I’m pretty sure it is the direct companion to a chord diagram that UNICEF published as part of their Uprooted report.

As I was working through the data, I wanted to take it and start at the same place.  Focus on migration globally and then narrow the focus in on children affected by migration.

Needless to say – I got side tracked.  I started by wanting to make paths on maps showing the movement of migrants.  I haven’t really done this very often, so I figured this would be a great data set to play with.  Once I set that up, it quickly divulged into something else.

I wasn’t satisfied with the density of the data.  The clarity of how it was displayed wasn’t there for me.  So I decided to take an abstract take on the same concept.  As if by fate I had received Chart Chooser cards in the mail earlier and Josh and I were reviewing them.  We were having a conversation about the various uses of each chart and brainstorming on how it could be incorporated into our next Tableau user group (I really do eat, drink, and breathe this stuff).

Anyway – one of the charts we were talking about was the sankey diagram.  So it was already on my mind and I’d seen it accomplished multiple times in Tableau.  It was time to dive in and see how this abstraction would apply to the geospatial.

I started with Chris Love’s basic tutorial of how to set up a sankey.  It’s a really straightforward read that explains all the concepts required to make this work.  Here’s the quick how-to in my paraphrased words.

  1. Duplicate your data via a Union, identify the original and the copy (Which is great because I had already done this for the pathing)  As I understand it from Chris’s write-up this let’s us ‘stretch out’ the data so to speak.
  2. Once the data is stretched out, it’s filled in by manipulating the binning feature in Tableau.  My interpretation would be that the bins ‘kind of’ act like dimensions (labeled out by individual integers).  This becomes useful in creating individual points that eventually turn into the line (curve).
  3. Next there are ranking functions made to determine the starting and end points of the curves.
  4. Finally the curve is built using a mathematical function called a sigmoid function.  This is basically an asymptotic function that goes from -1 to 1 and has a middle area with a slope of ~1.
  5. After the curve is developed, the points are plotted.  This is where the ranking is set up to determine the leftmost and rightmost points.  Chris’s original specifications had the ranking straightforward for each of the dimensions.  My final viz is a riff on this.
  6. The last steps are to switch the chart to a line chart and then build out the width (size) of the line based on the measure you used in the ranking (percent of total) calculation.

So I did all those steps and ended up with exactly what was described – a sankey diagram.  A brilliant one too, I could quickly switch the origin dimension to different levels (major area, region, country) and do similar work on the destination side.  This is what ultimately led me to the final viz I made.

So while adjusting the table calculations, I came to one view that I really enjoyed.  The ranking pretty much “broke” for the initial starting point (everything was at 1), but the destination was right.  What this did for the viz was take everything from a single point and then create roots outward.  Initial setup had this going from left to right – but it was quite obvious that it looked like tree roots.  So I flipped it all.

I’ll admit – this is mostly a fun data shaping/vizzing exercise.  You can definitely gain insights through the way it is deployed (take a look at Latin America & Caribbean).

After the creation of the curvy (onion shape), it was a “what to add next” free for all.  I had wrestled with the names of the destination countries to try and get something reasonable, but couldn’t figure out how to display them in proximity with the lines.  No matter – the idea of a word cloud seemed kind of interesting.  You’d get the same concept of the different chord sizes passed on again and see a ton of data on where people are migrating.  This also led to some natural interactivity of clicking on a country code to see its corresponding chords above.

Finally to add more visual context a simple breakdown of the major regions origin to destinations.  To tell the story a bit further.  The story points for me: most migrants move within their same region, except for Latin America/Caribbean.

And so it beings – Adventures in Python

Tableau 10.2 is on the horizon and with it comes several new features – one that is of particular interest to me is their new Python integration.  Here’s the Beta program beauty shot:

Essentially what this will mean is that more advanced programming languages aimed at doing more sophisticated analysis will become an easy to use extension of Tableau.  As you can see from the picture, it’ll work similar to how the R integration works with the end-user using the SCRIPT_STR() function to pass through the native Python code and allowing output.

I have to admit that I’m pretty excited by this.  For me I see this propelling some data science concepts more into the mainstream and making it much easier to communicate and understand the purpose behind them.

In preparation I wanted to spend some time setting up a Linux Virtual Machine to start getting a ‘feel’ for Python.

(Detour) My computer science backstory: my intro to programming was C++ and Java.  They both came easy to me.  I tried to take a mathematics class based in UNIX later on that was probably the precursor to some of the modern languages we’re seeing, but I couldn’t get on board with the “terminal” level entry.  Very off putting coming from a world where you have a better feedback loop in terms of what you’re coding.  Since that time (~9 years ago) I haven’t had the opportunity to encounter or use these types of languages.  In my professional world everything is built on SQL.

Anyway, back to the main heart – getting a box set up for Python.  I’m a very independent person and like to take the knowledge I’ve learned over time and troubleshoot my way to results.  The process of failing and learning on the spot with minimal guidance helps me solidify my knowledge.

Here’s the steps I went through – mind you I have a PC and I am intentionally running Windows 7.  (This is a big reason why I made a Linux VM)

  1. Download and install VirtualBox by Oracle
  2. Download x86 ISO of Ubuntu
  3. Build out Linux VM
  4. Install Ubuntu

These first four steps are pretty straightforward in my mind.  Typical Windows installer for VirtualBox.  Getting the image is very easy as is the build (just pick a few settings).

Next came the Python part.  I figured I’d have to install something on my Ubuntu machine, but I was pleasantly surprised to learn that Ubuntu already comes with Python 2.7 and 3.5.  A step I don’t have to do, yay!

Now came the part where I hit my first real challenge.  I had this idea of getting to a point where I could go through steps of doing sentiment analysis outlined by Brit Cava on the Tableau blog.  I’d reviewed the code and could follow the logic decently well.  And I think this is a very extensible starting point.

So based on the blog post I knew there would be some Python modules I’d be in need of.  Googling led me to believe that installing Anaconda would be the best path forward, it contains several of the most popular Python modules.  Thus installing it would eliminate the need to individually add in modules.

I downloaded the file just fine, but instructions on “installing” were less than stellar.  Here’s the instructions:

Directions on installing Anaconda on Linux

So as someone who takes instructions very literal (and again – doesn’t know UNIX very well) I was unfortunately greeted with a nasty error message lacking any help.  Feelings from years ago were creeping in quickly.  Alas, I Googled my way through this (and had a pretty good inkling that it just couldn’t ‘find’ the file).

What they said (also notice I already dropped the _64) since mine isn’t 64-bit.

 

Alas – all that was needed to get the file to install!

So installing Anaconda turned out to be pretty easy.  After getting the right code in the prompt.  Then came the fun part, trying to do sentiment analysis.  I knew enough based on reading that Anaconda came with the three modules mentioned: pandas, nltk, and time.  So I felt like this was going to be pretty easy to try and test out – coding directly from the terminal.

Well – I hit my second major challenge.  The lexicon required to do the sentiment analysis wasn’t included.  So, I had no way of actually doing the sentiment analysis and was left to figure it out on my own.  This part was actually not that bad, Python did give me a good prompt to fix – essentially to call the nltk downloader and get the lexicon.  And the nltk downloader has a cute little GUI to find the right lexicon (vader).  I got this installed pretty quickly.

Finally – I was confident that I could input the code and come up with some results.  And this is where I hit my last obstacle and probably the most frustrating of the night.  When pasting in the code (raw form from blog post) I kept running into errors.  The message wasn’t very helpful and I started cutting out lines of code that I didn’t need.

What’s the deal with line 5?

Eventually I figured out the problem – there were weird spaces in the raw code snippet.  To which after some additional googling (this time from my husband) he kindly said “apparently spaces matter according to this forum.”  No big deal – lesson learned!

Yes! Success!

So what did I get at the end of the day?  A wonderful CSV output of sentiment scores for all the words in the original data set.

Looking good, there’s words and scores!
Back to my comfort zone – a CSV

Now for the final step – validate that my results aligned with expectations.  And it did – yay!

0.3182 = 0.3182

Next steps: viz the data (obviously).  And I’m hoping to extend this to an additional sentiment analysis, maybe even something from Twitter.  Oh and I also ended up running (you guessed it, already installed) a Jupyter notebook to get over the pain of typing directly in the Terminal.