And so it beings – Adventures in Python

Tableau 10.2 is on the horizon and with it comes several new features – one that is of particular interest to me is their new Python integration.  Here’s the Beta program beauty shot:

Essentially what this will mean is that more advanced programming languages aimed at doing more sophisticated analysis will become an easy to use extension of Tableau.  As you can see from the picture, it’ll work similar to how the R integration works with the end-user using the SCRIPT_STR() function to pass through the native Python code and allowing output.

I have to admit that I’m pretty excited by this.  For me I see this propelling some data science concepts more into the mainstream and making it much easier to communicate and understand the purpose behind them.

In preparation I wanted to spend some time setting up a Linux Virtual Machine to start getting a ‘feel’ for Python.

(Detour) My computer science backstory: my intro to programming was C++ and Java.  They both came easy to me.  I tried to take a mathematics class based in UNIX later on that was probably the precursor to some of the modern languages we’re seeing, but I couldn’t get on board with the “terminal” level entry.  Very off putting coming from a world where you have a better feedback loop in terms of what you’re coding.  Since that time (~9 years ago) I haven’t had the opportunity to encounter or use these types of languages.  In my professional world everything is built on SQL.

Anyway, back to the main heart – getting a box set up for Python.  I’m a very independent person and like to take the knowledge I’ve learned over time and troubleshoot my way to results.  The process of failing and learning on the spot with minimal guidance helps me solidify my knowledge.

Here’s the steps I went through – mind you I have a PC and I am intentionally running Windows 7.  (This is a big reason why I made a Linux VM)

  1. Download and install VirtualBox by Oracle
  2. Download x86 ISO of Ubuntu
  3. Build out Linux VM
  4. Install Ubuntu

These first four steps are pretty straightforward in my mind.  Typical Windows installer for VirtualBox.  Getting the image is very easy as is the build (just pick a few settings).

Next came the Python part.  I figured I’d have to install something on my Ubuntu machine, but I was pleasantly surprised to learn that Ubuntu already comes with Python 2.7 and 3.5.  A step I don’t have to do, yay!

Now came the part where I hit my first real challenge.  I had this idea of getting to a point where I could go through steps of doing sentiment analysis outlined by Brit Cava on the Tableau blog.  I’d reviewed the code and could follow the logic decently well.  And I think this is a very extensible starting point.

So based on the blog post I knew there would be some Python modules I’d be in need of.  Googling led me to believe that installing Anaconda would be the best path forward, it contains several of the most popular Python modules.  Thus installing it would eliminate the need to individually add in modules.

I downloaded the file just fine, but instructions on “installing” were less than stellar.  Here’s the instructions:

Directions on installing Anaconda on Linux

So as someone who takes instructions very literal (and again – doesn’t know UNIX very well) I was unfortunately greeted with a nasty error message lacking any help.  Feelings from years ago were creeping in quickly.  Alas, I Googled my way through this (and had a pretty good inkling that it just couldn’t ‘find’ the file).

What they said (also notice I already dropped the _64) since mine isn’t 64-bit.

 

Alas – all that was needed to get the file to install!

So installing Anaconda turned out to be pretty easy.  After getting the right code in the prompt.  Then came the fun part, trying to do sentiment analysis.  I knew enough based on reading that Anaconda came with the three modules mentioned: pandas, nltk, and time.  So I felt like this was going to be pretty easy to try and test out – coding directly from the terminal.

Well – I hit my second major challenge.  The lexicon required to do the sentiment analysis wasn’t included.  So, I had no way of actually doing the sentiment analysis and was left to figure it out on my own.  This part was actually not that bad, Python did give me a good prompt to fix – essentially to call the nltk downloader and get the lexicon.  And the nltk downloader has a cute little GUI to find the right lexicon (vader).  I got this installed pretty quickly.

Finally – I was confident that I could input the code and come up with some results.  And this is where I hit my last obstacle and probably the most frustrating of the night.  When pasting in the code (raw form from blog post) I kept running into errors.  The message wasn’t very helpful and I started cutting out lines of code that I didn’t need.

What’s the deal with line 5?

Eventually I figured out the problem – there were weird spaces in the raw code snippet.  To which after some additional googling (this time from my husband) he kindly said “apparently spaces matter according to this forum.”  No big deal – lesson learned!

Yes! Success!

So what did I get at the end of the day?  A wonderful CSV output of sentiment scores for all the words in the original data set.

Looking good, there’s words and scores!
Back to my comfort zone – a CSV

Now for the final step – validate that my results aligned with expectations.  And it did – yay!

0.3182 = 0.3182

Next steps: viz the data (obviously).  And I’m hoping to extend this to an additional sentiment analysis, maybe even something from Twitter.  Oh and I also ended up running (you guessed it, already installed) a Jupyter notebook to get over the pain of typing directly in the Terminal.

Synergy through Action

This has been an amazing week for me.  On the personal side of things my ship is sailing in the right direction.  It’s amazing what the new year can do to clarify values and vision.

Getting to the specifics of why I’m calling this post “Synergy through Action.”  That’s the best way for me to describe how my participation in this week’s Tableau and data visualization community offerings have influenced me.

It all actually started on Saturday.  I woke up and spent the morning working on a VizforSocialGood project, specifically a map to represent the multiple locations connected to the February 2017 Women in Data Science conference.  I’d been called out on Twitter (thanks Chloe) and felt compelled to participate.  The kick of passion I received after submitting my viz propelled me into the right mind space to tackle 2 papers toward my MBA.

Things continued to hold steady on Sunday where I took on the #MakeoverMonday task of Donald Trump’s tweets.  I have to imagine that the joy from accomplishment was the huge motivator here.  Otherwise I can easily imagine myself hitting a wall.  Or perhaps it gets easier as time goes on?  Who knows, but I finished that viz feeling really great about where the week was headed.

Monday – Alberto Cairo and Heather Krause’s MOOC was finally open!  Thankfully I had the day off to soak it all in.  This kept my brain churning.  And by Wednesday I was ready for a workout!

So now that I’ve described my week – what’s the synergy in action part?  Well I took all the thoughts from the social good project, workout Wednesday, and the sage wisdom from the MOOC this week to hit on something much closer to home.

I wound up creating a visualization (in the vein of) the #WorkoutWednesday redo offered up.  What’s it of?  Graduation rates of specific demographics for every county in Arizona for the past 10ish years.  Stylized into small multiples using at smattering of slick tricks I was required to use to complete the workout.

Here’s the viz – although admittedly it is designed more as a static view (not quite an infographic).

 

And to sum it all up: this could be the start of yet another spectacular thing.  Bringing my passion to the local community that I live in – but more on a widespread level (in the words of Dan Murray, user groups are for “Tableau zealots”).

Makeover Monday 2017 – Week 3 Trump Tweets

**Update (1/20/17) : The original data set had a date formatting snafu resulting in 1307 tweets at the 12:00-12:59 PM (UTC time) hour to be displayed as 00:00-00:59 (aka 12 AM hour).  This affected 4.3% of the original data set visualization and has been corrected.  I have also added a footnote denoting the visualization is in EST.  This affects the shape of the data in both the 4 AM – 8 AM and 4 PM – 8 PM sections.

Rolling right along into week 3’s Makeover Monday.  The data set this week: Donald Trump’s tweets.  The original Buzzfeed viz and article accompanying this analyzed Trump’s retweet activity since his announcement of running for president.  The final viz ended up being what I would best describe as bubble charts of the top users he retweeted during this time:

What’s interesting is that the actual article goes into significant depth on how their team systematically reviewed the tweets.  It’a a bummer that the additional analysis done couldn’t be synthesized into visual form.

My take on the makeover this week was driven completely by the underlying data available.  The TDE provided had the following fields:

Two things stuck out to me with the data.  First: the username being retweeted wasn’t included; second: the entire tweet text was included.  Having full text available just screams for some sort of text analysis.  I got committed at that point to doing something with the text.

My initial idea was to do some sort of sentiment analysis.  Recently I had installed both R-Studio and Python on my PC to try integration with Tableau.  I’d had success with R-Studio (mind you after watching a brief YouTube video), but I hadn’t gotten Python to cooperate (my effort in assisting in this cooperation = 2 out of 10).  I figured since I had both available maybe I should make an attempt.  After marinating on the concept I didn’t feel comfortable adding more sentiment analysis to the fire of American politics.  (On a personal note: I have been politically checked out since the early primaries.)

So instead of doing sentiment analysis, I decided to turn the data more into text mining for mentions and hashtags.  I had done some fiddling with the time component and was digging how the cycle plot/horizon chart were playing out visually.  So it seemed natural to continue on a progression of getting more details out of the bars and times of day.

Note on the time: time is graciously parsed into correct format with the data.  In looking at the original time, I am under the impression it was represented in GMT (+0000).  To adjust for this, I added -5 hours to all of the parsed dates to put it in EST aka Trump time.

So back to text mining.  Post #data16 conference, a colleague of mine was recounting how to use regex to scrub through text.  I walked away from his talk thinking I need to use that next time I have the opportunity.  And what I love about it: NATIVE FUNCTION TO TABLEAU!!  So this was making me sing.  Now I don’t know a ton about regex (lots of notation I have yet to memorize), so I decided to quickly google my way to getting the user handles and hashtags.  These handy results really made this analysis zip along: regexr & regex+twitter.

Everything else came to life pretty quickly.  I knew I wanted to include at least one or two tweets to read through, but I wanted to keep it curated.  I think this was accomplished well and I spent a good deal of time trying out different time combinations just to see what would bubble to the surface.

A final note on aesthetics this week: I’m reading Alberto Cairo’s The Functional Art, and as I mentioned in an earlier post, I’m also participating in his MOOC that starts tomorrow.  I am only 4 chapters in, but Alberto has me taking a few things to heart.  I don’t think it is by coincidence that I decided to push the beauty side of things.  I always strive for elegance, but I strive for it through white space and keeping that “data ink ratio” at a certain point.  But I’m not blind to the different visualizations out there that attract people.  So for once I used a non-white background (yay!).  And I also went for a font that’s well outside of the look of my usual vizzing font.

More than focusing on aesthetics, is of course the function of the viz.  I tried to spend more time thinking about the audience and what they were going to “get” out of it.  I hope that the final product is less of a “visual aid” to my analysis and more of an interactive tool to explore the tweets of the soon to be President.

Full viz available on my Tableau public page.

#DataResolutions – More than a hashtag

This gem of a blog post appeared on Tableau Public and within my twitter feed earlier this week asking what my #DataResolutions are.  Here was my lofty response:

 


Sound like a ton of goals and setting myself up for failure?  Think again.  At the heart of most of my work with data visualization are 2 concepts: growth and community.  I’ve had the amazing opportunity to co-lead and grow the Phoenix Tableau user group over the past 5+ months.  And one thing I’ve learned along the way: to be a good leader you have to show up.  Regardless of skill level, technical background, formal education, we’re all bound together by our passion for data visualization and data analytics.

To ensure that I communicate my passion, I feel that it’s critical to demonstrate it.  It grows me as a person and stretches me outside of my comfort zone to an extreme.  And it opens up opportunities and doors for me to grow in ways I didn’t know existed.  A great example of this is enrolling in Alberto Cairo and Healther Krause’s MOOC Data Exploration and Storytelling: Finding Stories in Data with Exploratory Analysis and Visualization.  I see drama and story telling as a development area for me personally.  Quite often I think I get very wrapped up in the development of data stories that the final product is a single component being used as my own visual aid.  I’d like the learn how to communicate the entire process within a visualization and guide a reader through.  I also want to be surrounded by 4k peers who have their own passion and opinions.

Moving on to collaborations.  There are 2 collaborations I mentioned above, one surrounding data+women and the other is data mashup.  My intention behind developing out these is to once again grow out of my comfort zone.  Data Mashup is also a great way for me to enforce accountability to Makeover Monday and to develop out my visualization interpretation skills.  The data+women project is still in an incubation phase, but my goal there is to spread some social good.  In our very cerebral world, sometimes it takes a jolt from someone new to be used as fuel for validation and action.  I’m hoping to create some of this magic and get some of the goodness of it from others.

More to come, but one thing is for sure: I can’t fail if I don’t write down what I want to achieve.  The same is true for achievement, unless it’s written down, how can I measure?

Makeover Monday 2017 – Week 2

It’s time for Makeover Monday – Week 2.  This week’s data set was the quarterly sales (by units) of Apple iPhones for the past 10ish years.  The original article accompanying the data indicated that the golden years of Apple may be over.

So let me start by saying – I broke the rules (or rather, the guidelines).  Makeover Monday guidelines indicate that the goal is to improve upon the original visualization and stick to the original data fields.  I may have overlooked that guideline this week in favor of adding a little more context.

When I first approached the data set and dropped it into Tableau, the first thing I immediately noticed was that Q4 always has a dip compared to the other quarters of the year.

This view contradicted all of my existing knowledge of how iPhone releases work.  Typically every year Apple holds a conference around the middle/end of September announcing the “new” iPhone.  That can either be the gap increase (off year, aka the S) or the new generation.  It lines up such that pre-sales and sales come in the weeks shortly following.  And in addition to that I would suspect that sales would stay heightened throughout the holiday season.

This is where I immediately went back to the data to challenge it and I noticed that Apple defines its fiscal year differently.  Specifically October to December (of the previous year) counts as Q1 of the current year.  Essentially Q1 of 2017 is actually 10/1/16 to 12/31/16.  Meaning that in the normalized world thinking about quarters, everything should be adjusted.

Now I was starting to feel much better about how things were looking.  It aligned with my real world expectations.

I still couldn’t help but feel that a significant portion of the story was missing.  In my mind it wasn’t fair to only look at iPhone sales over time without understanding more data points of the smartphone market.  I narrowed it down to overall sales of smartphones and number of smartphone users.  The idea I had was this: have we reached a point where the number of smartphone users is now a majority?  Essentially the Adoption Curve came to my mind – maybe we’ve hit that sweet spot where the Late Majority is now getting in on smartphones.

To validate the theory and keep things simple, I did quick searches for data sets I could bring into the view.  As if through serendipity, the two additional sources I stumbled upon came from the same as the original (statistica.com).  I went ahead and added them into my data set and got to work.

My initial idea was this: line plot of iPhone sales vs. overall smartphone sales.  See if directionality was the same.  Place a smaller graph of smartphone users to the side (mainly because it was US only, couldn’t find a free global data set).  And the last viz was going to be a combination of the 3 showing basic “growth” change.  That in my mind would in a very basic way display an answer to my questioning.

I went through a couple of iterations and finally landed on the view below as my final.

I think it sums up the thought process and answers the question I originally asked myself when I approached the data set.  And hopefully I can be pardoned (if even necessary) since the accompanying data added in merely enhanced information at hand and kept with the simplicity of data points available (units and time).

Makeover Monday 2017 – Week 1

It’s officially 2017 – the start of a new year.  As such, this is a great time for anyone in the Tableau universe to make a fresh commitment to participate in the community challenge known as Makeover Monday.

As I jump into this challenge, I’ve made the conscious decision to start with the things I already like doing and to add on each time.  This to me is the way that I’ll be able to stay actively involved and enthusiastic.  Essentially: keep it simple.

For this week’s data set it was obvious that something of a comparative nature needed to be applied.  I started off with a basic dot plot and went from there.

What I ended up with: a slope chart with the slope representing the delta in rank of income by gender, the size of the line representing the annual monetary difference in income, and 3 colors representing categorized multipliers on the wage gap.

I wanted this to be for a phone, so I held to the idea of a single viz.  Interactivity is really limited to tooltips, most other nuance comes from the presentation of the visualization itself.

And I pushed myself to add a little journalistic flare this week.  Not really my style, but I figured I would see where it took me.

The Float Plot

One of the more interesting aspects of data visualization is how new visualization methods are created.  There are several substantial charts, graphs, and plots out there that visualization artists typically rely on.

As I’ve spent time reading more about data visualization, I started thinking about potential visualizations out there that could be added into the toolkit.  Here’s the first one that I’ve come up with: The Float Plot.

The idea behind the float plot is simple.  Plot one value that has some sort of range of good/acceptable/bad values and use color banding to display where it falls.  It works well with percentage values.

I’ve also made a version that incorporates peers.  Peers could be previous time period values or they could be less important categories.  The version with peers reminds me somewhat of a dot plot, but I particularly appreciate the difference in size to distinguish the important data point.

What’s also great about the Float Plot is that it doesn’t have to take up much space.  It looks great scaled short vertically or narrow horizontally.

Enjoy the visualization on my Tableau public profile here.

Statistical Process Control Charts

I’ve had this idea for a while now – create a blog post and video tutorial discussing what Statistical Process Control is and how to use different Control Chart “tests” in Tableau.

I’ve spent a significant portion of my professional career in business process improvement and always like it when I can integrate techniques learned from a discipline derived from industrial engineering and apply it in a broader sense.

It also gives me a great chance to brush up on my knowledge and learn how to order my thoughts for presenting to a wide audience.  And let’s not forget: an opportunity to showcase data visualization and Tableau as the delivery mechanism of these insights to my end users.

So why Statistical Process Control?  Well it’s a great way to use the data you have and apply different tests to start early detection.  Several of the rules out there are aimed at finding “out-of-control,” non-normal, or repetitive parts within a stream of data.  Different rules have been developed based on how we might be able to detect them.

The video tutorial above goes through the first 3 Western Electric rules.  Full details on Western Electric via Wikipedia: here.

Rule 1: Very basic, uses the principle of a bell curve to put a spotlight on points that are above or below the Upper Control Limit (UCL) or Lower Control Limit (LCL) also known as +/- 3 standard deviations from the mean.  These are essentially outlier data points that don’t fall within our typical span of 99.7%.

Rule 2: Takes into consideration surrounding observations.  Looking at 3 consecutive observations are 2 out of 3 above or below the 2 SD mark from the average.  In this rule the observations must be on the same side of the average line when beyond 2 SD.  Since we’re at 95% at 2 SD, having 2 out of 3 in a set in that range could signal an issue.

Rule 3: Starts to consider even more data points within a collection of observations.  In this scenario we’re now looking for 4 out of 5 observations +/- 1 SD from the average.  Again, we’re retaining the positioning above/below the average line throughout the 5 points.  This one really shows the emergence of a trend.

I applied the first 3 rules to my own calorie data to see detect any potential issues.  It’s very interesting to see the results.  For my own particular data set, Rule 3 was of significant value.  Having it in line as the new daily data funnels in could prevent me from going on a “streak” of either over or under consuming.

 

Interact with the full version on my Tableau Public profile here.

#MakeoverMonday 11/22/16 – Advanced Logging Edition

And it’s time – my first ever Makeover Monday.  I’ll admit, I’ve attempted to catch up in the past, but always lost steam.  I think the first data set might be related to sports and I struggle to focus on making something interesting.

Despite my follies, I’m proud to say that I’ve participated in this week’s Makeover Monday in honor of the special advanced logging that is taking place.  Along with submitting work with the hashtag on twitter, Tableau has asked for us to upload a copy of our log files and workbook.  Contained within the advanced log files are .PNGs that show analysis iterations.

I went into this Monday with the idea of doing a basic “best practices” version.  One that would mimic something I might create for ultimate exploration and zero data journalism.  I tried to stick with one element that I thought worked well and create the dashboard around it.

Looking at the other participants, I’m already thinking that my time heatmap could be improved.  My mind was stuck on the day numbers and quarters.  I should have switched to days of the week!  Irrespective – here it is:


And the GIF:

makeover-monday-112116

#data16 Data Dump

Last night was our monthly Phoenix Tableau User Group (PHXTUG) meeting and as part of the post-excitement of Tableau’s 2016 conference we took some time to go through their strategy and some upcoming features.

Full video is available here:

Interested in reusing the slides? Find the deck here: