Tableau 10.2 is on the horizon and with it comes several new features – one that is of particular interest to me is their new Python integration. Here’s the Beta program beauty shot:
Essentially what this will mean is that more advanced programming languages aimed at doing more sophisticated analysis will become an easy to use extension of Tableau. As you can see from the picture, it’ll work similar to how the R integration works with the end-user using the SCRIPT_STR() function to pass through the native Python code and allowing output.
I have to admit that I’m pretty excited by this. For me I see this propelling some data science concepts more into the mainstream and making it much easier to communicate and understand the purpose behind them.
In preparation I wanted to spend some time setting up a Linux Virtual Machine to start getting a ‘feel’ for Python.
(Detour) My computer science backstory: my intro to programming was C++ and Java. They both came easy to me. I tried to take a mathematics class based in UNIX later on that was probably the precursor to some of the modern languages we’re seeing, but I couldn’t get on board with the “terminal” level entry. Very off putting coming from a world where you have a better feedback loop in terms of what you’re coding. Since that time (~9 years ago) I haven’t had the opportunity to encounter or use these types of languages. In my professional world everything is built on SQL.
Anyway, back to the main heart – getting a box set up for Python. I’m a very independent person and like to take the knowledge I’ve learned over time and troubleshoot my way to results. The process of failing and learning on the spot with minimal guidance helps me solidify my knowledge.
Here’s the steps I went through – mind you I have a PC and I am intentionally running Windows 7. (This is a big reason why I made a Linux VM)
- Download and install VirtualBox by Oracle
- Download x86 ISO of Ubuntu
- Build out Linux VM
- Install Ubuntu
These first four steps are pretty straightforward in my mind. Typical Windows installer for VirtualBox. Getting the image is very easy as is the build (just pick a few settings).
Next came the Python part. I figured I’d have to install something on my Ubuntu machine, but I was pleasantly surprised to learn that Ubuntu already comes with Python 2.7 and 3.5. A step I don’t have to do, yay!
Now came the part where I hit my first real challenge. I had this idea of getting to a point where I could go through steps of doing sentiment analysis outlined by Brit Cava on the Tableau blog. I’d reviewed the code and could follow the logic decently well. And I think this is a very extensible starting point.
So based on the blog post I knew there would be some Python modules I’d be in need of. Googling led me to believe that installing Anaconda would be the best path forward, it contains several of the most popular Python modules. Thus installing it would eliminate the need to individually add in modules.
I downloaded the file just fine, but instructions on “installing” were less than stellar. Here’s the instructions:
So as someone who takes instructions very literal (and again – doesn’t know UNIX very well) I was unfortunately greeted with a nasty error message lacking any help. Feelings from years ago were creeping in quickly. Alas, I Googled my way through this (and had a pretty good inkling that it just couldn’t ‘find’ the file).
So installing Anaconda turned out to be pretty easy. After getting the right code in the prompt. Then came the fun part, trying to do sentiment analysis. I knew enough based on reading that Anaconda came with the three modules mentioned: pandas, nltk, and time. So I felt like this was going to be pretty easy to try and test out – coding directly from the terminal.
Well – I hit my second major challenge. The lexicon required to do the sentiment analysis wasn’t included. So, I had no way of actually doing the sentiment analysis and was left to figure it out on my own. This part was actually not that bad, Python did give me a good prompt to fix – essentially to call the nltk downloader and get the lexicon. And the nltk downloader has a cute little GUI to find the right lexicon (vader). I got this installed pretty quickly.
Finally – I was confident that I could input the code and come up with some results. And this is where I hit my last obstacle and probably the most frustrating of the night. When pasting in the code (raw form from blog post) I kept running into errors. The message wasn’t very helpful and I started cutting out lines of code that I didn’t need.
Eventually I figured out the problem – there were weird spaces in the raw code snippet. To which after some additional googling (this time from my husband) he kindly said “apparently spaces matter according to this forum.” No big deal – lesson learned!
So what did I get at the end of the day? A wonderful CSV output of sentiment scores for all the words in the original data set.
Now for the final step – validate that my results aligned with expectations. And it did – yay!
Next steps: viz the data (obviously). And I’m hoping to extend this to an additional sentiment analysis, maybe even something from Twitter. Oh and I also ended up running (you guessed it, already installed) a Jupyter notebook to get over the pain of typing directly in the Terminal.