Tag: sentiment analysis

  • The Shape of Shakespeare’s Sonnets | #IronViz Books & Literature

    The Shape of Shakespeare’s Sonnets | #IronViz Books & Literature

    Jump directly to the viz

    If it’s springtime that can only mean that it’s time to begin the feeder rounds for Tableau’s Iron Viz contest.  The kick-off global theme for the first feeder is books & literature, a massive topic with lots of room for interpretation.  So without further delay, I’m excited to share my submission: The Shape of Shakespeare’s Sonnets.

    The genesis of the idea

    The idea came after a rocky start and abandoned initial idea.  My initial idea was to approach the topic with a meta-analysis or focus on the overall topic (‘books’) and to avoid focusing on a single book.  I found a wonderful list of NYT non-fiction best sellers lists, but was uninspired after spending a significant amount of time consuming and prepping the data.  So I switched mid-stream and decided to keep the parameters of a meta-analysis, but change to a body of literature that a meta-analysis could be performed on.  I landed on Shakespeare’s Sonnets for several reasons:

    • Rigid structure – great for identifying patterns
    • 154 divides evenly for small multiples (11×14 grid)
    • Concepts of rhyme and sentiment could easily be analyzed
    • More passionate subject: themes of love, death, wanting, beauty, time
    • Open source text, should be easy to find
    • Focus on my strengths: data density, abstract design, minimalism
    Getting Started

    I wasn’t disappointed with my google search, it took me about 5 minutes to locate a fantastic CSV containing all of the Sonnets (and more) in a nice relational format.  There were some criteria necessary for the data set to be usable – namely each line of the sonnet needed to be a record.  After that point, I knew I could explode and reshape the data as necessary to get to a final analysis.

    Prepping & Analyzing the Data

    The strong structuring of the sonnets meant that counting things like number of characters and number of words would yield interesting results.  And that was the first data preparation moment.  Using Alteryx I expanded out line into columns for individual words.  Those were then transposed back into rows and affixed to the original data set.  Why?  This would allow for quick character counting in Tableau, repeated dimensions (like line, sonnet number), and dimensions for the word number in each line.

    I also extracted out all the unique words, counted their frequency, and exported them to a CSV for sentiment analysis.  Sentiment analysis is a way to score words/phrases/text to determine the intention/sentiment/attitude of the words.  For the sake of this analysis, I chose to go with a negative/positive scoring system.  Using Python and the nltk package, each word’s score was processed (with VADER).  VADER is optimized for social media, but I found the results fit well with the words within the sonnets.

    The same process was completed for each sonnet line to get a more aggregated/overall sentiment score.  Again, Alteryx was the key to extracting the data in the format I needed to quickly run it through a quick Python script.

    Here’s the entire Alteryx workflow for the project:

    The major components
    • Start with original data set (poem_lines.csv)
      • filter to Sonnets
      • Text to column for line rows
      • Isolate words, aggregate and export to new CSV (sonnetwords.csv)
      • Isolate lines, export to new CSV (sonnetlines)
      • Join swordscore to transformed data set
      • Join slinescore to transformed data set
      • Export as XLSX for Tableau consumption (sonnets2.xlsx)
    Python snippet
    make sure you download nltk leixcons after importing; thanks to Brit Cava for code inspiration

    The Python code is heavily inspired by a blog post from Brit Cava in December 2016.  Blog posts like hers are critically important, they help enable others within the community do deeper analysis and build new skills.

    Bringing it all together

    Part of my vision was the provoke patterns, have a highly dense data display, and use an 11×14 grid.  My first iteration actually started with mini bar charts for number of characters in each word.  The visual this produced was what ultimately led to the path of including word sentiment.

    height = word length, bars are in word order

    This eventually changed to circles, which led to the progression of adding a bar to represent the word count of each individual line.  The size of the words at this point became somewhat of a disruption on the micro-scale, so sentiment was distilled down into 3 colors: negative, neutral, or positive.  The sentiment of the entire line instead has a gradient spectrum (same color endpoints for negative/positive).  Sentiment score for each word was reserved for a viz in tool tip – which provides inspiration for the name of the project.

    Sonnet 72, line 2

    Each component is easy to see and repeated in macro format at the bottom – it also gives the end user an easy way to read each Sonnet from start to finish.

    designed to show the progression of abstraction

    And there you have it – a grand scale visualization showing the sentiment behind all 154 of Shakespeare’s Sonnets.  Spend some time reciting poetry, exploring the patterns, and finding the meaning behind this famous body of literature.

    Closing words: thank you to Luke Stanke for being a constant source of motivation, feedback, and friendship.  And to Josh Jackson for helping me battle through the creative process.

    The Shape of Shakespeare’s Sonnets

    click to interact at Tableau Public

     

     

     

  • And so it beings – Adventures in Python

    Tableau 10.2 is on the horizon and with it comes several new features – one that is of particular interest to me is their new Python integration.  Here’s the Beta program beauty shot:

    Essentially what this will mean is that more advanced programming languages aimed at doing more sophisticated analysis will become an easy to use extension of Tableau.  As you can see from the picture, it’ll work similar to how the R integration works with the end-user using the SCRIPT_STR() function to pass through the native Python code and allowing output.

    I have to admit that I’m pretty excited by this.  For me I see this propelling some data science concepts more into the mainstream and making it much easier to communicate and understand the purpose behind them.

    In preparation I wanted to spend some time setting up a Linux Virtual Machine to start getting a ‘feel’ for Python.

    (Detour) My computer science backstory: my intro to programming was C++ and Java.  They both came easy to me.  I tried to take a mathematics class based in UNIX later on that was probably the precursor to some of the modern languages we’re seeing, but I couldn’t get on board with the “terminal” level entry.  Very off putting coming from a world where you have a better feedback loop in terms of what you’re coding.  Since that time (~9 years ago) I haven’t had the opportunity to encounter or use these types of languages.  In my professional world everything is built on SQL.

    Anyway, back to the main heart – getting a box set up for Python.  I’m a very independent person and like to take the knowledge I’ve learned over time and troubleshoot my way to results.  The process of failing and learning on the spot with minimal guidance helps me solidify my knowledge.

    Here’s the steps I went through – mind you I have a PC and I am intentionally running Windows 7.  (This is a big reason why I made a Linux VM)

    1. Download and install VirtualBox by Oracle
    2. Download x86 ISO of Ubuntu
    3. Build out Linux VM
    4. Install Ubuntu

    These first four steps are pretty straightforward in my mind.  Typical Windows installer for VirtualBox.  Getting the image is very easy as is the build (just pick a few settings).

    Next came the Python part.  I figured I’d have to install something on my Ubuntu machine, but I was pleasantly surprised to learn that Ubuntu already comes with Python 2.7 and 3.5.  A step I don’t have to do, yay!

    Now came the part where I hit my first real challenge.  I had this idea of getting to a point where I could go through steps of doing sentiment analysis outlined by Brit Cava on the Tableau blog.  I’d reviewed the code and could follow the logic decently well.  And I think this is a very extensible starting point.

    So based on the blog post I knew there would be some Python modules I’d be in need of.  Googling led me to believe that installing Anaconda would be the best path forward, it contains several of the most popular Python modules.  Thus installing it would eliminate the need to individually add in modules.

    I downloaded the file just fine, but instructions on “installing” were less than stellar.  Here’s the instructions:

    Directions on installing Anaconda on Linux

    So as someone who takes instructions very literal (and again – doesn’t know UNIX very well) I was unfortunately greeted with a nasty error message lacking any help.  Feelings from years ago were creeping in quickly.  Alas, I Googled my way through this (and had a pretty good inkling that it just couldn’t ‘find’ the file).

    What they said (also notice I already dropped the _64) since mine isn’t 64-bit.

     

    Alas – all that was needed to get the file to install!

    So installing Anaconda turned out to be pretty easy.  After getting the right code in the prompt.  Then came the fun part, trying to do sentiment analysis.  I knew enough based on reading that Anaconda came with the three modules mentioned: pandas, nltk, and time.  So I felt like this was going to be pretty easy to try and test out – coding directly from the terminal.

    Well – I hit my second major challenge.  The lexicon required to do the sentiment analysis wasn’t included.  So, I had no way of actually doing the sentiment analysis and was left to figure it out on my own.  This part was actually not that bad, Python did give me a good prompt to fix – essentially to call the nltk downloader and get the lexicon.  And the nltk downloader has a cute little GUI to find the right lexicon (vader).  I got this installed pretty quickly.

    Finally – I was confident that I could input the code and come up with some results.  And this is where I hit my last obstacle and probably the most frustrating of the night.  When pasting in the code (raw form from blog post) I kept running into errors.  The message wasn’t very helpful and I started cutting out lines of code that I didn’t need.

    What’s the deal with line 5?

    Eventually I figured out the problem – there were weird spaces in the raw code snippet.  To which after some additional googling (this time from my husband) he kindly said “apparently spaces matter according to this forum.”  No big deal – lesson learned!

    Yes! Success!

    So what did I get at the end of the day?  A wonderful CSV output of sentiment scores for all the words in the original data set.

    Looking good, there’s words and scores!
    Back to my comfort zone – a CSV

    Now for the final step – validate that my results aligned with expectations.  And it did – yay!

    0.3182 = 0.3182

    Next steps: viz the data (obviously).  And I’m hoping to extend this to an additional sentiment analysis, maybe even something from Twitter.  Oh and I also ended up running (you guessed it, already installed) a Jupyter notebook to get over the pain of typing directly in the Terminal.