Earlier in the year (March), I participated in a panel discussion for a data+women series at the Phoenix TUG. As prep I was given a list of questions and asked to pick a few I felt comfortable answering.
To facilitate the sharing of my experiences, I wrote a blog post on some of the more intriguing questions. What comes next is a blog post I originally published within my company’s social media site on 3/23/16.
What is your first step of investigation when you are given a dataset?
So before investigating a data set, I think it is critical to understand what the data output contains. I think it is important to meet with the business owners or subject matter experts that are responsible for creating, maintaining, or using the data. This helps to ensure you understand what they are looking for and what the different fields mean.
Assuming the pre-work has been done with business owners, I always approach a new dataset the same way. There’s three things I do every time.
The first is to find a categorical dimension that seems to be at the highest level. Typically I’m looking for a category that has less than 20 members (10 to 15 being ideal). This data can set a great foundation to understanding the overall population of records. Once I’ve found that categorical dimension, I immediately create a Pareto chart. Recall that a Pareto chart is a bar chart where the bar volumes are ranked from greatest to smallest. Additionally it has a line chart on the secondary axis displaying the cumulative percentage. The goal of the tool is to see immediately where most (or a majority) of problems are. And it usually follows the 80/20 rule logic.
The next thing I do is plot records over time, usually by month, sometimes by week. This is great for understanding any seasonality and for finding immediate trends. Often systems are changing and new processes are implemented. By plotting records over time, you can immediately take any funny data points back to your business owner and ask for clarity.
The more clarity you have around your dataset, the more successful you’re going to be at understanding the information you can retrieve from it.
The last thing I do is take that categorical dimension and plot each one against time. This really is the combination of the first two to a large extent. From this you’ll be able to further pinpoint any trends that are category specific and find even more potential seasonality or process changes.
By creating the three visuals I mentioned above, I usually feel pretty confident about the “vitals” of the dataset. I get a great sense of being able to communicate in the same language and accurately reference volume. I’ve found that this is a great first step when working with almost any set of data. It makes an excellent conversation starter with business owners and also proves itself as a useful communication tool when bringing new team members into a project.
That’s my initial analysis workflow, what’s yours?