By Andrew Jeavons
I’m delighted to be presenting at the upcoming ESOMAR conference in Berlin entitled “Big Data World” taking place from November 15th-17th. In my opinion the Berlin Intercontinental Hotel has one of the best bars in the world, sadly I will not be staying there.
What I will be doing is talking about a form of statistical analysis for text data called Automated Topic Analysis. The conference is about “big data”. The flood of data researchers can now access has lead to an urgent need for analytical techniques to enable us to make sense of this deluge of information. One of these growth areas has been text analytics. Suddenly we have textual information from a vast variety of different sources. The first wave of analytic innovation for text data was sentiment analysis. Is the comment good or bad? This is vital information for enterprise feedback for any business.
Sentiment analysis is very useful, but it lacks depth. People talk about lots of different things in textual data. A comment on a website may be neither good nor bad, but it may also convey important information that a business needs to hear. Automated topic analysis allows themes, or topics, to be generated from textual data. Rather than just imposing a simple good/bad classification it allows latent text topics to be discovered and analyzed.
At the start of March 2016 I started collecting the Twitter data from the main candidates for the US Presidential election. Originally I was interested in emotional analysis of the tweets. As time passed it occurred to me that it might be interesting to see what themes emerged over the time course of the election. The paper I am presenting is based on automated topic analysis of these tweets over a period time ending on August 14th (data collection is still ongoing).
What I found was that this form of topic analysis did reveal some radical differences between the tweets of the candidates. A topic is a pattern of word usage, but it is worth remembering that topics may not be semantically meaningful. That doesn’t mean they are not useful for segmenting populations. Style and vocabulary can be key identifiers of population segments, yet they may not have any semantic meaning per se. Automated topic analysis is very loosely analogous to cluster analysis, but it uses patterns of word usage rather than numeric values to develop topics. Topics are identified by the words associated with them.
A very useful characteristic of the topic analysis technique I used was that every tweet could be assigned a primary topic. Text items may have more than one topic associated with them, but for the sake of simplicity I just took the most likely topics for the tweets as the basis for segmenting the data.
Using this approach I was able to produce an analysis showing how the main candidates (Clinton, Trump, Cruz and Sanders) varied the topics they tweeted about over time. For instance it was very clear that Trump had a very different topic structure than the rest of the candidates. I could also statistically analyze how consistent the candidates were with their topics in tweets. Did they stay “on message”? What was really interesting was who was the most consistent in terms of tweet topics and how they performed in the candidate nomination process. Did staying “on message” in terms of tweet topics help?
“Big Data” is all about analytics. Researchers face a challenge of analysis not collection. In particular text data is a rich but complex data type to analyze. Quantification of any data is important, but as I found with the topic analysis I performed there remains a significant component of qualitative interpretation when it comes to text.
So please join me in Berlin, it promises to be a great experience. New analytics for big data is incredibly important to researchers and this conference is the ideal venue to learn about them.
By Andrew Jeavons, Mass Cognition