This is the first in a series. You can read the second, third and fourth parts here.
Last May, I wrote a blog post in response to my daughter’s question “what’s the RSA Conference about, Daddy?” Being a data-driven guy, I wanted to give her an answer that used a quantitative approach, but also captured qualitative aspects of the event as well. I was able to obtain session titles from the RSAC website for the last four years and do some textual analysis to determine that RSAC was, at least recently, about “the risk from cyber threats to data in the cloud” (you can read the post to find out why).
I wanted to go back farther than four years, but wasn’t able to find the data to support longer-term analysis. But the good folks behind RSAC saw my request and contacted me to offer all session titles since the conference began back in 1991. I wasn’t involved in the data collection process, but I understand it involved translation from stone tablets, papyrus fragments, Gutenberg’s printing press, and a scale model of Babbage’s Difference Engine. My hat goes off to them for that effort, and I naturally jumped at the chance to discover what the infosec industry has been discussing at it’s premier conference over the last quarter century.
Figure 1. Number of RSAC Sessions per year between 1991-2015. *2016 totals not finalized at time of writing.
Before digging into the titles, though, let’s start with a quick look at the conference itself. I wasn’t around for the 1991 RSAC, but I’m going to venture a guess that it didn’t look very much like what we see today. There were just two sessions, though one of them had a title long enough that it could have been split into a few more: “Cryptography, Industry & Public Policy - Government: Cryptographic Policy and Its Effect on Cryptography Standards, Export Controls and Legislation.” You know you’re at a crypto conference when the keynote title includes three instances of the word “crypto.” By comparison, the 2015 RSAC featured nearly 500 sessions (and none of them used “crypto” more than once). That’s a CAGR of 26%--nice job, RSAC peeps! I sure wish the HACK ETF showed that kind of growth.
I don’t want to bore you with too many details and caveats, but I should clarify a few things before we move on with the analysis of session titles. First, these are titles only; not abstracts and all that other stuff. Second, I performed minor surgery on this dataset to prep it for dissection. That included things like removing symbols, punctuation, stop words, and universal yet analytically uninteresting terms like “security” and “secure.” I was going to employ a stemming function across the entire corpus, but that proved too heavy-handed. For instance, I thought preserving the difference between “attacker” and “attack” might be interesting. Thus, I opted for a more controlled, manual stemming process. I also combined terms I wanted to treat as synonymous such as “Internet of Things” and “Internet of Everything” and “IoT.” The last major thing I did was drop all words that were not used at least 5 times, which left me just south of 800 words to play with. Over the course of this series, we’re going to examine those words from a bunch of different angles to see what insights they contain about key topics, trends, and transitions in the security industry over the last 25 years.
To start things off – as well as close out this initial post – I’d like to provide some analytical justification for the title of this series. The figure below shows the percentage of RSAC session titles that included four “c” words: “commerce,” “crypto,” “cloud,” and “cyber.” It’s easy to see that the early years of RSAC were dominated by discussion of cryptography, which was so vital to the explosion of online commerce in the 1990s. Also evident is the growth of cloud computing and the persistent advance of all things ‘cyber’ in more recent times.
Figure 2. Percentage of RSAC session titles over time that include four key words: commerce, crypto, cloud, cyber.
The figure illustrates well, I think, the vast changes that have taken place within our industry and how RSAC acts as a kind of mirror for that change. I realize there’s a huge unexplained gap in the middle of the twin “c” peaks and the 4 terms account for less than a quarter of all sessions. But hopefully it “peaks” your interest for what’s to come as we dig in further into this series.
I also want to announce that I’ll be moderating a panel at RSAC based on this analysis with Jay Jacobs, Alex Pinto, and Bob Rudis. Be sure to attend if you want to see what some of the best data scientists in our industry (not including myself in that group) found fun and interesting within this dataset.
Next up, I’ll cover the most common words across the years and highlight the biggest winners and losers.