Political Violence at a Glance carried this post the other day: Raining on the Parade: Some Cautions Regarding the Global Database of Events, Language and Tone Dataset. Shortly thereafter this was posted at the Conflict Research group on Facebook.
I’m not very familiar with the GDELT dataset, but I know that many scholars in this group are very familiar with it. I would love to get people’s take on this post
Hence, I began drafting a post, but it was too long, so I have divided it into two parts. In the end I offer little assistance to Keels as I have not spent much time at all with the GDELT data (and therefore can neither disparage nor endorse it). I can however, point out a few rather troubling problems with the post on PV@G. Which is not to say that the post has no merit. There are some interesting bits there. But there are also a few alarming statements, and this series of posts will hopefully help folks like Keel who are (presumably) new to events data think about them better than the authors of the post have done.
Let me begin by borrowing from John Stewart and note that the post strongly suggests that the authors imagine a data analysis world filled with rainbows and unicorns. Whatever events data one turns to, GDELT or any of many others, there will be neither unicorns nor rainbows.
We have, sadly, seen this type of “Holy Cow, these data are not at all what I expected. They’re awful” type of reaction before, even quite prominently. Charles Brockett managed to publish his journal about his crushing first experience with event data in the American Political Science Review [ungated PDF]. It turns out that there is a great deal of published work on the type of data produced when we use humans or machines to perform content analysis on news reports to produce events data. And if one does not avail one’s self of such research, and wanders into working with any events data set carrying implicit assumptions about rainbows and unicorns, then one is going to be disappointed with how far real events data are from that imagined world.
As noted, the post reports a number of interesting points that anyone wishing to use GDELT will want to consider. So please read it. However, the first three points they raise are, in my view, flatly off base. I discuss the first two issues here and the third in a follow-up.
The post tells us:
An initial issue is an ease-of-use issue related to the dataset. If reports are collected from multiple news sources then major events will likely be covered multiple times by these myriad sources… Quality-of-life issues are important for third-parties seeking to use datasets, because the inability to understand a dataset will lead it to be used in ways not intended by the authors and may eventually hurt the dataset’s reputation if it is inappropriately utilized.
This doesn’t rise to the level lampooned in David Thorne’s email exchange with Simon Edhouse, but it is of broadly the same kind: “Thank you for doing so much work for free. You know what would be really useful, though? If you could imagine the use to which I want to put your data, and then do a whole bunch more work for free to create that information, that would really be swell. In return, I will pay you with rainbows and unicorns.”
The second issue raised in the post is:
The creators of the GDELT do not allow for third-party verification: they do not release the articles nor do they list the article sources and dates. In an email communication with Kalev Leetaru (January 11, 2014) we were told: “Our licensing restrictions are quite tight on the data and we cannot make the text available.” This struck us as odd given that most of the sources are publicly available news reports, but more importantly if the underlying data cannot be shared it imperils notions of transparency and replication central to science.
I have to say, this point left me somewhat speechless. My initial draft of this post contained here a series of snarky, rhetorical questions. But I will ask one genuine question: have these folks ever noticed that when one does a news search on Lexis-Nexis or Factiva one is restricted to 100 downloads? The reason that this is so is that those companies assert legal copyright on the documents and do not want people to write software that will permit such people from downloading major chunks of the content and then making it available online. Indeed, Aaron Swartz, the programmer who committed suicide in January of 2013, was being prosecuted by the US government as an effort to enforce JSTOR’s copyright claim after he wrote just such a program and downloaded the JSTOR catalog. Given this what would “odd” would be GDELT (or any project like it) challenging content providers’ copyright claims by making the “underlying data” available online. For the uninitiated, the company’s claims are contested, and Phil Schrodt just recently wrote a wonderful post on The Legal Status of Events Data.
OK, that’s sufficient for one post. If you aren’t bored already, check in tomorrow as I will discuss what I think is probably the most troubling aspect of the post: the expectation that Actor and Target would not have missing (or Unknown) values.
 See, for example, Snyder & Kelly (1977), Franzosi (1987), Martin (1988), Olzak (1989), McCarthy, McPahil & Smith (1996), Oliver & Myers (1999), Sommer & Scarritt (1999), Oliver & Maney (2000), Maney & Oliver (2001), Poe, Carey & Vazquez (2001), Davenport & Ball (2002), Koopmans & Rucht (2002), Almeida & Lichbach (2003), and Earl, Martin, McCarthy & Soule (2004), and Davenport (2010) among many others.
 If you do not have experience with event data and this point strikes you off target, please read part II of this series: I provide the necessary backstory there.
 No, I do not inhabit the world of rainbows and unicorns where I imagine the users of Lexis-Nexis, Factiva, etc. read the licensing agreements that spell out the intellectual property claims of these corporations.