Political Violence at a Glance carried this post last week: Raining on the Parade: Some Cautions Regarding the Global Database of Events, Language and Tone Dataset. Yesterday I posted about two of the three issues I have with the post: (1) GDELT has not formatted the data in a way that is friendly for the use to which the authors planned to put it, and (2) GDELT does not make the source articles/text available.
Here I discuss the third: the unwarranted expectation that there will be few, if any, cases in an events dataset like GDELT assigned missing (or Unknown) values to the Actor and Target variables.
The “raining” post makes its way into the criticism thusly: “The remaining issues all involve the constructs and related measures that underlie the data,” and goes to Rudyard Kipling to… er, let’s go to the quote:
Assuming the legitimacy of the GDELT primary data sources, their method for coding data raises additional concerns. Rudyard Kipling once famously wrote:
“I Keep six honest serving-men:
(They taught me all I knew)
Their names are What and Where and When
And How and Why and Who.”
This is one of the standard tenets of journalism,
I see. Yes. Yes, it is. Please, continue…
but much of this basic information is missing in the GDELT dataset… It seems to us that the Actor1 and Actor2 variables, which represent who is performing an action and who the action is being performed on respectively, should rarely, if ever, be missing… Yet, in only 39% of the total observations are both Actors 1 and 2 identified.
Thirty nine percent is definitely attention grabbing. And when I downloaded their data to…
Huh, the post does not contain a link to the data. So the post that rains on GDELT does not make available the data we can use to eyeball their claim. So, as I noted in yesterday’s post, the “raining” post is comfortable levying the (accurate) charge of poor transparency at GDELT, yet did not meet that standard itself. Nice.
Of course, I could go to GDELT, download the data, make some guesses about how the sample was drawn, and try to replicate the finding. What could possibly go wrong? Or I could expect the authors of a post to include a hyperlink to the data. If I did, would I be asking them to do a bunch of extra work for free? <= [Rhetorical question]
OK, enough snark. Let’s return to the eye-popping allegation about missingness and consider whether it is actually surprising. To do so, let’s begin with a recent Reuters story about an event that GDELT should code. It is a story published by Al Jazeera America, titled “Baghdad motorbike bomb kills dozens.” These are the sentences that strike me as containing information we would want to code:
At least 42 people have been killed after a motorcycle rigged with explosives was detonated in Baghdad’s Sadr City and armed men targeted mostly Shia neighbourhoods across the country.
The motorcycle was parked in a second-hand market in the Shia Muslim neighbourhood that sells used bikes and was
filled with people, mostly young men, when it exploded late on Thursday afternoon, killing 31 and wounding 51 others, Iraqi medical and police sources said.
It was not clear who was behind the bombing but violence against Shia Muslims is often blamed on the Sunni Islamic State of Iraq and the Levant (ISIL), a group that al-Qaeda central leadership has disowned.
In other violence Thursday, four people died from bombs on two different mini-buses in Shia sections of Baghdad.
An attacker smashed his explosives-packed vehicle into a checkpoint, killing three soldiers and wounding six others in Mushaada, a Sunni district, in northern Baghdad, police said.
In Salahuddin province, a pro-government Sunni-manned checkpoint in the town of Shirqat was hit by a bomb that killed two fighters and wounded four others, police said.
Also to the north in Tuz Khurmatu, a bomb in an outdoor marketplace frequented by Shia Turkmen killed two people and wounded 11 others.
OK, what will Sir Rudyard Kipling have us do with those Actor and Target variables? Would he argue that there is no reasonable case for assigning a missing (or Unknown) value to either Actor or Target on any of the four bombings in Iraq on 27 February 2014?
This is the thing: news reports of contentious politics (protest, insurgency, government coercion, terror, etc.) frequently fail to identify a specific Actor and/or specific Target of an Event that those of us interested in conflict might want to code.
The “raining” post appears to be convinced that the information needed to assign non-missing values is contained in the reports. This implies that had human eyes used the CAMEO scheme and coded the articles that there shouldn’t be any missing data on Actor / Target (excepting human error). It turns out that I cannot address this directly as I have not evaluated the GDELT data. But I can consider Best, Carpino & Crezcenzi’s (2013) recently published article. They study the difference between events data produced via human coding of the full articles from Reuters to that produced using TABARI, the automated coding system that GDELT’s coder is a modification of, to code only the ledes of those articles. They report:
we find that, contrary to expectations, hand coding full news stories does not lead to significant improvements in the accuracy or depth of actor information compared with machine coding by TABARI using lead sentences. These findings should bolster the confidence of researchers using TABARI coded data, with the caveat that TABARI’s ability to distinguish between actors is dependent upon the detail available in the actor dictionaries.
So a comparison of human versus machine coding that used the same sources and the same coding protocol produced the same values, as long as the actor dictionaries are sufficiently detailed.
Of course, that finding does not demonstrate that GDELT does not have lousy actor dictionaries; has not mangled TABARI; nor that some sort of source SNAFU, or post coding data management problem, renders the data FUBAR. I confess, I do not know whether the GDELT data are excellent, mediocre, or a train wreck. But I do know that the information provided in the post is not grounds for suggesting that it probably is the latter.
To probe this issue further, please consider the following discussion of assigning a value to the Target, found in the User Guide for the Intranational Political Interactions (IPI) project:
Sometimes, coding the target can represent a problem, as in the following example: Govt reports 11 dead in guerrilla attack; repts 14 others dead in Dec 31 attacks. From this report it is unclear who is the target and, as a consequence, how many events should be coded. Unfortunately, this record is incomplete. We know that at least two events took place, that the actor in both cases was unspecified guerrillas, and that eleven and fourteen people died in either armed attacks or military clashes, depending on who the target was. Since it is better to err on the conservative side, we code it as an armed attack (604 on the conflict scale) on an unknown target (99) and indicate in our event record that we need more information. 
Lest the uninitiated wish to object and quote the author who brought us Mowgli, this issue is not unique to the IPI project: it is generic to any events data set that was built using content analysis of natural language. Reports of events of interest to students of contentious politics always fail to contain information about the identity of actors and/or targets, the quips of a late British Poet Laureaute notwithstanding. And different projects will make different decisions about what sort of values to assign when information on Actor and/or Target is vague or absent. Indeed, CAMEO has a rather sophisticated, multi-tiered approach, but one could not know that from the “raining” post. To get a taste for the complexity that GDELT reports that it uses, peruse chapter 3 of CAMEO (Conflict and Mediation Event Observations): Event and Actor Codebook, v 1.1b3 (ungated PDF at the GDELT site).
Let me offer one other example. The Ill Treatment and Torture (ITT) project, on which I was a co-PI, conducts human content analysis on Amnesty International documents. We coded the government agency responsible for the alleged abuse (actor) and the victim of the alleged abuse (target). In an article introducing one version of the ITT data we report that Unnamed is by far the most common Victim Type, and Unnamed is more or less tied with Police as the most common Government Agency (p. 209; ungated PDF here). You might also be interested in Chad Clay‘s post in which he maps the ITT data.
This discussion hopefully demonstrates that a high percentage of missing (Unknown) values for Actor / Target is not, on its own, sufficient evidence to damn an events dataset. A 39% rate for complete cases on Actor and Target could, very well, indicate that there is a problem, but we need considerably more information to make a judgment. The key point is: users need to go into working with an events dataset on contentious politics with an expectation that there will be lots of events that cannot be assigned Actor and/or Target values independent of a number of coding decisions, any of which a given researcher might dispute.
To wrap up, my criticism aside, the authors of the post at PV@G have several additional critiques, most importantly that the documentation for GDELT leaves much to be desired, and once one gets pas the initial three critiques the others do not suffer from the problems I have described. Yes, there is some useful information in the post. But it is fronted by three “concerns” that may well not be concerns at all, but are likely to appear to be concerns to folks who wander into this type of research without relevant experience. Those of us who work with contentious politics events data need to do a better job making it easier for newbies to get up to speed. And newbies need to do a better job of reaching out to experienced researchers and asking whether the issues that alarm them are truly alarming.
 The hypocrisy is especially salient given the GDELT community’s norm of posting code and data (usually at GitHub) in posts about the data, a practice that puts into action the transparency norms the authors of the “raining” post preach, but chose not to practice. Hypocrisy of course does not mitigate the issue of missing data in the Actor fields, so I address that topic directly.
 That missing values might justifiably be assigned to Actor and/or Target for a given coding scheme does not suggest that the GDELT data has appropriately assigned the missing values reported in the post. It may have done so appropriately, and it may well not have done so appropriately. Given the information in the post we cannot tell, which makes the charge levied there sloppy and irresponsible.
 Alex Hanna recently posted an analysis that generates a very different outcome, but notes that
It’s not clear whether the non-correspondence between GDELT and DoCA is due to the limitations of the New York Times as a source, the limitations of the GDELT search protocol for protest events, or both.
When the both coding protocol and the sources are different it is tough to say what is going on. I will have more to say on this issue in the coming days.
 The IPI User Guide continues:
We hope that as we code additional sources, more facts will be reported which will allow us to fill in the missing information, although there will be a substantial number of events for which this is not the case.
The IPI project used human coders, so it was possible to include a “need more info” variable and then have coders go through the completed data set seeking further info from other events that would permit the missing data to be recorded. In an automated system this is not, to the best of my knowledge, feasible. But as a co-PI on the IPI project I can tell you that a very small percentage (certainly less than 15%) of missing values were replaced as a consequence.
 It will hopefully also persuade you that you want to look closely at the decisions the project used to assign (non-)missing values as your project might wish to make very different operational decisions, and if the project is well documented, you should be able to do some recoding of values to improve the situation.
 Readers interested in the standards and best practices for collecting conflict data should visit the Conflict Consortium’s Creating Conflict Data: Standards & Best Practices. (while you are there, sign up and become a member).