Two Rubes Walk into a Bar, Order Event Data (Part 2)

Political Violence at a Glance carried this post last week: Raining on the Parade: Some Cautions Regarding the Global Database of Events, Language and Tone Dataset. Yesterday I posted about two of the three issues I have with the post: (1) GDELT has not formatted the data in a way that is friendly for the use to which the authors planned to put it, and (2) GDELT does not make the source articles/text available.

Here I discuss the third: the unwarranted expectation that there will be few, if any, cases in an events dataset like GDELT assigned missing (or Unknown) values to the Actor and Target variables.

The “raining” post makes its way into the criticism thusly: “The remaining issues all involve the constructs and related measures that underlie the data,” and goes to Rudyard Kipling to… er, let’s go to the quote:

Assuming the legitimacy of the GDELT primary data sources, their method for coding data raises additional concerns. Rudyard Kipling once famously wrote:

“I Keep six honest serving-men:
(They taught me all I knew)
Their names are What and Where and When
And How and Why and Who.”

This is one of the standard tenets of journalism,

I see. Yes. Yes, it is. Please, continue…

but much of this basic information is missing in the GDELT dataset… It seems to us that the Actor1 and Actor2 variables, which represent who is performing an action and who the action is being performed on respectively, should rarely, if ever, be missing… Yet, in only 39% of the total observations are both Actors 1 and 2 identified.

Thirty nine percent is definitely attention grabbing. And when I downloaded their data to…

Huh, the post does not contain a link to the data. So the post that rains on GDELT does not make available the data we can use to eyeball their claim. So, as I noted in yesterday’s post, the “raining” post is comfortable levying the (accurate) charge of poor transparency at GDELT, yet did not meet that standard itself. Nice.[1]

Of course, I could go to GDELT, download the data, make some guesses about how the sample was drawn, and try to replicate the finding. What could possibly go wrong? Or I could expect the authors of a post to include a hyperlink to the data. If I did, would I be asking them to do a bunch of extra work for free? <= [Rhetorical question]

OK, enough snark. Let’s return to the eye-popping allegation about missingness and consider whether it is actually surprising. To do so, let’s begin with a recent Reuters story about an event that GDELT should code. It is a story published by Al Jazeera America, titled “Baghdad motorbike bomb kills dozens.” These are the sentences that strike me as containing information we would want to code:

At least 42 people have been killed after a motorcycle rigged with explosives was detonated in Baghdad’s Sadr City and armed men targeted mostly Shia neighbourhoods across the country.

The motorcycle was parked in a second-hand market in the Shia Muslim neighbourhood that sells used bikes and was
filled with people, mostly young men, when it exploded late on Thursday afternoon, killing 31 and wounding 51 others, Iraqi medical and police sources said.

It was not clear who was behind the bombing but violence against Shia Muslims is often blamed on the Sunni Islamic State of Iraq and the Levant (ISIL), a group that al-Qaeda central leadership has disowned.

In other violence Thursday, four people died from bombs on two different mini-buses in Shia sections of Baghdad.

An attacker smashed his explosives-packed vehicle into a checkpoint, killing three soldiers and wounding six others in Mushaada, a Sunni district, in northern Baghdad, police said.

In Salahuddin province, a pro-government Sunni-manned checkpoint in the town of Shirqat was hit by a bomb that killed two fighters and wounded four others, police said.

Also to the north in Tuz Khurmatu, a bomb in an outdoor marketplace frequented by Shia Turkmen killed two people and wounded 11 others.

OK, what will Sir Rudyard Kipling have us do with those Actor and Target variables? Would he argue that there is no reasonable case for assigning a missing (or Unknown) value to either Actor or Target on any of the four bombings in Iraq on 27 February 2014?[2]

This is the thing: news reports of contentious politics (protest, insurgency, government coercion, terror, etc.) frequently fail to identify a specific Actor and/or specific Target of an Event that those of us interested in conflict might want to code.

The “raining” post appears to be convinced that the information needed to assign non-missing values is contained in the reports. This implies that had human eyes used the CAMEO scheme and coded the articles that there shouldn’t be any missing data on Actor / Target (excepting human error). It turns out that I cannot address this directly as I have not evaluated the GDELT data. But I can consider Best, Carpino & Crezcenzi’s (2013) recently published article. They study the difference between events data produced via human coding of the full articles from Reuters to that produced using TABARI, the automated coding system that GDELT’s coder is a modification of, to code only the ledes of those articles. They report:[3]

we find that, contrary to expectations, hand coding full news stories does not lead to significant improvements in the accuracy or depth of actor information compared with machine coding by TABARI using lead sentences. These findings should bolster the confidence of researchers using TABARI coded data, with the caveat that TABARI’s ability to distinguish between actors is dependent upon the detail available in the actor dictionaries.

So a comparison of human versus machine coding that used the same sources and the same coding protocol produced the same values, as long as the actor dictionaries are sufficiently detailed.[3]

Of course, that finding does not demonstrate that GDELT does not have lousy actor dictionaries; has not mangled TABARI; nor that some sort of source SNAFU, or post coding data management problem, renders the data FUBAR. I confess, I do not know whether the GDELT data are excellent, mediocre, or a train wreck. But I do know that the information provided in the post is not grounds for suggesting that it probably is the latter.

Is the GDELT data a train wreck?

To probe this issue further, please consider the following discussion of assigning a value to the Target, found in the User Guide for the Intranational Political Interactions (IPI) project:

Sometimes, coding the target can represent a problem, as in the following example: Govt reports 11 dead in guerrilla attack; repts 14 others dead in Dec 31 attacks. From this report it is unclear who is the target and, as a consequence, how many events should be coded. Unfortunately, this record is incomplete. We know that at least two events took place, that the actor in both cases was unspecified guerrillas, and that eleven and fourteen people died in either armed attacks or military clashes, depending on who the target was. Since it is better to err on the conservative side, we code it as an armed attack (604 on the conflict scale) on an unknown target (99) and indicate in our event record that we need more information. [4]

Lest the uninitiated wish to object and quote the author who brought us Mowgli, this issue is not unique to the IPI project: it is generic to any events data set that was built using content analysis of natural language. Reports of events of interest to students of contentious politics always fail to contain information about the identity of actors and/or targets, the quips of a late British Poet Laureaute notwithstanding. And different projects will make different decisions about what sort of values to assign when information on Actor and/or Target is vague or absent. Indeed, CAMEO has a rather sophisticated, multi-tiered approach, but one could not know that from the “raining” post. To get a taste for the complexity that GDELT reports that it uses, peruse chapter 3 of CAMEO (Conflict and Mediation Event Observations): Event and Actor Codebook, v 1.1b3 (ungated PDF at the GDELT site).

Let me offer one other example. The Ill Treatment and Torture (ITT) project, on which I was a co-PI, conducts human content analysis on Amnesty International documents. We coded the government agency responsible for the alleged abuse (actor) and the victim of the alleged abuse (target). In an article introducing one version of the ITT data we report that Unnamed is by far the most common Victim Type, and Unnamed is more or less tied with Police as the most common Government Agency (p. 209; ungated PDF here). You might also be interested in Chad Clay‘s post in which he maps the ITT data.

This discussion hopefully demonstrates that a high percentage of missing (Unknown) values for Actor / Target is not, on its own, sufficient evidence to damn an events dataset.[5] A 39% rate for complete cases on Actor and Target could, very well, indicate that there is a problem, but we need considerably more information to make a judgment. The key point is: users need to go into working with an events dataset on contentious politics with an expectation that there will be lots of events that cannot be assigned Actor and/or Target values independent of a number of coding decisions, any of which a given researcher might dispute.

To wrap up, my criticism aside, the authors of the post at PV@G have several additional critiques, most importantly that the documentation for GDELT leaves much to be desired,[6] and once one gets pas the initial three critiques the others do not suffer from the problems I have described. Yes, there is some useful information in the post. But it is fronted by three “concerns” that may well not be concerns at all, but are likely to appear to be concerns to folks who wander into this type of research without relevant experience. Those of us who work with contentious politics events data need to do a better job making it easier for newbies to get up to speed. And newbies need to do a better job of reaching out to experienced researchers and asking whether the issues that alarm them are truly alarming.

@WilHMoo

[1] The hypocrisy is especially salient given the GDELT community’s norm of posting code and data (usually at GitHub) in posts about the data, a practice that puts into action the transparency norms the authors of the “raining” post preach, but chose not to practice. Hypocrisy of course does not mitigate the issue of missing data in the Actor fields, so I address that topic directly.

[2] That missing values might justifiably be assigned to Actor and/or Target for a given coding scheme does not suggest that the GDELT data has appropriately assigned the missing values reported in the post. It may have done so appropriately, and it may well not have done so appropriately. Given the information in the post we cannot tell, which makes the charge levied there sloppy and irresponsible.

[3] Alex Hanna recently posted an analysis that generates a very different outcome, but notes that

It’s not clear whether the non-correspondence between GDELT and DoCA is due to the limitations of the New York Times as a source, the limitations of the GDELT search protocol for protest events, or both.

When the both coding protocol and the sources are different it is tough to say what is going on. I will have more to say on this issue in the coming days.

[4] The IPI User Guide continues:

We hope that as we code additional sources, more facts will be reported which will allow us to fill in the missing information, although there will be a substantial number of events for which this is not the case.

The IPI project used human coders, so it was possible to include a “need more info” variable and then have coders go through the completed data set seeking further info from other events that would permit the missing data to be recorded. In an automated system this is not, to the best of my knowledge, feasible. But as a co-PI on the IPI project I can tell you that a very small percentage (certainly less than 15%) of missing values were replaced as a consequence.

[5] It will hopefully also persuade you that you want to look closely at the decisions the project used to assign (non-)missing values as your project might wish to make very different operational decisions, and if the project is well documented, you should be able to do some recoding of values to improve the situation.

[6] Readers interested in the standards and best practices for collecting conflict data should visit the Conflict Consortium’s Creating Conflict Data: Standards & Best Practices. (while you are there, sign up and become a member).

6 Responses to Two Rubes Walk into a Bar, Order Event Data (Part 2)

Pingback: No More Fountains of Youth/Pots o’ Gold: Conceptualization and Events Data (Part 1) | Will Opines
Nick Weller says:

March 9, 2014 at 12:46 am

We have also posted this reply at: dornsife.usc.edu/gdelt which contains the relevant links to the supporting data.

Dear Will,

Thanks for these thoughts.

It seems you enjoyed yourself in what you wrote. It seems that you are quite defensive about the kinds of questions we raised. It’s nonetheless great to get a sense of what others think about our comment. Our biggest takeaway from your post is that if we’re going to become regular bloggers like yourself we have a lot to learn – mostly about how to, as you say, be “snarky.” Your blog post could have been very useful had you addressed the substance of our blog post rather than engage in name calling. We, honestly, don’t understand the “I’m smarter than you” attitude of your post, but we believe it doesn’t benefit any of us.

There are essentially two parts to our argument
.
1. The underlying data for GDELT is not available, which is a concern for the creation of scientific knowledge. At a minimum the data collection and presentation process must be transparent and we must be able to replicate it. Simply making code and the GDLET data available does not create the conditions for transparency, because the underlying data are still unknown. It is like observing a burning bush in the mountains. It may have, in fact, happened exactly as it has been attributed, and billions of people may believe it. Nonetheless, it is still not science. Blaming the Department of Justice for the inability to meet scientific standards is not helpful. We know the challenges. But, positive steps could be taken.

The citations could be made available and perhaps even short excerpts from the articles that were used to identify things like Target, Source, etc. Doing so may be difficult, may take a long time, and be expensive, but we need to find a way to move beyond the excuses. Perhaps you, or the creators of GDELT, could organize the Event Coding equivalent of the “Cooperative Congressional Election Study” (see http://projects.iq.harvard.edu/cces).

We also all need to collaborate on developing standards for how to make sure that both copyright rules are followed and that the data is reliable and valid. We don’t see how that’s possible as long as no one can independently verify the data. [We are told that the citations will be available in GDELT 2.0, which is a positive step].

2. You discuss the difficulty of coding the messy journalistic articles associated with conflict events. No doubt this is true. And as you point out there are good reasons why data might be missing in the coded newspaper articles (data to support our original claim here). Even you recognize that, “A 39% rate for complete cases on Actor and Target could, very well, indicate that there is a problem, but we need considerably more information to make a judgment.” What information do we need? How would we make that judgment? In the absence of the actual data or citations to the articles. How would we know if the missing information is a result of the new stories or the algorithm for extracting the information? How can we compare different methods of coding the data? It seems that at the very least we need some information about the exact news stories that were coded for a given event.

As we stated in the original post: “we want these datasets to be useful, which requires that they meet standards of transparency, reliability, and validity.” We became interested in GDELT as a result of working on a project for Data Challenge for SBP 2014. When an international society of computer scientists, psychologists, economists, and physicians makes a data project their central challenge, it behooves us to ask questions about the data. Unlike you, we had the decency to ask our questions of a GDELT organizer before we made our post. The response did not assuage our concerns, and while we had already produced some results we have chosen not to use this dataset for the time period due to our concerns.

As far as we can tell nothing you’ve said helps us make progress towards building the science surrounding GDELT or suggests that these are inappropriate standards for science. Our sincere hope is that we can find ways to, and ultimately the discipline will decide how to move forward on issues such as the ones we raised about GDELT.

Cheers,
Nick and Kenny

- Nick Weller says:
  
  March 9, 2014 at 12:48 am
  
  P.S. The website with the links and what not is: http://dornsife.usc.edu/weller/gdelt
  
- Will H. Moore says:
  
  March 9, 2014 at 3:06 am
  
  Gents,
  
  Thanks for stopping by and sharing your thoughts. I appreciate you taking the time, offering the clarification, and especially the link to the data. I would like to offer two comments.
  
  First, you may not find your quote of Rudyard Kipling snarky If so, then we have different views on the matter. My tone saw that bit and raised it several. Had you not gone that route in the post, I would have used a different tone (and title) in mine.
  
  Second, with respect to your entirely appropriate concerns about science you may be surprised to learn that we share views. I encourage you to read sections 2.1 and 3 in “Creating Conflict Data: Standards & Best Practices,” co-authored by Christian Davenport and myself (available here: http://conflictconsortium.weebly.com/standards–best-practices.html ). Should you be able to create the time to read the full document, we will very much welcome feedback on how we might improve it.
  
  – Will
  
  - Nick Weller says:
    
    March 10, 2014 at 12:48 pm
    
    Hi Will,
    
    Thanks for your response.
    
    We certainly didn’t intend the Kipling quotation to be snarky. Perhaps that was not conveyed appropriately.
    
    I have read a number of your pieces on philosophy of science topics, and know that we share many of the same concerns/views. I haven’t read the piece you mentioned above, but I will take a look at it and send you feedback if anything comes to mind.
    
    Best,
    
    Nick
Pingback: Kerfuffle, Human Rights Data Stylee | Will Opines

	Hell on Punched Out
	Kate on Punched Out
	Repairing the human… on Punched Out
	How His Invasion of… on The 5% Rule and Indiscriminate…
	My Experience Being… on Punched Out