furia furialog · Every Noise at Once · New Particles · The War Against Silence · Aedliga (songs) · photography · other things · contact
The World Cup has a long, storied history. 708 matches across 18 tournaments, involving 80ish different countries, more than 5000 players, and more than 2000 goals. That's a lot of soccer.  

It's not really that much data, though. Soccer isn't a sport for actuaries. My AAC file for Shakira's 2010 World Cup theme song is approximately three times the size of my data file containing more or less all salient info about the entire match-history of the Cup finals. When people talk about Big Data, this is not what they mean. This is Small Data.  



Even Small Data can be hard to get right, though. Who scored the second Cuban goal in their 3-3 draw with Romania on 5 June 1938? I'm betting you don't quite remember the guy's name, either.  

FIFA's official stats page for this match claims that the second Cuban goal was scored by Jose Magrina in the 69th minute. The listed Cuban lineup, however, includes no such player among either the starters or the substitutes.  

Planet World Cup's version has the middle Cuban goal scored by "Maquina", who the FIFA lineup doesn't list either, and PWC doesn't have a lineup to reference. They also have the second Romanian goal coming after the second Cuban goal, not before.  

Scoreshelf has a page for Carlos Maquina Oliveira, which matches FIFA's listing of Carlos Oliveira, so that's something. But Scoreshelf credits him with the second and third Cuban goals, and their match page has fairly different timings for all six goals, and credits the first Romanian goal to a different player.  

InfoFootballOnline's 1938 stats page claims Maquina had 2 goals for Cuba, but it also claims Héctor Socorro had 3, including 2 of the 3 in the 3-3 draw, contradicting other sites' credit of one of the Cuban goals to Tomas Fernández.  

The Wikipedia page for 1938 has yet another set of timings, and expresses its own creativity by giving the third Cuban goal to Juan Tuñas.  

So I can say, with pretty good confidence, that I don't know who scored these goals, nor when they occurred. If you know of an explicably authoritative source for the goal credits for Cuba's 1938 World Cup games, send me the reference. But of all the versions, FIFA's official one is clearly and ironically the least sensible, as it involves a player that none of these sources, including FIFA themselves, list as having been in the game. Machines can't tell us how we're wrong, but they ought to be able to easily tell us when we're not making any sense.  

So for my version of World Cup History in Needle, I've made my own decisions, too, but I can at least say, with computationally verified certainty, that they're internally consistent. Across this whole small sprawling history, there are no goals or cards attributed to unknown players, the itemized goal totals (including own-goals) match the official final scores, the computed champions match the record books. I've fixed dozens of games where FIFA's stats list starters being replaced by ghosts, or 12th players joining the fray. I found the two cases where players were carded while they weren't even in the game, and looked them up to make sure that's what actually happened. I've checked that there are no goals credited during overtime of games that didn't have any, I fixed the game that was listed as happening in February, and I fixed the extra errors my own non-soccer-specific software introduced (did you know that "NGA", the country-code for Nigeria, is also the Vietnamese word for Russia?).  

It shouldn't have come to this. I'm an art major working for a company that makes airline software. Between FIFA and a dozen sports and news companies, somebody who lives by this data ought to have fixed it all years ago. The worst thing is, I suspect they've been trying. And yet, every source I checked had problems I could tell they couldn't find. When all data had to do was look right in the galley proofs, it was a publishing problem. But bringing publishing tools to data problems is like bringing a Strunk & White to a math fight.
Site contents published by glenn mcdonald under a Creative Commons BY/NC/ND License except where otherwise noted.