The first of April is the day we remember what we are the other 364 days of the year.--Mark Twain
I bet you read 364 days a year about how unreliable and inaccurate the free Web resources are vs. the commercial ones, which are touted to deliver accurate, complete, and timely information for reasonable fees. Let me give you a little sample of reality in this column, which will focus on large-scale errors of commission and omission in well-known (because they're widely advertised) factographic databases. With some effort these mistakes could have been detected and corrected by the database producers. I myself usually do some routine quality checks when odd results make me suspicious and force me to explore deeper--I'll also discuss this. But first let's take a look at the most traditional sources of errors.
Errors Everywhere
Most of the database errors we encounter are born in the print sources, and they're then transferred to the digital world. We're conditioned to see mistakes whenever we open a magazine or newspaper, or browse a database, even if these information sources are among the most reputable.
Many of the errors are such that a junior editor, or even a reasonably informed person, could spot them. In the January 2002 issue of Travel Holiday (p. 46), I read the writer's opinion of a package tour that said "the trip's biggest weakness is timing: Bangkok is a 10-hour flight from New York via Tokyo." It could have been a typo or a senior moment, but then the author explains insightfully, "It's a long way to travel for a week vacation." I'll bet that this author, who writes with the false aura of expertise, has never traveled from New York to Bangkok, which can't be done in less than 20 hours. Even a nonstop flight to Tokyo takes 14 hours from the Big Apple.
Errors show up even in The New York Times, the paper of record. Yes, many of them are misspellings of names, but many are factual errors. While I was working on this piece, the paper said that the Volga River flows into the Black Sea (instead of the Caspian Sea). Commenting on the uproar caused by the introduction of car surveillance cameras to catch speeders, The New York Times wrote that traffic fatalities in Hawaii rose from 23 in 1997 to 60 in 2001. (This would make Hawaii the safest state in terms of its traffic fatality rate; however, the real numbers were 131 in 1997 and 141 in 2001. The numbers the newspaper mentioned were for alcohol-related fatalities-indeed a drastic increase.)
Likewise, I was quite surprised to read about a so-called insurgence in Nepal in the February 20, 2002 issue of The New York Times: "King Birendra first ordered the army into action against the rebels last November. A government official later said the king was not taking a stand in a heated debate underway over extending the state of emergency declared then to tackle the revolt." Well, King Birendra and his family were massacred last June. He could hardly order the army into action in November, so I'm not surprised that he didn't take a stand in the ongoing debate. This piece of news was a result from the joint New York Times/Reuters project, two of the biggest names in news. You can imagine what gets printed by lesser-known publishers.
Speaking of name clout, The Guinness World Records database claimed that Lake Baikal, the world's deepest lake, and its surrounding area "covers 88,000 km (234,000 miles)." (See Figure 1.) For accuracy's sake, area coverage is measured in square units. Furthermore, the equator's length is only about 40,000 km, so it's just impossible even if the database creators had measured the length. Then comes the glaring conversion error from kilometers to miles.
What's in a ZIP Code?
Many of these errors are then perpetuated through the online services that syndicate others' content. Third parties who license content can add to the problem when they massage the source data, but do not apply the updated records that include not only more current data but also corrections. This is the case with Claritas, Inc.'s expensive Population Demographics (PD) database that's based on data provided by the U.S. Census Bureau (which itself has a nifty site with much of its data available for free). PD costs $80/hour on DIALOG and a full record sets you back $14.25. Dialog proudly announced last May that the database was updated with 1990 Census data. I wasn't thrilled and am still waiting for a new update, as much of the 2000 Census has been readily available for free. PD is meant to be used for market analysis, territory analysis, and strategic planning for advertising campaigns. Think twice before you do so.
PD has demographic and economic information about the entire U.S. at the ZIP code level. For product targeting, it could, for example, be the perfect tool for gauging the demographic profile of all of a state's ZIP codes. The problem is that there are more than 6,500 records that consist of nothing but zeros. Why? I can only guess that Claritas didn't apply the changes and corrections appropriately or at all to its version of the database. These records full of zeros don't appear in the U.S. Census Bureau's version (see my examples at http://www2.hawaii.edu/~jacso/extra).
This is a large-scale problem. In the case of Hawaii, which has 210 ZIP codes in the PD database, 20 percent of the records are all zeros--yet you are charged $14.25 for each of them!
If the records with zeros aren't absurd enough, you may be interested to learn that there are many records in which the total population in a ZIP code is reported as zero, but the average household income for ghost households is more than two times the national average--such as in Chester, Pennsylvania, with a $112,523 average household income. (See Figure 2.) Of greater concern are those records for territories where there are a few thousand people with $0 household and personal income; that's way below anyone's poverty level, if it's true. And I regret that for Kaneohe, Hawaii, which is a military base town a few miles from where I live, PD reported that the per capita income was $210,617 in 1990 based on a total population of 6 (which was estimated to decrease to 4 by 2000 and to remain at that in 2005). (See Figure 3.) Rather, I believe my eyes and the Census Bureau's statistics, which report the population as a realistic 2,505. Pity the consultant who bases a marketing plan on this database. The thousan ds of nonsense records must be removed from this database by Claritas--a very easy process.
No Info Is Good Info?
It's better to say "I don't know" than to dispense wrong information and charge for it. This may have been the principle behind the decision at InfoUSA in creating the American Business Directory about U.S. companies. It has a half million records in which a listed company was not assigned to any one of 15 major industry areas (agriculture, transportation, healthcare, etc.), as they were perceived by InfoUSA as "non-classifiable establishments." Mind you, these are not mom and pop shops in Timbuktu, but sizable U.S. corporations. The company claims that "17 million phone calls are made each year, and businesses with 100 or more employees are phone-verified at least twice a year" (http://www.referenceusa.com/au/au.asp).
Just a cursory look at some of the half million "non-classifiable establishments" suggests to me that more telephone calls (or better people) would be needed. I don't think that InfoUSA needs highly paid employees to figure out under which industry classification code Young & Rubicam, Inc. or Volkswagen of America should be listed-- without making a single phone call. Omaha Steaks, Intl.'s line of business certainly should ring familiar to the Omaha-based InfoUSA. (See Figure 4.)
Though the CD-ROM version of subsets of this monstrous database is cheap, on DIALOG the hourly rate is $80 and the per-record fee is $1.33. True, there are several records for the same company (for the headquarters and for the branches), but if you limit your search to the broad industry category (IN=) field you're out of luck. But why should you be penalized for American Business Directory's massive sloppiness while you search and search and display records ad nauseam until an informative one is found?
Does She or Doesn't She?
Quite often you realize the large-scale problem that shows up in many records serendipitously. You may recall that in my October 2001 column I wrote about the massive volume of missing data elements in records contained in R.R. Bowker's Ulrich's database, such as the very important circulation information that is unavailable in more than 50 percent of the records. That problem is aggravated by the fact that only some of the circulation data are obviously wrong--others are just questionable or only somewhat wrong. I got to the problem by finding some odd circulation data that just didn't jibe with my knowledge, or that simply contradicted common sense.
I wonder, For example, if Link-Up's circulation really is twice as high (10,000) as Information Today's (5,000). Most of the time, it's impossible for users to determine which data are wrong but a casual browsing of my search results just reinforced my concern about the accuracy of circulation data in Ulrich's. Bowker is good at correcting most of the errors that I report, but it would be in a far better position if it did systematic spot-checks and compared its data with that available from ABC (no, not the American Bowling Congress, to be discussed below, but the Audit Bureau of Circulation) for at least the subset of titles contained in both databases. Often a phone call to the publisher would suffice to change the obviously absurd data.
Take as an example the bowling magazine WB--For the Woman Who Bowls, of which Bowker claims to circulate 4 million copies. (See Figure 5.) This is a whopping number, especially if you consider that only members (female only?) of the American Bowling Congress are entitled to it, not just any woman who bowls. Or maybe Ijust know too little about women?
Distressed by this thought, I made a quick and unscientific survey among my female acquaintances, asking them if they bowl. Some didn't understand the question because of my accent and first thought it was a euphemism for some indecent activity, yet others thought it was a joke. However, one thing was common: None of them bowled, and by now I know that the entire membership of ABC is 2 million, including men who bowl.
Similarly, I think that Ulrich's impressively precise-looking 7,111,081 circulation figure of U.S. Surfing at a subscription rate of $35 is also excessively high. The publisher of a publication that grosses $210 million certainly could be traced down to verify the figure, which is reported by Bowker as unverified. This implies that in all those cases where the status is not unverified it is, well, verified. These occasionally discovered errors led me then to explore the database further and to find many errors.
There are more mysterious oddities that require much deeper probing. I spent many hours figuring out how a new company, Prestige Factor, may have arrived at the conclusion that among all the 1,468 social science periodicals (for which it calculates a "prestige factor" using an undisclosed formula) the Annual Review of Psychology is the least prestigious, sharing this rank with John Wiley & Sons' Mental Retardation and Developmental Research Review. (Look for this story next month in the Internet Publishing section.)
People try to make others believe that something false is true, not only on April Fool's Day. They may not set out to do so, they just neglect to apply an essential quality-control process and pass the costs on to the users. These users may be considered All Year Fools if they pay for and pass on the acquired false information, fulfilling what Mark Twain said so well. You can find many examples for various kinds of large-scale absurdities and methods to discover them in my book Content Evaluation of Textual CD-ROM and Web Databases, recently published by Libraries Unlimited.
Peter Jacso is associate professor of library and information science at the University of Hawaii's Department of Information and Computer Sciences. His e-mail address is jacso@hawaii.edu.

No comments:
Post a Comment