In his 30-year career, Richard Taylor, an information technology architect with Lockheed Martin Mission Systems, has taken on some extremely complex jobs. He has helped turn the FBI’s fingerprint files into an electronic database, designed systems for airlines that make sure there is a gate free for every arriving flight, and developed an archive of high-quality digital images for the National Gallery of Art.
But nothing that Taylor has done approaches his current task: equipping computers to make sense of the scribbles on 1.5 billion pages of U.S. census forms returned by 121 million households.
The current census is the first to use computer software, rather than human eyes, to decipher and record information. Taylor called it “the biggest data-capture project in history,” the Super Bowl of handwriting recognition.
The Census Bureau has given Taylor’s system only until early July to do the job–that’s about 100 days starting with the official Census Day, April 1. When you consider that many people cannot even read their own handwriting, the assignment seems that much more daunting.
In the last two censuses, in 1980 and 1990, getting those words off paper and into computers started with automated cameras built by the Census Bureau. Mechanical arms in the cameras opened the forms and smoothed the pages flat so they could be photographed for microfilm. Computers then scanned the microfilm to make note of which boxes had been checked. Seven thousand clerks, working around the clock at seven processing centers, looked at the forms and entered the remaining handwritten information into computers.
When the 1990 census ended, the Census Bureau was determined to have computers do more of the job themselves, said J. Gary Doyle, the bureau’s systems integration manager.
“We were pretty sure it would work,” Doyle said. “It was just a question of how well it would work. But even if we got, say, 50 percent with character recognition, that’s 50 percent [fewer] keyers needed.”
But the Census Bureau will not be satisfied with a 50 percent accuracy rate from Lockheed Martin Mission Systems, a unit of Lockheed Martin Corp., which won the contract in 1997. The bureau wanted 98 percent accuracy, about 3 percentage points higher than the best accuracy rate people can achieve when typing the information.
In 1997, Taylor had his choice of about seven handwriting recognition programs. All of them operated in about the same way, by using mathematical probabilities and pattern matching to figure out if, for example, an individual letter was a C or an O that had not been quite closed.
“Different recognition engines are better at different things,” Taylor said. “Some are best with all uppercase; others are better with numbers.”
He settled on recognition software produced by a German company, CGK Computer Gesellschaft Konstanz, partly because Lockheed Martin’s engineers had used it in other projects and knew how to fine-tune it.
But to reach the Census Bureau’s accuracy target, character recognition was not going to be enough. Taylor’s group decided that it had to develop custom software to identify whole words.
Eventually, the software engineers took three approaches. The simplest one was to develop vast dictionaries of place names, street names and occupations from past census data and such sources as the Postal Service. In a trial run of the system using advance forms from South Carolina, the dictionaries quickly proved their value.
“I don’t want to say anything about the educational system there,” Taylor said, “but a lot of people can’t spell Carolina.”
A second piece of software cross-checks data that appear more than once. For example, it makes sure that the current age given corresponds to the date of birth entered elsewhere on the form.
The final piece of word-reading software uses a more arcane approach called trigram analysis. The software team created a list of every possible three-letter combination that can be made from the alphabet. Those triplets were then compared with a database containing all the words in a large English-language dictionary.
From that comparison came tables that indicate the likelihood of any particular letter combination appearing at the beginning, in the middle or at the end of a word. Using that, the software compares the triple-letter combinations within words (identified as words by the other recognition software) and adjusts the accuracy confidence ratings accordingly.
For example, Taylor said that if the software scanned his surname, it would increase its assessment of the recognition software’s accuracy after finding that “tay” was followed by “ayl.” By contrast, the system might reject any result that produced the result “zkx.”
While the software group was working, another team developed a system to minimize the amount of paper handling at the four data processing centers built by Lockheed Martin Mission Systems and the Census Bureau. At their busiest, the centers will each handle about 17 semi-trailer loads of forms daily.
Each form is run across a bar code reader to confirm that it was returned by the household that received it and examined by actual people for things like blobs of peanut butter that might gum up the system, then fed into oven-size scanners made by Eastman Kodak. Each scanner handles 23,400 pages a day.
The scanners are attached to banks of Dell servers with 400-megahertz Pentium III processors running Windows NT. The data are stored, and the handwriting recognition begins. Taylor said that when he began work three years ago, he designed the system to accommodate more servers as their prices fell.
“I think of servers as pigs eating out of the trough,” he said. “If you put more pigs in, they’ll eat it up faster.”
After the forms have been read, the data are sent to the Census Bureau for compilation. Both the data and the scanned images are recorded at the processing centers onto two backup tapes, which are stored in different locations. (All the actual paper forms will be reduced to fine particles by shredders.)
The Census Bureau will not release statistics on how the system is performing until the count ends. But Lockheed Martin Mission Systems estimates that it is running at about 99 percent accuracy.
If the system has a weak link, it is the people who are still essential to the process. Each center still needs clerks to type in data, although far fewer than for previous censuses.
If the software systems review a word and are still uncertain about what it is, a digital image of that section of the form is forwarded to a person, who deciphers it and types the information into the system. If the person cannot figure out the word from the image, which happens rarely, the actual paper form is retrieved and examined.
Taylor was counting on the people backing up the automated equipment, who are hired for only a few months, to produce about 6,000 keystrokes an hour, about half the industry average. But early test runs in Baltimore showed that they were likely to do only 4,500. “This was a worry,” he said. To make sure that the Census Bureau meets its deadline, the system is initially processing only the required information from short and long forms; the forms will be run through again later for complete processing.
As impressive as the system may be, it is in a sense already obsolete, rather like the printed census form itself, which dates to 1830.
“My guess is that this is the biggest paper census–forever,” Taylor said. His company is planning to build a similar system for Britain.
This year, for the first time, the Census Bureau has allowed people to file their forms on-line. But the program has not been heavily promoted, Doyle said, and may generate only about 60,000 replies. He expects on-line filing to become more popular. “Just like anything else, you’ll see more and more people doing it on-line,” he said.
So far, the digital world has no answer for the question of how to preserve the census data for a long time. The bureau probably will use special digital printers, again from Kodak, that will record the material onto old-fashioned microfilm. Independent laboratories have certified that Kodak’s microfilm will last for 500 years, and the company says internal tests show that it is good for at least twice that. Digital storage devices, by contrast, need to be renewed every 10 to 100 years.
But microfilm offers another advantage. In the distant future, when Windows-based servers will be long forgotten, a simple magnifying glass will be able to resurrect the data stored on microfilm.




