New, Bad Information
There's a very large category of informational problem that I'm sure the information science people have a good name for, but I call it The New, Bad Information Problem. The problem, which I discussed in an early post a little bit, is how you deal with corrections and new details of all sorts. This is a very common issue with bibliographic details for books in print which are often changed during pre-publication--I have written books that go from 600 to 1,000 pages between signing a contract and delivering electronic files. Errors are bound to occur because most bibliographic information isn't a neat XML-based hand-off of triple-checked data provided by the publisher to booksellers.
No.
Heaven forbid. Far too easy.
Instead, the vast majority of bibliographic data is entered by many hands in many places. The publisher may provide one form of database, or even a very clear XML dump, but probably in their own schema. As I talked about in my Scottish post just previously, unless you agree to normalization and authority, you're lost.
If Stephen King is listed as "King, Stephen (author)" in the data sent by one publisher and "Stephen E. King" by another, there has to be a mechanism by which you re-normalize or map all incoming autonomously normalized data into one standard set. This is the process of authority, generally speaking, but all data suppliers--authors, publishers, independent collectors of book data like Books In Print--must be consulting the same authority, or have a mapping from their normalization to an agreed-on normalization.
Thus, the new, bad information problem. If a name is misspelled and a new piece of data is collected that indicates the name is misspelled, how can a database be reconciled among those two pieces of data? Most booksellers and book information providers, and I speak from broad experience, produce flat outputs in which all new information is supposed to always be better than old information.
This produces the additional effect, obviously, that if you take information from multiple sources and they each produce feeds of corrections and updates, how can you even tell among two sources updated on the same date even at the same minute whether one is now authoritative and one is not?
You can't.
I support a wholesale revision in the method by which book information should be corrected, and am thinking about how to build a system to support my method of thinking. In my world, you have a definitive current snapshot of the most reliable reconciliation of all data--that's your live feed. But you also have a very deep mine of the change history with authoritativeness attached to it.
My proposed database structure would have a table for each field, and each table would have a datestamp, the informational value, and a score representing authority; a comment field would also be necessary for human review when needed.
The score would correspond to values like:
Physically examined the book
Author provided detail in email
Provided in updated XML feed from publisher
Manually renormalized
and so on.
The tricky part here is established what trumps what. For a page count field, you might have a simple hierarchy in which a physical examination by a trusted party of the actual book trumps everything else except a "physical examination was incorrect" override by a master authority.
It gets more complicated when you look into normalization, as always. Let's say you get data from sources A, B, and C. A tells you in an initial entry for a new book that the author's name is A. Alfred Smith. B tells you it's Alfred A. Smith. C tells you it's A.A. Smith. Don't laugh: I've seen the same title of a book--often with the image on the booksellers' sites--appear six different ways on eight stores.
How does poor Mr. Smith get his name correctly formed? One aspect might be to score the corrections from each source. If A has fewer patches applied to its data by a certain margin than B and C, A trumps. If B provides a new record later, then B's changes to the name might be ignored, unless the name goes from A. Alfred Smith to B. Benjamin Smythe. Notice the level of logic you need to compare normalizations: we'd want to score the differences between components of each kind of fielded data.
This is all to say that new, bad information is a substantial problem, and any method of trying to fix the accuracy, authoritativeness, and normalization of book--or any kind of fielded data--has to involve scoring, comparisons, and history.
Comments
Um...is there some reason why the book information that's created and stored and shared by *libraries* isn't adequate to your needs?
What you're describing is what catalogers do. Physical description, authority control, etc, etc. Why on earth do you want to reinvent the wheel here?
Posted by: Liz Lawley
|
October 17, 2004 07:16 AM
What I'm proposing is not reinventing HOW to create authority, or any of these categories, but a method of reconciling how this information is reported, matched, normalized, and made available.
Authority is not generally available in a comprehensive manner, nor is it subject to the controls that I describe for the actual information that's flowing into any system, public or private.
Show me a library that maintains deep information and can reconcile against many sources to produce the authoritativeness I'm talking about from many sources, while reconciling the new, bad information program; and collect information from publishers, the general public, and readers; and produce a public source of information that can be used freely...
It doesn't exist. I'm not denigrating the efforts of librarians at all. But unless I'm missing something, there's no collective, deep pool of this information that's available outside of certain closed systems, and the "deepness" doesn't involve the reconciliation over time that I'm talking about.
My general problem with book information is that it doesn't tend to get better over time. In my last eight years of working with licensed and public sources, I find that errors reverberate. There's very little in the way of tamping down these resonances that's built into any system.
Posted by: Glenn Fleishman
|
October 17, 2004 08:55 AM
Just a for instance: the LOC authority site says: Yes. All authority information in Library of Congress Authorities is available free of charge via this Web site (authorities.loc.gov). Users do not have to register or request permission to search, save, print, or email the LC authority records. The only limitation is that authority records may only be saved, printed or emailed one at a time. So very useful in one sense; not at all in another.
Posted by: Glenn Fleishman
|
October 17, 2004 09:42 AM