Main

April 1, 2007

Switching over to ISBN-13

Isbn.nu was abruptly dragged in 2007 when several bookstores that I work with to retrieve price information suddenly switched to ISBN-13 as their feed or query format! We're in the process of rooting out all the 10-digit ISBN references and updating databases, so we're compliant with ISBN-13.

There are no 979 ISBN-13s out there yet (as far as I can tell), so it's a good time to make the transition.

I've written up a page that explains to my isbn.nu users precisely what's going on. It might be too technical/industry for them.

November 12, 2005

The Problem of Persistence

I've left this blog alone for far too long, so I come back with The Problem of Persistence, which is another elements in my series of problems relating to structuring data, which include The New, Bad Information Problem and The Scottish Problem.

The problem with persistence relates to creating objects that represent a particular set of ideas that have a persistence as that object over time. Good example: The Köchel or K Numbers assigned to Mozart's works. Mozart himself didn't number or catalog his own oeuvre, but a 19th century intellectual did. The works are numbered by appearance and the numbers have been used widely to refer to specific works. The numbers don't change, even if additional works are discovered between existing numbers, to avoid destroying decades of research and publishing. The numbers are persistent object identifiers to which many attributes are attached.

In Bookland, our happy 978 and 979 UPC code prefix world, there's no such animal. I am finding there is no such animal among many realms of information in which there are discrete entities that have members but not a persistent and consistent method of numbering them.

In developing a new DVD and video price comparison service that I hope to launch later this year or in early 2006, I worked with a programmer to develop persistence in the kind of unique work objects that I create for isbn.nu every time the database is regenerated.

The idea is simple, when we finally figured it out: You have to build a structure in which the information present in the structure is authoritative when a record exists. New information is always considered less authoritative for specific values in an object, and reconciliation can happen or a queue of details to reconcile can be built based on the manner in which a particular new detail is out of sync with existing records.

Practically speaking, this means you prime the pump of a database on, say, movies with a good authoritative and deep source and then choose specific values that are unique to become authoritative, such as a director's name (even the spelling) or a movie production's title, like "Star Wars: Episode I." When new information arrives, it's checked by SKU against existing information. If the authoritative details vary they are either ignored or stuck into a queue for review--it's possible that the SKU had the wrong title, for instance. In either case, non-authoritative details, such as the movie's length or the DVD region encoding, are updated to whatever the new information provides.

This isn't the full depth of hygiene and reconciliation I'd like, but it means that once an object is created and numbered for a unique movie production such as "Pride and Prejudice" as produced in 2005, that object can retain the same ID number or other code permanently. All other details will accrue to or revolve around that initial unique movie production information.

I'll write more about this as the system rolls out, as it will for both the new movie site and for my existing isbn.nu book site.

Remarkable coincidence: This David Weinberger piece about the issues of metadata and book identification and sub-book (chapters, etc.) identification appears today in the Boston Globe.

October 18, 2004

Whither Onix?

Is anyone using Onix, an attempt to standardize fielded data for book information in an XML schema and DTD? The site appears to be stuck in 2001 except for what one blog reader noted: the 2.1 specification was released in July 2004.

So I ask: is anyone using Onix?

And Onix doesn't solve normalization or authority, just data fielding and formatting.

October 12, 2004

New, Bad Information

There's a very large category of informational problem that I'm sure the information science people have a good name for, but I call it The New, Bad Information Problem. The problem, which I discussed in an early post a little bit, is how you deal with corrections and new details of all sorts. This is a very common issue with bibliographic details for books in print which are often changed during pre-publication--I have written books that go from 600 to 1,000 pages between signing a contract and delivering electronic files. Errors are bound to occur because most bibliographic information isn't a neat XML-based hand-off of triple-checked data provided by the publisher to booksellers.

No.

Heaven forbid. Far too easy.

Instead, the vast majority of bibliographic data is entered by many hands in many places. The publisher may provide one form of database, or even a very clear XML dump, but probably in their own schema. As I talked about in my Scottish post just previously, unless you agree to normalization and authority, you're lost.

If Stephen King is listed as "King, Stephen (author)" in the data sent by one publisher and "Stephen E. King" by another, there has to be a mechanism by which you re-normalize or map all incoming autonomously normalized data into one standard set. This is the process of authority, generally speaking, but all data suppliers--authors, publishers, independent collectors of book data like Books In Print--must be consulting the same authority, or have a mapping from their normalization to an agreed-on normalization.

Thus, the new, bad information problem. If a name is misspelled and a new piece of data is collected that indicates the name is misspelled, how can a database be reconciled among those two pieces of data? Most booksellers and book information providers, and I speak from broad experience, produce flat outputs in which all new information is supposed to always be better than old information.

This produces the additional effect, obviously, that if you take information from multiple sources and they each produce feeds of corrections and updates, how can you even tell among two sources updated on the same date even at the same minute whether one is now authoritative and one is not?

You can't.

I support a wholesale revision in the method by which book information should be corrected, and am thinking about how to build a system to support my method of thinking. In my world, you have a definitive current snapshot of the most reliable reconciliation of all data--that's your live feed. But you also have a very deep mine of the change history with authoritativeness attached to it.

My proposed database structure would have a table for each field, and each table would have a datestamp, the informational value, and a score representing authority; a comment field would also be necessary for human review when needed.

The score would correspond to values like:
Physically examined the book
Author provided detail in email
Provided in updated XML feed from publisher
Manually renormalized
and so on.

The tricky part here is established what trumps what. For a page count field, you might have a simple hierarchy in which a physical examination by a trusted party of the actual book trumps everything else except a "physical examination was incorrect" override by a master authority.

It gets more complicated when you look into normalization, as always. Let's say you get data from sources A, B, and C. A tells you in an initial entry for a new book that the author's name is A. Alfred Smith. B tells you it's Alfred A. Smith. C tells you it's A.A. Smith. Don't laugh: I've seen the same title of a book--often with the image on the booksellers' sites--appear six different ways on eight stores.

How does poor Mr. Smith get his name correctly formed? One aspect might be to score the corrections from each source. If A has fewer patches applied to its data by a certain margin than B and C, A trumps. If B provides a new record later, then B's changes to the name might be ignored, unless the name goes from A. Alfred Smith to B. Benjamin Smythe. Notice the level of logic you need to compare normalizations: we'd want to score the differences between components of each kind of fielded data.

This is all to say that new, bad information is a substantial problem, and any method of trying to fix the accuracy, authoritativeness, and normalization of book--or any kind of fielded data--has to involve scoring, comparisons, and history.

October 6, 2004

If it isn't Scottish, it's crap!

When I worked at Amazon.com, we had a situation we called the Scottish Problem. I think Rebecca, the catalog programmer, was the coiner, as she was wont to a nice turn of phrase. The Scottish Problem wasn't to do with Macbeth or tartans, but rather that Ingram Book Company delivered its electronic book information to us in a normalized fashion. Their normalization routines stunk, frankly.

One of the normalizations was that whenever the word macintosh appeared anywhere in a book title, the elves or daemons in their clean-up routine changed it to MacIntosh. That's right: with hundreds of computer books including the computer model Macintosh (no capital letter i) in the title, Ingram was overriding this.

The Scottish Problem was a knotty one because of a related issue that I call the New, Bad Information Problem which I will spend some time on this blog discussing. If you fail to tag every field of information you receive from another party or enter with characteristics then when you receive new, bad information, it can easily overwrite your existing revised, good information.

Thus you need depth to every field, from page count to title to scholastic level. Otherwise, there's no way for anyone managing the information or collecting it to understand why a change was made, and whether that change should permanently overwrite any new, bad information. In fact, you can't even evaluate whether new information is bad or not without a history and a process of reconciliation.

I'll write more about this soon, including my ideas on how to use a revision history per field per record along with prioritization.