" /> ISBlogN: October 2004 Archives

Main | January 2005 »

October 25, 2004

Lucky Number 13

The Millions (A Blog About Books) writes about the 13-digit ISBN transition. By 2007, the 10-digit ISBN (nine real information digits and one checksum digit) will be replaced by a 13-digit ISBN (3-digit prefix, 9 info digits, 1 checksum digit), which will be identical to the current 13-digit BookLand EAN but starting with 979 instead of 978.

C. Max Magee explains that there will be some real transition issues for smaller booksellers who have to convert legacy systems. On the other hand, I've been told by a lot of the bookseller aggregator sites--ones like Alibris and ABEBooks who entirely or largely list books from independent booksellers--that most of the stores use one of a few inventory management packages. (Hey, so all the online booksellers have to move to using EANs primarily, too; some allow them now.)

The 978 prefix, which identifies a mythical country in the EAN worldspace called BookLand, allowed ISBNs to be freely converted into internationally compatible EAN systems. More confusingly, the US has been using a 12-digit UPC code in retailing which is transitioning to the full 13 digits for better globalization.

(Mark of the Beast theorists, start your biblical engines!)

It's an expansion of the namespace for ISBNs, too, because all existing ISBNs will be honored in the 978-prefix namespace forever. Publishers with existing ISBN inventories can continue to use them by converting them into the 978 system. New ISBNs will be assigned as a complete EAN number starting with 979 and will not be convertible back to the old 10-digit system. This opens 1,000,000,000 new ISBNs for assignment, just incidentally, since the two namespaces (978 and 979) will be independent.

Here's the U.S. ISBN authority's explanation.

Magee's site is worth reading for any inside-baseball bibliophile.

October 18, 2004

Whither Onix?

Is anyone using Onix, an attempt to standardize fielded data for book information in an XML schema and DTD? The site appears to be stuck in 2001 except for what one blog reader noted: the 2.1 specification was released in July 2004.

So I ask: is anyone using Onix?

And Onix doesn't solve normalization or authority, just data fielding and formatting.

October 12, 2004

Suggestions

I've already received some interesting suggestions of topics to cover here, including the new transition to full UPC code in the U.S.--13 digits instead of 12--and how a Wiki-biblio-o-pedia might be expanded to include other media, like DVDs and music.

I have to say that I'm pretty excited about the notion of building a Wiki sort of informational site. The Wikipedia itself is a shining example of how to manage such a project, and because facts about books aren't copyrighted, there's very little risk of having people contribute tainted data. In fact, depending on how well the project is structured, some online bookstores might provide a core of data to start with in the project in order to get the benefit of collaborative error fixing.

I know as an author it frustrates me that there's no good way for me to say to the world of bookselling, here's the definitive, authoritative, author-a-tative version of my book's details.

New, Bad Information

There's a very large category of informational problem that I'm sure the information science people have a good name for, but I call it The New, Bad Information Problem. The problem, which I discussed in an early post a little bit, is how you deal with corrections and new details of all sorts. This is a very common issue with bibliographic details for books in print which are often changed during pre-publication--I have written books that go from 600 to 1,000 pages between signing a contract and delivering electronic files. Errors are bound to occur because most bibliographic information isn't a neat XML-based hand-off of triple-checked data provided by the publisher to booksellers.

No.

Heaven forbid. Far too easy.

Instead, the vast majority of bibliographic data is entered by many hands in many places. The publisher may provide one form of database, or even a very clear XML dump, but probably in their own schema. As I talked about in my Scottish post just previously, unless you agree to normalization and authority, you're lost.

If Stephen King is listed as "King, Stephen (author)" in the data sent by one publisher and "Stephen E. King" by another, there has to be a mechanism by which you re-normalize or map all incoming autonomously normalized data into one standard set. This is the process of authority, generally speaking, but all data suppliers--authors, publishers, independent collectors of book data like Books In Print--must be consulting the same authority, or have a mapping from their normalization to an agreed-on normalization.

Thus, the new, bad information problem. If a name is misspelled and a new piece of data is collected that indicates the name is misspelled, how can a database be reconciled among those two pieces of data? Most booksellers and book information providers, and I speak from broad experience, produce flat outputs in which all new information is supposed to always be better than old information.

This produces the additional effect, obviously, that if you take information from multiple sources and they each produce feeds of corrections and updates, how can you even tell among two sources updated on the same date even at the same minute whether one is now authoritative and one is not?

You can't.

I support a wholesale revision in the method by which book information should be corrected, and am thinking about how to build a system to support my method of thinking. In my world, you have a definitive current snapshot of the most reliable reconciliation of all data--that's your live feed. But you also have a very deep mine of the change history with authoritativeness attached to it.

My proposed database structure would have a table for each field, and each table would have a datestamp, the informational value, and a score representing authority; a comment field would also be necessary for human review when needed.

The score would correspond to values like:
Physically examined the book
Author provided detail in email
Provided in updated XML feed from publisher
Manually renormalized
and so on.

The tricky part here is established what trumps what. For a page count field, you might have a simple hierarchy in which a physical examination by a trusted party of the actual book trumps everything else except a "physical examination was incorrect" override by a master authority.

It gets more complicated when you look into normalization, as always. Let's say you get data from sources A, B, and C. A tells you in an initial entry for a new book that the author's name is A. Alfred Smith. B tells you it's Alfred A. Smith. C tells you it's A.A. Smith. Don't laugh: I've seen the same title of a book--often with the image on the booksellers' sites--appear six different ways on eight stores.

How does poor Mr. Smith get his name correctly formed? One aspect might be to score the corrections from each source. If A has fewer patches applied to its data by a certain margin than B and C, A trumps. If B provides a new record later, then B's changes to the name might be ignored, unless the name goes from A. Alfred Smith to B. Benjamin Smythe. Notice the level of logic you need to compare normalizations: we'd want to score the differences between components of each kind of fielded data.

This is all to say that new, bad information is a substantial problem, and any method of trying to fix the accuracy, authoritativeness, and normalization of book--or any kind of fielded data--has to involve scoring, comparisons, and history.

October 06, 2004

If it isn't Scottish, it's crap!

When I worked at Amazon.com, we had a situation we called the Scottish Problem. I think Rebecca, the catalog programmer, was the coiner, as she was wont to a nice turn of phrase. The Scottish Problem wasn't to do with Macbeth or tartans, but rather that Ingram Book Company delivered its electronic book information to us in a normalized fashion. Their normalization routines stunk, frankly.

One of the normalizations was that whenever the word macintosh appeared anywhere in a book title, the elves or daemons in their clean-up routine changed it to MacIntosh. That's right: with hundreds of computer books including the computer model Macintosh (no capital letter i) in the title, Ingram was overriding this.

The Scottish Problem was a knotty one because of a related issue that I call the New, Bad Information Problem which I will spend some time on this blog discussing. If you fail to tag every field of information you receive from another party or enter with characteristics then when you receive new, bad information, it can easily overwrite your existing revised, good information.

Thus you need depth to every field, from page count to title to scholastic level. Otherwise, there's no way for anyone managing the information or collecting it to understand why a change was made, and whether that change should permanently overwrite any new, bad information. In fact, you can't even evaluate whether new information is bad or not without a history and a process of reconciliation.

I'll write more about this soon, including my ideas on how to use a revision history per field per record along with prioritization.

What makes me an expert, anyway?

One word: experience.

I'm launching this site in the interests of starting conversations about the way in which book details -- author, title, subject, and even page count -- are collected, sold, disseminated, updated, broken, and misused.

My credentials? I worked for Amazon.com from 1996 to 1997 as catalog manager. In that capacity, I worked with data vendors, and developed or help developed in-house resources as well as Web site tools that would provide the best information about each book we listed. We wanted the list to be exhaustive, but also accurate. I developed an intimate knowledge with several data feeds from book distributors and the Library of Congress in the process.

Part of the outgrowth of my time at Amazon was the realization that the way in which most online bookstores and library catalogs dealt with searching was structured around the ISBN or a library cataloguing number, like the LCCN. Instead of dealing with books as works--that is a discrete idea not an instantiation as an edition--the searches always seemed to pull down every instance. If I search on Wizard of Oz, I'm generally thinking about the work "Wizard of Oz"--I need tools that turn my concept into a set which I can then refine down into the members of.

I used these concepts as part of my consulting for Powell's Books and Half.com after my non-compete with Amazon.com expired, although neither really implemented this part of my idea. I built isbn.nu partly as a programming experiment in 1999, and partly as a forum in which to try out my ideas on this topic.

Years after I left Amazon.com, they introduced a version of this kind of linkage, which has no particular name. Search on Wizard of Oz, and you're still presented with a long list, but the second item has a link that lets you see all 98 titles linked to a single authoritative Wizard of Oz. It's still not great: you can't view through editions. And if you click through to the first result in the list, you see a link that shows eight other editions in a better format. But what about the other 91 results?

ISBN.nu has had a kind of authority that I dub work authority for all of the books I list, which is nearly three million. Library scientists use the term authority to denote how to normalize multiple instances of, say, an author's name into a single definitive version. So if Jimmy Carter, James E. Carter, and James Earl Carter III are really the same person, the authority entry for this person maps those to, say, Jimmy Carter (which is President Carter's preference, I believe).

Authority doesn't mean that the information is correct. Rather, it means that you have authoritatively settled on a single form of a category of information that might be represented in several ways. It's a way to collapse lots of individual information that is fundamentally about the same thing into a single set of information that is mapped to the same thing. This is closer to how people conceive of what they want from a book search than any of the tools I've seen.

ISBN.nu's system isn't perfect. I haven't developed full author or title authority yet, so the same author may be listed in different ways and different authors with the same name are though to have written books they have not. I have worked with a programmer to fully normalize our data, which means to remove a lot of the detritus in information that comes into my system from a licensed source and file down the parts that aren't legitimate differences, like capitalization, extra spaces, and punctuation.

The next step beyond normalization will be full authority development, which I have in process.

In the coming entries, I hope to discuss a number of book information and meta-information problems I have encountered and coped with including The Scottish Problem, The New, Bad Information Problem, The Chunky Problem, and others with names that I hope you find similarly amusing.

I'll start talking about OCLC Research's xISBN project, and how I believe that with a little effort and some coordination, along with the tenets of Feist v. Rural, the Internet community could develop a WikiBook-a-pedia, or a compendium of updatable bibliographic information along with authority and chunky linkages that could be distributed freely.

Comments are welcome. I'm using TypeKey from Movable Type to prevent comment spam and other problems. I apologize for requiring a centralized, email-verified registration, but this seems like the ony solution to ensure that comments aren't turned into a horrible mush as happened on other sites I operate.