January 23, 2009

Testing New Ideas

A test version of isbn.nu at beta.isbn.nu is up and running. In this new version, the informational backend is still more or less the same, but after all the cosmetic and flow issues are dealt with, I'll be adding more ways in which information can be reported for correction by users.

September 11, 2008

Fixing Authors in Their Place

I haven't written on this blog for some time, due to the exigencies of other demands, but I wanted to note here (in case I have any readers left) that I just updated isbn.nu a few days ago to provide unique author pages. This means that 100 John Smiths can all be uniquely identified with a permanent and distinct code and URL location for their unique set of books. It also means that I could license and add author biographies, link to author Web sites, and allow reader and author submitted biographies or other details. That's to come.

My colleague Jeff built the outline of this months ago, and I finally had the time to integrate all the pieces. Jeff did all the heavy lifting on isbn.nu for the past few years in terms of refactoring old code, turning into a maintainable and scalable beast, and answering my programming questions. He's helped me become an object-oriented programmer, and to understand how to build code that works.

In the last few weeks, I've built an API for isbn.nu, an interface that will allow an application programmer I'm working with to create an iPhone/iPod touch application that will work as a front-end to isbn.nu. That API sort of necessitated fixing authors into their unique identities and solving a number of other site problems that have been lagging.

I've got a bunch of other features and improvements coming at long last, but this unique author page/identity is one of the ones I've wanted to add for years.

April 1, 2007

Switching over to ISBN-13

Isbn.nu was abruptly dragged in 2007 when several bookstores that I work with to retrieve price information suddenly switched to ISBN-13 as their feed or query format! We're in the process of rooting out all the 10-digit ISBN references and updating databases, so we're compliant with ISBN-13.

There are no 979 ISBN-13s out there yet (as far as I can tell), so it's a good time to make the transition.

I've written up a page that explains to my isbn.nu users precisely what's going on. It might be too technical/industry for them.

October 8, 2006

Get Ready for ISBN-13 and 979

We're rapidly approaching January 1, 2007, when two important developments in the history of ISBN occur.

First, the 10-digit ISBN (known now as ISBN-10) will no longer be the standard for use. R.R. Bowker, which coordinates ISBNs in the US, says that ISBN-10s can be phased out from use on books starting on January 1. It's not obligatory to get rid of them, but it will make less and less sense, because of point 2.

Second, the use of ISBN-13s that start with 979 will come into being. Up until Dec. 31, 2006, ISBN-10s have been freely convertible into EAN UPCs, or the global barcodes used to identify national origin and product stockkeeping unit (SKU) numbers. ISBN-10 can be converted by taking the digits 978 plus the first nine digits of the ISBN-10 and computing a new base-10 checksum for digit 13. (ISBNs use base 11, where the 10th digit can be zero through nine, or X representing 10 in base 11.)

With the introduction of 979s, not all ISBN-13s will be convertible into ISBN-10s; only 978-prepended ISBN-13s have a corresponding ISBN-10.

The reason for this is numberspace. They've run out of ISBNs in the current 10-digit space, and by adding another EAN prefix, they buy 10 to the 9th power potential new ISBNs. These are assigned in blocks to publishers, based on the publisher's scale and country of operation, and thus the numbers are used inefficiently. Certain ranges are reserved, as well.

I wrote about this at some length and less clarity two years ago. The deadline is finally upon us!

November 13, 2005

The Chunky Problem

Back in October 2004, I promised to write about three problems of book information: The New, Bad Information Problem (information hygiene over time); The Scottish Problem (normalization errors); and the Chunky Problem. I wrote about the other two back then; here's my exegesis on chunkiness.

The Chunky Problem relates to how people conceptualize information and then how that information is represented in databases and on Web sites, to name two concrete instantiations. Despite decades of analogy between the psychic construct that is the mind and card catalogs, computer filing systems, and so forth, it's clear from equally long stretches of research that people represent information in chunks that interrelate. Thus, even someone who can remember 1000 discrete ISBNs or phone numbers isn't representing that information with one neuron per digit. Rather, there's a holistic standing wave in which information flows as a basis of triggers and relationship that allows us to contain these facts.

You wouldn't know this to look at most Web sites that present information of any kind about books, media, or, really any data that's available in masses of larger than a few. Web sites typically represent data as stored in discrete, and offer very flat databases. A flat databases lacks relationships: each record is defined uniquely by one or more fields, and all of the information in the record is fielded. There's always an entry (even null) for each field. A relational database allows richer information by creating virtual objects: a list of attributes in one table might be assigned via another table to an object that resides in a third table which uniquely defines it.

For instance, a book might be a library edition paperback with an embossed cover. One table stores book attributes paperback, hardcover, turtlebound, etc. Another table stores a list of books by ISBN. The joining table then pairs the unique ISBN with the attributes to create an object.

XML is quite exciting to most people involved in structuring information because it allows the definition of an object-oriented structure and attributes that adhere to it without requiring the rigid formalized storage of a database. In an XML file, you can create objects explicitly, producing human-readable and machine-parseable results that represent data in a simpler and substantially more powerful method.

Now The Chunky Problem is that most media data is in very flat form. In fact, when I license data or review data for license for media like books and movies (in DVD and other form), I find that there is often neither no table structure and no normalization. Rather than have a table of actors with attributes (like a biographic sketch) and tie those by reference using a unique identifier to a table of SKUs (individual movie UPC codes, for instance), it's just a long list with repetitive information all organized uniquely by SKU.

Since information wants to be chunked, or so I think, I have spent nearly a decade working on systems--mentally and in actual code--that rework flat information into rich relationships that can be represented chunkily to an end user.

Thus, my long-standing example of The Wizard of Oz. There is a work of fiction called The Wizard of Oz, and it is instantiated as many, many things, which includes dozens or of ISBNs, dozens of books that precede the days of ISBNs and are out of print, collections in which it appears as an item in a larger compendium, essays about it, and parodies.

At the chunkiest level, you have the work, The Wizard of Oz. Move down a chunk and you're into categories of things: books, movies, parodies. Move into the books chunk, because you're thinking "I want to buy the book, Wizard of Oz," and you find the sea of types: paperback, hardcover, large print. Finally, we can delve into a specific SKU or ISBN.

There are ways to short circuit this, too. Imagine the chunky statement, "I want to buy a new copy of the latest edition of the book Wizard of Oz." Zoom down through and the user is on that precise book's offering page with details. Or, "find me a collection in print of The Wizard of Oz and Ozma of Oz." A system that's chunky knows about those two works, and can track those words as uniquely identified items across containers, which are instantiated editions. Thus, the system knows that ISBN Z is a container of unique work Wizard of Oz and unique work Ozma of Oz.

Chunkiness works as a way to share information across objects, too. A review of Wizard of Oz as a work should adhere at the chunkiest level. A review of a particular newly edited edition that might appear as several ISBNs is much less chunky. Chunkiness needs to adhere to objects at various scales.

There's a great little joke about a professor who brings a large glass jar into a class along with a bin of rocks, a bin of pebbles, and a container of sand. He pours rocks into the jar and asks the class if it's full. They agree it is. Then he shakes in pebbles. Is it full now? Yes. Then he pours in sand? Now? Yes. Then (in some versions) a couple of beers. It's apparently a metaphor for life. (The moral in many tellings: there's always room for a couple of beers.)

The Chunky Problem looks at this as the reverse: it used to be when you went to a bookstore online, the whole jar was full of sand. Over time, it became full of pebbles. Now it's a rocks, pebbles, and sand. I'd rather view it this way: the jar contains rocks. The rocks, when broken down, turn into pebbles. The pebbles, when pulverized, are sand. But unlike physical objects you can go from a jar to a grain of sand in one step, and the process is reversible at any time.

Most product sites would benefit from getting chunky.

November 12, 2005

The Problem of Persistence

I've left this blog alone for far too long, so I come back with The Problem of Persistence, which is another elements in my series of problems relating to structuring data, which include The New, Bad Information Problem and The Scottish Problem.

The problem with persistence relates to creating objects that represent a particular set of ideas that have a persistence as that object over time. Good example: The Köchel or K Numbers assigned to Mozart's works. Mozart himself didn't number or catalog his own oeuvre, but a 19th century intellectual did. The works are numbered by appearance and the numbers have been used widely to refer to specific works. The numbers don't change, even if additional works are discovered between existing numbers, to avoid destroying decades of research and publishing. The numbers are persistent object identifiers to which many attributes are attached.

In Bookland, our happy 978 and 979 UPC code prefix world, there's no such animal. I am finding there is no such animal among many realms of information in which there are discrete entities that have members but not a persistent and consistent method of numbering them.

In developing a new DVD and video price comparison service that I hope to launch later this year or in early 2006, I worked with a programmer to develop persistence in the kind of unique work objects that I create for isbn.nu every time the database is regenerated.

The idea is simple, when we finally figured it out: You have to build a structure in which the information present in the structure is authoritative when a record exists. New information is always considered less authoritative for specific values in an object, and reconciliation can happen or a queue of details to reconcile can be built based on the manner in which a particular new detail is out of sync with existing records.

Practically speaking, this means you prime the pump of a database on, say, movies with a good authoritative and deep source and then choose specific values that are unique to become authoritative, such as a director's name (even the spelling) or a movie production's title, like "Star Wars: Episode I." When new information arrives, it's checked by SKU against existing information. If the authoritative details vary they are either ignored or stuck into a queue for review--it's possible that the SKU had the wrong title, for instance. In either case, non-authoritative details, such as the movie's length or the DVD region encoding, are updated to whatever the new information provides.

This isn't the full depth of hygiene and reconciliation I'd like, but it means that once an object is created and numbered for a unique movie production such as "Pride and Prejudice" as produced in 2005, that object can retain the same ID number or other code permanently. All other details will accrue to or revolve around that initial unique movie production information.

I'll write more about this as the system rolls out, as it will for both the new movie site and for my existing isbn.nu book site.

Remarkable coincidence: This David Weinberger piece about the issues of metadata and book identification and sub-book (chapters, etc.) identification appears today in the Boston Globe.

February 1, 2005

Why Certain Books Are Popular Searches

I started seeing lions everywhere.

Okay, not everywhere. They were just among the topics of the top book searches at isbn.nu. I have a page that shows me and anyone else who cares the most recent 10 searches and (since I added this a few days ago) the top 10 searches since the database for popularity was reset.

This has helped me understand how the Internet works a little better, as I can explain the source of the popularity of some of these links.

How to Have Sex in the Woods. Many people search Yahoo asking "how to have sex"--there's nothing about the woods in there at all. The book price page on my site is the #11 answer to this question.

101 Ways to Promote Your Web Site. One spammer trick is to flood your site with referrals. They hope that you either review or publish your statistics, thus increasing their Google Whuffie. This book was highly "promoted" by get-rich-quick sites through referral spam. It doesn't have anything to do with the quality or nature of the book; I suspect it's just a random link choice.

Dollar Bill Origami. Sounds like a cool book, but folks arrive here because they're searching on dollar bill origami over at Google. I did not know so many people were interested in folding dollar bills.

Endangered Tigers. Google thinks I know something about this topic.

Rapid Application Development with Mozilla. A single page on RDF links to where to find prices for one book. And it provokes dozens of clickthroughs. Must be a popular page.

Quarkexpress 6: For Print and Web Design. The program name is misspelled in this iteration of the book title in my database. It should be QuarkXPress. That extra 'e' in the middle means that Google points folks to me whenever they mis-search on Quark's flagship program.

Lions: Life in the Pride and The African Lion: This one flummoxed me until I found that these two books and links to my site were used as part of an example in "Representing Classes As Property Values on the Semantic Web." I wrote the author telling her how amusing I thought this random traffic was! She was apologetic, and I said I did not mind: the more people that find my site, the better.

Finally, an old favorite, Susie Bright's book Mommy's Little Girl: On Sex, Motherhood, Porn, and Cherry Pie. She's a fantastic writer. But why me? (Or rather, why isbn.nu?) The answer is, unfortunately, horrible. Searches on "little girl porn" lead one to my price service's doorstep. I feel like putting up a custom page for those referrers--not for this book--"Y'oughta be ashamed of yourselves, pervs!"