Back in October 2004, I promised to write about three problems of book information: The New, Bad Information Problem (information hygiene over time); The Scottish Problem (normalization errors); and the Chunky Problem. I wrote about the other two back then; here's my exegesis on chunkiness.
The Chunky Problem relates to how people conceptualize information and then how that information is represented in databases and on Web sites, to name two concrete instantiations. Despite decades of analogy between the psychic construct that is the mind and card catalogs, computer filing systems, and so forth, it's clear from equally long stretches of research that people represent information in chunks that interrelate. Thus, even someone who can remember 1000 discrete ISBNs or phone numbers isn't representing that information with one neuron per digit. Rather, there's a holistic standing wave in which information flows as a basis of triggers and relationship that allows us to contain these facts.
You wouldn't know this to look at most Web sites that present information of any kind about books, media, or, really any data that's available in masses of larger than a few. Web sites typically represent data as stored in discrete, and offer very flat databases. A flat databases lacks relationships: each record is defined uniquely by one or more fields, and all of the information in the record is fielded. There's always an entry (even null) for each field. A relational database allows richer information by creating virtual objects: a list of attributes in one table might be assigned via another table to an object that resides in a third table which uniquely defines it.
For instance, a book might be a library edition paperback with an embossed cover. One table stores book attributes paperback, hardcover, turtlebound, etc. Another table stores a list of books by ISBN. The joining table then pairs the unique ISBN with the attributes to create an object.
XML is quite exciting to most people involved in structuring information because it allows the definition of an object-oriented structure and attributes that adhere to it without requiring the rigid formalized storage of a database. In an XML file, you can create objects explicitly, producing human-readable and machine-parseable results that represent data in a simpler and substantially more powerful method.
Now The Chunky Problem is that most media data is in very flat form. In fact, when I license data or review data for license for media like books and movies (in DVD and other form), I find that there is often neither no table structure and no normalization. Rather than have a table of actors with attributes (like a biographic sketch) and tie those by reference using a unique identifier to a table of SKUs (individual movie UPC codes, for instance), it's just a long list with repetitive information all organized uniquely by SKU.
Since information wants to be chunked, or so I think, I have spent nearly a decade working on systems--mentally and in actual code--that rework flat information into rich relationships that can be represented chunkily to an end user.
Thus, my long-standing example of The Wizard of Oz. There is a work of fiction called The Wizard of Oz, and it is instantiated as many, many things, which includes dozens or of ISBNs, dozens of books that precede the days of ISBNs and are out of print, collections in which it appears as an item in a larger compendium, essays about it, and parodies.
At the chunkiest level, you have the work, The Wizard of Oz. Move down a chunk and you're into categories of things: books, movies, parodies. Move into the books chunk, because you're thinking "I want to buy the book, Wizard of Oz," and you find the sea of types: paperback, hardcover, large print. Finally, we can delve into a specific SKU or ISBN.
There are ways to short circuit this, too. Imagine the chunky statement, "I want to buy a new copy of the latest edition of the book Wizard of Oz." Zoom down through and the user is on that precise book's offering page with details. Or, "find me a collection in print of The Wizard of Oz and Ozma of Oz." A system that's chunky knows about those two works, and can track those words as uniquely identified items across containers, which are instantiated editions. Thus, the system knows that ISBN Z is a container of unique work Wizard of Oz and unique work Ozma of Oz.
Chunkiness works as a way to share information across objects, too. A review of Wizard of Oz as a work should adhere at the chunkiest level. A review of a particular newly edited edition that might appear as several ISBNs is much less chunky. Chunkiness needs to adhere to objects at various scales.
There's a great little joke about a professor who brings a large glass jar into a class along with a bin of rocks, a bin of pebbles, and a container of sand. He pours rocks into the jar and asks the class if it's full. They agree it is. Then he shakes in pebbles. Is it full now? Yes. Then he pours in sand? Now? Yes. Then (in some versions) a couple of beers. It's apparently a metaphor for life. (The moral in many tellings: there's always room for a couple of beers.)
The Chunky Problem looks at this as the reverse: it used to be when you went to a bookstore online, the whole jar was full of sand. Over time, it became full of pebbles. Now it's a rocks, pebbles, and sand. I'd rather view it this way: the jar contains rocks. The rocks, when broken down, turn into pebbles. The pebbles, when pulverized, are sand. But unlike physical objects you can go from a jar to a grain of sand in one step, and the process is reversible at any time.
Most product sites would benefit from getting chunky.