The Problem of Persistence
I've left this blog alone for far too long, so I come back with The Problem of Persistence, which is another elements in my series of problems relating to structuring data, which include The New, Bad Information Problem and The Scottish Problem.
The problem with persistence relates to creating objects that represent a particular set of ideas that have a persistence as that object over time. Good example: The Köchel or K Numbers assigned to Mozart's works. Mozart himself didn't number or catalog his own oeuvre, but a 19th century intellectual did. The works are numbered by appearance and the numbers have been used widely to refer to specific works. The numbers don't change, even if additional works are discovered between existing numbers, to avoid destroying decades of research and publishing. The numbers are persistent object identifiers to which many attributes are attached.
In Bookland, our happy 978 and 979 UPC code prefix world, there's no such animal. I am finding there is no such animal among many realms of information in which there are discrete entities that have members but not a persistent and consistent method of numbering them.
In developing a new DVD and video price comparison service that I hope to launch later this year or in early 2006, I worked with a programmer to develop persistence in the kind of unique work objects that I create for isbn.nu every time the database is regenerated.
The idea is simple, when we finally figured it out: You have to build a structure in which the information present in the structure is authoritative when a record exists. New information is always considered less authoritative for specific values in an object, and reconciliation can happen or a queue of details to reconcile can be built based on the manner in which a particular new detail is out of sync with existing records.
Practically speaking, this means you prime the pump of a database on, say, movies with a good authoritative and deep source and then choose specific values that are unique to become authoritative, such as a director's name (even the spelling) or a movie production's title, like "Star Wars: Episode I." When new information arrives, it's checked by SKU against existing information. If the authoritative details vary they are either ignored or stuck into a queue for review--it's possible that the SKU had the wrong title, for instance. In either case, non-authoritative details, such as the movie's length or the DVD region encoding, are updated to whatever the new information provides.
This isn't the full depth of hygiene and reconciliation I'd like, but it means that once an object is created and numbered for a unique movie production such as "Pride and Prejudice" as produced in 2005, that object can retain the same ID number or other code permanently. All other details will accrue to or revolve around that initial unique movie production information.
I'll write more about this as the system rolls out, as it will for both the new movie site and for my existing isbn.nu book site.
Remarkable coincidence: This David Weinberger piece about the issues of metadata and book identification and sub-book (chapters, etc.) identification appears today in the Boston Globe.