How Metadata Disrupted the Book Industry

Metadata Co-Op Blog Posts

Find out what lessons we can take from the book industry's experiences with metadata and digital transformation in this interview between Rachelle Byars-Sargent, who heads up the Information and Knowledge Metadata team at PBS, and Meg Varley Keller who worked for many years in scholarly book & database publishing before joining PBS as a Product Manager for the Metadata Education program. This interview is excerpted from the Introduction to EIDR webinar .

Q: So Meg, before you came to public television, you worked in the book industry, which as you’ve mentioned went through a similar supply chain disruption when first chain booksellers and then e-books and Amazon appeared on the scene in the 1990s. Can you talk a bit about that?

A: Absolutely – the parallels are actually almost uncanny, because in both cases the transition is ultimately from a thing-based economy – a physical book or in this case video tapes – to a file-based, digital economy.

And in both cases, metadata became a really big deal seemingly overnight, because file-based inventories – any digital inventory – can go anywhere, really quickly. The content is literally freed from its container. You very quickly get to a point where it’s no longer possible to keep up with all of the end points in the distribution chain using tools designed for tracking physical inventory. Even with satellite, you’re dealing with something that takes up a specific amount of space on a server and takes a fair amount of time to deliver on a fixed schedule – whereas digital files can be delivered nearly instantaneously, on demand. It makes sense that the same tools you use for one won’t be up to the task once that content can go anywhere in seconds.

Q: So in the book business, what did that look like, the transition to a digital lifecycle?

A: Well in retrospect – I was working in book marketing for a university press – it felt like all of a sudden my job changed from being about making beautiful catalogs and sending those out to bookstores each season to instead, sending out lots and lots of spreadsheets with data about those same books – because that’s what Barnes & Noble and Borders (remember Borders?) and later, Amazon, that’s what they had to have. And it was a pain because we had to do both, we still had to do the catalogs, because lots of buyers still used those at independent bookstores but many buyers even at B&N still wanted them to pick the books they wanted to sell – but on top of that we had to start providing more and more metadata and cover images and we had to get all of the metadata out earlier and earlier in the process – and book publishing was a very established business with a set way of doing things and change was really hard because for awhile it meant doing things both ways.

Q: But didn’t the book industry develop ISBNs much earlier – they were already using unique IDs and metadata like subject codes in that regard, right?

A: They were, yes – that started back in the 1960s, I think, but ISBNs did more to enable copyright standards than supply chain efficiencies and the BISAC subject codes were basically created as a crossmap between the way librarians organized books and the way bookstores did it. The book industry later adopted UPC codes for retail like all other retail-based industries at the time, but to make it possible to track digital books (and distinguish between them and print copies) AND to enable really robust search capabilities (the kind that let Borders or Amazon recommend new books to you in a way that in those days seemed like magic) – they had to develop a more robust standard, which is called ONIX. Before ONIX, publishers were going crazy trying to deliver metadata in different formats to every single wholesale and retail partner, they all had their own requirements, just as today’s video platforms and SRD partners do – another uncanny parallel, the way finally a new industry standard – even though it means a bit more work in the short term – actually means less work in the long run once everyone’s on the same page.

Q: Was there anything positive about that period of change that you see in retrospect, anything that seemed like short-term wins or motivating reasons to move more quickly toward a data-driven workflow?

A: One big advantage, because I was working initially at a small university press, was that metadata puts you, at least at first, on a level playing field. The biggest publishers could throw a lot of money around with advertising or in terms of putting their books at the front of the store in Barnes & Noble and things like that, but information in this instance is free – we could get to market faster and squeeze out greater profitability by just being more efficient with the data, by following the rules. So it’s at least in theory a democratizer – machines reading that data don’t care if you have a big budget or not – if the data is industry-standard and meets partner and platform needs, you have a competitive advantage because your content can be more discoverable than that of the biggest networks. It doesn’t even take a lot of technology – there are lots of ways to get the metadata together and out without spending a lot on the tech – as you always say, it’s not about the technology – in the 1990s we were still using spreadsheets at first. Later we were very happy to switch to software that could do an XML output for us, but the market quickly develops new tools when there are needs like that, and you see that starting to happen in this industry, too.

The second big advantage was that it quickly became obvious that once we got our systems set up internally to deliver the metadata in the right way at the right time, we moved closer and closer to automating much of the most tedious parts of distribution. Everything internally gets much more efficient once you are focused on getting all of the information together in a standardized way. There's no time to bother maintaining the same data in different departments, and you have to eliminate the risk of one of those sources being wrong. Everyone has to get on the same page or the risk of sending out the wrong information is too high.

Finally – the 3rd big win – was "the long tail," or all of the deep backlist titles we could start selling – once everything was captured in metadata you could promote backlist titles alongside new releases that went together based on their subject area or other metadata elements – and it was very easy, whereas before you had to have a lot of historical knowledge about books the press had published in the past to know what titles to cross-promote. Once the metadata was in place, you could pull them up in a second – and the “long tail” of subject-driven search was a big deal especially for small niche publishers and for niche-based retailers – sales for older titles went up a lot. In the same way now, metadata will eventually make it very easy for any producer or distributor to quickly put their hands on every program focused on pandemics for example, or civil rights – programmers don’t have to look just to the latest program offers for content, they’ve got an amazing backlist of evergreen content to leverage, which is really helpful right now with the gap in production, for example. So these efficiencies really open up new opportunities internally in the way stations do business as well as for viewers out in the world looking for a frictionless findability experience.

Q: So are there any crystal ball predictions you can think of based on having gone through a similar process in another content industry?

A: In the book industry, once there was a reliable, trustworthy, and industry-standard, machine-readable "hub" of core metadata, a lot of other tools got built around that. Once that platform of data is in place, as it now is with EIDR, it can be cross-linked via APIs to create new combinations of data and to automate various pieces of the content discovery and distribution lifecycle.

In the book industry, a number of tools emerged, for example, starting with eventually eliminating the need for all of those spreadsheets. Once all the necessary data is there, the platforms and partners that need it will build tools to automate the import of that metadata to their platforms, and that's vastly more efficient than having to deal with each content creator individually. For the fields above and beyond what is in EIDR – keywords is just one example – maybe you still have to send that data over to the partner, but that piece can get automated, too, and linked to the appropriate EIDR ID using APIs. This is already happening, at PBS we've already had platforms saying that rather than receive the data directly from PBS, they'd rather just get the EIDR IDs. That's easier for everyone. Eventually, in the book industry, tools emerged that made it really easy for content creators to push out their metadata using XML, without ever having to learn XML or even understand what it was – and then no more spreadsheets.

The next thing that happens, or it did in the book industry, is that once that data is there, tools get built that content creators or distributors will find useful enough to pay for. In the book industry, push-button catalogs became a thing publishers were more than happy to pay for, and the company that created that tool created another tool for book reviewers and media to use to get advance access to e-books so they could review the books pre-publication, and they created tools leveraging the same data for independent bookstores because it made ordering books so much easier. So my bet is that lots of new services will appear that leverage this single source of truth getting built with EIDR, and as was the case in the book world, the key to getting past the spreadsheets and the need to hassle with metadata is, ironically, the metadata itself. Embrace the metadata, and the metadata will (eventually) set you free.