Holding Out for a CV Hero: The Frick Computer Vision Symposium

16 Apr 2018

What follows are some informal notes I jotted down after moderating a symposium at The Frick Collection’s Digital Art History Lab called “Searching Through Seeing: Optimizing Computer Vision Technology for the Arts” on April 12-13, 2018:

Everyone wants visual search, and a lot of people are doing it. The majority of the projects we saw relied in some way on calculate proximities of artworks using the feature vectors generated by variously-trained convolutional neural networks. With these proximities, you can return ranked search results of objects that appear visually similar. But “the one true similarity measurement” does not exist: from a mathematical standpoint, every feature space generated by a neural network, no matter how may dimensions, can offer a distinct kind of visual similarity. Sometimes that similarity had to do with detecting shared poses of figures; sometimes is was focused on stylistic affinities such as color or texture instead. And of course, there are manifold art historical similarities as well that are often bound up in properties not directly visible in 2D digital surrogates, such as production histories that involve states, versions, and copies, not to mention the social relationships between artists and higher-order conceptual similarities that have little to do with the merely visual. Art historians will need to invest a bit more time to understand the subtleties of visual similarity as seen through the eyes of neural networks in order to better structure our questions in ways that mesh with these approaches.
Very few of the presenters openly talked about error rates in their work. One shining exception was the pair of graduate researchers XY Han and Vardan Papyan who had collaborated with the Frick on a pilot program to automate category tagging of a portion of the Art Reference Library’s photo archive. Focusing on 19th century American portraiture, they tried to train a multi-label classifier to apply some of the Frick’s custom vocabularies for tagging images such as Man, Woman, Half-Length, With Hat, Hands Showing, Hands Not Showing. Where there was a lot of training example (e.g. Man, Woman) their classifier performed quite well. Where they had only a few dozen training examples (e.g. Full-Length) performance was much lower. During discussion, they also shared some of the more delightful patterns they found when assessing errors made by their model, including how it seemed to learn that white ruffs were a reliable indicator of hands being present in a portrait… leading it to erroneously add Hands Showing to some images that prominently featured sheep! Focusing on error rates like this is productive in several ways. It helps crucially deflate the hype machine around deep learning, yes. But it also reminds us that such rates can be realistically evaluated against the amount of human labor that would otherwise be needed to categorize these images; something essential for decision-makers in cultural heritage orgs to understand. The art historian in me was also fascinated by the way that this particular cuff/sheep error suggests lines of inquiry into fashion history as seen through portraiture from this period…
Speaking of interpretation, this another topic that I left the event wanting to learn more about. Deep learning interpretability is a hot topic in machine learning circles: how do we explain why a neural network made a classification for one particular image? Moreover, can we “read” the internals of a successfully-trained model to understand what generic visual features allow it to distinguish between, for example, Man and Woman in nineteenth-century American portraiture? Because if so, I’d love to learn how those generic visual signals overlapped, or differed from, the signals distinguishing those tags in eighteenth-century vs. twentieth-century portraiture. Tools like pixel activations are letting us get some insight into the former. But the latter is still very difficult to do. Using deep learning in a way that enhances the way art historians interpret the history of style itself is something I’d love to see, but we’re a ways off from that still.
Another disappointment: none of the computer scientists had much to say at all about the extreme prevalence of the WikiArt database in all of these presentations. It wasn’t for lack of art historians asking, though: Titia Hulst vocally inquired after projects that worked with corpora of abstract paintings. Notable, also, was that no one seemed ready to take up the question of what computers were not seeing - that discussion table was left conspicuously empty when we asked participants to move to select one of seven different discussion options. This event highlighted the continuing gaps the digital canon, not only in non-Western art, but also modern and contemporary art (is there a copyright & licensing problem here like there is in quantitative literary analysis?) This isn’t even to mention problems posed by images of three-dimensional artworks…
User interface projects played a larger role than I had expected going in. This was a very useful surprise, though, because I began to understand that these interfaces are key not just for enabling new ways to do visual search, but even more for their importance in the original data creation process. Creating the training data for a deep learning model requires thousands or tens of thousands of human-tagged images. This tagging work should benefit from modern interface design, but instead all too often happens inside clunky cataloging systems like TMS. Training networks to understand useful types of visual similarity is an even more difficult problem, as one must declare image-to-image links. Work by EPFL on the REPLICA project paves a useful path forward for this, but so too does the comparatively simple Image Investigation Tool developed by art historian Elizabeth Honig, or the ARIES interface soon to be debuted by the Frick. More human-centered interfaces and workflows are also needed for doing quality control on the predictions made by these systems. It’s foolish for institutions to pay for an entire system to help automate metadata, only to insist on hand checking 100% of its predictions. But the field still needs to learn what questions to ask when determining where this cutoff lies in the context of a particular project.
Speaking of deliverables, I’ll finish with the big question about next steps. What is the “product” that this field needs next? Is it a trained model for other institutions to implement? A hosted service that everyone can easily plug in to? Even with my journeyman understanding of these deep learning models, I suspect that these are the wrong targets. Individual institutions and researchers will always have too many odd requirements for their own projects for prebuilt models to solve satisfactorily. Everyone will want to customize, so it is unlikely your single model will foment much of a shift in the field. And anyway, whatever model you build now may be hopelessly out of date within two years. As for hosted services, Google and Amazon will always do that better than we will, so why should we attempt to beat them at what is very much their game? I’m much more drawn to Carl Stahmer’s forceful call for cultural heritage institutions to do what we do uniquely well: build collections. Not just physical collections, though, but series of well-documented, specialized digital corpora like the English Broadside Ballad Archive that can be used as more appropriate benchmarks for CV projects than the over-generalized and under-documented collections of WikiArt and the like.

Alongside those solid datasets, I also think good white papers with lessons learned on projects like the early experiments the Frick has run with their photo archive will be invaluable as more institutions begin to develop projects and funding proposals. We need realistic numbers about how much training data is needed for certain tasks, expected accuracy rates, reasons for using packaged model training services like Clarifai vs. building your own with TensorFlow, and example workflows for assessing results and adding them into existing cataloging systems. I suspect that these kinds of solid practicalities will best help drive productive conversations about what ongoing models for collaboration will look like in this field.

Postscript

Serendipitously, a marvelous article by Rachel Sanger Buurma and Laura Heffernan on the pioneering quantitative literary work of Josephine Miles was making the rounds on Twitter today. One of their quotes from Miles (here talking about the contributions of Penny Gee, a punch card operator) struck a chord with me:

Later, Miles remembered Gee as “very smart and good” and—most importantly—a true collaborator, as opposed to those “IBM people from San Jose” who would arrive periodically to flatly ask, “What can we do to help you?” “I’ve never been able to connect with them,” Miles explains, “though I did with Penny Gee. She really taught me.”

In my wrap-up remarks on the second day of the conference, I noted that there seemed to be no shortage of computer science and software engineering experts interested in collaborating. We heard many variations on “What can we do to help you?” However, comparatively few art historians in attendance seemed to be piping up with possible answers. The two things that computer vision offers out of the box - multi-label classification and visual search - are exciting for cultural heritage institutions because they have lots of implications for how those institutions catalog and serve their assets. But for many art historians, even the digitally curious, this just looks like a scaling up of the usual business. We find images, we write about them. What computer vision “can do to help us” seems, well, a bit boring at first glance.

A much smaller handful of art historians are engaging with the more methodologically revolutionary affordances of explicitly-model-based historical argumentation. Diana Greenwald, for example, highlighted during our discussions at the Frick that the ability to count certain subjects across large image corpora could be an enormous boon for social art history.¹ I also hold out hopes that the right bouquet of customized models can help art historians begin to do the type of predictive modeling work that can capture varying histories of style akin to the kind of work that Ted Underwood has done on literary prestige. It’s the kind of work that I’m conducting right now with Sandra van Ginhoven as we work through the Getty Provenance Index databases. But we long to turn our gaze from data about the production and movements of artworks to computationally consider their visual information as well.

Not incidentally, Greenwald recently published an article on collecting artists that very explicitly uses model-based argumentation: “Colleague Collectors: A Statistical Analysis of Artists’ Collecting Networks in Nineteenth-Century New York,” Nineteenth-Century Art Worldwide 17, no. 1 (2018) http://www.19thc-artworldwide.org/spring18/greenwald-on-artists-collecting-networks-in-nineteenth-century-new-york ↩

Lincoln, Matthew D. "Holding Out for a CV Hero: The Frick Computer Vision Symposium." Matthew Lincoln, PhD (blog), 16 Apr 2018, https://matthewlincoln.net/2018/04/16/holding-out-for-a-cv-hero-the-frick-computer-vision-symposium.html.

Matthew Lincoln, PhD Cultural Heritage Data & Info Architecture

Holding Out for a CV Hero: The Frick Computer Vision Symposium

Postscript