Scoping The Index of Digital Humanities Conferences for Now and for Later
Earlier this month, Scott Weingart, Nickoal Eichmann-Kalwara, and myself were delighted to release The Index of Digital Humanities Conferences, a searchable database of conferences, authors, and abstracts charting a long history of scholarly digital humanities events back to 1960.
You can read much more about the details of the project on its About page. As I’ve done for other CMU projects, I want to reflect on some of the behind-the-scenes development process for this resource.
While Scott and Nickoal have been working on this dataset for years, my involvement began in my first few weeks on the job here at CMU in 2018. The spreadsheets they had been compiling were gradually starting to strain under the complexity of their data. Authors changed institutions, provided multiple affiliations even on a single abstract, gave different variations of their names (or changed their names) over time, amongst a host of other tiny challenges. I’m a firm proponent of working in spreadsheets whenever you can possibly help it… up until the point where the complexity of your data demands a more stringent database solution.
Because I needed to pick up Django in order to develop some plugins for Janeway as part of another CMU project, we figured it would be worth the time for me to learn more of the framework’s ins and outs by setting up a database and data entry interface for the DH conferences data they’d collected so far. This would give them better control over links between entities like abstracts, authors, and institutions, and would also allow me to program some useful data cleaning capabilities like merging two author records together, something that would be difficult to do in a valid way with a bunch of disconnected spreadsheets. I’m very happy with the decision to work with Django, not least because its documentation spans from beginner tutorials to mid-level guides to key concepts like models or the QuerySet API, down to very readable object- and method-level reference docs.
As with any research software project, though, the concept and scope evolved over time. In retrospect, I think we did a pretty good job of heading off the most punishing levels of scope creep. A lot of this had to do with hammering out the boundaries of source material for the project. Because these DH conference programs provide a unique view (albeit just one kind of view) of the people that make up the history of DH, it was tempting to turn the project into an index of digital humanities practitioners. The scope of evidence suddenly changes with this move though - now programs aren’t enough, are we also to enter info from CVs? Courses? All kinds of other publications? Surely not.
By restricting the scope of the project only to evidence that could be found within conference programs, this both clarified the scope of the research project, but also had a significant impact on the data model we developed. Rather than needing to create complex timelines of job changes for a people-centric database, instead we could focus our efforts on a document-centric database where any attribute about a person had to come from a document, allowing us to pin any attribute to a certain point in time. I describe this more in depth in the project colophon.
While I am mostly satisfied that this data model was the right choice for our scope, and will be accommodating to potential changes in the future, there are other parts of the application infrastructure that I’d do over if I had the time to fully rebuild the project code from the ground up, now with two full years of Django experience under my belt:
- I would partition the application in to multiple modules! My
views.pyare embarrassingly long single files that are a major pain to navigate. Very amateurish! I learned my lesson the hard way, and was much more proactive about splitting out the codebase into smaller logical files for a more recent Django-backed project. I would also want to be much more consistent about using class-based views now that I have a much better understanding about how they work and how to customize them.
- I would implement a more full-featured search backend using ElasticSearch. Searching on the site right now is accommodated through a mix of regular Django filters, and full-text search powered by the PostgreSQL database that sits behind the webapp. I started off just using the built-in FTS provided by Postgres and Django’s extension for it because at first it just looked like we needed simple search on the full text of articles when it was available, and that came baked in to Postgres. As I started to build out more interfaces for our editors to do data entry and cleanup though, we started needing better search for names, institutions, and more complex faceting capabilities. By the time our feature list edged over that line where I’d say it was worth adding a whole new layer to our tech stack, we were already facing a tight deadline to get the project public. So what we have up now works fine, but only because I wrote a lot of rather janky workarounds into the Django code. ElasticSearch could likely support most of those needs and accommodate new ones as they came up.
- Construct a better pipeline for bulk-loading data from XML or CSV and then cleaning within the app. I scripted a few functions for managing TEI XML produced from abstracts for the more recent ADHO conferences, and a few times we had the opportunity to bulk load data from a CSV exported from conference management software like ConfTool. But there are so many edge cases (especially when it comes to formatting institutional affiliations!) that followup cleanup was always necessary, even for “well structured” TEI. I had begun to work out a module for flagging records for review after a bulk import, but we put that aside in favor of finishing manual entry of the full bibliography of events. If it ever comes to pass, version 2.0 of this application would absolutely need to include a better cleaning interface for bulk-loaded data from other sources.