debate, learn, share text encoding and modelling

2015-10-15

paregorios
01:44:18 PM

I guess this is the place


raffazizzi
01:47:50 PM

yup. It we’re not careful it may soon turn into venting about overlapping hierarchies


paregorios
01:48:10 PM

eek


mdlincoln
01:48:34 PM

This popped up in my feed today: sentiment analysis being no better than a coin flip: https://relativeinsight.com/zurich-university-sentiment-analysis-is-a-statistical-flip-of-a-coin/ (@edsu this might be useful to show ppl in response to questions about Twitter analytics)


raffazizzi
01:51:03 PM

@mdlincoln: I’d post that again in #announcements :simple_smile:


mdlincoln
01:53:17 PM

blerg - guess who the slack newbie is :sweat:


2015-10-16


paregorios
01:14:26 PM

now that TEI is on github, is there a way for us to autotrack/announce releases here in slack?


raffazizzi
02:07:55 PM

That’s a good idea! Let me know how you get along - happy to jump in if needed


paregorios
02:09:38 PM

they assume you’re a member of the dev team on a given project at github


paregorios
02:14:14 PM

@raffazizzi: I’ve subscribed the channel to the TEI Guidelines releases Atom feed at https://github.com/TEIC/TEI/releases.atom


raffazizzi
02:14:23 PM

well I am :stuck_out_tongue: so maybe I can set it up


raffazizzi
02:14:52 PM

oh, maybe that’s actually better - otherwise we’d be spammed with every push and comments on issues


paregorios
02:14:52 PM

I suppose it depends a bit on how much granularity we want


paregorios
02:14:55 PM

yeah


paregorios
02:14:58 PM

zactly


raffazizzi
02:15:02 PM

:+1:


paregorios
02:15:11 PM

we can see how it goes


2015-10-17

2015-10-18

2015-10-19

2015-10-20

timfinnegan
03:45:56 AM

james joyce’s works just came out of copyright in 2011, and there seems to be some residual timidity/ inertia that’s left us with a dismal selection of etexts that mostly use CAPS for italics… which I’m looking to fix. I expect we’ll be generating dozens of new offerings in the near future, but I’m not sure what archives are serious about supporting multiple editions of classic texts, in multiple formats, with maybe a little crowdsourced proofreading if that proves necessary. (I believe http://archive.org balks at the mutiple-editions principle?)


2015-10-21

timfinnegan
03:54:55 AM

is there a consensus on when to italicize the punctuation around italic passages– eg parens and quotes and exclamation marks?


paregorios
08:07:30 AM

@timfinnegan: iirc Chicago Manual of Style et al address those sorts of issues; I think the answer generally is that adjacent punctuation is attracted to the text style of neighboring characters, but none of the neurons to which I presently have access remembers details.


timfinnegan
10:20:25 AM

hmm, that leads to the question: does the CMoS still apply to web css? eg if web rendering has different issues to address?


paregorios
02:22:19 PM

I’m not sure I really understand the issue. Italics are a convention of formatting of text for visual display. CSS is a mechanism for effecting formatting of visual display online. Neither constitutes semantic markup. If the issue is what to italicize in a web page, one decides what one wants to see. If it’s about setting up for post-processing on the basis of what the italics are perceived to mean, one either needs to intervene earlier (with appropriate rules) to markup the text with something unambiguous (i.e. other than italics) or one has to embed italics-inference rules in the post-processing context.


timfinnegan
02:25:36 PM

oh, i just wonder if cmos recommendations sometimes make css look bad, so web designers take exception


paregorios
02:27:30 PM

ah, dunno


paregorios
05:27:10 PM

xml catalogs as used in OxygenXML (the only place I’ve ever used them) seem backwards to me: https://www.oxygenxml.com/doc/versions/17.0/ug-editor/index.html#topics/using-XML-Catalogs.html


2015-10-23

2015-10-25

2015-10-26

raffazizzi
10:47:23 AM

Hi people - the Text Encoding Initative Memebers Meeting conference has started today with pre-conference workshops


raffazizzi
10:47:36 AM

Follow it on twitter at #TEI2015


raffazizzi
10:48:15 AM

Anyone here at the conference? We could use this slack channel a backchannel during the conference


paregorios
03:45:30 PM

@wsalesky!


wsalesky
04:21:23 PM

@paregorios! :simple_smile:


2015-10-27

benwbrum
05:34:43 AM

Last week was the first MEDEA workshop on encoding account books. I did a rotten job of tweeting, but would be happy to chat with anyone else working with account books about the issues that were raised there.


paregorios
08:32:46 AM

@benwbrum: I got to meet and talk with Kathryn Tomasek in DC a few weeks ago. Cool project!


benwbrum
09:56:49 AM

It is really neat, @paregorios. There was a broad spectrum of perspectives at the workshop, including linguists who hated the normalization the economic historians were doing (since it made the data unusable for historic linguistic and orthographic change) and of course economic historians who hated the verbatim-et-literatim encoding the linguists were doing in return (since the lack of normalization kills any quantitative economic analysis). That contrast was really valuable for those of us in the middle, who’d been foolishly toying with utopian encoding schemes.


paregorios
09:57:28 AM

sounds like a true humanities project!


2015-10-28

martindholmes
02:55:18 AM

Hi all.


2015-10-29

paregorios
08:28:16 AM

@benwbrum: for fun some ancient accounts (just the ones for which the http://papyri.info site has texts and images): http://tinyurl.com/np32pp2 marked up in TEIXML


benwbrum
08:39:58 AM

Thanks! I felt like there were two major gaps in representation at MEDEA – nobody was working with texts that pre-dated the 13th century, so we missed anything from the ancient world as well as any cuneiform accounts.


paregorios
08:41:23 AM

yeah, I’m not sure what the CDLI has in the way of the latter



benwbrum
08:43:20 AM

Interestingly, only Kathryn Tomasek was working with a non-obsolete currency. Even those of us dealing with the 19th-c US were working with “Virginia money” and such.


paregorios
08:43:21 AM

that’s an interesting record for a number of technical reasons …


paregorios
08:44:05 AM

if the aggregation there is working right it means that separate “papyri” held at separate institutions (Columbia and Harvard) are thought to be part of the same document.


benwbrum
08:46:01 AM

That is interesting. Has http://papyri.info done any work with IIIF? It’s designed to support such cases.


paregorios
08:46:24 AM

and that is indeed what one of the HGV records says: “Gehört zu PSI VI 625 und P.Cair. Zen. IV 59799”


paregorios
08:46:42 AM

that would be a question for @hcayless or @ryanfb


benwbrum
08:47:40 AM

OK.


paregorios
08:48:10 AM

this happens as I understand it not uncommonly with the papyri: artifact of the 19th and (especially) 20th century antiquities trade in and out of Egypt


paregorios
08:48:29 AM

documents were subdivided and sold separately in order to increase unit returns


benwbrum
08:50:41 AM

Apparently it’s common enough in medieval manuscripts to serve as a major motivation for IIIF.


paregorios
08:51:26 AM

actually there’s 3 fragments: one in the Egyptian museum in Cairo, one at Columbia, and one in Florence (per http://www.trismegistos.org/text/1424)


benwbrum
08:51:28 AM

They allow you to bring in several images from different sources to create a “page” on a canvas, or to pull different pages from different repositories into the same text.


paregorios
08:51:48 AM

that’s cool; it looks to me like we only have image here of the Columbia piece


paregorios
08:52:51 AM

one would have to run down the CE 1978 citations to see how they’re thought to fit together


paregorios
08:54:24 AM

I suspect you’ll find that the XML encoding of all of these doesn’t privilege the “account” document type very much. Numbers should be marked up as such, but otherwise, not.



paregorios
08:56:38 AM

currencies aren’t glossed/called out, for example


hcayless
09:25:26 AM

I got a message that my name was being spoken. We don’t do IIIF yet, but plan to whenever we have the spare time


paregorios
09:28:53 AM

@hcayless: thanks


ryanfb
09:29:06 AM

Though we’re likely to do it first for images we host. External images is a whole other bag of worms


hcayless
09:30:03 AM

and by worms, he means dragons


hcayless
09:32:06 AM

they bring out the barbecue sauce, you better run


paregorios
09:41:05 AM

mmmmmm barbecue


paregorios
09:42:21 AM

which brings up an interesting encoding question …. how would one markup the pseudo-acronym BBQ in TEI?


benwbrum
09:43:12 AM

I’ve been adding IIIF support to FromThePage, and have struggled a bit with how to handle self-hosted pages vs. pages hosted elsewhere.


benwbrum
09:44:05 AM

One option is to generate manifests for both, with image services (I.e. URLs) that point to the FromThePage server, then use a “shim” like approach to proxy image calls to the actual image hosts.


benwbrum
09:44:24 AM

That seems like the only option for hosts that don’t actually support the IIIF Image API.


benwbrum
09:45:11 AM

However, the hosts I’m integrating with have either recently added support for IIIF (the Internet Archive) or have plug-ins or shims available to support it (Omeka).


benwbrum
09:46:09 AM

In that case it seems like the thing to do is to ingest an IIIF manifest from the host, then produce a modified manifest based on it which directs images to the original host, but provides transcripts via annotations hosted by my server.


edsu
09:46:17 AM

i thought if the CORS headers were set up correctly the images could be anywhere?


benwbrum
09:46:24 AM

They can, yes.


edsu
09:46:36 AM

ok, i guess i don’t understand the problem then :simple_smile:


benwbrum
09:47:02 AM

The problem isn’t ‘is this possible?’, it’s “what’s the best way to do this?”


benwbrum
09:47:41 AM

Generating manifests for someone else’s images is straightforward when adding additional content (transcripts & translations, in my case).


edsu
09:47:48 AM

i would try to avoid shims, personally


edsu
09:48:08 AM

i’m not sure if that helps :simple_smile:


benwbrum
09:48:36 AM

But I’d like to avoid losing information contained in the original manifests when I generate derivative manifests – like repository-specific metadata or non-transcript annotations.


benwbrum
09:49:09 AM

That may be unnecessary in the linked data world – I’m certainly new to LODLAM, and don’t really have my head around it.


edsu
09:49:33 AM

sounds like you have your head around pretty well to me


benwbrum
09:50:13 AM

So if http://papyri.info starts presenting transcripts along with IIIF, I’ll be watching closely.


edsu
09:50:30 AM

i would probably only aggregate what information you actually need for FromThePage


edsu
09:51:04 AM

and not worry too much about other data, that in theory would be good to have, but you have no use for at the moment


edsu
09:51:43 AM

let your application drive your decisions, rather than what seems like the right thing to do


edsu
09:52:11 AM

here ends my Pontification for the Day


benwbrum
09:52:35 AM

Thanks, @edsu. You sounded positively guru-like.


edsu
09:53:05 AM

i’m a total sham


edsu
09:53:12 AM

i also am LATE TO A MEETING ARGH!


edsu
09:53:15 AM

seeya


paregorios
12:06:44 PM

Anybody here have experience with handling <saxon:collation> in XSL in OxygenXML with Saxon? I have previously working code that now fails after an OxygenXML upgrade.


paregorios
12:33:47 PM

nm. with moral support on twitter from @wsalesky I read the fine manual and got with the program: http://www.saxonica.com/documentation/index.html#!extensibility/config-extend/collation/implementing-collation


2015-11-03

2015-11-04

2015-11-05

suttonkoeser
05:20:00 PM

Anybody working with TEI + annotation, TEI facsimile, or markdown to TEI?


2015-11-06

andersoncliffb
09:25:45 AM

@suttonkoeser: I’m not working with markdown to TEI, but I’m certainly very interested in the topic since I write frequently in both.


hcayless
09:36:00 AM

@suttonkoeser: there’s a markdown to TEI conversion in the TEI Stylesheets. Haven’t used it though. Not at this moment working on TEI annotation, but have and will again soonish.


suttonkoeser
09:42:33 AM

I presume this stylesheet is the one you mean, I found it in my initial searches: https://github.com/TEIC/Stylesheets/blob/master/markdown/markdown-to-tei.xsl
But since it’s based on regular expressions and has no comments, I didn’t think it would be very easy to modify or work with


suttonkoeser
09:45:07 AM

I’m working in python, so currently I’m using the mistune markdown parser (https://github.com/lepture/mistune) and creating a custom tei renderer.


hcayless
09:46:27 AM

I’ve absolutely no idea how it works, but find myself sharing responsibility for it :simple_smile:


hcayless
09:47:26 AM

I vaguely assume the sanest approach would be to convert to HTML and then transform to TEI


hcayless
09:47:52 AM

that way you’d stand a chance of flattening some of the variance in markdown flavors


hcayless
09:48:18 AM

but my ignorance here is vast and embarrassing :simple_smile:


suttonkoeser
09:50:04 AM

That’s interesting, hadn’t thought of going from markdown -> html -> tei. I suppose it would be ideal if the markdown was converted to html using the same library that users see when they enter and preview their content, but the markdown is getting entered as annotation content using annotator.js and the meltdown javascript library.


hcayless
09:51:11 AM

it depends what you’re doing really, whether it’s a generic markdown -> TEI converter or whether you’re targeting a specific flavor


hcayless
09:52:46 AM

sounds like you have a specific flavor in mind, so you might not need that


suttonkoeser
09:54:10 AM

Right. I think the markdown is pretty generic at the moment, only non-standard addition I’m aware of is footnotes (which is a fairly standard non-standard from what I can tell). But I guess we can decide what TEI output we want for our use case, and it doesn’t have to be a general solution for everyone. Although I think most of the tags I’m outputting are fairly standard


fmcc
09:55:44 AM

There seems to be some discussion of this kind of conversion with pandoc: https://github.com/jgm/pandoc/issues/2047


suttonkoeser
09:56:30 AM

@fmcc: interesting, good to know


fmcc
09:58:00 AM

doesn’t seem to have been much activity since then though


hcayless
11:28:25 AM

Just looking at the TEI Stylesheets implementation and I fear I share @suttonkoeser ’s suspicion of it (despite knowing it was built by super-smart people).


2015-11-09

raffazizzi
09:17:58 AM

@suttonkoeser: Hi! Re: TEI+annotation, I’m very interested and worked on some aspects / have some plans. What are you working on?


2015-11-10

suttonkoeser
12:48:15 PM

Hey @raffazizzi. The TEI + annotation work is in relation to readux, which is for our digitized books and an annotation/critical edition platform - http://readux.library.emory.edu/ for the site, https://github.com/emory-libraries/readux for code. I have a new release almost out the door that supports annotation with annotator.js, and we’re generating TEI facsimile from a couple of different OCR xml formats to support that.


suttonkoeser
12:49:40 PM

The next step, which I’ve started working on, is to generate a TEI export of a single volume packaged with all of a user’s annotations for that volume, so that we have it as an artifact and as an interim step to generating a more user-friendly annotated edition


suttonkoeser
12:57:16 PM

We’ve decided for our purposes that we don’t need all of the annotation data (in particular annotation created/updated don’t seem relevant in the TEI export), but I think we’ve figured out a reasonable way to map the annotation data to TEI notes and insert annotation reference markers into the TEI facsimile data.


suttonkoeser
12:58:04 PM

If anyone is interested, I could share some examples of the TEI we’re generating. We did have some detail questions (like how to reference the annotated content), but I think we’ve come up with something workable.


literature_geek
03:18:00 PM

@suttonkoeser: I’m interested in examples of the TEI!


suttonkoeser
05:09:41 PM

@literature_geek: cool, I’ll share some soon - I think it will be a little easier to discuss when you can look at the annotation features in readux and see the tei facsimile that it’s using


hcayless
07:06:48 PM

@suttonkoeser: I’d be interested to see it too!


2015-11-11

raffazizzi
10:38:35 AM

@suttonkoeser: sounds great! I’d also be interested in seeing some examples :simple_smile:


2015-11-14

timfinnegan
11:10:56 AM

visualisation-god edward tufte’s views (once removed?) on css: https://edwardtufte.github.io/tufte-css/