textencoding
debate, learn, share text encoding and modelling
2015-10-15
I guess this is the place
yup. It we’re not careful it may soon turn into venting about overlapping hierarchies
eek
This popped up in my feed today: sentiment analysis being no better than a coin flip: https://relativeinsight.com/zurich-university-sentiment-analysis-is-a-statistical-flip-of-a-coin/ (@edsu this might be useful to show ppl in response to questions about Twitter analytics)
blerg - guess who the slack newbie is :sweat:
2015-10-16
now that TEI is on github, is there a way for us to autotrack/announce releases here in slack?
That’s a good idea! Let me know how you get along - happy to jump in if needed
they assume you’re a member of the dev team on a given project at github
@raffazizzi: I’ve subscribed the channel to the TEI Guidelines releases Atom feed at https://github.com/TEIC/TEI/releases.atom
well I am :stuck_out_tongue: so maybe I can set it up
oh, maybe that’s actually better - otherwise we’d be spammed with every push and comments on issues
I suppose it depends a bit on how much granularity we want
yeah
zactly
:+1:
we can see how it goes
2015-10-17
2015-10-18
2015-10-19
2015-10-20
james joyce’s works just came out of copyright in 2011, and there seems to be some residual timidity/ inertia that’s left us with a dismal selection of etexts that mostly use CAPS for italics… which I’m looking to fix. I expect we’ll be generating dozens of new offerings in the near future, but I’m not sure what archives are serious about supporting multiple editions of classic texts, in multiple formats, with maybe a little crowdsourced proofreading if that proves necessary. (I believe http://archive.org balks at the mutiple-editions principle?)
2015-10-21
is there a consensus on when to italicize the punctuation around italic passages– eg parens and quotes and exclamation marks?
@timfinnegan: iirc Chicago Manual of Style et al address those sorts of issues; I think the answer generally is that adjacent punctuation is attracted to the text style of neighboring characters, but none of the neurons to which I presently have access remembers details.
hmm, that leads to the question: does the CMoS still apply to web css? eg if web rendering has different issues to address?
I’m not sure I really understand the issue. Italics are a convention of formatting of text for visual display. CSS is a mechanism for effecting formatting of visual display online. Neither constitutes semantic markup. If the issue is what to italicize in a web page, one decides what one wants to see. If it’s about setting up for post-processing on the basis of what the italics are perceived to mean, one either needs to intervene earlier (with appropriate rules) to markup the text with something unambiguous (i.e. other than italics) or one has to embed italics-inference rules in the post-processing context.
oh, i just wonder if cmos recommendations sometimes make css look bad, so web designers take exception
ah, dunno
xml catalogs as used in OxygenXML (the only place I’ve ever used them) seem backwards to me: https://www.oxygenxml.com/doc/versions/17.0/ug-editor/index.html#topics/using-XML-Catalogs.html
2015-10-23
2015-10-25
2015-10-26
Hi people - the Text Encoding Initative Memebers Meeting conference has started today with pre-conference workshops
Follow it on twitter at #TEI2015
Anyone here at the conference? We could use this slack channel a backchannel during the conference
@wsalesky!
@paregorios! :simple_smile:
2015-10-27
Last week was the first MEDEA workshop on encoding account books. I did a rotten job of tweeting, but would be happy to chat with anyone else working with account books about the issues that were raised there.
@benwbrum: I got to meet and talk with Kathryn Tomasek in DC a few weeks ago. Cool project!
It is really neat, @paregorios. There was a broad spectrum of perspectives at the workshop, including linguists who hated the normalization the economic historians were doing (since it made the data unusable for historic linguistic and orthographic change) and of course economic historians who hated the verbatim-et-literatim encoding the linguists were doing in return (since the lack of normalization kills any quantitative economic analysis). That contrast was really valuable for those of us in the middle, who’d been foolishly toying with utopian encoding schemes.
sounds like a true humanities project!
2015-10-28
Hi all.
2015-10-29
@benwbrum: for fun some ancient accounts (just the ones for which the http://papyri.info site has texts and images): http://tinyurl.com/np32pp2 marked up in TEIXML
Thanks! I felt like there were two major gaps in representation at MEDEA – nobody was working with texts that pre-dated the 13th century, so we missed anything from the ancient world as well as any cuneiform accounts.
yeah, I’m not sure what the CDLI has in the way of the latter
Hey – thanks to the link you sent, I found an account with currency as well as goods! http://papyri.info/ddbdp/p.cair.zen;4;59799/?q=PHRASE:%28abrechnung+OR+account%29&rows=3&start=19&fl=id%2Ctitle&fq=has_transcription%3Atrue&fq=%28images-int%3Atrue%29&fq=metadata%3A%28abrechnung+OR+account%29&sort=series+asc%2Cvolume+asc%2Citem+asc&p=20&t=113
Interestingly, only Kathryn Tomasek was working with a non-obsolete currency. Even those of us dealing with the 19th-c US were working with “Virginia money” and such.
that’s an interesting record for a number of technical reasons …
if the aggregation there is working right it means that separate “papyri” held at separate institutions (Columbia and Harvard) are thought to be part of the same document.
That is interesting. Has http://papyri.info done any work with IIIF? It’s designed to support such cases.
and that is indeed what one of the HGV records says: “Gehört zu PSI VI 625 und P.Cair. Zen. IV 59799”
that would be a question for @hcayless or @ryanfb
OK.
this happens as I understand it not uncommonly with the papyri: artifact of the 19th and (especially) 20th century antiquities trade in and out of Egypt
documents were subdivided and sold separately in order to increase unit returns
Apparently it’s common enough in medieval manuscripts to serve as a major motivation for IIIF.
actually there’s 3 fragments: one in the Egyptian museum in Cairo, one at Columbia, and one in Florence (per http://www.trismegistos.org/text/1424)
They allow you to bring in several images from different sources to create a “page” on a canvas, or to pull different pages from different repositories into the same text.
that’s cool; it looks to me like we only have image here of the Columbia piece
one would have to run down the CE 1978 citations to see how they’re thought to fit together
I suspect you’ll find that the XML encoding of all of these doesn’t privilege the “account” document type very much. Numbers should be marked up as such, but otherwise, not.
currencies aren’t glossed/called out, for example
I got a message that my name was being spoken. We don’t do IIIF yet, but plan to whenever we have the spare time
@hcayless: thanks
Though we’re likely to do it first for images we host. External images is a whole other bag of worms
and by worms, he means dragons
they bring out the barbecue sauce, you better run
mmmmmm barbecue
which brings up an interesting encoding question …. how would one markup the pseudo-acronym BBQ in TEI?
I’ve been adding IIIF support to FromThePage, and have struggled a bit with how to handle self-hosted pages vs. pages hosted elsewhere.
One option is to generate manifests for both, with image services (I.e. URLs) that point to the FromThePage server, then use a “shim” like approach to proxy image calls to the actual image hosts.
That seems like the only option for hosts that don’t actually support the IIIF Image API.
However, the hosts I’m integrating with have either recently added support for IIIF (the Internet Archive) or have plug-ins or shims available to support it (Omeka).
In that case it seems like the thing to do is to ingest an IIIF manifest from the host, then produce a modified manifest based on it which directs images to the original host, but provides transcripts via annotations hosted by my server.
i thought if the CORS headers were set up correctly the images could be anywhere?
They can, yes.
ok, i guess i don’t understand the problem then :simple_smile:
The problem isn’t ‘is this possible?’, it’s “what’s the best way to do this?”
Generating manifests for someone else’s images is straightforward when adding additional content (transcripts & translations, in my case).
i would try to avoid shims, personally
i’m not sure if that helps :simple_smile:
But I’d like to avoid losing information contained in the original manifests when I generate derivative manifests – like repository-specific metadata or non-transcript annotations.
That may be unnecessary in the linked data world – I’m certainly new to LODLAM, and don’t really have my head around it.
sounds like you have your head around pretty well to me
So if http://papyri.info starts presenting transcripts along with IIIF, I’ll be watching closely.
i would probably only aggregate what information you actually need for FromThePage
and not worry too much about other data, that in theory would be good to have, but you have no use for at the moment
let your application drive your decisions, rather than what seems like the right thing to do
here ends my Pontification for the Day
Thanks, @edsu. You sounded positively guru-like.
i’m a total sham
i also am LATE TO A MEETING ARGH!
seeya
Anybody here have experience with handling <saxon:collation> in XSL in OxygenXML with Saxon? I have previously working code that now fails after an OxygenXML upgrade.
nm. with moral support on twitter from @wsalesky I read the fine manual and got with the program: http://www.saxonica.com/documentation/index.html#!extensibility/config-extend/collation/implementing-collation
2015-11-03
2015-11-04
2015-11-05
Anybody working with TEI + annotation, TEI facsimile, or markdown to TEI?
2015-11-06
@suttonkoeser: I’m not working with markdown to TEI, but I’m certainly very interested in the topic since I write frequently in both.
@suttonkoeser: there’s a markdown to TEI conversion in the TEI Stylesheets. Haven’t used it though. Not at this moment working on TEI annotation, but have and will again soonish.
I presume this stylesheet is the one you mean, I found it in my initial searches: https://github.com/TEIC/Stylesheets/blob/master/markdown/markdown-to-tei.xsl
But since it’s based on regular expressions and has no comments, I didn’t think it would be very easy to modify or work with
I’m working in python, so currently I’m using the mistune markdown parser (https://github.com/lepture/mistune) and creating a custom tei renderer.
I’ve absolutely no idea how it works, but find myself sharing responsibility for it :simple_smile:
I vaguely assume the sanest approach would be to convert to HTML and then transform to TEI
that way you’d stand a chance of flattening some of the variance in markdown flavors
but my ignorance here is vast and embarrassing :simple_smile:
That’s interesting, hadn’t thought of going from markdown -> html -> tei. I suppose it would be ideal if the markdown was converted to html using the same library that users see when they enter and preview their content, but the markdown is getting entered as annotation content using annotator.js and the meltdown javascript library.
it depends what you’re doing really, whether it’s a generic markdown -> TEI converter or whether you’re targeting a specific flavor
sounds like you have a specific flavor in mind, so you might not need that
Right. I think the markdown is pretty generic at the moment, only non-standard addition I’m aware of is footnotes (which is a fairly standard non-standard from what I can tell). But I guess we can decide what TEI output we want for our use case, and it doesn’t have to be a general solution for everyone. Although I think most of the tags I’m outputting are fairly standard
There seems to be some discussion of this kind of conversion with pandoc: https://github.com/jgm/pandoc/issues/2047
@fmcc: interesting, good to know
doesn’t seem to have been much activity since then though
Just looking at the TEI Stylesheets implementation and I fear I share @suttonkoeser ’s suspicion of it (despite knowing it was built by super-smart people).
2015-11-09
@suttonkoeser: Hi! Re: TEI+annotation, I’m very interested and worked on some aspects / have some plans. What are you working on?
2015-11-10
Hey @raffazizzi. The TEI + annotation work is in relation to readux, which is for our digitized books and an annotation/critical edition platform - http://readux.library.emory.edu/ for the site, https://github.com/emory-libraries/readux for code. I have a new release almost out the door that supports annotation with annotator.js, and we’re generating TEI facsimile from a couple of different OCR xml formats to support that.
The next step, which I’ve started working on, is to generate a TEI export of a single volume packaged with all of a user’s annotations for that volume, so that we have it as an artifact and as an interim step to generating a more user-friendly annotated edition
We’ve decided for our purposes that we don’t need all of the annotation data (in particular annotation created/updated don’t seem relevant in the TEI export), but I think we’ve figured out a reasonable way to map the annotation data to TEI notes and insert annotation reference markers into the TEI facsimile data.
If anyone is interested, I could share some examples of the TEI we’re generating. We did have some detail questions (like how to reference the annotated content), but I think we’ve come up with something workable.
@suttonkoeser: I’m interested in examples of the TEI!
@literature_geek: cool, I’ll share some soon - I think it will be a little easier to discuss when you can look at the annotation features in readux and see the tei facsimile that it’s using
@suttonkoeser: I’d be interested to see it too!
2015-11-11
@suttonkoeser: sounds great! I’d also be interested in seeing some examples :simple_smile:
2015-11-14
visualisation-god edward tufte’s views (once removed?) on css: https://edwardtufte.github.io/tufte-css/