Standards, protocols, strategies for distributing data

2015-10-16

mdlincoln
04:35:21 PM

I’m on the hunt for GLAM or other DH projects that are either using git &| torrents as a way to distribute versioned data. I know the usual suspects (Tate, MoMA, Cooper Hewitt) N.B. I’m distinctly not looking for JSON-based API stuff


roxanne
04:45:17 PM

I don’t know of projects off-hand, but some library folks are planning to discuss at DLF in a couple weeks. Might be of interest http://dlfforum2015.sched.org/mobile/#session:b37651e41eceef99db5b6be017da48a2


mdlincoln
04:46:56 PM

@roxanne: brilliant, thanks - I will have to keep an eye on that


2015-10-17

roxanne
01:06:59 PM

Also IU Libraries does some of this https://github.com/iulibdcs


krisshaffer
10:42:53 PM

@mdlincoln: Not sure if this is what you’re looking for, but I’m part of a poetry-and-music corpus analysis project that has all data and scripts version controlled and stored on GitHub: https://github.com/corpusmusic/liederCorpusAnalysis.


2015-10-18

mdlincoln
02:32:19 PM

Also on the topic of torrents, has anyone heard of or used http://academictorrents.com/ ?


thomaspadilla
05:12:10 PM

I vaguely remember this crossing the bow of the ol twitter - seems like a cool approach


thomaspadilla
05:14:20 PM

we were just discussing making some library data available via torrent the other day, but didnt seem optimal for a relatively exceptional case where collections have restrictions


mdlincoln
05:56:14 PM

yeah, certainly it wouldn’t make sense for data that had selective permissions


mdlincoln
05:58:25 PM

But, as an example, I’ve spend days of scripting time to pulling down the CC0 collection data and images from the semi-dysfunctional https://www.rijksmuseum.nl/nl/api and it seems like a torrent of the filedump (~150GB) would be a much better way to share the info


thomaspadilla
07:58:34 PM

oh yeah, totally agree, and am interested in alternatives


thomaspadilla
07:58:50 PM

have you looked into OPENN’s rysnc option?


thomaspadilla
07:59:09 PM

seems like a promising approach


thomaspadilla
08:00:03 PM

they also offer ftp access



mdlincoln
08:40:28 PM

O if only everyone thought like Will Noel :simple_smile:


2015-10-19

thomaspadilla
06:38:56 AM

Id also throw the following in the mix, nice is you need to turn some json > csv for intro workshops https://github.com/n3mo/jsan


eby
01:02:42 PM

@mdlincoln: i’ve seen it in passing but haven’t seen much use. likely due to many academics dealing with licensing issues. If you are looking at other examples of torrent use then the Internet Archive is a good one as they offer most of their downloads as a torrent option. I know @edsu has put up some sets like the ferguson tweet archive http://inkdroid.org/2014/11/18/on-forgetting/


mdlincoln
05:39:24 PM

Well, if any of you brave souls have ~160GB of free space, I’ve tried assembling my first torrent file here - I’d love to know if it can actually work: http://matthewlincoln.net/2015/10/19/the-rijksmuseum-as-bittorrent.html


2015-10-20

2015-10-21

2015-10-22

mdlincoln
09:18:46 AM

A nice overview of CIDOC-CRM by way of mapping the Victoria & Albert API to LOD: http://conaltuohy.com/blog/bridging-conceptual-gap-api-cidoc-crm/


mdlincoln
03:32:39 PM

Interesting piece on http://Academia.edu, open access, and a shift from content-gatekeeping to metrics-gatekeeping: http://blogs.lse.ac.uk/impactofsocialsciences/2015/10/22/does-academia-edu-mean-open-access-is-becoming-irrelevant/


2015-10-23

mdlincoln
09:15:24 AM

Speaking of distributed content delivery, has anyone been watching the development of IPFS?


mdlincoln
09:15:44 AM

edsu
11:48:40 AM

i tuned out for a few days and now there’s all this interesting convo to review!


edsu
11:50:18 AM

@mdlincoln: i have been tracking ipfs a bit ; i think it’s a really interesting idea, and ties into what you were talking about earlier w/r/t bittorrent right?


edsu
11:51:21 AM

it attracted the attention of Brewster Kahle at Internet Archive fairly recently too http://brewster.kahle.org/2015/08/11/locking-the-web-open-a-call-for-a-distributed-web-2/


mdlincoln
11:55:13 AM

well that’s some good attention!


edsu
11:55:35 AM

i know right! it would be fun to take an hour to try to get it going sometime


edsu
11:55:51 AM

i’ve only read about it so far


edsu
11:56:09 AM

have you announced that torrent very widely yet?


mdlincoln
11:56:15 AM

As someone outside the institutional repository loop, I’ve always been curious how much tese types of technologies, or even something as “dull” as rsync, are discussed


edsu
11:56:28 AM

not often enough, imho


ryanfb
11:56:33 AM

Interesting, hadn’t seen IPFS before


mdlincoln
11:56:36 AM

Yes, as best I could! Trevor Owens gave it a good signal boost with a tweet earlier


edsu
11:56:58 AM

oh!


ryanfb
11:57:00 AM

Now wondering how hard it would be to adapt ipfspics into an IIIF shim/proxy


mdlincoln
11:57:09 AM

Though at the moment we’ve just got a handful of downloaders, including a patient person with high bandwidth in brooklyn :stuck_out_tongue:


mdlincoln
11:58:18 AM

I’ve got permission to seed it from one of the department’s machines, so there is at least one full copy going aside from mine, when I’m home and have my external HD hooked up


mdlincoln
11:58:48 AM

I guess ideally, this is the kind of thing that you coudl get a consortium of libraries to all seed together


mdlincoln
11:59:12 AM

:stuck_out_tongue:


edsu
11:59:13 AM

you would think right?


mdlincoln
11:59:16 AM

haha


mdlincoln
11:59:56 AM

I mean, I think it would be a great idea for research libraries to mirror datadumps put on github


mdlincoln
12:00:18 PM

hint hint


mdlincoln
12:01:18 PM

Though as @thomaspadilla pointed out, it doesn’t even have to be as fancy as git or bittorrent - OPenn does just fine offering rsync


edsu
12:02:20 PM

rsync has the added benefit of allowing the dataset to change over time, which is a bit trickier with bittorrent


edsu
12:02:41 PM

i think that’s something the academictorrents people were trying to work with?


mdlincoln
12:03:01 PM

yes, dataset change is certainly an issue


edsu
12:03:33 PM

the disadvantage of rsync is that you don’t get the advantage of the swarm, where everyone shares the cost of distributing the data


edsu
12:03:44 PM

right?


mdlincoln
12:03:50 PM

yes, that’s my understanding


edsu
12:04:17 PM

so OPenn pays to make it available to the world


edsu
12:04:41 PM

a familiar model :simple_smile:


mdlincoln
12:06:15 PM

yep. basically, we want git/git-annex crossed with bittorrent


fmcc
12:07:08 PM

I think there is actually a git/bittorrent project kicking about, i’ll try to find it


fmcc
12:07:51 PM

called gittorrent … of course



edsu
12:08:33 PM

heh, i wonder if that actually works


edsu
12:08:45 PM

119 forks, woah


fmcc
12:09:56 PM

I don’t really know anything about it, though I think I did encounter some discussion of potential issues - probably a HN thread FWIW


edsu
12:10:18 PM

it wouldn’t be computing if there weren’t issues I guess - haha


ryanfb
12:12:59 PM
Uploaded file: 0dotdv_4.png
Comment: Well, this experiment has revealed that Duke *might* be throttling torrents by default…

fmcc
12:13:11 PM

yeah, IPFS seems a better shout than the gittorrent


edsu
12:13:55 PM

ipfs is an interesting social experiment on its own


mdlincoln
12:14:59 PM

@ryanfb: good luck!


mdlincoln
12:16:31 PM

@ryanfb: FWIW our upload speed at the UMD seed has fluctuated between 2K and 4M, so…


ryanfb
12:17:11 PM

Ah, ok…maybe I’ll just leave it running over the weekend and see what happens


ryanfb
12:17:24 PM

I know that un-shaped, the connection on this machine is pretty fast


mdlincoln
12:18:09 PM

yes, do try if you can. At the moment we’ve just got one downloader on the UMD seed, which was just going quite speedily an hour ago


mdlincoln
12:18:21 PM

so it may be them, not us


mdlincoln
12:19:21 PM

ps @edsu if this is something that MITH et al could/would seed, I can hand deliver the data on a hard drive :simple_smile:


thomaspadilla
12:25:56 PM

this channel is amaze, my two cents :simple_smile:


mdlincoln
12:32:08 PM

glad you like, @thomaspadilla :goat: I must admit I understand the mechanics of the IPFS and gittorrent stuff just enough to know that they sound both interesting and complex (socially, as much as technically) to implement, and that’s about it


thomaspadilla
12:33:49 PM

indeed-y : cultural heritage orgs need provocations like this


edsu
12:38:28 PM

@mdlincoln: I’d have to ask trevor :simple_smile: all we have is AWS now, i believe you can seed s3 buckets


edsu
12:38:42 PM

@mdlincoln how about i put it up on InternetArchive?


mdlincoln
12:39:34 PM

@edsu: let’s talk after my talk next Tuesday - I can give you a copy of the data at the very least


edsu
12:42:04 PM

@mdlincoln: you know that things can torrent from InternetArchive right?


ryanfb
12:42:38 PM

I was looking into that as well, apparently you can also torrent upload but it won’t seed back on the same torrent after it finishes http://archive.org/about/faqs.php#Archive_BitTorrents


edsu
12:42:56 PM

wow, torrent upload — cool


edsu
01:05:34 PM

@mdlincoln did you follow the work that john resig did w/ computer vision for the Frick Museum?


edsu
01:07:23 PM

i wonder if something could be done with these images


thomaspadilla
01:08:35 PM

maybe something with opencv?


edsu
01:09:01 PM

yeah, i’m definitely not an expert when it comes to this sort of thing


edsu
01:09:07 PM

i know resig was working with tineye


edsu
01:09:10 PM

at the time


thomaspadilla
01:09:21 PM

yeah I had seen miriam posner post a couple of things recently, seems like interesting applications


thomaspadilla
01:09:36 PM

probably also something to learn wragge re: facial identification and so forth


thomaspadilla
01:10:16 PM

probs also the cool stuff cooper hewitt labs have been doing


edsu
01:10:25 PM

resig used it to connect up japanese prints from the same woodcuts


edsu
01:10:50 PM

so you could see how woodcuts were used across and within museum collections






edsu
01:15:58 PM

yes, they did some interesting work browsing by color


edsu
01:16:12 PM

that would be cool to see right?


mdlincoln
01:17:18 PM

hah I suspect that highbandwidth Brooklyn leecher is john :simple_smile:


mdlincoln
01:17:43 PM

Shannon entropy could be pretty interesting to work with vis-a-vis prints, in fact


mdlincoln
01:18:12 PM

I use the measure in my diss for thinking about artistic diversity based on subject keywords, in fact


mdlincoln
01:18:22 PM

but it’s applicable to all sorts of signals


edsu
01:22:54 PM

@mdlincoln: they could use an Art category here https://aws.amazon.com/datasets/


edsu
01:23:13 PM

sigh


mdlincoln
01:29:38 PM

I actually wonder what’s happen if I advertised it instead as an a computer vision dataset with richly tagged images :)


edsu
01:44:44 PM

@mdlincoln see a new leecher?


edsu
01:46:10 PM

oops started the torrent on the wrong partition - restarting :simple_smile:


edsu
02:05:13 PM

@mdlincoln it does look like you can upload the .torrent file to internet archive and they will leech it and then seed it


edsu
02:05:31 PM

these words are so weird, i always feel like i am using them wrong


mdlincoln
02:05:34 PM

haha


mdlincoln
02:05:42 PM

well you do indeed seem to be downloading it :simple_smile:


mdlincoln
02:05:47 PM

yay local network connectinos


edsu
02:05:50 PM

i almost did it myself, but then thought you should be the one?


edsu
02:06:10 PM

that’s actually going out to amazon cloud


edsu
02:06:26 PM

internet archive make it very easy to do


edsu
02:07:06 PM

i’ll admit i got slightly cold feet when i was reading the terms of service and wondering what an app was


edsu
02:07:40 PM

it clearly says the data is in the public domain or cc0


mdlincoln
03:33:36 PM

Also just mentioned on my blog: http://dat-data.com/


mdlincoln
03:34:34 PM

I’ve heard tell of this project before, but haven’t followed it much. It looks like it is more built around versioning and distributing modeled data (key/value or tabular) - not sure about how it handles large binary files


mdlincoln
03:34:51 PM

but I do see it has R bindings so that makes me :dancer:


edsu
03:35:34 PM

yeah, it is a neat project


edsu
03:35:56 PM

the lead developer has put some interesting videos up


mdlincoln
03:37:15 PM

hmm ok, added to the list of things to look at more in depth


mdlincoln
03:37:51 PM

I could imagine developing some very interesting mutli-party data collaboration platform built on a dat “trunk”



edsu
03:38:43 PM

haven’t watched that one ; but it might be relevant


abrennr
04:00:30 PM

:smile: Earlier today when I was reading this thread I was thinking “oh, this reminds me, what about that Dat project?” and there y’all are. :muscle: I did step through most of the tutoral at http://try-dat.com which is cool, does on-the-fly docker environment deployment to let you run the tutorial code in browser as you go.


abrennr
04:01:13 PM

@mdlincoln: here’s what the tutorial says about binary files


abrennr
04:01:25 PM

> But what if you have a large non-tabular file that you want store in dat? To accommodate this, dat is capable of adding files to a dataset that is named, appropriately, “files.” These attachments are sometimes called “blobs,” which is short for “Binary Large OBjectS.” A blob can be any form of binary data, but for now just think of a blob as a file, like one you might put in Dropbox or attach to an email.


abrennr
04:01:43 PM

> For the sake of speed and efficiency, dat doesn’t store blobs inside the datasets. Instead they’re kept in a “blob store” – a special directory – with each having an indexed “blob key.”


abrennr
04:06:02 PM

Also TIL I can’t hardly type dat and not do data^H


edsu
04:06:11 PM

sounds similar in principle to how git-lfs works


2015-10-25

shawngraham
08:41:26 PM

folks, I’m starting to play around with jupyter notebooks. I don’t do much python, but I do futz from time to time with R. This: http://irkernel.github.io/ seems completely borked (and the discussion in the issues channel on github is pretty much beyond me). But I tried again with miniconda and R essentials ( https://www.continuum.io/blog/developer/jupyter-and-conda-r ) and I was able to get an R notebook up and running. But the kernel keeps dying. Anyway - just wondered if anyone else was messing around with this stuff.


2015-10-26

fmcc
06:07:25 AM

@shawngraham: I’m not an R user, but I use Jupyter quite a lot - not tried out any of the more recent d3.js integration yet though


fmcc
06:19:51 AM

I’ve not looked into it that much at the moment, but am interested in the idea of ipython notebooks as a way of publishing research or notes on a subject


fmcc
06:21:23 AM

been trying to use the markdown boxes to keep commentary on code and what i’m actually up to - had wanted to look into pandoc integration more


fmcc
06:22:10 AM

I’d like to see if there were ways to transform the notebooks into slicker webpages etc.


paregorios
08:50:47 AM

fwiw I recently saw Patrick Burns (https://fordham.academia.edu/PatrickBurns == http://pbartleby.com/ == @diyclassics ) give a talk illustrated entirely from an ipython notebook, but I don’t see any of that stuff online anywhere.


shawngraham
09:50:23 AM

very cool. I’ve used knitr etc to get my R code & results online, and I’ve played with pykwiki as a way of pushing my reading notes online. I believe you can push jupyter to a reveal.js type slide show too. I’m thinking of OA journals and ideas around taking a jupyter notebook as an article… this is all very nebulous in my head. Turns out though there’s something fairly recent which is preventing the R kernel from playing nice with jupyter - https://github.com/IRkernel/IRkernel/issues/205 Hopefully gets resolved soon.


fmcc
10:12:36 AM

ah knitr - I think that’s definitely what sparked off my thoughts about pandoc etc. above - I’m pretty sure i’d read this article http://galahad.well.ox.ac.uk/repro/ and had intended to look into it more carefully


fmcc
10:12:51 AM

really I just want to never write any LaTeX again


paregorios
10:20:45 AM

@fmcc blasphemy! :wink:


fmcc
10:23:28 AM

@paregorios: I know, so many sunk hours!


paregorios
10:24:03 AM

somehow I never actually got sucked down that slippery slope … I’m just in a taunting mood


paregorios
10:24:05 AM

sorry


abrennr
10:27:47 AM

paging @mcburton, just back from the jupyterday meetup in NYC this weekend


fmcc
10:31:06 AM

@paregorios: Well, it was a combination of doing a commentary on a Greek text, and wanting to include technical line drawings, and being a bit of a typographical aesthete


fmcc
10:31:32 AM

(first and last time i’ll ever call myself an aesthete…)


edsu
10:38:31 AM

@mcburton: how was jupyterday?


mdlincoln
10:51:07 AM

I’ve not tried juypter with R, as between RMarkdwon and Shiny there is already a fairly functional dynamic publication system in place. I have been meaning to give it a try though


mcburton
10:56:58 AM

@shawngraham @fmcc: I have used Jupyter Notebooks extensively and have thought about publishing with Notebooks. You can use the nbconvert tool to transform Notebooks into HTML


mcburton
10:57:47 AM

@shawngraham: also, Jupyter can use R, it is no longer just a python project.


mcburton
10:59:17 AM

@edsu: JupyterDay was amazing. It was mainly a bunch of computational scientists and folks from industry. A good If you haven’t already check out the hashtag https://twitter.com/search?q=%23jupyterday&src=typd


edsu
10:59:57 AM

i did watch some of the twitter activity ; but will take a closer look


edsu
11:00:05 AM

any big takeaways?


mcburton
11:01:04 AM

Buzzfeed and the data journalists are WAAAAAY ahead of the digital humanities folks when it comes to publishing this stuff


mcburton
11:01:47 AM

integrating code + narrative + data


mcburton
11:01:58 AM

yes, thats right, BUZZFEED



mcburton
11:03:10 AM

O’Reilly is also starting some experiments with Thebe which lets you embedded executable code cells in any HTML document



mcburton
11:03:24 AM

but it is really hacky at the moment


mcburton
11:03:46 AM

Lev Manovich showed up, so I wasn’t the only digital humanist


mcburton
11:06:20 AM

Lorena Barba, https://twitter.com/LorenaABarba, gave a really nice talk about computational literacy and computational learning. The Jupyter Grant proposal has some really interesting stuff around computational narratives, http://blog.jupyter.org/2015/07/07/project-jupyter-computational-narratives-as-the-engine-of-collaborative-data-science/, that really needs humanists to help them better understand


edsu
11:07:47 AM

Radical/Networks was going on not too far away, which looked interesting too http://radicalnetworks.org/program/index.html


mcburton
11:11:10 AM

too many things happening in NYC


edsu
11:12:35 AM

totally ; i did run across some of the radical/networks presentations here http://livestream.com/internetsociety/radicalnetworks


mcburton
11:13:54 AM

on the topic of #jupyterday, there was an interesting conversation in meatspace and on twitter about long-term preservation of notebooks and code being produced by journalists


mcburton
11:14:04 AM

Buzzfeed basically just uses Github as their repository



mcburton
11:15:36 AM

yes


mcburton
11:16:25 AM

that lead me to put my foot in my mouth about Zenodo (It thought it was a for-profit repository)


edsu
11:17:36 AM

they do some nice work


mcburton
11:18:04 AM

yeah, I need to learn a bit more about the project. I didn’t realize it was run out of CERN


edsu
11:18:42 AM

interesting, i didn’t know who was doing it


mcburton
11:18:50 AM

science people


mcburton
11:19:09 AM

I also learned of a new, digital repository Invenio http://invenio.readthedocs.org/en/latest/introduction/about.html


edsu
11:19:09 AM

i had used it to register a DOI for the twitter archiving utility i worked on https://github.com/edsu/twarc


edsu
11:19:32 AM

eek, who is that dude


mcburton
11:20:04 AM

twarc man


mcburton
11:21:14 AM

@shawngraham @fmcc: Here is a recent post about D3.js integration into the Notebook. http://blog.thedataincubator.com/2015/08/embedding-d3-in-an-ipython-notebook/


mcburton
11:24:32 AM

@edsu: Maybe you’ve seen this, but the Binder project is pretty great for creating temporary Notebook deployments right out of a github repository http://mybinder.org/


edsu
11:25:20 AM

wow, no i hadn’t seen that


edsu
11:25:53 AM

although pinboard is telling me I already bookmarked it ; so I guess my memory is faulty :simple_smile:


edsu
11:26:13 AM

seems like it could be super handy for classes?


mcburton
11:26:49 AM

yes, I used it for my Twitter bot workshop



mcburton
11:27:05 AM

click the “Launch Binder” shield


edsu
11:29:00 AM

neato


edsu
11:29:29 AM

are there any ramifications to actually putting auth keys into that?


mcburton
11:30:12 AM

well, I don’t want to put my keys into the Github Repository…


edsu
11:30:36 AM

understandably


mcburton
11:31:46 AM

I did play around with having a single app on my account and then set up a workflow where the students would authorize their bot accounts with that app. But I opted to teach them how to set up their own developer accounts instead


edsu
11:35:32 AM

it’s nice ; does their modified notebook get deleted when they leave the session?


mcburton
11:43:12 AM

yes, timeout is an hour


mcburton
11:43:16 AM

of inactivity


mcburton
11:43:43 AM

they are working on ways to make it so modified notebooks could be pushed back up to github


edsu
11:44:21 AM

very cool


mcburton
11:44:45 AM

yeah, it is by a group of Neuroscientsts at Janelia https://twitter.com/thefreemanlab


mcburton
11:44:55 AM

they are CRAZY productive and not too far from DC


edsu
11:47:35 AM

part of janelia it looks like? https://www.janelia.org/about-us




mcburton
11:49:49 AM

They work in a building that looks like starfleet academy and they are building the matrix (for mice and zebrafish)


edsu
12:11:22 PM

. o O (kind of awesome john resig popped into #visualization)


shawngraham
12:11:32 PM

@mcburton: the R kernal kept dying on me all the time - something to do with conda and rzmq not playing nice in recent days. Anyway, I think I’ll just continue to watch from the sidelines, given as @mdlincoln R markdown & shiny are pretty nice…


mcburton
12:14:44 PM

@shawngraham: Rmarkdown is awesome, if I was doing R I’d use that over Jupyter Notebooks


shawngraham
12:16:24 PM

@mcburton yeah, I’m using R for text stuff, and python for sound stuff. Probably the most awkward dh guy ever, me.


mcburton
12:16:37 PM

@edsu: @jeresig is here too. Hey John!


jeresig
12:17:08 PM

hello! :smile:


shawngraham
12:18:41 PM

@mcburton: it was this that seems to be killing juypter for me: https://github.com/IRkernel/IRkernel/issues/205


mcburton
12:25:14 PM

@shawngraham: you might look into using venv instead of conda for managing your python environment https://docs.python.org/3/library/venv.html


shawngraham
12:44:38 PM

@mcburton: ah cool, thank you! yeah, I gotta learn to keep things separate for different tasks/projects.


shawngraham
12:45:05 PM

that’s probably something we could do a bit more of in terms of teaching dh stuff. Or maybe, more accurately, i ought to do more of…


fmcc
12:55:46 PM

@mcburton @shawngraham venv is python only though, so won’t help much with R kernel installation issues?


shawngraham
12:59:11 PM

@fmcc but at least it’ll keep me from screwing up other things on this machine!


fmcc
12:59:38 PM

If you’re doing python i’d say it was totally essential


paregorios
01:17:11 PM

invest in python virtual environments; the payback is huge. And I like https://virtualenvwrapper.readthedocs.org/en/latest/


shawngraham
01:19:47 PM

oh cool @paregorios thanks!


paregorios
01:26:45 PM

the pattern I use is to create a directory ~/Envs/ and use mkvirtualenv to create all the components of each environment I need there. It’s disk-extravagant, but I just create a new one for each project and name the environment to match the top-level project directory. This has allowed me to write a little script that puts me in the project directory, activates the associated virtual environment, etc.


paregorios
01:27:35 PM

that keeps the venv and its binaries out of the way of your project-related repository etc.


ryanfb
02:14:36 PM

@mdlincoln: Quick question about the Rijksmuseum torrent - any particular reason for splitting it into TGZ files? I only see ~5GB total savings from compression…I’d think uncompressed would make it easier to use the dataset while continuing to seed it


jeresig
02:26:29 PM

(don’t know if I thanked you for it yet, @mdlincoln, but that Rijksmuseum torrent is just awesome!)


edsu
02:28:08 PM

@jeresig we were chatting about your work with tineye and the frick the other day in here, thinking about that torrent, and wondering if there is something interesting to do


jeresig
02:28:46 PM

oh nice!! Hmm, there very well may be some opportunities there


jeresig
02:29:31 PM

on a related note (I haven’t written about it yet), I’ve been having very good luck with the Open Source “pastec” framework recently. It’s very similar to TinEye’s MatchEngine in functionality, the quality is roughly comparable, but it’s Open Source! https://github.com/Visu4link/pastec


edsu
02:29:56 PM

oh! i was just going to ask if you were still working w/ tineye, nice


jeresig
02:29:57 PM

There are some features that I wish existed so I want to start talking with its creator - and maybe start writing guides on how to use it.


edsu
02:30:20 PM

that would be awesome


jeresig
02:30:28 PM

For http://Ukiyo-e.org I’m still using it - but I think I might start transitioning, or at least providing an alternative, in my open source projects


jeresig
02:30:39 PM

Great!


jeresig
02:31:35 PM

I have some data showing a quality comparison between the two technologies, as well. However it’s using some private images from the Frick and I need to get permission to release them.


ryanfb
02:46:18 PM

jeresig: Thanks for the heads-up on Pastec; currently experimenting with bouncing all of the Rijksmuseum torrent into it :simple_smile:


jeresig
02:47:43 PM

@ryanfb: haha, awesome! let me know what you find out :simple_smile: One issue that exists right now is that if you want to query something that is already in the database you need to re-upload the image again (rather than just say “given me everything that looks like image #123”). But it’s not a huge deal, thankfully!


ryanfb
02:48:06 PM

Yeah, my plan was to use it that way to try to detect duplicate images


ryanfb
02:48:14 PM

*near-duplicate


ryanfb
02:49:28 PM

Also I’ve been idly pondering the idea of trying to use something like this for numismatics, and there are a couple domain-specific publications about CBIR for coins (though no running servers I can find), but if this gets OK results out of the box it might be a cool proof of concept :simple_smile:


jeresig
02:50:48 PM

@ryanfb: Nice!! I’ve considered that exact use case, well. One concern that I had (without testing any data) was over how we it’d work with well-worn coin faces (I suspect that it’d probably struggle). But I’d be very interested in seeing the results of that!


jeresig
02:51:07 PM

Also it might be also worth playing around with imgSeek (which is just an image similarity tool, doesn’t have a concept of “duplicate”) http://sourceforge.net/projects/imgseek/


ryanfb
02:52:08 PM

Thanks for reminding me about it!


jeresig
02:54:50 PM

No problem! I’m constantly watching for new open source solutions to these problems - I’m very surprised that there aren’t more!


ryanfb
02:55:49 PM

Yeah, seems like a lot of computer vision stuff gets locked up in proprietary systems, unfortunately…


mdlincoln
03:56:02 PM

Cheers @jeresig - I thought you’d like the torrent :stuck_out_tongue:


2015-10-27

2015-10-28

ryanfb
01:19:19 PM

@mdlincoln: one note on the Rijksmuseum torrent…seems like there’s 13,101 empty image files, which correspond to copyright-protected images (which don’t have a webImage resource in the Rijksmuseum API). Might be something to note in the torrent details.


edsu
02:00:56 PM

@mcburton: you seen this before? https://github.com/aaren/notedown


edsu
02:01:04 PM

seems like it could be useful sometimes


mdlincoln
02:11:07 PM

@ryanfb: hmmm, that’s frustrating. Can you give me a filename example? If the object does not have a webImage, it ought to never have been downloaded and made into a file, anyway


mdlincoln
02:11:41 PM

but I wouldn’t put it past either their API or my bash skills to have messed that up somehow


ryanfb
02:21:20 PM

@mdlincoln: images/55/RP-F-F03020.jpeg


ryanfb
02:21:48 PM

You can find them all with find images -type f -name '*.jpeg' -size 0


ryanfb
02:22:20 PM

I’m actually double-checking that list and there seem to be a handful that have zero-size files and a non-nil webImage


mdlincoln
02:41:18 PM

ah, handy command there


mdlincoln
02:42:21 PM

it’s worth investigating - though I can’t say when it’ll get up to the top of my to-do list :tired_face:


ryanfb
02:46:40 PM

no worries - current count is just 3 images like that :simple_smile:


ryanfb
02:48:39 PM

I’ll probably have a blog post with the results of my Pastec experiment sometime next week


mcburton
03:10:15 PM

@edsu: Yes, I remember coming across this a while ago and forgot about it. I haven’t used it, but it is something I want to play with


mcburton
04:23:42 PM

https://source.opennews.org/en-US/articles/introducing-agate/ > agate is a Python data analysis library in the vein of numpy or pandas, but with one crucial difference. Whereas those libraries optimize for the needs of scientists—namely, being incredibly fast when working with vast numerical datasets—agate instead optimizes for the performance of the human who is using it. That means stripping out those technical optimizations and instead focusing on designing code that is easy to learn, readable, and flexible enough to handle any weird data you throw at it.


2015-10-29

mdlincoln
02:24:39 PM

mdlincoln
02:24:54 PM

for when someone has sent you XML that merely needed to be JSON


2015-10-30

mdlincoln
11:07:26 AM

SO I may have spoken too soon regarding the above utility - when asked to strip namespaces, it also demolishes attributes :cry: Does anyone have other utilities they have found useful for converting a lot XML files into JSON?


mdlincoln
11:11:10 AM

Yes, I could bite the bullet and get back to using xpath selectors… but JQ works so well for producing normalized tables out of denormalized documents


fmcc
11:17:25 AM

@mdlincoln: The namespace thing doesn’t look like it would be too much code to change


mdlincoln
11:21:48 AM

hmm, looking at it now, that’s a good point. I can guess my way through python, yah? :wink:


fmcc
11:27:08 AM

yeah, well if you give me a short example file, and what you expect the output to be, i’ll pull and modify it


mdlincoln
11:46:26 AM

Thanks! But looking through forks of the script, I found one that’s already done it: https://github.com/edyesed/xml2json


fmcc
11:56:19 AM

fantastic! Too obvious a fix to have been left undone.


mdlincoln
11:57:12 AM

:+1: for descriptive commit messages, too - otherwise I’d never have found it


2015-11-02

mcburton
12:19:26 PM

Can someone explain what Proquest offers for EEBO that they can charge access for? What is preventing us from just liberating all the page scans from http://eebo.chadwyck.com/ ? Page scans can’t be copyrighted anyway…


mcburton
12:21:14 PM

are we all just trying to avoid the heat from a license violation?


edsu
12:30:09 PM

edsu
12:35:42 PM

mcburton
12:53:42 PM

someone needs to challenge ProQuest on their assertion of copyright https://en.wikipedia.org/wiki/Bridgeman_Art_Library_v._Corel_Corp.


mcburton
12:54:10 PM

> exact photographic copies of public domain images could not be protected by copyright in the United States because the copies lack originality. The Court found that despite the fact that accurate reproductions might require a great deal of skill, experience and effort, the key element to determine whether a work is copyrightable under U.S. law is originality.


edsu
01:59:54 PM

For some reason I have it in my head that the situation is different in the UK?


edsu
02:01:15 PM

Still ProQuest is in Michigan, so it doesn’t matter right?


edsu
02:03:37 PM

I guess if you do decide to challenge them it’s a good idea to have a friend who is an IP lawyer.


edsu
02:04:26 PM

It’s hard to imagine them not responding after they have put language like that into their terms of service.


benwbrum
03:48:15 PM

I’m pretty sure that any attempt to liberate page scans from EEBO will run into CFAA pretty quickly, rather than copyright assertions.


benwbrum
03:50:16 PM

There has been no good test case in the UK to establish Bridgeman v. Corel there. Most cultural institutions assert copyright over scans of public-domain materials there through arguments including sweat-of-the-brow copyright.


benwbrum
03:51:10 PM

I’ve run into that a lot with parish registers (and census images) that are not expressive and in some cases either centuries old or government created.


mdlincoln
03:53:27 PM

CFAA?


benwbrum
03:56:01 PM

Computer Fraud and Abuse Act. See https://www.techdirt.com/search-g.php?num=20&q=CFAA&search=Search for examples.


benwbrum
03:57:19 PM

According to Wikipedia, it was what Aaron Swartz was prosecuted under: https://en.wikipedia.org/wiki/Computer_Fraud_and_Abuse_Act


2015-11-03

jeresig
11:59:00 AM

@ryanfb: your initial pastec results (as you posted on twitter) are very fun - I especially love the different photos where the bench is the common factor :simple_smile:


jeresig
12:00:40 PM

@ryanfb: I had a question about the index size — did you feed full-size images in or did you re-size them ahead of time? If I remember correctly pastec will resize them smaller if you don’t. I’m just curious if the index size would be smaller if you fed in deliberately smaller images (and if that would impact the quality of the matches at all)


jeresig
12:01:09 PM

@ryanfb: and when you counted the number of matches are you counting A -> B and B -> A separately or counting both as one match?


ryanfb
12:01:20 PM

@jeresig: Yeah I just bounced in the full size images and let Pastec handle the resizing.


ryanfb
12:02:26 PM

@jeresig: Currently writing up a blog post on the whole process right now…for the 33,029 “unique” matches number, that’s filtered down by a script where it considers any unique set of image_ids in a Pastec search result a unique match


ryanfb
12:03:00 PM

So e.g. image_ids: [3,2,1] and image_ids: [1,2,3] will be one match


jeresig
12:03:06 PM

interesting!


ryanfb
12:03:22 PM

But image_ids: [4,3,2,1] would be a new match


jeresig
12:04:43 PM

@ryanfb: are you filtering out single matches (e.g. you re-upload an image and it just matches itself again, returning something like image_ids: [1])


ryanfb
12:04:47 PM

Yes


jeresig
12:04:52 PM

cool!


jeresig
12:09:34 PM

@ryanfb: I’m very interested to see your results! Not sure if you’ve seen it but this is the analysis that I did with the Frick Art Reference Library Anonymous Italian Art photo archive: http://ejohn.org/research/computer-vision-photo-archives/ I did a lot of manual work to try and verify the quality of the matches. And, as I think I mentioned before, pastec seems to be roughly comparable, if slightly worse, than MatchEngine (but likely still “good enough”). Usually the big question that comes up is “what are we missing with these matches? and what false positives are we getting?” Beyond a certain point (too many misses, too many false positives) whatever algorithm will become un-usuable. Not sure if you’re storing the “score” field as well, but I found that setting the minimum score to about 19 worked well in some of my testing.


ryanfb
12:12:07 PM

Yeah, that’s always a hard question. For example, those plates (which I consider a really interesting match) had a score of 14. But I get a lot of matches where it seems like a calibration target in the image is causing it to match every other image with a similar calibration target, and the score is higher than that (d’oh)


jeresig
12:12:34 PM

:disappointed: ugh yeah, calibration targets/color bars are a real pain


jeresig
12:13:04 PM

btw, I have Node module that I’ve been working on for interfacing with Pastec, fwiw: https://github.com/jeresig/node-pastec


jeresig
12:13:23 PM

actively developing it, fixing up some issues


ryanfb
12:13:30 PM

Right now, I’m planning on sharing the Pastec index and match results with the blog post so anyone can play with them


jeresig
12:14:10 PM

nice!


ryanfb
12:16:37 PM

My next idea is some sort of twitter bot tweeting GIFs of every “unique” match, with the Rijksmuseum URLs they’re made from incorporated as well so they show up as already tweeted on the Rijksmuseum object page (and searching Twitter for that URL will turn it up)


ryanfb
12:17:08 PM

Though maybe too many to feasibly do without being a data-hose, as one per hour would be almost 4 years


jeresig
12:17:19 PM

:smile:


jeresig
12:19:53 PM

@ryanfb: you may also be interested in some graph/cluster analysis that I did using the MatchEngine-derived links. It came up with some really interesting groupings of artworks that were quite unexpected: https://www.youtube.com/watch?v=PL6J8MtTsPo&t=27m8s


ryanfb
12:20:36 PM

Awesome, will take a look. Thanks!


ryanfb
12:25:30 PM

@jeresig: I hadn’t thought of using this process to remove calibration targets before, but if you’re interested in automatically detecting calibration charts I’ve written up a survey of some different approaches: https://ryanfb.github.io/etc/2015/07/08/automatic_colorchecker_detection.html


jeresig
12:26:47 PM

@ryanfb: that’s fantastic!! thank you so much


mdlincoln
12:32:59 PM

Can’t wait to see this - make sure to ping me when the blog post is up :)


jeresig
12:38:36 PM

ughh - I wish I had more time for experimentation with new technology! it’s so much fun :simple_smile:


edsu
03:46:26 PM

@ryanfb did i miss a post from you about you pastec work?


edsu
03:48:05 PM

@mdlincoln: so did I hear right that you are Dr Lincoln now?


ryanfb
03:53:19 PM

@edsu @mdlincoln @jeresig - just published the blog post now :simple_smile: http://ryanfb.github.io/etc/2015/11/03/finding_near-matches_in_the_rijksmuseum_with_pastec.html


ryanfb
03:55:30 PM

We’ll see if a 1.5GB file trips some sort of download limit on my institutional Box account…


jeresig
03:58:10 PM

@ryanfb: fantastic work! thank you for detailing all the steps you took and providing links to all the resulting data - that’s most helpful!


jeresig
04:02:11 PM

it might be interesting to treat to Rijksmuseum image set as a sort of canonical test dataset to analyze the quality of matches. Most of the other sets that I have include images that I can’t easily distribute.


ryanfb
04:04:49 PM

I was thinking about checking out some of the other things on https://en.wikipedia.org/wiki/List_of_CBIR_engines and seeing what some of the other free/open ones can produce in comparison


ryanfb
04:05:04 PM

(if any of them are in as-usable a state as Pastec…)


jeresig
04:06:19 PM

@ryanfb: Most of the other ones are just “similarity” not “duplicate”. The only “duplicate” ones I know of are Pastec and MatchEngine. imgSeek is also open source but it just finds similar images (it’ll just keep giving you matches and it never cuts off the results - also it doesn’t look at image features like Pastec and Matchengine, it can be tricked really easily, unfortunately)


jeresig
04:06:31 PM

but if you find anything, I’d be extremely interested!


ryanfb
04:10:51 PM

I’ll definitely share anything I find. Ultimately (for my work at Duke), my interest is in matching images of e.g. Ancient Greek inscriptions, for which I have a crazy-idea-which-just-might-work that I need to actually take the time and implement.


jeresig
04:11:15 PM

oh, that sounds cool!


fmcc
04:25:04 PM

@ryanfb: Is a big issue with matching inscription images not that they’re relatively similar in appearance, and would perhaps need some other kind of vectorisation that these CBIR systems don’t have?


ryanfb
04:27:53 PM

@fmcc: Yes. The rough outline of my plan is to try to use intrinsic self-similarity within an image to match to other images (with similar self-similarity)


fmcc
04:30:27 PM

@ryanfb: Is that approach based upon a particular paper?


ryanfb
04:31:23 PM

No, and that’s why it’s taking me so long to get around to it :wink:



fmcc
04:33:26 PM

I’ve not come across self-similarity at all before


ryanfb
04:34:30 PM

I think I may have turned that up before. There are a lot based around e.g. rotational self-similarity


fmcc
04:38:53 PM

sorry - that was literally the first thing that I came across googling, and I mentioned it for context, not because I thought it might be new to you…


ryanfb
04:39:11 PM

No worries :simple_smile:


ryanfb
05:43:44 PM

Just realized I can probably get away with dumping all the Rijksmuseum match GIFs on Flickr (especially now that I can set a Public Domain license). Not as discoverable through the existing Rijksmuseum interface, but oh well


2015-11-04

edsu
10:08:37 AM

that would be awesome


benwbrum
10:12:46 AM

I’m in the middle of a conversation with the Free UK Genealogy folks about opening up their data via an API. What would be the ideal format for census records, parish registers, or other vital statistics? RDF via an API? Record-type-specific CSV?


benwbrum
10:13:30 AM

One of our challenges is that individual records can change as they are corrected or added, making it very hard to create a persistant URL to an individual record.


benwbrum
10:14:32 AM

By contrast, sub-sets of our records could be delivered pretty easily via a URL scheme like <http://freereg.org.uk/COUNTY/YEAR/RECORD_TYPE>


edsu
10:15:35 AM

is the expectation that the persistent URL would always return the same data?


edsu
10:16:22 AM

i could imagine wanting to be able to periodically scrub my local data to get any updates, using the persistent url


edsu
10:17:07 AM

my advice is to learn about how they have it stored now, and come up with an initial solution that impacts that the least


edsu
10:17:42 AM

oh i’m just reading your next message ; so you want to have metadata in the URL


benwbrum
10:17:43 AM

I believe that the goal is that–as a database of records–users will want to link to a specific record. E.g. a family tree record for a marriage may link to the parish register entry recording the wedding.


edsu
10:18:06 AM

and if someone updates the county then people’s links break?


benwbrum
10:19:34 AM

There’s not really a unique ID (other than the database primary key) for the record. If someone replaces the register file containing that entry, and if the bride’s father’s surname has changed, we delete the old record and create a new one with the correct info.


benwbrum
10:19:43 AM

Any permalink to the record will be broken.


edsu
10:20:07 AM

if it uses the primary key?


benwbrum
10:20:37 AM

That would not be true of a permalink to e.g. “all entries for St. Mary’s, Burton-upon-Trent, Shropshire, 1743”


benwbrum
10:20:45 AM

If it uses the primary key, yes.


edsu
10:21:04 AM

but the problem with a permalink style url is that if the metadata changes so will the url?


benwbrum
10:21:07 AM

Similarly if it uses a smart key derived from all meaningful fields, since one of the meaningful fields will have changed durign teh correction


benwbrum
10:21:09 AM

Right.


benwbrum
10:21:28 AM

Permalinks to sets of records work fine, permalinks to an individual record can break.


edsu
10:21:33 AM

seems like either way things are changing and that if you don’t want links to break you need to remember the old ones


edsu
10:21:47 AM

and 301 redirect from the old to the new


edsu
10:21:59 AM

i really like how WordPress does this for example


edsu
10:22:39 AM

great question btw :smile:


benwbrum
10:22:43 AM

Yes. That’ll require some human intervention to say “Is old record XYZ the same as new record XWZ?”, but I suspect we could do that.


benwbrum
10:23:01 AM

Then you keep the URL for the old record and redirect, as you suggest.


edsu
10:23:51 AM

i guess i prefer the permalink style url


edsu
10:23:58 AM

if it can be achieved without too much trouble


edsu
10:24:06 AM

they are much more hackable


benwbrum
10:24:08 AM

Thanks! This will be less of a problem as we move the volunteers onto an online transcription system. At the moment they’re uploading CSV files with large batches of entries.


edsu
10:24:51 AM

one compromise between RDF and CSV is CSV on the Web


edsu
10:25:17 AM

it’s basically just csv, with a sidecar json-ld file that defines the semantics of the csv file for anyone that wants to turn it into RDF


edsu
10:25:35 AM

edsu
10:26:43 AM

a while ago i tried to create a simple example here http://edsu.github.io/csvw-template/


edsu
10:27:29 AM

oh, i see the conversion is no longer working #sigh


benwbrum
10:32:09 AM

This is very interesting to me. Thank you for the link.


paregorios
10:41:38 AM

++


edsu
10:54:45 AM

@benwbrum: you probably know this already but there was significant interest from semweb/linkeddata community in genealogy


benwbrum
10:58:43 AM

I’m pretty new to the linked data world. My whole experience started by skimming the O’Rielly Linke data book on the flight to Philadelpha for the IIIF hackathon a few weeks ago, then plumbing a JSON manifest generator into FromThePage. So there’s a huge amount I don’t know.


edsu
12:16:39 PM

@benwbrum: btw, I fixed my csv metadata file so that conversion works now http://edsu.github.io/csvw-template/


edsu
12:17:33 PM

i think CSVW is a nice example of how you can make your data easy to consume and use, while also making a high fidelity semantic version available too


edsu
12:17:39 PM

for the people that want that


edsu
12:19:28 PM

@benmiller: ian davis was a very prominent figure in the linked data community and also quite into genealogy which he blogged about http://blog.iandavis.com/tags/genealogy/


edsu
12:20:35 PM

he’s still around, but last i heard was more interested in game development with golang :simple_smile:


2015-11-05

mdlincoln
09:07:03 AM

thomaspadilla
10:24:52 AM

Interesting piece, nice that it highlights the promise of eMOP


thomaspadilla
10:26:25 AM

I think that the re-use angle is a bit off - probably more conceptually accurate to think about preparation of collections for unanticipated use


mdlincoln
10:43:18 AM

and there’s the chicken&egg problem: can’t create a data aggregation site if no one is following a standard <-> no one will follow a standard if they don’t need to submit their work to an aggregation site


thomaspadilla
10:44:12 AM

Yeah, and theres a bit of conflation going on - e.g. institutions that create collections vs. researchers that create derivative datasets from them (wherein of course we’d like to see reuse)


mdlincoln
10:44:48 AM

Ah! Yes, I think that’s what was rubbing me the wrong way.


thomaspadilla
10:47:45 AM

Interesting either way of course! Went into the ol’ Zotero. Will probably reference in a piece Im working on right now.


jheppler
12:57:45 PM

Anybody have some Gephi data I could use today for a quick tutorial I’m giving to grad students?


jheppler
01:00:51 PM

Thought about using the Les Miserables set, just for simplicity, but wondered if there was something out there more interesting.


thomaspadilla
01:01:42 PM

i have the data from the dh conference ya’ll had there a few years ago


thomaspadilla
01:01:48 PM

derived I think from a dataset that elijah created


thomaspadilla
01:01:57 PM

its linked off the tutorial here http://thomaspadilla.org/na2014


coryandrewtaylor
01:11:45 PM

@jheppler: I’ve got a couple here:




coryandrewtaylor
01:12:22 PM

coryandrewtaylor
01:12:32 PM

They’re literary networks, taken from the Gospel of Luke.


jheppler
01:13:17 PM

@coryandrewtaylor: Thanks!


coryandrewtaylor
01:15:41 PM

@jheppler: No problem!


shawngraham
01:32:15 PM

@jheppler we had an ma student a few years back do historical SNA for his thesis - all his files are on figshare http://figshare.com/authors/Peter_Holdsworth/402385


edsu
01:54:13 PM

@jheppler i bet @mdlincoln has some about Dutch Engravers :smile:


edsu
01:55:13 PM

@shawngraham what is Holdsworth 1898?


edsu
01:55:41 PM

oh holdsworth is the name of the student?


shawngraham
01:56:14 PM

http://figshare.com/articles/Holdsworth_1898_Dataset/727769 so what he did was look at the membership rolls of women’s service organizations in the run up to the centennary of the war of 1812, to see how ideas of commemoration spread around Ontario


shawngraham
01:56:31 PM

there’s a neat bit where he looks at the structure of social networks against the structure of the rail network…


shawngraham
01:57:51 PM

yes, Peter Holdsworth. Really neat guy.


mdlincoln
02:26:27 PM

If they need to work with an unwieldy dynamic network with dated nodes and edges: https://gitlab.com/mdlincoln/dh2015/tree/master/data-raw


mdlincoln
02:27:30 PM

The bm_print_nodes and bm_print_edges might be the easier ones to work with


mdlincoln
02:29:10 PM

the full R package comes with documentation for all those data, too - but I think you know how to navigate that @jheppler


mdlincoln
02:30:17 PM

unless i am misremembering my various DHers’ languages


mdlincoln
02:30:31 PM

^ wouldn’t THAT make for an interesting paper


mdlincoln
02:34:27 PM

The rkm nodes and edges are a similar format, though much of the descriptive data is in Dutch


edsu
03:03:18 PM

@mdlincoln: did you ever run across http://umd-r-users.github.io/studyGroup/ ?


edsu
03:43:56 PM

ryanfb
03:53:06 PM

@edsu: Thanks! Yeah I looked into Flickr but their support for animated GIFs is kind of weird


edsu
03:53:29 PM

yeah, tumblr is probably better


edsu
03:54:22 PM

@ryanfb i managed to get Peter Gorgels attention https://twitter.com/pgorgels


ryanfb
03:55:42 PM

Oh, nice!


edsu
03:56:02 PM

that will get Rijksmuseum eyes on it i think


edsu
03:56:23 PM

maybe they are already on it anyway :simple_smile:


ryanfb
03:56:53 PM

Yeah I’ve @’d the main account a few times but I’m sure that account is probably a notifications hose for some poor person…


edsu
03:57:05 PM

haha, yeah


edsu
03:57:29 PM

peter is pretty awesome


edsu
03:57:35 PM

i’m a fan anyway


edsu
03:58:26 PM

I saw him present at NDF in New Zealand a few years ago about Rijksstudio https://www.youtube.com/watch?v=iW17d-OQsIs


ryanfb
04:03:03 PM

Cool :simple_smile: Yeah, anyone who’s there involved in their push to make everything of theirs freely available online is probably good people…


2015-11-06

thomaspadilla
11:41:34 AM

figured there could be some stuff of interest to folks here http://socialcomputing.asu.edu/pages/datasets


jeresig
12:49:14 PM

@ryanfb: that tumblr is fantastic! Browsing through the matches is very fun - and I like that you added tags to the posts, as well!


mdlincoln
01:32:45 PM
Comment: This one has got to be my favorite so far

jheppler
01:52:59 PM

Nice!



fmcc
02:55:17 PM

@ryanfb: “Age yourself to view the future you!”


jeresig
02:57:35 PM

That’s a great one! Love seeing how the woodcut medium was (ab)used :simple_smile:


fmcc
03:03:36 PM

@jeresig: So the centre of the block was cut out so they only had to recreate the face?


jeresig
03:03:59 PM

@fmcc: precisely! they did this in Japanese Woodblock prints, too - one sec, let me post a photo


fmcc
03:06:25 PM

Trying to look up where the two images were then - quite interesting with that information that it’s Charles the Second that’s being removed to make way for William of Orange


jeresig
03:06:57 PM

jeresig
03:07:03 PM

jeresig
03:08:07 PM

My CV stuff found both of these matches. The one above (with the two men) are not only different kabuki actors but the artist signatures on the prints are different, too! They had chopped out the old signature and added a different one, for some reason.


ryanfb
03:08:27 PM

Nice!


fmcc
03:09:54 PM

Do you know which one the original was?


jeresig
03:11:06 PM

@fmcc: not 100% sure (esp for the first one, since there are few other identifying details). I bet you could look really close and see where the woodblock had chipped — whichever one had less chips in it would’ve been the earlier one (since they naturally degrade as they’re used)


fmcc
03:18:18 PM

That’s interesting - the one on the right looks like the quality isn’t quite is good, but I guess that could be just ink bleed


fmcc
03:19:45 PM

that’s based a bit on intaglio printing though - really have no idea what wood block is like vs. linocut which is the only other relief printing i’ve done.


ryanfb
03:24:49 PM

@jeresig: if you ever make a computer vision system for automatically ordering woodblock prints based on the chipping, I vote for calling it “woodblockchain”


jeresig
03:40:04 PM

@ryanfb: :smile:


mcburton
04:11:51 PM

The Carnegie Museum of Art has posted all of their metadata as a CSV file on Github. +1 for great documentation too! https://github.com/cmoa/collection


jeresig
06:04:22 PM

Are people familiar with dat? http://dat-data.com/ I feel like it’s really awesome, especially for all these large CSV datasets


jeresig
06:04:48 PM

It’d be cool if there was a central server which hosted all these datasets for easy access


mdlincoln
06:08:35 PM

Yup, it’s been a subject of discussion a ways up in the group history - I’d love to see some uni libraries start to mirror & archive some of these datasets


mdlincoln
06:09:26 PM

I’ve yet to experiment with dat personally, though - is it stable-ish yet?


2015-11-07

thomaspadilla
05:13:04 AM

re: librarians - we’re working on it!


2015-11-09

coryandrewtaylor
12:19:36 PM

Not sure if this is the best channel, but Google has open-sourced their TensorFlow machine learning library:


coryandrewtaylor
12:19:40 PM

fmcc
12:24:53 PM

@coryandrewtaylor: this looks really interesting - i’m going to poke about with this tonight


benwbrum
12:26:57 PM

Looks like the Zooniverse/NYPL collaboration to extend the Scribe codebase is finally available: http://scribeproject.github.io/


benwbrum
12:27:31 PM

I’ll be checking it out in detail for a client over the next several weeks.


mdlincoln
01:51:04 PM

Tropy, a tool for research photograph management, is looking for input on user practices and needs: https://docs.google.com/forms/d/1gxeRwzxQZNeOr4VJSvaUBYYfLomo2UFtAhA0YJ7vr14/viewform



edsu
02:03:10 PM

@benwbrum: is the NYPL Scribe codebase related to the Zooniverse Scribe project?


benwbrum
02:44:49 PM

@edsu: Zooniverse developed Scribe back in 2010-2011, open-sourcing a version of it at the end of 2011.


edsu
02:45:02 PM

yup i remember that


benwbrum
02:45:15 PM

Both FreeUKGenealogy (my client) and NYPL forked that to start separate projects.


benwbrum
02:45:35 PM

NYPL’s went into Ensemble, and was the basis for an NEH grant to extend it further.


benwbrum
02:46:07 PM

We used Scribe as a jumping off point for the “delivery mechanism” – the searchable database populated by data entry through Scribe.


benwbrum
02:47:28 PM

When NYPL/Zooniverse got the NEH grant in September 2013, we shelved that effort to focus elsewhere while they made the Scribe code more robust. (That wasn’t the only factor).


benwbrum
02:47:31 PM

And now it’s out!


edsu
02:47:51 PM

scribeAPI?


benwbrum
02:48:54 PM

What’s not entirely clear to me is whether the ScribeProject code (which appears to be ScribeAPI) is behind non-NYPL projects like Shakespeare’s World (Zooniverse+Folger) and AnnoTate (Zooniverse+Tate).


benwbrum
02:49:12 PM

It certainly is behind Measuring the ANZACs.


benwbrum
02:50:18 PM

Regardless, I’ve only had 10-15 minutes to go over the docs and no time to go over the code. I should have figured out more soon.


edsu
02:50:35 PM

thanks for the info, that helped me a lot!


benwbrum
02:50:37 PM

Sadly, I’ve lost my technical contact at the Zooniverse, as Stuart Lynn is now at CartoDB.



edsu
02:51:08 PM

benwbrum
02:51:46 PM

Free UK Genealogy will be rebooting their online structured transcription effort shortly – maybe after version 2.1.4 of FreeREG, maybe after 2.1.1 of FreeCEN. Next 2-3 months, regardless.


benwbrum
02:52:13 PM

I’m not sure whether we’ll work under the aegis of Open Source Indexing again or not.


benwbrum
02:52:56 PM

I do hope we’ll be able to use ScribeAPI, since its predecessor influenced so much of our technical stack three years ago.


edsu
02:53:37 PM

have you written about that at all?


benwbrum
02:54:57 PM

benwbrum
02:55:07 PM

We got a lot of sample project definitions.


edsu
02:55:09 PM

thanks!


benwbrum
02:55:16 PM

Only one offer to help, from geneanum.


benwbrum
02:56:58 PM

There was a lot of interest in creating an open-source tool in the same space as FamilySearch Indexing at RootsTech in spring 2013. Not a lot of resistence to open source for the tool, though most of the vendors hoped to use the tool to build paywalled databases, of course. I did get the impression that the idea was novel.


edsu
02:57:49 PM

i like the idea of a framework, rather than a turnkey solution


edsu
02:58:35 PM

since transcription efforts seem to vary so much in their presentation


edsu
02:58:55 PM

but i’m a newb when it comes to this stuff


benwbrum
03:00:21 PM

Have you seen the Zooniverse Project Builder (“Panoptes”)?


edsu
03:00:30 PM

no, i have not


benwbrum
03:00:33 PM

It’s a nice crowdsourcing framework that doesn’t include transcription.


edsu
03:00:51 PM

good name



benwbrum
03:01:30 PM

Very impressive, very usable. @mia and I used it in our crowdsourcing class at HILT this summer.


benwbrum
03:01:33 PM

That’s the one.


benwbrum
03:01:38 PM

It’s hosted, and open for anyone.


benwbrum
03:02:29 PM

benwbrum
03:02:38 PM

You’ll need an account, however.


benwbrum
03:03:36 PM

It lets you ask multiple-choice questions, or “drawing” questions that ask users to select a region of the image.


benwbrum
03:04:00 PM

Those latter answers can have another step, presenting them with multiple choice questions about the drawing.


2015-11-10

2015-11-11

mdlincoln
03:40:16 PM

I’ve raised the git/github question on here before, but I’m wondering if anyone has examples of CONTRIBUTOR policies for data repos? I’m curious what best practices would be for handling, say, pull requests on github for a repo that is generated from an upstream CMS. You might not want to just accept the changes without implementing them in your CMS and/or generating scripts, so how does one make that process clear to people who clone/fork your repo as if it were any other open-source project?



2015-11-12

sambrenner
11:48:18 AM

we (cooper hewitt) don’t have anything stated - but we have a link on our website’s object pages for people to email (using zendesk) corrections etc., which gets passed on to the appropriate curator to make the update in TMS.


sambrenner
11:49:37 AM

so i imagine our statement would say something like “we welcome all pull requests which concern data formatting, organization etc… for cataloging errors, please visit the appropriate page on our website and follow the feedback link there”


mdlincoln
02:19:20 PM

So what if someone sends a PR that, say, reformats the way that you’ve serialized your data (say, expressing an array as a object instead) and you want to incorporate their changes. Would you actually accept that PR, and then reverse-engineer your export scripts to reproduce that formatting change in your next database update?


mdlincoln
02:22:35 PM
Uploaded file: cmoa PR policy
Comment: CMOA just updated their documentation to include the following explicit process documentation

sambrenner
02:30:13 PM

i guess we would encourage any data-reformatting to be done in a script that future users could run (eg in this folder - https://github.com/cooperhewitt/collection/tree/master/bin). like if someone wanted to write an import-to-(db of choice) script we’d accept that. how we format our data is kind of irrelevant at that point


mdlincoln
02:31:49 PM

Neat - I like the idea of encouraging documented scripts that refer to the canonical extract


mdlincoln
02:32:32 PM

It usefully side-steps the demand of having to be everything to everyone


sambrenner
02:34:22 PM

we should definitely be more explicit about it, though. i’ll come up with something and add it in before the week’s out