mdlincoln

04:35:21 PM

I’m on the hunt for GLAM or other DH projects that are either using git &| torrents as a way to distribute versioned data. I know the usual suspects (Tate, MoMA, Cooper Hewitt) N.B. I’m distinctly not looking for JSON-based API stuff

roxanne

04:45:17 PM

I don’t know of projects off-hand, but some library folks are planning to discuss at DLF in a couple weeks. Might be of interest http://dlfforum2015.sched.org/mobile/#session:b37651e41eceef99db5b6be017da48a2

mdlincoln

04:46:56 PM

@roxanne: brilliant, thanks - I will have to keep an eye on that

roxanne

01:06:59 PM

Also IU Libraries does some of this https://github.com/iulibdcs

krisshaffer

10:42:53 PM

@mdlincoln: Not sure if this is what you’re looking for, but I’m part of a poetry-and-music corpus analysis project that has all data and scripts version controlled and stored on GitHub: https://github.com/corpusmusic/liederCorpusAnalysis.

mdlincoln

02:32:19 PM

Also on the topic of torrents, has anyone heard of or used http://academictorrents.com/ ?

thomaspadilla

05:12:10 PM

I vaguely remember this crossing the bow of the ol twitter - seems like a cool approach

thomaspadilla

05:14:20 PM

we were just discussing making some library data available via torrent the other day, but didnt seem optimal for a relatively exceptional case where collections have restrictions

mdlincoln

05:56:14 PM

yeah, certainly it wouldn’t make sense for data that had selective permissions

mdlincoln

05:58:25 PM

But, as an example, I’ve spend days of scripting time to pulling down the CC0 collection data and images from the semi-dysfunctional https://www.rijksmuseum.nl/nl/api and it seems like a torrent of the filedump (~150GB) would be a much better way to share the info

thomaspadilla

07:58:34 PM

oh yeah, totally agree, and am interested in alternatives

thomaspadilla

07:58:50 PM

have you looked into OPENN’s rysnc option?

thomaspadilla

07:59:09 PM

seems like a promising approach

thomaspadilla

08:00:03 PM

they also offer ftp access

thomaspadilla

08:00:26 PM

http://openn.library.upenn.edu/TechnicalReadMe.html

mdlincoln

08:40:28 PM

O if only everyone thought like Will Noel :simple_smile:

thomaspadilla

06:38:56 AM

Id also throw the following in the mix, nice is you need to turn some json > csv for intro workshops https://github.com/n3mo/jsan

eby

01:02:42 PM

@mdlincoln: i’ve seen it in passing but haven’t seen much use. likely due to many academics dealing with licensing issues. If you are looking at other examples of torrent use then the Internet Archive is a good one as they offer most of their downloads as a torrent option. I know @edsu has put up some sets like the ferguson tweet archive http://inkdroid.org/2014/11/18/on-forgetting/

mdlincoln

05:39:24 PM

Well, if any of you brave souls have ~160GB of free space, I’ve tried assembling my first torrent file here - I’d love to know if it can actually work: http://matthewlincoln.net/2015/10/19/the-rijksmuseum-as-bittorrent.html

mdlincoln

09:18:46 AM

A nice overview of CIDOC-CRM by way of mapping the Victoria & Albert API to LOD: http://conaltuohy.com/blog/bridging-conceptual-gap-api-cidoc-crm/

mdlincoln

03:32:39 PM

Interesting piece on http://Academia.edu, open access, and a shift from content-gatekeeping to metrics-gatekeeping: http://blogs.lse.ac.uk/impactofsocialsciences/2015/10/22/does-academia-edu-mean-open-access-is-becoming-irrelevant/

mdlincoln

09:15:24 AM

Speaking of distributed content delivery, has anyone been watching the development of IPFS?

mdlincoln

09:15:44 AM

https://github.com/ipfspics/server

edsu

11:48:40 AM

i tuned out for a few days and now there’s all this interesting convo to review!

edsu

11:50:18 AM

@mdlincoln: i have been tracking ipfs a bit ; i think it’s a really interesting idea, and ties into what you were talking about earlier w/r/t bittorrent right?

edsu

11:51:21 AM

it attracted the attention of Brewster Kahle at Internet Archive fairly recently too http://brewster.kahle.org/2015/08/11/locking-the-web-open-a-call-for-a-distributed-web-2/

mdlincoln

11:55:13 AM

well that’s some good attention!

edsu

11:55:35 AM

i know right! it would be fun to take an hour to try to get it going sometime

edsu

11:55:51 AM

i’ve only read about it so far

edsu

11:56:09 AM

have you announced that torrent very widely yet?

mdlincoln

11:56:15 AM

As someone outside the institutional repository loop, I’ve always been curious how much tese types of technologies, or even something as “dull” as rsync, are discussed

edsu

11:56:28 AM

not often enough, imho

ryanfb

11:56:33 AM

Interesting, hadn’t seen IPFS before

mdlincoln

11:56:36 AM

Yes, as best I could! Trevor Owens gave it a good signal boost with a tweet earlier

edsu

11:56:58 AM

oh!

ryanfb

11:57:00 AM

Now wondering how hard it would be to adapt ipfspics into an IIIF shim/proxy

mdlincoln

11:57:09 AM

Though at the moment we’ve just got a handful of downloaders, including a patient person with high bandwidth in brooklyn :stuck_out_tongue:

mdlincoln

11:58:18 AM

I’ve got permission to seed it from one of the department’s machines, so there is at least one full copy going aside from mine, when I’m home and have my external HD hooked up

mdlincoln

11:58:48 AM

I guess ideally, this is the kind of thing that you coudl get a consortium of libraries to all seed together

mdlincoln

11:59:12 AM

:stuck_out_tongue:

edsu

11:59:13 AM

you would think right?

mdlincoln

11:59:16 AM

haha

mdlincoln

11:59:56 AM

I mean, I think it would be a great idea for research libraries to mirror datadumps put on github

mdlincoln

12:00:18 PM

hint hint

mdlincoln

12:01:18 PM

Though as @thomaspadilla pointed out, it doesn’t even have to be as fancy as git or bittorrent - OPenn does just fine offering rsync

edsu

12:02:20 PM

rsync has the added benefit of allowing the dataset to change over time, which is a bit trickier with bittorrent

edsu

12:02:41 PM

i think that’s something the academictorrents people were trying to work with?

mdlincoln

12:03:01 PM

yes, dataset change is certainly an issue

edsu

12:03:33 PM

the disadvantage of rsync is that you don’t get the advantage of the swarm, where everyone shares the cost of distributing the data

edsu

12:03:44 PM

right?

mdlincoln

12:03:50 PM

yes, that’s my understanding

edsu

12:04:17 PM

so OPenn pays to make it available to the world

edsu

12:04:41 PM

a familiar model :simple_smile:

mdlincoln

12:06:15 PM

yep. basically, we want git/git-annex crossed with bittorrent

fmcc

12:07:08 PM

I think there is actually a git/bittorrent project kicking about, i’ll try to find it

fmcc

12:07:51 PM

called gittorrent … of course

fmcc

12:07:53 PM

http://blog.printf.net/articles/2015/05/29/announcing-gittorrent-a-decentralized-github/

edsu

12:08:33 PM

heh, i wonder if that actually works

edsu

12:08:45 PM

119 forks, woah

fmcc

12:09:56 PM

I don’t really know anything about it, though I think I did encounter some discussion of potential issues - probably a HN thread FWIW

edsu

12:10:18 PM

it wouldn’t be computing if there weren’t issues I guess - haha

ryanfb

12:12:59 PM

Uploaded file: 0dotdv_4.png

Comment: Well, this experiment has revealed that Duke *might* be throttling torrents by default…

fmcc

12:13:11 PM

yeah, IPFS seems a better shout than the gittorrent

edsu

12:13:55 PM

ipfs is an interesting social experiment on its own

mdlincoln

12:14:59 PM

@ryanfb: good luck!

mdlincoln

12:16:31 PM

@ryanfb: FWIW our upload speed at the UMD seed has fluctuated between 2K and 4M, so…

ryanfb

12:17:11 PM

Ah, ok…maybe I’ll just leave it running over the weekend and see what happens

ryanfb

12:17:24 PM

I know that un-shaped, the connection on this machine is pretty fast

mdlincoln

12:18:09 PM

yes, do try if you can. At the moment we’ve just got one downloader on the UMD seed, which was just going quite speedily an hour ago

mdlincoln

12:18:21 PM

so it may be them, not us

mdlincoln

12:19:21 PM

ps @edsu if this is something that MITH et al could/would seed, I can hand deliver the data on a hard drive :simple_smile:

thomaspadilla

12:25:56 PM

this channel is amaze, my two cents :simple_smile:

mdlincoln

12:32:08 PM

glad you like, @thomaspadilla :goat: I must admit I understand the mechanics of the IPFS and gittorrent stuff just enough to know that they sound both interesting and complex (socially, as much as technically) to implement, and that’s about it

thomaspadilla

12:33:49 PM

indeed-y : cultural heritage orgs need provocations like this

edsu

12:38:28 PM

@mdlincoln: I’d have to ask trevor :simple_smile: all we have is AWS now, i believe you can seed s3 buckets

edsu

12:38:42 PM

@mdlincoln how about i put it up on InternetArchive?

mdlincoln

12:39:34 PM

@edsu: let’s talk after my talk next Tuesday - I can give you a copy of the data at the very least

edsu

12:42:04 PM

@mdlincoln: you know that things can torrent from InternetArchive right?

ryanfb

12:42:38 PM

I was looking into that as well, apparently you can also torrent upload but it won’t seed back on the same torrent after it finishes http://archive.org/about/faqs.php#Archive_BitTorrents

edsu

12:42:56 PM

wow, torrent upload — cool

edsu

01:05:34 PM

@mdlincoln did you follow the work that john resig did w/ computer vision for the Frick Museum?

edsu

01:07:23 PM

i wonder if something could be done with these images

thomaspadilla

01:08:35 PM

maybe something with opencv?

edsu

01:09:01 PM

yeah, i’m definitely not an expert when it comes to this sort of thing

edsu

01:09:07 PM

i know resig was working with tineye

edsu

01:09:10 PM

at the time

thomaspadilla

01:09:21 PM

yeah I had seen miriam posner post a couple of things recently, seems like interesting applications

thomaspadilla

01:09:36 PM

probably also something to learn wragge re: facial identification and so forth

thomaspadilla

01:10:16 PM

probs also the cool stuff cooper hewitt labs have been doing

edsu

01:10:25 PM

resig used it to connect up japanese prints from the same woodcuts

edsu

01:10:50 PM

so you could see how woodcuts were used across and within museum collections

thomaspadilla

01:11:49 PM

https://collection.cooperhewitt.org/experimental

thomaspadilla

01:12:12 PM

http://labs.cooperhewitt.org/2013/default-sort-or-what-would-shannon-do/

thomaspadilla

01:12:36 PM

http://miriamposner.com/blog/the-case-of-the-missing-faces/

thomaspadilla

01:12:55 PM

https://collection.cooperhewitt.org/objects/colors/

edsu

01:15:58 PM

yes, they did some interesting work browsing by color

edsu

01:16:12 PM

that would be cool to see right?

mdlincoln

01:17:18 PM

hah I suspect that highbandwidth Brooklyn leecher is john :simple_smile:

mdlincoln

01:17:43 PM

Shannon entropy could be pretty interesting to work with vis-a-vis prints, in fact

mdlincoln

01:18:12 PM

I use the measure in my diss for thinking about artistic diversity based on subject keywords, in fact

mdlincoln

01:18:22 PM

but it’s applicable to all sorts of signals

edsu

01:22:54 PM

@mdlincoln: they could use an Art category here https://aws.amazon.com/datasets/

edsu

01:23:13 PM

sigh

mdlincoln

01:29:38 PM

I actually wonder what’s happen if I advertised it instead as an a computer vision dataset with richly tagged images :)

edsu

01:44:44 PM

@mdlincoln see a new leecher?

edsu

01:46:10 PM

oops started the torrent on the wrong partition - restarting :simple_smile:

edsu

02:05:13 PM

@mdlincoln it does look like you can upload the .torrent file to internet archive and they will leech it and then seed it

edsu

02:05:31 PM

these words are so weird, i always feel like i am using them wrong

mdlincoln

02:05:34 PM

haha

mdlincoln

02:05:42 PM

well you do indeed seem to be downloading it :simple_smile:

mdlincoln

02:05:47 PM

yay local network connectinos

edsu

02:05:50 PM

i almost did it myself, but then thought you should be the one?

edsu

02:06:10 PM

that’s actually going out to amazon cloud

edsu

02:06:26 PM

internet archive make it very easy to do

edsu

02:07:06 PM

i’ll admit i got slightly cold feet when i was reading the terms of service and wondering what an app was

edsu

02:07:40 PM

it clearly says the data is in the public domain or cc0

mdlincoln

03:33:36 PM

Also just mentioned on my blog: http://dat-data.com/

mdlincoln

03:34:34 PM

I’ve heard tell of this project before, but haven’t followed it much. It looks like it is more built around versioning and distributing modeled data (key/value or tabular) - not sure about how it handles large binary files

mdlincoln

03:34:51 PM

but I do see it has R bindings so that makes me :dancer:

edsu

03:35:34 PM

yeah, it is a neat project

edsu

03:35:56 PM

the lead developer has put some interesting videos up

mdlincoln

03:37:15 PM

hmm ok, added to the list of things to look at more in depth

mdlincoln

03:37:51 PM

I could imagine developing some very interesting mutli-party data collaboration platform built on a dat “trunk”

edsu

03:38:29 PM

https://www.youtube.com/watch?v=3nG1p7sDygM

edsu

03:38:43 PM

haven’t watched that one ; but it might be relevant

abrennr

04:00:30 PM

:smile: Earlier today when I was reading this thread I was thinking “oh, this reminds me, what about that Dat project?” and there y’all are. :muscle: I did step through most of the tutoral at http://try-dat.com which is cool, does on-the-fly docker environment deployment to let you run the tutorial code in browser as you go.

abrennr

04:01:13 PM

@mdlincoln: here’s what the tutorial says about binary files

abrennr

04:01:25 PM

> But what if you have a large non-tabular file that you want store in dat? To accommodate this, dat is capable of adding files to a dataset that is named, appropriately, “files.” These attachments are sometimes called “blobs,” which is short for “Binary Large OBjectS.” A blob can be any form of binary data, but for now just think of a blob as a file, like one you might put in Dropbox or attach to an email.

abrennr

04:01:43 PM

> For the sake of speed and efficiency, dat doesn’t store blobs inside the datasets. Instead they’re kept in a “blob store” – a special directory – with each having an indexed “blob key.”

abrennr

04:06:02 PM

Also TIL I can’t hardly type dat and not do data^H

edsu

04:06:11 PM

sounds similar in principle to how git-lfs works

shawngraham

08:41:26 PM

folks, I’m starting to play around with jupyter notebooks. I don’t do much python, but I do futz from time to time with R. This: http://irkernel.github.io/ seems completely borked (and the discussion in the issues channel on github is pretty much beyond me). But I tried again with miniconda and R essentials ( https://www.continuum.io/blog/developer/jupyter-and-conda-r ) and I was able to get an R notebook up and running. But the kernel keeps dying. Anyway - just wondered if anyone else was messing around with this stuff.

fmcc

06:07:25 AM

@shawngraham: I’m not an R user, but I use Jupyter quite a lot - not tried out any of the more recent d3.js integration yet though

fmcc

06:19:51 AM

I’ve not looked into it that much at the moment, but am interested in the idea of ipython notebooks as a way of publishing research or notes on a subject

fmcc

06:21:23 AM

been trying to use the markdown boxes to keep commentary on code and what i’m actually up to - had wanted to look into pandoc integration more

fmcc

06:22:10 AM

I’d like to see if there were ways to transform the notebooks into slicker webpages etc.

paregorios

08:50:47 AM

fwiw I recently saw Patrick Burns (https://fordham.academia.edu/PatrickBurns == http://pbartleby.com/ == @diyclassics ) give a talk illustrated entirely from an ipython notebook, but I don’t see any of that stuff online anywhere.

shawngraham

09:50:23 AM

very cool. I’ve used knitr etc to get my R code & results online, and I’ve played with pykwiki as a way of pushing my reading notes online. I believe you can push jupyter to a reveal.js type slide show too. I’m thinking of OA journals and ideas around taking a jupyter notebook as an article… this is all very nebulous in my head. Turns out though there’s something fairly recent which is preventing the R kernel from playing nice with jupyter - https://github.com/IRkernel/IRkernel/issues/205 Hopefully gets resolved soon.

fmcc

10:12:36 AM

ah knitr - I think that’s definitely what sparked off my thoughts about pandoc etc. above - I’m pretty sure i’d read this article http://galahad.well.ox.ac.uk/repro/ and had intended to look into it more carefully

fmcc

10:12:51 AM

really I just want to never write any LaTeX again

paregorios

10:20:45 AM

@fmcc blasphemy! :wink:

fmcc

10:23:28 AM

@paregorios: I know, so many sunk hours!

paregorios

10:24:03 AM

somehow I never actually got sucked down that slippery slope … I’m just in a taunting mood

paregorios

10:24:05 AM

sorry

abrennr

10:27:47 AM

paging @mcburton, just back from the jupyterday meetup in NYC this weekend

fmcc

10:31:06 AM

@paregorios: Well, it was a combination of doing a commentary on a Greek text, and wanting to include technical line drawings, and being a bit of a typographical aesthete

fmcc

10:31:32 AM

(first and last time i’ll ever call myself an aesthete…)

edsu

10:38:31 AM

@mcburton: how was jupyterday?

mdlincoln

10:51:07 AM

I’ve not tried juypter with R, as between RMarkdwon and Shiny there is already a fairly functional dynamic publication system in place. I have been meaning to give it a try though

mcburton

10:56:58 AM

@shawngraham @fmcc: I have used Jupyter Notebooks extensively and have thought about publishing with Notebooks. You can use the nbconvert tool to transform Notebooks into HTML

mcburton

10:57:47 AM

@shawngraham: also, Jupyter can use R, it is no longer just a python project.

mcburton

10:59:17 AM

@edsu: JupyterDay was amazing. It was mainly a bunch of computational scientists and folks from industry. A good If you haven’t already check out the hashtag https://twitter.com/search?q=%23jupyterday&src=typd

edsu

10:59:57 AM

i did watch some of the twitter activity ; but will take a closer look

edsu

11:00:05 AM

any big takeaways?

mcburton

11:01:04 AM

Buzzfeed and the data journalists are WAAAAAY ahead of the digital humanities folks when it comes to publishing this stuff

mcburton

11:01:47 AM

integrating code + narrative + data

mcburton

11:01:58 AM

yes, thats right, BUZZFEED

edsu

11:02:46 AM

yeah, i enjoyed https://source.opennews.org/en-US/articles/what-weve-learned-about-sharing-our-data-analysis/

mcburton

11:03:10 AM

O’Reilly is also starting some experiments with Thebe which lets you embedded executable code cells in any HTML document

mcburton

11:03:11 AM

https://github.com/oreillymedia/thebe

mcburton

11:03:24 AM

but it is really hacky at the moment

mcburton

11:03:46 AM

Lev Manovich showed up, so I wasn’t the only digital humanist

mcburton

11:06:20 AM

Lorena Barba, https://twitter.com/LorenaABarba, gave a really nice talk about computational literacy and computational learning. The Jupyter Grant proposal has some really interesting stuff around computational narratives, http://blog.jupyter.org/2015/07/07/project-jupyter-computational-narratives-as-the-engine-of-collaborative-data-science/, that really needs humanists to help them better understand

edsu

11:07:47 AM

Radical/Networks was going on not too far away, which looked interesting too http://radicalnetworks.org/program/index.html

mcburton

11:11:10 AM

too many things happening in NYC

edsu

11:12:35 AM

totally ; i did run across some of the radical/networks presentations here http://livestream.com/internetsociety/radicalnetworks

mcburton

11:13:54 AM

on the topic of #jupyterday, there was an interesting conversation in meatspace and on twitter about long-term preservation of notebooks and code being produced by journalists

mcburton

11:14:04 AM

Buzzfeed basically just uses Github as their repository

edsu

11:15:25 AM

i saw this https://twitter.com/mcburton/status/657930767700533248

mcburton

11:15:36 AM

yes

mcburton

11:16:25 AM

that lead me to put my foot in my mouth about Zenodo (It thought it was a for-profit repository)

edsu

11:17:36 AM

they do some nice work

mcburton

11:18:04 AM

yeah, I need to learn a bit more about the project. I didn’t realize it was run out of CERN

edsu

11:18:42 AM

interesting, i didn’t know who was doing it

mcburton

11:18:50 AM

science people

mcburton

11:19:09 AM

I also learned of a new, digital repository Invenio http://invenio.readthedocs.org/en/latest/introduction/about.html

edsu

11:19:09 AM

i had used it to register a DOI for the twitter archiving utility i worked on https://github.com/edsu/twarc

edsu

11:19:32 AM

eek, who is that dude

mcburton

11:20:04 AM

twarc man

mcburton

11:21:14 AM

@shawngraham @fmcc: Here is a recent post about D3.js integration into the Notebook. http://blog.thedataincubator.com/2015/08/embedding-d3-in-an-ipython-notebook/

mcburton

11:24:32 AM

@edsu: Maybe you’ve seen this, but the Binder project is pretty great for creating temporary Notebook deployments right out of a github repository http://mybinder.org/

edsu

11:25:20 AM

wow, no i hadn’t seen that

edsu

11:25:53 AM

although pinboard is telling me I already bookmarked it ; so I guess my memory is faulty :simple_smile:

edsu

11:26:13 AM

seems like it could be super handy for classes?

mcburton

11:26:49 AM

yes, I used it for my Twitter bot workshop

mcburton

11:26:55 AM

https://github.com/DSSatPitt/python-twitter-bots

mcburton

11:27:05 AM

click the “Launch Binder” shield

edsu

11:29:00 AM

neato

edsu

11:29:29 AM

are there any ramifications to actually putting auth keys into that?

mcburton

11:30:12 AM

well, I don’t want to put my keys into the Github Repository…

edsu

11:30:36 AM

understandably

mcburton

11:31:46 AM

I did play around with having a single app on my account and then set up a workflow where the students would authorize their bot accounts with that app. But I opted to teach them how to set up their own developer accounts instead

edsu

11:35:32 AM

it’s nice ; does their modified notebook get deleted when they leave the session?

mcburton

11:43:12 AM

yes, timeout is an hour

mcburton

11:43:16 AM

of inactivity

mcburton

11:43:43 AM

they are working on ways to make it so modified notebooks could be pushed back up to github

edsu

11:44:21 AM

very cool

mcburton

11:44:45 AM

yeah, it is by a group of Neuroscientsts at Janelia https://twitter.com/thefreemanlab

mcburton

11:44:55 AM

they are CRAZY productive and not too far from DC

edsu

11:47:35 AM

part of janelia it looks like? https://www.janelia.org/about-us

mcburton

11:49:13 AM

https://github.com/freeman-lab/talk-nyc-fall-2015/blob/master/images/janelia.jpg

mcburton

11:49:25 AM

https://raw.githubusercontent.com/freeman-lab/talk-nyc-fall-2015/master/images/janelia.jpg

mcburton

11:49:49 AM

They work in a building that looks like starfleet academy and they are building the matrix (for mice and zebrafish)

edsu

12:11:22 PM

. o O (kind of awesome john resig popped into #visualization)

shawngraham

12:11:32 PM

@mcburton: the R kernal kept dying on me all the time - something to do with conda and rzmq not playing nice in recent days. Anyway, I think I’ll just continue to watch from the sidelines, given as @mdlincoln R markdown & shiny are pretty nice…

mcburton

12:14:44 PM

@shawngraham: Rmarkdown is awesome, if I was doing R I’d use that over Jupyter Notebooks

shawngraham

12:16:24 PM

@mcburton yeah, I’m using R for text stuff, and python for sound stuff. Probably the most awkward dh guy ever, me.

mcburton

12:16:37 PM

@edsu: @jeresig is here too. Hey John!

jeresig

12:17:08 PM

hello! :smile:

shawngraham

12:18:41 PM

@mcburton: it was this that seems to be killing juypter for me: https://github.com/IRkernel/IRkernel/issues/205

mcburton

12:25:14 PM

@shawngraham: you might look into using venv instead of conda for managing your python environment https://docs.python.org/3/library/venv.html

shawngraham

12:44:38 PM

@mcburton: ah cool, thank you! yeah, I gotta learn to keep things separate for different tasks/projects.

shawngraham

12:45:05 PM

that’s probably something we could do a bit more of in terms of teaching dh stuff. Or maybe, more accurately, i ought to do more of…

fmcc

12:55:46 PM

@mcburton @shawngraham venv is python only though, so won’t help much with R kernel installation issues?

shawngraham

12:59:11 PM

@fmcc but at least it’ll keep me from screwing up other things on this machine!

fmcc

12:59:38 PM

If you’re doing python i’d say it was totally essential

paregorios

01:17:11 PM

invest in python virtual environments; the payback is huge. And I like https://virtualenvwrapper.readthedocs.org/en/latest/

shawngraham

01:19:47 PM

oh cool @paregorios thanks!

paregorios

01:26:45 PM

the pattern I use is to create a directory ~/Envs/ and use mkvirtualenv to create all the components of each environment I need there. It’s disk-extravagant, but I just create a new one for each project and name the environment to match the top-level project directory. This has allowed me to write a little script that puts me in the project directory, activates the associated virtual environment, etc.

paregorios

01:27:35 PM

that keeps the venv and its binaries out of the way of your project-related repository etc.

ryanfb

02:14:36 PM

@mdlincoln: Quick question about the Rijksmuseum torrent - any particular reason for splitting it into TGZ files? I only see ~5GB total savings from compression…I’d think uncompressed would make it easier to use the dataset while continuing to seed it

jeresig

02:26:29 PM

(don’t know if I thanked you for it yet, @mdlincoln, but that Rijksmuseum torrent is just awesome!)

edsu

02:28:08 PM

@jeresig we were chatting about your work with tineye and the frick the other day in here, thinking about that torrent, and wondering if there is something interesting to do

jeresig

02:28:46 PM

oh nice!! Hmm, there very well may be some opportunities there

jeresig

02:29:31 PM

on a related note (I haven’t written about it yet), I’ve been having very good luck with the Open Source “pastec” framework recently. It’s very similar to TinEye’s MatchEngine in functionality, the quality is roughly comparable, but it’s Open Source! https://github.com/Visu4link/pastec

edsu

02:29:56 PM

oh! i was just going to ask if you were still working w/ tineye, nice

jeresig

02:29:57 PM

There are some features that I wish existed so I want to start talking with its creator - and maybe start writing guides on how to use it.

edsu

02:30:20 PM

that would be awesome

jeresig

02:30:28 PM

For http://Ukiyo-e.org I’m still using it - but I think I might start transitioning, or at least providing an alternative, in my open source projects

jeresig

02:30:39 PM

Great!

jeresig

02:31:35 PM

I have some data showing a quality comparison between the two technologies, as well. However it’s using some private images from the Frick and I need to get permission to release them.

ryanfb

02:46:18 PM

jeresig: Thanks for the heads-up on Pastec; currently experimenting with bouncing all of the Rijksmuseum torrent into it :simple_smile:

jeresig

02:47:43 PM

@ryanfb: haha, awesome! let me know what you find out :simple_smile: One issue that exists right now is that if you want to query something that is already in the database you need to re-upload the image again (rather than just say “given me everything that looks like image #123”). But it’s not a huge deal, thankfully!

ryanfb

02:48:06 PM

Yeah, my plan was to use it that way to try to detect duplicate images

ryanfb

02:48:14 PM

*near-duplicate

ryanfb

02:49:28 PM

Also I’ve been idly pondering the idea of trying to use something like this for numismatics, and there are a couple domain-specific publications about CBIR for coins (though no running servers I can find), but if this gets OK results out of the box it might be a cool proof of concept :simple_smile:

jeresig

02:50:48 PM

@ryanfb: Nice!! I’ve considered that exact use case, well. One concern that I had (without testing any data) was over how we it’d work with well-worn coin faces (I suspect that it’d probably struggle). But I’d be very interested in seeing the results of that!

jeresig

02:51:07 PM

Also it might be also worth playing around with imgSeek (which is just an image similarity tool, doesn’t have a concept of “duplicate”) http://sourceforge.net/projects/imgseek/

ryanfb

02:52:08 PM

Thanks for reminding me about it!

jeresig

02:54:50 PM

No problem! I’m constantly watching for new open source solutions to these problems - I’m very surprised that there aren’t more!

ryanfb

02:55:49 PM

Yeah, seems like a lot of computer vision stuff gets locked up in proprietary systems, unfortunately…

mdlincoln

03:56:02 PM

Cheers @jeresig - I thought you’d like the torrent :stuck_out_tongue:

ryanfb

01:19:19 PM

@mdlincoln: one note on the Rijksmuseum torrent…seems like there’s 13,101 empty image files, which correspond to copyright-protected images (which don’t have a webImage resource in the Rijksmuseum API). Might be something to note in the torrent details.

edsu

02:00:56 PM

@mcburton: you seen this before? https://github.com/aaren/notedown

edsu

02:01:04 PM

seems like it could be useful sometimes

mdlincoln

02:11:07 PM

@ryanfb: hmmm, that’s frustrating. Can you give me a filename example? If the object does not have a webImage, it ought to never have been downloaded and made into a file, anyway

mdlincoln

02:11:41 PM

but I wouldn’t put it past either their API or my bash skills to have messed that up somehow

ryanfb

02:21:20 PM

@mdlincoln: images/55/RP-F-F03020.jpeg

ryanfb

02:21:48 PM

You can find them all with find images -type f -name '*.jpeg' -size 0

ryanfb

02:22:20 PM

I’m actually double-checking that list and there seem to be a handful that have zero-size files and a non-nil webImage

mdlincoln

02:41:18 PM

ah, handy command there

mdlincoln

02:42:21 PM

it’s worth investigating - though I can’t say when it’ll get up to the top of my to-do list :tired_face:

ryanfb

02:46:40 PM

no worries - current count is just 3 images like that :simple_smile:

ryanfb

02:48:39 PM

I’ll probably have a blog post with the results of my Pastec experiment sometime next week

mcburton

03:10:15 PM

@edsu: Yes, I remember coming across this a while ago and forgot about it. I haven’t used it, but it is something I want to play with

mcburton

04:23:42 PM

https://source.opennews.org/en-US/articles/introducing-agate/ > agate is a Python data analysis library in the vein of numpy or pandas, but with one crucial difference. Whereas those libraries optimize for the needs of scientists—namely, being incredibly fast when working with vast numerical datasets—agate instead optimizes for the performance of the human who is using it. That means stripping out those technical optimizations and instead focusing on designing code that is easy to learn, readable, and flexible enough to handle any weird data you throw at it.

mdlincoln

02:24:39 PM

https://github.com/hay/xml2json

mdlincoln

02:24:54 PM

for when someone has sent you XML that merely needed to be JSON

mdlincoln

11:07:26 AM

SO I may have spoken too soon regarding the above utility - when asked to strip namespaces, it also demolishes attributes :cry: Does anyone have other utilities they have found useful for converting a lot XML files into JSON?

mdlincoln

11:11:10 AM

Yes, I could bite the bullet and get back to using xpath selectors… but JQ works so well for producing normalized tables out of denormalized documents

fmcc

11:17:25 AM

@mdlincoln: The namespace thing doesn’t look like it would be too much code to change

mdlincoln

11:21:48 AM

hmm, looking at it now, that’s a good point. I can guess my way through python, yah? :wink:

fmcc

11:27:08 AM

yeah, well if you give me a short example file, and what you expect the output to be, i’ll pull and modify it

mdlincoln

11:46:26 AM

Thanks! But looking through forks of the script, I found one that’s already done it: https://github.com/edyesed/xml2json

fmcc

11:56:19 AM

fantastic! Too obvious a fix to have been left undone.

mdlincoln

11:57:12 AM

:+1: for descriptive commit messages, too - otherwise I’d never have found it

mcburton

12:19:26 PM

Can someone explain what Proquest offers for EEBO that they can charge access for? What is preventing us from just liberating all the page scans from http://eebo.chadwyck.com/ ? Page scans can’t be copyrighted anyway…

mcburton

12:21:14 PM

are we all just trying to avoid the heat from a license violation?

edsu

12:30:09 PM

http://eebo.chadwyck.com/help/faqs.htm#6 is just infuriating

edsu

12:35:42 PM

@mcburton: i guess you saw https://twitter.com/whitneytrettien/status/659514110783135744 ?

mcburton

12:53:42 PM

someone needs to challenge ProQuest on their assertion of copyright https://en.wikipedia.org/wiki/Bridgeman_Art_Library_v._Corel_Corp.

mcburton

12:54:10 PM

> exact photographic copies of public domain images could not be protected by copyright in the United States because the copies lack originality. The Court found that despite the fact that accurate reproductions might require a great deal of skill, experience and effort, the key element to determine whether a work is copyrightable under U.S. law is originality.

edsu

01:59:54 PM

For some reason I have it in my head that the situation is different in the UK?

edsu

02:01:15 PM

Still ProQuest is in Michigan, so it doesn’t matter right?

edsu

02:03:37 PM

I guess if you do decide to challenge them it’s a good idea to have a friend who is an IP lawyer.

edsu

02:04:26 PM

It’s hard to imagine them not responding after they have put language like that into their terms of service.

benwbrum

03:48:15 PM

I’m pretty sure that any attempt to liberate page scans from EEBO will run into CFAA pretty quickly, rather than copyright assertions.

benwbrum

03:50:16 PM

There has been no good test case in the UK to establish Bridgeman v. Corel there. Most cultural institutions assert copyright over scans of public-domain materials there through arguments including sweat-of-the-brow copyright.

benwbrum

03:51:10 PM

I’ve run into that a lot with parish registers (and census images) that are not expressive and in some cases either centuries old or government created.

mdlincoln

03:53:27 PM

CFAA?

benwbrum

03:56:01 PM

Computer Fraud and Abuse Act. See https://www.techdirt.com/search-g.php?num=20&q=CFAA&search=Search for examples.

benwbrum

03:57:19 PM

According to Wikipedia, it was what Aaron Swartz was prosecuted under: https://en.wikipedia.org/wiki/Computer_Fraud_and_Abuse_Act

jeresig

11:59:00 AM

@ryanfb: your initial pastec results (as you posted on twitter) are very fun - I especially love the different photos where the bench is the common factor :simple_smile:

jeresig

12:00:40 PM

@ryanfb: I had a question about the index size — did you feed full-size images in or did you re-size them ahead of time? If I remember correctly pastec will resize them smaller if you don’t. I’m just curious if the index size would be smaller if you fed in deliberately smaller images (and if that would impact the quality of the matches at all)

jeresig

12:01:09 PM

@ryanfb: and when you counted the number of matches are you counting A -> B and B -> A separately or counting both as one match?

ryanfb

12:01:20 PM

@jeresig: Yeah I just bounced in the full size images and let Pastec handle the resizing.

ryanfb

12:02:26 PM

@jeresig: Currently writing up a blog post on the whole process right now…for the 33,029 “unique” matches number, that’s filtered down by a script where it considers any unique set of image_ids in a Pastec search result a unique match

ryanfb

12:03:00 PM

So e.g. image_ids: [3,2,1] and image_ids: [1,2,3] will be one match

jeresig

12:03:06 PM

interesting!

ryanfb

12:03:22 PM

But image_ids: [4,3,2,1] would be a new match

jeresig

12:04:43 PM

@ryanfb: are you filtering out single matches (e.g. you re-upload an image and it just matches itself again, returning something like image_ids: [1])

ryanfb

12:04:47 PM

Yes

jeresig

12:04:52 PM

cool!

jeresig

12:09:34 PM

@ryanfb: I’m very interested to see your results! Not sure if you’ve seen it but this is the analysis that I did with the Frick Art Reference Library Anonymous Italian Art photo archive: http://ejohn.org/research/computer-vision-photo-archives/ I did a lot of manual work to try and verify the quality of the matches. And, as I think I mentioned before, pastec seems to be roughly comparable, if slightly worse, than MatchEngine (but likely still “good enough”). Usually the big question that comes up is “what are we missing with these matches? and what false positives are we getting?” Beyond a certain point (too many misses, too many false positives) whatever algorithm will become un-usuable. Not sure if you’re storing the “score” field as well, but I found that setting the minimum score to about 19 worked well in some of my testing.

ryanfb

12:12:07 PM

Yeah, that’s always a hard question. For example, those plates (which I consider a really interesting match) had a score of 14. But I get a lot of matches where it seems like a calibration target in the image is causing it to match every other image with a similar calibration target, and the score is higher than that (d’oh)

jeresig

12:12:34 PM

:disappointed: ugh yeah, calibration targets/color bars are a real pain

jeresig

12:13:04 PM

btw, I have Node module that I’ve been working on for interfacing with Pastec, fwiw: https://github.com/jeresig/node-pastec

jeresig

12:13:23 PM

actively developing it, fixing up some issues

ryanfb

12:13:30 PM

Right now, I’m planning on sharing the Pastec index and match results with the blog post so anyone can play with them

jeresig

12:14:10 PM

nice!

ryanfb

12:16:37 PM

My next idea is some sort of twitter bot tweeting GIFs of every “unique” match, with the Rijksmuseum URLs they’re made from incorporated as well so they show up as already tweeted on the Rijksmuseum object page (and searching Twitter for that URL will turn it up)

ryanfb

12:17:08 PM

Though maybe too many to feasibly do without being a data-hose, as one per hour would be almost 4 years

jeresig

12:17:19 PM

:smile:

jeresig

12:19:53 PM

@ryanfb: you may also be interested in some graph/cluster analysis that I did using the MatchEngine-derived links. It came up with some really interesting groupings of artworks that were quite unexpected: https://www.youtube.com/watch?v=PL6J8MtTsPo&t=27m8s

ryanfb

12:20:36 PM

Awesome, will take a look. Thanks!

ryanfb

12:25:30 PM

@jeresig: I hadn’t thought of using this process to remove calibration targets before, but if you’re interested in automatically detecting calibration charts I’ve written up a survey of some different approaches: https://ryanfb.github.io/etc/2015/07/08/automatic_colorchecker_detection.html

jeresig

12:26:47 PM

@ryanfb: that’s fantastic!! thank you so much

mdlincoln

12:32:59 PM

Can’t wait to see this - make sure to ping me when the blog post is up :)

jeresig

12:38:36 PM

ughh - I wish I had more time for experimentation with new technology! it’s so much fun :simple_smile:

edsu

03:46:26 PM

@ryanfb did i miss a post from you about you pastec work?

edsu

03:48:05 PM

@mdlincoln: so did I hear right that you are Dr Lincoln now?

ryanfb

03:53:19 PM

@edsu @mdlincoln @jeresig - just published the blog post now :simple_smile: http://ryanfb.github.io/etc/2015/11/03/finding_near-matches_in_the_rijksmuseum_with_pastec.html

ryanfb

03:55:30 PM

We’ll see if a 1.5GB file trips some sort of download limit on my institutional Box account…

jeresig

03:58:10 PM

@ryanfb: fantastic work! thank you for detailing all the steps you took and providing links to all the resulting data - that’s most helpful!

jeresig

04:02:11 PM

it might be interesting to treat to Rijksmuseum image set as a sort of canonical test dataset to analyze the quality of matches. Most of the other sets that I have include images that I can’t easily distribute.

ryanfb

04:04:49 PM

I was thinking about checking out some of the other things on https://en.wikipedia.org/wiki/List_of_CBIR_engines and seeing what some of the other free/open ones can produce in comparison

ryanfb

04:05:04 PM

(if any of them are in as-usable a state as Pastec…)

jeresig

04:06:19 PM

@ryanfb: Most of the other ones are just “similarity” not “duplicate”. The only “duplicate” ones I know of are Pastec and MatchEngine. imgSeek is also open source but it just finds similar images (it’ll just keep giving you matches and it never cuts off the results - also it doesn’t look at image features like Pastec and Matchengine, it can be tricked really easily, unfortunately)

jeresig

04:06:31 PM

but if you find anything, I’d be extremely interested!

ryanfb

04:10:51 PM

I’ll definitely share anything I find. Ultimately (for my work at Duke), my interest is in matching images of e.g. Ancient Greek inscriptions, for which I have a crazy-idea-which-just-might-work that I need to actually take the time and implement.

jeresig

04:11:15 PM

oh, that sounds cool!

fmcc

04:25:04 PM

@ryanfb: Is a big issue with matching inscription images not that they’re relatively similar in appearance, and would perhaps need some other kind of vectorisation that these CBIR systems don’t have?

ryanfb

04:27:53 PM

@fmcc: Yes. The rough outline of my plan is to try to use intrinsic self-similarity within an image to match to other images (with similar self-similarity)

fmcc

04:30:27 PM

@ryanfb: Is that approach based upon a particular paper?

ryanfb

04:31:23 PM

No, and that’s why it’s taking me so long to get around to it :wink:

fmcc

04:32:47 PM

Cool - i’m having a look at this http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.442.8702&rep=rep1&type=pdf

fmcc

04:33:26 PM

I’ve not come across self-similarity at all before

ryanfb

04:34:30 PM

I think I may have turned that up before. There are a lot based around e.g. rotational self-similarity

fmcc

04:38:53 PM

sorry - that was literally the first thing that I came across googling, and I mentioned it for context, not because I thought it might be new to you…

ryanfb

04:39:11 PM

No worries :simple_smile:

ryanfb

05:43:44 PM

Just realized I can probably get away with dumping all the Rijksmuseum match GIFs on Flickr (especially now that I can set a Public Domain license). Not as discoverable through the existing Rijksmuseum interface, but oh well

edsu

10:08:37 AM

that would be awesome

benwbrum

10:12:46 AM

I’m in the middle of a conversation with the Free UK Genealogy folks about opening up their data via an API. What would be the ideal format for census records, parish registers, or other vital statistics? RDF via an API? Record-type-specific CSV?

benwbrum

10:13:30 AM

One of our challenges is that individual records can change as they are corrected or added, making it very hard to create a persistant URL to an individual record.

benwbrum

10:14:32 AM

By contrast, sub-sets of our records could be delivered pretty easily via a URL scheme like <http://freereg.org.uk/COUNTY/YEAR/RECORD_TYPE>

edsu

10:15:35 AM

is the expectation that the persistent URL would always return the same data?

edsu

10:16:22 AM

i could imagine wanting to be able to periodically scrub my local data to get any updates, using the persistent url

edsu

10:17:07 AM

my advice is to learn about how they have it stored now, and come up with an initial solution that impacts that the least

edsu

10:17:42 AM

oh i’m just reading your next message ; so you want to have metadata in the URL

benwbrum

10:17:43 AM

I believe that the goal is that–as a database of records–users will want to link to a specific record. E.g. a family tree record for a marriage may link to the parish register entry recording the wedding.

edsu

10:18:06 AM

and if someone updates the county then people’s links break?

benwbrum

10:19:34 AM

There’s not really a unique ID (other than the database primary key) for the record. If someone replaces the register file containing that entry, and if the bride’s father’s surname has changed, we delete the old record and create a new one with the correct info.

benwbrum

10:19:43 AM

Any permalink to the record will be broken.

edsu

10:20:07 AM

if it uses the primary key?

benwbrum

10:20:37 AM

That would not be true of a permalink to e.g. “all entries for St. Mary’s, Burton-upon-Trent, Shropshire, 1743”

benwbrum

10:20:45 AM

If it uses the primary key, yes.

edsu

10:21:04 AM

but the problem with a permalink style url is that if the metadata changes so will the url?

benwbrum

10:21:07 AM

Similarly if it uses a smart key derived from all meaningful fields, since one of the meaningful fields will have changed durign teh correction

benwbrum

10:21:09 AM

Right.

benwbrum

10:21:28 AM

Permalinks to sets of records work fine, permalinks to an individual record can break.

edsu

10:21:33 AM

seems like either way things are changing and that if you don’t want links to break you need to remember the old ones

edsu

10:21:47 AM

and 301 redirect from the old to the new

edsu

10:21:59 AM

i really like how WordPress does this for example

edsu

10:22:39 AM

great question btw :smile:

benwbrum

10:22:43 AM

Yes. That’ll require some human intervention to say “Is old record XYZ the same as new record XWZ?”, but I suspect we could do that.

benwbrum

10:23:01 AM

Then you keep the URL for the old record and redirect, as you suggest.

edsu

10:23:51 AM

i guess i prefer the permalink style url

edsu

10:23:58 AM

if it can be achieved without too much trouble

edsu

10:24:06 AM

they are much more hackable

benwbrum

10:24:08 AM

Thanks! This will be less of a problem as we move the volunteers onto an online transcription system. At the moment they’re uploading CSV files with large batches of entries.

edsu

10:24:51 AM

one compromise between RDF and CSV is CSV on the Web

edsu

10:25:17 AM

it’s basically just csv, with a sidecar json-ld file that defines the semantics of the csv file for anyone that wants to turn it into RDF

edsu

10:25:35 AM

the gory details are at http://www.w3.org/2013/csvw/wiki/Main_Page

edsu

10:26:43 AM

a while ago i tried to create a simple example here http://edsu.github.io/csvw-template/

edsu

10:27:29 AM

oh, i see the conversion is no longer working #sigh

benwbrum

10:32:09 AM

This is very interesting to me. Thank you for the link.

paregorios

10:41:38 AM

++

edsu

10:54:45 AM

@benwbrum: you probably know this already but there was significant interest from semweb/linkeddata community in genealogy

benwbrum

10:58:43 AM

I’m pretty new to the linked data world. My whole experience started by skimming the O’Rielly Linke data book on the flight to Philadelpha for the IIIF hackathon a few weeks ago, then plumbing a JSON manifest generator into FromThePage. So there’s a huge amount I don’t know.

edsu

12:16:39 PM

@benwbrum: btw, I fixed my csv metadata file so that conversion works now http://edsu.github.io/csvw-template/

edsu

12:17:33 PM

i think CSVW is a nice example of how you can make your data easy to consume and use, while also making a high fidelity semantic version available too

edsu

12:17:39 PM

for the people that want that

edsu

12:19:28 PM

@benmiller: ian davis was a very prominent figure in the linked data community and also quite into genealogy which he blogged about http://blog.iandavis.com/tags/genealogy/

edsu

12:20:35 PM

he’s still around, but last i heard was more interested in game development with golang :simple_smile:

mdlincoln

09:07:03 AM

Curious what everyone thinks about this: http://talkinghumanities.blogs.sas.ac.uk/2015/11/05/re-using-bad-data-in-the-humanities/

thomaspadilla

10:24:52 AM

Interesting piece, nice that it highlights the promise of eMOP

thomaspadilla

10:26:25 AM

I think that the re-use angle is a bit off - probably more conceptually accurate to think about preparation of collections for unanticipated use

mdlincoln

10:43:18 AM

and there’s the chicken&egg problem: can’t create a data aggregation site if no one is following a standard <-> no one will follow a standard if they don’t need to submit their work to an aggregation site

thomaspadilla

10:44:12 AM

Yeah, and theres a bit of conflation going on - e.g. institutions that create collections vs. researchers that create derivative datasets from them (wherein of course we’d like to see reuse)

mdlincoln

10:44:48 AM

Ah! Yes, I think that’s what was rubbing me the wrong way.

thomaspadilla

10:47:45 AM

Interesting either way of course! Went into the ol’ Zotero. Will probably reference in a piece Im working on right now.

jheppler

12:57:45 PM

Anybody have some Gephi data I could use today for a quick tutorial I’m giving to grad students?

jheppler

01:00:51 PM

Thought about using the Les Miserables set, just for simplicity, but wondered if there was something out there more interesting.

thomaspadilla

01:01:42 PM

i have the data from the dh conference ya’ll had there a few years ago

thomaspadilla

01:01:48 PM

derived I think from a dataset that elijah created

thomaspadilla

01:01:57 PM

its linked off the tutorial here http://thomaspadilla.org/na2014

coryandrewtaylor

01:11:45 PM

@jheppler: I’ve got a couple here:

coryandrewtaylor

01:11:49 PM

https://www.dropbox.com/s/eieqj11d6u0mj5j/Luke%20coappearances%20-%20all.gephi?dl=0

coryandrewtaylor

01:12:16 PM

https://www.dropbox.com/s/nd3x5fvn80rv7yn/Luke%20dialogue%20-%20all.gephi?dl=0

coryandrewtaylor

01:12:22 PM

Uploaded file: Luke coappearances - all.gephi

coryandrewtaylor

01:12:32 PM

They’re literary networks, taken from the Gospel of Luke.

jheppler

01:13:17 PM

@coryandrewtaylor: Thanks!

coryandrewtaylor

01:15:41 PM

@jheppler: No problem!

shawngraham

01:32:15 PM

@jheppler we had an ma student a few years back do historical SNA for his thesis - all his files are on figshare http://figshare.com/authors/Peter_Holdsworth/402385

edsu

01:54:13 PM

@jheppler i bet @mdlincoln has some about Dutch Engravers :smile:

edsu

01:55:13 PM

@shawngraham what is Holdsworth 1898?

edsu

01:55:41 PM

oh holdsworth is the name of the student?

shawngraham

01:56:14 PM

http://figshare.com/articles/Holdsworth_1898_Dataset/727769 so what he did was look at the membership rolls of women’s service organizations in the run up to the centennary of the war of 1812, to see how ideas of commemoration spread around Ontario

shawngraham

01:56:31 PM

there’s a neat bit where he looks at the structure of social networks against the structure of the rail network…

shawngraham

01:57:51 PM

yes, Peter Holdsworth. Really neat guy.

mdlincoln

02:26:27 PM

If they need to work with an unwieldy dynamic network with dated nodes and edges: https://gitlab.com/mdlincoln/dh2015/tree/master/data-raw

mdlincoln

02:27:30 PM

The bm_print_nodes and bm_print_edges might be the easier ones to work with

mdlincoln

02:29:10 PM

the full R package comes with documentation for all those data, too - but I think you know how to navigate that @jheppler

mdlincoln

02:30:17 PM

unless i am misremembering my various DHers’ languages

mdlincoln

02:30:31 PM

^ wouldn’t THAT make for an interesting paper

mdlincoln

02:34:27 PM

The rkm nodes and edges are a similar format, though much of the descriptive data is in Dutch

edsu

03:03:18 PM

@mdlincoln: did you ever run across http://umd-r-users.github.io/studyGroup/ ?

edsu

03:43:56 PM

@ryanfb http://rijksmuseum-pastec.tumblr.com/ is awesome!

ryanfb

03:53:06 PM

@edsu: Thanks! Yeah I looked into Flickr but their support for animated GIFs is kind of weird

edsu

03:53:29 PM

yeah, tumblr is probably better

edsu

03:54:22 PM

@ryanfb i managed to get Peter Gorgels attention https://twitter.com/pgorgels

ryanfb

03:55:42 PM

Oh, nice!

edsu

03:56:02 PM

that will get Rijksmuseum eyes on it i think

edsu

03:56:23 PM

maybe they are already on it anyway :simple_smile:

ryanfb

03:56:53 PM

Yeah I’ve @’d the main account a few times but I’m sure that account is probably a notifications hose for some poor person…

edsu

03:57:05 PM

haha, yeah

edsu

03:57:29 PM

peter is pretty awesome

edsu

03:57:35 PM

i’m a fan anyway

edsu

03:58:26 PM

I saw him present at NDF in New Zealand a few years ago about Rijksstudio https://www.youtube.com/watch?v=iW17d-OQsIs

ryanfb

04:03:03 PM

Cool :simple_smile: Yeah, anyone who’s there involved in their push to make everything of theirs freely available online is probably good people…

thomaspadilla

11:41:34 AM

figured there could be some stuff of interest to folks here http://socialcomputing.asu.edu/pages/datasets

jeresig

12:49:14 PM

@ryanfb: that tumblr is fantastic! Browsing through the matches is very fun - and I like that you added tags to the posts, as well!

mdlincoln

01:32:45 PM

Uploaded file: tumblr_nxeb7jn2G01ul0fego1_1280.gif

Comment: This one has got to be my favorite so far

jheppler

01:52:59 PM

Nice!

ryanfb

02:53:50 PM

http://rijksmuseum-pastec.tumblr.com/post/132561134846/httpswwwrijksmuseumnlencollectionrp-p-ob-27 might be my favorite

fmcc

02:55:17 PM

@ryanfb: “Age yourself to view the future you!”

jeresig

02:57:35 PM

That’s a great one! Love seeing how the woodcut medium was (ab)used :simple_smile:

fmcc

03:03:36 PM

@jeresig: So the centre of the block was cut out so they only had to recreate the face?

jeresig

03:03:59 PM

@fmcc: precisely! they did this in Japanese Woodblock prints, too - one sec, let me post a photo

fmcc

03:06:25 PM

Trying to look up where the two images were then - quite interesting with that information that it’s Charles the Second that’s being removed to make way for William of Orange

jeresig

03:06:57 PM

Uploaded file: Screenshot 2015-11-06 15.05.23.jpg

jeresig

03:07:03 PM

Uploaded file: Screenshot 2015-11-06 15.05.29.jpg

jeresig

03:08:07 PM

My CV stuff found both of these matches. The one above (with the two men) are not only different kabuki actors but the artist signatures on the prints are different, too! They had chopped out the old signature and added a different one, for some reason.

ryanfb

03:08:27 PM

Nice!

fmcc

03:09:54 PM

Do you know which one the original was?

jeresig

03:11:06 PM

@fmcc: not 100% sure (esp for the first one, since there are few other identifying details). I bet you could look really close and see where the woodblock had chipped — whichever one had less chips in it would’ve been the earlier one (since they naturally degrade as they’re used)

fmcc

03:18:18 PM

That’s interesting - the one on the right looks like the quality isn’t quite is good, but I guess that could be just ink bleed

fmcc

03:19:45 PM

that’s based a bit on intaglio printing though - really have no idea what wood block is like vs. linocut which is the only other relief printing i’ve done.

ryanfb

03:24:49 PM

@jeresig: if you ever make a computer vision system for automatically ordering woodblock prints based on the chipping, I vote for calling it “woodblockchain”

jeresig

03:40:04 PM

@ryanfb: :smile:

mcburton

04:11:51 PM

The Carnegie Museum of Art has posted all of their metadata as a CSV file on Github. +1 for great documentation too! https://github.com/cmoa/collection

jeresig

06:04:22 PM

Are people familiar with dat? http://dat-data.com/ I feel like it’s really awesome, especially for all these large CSV datasets

jeresig

06:04:48 PM

It’d be cool if there was a central server which hosted all these datasets for easy access

mdlincoln

06:08:35 PM

Yup, it’s been a subject of discussion a ways up in the group history - I’d love to see some uni libraries start to mirror & archive some of these datasets

mdlincoln

06:09:26 PM

I’ve yet to experiment with dat personally, though - is it stable-ish yet?

thomaspadilla

05:13:04 AM

re: librarians - we’re working on it!

coryandrewtaylor

12:19:36 PM

Not sure if this is the best channel, but Google has open-sourced their TensorFlow machine learning library:

coryandrewtaylor

12:19:40 PM

http://tensorflow.org/

fmcc

12:24:53 PM

@coryandrewtaylor: this looks really interesting - i’m going to poke about with this tonight

benwbrum

12:26:57 PM

Looks like the Zooniverse/NYPL collaboration to extend the Scribe codebase is finally available: http://scribeproject.github.io/

benwbrum

12:27:31 PM

I’ll be checking it out in detail for a client over the next several weeks.

mdlincoln

01:51:04 PM

Tropy, a tool for research photograph management, is looking for input on user practices and needs: https://docs.google.com/forms/d/1gxeRwzxQZNeOr4VJSvaUBYYfLomo2UFtAhA0YJ7vr14/viewform

mdlincoln

01:51:28 PM

Original announcement for Tropy at CHNM: http://chnm.gmu.edu/news/rrchnm-to-build-software-to-help-researchers-organize-digital-photographs/

edsu

02:03:10 PM

@benwbrum: is the NYPL Scribe codebase related to the Zooniverse Scribe project?

benwbrum

02:44:49 PM

@edsu: Zooniverse developed Scribe back in 2010-2011, open-sourcing a version of it at the end of 2011.

edsu

02:45:02 PM

yup i remember that

benwbrum

02:45:15 PM

Both FreeUKGenealogy (my client) and NYPL forked that to start separate projects.

benwbrum

02:45:35 PM

NYPL’s went into Ensemble, and was the basis for an NEH grant to extend it further.

benwbrum

02:46:07 PM

We used Scribe as a jumping off point for the “delivery mechanism” – the searchable database populated by data entry through Scribe.

benwbrum

02:47:28 PM

When NYPL/Zooniverse got the NEH grant in September 2013, we shelved that effort to focus elsewhere while they made the Scribe code more robust. (That wasn’t the only factor).

benwbrum

02:47:31 PM

And now it’s out!

edsu

02:47:51 PM

scribeAPI?

benwbrum

02:48:54 PM

What’s not entirely clear to me is whether the ScribeProject code (which appears to be ScribeAPI) is behind non-NYPL projects like Shakespeare’s World (Zooniverse+Folger) and AnnoTate (Zooniverse+Tate).

benwbrum

02:49:12 PM

It certainly is behind Measuring the ANZACs.

benwbrum

02:50:18 PM

Regardless, I’ve only had 10-15 minutes to go over the docs and no time to go over the code. I should have figured out more soon.

edsu

02:50:35 PM

thanks for the info, that helped me a lot!

benwbrum

02:50:37 PM

Sadly, I’ve lost my technical contact at the Zooniverse, as Stuart Lynn is now at CartoDB.

edsu

02:50:56 PM

https://github.com/zooniverse/scribeAPI/graphs/contributors

edsu

02:51:08 PM

maybe https://github.com/saschaishikawa ?

benwbrum

02:51:46 PM

Free UK Genealogy will be rebooting their online structured transcription effort shortly – maybe after version 2.1.4 of FreeREG, maybe after 2.1.1 of FreeCEN. Next 2-3 months, regardless.

benwbrum

02:52:13 PM

I’m not sure whether we’ll work under the aegis of Open Source Indexing again or not.

benwbrum

02:52:56 PM

I do hope we’ll be able to use ScribeAPI, since its predecessor influenced so much of our technical stack three years ago.

edsu

02:53:37 PM

have you written about that at all?

benwbrum

02:54:57 PM

Yes – see http://freeukgen.github.io/OpenSourceIndexing/ from mid-2013

benwbrum

02:55:07 PM

We got a lot of sample project definitions.

edsu

02:55:09 PM

thanks!

benwbrum

02:55:16 PM

Only one offer to help, from geneanum.

benwbrum

02:56:58 PM

There was a lot of interest in creating an open-source tool in the same space as FamilySearch Indexing at RootsTech in spring 2013. Not a lot of resistence to open source for the tool, though most of the vendors hoped to use the tool to build paywalled databases, of course. I did get the impression that the idea was novel.

edsu

02:57:49 PM

i like the idea of a framework, rather than a turnkey solution

edsu

02:58:35 PM

since transcription efforts seem to vary so much in their presentation

edsu

02:58:55 PM

but i’m a newb when it comes to this stuff

benwbrum

03:00:21 PM

Have you seen the Zooniverse Project Builder (“Panoptes”)?

edsu

03:00:30 PM

no, i have not

benwbrum

03:00:33 PM

It’s a nice crowdsourcing framework that doesn’t include transcription.

edsu

03:00:51 PM

good name

edsu

03:01:17 PM

https://github.com/zooniverse/Panoptes ?

benwbrum

03:01:30 PM

Very impressive, very usable. @mia and I used it in our crowdsourcing class at HILT this summer.

benwbrum

03:01:33 PM

That’s the one.

benwbrum

03:01:38 PM

It’s hosted, and open for anyone.

benwbrum

03:02:29 PM

See https://www.zooniverse.org/#/lab

benwbrum

03:02:38 PM

You’ll need an account, however.

benwbrum

03:03:36 PM

It lets you ask multiple-choice questions, or “drawing” questions that ask users to select a region of the image.

benwbrum

03:04:00 PM

Those latter answers can have another step, presenting them with multiple choice questions about the drawing.

mdlincoln

03:40:16 PM

I’ve raised the git/github question on here before, but I’m wondering if anyone has examples of CONTRIBUTOR policies for data repos? I’m curious what best practices would be for handling, say, pull requests on github for a repo that is generated from an upstream CMS. You might not want to just accept the changes without implementing them in your CMS and/or generating scripts, so how does one make that process clear to people who clone/fork your repo as if it were any other open-source project?

mdlincoln

03:40:30 PM

Context: https://github.com/cmoa/collection/issues/4 and https://github.com/MuseumofModernArt/collection/issues/9

sambrenner

11:48:18 AM

we (cooper hewitt) don’t have anything stated - but we have a link on our website’s object pages for people to email (using zendesk) corrections etc., which gets passed on to the appropriate curator to make the update in TMS.

sambrenner

11:49:37 AM

so i imagine our statement would say something like “we welcome all pull requests which concern data formatting, organization etc… for cataloging errors, please visit the appropriate page on our website and follow the feedback link there”

mdlincoln

02:19:20 PM

So what if someone sends a PR that, say, reformats the way that you’ve serialized your data (say, expressing an array as a object instead) and you want to incorporate their changes. Would you actually accept that PR, and then reverse-engineer your export scripts to reproduce that formatting change in your next database update?

mdlincoln

02:22:35 PM

Uploaded file: cmoa PR policy

Comment: CMOA just updated their documentation to include the following explicit process documentation

sambrenner

02:30:13 PM

i guess we would encourage any data-reformatting to be done in a script that future users could run (eg in this folder - https://github.com/cooperhewitt/collection/tree/master/bin). like if someone wanted to write an import-to-(db of choice) script we’d accept that. how we format our data is kind of irrelevant at that point

mdlincoln

02:31:49 PM

Neat - I like the idea of encouraging documented scripts that refer to the canonical extract

mdlincoln

02:32:32 PM

It usefully side-steps the demand of having to be everything to everyone

sambrenner

02:34:22 PM

we should definitely be more explicit about it, though. i’ll come up with something and add it in before the week’s out

data-sharing

2015-10-16

2015-10-17

2015-10-18

2015-10-19

2015-10-20

2015-10-21

2015-10-22

2015-10-23

sigh

2015-10-25

2015-10-26

2015-10-27

2015-10-28

2015-10-29

2015-10-30

2015-11-02

2015-11-03

2015-11-04

2015-11-05

2015-11-06

2015-11-07

2015-11-09

2015-11-10

2015-11-11

2015-11-12