data-sharing
Standards, protocols, strategies for distributing data
2015-10-16
I’m on the hunt for GLAM or other DH projects that are either using git &| torrents as a way to distribute versioned data. I know the usual suspects (Tate, MoMA, Cooper Hewitt) N.B. I’m distinctly not looking for JSON-based API stuff
I don’t know of projects off-hand, but some library folks are planning to discuss at DLF in a couple weeks. Might be of interest http://dlfforum2015.sched.org/mobile/#session:b37651e41eceef99db5b6be017da48a2
@roxanne: brilliant, thanks - I will have to keep an eye on that
2015-10-17
@mdlincoln: Not sure if this is what you’re looking for, but I’m part of a poetry-and-music corpus analysis project that has all data and scripts version controlled and stored on GitHub: https://github.com/corpusmusic/liederCorpusAnalysis.
2015-10-18
Also on the topic of torrents, has anyone heard of or used http://academictorrents.com/ ?
I vaguely remember this crossing the bow of the ol twitter - seems like a cool approach
we were just discussing making some library data available via torrent the other day, but didnt seem optimal for a relatively exceptional case where collections have restrictions
yeah, certainly it wouldn’t make sense for data that had selective permissions
But, as an example, I’ve spend days of scripting time to pulling down the CC0 collection data and images from the semi-dysfunctional https://www.rijksmuseum.nl/nl/api and it seems like a torrent of the filedump (~150GB) would be a much better way to share the info
oh yeah, totally agree, and am interested in alternatives
have you looked into OPENN’s rysnc option?
seems like a promising approach
they also offer ftp access
O if only everyone thought like Will Noel :simple_smile:
2015-10-19
Id also throw the following in the mix, nice is you need to turn some json > csv for intro workshops https://github.com/n3mo/jsan
@mdlincoln: i’ve seen it in passing but haven’t seen much use. likely due to many academics dealing with licensing issues. If you are looking at other examples of torrent use then the Internet Archive is a good one as they offer most of their downloads as a torrent option. I know @edsu has put up some sets like the ferguson tweet archive http://inkdroid.org/2014/11/18/on-forgetting/
Well, if any of you brave souls have ~160GB of free space, I’ve tried assembling my first torrent file here - I’d love to know if it can actually work: http://matthewlincoln.net/2015/10/19/the-rijksmuseum-as-bittorrent.html
2015-10-20
2015-10-21
2015-10-22
A nice overview of CIDOC-CRM by way of mapping the Victoria & Albert API to LOD: http://conaltuohy.com/blog/bridging-conceptual-gap-api-cidoc-crm/
Interesting piece on http://Academia.edu, open access, and a shift from content-gatekeeping to metrics-gatekeeping: http://blogs.lse.ac.uk/impactofsocialsciences/2015/10/22/does-academia-edu-mean-open-access-is-becoming-irrelevant/
2015-10-23
Speaking of distributed content delivery, has anyone been watching the development of IPFS?
i tuned out for a few days and now there’s all this interesting convo to review!
@mdlincoln: i have been tracking ipfs a bit ; i think it’s a really interesting idea, and ties into what you were talking about earlier w/r/t bittorrent right?
it attracted the attention of Brewster Kahle at Internet Archive fairly recently too http://brewster.kahle.org/2015/08/11/locking-the-web-open-a-call-for-a-distributed-web-2/
well that’s some good attention!
i know right! it would be fun to take an hour to try to get it going sometime
i’ve only read about it so far
have you announced that torrent very widely yet?
As someone outside the institutional repository loop, I’ve always been curious how much tese types of technologies, or even something as “dull” as rsync, are discussed
not often enough, imho
Interesting, hadn’t seen IPFS before
Yes, as best I could! Trevor Owens gave it a good signal boost with a tweet earlier
oh!
Now wondering how hard it would be to adapt ipfspics into an IIIF shim/proxy
Though at the moment we’ve just got a handful of downloaders, including a patient person with high bandwidth in brooklyn :stuck_out_tongue:
I’ve got permission to seed it from one of the department’s machines, so there is at least one full copy going aside from mine, when I’m home and have my external HD hooked up
I guess ideally, this is the kind of thing that you coudl get a consortium of libraries to all seed together
:stuck_out_tongue:
you would think right?
haha
I mean, I think it would be a great idea for research libraries to mirror datadumps put on github
hint hint
Though as @thomaspadilla pointed out, it doesn’t even have to be as fancy as git or bittorrent - OPenn does just fine offering rsync
rsync has the added benefit of allowing the dataset to change over time, which is a bit trickier with bittorrent
i think that’s something the academictorrents people were trying to work with?
yes, dataset change is certainly an issue
the disadvantage of rsync is that you don’t get the advantage of the swarm, where everyone shares the cost of distributing the data
right?
yes, that’s my understanding
so OPenn pays to make it available to the world
a familiar model :simple_smile:
yep. basically, we want git/git-annex crossed with bittorrent
I think there is actually a git/bittorrent project kicking about, i’ll try to find it
called gittorrent … of course
heh, i wonder if that actually works
119 forks, woah
I don’t really know anything about it, though I think I did encounter some discussion of potential issues - probably a HN thread FWIW
it wouldn’t be computing if there weren’t issues I guess - haha
yeah, IPFS seems a better shout than the gittorrent
ipfs is an interesting social experiment on its own
@ryanfb: good luck!
@ryanfb: FWIW our upload speed at the UMD seed has fluctuated between 2K and 4M, so…
Ah, ok…maybe I’ll just leave it running over the weekend and see what happens
I know that un-shaped, the connection on this machine is pretty fast
yes, do try if you can. At the moment we’ve just got one downloader on the UMD seed, which was just going quite speedily an hour ago
so it may be them, not us
ps @edsu if this is something that MITH et al could/would seed, I can hand deliver the data on a hard drive :simple_smile:
this channel is amaze, my two cents :simple_smile:
glad you like, @thomaspadilla :goat: I must admit I understand the mechanics of the IPFS and gittorrent stuff just enough to know that they sound both interesting and complex (socially, as much as technically) to implement, and that’s about it
indeed-y : cultural heritage orgs need provocations like this
@mdlincoln: I’d have to ask trevor :simple_smile: all we have is AWS now, i believe you can seed s3 buckets
@mdlincoln how about i put it up on InternetArchive?
@edsu: let’s talk after my talk next Tuesday - I can give you a copy of the data at the very least
@mdlincoln: you know that things can torrent from InternetArchive right?
I was looking into that as well, apparently you can also torrent upload but it won’t seed back on the same torrent after it finishes http://archive.org/about/faqs.php#Archive_BitTorrents
wow, torrent upload — cool
@mdlincoln did you follow the work that john resig did w/ computer vision for the Frick Museum?
i wonder if something could be done with these images
maybe something with opencv?
yeah, i’m definitely not an expert when it comes to this sort of thing
i know resig was working with tineye
at the time
yeah I had seen miriam posner post a couple of things recently, seems like interesting applications
probably also something to learn wragge re: facial identification and so forth
probs also the cool stuff cooper hewitt labs have been doing
resig used it to connect up japanese prints from the same woodcuts
so you could see how woodcuts were used across and within museum collections
yes, they did some interesting work browsing by color
that would be cool to see right?
hah I suspect that highbandwidth Brooklyn leecher is john :simple_smile:
Shannon entropy could be pretty interesting to work with vis-a-vis prints, in fact
I use the measure in my diss for thinking about artistic diversity based on subject keywords, in fact
but it’s applicable to all sorts of signals
sigh
I actually wonder what’s happen if I advertised it instead as an a computer vision dataset with richly tagged images :)
@mdlincoln see a new leecher?
oops started the torrent on the wrong partition - restarting :simple_smile:
@mdlincoln it does look like you can upload the .torrent file to internet archive and they will leech it and then seed it
these words are so weird, i always feel like i am using them wrong
haha
well you do indeed seem to be downloading it :simple_smile:
yay local network connectinos
i almost did it myself, but then thought you should be the one?
that’s actually going out to amazon cloud
internet archive make it very easy to do
i’ll admit i got slightly cold feet when i was reading the terms of service and wondering what an app was
it clearly says the data is in the public domain or cc0
I’ve heard tell of this project before, but haven’t followed it much. It looks like it is more built around versioning and distributing modeled data (key/value or tabular) - not sure about how it handles large binary files
but I do see it has R bindings so that makes me :dancer:
yeah, it is a neat project
the lead developer has put some interesting videos up
hmm ok, added to the list of things to look at more in depth
I could imagine developing some very interesting mutli-party data collaboration platform built on a dat “trunk”
haven’t watched that one ; but it might be relevant
:smile: Earlier today when I was reading this thread I was thinking “oh, this reminds me, what about that Dat project?” and there y’all are. :muscle: I did step through most of the tutoral at http://try-dat.com which is cool, does on-the-fly docker environment deployment to let you run the tutorial code in browser as you go.
@mdlincoln: here’s what the tutorial says about binary files
> But what if you have a large non-tabular file that you want store in dat? To accommodate this, dat is capable of adding files to a dataset that is named, appropriately, “files.” These attachments are sometimes called “blobs,” which is short for “Binary Large OBjectS.” A blob can be any form of binary data, but for now just think of a blob as a file, like one you might put in Dropbox or attach to an email.
> For the sake of speed and efficiency, dat doesn’t store blobs inside the datasets. Instead they’re kept in a “blob store” – a special directory – with each having an indexed “blob key.”
Also TIL I can’t hardly type dat and not do data^H
sounds similar in principle to how git-lfs works
2015-10-25
folks, I’m starting to play around with jupyter notebooks. I don’t do much python, but I do futz from time to time with R. This: http://irkernel.github.io/ seems completely borked (and the discussion in the issues channel on github is pretty much beyond me). But I tried again with miniconda and R essentials ( https://www.continuum.io/blog/developer/jupyter-and-conda-r ) and I was able to get an R notebook up and running. But the kernel keeps dying. Anyway - just wondered if anyone else was messing around with this stuff.
2015-10-26
@shawngraham: I’m not an R user, but I use Jupyter quite a lot - not tried out any of the more recent d3.js integration yet though
I’ve not looked into it that much at the moment, but am interested in the idea of ipython notebooks as a way of publishing research or notes on a subject
been trying to use the markdown boxes to keep commentary on code and what i’m actually up to - had wanted to look into pandoc integration more
I’d like to see if there were ways to transform the notebooks into slicker webpages etc.
fwiw I recently saw Patrick Burns (https://fordham.academia.edu/PatrickBurns == http://pbartleby.com/ == @diyclassics ) give a talk illustrated entirely from an ipython notebook, but I don’t see any of that stuff online anywhere.
very cool. I’ve used knitr etc to get my R code & results online, and I’ve played with pykwiki as a way of pushing my reading notes online. I believe you can push jupyter to a reveal.js type slide show too. I’m thinking of OA journals and ideas around taking a jupyter notebook as an article… this is all very nebulous in my head. Turns out though there’s something fairly recent which is preventing the R kernel from playing nice with jupyter - https://github.com/IRkernel/IRkernel/issues/205 Hopefully gets resolved soon.
ah knitr - I think that’s definitely what sparked off my thoughts about pandoc etc. above - I’m pretty sure i’d read this article http://galahad.well.ox.ac.uk/repro/ and had intended to look into it more carefully
really I just want to never write any LaTeX again
@fmcc blasphemy! :wink:
@paregorios: I know, so many sunk hours!
somehow I never actually got sucked down that slippery slope … I’m just in a taunting mood
sorry
paging @mcburton, just back from the jupyterday meetup in NYC this weekend
@paregorios: Well, it was a combination of doing a commentary on a Greek text, and wanting to include technical line drawings, and being a bit of a typographical aesthete
(first and last time i’ll ever call myself an aesthete…)
@mcburton: how was jupyterday?
I’ve not tried juypter with R, as between RMarkdwon and Shiny there is already a fairly functional dynamic publication system in place. I have been meaning to give it a try though
@shawngraham @fmcc: I have used Jupyter Notebooks extensively and have thought about publishing with Notebooks. You can use the nbconvert tool to transform Notebooks into HTML
@shawngraham: also, Jupyter can use R, it is no longer just a python project.
@edsu: JupyterDay was amazing. It was mainly a bunch of computational scientists and folks from industry. A good If you haven’t already check out the hashtag https://twitter.com/search?q=%23jupyterday&src=typd
i did watch some of the twitter activity ; but will take a closer look
any big takeaways?
Buzzfeed and the data journalists are WAAAAAY ahead of the digital humanities folks when it comes to publishing this stuff
integrating code + narrative + data
yes, thats right, BUZZFEED
O’Reilly is also starting some experiments with Thebe which lets you embedded executable code cells in any HTML document
but it is really hacky at the moment
Lev Manovich showed up, so I wasn’t the only digital humanist
Lorena Barba, https://twitter.com/LorenaABarba, gave a really nice talk about computational literacy and computational learning. The Jupyter Grant proposal has some really interesting stuff around computational narratives, http://blog.jupyter.org/2015/07/07/project-jupyter-computational-narratives-as-the-engine-of-collaborative-data-science/, that really needs humanists to help them better understand
Radical/Networks was going on not too far away, which looked interesting too http://radicalnetworks.org/program/index.html
too many things happening in NYC
totally ; i did run across some of the radical/networks presentations here http://livestream.com/internetsociety/radicalnetworks
on the topic of #jupyterday, there was an interesting conversation in meatspace and on twitter about long-term preservation of notebooks and code being produced by journalists
Buzzfeed basically just uses Github as their repository
yes
that lead me to put my foot in my mouth about Zenodo (It thought it was a for-profit repository)
they do some nice work
yeah, I need to learn a bit more about the project. I didn’t realize it was run out of CERN
interesting, i didn’t know who was doing it
science people
I also learned of a new, digital repository Invenio http://invenio.readthedocs.org/en/latest/introduction/about.html
i had used it to register a DOI for the twitter archiving utility i worked on https://github.com/edsu/twarc
eek, who is that dude
twarc man
@shawngraham @fmcc: Here is a recent post about D3.js integration into the Notebook. http://blog.thedataincubator.com/2015/08/embedding-d3-in-an-ipython-notebook/
@edsu: Maybe you’ve seen this, but the Binder project is pretty great for creating temporary Notebook deployments right out of a github repository http://mybinder.org/
wow, no i hadn’t seen that
although pinboard is telling me I already bookmarked it ; so I guess my memory is faulty :simple_smile:
seems like it could be super handy for classes?
yes, I used it for my Twitter bot workshop
click the “Launch Binder” shield
neato
are there any ramifications to actually putting auth keys into that?
well, I don’t want to put my keys into the Github Repository…
understandably
I did play around with having a single app on my account and then set up a workflow where the students would authorize their bot accounts with that app. But I opted to teach them how to set up their own developer accounts instead
it’s nice ; does their modified notebook get deleted when they leave the session?
yes, timeout is an hour
of inactivity
they are working on ways to make it so modified notebooks could be pushed back up to github
very cool
yeah, it is by a group of Neuroscientsts at Janelia https://twitter.com/thefreemanlab
they are CRAZY productive and not too far from DC
They work in a building that looks like starfleet academy and they are building the matrix (for mice and zebrafish)
@mcburton: the R kernal kept dying on me all the time - something to do with conda and rzmq not playing nice in recent days. Anyway, I think I’ll just continue to watch from the sidelines, given as @mdlincoln R markdown & shiny are pretty nice…
@shawngraham: Rmarkdown is awesome, if I was doing R I’d use that over Jupyter Notebooks
@mcburton yeah, I’m using R for text stuff, and python for sound stuff. Probably the most awkward dh guy ever, me.
@edsu: @jeresig is here too. Hey John!
hello! :smile:
@mcburton: it was this that seems to be killing juypter for me: https://github.com/IRkernel/IRkernel/issues/205
@shawngraham: you might look into using venv
instead of conda for managing your python environment https://docs.python.org/3/library/venv.html
@mcburton: ah cool, thank you! yeah, I gotta learn to keep things separate for different tasks/projects.
that’s probably something we could do a bit more of in terms of teaching dh stuff. Or maybe, more accurately, i ought to do more of…
@mcburton @shawngraham venv is python only though, so won’t help much with R kernel installation issues?
@fmcc but at least it’ll keep me from screwing up other things on this machine!
If you’re doing python i’d say it was totally essential
invest in python virtual environments; the payback is huge. And I like https://virtualenvwrapper.readthedocs.org/en/latest/
oh cool @paregorios thanks!
the pattern I use is to create a directory ~/Envs/ and use mkvirtualenv to create all the components of each environment I need there. It’s disk-extravagant, but I just create a new one for each project and name the environment to match the top-level project directory. This has allowed me to write a little script that puts me in the project directory, activates the associated virtual environment, etc.
that keeps the venv and its binaries out of the way of your project-related repository etc.
@mdlincoln: Quick question about the Rijksmuseum torrent - any particular reason for splitting it into TGZ files? I only see ~5GB total savings from compression…I’d think uncompressed would make it easier to use the dataset while continuing to seed it
(don’t know if I thanked you for it yet, @mdlincoln, but that Rijksmuseum torrent is just awesome!)
@jeresig we were chatting about your work with tineye and the frick the other day in here, thinking about that torrent, and wondering if there is something interesting to do
oh nice!! Hmm, there very well may be some opportunities there
on a related note (I haven’t written about it yet), I’ve been having very good luck with the Open Source “pastec” framework recently. It’s very similar to TinEye’s MatchEngine in functionality, the quality is roughly comparable, but it’s Open Source! https://github.com/Visu4link/pastec
oh! i was just going to ask if you were still working w/ tineye, nice
There are some features that I wish existed so I want to start talking with its creator - and maybe start writing guides on how to use it.
that would be awesome
For http://Ukiyo-e.org I’m still using it - but I think I might start transitioning, or at least providing an alternative, in my open source projects
Great!
I have some data showing a quality comparison between the two technologies, as well. However it’s using some private images from the Frick and I need to get permission to release them.
jeresig: Thanks for the heads-up on Pastec; currently experimenting with bouncing all of the Rijksmuseum torrent into it :simple_smile:
@ryanfb: haha, awesome! let me know what you find out :simple_smile: One issue that exists right now is that if you want to query something that is already in the database you need to re-upload the image again (rather than just say “given me everything that looks like image #123”). But it’s not a huge deal, thankfully!
Yeah, my plan was to use it that way to try to detect duplicate images
*near-duplicate
Also I’ve been idly pondering the idea of trying to use something like this for numismatics, and there are a couple domain-specific publications about CBIR for coins (though no running servers I can find), but if this gets OK results out of the box it might be a cool proof of concept :simple_smile:
@ryanfb: Nice!! I’ve considered that exact use case, well. One concern that I had (without testing any data) was over how we it’d work with well-worn coin faces (I suspect that it’d probably struggle). But I’d be very interested in seeing the results of that!
Also it might be also worth playing around with imgSeek (which is just an image similarity tool, doesn’t have a concept of “duplicate”) http://sourceforge.net/projects/imgseek/
Thanks for reminding me about it!
No problem! I’m constantly watching for new open source solutions to these problems - I’m very surprised that there aren’t more!
Yeah, seems like a lot of computer vision stuff gets locked up in proprietary systems, unfortunately…
Cheers @jeresig - I thought you’d like the torrent :stuck_out_tongue:
2015-10-27
2015-10-28
@mdlincoln: one note on the Rijksmuseum torrent…seems like there’s 13,101 empty image files, which correspond to copyright-protected images (which don’t have a webImage resource in the Rijksmuseum API). Might be something to note in the torrent details.
seems like it could be useful sometimes
@ryanfb: hmmm, that’s frustrating. Can you give me a filename example? If the object does not have a webImage, it ought to never have been downloaded and made into a file, anyway
but I wouldn’t put it past either their API or my bash skills to have messed that up somehow
@mdlincoln: images/55/RP-F-F03020.jpeg
You can find them all with find images -type f -name '*.jpeg' -size 0
I’m actually double-checking that list and there seem to be a handful that have zero-size files and a non-nil webImage
ah, handy command there
it’s worth investigating - though I can’t say when it’ll get up to the top of my to-do list :tired_face:
no worries - current count is just 3 images like that :simple_smile:
I’ll probably have a blog post with the results of my Pastec experiment sometime next week
@edsu: Yes, I remember coming across this a while ago and forgot about it. I haven’t used it, but it is something I want to play with
https://source.opennews.org/en-US/articles/introducing-agate/ > agate is a Python data analysis library in the vein of numpy or pandas, but with one crucial difference. Whereas those libraries optimize for the needs of scientists—namely, being incredibly fast when working with vast numerical datasets—agate instead optimizes for the performance of the human who is using it. That means stripping out those technical optimizations and instead focusing on designing code that is easy to learn, readable, and flexible enough to handle any weird data you throw at it.
2015-10-29
for when someone has sent you XML that merely needed to be JSON
2015-10-30
SO I may have spoken too soon regarding the above utility - when asked to strip namespaces, it also demolishes attributes :cry: Does anyone have other utilities they have found useful for converting a lot XML files into JSON?
Yes, I could bite the bullet and get back to using xpath selectors… but JQ works so well for producing normalized tables out of denormalized documents
@mdlincoln: The namespace thing doesn’t look like it would be too much code to change
hmm, looking at it now, that’s a good point. I can guess my way through python, yah? :wink:
yeah, well if you give me a short example file, and what you expect the output to be, i’ll pull and modify it
Thanks! But looking through forks of the script, I found one that’s already done it: https://github.com/edyesed/xml2json
fantastic! Too obvious a fix to have been left undone.
:+1: for descriptive commit messages, too - otherwise I’d never have found it
2015-11-02
Can someone explain what Proquest offers for EEBO that they can charge access for? What is preventing us from just liberating all the page scans from http://eebo.chadwyck.com/ ? Page scans can’t be copyrighted anyway…
are we all just trying to avoid the heat from a license violation?
@mcburton: i guess you saw https://twitter.com/whitneytrettien/status/659514110783135744 ?
someone needs to challenge ProQuest on their assertion of copyright https://en.wikipedia.org/wiki/Bridgeman_Art_Library_v._Corel_Corp.
> exact photographic copies of public domain images could not be protected by copyright in the United States because the copies lack originality. The Court found that despite the fact that accurate reproductions might require a great deal of skill, experience and effort, the key element to determine whether a work is copyrightable under U.S. law is originality.
For some reason I have it in my head that the situation is different in the UK?
Still ProQuest is in Michigan, so it doesn’t matter right?
I guess if you do decide to challenge them it’s a good idea to have a friend who is an IP lawyer.
It’s hard to imagine them not responding after they have put language like that into their terms of service.
I’m pretty sure that any attempt to liberate page scans from EEBO will run into CFAA pretty quickly, rather than copyright assertions.
There has been no good test case in the UK to establish Bridgeman v. Corel there. Most cultural institutions assert copyright over scans of public-domain materials there through arguments including sweat-of-the-brow copyright.
I’ve run into that a lot with parish registers (and census images) that are not expressive and in some cases either centuries old or government created.
CFAA?
Computer Fraud and Abuse Act. See https://www.techdirt.com/search-g.php?num=20&q=CFAA&search=Search for examples.
According to Wikipedia, it was what Aaron Swartz was prosecuted under: https://en.wikipedia.org/wiki/Computer_Fraud_and_Abuse_Act
2015-11-03
@ryanfb: your initial pastec results (as you posted on twitter) are very fun - I especially love the different photos where the bench is the common factor :simple_smile:
@ryanfb: I had a question about the index size — did you feed full-size images in or did you re-size them ahead of time? If I remember correctly pastec will resize them smaller if you don’t. I’m just curious if the index size would be smaller if you fed in deliberately smaller images (and if that would impact the quality of the matches at all)
@ryanfb: and when you counted the number of matches are you counting A -> B and B -> A separately or counting both as one match?
@jeresig: Yeah I just bounced in the full size images and let Pastec handle the resizing.
@jeresig: Currently writing up a blog post on the whole process right now…for the 33,029 “unique” matches number, that’s filtered down by a script where it considers any unique set of image_ids
in a Pastec search result a unique match
So e.g. image_ids: [3,2,1]
and image_ids: [1,2,3]
will be one match
interesting!
But image_ids: [4,3,2,1]
would be a new match
@ryanfb: are you filtering out single matches (e.g. you re-upload an image and it just matches itself again, returning something like image_ids: [1]
)
Yes
cool!
@ryanfb: I’m very interested to see your results! Not sure if you’ve seen it but this is the analysis that I did with the Frick Art Reference Library Anonymous Italian Art photo archive: http://ejohn.org/research/computer-vision-photo-archives/ I did a lot of manual work to try and verify the quality of the matches. And, as I think I mentioned before, pastec seems to be roughly comparable, if slightly worse, than MatchEngine (but likely still “good enough”). Usually the big question that comes up is “what are we missing with these matches? and what false positives are we getting?” Beyond a certain point (too many misses, too many false positives) whatever algorithm will become un-usuable. Not sure if you’re storing the “score” field as well, but I found that setting the minimum score to about 19 worked well in some of my testing.
Yeah, that’s always a hard question. For example, those plates (which I consider a really interesting match) had a score of 14. But I get a lot of matches where it seems like a calibration target in the image is causing it to match every other image with a similar calibration target, and the score is higher than that (d’oh)
:disappointed: ugh yeah, calibration targets/color bars are a real pain
btw, I have Node module that I’ve been working on for interfacing with Pastec, fwiw: https://github.com/jeresig/node-pastec
actively developing it, fixing up some issues
Right now, I’m planning on sharing the Pastec index and match results with the blog post so anyone can play with them
nice!
My next idea is some sort of twitter bot tweeting GIFs of every “unique” match, with the Rijksmuseum URLs they’re made from incorporated as well so they show up as already tweeted on the Rijksmuseum object page (and searching Twitter for that URL will turn it up)
Though maybe too many to feasibly do without being a data-hose, as one per hour would be almost 4 years
:smile:
@ryanfb: you may also be interested in some graph/cluster analysis that I did using the MatchEngine-derived links. It came up with some really interesting groupings of artworks that were quite unexpected: https://www.youtube.com/watch?v=PL6J8MtTsPo&t=27m8s
Awesome, will take a look. Thanks!
@jeresig: I hadn’t thought of using this process to remove calibration targets before, but if you’re interested in automatically detecting calibration charts I’ve written up a survey of some different approaches: https://ryanfb.github.io/etc/2015/07/08/automatic_colorchecker_detection.html
@ryanfb: that’s fantastic!! thank you so much
Can’t wait to see this - make sure to ping me when the blog post is up :)
ughh - I wish I had more time for experimentation with new technology! it’s so much fun :simple_smile:
@ryanfb did i miss a post from you about you pastec work?
@mdlincoln: so did I hear right that you are Dr Lincoln now?
@edsu @mdlincoln @jeresig - just published the blog post now :simple_smile: http://ryanfb.github.io/etc/2015/11/03/finding_near-matches_in_the_rijksmuseum_with_pastec.html
We’ll see if a 1.5GB file trips some sort of download limit on my institutional Box account…
@ryanfb: fantastic work! thank you for detailing all the steps you took and providing links to all the resulting data - that’s most helpful!
it might be interesting to treat to Rijksmuseum image set as a sort of canonical test dataset to analyze the quality of matches. Most of the other sets that I have include images that I can’t easily distribute.
I was thinking about checking out some of the other things on https://en.wikipedia.org/wiki/List_of_CBIR_engines and seeing what some of the other free/open ones can produce in comparison
(if any of them are in as-usable a state as Pastec…)
@ryanfb: Most of the other ones are just “similarity” not “duplicate”. The only “duplicate” ones I know of are Pastec and MatchEngine. imgSeek is also open source but it just finds similar images (it’ll just keep giving you matches and it never cuts off the results - also it doesn’t look at image features like Pastec and Matchengine, it can be tricked really easily, unfortunately)
but if you find anything, I’d be extremely interested!
I’ll definitely share anything I find. Ultimately (for my work at Duke), my interest is in matching images of e.g. Ancient Greek inscriptions, for which I have a crazy-idea-which-just-might-work that I need to actually take the time and implement.
oh, that sounds cool!
@ryanfb: Is a big issue with matching inscription images not that they’re relatively similar in appearance, and would perhaps need some other kind of vectorisation that these CBIR systems don’t have?
@fmcc: Yes. The rough outline of my plan is to try to use intrinsic self-similarity within an image to match to other images (with similar self-similarity)
@ryanfb: Is that approach based upon a particular paper?
No, and that’s why it’s taking me so long to get around to it :wink:
Cool - i’m having a look at this http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.442.8702&rep=rep1&type=pdf
I’ve not come across self-similarity at all before
I think I may have turned that up before. There are a lot based around e.g. rotational self-similarity
sorry - that was literally the first thing that I came across googling, and I mentioned it for context, not because I thought it might be new to you…
No worries :simple_smile:
Just realized I can probably get away with dumping all the Rijksmuseum match GIFs on Flickr (especially now that I can set a Public Domain license). Not as discoverable through the existing Rijksmuseum interface, but oh well
2015-11-04
that would be awesome
I’m in the middle of a conversation with the Free UK Genealogy folks about opening up their data via an API. What would be the ideal format for census records, parish registers, or other vital statistics? RDF via an API? Record-type-specific CSV?
One of our challenges is that individual records can change as they are corrected or added, making it very hard to create a persistant URL to an individual record.
By contrast, sub-sets of our records could be delivered pretty easily via a URL scheme like <http://freereg.org.uk/COUNTY/YEAR/RECORD_TYPE>
is the expectation that the persistent URL would always return the same data?
i could imagine wanting to be able to periodically scrub my local data to get any updates, using the persistent url
my advice is to learn about how they have it stored now, and come up with an initial solution that impacts that the least
oh i’m just reading your next message ; so you want to have metadata in the URL
I believe that the goal is that–as a database of records–users will want to link to a specific record. E.g. a family tree record for a marriage may link to the parish register entry recording the wedding.
and if someone updates the county then people’s links break?
There’s not really a unique ID (other than the database primary key) for the record. If someone replaces the register file containing that entry, and if the bride’s father’s surname has changed, we delete the old record and create a new one with the correct info.
Any permalink to the record will be broken.
if it uses the primary key?
That would not be true of a permalink to e.g. “all entries for St. Mary’s, Burton-upon-Trent, Shropshire, 1743”
If it uses the primary key, yes.
but the problem with a permalink style url is that if the metadata changes so will the url?
Similarly if it uses a smart key derived from all meaningful fields, since one of the meaningful fields will have changed durign teh correction
Right.
Permalinks to sets of records work fine, permalinks to an individual record can break.
seems like either way things are changing and that if you don’t want links to break you need to remember the old ones
and 301 redirect from the old to the new
i really like how WordPress does this for example
great question btw :smile:
Yes. That’ll require some human intervention to say “Is old record XYZ the same as new record XWZ?”, but I suspect we could do that.
Then you keep the URL for the old record and redirect, as you suggest.
i guess i prefer the permalink style url
if it can be achieved without too much trouble
they are much more hackable
Thanks! This will be less of a problem as we move the volunteers onto an online transcription system. At the moment they’re uploading CSV files with large batches of entries.
one compromise between RDF and CSV is CSV on the Web
it’s basically just csv, with a sidecar json-ld file that defines the semantics of the csv file for anyone that wants to turn it into RDF
a while ago i tried to create a simple example here http://edsu.github.io/csvw-template/
oh, i see the conversion is no longer working #sigh
This is very interesting to me. Thank you for the link.
++
@benwbrum: you probably know this already but there was significant interest from semweb/linkeddata community in genealogy
I’m pretty new to the linked data world. My whole experience started by skimming the O’Rielly Linke data book on the flight to Philadelpha for the IIIF hackathon a few weeks ago, then plumbing a JSON manifest generator into FromThePage. So there’s a huge amount I don’t know.
@benwbrum: btw, I fixed my csv metadata file so that conversion works now http://edsu.github.io/csvw-template/
i think CSVW is a nice example of how you can make your data easy to consume and use, while also making a high fidelity semantic version available too
for the people that want that
@benmiller: ian davis was a very prominent figure in the linked data community and also quite into genealogy which he blogged about http://blog.iandavis.com/tags/genealogy/
he’s still around, but last i heard was more interested in game development with golang :simple_smile:
2015-11-05
Curious what everyone thinks about this: http://talkinghumanities.blogs.sas.ac.uk/2015/11/05/re-using-bad-data-in-the-humanities/
Interesting piece, nice that it highlights the promise of eMOP
I think that the re-use angle is a bit off - probably more conceptually accurate to think about preparation of collections for unanticipated use
and there’s the chicken&egg problem: can’t create a data aggregation site if no one is following a standard <-> no one will follow a standard if they don’t need to submit their work to an aggregation site
Yeah, and theres a bit of conflation going on - e.g. institutions that create collections vs. researchers that create derivative datasets from them (wherein of course we’d like to see reuse)
Ah! Yes, I think that’s what was rubbing me the wrong way.
Interesting either way of course! Went into the ol’ Zotero. Will probably reference in a piece Im working on right now.
Anybody have some Gephi data I could use today for a quick tutorial I’m giving to grad students?
Thought about using the Les Miserables set, just for simplicity, but wondered if there was something out there more interesting.
i have the data from the dh conference ya’ll had there a few years ago
derived I think from a dataset that elijah created
@jheppler: I’ve got a couple here:
They’re literary networks, taken from the Gospel of Luke.
@coryandrewtaylor: Thanks!
@jheppler: No problem!
@jheppler we had an ma student a few years back do historical SNA for his thesis - all his files are on figshare http://figshare.com/authors/Peter_Holdsworth/402385
@jheppler i bet @mdlincoln has some about Dutch Engravers :smile:
@shawngraham what is Holdsworth 1898?
oh holdsworth is the name of the student?
http://figshare.com/articles/Holdsworth_1898_Dataset/727769 so what he did was look at the membership rolls of women’s service organizations in the run up to the centennary of the war of 1812, to see how ideas of commemoration spread around Ontario
there’s a neat bit where he looks at the structure of social networks against the structure of the rail network…
yes, Peter Holdsworth. Really neat guy.
If they need to work with an unwieldy dynamic network with dated nodes and edges: https://gitlab.com/mdlincoln/dh2015/tree/master/data-raw
The bm_print_nodes and bm_print_edges might be the easier ones to work with
the full R package comes with documentation for all those data, too - but I think you know how to navigate that @jheppler
unless i am misremembering my various DHers’ languages
^ wouldn’t THAT make for an interesting paper
The rkm nodes and edges are a similar format, though much of the descriptive data is in Dutch
@edsu: Thanks! Yeah I looked into Flickr but their support for animated GIFs is kind of weird
yeah, tumblr is probably better
Oh, nice!
that will get Rijksmuseum eyes on it i think
maybe they are already on it anyway :simple_smile:
Yeah I’ve @’d the main account a few times but I’m sure that account is probably a notifications hose for some poor person…
haha, yeah
peter is pretty awesome
i’m a fan anyway
I saw him present at NDF in New Zealand a few years ago about Rijksstudio https://www.youtube.com/watch?v=iW17d-OQsIs
Cool :simple_smile: Yeah, anyone who’s there involved in their push to make everything of theirs freely available online is probably good people…
2015-11-06
figured there could be some stuff of interest to folks here http://socialcomputing.asu.edu/pages/datasets
@ryanfb: that tumblr is fantastic! Browsing through the matches is very fun - and I like that you added tags to the posts, as well!
Nice!
@ryanfb: “Age yourself to view the future you!”
That’s a great one! Love seeing how the woodcut medium was (ab)used :simple_smile:
@jeresig: So the centre of the block was cut out so they only had to recreate the face?
@fmcc: precisely! they did this in Japanese Woodblock prints, too - one sec, let me post a photo
Trying to look up where the two images were then - quite interesting with that information that it’s Charles the Second that’s being removed to make way for William of Orange
My CV stuff found both of these matches. The one above (with the two men) are not only different kabuki actors but the artist signatures on the prints are different, too! They had chopped out the old signature and added a different one, for some reason.
Nice!
Do you know which one the original was?
@fmcc: not 100% sure (esp for the first one, since there are few other identifying details). I bet you could look really close and see where the woodblock had chipped — whichever one had less chips in it would’ve been the earlier one (since they naturally degrade as they’re used)
That’s interesting - the one on the right looks like the quality isn’t quite is good, but I guess that could be just ink bleed
that’s based a bit on intaglio printing though - really have no idea what wood block is like vs. linocut which is the only other relief printing i’ve done.
@jeresig: if you ever make a computer vision system for automatically ordering woodblock prints based on the chipping, I vote for calling it “woodblockchain”
@ryanfb: :smile:
The Carnegie Museum of Art has posted all of their metadata as a CSV file on Github. +1 for great documentation too! https://github.com/cmoa/collection
Are people familiar with dat
? http://dat-data.com/ I feel like it’s really awesome, especially for all these large CSV datasets
It’d be cool if there was a central server which hosted all these datasets for easy access
Yup, it’s been a subject of discussion a ways up in the group history - I’d love to see some uni libraries start to mirror & archive some of these datasets
I’ve yet to experiment with dat personally, though - is it stable-ish yet?
2015-11-07
re: librarians - we’re working on it!
2015-11-09
Not sure if this is the best channel, but Google has open-sourced their TensorFlow machine learning library:
@coryandrewtaylor: this looks really interesting - i’m going to poke about with this tonight
Looks like the Zooniverse/NYPL collaboration to extend the Scribe codebase is finally available: http://scribeproject.github.io/
I’ll be checking it out in detail for a client over the next several weeks.
Tropy, a tool for research photograph management, is looking for input on user practices and needs: https://docs.google.com/forms/d/1gxeRwzxQZNeOr4VJSvaUBYYfLomo2UFtAhA0YJ7vr14/viewform
Original announcement for Tropy at CHNM: http://chnm.gmu.edu/news/rrchnm-to-build-software-to-help-researchers-organize-digital-photographs/
@benwbrum: is the NYPL Scribe codebase related to the Zooniverse Scribe project?
@edsu: Zooniverse developed Scribe back in 2010-2011, open-sourcing a version of it at the end of 2011.
yup i remember that
Both FreeUKGenealogy (my client) and NYPL forked that to start separate projects.
NYPL’s went into Ensemble, and was the basis for an NEH grant to extend it further.
We used Scribe as a jumping off point for the “delivery mechanism” – the searchable database populated by data entry through Scribe.
When NYPL/Zooniverse got the NEH grant in September 2013, we shelved that effort to focus elsewhere while they made the Scribe code more robust. (That wasn’t the only factor).
And now it’s out!
scribeAPI?
What’s not entirely clear to me is whether the ScribeProject code (which appears to be ScribeAPI) is behind non-NYPL projects like Shakespeare’s World (Zooniverse+Folger) and AnnoTate (Zooniverse+Tate).
It certainly is behind Measuring the ANZACs.
Regardless, I’ve only had 10-15 minutes to go over the docs and no time to go over the code. I should have figured out more soon.
thanks for the info, that helped me a lot!
Sadly, I’ve lost my technical contact at the Zooniverse, as Stuart Lynn is now at CartoDB.
Free UK Genealogy will be rebooting their online structured transcription effort shortly – maybe after version 2.1.4 of FreeREG, maybe after 2.1.1 of FreeCEN. Next 2-3 months, regardless.
I’m not sure whether we’ll work under the aegis of Open Source Indexing again or not.
I do hope we’ll be able to use ScribeAPI, since its predecessor influenced so much of our technical stack three years ago.
have you written about that at all?
We got a lot of sample project definitions.
thanks!
Only one offer to help, from geneanum.
There was a lot of interest in creating an open-source tool in the same space as FamilySearch Indexing at RootsTech in spring 2013. Not a lot of resistence to open source for the tool, though most of the vendors hoped to use the tool to build paywalled databases, of course. I did get the impression that the idea was novel.
i like the idea of a framework, rather than a turnkey solution
since transcription efforts seem to vary so much in their presentation
but i’m a newb when it comes to this stuff
Have you seen the Zooniverse Project Builder (“Panoptes”)?
no, i have not
It’s a nice crowdsourcing framework that doesn’t include transcription.
good name
Very impressive, very usable. @mia and I used it in our crowdsourcing class at HILT this summer.
That’s the one.
It’s hosted, and open for anyone.
You’ll need an account, however.
It lets you ask multiple-choice questions, or “drawing” questions that ask users to select a region of the image.
Those latter answers can have another step, presenting them with multiple choice questions about the drawing.
2015-11-10
2015-11-11
I’ve raised the git/github question on here before, but I’m wondering if anyone has examples of CONTRIBUTOR policies for data repos? I’m curious what best practices would be for handling, say, pull requests on github for a repo that is generated from an upstream CMS. You might not want to just accept the changes without implementing them in your CMS and/or generating scripts, so how does one make that process clear to people who clone/fork your repo as if it were any other open-source project?
2015-11-12
we (cooper hewitt) don’t have anything stated - but we have a link on our website’s object pages for people to email (using zendesk) corrections etc., which gets passed on to the appropriate curator to make the update in TMS.
so i imagine our statement would say something like “we welcome all pull requests which concern data formatting, organization etc… for cataloging errors, please visit the appropriate page on our website and follow the feedback link there”
So what if someone sends a PR that, say, reformats the way that you’ve serialized your data (say, expressing an array as a object instead) and you want to incorporate their changes. Would you actually accept that PR, and then reverse-engineer your export scripts to reproduce that formatting change in your next database update?
i guess we would encourage any data-reformatting to be done in a script that future users could run (eg in this folder - https://github.com/cooperhewitt/collection/tree/master/bin). like if someone wanted to write an import-to-(db of choice) script we’d accept that. how we format our data is kind of irrelevant at that point
Neat - I like the idea of encouraging documented scripts that refer to the canonical extract
It usefully side-steps the demand of having to be everything to everyone
we should definitely be more explicit about it, though. i’ll come up with something and add it in before the week’s out