I have been approached by the good people in the HSBC UK archives to help with a little project. The archives contain, amongst many other fascinating things, four boxes of war cards. Rachael Porter from the UK Archives team explains;
‘”Next year marks 100 years since the start of the First World War. Here at HSBC Archives we are keen to mark this anniversary and, in doing so, also showcase some of our records relating to this period. We have a set of index cards, naming Midland Bank staff members who went off to serve in the armed forces during the conflict. Each card records detailed information about the employee, and in some cases unique information about his service record, which cannot be found anywhere else. We’d love for members of these men’s families, and military history enthusiasts, to be able to access these records; and a project centered around digitizing them and making them available, whilst also tying in with the anniversary, would be a real achievement for us.”
The question is how to do that? And where to do that? To ensure people get the most value out of this amazing and important data as possible. I am after a bit of help.
The how/where? (a few quickly thrown together ideas)
The boxes contain approximately 5,000 index cards. They have typed field names and hand written details with things such as staff name, branch, rank, regiment etc. The back of the card includes notes on their time in the forces.
I think it would be great to scan the cards front and back and store them on a platform capable of allowing other people to transcribe and add to the data. The actual scanning is tricky/time intensive (and there may be no way round that) due to both the volume and the fact these are 100 years old and sticking them in a duplex scanner maybe a little risky. If anyone has experience of scanning index cards with these kinds of machines I would love to know more. Have you been involved in any projects that had to scan in a large set of paper based data?
A platform like the one built (in a week) for the Guardian MP’s Expenses investigation sprang to mind. This allowed masses of PDF scans to be uploaded and then provide some simple yet powerful tools for annotating and categorising the data. Unfortunately the site is no longer live but there are some good reads on how and why they built it.
Another good example of the genre is Old Weather, which published thousands of old nautical weather logs and asks people to transcribe them. Whether it is feasible for the project to build something to this level is debatable, small budget and short timescales as usual, but it would be great to try because it would become reusable for future data sets rather than just be a one off set of scanning and tagging of data.
I am looking for any platforms that support this kind of data load and amendment. At a basic level we could use Flickr to upload the images then use tagging, sets (or whatever they are called now), comments etc to try and build a usable and searchable set of data. Evernote was another tool that sprang to mind, to capture and attempt to transcribe the data but I am not sure if it really suited to this kind of task, especially if you wanted to build something else on top of it.
I am also looking for suggestions of other tools that would assist with this. Whether it be a platform like Flickr or a set of open source tools somewhere. Anything we can have a play with.
Also are there any organisations/people that specialise in this, other archives or museums for example. Any specialist military history sites? Once the data is scanned and annotated where should it live? Submitting the data set to the National Archives was a good suggestion by my colleague as it seems they have a nice looking API.
This was a very quick scrawl of ideas. The key for me is tools to help with the capture, storage and annotation of the data. Any help would be greatly appreciated. Feel free to get in touch via the comments below or Twitter if you prefer.
Update 14:30 29/05/13
I have had some lovely ideas shared via Twitter and on our internal blog at work. Mechanical Turk has had many mentions and I was foolish to not include it. Simon shared the brilliant looking open hardware book scanner. Please keep them coming you lovely people.
Update 10:30 31/05/13
I have had some pointers to Guardian staff that worked on the MP’s expenses system and an offer to help code such a system from a brilliant man who used to work for the Guardian. We may have a solution from our Central Scanning department in Coventry but we cannot test it until the week of June the 10th due to system access requirements, fingers crossed. Rachael has been in touch with the Imperial War Museum. Things are looking good.
Update 15:40 26/06/13
It has been slow progress over the last few weeks then some interesting things happened within a couple of days. First we have the go ahead and the means to scan the cards. This will be happening next Wednesday in Coventry, where we will be running the cards through these scanning beasts. We also had a great meeting with Luke Smith (see his great comment below for links to some interesting resources) from the Imperial War Museum who is working on the fantastic looking Lives of the First World War http://www.livesofthefirstworldwar.org/. We have also received some very good advice/pointers/introdcutions to interesting people from Kim Plowright. It is getting a bit exciting.
Why not see if someone like Ancestry.com would partner with you? They have the platform and the experience at these sorts of things. If anyone is researching military history they are likely already on this site thus giving you the right audience as well.
I have them on a list of sites to get in touch with. Thank you.
Paul, I sent them a tweet and they sent back a link to a page explaining how to use a scanner. Got a few interesting avenues opening up (see update above) so will see how they go.
Hi Aden. I work at IWM and have been in touch with Rachael. We are speaking on Tuesday — perhaps you will join the call.
Ben Brumfield is a good thinker on crowdsourced transcription:
I added a few entries to his Collaborative Transcription Tools round up at:
Thank you very, very much for the links and yes I will be joining the call next Tuesday.
Thanks to Luke for the kind words.
In my opinion, the timing and the nature of the material make this a very good candidate for a successful crowdsourcing project: You’re likely to get some very passionate volunteers.
The big question is the choice of platform, which essentially depends on what you want to do with the data. The card you posted isn’t quite the sort of structured data you find with census forms, nor are they totally free-form prose like letters or diaries.
If your goal is to transcribe them in a way that renders them searchable/browseable by event type, location, name, unit or other structured fields while preserving the text, I’d be happy to offer FromThePage as the best solution. If you’re just looking for plain-text transcripts with no structure, you have a lot of options from the Transcription Tools list, of which the most mature may be Scripto and the most visually appealing may be the NARA Drupal plugin. A distributed Google Spreadsheet effort might be appropriate if you only want to abstract the structured data and don’t care about analysis of the rest of the text — a task which may make it easy to do visualizations and timelines.
What do you think the “end product” will be, if there is a single one you have in mind?
We definitely want to add some structure to the data and we would also like to try a crowdsourcing approach just to prove the model to others inside the oprganisation. We have not really had a think about what the end product will be but we are very impressed with what the IWM are doing and I suspect we will donate the dataset to their initiative as long as we can get the go ahead from Legal etc. Thanks for the offer of help with transcription, I have sent you a link to some of the first scans. Will be in touch.
I had forgotten until a work discussion today that HSBC have a full digital archive system. You should probably also give some thought as to how this material could be ingested into that, ideally combining the transcribed material with the images of the original cards.
Hopefully there will be some info up on The National Archives website soon about how we approach this sort of project, and the system that HSBC Archives have bought was essentially (to cut a long story short) originally developed to meet our digital preservation needs, so there should be some overlap.
That system is not yet operational but will certainly be housing what we do with the War Cards. I will be writing a big update on the War Cards next week hopefully as we have made a bit of progress.