Brad IdeasCrazy ideas, inventions, essays and links from Brad Templeton |
|
|
|
NavigationUser loginIf you like this blog, do me a favour and start your Amazon shopping (especially a kindle) from this link, and I'll get a cut. Recent comments
Top EssaysRecent blog posts
BlogrollFellow EFF Folks
Cory Doctorow Larry Lessig Ed Felten Dave Farber John Perry Barlow EFF Deep Links Dave Sifry |
DSLR works surprisingly well for text
Over the years I have accumulated a fair bit of documentation in the form of photographed pages. I have a habit of photographing stuff people try to hand me to carry about, then giving it back. A while ago I set one of the OCR programs to simply scan every photo I've stored and extract only the text, while building a matching tree in another directory. Then I deleted all the docs less than 1kB and merged down to one directory (smart rename). That gave me about 300 documents after a week. Most were surprisingly accurate - for a page of text in good light I was getting ~5 errors per page. For the stuff that counts (my handwritten additions to printed docs) the error rate was worse than a person trying to read my writing. But that's expected, and I will eventually have to type things in off the images if I want that.
example of stuff that I want OCR'd but don't think it'll happen this decade: http://www.mozbike.com/build/long-2/one-less-ute-01.jpg
I suspect that you could get 90% accuracy just doing photo+scan. So it depends a lot on what your documents are - if they're publications I think just waiting until the googleborg sucks everything up then doing a get on your partial matches would work. For stuff that is not going to be borged it's harder, you want more accuracy. But that might be surprisingly little material for most people. I'm thinking of all the government RFC/RFD blurf I accumulate with the matching submissions, for instance, and those will likely be put on the net at some stage without my help. It's personal bills and so on that won't, but for many of those you explicitly want an image not a scan.