DSLR works surprisingly well for text

Over the years I have accumulated a fair bit of documentation in the form of photographed pages. I have a habit of photographing stuff people try to hand me to carry about, then giving it back. A while ago I set one of the OCR programs to simply scan every photo I've stored and extract only the text, while building a matching tree in another directory. Then I deleted all the docs less than 1kB and merged down to one directory (smart rename). That gave me about 300 documents after a week. Most were surprisingly accurate - for a page of text in good light I was getting ~5 errors per page. For the stuff that counts (my handwritten additions to printed docs) the error rate was worse than a person trying to read my writing. But that's expected, and I will eventually have to type things in off the images if I want that.

example of stuff that I want OCR'd but don't think it'll happen this decade: http://www.mozbike.com/build/long-2/one-less-ute-01.jpg

I suspect that you could get 90% accuracy just doing photo+scan. So it depends a lot on what your documents are - if they're publications I think just waiting until the googleborg sucks everything up then doing a get on your partial matches would work. For stuff that is not going to be borged it's harder, you want more accuracy. But that might be surprisingly little material for most people. I'm thinking of all the government RFC/RFD blurf I accumulate with the matching submissions, for instance, and those will likely be put on the net at some stage without my help. It's personal bills and so on that won't, but for many of those you explicitly want an image not a scan.

Reply

Please enter Brad's last name above. Case doesn't matter
Please make up a name if you do not wish to give your real one.
The content of this field is kept private and will not be shown publicly.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd>
  • Web page addresses and e-mail addresses turn into links automatically.
  • Lines and paragraphs break automatically.

More information about formatting options