digitizing

Digitizing your papers, literally, for the future, with 4K video

I have so much paper that I’ve been on a slow quest to scan things. So I have high speed scanners and other tools, but it remains a great deal of work to get it done, especially reliably enough that you would throw away the scanned papers. I have done around 10 posts on digitizing and gathered them under that tag.

Recently, I was asked by a friend who could not figure out what to do with the papers of a deceased parent. Scanning them on your own or in scanning shops is time consuming and expensive, so a new thought came to me.

Set up a scanning table by mounting a camera that shoots 4K video looking down on the table. I have tripods that have an arm that extends out but there are many ways to mount it. Light the table brightly, and bring your papers. Then start the 4K video and start slapping the pages down (or pulling them off) as fast as you can.

There is no software today that can turn that video into a well scanned document. But there will be. Truth is, we could write it today, but nobody has. If you scan this way, you’re making the bet that somebody will. Even if nobody does, you can still go into the video and find any page and pull it out by hand, it will just be a lot of work, and you would only do this for single pages, not for whole documents. You are literally saving the document “for the future” because you are depending on future technology to easily extract it.  read more »

Where's my fast, smart, overhead scanner?

Back in 2008, I proposed the idea of a scanner club which would share high-end scanning equipment to rid of houses of the glut of paper. It’s a harder problem than it sounds. I bought a high-end Fujitsu office scanner (original price $5K, but I paid a lot less) and it’s done some things for me, but it’s still way too hard to use on general scanning problems.

I’ve bought a lot of scanners in the day. There are now lots of portable hand scanners that just scan to an SD card which I like. I also have several flatbeds and a couple of high volume sheetfeds.

In the scanner club article, I outlined a different design for how I would like a scanner to work. This design is faster and much less expensive and probably more reliable than all the other designs, yet 7 years later, nobody has built it.

The design is similar to the “document camera” family of scanners which feature a camera suspended over a flat surface, equipped with some LED lighting. Thanks to the progress in digital cameras, a fast, high resolution camera is now something you can get cheap. The $350 Hovercam Solo 8, which provides an 8 megapixel (4K) image at 30 frames per second. Soon, 4K cameras will become very cheap. You don’t need video at that resolution, and still cameras in the 20 megapixel range — which means 500 pixels/inch scanning of letter sized paper — are cheap and plentiful.

Under the camera you could put anything, but a surface of a distinct colour (like green screen) is a good idea. Anything but the same colour as your paper will do. To get extra fancy, the table could be perforated with small holes like an air hockey table, and have a small suction pump, so that paper put on it is instantly held flat, sticking slightly to the surface.

No-button scanning

The real feature I want is an ability to scan pages as fast as a human being can slap them down on the table. To scan a document, you would just take pages and quickly put them down, one after the other, as fast as you can, so long as you pause long enough for your hand to leave the view and the paper to stay still for 100 milliseconds or so.

The system will be watching with a 60 frame per second standard HD video camera (these are very cheap today.) It will watch until a new page arrives and your hand leaves. Because it will have an image of the table or papers under the new sheet, it can spot the difference. It can also spot when the image becomes still for a few frames, and when it doesn’t have your hand in it. This would trigger a high resolution still image. The LEDs would flash with that still image, which is your signal to know the image has been taken and the system is ready to drop a new page on. Every so often you would clear the stack so it doesn’t grow too high.

Alternately, you could remove pages before you add a new one. This would be slower, you would get no movement of papers under the top page. If you had the suction table, each page would be held nice and flat, with a green background around it, for a highly accurate rotation and crop in the final image. With two hands it might not be much slower to pull pages out while adding new ones.

No button is pressed between scans or even to start and finish scanning. You might have some buttons on the scanner to indicate you are clearing the stack, or to select modes (colour, black and white, line art, double sided, exposure modes etc.) Instead of buttons, you could also have little tokens you put on the surface with codes that can be read by the camera. This can include sheets of paper you print with bar codes to insert in the middle of your scanning streams.

By warning the scanner, you could also scan bound books and pamplets and even stapled documents without unstapling. You will get some small distortions but the scans will be fine if the goal is document storage rather than publishing. (You can even eliminate those distortions if you use 3D scanning techniques like structured light projection onto the pages, or having 2 cameras for stereo.)

For books, this is already worked out, and many places like the Internet Archive build special scanners that use overhead cameras for books. They have not attacked the “loose pile of paper” problem that so many of us have in our files and boxes of paper.

Why this method?

I believe this method is much faster than even high speed commercial scanners on all but the most regular of documents. You can flip pages at better than 1 per second. With small things, like business cards and photos, you can lay down multiple pages per second. That’s already the speed of typical high end office scanners. But the difference is actually far greater.

For those office scanners, you tend to need a fairly regular stack or the document feeder may mess up. Scanning a pile of different sized pages is problematic, and even general loose pages run the risk of skipping pages or other errors. As such, you always do a little bit of prep with your stacks of documents before you put them in the scanner. No button scanning will work with a random pile of cards and papers, including even folded papers. You would unfold them as you scan, but the overall process will take less time.

A scanner like this can handle almost any size and shape of paper. It could offer the option to zoom the camera out or pull it higher to scan very large pages, which the other scanners just can’t do. A lower ppi number on the larger pages, but if you can’t handle that, scan at full ppi and stitch together like you would on an older scanner.

The scans will not be as clean as a flatbed or sheetfed scanner. There will be variations in lighting and shading from curvature of the pages, along with minor distortions unless you use the suction table for all pages. A regular scanner puts a light source right on the page and full 3-colour scanning elements right next to it, it’s going to be higher quality. For publication and professional archiving, the big scanners will still win. On the other hand, this scanner could handle 3-dimensional objects and any thickness of paper.

Another thing that’s slower here is double sided pages. A few options are available here:

  • Flip every page. Have software in the scanner able to identify the act of flipping — especially easy if you have the 3D imaging with structured light.
  • Run the whole stack through again, upside-down. Runs the risk of getting out of sync. You want to be sure you tie every page with its other side.
  • Build a fancier double sided table where the surface is a sheet of glass or plexi, and there are cameras on both sides. (Flash the flash at two different times of course to avoid translucent paper.) Probably no holes in the glass for suction as those would show in the lower image.

Ideally, all of this would work without a computer, storing the images to a flash card. Fancier adjustments and OCR could be done later on the computer, as well as converting images to PDFs and breaking things up into different documents. Even better if it can work on batteries, and fold up for small storage. But frankly, I would be happy to have it always there, always on. Any paper I received in the mail would get a quick slap-down on the scanning table and the paper could go in the recycling right away.

You could also hire teens to go through your old filing cabinets and scan them. I believe this scanner design would be inexpensive, so there would be less need to share it.

Getting Fancy

As Moore’s law progresses, we can do even more. If we realize we’re taking video and have the power to process it, it becomes possible to combine all the video frames with a page in it, and produce an image that is better than any one frame, with sub-pixel resolution, and superior elimination of gradations in lighting and distortions.

As noted in the comments, it also becomes possible to do all this with what’s in a mobile phone, or any video camera with post-processing. One can even imagine:

  • Flipping through a book at high speed in front of a high-speed camera, and getting an image of the entire book in just a few seconds. Yes, some pages will get missed so you just do it again until it says it has all the pages. Update: This lab did something like this.
  • Vernor Vinge’s crazy scanner from Rainbow’s End, which sliced off the spines and blew the pages down a tube, being imaged all the way along to capture everything.
  • Using a big table and a group of people who just slap things down on the table until the computer, using a projector, shows you which things have been scanned and can be replaced. Thousands of pages could go buy in minutes.

Review of standalone wand scanner

Back at the start of this blog, in 2004, I described a product I wanted to see, which I called the Paperless Home Scanner. Of late, several companies have been making products like this (not necessarily because of this blog of course) and so I finally picked one up to see how things pan out.

Because I’m cheap, I was able to pick up an asian made scanner sold under many brand names for only $38 on eBay. This scanner sometimes called the Handyscan or PS-4100 or similar numbers, can also be found on amazon for much more.

The product I described is a portable sheetfed scanner which runs on batteries and does not need to be connected to a computer because it just writes to a flash card. This particular scanner isn’t that because it’s a hand scanner you swipe over your documents. For many years I have used a Visioneer Strobe, which is a slow sheetfed unit that has to be connected to a Windows computer. I found that having to turn the computer on and loading the right software and selecting the directory to scan was a burden. (You don’t strictly have to do that but strangely you seem motivated to do so.) The older scanner was not very fast, and suffered a variety of problems, being unable to scan thermal paper receipts (they are so thin it gets confused) and having problems with even slight skew on the documents.

I was interested in the hand-scanner approach because I presumed there had been vast improvements using the laser surface scanning found in mice. I figured a new scanner could do very good registration even if you were uneven in your wanding. Here are some of my observations:  read more »

  • While it does a better job of making an undistorted scan than older hand scanners, it is still far from perfect, and any twists or catches can distort the scan, though not that much. Enough that you wouldn’t use it to print a copy, but fine for records archiving.
  • It’s exactly 8.5” wide. Since it’s hard to be exactly straight on any scan, that’s an annoyance as you will often drift slightly from a page. A scanner for letter paper should really be about 9” wide. I’ll gladly pay the extra for that.
  • Even today with Moore’s law it’s too slow scanning colour. Often the red light comes on that you are scanning too fast in colour. In B&W it is rare but still can happen. Frankly, by this time we should be able to make things fast and sensitive enough to allow scanning as fast as anybody is likely to do it.
  • While it is nice a small (and thus good for travel,) for use in the home, I would prefer it be a bit wider so I can get it on to the paper and scan the whole page with no risk of catching on the paper. And yes, there is always a risk of it catching.
  • It also catches on bends and folds in the paper, and so ideally you are holding the paper with one hand somehow and swiping with the other, but of course that is not really easy to do if scanning the whole page.
  • This particular scanner resets every time it turns off. And it resets to colour-300dpi. I wish it just remembered my settings.
  • In spite of what it said, it does not appear to have a monochrome setting, such as bitmap-600dpi or even 300dpi. That turns out to be fine, and even what you want, for records archiving. Sure, why throw away information in this era of cheap storage, I agree. On the other hand if it allows scanning-super-fast it may be worth it. A trick might be to start in grayscale and get levels, and then switch to bitmap/threshold
  • One huge difference with swipe scanners is they don’t know where the edges of the paper are. You can scan on a black background and have software crop and straighten, but feeding scanners do that for you because they know where those edges are. Again, having a bit of the background there is fine for archiving bills etc.
  • Overall, I do now realize that not having a view of what I scanned is more of burden than I thought. Particularly if you are thinking of disposing of the document after scanning. Did you get a good scan or not? Though it would add a lot to the cost and size, I now wonder if a very small display screen might be in order.
  • Instead of a display screen, one alternative might be bluetooth, and send the scan image to your smartphone or computer directly. Not required, so you can still scan at-will, but if you have your device with you, you can get a review screen and perhaps some more advanced UI.
  • Indeed, the bluetooth approach would save you the trouble of having to transfer the files, or of having a flash card. (A modest number of megs of internal flash would probably do the job of storing until you can get near the computer.)
  • While it does plug into USB (to read the flash card) that would be a pain if you wanted to scan to screen. Bluetooth is better.

Hand swipe vs. motor fed

Negative copier for digital camera

As digital cameras have developed enough resolution to work as scanners, such as in the scanning table proposal I wrote about earlier, some people are also using them to digitize slides. You can purchase what is called a “slide copier” which is just a simple lens and holder which goes in front of the camera to take pictures of slides. These have existed for a long time as they were used to duplicate slides in film days. However, they were not adapted for negatives since you can’t readily duplicate a colour negative this way, because it is a negative and because it has an orange cast from the substrate.

There is at least one slide copier (The Opteka) which offers a negative strip holder, however that requires a bit of manual manipulation and the orange cast reduces the color gamut you will get after processing the image. Digital photography allows imaging of negatives because we can invert and colour adjust the result.

To get the product I want, we don’t have too far to go. First of all, you want a negative strip holder which has wheels in the sprocket holes. Once you have placed your negative strip correctly with one wheel, a second wheel should be able to advance exactly one frame, just like the reel in the camera did when it was shooting. You may need to do some fine adjustments, but it is also satisfactory to have the image cover more than 36mm so that you don’t have to be perfectly accurate, and have the software do some cropping.

Secondly, you would like it so that ideally, after you wind one frame, it triggers the shutter using a remote release. (Remote release is sadly a complex thing, with many different ways for different cameras, including wired cable releases where you just close a contact but need a proprietary connector, infrared remote controls and USB shooting. Sadly, this complexity might end up adding more to the cost than everything else, so you may have to suffer and squeeze it yourself.) As a plus, a little air bulb should be available to blow air over negatives before shooting them.

Next, you want an illuminator behind the negative or slide. For slides you want white of course. For negatives however, you would like a colour chosen to undo the effects of the orange cast, so that the gamut of light received matches the range of the camera sensors. This might be done most easily with 3 LEDs matched to camera sensors in the appropriate range of brightness.

You could also simply make a product out of this light, to be used with existing slide duplicators; that’s the simplest way to do this in the small scale.

Why do all this, when a real negative scanner is not that expensive, and higher quality? Digitizing your negatives this way would be fast. Negative scanners all tend to be very slow. This approach would let you slot in a negative strip, and go wind-click-wind-click-wind-click-wind-click in just a couple of seconds, not unlike shooting fast on an old film camera. You would get quite decent scans with today’s high quality DLSRs. My 5d Mark II with 21 megapixels would effectively be getting around 4000 dpi, though with bayer interpolation. If you wanted a scan for professional work or printing, you could then go back to that negative and do it on a more expensive negative scanner, cleaning it first etc.

Another solution is just to send all the negatives off to one of the services which send them to India for cheap scanning, though these tend to be at a more modest resolution. This approach would let you quickly get a catalog of your negatives.

Light Table

Of course, to get a really quick catalog, another approach would be to create a grid of 3 rows of negative strip holder which could then be placed on a light table — ideally a light table with a blueish light to compensate for the orange cast. Take a photo of the entire grid to get 12 individual photos in one shot. This will result (on the 5D) in about 1.5 megapixel versions of each negative. Not sufficient to work with but fine for screen and web use, and not too far off the basic service you get from the consumer scanning companies.

I have some of my old negatives in plastic sheets that go in binders, so I could do it directly with them, but it’s work to put negatives into these and would be much easier to slide strips into a plastic holder which keeps them flat. Of course, another approach would be to simply lay the strips on the light table and put a sheet of clear plexiglass on top of them, and shoot in a dim room to avoid reflections.

Negative viewer

It would also be useful if digital cameras or video cameras tossed in a “view colour negative” mode which did its best to show an invert of the live preview image with orange cast reverted. Then you could browse your negatives by holding them up to your camera (in macro mode) and see them in their true form, if at lower resolution. Of course you can usually figure out what’s in a negative but sometimes it’s not so easy and requires the loupe, and it might not in this case.

Scanning table for old digital cameras

I have several sheetfed scanners. They are great in many ways — though not nearly as automatic as they could be — but they are expensive and have their limitations when it comes to real-world documents, which are often not in pristine shape.

I still believe in sheetfed scanners for the home, in fact one of my first blog posts here was about the paperless home, and some products are now on the market similar to this design, though none have the concept I really wanted — a battery powered scanner which simply scans to flash cards, and you take the flash card to a computer later for processing.

My multi-page document scanners will do a whole document, but they sometimes mis-feed. My single-page sheetfed scanner isn’t as fast or fancy but it’s still faster than using a flatbed because the act of putting the paper in the scanner is the act of scanning. There is no “open the top, remove old document, put in new one, lower top, push scan button” process.

Here’s a design that might be cheap and just what a house needs to get rid of its documents. It begins with a table which has an arm coming out from one side which has a tripod screw to hold a digital camera. Also running up the arm is a USB cable to the camera. Also on the arm, at enough of an angle to avoid glare and reflections are lighting, either white LED or CCFL tubes.

In the bed of the table is a capacitive sensor able to tell if your hand is near the table, as well as a simple photosensor to tell if there is a document on the table. All of this plugs into a laptop for control.

You slap a document on the table. As soon as you draw your hand away, the light flashes and the camera takes a picture. Then go and replace or flip the document and it happens again. No need to push a button, the removal of your hand with a document in place causes the photo. A button will be present to say “take it again” or “erase that” but you should not need to push it much. The light should be bright enough so the camera can shoot fairly stopped down, allowing a sharp image with good depth of field. The light might be on all the time in the single-sided version.

The camera can’t be any camera, alas, but many older cameras in the 6MP range would get about 300dpi colour from a typical letter sized page, which is quite fine. Key is that the camera has macro mode (or can otherwise focus close) and can be made to shoot over USB. An infrared LED could also be used to trigger many consumer cameras. Another plus is manual focus. It would be nice if the camera can just be locked in focus at the right distance, as that means much faster shooting for typical consumer digital cameras. And ideally all this (macro mode, manual focus) can all be set by USB control and thus be done under the control of the computer.

Of course, 3-D objects can also be shot in this way, though they might get glare from the lights if they have surfaces at the wrong angles. A fancier box would put the lights behind cloth diffusers, making things bulkier, though it can all pack down pretty small. In fact, since the arm can be designed to be easily removed, the whole thing can pack down into a very small box. A sheet of plexi would be available to flatten crumpled papers, though with good depth of field, this might not strictly be necessary.

One nice option might be a table filled with holes and a small suction pump. This would hold paper flat to the table. It would also make it easy to determine when paper is on the table. It would not help stacks of paper much but could be turned off, of course.

A fancier and bulkier version would have legs and support a 2nd camera below the table, which would now be a transparent piece of plexiglass. Double sided shots could then be taken, though in this case the lights would have to be turned off on the other side when shooting, and a darkened room or shade around the bottom and part of the top would be a good idea, to avoid bleed through the page. Suction might not be such a good idea here. The software should figure if the other side is blank and discard or highly compress that image. Of course the software must also crop images to size, and straighten rectangular items.

There are other options besides the capacitive hand sensor. These include a button, of course, a simple voice command detector, and clever use of the preview video mode that many digital cameras now have over USB. (ie. the computer can look through the camera and see when the document is in place and the hand is removed.) This approach would also allow gesture commands, little hand signals to indicate if the document is single sided, or B&W, or needs other special treatment.

The goal however, is a table where you can just slap pages down, move your hand away slightly and then slap down another. For stacks of documents one could even put down the whole stack and take pages off one at a time though this would surely bump the stack a bit requiring a bit of cleverness in straightening and cropping. Many people would find they could do this as fast as some of the faster professional document scanners, and with no errors on imperfect pages. The scans would not be as good as true scanner output, but good enough for many purposes.

In fact, digital camera photography’s speed (and ability to handle 3-D objects) led both Google Books and the Internet Archive to use it for their book scanning projects. This was of course primarily because they were unwilling to destroy books. Google came up with the idea of using a laser rangefinder to map the shape of the curved book page to correct any distortions in it. While this could be done here it is probably overkill.

One nice bonus here is that it’s very easy to design this to handle large documents, and even to be adjustable to handle both small and large documents. Normally scanners wide enough for large items are very expensive.

Going paperless by making manuals easier to find

As I move to get more paper out of my life, one thing I’m throwing away with more confidence is manuals. It’s pretty frequent that I can do a search for product model numbers or other things on a manual, and find a place to download the PDF. Then I can toss the manual. I need to download the PDF, because the company might die and their web site might go away.

I would like to make this even easier. For starters, it would be nice if the UPC database (UPC are the bar codes found on all retail products) would also offer a link to getting all manuals and paper that come with a product. I would then be able to just photograph the bar codes of all my products with my phone or camera, and cause automatic download or escrow of all manuals. Perhaps a symbol next to the UPC could tell me this is guaranteed to work.

It would be even better if companies escrowed the manuals, which is to say paid a one-time fee to a trustable company which would promise to keep the documents online forever. This company must be backed by a very solid company itself, perhaps a consortium of all the major vendors with a pact that if any of them go other, the rest take up the slack of maintaining the site.

In fact, all free, public documents should have a code on them that can be turned into a URL where I can fetch the document, as PDF, HTML or even MSWord. Any attempt to scan such a document would pick up this code and know it doesn’t have to scan the rest unless it is marked up. For books, we sould key off the ISBN as well as the UPC. Eventually one of the newer, compact 2-D “barcodes” could be used to code a number to find the docs.

Of course, many products are now coming without manuals at all, and that’s largely fine with me.

OCR Page numbers and detect double feed

I’m scanning my documents on an ADF document scanner now, and it’s largely pretty impressive, but I’m surprised at some things the system won’t do.

Double page feeding is the bane of document scanning. To prevent it, many scanners offer methods of double feed detection, including ultrasonic detection of double thickness and detection when one page is suddenly longer than all the others (because it’s really two.)

There are a number of other tricks they could do, I think. I think a paper feeder that used air suction or gecko-foot van-der-waals force pluckers on both sides of a page to try to pull the sides in two different directions could help not just detect, but eliminate such feeds.

However, the most the double feed detectors do is signal an exception to stop the scan. Which means work re-feeding and a need to stand by.

However, many documents have page numbers. And we’re going to OCR them and the OCR engine is pretty good at detecting page numbers (mostly out of desire to remove them.) However, it seems to me a good approach would be to look for gaps in the page numbers, especially combined with the other results of a double feed. Then don’t stop the scan, just keep going, and report to the operator which pages need to be scanned again. Those would be scanned, their number extracted, and they would be inserted in the right place in the final document.

Of course, it’s not perfect. Sometimes page numbers are not put on blank pages, and some documents number only within chapters. So you might not catch everything, but you could catch a lot of stuff. Operators could quickly discern the page numbering scheme (though I think the OCR could do this too) to guide the effort.

I’m seeking a maximum convenience workflow. I think to do that the best plan is to have several scanners going, and the OCR after the fact in the background. That way there’s always something for the operator to do — fixing bad feeds, loading new documents, naming them — for maximum throughput. Though I also would hope the OCR software could do better at naming the documents for you, or at least suggesting names. Perhaps it can, the manual for Omnipage is pretty sparse.

While some higher end scanners do have the scanner figure out the size of the page (at least the length) I am not sure why it isn’t a trivial feature for all ADF scanners to do this. My $100 Strobe sheetfed scanner does it. That my $6,000 (retail) FI-5650 needs extra software seems odd to me.

Forming a "scanner club"

I’ve accumulated tons of paper, and automated scanner technology keeps getting better and better. I’m thinking about creating a “Scanner club.” This club would purchase a high-end document scanner, ideally used on eBay. This would be combined with other needed tools such as a paper cutter able to remove the spines off bound documents (and even less-loved books) and possibly a dedicated computer. Then members of the club would each get a week with the scanner to do their documents, and at the end of that period, it would be re-sold on eBay, ie. a “ReBay.” The cost, divided up among members, should be modest. Alternately the scanner could be kept and time-shared among members from then on.

A number of people I have spoken to are interested, so recruiting enough members is no issue. The question is, what scanner to get? Document scanners can range from $500 for a “workgroup” scanner to anywhere from $1,500 to $10,000 for a “production” scanner. (There are also $100,000 scanning-house scanners that are beyond the budget. The $500 units are not worth sharing and are more modest in ability.

My question is, what scanner to get? As you go up in price, the main thing that changes is speed in pages per minute. That’s useful, but for private users not the most important attribute. (What may make it important is that if you need to monitor the scanning job to fix jams or re-feed. Then speed makes a big difference.)

To my mind the most important feature is how automatic the process is — can you put in a big stack of papers and come back later? This means a scanner which is very good at not jamming or double-feeding, and which handles papers of different sizes and thicknesses, and can tolerate papers that have been folded. My readings of reviews and spec sheets show many scanners that are good at detecting double feeds (the scanner grabs two sheets) as well as detecting staples, but the result is to stop and fix by hand. But what scanners require the least fixing-by-hand in the first place?

All the higher end units scan both sides in the same pass. Older ones may not do colour. Other things you get as you pay more will be:

  • Bigger input hoppers — up to around 500 sheets at a time. This seems very useful.
  • Higher daily duty cycles, for all-day scanning.
  • Staple detectors (stops scan) and ultrasonic double feed detectors (also stop scan.)
  • Better, fancier OCR (generating searchable PDFs) including OCR right in the hardware.
  • Automatic orientation detection
  • Ability to handle business cards. Stack up all those old business cards!
  • The VRS software system, a high end tool which figures out if the document needs colour, grayscale or threshold, discards blank pages or blank backs and so on.
  • In a few cases, a CD-burner so can be used without computer.
  • Buttons to label “who” a document is being scanned for (can double as classification buttons.)
  • Ability to scan larger documents. (Most high-end seem to do 11” wide which is enough for me.)

One thing I haven’t seen a lot of talk about is easy tools to classify documents, notably if you put several documents in a stack. At a minimum if would be nice if the units recognized a “divider page” which could be a piece of coloured paper or a piece of paper with a special symbol on it which means “start new document.” One could then handwrite text on this page to have it as a cover page for later classification at the computer, or if neatly printed, OCR is not out of the question. But even just a sure-fire way to divide up the documents makes sense here. Comments suggest such tools are common.

It may be that the most workable solution is to hire teen-agers or similar to operate the scanner, fix jams and feed and classify documents. At the speeds of these scanners (as much as 100 pages/minute for the higher end) it seems there will be something to do very often.

Anyway, anybody have experience with some of the major models and comments on which are best? The major vendors include Canon, Xerox Documate/Visioneer, Fujitsu, Kodak, Bell and Howell and Panasonic.

Photograph your shelves to catalog your library

A lot of people want to catalog their extensive libraries, to be able to know what they have, to find books and even to join social sites which match you with people with similar book tastes, or even trade books with folks.

There are sites and programs to help you catalog your library, such as LibraryThing. You can do fast searches by typing in subsets of book titles. The most reliable quick way is to get a bar code scanner, like the free CueCats we were all given a decade ago, and scan the ISBN or UPC code. Several of these sites also support you taking a digital photograph of the UPC or ISBN barcode, which they will decode for you, but it's not as quick or reliable as an actual barcode scanner.

So I propose something far faster -- take a picture with a modern hi-res digital camera of your whole shelf. Light it well first, to avoid flash glare, perhaps by carrying a lamp in your hand. Colour is not that important. Take the shelves in a predictable order so picture number is a shelf number.

What you need next is some OCR of above average sophistication, since it has to deal with text in all sorts of changing fonts and sizes, some fine print and switching orientations. But it also has a simpler problem than most OCR packages because it has a database of known book titles, authors, publisher names and other tag phrases. And it even would have, after some time, a database of actual images of fully identified book spines taken by other users. There may be millions of books to consider but that's actually a much smaller space than most OCR has to deal with when it must consider arbitrary human sentences.

Even so, it won't do the OCR perfectly on many books. But that doesn't matter so much for some applications such as search for a book. Because if you want to know "Where's my copy of *The Internet Jokebook*" it only has to find the book whose text looks the most like that from a small set. It doesn't have to get all the letters right by any stretch. If it finds more than one match it can quickly show you them as images and you can figure it out right away.

If you want a detailed catalog, you can also just get the system to list only the books it could not figure out, and you can use the other techniques to reliably identify it. The easiest being looking at the image on screen and typing the name, but it could also print out those images per shelf, and send you over to get the barcode. The right software could catalog your whole library in minutes.

This would also have useful commercial application in bookstores, especially used ones, in all sorts of libraries and on corporate bookshelves.

Of course, the photograph technique is actually worthwhile without the OCR. You can still peruse such photographs pretty easily, much more easily than going down to look at books in storage boxes. And, should your library be destroyed in a fire, it's a great thing to have for insurance and replacement purposes. And it's also easy to update. If you don't always re-shelve books in the same place (who does) it is quick to re-photograph every so often, and software to figure out that one book moved from A to B is a much simpler challenge since it already has an image of the spine from before.

Converting vinyl to digital, watch the tone arm

After going through the VHS to digital process, which I lamented earlier I started wondering about the state of digitizing old vinyl albums and tapes is.

There are a few turntable/cd-writer combinations out there, but like most people today, I’m interested in the convenience of compressed digital audio which means I don’t want to burn to CDs at all, and nor would I want to burn to 70 minute CDs I have to change all the time just so I can compress later. But all this means I am probably not looking for audiophile quality, or I wouldn’t be making MP3s at all. (I might be making FLACs or sampling at a high rate, I suppose.)

What I would want is convenience and low price. Because if I have to spend $500 I probably would be better off buying my favourite 500 tracks at online music stores, which is much more convenient. (And of course, there is the argument over whether I should have to re-buy music I already own, but that’s another story. Some in the RIAA don’t even think I should be able to digitize my vinyl.)

For around $100 you can also get a “USB turntable.” I don’t have one yet, but the low end ones are very simple — a basic turntable with a USB sound chip in it. They just have you record into Audacity. Nothing very fancy. But I feel this is missing something.

Just as the VHS/DVD combo is able to make use of information like knowing the tape speed and length, detecting index marks and blank tape, so should our album recorder. It should have a simple sensor on the tone arm to see as it moves over the album (for example a disk on the axis of the arm with rings of very fine lines and an optical sensor.) It should be able to tell us when the album starts, when it ends, and also detect those 2-second long periods between tracks when the tone arm is suddenly moving inward much faster than it normally is. Because that’s a far better way to break the album into tracks than silence detection. (Of course, you can also use CDDB/Freedb to get track lengths, but they are never perfect so the use of this, net data and silence detection should get you perfect track splits.) It would also detect skips and repeats this way.  read more »

Paperless Home

In my home I now have a "home computer", for a while in the kitchen, now in the living room. Of course I have had computers in my home since the 70s, but this one is different. It's an old cheap laptop I picked up, not powerful at all. What's different is how I use it.

I have connected a Visioneer sheetfed scanner to it. I can drop papers and business cards into it and it scans them. Then I drop them in a box. I have scans of all the receipets and other documents, but if for some reason I need the original I can see by date which ones were nearby and presumably find it quickly enough. A good idea might be to drop a coloured sheet in once a month.

This has inspired me to design a simple device which would be very cheap to build. It's a small sheetfed scanner like this one, which is the size of a 3-hole punch. It's battery powered so it can be stuck anywhere. It has no cables going into it, instead it has a memory card slot, such as compact flash.

When you scan, the data would just be written to the flash card. (Nicely this means the scans are fast.) A button or two on the scanner might set some basic options (like colour or gray, delete and rescan etc.) At most a small LCD panel would be all you get.

When the flash is full, you would take it to the computer, which would do all the real work. Scan and process the data. Convert grayscales to bitmaps at the right level as desired. OCR the text for searching and indexing. Detect orientation, tilt, business cards (by size) etc. all automatically.

And of course then let you view and organize your papers on the computer.

We've dreamed of a paperless office for some time but never gotten there (though we get a little closer all the time.) But can we get to a paperless home, or at least a lower-paper home? I find the paperwork and nitty-gritty of managing a home gets more frustrating with time and hope something can help it.

As noted, my design is extremely cheap. The scanner, a small controller and a flash interface is about all there is too it. Cheaper than the current scanners, which can be had well under $100. The flash card is the expensive part.  read more »

Syndicate content