Back in the "old days" before people began to use computerized word processors, people produced documents in three ways: handwritten, typed, and offset printing. This is a series of articles that will explain how to take a document that is in a typed or an offset printing format and convert it to a computerized document that the user can then use to create a book, CD-ROM or other stored computer document.
I based these articles on my experiences with transforming my father's books and my "old" Master's thesis to a computerized format. I will fill the articles with advice based on my own errors in the hopes that I can prevent others from making the same mistakes.
I did this process with a Macintosh computer. Therefore, some references to specific programs are Mac based references. Whether or not the same program is available on the Windows platform, I do not know. I also cannot say if the program works in the same way on the different platform. However, I will attempt to make the remarks general enough to apply to all similar programs.
Before starting this project, I had purchased an HP OfficeJet all-in-one printer, scanner, and copier specifically for this project. I wanted a device that could copy multiple pages at once, scan multiple pages at once and print in both black & white and color. If you are converting a long document, I would recommend that you have such a device for your computer. To find the best such device currently available, check the reviews of "all-in-one" machines with the trade magazines for your computer's operating system.
To understand what I started with, I will briefly explain what I inherited from my father. My father wrote three books that were genealogical studies of our family. My father typed all of his books himself. He did not really trust computers. He believed in "saving space." So, he heavily used abbreviations and "see" references, which would direct readers to bottoms of pages that he had left blank from previous or future chapters in the book. The result was a smaller book but one which was difficult to read. My father copyrighted these books, which he xeroxed and had GBC bound for sale to family members and others interested in the genealogical work. When he died, I inherited his work. After a few years, I decided to transform them to computerized documents. After clean up and printing them in a more readable format, the increase in the size of the books ranged from a doubling of the size of one book to a tripling of two others. I don't know if such an increase is standard from the printed or typed documents, however, you should anticipate some margin of increase in the work.
Copyright Warning
These series of articles presupposes that you have control of the intellectual property that you are working with. That is, that you (or the person you are helping) owns the copyright to the material. It is illegal to reproduce material that is still under copyright protection without the specific approval of the person or persons who own the copyright. One of the goals of these series of articles is to help you to obtain your own copyright to the material that you have created.
Step 1. Make a Copy of the Original
The first step is to take the original and make a copy of it. To do this, you may have to take the document apart. Because my father's books were GBC bound, it was easier to take them apart. I made copies of both sides of the books and then assembled the pages in order. If you are dealing with a traditionally bound book, you have a set of nasty choices. You can disassemble the book; copy the pages; and, hopefully, have the book rebound professionally. Or, you can manually copy each page one at a time. If the book is large, this will take a long time to do.
I have developed a few Dos and Don'ts for copying large documents that you wish to scan to a computer.
1. Do make the "cleanest" copy possible. The fewer marks, black streaks and other oddball items on the page, the better it will be for you later on in the process. This may mean making a copy of the original page, using "white out" on the extraneous parts and recopying the page.
2. Don't reduce the image when copying. No matter how tempting, do not reduce the image of the typed page. It will make it more difficult later on, if you do.
3. Don't copy two pages of the book onto one page. Even if you are dealing with a book that when opened flat can fit two pages onto one 8.5 by 11 page, don't do it. Copy each page individually or you will have to cut the pages apart before scanning. If you don't, you will only have to scan them again.
4. Do make special note of pages that have "unique" material on them. This includes: photographs, handwritten comments, xeroxed documents within the text, and other material. I found it beneficial to make separate duplicated copies of these pages.
5. If you are dealing with a "bad" original, it may be that no amount of xeroxing or other work will make the document presentable enough to scan and save you time.
After you have made a xerox copy of the entire document and assembled it in page order, you are ready for the second step.
Step 2: Scan to the Computer
There are two steps to the scanning process: (1) the original scan, and (2) the use of the OCR program. Most scanners come with their own OCR programs. This is what I used for the process. The method you will need to use to scan will depend on your scanner. There are two types of scanners available: (1) single sheet (page) scanners, and (2) multiple sheet (page) scanners. If you are doing a long document, I recommend that you use a multiple sheet (page) scanner. In either case, I would definitely recommend a flat bed scanner.
Now, place the material to be scanned either on the scanner's flat bed or into the tray for the multiple documents. Make sure that the document is correctly loaded. That is, the area that you want to scan is facing the correct direction for the machine.
At this point, you can choose to either scan or to do an OCR scan, which will scan the document and convert it to the OCR program connected with the scanner. If you are using a program that did not come with the scanner, I recommend that you first scan the document and then perform the OCR function. If you are using the OCR program that came with the scanner, then you can do both at the same time, provided your scanner controls let you do so.
Again, I have a few recommendations concerning the scanning process.
1. Do use the cleanest xerox copies to scan.
2. Do save pictures and "special pages" for scanning separately.
3. If "special pages" have both text and pictures, handwriting or other special aspects, scan them in order and then set them aside for further scanning, if you have not made duplicates of them already.
4. Even multiple page scanners seem to have a level of tolerance. That is, they can only scan so many pages at once. My scanner's rate was between 10 and 12 pages. I determined this rate by beginning to scan at 20 pages and then getting cut off during the OCR process. Through the process of trial and error, I managed to determine what the optimum rate was. Yes, every once in a while it would take more than the 10-12 pages, but it wasn't worth the time in rescanning to push beyond the basic optimum rate.
5. Scan pictures, photographs, graphics, and special documents separately. Do them one at a time. Scan it like a picture and store it separately. Save it as a ".jpg" or j-peg document. This is a picture document. Scanning it and saving it as this type of document will give you the most flexibility with the document at a later date. If you are curious about what I mean by "special" documents, I will use an example from my father's books. He had xeroxed original handwritten documents such as wills, census forms and deeds. These special documents he referred to within his text and I needed to treat them as j-peg documents to achieve the same impact. In addition, my father had hand-drawn maps, hand-drawn graphics, and photographs. Except for the photographs, all of these I scanned in separately and then worked with to make them appear the best possible. The photographs I had a friend place on a CD-ROM for me and I brought them in separately. (Thank you, John) I did learn to scan the photos in myself from classes that I took from that same friend.
6. The scanner places the scanned information into a document that looks like the document to the right. You will not be able to manipulate it, change it, or do anything with it. In order to change this scanned document to a document capable of being word processed, you will need to send it through an OCR program. That is the next step.
However, look at this scanned version and note a few items. Paragraphs are indented, the title is centered, a table of people is intact, and the signature line came through. The document looks much like the original page. Most of the text is readable. This will change to some degree, depending on the OCR program that you use.
7. Do not expect that everything will come through as it was in the original. Some of it will not come through well at all. At this point, you will have to decide whether it would be easier to retype the information or to work with the scanned material.
My college thesis, which I typed in 1977, was such a document. At the time, I typed it using a blue ink typewriter ribbon. Xerox copies of this faded ink typed document did not scan well. I type at 55+ words a minute. It was quicker to retype the whole document than to deal with the bad scanning of the document. If you are dealing with a document that is in questionable state, I would only scan a few pages and see how they turn out before I made the decision about whether or not the scanning would be worth it. This would also be dependent on how fast you could type or whether you wanted to hire some one to type it for you and how much they would charge.
8. No matter how many pages your scanner can process at a time, use natural stopping points even if it means scanning only three pages at one time. For instance, if your scanner operates best with 10 pages at once and you have a chapter that is 33 pages long. You can do three scans of 10 pages. That will leave you with three pages for the next scan. Unless the next chapter is less than 7 pages, I would recommend that you scan the three pages by themselves and then proceed to the next chapter. It will make it easier during the OCR function and during the editing functions later on.
Step 3. Use of the OCR Program
OCR is an acronym. It stands for Optical Character Recognition or Optical Character Reader. An OCR program is one that takes documents that have been scanned in and creates a computerized textual document that can then be manipulated. I used the program that came with my HP machine. It was called ReadIris. Yes, there were limits to this program. Yes, other programs exist that have a better rating than this program. However, this program had one definite advantage, it was free.
In addition, since it came with the HP scanner, the scan portion of the OCR function immediately looked for the ReadIris program. I was not sure that another program would work as smoothly or allow the process to be done in one step. I was also not sure that any other program would produce a document that was any better than the document that this program produced.
If your scanner does not come with an OCR program, review the documentation for the available OCR programs and then make the best selection from the group. If you can, test a scanned document with that particular OCR program to observe what the results are. If your scanner does come with an OCR program, try it out. If it works most of the time, or is satisfactory for you, then you don't need to upgrade. If it does not, then upgrade. Part of it depends on how much editing that you will need to do with the text. As I indicated, with my father's text, I knew I would have to do a heavy amount of editing anyway. The "extra" editing that came because of the OCR program was not that significant.
The document to the right shows the results of the OCR conversion of the same scanned document shown earlier. Notice that while the OCR function kept the paragraphs, there is no indentation. In addition, the OCR did not center the chapter number and title. This was common for this particular OCR program.
Also notice that the date "1736" became "17)6" This too was common and I discovered that this OCR program did not read the "3" that my father's typewriter created as a "3" all the time. Notice that further down on the page, it read the "3" correctly. However, in most cases, it read it as a close parenthesis mark ")." This was a problem with the OCR program.
Notice that the signature line came through as "~d.J;;L 7f~~." I am not sure whether this was a result of the OCR program or a result of something that was simply unreadable no matter what OCR program was used. Handwriting, both cursive and printed, was not meant to be transferred to an OCR program and most of these types of information do not transfer well.
Another problem is obvious with this translation. The material that appears as a table in the original does not come across as a table in this version. The OCR program scans across the page and runs all the material on the same line together whether it follows each other or not. It would do the same thing if it were "reading" two separate pages that were next to each other. So, if it were page 3 and page 4 scanned at the same time, the OCR would scan line one of page 3, then line one of page 4, and then combine them. This would make the text virtually unreadable.
Finally, there were several times that the computer could not distinguish between a lower case "L" and the number "1." The particular typewriter that my father used caused part of this. The original manual typewriter did not have a "1" key. Therefore, as was the convention of the time, he used the lower case "L" for the number one. Later on, when he switched to an electrical typewriter, he would forget that it had a "1" key and he would still use the lower case "L." This was not a result of the OCR program and would have been the same in any OCR program because it was based on the inconsistencies in the original document. So, if you are scanning and OCRing an older typed document, remember that certain conventions were used that may confound the best OCR program.
Most OCR programs allow you to select what type of word processing program that you are using and then you can save to that program. After much experimentation, I saved the document in the generic "text edit" format because that would not limit my use of any word processor.

The document to the left shows the same document saved to a Word document. Notice that it has the same mistakes that the document saved to the generic TextEdit format does. However, I could manipulate this text so that the top two lines would reveal a problem:
CHAPTER 4
THE FAMILY OF ZACHARIAH LYERLY (17)6-L847)
Notice that the "1" in the date "1847" is really a lower case "L," which became obvious when I switched the text to all caps. This error was a bit more obvious in the documents created for Word than it was in the generic TextEdit format. So, which word processing program you want to save to is up to you. I found that this particular OCR program had fewer problems and ran more smoothly when the program saved the material to the TextEdit program.
The selection of the OCR program will determine how much work that you will have to do in the editing phase of the document. No matter how good the program is, do not believe that everything is going to transfer completely and correctly. It will not. You are still going to have to do work on the editing side of the document. The next article will deal with the series of steps in the process of editing your work once it is on the computer.
I can summarize some of my recommendations concerning the OCR program as follows.
1. If you pick to use the "free" OCR program, realize that it is free for a reason and may be limited in function. If this does not bother you, then go with it. On the other hand, if your typing is slow or you would prefer a better OCR program, then review the literature to find the best one available for your operating system, scanner and computer when you are doing your job.
2. Realize that no matter how good the OCR program is, the original material limits it. If the original document used conventions of the day that documents do not use now, then the OCR will translate and keep these conventions in the final document. You should make notes of the ones that you observe for correction during the editing process.
3. When setting up the preferences concerning the type of word processing document you wish to save to, try different ones to see which one produces the document that you can read the best with the least amount of difficulties with the OCR program itself.
4. Scanned columns and tables tend not to translate well into an OCR document. Pre-existing table of content pages, multiple pages on one page, or columns should be scanned with care and noted when converting using the OCR program. If possible, you should avoid scanning columns, tables, and double-paged text. You should split them apart and scan them in the proper sequence. Table of contents will tend to come in with the spaces missing between the title of the article and the page number.
Step 4. Save, Copy, Store
Depending on the length of your document, you should do this step at various times throughout this first series of steps. If the document has chapters, I recommend that you save the chapters (both the scan documents and the OCR documents) to a separate storage disk at the end of each chapter. Then, carefully store this disk away where it won't be hurt, damaged or otherwise hexed by the gremlins of the computer world.
If your document does not have chapters, I would recommend that you perform the save function at the end of each day's work or at points that you can easily remember (pages 25, 50, 75, etc.).The folder that you have placed the results of the OCR conversion in, I would label the "take from" folder. You will not touch these documents directly. They will only be used to copy from. Place in this folder the j-peg documents that you have, if any. Also, place in this folder any other material that will be coming into your final document that you are creating. This would be covers, introductory material, additional material that you have created, table of contents, lists, etc. This folder represents your final, entire document.
When you are done with the entire document, I would recommend that you not only save the material onto the disk you have been saving to but that you also save it onto a CD. This CD I would then store separately from the other saved documents. And, if this document is valuable to you, I would place it in a fireproof container. I joke not. Take it from me that starting over in this process is most disheartening.
Therefore, for this step in the process, I would recommend the following.
1. Do copy on a regular basis. You may copy onto the same hard drive you are using. However, it would be wiser to copy onto a separate drive.
2. You should copy onto a disk that is big enough to hold most, if not all, of your document. A zip disk or a CD will usually give you enough space to save the material.
3. Store the material away from your computer in a separate location: another room, at work, a storage area, or a safe deposit box. If you can't do that, invest in a fireproof box and store it in that. If you are really paranoid, do both. Believe me, you can't be too careful about this.
4. Create a folder called the "take from" folder and store all of your documents in that folder. When you are finished with the scan and the OCR functions, you should copy this entire folder to a new, separate disk from the one that you have been saving to in the previous steps. You should also protect and store this disk in a separate location. If you are really paranoid, you should create several (two should do) and store them in two or more locations.
Conclusion
As I have hinted at in the last step, I lost one of the books even though I thought I had backed it up. I lost it not during this phase but at the end phase when I had completely finished the document. I thought I would die. I didn't and the second time through with that book, I believe turned out better than the first time through. However, if I had backed it up and stored it at several times in the process to a separate disk that I had stored elsewhere, this would not have happened. I didn't and the fickle finger of fate did not properly copy it to the backup disk that I thought I had copied it to. If the document is long enough and important enough to go through this entire process with, then it is valuable enough to be placed on several disks and placed in several locations for safety reasons.
Now that you have your document saved into a folder and backed up onto other folders, you are ready to begin the editing process for the document. This is the process that I will cover in the next article.