Table of Contents
Scanning SBW magazines
Given the first stage of the project is largely complete, this page is more for information. It may help other organisations undertaking digital history projects.
1. Remove the magazines from the springback binder
The magazines are all stored in springback binders, one year (usually 12 issues) per binder. Springback binders are basically like a big bulldog clip. By bending back both halves of the cover, the folder with the magazines inside can be carefully removed.
2. Remove staples from the magazine
Use a staple remover to carefully remove the staples from the magazine. For the old magazines the staples are generally rusty in any case. Removing staples means that the pages can be made to lie completely flat on the scanner, which is critical for a high quality scan. Make sure that you keep the pages in the correct order!
3. Scan the magazine
Scan the pages one at a time, into a multipage PDF. A high end office scanner found in most large corporates is ideal - modern one scan quickly and at high quality.
I use the following settings:
- scan to PDF
- resolution: 300 x 300dpi
- paper size: A4 (if you leave it on Auto, sometimes the machine will get it wrong)
- Black and white text (don't use photo settings, as you will get speckles; don't use greyscale)
- If you have a density setting, somewhere in the middle is probably best - too high and you will start to see speckles, too low and the text will be a bit faint.
- If you have a contrast setting, higher is generally better.
High end office scanners normally default to PDF these days. Normally you have about 60 seconds after you have scanned the previous page to scan the next page, in order to keep it as one document. I find that it takes about 15 seconds/page once you get going.
It may be possible to use the automatic document feeder for magazines with more modern paper. The main risk with using the automatic document feeder is that if something goes wrong, it can seriously damage the pages. The ADF tends to work better when the documents are fed in lengthways (rather than sideways, which is probably the default on most machines) - this may need configuring on the scanner.
Before you start scanning in earnest, check that the quality of the scan is as high as possible. Spend some time trying various settings to see which works best. Any areas that are not text/pictures should ideally come out white in the scan. Any speckles or colouring reduce the ability to subsequently OCR the documents, as the below examples show.
Also note that the best quality for OCR may not be the best for other purposes eg photos.
1. Black and white – text/line art setting
Result of OCR
the lap of luxury. IYOu know", said Frank, "a tourist's life won't be bad." "No", said Snow, "just on the tarn." But we were all looking forward to it!
2. Black and white – text only setting Result of OCR
the lap of luxury. "You know", said Frank, "a tourist's life won't be bad." "No", said Snow, "just on the turn." But we were all looking forward to it!
The margins are not even, so take this into account when you are scanning, and shift the pages one way or the other, otherwise you may lose text off the sides. This is primarily an issue for older magazines which were printed on quarto paper, which is wider than A4. Newer ones are A4 and the text should fit on an A4 page regardless.
I suggest scanning the various cover sections (front cover, inside front cover, back cover, inside back cover) separately. This makes it cheaper and easier to OCR. For the cover (and any advertisement pages), I use a low density setting, as otherwise it comes out very dark. Again, some trial and error may be needed to get good results.
4. Saving and sending
Save the magazine with the following names:
- main part - <yyyymm>_<issuenumber>.pdf eg 195401_230.pdf for January 1954, issue number 230.
- front cover - eg 195401_230_cover.pdf
In a lot of cases, the same inside covers and back cover are used over a year. You only need to scan it once. Name them:
- 1954_inside_cover.pdf
- 1954_back_cover.pdf
- 1954_inside_back_cover.pdf
More recent magazines may not have issue numbers, in which case you can leave it out eg 198701.pdf
I can compile the various sections back into a magazine for the website. If you make a mistake (eg a page scans badly, page out of order) make a note of it and just keep scanning. For badly scanned pages, either rescan the page straight away, or scan it separately afterwards. I have software which can manipulate/insert/remove pages after the fact. This is usually easier than starting again.
Zip sets of PDF files together (max of about 10MB per zip file) and email them. Include a note explaining any errors that need to be corrected or pages rearranged.
5. Finishing up
Insert the magazines back inside the springback binder. Don't restaple them – we'll look for a better method of archiving that doesn't mark the magazines.
Other notes
Technical notes - to come back and fill in more detail on:
- Double sided scans via newer photocopiers
- Use of OcrMyPDF (including Tesseract) for OCRing
- + OnlineOCR
- Centering magazines
- Use of GhostScript to reduce file size
- Use of PDFSAM to merge double sided files (not needed if double sided scanner is used)
- Use of iconv to remove some non-printing characters