User Tools

Site Tools


scanning

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
scanning [2015/12/02 07:41] sbwscanning [2023/08/16 13:22] (current) sbw
Line 1: Line 1:
 ====== Scanning SBW magazines ====== ====== Scanning SBW magazines ======
  
 +Given the first stage of the project is largely complete, this page is more for information. It may help other organisations undertaking digital history projects.
  
 ===== 1. Remove the magazines from the springback binder ===== ===== 1. Remove the magazines from the springback binder =====
Line 16: Line 17:
 ===== 3. Scan the magazine ===== ===== 3. Scan the magazine =====
  
-Scan the pages one at a time, into a multipage PDF. +Scan the pages one at a time, into a multipage PDF. A high end office scanner found in most large corporates is ideal - modern one scan quickly and at high quality.
  
 I use the following settings: I use the following settings:
Line 28: Line 29:
 High end office scanners normally default to PDF these days. Normally you have about 60 seconds after you have scanned the previous page to scan the next page, in order to keep it as one document. I find that it takes about 15 seconds/page once you get going.  High end office scanners normally default to PDF these days. Normally you have about 60 seconds after you have scanned the previous page to scan the next page, in order to keep it as one document. I find that it takes about 15 seconds/page once you get going. 
  
-It may be possible to use the automatic document feeder for more recent magazines. The main risk with using the automatic document feeder is that if something goes wrong, it may damage the pages. However, most of the magazines (anything prior to 1984) are quarto size, not A4, which may make automatic scanning more difficult.+It may be possible to use the automatic document feeder for magazines with more modern paper. The main risk with using the automatic document feeder is that if something goes wrong, it can seriously damage the pages. The ADF tends to work better when the documents are fed in lengthways (rather than sideways, which is probably the default on most machines) - this may need configuring on the scanner.
  
 Before you start scanning in earnest, check that the quality of the scan is as high as possible. Spend some time trying various settings to see which works best. Any areas that are not text/pictures should ideally come out white in the scan. Any speckles or colouring reduce the ability to subsequently OCR the documents, as the below examples show. Before you start scanning in earnest, check that the quality of the scan is as high as possible. Spend some time trying various settings to see which works best. Any areas that are not text/pictures should ideally come out white in the scan. Any speckles or colouring reduce the ability to subsequently OCR the documents, as the below examples show.
 +
 +Also note that the best quality for OCR may not be the best for other purposes eg photos.
  
 **1. Black and white – text/line art setting** **1. Black and white – text/line art setting**
Line 75: Line 78:
  
 Insert the magazines back inside the springback binder. Don't restaple them – we'll look for a better method of archiving that doesn't mark the magazines. Insert the magazines back inside the springback binder. Don't restaple them – we'll look for a better method of archiving that doesn't mark the magazines.
 +
 +===== Other notes =====
 +Technical notes - to come back and fill in more detail on:
 +  * Double sided scans via newer photocopiers
 +  * Use of OcrMyPDF (including Tesseract) for OCRing
 +    * + OnlineOCR
 +  * Centering magazines
 +  * Use of GhostScript to reduce file size
 +  * Use of PDFSAM to merge double sided files (not needed if double sided scanner is used)
 +  * Use of iconv to remove some non-printing characters
 +
 +
scanning.1449002504.txt.gz · Last modified: 2015/12/02 07:41 by sbw

Donate Powered by PHP Valid HTML5 Valid CSS Driven by DokuWiki