Table of contents
This article is provided courtesy of Sean Keegan, SensusAccess Solutions. This is the original Conversion Best Practices PDF.
The quality of a conversion is dependent upon the quality of the original document. Additionally, the resulting output format may include enhancements for navigation if the original file contains the appropriate semantic markup. For instance, a Microsoft Word document containing the heading style markup for chapters (such as, Heading 1, Heading 2 and so on) will convert into a more usable DAISY or EPUB format with the relevant chapter navigation elements. The following best practices identify simple methods to prepare the file before converting in order to achieve a high-quality output.
PDF and image-based files
- PDF and image-based files will be processed using optical character recognition (OCR) to create a text-based version of the document.
- If scanning the document, ensure the scanned image is free from smudges, dark marks, highlighted text or artifacts in the image. These will affect the accuracy of the OCR process.
- Minimize the any effects from skewing. If the image is presented at an off-angle, the accuracy of the OCR process will be lower resulting in a lower quality text version.
- If you are starting with an image-based format and wish to convert to a text format, you may achieve better results by initially converting to Tagged PDF and then copying and pasting the text into a Microsoft Word document. While you can convert directly from an image file to a text file with SensusAccess, you may find better results for some image documents if converting to a Tagged PDF and then to a text file (see Converting to Microsoft Word and text files section below).
Converting to Microsoft Word and text files
SensusAccess will convert image-based documents into Microsoft Word, RTF and text files. You may also find it useful with some image-based documents to convert initially to Tagged PDF and then copy and paste the text from the Tagged PDF into Microsoft Word. This may result in a better reading experience and may remove non-essential content.
With the Microsoft Word version of the document, you can more accurately clean the content for conversion into MP3 audio or for use with assistive technologies. Most conversions will take just a few seconds within Microsoft Word and involve the use of the Find and Replace tools. For more information on using the Find and Replace tools, see Using the Find and Replace in Microsoft Word for removing special characters in a document.
NOTE: In the Find and Replace examples below, replace the <space> value with one spacebar and do not include the quotes.
Image-file to tagged PDF to Microsoft Word document
- Submit the image-based document to SensusAccess and select Tagged PDF as the output option.
- Open the Tagged PDF and select all the text. Copy and paste this into a Microsoft Word document (Open Office may also be used).
- Using Find and Replace:
- Search for .<space>^p and replace with .^p^p.
- Search for <space>^p and replace with <space>.
- Search for <space>•<space> and replace with ^p•<space>.
- Search for -<space> and replace with no value.
- Save the document in your preferred text format.
Image-file to Microsoft Word document
To clean up a Microsoft Word file for use with assistive technology or for creating MP3 files, perform a search and replace to remove optional hyphens and section breaks. Identify the special character you wish to find in the Find: box and leave the Replace with: box empty. See Using the Find and Replace in Microsoft Word for additional information on removing special characters in a document.
- Submit the image-based document to SensusAccess and select Microsoft Word as the output option.
- Open the converted Microsoft Word document. Open Office may also be used.
- Using Find and Replace:
- Search for Optional Hyphen under Special Formatting and replace with no value.
- Search for Section Breaks under Special Formatting and replace with ^p^p.
- Search for Manual Page Breaks and replace with ^p^p.
- Save the document in your preferred text format.
Authoring Microsoft Word, RTF, text files
- Use Microsoft Word styles to specify document headings. For example, the style Heading 1 could be used to identify the title of the document and the style Heading 2 could be used to identify chapter information. It is best to use only one Heading 1 to facilitate accurate conversions into other document formats (such as, DAISY, EPUB, Braille and so on).
- Provide short descriptions for content-related images in your Microsoft Word document.
- Avoid using text boxes in your document. If you want to customize the layout, use a Column Tool or a Section Break.
- If converting to DAISY, page numbers will be identified based on the Microsoft Word pagination. To obtain custom pagination, use the PageNumber style from the Save As DAISY plug-in for Microsoft Office for your custom page numbers.
Authoring HTML files
- Use HTML heading markup (e.g., <h1>, <h2>, etc.) to designate headings in the document. For example, the style Heading 1 could be used to identify the title of the document and the style Heading 2 could be used to identify chapter information.
- Provide short descriptions for content-related images in the HTML document.