||"The more I dig, the deeper I go, the more I realize how little I know" – Mayuresh A Kathe
Extract plain text with external converters
Last updated at 12:27 pm UTC on 23 August 2007
How to extract plain text out of popular file formats like PDF, Postscript, Word, PowerPoint and Excel with the help of external converters
This Howto will give some background information about the topic and shows some small examples. The examples can be evaluated in a workspace.
Thanks to the open source of TextIndexNG (http://opensource.zopyx.com/software/textindexng3/) you can see how this Zope/Plone product uses some free (GPL) external converters to extract plain text out of various file formats and use this text to realize an indexing for full text search. Perhaps this is useful for some Squeakers too.
I've made some tests with those external converters for PDF, Postscript, Word, PowerPoint and Excel files under Debian Linux with Squeak 3.9 and the image sq3.9-7067dev07.07.1.
File format / package requirements:
- PDF requires some xpdf stuff,
- Postscript requires ghostscript,
- WinWord requires wvWare version 1(!), no support for V2),
- PowerPoint requires pphtml from the xlhtml package and
- Excel requires xls2csv from the catdoc package.
I've tested it on Debian 4.0. Ubuntu should have the same packages. You should find those packages on nearly every Linux distribution.
- The test were made with the same calling parameter as TextIndexNG does it. So it should be best practice because TextIndexNG is often used with Plone/Zope.
- Have a look at the manpage of the converters (e.g. man pdftotext) to see in detail what document version is supported and look for additional parameter.
- You may use different encodings.
- Error checking need to be done after the converter call.
- The plain text of OpenOffice files should be extracted with the package OpenOffice importer from Squeak map (haven't tried it).
- The new XML based MS Office file format are not supported by the converters here but should be readable with the right XML tools in Squeak.
- most of the converters are also available on Windows but I haven't tested this.
- I've used to call the external converters. I don't know if this is the most elegant way but it worked for me.
- If you search for the term GPL, Linux and converter you will find more converter for more file formats.
- When you use this approach for implementing a full text search you should use a tool like http://www.squeaksource.com/Stemmer.html after extracting the plain text.
The following code shows how it works. If you have the converter tools installed, evaluate the code in a workspace. You can use the attached test files (myTestFiles.zip -> my.pdf, my.ps, my.doc, my.ppt, my.xls).
path2myTestFiles := '/var/DirOfMyTestFiles/'.
" pdf "
myPDFFile := path2myTestFiles, 'my.pdf'.
result := (PipeableOSProcess command: 'pdftotext -enc UTF-8 ', myPDFFile, ' -') output.
" postscript "
myPSFFile := path2myTestFiles, 'my.ps'.
result := (PipeableOSProcess command: 'ps2ascii ', myPSFFile,) output.
" word "
wvWareConfigFile := '/usr/share/wv/wvText.xml'.
myWordFile := path2myTestFiles, 'my.doc'.
result := (PipeableOSProcess command: 'wvWare -c utf-8 --nographics -x ', wvWareConfigFile,
' ', myWordFile, ' 2> /dev/null') output.
" powerpoint "
myPPTFile := path2myTestFiles, 'my.ppt'.
"this will only convert to html"
result := (PipeableOSProcess command: 'ppthtml ', myPPTFile, ' 2> /dev/null') output.
"we have a two way step here. first convert to a html file, then extract the text from the html file"
result := (PipeableOSProcess command: 'ppthtml ', myPPTFile, ' >> ', myPPTFile , '.html') output.
result := (PipeableOSProcess command: 'html2text ', myPPTFile, '.html 2> /dev/null') output.
" excel "
"remark: this didn't work when I had diagrams in my Excel file"
myExcelFile := path2myTestFiles, 'my.xls'.
result := (PipeableOSProcess command: 'xls2csv -d 8859-1 -q 0 ', myExcelFile, ' 2> /dev/null') output.