QotD    "To be or not to be" – Shakespeare
View this PageEdit this PageUploads to this PageHistory of this PageTop of the SwikiRecent ChangesSearch the SwikiHelp Guide
Extract plain text with external converters
Last updated at 12:27 pm UTC on 23 August 2007

How to extract plain text out of popular file formats like PDF, Postscript, Word, PowerPoint and Excel with the help of external converters

This Howto will give some background information about the topic and shows some small examples. The examples can be evaluated in a workspace.
Thanks to the open source of TextIndexNG (http://opensource.zopyx.com/software/textindexng3/) you can see how this Zope/Plone product uses some free (GPL) external converters to extract plain text out of various file formats and use this text to realize an indexing for full text search. Perhaps this is useful for some Squeakers too.
I've made some tests with those external converters for PDF, Postscript, Word, PowerPoint and Excel files under Debian Linux with Squeak 3.9 and the image sq3.9-7067dev07.07.1.


File format / package requirements:

I've tested it on Debian 4.0. Ubuntu should have the same packages. You should find those packages on nearly every Linux distribution.


The following code shows how it works. If you have the converter tools installed, evaluate the code in a workspace. You can use the attached test files (myTestFiles.zip -> my.pdf, my.ps, my.doc, my.ppt, my.xls).

path2myTestFiles := '/var/DirOfMyTestFiles/'.

" pdf "
myPDFFile := path2myTestFiles, 'my.pdf'.
result := (PipeableOSProcess command: 'pdftotext -enc UTF-8 ', myPDFFile, ' -') output.

" postscript "
myPSFFile := path2myTestFiles, 'my.ps'.
result := (PipeableOSProcess command: 'ps2ascii ', myPSFFile,) output.

" word "
wvWareConfigFile := '/usr/share/wv/wvText.xml'.
myWordFile := path2myTestFiles, 'my.doc'.
result := (PipeableOSProcess command: 'wvWare -c utf-8 --nographics -x ', wvWareConfigFile,
   ' ', myWordFile, ' 2> /dev/null') output.

" powerpoint "
myPPTFile := path2myTestFiles, 'my.ppt'.
"this will only convert to html"
result := (PipeableOSProcess command: 'ppthtml ', myPPTFile, ' 2> /dev/null') output.
"we have a two way step here. first convert to a html file, then extract the text from the html file"
result := (PipeableOSProcess command: 'ppthtml ', myPPTFile, ' >> ', myPPTFile , '.html') output.
result := (PipeableOSProcess command: 'html2text ', myPPTFile, '.html 2> /dev/null') output.

" excel "
"remark: this didn't work when I had diagrams in my Excel file"
myExcelFile := path2myTestFiles, 'my.xls'.
result := (PipeableOSProcess command: 'xls2csv -d 8859-1 -q 0 ', myExcelFile, ' 2> /dev/null') output.