Microsoft Word Document HTML Cleanup in PHP


I had to cleanup a HTML created out of MS Word Document manually.  Honestly it is a pain to manually search and replace all the junks Word Document generating.  So I have written a text conversion function in PHP to automatically cleanup the MS Word junks and output HTML entities.

Also when pasting from Microsoft Word into a web form if you just do not like to relies on TinyMCE or FCKEditor “Paste From Word" feature which does not seem to work most of the time it is a simple server side solution to strip an replace Word formatting for clean HTML output.


function word_cleanup ($str)
{
    $pattern = "/<(\w+)>(\s|&nbsp;)*<\/\1>/";
    $str = preg_replace($pattern, '', $str);
    return mb_convert_encoding($str, 'HTML-ENTITIES', 'UTF-8');
}
Share this
Anonymous's picture

Remove junk from wordfile being parsed by PHP

Have you tried to search for a word in a MS Word file? I tried to use your parser with little success. (There was no error using it ;-)) It just didn´t remove the header and footer of the document.