Read Microsoft Office 2007 files with PHP

In a web application, it’s often handy to obtain the text content of files saved from other programs such as Microsoft Office. An example would be KnowledgeBase software that allows file uploads. The file contents could be read and saved to the database for future keyword searches.

For Microsoft Office versions earlier than 2007, obtaining the text content from Word, Excel, or PowerPoint files with PHP is cumbersome, and in some cases requires third-party extensions to achieve the task.

With Office 2007, Microsoft has made it incredibly easy to read the text content from Word, Excel, or PowerPoint files with PHP, using minimal configuration.

Unzip

Office 2007 files are named as docx, xlsx, and pptx, but really these are just compressed folders. If you replace the extension of an Office 2007 file with .zip, you can then uncompress that, and view the individual files contained within:

Screenshot of Windows file

Screenshot of Windows file

Screenshot of Windows Zip utility

Most of the files contained within are XML files, which contain the raw text content of the Office document. We can easily read XML files with PHP.

XML files to read

Relevant to the directory structure of the unzipped document file, here are the actual XML files that can be read with PHP to obtain the text content:

Microsoft Word 2007 (docx)

Read file word/document.xml:

Screenshot of Windows Explorer

Microsoft Excel 2007 (xlsx)

Read file xl/sharedStrings.xml:

Screenshot of Windows Explorer

Microsoft PowerPoint 2007 (pptx)

Read files in ppt/slides/. For each individual slide, there will be a slide#.xml file. For example, slide1.xml, slide2.xml, etc.

Screenshot of Windows Explorer

PHP code

To read the Office files with PHP, first verify the file type is application/octet-stream.

Then check to make sure the PHP Zip extension is installed:

if ( class_exists("ZipArchive") ) {...

Then declare a few initial variables:

// the zip class
$zip = new ZipArchive;

// the string variable that will hold the file content
$file_content = "";

// the uploaded file
$file_upload = $file -> upload["tmp_name"];

Then proceed to read the appropriate file:

// open file
if ($zip -> open($file_upload) === true) {

  // docx
  if ( ($index = $zip -> locateName("word/document.xml")) !== false ) {
    $data = $zip -> getFromIndex($index);
    $xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
    $file_content = strip_tags($xml -> saveXML());
  }

  echo $file_content;
}

Above we read the XML for a docx file. The same code can be used for xlsx, but swap out word/document.xml and replace with xl/sharedStrings.xml.

For pptx, we need to loop through all of the slide#.xml files:

// open file
if ($zip -> open($file_upload) === true) {

  // loop through all slide#.xml files
  $slide = 1;

  while ( ($index = $zip -> locateName("ppt/slides/slide" . $slide . ".xml")) !== false ) {

    $data = $zip -> getFromIndex($index);
    $xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
    $file_content .= strip_tags($xml -> saveXML());

    $slide++;
  }

  echo $file_content;
}

Final notes

Keep in mind this does not work with old school doc, xls, or ppt files from Office ’97-’03. However, if a user saves the file from Office ’07 into one of these legacy formats, it still might be possible to use this same approach to read the text.

3 thoughts on “Read Microsoft Office 2007 files with PHP

  1. Excellent!!

    Just a comment, when trying to read a .xlsx file the first thing that is displayed is the next message:

    Strict Standards: Non-static method DOMDocument::loadXML() should not be called statically in C:Program FilesApache Software FoundationApache2.2htdocs3espacio	est.php on line 9

    Any workarround to avoid this message?

    Thanks a lot in advance.

  2. I’m reading xlsx file. Working fine. But it display all the file content in one line. How to add space between them?
    Thanks

Comments are closed.