In a web application, it’s often handy to obtain the text content of files saved from other programs such as Microsoft Office. An example would be KnowledgeBase software that allows file uploads. The file contents could be read and saved to the database for future keyword searches.
For Microsoft Office versions earlier than 2007, obtaining the text content from Word, Excel, or PowerPoint files with PHP is cumbersome, and in some cases requires third-party extensions to achieve the task.
With Office 2007, Microsoft has made it incredibly easy to read the text content from Word, Excel, or PowerPoint files with PHP, using minimal configuration.
Unzip
Office 2007 files are named as docx, xlsx, and pptx, but really these are just compressed folders. If you replace the extension of an Office 2007 file with .zip, you can then uncompress that, and view the individual files contained within:

![]()

Most of the files contained within are XML files, which contain the raw text content of the Office document. We can easily read XML files with PHP.
XML files to read
Relevant to the directory structure of the unzipped document file, here are the actual XML files that can be read with PHP to obtain the text content:
Microsoft Word 2007 (docx)
Read file word/document.xml:

Microsoft Excel 2007 (xlsx)
Read file xl/sharedStrings.xml:

Microsoft PowerPoint 2007 (pptx)
Read files in ppt/slides/. For each individual slide, there will be a slide#.xml file. For example, slide1.xml, slide2.xml, etc.

PHP code
To read the Office files with PHP, first verify the file type is application/octet-stream.
Then check to make sure the PHP Zip extension is installed:
if ( class_exists("ZipArchive") ) {...
Then declare a few initial variables:
// the zip class $zip = new ZipArchive; // the string variable that will hold the file content $file_content = ""; // the uploaded file $file_upload = $file -> upload["tmp_name"];
Then proceed to read the appropriate file:
// open file
if ($zip -> open($file_upload) === true) {
// docx
if ( ($index = $zip -> locateName("word/document.xml")) !== false ) {
$data = $zip -> getFromIndex($index);
$xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
$file_content = strip_tags($xml -> saveXML());
}
echo $file_content;
}
Above we read the XML for a docx file. The same code can be used for xlsx, but swap out word/document.xml and replace with xl/sharedStrings.xml.
For pptx, we need to loop through all of the slide#.xml files:
// open file
if ($zip -> open($file_upload) === true) {
// loop through all slide#.xml files
$slide = 1;
while ( ($index = $zip -> locateName("ppt/slides/slide" . $slide . ".xml")) !== false ) {
$data = $zip -> getFromIndex($index);
$xml = DOMDocument::loadXML($data, LIBXML_NOENT | LIBXML_XINCLUDE | LIBXML_NOERROR | LIBXML_NOWARNING);
$file_content .= strip_tags($xml -> saveXML());
$slide++;
}
echo $file_content;
}
Final notes
Keep in mind this does not work with old school doc, xls, or ppt files from Office ’97-’03. However, if a user saves the file from Office ’07 into one of these legacy formats, it still might be possible to use this same approach to read the text.
My technical meanderings and other nonsense. Published since 2002. No, really. I'm *that* internet-old. I remember the days of
Excellent!!
Just a comment, when trying to read a .xlsx file the first thing that is displayed is the next message:
Any workarround to avoid this message?
Thanks a lot in advance.
Thanks a lot. Its working . Thanks
I’m reading xlsx file. Working fine. But it display all the file content in one line. How to add space between them?
Thanks