Compartir a través de


Legacy office documents are not being crawled in the FAST Search

In Fast Search server 2010, after content is crawled the items are submitted for item processing. IFilters are used to extract content and metadata from the crawled content. We can also enable Advanced Filter Pack in Fast Search server 2010 to extract text and metadata from several hundred file formats, complementing the document formats that are supported by the Microsoft Filter Pack. By default, the Advanced Filter Pack is disabled and you can enable the Advanced Filter Pack by using the PowerShell command “.\AdvancedFilterPack.ps1 –enable”

You might have observed that sometime crawling legacy office documents (.doc, .xls, .ppt etc..) will fail with below warning message in the crawl log. 

"The FAST Search backend reported warnings when processing the item. Document conversion failed: LoadIFilter() failed: Bad extension for file (0x800401e6)."

 

 

This is an issue with legacy 97-2003 office documents and the Microsoft iFilter 2.0 which fails to convert the files to html in preparation for the indexing process. 

Unfortunately fix is not available for the Microsoft iFilter 2.0 but a workaround is possible with a configuration change in the FAST environment which will ensure that the legacy file types such as “.doc”, “.ppt” and “.xls” are converted using the SearchExportConverter and not the iFilterConverter. The SearchExportConverter supports legacy formats and handles their conversion better than the iFilterConverter. The iFilterConverter will still be used to convert new versions of office documents such as “.docx”, “.pptx”, “.xlsx” etc..

 

To resolve the issue, try to execute below steps on each server in your FAST environment. I would recommend backing up the files before making any changes to it.

 

1)     Open E:\FASTSearch\etc\config_data\DocumentProcessor\optionalprocessing.xml, change  “active” to yes, for “SearchExportConverter”, save the file.

 

2)     Open E:\FASTSearch\etc\formatdetector\converter_rules.xml on the FAST server.

 

3)     In the converter_rules.xml, go to the node on line 673, and comment the first “doc” type. e.g  "<!--<ext>.doc</ext-->". Now we have allowed the “.doc” type of extension to be in another document processing stage “SearchExportConverter”. Repeat this for the .xls and .ppt document types.

 

4)     Open E:\FASTSearch\etc\pipelineconfig.xml, and locate the following pipelines:

              <pipeline name="Attachments (webcluster)" default="0">

              <pipeline name="Office14 (webcluster)" default="1">

 

5)     Within the two pipeline sections, swap the order of “IFilterConverter” and “SearchExportConverter”, to make “SearchExportConverter” in front of “IFilterConverter” in  both pipelines.

 

6)      In FAST PowerShell, run “nctrl reloadcfg”

 

7)      In FAST PowerShell, run “nctrl restart procserver_X”, X applies to all the procserver numbers you have in the system, and there could be several procservers.

 

Once you have followed all the steps on all the servers, you will need to run a full crawl.

 

Published by - Prasad Joshi