Telerik

Knowledge Base

Home/Web Content Management/Developer Network/Knowledge Base/How to customize the indexing of html documents

How to customize the indexing of html documents

Currently the PageIndex, the BlogIndex and the NewsIndex are indexing html content.Sitefinity provides control over the way the html items (pages,blog posts,news articles) are indexed.The search service of Sitefinity will create a file called fieldsInfoProvider.xml in each index directory (for example if the index is named myIndex the file will be created
in ~[your site]/App_Data/Search/myIndex/fieldsInfoProvider.xml).
Here is a sample content of this file:
<fields> 
  <field filtertag="title" weight="1" name="title" /
  <field filterattributes="name:keywords;" filtertag="meta" indexattribute="content"   weight="1" name="keywords" /
  <field filterattributes="name:description;" filtertag="meta" indexattribute="content" weight="1" name="description" /
  <field filtertag="script" weight="-1" name="script" indexattribute="" filterattributes="" /
  <field filtertag="style" weight="-1" name="style" indexattribute="" filterattributes="" /
</field> 

It contains a list of the field elements describing how the text from an html page should be extracted and written in the index file. The search service then looks for the information in the index.

By default Sitefinity will index all the text content of an html page(text between opening and closing html tags, but not the markup itself) excluding the content of the script and style tags. By adding more field rules in this file, you can customize this behavior.The meaning of the attributes is as follows:
  • name – this is the name of the field. Sitefinity uses several fields named content, title, keywords and description. The content field contains the main text that will be indexed by the crawler. The title field contains the heading of the search result item.

  • weight – used for ranking the search results. If you set a higher weight for example on the title and keywords fields, the pages with matching search term in these fields will have higher rank and will show on the top of search results list. Weight attributes are normalized from 0 to 1.                                                                                           
Note: If weight is set to -1, the filtered tag text will not be indexed. This is useful if you want to exclude some html from the indexing. For example this is used to remove the script and style tags from indexing:
<field name="script" weight="-1" filtertag="script" indexattribute="" filterattributes="" /
<field name="style" weight="-1" filtertag="style" indexattribute="" filterattributes="" /

Example: You want to exclude a master page content placeholder from indexing (we suppose that the content place holder is placed in div tag with id= idOfTheDivContainingPlaceHolder):
<field filterattributes="id:idOfTheDivContainingPlaceHolder" filtertag="div" weight="-1" name="exampleExcludeIndexingFieldRule"
If weight=-1 the name attribute has no role – you can set it arbitrary.

  • filterTag – the name of the HTML tag, which should be handled according to this field rule.
Example: Extract the text between the <title> tag as title field.
<field filtertag="title" weight="1" name="title"

  • indexAttribute – the name of the HTML attribute, containing the value that should be extracted and indexed.
Note: Usually only the text between opening and closing tags is indexed. Using indexAttribute you can index the value of an html attribute.
Example: This example shows extracting the value of a content attribute from the meta tag when the meta tag has attribute/value pair name=”keywords”:

<field name="keywords"  
weight="1"  
indexAttribute="content"  
filterTag="meta"  
filterAttributes="name:keywords;" /> 


  • filterAttributes – contains names of HTML attributes and matching attribute values. They provide a way to filter further already filtered tags.
If you have any questions on the search indexing, don't hesitate to open a support thread.

Article Info

Article relates to v3.x
Created by Parvan Gyoshev
Last modified by Parvan Gyoshev
Related categories: Modules

Powered by Sitefinity ASP.NET CMS

Contact Us | Site Feedback | Terms of Use | Privacy Policy
Copyright © 2002-2010 Telerik. All rights reserved.