+1-888-365-2779
Try Now
More in this section
Categories
Bloggers
Blogs RSS feed

Troubleshooting Lucene Search Issues

by Laurent Poulain

Sometimes, a search query does not return the expected results. Or displays some entries which should not be returned. Or returns results in a counter-intuitive order. This article shows ways to troubleshoot those issues.

Luke to the rescue

Luke is the tool of choice to understand search results. It is a free Java utility that lets you analyse a Lucene index.

When trying to understand why some items are not returned, Luke allows to know whether the issue is upstream (the information is not indexed) or downstream (a widget is not reading the index correctly).

When trying to understand why some unexpected items are returned, Luke gives more information about said items.

Last but not least, Luke can give precise details about how an item ranking is computed.

You can download this tool from https://luke.googlecode.com/files/lukeall-3.5.0.jar. To run it, type "java -jar lukeall-3.5.0.jar" in a console and open the Sitefinity Lucene index directory, by default located in ~/App_Data/Sitefinity/Search/[index name]

Luke overview

On the default Overview tab, Luke allows to see the various fields indexed as well as the number of terms on the bottom left table:

luke1_bis

The bottom right table shows the top terms for the whole index, or just for a particular field. For instance, if you select the ContentType field and click "Show top terms >>", you will see the number of document indexed per type (e.g. telerik.sitefinity.pages.model.pagenode represents the number of Pages being indexed):

luke2_bis

You can also perform some searches in the Search tab (in the rest of this article, search queries will be displayed inside square brackets):

luke3_bis

The search works using a series of one or more <field name>:<term> predicates - e.g. [Title:hello]. Note that the query is case sensitive: the field name should always use the exact same case than the field name displayed, but the term should always be lowercase, e.g. [Title:hello]. Adding a + before a predicate indicates it is mandatory, e.g. [+Title:hello +Title:world] searches for content whose title contains both "hello" and "world".

Notice how the ContentType field indicates the type of content you're dealing with. When a result shows up when it shouldn't, it is generally because Sitefinity is indexing a similar item of another type. The ContentType combined with the Id (and sometimes the Link field) helps pinpoint the exact item being indexed.

How are Sitefinity search queries translated to Lucene queries?

When trying to understand the search results, the first step is often to try to reproduce the issue inside Luke by running a Lucene search query similar with what Sitefinity is running under the hood. Here are the general rules:

  • In the case of a single-term search, [term] will typically generate a Lucene query like [(Title:term Content:term)], meaning it will search for "term" either in the Title or the Content field
  • Sitefinity will however first verify that the term is indexed. If it finds that "term" is not indexed for the field "Title", it will strip this field from the query, e.g. [(Content:term)]
  • If it finds other indexed terms starting with "term" it will add them to the query, e.g. [(Content:term Content:term1 Content:term2)]
  • In the case of a multiple-term search, [term1 term2] will be typically generate a Lucene query like [(+Title:term1 +Title:term2) (+Content:term1 +Content:term2)]

Exact match search

As the default behavior, searching for [company] will search for any term starting with "company". This is not achieved by using wildcards (even though they are supported by Lucene), but by rewriting the query internally before sending it to Lucene.

When searching for [company], Sitefinity will look for terms in the Lucene index starting with "company" (e.g. "companyA", "companyB"). If it finds such terms, it will rewrite the query internally to search for company, companyA or companyB.

This behavior can be disabled by going to Administration / Settings / Advanced / Search, and checking "Enable exact match"

Note that Sitefinity does NOT support stemming, e.g. searching for [company] will NOT find occurrences of "companies". However, searching for [compan] would look for occurrences of "company", "companies", "companyA" and "companyB" if such terms are already indexed.

Indexing custom fields

By default, Sitefinity is looking at only two fields when performing a search: Title and Content. You can however index and search in extra fields. In the example below, we add "Symptom" (a field added to a dynamic module) by editing the index and adding the field name under "Additional fields for indexing" in the Advanced section:

SF_index1

After a reindex, we now see a new field:

luke4_bis

The last step is to update the Search Results widget (in the Advanced properties) to both search for the Symptom field and to highlight any keyword found in that field:

SF_index2_bis

Customized search

A common request is to be able to perform a more granular search than a given type, e.g. search for documents inside a particular library. This is achieved by:

Keep in mind that anything that the search widgets process must be in the Lucene index - whether filtering results or displaying extra fields. Before adding any clause that filters on a particular field, make sure that the index contains actual values for that field. Those values can also give you hints about what to filter.

For example, filtering documents stored in a particular library requires to filter for results whose "Link" field begins with, say "~/docs/default-source/sub-library/". A look at the Link top terms in the Overview tab however indicates that this field is broken down by words. In other words, it is not possible to add a Lucene query filter that looks for results whose Link field starts with "~/docs/default-source/sub-library/". It is thus best to filter the elements after the search result. Below is an implementation example:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Web;
 
using Telerik.Sitefinity.Abstractions;
using Telerik.Sitefinity.Search;
using Telerik.Sitefinity.Services.Search;
using Telerik.Sitefinity.Services.Search.Web.UI.Public;
using Telerik.Sitefinity.Services.Search.Data;
using Telerik.Sitefinity.Services.Search.Model;
using System.Text.RegularExpressions;
using System.ComponentModel;
 
namespace SitefinityWebApp.Custom
{
 
    public class MySearch : SearchResults
    {
        [Category("Custom Filter")]
        public string Link { get; set; }
 
        protected override ISearchResultsBuilder GetSearcher()
        {
            return new MySearcher(this);
        }
 
        public class MySearcher : ISearchResultsBuilder
        {
            public MySearcher(SearchResults control)
            {
                this.control = control;
            }
 
            public IEnumerable<IDocument> Search(string query, string catalogue, string[] searchFields, string[] highlightedFields, int skip, int take, out int hitCount)
            {
                var control = this.control;
                var service = Telerik.Sitefinity.Services.ServiceBus.ResolveService<ISearchService>();
                var queryBuilder = ObjectFactory.Resolve<IQueryBuilder>();
                var searchQuery = queryBuilder.BuildQuery(query, control.SearchFields);
                searchQuery.IndexName = catalogue;
                searchQuery.Skip = skip;
                searchQuery.Take = take;
                searchQuery.OrderBy = null;
                searchQuery.HighlightedFields = control.HighlightedFields;
 
                // Contains the default filter - by current language
                var currentFilter = searchQuery.Filter;
                var myFilter = new SearchFilter();
                myFilter.Operator = QueryOperator.And;
 
                MySearch myControl = (MySearch)control;
 
                // Persist the language filter, if exists
                if (currentFilter != null) myFilter.AddFilter(currentFilter);
                searchQuery.Filter = myFilter;
                IResultSet result = service.Search(searchQuery);
 
                var filtered_result = myControl.Link.IsNullOrEmpty() ?
                    result :
                    result.Where(r => r.GetValue("Link") != null &&
                                      r.GetValue("Link").ToString().StartsWith(myControl.Link));
                List<IDocument> documents = filtered_result.SetContentLinks().ToList<IDocument>();
                hitCount = documents.Count();
                return documents;
            }
 
            protected readonly SearchResults control;
        }
    }
}

 

Note that this control defines a Link property which can be accessed when looking at the advanced properties of the widget:

SF_search_result_custom_bis

This avoids the need to hard-code the library path in the control itself, making it reusable.

Ranking

Ranking is always a difficult topic, as there will always be some user who disagree with the ranking.

Nonetheless, Luke can help you understand the rationale behind a particular ranking. In the Search tab, select a result and click on the Explain button.

luke5_bis

Lucene relies on three scores to determine ranking:

  • Term frequency (TF): the number of term occurrences
  • Inverse Document Frequency (IDF): this is only useful when searching for multiple terms, as it allows to rate the relative importance of each term of the query. The more a term is used across the whole index, the lower its score. The idea is that, when searching for [company ACME], the term "company" has a weaker weight than "ACME" as it is used more often. As a result, an item containing ten occurrences of "ACME" and one occurrence of "company" will rank higher than an item containing one occurrence of "ACME" and ten occurrences of "company"
  • Field Normalization: the longer the whole text, the lower the ranking. In other words, when searching for [ACME], a short news item which contains only "ACME" in its title will have a higher ranking than a news item which contains the same term in its title but with a lot of extra text.

Leave a comment