+1-888-365-2779
Try Now
More in this section

Forums / Bugs & Issues / Search Indexes TOO Much

Search Indexes TOO Much

63 posts, 0 answered
  1. Nikifor
    Nikifor avatar
    232 posts
    Registered:
    18 May 2013
    22 May 2008
    Link to this post
    Hello Zubair,

    Unfortunately, we were unable to reproduce the reported behavior again. Our concern is that the problem could be something more specific, this is why it would be helpful for the investigation if you can provide us with the exact text of the missing records. This is how we would be able to narrow down the number of possibilities.

    Thank you for your cooperation in advance.

    Best wishes,
    Nikifor
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center
  2. Ben Alexandra
    Ben Alexandra avatar
    215 posts
    Registered:
    15 Sep 2012
    01 Jun 2008
    Link to this post
    Hi,

    I guess I don't understand how to implement this and haven't found the documentation.  I think I finally found where to put the .xml file (~/App_Data/Search/[index name]/fieldsInfoProvider.xml, right?) but I don't know what how to remove the divs that have the menus, for example.  I would think it would be something like this, assuming the div had an ID of "header":

    <field name="header" weight="-1" indexAttribute="" filterTag="div" filterAtrributes="id:header" />

    Please advise if that's the right location for the XML and how I remove content in certain divs.  Also, when does this come into play?  On Index? On Search?

    Also please let us know if you a have a central place for this kind of information and documentation.  I don't know how other people would find this information, aside from stumbling on this post.  I also worry about what I'm missing out on, since I don't know of a central place for this stuff.

    Thanks

    Ben
  3. Nikola
    Nikola avatar
    51 posts
    Registered:
    24 Sep 2012
    02 Jun 2008
    Link to this post
    Hello Ben Alexandra,

    The file is located in the ~/Add_Data/Search/[IndexName]/fieldsInfoProvider.xml. It's created upon indexing and you don't have to manually create the file. Once indexing is done, you can edit/add or remove the file fields.

    Your presumption is correct, you can add new fields in .xml file, for example:
    <field name="header" weight="-1" filterTag="div" filterAttributes="id:header" indexAttribute="" />  
    This way, on indexing and search certain areas can be excluded or be given higher rank. Setting weight value of "-1" means that the element will be excluded from indexing and after re-index is done, it won't be available in the search results too.

    More samples and instructions on that matter can be found in the previous posts of the current thread.
    Did it work for you? Let us know about the results.

    Best wishes,
    Nikola
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center
  4. Ben Alexandra
    Ben Alexandra avatar
    215 posts
    Registered:
    15 Sep 2012
    02 Jun 2008
    Link to this post
    Hi,

    Sorry, but that's not working for me. On my beta site if you do a search for SPAM you'll see EVERY (45) page is returned (because I have the word SPAM in the top menu.  Below is my file, with the last line added by me.  Maybe you can take  look at my XML and at the HTML on my website and see why this isn't working.  I've reindexed various time and I STILL have the problem where if you go search for ANY word that shows up in my sitemap, you get back EVERY page in the sitemap.

    Thanks for any help you can provide!

    <?xml version="1.0" encoding="utf-8"?> 
    <fields> 
      <field name="title" weight="1" indexAttribute="" filterTag="title" filterAttributes="" /> 
      <field name="keywords" weight="1" indexAttribute="content" filterTag="meta" filterAttributes="name:keywords;" /> 
      <field name="description" weight="1" indexAttribute="content" filterTag="meta" filterAttributes="name:description;" /> 
      <field name="script" weight="-1" indexAttribute="" filterTag="script" filterAttributes="" /> 
      <field name="style" weight="-1" indexAttribute="" filterTag="style" filterAttributes="" /> 
     
      <field name="header" weight="-1" indexAttribute="" filterTag="div" filterAttributes="id:header" /> 
    </fields> 

    Ben
  5. Nikola
    Nikola avatar
    51 posts
    Registered:
    24 Sep 2012
    03 Jun 2008
    Link to this post
    Hello Ben Alexandra,

    It is indexing your navigation links wrapped into div element with id=left_navpanel. There is no such field in the given .xml file which means there is no restriction about it.

    Can you try and put the following:
    <field name="navigation" weight="-1" indexAttribute="content" filterTag="div" filterAttributes="id:left_panel" /> 
    After that re-index the web index and try again searching for SPAM. Let us know if it worked.

    All the best,
    Nikola
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center
  6. Ben Alexandra
    Ben Alexandra avatar
    215 posts
    Registered:
    15 Sep 2012
    03 Jun 2008
    Link to this post
    Sorry, nope!  Same result (44 pages).

    That panel also only has the nav for that branch of the tree, so even if I were getting 10 results, or something, that would make sense, but 90% of the pages wouldn't have anything about the SPAM even in that panel.

    Any other suggestions?

    Ben
  7. Ben Alexandra
    Ben Alexandra avatar
    215 posts
    Registered:
    15 Sep 2012
    04 Jun 2008
    Link to this post
    Hi,

    I also tried creating a whole new search index, thinking maybe something was corrupted or something, but that didn't work either.

    Just to be clear, here's my XML file (~/App_Data/Search/web/fieldsInfoProvider.xml).  Please double-check it's in the right place and set up correctly.  You can see the HTML that gets returned.  Is there another field I should be using?  Do I have a typo?

    <?xml version="1.0" encoding="utf-8"?> 
    <fields> 
      <field name="title" weight="1" indexAttribute="" filterTag="title" filterAttributes="" /> 
      <field name="keywords" weight="1" indexAttribute="content" filterTag="meta" filterAttributes="name:keywords;" /> 
      <field name="description" weight="1" indexAttribute="content" filterTag="meta" filterAttributes="name:description;" /> 
      <field name="script" weight="-1" indexAttribute="" filterTag="script" filterAttributes="" /> 
      <field name="style" weight="-1" indexAttribute="" filterTag="style" filterAttributes="" /> 
     
      <field name="header" weight="-1" indexAttribute="" filterTag="div" filterAttributes="id:header" /> 
      <field name="navigation" weight="-1" indexAttribute="content" filterTag="div" filterAttributes="id:left_panel" /> 
    </fields> 


    If you search for Vista, which isn't in my keywords or description but is in the top menu, you get every page.

    Thanks

    Ben
  8. Ben Alexandra
    Ben Alexandra avatar
    215 posts
    Registered:
    15 Sep 2012
    04 Jun 2008
    Link to this post
    Does the search engine actually go out to the website and look at returned content, or does just look at the database and construct what it expects the page to get returned as?

    Also, should it be id:header or name:header?

    Thanks

    Ben
  9. Nikola
    Nikola avatar
    51 posts
    Registered:
    24 Sep 2012
    05 Jun 2008
    Link to this post
    Hello Ben Alexandra,

    Your xml looks fine. Actually, the typo is mine, in the code I previously sent indexAttribute="content" is not needed for the navigation field property, the correct one is indexAttribute="".

    However, we tested by indexing your site as External page and discovered a small bug in the HTML parsing algorithm. The problem occurs when indexing nested controls, because the parser didn't recognize the proper closing tag for the content that should be skipped.

    The bug will be fixed in the upcoming hotfix.

    You can find an alternative approach to prevent navigation from indexing in one of the previous posts from Bob:

        protected override void Render(HtmlTextWriter writer)     
        {     
            // Checks if this is called by the Search Indexer and does not render anything if so.     
            // Navigation controls are present in every page and should NOT be indexed multiple times.     
            if (!CmsContext.IsCrawlerRequest)     
                base.Render(writer);     
        }  
    Please notice that this is implemented in the Sitefinity navigation controls such as SitePanelBar, SiteMenu etc. It is not implemented in the RadControls. If you're using RadMenu and/or RadPanelBar in your template as navigation controls, the contents of the menus will be indexed by the Indexing Service.

    You can either use the above shown code in the code behind of your navigational control or you can use the SitePanelBar and SiteMenu instead or wait a few days for the hotfix.


    Greetings,
    Nikola
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center
  10. Craig
    Craig avatar
    8 posts
    Registered:
    16 May 2008
    06 Jun 2008
    Link to this post
    Was this bug resolved? We're running 3.2 SP2 and have noticed that the indexer is still refusing to ignore the div structure we're defining in the fieldsInfoProvider.xml. We can force the navigation to not render for the crawler, but this doesn't help us for other aspects of the site we don't want crawled. (header, footer etc).
  11. Nikola
    Nikola avatar
    51 posts
    Registered:
    24 Sep 2012
    06 Jun 2008
    Link to this post
    Hi Craig,

    It's resolved and will be shipped with the upcoming hotfix shortly, as it was already mentioned in my previous post.

    Greetings,
    Nikola
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center
  12. Ben Alexandra
    Ben Alexandra avatar
    215 posts
    Registered:
    15 Sep 2012
    06 Jun 2008
    Link to this post
    Thanks for fixing it!  When will the hotfix be released?  Today by any chance?

    Ben
  13. Rebecca
    Rebecca avatar
    536 posts
    Registered:
    24 Sep 2012
    06 Jun 2008
    Link to this post
    Hi Ben Alexandra,

    The hotfix should be out by the end of next week, so stay tuned.

    Kind regards,
    Rebecca
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center
  14. Ben Alexandra
    Ben Alexandra avatar
    215 posts
    Registered:
    15 Sep 2012
    08 Jul 2008
    Link to this post
    Hi Guys,

    Thanks for you work on the search.  Generally it works well, but I have a couple of small issues.  Take the link below, for example:
    http://www.sustainablecards.com/search/index.aspx?IndexCatalogue=web&SearchQuery=peo

    So this is a search for a person who works with this company.  He shows up on 2 pages.  The first is the first page in a page group, and you'll notice that the page name that Search finds is " | Sustainable Cards, LLC. | "Leave No Carbon Footprint!""  The problem is that it starts with "Overview | Susta..."  Click the first link returned and then look at the ACTUAL title (different than the one that Search sees when it indexes the site.

    The second issue is that one of the problems I have is removing areas of search that are repeated.  For example, on this site we append " Sustainable Cards, LLC. | Leave No Carbon Footprint" to all page titles in the C# for the Master Page.  They then show in the search results.  If I set Title to be -1 instead of 1 it then hides the title in the search results and the formatting looks silly.  If someone searches for Sustainable it also returns every page.

    I'm not sure how to suggest you handle this second issue, except for getting the title from the database, not from the page. I'm not really sure how to get around it.  I can't put a div in the title because then browsers will return the HTML inline in the title bar.  Any suggestions?

    I don't want to remove that information from the page title, as it is often helpful for search engines and people.  I sometimes put keywords in the page title using code and that can help with SEO / page ranking.

    Also, how do external sites work?  I have a site (http://usa.weleda.com) that has content on shop.weleda.com and articles.weleda.com.  Would those get indexex by your search?  Can I add those domains in the XML and get them returned?  Really they are the same site, though they're not in Sitefinity and they're on different domains.

    Thanks for any help you can suggest.

    Ben
  15. Nikola
    Nikola avatar
    51 posts
    Registered:
    24 Sep 2012
    08 Jul 2008
    Link to this post
    Hello Ben Alexandra,

    We tested indexing a single page with title Overview | Sustainable Cards, LLC. | "Leave No Carbon Footprint!".
    Using the default page index settings the page was indexed fine, the index contained the whole title.
    Searching for "Carbon" returned the page in the results list.

    Then we set the title="-1" in the fieldsInfoProvider.xml and we reindexed which causes the title not to be indexed. So when we searched for "Carbon" the page was not showing in the results at all.

    As for the external pages, you have to create pages in the CMS and select External page for Page type
    If those pages are included in the index and in the page Search Options > Index this page = Yes, they should be indexed.

    Sincerely yours,
    Nikola
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center
  16. Ben Alexandra
    Ben Alexandra avatar
    215 posts
    Registered:
    15 Sep 2012
    09 Jul 2008
    Link to this post
    Hi Nikola,

    Thanks for your rapid reply. Your support is always appreciated. I think, however, that you missed the point a bit.  Let me try to clarify.

    If you look at the link I sent, you'll see that the first result is not showing the page name properly.  Look at the page title and then go to the page and look at the title there (the word Overview is missing in the search results (bug!)). 

    The second issue is that if I put any keywords in the page title, like we do for SEO purposes, and someone searches  for those words, EVERY page gets returned.  Right now if you search for Sustainable every page gets returned, because we append Sustainable Cards, LLC. to the page title via C# on Page_Load.  I'm wondering if there's a way around that.  If I exclude the page title from the search results (which I'd like to do), the results that get returned are not formatted correctly, as they don't show the page title.  I know that's a tricky one, but I guess I was hoping you could SHOW the page title, but ignore it in the algorithm that caculates which pages get returned.  Or give it a 0.00001 ranking.

    Does that make any sense?

    Thanks a lot

    Ben
  17. Nikola
    Nikola avatar
    51 posts
    Registered:
    24 Sep 2012
    10 Jul 2008
    Link to this post
    Hello Ben Alexandra,

    I did notice the difference between the Page title in the Search Results and the actual page title seen at the title bar of the browser window. My point was that when the page title is properly entered in the page, it is then properly indexed. So by design, it is working as expected.

    Probably, if you append the title at run-time it could have been missed by the indexer.

    As for the second issue, if you append at runtime titles to all the pages, the Indexing Service Crawler indexes those titles because it request the page and receives it the way you should see that page in your browser.

    You can skip indexing the page title by detecting when it is being requested by the Indexing Service. Just an idea, in the method you're appending the title try to include a check for CmsContext.IsCrawlerRequest, if its true do not add the common title text.

    Let me know if need further assistance. We would be glad to continue this discussion in a new support thread.

    All the best,
    Nikola
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center
  18. Ben Alexandra
    Ben Alexandra avatar
    215 posts
    Registered:
    15 Sep 2012
    10 Jul 2008
    Link to this post
    Hi Nikola,

    Thanks for the suggestion, I'll try that.  It DOES, however, still seem to be a bug, as one is fine, one is missing the real page title.  Look at the page and compare the 2 results, then go and look at each page and you'll see they are both identical (except one is the first in a page group) and the page title is different for both (so I think it must be a bug, no?)

    Thanks

    Ben
  19. Nikola
    Nikola avatar
    51 posts
    Registered:
    24 Sep 2012
    11 Jul 2008
    Link to this post
    Hello Ben Alexandra,

    Try searching for "sustainable" and you'll see that the page title is not missing for the first result item but for pages without titles entered in the meta fields. For example, the following pages are missing titles:

        ~/sitemap.aspx
        ~/about_us/index.aspx
        ~/products/index.aspx

    Probably, it would be easier if you send us the code you're using to substitude the pages' title at runtime.

    Regards,
    Nikola
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center
  20. Ben Alexandra
    Ben Alexandra avatar
    215 posts
    Registered:
    15 Sep 2012
    11 Jul 2008
    Link to this post
    Hi,

    There is actually an error with some pages.  I debugged the crawl process and for some of the pages the Page.Title came back null, where as for other pages it came back with the correct page title.  All the pages are set up the same, except some are page groups, but those get ignored. I can't figure out the consistent factor between pages that don't index correctly.   Even if I remark that code that changes the title, I get some pages with null titles.  I don't get it.  Check out these 2 links

    Error with Search (Screen Shot) / Search for Cards on Sustainable Cards

    Any ideas what could be going on?

    Ben
  21. Ben Alexandra
    Ben Alexandra avatar
    215 posts
    Registered:
    15 Sep 2012
    11 Jul 2008
    Link to this post
    OOOOOOOOOOOOOOOOOOOH!!!!  Got it!  Pages where I don't specifically specify a Title don't show a title in the search!

    I've never done that, unless it's different from the Menu title, as it's redundant and has never been necessary.  It seems that SF just uses the Menu Title if there's no Page title.

    Can I fix that in the Master.cs file so I don't have to go through every page and assign a title manually, so it gets indexed correctly?  Or could you at least send some SQL so I can update the blank titles automatically?  I guess I'd have to do that from time to time to make sure there were no gaps, and would have to retrain all my clients.  C# code that sets the page title if it's a CmsContext.IsCrawlerRequest would be easier ;)

    Thanks

    Ben
  22. Ben Alexandra
    Ben Alexandra avatar
    215 posts
    Registered:
    15 Sep 2012
    11 Jul 2008
    Link to this post
    Hi,

    One more thing.  Is there any way to index the contents of PDF documents and have those returned by search?

    Ben
  23. Brook
    Brook avatar
    39 posts
    Registered:
    21 Mar 2007
    11 Jul 2008
    Link to this post
    I am curious about the PDF search also, I had noticed that the ASPOSE binaries are in the bin folder.

    Brook
  24. Nikola
    Nikola avatar
    51 posts
    Registered:
    24 Sep 2012
    14 Jul 2008
    Link to this post
    Hi Brook,

    PDF indexer is not implemented yet, but we plan to implement such.

    Ben,

    You can substitute the empty titles by putting the following code in your .master page:

    <script runat="server">  
        protected void Page_Load(object sender, EventArgs e)  
        {  
            if (Telerik.CmsContext.IsCrawlerRequest  
                && this.Page.Title.Equals("Untitled Page", StringComparison.OrdinalIgnoreCase))  
            {  
                this.Page.Title = "Title1";  
            }  
        }  
    </script> 

    The second condition is valid if you have "Untitled Page" set by default in your master page. This will prevent indexing empty titles, which I think could happen only when you have no title for the particular page and no title defined in the .master file.

    Don't hesitate to contact us if you have other questions.

    All the best,
    Nikola
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center
  25. Ben Alexandra
    Ben Alexandra avatar
    215 posts
    Registered:
    15 Sep 2012
    14 Jul 2008
    Link to this post
    Thanks, but that doesn't REALLY help.  It just returns Title1 for half the titles.  Is there no way to access the Menu Title, as that's what gets replaced in the page if there's no Page Title, right?  We generally don't use the Page Title field, unless we want it to be different from the Menu Title (which is rare).  If not, can you send some SQL to update the database to replace all the empty page titles with their menu titles?

    Thanks

    Ben
  26. Nikola
    Nikola avatar
    51 posts
    Registered:
    24 Sep 2012
    17 Jul 2008
    Link to this post
    Hello Ben Alexandra,

    You can replace the empty page title by overriding the Telerik.Cms.Web.InternalPage OnPreInit method.
    Create a new class item in you project App_Code folder and add the following code:

    using System;  
     
    namespace Telerik.Cms.Web  
    {  
        public class SmartTitleInternalPage : InternalPage  
        {  
            protected override void OnPreInit(EventArgs e)  
            {  
                base.OnPreInit(e);  
                if (string.IsNullOrEmpty(base.CmsPage.Title))  
                {  
                    this.Title = base.CmsPage.MenuName;  
                }  
            }  
        }  

    To use this class instead the InternalPage class, you have to modify the Inherits property of the cmsentrypoint.aspx located in the Sitefinity folder as follows:

    <%@ Page Inherits="Telerik.Cms.Web.SmartTitleInternalPage" MasterPageFile="~/Sitefinity/Dummy.master" %> 

    Bear in mind that InternalPage class is responsible for all the CMS pages to be rendered properly, so when overriding it any modifications or implementations could lead to unexpected results and should be done carefully.

    Let me know if this works for you.

    Regards,
    Nikola
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center
  27. Randel
    Randel avatar
    50 posts
    Registered:
    30 Aug 2012
    08 Oct 2008
    Link to this post

    Ok, I think I have an understanding of concept and have created a new search which searches the my complete site.  However, I don't completely know are all the values available for the following attributes: indexAttribute, filterTag, and filterAttributes; and what each attribute is specifically used for.

    I'm using the fieldsInfoProvider.xml file created for this search as my base of understanding and have pasted it below.

    For example, I see:
     - (3) empty strings and (2) "content" for indexAttribute
     - "title", "meta", "script", and "style" for filterTag
     - (3) empty strings, "name:keywords;", and "name:descriptions;"

    Again, what is "indexAttribute" used for?  Does "content" mean that the field will only search the content of a page, and is that the content of the page itself and not the template it's created from?  Also, if it does mean to only search the page content, what does the empty string mean?  Same thing, what are "filterTag" and "filterAttributes" used for / reference in a page?

    I still concider myself very new to web developement and feel as if I'm missing some puzzle pieces here.  So I'll be more then happy to take any help/guidance.


    <?xml version="1.0" encoding="utf-8"?>

    <fields>

      <field name="title"

             weight="1"

             indexAttribute=""

             filterTag="title"

             filterAttributes="" />

            

      <field name="keywords"

             weight="1"

             indexAttribute="content"

             filterTag="meta"

             filterAttributes="name:keywords;" />

            

      <field name="description"

             weight="1"

             indexAttribute="content"

             filterTag="meta"

             filterAttributes="name:description;" />

            

      <field name="script"

             weight="-1"

             indexAttribute=""

             filterTag="script"

             filterAttributes="" />

            

      <field name="style"

             weight="-1"

             indexAttribute=""

             filterTag="style"

             filterAttributes="" />

    </fields>
  28. Georgi
    Georgi avatar
    3583 posts
    Registered:
    28 Oct 2016
    10 Oct 2008
    Link to this post
    Hi Randel,

    Here is a list of the properties that are taken into account, along with their description:

    filterTag - the tag name that will be used to extract the data to be indexed.

    filterAttributes -  a collection of attributes in the tag name to ensure the tag has been captured properly.
    An Example to a tag attribute : attributes="name:-Description" this will result one item in the list that contains the KeyValuePair: Key="name" , value="description".

    weight - Weight assigned to terms found in this index field.

    indexAttribute - the attribute to be indexed within the tag. If null, the data between the starting & ending tag will be indexed.

    Please let us know if you need further information on the matter.

    Greetings,
    Georgi
    the Telerik team

    Check out Telerik Trainer, the state of the art learning tool for Telerik products.
  29. Armen
    Armen avatar
    11 posts
    Registered:
    07 Feb 2008
    18 Dec 2008
    Link to this post
    Hi,

    Thank you for this thread. It helped me a lot. Anyway I would like to understand how to make the fieldsInfoProvider.xml file to be generated as I need. As after deleting the index and creating a new one all my changes in xml file were lost. I would like to leave/regenerate my modified fieldsInfoProvider.xml be available any time when the end user (administrator) will delete indexes and/or create new ones, without need to contact me every time with a request of indexing problem.

    Can you please let me know how generate the fieldsInfoProvider.xml in my way while creating the index? I believe that it should be very handy for all devs.

    Thanks, 
    Armen
  30. Georgi
    Georgi avatar
    3583 posts
    Registered:
    28 Oct 2016
    20 Dec 2008
    Link to this post
    Hi Armen,

    Unfortunately there is no way to override the method which creates the fieldsInfoProvider.xml file. We will consider this some of next version. What you could do for now, is to re-index the web site instead of creating a new index every time.

    Kind regards,
    Georgi
    the Telerik team

    Check out Telerik Trainer, the state of the art learning tool for Telerik products.
Register for webinar
63 posts, 0 answered
1 2 3