Telerik

Forums

Skip Navigation LinksHome > Web Content Management > Developer Network / Forums / Sitefinity 3.x: Bugs & Issues > Search Indexes TOO Much

Search Indexes TOO Much

Feed from this thread
  • Ben Alexandra Intermediate Ben Alexandra's avatar

    Posted on Oct 17, 2007 (permalink)

    Hi,

    Thanks so much for getting search working.  It's great to have.  i was, however, a little surprised by how you implemented a couple of things.  The biggest thing is that it seems to actually spider the built pages (instead of just reading the data stored in the database).  This has some advantages (like reading custom WebUserControls as rendered) but also has at least one big problem.

    The biggest problem is that if someone searches for something that is in menu, it returns EVERY PAGE!  For example, on my site, a lot of people will be searching for CampusTrakker or Web Hosting and even though both of those are very different things, they both return every page on the site.

    One idea I had for getting around indexing things in the template would be for you to try to recognize template info and links vs content links (as Google Must.  If you search Google for Trakkware Pricing it only returns my pricing page, not every page even though the word Pricing is on every page).  Now of course I realize you are not Google, and don't have their resources, but I'm wondering if you can either look for template data, or as an easier (and maybe temporary solution) have an exclude tag.

    I guess what I'm thinking is something like
    <html> 
    <head> 
        <header info....> 
    </head> 
    <body> 
    <!-- BEGIN_IGNORE_FOR_SITEFINITY_SEARCH --> 
        <template data, menus, etc......> 
    <!-- END_IGNORE_FOR_SITEFINITY_SEARCH --> 
          
        regular content, editable regions, etc....  
     
    <!-- BEGIN_IGNORE_FOR_SITEFINITY_SEARCH --> 
        <more template data, mainly ending tags.....> 
    <!-- END_IGNORE_FOR_SITEFINITY_SEARCH --> 
    </body> 
    </html> 

    The nice thing about something like that is people don't have to use it, but if they are having trouble, they could just drop a couple of begin and end ignore tags on their master page and they'd be set.

    Does that make any sense?  Does that seem like it would be a good thing to do?  Would other people be interested in having a feature like that? Would that be something doable by SP1?

    Ben

    Reply

  • Telerik Admin admin's avatar

    Posted on Oct 17, 2007 (permalink)

    Hello Ben Alexandra,

    You will have full control of what is indexed and what not in v3.2. Unfortunately, we will not be able to make it for SP1.

    Our approach is a bit different. You will be able to specify filter fields for each index catalogue and specify weight for them. The weight will be used to set the rank of the page within the result set. So weight with negative value means no index. You will be able to filter by tags and attributes. Here is an example of such configuration:
    <?xml version="1.0" encoding="utf-8"?>   
    <Fields>   
        <field name="text" filterTag="body" filterAttrebutes="" weight="1.0" indexAttrebute="" />   
        <field name="keywords" filterTag="meta" filterAttrebutes="name:keywords" weight="2.0" indexAttrebute="content" />   
        <field name="description" filterTag="meta" filterAttrebutes="name:description" weight="1.5" indexAttrebute="content" />   
        <field name="header" filterTag="h1" filterAttrebutes="" weight="1.6" /> 
        <field name="noindex" filterTag="div" filterAttrebutes="class:noindex" weight="-1" /> 
    </Fields>  

    Of course, we will provide user interface for these settings in the Search/Index section.

    Furthermore, we will provide a way for controls to be able to determine whether the current request is from a crawler so you can decide what to render. For example, the navigation controls will render empty string by default in this case.

    Do you think this is flexible enough? Let me know what you think.

    Best wishes,
    Bob
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center

    Reply

  • Ben Alexandra Intermediate Ben Alexandra's avatar

    Posted on Oct 18, 2007 (permalink)

    Hi Bob,

    I think that would work well, but it's a long way off.  Are there any temporary work-arounds?  I find it pretty much useless at this point, unfortunately.  It sounds like your solution is good, but I need to decide if I'm going to remove Search functionality from my sites (all recent sites have been built with a search box in the template based on it being available for 3.1).

    I know my suggestion about Ignore is pretty hackey, but it  seems like an easy temporary solution to the problem of too much being indexed.  Or at least having the option of ignoring RadMenus and RadPanelbars.

    Thanks.  Just let me know what you decide and if there are any temporary fixes I can do for now.

    Thanks a ton!

    Ben

    PS Attribute is spelled with an i, not an e ;)

    Reply

  • Telerik Admin admin's avatar

    Posted on Oct 18, 2007 (permalink)

    Hello Ben,

    Thank you for correcting my spelling. I should take another English course:)

    Most of the functionality will be available in SP1. Menus and PanelBars will handle that. Keywords, Title and Description will be predefined and you will be able to set weight for them but you will not be able to add your own tags and therefore specify areas that will not be indexed.

    Greetings,
    Bob
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center

    Reply

  • Ben Alexandra Intermediate Ben Alexandra's avatar

    Posted on Oct 18, 2007 (permalink)

    Hi Bob,

    No, your english is fantastic.  I only mention it because it's source code and we wouldn't want any typos in Sitefinity's source code, right? ;)

    OK, so I'm confused.  I thought you said: "You will have full control of what is indexed and what not in v3.2. Unfortunately, we will not be able to make it for SP1."  Now you're saying "Most of the functionality will be available in SP1."

    I guess I'm wondering if you're going to have initial functionality in SP1, then more full features in 3.2 or did you decide to move up the functionality from 3.2 to SP1?

    I guess the main question is, will the big bug (too much being indexed) be resolved by SP1?  Even if it's not perfect and not ranking (which sounds cool), will I at least get decent results?  What functionality will you provide in SP1, what will be added to 3.2?

    Thanks so much!  Keep up the great work!

    Ben

    PS Is there a tentative date for 3.2?  Sitefinity kicks ass!  And with every version it kicks more ass and kicks it harder, so of course I'm anxious for each new release!

    Reply

  • Telerik Admin admin's avatar

    Posted on Oct 18, 2007 (permalink)

    Hi Ben,

    Thank you for the nice words.

    So, we are going to have initial functionality in v3.1 SP1 and full features in v3.2. The big bug (too much to index) will be solved. What you won’t be able to do in SP1 is define your own areas to affect the page ranking.

    A release date for version 3.2 has not been set yet but it will be in January. We pushed it little back as we were about two weeks late with 3.1. Soon we will publish the road map for 3.2 and 4.0. Some very exciting features are on the lurk.
     
    Sincerely yours,
    Bob
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center

    Reply

  • Ben Alexandra Intermediate Ben Alexandra's avatar

    Posted on Oct 18, 2007 (permalink)

    That sounds perfect.  Yeah, as long as it's at least workable, that's great.  I can wait till Jan or Feb for full features.  I understand SP1 will be out in a week or so, is that right? 

    The features for 3.2 sound sweet.  I can't even imagine what you're cooking up for 4.0 ;)

    Thanks

    Ben

    PS Is there an API for search?

    Reply

  • Telerik Admin admin's avatar

    Posted on Oct 18, 2007 (permalink)

    Hi Ben,

    The service pack should be out next week.

    Yes there is API for search. In fact, you can provide your own sources for indexing. I hope that we will be able to provide examples soon.

    Sincerely yours,
    Bob
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center

    Reply

  • Ben Alexandra Intermediate Ben Alexandra's avatar

    Posted on Feb 27, 2008 (permalink)

    Hi,

    Is it setup to only search certain regions yet?  Do you have documentation on how to set that up?

    http://cms.newcenturybank.com/search/index.aspx?IndexCatalogue=web&SearchQuery=community

    Also, have you looked at the issue of duplicate pages showing up, due to multiple page addresses.  It seems it should only find the Primary page, not the other Urls, no?  Look at the link above, you'll see it returns /index/search.aspx and /search/index.aspx which are the same page.

    Thanks

    Ben

    Reply

  • Telerik Admin admin's avatar

    Posted on Mar 4, 2008 (permalink)

    Hi Ben Alexandra,

    Unfortunately none of these issues could make it to this release. A lot of changes and optimizations have started on this front and I hope we will be able to finish them for the SP1 of v3.2.

    I’m sorry for not being able to help this time.

    Kind regards,
    Bob
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center

    Reply

  • Brook Brook's avatar

    Posted on Mar 17, 2008 (permalink)

    Any informaton on what may or may not make it into 3.2 SP1 in regards to the search engine?

    Thanks in Advance...

    Reply

  • Ben Alexandra Intermediate Ben Alexandra's avatar

    Posted on Mar 18, 2008 (permalink)

    Yes, we're anxiously awaiting, as the current model really doesn't work.  The 2 biggest problems being 1) Duplicate Page Urls being returned (should be just based on the sitemap, not on all Urls for a page) and 2) Extraneous words in template (not content of pages) being returned, so if you search for something that exists in the template, such as Products, EVERY page is returned, and each page is returned multiple times due to the first problem.

    If you solve those 2 problems, I think ti'll work really well.  Hopefully creating sections to be ignored in the template will be easy.

    Thanks a lot!

    Ben

    Reply

  • Telerik Admin admin's avatar

    Posted on Mar 19, 2008 (permalink)

    Hi Ben, Brook

    We are working hard on improving the Search Module for SP1. We found several other issues with the search functionality as well as possible fields for improvement, which will all be included in Sitefinity's 3.2 Service Pack 1.

    We apologize for any caused inconvenience.

    All the best,
    the Telerik team



    Instantly find answers to your questions at the new Telerik Support Center

    Reply

  • Paul Dain Paul Dain's avatar

    Posted on Apr 28, 2008 (permalink)

    It looks like this was actually implemented in SP1 -- can you verify?

    Reply

  • Telerik Admin admin's avatar

    Posted on Apr 29, 2008 (permalink)

    Hello Paul Dain,

    The issue with duplicating the search results is fixed for SP1. There are still issues with the search engine that we work on, and they will be fixed for the next service release (in May).

    Greetings,
    Georgi
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center

    Reply

  • Paul Dain Paul Dain's avatar

    Posted on Apr 29, 2008 (permalink)

    Thanks for the info.

    We have been trying the fieldsInfoProvider.xml, but there does not appear to be any documentation for it. Specifically, what attributes/values are allowed and what are the various effects. Is this something you can provide?

    Thanks,

    - Paul

    Reply

  • Telerik Admin admin's avatar

    Posted on Apr 30, 2008 (permalink)

    Hi Paul Dain,

    Sure, we can provide that information.

    This is something like a new feature, introduced in Service Pack 1. This file is used by the search engine, for better handling of the content, while indexing. Here is an example code found in that file :
    <fields> 
      <field name="title" weight="1"  
       indexAttribute="" filterTag="title" filterAttributes="" /> 
      <field name="keywords" weight="1"  
       indexAttribute="content" filterTag="meta"
       
    filterAttributes="name:keywords;" /> 
      <field name="description" weight="1"  
       indexAttribute="content" filterTag="meta"
       
    filterAttributes="name:description;" /> 
    </fields> 

    As we can see, here are 3 different fields - title, keywords and description. These are also the meta tags we can find in every html page. Every field has a weight property. The search engine spiders through the pages, indexing the content and giving weight of different part of the content of the page. This weight depends on the values set in this file. Later, when you search for something, the results are sorted based on that weight. The results (pages) where you have your "search term" with higher weight, are first in the list. The search engine also respects the repeat ratio of the search term in the pages.

    This way you can exclude certain content from a page from indexing, or give higher priority to the keywords of a page, even if the keywords are not listed (because the keyword field is a meta tag). By certain content I mean that you can even exclude content within given tag or within tag with specified class.

    Excluding, for example, the indexing of the title tag of the page would look like this :

    <field name="title" weight="-1"  
     indexAttribute="" filterTag="title" filterAttributes="" />  

    Please note, that weight property has a negative attribute. Every field with such weight will not appear in the search results at all.

    It is true that our documentation lacks this information. We will definitely work to change this fact, and provide a full information on that file.

    All the best,
    Georgi
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center

    Reply

  • SelArom MVP SelArom's avatar

    Posted on Apr 30, 2008 (permalink)

    I really love the way this was implemented, it is very intuitive (once you figure it out of course). you can restrict any element from being indexed, not just meta or whatever, you just have to set the appropriate fields.

    for example, my search results were bringing up every single page because certain terms are in the navigation menu. so I set the search to exclude everything in the radmenu div using its class and another menu's id so that they are not indexed. now only my actual page content is scanned and searched. just set to filter tag to div, and the select the filterAttributes with a type:name syntax:

    <field name="navigation" weight="-1" indexAttribute="" filterTag="div" filterAttributes="id:navigation" />

    <field name="header" weight="-1" indexAttribute="" filterTag="div" filterAttributes="class:RadMenu" />


    very cool! I would assume that you could also ADD pages to your search by specifying divs to add special weight if a certain div is present like 

    <field name="header" weight="5" indexAttribute="" filterTag="div" filterAttributes="class:importantstuff" />


    so that these results come up first. not too shabby. man i love sitefinity more every day!

    Reply

  • Brook Brook's avatar

    Posted on Apr 30, 2008 (permalink)

    This is great news, I wish there was a better way to communicate when these new features are implemented.  Perhaps there could be a section in the clients section by functionality, Search, blogs etc... in which the development team could post announcements of  new or changed features and those would link to the documentation ?

    Reply

  • Telerik Admin admin's avatar

    Posted on May 3, 2008 (permalink)

    Hi Brook,

    we are putting a lot of tought and effort into making the communication process (as well as communication infrastructure) better, faster and more accurate. We fully understand that you need this kind of information in order to be able to plan your own activities and the work on your projects.

    I just wanted to let you know that this need has been recognized and we are already taking steps in improving this area. Thank you for all your great inputs and the patience you've demonstrated.

    Greetings,
    Ivan
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center

    Reply

  • Zubair Zubair's avatar

    Posted on May 15, 2008 (permalink)

    Ok, I came here looking for a solution to remove Title of any pages from being indexed, looks like I've found it here and I'm going to implement (and come back and post issues if any)

    But now I'll just second the opinion of Brook, this is what I've also noticed that alot of improvements/bug fixes go undocumented and nobody knows what's available for us out-of-the-box in a new version or service pack (like this one), so please please do put a section where you make all the announcements and I'd say put a link to it on the homepage. Thanks.

    Reply

  • Zubair Zubair's avatar

    Posted on May 15, 2008 (permalink)

    I have been facing this problem for sometime and even more now because I need to exclude Title from search.

    I noticed that after running doing the index once, I cannot do it again and it gives me the following error.

    Could not find file '....\Index\segments'.

    Description: An unhandled exception occurred during the execution of the current web request. Please review the stack trace for more information about the error and where it originated in the code.

    Exception Details: System.IO.FileNotFoundException: Could not find file '.............\Index\segments'.

    Source Error:

    An unhandled exception was generated during the execution of the current web request. Information regarding the origin and location of the exception can be identified using the exception stack trace below.

    Stack Trace:

    [FileNotFoundException: Could not find file 'D:\Web\DIC.website\App_Data\Search\DubaiInternetCity\Index\segments'.]
       System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath) +1971213
       System.IO.FileStream.Init(String path, FileMode mode, FileAccess access, Int32 rights, Boolean useRights, FileShare share, Int32 bufferSize, FileOptions options, SECURITY_ATTRIBUTES secAttrs, String msgPath, Boolean bFromProxy) +998
       System.IO.FileStream..ctor(String path, FileMode mode, FileAccess access, FileShare share) +114
       Lucene.Net.Store.FSIndexInput..ctor(FileInfo path) +70
       Lucene.Net.Store.FSDirectory.OpenInput(String name) +66
       Lucene.Net.Index.SegmentInfos.Read(Directory directory) +44
       Lucene.Net.Index.AnonymousClassWith.DoBody() +40
       Lucene.Net.Store.With.Run() +56
       Lucene.Net.Index.IndexReader.Open(Directory directory, Boolean closeDirectory) +102
       Telerik.Search.Engine.SearchManager.GetIndexingStatistics(String Provider) +144
       Telerik.Search.WebControls.Admin.ControlPanel.Indexes_ItemDataBound(Object sender, RepeaterItemEventArgs e) +

    Also when I set 'title='-1' in fieldsInfoProvider.xml I notice that some page with the search keyword in the content area don't appear in the results and when I try to index it again I get the above error, previously I was able to recover from the above error by deleting the Search folder under App_Data, now I can't do that even. 

    Please tell me what's going on. Thanks

    Reply

  • SelArom MVP SelArom's avatar

    Posted on May 15, 2008 (permalink)

    hmm, I seem to not have this down after all. I thought maybe it was just that I had indexed everything in full, so I deleted the index and created a new one. unfortunately this too is indexing the whole page including the navigation. I've included my filter below, can you tell me if I've done anything wrong?

    <?xml version="1.0" encoding="utf-8"?>  
    <fields> 
        <field name="title" weight="3" indexAttribute="content" filterTag="title" filterAttributes="" /> 
        <field name="keywords" weight="2" indexAttribute="content" filterTag="meta" filterAttributes="name:keywords" /> 
        <field name="description" weight="1" filterTag="meta" filterAttributes="name:description" indexAttribute="" /> 
        <field name="navigation" weight="-1" filterTag="div" filterAttributes="class:nav" indexAttribute="" /> 
        <field name="header" weight="-1" filterTag="div" filterAttributes="class:topnav" indexAttribute="" /> 
    </fields> 

    Reply

  • Telerik Admin admin's avatar

    Posted on May 17, 2008 (permalink)

    Hello Zubair,

    Unfortunately a few bugs ware discovered with improperly locked or deleted index files. To fix your problem you have to delete the entire index folder (~/AppData/Search/[Index Name]) and then reindex the site. This issue has been fixed for v3.2 SP2.

    Josh,

    Your filter is correct and should work just fine. Note that the filter does not work for versions previous to v3.2 SP1 although the file is present in them. If you are using the latest version and you still have this problem, could you please send us your project to be examined? Also there is an alternative way to prevent navigation from indexing. Please consider the code below:
        protected override void Render(HtmlTextWriter writer)  
        {  
            // Checks if this is called by the Search Indexer and does not render anything if so.  
            // Navigation controls are present in every page and should NOT be indexed multiple times.  
            if (!CmsContext.IsCrawlerRequest)  
                base.Render(writer);  
        } 
    Please take a look at the implementation of the navigational controls in ~/Sitefinity/UserControls/Navigation/.

    The second approach will work for previous versions as well.

    There are a lot of improvements and bug fixes in Search for SP2 so stay tuned.


    All the best,
    Bob
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center

    Reply

  • Zubair Zubair's avatar

    Posted on May 18, 2008 (permalink)

    hi,

    I'm facing some problems with Search and Search paging.

    I've set to show 6 PostsPerPage, but I noticed that if I get more than 6 results, sometime repeater only shows 5 results on a page and skips the <AlternateItem> template for the 6th page and shows the 6th result on next page, this happens on all pages, however reindexing the site solves the issue.

    • I also noticed that sometimes if I search for something and get 23 results where I'm only showing 6 posts per page, so the total page count returned is 4 which is fine, now here's the problem, the 2nd page only shows me 4 results -  on page 1 and 3 I see 6 results and 5 on page 4, so where did my 2 results go ?
    • Another issue is sometime the description of some of the pages is not shown and this is happening randomly for some of the pages.

    I think there's alot of issues with the search and I'm hoping that they're addressed in the SP2.

    (I can send you a Url to test this issues, please post your email)

    Please let me know what's going wrong. Thanks

    Reply

  • Telerik Admin admin's avatar

    Posted on May 19, 2008 (permalink)

    Hello Zubair,

    Unfortunately, we could not manage to reproduce the reported behavior using Sitefinity's Search Results control. We tried the same scenario with 23 search results - 6 per page, with no effect. Can you please provide us with a link where we can see the exact issue and troubleshoot it further. Also, please elaborate on the <AlternateItem> property and where exactly you are setting it.

    Thank you for your cooperation in advance.

    Best wishes,
    Nikifor
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center

    Reply

  • SelArom MVP SelArom's avatar

    Posted on May 19, 2008 (permalink)

    i am using the latest version, and stil getting everything indexed, however your workaround works PERFECTLY. This will work for now until I can have some time to troubleshoot the filters, maybe the new sp will clear it up

    thanks!

    Reply

  • Zubair Zubair's avatar

    Posted on May 20, 2008 (permalink)

    Thanks Nikifor, please provide me your email or shall I send it to support@telerik.com ?

    Reply

  • Zubair Zubair's avatar

    Posted on May 20, 2008 (permalink)

    hi Nikifor,

    I've just sent an email to support@telerik.com with the details of the issue.

    Reply

  • Telerik Admin admin's avatar

    Posted on May 20, 2008 (permalink)

    Hi Zubair,

    Thank you for providing the information. We will get on this and as soon as we have any result we will update this forum thread.

    Thank you for your time.

    Greetings,
    Nikifor
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center

    Reply

  • Telerik Admin admin's avatar

    Posted on May 22, 2008 (permalink)

    Hello Zubair,

    Unfortunately, we were unable to reproduce the reported behavior again. Our concern is that the problem could be something more specific, this is why it would be helpful for the investigation if you can provide us with the exact text of the missing records. This is how we would be able to narrow down the number of possibilities.

    Thank you for your cooperation in advance.

    Best wishes,
    Nikifor
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center

    Reply

  • Ben Alexandra Intermediate Ben Alexandra's avatar

    Posted on Jun 1, 2008 (permalink)

    Hi,

    I guess I don't understand how to implement this and haven't found the documentation.  I think I finally found where to put the .xml file (~/App_Data/Search/[index name]/fieldsInfoProvider.xml, right?) but I don't know what how to remove the divs that have the menus, for example.  I would think it would be something like this, assuming the div had an ID of "header":

    <field name="header" weight="-1" indexAttribute="" filterTag="div" filterAtrributes="id:header" />

    Please advise if that's the right location for the XML and how I remove content in certain divs.  Also, when does this come into play?  On Index? On Search?

    Also please let us know if you a have a central place for this kind of information and documentation.  I don't know how other people would find this information, aside from stumbling on this post.  I also worry about what I'm missing out on, since I don't know of a central place for this stuff.

    Thanks

    Ben

    Reply

  • Telerik Admin admin's avatar

    Posted on Jun 2, 2008 (permalink)

    Hello Ben Alexandra,

    The file is located in the ~/Add_Data/Search/[IndexName]/fieldsInfoProvider.xml. It's created upon indexing and you don't have to manually create the file. Once indexing is done, you can edit/add or remove the file fields.

    Your presumption is correct, you can add new fields in .xml file, for example:
    <field name="header" weight="-1" filterTag="div" filterAttributes="id:header" indexAttribute="" />  
    This way, on indexing and search certain areas can be excluded or be given higher rank. Setting weight value of "-1" means that the element will be excluded from indexing and after re-index is done, it won't be available in the search results too.

    More samples and instructions on that matter can be found in the previous posts of the current thread.
    Did it work for you? Let us know about the results.

    Best wishes,
    Nikola
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center

    Reply

  • Ben Alexandra Intermediate Ben Alexandra's avatar

    Posted on Jun 2, 2008 (permalink)

    Hi,

    Sorry, but that's not working for me. On my beta site if you do a search for SPAM you'll see EVERY (45) page is returned (because I have the word SPAM in the top menu.  Below is my file, with the last line added by me.  Maybe you can take  look at my XML and at the HTML on my website and see why this isn't working.  I've reindexed various time and I STILL have the problem where if you go search for ANY word that shows up in my sitemap, you get back EVERY page in the sitemap.

    Thanks for any help you can provide!

    <?xml version="1.0" encoding="utf-8"?> 
    <fields> 
      <field name="title" weight="1" indexAttribute="" filterTag="title" filterAttributes="" /> 
      <field name="keywords" weight="1" indexAttribute="content" filterTag="meta" filterAttributes="name:keywords;" /> 
      <field name="description" weight="1" indexAttribute="content" filterTag="meta" filterAttributes="name:description;" /> 
      <field name="script" weight="-1" indexAttribute="" filterTag="script" filterAttributes="" /> 
      <field name="style" weight="-1" indexAttribute="" filterTag="style" filterAttributes="" /> 
     
      <field name="header" weight="-1" indexAttribute="" filterTag="div" filterAttributes="id:header" /> 
    </fields> 

    Ben

    Reply

  • Telerik Admin admin's avatar

    Posted on Jun 3, 2008 (permalink)

    Hello Ben Alexandra,

    It is indexing your navigation links wrapped into div element with id=left_navpanel. There is no such field in the given .xml file which means there is no restriction about it.

    Can you try and put the following:
    <field name="navigation" weight="-1" indexAttribute="content" filterTag="div" filterAttributes="id:left_panel" /> 
    After that re-index the web index and try again searching for SPAM. Let us know if it worked.

    All the best,
    Nikola
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center

    Reply

  • Ben Alexandra Intermediate Ben Alexandra's avatar

    Posted on Jun 3, 2008 (permalink)

    Sorry, nope!  Same result (44 pages).

    That panel also only has the nav for that branch of the tree, so even if I were getting 10 results, or something, that would make sense, but 90% of the pages wouldn't have anything about the SPAM even in that panel.

    Any other suggestions?

    Ben

    Reply

  • Ben Alexandra Intermediate Ben Alexandra's avatar

    Posted on Jun 4, 2008 (permalink)

    Hi,

    I also tried creating a whole new search index, thinking maybe something was corrupted or something, but that didn't work either.

    Just to be clear, here's my XML file (~/App_Data/Search/web/fieldsInfoProvider.xml).  Please double-check it's in the right place and set up correctly.  You can see the HTML that gets returned.  Is there another field I should be using?  Do I have a typo?

    <?xml version="1.0" encoding="utf-8"?> 
    <fields> 
      <field name="title" weight="1" indexAttribute="" filterTag="title" filterAttributes="" /> 
      <field name="keywords" weight="1" indexAttribute="content" filterTag="meta" filterAttributes="name:keywords;" /> 
      <field name="description" weight="1" indexAttribute="content" filterTag="meta" filterAttributes="name:description;" /> 
      <field name="script" weight="-1" indexAttribute="" filterTag="script" filterAttributes="" /> 
      <field name="style" weight="-1" indexAttribute="" filterTag="style" filterAttributes="" /> 
     
      <field name="header" weight="-1" indexAttribute="" filterTag="div" filterAttributes="id:header" /> 
      <field name="navigation" weight="-1" indexAttribute="content" filterTag="div" filterAttributes="id:left_panel" /> 
    </fields> 


    If you search for Vista, which isn't in my keywords or description but is in the top menu, you get every page.

    Thanks

    Ben

    Reply

  • Ben Alexandra Intermediate Ben Alexandra's avatar

    Posted on Jun 4, 2008 (permalink)

    Does the search engine actually go out to the website and look at returned content, or does just look at the database and construct what it expects the page to get returned as?

    Also, should it be id:header or name:header?

    Thanks

    Ben

    Reply

  • Telerik Admin admin's avatar

    Posted on Jun 5, 2008 (permalink)

    Hello Ben Alexandra,

    Your xml looks fine. Actually, the typo is mine, in the code I previously sent indexAttribute="content" is not needed for the navigation field property, the correct one is indexAttribute="".

    However, we tested by indexing your site as External page and discovered a small bug in the HTML parsing algorithm. The problem occurs when indexing nested controls, because the parser didn't recognize the proper closing tag for the content that should be skipped.

    The bug will be fixed in the upcoming hotfix.

    You can find an alternative approach to prevent navigation from indexing in one of the previous posts from Bob:

        protected override void Render(HtmlTextWriter writer)     
        {     
            // Checks if this is called by the Search Indexer and does not render anything if so.     
            // Navigation controls are present in every page and should NOT be indexed multiple times.     
            if (!CmsContext.IsCrawlerRequest)     
                base.Render(writer);     
        }  
    Please notice that this is implemented in the Sitefinity navigation controls such as SitePanelBar, SiteMenu etc. It is not implemented in the RadControls. If you're using RadMenu and/or RadPanelBar in your template as navigation controls, the contents of the menus will be indexed by the Indexing Service.

    You can either use the above shown code in the code behind of your navigational control or you can use the SitePanelBar and SiteMenu instead or wait a few days for the hotfix.


    Greetings,
    Nikola
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center

    Reply

  • Craig avatar

    Posted on Jun 5, 2008 (permalink)

    Was this bug resolved? We're running 3.2 SP2 and have noticed that the indexer is still refusing to ignore the div structure we're defining in the fieldsInfoProvider.xml. We can force the navigation to not render for the crawler, but this doesn't help us for other aspects of the site we don't want crawled. (header, footer etc).

    Reply

  • Telerik Admin admin's avatar

    Posted on Jun 6, 2008 (permalink)

    Hi Craig,

    It's resolved and will be shipped with the upcoming hotfix shortly, as it was already mentioned in my previous post.

    Greetings,
    Nikola
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center

    Reply

  • Ben Alexandra Intermediate Ben Alexandra's avatar

    Posted on Jun 6, 2008 (permalink)

    Thanks for fixing it!  When will the hotfix be released?  Today by any chance?

    Ben

    Reply

  • Telerik Admin admin's avatar

    Posted on Jun 6, 2008 (permalink)

    Hi Ben Alexandra,

    The hotfix should be out by the end of next week, so stay tuned.

    Kind regards,
    Rebecca
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center

    Reply

  • Ben Alexandra Intermediate Ben Alexandra's avatar

    Posted on Jul 7, 2008 (permalink)

    Hi Guys,

    Thanks for you work on the search.  Generally it works well, but I have a couple of small issues.  Take the link below, for example:
    http://www.sustainablecards.com/search/index.aspx?IndexCatalogue=web&SearchQuery=peo

    So this is a search for a person who works with this company.  He shows up on 2 pages.  The first is the first page in a page group, and you'll notice that the page name that Search finds is " | Sustainable Cards, LLC. | "Leave No Carbon Footprint!""  The problem is that it starts with "Overview | Susta..."  Click the first link returned and then look at the ACTUAL title (different than the one that Search sees when it indexes the site.

    The second issue is that one of the problems I have is removing areas of search that are repeated.  For example, on this site we append " Sustainable Cards, LLC. | Leave No Carbon Footprint" to all page titles in the C# for the Master Page.  They then show in the search results.  If I set Title to be -1 instead of 1 it then hides the title in the search results and the formatting looks silly.  If someone searches for Sustainable it also returns every page.

    I'm not sure how to suggest you handle this second issue, except for getting the title from the database, not from the page. I'm not really sure how to get around it.  I can't put a div in the title because then browsers will return the HTML inline in the title bar.  Any suggestions?

    I don't want to remove that information from the page title, as it is often helpful for search engines and people.  I sometimes put keywords in the page title using code and that can help with SEO / page ranking.

    Also, how do external sites work?  I have a site (http://usa.weleda.com) that has content on shop.weleda.com and articles.weleda.com.  Would those get indexex by your search?  Can I add those domains in the XML and get them returned?  Really they are the same site, though they're not in Sitefinity and they're on different domains.

    Thanks for any help you can suggest.

    Ben

    Reply

  • Telerik Admin admin's avatar

    Posted on Jul 8, 2008 (permalink)

    Hello Ben Alexandra,

    We tested indexing a single page with title Overview | Sustainable Cards, LLC. | "Leave No Carbon Footprint!".
    Using the default page index settings the page was indexed fine, the index contained the whole title.
    Searching for "Carbon" returned the page in the results list.

    Then we set the title="-1" in the fieldsInfoProvider.xml and we reindexed which causes the title not to be indexed. So when we searched for "Carbon" the page was not showing in the results at all.

    As for the external pages, you have to create pages in the CMS and select External page for Page type
    If those pages are included in the index and in the page Search Options > Index this page = Yes, they should be indexed.

    Sincerely yours,
    Nikola
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center

    Reply

  • Ben Alexandra Intermediate Ben Alexandra's avatar

    Posted on Jul 9, 2008 (permalink)

    Hi Nikola,

    Thanks for your rapid reply. Your support is always appreciated. I think, however, that you missed the point a bit.  Let me try to clarify.

    If you look at the link I sent, you'll see that the first result is not showing the page name properly.  Look at the page title and then go to the page and look at the title there (the word Overview is missing in the search results (bug!)). 

    The second issue is that if I put any keywords in the page title, like we do for SEO purposes, and someone searches  for those words, EVERY page gets returned.  Right now if you search for Sustainable every page gets returned, because we append Sustainable Cards, LLC. to the page title via C# on Page_Load.  I'm wondering if there's a way around that.  If I exclude the page title from the search results (which I'd like to do), the results that get returned are not formatted correctly, as they don't show the page title.  I know that's a tricky one, but I guess I was hoping you could SHOW the page title, but ignore it in the algorithm that caculates which pages get returned.  Or give it a 0.00001 ranking.

    Does that make any sense?

    Thanks a lot

    Ben

    Reply

  • Telerik Admin admin's avatar

    Posted on Jul 10, 2008 (permalink)

    Hello Ben Alexandra,

    I did notice the difference between the Page title in the Search Results and the actual page title seen at the title bar of the browser window. My point was that when the page title is properly entered in the page, it is then properly indexed. So by design, it is working as expected.

    Probably, if you append the title at run-time it could have been missed by the indexer.

    As for the second issue, if you append at runtime titles to all the pages, the Indexing Service Crawler indexes those titles because it request the page and receives it the way you should see that page in your browser.

    You can skip indexing the page title by detecting when it is being requested by the Indexing Service. Just an idea, in the method you're appending the title try to include a check for CmsContext.IsCrawlerRequest, if its true do not add the common title text.

    Let me know if need further assistance. We would be glad to continue this discussion in a new support thread.

    All the best,
    Nikola
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center

    Reply

  • Ben Alexandra Intermediate Ben Alexandra's avatar

    Posted on Jul 10, 2008 (permalink)

    Hi Nikola,

    Thanks for the suggestion, I'll try that.  It DOES, however, still seem to be a bug, as one is fine, one is missing the real page title.  Look at the page and compare the 2 results, then go and look at each page and you'll see they are both identical (except one is the first in a page group) and the page title is different for both (so I think it must be a bug, no?)

    Thanks

    Ben

    Reply

  • Telerik Admin admin's avatar

    Posted on Jul 11, 2008 (permalink)

    Hello Ben Alexandra,

    Try searching for "sustainable" and you'll see that the page title is not missing for the first result item but for pages without titles entered in the meta fields. For example, the following pages are missing titles:

        ~/sitemap.aspx
        ~/about_us/index.aspx
        ~/products/index.aspx

    Probably, it would be easier if you send us the code you're using to substitude the pages' title at runtime.

    Regards,
    Nikola
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center

    Reply

  • Ben Alexandra Intermediate Ben Alexandra's avatar

    Posted on Jul 11, 2008 (permalink)

    Hi,

    There is actually an error with some pages.  I debugged the crawl process and for some of the pages the Page.Title came back null, where as for other pages it came back with the correct page title.  All the pages are set up the same, except some are page groups, but those get ignored. I can't figure out the consistent factor between pages that don't index correctly.   Even if I remark that code that changes the title, I get some pages with null titles.  I don't get it.  Check out these 2 links

    Error with Search (Screen Shot) / Search for Cards on Sustainable Cards

    Any ideas what could be going on?

    Ben

    Reply

  • Ben Alexandra Intermediate Ben Alexandra's avatar

    Posted on Jul 11, 2008 (permalink)

    OOOOOOOOOOOOOOOOOOOH!!!!  Got it!  Pages where I don't specifically specify a Title don't show a title in the search!

    I've never done that, unless it's different from the Menu title, as it's redundant and has never been necessary.  It seems that SF just uses the Menu Title if there's no Page title.

    Can I fix that in the Master.cs file so I don't have to go through every page and assign a title manually, so it gets indexed correctly?  Or could you at least send some SQL so I can update the blank titles automatically?  I guess I'd have to do that from time to time to make sure there were no gaps, and would have to retrain all my clients.  C# code that sets the page title if it's a CmsContext.IsCrawlerRequest would be easier ;)

    Thanks

    Ben

    Reply

  • Ben Alexandra Intermediate Ben Alexandra's avatar

    Posted on Jul 11, 2008 (permalink)

    Hi,

    One more thing.  Is there any way to index the contents of PDF documents and have those returned by search?

    Ben

    Reply

  • Brook Brook's avatar

    Posted on Jul 11, 2008 (permalink)

    I am curious about the PDF search also, I had noticed that the ASPOSE binaries are in the bin folder.

    Brook

    Reply

  • Telerik Admin admin's avatar

    Posted on Jul 14, 2008 (permalink)

    Hi Brook,

    PDF indexer is not implemented yet, but we plan to implement such.

    Ben,

    You can substitute the empty titles by putting the following code in your .master page:

    <script runat="server">  
        protected void Page_Load(object sender, EventArgs e)  
        {  
            if (Telerik.CmsContext.IsCrawlerRequest  
                && this.Page.Title.Equals("Untitled Page", StringComparison.OrdinalIgnoreCase))  
            {  
                this.Page.Title = "Title1";  
            }  
        }  
    </script> 

    The second condition is valid if you have "Untitled Page" set by default in your master page. This will prevent indexing empty titles, which I think could happen only when you have no title for the particular page and no title defined in the .master file.

    Don't hesitate to contact us if you have other questions.

    All the best,
    Nikola
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center

    Reply

  • Ben Alexandra Intermediate Ben Alexandra's avatar

    Posted on Jul 14, 2008 (permalink)

    Thanks, but that doesn't REALLY help.  It just returns Title1 for half the titles.  Is there no way to access the Menu Title, as that's what gets replaced in the page if there's no Page Title, right?  We generally don't use the Page Title field, unless we want it to be different from the Menu Title (which is rare).  If not, can you send some SQL to update the database to replace all the empty page titles with their menu titles?

    Thanks

    Ben

    Reply

  • Telerik Admin admin's avatar

    Posted on Jul 17, 2008 (permalink)

    Hello Ben Alexandra,

    You can replace the empty page title by overriding the Telerik.Cms.Web.InternalPage OnPreInit method.
    Create a new class item in you project App_Code folder and add the following code:

    using System;  
     
    namespace Telerik.Cms.Web  
    {  
        public class SmartTitleInternalPage : InternalPage  
        {  
            protected override void OnPreInit(EventArgs e)  
            {  
                base.OnPreInit(e);  
                if (string.IsNullOrEmpty(base.CmsPage.Title))  
                {  
                    this.Title = base.CmsPage.MenuName;  
                }  
            }  
        }  

    To use this class instead the InternalPage class, you have to modify the Inherits property of the cmsentrypoint.aspx located in the Sitefinity folder as follows:

    <%@ Page Inherits="Telerik.Cms.Web.SmartTitleInternalPage" MasterPageFile="~/Sitefinity/Dummy.master" %> 

    Bear in mind that InternalPage class is responsible for all the CMS pages to be rendered properly, so when overriding it any modifications or implementations could lead to unexpected results and should be done carefully.

    Let me know if this works for you.

    Regards,
    Nikola
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center

    Reply

  • Randel avatar

    Posted on Oct 8, 2008 (permalink)

    Ok, I think I have an understanding of concept and have created a new search which searches the my complete site.  However, I don't completely know are all the values available for the following attributes: indexAttribute, filterTag, and filterAttributes; and what each attribute is specifically used for.

    I'm using the fieldsInfoProvider.xml file created for this search as my base of understanding and have pasted it below.

    For example, I see:
     - (3) empty strings and (2) "content" for indexAttribute
     - "title", "meta", "script", and "style" for filterTag
     - (3) empty strings, "name:keywords;", and "name:descriptions;"

    Again, what is "indexAttribute" used for?  Does "content" mean that the field will only search the content of a page, and is that the content of the page itself and not the template it's created from?  Also, if it does mean to only search the page content, what does the empty string mean?  Same thing, what are "filterTag" and "filterAttributes" used for / reference in a page?

    I still concider myself very new to web developement and feel as if I'm missing some puzzle pieces here.  So I'll be more then happy to take any help/guidance.


    <?xml version="1.0" encoding="utf-8"?>

    <fields>

      <field name="title"

             weight="1"

             indexAttribute=""

             filterTag="title"

             filterAttributes="" />

            

      <field name="keywords"

             weight="1"

             indexAttribute="content"

             filterTag="meta"

             filterAttributes="name:keywords;" />

            

      <field name="description"

             weight="1"

             indexAttribute="content"

             filterTag="meta"

             filterAttributes="name:description;" />

            

      <field name="script"

             weight="-1"

             indexAttribute=""

             filterTag="script"

             filterAttributes="" />

            

      <field name="style"

             weight="-1"

             indexAttribute=""

             filterTag="style"

             filterAttributes="" />

    </fields>

    Reply

  • Telerik Admin admin's avatar

    Posted on Oct 10, 2008 (permalink)

    Hi Randel,

    Here is a list of the properties that are taken into account, along with their description:

    filterTag - the tag name that will be used to extract the data to be indexed.

    filterAttributes -  a collection of attributes in the tag name to ensure the tag has been captured properly.
    An Example to a tag attribute : attributes="name:-Description" this will result one item in the list that contains the KeyValuePair: Key="name" , value="description".

    weight - Weight assigned to terms found in this index field.

    indexAttribute - the attribute to be indexed within the tag. If null, the data between the starting & ending tag will be indexed.

    Please let us know if you need further information on the matter.

    Greetings,
    Georgi
    the Telerik team

    Check out Telerik Trainer, the state of the art learning tool for Telerik products.

    Reply

  • Armen avatar

    Posted on Dec 18, 2008 (permalink)

    Hi,

    Thank you for this thread. It helped me a lot. Anyway I would like to understand how to make the fieldsInfoProvider.xml file to be generated as I need. As after deleting the index and creating a new one all my changes in xml file were lost. I would like to leave/regenerate my modified fieldsInfoProvider.xml be available any time when the end user (administrator) will delete indexes and/or create new ones, without need to contact me every time with a request of indexing problem.

    Can you please let me know how generate the fieldsInfoProvider.xml in my way while creating the index? I believe that it should be very handy for all devs.

    Thanks, 
    Armen

    Reply

  • Telerik Admin admin's avatar

    Posted on Dec 20, 2008 (permalink)

    Hi Armen,

    Unfortunately there is no way to override the method which creates the fieldsInfoProvider.xml file. We will consider this some of next version. What you could do for now, is to re-index the web site instead of creating a new index every time.

    Kind regards,
    Georgi
    the Telerik team

    Check out Telerik Trainer, the state of the art learning tool for Telerik products.

    Reply

  • Zubair Zubair's avatar

    Posted on Apr 20, 2010 (permalink)

    Hi Georgi,

    Older thread I know, but I'm experiencing this problem with search indexing too much, here's the thing.

    We have a few overlay popup pages that we want not indexed.

    Here's how I'm doing it in the fieldInfoProvider.xml file

    <?xml version="1.0" encoding="utf-8"?>
    <fields>
      <field name="content" weight="4" indexAttribute="id:content;" filterTag="div" filterAttributes="" />
      <field name="overlay" weight="-1" indexAttribute="" filterTag="div" filterAttributes="class:scroll-pane;" />
      <field name="title" weight="1" indexAttribute="" filterTag="title" filterAttributes="" />
      <field name="keywords" weight="2" indexAttribute="content" filterTag="meta" filterAttributes="name:keywords;" />
      <field name="description" weight="3" indexAttribute="content" filterTag="meta" filterAttributes="name:description;" />
      <field name="script" weight="-1" indexAttribute="" filterTag="script" filterAttributes="" />
      <field name="style" weight="-1" indexAttribute="" filterTag="style" filterAttributes="" />
    </fields>

    Notice the name="overlay" it contains a <div class="scroll-pane">...content..</div> in the page, I want everything within that <div> not indexed, how can I do it?

    Also specifying like this means everything within the specific tag(s) is not indexed in search? correct me if I'm wrong.

    I'm hoping you'll get back to me as soon as you can.

    Thanks in anticipation.

    Reply

  • Telerik Admin admin's avatar

    Posted on Apr 20, 2010 (permalink)

    Hello Zubair,

    If you have div like this one shown below

    <div class="scroll-pane">
        IVAN
    </div>

    and the following filed settings

    <field name="overlay"  weight="-1"  indexAttribute=""  filterTag="div"  filterAttributes="class:scroll-pane;"  />

    the text "IVAN" should not be indexed. Note that after you have added the new field name to the xml file the application should be restarted and the index should be run again. Then the Lucene segments file will be updated and SearchResults will return the correct results.

    Best wishes,
    Ivan Dimitrov
    the Telerik team

    Do you want to have your say when we set our development plans? Do you want to know when a feature you care about is added or when a bug fixed? Explore the Telerik Public Issue Tracking system and vote to affect the priority of the items.

    Reply

  • Zubair Zubair's avatar

    Posted on Apr 20, 2010 (permalink)

    Thanks for the quick reply, yes this is how the markup is, I wasnt restarting the app but just reindexed and it didnt work, I'll try restarting the app.

    Reply

Back to Top

Skip Navigation LinksHome > Web Content Management > Developer Network / Forums / Sitefinity 3.x: Bugs & Issues > Search Indexes TOO Much

Powered by Sitefinity ASP.NET CMS

Contact Us | Site Feedback | Terms of Use | Privacy Policy
Copyright © 2002-2010 Telerik. All rights reserved.