Forums

Skip Navigation LinksHome / Developer Network / Forums / Sitefinity Older Versions (3.x): Set-up & Installation > Does Search index PDFs and DOC files?

Does Search index PDFs and DOC files?

  • David Willis avatar

    Posted on Oct 26, 2007 (permalink)

    Can search index PDFs and DOC files?

    Reply

  • Pepi Pepi admin's avatar

    Posted on Oct 29, 2007 (permalink)

    Hello David,

    For the time being Search doesn't support indexing of .pdf and .doc files. But our plans are to implement this functionality as we will extend Search module for future releases .

    Don't hesitate to ask if you come up with other questions.

    Greetings,
    Pepi
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center

    Reply

  • mexner avatar

    Posted on Apr 3, 2008 (permalink)

    With SF 3.2, SP1, I noticed in the web.config a section for "Indexers". Does this provide the ability to search/index .doc and .pdf?

     <indexers> 
            <add extensionOrMimeType=".aspx" type="Telerik.Search.Engine.HtmlIndexer"/> 
            <add extensionOrMimeType="text/htwithml" type="Telerik.Search.Engine.HtmlIndexer"/> 
    </indexers> 


    Is there a "type" we can currently provide to search .pdf?
    Thanks!

    Reply

  • Georgi Georgi admin's avatar

    Posted on Apr 7, 2008 (permalink)

    Hi Chris,

    Our search engine still don't support indexing in Doc files and PDFs. Our plans are to implement this functionality, and your suggestions about section <indexers> are correct - when the spiders that index the doc files, and the pdf files are ready, they will be added in this section.

    Let us know if there is anything else you would like to know.

    Greetings,
    Georgi
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center

    Reply

  • Meister Intermediate avatar

    Posted on Apr 30, 2008 (permalink)

    Hi

    Can you let me know when you are planning to support indexing of PDF files please?

    Reply

  • Posted on Apr 30, 2008 (permalink)

    do you mean that it will actually index the CONTENTS of these documents? that's crazy awesome :) but what if they're not in the library, but rather just linked and stored physically in a location on the server. would they get indexed or is this only for library files?

    thanks!

    Reply

  • Ivan Ivan admin's avatar

    Posted on May 1, 2008 (permalink)

    Hi SelArom,

    this is among the more important features we want to implement, but it will surely not be available in the upcoming Service Pack.

    The reason for indexers section in the web.config is that like anything else in Sitefinity, search indexes can be developed and implemented by clients (e.g. you develop some module with complex data and you want to handle the way it'll be searched on your own, you'll be able to implement your own searching logic), so that's how you would register it. Unfortunately the exact process or the tutorials for this are not yet available and it's not exactly a trivial thing to do.

    To conclude: once implemented, the PDF and DOC search will search the contents of a file. It will not be necessary for this file to be in the Images & Documents library. And finally, we don't have a fixed date of this release, but we are working on it.

    Sincerely yours,
    Ivan
    the Telerik team

    Instantly find answers to your questions at the new Telerik Support Center

    Reply

  • Jeff Mah avatar

    Posted on Dec 18, 2008 (permalink)

    What's the current status on indexing PDF and DOC files? Are you able to share a probable release date?

    Reply

  • Georgi Georgi admin's avatar

    Posted on Dec 19, 2008 (permalink)

    Hello Jeff Mah,

    This is still not implemented. We are trying to include it in the list with the tasks for 4.0, which should come in April or May.

    Best wishes,
    Georgi
    the Telerik team

    Check out Telerik Trainer, the state of the art learning tool for Telerik products.

    Reply

  • Tony avatar

    Posted on Jan 23, 2009 (permalink)

    But I want the feature now :P

    Suggestion:  A sign-up for a digest email on a per-feature basis so that once a week/month/whatever we get an update on the features we are wanting to sell to clients would be great!  (then I could be sure I was on-top of the latest info to tell to my clients.)

    Reply

  • Georgi Georgi admin's avatar

    Posted on Jan 26, 2009 (permalink)

    Hi Tony,

    We already working on a similar solution, you will be able to track all issues and feature request directly on our web site. I still cannot commit to a time frame for it though.

    Best wishes,
    Georgi
    the Telerik team

    Check out Telerik Trainer, the state of the art learning tool for Telerik products.

    Reply

  • martin avatar

    Posted on Feb 24, 2009 (permalink)

    Hi,

    I appreciate that being able to index the content of uploaded PDFs and files would/will be great... I'm going to ask a question about searching for a file just by it's name.

    I'm running a "trial" version of Sitefinity (prior to purchase) and have created a document library within the "images & documents" module. I have uploaded a PDF and a .rtf into that library. I have created a search index and a search page. I can successfully search for general text held within general pages (including news, events and lists) - but it seems to totally fail to search/find these test files I've built into the mock-up download page.

    Am I missing something very obvious? Does Sitefinity currently (out-of-the-box) search these files too (even if only in it's file name?) Because if it doesn't... it's going to become pretty useless for our client's needs and an alternative CMS will have to be used.

    Any help greatly received,
    Martin.

    Reply

  • Posted on Feb 24, 2009 (permalink)

    I'm not sure how it works out-of-the-box, but the search index is actually very extensible! you can use it to index any content you wish, all you have to do is add an index provider...

    take a look at this forum post: http://www.sitefinity.com/support/forums/support-forum-thread/b1043S-becdhd.aspx

    there's an example project that you can build on. In my blog, I used it to index content from my events module into my site search results. the details are here:
    http://www.selarom.net/blog/2009-01-23/Sitefinity_Index_and_Search_Events.aspx

    it would probably not be too difficult to add the filenames (and links to the files) to the index using this method...

    I hope this was helpful!

    Reply

  • martin avatar

    Posted on Feb 24, 2009 (permalink)

    Hi SelArom,

    Thank you for your reply. I'm sure your information will indeed be helpful. Unfortunately, as it currently stands, I'm more of a web "designer" than an ASP.NET "coder" and thus on a VERY steep learning curve. I shall read through your links and see what I can make from it.

    I'm pleased to see how flexible Sitefinity is, but a lot of that flexibility does seem to involve a lot of 'back-end' coding (as opposed to within the Admin interface.

    On a similar note (being still talking about the "images & documents" module) I've noticed a funny thing happening with my test download files. I can download them ok, but the file in question doesn't end in .pdf - it ends in .sflb.ashx and thus is ununderstandable by the OS. If I then manually replace .sflb.ashx with .pdf, then it opens just fine.

    Both bizzare and annoying and will need to be sorted prior to a "real" site being created.

    Regards,
    m.

    Reply

  • Georgi Georgi admin's avatar

    Posted on Feb 25, 2009 (permalink)

    Hello,

    Josh, thank you for providing this information to Martin. As usual, we appreciate your help in the forums!

    Martin, we are already working on PDF and DOCs indexing functionality, and it will be available with the 4.0 version. As for the Images and Documents module, if you are using the 3.6 Hotfix version, you should have no worries with the download extensions. Although the filenames end with strange extensions, the files should be downloaded with the proper ones. In the hotfix version, it is even possible to use the real file extensions in the Urls. More information on this could be found in the KB article How to use the real extensions for the items in the Images and Documents module.

    I hope this helps as well.

    Sincerely yours,
    Georgi
    the Telerik team

    Instantly find answers to your questions on the new Telerik Support Portal.
    Check out the tips for optimizing your support resource searches.

    Reply

  • Venkat avatar

    Posted on Aug 3, 2009 (permalink)

    The link provided in this post is not working.

    Can any one please check this.

    Thanks
    Venkat.

    Reply

  • Georgi Georgi admin's avatar

    Posted on Aug 3, 2009 (permalink)

    Hi Venkat,

    Can you please let us know which link are you referring to, since there are several links on the page. I did not find a broken link though.

    Kind regards,
    Georgi
    the Telerik team

    Instantly find answers to your questions on the newTelerik Support Portal.
    Check out the tipsfor optimizing your support resource searches.

    Reply

  • Venkat avatar

    Posted on Aug 3, 2009 (permalink)

    Hi George,

    Thank you fro your reply

    I tried it in the morning and the below link not worked at that time and its working now.

    http://www.selarom.net/blog/2009-01-23/Sitefinity_Index_and_Search_Events.aspx.

    regards
    -Venkat.

    Reply

  • Georgi Georgi admin's avatar

    Posted on Aug 3, 2009 (permalink)

    Hi Venkat,

    Thank you for the follow up. I am glad that the resource is accessible now.

    Regards,
    Georgi
    the Telerik team

    Instantly find answers to your questions on the newTelerik Support Portal.
    Check out the tipsfor optimizing your support resource searches.

    Reply

  • Shanti Boyanapalli avatar

    Posted on Oct 26, 2009 (permalink)

    Hi
    I would like to know when the Searching through images and documents functionality is getting released?

    Reply

  • Nikolai Nikolai admin's avatar

    Posted on Oct 27, 2009 (permalink)

    Hello Shanti Boyanapalli,

    We will try our best to add this functionality in the official Sitefinity 4.0 release.

    Greetings,
    Nikolai
    the Telerik team

    Instantly find answers to your questions on the new Telerik Support Portal.
    Watch a video on how to optimize your support resource searches and check out more tips on the blogs.

    Reply

  • Jake avatar

    Posted on Feb 28, 2011 (permalink)

    Can anyone verify that this functionality was included in the 4.0 build? The post on Feb 25 made it sound like it was to be added, but I can't seem to find any confirmation in the 4.0 features documentation. 

    Thanks in advance! 
    Jake

    Reply

  • Ivan Dimitrov Ivan Dimitrov admin's avatar

    Posted on Mar 1, 2011 (permalink)

    Hello Shanti,

    In Sitefinity 4.0 the content of PDF and DOC files is not added to the index. We will not be able to implement this before Q2.

    All the best,
    Ivan Dimitrov
    the Telerik team
    Registration for Q1 2011 What’s New Webinar Week is now open. Mark your calendar for the week starting March 21st and book your seat for a walk through all the exciting stuff we ship with the new release!

    Reply

  • Thomas Brooke avatar

    Posted on Aug 23, 2011 (permalink)

    Hi,

    Any update on this? Or any alternative suggestions (e.g. Marketplace)?

    This is a real must for my client and would be a serious blow to the project if we can't do it.

    Using SF 4.2.

    Thanks,
    Thom

    Reply

  • Stanislav Velikov Stanislav Velikov admin's avatar

    Posted on Aug 26, 2011 (permalink)

    Hi Thomas,

    Sitefinity still doesn`t support search in document contents. They are searched by name of the content item.
    Excuse us for the inconvenience.

    Kind regards,
    Stanislav Velikov
    the Telerik team

    Thank you for being the most amazing .NET community! Your unfailing support is what helps us charge forward! We'd appreciate your vote for Telerik in this year's DevProConnections Awards. We are competing in mind-blowing 20 categories and every vote counts! VOTE for Telerik NOW >>

    Reply

  • Ryan avatar

    Posted on Aug 26, 2011 (permalink)

    Lack of PDF content indexing is a show stopper for sure.

    Can Sitefinity Search read other indexes?

    What other alternatives are there?

    Reply

  • Stanislav Velikov Stanislav Velikov admin's avatar

    Posted on Aug 31, 2011 (permalink)

    Hello Ryan,

    For customized search you have to implement a custom pipe. You can see how the TxtDocumentSearchInboundPipe in the sample project from  Publishing system brief walkthrough.  This is creating a pipe that will push the items into the search index when they are published. The only requirement is to be published by the fluent api. You have to implement PushData and ToPublishingPoint. Those methods will actually put the item into the publishing point. Note that you have to add some settings / mappings as it was explained in Registering custom pipes in Sitefinity

    Regards,
    Stanislav Velikov
    the Telerik team

    Thank you for being the most amazing .NET community! Your unfailing support is what helps us charge forward! We'd appreciate your vote for Telerik in this year's DevProConnections Awards. We are competing in mind-blowing 20 categories and every vote counts! VOTE for Telerik NOW >>

    Reply

  • shae avatar

    Posted on Feb 8, 2012 (permalink)

    Has this functionality been added yet?

    Reply

  • Stanislav Velikov Stanislav Velikov admin's avatar

    Posted on Feb 8, 2012 (permalink)

    Hello,

     It is not available yet.

    All the best,
    Stanislav Velikov
    the Telerik team
    Do you want to have your say in the Sitefinity development roadmap? Do you want to know when a feature you requested is added or when a bug fixed? Explore the Telerik Public Issue Tracking system and vote to affect the priority of the items

    Reply

  • shae avatar

    Posted on Feb 8, 2012 (permalink)

    Its been over 4 years since first mention of this. Is it on the 5 year plan?

    Reply

  • James Greaves avatar

    Posted on May 10, 2012 (permalink)

    I am really just bumping this thread.

    As Shae said...We are approaching 5 years now.  Is there any movement on this issue? 

    Reply

  • Stanislav Velikov Stanislav Velikov admin's avatar

    Posted on May 15, 2012 (permalink)

    Hi,

    The feature is currently on the roadmap for Sitefinity 5.1. Until then you can modify the TxtDocumentSearchInboundPipe pipe and use it to search through different types of content. I've attached a sample, which is modified to search through .pdf files. For this purpose we're using a third-party library called iTextSharp.text. The sample pipe also searches in one library only (take a look at the PushData() method), so that it suits your second requirement. What is different from the sample in my colleague's blog post is that in the CanProcessItem() method we check whether the item we process has a .pdf extension, instead of a .txt extension.
    if (documentType.IsAssignableFrom(item.GetType()))
                {
                    var docItem = ((Telerik.Sitefinity.Libraries.Model.Document)item);
                        if (docItem.Extension == "pdf" || docItem.Extension == ".pdf")
                            return true;
                    return false;
                }
    Then in the GetFileLink() method we get the url:
    var manager = LibrariesManager.GetManager();
               var docUrl = String.Concat("~", manager.Provider.GetItemUrl(doc), doc.Extension);
               docUrl = Telerik.Sitefinity.Web.RouteHelper.ResolveUrl(docUrl, UrlResolveOptions.Absolute);
               return docUrl;
    The last thing we do is pass its value to the OpenPDF()method and the .pdf content is retrieved, using a sample method from the page of the iTextSharp.text library :
    private string openPDF(string fileUrl)
           {
               string str = "";
               iTextSharp.text.Document doc = new iTextSharp.text.Document();
      
               PdfReader reader = new PdfReader(fileUrl);
               for (int i = 1; i <= reader.NumberOfPages; i++)
               {
                   byte[] bt = reader.GetPageContent(i);
      
                   str += ExtractTextFromPDFBytes(bt);
      
               }
               return str;
    I have attached the modifications made to the project form this blog post.

    Regards,
    Stanislav Velikov
    the Telerik team
    Do you want to have your say in the Sitefinity development roadmap? Do you want to know when a feature you requested is added or when a bug fixed? Explore the Telerik Public Issue Tracking system and vote to affect the priority of the items

    Reply

  • Register for webinar
Skip Navigation LinksHome / Developer Network / Forums / Sitefinity Older Versions (3.x): Set-up & Installation > Does Search index PDFs and DOC files?