1-888-365-2779
+1-888-365-2779
Try Now
More in this section

Forums / Developing with Sitefinity / PDF Searching error

PDF Searching error

7 posts, 0 answered
  1. KMac
    KMac avatar
    133 posts
    Registered:
    15 Dec 2008
    10 Nov 2009
    Link to this post
    Hello,

    Because I can't wait for SF4.0 any more, I've gone ahead and created a customIndex that indexes PDF documents. Using some code supplied by Ivan (which is way more elegant than what I would have put together), and PDFBox (some opensource PDF utility classes), I've managed to get it working smoothly as long as I'm only using this customIndex (feel free to use it to index your own PDF documents until SF4.0--PDFBox is from pdfbox.org).

    The problem I have is when I try to add a newsIndex together with my customIndex. The index runs for a while and then I get the following error:

    Could not find file 'C:\inetpub\wwwroot\HughesAmysIntranet\App_Data\Search\Case_Blawg_PDFs\Index\_5z.cfs'.

    Description: An unhandled exception occurred during the execution of the current web request. Please review the stack trace for more information about the error and where it originated in the code.

    Exception Details: System.IO.FileNotFoundException: Could not find file 'C:\inetpub\wwwroot\Hugh\App_Data\Search\PDFIndex\Index\_5z.cfs'.

    Source Error:

    An unhandled exception was generated during the execution of the current web request. Information regarding the origin and location of the exception can be identified using the exception stack trace below.

    Stack Trace:

    [FileNotFoundException: Could not find file 'C:\inetpub\wwwroot\Hugh\App_Data\Search\PDFIndex\Index\_5z.cfs'.]
       System.IO.__Error.WinIOError(Int32 errorCode, String maybeFullPath) +305
       System.IO.FileStream.Init(String path, FileMode mode, FileAccess access, Int32 rights, Boolean useRights, FileShare share, Int32 bufferSize, FileOptions options, SECURITY_ATTRIBUTES secAttrs, String msgPath, Boolean bFromProxy) +1162
       System.IO.FileStream..ctor(String path, FileMode mode, FileAccess access, FileShare share) +66
       Telerik.Lucene.Net.Store.Descriptor..ctor(FSIndexInput enclosingInstance, FileInfo file, FileAccess mode) +73
       Telerik.Lucene.Net.Store.FSIndexInput..ctor(FileInfo path) +64
       Telerik.Lucene.Net.Store.FSDirectory.OpenInput(String name) +102
       Telerik.Lucene.Net.Index.CompoundFileReader..ctor(Directory dir, String name) +183
       Telerik.Lucene.Net.Index.SegmentReader.Initialize(SegmentInfo si) +233
       Telerik.Lucene.Net.Index.SegmentReader.Get(Directory dir, SegmentInfo si, SegmentInfos sis, Boolean closeDir, Boolean ownDir) +227
       Telerik.Lucene.Net.Index.SegmentReader.Get(SegmentInfo si) +48
       Telerik.Lucene.Net.Index.IndexWriter.MergeSegments(SegmentInfos sourceSegments, Int32 minSegment, Int32 end) +700
       Telerik.Lucene.Net.Index.IndexWriter.Optimize() +211
       Telerik.Search.Engine.BaseIndexer.Close() +43
       Telerik.Search.Engine.Crawler.Index(String provider, String[] urls, LinkedList`1 data, Boolean appendToIndex) +340
       Telerik.Search.Engine.IndexingManager.StartIndexing(IIndexingService service, Boolean appendToIndex) +110
       Telerik.Search.Engine.IndexingService.Index(Boolean appendToIndex) +38
       Telerik.Search.WebControls.Admin.ControlPanel.Service_Command(Object sender, CommandEventArgs e) +451
       System.Web.UI.WebControls.LinkButton.OnCommand(CommandEventArgs e) +108
       System.Web.UI.WebControls.LinkButton.RaisePostBackEvent(String eventArgument) +135
       System.Web.UI.WebControls.LinkButton.System.Web.UI.IPostBackEventHandler.RaisePostBackEvent(String eventArgument) +10
       System.Web.UI.Page.RaisePostBackEvent(IPostBackEventHandler sourceControl, String eventArgument) +13
       System.Web.UI.Page.RaisePostBackEvent(NameValueCollection postData) +175
       System.Web.UI.Page.ProcessRequestMain(Boolean includeStagesBeforeAsyncPoint, Boolean includeStagesAfterAsyncPoint) +1565
    

    When I check the folder it refers to, it ain't lying. That file isn't there. I can't do anything after that until I remove the files from the App_Data\Search\Hugh folder.

    Like I said most of the code I'm using is from Ivan and has been floating around on these boards before. I just tweaked to extract from PDF documents. The main code is below:

    FileIndexProvider: Basically just grabs all pdfs in a specified library called "Accounting" and creates FileIndexInfo items
    using System;  
    using System.Collections.Generic;  
    using System.Linq;  
    using System.Web;  
    using Telerik.Framework.Search;  
    using Telerik.Cms.Engine;  
    using System.Collections;  
    using Telerik.Libraries;  
     
    namespace Telerik.Cms.Search  
    {  
        /// <summary>    
        /// Summary description for FileIndexProvider    
        /// </summary>    
        public class FileIndexProvider : IIndexingServiceClient  
        {  
            public FileIndexProvider()  
            {  
            }  
     
            #region IIndexingServiceClient Members  
     
     
            // return the name here    
            public string Name  
            {  
                get { return "FileIndex"; }  
            }  
            // add some description    
            public string Description  
            {  
                get { return "Indexes text files from Libraries module"; }  
            }  
     
            private string _LibraryToIndex = "Accounting";  
            // add some description    
            public string LibraryToIndex  
            {  
                get { return _LibraryToIndex; }  
                set { _LibraryToIndex = value; }  
            }  
            // call GetUrlsToIndex method    
            public string[] GetUrlsToIndex()  
            {  
                return new string[0];  
            }  
            // GetContent - here we are creating a new instance of ContentManager    
            // with specified provider and get the content    
            public IIndexerInfo[] GetContentToIndex()  
            {  
                LibraryManager manager = new LibraryManager("Libraries");  
                  
                List<IIndexerInfo> list = new List<IIndexerInfo>();  
     
     
                ILibrary library = manager.GetLibrary(LibraryToIndex);  
                IList files = library.GetItems();  
     
                foreach (IContent file in files)  
                {  
                    // here we are checking the mime type if it is plain text of PDF      
                    if (file.MimeType == "text/plain" || file.MimeType == "application/pdf")  
                    {  
                        string urlWithExtension = file.Url + manager.Provider.ContentExtension;  
                        IIndexerInfo fileIndexInfo = new FileIndexerInfo(urlWithExtension, file.ID, null, file.MimeType);  
                        list.Add(fileIndexInfo);  
                    }  
                }  
     
     
                return list.ToArray();  
            }  
     
            public void Initialize(IDictionary<string, string> settings)  
            {  
                // initialize settings    
            }  
     
            public event EventHandler<IndexEventArgs> Index;  
     
            #endregion  
        }  
    }  
     



    FileIndexInfo: uses WebClient to download each pdf to a temporary file (PDFBox doesn't seem to work with URLs) then calls parsePDF to extract the text from each pdf to add to the index
    using System;  
    using System.Collections.Generic;  
    using System.Linq;  
    using System.Web;  
    using Telerik.Cms.Search;  
    using System.Globalization;  
    using Telerik.Framework.Search;  
    using System.Text;  
    using System.Net;  
    using Telerik.Web;  
    using org.pdfbox.pdmodel;  
    using org.pdfbox.util;  
    using System.IO;  
     
     
    namespace Telerik.Cms.Search  
    {  
        /// <summary>    
        /// Summary description for FileIndexInfo    
        /// </summary>    
        public class FileIndexerInfo : IIndexerInfo  
        {  
     
             
              
            public FileIndexerInfo(string url, Guid itemId, CultureInfo culture, string mimeType)  
            {  
                this.url = url;  
                this.itemId = itemId;  
                this.culture = (culture != null ? culture.Name : string.Empty);  
                this.mimeType = mimeType;  
            }  
     
            #region Private fields  
     
            private string url;  
            private Guid itemId;  
            private string culture;  
            private string mimeType;  
            private static readonly object syncObject = new object();  
     
            #endregion  
     
            #region IIndexerInfo Members  
     
            public string Path  
            {  
                get { return this.url; }  
            }  
     
            public string MimeType  
            {  
                get { return mimeType; }  
            }  
     
            public byte[] GetData()  
            {  
                if (HttpContext.Current == null)  
                    throw new InvalidOperationException();  
     
                lock (syncObject)  
                {  
                    WebClient client = new WebClient();  
                    byte[] buffer = null;  
                    string outFile  = "c:/inetpub/wwwroot/hugh/tempfile.pdf";  
     
     
                    //buffer = client.DownloadData(UrlPath.ResolveAbsoluteUrl(this.url));  
                      
     
                    if (mimeType == "application/pdf")  
                    {  
                        client.DownloadFile(UrlPath.ResolveAbsoluteUrl(this.url), outFile);  
     
                        string extractedText = parsePDF(outFile);  
                        File.Delete(outFile);  
                        byte[] extractedBytes = this.Encoding.GetBytes(extractedText);  
                        return extractedBytes;  
                    }  
                    else  
                    {  
                        return buffer;  
                    }  
                      
                }  
            }  
            public string parsePDF(string pdf_in)  
            {  
                PDDocument doc = null;  
                try  
                {  
                      
                    doc = PDDocument.load(pdf_in);  
                    PDFTextStripper stripper = new PDFTextStripper();  
                    return stripper.getText(doc);  
                }  
                catch (Exception ex)  
                {  
                    HttpContext.Current.Response.Write(ex.Message.ToString());  
     
                    //Do Error checking here;  
                      
     
                }  
                finally  
                {  
                    if (doc != null)  
                    {  
                        doc.close();  
                    }  
                }  
                   return string.Empty;  
            }  
            public Encoding Encoding  
            {  
                get { return Encoding.UTF8; }  
            }  
     
            public string Culture  
            {  
                get { return this.culture; }  
            }  
     
            public Guid ItemID  
            {  
                get { return this.itemId; }  
            }  
     
            #endregion  
     
            #region IIndexerInfo Members  
     
     
            public string ResolveIndexPath()  
            {  
                return Path;  
            }  
     
            #endregion  
        }  
     
     
    }  
     

    The current site I'm trying to index has about 450 PDF documents and 450 corresponding newsitems (hence why I'm trying to incorporate the two into one searchable index). Have I reached some sort of maximum number of files? Is it a bug with Lucene. I've seen others having the same problem of missing .cfs or similar files during indexing but none of the solutions seemed to work. I tried deleting the index and recreating, removing all items from the database tables, reordering to the indexes, on a different server all to no avail. I don't know what else to check.

    Any help would be much appreciated.

  2. Ivan Dimitrov
    Ivan Dimitrov avatar
    16072 posts
    Registered:
    12 Sep 2017
    17 Nov 2009
    Link to this post
    Hello KMac,

    We spent several hours debugging the problem. but we could not come up with a solution. The problem comes somewhere from Lucine engine when the segments are merged. As you noticed the single search works correctly. We will need more time to figure out what is going wrong. I suggest that you should run the index separately until there is a solution. I am sorry for the inconvenience that this may cause you.

    ( the same reply was sent to the another request you opened)

    Regards,
    Ivan Dimitrov
    the Telerik team

    Instantly find answers to your questions on the new Telerik Support Portal.
    Watch a video on how to optimize your support resource searches and check out more tips on the blogs.
  3. KMac
    KMac avatar
    133 posts
    Registered:
    15 Dec 2008
    17 Nov 2009
    Link to this post

    Thanks Ivan,

    Is there a way to return results from both indexes at the same time? For this particular project the two indexes are unforunately intertwined in that each news item has a corresponding PDF and I need to be able to show the results of both from one search query.

    Also, is this something that is fixed with 4.0?

     

    Thanks again for the great support!

  4. Ivan Dimitrov
    Ivan Dimitrov avatar
    16072 posts
    Registered:
    12 Sep 2017
    18 Nov 2009
    Link to this post
    Hi KMac,

    We have plans to implement document search provided for 4.0, but we will have enough time for debugging and testing, because it is a whole new implementation. Currently the only way is returning the result items from both indexes at the same item using two Repeater controls and overriding CreateChildControls() of SearchResults controls.

    Below is a sample code.

    public class SearchResultsCustom : SearchResult
    {
        public SearchResultsCustom()
        {
        }
     
        protected override void CreateChildControls()
        {
     
            this.Controls.Clear();
     
            this.layoutCnt = new Container(this);
            this.LayoutTemplate.InstantiateIn(layoutCnt);
     
            if (!string.IsNullOrEmpty(Query))
            {
                string vPath = PathUtil.GetIndexPhysicalPath(this.IndexCatalogue);
                 string nPath = PathUtil.GetIndexPhysicalPath(NewCatalogName);
                if (Directory.Exists(vPath) && Directory.Exists(nPath) )
                {
                    // check if query string is supplied to select the current page
                    if (!string.IsNullOrEmpty(HttpContext.Current.Request.QueryString[PageKey]))
                    {
                        this.CurrentPage = Convert.ToInt32(
                            HttpContext.Current.Request.QueryString[PageKey]);
                    }
     
                    string searchQuery = Query;
                    string message = string.Empty;
     
                    bool isValid = this.EscapeSpecialChars ? this.EscapeSpecialChars : ValidateQuery(ref searchQuery, out message);
     
                    if (isValid)
                    {
                        int totalItems;
                        int totalItems2;
                        int startIndex = (this.CurrentPage - 1) * this.PostsPerPage;
     
                        try
                        {
                            IList<ResultItem> dataSource = SearchManager.Search(
                                searchQuery,
                                this.IndexCatalogue,
                                startIndex,
                                this.PostsPerPage,
                                this.WordsMode,
                                this.EscapeSpecialChars,
                                out totalItems);
     
     
                            IList<ResultItem> newDS = SearchManager.Search(
                                   searchQuery,
                                   NewCatalogName,
                                   startIndex,
                                   this.PostsPerPage,
                                   this.WordsMode,
                                   this.EscapeSpecialChars,
                                   out totalItems2);
     
     
     
                            IList<ResultItem> newList = new List<ResultItem>();
                            foreach(ResultItem item in dataSource)
                            {
                                newList.Add(item);
                            }
                            foreach(ResultItem newDSitem in newDS)
                            {
                                newList.Add(newDSitem);
                            }
     
     
                            totalItems = totalItems + totalItems2;
     
                            int numberOfPages = (this.PostsPerPage == 0) ? 1 : (int)Math.Ceiling((double)totalItems / (double)this.PostsPerPage);
     
                            if (numberOfPages == 0 && totalItems > 0)
                                numberOfPages = 1;
     
                            string qeryTest = this.Query.Trim('\"');
                            ((Control)this.layoutCnt.ResultsStats).EnableViewState = false;
                            this.layoutCnt.ResultsStats.Text = string.Format(this.layoutCnt.ResultsStats.Text, totalItems, qeryTest);
     
                            if (this.AllowPaging)
                            {
                                this.layoutCnt.Pager1.SelectedPageChanged += new EventHandler<EventArgs>(Pager_SelectedPageChanged);
                                this.layoutCnt.Pager1.PageCount = numberOfPages;
                                this.layoutCnt.Pager1.SelectedPage = this.CurrentPage;
                            }
                            this.layoutCnt.ResultsList.DataSource = newList;
     
                        }
                        catch (Telerik.Lucene.Net.QueryParsers.ParseException ex)
                        {
                            this.layoutCnt.ResultsStats.Text = "strange chars";
                            Log.Exception(ex);
                        }
                    }
                    else
                    {
                        this.layoutCnt.ResultsStats.Text = message;
                    }
                }
                else
                {
                    this.layoutCnt.ResultsStats.Text = String.Format("InvalidIndex", this.IndexCatalogue);
                }
            }
            else
            {
                this.layoutCnt.ResultsStats.Text = String.Empty;
            }
     
            this.layoutCnt.ResultsList.ItemDataBound += new RepeaterItemEventHandler(ResultsList_ItemDataBound);
            this.layoutCnt.ResultsList.SkinID = this.SkinID;
     
            Controls.Add(this.layoutCnt);
            this.layoutCnt.ResultsList.DataBind();
            base.CreateChildControls();
        }
     
     
        public string NewCatalogName
        {
            get
            {
                return this.newCat;
            }
            set
            {
                this.newCat = value;
            }
        }
     
        private string newCat;
    }


    I suggest that you should wait until we find the problem.

    Kind regards,
    Ivan Dimitrov
    the Telerik team

    Instantly find answers to your questions on the new Telerik Support Portal.
    Watch a video on how to optimize your support resource searches and check out more tips on the blogs.
  5. KMac
    KMac avatar
    133 posts
    Registered:
    15 Dec 2008
    18 Nov 2009
    Link to this post
    Hey Ivan,

    I'd love to wait for this to be fixed, but if it's not going to be fixed until 4.0, I can't. I've been promising this client PDF searching since February when I first heard 4.0 was coming out. And their out of patience unfortunately. I'll gladly implement the solution you provided, only it's a bit above my head. Is this a user control that I have to create and add 2 repeaters or is it something else entirely? Do the repeaters require specific IDs? When I try creating a User control the following errors are detected:

     ASPNET: Make sure that the class defined in this code file matches the 'inherits' attribute, and that it extends the correct base class (e.g. Page or UserControl).

    Any help would be greatly appreciated.

     

  6. Ivan Dimitrov
    Ivan Dimitrov avatar
    16072 posts
    Registered:
    12 Sep 2017
    19 Nov 2009
    Link to this post
    Hi KMac,

    You need to create a custom control located under App_Code folder( or class library if you intend to have compiled code). Then you need to use this custom control instead of the default SearchResults control. Generally you do not need to repeaters, because the code I sent you is merging everything in one list that is passed to the default repeater in SearchResults.ascx template located under Sitefinity/ControlTemplates/Search folder.

    Kind regards,
    Ivan Dimitrov
    the Telerik team

    Instantly find answers to your questions on the new Telerik Support Portal.
    Watch a video on how to optimize your support resource searches and check out more tips on the blogs.
  7. KMac
    KMac avatar
    133 posts
    Registered:
    15 Dec 2008
    23 Nov 2009
    Link to this post
    Hi Ivan,

    I just wanted to say what is reiterated on these forums time and time again. The support that the Telerik team provides for their products is second to none and I think I speak for most here when I say I really appreciate both the quality of products you produce and the unparalleled help and guidance you provide. Your assistance with this particular issue was much appreciated.
Register for webinar
7 posts, 0 answered