An implementation of a Copernic Desktop Search Custom Extractor in C#

As I mentioned in a previous post, Copernic Desktop Search 1.5 beta is currently available to the public. One of the most important new features was for me the introduction of an extensibility API.

When I wrote that first announcement, I hadn’t had a close look at the API. By now I’ve found out that it’s about extracting data from new file types, nothing more or less than that. I’d wish, and maybe I should add that to my list of things I’d like to see in CDS, they would extend that extensibility support to other areas, like creating plugins for file type preview, or even introducing completely new ranges of searchable objects. Well, maybe that’s in the future.

For now, I’ve taken the plunge and tried to implement a custom extractor for CDS. And while I was at it, I wanted to do it in in C#. I succeeded and these are the results, maybe somebody will find this useful to implement a custom extractor that really does something worthwhile 🙂

The interface

The first task was to create a C# interface from the IDL description that’s given in the API documentation. My experience with COM and ActiveX stems from work I’ve been doing in C++ and Delphi some years back, I’ve been doing close to nothing with .NET InterOp, so I’m no expert at this. Nevertheless, I managed to create the following definition:

[ComVisible(true), ComImport, Guid("7E337435-5E47-40A0-B8A9-315BDD1BAE0D"),
  InterfaceType(ComInterfaceType.InterfaceIsIUnknown)]
public interface ICopernicDesktopSearchFileExtractor {
  [DispId(1)] 
  void LoadURI([MarshalAs(UnmanagedType.BStr)] string uri);
  [DispId(5)] 
  void GetContentStream([MarshalAs(UnmanagedType.IUnknown)] out object contentStream);
  [DispId(9)] 
  void get_IsContentUnicode([MarshalAs(UnmanagedType.VariantBool)] out bool value);
}

The implementation

The next job was to create an implementation of the interface in a COM server. BTW, the assembly you use for this must have the Register for COM Interop flag set. This is my implementation:

[ClassInterface(ClassInterfaceType.None), Guid("1bb4b2a5-d516-4a00-868b-cfc49a84881a")]
public class CustomExtractor: ICopernicDesktopSearchFileExtractor {
  void ICopernicDesktopSearchFileExtractor.LoadURI(string uri) {
    currentURI = uri;
  }

  private string currentURI;

  void ICopernicDesktopSearchFileExtractor.GetContentStream(out object contentStream) {
    contentStream = null;
    if (currentURI != null && File.Exists(currentURI)) 
      contentStream = new IStreamWrapper(File.OpenRead(currentURI));
  }

  void ICopernicDesktopSearchFileExtractor.get_IsContentUnicode(out bool value) {
    value = false; // return true if the file contains Unicode content
  }

  const string regPath = @"SOFTWARECopernicDesktopSearchCustomExtractors";

  [ComRegisterFunction]
  public static void Register(Type t) { 
    RegistryKey rkey = Registry.LocalMachine.CreateSubKey(regPath);
    rkey.SetValue(".testfile", "{1bb4b2a5-d516-4a00-868b-cfc49a84881a}", 
      RegistryValueKind.String);
    rkey.Flush( );
  }

  [ComUnregisterFunction]
  public static void Unregister(Type t) {
    RegistryKey rkey = Registry.LocalMachine.CreateSubKey(regPath);
    rkey.DeleteValue(".testfile", false);
    rkey.Flush( );
  }
}

This implementation registers itself for a “.testfile” filetype. In reality, I simply created a few text files and renamed them to that extension.

The stream wrapper

As you can see, there’s a class in use called IStreamWrapper. This is another class I had to write because CDS wants to access the extracted data from the file using the COM interface IStream. The .NET System.IO.Stream doesn’t implement that interface, so I needed a wrapper class for a .NET stream that I could expose via COM. Luckily, I found that the implementation was rather straightforward and that CDS only calls two methods from the IStream interface during a normal run, so I didn’t need to implement many of the methods. Here’s what I came up with:

public class IStreamWrapper : IStream {
  public IStreamWrapper(Stream stream) {
    if (stream == null)
      throw new ArgumentNullException("stream", "Can't wrap null stream.");
    this.stream = stream;
  }

  Stream stream;

  public void Clone(out System.Runtime.InteropServices.ComTypes.IStream ppstm) { 
    ppstm = null;
  }

  public void Commit(int grfCommitFlags) { }

  public void CopyTo(System.Runtime.InteropServices.ComTypes.IStream pstm, 
    long cb, System.IntPtr pcbRead, System.IntPtr pcbWritten) { }

  public void LockRegion(long libOffset, long cb, int dwLockType) { }

  public void Read(byte[] pv, int cb, System.IntPtr pcbRead) {
    Marshal.WriteInt64(pcbRead, (Int64) stream.Read(pv, 0, cb));
  }

  public void Revert( ) { }

  public void Seek(long dlibMove, int dwOrigin, System.IntPtr plibNewPosition) {
    Marshal.WriteInt64(plibNewPosition, stream.Seek(dlibMove, (SeekOrigin) dwOrigin));
  }

  public void SetSize(long libNewSize) { }

  public void Stat(out System.Runtime.InteropServices.ComTypes.STATSTG pstatstg, int grfStatFlag) {
    pstatstg = new System.Runtime.InteropServices.ComTypes.STATSTG( );
  }

  public void UnlockRegion(long libOffset, long cb, int dwLockType) { }

  public void Write(byte[] pv, int cb, System.IntPtr pcbWritten) { }
}

The only methods that CDS really uses are Seek and Read.

The test setup

To try things out, I shut down CDS and moved my old index files away. Then I changed the settings for folders indexed to include only the one folder where I had created the test files with the .testfile extension. I compiled and registered the library with the classes above (for registration of a .NET assembly you use regasm, not regsvr32 as the documentation describes) and startup CDS again. I found the most reliable way to get CDS to reconstruct its index was to use the link on the Advanced page of the options dialog, that says “Clear index contents and reindex all files and folders”. The suggestion from the documentation didn’t work reliably for me, to use the Options/Update index/Files menu entry.

Once everything was in place, I found the whole mechanism working flawlessly. CDS would index my files and I could find them when entering search words that I knew were in the files. The only fly in the ointment, as I already hinted above, is that CDS doesn’t provide any kind of preview for custom file types, and there’s no (public?) way to extend it in that regard. Apart from that, thanks to Copernic for listening to those of your users who suggested an extensibility feature!

11 Comments on An implementation of a Copernic Desktop Search Custom Extractor in C#

  1. I’ve become interested in Copernic after a long absence (v1.2), and thinking that I should use MS Indexing Service (on 2000 / XP / 2003 systems and NTFS volumes only) instead.I have a couple of quick questions – 1. Have you tried your C# code with the actual 1.5 release?2. Have you got any ideas on how to implement indexing of the NTFS file alternate data streams, using Copernic?I’ll be in touch later – once I have done a few tests with Copernic DS v1.5RegardsIan Thomas

    Like

  2. The code works with the final 1.5 release just as well as it did with the beta. About indexing alternate data streams… you’d have to combine the content from the file with that from the alternate stream, I guess. It would be easy to write an extension to my IStreamWrapper that would simply “append” content from other sources to the data from the file itself. Copernic won’t be able to distinguish between data from the file and data from alternate streams, but as there’s no way to visualise your data in Copernic up to now, this shouldn’t be a great problem.

    Like

  3. ian thomas // May 5, 2005 at 8:51 am // Reply

    A couple of things, here. First, your remark “there’s no way to visualise your data in Copernic up to now”. There is a way – I have followed, in recipe fashion, the approach used by ‘Pythonner’ (http://pythonner.blogspot.com/2005/03/copernic-desktop-search-plug-in.html) where he presents a quick-and-dirty solution for indexing and previewing of .PS (postscript) files with CDS. This uses a C# wrapper. I have compiled, signed, installed this and it all works as expected on my Win2K system with Copernic Desktop Search installed.Because in this special case, GhostScript is used and it has a mechanism to processany .PS files using a conversion file (PS2ASCII.PS) and a range of command parameters,producing a plain text stream, search and preview are possible.(I used GhostScript 8.15). Using something like GhostScript is a very heavy burden (several Mb of DLL and files, file locations to be explicitly defined, etc) but I guess it’s no worse than using the Adobe iFilter for Acrobat 7, when using Microsoft’s OS Indexing Service: that’s 7Mb, as I recall. I might use the same approach, since it is quite feasible for an application I have in mind to extract quite particular text from a GIS document file, using an installedapplication (which is closer to 70Mb in size, as it happens). Secondly, I’m a real novice with COM – I can *just* follow what’s in your code. First, I need to implement what you’ve written in your 3March2005 blog article before scratching the head about how to ‘simply “append” content from other sources to the data from the file itself’, as you say. That part’s relatively easy, I guess.Thnaks for the helping hand.

    Like

  4. Actually that code won’t work with the newest release, due to new bug in build 646 of Copernic. And it turns out that if you had implemented IStream::Seek() (as I did), it also wouldn’t work due to a bug in their Preview (though indexing works), in the Build 644 they released as 1.5. And lastly it takes about 20 seconds (and 250 spurrious reads) for their preview to return.All-in-all, there are quite a few issues with Copernic’s current API. I hope they get it debugged soon, but they haven’t been very responsive yet.

    Like

  5. Guys, thanks for updated information on this. I haven’t had a look at the newer release versions of CDS and the corresponding APIs for quite a while, because the original purpose of the extension I was looking got pushed to a lower priority. I appreciate the current information, though!

    Like

  6. I cant get it running, always get missing GUID reference or so.

    Like

  7. Thanks for the IStream wrapper – it saved me an awful lot of work. I’ve adapted it to work on the .NET Compact Framework 2.0; if you’re interested, it’s posted here with full credit. Thanks again for publishing your findings!(BTW, your Captcha provider is horrible – the text is barely legible, and I think would still be easier on OCRs than most Captchas. Please consider replacing it?)

    Like

  8. Tomer, thanks for posting this, and also for your work. You’re the second guy to comment on the captcha, so I decided to fiddle with the settings somewhat to make it easier to read in the majority of cases. Before, it managed to kill at least 99.9% of the comment spam coming this way (and I know that this means thousands of spams, from the time I didn’t have a captcha) – I’ll see how well it does now. Thanks for the feedback in any case!

    Like

  9. In case someone still needs this, the URL for the .NET CF version of the IStream wrapper has moved to http://www.tomergabel.com/ManagedIStreamWrapperForNETCompactFramework20.aspx

    Like

  10. Matthew Howells // July 18, 2012 at 4:41 pm // Reply

    Your implementation of IStream.Read is wrong. it does not take account of the fact that reading fewer bytes than requested in an IStream means the end of the stream has been reached, but the same is not true of a Stream.

    Like

  11. Hey, does any of you know whether custom file indexers are possible in new versions of Copernic?

    Like

Leave a Comment

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s