Medusa search engine improvements

publicado 17/02/2003, Última modificación 26/06/2013

It’s been a long time since I wrote anything in this section. Today is the day I blab about what I thought last night that kept me up all day. This is my proposal to improve Medusa dramatically. Ask what Medusa is? Medusa is a file indexing and search service created alongside Nautilus by Eazel, to be used in the GNOME environment. Too much already? Then don’t read the rest: a full page of technicalese awaits the ignoramuses.

Medusa search/index improvement proposal

Hey there:

I had a couple ideas that kept me restless and so I’m writing this very late in the evening (or very early in the morning, choose the one you prefer - bett’s number 2).

Goals

  • to leverage existing code for local area networks
  • to index removable devices and all kinds of volumes for creation of an offline searchable catalog
  • to improve overall user experience

Medusa is both a file indexing daemon and a file search service. So far so good. But Medusa can be extremely useful (a slocate in steroids) for many other things beyond the "personal computer" use case scenario.

Medusa can be leveraged to support two things I’d like to see:

  • Removable media
  • Storage/Local area networks

Right now, Medusa has no (AFAIK) support for indexing removable volumes. It can index local volumes and removable volumes alright, but can’t do so as I envisioned it could. Coupled with Nautilus, this could mean that searching for a file containing "user friendly", gives me results in a (e.g. multicolumn view) from where double clicking the file prompts me to insert the “Backups #2″ volume on my CD-ROM drive Hitachi 48X plus. After which Nautilus could easily resume regular operations and show me the file’s contents.

To add removable device support, three infrastructural areas need to be changed:

  • database changes
  • system interaction changes
  • user interface changes

database

  • The full path name to a file should no longer be stored. Now Medusa should store the volume’s unique ID (GUID, whatever), and the volume name, along with the base path (say a file is /usr/lib/gkrellm/plugins/ and /usr is a filesystem, the base path here would be /lib/gkrellm/plugins).
    I don’t know if this is the way Medusa handles it right now, but, well, it seems good. This gives us room to include CD-ROMs and floppies in the mix. If you look carefully enough, perhaps storing the mount points or device files isn’t even needed, because you store the type of media (hard disk, CD-ROM, etcetera, information readily available via standard system interfaces) along with the volume information, so when requiring a file, you can reconstruct the path independently from wherever it got indexed first (so you can pop your CD into the Hitachi CD-ROM whenever you are listening to music in your Plextor where you usually index your volumes).

System interaction

  • Medusa should be signaled by autorun whenever a drive is inserted, or by ‘dynamic’ where a volume is hotplugged - the whole idea being that Medusa should detect mounts and act accordingly
  • Medusa should start indexing files if it’s a new volume or begin monitoring files there for changes.
  • Medusa should perhaps detect unmounts and immediately close all files being indexed, to avoid the “busy: cannot umount�? problem when umounting drives. And it should be compatible with supermount.
  • It should also monitor for changes in files so as to avoid rescanning the entire hard disk the braindead way slocate does nowadays (I think Medusa already acts smart in this particular issue). And it should be ‘nice’ to system resources (not be a hog when indexing).

User interface integration

  • Nautilus icons for drives should have a right click menu item that says “Index this volume now�? to signal Medusa that it should begin indexing the removable volume.
  • Search results would include files in removable volumes (or at least an option to include them!) and would index words in all files which have text. Activating an entry in the search dialog should prompt me for the volume if it’s missing, and mount it, evidently only if this is possible at all.

As you can see, with the proper elbow grease at the nautilus level, and the proper steel framing at the system level, we now have a very powerful cataloguing system. As file systems evolve and delve into the metadata thing, the cataloguing system will get richer all by itself, with little future work. And I could look for all music files sung by ATB in all my CDs. Which would by far surpass anything the Microsofties offer nowadays. This all only using distro-provided Linux software. And at no user effort (except for, perhaps, floppies, since all other types of removable media are either handled by autorun or dynamic).

Network scenarios

Great huh? Now imagine this gets extended to support NFS networks. Medusa could be accessible via the network (a medusa search service in my local machine could, instead of indexing network mounts, delegate the search to the medusa search service at the machine where the network mount is exported), much the way SGI FAM does.

Medusa should also respect the /etc/exports conventions.

This way, we leverage the NFS networks’ facilities, with zero extra configuration, while still providing an extremely low-resource network search facility. This could mean that a newbie corporate hire could look for every document with the word “policy�? on it, with nearly zero network overhead, on every corporate file server, and have the network show him ONLY the files he can see (by traditional UNIX security semantics, which both Medusa and NFS respect - and slocate as well). And so our new hire can get to work quickly and know all company policies instead of getting a two-hundred page book or complicated instructions on how to “Connect to a network drive and access folder XYZ�?.

(*) FAM delegates file monitoring to remote FAMs. When FAM cannot connect to a remote NFS FAM server, it falls back to standard dnotify. This behavior can be mimicked in Medusa-searchd. To prevent failed file accesses, removable media search results wouldn’t be returned to the client search service.

To work properly, this would evidently need autoconfiguration. This won’t succeed if the Medusa daemon needs to be configured in client or server machines. Medusa has to work drop-in, out of the box, with current network configuration.

  • Medusa-searchd in the NFS server should accept remote connections if the NFS server is up
  • Medusa-searchd in the NFS server should respect the /etc/exports conventions and use the existing configuration files
  • Medusa-indexd in the client should never index NFS mounted volumes.
  • Medusa-searchd in the client should always attempt first to connect to the NFS server, and failing that, use standard search methods or not show any results from the server at all. Many NFS networks would come down to their knees at the sole idea of traffic from every client hammering the entire exported share to index it.
  • Medusa-indexd/-searchd code should be audited for possible vulnerabilities involving feeding purposefully corrupted files or search queries

To boot, this could even be reused in a web project as a reusable search service for intranets (to kill the need for htdig which doesn’t really go beyond HTML files).

I have the feeling that a couple of changes in Medusa would render its usability much greater than the current prospect. I bet if this gets worked upon, even the KDE people would get around to using it. Remember how ugly and slow the search box in KDE is. And another thing: it’s slower than Windows’ file search tool. KDE already takes advantage of SGI FAM. Medusa could be the search service everyone expected.

good luck!

Rudd-O

This will also be submitted to the Nautilus mailing list.

Etiquetas
publicaciones