The database

The metadata service uses a persistent store (called "database" in this documentation) to keep:
- representations of indexed files (as serialized File instances)
- a mapping from filename to file representation (as a FileList instance)
- indexes for each of the indexed files' attributes (named "<attributename>_<indextype>"

Database structure

The structure starts at the database root, which is a dictionary object.  The root contains a file list named, aptly, "file_list", which is a FileList instance.  The "file_list" object is a mapping which maps filenames to representations of files, which are instances of File or derived classes.

Technology

The database uses ZODB (Zope Object Database) as its backend.  ZODB is an advanced object database (more accurately, an object persistence framework) which boasts, among others, the following features:
- transactions
- indexes
- extremely fast searches
all of which are used by Advanced metadata services.  ZODB was chosen because it allows objects to be persisted and modified transparently, with no impedance mismatch.  But ZODB is not without weaknesses.  Using ZODB means that:
- (re)creation of an index requires each file representation to be loaded, which is O(n)
- there is no schema migration plan for objects, which means every File-derived object needs to maintain its public API forever
- currently, there is no known way to maintain multiple index types for the same File attribute





Advanced metadata services is composed of several applications:

- The metadata service 'pymetadata', in charge of indexing all files and responding to metadata queries
- The command-line 'search' application, a simple application to search the metadata service
- The command-line 'get-metadata' application, a simple application to query the metadata service for a file's metadata



File types:

Currently, Advanced metadata services indexes basic attributes of all files, and indexes metadata from the following file types:
- MP3
- Ogg Vorbis
- plain text
- HTML

Slogans:
"Finding the proverbial needle on a haystack of a million [strands|files] in less than one second"


Goals:

These are the main goals of Advanced metadata services:
- enable end users to conduct rich and fast searches of their entire information space
- provide application and desktop environment developers with a simple, consistent and standards-based interface to enable metadata and search integration in their software
- provide developers with a simple, standards-based plugin interface to integrate their file formats into Advanced metadata services


Available challenges:
- reduce on-disk database size
- make the database use a single file which reuses deleted objects' space
- reduce time needed to create a new index
- make new file format plugins (yay!)


How to develop a file format plugin

Developing a file format plugin for Advanced metadata services is extremely simple.  Advanced metadata services provides you with an extremely simple-to-use interface to KDE's powerful KFileMetaInfo system.  This means that, in most cases, you only need to code a few assignment operations.

File format plugins are Python modules.  The following (real and commented) example of a file format plugin illustrates this:

--------------------------------------------
from MetadataPlugins import get_kde_metadata_expert, File, DataType
from DateTime.DateTime import DateTime

# you define a class, always derived from File
class PlainTextFile(File):

# you define the public attributes, corresponding to each metadata item's name
	format = None
	lines = None
	characters = None

# in the initialization phase, you get a file name and a MIME type, ask the expert to deliver the metadata for the filename, and use the returned dictionary to fill the public attributes.  Make sure you cast (using whatever function) each attribute to its correct data type
	def __init__(self,filename,mimetype):
# you need to call the parent's initialization routine, always
		File.__init__(self,filename,mimetype)
# retrieve the metadata attributes using the metadata expert
		metadata = get_metadata_expert().get_metadata(filename)
# assign the attributes using whatever data you have available
# FIXME refer to KDE's KFile documentation
		if metadata.has_key("general.format"):
			self.format = metadata["general.format"]
		if metadata.has_key("general.lines"):
			self.lines = int(metadata["general.lines"])
		if metadata.has_key("general.characters"):
			self.characters = int(metadata["general.characters"])
# write a short function which describes the file in a human-readable way
	def __repr__(self):
		return "plain text file %s (%d characters, %s format)"%(self.filename,self.characters,self.format)

# declare the class name
def get_factory():
	return PlainTextFile

# declare the list of supported MIME types by this plugin
def get_mimetypes():
	return ["text/plain"]

# declare the public attributes you want to have indexes on, and their data types
def get_data_types():
	return {"format":DataType.TYPE_STRING,"lines":DataType.TYPE_INT,"characters":DataType.TYPE_INT}
# DAMMINT FIXME this seems to be no longer needed now
------------------------------------------------------

Yes, that's all you need to do.  Should you need an attribute not returned by the metadata expert, feel free to import any of Python's available libraries and do the work =).

Things to watch:
- the following names are reserved: filename, size, container, modification_date, mimetype
- all string and text attributes must be platform encoded strings. To convert strings encoded in another locale's encoding, use the function transcode_to_platform_encoding(yourstring,source_encoding), in the common module.  There's also a convenience function called platform_to_utf8, which converts an utf-8 encoded string to the platform's encoding.  Fortunately for you, all strings that the get_metadata() call returns are guaranteed to be platform-encoded.
- the customary attribute for the file's contents is named 'contents'.  This attribute, if present, must always be of type TYPE_TEXT and contain whatever text you extract.  String encoding rules also apply, so you should also use the functions mentioned in the previous paragraph.  If the transcode_to_platform_encoding function cannot represent the passed string in your platform's encoding, do not worry, they will raise UnicodeDecodeError or UnicodeEncodeError: catch them and provide None as a value for contents.  This also applies to all other text attributes.
- short text attributes should be declared as TYPE_STRING.  This allows the index to perform index optimizations and search types not available with TYPE_TEXT
- the data type namespace is shared among all plugins.  If another plugin has an attribute defined with the same type, do not redefine the attribute's data type yourself, or the metadata service will create an additional index, possibly leading to conflict.  Use a different name for your attribute.
- store dates and times by using the DateTime (from common import DateTime) included class.  To create a DateTime, you can call DateTime.gmtime(time_in_utc_epoch) where time_in_utc_epoch must be the offset in seconds since the epoch, in UTC, such as returned by os.stat() or time.time().  You can also use DateTime to create a specific date, in UTC as well: DateTime(year,month,day,hour,minute,second) will suffice.  See the audio-vorbis plugin (how to set the track date) for more information, and consult the Python documentation on the module 'time' to find out how to get and generate UTC timestamps.  Remember that time.time() and os.stat return UTC timestamps.  All dates in your class attributes must be of DateTime type.
- To log a message, use the logging module, with the "plugins.pluginfilename" service name.  You can:
	import logging
	logging.getLogger("plugins.text-plain").info("Could not determine file contents encoding")
This is useful when you need to inform the system administrator that some aspect of your plugin failed but you want to keep going on.  You just trap the exception and in the except clause, log the problem.  If you need a traceback of what happened, you can 
	logging.getLogger("plugins.text-plain").info("Could not determine file contents encoding",exc_info=True)
- To see log messages, raise the metadata service log level. 


Testing the plugin

- use the provided application "test-file-plugin" to test your plugin.  It simulates the interface to the metadata service, at the command line.  Pass any filename as the first argument, and watch the printout or tracebacks.

Installing the plugin

Put your plugin file in the directory # FIXME where?


Querying the metadata service:

The metadata service provides two equal search interfaces:
- a TCP XML-RPC interface, located at port #FIXME
- a UNIX domain socket XML-RPC interface, located at #FIXME

#FIXME implement search engine with a web interface???

To query the metadata service, use the available methods on the search interfaces.  The search interfaces are self-documenting: visit the TCP interface with your Web browser to find out the documentation on each available method.  The TCP interface requires HTTP Basic authentication, while the UNIX domain socket interface will assume your current user's credentials.
#FIXME implement this!


Metadata service features:
- extremely fast searches: a simple search expression which returns 50 results from more than 200.000 files takes approx. 0.6 seconds.
- fast indexing: the average file takes less than 0.4 seconds to be indexed
- rich metadata: you can query metadata on any local file available to you
- arbitrarily complex searches: any search expression can be composed with other expressions in classical OR and AND fashion.  Search expressions can also be nested with no depth limit.
- wide range of search types: in addition to standard equality and less/more than query types, you can query text attributes using glob characters
#FIXME implement - incremental searches: 
- adaptive resource usage: you can configure the metadata service for a target system average load, and, if the system load goes above the limit, the indexing process will be throttled to preserve system responsiveness
#FIXME implement - que se indexa cuando se modifica un archivo (incremental indexing)

Requirements:

Advanced metadata services requires the following software to work:
- Python 2.3 or later
- Zope 2.7 or later
- KDE (kdebase and KFile plugins)
- the PyQt and PyKDE bindings

(with great things come great requirements, yay!)


Contacting

The project leader and author is Manuel Amador (Rudd-O) <rudd-o@amautacorp.com>.  Send him patches, new file format plugins, comments, contributions, gifts, anything you find appropriate.
