Hi, guys:

I'm building a metadata engine for UNIX operating systems, and I'm using Zope (ZODB and ZCatalog) in the process.  Since there's a lot of data to store (millions of objects, which are representations for files in the file system), I've written my own DirectoryStorage which (simplifying stuff) just lumps every object into a directory.  It actually has full transactional semantics, so there should not be any problem using it in production settings, taking into account the invariant that directories with millions of files should not cause any major degradation in performance - O(log(n)) is the target - which seems to be the case with ReiserFS.  Evidently it does not support either Undo or Versions (it subclasses MappingStorage).  What's great about this approach is how little disk space it needs (noticing how FileStorage and BDBStorage grow exponentially with stored objects did encourage me to write this).

Now, my ZODB structure is as follows:
- a volume list which is a PersistentMapping, containing volumes, and several catalogs, one for each index type (FieldIndex, TextIndexNG2, PathIndex), since I must deal with the fact that a.attribute may be a different type from b.attribute, e.g. sometimes I need to search for "/bin" (FieldIndex) and sometimes for "/bin/*" (PathIndex) and if I had only one catalog with PathIndex and FieldIndex on the same attribute, when searching, both indexes would be consulted, resulting in an undesired intersection of two result sets for PathIndex and FieldIndex yielding only "/bin". (if you have a solution for this, please let me know).
- the volumes, which contain a file list (OOBTree) and a OIBTree for the file modification dates, so when sweeping the disk there's no need to wake every object up.

When time comes to store an object, the object is stored into the file list of the corresponding volume, and catalogued into all pertinent catalogs.  Now, sometimes, storing an object causes about 50 objects to be changed (I suppose because the trees for the file list and the catalogs are balancing themselves), but sometimes it causes 1000, or even 16000 changes, all of them need to be committed, and obviously performance drops from .10 per file to 33 seconds per file.

Why is that?  IF you are interested in taking a look at what's going on, the code is <here>, and it logs everything, fairly detailed, so you can understand what's going on.  Also, after hours of indexing, performance drops as well, even for small changesets of 50~60 objects.  I *can tell* it has nothing to do with the file system at all, because I'm timing disk accesses in isolation, and they do not go up.  The growth in time seems to be between the point of get_transaction().commit() and the point _finish gets called in the storage module.  Why?


Another issue:
let's say we have an OOBTree, with five entries, keys ["a","b","c","d","e"].  This code:
>>> for a in theoobtree.keys(): del theoobtree[a]
yields:
>>> print theoobtree.keys()
["a","c"]

So, I expected to be able to empty the oobtree that way, and it did not work.  I had to:
>>> for [ a for a in theoobtree.keys() ]
for it to work.  Why is that?  AFAIK theoobtree.keys() should not change during an iteration, since theoobtree.keys() yields a list of keys, not an iterator object, as in theoobtree.iterkeys(), right?


Another issue: found a bug in TextIndexNG2 which prevents using it with non ZClass (I think?) objects.  This patch fixes it:
------------------------------------------------
--- TextIndexNG.py.old	2005-03-03 03:37:02.000000000 -0500
+++ TextIndexNG.py	2005-03-04 16:59:30.200742800 -0500
@@ -258,7 +258,7 @@
             if result is None: return None
             source, mimetype, encoding = result
 
-        elif obj.meta_type in ('File', 'Portal File') and  \
+        elif hasattr(obj,"meta_type") and obj.meta_type in ('File', 'Portal File') and  \
            attr in ('PrincipiaSearchSource', 'SearchableText'):
 
             source= getattr(obj, attr, None)
@@ -268,12 +268,12 @@
                 source = str(obj)
             mimetype = obj.content_type
 
-        elif obj.meta_type == 'ExtFile' and \
+        elif hasattr(obj,"meta_type") and obj.meta_type == 'ExtFile' and \
            attr in ('PrincipiaSearchSource', 'SearchableText'):
             source = obj.index_html()
             mimetype = obj.getContentType()
 
-        elif obj.meta_type in ('ZMSFile',):
+        elif hasattr(obj,"meta_type") and obj.meta_type in ('ZMSFile',):
             lang = attr[attr.rfind('_')+1:]
             req = {'lang' : lang}
             file = obj.getObjProperty('file', req)
@@ -283,7 +283,7 @@
                 source = file.getData()
                 mimetype = file.getContentType()
    
-        elif obj.meta_type in ('TTWObject',) and attr not in ('SearchableText', ): 
+        elif hasattr(obj,"meta_type") and obj.meta_type in ('TTWObject',) and attr not in ('SearchableText', ): 
             field = obj.get(attr)
             source = str(field)
             if field.meta_type in ( 'ZMSFile', 'File' ):
----------------------------------------

TextIndexNG2 is also creating a parsetab.py file, which should be created as a temporary file instead.  Evidently this can be considered a security bug.

luck,

      Rudd-O
      
ps. this stuff needs to be more documented, I've learned all this 90% by looking at the code, which isn't right because it took me a helluva long time.