KDE PIM/Meetings/Osnabrueck 4/PIM Storage Service Requirements
Appearance
Result of our requirements brainstorming session on Friday
- Flat Access
- Categorization by Attributes
- Job Priorities
- Virtual Folders
- Filtering
- Access to parts of objects (mimetypes)
- Change Notification
- Shared Cache
- Asynchronous Access
- Out-of-process service plugins
- Online/Offline state management
- PIM Object Handle
- No hard locks
- Conflict Handling
- Referencing objects (local or on server)
- Capacities and Capabilities of Storage Backends
- URI scheme to identify resources
- Lazy Loading
- Copy-on-write implementation for PIM objects (using snapshots)
- Using Changesets
- Syncing with groupware servers
- Undo
- Resources (storage units)
- Non-global resource activation profiles
Till's mail about the requirements from the mail side
From: Till Adam <[email protected]> To: KDE PIM <[email protected]>, [email protected] Date: Thu, 6 Oct 2005 09:18:58 +0200 On Thursday 06 October 2005 08:15, Cornelius Schumacher wrote: > On Thursday 06 October 2005 05:06, Mark Bucciarelli wrote: > > With this approach, I imagine we would see gains of two orders of > > magnitude in memory usage for large (year-long) files. If korg only > > loads event headers for the current month, then startup would be a > > constant speed no matter how large the data set. > > That's what I called "proxy objects" in my reply to the PIM daemon > proposal. The drawback of this would be that you have a delay when loading > the missing data. When you for example navigate through several months in > KOrganizer then you would see an empty month at first and the events would > pop up later. Not a very user-friendly solution. It would also mean that if > you open an editor there would be a delay until all the data is loaded, so > that you would start with an empty, disabled editor and the content of the > fields would be filled in later until you are finally able to use the > editor. Not pretty. I've been thinking about the mail side of things a bit and come to the conclusion that something like a proxy of facade object is definitely needed for mail. We currently have three sorts of pointers to messages, and then two flags per message that signal the state of their "completeness". This is not sufficient and the fact that pointers to messages go away and are replaced by something else, is a major problem and our number one source of crashes. Yet the reason for this design was the need to have something extremely light weight to represent a message until more information is needed, because otherwise a folder with 10000 or 100000 mails would become completely unusable. Having the to-be-lazy-loaded information readily enough available that the user perceives no or only little delay when requesting it, is of course a challenge, and in the presence of across the network retrieval also has physical limitations, but online IMAP in KMail, which already works like that, to an extent, proves that it can be done. Caching could be a lot better, but more on that later. Before I go into details a couple of general comments: I think a design meeting is a great idea, I would welcome it. I agree that this design is crucial, should be very well thought out, and not rushed in any way. We need to get this right. I agree that we should look at EDS and also other solutions, they must have solved many of the same problems. If compatability with EDS seems achievable, I would consider that a worthy goal, but not if it hurts our power or flexibility. Braindump of my musings on mail storage thus far, in no particular order: - mails are identified by a globally unique serial number (whether to expose that outside of KMail or use an URI scheme for that - possibly including the sernum - is a separate discussion) - there is a one to one mapping between the serial number and a ref-counted pointer to a Message object, which is initially an empty skeleton, containing no information beyond the serial number - internally, the mail store holds mappings of serialnumber, storage URL and cache URL - the Message API allows retrieval of those parts of the mail that are needed, such as Envelope (what is needed for display in the headers list), Headers, body parts, etc. If they are in the cache, they come from there, otherwise from the storage location (server) - access to all of these parts is asynchronous, with possibly synchronous convenience wrappers where access needs to be immediate for preformance reasons and can reasonably be expected to be immediate, such as envelope reqeuests - caching policies, which can apply to accounts, folders, even messages, govern how much information of a mail is locally present, and how much of the lazy loaded information that isn't, initially, is kept around. This allows scenarios such as "in this folder, don't download anything from the imap server beyond the envelope, but if I look at the mail, keep the bodies around in the cache", or "sync everything for this account, but not attachments, and not mails over 5 MB on mailcheck or mails in my SPAM folder" - messages (sernums) can have an arbitrary set of category flags associated with them, a la GMail labels, references to other PIM data, via URIs maybe - storage folder location can be used as one (but not the only) grouping criterion, possibly modelled as a category flag, internally - local mail (cache) storage is in maildir format, a local maildir account is simply one with cache URL == storage URL (implementation detail) - the current folderstorage subclasses become machines for mapping storage URL to cache URL and shifting data from one to the other on request - the internal mapping of sernum, storage URL, cache URL, category flags and performance critical envelope data (what used to be the index) is stored in a relational database, such as SQLite, which provides central, transactional, integrity guaranteed access to that information through the API (implementation detail) - I imagine access to all of this via a libemailstorage (or even libpimdata, or something) which dishes out handles to read-only (vast majority, for mail) and read-write instances of mails, handles locking, copy-on-write, etc. Whether that is implemented via a server process, which the lib talks to, or by concurrent access to the above mentioned database is a yet to be resolved implementation detail, and mostly orthogonal to the storage layer API, I believe Open questions: - how do accounts fit in? Should an account be a set of credentials for access to a set of storage URLs plus a set of attributes, such as cache policies, and managed by a pim-wide entity? How about connection tracking, is that orthogonal? - are all of the special features of certain server types (IMAP, Groupwise, HTTPMail, etc) integrateable into such a scheme? Things like quota, ACLs, etc - what should the query language look like? A special API, aware of mail semantics? should URI schemes (mail:/#12345/headers/from, mail:/#12345/body/attachment) be used, SQL, IMAP? - how to integrate this with Interview? Should folders be filtered (proxy) models on a global mailstore model? Sorting and threading as sorted (proxy) models on top of that? How much of that should be in the library, and how much in KMail? Does it make sense to be able to display a folder in any QAbstractItemView? - probably many more ...