PIM/Akonadi/SearchPolishing: Difference between revisions

From KDE Community Wiki
< PIM‎ | Akonadi
(include feedback)
 
(fix linebreaks)
Line 5: Line 5:
===Performance faults while indexing===
===Performance faults while indexing===
====Excessive work per item====
====Excessive work per item====
* FIXED Excessive queries per item kde#289932#c58 [1],
* FIXED Excessive queries per item kde#289932#c58 [1], kde#289932#c87 (754275eda610dce1160286a76339353097d8764c in kde-runtime/4.8)
kde#289932#c87 (754275eda610dce1160286a76339353097d8764c in kde-runtime/4.8)
* Attachments fetched but not effectively indexed. The problem with attachments is that they are indexed by a helper process (nepomukindexer), which needs to final URI of the attachment object. However, what we pass in is the temporary _:xxxx URIs that still need to be resolved by DMS. StoreResourceJob contains the mapping AFAICT, so it's probably just a matter of deferring the indexData() calls until we have the result of that job.
* Attachments fetched but not effectively indexed. The problem with attachments is that they are indexed by a helper process (nepomukindexer), which needs to final URI of the attachment object. However, what we pass in is the temporary _:xxxx URIs that still need to be resolved by DMS. StoreResourceJob contains the mapping AFAICT, so it's probably just a matter of deferring the indexData() calls until we have the result of that job.
* "setting the same icons on mails, their attachments and their tags while indexing; is this necessary? This is commented in non-mail feeder plugins.  The icons are added to have pretty search results.  The (expensive) resource identification only happens when creating new SimpleResource objects, not when setting existing URIs as properties. So, simply caching the icons should fix this.
* "setting the same icons on mails, their attachments and their tags while indexing; is this necessary? This is commented in non-mail feeder plugins.  The icons are added to have pretty search results.  The (expensive) resource identification only happens when creating new SimpleResource objects, not when setting existing URIs as properties. So, simply caching the icons should fix this.
Line 12: Line 11:
=====Failures to index items=====
=====Failures to index items=====
* FIXED Cardinality fault on messageHeader
* FIXED Cardinality fault on messageHeader
http://oscaf.git.sourceforge.net/git/gitweb.cgi?p=oscaf/shared-desktop-
http://oscaf.git.sourceforge.net/git/gitweb.cgi?p=oscaf/shared-desktop-ontologies;a=commitdiff;h=4697389c39b7112aaf0f6ac1a36b216e78ab5e14
ontologies;a=commitdiff;h=4697389c39b7112aaf0f6ac1a36b216e78ab5e14
* FIXED Cardinality fault on PIMO:Persons' propertiesd732592b in kde-runtime/master
* FIXED Cardinality fault on PIMO:Persons' properties
d732592b in kde-runtime/master
=====Redundant reindexing=====
=====Redundant reindexing=====
* kde#289932#c58?
* kde#289932#c58?
====Repeated indexing per collection====
====Repeated indexing per collection====
* FIXED Attempted indexing of collections we cannot index
* FIXED Attempted indexing of collections we cannot index ec4f19eb781514ce0dfc09fe4e9ea4591ecc31e9 in kdepim-runtime/4.8
ec4f19eb781514ce0dfc09fe4e9ea4591ecc31e9 in kdepim-runtime/4.8
* FIXED Mark each collection on completion with indexing level 2729771b765d0bd6e0e03d0a5b055e36bc48944c in kdepim-runtime/master (does this prevent discovery of items changed while feeder was not running?)
* FIXED Mark each collection on completion with indexing level
2729771b765d0bd6e0e03d0a5b055e36bc48944c in kdepim-runtime/master
(does this prevent discovery of items changed while feeder was not running?)
====Indexing interferes with other work====
====Indexing interferes with other work====
* FIXED Hide indexing until user is idle kde#289932#c58
* FIXED Hide indexing until user is idle kde#289932#c58
====Low nominal performance====
====Low nominal performance====
* Eg. 5700 (42MB mbox) kde-core-devel mails in 20 minutes (4.8 items/sec) on
* Eg. 5700 (42MB mbox) kde-core-devel mails in 20 minutes (4.8 items/sec) on Core i7-2620M (4x2.7GHz, HT), idle detection disabled. Not clear what is the bottleneck.  Virtuoso using 80-90% of one core during this.
Core i7-2620M (4x2.7GHz, HT), idle detection disabled. Not clear what is the
* Akonadi->feeder->dbus->nepomukstorage->virtuoso of all mail negates performance advantage of fast Akonadi protocol. Seeing the huge improvement after Sebastian's changes on the resource identification in DMS, I'd guess that this is where most of the time is spent. But that's just gut feeling. If that turns out to be true though, we can probably apply some more clever caching for e.g. email addresses (in a typically folder I'd assume some of them repeat quite often) to avoid running identification on them over and over again. List-Id is another good candidate for that.
bottleneck.  Virtuoso using 80-90% of one core during this.
* Akonadi->feeder->dbus->nepomukstorage->virtuoso of all mail negates
performance advantage of fast Akonadi protocol. Seeing the huge improvement after Sebastian's changes on the resource identification in DMS, I'd guess that this is where most of the time is spend.
But that's just gut feeling. If that turns out to be true though, we can probably apply some more clever
caching for e.g. email addresses (in a typically folder I'd assume some of them repeat quite often) to avoid running identification on them over and over again. List-Id is another good candidate for that.
== Ability to utilise indexing work (working search)==
== Ability to utilise indexing work (working search)==
===Search features that fully use indexed data===
===Search features that fully use indexed data===
* Quicksearch now does fulltext search in 4.9
* Quicksearch now does fulltext search in 4.9
* Indexed: Date, Subject, From, Sender, To, Cc, Bcc, List-Id, Organization,
* Indexed: Date, Subject, From, Sender, To, Cc, Bcc, List-Id, Organization, some X-headers, Status flags, Tags, Important, Todo, Watched, Plain text body. Searchable: Age(days), Subject, From, To, Cc, Reply-To, List-Id, Organization, some X-headers, Status flags, Tags, all headers (probably not useful), message body.  Would be nice to capture List-Id: as mailing list resources in the NMO ontology so we can search explicitly for mails to lists.
some X-headers, Status flags, Tags, Important, Todo, Watched, Plain text body
* No way to search by the actual PIMO Persons/Contacts created by indexing, user must input part of name.
Searchable: Age(days), Subject, From, To, Cc, Reply-To, List-Id, Organization,
some X-headers, Status flags, Tags, all headers (probably not useful), message body.  Would be nice to capture List-Id: as mailing list resources in the NMO ontology so we can search explicitly for mails to lists.
* No way to search by the actual PIMO Persons/Contacts created by indexing,
user must input part of name.
* No way to search attachments or whether something has an attachment
* No way to search attachments or whether something has an attachment
* WIP Till: Composer address auto-completion based on all available Nepomuk data.
* WIP Till: Composer address auto-completion based on all available Nepomuk data.
Line 48: Line 33:
====Server side====
====Server side====
* RE-BROKEN Truncated query strings cause broken search folders (Limit needs to be more than 1024 chars)
* RE-BROKEN Truncated query strings cause broken search folders (Limit needs to be more than 1024 chars)
2.2.2 Client side
====Client side====
* Dialog allows modifying existing search folder by name but fails (modifies
* Dialog allows modifying existing search folder by name but fails (modifies remote id)
remote id)
* Possible to create search in search folders; doesn't work
* Possible to create search in search folders; doesn't work
====Viewing search results changes search results====
====Viewing search results changes search results====
* search on unread message status, messages disappear from search as message
* search on unread message status, messages disappear from search as message preview makes them read
preview makes them read
* Just viewing search results causes some messages to disappear from search collection.  itemChanged currently is handled in the feeder as add/remove. For emails this case can be optimized for the common case of flag/tag changes, as they rarely change content.
* Just viewing search results causes some messages to disappear from search
collection.  itemChanged currently is handled in the feeder as add/remove. For emails
this case can be optimized for the common case of flag/tag changes, as they rarely change content.


===Minimising indexing work===
===Minimising indexing work===
Line 63: Line 44:
* Change default set of indexed folders
* Change default set of indexed folders
* Make it easy to change per folder indexing attribute
* Make it easy to change per folder indexing attribute
* Show indexing status, allow attr change directly in folder selector in
* Show indexing status, allow attr change directly in folder selector in search dialog.
search dialog.
* Indexing all except full text a useful compromise?
* Indexing all except full text a useful compromise?


[1] https://bugs.kde.org/show_bug.cgi?id=289932
[1] https://bugs.kde.org/show_bug.cgi?id=289932

Revision as of 12:12, 19 March 2012

Fixing and Polishing Search

At the Osnabrück PIM Meeting 2012 we started an effort to make search and indexing PIM data useful. The following tree classifies the work that has been done

Faults in indexing

Performance faults while indexing

Excessive work per item

  • FIXED Excessive queries per item kde#289932#c58 [1], kde#289932#c87 (754275eda610dce1160286a76339353097d8764c in kde-runtime/4.8)
  • Attachments fetched but not effectively indexed. The problem with attachments is that they are indexed by a helper process (nepomukindexer), which needs to final URI of the attachment object. However, what we pass in is the temporary _:xxxx URIs that still need to be resolved by DMS. StoreResourceJob contains the mapping AFAICT, so it's probably just a matter of deferring the indexData() calls until we have the result of that job.
  • "setting the same icons on mails, their attachments and their tags while indexing; is this necessary? This is commented in non-mail feeder plugins. The icons are added to have pretty search results. The (expensive) resource identification only happens when creating new SimpleResource objects, not when setting existing URIs as properties. So, simply caching the icons should fix this.

Repeated indexing per item

Failures to index items
  • FIXED Cardinality fault on messageHeader

http://oscaf.git.sourceforge.net/git/gitweb.cgi?p=oscaf/shared-desktop-ontologies;a=commitdiff;h=4697389c39b7112aaf0f6ac1a36b216e78ab5e14

  • FIXED Cardinality fault on PIMO:Persons' propertiesd732592b in kde-runtime/master
Redundant reindexing
  • kde#289932#c58?

Repeated indexing per collection

  • FIXED Attempted indexing of collections we cannot index ec4f19eb781514ce0dfc09fe4e9ea4591ecc31e9 in kdepim-runtime/4.8
  • FIXED Mark each collection on completion with indexing level 2729771b765d0bd6e0e03d0a5b055e36bc48944c in kdepim-runtime/master (does this prevent discovery of items changed while feeder was not running?)

Indexing interferes with other work

  • FIXED Hide indexing until user is idle kde#289932#c58

Low nominal performance

  • Eg. 5700 (42MB mbox) kde-core-devel mails in 20 minutes (4.8 items/sec) on Core i7-2620M (4x2.7GHz, HT), idle detection disabled. Not clear what is the bottleneck. Virtuoso using 80-90% of one core during this.
  • Akonadi->feeder->dbus->nepomukstorage->virtuoso of all mail negates performance advantage of fast Akonadi protocol. Seeing the huge improvement after Sebastian's changes on the resource identification in DMS, I'd guess that this is where most of the time is spent. But that's just gut feeling. If that turns out to be true though, we can probably apply some more clever caching for e.g. email addresses (in a typically folder I'd assume some of them repeat quite often) to avoid running identification on them over and over again. List-Id is another good candidate for that.

Ability to utilise indexing work (working search)

Search features that fully use indexed data

  • Quicksearch now does fulltext search in 4.9
  • Indexed: Date, Subject, From, Sender, To, Cc, Bcc, List-Id, Organization, some X-headers, Status flags, Tags, Important, Todo, Watched, Plain text body. Searchable: Age(days), Subject, From, To, Cc, Reply-To, List-Id, Organization, some X-headers, Status flags, Tags, all headers (probably not useful), message body. Would be nice to capture List-Id: as mailing list resources in the NMO ontology so we can search explicitly for mails to lists.
  • No way to search by the actual PIMO Persons/Contacts created by indexing, user must input part of name.
  • No way to search attachments or whether something has an attachment
  • WIP Till: Composer address auto-completion based on all available Nepomuk data.

Faults in search

Server side

  • RE-BROKEN Truncated query strings cause broken search folders (Limit needs to be more than 1024 chars)

Client side

  • Dialog allows modifying existing search folder by name but fails (modifies remote id)
  • Possible to create search in search folders; doesn't work

Viewing search results changes search results

  • search on unread message status, messages disappear from search as message preview makes them read
  • Just viewing search results causes some messages to disappear from search collection. itemChanged currently is handled in the feeder as add/remove. For emails this case can be optimized for the common case of flag/tag changes, as they rarely change content.

Minimising indexing work

Assuming there is no/low demand for search, do less of the expensive indexing.

  • Change default set of indexed folders
  • Make it easy to change per folder indexing attribute
  • Show indexing status, allow attr change directly in folder selector in search dialog.
  • Indexing all except full text a useful compromise?

[1] https://bugs.kde.org/show_bug.cgi?id=289932