Projects/Nepomuk/FileIndexing: Difference between revisions
No edit summary |
|||
Line 28: | Line 28: | ||
* BMP | * BMP | ||
* SVG - Strigi stores them as plain text | * SVG - Strigi stores them as plain text | ||
We just use exiv2 and cover almost everything. Plus the code would be super simple. | We just use exiv2 and cover almost everything. Plus the code would be super simple. | ||
Line 40: | Line 39: | ||
== Audio == | == Audio == | ||
* MP3 | * MP3 | ||
* FLAC | |||
* WAV | |||
Strigi rolls its own for id3 metadata. We should use taglib. | |||
== Documents == | == Documents == | ||
PDF - Strigi uses their own which is crap. We should use poppler. | |||
ODF - Strigi inbuilt. We should | |||
* epub | === Microsoft Formats === | ||
DOC - OLE 2 Compound Document and Office Open XML - Custom parser by Strigi. What can we use? <br\> | |||
XSL - http://qt-project.org/wiki/Handling_Microsoft_Excel_file_format <br\> | |||
spreadsheet formats <br\> | |||
Maybe we can use some libreoffice or calligra libraries? | |||
=== Open document formats === | |||
ODF? Custom analyzer by Strigi. | |||
=== Ebook formats === | |||
* epub - Strigi reuses their ODF parser for epub | |||
* mobi | * mobi | ||
* | * rtf | ||
* | * lrf | ||
Checkout what Okular uses. Try using that. | |||
=== Other === | |||
* lyx | * lyx | ||
* tex | * tex | ||
Line 55: | Line 74: | ||
== Archives == | == Archives == | ||
* tar | * tar | ||
* gzip | * gzip | ||
* whatever .. | * whatever .. | ||
Strigi has its own analyzers for each archive which doesn't really add any metadata. It just adds the type <tt>nfo:Archive</tt>. We can do the same based on the mimetype. | |||
== Emails == | == Emails == | ||
* There was a bug report | * mbox format - There was a bug report | ||
== Text Files == | == Text Files == | ||
Line 67: | Line 89: | ||
== ISO images == | == ISO images == | ||
Add the type based on the mimetype | |||
== Executable files == | == Executable files == | ||
Use Mimetype |
Revision as of 18:26, 10 September 2012
Nepomuk currently acts as the file indexer for the KDE platform, applications and workspaces. Even though we frequently tout that we are not just a file indexer, we need to index the files properly.
File indexing solutions
Strigi
The KDE software releases in version 4.9, currently use libstreamanalyzer to index the files. Current problems with strigi -
- Difficult to contribute to
- No documentation
- Un-maintained
- Does not reuse libraries
Lists the current status of indexing different files.
Roll our own?
File Formats
We list down all the different file formats, and which all are supported by the different file indexing solutions.
Images
- JPEG - Use exiv - strigi also uses exiv - currently broken
- PNG - Strigi rolls its own - detects the application name, color depth and interlace mode as well
- GIF - there isn't much metadata
- EXIF
- TIFF
- BMP
- SVG - Strigi stores them as plain text
We just use exiv2 and cover almost everything. Plus the code would be super simple.
Videos
Strigi uses ffmpeg except for ID3, vorbis and OggS. It also has to seek through the file. Not sure what that is for.
Overall, we could just use ffmpeg for everything. It's very fast and pretty much supports all the formats.
Audio
- MP3
- FLAC
- WAV
Strigi rolls its own for id3 metadata. We should use taglib.
Documents
PDF - Strigi uses their own which is crap. We should use poppler. ODF - Strigi inbuilt. We should
Microsoft Formats
DOC - OLE 2 Compound Document and Office Open XML - Custom parser by Strigi. What can we use? <br\> XSL - http://qt-project.org/wiki/Handling_Microsoft_Excel_file_format <br\> spreadsheet formats <br\>
Maybe we can use some libreoffice or calligra libraries?
Open document formats
ODF? Custom analyzer by Strigi.
Ebook formats
- epub - Strigi reuses their ODF parser for epub
- mobi
- rtf
- lrf
Checkout what Okular uses. Try using that.
Other
- lyx
- tex
- cbz - Comic books
Archives
- tar
- gzip
- whatever ..
Strigi has its own analyzers for each archive which doesn't really add any metadata. It just adds the type nfo:Archive. We can do the same based on the mimetype.
Emails
- mbox format - There was a bug report
Text Files
- Text files
- Source Code
ISO images
Add the type based on the mimetype
Executable files
Use Mimetype