Jump to content

Projects/Nepomuk/FileIndexing: Difference between revisions

From KDE Community Wiki
Vhanda (talk | contribs)
Vhanda (talk | contribs)
No edit summary
Line 1: Line 1:
Nepomuk currently acts as the file indexer for the KDE platform, applications and workspaces. Even though we frequently tout that we are not just a file indexer, we need to index the files properly.
This page attempts to catalogue the list of files formats Nepomuk supports, and what formats are remaining.


{|
! MimeType !! Status !! Plugin !! Comments
|- image/jpeg
| Implemented - Requires Testing
| Exiv2Extractor
| No Comments
|-
| application/pdf
| Implemented - Requires Testing
| PopplerExtractor
| No Comments
|}
= File indexing solutions =
= File indexing solutions =



Revision as of 01:06, 6 November 2012

This page attempts to catalogue the list of files formats Nepomuk supports, and what formats are remaining.

MimeType Status Plugin Comments
Implemented - Requires Testing Exiv2Extractor No Comments
application/pdf Implemented - Requires Testing PopplerExtractor No Comments


File indexing solutions

Strigi

The KDE software releases in version 4.9, currently use libstreamanalyzer to index the files. Current problems with strigi -

  • Difficult to contribute to
  • No documentation
  • Un-maintained
  • Does not reuse libraries
  • Has its own huge parsers for archives, utf, etc.

Roll our own?

Maybe it would be better to roll our own file parsers which are just light wrappers over the existing libraries.

File Formats

We list down all the different file formats, and which all are supported by the different file indexing solutions.

Images

  • JPEG - Use exiv - strigi also uses exiv - currently broken
  • PNG - Strigi rolls its own - detects the application name, color depth and interlace mode as well
  • GIF - there isn't much metadata
  • EXIF
  • TIFF
  • BMP
  • SVG - Strigi stores them as plain text

We just use exiv2 and cover almost everything. Plus the code would be super simple.

Videos

Strigi uses ffmpeg except for ID3, vorbis and OggS. It also has to seek through the file. Not sure what that is for.

Overall, we could just use ffmpeg for everything. It's very fast and pretty much supports all the formats.

Audio

  • MP3
  • FLAC
  • WAV

Strigi rolls its own for id3 metadata. We should use taglib or ffmpeg. It seems to handle flac and wav files pretty well.

Documents

PDF - Strigi uses their own which is crap. We should use poppler. ODF - Strigi inbuilt. We should

Microsoft Formats

DOC - OLE 2 Compound Document and Office Open XML - Custom parser by Strigi. What can we use? <br\> XSL - http://qt-project.org/wiki/Handling_Microsoft_Excel_file_format <br\> spreadsheet formats <br\>

Maybe we can use some libreoffice or calligra libraries?

Open document formats

ODF? Custom analyzer by Strigi.

Ebook formats

  • epub - Strigi reuses their ODF parser for epub
  • mobi
  • rtf
  • lrf

We could use libepub. + Checkout what Okular uses. Try using that.

Other

  • lyx
  • tex
  • cbz - Comic books

Archives

  • tar
  • gzip
  • whatever ..

Strigi has its own analyzers for each archive which doesn't really add any metadata. It just adds the type nfo:Archive. We can do the same based on the mimetype.

Emails

  • mbox format - There was a bug report

Text Files

  • Text files
  • Source Code

ISO images

Add the type based on the mimetype

Executable files

Use Mimetype