Projects/Nepomuk/WebExtractor

What is Nepomuk Web Extractor

Nepomuk Web Extractor is a project ( services and libraries ) that is intended to automatically retrive, semiautomatically review and apply new information or changes in existing information based on Nepomuk.

Example: Given files "House.M.D.s06e02.avi" and "House.S06E02_Broken (Part 2).srt" it should recognize them as video and subtitles for House.M.D. series, season 6, episode 2, creates necessary Nepomuk data and link them together with connection "subtitlesFor". (All names are fictional, all the coincidences are accidental.)

What are main design ideas of Web Extractor

The main component of Web Extractor is a Decision. Decision is basically a .diff for Nepomuk. Important components are Web Extractor Service and Decision Management Service. First is used to create Decisions, second is a reviewboard.

The workflow of Web Extractor is given in the following picture. First image is how it works now, second - how I would like it to work.

Extending Web Extractor

Web Extractor Service is a plugin-based system. In simple terms, every parsed file/resources is parsed with all registered and suitable plugins. Every plugin should return 0, 1, or more Decisions. Everything after that is handled by service(s).

Workflow of the creating Decision

Here is algorithm for creating Decision:

Plugin is called with URL of resource to work with.
Plugin explicitly mark some resources as target. These are resources that it wish to apply changes to.
New in-memory RDF storage is created. Every selected resource and all it properties and all properties of properties and so on are recursively copied to this model.
For every selected resource a so-called identification set is created. This set will be used to find target resource when Decision will be applied. It is necessary because:
1. URL of the resource may change
2. Resource may change itself and Decision will become not applicable. ( Just as with usuall diffs - changind source code may made them not applicable ). So identification set remebers the state of the resource in the moment Decision was created.
Now you have a new model. It is automatically wrapped into Nepomuk::ResourceManager instance. You are making the changes with this ResourceManager as if you were working with main Nepomuk model.
System will log every change you have done. It is not absolutely true, but for simplicity you can see it this way.
Based on the recorded changes, a diff is created.
Now you should estimate the quiality of your diff, give it a description and may be provide some other metadata.
Now, Decision is simply generated diff and your metadata.

What happens when decision is applied ?

First, using created identification sets, all target resources are found. If this operation failed, then Decision is marked as invalid.
Then all changes in diff are wisely applied. That is, if your diff includes creation of the new tag "T1" and this tag already exists in Nepomuk, second instance of tag "T1" won't be created - instead existing will be used.