Apache ManifoldCF (CF meaning Connector Framework) provides a framework for extracting content from multiple repositories, enriching it with document-level security information, and outputting the resulting document into Solr based on the security model found in Microsoft's Active Directory platform. Working with ManifoldCF requires an understanding of the interaction between extracting content from repositories via a Repository Connector, outputting the documents and security tokens via an Output Connector into Solr, listing a specific user's access tokens from an Authority Connector, and finally performing a search that filters the document results based on the list of tokens. ManifoldCF takes care of ensuring that, as content and security classifications for content are updated in the underlying repositories, it is synched to Solr, either on a scheduled basis or a constantly monitoring basis. Finally, it has a convenient web UI to manage the connector states.
ManifoldCF provides connectors that index into Solr content from a number of enterprise content repositories, including SharePoint, Documentum, Meridio, LiveLink, and FileNet. Competing with DataImportHandler and Nutch, ManifoldCF also crawls web pages, RSS feeds, JDBC databases, and remote Windows shares and local filesystems, while adding the document-level security tokens, where applicable. Also of note is its MediaWiki connector. The most compelling use case for ManifoldCF is leveraging ActiveDirectory to provide access tokens for content indexed in Microsoft SharePoint repositories, followed by just gaining access to content in the other enterprise-content repositories.
While the sweet spot for using ManifoldCF is with an authority like ActiveDirectory, we're going to reuse our MusicBrainz.org
data and come up with a simple scenario for playing with ManifoldCF and Solr. We will use our own MusicBrainzConnector
class to read in data from a simple CSV file that contains a MusicBrainz ID for an artist, the artist's name, and a list of music genre tags for the artist:
4967c0a1-b9f3-465e-8440-4598fd9fc33c,Enya,folk,pop,irish
The data will be streamed through Manifold and out to our /manifoldcf
Solr core with the list of genres used as the access tokens. To simulate an Authority service that translates a username to a list of access tokens, we will use our own GenreAuthority
. It will take the first character of the supplied username, and return a list of genres that start with the same character. So a call to ManifoldCF for the username [email protected]
would return the access tokens pop and punk. A search for "Chris" would match on "Chris Isaak" since he is tagged with pop, but "Chris Cagle" would be filtered out since he plays only American and country music.
Browse the source for both MusicBrainzConnector
and GenreAuthority
in ./examples/9/manifoldcf/connectors/
to get a better sense of how specific connectors work with the greater ManifoldCF framework.
To get started, we need to add some new dynamic fields to our schema in cores/manifoldcf/conf/schema.xml
:
<dynamicField name="allow_token_*" type="string" indexed="true" stored="true" multiValued="true"/> <dynamicField name="deny_token_*" type="string" indexed="true" stored="true" multiValued="true"/>
These rules will allow the Solr output connector to store access tokens in the fields such as allow_token_document
and deny_token_document
.
Now we can start up ManifoldCF. The version distributed with this book is a stripped-down version, with just the specific connectors required for this demo! In a separate window from ./examples/9/manifoldcf/example
run the following code:
>>java -jar start.jar
ManifoldCF ships with Jetty as a servlet container, hence the very similar start command to the one Solr uses!
Browse to http://localhost:8345/mcf-crawler-ui/
to access the ManifoldCF user interface which exposes the following main functions:
manifoldcf
Solr core.http://localhost:8345/mcf-authority-service/[email protected]
and verifying you receive a list of genre access tokens starting with the letter p
.Go ahead and choose the Status and Job Management screen and trigger the indexing job. Click on Refresh a couple of times, and you will see the artist's content being indexed into Solr. To see the various genres being used as access tokens, browse to:
http://localhost:8983/solr/manifoldcf/select?q=*:*&facet=true&facet.field=allow_token_document&rows=0.
At the time of writing, neither ManifoldCF nor Solr have a component that hooked ManifoldCF-based permissions directly into Solr. However, based on the code from the ManifoldCF in Action manuscript, available at http://code.google.com/p/manifoldcfinaction/, you can easily add a Search Component to your request handler. Add the following code to solrconfig.xml
:
<requestHandler name="standard" class="solr.SearchHandler" default="true">
<arr name="components">
<str>manifoldcf</str>
</arr>
</requestHandler>
<searchComponent name="manifoldcf" class="org.apache.manifoldcf.examples.ManifoldCFSecurityFilter">
<str name="AUTHENTICATED_USER_NAME">username</str>
</searchComponent>
You are now ready to perform your first query! Do a search for Chris, specifying your username
as [email protected]
and you should see only pop and punk music artists being returned!
http://localhost:8983/solr/manifoldcf/select?q=Chris&[email protected]
Change the username
parameter to [email protected]
and Chris Cagle, country singer should be returned! As documents are added/removed from the CSV file, ManifoldCF will notice the changes and reindex the updated content.