GitHub - arcusfelis/xapian-erlang-bindings: Xapian binding for Erlang (GSOC2012 project) (original) (raw)

Xapian binding for Erlang

License: MIT, GPL2 or higher (Xapian is still under GPL only.)

Author: Uvarov Michael (arcusfelis@gmail.com)

Xapian is an Open Source Search Engine Library, written in C++. Xapian is a highly adaptable toolkit which allows developers to easily add advanced indexing and search facilities to their own applications.

Xapian library

Install Xapian library itself.emerge dev-libs/xapian in Gentoo Linux.

Installation

I use rebar for building.

Try as a stand-alone Erlang application:

git clone git://github.com/arcusfelis/xapian-erlang-bindings.git xapian cd xapian ./rebar get-deps compile ./start-dev.sh

Add as a dependency to rebar.config:

{deps, [ {xapian, ".*", {git, "git://github.com/arcusfelis/xapian-erlang-bindings.git", "master"}} ]}.

Google hash map (optional)

You can use google sparse hash for storing resources' ids.

In the Debian and Ubuntu repositories, this is packaged aslibsparsehash-dev.

The C++-preprocessor macro GOOGLE_HASH_MAP enables using google hash map as a hash map.

emerge dev-cpp/sparsehash in Gentoo Linux.

Using

This application uses records, defined in the fileinclude/xapian.hrl. To include it use:

-include_lib("xapian/include/xapian.hrl").

Tests

Next command runs tests:

$ ./rebar eunit skip_deps=true

A pool of readers

Path = filename:join([code:priv_dir(xapian), test_db, simple]). {ok, Pid} = xapian_pool:open([{name, simple}], Path, []). result = xapian_pool:checkout([simple], fun([Server]) -> io:write(Server), result end).

Readers use the Poolboy application. There is only one writer for each database, so there is no writer pool. You can use a named process and a supervisor instead:

{ok, Pid} = xapian_server:open(Path, [{name, simple_writer}, write]). xapian_server:add_document(simple_writer, [#x_text{value = "Paragraph 1"}]).

If you try to run this code from the console, then next command will be useful:

rr(code:lib_dir(xapian, include) ++ "/xapian.hrl").

It loads information about records into the console.

A pool is supervised by xapian_sup. That is why calling thexapian_pool:open function does not link the parent process with the new process.

As with xapian_drv:transaction, you can checkout a few pools.

xapian_pool:checkout([pool1, poo2], fun([Server1, Server2]) -> actions_here end).

If an error occurs, an exception will be thrown and workers will be returned into the pool.

catch xapian_pool:checkout([simple], fun([S]) -> 5 = 2 + 2 end). {'EXIT',{{badmatch,4},[{erl_eval,expr,3,[]}]}}

Multi-database support

You can use this code for opening two databases from the directories "DB1" and "DB2".

{ok, Server} = xapian_driver:open([#x_database{path="DB1"}, #x_database{path="DB2"}], []).

Only read-only databases can be used.

There are two fields meaning a document's id: docid andmulti_docid. They are equal if only one database is used.

Otherwise, the first field contains a document id (can be repeated) andmulti_docid is a unique idintifier, which is calculated fromdocid and db_number.

db_number is the number of the document's database counting from 1.

db_name field contains pseudonyms of the databases. Information fromname field of #x_database{} record will be used for this. This field is undefined by default.

Here is a full multi-database example:

-record(document, {docid, db_name, multi_docid, db_number}).

example() -> DB1 = #x_database{name=db1, path="DB1"}, DB2 = #x_database{name=db1, path="DB2"}, {ok, Server} = xapian_driver:open([DB1, DB2], []), EnquireResourceId = xapian_driver:enquire(Server, "query string"), MSetResourceId = xapian_driver:match_set(Server, EnquireResourceId), %% Use a record_info call for retrieving a list of field names Meta = xapian_record:record(document, record_info(fields, document)), Table = xapian_mset_qlc:table(Server, MSetResourceId, Meta), qlc:e(qlc:q([X || #document{multi_docid=DocId} <- Table])).

Resources

A resource is a C++ object, which can be passed and stored inside an Erlang VM. Each server can have its own set of resources. Resources from other servers cannot be used or controlled. Resources are _not_automatically garbidge-collected, but if a control process (server) dies, all its resources are released.

Use the release_resource(Server, Resource) function call to free a resource which is no longer needed.

A second call of this function with the same arguments will cause an error:

1> Path = filename:join([code:priv_dir(xapian), test_db, simple]). "/home/user/erlang/xapian/priv/test_db/simple" 2> {ok, Server} = xapian_server:open(Path, []). {ok,<0.57.0>} 3> ResourceId = xapian_server:enquire(Server, "query"). #Ref<0.0.0.69> 4> xapian_server:release_resource(Server, ResourceId). ok 5> xapian_server:release_resource(Server, ResourceId). ** exception error: elem_not_found

Using a port

Ports cannot crash the Erlang VM. The port program will be compiled by rebar.

For running a single server in port mode use:

{ok, Server} = xapian_driver:open(Path, [port|Params]).

For running all servers in port mode use:

application:set_env(xapian, default_open_parameters, [port]).

Testing a port

$ erl -pa ./.eunit/ ./../xapian/ebin ./deps/?*/ebin

application:set_env(xapian, default_open_parameters, [port]). eunit:test({application, xapian}, [verbose]).

Document forms

Conversations:

Helpers

Stand-alone Stemmer

1> {ok, S} = xapian_server:open([],[]). {ok,<0.79.0>}

2> xapian_helper:stem(S, <<"english">>, "octopus cat"). [#x_term{value = <<"Zcat">>,position = [],frequency = 1}, #x_term{value = <<"Zoctopus">>,position = [],frequency = 1}, #x_term{value = <<"cat">>, position = [2], frequency = 1}, #x_term{value = <<"octopus">>, position = [1], frequency = 1}]

3> xapian_helper:stem(S, <<"english">>, "octopus cats"). [#x_term{value = <<"Zcat">>,position = [],frequency = 1}, #x_term{value = <<"Zoctopus">>,position = [],frequency = 1}, #x_term{value = <<"cats">>, position = [2], frequency = 1}, #x_term{value = <<"octopus">>, position = [1], frequency = 1}]

4> xapian_helper:stem(S, none, "octopus cats"). [#x_term{value = <<"cats">>, position = [2], frequency = 1}, #x_term{value = <<"octopus">>, position = [1], frequency = 1}]

5> xapian_helper:stem(S, "english", "Zcat"). [#x_term{value = <<"Zzcat">>,position = [], frequency = 1}, #x_term{value = <<"zcat">>, position = [1], frequency = 1}]

6> xapian_helper:stem(S, "english", "cat octo-cat"). [#x_term{value = <<"Zcat">>,position = [],frequency = 2}, #x_term{value = <<"Zocto">>,position = [],frequency = 1}, #x_term{value = <<"cat">>, position = [1,3], frequency = 2}, #x_term{value = <<"octo">>, position = [2], frequency = 1}]

"Z" is a prefix. It means that this term is stemmed.

Code examples

Documentation