Crowley Code! 
 (Take 12)

Haystack and Whoosh notes 2009/04/26

Real search is always better than running LIKE queries from MySQL so today I picked up Haystack [1] and Whoosh [2].  I chose this combination for the low barrier to entry and the easy upgrade path should that be required.  Both are pure Python and speak setup.py.

The first problem I ran into has actually been fixed but not committed.  The Gist embedded below patches Whoosh as recommended in the bug report [3].  The bug manifests itself as “IOError: [Errno 24] Too many open files” when you try to load even modestly sized datasets all at once.  I can’t make my laptop give me more than 8192 file descriptors and my Slice will only give me 1024 so I could never see just how bad things got on a 1.5 million row sample.  With the patch, though, everything is golden.

The second and last problem I encountered was more of a documentation problem.  Some of the official tutorial is a bit overkill, so here’s the fastest get-up-and-go tutorial I can distill:

  1. Add 'haystack', to INSTALLED_APPS in your settings.py.  Also add these two lines to let Haystack know where to keep your Whoosh index files:

    HAYSTACK_SEARCH_ENGINE = 'whoosh'
    HAYSTACK_WHOOSH_PATH = '/path/to/server/writable/directory'
  2. Add two lines to your global urls.py:

    import haystack
    haystack.autodiscover()
  3. Create a file called search_indexes.py next to models.py.  This file will contain model-like classes defining your search schema.  It is important to list every field you will want in your search results (the primary key, for example) in the search schema.  The field defined with document=True and the prepare method determine the searchable data.

    Update: Daniel Lindsley pointed out that Haystack reserves id for itself so I’ve changed my example to use a slug field.  Same point applies, just don’t use an id field in your subclasses of SearchIndex.

    from haystack import indexes
    from haystack.sites import site
    from models import Foo
    class FooIndex(indexes.SearchIndex):
    	text = indexes.CharField(document=True)
    	slug = indexes.CharField(model_attr='slug')
    	name = indexes.CharField(model_attr='name')
    	city = indexes.CharField(model_attr='city')
    	state = indexes.CharField(model_attr='state')
    	def prepare(self, obj):
    		self.prepared_data = super(FooIndex, self).prepare(obj)
    		self.prepared_data['text'] = obj.name
    		return self.prepared_data
    site.register(Foo, FooIndex)

    I’ve called the indexable data “text” and use the prepare method to explicitly allow searching by name only.  The official documentation ask for a template file to use during preparation but I think this is overkill.

  4. Replace your old ORM-based search view with something like this:

    from haystack.views import SearchView
    def search(req):
    	return SearchView(template='search.html')(req)
  5. Replace your search page’s template with something like this:

    {% extends 'layout.html' %}
    {% url core.views.search as base %}
    {% block content %}
    <form action="{{ base }}" method="get">
    <h1><label for="query">{% block title %}Search{% endblock %}</label>
    for <input id="query" name="query" type="text" value="{{ query }}" />
    <input type="submit" value="Search" class="button" /></h1>
    </form>
    {% if page.object_list %}
    	<ol start="{{ page.start_index }}">
    	{% for o in page.object_list %}
    		<li><a href="{{ base }}/{{ o.slug }}">{{ o.name }}</a></li>
    	{% endfor %}
    	</ol>
    	<p>Page {{ page.number }} of {{ page.paginator.num_pages }}</p>
    	<ul>
    	{% if page.has_previous %}
    		<li><a href="{{ base }}?query={{ query|urlencode }}&amp;page={{ page.previous_page_number }}">&larr; Previous</a></li>
    	{% endif %}
    	{% if page.has_next %}
    		<li><a href="{{ base }}?query={{ query|urlencode }}&amp;page={{ page.next_page_number }}">Next &rarr;</a></li>
    	{% endif %}
    	</ul>
    {% else %}
    	{% if query %}
    		<p>We couldn’t find anything named <strong>{{ query }}</strong></p>
    	{% endif %}
    {% endif %}
    {% endblock %}

    The view gets the query, a page from a regular Django Paginator and the paginator itself.  A form comes along too but I prefer to ignore this.  If you defined your own User model that lives at req.user, it must implement get_and_delete_messages [4] because django.contrib.auth.models.User leaks into django.core.context_preprocessors a bit.

Here’s the previously mentioned patch-and-install script (Updated to reflect bugfixes merged into the trunk of Whoosh!):

  1. http://haystacksearch.org/
  2. http://whoosh.ca/
  3. http://trac.whoosh.ca/ticket/25
  4. http://github.com/rcrowley/django-twitterauth/commit/83ba6a07df3e97455abd4ab2b3ffaba7096407bc

Richard Crowley?  Kentuckian engineer who cooks and eats in between bicycling and beering.

I blog mostly about programming and databases.  Browse by month or tag.

To blame for...


© 2009 Richard Crowley.  Managed by Bashpress.