eMail fulltext search with Clojure and Lucene in half a dozen lines of code

March 2019 · Last updated 2019-03-24 · 569 words

Recently I moved my eMail from an old server I was managing myself to a cheap but nice shared web host.

The new host unfortunately does not cope well with the 50k+ eMails in my inbox - search does not work. At all. I need search. Searching never was great before - in Roundcube, Thunderbird, mutt. I couldn’t get the hang of notmuch.

For a long time I wanted something better all around. Search with boolean terms, resilient against typos, featuring stemming, maybe even coping with synonyms.

I wanted to use this as a learning opportunity and also to try out some technology I was longing to play with since quite a while: Clojure and Lucene.

Fiddling around for half a day and here I present my prototype: Fast and feature-rich fulltext IMAP mail search in a couple of SLOC. Yeah, I know, it’s more than six lines like the title implies, but I like to count only lines that actually do something useful.

Indexing my IMAP inbox

Using a lein repl in the project directory I can index my inbox like this. Syntax and coding style might not be what a seasoned Clojure dev would like to see… Please bear with me here, this is the first Clojure I have written, ever.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
(require [clucy.core :as clucy],
         [clojure-mail.core :refer :all],
         [clojure-mail.message :refer (read-message)],
         [clojure.core.async :as async :refer :all])

(defn connect-imap []
  (inbox (mail-store (store "mailserver.net" "username" "password")))
  )

(defn get-all []
  ;; Nice: Clojure maps are lazy by default
  (map read-message (connect-imap))
  )

;; Create a channel with a buffer capacity of 42 messages
(def c (chan 42))

;; Connect the IMAP client to the channel
(thread (onto-chan c (get-all)))

;; Connect the channel to our search indexer
(def index (clucy/disk-index "/tmp/imap-lucene.idx"))
(thread (while true (clucy/add index (<!! c))))

This creates a buffered Channel (very similar to unix pipes) and connects the IMAP client to one end, the search engine indexer to the other end. Both the IMAP client and the indexer run in their own threads, the channel cares for the synchronization. Voilà!

By the way: I always loved Hoare’s CSP model - it feels very natural to me.

Isn’t it great we have such powerful tools at our disposal? Software development more and more feels like “throw a few building blocks together, create anything you want”.

Querying

The first example is basically hello world:

1
2
3
;; Search for eMails containing the word "test" anywhere in
;; their headers or their body, return the first ten results
(clucy/search index "test" 10)

This works, but we can do much nicer than that. Also, the query string can be anything Lucene accepts, which is much more than a simple example above. Here’s what I used after playing with my eMail index for a while:

1
2
3
4
5
;; Pretty print: show subject, from and received date of the
;; first 10 hits for a fuzzy search for all mails about
;; festivals in 2018
(pprint (map #(select-keys % [:subject :from :date-received])
              (clucy/search index "festival~ 2018" 10)))

The next steps in my eyes is to add a daemon that index eMails as they fly in (using IMAP IDLE) and an HTTP API for querying.

Follow the progress or download the whole project on GitHub.