Skip to content

Building a Search Index in Naiad

by on October 11, 2012

In this post we’ll expand on our string-counting program to develop an analogue of a real-time search index (suggested and prototyped by Russell Power). More concretely, we’ll build a Naiad computation that is a function of two inputs: the first is an updatable collection of documents, and the second is an updatable collection of query strings. The computation is meant to return any (document, query) pairs where all the words in the query appear in the document. Similar code is in NaiadExamples/SearchIndex.cs in the Naiad download.

First off we’ll want to define some inputs. Let’s do

var documents = controller.NewInput<Document>();
var queries = controller.NewInput<Query>();

Notice that we have used new types here: Document and Query. These are structs we have defined to hold text (strings) and identifiers (integers), and in the case of Query a threshold count (an integer). To use new types in Naiad they need to be structs (for now) and need to implement the IEquatable<T> interface, allowing Naiad to tell when it has two that are the same (an important part of differencing). It is also pretty important to override the GetHashCode method to make sure that you have a good distribution of hash codes (things will still work if you don’t, just slowly).

The first thing we’ll want to do with each of these inputs is break down the documents and queries into words, each associated with the original identifier. We’ll re-use the Document and Query types, just keeping single terms rather than whole lines of text:


var dTerms =
      documents.SelectMany(d => d.text
                                 .Split(' ')
                                 .Select(t => new Document(t, d.id)))
               .Distinct();

var qTerms =
      queries.SelectMany(q => q.text
                               .Split(' ')
                               .Select(t => new Query(t, q.id, q.thresh)))
             .Distinct();

SelectMany pulls out each of the terms in both the documents and in the queries, and Distinct ensures that we only keep one representative from each. We don’t want to report a match just because there were five occurences of one of the five query terms!

Now that we have a big pile of terms, some with document identifiers and some with query identifiers, we want to start looking for matches. The easiest way to do that is with a Join. Naiad’s Join is basically the same as LINQ’s Join, which is like an equi-join in SQL. You specify two inputs and a key selector function for each. Each pair of records whose keys match produce an output, determined by applying a result selector to the pair of them.

var matches = qTerms.Join(dTerms,
                          q => q.text,
                          d => d.text,
                          (q, d) => new Match(q.id, d.id, q.thresh));

Going slowly, the two collections are qTerms and dTerms, and their key selector functions look at their corresponding terms. Whenever we find a match we apply the result selector, which creates a new instance of the Match type, another struct that holds three integers for us (query id, doc id, and query threshold).

The next thing to do is count up the number of matches for each query and document pair. Once we’ve counted each pair, we’ll keep those with enough matches and print them to the screen. We’re using the Count result selector to subtract the count from the threshold, then looking for non-positive thresholds.

matches.Count(x => x, (x, i) => new Match(x.qid, x.did., x.thresh - i))
       .Where(x => x.thresh <= 0)
       .Subscribe(l => { foreach (var x in l) Console.WriteLine(x); });

To see how this works, we’ll pre-load the documents collection with a bunch of fake documents, using integers as a source for our strings. If you have a more interesting source of text, this would be a great place to plug it in.

documents.OnNext(GenerateFakeDocuments(1000000));
queries.OnNext();

Notice that even though we don’t have any queries to submit, we still need to call OnNext on it. Naiad promises to call any subscribe functions once per input batch (“epoch”, to Naiad); if one input is holding out on presenting input epochs, Naiad will patiently wait around expecting some input that isn’t going to arrive. Likewise, when we start submitting queries we’ll need to tick the documents collection too, even if we don’t have any new documents.

var queryCount = 0;
var line = Console.ReadLine();
while (line != "")
{
   queries.OnNext(new Query(line, queryCount++, line.Split(' ').Length));
   documents.OnNext();

   line = Console.ReadLine();
}

Well this sure is fun! But the real measure of a real-time index isn’t whether it can keep up with a human typing at a console. We want to see how fast this goes when we connect it to a serious source of queries. Let’s imagine instead of that interactive stuff above we wrote the following:

var batchSize = 100;
var iterations = 10000;

var stopwatch = System.Diagnostics.Stopwatch.StartNew();

for (int i = 0; i < iterations; i++)
{
   var next = new Query[batchSize];
   for (int j = i * batchSize; j < i * batchSize + next.Length; j++)
      next[j] = new Query(String.Format("{0}", j % documentCount), j, 1);

   queries.OnNext(next);
   documents.OnNext();
   controller.Sync(queries.CurrentEpoch);
}

stopwatch.Stop();

Console.WriteLine("Queries per second: {0}", iterations * batchSize / (stopwatch.ElapsedMilliseconds / 1000.0));

When run on my laptop this comes out to about 60Kq/s, which is a pretty decent number. Of course, the text corpus isn’t as skewed as real text is, and the queries are pretty silly too (single terms). And if you know anything about real search indices, you know that there is a lot more cleverness that goes into them than the program above. Nonetheless, with fairly simple tools, Join and Count, we were able to put together a simple search index that responds with low latency, has high throughput, and even supports incrementally updating the document collection (try it! it turns out we actually built a pub-sub search index).

I hope this starts to give a sense for the breadth of programs you can write with just a few tens of lines of code in Naiad. Coming up next, we’ll start writing Naiad programs with loops in them, determining the connected components structure of a graph in a scalable, interactive fashion.

Advertisements

From → Naiad

Leave a Comment

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: