The brave new world of search engines

14 Aug 2011

In an earlier post, I talked about current Google's search results in terms of personalization, and whether to like it or not. This post takes another aspect of 2011 Google search: what they do with complex queries.

(Executive summary: Since the introduction of Instant Search, Google does weird things with queries longer than two words and I'm among the unhappy few).

Despite my claims that the information that That Big Search Engine has collected about me actually leads to it giving me better search results, I have now spent a couple weeks with DuckDuckGo as my primary search engine on one computer, and Bing as the primary search engine on the other. One reason for this is that Google, through its user modelling prowess, gets thrown off the track pretty quickly whenever your query contains multiple terms (say, three or more) and you actually mean them.

As an example, entering "correct horse battery staple" on Google will give you the XKCD comic about password strength. If you give Google the string "battery staple correct", it will direct you not to the XKCD comic, and also not to battery-driven stapling machines; instead, you get links to descriptions on how to use your laptop battery ("correct battery", without any hint of staples), while Bing gives you a mixture of blog posts on the XKCD comics, battery-driven staplers, and buying batteries at Staples (the office material store). Google effectively ignores any but the first two terms.

The chances that you can do realtime search with complex queries is pretty slim, and Google doesn't take that chance - it just gives you some pages that look like they could be good responses to that query but that don't have anything in common with the terms you entered.

Hooray, we've gone full circle from searching our way through categories (Yahoo, circa 1997) to fulltext search (Altavista, circa 1999), to personalized fulltext search (Google, circa 2006) to a more limited model where you enter some terms and get directed to a category. Other search engines are better at answering multiword queries, but sometimes seem to suffer from a smaller index size, or seem to be led astray by assuming the wrong topic.

What can we learn from it? On one hand, Google apparently still has the biggest index, but they are not showing it off as entering weird queries will just give you random results. Instead, they rely on their own secret sauce of personalized search and topic guessing to give the 90% of average Joes and Janes a search experience that will make them happy (and it's indeed the average Joes and Janes, and not the corpus linguists or the SEO people who will click on all those ads).

The next thing you notice is that there is a separation between search providers and search engines: As of recently, Yahoo uses Microsoft's search, which means that Yahoo search and Bing actually work from a common index, while the search data from the actual users is not shared between the companies. In a similar fashion, DuckDuckGo only maintains a small index of its own, but supplements it with data from Yahoo's BOSS search service, from Bing and from a company called entireweb which also powers the searches of ixquick and a handful of other search engines.

In the late nineties, a related architecture -- one search engine using other companies' search, was realized by so-called meta-search engines which took the results from other engines and aggregated or re-ranked them. With the difference being that now, in 2011, (meta)search engines pay the search providers a considerable amount of money; And, besides search quality, they usually offer additional goodies -- results in manually curated resource (Wikipedia or the Python Package index) that are displayed specially, or (in the case of both DuckDuckGo and IxQuick) a promise of increased privacy.

The Information Retrieval view

Why do we (or at least I do) expect that, if we put in a search query with terms X, Y, and Z, we get a page containing the terms (namely, X, Y, and Z)? It has to with the standard paradigm of fulltext search: You take all pages that have the query terms in them (basically retrieving from the index the document sets that correspond to each search term and intersect them), and then do a ranking of the pages in the intersection, using a variety of methods (with the most basic one, tfidf, ranking pages by the number of occurrences of informative terms from the query).

The usefulness of fulltext search was recognized in the 1960s, in the Cranfield Experiments where different ways of retrieving texts were compared empirically - and it turned out that no scheme for putting documents into categories was better than just comparing the words in the text to those of the query. Most of today's search engines, as well as all desktop search engines such as Lucene, Xapian, or htDig, therefore use indexes containing document sets (called an inverted index) and use these to produce a ranked list of documents containing all query terms.

Besides fancier ways of ranking documents, you can try to normalize terms in a query - by stemming, which tries to normalize morphological variants by stripping off the endings of terms - yielding the same result for both "fly" and "flies"; semantic query expansion adds synonyms, which are however context-dependent and may result in noise, which is why the addition of terms is usually done to influence the ranking but not the basic set of documents.

Google's instant search only seems to be using the first two terms of the query for general searches. Besides being the horror dream of corpus linguists looking for a specific construction, this presumably allows a more efficient processing of queries because you can just cache the result set for any two-term query. Already in this SIGIR 1999 paper [1], the average query length is 2.35, with about two thirds of the non-empty queries having up to two terms.

Only looking at the first two words already works quite well when queries are short enough, but it's worth considering what happens to the additional terms. In some cases, the words simply serve to disambiguate the topic (for example "latex jpeg" may be disambiguated with terms such as "includegraphics" or "pdflatex"); in this case, a soft clustering algorithm like LDA (which is rumored to be part of Google's secret sauce) could use the additional terms to determine suitable topics without having to cache more than the highest-ranked documents for each word1-word2-topic combination. Another important case that can be special-cased consists in "local searches" which include place names: Googling for "jazz disco basingstoke" or "jazz disco oschatz" (i.e., "jazz disco" plus some small European city) works well in my local Google edition, whereas in "jazz disco blumenau", the third term seems to be ignored unless you use the Brazilian Portuguese edition of Google (despite none of the results really having to do with the city).

[1] Silverstein, Henzinger, Marais, Moricz (1999): Analysis of a Very Large Web Search Engine Query Log. SIGIR Forum 33(1), 6-12.

Blog posts

Neural Networks are Quite Neat (a rant)
After decades of Neural Network overhype, and a following time of disrespect, Neural Networks have become popular again - for a reason, as they can fit large amounts of data better than the feature-based models that came before them. Nonetheless, people who lived through the first overhyped episod are asking critical questions - the answers to which are (hopefully!) enlightening (more ...)

The brave new world of search engines
In an earlier post, I talked about current Google's search results in terms of personalization, and whether to like it or not. This post takes another aspect of 2011 Google search: what they do with complex queries. (more...)

Simple Pattern extraction from Google n-grams
Google has released n-gram datasets for multiple languages, including English and German. For my needs (lots of patterns, with lemmatization), writing a small bit of C++ allows me to extract pattern instances in bulk, more quickly and comfortably than with bzgrep. (more...)

Useful links

Fast dependency parsing
For doing syntactic preprocessing without spending too much time (CPU or engineering) on it, SpaCy and NLP4J should be among the first things to try. SpaCy covers English and German, whereas NLP4J covers only English, but is trained on biomedical treebanks (in addition to the WSJ news that everyone trains on), which makes it especially useful for that kind of texts. If you're looking towards parsing French, the Bonsai Model collection from the French Alpage group and the Mate Parser from Bernd Bohnet (now at Google) are good first guesses.

Neural Network Toolkits
My favorite toolkit for modeling natural language text using LSTMs and other gadgetry is DyNet, which uses dynamically constructed computation graphs and allows to model recursive neural networks and other gadgetry without much fuss. The static network structure of more standard neural network libraries such as TensorFlow trade off flexibility for the ability to join groups of examples in a minibatch (which DyNet allows, but does not enforce), which leads to greater training speed.

Conditional Random Fields.
Hanna Wallach has a very useful link collection on Conditional Random Fields. I'd recommend especially her tutorial on CRFs (which is also the introductory part of her MSc thesis) as well as Simon Lacoste-Juliens tutorial on SVMs, graphical models, and Max-Margin Markov Networks (also linked there).

Nice blogs

Language Log
Technologies du Langage
Earning my Turns
Leiter Reports