I want my filter bubble back

23 Jun 2011

Personalizing search results leads to a situation where you're more likely to see content that you agree with - limiting the diversity in your intellectual "diet". Clearly, we need to get away from this evil. But - can we?

Personalization and Context have been among the most conspicuous buzzwords in information retrieval since quite a long while. As a result, advertising networks (including Google, Bing and a dozen others) track you with cookies, and infer from the pages you look at that you like, say, computers, linux, musical instruments, Java programming, and restaurants. You can see your own profile (at least for Google) here. Other sites have less to choose from and use more direct methods - if you look at a toaster, you will be chased by toaster ads for the next two weeks because, after all, you are interested in toasters (or software synthesizers, or external hard disks...)

I do prefer being chased by normal-sized toaster ads to being chased by extra-large animated rollout flash non-targeted ads for, say travel or washing machines or typical spam material (get-rich-quick schemes, YOU HAVE WON, or, in earlier times when it was still something new and exciting, online banking). You can still have the worst of both worlds - lots of annoying ads despite all those powerful ways for targeting facebook ads.

But what about search engines? DuckDuckGo, a new, privacy-friendly search engine which has also other nice features (HTTPS by default, lots of specialized searches) has a presentation titled Escape your search engine filter bubble!. As a result of personalized search, one person searches for Egypt and finds information on political protests; another one searches for the same term and gets travel information. This point is made even more convincingly by Eli Pariser, the inventor of the term "filter bubble", in his TED talk: Filtering algorithms that look at our clicking behaviour to predict a useful ranking for the next things to look at are exposing us to a kind of informational junk food - not just from search engines, but also on social networks, or in recommender engines for things we buy or rent.

Blame the algorithms? A reader on Pariser's talk's web page claims that we already have the solution: This talk supports why Twitter is such a powerful tool. Twitter will continue to make inroads as a significant news service by providing a platform for unbiased and unfiltered information. But here's the catch: Whatever Twitter, RSS or other feeds we subscribe to will never be unbiased or unfiltered information. Because we cannot handle unfiltered information - we want to shut out the scrawny thirteen-year-olds peddling a ripoff of a ripoff of a tutorial for often-used software, the narcissistic bloggers who dig out inane facts, trying to turn the heads of an ADHD-ridden audience (cough yours truly cough), or simply the salesemen who come to peddle the next vacuum cleaner, self-improvement course, or telephone contract.

My own solution to the problem has been to live inside my filter bubble, but occasionally take a peek at metafilter, which is relatively broad thematically (or: contains a lot of non-techie stuff) but manages to bring up relevant and often societally important topics and cool stuff.

But I'm not averse to trying out new things and I did switch my browser's default search engine over to DuckDuckGo's bubble-free service, presumably choosing information by quality, or relevance only. DuckDuckGo offers some tools to make this easier: if you want the JavaDoc for, say the InputStream class, you can simply type !java InputStream and only search on the official Java2SE pages. But you do notice that your searches are rather ambiguous without context. While Google, despite me never having been interested in Michael Collins the Irish leader or Michael Collins the astronaut and the movie actor, includes the MIT professor Collins at position six, DuckDuckGo has no idea and gives me not only three pages on the Irish hero an an additional one on the Ultra-Marathoner and one for the politician and the therapist - it lets me scroll down half a metre for the MIT professor unless I include some additional term such as "parsing".

I'll probably switch back to that currently-most-popular search engine from the currently-most-successful engineering-driven web advertising company. Maybe I'll switch to their encrypted version if I feel a bit paranoid that day, or occasionally try out new search engines. I've decided not to give up my filter bubble, while occasionally sending out the odd pseudopod to get hold of new perspectives on known topics, or to learn more about topics in which I'm currently not versed. Please, dear search engine companies, can I have a surprise me! button that will lead me to the next interesting survey article on the Stanford Encyclopedia of Philosophy, philpapers, or even something that is good quality and relevant that I did not even know about?

Blog posts

Neural Networks are Quite Neat (a rant)
After decades of Neural Network overhype, and a following time of disrespect, Neural Networks have become popular again - for a reason, as they can fit large amounts of data better than the feature-based models that came before them. Nonetheless, people who lived through the first overhyped episod are asking critical questions - the answers to which are (hopefully!) enlightening (more ...)

The brave new world of search engines
In an earlier post, I talked about current Google's search results in terms of personalization, and whether to like it or not. This post takes another aspect of 2011 Google search: what they do with complex queries. (more...)

Simple Pattern extraction from Google n-grams
Google has released n-gram datasets for multiple languages, including English and German. For my needs (lots of patterns, with lemmatization), writing a small bit of C++ allows me to extract pattern instances in bulk, more quickly and comfortably than with bzgrep. (more...)

Useful links

Fast dependency parsing
For doing syntactic preprocessing without spending too much time (CPU or engineering) on it, SpaCy and NLP4J should be among the first things to try. SpaCy covers English and German, whereas NLP4J covers only English, but is trained on biomedical treebanks (in addition to the WSJ news that everyone trains on), which makes it especially useful for that kind of texts. If you're looking towards parsing French, the Bonsai Model collection from the French Alpage group and the Mate Parser from Bernd Bohnet (now at Google) are good first guesses.

Neural Network Toolkits
My favorite toolkit for modeling natural language text using LSTMs and other gadgetry is DyNet, which uses dynamically constructed computation graphs and allows to model recursive neural networks and other gadgetry without much fuss. The static network structure of more standard neural network libraries such as TensorFlow trade off flexibility for the ability to join groups of examples in a minibatch (which DyNet allows, but does not enforce), which leads to greater training speed.

Conditional Random Fields.
Hanna Wallach has a very useful link collection on Conditional Random Fields. I'd recommend especially her tutorial on CRFs (which is also the introductory part of her MSc thesis) as well as Simon Lacoste-Juliens tutorial on SVMs, graphical models, and Max-Margin Markov Networks (also linked there).

Nice blogs

Language Log
Technologies du Langage
Earning my Turns
Leiter Reports