The brave new world of search engines

14 Aug 2011
In an earlier post, I talked about current Google's search results in terms of personalization, and whether to like it or not. This post takes another aspect of 2011 Google search: what they do with complex queries. For a more current perspective, see this presentation (by Will Critchlow) from 2013. (Read more)

Simple Pattern extraction from Google n-grams

02 Jul 2011
Google has released n-gram datasets for multiple languages, including English and German. For my needs (lots of patterns, with lemmatization), writing a small bit of C++ allows me to extract pattern instances in bulk, more quickly and comfortably than with bzgrep. (Read more)

I want my filter bubble back

23 Jun 2011
Personalizing search results leads to a situation where you're more likely to see content that you agree with - limiting the diversity in your intellectual "diet". Clearly, we need to get away from this evil. But - can we? (Read more)

Injecting a TCP server

17 Jun 2011
You can network-enable any program by dynamically adding appropriate code that performs as a TCP server and redirects the program's standard input and output. (Read more)

Blog posts

Neural Networks are Quite Neat (a rant)
After decades of Neural Network overhype, and a following time of disrespect, Neural Networks have become popular again - for a reason, as they can fit large amounts of data better than the feature-based models that came before them. Nonetheless, people who lived through the first overhyped episod are asking critical questions - the answers to which are (hopefully!) enlightening (more ...)

The brave new world of search engines
In an earlier post, I talked about current Google's search results in terms of personalization, and whether to like it or not. This post takes another aspect of 2011 Google search: what they do with complex queries. For a more current perspective, see this presentation (by Will Critchlow) from 2013. (more...)

Simple Pattern extraction from Google n-grams
Google has released n-gram datasets for multiple languages, including English and German. For my needs (lots of patterns, with lemmatization), writing a small bit of C++ allows me to extract pattern instances in bulk, more quickly and comfortably than with bzgrep. (more...)

Useful links

Fast dependency parsing
For doing syntactic preprocessing without spending too much time (CPU or engineering) on it, SpaCy and NLP4J should be among the first things to try. SpaCy covers English and German, whereas NLP4J covers only English, but is trained on biomedical treebanks (in addition to the WSJ news that everyone trains on), which makes it especially useful for that kind of texts. If you're looking towards parsing French, the Bonsai Model collection from the French Alpage group and the Mate Parser from Bernd Bohnet (now at Google) are good first guesses. If you have a suitable treebank at hand and want neural network parsing, you might as well try UDPipe and its Parsito parser (for speed) or the BiLSTM graph-based parser by Eliyahu Kiperwasser and Yoav Goldberg (for accuracy). If you want to spend a day (or more) using exotic build tools and specific outdated versions of TensorFlow, you can also try SyntaxNet, which requires substantially greater amounts of model-specific parameter tuning than Kiperwasser and Goldberg's parser. (Don't use the German SyntaxNet model, which is trained on the tiny German Universal Dependencies treebank and not on one of the many large treebanks that exist for German).

Neural Network Toolkits
My favorite toolkit for modeling natural language text using LSTMs and other gadgetry is DyNet, which uses dynamically constructed computation graphs and allows to model recursive neural networks and other gadgetry without much fuss. The static network structure of more standard neural network libraries such as TensorFlow trade off flexibility for the ability to join groups of examples in a minibatch (which DyNet allows, but does not enforce), which leads to greater training speed.

Conditional Random Fields.
Hanna Wallach has a very useful link collection on Conditional Random Fields. I'd recommend especially her tutorial on CRFs (which is also the introductory part of her MSc thesis) as well as Simon Lacoste-Juliens tutorial on SVMs, graphical models, and Max-Margin Markov Networks (also linked there).

Nice blogs

Language Log
NLPers
hunch.net
Technologies du Langage
Earning my Turns
Leiter Reports