Neural Networks are Quite Neat (a rant)
After decades of Neural Network overhype, and a following time
of disrespect, Neural Networks have become popular again - for
a reason, as they can fit large amounts of data better than the
feature-based models that came before them. Nonetheless, people
who lived through the first overhyped episod are asking critical
questions - the answers to which are (hopefully!) enlightening
(more ...)
The brave new world of search engines
In an earlier post, I talked about current Google's search results in terms of personalization, and whether to like it or not. This post takes another aspect of 2011 Google search: what they do with complex queries. For a more current perspective, see this presentation (by Will Critchlow) from 2013.
(more...)
Simple Pattern extraction from Google n-grams
Google has released n-gram datasets for multiple languages, including English and German. For my needs (lots of patterns, with lemmatization), writing a small bit of C++ allows me to extract pattern instances in bulk, more quickly and comfortably than with bzgrep.
(more...)
Fast dependency parsing
For doing syntactic preprocessing without spending too much time (CPU or
engineering) on it, SpaCy
and NLP4J should be among the
first things to try. SpaCy covers English and German, whereas NLP4J covers
only English, but is trained on biomedical treebanks (in addition to the
WSJ news that everyone trains on), which makes it especially
useful for that kind of texts. If you're looking towards parsing French,
the Bonsai
Model collection from the French Alpage group and the
Mate Parser from Bernd Bohnet (now at Google) are good first guesses.
If you have a suitable treebank at hand and want neural network parsing, you
might as well try UDPipe and its
Parsito parser (for speed) or the
BiLSTM graph-based parser
by Eliyahu Kiperwasser and Yoav Goldberg (for accuracy). If you want to spend
a day (or more) using exotic build tools and specific outdated versions of TensorFlow,
you can also try SyntaxNet, which requires substantially greater amounts of
model-specific parameter tuning than Kiperwasser and Goldberg's parser.
(Don't use the German SyntaxNet model, which is trained on the tiny
German Universal Dependencies treebank and not on one of the many large
treebanks that exist for German).
Neural Network Toolkits
My favorite toolkit for modeling natural language text using LSTMs and other
gadgetry is DyNet, which uses dynamically
constructed computation graphs and allows to model recursive neural networks
and other gadgetry without much fuss. The static network structure of more
standard neural network libraries such as
TensorFlow trade off flexibility for
the ability to join groups of examples in a minibatch (which DyNet allows, but
does not enforce), which leads to greater training speed.
Conditional Random Fields.
Hanna Wallach has a very useful link collection
on Conditional Random Fields. I'd
recommend especially her tutorial on CRFs (which is also the
introductory part of her MSc thesis) as well as Simon Lacoste-Juliens
tutorial on SVMs, graphical models, and Max-Margin Markov Networks
(also linked there).