Coreference Resolution for English

I am one of the authors of BART, a modular framework for coreference resolution. BART came to be in the JHU Summer Workshop project "Exploiting Lexical and Encyclopedic Resources for Entity Disambiguation as a joint effort, and it is being actively used and developed by multiple research groups.

Discriminative Parser

Lexicalized Parsing with Morphology
The code for the parser from (Versley and Rehbein, 2009) is available in source code from bitbucket. This includes parts that are based on code from Helmut Schmid's BitPar (included with kind permission), and is available under similar terms as BitPar, i.e. for non-commercial/research purposes only.

Python interface to CWB

Efficient Access to large Corpora
The Open Corpus Workbench (CWB) allows you to efficiently store and query large (>100M words) corpora. cwb-python is a Python interface (similar to the existing Perl one) that allows you to quickly retrieve, e.g., a certain sentence, or the occurrences of a certain word.

Blog posts

Neural Networks are Quite Neat (a rant)
After decades of Neural Network overhype, and a following time of disrespect, Neural Networks have become popular again - for a reason, as they can fit large amounts of data better than the feature-based models that came before them. Nonetheless, people who lived through the first overhyped episod are asking critical questions - the answers to which are (hopefully!) enlightening (more ...)

The brave new world of search engines
In an earlier post, I talked about current Google's search results in terms of personalization, and whether to like it or not. This post takes another aspect of 2011 Google search: what they do with complex queries. For a more current perspective, see this presentation (by Will Critchlow) from 2013. (more...)

Simple Pattern extraction from Google n-grams
Google has released n-gram datasets for multiple languages, including English and German. For my needs (lots of patterns, with lemmatization), writing a small bit of C++ allows me to extract pattern instances in bulk, more quickly and comfortably than with bzgrep. (more...)

Useful links

Fast dependency parsing
For doing syntactic preprocessing without spending too much time (CPU or engineering) on it, SpaCy and NLP4J should be among the first things to try. SpaCy covers English and German, whereas NLP4J covers only English, but is trained on biomedical treebanks (in addition to the WSJ news that everyone trains on), which makes it especially useful for that kind of texts. If you're looking towards parsing French, the Bonsai Model collection from the French Alpage group and the Mate Parser from Bernd Bohnet (now at Google) are good first guesses. If you have a suitable treebank at hand and want neural network parsing, you might as well try UDPipe and its Parsito parser (for speed) or the BiLSTM graph-based parser by Eliyahu Kiperwasser and Yoav Goldberg (for accuracy). If you want to spend a day (or more) using exotic build tools and specific outdated versions of TensorFlow, you can also try SyntaxNet, which requires substantially greater amounts of model-specific parameter tuning than Kiperwasser and Goldberg's parser. (Don't use the German SyntaxNet model, which is trained on the tiny German Universal Dependencies treebank and not on one of the many large treebanks that exist for German).

Neural Network Toolkits
My favorite toolkit for modeling natural language text using LSTMs and other gadgetry is DyNet, which uses dynamically constructed computation graphs and allows to model recursive neural networks and other gadgetry without much fuss. The static network structure of more standard neural network libraries such as TensorFlow trade off flexibility for the ability to join groups of examples in a minibatch (which DyNet allows, but does not enforce), which leads to greater training speed.

Conditional Random Fields.
Hanna Wallach has a very useful link collection on Conditional Random Fields. I'd recommend especially her tutorial on CRFs (which is also the introductory part of her MSc thesis) as well as Simon Lacoste-Juliens tutorial on SVMs, graphical models, and Max-Margin Markov Networks (also linked there).

Nice blogs

Language Log
Technologies du Langage
Earning my Turns
Leiter Reports