Simple Pattern extraction from Google n-grams
Google has released n-gram datasets for multiple languages, including English and German. For my needs (lots of patterns, with lemmatization), writing a small bit of C++ allows me to extract pattern instances in bulk, more quickly and comfortably than with bzgrep.
Normally, people build a database around the n-gram datasets to be able to perform ad-hoc queries, such as Stefan Evert's Web1T5-Easy, or at least uncompress each file to be able to seek through it, and then do queries online. If you're looking for specific patterns anyway, however, it can make sense to just extract these patterns.
A couple years ago, when trying out pattern searches for BART, I would query Google's n-gram data like I would query a search engine API (back in the day when it seemed like this would give you reproducible results and the downside of "it takes horribly long" wasn't discovered yet): given two lemmas that were interesting to you (such as "falafel" and "food", you would construct the appropriate morphological variant (i.e., plural "foods") fit the whole thing into a pattern. Then, armed with such a query string (or possibly a gappy version of it such as "foods * falafel"), you would do a query for it and the query engine, in the case of BART, would do a bisection search in the uncompressed n-gram files. This actually works decently (sub-second query times), as long as you're interested in things individually, and you don't mind keeping around the 80GB (uncompressed text) or 220GB (Web1T5-Easy) of data.
If you want to extract features for clustering, you know beforehand
which queries for which patterns you're interested in, and you need to
find out about a large number of fillers for these patterns - hence, it
is much easier to do bulk queries (for reasonably precise patterns),
and keep these extracted instances around (usually only hundreds of
thousands of lines, instead of gigabytes over gigabytes). My old
method for achieving this first used bzgrep
to extract everything
that looked like a pattern instance, and then used a python script
to read in the patterns, throw out the pattern part of the line,
and lemmatize the words you're interested in.
With the new approach (which I will talk about now), I have only
one program that I need to call, I don't have to call the lemmatization
part over and over or make up my mind on how to cache this. You just
call extract_ngrams @NNS '%and$' '%other$' @NNS
and it will use
a pre-determined list of nouns that were tagged as plural nouns in the
corpus I'm interested in, and their lemmas. The result is actually
a nice example for the transparent decompression of streams in the
Boost C++ library, and the use of multithreading through OpenMP,
which makes it excellent fodder for a blog post.
With Boost's IO library, transparently opening Bzip2 files has actually become (almost) as easy as it is in Python or Java: you just set up a pipeline of objects, as in
std::ifstream myfile(filename, std::ios::binary|std::ios::in);
BI::filtering_stream<BI::input> my_filter;
my_filter.push(BI::bzip2_decompressor());
my_filter.push(myfile);
and then keep reading from the file with calls of
my_filter.getline(buf,BUF_SIZE);
and that's all. You could do fancier things than reading in the lines, in here the line is just read in, then split with ''strtok_r'' (remember, this is multithreaded code, so no non-reentrant functions for you), and then the single tokens of the line are matched against the template supplied by the user.
Using OpenMP to search through multiple bzipped files at once is similarly trivial: you put an appropriate pragma (''pragma omp parallel for'') in the position, make sure that all non-shared variables are declared inside the loop, and that's it:
omp_set_num_threads(3);
#pragma omp parallel for
for (int i=0; i<info.n_files; i++) {
char buf[BUF_SIZE];
sprintf(buf, info.path_template,i);
engine.do_filtering(buf);
}
That's it? Well, the buffering of the output is done per thread (in a thread-local buffer) and as it turned out, you have to put a critical section around the ''write'' calls, because occasional bad things happen otherwise. (In the code snippet, writeout is the name of the critical section).
if ((out_ptr-out_buf)>BUF_SIZE) {
#pragma omp critical (writeout)
write(1, out_buf, (out_ptr-out_buf));
out_ptr=out_buf;
}
If you think this can be useful to you, find the source on bitbucket. You will need the Boost IO libraries, PCRE (the regular expressions library), and you will need to fix the paths to the ngram data in the source code.