Injecting a TCP server
You can network-enable any program by dynamically adding appropriate code that performs as a TCP server and redirects the program's standard input and output.
For many programs that use machine learning, you have something similar to the following behavior:
- Read a model
- Read data from stdin
- Write data to stdout
Adhering to this model means that you can easily construct a pipeline out of multiple programs which feed each other - for example, take raw text, run it through a tokenizer, then a part-of-speech tagger, then a parser. Things become more complicated when you want to transform little pieces of data without always having the startup time that is involved in reading the model data.
For code that you do not control, or that is written in another language, you have to achieve this using a wrapper: you write a module, in your favorite programming language, that runs that other program, feeds it portions of the appropriate input (using a pair of pipes), and takes back corresponding portions of the appropriate output.
In some cases, though, a program has non-negligible startup time and still wants to read its input in one go (or in irregular pieces). For example, TreeTagger (a popular part-of-speech tagger) reads an arbitrary portion of its input before giving you the part-of-speech tag for the first input line.
A nicer solution -- if we can make the program cooperate somehow -- would be if we could, say, attach via a TCP socket and the program, instead of reading input only once, would do its work for each new incoming connection.
Instead of
load_model()
while things_to_do:
read_data()
write_processed_output()
exit()
it would be much nicer to have something along the lines of
load_model()
open_socket()
for each connection:
fork subprocess:
read_data()
write_processed_output()
To make the program do this despite not being written
for it, we would need to intercept libc's read
function, which then looks out for reads, and just before the read
from stdin, does the accept-connection-and-fork thing and then
returns the control to the original program.
One possibility for intercepting the read
function would be to write
a replacement (which then calls the old read
) and tell the dynamic
linker to load the code via the LD_PRELOAD
environment variable.
Such an approach is sketched
here,
and is used in Debian's fakeroot
command (which runs programs
in a mode where they appear to be able to do things that
only root can do, including installing things in /usr/bin
).
This does not work because the program we are looking at is
statically linked and won't care about LD_PRELOAD
. What a pity.
We can, however, run the program in debugging mode using the ptrace
system call and stop the program, write to its registers and memory,
or (hear, hear), intercept system calls.
Hence the approach I'll describe here does the following:
- start the program in debugging mode (with
PTRACE_TRACEME
) - wait until the first system call that reads from stdin
- add ('inject') a bit of program code for the accept-connection-and-fork task
- give control to the bit of program code we added, which will then run the TCP server code and fork off a new process that return the control to the original program (after replacing standard input and output by the file descriptor of the network connection).
Sounds easy? There's a catch though: In the added program, we cannot
use the standard library (libc
or anything else), and because we're
loading program code directly, we have to write it in assembler and
scrape together the raw bytes. (There's probably a more comfortable
way if you write your own ELF loader and some standard library routines,
but that would take even more time).
Find the source on bitbucket. (Note: there is a github project with the same name, which does the simpler variant of always starting the program anew. Discovered it too late.)