Strace to the rescue

Recently I was debugging a strange error when using the eventlet support in the new the new coverage 4.0 alpha. The issue manifested itself as a network connection problem: Name or service not known. This was confusing, since the host it was trying to connect to was localhost. How can it fail to resolve localhost?! Switching off the eventlet tracing, the problem went away.

After banging my head against this for a few days, I finally remembered a tool a rarely think to pull out: strace.

There's an excellent blog post showing the basics of strace by Chad Fowler, The Magic of Strace. After tracing my test process, I could easily search the output for my error message:

11045 write(2, "2014-10-15 09:16:48,348 [ERROR] py2neo.packages.httpstream.http: !!! NetworkAddressError: Name or service not known", 127) = 127

and a few lines above lay the solution to my mystery:

11045 open("/etc/hosts", O_RDONLY|O_CLOEXEC) = -1 EMFILE (Too many open files)

It turns out the eventlet tracer was causing my code to leak file descriptors, (a problem I'm still investigating), eventually hitting my relatively low ulimit. Bumping the limit in /etc/security/limits.conf, the problem disappeared!

I must remember to reach for strace sooner when trying to debug odd system behaviours.

Comments !