2013-11-18

Logging IO activity of a process

Okay, your 3rd party app sucks. It sucks big time. Generates heavy disk traffic at seemingly random times, and you just can't think of anything, anymore. The users are revolting, the website is lagging, your boss is raging.

It is time to check what the hell the app is actually doing.

The following script is using strace to catch all IO-related syscalls done by the given process, and dump them in a CSV manner. Later, you can aggregate by seconds, minutes, file systems, or subsystems (like Lucene, etc), create charts, graphs, and pivots.

#!/bin/sh

if [ "x$1" == "x-h" ]; then
 echo "Usage: ./iotrace.sh <pid>"
 exit 0
fi

if [ "$(id -u)" != "0" ]; then
   echo "This script must be run as root " 1>&2
   exit 1
fi

if [ $# -gt 0 ]; then
PID=$1
else
echo "Hey, I need a parameter!"
exit 1
fi

ps -eL | grep $PID | awk '{print"-p " $2}' | xargs strace -q -f -v -ttt -T -s 0 -e trace=open,close,read,write 2>&1 | awk -v pid=$PID '
function output(a, f, r, t)
{
 # a - action
 # f - file descriptor
 # r - result
 # t - time as unix epoch
 if (f in fd)
  file = fd[f];
 else
 {
  ("readlink /proc/" pid "/fd/" f) | getline file;
  fd[f] = file;
 }
 if (file !~ /^(socket|pipe|\/dev|\/proc)/ || r ~ /\d+/)
  print a, file, r, strftime("%Y-%m-%d %H:%M:%S"); #substr(t, 0, index(t, ".")-1));
}

BEGIN { OFS=";"; print "op;path;bytes;epoch";}
{
 if($6 ~ /resumed>/)
 {
  if ($5 ~ /open/){fd[$(NF-1)] = pending[$2];}
  else if ($5 ~ /close/){match($4, /([0-9]+)/, a);delete fd[a[1]];}
  else if ($5 ~ /write/){match($4, /([0-9]+)/, a);output("write", pending[$2], $(NF-1), $3);}
  else if ($5 ~ /read/) {match($4, /([0-9]+)/, a);output("read", pending[$2], $(NF-1), $3);}
  
  delete pending[$2];
 }
 else if ($4 ~ /^open\(/)
 {
  match($4, /\"(.+)\"/, a);
  f = a[1];
  if ($(NF-1) == "<unfinished")
  {
   pending[$2] = f;
  } else {
   fd[$(NF-1)] = f;
  }
 }
 else if ($4 ~ /^close\(/)
 {
  match($4, /([0-9]+)/, a);
  f = a[1];
  if ($(NF-1) == "<unfinished")
  {
   pending[$2] = f;
  } else {
   delete fd[f];
  }
 }
 else if ($4 ~ /^write\(/)
 {
  match($4, /([0-9]+)/, a);
  f = a[1];
  if ($(NF-1) == "<unfinished")
  {
   pending[$2] = f;
  } else {
   output("write", f, $(NF-1), $3);
  }
 }
 else if ($4 ~ /^read\(/)
 {
  match($4, /([0-9]+)/, a);
  f = a[1];
  if ($(NF-1) == "<unfinished")
  {
   pending[$2] = f;
  } else {
   output("read", f, $(NF-1), $3);
  }
 }
}'

What it does?
  1. Takes your input of a process ID
  2. Reads all the child processes of this process
  3. Feeds these into xargs to make strace to attach to all of them, also make strace to only print the four syscalls we are interested in (open, close, read, write), these are used by normal java IO methods
  4. Make a dictionary of file descriptors and filenames, and pretty-print the filenames with the acutal number of processed bytes
  5. Also, take care of the interrupted syscall printouts.

An average java webapp with Lucene can produce 3.5M rows in an hour. Note, that it cannot be opened in Excel ;-)

No comments :

Post a Comment