Against Boredom: shell

Showing posts with label shell. Show all posts

2015-05-12

Shell: Get threaddump directly from the java process

Inspecting the Java source, I found a pretty easy way to skip java in the process of extracting info from another java process :-)

PID=`pgrep java`
SCKT=/tmp/.java_pid$PID
SGNL=/tmp/.attach_pid$PID
CMD='1\0threaddump\0\0\0\0'

if [ ! -r $SCKT ]; then
 touch $SGNL || exit 2
 kill -s SIGQUIT $PID
 sleep 5
 rm $SGNL
 if [ ! -r $SCKT ]; then
  echo Cannot read $SCKT ... either you are not the correct user for this, or the java process does not 'see' our attach request.
  exit 1
 fi
 echo Done
fi

echo -ne "$CMD" | nc -U "$SCKT"

Possible options and variations I know about:

1\0threaddump\0-l\0\0\0 small L for the jstack -L option
1\0inspectheap\0\0\0\0
1\0inspectheap\0-live\0\0\0

For others see the attachListener.cpp (JDK7, JDK8)

2015-04-27

Nagios: Run query as a service check

Sometimes it is a good idea to check things directly in the database. A few months (years?) ago I ran into an issue with JIRA, that the data integrity is absolutely not enforced in any way, and user-detected issues are repaired with ad-hoc features, like the dreaded Integrity Checker. I have two major issues with Atlassians standpoint with this:

The Integrity Checker is a reaction, after the user reported the problem. The latter can take weeks(!!!). You cannot solve an issue before the user even detects it if you have no tools to... detect it.
A multi-$10k software not using foreign keys and other (not so) advanced RDBMS features? Come on...

So we are already using Nagios, let's see if I can put together a command and a service that reports things to me.

Now, we have 2 important tasks:

Run a SELECT on the Oracle server from a script, and return as a standard Nagios check script.
Run this command from a Nagios service and make it accept arguments.

Let's see, what info does our script need:

We have a tnsnames.ora, so we only need the service name. We can assemble this based on the application name and the tier code.
If we hardcode a serviced user for this, we'll only need it's password (as it may vary depending on the tier).
It is easier to host functions on the server and just call them. So we need a parameter for the currently desired function. Why function? We'll see it later.

Now, our check script looks like this:

#!/bin/sh

export ORACLE_HOME=/usr/local/oracle/product/11.2.0/client

OK=0
WARN=1
CRIT=2
UNKN=3

usage()
{
cat << EOF
SCRIPT PROBLEM|Called as $0 $@
usage: $0 options

This script run the db_check() on the specified Oracle service.

OPTIONS:
   -a           application (eg. jira)
   -t           tier (eg. d1, t1)
   -p           nagiossvc password (default: welcome)
   -f           function name (default: DB_CHECK)

These will be combined to be <tier><application>, and apps.nagios_<application> inside.
EOF
}

APP=
TIER=
PASSWD=
FUNCTION=
OBJECT=

while getopts "ht:a:p:f:o:" OPTION
do
     case $OPTION in
         h)
             usage
             exit $UNKN
             ;;
         t)
             TIER=$OPTARG
             ;;
         a)
             APP=$OPTARG
             ;;
         p)
             PASSWD=$OPTARG
             ;;
         f)
             FUNCTION=$OPTARG
             ;;
         o)
            OBJECT=$OPTARG
            ;;
         ?)
             usage
             exit $UNKN
             ;;
     esac
done

if [ -z $APP ]
then
     usage
     exit $UNKN
fi

if [ -z $TIER ]
then
    export TNS_ADMIN=/etc/tnsnames/prod
else
    export TNS_ADMIN=/etc/tnsnames/dev
fi

if [ -z $FUNCTION ]
then
    FUNCTION=DB_CHECK
fi

if [ -z $OBJECT ]
then
    OBJECT=NAGIOS_$APP
fi

if [ -z $PASSWD ]
then
        if [ -z $TIER ]
        then
            PASSWD=nagiossvc_passwd
        else
            PASSWD=welcome
        fi
fi

START=$(date +%s)
RESULTSET="$(${ORACLE_HOME}/bin/sqlplus -S -R 3 -L nagiossvc/${PASSWD}@${TIER}${APP} <<OURQUERY
set colsep ,
set pagesize 0
set linesize 10240
set trimspool on
set longchunksize 2000000 long 2000000 pages 0
SELECT ${OBJECT}.${FUNCTION}() AS ERRORS FROM DUAL;
OURQUERY
)"
END=$(date +%s)
TIMESPAN=$((END-START))

if [[ $RESULTSET == *ORA-* ]]
then
    echo "Script error!|${TIMESPAN}sec"
    echo "$RESULTSET"
    exit $CRIT
elif [ -n "$RESULTSET" ]
then
    echo "Issues were found.|${TIMESPAN}sec"
    echo "${RESULTSET}"
    exit $WARN
else
    echo "OK|${TIMESPAN}sec"
    exit $OK
fi

exit $UNKN

That's it for the first part, now we need a package that hosts this function for us. Why a package? Our DBA-s are crazy for packages. So let's create a package!

CREATE OR REPLACE PACKAGE nagios_jira AUTHID DEFINER AS
   FUNCTION db_check RETURN CLOB;
   -- Application specific functions:
   FUNCTION check_workflow_entry_states RETURN CLOB;
   FUNCTION check_issue_summary_not_null RETURN CLOB;
   FUNCTION check_invalid_issuelink RETURN CLOB;
   FUNCTION check_deleted_is_watcher RETURN CLOB;
   FUNCTION check_fileattachment_nulled RETURN CLOB;
END nagios_jira;
/
CREATE OR REPLACE PACKAGE BODY nagios_jira AS
FUNCTION db_check RETURN CLOB IS
   v_results CLOB := '';
BEGIN
   v_results := v_results || check_workflow_entry_states;
   v_results := v_results || check_issue_summary_not_null;
   v_results := v_results || check_invalid_issuelink;
   v_results := v_results || check_deleted_is_watcher;
   v_results := v_results || check_fileattachment_nulled;
   -- repeat
   RETURN v_results;
END db_check;

FUNCTION check_workflow_entry_states RETURN CLOB IS
   v_errorcount NUMBER(6);
BEGIN
   SELECT COUNT(*) INTO v_errorcount FROM JIRAUSER.JIRAISSUE  
   INNER JOIN JIRAUSER.OS_WFENTRY ON JIRAISSUE.WORKFLOW_ID = OS_WFENTRY.ID
   WHERE OS_WFENTRY.STATE IS NULL OR OS_WFENTRY.STATE = 0;
   IF (v_errorcount > 0) THEN
      RETURN 'CHECK_WORKFLOW_ENTRY_STATES('||v_errorcount||')' || CHR(10);
   ELSE
      RETURN '';
   END IF;
END check_workflow_entry_states;

FUNCTION check_issue_summary_not_null RETURN CLOB IS
   v_errorcount NUMBER(6);
BEGIN
   SELECT COUNT(*) INTO v_errorcount FROM JIRAUSER.JIRAISSUE WHERE SUMMARY IS NULL;
   IF (v_errorcount > 0) THEN
      RETURN 'CHECK_ISSUE_SUMMARY_NOT_NULL('||v_errorcount||')' || CHR(10);
   ELSE
      RETURN '';
   END IF;
END check_issue_summary_not_null;

FUNCTION check_invalid_issuelink RETURN CLOB IS
   v_errorcount NUMBER(6);
BEGIN
   SELECT COUNT(*) INTO v_errorcount FROM JIRAUSER.ISSUELINK L, JIRAUSER.JIRAISSUE I1, JIRAUSER.JIRAISSUE I2 WHERE I1.ID(+) = L.SOURCE AND I2.ID(+) = L.DESTINATION AND (I1.ID IS NULL OR I2.ID IS NULL);
   IF (v_errorcount > 0) THEN
      RETURN 'CHECK_INVALID_ISSUELINK('||v_errorcount||')' || CHR(10);
   ELSE
      RETURN '';
   END IF;
END check_invalid_issuelink;

FUNCTION check_deleted_is_watcher RETURN CLOB IS
   v_errorcount NUMBER(6);
BEGIN
   SELECT COUNT(*) INTO v_errorcount FROM (SELECT DISTINCT LOWER(SOURCE_NAME) FROM JIRAUSER.USERASSOCIATION MINUS SELECT DISTINCT LOWER_USER_NAME FROM JIRAUSER.CWD_USER);
   IF (v_errorcount > 0) THEN
      RETURN 'CHECK_DELETED_USERS_IN_WATCHERS('||v_errorcount||')' || CHR(10);
   ELSE
      RETURN '';
   END IF;
END check_deleted_is_watcher;

FUNCTION check_fileattachment_nulled RETURN CLOB IS
   v_errorcount NUMBER(6);
BEGIN
   SELECT COUNT(*) INTO v_errorcount FROM JIRAUSER.FILEATTACHMENT WHERE FILENAME IS NULL;
   IF (v_errorcount > 0) THEN
      RETURN 'CHECK_FILEATTACHMENT_WITHOUT_FILENAME('||v_errorcount||')' || CHR(10);
   ELSE
      RETURN '';
   END IF;
END check_fileattachment_nulled;

END nagios_jira;
/
GRANT EXECUTE ON nagios_jira to nagiossvc;
CREATE SYNONYM nagiossvc.nagios_jira FOR APPS.nagios_jira;

Ugh, those joins are ugly, but our architect is stuck in pre-9i times...

So if we get the return value from this function, it will contain only one cell for us. If that's not empty, then we found issues, and the details are listed.

Now, wire it into Nagios, have a Nagios command definition for this script:

# Run the check on the specified db
# ARG1 - application
# ARG2 - tier (defaults to prod)
# ARG3 - password of nagiossvc - optional (defaults in script)
# ARG4 - function name - optional (defaults to DB_CHECK)
# ARG5 - plsql package object name - optional (defaults to ARG1)
define command {
    command_name check_jira_integrity
    command_line /usr/local/whatever/bin/check_db.sh -a "$ARG1$" -t "$ARG2$" -p "$ARG3$" -f "$ARG4$" -o "$ARG5$"
}

Call this command from a Nagios service:

define service {
        service_description jira_prod_check_integrity
        host_name myjira
        check_command check_jira_integrity!jira
        check_interval 15
        notification_interval 15
        retry_interval 5
}

(Check every 15 minutes. When problems detected, retry every 5 minutes.)

Now we are playing.

2013-12-18

Sending emails to addresses and subjects in a CSV with a fixed message in a textfile

The following script was generated to notice approvers that an audit initiated the deletion of some users they did not approved (again). The Subject contains the usernames, and the message is a template with some PC blahblah.

#!/bin/bash

if [ $# -lt 2 ]; then
 echo "No arguments passed! I need a 'mailaddr;subject' .csv and a txt containing the fixed message."
 echo "csvmailer.sh addresses.csv message.txt"
 exit 1
fi

while read p; do
 IFS=';'
 TOKENS=($p) 
 EMAIL=${TOKENS[0]}
 SUBJ=${TOKENS[1]}
 IFS=' '
 mail -s "$SUBJ" "$EMAIL" < $2
done < $1

The CSV is like "jane.doe@company.com;This is a message in the subject about deleting the account of 'johndoe@company.com'\n". Do NOT put a semicolon into the subject column :-D

IFS sets the Internal Field Separator to semicolon (';'), instead of the default whitespace. This way the array generator/tokenizer is dead simple in the next line.

$1 is the CSV file, read into $p line-by-line.

$2 is the fixed message piped into mail.

2013-12-05

Linux: Send files in e-mail from console

So I wanted to send some files, but my mailx package did not have support for the famous -a parameter.

#!/bin/bash

function create_attachment_block()
{
        echo -ne "--$BOUNDARY\r\nContent-Transfer-Encoding: base64\r\n"
        echo -ne "Content-Type: $(file -bi "$1"); name=\"$1\"\r\n"
        echo -ne "Content-Disposition: attachment; filename=\"$1\"\r\n\r\n$(base64 -w 0 "$1")\r\n\r\n"
}

if [ $# -lt 2 ]; then
        echo No files specified...
        exit 1;
fi

BOUNDARY="==combine-autogun==_$(date +%Y%m%d%H%M%S)_$$_=="
BODY=""

for a in "$@"
do
        if [ -s "$a" -a -f "$a" -a -r "$a" ]; then
                BODY="$BODY""`create_attachment_block "$a"`"
        fi
done

/usr/sbin/sendmail -oi -t << COMPLEX_MAIL
To: $1
Subject: Please see files attached
MIME-Version: 1.0
User-Agent: $0
Content-Type: multipart/mixed; boundary="$BOUNDARY"

$BODY
--${BOUNDARY}--
COMPLEX_MAIL

2013-11-18

Logging IO activity of a process

Okay, your 3rd party app sucks. It sucks big time. Generates heavy disk traffic at seemingly random times, and you just can't think of anything, anymore. The users are revolting, the website is lagging, your boss is raging.

It is time to check what the hell the app is actually doing.

The following script is using strace to catch all IO-related syscalls done by the given process, and dump them in a CSV manner. Later, you can aggregate by seconds, minutes, file systems, or subsystems (like Lucene, etc), create charts, graphs, and pivots.

#!/bin/sh

if [ "x$1" == "x-h" ]; then
 echo "Usage: ./iotrace.sh <pid>"
 exit 0
fi

if [ "$(id -u)" != "0" ]; then
   echo "This script must be run as root " 1>&2
   exit 1
fi

if [ $# -gt 0 ]; then
PID=$1
else
echo "Hey, I need a parameter!"
exit 1
fi

ps -eL | grep $PID | awk '{print"-p " $2}' | xargs strace -q -f -v -ttt -T -s 0 -e trace=open,close,read,write 2>&1 | awk -v pid=$PID '
function output(a, f, r, t)
{
 # a - action
 # f - file descriptor
 # r - result
 # t - time as unix epoch
 if (f in fd)
  file = fd[f];
 else
 {
  ("readlink /proc/" pid "/fd/" f) | getline file;
  fd[f] = file;
 }
 if (file !~ /^(socket|pipe|\/dev|\/proc)/ || r ~ /\d+/)
  print a, file, r, strftime("%Y-%m-%d %H:%M:%S"); #substr(t, 0, index(t, ".")-1));
}

BEGIN { OFS=";"; print "op;path;bytes;epoch";}
{
 if($6 ~ /resumed>/)
 {
  if ($5 ~ /open/){fd[$(NF-1)] = pending[$2];}
  else if ($5 ~ /close/){match($4, /([0-9]+)/, a);delete fd[a[1]];}
  else if ($5 ~ /write/){match($4, /([0-9]+)/, a);output("write", pending[$2], $(NF-1), $3);}
  else if ($5 ~ /read/) {match($4, /([0-9]+)/, a);output("read", pending[$2], $(NF-1), $3);}
  
  delete pending[$2];
 }
 else if ($4 ~ /^open\(/)
 {
  match($4, /\"(.+)\"/, a);
  f = a[1];
  if ($(NF-1) == "<unfinished")
  {
   pending[$2] = f;
  } else {
   fd[$(NF-1)] = f;
  }
 }
 else if ($4 ~ /^close\(/)
 {
  match($4, /([0-9]+)/, a);
  f = a[1];
  if ($(NF-1) == "<unfinished")
  {
   pending[$2] = f;
  } else {
   delete fd[f];
  }
 }
 else if ($4 ~ /^write\(/)
 {
  match($4, /([0-9]+)/, a);
  f = a[1];
  if ($(NF-1) == "<unfinished")
  {
   pending[$2] = f;
  } else {
   output("write", f, $(NF-1), $3);
  }
 }
 else if ($4 ~ /^read\(/)
 {
  match($4, /([0-9]+)/, a);
  f = a[1];
  if ($(NF-1) == "<unfinished")
  {
   pending[$2] = f;
  } else {
   output("read", f, $(NF-1), $3);
  }
 }
}'

What it does?

Takes your input of a process ID
Reads all the child processes of this process
Feeds these into xargs to make strace to attach to all of them, also make strace to only print the four syscalls we are interested in (open, close, read, write), these are used by normal java IO methods
Make a dictionary of file descriptors and filenames, and pretty-print the filenames with the acutal number of processed bytes
Also, take care of the interrupted syscall printouts.

An average java webapp with Lucene can produce 3.5M rows in an hour. Note, that it cannot be opened in Excel ;-)