Parsing haproxy log files (python)

Holger Just

2011-03-19 16:32:08 UTC

Hi Roy,

Post by Roy Smith
Before I reinvent the wheel, has anybody already written code to parse
haproxy log messages with Python?

I have, although it's not _that_ fast. My approach requires about 1
minutes per 100 MB gziped logs (with a roughly 10:1 compression).

If your usecase matches on the features of halog, you should definitly
try that instead. It's written by Willy himself and is able to easily
maxout your streaming file I/O (meaning it is magnitudes faster than you
could ever do it in python itself)

That said, the gist of my analyzing implementation follows. It is
targeted at the verbose HTTP log format of HAProxy and Python 2.4. The
terminology is the one used in the configuration manual of HAProxy.
Refer to it for a description of the various fields.

--Holger

--------------------------------------------------------------------------

#!/usr/bin/env python
# encoding: utf-8

import re
import subprocess as sub

# Does the syslog server escape quotes?
template_escape = True

haproxy_re = (r'haproxy\[(?P<pid>\d+)\]: '
r'(?P<client_ip>(\d{1,3}\.){3}\d{1,3}):(?P<client_port>\d{1,5}) '
r'\[(?P<date>\d{2}/\w{3}/\d{4}(:\d{2}){3}\.\d{3})\] '
r'(?P<listener_name>\S+) (?P<server_name>\S+) '
r'(?P<Tq>(-1|\d+))/(?P<Tw>(-1|\d+))/(?P<Tc>(-1|\d+))/(?P<Tr>(-1|\d+))/'
r'(?P<Tt>\+?\d+) '
r'(?P<HTTP_return_code>\d{3}) (?P<bytes_read>\d+) '
r'(?P<captured_request_cookie>\S+) (?P<captured_response_cookie>\S+) '
r'(?P<termination_state>[\w-]{4}) (?P<actconn>\d+)/(?P<feconn>\d+)/'
r'(?P<beconn>\d+)/(?P<srv_conn>\d+)/(?P<retries>\d+) '
r'(?P<server_queue>\d+)/(?P<listener_queue>\d+) '
r'(\{(?P<captured_request_headers>.*?)\} )?'
r'(\{(?P<captured_response_headers>.*?)\} )?')

if template_escape:
haproxy_re += r'\\"(?P<HTTP_request>.+)\\"'
else:
haproxy_re += r'"(?P<HTTP_request>.+)"'

haproxy_re = re.compile(haproxy_re)

def scan(logfile_path):
(root, ext) = os.path.splitext(logfile_path)
process = None
if ext == ".gz":
# Use a shellout for unzipping. This is about 2-5 times faster
# than doing it in python.
process = sub.Popen(["/bin/gunzip", "--stdout", path],
stdout=sub.PIPE, bufsize=1)
fd = process.stdout
else:
fd = open(path, "r")

line_no = 0
for line in fd:
line_no += 1
try:
match = haproxy_re.search(line)
if not match:
# A non-request, e.g. an error or an info message of HAProxy
# We just ignore it and continue with the next line
continue

fields = match.groupdict()
if fields["captured_request_headers"]:
fields["captured_request_headers"] = \
fields["captured_request_headers"].split("|")
if fields["captured_response_headers"]:
fields["captured_response_headers"] = \
fields["captured_response_headers"].split("|")

# Now you have the matched parts in the fields dict
# And you can do whatever you like with it :)

except:
print "An error occurred in line %s. Last line was:" % line_no
print line
raise

# finalize the file reading
if process:
process.communicate()
else:
fd.close()