Discussion:
Parsing haproxy log files (python)
Roy Smith
2011-03-18 21:21:18 UTC
Permalink
Before I reinvent the wheel, has anybody already written code to parse
haproxy log messages with Python?
Holger Just
2011-03-19 16:32:08 UTC
Permalink
Hi Roy,
Post by Roy Smith
Before I reinvent the wheel, has anybody already written code to parse
haproxy log messages with Python?
I have, although it's not _that_ fast. My approach requires about 1
minutes per 100 MB gziped logs (with a roughly 10:1 compression).

If your usecase matches on the features of halog, you should definitly
try that instead. It's written by Willy himself and is able to easily
maxout your streaming file I/O (meaning it is magnitudes faster than you
could ever do it in python itself)

That said, the gist of my analyzing implementation follows. It is
targeted at the verbose HTTP log format of HAProxy and Python 2.4. The
terminology is the one used in the configuration manual of HAProxy.
Refer to it for a description of the various fields.

--Holger

--------------------------------------------------------------------------

#!/usr/bin/env python
# encoding: utf-8

import re
import subprocess as sub

# Does the syslog server escape quotes?
template_escape = True

haproxy_re = (r'haproxy\[(?P<pid>\d+)\]: '
r'(?P<client_ip>(\d{1,3}\.){3}\d{1,3}):(?P<client_port>\d{1,5}) '
r'\[(?P<date>\d{2}/\w{3}/\d{4}(:\d{2}){3}\.\d{3})\] '
r'(?P<listener_name>\S+) (?P<server_name>\S+) '
r'(?P<Tq>(-1|\d+))/(?P<Tw>(-1|\d+))/(?P<Tc>(-1|\d+))/(?P<Tr>(-1|\d+))/'
r'(?P<Tt>\+?\d+) '
r'(?P<HTTP_return_code>\d{3}) (?P<bytes_read>\d+) '
r'(?P<captured_request_cookie>\S+) (?P<captured_response_cookie>\S+) '
r'(?P<termination_state>[\w-]{4}) (?P<actconn>\d+)/(?P<feconn>\d+)/'
r'(?P<beconn>\d+)/(?P<srv_conn>\d+)/(?P<retries>\d+) '
r'(?P<server_queue>\d+)/(?P<listener_queue>\d+) '
r'(\{(?P<captured_request_headers>.*?)\} )?'
r'(\{(?P<captured_response_headers>.*?)\} )?')

if template_escape:
haproxy_re += r'\\"(?P<HTTP_request>.+)\\"'
else:
haproxy_re += r'"(?P<HTTP_request>.+)"'

haproxy_re = re.compile(haproxy_re)

def scan(logfile_path):
(root, ext) = os.path.splitext(logfile_path)
process = None
if ext == ".gz":
# Use a shellout for unzipping. This is about 2-5 times faster
# than doing it in python.
process = sub.Popen(["/bin/gunzip", "--stdout", path],
stdout=sub.PIPE, bufsize=1)
fd = process.stdout
else:
fd = open(path, "r")

line_no = 0
for line in fd:
line_no += 1
try:
match = haproxy_re.search(line)
if not match:
# A non-request, e.g. an error or an info message of HAProxy
# We just ignore it and continue with the next line
continue

fields = match.groupdict()
if fields["captured_request_headers"]:
fields["captured_request_headers"] = \
fields["captured_request_headers"].split("|")
if fields["captured_response_headers"]:
fields["captured_response_headers"] = \
fields["captured_response_headers"].split("|")

# Now you have the matched parts in the fields dict
# And you can do whatever you like with it :)

except:
print "An error occurred in line %s. Last line was:" % line_no
print line
raise

# finalize the file reading
if process:
process.communicate()
else:
fd.close()
Willy Tarreau
2011-03-20 12:36:59 UTC
Permalink
Hi Holger,
Post by Holger Just
Hi Roy,
Post by Roy Smith
Before I reinvent the wheel, has anybody already written code to parse
haproxy log messages with Python?
I have, although it's not _that_ fast. My approach requires about 1
minutes per 100 MB gziped logs (with a roughly 10:1 compression).
If your usecase matches on the features of halog, you should definitly
try that instead. It's written by Willy himself and is able to easily
maxout your streaming file I/O (meaning it is magnitudes faster than you
could ever do it in python itself)
in fact I'd like halog to be more commonly usable as a low-level
"pre-parser", which means it would take care of extracting the useful
information from the logs so that higher level scripts can process
pre-digested information.

Of course it will never be able to do everything, but if some scripts
don't need all the lines of a log file, we should ensure that halog
provides enough means to filter those lines out. For instance, right
now you can already use halog to ensure that only valid parsable lines
are returned. Most likely a number of other filtering options need to
be added, we need to figure out which ones.

Regards,
Willy

Loading...