Parsing apache logs …

Seems I’m not alone in the world wanting to parse apache log files. I googled lots of people bitterly complaining about it. Some folks wanted to write a grammar, and a flex/yacc/bison thingy. I am sure that there are some Java programmers who’ve been working on this … oh … 6 or 7 years or so, and may be approaching a solution, with a Java byte code only slightly below 1 PB in size.

But I digress. This is the core of the code I’ve mentioned before, and darn it, I wanted to get the logging in shape. So I looked at the horrible morass of terrible … ancient code . Really horrible stuff that. And I looked at the logs.

And thought to myself … dammit, I can make a regex that handles this.

So I tried, and … sure enough, it works.

@column = ($line =~ /(\d+.\d+.\d+.\d+)\s+(\S+)\s+(\S+)\s+\[(\d+\/\S+\/\d+):(\d+:\d+:\d+)\s+([-+]{0,1}\d+)\]\s+\"(.*?)\s+HTTP\/\d+\.\d+\"\s+(\d+)\s+(\d+)\s+\"(.*?)\"\s+\"(.*?)\"/);
	# parsed it BABY!!!
	# c[0] = IP address
	# c[1] = user name?
	# c[2] = unknown
	# c[3] = date
	# c[4] = time
	# c[5] = timezone (relative to GMT)
	# c[6] = incoming request (GET, PUT, HEAD, ... with relative URI part)
	# c[7] = return code (200, 404, ...)
	# c[8] = size of returned data in bytes
	# c[9] = referrer (or - for none)
	# c[10]= User Agent string

As Chris pointed out, there’s an XKCD for that.

Yeah. Baby! My inner loop just lost 80% of its lines. Much easier to understand (is it wrong that I can parse some subset of regexes in my head? The recursive ones give me a headache and I have to start banging my head against the wall to stop them).

A minor error in my edits in the loop, will fix now. Nice that this works so well …

Viewed 37359 times by 5314 viewers