Parsing apache logs ...
By joe
- 2 minutes read - 305 wordsSeems I’m not alone in the world wanting to parse apache log files. I googled lots of people bitterly complaining about it. Some folks wanted to write a grammar, and a flex/yacc/bison thingy. I am sure that there are some Java programmers who’ve been working on this … oh … 6 or 7 years or so, and may be approaching a solution, with a Java byte code only slightly below 1 PB in size. But I digress. This is the core of the code I’ve mentioned before, and darn it, I wanted to get the logging in shape. So I looked at the horrible morass of terrible … ancient code . Really horrible stuff that. And I looked at the logs. And thought to myself … dammit, I can make a regex that handles this. So I tried, and … sure enough, it works.
<code>
@column = ($line =~ /(\d+.\d+.\d+.\d+)\s+(\S+)\s+(\S+)\s+\[(\d+\/\S+\/\d+):(\d+:\d+:\d+)\s+([-+]{0,1}\d+)\]\s+\"(.*?)\s+HTTP\/\d+\.\d+\"\s+(\d+)\s+(\d+)\s+\"(.*?)\"\s+\"(.*?)\"/);
# parsed it BABY!!!
# c[0] = IP address
# c[1] = user name?
# c[2] = unknown
# c[3] = date
# c[4] = time
# c[5] = timezone (relative to GMT)
# c[6] = incoming request (GET, PUT, HEAD, ... with relative URI part)
# c[7] = return code (200, 404, ...)
# c[8] = size of returned data in bytes
# c[9] = referrer (or - for none)
# c[10]= User Agent string
</code>
As Chris pointed out, there’s an XKCD for that. Yeah. Baby! My inner loop just lost 80% of its lines. Much easier to understand (is it wrong that I can parse some subset of regexes in my head? The recursive ones give me a headache and I have to start banging my head against the wall to stop them). A minor error in my edits in the loop, will fix now. Nice that this works so well …