Log parsing in python using regular expressions.

Log parsing is a very basic problem for DevOps and SREs and we have a post on this regarding log parsing. You can find the post here

In the earlier post, I have not used regex and have used only string manipulations to parse the logs.

What is a regular expression?

Regular expressions are a sequence of alphabets, numbers, and symbols that defines a search pattern. Regex is present in almost every language and uses a regex processor that uses algorithms like backtracking etc to search for the patterns.

In this, I am going to share the code which you can use to parse the logs with the help of regular expression. The code is written by Marco

#! /bin/env python3

import sys
import re

# 27.59.104.166 - - [04/Oct/2019:21:15:54 +0000] "GET /users/login HTTP/1.1" 200 41716 "-" "okhttp/3.12.1"

LOG_LINE_REGEX = r'^(?P<IP>\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}).*\[(?P<timestamp>.*)\]\s"(?P<verb>[A-Z]+)\s(?P<path>[\w\/]+)\s+(?P<protocol>[\w\/\.]+)"\s(?P<status_code>\d+)\s(?P<response_size>\d+).*'

pattern = re.compile(LOG_LINE_REGEX)

for line in sys.stdin:
    m = pattern.match(line)
    if m:
        print(m.groupdict())

You can use this code to parse the logs, the logs format is written in the commented section of the code. To make changes to parse the different formats of logs you can use https://regexr.com/ to test out your regex. You can also reach out to me through comments anytime.

Note: If you are using regular expressions and are not aware of the cons of backtracking in a regular expression, please read it. Badly written REGEX can take down your systems. The same thing happened with cloud flare you can read the blog here:
https://blog.cloudflare.com/cloudflare-outage/


Gaurav Yadav

Gaurav is cloud infrastructure engineer and a full stack web developer and blogger. Sportsperson by heart and loves football. Scale is something he loves to work for and always keen to learn new tech. Experienced with CI/CD, distributed cloud infrastructure, build systems and lot of SRE Stuff.

1 COMMENT
  • Marco
    Reply

    nice mention of the cloudflare incident.
    Also partial matches are a danger. That’s why I’d almost always recommend being as explicit as possible and describing the whole line.
    Might be more work but will help to avoid errors.

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.