Parsing Apache or Nginx Web Server Logs with Java

Web servers like Apache, IIS and Nginx are designed to serve client requests over the HTTP protocol. In that process, they can save information about requests and responses in logging files using standard log formats. The known formats are used across different web servers making it easy to understand logging information.

In this post, I will go over details of how to parse known log formats using Java. Actual Java code is hosted on GitHub and you can download it or build your own programs using it.

In this example, though I will be reading the log file and streaming through each line, looking for entries with IP 127.0.0.1 and getting a count of all entries with HTTP code 200.

apache-log-example

Common Web Server Log Formats

Web servers make log entries using standard formats. Some of the common ones are COMMON, COMBINED, and NCSA log formats. Working with standard log formats makes it easy to understand log file entries with ease. Many free and paid commercial log parsing tools also understand these standard logging formats.

You may have been thinking that I have additional needs around logging. Maybe in case of errors, you want to log more detail than what the web server writes in the log files. This can be done using custom log formats. When setting up your log analyzer tools you will enter the custom log format string you have created. To understand Apache log formats and available options check out the post on Apache Logging.

In this post, though I will be parsing the COMMON log format entries.

Parsing Logs with Java

To start writing a Java program to parse a log file, I choose to go with the standard command line program. You can access the test code on Github, which I have uploaded, and use that as a base to start.

Basically, I pass a config file as a parameter to the program which defines various parameters. My config files have the following entries.

server=apache                             # Name of the server. Could be Apache or Nginx.                         
logtype=access                            # Looking at access log. The other option is error log.
logformat=common                          # Log file format. I have only coded for the common log format for now.
logfilename=/var/logs/apache2/access.log  # path to log file.

As one develops the program further one can add additional options to the config file but for now that is all I have.

Common Log Format RegEx

I am using the following regex to parse the common log format.

String COMMON_LOG_FORMAT = "^(\\S+) (\\S+) (\\S+) \\[([\\w:/]+\\s[+\\-]\\d{4})\\] \"(\\S+) (\\S+)\\s*(\\S+)?\\s*\" (\\d{3}) (\\S+)";

Java Parsing Loop for Log Entries

My java function to parse the log entries looks like this.

Java
public void filterWithRegex() throws IOException {

        String regString = Config.props.getProperty("search-expression");
        final Pattern linePattern = Pattern.compile(regString); // For filtering.
        final Pattern columnPattern = Pattern.compile(COMMON_LOG_FORMAT, Pattern.MULTILINE);

        try ( Stream<String> stream = Files.lines(logfile.getLogFile())) {

            var countIP = new HashMap<String, Integer>();
            stream.forEach(line -> {
                Matcher matcher = columnPattern.matcher(line);

                while (matcher.find()) {

                    String IP = matcher.group(1);
                    String Response = matcher.group(8);
                    int response = Integer.parseInt(Response);

                    // Inserting the IP addresses in the HashMap and maintaining the frequency for each HTTP 200 code.
                    if (response == 200) {
                        if (countIP.containsKey(IP)) {
                            countIP.put(IP, countIP.get(IP) + 1);
                        } else {
                            countIP.put(IP, 1);
                        }
                    }
                }
            });

            // Print result
            for (Map.Entry entry : countIP.entrySet()) {
                System.out.println(entry.getKey() + " " + entry.getValue());
            }
        }
}

In the code above, I am reading through the lines, parsing them one at a time. Here I am searching for a log with a specific IP and a response code of 200.

Results

The results I get are:

127.0.0.1 464

Ok, so not very nicely formatted but it shows that for an IP of 127.0.0.1, there are 464 total responses with an HTTP code of 200.

Conclusion

As you can see parsing a web server log file from Apache or Nginx is not very hard and the code I have provided does a good job of it. You can add features to it and do a lot more than what I have shown. But this is not the path that I recommend.

There are tools available that are free or in the case of commercial offerings, have a nominal cost which allows you to do a lot more with log analysis and filtering than you can do by writing your own code.

Check out the main webpage on all posts and guides related to the Apache web server.

Leave a Comment