Application, Network, Server, and Web Server Monitoring Metrics

In this post, I will discuss in detail application monitoring metrics, server monitoring metrics, and web server monitoring metrics.

Application, Server, and Web Server monitoring metrics provide key insights into the health of applications infrastructure, providing critical information into real-time performance and helping troubleshoot issues before they escalate.

The flow in this post is to first review the monitoring categories, then look at metrics grouping, and then finally go over relevant metrics for each of the monitoring categories.

Key Performance Indicators (KPIs) By Category

KPIs are essential metrics that help in evaluating the performance of applications and servers. Some of the critical KPIs for application and server monitoring include:

Latency

Latency refers to the time it takes for a request to travel from the client to the server and back. High latency can lead to slow application performance, negatively impacting user experience. Monitoring latency helps identify bottlenecks and ensure that your applications are responsive and efficient.

Key Latency Metrics Table
MetricDescription
Content Download TimeDuration for downloading requested data or assets from the server to the client, affecting the user’s perceived loading time.
Database Query TimeDuration for executing and retrieving results from a database query, affecting application performance and server load.
DNS Resolution TimeTime required for domain name conversion to an IP address, affecting the initial connection to an application or server.
Network latencyThe time it takes for data to travel from the webserver to the requesting user’s device. High network latency can negatively impact user experience.
Round-trip time (RTT)The time taken for a request to be sent to the server and the response to be received by the client.
Server Processing TimeTime spent by the server in processing the client’s request and generating a response.
SSL/TLS Handshake TimeTime taken to negotiate and establish a secure connection between the client and server.
TCP Connection TimeDuration for establishing a TCP connection between the client and server, impacting the initial communication process.
Time to First Byte (TTFB)The duration between the client request and receiving the first byte of data from the server.
Time to Interactive (TTI)Time taken for a web page to become fully interactive, reflecting overall application responsiveness and user experience.

Response Time

Response time measures the time taken for an application to process a request and return a response. This metric helps determine the efficiency of your applications and pinpoint potential performance issues.

Key Response Times Metrics Table
MetricDescription
API response timeThe time it takes for an API to process and return a request with the requested data.
Application response timeThe time an application takes to process and respond to a user request.
Average response timeThe mean time taken for all requests to be processed and responded to.
Database query timeThe duration required to execute and return results for a database query.
Network latencyThe time required for data to travel between two points on a network.
Page load timeThe duration it takes for a webpage to fully load and render in a user’s browser.
Server response timeThe duration it takes for a server to process and return a request.
Time to first byte (TTFB)The time from making an HTTP request to receiving the first byte of data.
95th percentile response timeThe response time within which 95% of requests are processed. This helps identify outlier requests that might skew the average.

Throughput

Throughput refers to the rate at which requests are processed and completed by your application or server. High throughput ensures that your systems can handle large volumes of traffic without being overwhelmed. Monitoring throughput enables you to identify capacity constraints and make informed decisions about scaling.

Key Throughput Metrics Table
MetricDescription
Application throughputThe number of transactions or requests processed by an application within a given time frame.
CPU throughputThe amount of processing work completed by a CPU in a given time.
Database throughputThe rate at which a database system can process queries or transactions.
Disk throughputThe rate at which data is transferred to or from a disk, highlighting the efficiency and performance of a storage device.
Message queue throughputThe rate at which messages are processed within a message queuing system.
Network throughputThe amount of data transmitted over a network in a given time frame.
Storage throughputThe speed at which a storage system can read or write data, represents the performance of the storage infrastructure.
Web server throughputThe number of HTTP requests processed by a web server in a specified time period.

Error Rates

Error rates help you identify issues with your applications or servers that may affect performance or reliability. Monitoring error rates allows you to spot trends and address potential problems before they become critical.

Key Error Rates Metrics Table
MetricDescription
API error rateThe percentage of API calls returning error responses.
Application exception rateThe frequency at which unhandled exceptions occur within the application, signaling potential bugs or configuration issues.
Database error rateThe proportion of failed database queries or transactions. This may reflect database connection errors or issues with the data or queries being executed.
HTTP error rateThe percentage of HTTP requests resulting in error status codes (e.g., 4xx, 5xx), indicating potential issues with the web server or application.
Network error rateThe proportion of network-related errors, such as timeouts or dropped connections, suggests potential issues with network infrastructure or configuration.
Security error rateThe percentage of security-related errors, like failed authentication attempts or unauthorized access. This indicates potential vulnerabilities or threats.
System error rateThe rate at which system-level errors occur, such as hardware failures or operating system crashes.
User-reported error rateThe frequency of user-reported issues or errors, providing valuable feedback for identifying and addressing problems impacting user experience.

Resource Utilization

Resource utilization metrics indicate how effectively your applications and servers are using system resources, such as CPU, memory, disk, and network. Monitoring resource utilization helps you identify inefficiencies, capacity constraints, and potential bottlenecks.

Key Resource Utilization Metrics Table
MetricDescription
CPU usageThe percentage of CPU capacity being used by your application or server.
Cache Hit RatioThe proportion of data requests being served from cache memory.
Disk I/ORate at which data is read from and written to storage devices, impacting application performance and server responsiveness.
Disk usagePercentage of storage capacity used on a server, impacting read/write speeds and overall system performance.
Error Rate: Number of errors or exceptions encountered within an application or server, indicating potential issues and areas for improvement.
Garbage collectionFrequency and duration of memory cleanup operations, impacting application performance and resource management.
Load averageA measure of the server’s workload over a given period. This metric is reported as a rolling average for the last 1,5,15 and 60 minutes.
Memory usageAmount of RAM consumed by an application or server, affecting performance and efficiency.
Network usageRate at which data is transmitted and received over your server’s network connections.
Network bandwidthVolume of data transmitted or received by an application or server over a specific period.
Network latencyTime taken for a data packet to travel between source and destination. This could be for server-to-server communication or it could be from the client to a server.
Thread countThe number of active threads in an application or server. This needs to be looked in the context of the app being monitored. Some apps use multiple threads for processing while others work with a lower number of threads. Therefore context is important for correctly using this metric.

Additional Notes on Storage

Disk Capacity

Disk space usage affects the time it takes to access data from a given block on a storage device. Usually, it has been observed that the higher the disk usage percentage metric (space consumed / total storage available) results in slower I/O operations.

Stored Data Access Patterns

Imagine an application that does a lot of random reads and writes from disk storage. Unlike memory, disk access is very slow, often times 100s of time slower than memory. This delay in accessing stored data will have the CPU sit idle, while the disk is working to fetch the required data. In these scenarios, you may want to look into faster storage or cache solutions to reduce the bottleneck.

Page Fault and Page Swaps

Available disk capacity also comes into play when applications consume most of the available memory and start to swap memory page blocks to disk. This page swapping results in page faults which cause a system to trash.

Trashing describes the process where the operating system keeps on swapping pages from disk to memory and back. This trashing can happen for multiple reasons, with one being process context switching.
Page faults are can be a big problem on virtual servers, either locally or when hosted in the cloud.

Availability

Availability metrics track the uptime and reliability of your applications and servers. High availability ensures that your systems are accessible and operational when users need them. Monitoring availability allows you to detect and address potential downtime issues.

Key Availability Metrics Table
MetricDescription
DowntimeThe total time a system, application, or service is unavailable or inaccessible to users due to scheduled maintenance, unexpected outages, or other issues
Failover timeThe amount of time it takes for a backup or redundant system to take over in the event of a primary system failure.
Incident response timeThe time it takes for a support team to acknowledge and begin resolving an incident or outage.
Mean Time Between Failures (MTBF)The average time between system or component failures. MTBF can be used to estimate the expected lifespan of hardware components.
Mean Time To Recovery (MTTR)The average time it takes to restore a system, application, or service to full functionality after a failure or outage. A lower MTTR indicates a more efficient recovery process.
Redundancy ratioThe ratio of backup or redundant components to primary components in a system. A higher redundancy ratio indicates a more robust and resilient infrastructure.
Service Level Agreement (SLA) complianceThe percentage of time a service provider meets the agreed-upon availability targets outlined in the SLA.
UptimeThe amount of time a system, application, or service has been continuously operational without interruption.

Application, Network and Server Monitoring

Application Monitoring Metrics

Application monitoring focuses on measuring the performance and health of software applications from the end-user perspective. This type of monitoring tracks metrics such as response time, throughput, error rates, and user satisfaction.

By monitoring application performance, businesses can ensure smooth operation, minimize downtime, and deliver a seamless user experience.

Common Application Monitoring Metrics

  • Cache hit ratio
  • Error rates
  • Garbage collection
  • Latency
  • Response time
  • Resource utilization
  • Throughput

Apdex score: An industry-standard metric used to measure user satisfaction based on application response times. The Apdex score ranges from 0 to 1, with higher scores indicating better user satisfaction. This metric considers both satisfactory and unsatisfactory response times, providing a comprehensive view of the application’s performance from the user’s perspective.

Tools For Application Monitoring

There are a variety of application monitoring tools available, offering different features and capabilities. Some popular application monitoring tools include:

  1. AppDynamics
  2. Datadog APM
  3. Dynatrace
  4. New Relic APM
  5. SolarWinds AppOptics

Application Monitoring Best Practices

  1. Focus on end-user experience: Prioritize monitoring metrics and events that have a direct impact on the end-user experience.
  2. Monitor dependencies: Track the performance of third-party services and components that your application relies on, as issues with dependencies can impact overall performance.
  3. Set up alerts and thresholds: Configure alerts based on predefined thresholds for critical performance metrics, enabling timely intervention when issues arise.
  4. Analyze trends and patterns: Regularly review performance data to identify trends and patterns, guiding optimization efforts and future development.
  5. Continuously improve: Use application monitoring insights to drive continuous improvement, refining processes and addressing performance bottlenecks.
Read: Application and Server Monitoring Best Practices

Key Metrics For Server Monitoring

Server monitoring metrics provide KPIs for tracking the performance and health of physical or virtual servers hosting applications, databases, and other services.

The primary objective of server monitoring is to ensure that the underlying hardware and software resources are functioning optimally and efficiently. Other than resource optimization you can use the monitoring metrics to prevent downtime, capacity planning, and for security and compliance.

Common Server Monitoring Metrics

  • CPU utilization
  • Disk usage
  • Disk I/O
  • Load average
  • Memory utilization
  • Network latency
  • Network throughput
  • Server uptime

Tools For Server Monitoring

There are various tools available for server monitoring, ranging from open-source solutions to commercial products. Some popular server monitoring tools include:

  1. Datadog
  2. Nagios
  3. PRTG Network Monitor
  4. SolarWinds Server & Application Monitor
  5. Zabbix

Server Monitoring Best Practices

  1. Define key performance metrics: Identify the most relevant performance metrics for your specific server environment and focus on monitoring those.
  2. Set up alerts and thresholds: Configure alerts based on predefined thresholds to receive timely notifications of potential issues or anomalies.
  3. Monitor consistently: Regular, consistent monitoring helps establish a baseline for server performance, making it easier to detect deviations and trends.
  4. Automate monitoring tasks: Use monitoring tools to automate routine tasks, freeing up time for more strategic activities.
  5. Maintain a monitoring log: Keep a record of performance data, alerts, and incidents to facilitate root cause analysis and improve future monitoring efforts.
Read: Application and Server Monitoring Best Practices

Key Metrics For WebServer Monitoring

Webserver monitoring focuses on measuring the performance and availability of web servers, which are responsible for processing user requests and delivering content to them over the network.

Webserver monitoring is important for providing improved user experience, increasing application availability, enhancing security, and for search engine optimization and analytics.

Common Webserver Monitoring Metrics

Here is a list of common web server performance metrics along with a description of each:

  • Cache hit ratio
  • Concurrent connections
  • Error and status codes
  • HTTP response time
  • Network latency
  • Requests per second
  • Server resource utilization
  • Time to first byte (TTFB)

Tools For Webserver Monitoring

There are numerous web server monitoring tools available, catering to various needs and requirements. Some popular webserver monitoring tools include:

  1. ManageEngine Applications Manager
  2. NGINX Amplify
  3. Prometheus
  4. Sematext Synthetics
  5. SolarWinds Web Performance Monitor

Webserver Monitoring Best Practices

  1. Set performance baselines: Establish baseline performance metrics to help detect anomalies and deviations from expected web server behavior.
  2. Configure alerts and thresholds: Set up alerts based on predefined thresholds for critical performance metrics, enabling prompt intervention when issues arise.
  3. Monitor server logs: Regularly review webserver logs for errors, security issues, and trends that can inform optimization efforts.
  4. Optimize server configurations: Use web server monitoring insights to fine-tune server configurations and improve performance.
  5. Monitor security: Regularly check for security vulnerabilities and apply updates or patches to protect your web server from potential threats.
Read: Application and Server Monitoring Best Practices

Monitoring Strategies and Techniques

Now that we are equipped with all the monitoring metrics and KPIs, the final step is to develop a plan that ties all of this together.

In this section, I will discuss key considerations for selecting appropriate performance metrics, setting up alert thresholds, analyzing trends and patterns, and balancing proactive and reactive monitoring.

Selecting Appropriate Performance Metrics

Choose metrics that are directly related to the performance and user experience you are targeting. Identify the active applications, servers and networks you are going to monitor and select the relevant metrics.

The metrics you select for monitoring should be actionable, meaning that they provide relevant information that you can use to take a specific action.

Finally do make sure that together these metrics provide a complete view of your infrastructure.

Set up Alert Thresholds and Notifications

First, establish a baseline for all metrics you have identified. Then using internal and industry standards and benchmarks, while considering Service Level Agreements (SLAs), determine appropriate thresholds for each metric.

Using these numbers set up monitoring tools to track the numbers. If the numbers are outside of the established ranges then ensure that configured notifications are sent out to appropriate resources able to correct the issues.

Routinely review performance data against the established baseline, to identify trends and patterns that may indicate issues or opportunities for improvement.

When issues arise, perform root cause analysis to identify the underlying factors contributing to the problem and develop targeted solutions.

Compare all new data against historical data and trends to make capacity planning and resource allocation decisions.

Summary

By creating a plan to effectively monitor your infrastructure you can ensure the optimal performance and reliability of your applications, servers, and web servers.

Benefits of Effective Monitoring

Active monitoring involves consistently tracking the health and performance of your IT infrastructure. Using this information you identify potential bottlenecks and issues, the goal is to address them before they become critical problems.

In this section, I will discuss the key benefits of active monitoring.

Improved Performance and Reliability

Real-time monitoring provides the ability to detect issues early, so you can resolve them before they impact the end users. By taking preventive measures, the overall stability of the IT infrastructure goes up.

Enhance User Experience

When systems are available and running efficiently, they help provide faster response times for end users. This results in a smooth user experience and reduced downtime.

Informed Decision Making

By active monitoring, you can discover performance trends and patterns, for capacity planning and resource allocation. By consistently monitoring and addressing issues, active monitoring supports a culture of continuous improvement, driving ongoing enhancements to your IT infrastructure and processes. This gives any business a competitive edge over others by offering a superior user experience.

Conclusion

In this post, I have provided metrics that you can add to a checklist for monitoring all of your network and application infrastructure. By keeping track of these metrics, you can proactively address issues, enhance system efficiency, and improve overall business operations.