App, Server & WebServer Monitoring: Key Performance Metrics

In this post, I will discuss in detail application monitoring metrics, server monitoring metrics, and web server monitoring metrics.

Application, Server, and Web Server monitoring metrics provide key insights into the health of applications infrastructure, providing critical information into real-time performance and helping troubleshoot issues before they escalate.

Key Performance Indicators (KPIs) By Category

KPIs are essential metrics that help in evaluating the performance of applications and servers. Some of the critical KPIs for application and server monitoring include:

Latency

Latency refers to the time it takes for a request to travel from the client to the server and back. High latency can lead to slow application performance, negatively impacting user experience. Monitoring latency helps identify bottlenecks and ensure that your applications are responsive and efficient.

Key Latency Metrics Table

Metric	Description
Content Download Time	Duration for downloading requested data or assets from the server to the client, affecting the user’s perceived loading time.
Database Query Time	Duration for executing and retrieving results from a database query, affecting application performance and server load.
DNS Resolution Time	Time required for domain name conversion to an IP address, affecting the initial connection to an application or server.
Network latency	The time it takes for data to travel from the webserver to the requesting user’s device. High network latency can negatively impact user experience.
Round-trip time (RTT)	The time taken for a request to be sent to the server and the response to be received by the client.
Server Processing Time	Time spent by the server in processing the client’s request and generating a response.
SSL/TLS Handshake Time	Time taken to negotiate and establish a secure connection between the client and server.
TCP Connection Time	Duration for establishing a TCP connection between the client and server, impacting the initial communication process.
Time to First Byte (TTFB)	The duration between the client request and receiving the first byte of data from the server.
Time to Interactive (TTI)	Time taken for a web page to become fully interactive, reflecting overall application responsiveness and user experience.

Response Time

Response time measures the time taken for an application to process a request and return a response. This metric helps determine the efficiency of your applications and pinpoint potential performance issues.

Key Response Times Metrics Table

Metric	Description
API response time	The time it takes for an API to process and return a request with the requested data.
Application response time	The time an application takes to process and respond to a user request.
Average response time	The mean time taken for all requests to be processed and responded to.
Database query time	The duration required to execute and return results for a database query.
Network latency	The time required for data to travel between two points on a network.
Page load time	The duration it takes for a webpage to fully load and render in a user’s browser.
Server response time	The duration it takes for a server to process and return a request.
Time to first byte (TTFB)	The time from making an HTTP request to receiving the first byte of data.
95th percentile response time	The response time within which 95% of requests are processed. This helps identify outlier requests that might skew the average.

Throughput

Throughput refers to the rate at which requests are processed and completed by your application or server. High throughput ensures that your systems can handle large volumes of traffic without being overwhelmed. Monitoring throughput enables you to identify capacity constraints and make informed decisions about scaling.

Key Throughput Metrics Table

Metric	Description
Application throughput	The number of transactions or requests processed by an application within a given time frame.
CPU throughput	The amount of processing work completed by a CPU in a given time.
Database throughput	The rate at which a database system can process queries or transactions.
Disk throughput	The rate at which data is transferred to or from a disk, highlighting the efficiency and performance of a storage device.
Message queue throughput	The rate at which messages are processed within a message queuing system.
Network throughput	The amount of data transmitted over a network in a given time frame.
Storage throughput	The speed at which a storage system can read or write data, represents the performance of the storage infrastructure.
Web server throughput	The number of HTTP requests processed by a web server in a specified time period.

Error Rates

Error rates help you identify issues with your applications or servers that may affect performance or reliability. Monitoring error rates allows you to spot trends and address potential problems before they become critical.

Key Error Rates Metrics Table

Metric	Description
API error rate	The percentage of API calls returning error responses.
Application exception rate	The frequency at which unhandled exceptions occur within the application, signaling potential bugs or configuration issues.
Database error rate	The proportion of failed database queries or transactions. This may reflect database connection errors or issues with the data or queries being executed.
HTTP error rate	The percentage of HTTP requests resulting in error status codes (e.g., 4xx, 5xx), indicating potential issues with the web server or application.
Network error rate	The proportion of network-related errors, such as timeouts or dropped connections, suggests potential issues with network infrastructure or configuration.
Security error rate	The percentage of security-related errors, like failed authentication attempts or unauthorized access. This indicates potential vulnerabilities or threats.
System error rate	The rate at which system-level errors occur, such as hardware failures or operating system crashes.
User-reported error rate	The frequency of user-reported issues or errors, providing valuable feedback for identifying and addressing problems impacting user experience.

Resource Utilization

Resource utilization metrics indicate how effectively your applications and servers are using system resources, such as CPU, memory, disk, and network. Monitoring resource utilization helps you identify inefficiencies, capacity constraints, and potential bottlenecks.

Key Resource Utilization Metrics Table

Metric	Description
CPU usage	The percentage of CPU capacity being used by your application or server.
Cache Hit Ratio	The proportion of data requests being served from cache memory.
Disk I/O	Rate at which data is read from and written to storage devices, impacting application performance and server responsiveness.
Disk usage	Percentage of storage capacity used on a server, impacting read/write speeds and overall system performance. Error Rate: Number of errors or exceptions encountered within an application or server, indicating potential issues and areas for improvement.
Garbage collection	Frequency and duration of memory cleanup operations, impacting application performance and resource management.
Load average	A measure of the server’s workload over a given period. This metric is reported as a rolling average for the last 1,5,15 and 60 minutes.
Memory usage	Amount of RAM consumed by an application or server, affecting performance and efficiency.
Network usage	Rate at which data is transmitted and received over your server’s network connections.
Network bandwidth	Volume of data transmitted or received by an application or server over a specific period.
Network latency	Time taken for a data packet to travel between source and destination. This could be for server-to-server communication or it could be from the client to a server.
Thread count	The number of active threads in an application or server. This needs to be looked in the context of the app being monitored. Some apps use multiple threads for processing while others work with a lower number of threads. Therefore context is important for correctly using this metric.

Additional Notes on Storage

Disk Capacity

Disk space usage affects the time it takes to access data from a given block on a storage device. Usually, it has been observed that the higher the disk usage percentage metric (space consumed / total storage available) results in slower I/O operations.

Stored Data Access Patterns

Imagine an application that does a lot of random reads and writes from disk storage. Unlike memory, disk access is very slow, often times 100s of time slower than memory. This delay in accessing stored data will have the CPU sit idle, while the disk is working to fetch the required data. In these scenarios, you may want to look into faster storage or cache solutions to reduce the bottleneck.

Page Fault and Page Swaps

Available disk capacity also comes into play when applications consume most of the available memory and start to swap memory page blocks to disk. This page swapping results in page faults which cause a system to trash.

Trashing describes the process where the operating system keeps on swapping pages from disk to memory and back. This trashing can happen for multiple reasons, with one being process context switching.

Page faults are can be a big problem on virtual servers, either locally or when hosted in the cloud.

Availability

Availability metrics track the uptime and reliability of your applications and servers. High availability ensures that your systems are accessible and operational when users need them. Monitoring availability allows you to detect and address potential downtime issues.

Key Availability Metrics Table

Metric	Description
Downtime	The total time a system, application, or service is unavailable or inaccessible to users due to scheduled maintenance, unexpected outages, or other issues
Failover time	The amount of time it takes for a backup or redundant system to take over in the event of a primary system failure.
Incident response time	The time it takes for a support team to acknowledge and begin resolving an incident or outage.
Mean Time Between Failures (MTBF)	The average time between system or component failures. MTBF can be used to estimate the expected lifespan of hardware components.
Mean Time To Recovery (MTTR)	The average time it takes to restore a system, application, or service to full functionality after a failure or outage. A lower MTTR indicates a more efficient recovery process.
Redundancy ratio	The ratio of backup or redundant components to primary components in a system. A higher redundancy ratio indicates a more robust and resilient infrastructure.
Service Level Agreement (SLA) compliance	The percentage of time a service provider meets the agreed-upon availability targets outlined in the SLA.
Uptime	The amount of time a system, application, or service has been continuously operational without interruption.

Application, Network and Server Monitoring

Application Monitoring Metrics

Application monitoring focuses on measuring the performance and health of software applications from the end-user perspective. This type of monitoring tracks metrics such as response time, throughput, error rates, and user satisfaction.

By monitoring application performance, businesses can ensure smooth operation, minimize downtime, and deliver a seamless user experience.

Common Application Monitoring Metrics

Cache hit ratio
Error rates
Garbage collection
Latency
Response time
Resource utilization
Throughput

Apdex score: An industry-standard metric used to measure user satisfaction based on application response times. The Apdex score ranges from 0 to 1, with higher scores indicating better user satisfaction. This metric considers both satisfactory and unsatisfactory response times, providing a comprehensive view of the application’s performance from the user’s perspective.

Tools For Application Monitoring

There are a variety of application monitoring tools available, offering different features and capabilities. Some popular application monitoring tools include:

Application Monitoring Best Practices

Focus on end-user experience: Prioritize monitoring metrics and events that have a direct impact on the end-user experience.
Monitor dependencies: Track the performance of third-party services and components that your application relies on, as issues with dependencies can impact overall performance.
Set up alerts and thresholds: Configure alerts based on predefined thresholds for critical performance metrics, enabling timely intervention when issues arise.
Analyze trends and patterns: Regularly review performance data to identify trends and patterns, guiding optimization efforts and future development.
Continuously improve: Use application monitoring insights to drive continuous improvement, refining processes and addressing performance bottlenecks.

Read: Application and Server Monitoring Best Practices

Key Metrics For Server Monitoring

Server monitoring metrics provide KPIs for tracking the performance and health of physical or virtual servers hosting applications, databases, and other services.

The primary objective of server monitoring is to ensure that the underlying hardware and software resources are functioning optimally and efficiently. Other than resource optimization you can use the monitoring metrics to prevent downtime, capacity planning, and for security and compliance.

Common Server Monitoring Metrics

CPU utilization
Disk usage
Disk I/O
Load average
Memory utilization
Network latency
Network throughput
Server uptime

Tools For Server Monitoring

There are various tools available for server monitoring, ranging from open-source solutions to commercial products. Some popular server monitoring tools include:

Server Monitoring Best Practices

Define key performance metrics: Identify the most relevant performance metrics for your specific server environment and focus on monitoring those.
Set up alerts and thresholds: Configure alerts based on predefined thresholds to receive timely notifications of potential issues or anomalies.
Monitor consistently: Regular, consistent monitoring helps establish a baseline for server performance, making it easier to detect deviations and trends.
Automate monitoring tasks: Use monitoring tools to automate routine tasks, freeing up time for more strategic activities.
Maintain a monitoring log: Keep a record of performance data, alerts, and incidents to facilitate root cause analysis and improve future monitoring efforts.

Read: Application and Server Monitoring Best Practices

Key Metrics For WebServer Monitoring

Webserver monitoring focuses on measuring the performance and availability of web servers, which are responsible for processing user requests and delivering content to them over the network.

Webserver monitoring is important for providing improved user experience, increasing application availability, enhancing security, and for search engine optimization and analytics.

Common Webserver Monitoring Metrics

Here is a list of common web server performance metrics along with a description of each:

Cache hit ratio
Concurrent connections
Error and status codes
HTTP response time
Network latency
Requests per second
Server resource utilization
Time to first byte (TTFB)

Tools For Webserver Monitoring

There are numerous web server monitoring tools available, catering to various needs and requirements. Some popular webserver monitoring tools include:

Webserver Monitoring Best Practices

Set performance baselines: Establish baseline performance metrics to help detect anomalies and deviations from expected web server behavior.
Configure alerts and thresholds: Set up alerts based on predefined thresholds for critical performance metrics, enabling prompt intervention when issues arise.
Monitor server logs: Regularly review webserver logs for errors, security issues, and trends that can inform optimization efforts.
Optimize server configurations: Use web server monitoring insights to fine-tune server configurations and improve performance.
Monitor security: Regularly check for security vulnerabilities and apply updates or patches to protect your web server from potential threats.

Read: Application and Server Monitoring Best Practices

Monitoring Strategies and Techniques

Now that we are equipped with all the monitoring metrics and KPIs, the final step is to develop a plan that ties all of this together.

In this section, I will discuss key considerations for selecting appropriate performance metrics, setting up alert thresholds, analyzing trends and patterns, and balancing proactive and reactive monitoring.

Selecting Appropriate Performance Metrics

Choose metrics that are directly related to the performance and user experience you are targeting. Identify the active applications, servers and networks you are going to monitor and select the relevant metrics.

The metrics you select for monitoring should be actionable, meaning that they provide relevant information that you can use to take a specific action.

Finally do make sure that together these metrics provide a complete view of your infrastructure.

Set up Alert Thresholds and Notifications

First, establish a baseline for all metrics you have identified. Then using internal and industry standards and benchmarks, while considering Service Level Agreements (SLAs), determine appropriate thresholds for each metric.

Using these numbers set up monitoring tools to track the numbers. If the numbers are outside of the established ranges then ensure that configured notifications are sent out to appropriate resources able to correct the issues.

If the monitoring tools allow then alerts should be adjusted based on trends and patterns.

Analyze Trends and Patterns

Routinely review performance data against the established baseline, to identify trends and patterns that may indicate issues or opportunities for improvement.

When issues arise, perform root cause analysis to identify the underlying factors contributing to the problem and develop targeted solutions.

Compare all new data against historical data and trends to make capacity planning and resource allocation decisions.

Summary

By creating a plan to effectively monitor your infrastructure you can ensure the optimal performance and reliability of your applications, servers, and web servers.

Benefits of Effective Monitoring

Active monitoring involves consistently tracking the health and performance of your IT infrastructure. Using this information you identify potential bottlenecks and issues, the goal is to address them before they become critical problems.

In this section, I will discuss the key benefits of active monitoring.

Improved Performance and Reliability

Real-time monitoring provides the ability to detect issues early, so you can resolve them before they impact the end users. By taking preventive measures, the overall stability of the IT infrastructure goes up.

Enhance User Experience

When systems are available and running efficiently, they help provide faster response times for end users. This results in a smooth user experience and reduced downtime.

Informed Decision Making

By active monitoring, you can discover performance trends and patterns, for capacity planning and resource allocation. By consistently monitoring and addressing issues, active monitoring supports a culture of continuous improvement, driving ongoing enhancements to your IT infrastructure and processes. This gives any business a competitive edge over others by offering a superior user experience.

Conclusion

In this post, I have provided metrics that you can add to a checklist for monitoring all of your network and application infrastructure. By keeping track of these metrics, you can proactively address issues, enhance system efficiency, and improve overall business operations.

See Also

Key Performance Indicators (KPIs) By Category

Latency

Key Latency Metrics Table

Response Time

Key Response Times Metrics Table

Throughput

Key Throughput Metrics Table

Error Rates

Key Error Rates Metrics Table

Resource Utilization

Key Resource Utilization Metrics Table

Additional Notes on Storage

Disk Capacity

Stored Data Access Patterns

Page Fault and Page Swaps

Trashing describes the process where the operating system keeps on swapping pages from disk to memory and back. This trashing can happen for multiple reasons, with one being process context switching.

Page faults are can be a big problem on virtual servers, either locally or when hosted in the cloud.

Availability

Key Availability Metrics Table

Application Monitoring Metrics

Common Application Monitoring Metrics

Tools For Application Monitoring

Application Monitoring Best Practices

Read: Application and Server Monitoring Best Practices

Key Metrics For Server Monitoring

Common Server Monitoring Metrics

Tools For Server Monitoring

Server Monitoring Best Practices

Read: Application and Server Monitoring Best Practices

Key Metrics For WebServer Monitoring

Common Webserver Monitoring Metrics

Tools For Webserver Monitoring

Webserver Monitoring Best Practices

Read: Application and Server Monitoring Best Practices

Monitoring Strategies and Techniques

Selecting Appropriate Performance Metrics

Set up Alert Thresholds and Notifications

If the monitoring tools allow then alerts should be adjusted based on trends and patterns.

Analyze Trends and Patterns

Summary

Benefits of Effective Monitoring

Improved Performance and Reliability

Enhance User Experience

Informed Decision Making

Conclusion