In this post, I will discuss in detail application monitoring metrics, server monitoring metrics, and web server monitoring metrics.
Application, Server, and Web Server monitoring metrics provide key insights into the health of applications infrastructure, providing critical information into real-time performance and helping troubleshoot issues before they escalate.
The flow in this post is to first review the monitoring categories, then look at metrics grouping, and then finally go over relevant metrics for each of the monitoring categories.
Key Performance Indicators (KPIs) By Category
KPIs are essential metrics that help in evaluating the performance of applications and servers. Some of the critical KPIs for application and server monitoring include:
Latency
Latency refers to the time it takes for a request to travel from the client to the server and back. High latency can lead to slow application performance, negatively impacting user experience. Monitoring latency helps identify bottlenecks and ensure that your applications are responsive and efficient.
Key Latency Metrics Table
Metric | Description |
---|---|
Content Download Time | Duration for downloading requested data or assets from the server to the client, affecting the user’s perceived loading time. |
Database Query Time | Duration for executing and retrieving results from a database query, affecting application performance and server load. |
DNS Resolution Time | Time required for domain name conversion to an IP address, affecting the initial connection to an application or server. |
Network latency | The time it takes for data to travel from the webserver to the requesting user’s device. High network latency can negatively impact user experience. |
Round-trip time (RTT) | The time taken for a request to be sent to the server and the response to be received by the client. |
Server Processing Time | Time spent by the server in processing the client’s request and generating a response. |
SSL/TLS Handshake Time | Time taken to negotiate and establish a secure connection between the client and server. |
TCP Connection Time | Duration for establishing a TCP connection between the client and server, impacting the initial communication process. |
Time to First Byte (TTFB) | The duration between the client request and receiving the first byte of data from the server. |
Time to Interactive (TTI) | Time taken for a web page to become fully interactive, reflecting overall application responsiveness and user experience. |
Response Time
Response time measures the time taken for an application to process a request and return a response. This metric helps determine the efficiency of your applications and pinpoint potential performance issues.
Key Response Times Metrics Table
Metric | Description |
---|---|
API response time | The time it takes for an API to process and return a request with the requested data. |
Application response time | The time an application takes to process and respond to a user request. |
Average response time | The mean time taken for all requests to be processed and responded to. |
Database query time | The duration required to execute and return results for a database query. |
Network latency | The time required for data to travel between two points on a network. |
Page load time | The duration it takes for a webpage to fully load and render in a user’s browser. |
Server response time | The duration it takes for a server to process and return a request. |
Time to first byte (TTFB) | The time from making an HTTP request to receiving the first byte of data. |
95th percentile response time | The response time within which 95% of requests are processed. This helps identify outlier requests that might skew the average. |
Throughput
Throughput refers to the rate at which requests are processed and completed by your application or server. High throughput ensures that your systems can handle large volumes of traffic without being overwhelmed. Monitoring throughput enables you to identify capacity constraints and make informed decisions about scaling.
Key Throughput Metrics Table
Metric | Description |
---|---|
Application throughput | The number of transactions or requests processed by an application within a given time frame. |
CPU throughput | The amount of processing work completed by a CPU in a given time. |
Database throughput | The rate at which a database system can process queries or transactions. |
Disk throughput | The rate at which data is transferred to or from a disk, highlighting the efficiency and performance of a storage device. |
Message queue throughput | The rate at which messages are processed within a message queuing system. |
Network throughput | The amount of data transmitted over a network in a given time frame. |
Storage throughput | The speed at which a storage system can read or write data, represents the performance of the storage infrastructure. |
Web server throughput | The number of HTTP requests processed by a web server in a specified time period. |
Error Rates
Error rates help you identify issues with your applications or servers that may affect performance or reliability. Monitoring error rates allows you to spot trends and address potential problems before they become critical.
Key Error Rates Metrics Table
Metric | Description |
---|---|
API error rate | The percentage of API calls returning error responses. |
Application exception rate | The frequency at which unhandled exceptions occur within the application, signaling potential bugs or configuration issues. |
Database error rate | The proportion of failed database queries or transactions. This may reflect database connection errors or issues with the data or queries being executed. |
HTTP error rate | The percentage of HTTP requests resulting in error status codes (e.g., 4xx, 5xx), indicating potential issues with the web server or application. |
Network error rate | The proportion of network-related errors, such as timeouts or dropped connections, suggests potential issues with network infrastructure or configuration. |
Security error rate | The percentage of security-related errors, like failed authentication attempts or unauthorized access. This indicates potential vulnerabilities or threats. |
System error rate | The rate at which system-level errors occur, such as hardware failures or operating system crashes. |
User-reported error rate | The frequency of user-reported issues or errors, providing valuable feedback for identifying and addressing problems impacting user experience. |
Resource Utilization
Resource utilization metrics indicate how effectively your applications and servers are using system resources, such as CPU, memory, disk, and network. Monitoring resource utilization helps you identify inefficiencies, capacity constraints, and potential bottlenecks.
Key Resource Utilization Metrics Table
Metric | Description |
---|---|
CPU usage | The percentage of CPU capacity being used by your application or server. |
Cache Hit Ratio | The proportion of data requests being served from cache memory. |
Disk I/O | Rate at which data is read from and written to storage devices, impacting application performance and server responsiveness. |
Disk usage | Percentage of storage capacity used on a server, impacting read/write speeds and overall system performance. Error Rate: Number of errors or exceptions encountered within an application or server, indicating potential issues and areas for improvement. |
Garbage collection | Frequency and duration of memory cleanup operations, impacting application performance and resource management. |
Load average | A measure of the server’s workload over a given period. This metric is reported as a rolling average for the last 1,5,15 and 60 minutes. |
Memory usage | Amount of RAM consumed by an application or server, affecting performance and efficiency. |
Network usage | Rate at which data is transmitted and received over your server’s network connections. |
Network bandwidth | Volume of data transmitted or received by an application or server over a specific period. |
Network latency | Time taken for a data packet to travel between source and destination. This could be for server-to-server communication or it could be from the client to a server. |
Thread count | The number of active threads in an application or server. This needs to be looked in the context of the app being monitored. Some apps use multiple threads for processing while others work with a lower number of threads. Therefore context is important for correctly using this metric. |
Additional Notes on Storage
Disk Capacity
Disk space usage affects the time it takes to access data from a given block on a storage device. Usually, it has been observed that the higher the disk usage percentage metric (space consumed / total storage available) results in slower I/O operations.
Stored Data Access Patterns
Imagine an application that does a lot of random reads and writes from disk storage. Unlike memory, disk access is very slow, often times 100s of time slower than memory. This delay in accessing stored data will have the CPU sit idle, while the disk is working to fetch the required data. In these scenarios, you may want to look into faster storage or cache solutions to reduce the bottleneck.
Page Fault and Page Swaps
Available disk capacity also comes into play when applications consume most of the available memory and start to swap memory page blocks to disk. This page swapping results in page faults which cause a system to trash.
Trashing describes the process where the operating system keeps on swapping pages from disk to memory and back. This trashing can happen for multiple reasons, with one being process context switching.
Page faults are can be a big problem on virtual servers, either locally or when hosted in the cloud.
Availability
Availability metrics track the uptime and reliability of your applications and servers. High availability ensures that your systems are accessible and operational when users need them. Monitoring availability allows you to detect and address potential downtime issues.
Key Availability Metrics Table
Metric | Description |
---|---|
Downtime | The total time a system, application, or service is unavailable or inaccessible to users due to scheduled maintenance, unexpected outages, or other issues |
Failover time | The amount of time it takes for a backup or redundant system to take over in the event of a primary system failure. |
Incident response time | The time it takes for a support team to acknowledge and begin resolving an incident or outage. |
Mean Time Between Failures (MTBF) | The average time between system or component failures. MTBF can be used to estimate the expected lifespan of hardware components. |
Mean Time To Recovery (MTTR) | The average time it takes to restore a system, application, or service to full functionality after a failure or outage. A lower MTTR indicates a more efficient recovery process. |
Redundancy ratio | The ratio of backup or redundant components to primary components in a system. A higher redundancy ratio indicates a more robust and resilient infrastructure. |
Service Level Agreement (SLA) compliance | The percentage of time a service provider meets the agreed-upon availability targets outlined in the SLA. |
Uptime | The amount of time a system, application, or service has been continuously operational without interruption. |
Application Monitoring Metrics
Application monitoring focuses on measuring the performance and health of software applications from the end-user perspective. This type of monitoring tracks metrics such as response time, throughput, error rates, and user satisfaction.
By monitoring application performance, businesses can ensure smooth operation, minimize downtime, and deliver a seamless user experience.
Common Application Monitoring Metrics
- Cache hit ratio
- Error rates
- Garbage collection
- Latency
- Response time
- Resource utilization
- Throughput
Apdex score: An industry-standard metric used to measure user satisfaction based on application response times. The Apdex score ranges from 0 to 1, with higher scores indicating better user satisfaction. This metric considers both satisfactory and unsatisfactory response times, providing a comprehensive view of the application’s performance from the user’s perspective.
Tools For Application Monitoring
There are a variety of application monitoring tools available, offering different features and capabilities. Some popular application monitoring tools include:
Application Monitoring Best Practices
- Focus on end-user experience: Prioritize monitoring metrics and events that have a direct impact on the end-user experience.
- Monitor dependencies: Track the performance of third-party services and components that your application relies on, as issues with dependencies can impact overall performance.
- Set up alerts and thresholds: Configure alerts based on predefined thresholds for critical performance metrics, enabling timely intervention when issues arise.
- Analyze trends and patterns: Regularly review performance data to identify trends and patterns, guiding optimization efforts and future development.
- Continuously improve: Use application monitoring insights to drive continuous improvement, refining processes and addressing performance bottlenecks.
Read: Application and Server Monitoring Best Practices
Key Metrics For Server Monitoring
Server monitoring metrics provide KPIs for tracking the performance and health of physical or virtual servers hosting applications, databases, and other services.
The primary objective of server monitoring is to ensure that the underlying hardware and software resources are functioning optimally and efficiently. Other than resource optimization you can use the monitoring metrics to prevent downtime, capacity planning, and for security and compliance.
Common Server Monitoring Metrics
- CPU utilization
- Disk usage
- Disk I/O
- Load average
- Memory utilization
- Network latency
- Network throughput
- Server uptime
Tools For Server Monitoring
There are various tools available for server monitoring, ranging from open-source solutions to commercial products. Some popular server monitoring tools include:
Server Monitoring Best Practices
- Define key performance metrics: Identify the most relevant performance metrics for your specific server environment and focus on monitoring those.
- Set up alerts and thresholds: Configure alerts based on predefined thresholds to receive timely notifications of potential issues or anomalies.
- Monitor consistently: Regular, consistent monitoring helps establish a baseline for server performance, making it easier to detect deviations and trends.
- Automate monitoring tasks: Use monitoring tools to automate routine tasks, freeing up time for more strategic activities.
- Maintain a monitoring log: Keep a record of performance data, alerts, and incidents to facilitate root cause analysis and improve future monitoring efforts.
Read: Application and Server Monitoring Best Practices
Key Metrics For WebServer Monitoring
Webserver monitoring focuses on measuring the performance and availability of web servers, which are responsible for processing user requests and delivering content to them over the network.
Webserver monitoring is important for providing improved user experience, increasing application availability, enhancing security, and for search engine optimization and analytics.
Common Webserver Monitoring Metrics
Here is a list of common web server performance metrics along with a description of each:
- Cache hit ratio
- Concurrent connections
- Error and status codes
- HTTP response time
- Network latency
- Requests per second
- Server resource utilization
- Time to first byte (TTFB)
Tools For Webserver Monitoring
There are numerous web server monitoring tools available, catering to various needs and requirements. Some popular webserver monitoring tools include:
- ManageEngine Applications Manager
- NGINX Amplify
- Prometheus
- Sematext Synthetics
- SolarWinds Web Performance Monitor
Webserver Monitoring Best Practices
- Set performance baselines: Establish baseline performance metrics to help detect anomalies and deviations from expected web server behavior.
- Configure alerts and thresholds: Set up alerts based on predefined thresholds for critical performance metrics, enabling prompt intervention when issues arise.
- Monitor server logs: Regularly review webserver logs for errors, security issues, and trends that can inform optimization efforts.
- Optimize server configurations: Use web server monitoring insights to fine-tune server configurations and improve performance.
- Monitor security: Regularly check for security vulnerabilities and apply updates or patches to protect your web server from potential threats.
Read: Application and Server Monitoring Best Practices
Monitoring Strategies and Techniques
Now that we are equipped with all the monitoring metrics and KPIs, the final step is to develop a plan that ties all of this together.
In this section, I will discuss key considerations for selecting appropriate performance metrics, setting up alert thresholds, analyzing trends and patterns, and balancing proactive and reactive monitoring.
Selecting Appropriate Performance Metrics
Choose metrics that are directly related to the performance and user experience you are targeting. Identify the active applications, servers and networks you are going to monitor and select the relevant metrics.
The metrics you select for monitoring should be actionable, meaning that they provide relevant information that you can use to take a specific action.
Finally do make sure that together these metrics provide a complete view of your infrastructure.
Set up Alert Thresholds and Notifications
First, establish a baseline for all metrics you have identified. Then using internal and industry standards and benchmarks, while considering Service Level Agreements (SLAs), determine appropriate thresholds for each metric.
Using these numbers set up monitoring tools to track the numbers. If the numbers are outside of the established ranges then ensure that configured notifications are sent out to appropriate resources able to correct the issues.
If the monitoring tools allow then alerts should be adjusted based on trends and patterns.
Analyze Trends and Patterns
Routinely review performance data against the established baseline, to identify trends and patterns that may indicate issues or opportunities for improvement.
When issues arise, perform root cause analysis to identify the underlying factors contributing to the problem and develop targeted solutions.
Compare all new data against historical data and trends to make capacity planning and resource allocation decisions.
Summary
By creating a plan to effectively monitor your infrastructure you can ensure the optimal performance and reliability of your applications, servers, and web servers.
Benefits of Effective Monitoring
Active monitoring involves consistently tracking the health and performance of your IT infrastructure. Using this information you identify potential bottlenecks and issues, the goal is to address them before they become critical problems.
In this section, I will discuss the key benefits of active monitoring.
Improved Performance and Reliability
Real-time monitoring provides the ability to detect issues early, so you can resolve them before they impact the end users. By taking preventive measures, the overall stability of the IT infrastructure goes up.
Enhance User Experience
When systems are available and running efficiently, they help provide faster response times for end users. This results in a smooth user experience and reduced downtime.
Informed Decision Making
By active monitoring, you can discover performance trends and patterns, for capacity planning and resource allocation. By consistently monitoring and addressing issues, active monitoring supports a culture of continuous improvement, driving ongoing enhancements to your IT infrastructure and processes. This gives any business a competitive edge over others by offering a superior user experience.
Conclusion
In this post, I have provided metrics that you can add to a checklist for monitoring all of your network and application infrastructure. By keeping track of these metrics, you can proactively address issues, enhance system efficiency, and improve overall business operations.