9.3 Routine monitoring checklist
Routine Monitoring Checklist
We provide you with a document to understand regular tasks required to monitor & maintain a running instance of OpenCRVS.
These cover tasks such as:
Tracking the OpenCRVS release and upgrading when required
Monitoring system upgrades such as for the server operating system
Refreshing expiring TLS/SSL certificates
Reacting to the automatic alerts that you will be notified with from Sentry or Kibana
Monitoring & Maintenance document & Excel
Download the Monitoring & Maintenance document and Excel checklist from the "Technical" zip in the OpenCRVS Requirements Templates
Built-in alerts from Kibana
OpenCRVS comes with a built-in set of automatic email alerts, that capture a minimal set of critical limits & conditions necessary for the product to work. When these alerts are triggered, the issues need to be solved as soon as possible. Oftentimes it's better if these issues are solved and planned for even before reaching the critical limits.
After deployment, these Alerts are configured automatically. But you will need to log in to Kibana to manually turn the alerts on after each deployment. We are addressing automatic enabling of these alerts by default in future versions of OpenCRVS.
It's a good practice to monitor your production installation's infrastructure manually on a daily basis. This practice improves the reliability of your environment and gives your team a chance to include server improvements in planned work. The following list captures some of the essential values that should be followed manually
Disk space usage on all nodes is less than 70%
Login to Kibana
Navigate to Observability -> Metrics -> Metrics Explorer
Use the parameters listed in 7.2 Infrastructure health under Available disk space
Verify all used disk space is under 70% on all nodes
CPU and memory usage less than 80% on all nodes
Login to Kibana
Navigate to Observability -> Metrics -> Metrics Explorer
Use the parameters listed in 7.2 Infrastructure health under CPU usage
Select a timeframe of 24 hours
Verify CPU load has not exceeded 80% on any of your server nodes
No errors in any services (Observability -> APM -> Services -> [service] -> Errors)
Last updated