7 Ways WmiAxon Improves System Monitoring
Troubleshooting Common WmiAxon Issues and Fixes
1. Agent won’t connect to the server
- Symptom: Agent shows offline in dashboard or fails to register.
- Quick fixes:
- Network: Verify agent can reach server IP/hostname and required port (use ping/telnet).
- Firewall: Open agent outbound port and server inbound port.
- DNS: Confirm hostname resolves correctly.
- Time sync: Ensure system clock/NTP is correct (certificate failures often follow drift).
- Logs: Check agent logs for TLS/auth errors and rotate credentials if expired.
2. High CPU or memory usage on monitored hosts
- Symptom: Monitoring causes resource spikes or alerts about resource exhaustion.
- Quick fixes:
- Sampling rate: Lower polling frequency for heavy metrics.
- Disable unused checks: Turn off nonessential plugins/collectors.
- Batching: Enable metric batching or increase collection intervals.
- Upgrade agent: Ensure latest agent with performance improvements is installed.
- Profile: Use system profiler to find hot threads in the agent process.
3. Missing or inconsistent metrics
- Symptom: Expected metrics absent or show gaps/inconsistent values.
- Quick fixes:
- Collector health: Verify each collector/plugin is enabled and healthy.
- Permissions: Ensure agent has permission to access required system resources (e.g., WMI, files, APIs).
- Network drops: Check for packet loss between agent and server; enable retries.
- Metric names/versions: Confirm metric names didn’t change after upgrades; update dashboards/queries.
- Log inspection: Review agent/plugin logs for errors or timeouts.
4. Alerts firing too often or false positives
- Symptom: Noisy alerts or repeated notifications for the same condition.
- Quick fixes:
- Threshold tuning: Increase thresholds or add sustained-duration requirements (e.g., 5 min).
- Aggregation: Aggregate metrics over a window before alert evaluation.
- Suppression/maintenance windows: Configure suppression during known maintenance periods.
- Dependencies: Use alert dependencies to avoid duplicate alerts from downstream services.
- Flapping detection: Enable flapping detection or add cooldown periods.
5. Dashboard or query performance problems
- Symptom: Dashboards load slowly or queries time out.
- Quick fixes:
- Query scope: Narrow time ranges and reduce high-cardinality group-bys.
- Downsampling: Use pre-aggregated or downsampled metrics for long-range views.
- Indexing: Ensure backend indexes are healthy and retention policies are appropriate.
- Panel limits: Reduce number of panels or concurrent queries per dashboard.
- Backend scaling: Scale query nodes or increase resources if chronic.
6. Authentication and permission errors
- Symptom: Users cannot log in or access resources; agent auth failures.
- Quick fixes:
- Credentials: Confirm API keys, tokens, or service accounts are valid and unexpired.
- Role mapping: Verify RBAC roles permit required actions.
- SSO/SAML: Check identity provider connectivity and certificate validity.
- Audit logs: Inspect auth logs for denied requests and trace causes.
7. TLS/Certificate failures
- Symptom: Connection refused, handshake failures, or “certificate expired” errors.
- Quick fixes:
- Certificate validity: Check expiry and renew as needed.
- Chain and CA: Ensure full chain is presented and trusted by clients.
- Hostname mismatch: Confirm cert SANs include server hostnames.
- Protocol support: Ensure both sides support common TLS versions and ciphers.
8. Upgrade or compatibility issues
- Symptom: After upgrade, features break or agents stop reporting.
- Quick fixes:
- Compatibility matrix: Verify agent/server versions are compatible before upgrading.
- Rollback plan: Keep backups/config exports and an easy rollback path.
- Staged rollout: Upgrade a subset first to validate.
- Migration notes: Follow vendor migration docs for schema or API changes.
Diagnostic checklist
- Collect logs: Agent + server + system logs for the timeframe.
- Reproduce: Try to reproduce the issue in a test environment.
- Isolate: Disable nonessential plugins or components to narrow cause.
- Confirm environment: Check network, DNS, time, permissions, and certificates.
- Escalate: If unresolved, gather timestamps, logs, config files and open a support ticket.
Leave a Reply