1. What is Azure Monitor?
Conceptually:
Azure Monitor is the full-stack monitoring service for Azure, designed to collect, analyze, and act on telemetry data from your cloud and on-premises environments. Its purpose is to maximize the performance and availability of applications and services.
Key ideas for SREs:
- Observability is about metrics, logs, and traces.
- Monitoring is about alerting, troubleshooting, and proactive maintenance.
- Azure Monitor is centralized; it unifies monitoring for infrastructure, applications, and network.
Practical Example:
- You have an Azure App Service hosting an API. Azure Monitor can tell you how many requests per second, response times, failed requests, and CPU/memory usage of the service.
2. Key Concepts in Azure Monitor
2.1 Telemetry Types
Azure Monitor collects three main types of data:
1. Metrics
- Numeric values over time.
- High-frequency, structured, and lightweight.
- Example: CPU utilization of a VM, request count for an app.
2. Logs
- Rich, structured or unstructured data.
- Stored in a Log Analytics workspace.
- Example: Application exceptions, resource creation logs, security events.
3. Traces
- Part of distributed tracing for applications.
- Tracks a single request across multiple services.
- Example: An API request going through frontend → backend → database.
2.2 Core Components
1. Data Sources
- Azure resources: VMs, App Services, Functions, AKS, Storage.
- On-prem or other clouds: Using agents or exporters.
- Custom applications: Using SDKs or REST API.
2. Data Collection
- Agents: e.g., Log Analytics agent, Azure Monitor agent (AMA).
- Diagnostics settings: Configure which metrics/logs are sent.
3. Data Stores
- Metrics store: Time-series database optimized for quick visualization.
- Log Analytics workspace: Stores logs for querying with Kusto Query Language (KQL).
4. Analysis Tools
- Metrics Explorer: Visualize trends in real time.
- Log Analytics: Query logs with KQL.
- Workbooks: Custom dashboards combining logs, metrics, text, and visuals.
5. Action/Automation
- Alerts: Triggered when metrics/logs exceed thresholds.
- Autoscale: Automatically scale resources based on metrics.
- Logic Apps/Runbooks: Automate operational responses.
2.3 Alerts
- Metric Alerts: Triggered when numeric metric crosses a threshold.
- Log Alerts: Triggered by specific log patterns (using KQL queries).
- Smart Alerts: AI-based anomaly detection.
- Action Groups: Define notifications/actions (email, SMS, webhook, Logic App).
Practical Example:
- Trigger an alert if VM CPU > 80% for 5 minutes.
- Trigger an alert if your API throws more than 50 errors in 1 minute.
2.4 Visualizations
- Azure Dashboards: Combine metrics, logs, and workbooks.
- Workbooks: Interactive reports for troubleshooting and SRE reports.
-
Application Insights: Part of Azure Monitor focused on application performance monitoring (APM).
3. Practical Implementation Steps
Step 1: Enable Azure Monitor
- Most Azure resources have Monitoring/Diagnostics settings.
- Configure diagnostic settings to send logs and metrics to:
- Log Analytics workspace
- Storage Account
- Event Hub
Step 2: Collect Data
- VMs → Install Azure Monitor agent.
- App Services → Enable Application Insights.
- AKS → Enable Container Insights.
Step 3: Query and Analyze
- Use Log Analytics workspace:
// Count failed requests in last 1 hour requests | where success == "False" | summarize count() by bin(timestamp, 1h) - Metrics Explorer: visualize CPU, memory, disk, network over time.
Step 4: Set Alerts
- Metric alert: CPU > 80% for 5 minutes.
- Log alert: More than 10 failed requests in 5 minutes.
- Use action groups to notify Slack/email/Teams.
Step 5: Visualize
- Build dashboards combining metrics, logs, and alerts.
- Use workbooks for SRE reporting and postmortems.
4. Advanced Concepts for SREs
1. Application Insights
- Performance monitoring: request duration, failure rate, dependencies.
- Distributed tracing for microservices.
- Live Metrics Stream for real-time observation.
2. Autoscale & Insights
- Scale VMs or App Services automatically based on metrics.
3. Log Analytics & KQL
- Powerful query language for troubleshooting.
- Example: Find top 5 VMs with highest memory usage in last 24 hours:
Perf | where ObjectName == "Memory" and CounterName == "Available MBytes" | summarize avg(CounterValue) by Computer | top 5 by avg_CounterValue desc
4. Integration
- Azure Monitor → PagerDuty/ServiceNow for incident response.
- Azure Monitor → Grafana for advanced visualizations.
- Log-based metrics → Autoscale decisions.
5. Best Practices for SRE Monitoring
- Centralized logging: Use one Log Analytics workspace per environment.
- Tag resources consistently: Easier filtering in logs.
- Alert only actionable events: Avoid alert fatigue.
- Use dashboards and workbooks for SRE reporting.
-
Enable retention policies: Keep logs long enough for postmortems (30–90 days typical).
Summary
| Concept | Description | Example |
|---|---|---|
| Metrics | Numeric, high-frequency data | CPU %, HTTP requests/sec |
| Logs | Structured/unstructured event data | App exceptions, VM events |
| Traces | Distributed request tracking | API call through microservices |
| Data Sources | Azure, on-prem, custom apps | VMs, AKS, Functions |
| Collection | Agents, diagnostic settings | AMA, App Insights SDK |
| Analysis | Metrics Explorer, KQL, Workbooks | Query failed requests, visualize CPU trends |
| Alerts | Metric/log-based | CPU > 80%, API errors > 50/min |
| Actions | Notifications, automation | Email, Logic App, Autoscale |