Log & Error Analysis
Module 13.3: Log & Error Analysis
Section titled “Module 13.3: Log & Error Analysis”Estimated time: ~35 minutes
Prerequisite: Module 13.1 (Data Analysis), Module 13.2 (Report Generation)
Outcome: After this module, you will know how to analyze log files with Claude Code, identify error patterns, perform root cause analysis, and generate incident reports.
1. WHY — Why This Matters
Section titled “1. WHY — Why This Matters”Production is down. You have 500MB of logs. Somewhere in there is the answer. Traditional approach: grep for “ERROR”, scroll through thousands of lines, correlate timestamps manually. Takes hours. You might miss the real cause.
Claude Code approach: “Analyze these logs. Find all errors, identify patterns, suggest root cause.” Claude parses the logs, groups errors, finds correlations, and pinpoints the culprit — in minutes. Log analysis is where AI assistance provides the highest leverage.
2. CONCEPT — Core Ideas
Section titled “2. CONCEPT — Core Ideas”Log Analysis Workflow
Section titled “Log Analysis Workflow”[Raw Logs] → [Parse] → [Filter] → [Pattern] → [Correlate] → [Root Cause] ↓ ↓ ↓ ↓ ↓ ↓ 500MB Structure Errors Groups Timeline AnswerLog Analysis Tasks
Section titled “Log Analysis Tasks”| Task | Description | Claude Prompt |
|---|---|---|
| Parse | Understand log format | ”What format are these logs?” |
| Filter | Extract relevant entries | ”Show all ERROR and WARN entries” |
| Count | Frequency analysis | ”Count errors by type” |
| Timeline | Temporal patterns | ”When did errors start spiking?” |
| Correlate | Find relationships | ”What happens before each OOM error?” |
| Root Cause | Identify source | ”What’s the likely root cause?” |
Common Log Formats
Section titled “Common Log Formats”- Application logs (timestamp, level, message)
- Access logs (Apache, Nginx)
- JSON structured logs
- Stack traces
- Syslog
Error Pattern Types
Section titled “Error Pattern Types”- Frequency spike: Sudden increase in error rate
- Cascading failures: One error triggers others
- Periodic errors: Cron-related, scheduled tasks
- Resource exhaustion: Memory, disk, connections
- External dependency: Third-party service failures
3. DEMO — Step by Step
Section titled “3. DEMO — Step by Step”Scenario: Production incident — API returning 500 errors. Analyze application.log.
Step 1: Initial Exploration
Section titled “Step 1: Initial Exploration”You: Read application.log (last 10000 lines). What's the log format?Give me a summary: total entries, log levels distribution, time range.
Claude: [Reads and analyzes log]
Log Format: JSON structured- timestamp: ISO 8601- level: DEBUG, INFO, WARN, ERROR- service: api-gateway- message: free text- trace_id: UUID (for correlation)
Summary (last 10000 entries):- Time range: 2024-01-15 14:00 to 14:45 (45 minutes)- DEBUG: 4,521 (45%)- INFO: 3,892 (39%)- WARN: 1,203 (12%)- ERROR: 384 (4%) ← Elevated, normal is <1%Step 2: Error Analysis
Section titled “Step 2: Error Analysis”You: Extract all ERROR entries. Group by error type/message. Show count.
Claude: [Parses and groups]
Error Distribution:- "Database connection timeout" — 156 (41%)- "Redis connection refused" — 124 (32%)- "Null pointer in UserService" — 62 (16%)- "Request timeout exceeded" — 42 (11%)
Observation: Database and Redis connection errors dominate (73%).Suggests infrastructure issue, not application bug.Step 3: Timeline Analysis
Section titled “Step 3: Timeline Analysis”You: Show error frequency over time, minute by minute. When did spike start?
Claude: [Time-series analysis]
Error Timeline:- 14:00-14:15: 2-5 errors/minute (baseline)- 14:16: Jump to 23 errors ← SPIKE START- 14:17-14:30: 40-60 errors/minute (elevated)- 14:31-14:45: 20-30 errors/minute (partial recovery)
Spike began at exactly 14:16. Something changed at that moment.Step 4: Correlation and Root Cause
Section titled “Step 4: Correlation and Root Cause”You: What happened at 14:16? Look for unusual entries in the 2 minutesbefore the spike. What's the likely root cause?
Claude: [Correlation analysis]
Events at 14:14-14:16:- 14:14:32 INFO "Deployment started: api-gateway v2.3.1"- 14:15:01 INFO "Database connection pool resize: 10 → 50"- 14:15:45 WARN "Redis memory usage at 95%"- 14:16:02 ERROR "Database connection timeout" (first of spike)
ROOT CAUSE ANALYSIS:Primary: Deployment at 14:14 triggered connection pool resize.The 5x increase in DB connections overwhelmed the database.Secondary: Redis was already stressed (95% memory).Cascade: DB timeout → retry → more connections → Redis overflow
Recommendation: Roll back deployment, investigate pool settings.4. PRACTICE — Try It Yourself
Section titled “4. PRACTICE — Try It Yourself”Exercise 1: Log Format Recognition
Section titled “Exercise 1: Log Format Recognition”Goal: Practice identifying different log formats.
Instructions:
- Find or create sample logs (different formats)
- Ask Claude to identify the format
- Ask for parsing strategy for each format
💡 Hint
Common formats: Apache access logs, JSON structured, syslog, custom application logs. Each has distinct patterns.
✅ Solution
Prompt: "Read [log file]. What format is this?Show the structure and suggest a parsing approach."
Claude will identify:- Delimiter-based (space, tab, pipe)- JSON structured- Regex-parseable patterns- Timestamp formatsExercise 2: Error Pattern Finding
Section titled “Exercise 2: Error Pattern Finding”Goal: Group and analyze errors.
Instructions:
- Use a log file with multiple error types
- Ask Claude to group and count errors
- Ask for timeline analysis
- Identify the dominant pattern
💡 Hint
Ask: “Extract ERROR entries, group by message, show count for each, then show frequency over time.”
✅ Solution
Prompts in sequence:1. "Extract all ERROR entries from [log]. Group by error message."2. "Count each error type. Which is most frequent?"3. "Show error frequency over time. Any spikes?"4. "What pattern does this suggest? (spike, cascade, periodic, resource)"Exercise 3: Incident Investigation
Section titled “Exercise 3: Incident Investigation”Goal: Complete root cause analysis.
Instructions:
- Create logs with a simulated “problem”
- Ask Claude to find root cause
- Request an incident report
- Compare Claude’s finding with actual cause
💡 Hint
Plant a specific cause (e.g., deployment, config change) in the logs. See if Claude finds it.
✅ Solution
Full investigation prompt:"Analyze [log]. Find all errors, identify when they started spiking,look for events that preceded the spike, and determine the likelyroot cause. Generate an incident report."5. CHEAT SHEET
Section titled “5. CHEAT SHEET”Analysis Workflow
Section titled “Analysis Workflow”Parse → Filter → Count → Timeline → Correlate → Root CauseKey Prompts
Section titled “Key Prompts”| Stage | Prompt |
|---|---|
| Exploration | ”What format? Summary of log levels.” |
| Error focus | ”All ERRORs grouped by type, with counts.” |
| Timeline | ”Error frequency over time. When did spike start?” |
| Correlation | ”What happens before each [error]?” |
| Root cause | ”Based on this, what’s the likely root cause?” |
Error Pattern Types
Section titled “Error Pattern Types”- Frequency spike (sudden increase)
- Cascading failure (chain reaction)
- Periodic (scheduled/cron)
- Resource exhaustion (memory, disk)
- External dependency (third-party)
Output Requests
Section titled “Output Requests”- “Generate incident report”
- “Create timeline visualization”
- “Summarize for non-technical stakeholders”
6. PITFALLS — Common Mistakes
Section titled “6. PITFALLS — Common Mistakes”| ❌ Mistake | ✅ Correct Approach |
|---|---|
| Analyzing full log file (too large) | Start with tail, recent entries, or time range |
| Focusing only on ERROR level | WARN often precedes ERROR, check both |
| Ignoring timestamps | Timeline is crucial for correlation |
| Missing trace IDs | Use trace_id to follow request flow |
| Assuming first error is root cause | Look for what PRECEDES errors |
| No context about system | Tell Claude about your architecture |
| Stopping at symptoms | Ask “why” repeatedly until root cause |
7. REAL CASE — Production Story
Section titled “7. REAL CASE — Production Story”Scenario: Vietnamese fintech, midnight production incident. Payment service returning errors. On-call engineer had 3 hours of logs, panicking.
Claude Code Investigation (15 minutes):
Step 1: “Analyze payment-service.log. Summary and error distribution.” → Found: 89% of errors were “Circuit breaker open: bank-api”
Step 2: “When did circuit breaker errors start? What happened before?” → Spike at 23:47. At 23:45: “Bank API response time: 15000ms” (normal: 200ms)
Step 3: “Is this our issue or bank’s issue?” → Analysis: Bank API latency spike caused our circuit breaker to trip. Our system behaved correctly. Bank API is the root cause.
Step 4: “Generate incident report for stakeholders.” → Complete report with timeline, analysis, and recommendation
Resolution:
- Confirmed with bank: their maintenance window overran
- Our circuit breaker worked as designed
- Incident report ready for morning standup
Quote: “What would have taken 2 hours of grep and scrolling, Claude did in 15 minutes. And it found the external dependency issue I might have missed.”
Phase 13 Complete! You now have data superpowers — from analysis to reports to log investigation.
Next Phase: Phase 14: Optimization →