๐ฏ TL;DR - Quick Takeaways
- 100K EPS โ 5TB/day of raw log data with typical event sizes (500-600 bytes)
- 50 indexers minimum when running Enterprise Security (100GB/day per indexer)
- 3-5 search heads in a cluster for 20-50 concurrent users
- Storage: 275TB+ raw capacity for 90-day retention with replication factor 3
- IOPS: 1200+ per indexer on hot/warm volumes (SSD required)
- Estimated cost: $850K-$1.2M for hardware + $500K/year for Splunk licenses
๐งฎ Interactive Splunk Sizing Calculator
Adjust your environment parameters below to see real-time sizing recommendations. This calculator uses production-validated formulas from 50+ enterprise deployments.
Your Environment
๐ Sizing Results
๐ฏ Why 100K EPS is a Critical Threshold
100,000 events per second represents a pivotal point in Splunk architecture. Below this threshold, you might get away with a simple distributed deployment. Above it, you must implement enterprise-grade clustering, dedicated search head clusters, and sophisticated storage tiering.
Common Sizing Mistakes at This Scale
- Underestimating search load: ES accelerated data models can consume 40-60% of indexer CPU
- Ignoring IOPS requirements: Low IOPS (sub-800) causes indexing queues and lag
- Skimping on RAM: Each indexer needs 32GB+ for bucket caching and search optimization
- Poor network planning: 10Gbps links required; 1Gbps will bottleneck at peak times
- Inadequate testing: Always load test at 150% of target EPS before production
๐๏ธ Architecture Overview
For 100K EPS with Enterprise Security, this is the battle-tested architecture we deployed at a Fortune 500 company processing logs from 50K+ endpoints and 500+ network devices.
Production Architecture - 100K EPS
โ๏ธ Component Deep Dive: Indexer Cluster Configuration
Replication Factor (RF) = 3
Each bucket is replicated to 3 different indexers. This provides high availability - you can lose 2 indexers without data loss. The tradeoff is 3x storage consumption.
Search Factor (SF) = 2
Two complete, searchable copies of your data. Remaining replica is streaming-only. This balances search performance with storage efficiency.
Why 50 Indexers?
Calculation: 5TB/day รท 100GB per indexer = 50 indexers (when running ES). If not running ES, you could use 300GB/day per indexer, requiring only 17 indexers. However, ES deployments have significantly higher search loads, justifying the conservative 100GB sizing.
๐ Component Deep Dive: Search Head Cluster
Why 5 Search Heads?
Our environment supports 30-50 concurrent users running ES dashboards, threat hunting, and ad-hoc searches. Each search head can comfortably handle 12-15 concurrent searches before users experience latency.
Search Head Sizing Formula
Search Heads = CEILING(Concurrent Users รท 12) + 1 for failover
For 30-50 users: (50 รท 12) + 1 = 5 search heads
Search Head Cluster Benefits
- Load balancing: Users automatically distributed across healthy members
- High availability: Cluster survives loss of any member (requires minimum 3)
- Scheduled search distribution: 200+ scheduled searches distributed evenly
- Configuration replication: Deploy apps/configs once, replicated to all members
๐ป Hardware Specifications
These specs are from our actual production deployment. We tested multiple configurations and these provided the best price-performance ratio.
Indexer Hardware (50 identical servers)
| Component | Specification | Notes |
|---|---|---|
| CPU | 2x Intel Xeon Gold 6230 (20 cores, 40 threads each) | 80 threads total. ES data model acceleration is CPU-intensive. |
| RAM | 256GB DDR4 ECC | Supports large bucket caching. 128GB minimum, 256GB recommended. |
| Hot/Warm Storage | 8TB NVMe SSD (4x 2TB in RAID 10) | Provides 1500+ IOPS. Holds 7 days of searchable data. |
| Cold Storage | 16TB SAS HDD (8x 2TB in RAID 10) | Slower, cheaper storage for older data. 23-90 day data. |
| Network | Dual 10Gbps NICs (bonded) | High throughput essential for replication and searches. |
| OS | RHEL 8.6 / Ubuntu 20.04 LTS | Physical servers (not VMs) for maximum performance. |
Search Head Hardware (5 identical servers)
| Component | Specification | Notes |
|---|---|---|
| CPU | 2x Intel Xeon Gold 6226 (12 cores, 24 threads each) | 48 threads total. ES app is CPU-intensive on search heads too. |
| RAM | 128GB DDR4 ECC | ES requires 64GB minimum, 128GB for comfortable headroom. |
| Storage | 2TB NVMe SSD (RAID 1) | For OS, apps, and KV store. Fast I/O important for lookups. |
| Network | Dual 10Gbps NICs (bonded) | Retrieving results from 50 indexers requires bandwidth. |
๐ฆ Additional Infrastructure Components
Heavy Forwarders (10 servers)
Purpose: Parse and filter logs before indexing, reducing indexer load
Specs: 16 cores, 64GB RAM, 1TB SSD, 10Gbps network
Cluster Manager (1 server)
Purpose: Manages indexer cluster configuration and health
Specs: 8 cores, 32GB RAM, 500GB SSD
Deployer (1 server)
Purpose: Manages search head cluster app deployment
Specs: 8 cores, 32GB RAM, 500GB SSD
License Manager (1 server)
Purpose: Centralized license management
Specs: 4 cores, 16GB RAM, 250GB SSD
๐พ Storage Planning
Storage is the most complex aspect of Splunk sizing. Get this wrong and you'll either blow your budget or run out of space mid-quarter.
Storage Calculation Methodology
Compression: ~50% (Splunk compresses to 2.5TB indexed)
Retention: 90 days
Replication Factor: 3x
Formula:
Total Storage = (Daily Volume ร Compression) ร Retention ร Replication Factor
Total Storage = (5TB ร 0.5) ร 90 ร 3
Total Storage = 675TB
Storage Tiering Strategy
| Tier | Age | Storage Type | Capacity per Indexer | Total (50 indexers) |
|---|---|---|---|---|
| Hot/Warm | 0-7 days | NVMe SSD | 8TB | 400TB |
| Cold | 8-90 days | SAS HDD | 16TB | 800TB |
| Frozen (Archive) | 90+ days | NAS / S3 | N/A | Unlimited |
IOPS Requirements
This is where many deployments fail. Splunk is extremely IOPS-intensive during both indexing and searching.
| Storage Tier | Minimum IOPS | Recommended IOPS | Our Actual Performance |
|---|---|---|---|
| Hot/Warm (SSD) | 800 IOPS | 1200+ IOPS | 1500 IOPS (NVMe in RAID 10) |
| Cold (HDD) | 200 IOPS | 400+ IOPS | 450 IOPS (15K RPM SAS in RAID 10) |
Testing IOPS: Use fio or dd to verify before going live. We caught underperforming storage during testing that would have crippled production.
โ๏ธ Production Configuration Examples
These are sanitized configs from our actual production cluster. Use these as templates but always test in staging first.
๐ indexes.conf - Indexer Configuration
๐ server.conf - Cluster Manager Settings
Key Settings Explained:
replication_factor = 3: Each bucket replicated to 3 indexerssearch_factor = 2: Two searchable copies maintainedheartbeat_timeout = 60: Indexer must check in every 60 seconds
๐ limits.conf - Performance Tuning
Tuning Notes:
- With 40 CPU cores per indexer, we allow up to 20 concurrent searches (50% of cores)
- Remaining cores handle indexing, replication, and data model acceleration
- Set
maxKBps = 0to remove artificial throttling - let hardware determine limits
๐ฐ Cost Breakdown
This is a substantial investment. Here's what we actually spent, including unexpected costs.
Hardware Costs
Annual Recurring Costs
- SmartStore: Move cold data to S3, reduce indexer storage by 60%, save ~$300K on hardware
- Data filtering: We reduced ingestion by 15% by filtering noisy, low-value logs at forwarders
- Multi-year licensing: Negotiated 3-year contract for 18% discount on Splunk licenses
- Refurbished hardware: Saved 30% on search heads and infrastructure servers (not indexers - those need new)
โ ๏ธ Common Pitfalls & Solutions
These are real issues we encountered during deployment. Learn from our mistakes.
๐ด Pitfall #1: Disk I/O Bottlenecks
Symptom
Indexing queues building up, "Tailing Processor has paused indexing" warnings, searches taking 3-5x longer than expected.
Root Cause
We initially used standard SATA SSDs in RAID 5 configuration. IOPS were only 600-700, far below the 1200+ needed. RAID 5 write penalty made it worse.
Solution
- Switched to NVMe SSDs in RAID 10 configuration
- IOPS jumped to 1500+, queue disappeared
- Search performance improved by 60%
Cost Impact
Additional $2,000 per indexer ($100K total) but absolutely worth it.
๐ด Pitfall #2: Search Head CPU Exhaustion
Symptom
Users complaining dashboards taking 2-3 minutes to load, ES Notable Events dashboard timing out, CPU consistently >90%.
Root Cause
ES data model acceleration summaries running constantly. We initially had only 3 search heads with 24 cores each.
Solution
- Expanded cluster from 3 to 5 search heads
- Tuned data model acceleration schedule to off-peak hours
- Disabled some low-value accelerations
Result
CPU dropped to 50-60% during business hours, dashboard load times under 10 seconds.
๐ด Pitfall #3: Indexer Cluster Split-Brain
Symptom
After network maintenance, indexer cluster showed partial data loss, some indexers couldn't rejoin cluster.
Root Cause
Network partition caused some indexers to lose contact with cluster manager. When connectivity restored, generation IDs mismatched.
Solution
- Implemented dedicated management network separate from data network
- Set aggressive heartbeat_timeout to detect failures faster
- Created runbook for split-brain recovery (required offline some indexers)
Prevention
Always maintain network redundancy for cluster management traffic. Use bonded NICs with separate switches.
๐ด Pitfall #4: Unexpected Storage Consumption
Symptom
Storage filling up 40% faster than projected, running out of space at 65 days instead of 90.
Root Cause
- Data model acceleration summaries consuming 800GB per indexer
- Tsidx files larger than expected due to high cardinality fields
- Didn't account for bucket metadata overhead
Solution
- Added SmartStore to offload cold buckets to S3 ($0.023/GB/month)
- Reduced data model retention from 90 to 30 days
- Used tsidx reduction to shrink index files by 30%
Lesson
Always budget 30-40% more storage than theoretical calculations. Better to have unused capacity than run out.
โ Implementation Checklist
Use this checklist to ensure you don't miss critical steps. Based on our deployment timeline of 12 weeks.
Phase 1: Planning & Procurement (Weeks 1-3)
- โ Calculate exact EPS based on current environment + 30% growth
- โ Size indexers, search heads, and storage using calculator above
- โ Get executive approval for budget ($1M+ hardware + $700K/year licenses)
- โ Procure hardware (8-12 week lead time for 50+ servers)
- โ Order network equipment (switches, cables, transceivers)
- โ Purchase Splunk licenses (negotiate multi-year discount)
- โ Reserve datacenter space (50+ rack units + power/cooling)
Phase 2: Infrastructure Setup (Weeks 4-6)
- โ Rack and cable all servers
- โ Configure network (VLANs, management network, data network)
- โ Install OS on all nodes (use automation - Ansible, Terraform)
- โ Configure RAID arrays (RAID 10 for hot/warm, RAID 10 for cold)
- โ Test IOPS on all indexers (fio benchmark, verify >1200 IOPS)
- โ Configure NTP, DNS, monitoring agents
- โ Harden OS (disable unnecessary services, firewall rules)
Phase 3: Splunk Installation (Weeks 7-8)
- โ Install Splunk Enterprise on all nodes
- โ Configure cluster manager with RF=3, SF=2
- โ Join all 50 indexers to cluster (automate with scripts)
- โ Configure search head cluster (3-5 members)
- โ Set up deployer and push base configuration
- โ Configure license manager and install licenses
- โ Set up heavy forwarders with parsing configs
Phase 4: Testing & Tuning (Weeks 9-10)
- โ Load test with synthetic data at 150K EPS (150% of target)
- โ Monitor CPU, memory, IOPS during load test
- โ Test search performance (verify <10 second dashboards)
- โ Simulate indexer failure (verify cluster rebalances correctly)
- โ Test search head failover (verify users don't see disruption)
- โ Install and configure Enterprise Security app
- โ Tune data model acceleration schedules
- โ Create user accounts, roles, and permissions
Phase 5: Migration & Go-Live (Weeks 11-12)
- โ Start with 10% of data sources (pilot group)
- โ Monitor for 48 hours, verify no issues
- โ Gradually increase to 50% of sources
- โ Final migration of all sources to new cluster
- โ Decommission old environment (after 30-day parallel run)
- โ Document runbooks for common operations
- โ Train team on new architecture
- โ Set up monitoring and alerting for cluster health
๐ฌ Questions or Comments?
This guide is community-maintained. If you found it helpful, have corrections, or want to share your own 100K+ EPS deployment experience: