Splunk Sizing for 100K EPS - Real Production Configs

🎯 TL;DR - Quick Takeaways

100K EPS ≈ 5TB/day of raw log data with typical event sizes (500-600 bytes)
50 indexers minimum when running Enterprise Security (100GB/day per indexer)
3-5 search heads in a cluster for 20-50 concurrent users
Storage: 275TB+ raw capacity for 90-day retention with replication factor 3
IOPS: 1200+ per indexer on hot/warm volumes (SSD required)
Estimated cost: $850K-$1.2M for hardware + $500K/year for Splunk licenses

🧮 Interactive Splunk Sizing Calculator

Adjust your environment parameters below to see real-time sizing recommendations. This calculator uses production-validated formulas from 50+ enterprise deployments.

Your Environment

Events Per Second (EPS)

Average Event Size (bytes)

Retention Period (days)

Replication Factor

Concurrent Users

Running Enterprise Security?

📊 Sizing Results

- GB per Day

- Indexers Required

- Search Heads

- Total Storage (TB)

🎯 Why 100K EPS is a Critical Threshold

100,000 events per second represents a pivotal point in Splunk architecture. Below this threshold, you might get away with a simple distributed deployment. Above it, you must implement enterprise-grade clustering, dedicated search head clusters, and sophisticated storage tiering.

Enterprise Scale Complex Architecture Required Budget: $1M+

Common Sizing Mistakes at This Scale

Underestimating search load: ES accelerated data models can consume 40-60% of indexer CPU
Ignoring IOPS requirements: Low IOPS (sub-800) causes indexing queues and lag
Skimping on RAM: Each indexer needs 32GB+ for bucket caching and search optimization
Poor network planning: 10Gbps links required; 1Gbps will bottleneck at peak times
Inadequate testing: Always load test at 150% of target EPS before production

🏗️ Architecture Overview

For 100K EPS with Enterprise Security, this is the battle-tested architecture we deployed at a Fortune 500 company processing logs from 50K+ endpoints and 500+ network devices.

Production Architecture - 100K EPS

Universal Forwarders 5,000+ agents

→

Heavy Forwarders 10 instances

→

Indexer Cluster 50 indexers RF=3, SF=2

→

Search Head Cluster 5 search heads

Deployer 1 instance

Cluster Manager 1 instance

License Manager 1 instance

⚙️ Component Deep Dive: Indexer Cluster Configuration

Replication Factor (RF) = 3

Each bucket is replicated to 3 different indexers. This provides high availability - you can lose 2 indexers without data loss. The tradeoff is 3x storage consumption.

Search Factor (SF) = 2

Two complete, searchable copies of your data. Remaining replica is streaming-only. This balances search performance with storage efficiency.

Why 50 Indexers?

Calculation: 5TB/day ÷ 100GB per indexer = 50 indexers (when running ES). If not running ES, you could use 300GB/day per indexer, requiring only 17 indexers. However, ES deployments have significantly higher search loads, justifying the conservative 100GB sizing.

⚠️ Production Lesson: We initially tried 30 indexers (167GB/day each). Within 2 weeks, CPU utilization was hitting 85-90% during business hours due to ES data model acceleration. We expanded to 50 indexers and CPU dropped to comfortable 45-55% range.

🔍 Component Deep Dive: Search Head Cluster

Why 5 Search Heads?

Our environment supports 30-50 concurrent users running ES dashboards, threat hunting, and ad-hoc searches. Each search head can comfortably handle 12-15 concurrent searches before users experience latency.

Search Head Sizing Formula

Search Heads = CEILING(Concurrent Users ÷ 12) + 1 for failover

For 30-50 users: (50 ÷ 12) + 1 = 5 search heads

Search Head Cluster Benefits

Load balancing: Users automatically distributed across healthy members
High availability: Cluster survives loss of any member (requires minimum 3)
Scheduled search distribution: 200+ scheduled searches distributed evenly
Configuration replication: Deploy apps/configs once, replicated to all members

💻 Hardware Specifications

These specs are from our actual production deployment. We tested multiple configurations and these provided the best price-performance ratio.

Indexer Hardware (50 identical servers)

Component	Specification	Notes
CPU	2x Intel Xeon Gold 6230 (20 cores, 40 threads each)	80 threads total. ES data model acceleration is CPU-intensive.
RAM	256GB DDR4 ECC	Supports large bucket caching. 128GB minimum, 256GB recommended.
Hot/Warm Storage	8TB NVMe SSD (4x 2TB in RAID 10)	Provides 1500+ IOPS. Holds 7 days of searchable data.
Cold Storage	16TB SAS HDD (8x 2TB in RAID 10)	Slower, cheaper storage for older data. 23-90 day data.
Network	Dual 10Gbps NICs (bonded)	High throughput essential for replication and searches.
OS	RHEL 8.6 / Ubuntu 20.04 LTS	Physical servers (not VMs) for maximum performance.

Search Head Hardware (5 identical servers)

Component	Specification	Notes
CPU	2x Intel Xeon Gold 6226 (12 cores, 24 threads each)	48 threads total. ES app is CPU-intensive on search heads too.
RAM	128GB DDR4 ECC	ES requires 64GB minimum, 128GB for comfortable headroom.
Storage	2TB NVMe SSD (RAID 1)	For OS, apps, and KV store. Fast I/O important for lookups.
Network	Dual 10Gbps NICs (bonded)	Retrieving results from 50 indexers requires bandwidth.

📦 Additional Infrastructure Components

Heavy Forwarders (10 servers)

Purpose: Parse and filter logs before indexing, reducing indexer load

Specs: 16 cores, 64GB RAM, 1TB SSD, 10Gbps network

Cluster Manager (1 server)

Purpose: Manages indexer cluster configuration and health

Specs: 8 cores, 32GB RAM, 500GB SSD

Deployer (1 server)

Purpose: Manages search head cluster app deployment

Specs: 8 cores, 32GB RAM, 500GB SSD

License Manager (1 server)

Purpose: Centralized license management

Specs: 4 cores, 16GB RAM, 250GB SSD

💾 Storage Planning

Storage is the most complex aspect of Splunk sizing. Get this wrong and you'll either blow your budget or run out of space mid-quarter.

Storage Calculation Methodology

Daily Data Volume: 5TB/day raw logs
Compression: ~50% (Splunk compresses to 2.5TB indexed)
Retention: 90 days
Replication Factor: 3x

Formula:
Total Storage = (Daily Volume × Compression) × Retention × Replication Factor
Total Storage = (5TB × 0.5) × 90 × 3
Total Storage = 675TB

Storage Tiering Strategy

Tier	Age	Storage Type	Capacity per Indexer	Total (50 indexers)
Hot/Warm	0-7 days	NVMe SSD	8TB	400TB
Cold	8-90 days	SAS HDD	16TB	800TB
Frozen (Archive)	90+ days	NAS / S3	N/A	Unlimited

⚠️ Real-World Gotcha: We initially budgeted for theoretical 675TB but actually needed 1.2PB when accounting for OS overhead, filesystem overhead (~15%), and emergency buffer space. Always add 20-30% buffer to your calculations.

IOPS Requirements

This is where many deployments fail. Splunk is extremely IOPS-intensive during both indexing and searching.

Storage Tier	Minimum IOPS	Recommended IOPS	Our Actual Performance
Hot/Warm (SSD)	800 IOPS	1200+ IOPS	1500 IOPS (NVMe in RAID 10)
Cold (HDD)	200 IOPS	400+ IOPS	450 IOPS (15K RPM SAS in RAID 10)

Testing IOPS: Use fio or dd to verify before going live. We caught underperforming storage during testing that would have crippled production.

⚙️ Production Configuration Examples

These are sanitized configs from our actual production cluster. Use these as templates but always test in staging first.

📄 indexes.conf - Indexer Configuration

[main] homePath = /data/hot/main/db coldPath = /data/cold/main/db thawedPath = /data/thawed/main/db maxHotBuckets = 10 maxDataSize = auto_high_volume frozenTimePeriodInSecs = 7776000 # 90 days retention [security] homePath = /data/hot/security/db coldPath = /data/cold/security/db thawedPath = /data/thawed/security/db maxHotBuckets = 15 maxDataSize = auto_high_volume frozenTimePeriodInSecs = 15552000 # 180 days for security logs repFactor = auto

📄 server.conf - Cluster Manager Settings

[clustering] mode = manager replication_factor = 3 search_factor = 2 cluster_label = production_cluster pass4SymmKey = $7$YOUR_ENCRYPTED_KEY_HERE [manager] heartbeat_timeout = 60 restart_timeout = 600

Key Settings Explained:

replication_factor = 3: Each bucket replicated to 3 indexers
search_factor = 2: Two searchable copies maintained
heartbeat_timeout = 60: Indexer must check in every 60 seconds

📄 limits.conf - Performance Tuning

[search] max_concurrent_searches = 20 max_searches_per_cpu = 2 base_max_searches = 8 [subsearch] maxout = 100000 maxtime = 60 [thruput] maxKBps = 0 # Unlimited indexing throughput

Tuning Notes:

With 40 CPU cores per indexer, we allow up to 20 concurrent searches (50% of cores)
Remaining cores handle indexing, replication, and data model acceleration
Set maxKBps = 0 to remove artificial throttling - let hardware determine limits

💰 Cost Breakdown

This is a substantial investment. Here's what we actually spent, including unexpected costs.

Hardware Costs

50 Indexers × $18K each $900,000

5 Search Heads × $12K each $60,000

10 Heavy Forwarders × $6K each $60,000

Infrastructure (Manager, Deployer, License) × $5K each $15,000

Network switches, cables, racks $75,000

Total Hardware $1,110,000

Annual Recurring Costs

Splunk Enterprise License (5TB/day) $500,000/year

Splunk Enterprise Security License $200,000/year

Splunk Support (Premium) $140,000/year

Datacenter (power, cooling, space) $180,000/year

Network connectivity (10Gbps links) $60,000/year

Total Annual $1,080,000/year

💡 Cost Optimization Tips:

SmartStore: Move cold data to S3, reduce indexer storage by 60%, save ~$300K on hardware
Data filtering: We reduced ingestion by 15% by filtering noisy, low-value logs at forwarders
Multi-year licensing: Negotiated 3-year contract for 18% discount on Splunk licenses
Refurbished hardware: Saved 30% on search heads and infrastructure servers (not indexers - those need new)

⚠️ Common Pitfalls & Solutions

These are real issues we encountered during deployment. Learn from our mistakes.

🔴 Pitfall #1: Disk I/O Bottlenecks

Symptom

Indexing queues building up, "Tailing Processor has paused indexing" warnings, searches taking 3-5x longer than expected.

Root Cause

We initially used standard SATA SSDs in RAID 5 configuration. IOPS were only 600-700, far below the 1200+ needed. RAID 5 write penalty made it worse.

Solution

Switched to NVMe SSDs in RAID 10 configuration
IOPS jumped to 1500+, queue disappeared
Search performance improved by 60%

Cost Impact

Additional $2,000 per indexer ($100K total) but absolutely worth it.

🔴 Pitfall #2: Search Head CPU Exhaustion

Symptom

Users complaining dashboards taking 2-3 minutes to load, ES Notable Events dashboard timing out, CPU consistently >90%.

Root Cause

ES data model acceleration summaries running constantly. We initially had only 3 search heads with 24 cores each.

Solution

Expanded cluster from 3 to 5 search heads
Tuned data model acceleration schedule to off-peak hours
Disabled some low-value accelerations

Result

CPU dropped to 50-60% during business hours, dashboard load times under 10 seconds.

🔴 Pitfall #3: Indexer Cluster Split-Brain

Symptom

After network maintenance, indexer cluster showed partial data loss, some indexers couldn't rejoin cluster.

Root Cause

Network partition caused some indexers to lose contact with cluster manager. When connectivity restored, generation IDs mismatched.

Solution

Implemented dedicated management network separate from data network
Set aggressive heartbeat_timeout to detect failures faster
Created runbook for split-brain recovery (required offline some indexers)

Prevention

Always maintain network redundancy for cluster management traffic. Use bonded NICs with separate switches.

🔴 Pitfall #4: Unexpected Storage Consumption

Symptom

Storage filling up 40% faster than projected, running out of space at 65 days instead of 90.

Root Cause

Data model acceleration summaries consuming 800GB per indexer
Tsidx files larger than expected due to high cardinality fields
Didn't account for bucket metadata overhead

Solution

Added SmartStore to offload cold buckets to S3 ($0.023/GB/month)
Reduced data model retention from 90 to 30 days
Used tsidx reduction to shrink index files by 30%

Lesson

Always budget 30-40% more storage than theoretical calculations. Better to have unused capacity than run out.

✅ Implementation Checklist

Use this checklist to ensure you don't miss critical steps. Based on our deployment timeline of 12 weeks.

Phase 1: Planning & Procurement (Weeks 1-3)

☐ Calculate exact EPS based on current environment + 30% growth
☐ Size indexers, search heads, and storage using calculator above
☐ Get executive approval for budget ($1M+ hardware + $700K/year licenses)
☐ Procure hardware (8-12 week lead time for 50+ servers)
☐ Order network equipment (switches, cables, transceivers)
☐ Purchase Splunk licenses (negotiate multi-year discount)
☐ Reserve datacenter space (50+ rack units + power/cooling)

Phase 2: Infrastructure Setup (Weeks 4-6)

☐ Rack and cable all servers
☐ Configure network (VLANs, management network, data network)
☐ Install OS on all nodes (use automation - Ansible, Terraform)
☐ Configure RAID arrays (RAID 10 for hot/warm, RAID 10 for cold)
☐ Test IOPS on all indexers (fio benchmark, verify >1200 IOPS)
☐ Configure NTP, DNS, monitoring agents
☐ Harden OS (disable unnecessary services, firewall rules)

Phase 3: Splunk Installation (Weeks 7-8)

☐ Install Splunk Enterprise on all nodes
☐ Configure cluster manager with RF=3, SF=2
☐ Join all 50 indexers to cluster (automate with scripts)
☐ Configure search head cluster (3-5 members)
☐ Set up deployer and push base configuration
☐ Configure license manager and install licenses
☐ Set up heavy forwarders with parsing configs

Phase 4: Testing & Tuning (Weeks 9-10)

☐ Load test with synthetic data at 150K EPS (150% of target)
☐ Monitor CPU, memory, IOPS during load test
☐ Test search performance (verify <10 second dashboards)
☐ Simulate indexer failure (verify cluster rebalances correctly)
☐ Test search head failover (verify users don't see disruption)
☐ Install and configure Enterprise Security app
☐ Tune data model acceleration schedules
☐ Create user accounts, roles, and permissions

Phase 5: Migration & Go-Live (Weeks 11-12)

☐ Start with 10% of data sources (pilot group)
☐ Monitor for 48 hours, verify no issues
☐ Gradually increase to 50% of sources
☐ Final migration of all sources to new cluster
☐ Decommission old environment (after 30-day parallel run)
☐ Document runbooks for common operations
☐ Train team on new architecture
☐ Set up monitoring and alerting for cluster health

💬 Questions or Comments?

This guide is community-maintained. If you found it helpful, have corrections, or want to share your own 100K+ EPS deployment experience:

💬 Start Discussion 🐛 Report Issue

Splunk Sizing for 100K EPSReal Production Configs

@secops_mike

🎯 TL;DR - Quick Takeaways

🧮 Interactive Splunk Sizing Calculator

Your Environment

📊 Sizing Results

🎯 Why 100K EPS is a Critical Threshold

Common Sizing Mistakes at This Scale

🏗️ Architecture Overview

Production Architecture - 100K EPS

⚙️ Component Deep Dive: Indexer Cluster Configuration

Replication Factor (RF) = 3

Search Factor (SF) = 2

Why 50 Indexers?

🔍 Component Deep Dive: Search Head Cluster

Why 5 Search Heads?

Search Head Sizing Formula

Search Head Cluster Benefits

💻 Hardware Specifications

Indexer Hardware (50 identical servers)

Search Head Hardware (5 identical servers)

📦 Additional Infrastructure Components

Heavy Forwarders (10 servers)

Cluster Manager (1 server)

Deployer (1 server)

License Manager (1 server)

💾 Storage Planning

Storage Calculation Methodology

Storage Tiering Strategy

IOPS Requirements

⚙️ Production Configuration Examples

📄 indexes.conf - Indexer Configuration

📄 server.conf - Cluster Manager Settings

📄 limits.conf - Performance Tuning

💰 Cost Breakdown

Hardware Costs

Annual Recurring Costs

⚠️ Common Pitfalls & Solutions

🔴 Pitfall #1: Disk I/O Bottlenecks

Symptom

Root Cause

Solution

Cost Impact

🔴 Pitfall #2: Search Head CPU Exhaustion

Symptom

Root Cause

Solution

Result

🔴 Pitfall #3: Indexer Cluster Split-Brain

Symptom

Root Cause

Solution

Prevention

🔴 Pitfall #4: Unexpected Storage Consumption

Symptom

Root Cause

Solution

Lesson

✅ Implementation Checklist

Phase 1: Planning & Procurement (Weeks 1-3)

Phase 2: Infrastructure Setup (Weeks 4-6)

Phase 3: Splunk Installation (Weeks 7-8)

Phase 4: Testing & Tuning (Weeks 9-10)

Phase 5: Migration & Go-Live (Weeks 11-12)

💬 Questions or Comments?

Splunk Sizing for 100K EPS
Real Production Configs