Splunk

Next time your beeper goes off, turn to Splunk.


Overview

Nagios gives you an alert that there is a problem. Now what? How will you figure out what's h0rked in your IT infrastructure today? How long will it take you to recover? Will you spend hours finding your way through log files and other IT data? Splunk is the perfect complement to Nagios. Nagios monitors your network for problems and Splunk helps you get to the root cause.

Splunk is search software that indexes any fast moving IT data as it happens, making it possible to actually see inside the data center at runtime. From your Web browser, you can navigate logs, configuration files, message queues, JMX notifications, SNMP and database transactions from any system, application or device. System administrators, developers and support staff everywhere can now diagnose and resolve problems faster resulting in shorter mean time to repair (MTTR), better service availability and reduced cost of incident response.

Click to view screenshot
View Screenshot

More Information

Find out more about the collaboration going on between Nagios and Splunk.

After your Nagios alerts, try Splunk Professional free for 30 days to get to the root cause of IT problems.

To learn more about the free Splunk Server and Splunk Professional try our online demo or take a tour.


What Can Splunk Do For You?

Nagios is monitoring your network services or your host resources and sends you an alert that something is wrong. Which is great. But now you have to get to the root of the problem. Splunk it.

  • System administrators can find the root cause of problems quickly and locate latent systems issues before they cause downtime.
  • Developers can debug interactions among multiple tiers and components in the code-test cycle, the migration from development to production or during production escalations.
  • Help desk and support teams can investigate reported incidents and alerts right away without having to reproduce the problem or call in senior analysts or developers.

Click to view screenshot
View Screenshot

How Does It Do That?

Splunk uses powerful algorithms to automatically organize any type of IT data into events. It then classifies these events and discovers relationships between events of different kinds. Events are indexed by time, terms and relationships.


Want Some Examples?

Dos Attack
One afternoon, Sean, a sysadmin, was paged by Nagios with a puzzling cascade of failures. He went from system to system checking as much as he could. He finally determined the company was under a DOS attack.

"It was pretty interesting to see the pattern of failures on Nagios, but I wanted to know what led up to it? Things fell over, but how?"

Once he had dealt with the threat at the router, he splunked the logs to see what happened. Those logs were on a 5-minute delay because the company is syncing them. The attack was pretty obvious to spot, but what he found useful was being able to easily determine how many events had occurred, where they were from and in what time interval they happened.

"It gave me some idea of the resources used to hit us and what our system's response was. Add some load testing and I can get a pretty good idea of our limits, not just with simulated traffic but with a real world event."


Bad Query
Rob got paged by Nagios the other day because his company's Web page wasn't loading right. Access to the site was timing out. He was also getting paged for a MySQL failure.

He checked the Web server and there was an unusually large number of httpd processes.

"I thought, hey, were we Slashdotted? I decided to splunk the Web logs, but that showed a normal number of http GETs. Didn't look like Slashdot or an attack."

Splunking his MySQL slow query log for that time showed a much larger than normal number of events. Looking at these individually, Rob saw a series of queries taking a huge amount of time to complete. So he needed to figure out which query?

"I showed all this to our DBA and it turns out there was an old query that takes forever in our new release. It should have been removed, but slipped past QA, eventually it was activated on almost every web hit, which took down the site."


Overtemp
Rhishi, an IT manager, got paged by Nagios a few weeks ago at about midnight. It was bad news--the server room temp had increased past an alarm scenario. He dragged himself out of bed, shut down some unnecessary servers and headed to the shop.

"When I got there, the place was just beginning to cool off. I had a graph of the temp, but why did it happen? The A/C unit seemed to be fine. I figured it must have been the servers?"

Sure enough, he checked his Cacti graphs and most of the servers got busy right about that time. That's okay for short periods, but they'd planned their A/C for boxes that aren't all grinding CPU at the same time. Even in a colocation facility, if everyone started grinding CPU at the same time, the A/C would probably fail.

"Splunking the logs, I saw that most of the boxes, which are development boxes, had a new cronjob that starts at the same time. The nightly build. It's done on many different architectures, takes about an hour and really works out the hardware. Unfortunately, all of them were scheduled to run at the same time, stressing the A/C."


About  |  Legal  |  Contact