Splunk Automation Guide
Reading Time: 8 minutes
Prototype Build: 1 day (installs, debugging, settings, configuration)
Download+Customize: 4 minutes
Collective Time Saved: days
Splunk is a powerful data ingestion, manipulation, and analytics platform that has grown over the years to form a whole suite of products. Today, we’ll be looking specifically at Splunk Enterprise, the original and still much loved core. We will explore and then automate search operations for a simple Threat Hunting example. We will then turn our learnings into a fully fledged self-service internal tool for use by colleagues (or perhaps other teams in your organization).
Splunk Automation and APIs
There are many reasons to automate Splunk’s operations. It is a rich and versatile platform that, once fed with multiple data sources, can help you surface and identify valuable insights and trigger actions. Splunk even has its own Search Processing Language (SPL) and multiple training and certification tracks. Its APIs are rich, mature, and first class! The Splunk Cloud trial has some API limitations and restrictions so we’ll be using Splunk Enterprise running on an Amazon AWS AMI instance. We recommend you do too if you want to explore the API functionality quickly and conveniently, though all commercial or enterprise versions should have the API enabled.
Before we dive into the API, some basic nomenclature and concepts should be understood around Splunk, mainly that of forwarders and receivers. Splunk can get very complicated very quickly if you do larger deployments or run clusters, so we’re just going to be running a single forwarder (Windows host + event logs) and a single receiver (our Splunk Enterprise host). Now, let’s talk APIs!
When diving into an API, the first concerns tend to be:
- Where and what sort of documentation does the API have?
- What specific endpoint(s) can I access?
- How do I authenticate?
- Are there any limitations (including rate-limits), or ‘gotchas’ ?
- What tooling can I use to quickly prototype and test?
Splunk API Documentation (or related)
- Main Splunk documentation hub here (Splunk Enterprise manual here).
- REST API user manual here / reference manual here.
- REST API (Search) here (Search Tutorial/Tips here).
- Splunkbase (application marketplace) here.
- Splunk community forum here / developer site here.
Splunk API Endpoints
Depending on your type of install, be it Splunk Cloud or Splunk Enterprise, your host (FQDN) will obviously be different, however, specific types of endpoints are grouped into resource groups of which “search” is one. The REST API is exposed on TCP port 8089 and responds with XML unless you ask for another output mode in the query or action.
Splunk API Authentication
Basic HTTP authentication (RFC 2617), session-based, and token authentication (Splunk v7.3+) types are available. In this guide, we will be defaulting to simple Basic HTTP authentication.
Splunk API Limitations
Splunk Cloud is extremely limited in terms of endpoint availability. It only exposes the “Search” endpoint, that is, if you can get it enabled on your instance by manually calling support! We recommend not being put off by the term “enterprise” in Splunk Enterprise and go ahead with the AWS AMI (or do a full install yourself on your own host). This will actually save you time and enable you to explore the API faster. Additional Splunk restrictions and limitations with the API are listed here.
cURL on the CLI is the fastest way to get going albeit you could just as easily use Postman, PowerShell, or another scripting tool (or language) of your choice. Later we will also look at using Tines for a more holistic workflow automation approach.
Quick Setup Up
As mentioned briefly earlier, we are using an AWS Splunk AMI on the AWS free-tier to get us going with a clean and fully functional Splunk install. We’re also running a Windows 2012 R2 Datacenter instance in GCP (Google Cloud Platform) as our forwarder, which will send Splunk its Windows Event log data.
Forwarder (Windows 2012 R2 )
The Windows host requires an install of the Splunk Universal Forwarder. Only the receiving Splunk Enterprise server hostname (or IP) and port TCP 9997 need to be configured during the install (for which you are prompted). You do not need to configure the deployment server option. Ensure the Windows firewall permits this port outbound to your receiver.
Receiver (Splunk Enterprise AWS AMI)
We give our AWS instance an Elastic IP, so it has a static IP throughout our usage. Also, rather than add the hostname to DNS, you could just edit your hosts file for testing purposes, or of course use the IP in the URL directly. After accessing the Splunk web interface via HTTP (not HTTPS, as it comes with an untrusted self-signed certificate) on TCP port 8000, we then go to “Apps/Find More Apps” and “Browse More Apps”. Use the keyword “Windows” in the search box and then “Splunk” for the support type tickbox to find and install the “Splunk Add-on For Microsoft Windows”. After a brief service restart, Splunk will now understand the Windows event and logging formats and will also use new data inputs.
Don’t forget to check your Splunk host is listening on TCP 9997 as a receiver. You can check this in “Settings/DATA/Forwarding and Receiving” under “Receive Data/Configure receiving” which should have TCP 9997 set to “Status” enabled. You can also SSH to the host using the ec2-user and check that a netstat -nlt shows the port in a listening state. The use of an AMI means the configured SecurityGroups should permit all the required traffic.
Note: All our API actions will be via HTTPS, but we will tell our action to ignore SSL warnings due to the self-signed certificate. For any commercial or production use, you would indeed generate and install a valid certificate and follow best current practice for securing and hardening your instance.
Starting Simple with cURL
By using cURL on the command line with the -d switch (to add parameters to the call) we automatically issue a POST request. The -s means silent as we don’t want any superfluous output. The -k means to ignore the certificate validation (as Splunk is using a self-signed certificate).
Let’s start a search job with a request for any and all events across all hosts. The ‘|’ pipe to jq (an additional handy multipurpose tool for JSON) gives us color and pretty printed output too.
We then get the below response which contains our job search ID.
Let’s see if the job “1602150605.166” has surfaced any events or results by asking the job endpoint directly.
Indeed it has. The eventCount below is 381. There’s lots more information in the Splunk response, but for now we know it’s definitely working.
Rather than pull back all the event logs right now, let’s get more specific with our request, and then look in to the results. With the below POST operation, we also need to ensure we’ve set some search terms; such as the host to search (previously we used host=* ), the EventCode we’re interested in, and the earliest time (or how far back to search relative to the current time on the Splunk host). The Splunk SPL (Search Processing Language) has a rich syntax we can use.
The specific Windows EventCode we’re looking for is 4738 which indicates “A user account was changed”. It’s most likely not a good thing if lots of unidentified new accounts are popping up on servers or guest accounts are being enabled across a fleet! As per previous we receive our search job ID.
We do not get the results immediately as we’ve actually kicked off a background job. We don’t know how long the search will take, so we need to subsequently poll that SID endpoint for it’s state. Let’s ask about the job state:
What we’re looking for is the state (more specifically the “dispatchstate”) of “DONE” but as per the API reference there can be many states to check for such as:
The response has many useful fields, such as information about the Splunk host and what endpoints to get further information from, but we're really only interested in what the eventCount or resultCount are (have a go yourself to see what the whole response entails!).
Note: "Events" are considered non-transformed raw data as supplied by the original forwarders, whereas "Results" are considered processed (or manipulated) events and have a maximum limit per API call.
Our search is DONE yet if it was longer running or failed, we would require some programming logic to poll the endpoint and respond to differing states. For now let’s take a quick look at the results, or should I say events?
Tip: Before looking at the “results”, and because we are on the CLI, let’s tell Splunk to save this job so we have more time to play around with it (rather than have it quickly expire and have to start again from scratch). To do this we will issue a request to the control endpoint of the specific job. If we were programmatically accessing this, we would not necessarily require this, but because we are issuing manual commands and taking our time to explore, it would be wise to save the job as it will avoid future annoyances. So, let’s issue the action to ‘save’ it:
Splunk will now give this job more time to live (604800 seconds is 7 days):
OK, so far so good, lets issue a GET to the /services/search/jobs/<sid>/events</sid> endpoint where we should get our “results” for which we know there is definitely at least one!
Below we have found an instance of a changed user account, specifically a Guest account.
For clarity, the Message portion formatted cleanly reads:
But what if we could simply automate these steps (and include polling and error checking logic)? Perhaps we could add in some more useful Windows Event Codes to check for evidence of miscreants and wider foul play? Maybe we could provide a simple form to produce a canned report that would also email back to the requester with a CSV file containing the results?
Well let’s give it a try with a simple visual workflow builder...
Further and Faster with Tines
Here we start with a visual storyboard by dragging and dropping templated actions into our workflow story. We can quickly emulate our previous steps in Tines. We will, however, go further by making them more robust, modular, and useful with a few additional tweaks.
Our initial story flow works but unfortunately we were so fast (with each successive step) that the Splunk job status came back with a job dispatchState of “PARSING”. This means our job was not yet complete and our subsequent search results call returned empty.
Let’s wrap some simple defensive logic around our HTTP Request actions and also add in some error checking (and polling), modularity, and even an email capability.
We construct a simple story that also includes a ‘polling’ retry loop (for the search status), one that also handles any errors or problem conditions that are encountered. This logic can grow and evolve over time. It can also take advantage of a huge range of vendor or other platform endpoints to increase the automation’s value, i.e. not just Splunk functionality but a whole range of templated integrations (or you can make your own)!
Once our job completes, our workflow will send an email with an attached text file containing our CSV based results. This CSV file is not from the output_mode of type CSV in Splunk, but one we’ve simply reconstructed with the specific and pertinent information we need (or want to share) from the Splunk search.
Modularity and Reusability
With the addition of the Webhook action at the start, and an Event Transform action at the bottom, we can now also use our story in a more modular fashion. We can fully abstract it behind a single “Send To Story” action that can be used in other stories.
Behind the Curtain
If we were to take a look under the hood at any action, but more specifically the “Create Search in Splunk” action, it looks very close to our cURL example but is all JSON. It takes its search parameters from the initial Webhook action. The HTTP Request action also has additional error checking for non-positively associated HTTP error codes “[0,400-499,500-599]” and will also alert us of such.
And now, for the pièce de résistance, let’s construct a simple front end form to top off our workflow so others can simply use it too. This means we will be able to easily provide canned queries that result in automations for any designated team’s use. This is rapid internal tooling made easy. We use an easy WYSIWYG (What You See is What You Get) process in Tines to construct the form which auto-sends to the Webhook action.
Here we are borrowing some very useful Event Codes from the excellent Splunk Blog namely an entry called “Peeping through Windows (Logs)”. The form above, once clicked, sends the data to our Webhook action below on the visual storyboard (the top and first action called “Receive Search Query”).
The ‘self-serve’ form, as an entry point to a workflow, means that different groups can now take advantage of automations that you might care to build and share. You could offer ‘self-serve’ to the non-SME’s in your team, adjacent teams, departments, or even some totally other functionality could be rapidly offered to customers.
You can now leverage automation, without software development expertise, and choose who to expose it to and what specific utility you want to share. By empowering those around you, everyone gets to go further, faster.
*Please note we recently updated our terminology. Our "agents" are now known as "actions," but some visuals might not reflect this.*