Friday, November 10, 2017

Detecting Data Feed Issues with Splunk

by Tony Lee

As a Splunk admin, you don’t always control the devices that generate your data. As a result, you may only have control of the data once it reaches Splunk. But what happens when that data stops being sent to Splunk? How long does it take anyone to notice and how much data is lost in the meantime?

We have seen many customers struggle with monitoring and detecting data feed issues so we figured we would share some of the challenges and also a few possible methods for detecting and alerting on data feed issues.

Challenges

Before we discuss the solution, we want to highlight a few challenges to consider when trying to detect data feed issues:
1) This requires searching over a massive amount of data—thus searches in high volume environments may take a while to return.  We have you covered.
2) Complete loss of traffic may not be required—partial loss in traffic may be enough to warrant alerting.  We still have you covered.
3) There may be legitimate reductions in data (weekends) which may produce false alarms—thus the reduction percentage may need to be adjusted.  Yes, we still have you covered.

Constructing a solution

Given these challenges, we wanted to walk you through the solution we developed (Step 4 in the final solution if you want to skip straight to that for the sake of time). This solution can be adapted to monitor indexes, sources, or sourcetypes—depending on what makes the most sense to you. If each of your data sources goes in its own index, then index would make the most sense. If multiple data feeds share indexes, but are referenced by different sources or sourcetypes, then it may make the most sense to monitor by source or sourcetype. In order to change this, just change all instances of “index” (except for the first index=*) to “sourcetype” below.  Our example syntax below show index monitoring, but the screenshots show sourcetype monitoring--this is very flexible.

The first challenge to consider in our searches is the massive amount of data we need to search.  We could use traditional searches such as index=*, but the searches would never finish even in smaller environments.  For this reason we use the tstats command.  In one fairly large environment, it was able to search through 3,663,760,230 events from two days worth of traffic in just 28.526 seconds.

The first solution we arrived at was the following:

Step 1)  View data sources and traffic:

| tstats prestats=t count where earliest=-2d@d latest=-0d@d index=* by index, _time span=1d | timechart useother=false limit=0 span=1d count by index


Figure 1:  Viewing your traffic

Step 2)  Transpose the data:

| tstats prestats=t count where earliest=-2d@d latest=-0d@d index=* by index, _time span=1d | timechart useother=false limit=0 span=1d count by index |  eval _time=strftime(_time,"%Y-%m-%d") | transpose | rename "row 1" AS TwoDaysAgo, "row 2" AS Yesterday


Figure 2:  Transposing the data to get the columns where we need them.

Step 3)  Alert Trigger for dead data source (Yesterday=0):

| tstats prestats=t count where earliest=-2d@d latest=-0d@d index=* by index, _time span=1d | timechart useother=false limit=0 span=1d count by index |  eval _time=strftime(_time,"%Y-%m-%d") | transpose | rename "row 1" AS TwoDaysAgo, "row 2" AS Yesterday | where Yesterday=0


The problem with this solution is that it would not detect partial losses of traffic.  Even if one event was sent, you would not receive an alert.  Thus we changed this to detected a percentage of drop off.

Figure 3:  Detecting a complete loss in traffic.  May not be the best solution.


Final solution:  Alert for percentage of drop off (Example below alerts on reduction of 25% or greater):

| tstats prestats=t count where earliest=-2d@d latest=-0d@d index=* by index, _time span=1d | timechart useother=false limit=0 span=1d count by index |  eval _time=strftime(_time,"%Y-%m-%d") | transpose | rename column AS DataSource, "row 1" AS TwoDaysAgo, "row 2" AS Yesterday | eval PercentageDiff=((Yesterday/TwoDaysAgo)*100) | where PercentageDiff<75

Figure 4:  Final solution to detect a percentage of decline in traffic

Caveats:

The solution above should get most people to where they need to be.  However, depending on your environment, you may need to make some adjustments—such as the percentage of traffic reduction, but that is a simple change of the 75 above.  We have included some additional caveats below that we have encountered:
1) There may be legitimate indexes with low events or possibly naturally occurring 0 events, use “index!=<name>” after the index=* in the |tstats command to ignore these indexes
2) Reminder:  Maybe you send multiple data feeds into a single index, but instead separate it out by sourcetype.  No problem, just change the searches above to use sourcetype instead of index.

Conclusion

The final step is to click the “Save As” button and select “Alert”.  It could be scheduled to run daily with results are greater than 0.  There may be a better way to monitor for data feed loss and we would love to hear it!  There is most likely a way to use _internal logs since Splunk logs information about itself.  😉  If you have that solution, please feel free to share in the comments section.  As you know, with Splunk, there is always more than one way to solve a problem.

Sunday, October 15, 2017

Spelunking your Splunk – Part I

By Tony Lee

Introduction

Have you ever inherited a Splunk instance that you did not build?  This means that you probably have no idea what data sources are being sent into Splunk.  You probably don’t know much about where the data is being stored.  And you certainly do not know who the highest volume hosts are within the environment.

As a consultant, this is reality for nearly every engagement we encounter:  We did not build the environment and documentation is sparse or inaccurate if we are lucky enough to even have it.  So, what do we do?  We could run some fairly complex queries to figure this out, but many of those queries are not efficient enough to search over vast amounts of data or long periods of time—even on highly optimized environments.  All is not lost though, we have some tricks (and a handy dashboard) that we would like to share.

Note:  Maybe you did build the environment, but you need a sanity check to make sure you don’t have any misconfigured or run-away hosts.  You will also find value here.

tstats to the rescue!

If you have not discovered or used the tstats command, we recommend that you become familiar with it even if it is at a very high-level.  In a nutshell, tstats can perform statistical queries on indexed fields—very very quickly.  These indexed fields by default are index, source, sourcetype, and host.  It just so happens that these are the fields that we need to understand the environment.  Best of all, even on an underpowered environment or one with lots of data ingested per day, these commands will still outperform the rest of your typical searches even over long periods of time.  Ok, time to answer some questions!

Common questions

These are common questions we ask during consulting engagements and this is how we get answers FAST.  Most of the time 7 days’ worth of data is enough to give us a good understanding of the environment and week out anomalies.

How many events are we ingesting per day?
| tstats count where index=* by _time

Figure 1:  Events per day


What are my most active indexes (events per day)?
| tstats prestats=t count where index=* by index, _time span=1d | timechart span=1d count by index

Figure 2:  Most active indexes


What are my most active sourcetypes (events per day)?
| tstats prestats=t count where index=* by sourcetype, _time span=1d | timechart span=1d count by sourcetype

Figure 3:  Most active sourcetypes


What are my most active sources (events per day)?
| tstats prestats=t count where index=* by source, _time span=1d | timechart span=1d count by source

Figure 4:  Most active sources


What is the noisiest host (events per day)?
| tstats prestats=t count where index=* by host, _time span=1d | timechart span=1d count by host

Figure 5:  Most active hosts


Dashboard Code

To make things even easier for you, try this dashboard out (code at the bottom) that combines the searches we provided above and as a bonus adds a filter to specify the index and time range.

Figure 6:  Data Explorer dashboard

Conclusion

Splunk is a very powerful search platform but it can grow to be a complicated beast--especially over time.  Feel free to use the searches and dashboard provided to regain control and really understand your environment.  This will allow you to trim the waste and regain efficiency.  Happy Splunking.


Dashboard XML code is below:


<form>
  <label>Data Explorer</label>
  <fieldset submitButton="true" autoRun="true">
    <input type="time" token="time">
      <label>Time Range Selector</label>
      <default>
        <earliest>-7d@h</earliest>
        <latest>now</latest>
      </default>
    </input>
    <input type="text" token="index">
      <label>Index</label>
      <default>*</default>
      <initialValue>*</initialValue>
    </input>
  </fieldset>
  <row>
    <panel>
      <chart>
        <title>Most Active Indexes</title>
        <search>
          <query>| tstats prestats=t count where index=$index$ by index, _time span=1d | timechart span=1d count by index</query>
          <earliest>$time.earliest$</earliest>
          <latest>$time.latest$</latest>
        </search>
        <option name="charting.chart">column</option>
        <option name="charting.drilldown">none</option>
      </chart>
    </panel>
  </row>
  <row>
    <panel>
      <chart>
        <title>Most Active Sourcetypes</title>
        <search>
          <query>| tstats prestats=t count where index=$index$ by sourcetype, _time span=1d | timechart span=1d count by sourcetype</query>
          <earliest>$time.earliest$</earliest>
          <latest>$time.latest$</latest>
        </search>
        <option name="charting.chart">column</option>
        <option name="charting.drilldown">none</option>
      </chart>
    </panel>
  </row>
  <row>
    <panel>
      <chart>
        <title>Most Active Sources</title>
        <search>
          <query>| tstats prestats=t count where index=$index$ by source, _time span=1d | timechart span=1d count by source</query>
          <earliest>$time.earliest$</earliest>
          <latest>$time.latest$</latest>
        </search>
        <option name="charting.chart">column</option>
        <option name="charting.drilldown">none</option>
      </chart>
    </panel>
  </row>
  <row>
    <panel>
      <chart>
        <title>Most Active Hosts</title>
        <search>
          <query>| tstats prestats=t count where index=$index$ by host, _time span=1d | timechart span=1d count by host</query>
          <earliest>$time.earliest$</earliest>
          <latest>$time.latest$</latest>
          <sampleRatio>1</sampleRatio>
        </search>
        <option name="charting.chart">column</option>
        <option name="charting.drilldown">none</option>
      </chart>
    </panel>
  </row>
</form>



Wednesday, September 13, 2017

Splunk Technology Add-on (TA) Creation Script

By Tony Lee


Introduction

If you develop a Splunk application, at some point you may find yourself needing a Technology Add-on (TA) to accompany the app. Essentially, the TA utilizes much of the app's files, except for the user interface (UI/views). TA's are typically installed on indexers and heavy forwarders to process incoming data. Splunk briefly covers the difference between as app and an add-on in the link below:

https://docs.splunk.com/Documentation/Splunk/6.6.3/Admin/Whatsanapp

Maintaining two codebases can be time consuming though. Instead, it is possible to develop one application and extract the necessary components to build a TA. There may be other solutions such as the Splunk Add-on Builder (https://splunkbase.splunk.com/app/2962/) , but I found this script below to be one of the easiest methods.

Approach

This could be written in any language, however my development environment is Linux-based. The quickest and easiest solution was to write the script using bash. Feel free to translate it to another language if needed though.

Usage

Usage is simple.  Just supply the name of the application and it will create the TA from the existing app.

The app should be located here (if not, change the APP_HOME variable in the script):

/opt/splunk/etc/apps/<AppName>

Copy and paste the bash shell script (Create-TA.sh) below to the /tmp directory and make it executable:

chmod +x /tmp/Create-TA.sh

Then run the script from the tmp directory and supply the application name:

Create-TA.sh <AppName>

Ex:  Create-TA.sh cylance_protect

Once complete, the TA will be located here:  /tmp/TA-<AppName>.spl

Code

#!/bin/bash
# Create-TA
# anlee2 - at - vt.edu
# TA Creation tool written in bash
# Input:  App name   (ex: cylance_protect)
# Output: /tmp/TA-<app name>.spl

# Path to the Splunk app home.  Change if this is not accurate.
APP_HOME="/opt/splunk/etc/apps"


##### Function Usage #####
# Prints usage statement
##########################
Usage()
{
echo "TA-Create v1.0
Usage:  TA-Create.sh <App name>

  -h = help menu

Please report bugs to anlee2@vt.edu"
}


# Detect the absence of command line parameters.  If the user did not specify any, print usage statement
[[ $# -eq 0 || $1 == "-h" ]] && { Usage; exit 0; }

# Set the app name and TA name based on user input
APP_NAME=$1
TA_NAME="TA-$1"

echo -e "\nApp name is:  $APP_NAME\n"


echo -e "Creating directory structure under /tmp/$TA_NAME\n"
mkdir -p /tmp/$TA_NAME/default /tmp/$TA_NAME/metadata /tmp/$TA_NAME/lookups /tmp/$TA_NAME/static /tmp/$TA_NAME/appserver/static


echo -e "Copying files...\n"
cp $APP_HOME/$APP_NAME/default/eventtypes.conf /tmp/$TA_NAME/default/ 2>/dev/null
cp $APP_HOME/$APP_NAME/default/app.conf /tmp/$TA_NAME/default/ 2>/dev/null
cp $APP_HOME/$APP_NAME/default/props.conf /tmp/$TA_NAME/default/ 2>/dev/null
cp $APP_HOME/$APP_NAME/default/tags.conf /tmp/$TA_NAME/default/ 2>/dev/null
cp $APP_HOME/$APP_NAME/default/transforms.conf /tmp/$TA_NAME/default/ 2>/dev/null
cp $APP_HOME/$APP_NAME/static/appIcon.png  /tmp/$TA_NAME/static/appicon.png 2>/dev/null
cp $APP_HOME/$APP_NAME/static/appIcon.png  /tmp/$TA_NAME/appserver/static/appicon.png 2>/dev/null
cp $APP_HOME/$APP_NAME/README /tmp/$TA_NAME/ 2>/dev/null
cp $APP_HOME/$APP_NAME/lookups/* /tmp/$TA_NAME/lookups/ 2>/dev/null

echo -e "Modifying app.conf...\n"
sed -i s/$APP_NAME/$TA_NAME/g /tmp/$TA_NAME/default/app.conf
sed -i "s/is_visible = .*/is_visible = false/g" /tmp/$TA_NAME/default/app.conf
sed -i "s/description = .*/description = TA for $APP_NAME./g" /tmp/$TA_NAME/default/app.conf
sed -i "s/label = .*/label = TA for $APP_NAME./g" /tmp/$TA_NAME/default/app.conf


echo -e "Creating default.meta...\n"
cat >/tmp/$TA_NAME/metadata/default.meta <<EOL
# Application-level permissions
[]
access = read : [ * ], write : [ admin, power ]
export = system

### EVENT TYPES
[eventtypes]
export = system

### PROPS
[props]
export = system

### TRANSFORMS
[transforms]
export = system

### LOOKUPS
[lookups]
export = system

### VIEWSTATES: even normal users should be able to create shared viewstates
[viewstates]
access = read : [ * ], write : [ * ]
export = system
EOL

cd /tmp; tar -zcf TA-$APP_NAME.spl $TA_NAME


echo -e "Finished.\n\nPlease check for you file here:  /tmp/$TA_NAME.spl"

Conclusion

Hopefully this helps others save some time by maintaining one application and extracting the necessary data to create the technology add-on.

Props

Huge thanks to Mike McGinnis for testing and feedback.  :-)

Sunday, August 27, 2017

Splunk: The unsung hero of creative mainframe logging

By Tony Lee

The situation

Have you ever, in your life, heard a good sentence that started with: “So, we have this mainframe... that has logging and compliance requirements…” Yeah, me neither. But this was a unique situation that required a quick and creative solution--and it needed to be done yesterday.  Queue the horror music.

In summary:  We needed to quickly log and make sense of mainframe data for reporting and compliance reasons. The mainframe did not support external logging such as syslog. However, the mainframe could produce a CSV file and that file could be scheduled to upload to an FTP server (Not SFTP, FTPS, or SCP).  Yikes!

Possible solutions

We could stand up an FTP server and use the Splunk Universal forwarder to monitor the FTP upload directory, but we did not have extra hardware or virtual capacity readily available. After a quick Google search, we ran across this little gem of an app called the Splunk FTP Reviver app (written by Luke Murphey):  https://splunkbase.splunk.com/app/3318/. This app cleverly creates a python FTP server using Splunk—best of all, it leverages Splunk’s user accounts and role-based access controls.

How it worked

At a high level, here are the steps involved:
  1. Install the FTP Receiver app:  https://splunkbase.splunk.com/app/3318/
  2. Create an index for the mainframe data (Settings -> Indexes -> New -> Name: mainframe)
  3. Create an FTP directory for the uploaded files (mkdir /opt/splunk/ftp)
  4. Create FTP Data input (Settings -> Data Inputs -> Local Inputs -> FTP -> New -> name: mainframe, port: 2121, path: ftp, sourcetype: csv, index: mainframe)
  5. Create a role with the ftp_write privileges (Settings -> Access Controls -> Roles: Add new -> Name: ftp_write, Capabilities: ftp_write)
  6. Create a Splunk user for the FTP Receiver app (Settings -> Access Controls -> Users: Add new -> Name: mainframe, Assign to roles: ftp_write)
  7. Configure the mainframe to send to the FTP Receiver app port (on your own for that one)
  8. Create a local data input to monitor the FTP upload directory and ingest as CSV (Settings -> Data inputs -> Local inputs -> Files and Directories -> New -> Browse to /opt/splunk/ftp -> Continuously monitor -> Sourcetype: csv, index: mainframe)


Illustrated, the solution looks like this:


Figure 1:  Diagram of functional components

If you run into any issues, troubleshoot and confirm that the FTP server is working via a common web browser.



Figure 2:  Troubleshooting with the web browser

Conclusion

Putting aside concerns that the mainframe may be older than most of the IT staff and the fact that FTP is still a clear-text protocol, this was an interesting solution that was created using the flexibility of Splunk. Add some mitigating controls and a little bit of SPL + dashboard design and it may be the easiest and most powerful mainframe reporter in existence.



Figure 3:  Splunk rocks, the process works