Sneller vs. Elastic: A Tale of the Billion Event Tape

Venkat Krishnamurthy
May 4, 2022

Sneller is a serverless vectorized query engine designed specifically for analytics on semi-structured JSON event data from logging, observability, security/XDR and IoT pipelines, for which Elastic/Opensearch is the most popular solution today. Sneller not only speaks SQL natively, but also Elastic’s query DSL, via a soon to be open-sourced Elastic query language adapter.

In this blog, we provide some insights from our head-to-head benchmarking of Sneller vs. Elastic on a realistic production-scale event data workload (~1 billion JSON-serialized events of 1KiB each, ingested at > 10k events per second over a 24h period, roughly 1TB/day raw).

Compared to Elastic Cloud, Sneller is 5x faster for a common monitoring dashboard workload for a single day’s worth of data.  For a typical production event data pipelines that considers 90 days of data retention, Sneller is ~10x less expensive in addition to being far simpler to operate - without needing complex data tiering or redundant high availability infrastructure.

Since most people reading these results will wonder about the benchmark infrastructure/set up, we’ll follow this post with another that dives into why and how we built our own benchmark infrastructure

Overview and Benchmark set up

We open-sourced Sneller back in May this year to help with analytics on volumes of arbitrarily complex JSON-encoded event data (such as Observability, Security/XDR and user event analytics). During our journey, we learned that Elasticsearch/ELK (and variants like Opensearch) are by far the most popular solution used today for these workloads.

Naturally, we had to demonstrate both to ourselves and our prospective customers that Sneller’s modern architecture is both cost-effective and performant compared to Elastic. To do this, we needed a benchmark. Cue XKCD…

In the words of a friend from HPC days, “Not all benchmarks lie, but all liars benchmark”

To avoid our own benchmarking effort from becoming an exercise in self-congratulatory illusion, we first spoke to several prospects and design partners using Elastic, about their actual production event data workloads, specifically focusing on ingest rates and query patterns. We used these learnings to simulate as realistic a production workload as possible.

We’ll dive straight in to the results in this post, since they’re the most interesting part. Since that’s likely to raise a number of questions about ‘how’, we’ll follow this up with another post on why and how we had to build a custom benchmarking infrastructure and workload for event data.

A summary of the benchmark harness

Input Data
Benchmark Run
  • Run length: 24h
  • Ingest rate: ~12000+ events/sec
  • Total ingested events: ~1.036B events over 24h
  • Total (raw) data size:  ~1TiB
Event Object
  • Raw data size: 1 KiB
  • Encoding: JSON
  • Compression: gzip
  • Prototype: AWS Cloudtrail events
Event Source Pipeline
  • Custom golang+jsonnet template-based JSON data generator
  • vector (vector.dev) for event data routing to AWS S3 and elastic simultaneously
Query Workload
  • Simulated usage pattern: Monitoring Dashboard (e.g using Kibana/Grafana)
  • Number of queries: 9 queries x 7 lookback windows - 5min, 15min, 30min, 1h, 3h, 6h, 12h (note - these window sizes are borrowed from Kibana/Grafana)
  • Refresh interval: Depending on lookback window size, between 30s (for 5 min lookback) and 30 mins (for the 12h lookback window)
  • Query syntax: Elastic DSL (e.g. as generated by Kibana/Grafana)
  • Total queries over 24h period: >100,000 across all lookback windows
Infrastructure details
Elastic
  • Environment: Elastic Cloud on AWS
  • Instance specs: CPU-Optimized (ARM)
  • Instance type: c6gd.8xlarge
  • 1.9TiB NvME (min capacity for >1TiB of data)
  • 32vCPU, 64GiB memory
  • Autoscaling: Off
  • Availability zones (AZ): 1

Note: We chose this CPU-optimized since it has the the best possible performance per Elastic’s own recommendations. We also turned off autoscaling and used a single AZ, so that we did not disadvantage Elastic on costs - for production installations, Elastic strongly recommends autoscaling and more than 1 AZ.

Sneller
  • 1 x r6i.8xlarge
  • 16 Intel vCPUs with AVX512 128GiB DRAM, Upto 12.5Gbps network
  • Autoscaling: Off
  • Availability zones: 1

Note: Sneller does not require multiple AZs for HA, since all persistent state (data + config) is in S3.

While we captured several metrics, we care most about performance because it matters most to customers using Elastic to power Kibana monitoring dashboards. More specifically, since we simulated a monitoring dashboard, we focused on measuring the total wall clock time for each ‘refresh’ - i.e. the time elapsed between starting the refresh and all the charts/queries in a dashboard updating with refreshed data (see the video below).

In addition, we also had a separate stream of measurements for the infrastructure metrics from the systems under test. Specifically, we measured the following:

  1. Elastic: CPU utilization (%), disk space free, memory utilization (%) - all of which are available via Elastic’s node stats API
  2. Sneller: Using Cloudwatch metrics, we specifically focused on CPU utilization and network traffic into the instance

The Results

Before we dive in to the nitty-gritty of the actual results, here’s the whole story told in a Grafana dashboard. This is a dashboard we created to show what an end user would see, while running the same queries against Sneller and Elastic. Grafana allows different data sources to be used in the same dashboard, so we set up 3 identical charts to talk to each.

Note that the queries here are only a subset (3 of 9) queries of the full benchmark query harness. In fact, these charts are some of the simplest single-level aggregations from the benchmark query set, which has multi-level(2 and 3 level) aggregations that are more complex.

The three representative charts to the left are configured to query Sneller via our Elastic adapter which translates Elastic queries into SQL under the hood. On the right hand side are three identical charts pointing to the benchmark Elastic Cloud instance. In other words, this dashboard runs the exact same queries upon each refresh, against both Sneller and Elastic backends.

What you see is the obvious difference in performance as you query larger time windows. For smaller windows (5 mins) the difference is negligible, but when you’re querying 6 hours of data (roughly 250M events) as shown in the video, the 3 Sneller charts return within a few seconds, while Elastic basically takes ~3x longer with the map chart timing out.

Now that you’ve seen the benchmark from the perspective of a real user, let’s dive in to some more detail on some key outcomes:

Comparing Performance

At the shortest lookback interval (5 mins), Elastic was better than Sneller in terms of total predictability of wall clock time. Here’s a graph showing the measurements over a 24h period of the run:


In absolute terms, while Elastic’s 99th percentile latency (aka p99) is visibly better for this short lookback interval. This is primarily due to a constant overhead from Sneller using S3 as a primary storage tier. Note that the absolute difference (from a user perspective) is actually relatively small (i.e ~1s for Elastic vs ~2s for Sneller).

Unlike Elastic’s configuration where the nodes have substantial local storage with fast NVMe, Sneller has no local storage whatsoever on its node, by design. The variance in terms of query timings is an artifact of S3 being a remote storage system.

As the data windows became longer, (15 mins, 30 mins, … all the way to 12h), Sneller easily outperformed Elastic in terms of individual query timings. However, the most significant impact here is on refresh wall clock time which is far more relevant to dashboards - this captures the time from the beginning to the end of the refresh cycle for a dashboard in Kibana/Grafana where every chart has finished updating with the latest data. Looking back 1hr (i.e ~43.2M events), Elastic’s tail latency is nearly 5x worse to complete a full refresh of the 9 queries (that strange V shape in the end is  artifact of the Elastic index being rolled over into the next day at 0:00 UTC)


Going up to 6h (~250M events), the difference is also clear. Note that the 1h lookback window is configured to refresh more frequent than the 6h lookback window. You’ll see this clearly in the video above - Elastic basically times out running the map chart with this cluster configuration.



Comparing Cost

To make the cost comparison realistic, we need to go beyond the narrow specifications of the single-day run that we used for the benchmark. The Elastic cluster we used for this run is as ‘barebones’ as permissible, and is far from a realistic production configuration. This is because we did the following:

  1. Used only a single node
  2. Did not enable autoscaling,
  3. Turned off every other service besides Elasticsearch,
  4. Used only a single Availability Zone (a bad idea for any non-trivial deployment)
  5. Could only store 1 day’s worth of data
  6. Did not enable any storage tiering (i.e no hot/warm/cold storage tiers)

A realistic Elastic Cloud configuration that uses 2 x AZs, turns on autoscaling and configures storage tiering for 2 days of hot data, 1 week of ‘warm’ data and 81 days of ‘cold’ data (90 days retention) overall is a very different story, cost wise. The cost is nearly $48 per hour for this cluster.  


On the other hand, Sneller’s cost for this run is $121 over the 24h period. Here are some salient details of this cost:

  1. This is priced at $150/PiB of data processed, which is determined by the total amount of data scanned across all the queries run by the Sneller cluster
  2. This single node running Sneller processed roughly 100k queries, which scanned a total of 1TiB of data in aggregate through the run.
  3. There is no additional storage or ingest cost (or infrastructure needed) - the cost is computed based only on the work done for succesful queries.

Compared to Elastic, this is nearly a ~10x reduction in cost ($24 * 48 ~$1100, vs $121). Here’s three reasons why:

  1. Since Sneller Cloud uses AWS S3 as primary storage, you don't need a costly high-availability architecture to prevent data loss. In other words, the cost of storing 90 days or 1 year or even 5 years worth of data is basically the cost of extremely reliable s3 storage, which is already relatively inexpensive ($23/TB/month).
  2. You do not need complex sizing and upfront provisioning of compute with Sneller. This means you can start with as much as you need for e.g. a single day’s worth of data (or your maximum lookback window for interactive queries), and grow depending on your desired performance SLAs. Note that adding more capacity to the Sneller Cloud cluster provides a linear increase in performance as well.  As a user, you do not need to worry about this since we are implementing robust autoscaling to make this transparent.
  3. Sneller Cloud (soon to be launched) will easily adjust compute up or down based on your SLAs. Sneller Cloud will guarantee high availability in a multi-tenant environment.

With our early design partners, we’re seeing at least a 50% cost reduction vs Elastic (depending on workload and scale, we can do even better). That’s  potentially 2-5x performance advantage at 50% of the cost of Elastic Cloud/Opensearch for your log/observability/security/product event data. You get this without having to rip and replace Elastic, or learn a whole new system since Sneller speaks ElasticQL as well as SQL.

Conclusions

Key takeaways from this benchmarking exercise.

Sneller can be up to 5x faster than Elastic in terms of real-world performance measured by wall clock time for monitoring/dashboard workloads.

This performance advantage changes with data window size:

  1. For the smallest intervals, Elastic will likely show better tail latency because it uses local NVMe as primary storage (with predictable IOPs/throughput)
  2. As the window sizes increase (two hours or more), Sneller shows steadily better performance, up to 5x faster than Elastic in this configuration.
Sneller also has a nearly 10x price/TCO advantage, which increases with the retention window size, for realistic production workloads (assuming a 90-day data retention window).
  1. Sneller is priced based on data processed, while Elastic Cloud’s pricing is a hourly/monthly charge for ‘always-on’ infrastructure.
  2. Sneller does not require complex HA architecture since it uses S3/(more generally, cloud object storage) as its primary storage, with no storage/retention limits
  3. Sneller Cloud will transparently autoscale compute capacity up or down based on load SLAs and actual traffic/utilization.

In summary - If you’re using Elasticsearch (or OpenSearch) for event data analytics (on Security/XDR, Observability or IoT data) you should sign up for Sneller Cloud and test your own event data pipelines for free - using either standard SQL or Elastic DSL.

Ready to speedup and simplify your event data analytics?

Sneller is also available as an open source project on GitHub.