Sneller is a cloud-native, serverless (i.e no dedicated storage) SQL engine designed specifically for JSON. With Sneller, you can run interactive SQL on large TB-sized JSON datasets including deeply nested fields. Our target use cases are event data pipelines such as Security, Observability, Ops, User Events and Sensor/IoT.
If you use Elasticsearch for these use cases today, you can replace the core Elastic engine (the 'E' in the ELK stack) with Sneller via an included Elastic adapter. You benefit from unlimited storage scaling and up to 10x faster queries over multi-TB log datasets (NOTE: Unlike Elastic, Sneller is not a free text search engine and does not require an indexing stage)
If you instead use a Data Warehouse/Data Lake + SQL for these use cases, Sneller can significantly simplify and speed up your existing pipeline. You no longer need dedicated ETL or ELT just to transform or reshape JSON to ingest it into a relational model (columnar or otherwise). You can run SQL directly against deeply nested JSON without any non-standard, product-specific extensions. Sneller’s extensive use of low-level vectorization (using AVX512) allow you to run sub-second SQL queries against billions of event records.
Sneller is open-source under the AGPLv3 license, and you can sign up to try Sneller Cloud which is currently in beta.
Event Data - What it is, where it comes from
A few months ago, Erik Bernhardsson, formerly of Spotify and someone who’s well known on data twitter, asked:
For a while now, we’ve been living in the world of event data that Erik is referring to. Events, in their simplest and most general form are timestamped(and increasingly, location-stamped) data about any activity of interest.
Some of the earliest and well known types of event data were (and still are) from Security/SIEM use cases that generate security events from applications, operating systems and network infrastructure. Another major source of event data is Observability/APM (often shortened to ‘o11y’), which covers application logs, infrastructure metrics (e.g. CPU load, memory utilization) and traces - which are data that show the detailed path that a request (e.g. an API call or a user action) takes through the entire technology stack from end to end. A third well known class of event data is Product Analytics used in growth orgs, based on app instrumentation. At its heart product event data is a time-ordered stream of well-defined user action events.
Event Data are seen in many areas beyond security, observability and product/service analytics. Real-time sensor data from the rapidly expanding world of connected devices, vehicles and industrial equipment is also event data. Finally, in the rapidly emerging world of Ops (DevOps, DataOps, MLOps/AIOps), telemetry data from each stage in an Ops pipeline is again a form of event data.
Of note is OpenTelemetry, an emerging standardization effort for event data, currently focused on Observability, that aims to provide a common vocabulary for event data domains to avoid needless redundancy/reinvention.
Why a new data platform specifically for event data?
Since there’s been a Cambrian explosion of tools for every kind of data pipeline or use case, we have to justify why we felt the need to add another one to the mix. To do that, here's our view of the three most important features of any kind of event data
Why do today’s data platforms fall short for event data?
If you're nodding as you read the above, you'll agree that an event data platform should at least hit the following needs:
- Cost effective, even when faced with essentially unbounded data retention windows or volumes
- Support both interactive monitoring and investigation analytic usage patterns on the same platform
- Schema-agnostic to handle schema evolution common to event data use cases without brittle schema evolution processes or ETL/ELT
- Ecosystem support allowing integration with the broadest ecosystem of data tools possible
Today, the choice comes down to two distinct kinds of data platforms:
Text Search platforms: Tools like Elasticsearch, OpenSearch , ChaosSearch and Quickwit belong in this category. They were designed first and foremost to handle querying of unstructured data such as text (and other modes such as voice and video). Because they need to extract structure from text (often by creating an inverted index), they've become a natural choice for handling the semi-structured nature of event data (such as application logging). Of these, Elastic has the advantage of a long-established open source solution with the ELK stack for observability and security use cases.
Structured data platforms: Every data warehouse, data lake or analytic database that supports SQL as its primary API, belongs in this category. At the heart of these systems is the relational (tabular) model for which SQL was developed in the first place. The obvious advantage here is unmatched familiarity of SQL, and the sheer scale of the supporting ecosystem that has developed over decades.
In our view, both types of platforms while great for many use cases, don't really work well for event data. Here's why
Sneller: Built for event data
This is what we set out to solve at Sneller. We wanted the following:
- An ideal combination of performance and economics at the typical scale of semi-structured event data (GB-TB/day or more)
- Standard SQL as the primary query language
- Simplified processing pipelines without brittle ETL/ELT only for reshaping JSON, and no headaches due to schema evolution
We’ll get into details of how we achieved this in subsequent blog posts. For now, here’s an illustration of the complete Sneller user experience today (technical details to follow in subsequent blog posts)
Compared to existing alternatives for JSON processing, Sneller's biggest value is simplicity at scale. Just load (via batch or streaming) your JSON-encoded event data directly into s3 (and in the future, other cloud object stores), and start querying it with standard SQL. There's no need to provision dedicated capacity for indexing, build ETL/ELT pipelines, handle schema evolution, or worry about how to age your data out to cheaper storage. Four key design decisions enable this:
Storage/compute separation (aka Serverless/Unbundled)
As the dominant storage mechanism in a cloud-native world, centralized object storage services such as s3, GCS or ADLS are battle-tested, secure and cost-effective at petabyte (or greater) scale. They are a natural choice for the storage economics of unbounded event data collections.
Unlike most platforms, Sneller is foundationally built to use object storage as its primary storage. A major problem with data infrastructure today is over/under-provisioning. Because storage and compute are not separate, you always end up paying for compute you don't need while scaling storage (or vice versa). In contrast, Sneller lets you dial compute capacity independently up or down based on how your workload changes (or your SLAs), without worrying about scaling or availability of the storage subsystem. As a result, Sneller is far more cost effective than comparable solutions - you only pay based on the work you do, not the volume of stored data, or an 'always-on' infrastructure size you provisioned upfront.
Next, unlike many databases/data warehouses, Sneller does not ‘copy’ or require your to load your data into a custom format or data layout, columnar or otherwise (we’ll elaborate on this admittedly controversial tradeoff in subsequent posts).
Finally, Sneller Cloud compute nodes on EC2 do not have any locally attached storage, which considerably simplifies scaling but also means that we don't persist any customer data (even inadvertently) within our service. As a result, in Sneller Cloud, your data - source data, intermediate data and outputs, are always in your control, in your own account in S3 (and soon GCS and ADLS). In memespeak,
Sneller is completely schema-on-read. This is a huge advantage for event data, which requires structure but also flexibility in changing/evolving that structure. It avoids the need for brittle, hard to maintain ETL/ELT pipelines whose sole purpose is to shape or reshape this data to ingest JSON into a specific storage format. In turn, this radically simplifies your event data pipeline. Compare this to both search platforms like Elastic that require dedicated indexing capacity and infrastructure, or to data warehouses that need a dedicated transformation pipeline for JSON. You can add new fields, change field types (and even handle type ambiguity issues) in your source data without worrying about schema evolution
With Sneller, you can run low latency, interactive SQL directly on TBs or more of semi-structured JSON data.
We did this by writing our engine from the ground up, to be SIMD-aware. Specifically, we use AVX-512 extensions available on modern Intel server CPUs (also likely to be available in AMD when Zen4 arrives. This allows us up to 16x the throughput for core kernels in our engine without requiring specialized accelerator hardware. The ‘done right’ part is the trick though - we had to develop a library of assembly-level AVX512 routines integrated into a Golang-based execution engine for this purpose. For the technically curious, not using LLVM-based JIT query compilation was another architectural tradeoff (details to follow in other blog posts)
While SQL is far and away the most popular UX (and API) for data, it is closely tied to the relational (row/column) data model. On the other hand, JSON-encoded event data allows for richer structure including nested objects and complex field types. To use SQL with JSON, databases often have to resort to workarounds like specialized variant types to store JSON and/or dedicated, non-standard JSON helper functions.
By contrast, Sneller leverages PartiQL, an extended dialect of SQL originally designed by Amazon specifically for semi-structured data(and supported by many products such as DynamoDB, Glue within the AWS data platform ecosystem). You get the full power and flexibility of SQL for complex nested event data structures without requiring explicit ELT/ETL into a tabular model, or product-specific extensions to handle this kind of data.
Taken together, Sneller offers the familiarity and capability of SQL for semi-structured JSON data without the cost or complexity of either search platforms like Elastic, or data warehouse pipelines with dedicated ETL/ELT stages only for transforming or reshaping JSON. In our benchmarks we have seen unto a 50-80% cost advantage over Elastic Cloud while obtaining a 3-10x (or more) increase in query performance for typical queries on billion-event datasets that contain complex nested JSON. Watch out for an upcoming blog that dives into great detail on how we built our benchmarking infrastructure (which is also soon to be open sourced).
Getting started with Sneller
We've open sourced Sneller’s core engine under the AGPLv3 license. You can download, build and try Sneller on your own JSON data. To get you started, we’ve provided canned datasets and also a conversion utility that creates ion (the compressed binary JSON format we use) from your own JSON data. You’ll need access to an AVX-512 capable instance or machine - these are widely available on all cloud providers.
Additionally, we also have instructions on how you can set up and run Sneller with Kubernetes.
Since we’re in early days, we would love any feedback - issues, stars, comments all appreciated!
Sign up/Learn more
If that piqued your interest, please drop by our website and sign up to try Sneller Cloud, our hosted offering, for free on your own data. Alternatively, you can head over to our GitHub page and build and try Sneller for yourself. Over the next few weeks, we’ll have many more deep dives into Sneller’s core technology and engineering, as well as how we benchmarked Sneller, so please stay tuned!