An Elegant Data Stack for Embedded Analytics

and

Mar 07, 2024

Background

At Astrodata, we have seen embedded analytics stacks of all shapes and sizes. We’ve learned the hard way that the cost of an embedded analytics data stack, combined with the front end application, can balloon beyond the anticipated ROI quickly without careful consideration of 1) technology pricing models and unit cost, 2) realized value 3) developer productivity.

There is a new wave of tools that are built specifically for embedded analytics solutions, so we wrote a quick blog about how some of these tools fit together and why they are such great choices.

In this article, we specifically explore the cost vs value advantages of the MotherDuck / Cube / React (MDCuRe) stack for building data products.

Technology Pricing Models and Unit Cost

Start with the data…

Let’s get practical - most queries that external customers run aren’t that big. Let’s take a classic data monetization strategy: benchmarking. The slice of granular data that your customer has access to is a relatively small slice of your big data pie. And the benchmarking metrics you provide are aggregated and vastly smaller than the granular data sets.

When this is the case, there’s no need to build massive compute capabilities into your embedded analytics solutions when the required compute of the majority of queries is small. In fact, building a much more lightweight solution allows for drastically reduced unit costs when provisioning data to external customers.

A deep understanding of the pricing models used in your embedded analytics stack will allow you to predict the ROI you can expect from your product. Ideally, the technology you use will either scale in a logarithmic fashion, or result in a cheap enough unit cost to make the initial procurement and development investment a no-brainer.

This is why MotherDuck is such an advantageous data storage and compute solution for embedded analytics solutions. Their pricing model is proportionately fair and very straightforward.

Moving downstream…

When diving into the pricing models for common BI tools and semantic layers, we notice that there is all too often a per-user model. There is usually a base cost for access to the platform that may include a handful of users, and additional users, including embedded users, will require an additional monthly or annual charge. This is one of the least advantageous pricing structures you can sign up for in embedded analytics initiatives, because pricing scales linearly as you scale your user base - there’s no leverage aside from the obvious diminished impact of the platform fee to unit cost.

Conversely, take a look at Cube’s pricing model - they price based on usage and uptime and they are intentionally building nuance into their pricing model to make embedded analytics products not only economically viable, but vastly advantageous when scaled.

Value to your Customers (and your business)

In order to charge for access to your data, it needs to be valuable. This technology stack allows us to add value to existing data without a ton of additional effort. Here are some ways this is realized:

Speed

MotherDuck provides the ability to segment compute by user (yes, by individual user) to provide a highly scalable environment for many individual users to run analytical queries. This is a feature unique to MotherDuck, and is somewhat of a groundbreaking feature in embedded analytics which often struggle to address high concurrency under load. Even in the case of mainstream Massively Parallel Processing (MPP) warehouses, when hundreds or thousands of users all access the same warehouse, there is competition for compute resources.

But the most important aspect of MotherDuck is its ability to operate as part of the web application as a local cache running in WASM, obviating the need for a connection to a cloud based warehouse every time a user wants to slice their data. This can result in analytical data being returned at the same rate as transactional data on your website.

Data Curation and Reliability

Cube provides us the ability to build business logic into the data that we deliver to customers, which adds value to the data. Consider a simple extract of raw data provided to a customer (or web developer). The customer will ingest it, transform it, and calculate business metrics using it. If you can do all of that work for them, it saves them time and effort, and is therefore more valuable.

Using a semantic layer gives us the ability to capitalize on the value it creates. With user profiles and authentication built into Cube, we can provision different levels of data access and exploration to customers belonging to different pricing tiers. Because Cube is entirely API-driven we can also allow the front end React application to easily upsert a user profile to give them new access when they upgrade their subscription.

Leveraging these two tools to power the bulk of your embedded analytics backend results in a justifiable unit cost as well as exceedingly low platform fees, not just when you get started, but continuing on as you add new customers.

Front-End Ecosystem & Performance

React remains the dominant front-end development framework due to its rich ecosystem, ease-of-use, and performance characteristics. Although there are many other high quality front-end frameworks, React remains the first choice for all our projects that need a front-end. There are many React-based click-to-deploy web development frameworks, like Vercel or Netlify, that are built for speed. Additionally, we’re excited by the optimizing compiler planned for release in React version 19 which lazily rebuilds components only when state changes rather than on each render, which could save a lot of boilerplate code for visualization components built in React which tend to have a lot of baggage in terms of functional complexity and have been, in the past, a drag on performance.

Developer Productivity

Amazon popularized the concept of the “2 pizza team” to denote the ideal team size to fully support a product or business capability from end-to-end. How many people can 2 pizzas feed? Industry lore says 5-8 people but maybe a little higher when we’re talking deep dish. Nonetheless, keeping an outcome focused and self-contained team to this size supports effective communication, decision-making, and thus action. And, on the flipside, it minimizes hand-offs and dependences which means the team can instead focus on customers’ needs. In the ideal world, a single web developer on the team should be able to ship a feature with minimal coupling to others outside of the team.

Tool choices can make or break the success of the product development team. Just as the team structure is geared towards efficient communication, team integration and delivery happiness, so must the tools reflect the same priorities. We want our stack to reduce – not introduce – friction to the team’s ability to continually ship working software.

The MotherDuck / Cube / React stack, or MDCuRe stack, is a worthy contender for teams building embedded analytics products using a continuous delivery workflow. Each of these tools allow teams to manage changes via source control, and provide a smooth path from local development, to integration environments, then production. In fact, a developer can instantiate a totally clean development sandbox on their workstation using this stack.

This conceivably allows even a single developer to build and deploy a feature end-to-end, including:

Write data warehouse migrations (using a tool like Atlas) and apply them to local DuckDB
Create and load data warehouse fixtures (using a tool like Mimesis or Faker) into DuckDB
Define a data model using Cube’s IDE and validate it in the playground
Set data access permissions via code
Implement front-end React code that consumes data from Cube’s REST or GraphQL interface
Check-in data warehouse migrations, Cube model, and front-end changes to source control, and
Deploy all code changes to upper environments like staging and production orchestrated via a CI/CD tool like Github Actions, including Cube Cloud and Mother Duck hosted instances.

This approach to team and tooling supports rapid iterations meaning the product team can iterate faster with releases more frequently – creating more opportunities to attain customer feedback to hone the direction of the product.

To Summarize:

A quick review of the advantages of using the MDCuRe stack:

Database: MotherDuck

The ability to segment compute by user removes competition from users of the embedded analytics solution, ensuring consistency of query speed for every user.
The blazing fast in-memory returns of duckDB make MotherDuck an optimal storage and compute solution for powering embedded analytics solutions.
Beyond the technical advantages of DuckDB, Motherduck’s pricing model is fair and straightforward. In the majority of embedded analytics use cases, customers are running relatively small queries on small slices of a company’s dataset. When you manage your data well in your E.A. stack, you can optimize such that unit costs for storage and compute are pennies.

Semantic Layer: Cube.dev

The fact that cube uses a semantic layer that can be written in javascript, yaml, or python means that software engineers will feel right at home in the semantic modeling aspects of a data product development lifecycle.
Leveraging an API-driven semantic layer allows instant realization of value for customers when upgrading their subscription.
Cube’s fine grained security controls give the governing business peace of mind that data and content access is secure.
Cube’s pricing model scales quite well with an embedded analytics use case as unit costs decrease as more users are added.

Front End: React

Supports custom analytical workflows and fine-tuned branding considerations
Provides ultimate flexibility in terms information architecture and interaction patterns to allow experiences that are specific to the user’s needs
Support for performance optimizations including server side rendering, and webGL based visualization libraries
Broad industry support and a huge ecosystem of open source libraries
Readily facilitates integration with the activation layer to enable users to ask a question, gain an insight, and take action within a single pane of glass.

It should go without saying that the components we’ve outlined in this post are the fundamental building blocks of a fully functioning data stack, and a final implementation will likely include more tooling and process, such as orchestration, ingestion, transformation, data quality/observability, and/or error handling. The MDCuRe stack is interoperable with many of our other favorite open and closed source tools out there, and is a great foundation for an embedded analytics platform.

Astrodata’s Substack

Discussion about this post