Embedded Analytics - Under the Hood
Why is this something people should think about?
“Embedded Analytics” is among the buzzwords buzzing across the data world, and there is real market value available to those who get it right. With the appropriate technologies and organizational practices in place, companies can position themselves to leverage embedded analytics to monetize their data and drive and product usage.
Embedded analytics is all about bringing data into familiar places to meet customers where they live, and encouraging curiosity with data that they would normally not be able or have time to interpret.
Disclaimer: For the purposes of this article, I’m going to be focusing heavily on fully customized embedded data applications. There are plenty of reasons to choose a simpler approach, including embedding an iframe into an existing portal. For more ideas on how to present data to end users, check out this blog post by my colleague, David Stocker.
Themes
Data should tell a story, and it should appeal to any audience for whom the data would help. Tailoring data experiences to customers increases the appeal of the data they are presented and drives adoption. These are some of the common themes present in modern embedded analytics solutions:
Headless BI and the Semantic Layer
Headless BI tools leverage a semantic layer and a REST API to define business logic and expose data elements to other applications. They can serve as the foundation for all analytics solutions in a business, but are exceptionally helpful in building custom data applications.
This is a great article on semantic layers and their place in a data pipeline: https://airbyte.com/blog/the-rise-of-the-semantic-layer-metrics-on-the-fly
Some examples of headless BI solutions (or nearly headless):
cube.dev - data model built in javascript with the sole intent of being an API driven, extensible data model. Probably the most flexible option for monetizing your data, particularly useful if you plan to integrate data into multiple different presentation layers.
Looker - Looker is not a headless BI solution by default because it has a “head” - its dashboarding layer - but its data model is best in class and can be decoupled from the visualization layer to function as a headless BI solution. Looker pioneered the embedded analytics world with its ability to control all aspects of the tool via the API. Main downside is cost.
There’s also Looker Modeler, which is LookML without the bells and whistles.
MetricFlow - the open source foundation for dbt’s semantic layer. Using MetricFlow or dbt is a viable option for building a semantic layer, but still a bit new to be relied upon for data monetisation.
Omni - A budding BI tool taking a page out of Looker’s playbook. They are developing quickly and are on a trajectory to be a better version of Looker when it comes to embedded analytics, with API driven embedding on the roadmap and cool features like using DuckDB for caching. They also have a much more palatable price tag.
Malloy - too new for enterprise use, but conceptually very interesting and serves the same purpose as most of the tools above.
Data Governance
There are standardized methods for enforcing good data governance practices in an embedded analytics solution.
By defining business metrics in a central place, we have already applied one aspect of data governance to our system. A semantic layer allows a business to trust that any question asked of its data will return a correct answer. However, it is also the responsibility of the embedded analytics solution to make sure that data consumers are not exposed to data they should not see. This can be achieved in a few ways, and the right solution depends on surrounding technologies and policies. Here are some common patterns for enforcing this element of data governance:
Leverage the existing semantic layer and headless BI solution
In addition to housing business logic, many semantic layer tools allow for user permissions, roles, groups, etc. to be considered when forming a final SQL statement. Looker has a great model for this with its User Attributes feature, and cube.js allows for role-based access.
This method allows for some additional flexibility, but comes at the cost of reliability. Given that these tools are built by developers,and in some cases by developers with little exposure to software development best practices, there is always a risk that mistakes are pushed to production.
Example: A user profile contains a key:value pair that reads: “state: california”. The intent of this attribute is to limit the data that this user can see to records that contain a value of “california” in the “state” column. However, a developer working exclusively in code may not know, or may not remember, that they need to include a reference to this key:value pair when designing a query that ultimately gets exposed to the end customer. This could result in the end user viewing data for all states.
Well implemented CI/CD practices can mitigate this risk.
Control access in the data warehouse
Many businesses that expose data to their customers will partition access at the warehouse level. Most modern data warehouses are able to integrate with the common authentication tools, such as SAML (Okta, Auth0), Google, AD, etc. This allows the org to maintain all data consumer profiles and attributes in a central location and extend them to different tools.
With this method implemented, we need not worry about filtering datasets because any query that is run by a customer is by default only going to be able to query a dataset that pertains to this specific customer. It’s as if each customer has their own mini database.
In cases where preventing unauthorized access to data is of the utmost importance, this method can provide a higher degree of confidence simply because there are fewer points of failure.
Culture and Process
In addition to technology, there are benefits to creating a culture around data governance. A good way to do this is to create a “Governance Committee” responsible for agreeing on and enforcing compliance with topics such as changes to business logic, levels of customer data access, and much more. Data governance is a broad and reaching topic that warrants much more than I can devote (or have knowledge of) in this article.
Performance
Paying customers don’t like to wait for things. The end user can only measure performance by latency, and there are a few ways to build your data foundation in support of optimizing latency. Here are some methods for ensuring performance when customers access data:
Warehouse
Use a warehouse intended for the use case. Decision makers often gravitate towards the big flashy names, such as Snowflake, Redshift and BigQuery when choosing a warehouse. While these warehouses are capable of pretty much anything, they are sometimes overkill, and sometimes not the right solution for embedded analytics use cases. Some factors when choosing a data warehouse for embedded analytics purposes:
Rapid data ingestion - BQ and Snowflake are not known for their ability to ingest data in real time, but there is a large ecosystem of tools that enable these cloud warehouses and others to ingest, process, and return data in near real time. Still, some cases warrant moving away from these big names
In cases where immediately reflecting changes in data to customers is essential, it’s worthwhile to consider other options aside from BQ and Snowflake, such as Apache Druid (or Imply for a managed version), or Clickhouse. The balance between read requirements, anticipated query complexity, write requirements, cost, and ease of maintenance will help determine the right solution.
Big data - some use cases necessitate queries on truly massive datasets. In these cases, consider a database built to handle these queries, designed for speed and scale.
https://trino.io/
- query at the exabyte scale
MongoDB - not as easy to use for analytical use cases, as the BI drivers tend to be a bit limiting, but it is super fast and great for big data processing
Transform
Query complexity increases latency. Pre-aggregating data is an efficient way to reduce query complexity and improve query performance. It is recommended to use a reliable data transformation tool to prepare data for efficient querying. Some tools include:
dbt
Best in class for enterprise use. Large community of development and support.
Coalesce
Up and coming. Requires Snowflake.
Dataform
GCP’s competitor to dbt. So far so good - don’t be surprised if it takes off in the near future. Main downside is that it requires BigQuery.
Orchestrator + SQL (SPs, scripts)
Fine for getting started, but the goal should be to leverage an orchestration tool to run one of the tools above
Python - flexible and able to be integrated into most stacks. Not as accessible to SQL developers.
Driving Product Engagement
Building an embedded analytics solution affords flexibility in how data is presented to customers. Product usage is a main driver for building embedded analytics solutions, because it keeps users in the main product when they may otherwise need to navigate to other systems for insights.
One of the most common precursors to an embedded analytics solution is a white labeled BI solution that customers may access. Besides cost, which is covered below, there are product problems with using a white labeled BI instance:
Customer experience - modern BI tools give users a ton of flexibility with how they view and interact with data. However, they are still their own tools with their own UX. The UX of the business’ product will not match the UX of the BI tool (unless the business has tailored its product to match that design system), which creates a choppy UX.
Tech savvy customers will likely be able to tell quickly that they are viewing a white labeled BI tool.
Customers prefer not to have to learn new tools. Navigating a modern BI tool takes training and time to learn.
Bringing data analysis into the paid product solves the above problems, and ultimately gets users to spend more time using the product.
Some product process tips that help with this transition (at the discretion of the product owner):
Don’t rip and replace - this change should be gradual, and the existing solution should be retained as long as possible.
Focus on simplifying the UX and bringing simple dashboards into the product (as opposed to self serve) first.
Once the data quality and availability checks have passed, begin to bring self serve analytics into the product. This process takes a large amount of user research, interviews, and thoughtfulness. There is a tricky balance to strike between giving users the flexibility to be curious and avoiding overwhelming them with non-essential features
Be careful with allowing users to create their own content. Any experienced data practitioner will know that any BI tool that has been in use for more than a couple years will be bloated with unused dashboards. Given the volume of users that tend to onboard with embedded analytics solutions, this problem has the potential to become a real bear.
This is a great case for leveraging an existing content management system (CMS), like Strapi, if the chosen headless BI solution or semantic layer doesn’t have a built in solution. It is not fun to build and maintain content management solutions.
Data Monetization
The other main driver in building an embedded analytics solution is data monetization. Across all industries, data is a currency that can be exchanged or provisioned for real dollars. Businesses that want to capitalize on the data that they own can take advantage of embedded analytics solutions to turn data monetization into a key product offering and expand their business model. Embedded analytics solutions are essential for taking data monetization to the next step.
Again, similar to how white labeled BI solutions are inferior to embedded analytics solutions in terms of product engagement, they are also inferior in terms of data monetization:
Cost - BI tools can be expensive, and often charge on a per license basis. This removes leverage from scaling this line of business, as Gross Profit per Customer is limited to the margin that the business can make on a single license. Moving to an API driven model, and incorporating performance best practices in the upstream pipeline could drastically reduce cost.
Pricing structure - Offering advanced analytics capabilities to higher paying customers is a natural progression when monetizing data. However, this becomes difficult to manage within a BI platform due to the tool specific constraints of data access and user provisioning. In other words, a business using a white labeled BI solution will need to ensure that their pricing model will work within the confines of the BI tool, rather than having the BI tool adapt to the business’ preferred pricing structure as afforded by a custom solution.
Activation - It’s harder to engineer a data activation use case when using an embedded iframe from a BI tool. Because users can only realize the value of a data product when they act on it, we want to design and implement an experience that facilitates a cohesive workflow of question asked, insight delivered, and action taken.
Internal Data Driven Culture
For many of the same reasons that embedded analytics solutions are advantageous when presenting data to paying customers (whether they are paying explicitly for the data or not), embedded analytics can serve an integral part of a business’s internal data function and culture.
As a general rule of thumb, building a large custom application for internal purposes is only worth the cost when the need for niche data analysis and exploration requires a custom solution. One of the main principles of data engineering, and I believe any type of engineering, is to avoid undifferentiated heavy lifting. As in, don’t build a thing if you can buy a thing for a reasonable price.
However, enterprises commonly have multiple BI tools in use. A headless semantic layer can provide access to well governed data models across an organization, within the tools users are accustomed to. The benefits of this level of consistency can be immense when it comes to driving adoption of new tools and building a resilient data culture.
If built using headless BI, the same semantic layer that powers an external solution can power an internal solution. This is because headless BI is extensible / API driven.