When it comes to technology, I have often noticed an abundance of two things: “concept” and “data.” Today, I would like to share one of my experiences, specifically related to building concepts in data. A few months ago, I was Observing a distributed system, focusing on some of the key aspects such as determining the number of requests that encountered errors within the last x minutes, calculating the average latency for various types of requests, and identifying the P95 of different requests that were successful without any errors and etc.
It was during this time that I realized the sheer volume of data being generated by multiple instrumented sources, even with just a few hundred users. As I went deeper into understanding data collection and storage, I came across a concept called “Data Cardinality.” So here let us explore what high and low cardinality data can assist us in effectively understanding our application.
What is Data Cardinality?
Data cardinality does not typically refer to a specific numerical value. Instead, it is more common to classify cardinality as a dimension that is either high or low. On one side High cardinality refers to a situation where there are numerous unique values, while low cardinality is characterized by a significant number of repeated values.
In simpler terms, high cardinality data enables us to analyze specific events, transactions, or requests at a granular level. It provides valuable insights that allow us to efficiently identify and resolve issues. For instance, in an e-commerce application, high cardinality data includes user-specific information such as User ID and purchase history. Analyzing these data helps optimize personalization and address user-specific concerns.
On the other hand, low cardinality data provides a broader understanding of system behavior. Instead of focusing on individual users, it allows us to analyze trends, and identify system-wide issues. For example, low cardinality data may include metrics like average response time, and error rates, etc. In the case of the e-commerce application mentioned earlier, it could involve examining which products are most frequently purchased by a specific group of users or analyzing average latency for different pages such as login, add to cart, checkout, etc. By analyzing this data, we gain insights into performance trends and make informed decisions regarding system optimization and resource allocation.
Pros of using High Cardanility Data:
- It allows a detailed and granular analysis at the individual event or transaction level. That could provide insights into specific occurrences, enabling precise troubleshooting and optimization.
- It is easy to identify and isolate issues at a specific user, request, or component level which enables the identification of user-specific patterns, preferences, and behaviors, facilitating personalized user experiences.
Cons of using High Cardanility Data:
- It often generates a larger volume of data, which requires more storage resources. This can lead to increased costs and infrastructure requirements for data storage and processing.
- Analyzing high cardinality data can be more complex due to the larger number of distinct values and the potential for outliers or noise. It may require more advanced analytics techniques and tools to extract meaningful insights.
Pros of using Low Cardanility Data:
- It provides a summarized view of system behavior and performance, simplifying trend analysis, identifying system-wide issues, and enabling strategic decision-making.
- It offers insights into overall system health indicators, optimizing capacity planning, and resource allocation without the need for detailed examination of individual events or transactions.
Now the question arises, which type of data should we utilize?
So there is no single defined answer to this question, as it ultimately depends on the specific use case and what we aim to accomplish, whether it requires high or low data cardinality.
However, by striking a balance between high and low-cardinality data in observability, we can achieve a comprehensive understanding of a complex system’s performance and behavior. High cardinality data allows us to focus on specific areas of interest, while low cardinality data provides a broader view of system-wide patterns and trends. This holistic perspective empowers us to optimize complex systems effectively.