Data Quality of Service (Data QoS)
As your need for observing data grows with the maturity of your business, you realize that the number of attributes you want to measure only brings more complexity than simplicity.
That's why, back in 2021, and taking inspiration from Mandeleev's work on classifying atomic elements in physics, I came up with the idea of combining data quality and service levels into a single table.
Quality of Service (QoS) is a well-established concept in network engineering. It measures the overall performance of a service, such as a telephony, computer network, or cloud computing service, particularly the quality seen by the network users. In networking, you must consider several criteria to quantitatively measure the QoS, such as packet loss, bit rate, throughput, transmission delay, and availability.
Data Quality is Not Enough
Regarding data, the industry standard for trust has often been limited to data quality — something I've agreed with for a long time.
At the 2017 Spark Summit, I even introduced CACTAR (Consistency, Accuracy, Completeness, Timeliness, Accessibility, and Reliability), an acronym for six data quality dimensions relayed in this Medium article. And although there is no official standard, the EDM Council added a 7th one.
Let's break down the seven data quality dimensions:
The measurement of the veracity of data to its authoritative source. Data might be provided, but that doesn't mean it's incorrect.
Accuracy refers to how precise data is. It can be assessed by comparing it to the original documents and trusted sources or confirming it against business rules.
- A customer is 24 years old, but the system identifies them as 42 years old.
- A supplier address is valid, but it is not their address.
- Fractional quantities are rounded up or down.
Fun fact: Many accuracy problems come from the data input. If you have data entry people on your team, reward them for accuracy, not only speed!
Data is required to be populated with a value (aka not null, not nullable). Completeness checks if all necessary data attributes are present in the dataset.
- A missing invoice number when it is required by business rules or law.
- A record with missing attributes.
- A missing expiration month in a credit card number.
Fun fact: A primary key is always a required field.
Data content must align with required standards, syntax (format, type, range), or permissible domain values. Conformity assesses how closely data adheres to internal, external, or industry-wide standards.
- The customer identifier must be five characters long.
- The customer address type must be in the list of governed address types.
- Merchant address is filled with text but not an identifying address (invalid state/province, postal codes, country, etc.).
- Invalid ISO country codes.
Fun fact: ISO country codes are 2 or 3 digits (like FR and FRA for France). If you mix up the two in the same datasets, it's not a conformity problem; it's a consistency problem.
Data should retain consistent content across data stores. Consistency ensures that data values, formats, and definitions in one group match those in another.
- Numeric formats converted to characters in a dump.
- Within the same feed, some records have invalid data formats.
- Revenues are calculated differently in different data stores.
- Strings are shortened from a max length of 255 to 32 when they go from the website to the warehouse system.
Fun fact: I was born in France on May 10th, 1971, but I am a Libra (October). When expressed as strings, date formats are transformed through a localization filter. So, being born on October 5th makes my date representation 05/10/1971 in Europe, but 10/05/1971 in the U.S.
All records are contained in a data store or data source. Coverage relates to the extent and availability of data present but potentially absent from a dataset.
- Every customer must be stored in the Customer database.
- The replicated database has missing rows or columns from the source.
The data must represent current conditions and be available and accessible when needed. Timeliness gauges how well data reflects current market/business conditions and its availability when needed.
- A file delivered too late or a source table not fully updated for a business process or operation.
- A credit rating change was not updated on the day it was issued.
- An address is not up to date for a physical mailing.
Fun fact: Forty-five million Americans change addresses every year.
How much data can be duplicated? It supports the idea that no record or attribute is recorded more than once. Uniqueness means each record and attribute is one-of-a-kind, aiming for a single, unique data entry (yeah, one can dream, right?).
- Two instances of the same customer, product, or partner with different identifiers or spelling.
- A share is represented as equity and debt in the same database.
Fun fact: data replication is not bad per se; involuntary data replication is!
Those seven dimensions are pretty well-rounded. As an industry, it's probably time to say, "Good enough." Of course, it completely ruins my CACTAR acronym (and it's great backstory).
But I still feel it is not enough. Data quality does not answer questions about end-of-life, retention period, and time to repair when broken.
So now let's look at service levels.
Service-levels complement quality
While data quality describes the condition of the data, service levels will give you precious information on the expectations around availability, the condition, and more.
Here is a list of service-level indicators you can apply to your data and its delivery. You will have to set some objectives (service-level objectives or SLOs) for your production systems and agree with your users and their expectations (aka setting service-level agreements or SLAs).
In simple terms, is my database accessible? A data source may become inaccessible for various reasons, such as server issues or network interruptions. The fundamental requirement is for the database to respond affirmatively when you use the JDBC's connect() method.
Throughput is about how fast I can access the data. It can be measured in bytes or records by unit of time.
Error rate (Er)
How often will your data have errors, and over what period? What is your tolerance for those errors?
General availability (Ga)
General availability in software and product management means the product is now ready for public use, fully functional, stable, and supported. Here, it applies to when the data will be available for consumption. If your consumers require it, it can be a date associated with a specific version (alpha, beta, v1.0.0, v1.2.0, etc.).
End of support (Es)
The date at which your product will not have support anymore.
For data, it means that the data may still be available after this date, but if you have an issue with it, you won't be offered a break-fix solution. It also means that you, as a consumer, will likely have to adopt a replacement version.
Fun fact: Windows 10 is supported until October 14, 2025.
End of life (El)
The date at which your product will not be available anymore. No support, no access. Rien. Nothing. Nada. Nichts.
For data, this means that the connection will fail or the file will not be available. It can also be that the contract with an external data provider has ended.
Fun fact: Google Plus was shut down in April 2019. You can't access anything from Google's social network after this date.
How long are we keeping the records and documents? There is nothing extraordinary here. Like most service-level indicators, Re length can vary by use case and legal requirements.
Frequency [of update] (Fy)
How often is your data updated? Daily? Weekly? Monthly? A linked indicator to this frequency is the time of availability, which applies well to daily batch updates.
Measures the time between the production of the data and its availability for consumption.
Time to detect [an issue] (Td)
How fast can you detect a problem? A problem can either be a complete break, like your car not starting on a cold morning, or something slow, like data feeding your SEC (Security Commission for Publicly Traded Companies) being wrong for several months.
How fast do you guarantee the detection of the problem? You can also see this service-level indicator called "failure detection time."
[POPOUT] Fun fact: squirrels (or another similar creature) ate the gas line on my wife's car. We detected the problem as the gauge went down quickly, even for a few miles. Do you even drive the car to the mechanic?
Time to notify (Tn)
Once you see a problem, how much time do you need to notify your users? This is, of course, assuming you know your users.
Time to repair (Tr)
How long do you need to fix the issue once it is detected? This is a prevalent metric for network operators running backbone-level fiber networks.
Of course, there are a lot more service-level indicators that will come over time. Agreements follow indicators and can include penalties. You see that the description of the service can become very complex.
Representation of Data Quality and Service Levels
To represent the elements, I needed to identify precisely each element on two axes:
Each element received additional attributes, as shown in the following illustration.
Periods & time-related
The periods are time-sensitive elements. Some are pretty obvious, as "end of life" is definitely after "general availability."
Classification of some elements, however, is more subtle. For example, when data comes to your new data store, you will check accuracy before consistency, and you can check uniqueness only when you have significant data. The elements have no chronological link, but they happen in sequence.
The second classification to find was about grouping. How can we group those elements? Is there a logical relation between them that would make sense?
Here's what I came up with:
- Data at rest (R).
- Data in motion (M).
- Performance (P).
- Lifecycle (C) of the product itself.
- Behavior (B) of the data includes retention, refresh frequency, availability time, and latency.
- Time (T)-related indicators
Why does classifying Data DoS elements matter?
There are a lot of benefits to the classification and definition of the elements forming the Data QoS, per the service-level indicators and the data quality dimensions:
- It offers definitions we can agree on: The first step of the Information Technology Infrastructure Library (ITIL) is to set up a common vocabulary among the stakeholders. Although ITIL might not be adequate for everything, it's a good foundation. Data QoS offers an evolutive framework with consistent terms and definitions.
- It provides compatibility with data contracts: The data contract needs to be built on standardized expectations. It's evident for the data retention period, as you would probably not see duration, safekeeping, or something else. However, latency and freshness are often interchanged; let's go for latency.
- It sets the foundation: Data QoS is not carved in stone, even if it could be compared to the Rosetta stone. It supports evolution and innovation while delivering a solid base.
In this article, I shared my strong feeling, developed over the years, that data quality is insufficient. Although data quality is becoming increasingly normalized, it still lacks service levels.
Service levels can have a profusion of indicators and are open by nature. Combining data quality and service levels can create a higher level of dimensions/indicators grouped in Data QoS.
The representation of Data QoS can be in a Mendeleev-like periodic table featuring each element in a neighboring context.
Do not hesitate to start a conversation and see how Data QoS can help your organization.