Back to Home

Information Management #2 - Bad Data

Written by Charlie Harp

December 15, 2015 at 4:52 AM

In the last post we talked about a couple of topics:

  • Information is important
  • There are different categories of information that commonly exist in a software application.
  • Bad information impacts performance of an application.
  • The scope of the impact varies based on the category of the information.
  • Software is a consumer of information that is especially susceptible to the dangers of bad information.Information is something useful we create from data.

在这篇文章中我将重点放在“坏数据”,the forms that it takes and how we can cope with it today and in the future.First of all, let's set the context. We are going to be talking about “badness” relative to our metaphorical software application. As was discussed in the last post, human beings have a great capability for coping with bad data but lack the speed required to process information on the scale we require.

Here are the categories of bad data that I am going to cover in this post:

Other Data

Information Management Other Data
This is data that is in a coding system that the application does not understand. This is almost always data that comes fromelsewhere through some interface. It can be a standard that the application does not use or proprietary data from the source that sent it. This is the easiest category of bad data to cope with (Yes really). Think about it, we know the source, can isolate the point of entry, all it requires is semantic interoperability. Establishing a mechanism at the point of entry into your application that can evaluate incoming “foreign” terms and reconcile them to the applications native dictionaries. It will require some elbow grease initially, but a smart platform should be able to do a good bit of the heavy lifting and learn over time. It is worth mentioning that other data can also fall into the categories that follow - even after they have been reconciled.

Disemvoweled Data

Information Management Disemvoweled Data
This is data, whether it comes in as “other data” or it was spawned in the recesses of your own application, which are barely human readable due to missing vowels or truncated words. Some of our application are long in the tooth... They were created many years ago when we had database, screen and paper report field size limitations. Making a complex notion fit in 20 characters is more art than science and it heavily relies on our amazing brains to convert them to something useful. A good semantic platform can do a lot to turn this into good data but the best policy, especially if you control this data, is to undergo a data quality initiative. Review proprietary terms and fix them at the source. Not only will it help the software leverage the data, it may stop a less flexible human brain from misinterpreting it.

Impostor Data

Information Imposter Disemvoweled Data
Most standard terminologies have something called a “code system ID”. This is a code that uniquely identifies a given terminology. If I have a source code for a term and a code system ID it should universally identify the term. This means if I have mapped that code system ID + source code already I am good to go... Right?

大部分时间我会说“是的”,但是,在那里are some applications where misguided users have the ability to change the term description for a given source code either in the application dictionary OR on the instance data for a patient. This is particularly bad because (1) it is hard to spot and (2) the application will always assume the term is what the code system and source code says it is. A semantic platform can help find and reconcile these terms against a standard but the trick is what to do with them. If your application is the source and the terminology is supposed to be a standard then a data quality initiative is necessary. If the data comes from somewhere else and you identify it as an impostor then isolation is probably necessary. The problem is you don't know if the user selected the code or the term so there is no way to know for sure whether the code or the term is what was intended. This can also happen with proprietary local terminologies as well. It is important to have a policy of stability - the meaning of a code NEVER changes.

Wrong Data

Information Imposter Wrong Data
无论多么好你的接口和基于“增大化现实”技术的政策e there is always the chance that a user will associate data with a patient that is just not correct. This is not an interoperability problem or a term quality problem but it is a data quality problem. Identifying wrong data is something humans do all the time. When you look at your clock and it says it's 2am but its light outside you know that you are dealing with wrong data. There are analytical mechanisms that we put in place that evaluate a cluster of data and suggest that something isn't quite right with this picture. While this is not always considered under the heading of data quality or data governance it still has significant impact of the quality of our data. The best approach for dealing with this is at the source but ensuring that accurate data is entered is tricky, for a number of reasons. Having a mechanism that periodically reviews instance data for contextual appropriateness would be a nice fall back, but that is also tricky. Once potentially wrong data is identified a human must be involved to be the final arbiter of its fate. Another important aspect of wrong data is making sure that, after it is removed, it does not get re-introduced accidentally.

Missing Data

Information Imposter Missing Data
That’s right, no data can also be bad data. Data that is missing is as bad as data that is wrong or unusable. If I am trying to create a program for managing my diabetic patients it is important to know who they are. If a diabetic patient is not coded as such it will be that much more difficult to include them in the program. This is similar to “Wrong Data” in that it requires a mechanism that looks at a cluster of instance data and asks the question “what’s missing?” If a data gap is detected, like with wrong data, a human should be notified and they should determine if the introduction of the missing data is appropriate.

Old Data

Information Management Old Data
Time marches on and as it does it has an impact on data. This category effects both instance data and reference data. For example, the fact that you had a broken arm when you were six may not be relevant now that you are 30... Knowing the shelf life for a piece of episodic data and whether or not to bring it into the current context for a patient can help reduce the noise and provide better results. Likewise, using outdated reference data can also be problematic, especially if that data is used to support clinical decision support.

Duplicated Data

Information Management Duplicate Data
Now that we are sharing data (feel the love), we are susceptible to a new type of problem. We are likely to receive data that is a near duplicate of the data we already have or that we have also received from somewhere else. Being inundated with a plethora of data doppelgängers creates the risk of our instance data becoming a data junk drawer. It is not likely that the patient is taking Coumadin, warfarin and simvistatin simultaneously. We will need to evolve coping mechanisms to deal with this problem and synthesize a clinical summary for the patient’s current state before too long.

Uncoded Data

Information Management Uncoded Data
Often referred to as “free text terms” this is typically what we find when an application has an “other” selection and the user gets to fill in the box. It could also be from an older application that does not use terminologies for some master data elements. The best option is to deal with this at the source. Even if you allow users to create “missing” local terms on the fly, it would provide an architecture to reconcile those terms in a meaningful way. If that is not an option, a mechanism that assess the text and reconciles it to terminology could increases the likelihood you could make use of this type of data. Natural Language Processing (NLP) solutions struggle in this use case, primarily because rarely is free text in the form of natural language, but there are approaches that can provide a fair degree of success.

Establishing a Data Quality Strategy

If we want to improve our ability to leverage our data we must undertake action to cultivate better quality. Understanding the forms of bad data enables us to formulate short term tactics and longer term strategies for more meaningful, high fidelity data. I have shared my thoughts, please feel free to share yours. Is there a category that I missed? Do you have any good “ugly data” stories?

There is a lot of talk about data governance in healthcare. The question is, will a rigid centrally controlled process work in our industry when it comes to master data and reference terminologies or is there an alternate approach? The next post will be about a new way of approaching data quality across a distributed enterprise, the information ecosystem.