*Many readers have been asking for more “practical” advice on their database models so I’ve asked a fellow healthcare data architect to lend a hand. Tom Maloney is a Senior Data Architect for Stockamp and Associates with over 25 years of experience and knowledge working with and designing databases for most industries. Tom has done a lot of freelance contracting through his own company where he lives and breathes data modeling. In this guest article Tom is discussing the pitfalls of allowing NULLs in healthcare databases and his suggestions and arguments have a lot of merit. As usual, recommendations like the ones Tom is making do not apply in all circumstances but they are worth reviewing.*

On our patient claim system the hospital wanted to know the average payment amount received between two periods of time. As we all know the average is the sum of the values divided by the number of rows. So we constructed a simple query to calculate the average.

Select SUM(Amount)/COUNT(*) FROM [dbo].[Payment]

WHERE

EffectiveDate BETWEEN ‘20060330’

AND ‘20060410’

After the answer was returned; we remembered that SQL

provides a function that calculates and returns the average, so we decided to

use it. When we ran this query

Select AVG(Amount)

FROM [dbo].[Payment]

WHERE EffectiveDate BETWEEN

‘20060330’ AND ‘20060410’

we had a completely different answer. Which one was correct or could they both be wrong? What cause the difference? The culprit turned out to be NULL values in some of the columns.

To prove this we performed a little experiment by creation and population the following table called Payment populated it with ten rows. We allowed some of the column in a few rows to contain NULL values (Note: the syntax is Transact-SQL from Microsoft SQL Server).

Table Creation:

CREATE TABLE [dbo].[Payment](

[Payment_Key] [int] NOT

NULL,

[Amount] [money] NULL,

[EffectiveDate] [datetime] NULL,

[Note] [varchar](1024) NOT NULL DEFAULT (”),

[PaymentRefNumber] [varchar](16) NOT NULL,

PRIMARY KEY CLUSTERED

(

[Payment_Key] ASC

)WITH (IGNORE_DUP_KEY = OFF)

)

Go

Populate the Table:

INSERT INTO [dbo].[Payment]

VALUES (1, $125.00, NULL, ‘Payment 1’, ‘1’) —ß Null in date

INSERT INTO [dbo].[Payment]

VALUES (2, $125.00, ‘20060403’, ‘Payment 2’, ‘2’)

INSERT INTO [dbo].[Payment]

VALUES (3, Null, ‘20060404’, ‘Payment 3’, ‘3’) —ß Null in Amount

INSERT INTO [dbo].[Payment]

VALUES (4, $125.00, ‘20060404’, ‘Payment 4’, ‘4’)

INSERT INTO [dbo].[Payment]

VALUES (5, $125.00, ‘20060405’, ‘Payment 5’, ‘5’)

INSERT INTO [dbo].[Payment]

VALUES (6, $125.00, ‘20060405’, ‘Payment 6’, ‘6’)

INSERT INTO [dbo].[Payment]

VALUES (7, NULL, NULL, ‘Payment 7’, ‘7’) —ß Null Amount and date

INSERT INTO [dbo].[Payment]

VALUES (8, $125.00, ‘20060406’, ‘Payment 8’, ‘8’)

INSERT INTO [dbo].[Payment]

VALUES (9, $125.00, NULL, ‘Payment 9’, ‘9’) —ß Null date

INSERT INTO [dbo].[Payment]

VALUES (10, $125.00, ‘20060407’, ‘Payment 10’, ’10’)

GO

If you add all of the values in the Amount column we have $1000.00, there are ten rows, so the average should be $100.00, right? Not really. We were only looking for rows between Mar 30 and Apr 10, 2006. A NULL can represent any value, it is unknown at this time, and we do not know if NULL means not entered or if it is a place holder for valid date within the range we are interested in.

As we will find out a little later SQL counts NULL sometime and other times it ignores it. Using the BETWEEN predicate SQL ignores NULLs. This gives us actually seven rows to divide into the sum of the amount. Also, one of the rows with a valid date range contains a NULL amount giving us only six Amounts to sum (both the SUM and AVG SQL function ignores NULLs), giving a total amount of $750.00, divided by seven rows we have an average of $107.14 rounded). Not $1000.00 divided by 10 rows yielding $100.00 average.

Select SUM(Amount)/COUNT(*) FROM [dbo].[Payment]

WHERE

EffectiveDate BETWEEN ‘20060330’

AND ‘20060410’

The result is **$107.1428**, this is the answer we were looking for. Let’s try using the AVG function and see what is returned:

Select AVG(Amount)

FROM [dbo].[Payment]

WHERE EffectiveDate BETWEEN

‘20060330’ AND ‘20060410’

This time for the same set of data we got **$125.00** as an average. What’s going on here? The answer lies in how SQL treats NULLs.

The following provides a deeper look into how SQL handled NULLs.

Other that the special way the SQL Server has to store a null and the extra logic SQL Server has to do to identify a column containing a null, here are a few favorites (comments from C. J. Date[1]):

- To test in a WHERE clause whether a field is null, SQL provides the special comparison “field IS NULL.” It is not intuitively obvious why the user has to write “field IS NULL” and not “field = NULL” – especially as the system “field = NULL” is used in the SET clause of the UPDATE statement (and the SET assignment process) to update a field to the null value. (In fact, the WHERE clause “WHERE field = NULL” is syntactically illegal – SQL Server’s syntax does allow it (some vendors do provide a SET statement to allow this).
- SQL92 has solved the problem of comparing NULLs by adding a new predicate of the form <search condition> IS [NOT] TRUE | FALSE | UNKNOWN, which will let you map any combination of three-value logic to two-value logic. For example, ((Age < 18) AND (Gender = ‘F’)) IS NOT FALSE will return TRUE if (Age IS NULL) or (Gender IS NULL) and the remaining condition is not NULL.
- NULL values are considered as duplicates of each other for the purpose on UNIQUE and DISTINCT and ORDER BY but not for the purpose of WHERE and GROUP BY. NULL values are also considered as greater than all non-null values for the purpose of ORDER BY but not for the purposes of WHERE.
- NULL values are always eliminated from the argument to a built-in function such as SUM or AVG, regardless of whether DISTINCT is specified in the function reference – except for the case of COUNT(*), which counts all rows, including duplicates and including all-null rows. Thus for example, given:

SELECT AVG(Status) FROM S —

Result: x

SELECT SUM(Status) FROM S —

Result: y

SELECT COUNT(*) FROM S —

Result: z

there is no guarantee that x = y/z

- Likewise, the function reference SUM (F) is
*not*semantically equivalent to the expression: f1 + f2 + … + fn where f1, f2…, fn are the values appearing in field F at the time the function is evaluated. Perhaps, even more counter-intuitively, the expression

SUM (F1 + F2)

Is not equivalent to the expression

SUM (F1) + SUM (F2)

- Since by definition NULL represents an unknown value, we define the results in every case to be
*unknown*(i.e., NULL) also, rather than*true*or*false*. To deal with NULL values properly, therefore, it is necessary to adopt 3-valued logic in place of the usual 2-valued logic. The 3-valued logic is defined by the truth tables shown below. Note that*unknown*or null truth-value can reasonably be interpreted as “maybe.”

__AND T ? F__ __OR
T ? F__

__NOT__

T T ? F

T T T T T F

? ? ? F

? T ? ? ? ?

F F F F F

T ? F F T

- Consider the question of whether set are allowed to contain NULL values. Suppose, for example, that the collection C = {1, 2, 3, ?} is to be permitted as a legal set. There are two possibilities.

- The particular null value appears in C is of course unknown, but is known to be distinct from 1, 2, and 3.
- The NULL value in C is completely unknown (i.e., it may in fact stand for one of the values 1, 2, 3), in which case the cardinality of C in turn is unknown (it may be either 3 or 4).

*unknown*for any value of <value expression>. If <low value expression> or <high value expression> is NULL, but not both, the results is determined by the value of <value expression> and its comparison with the remaining non-NULL term. If <value expression> is NULL the results are

*unknown*for any values of <low value expression> and <high value expression>.

On the whole I set all data type not to allow NULLs and provide default values when the column is not required. But what about dates and Boolean value columns where the data is not know at time of entry? There are times in designing a healthcare database schema where NULLs values may be allowed, these columns needs to be handled on an exception basis and rather than the norm. Usually I find a workaround, for example, if a Patient’s Gender is not always known at the time of entry, instead of using a Boolean, use a single character to hold a code (F=Female, M=Male, O=Other, U=Unknown, A=Ambiguous, N=Not applicable).

Another example is for dates. If a date is not known at time of entry I provide a default. Most dates record an event in time or a date range (Begin and End dates). For dates that represent a date in time I choose a low (e.g., 01/01/1900) or high (e.g., 11/27/4637) default date depending on the how the column would be used in a query. If developers or anyone else want to see a NULL value returned, I create a view over the table that contains a NULLIF[2] returning a NULL (e.g., NULLIF(MyDate, 19000101). Columns that represent a date range I default the From Date with 01/01/1900 and the Thru Date with 46371127.

When my queries are ran against range date with defaults I get the behavior I expect. When the range date columns contain NULLs the result are not what is expected or I have to write special SQL to handle it. In the example, by changing the table to not allow NULLs and replacing unknown dates with 19000101 and unknown Amounts with $0.00, both AVG and SUM(Amount)/Count(*) returns the same value.

All Character and Variable-Character data types I default with an empty string. When the column is selected, it is displayed with as an invisible value. By using NOT NULL data types you are assured that the results returned will be as you expect without having to handle NULLs differently. In addition, some vendor database’s performance improves because columns that allow NULLs must be evaluated differently for comparisons. If Nulls are used try and minimize it use.

Pingback: The Integrative Stream » Data Modeling

peteWow, nice article. Very well constructed…I’m embarrassed it took me so long to read it 😉

What I find interesting is that you used “SUM” and “COUNT” in the first pass at all. Being used to dealing with the lowest common denominator of data integrity in HIT, my first impulse is always to eyeball the raw data first. Saying

Select SUM(Amount)/COUNT(*) FROM [dbo].[Payment]

Makes a lot of assumptions about the quality of the Payment table which I have learned the hard way aren’t always true. I always start with

Select Amount FROM [dbo].[Payment]

then eyeball the data returned and count the rows. Then (baby steps)

Select COUNT(*) FROM [dbo].[Payment]

Still make sense? Then

Select SUM(Amount) FROM [dbo].[Payment]

Still looking copesthetic? Great, now I am ready to

Select SUM(Amount)/COUNT(*) FROM [dbo].[Payment]

Sad, but true. The lowest common denominator of data analysis is often times the best place to start. Particularly if your dealing years of poorly maintained data on a 20 year old legacy system.