Monday, May 18, 2009

Response to Comments on ‘To Date or Not To Date’

Steve Jones:  I agree with you, and I think I’m going to start from the beginning and do a series on this.  I guess I wanted to know if there was interest out there for this topic…and I think that question has been answered :-)

smunson:  I’m not sure what you mean by ‘re-use’, but I think you mean that you want to reuse an old row of data?  You can do this, but then you won’t have any history of past promotions and when they applied.  And if that’s OK for your implementation, then you don’ t need Effective Dating at all.  The point I was trying to make is:  how do you structure a table when you need to have historical, current, and future rows in it?  That’s the scenario I’m dealing with.  Once you have to deal with these requirements, then the complications start to appear…

larry.smith:  Yes, you have part of the idea that I’m trying to convey…but the missing part is:  why should we need to write all this trigger or constraint logic by hand?  I believe that a new datatype is in order that handles these functions seamlessly so we don’t have to.  That’s the rationale behind the INTERVAL datatype.

randy:  What you mentioned is another option for dealing with Effective Dating…using a StartDate and Duration instead of a StartDate and an EndDate.  I went with the 2 dates because I thought it would be simpler to do a BETWEEN instead of having to calculate the EndDate all the time.  Although you could do this with a computed column to simplify things, it would take up more CPU.  I guess I prefer storing the dates because the most common queries in my scenario call for a BETWEEN and I have little use for the Duration…although I could calculate that if necessary as well.

Manie Verster:  Sure, you can run reports on the table that has the 2 dates and things appear to work fine, but how would you prevent 2 projects from overlapping?  And how would you handle Referential Integrity from other tables?  How about gaps in the dates…would you allow or disallow them?  There are a whole host of issues that appear when you try to incorporate Effective Dating, and this was the point of the article.  Many of these problems have to be solved by programming, and my argument is that with an INTERVAL datatype this could be built into the Database Engine and make things much simpler.

As to your comment about the time, it depends on the granularity of time in the requirements.  If you only need to keep track of DATES without a TIME, then you don’t need the time in there as it will only sow confusion.  But there are scenarios where you may want rows to appear or disappear in the middle of the day, and you would need the TIME component for that.

marco:  I’m not sure I agree with the Database sticking to primitive types.  After all, what is a ‘primitive’ type?  Years ago you could have said that the DATETIME datatype was very complex and should not be handled in the database either.  Yet here it is and it’s been a resounding success.  Recently, SQL 2005 and 2008 have come out with several new datatypes based on the CLR integration that store more complex information and they have created some very elegant solutions with these new datatypes.  Check out the Spatial and Hierarchy datatypes in SQL 2008, for example.  So DB vendors are constantly pushing back the limits of what is ‘too complex’ to be represented as a datatype.  The question that remains is:  will this happen for date/time intervals?  Maybe we’re not at the point yet where we know enough about it to standardize it and encapsulate it into a datatype.    But maybe we are….the INTERVAL datatype would be complex, and there would probably be many attributes that you would be able to set to ‘customize’ it for your purposes, but in the end we are dealing with date periods, which are a very common component of everyday life, so they should be encapsulated in a datatype if possible.

Thanks for the feedback!

Wednesday, May 13, 2009

To Date or Not To Date….

OK….so I’m in the middle of a very interesting Effective Dating project…basically we’re dealing with some tables that only store current data, but there is a need for future-dated data as well.  Because the base table can’t hold that data, we have to come up with additional tables and a process to store future-dated data and then move it into the base table whenever we get to the future date.  Very cumbersome and fragile, and that’s why we’re looking at this project.

There is a great book on this subject by Richard Snodgrass which you can find at Amazon…this book is highly recommended (read:  required) for anyone looking at Effectively Dating tables.  Trust me, you won’t believe how complicated the subject is…here is the link to the book:

http://www.amazon.com/Developing-Time-Oriented-Database-Applications-Management/dp/1558604367/ref=sr_1_1?ie=UTF8&s=books&qid=1242268718&sr=1-1

There is also a book by C.J. Date on the subject, titled “Temporal Data & the Relational Model”, but I haven’t read it….if anyone has read it please feel free to post your feedback on it!

Anyways, the book makes an interesting case for some major enhancements to the Relational Database products out there, SQL Server included.  Basically, what you need when effectively dating tables is to add the valid date/time period for each row.  So, if you have an e-commerce website, for example, and you want to schedule a sales promotion for 3 days, you would indicate in your PROMOTION table that this particular sales promotion is only valid for 3 days.  The beauty of the solution is that the SQL queries in the application would query the tables by using the system date, as in “give me the promotions that are effective for today (the system date)”…so the website content can change as needed without any manual intervention. 

If the table was not effectively dated, you would have to schedule a job to run at the start date/time of the promotion and make all the data changes right then and there.  Of course, the promotion will probably start at midnight, and something will break when the scheduled job runs.  The promotion won’t show on the website, and the DBA will get called at half-past midnight in order to fix the problem to the tune of frantic curses and imprecations from the pained executive staff…but I won’t elaborate any further on the joys of our profession :-)

Now, how do you specify a valid date/time period for each row?  The obvious answer at first glance is to add 2 date columns:  StartDate and EndDate, and then add them to the Primary Key of the table.  That should work, right?

You won’t believe how off the mark that solution is.  Because what’s happening here is that the database is just storing 2 dates….it’s not really storing a date/time PERIOD (or INTERVAL).  The database doesn’t know that this period should include ALL DATES IN BETWEEN the 2 dates you’re storing.  So it’s possible to insert rows that overlap… as long as the StartDate and EndDate are different the Primary Key constraint will allow the INSERT…but you probably won’t want 2 of the same promotions running at the same time…..

And what about Foreign Keys?  As it turns out, these won’t work either….in our example, if you have a FK in the ORDER table pointing to the PromotionID, well now you can have duplicate promotionIDs in the PROMOTION table.  What if you have a promotion that you run every year for 3 weeks?  You keep the same promotion ID, because the substance of the promotion is the same, but every year you insert a new row with different effective dates.  So, if a customer bought something from PromotionID 6, and you’ve been running that promotion for 5 years, you now have 5 rows with PromotionID = 6 in the PROMOTION table.  The FK can’t handle that…it needs a unique constraint to point to.

So you need lots of constrants or triggers to get the job done….this is what the Snodgrass book deals with…..and by the way, the above is just a delicious sampling of the issues encountered…it’s not an all-inclusive list. 

In response to this, I have entered a suggestion in MS Connect in order to enhance SQL Server to fix this problem.  The solution entails defining a DATE/TIME PERIOD or INTERVAL datatype.  It turns out that the ANSI SQL standard already has this, but it’s not implemented in SQL Server.  Below I pasted the text of the Connect entry…if it sounds interesting to you, please follow this link and vote for it: 

https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=432281

Hefty subject…I think I should get 3 cents for this one…. :-)

SB

The text of the MS Connect entry follows:

“Problem:  SQL Server does not have an ANSI INTERVAL data type.  When we create effectively dated tables we have to simulate a time period by using 2 DATETIME columns (StartDate, EndDate).  However, often we don't want to allow 2 rows to have overlapping date ranges and many times we don't want to allow gaps in there either.  All of these validations have to be done in SQL.  It is cumbersome and probably doesn't perform as well as a data type would.

Solution:  Create an 'Interval' or 'Time Period' data type.  It should probably contain 2 values (StartDate, EndDate) stored in the same column and it can then treat all time periods between these 2 dates as a continuous length of time instead of just having 2 separate date columns in the DB. 

It should also contain a special value for 'forever' or 'until the end of time'.  This is heavily used for rows that don't currently have an end date.  NULLs are often used for this, but they create problems such as not being able to participate in a Primary Key and they also are treated differently in SQL and can cause performance problems.  So many times people just use an artificial value like '9999-12-31' for this end date, but this is not standardized, nor is its meaning obvious to the user.

It should have functions or operators to check for the gaps and overlaps. 

It can also contain options to make it more flexible, such as "should the period be inclusive of the start & end dates?" and "allow gaps/overlaps?".

The benefits of this would be that we don't have to code triggers and also the logic would be very neatly encapsulated in the datatype and would not clutter up our SQL excessively.  I imagine it would be faster as well.

Microsoft already did something like this in SQL 2008, where they created the Spatial Data Types.  Maybe some of the existing features developed for Spatial support can be reused for this data type?  Spatial data already has an 'Overlaps' operator, which is similar to what we would need for this data type.”

What’s a ‘DBA’?

Most of us are tired of this overused, vague term. How many of you have clicked on a 'DBA' job posting only to find that it's not what you do? Or asked for a DBA resume from a recruiter and gotten something other than what you expected? Or how about interviewing someone for a DBA position and finding that the candidate is not even close to what you're looking for, but can still rightly call himself a DBA....?

I got an interesting mini-speech on this a couple of years back and I've always referred to it ever since.  I can't take credit for it...the credit goes to my friend Yitzhak Khabinsky, although I have embellished it somewhat.  Here is the "mini-speech":

"Many people refer to every database professional as a 'DBA'.  This is wrong.  The term is much too vague.  What really exists is 5 separate disciplines:

1. Data Modeler - these people work with Business Analysts and developers during the requirements and design stages of the Development Lifeycle.  They usually are (or should be) one of the first people to get the requirements from the business people and translate those requirements into a data model.  They are experts at creating data models and working with data modeling applications.  It goes without saying that they are experts in relational database design from a conceptual standpoint.  They don't necessarily meddle in any one specific database platform, since much of their work is in the Logical Design Phase and is therefore platform-agnostic.

2. SQL Developer - these are the SQL experts who usually code review all the SQL written by anyone in their organization and take on the biggest SQL challenges themselves.  They are usually tied to one or more specific platforms.  They know all the ins and outs of writing clean, performant, scalable SQL and all the pitfalls to avoid.  Many times they are the last line of defense when confronting some particularly hairy SQL that is slow or is causing problems.  They also are usually the ones who take over a project after the data modeler finishes his work and develop the SQL core of new product offerings.

3. DB Administrator - these are the professionals who often work inside the "cold rooms" and deal with the actual SQL Server hardware of their organization.  They manage backups, restores, tape libraries, creation and consolidation of SQL instances, SANs, and the like.  They usually get into programming in order to automate administrative functions, but are usually not involved in application programming.  When they get involved in performance tuning it's usually the type of tuning where they don't touch the code; they try to optimize performance by laying out the database properly on the disk subsystems, making sure enough memory is available, configuring servers properly, and performing proper maintenance on the database servers.

4. ETL Developer - this is similar to the SQL developer above, but it has taken a direction of its own with the proliferation of Data Warehousing and Business Intelligence.  These professionals are usually experts in one or more of the "ETL" applications (Ascential, Informatica, SSIS, etc), which go beyond the realm of SQL development.

5. Database Architect - this is "all of the above" or close to it.  These are very senior professionals with many years of experience and can lead
teams of other Database Professionals.  In spite of many claims to this level of professionalism, there are in fact very few of these professionals around.  These professionals also get into the internals of how the database engines work and are also usually interested in database research as well.  The main litmus test is that they possess most of the 4 other skills if not all.

Also, the above 5 disciplines only apply to the OLTP world (with the exception of the ETL developer who has one foot in OLTP and the other in BI).  There are 5 other similar disciplines that apply to the BI world (Dimensional Modeler, MDX developer, Administrator, ETL Developer, BI Architect)."

So, that's the 'mini-speech'.  You will find that if you use this model you will get a FAR better idea of what you're looking for in a DBA job if you're looking for work, or a DBA candidate if you're looking to hire.  When I interview people I usually start with this mini speech and then try to put the person into one of these categories.  I have found it to be very effective in hiring top-notch candidates.  Obviously, there is some overlap...most people fit into more than one category, but you get a FAR better idea of what the person is capable of by following this model.

...and that's my 2 cents....

SB