Sam Bendayan's 2 Cents: 2009

Sunday, August 23, 2009

Printing out long VARCHAR(MAX) strings

I've had to do lots of work recently generating very large dynamic SQL strings, and one of the biggest annoyances is that when you PRINT a VARCHAR(MAX)/NVARCHAR(MAX) string that is longer than 8,000/4,000 characters, SSMS only displays the first 8,000/4,000 characters. This makes it nearly impossible to work with when generating large SQL strings. I always like to print them out and then copy/paste them to a new window to see if they run, but you can't because of this limitation. I spoke to someone in Microsoft a while back about this and was told that this behavior is 'by design'.

I can't see how that holds water. The whole point of the (MAX) datatypes is that they support very large sizes. So not being able to print them out equals a bug as far as I'm concerned. However, I was able to find an entry in Connect about this (https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=196643) and it looks like it was admitted as a bug in that entry. If you've had this problem and find it annoying, you can follow the link above and vote for the issue.

Now, here is the workaround. I created the attached SP that is able to print out all the characters in these long strings. Instead of doing a PRINT command on the string, call this SP and pass it the string as a variable. You will receive the full length of the string printed out in the output window in SSMS.

As for the details of how it works, here is the scenario:

imagine you are generating a very long NVARCHAR(MAX) Dynamic SQL string, which is much longer than 4,000 characters. You'd like to print it out the Dynamic SQL string and see if it runs separately. However, when you do a PRINT command, you are surprised to see that it only prints out the first 4,000 characters. In other words, it truncates your string. And now the string is not runnable as an SQL Batch.
So, as a workaround, you realize that there is a simple solution: use multiple PRINT commands, and print out a 4,000 character chunk of the string with each PRINT statement. The result looks like the following, and it's messy but it appears to work:
PRINT SUBSTRING(@SQL, 1, 4000)
PRINT SUBSTRING(@SQL, 4001, 4000)
PRINT SUBSTRING(@SQL, 8002, 4000)
There are several drawbacks to the above approach. First, you don't know exactly how long the string is going to be every time you run it, so you don't know how many of the above lines you have to type. And you know that if at some point you try to generate a string longer than the maximum size (12,000 characters in the above case), you will have to add more PRINT statements. However, since this is only used for debugging and not production, the fix for that is relatively easy...just add more PRINT statements when the error happens. The REAL problem is that whenever one of these 4,000-character chunks finishes printing and the other one starts, SSMS inserts a CRLF character. If it so happens that this CRLF chops off an SQL statement in mid-sentence, then you have non-working SQL. You will get a syntax error if you try to run it, and this means you have to go and remove these CRLFs manually before running the SQL. This may not seem to be a big deal, but if you're dealing with a really long 50,000 or 100,000+ line string of Dynamic SQL code, it becomes a major nuisance. There are many CRLFs to remove, and you have to do this EVERY time you regenerate the code. Very annoying.
Hence, the attached SP comes into play. Here's how it works: the first step it does is to figure out how long the string is and break it up into 4,000-character chunks, much as we did manually above. However, after this, it fixes the 'broken SQL line at the 4,000th character position' problem as follows:
- It detects if there is a broken line at the end of the string. This is as simple as assuming that the line is broken if there is a character at the exact 4,000th position. This may not always signify a problem, as a word may occassionally end on the 4,000th position, but we can discard that possibility as immaterial, since the fix for this does not otherwise break the SQL in this case.
- Once it detects that there is a broken line, it traverses the problematic 4,000-character chunk backwards from the end and stops printing this chunk on the last CRLF before the line that contains the 4,000th character. The next chunk will then start at the point where this chunk stopped, and this gives you working code no matter the string length.
  So, if you have a string that looks like this (the 4,000th position is in red bold font):

....SELECT * FROM sys.objects

SELECT * FROM sys.objects WHERE obj

ect_id LIKE '%' (this is the first line of the next 4,000-character chunk)

It prints out the following:

....SELECT * FROM sys.objects (this is the last line of the first chunk)

SELECT * FROM sys.objects WHERE object_id LIKE '%' (this is the first line of the next chunk)

...

The last line of the first string is not printed out because we know we're going to run into problems if we do. So instead, we end that string prematurely and start the next string right after the premature end of the first one. This solution assumes that you are regularly inserting CRLF characters in your Dynamic SQL, which I think is a must in order to keep them neatly formatted. Especially if they are very long strings.

Now, the bug...the only problem I haven't fixed yet is that an extra CRLF is added to the printed out code. Notice the extra CRLF in the last code sample above. However, this is a minor issue. The main thing is that the code can be copied/pasted into a new window and it will run without any manual massaging. When I have time to work on this further I'll fix the bug and repost the SP on my blog.

In summary, I think you will find this a handy SP for working with very long Dynamic SQL strings. It has served me well in printing out very large Dynamic SQL strings, which I find indispensable in debugging complex Dynamic SQL operations.

Beware the REPLACE function when doing heavy string parsing

I found something interesting the other day. I was attempting to optimize a very heavy string parsing routine using T-SQL and was having problems. I don’t normally use T-SQL for such heavy string parsing, but this was a special case of a legacy structure that I had to work with, so I had no choice. Anyway, the said routine was a table function that basically takes a fixed length text file and turns it into a table (please don’t ask why) so you can join it to other tables in a database. The thing that made this interesting is that this fixed length text file had over 4,000 rows, so in order to turn this into a table the function has to slice and dice these 4,000 rows every time it’s executed, which it does by executing a WHILE loop.

When I finished this function, it ran in about 15 seconds, which was way too slow for our needs. Unbeknownst to me, one of my colleagues was attempting the same thing and wrote his code in a slightly different way, but he got it to run in about 1 second. When I analyzed his code, I found it to be virtually identical to mine, with one difference: during every iteration I was using the REPLACE function to ‘wittle’ down the string until none of the string was left, at which point the WHILE loop would exit and the function would finish executing. In contrast, his approach did not have any REPLACE function, he was simply navigating forward in the string until he reached the end.

At this point one of my very special mental “talents” kicked in: when my mind is stuck on a problem and someone ELSE tells me the solution to it my mind immediately says “oh, I knew that!” and proceeds to give me a detailed explanation of the solution and why it works. And leaves you wondering where all that great information was hiding 5 seconds ago when you didn’t have the answer and you really needed it….you have to wonder why such useless “talents” even exist….

So, of course, once I saw this then everything became obvious: my version of the function was doing over 4,000 REPLACE operations and his wasn’t. To add to this, I have to wonder if the string datatypes, such as the VARCHAR datatype that we were both using, are implemented as immutable data types in SQL Server. Does anyone know if this is the case? If it is then it would further explain the difference in speed. Since an immutable data type doesn’t really ‘change’ the instance of the string in memory and just adds a new instance of a string, this means that my function was creating over 4,000 instances in memory of the large string that I was parsing. Talk about a waste!

I mention this because lots of SQL developers may not have much experience in string parsing, since this is normally not done using SQL. This may be old hat to many application programmers, but it’s not something that DBAs run into as frequently, so I’m hoping that others will find it as useful as I did.

So, the moral of the story is: be careful when using REPLACE in heavy string parsing operations. Like many other things, if you need it you should use it, but the trick is in knowing when you REALLY need it. In my case I thought I needed it but in reality there was a much simpler solution available, and it ended up running 15 times faster. It proves the old adage of “Keep It Simple, Stupid”…given 2 equivalent solutions the simpler one is better.

…and that’s my $.02 for today….

Friday, August 7, 2009

Effective Dating Series – Part I – THE PROBLEM

Much research has been done regarding temporal databases, but what exactly is the problem with Date/Time data and why is it so complicated? What benefits do you get by properly expressing this data in an Effective Dating scenario? In this article I’m going to provide a clear, simple answer to this question and also lay the foundation for the later parts of this series of articles that explain what happens to Relational Databases when you introduce the concept of Effective Dating. Suffice to say for now that the impact to a RDBMS is heavy; most of the concepts that are well-known and we sometimes take for granted such as Primary Keys and Foreign Keys tend to change drastically the minute that you introduce effective dates into a database schema. Nonetheless, it’s still worth the trouble to implement it in the right scenarios, and I will show one such real-world scenario in this article.

THE CONCEPT

Although we don’t always express this in our database schemas, most data is temporal; it actually only lives over a period of time. Yet in many tables that we design we don’t put in effective dates to signify this. Many times there is a good reason for this: maybe we’re not interested in historical data, or maybe we’re not interested ENOUGH in order to justify the cost in adding dates to the design.

However, there are times where putting these effective dates in will TREMENDOUSLY help out a project. Unfortunately, many times we don’t see this benefit at the start and so our initial design is not effectively dated even though it really is in our best interest. We normally realize this when we find that we need to add all sorts of workarounds just because our system is not date-aware by design. Yet many times we go on blissfully ignoring the effectively dated nature of our business and we keep on piling workarounds, each more ridiculous than the last, on top of a design that is pretending to be eternal and failing. This is what I call the “Delusions of Eternity” anti-pattern. It’s sort of like ignoring our own Mortality (we humans are also “effectively dated”), so maybe that is where this attitude comes from. Who knows, but maybe our own Mortality tends to put a negative slant on the whole subject of Effective Dating and causes us not to think about it :-)

THE SCENARIO

Let’s solidify all of this with an example. Let’s say we’re designing a database that handles promotions for retail sales in an E-Commerce environment. A common design for this is to have a Promotions table which contains all the promotion details and then a ‘PromotedProducts’ table which links promotions to products. Below is a simple diagram of this schema:

THE PROBLEM

So what’s the problem here? It seems like a good design at first, but the fact is that Promotions are by definition Effectively Dated, and this design ignores that. Soon the cracks in the design start to appear. Let’s illustrate how this design painfully falls apart:

Yes, you can store Promotion Details and associate them to Products, but what happens when a Promotion starts and then ends? You have to switch prices in the products that are affected. So, how do you do this? Well, obviously you run some kind of batch job, at exactly the right time, to change these prices. The batch job would insert the row into the PromotedProducts table at the beginning of the promotion, and then would delete the row at the end of the promotion. The correct price for the product would be written to the OrderDetail table by using a SELECT with a LEFT JOIN on the PromotedProducts table. The price from PromotedProducts would be used, and if it’s NULL, then the price from the Products table would be used instead. No big deal, right? Wrong, by several counts:

Once you accept that you have to run a batch job, things start to go downhill from there:

What if management wants to start a promotion at midnight? You’d better be up and watching in case that batch job fails.
If the batch job does fail, you’re dealing with downtime or bad pricing, both of which result in a direct cost to the business.
Because you are deleting the promotion as soon as it’s over, you don’t have the capability to analyze how many orders the promotion generated, or do any other kind of analysis. So someone will probably get the idea that they should store this information off in an Excel spreadsheet somewhere so they can report it to management. Voila! You have now created an unnecessary business process and a questionable source of data to base Management Decisions on.
Every time you do a promotion you have to set up a new batch job in order to insert the data. So there is manual intervention involved with every promotion setup, which means that every promotion turns into a Development Project, with development and testing phases. It’s a simple project, but a project nonetheless. This results in the following sorry list of circumstances:

So we see many problems here…not the least of which are frantic, screaming managers and sleepless DBAs. And this reinforces another famous saying: Data Modeling is good for your health (OK, it’s not famous, but it should be).

We also see how the effects of a bad design can spill over into the other departments in a company, raising costs and blood pressures and creating many fragile and unnecessary business processes to accommodate the bad design. Surely there must be a better way.

THE SOLUTION

Now, let’s see what happens if we break down, repent, and just accept the fact that these promotions live and die by dates. We would include 2 dates in the PromotedProducts table signifying the Start and End dates of the Promotion. The first thing we then notice is that these 2 new columns must be included in the Primary Key, since there can now be several promotions for the same product. Here is the new database schema:

All we did was add the 2 dates in the PromotedProducts table, but the real magic is what the application does with this data: the Website application would be designed so that it looks at these dates and compares them to the current date/time to see what rows should be displayed. It would look in the PromotedProducts table using the effective dates to determine if the promotions are active and should be displayed. The pricing would still be calculated using the LEFT JOIN method we described above.

Using this design, let’s look at the weaknesses of the previous design and see what’s happened to them:

No batch job needs to be run. The system is date-aware and automatically displays the right price at the right time. So, no downtime is possible from a Development standpoint, and no staying up at night :-)
Setting up a new promotion goes from being a development project to a few simple data changes done by users with a simple interface over the Promotions tables. The Development Department doesn’t even have to be involved. This produces a much nicer list of circumstances:
The promotions are persisted in the table even after they are finished, so you have the ability to go back and analyze the statistics of the promotions.

CONCLUSION:

This simple example illustrates how beneficial it is to consider including Effective Dates into your database schema. I have seen both of these scenarios at work and the difference is too vast to be ignored. You go from a frantic, stressful environment focused on how to make the systems work to a relaxed, confident environment focused on how to better run the business. And all because you let the software handle all the work. And isn’t that what software is supposed to do in the first place?

So that is the good news. In the next article of the series I will present the impact and pitfalls of the Effectively Dated design above and how to overcome them. We will build on the simple data model that we started with here and see what issues surface as you take Effective Dating further and further. You will then start to see how far RDBMS’s still have to go to properly represent this type of data.

Monday, May 18, 2009

Response to Comments on ‘To Date or Not To Date’

Steve Jones: I agree with you, and I think I’m going to start from the beginning and do a series on this. I guess I wanted to know if there was interest out there for this topic…and I think that question has been answered :-)

smunson: I’m not sure what you mean by ‘re-use’, but I think you mean that you want to reuse an old row of data? You can do this, but then you won’t have any history of past promotions and when they applied. And if that’s OK for your implementation, then you don’ t need Effective Dating at all. The point I was trying to make is: how do you structure a table when you need to have historical, current, and future rows in it? That’s the scenario I’m dealing with. Once you have to deal with these requirements, then the complications start to appear…

larry.smith: Yes, you have part of the idea that I’m trying to convey…but the missing part is: why should we need to write all this trigger or constraint logic by hand? I believe that a new datatype is in order that handles these functions seamlessly so we don’t have to. That’s the rationale behind the INTERVAL datatype.

randy: What you mentioned is another option for dealing with Effective Dating…using a StartDate and Duration instead of a StartDate and an EndDate. I went with the 2 dates because I thought it would be simpler to do a BETWEEN instead of having to calculate the EndDate all the time. Although you could do this with a computed column to simplify things, it would take up more CPU. I guess I prefer storing the dates because the most common queries in my scenario call for a BETWEEN and I have little use for the Duration…although I could calculate that if necessary as well.

Manie Verster: Sure, you can run reports on the table that has the 2 dates and things appear to work fine, but how would you prevent 2 projects from overlapping? And how would you handle Referential Integrity from other tables? How about gaps in the dates…would you allow or disallow them? There are a whole host of issues that appear when you try to incorporate Effective Dating, and this was the point of the article. Many of these problems have to be solved by programming, and my argument is that with an INTERVAL datatype this could be built into the Database Engine and make things much simpler.

As to your comment about the time, it depends on the granularity of time in the requirements. If you only need to keep track of DATES without a TIME, then you don’t need the time in there as it will only sow confusion. But there are scenarios where you may want rows to appear or disappear in the middle of the day, and you would need the TIME component for that.

marco: I’m not sure I agree with the Database sticking to primitive types. After all, what is a ‘primitive’ type? Years ago you could have said that the DATETIME datatype was very complex and should not be handled in the database either. Yet here it is and it’s been a resounding success. Recently, SQL 2005 and 2008 have come out with several new datatypes based on the CLR integration that store more complex information and they have created some very elegant solutions with these new datatypes. Check out the Spatial and Hierarchy datatypes in SQL 2008, for example. So DB vendors are constantly pushing back the limits of what is ‘too complex’ to be represented as a datatype. The question that remains is: will this happen for date/time intervals? Maybe we’re not at the point yet where we know enough about it to standardize it and encapsulate it into a datatype. But maybe we are….the INTERVAL datatype would be complex, and there would probably be many attributes that you would be able to set to ‘customize’ it for your purposes, but in the end we are dealing with date periods, which are a very common component of everyday life, so they should be encapsulated in a datatype if possible.

Thanks for the feedback!

Wednesday, May 13, 2009

To Date or Not To Date….

OK….so I’m in the middle of a very interesting Effective Dating project…basically we’re dealing with some tables that only store current data, but there is a need for future-dated data as well. Because the base table can’t hold that data, we have to come up with additional tables and a process to store future-dated data and then move it into the base table whenever we get to the future date. Very cumbersome and fragile, and that’s why we’re looking at this project.

There is a great book on this subject by Richard Snodgrass which you can find at Amazon…this book is highly recommended (read: required) for anyone looking at Effectively Dating tables. Trust me, you won’t believe how complicated the subject is…here is the link to the book:

http://www.amazon.com/Developing-Time-Oriented-Database-Applications-Management/dp/1558604367/ref=sr_1_1?ie=UTF8&s=books&qid=1242268718&sr=1-1

There is also a book by C.J. Date on the subject, titled “Temporal Data & the Relational Model”, but I haven’t read it….if anyone has read it please feel free to post your feedback on it!

Anyways, the book makes an interesting case for some major enhancements to the Relational Database products out there, SQL Server included. Basically, what you need when effectively dating tables is to add the valid date/time period for each row. So, if you have an e-commerce website, for example, and you want to schedule a sales promotion for 3 days, you would indicate in your PROMOTION table that this particular sales promotion is only valid for 3 days. The beauty of the solution is that the SQL queries in the application would query the tables by using the system date, as in “give me the promotions that are effective for today (the system date)”…so the website content can change as needed without any manual intervention.

If the table was not effectively dated, you would have to schedule a job to run at the start date/time of the promotion and make all the data changes right then and there. Of course, the promotion will probably start at midnight, and something will break when the scheduled job runs. The promotion won’t show on the website, and the DBA will get called at half-past midnight in order to fix the problem to the tune of frantic curses and imprecations from the pained executive staff…but I won’t elaborate any further on the joys of our profession :-)

Now, how do you specify a valid date/time period for each row? The obvious answer at first glance is to add 2 date columns: StartDate and EndDate, and then add them to the Primary Key of the table. That should work, right?

You won’t believe how off the mark that solution is. Because what’s happening here is that the database is just storing 2 dates….it’s not really storing a date/time PERIOD (or INTERVAL). The database doesn’t know that this period should include ALL DATES IN BETWEEN the 2 dates you’re storing. So it’s possible to insert rows that overlap… as long as the StartDate and EndDate are different the Primary Key constraint will allow the INSERT…but you probably won’t want 2 of the same promotions running at the same time…..

And what about Foreign Keys? As it turns out, these won’t work either….in our example, if you have a FK in the ORDER table pointing to the PromotionID, well now you can have duplicate promotionIDs in the PROMOTION table. What if you have a promotion that you run every year for 3 weeks? You keep the same promotion ID, because the substance of the promotion is the same, but every year you insert a new row with different effective dates. So, if a customer bought something from PromotionID 6, and you’ve been running that promotion for 5 years, you now have 5 rows with PromotionID = 6 in the PROMOTION table. The FK can’t handle that…it needs a unique constraint to point to.

So you need lots of constrants or triggers to get the job done….this is what the Snodgrass book deals with…..and by the way, the above is just a delicious sampling of the issues encountered…it’s not an all-inclusive list.

In response to this, I have entered a suggestion in MS Connect in order to enhance SQL Server to fix this problem. The solution entails defining a DATE/TIME PERIOD or INTERVAL datatype. It turns out that the ANSI SQL standard already has this, but it’s not implemented in SQL Server. Below I pasted the text of the Connect entry…if it sounds interesting to you, please follow this link and vote for it:

https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=432281

Hefty subject…I think I should get 3 cents for this one…. :-)

The text of the MS Connect entry follows:

“Problem: SQL Server does not have an ANSI INTERVAL data type. When we create effectively dated tables we have to simulate a time period by using 2 DATETIME columns (StartDate, EndDate). However, often we don't want to allow 2 rows to have overlapping date ranges and many times we don't want to allow gaps in there either. All of these validations have to be done in SQL. It is cumbersome and probably doesn't perform as well as a data type would.

Solution: Create an 'Interval' or 'Time Period' data type. It should probably contain 2 values (StartDate, EndDate) stored in the same column and it can then treat all time periods between these 2 dates as a continuous length of time instead of just having 2 separate date columns in the DB.

It should also contain a special value for 'forever' or 'until the end of time'. This is heavily used for rows that don't currently have an end date. NULLs are often used for this, but they create problems such as not being able to participate in a Primary Key and they also are treated differently in SQL and can cause performance problems. So many times people just use an artificial value like '9999-12-31' for this end date, but this is not standardized, nor is its meaning obvious to the user.

It should have functions or operators to check for the gaps and overlaps.

It can also contain options to make it more flexible, such as "should the period be inclusive of the start & end dates?" and "allow gaps/overlaps?".

The benefits of this would be that we don't have to code triggers and also the logic would be very neatly encapsulated in the datatype and would not clutter up our SQL excessively. I imagine it would be faster as well.

Microsoft already did something like this in SQL 2008, where they created the Spatial Data Types. Maybe some of the existing features developed for Spatial support can be reused for this data type? Spatial data already has an 'Overlaps' operator, which is similar to what we would need for this data type.”

What’s a ‘DBA’?

Most of us are tired of this overused, vague term. How many of you have clicked on a 'DBA' job posting only to find that it's not what you do? Or asked for a DBA resume from a recruiter and gotten something other than what you expected? Or how about interviewing someone for a DBA position and finding that the candidate is not even close to what you're looking for, but can still rightly call himself a DBA....?

I got an interesting mini-speech on this a couple of years back and I've always referred to it ever since. I can't take credit for it...the credit goes to my friend Yitzhak Khabinsky, although I have embellished it somewhat. Here is the "mini-speech":

"Many people refer to every database professional as a 'DBA'. This is wrong. The term is much too vague. What really exists is 5 separate disciplines:

1. Data Modeler - these people work with Business Analysts and developers during the requirements and design stages of the Development Lifeycle. They usually are (or should be) one of the first people to get the requirements from the business people and translate those requirements into a data model. They are experts at creating data models and working with data modeling applications. It goes without saying that they are experts in relational database design from a conceptual standpoint. They don't necessarily meddle in any one specific database platform, since much of their work is in the Logical Design Phase and is therefore platform-agnostic.

2. SQL Developer - these are the SQL experts who usually code review all the SQL written by anyone in their organization and take on the biggest SQL challenges themselves. They are usually tied to one or more specific platforms. They know all the ins and outs of writing clean, performant, scalable SQL and all the pitfalls to avoid. Many times they are the last line of defense when confronting some particularly hairy SQL that is slow or is causing problems. They also are usually the ones who take over a project after the data modeler finishes his work and develop the SQL core of new product offerings.

3. DB Administrator - these are the professionals who often work inside the "cold rooms" and deal with the actual SQL Server hardware of their organization. They manage backups, restores, tape libraries, creation and consolidation of SQL instances, SANs, and the like. They usually get into programming in order to automate administrative functions, but are usually not involved in application programming. When they get involved in performance tuning it's usually the type of tuning where they don't touch the code; they try to optimize performance by laying out the database properly on the disk subsystems, making sure enough memory is available, configuring servers properly, and performing proper maintenance on the database servers.

4. ETL Developer - this is similar to the SQL developer above, but it has taken a direction of its own with the proliferation of Data Warehousing and Business Intelligence. These professionals are usually experts in one or more of the "ETL" applications (Ascential, Informatica, SSIS, etc), which go beyond the realm of SQL development.

5. Database Architect - this is "all of the above" or close to it. These are very senior professionals with many years of experience and can lead
teams of other Database Professionals. In spite of many claims to this level of professionalism, there are in fact very few of these professionals around. These professionals also get into the internals of how the database engines work and are also usually interested in database research as well. The main litmus test is that they possess most of the 4 other skills if not all.

Also, the above 5 disciplines only apply to the OLTP world (with the exception of the ETL developer who has one foot in OLTP and the other in BI). There are 5 other similar disciplines that apply to the BI world (Dimensional Modeler, MDX developer, Administrator, ETL Developer, BI Architect)."

So, that's the 'mini-speech'. You will find that if you use this model you will get a FAR better idea of what you're looking for in a DBA job if you're looking for work, or a DBA candidate if you're looking to hire. When I interview people I usually start with this mini speech and then try to put the person into one of these categories. I have found it to be very effective in hiring top-notch candidates. Obviously, there is some overlap...most people fit into more than one category, but you get a FAR better idea of what the person is capable of by following this model.

...and that's my 2 cents....

Sam Bendayan's 2 Cents