Sam Bendayan's 2 Cents: August 2009

Sunday, August 23, 2009

Printing out long VARCHAR(MAX) strings

I've had to do lots of work recently generating very large dynamic SQL strings, and one of the biggest annoyances is that when you PRINT a VARCHAR(MAX)/NVARCHAR(MAX) string that is longer than 8,000/4,000 characters, SSMS only displays the first 8,000/4,000 characters. This makes it nearly impossible to work with when generating large SQL strings. I always like to print them out and then copy/paste them to a new window to see if they run, but you can't because of this limitation. I spoke to someone in Microsoft a while back about this and was told that this behavior is 'by design'.

I can't see how that holds water. The whole point of the (MAX) datatypes is that they support very large sizes. So not being able to print them out equals a bug as far as I'm concerned. However, I was able to find an entry in Connect about this (https://connect.microsoft.com/SQLServer/feedback/ViewFeedback.aspx?FeedbackID=196643) and it looks like it was admitted as a bug in that entry. If you've had this problem and find it annoying, you can follow the link above and vote for the issue.

Now, here is the workaround. I created the attached SP that is able to print out all the characters in these long strings. Instead of doing a PRINT command on the string, call this SP and pass it the string as a variable. You will receive the full length of the string printed out in the output window in SSMS.

As for the details of how it works, here is the scenario:

imagine you are generating a very long NVARCHAR(MAX) Dynamic SQL string, which is much longer than 4,000 characters. You'd like to print it out the Dynamic SQL string and see if it runs separately. However, when you do a PRINT command, you are surprised to see that it only prints out the first 4,000 characters. In other words, it truncates your string. And now the string is not runnable as an SQL Batch.
So, as a workaround, you realize that there is a simple solution: use multiple PRINT commands, and print out a 4,000 character chunk of the string with each PRINT statement. The result looks like the following, and it's messy but it appears to work:
PRINT SUBSTRING(@SQL, 1, 4000)
PRINT SUBSTRING(@SQL, 4001, 4000)
PRINT SUBSTRING(@SQL, 8002, 4000)
There are several drawbacks to the above approach. First, you don't know exactly how long the string is going to be every time you run it, so you don't know how many of the above lines you have to type. And you know that if at some point you try to generate a string longer than the maximum size (12,000 characters in the above case), you will have to add more PRINT statements. However, since this is only used for debugging and not production, the fix for that is relatively easy...just add more PRINT statements when the error happens. The REAL problem is that whenever one of these 4,000-character chunks finishes printing and the other one starts, SSMS inserts a CRLF character. If it so happens that this CRLF chops off an SQL statement in mid-sentence, then you have non-working SQL. You will get a syntax error if you try to run it, and this means you have to go and remove these CRLFs manually before running the SQL. This may not seem to be a big deal, but if you're dealing with a really long 50,000 or 100,000+ line string of Dynamic SQL code, it becomes a major nuisance. There are many CRLFs to remove, and you have to do this EVERY time you regenerate the code. Very annoying.
Hence, the attached SP comes into play. Here's how it works: the first step it does is to figure out how long the string is and break it up into 4,000-character chunks, much as we did manually above. However, after this, it fixes the 'broken SQL line at the 4,000th character position' problem as follows:
- It detects if there is a broken line at the end of the string. This is as simple as assuming that the line is broken if there is a character at the exact 4,000th position. This may not always signify a problem, as a word may occassionally end on the 4,000th position, but we can discard that possibility as immaterial, since the fix for this does not otherwise break the SQL in this case.
- Once it detects that there is a broken line, it traverses the problematic 4,000-character chunk backwards from the end and stops printing this chunk on the last CRLF before the line that contains the 4,000th character. The next chunk will then start at the point where this chunk stopped, and this gives you working code no matter the string length.
  So, if you have a string that looks like this (the 4,000th position is in red bold font):

....SELECT * FROM sys.objects

SELECT * FROM sys.objects WHERE obj

ect_id LIKE '%' (this is the first line of the next 4,000-character chunk)

It prints out the following:

....SELECT * FROM sys.objects (this is the last line of the first chunk)

SELECT * FROM sys.objects WHERE object_id LIKE '%' (this is the first line of the next chunk)

...

The last line of the first string is not printed out because we know we're going to run into problems if we do. So instead, we end that string prematurely and start the next string right after the premature end of the first one. This solution assumes that you are regularly inserting CRLF characters in your Dynamic SQL, which I think is a must in order to keep them neatly formatted. Especially if they are very long strings.

Now, the bug...the only problem I haven't fixed yet is that an extra CRLF is added to the printed out code. Notice the extra CRLF in the last code sample above. However, this is a minor issue. The main thing is that the code can be copied/pasted into a new window and it will run without any manual massaging. When I have time to work on this further I'll fix the bug and repost the SP on my blog.

In summary, I think you will find this a handy SP for working with very long Dynamic SQL strings. It has served me well in printing out very large Dynamic SQL strings, which I find indispensable in debugging complex Dynamic SQL operations.

Beware the REPLACE function when doing heavy string parsing

I found something interesting the other day. I was attempting to optimize a very heavy string parsing routine using T-SQL and was having problems. I don’t normally use T-SQL for such heavy string parsing, but this was a special case of a legacy structure that I had to work with, so I had no choice. Anyway, the said routine was a table function that basically takes a fixed length text file and turns it into a table (please don’t ask why) so you can join it to other tables in a database. The thing that made this interesting is that this fixed length text file had over 4,000 rows, so in order to turn this into a table the function has to slice and dice these 4,000 rows every time it’s executed, which it does by executing a WHILE loop.

When I finished this function, it ran in about 15 seconds, which was way too slow for our needs. Unbeknownst to me, one of my colleagues was attempting the same thing and wrote his code in a slightly different way, but he got it to run in about 1 second. When I analyzed his code, I found it to be virtually identical to mine, with one difference: during every iteration I was using the REPLACE function to ‘wittle’ down the string until none of the string was left, at which point the WHILE loop would exit and the function would finish executing. In contrast, his approach did not have any REPLACE function, he was simply navigating forward in the string until he reached the end.

At this point one of my very special mental “talents” kicked in: when my mind is stuck on a problem and someone ELSE tells me the solution to it my mind immediately says “oh, I knew that!” and proceeds to give me a detailed explanation of the solution and why it works. And leaves you wondering where all that great information was hiding 5 seconds ago when you didn’t have the answer and you really needed it….you have to wonder why such useless “talents” even exist….

So, of course, once I saw this then everything became obvious: my version of the function was doing over 4,000 REPLACE operations and his wasn’t. To add to this, I have to wonder if the string datatypes, such as the VARCHAR datatype that we were both using, are implemented as immutable data types in SQL Server. Does anyone know if this is the case? If it is then it would further explain the difference in speed. Since an immutable data type doesn’t really ‘change’ the instance of the string in memory and just adds a new instance of a string, this means that my function was creating over 4,000 instances in memory of the large string that I was parsing. Talk about a waste!

I mention this because lots of SQL developers may not have much experience in string parsing, since this is normally not done using SQL. This may be old hat to many application programmers, but it’s not something that DBAs run into as frequently, so I’m hoping that others will find it as useful as I did.

So, the moral of the story is: be careful when using REPLACE in heavy string parsing operations. Like many other things, if you need it you should use it, but the trick is in knowing when you REALLY need it. In my case I thought I needed it but in reality there was a much simpler solution available, and it ended up running 15 times faster. It proves the old adage of “Keep It Simple, Stupid”…given 2 equivalent solutions the simpler one is better.

…and that’s my $.02 for today….

Friday, August 7, 2009

Effective Dating Series – Part I – THE PROBLEM

Much research has been done regarding temporal databases, but what exactly is the problem with Date/Time data and why is it so complicated? What benefits do you get by properly expressing this data in an Effective Dating scenario? In this article I’m going to provide a clear, simple answer to this question and also lay the foundation for the later parts of this series of articles that explain what happens to Relational Databases when you introduce the concept of Effective Dating. Suffice to say for now that the impact to a RDBMS is heavy; most of the concepts that are well-known and we sometimes take for granted such as Primary Keys and Foreign Keys tend to change drastically the minute that you introduce effective dates into a database schema. Nonetheless, it’s still worth the trouble to implement it in the right scenarios, and I will show one such real-world scenario in this article.

THE CONCEPT

Although we don’t always express this in our database schemas, most data is temporal; it actually only lives over a period of time. Yet in many tables that we design we don’t put in effective dates to signify this. Many times there is a good reason for this: maybe we’re not interested in historical data, or maybe we’re not interested ENOUGH in order to justify the cost in adding dates to the design.

However, there are times where putting these effective dates in will TREMENDOUSLY help out a project. Unfortunately, many times we don’t see this benefit at the start and so our initial design is not effectively dated even though it really is in our best interest. We normally realize this when we find that we need to add all sorts of workarounds just because our system is not date-aware by design. Yet many times we go on blissfully ignoring the effectively dated nature of our business and we keep on piling workarounds, each more ridiculous than the last, on top of a design that is pretending to be eternal and failing. This is what I call the “Delusions of Eternity” anti-pattern. It’s sort of like ignoring our own Mortality (we humans are also “effectively dated”), so maybe that is where this attitude comes from. Who knows, but maybe our own Mortality tends to put a negative slant on the whole subject of Effective Dating and causes us not to think about it :-)

THE SCENARIO

Let’s solidify all of this with an example. Let’s say we’re designing a database that handles promotions for retail sales in an E-Commerce environment. A common design for this is to have a Promotions table which contains all the promotion details and then a ‘PromotedProducts’ table which links promotions to products. Below is a simple diagram of this schema:

THE PROBLEM

So what’s the problem here? It seems like a good design at first, but the fact is that Promotions are by definition Effectively Dated, and this design ignores that. Soon the cracks in the design start to appear. Let’s illustrate how this design painfully falls apart:

Yes, you can store Promotion Details and associate them to Products, but what happens when a Promotion starts and then ends? You have to switch prices in the products that are affected. So, how do you do this? Well, obviously you run some kind of batch job, at exactly the right time, to change these prices. The batch job would insert the row into the PromotedProducts table at the beginning of the promotion, and then would delete the row at the end of the promotion. The correct price for the product would be written to the OrderDetail table by using a SELECT with a LEFT JOIN on the PromotedProducts table. The price from PromotedProducts would be used, and if it’s NULL, then the price from the Products table would be used instead. No big deal, right? Wrong, by several counts:

Once you accept that you have to run a batch job, things start to go downhill from there:

What if management wants to start a promotion at midnight? You’d better be up and watching in case that batch job fails.
If the batch job does fail, you’re dealing with downtime or bad pricing, both of which result in a direct cost to the business.
Because you are deleting the promotion as soon as it’s over, you don’t have the capability to analyze how many orders the promotion generated, or do any other kind of analysis. So someone will probably get the idea that they should store this information off in an Excel spreadsheet somewhere so they can report it to management. Voila! You have now created an unnecessary business process and a questionable source of data to base Management Decisions on.
Every time you do a promotion you have to set up a new batch job in order to insert the data. So there is manual intervention involved with every promotion setup, which means that every promotion turns into a Development Project, with development and testing phases. It’s a simple project, but a project nonetheless. This results in the following sorry list of circumstances:

So we see many problems here…not the least of which are frantic, screaming managers and sleepless DBAs. And this reinforces another famous saying: Data Modeling is good for your health (OK, it’s not famous, but it should be).

We also see how the effects of a bad design can spill over into the other departments in a company, raising costs and blood pressures and creating many fragile and unnecessary business processes to accommodate the bad design. Surely there must be a better way.

THE SOLUTION

Now, let’s see what happens if we break down, repent, and just accept the fact that these promotions live and die by dates. We would include 2 dates in the PromotedProducts table signifying the Start and End dates of the Promotion. The first thing we then notice is that these 2 new columns must be included in the Primary Key, since there can now be several promotions for the same product. Here is the new database schema:

All we did was add the 2 dates in the PromotedProducts table, but the real magic is what the application does with this data: the Website application would be designed so that it looks at these dates and compares them to the current date/time to see what rows should be displayed. It would look in the PromotedProducts table using the effective dates to determine if the promotions are active and should be displayed. The pricing would still be calculated using the LEFT JOIN method we described above.

Using this design, let’s look at the weaknesses of the previous design and see what’s happened to them:

No batch job needs to be run. The system is date-aware and automatically displays the right price at the right time. So, no downtime is possible from a Development standpoint, and no staying up at night :-)
Setting up a new promotion goes from being a development project to a few simple data changes done by users with a simple interface over the Promotions tables. The Development Department doesn’t even have to be involved. This produces a much nicer list of circumstances:
The promotions are persisted in the table even after they are finished, so you have the ability to go back and analyze the statistics of the promotions.

CONCLUSION:

This simple example illustrates how beneficial it is to consider including Effective Dates into your database schema. I have seen both of these scenarios at work and the difference is too vast to be ignored. You go from a frantic, stressful environment focused on how to make the systems work to a relaxed, confident environment focused on how to better run the business. And all because you let the software handle all the work. And isn’t that what software is supposed to do in the first place?

So that is the good news. In the next article of the series I will present the impact and pitfalls of the Effectively Dated design above and how to overcome them. We will build on the simple data model that we started with here and see what issues surface as you take Effective Dating further and further. You will then start to see how far RDBMS’s still have to go to properly represent this type of data.

Sam Bendayan's 2 Cents

Sunday, August 23, 2009

Printing out long VARCHAR(MAX) strings

Beware the REPLACE function when doing heavy string parsing

Friday, August 7, 2009

Effective Dating Series – Part I – THE PROBLEM

Followers

Blog Archive

About Me