Thursday, August 11, 2011

Optimize string handling in VB6 : Part II


Make your Visual Basic apps process text fast as lightning. Part II of this article dives deep into the performance of the VB6 String functions. We learn the functions to use and the ones to avoid. We also learn how to call the Windows Unicode API functions and how to build really, really huge strings without crashing our VB6 apps.

Memory layout of VB6 strings

To get some background, let's see how strings are stored in RAM. VB6 stores strings in Unicode format. In COM terminology, a VB String is a BSTR. A String requires six overhead bytes plus 2 bytes for each character. Thus, you spend 6 + Len(string)*2bytes for each string.
The string starts with a 4-byte length prefix for the size of the string. It's not the character length, though. This 32-bit integer namely counts the number of bytes in the string (not counting the terminating 2 zero bytes). After the length prefix comes the actual text data, 2 bytes for each character. The last 2 bytes are zeros, denoting a NULL terminator (a Unicode null character).
Memory layout of a VB6 String
Field
Length prefix
Datastring
Terminator
Byte






...




Number of bytes
4
2*Length
2
Value
Num_bytes
Unicode characters
NULL













Let's see how a sample string "Aivosto" is stored:
Memory layout of "Aivosto"
Field
Length prefix
Datastring
Terminator
Byte




















Number of bytes
4
2*7
2
Value
14
A
i
v
o
s
t
o
NULL





















NULL character note. It is perfectly valid to store a NULL character in the string. As the string length is stored in the first 4 bytes, you can store NULLs in the string data. They will not be treated as the terminating NULL as in C/C++.
BSTR note. BSTR is the COM datatype for a string pointer. The pointer points to the first character of the datastring, not to the length prefix.

Performance of simple string functions

We are now going to measure how fast the various built-in string functions of VB6 really are. For this purpose we compiled a little .exe, which called each function 100 million times. The test was run on a typical Pentium 4 processor (2.8 GHz). We found out that Len and LenB are the fastest functions. So we compared the rest of the functions to Len/LenB and put the results in the chart and table below.

Relative performance of VB6 string functions
Function
Performance
Description
Len
1
Len(S) returns the number of characters in string S.
(Note: works differently with UDTs)
LenB
1
LenB(S) returns the number of bytes in S.
AscW
1
AscW(S) returns the Unicode value of the first character in S.
AscB
1
AscB(S) returns the first byte in S.
Asc
7
Asc(S) returns the ANSI value of the first character in S.
ChrW$
17
ChrW$(U) returns a string containing the Unicode character U.
ChrB$
18
ChrB$(B) returns a byte string containing the character B.
Chr$
24
Chr$(A) returns a string containing the ANSI character A.
Left$
18
Left$(S,x) returns x characters from the start of S.
Right$
21
Right$(S,x) returns x characters from the end of S.
Mid$
24
Mid$(S,x,y) returns y characters from S, starting at the x'th character.
CStr
36
CStr(x) returns the string representation of x (localized).
Str$
111
Str$(x) returns the string representation of x (not localized).
How to read the table: Calling Len takes 1 unit of time, while Asc takes 7 units. You can call Len 7 times in the same time you call Asc once.
Len and LenB. The fastest functions are Len and LenB. These are lightning fast functions that simply read the 2 length bytes at the start of the string area. Len is implemented in 6 assembly instructions in the VB runtime. LenB is even shorter: it runs just 5 instructions. In principle, LenB should run faster. In practice, this is not the case. Their performance is equal on today's processors.
AscW, AscB and Asc. This group of functions is very fast as well. Note how Asc takes 7 times the time of AscW and AscB. This is because AscW and AscB simply return the first byte(s) of the string. Asc needs to convert the value to an ANSI character code. You can squeeze more performance out of your program by replacing Asc with AscW. Since Asc and AscW return different values for many characters (other than ASCII 0-127), you need to know what you are doing before choosing the other function. ChrW$, ChrB$ and Chr$. Performance degrades as we move to functions that create strings. This group of functions creates a string out of a numeric character code. Note how Chr$ takes about 40% more time to run than ChrW$. This is because Chr$ converts from ANSI to Unicode while ChrW$ works with pure Unicode values. You can squeeze more performance out of your program by replacing Chr$ with ChrW$. Since they return different characters for many values (other than 0-127), you need to know what you are doing before choosing the other function.
Left$, Right$ and Mid$. Performance keeps at the degraded level with this group of functions. These functions create new strings by copying some characters in the input string. These are the only functions that can access the individual characters in a string. As you can see, Mid$ is slower than Left$ or Right$. This means you should use Left$ and Right$ when possible and only resort to Mid$ when you really need to access characters in the middle.
Tip. To access the first character in a string, call AscW(S) instead of Left$(S,1). This will save you 95% of the time of running Left$. There is a caveat, though. S must not be empty. If S is empty, you will trigger a run-time error. Therefore, you need to make sure S is not empty by calling Len(S) or LenB(S) first. This is the proper way to do it: If LenB(S) Then x = AscW(S)
CStr and Str$. These slow functions are used to convert other data types to a string. You typically use them to convert a numeric value into a string. CStr is much faster than Str$. (Tested for integer input value 32.)
You can save time by replacing calls to Str$ with CStr. This is not a straightforward task, though, because CStr and Str$ return different values. CStr returns a localized string, while Str$ returns a non-localized one. What is more, Str$ prefixes positive values with a space. As an example, CStr(1.2) returns "1,2" in several European locales. Str$(1.2) always returns " 1.2". Thus, you can trust that Str$ always works the same way, while CStr works differently in different locales. If you simply replace calls to Str$ with CStr, your program may fail later if it fails to interpret the resulting localized string. The following table compares Str and CStr in the Finnish locale. The results will look similar in several non-English locales.
Str$ and CStr in Finnish locale
x
Str$(x)
CStr(x)
1.2
" 1.2"
"1,2"
.2
" .2"
"0,2"
-.2
"-.2"
"-0,2"

Performance of string functions, group 2

The next group consists of functions whose performance vary based on what you feed them as input. Generally speaking, the longer the input string, the slower the call. Performance drops considerably with longer strings.
The following chart reports timings for two input strings S defined as follows:
Short input
S consists of 21 characters: 10 spaces, "1" and 10 spaces.
S = "          1          "
Long input
S consists of 201 characters: 100 spaces, "1" and 100 spaces.



So, for the call Replace(text), the performance was 513 using the short input string and 968 using the long input. The table below presents the numeric performance values for the short input.

Relative performance of VB6 string functions
Function
Performance
Description
LTrim$
20
LTrim$(S) returns a copy of S without leading spaces.
RTrim$
21
RTrim$(S) returns a copy of S without trailing spaces.
Trim$
29
Trim$(S) returns a copy of S without leading or trailing spaces.
Val
89
Val(S) returns the numeric value contained in S (non-localized).
CInt
81
CInt(S) returns the integer value contained in S (localized, rounded).
CDbl
103
CDbl(S) returns the double floating point value contained in S (localized).
LCase$
53
LCase$(S) returns a copy of S with lower case letters.
UCase$
53
UCase$(S) returns a copy of S with upper case letters.
StrConv(vbLowerCase)
327
StrConv(S,vbLowerCase) returns a copy of S with lower case letters.
StrConv(vbUpperCase)
333
StrConv(S,vbUpperCase) returns a copy of S with upper case letters.
StrComp(vbBinaryCompare)
47
StrComp(S1,S2,vbBinaryCompare) compares string S1 and S2 based on their Unicode values.
StrComp(vbTextCompare)
65
StrComp(S1,S2,vbTextCompare) compares string S1 and S2 in alocale-dependent, case-insensitive way.
InStr(vbBinaryCompare)
14
InStr(x,S1,S2,vbBinaryCompare) looks for S2 in S1, starting at position x, in a locale-independent, case-sensitive way.
InStr(vbTextCompare)
156
InStr(x,S1,S2,vbTextCompare) looks for S2 in S1, starting at position x, using text comparison.
InStrRev(vbBinaryCompare)
69
InStrRev(S1,S2,x,vbBinaryCompare) looks for S2 in S1, from position x back to 1, in a locale-independent, case-sensitive way.
InStrRev(vbTextCompare)
193
InStrRev(S1,S2,x,vbTextCompare) looks for S2 in S1, from position x back to 1, using text comparison.
Replace(vbBinaryCompare)
375
Replace(S1,S2,S3,x,y,vbBinaryCompare) replaces S2 with S3 in S1, starting at position x, max y times, in a locale-independent, case-sensitive way.
Replace(vbTextCompare)
513
Replace(S1,S2,S3,x,y,vbBinaryCompare) replaces S2 with S3 in S1, starting at position x, max y times, using text comparison.

LTrim$, RTrim$ and Trim$. This useful group of functions removes spaces from the start or the end of the input string, or both. The more spaces there are, the slower they run. These functions return a copy of the input string.
Tip 1. Never nest like LTrim$(RTrim$(S)). Simply call Trim$(S) instead.
Tip 2. LTrim$ is the fastest alternative to test whether string S contains any non-space characters. Use this syntax:Len(LTrim$(S))<>0
Val, CInt, CDbl. This is a group of relatively slow conversion functions. They look for a number inside a string. Val is a non-localized version that is best used together with Str$. CInt and CDbl are localized versions that are compatible with CStr. Val is a safe function to call, while CInt and CDbl raise error 13 when conversion fails. — It is best to avoid costly conversion where possible. Always pass numeric values in numeric data types, not as strings. By converting numbers to strings and strings to numbers you spend extra CPU cycles and also risk error 13 if your code is not designed the right way.
LCase$, UCase$, StrConv(vbLowerCase), StrConv(vbUpperCase). This group of functions performs case conversion. LCase$ is the faster alternative to StrConv(vbLowerCase). UCase$ is the faster alternative to StrConv(vbUpperCase).
Keep in mind that LCase$ and UCase$ have full Unicode support, whereas StrConv(vbLowerCase) and StrConv(vbUpperCase) don't support all Unicode characters. In fact, calls to StrConv(vbLowerCase) and StrConv(vbUpperCase) can remove diacritic marks from Latin characters and convert unsupported characters to "?". As an example, StrConv(vbLowerCase) converts all Greek and Cyrillic characters to garbage on a Western system.
Tip. Avoid StrConv(vbLowerCase) and StrConv(vbUpperCase). There is no reason to use these slow and limited calls. If you see them used, replace them with LCase$ or UCase$. If there is a chance of a Null value in the input, use LCase or UCase, because the dollar versions LCase$ and UCase$ fail if the input is Null.
StrComp. The StrComp function compares two strings. The vbBinaryCompare option is faster than vbTextCompare.
Tip. vbTextCompare should only be used for sorting. vbTextCompare is often used to test for a case-insensitive match (test if "ABC" is equal to "abc"), but it is not suitable for that. This is because the result depends on the current locale. Certain character combinations can produce false matches. For example, StrComp("ss", "ß", vbTextCompare) returns 0. This means "ss" and "ß" are a match, which may not be what you intended to do. For a real case insensitive test (without extra effects), use something like LCase$(S1) = LCase$(S2).
StrComp(vbTextCompare) can produce other surprises as well. In the following table you can see how StrComp sorts certain characters in the Finnish locale (as an example). This serves as a proof of why vbTextCompare is not simply the case insensitive counterpart of vbBinaryCompare. It affects much more than just the case.
StrComp examples
vbBinaryCompare
(all locales)
vbTextCompare
(Finnish locale)
a < ae < z < ä < æ
a < ae = æ < z < ä
va < vb < wa
va < wa < vb
AE < ae < Æ < æ
AE = ae = Æ = æ
OE < oe < Œ < œ
OE = oe = Œ = œ
ss < ß
ss = ß
TH < th < Þ < þ
TH = th = Þ = þ
d < e < Ð < ð
d < Ð = ð < e
Note that Option Compare Text is the equivalent of vbTextCompare. Option Compare Binary is the equivalent ofvbBinaryCompare. It is also the default setting.
InStr. InStr(,,vbBinaryCompare) is a quick function. Its counterpart InStr(,,vbTextCompare) is very slow, on the other hand. This is where it really pays off to use a binary search.
As with StrComp, vbTextCompare with InStr is full of surprises. To name an example,InStr(1,"Straße","ss",vbTextCompare) locates "ß" at position 5. You searched for 2 characters, but InStr found just one! What is this? The character "ß" stands for "ss" in the German language. But, if your program was expecting the 2 characters "ss" or "SS", it can fail now. — Similarly, InStr(1,"Þingvellir","th",vbTextCompare) locates "Þ". This happens because the character "Þ" stands for "th" in Icelandic. Again, you searched for 2 characters, but InStr found just one. These are not the only examples. You get the same behavior for æ and œ. Don't use vbTextCompare unless this is what you really want.
Tip. Quick uses for InStr
Locate a substring
InStr(S, x)
Locates x in S
See if string contains a substring
InStr(S, x) <> 0
Tells us if S contains at least one x
See if character is one of alternatives
InStr("ABC", S) <> 0
Tells us if S is one of "A", "B" or "C"
(Len(S) must be 1)
See if string is one of alternatives
InStr("-AB-CD-EF-", "-" & S & "-") <> 0
Tells us if S is one of "AB", "CD" or "EF"
(S must not contain "-")
InStrRev. Also InStrRev has much better performance with vbBinaryCompare than with vbTextCompare.
InStrRev is a lot slower than a regular InStr. This difference is especially big with the vbBinaryCompare option. Searching for the middle character in our 21-character short input string, InStrRev spent 5 times the amount InStr spent.
InStrRev's performance also drops considerably when the input string is long, that is, when there is a lot to search in. It seems that InStrRev has a bad implementation.
Tip. Avoid InStrRev because of the bad performance. It may make sense to replace calls to InStrRev with InStr at times. Example: If the input is known to contain two TABs and you wish to find the latter one, it may be faster to call InStr twice rather than InStrRev once.
Replace is a slow function. As with the other text functions, vbBinaryCompare has better performance than vbTextCompare.
If a replace is unlikely to occur, you can add performance by not calling Replace unnecessarily. First test with InStr whether a replace is required, and then only call Replace if a replacement is about to happen. Thus, if you need to replace "£" with "€", first test with InStr if there really is a "£" in the input string. Only call Replace if there was.

Empty string vs. null string

As you probably know having read this far, there are two ways to represent a zero-length string:
  • the null string: vbNullString
  • the empty string: ""
As far as pure VB6 programming is concerned, they work quite exactly the same way. The only differences are thatvbNullString is faster and it saves 6 bytes of memory.
What exactly makes vbNullString different from ""? It's the way they are stored. In the table below, you can see how the empty string uses 6 overhead bytes, but no data bytes. The Datastring field between the Length prefix and Terminator fields is hidden, because its length is zero.
Memory layout of "", the empty string
Field
Length prefix
Terminator
Byte






Number of bytes
4
2
Value
0
NULL







What is the memory layout of vbNullString? The answer: Nothing! There is nothing to store, because vbNullString is simply a NULL pointer. Besides a zero pointer, there is nothing else to store. For the empty string, there is a non-zero pointer and a real 6-byte string allocated in RAM.
Now let's see what happens when you store a null string and an empty string in variable S. The null string is stored as a null pointer. The only thing to store in S is a zero. To store the empty string "", VB needs to do more. VB will allocate an empty string in the memory and set S to point to it. Thus, we consume both the pointer and the data area. You can see the difference in the table below.
Null string vs. empty string

Pointer

Data
StrPtr(S)
VarPtr(S)
S = vbNullString
0

(no data)
0
1308564
S = ""
1568888
->
6 bytes
1568888
1308564
S = "a"
1568888
->
8 bytes
1568888
1308564
Memory address
1308564

1568888


In the table, variable S is stored as a pointer at memory address 1308564.
  • When you let S = vbNullString, the pointer value is set to zero. Nothing else is stored.
  • When you let S = "", the value 1568888 is stored at address 1308564. Now S points to the data area at 1568888. As it happens, this area contains 6 overhead bytes for the empty string "".
  • Compare the empty string to a regular non-empty string. When you let S = "a", it is stored exactly the same way as "". Again, S points to the data area at 1568888. In this area you can find the actual string "a", which takes 8 bytes.
There is a difference in using "" vs. vbNullString. The difference is with API calls. Some API calls may accept only the other. Check out the API documentation before switching to vbNullString. VB itself doesn't make this difference.

Getting string pointers

VB6 supports two functions for getting pointers to strings.
VarPtr(S) returns a pointer to the variable S. That memory location contains a pointer (BSTR) to the actual string data. This means VarPtr essentially returns a pointer to a pointer.
StrPtr(S) returns a pointer to the actual string data currently stored in S. This is what you need when passing the string to Unicode API calls. The pointer you get points to the Datastring field, not the Length prefix field. In COM terminology, StrPtrreturns the value of the BSTR pointer.

API calls with Unicode strings

A lot of Windows API procedures come in two versions: ANSI and Unicode. The Ansi versions end in 'A' while the Unicode versions end in 'W'.
VB6 supports the ANSI 'A' versions. When calling a Declare Sub or Declare Function, VB automatically converts String parameters from Unicode to ANSI. This means you safely use "As String" parameters in your Declare statements as long as you work with ANSI Declares.
Declare Sub MySub Lib "x" Alias "MySubA" (Text As String)
Unfortunately VB6 doesn't directly support the Unicode 'W' Declares. You can pass a Unicode string quite easily, though. This is how you declare the Unicode version:
Declare Sub MySub Lib "x" Alias "MySubW" (ByVal Text As Long)
This is how you pass the Unicode string:
MySub StrPtr(Text)
You simply wrap the String parameters in StrPtr. That's how simple it really is! The API procedure gets a pointer to a Unicode string, not a copy of the string. By passing your strings as a pointer, you avoid the costly automated conversion to ANSI. What is more, your code is not restricted to the current ANSI character set, but can use the full range of Unicode characters. This is important with truly international applications.

Building huge strings

VB6 lacks a StringBuilder class for the creation of large strings. Consider building a big string in a loop. An example is when you load a text file line by line:
Do Until EOF(FileNr)
   Line Input #FileNr, Line$
   Big$ = Big$ & Line$ & vbNewline ' Bad!
Loop
You repeatedly copy stuff to the end of the string with the concatenation operator &. The bad news is, VB is not optimized for this. As you grow the string, VB repeatedly copies Big$ over and over again. This really degrades performance when repeated. VB constantly allocates new space and performs a copy.
The problem gets worse as the string exceeds 64K in size. Small strings are stored in a 64K string cache. As the string becomes larger than the cache, performance drops considerably.

Order of concatenation

Big$ = Big$ & "abc" & "def"   ' 1 - Default order
Big$ = Big$ & ("abc" & "def") ' 2 - Short strings first
On line 1 above, VB will first join Big$ and "abc". After this, it will perform another copy to add "def" to the end.
If Big$ is big, the same is better rewritten with parentheses around the shorter strings. The trick here is that you avoid copying the contents of Big$ twice. On line 2, VB will combine "abc" & "def" first before making a copy of Big$.

Avoiding concatenation

You can avoid the slow run-time concatenation by using constants:
Const TWO_NEWLINES = vbNewline & vbNewline
String constants are stored in the executable in their entirety. There will be no concatenation when your program executes.
Sometimes you cannot use a constant. You can still initialize a variable when your program starts and use the value where required:
Public EscapeSequence As String
EscapeSequence = ChrW$(27) & vbCr ' ESC CR

Building large strings with CString

A good approach to building large strings is provided in a class called CString, originally published in Francesco Balena's article in Visual Basic Programmer's Journal. Designed as a "string builder" class, CString is a nice replacement for big VB strings. It is easy to use and its performance is great.
CString stores the string in a byte array. As you add text to the string, CString allocates more space from time to time. Because it allocates far less often than VB, CString performs much better than a regular String.
The bad news is, CString works in ANSI. It stores the string in a byte array, one byte per character. While this saves memory, it doesn't work well with all international characters. Being ANSI means CString will silently convert your characters to something else. This isn't what you want if your strings are going to contain any characters outside of the current codepage. This means, CString doesn't handle international text.
Fortunately, it's possible to make CString work in Unicode. You need to change the byte array storage to an integer array, meaning two bytes per character. While this requires twice the memory, it ensures CString will work with all possible characters and not cause any nasty effects.

Building a huge string in a file

As your strings grow really, really huge, CString will not be enough. Memory allocation will eventually fail, even if there is free RAM available. The limit varies but there is a point where you get Error #7: Out of memory and your application is likely to crash. Maybe it's 50 MB or 300 MB, but Out of memory can hit your application as well. This can happen if your program produces large reports, logs, web pages, XML or other textual data files by building the data in a String before writing it to a text file. What can you do to keep your program running?
Save directly to a text file. Stop keeping the string in RAM. Write your string data to a file instead. If you are creating a report or an HTML page, for example, it's better to write the text directly to a file rather than building the text in a String and writing to disk afterwards.
Temporary text file. Writing text directly to a file isn't always an option, especially when existing code should be rewritten and you want to avoid a rewrite like the plague. This is where you can build a helper class. Let's call the class FileString. The class will use CString (or any other similar solution) as the default storage, the regular RAM string. As you append text to FileString, the class will store the text in the RAM string. If the string grows larger than a certain limit (say 1 MB), the FileString class creates atemporary file and dumps the RAM string in it. It then clears the RAM string to accept more stuff. You go on appending stuff to the RAM string. Every now and then FileString will store the RAM string into the temporary file.
Once you're ready, you can have FileString save the resulting huge string to a file. This is as simple as copying the temporary file into the final file. If there is any text left in the RAM string, that text is simply appended to the end of the final file.
This way you can build virtually unlimited strings, even exceeding the amount of available RAM.
The FileString solution is especially nice when you can't tell if the string will be small or large. When the string is small, it will be built in RAM. When the string is large, it will go onto the disk. There is no limit to the string data. You can be certain your code will run as long as there's free space on the disk.