The incredible expanding layer bug

So a few months ago at work (January), a bug came across my desk for a performance problem in 3dsmax. I have always heard of bugs where someone in an architect’s office would load a file, and it would take 45-50 minutes. Perhaps the file was an autocad file with hundreds or thousands of layers in it. I’ve even had a guy at an architect’s office tell me they had files that took an hour to load before…. Just incredible. Anyways, this bug was that it would take almost a half hour to create 1000 layers in max. The guy who logged the defect, even gave us a maxscript to help reproduce the problem:

 

(–resetmaxfile(#noprompt)clearlistener()layerCount=1000

for i = 1 to layerCount do

(

t = timestamp()

mLayer=LayerManager.newLayerFromName (“blarg_” + i as string)

t += timestamp()

format “%,%n” i t

)

)

I first gave the script a run through in order to reproduce the problem, and indeed I was able to see it took a long long time. I ran it the first time, and it seemed to take an hour. But I after all wanted better numbers than that. So I modified my script to give me the total time.

The final result was 35 minutes to complete everything. During the course of which, the 3dsmax maxscript listener simply stopped responding. Finally it finished and I dumped the printed results into an Excel spread sheet and plotted the results.

The following chart plots the time (in blue) that it takes to create 1000 layers. Each Nth layer from 1 to 1000 is on the X-axis on the bottom. The time (y-axis) is the vertical axis and is plotted in milliseconds.

 

layer_creation_time_chart

By the time the 1000th layer was created, it took nearly 5 seconds. *Ouch*. The blue graph is a class parabolic shape and is in fact some form of an N squared polynomial. This performance degradation is a classic non-linear form. Contrast that with the red line, the expected performance result. Anyways, finding the bug was the real problem at hand. Why was it so slow?

My experiments were of course ridiculously hard to test. After all, you make a change and wait 35 minutes to test it. Finally I stumbled upon a function call to update some UI. Okay, I commented it out, and ran it again. My results astounded me: 4 seconds! The code I removed was simply that, when-ever a layer was created, it would update that little layer dropdown list that is on the layer toolbar:

LayerToolbar

 

Remember that little guy? That UI tool that no one likes nor uses anymore? Well the problem was that little layer dropdown list would add the new layer to some data structure, and then resort all the layers. This was a classic n squared iteration over all the layers. The more layers, the more sorting you have to do. Obviously a performance nightmare.

Once I temporarily removed that UI date function call, the time per layer was so low, that it wouldn’t even register on that graph shown above. But after all, creating layers should update that UI dropdown list eventually right? So if we remove the function call, how will it get updated? To fix that, I simply put in place a call to suspend UI updates, and another to resume UI updates for that UI drop down list. So before creating the 1000 layers, I call that suspend function, and afterwards call the resume function. So that in the end, the Layer UI dropdown list gets updated only once.

My favorite blogger, Joel Spolsky, wrote about this in a classic piece: writing about “Shlemiel the painter’s algorithm”

http://www.joelonsoftware.com/articles/fog0000000319.html

Advertisements

Bad Casts and wrong assumptions about class hierarchies

Lately here at work, I have been looking into solving a certain crash in the product. This particular crash was occurring in a certain function that had many crashes over the years.

The situation involved code in a base class that that was doing work for a derived class. In fact the base class was calling methods on the derived class. To do that, it had to cast itself to a pointer of the derived class (using the ‘this’ pointer). But since there were was more derived classes than one, the cast was sometimes performing an invalid cast when called with an instance of the other derived class.

So, sometimes when fixing a mysterious crash, just look for a bad cast. 🙂

The code had a class hierarchy roughly like this:

Class Base
	Class A
	Class B

So here, class A inherits from class Base. Similarly, class B also inherits from Base.

In this example case, some code in class Base was always casting the ‘this’ pointer to A. However called with an instance of B, the assumption won’t hold, and a cast will simply corrupt the pointer. Furthermore, it did this with a C style cast, which offers no warning nor protection.

This code demonstrates the problem:

void B::foo()
{
	A* temp = (A*)this; // B is not A!
	// crash attempting to do something with temp
}

So if called like this,

B* b = new B();
b->foo();

we should expect a crash calling foo since class B is not class A.

There were a few things wrong with this.

  1. A base class should not implement work that rightly belongs to a derived class. For instance it would be bad if a shape class should be in charge of drawing a box and a sphere. To be polymorphic a shape class would offer a virtual or abstract draw method, and leave the implementation details to the derived box and sphere classes. Thus a derived class pointer could stand in for a base class pointer (demonstrating polymorphism) and with no switch statements just implement the needed functionality.
    So in my case here, the whole concept of polymorphism here seems to have been violated. If a derived class performs an operation, then the base class should not be calling it. In fact, the base class should always be completely ignorant of the implementation details of a derived class. In this case, the operations being performed didn’t involve any virtual methods…. Thus without a virtual method, polymorphism cannot be implemented. Hint: using C++ it is possible to replace switch statements and if statements with simple virtual method calls. Thus this code was pure C code masquerading as C++ code.
  2. An incomplete understanding of the class hierarchy that derives from this base class. While the assumptions may have once held, once other derived classes were added later on, the assumptions would fail. Thus casting everything to one particular derived class to perform operations was very fragile, and asking for trouble.
  3. Use of C style casts here is plain wrong in C++. A cast like this should fail, but in C, no cast ever fails. A C style casts may be fast, but it offers no protection. Thankfully the Microsoft Visual C compiler offers a measure of protection however for C++. By turning on Run Time Type Identification (RTTI) in C++ and using any of the standard casts (static_cast<>, dynamic_cast<>, reinterpret_cast<>) the operation will fail with invalid data and return NULL (or nullptr in VC 10.0) . Thus B::foo could have been written like this:
void B::foo()
{
	A* temp = dynamic_cast(this);
	if (temp)
	{
		// do something with temp
	}
}

Which in this case as shown, the code inside the if statement would never get executed.

Finally in fixing the bug, I did a search through the entire source file looking for these C style casts to the wrong data type. And for almost every function that used this bad cast, we had lots of crashes in that function from customers over the years.

Solving STL assertion: ‘Assertion failed: vector iterators incompatible’ when calling std::vector::clear

Recently here at work we ran across a problem in the STL code that result from calling std::vector<>::clear(). The problem was that calling clear on a vector threw a debug message, which in our case, crashed the application. The callstack look like this:

msvcp100d.dll!std::_Debug_message() Line 13 C++
gw_objio.dle!std::_Vector_const_iterator<std::_Vector_val<std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> >,std::allocator<std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> > > > >::_Compat() Line 239 C++
gw_objio.dle!std::_Vector_const_iterator<std::_Vector_val<std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> >,std::allocator<std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> > > > >::operator==() Line 203 C++
gw_objio.dle!std::_Vector_const_iterator<std::_Vector_val<std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> >,std::allocator<std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> > > > >::operator!=() Line 208 C++
gw_objio.dle!std::vector<std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> >,std::allocator<std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> > > >::erase() Line 1194 C++
gw_objio.dle!std::vector<std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> >,std::allocator<std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> > > >::clear() Line 1218 C++

Where the debug message says:

c:program files (x86)microsoft visual studio 10.0vcincludevector(238) : Assertion failed: vector iterators incompatible

If you look at the code for vector<>::clear it is really very simple:

void clear()
{ // erase all
erase(begin(), end());
}

So what was the problem?

For our particular case, the memory for the std::vector was allocated via LocalAlloc by some programmer years ago. This problem is not immediately apparent since in this particular case, LocalAlloc was allocating memory for a typedef struct that happened to contain the vector (Among others). Now, since the vector is NOT plain old data (POD) it’s constructor was not called, and when the memory was freed it’s internal pointers were left dangling. So when the time came to use them (i.e. clear), things blew up.

Not fun to debug.

I take it this (perhaps) once used to work originally. But perhaps with the switch to the Microsoft Visual C 10.0 compiler and it’s attendant change in the STL, it was enough for this problem to bubble to the surface.

Point of the story:

1. We are in the 20th century now. If you are programming in C++ in the 20th century, don’t use Paleolithic era API’s to allocate memory for non POD data.

2. Don’t mix and match memory allocation routines. If using C++ use new/delete especially for non-POD data.

How to display a Bitmap with Gamma Correction in Autodesk 3dsmax

A customer asked me the other day how to display a Bitmap with the gamma correction. He was writing a custom maxscript UI control, and wanted to display a maxscript bitmap on the UI control. Ok, fair enough. Once he displayed the image, he found the image displayed as too dark. The same problem was found with the ImgTag maxscript UI control (found in maxsdksamplesmaxscriptmxsagniimgtag.cpp). A bitmap displayed in the ImgTag control was also displayed as too dark. This maxscript showed the problem when ran in 3dsmax 2012:

p = "E:\Dev\defects\chris_haydon_100.jpg"
b = openbitmap p
display b

rollout TOO_DARK "too dark imgtag"
(
    imgtag n "fish" bitmap:b
)
createDialog TOO_DARK width:400 height:400

What you would see is the Image on the maxscript dialog UI control would be too dark, while the image displayed in the normal maxscript bitmap would be normal.

The reason is because starting in 3dsmax 2012, gamma settings were turned on by default. In previous versions it was turned off. And the ImgTag code completely ignored gamma settings.

To display a Bitmap (maxsdkincludebitmap.h) correctly you have to pass in TRUE for the last parameter to

Bitmap::ToDib( int depth, UWORD *gam, BOOL dither, BOOL displayGamma )

This is used in code like this:

PBITMAPINFO bmi = mbm->bm->ToDib(32, NULL, FALSE, TRUE);

So for instance, the ImgTag was fixed to display a bitmap correctly with gamma in the following code:

Code Snippet
int ImgTag::SetBitmap(Value* val)
{
    if(val == &undefined)
    {
        if(m_hBitmap) DeleteObject(m_hBitmap);
        m_hBitmap = NULL;
        m_maxBitMap = NULL;
    }
    else
    {
        HWND hWnd = MAXScript_interface->GetMAXHWnd();

        MAXBitMap* mbm = (MAXBitMap*)val;
        type_check(mbm, MAXBitMap, _T("set .bitmap"));
        m_maxBitMap = val;

        HDC hDC = GetDC(hWnd);
        PBITMAPINFO bmi = mbm->bm->ToDib(32, NULL, FALSE, TRUE);
        if(m_hBitmap)
            DeleteObject(m_hBitmap);
        m_hBitmap = CreateDIBitmap(hDC, &bmi->bmiHeader, CBM_INIT, bmi->bmiColors, bmi, DIB_RGB_COLORS);
        LocalFree(bmi);
        ReleaseDC(hWnd, hDC);
    }

    Invalidate();
    return 1;
}

Solving Link error 1112

Yesterday I brushed off some old code I had not touched in quite a while. It was a layer manager plugin for Autodesk 3dsmax that I had written years ago. I don’t think I had even opened any project files for this project in the last 12 months. So I re-acquainted with the code, made some changes to the files, and went to recompile it. Now this project compiles to both 32 and 64 bit targets. So I randomly chose a 64 bit target, and built the sucker.

Imagine my shock when I encountered a linker error stating:

module machine type ‘X86’ conflicts with target machine type ‘x64’

I knew enough about this in the past, and never had a problem before. Like I said before, I hadn’t touched the code in long time. What’s more, is my code was under source control!

I opened up the project settings and double and triple checked my Linker Target Machine settings. Indeed it was properly set at: MachineX64. Perfect!

Still the problem persisted. Then I got worried that the library files I was linking in were corrupted, and somehow 32 bit libraries were intermixed with 64 bit libraries. I didn’t want to download the entire 3dsMax SDK again, in the off chance this hypothesis was true. So I researched a solution to check what type of libraries I had!

So I used the utility DumpBin.exe that is found here:

"C:Program Files (x86)Microsoft Visual Studio 9.0VCbindumpbin.exe"

And by the way, in windows 7 I did a search a few times for ‘dumpbin’, which failed every time. Turns out you have to search for dumpbin.exe. Go figure.

Anyways, I used this tool to dump out the functions in my lib file. Then I was able to inspect the functions and see what platform there were compiled for.

For instance, notice the ‘Machine’ entry in a function I dumped out using dumpbin.exe:

Version      : 0
Machine      : 8664 (x64)
TimeDateStamp: 49B974BD Thu Mar 12 14:46:53 2009
SizeOfData   : 0000002D
DLL name     : maxutil.dll
Symbol name  : ??0Path@Util@MaxSDK@@QEAA@PEBD@Z (public: __cdecl MaxSDK::Util::Path::Path(char const *))
Type         : code
Name type    : name
Hint         : 18
Name         : ??0Path@Util@MaxSDK@@QEAA@PEBD@Z

Notice the second line contains x64, a dead give away this is a function compiled for an x64 target platform. So the problem was, my project linked in a lot of libraries from the 3dsmax sdk. The 3dsmax sdk contains about 49 library files. How was I supposed to inspect every function from every library file?

Sounds like a job for a script:

@echo off
call "C:Program Files (x86)Microsoft Visual Studio 10.0VCvcvarsall.bat"

for %%f in (*.lib) do dumpbin.exe -headers %%f | findstr /c:"Machine      :" >> _lib-exports.txt"

pause
@echo on

So I put the above script into a batch file, and placed the batch file in the directory that contained all my library files. I iterated through all the library files, dumping out the header information. This was piped to find string, which searched for the machine information. I then appended that to a log file which I could inspect at my leisure. The log file contained 13642 lines looking like this:

Machine      : 8664 (x64)
Machine      : 8664 (x64)
Machine      : 8664 (x64)
Machine      : 8664 (x64)
Machine      : 8664 (x64)
Machine      : 8664 (x64)
Machine      : 8664 (x64)
Machine      : 8664 (x64)
Machine      : 8664 (x64)
Machine      : 8664 (x64)

So that was it. I opened the file in my favorite text editor, and a search for x86 turned up nothing. It was x64 all the way down.

So now what was I supposed to do? I was stuck. Let me review all I had done.

  1. My target machine linker setting was properly set.
  2. My build configuration setting was properly calling ‘x64’ and not win32.
  3. All the libraries I was importing were actually 64 bit libraries.

All appeared lost and hopeless until I ran across a tidbit on an MSDN forum.

Here the gentleman answered:

If you have left the ide at the default settings, then it will actually use the x86_amd64 compiler only. This is set under tools->options->projects and solutions->vc++ directories. Under amd64 executables the directory should be $(VCInstallDir)binx86_amd64 followed by $(VCInstallDir)bin.

This method works for all versions of windows so its a default setting. If you want to though you can change the x86_amd64 to just amd64.

So I looked at the executable settings for my x64 build configurations in VS 2008. It looked like this:

$(VCInstallDir)bin
$(WindowsSdkDir)bin

I changed it to this by adding the x86_amd64 directory:

$(VCInstallDir)binx86_amd64
$(VCInstallDir)bin
$(WindowsSdkDir)bin

And everything now works. My project now compiles and links just fine.

So I am left to surmise that some external add-on to visual studio hosed my executable tools settings for 64 bit configurations. I had off and on over the last year been using bullseye, which messes around with the settings in the project executable directories. So Perhaps that is what messed up that setting in visual studio. But now it’s fixed, and I am very happy.

Pointer Truncation

Pointer truncation is an evil that all developers of native C++ code need to worry about. This is simply storing a wide piece of memory in a narrower bit of memory. This can become an issue on applications that are ported from 32 bit to 64 bit. In 32 bit applications, pointers are of course 32 bits wide. So it is no big deal to do something like this:

int* foo = new int[45];
int bad = (int)foo;

However in 64 bit applications, that is bad. Since that pointer could be using the high bits of it’s memory address, and when cast or truncated to a 32 bit int, it will lose information. The high bits of a pointer are the bits that pointer to regions higher than 0xFFFFFFFF. For instance a 32 bit pointer has 8 digits (in hexadecimal notation) and can look like this:

0x12345678

But a 64 bit pointer has 16 digits (in hex) and can look like this:

0x1111111122222222

The bits above that are 1 are the high bits. Since a 64 bit app can address more than 2^32 bits of memory, these bits can get used if the 64 bit app requires lots of memory. There is a way to force pointers to be assigned to the higher regions of memory without the expense of allocating all that memory. That way is to use VirtualAlloc in the Windows API, and reserve (but not commit) the lower 4 Gigs of memory. 

The results of a random debugging session involving the code below, shows that the high bits are completely lost when stored in the variable bad.

// a pointer, which is 64 bits wide on a 64 bit platform
int* foo = new int[45];
// this will truncate the memory pointer
int bad = (int)foo; // PURE Evil! Integers are 32 bits wide

Notice how the integer loses the leading value from the pointer (circled) as shown in the debugger watch window below: image

If the coder really gets adventurous they can try converting that integer back to a pointer again, with especially disastrous results:

// a pointer, which is 64 bits wide on a 64 bit platform
int* foo = new int[45];
// this will truncate the memory pointer
int bad = (int)foo; // PURE Evil! Integers are 32 bits wide

// an accident waiting to happen
int* worse = (int*)bad; // catastrophic!
// worse now looks like this: 0xffffffffa0040140
// now the high bits of 'worse' are filled with 0xff, which automatically
// makes it point to invalid regions of memory
// Any accessing of this pointer will result in an access violation.

which shows this in the debug watch window:

image

Notice how the high bits of the pointer worse are now pointing to regions of memory that the windows kernal has flagged as off limits. Any attempt to do anything with this pointer will result in an access violation, and if not correctly handled will result in the termination of the application.

One good way to find these is to call VirtualAlloc with very big values when your application begins. (I’m simplifying greatly here, trust me). This will reserve large chunks of memory below 4 Gigs. So that by the time your application makes actual memory allocations, all or most of the pointers will be above the 4 Gigs boundary mark. Then you have to test your application quite thoroughly to flush out these hot spots.

Well you might ask, should not the compiler aid you in this situation? Not the MS VC++ 10.0 compiler. Even at warning level 4, nothing is emitted to warn about the truncation.

Also Code Analysis will not help as it not enabled on 64 bit builds. i.e.

1>—— Build started: Project: VirtualMemTest, Configuration: Debug x64 ——
1>cl : Command line warning D9040: ignoring option ‘/analyze’; Code Analysis warnings are not available in this edition of the compiler

And on 32 bit builds of the same code code analysis still emits no warning.

MiniDumpWriteDump for an external process

I have written an app to generate a minidump on an external process. This will be ultimately be called by our test harness on process’s that take too long to execute its unit tests. Or for our a process that hung while executing unit tests. (i.e. timed out) So I would like to create a minidump before terminating the process.

The minidump will be used to inspect the callstacks of the hung process. By looking at the callstacks you can get an idea of what went wrong: Or in our case, where did the hang occur.

So I wrote a handy little command line tool to do that. It takes as it’s (only) parameter the full path name of the executable you want a dump for. The cool part of this tool is you can also execute it while an application is running. And if you have symbols for your app, you will get nice callstacks in your minidump report.

getminidump (Source code here)

_CrtCheckMemory() ignores small heap corruptions

The common runtime libraries have some useful debugging tools contained in crtdbg.h. In that header file is a useful method called _CrtCheckMemory(). This function is used to validate the debug heap. According to MSDN:

The _CrtCheckMemory function validates memory allocated by the debug heap manager by verifying the underlying base heap and inspecting every memory block. If an error or memory inconsistency is encountered in the underlying base heap, the debug header information, or the overwrite buffers, _CrtCheckMemory generates a debug report with information describing the error condition. When _DEBUG is not defined, calls to _CrtCheckMemory are removed during preprocessing.

So it seems like a good idea to put these in your application to check for memory corruptions.

But I found a problem, or a limitation in using this. This method seems to ignore small heap corruptions.

For example, If I wanted to overwrite an array by 5 elements, I would do the following:

void corruptMemory()
{
    const int ARRAY_SIZE = 500;
    int* iparray = new int[ARRAY_SIZE]; // a memory leak
    int offset = 5;
    int badIndex = ARRAY_SIZE + offset;
    iparray[badIndex] = 0xdeadbeaf; // heap corruption
}

While experimenting with this, I found the _CrtCheckMemory() method will not catch this over-run:

void bar()
{
    corruptMemory();
    _ASSERTE(_CrtCheckMemory()); // this should trigger the assert
}

While running the above code, I found that the _CrtCheckMemory() remains blissfully ignorant of the memory corruption, returns TRUE to the assert macro, and nothing happens!

Only when I set the variable offset to a value of 6 or higher will _CrtCheckMemory will find and detect something. Using the above code, this will manifest itself (in a debug build) in two assert message boxes:

The first assert:

 image

And the assert arising from the failure condition returned from _CrtCheckMemory to _ASSERTE:

image

Of course when I set the offset to a big enough value, say 89, I can corrupt something really important and screw up the CRT, and also crash the application on exit.

I have a project that demonstrates his. The project also demonstrates how to dump memory leaks to the output window at the end of program execution.

I’ll try to put the link here:

A very subtle bug in class Tab

Over the last two days I’ve been trying to reproduce and fix a very mysterious crash in 3dsmax. The reports from the field reported a crash with the following partial stack frame:

	msvcr90.dll!free(0x20736563)  Line 110	C
 	geom.dll!BitArray::SetSize(0, 0)  Line 179	C++

The problem is it was crashing while attempting to free a invalid pointer. The problem was completely mysterious at first, for when you look at the code, SetSize doesn’t call free directly. Also the memory getting free’d was completely private, with no outside access.

To tackle the problem, I wrote some unit tests, hoping I could learn more about how class BitArray works, and therefore hoping to learn how to reproduce the problem. So after two days, a lot of unit tests, and looking at a lot of stack frames I came upon the answer.

The code listing below shows a unit test that demonstrates the crash.

void BitArrayTestSuite::TestCrash()
{
    Tab<BitArray> ar;
    int size = 10;
    // Calling SetCount here only allocates memory using malloc. It does NOT
    // construct instances of the item in the array.
    ar.SetCount(size);
    for (int i = 0; i < size; i++)
    {
        // The bit array is completely invalid here.
        // 1. It has not been constructed.
        // 2. It is full of garbage memory.
        // Use at your own risk.
        BitArray b = ar[i];
        // Take a gamble here an hope it doesn't crash.
        b.SetSize(0);
    }
}

Here a templated container class called Tab (For those who don’t know the 3dsmax SDK) calls SetCount to allocate memory for 10 items. It does this using only malloc. The Tab class does not call any constructors for BitArray. Therefore when the memory is accessed, it is interpreted as valid array of BitArray’s. All this without so much as a warning or anything to signal to the programmer that they have done something wrong. This is really dangerous: Lying to the compiler, surely brings consequences.

The final problem is exhibited as the SetSize method eventually deletes a pointer to its ‘bits’. But since the BitArray contains nothing but garbage, it was deleting a pointer to random memory.

This isn’t a problem with class BitArray. This class is actually quite robust, up to a point. The problem is the Tab class allows you to shoot your foot off so extremely easily!! Meaning by calling SetCount, you can lie to the compiler, fooling it into thinking you have legitimately constructed elements when you really have NOT.

This problem will occur not only with BitArray, but with anything else that Tab may or can hold.

The particular crash I was investigating: 1597131, mostly came from the following callstack:



msvcr90.dll!free(0x20736563) Line 110 C

geom.dll!BitArray::SetSize(0, 0) Line 179 C++

MNMath.dll!MNNormalFace::Init() Line 31 C++

> EPoly.dlo!CreateOnlyRestore::After() Line 1497 C++

EPoly.dlo!CreateOnlyRestore::Restore(1) Line 1549 C++

The place that contained the Tab abuse was inside of CreateOnlyRestore::After, where there was this code:

MNNormalSpec *pNorm = mpEditPoly->mm.GetSpecifiedNormals();
if (pNorm)
{
	mNewNormFaces.SetCount (nfnum);
	for (int i=0; iFace(i+ofnum);
	}
}

Notice how a Tab of MNNormalFace elements was magically ‘constructed’ using the SetCount method. Then the Init method was called on this phantom element, which eventually boiled down to freeing a dangling pointer.

Luckily for us, back in August, Bernard fixed a bug that went right through the code above. His fix actually completely removed the above abuse of Tab in

d:stageRenoir_MAX_R106_RL_Stagedevel3dswinsrcmaxsdksamplesmesheditablepolyrestore.cpp

and thus the principle place that leads to bucket 1597131.

However, there are six other stack frames that lead to bucket 1597131. Each one will have to be investigated. Meanwhile, there is a good lesson to be learned here:

The method Tab::SetCount does NOT construct your elements for you. Use Tab at your own risk.

Lockup in call DialogBoxIndirectParam at call to NtUserCallHwndParamLock

One day, while running some of our automated test suites, I suddenly got a lockup in while running the following script: fat_rendering_exposurecontrol.ms.

I had seen this before last year. But it was a problem that would come and go at whim. Even recompiling would make it go away.

The root problem was caused by a ‘less than’ operator failing. I’m not really interested in why this is failing, but it helps explain the rest of the problem.

Anyways, it all starts with a call to an STL min function like this:

std::min( a, b )

Eventually the less than operator devolves down to this grotesque mess:

c:Program Files (x86)Microsoft Visual Studio 9.0VCincludexutility [line 264]

template<class
_Ty1, class
_Ty2> inline

bool
__CLRCALL_OR_CDECL
_Debug_lt(const
_Ty1& _Left, const
_Ty2& _Right,

const
wchar_t *_Where, unsigned
int
_Line)

{    // test if _Left < _Right and operator< is strict weak ordering

if (!(_Left < _Right))

return (false);

else
if (_Right < _Left)

_DEBUG_ERROR2(“invalid operator<“, _Where, _Line); <– here

return (true);

}

_Right was 3814.9946

_Left was 6405.5967

_Left < _Right    is false
!(_Left < _Right)    is true

Therefore the code had absolutely no business hitting the else clause of the if statement and calling _DEBUG_ERROR2 . Weird!! Perhaps it’s a compiler error… But anyways…

The error eventually reports an assert error to the Debug CRT. Which calls our own debug CRT report hook. There-upon we simply pop up an assert. The problem arises when the assert dialog doesn’t appear, but apparently locks on something.

The call stack for the assert:

>    user32.dll!NtUserCallHwndParamLock()     <– That locks up here:

user32.dll!InternalDialogBox()

user32.dll!DialogBoxIndirectParamAorW()

user32.dll!DialogBoxIndirectParamW()

BugslayerUtil.dll!JWnd::Dialog() Line 53    C++

BugslayerUtil.dll!JModalDlg::DoModal() Line 72    C++

BugslayerUtil.dll!PopTheFancyAssertion() Line 474    C++

BugslayerUtil.dll!RealSuperAssertion() Line 281    C++

BugslayerUtil.dll!SuperAssertionA() Line 566    C++

maxutil.dll!assert1() Line 237    C++ <– here we pop up a normal assert.

3dsmax.exe!CrashHandler::Imp::RTLReportHook() Line 1725    C++

msvcr90d.dll!_VCrtDbgReportW() Line 576    C

msvcr90d.dll!_CrtDbgReportWV() Line 242    C++

msvcr90d.dll!_CrtDbgReportW(…) Line 42    C++

msvcp90d.dll!std::_Debug_message() Line 22    C++ <– less than operator flubs it here.

dltonerep.dlu!std::_Debug_lt<float,float>() Line 265    C++

dltonerep.dlu!std::min<float>() Line 3399    C++

So I’m wondering how a call to

::DialogBoxIndirectParam

could lock up?

The NtUserCallHwndParamLock function assembly is this:

NtUserCallHwndParamLock:

00000000778DB7C0 mov r10,rcx

00000000778DB7C3 mov eax,1027h

00000000778DB7C8 syscall

00000000778DB7CA ret

00000000778DB7CB nop

00000000778DB7CC nop

00000000778DB7CD nop

00000000778DB7CE nop

00000000778DB7CF nop

It seems that I cannot step through assembly code because I get this error:

The dialog box states: “Unable to step. The process has been soft broken.”

[Edit]

Eventually I solved this by eliminating the use of the std::min and std::max. I simply wrote my own functions to return a minimum and maximum value, and that solved everything.

As for the lockup, well, it could have been because of some contention over resources, as I was trying to pop up a dialog, and perhaps the CRT was trying to as well.