The incredible expanding layer bug

So a few months ago at work (January), a bug came across my desk for a performance problem in 3dsmax. I have always heard of bugs where someone in an architect’s office would load a file, and it would take 45-50 minutes. Perhaps the file was an autocad file with hundreds or thousands of layers in it. I’ve even had a guy at an architect’s office tell me they had files that took an hour to load before…. Just incredible. Anyways, this bug was that it would take almost a half hour to create 1000 layers in max. The guy who logged the defect, even gave us a maxscript to help reproduce the problem:

 

(–resetmaxfile(#noprompt)clearlistener()layerCount=1000

for i = 1 to layerCount do

(

t = timestamp()

mLayer=LayerManager.newLayerFromName (“blarg_” + i as string)

t += timestamp()

format “%,%n” i t

)

)

I first gave the script a run through in order to reproduce the problem, and indeed I was able to see it took a long long time. I ran it the first time, and it seemed to take an hour. But I after all wanted better numbers than that. So I modified my script to give me the total time.

The final result was 35 minutes to complete everything. During the course of which, the 3dsmax maxscript listener simply stopped responding. Finally it finished and I dumped the printed results into an Excel spread sheet and plotted the results.

The following chart plots the time (in blue) that it takes to create 1000 layers. Each Nth layer from 1 to 1000 is on the X-axis on the bottom. The time (y-axis) is the vertical axis and is plotted in milliseconds.

 

layer_creation_time_chart

By the time the 1000th layer was created, it took nearly 5 seconds. *Ouch*. The blue graph is a class parabolic shape and is in fact some form of an N squared polynomial. This performance degradation is a classic non-linear form. Contrast that with the red line, the expected performance result. Anyways, finding the bug was the real problem at hand. Why was it so slow?

My experiments were of course ridiculously hard to test. After all, you make a change and wait 35 minutes to test it. Finally I stumbled upon a function call to update some UI. Okay, I commented it out, and ran it again. My results astounded me: 4 seconds! The code I removed was simply that, when-ever a layer was created, it would update that little layer dropdown list that is on the layer toolbar:

LayerToolbar

 

Remember that little guy? That UI tool that no one likes nor uses anymore? Well the problem was that little layer dropdown list would add the new layer to some data structure, and then resort all the layers. This was a classic n squared iteration over all the layers. The more layers, the more sorting you have to do. Obviously a performance nightmare.

Once I temporarily removed that UI date function call, the time per layer was so low, that it wouldn’t even register on that graph shown above. But after all, creating layers should update that UI dropdown list eventually right? So if we remove the function call, how will it get updated? To fix that, I simply put in place a call to suspend UI updates, and another to resume UI updates for that UI drop down list. So before creating the 1000 layers, I call that suspend function, and afterwards call the resume function. So that in the end, the Layer UI dropdown list gets updated only once.

My favorite blogger, Joel Spolsky, wrote about this in a classic piece: writing about “Shlemiel the painter’s algorithm”

http://www.joelonsoftware.com/articles/fog0000000319.html

Advertisements

Notes on the Max.Log file, timers and fixing a crash in the OBJ exporter

So lately I was given a task to fix a crash in the 3dsmax OBJ exporter. I won’t go over the history of  this project. But it’s the only .OBJ exporter that we have. Anyways, we have had it for a few years, but just recently it started to fail in our automated tests.

So I looked at it, and could never reproduce it on my machine. To make matters worse, I asked to look at the max.log file, only to find absolutely nothing. So not only could I not reproduce it, the minidump file I had was inconclusive and I possessed an empty log file. All this makes debugging extremely difficult (As in bang your head against the wall type of debugging).

On the other hand the automation machine was regularly failing. So I decided to use logging to help figure out where it was crashing. The theory was that if I could pull a lot of information from the logger then I could find out where it was failing.

But, the problem is that 3dsmax has extremely thin logging capabilities. The only log file that 3dsmax has was written for rendering: more specifically network rendering.  Hence a log file is created in this location:

<Max>NetworkMax.log

It quickly became the default logger for 3dsmax. Calls to the logger are interspersed through most of the code now. But the depth of coverage is not uniform. Also  the code for this logger was implemented in 3dsmax.exe. However since 3dsmax.exe is at the top of the DLL dependency chain, there is no .lib file for a DLL at the bottom of the dependency chain to link to use the Logger. And those DLL’s at the bottom of dependency chain can not link to core.lib to gain access to the core Interface pointer. That is because core.dll also depends on a few of those very DLL’s at the bottom of the ‘food chain’.

So, if you want to use the logger in one of those DLL’s that 3dsmax (or core) depends on… well you can’t. The dependencies would be all backwards. Independent 3rd party plugin developers for max don’t have this problem. They can access a virtual methods on Interface and gain access to the Logger. No problem for them really. As for those independent DLL’s like maxutil.dll,  mesh.dll, mnmanth.dll? Well they don’t have any logging. They are flat out of luck.

So what does this have to do with the OBJ exporter? Well not much, but it does explain the lack of a logger in some code.

So to continue on with the story of fixing the OBJ exporter. I added calls to the max logger from a variety of functions in the OBJ exporter. Basically any function that was 1 and 2 calls away from a call to virtual SceneExporter::DoExport logged it’s function name to the logger.

So I submitted my changes, and the next day the Automation Engineer sent me a max.log file. It showed the crash occured when the main virtual DoExport method ended, then another call to a method ObjExp method ended after that.

2013/03/13 13:29:24 DBG: [02388] [02648] ObjExp::DoExport - end
2013/03/13 13:29:24 DBG: [02388] [02572] ObjExp::GetTriObjectFromNode - end <-- Crash after this

The thing is that nothing should have been executing after DoExport finished. Wierd!

The beginning of the export sequence showed quite a few calls to ExportDaStuff (Nice name huh?)

2013/03/13 13:29:24 DBG: [02388] [02572] ObjExp::ExportDaStuff - start
2013/03/13 13:29:24 DBG: [02388] [01620] ObjExp::ExportDaStuff - start
2013/03/13 13:29:24 DBG: [02388] [02572] ObjExp::nodeEnum - start
2013/03/13 13:29:24 DBG: [02388] [02964] ObjExp::ExportDaStuff - start

When-ever it crashed the max logger showed extra threads. The ThreadID which is shown in the log file, shows 2 and sometimes 3 threads. Keep in mind that the first number in brackets is the process ID. The second number in brackets is the thread ID. OK, at this point having never touched the OBJ exporter in my life I had no idea this was a problem. So I kept looking. Eventually I found the callstack to create the thread and get ‘ExportDaStuff’ going is this:

ProgressDlgProc
CreateExportThread <-- Calls _beginthread
ThExport
ExportDaStuff

Ok, so what was calling CreateExportThread? I found it in the windows procedure for a dialog. So, the code was creating a timer, and on the first timer ‘tick’ (WM_TIMER) it killed the timer and then called the create thread method.

case WM_TIMER:
    if(wParam == TIM_START)
    {
        KillTimer(hWnd, TIM_START);
        ...
        CreateExportThread(hWnd, exp);
    }
...

The documentation for MSDN states that WM_TIMER is a low priority message and will be delivered after everything else is delivered that is more important. Hence that it will usually end up at the back of the message queue. So that means quite a few WM_TIMER messages could stack up before they get delivered all at once. Hardly a precise timer mechanism. So as the above code shows, KillTimer was indeed getting called, but after it got called the other WM_TIMER messages were already in the queue. And then they got delivered, and hence the extra threads were created.

So when the DoExport method finished it cleaned up resources out from under the extra thread which then crashed.

The fix was simply to not rely on timer ticks to create threads. The intent was to create only one thread in response to displaying the dialog. I was therefore happy to oblige by putting the call to CreateExportThread exactly where it belonged: in WM_INITDIALOG.

Lessons learned:

1. If you are writing a plugin for 3dsmax, please be liberal and puts tons of logging statements in your code.

2. Don’t create threads in response to WM_TIMER messages.

3. Or don’t mix business logic with your UI code.

Difference between Debug and Release Builds

 

I wrote this document way back in 2010, which I found sitting on my computer. I wrote it for folks in our QA department. But I thought it good enough to share with other folks since it describes some basic things. So here is the document:

In a debug build of max, uninitialized memory is initialized with not garbage, but a certain pattern used to help catch bugs: 0xcccccccc.

For instance:

0x0017EBD8 cccccccc
0x0017EBDC 00000000
0x0017EBE0 00000000
0x0017EBE4 00000000
0x0017EBE8 00000000
0x0017EBEC 00000302
0x0017EBF0 cccccccc
0x0017EBF4 3ed6d0e0

Uninitialized memory is marked with this pattern to help catch bugs. For any pointer dereferencing this value will cause an access violation and crash the application.

Case Study

The animation reaction manager displays a curve control. That curve control can of course display many curves. Each curve can contain many key points.

clip_image002

And here are the points:

clip_image004

The problem is manifested when you insert a point into the middle of the curve. You should expect this:

clip_image006

But instead, in a debug build you get this:

clip_image008

The problem is not the curve control, but actually the reactor controller. The reactor controller goes back and alters the points after the point was correctly inserted. (A bug was fixed by me btw) This reactor controller does its dirty business after the UI widget has already correctly displayed its data. This highlights two bugs:

1. Why is the reactor controller modifying the UI control when the UI was already correctly displayed?

2. Why is the reactor controller modifying the point with such a massive value?

The point where the first point gets altered is in the Reactor::UpdateCurves function in the following file:
…3dswinsrcmaxsdksamplescontrollersreactorreactor.cpp.

In this code:

CurvePoint pt = curve->GetPoint(0,i);
// pt here is correct
switch (type)
{
case REACTORFLOAT:
pt.p = Point2(masterState->fvalue, slaveStates[i].fstate);
break;

}
DbgAssert(pt.p.y > -1000.0); // assert fails since pt is incorrect
curve->SetPoint(0,i, &pt, FALSE, TRUE);

The value for slaveStates[i].fstate is the following floating point number:

-1.0737418e+008

Where did this value come from?

Well examining it in the debugger’s memory shows this:

memory address of some memory:

0x0017EBD8 cccccccc

As you can see, a debug build take uninitialized memory and fills it with 0xcccccccc. This is to help catch errors. Now if you take that value and interpret it as an integer you get:

0x0017EBD8 -858993460

And if you interpret it as unsigned integer you get:

0x0017EBD8 3435973836

And if you interpret those bits as a floating point number you get:

0x0017EBD8 -1.0737418e+008

That is scientific notation for -1.0737418 x 10^8, or simply -1.0737418 x 100,000,000 or rather, -107374180.0. Or basically negative 107 million.

So this explains why the value of the first point is such a massive negative number. Solving this bug is not the point of this document, but is another story altogether.

In every single Debug build (On Windows platforms), all over the world, all uninitialized variables (on windows platforms) will always be 0xcccccccc. That number is always rendered as the same floating point value shown above. It is always the same with NO EXCEPTIONS.

That means if you want predictability in reproducing bugs you go with a debug build. If you want random behavior you go with a release build:

Here is the same work-flow in a release build:

Max 2010 (Release build):

clip_image010

The inserted point has a completely random y position (thankfully not as big as -107 million!). Also the first point is assigned to a random value. What is the random value? It is whatever the value was the last time memory was used there in that memory location. That memory could be clean, or it could be leftover from anything else. It could be memory that was last used 2 seconds ago, or 2 hours ago, or 2 days ago.

So again, if you see bugs with random behavior, it’s most likely an uninitialized variable. If you want to narrow down the repro steps, and actually solve it, use a debug build.

Solving linker error ‘LNK2022: metadata operation failed’

If you are compiling managed C++ projects, and ever encountered this idiotic linker error:

1>xxx.obj : error LNK2022: metadata operation failed (8013118D) : Inconsistent layout information in duplicated types (_PROPSHEETPAGEW): (0x020001f8).

1>xxx.obj : error LNK2022: metadata operation failed (8013118D) : Inconsistent layout information in duplicated types (_PROPSHEETPAGEA): (0x02000206).

than this post is for you.

For some reason in our code, this linker error always comes in pairs. No matter, in our code base it usually manifests itself after rearranging header files in a file that is compiled with /clr compiler switch.

The problem is caused by windows.h being included after other header files:

#pragma unmanaged
#include “file.h”
#include “otherfile.h”
#include <windows.h>

#pragma managed
….

The solution is to make sure that windows.h is included before anything else.

#pragma unmanaged
#include <windows.h>
#include “file.h”
#include “otherfile.h”

#pragma managed

….

Using ADPlus to get a minidump of a crashing process

ADPlus is a debugging tool that Microsoft includes in their windows debugging toolkit. It can (among other things) create a minidump when an application crashes. It uses the command line debugger (CDB) and waits for the application to crash. It then generates a large heap dump of the process. This is useful for applications and process’s that have no mechanism for handling unhandled exceptions.

It works like this:

adplus –p <process ID> -crash –o C:outputdir

Where process ID is the id of the process to monitor (Or you simply supply the name of the application.

-crash signifies you want it to monitor for crashes in the target app.

-o is the specified output directory.