Over engineered build systems from Hell

While I was at Autodesk years ago we went through various build systems. When I first started the build was written in Perl. Dependencies were specified using a .csv file. I had no idea how it worked, nor did I care how it worked since I was interested in other things during those years. The build could routinely take an hour and 45 minutes, and was mind-numbingly boring since we usually had to do it everyday. And if you were unlucky multiple times per day. Even worse In fact on the build servers the build would routinely take almost 4 hours!

What a horrible tax.

Later on, a hot-shot programmer came along and rewrote the build system to use ruby and rake. It was supposed to be faster, which it kind of was. But the complexity was just as bad, and no one knew ruby nor how rake worked. Then the developer left the company, leaving behind a black box. So that silo of information was gone, and no one knew how the build system worked. It took 2 developers about a year and a half or so to learn the build system well. To really be able to work on it at the level the original developer had done.

To be sure there were problems with the build. It still took a long time to build the product. Somewhere near an hour. About the only thing we did gain was the ability to do incremental builds.

But modifying the build was the main problem. The build, written in ruby, completely reinvented the wheel on so many different areas. To understand this better, you have to understand that the product at the time was built using Microsoft tools because it was based solely on the Microsoft platform. Thus the source project files were in a build language that Microsoft created. That build language was built into visual studio and was called MSBuild. Instead of using Microsoft tools to create the build, ruby and rake were used instead. Instead of using Microsoft tools to parse the xml project files, a ruby gem was used. Instead of using anything from Microsoft to help with the build, everything was re-invented from scratch. Parsing visual studio .vcproj (and eventually .vcxproj) files was done tediously and laboriously and mind numbingly using rake and some xml gem. Talk about recreating the wheel! For instance, lots and lots of code was written to duplicate a simple function Microsoft already offered. Though a Microsoft API call that retrieved a fully instantiated object with full properties all intact.

Copying files to the build directory was another disaster too. It would take around 10 to 12 minutes to copy 7000~14000 files. It originally was somewhere near 7000 files, but grew over time. All written in ruby code that no one knew how to debug except to put print statements in.

Another problem was adding build properties. If you wanted to add a build property (a key value pair), you had to add it to multiple places in the ruby build, knowing exactly what to modify (in duplicate) and such. It was horrible.

Mixing ruby with MSBuild was like mixing iron and clay. They don’t mix well at all. It was like a ruby straight jacket that hindered the build and visual studio upon which it was based.

There had to be a better way.

Eventually when the frustrations with the build boiled over, I learned MSBuild and figured out how to build max without ruby. It took over a year, from when I first got a working prototype, to get something into the main branch of max. Simply due to bureaucratic inertia. There are lots of people in positions of power who simply say no before learning about a subject. Which was something all too common there. The first liberation was freeing the list of things to get built from ruby. Yes the list of DLL’s and EXE’s to get built was specified in some arcane ruby syntax somewhere. The first thing was moving that list to a democratic XML file. Now any build language could parse it and find out what to build. The second thing was moving the list of files to copy to an XML file. Now any build system could know what files to copy as well.

Once those two things were in place, it was time to put in the build system that I originally came up with (during a particular Christmas break).

It was written in pure XML, with one MSBuild extension that was written using C#. All the tools were native to visual studio, and what you built on the command line was what was built in visual studio. They both used the same properties (using native property sheets) and built in the same way.

What’s more I found that using native MSBuild tools to copy those 7000+ files to build now was incredibly fast. In fact, while once debugging through that old ruby code responsible for copying, I found the source of the 10 minute copy task. Or why it took 10 minutes. It was using an N factorial algorithm! So given directory A with B thru Z subdirectories, it would iterate through all directories n! times. Each directory was not parsed once, but N! times according to the amount of sub-directories that existed. It was an unholy mess that proves that re-inventing the wheel is usually a disaster waiting to happen. Now back to the improvement: With the new msbuild copy mechanism it took 20 seconds to copy all those files. 20 seconds versus 10 minutes was a big improvement.

Incremental builds also greatly improved. Go ahead and build your product from scratch. Now don’t change a thing and rebuild. If you have a smart build system, it should just take a few seconds and nothing will happen. The build system will be smart enough to essentially report that nothing changed and therefore it did no work. My new build did just that in a matter of seconds. The old build system used to take about 5 minutes to do that (And it still did work anyways…).

Speaking of performance. The time to compile and link the actual files didn’t change much, because that was always in visual studio’s corner and not ruby. The improvement in performance came from the copying actions that now took 20 seconds. Also noticeable was the shorter time involved from when the build started to when the first CPP file was getting compiled. In ruby/rake it took quite a few minutes. In the new build it took a few seconds. Eventually when I got a new SSD hard-drive, I was able to get the build down to 22 minutes on my machine.

The build at Century Software

Later on I moved to Century Software, a local company to where I live. That was such a fun place. Anyways their build system for their windows product was written in Make! Yes Make the original, ancient build system that you can’t even find documentation for anymore. I mean literally, I found (I think) one page of documentation somewhere on some professors lecture notes. The docs were  horrible. Make implemented here was so old it built one C file at a time. No multi-threading, no parallel builds nothing. Slow was the operative word here. That and a incomprehensible built output that was so verbose it was almost impossible to comprehend. The only good thing about it was that it immediately stopped on the first error.

So eventually I rebuilt that using MSBuild too. It took me a few months in my spare time. No bureaucratic inertia, no one telling me no. I just worked on it in my spare time and eventually I had a complete and fully functioning system for the tinyterm product. This build was the best I’ve ever done, with zero duplication, extremely small project files and a build that was very fast. It went from 45 minutes to a minute and a half.

When writing a product, the build system should be done using the tools that the platform provides. There should be no transmogrifying the code, or the build scripts (like openssl) before doing the build. When writing ruby on rails use rake for your build process’s. When targeting Microsoft platforms use msbuild. Using java, then use maven. Data that must be shared between build platforms should be in xml so that anything can parse them. And most important of all, distrust must go, and developers and managers must have an open mind to new things. Otherwise the development process’s will take so long, and be so costly, and exert such a tax that new features will suffer, bugs will not get fixed and customers will not be served and they will take their money elsewhere.

Writing Stable 3ds Max Plugins

I found this document while looking through old files today, and thought I’d share it. It was from a lecture I gave at Autodesk University back in 2012. It applies to 3ds max, but has some points that would be applicable to software development in general.

[Note] I wrote this along time ago, and today I saw this blog post again. I thought in fairness, I should add a few things. These were my guidelines I came up with based off of years of experience and years of experience of fixing bugs in 3dsmax. While I firmly believe in every single last one of them, unfortunately, hardly any of these things ever entered the thoughts of most 3dsmax developers. Most coded all day long being blissly unaware of warning level 4, and none ever showed an interest in static analysis except two people. In fact management most of the time was completely unsympathetic to these ideas. Of course management, even the development managers who used to be programmers simply didn’t care. They just wanted bugs fixed fast. No matter, nor interest was given to systematically fixing the fence at the top of the cliff. All thoughts were to get ambulances to the dead bodies at the bottom of the cliff as fast as possible. As a result the fences at the top were always full of holes. By the time I left Autodesk in the spring of 2014, only a dozen or so projects compiled at warning level 4. And no systematic static analysis was being done by anyone. I could go on, but that’s a thought for another blog post.


Preventing crashes and increasing stability in software is a difficult task. There is no practice nor set of practices that will completely prevent all crashes. However there are a few things that will help reduce errors. With that short introduction let us get started.

Basic Responsibilities

These are basic practices that would apply no matter where you worked and no matter which product you worked on.

Compile at warning level 4

You should compile your plugins at warning level 4. Level 4 warnings can help point out subtle programming errors that can lead to bugs that can be absurdly difficult to discover later on.  This is a free and practically instantaneous way to find bugs early in the development process.

Code compiled at level 4 is better than code compiled at anything less. Level 4 warnings should be turned on, and no warnings should be suppressed.

The 3ds Max SDK compiles cleanly at warning level 4, and has been that way for at least 3 years now.

Case in Point:

We turned on warning level 4 for an old project recently. One level 4 warning pointed to some unreachable code. This was caused by a break statement that was left in in a loop. This problem eventually resulted in a complete feature not working.

Compile with Static Analysis

The highest version of visual studio comes with a static analyzer called Code Analysis. This feature can be turned on for native or managed code in visual studio. Static analysis does a deep scrutinization of the code and can help spot bugs. These bugs are more complex than what level 3 or 4 warnings can give. But these warnings are usually so fundamental that they can be likened to level 1 or 2 warnings.

Case in Point:

The static analyzer can detect allocation/de-allocation mismatches. For instance we turned it on and found when memory was allocated with new[] but was de-allocated with delete, instead of delete []. We found lots of these scattered throughout our large source code base. The advantage of this is that it is so easy to detect. Without static analysis it would take a special tool like bounds checker to reveal a memory allocation mismatch, and that would only be after exhaustive testing.

Check Pointers for NULL

By far the most common fix I have seen for known crashes in 3dsmax is to check a pointer for NULL. This is the most persistent problem that I have ever seen in C/C++ code. Get into a habit now to check every pointer for NULL before using it.  A corollary to this is to initialize all pointers to NULL before and after they are used.

Case in Point:

The visual studio static analysis tool emits various warnings for possible dereferencing of null pointers. Consequently I have rarely seen this problem in code that compiles at level 4 with static analysis.
For the Rampage Release the 4th highest Crash in 64 bit max was a crash using Ray Traced Shadows. The shadow code contained a buffer of floating point values that was uninitialized. It was extremely difficult to track down, as it was only manifest when the debugger was NOT attached.

Check before casting

If you lie to the compiler, your application will come back and bite you. C is a language that is seemingly built on casts, where anything can seemingly be cast to anything else. This ability to so easily lie to the compiler and misrepresent your types to the compiler is dangerous and risky. Therefore prefer to use C++ style casts. By turning on RTTI and using C++ style casts, the results of the cast can be checked for validity.

Case in Point:

In the sdk header file imtl.h is a class called MtlBase which has 4 derived classes. One of those classes is class Mtl. I found functions in MtlBase that was blindly assuming the instance (i.e. this) was an instance of class Mtl. However this ignored the fact that there were 3 other derived classes from MtlBase. Thus it was casting the ‘this’ pointer to class Mtl, and then doing work on that miscast pointer.

Avoid stack based strings

A very common way to crash the application is over-reliance on stack based C strings. This code for instance is very dangerous:

void foo() {

One of the problems with stack based strings, is operating on a string that is bigger than the buffer. This of course can corrupt the callstack.  This is almost impossible to debug afterwards and usually makes reading minidump crash files an exercise in frustration.  The danger can be minimized by using the newer safe string functions. For instance instead of using strcat, which can easily run over the end of a string, you can use strcat_s which is a safer version.

When possible use TSTR or MSTR instead , where the actual string buffer is stored on the heap, and not the stack. Then if anything does go wrong, it will not corrupt the callstack.

Now a disclaimer: Max has a lot of stack based strings all over the place (It is has been around a long time of course). But their usage is getting reduced as we now favor TSTR or MSTR.

Case in Point:

The code for the customization dialog contained a for loop that concatenated a string into a stack based buffer of limited size. The for loop interated too many times and the buffer overflowed, corrupting other items on the stack. That stack based buffer was several frames up the stack. When that stack frame was cleaned up, it crashed. Diagnosing the problem was difficult since the symptom was several function calls away from the source of the problem.

Avoid using catch(…)

If at all possible avoid using catch(…). Prefer to handle more specific exceptions like catching an out of memory exception such as (std::bad_alloc). While using catch(…) may prevent the application from crashing, it can  also hide bugs and make it more difficult to solve crashes. It is useful for debugging to actually remove a catch(…) and let the program crash exactly where the cause of the crash is located. You should generally catch only those errors that you can handle, and let the ones that you cannot pass through so that the larger system can handle it if possible, or crash in the “correct” place rather than delay it.

Now catch(…) can be used when it does something to correct the program state. This should be done only after careful consideration, usually with multiple developers. Also side affects needs to be considered as well. If a catch is used to wrap a call to thousands of 3ds Max functions, than it probably shouldn’t be used. However wrapping a call to a 3rd party library is acceptable. Everything needs to be balanced of course.

Certain regular expressions can easily be written to help search for empty catch statements. The static analyzer PVS-Studio will also help identify these too.

Case in Point:

I regularly review the usage of catch(…) in the source code, and have over the years taken out a few catch(…). As a result, the clarity of crashes from customers in the field has increased.

Use Debug Builds

When debug builds are available, they should be used for testing and development. 3ds Max is the only M&E product that provides debug builds to ADN partners, all though they may be slow in delivery. However despite the delays a debug build provides a great resource in validating your plugins.

[Note: It turns out to be very ironic that I put this here, since the 3dsmax team does not use debug builds. Sure the devs do, but in all my years there I could never get management to move to have the QA testers use debug builds. Never the less I believe in debug builds and that they are far superior for testing than release builds.]

Watch log file, asserts and the debug output

Log File

3dsmax has a log file that writes to <max install>networkmax.log. This file is mainly used for debugging network rendering, which was its original purpose. However, it has grown to become a popular logging mechanism for max. This log can provide useful information, but it still is under-utilized and cannot be expected to report program state consistently across the product.


Do not ignore asserts (remember debug builds?). Use asserts liberally in your own code and don’t suppress asserts unless they are logged and checked afterwards (for example, using automated testing). The assert system will automatically log all asserts (whether they are suppressed or not) to the file: <max install>3dsmax.assert.log.

Debug output window

The Visual Studio debug output window (debugging window) provides significant output and can be useful to watch during debugging sessions. Be sure to turn on display for all types of events including exceptions (very important) and regular program messages. If you want to check debug output without attaching a debugger, than you can use a Microsoft tool from sysinternals called DbgView. See the following website for details: http://technet.microsoft.com/en-US/sysinternals

Disclaimer: The MetaSL system parses a lot of files when 3ds Max starts up. This will generate a lot of exceptions that are benign, so not to worry. The reason is The MetaSL system, from Mental Images, uses a 3rd party library (antler) to parse files, which in turn uses exceptions for program flow.

Enable Break On Exception:

Visual Studio has debugging options that allow it to break execution when an exception is thrown. This should be used as often as possible. This is the corollary to the “No catch(…)” above.  There are a few places where max actually does use catch(…), for example in the maxscript kernel.  By enabling this feature, exceptions are immediately brought to the attention of the developer.

Max Specific Problems

Do not hold naked simple pointers to ReferenceTarget’s

A class that is not a ReferenceMaker should not hold a plain old pointer to a ReferenceTarget, or a class that derives from a ReferenceTarget, without some mechanism to ensure validity of the pointer before use (i.e. AnimHandles). Instead replace the simple pointer with a SingleRefMaker class instance, and have that observe the ReferenceTarget.

Good Bad
class Good{…

SingleRefMaker mObserve;


class Risky{…

ReferenceTarget* mObserve;



Do not write dynamic arrays of ReferenceTarget’s.

Do not write a class that holds an array of ReferenceTarget’s: especially when that array grows and shrinks at runtime.

A class like this usually has a container that holds pointers to ReferenceTargets. It usually overrides ReferenceMaker::NumRefs like this:

int RumRefs() { return myArray.Count(); }

Instead of a fixed number of items:

int RumRefs() { return 3; }

This cannot be done correctly without considering undo and redo (Subclassing class RestoreObj). The fundamental weakness of the reference system is that it expects references to be in a fixed position. That reference index is an internal implementation of the ReferenceMaker that should be invisible to clients. However clients routinely use the reference index to get a certain Target. And one of those clients is the undo system. One of the complications of such an implementation is that the Undo System usually expects that internal array to never shrink in size. If a ReferenceTarget is removed from the internal array, a RestoreObj usually should or could point to its old reference slot. The Reference System of course has no idea that the internal array shrunk in size, so if an undo action occurs it may stick that Reference back into the wrong slot. To avoid that, a common practice is to make dynamic reference arrays grow but never shrink. This wastes memory.
For example: Undo and Redo can change the size of the internal array via SetReference. So if you have an array with 10 ReferenceTarget’s and your undo/redo object happens to ‘redo’ and stick a reference back in at slot 5, well, all your other pointers from index 5 to 10 have now had their indexes bumped up by one. So now anything dependent or holding on to those moved ReferenceTarget pointers are now dangling.

There are a few alternatives to this:

  • Use class IRefTargContainer.
  • Use an array of AnimHandle’s.
  • Use a ParameterBlock

Do not access the Reference System after NOTIFY_SYSTEM_SHUTDOWN

The notification message NOTIFY_SYSTEM_SHUTDOWN (See notify.h) is broadcast before plugins are unloaded. It is critically important to drop all references to plugins in response to this message. There are many plugin modules that define ReferenceTargets that will then get unloaded shortly afterwards. Once the plugin module is unloaded, trying to access a ReferenceTarget defined in that module can result in a crash.

Do minimal work in DllMain

The MSDN docs state that minimal work should be done in DllMain. Specifically it warns against loader lock, among other things. The DllMain function can be called as a result of LoadLibrary. When LoadLibrary is executed a critical section is locked while your DllMain is active. If you try to do work that for example needs another DLL to get loaded, it could lock up the application as a race condition. Instead of doing work in DllMain on shutdown, there are a few other ways to do plugin initialization and unitialization. For example:

  • You can do uninitialization work in response to NOTIFY_SYSTEM_SHUTDOWN. (see notify.h)
  • You can and should use a LibInitialize and LibShutdown functions.

A similar warning is not to do heavy work in static variables constructors, because a static variable will get constructed close in time to when DLLMain is called. Then, when the static variable is constructed, the DLL may not be fully loaded and types needed by the constructor may not be available yet.

Do not violate party etiquette

Uninvited guests should not crash the 3ds Max party. When the party is over: go home.

Uninvited guests

Every plugin has an appropriate time in which it should be initialized, do its work and shutdown. For example:

  • A plugin for a color picker should not instantiate one when 3ds max starts up.
  • A plugin for a scene browser should be active ONLY when its UI is active.

It is entirely possible and probable that users can start max and NEVER use your plugin. Therefore do not waste memory and resources for a feature that may not get used. Do the work when users actually invoke your feature. In other words when 3ds Max starts up, the plugin should not invite itself to the 3ds Max party, it should wait for an invitation.
This rules is violated on startup by loading 3rd party libraries, instantiating plugin classes, holding pointers to the node scene graph and registering callbacks to common scene events (my favorite pet peeve: “Hey max crashed in this function even though I never used this feature?”).  When max loads a plugin, the major things 3ds Max requires from a plugin are:

  • The number of class descriptors
  • A way to get those class descriptors.
  • Some pointers to LibInitialize and LibShutdown functions.

Therefore class descriptors really are the only things that should be instantiated on module load or startup. There should be no static instances of the actual plugin class, whether it is a material plugin, shadow, utility, or renderer. Of course there are exceptions such as function published interfaces and parameter block descriptors that often are statically defined: But I’m not talking about those.

No loitering

When 3ds Max shuts down, it sends out the most important broadcast notification in all of 3ds Max (found in notify.h): NOTIFY_SYSTEM_SHUTDOWN. This means the 3ds Max party is over. The plugin should completely shut itself down or disassociate itself completely from all max data. For example: All References should be dropped. All arrays holding pointers to INode’s should be cleared out etc… And most common and most dangerous: All callbacks functions that are registered should be unregistered.

When NOTIFY_SYSTEM_SHUTDOWN is broadcast, the entire max scene is completely intact and still in a completely valid state. During any callbacks or notifications after that, 3ds Max will contain less and less of a valid state to work with. In other words as 3ds Max progresses in its shutdown sequence less and less of the max scene will be valid. So for instance the other shutdown notification NOTIFY_SYSTEM_SHUTDOWN2 is called merely when the main 3dsmax window (think HWND) is destroyed. No plugin should be responding to that message to (for example) iterate through the scene graph. Likewise the LibShutdown functions should not be iterating the scene graph.

Case In Point

Say that a plugin that depends on another library like this:
plugin.dll -> library.dll
When the plugin is loaded by max, the tertiary library will also (automatically) get loaded. But when the plugin is unloaded the tertiary library will not get unloaded. That is unless the reference count on the library is decremented to zero. This will not happen unless FreeLibrary is specifically called on library.dll (Which is not a common nor recommended practice). Thus instead, the library will get freed or shutdown long after WinMain exits and max has uninitialized and is gone. Therefore the tertiary library should not contain any dependencies on anything in the 3ds Max SDK. Thus for example GetCOREInterface() should never be called in a DllMain of a dependent module to a plugin (i.e. library.dll ).

Quality Testing

Developers can implement the following practices in their software development processes:

Automated regression testing

All good production pipelines should have regression testing that occurs automatically after a build. This is critical to help catch bugs before they get to customers in the field. Also the developers should have access to these automated tests so that they also can run these tests before submitting their code.

Dynamic Memory Analysis

This means using 3rd party tools to profile, analyze, check and verify memory during runtime of the application.

The following list of tools is a partial example of what is available:

  • MicroFocus BoundsChecker: Checks for memory leaks, or memory allocation mismatches among a host of other things.
  • Microsoft’s Application Verifier also checks for various memory problems during runtime such as accessing an array out of bounds.
  • Visual Leak Detector (Open source on codeplex.com) checks for memory leaks. It is fast, efficient and stable.

Code coverage

This is using a tool to measure how much of your application or plugin was actually tested during execution. This helps a developer to know when they have tested the product enough. It also can help a developer find areas they have not tested. Simply put untested code is buggy code, and a code coverage tool helps in this regard. The best tool I have ever seen for this is Bullseye (bullseye.com). It works for native C++ and is easy to use, and very fast. It requires instrumentation of the code during the build which can double the build time,but runtime performance is excellent.









The incredible expanding layer bug

So a few months ago at work (January), a bug came across my desk for a performance problem in 3dsmax. I have always heard of bugs where someone in an architect’s office would load a file, and it would take 45-50 minutes. Perhaps the file was an autocad file with hundreds or thousands of layers in it. I’ve even had a guy at an architect’s office tell me they had files that took an hour to load before…. Just incredible. Anyways, this bug was that it would take almost a half hour to create 1000 layers in max. The guy who logged the defect, even gave us a maxscript to help reproduce the problem:

	for i = 1 to layerCount do
		t = timestamp()
		mLayer=LayerManager.newLayerFromName ("blarg_" + i as string)
		t += timestamp()
		format "%,%n" i t

I first gave the script a run through in order to reproduce the problem, and indeed I was able to see it took a long long time. I ran it the first time, and it seemed to take an hour. But I after all wanted better numbers than that. So I modified my script to give me the total time.

The final result was 35 minutes to complete everything. During the course of which, the 3dsmax maxscript listener simply stopped responding. Finally it finished and I dumped the printed results into an Excel spread sheet and plotted the results.

The following chart plots the time (in blue) that it takes to create 1000 layers. Each Nth layer from 1 to 1000 is on the X-axis on the bottom. The time (y-axis) is the vertical axis and is plotted in milliseconds.


By the time the 1000th layer was created, it took nearly 5 seconds. *Ouch*. The blue graph is a class parabolic shape and is in fact some form of an N squared polynomial. This performance degradation is a classic non-linear form. Contrast that with the red line, the expected performance result. Anyways, finding the bug was the real problem at hand. Why was it so slow?

My experiments were of course ridiculously hard to test. After all, you make a change and wait 35 minutes to test it. Finally I stumbled upon a function call to update some UI. Okay, I commented it out, and ran it again. My results astounded me: 4 seconds! The code I removed was simply that, when-ever a layer was created, it would update that little layer dropdown list that is on the layer toolbar:


Remember that little guy? That UI tool that no one likes nor uses anymore? Well the problem was that little layer dropdown list would add the new layer to some data structure, and then resort all the layers. This was a classic n squared iteration over all the layers. The more layers, the more sorting you have to do. Obviously a performance nightmare.

Once I temporarily removed that UI date function call, the time per layer was so low, that it wouldn’t even register on that graph shown above. But after all, creating layers should update that UI dropdown list eventually right? So if we remove the function call, how will it get updated? To fix that, I simply put in place a call to suspend UI updates, and another to resume UI updates for that UI drop down list. So before creating the 1000 layers, I call that suspend function, and afterwards call the resume function. So that in the end, the Layer UI dropdown list gets updated only once.

My favorite blogger, Joel Spolsky, wrote about this in a classic piece: writing about “Shlemiel the painter’s algorithm”


Notes on the Max.Log file, timers and fixing a crash in the OBJ exporter

So lately I was given a task to fix a crash in the 3dsmax OBJ exporter. I won’t go over the history of  this project. But it’s the only .OBJ exporter that we have. Anyways, we have had it for a few years, but just recently it started to fail in our automated tests.

So I looked at it, and could never reproduce it on my machine. To make matters worse, I asked to look at the max.log file, only to find absolutely nothing. So not only could I not reproduce it, the minidump file I had was inconclusive and I possessed an empty log file. All this makes debugging extremely difficult (As in bang your head against the wall type of debugging).

On the other hand the automation machine was regularly failing. So I decided to use logging to help figure out where it was crashing. The theory was that if I could pull a lot of information from the logger then I could find out where it was failing.

But, the problem is that 3dsmax has extremely thin logging capabilities. The only log file that 3dsmax has was written for rendering: more specifically network rendering.  Hence a log file is created in this location:


It quickly became the default logger for 3dsmax. Calls to the logger are interspersed through most of the code now. But the depth of coverage is not uniform. Also  the code for this logger was implemented in 3dsmax.exe. However since 3dsmax.exe is at the top of the DLL dependency chain, there is no .lib file for a DLL at the bottom of the dependency chain to link to use the Logger. And those DLL’s at the bottom of dependency chain can not link to core.lib to gain access to the core Interface pointer. That is because core.dll also depends on a few of those very DLL’s at the bottom of the ‘food chain’.

So, if you want to use the logger in one of those DLL’s that 3dsmax (or core) depends on… well you can’t. The dependencies would be all backwards. Independent 3rd party plugin developers for max don’t have this problem. They can access a virtual methods on Interface and gain access to the Logger. No problem for them really. As for those independent DLL’s like maxutil.dll,  mesh.dll, mnmanth.dll? Well they don’t have any logging. They are flat out of luck.

So what does this have to do with the OBJ exporter? Well not much, but it does explain the lack of a logger in some code.

So to continue on with the story of fixing the OBJ exporter. I added calls to the max logger from a variety of functions in the OBJ exporter. Basically any function that was 1 and 2 calls away from a call to virtual SceneExporter::DoExport logged it’s function name to the logger.

So I submitted my changes, and the next day the Automation Engineer sent me a max.log file. It showed the crash occured when the main virtual DoExport method ended, then another call to a method ObjExp method ended after that.

2013/03/13 13:29:24 DBG: [02388] [02648] ObjExp::DoExport - end
2013/03/13 13:29:24 DBG: [02388] [02572] ObjExp::GetTriObjectFromNode - end <-- Crash after this

The thing is that nothing should have been executing after DoExport finished. Wierd!

The beginning of the export sequence showed quite a few calls to ExportDaStuff (Nice name huh?)

2013/03/13 13:29:24 DBG: [02388] [02572] ObjExp::ExportDaStuff - start
2013/03/13 13:29:24 DBG: [02388] [01620] ObjExp::ExportDaStuff - start
2013/03/13 13:29:24 DBG: [02388] [02572] ObjExp::nodeEnum - start
2013/03/13 13:29:24 DBG: [02388] [02964] ObjExp::ExportDaStuff - start

When-ever it crashed the max logger showed extra threads. The ThreadID which is shown in the log file, shows 2 and sometimes 3 threads. Keep in mind that the first number in brackets is the process ID. The second number in brackets is the thread ID. OK, at this point having never touched the OBJ exporter in my life I had no idea this was a problem. So I kept looking. Eventually I found the callstack to create the thread and get ‘ExportDaStuff’ going is this:

CreateExportThread <-- Calls _beginthread

Ok, so what was calling CreateExportThread? I found it in the windows procedure for a dialog. So, the code was creating a timer, and on the first timer ‘tick’ (WM_TIMER) it killed the timer and then called the create thread method.

case WM_TIMER:
    if(wParam == TIM_START)
        KillTimer(hWnd, TIM_START);
        CreateExportThread(hWnd, exp);

The documentation for MSDN states that WM_TIMER is a low priority message and will be delivered after everything else is delivered that is more important. Hence that it will usually end up at the back of the message queue. So that means quite a few WM_TIMER messages could stack up before they get delivered all at once. Hardly a precise timer mechanism. So as the above code shows, KillTimer was indeed getting called, but after it got called the other WM_TIMER messages were already in the queue. And then they got delivered, and hence the extra threads were created.

So when the DoExport method finished it cleaned up resources out from under the extra thread which then crashed.

The fix was simply to not rely on timer ticks to create threads. The intent was to create only one thread in response to displaying the dialog. I was therefore happy to oblige by putting the call to CreateExportThread exactly where it belonged: in WM_INITDIALOG.

Lessons learned:

1. If you are writing a plugin for 3dsmax, please be liberal and puts tons of logging statements in your code.

2. Don’t create threads in response to WM_TIMER messages.

3. Or don’t mix business logic with your UI code.

Bad Casts and wrong assumptions about class hierarchies

Lately here at work, I have been looking into solving a certain crash in the product. This particular crash was occurring in a certain function that had many crashes over the years.

The situation involved code in a base class that that was doing work for a derived class. In fact the base class was calling methods on the derived class. To do that, it had to cast itself to a pointer of the derived class (using the ‘this’ pointer). But since there were was more derived classes than one, the cast was sometimes performing an invalid cast when called with an instance of the other derived class.

So, sometimes when fixing a mysterious crash, just look for a bad cast. 🙂

The code had a class hierarchy roughly like this:

Class Base
	Class A
	Class B

So here, class A inherits from class Base. Similarly, class B also inherits from Base.

In this example case, some code in class Base was always casting the ‘this’ pointer to A. However called with an instance of B, the assumption won’t hold, and a cast will simply corrupt the pointer. Furthermore, it did this with a C style cast, which offers no warning nor protection.

This code demonstrates the problem:

void B::foo()
	A* temp = (A*)this; // B is not A!
	// crash attempting to do something with temp

So if called like this,

B* b = new B();

we should expect a crash calling foo since class B is not class A.

There were a few things wrong with this.

  1. A base class should not implement work that rightly belongs to a derived class. For instance it would be bad if a shape class should be in charge of drawing a box and a sphere. To be polymorphic a shape class would offer a virtual or abstract draw method, and leave the implementation details to the derived box and sphere classes. Thus a derived class pointer could stand in for a base class pointer (demonstrating polymorphism) and with no switch statements just implement the needed functionality.
    So in this case here, the whole concept of polymorphism here seems to have been violated. If a derived class performs an operation, then the base class should not be calling it. In fact, the base class should always be completely ignorant of the implementation details of a derived class. In this case, the operations being performed didn’t involve any virtual methods…. Thus without a virtual method, polymorphism cannot be implemented. Hint: using C++ it is possible to replace switch statements and if statements with simple virtual method calls. Thus this code was pure C code masquerading as C++ code.
  2. An incomplete understanding of the class hierarchy that derives from this base class. While the assumptions may have once held, once other derived classes were added later on, the assumptions would fail. Thus casting everything to one particular derived class to perform operations was very fragile, and asking for trouble.
  3. Use of C style casts here is plain wrong in C++. A cast like this should fail, but in C, no cast ever fails. A C style casts may be fast, but it offers no protection. Thankfully the Microsoft Visual C compiler offers a measure of protection however for C++. By turning on Run Time Type Identification (RTTI) in C++ and using any of the standard casts (static_cast<>, dynamic_cast<>, reinterpret_cast<>) the operation will fail with invalid data and return NULL (or nullptr in VC 10.0) . Thus B::foo could have been written like this:
void B::foo()
	A* temp = dynamic_cast(this);
	if (temp)
		// do something with temp

Which in this case as shown, the code inside the if statement would never get executed.

Finally in fixing the bug, I did a search through the entire source file looking for these C style casts to the wrong data type. And for almost every function that used this bad cast, we had lots of crashes in that function from customers over the years.

Useless files in the 3dsmax build

For 3dsmax 2013 we moved to the Microsoft Visual C 10.0 compiler. Therefore we didn’t want to ship anything that was built with version 9.0 of that compiler. So as a result, I was digging through all the modules of 3dsmax and found a group of DLL’s that are NOT used at all. These files come with the RealDWG library and thus get automatically included with our product. I suspect that any product that uses the Autodesk RealDWG toolkit might have these files:


Apparently Ambercore is a company that makes software for working with point clouds. 3dsmax doesn’t use it, so if  you want to make your max installation smaller, leaner and meaner then you can delete the files.

Difference between Debug and Release Builds


I wrote this document way back in 2010, which I found sitting on my computer. I wrote it for folks in our QA department. But I thought it good enough to share with other folks since it describes some basic things. So here is the document:

In a debug build of max, uninitialized memory is initialized with not garbage, but a certain pattern used to help catch bugs: 0xcccccccc.

For instance:

0x0017EBD8 cccccccc
0x0017EBDC 00000000
0x0017EBE0 00000000
0x0017EBE4 00000000
0x0017EBE8 00000000
0x0017EBEC 00000302
0x0017EBF0 cccccccc
0x0017EBF4 3ed6d0e0

Uninitialized memory is marked with this pattern to help catch bugs. For any pointer dereferencing this value will cause an access violation and crash the application.

Case Study

The animation reaction manager displays a curve control. That curve control can of course display many curves. Each curve can contain many key points.


And here are the points:


The problem is manifested when you insert a point into the middle of the curve. You should expect this:


But instead, in a debug build you get this:


The problem is not the curve control, but actually the reactor controller. The reactor controller goes back and alters the points after the point was correctly inserted. (A bug was fixed by me btw) This reactor controller does its dirty business after the UI widget has already correctly displayed its data. This highlights two bugs:

1. Why is the reactor controller modifying the UI control when the UI was already correctly displayed?

2. Why is the reactor controller modifying the point with such a massive value?

The point where the first point gets altered is in the Reactor::UpdateCurves function in the following file:

In this code:

CurvePoint pt = curve->GetPoint(0,i);
// pt here is correct
switch (type)
pt.p = Point2(masterState->fvalue, slaveStates[i].fstate);

DbgAssert(pt.p.y > -1000.0); // assert fails since pt is incorrect
curve->SetPoint(0,i, &pt, FALSE, TRUE);

The value for slaveStates[i].fstate is the following floating point number:


Where did this value come from?

Well examining it in the debugger’s memory shows this:

memory address of some memory:

0x0017EBD8 cccccccc

As you can see, a debug build take uninitialized memory and fills it with 0xcccccccc. This is to help catch errors. Now if you take that value and interpret it as an integer you get:

0x0017EBD8 -858993460

And if you interpret it as unsigned integer you get:

0x0017EBD8 3435973836

And if you interpret those bits as a floating point number you get:

0x0017EBD8 -1.0737418e+008

That is scientific notation for -1.0737418 x 10^8, or simply -1.0737418 x 100,000,000 or rather, -107374180.0. Or basically negative 107 million.

So this explains why the value of the first point is such a massive negative number. Solving this bug is not the point of this document, but is another story altogether.

In every single Debug build (On Windows platforms), all over the world, all uninitialized variables (on windows platforms) will always be 0xcccccccc. That number is always rendered as the same floating point value shown above. It is always the same with NO EXCEPTIONS.

That means if you want predictability in reproducing bugs you go with a debug build. If you want random behavior you go with a release build:

Here is the same work-flow in a release build:

Max 2010 (Release build):


The inserted point has a completely random y position (thankfully not as big as -107 million!). Also the first point is assigned to a random value. What is the random value? It is whatever the value was the last time memory was used there in that memory location. That memory could be clean, or it could be leftover from anything else. It could be memory that was last used 2 seconds ago, or 2 hours ago, or 2 days ago.

So again, if you see bugs with random behavior, it’s most likely an uninitialized variable. If you want to narrow down the repro steps, and actually solve it, use a debug build.

Developers who litter in the code

Over the years an old code base accumulates a lot of  junk left over by programmers. Some of this is what I call vacation signs. This is a small explicit statement left in the code in the form of a comment (or a bug) stating that the developer was there. For instance:

// mjm - 06.12.00 - begin


// mjm - end

So now, not only do I know where his changes started, thankfully he informed the entire world where his changes ended. This is equivalent to taking a vacation to Italy and pounding a small sign in the ground when you arrived in a town stating “I <insert name here> arrived in Milan on Feb 1 2004”. And imagine when you left, you then pounded another small sign into the ground (perhaps at the other end of town?) stating your name and the date that you left. This is cute for the pictures send to your mom, but sooner or later the mayor of Milan will start to get offended at stupid tourists who do this, and will send the constables to arrest the idiot foreigner who is littering his town. This if course is utter stupidty. And in the case of these vacation signs left in our code above, they are an utter nuisance. Especially when there are thousands of these little signs.

Solving STL assertion: ‘Assertion failed: vector iterators incompatible’ when calling std::vector::clear

Recently here at work we ran across a problem in the STL code that result from calling std::vector<>::clear(). The problem was that calling clear on a vector threw a debug message, which in our case, crashed the application. The callstack look like this:

msvcp100d.dll!std::_Debug_message() Line 13 C++
gw_objio.dle!std::_Vector_const_iterator<std::_Vector_val<std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> >,std::allocator<std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> > > > >::_Compat() Line 239 C++
gw_objio.dle!std::_Vector_const_iterator<std::_Vector_val<std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> >,std::allocator<std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> > > > >::operator==() Line 203 C++
gw_objio.dle!std::_Vector_const_iterator<std::_Vector_val<std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> >,std::allocator<std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> > > > >::operator!=() Line 208 C++
gw_objio.dle!std::vector<std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> >,std::allocator<std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> > > >::erase() Line 1194 C++
gw_objio.dle!std::vector<std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> >,std::allocator<std::basic_string<wchar_t,std::char_traits<wchar_t>,std::allocator<wchar_t> > > >::clear() Line 1218 C++

Where the debug message says:

c:program files (x86)microsoft visual studio 10.0vcincludevector(238) : Assertion failed: vector iterators incompatible

If you look at the code for vector<>::clear it is really very simple:

void clear()
{ // erase all
erase(begin(), end());

So what was the problem?

For our particular case, the memory for the std::vector was allocated via LocalAlloc by some programmer years ago. This problem is not immediately apparent since in this particular case, LocalAlloc was allocating memory for a typedef struct that happened to contain the vector (Among others). Now, since the vector is NOT plain old data (POD) it’s constructor was not called, and when the memory was freed it’s internal pointers were left dangling. So when the time came to use them (i.e. clear), things blew up.

Not fun to debug.

I take it this (perhaps) once used to work originally. But perhaps with the switch to the Microsoft Visual C 10.0 compiler and it’s attendant change in the STL, it was enough for this problem to bubble to the surface.

Point of the story:

1. We are in the 20th century now. If you are programming in C++ in the 20th century, don’t use Paleolithic era API’s to allocate memory for non POD data.

2. Don’t mix and match memory allocation routines. If using C++ use new/delete especially for non-POD data.

MetaSL Parser uses C++ exceptions for program flow

Beware to whoever integrates the mental images MetaSL parser into their application. For program logic, the parser makes heavy use of C++ exceptions to handle program flow. So instead of an if statement switching with if/then/else, it throws exceptions:

First-chance exception at 0x000007fefdfaaa7d (KernelBase.dll) in NNNN.exe: Microsoft C++ exception: antlr::MismatchedTokenException at memory location 0x4d69d458..
First-chance exception at 0x000007fefdfaaa7d (KernelBase.dll) in NNNN.exe: Microsoft C++ exception: antlr::NoViableAltException at memory location 0x4d69d270..
First-chance exception at 0x000007fefdfaaa7d (KernelBase.dll) in NNNN.exe: Microsoft C++ exception: [rethrow] at memory location 0x00000000..
First-chance exception at 0x000007fefdfaaa7d (KernelBase.dll) in NNNN.exe: Microsoft C++ exception: antlr::NoViableAltException at memory location 0x4d69d270..

I know… I know: Terrible!!

So the fallout for us poor souls who have to integrate this garbage is 10’s of thousands of C++ Exceptions, not to mention a terrible performance hit just for our application to use this stuff.  Yuck!