The Chris Test: 9 steps to better code

There are usually a set of things that a software shop does to ensure high quality software. No matter whether it is a web, desktop or mobile. The principles are the same.

Years ago, someone else wrote a similar set of rules.

http://www.joelonsoftware.com/articles/fog0000000043.html

This article was very influential to me when I first read it. However….

It is more focused on the high level issues that affect the company as a whole, not rules for good code. For instance, one rule is do you use source control? In this day, every software shop should and probably does. Maybe they didn’t 13 years ago? I know of plenty software projects, or products that use source control and still have bad code. So that’s not a real test of how good the code is. For the actual coders writing the software, some of the items in Joel’s excellent blog do not really affect them on a daily basis. Programmers live in their own little bubble, and live under a set of realities that effects them only. That is what I’m going to write about here.

  1. Do you compile at the highest warning level possible?
  2. Do you check for memory leaks?
  3. Do you have overly large functions?
  4. Do you have overly large code files?
  5. Do you test with debug builds?
  6. Do you have regression tests?
  7. Do you measure code coverage?
  8. Do you stop everything when tests fail?
  9. Do you do code reviews?
  10. Do you use a real build system?

Do you compile at the highest warning level possible?

This can be interpreted to mean different things for different types of languages. For instance, a web shop developing in javascript, css and ruby will not have compiler warnings. While ruby doesn’t have compiler warnings, there are linters available, or other static analysis tools to help supplement enforcing the rules of the language. But a product written in C, C++, C# or Java will have to live in the land of warnings every minute of every day.

Compiling code and witnessing a long list of warnings spew out is a special kind of demoralization. It’s even worse when these build warnings go unfixed for months or years. The latter state means one thing: Management or the leader of the developers don’t care that the product has subtle problems. In fact it means that nobody cares. Or it could be that someone cares, but feels that they are actively discouraged from doing so by peers or management – which is the worst state of all.

What I really don’t get, is how a developer could let a warning get into the build in the first place. I mean how could a developer NOT look at build output? It leads one to ask, if they aren’t looking at the build output, are there other things they are also not looking at? They have to know if the build succeeded or not right? They obviously are not blind, but it does indicate something is lacking: Either some competence or a desire. Either one is programmer malpractice.

The end result is that there will be very subtle bugs in the software that manifest themselves in very costly ways in the long term. There could be a problem that a customer support person spends 4 hours with a customer resolving. That then gets logged to a developer who spends a day trying to figure out what went wrong. That then gets fixed and distributed to customers 9 months later. Meanwhile everyone loses: The customer from lost productivity, the software shop from labor working on what could have been discovered in a few seconds by a compiler warning. Imagine the cost for such failures to act properly?

Now most modern compilers or linters have a range of options. These options range from the basic, to the important, all the way to the absurd. So not all compiler/linter warnings are created equal. When you start getting warnings that border on absurd, then you can create one (not many) list of exceptions to those warnings, or a list of warnings that are turned off. This list should be in one place, checked into source control, and contain clearly worded, and strong reasons those warnings are turned off. This list should never be large. And one person alone without code review should never be allowed to change it unilaterally.

The compiler (For languages where it applies) is the first line of defense against bugs. Software shops who refuse to use it to its fullest potential are putting out sub-standard products.

Do you check for memory leaks?

Failing to correct (or even care) about memory leaks is a certain kind of wrong. Memory leaks are like lung cancer. Slowly suffocating it’s victim. Memory leaks slowly and insidiously suffocate an application of its memory. Applications live on memory, and when memory goes, so does the application. And customers aren’t happy when their application dies. It’s even worse when it dies and takes all their data with it.

This fundamentally begins with the programmer who writes the feature. They are the ones who originally wrote the code that allocated the memory. They should have immediately considered where the memory would be deallocated, and who would own the memory in the meantime. And second the original programmer should have tested their code.

There are tools out there to help identify memory leaks, and they should be used if they exist. Failure to verify memory leaks exist is just plain sloppy.

Do you have overly large functions?

These are functions that are more than two screens in length. Which seems to be the unwritten standard in this industry today. It wasn’t always an accepted standard however. Ruby developers seem to go by the rule that functions should never be larger than 5-10 lines of code which is even better. Large functions quickly turn a program into an unreadable mess that makes it extremely difficult, or almost impossible, for another human to decipher. The problem here is that such functions cannot be easily refactored without the real possibility of introducing bugs. I have personally seen functions that were thousands and thousands of lines long. The cyclomatic complexity of such a function shoots through the stratosphere, and any modification of such can have real possibilities of adverse, unintended side effects. Such a function was essentially abandoned code and was a dumping ground for code changes that no one cared about.

If you ever see an application that has not changed much over the years, and has not been refactored. Rest assured that over large functions lies at the feet of responsibility.

Do you have overly large code files?

Large functions lay the foundation for large code files. These are files that contain 10,000 to 25,000 lines of code. Files that contain hundreds of functions and dozens of class definitions. Code files that try to be a God to everything. These files cannot be easily read nor browsed through. They remain a black box to all but the author.

Now, there are tools to help browse through large code files. Tools to help list the structure of the application. But when code gets so big that you need a browsing tool for the browsing tool, then you definitely have a problem. When this happens, discoverability nosedives and mindshare in the product evaporates. At that point programmers stop reading the code and start searching it. And when programmers start searching the code, they are not really understanding it. They are in essence looking at a painting through a microscope, and not standing back and looking at the big picture.

Another really bad side effect is the risk of duplicate code. A huge haystack may be hiding the needle you are searching for, but not finding it, you create another needle. This in a profession where needlessly reinventing the needle is a cardinal sin.

Case in point: I once come across someone modifying a rather small class: about 40 lines of code worth (at most). I looked around and saw that we had a few other classes with the same name, and informed them about it during a code review. It turns out there were about 10 to 20 different copies of that class floating around the code base. Apparently different programmers over the years used that class a lot. Ok, that’s a good thing. But instead of simply including a header file and linking against a library, they copied the class into their code. So I pushed back in the code review and asked them to remove the duplicate code. All this because the code base was so large, no one knew where anything was, and whether a particular problem had been solved already.

Do you test with debug builds?

This really applies only to C and C++ based code. Sorry all you script people out there. Your language has evolved past this little sharp edge of history. C and C++ base languages have, by convention, classified building a product into two different states: A release build and a debug build. A release build is what you ship to your customer. It is compiled for speed with little to no error checking built in. A debug build is slower, it is not shipped to the customer and has lots of error checking.

Software should be tested to be correct first, and then tested for speed and performance second. A good software quality assurance division of a company will understand that and not require testing only on release builds.

The reason for this is, that in the world of C and C++ based languages. A ‘debug’ build will have certain extra safety checks built into the product that help identify problems before they escalate into major disasters. Such code has assertions, poison memory patterns, bounds checking, extra type checking etc… These things make code much safer and more enjoyable for the customer. A ‘release’ build on the other hand, simply compiles all those nice safety nets away and instead plops the code right in the middle of a speedway, with no safety guards what-so-ever. Likewise testing only on release builds is like training a child to ride a bike in the middle of a busy road. With no safety precautions, something is going to be break, and it will be difficult to decipher post-mortem what happened.

Do you have regression tests?

A set of regression is the second layer of defense to prevent bugs. These tests must be run by developers as they write. Not just afterwards. Tests at the very least must be run before code is merge or submitted to the repository. Otherwise the tests serve no purpose. Otherwise imagine the havok if code is submitted on a daily basis that breaks things left and right. And no one would know. Once it is discovered the tests are broken and that a feature is broken, it’s usually much later in the development process. By that time a programmer may have moved one to work on something else. To fix it could be time consuming, not to mention costly. So the entire software development process becomes less efficient.

But the ultimate havoc wreaked by not having regression tests is that bugs get through to customers. And customers get annoyed and buy someone else’s software.

Do you measure Code Coverage?

Testing it great. But you also have to know when to stop too. Most programming languages, except the new ones, have tools for measuring code coverage. They should be used. In fact one company I know of, runs tests in a continuous integrations server. The coverage of the product is recorded, and on test results with coverage less than 95%, the commit is rejected. Way to go!!

Interestingly enough, figuring out code coverage is good insurance for when a programmer leaves. The programmer may leave, but tests, guaranteed by high code coverage, ensure the code can be maintained by someone else. So when the new guy or girl comes in, and starts maintaining the code, they can be reasonably sure that their code changes won’t break the world.

Code coverage also helps to ensure the quality of the code. When code is examined in the cold truth of coverage you start to discover a few things. You can discover unused parameters, or branches of code that never get executed. You may also discover expensive, slow code that perhaps you didn’t need to call, because it’s results weren’t needed. The good reasons can go on and on, and will vary. You can will also discover bugs, lots of them. I have never ran code through code coverage and not discovered a bug. It is simple as this: untested code is buggy code.

Do you stop everything when tests fail?

This should be a no brainer: Of course all development should stop if a regression test fails. All hands should be called on deck when the tests fail. Everyone should know not to sync code submitted during a particular time period. Everyone should know not to build upon similarly infected bad code. Everyone should also feel responsibility to look, hunt and search for and fix the problem. In such a situation, everyone should pretty much be guaranteed a clean working environment where they can trust that the code they are basing new features on is stable, solid and bug free.

But if the culture says that regression tests are ignored and swept under the rug. Then no one can be sure of anything about the product. What if the test failure count keeps creeping upwards and no one is held accountable? Then the product as a whole gains technical debt. Someone traded a short term gain that will not last, but incurs debt that grows over time. There is no debt so small that over time cannot grow into something huge. The entire farce can eventually grow into a farce where certain groups refuse to run the regression tests: because they are broken. But then refuse to help fix them in the first place. Such downwards spiral always result in the same thing: A spectacular collapse that offends some customers really badly.

Do you do code reviews?

No developer is immune from writing bugs into their code. Especially those who are unfamiliar with the software, language or platform. It can many years to master each of those 3 areas. In the meanwhile, code submitted by someone who has not mastered those areas should be reviewed by other people.

But this rule should not be applied with no regard for the circumstances. Code Reviews slow down development speed a lot. As in drastically. Ever had a code review languish for weeks at a time? I have, and it’s not fun. Productivity nosedives, moral plummets and old bugs languish. So, if a startup company absolutely required code reviews for everything, they would probably not actually ship a product… at all. Well before going bankrupt and running out of investor money.

So, here are a few common sense solutions to the problem:

  1. Code reviews are optional for those who originally wrote the code. Those who wrote the code are always more informed about what and how the code works than anyone else. Any changes made by them to code they themselves wrote should be optional.
  2. Conversely, modifying any code you yourself didn’t write should always require a code review. Especially if the person who wrote the code is still available.
  3. Have an expiration date. Any code review not reviewed in a certain amount of time expires and the developer is free to submit the code.
  4. Require unit tests and regression tests instead of code reviews. This works especially well for those who originally wrote the code in the first place.

Requiring a code review of all code is an ideal that should be sought after. But remember, customers buy your software, not your code reviews. Until there is first shipping software, no amount of code reviews will pay the bills. So, until then, don’t sacrifice the new startup for a feature that customers don’t buy.

Do you use a real build system?

A slow or inefficient build is a tax that costs real time and real money. A build is the process of converting the source code into something else which you send to your customer. Applications being written by programmers are also built by programmers. And usually built using custom code that is slow and very inefficient. It doesn’t have to be this way. I once worked at a company that used an archaic build system that took 45 minutes to compile and link. I fixed it up so that it took 1 ½ minutes. Another product I once saw had no build system at all. But rather a collection of bash scripts that did a few things in a certain order: No matter if they were required or not. Such a build system was actually fast compared to building 8 million lines of C/C++ code. But still it did only one thing: Build the product with little to no configurability.

Each programming language and environment has its own build system. C has the antique make system (good riddance). C++ has none really unless you are using Microsoft’s Visual studio and MSBuild. Java has Maven. Ruby has rake. These tools should be used instead of re-inventing the wheel by writing custom scripts for your build. Once a build system is written properly, then it can executed differently according to the needs of the user.

A full clean build can be executed and be guaranteed to work due to no preconditions polluting the source code.

An incremental build can be executed based off of localized changes. These builds don’t start from scratch but rather build only what is needed based off of the actual changes.

A clean build is simply to delete everything that a build creates, and return the source code back to the state it was in when it was first retrieved from the source code repository.

If a build cannot do this, then it is in the stone ages and not helping anyone nor anything.

Over engineered build systems from Hell

While I was at Autodesk years ago we went through various build systems. When I first started the build was written in Perl. Dependencies were specified using a .csv file. I had no idea how it worked, nor did I care how it worked since I was interested in other things during those years. The build could routinely take an hour and 45 minutes, and was mind-numbingly boring since we usually had to do it everyday. And if you were unlucky multiple times per day. Even worse In fact on the build servers the build would routinely take almost 4 hours!

What a horrible tax.

Later on, a hot-shot programmer came along and rewrote the build system to use ruby and rake. It was supposed to be faster, which it kind of was. But the complexity was just as bad, and no one knew ruby nor how rake worked. Then the developer left the company, leaving behind a black box. So that silo of information was gone, and no one knew how the build system worked. It took 2 developers about a year and a half or so to learn the build system well. To really be able to work on it at the level the original developer had done.

To be sure there were problems with the build. It still took a long time to build the product. Somewhere near an hour. About the only thing we did gain was the ability to do incremental builds.

But modifying the build was the main problem. The build, written in ruby, completely reinvented the wheel on so many different areas. To understand this better, you have to understand that the product at the time was built using Microsoft tools because it was based solely on the Microsoft platform. Thus the source project files were in a build language that Microsoft created. That build language was built into visual studio and was called MSBuild. So instead of using Microsoft tools to create the build, ruby and rake were used instead. So instead of using Microsoft tools to parse the xml project files, a ruby gem was used. So instead of using anything from Microsoft to help with the build, everything was re-invented from scratch. Parsing visual studio .vcproj (and eventually .vcxproj) files was done tediously and laboriously and mind numbingly using rake and some xml gem. Talk about recreating the wheel! So much code written to duplicate a simple function call to a Microsoft function could have retrieved a fully instantiated object with full properties all intact.

Copying files to the build directory was another disaster too. It would take around 10 to 12 minutes to copy 7000~14000 files. It originally was somewhere near 7000 files, but grew over time. All written in ruby code that no one knew how to debug except to put print statements in.

Another problem was adding build properties. If you wanted to add a build property (a key value pair), you had to add it to multiple places in the ruby build, knowing exactly what to modify (in duplicate) and such. It was horrible.

Mixing ruby with MSBuild was like mixing iron and clay. They don’t mix well at all. It was alike a ruby straight jacket that hindered the build and visual studio upon which it was based.

There had to be a better way.

Eventually when the frustrations with the build boiled over, I learned MSBuild and figured out how to build max without ruby. It took over a year, from when I first got a working prototype, to get something into the main branch of max. Simply due to bureaucratic inertia. There are lots of people in positions of power who simply say no before learning about a subject. Which was something all too common there. The first liberation was freeing the list of things to get built from ruby. Yes the list of DLL’s and EXE’s to get built was specified in some arcane ruby syntax somewhere. The first thing was moving that list to a democratic XML file. Now any build language could parse it and find out what to build. The second thing was moving the list of files to copy to an XML file. Now any build system could know what files to copy as well.

Once those two things were in place, it was time to put in the build system that I originally came up with during a particular Christmas break.

It was written in pure XML, with one MSBuild extension that was written using C#. All the tools were native to visual studio, and what you built on the command line was what was built in visual studio. They both used the same properties (using native property sheets) and built in the same way.

What’s more I found that using native MSBuild tools to copy those 7000+ files to build now was incredibly fast. In fact, I while once debugging through that old ruby code responsible for copying, I found the source of the 10 minute copy task. Or why it took 10 minutes. It was using an N factorial algorithm! So given directory A with B thru Z subdirectories, it would iterate through all directories n! times. Each directory was not parsed once, but N! times according to the amount of sub-directories that existed. It was an unholy mess that proves that re-inventing the wheel is usually a disaster waiting to happen. Now back to the improvement: With the new msbuild copy mechanism it took 20 seconds to copy all those files. 20 seconds versus 10 minutes was a big improvement.

Incremental builds also greatly improved. Go ahead and build your product from scratch. Now don’t change a thing and rebuild. If you have a smart build system, it should just take a few seconds and nothing will happen. The build system will be smart enough to essentially report that nothing changed and therefore it did no work. My new build did just that in a matter of seconds. The old build system used to take about 5 minutes to do that (And it still did work anyways…).

Speaking of performance. The time to compile and link the actual files didn’t change much, because that was always in visual studio’s corner and not ruby. The improvement in performance came from the copying actions that now took 20 seconds. Also noticeable was the shorter time involved from when the build started to when the first CPP file was getting compiled. In ruby/rake it took quite a few minutes. In the new build it took a few seconds. Eventually when I got a new SSD hard-drive, I was able to get the build down to 22 minutes on my machine.

Better yet was the removal of duplication of anything. Everything was native and iron was mixed with iron and clay was mixed with clay. Sort of….

One developer, (who we called doctor No, because he said no to everything good), holding on to the fantasy that max would be multi-platform someday would not let ruby go. So there were in essence two build systems that could do the same thing. In fact he wanted an option to invoke one build system from another! So I had to put in an option to invoke msbuild from ruby/rake! This was hobbling msbuild with an old clunker. Kind of like buying a new car and towing the old one around everywhere you go. Yes extremely stupid and frustrating.

Which goes to show that old ways of thinking die hard, or don’t die at all.

The build at Century Software

Later on I moved to Century Software, a local company to where I live. That was such a fun place. Anyways their build system for their windows product was written in Make! Yes Make the original, ancient build system that you can’t even find documentation for anymore. I mean literally, I found (I think) one page of documentation somewhere on some professors lecture notes. The docs were  horrible. Make implemented here was so old it built one C file at a time. No multi-threading, no parallel builds nothing. Slow was the operative word here. That and a incomprehensible built output that was so verbose it was almost impossible to comprehend. The only good thing about it was that it immediately stopped on the first error.

So eventually I rebuilt that using MSBuild too. It took me a few months in my spare time. No bureaucratic inertia, no one telling me no. I just worked on it in my spare time and eventually I had a complete and fully functioning system for the tinyterm product. This build was the best I’ve ever done, with zero duplication, extremely small project files and a build that was very fast. It went from 45 minutes to a minute and a half.

When writing a product, the build system should be done using the tools that the platform provides. There should be no transmogrifying the code, or the build scripts (like openssl) before doing the build. When writing ruby on rails use rake for your build process’s. When targeting Microsoft platforms use msbuild. Using java, then use maven. Data that must be shared between build platforms should be in xml so that anything can parse them. And most important of all, distrust must go, and developers and managers must have an open mind to new things. Otherwise the development process’s will take so long, and be so costly, and exert such a tax that new features will suffer, bugs will not get fixed and customers will not be served and they will take their money elsewhere.

Now working at….

I no longer work at Autodesk anymore. In fact I have not worked there since last spring. I and another developer, and almost the entire documentation department were laid off. In fact, that spring the entire Softimage product was discontinued too.

My highlights of working at Autodesk were these:

1. It was my first job at a big company (before I was self-employed). While I went through a big company, but the big company didn’t go through me. Once the company gets big enough it is hard to listen to customers. Especially when all are clamoring for bug fixes that should have been fixed last year, or 18 years ago.
2. I learned the craft of software development, for which I’ll always be grateful.
3. I learned how to make good code, and saw lots of examples of bad code.
4. I learned that if software quality is like a river of fire. If it gets out of control, and passes through the banks guiding the river, it will burn everything. Therefore bug counts have to be kept under strict control for a software product. If they are allowed to proliferate beyond the ability to get fixed, then the product suffers.
5. I almost single-handedly saved 3dsmax. When I started, the average time (in days) between a crash was in the mid single digits. When I finished the max 2014 release was more stable than Maya: in the low 40’s. That was my effort mostly with a bit of help in the last year for the 2014 release. Still the stability of 3dsmax was still no where near the rock-solid stability of Autocad. In all the years I used Autocad (at work and personally) I don’t think it ever crashed on me. At least not that I can remember. My personal opinion is that Max will never again have a sustained effort to fix crashes like they had when I was there. Simply because mostly: No one cared like I did.
6. I think it is time for a new 3D product to be built from the ground up. To avoid the problems that a mountain of code imposes. To have a modern take on a 3D application that is not hobbled by the 30 year old windows win32 API, and that is easy to develop for. If anyone is interested in such an effort please contact me privately.

Anyways after I left Autodesk, I worked at a local boutique software house called Century Software. It was the opposite of Autodesk: small having less than 10 people. They also have a very old product called TinyTerm: A terminal emulator. While it is old, it is a great product. Also the team there was a lot of fun. It was a breath of fresh air. I re-learned old C skills for which I’m grateful, but I also learned how to use a source control package named git. For which I’m even more grateful.

So what am I doing now?

As of January of 2015 I work at at company called Instructure. The environment is terrific, the people there are young and very smart. It is an awesome place for developers. I can’t say enough good things about this place. I work on a beautiful product called Bridge, which is a learning management system for business’s.

Writing Stable 3ds Max Plugins

I found this document while looking through old files today, and thought I’d share it. It was from a lecture I gave at Autodesk University back in 2012. It applies to 3ds max, but has some points that would be applicable to software development in general.

[Note] I wrote this along time ago, and today I saw this blog post again. I thought in fairness, I should add a few things. These were my guidelines I came up with based off of years of experience and years of experience of fixing bugs in 3dsmax. While I firmly believe in every single last one of them, unfortunately, hardly any of these things ever entered the thoughts of most 3dsmax developers. Most coded all day long being blissly unaware of warning level 4, and none ever showed an interest in static analysis except two people. In fact management most of the time was completely unsympathetic to these ideas. Of course management, even the development managers who used to be programmers simply didn’t care. They just wanted bugs fixed fast. No matter, nor interest was given to systematically fixing the fence at the top of the cliff. All thoughts were to get ambulances to the dead bodies at the bottom of the cliff as fast as possible. As a result the fences at the top were always full of holes. By the time I left Autodesk in the spring of 2014, only a dozen or so projects compiled at warning level 4. And no systematic static analysis was being done by anyone. I could go on, but that’s a thought for another blog post.

Introduction

Preventing crashes and increasing stability in software is a difficult task. There is no practice nor set of practices that will completely prevent all crashes. However there are a few things that will help reduce errors. With that short introduction let us get started.

Basic Responsibilities

These are basic practices that would apply no matter where you worked and no matter which product you worked on.

Compile at warning level 4

You should compile your plugins at warning level 4. Level 4 warnings can help point out subtle programming errors that can lead to bugs that can be absurdly difficult to discover later on.  This is a free and practically instantaneous way to find bugs early in the development process.

Code compiled at level 4 is better than code compiled at anything less. Level 4 warnings should be turned on, and no warnings should be suppressed.

The 3ds Max SDK compiles cleanly at warning level 4, and has been that way for at least 3 years now.

Case in Point:

We turned on warning level 4 for an old project recently. One level 4 warning pointed to some unreachable code. This was caused by a break statement that was left in in a loop. This problem eventually resulted in a complete feature not working.

Compile with Static Analysis

The highest version of visual studio comes with a static analyzer called Code Analysis. This feature can be turned on for native or managed code in visual studio. Static analysis does a deep scrutinization of the code and can help spot bugs. These bugs are more complex than what level 3 or 4 warnings can give. But these warnings are usually so fundamental that they can be likened to level 1 or 2 warnings.

Case in Point:

The static analyzer can detect allocation/de-allocation mismatches. For instance we turned it on and found when memory was allocated with new[] but was de-allocated with delete, instead of delete []. We found lots of these scattered throughout our large source code base. The advantage of this is that it is so easy to detect. Without static analysis it would take a special tool like bounds checker to reveal a memory allocation mismatch, and that would only be after exhaustive testing.

Check Pointers for NULL

By far the most common fix I have seen for known crashes in 3dsmax is to check a pointer for NULL. This is the most persistent problem that I have ever seen in C/C++ code. Get into a habit now to check every pointer for NULL before using it.  A corollary to this is to initialize all pointers to NULL before and after they are used.

Case in Point:

The visual studio static analysis tool emits various warnings for possible dereferencing of null pointers. Consequently I have rarely seen this problem in code that compiles at level 4 with static analysis.
For the Rampage Release the 4th highest Crash in 64 bit max was a crash using Ray Traced Shadows. The shadow code contained a buffer of floating point values that was uninitialized. It was extremely difficult to track down, as it was only manifest when the debugger was NOT attached.

Check before casting

If you lie to the compiler, your application will come back and bite you. C is a language that is seemingly built on casts, where anything can seemingly be cast to anything else. This ability to so easily lie to the compiler and misrepresent your types to the compiler is dangerous and risky. Therefore prefer to use C++ style casts. By turning on RTTI and using C++ style casts, the results of the cast can be checked for validity.

Case in Point:

In the sdk header file imtl.h is a class called MtlBase which has 4 derived classes. One of those classes is class Mtl. I found functions in MtlBase that was blindly assuming the instance (i.e. this) was an instance of class Mtl. However this ignored the fact that there were 3 other derived classes from MtlBase. Thus it was casting the ‘this’ pointer to class Mtl, and then doing work on that miscast pointer.

Avoid stack based strings

A very common way to crash the application is over-reliance on stack based C strings. This code for instance is very dangerous:

void foo() {
TCHAR buf[SIZE];

}
One of the problems with stack based strings, is operating on a string that is bigger than the buffer. This of course can corrupt the callstack.  This is almost impossible to debug afterwards and usually makes reading minidump crash files an exercise in frustration.  The danger can be minimized by using the newer safe string functions. For instance instead of using strcat, which can easily run over the end of a string, you can use strcat_s which is a safer version.

When possible use TSTR or MSTR instead , where the actual string buffer is stored on the heap, and not the stack. Then if anything does go wrong, it will not corrupt the callstack.

Now a disclaimer: Max has a lot of stack based strings all over the place (It is has been around a long time of course). But their usage is getting reduced as we now favor TSTR or MSTR.

Case in Point:

The code for the customization dialog contained a for loop that concatenated a string into a stack based buffer of limited size. The for loop interated too many times and the buffer overflowed, corrupting other items on the stack. That stack based buffer was several frames up the stack. When that stack frame was cleaned up, it crashed. Diagnosing the problem was difficult since the symptom was several function calls away from the source of the problem.

Avoid using catch(…)

If at all possible avoid using catch(…). Prefer to handle more specific exceptions like catching an out of memory exception such as (std::bad_alloc). While using catch(…) may prevent the application from crashing, it can  also hide bugs and make it more difficult to solve crashes. It is useful for debugging to actually remove a catch(…) and let the program crash exactly where the cause of the crash is located. You should generally catch only those errors that you can handle, and let the ones that you cannot pass through so that the larger system can handle it if possible, or crash in the “correct” place rather than delay it.

Now catch(…) can be used when it does something to correct the program state. This should be done only after careful consideration, usually with multiple developers. Also side affects needs to be considered as well. If a catch is used to wrap a call to thousands of 3ds Max functions, than it probably shouldn’t be used. However wrapping a call to a 3rd party library is acceptable. Everything needs to be balanced of course.

Certain regular expressions can easily be written to help search for empty catch statements. The static analyzer PVS-Studio will also help identify these too.

Case in Point:

I regularly review the usage of catch(…) in the source code, and have over the years taken out a few catch(…). As a result, the clarity of crashes from customers in the field has increased.

Use Debug Builds

When debug builds are available, they should be used for testing and development. 3ds Max is the only M&E product that provides debug builds to ADN partners, all though they may be slow in delivery. However despite the delays a debug build provides a great resource in validating your plugins.

[Note: It turns out to be very ironic that I put this here, since the 3dsmax team does not use debug builds. Sure the devs do, but in all my years there I could never get management to move to have the QA testers use debug builds. Never the less I believe in debug builds and that they are far superior for testing than release builds.]

Watch log file, asserts and the debug output

Log File

3dsmax has a log file that writes to <max install>networkmax.log. This file is mainly used for debugging network rendering, which was its original purpose. However, it has grown to become a popular logging mechanism for max. This log can provide useful information, but it still is under-utilized and cannot be expected to report program state consistently across the product.

Asserts

Do not ignore asserts (remember debug builds?). Use asserts liberally in your own code and don’t suppress asserts unless they are logged and checked afterwards (for example, using automated testing). The assert system will automatically log all asserts (whether they are suppressed or not) to the file: <max install>3dsmax.assert.log.

Debug output window

The Visual Studio debug output window (debugging window) provides significant output and can be useful to watch during debugging sessions. Be sure to turn on display for all types of events including exceptions (very important) and regular program messages. If you want to check debug output without attaching a debugger, than you can use a Microsoft tool from sysinternals called DbgView. See the following website for details: http://technet.microsoft.com/en-US/sysinternals

Disclaimer: The MetaSL system parses a lot of files when 3ds Max starts up. This will generate a lot of exceptions that are benign, so not to worry. The reason is The MetaSL system, from Mental Images, uses a 3rd party library (antler) to parse files, which in turn uses exceptions for program flow.

Enable Break On Exception:

Visual Studio has debugging options that allow it to break execution when an exception is thrown. This should be used as often as possible. This is the corollary to the “No catch(…)” above.  There are a few places where max actually does use catch(…), for example in the maxscript kernel.  By enabling this feature, exceptions are immediately brought to the attention of the developer.

Max Specific Problems

Do not hold naked simple pointers to ReferenceTarget’s

A class that is not a ReferenceMaker should not hold a plain old pointer to a ReferenceTarget, or a class that derives from a ReferenceTarget, without some mechanism to ensure validity of the pointer before use (i.e. AnimHandles). Instead replace the simple pointer with a SingleRefMaker class instance, and have that observe the ReferenceTarget.

Good Bad
class Good{…

SingleRefMaker mObserve;

}

class Risky{…

ReferenceTarget* mObserve;

}

 

Do not write dynamic arrays of ReferenceTarget’s.

Do not write a class that holds an array of ReferenceTarget’s: especially when that array grows and shrinks at runtime.

A class like this usually has a container that holds pointers to ReferenceTargets. It usually overrides ReferenceMaker::NumRefs like this:

int RumRefs() { return myArray.Count(); }

Instead of a fixed number of items:

int RumRefs() { return 3; }

This cannot be done correctly without considering undo and redo (Subclassing class RestoreObj). The fundamental weakness of the reference system is that it expects references to be in a fixed position. That reference index is an internal implementation of the ReferenceMaker that should be invisible to clients. However clients routinely use the reference index to get a certain Target. And one of those clients is the undo system. One of the complications of such an implementation is that the Undo System usually expects that internal array to never shrink in size. If a ReferenceTarget is removed from the internal array, a RestoreObj usually should or could point to its old reference slot. The Reference System of course has no idea that the internal array shrunk in size, so if an undo action occurs it may stick that Reference back into the wrong slot. To avoid that, a common practice is to make dynamic reference arrays grow but never shrink. This wastes memory.
For example: Undo and Redo can change the size of the internal array via SetReference. So if you have an array with 10 ReferenceTarget’s and your undo/redo object happens to ‘redo’ and stick a reference back in at slot 5, well, all your other pointers from index 5 to 10 have now had their indexes bumped up by one. So now anything dependent or holding on to those moved ReferenceTarget pointers are now dangling.

There are a few alternatives to this:

  • Use class IRefTargContainer.
  • Use an array of AnimHandle’s.
  • Use a ParameterBlock

Do not access the Reference System after NOTIFY_SYSTEM_SHUTDOWN

The notification message NOTIFY_SYSTEM_SHUTDOWN (See notify.h) is broadcast before plugins are unloaded. It is critically important to drop all references to plugins in response to this message. There are many plugin modules that define ReferenceTargets that will then get unloaded shortly afterwards. Once the plugin module is unloaded, trying to access a ReferenceTarget defined in that module can result in a crash.

Do minimal work in DllMain

The MSDN docs state that minimal work should be done in DllMain. Specifically it warns against loader lock, among other things. The DllMain function can be called as a result of LoadLibrary. When LoadLibrary is executed a critical section is locked while your DllMain is active. If you try to do work that for example needs another DLL to get loaded, it could lock up the application as a race condition. Instead of doing work in DllMain on shutdown, there are a few other ways to do plugin initialization and unitialization. For example:

  • You can do uninitialization work in response to NOTIFY_SYSTEM_SHUTDOWN. (see notify.h)
  • You can and should use a LibInitialize and LibShutdown functions.

A similar warning is not to do heavy work in static variables constructors, because a static variable will get constructed close in time to when DLLMain is called. Then, when the static variable is constructed, the DLL may not be fully loaded and types needed by the constructor may not be available yet.

Do not violate party etiquette

Uninvited guests should not crash the 3ds Max party. When the party is over: go home.

Uninvited guests

Every plugin has an appropriate time in which it should be initialized, do its work and shutdown. For example:

  • A plugin for a color picker should not instantiate one when 3ds max starts up.
  • A plugin for a scene browser should be active ONLY when its UI is active.

It is entirely possible and probable that users can start max and NEVER use your plugin. Therefore do not waste memory and resources for a feature that may not get used. Do the work when users actually invoke your feature. In other words when 3ds Max starts up, the plugin should not invite itself to the 3ds Max party, it should wait for an invitation.
This rules is violated on startup by loading 3rd party libraries, instantiating plugin classes, holding pointers to the node scene graph and registering callbacks to common scene events (my favorite pet peeve: “Hey max crashed in this function even though I never used this feature?”).  When max loads a plugin, the major things 3ds Max requires from a plugin are:

  • The number of class descriptors
  • A way to get those class descriptors.
  • Some pointers to LibInitialize and LibShutdown functions.

Therefore class descriptors really are the only things that should be instantiated on module load or startup. There should be no static instances of the actual plugin class, whether it is a material plugin, shadow, utility, or renderer. Of course there are exceptions such as function published interfaces and parameter block descriptors that often are statically defined: But I’m not talking about those.

No loitering

When 3ds Max shuts down, it sends out the most important broadcast notification in all of 3ds Max (found in notify.h): NOTIFY_SYSTEM_SHUTDOWN. This means the 3ds Max party is over. The plugin should completely shut itself down or disassociate itself completely from all max data. For example: All References should be dropped. All arrays holding pointers to INode’s should be cleared out etc… And most common and most dangerous: All callbacks functions that are registered should be unregistered.

When NOTIFY_SYSTEM_SHUTDOWN is broadcast, the entire max scene is completely intact and still in a completely valid state. During any callbacks or notifications after that, 3ds Max will contain less and less of a valid state to work with. In other words as 3ds Max progresses in its shutdown sequence less and less of the max scene will be valid. So for instance the other shutdown notification NOTIFY_SYSTEM_SHUTDOWN2 is called merely when the main 3dsmax window (think HWND) is destroyed. No plugin should be responding to that message to (for example) iterate through the scene graph. Likewise the LibShutdown functions should not be iterating the scene graph.

Case In Point

Say that a plugin that depends on another library like this:
plugin.dll -> library.dll
When the plugin is loaded by max, the tertiary library will also (automatically) get loaded. But when the plugin is unloaded the tertiary library will not get unloaded. That is unless the reference count on the library is decremented to zero. This will not happen unless FreeLibrary is specifically called on library.dll (Which is not a common nor recommended practice). Thus instead, the library will get freed or shutdown long after WinMain exits and max has uninitialized and is gone. Therefore the tertiary library should not contain any dependencies on anything in the 3ds Max SDK. Thus for example GetCOREInterface() should never be called in a DllMain of a dependent module to a plugin (i.e. library.dll ).

Quality Testing

Developers can implement the following practices in their software development processes:

Automated regression testing

All good production pipelines should have regression testing that occurs automatically after a build. This is critical to help catch bugs before they get to customers in the field. Also the developers should have access to these automated tests so that they also can run these tests before submitting their code.

Dynamic Memory Analysis

This means using 3rd party tools to profile, analyze, check and verify memory during runtime of the application.

The following list of tools is a partial example of what is available:

  • MicroFocus BoundsChecker: Checks for memory leaks, or memory allocation mismatches among a host of other things.
  • Microsoft’s Application Verifier also checks for various memory problems during runtime such as accessing an array out of bounds.
  • Visual Leak Detector (Open source on codeplex.com) checks for memory leaks. It is fast, efficient and stable.

Code coverage

This is using a tool to measure how much of your application or plugin was actually tested during execution. This helps a developer to know when they have tested the product enough. It also can help a developer find areas they have not tested. Simply put untested code is buggy code, and a code coverage tool helps in this regard. The best tool I have ever seen for this is Bullseye (bullseye.com). It works for native C++ and is easy to use, and very fast. It requires instrumentation of the code during the build which can double the build time,but runtime performance is excellent.

 

 

 

 

 

 

 

 

Beware the hook hiding under the clutter

A few years ago, while working on the release of 3dsMax (2014?), I was working on fixing warnings that came from a static analysis tool (PVS Studio). The analyzer had identified code in the status panel that could lead to possible memory corruption. As I started to look around in the code related to the status panel, I found some code for unrelated features and some dead code. In other words code for features that was not where it was supposed to be.

These features used to be on the status panel, but since then got moved around or removed. For instance the snaps used to be on the status panel (Remember in max 3?), but since then got moved to the toolbar on the top area of the app. What I found was that all the old features that ever used to be on the status panel were still there! When the UI got updated over the years, the programmer(s) just turned off the UI features, and simply left the old code. They didn’t even comment out the code, or indicate that a particular UI widget was not used… It was just abandoned. It was a hard to know what was still relevant, and what was not.

The snaps were at least still relevant, I put in a breakpoint, executed some snaps and hit that code! great! But wait! The snaps were now in a different functional area of the UI than the status panel. To be exact, for people not familiar with 3dmsax, the snaps are at the top of the screen on the main toolbar, and the status panel is the row of UI controls on the bottom of the screen. So why should toolbar code be hiding in the status panel code? What a Zig zag! Who cares? The computer doesn’t care where the code is, as long as the appropriate function gets compiled to an object file, and is available to the linker, that code can be anywhere. So why should humans care? Well because humans are much more expensive to operate than a computer. Humans read the code and have to understand the code first. So as long as that feature was hiding in the status panel, it was as good as taking a random book in the library, ripping off the catalogue ID, and placing it in a random location in the library.

Disorganized Library. In code typified by dead code and features implemented in the wrong files.

The tax associated with disorganization in code is immense. I’ve seen it first hand. I remember the first time I tried to find the code for the material editor. It took me an hour. I think I had to find some text that was somewhere in the material editor, and do a brute force text search of every file in the source code. Then look up some more things, do more brute force text searches until I found the code I was looking for. It was the equivalent of taking a book in the library, ripping off the cover with its catalogue ID, putting a different cover with an obsfucated name on it (generally with no vowels), and putting the book somewhere else in the library. Good luck finding it.

For an example of that, try finding the code for the editable mesh modifier in the public SDK of 3dsmax. You will have a hard time if you look for editablemesh.vcxproj, since it’s actually called pack1.vcxproj. Thankfully it is in a folder called /samples/mesh/editablemesh. (btw, the SDK sample code is only organized because of work I did way back in 2005).

Which leads us to another problem. A feature will be called one thing in the documentation, have a completely different name for the source code files, and have completely different class names in the code files. Imagine if you could have a code file name foo.cpp, the class names in the file are variants of the name baz, and the feature as exposed to the customer is named wizz. It’s all a confusing job to learn the lay of the land. Maps obviously don’t exist in this land, and only through experience do you learn where things are. Browsing the code is simply useless.

So I started to move the misplaced code. I used a code coverage tool called bullseye to aid me in my testing. At one point I had removed most of the code and hit around 85% functional coverage, and something slightly less for MCDC coverage. (MCDC coverage is the real indicator of code coverage btw). But there was this stubborn windows hook code that I could just not hit. Doing a brute text search for the hook, I could not find where the hook was used. It appeared to be dead code. So I removed it.

Fast forward a few months, and a bug report came that the cross hair feature was broken. That is where the cursor in the viewport can be replaced by cross hair lines. Ok, what caused it? My removal of the apparently dead hook. Ok, sorry about that.

Now if you know anything about hooks, you must know that they are only used as a last resort. And even then I and a few colleagues agree that using a hook is still generally the wrong approach. Hooks are an evil API that literally traps every single message that gets passed through your message pump (millions in a typical session). And for each message the system will execute your callback, and allow you to do who knows what with the message. There is no way that hooks can have a beneficial affect on performance. In fact with dozens (or more?) of messages getting sent every second, it can only have a negative affect on performance.

So the cross hair bug was reported, and the dev who was fixing it (not me) asked to revert the code changes by puting back the hook to fix the cross hair cursor. Not so fast I said. Why was the hook code in the status panel in the first place? Viewport code had it’s own place in the code. This hook code was far far removed from that! Finding this hook here was like finding the a gas line in the glove compartment of an automobile– in other word a complete and utter hack. This cross hair business has no place in status panel code. Well, something smelled fishy here. So I suggested to the other developer that they research the problem in greater depth and find another solution to the problem of the hook and crosshair. They did and in a week or so, they came back with a simple fix consisting of setting a variable (or something trivial like that) to turn on the cross hair. And best of all that fix was in it’s proper location: the viewport code. Excellent a great fix! Congratulations to the developer who found the real place to fix it.

Meanwhile once I found out what the hook was for, I researched it a little more and found out how really bad the hook was. That hook got installed and activated when max first started up, and stuck around until max shut down. It didn’t matter if you used the 3D viewport window cross hair or not, that hook was installed and running 100% of the time. Sucking CPU time for every message it trapped.

 

 

 

 

 

The incredible expanding layer bug

So a few months ago at work (January), a bug came across my desk for a performance problem in 3dsmax. I have always heard of bugs where someone in an architect’s office would load a file, and it would take 45-50 minutes. Perhaps the file was an autocad file with hundreds or thousands of layers in it. I’ve even had a guy at an architect’s office tell me they had files that took an hour to load before…. Just incredible. Anyways, this bug was that it would take almost a half hour to create 1000 layers in max. The guy who logged the defect, even gave us a maxscript to help reproduce the problem:

 

(–resetmaxfile(#noprompt)clearlistener()layerCount=1000

for i = 1 to layerCount do

(

t = timestamp()

mLayer=LayerManager.newLayerFromName (“blarg_” + i as string)

t += timestamp()

format “%,%n” i t

)

)

I first gave the script a run through in order to reproduce the problem, and indeed I was able to see it took a long long time. I ran it the first time, and it seemed to take an hour. But I after all wanted better numbers than that. So I modified my script to give me the total time.

The final result was 35 minutes to complete everything. During the course of which, the 3dsmax maxscript listener simply stopped responding. Finally it finished and I dumped the printed results into an Excel spread sheet and plotted the results.

The following chart plots the time (in blue) that it takes to create 1000 layers. Each Nth layer from 1 to 1000 is on the X-axis on the bottom. The time (y-axis) is the vertical axis and is plotted in milliseconds.

 

layer_creation_time_chart

By the time the 1000th layer was created, it took nearly 5 seconds. *Ouch*. The blue graph is a class parabolic shape and is in fact some form of an N squared polynomial. This performance degradation is a classic non-linear form. Contrast that with the red line, the expected performance result. Anyways, finding the bug was the real problem at hand. Why was it so slow?

My experiments were of course ridiculously hard to test. After all, you make a change and wait 35 minutes to test it. Finally I stumbled upon a function call to update some UI. Okay, I commented it out, and ran it again. My results astounded me: 4 seconds! The code I removed was simply that, when-ever a layer was created, it would update that little layer dropdown list that is on the layer toolbar:

LayerToolbar

 

Remember that little guy? That UI tool that no one likes nor uses anymore? Well the problem was that little layer dropdown list would add the new layer to some data structure, and then resort all the layers. This was a classic n squared iteration over all the layers. The more layers, the more sorting you have to do. Obviously a performance nightmare.

Once I temporarily removed that UI date function call, the time per layer was so low, that it wouldn’t even register on that graph shown above. But after all, creating layers should update that UI dropdown list eventually right? So if we remove the function call, how will it get updated? To fix that, I simply put in place a call to suspend UI updates, and another to resume UI updates for that UI drop down list. So before creating the 1000 layers, I call that suspend function, and afterwards call the resume function. So that in the end, the Layer UI dropdown list gets updated only once.

My favorite blogger, Joel Spolsky, wrote about this in a classic piece: writing about “Shlemiel the painter’s algorithm”

http://www.joelonsoftware.com/articles/fog0000000319.html

How to use Perforces P4.NET API for basic tasks

So recently I had a task to rename some files in Perforce. Ok, one or two files is easy, that can be done by hand. But this task required doing around 800 files based off of data from an external source. So I started looking for a .NET API that would allow me to access Perforce using C# rather than use the command line.

Thankfully Perforce has one.

http://www.perforce.com/product/components/apis

The only problem is documentation is basically no good:

http://www.perforce.com/perforce/doc.current/manuals/p4api.net/p4api.net_reference/Index.html

The help docs have one measly page that gives a tutorial on how it works. And that is only for connecting and disconnecting. For most of the rest of the reference guide contains no description of what anything does, nor what anything is, nor how to use it. The docs are basically just a skeleton of a help file. And on every page is this disclaimer: [This is preliminary documentation and is subject to change.]. Indeed it is. The docs are so bare, they are basically useless. Given a function signature that has no description like this:

Syntax

C#
public List<FileSpec> IntegrateFiles(IList<FileSpec> toFiles,FileSpec fromFile,

Options options

)

Parameters

toFiles

Type: System.Collections.Generic.IList(FileSpec)

fromFile

Type: Perforce.P4.FileSpec

options

Type: Perforce.P4.Options

Return Value

 

You basically have to guess how to use this API!

To add insult to the whole comical affair, Perforce did add documentation for this particular method, not the parameters as you see above, but for the method in general. But it was docs for the command line! As if the command line syntax and idioms could possibly apply to C#! In stunned disbelief I had to press forward as this was my only option for accessing Perforce using a .NET API. I had access different perforce API that uses .NET years ago. But it was so old that it wasn’t compatible with .NET 4.0, or the new perforce servers we use now.

So the point about this blog post is not to complain, but to offer a solution about how I figured all this stuff out.

First off I wrote a class to encapsulate the credentials needed to log on to the Perforce Server. It contained data like the server name and port, the username and the client spec:

public class PerforceID
{
	public PerforceID(String serverURI, String user_name, String client_spec)
	{
		mServerURI = serverURI;
		mUserName = user_name;
		mClientSpec = client_spec;
	}
	private String mServerURI;
	public System.String ServerURI
	{
		get { return mServerURI; }
	}
	private String mUserName;
	public System.String UserName
	{
		get { return mUserName; }
	}
	private String mClientSpec;
	public System.String ClientSpec
	{
		get { return mClientSpec; }
	}
}

This instance is constructed with the appropriate data and used later.

I kept all my methods in a static class, with just a few static data members.

How to Connect to a Perforce Server

Obviously connecting to the perforce server was of paramount importance. Thankfully the Perforce documentation did describe how to connect (at least). So here I basically borrow from their documentation, except I am using my own static variables. And as you can see I am using my own PerforceID instance to provide the server address (id.ServerURI), the user name (id.UserName) and the client spec I want to use (id.ClientSpec).

private static Perforce.P4.Connection mPerforceConnection;
private static Perforce.P4.Repository mRepository;
public static void Init(PerforceID id)
{
	// initialize the connection variables
	// note: this is a connection without using a password

	// define the server, repository and connection
	Server server = new Server(new ServerAddress(id.ServerURI));
	mRepository = new Repository(server);
	mPerforceConnection = mRepository.Connection;

	// use the connection varaibles for this connection
	mPerforceConnection.UserName = id.UserName;
	mPerforceConnection.Client = new Client();
	mPerforceConnection.Client.Name = id.ClientSpec;

	// connect to the server
	mPerforceConnection.Connect(null);
}

As you can see above the Repository and Connection instance need to be used later, hence why I grab a hold of them for later use.

How to Disconnect from a Perforce Server

This part is easy (again borrowed from their docs).

public static void UnInit()
{
	if (mPerforceConnection != null)
	{
		mPerforceConnection.Disconnect();
		mPerforceConnection = null;
	}
}

How to open a file for edit

Editing a file is the most basic operation I could hope to do. Unfortunately it was not straight-forward at all. So given a string that contains the full file path, this method will open the file for editing in Perforce.

public static bool Edit(String filename)
{
	Perforce.P4.Options options = new Options();
	mPerforceConnection.Client.EditFiles(options, new FileSpec[] { new FileSpec( new ClientPath( filename) ) });
	return true;
}

It was not my intention to open it for editing in any particular changelist. Thus after calling this method, the file will be opened for editing in the default changelist.

I found that the Perforce.P4.Client class contained a lot of the methods I would need to do various familiar operations. Operations that you could logically imagine a user doing manually such as Adding files to a changelist, deleting files, integrating files, shelving and unshelving, merging, moving, locking files etc…

How to integrate a file

Integration or branching (versus copying) is important because it retains the file history. For this operation you need two arguments: the old and new file name. Both must be the full qualified path names containing the directory and file.

using Perforce.P4;
public static bool Integrate(String old_name, String new_name)
{
	bool result = false;
	try
	{
		var from = new FileSpec(new ClientPath(old_name), VersionSpec.Head);
		var to = new FileSpec(new ClientPath(new_name), VersionSpec.None);

		mPerforceConnection.Client.IntegrateFiles(from, options, to);
		result = true;
	}
	catch (Exception e)
	{
		Console.WriteLine("Unknown exception calling Perforce.");
		Console.Write(e.Message);
		result = false;
	}

	return result;
}
static IntegrateFilesCmdOptions options = new IntegrateFilesCmdOptions(IntegrateFilesCmdFlags.None, -1, 0, null, null, null);

The reason I instantiated the instance of IntegrateFilesCmdOptions outside of the function is because I had to call this function hundreds of times. Hence it didn’t make sense redo that instantiation every time.  Also like all Perforce commands, I wrote inside an exception handler since I was guessing how the API worked. For instance I had to guess what parameters I could pass into my options instance above. This will integrate the files and put them in the default changelist. The last 3 parameters I still have no idea what they do, but passing in null works!

This method works, but the only problem is that calling this 800 times is very slow. After about 30 seconds of this I decided I needed a faster approach. I needed to integrate 800 files in a few seconds.

How to integrate lots of files (fast)

So given the requirement I needed to integrate hundreds of files, here is the fast way to do it. I needed to replace 800 calls to the server with just a few instead. This is done by first creating a branch and then integrating the branch. The branch spec will contain what files get branch to where. And the integrate command will be given the branch spec as its main specification. Therefore basically two commands get sent to the perforce servers instead of a few hundred. The result is spectacularly fast compared to the old way.

At the heart of a branch definition is a mapping of old files to the new files. The old file is always on the left, and the new file is on the right. So given that I want to integrate or branch file A.cpp to a new location and rename it to B.cpp, the mapping would look like this:

//depot/foo/A.cpp       //depot/new/output/B.cpp

How to Create a branch

Given those file specifications, creating a branch spec also requires giving it a name, and specifying the user name:

using Perforce.P4;
public static bool CreateBranch(String branch_id, String username, Dictionary<String,String> fileList)
{
	try
	{
		var view_map = new ViewMap();
		foreach( var pair in fileList )
		{
			var from = new ClientPath(pair.Key);
			var to = new ClientPath(pair.Value);

			view_map.Add(new MapEntry(MapType.Include, from, to));
		}

		var msg = "Created programmatically via the Perforce .NET api. Used for integrating lots of files at once.";
		bool locked = true;
		var branch = new BranchSpec(branch_id, username, DateTime.Now, DateTime.Now, msg, locked, view_map, null, null);
		var created = mRepository.CreateBranchSpec(branch);
		Debug.Assert(created != null);
	}
	catch (Exception e)
	{
		Console.WriteLine("Unknown exception calling Perforce.");
		Console.Write(e.Message);
		return false;
	}

	return true;
}

Here the first argument to the method branch_id is the unique identifying name of the branch. The username argument is required for some reason. The last argument, the Dictionary contains the file mappings from old to new names. In this example above the method returns true if it succeeded and false otherwise. Again I wrapped it in an exception handler because I had to figure out how to use this API on my own by guessing, which I accomplished by a lot of trial and error.

To verify that the branch was created, I opened perforce and browsed the branch specs that were owned by me, and indeed the new branch spec showed up in the list.

How to Integrate a branch spec

Now that my branch spec is defined, I can easily integrate it using just the unique name I gave it when the branch was created:

using Perforce.P4;
public static void IntegrateBranch(String branch_id)
{
	try
	{
		var change_list = -1;
		var max_files = -1;
		var branch_options = new IntegrateFilesCmdOptions(IntegrateFilesCmdFlags.None, change_list, max_files, branch_id, null, null);
		var created = mPerforceConnection.Client.IntegrateFiles(branch_options, null);
		Debug.Assert(created != null);
	}
	catch (Exception e)
	{
		Console.WriteLine("Unknown exception calling Perforce.");
		Console.Write(e.Message);
	}
}

So the two methods above could be called in succession like this:


CreateBranch(“mybranch”, “John_Doe”, ….);
IntegrateBranch(“mybranch”);


			
					

Notes on the Max.Log file, timers and fixing a crash in the OBJ exporter

So lately I was given a task to fix a crash in the 3dsmax OBJ exporter. I won’t go over the history of  this project. But it’s the only .OBJ exporter that we have. Anyways, we have had it for a few years, but just recently it started to fail in our automated tests.

So I looked at it, and could never reproduce it on my machine. To make matters worse, I asked to look at the max.log file, only to find absolutely nothing. So not only could I not reproduce it, the minidump file I had was inconclusive and I possessed an empty log file. All this makes debugging extremely difficult (As in bang your head against the wall type of debugging).

On the other hand the automation machine was regularly failing. So I decided to use logging to help figure out where it was crashing. The theory was that if I could pull a lot of information from the logger then I could find out where it was failing.

But, the problem is that 3dsmax has extremely thin logging capabilities. The only log file that 3dsmax has was written for rendering: more specifically network rendering.  Hence a log file is created in this location:

<Max>NetworkMax.log

It quickly became the default logger for 3dsmax. Calls to the logger are interspersed through most of the code now. But the depth of coverage is not uniform. Also  the code for this logger was implemented in 3dsmax.exe. However since 3dsmax.exe is at the top of the DLL dependency chain, there is no .lib file for a DLL at the bottom of the dependency chain to link to use the Logger. And those DLL’s at the bottom of dependency chain can not link to core.lib to gain access to the core Interface pointer. That is because core.dll also depends on a few of those very DLL’s at the bottom of the ‘food chain’.

So, if you want to use the logger in one of those DLL’s that 3dsmax (or core) depends on… well you can’t. The dependencies would be all backwards. Independent 3rd party plugin developers for max don’t have this problem. They can access a virtual methods on Interface and gain access to the Logger. No problem for them really. As for those independent DLL’s like maxutil.dll,  mesh.dll, mnmanth.dll? Well they don’t have any logging. They are flat out of luck.

So what does this have to do with the OBJ exporter? Well not much, but it does explain the lack of a logger in some code.

So to continue on with the story of fixing the OBJ exporter. I added calls to the max logger from a variety of functions in the OBJ exporter. Basically any function that was 1 and 2 calls away from a call to virtual SceneExporter::DoExport logged it’s function name to the logger.

So I submitted my changes, and the next day the Automation Engineer sent me a max.log file. It showed the crash occured when the main virtual DoExport method ended, then another call to a method ObjExp method ended after that.

2013/03/13 13:29:24 DBG: [02388] [02648] ObjExp::DoExport - end
2013/03/13 13:29:24 DBG: [02388] [02572] ObjExp::GetTriObjectFromNode - end <-- Crash after this

The thing is that nothing should have been executing after DoExport finished. Wierd!

The beginning of the export sequence showed quite a few calls to ExportDaStuff (Nice name huh?)

2013/03/13 13:29:24 DBG: [02388] [02572] ObjExp::ExportDaStuff - start
2013/03/13 13:29:24 DBG: [02388] [01620] ObjExp::ExportDaStuff - start
2013/03/13 13:29:24 DBG: [02388] [02572] ObjExp::nodeEnum - start
2013/03/13 13:29:24 DBG: [02388] [02964] ObjExp::ExportDaStuff - start

When-ever it crashed the max logger showed extra threads. The ThreadID which is shown in the log file, shows 2 and sometimes 3 threads. Keep in mind that the first number in brackets is the process ID. The second number in brackets is the thread ID. OK, at this point having never touched the OBJ exporter in my life I had no idea this was a problem. So I kept looking. Eventually I found the callstack to create the thread and get ‘ExportDaStuff’ going is this:

ProgressDlgProc
CreateExportThread <-- Calls _beginthread
ThExport
ExportDaStuff

Ok, so what was calling CreateExportThread? I found it in the windows procedure for a dialog. So, the code was creating a timer, and on the first timer ‘tick’ (WM_TIMER) it killed the timer and then called the create thread method.

case WM_TIMER:
    if(wParam == TIM_START)
    {
        KillTimer(hWnd, TIM_START);
        ...
        CreateExportThread(hWnd, exp);
    }
...

The documentation for MSDN states that WM_TIMER is a low priority message and will be delivered after everything else is delivered that is more important. Hence that it will usually end up at the back of the message queue. So that means quite a few WM_TIMER messages could stack up before they get delivered all at once. Hardly a precise timer mechanism. So as the above code shows, KillTimer was indeed getting called, but after it got called the other WM_TIMER messages were already in the queue. And then they got delivered, and hence the extra threads were created.

So when the DoExport method finished it cleaned up resources out from under the extra thread which then crashed.

The fix was simply to not rely on timer ticks to create threads. The intent was to create only one thread in response to displaying the dialog. I was therefore happy to oblige by putting the call to CreateExportThread exactly where it belonged: in WM_INITDIALOG.

Lessons learned:

1. If you are writing a plugin for 3dsmax, please be liberal and puts tons of logging statements in your code.

2. Don’t create threads in response to WM_TIMER messages.

3. Or don’t mix business logic with your UI code.

Avoid stack based string buffers like the plague

This week I was directed to a crash in the VRML exporter. This exporter has been with 3dsmax for ages. The crash report was quite direct in showing exactly where the product was crashing. So I looked at the function and saw a huge mess of code that was attempting to create a unique name out of the 3dsmax scene object names. It was conflating some custom, do-it-yourself hashing algorithm with some really weird attempts to make unique names on top of that by concatenating names together.

So in attempting to reproduce the crash I created 120 scene objects that all have the same name. Now please understand that 3dsmax couldn’t care less if you named 100,000 scene objects with the same name. It makes no difference since the scene graph is agnostic about the INode names.

But not all 3D file formats are that way, apparently VRML isn’t. Is that perhaps a reason why it’s obsolete?

So once I understood what the VRML exporter wanted, I said to myself: “There are easier ways to generate a hash than this stuff”. The easiest way to get this is to call INode::GetHandle() which returns some sort of unsigned value. I think the value is 32 bits wide (I’m going off of my memory here).

So the easiest way to get a string with the scene node name that is unique is to just print the name and the handle value to a string:

   TSTR buf;
   buf.printf(_T(“%s - %u”), node->GetHandle());

So once that 4 or 5 lines of code replaced the few hundred lines of code, the VRML exporter was now able to export my test scene.

Oh did I mention the cause of the crash in the original code? All those attempts to make a unique name were concatenating the scene node name over and over again to a fixed length stack based buffer that was 256 characters long….

Yes I’m sure you have heard of that problem before…

In a related area, I was testing with a camera that had a name that was way longer than 256 characters. Turns out 3dsmax does NOT support cameras with names longer than 256 characters. Otherwise it crashes somewhere. So I got that fixed too…

So for safety’s sake: Do not attempt to feed strings into 3dsmax that are longer than 256 characters.

Bad Casts and wrong assumptions about class hierarchies

Lately here at work, I have been looking into solving a certain crash in the product. This particular crash was occurring in a certain function that had many crashes over the years.

The situation involved code in a base class that that was doing work for a derived class. In fact the base class was calling methods on the derived class. To do that, it had to cast itself to a pointer of the derived class (using the ‘this’ pointer). But since there were was more derived classes than one, the cast was sometimes performing an invalid cast when called with an instance of the other derived class.

So, sometimes when fixing a mysterious crash, just look for a bad cast.:)

The code had a class hierarchy roughly like this:

Class Base
	Class A
	Class B

So here, class A inherits from class Base. Similarly, class B also inherits from Base.

In this example case, some code in class Base was always casting the ‘this’ pointer to A. However called with an instance of B, the assumption won’t hold, and a cast will simply corrupt the pointer. Furthermore, it did this with a C style cast, which offers no warning nor protection.

This code demonstrates the problem:

void B::foo()
{
	A* temp = (A*)this; // B is not A!
	// crash attempting to do something with temp
}

So if called like this,

B* b = new B();
b->foo();

we should expect a crash calling foo since class B is not class A.

There were a few things wrong with this.

  1. A base class should not implement work that rightly belongs to a derived class. For instance it would be bad if a shape class should be in charge of drawing a box and a sphere. To be polymorphic a shape class would offer a virtual or abstract draw method, and leave the implementation details to the derived box and sphere classes. Thus a derived class pointer could stand in for a base class pointer (demonstrating polymorphism) and with no switch statements just implement the needed functionality.
    So in my case here, the whole concept of polymorphism here seems to have been violated. If a derived class performs an operation, then the base class should not be calling it. In fact, the base class should always be completely ignorant of the implementation details of a derived class. In this case, the operations being performed didn’t involve any virtual methods…. Thus without a virtual method, polymorphism cannot be implemented. Hint: using C++ it is possible to replace switch statements and if statements with simple virtual method calls. Thus this code was pure C code masquerading as C++ code.
  2. An incomplete understanding of the class hierarchy that derives from this base class. While the assumptions may have once held, once other derived classes were added later on, the assumptions would fail. Thus casting everything to one particular derived class to perform operations was very fragile, and asking for trouble.
  3. Use of C style casts here is plain wrong in C++. A cast like this should fail, but in C, no cast ever fails. A C style casts may be fast, but it offers no protection. Thankfully the Microsoft Visual C compiler offers a measure of protection however for C++. By turning on Run Time Type Identification (RTTI) in C++ and using any of the standard casts (static_cast<>, dynamic_cast<>, reinterpret_cast<>) the operation will fail with invalid data and return NULL (or nullptr in VC 10.0) . Thus B::foo could have been written like this:
void B::foo()
{
	A* temp = dynamic_cast(this);
	if (temp)
	{
		// do something with temp
	}
}

Which in this case as shown, the code inside the if statement would never get executed.

Finally in fixing the bug, I did a search through the entire source file looking for these C style casts to the wrong data type. And for almost every function that used this bad cast, we had lots of crashes in that function from customers over the years.