Machine Learning’s Complexity Problem

I’ve been experimenting with machine learning lately.   For someone who started writing code in the early 90’s and witnessed firsthand the explosion of the web and all the software engineering practices that evolved from it, I find amazing how machine learning  flips traditional software engineering on its head.

Traditional software engineering taught us to divide and conquer, minimize coupling, maximize cohesion while artfully abstracting concepts in the problem domain to produce functional and maintainable code in the solution domain.  Our favorite static code analysis tools helped keep our code (and its complexity) in check.

Similarly, traditional software architecture taught us to worry less about code complexity and more about architectural complexity for it had farther reaching consequences. Architectural complexity had the potential to negatively impact teams, businesses and customers alike, not to mention all phases of the software development lifecycle.

Yes, this was the good ol’ world of traditional software engineering.

And machine learning flips this world on its head.  Instead of writing code,  the engineering team collects tons of input and output data that characterize the problem at hand.  Instead of carving component boundaries on concepts artfully abstracted from the problem domain,  engineers experiment with mathematics to unearth boundaries from the data directly.

And this is where machine learning’s complexity problem begins.  Training data sets rarely derive from a single cohesive set.  They instead depend on a number of  other data sets and algorithms.     Although the final training data set may be neatly organized as a large table of features and targets, the number of underlying data dependencies required to support this can be quite dramatic.

Traditional software engineering became really good at refactoring away dependencies in static code and system architectures in order to tame the complexity beast, the challenge now is to do the same for data dependencies in machine learning systems.

In conclusion, the paper “Machine Learning: The High Interest Credit Card of Technical Debt” summarized this and a number of other ML complexity challenges nicely:

No inputs are ever really independent. We refer to this here as the CACE principle: Changing Anything Changes Everything.”

Tagged ,

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: