Beware of ‘Magic Machine Learning Algorithms’ and Embrace Your Data


4 min read

Machine learning (ML) is a trending topic in business applications of data science. There is no question that the application of ML can be a powerful tool to generate business value from data. However, the buzz around ML, combined with a broad shortage of advanced skills in this space, fuels a lot of misconceptions around what is required to generate real value from ML algorithms.

Terms like “proprietary ML algorithm” often peak my interest at the same time as they raise some healthy skepticism, because experience tells me that it’s the data—rather than the underlying algorithms—that make or break applications of ML.

There is evidence to support the fact that data, and not a special algorithm, is the key source of value from business applications of ML:

Given the same dataset, leading teams tend to get broadly the same results with ML.
This is clearly demonstrated by the leaderboard of most Kaggle competitions. If you have ever participated in such a competition, you know that the top of the leaderboard quickly becomes crowded with competitors whose submissions have virtually indistinguishable differences.

In such competitions, there is an answer within the data and once you’ve reached peak performance, you’ve reached it. Beyond that point, statistical noise overtakes any additional predictive value current ML algorithms can provide. Paradigm shifts in performance are rare. Different teams may take different approaches towards feature selection and ensemble different algorithms with distinctive weights, but the result of those algorithms is broadly similar. If the secret algorithms developed by each team were truly differentiating, then this would not happen.

Those producing the biggest breakthroughs in ML technology tend to release those algorithms to everyone for free.
There are a lot of people using ML, but there are relatively few companies and open source efforts generating genuinely new approaches to ML. Why would the likes of Facebook, Google, Bloomberg and others take their most fundamental advances in ML technology (e.g., Tensor Flow for neural networks) and release them to the masses for free? Isn’t this letting their secret sauce out of the bag?

If you ask these firms they’ll tell you no, because their secret sauce is the data they hold which you can’t easily access. They protect, and charge for, their data and the output of their ML—but they largely give their ML algorithm advances away for free.

There are some exceptions to this, such as proprietary voice transcription algorithms. However, even in this space while there is unquestionably much advanced ML that goes into developing these algorithms, the best results are produced by those with the best proprietary datasets—think AWS Transcribe and the massive amount of data collected every day by people speaking to Amazon’s Alexa. You could poach the engineers creating those algorithms, but without the underlying data that the algorithms are built on it would be very hard to replicate their performance.

The increasing tendency to make such private algorithms available at commodity-level prices further reinforces the idea that what you do with the algorithm is ultimately far more valuable than the algorithm itself. Given a voice recording, anyone can now obtain a high-quality transcript of that recording for just a few cents—but not everyone has access to the original recording.

In other business sectors where ML is often applied, like in developing trading algorithms for hedge funds, the above points apply as well. There is increasingly more talk about the proprietary datasets for sale in this space than there is of proprietary algorithms.

This is all broadly good news for most businesses looking to deploy ML because it means you probably already have the key differentiating factor—data—that your competition either doesn’t have or isn’t (yet) using.

In order to get the most value from ML using your data, you need:
1. A clear business problem to solve: There are no shortage of business problems, but make sure the problem is truly a ML problem.
2. Underlying ML technologies: These are mostly free, and everyone generally has access to the same tools.
3. Advanced skillsets to apply these technologies to business problems: Skills and experience matter. Demand for the best skills still significantly outpaces supply, although even moderately skilled teams can have a big impact, given high-quality datasets and standard approaches.
4. Data to generate valuable insights: This is mostly still the realm of private datasets, although they are often supplemented by publicly available or purchased data.
5. An understanding of what parts of this data to use and how: More data doesn’t necessarily result in better models. Selecting what bits of data to use, or not use, is often more important than the algorithm itself.
6. An effective data-driven change management process: Data and models alone will not automatically make everything better. Even with advanced ML, the company still needs effective and adaptable business process management practices.

The ML algorithm is important in that list, but it is only one step of many. For the reasons discussed earlier, it is likely not the main differentiating factor in successfully deriving business value from data. The secret sauce thus becomes the ability to use your data to deploy integrated end-to-end solutions, by:

• Collecting useful data from many source systems
• Integrating that data into a consistent digital footprint of business activities
• Deriving useful insights from that data (through a range of data science techniques including, but not limited to, ML)
• Communicating such findings to stakeholders
• Acting on those findings either via automation, change management or manual activities (or a combination thereof)
• Measuring the impact of those actions against defined business problems

When an organization asks, “how can we use ML here?”, what they really need to be thinking about is developing a solution that checks all of the boxes above—that is, thinking more about end-to-end operational data integration than simply an algorithm. It is the data you have and what you do with it in your business that is ultimately differentiating in the long run.