In the final paragraph of the post I wrote on revisiting the Linear Law of Patent Analysis, I mentioned that one of my primary motivations for coming back to this topic was that I was surprised to hear a “big data” consultant talking about recognizing how critical the business needs of the client was to a successful outcome. There is a tendency at times, with data scientists, to “let the data tell the story”. This occurs when the analyst begins to scrutinize a data collection, without any pre-conceived notions, and allows the insights to flow from the most prevalent trends or correlations. Occasionally, this method, where there is correlation without causation, provides a valuable insight that provides assistance with decision making. Often though, the correlation can be a spurious one due to misunderstanding the data or the context in which it is used. To alleviate the uncertainty, as provided for in the Linear Law of Patent Analysis, I prefer to understand the business needs and associated questions, or hypothesis, before I start thinking about the data and the analysis. In a recent post on Data Science Central, entitled To Hypothesize or Not to Hypothesize, analyst Michael Walker argues that, in some circumstances, specifically when examining marketing and sales situations that involve complex human behavior it is not necessary to develop a hypothesis before analyzing the data. He provides the following example as justification:
For example, we recently were engaged by a large financial firm to find meaning in data to help market and sell certain financial products. In one case, we found a strong correlation between two (2) variables suggesting the purchase of one product increased the purchase of another product. There was no rational explanation for this correlation and no way to prove causation. We suggested a number of controlled experiments to test different strategies. Human purchasing behavior is tricky yet by running a number of experiments we found the optimal marketing and selling process that significantly increased sales. This process would not meet traditional scientific standards – yet it worked for this particular purpose…No hypothesis, no expectations – just pure trial and error to see what worked and did not work, and attempt to explain why.
Mr. Walker goes on to generalize this by saying:
The dirty secret in business and public policy (but not hard scientific disciplines) – when dealing with unpredictable human behavior – is that running many experiments is often (but not always) superior to creating a model to test a hypothesis… Yet you do not need to build a general model to understand human behavior and purchase patterns. Finding a strong correlation between A and B and increased sales, you can run an experiment or better yet a series of controlled experiments to see what works. You don’t even need to know why it works or does not work – although that would be nice.
I generally don’t like the idea of using trial and error methods for determining the best course of action and while I don’t mind correlation without causation in methods, such as the use of support vector machines for classifying documents (in that case it either works well or it doesn’t), I think it is important to understand more about why a course of action produces the desired result.
Interestingly, while Mr. Walker suggests that hypothesis are not always necessary to provide a successful outcome he also states that the rigor that goes into the selection of the controlled experiments used to provide the desired results is also a form of generating a hypothesis:
To be fair, it may be argued that by selecting and designing the experiments in a certain manner we were in fact formulating and testing hypothesis.
So, even in a post suggesting that generating a hypothesis is not always necessary to conduct rigorous analysis, we eventually get back to a place where understanding the business needs and context associated with the data to be analyzed is critical to ensuring that the results obtained are actionable to the client or decision maker. Keeping this in mind, an analyst would be hard pressed to come up with a situation where allowing the “data to tell the story” independently of context would produce a productive result except by chance.