Friday, 2 December 2016

Agile In Data Science and Analytics

Is Agile an effective way to herd the data scientists into the production pen or just an excuse to avoid documentation and planning? What components in Agile do we recommend for Analytics PoCs and full-fledged projects?  

So let's discuss about it.

Every organization starts with the ambitions of business and further creates roadmap of technology, people and investment needed to unlock that business potential. To unlock the objective, we go through the phase of initial discussions, understand the requirements, technical workloads like – “I need a Linux server, database, recommendation engine, tools to handle the big data...”
Technical requirements are quite straightforward most of the times, but analytical activity is quite vague and there is uncertainty as we don’t know what can be the best approach to solve the problem, the amount of time to get the best solution.
If we develop it in traditional waterfall model approach, how it will go:

Developing a traditional analytics project:
Let’s say we need to build a recommendation engine for users. Use case seems pretty easy. A traditional analytics team would go endlessly building an engine by which will use the entire user data, run CBR(content based recommendation) or CF(Collaborative Filtering), and after a long effort possibly providing a powerful recommendation engine which can provide near real time recommendation to the users.  In the entire hassle free development, there was no interaction with business people.

Challenges in Traditional Approach
We developed the entire engine but are not sure about the correctness of the model. What if, we used wrong data, or wrong variables? We don’t even know if our data exploration and insights were correct? Oops, assume stakeholders reject it and give the feedback for existing model, as it didn’t meet the expectations. Let’s rework now. Wouldn’t it be awesome if we could have used Agile before?
Agile approach would have played a great role here, rapid and iterative product development and getting rapid customer feedback cycles.
Now our problem and opportunity come at the interaction of two trends: how we can incorporate data science and analytics, which is applied research and needs exhaustive effort on an unpredictable timeline, into the agile application? How can analytics applications do better than traditional waterfall approach model? How can we craft application for unknown, evolving data models? 

What is Agile

Agile Software development focuses on the four values(from Agile Manifesto):
  • Individuals and Interactions over process and tools
  • Working software over comprehensive documentation
  • Customer collaboration over contract negotiation
  • Responding to change over following a plan
Engineering Products and Engineering data science, both are different as
data science is less deterministic. It needs lots of creativity and though
process to derive the best approach.  Agile helps to manage those in the
cycles, where team explore, learn something about the data, share the
insights with the business team/stakeholders, align the needs and approach, take the feedback and start in the same direction.

How Agile Analytics approach unfolds

The main difference from traditional to Agile analytics approach is using iterative process, sharing the learnings with stakeholders, getting rapid feedbacks and learn with new business questions and describing datasets.
A team of Data scientists, Business analysts and other SMEs work with the stakeholders to discuss each question until they have:
  • The clear and as narrow as possible scope
  • Potential datasets and variables to be used for analysis
  • Questions to be answered
Data scientists provide the insights on the nature and quality of the dataset, hone the questions, hypothesis, and provide a concrete list of algorithms that can be viable to answer those questions. These outputs turn into Proof of concepts or prototypes of an analytics solution.
It is a voyoge of discovery. The below structure known as data-value pyramid explains that.

Every project needs an investment. And building Analytics solution is generally costlier than developing application software. As each business silo can point to a different domain or different data source. There is high risk in the investment.
Agile Analytics helps to minimize the risk of pursuing the blind alleys. With the iterative approach, cyclic interaction with business team, it mitigates the risk of implementing models which turns out to be garbage.