“Building a Culture around Evidence-based Decision Making and Quantitative Machine Learning is the Hallmark of Future-Proof Organisations” says Denis Bauer, Keynote Speaker for Digital Transformation Day
March 8th, 2018, the Digital Transformation Day, at Agile India 2018 Conference is as much about learning how the field of research is finding innovative solutions to problems by applying cloud technology and machine learning, as it is about learning how enterprises and organizations are benefiting from the era of digitization. Denis Bauer from Commonwealth Scientific and Industrial Research Organisation (CSIRO) will be presenting a keynote talk, “How Novel Compute Technology Transforms Life Science Research“. Denis will be talking about her team’s achievements and advances in building solutions which use modern architectures, such as serverless (AWS Lambda) and also customized machine learning on Apache Spark. The #agileindia2018, through a Q&A session with Denis, learned how building genome-scale data pipelines are similar to addressing computational challenges in business workflows and how cancer genomic and bioinformatics mind-sets are useful for enterprises and organizations in the era of digitalization and data-driven decisions.
Organizations are heavily investing in Big Data adoption, how do you think they can be selective and strategic in terms of choosing the right problems to solve with the right technology?
In my opinion, adopting Big Data and ML has no influence on which problems need solving, as pain points are dictated by the market (customer service, filling the gap in the market, being innovative etc.). However, they can and should have an influence on the choice of solutions. For example, using historic Big Data (e.g. customer behavior over the last 3 years) enables evidence-based decisions rather than relying on hunches or expensive trial-and-error tests (prospective A/B testing). Similarly, quantifying trends using machine learning enables more robust and generalizing approaches to be implemented rather than using ad-hoc rule-based decisions. I argue that building a culture around evidence-based decision-making using historic data and unbiased quantitative machine learning is the hallmark of a future-proof company/enterprise strategy.
Do you see a difference in approaching digital transformation for research versus business side on things?
Digital transformation in research is likely much swifter than in business. Research is inherently about communicating knowledge, sharing advancements and building on each other’s body of work. That being said, research can be less outcome focused than business and is more tolerant of unstable solutions. Therefore, to achieve a wide-reaching adoption of a technology, the innovation ecosystem needs three elements:
- the trailblazing research community for generating new ideas
- the innovative startups to identify commercially viable ideas
- the trusted companies to solidify adoption with robust stable solutions.
What are the key challenges to scale up machine learning algorithms and architecting solutions for genome-size data?
The key issue is lack of data. While there is a lot of activity in the genomics space resulting in thousands of genomes being generated the complexity of the problem researchers are trying to solve is even bigger, meaning that, to truly understand the genetic drivers of common diseases like diabetes, we really have to have a whole population scale approach. This is difficult from a technical perspective (data volumes, scaling computational approaches) but even more so from a political and social perspective. We therefore first need to address the questions of how individuals can control access to their genomic and medical data, how to keep analyses of such data volumes sustainable and how to quantify the validity of the resulting research.
One of the key issues is a lack of data, which is a bit counter-intuitive as genomics is producing more data than other Big Data disciplines. While there is a lot of activity in the genomics space resulting in thousands of genomes being generated, the complexity of the problem researchers are trying to solve is even bigger. To truly understand the genetic drivers of common diseases like diabetes, we really have to have a whole population-scale approach. This is difficult from a technical perspective (data volumes, scaling computational approaches) but even more so from a political and social perspective. We therefore first need to address the questions of how individuals can gain and control access to their genomic and medical data, how to keep analyses of such data volumes sustainable and how to quantify the validity of the resulting research. Current approaches involving blockchain and cryptocurrency go some way on ownership and provenance, however, to truly leapfrog to the position we need to be in for innovation, the general public needs to form an opinion and arrive at a consensus. That is why starting this conversation is important for building the innovative health care system of the future.
Can you provide high-level insights on the customizations done on ML and the use of AWS Lambda?
To the best of my knowledge, VariantSpark is the first ML technology able to deal with extremely ‘wide’ data, that is millions of features, where traditional ML algorithm cater for only 1000-10K features. This is because traditional BigData does not have as much information for each data point (e.g. customer data, there is only so much information to be collected on customer type, location, interactions etc.).
So genomics, which describes the whole individual through its 3 billion letters long genome, is the first discipline that requires the analysis of such wide data sets. We use VariantSpark for classification problems, e.g. predicting disease risk for a new patient, as well as to gain biological insights, e.g. which mutations can cause disease. The latter can also be described as a feature selection task and as such may be interesting to a wide range of disciplines facing ever wider datasets as part of the ongoing datafication. So VariantSpark could be used for finding out which sensor reading is predictive of impending failure or what customer actions are predictive for a churn.
Similarly, GT-Scan was one of the first complex serverless frameworks that demonstrated the power of Lambda-based workflows far beyond the scope it was originally designed for. Overcoming the space and compute limitations of Lambda, we developed a new breed of distributed cloud-based architectures to handle large complex research workflows. We treat Lambda function as “free-floating instantaneously recruitable CPUs” to build ad-hoc clusters or cheap persistent web-services. So we think, GT-Scan’s serverless pattern also has a wider application scope than just the genomic research where it originated from.
More about Dr. Denis Bauer
Dr. Denis Bauer leads the Transformational Bioinformatics team at Australia’s national science agency, CSIRO—the research institution behind fast WiFi, the Hendra virus vaccine, and polymer banknotes. She is also involved in initiatives to bring genomics into medical practice. Denis holds a Ph.D. in bioinformatics with expertise in machine learning and genomics. Her work was featured in prestigious blogs and highlighted as Computer Weekly’s Top 10 Australian IT stories of 2017.