The Data lake: tranquility and value

The promise of advantage from industrial scale data collection is compelling. Greg Hanson, SVP EMEA cloud at Informatica explains how to avoid an unruly swamp and deliver crystal clear insight

Much has been said in recent times of the concept of the data lake. Some are claiming that it's the new data warehouse and a more efficient and all-encompassing way for us to store and access our data. In fact, it is much more than that. With the rise of the cloud and data-gathering Internet of Things (IoT) devices, many companies are facing an exponential increase in business-relevant data. The average mid-sized company can expect to draw in several petabytes (1015) of data every month from website traffic, customer engagements, sales leads, employee communications and much more. This clearly needs organising.

As a result, the data warehousing approach of the previous decade is beginning to reach its limits. This carefully ordered, file-based warehousing system requires an advanced level of categorisation at the input stage. For many companies this means time-consuming manual involvement and hand coding. Given that the point of large-scale data gathering is to maximise the resource and then to mine it, this need to define data before it has even been requested is counterintuitive and unpractical when dealing with such extremely large volumes of information.

The first step in achieving a high-capacity approach is to ensure data from across the full breadth and depth of the company's in-cloud and on premise architecture is collated and unified to ensure that users can rapidly access all relevant data. Once they have access, users must be able to search, discover and understand all of the data in the resulting data lake at speed, and without the need for manual intervention.

The key to this is metadata cataloguing. Automatic metadata analysis will allow users to provide business context to raw data. Once context has been assigned, the next stage is to determine what relationships exist between different datasets in order to further determine their provenance and relevance to a specific task. It's also advisable to implement systems which can record the steps taken in these data preparation processes to ensure that they do not have to be re-entered or rearchitected for each new set of data.

Finally, system security must be architected with data security at its heart, so that data can be tracked across the full architecture, wherever it originates and wherever it resides. When storing petabytes of raw data, it's essential to be able to identify and track any proliferation in order to detect any unusual access or movement.

The data lake is quickly becoming an essential component of a successful data strategy and data volumes continue to rise rapidly. IT teams must ensure that they have a comprehensive data management platform in place, firstly in order to be able to get data into a data lake, but secondly to be able to catalogue all the data assets within not only the lake, but the organisation as a whole. Quality of data must never be forgotten and it must be built into the process before it reaches end users. Also, the platform needs to machine learn, to offer productivity gains and the ability to manage huge volumes of data without the need for slow human intervention.

Lastly, the value of a data lake and indeed the data itself is found when it gets used to make decisions, create innovation and avoid failures. A great way of achieving this is if access to data resources is provided in an easy to consume, business friendly user interface in which users can self-serve their data needs. With such tools at their disposal this can offer unparalleled insight and create real competitive advantages. With the market burgeoning now, companies need to act quickly or risk becoming lost in the approaching flood. NC