Creating a Safe place to Fish

Bojan Zubovic
Sep 24, 2018

The basics of Data Security and Governance 

We have been in the era of data-lakes for a few years now, the term has been thrown around by numerous industries as a solution to all of our data problems - a place to store all of our data. But how do we ensure that we prevent failures and make it work, what is the way forward?

Everyone usually has great intentions, the notion of having all your data sitting in one central place is a noble goal, though are you at risk, or have you already built a data-swamp?

We are passionate about data governance and feel it is a key component in the data space, but it can be difficult to understand at times and has been described by many as downright boring. We’re a fun bunch at Vibrato and feel that a good analogy can help understand complex concepts. When talking about data-lakes, it was hard not to think of a lake as a habitat for fish (data) to live in, and we linked several concepts that relate with that, so if you're ready for a fishing trip, bait your lines, cause here we go...

Vibrato datascience - shutterstock

What species of fish do we have?

The difference between a data lake and a data swamp is an understanding of the fish species in our habitat. Think semi-structured (json, xml), structured (csv, xls, avro, parquet) and unstructured (images, videos, documents).

Knowing the species is great, but we want to drill down further - how do we know what family a fish comes from, or more so, what information the data contains?

Its a classification problem, and this really comes down to data being tagged or to be more specific having meta-data attached to it so that we are able to understand what information it contains.

We still have a problem, we have thousands of fish and want to know if we are at the right spot in the lake to catch the right one - we need a data catalog, which will allow us to search and index our meta-data.

Where are the fish coming from?

When we talk of data lineage, we can envision a multitude of streams converging into a lake, and while streams, or to be precise, data pipelines, can be permanent or temporary, we want to be able to understand or track when and what fish are coming from which stream, so that we can better understand or perhaps audit the habitat.

Now while we can allow many different species of fish to live in our lake, we want to be able to understand the health of our fish, or more so the quality of our data. A data lake does not only store raw data, it should have the capability to store temporary, structured and refined data. Perhaps a stream enables such data to move to another data lake. These are all day-to-day complexities that need to be solved, and we want to ensure we can monitor the quality of our data, and have the controls and the right people - the data stewards to ensure the data is fit for purpose.

As the fish populations grow in the lake, there are some fish - some data that is required only for a certain time. Having a data retention capability and strategy, is another important step in preventing a lake becoming a swamp.

Vibrato datascience - shuttersotc

Who do we allow to fish?

Self-service analytics is critical to a successful business and a key aspect in becoming a data-driven organisation, though it does not come without concerns. Do we let everyone fish or trawl in the lake - do we carelessly let everyone do whatever they want with the data?

Its about giving the right people the right access to the right data for the right amount of time. Cybercrime is on the rise and with the increasing sophistication of attacks through the use of social networks and Machine Learning algorithms increasing the capability of hackers, it has become critical to ensure your data is safe. Understanding the 5Ws the who, what, when, where, why is critical in creating a safe place to fish.

Providing fishing licenses from a fishing ranger, while annoying to the honest people looking to fish, protects the habitat from degradation - your business from being exposed to data leaks and numerous other risks, potentially causing business disruption, fines and the possibility of your brand being tarnished.

Why do we need to protect our fish?

Thinking of the most critical data security concerns, data breaches or more specifically customer data leaks, are considered to be amongst the worst possible scenarios for any business. No one wants their personal details exposed to the public. It gets worse though - personal details can get in the hands of criminals, with the increased risk of identity theft occurring. With this in mind, we have a moral duty to keep our customer data protected and away from illegal trawlers.

Historically, there have been cases where both public and commercialised data sets were not properly anonymised, causing leakage of personal and sensitive information. The tradition of masking data with a hashing algorithms such a md5, sha256 is not enough considering GPU farms can be used to solve pre-generated dictionaries, leaving an opportunity for the anonymisation to be reversed.

Today, following best practice and compliance regulations is essential. GDPR is effective in the EU, and we will potentially see similar laws implemented throughout the world. PII (Personally Identifiable Information) must be encrypted at rest, in transit and anonymised before being utilised for analytical purposes, and the approach must to be thoroughly considered, so that the data anonymisation process is irreversible.


No matter where you are in your data journey; whether you're focused on protecting your data assets and minimising risk exposure, or you desire to increase efforts in strategic offensive activities, the pre-requisites still apply. If you have aspirations for data monetisation, creating data products or applying advanced analytical methods such as AI and Machine Learning to gain competitive advantage; data governance must be considered and is critical in future proofing your business, and providing a safe place to fish.


If you want us to work with you on this kind of cool stuff, click here


Want to know more?