What’s polluting your data lake?


A data lake is a large system of files and unstructured data collected from many, untrusted sources, stored and dispensed for business services, and is susceptible to malware pollution. As enterprises continue to produce, collect, and store more data, there is greater potential for costly cyber risks.

data lake pollution

Every time you send an email or text you are producing data. Every business service your organization has deployed is generating and exchanging data from third-party partners and supply chain providers. Every new merger and acquisition (M&A) results in large volume of data being transferred across two companies. Every IoT device or subscription is generating data that’s collected and stored in data lakes. You get the point: Mass data production and collection are unavoidable. And, as a result, our data lakes are becoming an overwhelmingly large and a ripe target for cybercriminals.

With digital transformations—a.k.a cloud adoptions and data migrations—having occurred over the past couple of years, cloud data storage has significantly increased. As enterprise data lakes and cloud storage environments expand, cybersecurity will become a greater challenge.

The impacts of malware pollution

Understanding the impact of malware pollution on a data lake can best be understood by looking at how real-life pollution affects our on-land lakes.

Water is fed into lakes from groundwater, streams and various types of precipitation run-off. Similarly, a data lake collects data from a multitude of sources such as internal applications, third party/supply chain partners, IoT devices, etc. All this data constantly flows in and out of the data lake. It can move into a data warehouse or other cloud storage environments or be extracted for further business insights or reference. The same process can be witnessed with freshwater lakes, extracting water for irrigation and churning water into other streams.

External “pollution” that feeds into a lake (both physical and digital) can harm the existing ecosystem. When unknown malware enters a data lake, bad actors can gain access to the data stored in the lake, manipulate it or mine it to sell on the dark web. This data can include…

Source…