Mastering Secure, Cost-Effective Cloud Data Lakes

Category: Data Engineering

Editor’s note: Ori Nakar and Johnathan Azaria are speakers for ODSC East this April 23-25. Be sure to check out their talk, “Unlock Safety & Savings: Mastering a Secure, Cost-Effective Cloud Data Lake,” there!

Have you ever experienced a surge in your cloud data lake expenses? Is this surge indicating a malicious activity or a legitimate operation? Data lakes have become a cornerstone of the digital age, prized for their flexibility and cost-effectiveness. Yet, as they expand, they bring forth challenges in security, access control, cost management, and monitoring. The stakes are high: unauthorized access can lead to data breaches, while even legitimate users can inadvertently drive up costs.

Table of Contents

In-person conference | May 13th-15th, 2025 | Boston, MA
In-person conference | May 13th-15th, 2025 | Boston, MA

In-person conference | May 13th-15th, 2025 | Boston, MA

Join us on May 13th-15th, 2025, for 3 days of immersive learning and networking with AI experts.

🔹 World-class AI experts

🔹 Cutting-edge workshops

🔹 Hands-on Training

🔹 Strategic Insights

🔹 Thought Leadership

🔹 And much more!

With the growth in usage comes far more complexity. The size of data, together with the number of objects, are growing rapidly. A growing number of users, both human and application, are performing constant operations on the data lake. The large number of operations makes access and cost control a hard and ongoing task. Monitoring is also a complex task, since there are many access options, and all should be monitored.

Attackers can also take advantage of the many access options to the data lake. They can use object store and query engines advanced functionally for reconnaissance and to effectively traverse, locate, and track sensitive data.

Figure: Data lake access

Traditional monitoring methods often fall short. Tracking object store access can be overwhelming, with a single query generating thousands of log records. Monitoring at the query engine level demands a unique solution for each engine, adding complexity.

We suggest a two-tiered approach to deal with these issues.

Level Up Your AI Expertise! Subscribe Now:

The first tier is to adopt best practices, such as:

Using roles instead of keys
Using unique credentials and not sharing them between users and services
Using tailored, instead of wide access permissions
Applying lifecycle management, query size limitations, alerts and other general rules

The second tier is monitoring your data for anomalies. By logging the queries performed on your data lake you can detect and stop numerous cases of abuse and misuse. Let us explain how.

The data lake is often accessed via query by two major user types:

Humans – employees will often query the data to get information or during the process of development.
Applications – a deployed application will access the data as part of its normal function.

The major difference between the two is the usage pattern. While human queries are sporadic in nature, they are also normally limited to their working hours and their areas of work. Humans who work in marketing don’t normally wake up at 3AM to start a new project on production tables.

Applications either work in a periodic schedule, such as ETLs, or work on demand per user request, but they are normally limited to a predefined number of tables and often have a clear usage baseline. We don’t expect applications to change their queries, access new tables, or suddenly switch from a periodic schedule to an irregular one.

In-person conference | May 13th-15th, 2025 | Boston, MA

Join us on May 13th-15th, 2025, for 3 days of immersive learning and networking with AI experts.

🔹 World-class AI experts

🔹 Cutting-edge workshops

🔹 Hands-on Training

🔹 Strategic Insights

🔹 Thought Leadership

🔹 And much more!

You should manage and protect your data lake carefully:

Adopt best practices for permissions management
Monitor access and use the monitoring data to create actionable insights

It will help you prevent data leakage and manage your costs by detecting data abuse and misuse.

About the Authors:

Ori Nakar is a principal cyber-security researcher, a data engineer, and a data scientist at Imperva Threat Research group. Ori has many years of experience as a software engineer and engineering manager, focused on cloud technologies and big data infrastructure. In the Threat Research group, Ori is responsible for the data infrastructure and is involved in analytics projects, machine learning, and innovation projects.

Johnathan Azaria is a tech lead in data science @ Imperva, specializing in AI-driven security algorithms and digital protection.

Source link

mohsin

I am an author and tech enthusiast deeply passionate about AI, Data Science, and cutting-edge technologies. With expertise in Python, machine learning, and automation, he is dedicated to simplifying complex concepts, helping readers navigate and excel in the dynamic world of artificial intelligence and data science.

See All Posts