Mastering Secure, Cost-Effective Cloud Data Lakes

Editor’s note: Ori Nakar and Johnathan Azaria are speakers for ODSC East this April 23-25. Be sure to check out their talk, “Unlock Safety & Savings: Mastering a Secure, Cost-Effective Cloud Data Lake,” there!

Have you ever experienced a surge in your cloud data lake expenses? Is this surge indicating a malicious activity or a legitimate operation? Data lakes have become a cornerstone of the digital age, prized for their flexibility and cost-effectiveness. Yet, as they expand, they bring forth challenges in security, access control, cost management, and monitoring. The stakes are high: unauthorized access can lead to data breaches, while even legitimate users can inadvertently drive up costs.

In-person conference | May 13th-15th, 2025 | Boston, MA

Join us on May 13th-15th, 2025, for 3 days of immersive learning and networking with AI experts.

🔹 World-class AI experts

🔹 Cutting-edge workshops

🔹 Hands-on Training

🔹 Strategic Insights

🔹 Thought Leadership

🔹 And much more!

With the growth in usage comes far more complexity. The size of data, together with the number of objects, are growing rapidly. A growing number of users, both human and application, are performing constant operations on the data lake. The large number of operations makes access and cost control a hard and ongoing task. Monitoring is also a complex task, since there are many access options, and all should be monitored.

Attackers can also take advantage of the many access options to the data lake. They can use object store and query engines advanced functionally for reconnaissance and to effectively traverse, locate, and track sensitive data.

Figure: Data lake access

Traditional monitoring methods often fall short. Tracking object store access can be overwhelming, with a single query generating thousands of log records. Monitoring at the query engine level demands a unique solution for each engine, adding complexity.

We suggest a two-tiered approach to deal with these issues.

Level Up Your AI Expertise! Subscribe Now: File:Spotify icon.svg - Wikipedia Soundcloud - Free social media icons File:Podcasts (iOS).svg - Wikipedia

The first tier is to adopt best practices, such as:

  • Using roles instead of keys
  • Using unique credentials and not sharing them between users and services
  • Using tailored, instead of wide access permissions
  • Applying lifecycle management, query size limitations, alerts and other general rules

The second tier is monitoring your data for anomalies. By logging the queries performed on your data lake you can detect and stop numerous cases of abuse and misuse. Let us explain how.

The data lake is often accessed via query by two major user types:

  • Humans – employees will often query the data to get information or during the process of development.
  • Applications – a deployed application will access the data as part of its normal function.

The major difference between the two is the usage pattern. While human queries are sporadic in nature, they are also normally limited to their working hours and their areas of work. Humans who work in marketing don’t normally wake up at 3AM to start a new project on production tables.

Applications either work in a periodic schedule, such as ETLs, or work on demand per user request, but they are normally limited to a predefined number of tables and often have a clear usage baseline. We don’t expect applications to change their queries, access new tables, or suddenly switch from a periodic schedule to an irregular one.

In-person conference | May 13th-15th, 2025 | Boston, MA

Join us on May 13th-15th, 2025, for 3 days of immersive learning and networking with AI experts.

🔹 World-class AI experts

🔹 Cutting-edge workshops

🔹 Hands-on Training

🔹 Strategic Insights

🔹 Thought Leadership

🔹 And much more!

You should manage and protect your data lake carefully:

  • Adopt best practices for permissions management
  • Monitor access and use the monitoring data to create actionable insights

It will help you prevent data leakage and manage your costs by detecting data abuse and misuse.

About the Authors:

Ori Nakar is a principal cyber-security researcher, a data engineer, and a data scientist at Imperva Threat Research group. Ori has many years of experience as a software engineer and engineering manager, focused on cloud technologies and big data infrastructure. In the Threat Research group, Ori is responsible for the data infrastructure and is involved in analytics projects, machine learning, and innovation projects.

Johnathan Azaria is a tech lead in data science @ Imperva, specializing in AI-driven security algorithms and digital protection.



Source link

Share this post on

Facebook
Twitter
LinkedIn

Leave a Reply

Your email address will not be published. Required fields are marked *