Implementing GDPR Compliance in Data Lakes: A Case Study from Adevinta Spain

Abstract

Compliance with the General Data Protection Regulation (GDPR) is not only a technical challenge but also carries significant financial implications if breached. At Adevinta Spain, we have structured our data lake architecture in a way that not only meets GDPR requirements but also reduces costs associated with data protection and management.

In this blog, we will delve into Adevinta Spain’s innovative approach to data lake architecture, focusing on how it effectively meets GDPR compliance while ensuring cost-efficiency. We’ll explore the critical elements of this architecture, including the history layer’s role in data defence, the market layer’s function in data refinement and organisation and the unique strategies employed for security and GDPR compliance. Additionally, we will discuss the challenges of GDPR adherence and how our architecture addresses these through streamlined user rights management, efficient infrastructure maintenance and optimised data storage and transfer. Join us as we uncover the nuances of balancing data privacy regulations with economic practicality in the realm of large-scale data management.

History Layer: The First Line in Data Defence

The history layer acts as the gateway for data in our data lake, typically receiving events through Apache Kafka. Here, a critical operation takes place: as data flows from Kafka, it is stored in Delta Lake format, but with a clear distinction — personally identifiable information (PII) fields are identified and separated as defined in the data contracts for each object.

These fields are channelled into a dedicated master table for analytically valuable PII, while non-PII data is stored in parallel tables.

This modular design facilitates centralised access and management of sensitive data, greatly simplifying the application of GDPR’s “right to be forgotten” rule.

This layer is where “Source Align Data Products” are placed after undergoing the data contract validation process:

Market Layer: Refinement and Organisation (big entities)

Moving to the market layer, events are grouped and refined by domains (big entities) rather than individual entities.

This layer provides the source of truth for Adevinta’s raw data in a domain semantic model format, ensuring integrated and standardised services for security, support, compliance, quality and observability. Essentially, it ensures that data is provided as data products.

The platform itself will provide these services in an integrated and maintained manner, allowing a smooth transition from dataset publication to data products without additional effort. It will also reduce redundancy and promote organisational and data modelling standardisation.

This step will be done without the need for code or data pipelines, relying on mappings and configurations. The codeless design allows a wider range of data owners or domain experts to include their data in the platform, meeting the required quality and service conditions for a data product.

With the design of these big entities, we seek to achieve a balance where we capture the domain details required by producers and consumers, while avoiding an excessively large number of entities.

This domain approach results in a significant reduction in the number of tables and files to handle, optimising storage and simplifying data management. Additionally, incorporating PII fields into a unified column reinforces efficiency and security when handling sensitive information.

Benefits of Fewer Tables and Files

The reduction in the number of tables and files is not just a matter of order; it’s an economic strategy. Fewer tables imply lower storage costs and a notable simplification in data maintenance and optimisation. Query performance optimisation is also simplified. By applying techniques like z-ordering in a concentrated set of tables, the optimisation process becomes more effective. This not only improves query speed but also reduces computational load, resulting in direct savings in processing resources and, consequently, costs.

Security and GDPR Compliance

Security and regulatory compliance are two cornerstones of our architecture. Access controls and continuous monitoring processes ensure privacy protection and GDPR compliance at every stage of data management.

Cost Efficiency and GDPR Challenges

The challenge of adhering to GDPR lies not only in establishing compliance procedures but also doing so in an economically viable way. Our data lake architecture is designed with resource economy in mind, offering solutions to common challenges such as:

User Rights Requests Management: The right to be forgotten and data portability processes can be costly if not properly planned. By centralising PII fields in a master table, we streamline the search and deletion process, resulting in a considerable decrease in processing times and, therefore, a reduction in operational costs.
Infrastructure Maintenance: With the creation of “big entities,” we reduce data fragmentation. Fewer tables mean less maintenance, less complexity in metadata handling, and a simpler, more economical optimisation process. Z-ordering practices, for example, can be more effectively focused on a reduced set of dense tables, instead of being dispersed among thousands of them.
Data Storage and Transfer: In a data lake architecture with multiple entities, the number of files to manage can be overwhelming, increasing storage costs and latency in I/O operations. By grouping related events into single tables and maintaining a low volume of large tables, we optimise storage use and improve performance in read and write operations. This also reduces costs associated with data access and transfer between different services and platforms.

Conclusion

Our data lake architecture reflects a balance between technological innovation and economic efficiency. The separation and centralisation of PII data in the history layer and the consolidation of events in the market layer provide a robust solution for large-scale data management. By implementing this architecture, Adevinta Spain not only aligns with data privacy regulations but also optimises its operations and maximises the value of its data assets. This architecture allows us to move forward confidently in a world where data is a critical asset and its management a pillar of corporate reputation and reliability.

Data Platform Team

Director: Marc Planaguma
Product Owner: Marta Díaz
Data Engineers: Gustavo Martín, Sergio Couto, Javier Carravilla, Enric Martínez, Christian Herrera
SREs: Joel LLacer, Ismael Arab, Jaime González, Roger Escuder

Architecting Compliance: Cost-Effective Data Strategies for GDPR