Privacy Engineering

Learning from big tech: Why Meta invested heavily in privacy code scanning

privacymatters
PrivadoHQ
Why Meta invested in privacy code scanning
Ben Werner
October 15, 2024

In August of 2024, Meta published an extensive blog detailing why and how they’ve invested years of engineering effort to build an internal privacy code scanning solution. 

As the world’s largest social media platform and one of the largest ad tech platforms, Meta faces more privacy challenges than maybe any company as they try to preserve privacy for billions of users. 

Meta built a privacy code scanning solution to solve their biggest privacy challenge: purpose limitation. Limiting personal data processing to only explicitly approved purposes is central to all data privacy. Any company processing personal data that must comply with regulations like CPRA or GDPR faces this same challenge. 

Also similar to other companies, Meta’s privacy team struggled with manual processes and point solutions that both slowed teams down and limited privacy risk mitigation, especially as constant software updates changed data flows. 

By developing a privacy code scanning solution, Meta was able to move away from manual code audits and prevent privacy risks in real-time and at scale. Let’s take a closer look at how Meta achieved this and see what other companies can do with less resources. 

Meta’s purpose limitation challenges 

Evolving code changes data flows

To limit data processing by purpose, Meta has to monitor and control data as it flows across systems and services. The diagram below shows how a user request initiates data flows through Meta’s internal software systems. 

Sample Meta Data Flow

In the diagram, code is either processing personal data through some type of activity such running it through an ad personalization model or sending it to some type of database. 

To limit data processing by purpose, Meta previously relied on “point checking” controls made up of periodic code audits and access control mechanisms for datasets. Meta found that these controls did not scale and could not keep up with changing privacy and product requirements.  

Manual code audits don’t scale

To check whether code processed certain personal data for certain purposes, Meta ran periodic code audits using simple if statements.

These code audits could detect privacy risks at critical points in the codebase and help build data lineage graphs that map data flows, but they required extensive manual effort and still fell behind continuous code changes.

Meta needed a programmatic solution that could detect privacy risks and map data flows in real-time while also eliminating costly human audits.   

Separating data to control access is expensive

Meta’s other “point checking” control was limiting access to datasets based on purpose. This approach was effective when datasets could be physically separated by purpose but could not be scaled. 

Data separation required costly maintenance and created significant processing limitations, especially when data was processed for different purposes by shared code. Additionally, when Meta started to address more complex purpose limitation requirements that crossed dozens of systems, these data access controls did not scale.

Privacy and product requirements change continually

On top of the “point checking” control limitations, ever-evolving requirements added even more strain and risk to their system. As privacy regulation has increased and privacy risk mitigation has shifted further left in the software development lifecycle, adapting to changing privacy and product requirements has become increasingly complex for Meta. 

When one dataset could be subject to many privacy requirements and one processing activity could affect many data flows downstream, Meta needed to focus on the source of data processing and privacy risk: the code, not the data. 

Only by building a solution that monitors how all code processes personal data could Meta implement complete and real-time privacy governance.

Meta’s privacy code scanning solution

As part of Meta’s Privacy Aware Infrastructure initiative, they built a privacy code scanning solution they call Policy Zones. 

Rather than relying on point checking, Meta found that monitoring data flows at the code level offered a more durable and sustainable approach to control how data is accessed and processed in real-time.

Identify and classify personal data in the code 

By looking at data processing points in the code such as web requests, event logs, or database entries, Meta is able to build a complete and up-to-date inventory of personal data elements that need to be purpose limited. 

Using machine-learning models, data elements are classified by type and sensitivity to facilitate risk detection and policy enforcement.   

Map real-time data flows by monitoring entire codebase 

Because code controls how data is collected, used, shared, and stored, Meta is able to map the entire flow of each data element by looking at code. 

Once the privacy requirement owner has full data flow visibility, they can determine where the risk lies and what privacy checks need to be created in the Policy Zones solution.  

Create privacy checks to enforce requirements

Policy Zones provides privacy requirement owners with the visibility to determine the necessary privacy checks and the governance to enforce the requirements.  

For example, violations can be triggered when select categories of data are sent anywhere except select destinations that only use the data for approved purposes. 

Implementing checks based on code enables enforcement that is programmatic, comprehensive, and adaptable. As code changes and data flows change, violations can be flagged automatically. To handle new data flows and privacy requirements, Meta built tools for privacy requirement owners to adjust and add privacy checks as needed. 

Detect and communicate data flow violations

Once privacy checks are set, data flow violations are automatically communicated to the appropriate stakeholders. Because violations are detected at the code level, stakeholders can immediately see what application or service is causing the violation and what code needs to be addressed. Therefore, the privacy owner can easily decide what remediation action to take and who needs to be involved. 

Prevent future privacy risks at the code level

Policy Zones continuously monitors data flows processed by code and enforces policies to prevent new data flow violations. Meta can now enforce purpose limitation policies at scale even as code, data flows, and requirements continually change. 

Key takeaways 

Privacy code scanning preserves privacy at scale for Meta

Investing in privacy code scanning has enabled Meta to implement comprehensive privacy governance and eliminate costly manual processes. By moving their focus from data assets to “code assets”, they built a solution that can monitor data flows in real-time and prevent risks before they even reach databases. 

Meta took years to build and implement their own solution

Despite what Meta achieved with privacy code scanning, they spent considerable time and resources to do it on their own. 

Meta’s blog states that implementation “was a complex, lengthy, and challenging process”. For example, they had to develop libraries that could scan all necessary programming languages such as Hack, C++, and Python and accurately identify personal data flows. 

Meta successfully overcame these software development challenges because they have more technical resources and privacy risk than most any other company. For the 99.9% of companies that don’t have the resources Meta has, building an in-house privacy code scanning solution is likely cost prohibitive. 

Implement Privado privacy code scanning in a matter of weeks 

Compared to building an in-house solution, Privado enables its customers to implement privacy code scanning in a fraction of the time for a fraction of the cost. 

Privado began building the first privacy code scanning solution in 2020 and has been solely focused on improving it since. Privado has developed the leading privacy code scanning solution by applying learnings from successful implementations for customers across industries and geographies.  

Today, Privado customers can fully implement privacy code scanning in just three weeks. 

Within a few days, Privado can build a data map that typically takes companies 6-12 months to do. For companies that spend 6-12 months building data maps by scanning data assets and sending questionnaires to developers, the resulting data maps are still significantly incomplete and quickly become out-of-date.  

After three weeks of onboarding, Privado customers have a real-time view of personal data flows and a prioritized list of privacy risks based on out-of-the-box compliance checks for CPRA, GDPR, and others. Learn more about each of our key capabilities below.

  • Dynamic Data Maps: Build comprehensive and real-time data maps without manual assessments
  • SDK Governance: Continuously scan mobile apps to detect new SDKs and ensure SDKs only process approved data elements with proper consent 
  • Auto-Risk Discovery: Proactively identify risks during software development and alert engineering teams before they go live
  • Smart Assessments: Automatically update RoPAs and privacy assessments each time code is updated
  • Consent Compliance: Simulate website and mobile app user behavior to test that CMPs properly display banners and limit data flows according to user consent
  • Developer Tool Integrations: Enable developers to prevent risks with automated code scans that deliver privacy guidance as developers code
Why Meta invested in privacy code scanning
Posted by
Ben Werner
in
Privacy Engineering
on
October 15, 2024

Ben leads product marketing at Privado

Subscribe to our email list

Thank you for subscribing, we have sent a confirmation email to your inbox.
Oops! Something went wrong while submitting the form.