The Principle of Least Data (original) (raw)

Yesterday I wrote about the attack surface of data. In that post, I introduction a new principle for securing data, the principle of least data.

My initial definition was;

An organization must collect and store only the data needed to complete their task.

This differs from the principle of least privilege (give users/processes only the permissions need to complete their task...and no more) and the principle of least knowledge (when developing software, objects should be loosely coupled).

While the principle of least data is closely related to least privilege, it speaks directly to the selection of the data you store, not the permissions to access it.

Data Design Decisions

A key point of failure for organizations is not evaluating the value of the data they capture during a workflow. Whether you're figuring out a trial sign up process, event registration, new employee on-boarding, or another activity, the question of what data to collect and store and why should be carefully considered.

Don't collect data because it might be useful at some point.

This opens the organizations up to unnecessary risk. If the words "might", "possible", or "potential" are used in an argument supporting the collection of data, you're about to violate the principle of least data.

You should only collect and store data for well understood use. Data should be evaluated for it's overall value to the organization and—just as importantly—the risk it can pose to the organization.

Unless the cost to acquire the data in the future is so ridiculously high that it's infeasible, you should always opt to collect and store the data when you have a concrete use for it.

The "Store It All" Trap

Thanks to Dr. Moore, the computational and storage costs for data are at an all time low. At today's prices it costs about a dollar a day to store a 1TB of data. With the advances in statistical modelling, data analysis, and machine learning (collectively "big data"), it's no wonder we're storing more data than ever.

But there's also a security cost for that data. Everything you store poses some level of risk and it's that risk that is much harder to calculate.

Making the evaluation of risk even harder is trying to calculate how data can combine to increase the risk to the business. A user's address might not seem particularly risky to store but if it's combined with information about minors who live at that address (I'm looking at you VTech) it's risk level increases significantly.

Isolate Data By Risk

Even though the best defence is simply not to collect and store the data in the first place, there will always be situations where risky data needs to be collected. In these cases, you should isolate the data in separate data stores and apply the principle of least privilege to ensure that no user or process has unnecessary access.

Let's take a common example of billing data for an online store. Your profile on the store will contain various preferences, your email address, recently viewed items, and other information about your past shopping experiences.

While it's important for the store to keep a record of your billing address, the same process (and users!) that access your profile must not have access to your billing address. Only the payment process should have access that data. This is the principle of least privilege in action.

Taking that a step further, these data points should be stored in separate data stores in order to reduce the risk they pose to the business. Let's make this a secondary point of the principle of least data.

An organization must collect and store only the data needed to complete their task... and if you do have to collect & store it, make sure it's isolated by risk factor.

There may be more operational overhead but it will be offset by a lower risk profile and (most likely) performance gains by simplifying the data schema (or document structure for the NoSQL crowd).

This method also provides a significant security upside as it requires an attacker to breach additional defences to get the complete data set (a/k/a don't put all your eggs in one basket).

Finding A Balance

All organizations are going to have to store sensitive data (or sensitive in the aggregate) at some point. The key is to make sure that you only store what's truly needed to complete the task at hand.

The less data you collect and store, the less you have to worry.

For any new workflow, your team should apply;

  1. the principle of least data; making sure you're only collecting & storing what's needed
  2. the principle of least privilege; making sure only the required users/process that can access that data

Collecting and storing data is always going to be a judgement call. But it's critical to evaluate the up and down side of that data. Blindly collecting more and more puts your organization at risk.