- Think about security before you start your big data project. You don't lock your doors after you've already been robbed, and you shouldn't wait for a data breach incident before you secure your data. Your IT security team and others involved in your big data project should have a serious data security discussion before installing and feeding data into your Hadoop cluster.
- Consider what data may get stored. If you're planning to use Hadoop to store and run analytics against data subject to regulation, you will likely need to comply with specific security requirements. Even if the data you're storing doesn't fall under regulatory jurisdiction, assess your risks - including loss of good will and potential loss of revenue - if data like personally identifiable information (PII) is lost.
- Centralise accountability. Right now, your data probably resides in diverse organizational silos and data sets. Centralising the accountability for data security ensures consistent policy enforcement and access control across these silos.
- Encrypt data both at rest and in motion. Add transparent data encryption at the file layer. SSL encryption can protect big data as it moves between nodes and applications. "File encryption addresses two attacker methods for circumventing normal application security controls," says Adrian Lane, analyst and CTO of security research and advisory firm Securosis. "Encryption protects in case malicious users or administrators gain access to data nodes and directly inspect files, and it also renders stolen files or disk images unreadable. It is transparent to both Hadoop and calling applications and scales out as the cluster grows. This is a cost-effective way to address several data security threats."
- Separate your keys and your encrypted data. Storing your encryption keys on the same server as your encrypted data is similar to locking your front door and then leaving the keys dangling from the lock. A key management system allows you to store your encryption keys safely and separately from the data you're trying to protect.
- Use the Kerberos network authentication protocol. You need to be able to govern which people and processes can access data stored within Hadoop. "This is an effective method for keeping rogue nodes and applications off your cluster," Lane says. "And it can help protect web console access, making administrative functions harder to compromise. We know Kerberos is a pain to set up, and (re-)validation of new nodes and applications take work. But without bi-directional trust establishment, it is too easy to fool Hadoop into letting malicious applications into the cluster, or into accepting the introduction of malicious nodes - which can then add, alter or extract data. Kerberos is one of the most effective security controls at your disposal, and it's built into the Hadoop infrastructure, so use it."
- Use secure automation. You're dealing with a multi-node environment, so deployment consistency can be difficult to ensure. Automation tools like Chef and Puppet can help you stay on top of patching, application configuration, updating the Hadoop stack, collecting trusted machine images, certificates and platform discrepancies. "Building the scripts takes some time up front but pays for itself in reduced management time later, and additionally ensures that each node comes up with baseline security in place."
- Add logging to your cluster. "Big data is a natural fit for collecting and managing log data," Lane says. "Many web companies started with big data specifically to manage log files. Why not add logging onto your existing cluster? It gives you a place to look when something fails, or if someone thinks perhaps you've been hacked. Without an event trace you are blind. Logging MR requests and other cluster activity is easy to do and increases storage and processing demands by a small fraction, but the data is indispensable when you need it."
- Implement secure communication between nodes and between nodes and applications. To do this, you'll need an SSL/TLS implementation that protects all network communications rather than just a subset. Some Hadoop providers, like Cloudera, already do this, as do many cloud providers. If your setup doesn't have this capability, you'll need to integrate the services into your application stack
Monday, 17 December 2012
With Big Data, Don't Forget Compliance and Controls
9 tips for securing Big Data