Haviv Ohayon

The Data Classification Challenge

The current crop of data classification solutions has failed the industry. Learn why and what you can do about it.

Image Credits: Suridata

Keep it safe

We always hear about sensitive information and the need to keep it hidden and safe. It seems pretty logical that organizations need to secure their clients’ data and strict litigations made it perfectly clear that any breach of sensitive information leakage would prompt fines and restrictions. Advances in litigations and standardizations in these areas developed a myriad of solutions for taking care of the problems, but most organizations only look at one side of coin, and while they enforce restrictions and limitations on the clients’ data, the organizations’ own sensitive data is sometimes left unchecked.

The organizations’ sensitive data may be their information on their own employees, information that may be valuable for phishing attacks and even samples of their own source code. Attackers can use this information to further their attacks on the organization,and in extreme cases, even impersonate organization officials and attack their clients and cause reputational and financial damages. To mitigate this problem,most organizations know that this information should be secured and protected,and they are actively do so, but sometimes general protection causes other unforeseen problems.

Needles in haystacks

The first problem, which sometimes overwhelms organizations, is the sheer volume of sensitive data that was accumulated. Sensitive data is located everywhere in the organization system, from personal workstations to development and production environments. The organization need to classify each file if it contains sensitive information or not. Finding each shred of sensitive data seems like a daunting task for an organization of 100 people, so what happens when we add another 100 hosts, and what about 1,000? The vast network of hosts makes it almost impossible to search each and every computer for sensitive information. To solve this problem, we just protect everything, and hereby lies our next problem.

The second problem is that by protecting everything, we sometimes harm our own business. Organizations have limited resources and each security resource that they use means that another resource cannot be implemented, resulting with strict and global enforcement on the system. This method may protect the sensitive data, but unfortunately, other business flows may be slow and incomplete. Therefore,there must be a balance between sensitive data protection and data finding.Thus, we need to automate the search for sensitive information by using rules and patterns. This method enables organizations to focus their protection only on required hosts and environments.

This is the tricky part. Creating rules and dictionaries of keywords is an arduous job,which requires pinpointing all necessary keywords and their permutations.Creating those dictionaries may take several months and even after finishing the job, researches show that dictionaries and rules may find up to 60% of sensitive data, since those rules and patterns do not “understand” the data. So,it seems that we still need a human to go over the files and we neatly returned to square one. How do we solve this problem?

Technology to the Rescue

The problem can be solved by using the best of all worlds and our understanding of AI. To search vast amount of data we are required to automate the process, but the system must understand the meaning of each file and discern if it contains sensitive information or not. To solve this problem, solutions may use NaturalLanguage Processing (NLP) algorithms that enable the system to go over the files and just a like a human, decide if the file contains sensitivei nformation. Since sensitive information can come in all shapes and sizes, the solution must also use machine learning to add more information to its databases and assessments and gradually minimize errors in tagging to fit myriad organizations and their data.