Handling Personally identifiable information like credit card numbers, passport numbers data is very important for the enterprises. Cloudera introduced a feature called Sensitive Data Redaction from version 5.4.0 to manage the sensitive information in HDFS. E.g.: Sensitive Data Redaction will get credit-card numbers out of log files and SQL queries.
In Cloudera Manager there are two new parameters in HDFS, one to enable redaction and one to specify what to redact. Redaction is an HDFS parameter that is applied to the whole cluster.
Let us validate how queries are logged into /tmp/itversity/hive.log by running hive queries from the command prompt.
To enable and configure redaction:
- Go to HDFS -> Configuration
- Search for “redaction”
- Check or Click on “Enable Log and Query redaction”
- And then to add policies, Click on + sign in “Log and Query Redaction Policy” section to add the rules.
- The rule will have the following fields to configure.
- Description – To name the rule, no impact on the redaction process with this field.
- Trigger – Used for a simple string matches (not a regular expression) that to redact the Search regular expression is applied.
- Search – It is a regular expression to find the data and replace with a certain string.
- Replace – Here we will put the text to be replaced that found by search field.
- Now to test redaction rules, E.g.: To redact email address, you can enter the email address and select “Test Redaction”. The output will be displayed by the replacement string that is configured.
- Click Save Changes to commit the changes.
- Restart the cluster.
To ensure data is redacted as expected, make sure to run hive queries once again and review the log file.