Last week we talked about using decorators to conditionally anonymize users of our application to build a togglable “demo mode”. In our example, we anonymized every user by giving them the name "Jane Doe"
and the phone number "555-867-5309"
. While this works, it doesn’t make for the most exciting demo experience. Ideally, we could incorporate more variety into our anonymized user base.
It turns out that with a little help from Faker.js and deterministic seeds, we can do just that!
Faker.js
Faker.js is a library that “generate[s] massive amounts of realistic fake data in Node.js and the browser.” This sounds like it’s exactly what we need.
As a first pass at incorporating Faker.js into our anonymization scheme, we might try generating a random name and phone number in the anonymize
function attached to our User
model:
const faker = require('faker');
userSchema.methods.anonymize = function() {
return _.extend({}, this, {
name: faker.name.findName(),
phone: faker.phone.phoneNumber()
});
};
We’re on the right path, but this approach has problems. Every call to anonymize
will generate a new name and phone number for a given user. This means that the same user might be given multiple randomly generated identities if they’re returned from multiple resolvers.
Consistent Random Identities
Thankfully, Faker.js once again comes to the rescue. Faker.js lets us specify a seed which it uses to configure it’s internal pseudo-random number generator. This generator is what’s used to generate fake names, phone numbers, and other data. By seeding Faker.js with a consistent value, we’ll be given a consistent stream of randomly generated data in return.
Unfortunately, it looks like Faker.js’ faker.seed
function accepts a number as its only argument. Ideally, we could pass the _id
of our model being anonymized.
However, a little digging shows us that the faker.seed
function calls out to a local Random
module:
Faker.prototype.seed = function(value) {
var Random = require('./random');
this.seedValue = value;
this.random = new Random(this, this.seedValue);
}
And the Random
module calls out to the mersenne
library, which supports seeds in the form of an array of numbers:
if (Array.isArray(seed) && seed.length) {
mersenne.seed_array(seed);
}
Armed with this knowledge, let’s update our anonymize
function to set a random seed based on the user’s _id
. We’ll first need to turn our _id
into an array of numbers:
this._id.split("").map(c => c.charCodeAt(0));
And then pass that array into faker.seed
before returning our anonymized data:
userSchema.methods.anonymize = function() {
faker.seed(this._id.split("").map(c => c.charCodeAt(0)));
return _.extend({}, this, {
name: faker.name.findName(),
phone: faker.phone.phoneNumber()
});
};
And that’s all there is to it.
Now every user will be given a consistent anonymous identity every time their user document is anonymized. For example, a user with an _id
of "5cb0b6fd8f6a9f00b8666dcb"
will always be given a name of "Arturo Friesen"
, and a phone number of "614-157-9046"
.
Final Thoughts
My client ultimately decided not to go this route, and decided to stick with obviously fake “demo mode” identities. That said, I think this is an interesting technique that I can see myself using in the future.
Seeding random number generators with deterministic values is a powerful technique for generating pseudo-random, but repeatable data.
That said, it’s worth considering if this is really enough to anonymize our users’ data. By consistently replacing a user’s name, we’re just masking one aspect of their identity in our application. Is that enough to truly anonymize them, or will other attributes or patterns in their behavior reveal their identity? Is it worth risking the privacy of our users just to build a more exciting demo mode? These are all questions worth asking.