Generating Realistic Pseudonyms with Faker.js and Deterministic Seeds

Last week we talked about using decorators to conditionally anonymize users of our application to build a togglable “demo mode”. In our example, we anonymized every user by giving them the name "Jane Doe" and the phone number "555-867-5309". While this works, it doesn’t make for the most exciting demo experience. Ideally, we could incorporate more variety into our anonymized user base.

It turns out that with a little help from Faker.js and deterministic seeds, we can do just that!

Faker.js

Faker.js is a library that “generate[s] massive amounts of realistic fake data in Node.js and the browser.” This sounds like it’s exactly what we need.

As a first pass at incorporating Faker.js into our anonymization scheme, we might try generating a random name and phone number in the anonymize function attached to our User model:


const faker = require('faker');

userSchema.methods.anonymize = function() {
  return _.extend({}, this, {
    name: faker.name.findName(),
    phone: faker.phone.phoneNumber()
  });
};

We’re on the right path, but this approach has problems. Every call to anonymize will generate a new name and phone number for a given user. This means that the same user might be given multiple randomly generated identities if they’re returned from multiple resolvers.

Consistent Random Identities

Thankfully, Faker.js once again comes to the rescue. Faker.js lets us specify a seed which it uses to configure it’s internal pseudo-random number generator. This generator is what’s used to generate fake names, phone numbers, and other data. By seeding Faker.js with a consistent value, we’ll be given a consistent stream of randomly generated data in return.

Unfortunately, it looks like Faker.js’ faker.seed function accepts a number as its only argument. Ideally, we could pass the _id of our model being anonymized.

However, a little digging shows us that the faker.seed function calls out to a local Random module:


Faker.prototype.seed = function(value) {
  var Random = require('./random');
  this.seedValue = value;
  this.random = new Random(this, this.seedValue);
}

And the Random module calls out to the mersenne library, which supports seeds in the form of an array of numbers:


if (Array.isArray(seed) && seed.length) {
  mersenne.seed_array(seed);
}

Armed with this knowledge, let’s update our anonymize function to set a random seed based on the user’s _id. We’ll first need to turn our _id into an array of numbers:


this._id.split("").map(c => c.charCodeAt(0));

And then pass that array into faker.seed before returning our anonymized data:


userSchema.methods.anonymize = function() {
  faker.seed(this._id.split("").map(c => c.charCodeAt(0)));
  return _.extend({}, this, {
    name: faker.name.findName(),
    phone: faker.phone.phoneNumber()
  });
};

And that’s all there is to it.

Now every user will be given a consistent anonymous identity every time their user document is anonymized. For example, a user with an _id of "5cb0b6fd8f6a9f00b8666dcb" will always be given a name of "Arturo Friesen", and a phone number of "614-157-9046".

Final Thoughts

My client ultimately decided not to go this route, and decided to stick with obviously fake “demo mode” identities. That said, I think this is an interesting technique that I can see myself using in the future.

Seeding random number generators with deterministic values is a powerful technique for generating pseudo-random, but repeatable data.

That said, it’s worth considering if this is really enough to anonymize our users’ data. By consistently replacing a user’s name, we’re just masking one aspect of their identity in our application. Is that enough to truly anonymize them, or will other attributes or patterns in their behavior reveal their identity? Is it worth risking the privacy of our users just to build a more exciting demo mode? These are all questions worth asking.