This post is written as a set of Literate Commits. The goal of this style is to show you how this program came together from beginning to end. Each commit in the project is represented by a section of the article. Click each section’s header to see the commit on Github, or check out the repository and follow along.
Project Setup
Today we’re going to be tackling the Molecule to Atom kata. The goal of this kata is to parse a string representation of a molecule into its component elements, or atoms.
As always, the first step is to get our project set up. We’ll be using Mocha as our test runner and Babel for ES6 support.
.babelrc
+{ + "presets": ["es2015"] +}
.gitignore
+node_modules/
package.json
+{ + "main": "index.js", + "scripts": { + "test": "mocha ./test --compilers js:babel-register" + }, + "dependencies": { + "babel-preset-es2015": "^6.9.0", + "babel-register": "^6.9.0", + "chai": "^3.5.0", + "lodash": "^4.12.0", + "mocha": "^2.4.5" + } +}
test/index.js
+import { expect } from "chai"; + +describe("index", function() { + + it("works"); + +});
Laying the Groundwork
The best place to start is at the beginning. We’ll begin solving this kata by writing the most basic test we can think of. We expect an empty string to be parsed into an empty object:
expect(parseMolecule("")).to.deep.equal({});
After writing this test, it fails. parseMolecule
is not defined. We
can quickly remedy this by importing parseMolecule
into our
test file and then exporting it from our main module.
Lastly, we need parseMolecule
to return an empty object. Just like
that, our tests are green.
index.js
+export function parseMolecule(formula) { + return {}; +}
test/index.js
import { expect } from "chai"; +import { parseMolecule } from "../index"; -describe("index", function() { +describe("Molecule to Atoms", function() { - it("works"); + it("it parses a molecule", () => { + expect(parseMolecule("")).to.deep.equal({}); + });
Introducing Our Abstractions
Knowing this solution is going to get complex fairly quickly, we’d like to leverage some abstractions to make it more testable and maintainable.
The most obvious abstraction we can think of is a molecule
. We create
a molecule
by passing in a formula
. From there, we can call parse
to get an object-based representation of the molecule.
We can rewrite our parseMolecule
function to use molecule
in a clean
way.
index.js
+export function molecule(formula) { + function parse() { + return {}; + } + + return { + parse + }; +} + export function parseMolecule(formula) { - return {}; + return molecule(formula).parse(); }
Testing Hydrogen
Let’s add a new test case:
expect(parseMolecule("H")).to.deep.equal({ H: 1 });
This test fails. It’s expecting {}
to equal { H: 1 }
.
The fastest way we can get ourselves to green is to return an object
that maps the provided formula
to 1
.
return {
[formula]: 1
};
After making this change, our first test fails. We now need to explicitly handle this base case:
if (!formula) {
return {};
}
And now we’re back to green again.
index.js
... function parse() { - return {}; + if (!formula) { + return {}; + } + return { + [formula]: 1 + }; }
test/index.js
... expect(parseMolecule("")).to.deep.equal({}); + expect(parseMolecule("H")).to.deep.equal({ H: 1 }); });
Multiple Elements
Next we’ll move onto parsing a molecule with multiple types of elements.
Let’s try to parse "HMg"
:
expect(parseMolecule("HMg")).to.deep.equal({ H: 1, Mg: 1 });
As expected, this test fails. parse
is returning { HMg: 1 }
, rather
than correctly splitting "H"
and "Mg"
.
To fix this, we’ll need to implement a new method on molecule
that
will split a formula into a number of parts. We’ll start things off by
splitting formulas into its component elements:
formula.match(/[A-Z][a-z]*/g) || [];
The || []
default is required to keep our base case of parsing ""
happy.
After rewriting parse
to use this new parts
method to assemble the
molecule object, our tests flip back to green.
index.js
export function molecule(formula) { + function parts() { + return formula.match(/[A-Z][a-z]*/g) || []; + } + function parse() { - if (!formula) { - return {}; - } - return { - [formula]: 1 - }; + return parts().reduce((result, part) => { + result[part] = 1; + return result; + }, {}); } ... return { - parse + parse, + parts };
test/index.js
import { expect } from "chai"; -import { parseMolecule } from "../index"; +import { parseMolecule, molecule } from "../index"; ... expect(parseMolecule("H")).to.deep.equal({ H: 1 }); + expect(parseMolecule("HMg")).to.deep.equal({ H: 1, Mg: 1 }); + }); + + describe("molecule", () => { + it("splits a formula into parts", () => { + expect(molecule("HMg").parts()).to.deep.equal(["H", "Mg"]); + }); });
Compiling Parts
Things are getting more complicated. Instead of just breaking apart our
formula
and looking for elements, we need to look for pairs of
elements and their corresponding count
, or ratio in the molecule.
A nice way to do this is using regular expression match groups and ES6 pattern matching. We can define our regex, and wrap the element and optional count in match groups:
let regex = /([A-Z][a-z]*)(\d*)/g;
As we iterate over our regex matches, we can destructure the result
into its corresponding element
and count
pair.
Finally, we refactored parts
to return an object with part
and
count
keys to keep things explicit. Thinking ahead, I think this might
come in handy when we need to deal with nested expressions.
index.js
... function parts() { - return formula.match(/[A-Z][a-z]*/g) || []; + let parts, results = []; + let regex = /([A-Z][a-z]*)(\d*)/g; + while (parts = regex.exec(formula)) { + let [_, part, count] = parts; + results.push({ + part, + count: parseInt(count) || 1 + }); + } + return results; } ... function parse() { - return parts().reduce((result, part) => { - result[part] = 1; + return parts().reduce((result, {part, count}) => { + result[part] = count; return result;
test/index.js
... expect(parseMolecule("HMg")).to.deep.equal({ H: 1, Mg: 1 }); + expect(parseMolecule("H2Mg")).to.deep.equal({ H: 2, Mg: 1 }); }); ... it("splits a formula into parts", () => { - expect(molecule("HMg").parts()).to.deep.equal(["H", "Mg"]); + expect(molecule("HMg").parts()).to.deep.equal([ + { part: "H", count: 1 }, + { part: "Mg", count: 1 }, + ]); });
Fixing an Edge Case
One thing I noticed while looking over our solution is that it
did not support multiple instances of the same element in a molecule.
For example, with "HMgH"
, the last "H"
element would override the
previous.
I added a test to confirm my suspicions, and quickly fixed the problem.
Instead of overriding the element in the result
object with count
,
increment it (if it exists) by count
:
result[part] = ~~result[part] + count;
The ~~
operator is rarely seen, but it’s an incredibly useful
operator. In this case, if result[part]
is undefined, it will be
converted to 0
, preventing a NaN
result from our addition.
index.js
... return parts().reduce((result, {part, count}) => { - result[part] = count; + result[part] = ~~result[part] + count; return result;
test/index.js
... expect(parseMolecule("H2Mg")).to.deep.equal({ H: 2, Mg: 1 }); + expect(parseMolecule("H2MgH")).to.deep.equal({ H: 3, Mg: 1 }); });
Beginning Nested Expressions
Next up on our feature todo list is adding support for nested
expressions. A simple nested expressions to get us going could be
"[H]Mg"
. Let’s set our expectations with a test:
expect(parseMolecule("[H]Mg")).to.deep.equal({ H: 1, Mg: 1 });
Now each “part” in our parts
method could be either an element, or a
nested expressions within square brackets. Let’s update our regular
expression to reflect that:
let regex = /(\[(.*)\]|([A-Z][a-z]*))(\d*)/g;
Thanks to the magic of match groups and destructing, we can assign the
nested expression (if it exists) to square
:
let [_, __, square, part, count] = parts;
Finally, if square
exists, let’s recursively process the nested
expression and append its parts
onto our list of results
.
With those changes, our tests flip back to green. Beautiful.
index.js
... let parts, results = []; - let regex = /([A-Z][a-z]*)(\d*)/g; + let regex = /(\[(.*)\]|([A-Z][a-z]*))(\d*)/g; while (parts = regex.exec(formula)) { - let [_, part, count] = parts; - results.push({ - part, - count: parseInt(count) || 1 - }); + let [_, __, square, part, count] = parts; + if (square) { + let nested = molecule(square).parts(); + results = results.concat(nested); + } + else { + results.push({ + part, + count: parseInt(count) || 1 + }); + } }
test/index.js
... expect(parseMolecule("H2MgH")).to.deep.equal({ H: 3, Mg: 1 }); + expect(parseMolecule("[H]Mg")).to.deep.equal({ H: 1, Mg: 1 }); });
Refactoring
Looking forward, the next feature we’ll want to support is using ratios,
or count
on nested expressions. This means that we’ll need some notion
of “multiplying” molecules by some count
.
I imagine calling molcule(...).multiply(2)
would return a new
molecule
. This means that multiply
will need access to our parsed
interpretation of the formula
.
We could have multiply
call parts
, multiply each element by count
,
serialize the result back into a formula and then create and return a
new molecule
, but that’s a very roundabout way of doing things.
Instead, let’s define an internal _parts
list that’s created every
time a molecule
is created. The parts
method will simply return
_parts
, and multiply
will be able to modify _parts
directly.
After doing our refactoring, all of our tests are still passing. We’re safe.
index.js
export function molecule(formula) { - function parts() { - let parts, results = []; - let regex = /(\[(.*)\]|([A-Z][a-z]*))(\d*)/g; - while (parts = regex.exec(formula)) { - let [_, __, square, part, count] = parts; - if (square) { - let nested = molecule(square).parts(); - results = results.concat(nested); - } - else { - results.push({ - part, - count: parseInt(count) || 1 - }); - } + let _parts = []; + + let matches; + let regex = /(\[(.*)\]|([A-Z][a-z]*))(\d*)/g; + while (matches = regex.exec(formula)) { + let [_, __, square, part, count] = matches; + if (square) { + let nested = molecule(square).parts(); + _parts = _parts.concat(nested); + } + else { + _parts.push({ + part, + count: parseInt(count) || 1 + }); } - return results; + } + + function parts() { + return _parts; }
Multiplying Nested Expressions
With that refactor out of the way, we can easily implement multiply
.
To give ourselves an explicit objective, let’s write a test first:
expect(molecule("H").multiply(2).parse()).to.deep.equal({ H: 2 });
Implementing multiply
is straight forward. We just loop over each of
the _parts
of the molecule, multiplying its count
by the provided
count
. Finally, we return this
, so calls to molecule
methods can
be chained together.
Next, let’s write a test for handling ratios on nested expressions:
expect(parseMolecule("[HO]2Mg")).to.deep.equal({ H: 2, O: 2, Mg: 1 });
This test fails, as expected, but it’s not difficult to get our suite
back to green. We just need to build our nested
molecule, multiply
is by its corresponding count
, and finally append its parts
onto our
results array.
index.js
... let [_, __, square, part, count] = matches; + count = parseInt(count) || 1; if (square) { - let nested = molecule(square).parts(); - _parts = _parts.concat(nested); + let nested = molecule(square).multiply(count); + _parts = _parts.concat(nested.parts()); } ... part, - count: parseInt(count) || 1 + count }); ... + function multiply(count) { + _parts.forEach((part) => { + part.count *= count; + }); + return this; + } + function parts() { ... parse, - parts + parts, + multiply };
test/index.js
... expect(parseMolecule("[H]Mg")).to.deep.equal({ H: 1, Mg: 1 }); + expect(parseMolecule("[HO]2Mg")).to.deep.equal({ H: 2, O: 2, Mg: 1 }); }); ... }); + it("multiplies an object", () => { + expect(molecule("H").multiply(2).parse()).to.deep.equal({ H: 2 }); + }); });
Regex Woes
Unfortunately, there’s a big problem with our solution. Our regex
(\[.*\]
) is looking for opening and closing brackets and assuming that
everything within them are a nested expression.
This assumption breaks down when we have two nested expressions in the
same formula, like "[H]O[H]"
. Our regex will match on the first and
very last square brackets, returning "H]O[H"
as the nested expression.
We can write this as a test to see the failure in action:
expect(molecule("[H]2O[H]").parts()).to.deep.equal([
{ part: "H", count: 2 },
{ part: "O", count: 1 },
{ part: "H", count: 1 }
]);
Switching to a non-greedy regex (\[.*?\]
) would still fail, but this
time on nested subexpressions, like "[H[O]]"
.
We need a better way to parse out our formula parts.
Our new solution uses a regex to match on all opening brackets, closing brackets, and elements with a trailing count:
let regex = /((\[)|(\])|([A-Z][a-z]*))(\d*)/g;
Every time we encounter an opening bracket, we push a new formula onto
our new stack
:
if (open) {
stack.push({
formula: ""
});
}
A non-empty stack means that we’re collecting the pieces of a nested expression. That means we should just append any elements and counts we find to that sub-expression’s formula:
else if (stack.length) {
stack[stack.length - 1].formula += part + count;
}
Finally, whenever we encounter a closing bracket, we want to create a
new molecule
out of the nested expression we just collected, and
appends its parts
onto our list of parts
:
else if (close) {
let nested = molecule(stack.pop().formula).multiply(count);
_parts = _parts.concat(nested.parts());
}
Now we’re properly processing sub-expressions and our tests return to green.
index.js
... let matches; - let regex = /(\[(.*)\]|([A-Z][a-z]*))(\d*)/g; + let stack = []; + let regex = /((\[)|(\])|([A-Z][a-z]*))(\d*)/g; while (matches = regex.exec(formula)) { - let [_, __, square, part, count] = matches; + let [_, __, open, close, part, count] = matches; count = parseInt(count) || 1; - if (square) { - let nested = molecule(square).multiply(count); + if (open) { + stack.push({ + formula: "" + }); + } + else if (close) { + let nested = molecule(stack.pop().formula).multiply(count); _parts = _parts.concat(nested.parts()); } + else if (stack.length) { + stack[stack.length - 1].formula += part + count; + } else {
test/index.js
... ]); + expect(molecule("[H]2O[H]").parts()).to.deep.equal([ + { part: "H", count: 2 }, + { part: "O", count: 1 }, + { part: "H", count: 1 } + ]); });
Alternate Groupings and Bug Fixes
Last up on our feature list is the ability to use parentheses and curly brackets in addition to square brackets as sub-expression dividers.
We can add a test for this:
expect(parseMolecule("K4{ON(SO3)2}2")).to.deep.equal({K: 4, O: 14, N: 2, S: 4});
The most straight-forward way to accomplish this is to just replace all
"{"
and "("
characters with "["
, and "}"
and ")"
characters
with "]"
in our formula
:
formula.replace(/[{(]/g, "[").replace(/[})]/g, "]");
Unfortunately, this test also exposed another bug in our solution. it looks like counts on sub-expressions aren’t being applied to other nested expressions.
We can fix this by giving our stack
a bit more state. In addition to
building up our current expressions’s formula
as we go, we also want
to keep track of any nested molecules
we encounter, so we can multiply
them by our count
.
After making this change, all of our tests pass.
index.js
... + formula = formula.replace(/[{(]/g, "[").replace(/[})]/g, "]"); + let matches; ... stack.push({ - formula: "" + formula: "", + molecules: [] }); ... else if (close) { - let nested = molecule(stack.pop().formula).multiply(count); - _parts = _parts.concat(nested.parts()); + let popped = stack.pop(); + popped.molecules.push(molecule(popped.formula)); + popped.molecules.forEach((molecule) => { + molecule.multiply(count); + }); + if (!stack.length) { + popped.molecules.forEach((molecule) => { + _parts = _parts.concat(molecule.parts()); + }); + } + else { + let last = stack[stack.length - 1]; + last.molecules = last.molecules.concat(popped.molecules); + } }
test/index.js
... expect(parseMolecule("[HO]2Mg")).to.deep.equal({ H: 2, O: 2, Mg: 1 }); + expect(parseMolecule("K4{ON(SO3)2}2")).to.deep.equal({K: 4, O: 14, N: 2, S: 4}); + expect(parseMolecule("{[Co(NH3)4(OH)2]3Co}(SO4)3")).to.deep.equal({ + Co: 4, + N: 12, + H: 42, + O: 18, + S: 3 + }); });
Final Thoughts
This was a doozy of a kata.
All in all, there are a lot of ideas flying around here and I fear I didn’t explain myself as well as I could have. Balancing trying to keep things as concise, while staying true to the intent of explaining every change is definitely a challenge.
The limitations of regular expressions are explored in this problem was well. Initially, a totally regex based solution seemed like a good approach, but the nested structure of our data made regexes difficult to use.
When using regexes, it’s always important to keep in mind that there are entire classes of problem where they’re simply not suitable.
Finally, my final solution isn’t as clean and clear as I would have liked. By the time I landed on a working solution, I feared that more refactor commits would have muddied the water. In future problems, I need to focus on refactoring sooner and more often.