Fleshing out URLs with Elixir

My most recent side project, Affiliate Crawler, is a web crawler designed to find links in your written content that can be turned into affiliate links.

If you’re curious, be sure to read more about the how, what, and why behind Affiliate Crawler.

Affiliate Crawler works by being given a starting URL that points to your content. It crawls that page, looking for external links that can potentially be monetized, and internal links to crawl further.

Unfortunately, the Elixir-based crawler that powers Affiliate Crawler only works correctly when given fully qualified URLs, such as http://www.east5th.co/. Partial URLs, like www.east5th.co result in parsing errors and a general “failure to crawl”.

This is unfortunate because most people deal exclusively in partially formed URLs. You tell people to go to google.com, not http://www.google.com/. This seems to be especially true when prompted for URLs online.

What we need is a way to infer fully fleshed out URLs from user-provided partial URLs, just like we naturally would in conversation.

Thankfully, this is ridiculously easy thanks to Elixir’s URI module, pattern matching, and the awesome power of recursion!

The Parts of a Parsed URL

Elixir’s URI.parse/1 function accepts a uri string as an argument and returns a URI struct that holds the component pieces that make up that uri.

For example, parsing "http://www.east5th.co/" returns the following data:


%URI{authority: "www.east5th.co", fragment: nil, host: "www.east5th.co",
 path: "/", port: 80, query: nil, scheme: "http", userinfo: nil}

URI.parse/1 only works on fully fleshed out URLs. Attempting to parse a partially specified uri string, like "east5th.co", returns a mostly empty struct:


%URI{authority: nil, fragment: nil, host: nil, path: "east5th.co", port: nil,
 query: nil, scheme: nil, userinfo: nil}

The major pieces of missing information needed to properly parse this uri are the path, and the scheme. Given those two pieces of information, everything else can be inferred by URI.parse/1.

Thankfully, we can come up with some fairly reasonable defaults for both path and scheme. If no path is provided, as in a uri like "http://www.east5th.co", we could assume a default of "/". If no scheme is specified, like in "www.east5th.co/, we could assume a default of http.

But how do we know when and where to add these default values to our uri string?

Always prepending "http://" and appending "/" leads to obvious problems in most cases. Using a regex-based check sounds fragile and error-prone.

There has to be some other way.

Pattern Matching and Recursion

In turns out that pattern matching makes it incredibly easy to know wether we need to provide a default path or scheme. After our first pass through URI.parse/1, we can look for nil values in either path or scheme:


case URI.parse(uri) do
  %URI{scheme: nil} ->
    # Needs default scheme
  %URI{path: nil} ->
    # Needs default path
  %URI{} ->
    # Successfully parsed
end

But pattern matching alone doesn’t give us our solution…

But what if uri needs both a scheme and path, as in the case of "www.east5th.co"? And wouldn’t we still need to re-parse the newly fleshed out uri to populate the inferred fields in the URI struct like port, and authority?

Thankfully, wrapping our URI.parse/1 call in a function and calling that function recursively elegantly solves both problems:


defp parse(uri) do
  case URI.parse(uri) do
    %URI{scheme: nil} ->
      parse("http://#{url}")
    %URI{path: nil} ->
      parse("#{url}/")
    parsed ->
      parsed
  end
end

We define a function called parse and within that function we parse the provided uri string with a call to URI.parse/1.

If URI.parse/1 returns a URI struct without a scheme, we recursively call parse with "http://" prepended to the current uri.

Similarly, if URI.parse/1 returns a URI struct without a path, we recursively call parse with "/" appended to the current uri string.

Otherwise, we return the parsed URI struct generated from our fully-qualified URL.

Depending on your personal preference, we could even write our parse/1 function with our pattern matching spread across multiple function heads:


def parse(uri) when is_binary(uri), do: parse(URI.parse(uri))
def parse(uri = %URI{scheme: nil}), do: parse("http://#{to_string(uri)}")
def parse(uri = %URI{path: nil}),   do: parse("#{to_string(uri)}/")
def parse(uri),                     do: uri

Pattern matching and recursion, regardless of your preferred style, can lead to some truly beautiful code.

Final Thoughts

I’m continually amazed by the tools offered by the Elixir programming language straight out of the box. Pattern matching, a recursion-first philosophy, and even the incredibly useful URI module are examples of techniques and features that make your day-to-day development life easier when working this this language.

If you want to see how I used these tools to build fully-qualified URLs from partial user input, check out the Affiliate Crawler source code on Github, and check it out in action at Affiliate Crawler.