My most recent side project, Affiliate Crawler, is a web crawler designed to find links in your written content that can be turned into affiliate links.
If you’re curious, be sure to read more about the how, what, and why behind Affiliate Crawler.
Affiliate Crawler works by being given a starting URL that points to your content. It crawls that page, looking for external links that can potentially be monetized, and internal links to crawl further.
Unfortunately, the Elixir-based crawler that powers Affiliate Crawler only works correctly when given fully qualified URLs, such as http://www.east5th.co/
. Partial URLs, like www.east5th.co
result in parsing errors and a general “failure to crawl”.
This is unfortunate because most people deal exclusively in partially formed URLs. You tell people to go to google.com
, not http://www.google.com/
. This seems to be especially true when prompted for URLs online.
What we need is a way to infer fully fleshed out URLs from user-provided partial URLs, just like we naturally would in conversation.
Thankfully, this is ridiculously easy thanks to Elixir’s URI
module, pattern matching, and the awesome power of recursion!
The Parts of a Parsed URL
Elixir’s URI.parse/1
function accepts a uri
string as an argument and returns a URI
struct that holds the component pieces that make up that uri
.
For example, parsing "http://www.east5th.co/"
returns the following data:
%URI{authority: "www.east5th.co", fragment: nil, host: "www.east5th.co",
path: "/", port: 80, query: nil, scheme: "http", userinfo: nil}
URI.parse/1
only works on fully fleshed out URLs. Attempting to parse a partially specified uri
string, like "east5th.co"
, returns a mostly empty struct:
%URI{authority: nil, fragment: nil, host: nil, path: "east5th.co", port: nil,
query: nil, scheme: nil, userinfo: nil}
The major pieces of missing information needed to properly parse this uri
are the path
, and the scheme
. Given those two pieces of information, everything else can be inferred by URI.parse/1
.
Thankfully, we can come up with some fairly reasonable defaults for both path
and scheme
. If no path is provided, as in a uri
like "http://www.east5th.co"
, we could assume a default of "/"
. If no scheme
is specified, like in "www.east5th.co/
, we could assume a default of http
.
But how do we know when and where to add these default values to our uri
string?
Always prepending "http://"
and appending "/"
leads to obvious problems in most cases. Using a regex-based check sounds fragile and error-prone.
There has to be some other way.
Pattern Matching and Recursion
In turns out that pattern matching makes it incredibly easy to know wether we need to provide a default path
or scheme
. After our first pass through URI.parse/1
, we can look for nil
values in either path
or scheme
:
case URI.parse(uri) do
%URI{scheme: nil} ->
# Needs default scheme
%URI{path: nil} ->
# Needs default path
%URI{} ->
# Successfully parsed
end
But pattern matching alone doesn’t give us our solution…
But what if uri
needs both a scheme
and path
, as in the case of "www.east5th.co"
? And wouldn’t we still need to re-parse the newly fleshed out uri
to populate the inferred fields in the URI
struct like port
, and authority
?
Thankfully, wrapping our URI.parse/1
call in a function and calling that function recursively elegantly solves both problems:
defp parse(uri) do
case URI.parse(uri) do
%URI{scheme: nil} ->
parse("http://#{url}")
%URI{path: nil} ->
parse("#{url}/")
parsed ->
parsed
end
end
We define a function called parse
and within that function we parse the provided uri
string with a call to URI.parse/1
.
If URI.parse/1
returns a URI
struct without a scheme
, we recursively call parse
with "http://"
prepended to the current uri
.
Similarly, if URI.parse/1
returns a URI
struct without a path
, we recursively call parse
with "/"
appended to the current uri
string.
Otherwise, we return the parsed URI
struct generated from our fully-qualified URL.
Depending on your personal preference, we could even write our parse/1
function with our pattern matching spread across multiple function heads:
def parse(uri) when is_binary(uri), do: parse(URI.parse(uri))
def parse(uri = %URI{scheme: nil}), do: parse("http://#{to_string(uri)}")
def parse(uri = %URI{path: nil}), do: parse("#{to_string(uri)}/")
def parse(uri), do: uri
Pattern matching and recursion, regardless of your preferred style, can lead to some truly beautiful code.
Final Thoughts
I’m continually amazed by the tools offered by the Elixir programming language straight out of the box. Pattern matching, a recursion-first philosophy, and even the incredibly useful URI
module are examples of techniques and features that make your day-to-day development life easier when working this this language.
If you want to see how I used these tools to build fully-qualified URLs from partial user input, check out the Affiliate Crawler source code on Github, and check it out in action at Affiliate Crawler.