The heart of Lob’s print and mail API is our print delivery network. This network of vetted, industry-leading print partners, allows us to fulfill orders quickly and at scale for our customers. Having a network of printers means that we can optimize incoming orders based on a variety of factors, such as:
Within Lob, the process that allows us to harness our print delivery network is known as routing. The Postmasters team, responsible for print and mail execution, recently completed a project to rebuild our order routing system to be more scalable, flexible, and easily configurable. What follows is the story of that process.
What we refer to as "routing" at Lob has humble origins. In the very early days our CEO, Leore, would make daily runs to the local Kinkos to fulfill orders. Seeing as this was an incredibly manual and time-intensive process, we onboarded our first print partner and began making steps towards automation. At first, we programmatically transferred orders to partners once a day by manually running a script.
As our order volume increased, we expanded our partner network. Orders needed to be transferred to partners multiple times a day in order for them to be able to continuously process our orders. The execution of the script was automated, in the form of a set of cron jobs. It would look something like this:
The use of cron to route orders worked, but it wasn’t perfect. You might be able to identify that this strategy greedily routes orders. Imagine that we added an additional rule:
Assuming a constant volume of available letters, this rule would route significantly less letters to Partner C because the second rule would have already routed all letters to Partner B.
As demonstrated, this system scaled poorly. As we added print partners for redundancy and additional capability, the need to evenly distribute volume and account for spikes arose. We handled this by juggling job frequency and maintaining an increasingly complex order of operations. Over time, the routing system and the way it was configured had snowballed into a beast that everyone was afraid to touch and nobody truly understood. Something had to be done.
We understood that the aging routing system was slow, difficult to maintain and would not scale with our volume or configuration needs. Together, the product, partner operations and engineering teams broke down these problems into goals and prioritized them:
With the old system, complex routing behavior was difficult to configure and maintain so we weren’t able to fully use routing to make intelligent business decisions. In the new system we wanted to make it easier to configure the various routing use cases we identified.
Naturally, routing would ensure that print partners received orders they were capable of producing. We also understood that routing could be used to manage the volume of orders we sent to our partners. Too much volume to a partner would push it over capacity and potentially incur production delays. Too little volume doesn’t meet our partner daily minimums and therefore is not cost effective.
There were certain factors routing could optimize for as well. Routing orders to print partners that are located geographically closer to the destination address can improve speed to delivery. Orders with destinations that are geographically concentrated can be drop shipped, for speed to delivery improvements. High enough saturation of regions can additionally result in cost savings in postage.
In designing the new routing system, we wanted to be able to account for a basic set of these use cases as well as allow for more to be built into the system in the future.
Early on we recognized that the architecture of the routing system needed redesign. The former routing system was bottlenecked to a single job scheduler that was difficult to scale. We identified that the old approach of using cron jobs had two distinct responsibilities:
We decided to break these operations into separate micro-services with limited responsibilities. This allowed us to scale parts of our routing pipeline independently and meet our throughput goals.
The new architecture centered around a "routing service", a real-time worker that accepted incoming orders and, based on a set of configured rules, routed them deliberately to print partners for production.
The basic process for each worker is as follows:
The heart of the new routing engine is a new rules language. The JSON-based language was designed to be flexible enough to allow users to capture the basic routing use cases as well as easy enough to configure by non-technical users.
At a high level the language defines routing rules by pairing expectations for an order and a set of partners that satisfy those expectations. For example, the following are two examples of rules:
An order can be evaluated against a rule to determine a set of potential partners. For example, if a USPS First Class check is evaluated against the first rule above, the set of potential partners would be Partner A. If a postcard was evaluated against the same rule, the set of potential partners would be empty.
The representation of these rules in the rules language is:
Individual rules are the building blocks used to configure routing. Rules are grouped and evaluated in series to express the many use cases we listed above. The purpose of this process is to determine a single partner that the order should be sent to.
Initially, when an order is being evaluated, the set of potential partners is the set of all partners. As the order is evaluated against a series of rules, the set of potential partners is reduced until one partner is chosen.
For flexibility, we provided different ways of grouping rules that allowed both greedy and non-greedy evaluation. One grouping of rules, which we refer to as "capability rules", will evaluate all the rules in the group and perform a union on the resulting potential partners. For example, if an order is evaluated against two rules in a capability rule group and the resulting potential partners are:
The resulting partner from the evaluation of the group is Partner B.
The other grouping, which we refer to as "preference rules", will greedily evaluate rules based on the order in the group.
In practice, rules are combined into capability and preference groups and orders are evaluated against sequences of groups to determine a final partner. By combining groups of greedily and non-greedily evaluated rules we’re able to describe how orders should be routed in an organized and understandable way.
In building a new routing system, we also focused on creating a better experience for the users of the system through tooling. The old routing system was configured through a bloated and slow web application that would often take minutes to load or make changes. Additionally, there was a lack of transparency around how routing affected volume distributions. Users would often have to use their best intuition when making routing rule changes.
In our new routing system the rules are created and updated through an API that provides versioning and validation. The engineering team worked closely with product to design and implement a web GUI for viewing and enabling rules safely and easily.
Drawing from our engineering best practices, we also incorporated testing into the rule creation process. We have the capability to “unit test” rules to ensure that rules are performing a base set of expected actions. We also have the ability to run a simulation of historical volume through a new set of rules to check that volume distributions look correct before enabling the rules in production. This enabled us to transition from an intuition-driven approach to configuring routing rules to a data-driven one. Whereas in the past we weren’t sure how our routing would react to the complexities of our daily volume, we’re now able to rigorously test our assumptions and deploy rules with confidence.
Since the initial release of the routing system we’ve continually invested in making routing more intelligent. We’re in the process of implementing automated routing to partners who have the capacity for more orders, minimizing production delays, and improving our unit economics. Next is adding the ability to optimize postage by saturating a region or routing orders geographically, based on distance to destination address using Lob’s Address Verification API.
Routing is just one example of the many unique problems that we solve here at Lob. The project is an excellent example of how engineers and non-technical stakeholders work together to bring clarity to vaguely defined problems and build iterative solutions for them.
Interested in working on problems like this? We’re hiring!