Scratch your own itch: why we are building Zenaton
A startup’s success is tied to its speed, its ability to iterate on its product but also oftentimes on its processes (namely business- and marketing-related). For ten years now, I’ve been constantly frustrated by how slow improvements to those processes have been.
When I was running Tellmewhere (a European Foursquare/Yelp, and a French pioneer on mobile with over 1M users), I was frustrated with the inability of our technical team (which was very skilled but overworked) to produce simple but essential workflows. These were things like “When a user was in a location, and they haven’t left a review 24 hours later, send a reminder to find out what they thought”, or “When the user hasn’t used our service for 7 days, send pertinent recommendations to remind them about us”.
Now, I don’t know if those workflows were the right ones at the right time, but I do know that we were never able to put them in place fast enough, to test them quickly, to improve them continually — and that’s a fatal error for a startup.
Let’s take an example that seems simple: in a marketplace, a user makes a request and is waiting for estimates from professionals within the next hour. The initial implementation would be:
- When a user makes a request, a line describing it is entered into the database, and an asynchronous task is launched that will determine the list of professionals that should be able to answer and send those professionals an email asking them to respond with an estimate on the site within the hour.
- Every minute, you’ll ask your database which requests were made exactly one hour ago, so that you can collect the estimates received and send them by email to the user.
Actually here is a non-exhaustive list of the problems that you may run into, and more importantly improvements that you will want to make:
- What to do when the first asynchronous task fails? (if something can fail, it will fail)
- What happens with the requests that should be handled if the task that searches for requests every minute fails after several minutes?
- What happens when no professional able to respond is found?
- What happens when no response is received within the hour?
- What happens when too many responses are received within the hour?
- What happens when the user cancels or changes their request?
- What happens when none of the responses fulfill the user’s request?
- What happens with professionals who rarely or never respond?
- What happens when none of the responses work for the user?
- How do you avoid losing in-progress requests if you decide to lower the amount of time given for responses?
- How can you proactively reduce the wait time if enough responses have already been received?
These kinds of natural evolutions will push your technical team to create state variables that describe the specific situation in which a user’s request finds itself. Then the process, which repeats every minute, will find requests to handle and, according to their states, try to complete the correct actions.
Any changes will make certain states obsolete. And when introducing anything new, you’ll quickly make the code and databases difficult to understand. You won’t be able to see the process as a whole, and it will even be difficult to clearly see what has been coded. As a result, the tech team will lose lots of time trying to maintain/understand the system so that nothing breaks and to respond to the problems that will inevitably occur. The process itself will slowly deteriorate.
Given these difficulties, major startups (those with 100s of software engineers you do not have) have developed their own solutions:
- Amazon: Simple Workflow Service, a paid service, complete but where the only thing that’s “simple” is in the name;
- Spotify: Luigi, for Python
- Uber: Cadence, for Go
- AirBnb: Airflow, for Python
- Pinterest: Pinball, for Python
- Netflix: Conductor, for Java
With the exception of the very recent Cadence and AWS SWF, the approach to these situations has been to define the state of the process starting from a formal workflow definition. As such, I myself had been thinking for a few years about a solution based on a workflow engine modeled as a petri net (you may be interested by these libraries: flowgraph and flowmanager). But in the course of a project at The Family together with Louis, we came up with a much better solution.
Our main innovation is to use the opposite approach than existing solutions that are based in a formal configuration of the workflow — either with a configuration file or a graphic interface — and to instead let the developer do what they know best, code. With Zenaton, a workflow is modelled by a class in which the developer describes (in their preferred language) the succession of tasks that makes up the workflow, as well as how to react to external events.
class OrderWorkflow(Workflow): def __init__(self, item, address): self.item = item self.address = address def handle(self): self.execute(PrepareOrder(self.item)) event = self.execute(Wait(OrderPreparedEvent)) self.execute(SendOrder(event.id, self.item, self.address)) def on_event(self, event): if isinstance(event, AddressUpdatedEvent): self.address = event.address
In the above example (written in python):
PrepareOrdertask will be executed. Its implementation is not shown here, e.g. it can be sending needed information to your warehouse.
- Then, we wait for a
OrderPreparedEventfor an unlimited time. This event could be triggered — as soon as the order is prepared — by someone in your warehouse using a dedicated interface.
- Then, the
SendOrdertask is executed. E.g. it could be sending an email to your customer with all details.
- Meanwhile, each time the shipping address is updated, an event is triggered and the new address is taken into account.
Not bad for 10 lines of code, eh?
Zenaton’s secret sauce is knowing how to use this class and execute it to effectively guide the workflow. In the end, Zenaton works like a Queue Management System in which it’s possible to insert not only a simple task but also entire workflows that are executed as described by the workflows’ class.
There are numerous benefits to this approach:
- Reading a workflow becomes very simple, as you can simply refer to the implementation of a single class that describes the workflow;
- Developing a workflow with this approach is quite easy, as the decider can moreover be executed and tested locally without Zenaton;
- It is easy to modify a workflow by simply modifying the workflow’s class (it’s even easy to A/B test a workflow — you just need to choose randomly between different implementations of the decider!);
- You don’t need to store states anymore, as all of that is managed by Zenaton;
- If a task fails, it’s easy to restart the workflow exactly where it was prior to failure;
- Your back-office is easily scalable, with your code organized into autonomous and distributed tasks;
- Last but not least, Zenaton can tell you exactly what happened on your platform, providing statistics/analytics on your processes without the need for any additional coding.
I think the advantages for developers are significant (no configuration needed, ability to test locally, scalable architecture, resilient). I hope this should allow startups with limited technical teams to offer the same rich services as major companies with engineering teams numbering in the hundreds.
Please contact me if you have any questions or suggestions :)