Part 1. Asynchronous actions within AWS Step Functions without servers

Running another asynchronous process from within the Step Functions is a non-trivial task. Traditional approaches have many shortcomings which may make them unfeasible in production environment. The Activity Task can be used to solve this issue and avoid most of common problems. It is achieved by using pattern similar to Correlation Identifier.

In first part of this series the theory of the solution is shown and in the second part – the example implementation.

Possible use cases

Have you ever wanted to start a new long-running process from Step Functions’ State Machine Execution and await its completion? It may be counterintuitive at first because why anyone would run another process from within the State Machine if the State Machine itself represents an asynchronous process.

However, there is quite a few possible use cases, for example:

  • Run another State Machine that is already used by other parts of the system. We could copy that State Machine and inline it but it would be a violation of DRY principle.
  • Dynamically select another State Machine to be run. This allows us to build a single high-level State Machine which dispatches execution of stages to different context-specific State Machines. This would also enable us to implement the Strategy pattern on the State Machine level.
  • Integrate with an existing system.

In this post, we want to share our experience with running such processes from within Step Functions and show you an approach that is rather simple, yet not widely known.

Traditional solutions

There are two common solutions to this problem. First one is to leverage Job Status Poller pattern. It is very simple, however, it has the following drawbacks:

  • It eats up our AWS limits. When we have got a lot of Executions, it may use up a considerable amount of state transitions (currently from 400 to 1000 transitions per second depending on a region) or other resources like DynamoDB capacity if the status check requires a query.
  • Maximum Execution history size is currently 25’000 events. Each state transition generates two events (Enter & Leave) and the loop consists of three states. This gives us six events per iteration so around 4000 iterations at most – assuming that the status polling is the sole purpose of the State Machine. Also restarting the State Machine from any state is hard and makes the State Machine less readable.
  • It implements busy wait which is usually discouraged for long running jobs.
Implementation of Job Status Polling pattern as State Machine.

For simple use cases this solution worked pretty well. However, we have been running into issues with AWS limits and the resource usage. Also tracking the execution flow in complex State Machines is harder when there are loops unrelated to the actual business logic.

Another solution is to implement a worker on top of Activities. The worker would start another process and wait till its completion. This solution is similar in concept to the Job Status Poller, however, the Execution is no longer limited by the event history size. Also the visual graph of the State Machine is simpler. Despite being better, it is not perfect due to the following:

  • We have to keep running workers that will be constantly polling our Activity. Given the fact that we are using Step Functions it is likely that we would like to stick to serverless solutions.
  • If the process is not implemented within the Activity then it is equivalent to another implementation of busy wait which may result in eating up your resources and AWS limits.

In our cases this solution worked well when the actual process was implemented within the Activity Worker.

data analytics
State Machine visual graph when waiting is implemented inside Activity.

Both solutions were working in practice, however, their disadvantages pulled us off from them the reason being the use case of starting the inner State Machine from the outer one. We wanted a solution that does not require us to maintain servers and that works in a push model instead of a pull one – so we do not need to implement a busy-wait.

Using Activity as a wait condition

Let’s forget all common examples with Activities and just think about the behaviour of an Activity. In fact, it is a really simple thing – we can get a Task to execute and send a response to that Task. When the Task is pulled from the Activity, corresponding Execution will wait until the response is sent. It can also have the specified timeout, in which case, the Execution may be resumed before the response is received.

An interesting observation is that the response can be sent by anyone with an access to Step Functions as long as it has got a Task Token: the one we get by calling the GetActivityTask and then used in all worker examples to send the response.

In our case, we want to run the inner State Machine from the outer State Machine and get its result. To achieve this we will pass the Task Token to the asynchronous process. Then, that process will use this token to resume the outer State Machine when it is finished. As long as the nested process supports any way of notifying about the job completion, we can integrate it into our system. Usually, this can be implemented by saving the Task Token when the inner process starts and uses it to resume the State Machine when the completion notification is handled.

From this point on, I will replace the inner asynchronous process with the inner State Machine for simplicity but keep in mind though that this mechanism is not limited to running nested State Machines. Also the Execution of the outer State Machine will be called the outer Execution and similarly the Execution of the inner State Machine will be called the inner Execution. For clarity: the State Machine is the definition of a process and the Execution is an instance of that process.

The seemingly hard thing is to start the inner Execution when the Activity state is reached in the outer Execution. We could use the worker again, however, the solution wouldn’t be serverless anymore.

Instead, it is possible to run Lambda function in parallel to the Activity and initiate the inner Execution from such Lambda. In the picture below you can see how a State Machine looks like:

We can see that
Start the asynchronous action and Wait for the asynchronous action are both run in parallel. Their responsibilities are as follows:

  • Start asynchronous actionLambda Task calling the GetActivityTask on the Activity associated with Wait for the asynchronous action. After getting the Task Token, it starts the inner Execution and passes the Task Token down there.
  • Wait for asynchronous action – Activity Task that will pause the outer Execution until the inner Execution sends a response or a timeout is reached.

To get a better grasp on how the whole solution works, let’s take a look at the process flow:

The first step (1) is the call from Lambda function to the GetActivityTask. In this phase we are acquiring a Task Token for the Activity Task and an input for the inner Execution. It is important to use the input from the GetActivityTask response and not the one passed to Lambda due to the fact that in the result of the GetActivityTask call we may get the Task Token for the Activity in any execution of any State Machine that uses this Activity. Usually, this is not the case but our code has to take that into account.

In the next step (2) the inner Execution is started using StartExecution call from Step Functions API. The task Token and the State’s input from (1) is passed as an input for the inner Execution. In the example, this part is also executed within the Lambda function. Depending on the use case, it may be preferable to create two separate States – one for the GetActivityTask call and the second one for starting the inner Execution. The usual case for creating separate States for (1) and (2) is when there is a non-negligible risk that due to AWS limits starting of the inner Execution will fail. For simplicity sake, in this post, we are going to assume that the StartExecution call will succeed.

After (2) is finished, the outer Execution is waiting on Wait for asynchronous action State. In the meantime, the inner Execution is proceeding through its States. When it finishes, the step (3) is executed within the Send success State. It takes the Task Token from the inner Execution input along with the result that will be sent to the outer Execution. Then, it calls SendTaskSuccess. This way, the Wait for asynchronous task State from the outer Execution finishes and has the output from the inner Execution as its result.

After these three steps are over, the outer Execution can normally continue. In the example it goes straight to the End state but believe me – it might as well go to the next non-trivial State. 🙂

Next part: implementation details

In the next post of this series the actual implementation of the described technique will be shown. Not only the basic code will be presented, but also improvements and tricks with State Machine which simplify implementation and improve maintainability.

Stay tuned!

Leave a Comment

Your email address will not be published. Required fields are marked *