Jul. 29, 2022
AWS Step Functions and Apache Airflow are workflow services that allow companies to automate business processes using computer modeling. Both services often play a vital role in enabling companies to increase efficiency while simultaneously reducing costs and minimizing errors. The aim of this article is to articulate the key distinctions between these two services to help companies make the right choice that meets their requirements.
Released in December 2016, AWS Step Functions is a “serverless” orchestration service that allows developers to combine AWS Lambda functions with other AWS services to create custom workflows. By using the intuitive AWS Step Functions graphical console, the application workflow is displayed as a series of event-driven steps.
Before we discuss Step Functions, it is important to talk about state machines. MDN defines a state machine as the following: “A state machine is a mathematical abstraction used to design algorithms. A state machine reads a set of inputs and changes to a different state based on the inputs. A state is a description of the status of a system waiting to execute a transition.”
What do Step Functions Do?
AWS Step Functions help implement state machines. They have built-in controls to monitor the status of each step in a workflow and ensure each task is executed in the expected order.
AWS Step Functions can be employed for multiple purposes:
Primarily, a Step Function manages the components and logic of an application. This capability allows developers to write less code and focus their efforts on creating and updating programming solutions.
AWS Step Functions offers two types of workflows: Standard and Express.
A Standard Step Function can be used for both short-term workflows that are longer than 5 minutes and long-term workflows that can last up to a year. However, they are the ideal choice when running long-term workflows because they provide access to the workflow’s execution history and enable visual debugging.
An Express workflow can last for a maximum of up to five minutes. It is ideal for a large number of fast executions not requiring accurate historical data, and useful for continuous data processing and IoT data ingestion.
A state machine is defined using JSON or YAML formats. The following is an example of a state machine’s definition named “HelloWorld”:
"Comment": "Run the HelloWorld Lambda function"
"Comment": "Run the GoodbyeWorld Lambda function"
The different steps of a workflow are defined using state machine language with a unique string being used to identify each state. Individual states can make decisions based on their inputs, perform actions, and transmit outputs to other states.
In AWS Step Functions, workflows are defined using the Amazon States Language. The Step Functions console provides a graphical representation of a state machine to help visualize the logic of an application.
States can perform various functions in a state machine:
Unlike AWS Step Functions, which has its own language for defining workflows, Apache Airflow is built using Python code with each task being represented in logical sequence by a Python class . The entire workflow’s definition and execution are done in one place.
from datetime import datetime
from airflow import DAG
from airflow.operators.dummy_operator import DummyOperator
from airflow.operators.python_operator import PythonOperatordef print_hello():
return 'Hello world!'dag = DAG('hello_world', description='Simple tutorial DAG',
schedule_interval='0 12 * * *',
start_date=datetime(2017, 3, 20), catchup=False)dummy_operator = DummyOperator(task_id='dummy_task', retries=3, dag=dag)hello_operator = PythonOperator(task_id='hello_task', python_callable=print_hello, dag=dag)dummy_operator >> hello_operator
One of the big differences between Apache Airflow and AWS Step Functions is that the former is not serverless. Therefore, when working with Apache Airflow, it is necessary to deploy infrastructure to host the service and to build user access management logic. Alternatively, Apache Airflow can be set up in a private network such as an AWS VPC to allow access via VPN. Either way, the usage of Apache Airflow increases costs since the service is not serverless and requires the deployment and maintenance of infrastructure.
Despite this apparent disadvantage, Apache Airflow remains an attractive choice for companies since it offers features not provided by AWS Step Functions, especially at the level of task visualization. Apache Airflow employs a simple interface providing users with a global view of all workflows and allows users to access a summary of all executions. AWS Step Functions grants users access to workflow data but does not integrate it directly within its interface. To gain a global view of their workflows, users have to fetch metrics using Amazon CloudWatch and set up dashboards to visualize this data.
Apache Airflow allows users to visually track the execution of each state and see live logs, stop the execution, or resume it at a specific step in the middle. With AWS Step Functions, however, users cannot visually track executions. Users can stop the execution of a state machine but cannot choose to resume it at a specific step in the middle of a state.
Gaining access to logs can also be difficult when using AWS Step Functions because the process of accessing logs consists of multiple steps and involves being redirected to an individual log group for each Lambda function. Users must then manually search the mixed logs for a specific execution time in order to find the information they are looking for.
This article describes the process used by the TrackIt team to set up a large data transformation pipeline with Apache Airflow. Readers can access the case study at the following link: https://medium.com/trackit/data-pipeline-architecture-optimization-apache-airflow-implementation-915821d5ce5b
To conclude, AWS Step Functions and Apache Airflow are both used for task orchestration and have similar features. In the example described in the case study, either service could have been chosen. However, TrackIt decided to use Apache Airflow because the visualization of data and processes more closely met the client’s needs.