Create custom Metrics for AWS Glue Jobs.
Updated: Jul 11
As you know, CloudWatch lets you publish custom metrics from your applications. These are metrics that are not provided by the AWS services themselves.
Traditionally, custom metrics were published to CloudWatch by applications by calling CloudWatch’s PutMetricData API, most commonly through the use of AWS SDK for the language of your choice.
With the new CloudWatch Embedded Metric Format (EMF), you can simply embed the custom metrics in the logs that your application sends to CloudWatch, and CloudWatch will automatically extract the custom metrics from the log data. You can then graph these metrics in the CloudWatch console and even set alerts and alarms on them like other out-of-the-box metrics.
This works anywhere you publish CloudWatch logs from, EC2 instances, on-prem VMs, Docker/Kubernetes containers in ECS/EKS, Lambda functions, etc.
In this case, we center on custom metrics for AWS Glue Job execution. The final aim of this task is to create a Cloudwatch Alarm to identify if a Glue Job execution was successful or not. The proposed solution to this is the one shown in the following diagram.
Infrastructure
We will create the infrastructure and permissions needed with Terraform.
resource "aws_cloudwatch_event_rule" "custom_glue_job_metrics" {
name = "CustomGlueJobMetrics"
description = "Create custom metrics from glue job events"
is_enabled = true
event_pattern = jsonencode(
{
"source": [
"aws.glue"
],
"detail-type": [
"Glue Job State Change"
]
}
)
}
resource "aws_cloudwatch_event_target" "custom_glue_job_metrics" {
target_id = "CustomGlueJobMetrics"
rule = aws_cloudwatch_event_rule.custom_glue_job_metrics.name
arn = aws_lambda_function.custom_glue_job_metrics.arn
retry_policy {
maximum_event_age_in_seconds = 3600
maximum_retry_attempts = 0
}
}
resource "aws_lambda_function" "custom_glue_job_metrics" {
function_name = "CustomGlueJobMetrics"
filename = "python/handler.zip"
source_code_hash = filebase64sha256("python/handler.zip")
role = aws_iam_role.custom_glue_job_metrics.arn
handler = "handler.handler"
runtime = "python3.9"
timeout = 90
tracing_config {
mode = "PassThrough"
}
}
resource "aws_lambda_permission" "allow_cloudwatch" {
statement_id = "AllowExecutionFromCloudWatch"
action = "lambda:InvokeFunction"
function_name = aws_lambda_function.custom_glue_job_metrics.function_name
principal = "events.amazonaws.com"
source_arn = aws_cloudwatch_event_rule.custom_glue_job_metrics.arn
}
resource "aws_iam_role" "custom_glue_job_metrics" {
name = "CustomGlueJobMetrics"
assume_role_policy = jsonencode(
{
Version : "2012-10-17",
Statement : [
{
Effect : "Allow",
Principal : {
Service : "lambda.amazonaws.com"
},
Action : "sts:AssumeRole"
}
]
})
}
resource "aws_iam_role_policy" "custom_glue_job_metrics" {
name = "CustomGlueJobMetrics"
role = aws_iam_role.custom_glue_job_metrics.id
policy = jsonencode({
Version : "2012-10-17",
Statement : [
{
Effect : "Allow",
Action : [
"logs:CreateLogGroup",
"logs:CreateLogStream",
"logs:PutLogEvents"
],
Resource : "arn:aws:logs:*:*:*"
}
]
})
}
We have already created an event rule, event target, and lambda function (where will run a handler.py) and its needed permissions.
It should be noted that we are using the default event bus.
Python Code:
The python code that will run in a lambda function, is the following:
from aws_embedded_metrics import metric_scope
@metric_scope
def handler(event, _context, metrics):
glue_job_name = event["detail"]["jobName"]
glue_job_run_id = event["detail"]["jobRunId"]
metrics.set_namespace(f"GlueBasicMetrics")
metrics.set_dimensions(
{"JobName": glue_job_name}, {"JobName": glue_job_name, "JobRunId": glue_job_run_id}
)
if event["detail-type"] == "Glue Job State Change":
state = event["detail"]["state"]
if state not in ["SUCCEEDED", "FAILED", "TIMEOUT", "STOPPED"]:
raise AttributeError("State is not supported.")
metrics.put_metric(key=state.capitalize(), value=1, unit="Count"
if state == "SUCCEEDED":
metrics.put_metric(key="Failed", value=0, unit="Count")
else:
metrics.put_metric(key="Succeeded", value=0, unit="Count")
This code will create a new namespace (GlueBasicsMetrics) within Cloudwatch metrics with two dimensions inside (JobName and JobName,JobRunId), and it will be updated each time the Glue Job is executed since this is the event trigger that causes the function to execute.
Install module and libraries
As you could see in the terraform code, we import the source code as a file called handler.zip. It is very important to highlight that the python code, the module, and libraries installed previously have to be compressed into this .zip file at the same path level.
Installation:
pip3 install aws-embedded-metrics
Cloudwatch Alarm
Great, now we have the necessary metrics, we must focus on the main objective of this article, CREATE A CLOUDWATCH ALARM to identify when the job execution failed. We will create this alarm with Terraform too.
resource "aws_cloudwatch_metric_alarm" "job_failed" {
alarm_name = "EtlJobFailed"
metric_name = "Failed"
namespace = "GlueBasicMetrics"
period = "60"
statistic = "Sum"
comparison_operator = "GreaterThanOrEqualToThreshold"
threshold = "1"
evaluation_periods = "1"
treat_missing_data = "ignore"
dimensions = {
JobName = "IotEtlTransformationJob"
}
alarm_actions = ["aws_sns_topic.mail.arn", "aws_sns_topic.chatbot.arn"]
}
We use as an example a list of mail and a chatbot-like SNS topics. Consider that you must create the SNS topics needed to notify when the alarm is executed.
I hope that this TeraTip will be useful for you and help you accomplish the Performance Efficiency Pillar and Operational Excellence Pillar of the Well-Architected Framework in your environment.
Martín Carletti
Cloud Engineer
Teracloud
If you want to know more about Cloudwatch, we suggest checking Your AWS invoice is getting bigger and bigger because of CloudWatch Logs, and you don't know why? To learn more about cloud computing, visit our blog for first-hand insights from our team. If you need an AWS-certified team to deploy, scale, or provision your IT resources to the cloud seamlessly, send us a message here.