Back to blog
Dec 2024 7 min read

Deploying ML Models with AWS SageMaker

AWSMLOpsSageMaker

Getting an ML model from a Jupyter notebook into production is one of the most underestimated engineering challenges. SageMaker provides a managed path that handles a lot of the heavy lifting — but understanding when and how to use its serverless capabilities is critical.

Why Serverless Inference

SageMaker's standard real-time endpoints use always-on instances — you pay whether or not requests are coming in. Serverless endpoints scale to zero, charging only for invocations and compute time. For low-to-medium traffic workloads, this reduces costs dramatically.

Deployment Steps

import boto3
import sagemaker
from sagemaker.model import Model

session = sagemaker.Session()

# 1. Upload model artifacts to S3
model_data = session.upload_data(
    'model.tar.gz',
    bucket='my-bucket',
    key_prefix='models/v1'
)

# 2. Create SageMaker Model
model = Model(
    image_uri=container_image,
    model_data=model_data,
    role=role,
)

# 3. Deploy serverless endpoint
predictor = model.deploy(
    serverless_inference_config=ServerlessInferenceConfig(
        memory_size_in_mb=3008,
        max_concurrency=5,
    )
)

# 4. Run inference
response = predictor.predict(payload)

Cold Start Mitigation

Serverless endpoints have cold starts — the first request after a period of inactivity incurs container initialization time (typically 2–10 seconds). Minimize this by keeping container images small, pre-loading model weights at container start, and warming the endpoint with scheduled pings if latency is critical.