New: AI & text-to-SQL on your own SupersetBook a demo

Data Engineering18 Apr 2026

AWS DMS for Live Database Replication into Your Lakehouse

Learn how AWS DMS replicates operational databases into lakehouses in real-time. Step-by-step guide for engineers building analytics infrastructure.

DTD23 Team

14 minutes read

Understanding AWS DMS and the Lakehouse Architecture

AWS Database Migration Service (DMS) is a managed service that replicates data from operational databases into cloud storage—typically Amazon S3—where it forms the foundation of a data lakehouse. For engineering teams building analytics infrastructure, DMS solves a critical problem: how to move live transactional data from production systems into analytics without slowing down your database or building brittle custom ETL pipelines.

The lakehouse architecture itself is the convergence of data lakes and data warehouses. A data lake stores raw, unstructured data at scale and low cost. A data warehouse provides structured schemas, strong consistency, and query optimization. A lakehouse combines both: you get the cost efficiency and flexibility of a lake with the query performance and governance of a warehouse. AWS DMS feeds the raw layer of this architecture, capturing every change from your operational database and landing it in S3 in near-real-time.

Why does this matter for analytics teams? Traditional approaches—full database exports, weekly batch jobs, or custom Kafka producers—create latency, operational burden, and risk. DMS handles the heavy lifting: connection pooling, schema inference, error recovery, and continuous replication. For mid-market companies and scale-ups adopting managed platforms like D23 for embedded analytics and self-serve BI, having a reliable, real-time data pipeline into your lakehouse is non-negotiable.

How AWS DMS Works: The Replication Engine

DMS operates in two phases: full load and change data capture (CDC). Understanding both is essential for production deployments.

Full Load Phase

During full load, DMS connects to your source database (RDS, Aurora, on-premises MySQL, PostgreSQL, SQL Server, Oracle, etc.), scans the tables you've specified, and writes the data to S3 in batches. This phase runs once and creates your initial dataset. DMS parallelizes this operation across multiple threads, so even large tables (millions or billions of rows) complete in reasonable time.

Key parameters you control:

Table mappings: Which tables and columns to replicate
Batch size: How many rows to write per S3 object
Parallel load settings: How many threads to use (higher = faster, but more database load)
LOB handling: How to deal with large objects (BLOBs, CLOBs) — inline, limited, or separate files

During full load, your source database experiences increased read load. For production systems, you'll want to run this during low-traffic windows or use read replicas to avoid impacting user-facing queries.

Change Data Capture (CDC) Phase

After full load completes, DMS switches to CDC mode. It continuously monitors your source database's transaction logs (or equivalent) and replicates every INSERT, UPDATE, and DELETE to S3. This is where "live" replication happens.

How CDC works depends on your source engine:

Aurora/RDS MySQL: DMS reads the binary log (binlog)
PostgreSQL: DMS uses logical replication slots
SQL Server: DMS reads the transaction log
Oracle: DMS uses LogMiner or native log reader

Each source has different retention requirements. For example, MySQL's binlog might rotate every 24 hours, so if DMS falls behind, you lose data. That's why monitoring replication lag is critical.

CDC writes changes to S3 in parquet or CSV format, typically partitioned by table and timestamp. This creates a continuous stream of change events that downstream processes (like Apache Spark jobs, Databricks Delta Live Tables, or AWS Glue) consume to keep your lakehouse in sync.

Setting Up DMS: Source, Target, and Task Configuration

A DMS replication task requires three components: a source endpoint, a target endpoint, and a task definition.

Source Endpoint Configuration

Your source is your operational database. DMS needs:

Hostname, port, username, password
Network connectivity (security groups, VPN, or public internet)
Appropriate database permissions (SELECT on source tables, access to logs)

For production databases, create a dedicated read-only user with minimal privileges. This isolates DMS from your main application credentials and limits blast radius if credentials leak.

Example for MySQL:

CREATE USER 'dms_user'@'%' IDENTIFIED BY 'strong_password';
GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'dms_user'@'%';

The REPLICATION CLIENT and REPLICATION SLAVE privileges allow DMS to read binlogs for CDC.

Target Endpoint Configuration

Your target is Amazon S3. DMS needs:

S3 bucket name and prefix (e.g., s3://my-data-lake/dms-replication/)
IAM role with permissions to write to that bucket
Optional: KMS key for encryption at rest

Create an S3 bucket with versioning disabled (unnecessary for analytics) and a lifecycle policy to transition old objects to Glacier if you need long-term retention.

IAM policy example:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:PutObject",
        "s3:GetObject",
        "s3:DeleteObject",
        "s3:ListBucket"
      ],
      "Resource": [
        "arn:aws:s3:::my-data-lake",
        "arn:aws:s3:::my-data-lake/*"
      ]
    }
  ]
}

Task Definition

A DMS task specifies:

Which tables to replicate (via table mappings)
Full load settings (parallelism, batch size)
CDC settings (enable/disable, start position)
Data transformation rules (optional)
Target metadata (table names, column mappings)

Table mappings use JSON schema. A basic example:

{
  "rules": [
    {
      "rule-type": "selection",
      "rule-id": "1",
      "rule-name": "include-all-tables",
      "object-locator": {
        "schema-name": "public",
        "table-name": "%"
      },
      "rule-action": "include"
    }
  ]
}

For large deployments, you might exclude certain tables (logs, temporary tables) or apply transformation rules (rename columns, filter rows, change data types).

Real-World Architecture: From Database to Lakehouse Analytics

Let's walk through a concrete example. Imagine you're a SaaS company with an Aurora MySQL database holding customer accounts, subscriptions, usage metrics, and transactions. You want to replicate this into a lakehouse for analytics without impacting production.

Architecture Flow:

Source: Aurora MySQL (prod database, 500 GB, 10,000 writes/sec)
DMS Replication: Full load to S3 (12 hours), then CDC continuous
S3 Raw Layer: Parquet files, partitioned by table and date
AWS Glue / Spark Jobs: Transform raw data, apply business logic, deduplicate CDC changes
S3 Curated Layer: Delta Lake tables (via Delta Lake or Iceberg), optimized for analytics
Query Engine: Athena, Redshift Spectrum, or Databricks SQL
BI Platform: D23 dashboards and embedded analytics connected to the curated layer

As outlined in the AWS blog on creating source-to-lakehouse replication pipes using Apache Hudi, AWS Glue, AWS DMS, and Amazon Redshift, this serverless approach scales automatically and costs only for data scanned and processed.

DMS writes changes to S3 in the CDC folder:

s3://my-data-lake/dms-replication/
├── accounts/
│   ├── 20240101/
│   │   ├── FULL_LOAD_COMPLETE
│   │   ├── data-00001.parquet
│   │   └── data-00002.parquet
│   ├── 20240102/
│   │   ├── cdc-00001.parquet (CDC changes)
│   │   └── cdc-00002.parquet
├── subscriptions/
├── transactions/

Your Glue or Spark job processes these files, handles deduplication (CDC can produce duplicates if DMS restarts), applies business logic (e.g., calculate customer lifetime value), and writes to Delta Lake:

s3://my-data-lake/curated/
├── customers/ (Delta Lake table)
├── subscriptions/ (Delta Lake table)
├── transactions/ (Delta Lake table)

Once in Delta Lake, your data is queryable via Athena, Redshift, or Databricks SQL. From there, D23 can connect directly to query your lakehouse data and power dashboards, embedded analytics, or self-serve BI for your teams.

Change Data Capture: The Engine of Live Analytics

CDC is where DMS earns its keep for analytics. Without CDC, you'd need to re-export your entire database every night—wasteful and slow. CDC captures only changes, so your lakehouse stays in sync with your operational database in near-real-time (typically 1–10 seconds latency, depending on transaction volume).

CDC Challenges and Solutions

Duplicates: When DMS restarts or network hiccups occur, it may re-send the same change. Your downstream processing must be idempotent (applying the same change twice produces the same result). Use a deduplication key (source table + primary key + timestamp) in your Spark job.

Out-of-order changes: If you have high transaction volume, changes might arrive out of order. Use timestamps or sequence numbers to reorder before applying to your curated layer.

Large transactions: If your source database runs a bulk update (e.g., UPDATE accounts SET status='active' WHERE created_at < '2024-01-01'), DMS may write millions of change records. This can overwhelm your CDC pipeline. Consider running bulk operations during maintenance windows or filtering them at the DMS task level.

Log retention: Your source database's transaction log has finite retention. If DMS falls behind (e.g., due to S3 throttling or network issues), it might lose changes. Monitor replication lag and set up CloudWatch alarms to alert if lag exceeds your SLA (e.g., 5 minutes).

As detailed in the Databricks blog on using Streaming Delta Live Tables and AWS DMS for Change Data Capture from MySQL, integrating DMS CDC with Delta Live Tables creates a reliable, auto-scaling pipeline that handles these challenges gracefully.

Hands-On: Replicating an RDS Database to S3

Let's build a concrete example. We'll replicate an RDS MySQL database to S3 and then query it.

Step 1: Create DMS Replication Instance

The replication instance is the compute engine that runs your DMS tasks. Create one in the AWS Console:

Open AWS DMS
Click "Create replication instance"
Configure:
- Instance class: dms.c5.xlarge (good balance of CPU, memory, network)
- Allocated storage: 100 GB (adjust based on task complexity)
- VPC: Same VPC as your RDS instance (or use VPC peering)
- Multi-AZ: Yes (for production)
- Publicly accessible: No (access via VPC)
Click "Create"

Wait 5–10 minutes for the instance to provision.

Step 2: Create Source Endpoint (RDS MySQL)

Click "Endpoints" → "Create endpoint"
Configure:
- Endpoint type: Source
- Engine: MySQL
- Server name: Your RDS endpoint (e.g., mydb.c9akciq32.us-east-1.rds.amazonaws.com)
- Port: 3306
- Username: dms_user (create this user in RDS first)
- Password: (strong password)
Click "Test connection" (should succeed if networking is correct)
Click "Create endpoint"

Step 3: Create Target Endpoint (S3)

Click "Endpoints" → "Create endpoint"
Configure:
- Endpoint type: Target
- Engine: Amazon S3
- Service access role ARN: (IAM role with S3 write permissions)
- Bucket name: my-data-lake
- Bucket folder: dms-replication/
Click "Create endpoint"

Step 4: Create Replication Task

Click "Database migration tasks" → "Create task"
Configure:
- Task identifier: mysql-to-s3-full-load-cdc
- Replication instance: (select the one you created)
- Source endpoint: (RDS MySQL)
- Target endpoint: (S3)
- Migration type: "Migrate existing data and replicate ongoing changes" (this enables CDC)

Configure table mappings:

{
  "rules": [
    {
      "rule-type": "selection",
      "rule-id": "1",
      "rule-name": "include-all",
      "object-locator": {
        "schema-name": "%",
        "table-name": "%"
      },
      "rule-action": "include"
    }
  ]
}

Configure task settings (JSON):

{
  "TargetMetadata": {
    "ParallelLoadThreads": 4,
    "BatchApplyEnabled": true
  },
  "FullLoadSettings": {
    "TargetSchema": "",
    "CreatePkAfterFullLoad": false,
    "StopTaskCachedSourceNotApplied": false,
    "StopTaskCachedSourceApplied": false,
    "MaxFullLoadSubTasks": 8,
    "TransactionConsistencyTimeout": 600,
    "CommitRate": 50000
  },
  "ChangeProcessingDdlHandlingPolicy": {
    "HandleSourceTableDropped": true,
    "HandleSourceTableTruncated": true,
    "HandleSourceTableAltered": true
  },
  "ValidationSettings": {
    "EnableValidation": false
  }
}

Click "Create task"

The task will start immediately. Monitor progress in the task details page. Full load typically takes hours for large tables; CDC begins automatically once full load completes.

Monitoring and Troubleshooting DMS Replication

Once your task is running, you need visibility into its health and performance.

Key Metrics to Monitor

Replication lag (CDC): Time between a change on the source and its appearance in S3. Aim for < 1 minute for most use cases. If lag grows, investigate bottlenecks (database log retention, S3 throttling, network bandwidth).
Full load progress: Percentage of tables completed. If stuck on one table, the table might be too large or the DMS instance under-resourced.
Network transmit/receive: High utilization suggests you need a larger replication instance or should parallelize across multiple tasks.
Source database CPU/connections: DMS is reading from your prod database. Monitor impact and adjust parallelism if needed.
S3 PutObject errors: If DMS can't write to S3, check IAM permissions and bucket policies.

CloudWatch Alarms

Set up alarms for:

Replication lag > 5 minutes
Task status = "failed"
Source database connections > 80% of max
S3 errors in CloudTrail

Common Issues and Fixes

"Cannot connect to source database": Check security group rules. DMS replication instance needs inbound access to RDS on port 3306 (or your database port). Verify the security group rule:

Source: DMS replication instance security group
Port: 3306
Protocol: TCP

"Replication lag is growing": DMS can't keep up with changes. Options:

Increase replication instance size
Reduce parallelism (fewer threads = less source database load, but slower)
Check S3 throttling (if writing > 3,500 PutObject/sec, you'll hit limits; request increase or use S3 Intelligent-Tiering)
Increase source database log retention

"CDC stopped after full load": Binary log might have rotated and DMS lost its position. Enable binlog retention on RDS:

CALL mysql.rds_set_configuration('binlog retention hours', 24);

"Duplicate rows in S3": DMS restarts can re-send changes. Your downstream processing must deduplicate. Use Spark SQL with ROW_NUMBER() partitioned by primary key and ordered by timestamp.

Integrating with Your Analytics Stack

Once data lands in S3, you need to process it and make it queryable. This is where your lakehouse truly comes alive.

AWS Glue and Spark for Transformation

Write a Glue job (Python or Scala) that:

Reads CDC files from S3
Deduplicates based on primary key + timestamp
Applies transformations (business logic, aggregations)
Writes to Delta Lake or Iceberg

Example PySpark job:

from pyspark.sql import SparkSession
from pyspark.sql.functions import row_number, col
from pyspark.sql.window import Window
 
spark = SparkSession.builder.appName("DMS-CDC-Process").getOrCreate()
 
# Read CDC changes
cdc_df = spark.read.parquet("s3://my-data-lake/dms-replication/accounts/*/cdc-*.parquet")
 
# Deduplicate: keep latest version of each row
window = Window.partitionBy("pk").orderBy(col("timestamp").desc())
deduped = cdc_df.withColumn("rn", row_number().over(window)).filter(col("rn") == 1).drop("rn")
 
# Write to Delta Lake
deduped.write.mode("overwrite").option("mergeSchema", "true").format("delta").save("s3://my-data-lake/curated/accounts")

Schedule this job to run every 5 minutes (or your desired refresh cadence) using Glue triggers.

Querying with Athena or Redshift Spectrum

Once in Delta Lake, query directly with Athena:

SELECT 
  account_id,
  customer_name,
  subscription_status,
  COUNT(*) as transaction_count
FROM delta.`s3://my-data-lake/curated/accounts`
GROUP BY 1, 2, 3;

Or use Redshift Spectrum for faster queries on larger datasets:

CREATE EXTERNAL TABLE accounts_ext
STORED AS PARQUET
LOCATION 's3://my-data-lake/curated/accounts';
 
SELECT * FROM accounts_ext WHERE created_at > CURRENT_DATE - 30;

Connecting to D23 for BI and Dashboards

Once your lakehouse is queryable, D23 connects directly to your data to power dashboards, embedded analytics, and self-serve BI. D23 works with Athena, Redshift, Databricks, and other query engines, so you can build charts and dashboards on top of your DMS-replicated data without additional ETL.

For engineering teams embedding analytics into products, this is powerful: your product can query the lakehouse directly via D23's API-first architecture to display real-time metrics, customer data, or operational KPIs.

Cost Optimization and Best Practices

DMS and S3 can become expensive if not managed carefully. Here are strategies to optimize.

DMS Costs

DMS charges per replication instance hour. A dms.c5.xlarge costs roughly $1.50/hour. A 24/7 instance runs ~$1,000/month. To reduce:

Right-size your instance: Start with dms.c5.large ($0.75/hour) and upgrade only if needed
Schedule tasks: If you don't need real-time CDC, run full loads on a schedule (daily/weekly) and stop the instance between runs
Use table filtering: Replicate only tables you need for analytics
Optimize parallelism: More threads = faster but higher CPU cost. Find the sweet spot

S3 Costs

S3 storage costs $0.023/GB/month for standard storage. For 1 TB of replicated data, that's ~$23/month. But if you're writing CDC changes continuously, you might accumulate data quickly. To optimize:

Partition by date: Old partitions can be moved to Glacier ($0.004/GB/month) after 30 days
Enable S3 Intelligent-Tiering: Automatically moves objects between access tiers
Compress data: Parquet compression (snappy, gzip) reduces storage 50–80%
Clean up old CDC files: After processing, delete raw CDC files to keep only curated data

Best Practices

Test on non-production data first: Run DMS on a read replica or staging database before touching production
Monitor replication lag continuously: Set CloudWatch alarms to catch issues before they impact analytics
Version your transformation logic: Store Glue jobs in version control; track which version produced which data
Document schema mappings: Keep a record of how source tables map to lakehouse tables
Use IAM roles, not access keys: Secure DMS endpoints with IAM roles instead of hardcoded credentials
Enable encryption: Use KMS for S3 encryption at rest; use SSL/TLS for DMS connections
Plan for growth: As your database grows, replication lag may increase. Monitor and scale proactively

As referenced in the AWS documentation on migrating RDS SQL Server databases to S3 data lakes, AWS provides detailed guidance on security, performance, and cost optimization for DMS-based pipelines.

Advanced Patterns: Multi-Source and Heterogeneous Replication

For larger organizations, a single DMS task might not be enough. You might need to replicate from multiple source databases (MySQL, PostgreSQL, Oracle) into a unified lakehouse.

Multi-Source Architecture

Create separate DMS tasks for each source database:

Task 1: RDS MySQL → S3 dms-replication/mysql/
Task 2: RDS PostgreSQL → S3 dms-replication/postgres/
Task 3: On-premises Oracle → S3 dms-replication/oracle/

Your transformation layer (Glue/Spark) reads from all three locations, applies consistent business logic, and writes to a unified curated layer.

Handling Schema Evolution

When your source database schema changes (new columns, renamed tables), DMS propagates those changes to S3. Your transformation job should handle this gracefully:

# Read with schema inference to handle new columns
df = spark.read.option("inferSchema", "true").parquet("s3://my-data-lake/dms-replication/*/")
 
# Write with mergeSchema to allow new columns
df.write.option("mergeSchema", "true").format("delta").save("s3://my-data-lake/curated/")

Cross-Database Joins

With data from multiple sources in a unified lakehouse, you can join across databases:

SELECT 
  c.customer_id,
  c.name,
  o.order_id,
  o.amount
FROM delta.`s3://my-data-lake/curated/customers` c
JOIN delta.`s3://my-data-lake/curated/orders` o
  ON c.customer_id = o.customer_id
WHERE o.created_at > CURRENT_DATE - 30;

This is impossible with traditional BI tools querying separate databases; the lakehouse makes it seamless.

Comparing DMS to Alternatives

DMS isn't the only way to replicate data to a lakehouse. Here's how it compares:

DMS vs. Kafka + Connectors

DMS: Managed, simpler setup, CDC built-in, no operational overhead
Kafka: More flexible, enables real-time streaming to multiple targets, higher operational complexity

For most analytics teams, DMS is sufficient. Kafka shines if you need to stream to multiple destinations (analytics, search, ML pipelines) simultaneously.

DMS vs. Custom Python/Pandas Scripts

DMS: Reliable, handles failures, scales automatically
Custom scripts: Flexible, but fragile, requires monitoring and maintenance

Custom scripts are fine for one-off migrations; DMS is better for ongoing replication.

DMS vs. Native Cloud Tools (Fivetran, Stitch)

DMS: AWS-native, lower cost, but requires more setup
Fivetran/Stitch: Easier setup, pre-built connectors, higher cost ($1,000+/month)

Choose based on your engineering bandwidth and budget.

Real-World Example: E-Commerce Company

Let's walk through a realistic scenario. TechStore is an e-commerce company with:

Production database: RDS Aurora MySQL, 200 GB, 50,000 writes/sec
Analytics needs: Real-time customer dashboards, order analytics, inventory reporting
Current pain: Weekly batch exports, 2-day lag between order and visibility

Solution with DMS:

DMS Task: Replicate entire Aurora database to S3 with CDC
- Full load: 4 hours
- CDC lag: 30 seconds
- Cost: $1.50/hour (dms.c5.xlarge instance)
Glue Job: Every 5 minutes, process CDC changes
- Deduplicate orders, customers, inventory
- Calculate derived metrics (daily revenue, top products)
- Write to Delta Lake
- Cost: ~$0.40 per run (5 minutes of Glue compute)
Query Layer: Athena queries Delta Lake
- Cost: $5 per TB scanned
- Queries complete in < 5 seconds
BI Layer: D23 dashboards connect to Athena
- Real-time order dashboard
- Customer lifetime value analysis
- Inventory alerts
- Embedded analytics in customer portal

Before DMS:

Weekly exports: 1 hour manual work
Data freshness: 2 days
Cost: $2,000/month (database engineer time)

After DMS:

Fully automated
Data freshness: 30 seconds
Cost: ~$1,200/month (DMS + Glue + Athena + D23)

The result: faster insights, better customer experience, lower total cost of ownership.

Conclusion: Building Modern Analytics Infrastructure

AWS DMS is a foundational tool for modern analytics infrastructure. It solves the hard problem of moving live transactional data into a lakehouse without custom engineering or operational overhead.

For data leaders at scale-ups and mid-market companies, DMS enables:

Real-time analytics: Data freshness in seconds, not days
Cost efficiency: Pay only for compute and storage, not for proprietary BI licenses
Flexibility: Query the lakehouse with any tool (Athena, Redshift, Databricks, D23)
Scalability: Automatically handles growing data volumes

The lakehouse architecture—raw data in S3, transformation with Glue/Spark, query with Athena, BI with D23 embedded analytics—is becoming the standard for modern companies. DMS is the reliable plumbing that makes it work.

If you're evaluating managed analytics platforms like D23, ensure your data pipeline is solid first. DMS + S3 + Delta Lake + D23 dashboards create a powerful, cost-effective analytics stack that rivals Looker, Tableau, and Power BI without the licensing overhead or platform lock-in.

For engineering teams embedding analytics into products, this architecture is especially powerful. Your product can query the lakehouse directly, display real-time metrics, and provide self-serve BI capabilities without building custom reporting infrastructure.

Start with a single DMS task on a non-production database. Measure replication lag, transformation time, and query performance. Scale from there. The foundation is simple; the possibilities are limitless.