Learn how AWS DMS replicates operational databases into lakehouses in real-time. Step-by-step guide for engineers building analytics infrastructure.
AWS Database Migration Service (DMS) is a managed service that replicates data from operational databases into cloud storage—typically Amazon S3—where it forms the foundation of a data lakehouse. For engineering teams building analytics infrastructure, DMS solves a critical problem: how to move live transactional data from production systems into analytics without slowing down your database or building brittle custom ETL pipelines.
The lakehouse architecture itself is the convergence of data lakes and data warehouses. A data lake stores raw, unstructured data at scale and low cost. A data warehouse provides structured schemas, strong consistency, and query optimization. A lakehouse combines both: you get the cost efficiency and flexibility of a lake with the query performance and governance of a warehouse. AWS DMS feeds the raw layer of this architecture, capturing every change from your operational database and landing it in S3 in near-real-time.
Why does this matter for analytics teams? Traditional approaches—full database exports, weekly batch jobs, or custom Kafka producers—create latency, operational burden, and risk. DMS handles the heavy lifting: connection pooling, schema inference, error recovery, and continuous replication. For mid-market companies and scale-ups adopting managed platforms like D23 for embedded analytics and self-serve BI, having a reliable, real-time data pipeline into your lakehouse is non-negotiable.
DMS operates in two phases: full load and change data capture (CDC). Understanding both is essential for production deployments.
Full Load Phase
During full load, DMS connects to your source database (RDS, Aurora, on-premises MySQL, PostgreSQL, SQL Server, Oracle, etc.), scans the tables you've specified, and writes the data to S3 in batches. This phase runs once and creates your initial dataset. DMS parallelizes this operation across multiple threads, so even large tables (millions or billions of rows) complete in reasonable time.
Key parameters you control:
During full load, your source database experiences increased read load. For production systems, you'll want to run this during low-traffic windows or use read replicas to avoid impacting user-facing queries.
Change Data Capture (CDC) Phase
After full load completes, DMS switches to CDC mode. It continuously monitors your source database's transaction logs (or equivalent) and replicates every INSERT, UPDATE, and DELETE to S3. This is where "live" replication happens.
How CDC works depends on your source engine:
Each source has different retention requirements. For example, MySQL's binlog might rotate every 24 hours, so if DMS falls behind, you lose data. That's why monitoring replication lag is critical.
CDC writes changes to S3 in parquet or CSV format, typically partitioned by table and timestamp. This creates a continuous stream of change events that downstream processes (like Apache Spark jobs, Databricks Delta Live Tables, or AWS Glue) consume to keep your lakehouse in sync.
A DMS replication task requires three components: a source endpoint, a target endpoint, and a task definition.
Source Endpoint Configuration
Your source is your operational database. DMS needs:
For production databases, create a dedicated read-only user with minimal privileges. This isolates DMS from your main application credentials and limits blast radius if credentials leak.
Example for MySQL:
CREATE USER 'dms_user'@'%' IDENTIFIED BY 'strong_password';
GRANT SELECT, REPLICATION CLIENT, REPLICATION SLAVE ON *.* TO 'dms_user'@'%';
The REPLICATION CLIENT and REPLICATION SLAVE privileges allow DMS to read binlogs for CDC.
Target Endpoint Configuration
Your target is Amazon S3. DMS needs:
s3://my-data-lake/dms-replication/)Create an S3 bucket with versioning disabled (unnecessary for analytics) and a lifecycle policy to transition old objects to Glacier if you need long-term retention.
IAM policy example:
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::my-data-lake",
"arn:aws:s3:::my-data-lake/*"
]
}
]
}Task Definition
A DMS task specifies:
Table mappings use JSON schema. A basic example:
{
"rules": [
{
"rule-type": "selection",
"rule-id": "1",
"rule-name": "include-all-tables",
"object-locator": {
"schema-name": "public",
"table-name": "%"
},
"rule-action": "include"
}
]
}For large deployments, you might exclude certain tables (logs, temporary tables) or apply transformation rules (rename columns, filter rows, change data types).
Let's walk through a concrete example. Imagine you're a SaaS company with an Aurora MySQL database holding customer accounts, subscriptions, usage metrics, and transactions. You want to replicate this into a lakehouse for analytics without impacting production.
Architecture Flow:
As outlined in the AWS blog on creating source-to-lakehouse replication pipes using Apache Hudi, AWS Glue, AWS DMS, and Amazon Redshift, this serverless approach scales automatically and costs only for data scanned and processed.
DMS writes changes to S3 in the CDC folder:
s3://my-data-lake/dms-replication/
├── accounts/
│ ├── 20240101/
│ │ ├── FULL_LOAD_COMPLETE
│ │ ├── data-00001.parquet
│ │ └── data-00002.parquet
│ ├── 20240102/
│ │ ├── cdc-00001.parquet (CDC changes)
│ │ └── cdc-00002.parquet
├── subscriptions/
├── transactions/
Your Glue or Spark job processes these files, handles deduplication (CDC can produce duplicates if DMS restarts), applies business logic (e.g., calculate customer lifetime value), and writes to Delta Lake:
s3://my-data-lake/curated/
├── customers/ (Delta Lake table)
├── subscriptions/ (Delta Lake table)
├── transactions/ (Delta Lake table)
Once in Delta Lake, your data is queryable via Athena, Redshift, or Databricks SQL. From there, D23 can connect directly to query your lakehouse data and power dashboards, embedded analytics, or self-serve BI for your teams.
CDC is where DMS earns its keep for analytics. Without CDC, you'd need to re-export your entire database every night—wasteful and slow. CDC captures only changes, so your lakehouse stays in sync with your operational database in near-real-time (typically 1–10 seconds latency, depending on transaction volume).
CDC Challenges and Solutions
Duplicates: When DMS restarts or network hiccups occur, it may re-send the same change. Your downstream processing must be idempotent (applying the same change twice produces the same result). Use a deduplication key (source table + primary key + timestamp) in your Spark job.
Out-of-order changes: If you have high transaction volume, changes might arrive out of order. Use timestamps or sequence numbers to reorder before applying to your curated layer.
Large transactions: If your source database runs a bulk update (e.g., UPDATE accounts SET status='active' WHERE created_at < '2024-01-01'), DMS may write millions of change records. This can overwhelm your CDC pipeline. Consider running bulk operations during maintenance windows or filtering them at the DMS task level.
Log retention: Your source database's transaction log has finite retention. If DMS falls behind (e.g., due to S3 throttling or network issues), it might lose changes. Monitor replication lag and set up CloudWatch alarms to alert if lag exceeds your SLA (e.g., 5 minutes).
As detailed in the Databricks blog on using Streaming Delta Live Tables and AWS DMS for Change Data Capture from MySQL, integrating DMS CDC with Delta Live Tables creates a reliable, auto-scaling pipeline that handles these challenges gracefully.
Let's build a concrete example. We'll replicate an RDS MySQL database to S3 and then query it.
Step 1: Create DMS Replication Instance
The replication instance is the compute engine that runs your DMS tasks. Create one in the AWS Console:
dms.c5.xlarge (good balance of CPU, memory, network)Wait 5–10 minutes for the instance to provision.
Step 2: Create Source Endpoint (RDS MySQL)
mydb.c9akciq32.us-east-1.rds.amazonaws.com)dms_user (create this user in RDS first)Step 3: Create Target Endpoint (S3)
my-data-lakedms-replication/Step 4: Create Replication Task
mysql-to-s3-full-load-cdc{
"rules": [
{
"rule-type": "selection",
"rule-id": "1",
"rule-name": "include-all",
"object-locator": {
"schema-name": "%",
"table-name": "%"
},
"rule-action": "include"
}
]
}{
"TargetMetadata": {
"ParallelLoadThreads": 4,
"BatchApplyEnabled": true
},
"FullLoadSettings": {
"TargetSchema": "",
"CreatePkAfterFullLoad": false,
"StopTaskCachedSourceNotApplied": false,
"StopTaskCachedSourceApplied": false,
"MaxFullLoadSubTasks": 8,
"TransactionConsistencyTimeout": 600,
"CommitRate": 50000
},
"ChangeProcessingDdlHandlingPolicy": {
"HandleSourceTableDropped": true,
"HandleSourceTableTruncated": true,
"HandleSourceTableAltered": true
},
"ValidationSettings": {
"EnableValidation": false
}
}The task will start immediately. Monitor progress in the task details page. Full load typically takes hours for large tables; CDC begins automatically once full load completes.
Once your task is running, you need visibility into its health and performance.
Key Metrics to Monitor
CloudWatch Alarms
Set up alarms for:
Common Issues and Fixes
"Cannot connect to source database": Check security group rules. DMS replication instance needs inbound access to RDS on port 3306 (or your database port). Verify the security group rule:
Source: DMS replication instance security group
Port: 3306
Protocol: TCP
"Replication lag is growing": DMS can't keep up with changes. Options:
"CDC stopped after full load": Binary log might have rotated and DMS lost its position. Enable binlog retention on RDS:
CALL mysql.rds_set_configuration('binlog retention hours', 24);"Duplicate rows in S3": DMS restarts can re-send changes. Your downstream processing must deduplicate. Use Spark SQL with ROW_NUMBER() partitioned by primary key and ordered by timestamp.
Once data lands in S3, you need to process it and make it queryable. This is where your lakehouse truly comes alive.
AWS Glue and Spark for Transformation
Write a Glue job (Python or Scala) that:
Example PySpark job:
from pyspark.sql import SparkSession
from pyspark.sql.functions import row_number, col
from pyspark.sql.window import Window
spark = SparkSession.builder.appName("DMS-CDC-Process").getOrCreate()
# Read CDC changes
cdc_df = spark.read.parquet("s3://my-data-lake/dms-replication/accounts/*/cdc-*.parquet")
# Deduplicate: keep latest version of each row
window = Window.partitionBy("pk").orderBy(col("timestamp").desc())
deduped = cdc_df.withColumn("rn", row_number().over(window)).filter(col("rn") == 1).drop("rn")
# Write to Delta Lake
deduped.write.mode("overwrite").option("mergeSchema", "true").format("delta").save("s3://my-data-lake/curated/accounts")Schedule this job to run every 5 minutes (or your desired refresh cadence) using Glue triggers.
Querying with Athena or Redshift Spectrum
Once in Delta Lake, query directly with Athena:
SELECT
account_id,
customer_name,
subscription_status,
COUNT(*) as transaction_count
FROM delta.`s3://my-data-lake/curated/accounts`
GROUP BY 1, 2, 3;Or use Redshift Spectrum for faster queries on larger datasets:
CREATE EXTERNAL TABLE accounts_ext
STORED AS PARQUET
LOCATION 's3://my-data-lake/curated/accounts';
SELECT * FROM accounts_ext WHERE created_at > CURRENT_DATE - 30;Connecting to D23 for BI and Dashboards
Once your lakehouse is queryable, D23 connects directly to your data to power dashboards, embedded analytics, and self-serve BI. D23 works with Athena, Redshift, Databricks, and other query engines, so you can build charts and dashboards on top of your DMS-replicated data without additional ETL.
For engineering teams embedding analytics into products, this is powerful: your product can query the lakehouse directly via D23's API-first architecture to display real-time metrics, customer data, or operational KPIs.
DMS and S3 can become expensive if not managed carefully. Here are strategies to optimize.
DMS Costs
DMS charges per replication instance hour. A dms.c5.xlarge costs roughly $1.50/hour. A 24/7 instance runs ~$1,000/month. To reduce:
dms.c5.large ($0.75/hour) and upgrade only if neededS3 Costs
S3 storage costs $0.023/GB/month for standard storage. For 1 TB of replicated data, that's ~$23/month. But if you're writing CDC changes continuously, you might accumulate data quickly. To optimize:
Best Practices
As referenced in the AWS documentation on migrating RDS SQL Server databases to S3 data lakes, AWS provides detailed guidance on security, performance, and cost optimization for DMS-based pipelines.
For larger organizations, a single DMS task might not be enough. You might need to replicate from multiple source databases (MySQL, PostgreSQL, Oracle) into a unified lakehouse.
Multi-Source Architecture
Create separate DMS tasks for each source database:
dms-replication/mysql/dms-replication/postgres/dms-replication/oracle/Your transformation layer (Glue/Spark) reads from all three locations, applies consistent business logic, and writes to a unified curated layer.
Handling Schema Evolution
When your source database schema changes (new columns, renamed tables), DMS propagates those changes to S3. Your transformation job should handle this gracefully:
# Read with schema inference to handle new columns
df = spark.read.option("inferSchema", "true").parquet("s3://my-data-lake/dms-replication/*/")
# Write with mergeSchema to allow new columns
df.write.option("mergeSchema", "true").format("delta").save("s3://my-data-lake/curated/")Cross-Database Joins
With data from multiple sources in a unified lakehouse, you can join across databases:
SELECT
c.customer_id,
c.name,
o.order_id,
o.amount
FROM delta.`s3://my-data-lake/curated/customers` c
JOIN delta.`s3://my-data-lake/curated/orders` o
ON c.customer_id = o.customer_id
WHERE o.created_at > CURRENT_DATE - 30;This is impossible with traditional BI tools querying separate databases; the lakehouse makes it seamless.
DMS isn't the only way to replicate data to a lakehouse. Here's how it compares:
DMS vs. Kafka + Connectors
For most analytics teams, DMS is sufficient. Kafka shines if you need to stream to multiple destinations (analytics, search, ML pipelines) simultaneously.
DMS vs. Custom Python/Pandas Scripts
Custom scripts are fine for one-off migrations; DMS is better for ongoing replication.
DMS vs. Native Cloud Tools (Fivetran, Stitch)
Choose based on your engineering bandwidth and budget.
Let's walk through a realistic scenario. TechStore is an e-commerce company with:
Solution with DMS:
DMS Task: Replicate entire Aurora database to S3 with CDC
Glue Job: Every 5 minutes, process CDC changes
Query Layer: Athena queries Delta Lake
BI Layer: D23 dashboards connect to Athena
Before DMS:
After DMS:
The result: faster insights, better customer experience, lower total cost of ownership.
AWS DMS is a foundational tool for modern analytics infrastructure. It solves the hard problem of moving live transactional data into a lakehouse without custom engineering or operational overhead.
For data leaders at scale-ups and mid-market companies, DMS enables:
The lakehouse architecture—raw data in S3, transformation with Glue/Spark, query with Athena, BI with D23 embedded analytics—is becoming the standard for modern companies. DMS is the reliable plumbing that makes it work.
If you're evaluating managed analytics platforms like D23, ensure your data pipeline is solid first. DMS + S3 + Delta Lake + D23 dashboards create a powerful, cost-effective analytics stack that rivals Looker, Tableau, and Power BI without the licensing overhead or platform lock-in.
For engineering teams embedding analytics into products, this architecture is especially powerful. Your product can query the lakehouse directly, display real-time metrics, and provide self-serve BI capabilities without building custom reporting infrastructure.
Start with a single DMS task on a non-production database. Measure replication lag, transformation time, and query performance. Scale from there. The foundation is simple; the possibilities are limitless.