Site icon CloudCusp

10 Crucial Docker Commands Every Data Engineer Should Know (With Examples)

docker commands

Imagine you’re a data engineer working on a large-scale data pipeline. Your team needs an efficient way to deploy and run ETL jobs, manage dependencies, and ensure consistency across different environments. Traditionally, setting up environments would be a time-consuming and error-prone process. But with Docker, you can package everything into a container and run it seamlessly anywhere.

Docker has transformed data engineering by simplifying deployment, scaling, and automation. In this article, we’ll explore 10 essential Docker commands that every data engineer should master. These commands will help you efficiently manage containers, debug workflows, and optimize data engineering tasks. Let’s dive in!


On This Page

1. docker pull – Downloading Images Efficiently

Before running a Docker container, you need an image. Docker Hub hosts thousands of images for various applications, including databases, data processing tools, and machine learning frameworks.

Usage:

docker pull python:3.9

Key Points:

Example:

A data engineer working on Apache Spark can pull an image like this:

docker pull bitnami/spark:latest

2. docker run – Running Containers for Data Workloads

The docker run command creates and starts a container from an image.

Basic Syntax:

docker run -it --name my_container python:3.9

Common Flags:

FlagDescription
-itInteractive mode (keeps terminal open)
--nameAssigns a custom name to the container
-dRuns the container in detached mode
-pMaps ports from container to host

Example:

Running PostgreSQL for a data warehouse project:

docker run -d --name my_postgres -e POSTGRES_PASSWORD=mysecret -p 5432:5432 postgres:latest

3. docker ps – Listing Active Containers

To see which containers are running, use:

docker ps

Output Example:

CONTAINER IDIMAGESTATUSPORTS
abc123postgres:latestUp 10 minutes0.0.0.0:5432->5432/tcp

For all containers (including stopped ones):

docker ps -a

4. docker stop & docker rm – Managing and Cleaning Up Containers

Stopping a Container:

docker stop my_container

Removing a Container:

docker rm my_container

Example Scenario:

A data engineer testing an ETL job in Airflow wants to stop and remove old containers:

docker stop airflow_container
docker rm airflow_container

5. docker images – Viewing and Managing Docker Images

In Docker, images act as blueprints for containers. To efficiently manage containerized applications, data engineers must frequently check available images and clean up unnecessary ones.

Listing Available Images

To view all Docker images on your system, use:

docker images

Output Example:

REPOSITORYTAGIMAGE IDCREATED
python3.9a1b2c3d42 weeks ago
postgreslateste5f6g7h81 month ago

Filtering Images

To list images for a specific repository, such as python:

docker images python

To filter images by a specific label:

docker images --filter "label=project=mydata"

Removing Unused Images

To delete an unused image and free up disk space:

docker rmi image_id

For removing all dangling images (unused layers):

docker image prune

Example Use Case:

A data engineer working on an ETL pipeline realizes old images are consuming storage. Running the following ensures only necessary images remain:

docker images
# Identify old images

docker rmi old_image_id
# Remove unnecessary ones

Regularly managing images keeps the system optimized, preventing clutter and ensuring smoother deployments.


6. docker exec – Running Commands Inside a Running Container

In many data engineering workflows, you may need to interact with a running container to inspect its environment, run commands, or troubleshoot issues. The docker exec command allows you to do just that without stopping or restarting the container.

Basic Syntax:

docker exec -it <container_name> <command>

The -it flag ensures an interactive terminal session, useful when executing shell commands.

Example: Accessing a Running Container

Assume you have a running PostgreSQL container named my_postgres. To access it and interact with the database:

docker exec -it my_postgres psql -U postgres

This will open a PostgreSQL interactive terminal.

Use Case: Running Docker Commands Inside an ETL Pipeline Container

A data engineer working with Apache Airflow may need to check the installed Python packages within a running container:

docker exec -it airflow_container pip list

If dependencies are missing, they can install them directly:

docker exec -it airflow_container pip install pandas numpy

Executing a Bash Shell Inside a Running Container

For containers running Linux-based environments, you can open a shell session:

docker exec -it my_container bash

Or if using Alpine Linux:

docker exec -it my_container sh

Automating Container Maintenance with docker exec

For scheduled maintenance tasks, you can run scripts inside a running container. For instance, if a database backup script is stored at /backup.sh inside a PostgreSQL container, you can execute it as follows:

docker exec -it my_postgres sh /backup.sh

This flexibility makes docker exec a powerful tool for real-time debugging and maintenance in containerized data engineering environments.

7. docker logs – Debugging with Container Logs

If a container crashes or behaves unexpectedly, check the logs:

docker logs my_container

Example: Debugging an Apache Kafka container:

docker logs kafka_broker


8. docker commit – Saving Changes to a New Image

If you modify a container and want to save the changes:

docker commit my_container my_custom_image

Example: A data engineer installing Python libraries in a container:

docker exec -it my_python_container pip install pandas numpy
docker commit my_python_container python_with_libs

9. docker network – Managing Container Networking

List networks:

docker network ls

Create a custom network:

docker network create my_network

Run a container in the custom network:

docker run -d --name db_container --network=my_network postgres:latest

10. docker-compose up – Orchestrating Multi-Container Applications

docker-compose is crucial for data engineers running multiple services (e.g., Airflow, Kafka, Spark).

Example docker-compose.yml for PostgreSQL & pgAdmin:

version: '3'
services:
  postgres:
    image: postgres:latest
    environment:
      POSTGRES_PASSWORD: mysecret
    ports:
      - "5432:5432"
  pgadmin:
    image: dpage/pgadmin4
    environment:
      PGADMIN_DEFAULT_EMAIL: admin@example.com
      PGADMIN_DEFAULT_PASSWORD: admin
    ports:
      - "8080:80"

Run all services with:

docker-compose up -d

WrapUP

Docker is a game-changer for data engineering. Mastering these 10 essential commands will help you:

Efficiently manage containers
Automate data workflows
Ensure consistency across environments
Easily deploy scalable data pipelines

As you continue your Docker journey, try combining these commands to build powerful, automated workflows. Start experimenting today, and take your data engineering skills to the next level!

FAQs

What is the difference between docker run and docker start?

docker run creates a new container from an image and starts it.
docker start restarts an existing stopped container without creating a new one.
Example:
docker run -d --name my_container python:3.9 # Creates and starts a new container
docker stop my_container # Stops the container
docker start my_container # Starts the same container again

How do I clean up unused images, containers, and volumes to free disk space?

Use the following commands to remove unused resources:
Remove all stopped containers
docker container prune
Remove all unused images
docker image prune
Remove all unused volumes
docker volume prune
Remove everything (containers, images, volumes, networks)
docker system prune -a

How do I copy files from a container to my local system?

Use the docker cp command:
docker cp my_container:/path/to/file /local/path
Example: Copying a database backup from a running PostgreSQL container:
docker cp my_postgres:/var/lib/postgresql/data/backup.sql ./backup.sql

How can I persist data in Docker containers?

By using volumes or bind mounts:
Named Volumes (Preferred for databases and persistent data)
docker volume create my_data docker run -d -v my_data:/data my_image
Bind Mounts (Maps local folders to containers)
docker run -d -v /local/path:/container/path my_image

How do I check real-time logs for a running container?

Use the docker logs command:
docker logs -f my_container
The -f (follow) flag continuously streams logs in real-time.

How do I run multiple services in Docker, like PostgreSQL and Apache Airflow?

Use Docker Compose with a docker-compose.yml file. Example:

version: '3'
services:
postgres:
image: postgres:latest
environment:
POSTGRES_PASSWORD: mysecret
ports:
- "5432:5432"

airflow:
image: apache/airflow:latest
depends_on:
- postgres
ports:
- "8080:8080"

Run all services with:
docker-compose up -d

How do I access a running container’s shell?

Use the docker exec command:
docker exec -it my_container bash
For Alpine-based containers:
docker exec -it my_container sh

How do I create a custom Docker image with my own dependencies?

Create a Dockerfile and build the image:
Example Dockerfile for Python with Pandas and NumPy:
FROM python:3.9 RUN pip install pandas numpy CMD ["python"]

Build and tag the image:
docker build -t my_python_image .

Run a container from it:
docker run -it my_python_image

How do I restart a crashed container automatically?

Use the --restart policy when running a container:
docker run -d --restart unless-stopped my_image
Other restart policies:
no (default) – Do not restart the container automatically.
always – Always restart, even if manually stopped.
unless-stopped – Restart unless explicitly stopped by the user.
on-failure – Restart only if the container exits with an error.

How can I check if a specific container is running?

Run:
docker ps --filter "name=my_container"

Or check using an exit code (useful in scripts):
docker inspect -f '{{.State.Running}}' my_container

If it returns true, the container is running.

5 5 votes
Would You Like to Rate US
Exit mobile version