Data on Kubernetes - Getting Started Guide

Contributors: Paul Au, Jonathan Battiato, Vindod Kumar, Alex Lines, Kallio Prinewill, Edith Puclla, Steve Sklar, Ryan Wallner, Gabriele Bartolini, Alastair Turner

About this Guide

Learning a new technology, and finding the community resources to help you learn that technology, can be quite a task. In this guide we have curated links to existing information which can help a Data on Kubernetes beginner get started. The guide is broken into sections providing theoretical and practical information to get a first stateful application on deployed kubernetes.

For more expert memebers of the community, this guide is also intended to capture gaps in existing content so we can, as a community, fill those gaps.

Why Stateful Applications on Kubernetes

Running databases and message queues on Kubernetes is becoming more common, and not just for development environments. Various features of the Kubernetes ecosystem enable and simplify operations for these stateful workloads.

Health checks and automated restarts of application pods
The Kubernetes Operator model allows specialists to encode the processes for setting up and managing a stateful application into a program. This program can then manage the initial configuration of the application and ongoing operations tasks like backups and upgrades
Declaritive configuration - specifying the desired configuration of the stateful application, rather than the steps to reach that configuration - allows these configurations to be version managed (enabling GitOps) and simplifies compliance checks and enforcement on configurations. The process of reconciling the current and desired configurations is managed partly by the Kubernetes controllers and partly by the Operator for the application. Links to further information on these features, and how they enable stateful application on on Kubernetes, are in the sections below.

Intro to Stateful

Purpose

A stateful workload, differently than a stateless workload, is an application or a process that stores any sort of information in a persistent way. Kubernetes supports data persistency for this type of workloads thanks to the API which abstractd the attached storage. The API provides the PersistentVolume and PersistentVolumeClaim Kubernetes resources in order to allow users to consume abstract storage resources on either Pods or StatefulSets that require to persist their data.

Resources

Types of workloads

Purpose

Provide a list of stateful workloads that exist on Kubernetes and a description/examples of each workload Stateful Workloads

Databases (stateful sets or CRD)
AI/ML (usually jobs) - https://developers.redhat.com/aiml/ai-workloads
Batch processing jobs
Stream processing
Machine learning and AI workloads
Data analytics
ETL (Extract, Transform, Load) pipelines
Data warehousing
Distributed databases
In-memory data grids
Time series databases
Search and indexing engines

Operators 101

"The goal of an Operator is to put operational knowledge into software" - https://operatorhub.io/what-is-an-operator

Operators takes knowledge of how to implement, deploy, run, maintain and protect software applications on Kubernetes and puts it into a repeatable framework for automation. The framework and automation in turn provide Day 1 Operations (installation, configuration, etc.) and Day 2 Operations (re-configuration, update, backup, failover, restore, etc.) for applications. You can read more about the framework at the operatorframework.io

Purpose

Provide resources explaining what operators are and what role they play in running data workloads on kubernetes

Resources:

Ecosystem 101

Purpose

List and describe open source projects that are a part of the DoK Ecosystem. This list is not comprehensive.

DoK Landscape

Databases

Vitess: MySQL-compatible, horizontally scalable, cloud-native database solution
Cassandra: Apache Cassandra is a highly-scalable partitioned row store. Rows are organized into tables with a required primary key.
PostgreSQL: PostgreSQL is a powerful, open source object-relational database system that uses and extends the SQL language combined with many features that safely store and scale the most complicated data workloads.
MySql: An open-source relational database management system.

Cloud Native Storage

Rook: Rook is an open source cloud-native storage orchestrator, providing the platform, framework, and support for Ceph storage to natively integrate with cloud-native environments.
CubeFS: CubeFS is a new generation cloud-native open source storage system that supports access protocols such as S3, HDFS, and POSIX.
Longhorn: Longhorn is a lightweight, reliable and easy-to-use distributed block storage system for Kubernetes.

Scheduling

Apache Airflow:Apache Airflow is an open-source tool for managing data workflows, including scheduling, monitoring, and creating them.

Streaming

Kafka: Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.
- Running Apache Spark on Kubernetes
Spark: Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
- Run Apache Spark jobs on Amazon EKS using the OSS Spark Operator
Flink: Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.
Strimzi: Strimzi provides a way to run an Apache Kafka cluster on Kubernetes in various deployment configurations.
- Running Kafka on Kubernetes using Strimzi
Apache Pulsar: Apache Pulsar is an open-source, distributed messaging and streaming platform built for the cloud.

AI/ML

Ray: Ray manages, executes, and optimizes compute needs across AI workloads. It unifies infrastructure via a single, flexible framework—enabling any AI workload from data processing to model training to model serving and beyond.
- Managing Ray clusters for ML on Kubernetes with KubeRay
Kubeflow: Kubeflow makes artificial intelligence and machine learning simple, portable, and scalable. We are an ecosystem of Kubernetes based components for each stage in the AI/ML Lifecycle with support for best-in-class open source tools and frameworks.

Batch Processing

Apache YuniKorn: light-weight, universal resource scheduler for container orchestrator systems.

Deploy your first database on kubernetes

Purpose

In this section, you'll learn how to use the knowledge you've accumulated to deploy a database to kubernetes.

Deploy MySQL using Killercoda Playground

Step 1: Launch the Killercoda Kubernetes Lab Environment from your web browser

Click here to access the environment

Step 2: Launch a MySQL Instance

kubectl apply -f https://k8s.io/examples/application/mysql/mysql-pv.yaml kubectl apply -f https://k8s.io/examples/application/mysql/mysql-deployment.yaml

Step 3: View your MySQL Instance Running

kubectl get pvc, po

Step 4: Attach to MySQL

When prompted for the MySQL password, it is password

kubectl exec -i -t $(kubectl get pod -l app=mysql -o name) -- bash mysql -u root -p

When you would liked to exit from the pod, type exit twice.

You've succesfully deployed your first Stateful Database (MySQL) on Kubernetes with a persistent volume.

Run MongoDB using Docker Desktop

Step 1: Install Docker Desktop

Step 2: Enable Kubernetes on Docker Desktop

Step 3: Set your context using kubectl

kubectl config get-contexts kubectl config use-context docker-desktop

Step 4: Run a MongoDB StatefulSet

You can copy the example MongoDB YAML and save it locally to mongo.yaml.

kubectl apply -f mongo.yaml

Step 5: View your Mongo database

kubectl get pvc, po

It should look something like this

kubectl get pvc,po NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE persistentvolumeclaim/mongodb-data-mongodb-0 Bound pvc-272ffc2a-2936-4609-a7b7-0cd20a8135af 1Gi RWO hostpath <unset> 77s persistentvolumeclaim/mongodb-pvc Bound pvc-4b6071be-5425-473e-a214-07d9b8db0213 1Gi RWO hostpath <unset> 77s NAME READY STATUS RESTARTS AGE pod/mongodb-0 1/1 Running 0 77s

Step 6: Attach to your Mongo Database

kubectl exec -it pod/mongodb-0 -- bash mongosh

You can then shows dbs and use myNewDB to test out the Mongo Database

test> show dbs test> use myNewDB switched to db myNewDB myNewDB>

When you would liked to exit from the pod, type exit twice.

You've succesfully deployed your first StatefulSet Database (MongoDB) on Kubernetes with a persistent volume.

Resources:

Next Steps

Now that you hopefully have gained an understanding of how to get started with Data on Kubernetes. It's time to think about next steps.

Next steps might be thinking beyond how to get started and tackeling some of the following topics.

High Availability
Multi-Cluster / Multi-Cloud
Backup and Recovery
Disaster Recovery
Snapshots and Data Replication
Encryption
Running and managing multiple types of data services
Performance

Purpose

In this section, we'll list some resources to push you to the next level of understanding.

Resources:

Do you want to contribute?

This is a community driven resource and we welcome contributions from the Data on Kubernetes Community. If you would like to contribute to this resource, feel free to submit a pull request. For some more detail on what this repository is trying to achieve, please see the project proposal.

Feedback

We want your feedback! Let us know what you like and what you think is missing. Are there topics you would like us to add?

Submit Feedback

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
assets		assets
LICENSE		LICENSE
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Data on Kubernetes - Getting Started Guide

About this Guide

Table of Contents

Why Stateful Applications on Kubernetes

Intro to Stateful

Purpose

Resources

Types of workloads

Purpose

Operators 101

Purpose

Resources:

Ecosystem 101

Purpose

Databases

Cloud Native Storage

Scheduling

Streaming

AI/ML

Batch Processing

Deploy your first database on kubernetes

Purpose

Deploy MySQL using Killercoda Playground

Run MongoDB using Docker Desktop

Resources:

Next Steps

Purpose

Resources:

Do you want to contribute?

Feedback

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages