Skip to content

wallnerryan/Get-Started

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

46 Commits
 
 
 
 
 
 

Repository files navigation

Data on Kubernetes - Getting Started Guide

Contributors: Paul Au, Jonathan Battiato, Vindod Kumar, Alex Lines, Kallio Prinewill, Edith Puclla, Steve Sklar, Ryan Wallner, Gabriele Bartolini, Alastair Turner

About this Guide

Learning a new technology, and finding the community resources to help you learn that technology, can be quite a task. In this guide we have curated links to existing information which can help a Data on Kubernetes beginner get started. The guide is broken into sections providing theoretical and practical information to get a first stateful application on deployed kubernetes.

For more expert memebers of the community, this guide is also intended to capture gaps in existing content so we can, as a community, fill those gaps.

Table of Contents

Why Stateful Applications on Kubernetes

Running databases and message queues on Kubernetes is becoming more common, and not just for development environments. Various features of the Kubernetes ecosystem enable and simplify operations for these stateful workloads.

  • Health checks and automated restarts of application pods
  • The Kubernetes Operator model allows specialists to encode the processes for setting up and managing a stateful application into a program. This program can then manage the initial configuration of the application and ongoing operations tasks like backups and upgrades
  • Declaritive configuration - specifying the desired configuration of the stateful application, rather than the steps to reach that configuration - allows these configurations to be version managed (enabling GitOps) and simplifies compliance checks and enforcement on configurations. The process of reconciling the current and desired configurations is managed partly by the Kubernetes controllers and partly by the Operator for the application. Links to further information on these features, and how they enable stateful application on on Kubernetes, are in the sections below.

Intro to Stateful

Purpose

A stateful workload, differently than a stateless workload, is an application or a process that stores any sort of information in a persistent way. Kubernetes supports data persistency for this type of workloads thanks to the API which abstractd the attached storage. The API provides the PersistentVolume and PersistentVolumeClaim Kubernetes resources in order to allow users to consume abstract storage resources on either Pods or StatefulSets that require to persist their data.

Resources

Types of workloads

Purpose

Provide a list of stateful workloads that exist on Kubernetes and a description/examples of each workload Stateful Workloads

  • Databases (stateful sets or CRD)
  • AI/ML (usually jobs) - https://developers.redhat.com/aiml/ai-workloads
  • Batch processing jobs
  • Stream processing
  • Machine learning and AI workloads
  • Data analytics
  • ETL (Extract, Transform, Load) pipelines
  • Data warehousing
  • Distributed databases
  • In-memory data grids
  • Time series databases
  • Search and indexing engines

Operators 101

"The goal of an Operator is to put operational knowledge into software" - https://operatorhub.io/what-is-an-operator

Operators takes knowledge of how to implement, deploy, run, maintain and protect software applications on Kubernetes and puts it into a repeatable framework for automation. The framework and automation in turn provide Day 1 Operations (installation, configuration, etc.) and Day 2 Operations (re-configuration, update, backup, failover, restore, etc.) for applications. You can read more about the framework at the operatorframework.io

Purpose

Provide resources explaining what operators are and what role they play in running data workloads on kubernetes

Resources:

Ecosystem 101

Purpose

List and describe open source projects that are a part of the DoK Ecosystem. This list is not comprehensive.

Databases

  • Vitess: MySQL-compatible, horizontally scalable, cloud-native database solution
  • Cassandra: Apache Cassandra is a highly-scalable partitioned row store. Rows are organized into tables with a required primary key.
  • PostgreSQL: PostgreSQL is a powerful, open source object-relational database system that uses and extends the SQL language combined with many features that safely store and scale the most complicated data workloads.
  • MySql: An open-source relational database management system.

Cloud Native Storage

  • Rook: Rook is an open source cloud-native storage orchestrator, providing the platform, framework, and support for Ceph storage to natively integrate with cloud-native environments.
  • CubeFS: CubeFS is a new generation cloud-native open source storage system that supports access protocols such as S3, HDFS, and POSIX.
  • Longhorn: Longhorn is a lightweight, reliable and easy-to-use distributed block storage system for Kubernetes.

Scheduling

  • Apache Airflow:Apache Airflow is an open-source tool for managing data workflows, including scheduling, monitoring, and creating them.

Streaming

  • Kafka: Apache Kafka is an open-source distributed event streaming platform used by thousands of companies for high-performance data pipelines, streaming analytics, data integration, and mission-critical applications.
  • Spark: Apache Spark™ is a multi-language engine for executing data engineering, data science, and machine learning on single-node machines or clusters.
  • Flink: Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. Flink has been designed to run in all common cluster environments, perform computations at in-memory speed and at any scale.
  • Strimzi: Strimzi provides a way to run an Apache Kafka cluster on Kubernetes in various deployment configurations.
  • Apache Pulsar: Apache Pulsar is an open-source, distributed messaging and streaming platform built for the cloud.

AI/ML

  • Ray: Ray manages, executes, and optimizes compute needs across AI workloads. It unifies infrastructure via a single, flexible framework—enabling any AI workload from data processing to model training to model serving and beyond.
  • Kubeflow: Kubeflow makes artificial intelligence and machine learning simple, portable, and scalable. We are an ecosystem of Kubernetes based components for each stage in the AI/ML Lifecycle with support for best-in-class open source tools and frameworks.

Batch Processing

  • Apache YuniKorn: light-weight, universal resource scheduler for container orchestrator systems.

Deploy your first database on kubernetes

Purpose

In this section, you'll learn how to use the knowledge you've accumulated to deploy a database to kubernetes.

Deploy MySQL using Killercoda Playground

Step 1: Launch the Killercoda Kubernetes Lab Environment from your web browser

Click here to access the environment

Step 2: Launch a MySQL Instance

kubectl apply -f https://k8s.io/examples/application/mysql/mysql-pv.yaml kubectl apply -f https://k8s.io/examples/application/mysql/mysql-deployment.yaml 

deploy mysql

Step 3: View your MySQL Instance Running

kubectl get pvc, po 

view mysql

Step 4: Attach to MySQL

When prompted for the MySQL password, it is password

kubectl exec -i -t $(kubectl get pod -l app=mysql -o name) -- bash mysql -u root -p 

attach mysql

When you would liked to exit from the pod, type exit twice.

You've succesfully deployed your first Stateful Database (MySQL) on Kubernetes with a persistent volume.

Run MongoDB using Docker Desktop

Step 1: Install Docker Desktop

Step 2: Enable Kubernetes on Docker Desktop

enable k8s

Step 3: Set your context using kubectl

kubectl config get-contexts kubectl config use-context docker-desktop 

Step 4: Run a MongoDB StatefulSet

You can copy the example MongoDB YAML and save it locally to mongo.yaml.

kubectl apply -f mongo.yaml 

Step 5: View your Mongo database

kubectl get pvc, po 

It should look something like this

kubectl get pvc,po NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS AGE persistentvolumeclaim/mongodb-data-mongodb-0 Bound pvc-272ffc2a-2936-4609-a7b7-0cd20a8135af 1Gi RWO hostpath <unset> 77s persistentvolumeclaim/mongodb-pvc Bound pvc-4b6071be-5425-473e-a214-07d9b8db0213 1Gi RWO hostpath <unset> 77s NAME READY STATUS RESTARTS AGE pod/mongodb-0 1/1 Running 0 77s 

Step 6: Attach to your Mongo Database

kubectl exec -it pod/mongodb-0 -- bash mongosh 

You can then shows dbs and use myNewDB to test out the Mongo Database

test> show dbs test> use myNewDB switched to db myNewDB myNewDB> 

When you would liked to exit from the pod, type exit twice.

You've succesfully deployed your first StatefulSet Database (MongoDB) on Kubernetes with a persistent volume.

Resources:

Next Steps

Now that you hopefully have gained an understanding of how to get started with Data on Kubernetes. It's time to think about next steps.

Next steps might be thinking beyond how to get started and tackeling some of the following topics.

  • High Availability
  • Multi-Cluster / Multi-Cloud
  • Backup and Recovery
  • Disaster Recovery
  • Snapshots and Data Replication
  • Encryption
  • Running and managing multiple types of data services
  • Performance

Purpose

In this section, we'll list some resources to push you to the next level of understanding.

Resources:

Do you want to contribute?

This is a community driven resource and we welcome contributions from the Data on Kubernetes Community. If you would like to contribute to this resource, feel free to submit a pull request. For some more detail on what this repository is trying to achieve, please see the project proposal.

Feedback

We want your feedback! Let us know what you like and what you think is missing. Are there topics you would like us to add?

Submit Feedback

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors