Managing & Scaling Data Pipelines with Databricks Esha Shah Senior Data Engineer ATLASSIAN Go-To-Market Data Engineering Richa Singhal Senior Data Engineer
Agenda Atlassian Overview Summary Adopting Databricks Data Platform Challenges
Growth over the last 5 years Data is now 20x times (Multi petabytes) 5x growth in numbers of internal users 5x number of events/day (Billions)
Atlassian Data Architecture (Before Databricks)
Key Challenges with Legacy Architecture Development Cross-team dependencies Cluster management Collaboration
Prepping for Scale Self-service Standardization Automation Agility Cost Optimization
Current Atlassian Data Architecture
Our Success Story Reduced development time Rapid Development Increased team and project efficiency with simplified sharing and co-authoring Collaboration Were able to support growth while reducing Infrastructure cost Scaling Removed Data engineering dependency for Analytics and Data Science teams Self Service
Adopting Databricks at Atlassian Building Data Pipelines Orchestration Leveraging Databricks Delta Databricks for Analytics and Data Science
Building Data Pipelines
Data Pipelines with Databricks Data Pipelines using Notebooks Data Pipelines using DB-Connect
Development using Databricks Notebook AWS Cloud Interactive Cluster Ephemeral Cluster Bitbucket Branch Databricks Workspace Import/ Export Jira Ticket Command Line Databricks Notebook Databricks Cluster
Multi-stage Envs using Databricks Workspaces Databricks Notebook Databricks Workspace Dev Folder Local/ Development Stage/ Production Bitbucket CICD Pipeline Stg Folder Prod Folder Stg Cluster Prod Cluster
Bitbucket CICD Pipeline branches: main: - step: name: Check configuration file deployment: test script: - pip install -r requirements.txt - 'yamllint -d "{extends: default, rules: {}" config.yaml' - python databricks_cicd/check_duplicates.py - step: name: Move code to Databricks deployment: production caches: - pip script: - pip install -r requirements.txt - bash databricks_cicd/move_code_to_databricks.sh prod - step: name: Update the job in Databricks script: - pip install -r requirements.txt - python databricks_cicd/configure_job_in_databricks.py
Development using DB-Connect Library AWS Cloud Interactive Cluster Ephemeral Cluster Bitbucket Branch Local IDE Pull Request /Merge db-connect Jira Ticket Databricks Cluster
Multi-stage Envs using AWS S3 Local IDE Databricks Cluster Dev Bucket Local/ Development Stage/ Production Bitbucket CICD Pipeline Docker Stg Bucket Prod Bucket Stg Cluster Prod Cluster
Orchestration
Orchestration using Airflow Airflow on Kubernetes SparkSubmit Task YODA In-house Data Quality Platform SignalFx Opsgenie On-Call Notebook Task Slack Notification Code on S3 Notebook Databricks Workspace
Tracking Resource Usage and Cost Job Metadata 'custom_tags': { 'business_unit': 'Data Engineering', 'environment': cluster_env, 'pipeline': 'Team_name', 'user': 'airflow', 'resource_owner': '<resource_owner>', 'service_name': '<service-name>' } Data Lake Ad Hoc Reporting Databricks Job
Leveraging Databricks Delta
Delta Time Travel Merge Auto-optimize
Databricks for Analytics and Data Science
Analytics Use Cases Exploratory and root cause analysis Analysis for Strategic Decisions POC for new metrics and business logic Creating and refreshing ad-hoc datasets Team Onboarding Templates
Big Wins: Analytics Self-service Collaboration
Data Science Use Cases Exploration, Sizing Feature generation Model training Scoring Experiments Analyzing results Model serving
Big Wins: Data Science Faster local stack to cloud cycle No infrastructure overhead Increased ML adoption across teams Governance & Tracking
Summary
Key Takeaways Delivery time reduced by 30% Decreased infrastructure costs by 60% Databricks used by 50% of all Atlassians Reduced Data team dependencies by more than 70%
Thank you!
Feedback Your feedback is important to us Don’t forget to rate and review the sessions

Scaling and Modernizing Data Platform with Databricks

  • 1.
    Managing & ScalingData Pipelines with Databricks Esha Shah Senior Data Engineer ATLASSIAN Go-To-Market Data Engineering Richa Singhal Senior Data Engineer
  • 2.
  • 4.
    Growth over thelast 5 years Data is now 20x times (Multi petabytes) 5x growth in numbers of internal users 5x number of events/day (Billions)
  • 5.
    Atlassian Data Architecture(Before Databricks)
  • 6.
    Key Challenges withLegacy Architecture Development Cross-team dependencies Cluster management Collaboration
  • 7.
  • 8.
  • 9.
    Our Success Story Reduceddevelopment time Rapid Development Increased team and project efficiency with simplified sharing and co-authoring Collaboration Were able to support growth while reducing Infrastructure cost Scaling Removed Data engineering dependency for Analytics and Data Science teams Self Service
  • 10.
    Adopting Databricks atAtlassian Building Data Pipelines Orchestration Leveraging Databricks Delta Databricks for Analytics and Data Science
  • 11.
  • 12.
    Data Pipelines withDatabricks Data Pipelines using Notebooks Data Pipelines using DB-Connect
  • 13.
    Development using DatabricksNotebook AWS Cloud Interactive Cluster Ephemeral Cluster Bitbucket Branch Databricks Workspace Import/ Export Jira Ticket Command Line Databricks Notebook Databricks Cluster
  • 14.
    Multi-stage Envs usingDatabricks Workspaces Databricks Notebook Databricks Workspace Dev Folder Local/ Development Stage/ Production Bitbucket CICD Pipeline Stg Folder Prod Folder Stg Cluster Prod Cluster
  • 15.
    Bitbucket CICD Pipeline branches: main: - step: name:Check configuration file deployment: test script: - pip install -r requirements.txt - 'yamllint -d "{extends: default, rules: {}" config.yaml' - python databricks_cicd/check_duplicates.py - step: name: Move code to Databricks deployment: production caches: - pip script: - pip install -r requirements.txt - bash databricks_cicd/move_code_to_databricks.sh prod - step: name: Update the job in Databricks script: - pip install -r requirements.txt - python databricks_cicd/configure_job_in_databricks.py
  • 16.
    Development using DB-ConnectLibrary AWS Cloud Interactive Cluster Ephemeral Cluster Bitbucket Branch Local IDE Pull Request /Merge db-connect Jira Ticket Databricks Cluster
  • 17.
    Multi-stage Envs usingAWS S3 Local IDE Databricks Cluster Dev Bucket Local/ Development Stage/ Production Bitbucket CICD Pipeline Docker Stg Bucket Prod Bucket Stg Cluster Prod Cluster
  • 18.
  • 19.
    Orchestration using Airflow Airflowon Kubernetes SparkSubmit Task YODA In-house Data Quality Platform SignalFx Opsgenie On-Call Notebook Task Slack Notification Code on S3 Notebook Databricks Workspace
  • 20.
    Tracking Resource Usageand Cost Job Metadata 'custom_tags': { 'business_unit': 'Data Engineering', 'environment': cluster_env, 'pipeline': 'Team_name', 'user': 'airflow', 'resource_owner': '<resource_owner>', 'service_name': '<service-name>' } Data Lake Ad Hoc Reporting Databricks Job
  • 21.
  • 22.
  • 23.
    Databricks for Analyticsand Data Science
  • 24.
    Analytics Use Cases Exploratoryand root cause analysis Analysis for Strategic Decisions POC for new metrics and business logic Creating and refreshing ad-hoc datasets Team Onboarding Templates
  • 25.
  • 26.
    Data Science UseCases Exploration, Sizing Feature generation Model training Scoring Experiments Analyzing results Model serving
  • 27.
    Big Wins: DataScience Faster local stack to cloud cycle No infrastructure overhead Increased ML adoption across teams Governance & Tracking
  • 28.
  • 29.
    Key Takeaways Delivery timereduced by 30% Decreased infrastructure costs by 60% Databricks used by 50% of all Atlassians Reduced Data team dependencies by more than 70%
  • 30.
  • 31.
    Feedback Your feedback isimportant to us Don’t forget to rate and review the sessions