Transcript
Matthias Niehoff: I'm Matthias. I work for codecentric, a small consultancy in Germany. Before I joined codecentric, I made my professional education in an insurance company. When I finished this education, I came back to the company and they said, congratulations, and yes, we want you to stay with us. We have a new job for you. Back then I was in the architecture team, whatever that meant. The job is going to be in the data warehouse team. I was like, what? Because for me, data is that which the others are doing.
From a software engineering perspective, I didn't care about data warehouses. This is where the data goes and they get it some magical way. Turns out it was a joke. I stayed in the architecture team. After joining codecentric, I moved into the data space. I started with Apache Spark 10 years ago, and then evolved into getting all the data stuff, but I have a software engineering background turning data. These challenges remain. From a software engineering perspective, it's often still like data, data, data for analytics, it's not that that we care about.
Changing Data
I'll give you an example. Thinking about, we have some data that is stored, order data and order id and quantity, on the right side, we have the analytics side. We just get the order id and the quantity. Then the application thing, we want to track if the order was unfulfilled. The analytics side still doesn't select order id and quantity, and wonders, we have more orders than we had in stock because they didn't know there was an unfulfilled field that they had to track and that has to be selected as well. They're eventually coming up to select this as well, but they want to measure now the unfulfilled items and then they change the data type from a Boolean to an integer. What happens? Either the consuming side converts it implicitly to a Boolean and it basically fails silently, or they just fail implicit because the data type has changed. This is a quite common scenario we have when providing data. What is the reason?
The problem is that most of the time only the technical schema is exposed towards the consumers of the data. The consumers just get that schema and try to reverse engineer what is actually happening here. What is the business process that creates that data? What are the constraints here? What is the data model underneath? Also, is this source really trustable? Can I trust it? How do I access it? Is it stable? Is it durable? All these questions remain. It might just be a dump on a file system somewhere more often than not.
On the other side, the producers, if they even care who uses the data, they don't often know who uses the data. For what reason? For what purpose? They're most of the time blind on this side. To summarize, most of the time, data is still a second-class citizen, if we're talking about software architectures. It's just a database where we store the data in. This leads to quick one-off solution if data is to provide it, we do dumps. We just say, here, you have a user for our database. You can access it. Nobody would do that normally, but, yes, that's done. The data team is left on its own.
Customer Case - FinTech
I'm going to show you some examples of what we did with clients to manage this. The first example is a small FinTech, which is in the realms of a larger bank in Germany. They were doing B2C stuff, and they wanted to do analysis on marketing campaigns on how the customers behave, what kind of customers they have. They want to check their online sales process. Nothing fancy, just gathering some data from different systems. There was a small team of data analysts who were really curious about getting results out of data, not the really technical guys, all your Python, all your SQL and know how to start a Jupyter Notebook, but they were more interested in, what is the data to say. They had minimal support from the org. The org was really focusing on the application. It was DevOps and software engineers and all this stuff, but they want to just build online applications. They didn't really see the case for data. As a data team, what did we do? We got files, in this case, dumped over at an SFTP server.
The first thing we do, we did tests on the source data. In this case we used dbt. We wrote tests to test what is in the data. Does the data change? Is it what we expected, to get notified early if there would be changes in the data? For instance, there would be all null columns, because suddenly would likely mean there was some problem in the export of the data from the source system. We made it visible. We built dashboards. This is elementary build on dbt. It most of the time didn't look like this. This is in the best case. We want to make it really visible.
Having this really as a chart where we looked on each day and also using it to communicate with our stakeholders, we don't have the data we needed. Data is broken and so on. We started documenting what we learned about the data. How is data created? What is the business process behind the data? What are the constraints? We learned on the fly, basically, while working with the data. We adjusted the test to actually have the knowledge we had in the test. We extended this. This also was part of dbt in the end. This was all just basically mitigation of the problems. We had mitigation of the problems that the producer actually wanted on board. We wanted to get to the source. We wanted to involve the producers to actually help us getting better data.
Data Sharing Model
Maybe it's getting a step back. If you have an application data model where you store your data of the application, you have defined it everywhere. Might be in normal form. Might be normalized or whatever. You have another application wanting to access this. No software engineer on earth would say, here's the application model, just do the access on it. You would always build some API on it. Might be a REST API, might be a stream. You might be putting on messages on AMQP, whatever. You would always use an abstraction.
From analytics, it's more often than not like this. You just dump out the internal application model. What we're thinking about is we want to use an API as well. Maybe not the REST API that is optimized for single record access, but maybe some more efficient. Or it might even be a separate data model that has been created for sharing the data towards the outside. All of this should actually be the responsibility of the team. We should care about how those interfaces look like. How should those interfaces look like where we share those data from the team towards the analytics part.
The one thing is, how is this sharing data model created? It needs some pipeline. It has to be some process, something where this sharing data model is created. The sharing data model itself, from a technology perspective, could be a view in the same database that they are using for the application. It might be a topic in some broker. It might be files in Blob storage. It might be Iceberg tables, data tables, CSV, whatever. It might be something in a data warehouse or data lake. With the MCP servers now giving LLMs access to data, the first idea is popping up to actually use MCP servers to access those data. I'm not sure if it's a good idea still, but at least something one can have in mind. We need those pipelines so that software developers can easily share this data. These pipelines should be part of the whole application, so owned by the development team. The development team is responsible for creating this data model for sharing the data.
First of all, it means more responsibilities, but also it gives you chances. I think a lot of you who work in data know that CI/CD tests and so on are not as widespread as in software engineering. It might be a chance for software development teams to actually build and use their software stack, to use their best practices or good practices that they have, writing tests, doing CI/CD from this. Taking ownership of it means not only responsibility, but also having the freedom to do it in the way they want to implement it. In the same tech stack, possibly. It's just another API that they are building next to maybe an existing REST API. Still, those pipelines need to be supported. We want to support those people with blueprints, with libraries creating this. It might be that the application tech might be Java something or Kotlin or so on.
On the data side and on the consumer side where you want to provide the data might be something like a Delta Lake managed by Databricks. I happen to have an example afterwards. You need to do more support for this. You need to do support for execution and retry, something that is not that usual in a normal web application. You have to build frameworks and things for this. You have to enable monitoring with them. Yes, they can use the monitoring that they use for applications, but it might be something different. Schema evolution, write to data-specific infrastructure? Also, things in business, like data changes have to be supported. Those are some of the things they are not used to when writing a normal application.
Java App Writes to Databricks-Powered Data Lake
Next example. We have a Java application writing to a Databricks-powered data lake. Most of the developers that would implement this application don't know what Databricks is or know how to use it or how to implement it, and how to do stuff in it. They might be able to do this, writing data to a Blob storage, maybe in Parquet file and not in JSON, or something a little bit more efficient. There are two options, basically. You can register this as an external table in Unity Catalog, or you can load it to Databricks. I won't discuss which one is better, but as a software developer, I just don't care. I don't want to know how to do this.
The pipeline should support, actually, this library. We built libraries that made it possible to just say, I have a large amount of data here in some file or in a database, and I want to publish it. The library takes care, with the pipeline behind it, to put the data into a Blob storage, load it to Databricks, make it available there. The SDK or the library had some input fields, so you can do a description in it, you can put information to the data in it, so that it gets published to the Unity Catalog as well. This was just the beginning of the interface we were describing. We started collecting also not only the data, but also information and metadata about it.
Data Contracts
We took another step forward after this. One thing we used were data contracts. I think most of you have heard of data mesh, a few years old by now. Huge impact, huge hype back then, a lot of different concepts. I think the most tangible concept that was there was the data products, and out of this came the data contracts. Contracts basically describe the data that you're sharing between a data provider that has some data product, in this case it would be the application team, and the data consumer, what data they share, and how it looks like, and what guarantees that gives you. The data contract is then described often as a YAML file, and you can use it for enforcement. I'll come to it later.
The most important part is you make it explicit. You explicitly describe, we share or provide this data, and we use it, and they talk to each other, which most of the time didn't happen before. What would a data contract look like? I have here the high-level overview of a data contract on the right side. You have servers. Where's the data stored? Might be in a database, might be in S3 bucket, might be anything, might be a topic. You have the schema of this. You can put price tags on it. If you say consumption costs like this, so you can enforce it, informational only. You give information about who's the team. You give roles. You give SLA properties. We give you the guarantee that this data is up to date, or is updated every day, at maximum lags behind one day after the event has occurred. We give you guarantees that we will do failure resolution of data within whatever days. This is really something that you argue about, and more often than not, coming to points where you're saying, you expect this, now we only can deliver this, because we have this and this constraint on this.
Then you find a solution together. We created those contracts data first, so we had the data, and then we started creating a contract, all of the data. You can just say, here's data, just build a contract out of it. We versioned those things with semantic versioning. We put it into a central Git repository. What we used was the Open Data Contract Standard. There are multiple standards, currently it seems like this is the one that evolves from all of those standards, and it's going to be the main standard, until there's another one. We were lucky and did the right choice at this point. I see that we can also enforce those contracts. There's a CLI, datacontract-cli as it's called, where you can test those contracts. You have defined this contract in a YAML file, all the information inside, where's the data stored, what kind of schema do we expect, and we're doing tests on it. We can also do like, do we have breaking changes?
If we do another version of the data contract, we can just give the YAML file of the old contract, of the new contract, and say, is this breaking or is it forward compatible? Can we use it again? What we did is, we created source tests. You remember right at the beginning, we manually created source tests, we can create dbt source tests, based on the information we have in our contract, and use those to enforce the data, and test the data as a consumer.
Where were we enforcing the contracts in the end? Once on the publisher side, in CI/CD. We changed an API. We created a pull request. Within the pull request, we checked the changes against the contract. We checked for the format, for the schema, for the data quality, is it still what we expected? If all these checks passed, then that was one of the necessary requirements to get those pull requests merged. On the consumer side, on the other side, we also retrieved the data, and the first thing we do, we're checking the data. Checking for schema. Checking for quality. Checking if those SLAs we had agreed on were true. This is what we did there. Where and how did we check for reliability? Because one of the topics is reliability. We did unit tests and tested contracts in the app development.
The pipeline execution, we tested against data, and we integrated it into application monitoring, having one view, and not the application monitoring with Grafana and whatever, and some other view somewhere else. We did source tests and tests in further stages along the processing as well. Didn't go into details here. Reliable and trusted data is important to scale the data usage. When I'm talking about scale, I'm not talking about having billions and billions and more data, but with scale, I mean getting more out of the data we have, empowering more use cases, getting more use cases done, creating more value for the business. This is, for me, scale. I don't need the tech scale. It's interesting, but it's not necessarily the most important part.
What Is the Role of Tech and Platform in This Case?
Next to this trusted data, what is the role of tech and platform in this case? We talked about the pipeline getting data, but what is the role of tech and platform? A few people know this. What does it mean when you see this? For me, it's sometimes really a cognitive overload. This is all the tooling and things one should know, or at least have an idea about what they do when it comes to data, and it's getting bigger and bigger. Also, GenAI is included, which is a big part of it, but it's really a lot. For me, sometimes it's really causing constant stress. I just go through LinkedIn or whatever, or Hacker News, and see there's a new tool, and there's something cool, and there's something I want to have to dig in, and it's really like constant stress. What do you do then if you're coming in this thing? There's multiple strategies of what people do. You try to start, I'm looking everywhere. I'm trying to get there.
I get an idea there, and so you know nothing about everything, basically. Or another strategy is, that's too much. I just focus on this one and go really deep. I know everything about Apache Spark. I don't care what's left and right, but in Apache Spark, I'm the absolute expert. Or you just say, no, back then, we were just using Apache NiFi example. We can do everything with it. You're just clinging to the past, also a strategy that people are doing. Or you get dogmatic, or you just give up and say, what else? I just don't care. It's definitely this complexity and all these things that might be there, and people are telling you, and market telling you, you need this to get actually data platform, and this is needed, and this is needed. It's unhealthy for people, and it's unhealthy for organizations.
A reason why I might come back to something that is essential complexity versus accidental complexity. Essential complexity is the things that you need to fulfill your functional requirements, your non-functional requirement, and there might be some boundary conditions. A lot of you might be in banking, so there are boundary conditions from regulations or whatever that you have to comply to. These are essential complexity. You cannot do this without those complexity. It's inherent to the problem and the essence of the problem that you are solving.
All the rest is the accidental complexity. Is using Kubernetes over kubectl really essential to the problem you're solving? Might be, might not be. I tend to say in data, more often than not, it's not the problem. For me, there's a lot of accidental complexity we introduce by using those tooling. Therefore, it's really important to embrace simple and effective solutions to really build things that solve the problems at hand, to scale things.
A good strategy to do this is choose boring technologies. Technologies that are well understood, both in what they can do, what they can't do, and what failures might occur. You might find all failures in Stack Overflow, or ChatGPT has learned about them. It's pretty good to understand. You have a community. You have people to talk about. It might be sometimes that a new problem comes up where existing and boring technology doesn't work. One of the most boring technologies, which is pretty good at data, is actually Postgres. It's not as sexy as Snowflake. It's not as fast as DuckDB. Postgres is actually a really good solution for a lot of data cases we have. Also, there's the concept of innovation tokens. A really good idea. There's the idea of, as a team, you have two or three or one innovation token per year that you can spend on new technologies. You try to manage and not to lose yourself into new stuff as well.
Next to boring technologies and also Postgres, is you build on well-understood technical foundation and open standards. Open standards give you the flexibility to argue with vendors, to switch things. It's good understanding. There's a foundation to use this. One of the open standards next to Postgres is all the things that come towards data storage right now. Think data storage in the cloud is a lot of getting commodities. If you're using Delta or Iceberg or Parquet, it's open standards. Also, query engines get open standards, and you can build on top of this.
Another thing to keep things simple is leveraging the key concept of the cloud. Doing automation. Doing scaling to zero, and reduce the vertical integration. I said, I'm not a big fan of Kubernetes in data because I think there's a lot of good PaaS and SaaS platforms that actually go higher level and enable you to do more with less in the end. This might be something that is very familiar to a lot of software engineers, but you see in the data place, there's a gap between what software engineering knows and what software engineers knows is a good thing, and what data engineering does.
Customer Case - Banking
Another case, a pretty conservative bank, slow, not that large, regional bank, a cooperative, they want to do more data-driven decisions. Which management doesn't want to do more data-driven decisions? The existing solutions didn't scale, not in terms of data, but scale in terms of what decisions we can make, what kind of insights do we get? They built a cloud data platform. They use a central team for the platform, no data mesh, but they have different data user teams that actually use those platforms. What we built is we used open standards.
A platform in the cloud, in this case, it's Azure, but it could be any cloud, using Delta Lake as an open standard for software, using PySpark, which is an open-source product to run their data. Nobody wants to run PySpark on its cluster on its own. You never want to do this. We used Databricks. We used the Unity Catalog, which is proprietary stuff, yes, but it gives us things that we didn't get about any standards, gives us things about permissions, management. It's not the business data catalog, it's a technical data catalog. We're using Databricks Serverless SQL to access that data.
Those are proprietary things, but the foundations are open source and open standards. We used also the Unity Catalog features to provide the data towards, I said, we had a core team and multiple single teams as well. We had the core team having catalogs, and those can be shared within other teams. We're using the Unity Catalog features of Databricks with it. If you're building things like these, on Kubernetes, you have to find a solution where you can do the cataloging. The Unity Catalog is open source. I've not seen a lot of companies actually running it. You want to get it managed because you won't get paid for running in catalog most of the time. We use the catalogs for data sharing between those departments. This was proprietary stuff, yes, but built on open foundation. Those open foundations enabled us, for instance, to use different query engines. Databricks is a great platform. I really like it. Other platforms might be good as well. I'm not that experienced with it.
If you are coming to data that is not huge, but it's big, like 30 gigs, 40 gigs, or whatever, Spark is actually really slow. It's a lot of overhead to actually process the data. Thanks to the open standard, we could use DuckDB to connect that data. We can get a token from Unity Catalog and read the data from the Delta Lake. It's an open catalog. It's an open standard, nothing proprietary. Because one thing, data is smaller than you might think. This is an analysis of Amazon Redshift. The team behind Amazon Redshift did it. It's a few months old already. This is the title, Why TPC Is Not Enough.
TPC is a benchmark used in analytics to measure the speed of warehouses and SQL engines. The paper was about, is TPC still the right benchmark? They analyzed all the data they have in their system and they found out, actually most of the data, so we have the rowcount buckets up to 10 million rows over here. Ten million rows isn't that much on data, but most of the data actually is about 10 million rows. It's not that big. Sometimes it's even enough to run on a single machine. You don't need distributed systems at all, making it simpler, easier, more cost-effective, and more reliable to build data platforms.
What We've Built - Small Stack (FinTech)
Another example, we built a more small stack, FinTech. We want to do everything on-prem. There were some reasons to do this. We just built things on Postgres with dbt and Airflow to orchestrate. One can argue if Airflow is a bit of overkill, yes. We use Python to ingest the data. We use the contracts before to actually synchronize on the data and have a clear interface between this. It was really simple tech, nothing fancy, but it gets the job done.
One thing I want to give you, apply software engineering best practices. I'm on a more or less software engineering conference for you as clients, writing tests for code and data, monitoring and observability, infrastructure as code. It's all a given, but it isn't too much in the data world, actually. It's a hard way to say, tests are a benefit. You also have to understand, yes, some things are definitely different. Data gives you another dimension. The data itself is another dimension you have to test as well, using separate environments.
Also thinking that software engineers say, of course I use separate environments. In data, it more often than not comes to this. Separate environments is often like, yes, we have prod. Bringing it together, reducing the complexity of the platform also reduces the error surface. Reducing complexity reduces costs. All in all, this improves reliability of your data platform, data pipelines. Then using standards, improving practices, simplifies the operation, also less complexity, less overhead, and sometimes a single node system can bring you a long way.
Customer Case - Industrial IoT Startup
We talked about having the pipelines and contracts basically in the platform and say it's simplifying. Maybe take one step further here. Another idea I just want to present and give out. Another case, an industrial IoT startup. They collected sensor data in their electronic grid and they visualized that data. Eventually, they also want to control the grid based on the data. What they did is they had a document database, put all the data into the document database. The team was consisting mostly of full stack developers and mostly heavily leaning to the TypeScript and the ecosystem around this. Then, a document database is fine, because if you're building a frontend with TypeScript, all the data is in JSON, perfect.
At some point, they started wanting to do data analysis. What we had were like this. We had an ingest. The data was put into a document database. There was a UI and APIs connected to the document database, and there was a relational database for some master data. What stations did we have? What centers did we have out in the field? Pretty common. Document database was also using the time series mode of those databases. Then they wanted to add aggregations on this, because they wanted to do hourly and daily values, so they did a Python script that runs on Document DB, then they wanted to have alerts running on the Document DB, getting the data out of it. The relational database was extended somehow because they wanted to do analytics. The center of the data was actually Document DB.
To summarize this, this was actually a web application that is a data platform with a UI. It's not a web application, but the center of this application is data. We built a UI on it, but it's not a UI and data is somewhere. We redesigned it. We have the ingest, and the first place where the data lands is the data platform. We don't have any application doing anything, but we have the data platform, the one that is responsible for storing, analyzing the data. We do all the aggregation and the alerting on this data platform.
Then we have the Document DB that is fed from the data platform with the data that is needed to visualize the data. It might be aggregated. It might be filtered. It might be optimized to actually fit the needs of visualization. We had data analytics directly on the data platform. What we did basically is a shift left. Shift left is like everywhere. Shift left in security is more about having it earlier in the process, in designing as well. We have to shift left in the data flows, not too much in the process of creating the software. There might be ideas as well, but in the shift left in data flow. We had from data source to application to doing some ETL and ELT stuff to doing it somewhere where we do analytics. We moved it to, we're having a data platform. Some optional ETL, ELT might be, might not be. Depends on what data platform you have. You don't have to load the data. You might access the data platform directly.
Then connected some analytics stuff, some web application where a user can do something. I'm not saying this should be done for all, definitely not. I think especially for applications where the data is in the center, and I see it a lot in IoT and industrial applications where you get a lot of sensor data or also in marketing where you get tracking data. This might actually make sense to move data at the center and not treating this like an application. We moved the data lake.
Why Build Different Platforms for Data and Applications?
When we think about this, we can even take it one step further. Why do we actually build different platforms for data and applications? There was a talk about platform engineer basically and developer experience. It was a lot about building applications for software engineers with all the CI/CD and the golden path and improving the developer experience, making it easy to deploy stuff. It's all good and all fine. What we end up with, we have an application platform where we run all the applications on it. We have a data platform, which is storing the data and powering ML and AI use cases, BI use cases. Taking observability, for instance. Observability is a huge thing in application platforms. Monitoring, logs, all the stuff is built into it. We would definitely need more of this in data platforms. There is a lot of things that can be used.
Most of the time, the data platform solves it itself. I take an example, Databricks. Getting data out of Databricks for monitoring and observability reasons is possible, but definitely not easy to do. It's completely two separate subsystems with different approaches, with different technologies. I don't think it has to be like this. This is actually also widening the gap between those systems, because we're building on different platforms and different paradigms, and try to avoid this. We use the data platform, getting the data from the data management. Why don't we build application and data platforms? Platforms that actually have in mind from the beginning, we want to support application development, running applications, and using the data within this application for analytical purposes. We have things like storage and runtime.
The storage might be different for data than for applications. The runtime might be different, but it comes from one vision, might not be one team, but from one platform building this. Everything that is CI/CD means there are proven patterns on how to do CI/CD in every software company, most likely. I think 80% of the time those can be reused for data. Why not provide it to data within a data platform as well? All the stuff like secret management. Secret management, there's a lot of tooling around it, but it's reinvented in data as well. Test data management, and the contracts themselves. The contracts are in between application and data. This is what is common to both, and of course should be managed within this platform itself. You might be asking, what is the tech? There isn't the one technology that powers all of this. You might say, yes, we built everything on Kubernetes because we built our developer platform and our data platform on Kubernetes.
One might say Databricks or Snowflake or whatever might be your thing. I would say, coming back to what I said was platforms, we did think about what is the essential complexity? What is the accidental complexity? What complexity is actually needed to solve this? It might be that Kubernetes is one thing. It might be that Kubernetes powers all of your cases and all the things that you have also with data. It might be that you have two different technologies and just creating good bridges in between and good integrations in between. It might be that there's even another solution based on this. There isn't the one answer.
Going Beyond Two-Tier Data Architectures
I promised I'm coming back to DuckDB as well. It's an idea of DuckDB not seeing this, but I think it's a good thing to think about is going beyond two-tier data architectures. The two-tier data architecture is basically this. We have data in the application. We have data in the analytics part. This is an idea of DuckDB. DuckDB, pretty fast embedded analytical query engine and database, which can basically run everywhere on a server or in a browser with Wasm, or in a Lambda function, and can read basically everywhere. Having the central data lake, in this case it's more like a swamp maybe, or between a lake and a swamp, as a central datastore, both powering applications and analytics on the same way. It's an idea. We haven't seen this in a project.
I think it's an idea worth thinking about, maybe having in mind when designing this. Because, in my opinion, it would be a good thing when data and applications became one. We would have less friction between operational and analytics: common ground, common platform, common ideas maybe to build stuff. We could actually define and encapsulate responsibilities more clearly. Remember the example from the IoT startup? There was a long discussion. Who should do the aggregation? Who is responsible for doing the other thing? Is it the application or is it the data team? Where is the knowledge about when to do the other thing? If you're thinking this is one, then it's a common ground you find and you can decide together where it is. You might have optimized data representations for different cases.
A document database isn't that good actually for analytics. There might be better settings, but it's really good for UIs, for example. The same is a Blob storage isn't that good for actually real-time requests from a web UI, but a document database would be. It's really important to use the tooling on both sides. Not saying, we have Kubernetes and Postgres and everything, so just get started with this and do this, but to know the stringencies, the differences that there are. Also, use the engineering best practices and good practices, and apply them to data.
Takeaways
What do we take away from this? The first thing, validate and monitor your data early on. Having contracts both from the provider side, and do it on the consumer side as well. Start at the source. Really take the devs into this. Don't say, we just get the data and we have to be on our own. Try to get the devs on board by building solutions and abstractions that actually have them, that make it easy for them. Don't say to them, here's Databricks, publish it there. You can build workflows there. You just need to write Python and write it to Parquet. No, that's not going to work. Really get to the devs and get them where they are, and address those pains. Of course, the organization has to prioritize data as well, otherwise every product owner of every team would just say data, data, I don't care. Simplify the architecture. Don't make it more complex.
Really make it handleable, also for smaller teams, for teams that are not that experienced. It doesn't have to be the latest, newest technology everywhere. Apply the good practices you know from software engineering. It might be a thing to think data and application platform together, not separate those worlds from the beginning, from the foundation. It's getting split up in the application anyway and in the usage, but maybe having a common ground to actually build on them. We talk a lot about tech now. Actually, data modeling is really a thing. Only write tables and having everything joined and having one large table might not be sufficient all the time. Things like star schema, dimensional modeling, all this stuff isn't the sexy part, but it's actually really helpful. Besides all the tech, there's also some things to learn from data and good classic data warehouse. Although, sometimes it's really dry, but it's helpful.
Questions and Answers
Nardon: Your take on using the data for application and for the operational system using the same platform. Usually, in companies, especially when you have several departments, you have teams that need data from other departments. Usually, that's why there's a data platform where you can advertise your data and share data with other teams. If you're using the same platform for data and applications, this gets a little more complex, because you have the teams developing the data and having to share. How are you doing a data mesh architecture and sharing data between departments if you're using the same architecture?
Matthias Niehoff: Even when you build the same platform for application and data, I would still have those typical data platform capabilities like having a catalog and having access management to different data, and a point where data is shared towards others. It would be still a capability that should be built into this application and data platform, definitely. It's not like we don't have the traditional data platform anymore with all the functions.
We still have those functions, just moved to one platform that is developed under one common vision with one common ground and one common goal. I'm pretty sure in a large organization, it won't be one team building this platform as well. What's more interesting is, often it's like the data department, especially in classical environments, somewhere in finance and the CFO might be the boss and the application platform is in CDO, CTO, CIO department, and it's completely different areas of the organization with different cultures as well. This is more like a tricky thing I see that would hinder adoption of this. Yes, definitely.
Participant 1: How do you know when you should start buying a tool instead of building on your own? Because a lot of the other talks so far have said that you should buy before you build, but you say to choose boring tech. At what point do you know that you should switch to buying existing software?
Matthias Niehoff: Boring tech doesn't mean you shouldn't buy. Postgres, ok, you don't buy Postgres, but you could buy Databricks with PySpark, and PySpark is boring tech. It depends. When you're doing stuff that you have the feeling of, it may mean you could take Wardley Maps or something to get out what is commodity, what is actually new, in genesis. If you're building stuff that is undifferentiated heavy lifting, things you do that don't provide really value to you, running a Kubernetes cluster as a data team, I would take as an example.
Then you think about, might there be something that would reduce the vertical integration that gives us more intermediate ways to do things with data, work with data. I don't have one person that actually only is there to run the Kubernetes cluster. I think this would be a thing. It's not like there's this one idea to do this. This is more like, depends on your situation.
Participant 2: I think my question is around the complexities of having a platform that has already been adopted or consumed by multiple data teams. How easy is it actually to change technology? I think I'm asking this because we're moving in a world where technologies are released every day. We see that there's a Microsoft Fabric that's been introduced now to say that it's also removing the complexities around infrastructure. How do you actually manage that? Because we are also in a distributed type of mesh, and data teams are actually responsible for their own tech. How do you ensure that all those principles can actually be a central component?
Matthias Niehoff: When you were changing tech, what would be the motivation, would be the question? Why would you choose changing the tech? Just to standardize, and all of them having the same technology? That wouldn't be enough as an argument to just change tech. If you say, we have to standardize because we have increasing license costs because everybody has bought their own tooling and so on, yes, then it's more or less a classical migration and modernization project with a clear path on how to move each of those teams to a central technology or a central platform. You still want to have distributed responsibilities, I understand, so teams are responsible for the data, but they should use common infrastructure. It's more like a normal modernization project.
Participant 3: You commented that running the same data platform for applications as well as, let's say, analytics is not something you've seen much of. I'm aware of a few that run, Uber, us, and usually the complexity is around non-functionals. There are two patterns I know, which is running multiple concurrent versions of the same platform optimized for different non-functionals, and the one we're attempting, which is to use the same platform overlaying the tooling on the same source. You keep talking about DuckDB and these tools. Given those two paradigms, do you have a thought as which one you think would be more efficient, profitable, functional?
Matthias Niehoff: You mean by having one platform or having separate platforms?
Participant 3: One platform with three different copies of the data optimized for the different non-functional use cases or one version of the data platform that then has to support optimizing for different use cases.
Matthias Niehoff: Yes, either optimizing the data or optimizing the processing, is basically the question.
Participant 3: For me, the question is usually around the idea of overlaying tooling, so different interfaces, query products.
Matthias Niehoff: It depends a bit. For instance, if you're taking a database as your storage, then you have the query engine baked in. Then you can't change too much. If you say my data is stored in a Blob storage underneath and I just use a different engine, then you would be flexible, of course. I think then flexibility actually pays off and it would go that way. It depends a lot on how you store it, but you made one good point. This distinction between having storage and query engine directly coupled goes more and more away. We have the storage that is often pretty cheap as a Blob storage, and we have query engines that optimize for different use cases, definitely.
See more presentations with transcripts