Skip to main content
deleted 84 characters in body
Source Link
Jay
  • 2.5k
  • 11
  • 60
  • 102

I have a data transformation product, which lets one select tables in a database and transform row data in the source database to a destination database. Transformation could be masking/encrypting row data (often done in test databases).

This is handled in the current product (java based workbench and engine) by processing 1000 rows at a time and doing it 10 threads parallely. This approach works fine on smaller data sets. But, when I have to transform huge data sets (say about 25X million records) at one time - this approach still works, but

  • The host machine's CPU on which my product runs, is under heavy load.
  • The source database and the target database are punched with too many transactions that they start slowing down. (Now, this could be attributed to the fact that the database server is probably running on a slower piece of hardware.)

I started looking for solutions and I was quick to address this by asking my client to beef up the hardwarerequesting a hardware "beef up" on the source/destination database server machines. This involved, say, buying a new multi-core CPU and some extra RAM. Turns out, upgrading hardware wasn't just the only concern: my client was faced with the challenge of buying multiple licences ofsoftware licenses for the database for running it on awere required to be procured - thanks to multi-core CPU!processors (per core license).

So, the ball is in my court now, I will have to come up with ways to address this issue, by making changes to my product. And, here is where I need your help. At this moment, I could think of one possible approach to handling huge loads:

Approach1

  1. Reading data from source database, persisting it to a temporary medium (file).
  2. Transform data in persisted file, by running it in a distributed environment (cheaper single core machines), there by handling the "trade-off move" of switching to file persistence. (Using something like Apache Hadoop to handle the distributed computing part)
  3. Writing data to a destination database.

This is all I could come up with for now, from an architectural perspective. Have you handled this situation before? If yes, how did you handle it? Appreciate your suggestions and help.

I have a data transformation product, which lets one select tables in a database and transform row data in the source database to a destination database. Transformation could be masking/encrypting row data (often done in test databases).

This is handled in the current product (java based workbench and engine) by processing 1000 rows at a time and doing it 10 threads parallely. This approach works fine on smaller data sets. But, when I have to transform huge data sets (say about 25 million records) at one time - this approach still works, but

  • The host machine's CPU on which my product runs, is under heavy load.
  • The source database and the target database are punched with too many transactions that they start slowing down. (Now, this could be attributed to the fact that the database server is probably running on a slower piece of hardware.)

I started looking for solutions and I was quick to address this by asking my client to beef up the hardware on the source/destination database server machines. This involved, say, buying a new multi-core CPU and some extra RAM. Turns out, upgrading hardware wasn't just the only concern: my client was faced with the challenge of buying multiple licences of the database for running it on a multi-core CPU!

So, the ball is in my court now, I will have to come up with ways to address this issue, by making changes to my product. And, here is where I need your help. At this moment, I could think of one possible approach to handling huge loads:

Approach1

  1. Reading data from source database, persisting it to a temporary medium (file).
  2. Transform data in persisted file, by running it in a distributed environment (cheaper single core machines), there by handling the "trade-off move" of switching to file persistence. (Using something like Apache Hadoop to handle the distributed computing part)
  3. Writing data to a destination database.

This is all I could come up with for now, from an architectural perspective. Have you handled this situation before? If yes, how did you handle it? Appreciate your suggestions and help.

I have a data transformation product, which lets one select tables in a database and transform row data in the source database to a destination database.

This is handled in the current product (java based workbench and engine) by processing 1000 rows at a time and doing it 10 threads parallely. This approach works fine on smaller data sets. But, when I have to transform huge data sets (say about X million records) at one time - this approach still works, but

  • The host machine's CPU on which my product runs, is under heavy load.
  • The source database and the target database are punched with too many transactions that they start slowing down. (Now, this could be attributed to the fact that the database server is probably running on a slower piece of hardware.)

I started looking for solutions and I was quick to address this by requesting a hardware "beef up" on the source/destination database server machines. This involved, say, buying a new multi-core CPU and some extra RAM. Turns out, upgrading hardware wasn't just the only concern: multiple software licenses for the database were required to be procured - thanks to multi-core processors (per core license).

So, the ball is in my court now, I will have to come up with ways to address this issue, by making changes to my product. And, here is where I need your help. At this moment, I could think of one possible approach to handling huge loads:

Approach1

  1. Reading data from source database, persisting it to a temporary medium (file).
  2. Transform data in persisted file, by running it in a distributed environment (cheaper single core machines), there by handling the "trade-off move" of switching to file persistence. (Using something like Apache Hadoop to handle the distributed computing part)
  3. Writing data to a destination database.

This is all I could come up with for now, from an architectural perspective. Have you handled this situation before? If yes, how did you handle it? Appreciate your suggestions and help.

Source Link
Jay
  • 2.5k
  • 11
  • 60
  • 102

Ways to handling huge transactions on any database?

I have a data transformation product, which lets one select tables in a database and transform row data in the source database to a destination database. Transformation could be masking/encrypting row data (often done in test databases).

This is handled in the current product (java based workbench and engine) by processing 1000 rows at a time and doing it 10 threads parallely. This approach works fine on smaller data sets. But, when I have to transform huge data sets (say about 25 million records) at one time - this approach still works, but

  • The host machine's CPU on which my product runs, is under heavy load.
  • The source database and the target database are punched with too many transactions that they start slowing down. (Now, this could be attributed to the fact that the database server is probably running on a slower piece of hardware.)

I started looking for solutions and I was quick to address this by asking my client to beef up the hardware on the source/destination database server machines. This involved, say, buying a new multi-core CPU and some extra RAM. Turns out, upgrading hardware wasn't just the only concern: my client was faced with the challenge of buying multiple licences of the database for running it on a multi-core CPU!

So, the ball is in my court now, I will have to come up with ways to address this issue, by making changes to my product. And, here is where I need your help. At this moment, I could think of one possible approach to handling huge loads:

Approach1

  1. Reading data from source database, persisting it to a temporary medium (file).
  2. Transform data in persisted file, by running it in a distributed environment (cheaper single core machines), there by handling the "trade-off move" of switching to file persistence. (Using something like Apache Hadoop to handle the distributed computing part)
  3. Writing data to a destination database.

This is all I could come up with for now, from an architectural perspective. Have you handled this situation before? If yes, how did you handle it? Appreciate your suggestions and help.