AWS SageMaker data preparation

Question

I am trying to understand how to implement a machine learning algorithm, where the preprocessing and postprocessing is an heavy taskm inside AWS Sagemaker. The main idea is to get data from S3, each time the data change in S3, Cloud watch triggers a lambda function to invoke a SageMaker endpoint. The problem is that, once the algorithm is trained, before predicting the new data, i need to preprocess the data (custom NLP preprocessing). Once the Algorithm have done the prediction, i need to take this prediction, do a post-process and then send the post-processed data to S3. The idea i have in mind is to create a docker:

├── text_classification/ - ml scripts | ├── app.py | ├── config.py | ├── data.py | ├── models.py | ├── predict.py - pre-processing data and post-processing data | ├── train.py | ├── utils.py

So i will do the pre-processing and the post-processing inside "predict.py". When i will invoke the endpoint for prediction, that script will run. Is this correct?

Chris Williams · Accepted Answer · 2020-06-05 12:45:28Z

1

Take a look at using Step Functions to orchestrate the entire workflow for you.

Have the CloudWatch event trigger a Step Function that would do the following:

Preprocess data
Create predictions (if its a batch process why not use batch transform instead).
Use a retry loop to check if inference has been completed.
Once it has been inferred run post processing of data and copy to S3.

answered Jun 5, 2020 at 12:45

Chris Williams

35.7k4 gold badges46 silver badges79 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

lgndrzzz Over a year ago

Many Thansk. In your opinion, this solution is easier to implement than the one i proposed?

Chris Williams Over a year ago

Not only easier, but also more resilient, easier to debug and a better architecture

lgndrzzz Over a year ago

last question, when you say "preprocess data", how? As far as i understood, is not possible (in a serverless environment) to call a python script to preprocess (i could use a lambda function to preprocess data, but there are too many constraints). What i should use? Thanks again, really

Chris Williams Over a year ago

You would either invoke a Lambda function that runs the preprocess of the data, run a ECS Task (Could be Fargate) or run the task on the EC2. Steps functions can trigger any of these.

Ajay Mahendru · Accepted Answer · 2020-06-05 19:45:51Z

You can also explore Amazon SageMaker Inference Pipelines.

An inference pipeline is an Amazon SageMaker model that is composed of a linear sequence of two to five containers that process requests for inferences on data. You use an inference pipeline to define and deploy any combination of pretrained Amazon SageMaker built-in algorithms and your own custom algorithms packaged in Docker containers. You can use an inference pipeline to combine preprocessing, predictions, and post-processing data science tasks. Inference pipelines are fully managed.

Collectives™ on Stack Overflow

AWS SageMaker data preparation

2 Answers 2

4 Comments

Comments

Hot Network Questions

Collectives™ on Stack Overflow

2 Answers 2

4 Comments

Comments

Related