1
$\begingroup$

I'm relatively new to ML/Statistical Analysis, and I'm facing a dataset structured like this

person_id, pay, task, hours 1, 560, A, 3 1, 560, B, 5 2, 650, A, 7 3, 520, C, 6 3, 520, A, 2 ... 

meaning person 1 is cumulatively paid 560 to perform task A 3 hrs and task B 5 hrs; person 2 paid 650 for task A 7 hrs; person 3 paid 520 for task C 6 hrs and A 2 hrs, etc. I hope it's clear.

I'd like to perform a regression, where my X plane is (task, hours) and Y is the per-person pay, but I haven't figured out yet how to approach such a problem. My tool box would be based on python+scikit-learn, preferably. But a generic discussion would be useful as well.

This is like

person_id, pay, tasks [1, 560, [[A, 3], [B, 5]] [2, 650, [[A, 7]] [3, 520, [[C, 6], [A, 2]] ... 

where person_id is a high cardinality feature which can be easily neglected, the Y label is pay, while the tasks (X) feature has its own structure, with fixed shape (2 dimensions here), but not a predetermined "depth", although limited in size (maybe 5-10 possible different tasks). I can't understand how to fit this in a regression schema, with such a structured feature data. Should I "flatten" tasks out, by explicitly having all possible values (A hours, B hours, C hours,... etc) as different columns, or is a more general approach possibile?

Moreover, this is a simplified version of my problem, to make the description simple enough, but it could include even more dimensions in the tasks structure, in which case the number of "flattened" task features would easily explode, to account for all possible combinations.

Any help welcome and appreciated! Thanks

$\endgroup$
2
  • $\begingroup$ Can you expand on the non-simplified case: How many distinct tasks? Do the task-groupings have any structure? (e.g. "A & B go together usually", "B & C do not go together usually") $\endgroup$ Commented Mar 26, 2021 at 16:32
  • $\begingroup$ Hi @GeoMatt22 - actually not any such structures I can predict. The non-simplified case might be like "task type: X, nr. of hrs: n, nr. of pieces m". Or even more dimensions, beyond task type, hrs, or pieces. What I was thinking, after I though about it a bit more, is that this resembles more of a Deep Learning exercise, rather than a regression. Like: each collection of points in X space (task type, hours, pieces) maps to an expected value of Y (pay), like a collection of dots in an n-dimensional image mapping to a label, or something... $\endgroup$ Commented Mar 29, 2021 at 7:19

0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.