1
$\begingroup$

I'm not sure if this is the right place to put this.

So I have a dataframe consisting of just 4 columns but around 2000 rows. It's in a csv format. A typical value of the first column which is titled month actually looks like so: 1;1990-01-02;1990;1;23 So it seems that 1 is the index, then we have the year, month, and day. The two other numbers don't make sense and I think I can throw them out. I want to clean up this column and put it into some date format because the 4th column contains precipitation information (with many NaN's)that I want to look into. I'd like to get the maximum temperature over every month.

Here's what I started out by doing

enter image description here

I then want to just get the 1st index of that list by doing

df["MONTH"]=df["MONTH"].map(lambda x: x[1]) 

but I get an error saying

AttributeError: 'list' object has no attribute 'split' 

The next step would have been to do something like

df["MONTH"]=pd.to_datetime[df["MONTH"]] 

How do I fix my error (or is there a better way of doing it?) and then get the max PRCP for January of 1990, February of 1990, ..., November of 2020 (some dates of some months are missing)

$\endgroup$
0

1 Answer 1

1
$\begingroup$

How do I fix my error?

The error comes from trying to map x.split(';') again after the MONTH strings had already been mapped into lists. You can only run that line once. Also note that index 1 can be accessed immediately like x.split(';')[1] instead of mapping x.split(';') and then mapping x[1].

Or is there a better way of doing it?

Yes, it's best to use the vectorized Series.str methods instead of Series.map:

df = pd.DataFrame({'MONTH': ['1;1990-01-02;1990;1;23', '2;1990-01-04;1990;1;23', '3;1990-01-05;1990;1;25', '4;1990-02-02;1990;2;23', '5;1990-02-04;1990;2;23'], 'PRCP': [np.nan, np.nan, 3, 9, 5]}) # MONTH PRCP # 0 1;1990-01-02;1990;1;23 NaN # 1 2;1990-01-04;1990;1;23 NaN # 2 3;1990-01-05;1990;1;25 3.0 # 3 4;1990-02-02;1990;2;23 9.0 # 4 5;1990-02-04;1990;2;23 5.0 
df['MONTH'] = pd.to_datetime(df['MONTH'].str.split(';').str[1]) # MONTH PRCP # 0 1990-01-02 NaN # 1 1990-01-04 NaN # 2 1990-01-05 3.0 # 3 1990-02-02 9.0 # 4 1990-02-04 5.0 

How do I get the max PRCP for January of 1990, February of 1990, ..., November of 2020?

Use Series.groupby.max to compute the max PRCP per year-month:

  • Either group by Month of YYYY strings using Series.dt.strftime:

    ym = df.MONTH.dt.strftime('%B of %Y') df.groupby(ym).PRCP.max() # MONTH # February of 1990 9.0 # January of 1990 3.0 # Name: PRCP, dtype: float64 
  • Or group by explicit years and months using Series.dt.year and Series.dt.month:

    ym = [df['MONTH'].dt.year, df['MONTH'].dt.month] df.groupby(ym)['PRCP'].max() # MONTH MONTH # 1990 1 3.0 # 2 9.0 # Name: PRCP, dtype: float64 
$\endgroup$
0

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.