difference between df.loc[:, reversed(colnames)] and df.loc[:, list(reversed(colnames))] - python

df.loc[:, reversed(colnames)] can retrieve a slice from df, but can not be used to assign to df. In order to assign to that slice of df, I have to use df.loc[:, list(reversed(colnames))]
Observe the following code and output for clarification:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'a':list('asdf'), 'b':range(4)}
)
print('df==', df, sep='\n')
colnames = ['a', 'b']
print('\ndf.loc[:, reversed(colnames)]:', df.loc[:, reversed(colnames)], sep='\n')
df.loc[:, reversed(colnames)] = np.nan
print('\ndf after df.loc[:, reversed(colnames)]=np.nan:', df, sep='\n')
df.loc[:, list(reversed(colnames))] = np.nan
print('\ndf after df.loc[:, list(reversed(colnames))]=np.nan:', df, sep='\n')
and the output:
df==
a b
0 a 0
1 s 1
2 d 2
3 f 3
df.loc[:, reversed(colnames)]:
b a
0 0 a
1 1 s
2 2 d
3 3 f
df after df.loc[:, reversed(colnames)]=np.nan:
a b
0 a 0
1 s 1
2 d 2
3 f 3
df after df.loc[:, list(reversed(colnames))]=np.nan:
a b
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
This behavior is new. I didn't have to list() to do assignment when I was using Anaconda 2018.12, I recently upgraded to Anaconda 2019.03.
I wonder what changed and caused this.

Related

DataFrame take union of columns and retain find first non-NaN value

Dataframe df has many thousand columns and rows. For a subset of columns that are given in a particular sequence, say columns B, C, E, I want to fill NaN values in B with first non-NaN value found in remaining columns (C, E) searching sequentially. Finally C, E are dropped
Sample df can be built as follows:
import numpy as np
import pandas as pd
df = pd.DataFrame(10*(2+np.random.randn(6, 5)), columns=list('ABCDE'))
df.loc[1, 'B'] = np.nan
df.loc[2, 'B'] = np.nan
df.loc[5, 'B'] = np.nan
df.loc[2, 'C'] = np.nan
df.loc[5, 'C'] = np.nan
df.loc[2, 'D'] = np.nan
df.loc[2, 'E'] = np.nan
df.loc[4, 'E'] = np.nan
df
A B C D E
0 18.161033 6.453597 25.253036 18.542586 20.667311
1 27.629402 NaN 40.654821 22.804547 23.633502
2 15.459256 NaN NaN NaN NaN
3 19.115203 4.002131 14.167508 23.796780 29.557706
4 27.180622 NaN 20.763618 15.923794 NaN
5 17.917170 NaN NaN 21.865184 9.867743
The expected outcome is as follows:
A B D
0 18.161033 6.453597 18.542586
1 27.629402 40.654821 22.804547
2 15.459256 NaN NaN
3 19.115203 4.002131 23.796780
4 27.180622 20.763618 15.923794
5 17.917170 9.867743 21.865184
Here is one way
drop = ['C', 'E']
fill= 'B'
d=dict(zip(df.columns,[fill if x in drop else x for x in df.columns.tolist() ]))
df.groupby(d,axis=1).first()
Out[172]:
A B D
0 14.472915 30.598602 24.528571
1 22.010242 22.215140 15.412039
2 5.383674 NaN NaN
3 38.265940 24.746673 35.367622
4 22.730089 20.244289 27.570413
5 31.216037 15.496690 9.746814
IIUC, use bfill to backfill, then drop to remove unwanted columns.
df.assign(B=df[['B', 'C', 'E']].bfill(axis=1)['B']).drop(['C', 'E'], axis=1)
A B D
0 18.161033 6.453597 18.542586
1 27.629402 40.654821 22.804547
2 15.459256 NaN NaN
3 19.115203 4.002131 23.796780
4 27.180622 20.763618 15.923794
5 17.917170 9.867743 21.865184
Here's a slightly more generalised version of the one above,
to_drop = ['C', 'E']
upd = 'B'
df.update(df[[upd, *to_drop]].bfill(axis=1)[upd]) # in-place
df.drop(to_drop, axis=1) # not in-place, need to assign
A B D
0 18.161033 6.453597 18.542586
1 27.629402 40.654821 22.804547
2 15.459256 NaN NaN
3 19.115203 4.002131 23.796780
4 27.180622 20.763618 15.923794
5 17.917170 9.867743 21.865184

map DataFrame index and forward fill nan values

I have a DataFrame with integer indexes that are missing some values (i.e. not equally spaced), I want to create a new DataFrame with equally spaced index values and forward fill column values. Below is a simple example:
have
import pandas as pd
df = pd.DataFrame(['A', 'B', 'C'], index=[0, 2, 4])
0
0 A
2 B
4 C
want to use above and create:
0
0 A
1 A
2 B
3 B
4 C
Use reindex with method='ffill':
df = df.reindex(np.arange(0, df.index.max()+1), method='ffill')
Or:
df = df.reindex(np.arange(df.index.min(), df.index.max() + 1), method='ffill')
print (df)
0
0 A
1 A
2 B
3 B
4 C
Using reindex and ffill:
df = df.reindex(range(df.index[0],df.index[-1]+1)).ffill()
print(df)
0
0 A
1 A
2 B
3 B
4 C
You can do this:
In [319]: df.reindex(list(range(df.index.min(),df.index.max()+1))).ffill()
Out[319]:
0
0 A
1 A
2 B
3 B
4 C

combining columns in pandas dataframe

I have the following dataframe:
df = pd.DataFrame({
'user_a':['A','B','C',np.nan],
'user_b':['A','B',np.nan,'D']
})
I would like to create a new column called user and have the resulting dataframe:
What's the best way to do this for many users?
Use forward filling missing values and then select last column by iloc:
df = pd.DataFrame({
'user_a':['A','B','C',np.nan,np.nan],
'user_b':['A','B',np.nan,'D',np.nan]
})
df['user'] = df.ffill(axis=1).iloc[:, -1]
print (df)
user_a user_b user
0 A A A
1 B B B
2 C NaN C
3 NaN D D
4 NaN NaN NaN
use .apply method:
In [24]: df = pd.DataFrame({'user_a':['A','B','C',np.nan],'user_b':['A','B',np.nan,'D']})
In [25]: df
Out[25]:
user_a user_b
0 A A
1 B B
2 C NaN
3 NaN D
In [26]: df['user'] = df.apply(lambda x: [i for i in x if not pd.isna(i)][0], axis=1)
In [27]: df
Out[27]:
user_a user_b user
0 A A A
1 B B B
2 C NaN C
3 NaN D D

Why does concat Series to DataFrame with index matching columns not work?

I want to append a Series to a DataFrame where Series's index matches DataFrame's columns using pd.concat, but it gives me surprises:
df = pd.DataFrame(columns=['a', 'b'])
sr = pd.Series(data=[1,2], index=['a', 'b'], name=1)
pd.concat([df, sr], axis=0)
Out[11]:
a b 0
a NaN NaN 1.0
b NaN NaN 2.0
What I expected is of course:
df.append(sr)
Out[14]:
a b
1 1 2
It really surprises me that pd.concat is not index-columns aware. So is it true that if I want to concat a Series as a new row to a DF, then I can only use df.append instead?
Need DataFrame from Series by to_frame and transpose:
a = pd.concat([df, sr.to_frame(1).T])
print (a)
a b
1 1 2
Detail:
print (sr.to_frame(1).T)
a b
1 1 2
Or use setting with enlargement:
df.loc[1] = sr
print (df)
a b
1 1 2

Combine multiple columns in Pandas excluding NaNs

My sample df has four columns with NaN values. The goal is to concatenate all the rows while excluding the NaN values.
import pandas as pd
import numpy as np
df = pd.DataFrame({'keywords_0':["a", np.nan, "c"],
'keywords_1':["d", "e", np.nan],
'keywords_2':[np.nan, np.nan, "b"],
'keywords_3':["f", np.nan, "g"]})
keywords_0 keywords_1 keywords_2 keywords_3
0 a d NaN f
1 NaN e NaN NaN
2 c NaN b g
Want to accomplish the following:
keywords_0 keywords_1 keywords_2 keywords_3 keywords_all
0 a d NaN f a,d,f
1 NaN e NaN NaN e
2 c NaN b g c,b,g
Pseudo code:
cols = [df.keywords_0, df.keywords_1, df.keywords_2, df.keywords_3]
df["keywords_all"] = df["keywords_all"].apply(lambda cols: ",".join(cols), axis=1)
I know I can use ",".join() to get the exact result, but I am unsure how to pass the column names into the function.
You can apply ",".join() on each row by passing axis=1 to the apply method. You first need to drop the NaNs though. Otherwise you will get a TypeError.
df.apply(lambda x: ','.join(x.dropna()), axis=1)
Out:
0 a,d,f
1 e
2 c,b,g
dtype: object
You can assign this back to the original DataFrame with
df["keywords_all"] = df.apply(lambda x: ','.join(x.dropna()), axis=1)
Or if you want to specify columns as you did in the question:
cols = ['keywords_0', 'keywords_1', 'keywords_2', 'keywords_3']
df["keywords_all"] = df[cols].apply(lambda x: ','.join(x.dropna()), axis=1)
Just provide another solution with to_string :
df1[df1.isnull()]=''
df1.apply(lambda x : x.to_string(index =False,na_rep=False),axis=1).replace({"\n":','},regex=True)
Then just assign it back to your column keywords_all by using
df['keywords_all']=df1.apply(lambda x : x.to_string(index =False,na_rep=False),axis=1).replace({"\n":','},regex=True)
or
df.assign(keywords_all=df1.apply(lambda x : x.to_string(index =False,na_rep=False),axis=1).replace({"\n":','},regex=True)
)
Out[397]:
keywords_0 keywords_1 keywords_2 keywords_3 keywords_all
0 a d NaN f a,d,f
1 NaN e NaN NaN e
2 c NaN b g b,c,g

Resources