df.loc[:, reversed(colnames)] can retrieve a slice from df, but can not be used to assign to df. In order to assign to that slice of df, I have to use df.loc[:, list(reversed(colnames))]
Observe the following code and output for clarification:
import pandas as pd
import numpy as np
df = pd.DataFrame(
{'a':list('asdf'), 'b':range(4)}
)
print('df==', df, sep='\n')
colnames = ['a', 'b']
print('\ndf.loc[:, reversed(colnames)]:', df.loc[:, reversed(colnames)], sep='\n')
df.loc[:, reversed(colnames)] = np.nan
print('\ndf after df.loc[:, reversed(colnames)]=np.nan:', df, sep='\n')
df.loc[:, list(reversed(colnames))] = np.nan
print('\ndf after df.loc[:, list(reversed(colnames))]=np.nan:', df, sep='\n')
and the output:
df==
a b
0 a 0
1 s 1
2 d 2
3 f 3
df.loc[:, reversed(colnames)]:
b a
0 0 a
1 1 s
2 2 d
3 3 f
df after df.loc[:, reversed(colnames)]=np.nan:
a b
0 a 0
1 s 1
2 d 2
3 f 3
df after df.loc[:, list(reversed(colnames))]=np.nan:
a b
0 NaN NaN
1 NaN NaN
2 NaN NaN
3 NaN NaN
This behavior is new. I didn't have to list() to do assignment when I was using Anaconda 2018.12, I recently upgraded to Anaconda 2019.03.
I wonder what changed and caused this.
Related
Dataframe df has many thousand columns and rows. For a subset of columns that are given in a particular sequence, say columns B, C, E, I want to fill NaN values in B with first non-NaN value found in remaining columns (C, E) searching sequentially. Finally C, E are dropped
Sample df can be built as follows:
import numpy as np
import pandas as pd
df = pd.DataFrame(10*(2+np.random.randn(6, 5)), columns=list('ABCDE'))
df.loc[1, 'B'] = np.nan
df.loc[2, 'B'] = np.nan
df.loc[5, 'B'] = np.nan
df.loc[2, 'C'] = np.nan
df.loc[5, 'C'] = np.nan
df.loc[2, 'D'] = np.nan
df.loc[2, 'E'] = np.nan
df.loc[4, 'E'] = np.nan
df
A B C D E
0 18.161033 6.453597 25.253036 18.542586 20.667311
1 27.629402 NaN 40.654821 22.804547 23.633502
2 15.459256 NaN NaN NaN NaN
3 19.115203 4.002131 14.167508 23.796780 29.557706
4 27.180622 NaN 20.763618 15.923794 NaN
5 17.917170 NaN NaN 21.865184 9.867743
The expected outcome is as follows:
A B D
0 18.161033 6.453597 18.542586
1 27.629402 40.654821 22.804547
2 15.459256 NaN NaN
3 19.115203 4.002131 23.796780
4 27.180622 20.763618 15.923794
5 17.917170 9.867743 21.865184
Here is one way
drop = ['C', 'E']
fill= 'B'
d=dict(zip(df.columns,[fill if x in drop else x for x in df.columns.tolist() ]))
df.groupby(d,axis=1).first()
Out[172]:
A B D
0 14.472915 30.598602 24.528571
1 22.010242 22.215140 15.412039
2 5.383674 NaN NaN
3 38.265940 24.746673 35.367622
4 22.730089 20.244289 27.570413
5 31.216037 15.496690 9.746814
IIUC, use bfill to backfill, then drop to remove unwanted columns.
df.assign(B=df[['B', 'C', 'E']].bfill(axis=1)['B']).drop(['C', 'E'], axis=1)
A B D
0 18.161033 6.453597 18.542586
1 27.629402 40.654821 22.804547
2 15.459256 NaN NaN
3 19.115203 4.002131 23.796780
4 27.180622 20.763618 15.923794
5 17.917170 9.867743 21.865184
Here's a slightly more generalised version of the one above,
to_drop = ['C', 'E']
upd = 'B'
df.update(df[[upd, *to_drop]].bfill(axis=1)[upd]) # in-place
df.drop(to_drop, axis=1) # not in-place, need to assign
A B D
0 18.161033 6.453597 18.542586
1 27.629402 40.654821 22.804547
2 15.459256 NaN NaN
3 19.115203 4.002131 23.796780
4 27.180622 20.763618 15.923794
5 17.917170 9.867743 21.865184
I have a DataFrame with integer indexes that are missing some values (i.e. not equally spaced), I want to create a new DataFrame with equally spaced index values and forward fill column values. Below is a simple example:
have
import pandas as pd
df = pd.DataFrame(['A', 'B', 'C'], index=[0, 2, 4])
0
0 A
2 B
4 C
want to use above and create:
0
0 A
1 A
2 B
3 B
4 C
Use reindex with method='ffill':
df = df.reindex(np.arange(0, df.index.max()+1), method='ffill')
Or:
df = df.reindex(np.arange(df.index.min(), df.index.max() + 1), method='ffill')
print (df)
0
0 A
1 A
2 B
3 B
4 C
Using reindex and ffill:
df = df.reindex(range(df.index[0],df.index[-1]+1)).ffill()
print(df)
0
0 A
1 A
2 B
3 B
4 C
You can do this:
In [319]: df.reindex(list(range(df.index.min(),df.index.max()+1))).ffill()
Out[319]:
0
0 A
1 A
2 B
3 B
4 C
I have the following dataframe:
df = pd.DataFrame({
'user_a':['A','B','C',np.nan],
'user_b':['A','B',np.nan,'D']
})
I would like to create a new column called user and have the resulting dataframe:
What's the best way to do this for many users?
Use forward filling missing values and then select last column by iloc:
df = pd.DataFrame({
'user_a':['A','B','C',np.nan,np.nan],
'user_b':['A','B',np.nan,'D',np.nan]
})
df['user'] = df.ffill(axis=1).iloc[:, -1]
print (df)
user_a user_b user
0 A A A
1 B B B
2 C NaN C
3 NaN D D
4 NaN NaN NaN
use .apply method:
In [24]: df = pd.DataFrame({'user_a':['A','B','C',np.nan],'user_b':['A','B',np.nan,'D']})
In [25]: df
Out[25]:
user_a user_b
0 A A
1 B B
2 C NaN
3 NaN D
In [26]: df['user'] = df.apply(lambda x: [i for i in x if not pd.isna(i)][0], axis=1)
In [27]: df
Out[27]:
user_a user_b user
0 A A A
1 B B B
2 C NaN C
3 NaN D D
I want to append a Series to a DataFrame where Series's index matches DataFrame's columns using pd.concat, but it gives me surprises:
df = pd.DataFrame(columns=['a', 'b'])
sr = pd.Series(data=[1,2], index=['a', 'b'], name=1)
pd.concat([df, sr], axis=0)
Out[11]:
a b 0
a NaN NaN 1.0
b NaN NaN 2.0
What I expected is of course:
df.append(sr)
Out[14]:
a b
1 1 2
It really surprises me that pd.concat is not index-columns aware. So is it true that if I want to concat a Series as a new row to a DF, then I can only use df.append instead?
Need DataFrame from Series by to_frame and transpose:
a = pd.concat([df, sr.to_frame(1).T])
print (a)
a b
1 1 2
Detail:
print (sr.to_frame(1).T)
a b
1 1 2
Or use setting with enlargement:
df.loc[1] = sr
print (df)
a b
1 1 2
My sample df has four columns with NaN values. The goal is to concatenate all the rows while excluding the NaN values.
import pandas as pd
import numpy as np
df = pd.DataFrame({'keywords_0':["a", np.nan, "c"],
'keywords_1':["d", "e", np.nan],
'keywords_2':[np.nan, np.nan, "b"],
'keywords_3':["f", np.nan, "g"]})
keywords_0 keywords_1 keywords_2 keywords_3
0 a d NaN f
1 NaN e NaN NaN
2 c NaN b g
Want to accomplish the following:
keywords_0 keywords_1 keywords_2 keywords_3 keywords_all
0 a d NaN f a,d,f
1 NaN e NaN NaN e
2 c NaN b g c,b,g
Pseudo code:
cols = [df.keywords_0, df.keywords_1, df.keywords_2, df.keywords_3]
df["keywords_all"] = df["keywords_all"].apply(lambda cols: ",".join(cols), axis=1)
I know I can use ",".join() to get the exact result, but I am unsure how to pass the column names into the function.
You can apply ",".join() on each row by passing axis=1 to the apply method. You first need to drop the NaNs though. Otherwise you will get a TypeError.
df.apply(lambda x: ','.join(x.dropna()), axis=1)
Out:
0 a,d,f
1 e
2 c,b,g
dtype: object
You can assign this back to the original DataFrame with
df["keywords_all"] = df.apply(lambda x: ','.join(x.dropna()), axis=1)
Or if you want to specify columns as you did in the question:
cols = ['keywords_0', 'keywords_1', 'keywords_2', 'keywords_3']
df["keywords_all"] = df[cols].apply(lambda x: ','.join(x.dropna()), axis=1)
Just provide another solution with to_string :
df1[df1.isnull()]=''
df1.apply(lambda x : x.to_string(index =False,na_rep=False),axis=1).replace({"\n":','},regex=True)
Then just assign it back to your column keywords_all by using
df['keywords_all']=df1.apply(lambda x : x.to_string(index =False,na_rep=False),axis=1).replace({"\n":','},regex=True)
or
df.assign(keywords_all=df1.apply(lambda x : x.to_string(index =False,na_rep=False),axis=1).replace({"\n":','},regex=True)
)
Out[397]:
keywords_0 keywords_1 keywords_2 keywords_3 keywords_all
0 a d NaN f a,d,f
1 NaN e NaN NaN e
2 c NaN b g b,c,g