Pandas - Replace multiple column values with previous column value when condition is met - python

I have a large dataframe that looks like this:
Start End Alm_No1 Val_No1 Alm_No2 Val_No2 Alm_No3 Val_No3
1/1/19 0:00 1/2/19 0:00 1 0 2 1 3 0
1/2/19 0:00 1/3/19 0:00 1 0 2 0 3 1
1/3/19 0:00 1/4/19 0:00 1 1 2 0 3 0
1/4/19 0:00 1/5/19 0:00 1 0 2 0 3 1
1/5/19 0:00 1/6/19 0:00 1 1 2 0 3 0
1/6/19 0:00 1/7/19 0:00 1 0 2 1 3 1
1/7/19 0:00 1/8/19 0:00 4 0 5 1 6 0
1/8/19 0:00 1/9/19 0:00 4 0 5 1 6 1
1/9/19 0:00 1/10/19 0:00 4 1 5 1 6 0
I want to update all values in columns "Val" with the number from the associated "Alm" column if the value is 1 so that I can get rid of the "Alm" columns.
The outcome would look like this:
Start End Alm_No1 Val_No1 Alm_No2 Val_No2 Alm_No3 Val_No3
1/1/19 0:00 1/2/19 0:00 1 0 2 2 3 0
1/2/19 0:00 1/3/19 0:00 1 0 2 0 3 3
1/3/19 0:00 1/4/19 0:00 1 1 2 0 3 0
1/4/19 0:00 1/5/19 0:00 1 0 2 0 3 3
1/5/19 0:00 1/6/19 0:00 1 1 2 0 3 0
1/6/19 0:00 1/7/19 0:00 1 0 2 2 3 3
1/7/19 0:00 1/8/19 0:00 4 0 5 5 6 0
1/8/19 0:00 1/9/19 0:00 4 0 5 5 6 6
1/9/19 0:00 1/10/19 0:00 4 4 5 5 6 0
I have created the list of columns which value should be changed:
val_col = df.columns.tolist()
val_list=[]
for i in range(0, len(val_col)) :
if val_col[i].startswith('Val'):
val_list.append(i)
then I tried creating a while look to iterate over the columns:
for x in val_list:
i = 0
while i < len(df):
if df.iloc[i, x] == 1:
df.iloc[i, x] = df.iloc[i, x-1]
i+=1
It takes forever too load and I have a hard time finding something that works with lambda or apply. Any hint?
Thanks in advance!

Never loop over the rows of a dataframe. You should set columns all in one operation.
for i in range(1,4):
df[f'Val_No{i}'] *= df[f'Alm_No{i}']

I feel silly answering my own questions just a few minutes later but I think I found something that works:
for x in val_list:
df.loc[df.iloc[:,x]==1,df.columns[x]] = df.iloc[:, x-1]
Worked like a charm!
234 ms ± 15.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I came up with a solution working for arbitrary number of Alm_No... /
Val_No... columns.
Let's start from a function to be applied to each row:
def fn(row):
for i in range(2, row.size, 2):
j = i + 1
if row.iloc[j]:
row.iloc[j] = row.iloc[i]
return row
Note the construction of the for loop. It starts from 2
(position of Alm_No1 column), with step 2 (the distance to
Alm_No2 column).
j holds the number of the next column (Val_No...).
If the "current" Val_No != 0 then substitute here the value
from the "current" Alm_No.
After the loop completes the changed row is returned.
So the only thing to do is to apply this function to each row:
df.apply(fn, axis=1)
My timeit measurements indicated that my solution runs a little
(7 %) quicker than yours and about 35 times quicker than the one
proposed by BallpointBen.
Apparently, the usage of f-strings has some share in this (quite significant)
difference.

Related

Python Multi Category Ratios

I have this project that requires the daily ratios per week per year like below
Week | Year | Weekday | Volume
1 2000 1 0
1 2000 2 10
1 2000 3 10
2 2000 1 10
2 2000 2 0
1 2001 1 0
1 2001 2 10
1 2001 3 10
2 2001 1 10
2 2001 2 0
I want the output to be something like this
Week | Year | Weekday | Volume | Ratio
1 2000 1 0 0
1 2000 2 10 .5
1 2000 3 10 .5
2 2000 1 10 1
2 2000 2 0 0
1 2001 1 0 0
1 2001 2 10 .5
1 2001 3 10 .5
2 2001 1 10 1
2 2001 2 0 0
I have a current solution that does something similar to this
for year in years
for week in weeks
ratio = week/weeklytotal
weeklyratios = pd.concat([weeklyratios,ratio], blablabla)
the problem with this is that it's incredibly inefficient, especially since i have to do this process over 30k times. It ends up resulting in a 2.3 seconds run time equaling up to a 24 hour code run time.
Is there a better way to do this that can let it run faster?
You can use a groupby to compute the total volume per week. Then you can join that total volume to the original dataframe and compute the ratio in a vectorized way.
Assuming that the original dataframe is df (dtype being int):
Week Year Weekday Volume
0 1 2000 1 0
1 1 2000 2 10
2 1 2000 3 10
3 2 2000 1 10
4 2 2000 2 0
5 1 2001 1 0
6 1 2001 2 10
7 1 2001 3 10
8 2 2001 1 10
9 2 2001 2 0
you can use:
s = df.groupby(['Week', 'Year']).sum().drop('Weekday', axis=1)
df2 = df.set_index(['Week', 'Year']).join(s,rsuffix='_tot').sort_index(level=1)
df2['ratio'] = df2.Volume / df2.Volume_tot
print(df2)
gives:
Weekday Volume Volume_tot ratio
Week Year
1 2000 1 0 20 0.0
2000 2 10 20 0.5
2000 3 10 20 0.5
2 2000 1 10 10 1.0
2000 2 0 10 0.0
1 2001 1 0 20 0.0
2001 2 10 20 0.5
2001 3 10 20 0.5
2 2001 1 10 10 1.0
2001 2 0 10 0.0
You can get your expected output with:
print(df2.drop('Volume_tot', axis=1).reset_index())
which gives:
Week Year Weekday Volume ratio
0 1 2000 1 0 0.0
1 1 2000 2 10 0.5
2 1 2000 3 10 0.5
3 2 2000 1 10 1.0
4 2 2000 2 0 0.0
5 1 2001 1 0 0.0
6 1 2001 2 10 0.5
7 1 2001 3 10 0.5
8 2 2001 1 10 1.0
9 2 2001 2 0 0.0
You can execute grouped operations using the indexing and groupby functionality in pandas.
Assuming you have a dataframe, df, with columns ['week','year','weekday','volume'], your solution would look something like this:
import numpy as np
import pandas as pd
import timeit as t
# make up some data, only 1000 groups not your 30000, but it gets the point across
dates = pd.date_range(start = '2000-01-01', end = '2019-02-28', freq = 'D')
volume = np.random.randint(0,100,len(dates))
df = pd.DataFrame(list(zip(dates.week,dates.year,dates.dayofweek,volume)),
columns = ['week','year','weekday','volume'])
# group
grp = df.groupby(['year','week'])
grp_vol = grp['volume'].sum()
# rename to avoid overlap in names
grp_vol.name = 'weekly_volume'
# rejoin to calculate your ratio
df = df.join(grp_vol, on = ['year','week'])
df['ratio'] = df['volume']/df['weekly_volume']
And then time it for good measure
%timeit df['ratio'] = df['volume']/df['weekly_volume']
196 µs ± 4.32 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
So a lot less than 24 hrs.

Normalize column in pandas [duplicate]

This question already has an answer here:
pandas - groupby and re-scale values
1 answer
I want to max/min normalize the pageviews for each page in a pandas dataframe that looks like this:
page days since publishing pageviews
example.com/a 1 5000
example.com/a 2 10000
example.com/a 3 7500
example.com/b 1 10000
example.com/b 2 20000
example.com/b 3 15000
I'd like to produce something like:
page days since publishing pageviews
example.com/a 1 0
example.com/a 2 1
example.com/a 3 0.5
example.com/b 1 0
example.com/b 2 1
example.com/b 3 0.5
The dataset is about 100 000 rows. Any help getting this done effectively would be much appreciated.
a = pd.DataFrame(pd.read_csv('input.csv'))
b = a.groupby('page').min()
b.reset_index(inplace=True)
a = pd.merge(a,b,how='left',right_on = 'page',left_on = 'page')
a['minmaxscale'] = (a.pageviews_x-a.pageviews_y)/a.pageviews_y
produces
page days since publishing_x pageviews_x \
0 example.com/a 1 5000
1 example.com/a 2 10000
2 example.com/a 3 7500
3 example.com/b 1 10000
4 example.com/b 2 20000
5 example.com/b 3 15000
days since publishing_y pageviews_y minmaxscale
0 1 5000 0.0
1 1 5000 1.0
2 1 5000 0.5
3 1 10000 0.0
4 1 10000 1.0
5 1 10000 0.5

Pandas - Take current row, compare value to X previous rows and return how many matches (within x% range)

I have a pandas column like so:
index colA
1 10.2
2 10.8
3 11.6
4 10.7
5 9.5
6 6.2
7 12.9
8 10.6
9 6.4
10 20.5
I want to search the current row value and find matches from previous rows that are close. For example index4 (10.7) would return a match of 1 because it is close to index2 (10.8). Similarly index8 (10.6) would return a match of 2 because it is close to both index2 and 4.
Using a threshold of +/- 5% for this example would output the below:
index colA matches
1 10.2 0
2 10.8 0
3 11.6 0
4 10.7 2
5 9.5 0
6 6.2 0
7 12.9 0
8 10.6 3
9 6.4 1
10 20.5 0
With a large dataframe I would like to limit this to the previous X (300?) number of rows to search over rather than an entire dataframe.
Using triangle indices to ensure we only look backwards. Then use np.bincount to accumulate the matches.
a = df.colA.values
i, j = np.tril_indices(len(a), -1)
mask = np.abs(a[i] - a[j]) / a[i] <= .05
df.assign(matches=np.bincount(i[mask], minlength=len(a)))
colA matches
index
1 10.2 0
2 10.8 0
3 11.6 0
4 10.7 2
5 9.5 0
6 6.2 0
7 12.9 0
8 10.6 3
9 6.4 1
10 20.5 0
If you are having resource issues, consider using good 'ol fashion loops. However, if you have access to numba you make this considerably faster.
from numba import njit
#njit
def counter(a):
c = np.arange(len(a)) * 0
for i, x in enumerate(a):
for j, y in enumerate(a):
if j < i:
if abs(x - y) / x <= .05:
c[i] += 1
return c
df.assign(matches=counter(a))
colA matches
index
1 10.2 0
2 10.8 0
3 11.6 0
4 10.7 2
5 9.5 0
6 6.2 0
7 12.9 0
8 10.6 3
9 6.4 1
10 20.5 0
Here's a numpy solution that leverages broadcasted comparison:
i = df.colA.values
j = np.arange(len(df))
df['matches'] = (
(np.abs(i - i[:, None]) < i * .05) & (j < j[:, None])
).sum(1)
df
index colA matches
0 1 10.2 0
1 2 10.8 0
2 3 11.6 0
3 4 10.7 2
4 5 9.5 0
5 6 6.2 0
6 7 12.9 0
7 8 10.6 3
8 9 6.4 1
9 10 20.5 0
Note; This is extremely fast, but does not handle the 300 row limitation for large dataframes.
rolling with apply , if speed matter , please look into cold's answer
df.colA.rolling(window=len(df),min_periods=1).apply(lambda x : sum(abs((x-x[-1])/x[-1])<0.05)-1)
Out[113]:
index
1 0.0
2 0.0
3 0.0
4 2.0
5 0.0
6 0.0
7 0.0
8 3.0
9 1.0
10 0.0
Name: colA, dtype: float64

Out of every 7 rows, get the nth row pandas

I have a df like this that's about 1000 rows:
0 1
0 1.345 2.456
1 2.123 3.564
2 0.023 3.548
3 3.457 2.456
4 1.754 3.564
5 0.905 3.548
6 3.674 7.543
7 9.443 6.4433...
The way it's organized is every 7 rows constitutes one "set" of data (data cannot be sorted here). Within each of the "groups" of 7 rows I want to get the first row so my new data frame would look like:
0 1
0 1.345 2.456
7 9.443 6.4433
I can solve it by creating a new column that repeats 1-7 & filtering by only that column...
0 1 groupby_col
0 1.345 2.456 1
1 2.123 3.564 2
2 0.023 3.548 3
3 3.457 2.456 4
4 1.754 3.564 5
5 0.905 3.548 6
6 3.674 7.543 7
7 9.443 6.4433 1...
then...
df[df['groupby_col'] == 1]
Is there a way I can do this in pandas without having to create an additional column then filter?
Option 1:
In [54]: df.iloc[::7]
Out[54]:
0 1
0 1.345 2.4560
7 9.443 6.4433
Option 2:
In [53]: df.iloc[np.arange(len(df))%7==0]
Out[53]:
0 1
0 1.345 2.4560
7 9.443 6.4433
df.loc[df.index%7==0]
Out[124]:
0 1
0 1.345 2.4560
7 9.443 6.4433
Or
df.groupby(df.index//7,as_index=False).first()
Out[128]:
0 1
0 1.345 2.4560
1 9.443 6.4433

Grouping Elements on criteria and Finding difference between The Highest and Lowest of Group

I have a dataframe like the following.
Best Bid Best Offer Best Offer Sign Lone Time
0 197.0 0.0 1 1
1 198.0 0.0 1 2
2 199.0 0.0 1 3
3 197.0 221.0 0 0
4 221.0 221.0 0 0
5 221.0 0.0 1 1
6 222.0 0.0 1 2
I want to make groups of each situation where Lone time is numbers in increasing order before they hit 0(they will always be in increasing order never decreasing) and find the difference between the highest and lowest best bid values in each group. So as an example
Best Bid Best Offer Best Offer Sign Lone Time diff
0 197.0 0.0 1 1 0
1 198.0 0.0 1 2 0
2 200.0 0.0 1 3 3
3 197.0 221.0 0 0 0
4 221.0 221.0 0 0 0
5 221.0 0.0 1 1 0
6 250.0 0.0 1 2 29
Here if you notice index 2 and 6 have values of 3 and 39 respectively. For index 2 i have taken the diff of best bid of (index 2 - index 0) and for index 6 i have taken the diff of the best bid of (index 6 - index 5). How do i achieve this?
I think this here will do what you're looking for:
dat = pd.read_csv('dat.csv', sep=';')
dat['jump'] = (dat['Lone Time'].shift(-1) - dat['Lone Time']).fillna(-1)
dat['jump'] = map(int, (1-np.sign(dat['jump']))/2.)
dat['series'] = dat['jump'].shift(1).fillna(0).cumsum()
dat['diff'] = 0
for s, df in dat.groupby('series'):
dat.loc[df.index[-1], 'diff'] = (df['Best Bid'].max() - df['Best Bid'].min())
dat = dat.drop(['jump', 'series'], axis=1)
print dat
1) find the magnitude of the change in Lone Time
2) mark the last row in the series, shift it by one and fill the missing value with a zero
2.1) distinguish between series by a cumulative sum, every sequence now has its own "identifier"
3) group by series and find the span between the min/max values
4) cleanup: drop the intermediate columns

Resources