TypeError: 'int' object is not subscriptable: adding a column with a calculation - python

I have found the code of a bisection method in Python. This is a method to find the root of a function. This function depends on several variables. The function and the bisection method works perfect. However, I now created a dataframe with multiple rows (>1000) where I want to apply the function to every row. The variables of the function are in the columns (0-8).
I've tried some simple code (append a column with the function, see the code) but the main error I get is:
'int' object is not subscriptable
So, the bisection method code is done. But now I want to apply this code to every row and append this to the dataframe (so create a new column with the result of the bisection method).
The code of the bisection method:
def bisection(f,a,b,N):
'''Approximate solution of f(x)=0 on interval [a,b] by the bisection method.
Parameters
----------
f : function
The function for which we are trying to approximate a solution f(x)=0.
a,b : numbers
The interval in which to search for a solution. The function returns
None if f(a)*f(b) >= 0 since a solution is not guaranteed.
N : (positive) integer
The number of iterations to implement.
Returns
-------
x_N : number
The midpoint of the Nth interval computed by the bisection method. The
initial interval [a_0,b_0] is given by [a,b]. If f(m_n) == 0 for some
midpoint m_n = (a_n + b_n)/2, then the function returns this solution.
If all signs of values f(a_n), f(b_n) and f(m_n) are the same at any
iteration, the bisection method fails and return None.
Examples
--------
>>> f = lambda x: x**2 - x - 1
>>> bisection(f,1,2,25)
1.618033990263939
>>> f = lambda x: (2*x - 1)*(x - 3)
>>> bisection(f,0,1,10)
0.5
'''
if f(a)*f(b) >= 0:
print("Bisection method fails.")
return None
a_n = a
b_n = b
for n in range(1,N+1):
m_n = (a_n + b_n)/2
f_m_n = f(m_n)
if f(a_n)*f_m_n < 0:
a_n = a_n
b_n = m_n
elif f(b_n)*f_m_n < 0:
a_n = m_n
b_n = b_n
elif f_m_n == 0:
print("Found exact solution.")
return m_n
else:
print("Bisection method fails.")
return None
return (a_n + b_n)/2
Then, the second part of the bisection method also works perfect:
from scipy.stats import poisson
import numpy as np
f = lambda x: x['h']*x['p']*0.5-
(x['D']*x['oc'])/(x**2)+x['w']*x['D']*
(-1*max((x*poisson.pmf(x['xm'],x,loc=0)+(x-x['xm'])*(1-
poisson.cdf(x['xm'],x,loc=0))),0))/(x**2)+
x['w']*x['D']*poisson.cdf(x,x['xm'],loc=0)/x+
(0.723465*x['p']*x['h']*(x['SL']-1))/(((x-
x['SL']*x)/(x['Dev']*np.sqrt(x['RT+LT'])))**0.865)+(0.51012*
(x['SL']-1)*((x-
x['SL']*x)/x['Dev']*np.sqrt(x['RT+LT']))**0.3)
/(x['Dev']*np.sqrt(x['RT+LT']))
b = np.sqrt((2*x['D']*x['oc'])/(x['h']*x['p'])).round()
a = 1
N = 100
'f' is just a nasty derivative.
And now I have a dataframe with 9 columns.
xm w D SL oc p h Dev RT+LT
0 34 5 1097 0.95 4 5 0.29 10 2.37
1 23 34 4166 0.95 11 34 0.23 19 0.53
2 15 39 188 0.95 19 39 0.32 4 2.77
3 34 39 9005 0.95 9 39 0.12 27 0.67
4 24 43 2555 0.95 14 43 0.10 15 0.66
5 41 6 7168 0.95 4 6 0.41 24 1.90
6 44 39 1390 0.95 42 39 0.21 11 1.79
7 34 4 6522 0.95 28 4 0.11 23 1.8
(sorry, I don't seem to get the tabel nicely displayed).
And now, I want to add a column and perform the following function
bisection(f,a,b,N)
And I want to do this for every row.
I tried simply:
x['f'] = bisection(f,a,b,N)
I keep getting the error 'int' object is not subscriptable..
Anyone an idea what is going wrong? If I just use random numbers in the f-function, the code gives me the correct result.
Steven

Related

Python: numpy/pandas change values on condition

I would like to know if there is a faster and more "pythonic" way of doing the following, e.g. using some built in methods.
Given a pandas DataFrame or numpy array of floats, if the value is equal or smaller than 0.5 I need to calculate the reciprocal value and multiply with -1 and replace the old value with the newly calculated one.
"Transform" is probably a bad choice of words, please tell me if you have a better/more accurate description.
Thank you for your help and support!!
Data:
import numpy as np
import pandas as pd
dicti = {"A" : np.arange(0.0, 3, 0.1),
"B" : np.arange(0, 30, 1),
"C" : list("ELVISLIVES")*3}
df = pd.DataFrame(dicti)
my function:
def transform_colname(df, colname):
series = df[colname]
newval_list = []
for val in series:
if val <= 0.5:
newval = (1/val)*-1
newval_list.append(newval)
else:
newval_list.append(val)
df[colname] = newval_list
return df
function call:
transform_colname(df, colname="A")
**--> I'm summing up the results here, since comments wouldn't allow to post code (or I don't know how to do it).**
Thank you all for your fast and great answers!!
using ipython "%timeit" with "real" data:
my function:
10 loops, best of 3: 24.1 ms per loop
from jojo:
def transform_colname_v2(df, colname):
series = df[colname]
df[colname] = np.where(series <= 0.5, 1/series*-1, series)
return df
100 loops, best of 3: 2.76 ms per loop
from FooBar:
def transform_colname_v3(df, colname):
df.loc[df[colname] <= 0.5, colname] = - 1 / df[colname][df[colname] <= 0.5]
return df
100 loops, best of 3: 3.32 ms per loop
from dmvianna:
def transform_colname_v4(df, colname):
df[colname] = df[colname].where(df[colname] <= 0.5, (1/df[colname])*-1)
return df
100 loops, best of 3: 3.7 ms per loop
Please tell/show me if you would implement your code in a different way!
One final QUESTION: (answered)
How could "FooBar" and "dmvianna" 's versions be made "generic"? I mean, I had to write the name of the column into the function (since using it as a variable didn't work). Please explain this last point!
--> thanks jojo, ".loc" isn't the right way, but very simple df[colname] is sufficient. changed the functions above to be more "generic". (also changed ">" to be "<=", and updated timing)
Thank you very much!!
If we are talking about arrays:
import numpy as np
a = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6], dtype=np.float)
print 1 / a[a <= 0.5] * (-1)
This will, however only return the values smaller than 0.5.
Alternatively use np.where:
import numpy as np
a = np.array([0.1, 0.2, 0.3, 0.4, 0.5, 0.6], dtype=np.float)
print np.where(a < 0.5, 1 / a * (-1), a)
Talking about pandas DataFrame:
As in #dmvianna's answer (so give some credit to him ;) ), adapting it to pd.DataFrame:
df.a = df.a.where(df.a > 0.5, (1 / df.a) * (-1))
The typical trick is to write a general mathematical operation to apply to the whole column, but then use indicators to select rows for which we actually apply it:
df.loc[df.A < 0.5, 'A'] = - 1 / df.A[df.A < 0.5]
In[13]: df
Out[13]:
A B C
0 -inf 0 E
1 -10.000000 1 L
2 -5.000000 2 V
3 -3.333333 3 I
4 -2.500000 4 S
5 0.500000 5 L
6 0.600000 6 I
7 0.700000 7 V
8 0.800000 8 E
9 0.900000 9 S
10 1.000000 10 E
11 1.100000 11 L
12 1.200000 12 V
13 1.300000 13 I
14 1.400000 14 S
15 1.500000 15 L
16 1.600000 16 I
17 1.700000 17 V
18 1.800000 18 E
19 1.900000 19 S
20 2.000000 20 E
21 2.100000 21 L
22 2.200000 22 V
23 2.300000 23 I
24 2.400000 24 S
25 2.500000 25 L
26 2.600000 26 I
27 2.700000 27 V
28 2.800000 28 E
29 2.900000 29 S
As in #jojo's answer, but using pandas:
df.A = df.A.where(df.A > 0.5, (1/df.A)*-1)
or
df.A.where(df.A > 0.5, (1/df.A)*-1, inplace=True) # this should be faster
.where docstring:
Definition: df.A.where(self, cond, other=nan, inplace=False,
axis=None, level=None, try_cast=False, raise_on_error=True)
Docstring:
Return an object of same shape as self and whose corresponding entries
are from self where cond is True and otherwise are from other.

Producing a “best fit” slope gradient from pandas df and populating new columnb

I'm trying to add a slope calculation on individual subsets of two fields in a dataframe and have that value of slope applied to all rows in each subset. (I've used the "slope" function in excel previously, although I'm not married to the exact algo. The "desired_output" field is what I'm expecting as the output. The subsets are distinguished by the "strike_order" column, subsets starting at 1 and not having a specific highest value.
"IV" is the y value
"Strike" is the x value
Any help would be appreciated as I don't even know where to begin with this....
import pandas
df = pandas.DataFrame([[1200,1,.4,0.005],[1210,2,.35,0.005],[1220,3,.3,0.005],
[1230,4,.25,0.005],[1200,1,.4,0.003],[1210,2,.37,.003]],columns=
["strike","strike_order","IV","desired_output"])
df
strike strike_order IV desired_output
0 1200 1 0.40 0.005
1 1210 2 0.35 0.005
2 1220 3 0.30 0.005
3 1230 4 0.25 0.005
4 1200 1 0.40 0.003
5 1210 2 0.37 0.003
Let me know if this isn't a well posed question and I'll try to make it better.
You can use numpy's least square
We can rewrite the line equationy=mx+c as y = Ap, where A = [[x 1]] and p = [[m], [c]]. Then use lstsq to solve for p, so we need to create A by adding a column of ones to df
import numpy as np
df['ones']=1
A = df[['strike','ones']]
y = df['IV']
m, c = np.linalg.lstsq(A,y)[0]
Alternatively you can use scikit learn's linear_model Regression model
you can verify the result by plotting the data as scatter plot and the line equation as plot
import matplotlib.pyplot as plt
plt.scatter(df['strike'],df['IV'],color='r',marker='d')
x = df['strike']
#plug x in the equation y=mx+c
y_line = c + m * x
plt.plot(x,y)
plt.xlabel('Strike')
plt.ylabel('IV')
plt.show()
the resulting plot is shown below
Try this.
First create a subset column by iterating over the dataframe, using the strike_order value transitioning to 1 as the boundary between subsets
#create subset column
subset_counter = 0
for index, row in df.iterrows():
if row["strike_order"] == 1:
df.loc[index,'subset'] = subset_counter
subset_counter += 1
else:
df.loc[index,'subset'] = df.loc[index-1,'subset']
df['subset'] = df['subset'].astype(int)
Then run a linear regression over each subset using groupby
# run linear regression on subsets of the dataframe using groupby
from sklearn import linear_model
model = linear_model.LinearRegression()
for (group, df_gp) in df.groupby('subset'):
X=df_gp[['strike']]
y=df_gp.IV
model.fit(X,y)
df.loc[df.subset == df_gp.iloc[0].subset, 'slope'] = model.coef_
df
strike strike_order IV desired_output subset slope
0 1200 1 0.40 0.005 0 -0.005
1 1210 2 0.35 0.005 0 -0.005
2 1220 3 0.30 0.005 0 -0.005
3 1230 4 0.25 0.005 0 -0.005
4 1200 1 0.40 0.003 1 -0.003
5 1210 2 0.37 0.003 1 -0.003
# Scott This worked except it went subset value 0, 1 and all subsequent subset values were 2. I added an extra conditional at the beginning and a very clumsy seed "seed" value to stop it looking for row -1.
import scipy
seed=df.loc[0,"date_exp"]
#seed ="08/11/200015/06/2001C"
#print(seed)
subset_counter = 0
for index, row in df.iterrows():
#if index['strike_order']==0:
if row['date_exp'] ==seed:
df.loc[index,'subset']=0
elif row["strike_order"] == 1:
df.loc[index,'subset'] = subset_counter
subset_counter = 1 + df.loc[index-1,'subset']
else:
df.loc[index,'subset'] = df.loc[index-1,'subset']
df['subset'] = df['subset'].astype(int)
This now does exactly what I want although I think using the seed value is clunky, would have preferred to use if row == 0 etc. But it's friday and this works.
Cheers

Portfolio Selection in Python with constraints from a fixed set

I am working on a project where I am trying to select the optimal subset of players from a set of 125 players (example below)
The constraints are:
a) Number of players = 3
b) Sum of prices <= 30
The optimization function is Max(Sum of Votes)
Player Vote Price
William Smith 0.67 8.6
Robert Thompson 0.31 6.7
Joseph Robinson 0.61 6.2
Richard Johnson 0.88 4.3
Richard Hall 0.28 9.7
I looked at the scipy optimize package but I can't find anywhere a way to constraint the universe to this subset. Can anyone point me if there is a library that would do that?
Thanks
The problem is well suited to be formulated as mathematical program and can be solved with different Optimization libraries.
It is known as the exact k-item knapsack problem.
You can use the Package PuLP for example. It has interfaces to different optimization software packages, but comes bundled with a free solver.
easy_install pulp
Free solvers are often way slower than commercial ones, but I think PuLP should be able to solve reasonably large versions of your problem with its standard solver.
Your problem can be solved with PuLP as follows:
from pulp import *
# Data input
players = ["William Smith", "Robert Thompson", "Joseph Robinson", "Richard Johnson", "Richard Hall"]
vote = [0.67, 0.31, 0.61, 0.88, 0.28]
price = [8.6, 6.7, 6.2, 4.3, 9.7]
P = range(len(players))
# Declare problem instance, maximization problem
prob = LpProblem("Portfolio", LpMaximize)
# Declare decision variable x, which is 1 if a
# player is part of the portfolio and 0 else
x = LpVariable.matrix("x", list(P), 0, 1, LpInteger)
# Objective function -> Maximize votes
prob += sum(vote[p] * x[p] for p in P)
# Constraint definition
prob += sum(x[p] for p in P) == 3
prob += sum(price[p] * x[p] for p in P) <= 30
# Start solving the problem instance
prob.solve()
# Extract solution
portfolio = [players[p] for p in P if x[p].varValue]
print(portfolio)
The runtime to draw 3 players from 125 with the same random data as used by Brad Solomon is 0.5 seconds on my machine.
Your problem is discrete optimization task because of a) constraint. You should introduce discrete variables to represent taken/not taken players. Consider the following Minizinc pseudocode:
array[players_num] of var bool: taken_players;
array[players_num] of float: votes;
array[players_num] of float: prices;
constraint sum (taken_players * prices) <= 30;
constraint sum (taken_players) = 3;
solve maximize sum (taken_players * votes);
As far as I know, you can't use scipy to solve such problems (e.g. this).
You can solve your problem in these ways:
You can generate Minizinc problem in Python and solve it by calling external solver. It seems to be more scalable and robust.
You can use simulated annealing
Mixed integer approach
The second option seems to be simpler for you. But, personally, I prefer the first one: it allows you introducing a wide range of various constraints, problem formulation feels more natural and clear.
#CaptainTrunky is correct, scipy.minimize will not work here.
Here is an awfully crappy workaround using itertools, please ignore if one of the other methods has worked. Consider that to draw 3 players from 125 creates 317,750 combinations, n!/((n - k)! * k!). Runtime on the main loop ~ 6m.
from itertools import combinations
df = DataFrame({'Player' : np.arange(0, 125),
'Vote' : 10 * np.random.random(125),
'Price' : np.random.randint(1, 10, 125)
})
df
Out[109]:
Player Price Vote
0 0 4 7.52425
1 1 6 3.62207
2 2 9 4.69236
3 3 4 5.24461
4 4 4 5.41303
.. ... ... ...
120 120 9 8.48551
121 121 8 9.95126
122 122 8 6.29137
123 123 8 1.07988
124 124 4 2.02374
players = df.Player.values
idx = pd.MultiIndex.from_tuples([i for i in combinations(players, 3)])
votes = []
prices = []
for i in combinations(players, 3):
vote = df[df.Player.isin(i)].sum()['Vote']
price = df[df.Player.isin(i)].sum()['Price']
votes.append(vote); prices.append(price)
result = DataFrame({'Price' : prices, 'Vote' : votes}, index=idx)
# The index below is (first player, second player, third player)
result[result.Price <= 30].sort_values('Vote', ascending=False)
Out[128]:
Price Vote
63 87 121 25.0 29.75051
64 121 20.0 29.62626
64 87 121 19.0 29.61032
63 64 87 20.0 29.56665
65 121 24.0 29.54248
... ...
18 22 78 12.0 1.06352
23 103 20.0 1.02450
22 23 103 20.0 1.00835
18 22 103 15.0 0.98461
23 14.0 0.98372

Applying diff on selected rows for comparing angles from math.atan2

I have a data frame like this that want to apply diff function on:
test = pd.DataFrame({ 'Observation' : ['0','1','2',
'3','4','5',
'6','7','8'],
'Value' : [30,60,170,-170,-130,-60,-30,10,20]
})
Observation Value
0 30
1 60
2 170
3 -170
4 -130
5 -60
6 -30
7 10
8 20
The column 'Value' is in degrees. So, the difference between -170 and 170 should be 20, not -340. In other words, when d2*d1 < 0, instead of d2-d1, I'd like to get 360-(abs(d1)+abs(d2))
Here's why I try. But then I don't know how to continue it without using a for loop:
test['Value_diff_1st_attempt'] = test['Value'].diff(1)
test['sign_temp'] = test['Value'].shift()
test['Sign'] = np.sign(test['Value']*test['sign_temp'])
Here's what the result should look like:
Observation Value Delta_Value
0 30 NAN
1 60 30
2 170 110
3 -170 20
4 -130 40
5 -60 70
6 -30 30
7 10 40
8 20 10
Eventually I'd like to get just the magnitude of differences all in positive values. Thanks.
Update: So, the value results are derived from math.atan2 function. The values are from 0<theta<180 or -180<theta<0. The problem arises when we are dealing with a change of direction from 170 (upper left corner) to -170 (lower left corner) for example, where the change is really just 20 degrees. However, when we go from -30 (Lower right corner) to 10 (upper right corner), the change is really 40 degrees. I hope I explained it well.
I believe this should work (took the definition from #JasonD's answer):
test["Value"].rolling(2).apply(lambda x: 180 - abs(abs(x[0] - x[1]) - 180))
Out[45]:
0 NaN
1 30.0
2 110.0
3 20.0
4 40.0
5 70.0
6 30.0
7 40.0
8 10.0
Name: Value, dtype: float64
How it works:
Based on your question, the two angles a and b are between 0 and +/-180. For 0 < d < 180 I will write d < 180 and for -180 < d < 0 I will write d < 0. There are four possibilities:
a < 180, b < 180 -> the result is simply |a - b|. And since |a - b| - 180 cannot be greater than 180, the formula will simplify to a - b if a > b and b - a if b > a.
a < 0, b < 0 - > The same logic applies here. Both negative and their absolute difference cannot be greater than 180. The result will be |a - b|.
a < 180, b < 0 - > a - b will be greater than 0 for sure. For the cases where |a - b| > 180, we should look at the other angle and this translates to 360 - |a - b|.
a < 0, b < 180 -> again, similar to the above. If the absolute difference is greater than 180, calculate 360 - absolute difference.
For the pandas part: rolling(n) creates arrays of size n. For 2: (row 0, row1), (row1, row2), ... With apply, you apply that formula to every rolling pair where x[0] is the first element (a) and x[1] is the second element.

selecting data using pandas

I have a large catalog that I am selecting data from according to the following criteria:
columns = ["System", "rp", "mp", "logg"]
catalog = pd.read_csv('data.txt', skiprows=1, sep ='\s+', names=columns)
# CUTS
i = (catalog.rp != -1) & (catalog.mp != -1)
new_catalog = pd.DataFrame(catalog[i])
print("{0} targets after cuts".format(len(new_catalog)))
When I perform the above cuts the code is working fine. Next, I want to add one more cut: I want to select all the targets that have 4.0 < logg < 5.0. However, some of the targets have logg = -1 (which stands for the fact that the value is not available). Luckily, I can calculate logg from the other available parameters. So here is my updated cuts:
# CUTS
i = (catalog.rp != -1) & (catalog.mp != -1)
if catalog.logg[i] == -1:
catalog.logg[i] = catalog.mp[i] / catalog.rp[i]
i &= (4 <= catalog.logg) & (catalog.logg <= 5)
However, I am receiving an error:
if catalog.logg[i] == -1:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
Can someone please explain what I am doing wrong and how I can fix it. Thank you
Edit 1
My dataframe looks like the following:
Data columns:
System 477 non-null values
rp 477 non-null values
mp 477 non-null values
logg 477 non-null values
dtypes: float64(37), int64(3), object(3)None
Edit 2
System rp mp logg FeH FeHu FeHl Mstar Mstaru Mstarl
0 target-01 5196 24 24 0.31 0.04 0.04 0.905 0.015 0.015
1 target-02 5950 150 150 -0.30 0.25 0.25 0.950 0.110 0.110
2 target-03 5598 50 50 0.04 0.05 0.05 0.997 0.049 0.049
3 target-04 6558 44 -1 0.14 0.04 0.04 1.403 0.061 0.061
4 target-05 6190 60 60 0.05 0.07 0.07 1.194 0.049 0.050
....
[5 rows x 43 columns]
Edit 3
My code in a format that I understand should be:
for row in range(len(catalog)):
parameter = catalog['logg'][row]
if parameter == -1:
parameter = catalog['mp'][row] / catalog['rp'][row]
if parameter > 4.0 and parameter < 5.0:
# select this row for further analysis
However, I am trying to write my code in a more simple and professional way. I don't want to use the for loop. How can I do it?
EDIT 4
Consider the following small example:
System rp mp logg
target-01 2 -1 2 # will NOT be selected since mp = -1
target-02 -1 3 4 # will NOT be selected since rp = -1
target-03 7 6 4.3 # will be selected since mp != -1, rp != -1, and 4 < logg <5
target-04 3.2 15 -1 # will be selected since mp != -1, rp != -1, logg = mp / rp = 15/3.2 = 4.68 (which is between 4 and 5)
you get the error because catalog.logg[i] is not a scalar,but a series,so you should turn to vectorized manipulation:
catalog.loc[i,'logg'] = catalog.loc[i,'mp']/catalog.loc[i,'rp']
which would modify the logg column inplace
As for edit 3:
rows=catalog.loc[(catalog.logg > 4) & (catalog.logg < 5)]
which will select rows that satisfy the condition
Instead of that code:
if catalog.logg[i] == -1:
catalog.logg[i] = catalog.mp[i] / catalog.rp[i]
You could use following:
i &= df.logg == -1
df.loc[i, 'logg'] = df.loc[i, 'mp'] / df.loc[i, 'rp']
# or
df.ix[i, 'logg'] = df.ix[i, 'mp'] / df.ix[i, 'rp']
For your edit 3 you need to add that line:
your_rows = df[(df.logg > 4) & (df.logg < 5)]
Full code:
i = (catalog.rp != -1) & (catalog.mp != -1)
i &= df.logg == -1
df.ix[i, 'logg'] = df.ix[i, 'mp'] / df.ix[i, 'rp']
your_rows = df[(df.logg > 4) & (df.logg < 5)]
EDIT
Probably I still don't understand what you want, but I get your desired output:
import pandas as pd
from io import StringIO
data = """
System rp mp logg
target-01 2 -1 2
target-02 -1 3 4
target-03 7 6 4.3
target-04 3.2 15 -1
"""
catalog = pd.read_csv(StringIO(data), sep='\s+')
i = (catalog.rp != -1) & (catalog.mp != -1)
i &= catalog.logg == -1
catalog.ix[i, 'logg'] = catalog.ix[i, 'mp'] / catalog.ix[i, 'rp']
your_rows = catalog[(catalog.logg > 4) & (catalog.logg < 5)]
In [7]: your_rows
Out[7]:
System rp mp logg
2 target-03 7.0 6 4.3000
3 target-04 3.2 15 4.6875
Am I still wrong?

Resources