KDB How to join tables with different column names - join

If I have the following tables:
t1:([] c1: 1 2 3; c2: 120 234 876)
t2:([] cd1:1 2; d: 999 899)
How can I join tables where t1.c1 = t2.cd2, where c1 and cd2 are not linked columns?

You're looking to use a left join lj as follows:
q)t1: ([] c1: 1 2 3; c2: 120 234 876)
q)t2:([] cd1:1 2; d: 999 899)
q)t1 lj 1!`c1 xcol t2
c1 c2 d
----------
1 120 999
2 234 899
3 876
where we use xcol to rename the column cd1 in t2 to match c1 in t1.
You can read more on joins at https://code.kx.com/q/ref/joins/

Related

Convert Pandas Column which Consist of list of JSON into new columns

I have DataFrame which have 3 columns:
order_id user_id Details
5c7c9 A [{"amount": "160",'id':'p2'},{"amount": "260",'id':'p3'}]
5c5c4 B [{"amount": "10",'id':'p1'},{"amount": "260",'id':'p3'}]
I want my final Dataframe to be like:
order_id user_id amount id
5c7c9 A 160 p2
5c7c9 A 260 p3
5c5c4 B 10 p1
5c5c4 B 260 p3
First if necessary convert values to list of dictianaries by ast.literal_eval, then use dictionary comprehension with DataFrame constructor and concat and last use DataFrame.join for add to original:
import ast
#df['Details'] = df['Details'].apply(ast.literal_eval)
df1 = (pd.concat({k: pd.DataFrame(v) for k, v in df.pop('Details').items()})
.reset_index(level=1, drop=True))
df = df.join(df1, rsuffix='_').reset_index(drop=True)
print (df)
order_id user_id amount id
0 5c7c9 A 160 p2
1 5c7c9 A 260 p3
2 5c5c4 B 10 p1
3 5c5c4 B 260 p3
You can use:
s=pd.DataFrame([[x] + [z] for x, y in zip(df1.index,df1.Details) for z in y])
s=s.merge(df1,left_on=0,right_index=True).drop(['Details',0],1)
print(s.pop(1).apply(pd.Series).join(s))
amount id order_id user_id
0 160 p2 5c7c9 A
1 260 p3 5c7c9 A
2 10 p1 5c5c4 B
3 260 p3 5c5c4 B

How to expand a dataframe whoes rows contains list of values? [duplicate]

This question already has an answer here:
How to unnest (explode) a column in a pandas DataFrame?
7 answers
c1 c2 c3
0 [1, 2] [[a, b], [c, d, e]] [[aff , bgg], [cff, ddd, edd]]
I want the output to be like :
c1 c2 c3
0 1 a aff
1 1 b bgg
2 2 c cff
3 2 d ddd
4 2 e edd
You can use np.repeat() and chain.from_iterable():
df = pd.DataFrame({'c1': np.repeat(df['c1'].values[0], [len(x) for x in (chain.from_iterable(df['c2']))]),
'c2': list(chain.from_iterable(chain.from_iterable(df['c2']))),
'c3': list(chain.from_iterable(chain.from_iterable(df['c3'])))
})
Returns:
c1 c2 c3
0 1 a aff
1 1 b bgg
2 2 c cff
3 2 d ddd
4 2 e edd
Keep in mind that this is relatively specific to your use case. It assumes that your c2 and c3 columns are instantiated with the same shape.

Compare Excel cells Python

I would like to compare two parts of two different columns from an Excel file that have a different number of elements. The comparison should be made between a part of Column 3 and a part of Column 2. Column 3 part has a length of j elements and Column 2 has a length of k elements(k>j). Column 2 part starts from row "j+1" and column 3 part starts from row 1. If an element from column 3 part is matching an element from column 2 part, then should check if the element from column1, before the j row, which has the same index as matched item from column 3 part is matching with the element from Column 1 part between j+1 and k, which has the same index as matched item from column 2 part. If yes, then should be written the element from Column 4 with the same index as matched element from column 2 part in a new Excel sheet.
Example: Column3[1]==Column2[2](which represents element 'A') => Column1[1]==Column1[j+2](which represents element 'P') => Column4[j+2] should be written in a new sheet.
Column 1 Column 2 Column 3 Column 4
P F A S
B G X T
C H K V
D I M W
P B R B
P A R D
C D H E
D E J k
E M K W
F F L Q
Q F K Q
For reading the Excel sheet cells from original sheet, I have used the df27.ix[:j-1,1].
One part of the code which reads the values of the mention part from column 3 and column 2 might be:
for j in range(1,j):
c3=sheet['B'+str(j)].value
for k in range(j,j+k):
c2=sheet['B'+str(k)].value
Any hint how I can accomplish this?
UPDATED
I have tried a new code which takes in consideration that we have '-', like joaquin mentioned in his example.
Joaquin's example:
C1 C2 C3 C4
0 P - A -
1 B - X -
2 C - K -
3 D - M -
4 P B - B
5 P A - D
6 C D - E
7 D E - k
8 E M - W
9 F F - Q
10 Q F - Q
New code:
from pandas import DataFrame as df
import pandas as pd
import openpyxl
wb=openpyxl.load_workbook('/media/sf_vboxshared/x.xlsx')
sheet=wb.get_sheet_by_name('Sheet1')
C13=[]
C12=[]
C1=[]
C2=[]
C3=[]
for s in range(2, sheet.max_row+1):
C1second=sheet['A'+str(s)].value
C2second=sheet['B'+str(s)].value
C3second=sheet['C'+str(s)].value
C1.append(C1second)
C2.append(C2second)
C3.append(C3second)
C1=[x.encode('UTF8') for x in C1]
for y in C2:
if y is not None:
C2=[x.encode('UTF8') if x is not None else None for x in C2]
for z in C3:
if z is not None:
C3=[x.encode('UTF8') if x is not None else None for x in C3]
for x in C1:
C13.append(x)
for x in C3:
C13.append(x)
for x in C1:
C12.append(x)
for x in C2:
C12.append(x)
tosave = pd.DataFrame()
df[C13]=pd.DataFrame(C13)
df[C12]=pd.DataFrame(C12)
for item in df[C13]:
if '-' in item: continue
new = df[df[C12] == item]
tosave = tosave.append(new)
But I still get the following error: df[C13]=pd.DataFrame(C13) TypeError: 'type' object does not support item assignment. Any idea what is wrong?
Many thanks in advance,
Dan
Given your df is
C1 C2 C3 C4
0 P - A -
1 B - X -
2 C - K -
3 D - M -
4 P B - B
5 P A - D
6 C D - E
7 D E - k
8 E M - W
9 F F - Q
10 Q F - Q
then, I combine C1 and C3 and C1 and C2
df['C13'] = df.apply(lambda x: x['C1'] + x['C3'], axis=1)
df['C12'] = df.apply(lambda x: x['C1'] + x['C2'], axis=1)
and compare which rows have the same pair of characters in columns C13 and C12, and save them in tosave
tosave = p.DataFrame()
for item in df['C13']:
if '-' in item: continue
new = df[df['C12'] == item]
tosave = tosave.append(new)
this gives you a tosave dataframe with the rows matching:
C1 C2 C3 C4 C13 C12
5 P A - D P- PA
That can be directly saved as it is or you can save just column C4
UPDATE: If you have data on each row, then you can not use the '-' detection (or any other kind of detection based on the differences between empty and filled columns). On the other hand, if j,k are not defined (for any j and k), your problem is actually reduced to find, for each row, identical pairs below that row. In consecuence, this:
tosave = p.DataFrame()
for idx, item in enumerate(df['C13']):
new = df[df['C12'] == item]
tosave = tosave.append(new.loc[idx+1:])
solves the problem given your labels and data is like:
C1 C2 C3 C4
0 P F A S
1 B G X T
2 C H K V
3 D I M W
4 P B R B
5 P A R D
6 C D H E
7 D E J k
8 E M K W
9 F F L Q
10 Q F K Q
This code also produces the same output as before:
C1 C2 C3 C4 C13 C12
5 P A R D PR PA
Note this probably needs some refinenment (p.e when a row produces 2 matches, the second row with produce 1 match, and you will need to remove replicates from the final output).

How to get rows where a set of columns are equal to a given value in Pandas?

I have a dataframe with many columns (around 1000).
Given a set of columns (around 10), which have 0 or 1 as values, I would like to select all the rows where I have 1s in the aforementioned set of columns.
Toy example. My dataframe is something like this:
c1,c2,c3,c4,c5
'a',1,1,0,1
'b',0,1,0,0
'c',0,0,1,1
'd',0,1,0,0
'e',1,0,0,1
And I would like to get the rows where the columns c2 and c5 are equal to 1:
'a',1,1,0,1
'e',1,0,0,1
Which would be the most efficient way to do it?
Thanks!
This would be more generic for multiple columns cols
In [1277]: cols = ['c2', 'c5']
In [1278]: df[(df[cols] == 1).all(1)]
Out[1278]:
c1 c2 c3 c4 c5
0 'a' 1 1 0 1
4 'e' 1 0 0 1
Or,
In [1284]: df[np.logical_and.reduce([df[x]==1 for x in cols])]
Out[1284]:
c1 c2 c3 c4 c5
0 'a' 1 1 0 1
4 'e' 1 0 0 1
Or,
In [1279]: df.query(' and '.join(['%s==1'%x for x in cols]))
Out[1279]:
c1 c2 c3 c4 c5
0 'a' 1 1 0 1
4 'e' 1 0 0 1
Can you try doing something like this:
df.loc[df['c2'] == 1 & df['c5'] == 1]
import pandas as pd
frame = pd.DataFrame([
['a',1,1,0,1],
['b',0,1,0,0],
['c',0,0,1,1],
['d',0,1,0,0],
['e',1,0,0,1]], columns='c1,c2,c3,c4,c5'.split(','))
print(frame.loc[(frame['c2'] == 1) & (frame['c5'] == 1)])

How to filter pandas dataframe on multiple columns based on a dictionary?

I have 3 dictionaries:
A, B, C
and a pandas dataframe with these columns:
['id',
't1',
't2',
't3',
't4']
Now all I want to do is keep only those rows whose t1 is present in dict A, t2 in dict B and t3 in dict C
I tried dataframe['t1'] in A
this gives an error: Series object is mutable cannot be hashed...
You can try something like this.
df.loc[(df['t1'].isin(A.keys()) & df['t2'].isin(B.keys()) & df['t3'].isin(C.keys()))]
I hope this is what you want.
In [51]: df
Out[51]:
t1 t2 t3 t4 max_value
0 1 4 5 2 5
1 34 70 1 5 70
2 43 89 4 11 89
3 22 76 4 3 76
In [52]: A = {34: 4}
In [53]: B = {70: 5, 89: 3}
In [54]: C = {1: 3, 5:1}
In [55]: df.loc[(df['t1'].isin(A.keys()) & df['t2'].isin(B.keys()) & df['t3'].isin(C.keys()))]
Out[55]:
t1 t2 t3 t4 max_value
1 34 70 1 5 70
To answer #EdChum, I have assumed OP wants to check the presence of values in dictionary keys.

Resources