If row in Dataframe contains certain string delete - python

I have to delete values in row of a dataframe if they contain a certain string.
Problem is that row is very long and contains text.
Loop does not work and put index in a list and then use .drop on index does also not work.
column1
8
8
8
8 total <-------- This must be deleted
8
8
8
8
8
...
Thanks

Suppose your dataframe is called df. Then use:
df_filtered = df[~df['column1'].str.contains('total')]
Explanation:
df['column1'].str.contains('total') will give you an array of the length of the dataframe column that is True whereever df['column1'] contains 'total'. With ~ you swap the True and False values of this array. And finally with df_filtered = df[...] you take only the lines, for which 'total' is not included.

if i understood it correctly, i have an small example below where the dataFrame is called df and i want to remove the mixfruit to be searched and deleted.
>>> df
name num
0 apple 5
1 banana 3
2 mixfruit 5
3 carret 6
One way is as other mentioned can go with str.contains as follows ..
>>> df[~df.name.str.contains("mix")]
name num
0 apple 5
1 banana 3
3 carret 6
You can use isin as well, which will drop all rows containing string
>>> df[~df['name'].isin(['mixfruit'])]
name num
0 apple 5
1 banana 3
3 carret 6
However, you can achieve the same as follows...
>>> df[df['name'] != 'mixfruit']
name num
0 apple 5
1 banana 3
3 carret 6

Related

Pandas DataFrame drop duplicates keeping first 'x' occurences [duplicate]

This question already has an answer here:
How to keep first two duplicates in a pandas dataframe?
2 answers
What I am looking for is a function that works exactly like pandas.DataFrame.drop_duplicates() but that allows me to keep not only the first occurence but the first 'x' occurences (say like 10). Does anything like that exist?
Thanks for your help!
IIUC, One way to do this would be with a groupby and head, to select the first x occurrences. As noted in the docs, head:
Returns first n rows of each group.
Sample code:
x = 10
df.groupby('col').head(x)
Where col is the column you want to check for duplicates, and x is the number of occurrences you want to keep for each value in col
For instance:
In [81]: df.head()
Out[81]:
a b
0 3 0.912355
1 3 2.091888
2 3 -0.422637
3 1 -0.293578
4 2 -0.817454
....
# keep 3 first instances of each value in column a:
x = 3
df.groupby('a').head(x)
Out[82]:
a b
0 3 0.912355
1 3 2.091888
2 3 -0.422637
3 1 -0.293578
4 2 -0.817454
5 1 1.476599
6 1 0.898684
8 2 -0.824963
9 2 -0.290499

Drop all row who has at lease one cell that contains only '-' in dataframe

There are negative numbers and some cells only contains -. They're there when the data is being imported. How can I drop them from the data-frame? Basically this is a notation of missing value.
For example, you have a df
df = pd.DataFrame(dict(A=[2,2,2,2,2,2,2], what=["what","what","what","-","what","-","what"]))
Then use
df[df.what.str.contains("-") == False]
output
A what
0 2 what
1 2 what
2 2 what
4 2 what
6 2 what

Iterating over dataframe doesn't give expected result

Iam importing a column of csv file using pandas.read_csv() to my python script.
Iam doing like :
data = pandas.read_csv(path)
for i in data:
print(i)
Why does it print only first element of the column ?
and when i convert it to a numpy array using npdata = np.array(data)and when i print it, it prints everything except first element.
Actually what iam trying to do is to load a time-date column from a csv file and want to do some feature engineering but i have problem i loading it correctly.
Because, iterating over data will imply iteration over the column names, which you're not looking for.
To iterate over rows, use df.iterrows instead:
data = pandas.read_csv(path)
for i, row in data.iterrows():
print(row)
MCVE:
print(df)
A B
0 1 2
1 3 4
2 5 6
3 7 8
4 9 10
for d in df:
print(d)
A
B
for i, d in df.iterrows():
print(d['A'], d['B'])
1 2
3 4
5 6
7 8
9 10
As mentioned in comments, if all you want to do is take a peek at your data, print out df.head:
print(df.head(3)) # the arg is any > zero value signifying the number of rows
A B
0 1 2
1 3 4
2 5 6

Attempting to delete multiple rows from Pandas Dataframe but more rows than intended are being deleted

I have a list, to_delete, of row indexes that I want to delete from both of my two Pandas Dataframes, df1 & df2. They both have 500 rows. to_delete has 50 entries.
I run this:
df1.drop(df1.index[to_delete], inplace=True)
df2.drop(df2.index[to_delete], inplace=True)
But this results in df1 and df2 having 250 rows each. It deletes 250 rows from each, and not the 50 specific rows that I want it to...
to_delete is ordered in descending order.
The full method:
def method(results):
#results is a 500 x 1 matrix of 1's and -1s
global df1, df2
deletions = []
for i in xrange(len(results)-1, -1, -1):
if results[i] == -1:
deletions.append(i)
df1.drop(df1.index[deletions], inplace=True)
df2.drop(df2.index[deletions], inplace=True)
Any suggestions as to what I'm doing wrong?
(I've also tried using .iloc instead of .index and deleting in the if statement instead of appending to a list first.
Your index values are not unique and when you use drop it is removing all rows with those index values. to_delete may have been of length 50 but there were 250 rows that had those particular index values.
Consider the example
df = pd.DataFrame(dict(A=range(10)), [0, 1, 2, 3, 4] * 2)
df
A
0 0
1 1
2 2
3 3
4 4
0 5
1 6
2 7
3 8
4 9
Let's say you want to remove the first, third, and fourth rows.
to_del = [0, 2, 3]
Using your method
df.drop(df.index[to_del])
A
1 1
4 4
1 6
4 9
Which is a problem
Option 1
use np.in1d to find complement of to_del
This is more self explanatory than the others. I'm looking in an array from 0 to n and seeing if it is in to_del. The result will be a boolean array the same length as df. I use ~ to get the negation and use that to slice the dataframe.
df[~np.in1d(np.arange(len(df)), to_del)]
A
1 1
4 4
0 5
1 6
2 7
3 8
4 9
Option 2
use np.bincount to find complement of to_del
This accomplishes the same thing as option 1 by counting the positions defined in to_del. I end up with an array of 0 and 1 with a 1 in each position defined in to_del and 0 else where. I want to keep the 0s so I make a boolean array by finding where it is equal to 0. I then use this to slice the dataframe.
df[np.bincount(to_del, minlength=len(df)) == 0]
A
1 1
4 4
0 5
1 6
2 7
3 8
4 9
Option 3
use np.setdiff1d to find positions
This uses set logic to find the difference between a full array of positions and just the ones I want to delete. I then use iloc to select.
df.iloc[np.setdiff1d(np.arange(len(df)), to_del)]
A
1 1
4 4
0 5
1 6
2 7
3 8
4 9

Python pandas: Append rows of DataFrame and delete the appended rows

import pandas as pd
df = pd.DataFrame({
'id':[1,2,3,4,5,6,7,8,9,10,11],
'text': ['abc','zxc','qwe','asf','efe','ert','poi','wer','eer','poy','wqr']})
I have a DataFrame with columns:
id text
1 abc
2 zxc
3 qwe
4 asf
5 efe
6 ert
7 poi
8 wer
9 eer
10 poy
11 wqr
I have a list L = [1,3,6,10] which contains list of id's.
I am trying to append the text column using a list such that, from my list first taking 1 and 3(first two values in a list) and appending text column in my DataFrame with id = 1 which has id's 2, then deleting rows with id column 2 similarly then taking 3 and 6 and then appending text column where id = 4,5 to id 3 and then delete rows with id = 4 and 5 and iteratively for elements in list (x, x+1)
My final output would look like this:
id text
1 abczxc # joining id 1 and 2
3 qweasfefe # joining id 3,4 and 5
6 ertpoiwereer # joining id 6,7,8,9
10 poywqr # joining id 10 and 11
You can use isin with cumsum for Series, which is use for groupby with apply join function:
s = df.id.where(df.id.isin(L)).ffill().astype(int)
df1 = df.groupby(s)['text'].apply(''.join).reset_index()
print (df1)
id text
0 1 abczxc
1 3 qweasfefe
2 6 ertpoiwereer
3 10 poywqr
It working because:
s = df.id.where(df.id.isin(L)).ffill().astype(int)
print (s)
0 1
1 1
2 3
3 3
4 3
5 6
6 6
7 6
8 6
9 10
10 10
Name: id, dtype: int32
I changed the values not in list to np.nan and then ffill and groupby. Though #Jezrael's approach is much better. I need to remember to use cumsum:)
l = [1,3,6,10]
df.id[~df.id.isin(l)] = np.nan
df = df.ffill().groupby('id').sum()
text
id
1.0 abczxc
3.0 qweasfefe
6.0 ertpoiwereer
10.0 poywqr
Use pd.cut to create you bins then groupby with a lambda function to join your text in that group.
df.groupby(pd.cut(df.id,L+[np.inf],right=False, labels=[i for i in L])).apply(lambda x: ''.join(x.text))
EDIT:
(df.groupby(pd.cut(df.id,L+[np.inf],
right=False,
labels=[i for i in L]))
.apply(lambda x: ''.join(x.text)).reset_index().rename(columns={0:'text'}))
Output:
id text
0 1 abczxc
1 3 qweasfefe
2 6 ertpoiwereer
3 10 poywqr

Resources