Sorting contents of file alphabetically with trailing numbers - python

I've got a text file with contents similar to this:
ABC-330 YELLOW
ABC-1150 GREEN
ABC-888 YELLOW
ABC-450 YELLOW
ABC-50 YELLOW
ABC-550 GREEN
etc.
I'd like to have it sorted like this:
ABC-50 YELLOW
ABC-330 YELLOW
ABC-450 YELLOW
ABC-550 GREEN
ABC-888 YELLOW
ABC-1150 GREEN
I've tried sorted(file_content.readlines()) but it puts ABC-1150 GREEN at the beginning, since it begins with '1'.
How to sort the contents the way I'd like?

Start by obtaining a list from the text file:
l = [line.rstrip('\n') for line in open('your_file.txt')]
And then use sorted with the following custom function in the key parameter, so that the items in the list are sorted by the first substring and then by the digits turned into int:
def fun(x):
import re
st, dig, _ = re.split('-(\d+)', x)
return st, int(dig)
sorted(l, key=fun)
['ABC-50 YELLOW',
'ABC-330 YELLOW',
'ABC-450 YELLOW',
'ABC-550 GREEN',
'ABC-888 YELLOW',
'ABC-1150 GREEN']
Where:
re.split('-(\d+)', 'ABC-330 YELLOW')
# ['ABC', '330', ' YELLOW']
Will give the items of interest within the string, you only need to tell sorted to sort by the first two, casting the second to an integer.

Using sorted with key
Ex:
with open(filename) as infile:
data = infile.readlines()
print(sorted(data, key=lambda x: int(x.split()[0].split("-")[1])))
#Or using Regex.
import re
print(sorted(data, key=lambda x: int(re.search(r"(\d+)", x).group(1))))
Output:
['ABC-50 YELLOW\n',
'ABC-330 YELLOW\n',
'ABC-450 YELLOW\n',
'ABC-550 GREEN',
'ABC-888 YELLOW\n',
'ABC-1150 GREEN\n']
Note: The lambda function is used to fetch the integer in the string

Use sorted with a key function:
import re
with open(filename) as f:
data = f.read()
sorted_data = '\n'.join(sorted(data.split('\n'), key=lambda row: int(re.search('\d+', row))))
print(sorted_data)
with open(filename, 'w') as f:
f.write(sorted_data)
Output:
ABC-50 YELLOW
ABC-330 YELLOW
ABC-450 YELLOW
ABC-550 GREEN
ABC-888 YELLOW
ABC-1150 GREEN

Related

Formatting file into list with price , product name and quantity

Need the most simplest and easiest way to do the following work
i have a file like this containing product name with prices.
blackberry 23 100
Black shirt with hoody (small) 4 800
Pastel Paint (red) (oil) 2 600
how can i format these into a list like this
lst=[['blackberry' ,23 ,100],['Black shirt with hoody (small)' ,4 ,800],['Pastel Paint (red) (oil)' ,2 ,600]]
I am trying with split its working when the product name only contain one word for example Blackberry but if more words include it don't work anymore as i am splitting with space.
Use str.rsplit, it starts splitting over the right part of the string as many items as you provide in the second argument (first one is the splitting item), as follows:
l = [
"blackberry 23 100",
"lack shirt with hoody (small) 4 800",
"Pastel Paint (red) (oil) 2 600"
]
outlist = [x.rsplit(" ", 2) for x in l]
print(outlist)
Here you have a live example
You've accurately described the logic issue: you need to gather all of the words in one phrase, rather than splitting on spaces. Note the common characteristic of the input lines: you have words followed by two integers. One way is to split, but then recombine all but the last two elements. Another is to use the rsplit method with a limit of 2 fields to split. The second is probably better.
You could also handle this with a regular expression (regex), but that would require learning another facility, likely more than you want right now.
with open('demo.txt') as f: # demo.txt is your file
lines = f.readlines()
datas = [line.strip().rsplit(' ', 2) for line in lines]
print(datas)
Output
[['blackberry', '23', '100'], ['Black shirt with hoody (small)', '4', '800'], ['Pastel Paint (red) (oil)', '2', '600']]
Here's one way using a list comprehension and str.rsplit. We use str.isdigit to select items for integer conversion:
from io import StringIO
mystr = StringIO("""blackberry 23 100
Black shirt with hoody (small) 4 800
Pastel Paint (red) (oil) 2 600""")
res = []
# replace mystr with open('file.txt', 'r')
with mystr as fin:
for line in fin:
res.append([i if not i.isdigit() else int(i) \
for i in line.strip().rsplit(' ', 2)])
[['blackberry', 23, 100],
['Black shirt with hoody (small)', 4, 800],
['Pastel Paint (red) (oil)', 2, 600]]
you can take the index from the last. for example the last index would return the price, second last would return the quantity and the rest would return the name of the product. Then when you have each items you can append them to the list.
lst = []
with open('test.txt', 'r') as file:
content = file.readlines()
for c in content:
new = c.split()
price = new[len(new)-1]
quantity = new[len(new)-2]
name = ' '.join(x for x in new[:len(new)-2])
nlst = [name, quantity, price]
lst.append(nlst)
Output:
[['blackberry phone', '2', '500']]
You can use re.split and re.findall:
import re
data = [re.split('(?<=[a-zA-Z\W])\s(?=\d)', i.strip('\n')) for i in open('filename.txt')]
final_data = [[a, *map(int, re.findall('\d+', b))] for a, b in data]
Output:
[['blackberry', 23, 100], ['Black shirt with hoody (small)', 4, 800], ['Pastel Paint (red) (oil)', 2, 600]]

how to count number of lines in each several paragraph

I wish to count the number of lines in paragraph from text file which looks like this:
text file =
black
yellow
pink
hills
mountain
liver
barbecue
spaghetti
I want to know that the last paragraph has less or more lines than others and then remove it.
The result I want:
black
yellow
pink
hills
mountain
liver
I tried in this way:
c = []
with open(file) as paragraph:
index = 0
for line in paragraph:
if line.strip():
index += 1
c.append(index)
but, I was struck that this could be too complicated...maybe?
The file test_line.txt
black
yellow
pink
hills
mountain
liver
barbecue
spaghetti
Start counting the line using index.
On line 6 check if a new Line came, and append the list with the counted lines of paragraphs and reset the index to 0
On line 9 counting the lines
On line 11 append for the last paragraph
Now you have got a list which contains number of lines in each paragraph. Do anything with the list as you please.
Here's your modified code-
file = "test_line.txt"
c = []
with open(file) as paragraph:
index = 0
for line in paragraph:
if line == '\n':
c.append(index)
index = 0
else:
index+=1
c.append(index)
print(c)
OUTPUT
[3, 3, 2]
Hope it helps!
You could split by \n\n and use a list comprehension:
test.txt
black
yellow
pink
hills
mountain
liver
barbecue
spaghetti
test.py
with open('test.txt') as f:
output = f.read()
x = [len(i.split('\n')) for i in output.split('\n\n')]
print(x)
Output:
[3, 3, 2] # 2 is the one you want to remove
You can use something like this:
from itertools import groupby
lines = open("test.txt").read().splitlines()
paragraphs = [list(groups) for keys, groups in groupby(lines, lambda x: x != "") if keys]
Where you read the file and split on new lines. This will give you:
[['black', 'yellow', 'pink'], [''], ['hills', 'mountain', 'liver'], [''], ['barbecue', 'spaghetti']]
From there you can use itertools.groupby to group them to a list of sublists and do some operations to determine what you want.
Output:
[['black', 'yellow', 'pink'], ['hills', 'mountain', 'liver'], ['barbecue', 'spaghetti']]
So now each sublist is a paragraph that you can count on it. So for the first paragraph, something like this: len(sublists[0]) will give you 3. For example:
for paragraph in paragraphs:
print(len(paragraph))
Output:
3
3
2
Now you just need to put your logic to finish this. You can use del sublists[i] to delete the ith sublist.

Making lists from multiple lines in a text file (Python)

I am trying to turn a text file into multiple lists but I am unsure how,
say for example the text file is:
Bob 16 Green
Sam 19 Blue
Sally 18 Brown
I then want to make three lists,
[Bob, Sam, Sally] , [16, 19, 18], [Green, Blue, Brown]
thanks
Keeping tokens as strings (not converting integers or anything), using a generator comprehension:
Iterate on the file/text lines, split your words and zip the word lists together: that will "transpose" the lists the way you want:
f = """Bob 16 Green
Sam 19 Blue
Sally 18 Brown""".splitlines()
print (list(zip(*(line.split() for line in f))))
result (as a list of tuples):
[('Bob', 'Sam', 'Sally'), ('16', '19', '18'), ('Green', 'Blue', 'Brown')]
* unpacks the outer generator comprehension as arguments of zip. results of split are processed by zip.
Even simpler using map which avoids the generator expression since we have split handy, (str.split(x) is the functional notation for x.split()) (should even be slightly faster):
print (list(zip(*map(str.split,f))))
Note that my example is standalone, but you can replace f by a file handle all right.
A simple oneliner, assuming you load the file as a string:
list(zip(*[line.split(' ') for line in filecontent.split('\n')]))
I first split it at all the new lines, each of those at all of the spaces, and then flip it (zip-operator)
the code below does what you want:
names = []
ages = []
fav_color = []
flist=open('myfile.txt','r').read().split('\n')
for line in flist:
names.append(line.split(' ')[0])
ages.append(line.split(' ')[1])
fav_color.append(line.split(' ')[2])
print(names)
print(ages)
print(fav_color)
lines.txt has the text contents:
*
Bob 16 Green
Sam 19 Blue
Sally 18 Brown
*
Python code:
with open ("lines.txt", "r") as myfile:
data = myfile.readlines()
for line in data:
list = [word for word in line.split()]
print list

Not getting the proper output with re.findall

I'm trying to write a code that reads my input csv file with pandas (df_input) and then uses re.findall for any occurrence of the variables in a list. This list is imported from another .csv file where column[0] (df_expression) contains the variables I want the code to search for and column[1] (df_translation) contains the values I want code to return when there's an exact match. This way, when I search for colors like 'burgundy' and 'maroon' it get's translated to 'red'. I've been trying this setup so I can make changes in my expressions translations without having to change the code itself.
df_name = df_input[0]
def expression(expr, string):
return True if len(re.findall(r'\b' + expr + r'\b', string, re.I)) > 0 else False
resultlist = []
for lineIndex in range(0, len(df_input)):
matches_list = []
for expIndex in range(0, len(df_expressions)):
if expression(str(df_expressions.ix[expIndex]), str(df_name.ix[lineIndex])):
matches_list.append(df_translation.ix[expIndex])
df_input['Color'] = resultlist
These are the return values:
resultlist
[['Black'], ['White'], ['Blue'], ['Red', 'Black'], ['Pink'], .....
Current output as found in my output.csv after df_input.to_csv(filepath+filename):
Name,Color
a black car,['Black']
a white paper,['White']
the sky is blue,['Blue']
this product is burgundy and black,['Red, Black']
just pink,['Pink']
Preferred output.csv:
Name,Color
a black car,Black
a white paper,White
the sky is blue,Blue
this product is burgundy and black,Red;Black
just pink,Pink
Is it possible to lose the brackets and quotes so whenever I do df_input.to_csv(filepath+filename) I get a clean output?
I've tried df.replace() - doesn't work, neither does adding [0] at the end of my re.findall and a bunch of other stuff. Only thing that seems to do the job is to str(resultlist).replace(), but then the index-match combination is pretty messed up. Any suggestions?
Try following changes and see how it behaves.
Replace
df_input['Color'] = resultless
With
df_input['Color'] = [', '.join(c) for c in resultlist]
This should transform resultless into ['Black', 'White', 'Blue', 'Red, Black', 'Pink', ...]

How to a split string into different values?

I have:
file=open("file.txt","r")
and the file is in the form:
apple\red\3 months
pear\green\4 months
how do I split the file so it becomes in the form of a list:
fruit = ['apple', 'pear']
colour = ['red','green']
expire = ['3 months', '4 months']
I have absolutely no idea and would appreciate help. What I have now:
file = open('file.txt','r')
for i in file:
i.readline()
i.split('\ ')
don't know it this is right, but have no idea when I've split it into the form of:
apple
red
3 months
pear
green
4 months
How I make the first and every 3th row after that into a list, and the 2:th and every 3th after that and so on.
You can split the line and add each part to a list. For example:
fruit = []
colour = []
expire = []
file = open('file.txt','r')
for i in file:
fruit_, colour_, expire_ = i.split('\\')
fruit.append(fruit_)
colour.append(colour_)
expire.append(expire_)
You can do that as follows:
s = ['apple\red\3 months', 'pear\green\4 months']
fruit = [i[0] for i.split('\\') in s]
colour = [i[1] for i.split('\\') in s]
expire = [i[2] for i.split('\\') in s]
You can use sequence unpacking with the .split() method, and then add each value to their separate list. Note that backslashes are used to escape special sequences, so you have to escape the backslash with a backslash and split on '\\' instead.
>>> line = 'apple\red\3 months'
>>> line = line.split('\\')
>>> line
['apple', 'red', '3 months']
>>> fruit, colour, expire = line
>>> print(fruit, colour, expire)
apple red 3 months
When reading from files you also have to .strip() each line because they have newline characters at the end. Solution:
data = {'fruits': [], 'colours': [], 'expires': []}
with open('file.txt') as f:
for line in f:
fruit, colour, expire = line.strip().split('\\')
data['fruits'].append(fruit)
data['colours'].append(colour)
data['expires'].append(expire)
Extendable version:
columns = ['fruits', 'colours', 'expires']
data = {c: [] for c in columns}
with open('file.txt') as f:
for line in f:
line = line.strip().split('\\')
for i, c in enumerate(columns):
data[c].append(line[i])
Untested one-liner:
with open('file.txt') as f: data = {c: d for c, *d in zip(*(['fruits', 'colours', 'expires']+[line.strip().split('\\') for line in f]))}
for l in open("i.txt"):
for m in l.split('\\'):
print(m.strip())
Use zip with *:
>>> s = r'''apple\red\3 months
pear\green\4 months'''
>>> zip(*(x.rstrip().split('\\') for x in s.splitlines()))
[('apple', 'pear'), ('red', 'green'), ('3 months', '4 months')]
For a file you can do something like:
with open("file.txt") as f:
fruit, colour, expire = zip(*(line.rstrip().split('\\') for line in f))
zip returns tuples instead of lists, so you can convert them to lists using list().
fruit, color, expire = [], [], []
for i in open('file.txt'):
fr,col,exp = i.split('\\')
fruit.append(fr)
color.append(col)
expire.append(exp)
print fruit #['apple', 'pear']
print color #['red', 'green']
print expire #['red', 'green']
You are going right.
fruit = []
color = []
expire = []
file = open('file.txt','r')
for i in file:
i.readline()
f, c, exp = i.split('\\')
fruit.append(f)
color.append(c)
expire.append(exp)

Resources