Convert this Weigth/Score into List of Coulmn name with sorted according to their Weigth/Score Matrix Format using Python - python

Convert this Weight/Score reading from an input .csv file into List of Column name with sorted according to their descending Weight/Score Matrix Format using Python Apache Beam and write into the another .csv file
Input .csv file
user_id, cat_1, cat_2, cat_3, cat_4, cat_5, cat_6
1 , 0.10, 0.2, 0.20, 0.12, 0.7, 0.6
2 , 0.6, 0.20, 0.12, 0.15, 0.13, 0.11
3 , 0.11, 0.10, 0.8, 0.12, 0.3, 0.7
4 , 0.2, 0.11, 0.12, 0.6, 0.9, 0.21
5 , 0.9, 0.8, 0.5, 0.1, 0.0, 0.11
Desired output .csv file
user_id, top_3_categories
1, [cat_3, cat_4, cat_1]
2, [cat_2, cat_4, cat_3]
3, [cat_4, cat_1, cat_2]
4, [cat_6, cat_3, cat_2]
5, [cat_6, cat_1, cat_2]

Related

how I plot three histograms in the same figure where each of them must represent a range of values?

I'm new in this forum. I've searched for a long time my answer in precedent posts, but no answer satisfied my trouble fully.
I read post that are in the follow links
Multiple Histograms, each for a label of x-axis, on the same graph matplotlib
Plot two histograms at the same time with matplotlib
and many others, but nothing.
So, I decide to ask you my question.
I have two arrays of probabilities like these (for simplicity I report two little lists):
a = [0.1, 0.2, 0.4, 0.56, 0.67, 0.70, 0.89, 0.90]
b = [0.15, 0.22, 0.41, 0.47, 0.45, 0.59, 0.66, 0.75, 0.83, 0.99]
I must create a histograms that represent three group of bars formed by 2 bars (one for the array a and another for the array b).
The first group of bars must represent the values of arrays that are between 0.0 (included) and 0.4 (excluded), the second group must represent the values of arrays that are between 0.4(included) and 0.65 (excluded), and the last group must represent the remaining values.
On the y axis I would prefer have relative frequency (instead of absolute frequency).
I should be obtain something like this https://ibb.co/41BdCCP (that I found in https://plot.ly/python/bar-charts/), but obviously, on the x axis I would the range of values (instead of animals name) and on the y axis I would relative frequency (like I wrote before).
Thank you so much, I hope that someone is able to resolve my problem.
I don't know what you eactly want: a bar chart or a histogram. But here is a histogram based on your question:
Here I am using a mask to define the groups you want to plot. You can adapt the solution to your problem. I created some test arrays for example purpose
a = np.array([0.04, 0.09,0.1, 0.12, 0.2, 0.4, 0.42, 0.44, 0.47, 0.5, 0.53,0.56, 0.67, 0.70, 0.75, 0.76, 0.78, 0.79, 0.89, 0.90] )
b = np.array([0.05, 0.08, 0.15,0.12, 0.22, 0.41, 0.43, 0.44, 0.46, 0.47, 0.45, 0.51, 0.54,0.59, 0.66, 0.75, 0.75, 0.76, 0.77, 0.8, 0.83, 0.99])
lim = [0, 0.4, 0.65, 1]
for i in range(len(lim)-1):
plt.hist(a[(a>=lim[i]) & (a<lim[i+1])], color='r')
plt.hist(b[(b>=lim[i]) & (b<lim[i+1])], color='b')

How to manipulate data in Python 3 with matplotlib?

I have the following data:
[0.21, 0.21, 0.33, 0.52, 0.22, 0.35, 0.43]
I would like to do the following things:
Draw a bar chart like this:
The x-axis should be 0.21, 0.22, 0.33, 0.35, 0.43, 0.52.
The second things I would like to do is using a value range, for example, I would like to change the x-axis to : 0.01- 0.2, 0.2-0.4, 0.4-0.6
Instead of loop it one by one, is there any smarter way?
Part 2. of your question is simple enough to do - you just need to define the width of the bins by passing a list of the boundaries.
import matplotlib.pyplot as plt
X = [0.21, 0.21, 0.33, 0.52, 0.22, 0.35, 0.43]
plt.hist(X, bins=[0.0, 0.2, 0.4, 0.6])
plt.show()
This will create bins [0.0, 0.2), [0.2, 0.4), [0.4, 0.6] where '[' is inclusive and '(' is exclusive.
Not clear on what you require in part 1. of your question.

How to correctly make a SQL-request with option?

Currently cur.fetchone returns "None". But if change ID=?, (number_row,) for Id=2 (or other number) all works as planned. Changing "cur.fetchone" to "cur.fetchone" is returning the empty tuple. Where I have I gone wrong?
import sqlite3
import numpy as np
con = sqlite3.connect('database_all.db')
power_list = np.array([-70, 200, 300, 400, 480, 520, 600])
temperature = 90
number_row = np.searchsorted(power_list, temperature, side='right')
print(number_row)
cur = con.cursor()
cur.execute("SELECT VARIANTS from P_steel_3_2 WHERE ID=?", (number_row,))
pressure_variants = cur.fetchone()
print(pressure_variants)
cur.close()
And my table (generated in DB Browser for SQLite)
BEGIN TRANSACTION;
CREATE TABLE "P_steel_3_2" (
`ID` INTEGER,
`TEMPERATURE` REAL,
`VARIANTS` REAL
);
INSERT INTO `P_steel_3_2` (ID,TEMPERATURE,VARIANTS) VALUES (0,'-70_200','0.1, 0.25, 0.4, 0.6 ,1 ,1.6, 2.5 ,4, 6.4, 10, 16');
INSERT INTO `P_steel_3_2` (ID,TEMPERATURE,VARIANTS) VALUES (1,'200-300','0.09, 0.22, 0.36, 0.56, 0.9, 1.4, 2.2, 3.6, 5.6, 9, 14');
INSERT INTO `P_steel_3_2` (ID,TEMPERATURE,VARIANTS) VALUES (2,'300-400','0.08, 0.2, 0.32, 0.5, 0.8, 1.25, 2, 3.2, 5, 8, 12.5');
INSERT INTO `P_steel_3_2` (ID,TEMPERATURE,VARIANTS) VALUES (3,'400-480','0.07, 0.18, 0.28, 0.45, 0.7, 1.1, 1.8, 2.8, 4.5, 7.1, 10.2');
INSERT INTO `P_steel_3_2` (ID,TEMPERATURE,VARIANTS) VALUES (5,'520-600','0.05, 0.11, 0.18, 0.28, 0.45, 0.7, 1.1, 1.8, 2.8, 4.5, 7.1');
INSERT INTO `P_steel_3_2` (ID,TEMPERATURE,VARIANTS) VALUES (6,'600-700','0, 0.06, 0.09, 0.15, 0.22, 0.36, 0.56, 0.9, 1.4, 2.2, 3.6');
INSERT INTO `P_steel_3_2` (ID,TEMPERATURE,VARIANTS) VALUES (4,'480-520','0.06, 0.16, 0.25, 0.4, 0.64, 1, 1.6, 2.5, 4, 6.4, 10');
COMMIT;
np.searchsorted returns an array of indices, so it's returning [2] rather than 2. You need to extract the element to use it in the query.
if number_row.ndim > 0 and number_row.shape[0] > 0:
id = number_row[0]
cur = con.cursor()
cur.execute("SELECT VARIANTS from P_steel_3_2 WHERE ID=?", (id,))
pressure_variants = cur.fetchone()
print(pressure_variants)
cur.close()

Mixed length object type in pandas dataframe

I want to use the pandas library to store mixed length objects.
Let's say for instance that I want to have a dataframe with two columns: the first one storing a float and the second one storing a list of float.
What is the best way to do this in pandas, bearing in mind that I want to be able to sort the data using the first column.
import pandas as pd
data = {
'a': [.1,.2,.3],
'b': [ [.1,.2], [.3,.4,.5,.6,.7], [.8,.9,1.] ],
}
df = pd.DataFrame(data)
print df
result:
a b
0 0.1 [0.1, 0.2]
1 0.2 [0.3, 0.4, 0.5, 0.6, 0.7]
2 0.3 [0.8, 0.9, 1.0]
reversed
print df.sort('a', ascending=False)
a b
2 0.3 [0.8, 0.9, 1.0]
1 0.2 [0.3, 0.4, 0.5, 0.6, 0.7]
0 0.1 [0.1, 0.2]

searching k nearest neighbors in numpy

I'm new to Python. I want to use numpy and sklearn to do KNN. However, there's a nan in my data. I set dtype of genfromtxt to None but the array will look like below:
[('ADT1_YEAST', 0.58, 0.61, 0.47, 0.13, 0.5, 0.0, 0.48, 0.22, 'MIT')
('ADT2_YEAST', 0.43, 0.67, 0.48, 0.27, 0.5, 0.0, 0.53, 0.22, 'MIT')
('ADT3_YEAST', 0.64, 0.62, 0.49, 0.15, 0.5, 0.0, 0.53, 0.22, 'MIT') ...,
('ZNRP_YEAST', 0.67, 0.57, 0.36, 0.19, 0.5, 0.0, 0.56, 0.22, 'ME2')
('ZUO1_YEAST', 0.43, 0.4, 0.6, 0.16, 0.5, 0.0, 0.53, 0.39, 'NUC')
('G6PD_YEAST', 0.65, 0.54, 0.54, 0.13, 0.5, 0.0, 0.53, 0.22, 'CYT')]
then, I will get data type not understood on NearestNeighbors function.
Here is my code:
npGem = np.genfromtxt('temp.data', dtype=None)
X = np.array(npGem)
nbrs = NearestNeighbors(n_neighbors=5, algorithm='ball_tree').fit(X)
can anyone teach me how to make the list be read? Thanks in advance.
I think you need to get the data into a matrix properly. I typically do something like this:
import numpy as np
features = [] # list of lists of the feature vairables.
classes = [] # list of the target variables
for line in f:
line = line.strip().split() # will split the line into pieces on any white spaces
features.append(line[1:-1]) # or whatever indices your features are in
classes.append(line[-1]) # or whatever index your target variable is in
classes = np.array(classes)
features = np.array(features,dtype=np.float)
If I understand the problem, you're really asking how to encode the categorical variables such that they can be properly interpreted by the nearest neighbors algorithm. You can do this with sklearn as explained in 4.2.4. Encoding categorical features. On the other hand, if you have incomplete features, 4.2.6. Imputation of missing values.

Resources