Error when using .between() for checking a long/lat location pulling it from a dictionary - python

I'm scanning through a data frame which is grouped by a specific id and am trying to return a locations surface type depending on certain long/lat locations which I have in a dictionary. The problem with the data set is that it's created at 100 frames per second so I am trying to find the median value as values before and after this point are incorrect.
I am using pandas jupyter notebook and have
This is the function which i want to pull the locations from the dictionary. The location is just a made up example
pitch_boundaries = {
'Astro': {'max_long': -6.123456, 'min_long': -6.123456,
'max_lat': 53.123456, 'min_lat': 53.123456},
}
def get_loc_name(loc_df, pitch_boundaries):
for pitch_name, coord_limits in pitch_boundaries.items():
between_long_limits = loc_df['longitude'].median().between(coord_limits['min_long'], coord_limits['max_long'])
between_lat_limits = loc_df['latitude'].median().between(coord_limits['min_lat'], coord_limits['max_lat'])
if between_long_limits.any() and between_lat_limits.any():
return pitch_name
# If we get here then there is no pitch.
call it here
def makeAverageDataFrame(df):
pitchBounds = get_loc_name(df, pitch_boundaries)
finally where the errors occurs
for region, df_region in df_Will.groupby('session_id'):
makeAverageDataFrame(df_region)
Actual results
# AttributeError: 'float' object has no attribute 'between'
or if I remove .median(): None
What I want is a new dataframe with something like
|surface|
|Astro|
|Grass|
|Astro|

loc_df['longitude'] is a series, and loc_df['longitude'].median() gives you a float, which does not have between method. Try loc_df[['longitude']].
def get_loc_name(loc_df, pitch_boundaries):
for pitch_name, coord_limits in pitch_boundaries.items():
between_long_limits = loc_df[['longitude']].median().between(coord_limits['min_long'], coord_limits['max_long'])
between_lat_limits = loc_df[['latitude']].median().between(coord_limits['min_lat'], coord_limits['max_lat'])
if between_long_limits.any() and between_lat_limits.any():
return pitch_name
And your problem with returning None is that your makeAverageDataFrame does not return anything (None). Try:
def makeAverageDataFrame(df):
pitchBounds = get_loc_name(df, pitch_boundaries)
return pitchBounds

Related

Using a function with apply Python

I have a column within my dataframe that I want to analyze and return a specific value to a new column called "Trade". Data in the column being analyzed looks like this:
QTY
(1.00)
1,418,999.89
328,536.93
-100
If the value in the quantity column in greater than 0 I want to return "Buy_Code", if not, return "Sell_Code"
I tried to create a function to loop through the dataframe and then use that function within apply but it's not working right. I know there is a ton of info out there on this topic, but I am struggling to grasp how this should be written. Thanks for the help
def trade_type():
for index,row in df_loan_tape.iterrows():
if row['QTY'] > 0:
df_loan_tape['Trade'] = 'Buy_Code'
else:
df_loan_tape['Trade'] = 'Sell_Code'
df_loan_tape['Trade'] = df_loan_tape['QTY'].apply(trade_type)

ECM Classification Binary Label Issue; Python 3.x recordlinkage library

I've been running into a problem in Python when using the recordlinkage library. My attempts at using the fit function of an ECMClassifier object have all resulted in errors. I've tried looking around but I've had no success as all the solutions provided do not relate to the recordlinkage library.
To help with the understanding of the problem, I'll describe what the dataframes I am working with are like.
Dataframe1 contains seven columns in the following order: Block_Key, cust_name, physical_padress1, physical_address2, city, state, zip.
Dataframe2 has eight columns: Block_Key, business_name, tradestyle, sec_trdstl, physical_street, physical_city, physical_state, physical_zip.
They are indexed with blocking using the Block_Key to return a list of MultiIndex.
def block_key_pairing(self, block_keys):
df1Data = pandas.read_excel(self.Dataframe1)
df2Data = pandas.read_excel(self.Dataframe2)
candidate_df = pandas.DataFrame()
pair_list = []
for block_key in block_keys:
bkIndexer = BlockKeyIndexer(block_key)
pair = bkIndexer.index(df1Data, df2Data)
for index in pair:
candidate_df.loc[index] = block_key
pair_list.append(pair)
candidate_df.to_excel('C:\\Users\\Documents\\Pairs & Block Keys.xlsx')
return pair_list
BlockKeyIndexer is a class object for custom indexing due to the regular Indexer returning pairs with different block keys because they are different by one or two characters.
The function that the candidate pairs are used is shown below, and it also where I fiddled with classifications over the course of a couple days.
def comparator(self, candidate_links):
comp = recordlinkage.Compare()
vector_df = pandas.DataFrame()
comp.string("cust_name", "business_name", method='levenshtein',label='Business Name')
comp.string("cust_name", "tradestyle", method='levenshtein',label='Trade Style')
comp.string("cust_name", "sec_trdstl", method='levenshtein',label='Secondary TS')
comp.string("physical_address_1", "physical_street", method='levenshtein',label='Primary Address')
comp.string("physical_address_2", "physical_street", method='levenshtein',label='Secondary Address')
comp.string("city", "physical_city", method='levenshtein',label='City')
comp.string("state", "physical_state", method='levenshtein',label='State')
comp.numeric("zip", "physical_zip",label='Zip')
df1Data = pandas.read_excel(self.Dataframe1)
df2Data = pandas.read_excel(self.Dataframe2)
for candidate_pair in candidate_links:
vector_df = vector_df.append(comp.compute(candidate_pair,df1Data,df2Data))
vector_df=vector_df.astype(np.int_)
ecm=recordlinkage.ECMClassifier(init='jaro',binarize=0.8)
result_ecm=ecm.fit(vector_df)
print(len(result_ecm))
return vector_df
So a vector dataframe is created called vector_df and when trying to use the fit function, it yields this error:
ValueError: Only binary labels are allowed for 'jaro'method. Column 2 has 1 different labels.
I've become very perplexed by this because it seems to imply that Columns 0 & 1 are acceptable.
So I figured the problem was because the values inside the vector_df weren't binary, so I tried modifying the comparator function like so.
def comparator(self, candidate_links):
comp = recordlinkage.Compare()
vector_df = pandas.DataFrame()
comp.string("cust_name", "business_name", method='levenshtein',label='Business Name')
comp.string("cust_name", "tradestyle", method='levenshtein',label='Trade Style')
#comp.string("cust_name", "sec_trdstl", method='levenshtein',label='Secondary TS')
comp.string("physical_address_1", "physical_street", method='levenshtein',label='Primary Address')
comp.string("physical_address_2", "physical_street", method='levenshtein',label='Secondary Address')
comp.string("city", "physical_city", method='levenshtein',label='City')
comp.string("state", "physical_state", method='levenshtein',label='State')
comp.numeric("zip", "physical_zip",label='Zip')
df1Data = pandas.read_excel(self.Dataframe1)
df2Data = pandas.read_excel(self.Dataframe2)
for candidate_pair in candidate_links:
vector_df = vector_df.append(comp.compute(candidate_pair,df1Data,df2Data))
vector_df=vector_df.astype(np.int_)
ecm=recordlinkage.ECMClassifier(init='jaro',binarize=0.8)
result_ecm=ecm.fit(vector_df)
print(len(result_ecm))
return vector_df
However, this yields a new error when running:
ValueError: could not broadcast input array from shape (12) into shape (13)
I've tried debugging and I looked inside the library to see what the process is but I still can't figure out what I need to do to the vector_df to have my code run correctly.

bound method DataFrame.toJSON of DataFrame

I am trying to convert my dataframe into a json string. I am using pyspark.
This is the code I am using.
def produceTrainData (self,csvData): #Array[String] = {
trainData = csvData.withColumn("therapyClass", lit("REMODULIN"))\
.withColumn("patientAge", lit(52))\
.withColumn("patientSex", lit("M"))\
.withColumn("serviceType", lit("PHARMACY"))\
.withColumn("npiId", lit("27"))\
.withColumn("requestID", lit(419568891))\
.withColumn("requestDateTime", lit("20171909 21:30:55"))\
selectData = trainData.select("payorId", "patientId","therapyType","therapyClass","ndcNumber","procedureCode","patientAge","patientSex",
"placeOfService", "serviceDuration","daysOrUnits","charges", "serviceDate", "serviceType","serviceBranchId",
"npiId", "diagnosisCode", "authNbr","requestID", "requestDateTime")
authNbrFilter = col("authNbr") != "-"
filterData = selectData.where(authNbrFilter)#.limit(20)
print(filterData)
filterData.show(20,False)
jsons = filterData.toJSON
print(jsons)
there are two errors:
when I am printing the jsons variable (print(jsons)) it's not returning a rdd like it suppose to do, instead it's returning a:
bound method DataFrame.toJSON of DataFrame
I am very interesting to know the reason of the error.
when I tried to collect the jsons variable is showing the next error:
AttributeError: 'function' object has no attribute 'collect'.
Do you know what it's the reason of this error?

Converting a column with address and coordinates to string with .astype(str) drops the coordinates

I'm using the geopy package to search address for their coordinates, with the column returning the matched address and coordinates
I want to just get the coordinates
Here is a test to show you how it works:
# Test to see if response is obtained for easy address
location = geolocator.geocode("175 5th Avenue NYC", timeout=10)
print((location.latitude, location.longitude))
>>> (40.7410861, -73.9896298241625)
In my code I have a CSV with cities, that are then looked up using the geopy package
data['geocode_result'] = [geolocator.geocode(x, timeout = 60) for x in data['ghana_city']]
I want to just get the coordinates from here
Using extract doesn't seem to work and just returns NaN values despite the regex being fine:
p = r'(?P<latitude>-?\d+\.\d+)?(?P<longitude>-?\d+\.\d+)'
data[['g_latitude', 'g_longitude']] = data['geocode_result2'].str.extract(p, expand=True)
data
I have a feeling that these problems are coming about due to the object that's returned from geopy in the column
The regex is sound, as verified on Regexr.com:
I have tried converting the column to a string, but the coordinates are dropped?!
data['geocode_result2'] = (data['geocode_result2']).astype(str)
data
Can anyone help here? thanks a lot
Dummy data:
The column I want to extract the coordinates from is geocode_result2 or geocode_result
geocode_result2
1 (Agona Swedru, Central Region, Ghana, (5.534454, -0.700763))
2 (Madina, Adenta, Greater Accra Region, PMB 107 MD, Ghana, (5.6864962, -0.1677052))
3 (Ashaiman, Greater Accra Region, TM3 8AA, Ghana, (5.77329565, -0.110766330148484))
Final code to get coordinates:
data['geocode_result'] = [geolocator.geocode(x, timeout = 60) for x in data['ghana_city']]
x = data['geocode_result']
data.dropna(subset=['geocode_result'], inplace=True)
data['g_latitude'] = data['geocode_result'].apply(lambda loc: loc.latitude)
data['g_longitude'] = data['geocode_result'].apply(lambda loc: loc.longitude)
data
geolocator.geocode returns Location object rather than a string (though its string representation actually contains lat/long which you were trying to parse), so lat/long can be retrieved by accessing location.latitude/ location.longitude attributes respectively.
# Make geocoding requests
data['geocode_result'] = [geolocator.geocode(x, timeout = 60) for x in data['ghana_city']]
# Extract lat/long to separate columns
data['g_latitude'] = data['geocode_result'].apply(lambda loc: loc.latitude)
data['g_longitude'] = data['geocode_result'].apply(lambda loc: loc.longitude)
Result
(I'm unable to comment due to reputation lack, so I'm answering the coordinates drop confusion here).
str(location) returns a textual address (without coordinates), but repr(location) returns a string in the following format (which includes the coordinates):
Location(%(address)s, (%(latitude)s, %(longitude)s, %(altitude)s))
What you see when you print data uses repr (pandas seems to drop the leading Location type for brevity), so you can see the coordinates. But when the column is converted to str, it uses str representation, which doesn't include the coordinates. That's the whole magic here.
Your can try by using .apply and .str
Ex:
def getLatLog(d):
try:
return re.findall(r"\d+\.\d+", d)
except:
return [None, None]
df['g_latitude'], df['g_longitude'] = df["geocode_result2"].apply(lambda x: getLatLog(x)).str
print(df["g_latitude"])
print(df["g_longitude"])
Output:
0 5.534454
1 5.6864962
2 5.77329565
Name: g_latitude, dtype: object
0 0.700763
1 0.1677052
2 0.110766330148484
Name: g_longitude, dtype: object

Get results in an Earth Engine python script

I'm trying to get NDVI mean in every polygon in a feature collection with earth engine python API.
I think that I succeeded getting the result (a feature collection in a feature collection), but then I don't know how to get data from it.
The data I want is IDs from features and ndvi mean in each feature.
import datetime
import ee
ee.Initialize()
#Feature collection
fc = ee.FeatureCollection("ft:1s57dkY_Sg_E_COTe3sy1tIR_U-5Gw-BQNwHh4Xel");
fc_filtered = fc.filter(ee.Filter.equals('NUM_DECS', 1))
#Image collection
Sentinel_collection1 = (ee.ImageCollection('COPERNICUS/S2')).filterBounds(fc_filtered)
Sentinel_collection2 = Sentinel_collection1.filterDate(datetime.datetime(2017, 1, 1),datetime.datetime(2017, 8, 1))
# NDVI function to use with ee map
def NDVIcalc (image):
red = image.select('B4')
nir = image.select('B8')
ndvi = nir.subtract(red).divide(nir.add(red)).rename('NDVI')
#NDVI mean calculation with reduceRegions
MeansFeatures = ndvi.reduceRegions(reducer= ee.Reducer.mean(),collection= fc_filtered,scale= 10)
return (MeansFeatures)
#Result that I don't know to get the information: Features ID and NDVI mean
result = Sentinel_collection2.map(NDVIcalc)
If the result is small, you pull them into python using result.getInfo(). That will give you a python dictionary containing a list of FeatureCollection (which are more dictionaries). However, if the results are large or the polygons cover large regions, you'll have to Export the collection instead.
That said, there are probably some other things you'll want to do first:
1) You might want to flatten() the collection, so it's not nested collections. It'll be easier to handle that way.
2) You might want to add a date to each result so you know what time the result came from. You can do that with a map on the result, inside your NDVIcalc function
return MeansFeatures.map(lambda f : f.set('date', image.date().format())
3) If what you really want is a time-series of NDVI over time for each polygon (most common), then restructuring your code to map over polygons first will be easier:
Sentinel_collection = (ee.ImageCollection('COPERNICUS/S2')
.filterBounds(fc_filtered)
.filterDate(ee.Date('2017-01-01'),ee.Date('2017-08-01')))
def GetSeries(feature):
def NDVIcalc(img):
red = img.select('B4')
nir = img.select('B8')
ndvi = nir.subtract(red).divide(nir.add(red)).rename(['NDVI'])
return (feature
.set(ndvi.reduceRegion(ee.Reducer.mean(), feature.geometry(), 10))
.set('date', img.date().format("YYYYMMdd")))
series = Sentinel_collection.map(NDVIcalc)
// Get the time-series of values as two lists.
list = series.reduceColumns(ee.Reducer.toList(2), ['date', 'NDVI']).get('list')
return feature.set(ee.Dictionary(ee.List(list).flatten()))
result = fc_filtered.map(GetSeries)
print(result.getInfo())
4) And finally, if you're going to try to Export the result, you're likely to run into an issue where the columns of the exported table are selected from whatever columns the first feature has, so it's good to provide a "header" feature that has all columns (times), that you can merge() with the result as the first feature:
# Get all possible dates.
dates = ee.List(Sentinel_collection.map(function(img) {
return ee.Feature(null, {'date': img.date().format("YYYYMMdd") })
}).aggregate_array('date'))
# Make a default value for every date.
header = ee.Feature(null, ee.Dictionary(dates, ee.List.repeat(-1, dates.size())))
output = header.merge(result)
ee.batch.Export.table.toDrive(...)

Resources