Is there a faster way of to generate the required output than using a one-to-many join in Proc SQL? - join

I require an output that shows the total number of hours worked in a rolling 24 hour window. The data is currently stored such that each row is one hourly slot (for example 7-8am on Jan 2nd) per person and how much they worked in that hour stored as "Hour". What I need to create is another field that is the sum of the most recent 24 hourly slots (inclusive) for each row. So for the 7-8am example above I would want the sum of "Hour" across the 24 rows: Jan 1st 8-9am, Jan 1st 9-10am... Jan 2nd 6-7am, Jan 2nd 7-8am.
Rinse and repeat for each hourly slot.
There are 6000 people, and we have 6 months of data, which means the table has 6000 * 183 days * 24 hours = 26.3m rows.
I am currently done this using the code below, which works on a sample of 50 people very easily, but grinds to a halt when I try it on the full table, somewhat understandably.
Does anyone have any other ideas? All date/time variables are in datetime format.
proc sql;
create table want as
select x.*
, case when Hours_Wrkd_In_Window > 16 then 1 else 0 end as Correct
from (
select a.ID
, a.Start_DTTM
, a.End_DTTM
, sum(b.hours) as Hours_Wrkd_In_Window
from have a
left join have b
on a.ID = b.ID
and b.start_dttm > a.start_dttm - (24 * 60 * 60)
and b.start_dttm <= a.start_dttm
where datepart(a.Start_dttm) >= &report_start_date.
and datepart(a.Start_dttm) < &report_end_date.
group by ID
, a.Start_DTTM
, a.End_DTTM
) x
order by x.ID
, x.Start_DTTM
;quit;

A compound index on the columns being accessed in the joined table - id + start_dttm + hours - would be useful if there isn't one already.
Using msglevel=i will print some diagnostics about how the query is executed. It may give some additional hints.

The most performant DATA step solution most likely involves a ring-array to track the 1hr time slots and hours worked within. The ring will allow a rolling aggregate (sum and count) to be computed based on what goes into and out of the ring.
If you have a wide SAS license, look into the procedures in SAS/ETS (Econometrics and Time Series). Proc EXPAND might have some rolling aggregate capability.
This sample DATA Step code took <10s (WORK folder on SSD) to run on simulated data for 6k people with 6months of complete coverage of 1hr time slots.
data have(keep=id start_dt end_dt hours);
do id = 1 to 6000;
do start_dt
= intnx('dtmonth', datetime(), -12)
to intnx('dtmonth', datetime(), -6)
by dhms(0,1,0,0)
;
end_dt = start_dt + dhms(0,1,0,0);
hours = 0.25 * floor (5 * ranuni(123)); * 0, 1/4, 1/2, 3/4 or 1 hour;
output;
end;
end;
format hours 5.2;
run;
/* %let log= ; options obs=50 linesize=200; * submit this (instead of next) if you want to log the logic; */
%let log=*; options obs=max;
data want2(keep=id start_dt end_dt hours hours_rolling_sum hours_rolling_cnt hours_out_:);
array dt_ring(24) _temporary_;
array hr_ring(24) _temporary_;
call missing (of dt_ring(*));
call missing (of hr_ring(*));
if 0 then set have; * prep pdv column order;
hours_rolling_sum = 0;
hours_rolling_cnt = 0;
label hours_rolling_sum = 'Hours worked in prior 24 hours';
index = 0;
do until (last.id);
set have;
by id start_dt;
index + 1;
if index > 24 then index = 1;
hours_out_sum = 0;
hours_out_cnt = 0;
do clear = 1 by 1 until (clear=0);
if sum (dt_ring(index), 0) = 0 then do;
* index is first go through ring array, or hit a zeroed slot;
&log putlog 'NOTE: ' index= 'clear for empty ring item. ';
clear = 0;
end;
else
if start_dt - dt_ring(index) >= %sysfunc(dhms(0,24,0,0)) then do;
&log putlog / 'NOTE: ' index= 'reducting and zeroing.' /;
hours_out_sum + hr_ring(index);
hours_out_cnt + 1;
hours_rolling_sum = hours_rolling_sum - hr_ring(index);
hours_rolling_cnt = hours_rolling_cnt - 1;
dt_ring(index) = 0;
hr_ring(index) = 0;
* advance item to next item, that might also be more than 24 hours ago;
index = index + 1;
if index > 24 then index = 1;
end;
else do;
&log putlog / 'NOTE: ' index= 'back off !' /;
* index was advanced to an item within 24 hours, back off one;
index = index - 1;
if index < 1 then index = 24;
clear = 0;
end;
end; /* do clear */
dt_ring(index) = start_dt;
hr_ring(index) = hours;
hours_rolling_sum + hours;
hours_rolling_cnt + 1;
&log putlog 'NOTE: ' index= 'overlaying and aggregating.' / 'NOTE: ' start_dt= hours= hours_rolling_sum= hours_rolling_cnt=;
output;
end; /* do until */
format hours_rolling_sum 5.2 hours_rolling_cnt 2.;
format hours_out_sum 5.2 hours_out_cnt 2.;
run;
options obs=max;
When reviewing the results you should notice the delta for hours_rolling_sum is +(hours in slot) - (hours_out_sum{which is hrs removed from ring})
If you must use SQL, I would suggest following #jspascal and index the table, but rearrange the query to left join original data to inner-joined subselect (so that SQL will do an index involved hash join on the ids) . For same amount of few people it should faster than original query, but still be too slow for doing all 6K.
proc sql;
create index id on have;
create index id_slot on have (id, start_dt);
quit;
proc sql _method;
reset inobs=50; * limit data so you can see the _method;
create table want as
select
have.*
, case
when ROLLING.HOURS_WORKED_24_HOUR_PRIOR > 16
then 1
else 0
end as REVIEW_TIME_CLOCKING_FLAG
from
have
left join
(
select
EACH_SLOT.id
, EACH_SLOT.start_dt
, count(*) as SLOT_COUNT_24_HOUR_PRIOR
, sum(PRIOR_SLOT.hours) as HOURS_WORKED_24_HOUR_PRIOR
from
have as EACH_SLOT
join
have as PRIOR_SLOT
on
EACH_SLOT.ID = PRIOR_SLOT.ID
and EACH_SLOT.start_dt - PRIOR_SLOT.start_dt between 0 and %sysfunc(dhms(0,24,0,0))-0.1
group by
EACH_SLOT.id, EACH_SLOT.start_dt
) as ROLLING
on
have.ID = ROLLING.ID
and have.start_dt = ROLLING.start_dt
order by
id, start_dt
;
%put NOTE: SQLOOPS = &SQLOOPS;
quit;
The inner join is pyramid-like and still involves a lot of internal looping.

Related

how to select multiple cells using offset variable but keeping what is in top 2 rows selected

I want to multi select cell on a worksheet. I might start with selecting a2:g22 but next time I want to select a2:g2 and offset the remaining rows by 20 so they will become a23:g23. The offset will have a variable which will have 20 added to it each time the code runs.
NextRow = Range("ba2")
NextRow = NextRow + 20
Range("a2:g2,a3:g22").Offset(NextRow, 0).Select
If nextrow = 0 then range a2:g2 is selected and a3:g22 is selected then I add 20 to nextrow and I want a2:g2 to be selected and a23:g42 selected.
What I get instead is a22:g22 selected and a23:g42 selected.
After doing something else for a few hours I realized that offset will always work with the entire range whatever that might be. So I tried replacing the cell references in the range with a variable with only partial success. If I have a variable called SelRow and I put in it the cell references Will that work. So that code looked like this:
SelRow = "A23:G42"
Range("A2:G2",SelRow).Select
This didn't work because my selection was A2:G42 so I changed it to this:
SelRow = "A2:G2,A23:G42"
Range(SelRow).Select
And that worked. So all I needed to do was to start with a reference number that was used to calculate 2 more variables called StRow and EndRowThe reference number can and will be anything but for now I started with 23.
StRow=23
EndRow = StRow + 19
NowSel = "a2:G2,A" & StRow & ":G" & EndRow
Range(NowSel).Select
That worked. The result of NowSelin this case was A2:G2,A23:G42This can of course get quite long so my final code looks like this:
StRow = Worksheets("Cell List").Range("A1") 'Get the last row number used
StRow = StRow + 20 'Set the next start row to last row + 20
EndRow = StRow + 19 'Set the last row for selection to start row + 19
NowSel = "a2:c2,f2:g2,a" & StRow & ":c" & EndRow & ",f" & StRow & ":g" & EndRow
Range(NowSel).Select
And now the following cells are selected A2:C2and F2:G2 and A23:C42and F23:G42
Now I can create charts just by changing a reference number.

t-sql: return range in which a given number falls

I am trying to figure out a good way to return a string 'name' of a range in which a given number falls. Ranges are spans of 1000, so the first range is '0000-0999', the second is '1000-1999' etc. For example, given the number 1234, I want to return the literal string '1000-1999'.
It seems to me that I could maintain a reference table with these ranges, like this
--create & populate temp table with ranges
create table #ranges (st int,en int)
go
insert into #ranges values(0,999)
insert into #ranges values(1000,1999)
insert into #ranges values(2000,2999)
go
--example query
select replace(str(st,4),' ','0') + '-' + replace(str(en,4),' ','0') as TheStringIWant
from #ranges
where 1234 between st and en
...but it seems to me that the ranges should be able to be determined from the given number itself, and that I shouldn't need the (redundant) reference table (or, for that matter, a function) just for this.
It also seems to me that I should be able to figure this out with just a bit of brain power, except that I've just had 2 beers in quick succession this evening...
Another way;
select case when value / 1000 < 1
then '0000-0999'
else cast(value / 1000 * 1000 as varchar(16)) + '-' + cast(value / 1000 * 1000 + 999 as varchar(16))
end
You can use mathematical functions to avoid using the temp table:
SELECT 1234,
RIGHT('0000' + CAST(FLOOR(1234/1000.0) * 1000 AS VARCHAR(11)),4)
+ '-'
+ RIGHT('0000' + CAST( (FLOOR(1234/1000.0) * 1000) + 999 AS VARCHAR(11)),4)
In the shell, I can use integer-arithmetic to cut off the 234, and calculate the string with a simple formular, however it wouldn't produce 0000-0999 but 0-999 for the first 1000.
v=1234
echo $(((v/1000)*1000))-$(((v/1000)*1000+999))
1000-1999
I don't know how to adapt it to tsql - whether possible at all.
declare #range int;
set #range = 1000;
select replace(str(st,4),' ','0') + '-' +
replace(str(st + #range,4),' ','0') as TheStringIWant
from (
select st = v / #range * #range
from (select v = 1234) s
) s

SAS, Python, Excel Create Constantly Updating Function

I have a very large dataset. I've been working out of SAS; however I am open to working out of Python and excel (only excel with good details--I've never programmed here). There is an identification number for each individual who has ordered (by time) observations row by row. In some of the rows, I have a binary observation indicating a "success" or a "failure" marked by a 1 or a 0 respectively. I'd like to add another three more columns (onto each row that contains a success/failure), that has the total number of successes (as they accumulate) and the total number of failures (as they accumulate) along withe the ratio between the two. The ratio is trivial; however, I just don't know how to do the first two. Any help would be greatly appreciated. Thanks!
As an update: Here is an idea of my dataset:
ID Success Failure totaSuccess totalFailure ratio
1234 - - - - -
1234 1 0 1 0 1/(1+0)
2345 - - - - -
2345 0 1 0 1 0/(1+0)
1234 0 1 1 1 1/(1+1)
PROC SORT DATA = HAVE;
BY ID;
RUN;
DATA WANT / VIEW = WANT;
SET HAVE;
BY ID;
IF FIRST.ID THEN DO;
TOTALSUCCES = 0;
TOTALFAILURE = 0;
END;
TOTALSUCCES + SUCCESS;
TOTALFAILURE + FAILURE;
RUN;
In SAS you can create a view so that it updates as your table updates. Regardless of what solution you use, its important to clarify how your table is being updated.
data have;
do id=1 to 10;
numobs=ceil(rand('uniform')*5);
do i=1 to numobs;
value=rand('bernoulli', 0.3);
output;
end;
end;
drop i numobs;
run;
proc sql;
create view want as
select id, value, sum(value) as success, count(value)-sum(value) as failure, sum(value)/(count(value)) as ratio
from have
group by id;
quit;

LINQ Query to get grouping based on range of calculated values

I have a sql query that I use to get a distribution for measurements into multiple Bins:
SELECT FLOOR(Value / #Step) * #Step AS Bin,
COUNT(*) AS Cnt FROM Measurements WHERE (StepId = #StepId)
GROUP BY Bin ORDER BY Bin
where
Value = value of measurements returned based on StepId (primary key in measurements table)
Step = actually the number of groups(Bins in distribution).
How can I use LINQ and create grouping based on dynamically created range of values.
Please advice.
I think this is what you're after:
double step = 100;
var stepId = 1;
from m in context.Measurements
where m.StepId == stepId;
let bin = (Math.Floor(c.Value / step)) * step
orderby bin
group m by bin into x
select new { x.Key, Count = x.Count() }
You can even omit the Floor function because the "/" operator on integers has the same function in SQL.

Find the next occurance of a day of the week in SQL

I'm trying to update a SQL report sproc's WHERE clause to check whether a given date falls on or before the next occurrence of a class. Classes have a StartDate and occur once per week on the same day each week. Given the StartDate, how can I find the next occurrence of that day of the week?
E.G. If the StartDate is 1/18/2012, a Wednesday, and I run the report as of today, 1/26/2012, I need to find 2/1/2012 which is the next Wednesday after 1/26. If the StartDate is 1/19, a Thurs, and I run the report today, the formula should give me Thurs 1/26 which is today.
Here's sort of the idea in SQL:
SELECT *
FROM tbl_Class cs
INNER JOIN tbl_Enrollment sce ON cs.pk_ClassID = sce.fk_ClassID
WHERE ...
AND sce.StartDate < [Find date of next class after #AsOfDate using cs.StartDate]
Here's some example SQL that I came up with. 3 iterations so you can follow how I got to the end. The 3rd iteration should be something you can incorporate into a WHERE clause by substituting your column names for the variables.
Setup:
DECLARE #Startdate DATETIME,#currentdate datetime
SET #Startdate = '1-26-2012'
SET #Currentdate = '1-23-2012'
--This section just normalizes it so you can use 7 as the interval
--The offset depends on your current setting for DATEFIRST, U.S. English default is 7, Sunday.
-- see http://msdn.microsoft.com/en-us/library/ms187766.aspx
DECLARE #StartDateWorkingDayOfWeek int,#CurrentDateWorkingDayOfWeek int
SELECT #StartDateWorkingDayOfWeek =(DATEPART(weekday,#Startdate)-2)
SELECT #CurrentDateWorkingDayOfWeek=(DATEPART(weekday,#Currentdate)-2)
Iteration #1
--Iteration 1
IF #StartDateWorkingDayOfWeek < #CurrentDateWorkingDayOfWeek
SELECT DATEADD(DAY,DATEDIFF(DAY,0,#Currentdate)/7*7 + 7,#StartDateWorkingDayOfWeek)
else
SELECT DATEADD(DAY,DATEDIFF(DAY,0,#Currentdate)/7*7 + 0,#StartDateWorkingDayOfWeek)
Iteration #2
--Iteration 2
SELECT DATEADD(DAY,DATEDIFF(DAY,0,#Currentdate)/7*7 +
CASE WHEN #StartDateWorkingDayOfWeek < #CurrentDateWorkingDayOfWeek
then 7
ELSE 0
end
,#StartDateWorkingDayOfWeek)
Iteration #3
--iteration 3
SELECT DATEADD(DAY,DATEDIFF(DAY,0,#Currentdate)/7*7 +
CASE WHEN (DATEPART(weekday,#Startdate)-2) < (DATEPART(weekday,#Currentdate)-2)
then 7
ELSE 0
end
,(DATEPART(weekday,#Startdate)-2))
Hat tip to this article:
http://www.sqlmag.com/article/tsql3/datetime-calculations-part-3
Here's what I came up with thanks to TetonSig and his reference to this link: http://www.sqlmag.com/article/tsql3/datetime-calculations-part-3
We can get the date of the previous Monday exclusive of the current date (#AsOfDate) like so:
SELECT DATEADD(day, DATEDIFF(day,0, #AsOfDate-1) /7*7, 0);
This gets the number of days between 1/1/1900 and #AsOfDate in days. /7*7 converts that to whole weeks, and then adds it back to 1/1/1900 (a Mon) to get the Monday before #AsOfDate. The -1 makes it exclusive of #AsOfDate. Without the minus 1, if #AsOfDate were on a Monday, it would be counted as the "previous" Monday.
Next the author shows that to get the inclusive next Monday, we simply need to add 7 to the exclusive previous Monday formula:
SELECT DATEADD(d, DATEDIFF(day,0, #AsOfDate-1) /7*7, 0)+7;
Voila! We've now got the first Monday on or after #AsOfDate. The only problem is, the Monday (0) above is a moving target in my case. I need the first [DayOfWeek] determined by the class date, not the first Monday. I need to swap out a ClassDayOfWeek calculation for the 0s above:
DATEADD(d, DATEDIFF(d, [ClassDayOfWeek], #AsOfDate-1)/7*7, [ClassDayOfWeek])+7
I wanted to calculate the ClassDayOfWeek without being dependent on or having to mess with setting ##datefirst. So I calculated it relative to the base date:
DATEDIFF(d, 0, StartDate)%7
This gives 0 for Mon, 6 for Sun so we can now plug that in for [ClassDayOfWeek]. I should point out that this 0-6 value is dates 1/1/1900-1/7/1900 represented as an int.
DATEADD(d, DATEDIFF(d, DATEDIFF(d, 0, StartDate)%7, #AsOfDate-1)/7*7, DATEDIFF(d, 0, StartDate)%7)+7
And in use per the question:
SELECT *
FROM tbl_Class cs
INNER JOIN tbl_Enrollment sce ON cs.pk_ClassID = sce.fk_ClassID
WHERE ...
AND sce.StartDate < DATEADD(d,
DATEDIFF(d,
DATEDIFF(d, 0, cs.StartDate)%7,
#AsOfDate-1)/7*7,
DATEDIFF(d, 0, cs.StartDate)%7)+7
I derived the answer with a simple case statement.
In your situation #targetDOW would be the day of the week of the class.
DECLARE #todayDOW INT = DATEPART(dw, GETDATE());
DECLARE #diff INT = (#targetDOW - #todayDOW);
SELECT
CASE
WHEN #diff = 0 THEN GETDATE()
WHEN #diff > 0 THEN DATEADD(d,#diff,GETDATE())
WHEN #diff < 0 THEN DATEADD(d,#diff + 7,GETDATE())
END;

Resources