## Is there a faster way of to generate the required output than using a one-to-many join in Proc SQL? - join

I require an output that shows the total number of hours worked in a rolling 24 hour window. The data is currently stored such that each row is one hourly slot (for example 7-8am on Jan 2nd) per person and how much they worked in that hour stored as "Hour". What I need to create is another field that is the sum of the most recent 24 hourly slots (inclusive) for each row. So for the 7-8am example above I would want the sum of "Hour" across the 24 rows: Jan 1st 8-9am, Jan 1st 9-10am... Jan 2nd 6-7am, Jan 2nd 7-8am.
Rinse and repeat for each hourly slot.
There are 6000 people, and we have 6 months of data, which means the table has 6000 * 183 days * 24 hours = 26.3m rows.
I am currently done this using the code below, which works on a sample of 50 people very easily, but grinds to a halt when I try it on the full table, somewhat understandably.
Does anyone have any other ideas? All date/time variables are in datetime format.
proc sql;
create table want as
select x.*
, case when Hours_Wrkd_In_Window > 16 then 1 else 0 end as Correct
from (
select a.ID
, a.Start_DTTM
, a.End_DTTM
, sum(b.hours) as Hours_Wrkd_In_Window
from have a
left join have b
on a.ID = b.ID
and b.start_dttm > a.start_dttm - (24 * 60 * 60)
and b.start_dttm <= a.start_dttm
where datepart(a.Start_dttm) >= &report_start_date.
and datepart(a.Start_dttm) < &report_end_date.
group by ID
, a.Start_DTTM
, a.End_DTTM
) x
order by x.ID
, x.Start_DTTM
;quit;

A compound index on the columns being accessed in the joined table - id + start_dttm + hours - would be useful if there isn't one already.
Using msglevel=i will print some diagnostics about how the query is executed. It may give some additional hints.

The most performant DATA step solution most likely involves a ring-array to track the 1hr time slots and hours worked within. The ring will allow a rolling aggregate (sum and count) to be computed based on what goes into and out of the ring.
If you have a wide SAS license, look into the procedures in SAS/ETS (Econometrics and Time Series). Proc EXPAND might have some rolling aggregate capability.
This sample DATA Step code took <10s (WORK folder on SSD) to run on simulated data for 6k people with 6months of complete coverage of 1hr time slots.
data have(keep=id start_dt end_dt hours);
do id = 1 to 6000;
do start_dt
= intnx('dtmonth', datetime(), -12)
to intnx('dtmonth', datetime(), -6)
by dhms(0,1,0,0)
;
end_dt = start_dt + dhms(0,1,0,0);
hours = 0.25 * floor (5 * ranuni(123)); * 0, 1/4, 1/2, 3/4 or 1 hour;
output;
end;
end;
format hours 5.2;
run;
/* %let log= ; options obs=50 linesize=200; * submit this (instead of next) if you want to log the logic; */
%let log=*; options obs=max;
data want2(keep=id start_dt end_dt hours hours_rolling_sum hours_rolling_cnt hours_out_:);
array dt_ring(24) _temporary_;
array hr_ring(24) _temporary_;
call missing (of dt_ring(*));
call missing (of hr_ring(*));
if 0 then set have; * prep pdv column order;
hours_rolling_sum = 0;
hours_rolling_cnt = 0;
label hours_rolling_sum = 'Hours worked in prior 24 hours';
index = 0;
do until (last.id);
set have;
by id start_dt;
index + 1;
if index > 24 then index = 1;
hours_out_sum = 0;
hours_out_cnt = 0;
do clear = 1 by 1 until (clear=0);
if sum (dt_ring(index), 0) = 0 then do;
* index is first go through ring array, or hit a zeroed slot;
&log putlog 'NOTE: ' index= 'clear for empty ring item. ';
clear = 0;
end;
else
if start_dt - dt_ring(index) >= %sysfunc(dhms(0,24,0,0)) then do;
&log putlog / 'NOTE: ' index= 'reducting and zeroing.' /;
hours_out_sum + hr_ring(index);
hours_out_cnt + 1;
hours_rolling_sum = hours_rolling_sum - hr_ring(index);
hours_rolling_cnt = hours_rolling_cnt - 1;
dt_ring(index) = 0;
hr_ring(index) = 0;
* advance item to next item, that might also be more than 24 hours ago;
index = index + 1;
if index > 24 then index = 1;
end;
else do;
&log putlog / 'NOTE: ' index= 'back off !' /;
* index was advanced to an item within 24 hours, back off one;
index = index - 1;
if index < 1 then index = 24;
clear = 0;
end;
end; /* do clear */
dt_ring(index) = start_dt;
hr_ring(index) = hours;
hours_rolling_sum + hours;
hours_rolling_cnt + 1;
&log putlog 'NOTE: ' index= 'overlaying and aggregating.' / 'NOTE: ' start_dt= hours= hours_rolling_sum= hours_rolling_cnt=;
output;
end; /* do until */
format hours_rolling_sum 5.2 hours_rolling_cnt 2.;
format hours_out_sum 5.2 hours_out_cnt 2.;
run;
options obs=max;
When reviewing the results you should notice the delta for hours_rolling_sum is +(hours in slot) - (hours_out_sum{which is hrs removed from ring})
If you must use SQL, I would suggest following #jspascal and index the table, but rearrange the query to left join original data to inner-joined subselect (so that SQL will do an index involved hash join on the ids) . For same amount of few people it should faster than original query, but still be too slow for doing all 6K.
proc sql;
create index id on have;
create index id_slot on have (id, start_dt);
quit;
proc sql _method;
reset inobs=50; * limit data so you can see the _method;
create table want as
select
have.*
, case
when ROLLING.HOURS_WORKED_24_HOUR_PRIOR > 16
then 1
else 0
end as REVIEW_TIME_CLOCKING_FLAG
from
have
left join
(
select
EACH_SLOT.id
, EACH_SLOT.start_dt
, count(*) as SLOT_COUNT_24_HOUR_PRIOR
, sum(PRIOR_SLOT.hours) as HOURS_WORKED_24_HOUR_PRIOR
from
have as EACH_SLOT
join
have as PRIOR_SLOT
on
EACH_SLOT.ID = PRIOR_SLOT.ID
and EACH_SLOT.start_dt - PRIOR_SLOT.start_dt between 0 and %sysfunc(dhms(0,24,0,0))-0.1
group by
EACH_SLOT.id, EACH_SLOT.start_dt
) as ROLLING
on
have.ID = ROLLING.ID
and have.start_dt = ROLLING.start_dt
order by
id, start_dt
;
%put NOTE: SQLOOPS = &SQLOOPS;
quit;
The inner join is pyramid-like and still involves a lot of internal looping.

## Related

### how to select multiple cells using offset variable but keeping what is in top 2 rows selected

I want to multi select cell on a worksheet. I might start with selecting a2:g22 but next time I want to select a2:g2 and offset the remaining rows by 20 so they will become a23:g23. The offset will have a variable which will have 20 added to it each time the code runs. NextRow = Range("ba2") NextRow = NextRow + 20 Range("a2:g2,a3:g22").Offset(NextRow, 0).Select If nextrow = 0 then range a2:g2 is selected and a3:g22 is selected then I add 20 to nextrow and I want a2:g2 to be selected and a23:g42 selected. What I get instead is a22:g22 selected and a23:g42 selected.

After doing something else for a few hours I realized that offset will always work with the entire range whatever that might be. So I tried replacing the cell references in the range with a variable with only partial success. If I have a variable called SelRow and I put in it the cell references Will that work. So that code looked like this: SelRow = "A23:G42" Range("A2:G2",SelRow).Select This didn't work because my selection was A2:G42 so I changed it to this: SelRow = "A2:G2,A23:G42" Range(SelRow).Select And that worked. So all I needed to do was to start with a reference number that was used to calculate 2 more variables called StRow and EndRowThe reference number can and will be anything but for now I started with 23. StRow=23 EndRow = StRow + 19 NowSel = "a2:G2,A" & StRow & ":G" & EndRow Range(NowSel).Select That worked. The result of NowSelin this case was A2:G2,A23:G42This can of course get quite long so my final code looks like this: StRow = Worksheets("Cell List").Range("A1") 'Get the last row number used StRow = StRow + 20 'Set the next start row to last row + 20 EndRow = StRow + 19 'Set the last row for selection to start row + 19 NowSel = "a2:c2,f2:g2,a" & StRow & ":c" & EndRow & ",f" & StRow & ":g" & EndRow Range(NowSel).Select And now the following cells are selected A2:C2and F2:G2 and A23:C42and F23:G42 Now I can create charts just by changing a reference number.

### t-sql: return range in which a given number falls

I am trying to figure out a good way to return a string 'name' of a range in which a given number falls. Ranges are spans of 1000, so the first range is '0000-0999', the second is '1000-1999' etc. For example, given the number 1234, I want to return the literal string '1000-1999'. It seems to me that I could maintain a reference table with these ranges, like this --create & populate temp table with ranges create table #ranges (st int,en int) go insert into #ranges values(0,999) insert into #ranges values(1000,1999) insert into #ranges values(2000,2999) go --example query select replace(str(st,4),' ','0') + '-' + replace(str(en,4),' ','0') as TheStringIWant from #ranges where 1234 between st and en ...but it seems to me that the ranges should be able to be determined from the given number itself, and that I shouldn't need the (redundant) reference table (or, for that matter, a function) just for this. It also seems to me that I should be able to figure this out with just a bit of brain power, except that I've just had 2 beers in quick succession this evening...

Another way; select case when value / 1000 < 1 then '0000-0999' else cast(value / 1000 * 1000 as varchar(16)) + '-' + cast(value / 1000 * 1000 + 999 as varchar(16)) end

You can use mathematical functions to avoid using the temp table: SELECT 1234, RIGHT('0000' + CAST(FLOOR(1234/1000.0) * 1000 AS VARCHAR(11)),4) + '-' + RIGHT('0000' + CAST( (FLOOR(1234/1000.0) * 1000) + 999 AS VARCHAR(11)),4)

In the shell, I can use integer-arithmetic to cut off the 234, and calculate the string with a simple formular, however it wouldn't produce 0000-0999 but 0-999 for the first 1000. v=1234 echo $(((v/1000)*1000))-$(((v/1000)*1000+999)) 1000-1999 I don't know how to adapt it to tsql - whether possible at all.

declare #range int; set #range = 1000; select replace(str(st,4),' ','0') + '-' + replace(str(st + #range,4),' ','0') as TheStringIWant from ( select st = v / #range * #range from (select v = 1234) s ) s

### SAS, Python, Excel Create Constantly Updating Function

I have a very large dataset. I've been working out of SAS; however I am open to working out of Python and excel (only excel with good details--I've never programmed here). There is an identification number for each individual who has ordered (by time) observations row by row. In some of the rows, I have a binary observation indicating a "success" or a "failure" marked by a 1 or a 0 respectively. I'd like to add another three more columns (onto each row that contains a success/failure), that has the total number of successes (as they accumulate) and the total number of failures (as they accumulate) along withe the ratio between the two. The ratio is trivial; however, I just don't know how to do the first two. Any help would be greatly appreciated. Thanks! As an update: Here is an idea of my dataset: ID Success Failure totaSuccess totalFailure ratio 1234 - - - - - 1234 1 0 1 0 1/(1+0) 2345 - - - - - 2345 0 1 0 1 0/(1+0) 1234 0 1 1 1 1/(1+1)

PROC SORT DATA = HAVE; BY ID; RUN; DATA WANT / VIEW = WANT; SET HAVE; BY ID; IF FIRST.ID THEN DO; TOTALSUCCES = 0; TOTALFAILURE = 0; END; TOTALSUCCES + SUCCESS; TOTALFAILURE + FAILURE; RUN;

In SAS you can create a view so that it updates as your table updates. Regardless of what solution you use, its important to clarify how your table is being updated. data have; do id=1 to 10; numobs=ceil(rand('uniform')*5); do i=1 to numobs; value=rand('bernoulli', 0.3); output; end; end; drop i numobs; run; proc sql; create view want as select id, value, sum(value) as success, count(value)-sum(value) as failure, sum(value)/(count(value)) as ratio from have group by id; quit;

### LINQ Query to get grouping based on range of calculated values

I have a sql query that I use to get a distribution for measurements into multiple Bins: SELECT FLOOR(Value / #Step) * #Step AS Bin, COUNT(*) AS Cnt FROM Measurements WHERE (StepId = #StepId) GROUP BY Bin ORDER BY Bin where Value = value of measurements returned based on StepId (primary key in measurements table) Step = actually the number of groups(Bins in distribution). How can I use LINQ and create grouping based on dynamically created range of values. Please advice.

I think this is what you're after: double step = 100; var stepId = 1; from m in context.Measurements where m.StepId == stepId; let bin = (Math.Floor(c.Value / step)) * step orderby bin group m by bin into x select new { x.Key, Count = x.Count() } You can even omit the Floor function because the "/" operator on integers has the same function in SQL.

### Find the next occurance of a day of the week in SQL

I'm trying to update a SQL report sproc's WHERE clause to check whether a given date falls on or before the next occurrence of a class. Classes have a StartDate and occur once per week on the same day each week. Given the StartDate, how can I find the next occurrence of that day of the week? E.G. If the StartDate is 1/18/2012, a Wednesday, and I run the report as of today, 1/26/2012, I need to find 2/1/2012 which is the next Wednesday after 1/26. If the StartDate is 1/19, a Thurs, and I run the report today, the formula should give me Thurs 1/26 which is today. Here's sort of the idea in SQL: SELECT * FROM tbl_Class cs INNER JOIN tbl_Enrollment sce ON cs.pk_ClassID = sce.fk_ClassID WHERE ... AND sce.StartDate < [Find date of next class after #AsOfDate using cs.StartDate]

Here's some example SQL that I came up with. 3 iterations so you can follow how I got to the end. The 3rd iteration should be something you can incorporate into a WHERE clause by substituting your column names for the variables. Setup: DECLARE #Startdate DATETIME,#currentdate datetime SET #Startdate = '1-26-2012' SET #Currentdate = '1-23-2012' --This section just normalizes it so you can use 7 as the interval --The offset depends on your current setting for DATEFIRST, U.S. English default is 7, Sunday. -- see http://msdn.microsoft.com/en-us/library/ms187766.aspx DECLARE #StartDateWorkingDayOfWeek int,#CurrentDateWorkingDayOfWeek int SELECT #StartDateWorkingDayOfWeek =(DATEPART(weekday,#Startdate)-2) SELECT #CurrentDateWorkingDayOfWeek=(DATEPART(weekday,#Currentdate)-2) Iteration #1 --Iteration 1 IF #StartDateWorkingDayOfWeek < #CurrentDateWorkingDayOfWeek SELECT DATEADD(DAY,DATEDIFF(DAY,0,#Currentdate)/7*7 + 7,#StartDateWorkingDayOfWeek) else SELECT DATEADD(DAY,DATEDIFF(DAY,0,#Currentdate)/7*7 + 0,#StartDateWorkingDayOfWeek) Iteration #2 --Iteration 2 SELECT DATEADD(DAY,DATEDIFF(DAY,0,#Currentdate)/7*7 + CASE WHEN #StartDateWorkingDayOfWeek < #CurrentDateWorkingDayOfWeek then 7 ELSE 0 end ,#StartDateWorkingDayOfWeek) Iteration #3 --iteration 3 SELECT DATEADD(DAY,DATEDIFF(DAY,0,#Currentdate)/7*7 + CASE WHEN (DATEPART(weekday,#Startdate)-2) < (DATEPART(weekday,#Currentdate)-2) then 7 ELSE 0 end ,(DATEPART(weekday,#Startdate)-2)) Hat tip to this article: http://www.sqlmag.com/article/tsql3/datetime-calculations-part-3

Here's what I came up with thanks to TetonSig and his reference to this link: http://www.sqlmag.com/article/tsql3/datetime-calculations-part-3 We can get the date of the previous Monday exclusive of the current date (#AsOfDate) like so: SELECT DATEADD(day, DATEDIFF(day,0, #AsOfDate-1) /7*7, 0); This gets the number of days between 1/1/1900 and #AsOfDate in days. /7*7 converts that to whole weeks, and then adds it back to 1/1/1900 (a Mon) to get the Monday before #AsOfDate. The -1 makes it exclusive of #AsOfDate. Without the minus 1, if #AsOfDate were on a Monday, it would be counted as the "previous" Monday. Next the author shows that to get the inclusive next Monday, we simply need to add 7 to the exclusive previous Monday formula: SELECT DATEADD(d, DATEDIFF(day,0, #AsOfDate-1) /7*7, 0)+7; Voila! We've now got the first Monday on or after #AsOfDate. The only problem is, the Monday (0) above is a moving target in my case. I need the first [DayOfWeek] determined by the class date, not the first Monday. I need to swap out a ClassDayOfWeek calculation for the 0s above: DATEADD(d, DATEDIFF(d, [ClassDayOfWeek], #AsOfDate-1)/7*7, [ClassDayOfWeek])+7 I wanted to calculate the ClassDayOfWeek without being dependent on or having to mess with setting ##datefirst. So I calculated it relative to the base date: DATEDIFF(d, 0, StartDate)%7 This gives 0 for Mon, 6 for Sun so we can now plug that in for [ClassDayOfWeek]. I should point out that this 0-6 value is dates 1/1/1900-1/7/1900 represented as an int. DATEADD(d, DATEDIFF(d, DATEDIFF(d, 0, StartDate)%7, #AsOfDate-1)/7*7, DATEDIFF(d, 0, StartDate)%7)+7 And in use per the question: SELECT * FROM tbl_Class cs INNER JOIN tbl_Enrollment sce ON cs.pk_ClassID = sce.fk_ClassID WHERE ... AND sce.StartDate < DATEADD(d, DATEDIFF(d, DATEDIFF(d, 0, cs.StartDate)%7, #AsOfDate-1)/7*7, DATEDIFF(d, 0, cs.StartDate)%7)+7

I derived the answer with a simple case statement. In your situation #targetDOW would be the day of the week of the class. DECLARE #todayDOW INT = DATEPART(dw, GETDATE()); DECLARE #diff INT = (#targetDOW - #todayDOW); SELECT CASE WHEN #diff = 0 THEN GETDATE() WHEN #diff > 0 THEN DATEADD(d,#diff,GETDATE()) WHEN #diff < 0 THEN DATEADD(d,#diff + 7,GETDATE()) END;