Will the monitors automatically capture the change in the payload data - ibm-watson

Let's say due to some system error or testing purpose, the payload data meets a certain criteria should be removed. For example, a system bug at client side that sent many same scoring requests. Or certain groups of users (black sheep) sent bad data and the data engineers worked to remove the data from payload data.
In this case, will OpenScale re-calculate the metrics for the monitors or not? Or will there be an option to filter the data?

Related

Azure Service Bus - Two Way Communication Performance Challenge

I need to establish a two-way communication between a Publisher and a Subscriber. This is to facilitate a front-end MVC3 application defining a Subscription with a Correlation Filter, and then placing a message onto a Topic. Finally, the MVC3 controller calls BeginReceive() on the SubscriptionClient, and awaits the response.
The issue seems to be the creation and deletion of these Subscription objects. The overhead is enormous, and it slows the application to a crawl. This is not to mention the various limitations to work around, such as no more than 2000 Subscriptions on a Topic.
What is the best practice for establishing this kind of two-way communication between a Publisher and Subscriber? We want the MVC3 app to publish a message and then wait for a response to that exact message (via the CorrelationId property and a CorrelationFilter). We already cache the NamespaceManager and MessagingFactory, as those are also prohibitively expensive, resource-wise, and also because we were advised that Service Bus uses an explicit provisioning model, where we are expected to pre-create most of these things during role startup.
So, this leaves us with the challenge of correlating request to response, and having this tremendous overhead of the creation and deletion of Subscriptions. What better practice exists? Should we keep a cache of SubscriptionClients, and swap the Filter each time? What does everyone else do? I need to have a request throughput on the order of 5 to 10 thousand MVC3 requests per second through the Web Role cluster. We are already using AsyncController and employing the asynchronous BeginReceive() on SubscriptionClient. It appears to be the creation and deletion of the Subscriptions by the thousands that is choking the system at this point.
UPDATE1:
Based on the great advice provided here, we have updated this solution to keep a cache of SubscriptionClient objects on each web role instance. Additionally, we have migrated to a MessageSession oriented approach.
However, this is still not scaling. It seems that AcceptMessageSession() is a very expensive operation. Should MessageSession objects also be cached and re-used? Does each open MessageSession object consume a connection to the Service Bus? If so, is this counted against the Subscription's concurrent connection quota?
Many thanks. I think we are getting there. Most of the example code on the web shows: Create Topic(), then CreateSubscription(), then CreateSubscriptionClient(), then BeginReceive() on the client, then teardown of all of the objects. All I can say is if you did this in real life, your server would be crushed, and you would max out on connections in no time.
We need to put thousands of requests per second through this thing, and it is very apparent that these objects must be cached and reused heavily. So, is MessageSession yet another item to cache? I will have fun caching that, because we will have to implement a reference counting mechanism, where only one reference to the MessageSession can be given out at a time, since this is for http request-specific request/response, and we cannot have other subscribers using the MessageSession objects concurrently.
UPDATE2:
OK, it is not feasible to cache MessageSession for re-use, because they only live as long as the LockDuration on the Subscription. This is a bummer, because the maximum LockDuration is 5 minutes. These appear to be for pub/sub of short duration, not for long-running distributed processes. It looks like we need to back to polling Azure Tables.
SUMMARY/COMMENTARY
We tried to build on Service Bus because of the scale potential and its durability and delivery semantics. However, it seems that there are situations, high-volume request/response among them, that are not suited to it. The publishing part works great, and having competing consumers on the back-end is great, but having a front-end request wait on a defined, single-consumer response, does not scale well at all, because the MessageSessions take way too long to create via AcceptMessageSession() or BeginAcceptMessageSession(), and because they are not suited to caching.
If someone has an alternative view, I would love to hear it.
This scenario is a classic request/response and a good candidate to use sessions. These are another correlation mechanism. Make a simple request queue and response queue. Each web role thread creates a unique sessionid for a request and puts that value in the "ReplyToSessionID" property of the brokeredmessage. Also this thread calls a AcceptMessageSession on the response queue with the sessionid value so it locks it. The brokered message is sent to the request queue and all worker roles compete for messages. When a worker role gets a request it processes it, creates a response message and sets the sessionid property on the response message = replytosessinid of request. this is then sent to the response queue and will only be delivered to the thread that has locked that session id. A detailed sample using sessions is here. There are 2 additional samples here using Queues and Topics to achieve the request response correlation.
For a single sender and receiver you need only have a single topic/subscription. You can set several correlation filters on the same subscription and thus have all the responses that you need correlated show up at the same subscription. Currently we support up to 100,000 correlation filter instances on a single subscription and these can be added/removed (w/ transactions if needed) along with message send/receive operations on the topic/subscription.
In addition you can use Rules with Actions to stamp the message with additional properties if needed depending on which filter matched.
An alternate approach is to perform message correlation on the web role instance by using a dictionary of correlation ids and callback delegates.
Each web role (publisher) instance has a single subscription, filtered by a single subscription id, to the response topic.
Before a message is sent, the callback for that message is registered in the dictionary with the correlation id as the key. The correlation id and a subscription id is sent along with the message such that the response can be sent back to the correct subscription along with the correlation id.
Each web role instance monitors the single subscription and on receiving the response removes the dictionary entry and invokes the callback.

Calling the same API over 2000 times every minute (JSON/PYTHON), use (HTTP sync, HTTP async, Websocket, or Other)?

I am trying to gather 1 minute data across 2000+ stocks from a financial institution API, every minute as soon as the data is available. I want to gather data during the trading hours. I want to gather this data using python.
Example API URL [Not a valid URL]: https://api.finance.com/v1/marketdata/[STOCK]/1minute
Conditions:
We know that the all 2000+ stocks 1 minute data is available for retrieval once the minute hits. For example, if the current time is 10:02:00AM and I wanted to get 10:01:00AM data from GOOG, I would call the URL: https://api.finance.com/v1/marketdata/GOOG/1minute , and I would see the 10:01:00AM data.
We know the data is stored in JSON format.
There exists a throttling limit. Suppose 500millisecond wait between requests.
I need the one minute data tick data (i.e. Open, Low, High, Close).
Question: How can I gather all 2000+ stocks data within 30 seconds?
Solutions I came up with, although I don't know if it is the most optimal in this situation or if my understanding of HTTP Request, HTTP Asynchronous, WebSocket is lacking in some way.
Possible Solutions?:
HTTP Request with For Loop: Currently I am using a simple for loop and a time.sleep() function. It is the simplest to implement. But the problem with it is at best it takes 16 minutes because of the throttling limit.
HTTP Asynchronous: From what I understand, I could create a separate thread for each stock and gather the one minute data that way. But based on what I have read, at most I can probably have about 100 threads running simultaneously. Is that a correct assumption? Also wouldn't most servers not allow that many requests be made simultaneously from one client machine?
Websocket: From what I understand, I could simply create one connection with the server and get the data without having to worry about that throttle limit. Ideally, I would build the application with websocket. Is it a correct assumption that this is the best method for this sort of problem? The issue I have currently with this method, however, is that their 1 tick minute data is only available via this API URL call method. As far as I know, I cannot retrieve that data through a websocket connection(i.e. If I connect to their websocket url: wss://stream-finance.com/ws, the 1 minute data is not one of the available data on the other end) The question I have here is: is it possible to create a websocket connection with the https url? Also, is it possible to retrieve that 1 minute data through their websocket url wss://stream-finance.com/ws, if that 1minute data isnt one of the available options to get?
Other: Is their another method that would work better for this instance?
Best Solution?: The best solution I see is simply to create one single connection to their server, then call each stock to update every 1 minute in "realtime". But I don't know how to implement that through that HTTPS URL they provide.
Your question is a bit confusing. Lets separate some concerns in here.
The server
The selection of the communication protocol is entirely dependent on what the server side implements, you just develop your client accordingly.
Bellow are a set of options and techniques I've seen been used:
HTTP Pooling (what I like to call the hit and run technique) Simply do an HTTP request to the interested endpoint on an interval basis; every minute seems to be your use case. This will be supported by pretty much every HTTP API out there.
HTTP Long Pooling Similar pooling, a server supporting long pooling will hang onto an HTTP connection until it has appropriate data provide or a timeout is reached. From a client perspective, you're still continuously requesting data on an interval basis, its just the server that postpones replying until data is available.
Web sockets (This is ideal for your case) This is essentially combining the two techniques (oversimplifying here), the server will respond while holding onto the connection, and continuously send new data over it.
The client
Regardless of the protocol you decide to use, your client software will have to handle the network latency in someway or another.
For the sake of simplicity I will suppose you have a function, called get_data() that fetches data from the server using one the techniques above.
Synchronous Your client will call get_data(), process the response, and repeat. You're used to this, that's how python normally works.
Asynchronous Your client will dispatch calls to get_data() to some worker, and some other function will be executed when data comes in. Essentially allowing your program to do multiple API calls, and process them in the response order, rather than request order.
Threads Analogous to the synchronous behavior, you would process each request/response a separate CPU thread, handling data as it comes in. (with all the caveats of the Python's GIL)
Note
As obvious as it may seem, 2000+ threads is not a good solution.

Sending Millions of data points to flask API

This question might be a little too subjective, but I am looking for an optimal way to send millions of datapoints to a flask API.
My current approach is essentially as follows:
Send a list of data points that are JSON objects, as well as sending some information that pertains to all of the data points such as the person it was collected on and the date it was collected
This updates two tables, a Use table that records the person, date, etc. and then a Data table that associates data points to a given use. This all occurs as one POST request to the Use endpoint
I'm afraid that with this approach it might timeout when sending millions of datapoints.
I'm looking for a way to combat this, some ways I have been considering are
Sending an initial POST request to create the Use, then sending the datapoints in patches as a PATCH to the same endpoint or a POST to a new data endpoint
sending a csv in a POST request and then parsing through the csv on the server
Haven't been able to find any similar questions online, so looking to see if there is an industry standard or best practice when doing something like this
Whether you're receiving via json or csv, it will remain a lot of data. You might want to shorten your json keys or change your json data types to consume less space.
It kind of depends if you're using the api to connect to your own website, because if so you might just chop up the data (using js), and send several ajax requests preventing timeouts on slower connections. If you want others to use your api, then you might want to have a look at the last answer on this question

Ordering of streaming data with kinesis stream and firehose

I have an architecture dilemma for my current project which is for near realtime processing of big amount of data. So here is a diagram of the the current architecture:
Here is an explanation of my idea which led me to that picture:
When the API gateway receives a request it's put in the stream(this is because of the nature of my application- "fire and forget) That's how I came up to that conclusion. The input data is separated in the shards based on a specific request attribute which guarantees me the correct order.
Then I have a lambda which cares for validating the input and anomaly detection. So it's an abstraction which keeps the data clean for the next layer- the data enrichment. So this lambda sends the data to a kinesis firehose because it can backup the "raw" data(something which I definitely want to have) and also attach a transformation lambda which will do the enrichment- so I won't care for saving the data in S3, it will come out of the box. So everything is great until the moment where I need a preserved ordering of the received data(the enricher is doing sessionization), which is lost in the firehose, because there's no data separation there as it's in the kinesis streams.
So the only thing I could think of is- to move the sissionization in the first lambda, which will break my abstraction, because it will start caring about data enrichment and the bigger drawback is that the backup data will have enriched data in it, which is also breaking the architecture. And all this is happening because the missing sharding conception in the firehose.
So can someone think of a solution of that problem without losing the out of the box features which aws provides us?
I think that sessionization and data enrichment are two different abstractions, will need to be split between the lambdas.
A session is a time bound, strictly ordered flow of events that are bounded by a purpose or task. You only have that information at the first lambda stage (from the kinesis stream categorization), and should label flows with session context at the source and where sessions can be bounded.
If storing session information in a backup is a problem, it may be that the definition of a session is not well specified or subject to redefinition. If sessions are subject to future recasting, the session data already calculated can be ignored, provided enough additional data to inform the unpredictable future concepts of possible sessions has also been recorded with enough detail.
Additional enrichment providing business context (aka externally identifiable data) should process the sessions transactionally within the previously recorded boundaries.
If sessions aren't transactional at the business level, then the definition of a session is over or under specified. If that is the case, you are out of the stream processing business and into batch processing, where you will need to scale state to the number of possible simultaneous interleaved sessions and their maximum durations -- querying the entire corpus of events to bracket sessions of hopefully manageable time durations.

How to efficiently process large amount data response from REST API?

One of out client who will be supplying data to us has REST based API. This API will fetch data from client's big data columnar store and will dump data as response to requested query parameters.
We will be issuing queries as below
http://api.example.com/biodataid/xxxxx
Challenge is that though response is quite huge though. For given id it contains JSON or XML response with at least 800 - 900 attributes in response for single id. Client is refusing to change service for whatever reason I can't cite here. In addition , due to some constraints we will get only 4-5 hour window daily to download this data for about 25000 to 100000 ids.
I have read about synchronous vs asynchronous handling of response. What are options available to design data processing service for efficiently loading to relational database ? We use python for data processing and mysql as current data ( more recent data ) store and H-Base as backend big data-store (recent and historical data). Goal is retrieve this data and process and load it either MySQL database or to H-Base store as fast as possible.
If you have built high throughput processing services any pointers will be helpful. Are there any resources for creating such services with example implementation ?
PS - If this question sounds too high level please comment and I will provide additional details.
I appreciate your response.

Resources