Cross JOIN collections and GroupBy CosmosDB Javascript API - join

I am searching for a solution in the Javascript API for CosmosDB, where you can perform an INNER/OUTER JOIN between two document collections.
I have been unsuccessful.
From my understanding, Javascript Stored Procedures run within a collection, and cannot access/reference data in another collection.
If the above is true, where does this leave our application's datasource that has been designed in a relational way? If Business requires a immediate query, to collect pthe following data:
All agreements/contracts that has been migrated to a new product offering, within a specific region, for a given time frame. How would I go about this query, if there are about 5 collections containing all infromation related to this query?
Any guidance?
UPDATE
Customer
{
"id": "d02e6668-ce24-455d-b241-32835bb2dcb5",
"Name": "Test User One",
"Surname": "Test"
}
Agreement
{
"id": "ee1094bd-16f4-45ec-9f5e-7ecd91d4e729",
"CustomerId": "d02e6668-ce24-455d-b241-32835bb2dcb5"
"RetailProductVersionInstance":
[
{
"id": "8ce31e7c-7b1a-4221-89a3-449ae4fd6622",
"RetailProductVersionId": "ce7a44a4-7e49-434b-8a51-840599fbbfbb",
"AgreementInstanceUser": {
"FirstName": "Luke",
"LastName": "Pothier",
"AgreementUserTypeId": ""
},
"AgreementInstanceMSISDN": {
"IsoCountryDialingCode": null,
"PhoneNumber": "0839263922",
"NetworkOperatorId": "30303728-9983-47f9-a494-1de853d66254"
},
"RetailProductVersionInstanceState": "IN USE",
"IsPrimaryRetailProduct": true,
"RetailProductVersionInstancePhysicalItems": [
{
"id": "f8090aba-f06b-4233-9f9e-eb2567a20afe",
"PhysicalItemId": "75f64ab3-81d2-f600-6acb-d37da216846f",
"RetailProductVersionInstancePhysicalItemNumbers": [
{
"id": "9905058b-8369-4a64-b9a5-e17e28750fba",
"PhysicalItemNumberTypeId": "39226b5a-429b-4634-bbce-2213974e5bab",
"PhysicalItemNumberValue": "KJDS959405"
},
{
"id": "1fe09dd2-fb8a-49b3-99e6-8c51df10adb1",
"PhysicalItemNumberTypeId": "960a1750-64be-4333-9a7f-c8da419d670a",
"PhysicalItemNumberValue": "DJDJ94943"
}
],
"RetailProductVersionInstancePhysicalItemState": "IN USE",
"DateCreatedUtc": "2018-11-21T13:55:00Z",
"DateUpdatedUtc": "2020-11-21T13:55:00Z"
}
]
}
]
}
RetailProduct
{
"id": "ce7a44a4-7e49-434b-8a51-840599fbbfbb",
"FriendlyName": "Data-Package 100GB",
"WholeSaleProductId": "d054dae5-173d-478b-bb0e-7516e6a24476"
}
WholeSaleProduct:
{
"id": "d054dae5-173d-478b-bb0e-7516e6a24476",
"ProductName": "Data 100",
"ProviderLiabilities": []
}
Above, I have added some sample documentation.
Relationships:
Agreement.CustomerId links to Customer.id
Agreement.RetailProductVersionInstance.RetailProductVersionId links
to RetailProduct.id
RetailProduct.WholeSaleProductId links to WholeSaleProduct.id
How, would I write a Javascript Stored Procedure, in CosmosDB, to perform joins between these 4 collections?

Short answer is that you cannot perform joins between different collections via SQL in Cosmos DB.
Generally, the solution to this type of question is multiple queries or different schema. In your scenario, if you can denormalize your schema into one collection without duplicating data, then it is easy.
If you provide your schemas, it'd be possible to provide a more comprehensive answer.
-- Edit 1 --
Stored Procedures are only good candidates for operations that require multiple operations on the same collection + partition key. This makes them good for bulk insert/delete/update, transactions (which need at least a read and a write), and a few other things. They aren't good for CPU intensive things, but rather things that would normally be IO bound by network latency. They aren't possible to use for cross partition or cross collection scenarios. In those cases, you must perform the operations exclusively from the remote client.
In your case, it's a fairly straightforward 2 + 2N separate reads, where N is the number of products. You need to read the agreement first. Then you can look up the customer and the product records in parallel, and then you can look up the wholesale record last, so you should have a latency of 3s + C, where s is the average duration of a given read request and C is some constant CPU time to perform the join/issue the request/etc.
It's worth considering whether you can consolidate RetailProduct and WholeSale product into a single record where Wholesale contains all the RetailProducts in an array, or as separate documents, partitioned by the wholesale id, with a well known id that contained the Wholesale product info in a separate document. That would reduce your latency by 1 third. If you go with the partitioning by wholesale id idea, you could write 1 query for any records that shared a wholesale id, so you'd get 2 + log(N) reads, but the same effective latency. For that strategy, you'd store a composite index of "wholesaleid+productid" in the agreement. One issue to worry about is that it duplicates the wholesale+product relationship, but as long as that relationship doesn't change, I don't think there is anything to worry about and it provides a good optimization for info lookup.

Related

Secondary indexes for Dynamodb flexibility

Coming from a SQL background, trying to undestand NoSQL particularly DynamoDB options. Given this schema:
{
"publist": [{
"Author": "John Scalzi",
"Title": "Old Man's War",
"Publisher": "Tor Books",
"Tags": [
"DeepSpace",
"SciFi"
]
},
{
"Author": "Ursula Le Guin",
"Title": "Wizard of Earthsea",
"Publisher": "Mifflin Harcourt",
"Tags": [
"MustRead",
"Fantasy"
]
},
{
"Author": "Cory Doctorow",
"Title": "Little Brother",
"Publisher": "Doherty"
}
]
}
I could have the main table have Author/Title as hash/range keys. A global secondary index could be Publisher/Title. What are the best practices here. How can I get a list of all Authors for a publisher without a total table scan? Cant have a secondary index because Publisher/Author is not unique! Also what are my options if I want all the titles that have a tag of DeepSpace?
EDIT: See RPM & Vikdor answers below. GSI need not be unique, so Publisher/Author is possible. But question remains: is there any workaround for getting all authors by tag, without full table scan?
Cant have a secondary index because Publisher/Author is not unique!
Sure you can, just make sure your Publisher/Title index has Author as a projection - you can then do a query by publisher and just iterate over the results and collect the authors.
When you set up your indexes, you can choose which attributes are projected into the index. Having a Publisher or Publisher/Title key doesn't mean you can only view the Publisher or Publisher and Title, it means you can only query by Publisher or Title, so if you have all attributes or the Author attribute projected into your index, you can get a list of authors by publisher using a query and not a full table scan.
Cant have a secondary index because Publisher/Author is not unique!
The (hash primary key, range primary key) tuple need not be unique for defining a Global Secondary Index. This is only a requirement for the Table level key definitions, i.e. the table cannot have multiple rows with the same values of (hash primary key, range primary key) tuple.
How can I get a list of all Authors for a publisher without a total table scan
You define a GSI on Publisher (Hash PK), Author (Range PK) and use DynamoDB query on the GSI with the Publisher attribute set as the Hash Key Value.
Unlike in SQL where it is possible to create non-clustered indexes on arbitrary columns based on the retrieval patterns, in DynamoDB, as the number of Local Secondary Indexes and Global Secondary Indexes are limited per table, it is important to list down the use cases of retrieving data before identifying the Hash Primary Key and Range Primary Key for a table and leverage Local Secondary Indexes as much as possible, as they use the table's read & write capacity and are strongly consistent (you can choose to run eventually-consistent queries too on LSIs to save capacity). GSIs need their own read & write capacity and are eventually-consistent.
Unfortunately this is not supported currently in DynamoDB. DDB does not provide the capability to query on nested documents alike MongoDB.
In this situation consider modelling data differently and put the nested document in a separate table.
hope this will help.
Cheers,

How to query for negated results in CouchDB (Python)

Does CouchDB have a way to return back the documents that do not meet a certain filter criteria? I am using Python' API and provided an example below:
couch['test_db'].view('doc/entrybyname', key=value, include_docs=True)
Say I wanted all the documents that didn't match the key value...does CouchDB offer a way to do this?
Right now I am getting all documents than filtering them as needed which is very inefficient, especially as the database grows in size.
Thanks for your help in advance.
Brian
There is no way to return data from a that is not in an index, only data that is in the index. The Mango/Query mechanism does allow you to perform queries such as this:
{
"selector": {
"country_code": { "$ne": "UK"}
}
}
which reads as "find my all of the documents where country_code is not equal to 'UK'", but the query would not be powered by an index - it would require a scan of all the documents - so would not be performant for large data volumes.
Depending on your use-case, you can create a custom Map/Reduce index that only includes the documents you are interested in e.g.
function(doc) {
if (doc.country_code != 'UK') {
emit(doc.country_code, null);
}
}
which creates an index of all the documents which are not in the UK, keyed on the country code.

Are Cassandra user defined data types recommended in view of performance?

I have a Cassandra Customers table which is going to keep a list of customers. Every customer has an address which is a list of standard fields:
{
CustomerName: "",
etc...,
Address: {
street: "",
city: "",
province: "",
etc...
}
}
My question is if I have a million customers in this table and I use a user defined data type Address to keep the address information for each customers in the Customers table, what are the implications of such a model, especially in terms of disk space. Is this going to be very expensive? Should I use the Address user defined data type or flattent the address information or even use a separate table?
Basically what happens in this case is that Cassandra will serialize instances of address into a blob, which is stored as a single column as part of your customer table. I don't have any numbers at hand on how much the serialization will put on top on disk or cpu usage, but it probably will not make a big difference for your use case. You should test both cases to be sure.
Edit: Another aspect I should also have mentioned: handling UDTs as single blobs will imply to replace the complete UDT for any updates. This will be less efficient than updating individual columns and is a potential cause for inconsistencies. In case of concurrent updates both writes could overwrite each others changes. See CASSANDRA-7423.

Is this an appropriate use-case for Amazon DynamoDB / NoSQL?

I'm working on a web application that uses a bunch of Amazon Web Services. I'd like to use DynamoDB for a particular part of the application but I'm not sure if it's an appropriate use-case.
When a registered user on the site performs a "job", an entry is recorded and stored for that job. The job has a bunch of details associated with it, but the most relevant thing is that each job has a unique identifier and an associated username. Usernames are unique too, but there can of course be multiple job entries for the same user, each with different job identifiers.
The only query that I need to perform on this data is: give me all the job entries (and their associated details) for username X.
I started to create a DynamoDB table but I'm not sure if it's right. My understanding is that the chosen hash key should be the key that's used for querying/indexing into the table, but it should be unique per item/row. Username is what I want to query by, but username will not be unique per item/row.
If I make the job identifier the primary hash key and the username a secondary index, will that work? Can I have duplicate values for a secondary index? But that means I will never use the primary hash key for querying/indexing into the table, which is the whole point of it, isn't it?
Is there something I'm missing, or is this just not a good fit for NoSQL.
Edit:
The accepted answer helped me find out what I was looking for as well as this question.
I'm not totally clear on what you're asking, but I'll give it a shot...
With DynamoDB, the combination of your hash key and range key must uniquely identify an item. Range key is optional; without it, hash key alone must uniquely identify an item.
You can also store a list of values (rather than just a single value) as an item's attributes. If, for example, each item represented a user, an attribute on that item could be a list of that user's job entries.
If you're concerned about hitting the size limitation of DynamoDB records, you can use S3 as backing storage for that list - essentially use the DDB item to store a reference to the S3 resource containing the complete list for a given user. This gives you flexibility to query for or store other attributes rather easily. Alternatively (as you suggested in your answer), you could put the entire user's record in S3, but you'd lose some of the flexibility and throughput of doing your querying/updating through DDB.
Perhaps a "Jobs" table would work better for you than a "User" table. Here's what I mean.
If you're worried about all of those jobs inside a user document adding up to more than the 400kb limit, why not store the jobs individually in a table like:
my_jobs_table:
{
{
Username:toby,
JobId:1234,
Status: Active,
CreationDate: 2014-10-05,
FileRef: some-reference1
},
{
Username:toby,
JobId:5678,
Status: Closed,
CreationDate: 2014-10-01,
FileRef: some-reference2
},
{
Username:bob,
JobId:1111,
Status: Closed,
CreationDate: 2014-09-01,
FileRef: some-reference3
}
}
Username is the hash and JobId is the range. You can query on the Username to get all the user's jobs.
Now that the size of each document is more limited, you could think about putting all the data for each job in the dynamo db record instead of using the FileRef and looking it up in S3. This would probably save a significant amount of latency.
Each record might then look like:
{
Username:bob,
JobId:1111,
Status: Closed,
CreationDate: 2014-09-01,
JobCategory: housework,
JobDescription: Doing the dishes,
EstimatedDifficulty: Extreme,
EstimatedDuration: 9001
}
I reckon I didn't really play with the DynamoDB console for long enough to get a good understanding before posting this question. I only just understood now that a DynamoDB table (and presumably any other NoSQL table) is really just a giant dictionary/hash data structure. So to answer my question, yes I can use DynamoDB, and each item/row would look something like this:
{
"Username": "SomeUser",
"Jobs": {
"gdjk345nj34j3nj378jh4": {
"Status": "Active",
"CreationDate": "2014-10-05",
"FileRef": "some-reference"
},
"ghj3j76k8bg3vb44h6l22": {
"Status": "Closed",
"CreationDate": "2014-09-14",
"FileRef": "another-reference"
}
}
}
But I'm not sure it's even worth using DynamoDB after all that. It might be simpler to just store a JSON file containing that content structure above in an S3 bucket, where the filename is the username.json
Edit:
For what it's worth, I just realized that DynamoDB has a 400KB size limit on items. That's a huge amount of data, relatively speaking for my use-case, but I can't take the chance so I'll have to go with S3.
It seems that username as the hash key and a unique job_id as the range, as others have already suggested would serve you well in dynamodb. Using a query you can quickly search for all records for a username.
Another option is to take advantage of local secondary indexes and sparse indexes. It seems that there is a status column but based upon what I've read you could add another column, perhaps 'not_processed': 'x', and make your local secondary index on username+not_processed. Only records which have this field are indexed and once a job is complete you delete this field. This means you can effectively table scan using an index for username where not_processed=x. Also your index will be small.
All my relational db experience seems to be getting in the way for my understanding dynamodb. Good luck!

Storing json vs Object vs Map in Hazelcast.

I am new to Hazelcast and would appreciate your thoughts on the below.
Use case : I have a database table CUSTOMER (id, firstname, lastname,age) , and would like to store it in a distributed map. There would be need to query(may be predicate) the collection and general get/put operations. There will be somewhere around a million records and I have 2 nodes at my disposal.
What would the best approach be keeping performance and memory in mind.
1.Store the records as Map of Maps ; IMap> ; where the keys in the inner map are the column names
Or
2.Store the records as json; IMap<String,String> Ex: [ "123" : { "id" : "123", "firstname" : "john", "lastname" : "Deer", "age" : "25" }]
Or
3.Create a Customer DTO and store it in IMap
Thanks
The last option using an Entity class is preferred, however if you expect the object often to be changed JSON might be preferable for schemaless (since you're talking about a database table I guess you don't need that).
Btw your first option is not working since IMap itself is not serializable. You could do IMap> but the problem with this approach is that the inner map needs to be completely deserialized on every get which will kill you're performance ;-)

Resources