Sender | Message | Time |
---|---|---|
2 Jan 2022 | ||
21:17:37 | ||
3 Jan 2022 | ||
12:43:19 | ||
14:20:47 | ||
16:22:29 | ||
hello! is there by any chance a data template of how the content of BlueSky protocol will be? 🙂 | 20:26:18 | |
i think probably application/json -- but is there anything like post-type body or?, etc | 20:27:17 | |
4 Jan 2022 | ||
06:23:41 | ||
06:56:57 | ||
Anyone tried ERC 725 Identity Standard? | 06:57:22 | |
10:58:24 | ||
12:53:27 | ||
17:20:19 | ||
there are no schemas yet but it will likely be split in layers for example allowing application/json or application/cbor as the document format and then a schema on top that is defined in terms of strings, lists, and dicts. for example a post could be any dict with an author and body {"author", "body"} but you may get more power by having those fields comply with a schema that give more meaning like the author being a did or the body being a Content-Typed
We should be able to evolve the schema and the representation independently. The signature should not depend on whether the data is in json, cbor, parquet, or a proprietary index/datastore https://github.com/blueskyCommunity/aozora/blob/main/anatomy-of-a-social-network.md#the-document-web [personal opinion] explicit post-types are an anti-pattern as a post is likely to match many different schemas and tools only care about the ones they care about. The fact that the post is a subclass and has an extra field could be totally irrelevant and the fact that your thing is type mammal when a tool is scanning for all animals should not fail due to the post-type strings not being the same a animals . | 19:34:53 | |
19:40:35 | ||
The signature should not depend on whether the data is in json, cbor | 19:40:35 | |
How do you plan to achieve this? | 19:40:55 | |
my personal plan would be a hash based on the list or dict structure. For a list hash each element the hash each pair of hashes, then the pairs from those in a merkle tree. For dicts hash the keys, build the radix tree or prefix trie. now hash the keyvalue pairs and walk up the tree calculating hashes. strings/bytes split into blocks and hash the list[blocks]. now you have a hash that is dependent on the list orders and the dict keys but not the encoding and where a small chunk can be validated without needing to send the whole object across the wire. I could have a big structure like the dict[twitter_user_id][tweet_id] = tweet and prove a tweet is in it and by a specific user with log2(||user_id||)+log2(||tweet_ids for user||) * hash_size + the tweet even large byte arrays or strings would be able to have a small proof of a slice in the middle of the bytes/string. by sending the merkle path. | 20:37:22 | |
I should make slides for this I don't think I am explaining it well for anyone that does not already know what a merkle tree is. 🤔 | 20:38:24 | |
note: this pattern will have lots of branch misprediction and be much slower then hashing the bytes of an already canonicalised json or cbor. | 20:41:53 | |
most elements, keys and values will be less then the block size and just be hashed but if some fool wanted to use a megabyte or petabyte sized string as a key I would want the tree structured hashing so we can go at it in parallel. | 20:47:54 | |
This sounds a bit like defining a new canonical format, crossed with IPLD. What canonical byte encoding for strings will you use? Doubles? | 20:53:55 | |
I would use string as a alias for byteArray utf-8, If there is a way to define a typed byteArray Doubles are a tagged byte[8] and then you need a way to express a unrecognised tag If I know what a u8, i8, u16, i16, u32, i32, u64, i64, f32, f64 then I can treat them special if I don't then they can be expanded to a {"tag/f64", "value": b'\x1f\x85\xebQ\xb8\x1e\t@'} probably b85 for json where you can't have a bytes. This lets you have custom types but also self describing for when the receiver does not know the custom types. | 22:09:48 | |
* I would use string as a alias for byteArray utf-8, If there is a way to define a typed byteArray Doubles are a tagged byte[8] and then you need a way to express a unrecognised tag If I know what a u8, i8, u16, i16, u32, i32, u64, i64, f32, f64 then I can treat them special if I don't then they can be expanded to a {"tag/f64": b'\x1f\x85\xebQ\xb8\x1e\t@'} probably b85 for json where you can't have a bytes. This lets you have custom types but also self describing for when the receiver does not know the custom types. | 22:10:26 | |
Yes, it is a canonical format but hopefully one that makes updating a part of the object only re-hash the path to the root. I am try to get the benefits of structural sharing to make working with immutable objects logarithmic compared to the linear costs associated with json/cbor. I want it to be affordable to hash objects whether they are laid out as documents, row store, column store, triple store, or whatever. | 22:16:03 | |
Array of Structs, Structs of Arrays, or Array of Structs of Arrays we need the objects to have the canonical hash so we store and search efficiently. For small objects just serialize out and hash the bytes but for very large objects like the sum of all facebook public posts that won't work it needs to work more like a git hash then a document hash. | 22:20:48 | |
* Array of Structs, Struct of Arrays, or Array of Structs of Arrays we need the objects to have the canonical hash so we store and search efficiently. For small objects just serialize out and hash the bytes but for very large objects like the sum of all facebook public posts that won't work it needs to work more like a git hash then a document hash. | 22:21:07 | |
also yes I do think this is like IPLD https://ipld.io/specs/codecs/dag-cbor/spec/ and I hope they move more in this direction. | 22:24:54 | |
It means everyone will need at least two encoders, json/cbor and this new one, and there are no pre existing libraries for this to build on which might slow adoption. There's also a size blowup because most strings are < hash size, and all primitive types like int doubles etc are as well. Is that a problem? For doubles specifically I was worrying about the canonical conversion from a double to 8 bytes, which is super error prone. | 22:39:56 | |
I think you can do all this with IPLD as it is today. | 22:40:06 | |
If you have a huge map or whatever then it just gets sharded in a standard way like a champ, or similar. And your max "small object" size is 1MiB | 22:41:38 |