Nothing here yet.
Nothing here yet.
Saurav Shrivastav ; I think we must go over some clarifying questions to begin Definitely. That's the way to go. I later found out that there are distributed file systems too and they can be used in such a scenario Ah. I assumed that We have a single machine, and need to transfer data to 10k machines . So I ruled out this thing from the very beginning. My bad. 🙈 Peer-to-peer transfer, more like how the tor network deals with transfers. Yeah. But even in P2P we need to store a mappings from chunkId to machine(s)Id. I'm not sure if we can have a eventual consistent mapping/index in a system with so many network failures. Providing access to the file by uploading it to something like an S3 bucket and running a script that can loop over the machines and issue commands remotely to download the file. This would leave the whole tracking and network failure part to the S3 library. Assumption: Machines are able to connect to AWS S3 infra. That will be the best thing. S3 is a distributed object storage system, so it does make sense. (I have a gut feeling that we're missing something. I don't know what. 🥲). We can divide the file into chunks, and then transfer these chunks to the machines, also appending a checksum for each chunk which can be verified at the client machine and an acknowledgment can be sent back to the server which will keep track of any failures and try sending the chunk some more times. If the machine, is not able to receive the chunks, then that machine can be flagged and tried later or a report can be generated for such machines. This would require establishing a protocol and writing some code. Damn. That's cool. Although, isn't TCP performs this flow control, retries, in-order transmission, and acknowledgement thing by default? 🤔 By the way I think this manual chunks formation thing is definitely needed in Approach 1. Anyhow, nice work buddy. PS: Now my old comment sounds dumber. lol
Congrats buddy. 🙌 I'm intrigued by the following question: There is a file of size 1 TB which needs to be transferred to 10k machines. How can this be done efficiently, keeping in mind that there can be network failures, machines going out of disk space, etc. I initially thought that since there could be network failures , we need something reliable , so the transport layer must be TCP . The simplest thing I can think of is FTP . But since there are 10k machines, we cannot transfer to all of them at once. Because (I think) our server cannot have 2*10k TCP connections simultaneously. Even if it could, the transfer will be very slow because of the following: network failure is very likely to occur between some machines and servers. The operation is also IO-bound. Maybe peer-to-peer is the way to go. Network failure between a machine and a server is very likely, but network failure between a machine and all machines & servers is very unlikely. I'm not sure how we'll manage/sync a table which maps fragmentId to machineId(s), though. Kinda like gossip protocol maybe? I would really appreciate it if you could share your approach and any specific details the interviewer shared on it. Feel free to criticise any particular thing you didn't like in my thinking process above. 🙂 Thank you, bhai. And again, congratulations on this new feat. 💪
Nice and concise article. I would like to add a few points: Before sharding, we should definitely consider vertical partitioning . Sharding fits nicely for key-value stores, but what if you're not a key-value store? Actually, it doesn't matter. We just need to generate a hash from the given key(s) to a range [0,numberOfShards-1] . For example, you can have a table with a composite Primary Key as (employeeId (int), department (string)) . Even in the above scenario, we can hash it to an integer and then take modulo with numberOfShards . Important ⚠️: We need to think a bit about the hash function and numberOfShards value and ensure that data gets evenly spread. In case you've queries which require reading from multiple shards very often, then you need to redesign the data model. Or maybe perform some data duplication. It's solely an issue with access patterns and data modelling and not precisely with sharding. Important: Because of sharding, transactions won't work if we're writing across multiple shards.
Hey buddy. It's an excellent article indeed. I would like to add a few points for the benefit of readers: Considering that the data in a portfolio isn't very dynamic, I think a Static site would've been a better alternative. If you've some dynamic data you want to get frequently updated on the web page, you could've gone with Incremental Static Regeneration (nextjs, nuxt, gastby all support it). These static alternatives require lesser computation, will have faster Time to first byte, will be cheaper to host, and can be scaled via a CDN.
Ákos Kőműves ; Thank you for the follow-up. always do some kind of validation on my REST endpoints as well - mostly with AJV. True. But please note that the graphQL validation is limited to only type checking. Say there is a mutation where you need a specific field to always be an integer greater than 5 . This additional validation needs to be implemented inside the resolver. In the case of REST, you could've used libraries like yup for both type checking and additional validation. I actually avoided the N+1 queries with GraphQL, this was one of the reasons why I started using it You got me wrong. You're mentioning that "graphQL helps us in ensuring that data fetching is exact (neither over fetched nor under etched), and we can do it in a single query from client" . I was talking about the number of database queries in resolvers. Say we have a Countries and Cities table. Clearly, there's a one-to-many relation. Assuming there're x countries and everyone has c1 , c2 ... cX cities respectively. If I ask for all countries and their cities, then without using the data loader pattern, the following will happen in graphQL: fetch the countries. for every country's city, fetch the details. Total number of database queries: 1 (to fetch countries) + c1 + c2 ... cX With data loader, it will be: 1 + x database queries If you provide a single endpoint (like in REST), the number of DB access will be just 1 . (using joins). By the way, please note that I equally adore GraphQL. (Although my arguments might sound overwise 🥺). Maintaining software is a big challenge, and this is where GraphQL shines.
Thank you for this nice article. I would like to add a few more things: In graphQL, we have an introspection query, which is really handy during development. (Don't forget to disable it in prod 🥲). One of the lesser talked things about graphQL is data modelling. For Apollo cache to work efficiently, we must ensure that the data modelling is done nicely. If you ensure that you're not using any antipattern then usually it's safe. I've faced this issue earlier. (maybe I should document it? 🤔) REST is faster: GraphQL does type checking, and all and thus they're slower than REST. But it makes the system less prone to errors. (There're tradeoffs to be made). Problem of N+1 queries: We might need to use the dataloader pattern to solve the N+1 issue, which is an additonal implementation overhead. Even after using dataloader pattern the number of queries needed to resolve one request will be greater than one simple query we need to write in REST. But do note that, the number of customized endpoints are less in graphQL, and we've granular control over data fetching, which are few pros.