I'm diving deeper into concurrent programming (using multithreading to split up workloads in my case) and although I am tackling a few specific problems in Rust, I'm sure that there are a handful of common problems which arise from concurrent programming and typical methods to solve these problems.
Unfortunately, I've not gotten my hands on a good high-level concurrent programming book or course but could use some general advice on problems that you've solved in addition to guidance with this specific but probably very common scenario:
One such recurring theme for me has been the need to read from a buffered reader some given number of bytes at a time and process it. The data needs to be pulled from the same buffered reader by all threads... This could also be done by mapping a large area of memory and loading the entire file up and then splitting the file into pieces and giving each thread a piece to work on. But this for me so far has been easier said than done when it comes to such things as keeping track of the line numbers between threads, and properly splitting the workload.
For example, say for some reason I wanted to load a 2GB text file up, and line-by-line, run some algorithm on each line... Having one thread do this is pretty slow, but if I could fire up 8 threads and have the first thread work on the first eighth of the file, the second thread work on the second eighth, and so on, I could accomplish this task quicker... However, I've run into a few issues with this method and was wondering if anyone else had any advice for doing this type of thing. I've heard of such things as worker pools but never really gotten proper treatment. Thank you.
do you want to use low level solution and implement your own worker pools etc? or do you just want a solid solution? because if it's rust -> actix is an actor concept that abstracts the problem in the same way erlang for example does it.
actors pass messages to each other and you haven an arbiter that spawns your actors if you send the message to them.
about concurrency there are so many solutions. thread pool just means spawning a certain amount of threads and push things via queue / buffer towards them.
a more efficient way would be using the kernel event loop in my opinion. maybe reading the node stream implementation would help in your particular problem?
I just programmed different solutions in different languages / concepts and for fun I read the scientific papers. I don't have a practical guide or good talk about concurrency beside the jepsen ones but they are about the consequences rather than the implementations.
It's also very language dependent, go/node for example make concurrency super easy, C/C++ can make it a pain in the ass. Java/Rust/C# give you different threading models, Erlang uses it's own process infrastructure in the VM, ... The idea however always remains the same
package your input - pass it on - if you can don't use shared memory within the thread.
if you use buffered reader that are threadsafe or if you use an in memory database that just checks for concurrency.
The first question I would ask myself is about order ... a CSV for example does not need to be processes in a specific order a book however does.
But you probably know all of this.
I'll keep it high level. Approach everything in a way that anything can be replayed. If you keep the data/messages immutable and replay-able, concurrency becomes a lot less complicated.
"Writing clean concurrent programs is hard—very hard. It is much easier to write code that executes in a single thread. It is also easy to write multithreaded code that looks fine on the surface but is broken at a deeper level. Such code works fine until the system is placed under stress." As Robert C. Martin states in Clean Code book. I recommend to read the chapter on concurrency.
Btw i think you can even use a combination of bash commands to implement this:
You're actually thinking about the problem backwards, but that's a common misunderstanding in multithreaded programming.
Back when I was working on some BeWare in the music industry some two decades ago, I had the idea of workers, multithreading, multitasking, and so forth very well explained to me by a co-worker in a way I've rarely ever seen stated so clearly. BeOS was a very strange OS in that it was "pervasively multithreaded" -- EVERY program typically had at LEAST three threads, one for input, one for output, one for logic. The OS was built around the concept that very few things needed or or ever even should work lock-in-step.
The issue at hand is that a LOT of tasks DO NOT work well being broken into smaller threads. They are by their nature linear in how they operate and should remain isolated to a single execution. The king of this is device I/O, of which filesystem access is a part.
Hence the idea is to ONLY parallelize what gains benefits from being in parallel. In this case you have reading data which should only be one thread, that would then hand off to other threads to process said data while it goes off to get MORE. Those other threads for data processing what you've read? Those would be your workers.
Hence your example of reading a 2gb file is perfect for us to work with here. The OPTIMAL way to read a file is a single linear thread of reading it. If you try to break it into pieces you go from fast sequential reads, to slow random reads -- and whilst sure modern SSD's don't have seek times, if you test with something like crystaldiskmark you'll still see that even some of the fastest SSD's in existence still take a huge hit on random reads.
As you can see, the top row -- sequentila read/writes -- is fastest. You want to leverage that when possible so breaking the file into different chunks your going to read out-of-order by different threads would be a performance DOWNGRADE.
What you want to do is run your PROCESSING in parallel with the read, and that's where the concept of workers shine. Your workers would be one or more threads that your file-read loop would send data to, your read loop being a SINGLE thread, possibly part of the executable's main thread.
In the case of your 2gb file let's say it's something like CSV, one record per line, or some other easily handled type of delimited or block based data.
Loop: Read block from file Are any workers available? > No Are we at maximum allowable workers? > No, make a worker > Yes, wait until a worker is available > Yes Drop-through, no action needed Send data to available worker End Loop After file end, wait until all workers are complete, then delete the workers.This approach creates workers as and only when needed, so you aren't creating extra threads you don't need should the read be bottleneck -- which with files it often is.
That's typically the optimal way to handle it. File reads are suspended if the workers are all busy and we don't have the cores to waste time making more, but for the most part file reads run in parallel with your processing.
All a 'worker' being is another thread.
Does that help any? See how thinking about breaking up the file READS into multiple threads doesn't make sense and how you should have one read and one or more data processing threads in parallel with it?