Comment by j on "Why is cache invalidation called the biggest problem in software development?"

CommentWhy is cache invalidation called the biggest problem in software development?

stuff ;)

Aug 18, 2016

because how can you determinate when to invalidate it ? only if you know the future you could actually do it properly. At least that's the idea of Bélády's.

The next question is what and where to cache and for how long, for example if you use a memcache cluster that is caching your database ... how do you know when to invalidate the cache ? you could now cocky say "on the succession of a database write" which leads to the next question how to keep it consistent ? should the database trigger your memcache invalidation because that would be the most realistic way. And that's still ignoring the CAPs theorem and the question of "when do you know something is done" and concurrency issues.

Is it efficient to discard information at this point ? The main problem with cache invalidation is the efficiency.

And when you think of the lvl of complexity in distribution models between a single core vs multi core and then you just add multiple machines, containers in virtual memory and network distributions the whole thing start's to really get complicated .... Those are just things of the top of my head and I won't get into the algorithms :)

Covenant Chukwudi

Software Engineer

Aug 18, 2016

How do you handle the Big question "Is it efficient to discard information at this point?"

That's exactly the point you have to do it case by case.

Harddisks cache MRU -> so the more often a file is read the less likely it gets discarded
Usercaches use LRU -> so the things that are used less are more likely to get discarded
some CPUs use RR -> so basically it uses the cache just as a buffer and maybe it overrides it or not

Efficiency is always a usecase thing for example: it could be very efficient to cache a variable in runtime so you just fetch it once from the api / hard disk [costly -> ms] but if you got a lot of writes against that "point of truth" you risk inconsistencies but how do you know when this happens ? you could go for statistical probability but as I mentioned now the Concurrency model and the CAPs theorem kick in and it get's complicated.

Usually I just apply the rule: "everything that goes into the database which is of essential importance is not cached". So transactions who actually produce value are always handled inside the database¹ and if that's slow I try to optimize the database before i try to cache and I log them several times in between so i can hope for minimal data loss.

But that's just a rule of thumb, I think you just need to look at the issues at hand for example if you got a shop you could cache all products that will be displayed but not the cart..... anyhow complex topic to think about :)

¹ Virtual Machines can break the ACID Transaction of MySQL -> a transaction is over when the data is persisted on the file system. Virtual machines will tell the database it's done before it's done :) because "it's virtually written now let the host system take it over" .... but that leads to far :)

Search Hashnode