Discussion on "Are All Large Language Models Really in 1.58 Bits?"

RJ Honicky · 2024-04-12T18:01:37.694Z

Introduction This post is my learning exhaust from reading an exciting pre-print paper titled The Era of 1-bit LLMs: All Large Language Models are in 1.58 Bits about very efficient representations of high-performing LLMs. I am trying to come up to s...

Great article, the comparison to the post-quantization model should certainly be done more thoroughly.

Thank you, I hope the authors do that in fleshed out version of the paper!

Out of curiosity: have you heard of vector symbolic architectures (also known as hyperdimensional computing)? They are not LLMs, but in some ways seem to have very similar underlying dynamics, and tend to use 1-bit or ternary representations. Though, I think their biggest draw is the easy-to-understand mathematical framework for how arbitrarily complex knowledge structures can be meaningfully built and manipulated in a high-dimensional vector space. In case you decide to give them a closer look, it would be interesting to know your thoughts on whether VSAs and LLMs might be doing something fundamentally similar. :-)

Sounds interesting, do you have a link to get me started?

RJ Honicky Sure! I first read about it in the Quanta magazine last year: quantamagazine org a-new-approach-to-computation-reimagines-artificial-intelligence-20230413

There is actually a website dedicated to the topic (hd-computing com), and the first paper mentioned in the "Course: Computing with High-Dimensional Vectors" section gives a thorough explanation of the basic principles.

I'm not so much a machine learning specialist/researcher as just someone who likes to read about interesting science, but this felt like the first intuitively understandable (even if incomplete) model of how/why the seemingly nebulous transformations of neural activations can result in meaningful manipulation of knowledge.

(Sorry about not providing direct links -- I'm a new user, so they're not allowed)

Dima G This is interesting. I don't understand the details, but it looks to me like they are using formal logic to constrain the vectors during training to create what looks to me like a more structured latent space, if I understand from my quick read of the article correctly.

To answer your questions more directly, I think they are actually using a neural network to train their "hyper-dimensional vectors," which are similar to vectors in the latent space in a neural network, including LLMs. In case you're not familiar, the latent space in a neural network is the high dimensional vector representation of knowledge that flows through the network. Although I say "high," this is actually very low compared to the number of dimensions in the space of possible word combinations, so is extremely compressed. In some sense, this is the job of the neural network: to compress information into this latent space, and then manipulate it.

Their idea seems to be, based on what I read, that they can control and interpret the latent space using formal logic, which seems very useful. It would come at the expense of less information per parameter (since the formal logic constrains the amount of entropy). This seems like a good tradeoff.

I'm not sure I understood the concept very well, but this is definitely an interesting thing to think about!

Thanks for the pointer!

Discussion

Are All Large Language Models Really in 1.58 Bits?

Responses(2)

Recent in Forum

Search Hashnode

Are All Large Language Models Really in 1.58 Bits?

Responses(2)

Recent in Forum