My FeedDiscussionsHeadless CMS
New
Sign in
Log inSign up
Learn more about Hashnode Headless CMSHashnode Headless CMS
Collaborate seamlessly with Hashnode Headless CMS for Enterprise.
Upgrade ✨Learn more

How to get document embeddings using GPT-2?

Youssef's photo
Youssef
·May 6, 2020

Note: Python programmer Disclaimer: beginner to both ML and NLP, only fine-tuned GPT-2 a bunch of times with the help of some packages.

I'm curious if using GPT-2 might yield a higher accuracy for document vectors (with greatly varying length) or not (would it surpass the state of the art?)

Really I'm most interested in document embeddings that are as accurate as possible. I'm wondering if using GPT-2 will get results that are more accurate than Paragraph Vectors for example.

I heard that in order to get vectors from GPT-2 "you can use a weighted sum and/or concatenation of vector outputs at its hidden layers (typically the last few hidden layers) as a representation of its corresponding words or even "meaning" of the entire text, although for this role BERT is used more often as it is bi-directional and takes into account of both forward and backward contexts."

As a machine learning and NLP beginner, I'd love to know how to go about this, or to be pointed in the right direction to learn more about how to attempt this in Python.

I've tried fine-tuning GPT-2 before but I have no idea how to extract vectors from it for text.