- Advertisement -

- Advertisement -

[EMNLP] What is GloVe? Part IV

0 10

Get real time updates directly on you device, subscribe now.

- Advertisement -

An introduction to unsupervised learning of word embeddings from co-occurrence matrices.

I’m running out of GloVe-related images. Sorry.

- Advertisement -

If you’re just joining us, please feel free to read Parts I,II, and III first, as we’re picking up right where they left off:

In this article, we’ll discuss one of the newer methods of creating vector space models of word semantics, more commonly known as word embeddings. The original paper by J. Pennington, R. Socher, and C. Manning is available here: http://www.aclweb.org/anthology/D14-1162.This method combines elements from the two main word embedding models which existed when GloVe, short for “Global Vectors [for word representation]” was proposed: global matrix factorization and local context window methods. In Part I, we compared these two different approaches. In Part II, we began walking through the authors’ development of the GloVe model. In Part III, we got into the algebra of the model design and explained the conditions that our function must satisfy for the embeddings to make sense. Now, we’ll state the final model function (I know, I know, I said we’d get to it in the last article, but the math ended up taking up a lot more space than I thought), and discuss the motivations behind it a bit.

Recall from Part III that we’re trying to find a function F which behaves “nicely”, and satisfies the conditions we discussed last time. A nice place to start would be something that gives us a natural homomorphism between the additive and multiplicative real numbers, i.e. a function that turns addition into multiplication, or vice versa, as long as we have an inverse where we need it. The clear choice is a function of the type a^x for some constant a. For simplicity here, we’ll use e, hence let F be the exponential function. Then taking the natural logarithm of both sides in the above equation, we get

Now recall that we wanted “exchange” symmetry of our function. This means that we want to be able to switch the places of w_i and w_k without altering the right hand side. Recalling the definition of X_{ik}, the number of times word k appears in the context of word i, we see that X_{ik} = X_{ki}, since word k appears in the context of word i if and only if word i appears in the context of word k. So we would have this symmetry in the above equation were it not for the log X_i term. But since this term is independent of the choice of word k, we can replace it with a bias term for word i. Introducing a similar term for word k makes our function symmetric:

You can think of these bias terms as just functions of word i and word k. Since we’re adding them to our expression, commutativity of addition saves our symmetry. This can be checked by simply switching the places of word i and k on either side of the above equation and observing that the equality still holds. Since the word vectors are the information we’re interested in learning, the biases must also be learned.

However, we’re still not out of the woods. The authors note that the above function (treated as a function of X_{ik}) is not well-defined when the number of co-occurrences is zero. This problem can be fixed by shifting the argument, i.e. taking log(1 + X_{ik}). But they further note that this whole approach is still flawed because the function weights all entries of the term-term co-occurrence matrix equally, even those X_{ik} which are very small or zero (remember they are integers). One could make the argument that the input of the logarithm constitutes some sort of implicit weighting because this term is affected by the size of X_{ik}, but the log itself drastically reduces the effect of variation in magnitude of the input. Instead, the authors suggest weighting each training example roughly proportional to the value of X_{ik}. That way, when we’re learning the word vectors and biases, whether it be by some sort of regression or another algorithm, the zero co-occurrences are de-emphasized, since low values are likely noisy, and the high co-occurrence values are emphasized. The authors incorporate this into a least-squares regression problem where we try to minimize the expression

where f is our weighting function, which we require to have the following properties:

  1. f(0) = 0. If f is assumed to be continuous, we want it to go to zero fast enough that f(x)log²(x) has a finite limit, since otherwise for co-occurrence counts of zero, we our function may want to “blow up” the magnitudes of the dot product and biases in order to compensate for the log.
  2. f(x) should be nondecreasing. This treats the issue we outlined in the paragraph above. It makes sure small matrix entries are not overweighted, and it weights higher the larger counts. The authors note that it’s not unreasonable to have 85% of the entries of X as zeroes.
  3. f(x) should be relatively small for large values of x, i.e. it doesn’t blow up too fast. Anything faster than would probably not work all that well. I apologize for using “faster” here to describe higher degree functions when you CS people are probably used to thinking slower as in the context of time complexity, but since we’re dealing with these functions from an analytic perspective, I think this makes more sense. This is just so we don’t wayy overweight the frequent co-occurrences.

I’m sorry for dragging this out into so many parts, but I feel it’s more digestible in bite-sized pieces. Try and guess a reasonable weighting function before the next part is published and we discuss the authors’ suggestion.

Once again, please check out the source paper!

- Advertisement -

Get real time updates directly on you device, subscribe now.

- Advertisement -

- Advertisement -

Leave A Reply

Your email address will not be published.

x

We use cookies to give you the best online experience. By agreeing you accept the use of cookies in accordance with our cookie policy.

I accept I decline