week 1 ¶
- deep neural networks => learning of more abstract representations and better performance than baselines like bayes
- structure of neural networks (already known, see [[FAST.AI Notes]]
- task for tweet sentiment classification:
- take tweet vector representation, pass through an embedding layer
- hidden layer relu activation,
- output layer with softmax to classify
- the initial vector representation is built by mapping vocab to indexes, each tweet => vector of indexes of words it has in order, padded with 0s so all vectors are the same size\
- uses trax https://github.com/google/trax
- use jax https://jax.readthedocs.io/en/latest/index.html because you're a cool kid
- dense layer = computation of the inner product between a set of trainable weights and an input vector (https://en.wikipedia.org/wiki/Inner_product_space)
- embedding layer: train a matrix of size vocab * emb_dimension to which you multiply your initial tweet representations. These weights will be trained in the backprop process and allow you to convert your tweet to a vector encoding its meaning relative to our task.
- then for each tweet you would have a large matrix so you take the mean of each embedding feature in embedding layers, similar to simple sentence representations
week 2 ¶
- ngrams vs sequence models
- sequence models allow you to track context from much further down the sequence than n-grams due to limitations in ram
- a lot of the computation shares parameters as the sequence model remembers context from earlier words and combines them
- RNN math :
- you train weights for the state -> state, input vector -> state, state -> output. You maintain state over the recurrence. clean af
-
- backpropagation though time: each output state takes into account the hidden state of prevous timesteps. this means calculating the loss involves the gradient of previous states.
- vertical concatenation of matrices
np.hstack
- horizontal concatenation =
np.vstack
- trax
.onehot
- gated recurrent unit: improvement of rnn that allows information to be conserved through state even over long sequences
- uses relevance and update gates
- at every step takes variable x and state h
- trains two update and relevance weights that are called with a sigmoid and determine which info from previous states is relevant and which value should be updated with that info
- then calculation of a candidate hidden state based on the previous sate, the relevance gate and input
- we then calculate the actual hidden state using the update gate to determine how to concatenate the last hidden state and the candidate hidden state
- more operations so longer processing times and memory usage
- running rnn in one direction lacks context from the words on the other side
- u can use bidirectional RNNs, one network going backwards
- once you've calculated the two hidden states (one for each direction), you compute the output normally by processing the concatenated states
- deep RNNs
- they have the same layers that process the given input x at each timestep but x goes through multiple sublayers:
week 3 ¶
- named entity recognition
- using lstm = long short-term memory unit
- to calculate the loss of a rnn, you need to multiply the gradients of the previous states. this leads to exploding or vanishing gradients
- this is a problem because it means weights from farther down the sequence are discounted in the loss calculation
- solutions:
- intialization with weights as identity matrices and zero biases + relu activation allows for an initial stable state that is less vulnerable to faulty gradients
- gradient clipping -> "cut" large gradients to a certain thresold, opposite relu for a given m.
- skip connections where for earlier layers you concatenate the input vector at that earlier stage with the output to increase its weight.
- LSTM
- learns when to remember and when to forget
- composed of:
- cell state = form of memory
- hidden state that performs computations to decide changes
- multiple other gates that manage the processing/removal of information and prevents exploding / vanishing gradient
- the cell state goes a forget gate using the input and previous states to discard info
- input gate where input / prev. hidden states to filter relevance of these elements with regards to the cell state with which it it will be concatenated
- output gate that determines which info from cell state and hidden state is relevant to form an output
- useful for: llms, music composition
- https://colah.github.io/posts/2015-08-Understanding-LSTMs/
- ner data processing:
- assign words and entity classes ids
- need fixed sequence length so pick one and then PAD
- to train ner:
- create a tensor for each input and its label
- it into a batch and feed to an lstm unit
- run the output through a dense layer and predict using a log softmax over k entity classes
Siamese networks ¶
- neural network that uses same weights while working in tandem on two different input vectors to make comparable output vectors
- useful to find for eg question duplicates
- the weights of each network are shared
- siamese loss operates on triplets of questions, minimize loss on triplet based on sim.
- you add a margin of safety alpha in the cost
- you want to avoid your model having a negative loss so you use a non-linearity relu. you add the alpha margin to ensure there is some progression beyond differentiating between positive and negative samples.
- to train you build batches of unique questions, negative questions for a given question are simply the other questions in the batch.
- the loss is just a sum of our cost over each sample.
- ways of calculating loss:
- mean_negative where you take the negative cos similarity of each other neg question and use that instead of P(A, N) in your loss term
- closest_neg where you take the highest cos similarity on your neg example
- L = L(mean_neg) + L(closest_neg)
- mean_neg allows for faster training but closest_neg penalizes the model and equivalent cost more
- one shot learning = instead of retraining a classifier on k+1 classes when you have new inputs, instead have a trained sim function that simply allows you to define a threshold and compare new inputs to existing values declaring them the same if they meet that threshold.
The previous paragraphs were just context on this note and the current topic. Now, give me an implementation of RNNs in pytorch: