VT
In table 3, the function for Swish needs sigmoid(beta*x), unless beta is identically 1. From what I understand, beta is itself a trainable parameter, making it key to differentiate it from SiLU, where beta is identically 1. Thus, the derivative should be: sigmoid(beta*x) plus beta times x times derivative_of_the_sigmoid function, where derivative of the sigmoid equals sigmoid(beta times x) times (1 minus sigmoid(beta times x)) Or if sigmoid is y, then derivative of y is equal to y times (1-y)