I really enjoyed the video but I think it can be a bit misleading. It gives the impression that it is the universal approximation property that make neural nets so effective when, of course, the learning algorithm: memorize the training data and on input x return the output y which was associated with the input nearest x -- also has the universal approximation property. Given training data from a continuous function f sampled via a distribution with non-zero density on interval I it will a converge to f (uniformly if I is a finite, and hence compact, interval). Nor does the geometric explanation at the start have anything to do with why neural nets are so effective.
I'm honestly unsure of the theoretical reason neural nets are so effective at language processing. But it certainly requires some characterization of the problem space (e.g. problems where the function to be learned has such and such property) since there are plenty of mathematical techniques for approximating well-behaved continuous functions with certain features that way outperform neural nets at learning various classes of mathematically nice functions..
I'm honestly unsure of the theoretical reason neural nets are so effective at language processing. But it certainly requires some characterization of the problem space (e.g. problems where the function to be learned has such and such property) since there are plenty of mathematical techniques for approximating well-behaved continuous functions with certain features that way outperform neural nets at learning various classes of mathematically nice functions..