Let $F: \mathbb{R}^n\to \mathbb{R}^m$ be some smooth map with components $F(x) = (f_1(x),\dots,f_m(x))^T$. The Jacobian matrix $JF(x)$ is defined to be the unique linear map satisfying
$$
\lim_{h\to 0} \frac{F(x+h) - F(x) - JF(x)h}{\|h\|}=0.
$$
This implies, among other things, that $JF(x)$ is the best linear approximation to $F$ at $x$. That is, $\Delta F = JF(x)\Delta x$. Notice that this definition immediately gives us the shape of $JF$ as a matrix with $n$ columns and $m$ rows, as it must be able to act on $n$-vectors and output $m$-vectors. If you write out the first component of this lienar approximation, we have
$$
\Delta f_1 = J_{11}\Delta x_1 + \dots + J_{1n}\Delta x_n.
$$
From multivariable calculus, we know that the best linear approximation of a scalar function is given by the partials, so we have $[JF(x)]_{ij} = \frac{\partial f_i}{\partial x_j}(x)$. That is, each row is the gradient vector of the $i$th component of $F$.
Now consider the case where $F$ has only a single component $f:\mathbb{R}^n\to \mathbb{R}$. This analysis will show us that $JF(x)$ is a $1\times n$ matrix given by $JF(x) = (f_{x_1}, \dots, f_{x_n})$. Ideed, we have
$$
\Delta f = (f_{x_1}, \dots, f_{x_n}) \begin{pmatrix} \Delta x_1 \\ \vdots \\ \Delta x_n\end{pmatrix} = f_{x_1}\Delta x_1 + \dots + f_{x_n}\Delta x_n,
$$
as expected. However, we can clearly identify this $1\times n$ matrix with the $n$-vector (column vector!) $g = (f_{x_1}, \dots, f_{x_n})^T$. This vector satisfies $JF(x)\Delta x = g\cdot \Delta x$. That is, the action of the derivative is replaced by a dot product with some vector. This is the definition of the gradient, so we define $\nabla f = g = (f_{x_1}, \dots, f_{x_n})^T = JF(x)^T$.
From this it is clear how the Jacobian and gradient differ. The Jacobian is a linear map between the same input and output spaces as $F$. It is the linear map that is defined by the definition of the derivative of multivariable functions. The gradient is not this type of object; it is actually an element of the input space that can be identified with a linear map via the dot product. This distinction comes up a lot when doing multivariable calculus on large systems like in machine learning, where if you write some things in terms of gradients and some in terms of Jacobians, the matrices will be the wrong size and you can't multiply them in the correct way to apply the chain and product rules for multivariable functions.