One idea that comes to mind is having the features (inputs) be the digits of the number, rather than the number itself. You might also try converting to binary before hand and then using those digits as features, or use some kind of one hot encoding for the decimal digits. The rationale being that finding primes is a discrete task dealing with integers, you don't want your model interpreting the input as a continuous value.
You might get better results if you include the modulo of the input with some primes (or just random integers) as features as well.
As for the hidden layers pick your favorite architecture. Standard linear layers followed by ReLU activations are acceptable. As you said, the last layer would be a sigmoid activation (preceded by a linear layer, not a ReLU). You can experiment with things like batch norm if you want although I wouldn't imagine they would help much.
With this setup your training data would have features as the binary digits of a positive integer, and the target would be whether or not that integer is prime (0 or 1). Your loss would be binary cross entropy. Train with stochastic gradient descent or friend until convergence. Note that your training data is very imbalanced as most numbers are not prime.