Representing your input system with a state machine seems to be the way to go here. I don't see why it would get "gradually more complex each with every new spell type", because most of the spell-specific logic could be encapsulated in the object representing the spell being cast. The state machine itself could be largely spell-agnostic.
The only thing the "pick target" state needs to know about the current spell is the targeting modeTargetType. When the player picked a valid target, then the state just needs to pass target and spell on to the "cast spell" state. What the spell actually does is of no concern for the pick target state.
The "cast spell" state wouldn't need to know much about the Spell either. It would just call methods from the spell object. The spell object itself would manage its own execution and call back to the "cast spell" state when it is done. What I would do is have a method public abstract void Execute(GameObject target, PickTargetState callbackState) in my Spell base class. The implementation would then do what the spell is supposed to do and then call callbackState.SpellFinished() when it is done. The method SpellFinished() of PickTargetState.SpellFinishded() would then switch to the next state.