Human voice is full of infliction - wether your sad, happy, talking in past tense, present, trying to sound quiet or fill a room, etc...
I would probably start with recording full sentences if the responses are predictable. Obviously, you want your users to be happy so make that come through in your speech. You'll probably have to do this a few times to get it down right.
Ultimately, this is inefficient, right? Trying to predict every single scenario your voice is needed might be difficult unless you really know your app and just need your voice in a few spots. So far as I know, all of the current AI systems, the speech is generated. They started with base voice actors, figured out how to make a machine sound the same and went from there. You could try to evolve to this over time if needed.