How do you measure the performance of a language model's ability to analyze prompts?