-
Worldwide
-
Ongoing
-
Fixed rate per hour
-
Description:
This project will consist of two different approaches that we will call “workflows”:
Workflow #1: Manual SxS Human Evaluation
In this task, you will see a user prompt and two AI-generated responses [or responses from 2 different AI models]. You will assess each response in several dimensions- Safety/Harmlessness, Writing Style, Verbosity, Instruction Following, Truthfulness, and Overall Quality, and at the end, you will select which response you think is better. You will explain as to why you think it is better. Finally, you will be required to rewrite your chosen response to improve it.
Workflow #2: Quality Evaluation
In this task, you will be given an original prompt and two translated versions of prompts derived from two different LLMs. You are supposed to read all the prompts (original and both translations) and then rate the translated prompt on four aspects:
- Verbatim Accuracy
- Formatting Preservation
- Semantic Equivalence
- Extraneous Information
The work contains 2 types of tasks:
- After rating, compare both the prompts and add a brief comment to justify your ratings.
- After rating, compare both the prompts and rewrite the prompt in the same targeted language.