ACCESSING YOUR SCORES ONLINE
CELPIP Test scores are available online through your CELPIP Account in 8 business days after your test date. For test takers that order Express Rating, your CELPIP Test scores will be available in 3 business days after your test date. Please note that business days do not include weekends or holidays. You will receive an email notification once your scores are available. Your CELPIP Test scores can be accessed and viewed online in your CELPIP Account for a period of 2 years, from your test date.
Two print copies of your CELPIP Official Score Report will be mailed to your registered address via Canada Post. The first two CELPIP Official Score Reports are included in your test registration fee. Additional Official Score Reports can be ordered through your CELPIP Account, for a fee of $20.00 CAD + tax. Priority shipping, at an additional fee, may also be requested, with shipping prices varying depending on your address.
Additional score reports are only available for tests that have been taken within the last 2 years. The length of time that your scores are considered valid by various institutions, however, is determined by their individual policies. You can obtain this information from these institutions.
Please note that the fees charged for additional CELPIP Official Score Reports are non-refundable.
UNDERSTANDING YOUR SCORES
Each component of the CELPIP-General Test and the CELPIP-General LS Test is given a CELPIP level. Below is a chart of each CELPIP level and its corresponding description. Since the CELPIP test scores have been calibrated against the Canadian Language Benchmark (CLB) levels, we have included the CLB level equivalencies for your information.
|CELPIP LEVEL||CELPIP DESCRIPTOR||CLB LEVEL|
|12||Advanced proficiency in workplace and community contexts||12|
|11||Advanced proficiency in workplace and community contexts||11|
|10||Highly effective proficiency in workplace and community contexts||10|
|9||Effective proficiency in workplace and community contexts||9|
|8||Good proficiency in workplace and community contexts||8|
|7||Adequate proficiency in workplace and community contexts||7|
|6||Developing proficiency in workplace and community contexts||6|
|5||Acquiring proficiency in workplace and community contexts||5|
|4||Adequate proficiency for daily life activities||4|
|3||Some proficiency in limited contexts||3|
|M||Minimal proficiency or insufficient information to assess||0, 1, 2|
|NA||Not Administered: test taker did not receive this test component||/|
The computer automatically scores the Listening and Reading Tests. Responses to the Writing and Speaking tests are scored by trained and experienced CELPIP Raters.
SPEAKING PERFORMANCE STANDARD
|4. Task Fulfillment||
WRITING PERFORMANCE STANDARDS
- Number of ideas
- Quality of ideas
- Organization of ideas
- Examples and supporting details
- Word choice
- Suitable use of words and phrases
- Range of words and phrases
- Precision and accuracy
- Format and paragraphing
- Connectors and transitions
- Grammar and sentence structure
- Spelling and punctuation
4. Task Fulfillment
- Word count
APPROXIMATE SCORES AND CELPIP LEVELS FOR THE LISTENING TEST
|CELPIP LEVEL||LISTENING SCORE /38|
Disclaimer: This example chart shows how raw scores in the Listening Test approximately correspond to CELPIP Levels. Since questions may have different levels of difficulty and may therefore be equated differently, the raw score for a certain level may vary slightly from one test to another.
APPROXIMATE SCORES AND CELPIP LEVELS FOR THE READING TEST
SCORING THE CELPIP READING AND LISTENING COMPONENTS
The goal of any test is to provide a fair and accurate assessment of each test taker, regardless of the specific questions presented during the test. Though CELPIP tests are assembled following guidelines for content and difficulty, it is still possible for test forms to vary slightly in difficulty. Since a raw score is merely the sum of the questions that a test taker has correctly responded to, it cannot account for these slight variations in difficulty. Consequently, a raw score of 30 will not have the same meaning across different forms of a test. This means that different test takers’ raw scores could be hard to interpret and compare.
In order to account for differences between test forms, Paragon transforms test takers’ raw scores into a scaled score. Scaled scores adjust raw scores in a consistent way so that test takers’ scores on different test forms can be compared.
HOW ARE CELPIP LEVEL DETERMINED?
After the Reading and Listening scaled scores have been determined, they are transformed into a CELPIP level. The transformation rules were established by English language experts who participated in a standard setting exercise. Standard setting is an extensive, research-based process. Language experts work with testing professionals to identify what language learners need to be able to do at each performance level, such as CLB 8. The experts then analyze the test in detail and determine what level of performance a test taker needs to demonstrate for each CELPIP level. This process has established a defensible link between each Reading and Listening scaled score and its corresponding CELPIP level.
HOW RELIABLE CELPIP READING AND LISTENING SCORES ARE?
There are many ways to measure reliability of a test. One good measure of reliability is Cronbach’s alpha, which specifically measures the internal consistency of a test form. The result of this statistical measure can range from -1 to +1 where -1 indicates a complete lack of internal consistency and +1 indicates perfect consistency. A result of 0.80 or higher is considered to be excellent. For both the CELPIP Reading and Listening components, the test forms have an average Cronbach’s alpha of 0.88. This indicates that CELPIP Reading and Listening test forms demonstrate excellent internal consistency.
SCORING THE CELPIP SPEAKING AND WRITING COMPONENTS
Paragon employs raters with the following qualifications:
|Teaching & Assessment Experience||
Raters receive ongoing training to ensure that the scoring criteria are consistently and systematically applied by all raters, and to minimize potential bias introduced by human judgment.
- Initial Rater Training
All raters attend an initial training program to guide them through Paragon’s rating approach. After completing a training manual, exercises, and rating samples, trainees engage in a certification process during which they rate 3-6 certification sets. In order to certify, trainees must achieve a minimum 80% agreement with the official score assigned to each performance in at least three consecutive sets. Only certified raters can start operational rating.
- Operational Rater Training
To maintain a shared perspective on relevant rating principles and criteria, all operational raters receive ongoing in-service training and monitoring, including:
- Weekly feedback on their agreement with other raters
- Weekly sample performances rated by expert raters
- Biweekly in-depth training materials in the form of communications hosted online, including a range of sample performances rated and justified by expert raters
- Detailed notes from rater seminars, in which challenging responses are discussed and rated by expert raters.
- Rater Monitoring
Rater performance analysis is conducted monthly to monitor the reliability of the rater pool and to identify raters who have unsatisfactory rater agreement. Underperforming raters receive personalized feedback on rating samples that demonstrate a significant discrepancy between their ratings and benchmark ratings Additional samples are provided upon request by a rater. Once identified as underperforming, a rater must demonstrate improvement within 8 weeks. If an underperforming rater does not demonstrate an improvement to meet Paragon’s rating standards within that period, Paragon may terminate their rating contract.
All tests are randomly assigned to raters by an online system. Test taker anonymity is maintained at all times. Each test taker’s performance, i.e. a test taker’s responses to all tasks in the component, is assessed by multiple raters. Each CELPIP speaking performance is rated by a minimum of three speaking raters, and each CELPIP writing performance is rated by a minimum of four writing raters. Raters work independently of one another, and have no knowledge of the ratings assigned by other raters.
- Rating criteria
The rating dimensions that have been developed for the writing and speaking component are listed above on this page in the Performance Standards section:
Speaking: Content/Coherence, Vocabulary, Listenability, and Task Fulfillment
Writing: Content/Coherence, Vocabulary, Readability, and Task Fulfillment
Each dimension is divided into five performance levels. Performance descriptors are provided for each level in each dimension. Raters assign a level in each dimension by identifying tangible evidence in the test taker’s performance that matches the descriptors in the rating scale.
When the ratings of a test taker’s performance are complete, they are inspected for agreement. If the ratings are in disagreement, a benchmark rater is automatically assigned to assess the performance. All benchmark raters are experienced raters who have demonstrated consistent accuracy and reliability in rating. Benchmark raters have no knowledge of the initial ratings.
- How is the final score determined?
The Speaking and Writing component scores are derived from the dimensional ratings assigned by the raters. These scores are then transformed into a CELPIP level. The transformation rules have been established by English language experts who participated in a standard setting exercise. Standard setting is an extensive, research-based process. Language experts work with testing professionals to identify what language learners need to be able to do at each performance level, such as CLB 8. The experts then analyze the test in detail and determine what level of performance a test taker needs to demonstrate for each CELPIP level. This process has established a defensible link between each Speaking and Writing component score and its corresponding CELPIP level.