From Kitchen Scales to People Scales: The Science of Reliable HR Measurement
An evidence based discussion on measuring skills and performance
In the interest of providing multiple means of engaging with this blog, below is the podcast style conversation, using our AI friends Johnny and Joanne. I have also done a voice over of the original blog, or below is the usual written format.
Would you bake a cake using broken scales? Probably not. Yet many HR practitioners and leaders use equally unreliable tools to make critical decisions about people's careers. From recruitment to development and rewards, we rely heavily on measurements - but how reliable are they really?
Think about your morning routine. You check the weather app to decide what to wear, your maps app to plan your route, maybe even your fitness tracker to monitor your steps. You trust these tools because they're consistently reliable. But can you say the same about the tools you use to assess your team's performance, potential, or skills?
The last year has seen significant progress within the Skills Based Organisation. Some of the most exciting elements come from the potential ability to measure the value (or price) of a skill, more on that in future blog (Stephany & Teutloff, 2024). But, this is predominantly progress within the technology and analytics space. For those thinking, what is a skills based organisation (SBO)? A skills based organisation goes beyond merely measuring skills, but using skills as an organising architecture to manage work. A traditional architecture is organised around professions, or job families e.g. Finance, HR or Engineering. In my experience, these functional silo’s make organising work hard, it makes mobility between these groups hard and many of the critical skills cut across these legacy groups. Instead skills based organisations will look to leverage the increase in data analytics to organise work around skills not job families. For example, data analytics is a skill that is held in Finance, HR & Engineering. Change Management is often a skill without a functional home. This will not be an overnight change, but in 10 years time, your skills as much as the professional body you belong to will dictate your work.
Whilst the progress in technology and analytics has been immense. The question still not answered is how do we reliably measure a skill and how do we know that what we are measuring underpins performance (validity). How do we know that Johnny’s level 2 in data analytics, is the same as Joanne’s level 2 in data analytics , and how do we know that improving Johnny or Joanne to a level 3 in data analytics will improve performance in a role? And is all of data analytics or just a sub-skill that makes up data analytics.
This reliability and validity challenge is as old as the tale of time. Anyone who has ever tried to measure something knows this, whether that is baking a cake, recruiting a candidate or selecting a psychometric for a team development session. So what is reliability and validity?
Reliability
Reliability refers to the stability/consistency of measurement. Is a 100 grams of flour consistently measured between and within different scales. Or is an extroverted employee consistently described as an extroverted employee by different people (inter-rater reliability) and at different points during the day, week, month year (test-retest).
Validity
Validity refers to how well the measure actually measures what it claims to measure (internal validity) and how well this measure actually predicts performance (external validity). Is an extroverted person actually an extroverted person (internal validity) and Is extroversion predictive of leadership (external validity)?
With most iterations of a skills based measure, there are two variables that need to be considered for their reliability and validity. First is the skill itself, the second is the level of proficiency. This is because most skill based measures will measure the skill in an incremental scale (1-no knowledge, 2- novice, 3- practitioner and 4-expert). This is like your home scales measuring both the weight and the ingredient. Is it 100grams and is it flour (reliability) and does increasing or decreasing the level of flour effect the cake as we expect (validity)?
A leading edge skills measurement approach needs to be able to show how it is tackling these issues. This is not to say any first iteration needs to be perfect, but that the first iteration of any skills based organisation may be "level 1, novice" by its own measure of proficiency. Knowing where it might have lower reliability or validity is critical. Then articulating in the tools roadmap the development that moves it up to "level 4, expert". A world leading skills based measuring tool will have to evidence industry standard setting levels of reliability and validity, not just fancy tech and maths.
This is again where an evidence based approach can help organisations improve the reliability and validity of tools. Using the 4 sources of evidence (internal data, external research, other practitioners and stakeholder views) practitoners can continuously and incrementally tackle the reliability and validity challenges. Answering the all important question-
where would improving reliability and validity of our skills measurement most improve our ability to execute our strategy?
I will now dive into the different elements all HR practitioners should be aware of when selecting any measurement too. Whether this be for recruiting someone, selecting a psychometric tool or a skills measurement tool. HR practitioners don’t need to be able to directly answer the questions, but know who they can ask to assure them (I/O Psychologists). A further explanation of the terms are below in relation to a skills measurement tool. You can return to this section in the future as a referent for assessing your own tools.
Reliability
Inter-rater reliability. Manager and self-ratings by themselves are known to be fairly unreliable, this is known as the idiosyncratic rater effect. Broadly speaking, a manager rating of an employee is more a measure of the manager than the employee. What this means is two different individuals may be rated more similar to each other by the same manager than the same employee by two different managers.
"Over the last fifteen years a significant body of research has demonstrated that each of us is a disturbingly unreliable rater of other people’s performance. The effect that ruins our ability to rate others has a name: the Idiosyncratic Rater Effect, which tells us that my rating of you on a quality such as “potential” is driven not by who you are, but instead by my own idiosyncrasies—how I define “potential,” how much of it I think I have, how tough a rater I usually am. This effect is resilient — no amount of training seems able to lessen it. And it is large — on average, 61% of my rating of you is a reflection of me." (Buckingham, 2015)
Test-retest. This is how much the rating varies depending on the time the rating is taken, by the same manager assuming all things are equal. The industry standard is .70. In that any assessment should be 70% consistent across time frames. Low complex jobs have shown to have a test-retest score of up to a .83 in measures of job performance ratings, with high complex jobs showing a test-retest score of .50 (Sturman et al., 2005). So only 50% of the measurements of performance were the same after a year holding all other things equal. Not hugely reliable in consistently measuring job performance ratings,
Both types of reliability can be improved by using multi-source rating effects, although variance is still present, it is reduced (Hoffman et al., 2010). This then becomes a trade-off between quality of data and quantity of data. Where are you happy to have higher quantities but less reliable data using single ratee and rater sources, vs. more reliable data using multiple level sources (e.g. 360) but requiring more time. Based on the evidence, the more complex a role, the higher the focus should be on increasing the sources of ratings to achieve higher reliability. Or with skills that are linked with strategic differentiation.
Validity
This is made up of internal and external validity.
Internal Validity focuses on how well the skill measurement tool measures what it's supposed to measure within its specific context. If it is measuring coding skills, does the assessment actually reflect coding ability rather than just theoretical knowledge (arm chair experts)?
In terms of the proficiency levels, we should also be aware of what's known as the "rater competency threshold" issue. Here's how it typically manifests:
1. Ceiling Effect:
A Level 2 (practitioner) rater might struggle to accurately distinguish between Level 3 (proficient) and 4 (master) performances
They might rate everything above their own level as simply "Level 3" because they can't recognise master-level nuances. For instance, I might not be able to tell the difference between a semi-professional tennis player and a professional tennis player. I might rate them all as "master". But a semi-professional tennis player, will be able to distinguish the differences.
2. Dunning-Kruger Impact:
Less skilled raters might overestimate their ability to assess others.
More skilled raters (Level 3-4) might underestimate others due to higher standards.
To improve internal validity given these constraints, you might:
1. Set Rater Qualifications:
Ensure raters are at least one level above what they're assessing
Use expert panels for higher-level assessments
2. Create Objective Evidence Requirements:
Define specific, observable behaviours for each level
Require concrete examples/evidence for ratings
3. Implement Calibration Mechanisms:
Regular calibration sessions among raters
Use of rubrics with detailed descriptors
Multiple rater perspectives for higher levels
External Validity concerns how well these measurements apply beyond your immediate testing environment. Do they actually predict performance (does IQ predict performance or extroversion and leadership). This covers again both the proficiency (level) and the skill itself.
For example, imagine you've developed a leadership assessment:
Strong internal validity would mean it accurately measures leadership capabilities in your test environment
Strong external validity would mean these measurements actually predict leadership performance in real workplace situations
External validity is normally satisfied by having referent performers. Who are the highest performing in their field. By determining what are the shared characteristics they all have that differentiates the best from the rest. This is not to say that there is a singular 'best profile' for any one role. Even within the same context, there might be many ways to organise a set of skills to achieve the same outcome(Mason et al., 2015). For instance, someone might have low coding skills, but high collaboration. Enabling them to leverage the skills of others to achieve the outcome. This is where diversity comes in, specifically diversity of thought. With managers configuring the skills, how do we guide them to configure the skills that most correlate with diverse high performers vs. what they value or what looks most like them. Again, using referent performers and job analysis should increase external validity. Whilst also looking out for adverse impact on those with cognitive diversity.
So as a leader or HR practitioner, have a think about the tools you are using to measure people, whether this is for the end of year performance review, the interview for that open position or the psychometric tool for a selection or development of your team. What evidence do you have of it’s reliability and suitability (validity). You wouldn’t use un-reliable scales to cook your family dinner, so why would you use unreliable tools to select and develop your teams.
Buckingham, M. (2015, February 9). Most HR Data Is Bad Data. Harvard Business Review. https://hbr.org/2015/02/most-hr-data-is-bad-data#
Hoffman, B., Lance, C. E., Bynum, B., & Gentry, W. A. (2010). Rater source effects are alive and well after all. Personnel Psychology, 63(1), 119–151. https://doi.org/10.1111/J.1744-6570.2009.01164.X
Mason, P. H., Domínguez D., J. F., Winter, B., & Grignolio, A. (2015). Hidden in plain view: degeneracy in complex systems. Biosystems, 128, 1–8. https://doi.org/10.1016/J.BIOSYSTEMS.2014.12.003
Stephany, F., & Teutloff, O. (2024). What is the price of a skill? The value of complementarity. Research Policy, 53(1). https://doi.org/10.1016/j.respol.2023.104898
Sturman, M. C., Cheramie, R. A., & Cashen, L. H. (2005). The impact of job complexity and performance measurement on the temporal consistency, stability, and test-retest reliability of employee job performance ratings. Journal of Applied Psychology, 90(2), 269–283. https://doi.org/10.1037/0021-9010.90.2.269