Reflections On Data Science In Big Tech

Data science is an infamously nebulous term. It can, depending on the organization, involve building dashboards, writing data pipelines, developing metrics, conducting causal inference analyses, doing ad-hoc analytics work, pulling numbers for stakeholders who don’t know SQL, building ML models offline, deploying those ML models to production, writing data tests and monitoring data “drift”, running A/B tests, or simply being a more quantitatively-minded PM.

At “big tech” companies, however, the term is somewhat less vague — particularly if we focus our attention on “product data scientists” (as opposed to, say, marketing, ads, or other activities). In my brief (2 year) stint at Spotify, I was a product data scientist. I supported an engineering team that made new playlists and better versions of existing ones. More generally, here is my definition of what product data scientists do: they use quantitative methods (analytics, statistics, and, occasionally, machine learning) to help their engineering and product partners launch “better” products on a faster timeline.

I have an embarrassing admission to make, though. Much of my time at Spotify was not spent doing those things! And this is not because I was particularly bad at my job. I think the same statement can be made about many of my colleagues. Avoiding these mistakes is the key to becoming a great product data scientist.

Replacement-level

There is a concept in baseball analytics/sabermetrics known as “WAR”: “wins above replacement”. It tries to quantify how valuable a player is to a team, as compared to an average, or “replacement-level”, player. Most players, of course, are valuable in some absolute or objective sense. Having them is better than having no one at all. But many players are not valuable in a relative, or “replacement-adjusted” sense. They do not generate an extra hit, RBI, defense play, etc. over what a mediocre replacement would do.

Here is my mental model of a “replacement-level” product data scientist.

Experimentation: They are handed information about an A/B test after it is run. If they are lucky, they have access to a test specification that clearly lays out the test hypothesis, key metrics, and decision tree, but, more likely, that information was never written down, so they talk to the product manager or engineering lead for context on what was run and why. They compute metrics they believe to be important and run standard tests to measure statistical significance. They summarize these numbers in a short presentation and deliver it at a team meeting. If the numbers appear promising, they usually stop there, and, everyone pats each other on the back. If not, the data scientist is tasked with further analysis. The entire process takes a few days, or, in rare cases, up to a week or two.
Analytics: The product manager formulates a hypothesis about how a feature can be made better, but they lack the ability to interrogate the data themselves. The DS in question reformulates the hypothesis into something tractable: a SQL query or R/Python analysis. The cycle repeats, several times, until the PM has the quantitative evidence needed to support their beliefs. This is called “data-driven decision making”.
Dashboarding: Some stakeholder wants to understand why a metric is moving in a particular direction. She tasks a data scientist with adding “cuts” to this metric, in a dashboard, so that she can monitor for which groups the metric is changing. The data scientist rewrites the SQL pipelines and dashboard extracts to expose this new, finer level of granularity. Doing so introduces both additional signal, and, of course, additional noise. It leads to yet more questions and more tasks for dashboarding.

My previous descriptions undoubtedly sound negative and cynical. You might be asking: why do companies, particularly big tech companies, pay data scientists so handsomely for so little “value added”? And, relatedly, how might we, as aspiring “above-replacement” product data scientists, contribute in ways that prove our “WAR”?

But before I get there, I want to make a few points.

First, not all of the work I previously described is dumb. Sometimes it’s necessary. Dashboards can provide valuable insights; it is important to understand user behavior via exploratory data analysis; and, of course, someone has to confirm that our feature improved user metrics in a statistically significant way. The problem arises when these tasks comprise most, or all, of what a data scientist does. If you find yourself getting sucked into nonstop cycles of A/B test analysis, Tableau tickets, or SQL monkey work, you should stop and reevaluate whether you’re actually contributing to the team’s success, or simply mindlessly doing what you feel comfortable with.

Second, you might have noticed that my replacement-level data scientist was preternaturally passive. They received requests: for A/B test analysis, for SQL queries, for dashboarding. They were not an active participant in their own work. They did not generate hypotheses; they simply confirmed them. They did not think independently; their thinking was dependent on, and downstream of, someone else’s. To me, this “service”/”help-desk” model of data science is why data science often provides so little value, and, relatedly, why so many product and engineering teams demonstrate little to no impact on company metrics. At least at Spotify, engineers and PMs have an almost unlimited number of ideas, and, correspondingly, a laundry list of A/B tests to run. It takes someone with a strong product sense, quantitative acumen, and knowledge of experimentation strategy, to help inform which ideas are truly worth investing in, and which ones should be deprioritized. Someone like….a product data scientist!

Third, it takes a lot of investment, on the part of the data scientist and the team they are working with, in order to make a data scientist a genuine “thought partner”, as opposed to a mere help-desk statistician. Data scientists should attend sprint planning meetings. They should understand the problem space deeply. They should be in frequent conversation with the PM and the tech lead. They should advise on feature development and A/B tests before those features are developed and those tests are run, not after. They should be one of the primary contributors to the written specification that constitutes the A/B test plan (the "test spec").

The earlier a data scientist can enter into the cycle of “hypothesis generation → feature development → experimentation”, the more impact they can have. And the more business knowledge and problem space understanding they come with, the more effective they can be. All of this is costly. It requires, if not a “fully embedded” model, at least a partially embedded one. (By embedding, I mean where a data scientist is part of a product team, instead of a cross-functional service organization.) It increases the communication burden and places even more importance on sharing information. It introduces more roadblocks to “moving fast” and firing off as many experiments as possible (although I’d argue that moving fast is often overrated). And, worst of all, it takes away time from “broader scope” activities a data scientist might want to do.

I want to spend some time on that last point. At most companies, individual contributors are incentivized away from doing “local” work — work focused on their immediate team — and incentivized towards doing “global” work — work focused on making other practitioners in their field better, or helping their department or company at large. In fact, it is often a hard requirement for a “staff level” promotion that the individual contributor has demonstrated broad impact, and has developed some tool that people across the organization use. Naturally, as a data scientist trying to get promoted, you might gravitate towards building such tools instead of helping your immediate team run better tests and make better decisions. After all, it is rare that a well-run A/B test (or, even better, an A/B test that you avoided running) makes it into your promotion packet. But, in my view, this is one of the preeminent signs of a “high WAR” product data scientist. There is often no good way to resolve the tension between the work you should be doing, for the sake of the business, and the work you feel compelled to do, for the sake of your career. My (dismal) advice is that, if you are in such a situation and feel at a dead-end, you should leave for another job.

Above replacement-level

If data scientists shouldn’t be mired in replacement-level activities, what should they be doing instead, or in addition? Below, I share three ideas from my experience at Spotify.

Inform product and testing strategy

As I argued above, data scientists are uniquely equipped to contribute to the product and testing strategy. They should not simply run post-hoc test analyses; they should also advocate for which tests should be run, and which ones shouldn’t be. Data scientists should also remember that they are not simply another voice in the room: between product managers, mid-level managers, product directors, and company leadership, there are already enough (too many?) of those. Instead, a data scientist should use their unique combination of skills — business/domain knowledge, quantitative research, and experimentation strategy — to influence the product and testing roadmap.

Many tests that are run, and feature iterations that are worked on, are unlikely to make a big difference to metrics. In my experience, this is often obvious in hindsight, but, with careful thinking, such tests can also be identified in advance. There are three techniques I’ve found especially useful.

Opportunity sizing/funnel analysis. Identify the user funnel in advance of the test, and compute the drop-off at each stage of the funnel. Understand which aspect of the funnel the feature improvement affects, and whether, even if it meets its goals, the overall metrics will improve. A large (relative) impact to the bottom of the funnel might translate into a small (absolute) impact overall. Even better is to do this analysis during quarterly planning, which can inform longer-term strategy as opposed to week-to-week sprint work.
1. The same idea holds for other sorts of changes, like internationalization, improvements for new users, improvements for power users, etc. What percentage of the userbase is being affected? Is it enough to care?
Thinking in distributions. Many metrics in tech follow highly skewed, power-law type distributions. A small percentage of users can contribute disproportionately to an “uncapped” (unbounded) metric such as “streaming time”. Understand which part of the distribution your feature is expected to impact. Often, large changes to uncapped metrics arise from improvements to a small part of the userbase, but, perversely, these are also the users who derive the most value from the service, and are therefore the least likely to churn.
Offline analysis. One fascinating area of data science involves trying to predict the outcomes of tests before they are run. It marries statistics with machine learning. The basic idea is to create a domain-specific ML model for some user outcome we care about. Suppose we are trying to predict listening time for a song in a personalized playlist. Our features might be things like “whether the user has listened to the song before” and “how popular the song is”. If we make a change to, say, boost more unpopular artists on the platform (which might be good for the business, for separate reasons), we can estimate, at least roughly, the kind of impact this change will have on “listening time” by comparing the ML model predictions for the boosted and unboosted playlists.
1. If we know that certain pairs of metrics are causally related — for example, if we understand the relationship between listening time and retention, or lifetime value — then we might be able to make even stronger statements about how a particular change will affect business goals (again, before the test is run). We might realize, for example, that the current level of boosting will unduly harm our revenue, and recalibrate it instead of wasting a cycle of testing.
2. Data scientists can play a role both in creating the ML model, as well as in establishing the relationships between metrics (using techniques like “causal meta-mediation analysis”).

Own data, end-to-end

Someone at the company should understand how a metric is computed, from end-to-end. Often this knowledge is possessed in patchwork. Client or backend engineers instrument the raw events; data engineers ingest these into a data lake, and perhaps lightly massage them; analytics engineers create domain-specific aggregations, like dimension and fact tables; and data scientists stand atop these efforts and generate the final metric. But this leads to what I’ll call the “Rumsfeld problem”: you do the analysis with the data you have, not the data you might want or wish to have at a later time. A related problem is that, if your data begins to look suspicious, you have no idea why or how to fix it.

I had a great deal of success at Spotify inserting myself further upstream in this “data flow”. I took on the work of an analytics engineer, and built my own domain-specific aggregations on top of raw events. I knew which measures and dimensions I wanted to analyze and use for A/B testing, and I built the datasets required to compute those quantities simply and straightforwardly.

Data scientists should adopt the habit of thinking of datasets as one of our primary work outputs. A well-architected, well-tested and well-maintained set of tables pertaining to a particular problem space is invaluable. It helps dashboards, analytics, metric development, experiment analysis, and even ML modeling all go faster. It is particularly useful for self-serve analytics, and can save you some late-evening DMs from your product manager and other partners asking for a “quick data pull”. The process of building a dataset will invariably involve interrogating the data — asking if it means what you think it means — and I have always found that this uncovers bugs, whether minor or major.

Data scientists should also be interested in data even further “upstream”. At Spotify, I helped modify the event specification/schema for some of the raw events that I used in my aggregations. (To touch on the technical details: there wasn’t a reliable field to join between music stream events, client interaction events, and playlist recommendation events. I worked with others to add this field for our particular playlist, and I confirmed that it was working properly.) Instead of suffering from the Rumsfeld problem, I overcame it. I built the right raw data, and from there everything else flowed naturally.

Once again, none of this is costless. Analytics and data engineering is a full-time job, and “normal” data science work is already quite a lot. So why do I think data scientists should invest in these activities? Quite simply, it is difficult to be confident in the rigor and quality of analyses, A/B test results, and dashboards, without having confidence in their underlying datasets. And, as I mentioned previously, the tables that will help you do your analysis most efficiently are rarely just there, fully-formed, like Venus on the half-shell. You will likely have to make them.

Understand the user and advocate for them

No one who works at a company is a dispassionate observer. Product managers want their ideas to succeed, and their trajectory at the company often hinges on the success of the features they foster. Engineers want the features they build to be adopted into the product, as opposed to unceremoniously canned after a failed A/B test. “Researchers” — a group that includes data scientists — are by no means unbiased, but we should aspire to be advocates for the user, not the feature. (We are often the only ones.)

Advocating for the user requires understanding them and how they interact with the product. This sounds obvious, but it often gets lost in the week-to-week grind of sprints and A/B tests. It is, for better or worse, possible to improve user metrics without having any idea of how users use the feature. We can simply run a bunch of tests, and see which ones “work”. This is “hill climbing” in its most basic (and idiotic) form.

A better approach is to partner with user research to develop a mental model for our users. What do users think of the feature? What are their common interaction pathways? What aspects do they feel are broken? How do they define success? How sticky is the feature? What does a typical “session” look like? And so on. The challenge is that much of this research is open-ended, and, if not done carefully, not terribly actionable. You might end up making banal statements like “users want Spotify to help them find more new music”, and everyone will nod their heads and go back to doing what they had been doing. But, done properly, this research can generate lots of hypotheses for future feature development, can inform instrumentation and data collection, can be used to decide on success metrics, and can help align the mental models of the people making the feature (PMs and engineers) with the people using them (usually, ordinary Janes and Joes).

Most importantly, research can illuminate the gap between the grand ambitions of PMs and directors and the harsh realities of development of new features. At Spotify, I witnessed the launch of several features that would supposedly transform the app: the home redesign, the AI DJ, audiobooks, and so on. These launches often had outsized expectations and correspondingly outrageous metrics targets. Research was usually the first discipline to highlight the disparity between expectation and reality, and to prove that what we had built was not as “revolutionary” as we had intended.

Conclusion

This has been a long post, but hopefully not too long-winded. What I want to leave you with is this. Product data science, much like Spotify feature launches, has suffered from its own “expectations gap”. The ambition for product data science is that it suffuses and transforms the product strategy. The reality is that we spend much of our time analyzing tests that should not have been run and generating reports that only confirm everyone’s pre-existing thinking. To bridge this gap requires a lot of work: becoming a co-equal partner with product (which typically requires at least partial embedding), investing in data quality and usability, and understanding what the user really thinks.

That’s the “carrot”. The “stick” is that if we don’t embrace this new way of working, we risk the same sort of catastrophe that user research has undergone in the last year. As Judd Antin writes, “Layoffs are the worst, but they’re no accident. When companies lay off workers, they’re making a statement about business value. When a discipline gets the disproportionate axe, as UX Research has, the meaning is pretty clear.” At Spotify, for example, the most recent round of layoffs decimated user research, and severed the tight connection between qualitative and quantitative research disciplines.

I am not suggesting, a la Lake Wobegon, that we all become “above-replacement” product data scientists. Instead, we should redefine what it means to be replacement-level. It is not simply a matter of running statistical tests and writing SQL queries; it means being a voice in the room that people respect and listen to — one that the team cannot do without.