Using GitHub’s Scientist library to refactor with confidence

Dylan Irlbeck
Flexport Engineering
6 min readApr 6, 2023

--

For the last decade, Flexport has been powered mostly by a Ruby on Rails monolith. In that time, we’ve written millions of lines of Ruby code and refactored core modules and business logic more times than we can count. As time went on, though, we kept asking ourselves: how can we refactor with more confidence?

The success of our refactors is by no means guaranteed. Particularly with older code, the engineer making the refactor is usually not the original author, leading to unknown unknowns that cause bugs and demotivate engineers from attempting future refactors. In late 2020, we came across GitHub Scientist, an open-source Ruby library for “carefully refactoring critical code paths.”

Using Scientist’s simple and strong foundation, we’ve given our engineers the exact tool they need to refactor our Ruby code with confidence. In this post, we want to share what we’ve learned by using Scientist in production in the hopes that we all can refactor better.

Usage

Scientist, as the name suggests, empowers Flexport to run experiments on its Ruby code. Each experiment defines a control code block (our old code) and a candidate control block (our refactor). When run, the experiment executes both code blocks, keeping track of each block’s result and execution time. The experiment then returns the result of the control block, and our program proceeds as normal.

require "scientist"

class MyWidget
def allows?(user)
experiment = Scientist::Default.new "widget-permissions"
experiment.use { model.check_user?(user).valid? } # old way, "control"
experiment.try { user.can?(:read, model) } # refactored way, "candidate"
experiment.run
end
end

In the context of refactors, we’d expect that, at a minimum, our candidate block returns the same result as our control block. (Otherwise, we aren’t refactoring!) But we’d also prefer our candidate to be just as fast — if not faster — than the original code. At Flexport, these are two dimensions we care deeply about when refactoring: preserving correctness and maintaining, or improving, performance.

With no additions to the Scientist code, however, experiments will just burn CPU time, akin to a scientist doing great work in their laboratory but never sharing their results! The results of experiments need to be published to become valuable to the world. At Flexport, we accomplish this by “reporting” our results into dashboards and logs. And it is precisely this way in which we report that allows us to refactor our Ruby code so effectively.

FlexportExperiment

Our implementation replaces the default Experiment module on the Scientist library with a custom FlexportExperiment module that overrides three methods: initialize, enabled?, and publish. We’ll provide the (stripped-down) code and then go into more detail about each piece in later sections.

### config/initializers/scientist.rb

# Override the default Experiment implementation with ours in a Rails initializer.
Scientist::Experiment.set_default(FlexportExperiment)

### helpers/flexport_experiment.rb

class FlexportExperiment
extend T::Generic
extend T::Sig

# Use Sorbet generics to allow callers to generically type their control/candidate
# block return types.
Elem = type_member

# Include all the methods we need to satisfy Scientist.
include Scientist::Experiment

sig { params(name: String).void }
def initialize(name)
@name = name
end

sig { returns(T::Boolean) }
def enabled?
# If we're in a test environment or someone's configured the experiment to start running,
# the experiment is enabled.
Rails.env.test? || FeatureFlagService.should_run_experiment?("experiment_#{@name}")
end

sig { params(result: Scientist::Result).void }
def publish(result)
control = result.control
candidate = result.candidates.first
if result.matched?
# Results matched.
report_match
elsif result.ignored?
# Result ignored.
report_ignored
elsif candidate.exception
# Candidate block raised an exception.
report_exception(candidate.exception.class.name)
else
# Control and candidate blocks mismatched.
report_mismatch
report_durations(control.duration, candidate.duration)
end
end
end

Usage of FlexportExperiment is straightforward, with publishing complexity abstracted away from the caller. Callers need only define a return type for the control/candidate blocks, a name for the experiment, and the two pieces of code they’d like to experiment with. (And with Rails autoloading, FlexportExperiment is freely available in all of our Rails classes.)

FlexportExperiment[Integer].new("experiment-foobar").run do |e|
e.use do
control_logic
end

e.try do
candidate_logic
end
end

Integrations

As mentioned above, the real power of our tool is in its integrations. Let’s dive into each one to see how they contribute to our refactoring confidence.

Experiment results

The publish method in FlexportExperiment translates experimental results into formats that we can understand and draw insight from: metrics and logs. We use observability tools heavily at Flexport, and we’ve fine-tuned their integration with FlexportExperiment to surface the information that our engineers care most about.

Every experiment that we run gets its own dashboard to track correctness and performance. From this birds-eye view, engineers can quickly determine whether their new code is giving the results they expect. When mismatches — cases where the control and candidate do not give the same result — occur, or if exceptions are raised during the execution of the candidate code, engineers can jump over to logs for more granular detail about the error(s).

A dashboard showing the results of an experiment. Shown are metrics about execution time and number of correct/incorrect results, plotted over time.

Note the “Mismatched by run order” on the dashboard. In order to mitigate side effects on our experiments, FlexportExperiment will run the control and candidate blocks twice, each time in a different order. If one of the orders, say candidate_control, is giving significantly more mismatches than control_candidate, it is likely that a data inter-dependency exists between the code blocks and we need to rework our experiment setup to get useful results.

Controlling experiments

There are many cases where our engineers don’t want to run their experiments right away or want the ability to quickly disable their experiment if it runs amok. To solve this problem, we’ve integrated our internal feature flag system with FlexportExperiment.enabled?. When an experiment is created in code, it will not actually commence in production until a feature flag exists that corresponds to that experiment (obeying a consistent flag naming convention).

def enabled?
# If we're in a test environment or someone's configured the experiment to start running,
# the experiment is enabled.
Rails.env.test? || FeatureFlagService.should_run_experiment?(“experiment_#{@name}”)
end

Type safety

Flexport has been using Sorbet, Stripe’s open-source Ruby typechecker, to type our monolith Ruby code for several years. Having types on variables and method signatures has been invaluable for developers looking to move quickly without breaking the system. In the early iterations of FlexportExperiment, though, we noticed that our control and candidate blocks were not getting typechecked. But the types of the control and candidates must be the same, otherwise they couldn’t possibly give the same result!

foobar = FlexportExperiment.new("experiment-foobar") do |e|
e.use do
5
end
e.try do
"5" # a different type that's not caught by Sorbet
end
end

T.reveal_type(foobar) # T.untyped

To preserve type safety when running experiments, we leveraged Sorbet generics. With our change, experiment creators simply provide the return type of their code blocks; FlexportExperiment ensures that the types flow throughout the program.

foobar = FlexportExperiment[Integer].new("experiment-foobar") do |e|
e.use do
5
end
e.try do
"5" # causes Sorbet to raise an error when typechecking
end
end

T.reveal_type(foobar) # Integer

Extending to other languages

The Scientist pattern is a powerful one: it lends empirical support — in the form of correctness and performance guarantees — to refactors of complex code. Using FlexportExperiment, I feel more confident when I refactor, and as a result, I look to do so more often. There’s only one problem: I can only use the tool when writing Ruby code!

In recent years, Flexport has been gradually moving to a service-oriented architecture to complement our Rails monolith. The overwhelming majority of these new services are written in Java and Kotlin, but we haven’t yet invested in supporting FlexportExperiment in these new languages. And we’re not alone in noticing this problem: Scientist’s README lists alternatives for over 20 languages and runtimes.

Given the success of FlexportExperiment in our Rails monolith, I don’t think it will be long before Flexport engineers bring science to bear in every system we build. I look forward to the day that we have hundreds of experiments running concurrently across our systems, with every engineer feeling empowered to refactor — regardless of the complexity or criticality of the code.

I’d like to thank Patrick Auld and Alek Storm for reading drafts of this post.

--

--