June 24, 2022

An Inverse Turing Test

This blog post is a bit ... different. Normally, a blog post is a static document, and the direction of communication is from the screen to the user. But this piece requires you to interact. You're not just reading the content; you'll actually change the story of the blog post as you interact with it. This makes the piece a bit more experimental, but hopefully, also much more interesting!

A small warning: the experience is way better on desktop. It's certainly possible to play around while on a mobile phone, but it's a much better experience to explore this piece when you have access to a keyboard.

Experiment

We will try to figure out if you are a human or a bot. To figure this out, we will ask you to give us a sequence of random numbers.

You're going to use maths aren't you?

In a very interactive experience, yes.

There are two buttons below. One for heads. One for tails.

Please click the heads/tails button as randomly as possible. If you're using a laptop, you may also use the 1 (heads) or 0 (tails) keys on your keyboard. That's usually a whole lot faster.

The rest of the article will appear once you've done 10 virtual coinflips, but the results may be more impressive if you generate 100. Give it a try!

You've sampled /100 coin tosses.

An Inverse Turing Test

We now have coin tosses. So what does this say about you?

Technically, we now have data that allows us to do an "inverse-Turing"-test. During a normal Turing test you would test if a machine is indistinguishable from a human. In this case, we're doing the opposite. We're going to be testing if the input is indistinguishable from something a machine might generate. As you may learn, humans are usually pretty bad at generating random numbers ...

So given our sequence of coin flips ...
when are the numbers not random enough to be made by a machine?

We could calculate some statistics from our long 0/1 series. One place to start is to just count how many ones and zeros we have. If we have way more zeros than ones, then we might be able to claim that we weren't using a good random sampling method.

In our case, we have heads and tails. If there's a big difference between these two numbers, then that might indicate that we're dealing with a human. But when is the difference so big that it starts getting suspicious?

You could argue that, for example, four times heads and six times tails isn't that strange. But then again, 400 heads and 600 tails would feel fishy. So it's not just the ratio of heads and tails that matters; the total numbers need to be taken into account too!

This is where we can use a bit of probability theory to help us. If the numbers are random, then we have maths that can tell us what we might expect. The chart below will do just that. Note that it's an interactive chart that updates when you press 0/1 on your keyboard. Feel free to play around!

But wait. Where does this curve come from?

Because we're dealing with a binary dataset, we can assume a beta distribution. It's an amazing probability distribution with many applications in AB testing. If you'd like to learn more, we recommend checking the 3b1b series on Youtube.

The blue line represents the belief of the "heads" probability if the numbers were generated randomly. The dotted line is what we actually saw. If the dotted line is further from the center, we might argue it's statistically different. You'll notice that the more numbers we have, the tinner the peak will be. That's because the more numbers we have, the closer the heads/tails ratio needs to be around 0.5.

If this data was generated by a machine, with it's random number generator, then we can expect that it wouldn't stray too far from the middle of the distribution. At the moment, the line is at the % quantile of the distribution. Weather or not this is fine depends on your appetite for risk but this article will assume that we want to remain in the 90% most likely quantile. That means in this case, we have a test that does not pass.

Enough?

We hope the example sofar is pretty interesting ... but you might be thinking ...

But this is a single test ...
is that enough to account for all human bias?

After all, let's pretend that we have the following sequence:

00000001111111

Theoretically, this is exactly 7 "heads" and 7 "tails", so it would pass our test. But look at it, it doesn't look random at all! Sure the total number of ones and zeros is equal, but if this series was random we'd expect the next number to be unpredictable. And here, there seems to be a clear pattern.

So maybe, we need to extend our original test by also taking the order the sequence into account. So let's consider the pairs in the sequence in order.

We can count all of these pairs of numbers and put them in a table.

Pattern Count
0,0 0
0,1 0
1,0 0
1,1 0
This is similar to what we had before ...
but instead of one number, we're tracking four!

We are now tracking each pair of numbers, which might allow us to detect that there are more 0-1 pairs than there are 1-1 pairs. That also means that we can use the same method of testing again! We can again use a beta distribution to describe what we would expect from the sequence and we can once again perform a test.

Distrbution for 0,0

Distrbution for 0,1

Distrbution for 1,0

Distrbution for 1,1

But wait. Why do these charts look different?

Before we were dealing with just comparing the ones versus the zeros. Now we are comparing "0,0" versus the rest, then "0,1" versus the rest, etc. That means that we no longer expect the center of mass to be in the middle at 50% but instead we expect it to be around the 25% mark. If you're curious about more details, we recommend watching the 3b1b series on Youtube about the Beta distribution.

Given that we can use a similar trick, let's add some more numbers to our table.

Pattern Count Expected Quantile Test Pass
0,0 0 0
0,1 0 0
1,0 0 0
1,1 0 0

Do all the tests pass? If not, feel free to cheat a little here and try to make the entire series "more random" by adding some more numbers now.

Random Enough?

In the previous segment we saw that we might be able to come up with better tests by looking at the pairs in the sequence.However, you may have been able to play around with the numbers such that the tests pass anyway. So it's only natural if you might be thinking ...

But hang on ...
can't we fool these tests too? Just like before?

But here comes a cool insight. Maybe we should just look at the pairs in a sequence. Maybe we should look at the triplets too!

That means we can make another table with more tests!

Pattern Count Expected Quantile Test Pass
0,0,0 0 0
0,0,1 0 0
0,1,0 0 0
0,1,1 0 0
1,0,0 0 0
1,0,1 0 0
1,1,0 0 0
1,1,1 0 0

We are now counting over eight groups. Again, you can try to cheat, but you'll notice it's starting to get harder.

Repeat

You may recognize something at this point.

I'm starting to see a pattern here ...
We can keep looking at bigger sequences if we want to be more critical!

To make this point clear, let's have a look at a histogram of counts for sequences of size 4.

What to expect

The chart on the left is a histogram that counts how often certain sequences have occurred. There is also a dashed line that shows where the bar is expected to be if the sequence was randomly generated.

When you look at this chart, you can wonder what you should expect to see. If the data was really random, then each bar should be about the same length. There should still be some variance, but any large deviations from the dotted lines would be suspicious.

For completeness, let's have a look at sequences of size 5.

What to expect

Just like before, if this was a random sequence, we'd expect to see the uniform distribution. The further a bar is away from the uniform expectation, the higher the odds that one of our tests would catch it!

Many Many Tests

We're getting close to a moment of truth, when we look at your sequence of random numbers ... are they random enough that they could have been made by a machine?

So what do all of the tests say?

Let's make an inventory of all the tests that we can generate for sequences up until size 5.

Tests that pass.
Tests that fail.

Does it feel like a lot of tests have failed? At the moment tests fail while tests pass.

You can compare against what an actual random number generator would do. You can either press the buttons or use the "q"/"w" keyboard shortcuts. You'll notice that some of the tests fail for the random tests too. This is totally normal! There's always a chance that a test fails because of randomness, but usually the human sequence will have more failing tests.

As before, the entire piece updates when you hit the button. So you can scroll back up to compare.

Conclusion

I hope that by playing around with the random sequence you've gotten some intuition out of it. Maybe you even feel like:

This was pretty interesting.

You may have noticed that it's pretty hard to generate a properly random sequence of numbers. Typically, but not always, humans generate too many sequences that alternate like "0,1" and "1,0" and don't sample enough longer sequences of the same value. By making sure that we look at sliding windows over the sequence, we may be able to detect these non-random patterns.

But maybe if we take a step back, we can also appreciate what probability theory has allowed us to do here. Because we have probability theory, we have methods at our disposal to help describe when something is random. That also means that we have tools that help us detect when something **non-random** is happening. That is why probability theory is such a popular topic to study in data science! If we can detect a non-random pattern, then we may be able to re-use that pattern for a prediction!

Appendix

This piece was part of the second Summer of Math Exposition. I made this to challenge myself to build something I normally wouldn't make, and if you have the time, I highly recommend you give it a try yourself. I taught myself alpine.js while building this, and I very much feel like it's been a great learning experience.

Back to blog.