Creating sample data for SQL databases

Hi friends!

Whatever the reason may be, there comes a time in most developers’ lives, that we need to show off something we’ve done with a database back-end. While throwing up a database is now much easier than it once was, there still remains the problem of inserting a bunch of fake data. You might be asking, “Why do this! There’s plenty of ways already to do this…” There are online services that will do this for you, but you need to upload your schema, they often aren’t even https sites, and just feel…insecure. There are also I’m sure scripts out there that do this, but the quick search I did turned up a bunch of gibberish data, and/or just ids and not something that looks real-ish. So here we are.

I was at PASS Summit a couple months ago with my teammate Doug and we were showing off some fun demos around our SQL Server beta for Google Cloud SQL and he ran a Python script which faked a bunch of employee-like data. His code lives here if you want to work with SQL Server.

I then got the bug to expand it to work with MySQL. The repo with the script is here. The script is There’s some other stuff in there for another blog post I’m writing about running an app in Kubernetes against a Cloud SQL server, but for this script, you just need, and the requirements.txt. It works with both Python 3.x and 2.x, but because of some string formatting changes to get it working in 2.x means changing some of the string formatting in the script. Search for f" and you’ll find the places you need to switch over to using the .format() method of string formatting.

This script needs a SQL instance to connect to, whether you stand up your own instance, or setup a Cloud SQL instance. If you wanted to setup a Cloud SQL instance, I’ve got a few blog posts on how to connect in different ways depending on what you need. The intro blog, with links to the various ways to set things up in the Cloud, is here.

The script, by default, creates 8 different office locations, with 8 employees per location, but these values can be overridden with flags. You can run python --help to see all the various options.

There are three required pieces of information to run: db username, db password, and db name. These can either be environment variables, or flags passed in on command line.

The employee table schema looks like:

The IP addresses and the social security numbers are all absolutely fake as generated by the Faker Python library. If there’s concern even just faking it, look at around line 190 in the script to find the INSERT command to remove the IP address and SS# parameters.

The locations table schema (which maps to the office_id value in the employee table) looks like:

The Faker library can generate basically whatever you need, so the tables don’t have to look like this, it’s just what I did.

Thing to note: The script by default connects to localhost. This is because, generally speaking, I’m either running a local-hosted demo instance, or I’m working against a Cloud instance. And if I’m running against a Cloud instance, I’m using the Cloud SQL Proxy which listens on localhost. Using the proxy is best practice if you’re using a Cloud instance.

The script works fine against remote instances too, you just have to either set the SQL_HOST environmental variable or pass in an address with the -h flag.

To setup the proxy, I’ve got a section in a previous post here that runs down using it with the public IP enabled instance. Once you do setup the proxy, then you don’t have to pass the SQL_HOST variable or pass in an IP (don’t forget to unset the SQL_HOST environment variable if you set it previously).

To run the script:

python will run against either a locally running MySQL server or a Cloud SQL instance with a locally running Cloud SQL Proxy. It will pull your DB credentials (user, password and database name) from the environment, or error out telling you that you need to set them.

python -u <db user> -p <password> -d <database name> will also run against a locally running MySQL server or a Cloud SQL instance with a local Cloud SQL Proxy, but now you’re passing in the DB credentials directly rather than using the environment.

python -h <public ip> will run against an instance running on the public ip address specified. It pulls DB creds from the local environment where you’re running the script.

All of the above will create a total of 64 users in your database. Because by default it creates 8 locations, with 8 employees per location (arbitrary hardcoded values in the script).

python -l 10 -e 100 would create 1000 employees.

That’s all there is to it! Happy fake data generating.

Run into any problems? Want more functionality out of the script? Please let me know! Respond in comments below, or reach out to me on Twitter. My DMs are open!

Husband, father, actor, sword fighter, musician, gamer, developer advocate at Google. Making things that talk to the Cloud. Pronouns: He/Him