Tutorials
must read
importing & cleaning data
+2

Generating Realistic Random Datasets with Trumania

Why do data scientists and data engineers work with synthetic data? How do they obtain it? Discover Trumania, a scenario-based random dataset generator library.

Editor's note: this post was written in collaboration with Milan van der Meer. Both authors of this post are on the Real Impact Analytics team, an innovative Belgian big data startup that captures the value in telecom data by "appifying big data".

This tutorial provides a small taste on why you might want to generate random datasets and what to expect from them. It will also walk you through some first examples on how to use Trumania, a data generation Python library.

For more information, you can visit Trumania's GitHub!

Why generate random datasets ?

Generating random dataset is relevant both for data engineers and data scientists.

As a data engineer, after you have written your new awesome data processing application, you think it is time to start testing end-to-end and you therefore need some input data.

As a data scientist, you can benefit from data generation since it allows you to experiment with various ways of exploring datasets, algorithms, data visualization techniques or to validate assumptions about the behaviour of some method against many different dataset of your choosing.

In both cases, a tempting option is just to use real data. One small problem though is that production data is typically hard to obtain, even partially, and it is not getting easier with new European laws about privacy and security.

Maybe you are one of the lucky ones, you got a nice complete production data set on your laptop. Even then, you would like more variability in the dataset to validate against various situations like different population sizes, different data distributions, various data quality issues like missing or invalid data...

For a more extensive read on why generating random datasets is useful, head towards 'Why synthetic data is about to become a major competitive advantage'.

Schema-Based Random Data Generation: We Need Good Relationships!

This section tries to illustrate schema-based random data generation and show its shortcomings.

Many tools already exist to generate random datasets. A common approach among those tools is schema-based generation which allows you to define a blueprint and use it to generate some entities. Khermes and LogSynth are two examples of such tools.

An example of schema-based config would maybe include this person-schema:

{
  {
    "field": "Name",
    "class": NameGenerator
  },
  {
    "field": "Age",
    "class": RandomInt(18, 65)
  },
  {
    "field": "Gender",
    "class": RandomPicker(["Male", "Female", "Other"])
  } 
}

This schema defines the generation of some data about a person. It contains a name, age, and gender. It is simple, it is quick, and has some limitations. One main limitation is faking relationships between different entities or based on time.

For example, how to make some names more likely based on gender? Or how to create a log of user actions in which actions are more likely to happen during weekends?

The schema-based approach does not make it easy to expose your new application to real-world specificities or difficulties such as:

  • Imbalanced datasets (difficulties with group-by kind of operations);
  • Clustered patterns in entities network. That is, discover social groups through user interactions (for example, phone calls);
  • Various time activity profiles. Actions are more likely to happen at some time of the day or of the week;
  • Causal relationship between actions. That is, many purchase actions at a shop that ultimately triggers a "out-of-stock" event.

Trumania is based on scenario in order to address these shortcomings and generate more realistic datasets. As the scenario unfolds, various populations interact with each other, update their properties and emit logs.

A example of scenario would be people calling their friends. When a person calls they need to pay in minutes, when the minutes are exhausted they have to go to a shop... The resulting dataset could be the log of user calls, the log of shopping actions and the stock level at each hour. Actions in a scenario can either be random or deterministic.

In Trumania, the generated datasets are typically time-series because they result from the execution of a scenario which unfolds over time.

In the example below, you'll see how you can elaborate a basic scenario in which various people send messages to each other. The resulting dataset will be the time series of messages exchanged between people.

The sequence of steps we'll go trough is as follows:

  • creating a Trumania circus, which is the world inside which the scenario will execute
  • adding a "hello world" story that will produce basic hard-coded logs
  • adding a relationship to assign some favourite quotes to each person, that will be used to create to content of the messages.
  • parameterize the time parameters of the story to obtain a more realistic distribution of messages troughout the day
  • add another relationship to define a social network, s.t. the social graph that emerges from the messages log be more realistic

Creating a Trumania Circus

Purpose

The first step consists in creating a circus, which is the world in which all the elements of the scenario will exist. We are also going to create a basic population of persons. Since many aspects of Trumania are random, this step will also introduce the concept of generators, which are used to control many random behaviors.

We're not going to add any scenario at this point, so everything will be static.

Let's get started!

How-to

A Trumania circus is simply created as follows:

from trumania.core import circus

example_circus = circus.Circus(name="example1", 
                               master_seed=12345,
                               start=pd.Timestamp("1 Jan 2017 00:00"),
                               step_duration=pd.Timedelta("1h"))

In Trumania, all time-related elements are controlled by a central clock. The most important part in the code snippet above is step_duration=pd.Timedelta("1h"), which defines that the clock will be incremented by steps of 1 hour.

Next, you add the person population to the circus. A population is essentially a set of agents having an id and some attributes.

It's possible to specify the values of each attribute manually, but in most cases, you want to generate them randomly, so let's first define a few generators for that:

from trumania.core.random_generators import SequencialGenerator, FakerGenerator, NumpyRandomGenerator

id_gen = SequencialGenerator(prefix="PERSON_")
age_gen = NumpyRandomGenerator(method="normal", loc=3, scale=5,
                               seed=next(example_circus.seeder))
name_gen = FakerGenerator(method="name", seed=next(example_circus.seeder))

A Trumania generator is responsible for providing data when its method generate() is called. In the example above, id_gen will generate strings like PERSON_0001, PERSON_0002, ... age_gen will repeatedly sample data from a normal distribution and name_gen will provide random people's names.

Note that any statistical distribution from numpy as well as any Faker provider is available in Trumania, and Trumania can easily be extended with new ones.

With this in place, you can add a population of 1000 persons with sequential IDs and two basic random attributes:

person = example_circus.create_population(name="person", size=1000, ids_gen=id_gen)
person.create_attribute("NAME", init_gen=name_gen)
person.create_attribute("AGE", init_gen=age_gen)

Result

You can already have a look at the generated attributes of all members of the person population by calling person.to_dataframe().

+-------------------+-----------------+---------+
|                   | NAME            |     age |
|-------------------+-----------------+---------|
| PERSON_0000000000 | Amy Berger      | 28.588  |
| PERSON_0000000001 | Michael Curry   | 28.7499 |
| PERSON_0000000002 | Robert Ramirez  | 35.9242 |
| PERSON_0000000003 | Derek Gonzalez  | 34.7091 |
| PERSON_0000000004 | Gregory Fischer | 25.6009 |
| PERSON_0000000005 | Erica Walker    | 33.9758 |
| PERSON_0000000006 | Bradley Collins | 24.4428 |
| PERSON_0000000007 | James Rodriguez | 34.7835 |
| PERSON_0000000008 | Brandy Padilla  | 34.6296 |
| PERSON_0000000009 | Mark Taylor     | 38.2286 |
+-------------------+-----------------+---------+

Tip: if you want to see the entire script for this basic user population, check out this snippet on Github!

Hello World Statements: Creating Stories

Purpose

Let's make this a bit more interesting by adding a story. A Trumania story is essentially a piece of scenario that encapsulates some dynamic aspect. A story is executed every time the simulated clock goes forward by one step.

Recall that you set step_duration=pd.Timedelta("1h") in the circus above, so this implies that every logical hour, each story will get a chance to be executed.

You're going to create a simple hello-world story in which all members of the person population above will regularly emit the message "hello world".

How-to

The first main aspect to specify when creating a story is the population that initiates the story. In this case, this is going to be the person. The second aspect is defining the timer, which defines when each member of the population triggers that action. For now, you are just hard-coding them to 1, which implies all members will execute the story at every clock step (you'll build more elaborate timers in a later step):

hello_world = example_circus.create_story(
    name="hello_world",
    initiating_population=example_circus.populations["person"],
    member_id_field="PERSON_ID",
    timer_gen=ConstantDependentGenerator(value=1)
)

So far, the story is empty so it doesn't do anything: you need to add some operations to it. Operations can be random or deterministic, they can read and update any population's attribute or other piece of state in the circus, and even have side effects.

For now, let's add two simple deterministic operations and a logger to actually get some logs:

hello_world.set_operations(
    example_circus.clock.ops.timestamp(named_as="TIME"),
    ConstantGenerator(value="hello world").ops.generate(named_as="MESSAGE"),
    FieldLogger(log_id="hello")
)

The example_circus.clock.ops.timestamp is generating a random timestamp inside the current time interval of 1 hour and ConstantGenerator is just producing constant hard-coded values.

Result

Let's run the circus for 48h of simulated time:

example_circus.run(
    duration=pd.Timedelta("48h"),
    log_output_folder="example_scenario",
    delete_existing_logs=True
)

This should produce the basic following dataset, in which each of the 1000 persons said "hello world" 48 times over 2 days.

The first field of the resulting dataset, PERSON_ID, corresponds to the member_id_field of the story. The other two fields, TIME and MESSAGE, correspond to the two main operations that you put in your story.

+-------+-------------------+---------------------+-------------+
|       | PERSON_ID         | TIME                | MESSAGE     |
|-------+-------------------+---------------------+-------------|
|  0    | PERSON_0000000000 | 2017-01-01 01:14:12 | hello world |
|  1    | PERSON_0000000001 | 2017-01-01 01:23:51 | hello world |
|  2    | PERSON_0000000002 | 2017-01-01 01:37:11 | hello world |
|  3    | PERSON_0000000003 | 2017-01-01 01:33:12 | hello world |
|  4    | PERSON_0000000004 | 2017-01-01 01:08:02 | hello world |
|  5    | PERSON_0000000005 | 2017-01-01 01:13:36 | hello world |
|  6    | PERSON_0000000006 | 2017-01-01 01:35:08 | hello world |
|  7    | PERSON_0000000007 | 2017-01-01 01:24:30 | hello world |
|  8    | PERSON_0000000008 | 2017-01-01 01:01:51 | hello world |
|  9    | PERSON_0000000009 | 2017-01-01 01:41:41 | hello world |
| ...                                                           |
| 23990 | PERSON_0000000990 | 2017-01-02 23:33:28 | hello world |
| 23991 | PERSON_0000000991 | 2017-01-02 23:15:26 | hello world |
| 23992 | PERSON_0000000992 | 2017-01-02 23:45:56 | hello world |
| 23993 | PERSON_0000000993 | 2017-01-02 23:25:16 | hello world |
| 23994 | PERSON_0000000994 | 2017-01-02 23:33:55 | hello world |
| 23995 | PERSON_0000000995 | 2017-01-02 23:41:45 | hello world |
| 23996 | PERSON_0000000996 | 2017-01-02 23:39:23 | hello world |
| 23997 | PERSON_0000000997 | 2017-01-02 23:45:53 | hello world |
| 23998 | PERSON_0000000998 | 2017-01-02 23:29:25 | hello world |
| 23999 | PERSON_0000000999 | 2017-01-02 23:12:24 | hello world |
+-------+-------------------+---------------------+-------------+

Tip: check out the entire snippet on Github

Someone to say Hello World to

Purpose

Let's improve this a bit by adding another random person in the conversation. For now, you're going to say that any person can speak to any other person with equal probability, which is not very realistic (you'll add a social network in a later step). Also, you're going to enrich the dataset by adding people's name based on their ID.

How-to

Let's say the other person is going to be stored in a new OTHER_PERSON field. In this simplistic case, you just want to put in that field the result of a random unweighted selection from the person population. You can simply use the select_one random operation of that population:

example_circus.set_operations(
    example_circus.clock.ops.timestamp(named_as="TIME"),
    ConstantGenerator(value="hello world").ops.generate(named_as="MESSAGE"),

    example_circus.populations["person"].ops.select_one(named_as="OTHER_PERSON"),

    example_circus.populations["person"]
        .ops.lookup(id_field="PERSON_ID", select={"NAME": "EMITTER_NAME"}),

    example_circus.populations["person"]
        .ops.lookup(id_field="OTHER_PERSON", select={"NAME": "RECEIVER_NAME"}),

    FieldLogger(log_id="hello")
)

Result

When you run the circus again, you end up with a dataset with the new OTHER_PERSON field:

+-------+-------------------+---------------------+-------------+-------------------+---------------------+--------------------+
|       | PERSON_ID         | TIME                | MESSAGE     | OTHER_PERSON      | EMITTER_NAME        | RECEIVER_NAME      |
|-------+-------------------+---------------------+-------------+-------------------+---------------------+--------------------|
|  0    | PERSON_0000000000 | 2017-01-01 01:14:12 | hello world | PERSON_0000000852 | Ann Cruz            | Sophia Black       |
|  1    | PERSON_0000000001 | 2017-01-01 01:23:51 | hello world | PERSON_0000000429 | Kimberly Sanchez    | Jeffrey Ryan       |
|  2    | PERSON_0000000002 | 2017-01-01 01:37:11 | hello world | PERSON_0000000925 | Bethany Smith       | Regina Brown       |
|  3    | PERSON_0000000003 | 2017-01-01 01:33:12 | hello world | PERSON_0000000347 | Frank Middleton     | Jacob Ross         |
|  4    | PERSON_0000000004 | 2017-01-01 01:08:02 | hello world | PERSON_0000000211 | Cheryl Decker       | Joshua Miller      |
|  5    | PERSON_0000000005 | 2017-01-01 01:13:36 | hello world | PERSON_0000000779 | Thomas Rodriguez    | Nicole Tanner      |
|  6    | PERSON_0000000006 | 2017-01-01 01:35:08 | hello world | PERSON_0000000331 | James Peters        | Melissa Rogers     |
|  7    | PERSON_0000000007 | 2017-01-01 01:24:30 | hello world | PERSON_0000000234 | Allison Hansen      | Taylor Smith       |
|  8    | PERSON_0000000008 | 2017-01-01 01:01:51 | hello world | PERSON_0000000678 | Candice Sellers     | James Smith        |
|  9    | PERSON_0000000009 | 2017-01-01 01:41:41 | hello world | PERSON_0000000108 | Maria White         | James Nguyen       |
| ...                                                                                                                          | 
| 23990 | PERSON_0000000990 | 2017-01-02 23:33:28 | hello world | PERSON_0000000689 | Natasha Brown       | Bailey Ramirez DDS |
| 23991 | PERSON_0000000991 | 2017-01-02 23:15:26 | hello world | PERSON_0000000250 | Shelly Ponce        | Jordan Johnson     |
| 23992 | PERSON_0000000992 | 2017-01-02 23:45:56 | hello world | PERSON_0000000413 | Steven Mendez       | Crystal Duffy      |
| 23993 | PERSON_0000000993 | 2017-01-02 23:25:16 | hello world | PERSON_0000000670 | Morgan Rice         | Jonathan Obrien    |
| 23994 | PERSON_0000000994 | 2017-01-02 23:33:55 | hello world | PERSON_0000000411 | Crystal Vincent     | George Mathis      |
| 23995 | PERSON_0000000995 | 2017-01-02 23:41:45 | hello world | PERSON_0000000563 | Dawn Kim            | Jennifer Martinez  |
| 23996 | PERSON_0000000996 | 2017-01-02 23:39:23 | hello world | PERSON_0000000668 | Jimmy Franco        | Dr. Mark Ruiz MD   |
| 23997 | PERSON_0000000997 | 2017-01-02 23:45:53 | hello world | PERSON_0000000268 | Christian Christian | Victoria Donovan   |
| 23998 | PERSON_0000000998 | 2017-01-02 23:29:25 | hello world | PERSON_0000000829 | David Hernandez     | Margaret Anderson  |
| 23999 | PERSON_0000000999 | 2017-01-02 23:12:24 | hello world | PERSON_0000000944 | Tamara Ramirez      | Andrew Curtis      |
+-------+-------------------+---------------------+-------------+-------------------+---------------------+--------------------+

Tip: you can see the entire code here

You always say that

Purpose

Let's improve further by generating a random sentence in the MESSAGE field instead of a boring hello world hard-coded string. This will allow you to illustrate a first basic usage of Trumania's Relationship.

You are going to associate each person with 4 quotes. Anytime a person emits a message, the content will be picked among the 4 quotes of that person. Moreover, you are going to associate each quote with a different weight that defines how likely is that quote: a quote with high weight will be emitted more often by its owner than a quote with low weight.

Now how do you get started?

How-to

The first step is to create a quote generator, which is similar to the other Faker generators you've seen above:

quote_generator = FakerGenerator(method="sentence", nb_words=6, variable_nb_words=True,
                                 seed=next(example_circus.seeder))

You could use that generator as-is in the story as before, but let's make things a bit more interesting by constraining each person to always use one of their 4 favorite quotes. You do this by creating a Relationship between the members of the person population and some sentence values.

First, you create an empty relationship:

quotes_rel = example_circus.populations["person"].create_relationship("quotes")

Then, you populate this relationship with 4 quotes for each person. In the code below person.ids is a pandas Series of size 1000 with all the ids of the person population, and quote_generator.generate(size=person.size) provides another pandas Series also of size 1000, with random quotes.

This means each pass of the for-loop below populates 1000 relations in the relationships: adding one quote to each of the 1000 users. The first 1000 quotes will be associated with weight=1, the next one will be made more frequent by associating them with weight=2, ...

for w in range(4):
    quotes_rel.add_relations(
        from_ids=person.ids,
        to_ids=quote_generator.generate(size=person.size),
        weights=w
    )

You can now replace the ConstantGenerator in your story with a select_one operation on that relationship. This reads as follows: for every value currently in the field PERSON_ID, look up the relationships in quotes that start from that ID and select randomly one related value. Since no weights are overridden in the operation below, the default relationship weights defined above in the add_relations are used:

example_circus.set_operations(
    example_circus.clock.ops.timestamp(named_as="TIME"),

    example_circus.populations["person"].get_relationship("quotes")
        .ops.select_one(from_field="PERSON_ID", named_as="MESSAGE"),

    example_circus.populations["person"].ops.select_one(named_as="OTHER_PERSON"),

    example_circus.populations["person"]
        .ops.lookup(id_field="PERSON_ID", select={"NAME": "EMITTER_NAME"}),

    example_circus.populations["person"]
        .ops.lookup(id_field="OTHER_PERSON", select={"NAME": "RECEIVER_NAME"}),

    FieldLogger(log_id="hello")
)

Result

When you run the circus, the MESSAGE field should now contain the random quote:

+-------+-------------------+---------------------+-----------------------------------------------------+-------------------+---------------------+---------------------+
|       | PERSON_ID         | TIME                | MESSAGE                                             | OTHER_PERSON      | EMITTER_NAME        | RECEIVER_NAME       |
|-------+-------------------+---------------------+-----------------------------------------------------+-------------------+---------------------+---------------------|
|  0    | PERSON_0000000000 | 2017-01-01 01:14:12 | Become period risk wait now toward less.            | PERSON_0000000158 | Ann Cruz            | Victoria Washington |
|  1    | PERSON_0000000001 | 2017-01-01 01:23:51 | Month push down.                                    | PERSON_0000000366 | Kimberly Sanchez    | Steven Williams     |
|  2    | PERSON_0000000002 | 2017-01-01 01:37:11 | Blue general carry else deep problem area.          | PERSON_0000000613 | Bethany Smith       | Frances Davis       |
|  3    | PERSON_0000000003 | 2017-01-01 01:33:12 | Phone sister pretty suddenly allow conference.      | PERSON_0000000086 | Frank Middleton     | Ashley Fernandez    |
|  4    | PERSON_0000000004 | 2017-01-01 01:08:02 | Each discussion several send wide process.          | PERSON_0000000972 | Cheryl Decker       | Latoya Flynn        |
|  5    | PERSON_0000000005 | 2017-01-01 01:13:36 | Guess can issue writer.                             | PERSON_0000000030 | Thomas Rodriguez    | Lindsay Bailey      |
|  6    | PERSON_0000000006 | 2017-01-01 01:35:08 | Boy step claim camera common but our.               | PERSON_0000000548 | James Peters        | Thomas Ward         |

| ...                                                                                                                                                                   | 23993 | PERSON_0000000993 | 2017-01-02 23:25:16 | May course and kill from news.                      | PERSON_0000000202 | Morgan Rice         | Patricia Williams   |
| 23994 | PERSON_0000000994 | 2017-01-02 23:33:55 | Water door live hospital together safe.             | PERSON_0000000268 | Crystal Vincent     | Victoria Donovan    |
| 23995 | PERSON_0000000995 | 2017-01-02 23:41:45 | Stage control management read although thousand.    | PERSON_0000000466 | Dawn Kim            | Ashley Baxter       |
| 23996 | PERSON_0000000996 | 2017-01-02 23:39:23 | Their that region majority break article.           | PERSON_0000000620 | Jimmy Franco        | Alex Dominguez      |
| 23997 | PERSON_0000000997 | 2017-01-02 23:45:53 | Lead simple audience response eye.                  | PERSON_0000000319 | Christian Christian | David Smith         |
| 23998 | PERSON_0000000998 | 2017-01-02 23:29:25 | Vote man son yeah child.                            | PERSON_0000000487 | David Hernandez     | Joshua Park         |
| 23999 | PERSON_0000000999 | 2017-01-02 23:12:24 | Answer region wind condition someone bed cover.     | PERSON_0000000237 | Tamara Ramirez      | Beverly Rodriguez   |
+-------+-------------------+---------------------+-----------------------------------------------------+-------------------+---------------------+---------------------+

The point of the relationship used for the quotes is that now the fields of the generated dataset are related: some sentences are more frequently used by some users.

Tip: view the entire code snippet on Github

It ain't what you do, it's the time that you do it

Purpose

So far, every member of the person population is producing exactly one message every hour of the day. Let's make that a bit more realistic.

Trumania uses a combination of 2 concepts to control the time aspect of the execution of a story: activity levels and timer profiles. Together, they determine the probability of triggering the story given a specific user and time of the day.

How-to

You'll update the story and parameterize those two aspects.

First, let's create the timer profile of the story, which controls the moments when more or less actions are triggered. This profile is the same for all members of the population that trigger the story.

Trumania comes built-in with some interesting default ones, like DefaultDailyTimerGenerator, corresponds to the daily phone usage we have measured on an actual production dataset:

DefaultDailyTimerGenerator

Let's instantiate one in your scenario:

from trumania.components.time_patterns.profilers import DefaultDailyTimerGenerator

story_timer_gen = DefaultDailyTimerGenerator(
    clock=example_circus.clock, 
    seed=next(example_circus.seeder))

Second, there is the population member's activity levels for that story, which define how active is each user within that story. You have the freedom to set a different level for each member, such that some member triggers that action more or less often than others.

You could specify each level manually if you felt like it, though here you're just going use a random generator that will assign an activity level to each population member.

First, we define three activity levels of 3, 10 and 20 triggers per day on average:

low_activity = story_timer_gen.activity(n=3, per=pd.Timedelta("1 day"))
med_activity = story_timer_gen.activity(n=10, per=pd.Timedelta("1 day"))
high_activity = story_timer_gen.activity(n=20, per=pd.Timedelta("1 day"))

Then, you can create the actual activity generator, which is going to be yet another Numpy generator, this time based on the choice method. The one below will assign a low activity to 20% of the population, a medium activity to 70% and a high activity to 10%:

activity_gen = NumpyRandomGenerator(
    method="choice", 
    a=[low_activity, med_activity, high_activity],
    p=[.2, .7, .1],
    seed=next(example_circus.seeder))

And you can now use this timer profile and activity generator as part of your story:

hello_world = example_circus.create_story(
    name="hello_world",
    initiating_population=example_circus.populations["person"],
    member_id_field="PERSON_ID",

    timer_gen=story_timer_gen,
    activity_gen=activity_gen
)

Result

The resulting dataset has the same schema as before:

+-------+-------------------+---------------------+---------------------------------------------------------+-------------------+---------------------+----------------------+
|       | PERSON_ID         | TIME                | MESSAGE                                                 | OTHER_PERSON      | EMITTER_NAME        | RECEIVER_NAME        |
|-------+-------------------+---------------------+---------------------------------------------------------+-------------------+---------------------+----------------------|
|  0    | PERSON_0000000016 | 2017-01-01 00:14:12 | Still try sex sure.                                     | PERSON_0000000739 | Johnny Moore        | Adam Barrett         |
|  1    | PERSON_0000000063 | 2017-01-01 00:23:51 | Resource magazine wide.                                 | PERSON_0000000511 | Cody Pham           | Tina Simmons         |
|  2    | PERSON_0000000064 | 2017-01-01 00:37:11 | Heat sure simple letter better forget.                  | PERSON_0000000044 | Katherine Fleming   | Angela Gray          |
|  3    | PERSON_0000000076 | 2017-01-01 00:33:12 | Star our conference always place ball.                  | PERSON_0000000959 | Benjamin Reese II   | Barbara Alexander    |
|  4    | PERSON_0000000088 | 2017-01-01 00:08:02 | Hot event five left become time commercial.             | PERSON_0000000000 | Zachary Cook        | Ann Cruz             |
| ...                                                                                                                                                                        |
| 28556 | PERSON_0000000962 | 2017-01-04 00:07:45 | Then include rock rule scientist condition.             | PERSON_0000000448 | Johnny Washington   | Melissa Adams        |
| 28557 | PERSON_0000000980 | 2017-01-04 00:04:44 | Collection full argue interview property pattern never. | PERSON_0000000731 | James Ramirez       | David Morrow         |
| 28558 | PERSON_0000000983 | 2017-01-04 00:23:21 | High day prepare see.                                   | PERSON_0000000509 | Evan Horne          | Robert Sims          |
| 28559 | PERSON_0000000988 | 2017-01-04 00:43:56 | Likely south school result case federal seat.           | PERSON_0000000819 | Benjamin Campbell   | Michelle Jackson DDS |
+-------+-------------------+---------------------+---------------------------------------------------------+-------------------+---------------------+----------------------+

Tip: check out the snippet on Github

Result Analytics

The time patterns that you included in the configuration of your scenario should now be visible in the inner structure of the resulting dataset. You can demonstrate that by doing a basic data analysis on the resulting resulting_dataset:

  • the number of message at each hour of the day should follow the specified distribution (note how similar it is to the curve of DefaultDailyTimerGenerator above):
time_profile = (
    resulting_dataset[["MESSAGE", "TIME"]]
    .groupby(by=example_5_df.TIME.dt.hour)["MESSAGE"]
    .count()
)
time_profile.plot()

Trumania

  • The histogram of number of messages per user should respect as well your configuration:
    • you configured low_activity to 3 messages on average per day, med_activity to 10 and high_activity to 20 and ran the simulation over 5 days. This should yield 3 groups of person with roughly 15, 50 and 100 total messages, which corresponds to the 3 centers of the histogram below
    • you configured the probabilities of these activity levels to .2, .7 and .1, and this again roughly corresponds to the respective area of each "bump" of the histogram:
usage_per_user = resulting_dataset[["MESSAGE", "PERSON_ID"]].groupby("PERSON_ID")["MESSAGE"].count()
usage_per_user.plot(kind="hist")

Histogram of message frequency per user

The Social Network

Purpose

You can improve the scenario even further by connecting the people such that they send messages to their friends during the execution of your story, instead of to a random person. This will introduce a lot of structure to the resulting dataset since now the social graph should be discoverable from the messages log.

How-to

You can model the social network with a Relationship again, which is the same concept as you used above to model quotes. You may recall that the quotes relationship had a weight. That weight defined which quote was more or less likely, given a person. You can do the same thing here and use weights to define which friends are more or less likely to be called than others.

Note that here you are defining a relationship from person to person, although in general Trumania enables you to define relationships between any pair of populations.

There are many ways to generate the edges of a social graph. One basic and classic one is simply to rely on the Erdos-Renyi algorithm. This is particularly easy in Trumania since it is already built in the framework. You just need to inject it into your scenario to create a relationship.

Here, you're going to define a social graph between the members of the person relationship in which each person has 20 friends. Note that a weight will automatically be added to that relationship with a Pareto distribution.

from trumania.components.social_networks.erdos_renyi import WithErdosRenyi

# self is the circus here, see example code on github for details
self.add_er_social_network_relationship(
    self.populations["person"],
    relationship_name="friends",
    average_degree=20)

With this relationship in place, you can revisit our story once again. This time, the OTHER_PERSON will no longer be picked up at random among the population, but it will be selected among the friends of each person member, with a probability proportional to the weight defined in the social network:

hello_world.set_operations(
    self.clock.ops.timestamp(named_as="TIME"),

    self.populations["person"].get_relationship("quotes")
        .ops.select_one(from_field="PERSON_ID",named_as="MESSAGE"),

    self.populations["person"]
        .get_relationship("friends")
        .ops.select_one(from_field="PERSON_ID", named_as="OTHER_PERSON"),

    self.populations["person"]
        .ops.lookup(id_field="PERSON_ID", select={"NAME": "EMITTER_NAME"}),

    self.populations["person"]
        .ops.lookup(id_field="OTHER_PERSON", select={"NAME": "RECEIVER_NAME"}),

    operations.FieldLogger(log_id="hello_6")
)

Result

Once again, the resulting dataset has the same data schema as before, although this time it has even more structure than before thanks to your basic social network:

+-------+-------------------+---------------------+----------------------------------------------------+-------------------+-------------------+------------------+
|       | PERSON_ID         | TIME                | MESSAGE                                            | OTHER_PERSON      | EMITTER_NAME      | RECEIVER_NAME    |
|-------+-------------------+---------------------+----------------------------------------------------+-------------------+-------------------+------------------|
|  0    | PERSON_0000000004 | 2017-01-01 00:14:12 | Clearly see sure.                                  | PERSON_0000000321 | Cheryl Decker     | Ian White        |
|  1    | PERSON_0000000012 | 2017-01-01 00:23:51 | Offer or interview clear structure watch capital.  | PERSON_0000000697 | Kelly Green       | Ashley Turner    |
|  2    | PERSON_0000000018 | 2017-01-01 00:37:11 | Recently race draw thousand around ahead.          | PERSON_0000000989 | Laura Stephenson  | Hannah James     |
|  3    | PERSON_0000000040 | 2017-01-01 00:33:12 | To protect man image power beyond.                 | PERSON_0000000418 | Jeffrey Miller    | Scott Collins    |
...                                                                               |
| 28966 | PERSON_0000000985 | 2017-01-04 00:09:23 | However agreement fear door land hotel.            | PERSON_0000000211 | Laura Mason       | Joshua Miller    |
| 28967 | PERSON_0000000995 | 2017-01-04 00:48:08 | Top heat window quite forward friend somebody.     | PERSON_0000000171 | Dawn Kim          | Sandra Phillips  |
| 28968 | PERSON_0000000999 | 2017-01-04 00:58:46 | Answer region wind condition someone bed cover.    | PERSON_0000000252 | Tamara Ramirez    | Jordan Collins   |
+-------+-------------------+---------------------+----------------------------------------------------+-------------------+-------------------+------------------+

Tip: you can take a look at the snippet on Github

Result Analytics

A very crude way of looking into that social network is simply to count the number of unique friends called by each person. If you run the simulation for long enough, this should provide you with a lower bound of the actual number of people in each person's social network (since you use weights with a Pareto distribution, many people have friends which are really unlikely of being called, so you need the simulation to run very long to really observer every possible called person).

You execute our scenario for 15 simulated days and obtained this, which is centered very close to and a bit below of the average degree of 20 you set in your social graph:

social_group_size = (
    resulting_dataset[["PERSON_ID", "OTHER_PERSON"]]
        .groupby("PERSON_ID")
        .agg(lambda friends_ids: len(friends_ids.unique()))
    )
social_group_size.plot(kind="hist")

Histogram of observed social group size

There is more !

Several features of Trumania have not been described in this post:

  • having several populations in a scenario
  • having several stories in a scenario
  • adding probabilistic dependencies between stories (e.g. a person calling a friend after having been called)
  • state update (e.g. decreasing balance accounts based on phone call duration)
  • re-usable pieces of scenario, like geographies, more timer profiles, ...
  • persistence of pieces of scenario

Conclusion

We introduced Trumania as a scenario-based data generator library in python. The generated datasets can be used for a wide range of applications such as testing, learning, and benchmarking.

We explained that in order to properly test an application or algorithm, we need datasets that respect some expected statistical properties. We illustrated that Trumania is capable of doing that in an example where we generated a basic "message log" dataset which respects a distribution of number of message per user, a distribution of time of day as well as a distribution of calls among a social graph.

We hope we also demonstrated that the flexibility of Trumania component allow the creation of a wide range of scenarios.

A Trumania scenario can include relationships between entities and can be configured with theoretical random distributions, empirical distributions observed in a real dataset or even real production data (e.g. a dataset of location ids or so). We demonstrated that by re-using a statistical distribution of phone usage timestamp observed in a real dataset for configuring the time random timer of our story.

With this in mind, go crazy! Go to Trumania's Github and write you own scenario ;)

Otherwise, feel free to reach out to either Svend Vanderveken or co-author Milan van der Meer:

  • Svend Vanderveken is a freelance software engineer with an expertise in data-processing solutions, mostly in Kafka, Scala, SQL, python... Follow him via his blog svend.kelesia.com and twitter @sv3ndk.
  • Milan van der Meer currently works as a software engineer for Real Impact Analytics. You can always connect with him on LinkedIn
Want to leave a comment?