Synthetic Data — YData Fabric

Caio Gasparine
7 min readNov 28, 2023

--

Practical exercise with Synthetic data generation

Photo by Stephen Phillips - Hostreviews.co.uk on Unsplash

Synthetic data in 6 easy steps!

Step 1 — Choose your dataset and upload it

Step 2 — Explore your data

Step 3 — Labs

Step 4 — Training your model

Step 5 — Generate your Synthetic Data

Step 6 — Compare the results

Step 1 — Choose your dataset and upload it

Go to the YData website and create a FREE trial account using the Fabric Community version.

After that, we can start the development of our project. Let’s have a dataset prepared to ingest in the YData Fabric Platform. Please, download the following dataset from Kaggle:

You should have a file called bankloan.CSV and we are going to use this file with YData. We are not exploring the data or performing any kind of EDA (Exploratory data analysis) right now, because we will be using this feature inside the YData Fabric platform.

In the YData platform select +Add Dataset.

Upload the dataset and choose a display name.

After that just hit Create Connector.

Note: For the purpose of this exercise we are using option 1. Upload file… in our case a .CSV (comma separated values) file, but you also have different data sources you can use to connect with.

Confirm your dataset information and select Create Dataset.

Wait a couple of seconds… the duration of this task depends on the size of the dataset you are ingesting, so expect more time for bigger datasets.

OK! Now your dataset is ready to be used!

Step 2 — Explore your data

After you have your dataset ready to use, simply select your dataset name (yes, click on the dataset name). It will open a couple of options, like Overview, Profiling, Metadata, and Connection Details.

Overview

Where you can find the big picture of your dataset and the platform will present you the big number related to your dataset.

Profiling

You can explore more your data, looking at fields, correlations, missing values, outliers, etc.

Metadata

In this option you can explore all the metadata, like column names, data types, formats, etc.

Metadata means “data, about data”. Metadata is defined as the data providing information about one or more aspects of the data; it is used to summarize basic information about data that can make tracking and working with specific data easier.

and Connection Details…

Step 3 — Labs

On the left menu select the Labs option and then select (+) Create Lab.

Select your IDE:

Select your bundle (in our case will be Community)

you can customize your specs (limited to Community). Hit Next.

Give a name to your Lab.

Confirm and select Create.

Just wait a couple of seconds and your Lab will be ready to be used.

Using the Labs you have a full solution (hardware and software) to run your code.

Step 4 — Training your model

On the left menu select the Synthetic Data option and then select (+) Create Synthetic Data.

Select your dataset, select the columns to generate and confirm metadata, and then choose Save

In this step, you can also choose an Anonymize method to be applied for some of the columns (Task 3).

Now your synthetic data is being generated. ;-)

If you just click on the name of the model you can see some results based on the new model deployed. Now you can generate a new dataset using the button Go to Generation >.

Step 5 — Generate your Synthetic Data

After you train your model you can choose the option Go to Generation> to create a new dataset based on the model deployed.

What you have to do is to define the number of rows (records) to be generated by the model, and then hit Generate.

Now, a new dataset will be generated based on the model trained using the original dataset.

Step 6 — Compare the results

Now you can hit Compare > to visualize a comparative view, presenting the original dataset vs. the new dataset created.

This is very helpful to validate the new data and make sure it is aligned with the original dataset and business assumptions.

Make sure you are checking the Alerts automatically generated by the platform to identify possible issues with the data generated.

Conclusion

Synthetic data emerges as a powerful and versatile solution in the realm of data science and artificial intelligence. As organizations grapple with the challenges of privacy, security, and the need for large and diverse datasets, synthetic data provides a promising alternative. Its ability to mimic the statistical properties of real-world data while preserving individual privacy makes it a valuable resource for training and testing machine learning models.

Moreover, the cost-effectiveness and scalability of synthetic data generation contribute to its growing appeal across various industries. By reducing the dependency on scarce and sensitive real-world data, organizations can accelerate innovation and model development. The flexibility to generate diverse datasets also addresses the issue of bias in machine learning, promoting fair and unbiased algorithmic outcomes.

However, challenges such as ensuring the quality and representativeness of synthetic data remain. Continuous advancements in synthetic data generation techniques, coupled with rigorous validation processes, are essential to overcome these hurdles.

In essence, synthetic data stands at the forefront of a data-driven future, offering a pragmatic solution to the complex interplay between data availability, privacy concerns, and the relentless demand for advanced machine learning models. As technology evolves, the integration of synthetic data into mainstream data science practices is poised to reshape how organizations approach data analytics and artificial intelligence, fostering a more ethical, efficient, and inclusive data landscape.

--

--

Caio Gasparine
Caio Gasparine

Written by Caio Gasparine

Project Manager | Data & AI | Professor

Responses (1)