Data Generation in Python for Model Regression with sklearn

8 min readJan 31, 2023

Introduction:

This paper introduces the generation of sample data for model regression, which will then be used in two subsequent papers for Gaussian Process and Neural Network regressor training in scikit-learn (sklearn). The focus of this paper is to provide an introduction to Python scripts for data generation and visualization, rather than being exhaustive.

Methods:

The paper reviews three methods to generate random data for computer experiments: Latin Hypercube sampling, random sampling, and sphere random sampling. These data are divided into two parts: one for training and one for testing the regression model. This allows us to demonstrate the importance of the standard deviation for predicted data in Gaussian Process modeling.

Latin Hypercube Sampling:

The Latin Hypercube samples are generated using the SciPy library, which is more efficient than random sampling in mapping the parameter space. The number of parameters/variables is 3, and the number of samples can range from 100 to 1000. The response is a simplified version of the Matyas function with 3 variables.

Random Sampling:

Random samples are generated directly using only Python libraries, without the use of additional libraries.

Sphere Random Sampling:

Sphere random samples are generated using only Python libraries on a sphere of radius 1, and then scaled to the boundaries of the 3 variables.

Code Implementation: The libraries used in the implementation include Scipy’s “qmc” library, the mathematical library “math”, and the numerical library “numpy”. The number of parameters and samples are defined as integers with defined boundary values.

from scipy.stats import qmc 
import math 
import numpy
from random import random
npara = int(3)
nsamp = int(250)
l_bound = [-10,-10,-10]
u_bound = [10,10,10]
sampler = qmc.LatinHypercube(d=npara)
sample = sampler.random(n=nsamp)

The implementation of the Latin Hypercube sampling is straightforward using the defined parameters, “npara” and “nsamp”. The generated samples are then scaled based on the boundary values.

centerp = numpy.multiply(numpy.add(l_bound,u_bound),[0.5]*npara) 
radiusp = numpy.multiply(numpy.add(u_bound,numpy.multiply(l_bound,-1)),[0.5]*npara) 
print(centerp)
print(radiusp)

Two Latin Hypercube samples are generated, with one used for training data and the other used for testing data. The two Latin Hypercube samples are combined using the numpy “vstack” command.

sampmethod = 'Latin'
if 'Latin' == sampmethod: 
    lub = qmc.scale(sample,l_bound,u_bound)
    l_boundnew = [x-5 for x in l_bound]
    u_boundnew = [x+5 for x in u_bound] 
    lubb = qmc.scale(sample,l_boundnew,u_boundnew)
    lub = numpy.vstack((lub,lubb))

Random sampling data is generated in the same manner as the Latin Hypercube data, with two lists generated and then concatenated using list addition.

if 'randsamp' == sampmethod: 
    lub=[]
    lubb=[]
    l_boundnew = [x-15 for x in l_bound]
    u_boundnew = [x+15 for x in u_bound] 
    for i in range(nsamp):
        x=l_bound[0] + random()*(u_bound[0] - l_bound[0])
        y=l_bound[1] + random()*(u_bound[1] - l_bound[1])
        z=l_bound[2] + random()*(u_bound[2] - l_bound[2])
        lub.append([ x, y, z])
        xn=l_boundnew[0] + random()*(u_boundnew[0] - l_boundnew[0])
        yn=l_boundnew[1] + random()*(u_boundnew[1] - l_boundnew[1])
        zn=l_boundnew[2] + random()*(u_boundnew[2] - l_boundnew[2])
        lubb.append([ xn, yn, zn])
    lub = lub + lubb

Sphere random sampling data is generated by defining the center and radius of the sphere based on the boundary values, then using these values to generate samples on the sphere.

The generation of the sampling data inside a sphere is based on random samples between -1 and 1. In this paper we define the point not inside the entire sphere but inside a crown define by the radius 1 and 0.5. The training points will be inside this radius, and the testing points will be outside, with either a radius higher than 1 or a radius lower than 0.5. The testing data can also share some points with the training data. The command lines below show 2 ways to define the testing points:

• sc = [[2.75+(2*random()-1)*0.01 for i in range(npara)] for j in range(nsamp)]

in that case the testing points are scaled by sc, randomly pick up between 2.74 and 2.76, after being centered at zero, then are recentered to the original data after scaling. As the original data are in a crown defined by the radii 1 and 0.5, the new data are inside a crown of radii 2.75 and 1.375, and do not share are common point with the training samples. If the scaling factor would be 1.5, then the radii of the crown of the second set of data would be 1.5 and 0.75, and the two sets of data would share some common points.

• sc = [[random()*0.55 for i in range(npara)] for j in range(nsamp)]

in that case, the testing points are inside the training point, and the two sets of data will share no data points. If the scaling factor is above 0.5, they would share some data points.

if 'sphere' == sampmethod: 
    i=0 
    lub=[]
    lubb=[]
    while (i<nsamp):
        x,y,z = [2*random()-1 for i in range(npara)] 
        if (math.sqrt(x*x+y*y+z*z)<1 and math.sqrt(x*x+y*y+z*z)>0.5): 
            lub.append([ centerp[0]+x*radiusp[0],centerp[1]+y*radiusp[1], centerp[2]+z*radiusp[2]])
            i +=1 
    ##sc = [[2.75+(2*random()-1)*0.01 for i in range(npara)] for j in range(nsamp)]  
    sc = [[random()*0.35 for i in range(npara)] for j in range(nsamp)]  
    lubp = [[((x-centerp[0])),((y-centerp[1])),((z-centerp[2]))] for x,y,z in lub]
    lubp = numpy.multiply(lubp,sc)
    lubp = [[((x+ centerp[0])),((y+ centerp[1])),((z+ centerp[2]))] for x,y,z in lubp]
    lub = lub + lubp

Once the two set of data have been defined using Latin Hypercube, random or sphere random sampling, it can be needed to round the values as they come as floats.

First the number of samples is redefined as the two set of data have been merged together. Then the data list/array is flattened so we can define the resolution needed for the data. The resolution is given by the Ror function below. If reso is set to one, the data are converted to integers. If reso is set to 0.25 the data are rounded to the nearest quarter, and if reso is 0.5 the rounding is to the nearest half.

nsamp = int(len(lub)) 
print('new nsamp',nsamp)
ntot = npara*nsamp
lub1=numpy.reshape(lub,(1,ntot))
lub11 = lub1[0]
lub111 = list(lub11.flatten()) 
def Ror (value,reso): return reso*round(value/reso) 
lub2 = [Ror(x,0.25) for x in lub111]
lub22 = numpy.reshape(lub2,(nsamp,npara)) 
print(len(lub22))

Finally, we use the modified Matyas function to generate the response. The first term of the function is made anisotropic to the z variable, as it will help for the visualization and the modeling later. The data are reshaped to be consistent with the variable data shape.

def feval(co): 
    x,y,z = co
    return 0.26*math.sqrt(x*x+y*y+100*z*z) - 0.18*math.pow(abs(x*y*z),1/3) 
fr = [feval(l) for l in lub22]  
fr = numpy.reshape(fr,(nsamp,1))

The final step is to write the data in file so they can be used for plotting and later for regression modelling in sklearn. The 3 first lines of the files are comments which are not going to be used but are put there for illustration of Python commands to write and read files.

with open("doe.txt","w") as doe1:
    doe1.write("#n \t x \t y \t z \n")
    doe1.write("#n \t unit \t unit \t unit \n")
    doe1.write("#n \t Float \t Float \t Float \n")
    for item,it in zip(lub22,fr):
        doe1.write("%s \t %s \t %s \t %s \n" %(item[0],item[1],item[2],it[0]))

The next task is to get a python script to plot these data. As we have 3 variables and 1 response, we would need to plot in 4 dimensions. In the next paragraph, we will see how to use a trick to ply 4D data in a 3D graphs. This will also be useful to plot the results of the regression modelling.

First the libraries needed are listed below:

import matplotlib as mpl
from mpl_toolkits.mplot3d import Axes3D 
import numpy as np
import matplotlib.pyplot as plt 
import pandas as pd

Then, the file we created in the previous section is read, skipping the 3 first lines of comments, and the rest of the data are written in a new file.

f = open("doe.txt")
f1 = open("doe1.txt","w") 
lines=f.readlines()[3:]
for line in lines:
    f1.write("%s \n" %line) 
f.close()
f1.close()

This new file is read using panda library, and the data are split into the training data and test data defined in the previous section. This is purely for data plotting, so we can separate the 2 sets of data.

data=pd.read_csv('doe1.txt',sep='\t',header=None)
ldata=int(len(data)*0.5-1)
data1=data[0:ldata]
data2=data[ldata+1:(ldata+1)*2]

Once the data and data2 lists have been created from the file, we can define x,y,z and t as the 3 variables of the function and the value of the function. The x,y and z variables are used for the 3 axis of the 3D plot. For the fourth dimension of the plot assigned to the value t, we use the size of the scatter plot data. The size of the scatter plot can be defined as a float of an array with the size of the samples. This array can be scaled by a float or can be modified by functions like absolute value. The values of the size array must be positive or they will be return a zero. That is a limitation of the pseudo 4D visualization, as negative values cannot be displayed. Maybe it would be possible to split the data between positive and negative response values and use different colors for positive and negatives ones.

mpl.rcParams['legend.fontsize'] = 10 
fig = plt.figure()
ax = fig.gca(projection='3d') 
##ax.set_xlim(xmin=-10,xmax=10)
##ax.set_ylim(-10,10)
##ax.set_zlim(-10,10)

x,y,z,t = data1[0],data1[1],data1[2],data1[3]
ax.scatter(x,y,z,color='blue', s=1*t) 
x,y,z,t = data2[0],data2[1],data2[2],data2[3]
ax.scatter(x,y,z,color='red', s=1*t) 
ax.set_xlabel('x')
ax.set_ylabel('y')
ax.set_zlabel('z')
plt.savefig('3Ddata.png', dpi=500) 
plt.show()

Now we have the two Python scripts ready, we will check the outcomes.

First with Latin Hypercubes (LHC), we will define 2 LHC and display them using the same bounds as in the first section. As we can see below, the train set is in blue and the test set is in red. The values of the two sets are larger in the z direction as we modified the function to be anisotropic in the z direction.

The second set of samples will be the random samples with the same parameters as in the first section.

As the range of the test data is larger than for the LHC, we can clearly see the train data in blue in the centre of the graph.

The third set of samples will be the sphere random samples, with 1000 samples instead of 500 used previously, and we will check few cases.

The first case is with a scaling factor of 0.25, and we can see the test data inside the training data (radii 1 and 0.75).

The second case uses a translation of 2.75 for the train data above. That time the test data are outside the train data sphere, the huge value along z can be seen.

The last case is using a translation length of 1.15 which brings few test data slightly above the training sphere data, but most of the training and test data share the same variable space.

Now, we can generate random data, the nest step will be to use them in sklearn to train Gaussian Process and Neural Network model for regression. This will be shown in the following paper: Exploring the Capabilities of Gaussian Process Regression with sklearn: Visualizing Mean and Standard Deviation Predictions (link.medium.com/QqC5mk57ixb).

References:

· scipy.stats.qmc.LatinHypercube — SciPy v1.9.3 Manual

· Latin Hypercube Sampling (LHS) — EVOCD (msstate.edu)

· Test functions for optimization — HandWiki

· numpy.vstack — NumPy v1.23 Manual

· matplotlib.axes.Axes.scatter — Matplotlib 3.6.2 documentation

Data Generation in Python for Model Regression with sklearn

Written by Francis Benistant

No responses yet