Read Numpy files in Jupyter Lab

Asked

Viewed 179 times

0

Good afternoon, I am currently developing an Artificial Intelligence project. In the current phase of my project, I have my neural networks implemented and I am in the phase of training the neural network. Initially, I started by training the network on my computer, but in the meantime I got access to a server that allows me to train the network using Jupyter Lab (which allows me to speed up the training process). The problem is that when reading some Numpy files, I get encoding errors, such as this:

InvalidArgumentError:  UnicodeEncodeError: 'ascii' codec can't encode character '\xe7' in position 64: ordinal not in range(128)
Traceback (most recent call last):

  File "/opt/conda/envs/csw-aii/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 236, in __call__
    ret = func(*args)

  File "/opt/conda/envs/csw-aii/lib/python3.6/site-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 789, in generator_py_func
    values = next(generator_state.get_iterator(iterator_id))

  File "/opt/conda/envs/csw-aii/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/data_adapter.py", line 975, in generator_fn
    yield x[i]

  File "/home/jfm-castilho/Chargrid/dataset_generator.py", line 26, in __getitem__
    batch_x.append(np.load(self.representation_path + file + ".npy", allow_pickle=True,encoding = 'latin1'))

  File "/opt/conda/envs/csw-aii/lib/python3.6/site-packages/numpy/lib/npyio.py", line 428, in load
    fid = open(os_fspath(file), "rb")

UnicodeEncodeError: 'ascii' codec can't encode character '\xe7' in position 64: ordinal not in range(128)


     [[{{node PyFunc}}]]
     [[IteratorGetNext]] [Op:__inference_distributed_function_13003]

Function call stack:
distributed_function

On my computer, there is no problem reading Numpy files, only when I try to read the files through Jupyter Lab. How can I fix this error. The line on which the error appears is the first line of the code snippet above.

Some considerations:

  • The version of Numpy is equal in both the computer and Jupyter Lab: 1.18.1

  • The files read by the computer and Jupyterlab are the same (I uploaded the files to the server where Jupyterlab is located and the Relative Path where the files are located is the same on the computer as in Jupyterlab.)

  • I have tested several approaches to solve the problem, such as:
    • np.load(self.representation_path + file + ".npy", allow_pickle=True,encoding = 'bytes')
    • np.load(self.representation_path + file + ".npy", allow_pickle=True,encoding = 'ascii')
    • np.load(self.representation_path + file + ".npy", allow_pickle=True,encoding = 'utf-8')
    • np.load(self.representation_path + file + ".npy", allow_pickle=True)
    • with open(self.representation_path + file + ".npy", 'rb') as file: arr = pickle.load(file)

In none of these attempts the result was different, originating in all cases a Unicodeencorror.

I don’t know if it helps the analysis, the line where I store the Numpy Array in a Numpy File is as follows:

np.save(repr_path_pad + simple_img_name[idx], data_padded)

This is the class where I read the files, it’s a Generator that’s used when training the neural network. Batch size equals 7, so it reads 7 files at a time.

class RepresentationGenerator(Sequence):

def __init__(self, representation_path, target_path, filenames, batch_size):
    self.filenames = np.array(filenames)
    self.batch_size = batch_size
    self.representation_path = representation_path
    self.target_path = target_path

def __len__(self):
    length = len(self.filenames) // self.batch_size
    if len(self.filenames) % self.batch_size > 0:
        length += 1

    return length

def __getitem__(self, idx):
    files_to_batch = self.filenames[idx * self.batch_size: (idx + 1) * self.batch_size]
    batch_x = []
    batch_SS = []
    for file in files_to_batch:
        batch_x.append(np.load(self.representation_path + file + ".npy", allow_pickle=True))
        batch_SS.append(np.load(self.target_path + 'semantic segmentation/' + file + ".npy", allow_pickle=True))
    batch_x = np.array(batch_x).astype(np.float16)
    batch_SS = np.array(batch_SS).astype(np.float16)

    return batch_x, batch_SS

Below I leave the code snippet where the above class is called

train_generator = RepresentationGenerator(representation_path=repr_path_pad, target_path=target_path_pad,
                                              filenames=training_filenames, batch_size=self.batch_size)
val_generator = RepresentationGenerator(representation_path=representations_path, target_path=target_path,
                                            filenames=validation_filenames, batch_size=self.batch_size)
cp_callback = tf.keras.callbacks.ModelCheckpoint(filepath=model_name + '.h5',
                                                     save_weights_only=True,
                                                     verbose=1)
plot_history = PlotHistory(history_fit, model_name, self.model, model_path=model_path,
                               load_previous=load_previous)
self.model.fit(train_generator,
               steps_per_epoch=len(train_generator),
               verbose=1,
               epochs=num_epochs_train,
               validation_data=val_generator,
               validation_steps=len(val_generator),
               callbacks=[cp_callback, plot_history]

)


Below I leave the full error log

-
InvalidArgumentError                      Traceback (most recent call last)
<ipython-input-1-9a8acfabebd2> in <module>
    212          split_dataset_file=split_dataset, ocr_filename=ocr_file, annotated_filename=annotated_files,
    213          num_epochs_trainning=num_epochs_train, history_fit=history_fit_image, width_padding=w_padding,
--> 214          upsample_path=original_repr_path, upsample_target_path=original_target_path)

<ipython-input-1-9a8acfabebd2> in main(images_path, representation_path, targets_path, repr_pad_path, target_padded_path, prefix, make_new_representation, train, use_previous_weights, split_dataset_file, model_filename, model_path, downsample, ocr_filename, annotated_filename, num_epochs_trainning, history_fit, width_padding, predict, upsample_path, upsample_target_path, update_dicts, num_chars)
     94                  split_dataset=split_dataset_file,
     95                  validation_filenames=data['val_imgs'], history_fit=history_fit,
---> 96                  model_name=model_filename, num_epochs_train=num_epochs_trainning)
     97     if predict:  # if want to predict
     98         if not train:  # if neural network wasn't trained, load model

~/Chargrid/neural_network.py in train(self, representations_path, target_path, repr_path_pad, target_path_pad, training_filenames, validation_filenames, model_path, model_name, num_epochs_train, history_fit, split_dataset, batch_size)
     85                            epochs=num_epochs_train,
     86                            validation_data=val_generator,
---> 87                            validation_steps=len(val_generator)
     88                            )
     89         except KeyboardInterrupt:

/opt/conda/envs/csw-aii/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training.py in fit(self, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
    817         max_queue_size=max_queue_size,
    818         workers=workers,
--> 819         use_multiprocessing=use_multiprocessing)
    820 
    821   def evaluate(self,

/opt/conda/envs/csw-aii/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py in fit(self, model, x, y, batch_size, epochs, verbose, callbacks, validation_split, validation_data, shuffle, class_weight, sample_weight, initial_epoch, steps_per_epoch, validation_steps, validation_freq, max_queue_size, workers, use_multiprocessing, **kwargs)
    340                 mode=ModeKeys.TRAIN,
    341                 training_context=training_context,
--> 342                 total_epochs=epochs)
    343             cbks.make_logs(model, epoch_logs, training_result, ModeKeys.TRAIN)
    344 

/opt/conda/envs/csw-aii/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2.py in run_one_epoch(model, iterator, execution_function, dataset_size, batch_size, strategy, steps_per_epoch, num_samples, mode, training_context, total_epochs)
    126         step=step, mode=mode, size=current_batch_size) as batch_logs:
    127       try:
--> 128         batch_outs = execution_function(iterator)
    129       except (StopIteration, errors.OutOfRangeError):
    130         # TODO(kaftan): File bug about tf function and errors.OutOfRangeError?

/opt/conda/envs/csw-aii/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/training_v2_utils.py in execution_function(input_fn)
     96     # `numpy` translates Tensors to values in Eager mode.
     97     return nest.map_structure(_non_none_constant_value,
---> 98                               distributed_function(input_fn))
     99 
    100   return execution_function

/opt/conda/envs/csw-aii/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py in __call__(self, *args, **kwds)
    566         xla_context.Exit()
    567     else:
--> 568       result = self._call(*args, **kwds)
    569 
    570     if tracing_count == self._get_tracing_count():

/opt/conda/envs/csw-aii/lib/python3.6/site-packages/tensorflow_core/python/eager/def_function.py in _call(self, *args, **kwds)
    597       # In this case we have created variables on the first call, so we run the
    598       # defunned version which is guaranteed to never create variables.
--> 599       return self._stateless_fn(*args, **kwds)  # pylint: disable=not-callable
    600     elif self._stateful_fn is not None:
    601       # Release the lock early so that multiple threads can perform the call

/opt/conda/envs/csw-aii/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py in __call__(self, *args, **kwargs)
   2361     with self._lock:
   2362       graph_function, args, kwargs = self._maybe_define_function(args, kwargs)
-> 2363     return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
   2364 
   2365   @property

/opt/conda/envs/csw-aii/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py in _filtered_call(self, args, kwargs)
   1609          if isinstance(t, (ops.Tensor,
   1610                            resource_variable_ops.BaseResourceVariable))),
-> 1611         self.captured_inputs)
   1612 
   1613   def _call_flat(self, args, captured_inputs, cancellation_manager=None):

/opt/conda/envs/csw-aii/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py in _call_flat(self, args, captured_inputs, cancellation_manager)
   1690       # No tape is watching; skip to running the function.
   1691       return self._build_call_outputs(self._inference_function.call(
-> 1692           ctx, args, cancellation_manager=cancellation_manager))
   1693     forward_backward = self._select_forward_and_backward_functions(
   1694         args,

/opt/conda/envs/csw-aii/lib/python3.6/site-packages/tensorflow_core/python/eager/function.py in call(self, ctx, args, cancellation_manager)
    543               inputs=args,
    544               attrs=("executor_type", executor_type, "config_proto", config),
--> 545               ctx=ctx)
    546         else:
    547           outputs = execute.execute_with_cancellation(

/opt/conda/envs/csw-aii/lib/python3.6/site-packages/tensorflow_core/python/eager/execute.py in quick_execute(op_name, num_outputs, inputs, attrs, ctx, name)
     65     else:
     66       message = e.message
---> 67     six.raise_from(core._status_to_exception(e.code, message), None)
     68   except TypeError as e:
     69     keras_symbolic_tensors = [

/opt/conda/envs/csw-aii/lib/python3.6/site-packages/six.py in raise_from(value, from_value)

InvalidArgumentError:  UnicodeEncodeError: 'ascii' codec can't encode character '\xe7' in position 64: ordinal not in range(128)
Traceback (most recent call last):

  File "/opt/conda/envs/csw-aii/lib/python3.6/site-packages/tensorflow_core/python/ops/script_ops.py", line 236, in __call__
    ret = func(*args)

  File "/opt/conda/envs/csw-aii/lib/python3.6/site-packages/tensorflow_core/python/data/ops/dataset_ops.py", line 789, in generator_py_func
    values = next(generator_state.get_iterator(iterator_id))

  File "/opt/conda/envs/csw-aii/lib/python3.6/site-packages/tensorflow_core/python/keras/engine/data_adapter.py", line 975, in generator_fn
    yield x[i]

  File "/home/jfm-castilho/Chargrid/dataset_generator.py", line 26, in __getitem__
    batch_x.append(np.load(self.representation_path + file + ".npy", allow_pickle=True,encoding = 'latin1'))

  File "/opt/conda/envs/csw-aii/lib/python3.6/site-packages/numpy/lib/npyio.py", line 428, in load
    fid = open(os_fspath(file), "rb")

UnicodeEncodeError: 'ascii' codec can't encode character '\xe7' in position 64: ordinal not in range(128)


     [[{{node PyFunc}}]]
     [[IteratorGetNext]] [Op:__inference_distributed_function_13003]

Function call stack:
distributed_function
  • 1

    The encoding of your input file is even LATIN1 ? Have you tried using the encoding="utf-8" ?

  • How can I know which encoding, I do not pass this information when I save the Numpy Array to the file. I never considered utf-8, because in the Numpy documentation it says that encodings other than ASCII, latin1, or bytes https://numpy.org/devdocs/reference/generated/numpy.load.htmlhighlight=load#numpy should not be considered.

  • I tested now with 'utf-8' and the again gave the same error

1 answer

1

Good afternoon!

I believe that the error is due to the very high memory allocation attempt by the operation in question, the Jupiter Lab is an environment for shared development, so it does not have the same level of hardware availability.

In order to help you more, we need to know what you are trying to perform by giving this batch_x.append, because only knowing that if you are calling this function does not bring us a survey of possible errors.

I’ll be waiting for more information.


I believe that your problem is the way you are handling the information, in order to be able to help you in a correct way need more information, as I believe that little likely you will make available I will demonstrate below how I treat the data of a neural network (in the case of a CNN), perhaps this will help you.

import tensorflow as tf
import numpy as np
import pandas as pd

# abra os arquivos com os dados
mnist = tf.keras.datasets.mnist

train = pd.read_csv("../input/train.csv")
test = pd.read_csv("../input/test.csv")

Y_train = train["label"]
X_train = train.drop(labels = ["label"], axis = 1) 


(x_train1, y_train1), (x_test1, y_test1) = mnist.load_data()

train1 = np.concatenate([x_train1, x_test1], axis=0)
y_train1 = np.concatenate([y_train1, y_test1], axis=0)

Y_train1 = y_train1
X_train1 = train1.reshape(-1, 28*28)

X_train = X_train / 255.0
test = test / 255.0

X_train1 = X_train1 / 255.0

X_train = np.concatenate((X_train.values, X_train1))
Y_train = np.concatenate((Y_train, Y_train1))

X_train = X_train.reshape(-1,28,28,1)
test = test.values.reshape(-1,28,28,1)

Y_train = to_categorical(Y_train, num_classes = 10)

X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size = 0.1, random_state=2)

#Insira aqui seu Modelo de CNN e depois compile o mesmo

# Codigo a baixo previne o overfitting, em grandes volumes de dados (no caso imagens)
datagen = ImageDataGenerator(
        featurewise_center=False,  # set input mean to 0 over the dataset
        samplewise_center=False,  # set each sample mean to 0
        featurewise_std_normalization=False,  # divide inputs by std of the dataset
        samplewise_std_normalization=False,  # divide each input by its std
        zca_whitening=False,  # apply ZCA whitening
        rotation_range=10,  # randomly rotate images in the range (degrees, 0 to 180)
        zoom_range = 0.1, # Randomly zoom image 
        width_shift_range=0.1,  # randomly shift images horizontally (fraction of total width)
        height_shift_range=0.1,  # randomly shift images vertically (fraction of total height)
        horizontal_flip=False,  # randomly flip images
        vertical_flip=False)  # randomly flip images


datagen.fit(X_train)

epochs = 50

batch_size = 128

callbacks = [
    TerminateOnNaN(),
    ReduceLROnPlateau(monitor='val_acc', 
                      patience=5, 
                      verbose=1, 
                      factor=0.5, 
                      min_lr=0.00001),
    EarlyStopping(monitor='val_loss',
                 patience=5,
                 mode='min',
                 verbose=1,
                 restore_best_weights=True),
    ModelCheckpoint(h5_path,
                   monitor='val_loss',
                   verbose=1,
                   save_best_only=True,
                   mode='min')
]

history = model.fit_generator((x_train, y_train),
validation_data=(x_val, y_val),
verbose=1,
steps_per_epoch=(x_train.shape[0] // batch_size),
epochs=epochs,
callbacks=callbacks)

# Plota a a validacao da rede anteriormente treinada
fig, ax = plt.subplots(2,1)
ax[0].plot(history.history['loss'], color='b', label="Training loss")
ax[0].plot(history.history['val_loss'], color='r', label="validation loss",axes=ax[0])
legend = ax[0].legend(loc='best', shadow=True)

ax[1].plot(history.history['accuracy'], color='b', label="Training accuracy")
ax[1].plot(history.history['val_accuracy'], color='r',label="Validation accuracy")
legend=ax[1].legend(loc='best', shadow=True)

I would like to leave an addendum if you are working purely with data such as stock values, house prices, etc., try to apply a statistical method will probably have a much more efficient outcome than a neural network model. I hope I was able to help you at least a little.


@Joãocastilho now things have become a little clearer, are you trying to extract a data or text from a file, making it read as if it were a correct image? (in case I’m wrong please correct me)

What I’m able to understand by the error and the context, his workout algorithm is taking the images (files) reading them and doing the operation correctly, however, when he finds the "ç" error, this error is due to the fact that the reading is being made in an American standard Encounter, I would recommend you to use the Teserract, using it you would extract the text data and then process the information, or you can try to replace it with another character, or you can try to force the file to read as 'latin-1' using the code below.

batch_x.append(np.load(self.representation_path + file + ".npy", allow_pickle=True, encoding='latin-1'))
batch_SS.append(np.load(self.target_path + 'semantic segmentation/' + file + ".npy", allow_pickle=True, encoding='latin-1'))

Note: probably the machine where Jupiter is found is in English as default language and his computer in pt, inherent to this is the error in one and the other not.

  • Thanks for availability :) I updated the post with new information

  • @Joãocastilho I updated my answer, I hope after reading it I can solve your problem.

  • Great, I’m also working on a CNN, but not with images. What I use as input for CNN is a representation of a character grid of an image, based on the https article://arxiv.org/pdf/1809.08799.pdf. I have taken into account your over-memory feedback and decreases the batch size and Steps per epoch to 1. And even then it keeps giving error. The error does not appear at the beginning of training, it appears in the middle, so I think the problem may not be of reading but of memory. Still, I could not figure out the problem. I point out that on my computer works properly

  • updated the post with the complete log of the error that appears to me.

  • I’ll give you a practical example. I have a size image (255,355). What I do in the preprocessing phase is to create an array of zeros (255,355) and in the pixel where a character is located assign a unique value (a unique value for each different character). At the end I do a 1-hot-encoded of this matrix with a size (255,355,50) [if only 50 characters] and write to a file. When I’m training the model, it reads this matrix (255,355,50).

  • @Joãocastilho I unintentionally posted what I was writing, please read my reply again.

  • Thanks for the feedback, but I don’t think that’s the problem. In the matrix I submit in the neuronal network there are no characters present, but an encoding of these characters (for example, the 'a' -> 1, 'b' -> 2, 'ç' -> 34) so that in this matrix only numbers are contained.

  • Preprocessing is done on my computer, in Jupyterlab I only read the representation files

Show 3 more comments

Browser other questions tagged

You are not signed in. Login or sign up in order to post.