Help! Unable to One Hot Encode last Column

I am trying to OneHotEncode the last column of my Excel table which has categorical data with three categories. All the other columns have numeric data except the last. This is my code:

import numpy as np
import pandas as pd
import tensorflow as tf
print(tf.__version__)

# Part 1 - Data Preprocessing

# Importing the dataset
dataset = pd.read_csv('ANN_1_APP.csv')
X = dataset.iloc[:, 0:-1].values
y = dataset.iloc[:, -1].values
print(X)
print(y)

#One Hot Encoding the "Geography" column
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [-1])], remainder='passthrough')
y = np.array(ct.fit_transform(y))
print(y)

The last print(y) gives the values in the table column e.g. Home, Away, Draw instead of encoding them into binary representation.

Kindly help . . .!

I’ve edited your post for readability. When you enter a code block into a forum post, please precede it with a separate line of three backticks and follow it with a separate line of three backticks to make it easier to read.

You can also use the “preformatted text” tool in the editor (</>) to add backticks around text.

See this post to find the backtick on your keyboard.
Note: Backticks (`) are not single quotes (’).

You use a lot of transformations that seem unecessary.
Like why use .values? Why add the np.array?
For only one column you don’t need a ColumnTransformer.
So first test would be, if just using the OHEncoder works out. Only then bundle it into the ColumnTransformer (which would only be useful if you had more than one transformer to begin with).

In case the problem is still relevant: I think you canno use the “ColumnTransformer” for OneHotEncoding because the CT expects to return the same number of columns as it gets. However OHE creates one columns per unique entry.

I remembered struggling with a FeaturePreprocessingPipeline with that…
The go-to class should be “FeatureUnion” or “Pipeline” instead of “ColumnTransformer”.

Hi thank you so much for your comments. However for some reason I was unable to use OneHotEncode.

The one that worked for me is the get_dummies as shown below:

y = pd.get_dummies(dataset[:,-1])

Thank you so much.
Can you kindly type some sample code . . . maybe I can get my head around that because I am new to Python and ML

I just looked at my old pipeline and turns out I used ColumnTransformer. BUT I used OneHotEncoder(sparse=True) so maybe that’s why it didn’t work?

Anyway for that sample code… I can give you my new pipeline - it’s using something called DataFrameMapper, which allows Pandas and Sklearn to work together better, by replacing the ColumnTransformer and FeatureUnion in a way that does return DataFrames and thus keeps column-names.

That said, it’s quite a complex thing (and technically only doing basic feature preprocessing), but if you are interested, here is the Notebook:

When you try using OneHotEncoder without the transformer what error did you get?

If I recall, that encoder expects a 2D array as input to its fit function, while it looks like you’re feeding it a 1D array. Try feeding it something like y.reshape(-1,1) so that the data is in a 2D format.

Also, that one-hot encoder returns a sparse matrix object by default. I’m guessing your final np.array call on the transformer results is an attempt to deal with that but I’m not sure it’s going to do what you expect. You can use
ct.fit_transform(y).toarray()
OR, tell the encoder to not return a sparse matrix in the first place.
OneHotEncoder(sparse=False)

Hi thank you so much for your valuable suggestions. Finally I was able to figure out some errors I was making listed as follows:

  1. I was initializing the the variable y and X too early which had the effect of declaring the to be encoded values in y as a 1D array — yet y after encoding assumes a 3D array structure.
  2. My fit_transform function was on y instead of the entire data frame in this case named dataset

The running code looks as pasted below:

# Importing the libraries
import numpy as np
import pandas as pd
import tensorflow as tf
print(tf.__version__)

# Part 1 - Data Preprocessing

# Importing the dataset
dataset = pd.read_csv('ANN_1_APP.csv')

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[('encoder', OneHotEncoder(), [-1])], remainder='passthrough')
dataset = np.array(ct.fit_transform(dataset))
print(dataset)
y = dataset[:,0:-1]
inputval = dataset[:,-1]
X = inputval.reshape(-1,1)

Now even though the code is running I have a new problem. The data frame named dataset has interchanged the columns of the .CSV file immediately after OneHotEncoding.

Column 0 in the CSV has become column -1; and the one I have encoded which was -1 in the CSV has become 0 in dataset data frame.

Kindly any suggestions how to correct that?

The reordering you’re seeing is a product of how the sklearn ColumnTransformer works. By default it drops all untransformed columns, but when you set remainder='passthrough' it takes the output of the transformers first and joins the untransformed columns on the right.

You can reorder the “columns” in a 2D numpy array using standard indexing notation

desired_order = [3,4,5,0,1,2]
dataset[:, desired_order]

Honestly though, like @Jagaya pointed out, for only one column you don’t need ColumnTransformer at all. Given that the column order here seems to be important, I would add other transforms and only feed it the features you wish to do transformations on then join the results into your dataset.
I think you may find it easier if you do these transforms while keeping the data in a pandas DataFrame. That way you can reference the columns by name and the order won’t matter. When you’re finished augmenting the data, you can create variables that specify the features/targets and their order to be fed into a model.

# Import data. CSV header may already include names
df = pd.read_csv('example.csv', names = ['f1', 'f2', 'target']) 
print(df.columns.to_list())  # ['f1', 'f2', 'target']

# One hot encode the 'f2' column and
# join the output columns to the dataframe
df = df.join(pd.get_dummies(df['f2'],  prefix='f2_enc'))
print(df.columns.to_list())  # ['f1', 'f2', 'target', 'f2_enc_A', 'f2_enc_B', 'f2_enc_C']

# Specify feature and target columns
target_columns = ['target']
feature_columns = df.columns[~df.columns.isin(['target', 'f2'])  # Select all that are not 'target' or 'f2'
print(feature_columns.to_list())  # ['f1', 'f2_enc_A', 'f2_enc_B', 'f2_enc_C']

X = df[feature_columns].values
y = df[target_columns].values

If you’re using TensorFlow downstream you can potentially skip creating X & y numpy arrays entirely. Using tf.data.Dataset.from_tensor_slices(dict(df)) allows you to preserve the column names (as dictionary keys) when dataloading. This makes it a lot easier to ensure data is fed into the model during predictions in the same way it was during training (so you won’t have the column ordering issue all over again).