The reordering you’re seeing is a product of how the sklearn ColumnTransformer
works. By default it drops all untransformed columns, but when you set remainder='passthrough'
it takes the output of the transformers first and joins the untransformed columns on the right.
You can reorder the “columns” in a 2D numpy array using standard indexing notation
desired_order = [3,4,5,0,1,2]
dataset[:, desired_order]
Honestly though, like @Jagaya pointed out, for only one column you don’t need ColumnTransformer
at all. Given that the column order here seems to be important, I would add other transforms and only feed it the features you wish to do transformations on then join the results into your dataset.
I think you may find it easier if you do these transforms while keeping the data in a pandas DataFrame. That way you can reference the columns by name and the order won’t matter. When you’re finished augmenting the data, you can create variables that specify the features/targets and their order to be fed into a model.
# Import data. CSV header may already include names
df = pd.read_csv('example.csv', names = ['f1', 'f2', 'target'])
print(df.columns.to_list()) # ['f1', 'f2', 'target']
# One hot encode the 'f2' column and
# join the output columns to the dataframe
df = df.join(pd.get_dummies(df['f2'], prefix='f2_enc'))
print(df.columns.to_list()) # ['f1', 'f2', 'target', 'f2_enc_A', 'f2_enc_B', 'f2_enc_C']
# Specify feature and target columns
target_columns = ['target']
feature_columns = df.columns[~df.columns.isin(['target', 'f2']) # Select all that are not 'target' or 'f2'
print(feature_columns.to_list()) # ['f1', 'f2_enc_A', 'f2_enc_B', 'f2_enc_C']
X = df[feature_columns].values
y = df[target_columns].values
If you’re using TensorFlow downstream you can potentially skip creating X
& y
numpy arrays entirely. Using tf.data.Dataset.from_tensor_slices(dict(df))
allows you to preserve the column names (as dictionary keys) when dataloading. This makes it a lot easier to ensure data is fed into the model during predictions in the same way it was during training (so you won’t have the column ordering issue all over again).