[WideAndDeep] Improved Spark preprocessing scripts performance
This commit is contained in:
parent
9df464f277
commit
15ba45666d
|
@ -71,7 +71,7 @@ The examples are organized first by framework, such as TensorFlow, PyTorch, etc.
|
|||
| [BERT](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT) |PyTorch | N/A | Yes | Yes | Yes | - | - | [Yes](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/BERT/triton) | - |
|
||||
| [Transformer-XL](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/LanguageModeling/Transformer-XL) |PyTorch | N/A | Yes | Yes | Yes | - | - | - | - |
|
||||
| [Neural Collaborative Filtering](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Recommendation/NCF) |PyTorch | N/A | Yes | Yes | - | - |- | - | - |
|
||||
| [DLRM](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Recommendation/NCF) |PyTorch | N/A | Yes | - | - | - |- | - | - |
|
||||
| [DLRM](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Recommendation/NCF) |PyTorch | N/A | Yes | Yes | - | - |- | - | - |
|
||||
| [Mask R-CNN](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Segmentation/MaskRCNN) |PyTorch | N/A | Yes | Yes | - | - | - | - | - |
|
||||
| [Jasper](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechRecognition/Jasper) |PyTorch | N/A | Yes | Yes | - | Yes | Yes | [Yes](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechRecognition/Jasper/trtis) | - |
|
||||
| [Tacotron 2 And WaveGlow v1.10](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2) | PyTorch | N/A | Yes | Yes | - | Yes | Yes | [Yes](https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/SpeechSynthesis/Tacotron2/notebooks/trtis) | - |
|
||||
|
@ -91,6 +91,7 @@ The examples are organized first by framework, such as TensorFlow, PyTorch, etc.
|
|||
| [Mask R-CNN](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Segmentation/MaskRCNN) |TensorFlow | N/A | Yes | Yes | - | - | - | - | - |
|
||||
| [GNMT v2](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/Translation/GNMT) | TensorFlow | N/A | Yes | Yes | - | - | - | - | - |
|
||||
| [Faster Transformer](https://github.com/NVIDIA/DeepLearningExamples/tree/master/FasterTransformer) | Tensorflow | N/A | - | - | - | Yes | - | - | - |
|
||||
| [Transformer-XL](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow/LanguageModeling/Transformer-XL) |TensorFlow | N/A | Yes | Yes | - | - | - | - | - |
|
||||
| [U-Net Medical](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Segmentation/UNet_Medical) | TensorFlow-2 | N/A | Yes | Yes | - | Yes |- | - | Yes |
|
||||
| [Mask R-CNN](https://github.com/NVIDIA/DeepLearningExamples/tree/master/TensorFlow2/Segmentation/MaskRCNN) |TensorFlow-2 | N/A | Yes | Yes | - | - |- | - | - |
|
||||
| [ResNet50 v1.5](https://github.com/NVIDIA/DeepLearningExamples/tree/master/MxNet/Classification/RN50v1.5) | MXNet | Yes | Yes | Yes | - | - | - | - | - |
|
||||
|
|
|
@ -12,7 +12,7 @@
|
|||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:20.02-tf1-py3
|
||||
ARG FROM_IMAGE_NAME=nvcr.io/nvidia/tensorflow:20.03-tf1-py3
|
||||
|
||||
FROM ${FROM_IMAGE_NAME}
|
||||
|
||||
|
|
|
@ -24,6 +24,7 @@ This repository provides a script and recipe to train the Wide and Deep Recommen
|
|||
* [Command-line options](#command-line-options)
|
||||
* [Getting the data](#getting-the-data)
|
||||
* [Dataset guidelines](#dataset-guidelines)
|
||||
* [Spark preprocessing](#spark-preprocessing)
|
||||
* [Training process](#training-process)
|
||||
- [Performance](#performance)
|
||||
* [Benchmarking](#benchmarking)
|
||||
|
@ -195,7 +196,7 @@ docker build . -t wide_deep
|
|||
4. Start an interactive session in the NGC container to run preprocessing/training/inference.
|
||||
|
||||
```bash
|
||||
docker run --runtime=nvidia --rm -ti -v ${HOST_OUTBRAIN_PATH}:/outbrain wide_deep /bin/bash
|
||||
docker run --runtime=nvidia --privileged --rm -ti -v ${HOST_OUTBRAIN_PATH}:/outbrain wide_deep /bin/bash
|
||||
```
|
||||
5. Start preprocessing.
|
||||
|
||||
|
@ -294,17 +295,54 @@ The Outbrain dataset can be downloaded from [Kaggle](https://www.kaggle.com/c/ou
|
|||
|
||||
#### Dataset guidelines
|
||||
|
||||
The dataset contains a sample of users’ page views and clicks, as observed on multiple publisher sites. Viewed pages and clicked recommendations have further semantic attributes of the documents.
|
||||
The dataset contains a sample of users’ page views and clicks, as observed on multiple publisher sites. Viewed pages and clicked recommendations have additional semantic attributes of the documents.
|
||||
The dataset contains sets of content recommendations served to a specific user in a specific context. Each context (i.e. a set of recommended ads) is given a `display_id`. In each such recommendation set, the user has clicked on exactly one of the ads.
|
||||
|
||||
The dataset contains sets of content recommendations served to a specific user in a specific context. Each context (i.e. a set of recommendations) is given a display_id. In each such set, the user has clicked on at least one recommendation. The page view logs originally has more than 2 billion rows (around 100 GB uncompressed).
|
||||
The original data is stored in several separate files:
|
||||
- `page_views.csv` - log of users visiting documents (2B rows, ~100GB uncompressed)
|
||||
- `clicks_train.csv` - data showing which ad was clicked in each recommendation set (87M rows)
|
||||
- `clicks_test.csv` - used only for the submission in the original Kaggle contest
|
||||
- `events.csv` - metadata about the context of each recommendation set (23M rows)
|
||||
- `promoted_content.csv` - metadata about the ads
|
||||
- `document_meta.csv`, `document_topics.csv`, `document_entities.csv`, `document_categories.csv` - metadata about the documents
|
||||
|
||||
During the preprocessing stage the data is transformed into 55M rows tabular data of 54 features and eventually saved in pre-batched TFRecord format.
|
||||
|
||||
The data within the preprocessing stage are transferred into tabular data of 54 features, for training having 55 million rows.
|
||||
|
||||
#### Spark preprocessing
|
||||
|
||||
The original dataset is preprocessed using Spark scripts from the `preproc` directory. The preprocessing consists of the following operations:
|
||||
- separating out the validation set for cross-validation
|
||||
- filling missing data with the most frequent value
|
||||
- generating the user profiles from the page views data
|
||||
- joining the tables for the ad clicks data
|
||||
- computing click-through rates (CTR) for ads grouped by different contexts
|
||||
- computing cosine similarity between the features of the clicked ads and the viewed ads
|
||||
- math transformations of the numeric features (taking logarithm, scaling, binning)
|
||||
- storing the resulting set of features in TFRecord format
|
||||
|
||||
The `preproc1-4.py` preprocessing scripts use PySpark.
|
||||
In the Docker image, we have installed Spark 2.3.1 as a standalone cluster of Spark.
|
||||
The `preproc1.py` script splits the data into a training set and a validation set.
|
||||
The `preproc2.py` script generates the user profiles from the page views data.
|
||||
The `preproc3.py` computes the click-through rates (CTR) and cosine similarities between the features.
|
||||
The `preproc4.py` script performs the math transformations and generates the final TFRecord files.
|
||||
The data in the output files is pre-batched (with the default batch size of 4096) to avoid the overhead
|
||||
of the TFRecord format, which otherwise is not suitable for the tabular data -
|
||||
it stores a separate dictionary with each feature name in plain text for every data entry.
|
||||
|
||||
The preprocessing includes some very resource-exhausting operations, like joining 2B+ rows tables.
|
||||
Such operations may not fit into the RAM memory, therefore we decided to use Spark which is a suitable tool
|
||||
for handling tabular operations on large data.
|
||||
Note that the Spark job requires about 1 TB disk space and 500 GB RAM to perform the preprocessing.
|
||||
For more information about Spark, please refer to the
|
||||
[Spark documentation](https://spark.apache.org/docs/2.3.1/).
|
||||
|
||||
|
||||
### Training process
|
||||
|
||||
The training can be started by running the `trainer/task.py` script. By default the script is in train mode. Other training related
|
||||
configs are also present in the `trainer/task.py` and can be seen using the command `python -m trainer.task --help`. Training happens for `--num_epochs` epochs with custom estimator for the model. The model has a wide linear part and a deep feed forward network, and the networks are built according to the default configuration.
|
||||
configs are also present in the `trainer/task.py` and can be seen using the command `python -m trainer.task --help`. Training happens for `--num_epochs` epochs with a custom estimator for the model. The model has a wide linear part and a deep feed forward network, and the networks are built according to the default configuration.
|
||||
|
||||
Two separate optimizers are used to optimize the wide and the deep part of the network:
|
||||
|
||||
|
@ -399,6 +437,8 @@ This section needs to include the date of the release and the most important cha
|
|||
March 2020
|
||||
- Initial release
|
||||
|
||||
May 2020
|
||||
- Improved Spark preprocessing scripts performance
|
||||
### Known issues
|
||||
|
||||
- Limited tf.feature_column support
|
||||
|
|
|
@ -1,150 +0,0 @@
|
|||
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from __future__ import absolute_import
|
||||
from __future__ import division
|
||||
from __future__ import print_function
|
||||
|
||||
import argparse
|
||||
import datetime
|
||||
import sys
|
||||
|
||||
import outbrain_transform
|
||||
|
||||
import tensorflow as tf
|
||||
import glob
|
||||
|
||||
import pandas as pd
|
||||
|
||||
import trainer.features
|
||||
|
||||
|
||||
def parse_arguments(argv):
|
||||
"""Parse command line arguments.
|
||||
|
||||
Args:
|
||||
argv: list of command line arguments including program name.
|
||||
Returns:
|
||||
The parsed arguments as returned by argparse.ArgumentParser.
|
||||
"""
|
||||
parser = argparse.ArgumentParser(
|
||||
description='Runs Transformation on the Outbrain Click Prediction model data.')
|
||||
|
||||
parser.add_argument(
|
||||
'--training_data',
|
||||
default='',
|
||||
help='Data to analyze and encode as training features.')
|
||||
parser.add_argument(
|
||||
'--eval_data',
|
||||
default='',
|
||||
help='Data to encode as evaluation features.')
|
||||
parser.add_argument(
|
||||
'--output_dir',
|
||||
default=None,
|
||||
required=True,
|
||||
help=('Google Cloud Storage or Local directory in which '
|
||||
'to place outputs.'))
|
||||
parser.add_argument('--batch_size', default=None, type=int, help='Size of batches to create.')
|
||||
parser.add_argument('--submission', default=False, action='store_true', help='Use real test set for submission')
|
||||
|
||||
args, _ = parser.parse_known_args(args=argv[1:])
|
||||
|
||||
return args
|
||||
|
||||
# a version of this method that prefers pandas methods
|
||||
def local_transform_chunk(nr, csv, output_prefix, min_logs, max_logs, batch_size=None, remainder=None):
|
||||
# put any remainder at the front of the line, with the new rows after
|
||||
if remainder is not None:
|
||||
csv = remainder.append(csv)
|
||||
|
||||
# for each batch, slice into the datafrom to retrieve the corresponding data
|
||||
print(str(datetime.datetime.now()) + '\tWriting rows...')
|
||||
num_rows = len(csv.index)
|
||||
with tf.python_io.TFRecordWriter(output_prefix + str(nr).zfill(3) + '.tfrecord') as writer:
|
||||
for start_ind in range(0,num_rows,batch_size if batch_size is not None else 1): # for each batch
|
||||
if start_ind + batch_size - 1 > num_rows: # if we'd run out of rows
|
||||
return csv.iloc[start_ind:] # return remainder for use with the next file
|
||||
# otherwise write this batch to TFRecord
|
||||
csv_slice = csv.iloc[start_ind:start_ind+(batch_size if batch_size is not None else 1)]
|
||||
example = outbrain_transform.create_tf_example(csv_slice, min_logs, max_logs)
|
||||
writer.write(example.SerializeToString())
|
||||
|
||||
# calculate min and max stats for the given dataframes all in one go
|
||||
def compute_min_max_logs(dataframes):
|
||||
print(str(datetime.datetime.now()) + '\tComputing min and max')
|
||||
min_logs = {}
|
||||
max_logs = {}
|
||||
df = pd.concat(dataframes) # concatenate all dataframes, to process at once
|
||||
for name in trainer.features.FLOAT_COLUMNS_LOG_BIN_TRANSFORM:
|
||||
feature_series = df[name]
|
||||
min_logs[name + '_log_01scaled'] = outbrain_transform.log2_1p(feature_series.min(axis=0)*1000)
|
||||
max_logs[name + '_log_01scaled'] = outbrain_transform.log2_1p(feature_series.max(axis=0)*1000)
|
||||
for name in trainer.features.INT_COLUMNS:
|
||||
feature_series = df[name]
|
||||
min_logs[name + '_log_01scaled'] = outbrain_transform.log2_1p(feature_series.min(axis=0))
|
||||
max_logs[name + '_log_01scaled'] = outbrain_transform.log2_1p(feature_series.max(axis=0))
|
||||
return min_logs, max_logs
|
||||
|
||||
|
||||
def main(argv=None):
|
||||
args = parse_arguments(sys.argv if argv is None else argv)
|
||||
|
||||
# Retrieve and sort training and eval data (to ensure consistent order)
|
||||
# Order is important so that the right data will get sorted together for MAP
|
||||
training_data = sorted(glob.glob(args.training_data))
|
||||
eval_data = sorted(glob.glob(args.eval_data))
|
||||
print('Training data:\n{}\nFound:\n{}'.format(args.training_data,training_data))
|
||||
print('Evaluation data:\n{}\nFound:\n{}'.format(args.eval_data,eval_data))
|
||||
|
||||
outbrain_transform.make_spec(args.output_dir + '/transformed_metadata', batch_size=args.batch_size)
|
||||
|
||||
# read all dataframes
|
||||
print('\n' + str(datetime.datetime.now()) + '\tReading input files')
|
||||
eval_dataframes = [pd.read_csv(filename, header=None, names=outbrain_transform.CSV_ORDERED_COLUMNS)
|
||||
for filename in eval_data]
|
||||
train_dataframes = [pd.read_csv(filename, header=None, names=outbrain_transform.CSV_ORDERED_COLUMNS)
|
||||
for filename in training_data]
|
||||
|
||||
# calculate stats once over all records given
|
||||
min_logs, max_logs = compute_min_max_logs(eval_dataframes + train_dataframes)
|
||||
|
||||
if args.submission:
|
||||
train_output_string = '/sub_train_'
|
||||
eval_output_string = '/test_'
|
||||
else:
|
||||
train_output_string = '/train_'
|
||||
eval_output_string = '/eval_'
|
||||
|
||||
# process eval files
|
||||
print('\n' + str(datetime.datetime.now()) + '\tWorking on evaluation data')
|
||||
eval_remainder = None # remainder when a file's records don't divide evenly into batches
|
||||
for i, df in enumerate(eval_dataframes):
|
||||
print(eval_data[i])
|
||||
eval_remainder = local_transform_chunk(i, df, args.output_dir + eval_output_string, min_logs, max_logs,
|
||||
batch_size=args.batch_size, remainder=eval_remainder)
|
||||
if eval_remainder is not None:
|
||||
print('Dropping {} records (eval) on the floor'.format(len(eval_remainder)))
|
||||
|
||||
# process train files
|
||||
print('\n' + str(datetime.datetime.now()) + '\tWorking on training data')
|
||||
train_remainder = None
|
||||
for i, df in enumerate(train_dataframes):
|
||||
print(training_data[i])
|
||||
train_remainder = local_transform_chunk(i, df, args.output_dir + train_output_string, min_logs, max_logs,
|
||||
batch_size=args.batch_size, remainder=train_remainder)
|
||||
if train_remainder is not None:
|
||||
print('Dropping {} records (train) on the floor'.format(len(train_remainder)))
|
||||
|
||||
if __name__ == '__main__':
|
||||
main()
|
|
@ -1,180 +0,0 @@
|
|||
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
|
||||
from __future__ import absolute_import
|
||||
from __future__ import division
|
||||
from __future__ import print_function
|
||||
|
||||
import tensorflow as tf
|
||||
from tensorflow_transform.tf_metadata import dataset_schema
|
||||
from tensorflow_transform.tf_metadata import dataset_metadata
|
||||
from tensorflow_transform.tf_metadata import metadata_io
|
||||
import numpy as np
|
||||
|
||||
from trainer.features import LABEL_COLUMN, DISPLAY_ID_COLUMN, IS_LEAK_COLUMN, DISPLAY_ID_AND_IS_LEAK_ENCODED_COLUMN, CATEGORICAL_COLUMNS, DOC_CATEGORICAL_MULTIVALUED_COLUMNS, BOOL_COLUMNS, INT_COLUMNS, FLOAT_COLUMNS, FLOAT_COLUMNS_LOG_BIN_TRANSFORM, FLOAT_COLUMNS_SIMPLE_BIN_TRANSFORM
|
||||
|
||||
RENAME_COLUMNS = False
|
||||
|
||||
CSV_ORDERED_COLUMNS = ['label','display_id','ad_id','doc_id','doc_event_id','is_leak','event_weekend',
|
||||
'user_has_already_viewed_doc','user_views','ad_views','doc_views',
|
||||
'doc_event_days_since_published','doc_event_hour','doc_ad_days_since_published',
|
||||
'pop_ad_id','pop_ad_id_conf',
|
||||
'pop_ad_id_conf_multipl','pop_document_id','pop_document_id_conf',
|
||||
'pop_document_id_conf_multipl','pop_publisher_id','pop_publisher_id_conf',
|
||||
'pop_publisher_id_conf_multipl','pop_advertiser_id','pop_advertiser_id_conf',
|
||||
'pop_advertiser_id_conf_multipl','pop_campain_id','pop_campain_id_conf',
|
||||
'pop_campain_id_conf_multipl','pop_doc_event_doc_ad','pop_doc_event_doc_ad_conf',
|
||||
'pop_doc_event_doc_ad_conf_multipl','pop_source_id','pop_source_id_conf',
|
||||
'pop_source_id_conf_multipl','pop_source_id_country','pop_source_id_country_conf',
|
||||
'pop_source_id_country_conf_multipl','pop_entity_id','pop_entity_id_conf',
|
||||
'pop_entity_id_conf_multipl','pop_entity_id_country','pop_entity_id_country_conf',
|
||||
'pop_entity_id_country_conf_multipl','pop_topic_id','pop_topic_id_conf',
|
||||
'pop_topic_id_conf_multipl','pop_topic_id_country','pop_topic_id_country_conf',
|
||||
'pop_topic_id_country_conf_multipl','pop_category_id','pop_category_id_conf',
|
||||
'pop_category_id_conf_multipl','pop_category_id_country','pop_category_id_country_conf',
|
||||
'pop_category_id_country_conf_multipl','user_doc_ad_sim_categories',
|
||||
'user_doc_ad_sim_categories_conf','user_doc_ad_sim_categories_conf_multipl',
|
||||
'user_doc_ad_sim_topics','user_doc_ad_sim_topics_conf','user_doc_ad_sim_topics_conf_multipl',
|
||||
'user_doc_ad_sim_entities','user_doc_ad_sim_entities_conf','user_doc_ad_sim_entities_conf_multipl',
|
||||
'doc_event_doc_ad_sim_categories','doc_event_doc_ad_sim_categories_conf',
|
||||
'doc_event_doc_ad_sim_categories_conf_multipl','doc_event_doc_ad_sim_topics',
|
||||
'doc_event_doc_ad_sim_topics_conf','doc_event_doc_ad_sim_topics_conf_multipl',
|
||||
'doc_event_doc_ad_sim_entities','doc_event_doc_ad_sim_entities_conf',
|
||||
'doc_event_doc_ad_sim_entities_conf_multipl','ad_advertiser','doc_ad_category_id_1',
|
||||
'doc_ad_category_id_2','doc_ad_category_id_3','doc_ad_topic_id_1','doc_ad_topic_id_2',
|
||||
'doc_ad_topic_id_3','doc_ad_entity_id_1','doc_ad_entity_id_2','doc_ad_entity_id_3',
|
||||
'doc_ad_entity_id_4','doc_ad_entity_id_5','doc_ad_entity_id_6','doc_ad_publisher_id',
|
||||
'doc_ad_source_id','doc_event_category_id_1','doc_event_category_id_2','doc_event_category_id_3',
|
||||
'doc_event_topic_id_1','doc_event_topic_id_2','doc_event_topic_id_3','doc_event_entity_id_1',
|
||||
'doc_event_entity_id_2','doc_event_entity_id_3','doc_event_entity_id_4','doc_event_entity_id_5',
|
||||
'doc_event_entity_id_6','doc_event_publisher_id','doc_event_source_id','event_country',
|
||||
'event_country_state','event_geo_location','event_hour','event_platform','traffic_source']
|
||||
|
||||
def make_spec(output_dir, batch_size=None):
|
||||
fixed_shape = [batch_size,1] if batch_size is not None else []
|
||||
spec = {}
|
||||
spec[LABEL_COLUMN] = tf.FixedLenFeature(shape=fixed_shape, dtype=tf.int64, default_value=None)
|
||||
spec[DISPLAY_ID_COLUMN] = tf.FixedLenFeature(shape=fixed_shape, dtype=tf.int64, default_value=None)
|
||||
spec[IS_LEAK_COLUMN] = tf.FixedLenFeature(shape=fixed_shape, dtype=tf.int64, default_value=None)
|
||||
spec[DISPLAY_ID_AND_IS_LEAK_ENCODED_COLUMN] = tf.FixedLenFeature(shape=fixed_shape, dtype=tf.int64, default_value=None)
|
||||
|
||||
for name in BOOL_COLUMNS:
|
||||
spec[name] = tf.FixedLenFeature(shape=fixed_shape, dtype=tf.int64, default_value=None)
|
||||
for name in FLOAT_COLUMNS_LOG_BIN_TRANSFORM+FLOAT_COLUMNS_SIMPLE_BIN_TRANSFORM:
|
||||
spec[name] = tf.FixedLenFeature(shape=fixed_shape, dtype=tf.float32, default_value=None)
|
||||
for name in FLOAT_COLUMNS_SIMPLE_BIN_TRANSFORM:
|
||||
spec[name + '_binned'] = tf.FixedLenFeature(shape=fixed_shape, dtype=tf.int64, default_value=None)
|
||||
for name in FLOAT_COLUMNS_LOG_BIN_TRANSFORM:
|
||||
spec[name + '_binned'] = tf.FixedLenFeature(shape=fixed_shape, dtype=tf.int64, default_value=None)
|
||||
spec[name + '_log_01scaled'] = tf.FixedLenFeature(shape=fixed_shape, dtype=tf.float32, default_value=None)
|
||||
for name in INT_COLUMNS:
|
||||
spec[name + '_log_int'] = tf.FixedLenFeature(shape=fixed_shape, dtype=tf.int64, default_value=None)
|
||||
spec[name + '_log_01scaled'] = tf.FixedLenFeature(shape=fixed_shape, dtype=tf.float32, default_value=None)
|
||||
for name in BOOL_COLUMNS + CATEGORICAL_COLUMNS:
|
||||
spec[name] = tf.FixedLenFeature(shape=fixed_shape, dtype=tf.int64, default_value=None)
|
||||
|
||||
for multi_category in DOC_CATEGORICAL_MULTIVALUED_COLUMNS:
|
||||
#spec[multi_category] = tf.VarLenFeature(dtype=tf.int64)
|
||||
shape = fixed_shape[:-1]+[len(DOC_CATEGORICAL_MULTIVALUED_COLUMNS[multi_category])]
|
||||
spec[multi_category] = tf.FixedLenFeature(shape=shape, dtype=tf.int64)
|
||||
|
||||
metadata = dataset_metadata.DatasetMetadata(dataset_schema.from_feature_spec(spec))
|
||||
|
||||
metadata_io.write_metadata(metadata, output_dir)
|
||||
|
||||
def tf_log2_1p(x):
|
||||
return tf.log1p(x) / tf.log(2.0)
|
||||
|
||||
def log2_1p(x):
|
||||
return np.log1p(x) / np.log(2.0)
|
||||
|
||||
def compute_min_max_logs(rows):
|
||||
min_logs = {}
|
||||
max_logs = {}
|
||||
|
||||
for name in FLOAT_COLUMNS_LOG_BIN_TRANSFORM + INT_COLUMNS:
|
||||
min_logs[name + '_log_01scaled'] = float("inf")
|
||||
max_logs[name + '_log_01scaled'] = float("-inf")
|
||||
|
||||
for row in rows:
|
||||
names = CSV_ORDERED_COLUMNS
|
||||
columns_dict = dict(zip(names, row))
|
||||
for name in FLOAT_COLUMNS_LOG_BIN_TRANSFORM:
|
||||
nn = name + '_log_01scaled'
|
||||
min_logs[nn] = min(min_logs[nn], log2_1p(columns_dict[name] * 1000))
|
||||
max_logs[nn] = max(max_logs[nn], log2_1p(columns_dict[name] * 1000))
|
||||
for name in INT_COLUMNS:
|
||||
nn = name + '_log_01scaled'
|
||||
min_logs[nn] = min(min_logs[nn], log2_1p(columns_dict[name]))
|
||||
max_logs[nn] = max(max_logs[nn], log2_1p(columns_dict[name]))
|
||||
|
||||
return min_logs, max_logs
|
||||
|
||||
def scale_to_0_1(val, minv, maxv):
|
||||
return (val - minv) / (maxv - minv)
|
||||
|
||||
def create_tf_example(df, min_logs, max_logs):
|
||||
result = {}
|
||||
result[LABEL_COLUMN] = tf.train.Feature(int64_list=tf.train.Int64List(value=df[LABEL_COLUMN].to_list()))
|
||||
result[DISPLAY_ID_COLUMN] = tf.train.Feature(int64_list=tf.train.Int64List(value=df[DISPLAY_ID_COLUMN].to_list()))
|
||||
result[IS_LEAK_COLUMN] = tf.train.Feature(int64_list=tf.train.Int64List(value=df[IS_LEAK_COLUMN].to_list()))
|
||||
#is_leak = df[IS_LEAK_COLUMN].to_list()
|
||||
encoded_value = df[DISPLAY_ID_COLUMN].multiply(10).add(df[IS_LEAK_COLUMN].clip(lower=0)).to_list()
|
||||
# * 10 + (0 if is_leak < 0 else is_leak)
|
||||
result[DISPLAY_ID_AND_IS_LEAK_ENCODED_COLUMN] = tf.train.Feature(int64_list=tf.train.Int64List(value=encoded_value))
|
||||
|
||||
for name in FLOAT_COLUMNS:
|
||||
result[name] = tf.train.Feature(float_list=tf.train.FloatList(value=df[name].to_list()))
|
||||
for name in FLOAT_COLUMNS_SIMPLE_BIN_TRANSFORM:
|
||||
#[int(columns_dict[name] * 10)]
|
||||
value = df[name].multiply(10).astype('int64').to_list()
|
||||
result[name + '_binned'] = tf.train.Feature(int64_list=tf.train.Int64List(value=value))
|
||||
for name in FLOAT_COLUMNS_LOG_BIN_TRANSFORM:
|
||||
# [int(log2_1p(columns_dict[name] * 1000))]
|
||||
value_prelim = df[name].multiply(1000).apply(np.log1p).multiply(1./np.log(2.0))
|
||||
value = value_prelim.astype('int64').to_list()
|
||||
result[name + '_binned'] = tf.train.Feature(int64_list=tf.train.Int64List(value=value))
|
||||
nn = name + '_log_01scaled'
|
||||
#val = log2_1p(columns_dict[name] * 1000)
|
||||
#val = scale_to_0_1(val, min_logs[nn], max_logs[nn])
|
||||
value = value_prelim.add(-min_logs[nn]).multiply(1./(max_logs[nn]-min_logs[nn])).to_list()
|
||||
result[nn] = tf.train.Feature(float_list=tf.train.FloatList(value=value))
|
||||
for name in INT_COLUMNS:
|
||||
#[int(log2_1p(columns_dict[name]))]
|
||||
value_prelim = df[name].apply(np.log1p).multiply(1./np.log(2.0))
|
||||
value = value_prelim.astype('int64').to_list()
|
||||
result[name + '_log_int'] = tf.train.Feature(int64_list=tf.train.Int64List(value=value))
|
||||
nn = name + '_log_01scaled'
|
||||
#val = log2_1p(columns_dict[name])
|
||||
#val = scale_to_0_1(val, min_logs[nn], max_logs[nn])
|
||||
value = value_prelim.add(-min_logs[nn]).multiply(1./(max_logs[nn]-min_logs[nn])).to_list()
|
||||
result[nn] = tf.train.Feature(float_list=tf.train.FloatList(value=value))
|
||||
|
||||
for name in BOOL_COLUMNS + CATEGORICAL_COLUMNS:
|
||||
result[name] = tf.train.Feature(int64_list=tf.train.Int64List(value=df[name].to_list()))
|
||||
|
||||
for multi_category in DOC_CATEGORICAL_MULTIVALUED_COLUMNS:
|
||||
values = []
|
||||
for category in DOC_CATEGORICAL_MULTIVALUED_COLUMNS[multi_category]:
|
||||
values = values + [df[category].to_numpy()]
|
||||
# need to transpose the series so they will be parsed correctly by the FixedLenFeature
|
||||
# we can pass in a single series here; they'll be reshaped to [batch_size, num_values]
|
||||
# when parsed from the TFRecord
|
||||
value = np.stack(values, axis=1).flatten().tolist()
|
||||
result[multi_category] = tf.train.Feature(int64_list=tf.train.Int64List(value=value))
|
||||
|
||||
tf_example = tf.train.Example(features=tf.train.Features(feature=result))
|
||||
|
||||
return tf_example
|
|
@ -1,88 +0,0 @@
|
|||
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
from __future__ import print_function
|
||||
|
||||
import pandas as pd
|
||||
import os
|
||||
import glob
|
||||
import tqdm
|
||||
import argparse
|
||||
from joblib import Parallel, delayed
|
||||
|
||||
|
||||
parser = argparse.ArgumentParser()
|
||||
parser.add_argument('--train_files_pattern', default='train_feature_vectors_integral_eval.csv/part-*')
|
||||
parser.add_argument('--valid_files_pattern', default='validation_feature_vectors_integral.csv/part-*')
|
||||
parser.add_argument('--train_dst_dir', default='train_feature_vectors_integral_eval_imputed.csv')
|
||||
parser.add_argument('--valid_dst_dir', default='validation_feature_vectors_integral_imputed.csv')
|
||||
parser.add_argument('--header_path', default='train_feature_vectors_integral_eval.csv.header')
|
||||
parser.add_argument('--num_workers', type=int, default=4)
|
||||
args = parser.parse_args()
|
||||
|
||||
header = pd.read_csv(args.header_path, header=None)
|
||||
columns = header[0].to_list()
|
||||
|
||||
train_files = glob.glob(args.train_files_pattern)
|
||||
print('train files: ', train_files)
|
||||
|
||||
def get_counts(f):
|
||||
df = pd.read_csv(f, header=None, dtype=object, names=columns, na_values='None')
|
||||
counts = {}
|
||||
for c in df:
|
||||
counts[c] = df[c].value_counts()
|
||||
return counts
|
||||
|
||||
all_counts = Parallel(n_jobs=args.num_workers)(delayed(get_counts)(f) for f in train_files)
|
||||
cols = len(all_counts[0])
|
||||
imputation_dict = {}
|
||||
for c in tqdm.tqdm(columns):
|
||||
temp = None
|
||||
for i in range(len(all_counts)):
|
||||
if temp is None:
|
||||
temp = pd.Series(all_counts[i][c])
|
||||
else:
|
||||
temp += pd.Series(all_counts[i][c])
|
||||
if len(temp) == 0:
|
||||
imputation_dict[c] = 0
|
||||
else:
|
||||
imputation_dict[c] = temp.index[0]
|
||||
|
||||
print('imputation_dict: ', imputation_dict)
|
||||
|
||||
if not os.path.exists(args.train_dst_dir):
|
||||
os.mkdir(args.train_dst_dir)
|
||||
|
||||
def impute_part(src_path, dst_dir):
|
||||
print('imputing: ', src_path, ' to: ', dst_dir)
|
||||
filename = os.path.basename(src_path)
|
||||
dst_path = os.path.join(dst_dir, filename)
|
||||
|
||||
df = pd.read_csv(src_path, header=None, dtype=object, names=columns, na_values='None')
|
||||
df2 = df.fillna(imputation_dict)
|
||||
df2.to_csv(dst_path, header=None, index=None)
|
||||
|
||||
|
||||
print('launching imputation for train CSVs')
|
||||
Parallel(n_jobs=args.num_workers)(delayed(impute_part)(f, args.train_dst_dir) for f in train_files)
|
||||
|
||||
valid_files = glob.glob(args.valid_files_pattern)
|
||||
|
||||
if not os.path.exists(args.valid_dst_dir):
|
||||
os.mkdir(args.valid_dst_dir)
|
||||
|
||||
print('launching imputation for validation CSVs')
|
||||
Parallel(n_jobs=args.num_workers)(delayed(impute_part)(f, args.valid_dst_dir) for f in valid_files)
|
||||
|
||||
print('Done!')
|
Binary file not shown.
|
@ -19,21 +19,20 @@ OUTPUT_BUCKET_FOLDER = "/outbrain/preprocessed/"
|
|||
DATA_BUCKET_FOLDER = "/outbrain/orig/"
|
||||
SPARK_TEMP_FOLDER = "/outbrain/spark-temp/"
|
||||
|
||||
from pyspark.sql.types import IntegerType, StringType, StructType, StructField
|
||||
from pyspark.sql.types import IntegerType, StringType, StructType, StructField
|
||||
import pyspark.sql.functions as F
|
||||
|
||||
from pyspark.context import SparkContext, SparkConf
|
||||
from pyspark.sql.session import SparkSession
|
||||
from pyspark.sql.functions import col
|
||||
|
||||
conf = SparkConf().setMaster('local[*]').set('spark.executor.memory', '256g').set('spark.driver.memory', '126g').set("spark.local.dir", SPARK_TEMP_FOLDER)
|
||||
conf = SparkConf().setMaster('local[*]').set('spark.executor.memory', '40g').set('spark.driver.memory', '200g').set("spark.local.dir", SPARK_TEMP_FOLDER)
|
||||
|
||||
sc = SparkContext(conf=conf)
|
||||
spark = SparkSession(sc)
|
||||
|
||||
print('Loading data...')
|
||||
|
||||
truncate_day_from_timestamp_udf = F.udf(lambda ts: int(ts / 1000 / 60 / 60 / 24), IntegerType())
|
||||
|
||||
events_schema = StructType(
|
||||
[StructField("display_id", IntegerType(), True),
|
||||
StructField("uuid_event", StringType(), True),
|
||||
|
@ -46,7 +45,7 @@ events_schema = StructType(
|
|||
events_df = spark.read.schema(events_schema) \
|
||||
.options(header='true', inferschema='false', nullValue='\\N') \
|
||||
.csv(DATA_BUCKET_FOLDER + "events.csv") \
|
||||
.withColumn('day_event', truncate_day_from_timestamp_udf('timestamp_event')) \
|
||||
.withColumn('day_event', (col('timestamp_event') / 1000 / 60 / 60 / 24).cast("int")) \
|
||||
.alias('events')
|
||||
|
||||
events_df.count()
|
||||
|
|
|
@ -25,9 +25,6 @@ import pyspark.sql.functions as F
|
|||
import math
|
||||
import time
|
||||
|
||||
import random
|
||||
random.seed(42)
|
||||
|
||||
from pyspark.context import SparkContext, SparkConf
|
||||
from pyspark.sql.session import SparkSession
|
||||
|
||||
|
@ -43,7 +40,7 @@ args = parser.parse_args()
|
|||
|
||||
evaluation = not args.submission
|
||||
|
||||
conf = SparkConf().setMaster('local[*]').set('spark.executor.memory', '256g').set('spark.driver.memory', '126g').set("spark.local.dir", SPARK_TEMP_FOLDER)
|
||||
conf = SparkConf().setMaster('local[*]').set('spark.executor.memory', '40g').set('spark.driver.memory', '200g').set("spark.local.dir", SPARK_TEMP_FOLDER)
|
||||
|
||||
sc = SparkContext(conf=conf)
|
||||
spark = SparkSession(sc)
|
||||
|
|
|
@ -28,7 +28,7 @@ from pyspark.ml.linalg import SparseVector, VectorUDT
|
|||
from pyspark.context import SparkContext, SparkConf
|
||||
from pyspark.sql.session import SparkSession
|
||||
|
||||
conf = SparkConf().setMaster('local[*]').set('spark.executor.memory', '256g').set('spark.driver.memory', '126g').set("spark.local.dir", SPARK_TEMP_FOLDER)
|
||||
conf = SparkConf().setMaster('local[*]').set('spark.executor.memory', '40g').set('spark.driver.memory', '200g').set("spark.local.dir", SPARK_TEMP_FOLDER)
|
||||
|
||||
sc = SparkContext(conf=conf)
|
||||
spark = SparkSession(sc)
|
||||
|
@ -59,7 +59,6 @@ args = parser.parse_args()
|
|||
|
||||
evaluation = not args.submission
|
||||
|
||||
|
||||
# ## UDFs
|
||||
def date_time_to_unix_epoch(date_time):
|
||||
return int(time.mktime(date_time.timetuple()))
|
||||
|
@ -98,7 +97,6 @@ def convert_odd_timestamp(timestamp_ms_relative):
|
|||
TIMESTAMP_DELTA=1465876799998
|
||||
return datetime.datetime.fromtimestamp((int(timestamp_ms_relative)+TIMESTAMP_DELTA)//1000)
|
||||
|
||||
|
||||
# # Loading Files
|
||||
|
||||
# ## Loading UTC/BST for each country and US / CA states (local time)
|
||||
|
@ -870,9 +868,9 @@ len(entities_docs_counts)
|
|||
documents_total = documents_meta_df.count()
|
||||
documents_total
|
||||
|
||||
|
||||
# ## Exploring Publish Time
|
||||
publish_times_df = train_set_df.filter('publish_time is not null').select('document_id_promo','publish_time').distinct().select(F.col('publish_time').cast(IntegerType()))
|
||||
|
||||
publish_time_percentiles = get_percentiles(publish_times_df, 'publish_time', quantiles_levels=[0.5], max_error_rate=0.001)
|
||||
publish_time_percentiles
|
||||
|
||||
|
@ -1962,42 +1960,6 @@ else:
|
|||
|
||||
train_set_feature_vectors_df.write.parquet(OUTPUT_BUCKET_FOLDER+train_feature_vector_gcs_folder_name, mode='overwrite')
|
||||
|
||||
|
||||
# ## Exporting integral feature vectors to CSV
|
||||
train_feature_vectors_exported_df = spark.read.parquet(OUTPUT_BUCKET_FOLDER+train_feature_vector_gcs_folder_name)
|
||||
train_feature_vectors_exported_df.take(3)
|
||||
|
||||
if evaluation:
|
||||
train_feature_vector_integral_csv_folder_name = 'train_feature_vectors_integral_eval.csv'
|
||||
else:
|
||||
train_feature_vector_integral_csv_folder_name = 'train_feature_vectors_integral.csv'
|
||||
|
||||
integral_headers = ['label', 'display_id', 'ad_id', 'doc_id', 'doc_event_id', 'is_leak'] + feature_vector_labels_integral
|
||||
|
||||
with open(OUTPUT_BUCKET_FOLDER+train_feature_vector_integral_csv_folder_name+".header", 'w') as output:
|
||||
output.writelines('\n'.join(integral_headers))
|
||||
|
||||
def sparse_vector_to_csv_with_nulls_row(additional_column_values, vec, num_columns):
|
||||
def format_number(x):
|
||||
if int(x) == x:
|
||||
return str(int(x))
|
||||
else:
|
||||
return '{:.3}'.format(x)
|
||||
|
||||
return ','.join([str(value) for value in additional_column_values] +
|
||||
list([ format_number(vec[x]) if x in vec.indices else '' for x in range(vec.size) ])[:num_columns]) \
|
||||
.replace('.0,',',')
|
||||
|
||||
train_feature_vectors_integral_csv_rdd = train_feature_vectors_exported_df.select(
|
||||
'label', 'display_id', 'ad_id', 'document_id', 'document_id_event', 'feature_vector') \
|
||||
.withColumn('is_leak', F.lit(-1)) \
|
||||
.rdd.map(lambda x: sparse_vector_to_csv_with_nulls_row([x['label'], x['display_id'],
|
||||
x['ad_id'], x['document_id'], x['document_id_event'], x['is_leak']],
|
||||
x['feature_vector'], len(integral_headers)))
|
||||
|
||||
train_feature_vectors_integral_csv_rdd.saveAsTextFile(OUTPUT_BUCKET_FOLDER+train_feature_vector_integral_csv_folder_name)
|
||||
|
||||
|
||||
# # Export Validation/Test set feature vectors
|
||||
def is_leak(max_timestamp_pv_leak, timestamp_event):
|
||||
return max_timestamp_pv_leak >= 0 and max_timestamp_pv_leak >= timestamp_event
|
||||
|
@ -2009,7 +1971,6 @@ if evaluation:
|
|||
else:
|
||||
data_df = test_set_df
|
||||
|
||||
|
||||
test_validation_set_enriched_df = data_df.select(
|
||||
'display_id','uuid_event','event_country','event_country_state','platform_event',
|
||||
'source_id_doc_event', 'publisher_doc_event','publish_time_doc_event',
|
||||
|
@ -2099,35 +2060,5 @@ else:
|
|||
|
||||
test_validation_set_feature_vectors_df.write.parquet(OUTPUT_BUCKET_FOLDER+test_validation_feature_vector_gcs_folder_name, mode='overwrite')
|
||||
|
||||
|
||||
# ## Exporting integral feature vectors to CSV
|
||||
test_validation_feature_vectors_exported_df = spark.read.parquet(OUTPUT_BUCKET_FOLDER+test_validation_feature_vector_gcs_folder_name)
|
||||
test_validation_feature_vectors_exported_df.take(3)
|
||||
|
||||
if evaluation:
|
||||
test_validation_feature_vector_integral_csv_folder_name = \
|
||||
'validation_feature_vectors_integral.csv'
|
||||
else:
|
||||
test_validation_feature_vector_integral_csv_folder_name = \
|
||||
'test_feature_vectors_integral.csv'
|
||||
|
||||
integral_headers = ['label', 'display_id', 'ad_id', 'doc_id', 'doc_event_id', 'is_leak'] \
|
||||
+ feature_vector_labels_integral
|
||||
|
||||
with open(OUTPUT_BUCKET_FOLDER + test_validation_feature_vector_integral_csv_folder_name \
|
||||
+".header", 'w') as output:
|
||||
output.writelines('\n'.join(integral_headers))
|
||||
|
||||
test_validation_feature_vectors_integral_csv_rdd = \
|
||||
test_validation_feature_vectors_exported_df.select(
|
||||
'label', 'display_id', 'ad_id', 'document_id', 'document_id_event',
|
||||
'is_leak', 'feature_vector') \
|
||||
.rdd.map(lambda x: sparse_vector_to_csv_with_nulls_row([x['label'],
|
||||
x['display_id'], x['ad_id'], x['document_id'], x['document_id_event'], x['is_leak']],
|
||||
x['feature_vector'], len(integral_headers)))
|
||||
|
||||
test_validation_feature_vectors_integral_csv_rdd.saveAsTextFile(
|
||||
OUTPUT_BUCKET_FOLDER+test_validation_feature_vector_integral_csv_folder_name)
|
||||
|
||||
spark.stop()
|
||||
|
||||
|
|
568
TensorFlow/Recommendation/WideAndDeep/preproc/preproc4.py
Normal file
568
TensorFlow/Recommendation/WideAndDeep/preproc/preproc4.py
Normal file
|
@ -0,0 +1,568 @@
|
|||
#!/usr/bin/env python
|
||||
# coding: utf-8
|
||||
|
||||
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
evaluation = True
|
||||
evaluation_verbose = False
|
||||
|
||||
OUTPUT_BUCKET_FOLDER = "/outbrain/preprocessed/"
|
||||
DATA_BUCKET_FOLDER = "/outbrain/orig/"
|
||||
SPARK_TEMP_FOLDER = "/outbrain/spark-temp/"
|
||||
LOCAL_DATA_TFRECORDS_DIR="/outbrain/tfrecords"
|
||||
|
||||
TEST_SET_MODE = False
|
||||
|
||||
TENSORFLOW_HADOOP="preproc/data/tensorflow-hadoop-1.5.0.jar"
|
||||
|
||||
from IPython.display import display
|
||||
|
||||
import pyspark.sql.functions as F
|
||||
from pyspark.ml.linalg import Vectors, SparseVector, VectorUDT
|
||||
|
||||
from pyspark.context import SparkContext, SparkConf
|
||||
from pyspark.sql.session import SparkSession
|
||||
|
||||
conf = SparkConf().setMaster('local[*]').set('spark.executor.memory', '40g').set('spark.driver.memory', '200g').set("spark.local.dir", SPARK_TEMP_FOLDER)
|
||||
conf.set("spark.jars", TENSORFLOW_HADOOP)
|
||||
conf.set("spark.sql.files.maxPartitionBytes", 805306368)
|
||||
|
||||
sc = SparkContext(conf=conf)
|
||||
spark = SparkSession(sc)
|
||||
|
||||
from pyspark.sql import Row
|
||||
from pyspark.sql.types import ArrayType, BinaryType, DoubleType, LongType, StringType, StructField, StructType
|
||||
from pyspark.sql.functions import col, when, log1p, udf
|
||||
|
||||
import numpy as np
|
||||
import scipy.sparse
|
||||
|
||||
import math
|
||||
import datetime
|
||||
import time
|
||||
import itertools
|
||||
|
||||
import pickle
|
||||
|
||||
import pandas as pd
|
||||
import tensorflow as tf
|
||||
from tensorflow_transform.tf_metadata import dataset_schema
|
||||
from tensorflow_transform.tf_metadata import dataset_metadata
|
||||
from tensorflow_transform.tf_metadata import metadata_io
|
||||
|
||||
import trainer
|
||||
from trainer.features import LABEL_COLUMN, DISPLAY_ID_COLUMN, AD_ID_COLUMN, IS_LEAK_COLUMN, DISPLAY_ID_AND_IS_LEAK_ENCODED_COLUMN, CATEGORICAL_COLUMNS, DOC_CATEGORICAL_MULTIVALUED_COLUMNS, BOOL_COLUMNS, INT_COLUMNS, FLOAT_COLUMNS, FLOAT_COLUMNS_LOG_BIN_TRANSFORM, FLOAT_COLUMNS_SIMPLE_BIN_TRANSFORM
|
||||
|
||||
import argparse
|
||||
|
||||
parser = argparse.ArgumentParser()
|
||||
|
||||
parser.add_argument(
|
||||
'--prebatch_size',
|
||||
help='Prebatch size in created tfrecords',
|
||||
type=int,
|
||||
default=4096)
|
||||
|
||||
parser.add_argument(
|
||||
'--submission',
|
||||
action='store_true',
|
||||
default=False
|
||||
)
|
||||
|
||||
args = parser.parse_args()
|
||||
|
||||
batch_size = args.prebatch_size
|
||||
|
||||
# # Feature Vector export
|
||||
bool_feature_names = ['event_weekend',
|
||||
'user_has_already_viewed_doc']
|
||||
|
||||
int_feature_names = ['user_views',
|
||||
'ad_views',
|
||||
'doc_views',
|
||||
'doc_event_days_since_published',
|
||||
'doc_event_hour',
|
||||
'doc_ad_days_since_published',
|
||||
]
|
||||
|
||||
float_feature_names = [
|
||||
'pop_ad_id',
|
||||
'pop_ad_id_conf',
|
||||
'pop_ad_id_conf_multipl',
|
||||
'pop_document_id',
|
||||
'pop_document_id_conf',
|
||||
'pop_document_id_conf_multipl',
|
||||
'pop_publisher_id',
|
||||
'pop_publisher_id_conf',
|
||||
'pop_publisher_id_conf_multipl',
|
||||
'pop_advertiser_id',
|
||||
'pop_advertiser_id_conf',
|
||||
'pop_advertiser_id_conf_multipl',
|
||||
'pop_campain_id',
|
||||
'pop_campain_id_conf',
|
||||
'pop_campain_id_conf_multipl',
|
||||
'pop_doc_event_doc_ad',
|
||||
'pop_doc_event_doc_ad_conf',
|
||||
'pop_doc_event_doc_ad_conf_multipl',
|
||||
'pop_source_id',
|
||||
'pop_source_id_conf',
|
||||
'pop_source_id_conf_multipl',
|
||||
'pop_source_id_country',
|
||||
'pop_source_id_country_conf',
|
||||
'pop_source_id_country_conf_multipl',
|
||||
'pop_entity_id',
|
||||
'pop_entity_id_conf',
|
||||
'pop_entity_id_conf_multipl',
|
||||
'pop_entity_id_country',
|
||||
'pop_entity_id_country_conf',
|
||||
'pop_entity_id_country_conf_multipl',
|
||||
'pop_topic_id',
|
||||
'pop_topic_id_conf',
|
||||
'pop_topic_id_conf_multipl',
|
||||
'pop_topic_id_country',
|
||||
'pop_topic_id_country_conf',
|
||||
'pop_topic_id_country_conf_multipl',
|
||||
'pop_category_id',
|
||||
'pop_category_id_conf',
|
||||
'pop_category_id_conf_multipl',
|
||||
'pop_category_id_country',
|
||||
'pop_category_id_country_conf',
|
||||
'pop_category_id_country_conf_multipl',
|
||||
'user_doc_ad_sim_categories',
|
||||
'user_doc_ad_sim_categories_conf',
|
||||
'user_doc_ad_sim_categories_conf_multipl',
|
||||
'user_doc_ad_sim_topics',
|
||||
'user_doc_ad_sim_topics_conf',
|
||||
'user_doc_ad_sim_topics_conf_multipl',
|
||||
'user_doc_ad_sim_entities',
|
||||
'user_doc_ad_sim_entities_conf',
|
||||
'user_doc_ad_sim_entities_conf_multipl',
|
||||
'doc_event_doc_ad_sim_categories',
|
||||
'doc_event_doc_ad_sim_categories_conf',
|
||||
'doc_event_doc_ad_sim_categories_conf_multipl',
|
||||
'doc_event_doc_ad_sim_topics',
|
||||
'doc_event_doc_ad_sim_topics_conf',
|
||||
'doc_event_doc_ad_sim_topics_conf_multipl',
|
||||
'doc_event_doc_ad_sim_entities',
|
||||
'doc_event_doc_ad_sim_entities_conf',
|
||||
'doc_event_doc_ad_sim_entities_conf_multipl'
|
||||
]
|
||||
|
||||
# ### Configuring feature vector
|
||||
category_feature_names_integral = ['ad_advertiser',
|
||||
'doc_ad_category_id_1',
|
||||
'doc_ad_category_id_2',
|
||||
'doc_ad_category_id_3',
|
||||
'doc_ad_topic_id_1',
|
||||
'doc_ad_topic_id_2',
|
||||
'doc_ad_topic_id_3',
|
||||
'doc_ad_entity_id_1',
|
||||
'doc_ad_entity_id_2',
|
||||
'doc_ad_entity_id_3',
|
||||
'doc_ad_entity_id_4',
|
||||
'doc_ad_entity_id_5',
|
||||
'doc_ad_entity_id_6',
|
||||
'doc_ad_publisher_id',
|
||||
'doc_ad_source_id',
|
||||
'doc_event_category_id_1',
|
||||
'doc_event_category_id_2',
|
||||
'doc_event_category_id_3',
|
||||
'doc_event_topic_id_1',
|
||||
'doc_event_topic_id_2',
|
||||
'doc_event_topic_id_3',
|
||||
'doc_event_entity_id_1',
|
||||
'doc_event_entity_id_2',
|
||||
'doc_event_entity_id_3',
|
||||
'doc_event_entity_id_4',
|
||||
'doc_event_entity_id_5',
|
||||
'doc_event_entity_id_6',
|
||||
'doc_event_publisher_id',
|
||||
'doc_event_source_id',
|
||||
'event_country',
|
||||
'event_country_state',
|
||||
'event_geo_location',
|
||||
'event_hour',
|
||||
'event_platform',
|
||||
'traffic_source']
|
||||
|
||||
|
||||
feature_vector_labels_integral = bool_feature_names + int_feature_names + float_feature_names + category_feature_names_integral
|
||||
|
||||
if args.submission:
|
||||
train_feature_vector_gcs_folder_name = 'train_feature_vectors_integral'
|
||||
else:
|
||||
train_feature_vector_gcs_folder_name = 'train_feature_vectors_integral_eval'
|
||||
|
||||
# ## Exporting integral feature vectors to CSV
|
||||
train_feature_vectors_exported_df = spark.read.parquet(OUTPUT_BUCKET_FOLDER+train_feature_vector_gcs_folder_name)
|
||||
train_feature_vectors_exported_df.take(3)
|
||||
|
||||
integral_headers = ['label', 'display_id', 'ad_id', 'doc_id', 'doc_event_id', 'is_leak'] + feature_vector_labels_integral
|
||||
|
||||
CSV_ORDERED_COLUMNS = ['label','display_id','ad_id','doc_id','doc_event_id','is_leak','event_weekend',
|
||||
'user_has_already_viewed_doc','user_views','ad_views','doc_views',
|
||||
'doc_event_days_since_published','doc_event_hour','doc_ad_days_since_published',
|
||||
'pop_ad_id','pop_ad_id_conf',
|
||||
'pop_ad_id_conf_multipl','pop_document_id','pop_document_id_conf',
|
||||
'pop_document_id_conf_multipl','pop_publisher_id','pop_publisher_id_conf',
|
||||
'pop_publisher_id_conf_multipl','pop_advertiser_id','pop_advertiser_id_conf',
|
||||
'pop_advertiser_id_conf_multipl','pop_campain_id','pop_campain_id_conf',
|
||||
'pop_campain_id_conf_multipl','pop_doc_event_doc_ad','pop_doc_event_doc_ad_conf',
|
||||
'pop_doc_event_doc_ad_conf_multipl','pop_source_id','pop_source_id_conf',
|
||||
'pop_source_id_conf_multipl','pop_source_id_country','pop_source_id_country_conf',
|
||||
'pop_source_id_country_conf_multipl','pop_entity_id','pop_entity_id_conf',
|
||||
'pop_entity_id_conf_multipl','pop_entity_id_country','pop_entity_id_country_conf',
|
||||
'pop_entity_id_country_conf_multipl','pop_topic_id','pop_topic_id_conf',
|
||||
'pop_topic_id_conf_multipl','pop_topic_id_country','pop_topic_id_country_conf',
|
||||
'pop_topic_id_country_conf_multipl','pop_category_id','pop_category_id_conf',
|
||||
'pop_category_id_conf_multipl','pop_category_id_country','pop_category_id_country_conf',
|
||||
'pop_category_id_country_conf_multipl','user_doc_ad_sim_categories',
|
||||
'user_doc_ad_sim_categories_conf','user_doc_ad_sim_categories_conf_multipl',
|
||||
'user_doc_ad_sim_topics','user_doc_ad_sim_topics_conf','user_doc_ad_sim_topics_conf_multipl',
|
||||
'user_doc_ad_sim_entities','user_doc_ad_sim_entities_conf','user_doc_ad_sim_entities_conf_multipl',
|
||||
'doc_event_doc_ad_sim_categories','doc_event_doc_ad_sim_categories_conf',
|
||||
'doc_event_doc_ad_sim_categories_conf_multipl','doc_event_doc_ad_sim_topics',
|
||||
'doc_event_doc_ad_sim_topics_conf','doc_event_doc_ad_sim_topics_conf_multipl',
|
||||
'doc_event_doc_ad_sim_entities','doc_event_doc_ad_sim_entities_conf',
|
||||
'doc_event_doc_ad_sim_entities_conf_multipl','ad_advertiser','doc_ad_category_id_1',
|
||||
'doc_ad_category_id_2','doc_ad_category_id_3','doc_ad_topic_id_1','doc_ad_topic_id_2',
|
||||
'doc_ad_topic_id_3','doc_ad_entity_id_1','doc_ad_entity_id_2','doc_ad_entity_id_3',
|
||||
'doc_ad_entity_id_4','doc_ad_entity_id_5','doc_ad_entity_id_6','doc_ad_publisher_id',
|
||||
'doc_ad_source_id','doc_event_category_id_1','doc_event_category_id_2','doc_event_category_id_3',
|
||||
'doc_event_topic_id_1','doc_event_topic_id_2','doc_event_topic_id_3','doc_event_entity_id_1',
|
||||
'doc_event_entity_id_2','doc_event_entity_id_3','doc_event_entity_id_4','doc_event_entity_id_5',
|
||||
'doc_event_entity_id_6','doc_event_publisher_id','doc_event_source_id','event_country',
|
||||
'event_country_state','event_geo_location','event_hour','event_platform','traffic_source']
|
||||
|
||||
FEAT_CSV_ORDERED_COLUMNS = ['event_weekend',
|
||||
'user_has_already_viewed_doc','user_views','ad_views','doc_views',
|
||||
'doc_event_days_since_published','doc_event_hour','doc_ad_days_since_published',
|
||||
'pop_ad_id','pop_ad_id_conf',
|
||||
'pop_ad_id_conf_multipl','pop_document_id','pop_document_id_conf',
|
||||
'pop_document_id_conf_multipl','pop_publisher_id','pop_publisher_id_conf',
|
||||
'pop_publisher_id_conf_multipl','pop_advertiser_id','pop_advertiser_id_conf',
|
||||
'pop_advertiser_id_conf_multipl','pop_campain_id','pop_campain_id_conf',
|
||||
'pop_campain_id_conf_multipl','pop_doc_event_doc_ad','pop_doc_event_doc_ad_conf',
|
||||
'pop_doc_event_doc_ad_conf_multipl','pop_source_id','pop_source_id_conf',
|
||||
'pop_source_id_conf_multipl','pop_source_id_country','pop_source_id_country_conf',
|
||||
'pop_source_id_country_conf_multipl','pop_entity_id','pop_entity_id_conf',
|
||||
'pop_entity_id_conf_multipl','pop_entity_id_country','pop_entity_id_country_conf',
|
||||
'pop_entity_id_country_conf_multipl','pop_topic_id','pop_topic_id_conf',
|
||||
'pop_topic_id_conf_multipl','pop_topic_id_country','pop_topic_id_country_conf',
|
||||
'pop_topic_id_country_conf_multipl','pop_category_id','pop_category_id_conf',
|
||||
'pop_category_id_conf_multipl','pop_category_id_country','pop_category_id_country_conf',
|
||||
'pop_category_id_country_conf_multipl','user_doc_ad_sim_categories',
|
||||
'user_doc_ad_sim_categories_conf','user_doc_ad_sim_categories_conf_multipl',
|
||||
'user_doc_ad_sim_topics','user_doc_ad_sim_topics_conf','user_doc_ad_sim_topics_conf_multipl',
|
||||
'user_doc_ad_sim_entities','user_doc_ad_sim_entities_conf','user_doc_ad_sim_entities_conf_multipl',
|
||||
'doc_event_doc_ad_sim_categories','doc_event_doc_ad_sim_categories_conf',
|
||||
'doc_event_doc_ad_sim_categories_conf_multipl','doc_event_doc_ad_sim_topics',
|
||||
'doc_event_doc_ad_sim_topics_conf','doc_event_doc_ad_sim_topics_conf_multipl',
|
||||
'doc_event_doc_ad_sim_entities','doc_event_doc_ad_sim_entities_conf',
|
||||
'doc_event_doc_ad_sim_entities_conf_multipl','ad_advertiser','doc_ad_category_id_1',
|
||||
'doc_ad_category_id_2','doc_ad_category_id_3','doc_ad_topic_id_1','doc_ad_topic_id_2',
|
||||
'doc_ad_topic_id_3','doc_ad_entity_id_1','doc_ad_entity_id_2','doc_ad_entity_id_3',
|
||||
'doc_ad_entity_id_4','doc_ad_entity_id_5','doc_ad_entity_id_6','doc_ad_publisher_id',
|
||||
'doc_ad_source_id','doc_event_category_id_1','doc_event_category_id_2','doc_event_category_id_3',
|
||||
'doc_event_topic_id_1','doc_event_topic_id_2','doc_event_topic_id_3','doc_event_entity_id_1',
|
||||
'doc_event_entity_id_2','doc_event_entity_id_3','doc_event_entity_id_4','doc_event_entity_id_5',
|
||||
'doc_event_entity_id_6','doc_event_publisher_id','doc_event_source_id','event_country',
|
||||
'event_country_state','event_geo_location','event_hour','event_platform','traffic_source']
|
||||
|
||||
def to_array(col):
|
||||
def to_array_(v):
|
||||
return v.toArray().tolist()
|
||||
# Important: asNondeterministic requires Spark 2.3 or later
|
||||
# It can be safely removed i.e.
|
||||
# return udf(to_array_, ArrayType(DoubleType()))(col)
|
||||
# but at the cost of decreased performance
|
||||
|
||||
return udf(to_array_, ArrayType(DoubleType())).asNondeterministic()(col)
|
||||
|
||||
|
||||
CONVERT_TO_INT = ['doc_ad_category_id_1',
|
||||
'doc_ad_category_id_2','doc_ad_category_id_3','doc_ad_topic_id_1','doc_ad_topic_id_2',
|
||||
'doc_ad_topic_id_3','doc_ad_entity_id_1','doc_ad_entity_id_2','doc_ad_entity_id_3',
|
||||
'doc_ad_entity_id_4','doc_ad_entity_id_5','doc_ad_entity_id_6',
|
||||
'doc_ad_source_id','doc_event_category_id_1','doc_event_category_id_2','doc_event_category_id_3',
|
||||
'doc_event_topic_id_1','doc_event_topic_id_2','doc_event_topic_id_3','doc_event_entity_id_1',
|
||||
'doc_event_entity_id_2','doc_event_entity_id_3','doc_event_entity_id_4','doc_event_entity_id_5', 'doc_event_entity_id_6']
|
||||
|
||||
|
||||
def format_number(element, name):
|
||||
if name in BOOL_COLUMNS + CATEGORICAL_COLUMNS:
|
||||
return element.cast("int")
|
||||
elif name in CONVERT_TO_INT:
|
||||
return element.cast("int")
|
||||
else:
|
||||
return element
|
||||
|
||||
def to_array_with_none(col):
|
||||
def to_array_with_none_(v):
|
||||
tmp= np.full((v.size,), fill_value=None, dtype=np.float64)
|
||||
tmp[v.indices] = v.values
|
||||
return tmp.tolist()
|
||||
# Important: asNondeterministic requires Spark 2.3 or later
|
||||
# It can be safely removed i.e.
|
||||
# return udf(to_array_, ArrayType(DoubleType()))(col)
|
||||
# but at the cost of decreased performance
|
||||
|
||||
return udf(to_array_with_none_, ArrayType(DoubleType())).asNondeterministic()(col)
|
||||
|
||||
@udf
|
||||
def count_value(x):
|
||||
from collections import Counter
|
||||
tmp = Counter(x).most_common(2)
|
||||
if not tmp or np.isnan(tmp[0][0]):
|
||||
return 0
|
||||
return float(tmp[0][0])
|
||||
|
||||
def replace_with_most_frequent(most_value):
|
||||
return udf( lambda x: most_value if not x or np.isnan(x) else x)
|
||||
|
||||
|
||||
train_feature_vectors_integral_csv_rdd_df = train_feature_vectors_exported_df.select('label', 'display_id', 'ad_id', 'document_id', 'document_id_event', 'feature_vector').withColumn('is_leak', F.lit(-1)).withColumn("featvec", to_array("feature_vector")).select(['label'] + ['display_id'] + ['ad_id'] + ['document_id'] + ['document_id_event'] + ['is_leak'] + [format_number(element, FEAT_CSV_ORDERED_COLUMNS[index]).alias(FEAT_CSV_ORDERED_COLUMNS[index]) for index, element in enumerate([col("featvec")[i] for i in range(len(feature_vector_labels_integral))])]).replace(float('nan'), 0)
|
||||
|
||||
if args.submission:
|
||||
test_validation_feature_vector_gcs_folder_name = 'test_feature_vectors_integral'
|
||||
else:
|
||||
test_validation_feature_vector_gcs_folder_name = 'validation_feature_vectors_integral'
|
||||
|
||||
# ## Exporting integral feature vectors
|
||||
test_validation_feature_vectors_exported_df = spark.read.parquet(OUTPUT_BUCKET_FOLDER+test_validation_feature_vector_gcs_folder_name)
|
||||
test_validation_feature_vectors_exported_df.take(3)
|
||||
|
||||
test_validation_feature_vectors_integral_csv_rdd_df = test_validation_feature_vectors_exported_df.select(
|
||||
'label', 'display_id', 'ad_id', 'document_id', 'document_id_event',
|
||||
'is_leak', 'feature_vector').withColumn("featvec", to_array("feature_vector")).select(['label'] + ['display_id'] + ['ad_id'] + ['document_id'] + ['document_id_event'] + ['is_leak'] + [format_number(element, FEAT_CSV_ORDERED_COLUMNS[index]).alias(FEAT_CSV_ORDERED_COLUMNS[index]) for index, element in enumerate([col("featvec")[i] for i in range(len(feature_vector_labels_integral))])]).replace(float('nan'), 0)
|
||||
|
||||
def make_spec(output_dir, batch_size=None):
|
||||
fixed_shape = [batch_size,1] if batch_size is not None else []
|
||||
spec = {}
|
||||
spec[LABEL_COLUMN] = tf.FixedLenFeature(shape=fixed_shape, dtype=tf.int64, default_value=None)
|
||||
spec[DISPLAY_ID_COLUMN] = tf.FixedLenFeature(shape=fixed_shape, dtype=tf.int64, default_value=None)
|
||||
spec[IS_LEAK_COLUMN] = tf.FixedLenFeature(shape=fixed_shape, dtype=tf.int64, default_value=None)
|
||||
spec[DISPLAY_ID_AND_IS_LEAK_ENCODED_COLUMN] = tf.FixedLenFeature(shape=fixed_shape, dtype=tf.int64, default_value=None)
|
||||
for name in BOOL_COLUMNS:
|
||||
spec[name] = tf.FixedLenFeature(shape=fixed_shape, dtype=tf.int64, default_value=None)
|
||||
for name in FLOAT_COLUMNS_LOG_BIN_TRANSFORM+FLOAT_COLUMNS_SIMPLE_BIN_TRANSFORM:
|
||||
spec[name] = tf.FixedLenFeature(shape=fixed_shape, dtype=tf.float32, default_value=None)
|
||||
for name in FLOAT_COLUMNS_SIMPLE_BIN_TRANSFORM:
|
||||
spec[name + '_binned'] = tf.FixedLenFeature(shape=fixed_shape, dtype=tf.int64, default_value=None)
|
||||
for name in FLOAT_COLUMNS_LOG_BIN_TRANSFORM:
|
||||
spec[name + '_binned'] = tf.FixedLenFeature(shape=fixed_shape, dtype=tf.int64, default_value=None)
|
||||
spec[name + '_log_01scaled'] = tf.FixedLenFeature(shape=fixed_shape, dtype=tf.float32, default_value=None)
|
||||
for name in INT_COLUMNS:
|
||||
spec[name + '_log_int'] = tf.FixedLenFeature(shape=fixed_shape, dtype=tf.int64, default_value=None)
|
||||
spec[name + '_log_01scaled'] = tf.FixedLenFeature(shape=fixed_shape, dtype=tf.float32, default_value=None)
|
||||
for name in BOOL_COLUMNS + CATEGORICAL_COLUMNS:
|
||||
spec[name] = tf.FixedLenFeature(shape=fixed_shape, dtype=tf.int64, default_value=None)
|
||||
for multi_category in DOC_CATEGORICAL_MULTIVALUED_COLUMNS:
|
||||
shape = fixed_shape[:-1]+[len(DOC_CATEGORICAL_MULTIVALUED_COLUMNS[multi_category])]
|
||||
spec[multi_category] = tf.FixedLenFeature(shape=shape, dtype=tf.int64)
|
||||
metadata = dataset_metadata.DatasetMetadata(dataset_schema.from_feature_spec(spec))
|
||||
metadata_io.write_metadata(metadata, output_dir)
|
||||
|
||||
# write out tfrecords meta
|
||||
make_spec(LOCAL_DATA_TFRECORDS_DIR + '/transformed_metadata', batch_size=batch_size)
|
||||
|
||||
def log2_1p(x):
|
||||
return np.log1p(x) / np.log(2.0)
|
||||
|
||||
# calculate min and max stats for the given dataframes all in one go
|
||||
def compute_min_max_logs(df):
|
||||
print(str(datetime.datetime.now()) + '\tComputing min and max')
|
||||
min_logs = {}
|
||||
max_logs = {}
|
||||
all_dict = {}
|
||||
float_expr = []
|
||||
for name in trainer.features.FLOAT_COLUMNS_LOG_BIN_TRANSFORM + trainer.features.INT_COLUMNS:
|
||||
float_expr.append(F.min(name))
|
||||
float_expr.append(F.max(name))
|
||||
floatDf = all_df.agg(*float_expr).collect()
|
||||
for name in trainer.features.FLOAT_COLUMNS_LOG_BIN_TRANSFORM:
|
||||
minAgg = floatDf[0]["min("+name+")"]
|
||||
maxAgg = floatDf[0]["max("+name+")"]
|
||||
min_logs[name + '_log_01scaled'] = log2_1p(minAgg*1000)
|
||||
max_logs[name + '_log_01scaled'] = log2_1p(maxAgg*1000)
|
||||
for name in trainer.features.INT_COLUMNS:
|
||||
minAgg = floatDf[0]["min("+name+")"]
|
||||
maxAgg = floatDf[0]["max("+name+")"]
|
||||
min_logs[name + '_log_01scaled'] = log2_1p(minAgg)
|
||||
max_logs[name + '_log_01scaled'] = log2_1p(maxAgg)
|
||||
|
||||
return min_logs, max_logs
|
||||
|
||||
|
||||
all_df = test_validation_feature_vectors_integral_csv_rdd_df.union(train_feature_vectors_integral_csv_rdd_df)
|
||||
min_logs, max_logs = compute_min_max_logs(all_df)
|
||||
|
||||
if args.submission:
|
||||
train_output_string = '/sub_train'
|
||||
eval_output_string = '/test'
|
||||
else:
|
||||
train_output_string = '/train'
|
||||
eval_output_string = '/eval'
|
||||
|
||||
path = LOCAL_DATA_TFRECORDS_DIR
|
||||
|
||||
def create_tf_example_spark(df, min_logs, max_logs):
|
||||
result = {}
|
||||
result[LABEL_COLUMN] = tf.train.Feature(int64_list=tf.train.Int64List(value=df[LABEL_COLUMN].to_list()))
|
||||
result[DISPLAY_ID_COLUMN] = tf.train.Feature(int64_list=tf.train.Int64List(value=df[DISPLAY_ID_COLUMN].to_list()))
|
||||
result[IS_LEAK_COLUMN] = tf.train.Feature(int64_list=tf.train.Int64List(value=df[IS_LEAK_COLUMN].to_list()))
|
||||
encoded_value = df[DISPLAY_ID_COLUMN].multiply(10).add(df[IS_LEAK_COLUMN].clip(lower=0)).to_list()
|
||||
result[DISPLAY_ID_AND_IS_LEAK_ENCODED_COLUMN] = tf.train.Feature(int64_list=tf.train.Int64List(value=encoded_value))
|
||||
for name in FLOAT_COLUMNS:
|
||||
value = df[name].to_list()
|
||||
result[name] = tf.train.Feature(float_list=tf.train.FloatList(value=value))
|
||||
for name in FLOAT_COLUMNS_SIMPLE_BIN_TRANSFORM:
|
||||
value = df[name].multiply(10).astype('int64').to_list()
|
||||
result[name + '_binned'] = tf.train.Feature(int64_list=tf.train.Int64List(value=value))
|
||||
for name in FLOAT_COLUMNS_LOG_BIN_TRANSFORM:
|
||||
value_prelim = df[name].multiply(1000).apply(np.log1p).multiply(1./np.log(2.0))
|
||||
value = value_prelim.astype('int64').to_list()
|
||||
result[name + '_binned'] = tf.train.Feature(int64_list=tf.train.Int64List(value=value))
|
||||
nn = name + '_log_01scaled'
|
||||
value = value_prelim.add(-min_logs[nn]).multiply(1./(max_logs[nn]-min_logs[nn])).to_list()
|
||||
result[nn] = tf.train.Feature(float_list=tf.train.FloatList(value=value))
|
||||
for name in INT_COLUMNS:
|
||||
value_prelim = df[name].apply(np.log1p).multiply(1./np.log(2.0))
|
||||
value = value_prelim.astype('int64').to_list()
|
||||
result[name + '_log_int'] = tf.train.Feature(int64_list=tf.train.Int64List(value=value))
|
||||
nn = name + '_log_01scaled'
|
||||
value = value_prelim.add(-min_logs[nn]).multiply(1./(max_logs[nn]-min_logs[nn])).to_list()
|
||||
result[nn] = tf.train.Feature(float_list=tf.train.FloatList(value=value))
|
||||
for name in BOOL_COLUMNS + CATEGORICAL_COLUMNS:
|
||||
value = df[name].fillna(0).astype('int64').to_list()
|
||||
result[name] = tf.train.Feature(int64_list=tf.train.Int64List(value=value))
|
||||
for multi_category in DOC_CATEGORICAL_MULTIVALUED_COLUMNS:
|
||||
values = []
|
||||
for category in DOC_CATEGORICAL_MULTIVALUED_COLUMNS[multi_category]:
|
||||
values = values + [df[category].to_numpy()]
|
||||
# need to transpose the series so they will be parsed correctly by the FixedLenFeature
|
||||
# we can pass in a single series here; they'll be reshaped to [batch_size, num_values]
|
||||
# when parsed from the TFRecord
|
||||
value = np.stack(values, axis=1).flatten().tolist()
|
||||
result[multi_category] = tf.train.Feature(int64_list=tf.train.Int64List(value=value))
|
||||
tf_example = tf.train.Example(features=tf.train.Features(feature=result))
|
||||
return tf_example
|
||||
|
||||
def _transform_to_tfrecords(rdds):
|
||||
csv = pd.DataFrame(list(rdds), columns=CSV_ORDERED_COLUMNS)
|
||||
num_rows = len(csv.index)
|
||||
examples = []
|
||||
for start_ind in range(0,num_rows,batch_size if batch_size is not None else 1): # for each batch
|
||||
if start_ind + batch_size - 1 > num_rows: # if we'd run out of rows
|
||||
csv_slice = csv.iloc[start_ind:]
|
||||
# drop the remainder
|
||||
print("last Example has: ", len(csv_slice))
|
||||
examples.append((create_tf_example_spark(csv_slice, min_logs, max_logs), len(csv_slice)))
|
||||
return examples
|
||||
else:
|
||||
csv_slice = csv.iloc[start_ind:start_ind+(batch_size if batch_size is not None else 1)]
|
||||
examples.append((create_tf_example_spark(csv_slice, min_logs, max_logs), batch_size))
|
||||
return examples
|
||||
|
||||
from pyspark import TaskContext
|
||||
max_partition_num = 30
|
||||
|
||||
def _transform_to_slices(rdds):
|
||||
taskcontext = TaskContext.get()
|
||||
partitionid = taskcontext.partitionId()
|
||||
csv = pd.DataFrame(list(rdds), columns=CSV_ORDERED_COLUMNS)
|
||||
num_rows = len(csv.index)
|
||||
print("working with partition: ", partitionid, max_partition_num, num_rows)
|
||||
examples = []
|
||||
for start_ind in range(0,num_rows,batch_size if batch_size is not None else 1): # for each batch
|
||||
if start_ind + batch_size - 1 > num_rows: # if we'd run out of rows
|
||||
csv_slice = csv.iloc[start_ind:]
|
||||
print("last Example has: ", len(csv_slice), partitionid)
|
||||
examples.append((csv_slice, len(csv_slice)))
|
||||
return examples
|
||||
else:
|
||||
csv_slice = csv.iloc[start_ind:start_ind+(batch_size if batch_size is not None else 1)]
|
||||
examples.append((csv_slice, len(csv_slice)))
|
||||
return examples
|
||||
|
||||
def _transform_to_tfrecords_from_slices(rdds):
|
||||
examples = []
|
||||
for slice in rdds:
|
||||
if len(slice[0]) != batch_size:
|
||||
print("slice size is not correct, dropping: ", len(slice[0]))
|
||||
else:
|
||||
examples.append((bytearray((create_tf_example_spark(slice[0], min_logs, max_logs)).SerializeToString()), None))
|
||||
return examples
|
||||
|
||||
def _transform_to_tfrecords_from_reslice(rdds):
|
||||
examples = []
|
||||
all_dataframes = pd.DataFrame([])
|
||||
for slice in rdds:
|
||||
all_dataframes = all_dataframes.append(slice[0])
|
||||
num_rows = len(all_dataframes.index)
|
||||
examples = []
|
||||
for start_ind in range(0,num_rows,batch_size if batch_size is not None else 1): # for each batch
|
||||
if start_ind + batch_size - 1 > num_rows: # if we'd run out of rows
|
||||
csv_slice = all_dataframes.iloc[start_ind:]
|
||||
if TEST_SET_MODE:
|
||||
remain_len = batch_size - len(csv_slice)
|
||||
(m, n) = divmod(remain_len, len(csv_slice))
|
||||
print("remainder: ", len(csv_slice), remain_len, m, n)
|
||||
if m:
|
||||
for i in range(m):
|
||||
csv_slice = csv_slice.append(csv_slice)
|
||||
csv_slice = csv_slice.append(csv_slice.iloc[:n])
|
||||
print("after fill remainder: ", len(csv_slice))
|
||||
examples.append((bytearray((create_tf_example_spark(csv_slice, min_logs, max_logs)).SerializeToString()), None))
|
||||
return examples
|
||||
# drop the remainder
|
||||
print("dropping remainder: ", len(csv_slice))
|
||||
return examples
|
||||
else:
|
||||
csv_slice = all_dataframes.iloc[start_ind:start_ind+(batch_size if batch_size is not None else 1)]
|
||||
examples.append((bytearray((create_tf_example_spark(csv_slice, min_logs, max_logs)).SerializeToString()), None))
|
||||
return examples
|
||||
|
||||
TEST_SET_MODE = False
|
||||
train_features = train_feature_vectors_integral_csv_rdd_df.coalesce(30).rdd.mapPartitions(_transform_to_slices)
|
||||
cached_train_features = train_features.cache()
|
||||
cached_train_features.count()
|
||||
train_full = cached_train_features.filter(lambda x: x[1] == batch_size)
|
||||
# split out slies where we don't have a full batch so that we can reslice them so we only drop mininal rows
|
||||
train_not_full = cached_train_features.filter(lambda x: x[1] < batch_size)
|
||||
train_examples_full = train_full.mapPartitions(_transform_to_tfrecords_from_slices)
|
||||
train_left = train_not_full.coalesce(1).mapPartitions(_transform_to_tfrecords_from_reslice)
|
||||
all_train = train_examples_full.union(train_left)
|
||||
|
||||
TEST_SET_MODE = True
|
||||
valid_features = test_validation_feature_vectors_integral_csv_rdd_df.coalesce(30).rdd.mapPartitions(_transform_to_slices)
|
||||
cached_valid_features = valid_features.cache()
|
||||
cached_valid_features.count()
|
||||
valid_full = cached_valid_features.filter(lambda x: x[1] == batch_size)
|
||||
valid_not_full = cached_valid_features.filter(lambda x: x[1] < batch_size)
|
||||
valid_examples_full = valid_full.mapPartitions(_transform_to_tfrecords_from_slices)
|
||||
valid_left = valid_not_full.coalesce(1).mapPartitions(_transform_to_tfrecords_from_reslice)
|
||||
all_valid = valid_examples_full.union(valid_left)
|
||||
|
||||
all_train.saveAsNewAPIHadoopFile(LOCAL_DATA_TFRECORDS_DIR + train_output_string, "org.tensorflow.hadoop.io.TFRecordFileOutputFormat",
|
||||
keyClass="org.apache.hadoop.io.BytesWritable",
|
||||
valueClass="org.apache.hadoop.io.NullWritable")
|
||||
|
||||
all_valid.saveAsNewAPIHadoopFile(LOCAL_DATA_TFRECORDS_DIR + eval_output_string, "org.tensorflow.hadoop.io.TFRecordFileOutputFormat",
|
||||
keyClass="org.apache.hadoop.io.BytesWritable",
|
||||
valueClass="org.apache.hadoop.io.NullWritable")
|
||||
|
||||
spark.stop()
|
||||
|
|
@ -1,24 +0,0 @@
|
|||
#!/bin/bash
|
||||
|
||||
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
|
||||
input=$1
|
||||
output=$2
|
||||
mkdir -p ${output}
|
||||
for f in ${input}/*
|
||||
do
|
||||
filename=${f##*/}
|
||||
sort -n -k2 -t ',' $f > ${output}/${filename}
|
||||
done
|
|
@ -17,4 +17,4 @@
|
|||
set -x
|
||||
set -e
|
||||
|
||||
python -m trainer.task --model_dir . --transformed_metadata_path "/outbrain/tfrecords" --eval_data_pattern "/outbrain/tfrecords/eval_*" --train_data_pattern "/outbrain/tfrecords/train_*" --save_checkpoints_secs 600 --linear_l1_regularization 0.0 --linear_l2_regularization 0.0 --linear_learning_rate 0.2 --deep_l1_regularization 0.0 --deep_l2_regularization 0.0 --deep_learning_rate 1.0 --deep_dropout 0.0 --deep_hidden_units 1024 1024 1024 1024 1024 --prebatch_size 4096 --global_batch_size 131072 --eval_batch_size 32768 --eval_steps 8 --model_type wide_n_deep --gpu --benchmark --amp
|
||||
python -m trainer.task --model_dir . --benchmark_warmup_steps 50 --benchmark_steps 200 --gpu --benchmark --amp
|
||||
|
|
|
@ -17,4 +17,4 @@
|
|||
set -x
|
||||
set -e
|
||||
|
||||
mpiexec --allow-run-as-root --bind-to socket -np 4 python -m trainer.task --hvd --model_dir . --transformed_metadata_path "/outbrain/tfrecords" --eval_data_pattern "/outbrain/tfrecords/eval_*" --train_data_pattern "/outbrain/tfrecords/train_*" --save_checkpoints_secs 600 --linear_l1_regularization 0.0 --linear_l2_regularization 0.0 --linear_learning_rate 0.2 --deep_l1_regularization 0.0 --deep_l2_regularization 0.0 --deep_learning_rate 1.0 --deep_dropout 0.0 --deep_hidden_units 1024 1024 1024 1024 1024 --prebatch_size 4096 --global_batch_size 131072 --eval_batch_size 32768 --eval_steps 8 --model_type wide_n_deep --gpu --benchmark --amp
|
||||
mpiexec --allow-run-as-root --bind-to socket -np 4 python -m trainer.task --hvd --model_dir . --benchmark_warmup_steps 50 --benchmark_steps 200 --gpu --benchmark --amp
|
||||
|
|
|
@ -17,4 +17,4 @@
|
|||
set -x
|
||||
set -e
|
||||
|
||||
mpiexec --allow-run-as-root --bind-to socket -np 8 python -m trainer.task --hvd --model_dir . --transformed_metadata_path "/outbrain/tfrecords" --eval_data_pattern "/outbrain/tfrecords/eval_*" --train_data_pattern "/outbrain/tfrecords/train_*" --save_checkpoints_secs 600 --linear_l1_regularization 0.0 --linear_l2_regularization 0.0 --linear_learning_rate 0.2 --deep_l1_regularization 0.0 --deep_l2_regularization 0.0 --deep_learning_rate 1.0 --deep_dropout 0.0 --deep_hidden_units 1024 1024 1024 1024 1024 --prebatch_size 4096 --global_batch_size 131072 --eval_batch_size 32768 --eval_steps 8 --model_type wide_n_deep --gpu --benchmark --amp
|
||||
mpiexec --allow-run-as-root --bind-to socket -np 8 python -m trainer.task --hvd --model_dir . --benchmark_warmup_steps 50 --benchmark_steps 200 --gpu --benchmark --amp
|
||||
|
|
|
@ -17,4 +17,4 @@
|
|||
set -x
|
||||
set -e
|
||||
|
||||
python -m trainer.task --model_dir . --transformed_metadata_path "/outbrain/tfrecords" --eval_data_pattern "/outbrain/tfrecords/eval_*" --train_data_pattern "/outbrain/tfrecords/train_*" --save_checkpoints_secs 600 --linear_l1_regularization 0.0 --linear_l2_regularization 0.0 --linear_learning_rate 0.2 --deep_l1_regularization 0.0 --deep_l2_regularization 0.0 --deep_learning_rate 1.0 --deep_dropout 0.0 --deep_hidden_units 1024 1024 1024 1024 1024 --prebatch_size 4096 --global_batch_size 131072 --eval_batch_size 32768 --eval_steps 8 --model_type wide_n_deep --gpu --benchmark
|
||||
python -m trainer.task --model_dir . --benchmark_warmup_steps 50 --benchmark_steps 200 --gpu --benchmark
|
||||
|
|
|
@ -17,4 +17,4 @@
|
|||
set -x
|
||||
set -e
|
||||
|
||||
mpiexec --allow-run-as-root --bind-to socket -np 4 python -m trainer.task --hvd --model_dir . --transformed_metadata_path "/outbrain/tfrecords" --eval_data_pattern "/outbrain/tfrecords/eval_*" --train_data_pattern "/outbrain/tfrecords/train_*" --save_checkpoints_secs 600 --linear_l1_regularization 0.0 --linear_l2_regularization 0.0 --linear_learning_rate 0.2 --deep_l1_regularization 0.0 --deep_l2_regularization 0.0 --deep_learning_rate 1.0 --deep_dropout 0.0 --deep_hidden_units 1024 1024 1024 1024 1024 --prebatch_size 4096 --global_batch_size 131072 --eval_batch_size 32768 --eval_steps 8 --model_type wide_n_deep --gpu --benchmark
|
||||
mpiexec --allow-run-as-root --bind-to socket -np 4 python -m trainer.task --hvd --model_dir . --benchmark_warmup_steps 50 --benchmark_steps 200 --gpu --benchmark
|
||||
|
|
|
@ -17,4 +17,4 @@
|
|||
set -x
|
||||
set -e
|
||||
|
||||
mpiexec --allow-run-as-root --bind-to socket -np 8 python -m trainer.task --hvd --model_dir . --transformed_metadata_path "/outbrain/tfrecords" --eval_data_pattern "/outbrain/tfrecords/eval_*" --train_data_pattern "/outbrain/tfrecords/train_*" --save_checkpoints_secs 600 --linear_l1_regularization 0.0 --linear_l2_regularization 0.0 --linear_learning_rate 0.2 --deep_l1_regularization 0.0 --deep_l2_regularization 0.0 --deep_learning_rate 1.0 --deep_dropout 0.0 --deep_hidden_units 1024 1024 1024 1024 1024 --prebatch_size 4096 --global_batch_size 131072 --eval_batch_size 32768 --eval_steps 8 --model_type wide_n_deep --gpu --benchmark
|
||||
mpiexec --allow-run-as-root --bind-to socket -np 8 python -m trainer.task --hvd --model_dir . --benchmark_warmup_steps 50 --benchmark_steps 200 --gpu --benchmark
|
||||
|
|
|
@ -21,34 +21,15 @@ else
|
|||
PREBATCH_SIZE=4096
|
||||
fi
|
||||
|
||||
python preproc/preproc1.py
|
||||
python preproc/preproc2.py
|
||||
python preproc/preproc3.py
|
||||
|
||||
export CUDA_VISIBLE_DEVICES=
|
||||
LOCAL_DATA_DIR=/outbrain/preprocessed
|
||||
LOCAL_DATA_TFRECORDS_DIR=/outbrain/tfrecords
|
||||
|
||||
TRAIN_DIR=train_feature_vectors_integral_eval.csv
|
||||
VALID_DIR=validation_feature_vectors_integral.csv
|
||||
TRAIN_IMPUTED_DIR=train_feature_vectors_integral_eval_imputed.csv
|
||||
VALID_IMPUTED_DIR=validation_feature_vectors_integral_imputed.csv
|
||||
HEADER_PATH=train_feature_vectors_integral_eval.csv.header
|
||||
|
||||
cd ${LOCAL_DATA_DIR}
|
||||
python /wd/preproc/csv_data_imputation.py --num_workers 40 \
|
||||
--train_files_pattern 'train_feature_vectors_integral_eval.csv/part-*' \
|
||||
--valid_files_pattern 'validation_feature_vectors_integral.csv/part-*' \
|
||||
--train_dst_dir ${TRAIN_IMPUTED_DIR} \
|
||||
--valid_dst_dir ${VALID_IMPUTED_DIR} \
|
||||
--header_path ${HEADER_PATH}
|
||||
cd -
|
||||
|
||||
time preproc/sort_csv.sh ${LOCAL_DATA_DIR}/${VALID_IMPUTED_DIR} ${LOCAL_DATA_DIR}/${VALID_IMPUTED_DIR}_sorted
|
||||
|
||||
python dataflow_preprocess.py \
|
||||
--eval_data "${LOCAL_DATA_DIR}/${VALID_IMPUTED_DIR}_sorted/part-*" \
|
||||
--training_data "${LOCAL_DATA_DIR}/${TRAIN_IMPUTED_DIR}/part-*" \
|
||||
--output_dir ${LOCAL_DATA_TFRECORDS_DIR} \
|
||||
--batch_size ${PREBATCH_SIZE}
|
||||
|
||||
echo "Starting preprocessing 1/4..."
|
||||
time python -m preproc.preproc1
|
||||
echo "Preprocessing 1/4 done.\n"
|
||||
echo "Starting preprocessing 2/4..."
|
||||
time python -m preproc.preproc2
|
||||
echo "Preprocessing 2/4 done.\n"
|
||||
echo "Starting preprocessing 3/4..."
|
||||
time python -m preproc.preproc3
|
||||
echo "Preprocessing 3/4 done.\n"
|
||||
echo "Starting preprocessing 4/4..."
|
||||
time python -m preproc.preproc4 --prebatch_size ${PREBATCH_SIZE}
|
||||
echo "Preprocessing 4/4 done.\n"
|
||||
|
|
100
TensorFlow/Recommendation/WideAndDeep/trainer/dataset_utils.py
Normal file
100
TensorFlow/Recommendation/WideAndDeep/trainer/dataset_utils.py
Normal file
|
@ -0,0 +1,100 @@
|
|||
# Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved.
|
||||
#
|
||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||
# you may not use this file except in compliance with the License.
|
||||
# You may obtain a copy of the License at
|
||||
#
|
||||
# http://www.apache.org/licenses/LICENSE-2.0
|
||||
#
|
||||
# Unless required by applicable law or agreed to in writing, software
|
||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||
# See the License for the specific language governing permissions and
|
||||
# limitations under the License.
|
||||
# ==============================================================================
|
||||
import tensorflow as tf
|
||||
from tensorflow.compat.v1 import logging
|
||||
|
||||
def separate_input_fn(
|
||||
tf_transform_output,
|
||||
transformed_examples,
|
||||
create_batches,
|
||||
mode,
|
||||
reader_num_threads=1,
|
||||
parser_num_threads=2,
|
||||
shuffle_buffer_size=10,
|
||||
prefetch_buffer_size=1,
|
||||
print_display_ids=False):
|
||||
"""
|
||||
A version of the training + eval input function that uses dataset operations.
|
||||
(For more straightforward tweaking.)
|
||||
"""
|
||||
|
||||
logging.warn('Shuffle buffer size: {}'.format(shuffle_buffer_size))
|
||||
|
||||
filenames_dataset = tf.data.Dataset.list_files(
|
||||
transformed_examples,
|
||||
shuffle=False
|
||||
)
|
||||
|
||||
raw_dataset = tf.data.TFRecordDataset(
|
||||
filenames_dataset,
|
||||
num_parallel_reads=reader_num_threads
|
||||
)
|
||||
|
||||
if mode == tf.estimator.ModeKeys.TRAIN and shuffle_buffer_size > 1:
|
||||
raw_dataset = raw_dataset.shuffle(shuffle_buffer_size)
|
||||
|
||||
raw_dataset = raw_dataset.repeat()
|
||||
raw_dataset = raw_dataset.batch(create_batches)
|
||||
|
||||
# this function appears to require each element to be a vector
|
||||
# batching should mean that this is always true
|
||||
# one possible alternative for any problematic case is tf.io.parse_single_example
|
||||
parsed_dataset = raw_dataset.apply(
|
||||
tf.data.experimental.parse_example_dataset(
|
||||
tf_transform_output.transformed_feature_spec(),
|
||||
num_parallel_calls=parser_num_threads
|
||||
)
|
||||
)
|
||||
|
||||
# a function mapped over each dataset element
|
||||
# will separate label, ensure that elements are two-dimensional (batch size, elements per record)
|
||||
# adds print_display_ids injection
|
||||
def consolidate_batch(elem):
|
||||
label = elem.pop('label')
|
||||
reshaped_label = tf.reshape(label, [-1, label.shape[-1]])
|
||||
reshaped_elem = {
|
||||
key: tf.reshape(elem[key], [-1, elem[key].shape[-1]])
|
||||
for key in elem
|
||||
}
|
||||
|
||||
if print_display_ids:
|
||||
elem['ad_id'] = tf.Print(input_=elem['ad_id'],
|
||||
data=[tf.reshape(elem['display_id'], [-1])],
|
||||
message='display_id', name='print_display_ids',
|
||||
summarize=FLAGS.eval_batch_size)
|
||||
elem['ad_id'] = tf.Print(input_=elem['ad_id'],
|
||||
data=[tf.reshape(elem['ad_id'], [-1])],
|
||||
message='ad_id', name='print_ad_ids',
|
||||
summarize=FLAGS.eval_batch_size)
|
||||
elem['ad_id'] = tf.Print(input_=elem['ad_id'],
|
||||
data=[tf.reshape(elem['is_leak'], [-1])],
|
||||
message='is_leak', name='print_is_leak',
|
||||
summarize=FLAGS.eval_batch_size)
|
||||
|
||||
return reshaped_elem, reshaped_label
|
||||
|
||||
if mode == tf.estimator.ModeKeys.EVAL:
|
||||
parsed_dataset = parsed_dataset.map(
|
||||
consolidate_batch,
|
||||
num_parallel_calls=None
|
||||
)
|
||||
else:
|
||||
parsed_dataset = parsed_dataset.map(
|
||||
consolidate_batch,
|
||||
num_parallel_calls=parser_num_threads
|
||||
)
|
||||
parsed_dataset = parsed_dataset.prefetch(prefetch_buffer_size)
|
||||
|
||||
return parsed_dataset
|
|
@ -24,6 +24,7 @@ import tensorflow as tf
|
|||
import tensorflow_transform as tft
|
||||
|
||||
from trainer import features
|
||||
from trainer import dataset_utils
|
||||
from utils.hooks.benchmark_hooks import BenchmarkLoggingHook
|
||||
|
||||
import horovod.tensorflow as hvd
|
||||
|
@ -59,7 +60,7 @@ def create_parser():
|
|||
help='Pattern of training file names. For example if training files are train_000.tfrecord, \
|
||||
train_001.tfrecord then --train_data_pattern is train_*',
|
||||
type=str,
|
||||
default='/outbrain/tfrecords/train_*',
|
||||
default='/outbrain/tfrecords/train/part*',
|
||||
nargs='+'
|
||||
)
|
||||
parser.add_argument(
|
||||
|
@ -67,7 +68,7 @@ def create_parser():
|
|||
help='Pattern of eval file names. For example if eval files are eval_000.tfrecord, \
|
||||
eval_001.tfrecord then --eval_data_pattern is eval_*',
|
||||
type=str,
|
||||
default='/outbrain/tfrecords/eval_*',
|
||||
default='/outbrain/tfrecords/eval/part*',
|
||||
nargs='+'
|
||||
)
|
||||
parser.add_argument(
|
||||
|
@ -331,70 +332,6 @@ def get_feature_columns(use_all_columns=False, force_subset=None):
|
|||
tf.compat.v1.logging.warn('wide&deep intersection: {}'.format(len(set(wide_columns).intersection(set(deep_columns)))))
|
||||
return wide_columns, deep_columns
|
||||
|
||||
def separate_input_fn(
|
||||
tf_transform_output,
|
||||
transformed_examples,
|
||||
create_batches,
|
||||
mode,
|
||||
reader_num_threads=1,
|
||||
parser_num_threads=2,
|
||||
shuffle_buffer_size=10,
|
||||
prefetch_buffer_size=1,
|
||||
print_display_ids=False):
|
||||
"""
|
||||
A version of the training + eval input function that uses dataset operations.
|
||||
(For more straightforward tweaking.)
|
||||
"""
|
||||
|
||||
tf.compat.v1.logging.warn('Shuffle buffer size: {}'.format(shuffle_buffer_size))
|
||||
|
||||
filenames_dataset = tf.data.Dataset.list_files(transformed_examples, shuffle=False)
|
||||
|
||||
raw_dataset = tf.data.TFRecordDataset(filenames_dataset,
|
||||
num_parallel_reads=reader_num_threads)
|
||||
|
||||
raw_dataset = raw_dataset.shuffle(shuffle_buffer_size) \
|
||||
if (mode==tf.estimator.ModeKeys.TRAIN and shuffle_buffer_size > 1) \
|
||||
else raw_dataset
|
||||
raw_dataset = raw_dataset.repeat()
|
||||
raw_dataset = raw_dataset.batch(create_batches)
|
||||
|
||||
# this function appears to require each element to be a vector
|
||||
# batching should mean that this is always true
|
||||
# one possible alternative for any problematic case is tf.io.parse_single_example
|
||||
parsed_dataset = raw_dataset.apply(tf.data.experimental.parse_example_dataset(
|
||||
tf_transform_output.transformed_feature_spec(),
|
||||
num_parallel_calls=parser_num_threads))
|
||||
|
||||
# a function mapped over each dataset element
|
||||
# will separate label, ensure that elements are two-dimensional (batch size, elements per record)
|
||||
# adds print_display_ids injection
|
||||
def consolidate_batch(elem):
|
||||
label = elem.pop('label')
|
||||
reshaped_label = tf.reshape(label, [-1, label.shape[-1]])
|
||||
reshaped_elem = {key: tf.reshape(elem[key], [-1, elem[key].shape[-1]]) for key in elem}
|
||||
if print_display_ids:
|
||||
elem['ad_id'] = tf.Print(input_=elem['ad_id'],
|
||||
data=[tf.reshape(elem['display_id'], [-1])],
|
||||
message='display_id', name='print_display_ids', summarize=FLAGS.eval_batch_size)
|
||||
elem['ad_id'] = tf.Print(input_=elem['ad_id'],
|
||||
data=[tf.reshape(elem['ad_id'], [-1])],
|
||||
message='ad_id', name='print_ad_ids', summarize=FLAGS.eval_batch_size)
|
||||
elem['ad_id'] = tf.Print(input_=elem['ad_id'],
|
||||
data=[tf.reshape(elem['is_leak'], [-1])],
|
||||
message='is_leak', name='print_is_leak', summarize=FLAGS.eval_batch_size)
|
||||
|
||||
return reshaped_elem, reshaped_label
|
||||
|
||||
if mode == tf.estimator.ModeKeys.EVAL:
|
||||
parsed_dataset = parsed_dataset.map(consolidate_batch, num_parallel_calls=None)
|
||||
else:
|
||||
parsed_dataset = parsed_dataset.map(consolidate_batch,
|
||||
num_parallel_calls=parser_num_threads)
|
||||
parsed_dataset = parsed_dataset.prefetch(prefetch_buffer_size)
|
||||
|
||||
return parsed_dataset
|
||||
|
||||
# rough approximation for MAP metric for measuring ad quality
|
||||
# roughness comes from batch sizes falling between groups of
|
||||
# display ids
|
||||
|
@ -694,11 +631,8 @@ def main(FLAGS):
|
|||
wide_optimizer = hvd.DistributedOptimizer(wide_optimizer)
|
||||
deep_optimizer = hvd.DistributedOptimizer(deep_optimizer)
|
||||
|
||||
stats_filename = os.path.join(FLAGS.transformed_metadata_path, 'stats.json')
|
||||
embed_columns = None
|
||||
|
||||
# input functions to read data from disk
|
||||
train_input_fn = lambda : separate_input_fn(
|
||||
train_input_fn = lambda : dataset_utils.separate_input_fn(
|
||||
tf_transform_output,
|
||||
FLAGS.train_data_pattern,
|
||||
create_batches,
|
||||
|
@ -708,7 +642,7 @@ def main(FLAGS):
|
|||
shuffle_buffer_size=int(FLAGS.shuffle_percentage*create_batches),
|
||||
prefetch_buffer_size=FLAGS.prefetch_buffer_size,
|
||||
print_display_ids=FLAGS.print_display_ids)
|
||||
eval_input_fn = lambda : separate_input_fn(
|
||||
eval_input_fn = lambda : dataset_utils.separate_input_fn(
|
||||
tf_transform_output,
|
||||
FLAGS.eval_data_pattern,
|
||||
(FLAGS.eval_batch_size // FLAGS.prebatch_size),
|
||||
|
|
Loading…
Reference in a new issue