Training content extraction models
- Download the training data (see above). In what follows
ROOTDIR
contains the root of thedragnet_data
repo, another directory with similar structure (HTML
andCorrected
sub-directories). - Create the block corrected files needed to do supervised learning on the block level. First make a sub-directory
$ROOTDIR/block_corrected/
for the output files, then run:from dragnet.data_processing import extract_all_gold_standard_data rootdir = ‘/path/to/dragnet_data/’ extract_all_gold_standard_data(rootdir)This solves the longest common sub-sequence problem to determine which blocks were extracted in the gold standard. Occasionally this will fail if lxml (libxml2) cannot parse a HTML document. In this case, remove the offending document and restart the process. - Use k-fold cross validation in the training set to do model selection and set any hyperparameters. Make decisions about the following:
- Whether to use just article content or content and comments.
- The features to use
- The machine learning model to use
output_dir
, writes a pickled version of it along with some some block level classification errors to a file in the specifiedoutput_dir
. If nooutput_dir
is specified, the block-level performance is printed to stdout. - Once you have decided on a final model, train it on the entire training data using
dragnet.model_training.train_models
. - As a last step, test the performance of the model on the test set (see below).
Evaluating content extraction models
Use evaluate_models_predictions
in model_training
to compute the token level accuracy, precision, recall, and F1. For example, to evaluate a trained model run:
from dragnet.compat import train_test_split from dragnet.data_processing import prepare_all_data from dragnet.model_training import evaluate_model_predictions rootdir = '/path/to/dragnet_data/' data = prepare_all_data(rootdir) training_data, test_data = train_test_split(data, test_size=0.2, random_state=42) test_html, test_labels, test_weights = extractor.get_html_labels_weights(test_data) train_html, train_labels, train_weights = extractor.get_html_labels_weights(training_data) extractor.fit(train_html, train_labels, weights=train_weights) predictions = extractor.predict(test_html) scores = evaluate_model_predictions(test_labels, predictions, test_weights)
Note that this is the same evaluation that is run/printed in train_model