Training content extraction models
- Download the training data (see above). In what follows
ROOTDIRcontains the root of the
dragnet_datarepo, another directory with similar structure (
- Create the block corrected files needed to do supervised learning on the block level. First make a sub-directory
$ROOTDIR/block_corrected/for the output files, then run:from dragnet.data_processing import extract_all_gold_standard_data rootdir = ‘/path/to/dragnet_data/’ extract_all_gold_standard_data(rootdir)This solves the longest common sub-sequence problem to determine which blocks were extracted in the gold standard. Occasionally this will fail if lxml (libxml2) cannot parse a HTML document. In this case, remove the offending document and restart the process.
- Use k-fold cross validation in the training set to do model selection and set any hyperparameters. Make decisions about the following:
- Whether to use just article content or content and comments.
- The features to use
- The machine learning model to use
output_dir, writes a pickled version of it along with some some block level classification errors to a file in the specified
output_dir. If no
output_diris specified, the block-level performance is printed to stdout.
- Once you have decided on a final model, train it on the entire training data using
- As a last step, test the performance of the model on the test set (see below).
model_training to compute the token level accuracy, precision, recall, and F1. For example, to evaluate a trained model run:
from dragnet.compat import train_test_split from dragnet.data_processing import prepare_all_data from dragnet.model_training import evaluate_model_predictions rootdir = '/path/to/dragnet_data/' data = prepare_all_data(rootdir) training_data, test_data = train_test_split(data, test_size=0.2, random_state=42) test_html, test_labels, test_weights = extractor.get_html_labels_weights(test_data) train_html, train_labels, train_weights = extractor.get_html_labels_weights(training_data) extractor.fit(train_html, train_labels, weights=train_weights) predictions = extractor.predict(test_html) scores = evaluate_model_predictions(test_labels, predictions, test_weights)
Note that this is the same evaluation that is run/printed in