Upload Data to Document Classification
- How to install the client library for Document Classification
- How to generate an authentication token
- How to create a dataset and upload data to your Document Classification service instance
To try out Document Classification, the first step is to upload data that will be used to train a machine learning model. For more information, see Document Classification.
- Step 1
To get started, make sure to have your local
JupyterLab
instance running.In the
JupyterLab
interface, navigate into the folder where the notebooktrain_and_evaluate_custom_model.ipynb
is located using the navigation pane on the left. Open the notebook by double-clicking it. The content of the notebook will now appear on the right. If you are not familiar with that, please review the previous tutorial: Set Up Jupyter Notebook and Client Library for Document Classification.The first step is to install the client library and to clone the repository so that you have the example dataset ready to use.
To do so, click the first cell indicated by its grey background. Once you clicked the cell, a blue bar appears on the right side which always indicates at which position in the notebook you currently are. A cell can be run by clicking Run at the top.
To run a cell, you can also use the shortcut Shift + Enter.
Log in to complete tutorial - Step 2
First, you will use the service keys that you created in Create Service Instance for Document Classification with Customer Account.
Scroll down in your notebook and click the cell that is shown in the image below. You can now just copy in your service key in the corresponding area. Pay attention to the additional comments given in the notebook.
Once you have filled in the service key, click Run.
Next, click in the cell indicated in the image below. Here, enter a name for your machine learning model, for example
tutorial-language-model
. Then, click Run.Now these variables are set and are used throughout the notebook.
Log in to complete tutorial - Step 3
In this step, the client library will be initialized which automatically creates an access token to use for communication with your service instance.
Scroll down in your notebook and click the next cell below the heading
Initialize Demo
. Click Run to import the client library and to create an instance of it.Log in to complete tutorial - Step 4
Now you have to create a dataset. A dataset holds the actual data that is used to train a machine learning model later on.
For that, scroll down in your notebook and click the first cell below the heading
Create Dataset for training of a new model
. The code in this cell creates a new dataset and prints out thedatasetId
below.You can now upload data into your newly created dataset.
Log in to complete tutorial - Step 5
After you have created a dataset, the position automatically jumps into the next cell. This cell contains code that uploads all the documents from the dataset to the service.
At this point, it makes sense to actually look at the example dataset. On the left pane, navigate into the folder
data > training_data
which contains all documents.In the folder, you find multiple documents which contain texts in different languages. The service can also process other file formats than PDF. Please review the documentation for further information.
Along that, a corresponding JSON file exists for every document with the same file name. These files are only necessary in the training dataset as they contain the classification for the training documents from which the service then trains the machine learning model.
Go ahead and open any of the JSON files by double-clicking it. On the right, the structure of the file opens. Expand the structure by using the arrows. You then see all characteristics and the corresponding values that are assigned to the training document.
In this tutorial, you train a model to identify languages of documents. Thus, each JSON file contains the characteristic
Language
and a value. In this dataset, the possible values areEnglish
,German
,Both
(if documents contain both English and German text) andOther
in case none of the other values applies.Now that you understand how a training dataset is structured, go ahead and upload the data. Close any other tabs and go back to your notebook. Make sure you are in the right cell and click Run.
The training documents are now uploaded. This may take a few minutes. Once done, a message is printed below which notifies you about the successful upload.
The service splits the uploaded data internally, so that 80% of the documents are later used as training documents. The other share of documents is used for testing. This process is called
stratification
.Log in to complete tutorial - Step 6
Instead of opening every document in your dataset, you can easily access the statistics and distribution of the dataset.
The position should automatically have moved into the next cell. Click Run to print out the dataset statistics.
The statistics show most importantly the upload success, for example, how many uploads have failed, succeeded or are still running, and the ground truths. That includes all characteristics, the respective values and the distribution among the dataset.
As mentioned earlier, the dataset’s only characteristic is
language
with the four possible values. The values are almost equally distributed among the 202 documents in the dataset.Last but not least, these statistics can also be shown graphically. The position should have moved automatically again. Click Run to create a bar chart of the distribution of the values in the dataset. You may need to scroll to see the whole chart.
The chart shows the possible languages on the x-axis and the y-axis provides the number of documents that each language has in the dataset.
By now you have successfully created a dataset, inspected the training data and uploaded it to the service.
Log in to complete tutorial - Step 7
Select the characteristic and its possible values used in the training data in this tutorial:
Log in to complete tutorial