One more way to ingest data in CluedIn

CluedIn Contrib Submitter is a non-official implementation of a clue submitter for CluedIn.

It has pros and cons compared to the standard CluedIn data ingestion methods. A more detailed comparison will be discussed after describing how this Submitter works.

How it works

In short, the Submitter is a pod (deployment) running in the same AKS cluster with CluedIn and having access to CluedIn's message queue.

A user or an automated process sends a POST request to the Submitter. The request must have:

minimal required information about mapping
API token
data

The Submitter parses the mapping information and generates clues from the request's data. The generated clues are submitted directly to the message queue.

Request

A typical request may look like:

curl --location 'https://app.mycluedindomain.com/submitter/data?entity_type=%2FPerson&origin_code=Salesforce%3Aid&vocab_prefix=organization.sample&name=name&codes=Dynamics%3Aid%2CSharepoint%3Aname&incoming_edges=%2FManager%7C%2FEmployee%23Sharepoint%3AManagerId%2C%2FCustomer%7C%2FContact%23CRM%3Aid&outgoing_edges=%2FWorks%7C%2FOrganization%23Salesforce%3Aorg_id%2C%2FManages%7C%2FEmployee%23Sharepoint%3Aemployee_id' \
--header 'Authorization: Bearer {JWT}' \
--header 'Content-Type: application/json' \
--data '[
    {
        "id": 1,
        "name": "John Doe",
        "city": "London"
    },
    {
        "id": 2,
        "name": "Mary Jane",
        "city": "Paris"
    }
]'

Or, if you use a UI tool like Postman, it can be even more user-friendly:

The query string parameters define the mapping configuration the Submitter will apply to the request's data.

Here's the full list of currently supported parameters:

`entity_type`

Mandatory parameter because you can't create an Entity without specifying its Entity Type. Example: &entity_type=/Person

`origin_code`

The Origin Entity Code parameter is also mandatory because each entity must have one Origin Entity Code. The Origin Entity Code also depends on the Entity Type parameter, but because both parameters are mandatory, it only matters for the error messages - the Submitter can't parse the Entity Code without knowing what its Entity Type is.

Example: &origin_code=Salesforce:id.

Let's look closer at this example: the Entity Type is omitted because it is specified in the entity_type parameter. Salesforce is the Entity Code's Origin, and id is the name of the property where the Submitter must find the value to generate an Origin Entity Code for a particular clue.

For example, if the record is:

    {
        "id": 1,
        "name": "John Doe",
        "city": "London"
    },

If the configuration is: &entity_type=/Person&&origin_code=Salesforce:id, then the Submitter takes the id property of the record and generates a Clue with Entity Type /Person and with Origin Entity Code /Person#Salesforce:1.

Hence, you can consider most parameter templates that the Submitter translates to real Clue properties.

`vocab_prefix`

The Vocabulary Prefix is the last of three mandatory parameters. An example value is: &vocab_prefix=person.sample.

So each property from the aforementioned record will be mapped to a vocabulary key like:

person.sample.id: 1
person.sample.name: John Doe
person.sample.city: London

You don't have to create anything like Vocabulary or Entity Type in advance, but you may want to configure the mapping later in CluedIn.

`codes`

The codes are optional for additional Entity Codes if you need them. The format is the same as for the origin_code but in this case it's a comma-separated list of codes. For example: &codes=Dynamics:id,Sharepoint:name will generate codes: /Person#Dynamics:1 and /Person#Sharepoint:John Doe.

If a record doesn't have a given property, the corresponding code will not be created, but it will not considered as an error.

`name`

Entity Name. This parameter is optional but highly recommended. It's important to remember that this is not the actual name but the name of the property where the Submitter will find it.

`incoming_edges` and `outgoing_edges`

These are two optional parameters for Entity Edges. If you don't specify them, nothing happens. If you specify them incorrectly, you will get an error.

The format is a comma-separated list of edges, with each edge being /EdgeType|Origin:id.

Say you specified an Outgoing Entity Edge /Works|/Organization#Salesforce:org_id

This means that if in a record, the org_id exists, then the Submitter will create a Clue and add an Outgoing Entity Edge from the Clue's Origin Entity Code to /Organization#Salesforce:5 assuming that the org_id property in the record is 5.

Payload

The payload must be a valid JSON-array.

Mapping

You basically have zero data modeling on the CluedIn's side because all mapping is provided in your request, and it can be as simple as:

Entity Type
Origin Entity Code
Vocabulary Prefix

The Submitter creates clues from the data you sent, and then you can always map the vocabulary keys to other keys, or create additional Entity Codes or Edges with Rules.

Response

A typical response looks like this:

{
    "submission": {
        "id": "8cfe4503-b38e-481d-ae92-0ae985ccf533",
        "timestamp": 1713503423257
    },
    "query_string": "?entity_type=/Person&vocab_prefix=organization.sample&codes=Dynamics:id,Sharepoint:name&origin_code=Salesforce:id&incoming_edges=/Manager|/Employee%23Sharepoint:ManagerId,/Customer|/Contact%23CRM:id&outgoing_edges=/Works|/Organization%23Salesforce:org_id,/Manages|/Employee%23Sharepoint:employee_id&name=name",
    "configuration": {
        "name_template": "name",
        "origin_entity_code_template": "/Person#Salesforce:id",
        "entity_type": "/Person",
        "vocabulary_prefix": "organization.sample",
        "entity_code_templates": "/Person#Dynamics:id,/Person#Sharepoint:name",
        "incoming_edges_templates": "EdgeType: /Manager; From: C:/Employee#Sharepoint:ManagerId; To: C:/Person#Salesforce:id; Properties: 0,EdgeType: /Customer; From: C:/Contact#CRM:id; To: C:/Person#Salesforce:id; Properties: 0",
        "outgoing_edges_templates": "EdgeType: /Works; From: C:/Person#Salesforce:id; To: C:/Organization#Salesforce:org_id; Properties: 0,EdgeType: /Manages; From: C:/Person#Salesforce:id; To: C:/Employee#Sharepoint:employee_id; Properties: 0"
    },
    "status": {
        "code": 200,
        "description": "OK"
    },
    "errors": [],
    "records": {
        "received": 2,
        "accepted": 2,
        "rejected": 0
    }
}

It may look like too much information, but it's only helpful information. For example, you can correlate the submission ID and timestamp to your posted data. You can also easily check the response code and the description.

Also, two very useful fields are query_string and configuration. The query_string shows you what you sent in the request's query string parameters. Logging these activities is very useful. The configuration shows you how this query string was parsed by the Submitter. Hence, if you accidentally made a typo in a parameter's name, you will see that it was not picked up by the Submitter, so you can spot the problem early.

The records counts show how many records were parsed from the payload (received), how many records were successfully transformed into Clues and published to the message queue (accepted), and how many records were not accepted (rejected).

The Submitter fails fast. I.e., if you posted 100 thousand records in the same batch, and the record number 255 didn't have a property needed to create an Origin Entity Code, the submitter will not check the remaining 99745 records and will return a "Bad Request" with: received - 100000, accepted - 254, rejected - 99746. Also, the errors collection will contain information about the problematic record, so you will easily localize the problem.

Request compression

You can speed up network communication with request compression. A 1Gb CSV file can usually be compressed to 200-250Mb, so you will not cross the response size limit and will send your data faster.

Here's a code sample of how to post compressed content using Python:

import gzip
import json
import requests

def post_batch(batch):
    
    compressed_data = gzip.compress(json.dumps(batch).encode('utf-8'))
    
    response = requests.post(
        url='https://app.mycluedindomain.com/submitter/data?entity_type=/Test&origin_code=Test:UUID&vocab_prefix=test.demo&name=Name',
        data=compressed_data,
        headers={
            "Authorization": f"Bearer {CLUEDIN_TOKEN}",
            "Content-Encoding": "gzip",
            "Content-Type": "application/json"
        },
        timeout=600)
    
    try:
        print(response.status_code, response.json())
    except:
        print(response.status_code, response.text)

Pros and cons

As it usually happens, the same feature can be an advantage or a disadvantage, depending on who uses it.

The main difference of the unofficial Submitter is that it simplifies data ingestion by skipping every feature that can be skipped on your way of ingesting data in CluedIn. If you have the Submitter configured, you just have one URL (not counting the query string parameters differences), and you can send any of your data to the same endpoint in almost the same way as you would send it to a standard CluedIn Ingestion Endpoint.

This approach makes life easier for a data engineer who needs to automate the ingestion of multiple datasets in CluedIn. You don't have to create endpoints in UI first, and you don't have to do the UI mapping.

For the same reason, a business user will prefer the standard CluedIn Ingestion Endpoints approach because all configuration happens in CluedIn UI with the ability to preview, automatically map properties to existing vocabularies, create missing vocabularies, etc.