Roman KlimenkoBlogPhotography

CluedIn Data Enrichment

March 30, 2024

cluedindatapythonmicrosoftazureazure-functions

Here's a video tutorial on how to enrich data in CluedIn with the help of the Enrich rule action, and a simple Azure Function:

The Enrich Rule Action is a flexible way to enrich data in CluedIn with external data sources.

The Rule Action requires an API that accepts a list of vocabulary keys and returns a list of properties. The returned properties are saved with a specified vocabulary prefix. The API can be implemented as an Azure Function, a REST API, or any other service that can be accessed via HTTP.

The API is also responsible for getting the data from the external source, processing it, and returning the properties to CluedIn.

The Rule Action takes the following parameters:

  • URL: The URL of the enriching API.
  • Payload: Comma-separated list of vocabulary keys to send to the API.
  • Vocabulary Prefix: The vocabulary prefix used to save properties rerturned by the API.

When the action is invoked, it sends the payload to the API (HTTP POST) and saves the returned properties with the specified vocabulary prefix.

Example

Vocabulary key with UK postcodes: customer.postcode.

Open API that returns postcodes data: https://postcodes.io/.

For example:

curl --location 'api.postcodes.io/postcodes/CA2 6PJ'

Response:

{ "status": 200, "result": { "postcode": "CA2 6PJ", "quality": 1, "eastings": 338102, "northings": 554389, "country": "England", "nhs_ha": "North West", "longitude": -2.966283, "latitude": 54.880414, "european_electoral_region": "North West", "primary_care_trust": "Cumbria Teaching", "region": "North West", "lsoa": "Carlisle 009D", "msoa": "Carlisle 009", "incode": "6PJ", "outcode": "CA2", "parliamentary_constituency": "Carlisle", "parliamentary_constituency_2024": "Carlisle", "admin_district": "Cumberland", "parish": null, "admin_county": null, "date_of_introduction": "198001", "admin_ward": "Morton", "ced": "Morton", "ccg": "NHS North East and North Cumbria", "nuts": "Carlisle", "pfa": "Cumbria", "codes": { "admin_district": "E06000063", "admin_county": "E99999999", "admin_ward": "E05014205", "parish": "E43000295", "parliamentary_constituency": "E14000620", "parliamentary_constituency_2024": "E14001152", "ccg": "E38000215", "ccg_id": "01H", "ced": "E58000165", "nuts": "TLD12", "lsoa": "E01019231", "msoa": "E02003995", "lau2": "E07000028", "pfa": "E23000002" } } }

We implement an API, for example, as a JavaScript Azure Function that accepts the customer.postcode key, gets the postcode data from the postcodes.io API, and returns the properties to CluedIn:

const { app } = require("@azure/functions"); const querystring = require("node:querystring"); app.http("postcodes", { methods: ["POST"], authLevel: "anonymous", handler: async (request, context) => { const parsedBody = querystring.parse(await request.text()); const postcode = parsedBody["customer.postcode"]; const response = await fetch( `https://api.postcodes.io/postcodes/${postcode}`, { method: "GET", headers: { "Content-Type": "application/json", }, } ); return { headers: { "Content-Type": "application/json" }, body: JSON.stringify(await response.json()), status: response.status, }; }, });

Hence, sending the following request to the Azure Function:

curl --location 'https://postcodes.azurewebsites.net/api/postcodes' \ --header 'Content-Type: application/x-www-form-urlencoded' \ --data-urlencode 'customer.postcode=CA2 6PJ'

It returns the same response as the response from the postcodes.io API.

The API between CluedIn and the external source is needed to simplify the data enrichment process. The API can be implemented in any language and hosted on any platform that supports HTTP requests.

Now, you can create a Data Part or a Golden Record Rule in CluedIn:

  • URL - the URL of the Azure Function.
  • Payload - customer.postcode. It will be sent in the request body.
  • Vocabulary Prefix - postcode. The returned properties will be saved with this prefix.

When the Rule Action is invoked, it sends the payload to the Azure Function and saves the returned properties with the specified vocabulary prefix. The JSON request proeperties are flattened, so the response above will be saved into the enriched entity as:

{ "postcode.status": 200, "postcode.result.postcode": "CA2 6PJ", "postcode.result.quality": 1, "postcode.result.eastings": 338102, "postcode.result.northings": 554389, "postcode.result.country": "England", "postcode.result.nhs_ha": "North West", "postcode.result.longitude": -2.966283, "postcode.result.latitude": 54.880414, "postcode.result.european_electoral_region": "North West", "postcode.result.primary_care_trust": "Cumbria Teaching", "postcode.result.region": "North West", "postcode.result.lsoa": "Carlisle 009D", "postcode.result.msoa": "Carlisle 009", "postcode.result.incode": "6PJ", "postcode.result.outcode": "CA2", "postcode.result.parliamentary_constituency": "Carlisle", "postcode.result.parliamentary_constituency_2024": "Carlisle", "postcode.result.admin_district": "Cumberland", "postcode.result.parish": null, "postcode.result.admin_county": null, "postcode.result.date_of_introduction": "198001", "postcode.result.admin_ward": "Morton", "postcode.result.ced": "Morton", "postcode.result.ccg": "NHS North East and North Cumbria", "postcode.result.nuts": "Carlisle", "postcode.result.pfa": "Cumbria", "postcode.codes.admin_district": "E06000063", "postcode.codes.admin_county": "E99999999", "postcode.codes.admin_ward": "E05014205", "postcode.codes.parish": "E43000295", "postcode.codes.parliamentary_constituency": "E14000620", "postcode.codes.parliamentary_constituency_2024": "E14001152", "postcode.codes.ccg": "E38000215", "postcode.codes.ccg_id": "01H", "postcode.codes.ced": "E58000165", "postcode.codes.nuts": "TLD12", "postcode.codes.lsoa": "E01019231", "postcode.codes.msoa": "E02003995", "postcode.codes.lau2": "E07000028", "postcode.codes.pfa": "E23000002" }

Additional Scenarios and Best Practices

Data Normalization

If the API returns a key that already exists in the enriched entity, the Rule Action will overwrite the value in the entity. You can use it to normalize data like phone numbers or addresses - simply implement an API that returns the normalized data in the same keys that are sent for enrichment.

Data Validation

You can use the Rule Action to validate data. For example, you can implement an API that checks if the email address is valid and returns a boolean value. The Rule Action will save the boolean value with the specified vocabulary prefix.

Caching

If the external source is slow or has rate limits, you can implement a caching mechanism in the API. The API can cache the responses and return the cached data if the same request is sent within a specified time frame.

Rate Limiting

If the external source has rate limits, you can implement a rate limiting mechanism in the API. The API can return an error response if the rate limit is exceeded. In this case, you can reprocess the entities having the error response later.

You can also send the Retry-After header in the response to specify when the entity should be reprocessed. And return the 429 Too Many Requests status code.

Moreover, your API can still accept data and send it to a queue for processing later.

Error Handling

Always return a JSON even if an error occurs. The Rule Action will save the error message in the entity, and you can use it for debugging or monitoring purposes.

Security

The rule action sends an API key configured in the CLUEDIN_RULE_ACTION_API_KEY variable. The API can validate the key before processing the request.

Preview

When you invoke the Preview interface of the Rule Action, it sends a request to the API with the is_preview key set to true. The API can return a preview response that will be shown in the CluedIn UI.