Identity Resolution using Spark 2.4 and DSE GraphFrames (DSE v6.8) with BYOS
- Identity resolution (IDR) is the process of matching identifiers across devices and touchpoints to a single profile which helps build a cohesive, omnichannel view of a consumer, enabling brands to deliver relevant messaging throughout the customer journey
- The underlying data source for IDR is generally an Identity Graph (IDG) which is a profile database housing all known identifiers correlated to individuals. The IDG also stores the metadata about the identifiers
- An IDG as below for a single user who is represented via multiple identifiers all connected to each other directly or transitively
Below steps using Apache Spark 2.4 with DSE 6.8 via BYOS support (Bring your own spark)
- Populating the IDG using JSON identifier data sets (id-graph-loader)
- Computing the count of identifiers that are connected to a set of input identifiers via IDG (id-graph-resolver)
- Exporting the matched identifiers that are connected to a set of input identifiers via IDG (id-graph-resolver)
- Data can be generated using this google sheet
- To generate json from the google sheet use this link
-
id-graph-loader
- Build:
cd id-graph-loader ./gradlew build
- Generate the byos.properties
- Run:
cd spark-2.4.8-bin-hadoop2.7/ bin/spark-submit --jars ~/dse-6.8.18/clients/dse-byos_2.11-6.8.18.jar --properties-file ~/dse-6.8.18/byos.properties --class com.datastax.examples.dsegf.Loader id-graph-loader-1.0-SNAPSHOT.jar <vertexJsonPath> <edgeJsonPath>
- Build:
-
id-graph-resolver
- Build:
cd id-graph-resolver ./gradlew build
- Run:
cd spark-2.4.8-bin-hadoop2.7/ bin/spark-submit --jars /Users/saurabh.verma/Downloads/dse-6.8.18/clients/dse-byos_2.11-6.8.18.jar --properties-file /Users/saurabh.verma/Downloads/dse-6.8.18/byos.properties --class com.datastax.examples.dsegf.IDGResolver <toBeMatchedIDsPath.csv> <resolvedIDsOutputPath.csv>
- The sample data for loading and resolution are located inside the
resources
folder of the loader and resolver modules