Highly available Snowplow pipeline on kubernetes and GCP

blog preview

TLDR: If you are in a hurry, feel free to skip the first chapters and directly jump to the implementation guide, starting at chapter Prerequisites

Knowing how our websites perform is critical for the success of virtually any business in this internet-focussed time. We design and create websites and web apps to satisfy our users, engage them and - at the end - make money. For these reasons it's from elementary importance to know, how our users are behaving on our sites, where they experience struggles and bottlenecks and where and why they churn. To answer all of these - and many more questions, there are numerous web-trackers on the market, from Google Analytics, to Adobe Analytics and to Snowplow. The latter product is unique in many ways:

  • Snowplow is Open Source
  • Snowplow - in contrast to Google Analytics - is - if used correctly - fully GDPR compliant and respect users privacy
  • Snowplow allows for full control of the data flow - starting from the tracker to the data pipeline and data modelling. All elements run on your own infrastructure - allowing you to truly be compliant to GDPR
  • Snowplow provides satisfactory tracking results for web, mobile and even amp pages
  • Snowplow allows for realtime tracking, analytics and dashboarding. The latency of the pipeline is generally below 1 second

For a more in-depth introduction on what snowplow is - and why it is superior to especially Google Analytics - have a look at my previous post about "What is Snowplow"

Why running Snowplow on kubernetes?

As outlined in previous post about "What is Snowplow", Snowplow consists of multiple components:

  • The trackers which collect/generate user behavior data
  • The collector server, which acts as receiving server for the tracker data
  • A enrich server, which checks the incoming events for validity and optionally enriches the data
  • An iglu-server which - together with a database - acts as schema registry (the schemas define, how a valid event needs to look like)
  • The bigquery streamloader which ultimately inserts the data into BigQuery
  • A mutator server which listens for changes in incoming data types and creates new BigQuery columns on demand
  • A repeater server which simply retries failed inserts
  • A load balancer for serving the events to the collector servers (most probably you want to have multiple collector servers for HA reasons)
  • A Google Cloud Storage (GCS) for storing deadletter events

The "glue" between all these components is Google Pub/Sub - acting as a convenient message queue between the servers. Pub/Sub also perfectly decouples the components from each other, allowing for very good horizontal scalability and high availability options.

NOTE: While this guide is using Google Pub/Sub as "glue" between all the Snowplow pipeline components, Snowplow provides a bunch of different options which don't rely on the Google Cloud. Eg. AWS Kinesis or Kafka - with the latter being a little more different. I'll discuss this in a later post.

See the following reference architecture for more details

Snowplow reference architecture on GCPSnowplow reference architecture on GCP

As the above sketch shows, there are quite a lot of components involved. Running Pub/Sub now for multiple years I can say that this technology involves up to zero management. Set it up correctly, and it'll work. The same is true for GCS. To be fair, these things are quite "dumb" in the sense that the first one "simply" is a (great!) message queue and the latter is just a blob storage.

However, it's a different story for all the compute instances as well as the iglu postgres database. Google Cloud does a lot to keep management efforts low - however there are still a lot of server components which on times take a lot of heat (meaning events and load). Event with all the advances in serverless cloud technology, one needs to monitor and manage these components in production environments.

And for monitoring and operating production workloads there is - in my honest opinion - still no better option available than kubernetes. The managed kubernetes solutions on all major cloud providers are cheap or free, provide good availability and take away most of the hassle with running kubernetes - allowing us to focus on deploying and running applications - in this case the Snowplow pipeline.

To summarize, Snowplow consists of 6 server components + 1 database which need management and monitoring. Kubernetes is a great option to deploy these servers and add the k8s benefits to Snowplow: great tooling, monitoring, self-healing

Cost savings

Well, and my favorite reason for running Snowplow on an existing kubernetes cluster are simply costs. Snowplow only requires very limited resources. For my 100 Million events per month pipeline, actually a single e2-small compute instance per server component would be sufficient. However, then you most probably want to have two or better 3 of the collector server and the iglu server running due to high availability demands (more on that later). Also the BigQuery streamloader might need 2 or more instances as the BigQuery insert latency can be quite high - meaning that a single streamloader - even if only utilized by 50% - might not be able to cope with the incoming events, as it's constantly waiting for BigQuery to respond.

Due to the fact, that Snowplow only requires very little CPU and memory, I was able to add all the server components in my existing cluster, adding only very little "background noise" load. I'm currently running Snowplow with zero additional compute costs.

This advantage however is obviously only valid, if you have other workloads already running on a kubernetes and you can spare 5% additional CPU and memory resources.

Snowplow architecture on kubernetes

The architecture I currently use for most of my production pipelines and utilizes kubernetes to host the main Snowplow server components looks - in a little simplified version - as follows:

Snowplow servers running on kubernetesSnowplow servers running on kubernetes

(Click on the image to open it with better resolution)

Looking at the sketch, we have the following components running on the Google Cloud Platform:

  • Several pub/sub topics and subscriptions running on the GCP platform
  • Our BigQuery data warehouse, also running on the GCP platform
  • A GCS storage bucket, where our failed events are stored

The following components are running on our kubernetes cluster:

  • Kubernetes ingress (Load Balancer)
  • Collector server (Deployment)
  • Enrich server (Deployment)
  • Streamloader (Deployment)
  • Mutator (Deployment)
  • Repeater (Deployment)
  • Iglu Server (Deployment)
  • Iglu PostgreSQL cluster (CloudNativePG cluster (more on that later))

Additional to these main components we need some typical kubernetes components to make this architecture work:

  • A ClusterIP service for the collector as well as the iglu server - as we need to actively send data/requests to them
  • Configmaps for all the servers - for their configuration
  • A BackendConfig for configuring our Ingress Load Balancer (Note: this is due to my cluster running on the Google Cloud. BackendConfigs are a Google Cloud CRD, allowing for Load Balancer backend configuration. There are similar options for most other k8s supported load balancers)

Note: My kubernetes cluster is a GKE managed kubernetes running on the GCP platform - which is most probably the most sensible option when using the Google Cloud services like BigQuery and Pub/Sub. If you are eg. on AWS, consider using the AWS kubernetes option (EKS) and use Kinesis instead of Pub/Sub and S3 instead of GCS.

In the next chapters we'll use this theoretical knowledge and implement a working Snowplow pipeline, running the servers on kubernetes and using pub/sub as the "glue" between them. The setup I'm proposing below is in general production ready. Depending on how your cluster is set up and depending on your company policies, you might add Security Contexts as well as an IDS system like Falco. This however is conventional kubernetes operations - therefore I'll focus on the Snowplow main components here.

High availability

High availability is one of the most important aspects for me when designing systems. Not even due to business criticality - but because I don't want to get calls in the mid of the night due to downed servers.

Due to the very decoupled nature and perfect horizontal scalability nature of Snowplow, it's rather easy to design it in a way, that it's highly available.

If we look at the individual components of the pipeline, we can assume that Pub/Sub, the load balancer, BigQuery and GCS are already highly available (check the google documentation for SLAs).

So we only need to look at which of the server components are potential breaking points in the pipeline.

First, let's define what breaking point means: In my case - and many web tracking use-cases - breaking point means, that we miss or lose messages. It's not so critical, that all messages are inserted in our data warehouse immediately all the time.

Looking at the architecture sketch above, we can see, that we might have two servers which are critical for message loss:

  1. Collector server: The obvious point of failure - if the collector server is down, we miss tracked messages. The worst case!
  2. Iglu server: The enrich server validates messages against a predefined schema, available at the iglu server. Therefore, if the iglu server is down, the enrich server will mark the message as invalid and forward it to the bad-rows pub/sub topic. While the message is not lost, it's still quite annoying if we need to reinsert it from the bad-rows topic.
  3. Iglu database: As the iglu server needs to read the schemas from somewhere, also the iglu database itself should be highly available, otherwise we are risking events getting sent to the bad-rows topic.

What's with the enrich, streamloader, mutator and repeater servers? Actually we don't need to worry about them too much. If the enrich server is down, the collector server will still get all events and forward them to Pub/Sub. There they are patiently waiting for the enrich server to come up again. The same is true for the other server components. This means, if any of the other servers is down, we simply don't get near-realtime inserts anymore, but we do not lose messages. Which is a problem which can wait until I had my morning coffee.

To make sure the before-mentioned 3 components are highly available, I suggest the following:

  • Have at least 3 worker nodes in your kubernetes cluster
  • Have 3 pods of your collector server as well as iglu server running. Snowplow will not mind, if there are multiple instances running - and you simply get the resiliency you need.
  • For the Iglu-Database you also want to have a highly available option. Either you already have a PostgreSQL/MySQL server as part of your other business systems, or you use one of the managed cloud databases (like CloudSQL for Postgres). However, my favorite option: Use the incredible CloudNativePG postgres cluster for iglu. It's relatively lightweight and it really looks anf feels like a cloud-native postgres cluster. It provides high availability options, automated backups and is all in all a wonderful experience. See my introduction to CloudNativePG for details on what it is and how to set it up.

Note: If you need to have low-latency event inserts into your data warehouse all the time (so you can't "wait" for eg. the enrich server being restored again) follow the same multiple-instance-principles as for the collector server for your enrich server, mutator, streamloader and repeater as well.

Prerequisites

For this guide to work, we assume the following:

  • You have a working kubernetes cluster to deploy the main snowplow workload to. For this guide, preferably on the Google Cloud, but any k8s cluster will do.

  • You have a Google Cloud Project with billing enabled.

  • You have IAM permissions to create and view:

    • GCS storage buckets
    • BigQuery datasets
    • Pub/Sub topics and subscriptions
    • Service Accounts and assign IAM roles for them
  • You have a Debian derivative Linux operating system as your local development environment. A link will be provided, if the installation steps differ for other operating systems.

NOTE: Depending on the amount of events you send to your Snowplow pipeline, you might end up with some costs. This will mainly be Google Pub/Sub costs. See the Pub/Sub pricing overview for more details. That being said: I currently run a pipeline with 100 Million events per month - and my costs are at approx. 2.8€ per day. Depending on your sites workloads, there is a good chance that it's free for you to run the pub/sub instances.

Preparation

  1. Make sure you have a version of Python between 3.5 and 3.9 installed (yes - unfortunately, gcloud still does not support Python 3.10+)

  2. Install the gcloud cli (Ubuntu/Debian):

    1# Install the system dependencies
    2sudo apt-get install apt-transport-https ca-certificates gnupg
    3
    4# Add the gcloud CLI distribution as package source
    5echo "deb [signed-by=/usr/share/keyrings/cloud.google.gpg] https://packages.cloud.google.com/apt cloud-sdk main" | sudo tee -a /etc/apt/sources.list.d/google-cloud-sdk.list
    6
    7# Import the Google cloud public key
    8curl https://packages.cloud.google.com/apt/doc/apt-key.gpg | sudo tee /usr/share/keyrings/cloud.google.gpg
    9
    10# Finally, install the gcloud cli
    11sudo apt-get update && sudo apt-get install google-cloud-cli

    For other operating systems, have a look in the official gcloud CLI installation manual.

  3. Run gcloud init and follow the steps in the console to authenticate your user against GCP.

  4. Enable the following Google Cloud APIs by following the links and clicking the "Enable" button. Make sure to select the correct project at the top-left of the screen.

    Enable the required Google Cloud APIsEnable the required Google Cloud APIs

    If you don't get the "Enable" - button, your API is already enabled.

How to deploy Snowplow on kubernetes

The following sections describe the steps required to set up the basic Snowplow infrastructure on Google Cloud as well as deploy the individual Snowplow servers on your kubernetes cluster.

Note: In the sections below you'll encounter the term <your-project-id> quite often. Replace this term with your Google Cloud Project Id. You find it by navigating to the GCP console.

Get your GCP project idGet your GCP project id

Service Accounts

As our servers need to publish and subscribe to Pub/Sub topics, insert into BigQuery and list storage buckets, we need to create the following service accounts with the following roles:

  • spprefix-bq-loader-server: BigQuery Data Editor, Logs Writer, Pub/Sub Publisher, Pub/Sub Subscriber, Pub/Sub Viewer, Storage Object Viewer

  • spprefix-enrich-server: Logs Writer, Pub/Sub Publisher, Pub/Sub Subscriber, Pub/Sub Viewer, Storage Object Viewer

  • spprefix-collector-server: Logs Writer, Pub/Sub Publisher, Pub/Sub Viewer

NOTE: Feel free to change the names of the service-accounts to your naming schemas

Run these commands using gcloud cli to create the service accounts with their roles:

1# spprefix-bq-loader-server
2gcloud iam service-accounts create spprefix-bq-loader-server \
3--description="Snowplow bq loader service account" \
4--display-name="spprefix-bq-loader-server"
5
6gcloud projects add-iam-policy-binding <your-project-id> \
7--member="serviceAccount:spprefix-bq-loader-server@<your-project-id>.iam.gserviceaccount.com" \
8--role="roles/bigquery.dataEditor"
9
10gcloud projects add-iam-policy-binding <your-project-id> \
11--member="serviceAccount:spprefix-bq-loader-server@<your-project-id>.iam.gserviceaccount.com" \
12--role="roles/logging.logWriter"
13
14gcloud projects add-iam-policy-binding <your-project-id> \
15--member="serviceAccount:spprefix-bq-loader-server@<your-project-id>.iam.gserviceaccount.com" \
16--role="roles/pubsub.publisher"
17
18gcloud projects add-iam-policy-binding <your-project-id> \
19--member="serviceAccount:spprefix-bq-loader-server@<your-project-id>.iam.gserviceaccount.com" \
20--role="roles/pubsub.subscriber"
21
22gcloud projects add-iam-policy-binding <your-project-id> \
23--member="serviceAccount:spprefix-bq-loader-server@<your-project-id>.iam.gserviceaccount.com" \
24--role="roles/pubsub.viewer"
25
26gcloud projects add-iam-policy-binding <your-project-id> \
27--member="serviceAccount:spprefix-bq-loader-server@<your-project-id>.iam.gserviceaccount.com" \
28--role="roles/storage.objectViewer"
29
30# spprefix-enrich-server
31gcloud iam service-accounts create spprefix-enrich-server \
32--description="Snowplow enrich service account" \
33--display-name="spprefix-enrich-server"
34
35gcloud projects add-iam-policy-binding <your-project-id> \
36--member="serviceAccount:spprefix-enrich-server@<your-project-id>.iam.gserviceaccount.com" \
37--role="roles/logging.logWriter"
38
39gcloud projects add-iam-policy-binding <your-project-id> \
40--member="serviceAccount:spprefix-enrich-server@<your-project-id>.iam.gserviceaccount.com" \
41--role="roles/pubsub.publisher"
42
43gcloud projects add-iam-policy-binding <your-project-id> \
44--member="serviceAccount:spprefix-enrich-server@<your-project-id>.iam.gserviceaccount.com" \
45--role="roles/pubsub.subscriber"
46
47gcloud projects add-iam-policy-binding <your-project-id> \
48--member="serviceAccount:spprefix-enrich-server@<your-project-id>.iam.gserviceaccount.com" \
49--role="roles/pubsub.viewer"
50
51gcloud projects add-iam-policy-binding <your-project-id> \
52--member="serviceAccount:spprefix-enrich-server@<your-project-id>.iam.gserviceaccount.com" \
53--role="roles/storage.objectViewer"
54
55# spprefix-collector-server
56gcloud iam service-accounts create spprefix-collector-server \
57--description="Snowplow collector service account" \
58--display-name="spprefix-collector-server"
59
60gcloud projects add-iam-policy-binding <your-project-id> \
61--member="serviceAccount:spprefix-collector-server@<your-project-id>.iam.gserviceaccount.com" \
62--role="roles/logging.logWriter"
63
64gcloud projects add-iam-policy-binding <your-project-id> \
65--member="serviceAccount:spprefix-collector-server@<your-project-id>.iam.gserviceaccount.com" \
66--role="roles/pubsub.publisher"
67
68gcloud projects add-iam-policy-binding <your-project-id> \
69--member="serviceAccount:spprefix-collector-server@<your-project-id>.iam.gserviceaccount.com" \
70--role="roles/pubsub.viewer"

Create and download keyfile jsons by running these commands. They will be downloaded to ~/Downloads/sa-keys.

1mkdir -p ~/Downloads/sa-keys
2gcloud iam service-accounts keys create ~/Downloads/sa-keys/bq-loader-server-sa.json \
3 --iam-account="spprefix-bq-loader-server@<your-project-id>.iam.gserviceaccount.com"
4gcloud iam service-accounts keys create ~/Downloads/sa-keys/enrich-server-sa.json \
5 --iam-account="spprefix-enrich-server@<your-project-id>.iam.gserviceaccount.com"
6gcloud iam service-accounts keys create ~/Downloads/sa-keys/collector-server-sa.json \
7 --iam-account="spprefix-collector-server@<your-project-id>.iam.gserviceaccount.com"

NOTE: Please delete the service account keyfiles after you you are finished setting up the cluster.

Pub/Sub

For "glueing" our server components together, we'll use Google Pub/Sub. We'll need the following topics:

  • spprefix-bad-1-topic
  • spprefix-bq-bad-rows-topic
  • spprefix-bq-loader-server-failed-inserts-topic
  • spprefix-bq-loader-server-types-topic
  • spprefix-enriched-topic
  • spprefix-raw-topic

Run these commands to create the topics

1gcloud pubsub topics create spprefix-bad-1-topic
2gcloud pubsub topics create spprefix-bq-bad-rows-topic
3gcloud pubsub topics create spprefix-bq-loader-server-failed-inserts-topic
4gcloud pubsub topics create spprefix-bq-loader-server-types-topic
5gcloud pubsub topics create spprefix-enriched-topic
6gcloud pubsub topics create spprefix-raw-topic

Additionally, we need subscriptions on these topics.

  • spprefix_bad_1: spprefix-bad-1-topic
  • spprefix_bad_rows: spprefix-bq-bad-rows-topic
  • spprefix-bq-loader-server-failed-inserts: spprefix-bq-loader-server-failed-inserts-topic
  • spprefix-bq-loader-server-types: spprefix-bq-loader-server-types-topic
  • spprefix-bq-loader-server-input: spprefix-enriched-topic
  • spprefix-enrich-server: spprefix-raw-topic

Create them, by running these commands:

1gcloud pubsub subscriptions create spprefix_bad_1 --topic=spprefix-bad-1-topic --expiration-period=never
2gcloud pubsub subscriptions create spprefix_bad_rows --topic=spprefix-bq-bad-rows-topic --expiration-period=never
3gcloud pubsub subscriptions create spprefix-bq-loader-server-failed-inserts --topic=spprefix-bq-loader-server-failed-inserts-topic --expiration-period=never
4gcloud pubsub subscriptions create spprefix-bq-loader-server-types --topic=spprefix-bq-loader-server-types-topic --expiration-period=never
5gcloud pubsub subscriptions create spprefix-bq-loader-server-input --topic=spprefix-enriched-topic --expiration-period=never
6gcloud pubsub subscriptions create spprefix-enrich-server --topic=spprefix-raw-topic --expiration-period=never

Google Cloud Storage

Why do we need Google Cloud Storage? Well, we don't desperately need to provision cloud storage for Snowplow to work - however Snowplow provides a great deadlettering-feature. Meaning - if for whatever reasons - Snowplow can't insert an incoming event it will add the data as file to blob storage like Google Cloud Storage. This allows for debugging, why the event was not inserted - and even better - for letter re-inserting the data, when you were able to resolve the cause for non-insertion.

Run the following command to create a storage bucket for snowplow deadletters (events which can't be inserted)

1gcloud storage buckets create gs://spprefix-bq-loader-dead-letter

Note: The google bucket names need to be globally unique - therefore please change the name of your storage bucket to a different one. Eg. replace the term spprefix to something different.

To allow our BigQuery loader service to insert bad events into the newly created dead-letter-bucket, we want to add the storage object admin permission to this bucket:

1gcloud storage buckets add-iam-policy-binding gs://spprefix-bq-loader-dead-letter --member="serviceAccount:spprefix-bq-loader-server@<your-project-id>.iam.gserviceaccount.com" --role=roles/storage.objectAdmin

Note: That are finally all the bootstrapping steps. Let's continue with setting up our server components on kubernetes.

IP-Address

For allowing our trackers send data to our collector we need a public IP-Address which we can bind our cluster ingress to. If you run your workloads on a GKE cluster, follow this step. Otherwise, make sure that the ingress we deploy can be reached via a public IP or URL.

1# Create a Google Cloud global static IP-Address
2gcloud compute addresses create snowplow-ingress-ip --global --ip-version IPV4

NOTE: If running on GKE, the ingress IP-Address needs to be of type global!

Set up Iglu server

First, let's set up our iglu server.

Prepare Iglu Server database

The iglu server requires a database to store it's schemas. This guide assumes, you provide this database. If not, my recommendations are, either use:

  • CloudSQL for Postgres, a managed Postgres server from Google, or - even better -
  • CloudNativePG, a highly available postgres cluster. Follow the link and set up the cluster as in the description.

Please note your postgres database, user and password for the iglu server for the next steps.

Setup the iglu deployment

  1. Create a file called iglu-configmap.yaml, with following content. Change the following literals:

    • <iglu-database>: Postgres database name
    • <iglu-database-user>: Postgres user for this iglu database
    • <your-postgres-password>: Postgres password of your iglu postgres user
    • <your-iglu-api-key>: Create and add a randomly generated API key for accessing iglu here. Keep this secret. You will need this for later steps. This can be any string.
    1kind: ConfigMap
    2apiVersion: v1
    3metadata:
    4 name: iglu-configmap
    5 namespace: snowplow
    6data:
    7 iglu-server.hocon: |
    8 {
    9 "repoServer": {
    10 "interface": "0.0.0.0"
    11 "port": 8080
    12 "threadPool": "cached"
    13 "maxConnections": 2048
    14 }
    15 "database": {
    16 "type": "postgres"
    17 "host": "pgcluster-iglu-rw"
    18 "port": 5432
    19 "dbname": "<iglu-database>"
    20 "username": "<iglu-database-user>"
    21 "password": "<your-postgres-password>"
    22 "driver": "org.postgresql.Driver"
    23 pool: {
    24 "type": "hikari"
    25 "maximumPoolSize": 5
    26 connectionPool: {
    27 "type": "fixed"
    28 "size": 4
    29 }
    30 "transactionPool": "cached"
    31 }
    32 }
    33 "debug": false
    34 "patchesAllowed": true
    35 "superApiKey": "<your-iglu-api-key>"
    36 }
  2. Create a file iglu-deployment.yaml with following content.

    1apiVersion: apps/v1
    2kind: Deployment
    3metadata:
    4 name: iglu-server
    5 namespace: snowplow
    6spec:
    7 selector:
    8 matchLabels:
    9 app: iglu
    10 replicas: 2
    11 template:
    12 metadata:
    13 labels:
    14 app: iglu
    15 spec:
    16 securityContext:
    17 runAsUser: 1000
    18 runAsGroup: 3000
    19 fsGroup: 2000
    20 containers:
    21 - name: iglu-server
    22 image: snowplow/iglu-server:0.9.0
    23 command: ["/home/snowplow/bin/iglu-server", "--config", "/snowplow/config/iglu-server.hocon"]
    24 imagePullPolicy: "IfNotPresent"
    25 env:
    26 - name: JAVA_OPTS
    27 value: -Dorg.slf4j.simpleLogger.defaultLogLevel=info
    28 volumeMounts:
    29 - name: iglu-config-volume
    30 mountPath: /snowplow/config
    31 resources:
    32 requests:
    33 memory: "256Mi"
    34 cpu: "100m"
    35 limits:
    36 memory: "1.5Gi"
    37 volumes:
    38 - name: iglu-config-volume
    39 configMap:
    40 name: iglu-configmap
    41 items:
    42 - key: iglu-server.hocon
    43 path: iglu-server.hocon
  3. As our servers need to reach the iglu server, we want to add a ClusterIP service. Add a file iglu-service.yaml with following content:

    1apiVersion: v1
    2kind: Service
    3metadata:
    4 name: iglu-server-service
    5 namespace: snowplow
    6spec:
    7 selector:
    8 app: iglu
    9 ports:
    10 - name: http
    11 protocol: TCP
    12 port: 80
    13 targetPort: 8080

    NOTE: For unknown reasons, the iglu service (iglu/iglu-service.yaml) needs to listen on port 80 or 443 - otherwise the enrich server can't access iglu.

Apply all manifests with kubectl apply -f <filename>. Your iglu server is now ready to consume and serve your event schemas.

Set up BigQuery StreamLoader

  1. Create a file streamloader-configmap.yaml and add the following content. Change <your-google-project-id> to your google project id and <your-iglu-api-key> to the key of your iglu API - set in the step above.

    1kind: ConfigMap
    2apiVersion: v1
    3metadata:
    4 name: streamloader-configmap
    5 namespace: snowplow
    6data:
    7 config.hocon: |
    8 {
    9 "projectId": <your-google-project-id>
    10
    11 "loader": {
    12 "input": {
    13 "subscription": spprefix-bq-loader-server-input
    14 }
    15 "output": {
    16 "good": {
    17 "datasetId": spprefix_pipeline_db
    18 "tableId": events
    19 }
    20 "bad": {
    21 "topic": spprefix-bq-bad-rows-topic
    22 }
    23 "types": {
    24 "topic": spprefix-bq-loader-server-types-topic
    25 }
    26 "failedInserts": {
    27 "topic": spprefix-bq-loader-server-failed-inserts-topic
    28 }
    29 }
    30 }
    31
    32 "mutator": {
    33 "input": {
    34 "subscription": spprefix-bq-loader-server-types
    35 }
    36 "output": {
    37 "good": ${loader.output.good}
    38 }
    39 }
    40
    41 "repeater": {
    42 "input": {
    43 "subscription": spprefix-bq-loader-server-failed-inserts
    44 }
    45 "output": {
    46 "good": ${loader.output.good}
    47 "deadLetters": {
    48 "bucket": "gs://spprefix-bq-loader-dead-letter"
    49 }
    50 }
    51 }
    52 }
    53
    54
    55 iglu-config.json: |
    56 {
    57 "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-3",
    58 "data": {
    59 "cacheSize": 500,
    60 "cacheTtl": 600,
    61 "repositories":
    62 [
    63 {
    64 "connection": {
    65 "http": {
    66 "uri": "http://iglucentral.com"
    67 }
    68 },
    69 "name": "Iglu Central",
    70 "priority": 10,
    71 "vendorPrefixes": []
    72 },
    73 {
    74 "connection": {
    75 "http": {
    76 "uri": "http://mirror01.iglucentral.com"
    77 }
    78 },
    79 "name": "Iglu Central - Mirror 01",
    80 "priority": 20,
    81 "vendorPrefixes": []
    82 },
    83 {
    84 "connection": {
    85 "http": {
    86 "apikey": "<your-iglu-api-key>",
    87 "uri": "http://iglu-server-service/api"
    88 }
    89 },
    90 "name": "Iglu Server",
    91 "priority": 0,
    92 "vendorPrefixes": []
    93 }
    94 ]
    95 }
    96 }
    97
    98
  2. Run the following command to create a base64 encoded string of your streamloader service account:

    1base64 ~/Downloads/sa-keys/bq-loader-server-sa.json
  3. Create a file bqloader_google_application_credentials.yaml with following content - and add the output of the base64 step instead of <base64-string-of-bqloader-service-account-key-file>

    1apiVersion: v1
    2kind: Secret
    3metadata:
    4 name: bqloader-serviceaccount-creds
    5 namespace: snowplow
    6type: Opaque
    7data:
    8 sa_json: |
    9 <base64-string-of-bqloader-service-account-key-file>
  4. Create a file streamloader-deployment.yaml

    1apiVersion: apps/v1
    2kind: Deployment
    3metadata:
    4 name: streamloader-server
    5 namespace: snowplow
    6spec:
    7 selector:
    8 matchLabels:
    9 app: streamloader
    10 replicas: 1
    11 template:
    12 metadata:
    13 labels:
    14 app: streamloader
    15 spec:
    16 containers:
    17 - name: streamloader-server
    18 image: snowplow/snowplow-bigquery-streamloader:1.6.4
    19 command:
    20 - "/home/snowplow/bin/snowplow-bigquery-streamloader"
    21 - "--config"
    22 - "/snowplow/config/config.hocon"
    23 - "--resolver"
    24 - "/snowplow/config/iglu-config.json"
    25 imagePullPolicy: "IfNotPresent"
    26 env:
    27 - name: JAVA_OPTS
    28 value: -Dorg.slf4j.simpleLogger.defaultLogLevel=info
    29 - name: GOOGLE_APPLICATION_CREDENTIALS
    30 value: /etc/gcp/sa_credentials.json
    31 volumeMounts:
    32 - name: streamloader-config-volume
    33 mountPath: /snowplow/config
    34 - name: service-account-credentials-volume
    35 mountPath: /etc/gcp
    36 readOnly: true
    37 resources:
    38 requests:
    39 memory: "256Mi"
    40 cpu: "250m"
    41 limits:
    42 memory: "2Gi"
    43 volumes:
    44 - name: streamloader-config-volume
    45 configMap:
    46 name: streamloader-configmap
    47 items:
    48 - key: iglu-config.json
    49 path: iglu-config.json
    50 - key: config.hocon
    51 path: config.hocon
    52 - name: service-account-credentials-volume
    53 secret:
    54 secretName: bqloader-serviceaccount-creds
    55 items:
    56 - key: sa_json
    57 path: sa_credentials.json
  5. Apply the three manifests using kubectl apply.

Set up the Mutator

  1. The mutator server conveniently uses exact the same configuration as the streamloader - therefore we will reuse it's configuration. So we only need to create a file mutator-deployment.yaml here.

    1apiVersion: apps/v1
    2kind: Deployment
    3metadata:
    4 name: mutator-server
    5 namespace: snowplow
    6spec:
    7 selector:
    8 matchLabels:
    9 app: mutator
    10 replicas: 1
    11 template:
    12 metadata:
    13 labels:
    14 app: mutator
    15 spec:
    16 containers:
    17 - name: mutator-server
    18 image: snowplow/snowplow-bigquery-mutator:1.6.4
    19 command:
    20 - "/home/snowplow/bin/snowplow-bigquery-mutator"
    21 - "listen"
    22 - "--config"
    23 - "/snowplow/config/config.hocon"
    24 - "--resolver"
    25 - "/snowplow/config/iglu-config.json"
    26 imagePullPolicy: "IfNotPresent"
    27 env:
    28 - name: JAVA_OPTS
    29 value: -Dorg.slf4j.simpleLogger.defaultLogLevel=info
    30 - name: GOOGLE_APPLICATION_CREDENTIALS
    31 value: /etc/gcp/sa_credentials.json
    32 volumeMounts:
    33 - name: mutator-config-volume
    34 mountPath: /snowplow/config
    35 - name: service-account-credentials-volume
    36 mountPath: /etc/gcp
    37 readOnly: true
    38 resources:
    39 requests:
    40 memory: "128Mi"
    41 cpu: "150m"
    42 limits:
    43 memory: "512Mi"
    44 volumes:
    45 - name: mutator-config-volume
    46 configMap:
    47 name: streamloader-configmap
    48 items:
    49 - key: iglu-config.json
    50 path: iglu-config.json
    51 - key: config.hocon
    52 path: config.hocon
    53 - name: service-account-credentials-volume
    54 secret:
    55 secretName: bqloader-serviceaccount-creds
    56 items:
    57 - key: sa_json
    58 path: sa_credentials.json
  2. Apply this manifest using kubectl apply

Initialize the BigQuery database

The mutator server provides a convenient little script which sets up the BigQuery database. For this to happen, run the following steps:

  1. Create the BigQuery dataset by running:

    1bq --location=EU mk -d --description "Snowplow event dataset" spprefix_pipeline_db

    (Change the location to your preferred BigQuery location)

  2. Create a file mutator-init-deployment.yaml with this content:

    1apiVersion: v1
    2kind: Pod
    3metadata:
    4 name: mutator-init-server
    5 namespace: snowplow
    6spec:
    7 containers:
    8 - name: mutator-init-server
    9 image: snowplow/snowplow-bigquery-mutator:1.6.4
    10 command:
    11 - "/home/snowplow/bin/snowplow-bigquery-mutator"
    12 - "create"
    13 - "--config"
    14 - "/snowplow/config/config.hocon"
    15 - "--resolver"
    16 - "/snowplow/config/iglu-config.json"
    17 - "--partitionColumn=collector_tstamp"
    18 - "--requirePartitionFilter"
    19 imagePullPolicy: "IfNotPresent"
    20 env:
    21 - name: JAVA_OPTS
    22 value: -Dorg.slf4j.simpleLogger.defaultLogLevel=info
    23 - name: GOOGLE_APPLICATION_CREDENTIALS
    24 value: /etc/gcp/sa_credentials.json
    25 volumeMounts:
    26 - name: mutator-config-volume
    27 mountPath: /snowplow/config
    28 - name: service-account-credentials-volume
    29 mountPath: /etc/gcp
    30 readOnly: true
    31 resources:
    32 requests:
    33 memory: "128Mi"
    34 cpu: "150m"
    35 limits:
    36 memory: "512Mi"
    37 volumes:
    38 - name: mutator-config-volume
    39 configMap:
    40 name: streamloader-configmap
    41 items:
    42 - key: iglu-config.json
    43 path: iglu-config.json
    44 - key: config.hocon
    45 path: config.hocon
    46 - name: service-account-credentials-volume
    47 secret:
    48 secretName: bqloader-serviceaccount-creds
    49 items:
    50 - key: sa_json
    51 path: sa_credentials.json
  3. Apply this manifest. This will create a pod which will initialize the BigQuery dataset structure to meet the Snowplow canonical/atomic event structure.

Set up the Repeater

The repeater server conveniently uses exact the same configuration as the streamloader - therefore we will reuse it's configuration again. So we only need to create a file repeater-deployment.yaml - and apply it, using kubectl apply

1apiVersion: apps/v1
2kind: Deployment
3metadata:
4 name: repeater-server
5 namespace: snowplow
6spec:
7 selector:
8 matchLabels:
9 app: repeater
10 replicas: 1
11 template:
12 metadata:
13 labels:
14 app: repeater
15 spec:
16 containers:
17 - name: repeater-server
18 image: snowplow/snowplow-bigquery-repeater:1.6.4
19 command:
20 - "/home/snowplow/bin/snowplow-bigquery-repeater"
21 - "--config"
22 - "/snowplow/config/config.hocon"
23 - "--resolver"
24 - "/snowplow/config/iglu-config.json"
25 - "--bufferSize=20"
26 - "--timeout=20"
27 - "--backoffPeriod=900"
28
29 imagePullPolicy: "IfNotPresent"
30 env:
31 - name: JAVA_OPTS
32 value: -Dorg.slf4j.simpleLogger.defaultLogLevel=info
33 - name: GOOGLE_APPLICATION_CREDENTIALS
34 value: /etc/gcp/sa_credentials.json
35 volumeMounts:
36 - name: repeater-config-volume
37 mountPath: /snowplow/config
38 - name: service-account-credentials-volume
39 mountPath: /etc/gcp
40 readOnly: true
41 resources:
42 requests:
43 memory: "128Mi"
44 cpu: "150m"
45 limits:
46 memory: "512Mi"
47 volumes:
48 - name: repeater-config-volume
49 configMap:
50 name: streamloader-configmap
51 items:
52 - key: iglu-config.json
53 path: iglu-config.json
54 - key: config.hocon
55 path: config.hocon
56 - name: service-account-credentials-volume
57 secret:
58 secretName: bqloader-serviceaccount-creds
59 items:
60 - key: sa_json
61 path: sa_credentials.json

Set up the Enrich server

For the enrich server, we need to configure the connection to the iglu server as well as each individual enrichment we want to enable. We can do all of that in a configmap.

  1. un the following command to create a base64 encoded string of your streamloader service account:

    1base64 ~/Downloads/sa-keys/enrich-server-sa.json
  2. Create a file enrich_google_application_credentials.yaml with following content - and add the output of the base64 step instead of <base64-string-of-enrich-service-account-key-file>

    1apiVersion: v1
    2kind: Secret
    3metadata:
    4 name: enrich-serviceaccount-creds
    5 namespace: snowplow
    6type: Opaque
    7data:
    8 sa_json: |
    9 <base64-string-of-enrich-service-account-key-file>
  3. Create a file enrich-configmap.yaml with the following content, but change <your-iglu-api-key> to the key of your iglu API. Also change <my-salt> to a randomly generated salt-key used for creating the pseudonymization of user IP-Addresses. This configmap enables the following enrichments (See Snowplows enrichment docs for more details):

    1kind: ConfigMap
    2apiVersion: v1
    3metadata:
    4 name: enrich-configmap
    5 namespace: snowplow
    6data:
    7 enrichment_campaigns.json: |
    8 {
    9 "schema": "iglu:com.snowplowanalytics.snowplow/campaign_attribution/jsonschema/1-0-1",
    10 "data": {
    11 "name": "campaign_attribution",
    12 "vendor": "com.snowplowanalytics.snowplow",
    13 "enabled": true,
    14 "parameters": {
    15 "mapping": "static",
    16 "fields": {
    17 "mktMedium": ["utm_medium", "medium"],
    18 "mktSource": ["utm_source", "source"],
    19 "mktTerm": ["utm_term", "legacy_term"],
    20 "mktContent": ["utm_content"],
    21 "mktCampaign": ["utm_campaign", "cid", "legacy_campaign"]
    22 }
    23 }
    24 }
    25 }
    26
    27 enrichment_pii.json: |
    28 {
    29 "schema": "iglu:com.snowplowanalytics.snowplow.enrichments/pii_enrichment_config/jsonschema/2-0-0",
    30 "data": {
    31 "vendor": "com.snowplowanalytics.snowplow.enrichments",
    32 "name": "pii_enrichment_config",
    33 "emitEvent": true,
    34 "enabled": true,
    35 "parameters": {
    36 "pii": [
    37 {
    38 "pojo": {
    39 "field": "user_ipaddress"
    40 }
    41 }
    42 ],
    43 "strategy": {
    44 "pseudonymize": {
    45 "hashFunction": "MD5",
    46 "salt": "<my-salt>"
    47 }
    48 }
    49 }
    50 }
    51 }
    52
    53 enrichment_event_fingerprint.json: |
    54 {
    55 "schema": "iglu:com.snowplowanalytics.snowplow/event_fingerprint_config/jsonschema/1-0-1",
    56 "data": {
    57 "name": "event_fingerprint_config",
    58 "vendor": "com.snowplowanalytics.snowplow",
    59 "enabled": true,
    60 "parameters": {
    61 "excludeParameters": ["cv", "eid", "nuid", "stm"],
    62 "hashAlgorithm": "MD5"
    63 }
    64 }
    65 }
    66
    67 enrichment_referrer_parser.json: |
    68 {
    69 "schema": "iglu:com.snowplowanalytics.snowplow/referer_parser/jsonschema/2-0-0",
    70 "data": {
    71 "name": "referer_parser",
    72 "vendor": "com.snowplowanalytics.snowplow",
    73 "enabled": true,
    74 "parameters": {
    75 "database": "referers-latest.json",
    76 "uri": "https://snowplow-hosted-assets.s3.eu-west-1.amazonaws.com/third-party/referer-parser/",
    77 "internalDomains": []
    78 }
    79 }
    80 }
    81
    82 enrichment_ua_parser.json: |
    83 {
    84 "schema": "iglu:com.snowplowanalytics.snowplow/ua_parser_config/jsonschema/1-0-1",
    85 "data": {
    86 "name": "ua_parser_config",
    87 "vendor": "com.snowplowanalytics.snowplow",
    88 "enabled": true,
    89 "parameters": {
    90 "uri": "https://snowplow-hosted-assets.s3.eu-west-1.amazonaws.com/third-party/ua-parser",
    91 "database": "regexes-latest.yaml"
    92 }
    93 }
    94 }
    95
    96 enrichment_yauaa.json: |
    97 {
    98 "schema": "iglu:com.snowplowanalytics.snowplow.enrichments/yauaa_enrichment_config/jsonschema/1-0-0",
    99 "data": {
    100 "enabled": true,
    101 "vendor": "com.snowplowanalytics.snowplow.enrichments",
    102 "name": "yauaa_enrichment_config"
    103 }
    104 }
    105
    106 config.hocon: |
    107 {
    108 "auth": {
    109 "type": "Gcp"
    110 }
    111 "input": {
    112 "type": "PubSub"
    113 "subscription": "projects/<your-project-id>/subscriptions/spprefix-enrich-server"
    114 }
    115 "output":
    116 {
    117 "good": {
    118 "type": "PubSub"
    119 "topic": "projects/<your-project-id>/topics/spprefix-enriched-topic"
    120 "attributes": [ "app_id", "event_name" ]
    121 }
    122 "bad": {
    123 "type": "PubSub"
    124 "topic": "projects/<your-project-id>/topics/spprefix-bad-1-topic"
    125 }
    126 }
    127 "assetsUpdatePeriod": "10080 minutes"
    128 }
    129
    130 iglu-config.json: |
    131 {
    132 "schema": "iglu:com.snowplowanalytics.iglu/resolver-config/jsonschema/1-0-3",
    133 "data": {
    134 "cacheSize": 500,
    135 "cacheTtl": 600,
    136 "repositories":
    137 [
    138 {
    139 "connection": {
    140 "http": {
    141 "uri": "http://iglucentral.com"
    142 }
    143 },
    144 "name": "Iglu Central",
    145 "priority": 10,
    146 "vendorPrefixes": []
    147 },
    148 {
    149 "connection": {
    150 "http": {
    151 "uri": "http://mirror01.iglucentral.com"
    152 }
    153 },
    154 "name": "Iglu Central - Mirror 01",
    155 "priority": 20,
    156 "vendorPrefixes": []
    157 },
    158 {
    159 "connection": {
    160 "http": {
    161 "apikey": "<your-iglu-api-key>",
    162 "uri": "http://iglu-server-service/api"
    163 }
    164 },
    165 "name": "Iglu Server",
    166 "priority": 0,
    167 "vendorPrefixes": []
    168 }
    169 ]
    170 }
    171 }
  4. Create the file enrich-deployment.yaml and add the following content:

    1apiVersion: apps/v1
    2kind: Deployment
    3metadata:
    4 name: enrich-server
    5 namespace: snowplow
    6spec:
    7 selector:
    8 matchLabels:
    9 app: enrich
    10 replicas: 1
    11 template:
    12 metadata:
    13 labels:
    14 app: enrich
    15 spec:
    16 containers:
    17 - name: enrich-server
    18 image: snowplow/snowplow-enrich-pubsub:3.7.0
    19 command:
    20 - "/home/snowplow/bin/snowplow-enrich-pubsub"
    21 - "--config"
    22 - "/snowplow/config/config.hocon"
    23 - "--iglu-config"
    24 - "/snowplow/config/iglu-config.json"
    25 - "--enrichments"
    26 - "/snowplow/config/enrichments"
    27 imagePullPolicy: "IfNotPresent"
    28 env:
    29 - name: JAVA_OPTS
    30 value: -Dorg.slf4j.simpleLogger.defaultLogLevel=info -Dorg.slf4j.simpleLogger.log.InvalidEnriched=debug
    31 - name: GOOGLE_APPLICATION_CREDENTIALS
    32 value: /etc/gcp/sa_credentials.json
    33 volumeMounts:
    34 - name: enrich-config-volume
    35 mountPath: /snowplow/config
    36 - name: service-account-credentials-volume
    37 mountPath: /etc/gcp
    38 readOnly: true
    39 resources:
    40 requests:
    41 memory: "256Mi"
    42 cpu: "350m"
    43 limits:
    44 memory: "1.5Gi"
    45 cpu: 2
    46 volumes:
    47 - name: enrich-config-volume
    48 configMap:
    49 name: enrich-configmap
    50 items:
    51 - key: iglu-config.json
    52 path: iglu-config.json
    53 - key: config.hocon
    54 path: config.hocon
    55 - key: enrichment_campaigns.json
    56 path: enrichments/enrichment_campaigns.json
    57 - key: enrichment_pii.json
    58 path: enrichments/enrichment_pii.json
    59 - key: enrichment_event_fingerprint.json
    60 path: enrichments/enrichment_event_fingerprint.json
    61 - key: enrichment_referrer_parser.json
    62 path: enrichments/enrichment_referrer_parser.json
    63 - key: enrichment_ua_parser.json
    64 path: enrichments/enrichment_ua_parser.json
    65 - name: service-account-credentials-volume
    66 secret:
    67 secretName: enrich-serviceaccount-creds
    68 items:
    69 - key: sa_json
    70 path: sa_credentials.json
  5. Apply both manifests with kubectl apply

Set up the Collector server and ingress load balancer

For the final component we are going to set up the following resources:

  • Service account secret
  • collector configmap
  • collector deployment
  • backend config for configuring the load balancer backend
  • collector service
  • ingress to allow our trackers to send data
  1. Run the following command to create a base64 encoded string of your streamloader service account:

    1base64 ~/Downloads/sa-keys/collector-server-sa.json
  2. Create a file bqloader_google_application_credentials.yaml with following content - and add the output of the base64 step instead of <base64-string-of-bqloader-service-account-key-file>

    1apiVersion: v1
    2kind: Secret
    3metadata:
    4 name: collector-serviceaccount-creds
    5 namespace: snowplow
    6type: Opaque
    7data:
    8 sa_json: |
    9 <base64-string-of-collector-service-account-key-file>
  3. Create a file collector-configmap.yaml with the following content, replace <your-project-id> with your google project id and apply it using kubectl apply

    NOTE: The configuration file contains the following config option: "/customdtsp/tp2" = "/com.snowplowanalytics.snowplow/tp2". This configures a custom collector path. By default, all trackers send their events to https://<collector-url>/com.snowplowanalytics.snowplow/tp2. However, this is often blocked by AdBlockers. Therefore, it is advised to change the path to a custom one. In the below configuration, we use /customdtsp/tp2 as our tracking path. Make sure to also add this setting to your tracker configuration.

    Very important: The custom-path needs to have exactly two levels. /customdtsp/tp2 works, /customdtsp does not as well as /customdtsp/tp2/other does not

    Very important: While this setting allows us to prevent AdBlockers to interfere with our tracking - please make sure to respect the users privacy. Do not and do never tracker data which can identify specific users.

    1kind: ConfigMap
    2apiVersion: v1
    3metadata:
    4 name: collector-configmap
    5 namespace: snowplow
    6data:
    7 config.hocon: |
    8 collector {
    9 interface = "0.0.0.0"
    10 port = 8080
    11 ssl {
    12 enable = false
    13 redirect = false
    14 port = 8443
    15 }
    16 paths {
    17 "/customdtsp/tp2" = "/com.snowplowanalytics.snowplow/tp2"
    18 }
    19 p3p {
    20 policyRef = "/w3c/p3p.xml"
    21 CP = "NOI DSP COR NID PSA OUR IND COM NAV STA"
    22 }
    23 crossDomain {
    24 enabled = false
    25 domains = [ "*" ]
    26 secure = true
    27 }
    28 cookie {
    29 enabled = true
    30 expiration = "365 days"
    31 name = sp
    32 domains = []
    33 fallbackDomain = ""
    34 secure = true
    35 httpOnly = false
    36 sameSite = "None"
    37 }
    38 doNotTrackCookie {
    39 enabled = false
    40 name = ""
    41 value = ""
    42 }
    43 cookieBounce {
    44 enabled = false
    45 name = "n3pc"
    46 fallbackNetworkUserId = "00000000-0000-4000-A000-000000000000"
    47 forwardedProtocolHeader = "X-Forwarded-Proto"
    48 }
    49 enableDefaultRedirect = false
    50 redirectMacro {
    51 enabled = false
    52 placeholder = "[TOKEN]"
    53 }
    54 rootResponse {
    55 enabled = false
    56 statusCode = 302
    57 headers = {}
    58 body = "302, redirecting"
    59 }
    60 cors {
    61 accessControlMaxAge = "5 seconds"
    62 }
    63 prometheusMetrics {
    64 enabled = false
    65 }
    66 streams {
    67 good = spprefix-raw-topic
    68 bad = spprefix-bad-1-topic
    69 useIpAddressAsPartitionKey = false
    70 sink {
    71 enabled = google-pub-sub
    72 googleProjectId = "<your-project-id>"
    73 backoffPolicy {
    74 minBackoff = 1000
    75 maxBackoff = 1000
    76 totalBackoff = 10000
    77 multiplier = 1
    78 }
    79 }
    80 buffer {
    81 byteLimit = 1000000
    82 recordLimit = 500
    83 timeLimit = 500
    84 }
    85 }
    86 telemetry {
    87 disable = false
    88 url = "telemetry-g.snowplowanalytics.com"
    89 userProvidedId = ""
    90 moduleName = "collector-pubsub-ce"
    91 moduleVersion = "0.2.2"
    92 autoGeneratedId = "329042380932sdjfiosdfo"
    93 }
    94 }
    95 akka {
    96 loglevel = WARNING
    97 loggers = ["akka.event.slf4j.Slf4jLogger"]
    98 http.server {
    99 remote-address-header = on
    100 raw-request-uri-header = on
    101 parsing {
    102 max-uri-length = 32768
    103 uri-parsing-mode = relaxed
    104 }
    105 max-connections = 2048
    106 }
    107 }
  4. Create a file collector-deployment.yaml and apply it

    1apiVersion: apps/v1
    2kind: Deployment
    3metadata:
    4 name: collector-server
    5 namespace: snowplow
    6spec:
    7 selector:
    8 matchLabels:
    9 app: collector
    10 replicas: 2
    11 template:
    12 metadata:
    13 labels:
    14 app: collector
    15 spec:
    16 # Prevent the scheduler to place two pods on the same node
    17 affinity:
    18 podAntiAffinity:
    19 preferredDuringSchedulingIgnoredDuringExecution:
    20 - weight: 10
    21 podAffinityTerm:
    22 labelSelector:
    23 matchExpressions:
    24 - key: app
    25 operator: In
    26 values:
    27 - collector
    28 topologyKey: "kubernetes.io/hostname"
    29 containers:
    30 - name: collector-server
    31 image: snowplow/scala-stream-collector-pubsub:2.8.2
    32 command:
    33 - "/opt/snowplow/bin/snowplow-stream-collector"
    34 - "--config"
    35 - "/snowplow/config/config.hocon"
    36 imagePullPolicy: "IfNotPresent"
    37 env:
    38 - name: JAVA_OPTS
    39 value: -Dorg.slf4j.simpleLogger.defaultLogLevel=info
    40 - name: GOOGLE_APPLICATION_CREDENTIALS
    41 value: /etc/gcp/sa_credentials.json
    42 volumeMounts:
    43 - name: collector-config-volume
    44 mountPath: /snowplow/config
    45 - name: service-account-credentials-volume
    46 mountPath: /etc/gcp
    47 readOnly: true
    48 resources:
    49 requests:
    50 memory: "128Mi"
    51 cpu: "250m"
    52 limits:
    53 memory: "1Gi"
    54 volumes:
    55 - name: collector-config-volume
    56 configMap:
    57 name: collector-configmap
    58 items:
    59 - key: config.hocon
    60 path: config.hocon
    61 - name: service-account-credentials-volume
    62 secret:
    63 secretName: collector-serviceaccount-creds
    64 items:
    65 - key: sa_json
    66 path: sa_credentials.json
  5. Create a file collector-be-config.yaml and apply it:

    Note: This step is only relevant, if you run your workloads on a GKE cluster. If you have a different kubernetes distribution, configure the load-balancer we are going to deploy in the next steps to use an http healthcheck on Port 8080 and request-path /health on the collector-server-pods.

    1apiVersion: cloud.google.com/v1
    2kind: BackendConfig
    3metadata:
    4 name: collector-backendconfig
    5 namespace: snowplow
    6spec:
    7 timeoutSec: 60
    8 healthCheck:
    9 checkIntervalSec: 10
    10 timeoutSec: 10
    11 healthyThreshold: 3
    12 unhealthyThreshold: 5
    13 type: HTTP
    14 requestPath: /health
    15 port: 8080
    16 logging:
    17 enable: false
  6. Create a file collector-service.yaml and apply it:

    1apiVersion: v1
    2kind: Service
    3metadata:
    4 name: collector-server-service
    5 namespace: snowplow
    6 annotations:
    7 cloud.google.com/backend-config: '{"default": "collector-backendconfig"}' # this is only required, if you run on GKE. See note in the step above.
    8spec:
    9 selector:
    10 app: collector
    11 type: ClusterIP
    12 ports:
    13 - protocol: TCP
    14 port: 8080
    15 targetPort: 8080
  7. Create a file collector-ingress.yaml and apply it:

    Note: This next manifest assumes, that you have cert-manager installed and a Cluster-Issuer for issuing letsencrypt certificates configured. If not, please foll this HowToGeek-guide

    The file also assumes you are running your workloads on a GKE cluster and want to deploy a GCE load balancer for our ingress. However, you might use any ingress/loadbalancer you want. Just make sure that the load-balancer directs the traffic to the collector-server-service on Port 8080.

    1apiVersion: networking.k8s.io/v1
    2kind: Ingress
    3metadata:
    4 name: snowplow-ingress
    5 namespace: snowplow
    6 annotations:
    7 kubernetes.io/ingress.global-static-ip-name: snowplow-ingress-ip
    8 cert-manager.io/cluster-issuer: letsencrypt-prod
    9 kubernetes.io/ingress.class: gce
    10 acme.cert-manager.io/http01-edit-in-place: "true"
    11spec:
    12 tls: # < placing a host in the TLS config will indicate a certificate should be created
    13 - hosts:
    14 - spcollector.example.com
    15 secretName: snowplow-collector-cert-secret # < cert-manager will store the created certificate in this secret
    16 rules:
    17 - host: spcollector.example.com
    18 http:
    19 paths:
    20 - path: /
    21 pathType: Prefix
    22 backend:
    23 service:
    24 name: collector-server-service
    25 port:
    26 number: 8080

Summary

In this rather extensive post we covered how a Snowplow-Deployment on kubernetes might look like. We visited all the required steps for setting up the Google Cloud resources as well as how to deploy our main server workloads on a kubernetes cluster.

While this involves a lot of steps, in the end it's worth it:

  • It's easier to maintain all the server components on kubernetes then it is with other cloud offerings.
  • It's most probably more cost efficient.

Furthermore, this whole setup is operational very stable:

  • The Google Cloud services (Pub/Sub, Storage, BigQuery, LoadBalancer) are very stable on their own.
  • Running our server workloads on kubernetes also makes them very resilient. Especially, as Snowplow is excellent in horizontal scalability. If you need more resiliency (or throughput), simply add another replica.
  • Both of these points make Snowplow rather a "setup and forget" type of solution with very little operational efforts.

NOTE: As these are arguably quite a lot of steps to take I created a convenient bootstrap script which sets up basically all these components by running it. If you want access to this script, please contact me with the contact form at the bottom of the page.

------------------

Interested in how to train your very own Large Language Model?

We prepared a well-researched guide for how to use the latest advancements in Open Source technology to fine-tune your own LLM. This has many advantages like:

  • Cost control
  • Data privacy
  • Excellent performance - adjusted specifically for your intended use

Need assistance?

Do you have any questions about the topic presented here? Or do you need someone to assist in implementing these areas? Do not hesitate to contact me.