Troubleshooting Kafka

This section is aimed for those who have Kafka problems, but are not yet familiar with Kafka. At a high level, Kafka is a message broker which stores messages in a log, very similar to an array, format. It receives messages from producers that write to a specific topic, and then sends them to consumers that are subscribed to that topic. The consumers can then process the messages.

On the inside, when a message enters a topic, it will be written to a certain partition. You can think of a partition as physical a box that stores messages for a specific topic. In a distributed Kafka setup, each partition might be stored on a different machine/node, but if you only have a single Kafka instance, then all the partitions are stored on the same machine.

When a producer sends a message to a topic, it will either stick to a certain partition number based on the partition key (example: partition 1, partition 2, etc.) or it will choose a partition in a round-robin manner. A consumer will then subscribe to a topic and will automatically be assigned to one or more partitions by Kafka. The consumer will then start receiving messages from the assigned partitions. Important to note: the number of consumers cannot exceed the number of partitions. If you have more consumers than partitions, the extra consumers will receive no messages.

Each message in a topic will have an "offset" (number). You can think of this like an "index" in an array. The offset will be used by the consumer to track where it is in the log, and what's the last message it has consumed. Offsets are scoped to a partition, therefore a partition in a topic can have the same offset numbers. If the consumer is not able to keep up with the producer, it will start to lag behind. Most of the time, we want "lag" to be as low as possible. The easiest solution to lagging is adding more partitions and increasing the number of consumers.

The differences with other types of queues or brokers like RabbitMQ or Redis is that Kafka has a concept called "retention time". Messages that are stored on Kafka and consumed by consumers won't be deleted immediately. Instead, they will be stored for a certain period of time. By default, self-hosted Sentry uses Kafka with a retention time of 24 hours. This means that messages that are older than 24 hours will be deleted. If you want to change the retention time, you can do so by modifying the KAFKA_LOG_RETENTION_HOURS environment variable in the kafka service.

You can visualize the Kafka consumers and their offsets by bringing an additional container, such as Kafka UI or Redpanda Console into your Docker Compose.

Kafka UI:

Copied
kafka-ui:
  image: provectuslabs/kafka-ui:latest
  restart: on-failure
  environment:
    KAFKA_CLUSTERS_0_NAME: "local"
    KAFKA_CLUSTERS_0_BOOTSTRAPSERVERS: "kafka:9092"
    DYNAMIC_CONFIG_ENABLED: "true"
  ports:
    - "8080:8080"
  depends_on:
    - kafka

Or, you can use Redpanda Console:

Copied
redpanda-console:
  image: docker.redpanda.com/redpandadata/console:latest
  restart: on-failure
  entrypoint: /bin/sh
  command: -c "echo \"$$CONSOLE_CONFIG_FILE\" > /tmp/config.yml; /app/console"
  environment:
    CONFIG_FILEPATH: "/tmp/config.yml"
    CONSOLE_CONFIG_FILE: |
      kafka:
        brokers: ["kafka:9092"]
        sasl:
          enabled: false
      schemaRegistry:
        enabled: false
      kafkaConnect:
        enabled: false
  ports:
    - "8080:8080"
  depends_on:
    - kafka

It's recommended to put this on docker-compose.override.yml rather than modifying your docker-compose.yml directly. The UI will then can be accessed via http://localhost:8080/ (or http://<your-ip>:8080/ if you're using a reverse proxy).

Copied
Exception: KafkaError{code=OFFSET_OUT_OF_RANGE,val=1,str="Broker: Offset out of range"}

This happens where Kafka and the consumers get out of sync. Possible reasons are:

  1. Running out of disk space or memory
  2. Having a sustained event spike that causes very long processing times, causing Kafka to drop messages as they go past the retention time
  3. Date/time out of sync issues due to a restart or suspend/resume cycle

Ideally, you want to have zero lag for all consumer groups. If a consumer group has a lot of lag, you need to investigate whether it's caused by a disconnected consumer (e.g., a Sentry/Snuba container that's disconnected from Kafka) or a consumer that's stuck processing a certain message. If it's a disconnected consumer, you can either restart the container or reset the Kafka offset to 'earliest.' Otherwise, you can reset the Kafka offset to 'latest.'

The proper solution is as follows (reported by @rmisyurev). This example uses snuba-consumers with events topic. Your consumer group name and topic name may be different.

  1. Shutdown the corresponding Sentry/Snuba container that's using the consumer group (You can see the corresponding containers by inspecting the docker-compose.yml file):
    Copied
    docker compose stop snuba-errors-consumer snuba-outcomes-consumer snuba-outcomes-billing-consumer
    
  2. Receive consumers list:
    Copied
    docker compose exec kafka kafka-consumer-groups --bootstrap-server kafka:9092 --list
    
  3. Get group info:
    Copied
    docker compose exec kafka kafka-consumer-groups --bootstrap-server kafka:9092 --group snuba-consumers --describe
    
  4. Watching what is going to happen with offset by using dry-run (optional):
    Copied
    docker compose exec kafka kafka-consumer-groups --bootstrap-server kafka:9092 --group snuba-consumers --topic events --reset-offsets --to-latest --dry-run
    
  5. Set offset to latest and execute:
    Copied
    docker compose exec kafka kafka-consumer-groups --bootstrap-server kafka:9092 --group snuba-consumers --topic events --reset-offsets --to-latest --execute
    
  6. Start the previously stopped Sentry/Snuba containers:
    Copied
    docker compose start snuba-errors-consumer snuba-outcomes-consumer snuba-outcomes-billing-consumer
    

This option is as follows (reported by @gabn88):

  1. Set offset to latest and execute:
    Copied
    docker compose exec kafka kafka-consumer-groups --bootstrap-server kafka:9092 --all-groups --all-topics --reset-offsets --to-latest --execute
    

Unlike the proper solution, this involves resetting the offsets of all consumer groups and all topics.

  1. Stop the instance:

    Copied
    docker compose down --volumes
    
  2. Remove the the Kafka volume:

    Copied
    docker volume rm sentry-kafka
    
  3. Run the install script again:

    Copied
    ./install.sh
    
  4. Start the instance:

    Copied
    docker compose up --wait
    

If you notice a very slow ingestion speed and consumers are lagging behind, it's likely that the consumers are not able to keep up with the producers. This can happen if the consumers are not able to keep up with the rate of messages being produced. To fix this, you can increase the number of partitions and increase the number of consumers.

  1. For example, if you see ingest-consumer consumer group has a lot of lag, and you can see that it's subscribed to ingest-events topic, then you need to first increase the number of partitions for that topic.
    Copied
    docker compose exec kafka kafka-topics --bootstrap-server kafka:9092 --alter --partitions 3 --topic ingest-events
    
  2. Validate that the number of partitions for the topic is now 3.
    Copied
    docker compose exec kafka kafka-topics --bootstrap-server kafka:9092 --describe --topic ingest-events
    
  3. Then, you need to increase the number of consumers for the consumer group. You can see on the docker-compose.yml that the container that consumes ingest-events topic using ingest-consumer consumer group is events-consumer container. But we won't modify the docker-compose.yml directly, instead, we will create a new file called docker-compose.override.yml and add the following:
    Copied
    services:
      events-consumer:
        deploy:
          replicas: 3
    
    This will increase the number of consumers for the ingest-consumer consumer group to 3.
  4. Finally, you need to refresh the events-consumer container. You can do so by running the following command:
    Copied
    docker compose up -d --wait events-consumer
    
  5. Observe the logs of events-consumer, you should not see any consumer errors. Let it run for a while (usually a few minutes until a few hours) and observe the Kafka topic lags.

If you want to reduce the disk space used by Kafka, you'll need to carefully calculate how much data you are ingesting, how much data loss you can tolerate and then follow the recommendations on this awesome StackOverflow post or this post on our community forum.

You could, however, add these on the Kafka container's environment variables (by @csvan):

Copied
services:
  kafka:
    # ...
    environment:
      KAFKA_LOG_RETENTION_HOURS: 24
      KAFKA_LOG_CLEANER_ENABLE: true
      KAFKA_LOG_CLEANUP_POLICY: delete
Was this helpful?
Help improve this content
Our documentation is open source and available on GitHub. Your contributions are welcome, whether fixing a typo (drat!) or suggesting an update ("yeah, this would be better").