Quantcast
Channel: ScyllaDB
Viewing all 939 articles
Browse latest View live

Scylla Open Source Release 3.0.2

$
0
0

Scylla Release

The Scylla team announces the release of Scylla Open Source 3.0.2, a bugfix release of the Scylla Open Source 3.0 stable branch. Scylla Open Source 3.0.2, like all past and future 3.x.y releases, is backward compatible and supports rolling upgrades.

Related links:

Issues solved in this release:

  • CQL: ALLOW FILTERING gave the wrong result when SELECT lists specific columns #4121
  • Hinted Handoff: in some cases, a subset of updates stored in the coordinator for HH were not sent, hence the data was only synchronized at the next repair. #4122
  • A large number of SSTables caused streaming to use the CPU for too long, causing stalls #4005
  • In some rare cases, during service stop, Scylla exited #4107

The post Scylla Open Source Release 3.0.2 appeared first on ScyllaDB.


Scylla Manager Now Available for Scylla Open Source Users

$
0
0

Scylla Manager

Today we announce the release of Scylla Manager 1.3.1, a management system that automates maintenance tasks on a Scylla cluster. This release provides support for encrypted CQL communication with Scylla nodes and improves SSH stability. We are also shipping an official, publicly available, Scylla Manager Docker image. Most importantly, while past versions of Scylla Manager supported only Scylla Enterprise clusters, Scylla Manager 1.3.1 now also supports management of Scylla Open Source clusters of up to 5 nodes.

Scylla Manager joins a growing list of Scylla management and monitoring software, including the open source Scylla Monitoring Stack, which provides cluster metrics and graphical dashboards via Prometheus and Grafana.

Scylla Manager provides a robust suite of tools aligned with Scylla’s shard-per-core architecture for the easy and efficient management of Scylla clusters. Scylla Manager performs a regular health check on the nodes of a server, ensuring awareness of any node degradation or downtime. Scylla Manager can also control cluster repairs on a per-shard basis, ensuring repair speed and impact on performance is optimized. This gives maximum repair parallelism on a node and shortens overall repair time without generating unnecessary load.

Scylla Manager Block Diagram

Helpful links

New license

Scylla Open Source users will now have the ability to automate tasks and other work using Scylla Manager. Previously, Scylla Manager was available only to Scylla Enterprise users. With the new license, you can use all the Scylla Manager features with clusters running Scylla Open Source. The only limitation is that the cluster size that you connect to is limited to 5 nodes. The full license agreement can be read at: https://www.scylladb.com/scylla-manager-software-license-agreement/.

Docker

With the introduction of a new license, and to make it easier for open source users, we are now shipping the official Scylla Manager Docker image. The image is available on Docker Hub https://hub.docker.com/r/scylladb/scylla/scylla-manager. It is based on a Centos 7 image, and contains both the Scylla Manager server and sctool. Running Scylla Manager has never been easier. Now you can run it with just a few lines of docker-compose.

SSL/TLS support

Scylla Manager 1.3 introduced a Health Check functionality that monitors CQL connectivity on cluster nodes. This release extends support for clusters with client server encryption enabled. Below is a sample Scylla configuration that can be used to enable client server encryption on a Scylla node. For more details on enabling encryption refer to our security documentation: https://docs.scylladb.com/operating-scylla/security/client_node_encryption/

# enable or disable client/server encryption.
     client_encryption_options:
     enabled: true
     certificate: /etc/scylla/db.crt
     keyfile: /etc/scylla/db.key
#     truststore: <none, use system trust>
#     require_client_auth: False
#     priority_string: <none, use default>

Scylla Manager Health Check will work out of the box with encrypted CQL provided that require_client_auth is not set on the node. Users may check if encryption is enabled by using sctool status command that now has a new SSL column that informs the user if a given node has encryption enabled or not.

sctool status -c prod-cluster
Datacenter: dc1
╭──────────┬─────┬────────────────╮
│ CQL      │ SSL │ Host           │
├──────────┼─────┼────────────────┤
│ UP (2ms) │ ON  │ 192.168.100.11 │
│ UP (3ms) │ ON  │ 192.168.100.12 │
│ UP (5ms) │ ON  │ 192.168.100.13 │
╰──────────┴─────┴────────────────╯

Datacenter: dc2
╭───────────┬─────┬────────────────╮
│ CQL       │ SSL │ Host           │
├───────────┼─────┼────────────────┤
│ UP (12ms) │ ON  │ 192.168.100.21 │
│ UP (11ms) │ ON  │ 192.168.100.22 │
│ UP (13ms) │ ON  │ 192.168.100.23 │
╰───────────┴─────┴────────────────╯

If Scylla requires the client to present a certificate (client_encryption_options.require_client_auth set to true) then the certificate and key must be provided to Scylla Manager. This is typically done when adding a cluster, you may also update an existing cluster with the information.

sctool cluster update -c=prod-cluster --ssl-user-cert-file <path to client certificate> --ssl-user-key-file <path to key associated with the certificate>

SSL/TLS for Scylla Manager’s data

When storing Scylla Manager data on a remote cluster you may also enable encryption. In order to do so, find the database section in the configuration file and set ssl parameter to true.

# Scylla Manager database, used to store management data.
database:
# Enable or disable client/server encryption.
  ssl: true

If you want to specify client certificate, disable validation or use a custom certificate authority to validate node’s certificate. This is not an issue, the ssl configuration section, that is just below the database section, contains more SSL options.

# Optional custom client/server encryption options.
#ssl:
# CA certificate used to validate server cert. If not set will use the host's root CA set.
# cert_file:
#
# Verify the hostname and server cert.
#  validate: true
#
# Client certificate and key in PEM format. It has to be provided when
# client_encryption_options.require_client_auth=true is set on server.
#  user_cert_file:
#  user_key_file

The database configuration section also gained local_dc parameter that should be set if working with a multi-dc cluster. This will instruct the database driver to prioritize hosts in that datacenter when selecting a host to query.

SSH keepalive

Scylla Manager communicates with the Scylla API using HTTP over an SSH tunnel. If a connection from Scylla Manager to a Scylla node goes through a proxy, and the connection is idle then the proxy may decide to drop it. To avoid this situation we are introducing a keepalive mechanism. The keepalive is controlled by two parameters, the server_alive_interval that specifies the interval in which the keepalive messages are being sent, and server_alive_count_max that specifies the maximal number of failed attempts. This is very similar to how you can configure the ssh unix command with ServerAliveInterval and ServerAliveCountMax options. The keepalive setting can be modified globally in the new ssh section in the Scylla Manager configuration file.

# SSH global configuration, SSH is used to access scylla nodes. Username and
# identity file are specified per cluster with sctool.
#ssh:
# Alternative default SSH port.
#  port: 22
#
# Interval to send keepalive message through the encrypted channel and
# request a response from the server.
#  server_alive_interval: 15s
#
# The number of server keepalive messages which may be sent without receiving
# any messages back from the server. If this threshold is reached while server
# keepalive messages are being sent, ssh will disconnect from the server,
# terminating the session.
#  server_alive_count_max: 3

If you are part of the Scylla open source community and install Scylla Manager, we’d love to hear from you! Let us know by contacting us directly, or join the discussion on Slack.

The post Scylla Manager Now Available for Scylla Open Source Users appeared first on ScyllaDB.

Moving from Cassandra to Scylla via Apache Spark: The Scylla Migrator

$
0
0

Scylla and Spark

Welcome to a whole new chapter in our Spark and Scylla series! This post will introduce the Scylla Migrator project – a Spark-based application that will easily and efficiently migrate existing Cassandra tables into Scylla.

Over the last few years, ScyllaDB has helped many customers migrate from existing Cassandra installations to a Scylla deployment. The migration approach is detailed in this document. Briefly, the process is comprised of several phases:

  1. Create an identical schema in Scylla to hold the data;
  2. Configure the application to perform dual writes;
  3. Snapshot the historical data from Cassandra and load it into Scylla;
  4. Configure the application to perform dual reads and verify data loaded from Scylla;
  5. Decommission Cassandra.

The Scylla Migrator project is meant to considerably speed up step 3; let’s see how it works!

1. An overview of the Migrator

The Scylla Migrator project is a Spark-based application that does one simple task: it reads data from a table on a live Cassandra instance and writes it to Scylla. It does so by running a full scan on the source table in parallel; the source table is divided to partitions, and each task in the Spark stage (refer back to Part 1 of the series to review these terms) copies a partition to the destination table.

By using Spark to copy the data, we gain the ability to distribute the data transfer between several processes on different machines, leading to much improved performance. The migrator also includes several other handy features:

  • it is highly resilient to failures, and will retry reads and writes throughout the job;
  • it will continuously write savepoint files that describe which token ranges have already been transferred: should the Spark job be stopped, these savepoint files can be used to resume the transfer from the point at which it stopped;
  • it can be configured to preserve the WRITETIME and TTL attributes of the fields that are copied;
  • it can handle simple column renames as part of the transfer (and can be extended to handle more complex transformations – stay tuned!).

Hopefully, these features will convince you to use the migrator to load data into your next Scylla deployment. Let’s see how you’d set it up.

There’s one downside to using the migrator: since it reads the data directly from the Cassandra database, it will compete for resources with other workloads running on Cassandra. Our benchmarks, however, show that since there’s much more work to be performed on the Spark and Scylla side while copying the data, Cassandra remains relatively undisturbed.

2. Using the Migrator

First off, since the migrator runs on Spark, you need a Spark deployment.

2.1 Deploying Spark

Luckily, Spark is pretty simple to set up. For a basic setup on a cloud provider, you can set up a few instances, unpack the Spark distribution on them and use these commands to start the Spark components:

Note that this configuration starts 8 Spark workers on each slave node, where each worker can execute 2 tasks in parallel. We’ll discuss these configuration options in further detail in the “Benchmarking the Migrator” section.

There are a few other options available for deploying Spark – you could use a managed solution, like Amazon EMR or Google Cloud Dataproc. You could also deploy Spark and the migrator directly onto a Kubernetes cluster. These are beyond the scope of this article, but if you’d like to contribute instructions for running the Migrator in these setups, please feel free to send a pull request.

The capacity required for the cluster depends on the size of the Scylla cluster. We can offer the following rules of thumb:

  • allocate 1 Spark CPU core for every 2 Scylla cores;
  • allocate 2GB of RAM for every Spark CPU core.

So, for example, if your Scylla cluster has a total of 48 CPU cores, you’d need a total of 24 CPU cores and 48GB of RAM for Spark. You could use 3 c5.2xlarge instances for that cluster on AWS.

IMPORTANT: verify that the Spark nodes are added to a security group that can access port 9042 on Cassandra and Scylla.

2.2 Running the Migrator

Now, once you have your Spark nodes up and running, ssh into the Spark master node and clone the Migrator repository. Install sbt (Scala’s build tool) on the machine, and run the build.sh script in the Migrator’s repository.

Next, duplicate the config.yaml.example file to configure your migration. The file is heavily commented, and we’ll discuss how to tune the parameters shortly, but for now, change the host, keyspace and table settings under source and target to match your environment.

NOTE: the table and keyspace you’re copying into must already exist on Scylla before the migration starts, with the same schema.

With the config file created, you can now start the migrator. But wait! Don’t just start it in the SSH session; if you do that, and your session gets disconnected, your job will stop. We recommend that you install tmux (and read this excellent tutorial) and run the migrator inside it. This way, you can detach and reattach the shell running the migrator job.

Once inside tmux, launch the job:

2.3 Keeping track of the transfer process

By default, Spark is pretty chatty in its logs, so you’ll need to skip past some of the lines that start scrolling through. The migrator will print out some diagnostics when it starts up; for example:

This output will help you make sure that the schema is being retrieved correctly from Cassandra and being processed correctly for writing to Scylla. When the actual transfers start, you’ll see output similar to this:

You can keep track of the progress using the Spark application UI, conveniently located at http://<spark master IP>:4040. The progress of the job in the UI reflects the actual progress of the transfer.

We also recommend monitoring Scylla closely during the transfer using the Scylla Monitoring dashboards. The Requests Served per Shard metric should stay constant; if it fluctuates, this may indicate that Scylla is not being saturated and more parallelism is required (see the tuning section below). The Write Timeouts per Second per Shard metric should be mostly zero. If it starts to rise, it may mean that you’re using too much parallelism for the migrator.

2.4 Stopping and resuming the job

As the migrator works, you’ll see it periodically log lines similar to these:

Every 5 minutes, the migrator will save a snapshot of the config file to the directory specified by the savepoints.path setting. Apart from your settings, the config file will also contain a list of token ranges that have already been transferred.

If you need to temporarily stop the job, or in the unlikely event that the migrator crashes, you could use these savepoint files to resume the migrator from where it stopped. This is done by copying one of these savepoint files and using it as the config for the job:

2.5 Timestamp preservation

By default, the Migrator will preserve the WRITETIME and TTL attributes for migrated rows. This is great for creating exact copies of your data if you rely on these timestamps in your application. There’s one caveat: this option is incompatible with tables containing compound data types (lists, maps, sets); so you’ll need to disable it with preserveTimestamps: false in the config file in that case.

This restriction might be lifted in the future.

2.6 Renaming fields

As mentioned, you could also rename columns while transferring the data. For example, if you’d like to copy the column called orig_field into the column dest_field, you can use this setting in the config file:

You could of course add more such renames by adding more items to the list:

Note, again, that the table on Scylla must already be defined with the new column names.

2.7 Tuning the migrator

Lastly, here are a few tips for tuning the migrator’s performance.

We recommend starting 2 Spark workers on each node, but to start them with SPARK_WORKER_CORES=8. This will allow Spark to run 8 transfers in parallel on each worker, which will improve the throughput of the migration.

To take advantage of the above, it is extremely important to use a high source.splitCount value. This value determines how the source table is divided between the Spark tasks. If there are less splits than worker cores, some cores will be idle. On the other hand, if there are too many splits such that each one contains less than 1MB of data, performance will suffer due to the scheduling overhead.

The default setting of source.splitCount: 256 should be sufficient for migrations of tables larger than 10GB. However, for larger transfers and larger Spark cluster sizes, you may need to increase it.

And finally, if the Scylla table you’re copying into is not in use, it is possible to gain a nice speed up (about 25%) by disabling compaction on it. You should of course re-enable it when the migration is finished. Do note that disabling compaction may cause Scylla to use a larger amount of disk space should any write retries occur, so keep an eye on the free disk space.

Disabling and re-enabling compaction is done using the following CQL statements:

3. Benchmarking the Migrator

So let’s see some numbers – how does the migrator fare against the sstableloader on an identical setup? We’ll start with the migrator first. We used the following infrastructure for the benchmarks:

  • 6 Cassandra 3.11 nodes, running on i3.4xlarge instances (a total of 96 cores and 732GB of RAM);
  • 3 Scylla 3.0-RC3 nodes, running on i3.4xlarge instances (a total of 48 cores and 366GB of RAM);
  • 3 Spark 2.4 nodes, running on r4.4xlarge instances (a total of 48 cores and 366GB of RAM).

The test data used consisted of 1TB of random data, generated as 1Kb rows with a replication factor of 3.

The first run used 23 Spark workers, started with SPARK_WORKER_CORES=2, for a total of 46 parallel transfers. This configuration achieved an average transfer rate of 2GB/minute. We can see, in the screenshots from the Scylla Monitoring dashboards, that Scylla is not being saturated:

2GB/minute

Restarting the workers with SPARK_WORKER_CORES=8, for a total of 184 parallel transfers, achieved a transfer rate of 3GB/minute – a 50% increase! We can see from the same graphs that Scylla is now being saturated much more consistently:

3GB/minute

Disabling compaction also helped squeeze out some more performance, for a transfer rate of 3.73GB/minute:

3.73GB/minute

Increasing the concurrency to keep the load steadier did not help, unfortunately, and only resulted in write timeouts.

4. Summary

In this post, we’ve seen how you could deploy and configure Spark and the Scylla Migrator and use them to easily transfer data from Cassandra to a Scylla deployment. We encourage you to kick the tires on this project, use it to perform your migrations and report back if you hit any problems.

The post Moving from Cassandra to Scylla via Apache Spark: The Scylla Migrator appeared first on ScyllaDB.

Scylla Users Share Their Stories

$
0
0

 

Customer Stories: Comcast, Zenly, Grab

“It worked out of the box. We didn’t have to tune anything. It was all just easy.”

— Derek Ramsey, Sensaphone

Our favorite and most important discussions are always with our users. They tell us about their experiences with our database, how they came about using our technology, what they like about it, where we could improve, and much more. These conversations take place any number of ways — on our busy Slack channel, Zoom meetings, emails, phone calls, you name it. Of course, our most customer-centric event is our Scylla Summit user conference.

“What Cassandra can do with 600 nodes, Scylla can do with 60.”

— Murukesh Muhanan, Yahoo! Japan

At our Scylla Summit last November, we took the opportunity to interview a number of users with a video camera rolling. We recently added these interview clips to our site — you’ll see a carousel of these videos on our Users page. The fun part for us isn’t just hearing all the great things they have to say about Scylla, it’s seeing the innovative solutions our users are able to create with our technology. From world-renown entertainment apps, to taxi hailing, to social apps, to tech industry leaders and any variety of IoT systems.

Customer Stories: Yahoo Japan, Numberly, Sensaphone

“With Scylla, there’s no JVM so cutting out those problems from the list and just having to configure things properly for a Cassandra-type system or a Dynamo-type system is great.”

— Keith Lohnes, Software Engineer, IBM

We encourage you to watch what some of our users have to say in these videos. And have a look at our library of written case studies as well — we now have almost 40 use case we’ve collected on our Users page.

“It’s critical we have a system we can trust, that’s efficient and that maintains consistent low latencies.”

— Alexys Jacob, CTO, Numberly

Interested in sharing your use of Scylla? Please let us know. We’d be glad to share your Scylla story with the world!

The post Scylla Users Share Their Stories appeared first on ScyllaDB.

Scylla Open Source Release 3.0.3

$
0
0

Scylla Release

The Scylla team announces the release of Scylla Open Source 3.0.3, a bugfix release of the Scylla Open Source 3.0 stable branch. Scylla 3.0.3, like all past and future 3.x.y releases, is backward compatible and supports rolling upgrades.

Related links:

Issues solved in this release:

  • Counters: Scylla rejects SSTables that contain counters that were created by Cassandra 2.0 and earlier. Due to #4206, Scylla mistakenly rejected some SSTables that were created by Cassandra 2.1 as well.
  • TLS: Scylla now disables TLS1.0 by default and forces minimum 128 bits ciphers #4010. More on encryption on transit (client to server) here.
  • Core: In very rare cases, the commit log replay fails. Commit log replay is used after a node was unexpectedly restarted #4187
  • Streaming: in some cases, Scylla exits due to a failed streaming operation #4124
  • A rare race condition between a node restart and schema updates may cause Scylla to exit #4148

The post Scylla Open Source Release 3.0.3 appeared first on ScyllaDB.

The Complex Path for a Simple Portable Python Interpreter, or Snakes on a Data Plane

$
0
0

We needed a Python interpreter that can be shipped everywhere. You won’t believe what happened next!

Snakes on a Data Plane

“When I said I wanted portable Python, this is NOT what I meant!”

In theory, Python is a portable language. You can write your script locally and distribute it to other machines with the Python interpreter. In practice, things can go wrong for a variety of reasons.

The first and simpler problem is the module system: for a script to run, all of the modules it uses must be installed. For Python-savvy users, installing them is not a problem. But for a software vendor that wants to guarantee a hassle-free experience for all sorts of users the module dependency is not always easily met.

Second, Python is not a language, but actually two languages: under the name Python, there are Python2 and Python3. And while Python2 is set to be deprecated soon (which has been a true statement for the past decade, btw!) and bleeding-edge organizations will deal with that nicely, the situation is much different in old-school enterprise: RHEL7, for instance, does not ship with a Python3 interpreter at all and will still be supported under Red Hat policies for years to come.

Using Python3 in such distributions is possible: third-party organizations like EPEL produce RHEL-compatible binaries for the interpreter. But once more, old-school enterprise usually means security policies are in place that either disallow installing packages from untrusted sources, or want to install the Python scripts in machines with no connection to the internet. EPEL would require internet connectivity to update dependencies if some packages are some minor versions behind, which makes this a no go.

Scylla is the Real Time Big Data Database. In organizations with strict security policies, the database nodes tend to be even stricter than average. While our main database codebase is written in native C++, we adopt higher level languages like Python for our configuration and deployment scripts. In dealing with enterprise-grade organizations, we have dealt with the problem on how to ship our Python scripts to our customers in a way that will work across a wide variety of setups and security policies.

Harry Potter, a Python Interpreter

One of many approaches to a Python interpreter. Requires broomstick package for portability.

We considered many approaches: should we just rewrite everything in C++? Should we make sure that everything we do works with Python2 as well, and uses as few modules as possible? Should we compile the Python binaries to C with Cython or Nuitka? Should we rely on virtualenv or PyInstaller to ship a complete environment? All of those solutions had pros and cons and ultimately none of them would work across the wide variety of scenarios our database seeks to be installed.

In this article we will very briefly discuss those alternatives, and describe in details the solution we ended up employing: we now distribute a full Python3 interpreter together with our scripts, that can be executed in any Linux distribution of any kind.

“Why not X?” Or, The Top Three Reasons We Didn’t Use Your Pet Suggestion!

Ophidiophobia or Fear of Snakes; courtesy SantiMB.Photos

Image courtesy SantiMB.Photos; used with permission

Computer science is hard. While cosmologists tackle hard questions like “where do we come from?”, anyone working with code has to deal with much harder questions like “why didn’t you use X instead?”, where X is any other approach that the reader has in mind. So let’s get that out of the way and discuss the alternatives!

1. Rewrite our scripts

ScyllaDB is an Open Source database mainly written in C++. Since we already have to deploy our C++ code in a portable manner, it wouldn’t be a stretch to just rewrite the Python scripts. We know what you are thinking and sure, we could also have rewritten it in Go, Lua, Rust, Bash, Fortran or Cobol.

However, we wrote those scripts in Python initially for a reason: configuration and deployment are not performance critical, and writing that kind of code in Python is easy, and it is a language that is well known among many developers from many backgrounds.

Rewriting something that already does its job just fine is even worse, since this is time we could be spending somewhere else. Not a chance.

2. Write everything to also work with Python2

This would be like coding with shackles: many modules that we already use are not even available for Python2 anymore and soon the entire Python2 language will be no more. We would rather be free to code as we will, without having to worry about that. Making sure that changes to the scripts don’t break that also require testing infrastructure (we still use human coders, the kind that every now and then forgets something).

3. Just compile it with Nuitka, Cython, use PyInstaller, or whatever!

This is where things get interesting: we did very seriously consider those alternatives. The problem is that all of them will generate some standalone installer that ships with its dependencies. But such installer cannot be shipped everywhere.

Let’s take a look for instance at what cython generates. Consider the following script:

We can tell cython to compile it with the environment embedded into a single binary, and in theory that’s what we would want: we could then distribute the resulting binary. But how does it look like in the end? Cython allows us to compile the python script and generate an ELF executable binary in the end:

ELF binaries in Linux can consume shared libraries in two ways: they can be loaded at the program startup time, or dynamically loaded during its execution. The ldd tool can be used to inspect which libraries the program will need during startup. Let’s see what the cython-generated binary generates:

There are two problems with the output above. First, the list is deceptively small. Since hello uses the YAML library, we would expect it to depend on it. The strace utility can be used to inspect all calls to the Operating System being issued by a program. If we use this to see which files are being opened we can confirm our suspicion:

That is because libyaml is being loaded during execution time. Cython has no knowledge of which libraries will be loaded during execution time and will just trust that those are found in the system.

Another problem is that the resulting binary depends on system libraries from the host system (like the GNU libc, libpython, etc), on their specific versions. So while it can be transferred to a system similar to the host system, it can’t be transferred to an arbitrary system.

The situation with PyInstaller is a bit better. The shared libraries the script uses during execution time are discovered and added to the final bundle:

But we still have the issue that the resulting binary depends on the basic system libraries needed during startup, as the ldd program will tell us:

This is actually discussed in the PyInstaller FAQ, from which we quote the relevant part for simplicity (highlight is ours):

The executable that PyInstaller builds is not fully static, in that it still depends on the system libc. Under Linux, the ABI of GLIBC is backward compatible, but not forward compatible. So if you link against a newer GLIBC, you can’t run the resulting executable on an older system. The supplied binary bootloader should work with older GLIBC. However, the libpython.so and other dynamic libraries still depends on the newer GLIBC. The solution is to compile the Python interpreter with its modules (and also probably bootloader) on the oldest system you have around, so that it gets linked with the oldest version of GLIBC.

As the PyInstaller FAQ notices, a fully static binary (without using shared libraries at all) is usually one way to overcome shared library dependencies. However such method can present problems on its own, the most obvious of them being the final size of the application: if we ship 10 scripts, each script has to be compiled into its own multi-MB bundle separately. Soon, this solution becomes non-scalable.

The proposed solution, building it in the oldest available system also doesn’t quite work for our use case: the oldest available system is exactly the ones in which installing Python3 can be a challenge and tools are not up to date. We rely on modern tools to build, so often we want to do the other way around.

We tried Nuitka as well, which is an awesome project and can operate as a mixture between what cython and PyInstaller offers with its --standalone mode. While we won’t detail our efforts here for brevity, the end result has drawbacks similar to PyInstaller. On a side note, both these tools seem to have issues with syntax like __import__(“some-string”) (since it is not possible to know what will be imported until this is called during the program execution time), and modules may have to be passed explicitly in the command line in that case. You never know when a dependency-of-your-dependency may do that, so that’s an added risk for our deployments.

Virtualenv also has similar issues. It solves the module-packaging issue nicely, but it will still create symlinks to core Python functionality in the base system. It is just not intended for across-system portable deployments.

Taming the wild Python — Our solution:

Animalist: Fear of Pythons

At this point in our exploration we realized that since our requirements look a bit unique, then maybe that means we should invest in our own solution. And while we could invest in enhancing some of the existing solutions (the reader will notice that some of the techniques we used could be used to solve some of the shortcomings of PyInstaller and Nuitka as well), we realized that for the same effort we could ship the entire Python interpreter in a way that it doesn’t depend on anything in the system and then just uses that to execute the scripts. This means not having to worry about any compatibility issue, every single Python syntax will work, there is no need to compile the scripts statically, or compile the code an lose access to the source in the destination machine.

We did that by creating a relocatable interpreter: in simple terms, the interpreter we ship will also ship with all the libraries that we need in relative paths in the filesystem, and everything that is done by the interpreter will refer to those paths. This is similar to what PyInstaller does, with the exception that we also handle glibc, the dynamic loader and the system libraries.

Another advantage of our solution is that if we package 30 scripts with PyInstaller, each of them will have its own bundle of libraries. This is because each resulting bundle will have its own copy of the interpreter and dependencies. In our solution, because we are relocating the interpreter, and all scripts will share that interpreter, all the needed libraries are naturally present only once.

What to include in the interpreter?

The code to generate the relocatable interpreter is now part of the Scylla git tree and is available under the AGPL as the rest of Scylla (although we would be open to move it to its own project under a more permissive license if there is community interest). It can be found in our github repository. The script is used to generate the relocatable interpreter works on any modern Fedora system (since we rely on Fedora tools to map dependencies). Although it has to be built on Fedora, it generates an archive that can be then copied anywhere to any distribution. We pass as an input to the script the list of modules we will use, for example:

We then use standard rpm management utilities (which is why the generation of the relocatable interpreter is confined to Fedora) in the distribution to obtain a list of all files that these modules need, together with its dependencies:

We then copy the resulting files, with some heuristic to skip things like documentation and configuration files, to a temporary location. We organize it so the binaries go to libexec/, and the libraries go to lib/.

At this point, the Python binary still refers to hard coded system library paths. People familiar with the low level inner workings of Linux shared objects will by now certainly think of the usual crude trick to get around this: setting the environment variable LD_LIBRARY_PATH, which tells the ELF loader to search for shared objects in an alternative path first.

The problem with that is that as an environment variable it will be inherited by child processes. Any call to an external program would try to find its libraries in that same path. So code like this:

output = subprocess.check_output(['ls'])

wouldn’t work, since the system’s `ls` needs to use its own shared libraries, not the ones we ship with the Python interpreter.

A better approach is to patch the interpreter binary so as not to depend on environment variables. The ELF format specifies that lookup directories can be specified in the DT_RUNPATH or DT_RPATH dynamic section attributes. And this being 2019, thankfully we have an app for that. The patchelf utility can be used to add that attribute to an existing binary where there was none, so that’s our next step ($ORIGIN is an internal variable in the ELF loader that refers to the directory where the application binary lives)

patchelf --set-rpath $ORIGIN/../lib <python_binary>

Things are starting to take shape: now all the libraries will be taken from ../lib and we can move things around. This works with both libraries loaded at startup and execution time. The remaining problem is that the ELF loader itself has to be replaced as in practice it has to match the libc in use by the application (where the API to load shared libraries during execution time will live)

To solve that, we ship the ELF loader as well!. We place the ELF loader (here called ld.so) in libexec/, like the actual python binary, and Inside our bin/ directory, instead the Python binary itself we have a trampoline-like shell script that looks like this:

This shell script finds the real path of the executable, in case you are calling it from a symlink, (stored in “x”), and then splits it in its basename (“b”) and the root of the relocatable interpreter (“d”). Note that because the binary will be inside ./bin/, we need to descend one level from its dirname.

In the last two lines, once we find out the real location of the script (“d”), we find the location of the dynamic loader (ld.so) and the Python binary ($realexe) which we know will always be in libexec and force the invocation to happen using the ELF loader we provided, while setting PYTHONPATH in a way that make sure that the interpreter will find its dependencies. Does it work? Oh yes!

But how do we know it’s really taking its libraries from the right place? Well, aside from trying it on a old, target system, we can just look at the output of the ldd tool again. We installed an interpreter into /tmp//python/, and this is what it says:

All of them are coming from their relocatable location. Well, all except for ld-linux-x86_64.so.2, which is the ELF loader. But remember we will force the override of the ELF loader by executing ours explicitly, so we are fine here. The libraries loaded at execution time are fine as well:

And with that, we now have a Python interpreter that can be installed with no dependencies whatsoever, in any Linux distribution!

But wait… no way this works with unmodified scripts

OMG That's Not a Python!

“OMG! That’s not a Python!”

If you are thinking that there is no way this work with unmodified scripts, you are technically correct. The default shebang will point to /usr/bin/python3, and we still have to change every single script to point to the new location. But nothing else aside from that has to be changed. Still, is it even worth the trouble?

It is, if everything can happen automatically. We wrote a script that given a set of Python scripts, will modify their shebang to point to `/usr/bin/env python3`. The actual script is then replaced by a bash trampoline script much like the one we used for the interpreter itself. That guarantees that the relocatable interpreter precedes everything else in the PATH (so env ends up picking our interpreter, but without having to mess with the system’s PATH), and then calls the real script which now lives in the `./libexec`/ directory.

We set the PYTHONPATH as well to make sure we’re looking for imports inside libexec, to allow local imports to keep working. See for instance what happens with scylla_blocktune.py, one of the scripts we ship, after relocating it like this:

And there we have it: now we can distribute the entire directory /tmp/test with as many Python3 scripts as we want and unpack it anywhere we want: they will all run using the interpreter that ships inside it, that in turn can run in any Linux distribution. Total Python freedom!

But how extensible is it really??

Fully Extensible PythonAn example of fully extensible Python

For people used to very flexible Python environments with user modules installed via pip, our method of specifying the specific modules we want installed in the relocatable package doesn’t seem very flexible and extensible. But what if instead of packaging those modules, we were to package pip itself?

To demonstrate that, we ran our script again with the following invocation:

Note that since the pip binary in the base system is itself a python script, we have to modify it as well before we copy it to the destination environment. Once we unpack it, the environment is now ready. PyYAML was not included in the modules list and is not available, so our previous `hello.py` example won’t work:

But once we install it via pip, will we be luckier?

Where did it go? We can see that it is now installed in a relative path inside our relocatable directory:

Which due to the way the interpreter is started through the trampoline scripts, is included in the search path:

And now what? Just run it!

Conclusion

In this article we have detailed the uncommon ScyllaDB’s approach for a common problem: how to distribute software, in particular ones written in Python without having to worry about the destination environment. We did that by shipping a relocatable interpreter, that has everything it needs in its relative paths and can be passed around without depending on the system libraries. The scripts can then just be executed with that interpreter with just a minor indirection.

The core of the solution we have adopted could have been used with other third party tools like PyInstaller and Nuitka as well. But in our analysis at that point it would just be simpler to provide the entire Python interpreter and its environment. It makes for a more robust solution (such as easier handling on execution time dependencies) without restricting access to the source code, and is fully extensible: we demonstrated that it is even possible to run pip in the destination environment and from there install whatever one needs.

Since we believe the full depth of this solution is very specific to our needs, we wrote this in a way that plays well with our build system and stopped there. In particular, we only support creating the relocatable interpreter on Fedora. Extending the script to also support Ubuntu would not be very hard, though. The relocatable interpreter is also part of the Scylla repository and not a standalone solution at the moment. If you think we’re wrong in our analysis about this being a narrow and specialized use case and could benefit from this too, we would love to hear from you. We are certainly open to making changes to accommodate the wider needs of the community. Reach out to us on Twitter, @ScyllaDB, or drop us a message via our web site.

No actual Pythons were harmed in the writing of this blog.

The post The Complex Path for a Simple Portable Python Interpreter, or Snakes on a Data Plane appeared first on ScyllaDB.

Scylla Enterprise Release 2018.1.10

$
0
0

Scylla Enterprise Release

The Scylla team is pleased to announce the release of Scylla Enterprise 2018.1.10, a production-ready Scylla Enterprise minor release. Scylla Enterprise 2018.1.10 is a bug fix release for the 2018.1 branch, the latest stable branch of Scylla Enterprise. Scylla Enterprise customers are encouraged to upgrade to Scylla Enterprise 2018.1.10 in coordination with the Scylla support team.

The major fix in this release is avoiding reactor stalls while merging MemTable into the cache. This improvement was gradually added to the open source release, and is now backported to Scylla Enterprise. Related open source issues include: #2012, #2576, #2715, #3053, #3093, #3139, #3186, #3215, #3402, #3526, #3532, #3608, #4030

Other fixes issues in this release are listed below, with open source references, if present:

  • Counters: Scylla rejects SSTables that contain counters that were created by Cassandra 2.0 and earlier. Due to #4206, Scylla mistakenly rejected some SSTables that were created by Cassandra 2.1 as well.
  • TLS: Scylla now disables TLS1.0 by default and forces minimum 128 bit ciphers #4010. More on encryption in transit (client to server) here
  • Core: In very rare cases, the commit log replay fails. Commit log replay is used after a node was unexpectedly restarted #4187
  • On rare cases, and under heavy load, for example, during repair, Scylla Enterprise might OOM and exit with an error such as “compaction_manager - compaction failed: std::bad_alloc (std::bad_alloc)”. #3717, #3716
  • In some rare cases, during service stop, Scylla exited #4107
  • scylla_setup: An option to select server NIC was added #3658
  • Scylla Enterprise 2018.1 fails to run scylla_setup script with or without `--nic` flag on Ubuntu 16.04 when the NIC is not eth0 (--nic flag is ignored when passed)
  • CQL: Selecting from a partition with no clustering restrictions (single partition scan) might have resulted in a temporary loss of writes #3608

Related Links

The post Scylla Enterprise Release 2018.1.10 appeared first on ScyllaDB.

ValuStor — a memcached alternative built on Scylla

$
0
0

Derek Ramsey, Software Engineering Manager at Sensaphone, gave an overview of ValuStor at Scylla Summit 2018. Sensaphone is a maker of remote monitoring solutions for the Industrial Internet of Things (IIoT). Their products are designed to watch over your physical plant and equipment — such as HVAC systems, oil and gas infrastructure, livestock facilities, greenhouses, food, beverage and medical cold storage. Yet there is a lot of software behind the hardware of IIoT. ValuStor is an example of ostensible “hardware guys” teaching the software guys a thing or two.

Overview and Origins of ValuStor

Open Source Initiative Approved License LogoDerek began his Scylla Summit talk with an overview: what is ValuStor? It is a NoSQL memory cache and a persistent database utilizing a Scylla back-end, designed for key-value and document store data models. It was implemented as a single header-only database abstraction layer and comprises three components: the Scylla database, a ValuStor Client and Cassandra driver (which Scylla can use since it is CQL-compliant).

ValuStor is released as free open source under the MIT license. The MIT license is extremely permissive, allowing anyone to “use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the software.” Which means that you can bake it into your own software, services and infrastructure freely and without reservation.

Derek then went back a bit into history to describe how Sensaphone began down this path. What circumstances gave rise to the origins of ValuStor? It all started when Sensaphone launched a single-node MySQL database. As their system scaled to hundreds of simultaneous connections, they added memcached to help offload their disk write load on the backend database. Even so, they kept scaling, and their system was insufficient to handle the growing load. It wasn’t the MySQL database that was the issue. It was memcached that was unable to handle all the requests.

For the short term, they began by batching requests. Yet Sensaphone needed to address the fundamental architectural issues, including the need for redundancy and scalability. There was also a cold cache performance risk (also known as the “thundering herd” problem) if they ever needed to restart.

No one in big data takes the decision to replace infrastructure lightly. Yet when they did the cutover it was surprisingly quick. It took only three days from inception to get Scylla running in production. And, as of last November, Sensaphone had already been in production for a year.

Since its initial implementation, Sensaphone added two additional use cases: managing web sessions, and for a distributed message queuing fan out for a producer/consumer application (a publish-subscribe design pattern akin to an in-house RabbitMQ or ActiveMQ). Derek recommended that anyone interested check out the published Usage Guide on GitHub.

Comparisons to memcached

Derek made a fair but firm assessment of the limitations of memcached for Sensaphone’s environment. First he cited the memcached FAQ itself, where it says it is not recommended for sessions, since if the cache is ever lost you may lock users off your site. While the PHP manual had a section on sessions in memcached, there is simply no guarantee of survivability of user data.

Second, Derek cited the minimal security implementation (e.g., SASL authentication). There has been a significant amplification of attacks on memcached in recent years (such as DDoS attacks), and while there are ways to minimize risks, there is no substitution for built-in end-to-end encryption.

Derek listed the basic, fundamental architectural limitations: “it has no encryption, no failover, no replication, and no persistence. Of course it has no persistence — it’s a RAM cache.” That latter point, while usually a feature for performance in memcached’s favor, was exactly what leads to the cold cache problem when your server inevitably crashes or has to be restarted.

Sensaphone was resorting to use batching to maintain a semblance of performance, whereas batch is an antipattern for Scylla.

ValuStor: How does it work?

ValuStor Client

Derek described client design, which kept ease-of-use first and foremost. There are only two API functions: get and store. (Deletes are not done directly, Instead, setting a time-to-live — TTL — of 1 second on data is effectively a delete.)

Implemented as an abstraction layer means you can use your data in a native programming language with native data types: integers, floating point, strings, JSON, blobs, bytes, and UUIDs.

For fault tolerance, ValuStor also added a client-side write queue for a backlog function, and automatic adaptive consistency (more on this later).

Cassandra Driver

The Cassandra driver supports thread safety and is multi-threaded. “Practically speaking, that means you can throw requests at it and it will automatically scale to use more CPU resources as required and you don’t need to do any special locking,” Derek explained. “Unlike memcached… [where] the C driver is not thread-safe.”

ValuStor also offers connection control, so if a Scylla node goes down it will automatically re-establish the connection with a different node. It is also datacenter-aware and will choose your datacenters intelligently.

Scylla Database Server

The Scylla server at the heart of ValuStor offers various architectural advantages. “First and foremost is performance. With the obvious question, ‘How in the world can a persistent database compete with RAM-only caching?’”

Derek then described how Scylla offers its own async and userspace I/O schedulers. Such architectural features can, at times, result in Scylla responsiveness with sub-millisecond latencies.

Scylla also has its own cache and separate memtable, which acts as a sort of cache. “In our use case at Sensaphone we have 100% cache hits all the time. We never have to hit the disk, even though it has one, and since our database has never actually gone down we’ve never actually even had to load it from disk except for maintenance periods.”

In terms of cache warming, Derek provided some advice, “The cold cache penalty is actually less severe for Scylla if you use heat-weighted load balancing because Scylla will automatically warm up your cache for you for the nodes you restart.”

Derek then turned to the issues of security. His criticisms were sobering: “Memcached is what I call ‘vulnerable by design.’” In the latest major issue, “their solution was simply to disable UDP by default rather than fix the problem.”

“By contrast, ValuStor comes with complete TLS support right out of the box.” That includes client authentication and server certified verification by domain or IP, over-the-wire encryption within and across datacenters, and of course basic password authentication and access control. You can read more about TLS setup in the ValuStor documentation.

“Inevitably, though, the database is going to go offline from the client perspective. Either you have network outage or you’ll have hardware issues on your database server.” Derek then dove down into a key additional feature for fault tolerance: a client-side write queue on the producer side. It buffers up and performs automatic retries. When the database is back up, it clears its backlog. The client keeps the requests serialized, so that data is not written in the wrong order. “Producers keep on producing and your writes are simply delayed. They aren’t lost.”

Derek then noted “Scylla has great redundancy. You can set your custom data replication factor per keyspace. It can be changed on the fly. And the client driver is aware of this and will route your traffic to the nodes that actually have your data.” You can also set different replication factors per datacenter, and the client is also aware of your multi-datacenter topology.

In terms of availability, Derek reminded the audience of the CAP theorem, “it states you can have Consistency, Availability or Partition tolerance. Pick any two.” This leads to the quorum problem (where you require n/2 + 1 nodes being available), which can lead to fragility issues in multi-datacenter deployments.

To illustrate, Derek showed the following series of graphics:

Quorum Problem: Example 1

Let’s say you have a primary datacenter with three nodes, and a secondary datacenter with two nodes. The outage of any two nodes will not cause a problem in quorum.

Quorum Problem: Example 1 (Primary Datacenter Failure)

However, if your primary datacenter goes offline, your secondary datacenter would not work if it required a strict adherence to quorum being set at n/2 +1.

Quorum Problem: Example 2

In a second example Derek put forth, if you had a primary with three nodes, and two secondary sites, then if your primary site went down, you could still keep operating if the primary site went offline, since there would still be four nodes, which meets the n/2 + 1 requirement.

Quorum Problem: Example 2 (Secondary site failures)

However, if both of your secondary datacenters went offline, Derek observed this failure would have the unfortunate effect of bringing your primary datacenter down with it, even if there was nothing wrong with your main cluster.

“The solution to this problem is automatic adaptive consistency. This is done on the client side.” Since Scylla is an eventually consistent database with tunable consistency, this buys ValuStor “the ability to adaptively downgrade the consistency on a retry of the requests.” This dramatically reduces the issues of likelihood of inconsistency. It also works well with Hinted Handoffs, which further reduces problems when individual nodes go offline.

Derek took the audience on a brief refresher on consistency levels, including ALL, QUORUM, and ONE/ANY. You can learn more about these reading the Scylla documentation and even teach yourself more going through our console demo.

Next, Derek covered the scalability of Scylla. “The Scylla architecture itself is nearly infinitely scalable. Due to the shard-per-core design you can keep throwing new cores and new machines at it and it’s perfectly happy to scale up. With the driver shard-aware, it will automatically route traffic to the appropriate location.” This is contrasted with memcached, which requires manual sharding. “This is not ideal.”

Using ValuStor

Configuring ValuStor is accomplished in a C++ template class. Once you’ve created the table in the database, you don’t even need to write any other CQL queries.

ValuStor Configuration

This is an example of a minimal configuration. There are more options for SSL.

ValuStor Usage

Here is an example of taking a key-value and storing it, checking to see if the operation was successful, performing automatic retries if it did not, and handling errors if the operation still fails.

Comparisons

ValuStor Comparisons

When designing ValuStor Derek emphasized, “We wanted all of the things on the left-hand side. In evaluating some of the alternatives, none of them really met our needs.”

In particular, Derek took a look at the complexity of Redis. It has dozens of commands. It has master-slave replication. And for Derek’s bottom line, it’s not going to perform as well as Scylla. He cited the recent change in licensing to Commons Clause, which has caused some confusion and consternation in the market. He also pointed out that if you do need the complexity of Redis, you can move to Pedis, which uses the Seastar engine at its heart for better performance.

What about MongoDB or CouchDB?

Derek also made comparisons to MongoDB and CouchDB, since ValuStor has full native JSON support and can also be used as a document store. “It’s not as full-featured, but depending on your needs, it might actually be a good solution.” He cited how Mongo also recently went through a widely-discussed licensing change (which we covered in a detailed blog).

Derek Ramsey at Scylla Summit 2018

Derek Ramsey at Scylla Summit 2018

What’s Next for ValuStor

Derek finished by outlining the feature roadmap for ValuStor.

  • SWIG bindings will allow it to connect to a wide variety of languages
  • Improvements to the command line will allow scripts to use ValuStor
  • Expose underlying Futures, to process multiple requests from a single thread for better performance, and lastly,
  • A non-template configuration option

To learn more, you can watch Derek’s presentation below, check out Derek’s slides, or peruse the ValuStor repository on Github.

The post ValuStor — a memcached alternative built on Scylla appeared first on ScyllaDB.


Scylla Open Source Release 2.3.3

$
0
0

Scylla Software Release

The Scylla team announces the release of Scylla Open Source 2.3.3, a bugfix release of the Scylla Open Source 2.3 stable branch. Release 2.3.3, like all past and future 2.3.y releases, is backward compatible and supports rolling upgrades.

Note that the latest stable release of Scylla Open Source is release 3.0 and you are encouraged to upgrade to it.

Related links:

Issues solved in this release:

  • A race condition between the “nodetool snapshot” command and Scylla running compactions may result in a nodetool error: Scylla API server HTTP POST to URL ‘/storage_service/snapshots’ failed: filesystem error: link failed: No such file or directory #4051
  • A bootstrapping node doesn’t wait for schema before joining the ring, which may result in a node fail to bootstrap, with error “storage_proxy – Failed to apply mutation”. In particular, this error manifests when a user defined type is used. #4196
  • Counters: Scylla rejects SSTables that contain counters that were created by Cassandra 2.0 and earlier. Due to #4206, Scylla mistakenly rejected some SSTables that were created by Cassandra 2.1 as well.
  • Core: In very rare cases, the commit log replay fails. Commit log replay is used after a node was unexpectedly restarted #4187
  • In some rare cases, during service stop, Scylla exited #4107

The post Scylla Open Source Release 2.3.3 appeared first on ScyllaDB.

In-Memory Scylla, or Racing the Red Queen

$
0
0

Racing the Red Queen“Now, here, you see, it takes all the running you can do, to keep in the same place. If you want to get somewhere else, you must run at least twice as fast as that!” — The Red Queen to Alice, Alice Through the Looking Glass

In the world of Big Data, if you are not constantly evolving you are already falling behind. This is at the heart of the Red Queen syndrome, which was first applied to the evolution of natural systems. It applies just as much to the evolution of technology. ‘Now! Now!’ cried the Queen. ‘Faster! Faster!’ And so it is with Big Data.

Over the past decade, many databases have shifted from storing all their data on Hard Disk Drives (HDDs) to Solid State Drives (SSDs) to drop latencies to just a few milliseconds. To get ever-closer to “now.” The whole industry continues to run “twice as fast” just to stay in place.

So as fast storage NVMe drives become commonplace in the industry, they practically relegate SATA SSDs to legacy status; they are becoming “the new HDDs”.

For some use cases even NVMe is still too slow, and users need to move their data to in-memory deployments instead, where speeds for Random Access Memory (RAM) are measured in nanoseconds. Maybe not in-memory for everything — first, because in-memory isn’t persistent, and also because it can be expensive! — but at least for their most speed-intensive data.

All this acceleration is certainly good news for any I/O intensive, latency sensitive applications which will now be able to use those storage devices as a substrate of workloads that used to need to be kept in memory for performance reasons. However, do the speed of accesses in those devices really match what they advertise? And what workloads are most likely to need the extra speed provided by hosting their data in-memory?

In this article we will examine the performance claims of latency-bound access in a real NVMe devices and show that there is still a place for in-memory solutions for extremely latency sensitive applications. To address those workloads, ScyllaDB added an in-memory option to Scylla Enterprise 2018.1.7. We will discuss how that all ties together in a real database like Scylla and how users can benefit from the new addition to the Scylla product.

Storage Speed Hierarchy

Various storage devices have different access speeds. Faster devices are usually more expensive and have less capacity. The table below shows a brief summary of devices in broad use in modern servers and their access latencies.

Device Latency
Register 1 cycle
Cache 2-10ns
DRAM 100-200ns
NVMe 10-100μs
SATA SSD 400μs
Hard Disk Drive (HDD)  10ms

It would be great of course to have all your data in fastest storage available: register or cache, but if your data fits in there it is probably not considered a Big Data environment. On the other hand, if the workload is backed by a spinning disk it is hard to expect good latencies for requests that need to access the underlying storage.. Considering size vs speed tradeoff NVMe does not look so bad here. Moreover, in real life situations the workload needs to fetch data from various places in the storage array to compose a request. In hypothetical scenario with in which two files are accessed for every storage-bound request and access time around ~50μs the cost of a storage-bound access is around 100μs, which is not too bad at all. But how reliable are those access numbers in real life?

Real World Latencies

In practice, we see that NVMe latencies may be much higher than that, though. Even larger than what spinning disks provide. There are a couple of reasons for that. First the technology limitation: SSD becomes slower as it fills up and data is written and rewritten. The reason, is that an SSD has an internal Garbage Collection (GC) process that looks for free blocks and it becomes more time consuming the less free space there is. We saw that some disks may have latencies of hundreds of milliseconds in worst case scenarios. To avoid this problem, freed blocks have to be explicitly discarded by the operating system to make GC unnecessary. This is done by running the fstrim utility periodically (which we absolutely recommend to do), but ironically fstrim that runs in the background may cause latencies by itself. Another reason for larger-than-promised latencies is that a query does not run in isolation. In a real I/O-intensive system like a database, usually there are a lot of latency sensitive accesses such as queries that run in parallel and consume disk bandwidth concurrently with high-throughput patterns like bulk writes and data reorganization (like compactions in ScyllaDB). As a result, latency sensitive requests may end up in a device queue and result in increased tail latency.

It is possible to observe all those scenarios in practice with the ioping utility. ioping is very similar to well-known networking ping utility, but instead of sending requests over the network it sends them to a disk. Here is the result of the test we did on AWS:

No other IO:
99 requests completed in 8.54 ms, 396 KiB read, 11.6 k iops, 45.3 MiB/s generated 100 requests in 990.3 ms, 400 KiB, 100 iops, 403.9 KiB/s min/avg/max/mdev = 59.6 us / 86.3 us / 157.8 us / 27.2 us

Read/Write fio benchmark:
99 requests completed in 34.2 ms, 396 KiB read, 2.90 k iops, 11.3 MiB/s generated 100 requests in 990.3 ms, 400 KiB, 100 iops, 403.9 KiB/s min/avg/max/mdev = 73.0 us / 345.2 us / 5.74 ms / 694.3 us

fstrim:
99 requests completed in 300.3 ms, 396 KiB read, 329 iops, 1.29 MiB/s generated 100 requests in 1.24 s, 400 KiB, 80 iops, 323.5 KiB/s min/avg/max/mdev = 62.2 us / 3.03 ms / 83.4 ms / 14.5 ms

As we can see under normal condition the disk provides latencies in the promised range, but when the disk is under load, max latency can be very high.

Scylla Node Storage Model

To understand what benefit one will have from keeping all the data in memory we need to consider how Scylla storage model works. Here is a schematic describing the storage model of a single node.

Scylla Storage Model

When the database is queried a node tries to locate the requested data in cache and memtables, both of which reside in RAM. If the data is in the cache – good, all that is needed is to combine the data from the cache with the data from memtable (if any) and a reply can be sent right away. But what if the cache has no data (and no indication that data is not present in permanent storage as well)? In this case, the bottom part of the diagram has to be invoked and storage has to be contacted.

The format the data is stored in is called an sstable. Depending on the configured compaction strategy, and on how recently queried data was written and on other factors, multiple sstables may have to be contacted to satisfy a request. Let’s take a closer look at the sstable format.

Very Brief Description Of the SSTable Format

Each sstable consist of multiple files. Here is a list of files for a hypothetical non-compressed sstable.

la-1-big-CRC.db
la-1-big-Data.db
la-1-big-Digest.sha1
la-1-big-Filter.db
la-1-big-Index.db
la-1-big-Scylla.db
la-1-big-Statistics.db
la-1-big-Summary.db
la-1-big-TOC.txt

Most of those files (green ones) are very small and their content is kept in memory while the sstable is open. But there are two exceptions: Data and Index (as indicated in red). Let’s take a closer look at what those two contain.

SSTable Data Format

The Data file stores the actual data. It is sorted according to partition keys, which makes binary search possible. But searching for a specific key in a large file may require a lot of disk access, so to make the task more efficient there is another file, Index, that holds a sorted list of keys and offsets into the Data file where data for those keys can be found.

As one can see, each access to an sstable requires at least two reads from disk (it may be even more depending on the size of the data that has to be read and the place of the key in the index file).

Benchmarking Scylla

Let’s look at how those maximum latencies can affect the behaviour of Scylla. The benchmark was run on a cluster in the Google Compute Engine (GCE) with one NVMe disk. We have experienced that NVMe on GCE is somewhat slow, so in a way it helps to emphasis in-memory benefits. Below is a graph of 99th percentile for access to two different tables. One is a regular table on NVMe disk (red) and another is in memory (green).

p99 Latencies In-Memory vs. SSD

The 99th percentile latency for the on-disk table is much higher and has much more variation in it. There is another line in the graph (in blue) that plots the number of compaction running in the system. It can be seen that the blue graph matches the red one which means that 99th percentile latency of an on-disk table is affected greatly by the compaction process. High on-disk latencies here are a direct result of tail latencies that occurred because user read was queued after compaction read.

Having performance of 20ms for P99 isn’t much for Scylla but in this case, a single not-so-fast NVMe disk was used. Adding more NVMes in raid0 setup will allow for more parallelism and will mitigate the negative effects of queuing, but doing so also increases the price of the setup and at some point will erase all the price benefits of using NVMe while not necessarily achieving the same performance as in-memory setup. In-memory setup allows you to get low and consistent latency at a reasonable price point.

Configuration

Two new configuration steps are needed to make use of the feature. First, one needs to specify how much memory should be left for in-memory sstable storage. It can be done by adding in_memory_storage_size_mb to scylla.yaml file or specifying --in-memory-storage-size-mb on a command line. After memory is reserved in-memory table can be created by executing:

CREATE TABLE ks.cf (
     key blob PRIMARY KEY,
     "C0" blob,

) WITH compression = {}
  AND read_repair_chance = '0'
  AND speculative_retry = 'ALWAYS'
  AND in_memory = 'true'
  AND compaction = {'class':'InMemoryCompactionStrategy'};

Note new in_memory property there that is set to true and new compaction strategy. Strictly speaking it is not required to use InMemoryCompactionStrategy with in-memory tables but this compaction strategy compacts much more aggressively to get rid of data duplication as fast as possible to save memory.

Note that mix of in-memory and regular tables is supported.

Conclusion

Despite what the vendors may say, real world storage devices can present high tail latencies in the face of competing requests, even for newer technology like NVMe. Workloads that cannot tolerate a jump in latencies under any circumstances can benefit greatly from the new Scylla enterprise in-memory feature. If, on the other hand, a workload can cope with occasionally higher latency for a low number of requests it is beneficial to let Scylla manage what data is held in memory with its usual caching mechanism and to use regular on-disk tables with fast NVMe storage.

Find out more about Scylla Enterprise

The post In-Memory Scylla, or Racing the Red Queen appeared first on ScyllaDB.

Introducing Scylla University

$
0
0

Scylla University

So you’re thinking of running your applications with Scylla? You’ve probably heard it’s a lightning fast, self-optimizing, highly available Apache Cassandra drop-in replacement. Yet you may still have questions like:

  • How do I upgrade from my current system?
  • How many nodes do I need?
  • How do I ensure that my data is consistent across the database?
  • How should applications be written to maximize database performance? How do I scale up or scale out?

To make it easier to find answers to these questions and many more, we have launched Scylla University. Anyone in your organization can now take advantage of our valuable new self-paced online training at no cost.

Visit Scylla University

Our goal at Scylla University is to improve the learning process by creating a destination for engaging, easy-to-use training, advice and best practices. We want to make sure that everyone has access to the knowledge needed to master the fastest database in the market.

The first course serves as a background for anyone who needs hands-on knowledge of Scylla — from developers to DBAs. Each lesson is chunked into small learning segments, allowing you to learn at your own pace. Each module contains quizzes to check your understanding as well as interactive exercises and hands-on labs so you can practice what you’re learning.

Consistency Level: Write Operations

Learn about consistency levels and replication factors in Scylla

The overview course explains the basics of Scylla. By the end of this course, you will understand the fundamental concepts of NoSQL databases and gain knowledge of Scylla’s features and advantages. This includes topics such as installation, the Scylla architecture, data model, and how to configure the system for high availability.

It sets the model for all Scylla University courses to come, including videos, slides, and hands-on labs where you get a chance to put theories into practice.

According to recent research, combining theoretical understanding and practical hands-on learning, with incremental testing for comprehension, is the most effective way to study.

Theory and Practice Make Perfect

New Courses in the Works

We’re currently developing new courses and learning material, including in-depth Scylla Administration, advanced architecture, troubleshooting, best practices and advanced data modeling. When you register for Scylla University we’ll notify you about new content as it becomes available. You’ll also be the first to know about software releases and other updates.

Our future plans include:

  • Courses specific to user levels ranging from novice to advanced
  • More content, case studies, videos, webinars, quizzes and hands-on labs to help you improve your skills
  • Certification paths for Scylla and NoSQL
  • Ongoing refinements to the user interface

Scylla University Sea Monster Honor Roll Challenge

Scylla University Mascot

Got what it takes to be the best of the best? Each completed quiz and lesson will earn you points. Think you can earn enough to be on the Honor Roll? How about the Head of the Class? The top 5 students will get a Scylla t-shirt and more cool swag. This challenge ends on March 20th, so get going!

Get Started Today!

Scylla Essentials

“You don’t have to be great to start, but you have to start to be great.” – Zig Ziglar

Our first online course, Scylla Essentials – Overview of Scylla, is now available. It’s free. All you have to do to get started is register.

Is there a course you would like us to add to our curriculum? Or perhaps you’ve got suggestions on how we can improve the University site? Please let us know.

The post Introducing Scylla University appeared first on ScyllaDB.

Tantan Connects with Scylla to Help Singles Find Their Closest Match

$
0
0
Tantan and Scylla

About Tantan

Tantan is a locality-based dating app that provides a venue for people to connect and expand their social circles. More than 160 million men and women have used Tantan to chat, form new friendships, and, ideally, find their perfect match. We have more than 10 million daily active users.

Our app delivers profiles of people that might be interesting to our users. Those profiles incorporate photos, likes, and interests. After evaluating prospective matches, users can ‘like’ favorites with a swipe. If the swipe is reciprocated, a match is struck.

Once matched up, users can interact via texts, voice messages, pictures, and videos. Our app provides social privacy by blocking and filtering other users. Our new ‘ice-breaker’ feature lets our users get a better understanding of a potential match with a brief quiz.

tantan logo

“By adopting Scylla, we’ve completely side-stepped garbage collection and other issues that were interfering with the real-time nature of our application.”

Ken Peng, Backend Specialist, Tantan

The Challenge

The Tantan app periodically collects user locations, which are used to identify, in real-time, other users within a range of about 300 meters. Our peak usage hits about 50,000 writes per second.

Because we have so many geographically distributed users, we put a lot of strain on our database. Due to the real-time nature of our app, our number-one database requirement is low latency for mixed read and write workloads.

Initially we ran two 16-node Cassandra clusters–one cluster for passby and one for storing user data. Yet even with this many nodes, we found ourselves running into serious latency issues. These issues were often caused by a common culprit in Cassandra — garbage collection. However, we also saw latency spikes resulting from other slow maintenance tasks, such as repair. Even after tuning Cassandra to our specific workloads, we saw little improvement.

The Solution

Scylla turned out to be the just the right solution for our problems. We found that Scyla meets our needs perfectly. Today, we run the passby function against an 8-node Scylla cluster, a reduction in nodes that’s been a big cost savings for us. More importantly, by adopting Scylla, we’ve completely side-stepped garbage collection and other issues that were interfering with the real-time nature of our application. We also spend much less time managing our database infrastructure.

We found the Scylla teams and the broader Scylla community to be very responsive in getting us quickly up to speed and helping us to optimize our Scylla deployment.

The post Tantan Connects with Scylla to Help Singles Find Their Closest Match appeared first on ScyllaDB.

Scylla and Elasticsearch, Part Two: Practical Examples to Support Full-Text Search Workloads

$
0
0
Scylla and Elasticsearch

We covered the basics of Elasticsearch and how Scylla is a perfect complement for it in part one of this blog. Today we want to give you specific how-tos on connecting Scylla and Elasticsearch, including use cases and sample code.

Use Case #1

If combining a persistent, highly available datastore with full text search engine is a market requirement, then implementing a single, integrated solution is an ultimate goal that requires time and resources. To answer this challenge we describe below a way for users to use best-of-bread solutions that support full text search workloads. We chose Elasticsearch open source together with Scylla open source to showcase the solution.

In this use case we start with a fresh clean setup, meaning that you need to ingest your data into both Scylla and Elasticsearch using dual writes and then perform the textual search.

The following example creates an apparel catalog consisting of 160 items, stored on Scylla, and searchable using Elasticsearch (ES). The catalog.csv file comprised of 7 columns: Brand, Group, Sub_Group, Color, Size, Gender and SKU – Let’s describe them.

  • Brand: companies who manufacture the clothing items (5 options)
  • Group: clothing type (4 options)
  • Sub_Group: an attribute that correlates to the clothing group type (9 options)
  • Color: pretty much self explanatory (7 options) – very common query filter
  • Size: ranging from Small to XLarge (4 options) – very common query filter
  • Gender: People like to see the results for the relevant gender group (4 options) – very common query filter
  • SKU: a unique product ID, usually made of the other product attributes initials
BrandGroupSub_GroupColorSizeGender
North FaceShirtsshort sleeveblackSmallmen
PatagoniaPantslong sleeveredMediumwomen
MammutShoesgore-texgreyLargeboys
GarmontJacketsleathergreenXLargegirls
Columbia softshellyellow  
  hikingblue  
  jeanswhite  
  bermuda   
  UV_protection   
We will be using two python scripts (see Appendix-B) to demonstrate the integration of Scylla with Elasticsearch.
  • Dual writes using the insert_data script for data ingestion, in our case an apparel catalog csv file.
  • Textual search using the query_data script, which is basically a 2-hop query that will retrieve the unique product_id (SKU) from Elasticsearch and then use the retrieved SKU to query other product attributes from Scylla.
Scylla and Elasticsearch Block Diagram

Prerequisites

  • python installed
  • pip installed
  • Java 8 installed
  • Scylla cluster installed (see here)
  • Node for Elasticsearch and python scripts (can be separate nodes)

Procedure

  1. Install the python drivers on the node to be used for the scripts

$ sudo pip install cassandra-driver $ sudo pip install elasticsearch

  1. Install Elasticsearch (see here)

$ sudo apt-get update $ curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.2.3.deb $ sudo dpkg -i elasticsearch-6.2.3.deb

  1. Start Elasticsearch, verify status and health state

$ sudo /etc/init.d/elasticsearch start [ ok ] Starting elasticsearch (via systemctl): elasticsearch.service. curl http://127.0.0.1:9200/_cluster/health?pretty

{ “cluster_name” : “elasticsearch”, “status” : “green“, “timed_out” : false, “number_of_nodes” : 1, “number_of_data_nodes” : 1, “active_primary_shards” : 0, “active_shards” : 0, “relocating_shards” : 0, “initializing_shards” : 0, “unassigned_shards” : 0, “delayed_unassigned_shards” : 0, “number_of_pending_tasks” : 0, “number_of_in_flight_fetch” : 0, “task_max_waiting_in_queue_millis” : 0, “active_shards_percent_as_number” : 100.0 }

  1. Copy the following files to the location from which you will run them, and make them executable
    • catalog.csv (the SKU is the unique product_id, made of other product attributes initials)
    • insert_data_into_scylla+elastic.py
    • query_data_from_scylla+elastic.py
  1. Run the insert_data script. The script will perform the following:
    • Create the Schema on Scylla (see Appendix-A)
    • Create the Elasticsearch index (see Appendix-A)
    • Dual write: insert the catalog items (csv file) to both DBs (using prepared statement for Scylla)
    Use the -s / -e flags to insert a comma-separated list of IPs for the Scylla and /or Elasticsearch (ES) nodes. If you are running Elasticsearch (ES) on the same node as the python scripts, no need to enter IP, 127.0.0.1 will be used as the default. $ python insert_data_into_scylla+elastic.py -h usage: insert_data_into_scylla+elastic.py [-h] [-s SCYLLA_IP] [-e ES_IP] optional arguments: -h, --help show this help message and exit -s SCYLLA_IP -e ES_IP
  1. Once the “insert_data” script completes, you will find 160 entries in both Scylla and Elasticsearch
Elasticsearch Scylla
$ curl http://127.0.0.1:9200/catalog/_count/?pretty
  {
  "count" : 160,
  "_shards" : {
  "total" : 5,
  "successful" : 5,
  "skipped" : 0,
  "failed" : 0
  }
}
cqlsh> SELECT COUNT (*) FROM catalog.apparel ;

 count
-------
   160

(1 rows)
    1. Run the query_data script. The script will perform the following:
      • It will execute a textual search in Elasticsearch per the flag you provide, either searching for a single word (single filter) OR searching for multiple words (multiple filters), OR without any filter, which is basically a “match_all” query.

      • It will then use the SKU value retrieved from the textual search to query Scylla, while using prepared statements.As mentioned, there are 3 query types, use the -n flag to select the query type. Optional values are:

        • “single”: using a single filter (by group) to query for “pants”
        • “multiple” (default): using multiple filters (by color AND sub_group) to query for “white softshell”
        • “none”: query without any filter = “match_all”

        Use the -s / -e flags to insert a comma-separated list of IPs for the Scylla and /or Elasticsearch (ES) nodes. If you are running Elasticsearch (ES) on the same node as the python scripts, no need to enter IP, 127.0.0.1 will be used as the default. Note: Elasticsearch returns only the 1st 10 results by default. To overcome this we set the size limit to 1000 results. When retrieving a large set of results, we recommend using pagination (read more here: elasticsearch-py helpers).

        $ python query_data_from_scylla+elastic.py -h
        usage: query_data_from_scylla+elastic.py [-h] [-s SCYLLA_IP] [-e ES_IP] [-n NUM_FILTERS]


        optional arguments:
          -h, –help show this help message and exit
          -s SCYLLA_IP
          -e ES_IP
          -n NUM_FILTERS

    2. To delete Elasticsearch index and the keyspace in Scylla run the following commands

      • Elasticsearch: $ curl -X DELETE "127.0.0.1:9200/catalog"
      • Scylla: cqlsh> DROP KEYSPACE catalog ;

Use Case #2

In this use case we assume you already have your data in Scylla and want to import it into Elasticsearch, to be indexed for textual search purposes. To accomplish this ,the first thing you will need to do is export the relevant table and its content from Scylla into a .csv data file, this can be accomplished by using the cqlsh COPY TO command.

The second thing to do is export the table schema into a .cql schema file, a file for each table separately. This can be accomplished by running the following command cqlsh [IP] "-e DESCRIBE TABLE [table name] > [name].cql

Once you have your .csv and .cql files ready, you just need to have an Elasticsearch node installed and you’re good to go.

The following script (see Appendix-C) will use the.cql schema file and .csv data file as inputs to create an index in Elasticsearch (ES) and insert the data.

The ES index name will be created based on the .csv file name

The index _id field (index partition key) is based on the PRIMARY KEY taken from the .cql schema (simple/composite/compound).

The index _type field will represent the partition key (PK), in case of a compound key it will use `-` to concatenate the column names.

The script will print progress for every 1000 rows processed and total rows processed in its output.

Prerequisites

  • python installed
  • pip installed
  • Java 8 installed
  • Scylla cluster installed (see here)
  • Node for Elasticsearch and python scripts (can be separate nodes)

Procedure

  1. Install the python drivers on the node to be used for the scripts
    $ sudo pip install cassandra-driver
    $ sudo pip install elasticsearch
  2. Install Elasticsearch (see here)
    $ sudo apt-get update
    $ curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-6.2.3.deb
    $ sudo dpkg -i elasticsearch-6.2.3.deb

  3. Start Elasticsearch, verify status and health state$ sudo /etc/init.d/elasticsearch start
    [ ok ] Starting elasticsearch (via systemctl): elasticsearch.service.curl http://127.0.0.1:9200/_cluster/health?pretty


    {

    "cluster_name" : "elasticsearch",
    "status" : "green",
    "timed_out" : false,
    "number_of_nodes" : 1,
    "number_of_data_nodes" : 1,
    "active_primary_shards" : 0,
    "active_shards" : 0,
    "relocating_shards" : 0,
    "initializing_shards" : 0,
    "unassigned_shards" : 0,
    "delayed_unassigned_shards" : 0,
    "number_of_pending_tasks" : 0,
    "number_of_in_flight_fetch" : 0,
    "task_max_waiting_in_queue_millis" : 0,
    "active_shards_percent_as_number" : 100.0
    }

  4. Copy the python file to the location from which you will run it, and make it executable. Place your .csv and .cql files in an accessible location (can be same dir as the python script)

  5. Run the script (see below usage, important details and examples)
    • Usage
      $ python ES_insert_data_per_schema.py -h
      usage: ES_insert_data_per_schema.py [-h] [-e ES_IP] [-c CSV_FILE_NAME]
                                          [-s CQL_SCHEMA_FILE_NAME]
                                          [-i IGNORE_CQL_SCHEMA]
      optional arguments:
        -h, --help show this help message and exit
        -e ES_IP
        -c CSV_FILE_NAME
        -s CQL_SCHEMA_FILE_NAME
        -i IGNORE_CQL_SCHEMA
    • Important Details
      • Use -e flag to insert a comma-separated list of IPs for Elasticsearch (ES) nodes. If ES is running locally, no need for this flag, default 127.0.0.1 will be used
      • -i ignore_cql_schema -> default: True. Meaning it will use the 1st column from the .csv file for ES index _id field. If you have a compound PK use -i no so not to ignore the .cql schema
      • -c csv_file_name -> requires full path to file. Needs to be in the format as described in the prerequisites
      • -s cql_schema_file name -> requires full path to file. Checking schema for compound PK, if did not find it checking for simple PK
      • If .cql file is not provided (but .csv file was provided), script will fall back to ignoring .cql schema and use the 1st column from the .csv file for ES index _id field
      • If both .cql + .csv files are not provided, error is printed and script exists.
    • Output Example Using Compound PK
      ubuntu@ip-172-16-0-124:~/scylla_elastic$ python ES_insert_data_per_schema.py -c
      ./cp_prod.product_all.csv -s
      ./cp_prod_product_all.cql -i no

      ## Check schema (./cp_prod_product_all.cql) for compound primary key to be used as index id
      ## Did not find a compound primary key, checking for regular primary key to be used as index id
      ## Connecting to ES -> Creating 'cp_prod.product_all' index, if not exist
      ## Write csv file (./cp_prod.product_all.csv) content into Elasticsearch
      ## Update every 1000 rows processed ##
      Rows processed: 1000
      Rows processed: 2000
      Rows processed: 3000
      Rows processed: 4000
      Rows processed: 5000
      Rows processed: 6000
      Rows processed: 7000
      Rows processed: 8000
      Rows processed: 9000
      ## After all inserts, refresh index (just in case)
      ### Total Rows Processed: 9715 ###

Next Steps

We’ve given the case for using Scylla and Elasticsearch together. And above we’ve shown you step-by-step model for how to implement it. The next step is up to you! Download Scylla and Elasticsearch and try it out yourself.

If you do, we’d love to hear your feedback and experience. Either by joining our Slack channel, or dropping us a line.

Appendix A (Schema and Index for Use Case #1)

Scylla schema Elasticsearch Index

Appendix B (Python Code for Use Case #1)

Insert_Data Query_Data

Appendix C (Python Code for Use Case #2)

The post Scylla and Elasticsearch, Part Two: Practical Examples to Support Full-Text Search Workloads appeared first on ScyllaDB.

Deep Dive into the Scylla Spark Migrator

$
0
0

Scylla and Spark

Another week, another Spark and Scylla post! This time, we’re back again with the Scylla Spark Migrator; we’ll take a short tour through its innards to see how it is implemented.

  • Read why we implemented the Scylla Spark Migrator in this blog.

Overview

When developing the Migrator, we had several design goals in mind. First, the Migrator should be highly efficient in terms of resource usage. Resource efficiency in the land of Spark applications usually translates to avoiding data shuffles between nodes. Data shuffles are destructive to Spark’s performance, as they incur more I/O costs. Moreover, shuffles usually get slower as more nodes are added (which is the opposite of the scaling model we like!).

Beyond resource efficiency, the Migrator was designed to perform decently out of the box with relatively little tuning: the default configuration splits the source table into 256 ranges that are transferred in parallel; on each executor, 8 connections are opened to Cassandra and 16 to Scylla; rows are fetched in batches of 1000 and TTL/WRITETIME timestamps are preserved. Of course, these parameters can be tuned using the configuration file.

With these goals in mind, let’s recap how the Migrator works:

  • When launched, the Migrator reads the schema definition for the source table from Cassandra;
  • The schema is used to create the CQL selection;
  • If timestamp preservation is enabled, the TTL and WRITETIME timestamps for non-key columns are also added to the CQL projection;
  • The rows are fetched in chunks from Cassandra, and each chunk is written to Scylla;
  • As rows are written, the token ranges that have been processed are tracked and periodically saved.

Sounds pretty straightforward! We’ll dive into how these steps are implemented in the following sections.

Using the table definition to create the DataFrame schema

As we’ve mentioned in the post about DataFrames, every DataFrame in Spark has an associated schema. The schema is used to validate the queries against the DataFrame, optimize them and so forth. When the data source for the DataFrame is structured, it is very sensible to infer the schema from the data source.

When creating a DataFrame from a Cassandra table, the Cassandra connector will happily infer the schema for us. However, creating DataFrames is limited to using the table itself. The Migrator needs to add additional expressions to the table scan, so we’re going to build the schema manually using the connector’s infrastructure.

We start off by creating an instance of CassandraConnector. This class bundles the management of connections to Cassandra for the driver and the executors:

The connector’s configuration can be derived from Spark’s configuration. The actual initialization in the Migrator is slightly different, as we extract the configuration parameters from a YAML file rather than through Spark’s configuration mechanism. With the connector defined, we can use it to build the schema.

The connector provides a data type called TableDef:

The TableDef data type represents everything we need to know about the structure of a Cassandra/Scylla table. Most importantly, it includes several sequences of ColumnDef instances that describe the columns of the table.

Now, TableDef is a plain case class, so we could construct it manually (and that could be useful when creating new tables from DataFrames not originating in Cassandra), but in this case, we’d like to infer it from an existing table. There’s a handy method for this:

tableFromCassandra will use the connector to fetch the schema data from Cassandra and create the TableDef instance. There’s actually another method, Schema.fromCassandra, that can fetch the definitions for all keyspaces and all tables on Cassandra, but we won’t use it.

Using the TableDef instance, we should be able to construct the DataFrame schema. These schemas are specified as a StructType; this is a data type representing a record:

So essentially, each ColumnDef should map to a StructField. There’s one last missing piece for this puzzle: how do we actually do that? How do we convert the ColumnType associated with a ColumnDef to that?

Luckily, the DataTypeConverter object has what we need:

This function will take any ColumnDef and spit back a StructField. So we can just do the following to create our StructType schema:

Great! So now, we can move forward with actually modifying the schema for our needs.

Building a custom query for a table scan

The motivation for building the schema manually is to include in it the expressions for TTL and WRITETIME timestamps for the individual columns; we need to copy over those timestamps along with the column values. Let’s start by creating the custom query itself. We’re going to use the more low-level RDD interface, as only that interface supports specifying custom selections.

We can modify the colums included in the CQL projection by using the select method on the CassandraRDD. This is the definition for select:

We haven’t talked about ColumnRef yet – this is a data type describing column references in CQL projections. We can build the basic projection by using the ref field present on the ColumnDef:

Now, to include the TTL and WRITETIME timestamps of each column in the projection, we could do something like the following:

There’s one problem: we can’t retrieve these timestamps for columns that are part of the primary key for the table; so we need to only add the timestamp extraction for regular columns. The TableDef data type differentiates between the columns (allColumns is just a concatenation of all the ColumnDef instances), so we can write the projection as follows:

And this projection can be fed to the select method on the CassandraRDD:

What’s CassandraSQLRow, you ask? This is a data type, provided by the Cassandra connector, that implements the interface for Spark’s internal Row: an sequence of heterogenously-typed named fields that backs DataFrames. The rows from the Cassandra driver for Java are converted into this data type.

Finally, we need to take care of the DataFrame schema as well: it too must be modified to include the fields for the TTL and WRITETIME timestamps. This is done similarly by transforming the sequence of StructFields:

Note that each field is checked against the names of regular columns; only if it is indeed a non-key column, the schema is modified to include its timestamps.

Excellent! We now have an RDD that uses a custom projection to query the columns and their timestamps, and a DataFrame schema. We need to put them together and create the DataFrame; we don’t actually want to work with the RDD. To do so, we’ll use the createDataset method available on the SparkSession:

The RowEncoder is (yet another) data type that tells Spark’s Catalyst engine how to convert the RDD’s data to Spark’s internal binary representation.

Preserving ttls and writetimes for individual cells

Our next challenge is to figure out how to actually write different WRITETIME~/~TTL timestamps to each column of the transferred rows. See, the problem is that when issuing a CQL insert, you can only specify one timestamp that is applied to all non-key columns being written:

This statement would apply the 24 hour TTL value to both reg1 and reg2 values. The solution to this is to issue a separate CQL statement for each group of values within a row with the same TTL and WRITETIME. For example, assuming we need to copy a row that consists of the following values:

To copy such a row, we would need to issue the following CQL statements:

Note how the insertions of reg3 and reg4 ended up in the same CQL statement; this is because they have the same TTL and WRITETIME values, while the other values create unique combinations of TTL and WRITETIME. When Scylla processes these CQL statements, it’ll merge the values into the same row determined by the key.

The DataFrame that contains the rows must use a fixed schema; the CQL statements we’ve shown contain the columns of the primary key, and a subset of (or possibly the entire set of) regular columns. We’ll need to represent the fact that some columns are unset on the rows. To do so, we’ll use the CassandraOption[T] data type provided by the Cassandra connector.

This data type is similar to Scala’s built-in Option[T] data type, but contains another term in the sum type. Here’s the definition:

We can use this data type to represent the 3 states a column in a CQL DML statement can be: set with a value, unset, or set with null (which means it’ll be cleared in the row).

To implement this row splitting operation, we’ll use the flatMap function present on DataFrames, that allows the use of plain Scala functions to process the rows in the DataFrame. This is the definition of flatMap:

Recall that DataFrames are a type alias for Dataset[Row]. TraversableOnce is a trait from the Scala collections library representing collections that can be processed once (or more). We can use flatMap to transform the rows in the DataFrame to collections of elements for which a Spark encoder exists. The collections will be flattened into the resulting DataFrame.

So, we’ll write a function of the form Row => List[Row] that will expand each original row into rows that use the same TTL and WRITETIME values. Since access to rows is by index, we’ll first create maps that contain the indices for the primary keys and the regular columns.

This function will create the maps we need:

Note that the index map for the regular columns contains the index for the original column, the TTL column and the WRITETIME column. To actually perform the row splitting operation, we’ll use this function:

This one is slightly more involved, so let’s describe each of the transformations it performs. We start by checking if there are any regular columns; if there aren’t, and the row is composed entirely of primary key columns, we just return the row in a list.

When we do have regular columns, we first transform the index map to a list of column name, value, TTL value and WRITETIME value. Next, we group the fields by their TTL and WRITETIME values and discard the timestamps from the resulting map’s value type. Lastly, we construct a row from each field group by adding the primary key values, the regular column values wrapped in CassandraOption and finally adding the TTL and WRITETIME values.

To actually use these functions, we first need to use Spark’s broadcasting mechanism to send a copy of the field index maps to each executor. This is done as follows:

The broadcasting mechanism is an optimization for when the bodies of DataFrame transformation functions need to reference read-only values. To transform the DataFrame, we call flatMap as follows:

Finally, we need to tell the connector to save the DataFrame to Scylla. We’ll drop down again to the RDD API as that offers more control over how the CQL statements are constructed. We create a ColumnSelector value that describes which columns should be written; the columns are created from the original schema that we loaded, before adding the colName_ttl columns:

We also create a WriteConf value that describes how the connector should perform the writes. Critically, we tell the connector which column should be used for the TTL and WRITETIME values for each row:

And we pass all of this to the saveToCassandra method:

Whew! That’s the beefiest part for the Migrator.

Tracking token ranges that have been transferred

The next part we’ll discuss is how the Migrator keeps track of which token ranges have already been transferred. This functionality required some modification of the Cassandra connector to propagate this information to the Migrator, which is why the Migrator relies on a fork of the connector.

There are two sides to this feature: first, we must keep track of token ranges that have been transferred, and periodically write them to a savepoint file; second, when resuming a migration, we must use the savepoint file to skip token ranges that have already been transferred. Let’s start with the first part.

We’ve seen Spark’s broadcast variables in the previous section; these are immutable, read-only values that are sent to all executors from the driver. Spark also contains accumulators: variables that are readable and writable from both executors and the driver.

We’ve modified the connector to store transferred token ranges in an accumulator. This accumulator is then periodically read on the driver and its contents is stored in a savepoint file. To implement an accumulator, we can inherit from Spark’s AccumulatorV2 abstract class:

The contract is that the accumulator can read values of type IN, update an internal state in a thread-safe way, and output values of type OUT, which would typically be an aggregation of IN values. Our accumulator will set IN = Set[CqlTokenRange[_, _]] and OUT = Set[CqlTokenRange[_, _]].

These are the important parts of the implementation:

The core idea is that we use an AtomicReference to store the current set of transferred token ranges. AtomicReference is a mutable reference for storing immutable values. Scala’s Set is immutable, so it can be safely stored there. Whenever we need to add another set of token ranges that have been transferred, we use getAndUpdate to atomically update the set. To extract the set, we can use the get method on the reference.

To use the accumulator, we’ve modified the connector’s TableWriter class; specifically, when writing one of the RDD’s partitions, the writer tests if the partition is a CassandraPartition (this is true for the Migrator, but not always, and thus is not reflected in the types), and extracts its ranges:

We call this function in the writeInternal method and add the transferred ranges into the accumulator after writing them:

The full implementation can be seen here.

Before starting the DataFrame write, which blocks the calling thread until it completes, a scheduled thread is setup which reads the accumulator periodically and writes its contents to the savepoint file:

Now, the second part to this feature is actually skipping token ranges that have already been transferred. The connector operates dividing the token ranges between the executors, and then fetching rows corresponding to the token ranges. We can filter the token ranges by passing in a function of the form (Long, Long) => Boolean, which determines if a given range should be transferred.

This is the relevant implementation in CassandraTableScanRDD#compute:

These token ranges are then fetched as they were before our modifications. This is all that’s needed to skip the token ranges that were already transferred!

Summary

The source code for the Migrator can be found here, and the source code for the modified Cassandra connector here (see the latest commits to see the changes made). Some of the snippets shown here differ slightly from the code in Github, but the core principles remain the same.

The post Deep Dive into the Scylla Spark Migrator appeared first on ScyllaDB.

From SAP to Scylla: Tracking the Fleet at GPS Insight

$
0
0

From SAP to Scylla: Tracking the Fleet at GPS Insight

“Scylla is the ideal database for IoT in the industry right now. Especially without the garbage collection that Cassandra has. Small footprint. It just does what you need it to do.”

— Doug Stuns, GPS Insight

Doug Stuns began his presentation at Scylla Summit 2018 by laying out his company’s goals. Founded in 2004, GPS Insight now tracks more than 140,000 vehicles and assets. The company collects a wide variety of data for every one of those vehicles. Battery levels, odometer readings, hard stops, acceleration, vehicle performance, emissions, and GPS data to determine route efficiency.

By the time Doug was brought onboard, they knew they needed to move away from their exclusively SQL SAP Adaptive Server Enterprise (ASE) architecture to include NoSQL for real-time vehicle data management. “They had effectively too much machine data for their relational database environment,” which had been built 8 to 10 years prior. All the GPS data, all the machine data, and all the diagnostic data for monitoring their fleet was being ingested into their SQL system.

Over time, their SQL database just was inadequate to scale to the task. “As the data got larger and larger, it became harder and harder to keep the performance up on all this machine data in a relational database. So they were looking at NoSQL options, and that’s where I came in on the scene for GPS.”

GPS Insight was, at that point, growing their data by a terabyte a year just for diagnostics. Trying to create JOINs against these large tables was creating terrible performance. An additional problem was deleting old data. “With no TTL in a relational database, what are you stuck with? A bunch of custom purge scripts,” which also impacted performance.

“My task was to find a NoSQL-based solution that would handle this. When I originally came to GPS Insight we were 100% going down the Cassandra road… I thought Hadoop was a little too ‘Gen One’ and a little too cumbersome. So I started off heading down this Cassandra road and I stumbled upon Scylla. And from there we decided to do a POC, since we had the opportunity to go with with the best technology.” For example, one major issue Doug did not like about Cassandra was having to maintain an oversized cluster mainly because of Java Virtual Machine (JVM) garbage collection.

Proof of Concept

In the POC, Doug took a year’s worth of data and stood up two 3-node clusters on AWS, consisting of i3.2xlarge machines. They fired off a loading utility and began to populate both clusters.

“We loaded the Scylla data in, like, four or five hours, while it took us close to two days to load the Cassandra data.”

For production, which would scale to a ten-node cluster, Scylla solution engineers advised creating a stand-alone Scylla loader, consisting of a “beefed-up” m4.xl10 instance, with 40 processors on it. On that node, Doug installed the client libraries, and created sixteen child processes.

After loading the cluster, Doug created stress tests: one in PHP and one in golang using the respective language drivers. Both were quick to deploy. “No issues there.”

When they first ran the tests, there was no difference between Scylla and Cassandra. But as they kept increasing load, Scylla scaled while Cassandra quickly topped out. Scylla could execute reads at 75,000 rows per second, while Cassandra was limited to 2,000 rows per second.

Getting to Production

With the POC complete and knowing Scylla could handle the load, it was time to implement. The team’s concerns turned to what business logic (stored procedures, etc.) were already in SAP, and what they would put in Scylla. They decided to focus exclusively on GPS data, diagnostic data, and hardware-machine interface (HMI) data. This would come from the device GPS Insight installed in every vehicle, sending updates on fifteen to twenty data points every second.

For production, GPS Insight decided to deploy a hybrid topology, with five nodes on premises and five nodes in the cloud. The on-premises servers were Dell 720s. “Definitely not state-of-the-art.” While they had 192 GB of RAM, they had spinning HDDs for storage. For the cloud, Doug said they used i3.4xlarge nodes, each with 4TB NVMe storage.

GPS Insight Production Topology

Next, Doug set up a Spark 2.3 node for the developers. Looking at the set of twenty queries they needed to run, they worked together to define a diagnostic Data Definition Language (DDL) to be able to communicate between their Scylla cluster and the SQL system. It included both non-expiring data that needed to be “kept forever” for contractual reasons (such as fuel consumption) as well as data that had the exact same schema, but had a five-month TTL expiration. “Each table was exactly the same minus the partitioning clause, and the primary key and other ordering constraints so you could get the data back to support the queries.” It also solved various data purging problems.

They partitioned one table using a compound key of account ID and vehicle identification number (VIN). The next table was identical, but partitioned only by account ID. The next was by account ID and category. They had recently added some time series data as well.

“The beauty of that for the developers was we barely had to change our PHP code and golang code. We were just able to add the drivers in and the reporting mechanism worked almost without any changes.”

They audited and validated the reports from Scylla against SAP ASE. Audit reports would expire after five months.

GPS Insight DevOps Methodology

To support the developers, Doug created a three keyspace software development lifecycle environment:

  • A dev keyspace, where the developers can do what they wish.
  • A keyspace that matches production, to which developers can commit their changes, and Doug can promote to production, and
  • A production keyspace, which can take any unplanned changes back to dev.

As a side benefit of offloading the raw data management to Scylla, Doug explained that this approach now allowed the SAP ASE team the ability to upgrade the SQL system since load was nowhere near as large nor uptime so critical as it had been.

“Scylla has allowed GPS Insight an inexpensive and efficient way of offloading machine data from SAP ASE while having even better visibility and performance.”

— Doug Stuns

Want to learn more? Watch the full video below:

If after watching you have any more questions or ideas regarding your own Big Data projects, make sure to join our Slack channel, or contact us directly.

The post From SAP to Scylla: Tracking the Fleet at GPS Insight appeared first on ScyllaDB.


Scylla Enterprise Release 2018.1.11

$
0
0

Scylla Enterprise Release Notes

The ScyllaDB team announces the release of Scylla Enterprise 2018.1.11, a production-ready Scylla Enterprise minor release. Scylla Enterprise 2018.1.11 is a bug fix release for the 2018.1 branch, the latest stable branch of Scylla Enterprise. As always, Scylla Enterprise customers are encouraged to upgrade to Scylla Enterprise 2018.1.11 in coordination with the Scylla support team.

The major fix in this release is for a rare race condition between a schema update (adding or removing a column) and an initiated compaction, which may cause the compaction to write values to the wrong columns. The short time window where this may occur, is just after compaction was initiated but before it starts writing to storage. Should the race condition occur, there may be a crash or validation/run-time error on the driver side. #4304

Other fixes issues in this release are listed below, with open source references, if present:

  • Using Scylla shard aware drivers may cause unbalanced connections, where one shard has tens of thousands of connections, while the others do not, resulting in a temporary slow down #4269
  • Reads from system tables do not time out as they should, which might create large queues of reads if connections were opened too often.
  • Rebuilding a node may cause long reactor stalls on large clusters #3639
  • A bootstrapping node doesn’t wait for a schema before joining the ring, which may result in the node failing to bootstrap, with the error storage_proxy - Failed to apply mutation. In particular, this error manifests when user-defined type is used.  #4196
  • cqlsh: Make TLSv1.2 default

Related Links

The post Scylla Enterprise Release 2018.1.11 appeared first on ScyllaDB.

Scylla Open Source Release 3.0.4

$
0
0

Scylla Open Source Release Notes

The ScyllaDB team announces the release of Scylla Open Source 3.0.4, a bugfix release of the Scylla 3.0 stable branch. Scylla 3.0.4, like all past and future 3.x.y releases, is backward compatible and supports rolling upgrades.

The major fix in this release is for a rare race condition between a schema update (adding or removing a column) and an initiated compaction, which may cause the compaction to write values to the wrong columns. The short time window where this may occur is just after compaction was initiated but before it starts writing to storage. Should the race condition occur, there may be a crash or validation/run-time error on the driver side. #4304

Related links:

Other issues solved in this release:

  • commit log: On rare cases, when two chunks of the commit log or hinted handoff are written simultaneously, they might overlap, cause one to write on the header of the other, resulting in a “commitlog files are corrupted: Checksum error in segment chunk at XXXX” message. #4231
  • Materialized Views: a mistake in flow control algorithm result with Scylla fail to apply back pressure to the client #4143
  • Scylla Driver: Using Scylla shard aware drivers may cause unbalanced connections, where one shard has tens of thousands of connections, while the others do not, resulting in a temporary slow down #4269
  • MC FIle format: large promoted indexes might require large allocations and cause performance problems or fail with bad_alloc. This release does not fix the problem completely but makes it less severe. #4217
  • MC Format: in some cases, writing an uncompressed mc sstable in debug mode fails with EINVAL due to unaligned DMA address #4262
  • CQL: Scylla does not support insertion of null data through JSON inserts #4256. For example:
    INSERT INTO mytable JSON '{
        "myid" : "id2",
        "mytext" : "text234",
        "mytext1" : "text235",
        "mytext2" : null
    }';
  • Housekeeping: scylla-housekeeping for servers with python version older than 3.6, reporting “TypeError: the JSON object must be str, not 'bytes'#4239
  • Streaming: rebuilding a node may cause long reactor stalls on large clusters #3639
  • Bootstrap: A bootstrapping node doesn’t wait for a schema before joining the ring, which may result in the node failing to bootstrap, with the error “storage_proxy - Failed to apply mutation”. In particular, this error manifests when user-defined type is used. #4196
  • Authentication: possible access to uninitialized memory in password_authenticator #4168
  • fragmented_temporary_buffer::read_exactly() allocates huge amounts of memory on premature end-of-stream, result with bad_alloc when reading from CQL, if the connection is closed in the middle of a frame, and other places #4233
  • CQLSh now use TLSv1.2 by default

The post Scylla Open Source Release 3.0.4 appeared first on ScyllaDB.

Scylla Monitoring Stack Release 2.2

$
0
0

Scylla Monitoring Stack Release Notes

The Scylla team is pleased to announce the release of Scylla Monitoring Stack 2.2.

Scylla Monitoring Stack is an open source stack for monitoring Scylla Enterprise and Scylla Open Source, based on Prometheus and Grafana. Scylla Monitoring Stack 2.2 supports:

  • Scylla Open Source versions 2.3 and 3.0
  • Scylla Enterprise versions 2017.x and 2018.x
  • Scylla Manager 1.3.x

Related Links

New in Scylla Monitoring Stack 2.2

  • CQL optimization dashboard (#471)
    The CQL optimization dashboard helps identify issues when developing an application with Scylla such as non-prepared statements, queries that are not token aware, non paged queries, and requests from a remote DC. Before using the new dashboard, make sure you correctly defined the DC names (see Align Data Center Names below). More on the new Optimization Dashboard. A blog post on the optimization dashboard, and how to use it will be published soon.

Scylla Monitor 2.2 Dashboard

  • Unified target files for Scylla and node_exporter (#378)
    To simplify the Prometheus configuration of Scylla nodes and the node_exporter targets, you only need to configure Scylla targets. Prometheus assumes that there is a node_exporter running on each of the Scylla servers and will use the same IPs as those set in the targets. It is still possible to configure a specific node_exporter target file.
  • Per machine (node_exporter related) dashboard added to Enterprise (#495)
    The per-machine dashboard shows information about the host disk and network. It is now available for Enterprise.
  • Prometheus container uses the current user ID and group (#487)
    There is an ongoing issue with the volume the Prometheus container uses to store its data. From Scylla manager version 2.2, the container will run as the current user and with the user group ID. This means, that the data directory should have the current user permissions. While this does not require any changes, it is recommended to check your Docker installation and make sure you are not running Docker as root.
  • kill-all.sh kills Prometheus instances gracefully (#438)
    The kill-all command will now attempt to kill Prometheus gracefully. By doing so, Prometheus will start quickly after shutdown. This means that shutdowns can take longer than anticipated. The kill-all will wait for up to two minutes for Prometheus to shut down. Once the time has lapsed, the command will forcefully kill the container.
  • start-all.sh now supports --version flag (#374)
    To verify your Monitoring stack version, you can now run ./start-all.sh --version
  • Remove the version from the dashboard names (#486)
    Following the move to Grafana 5 and the use of the dashboard folders, the version was removed from the dashboard names.
  • Dashboard loaded from API should have overwritten properties set to true (#474)
    For users who upload the dashboard with the API, dashboards have the overwrite flag set to true, so you can upload the same dashboard twice.
  • Update Alertmanager to 0.16 (#478)
    Following the changes in Alertmanager see the changelog for details

Align Data Center Names

The new Optimization Dashboard (above) relies on the definition of nodes per Data Center in the Monitoring Stack, to match the Data Center names used in Scylla Cluster. For example:

nodetool status
Datacenter: DC1
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address     Load       Tokens   Owns Host ID                              Rack
UN 172.20.0.4  108.83 KB  256      ?    fae7039a-21ad-4e94-9474-430abcf48158 Rack1
UN 172.20.0.2  108.86 KB  256      ?    fe2986de-9c8a-44bb-8b3b-923519095a23 Rack1
UN 172.20.0.3  108.84 KB  256      ?    2a10a36f-365f-455a-85d4-18cd40b6b765 Rack1
Datacenter: DC2
===============
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address     Load       Tokens   Owns Host ID                              Rack
UN 172.20.0.5  108.8 KB   256      ?    edbc46cb-d948-4745-a90b-28d3bc90c034 Rack1
UN 172.20.0.6  108.76 KB  256      ?    58a5a43c-8ec1-4369-91d9-6bd79d1d706a Rack1
UN 172.20.0.7  108.27 KB  256      ?    c5886895-be75-4e18-8fa5-3633a10f9ee8 Rack1

Should match the data center names, in this case, DC1 and DC2 found in scylla-grafana-monitoring/prometheus/scylla_servers.yml

Such as:

- targets:
- 172.20.0.2:9180
- 172.20.0.3:9180
- 172.20.0.4:9180

labels:
cluster: my-cluster
dc: DC1

- targets:
- 172.20.0.5:9180
- 172.20.0.6:9180
- 172.20.0.7:9180

labels:
cluster: my-cluster
dc: DC2

Bug Fixes

  • Moved the node_exporter relabeling to metric_relabeling (#497)
  • Fixed units in foreground write (#463)
  • Manager dashboard was missing UUID (#505)

The post Scylla Monitoring Stack Release 2.2 appeared first on ScyllaDB.

Discord, on the Joy of Opinionated Systems

$
0
0

Discord and Scylla

Mark Smith, Director of Engineering at Discord, manages the infrastructure team for the company that powers community across the gaming industry. Last May, when the company turned three years old, Discord boasted a user base of 130 million registered users. That compared to 45 million the year before, marking a three-fold increase in users in a single year. By the time of this writing, in March 2019, the provider of free voice and text chat focused on Internet gaming had grown its user base to over 250 million users worldwide, and thus is well on its way towards doubling in size in a year.

Discord has risen to a global website rank of 126 on Alexa, and on Similarweb it is ranked the 100th most popular site in the world. Every day somewhere around fourteen million people use Discord to send over three hundred million messages.

Which means that scalability and uptime are not optional.

Productionization

Mark did not begin by talking about Scylla, nor about Discord. Instead, Mark harkened back to his experience with Apache Kafka at Dropbox. There, he was part of a team of four engineers that launched the technology, where it scaled to 15 million messages per second on 100 brokers organized into multiple clusters. He said it took about a year to “productionize.”

Mark’s definition of the term was slightly different than what one would find in a dictionary. To him it was more than just getting into production. “This is sort of the road from running it, turning it on, to the point where you feel comfortable running it. Where you can recover from outages. Where you can deal with whatever problems arise. Where you can train people. And all of those good things.” And this process took a year.

“It turns out Kafka is actually kind of hard. And this is a story we see repeated in the open source world in a lot of the projects that we use like Cassandra, MongoDB, etc. It’s very easy to download and start them. It’s a lot harder to run them at scale.”

Taking Kafka for further example, Mark noted there are different Kafka distributions available, from open source to proprietary offerings. Plus, configuration is often a black art. While you can study how Kafka is used in different environments, those configurations are often dependent on unique characteristics of hardware and other use case factors particular to that installation. There are over 150 tunables for a Kafka broker. The permutations of which lead to huge complexity and are incredibly hard.

Beyond that is the issue of monitoring. “If you Google ‘How to monitor Kafka,’ they say ‘Here is how you use JMX.’ That just gives you stats. That gives you a lot of stats. That doesn’t actually tell you how to monitor a system. That doesn’t tell you what matters.”

“When you are paged at 3 AM, and you have ten thousand metrics to look through…? What matters? It’s really hard to tell.”

“This also makes getting community help very difficult. Because you have to spend the first hour of every conversation: ‘Here’s my config file. Here’s my hardware. Here’s my use case. Here’s my load. Here’s how everything fits together,’ before they can get to the point of understanding what you are doing to try to help you out.”

“We followed the road that most of you have probably followed at one time or another. You start by being very reactive with incidents. You write post-mortems. You do root cause analysis. You do all those sorts of things to understand what has gone wrong in your cluster. And then you make betterments out of that. You make playbooks. You improve your monitoring. You create dashboards. You learn about stuff that you don’t have today that you need to have.

“And then over time, hopefully, you move to a more proactive story. With chaos engineering, disaster readiness testing [DRT]. Stuff like that. But it’s never easy. And if you’ve deployed, in this case, Kafka, or Cassandra, or Mongo, or any of these systems at scale, you’ve been through this story before. It’s never as easy as we’d hope.”

Opinionated Systems

Mark took a small digression to talk about flight lessons. He compared the complexity of the cockpit of a 747, and the training needed for a multi-person crew, and how we are still able to end up with a very safe method of transportation. Standard Operating Procedures (SOPs), Pilot’s Operating Handbooks (POHs). Checklists, checklists, checklists.

“It’s regimented, and it’s fully documented and trained. You know exactly what you’re doing. And you’re following the list.”

Such systems are designed by “competent stewards of the system who understand it deeply.” Whereas in aviation you have the FAA and NTSB, you don’t have the same regulating forces in open source. It’s up to the community and the users.

“The software is good. But the ecosystem around it. How you run it in production. It’s Wild West. You can ask LinkedIn what they do. And they will help you out and show you what’s been successful for them. You can talk to Twitter. You can talk to your friends at wherever and you will get a hundred stories from a hundred people. But that’s not really going to help you in running your system.”

Scylla at Discord

When you launch the app, you land in the Activity Tab. It includes news, a quick launcher for your favorite games, a list of what your friends are currently playing (or recently played), plus a section for your friends’ listening parties, streams and Xbox status. “We use Scylla to power this and we trust it to be the data source for the homepage of Discord.”

Discord Activity Tab

Mark talked about the surety that comes with Scylla’s autoconfiguration. “Scylla just says, ‘You know what? This is our system. We’ve spent years building this and understanding it. Do it like this. Here’s how you install it. Here’s how you configure it.’”

“In fact, you don’t choose how to configure Scylla for the most part. You run some tools. They write out config files. Yes, you can go in after the fact and change it, if you wish, but they give you the baseline of how to do it. And for monitoring, they are very opinionated. Here’s your Docker image of Prometheus. Here’s your image of Grafana. Here are your dashboards. Use these.”

Such opinionated systems are, to Mark, a critical advantage. “If you go out into the community and you ask for help with Scylla, everybody knows what you’re going to be looking at. They know what graphs are important. People who are good stewards of the system have already figured out what is important and how to present it. So you don’t actually have a lot of decisions to make when it comes to deploying Scylla.”

“In our experience at Discord we found this to be true. Setup was trivial. The operations have been easy and well-understood. We also have a background in Cassandra which has helped in some of that.”

He did add a caution: “We have had a production incident,” with the acknowledgement that systems will always fail. In Discord’s case it was a capacity issue. However, because of its strongly-opinionated methods, the ScyllaDB team was able to immediately pinpoint the issue. The solution was simply to expand the cluster.

“The multiplicative factor of having shared knowledge, shared language, shared ability to understand what these things look like and how they work is very powerful for your business. It’s very powerful for running these systems at scale.”

To Mark, it is okay, even preferable, to have strong opinions. “It’s okay to say, ‘This is how you run something. This is how you do something.’”

This advice applied beyond open source vendors to the services offered by organizations themselves. “This applies to each of us too, in the things that we are delivering to our customers. There is a cost to infinite configuration. There is a cost to the mental overhead of having to make all these choices. Of understanding how to put stuff together. Your customers may ask for it. They may say, ‘Give us all these knobs and whistles and buttons.’ But it is up to you to figure out is that actually the right thing for them? Am I actually adding value to their use case and what they’re doing.”

In conclusion, Mark offered a pretty definitive assessment: “Scylla has exemplified this sort of principle and it has ended up saving Discord time, money and downtime.”

The post Discord, on the Joy of Opinionated Systems appeared first on ScyllaDB.

Scylla Open Source Release 2.3.4

$
0
0

Scylla Open Source Release Notes

The Scylla team announces the release of Scylla Open Source 2.3.4, a bugfix release of the Scylla Open Source 2.3 stable branch. Scylla Open Source Release 2.3.4, like all past and future 2.3.y releases, is backward compatible and supports rolling upgrades.

Note that the latest stable branch of Scylla Open Source is Scylla 3.0; you are encouraged to upgrade to it.

The major fix in this release is for a rare race condition between a schema update (adding or removing a column) and an initiated compaction, which may cause the compaction to write values to the wrong columns. The short time window where this may occur is just after compaction was initiated but before it starts writing to storage. Should the race condition occur, there may be a crash or validation/run-time error on the driver side. #4304

Related links:

Other issues solved in this release:

  • Scylla node crashes upon prepare request with multi-column IN restriction #3692, #3204
  • Scylla Driver: Using Scylla shard aware drivers may cause unbalanced connections, where one shard has tens of thousands of connections, while the others do not, resulting in a temporary slow down #4269

The post Scylla Open Source Release 2.3.4 appeared first on ScyllaDB.

Viewing all 939 articles
Browse latest View live


Latest Images