Quantcast
Channel: ScyllaDB
Viewing all 938 articles
Browse latest View live

SaaS vs OSS – Fight or flight, round #2

$
0
0

Inspecting Software Licensing (APGL, SSPL, Confluent Community)

To quote Bob Dylan, “the times they are a changin’.” Microsoft loves Linux, IBM buys Red Hat, RedisLabs changes their module license to Commons Clause, Mongo invents Server Side Public License (SSPL) and moves from AGPL, AWS open sources Firecracker and releases a Kafka service, and the hot news from Friday, Confluent changes its license for components of the Confluent Platform from Apache 2.0 to the Confluent Community License.

A few weeks ago I wrote about MongoDB’s SSPL, which is similar to Confluent’s new license. You could say the Confluent Community License is to the Apache license as MongoDB SSPL is to AGPL. Both of these new licenses take the same position in order to protect their assets from cloud providers.

It is hard to blame Confluent for responding to pressure from the AWS Kafka service–and perhaps fear that other cloud vendors will follow. Rightfully, Confluent wishes to enjoy the fruits of their investment in KSQL and, in parallel, still keep it open source. It’s important to note that KSQL isn’t part of the Apache Kafka project, which remains under the ASL2 license and was fully developed by Confluent. To the best of my knowledge, AWS has not provided KSQL as a service but perhaps this is a means of preventing future abuse. AWS adoption of Kafka also signals a victory for of OSS Kafka over the proprietary Kinesis technology.

Although more and more OSS vendors are going down the path of more restrictive licenses, and they might even eventually make these standard, overall this is a step in the wrong direction for our industry. These more restrictive licenses have the regrettable consequence of creating silos where once there was sharing.

Remember that just the opposite was true for large community projects like Linux, KVM, and Hadoop, which saw contributors line up by the hundreds. Unfortunately, pressure from SaaS/Cloud providers is turning OSS vendors toward the undesirable (but perhaps logical) path of more restrictive usage licenses. This is a lose-lose scenario, one that goes against the very spirit of open source.

In a perfect world, customers would not use OSS software-based services from vendors who did not participate in the creation of that software. This is the flag that OSS vendors should carry. On the other hand, it’s hard to compete with an 800-pound gorilla that owns a lot of the mindshare of the industry.

My prediction is that vendors like AWS will receive more criticism and, in response, will lean more and more toward an open source play. Were this a game of chess, a streaming SQL query functionality contribution to Apache Kafka by AWS would result in a call of check. Let’s stay tuned and see how it evolves.

Another class of companies who will be hurt by this licensing trend will be smaller as-a-service vendors that provide MongoDB/Elastic/Kafka/etc. Despite their contributions back to the OSS, these smaller companies will be hampered by the more restrictive licenses, which will keep them from running the very technologies they’ve aided. Examples include mLabs with MongoDB, Instaclustr with Kafka/Confluent, and IBM Compose with a variety of offerings.

This trend is neither healthy nor ideal. We should all want to see OSS vendors and IaaS vendors complement each other — either by contributing to the same OSS project and sharing the commercial benefits or by allowing the OSS vendor to monetize on top of the IaaS market place. As an end user, you should support this cause by directing your buying power toward not only the vendors providing the best services, of course, but also the ones making a difference by aligning themselves with the overall long term value you receive from the ecosystem.

Dor is the co-founder and CEO of ScyllaDB, who develop the AGPL’ed Scylla database and Seastar, its Apache-licensed core engine.

The post SaaS vs OSS – Fight or flight, round #2 appeared first on ScyllaDB.


Scylla and Confluent Integration for IoT Deployments

$
0
0
Scylla Flow Control

 

The Internet is not just connecting people around the world. Through the Internet of Things (IoT), it is also connecting humans to the machines all around us and directly connecting machines to other machines. In this blog post we’ll share an emerging machine-to-machine (M2M) architecture pattern in which MQTT, Apache Kafka and Scylla all work together to provide an end-to-end IoT solution. We’ll also provide demo code so you can try it out for yourself.

 

IoT Scale

IoT is a fast-growing market, already known to be over $1.2 trillion in 2017 and anticipated to grow to over $6.5 trillion by 2024. The explosive number of devices generating, tracking, and sharing data across a variety of networks is overwhelming to most data management solutions. With more than 25 billion connected devices in 2018 and internet penetration increasing at a staggering 1066% since 2000, the opportunity in the IOT market is significant.

There’s a wide variety of IoT applications, like data center and physical plant monitoring, manufacturing (a multibillion-dollar sub-category known as Industrial IoT, or IIoT), smart meters, smart homes, security monitoring systems and public safety, emergency services, smart buildings (both commercial and industrial), healthcare, logistics & cargo tracking, retail, self-driving cars, ride sharing, navigation and transport, gaming and entertainment… the list goes on.

A significant dependency for this growth is the overall reliability and scalability of IoT deployments. As Internet of Things projects go from concepts to reality, one of the biggest challenges is how the data created by devices will flow through the system. How many devices will create information? What protocols do the devices use to communicate? How will they send that information back? Will you need to capture that data in real time, or in batches? What role will analytics play in the future? What follows is an example of such a system, using existing best-in-class technologies.

An End-to-End Architecture for the Internet of Things

IOT-based applications (both B2C and B2B) are typically built in the cloud as microservices with similar characteristics. It is helpful to think about the data created by the devices and the applications in three stages:

  • Stage one is the initial creation — where data is created on the device and then sent over the network.
  • Stage two is how the central system collects and organizes that data.
  • Stage three is the ongoing use of that data stored in a persistent storage system.

Typically, when sensors/smart-devices get actuated they create data. This information can then be sent over the network back to the central application. At this point, one must decide which standard the data will be created in and how it will be sent over the network.

One widely used protocol for delivering this data is the Message Queuing Telemetry Transport (MQTT) protocol. MQTT is a lightweight messaging protocol for pub-sub communication typically used for M2M communication. Apache Kafka® is not a replacement to MQTT, but since MQTT is not built for high scalability, longer storage or easy integration to legacy systems, it complements Apache Kafka well.

In an IoT solution, devices can be classified into sensors and actuators. Sensors generate data points while actuators are mechanical components that may be controlled through commands. For example, the ambient lighting in a room may be used to adjust the brightness of an LED bulb and MQTT is the protocol optimized for sensor networks and M2M. Since MQTT is designed for low-power and coin-cell-operated devices, it cannot handle the ingestion of massive datasets.

On the other hand, Apache Kafka may deal with high-velocity data ingestion but not with M2M. Scalable IoT solutions use MQTT as an explicit device communication while relying on Apache Kafka for ingesting sensor data. It is also possible to bridge Kafka and MQTT for ingestion. It is recommended to keep them separate by configuring the devices or gateways as Kafka producers while still participating in the M2M network managed by an MQTT broker.

At stage two, data typically lands as streams in Kafka and is arranged in the corresponding topics that various IoT applications consume for real-time decision making. Various options like KSQL and Single Message Transforms (SMT) are available at this stage.

At stage three this data, which typically has a shelf life, is streamed into a long-term store like Scylla using the Kafka Connect framework. A scalable, distributed, peer-to-peer NoSQL database, Scylla is a perfect fit for consuming the variety, velocity and volume of data (often time-series) coming directly from users, devices and sensors spread across geographic locations.

What is Apache Kafka?

Apache Kafka is an open source distributed message queuing and streaming platform capable of handling a high volume and velocity of events. Since being created and open sourced by LinkedIn in 2011, Kafka has quickly evolved from a message queuing system to a full-fledged streaming platform.

Enterprises typically accumulate large amounts of data over time from different sources and data types such as IoT devices and microservices applications. Traditionally, for businesses to derive insights from this data they used data warehousing strategies to perform Extract, Transform, Load (ETL) operations that are batch-driven and run at a specific cadence. This leads to an unmanageable situation as custom scripts move data from their sources to destinations as one-offs. It also creates many single points of failure and does not permit analysis of the data in real time.

Kafka provides a platform that can arrange all these messages by topics and streams. Kafka is enterprise ready and has features like high availability (HA) and replication on commodity hardware. Kafka decouples the impedance mismatch between the sources and the downstream systems that need to perform business-driven actions on the data.

Apache Kafka Integrations

What is Scylla?

Scylla is a scalable, distributed, peer-to-peer NoSQL database. It is a drop-in replacement for Apache Cassandra™ that delivers as much as 10X better throughput and more consistent low latencies. It also provides better cluster resource utilization while building upon the existing Apache Cassandra ecosystem and APIs.

Most microservices developed in the cloud prefer to have a distributed database native to the cloud that can linearly scale. Scylla fits that use case well by harnessing modern multi-core/multi-CPU architecture, and producing low, predictable latency response times. Scylla is written in C++, which results in significant improvements of TCO, ROI and an overall better user experience.

Scylla is a perfect complement to Kafka because it leverages the best from Apache Cassandra in high availability, fault tolerance, and its rich ecosystem. Kafka is not an end data store itself, but a system to serve a number of downstream storage systems that depend on sources generating the data.

Demo of Scylla and Confluent Integration

The goal of this demo is to demonstrate an end-to-end use case where sensors emit temperature and brightness readings to Kafka and the messages are then processed and stored in Scylla. To demonstrate this, we are using Kafka MQTT proxy (part of the Confluent Enterprise package), which acts as a broker for all the sensors that are emitting the readings.

We also use the Kafka Connect Cassandra connector, which spins up the necessary consumers to stream the messages into Scylla. Scylla supports both the data format (SSTable) and all relevant external interfaces, which is why we can use the out of the box Kafka Connect Cassandra connector.

The load from various sensors is simulated as MQTT messages via the MQTT Client (Mosquitto), which will publish to the Kafka MQTT broker proxy. All the generated messages are then published to the corresponding topics and then a Scylla consumer picks up the messages and stores them into Scylla.

Steps

  1. Download Confluent Enterprise

  2. Once the tarball is downloaded – then:

  3. Set the $PATH variable

    For the demo we choose to run the KAFKA Cluster locally but if we want to run this in production we would have to modify a few files to include the actual IP addresses of the cluster:

    • Zookeeper – /etc/kafka/zookeeper.properties
    • Kafka – /etc/kafka/server.properties
    • Schema Registry – /etc/schema-registry/schema-registry.properties

  4. Now we need to start the services Kafka and Zookeeper This should start both zookeeper and Kafka. To do this manually you have to provide these two parameters

  5. Configuring the MQTT proxy

    Inside the directory /etc/confluent-kafka-mqtt there is a file kafka-mqtt-dev.properties file that comes with the confluent distribution and this lists all the available configuration options for MQTT Proxy. Modify these parameters

  6. Create Kafka topics

    The simulated MQTT devices will be publishing to the topics temperature and brightness, so let’s create those topics in Kafka manually.

  7. Start the MQTT proxy

    This is how we start the configured MQTT proxy

  8. Installing the Mosquitto framework

  9. Publish MQTT messages

    We are going to be publishing messages with QoS2, that is the highest quality of service supported by MQTT protocol

  10. Verify messages in Kafka

    Make sure that the messages are published into the kafka topic

  11. To produce a continuous feed of MQTT messages (optional)

    Run this on the terminal

  12. Let’s start a scylla cluster and make it a kafka connect sink

    Note: If you are choosing to use Scylla in a different environment - then start from here https://www.scylladb.com/download/

    Once the cluster comes up with 3 nodes then ssh into each node and uncomment the broadcast address in /etc/scylla/scylla.yaml, change it to the public address of the node. If we are running the demo locally on a laptop or if we are running the Kafka connect framework in another Data Center compared to where the Scylla cluster is running.

  13. Let’s create a file cassandra-sink.properties

    This will enable us to start the connect framework with the necessary properties.

    Add these lines to the properties file

  14. Next we need to download the binaries for the stream reactor

    Now change the plugin.path property in
    /confluent-5.0.0/etc/schema-registry/connect-avro-distributed.properties

    To ABSOLUTE_PATH/confluent-5.0.0/lib/stream-reactor-1.1.0-1.1.0/libs/

  15. Now let’s start the connect framework in distributed mode
  16. Make sure that the cassandra-sink.properties file is updated with the necessary contact points of scylla nodes i.e the external IP addresses.

    Make sure that the necessary keyspace and tables with the appropriate schema are created after you CQLSH into the scylla nodes.

    Then to start the sink connector

    After you run the above command, then you should be able to see Scylla as a Cassandra sink and any messages published using the instructions in step-9 will get written to scylla as a downstream system.

  17. Now, let’s try to run a script which can simulate the activity of a MQTT device - you can do this by cloning this repo https://github.com/mailmahee/MQTTKafkaConnectScyllaDB

    And then running

    This script simulates MQTT sensor activity and publish messages to the corresponding topics. Then the connect frameworks drains the messages from the topics into the corresponding tables in Scylla.

You did it!

If you follow the instructions above, you should now be able to connect Kafka and Scylla using the Connect framework. In addition, You should be able to generate MQTT workloads that publish the messages to the corresponding Kafka topics, which are then used for both real-time as well as batch analytics via Scylla.

Given that applications in IoT are by and large based on streaming data, the alignment between MQTT, Kafka and Scylla makes a great deal of sense. With the new Scylla connector, application developers can easily build solutions that harness IoT-scale fleets of devices, as well as store the data from them in Scylla tables for real-time as well as analytic use cases.

Many of ScyllaDB’s IoT customers like General Electric, Grab, Nauto and Meshify use Scylla and Kafka as the backend for handling their application workloads. Whether a customer is rolling out an IoT deployment for commercial fleets, consumer vehicles, remote patient monitoring or a smart grid, our single-minded focus on the IoT market has led to scalable service offerings that are unmatched in cost efficiency, quality of service and reliability.

 

Try It Yourself

The post Scylla and Confluent Integration for IoT Deployments appeared first on ScyllaDB.

Scylla Summit 2018 Keynote: Four Years of Scylla

$
0
0
Dor Laor at Scylla Summit 2018
Now that the dust has settled from our Scylla Summit 2018 user conference, we’re glad for the chance to share the content with those who couldn’t make the trip to the Bay Area. We’ll start with the keynote from our CEO and Co-founder, Dor Laor, who kicked off the event with his talk about the past, present and future of Scylla.

Watch the video in full:

Browse the slides:

Dor began with an overview of the trends in the industry and first and foremost, digital transformation. Sticking to the down-to-earth, practical culture at ScyllaDB, Dor covered real life customer examples from all around us. Starting with space, with satellite-enabled services like GPS Insight and TellusLabs, to the rain forest plant-based beauty products of Brazil’s Natura, to the GE Predix-powered Industrial Internet of Things (IIoT) platform, Comcast X1’s on-demand services, as well as automotive applications from Faraday Future and Nauto.

In the beginning…

Scylla and Charybdis

Dor shared our company’s origins. He recalled first announcing the Seastar framework in February 2015, and leaving stealth mode in September of that year. ScyllaDB CTO Avi Kivity. presented at that year’s Cassandra Summit on how a new database, Scylla, could deliver 1,000,000 CQL operations per server.

Scylla 1.0 Release Graphic

Over the ensuing three years we made a great deal of progress. We released Scylla 1.0 at the end of March 2016. That year also saw the first Scylla Summit in September. The following year, in March 2017, Scylla unveiled its first Enterprise software release.

Scylla Enterprise Release Graphic

While Scylla was blazing its own path in the world of NoSQL, Dor also remarked on the successes of others in the industry, including MongoDB’s public offering in October 2017, and the September 2018 IPO of Elastic. These events serve as validation of the growing Big Data market as the hunger for data increases, fed by the growing appetite of modern, planet scale software. Not only most enterprises now trust in the operational capabilities of NoSQL distributed databases, the new world requirements cannot be met by traditional relational models.

State of the Art

Moving to the present, Dor announced Scylla Open Source 3.0. With this release, Scylla was finally achieving feature parity with Cassandra, and, in some cases, it was taking the lead. For storage, SSTable format 3.0 (mc) would reduce data footprint on disk. Production-ready Materialized Views (MV) and Global Secondary Indexes (GSI) will help users access only the data they need. Lightweight Transactions (LWT) remains the last major feature to achieve full feature parity with Cassandra.

Dor also announced that our cloud managed database, Scylla Cloud, was available as early access. running on Amazon Web Services (AWS), Scylla Cloud lets users launch a fully managed single-tenant, self-service Scylla cluster in minutes.

Scylla Cloud Graphic

As much as we talk about Cassandra, we are shifting gears and wish to be competitive with the best of breed NoSQL databases, led by DynamoDB as an example.

Scylla vs. Dynamo Graphic

Dor shared results from a head-to-head YCSB comparison of Scylla versus Amazon DynamoDB. We just recently published the comparative benchmark results. Our test results show you can achieve 1/4th the latency and spend only 1/7th the cost with Scylla for similar throughput on DynamoDB. (Scylla Cloud is 4-6X less expensive than DynamoDB.)

However, the real performance difference occurred in Zipfian distributions. You can read the blog in full as to why this is an important real-world consideration. Analogous test results were found for Bigtable, and CosmosDB was expected to perform similarly.

OLTP vs. OLAP Graphic

Another key feature introduced for the first time at Scylla Summit 2018 was our unique ability to support per-user SLAs, allowing system managers to limit database resource utilization. With this, Scylla customers can use the same Scylla cluster to service both transaction processing (mixed read-write, or write-heavy loads) as well as analytics (read-only/mostly) requests. Glauber Costa would host a full session on this, entitled OLAP or OLTP: Why not both?

Per-user-SLA utilizes 3 years of development of SLA guarantee for real time operations over distributed database background operations such as compaction, repair and streaming. This is a point in time evolution towards perfect multi tenant database.

Dor then enumerated a list of noteworthy accomplishments and the challenges we still have before us. For example, while he was proud of our Mutant Monitoring System (MMS), there is still work to be done on our Knowledgebase, as well as our upcoming launch of Scylla University. And while performance is good, and compactions are relatively smooth compared to other offerings, there are still more optimizations to be done. And while he was proud of the work we’ve done to integrate with Apache Spark, there’s a lot more to do to align Scylla with Kubernetes.

The Shape of Things to Come

 

To conclude, Dor gave a glimpse into the future of Scylla. Finishing up Cassandra parity features, especially Lightweight Transactions. Fleshing out Scylla Cloud. Making Scylla itself a stronger offering, with new tiered storage options, improvements in performance and additional drivers. And finally, making Scylla even easier to manage.

It has been a remarkable journey over the past four years. From all of us at ScyllaDB, thank you for following us on our journey, and for a wonderful 2018. 

Looking ahead, 2019 is sure to be another amazing year of pioneering achievements in the world of Big Data, both for Scylla as well as our users and customers. We’re looking forward to all that we will accomplish together!

The post Scylla Summit 2018 Keynote: Four Years of Scylla appeared first on ScyllaDB.

Scylla Enterprise Release 2018.1.8

$
0
0

Scylla Enterprise Release

The Scylla team is pleased to announce the release of Scylla Enterprise 2018.1.8, a production-ready Scylla Enterprise minor release. Scylla Enterprise 2018.1.8 is a bug fix release for the 2018.1 branch, the latest stable branch of Scylla Enterprise. Scylla Enterprise customers are encouraged to upgrade to Scylla Enterprise 2018.1.8 in coordination with the Scylla support team.

This release includes memory management improvements, first introduced in Scylla Open Source 2.3 release and graduated into Scylla Enterprise. These changes allow Scylla to free large contiguous areas of memory more reliably with less effort and improves the performance of workloads that have large blobs or collections. #2480

Related Links     

Additional fixed issues in this release, with open source references, if exists:

  • In extreme out-of-memory situations, failure to allocate an internal data structure used for synchronizing memtable flushes can result in a segmentation fault and crash. #3931

Auditing: audit configuration syntax in scylla.yaml was changed from audit-categories and audit-tables to audit_categories and audit_tables. Note that the upgrade procedure does *not* update existing scylla.yaml automatically. More on Scylla Enterprise auditing and auditing configuration here

The post Scylla Enterprise Release 2018.1.8 appeared first on ScyllaDB.

Scylla Summit 2018 Tech Talks Now Online

$
0
0

There was so much happening at our Scylla Summit 2018 late last fall. We held more than three dozen sessions, including multiple keynotes and concurrent breakout tracks. Our Tech Talks page has been updated with the videos and slides from Scylla Summit 2018. Now you can see what you missed — whether or not you were able to attend our user conference.

Dor Laor at Scylla Summit 2018

In the weeks ahead we’ll showcase some of the best talks from our conference, but there’s no need to wait. You can browse the whole catalog of talks from Scylla Summit 2018 today!

You can see the YouTube videos and all the SlideShare presentations in one place. All the keynotes, both by ScyllaDB executives and some of our most prominent customers. All the use case presentations by our incredible groundbreaking community members. All the tech talks from our engineers on current and upcoming features, best practices, tips and tricks.

We’ll point out a good intro to many of engineering talks at Scylla Summit 2018. ScyllaDB CTO and Co-Founder Avi Kivity presented on our near-term and longer-term initiatives. This is quite timely, too, as the release of Scylla Open Source 3.0 is right around the corner!

If you have any questions or comments after watching these presentations, or if you’d like to share your own experience with Scylla, please feel free to contact us!

The post Scylla Summit 2018 Tech Talks Now Online appeared first on ScyllaDB.

JSON Support in Scylla

$
0
0

Beginning with version 2.3, Scylla Open Source supports the Javascript Object Notation (JSON) format. That includes inserting JSON documents, retrieving data in JSON and providing helper functions to transform native CQL types into JSON and vice versa.

Also note that schemas are still enforced for all operations — one cannot just insert random JSON documents into a table. The new API is simply a convenient way of working with JSON without having to convert everything back and forth client-side.

JSON support consists of CQL statements and functions, described here, one by one, with examples.

You can use the following code snippet to build a sample restaurant menu. This example will serve as a basis in the following sections. This snippet also contains a second table based on collections, which contains additional information about served dishes.

CREATE KEYSPACE restaurant WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
use restaurant;

CREATE TABLE menu (category text, position int, name text, price float, PRIMARY KEY(category, position));

INSERT INTO menu (category, position, name, price) VALUES ('starters', 1, 'foie gras', 10.50);
INSERT INTO menu (category, position, name, price) VALUES ('starters', 2, 'steak tartare', 9.50);
INSERT INTO menu (category, position, name, price) VALUES ('starters', 3, 'taco de pulpo', 8.00);
INSERT INTO menu (category, position, name, price) VALUES ('soups', 1, 'sour rye soup', 12);
INSERT INTO menu (category, position, name, price) VALUES ('soups', 2, 'sorrel soup', 8);
INSERT INTO menu (category, position, name, price) VALUES ('soups', 3, 'beef tripe soup', 11.20);
INSERT INTO menu (category, position, name, price) VALUES ('main courses', 1, 'red-braised pork belly', 24.90);
INSERT INTO menu (category, position, name, price) VALUES ('main courses', 2, 'boknafisk', 19);


CREATE TABLE info (category text PRIMARY KEY, calories map<text, int>, vegan set, ranking list);
INSERT INTO info (category, calories, vegan, ranking) VALUES ('soups', {'sour rye soup': 500, 'sorrel soup': 290}, {'sorrel soup'}, ['sour rye soup', 'sorrel soup']);

SELECT JSON

Selecting data in JSON format can be performed with SELECT JSON statement. It’s syntax is almost identical to regular CQL SELECT.

In order to extract all data and see what the restaurant serves, try:

SELECT JSON * from menu;

Named columns can also be specified to narrow down the results. So, if we’re only interested in names and prices:

SELECT JSON name, price from menu;

As in regular CQL SELECT, it’s of course possible to restrict the query. Extracting soup info from the database can be achieved like this:

SELECT JSON name, price from menu WHERE category='soups';

Since data underneath is still structured with our schema, it’s possible to apply filtering too. So, if our meal is reimbursed anyway and we don’t want to ruin it by spending too little money:

SELECT JSON name, price from menu WHERE price > 10 ALLOW FILTERING;

Note that the results always consist of one column named [json]. This column contains the requested information in JSON format, properly typed – to string, int, float or boolean. Of course, (nested) collections are supported too!

SELECT JSON * FROM info;

INSERT JSON

Inserting JSON data is also very similar to a regular INSERT statement. Still, note that even though JSON documents can contain lots of arbitrary columns, the ones inserted into Scylla will be validated with table’s schema. Let’s add another soup to the menu:

INSERT INTO menu JSON '{"category": "soups", "position": 4, "name": "red borscht", "price": 11}';

That’s it – not complicated at all. What happens if we try to sneak some out-of-schema data to the statement?

INSERT INTO menu JSON '{"category": "soups", "position": 4, "name": "red borscht", "price": 11, "comment": "filling and delicious"}';

Not possible – schema rules cannot be ignored. What if some columns are missing from our JSON?

INSERT INTO menu JSON '{"category": "soups", "position": 4, "price": 16}'

SELECT * from menu;

Works fine, the omitted column just defaults to null. But, there’s more to the topic.

DEFAULT NULL/DEFAULT UNSET

By default, omitted columns are going to be treated as null values. If, instead, the user wants to omit changing the value in case it already exists, DEFAULT UNSET flag can be used. So, if our red borscht sells well and we want to boost the price in order to increase revenue:

INSERT INTO menu JSON '{"category": "soups", "position": 4, "price": 16}' DEFAULT UNSET;

We can see that our soup name was left intact, but the price changed:

SELECT * FROM menu WHERE category='soups';

fromJson()

fromJson() is a functional equivalent of INSERT JSON for a single value. The easiest way to explain its usage is with an example:

INSERT INTO menu (category, position, name, price) VALUES (fromJson('”soups”'), fromJson(‘1’), 'sour rye soup', 12);

The function works fine with collections too.

INSERT INTO info (category, calories) VALUES ('starters', fromJson('{"foie gras": 550}'));

SELECT * FROM info WHERE category = 'starters';

toJson()

toJson() is a counterpart of the fromJson() function (yes, really!) and can be used to convert single values to JSON format.

SELECT toJson(category), toJson(name) FROM menu;

SELECT category, toJson(calories), toJson(vegan), toJson(ranking) FROM info;

Types

Mapping of CQL types to JSON is well defined and usually intuitive. Full reference table of corresponding types can be found below. Note that some CQL types (e.g. decimal) will be implicitly converted to others, with possibly different precision (e.g. float) when returning JSON values.

CQL type INSERT JSON accepted type SELECT JSON returned type
ascii string string
bigint integer, string integer
blob string string
boolean boolean, string boolean
date string string
decimal integer, string, float float
double integer, string, float float
float integer, string, float float
inet string string
int integer, string integer
list list, string list
map map, string map
smallint integer, string integer
set list, string list
text string string
time string string
timestamp integer, string string
timeuuid string string
tinyint integer, string integer
tuple list, string list
uuid string string
varchar string string
varint integer, string integer

We do JSON. How about you?

JSON support in Scylla permits a variety of new novel designs and implementations. If you are currently using JSON in your own Scylla deployment or planning to use this feature in your own development, we’d love to hear from you.

The post JSON Support in Scylla appeared first on ScyllaDB.

Scylla Enterprise Release 2018.1.9

$
0
0
Scylla Enterprise Release

The Scylla team is pleased to announce the release of Scylla Enterprise 2018.1.9, a production-ready Scylla Enterprise minor release. Scylla Enterprise 2018.1.9 is a bug fix release for the 2018.1 branch, the latest stable branch of Scylla Enterprise. Scylla Enterprise customers are encouraged to upgrade to Scylla Enterprise 2018.1.9 in coordination with the Scylla support team.

This release fixes two issues listed below, with open source references, if exists:

  • Scylla aborted with an “Assertion `end >= _stream_position’ failed” exception. This occurred when querying a partition with no clustering ranges (happened on counter tables with no live rows) which also didn’t have static columns. #3304
  • Monitoring: latency values reported by Prometheus might be wrong #3827

Related Links     

The post Scylla Enterprise Release 2018.1.9 appeared first on ScyllaDB.

Introducing Scylla Open Source 3.0

$
0
0
Scylla Open Source Release 3.0

Scylla is an open source NoSQL database that offers the horizontal scale-out and fault-tolerance of Apache Cassandra, but delivers 10X the throughput and consistent, low single-digit latencies. Implemented from scratch in C++, Scylla’s close-to-the-hardware design significantly reduces the number of database nodes you require and self-optimizes to dynamic workloads and various hardware combinations.

With the release of Scylla Open Source 3.0, we’ve introduced a rich set of new features for more efficient querying, reduced storage requirements, lower repair times, and better overall database performance. Already the industry’s most performant NoSQL database, Scylla now includes production-ready features that surpass the capabilities of Apache Cassandra.

Scylla Open Source 3.0 is now available for download.

Materialized Views

Material Views automate the tedious and inefficient chores created when an application maintains several tables with the same data organized differently. Data is divided into partitions that can be found by a partition key. Sometimes the application needs to find a partition or partitions by the value of another column. Doing this efficiently without scanning all of the partitions requires indexing.

People have been using Materialized Views, also calling them denormalization, for years as a client-side implementation. In those days, the application maintained two or more views and two or more separate tables with the same data but under a different partition key. Every time the application wanted to write data, it needed to write to both tables, and reads were done directly (and efficiently) from the desired table. However, ensuring any level of consistency between the data in the two or more views required complex and slow application logic.

Scylla’s Materialized Views feature moves this complexity out of the application and into the servers. The implementation is faster (fewer round trips to the applications) and more reliable. This approach makes it much easier for applications to begin using multiple views into their data. The application just declares the additional views, Scylla creates the new view tables, and on every update to the base table the view tables are automatically updated as well. Writes are executed only on the base table directly and are automatically propagated to the view tables. Reads go directly to the view tables.

As usual, the Scylla version is compatible – in features and CQL syntax – with the Apache Cassandra version (where it is still in experimental mode).

Materialized View

Global Secondary Indexes

Scylla Open Source 3.0 introduces production-ready global secondary indexes that can scale to any size distributed cluster — unlike the local-indexing approach adopted by Apache Cassandra. The secondary index uses a Materialized View index under the hood in order to make the index independent from the amount of nodes in the cluster. Secondary Indexes are (mostly) transparent to the application. Queries have access to all the columns in the table and you can add and remove indexes without changing the application. Secondary Indexes can also have less storage overhead than Materialized Views because Secondary Indexes need to duplicate only the indexed column and primary key, not the queried columns like with a Materialized View. For the same reason, updates can be more efficient with Secondary Indexes because only changes to the primary key and indexed column cause an update in the index view. In the case of a Materialized View, an update to any of the columns that appear in the view requires the backing view to be updated.

Global Secondary Indexes

As always, the decision whether to use Secondary Indexes or Materialized Views really depends on the requirements of your application. If you need maximum performance and are likely to query a specific set of columns, you should use Materialized Views. However, if the application needs to query different sets of columns, Secondary Indexes are a better choice because they can be added and removed with less storage overhead depending on application needs.

Global secondary indexes minimize the amount of data retrieved from the database, providing many benefits:

  • Results are paged and customizable
  • Filtering is supported to narrow result sets
  • Keys, rather than data, are denormalized Supports more general-purpose use cases than Materialized Views

Allow Filtering

Allow filtering is a way to make a more complex query, returning only a subset of matching results. Because the filtering is done on the server, this feature also reduces the amount of data transferred over the network between the cluster and the application. Such filtering may incur processing impacts to the Scylla cluster. For example, a query might require the database to filter an extremely large data set before returning a response. By default, such queries are prevented from execution, returning the following message:

Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance.

Unpermitted queries include those that restrict:

  • Non-partition key fields
  • Parts of primary keys that are not prefixes
  • Partition keys with something other than an equality relation (though you can combine SI with ALLOW FILTERING to support inequalities; >= or <=; see below)
  • Clustering keys with a range restriction and then by other conditions (see this blog)

However, in some cases (usually due to data modeling decisions), applications need to make queries that violate these basic rules. Starting with Scylla Open Source 3.0, queries can be appended with the ALLOW FILTERING keyword to bypass this restriction and utilize server-side filtering.

The benefits of filtering include:

  • Cassandra query compatibility
  • Spark-Cassandra connector query compatibility
  • Query flexibility against legacy data sets

New File Format

Scylla Open Source 3.0 introduces support for a more performant storage format (SSTable), which is not only compatible with Apache Cassandra 3.x but also reduces storage volume by as much as 3X. The older 2.x format used to duplicate the column name next to each cell on disk. The new format eliminates the duplication and the column names are stored once, within the schema.

The newly introduced format is identical to that used by Apache Cassandra 3.x, while remaining backward-compatible with prior Scylla SSTable formats. New deployments of Scylla Open Source 3.0 will automatically use the new format, while existing files remain unchanged.

This new storage format delivers important benefits, including:

  • Can read existing Apache Cassandra 3.x files when migrating
  • Faster than previous versions
  • Reduced storage footprint of up to 66%, depending on the data model used Range delete support

Hinted Handoff

Hinted handoffs are designed to help when any individual node is temporarily unresponsive due to heavy write load, network weather, hardware failure, or any other factor. Hinted handoffs also help in the event of short-term network issues or node restarts, reducing the time for scheduled repairs, and resulting in higher overall performance for distributed deployments. Originally introduced as an experimental feature in Scylla Open Source 2.1, hinted handoffs are another production-ready feature in Scylla Open Source 3.0.

Hinted Handoff

Technically, a ‘hint’ is a record of a write request held by the coordinator until an unresponsive replica node comes back online. When a write is deemed successful but one or more replica nodes fail to acknowledge it, Scylla will write a hint that is replayed to those nodes when they recover. Once the node becomes available again, the write request data in the hint is written to the replica node.

Hinted handoffs deliver the following benefits:

  • Minimizes the difference between data in the nodes when nodes are down — whether for scheduled upgrades or for all-too-common intermittent network issues.
  • Reduces the amount of data transferred during repair.
  • Reduces the chances of checksum mismatch (during read-repair) and thus improves overall latency.

Full, Multi-partition Scan Improvements

Scylla Open Source 3.0 builds on earlier improvements by extending stateful paging to support range scans as well. As opposed to other partition queries, which read a single partition or a list of distinct partitions, range scans read all of the partitions that fall into the range specified by the client. Since the precise number and identity of partitions in a given range cannot be determined in advance, the query must read data from all nodes containing data for the range.

Range Scans Performance

To improve range scan paging, Scylla Open Source 3.0 introduces a new control algorithm for reading all data belonging to a range from all shards, which caches the intermediate streams on each of the shards and directs paged queries to the matching, previously used, cached results. The new algorithm is essentially a multiplexer that combines the output of readers opened on affected shards into a single stream. The readers are created on-demand when the partition scan attempts to read from the shard. To ensure that the read won’t stall, the algorithm uses buffering and read-ahead.

Benefits include:

  • Improved system responsiveness
  • Throughput of range scans improved by as much as 30%
  • Amount of data read from the disk reduced by as much as 40%
  • Disk operations lowered by as much as 75%

Streaming Improvements

Streaming is used during node recovery to populate restored nodes with data replicated from running nodes. The Scylla streaming model reads data on one node, transmits it to another node, and then writes to disk. The sender creates SSTable readers to read the rows from SSTables on disk and sends them over the network. The receiver receives the rows from the network and writes them to a memtable. The rows in memtable are flushed into SSTables periodically or when the memtable is full.

Improved Streaming

In Scylla Open Source 3.0, stream synchronization between nodes bypasses memtables, significantly reducing the time to repair, add and remove nodes. These improvements result in higher performance when there is a change in the cluster topology, improving streaming bandwidth by as much as 240% and reducing the time it takes to perform a “rebuild” operation by 70%.

Streaming Improvement MeasurementsScylla’s new streaming improvements provide the following benefits:

  • Lower memory consumption. The saved memory can be used to handle your CQL workload instead.
  • Better CPU utilization. No CPU cycles are used to insert and sort memtables.
  • Bigger SSTables and fewer compactions.

Release Notes

You can read more details about Scylla Open Source 3.0 in the Release Notes.

The post Introducing Scylla Open Source 3.0 appeared first on ScyllaDB.


Scylla Summit 2018 Keynote: Four Years of Scylla

$
0
0
Dor Laor at Scylla Summit 2018
Now that the dust has settled from our Scylla Summit 2018 user conference, we’re glad for the chance to share the content with those who couldn’t make the trip to the Bay Area. We’ll start with the keynote from our CEO and Co-founder, Dor Laor, who kicked off the event with his talk about the past, present and future of Scylla.

Watch the video in full:

Browse the slides:

Dor began with an overview of the trends in the industry and first and foremost, digital transformation. Sticking to the down-to-earth, practical culture at ScyllaDB, Dor covered real life customer examples from all around us. Starting with space, with satellite-enabled services like GPS Insight and TellusLabs, to the rain forest plant-based beauty products of Brazil’s Natura, to the GE Predix-powered Industrial Internet of Things (IIoT) platform, Comcast X1’s on-demand services, as well as automotive applications from Faraday Future and Nauto.

In the beginning…

Scylla and Charybdis

Dor shared our company’s origins. He recalled first announcing the Seastar framework in February 2015, and leaving stealth mode in September of that year. ScyllaDB CTO Avi Kivity. presented at that year’s Cassandra Summit on how a new database, Scylla, could deliver 1,000,000 CQL operations per server.

Scylla 1.0 Release Graphic

Over the ensuing three years we made a great deal of progress. We released Scylla 1.0 at the end of March 2016. That year also saw the first Scylla Summit in September. The following year, in March 2017, Scylla unveiled its first Enterprise software release.

Scylla Enterprise Release Graphic

While Scylla was blazing its own path in the world of NoSQL, Dor also remarked on the successes of others in the industry, including MongoDB’s public offering in October 2017, and the September 2018 IPO of Elastic. These events serve as validation of the growing Big Data market as the hunger for data increases, fed by the growing appetite of modern, planet scale software. Not only most enterprises now trust in the operational capabilities of NoSQL distributed databases, the new world requirements cannot be met by traditional relational models.

State of the Art

Moving to the present, Dor announced Scylla Open Source 3.0. With this release, Scylla was finally achieving feature parity with Cassandra, and, in some cases, it was taking the lead. For storage, SSTable format 3.0 (mc) would reduce data footprint on disk. Production-ready Materialized Views (MV) and Global Secondary Indexes (GSI) will help users access only the data they need. Lightweight Transactions (LWT) remains the last major feature to achieve full feature parity with Cassandra.

Dor also announced that our cloud managed database, Scylla Cloud, was available as early access. running on Amazon Web Services (AWS), Scylla Cloud lets users launch a fully managed single-tenant, self-service Scylla cluster in minutes.

Scylla Cloud Graphic

As much as we talk about Cassandra, we are shifting gears and wish to be competitive with the best of breed NoSQL databases, led by DynamoDB as an example.

Scylla vs. Dynamo Graphic

Dor shared results from a head-to-head YCSB comparison of Scylla versus Amazon DynamoDB. We just recently published the comparative benchmark results. Our test results show you can achieve 1/4th the latency and spend only 1/7th the cost with Scylla for similar throughput on DynamoDB. (Scylla Cloud is 4-6X less expensive than DynamoDB.)

However, the real performance difference occurred in Zipfian distributions. You can read the blog in full as to why this is an important real-world consideration. Analogous test results were found for Bigtable, and CosmosDB was expected to perform similarly.

OLTP vs. OLAP Graphic

Another key feature introduced for the first time at Scylla Summit 2018 was our unique ability to support per-user SLAs, allowing system managers to limit database resource utilization. With this, Scylla customers can use the same Scylla cluster to service both transaction processing (mixed read-write, or write-heavy loads) as well as analytics (read-only/mostly) requests. Glauber Costa would host a full session on this, entitled OLAP or OLTP: Why not both?

Per-user-SLA utilizes 3 years of development of SLA guarantee for real time operations over distributed database background operations such as compaction, repair and streaming. This is a point in time evolution towards perfect multi tenant database.

Dor then enumerated a list of noteworthy accomplishments and the challenges we still have before us. For example, while he was proud of our Mutant Monitoring System (MMS), there is still work to be done on our Knowledgebase, as well as our upcoming launch of Scylla University. And while performance is good, and compactions are relatively smooth compared to other offerings, there are still more optimizations to be done. And while he was proud of the work we’ve done to integrate with Apache Spark, there’s a lot more to do to align Scylla with Kubernetes.

The Shape of Things to Come

 

To conclude, Dor gave a glimpse into the future of Scylla. Finishing up Cassandra parity features, especially Lightweight Transactions. Fleshing out Scylla Cloud. Making Scylla itself a stronger offering, with new tiered storage options, improvements in performance and additional drivers. And finally, making Scylla even easier to manage.

It has been a remarkable journey over the past four years. From all of us at ScyllaDB, thank you for following us on our journey, and for a wonderful 2018. 

Looking ahead, 2019 is sure to be another amazing year of pioneering achievements in the world of Big Data, both for Scylla as well as our users and customers. We’re looking forward to all that we will accomplish together!

The post Scylla Summit 2018 Keynote: Four Years of Scylla appeared first on ScyllaDB.

Scylla Enterprise Release 2018.1.8

$
0
0

Scylla Enterprise Release

The Scylla team is pleased to announce the release of Scylla Enterprise 2018.1.8, a production-ready Scylla Enterprise minor release. Scylla Enterprise 2018.1.8 is a bug fix release for the 2018.1 branch, the latest stable branch of Scylla Enterprise. Scylla Enterprise customers are encouraged to upgrade to Scylla Enterprise 2018.1.8 in coordination with the Scylla support team.

This release includes memory management improvements, first introduced in Scylla Open Source 2.3 release and graduated into Scylla Enterprise. These changes allow Scylla to free large contiguous areas of memory more reliably with less effort and improves the performance of workloads that have large blobs or collections. #2480

Related Links     

Additional fixed issues in this release, with open source references, if exists:

  • In extreme out-of-memory situations, failure to allocate an internal data structure used for synchronizing memtable flushes can result in a segmentation fault and crash. #3931

Auditing: audit configuration syntax in scylla.yaml was changed from audit-categories and audit-tables to audit_categories and audit_tables. Note that the upgrade procedure does *not* update existing scylla.yaml automatically. More on Scylla Enterprise auditing and auditing configuration here

The post Scylla Enterprise Release 2018.1.8 appeared first on ScyllaDB.

Scylla Summit 2018 Tech Talks Now Online

$
0
0

There was so much happening at our Scylla Summit 2018 late last fall. We held more than three dozen sessions, including multiple keynotes and concurrent breakout tracks. Our Tech Talks page has been updated with the videos and slides from Scylla Summit 2018. Now you can see what you missed — whether or not you were able to attend our user conference.

Dor Laor at Scylla Summit 2018

In the weeks ahead we’ll showcase some of the best talks from our conference, but there’s no need to wait. You can browse the whole catalog of talks from Scylla Summit 2018 today!

You can see the YouTube videos and all the SlideShare presentations in one place. All the keynotes, both by ScyllaDB executives and some of our most prominent customers. All the use case presentations by our incredible groundbreaking community members. All the tech talks from our engineers on current and upcoming features, best practices, tips and tricks.

We’ll point out a good intro to many of the engineering talks at Scylla Summit 2018. ScyllaDB CTO and Co-Founder Avi Kivity presented on our near-term and longer-term initiatives. This is quite timely, too, as the release of Scylla Open Source 3.0 is right around the corner!

If you have any questions or comments after watching these presentations, or if you’d like to share your own experience with Scylla, please feel free to contact us!

The post Scylla Summit 2018 Tech Talks Now Online appeared first on ScyllaDB.

JSON Support in Scylla

$
0
0

Beginning with version 2.3, Scylla Open Source supports the Javascript Object Notation (JSON) format. That includes inserting JSON documents, retrieving data in JSON and providing helper functions to transform native CQL types into JSON and vice versa.

Also note that schemas are still enforced for all operations — one cannot just insert random JSON documents into a table. The new API is simply a convenient way of working with JSON without having to convert everything back and forth client-side.

JSON support consists of CQL statements and functions, described here, one by one, with examples.

You can use the following code snippet to build a sample restaurant menu. This example will serve as a basis in the following sections. This snippet also contains a second table based on collections, which contains additional information about served dishes.

CREATE KEYSPACE restaurant WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 };
use restaurant;

CREATE TABLE menu (category text, position int, name text, price float, PRIMARY KEY(category, position));

INSERT INTO menu (category, position, name, price) VALUES ('starters', 1, 'foie gras', 10.50);
INSERT INTO menu (category, position, name, price) VALUES ('starters', 2, 'steak tartare', 9.50);
INSERT INTO menu (category, position, name, price) VALUES ('starters', 3, 'taco de pulpo', 8.00);
INSERT INTO menu (category, position, name, price) VALUES ('soups', 1, 'sour rye soup', 12);
INSERT INTO menu (category, position, name, price) VALUES ('soups', 2, 'sorrel soup', 8);
INSERT INTO menu (category, position, name, price) VALUES ('soups', 3, 'beef tripe soup', 11.20);
INSERT INTO menu (category, position, name, price) VALUES ('main courses', 1, 'red-braised pork belly', 24.90);
INSERT INTO menu (category, position, name, price) VALUES ('main courses', 2, 'boknafisk', 19);


CREATE TABLE info (category text PRIMARY KEY, calories map<text, int>, vegan set, ranking list);
INSERT INTO info (category, calories, vegan, ranking) VALUES ('soups', {'sour rye soup': 500, 'sorrel soup': 290}, {'sorrel soup'}, ['sour rye soup', 'sorrel soup']);

SELECT JSON

Selecting data in JSON format can be performed with SELECT JSON statement. It’s syntax is almost identical to regular CQL SELECT.

In order to extract all data and see what the restaurant serves, try:

SELECT JSON * from menu;

Named columns can also be specified to narrow down the results. So, if we’re only interested in names and prices:

SELECT JSON name, price from menu;

As in regular CQL SELECT, it’s of course possible to restrict the query. Extracting soup info from the database can be achieved like this:

SELECT JSON name, price from menu WHERE category='soups';

Since data underneath is still structured with our schema, it’s possible to apply filtering too. So, if our meal is reimbursed anyway and we don’t want to ruin it by spending too little money:

SELECT JSON name, price from menu WHERE price > 10 ALLOW FILTERING;

Note that the results always consist of one column named [json]. This column contains the requested information in JSON format, properly typed – to string, int, float or boolean. Of course, (nested) collections are supported too!

SELECT JSON * FROM info;

INSERT JSON

Inserting JSON data is also very similar to a regular INSERT statement. Still, note that even though JSON documents can contain lots of arbitrary columns, the ones inserted into Scylla will be validated with table’s schema. Let’s add another soup to the menu:

INSERT INTO menu JSON '{"category": "soups", "position": 4, "name": "red borscht", "price": 11}';

That’s it – not complicated at all. What happens if we try to sneak some out-of-schema data to the statement?

INSERT INTO menu JSON '{"category": "soups", "position": 4, "name": "red borscht", "price": 11, "comment": "filling and delicious"}';

Not possible – schema rules cannot be ignored. What if some columns are missing from our JSON?

INSERT INTO menu JSON '{"category": "soups", "position": 4, "price": 16}'

SELECT * from menu;

Works fine, the omitted column just defaults to null. But, there’s more to the topic.

DEFAULT NULL/DEFAULT UNSET

By default, omitted columns are going to be treated as null values. If, instead, the user wants to omit changing the value in case it already exists, DEFAULT UNSET flag can be used. So, if our red borscht sells well and we want to boost the price in order to increase revenue:

INSERT INTO menu JSON '{"category": "soups", "position": 4, "price": 16}' DEFAULT UNSET;

We can see that our soup name was left intact, but the price changed:

SELECT * FROM menu WHERE category='soups';

fromJson()

fromJson() is a functional equivalent of INSERT JSON for a single value. The easiest way to explain its usage is with an example:

INSERT INTO menu (category, position, name, price) VALUES (fromJson('”soups”'), fromJson(‘1’), 'sour rye soup', 12);

The function works fine with collections too.

INSERT INTO info (category, calories) VALUES ('starters', fromJson('{"foie gras": 550}'));

SELECT * FROM info WHERE category = 'starters';

toJson()

toJson() is a counterpart of the fromJson() function (yes, really!) and can be used to convert single values to JSON format.

SELECT toJson(category), toJson(name) FROM menu;

SELECT category, toJson(calories), toJson(vegan), toJson(ranking) FROM info;

Types

Mapping of CQL types to JSON is well defined and usually intuitive. Full reference table of corresponding types can be found below. Note that some CQL types (e.g. decimal) will be implicitly converted to others, with possibly different precision (e.g. float) when returning JSON values.

CQL type INSERT JSON accepted type SELECT JSON returned type
ascii string string
bigint integer, string integer
blob string string
boolean boolean, string boolean
date string string
decimal integer, string, float float
double integer, string, float float
float integer, string, float float
inet string string
int integer, string integer
list list, string list
map map, string map
smallint integer, string integer
set list, string list
text string string
time string string
timestamp integer, string string
timeuuid string string
tinyint integer, string integer
tuple list, string list
uuid string string
varchar string string
varint integer, string integer

We do JSON. How about you?

JSON support in Scylla permits a variety of new novel designs and implementations. If you are currently using JSON in your own Scylla deployment or planning to use this feature in your own development, we’d love to hear from you.

The post JSON Support in Scylla appeared first on ScyllaDB.

Scylla Enterprise Release 2018.1.9

$
0
0
Scylla Enterprise Release

The Scylla team is pleased to announce the release of Scylla Enterprise 2018.1.9, a production-ready Scylla Enterprise minor release. Scylla Enterprise 2018.1.9 is a bug fix release for the 2018.1 branch, the latest stable branch of Scylla Enterprise. Scylla Enterprise customers are encouraged to upgrade to Scylla Enterprise 2018.1.9 in coordination with the Scylla support team.

This release fixes two issues listed below, with open source references, if exists:

  • Scylla aborted with an “Assertion `end >= _stream_position’ failed” exception. This occurred when querying a partition with no clustering ranges (happened on counter tables with no live rows) which also didn’t have static columns. #3304
  • Monitoring: latency values reported by Prometheus might be wrong #3827

Related Links     

The post Scylla Enterprise Release 2018.1.9 appeared first on ScyllaDB.

Introducing Scylla Open Source 3.0

$
0
0
Scylla Open Source Release 3.0

Scylla is an open source NoSQL database that offers the horizontal scale-out and fault-tolerance of Apache Cassandra, but delivers 10X the throughput and consistent, low single-digit latencies. Implemented from scratch in C++, Scylla’s close-to-the-hardware design significantly reduces the number of database nodes you require and self-optimizes to dynamic workloads and various hardware combinations.

With the release of Scylla Open Source 3.0, we’ve introduced a rich set of new features for more efficient querying, reduced storage requirements, lower repair times, and better overall database performance. Already the industry’s most performant NoSQL database, Scylla now includes production-ready features that surpass the capabilities of Apache Cassandra.

Scylla Open Source 3.0 is now available for download.

Materialized Views

Material Views automate the tedious and inefficient chores created when an application maintains several tables with the same data organized differently. Data is divided into partitions that can be found by a partition key. Sometimes the application needs to find a partition or partitions by the value of another column. Doing this efficiently without scanning all of the partitions requires indexing.

People have been using Materialized Views, also calling them denormalization, for years as a client-side implementation. In those days, the application maintained two or more views and two or more separate tables with the same data but under a different partition key. Every time the application wanted to write data, it needed to write to both tables, and reads were done directly (and efficiently) from the desired table. However, ensuring any level of consistency between the data in the two or more views required complex and slow application logic.

Scylla’s Materialized Views feature moves this complexity out of the application and into the servers. The implementation is faster (fewer round trips to the applications) and more reliable. This approach makes it much easier for applications to begin using multiple views into their data. The application just declares the additional views, Scylla creates the new view tables, and on every update to the base table the view tables are automatically updated as well. Writes are executed only on the base table directly and are automatically propagated to the view tables. Reads go directly to the view tables.

As usual, the Scylla version is compatible – in features and CQL syntax – with the Apache Cassandra version (where it is still in experimental mode).

Materialized View

Global Secondary Indexes

Scylla Open Source 3.0 introduces production-ready global secondary indexes that can scale to any size distributed cluster — unlike the local-indexing approach adopted by Apache Cassandra. The secondary index uses a Materialized View index under the hood in order to make the index independent from the amount of nodes in the cluster. Secondary Indexes are (mostly) transparent to the application. Queries have access to all the columns in the table and you can add and remove indexes without changing the application. Secondary Indexes can also have less storage overhead than Materialized Views because Secondary Indexes need to duplicate only the indexed column and primary key, not the queried columns like with a Materialized View. For the same reason, updates can be more efficient with Secondary Indexes because only changes to the primary key and indexed column cause an update in the index view. In the case of a Materialized View, an update to any of the columns that appear in the view requires the backing view to be updated.

Global Secondary Indexes

As always, the decision whether to use Secondary Indexes or Materialized Views really depends on the requirements of your application. If you need maximum performance and are likely to query a specific set of columns, you should use Materialized Views. However, if the application needs to query different sets of columns, Secondary Indexes are a better choice because they can be added and removed with less storage overhead depending on application needs.

Global secondary indexes minimize the amount of data retrieved from the database, providing many benefits:

  • Results are paged and customizable
  • Filtering is supported to narrow result sets
  • Keys, rather than data, are denormalized
  • Supports more general-purpose use cases than Materialized Views

Allow Filtering

Allow filtering is a way to make a more complex query, returning only a subset of matching results. Because the filtering is done on the server, this feature also reduces the amount of data transferred over the network between the cluster and the application. Such filtering may incur processing impacts to the Scylla cluster. For example, a query might require the database to filter an extremely large data set before returning a response. By default, such queries are prevented from execution, returning the following message:

Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance.

Unpermitted queries include those that restrict:

  • Non-partition key fields
  • Parts of primary keys that are not prefixes
  • Partition keys with something other than an equality relation (though you can combine SI with ALLOW FILTERING to support inequalities; >= or <=; see below)
  • Clustering keys with a range restriction and then by other conditions (see this blog)

However, in some cases (usually due to data modeling decisions), applications need to make queries that violate these basic rules. Starting with Scylla Open Source 3.0, queries can be appended with the ALLOW FILTERING keyword to bypass this restriction and utilize server-side filtering.

The benefits of filtering include:

  • Cassandra query compatibility
  • Spark-Cassandra connector query compatibility
  • Query flexibility against legacy data sets

New File Format

Scylla Open Source 3.0 introduces support for a more performant storage format (SSTable), which is not only compatible with Apache Cassandra 3.x but also reduces storage volume by as much as 3X. The older 2.x format used to duplicate the column name next to each cell on disk. The new format eliminates the duplication and the column names are stored once, within the schema.

The newly introduced format is identical to that used by Apache Cassandra 3.x, while remaining backward-compatible with prior Scylla SSTable formats. New deployments of Scylla Open Source 3.0 will automatically use the new format, while existing files remain unchanged.

This new storage format delivers important benefits, including:

  • Can read existing Apache Cassandra 3.x files when migrating
  • Faster than previous versions
  • Reduced storage footprint of up to 66%, depending on the data model used 
  • Range delete support

Hinted Handoff

Hinted handoffs are designed to help when any individual node is temporarily unresponsive due to heavy write load, network weather, hardware failure, or any other factor. Hinted handoffs also help in the event of short-term network issues or node restarts, reducing the time for scheduled repairs, and resulting in higher overall performance for distributed deployments. Originally introduced as an experimental feature in Scylla Open Source 2.1, hinted handoffs are another production-ready feature in Scylla Open Source 3.0.

Hinted Handoff

Technically, a ‘hint’ is a record of a write request held by the coordinator until an unresponsive replica node comes back online. When a write is deemed successful but one or more replica nodes fail to acknowledge it, Scylla will write a hint that is replayed to those nodes when they recover. Once the node becomes available again, the write request data in the hint is written to the replica node.

Hinted handoffs deliver the following benefits:

  • Minimizes the difference between data in the nodes when nodes are down — whether for scheduled upgrades or for all-too-common intermittent network issues.
  • Reduces the amount of data transferred during repair.
  • Reduces the chances of checksum mismatch (during read-repair) and thus improves overall latency.

Full, Multi-partition Scan Improvements

Scylla Open Source 3.0 builds on earlier improvements by extending stateful paging to support range scans as well. As opposed to other partition queries, which read a single partition or a list of distinct partitions, range scans read all of the partitions that fall into the range specified by the client. Since the precise number and identity of partitions in a given range cannot be determined in advance, the query must read data from all nodes containing data for the range.

Range Scans Performance

To improve range scan paging, Scylla Open Source 3.0 introduces a new control algorithm for reading all data belonging to a range from all shards, which caches the intermediate streams on each of the shards and directs paged queries to the matching, previously used, cached results. The new algorithm is essentially a multiplexer that combines the output of readers opened on affected shards into a single stream. The readers are created on-demand when the partition scan attempts to read from the shard. To ensure that the read won’t stall, the algorithm uses buffering and read-ahead.

Benefits include:

  • Improved system responsiveness
  • Throughput of range scans improved by as much as 30%
  • Amount of data read from the disk reduced by as much as 40%
  • Disk operations lowered by as much as 75%

Streaming Improvements

Streaming is used during node recovery to populate restored nodes with data replicated from running nodes. The Scylla streaming model reads data on one node, transmits it to another node, and then writes to disk. The sender creates SSTable readers to read the rows from SSTables on disk and sends them over the network. The receiver receives the rows from the network and writes them to a memtable. The rows in memtable are flushed into SSTables periodically or when the memtable is full.

Improved Streaming

In Scylla Open Source 3.0, stream synchronization between nodes bypasses memtables, significantly reducing the time to repair, add and remove nodes. These improvements result in higher performance when there is a change in the cluster topology, improving streaming bandwidth by as much as 240% and reducing the time it takes to perform a “rebuild” operation by 70%.

Streaming Improvement MeasurementsScylla’s new streaming improvements provide the following benefits:

  • Lower memory consumption. The saved memory can be used to handle your CQL workload instead.
  • Better CPU utilization. No CPU cycles are used to insert and sort memtables.
  • Bigger SSTables and fewer compactions.

Release Notes

You can read more details about Scylla Open Source 3.0 in the Release Notes.

The post Introducing Scylla Open Source 3.0 appeared first on ScyllaDB.

Scylla Summit Video: Grab and Scylla Driving Southeast Asia Forward

$
0
0

Grab and Scylla
Grab is a powerhouse in Southeast Asia. Its mobile app services cover a broad swath of everyday needs, from acting as a mobile wallet, to arranging affordable ridesharing, food and package delivery. Imagine if Apple Pay, Lyft, and Doordash were all bundled in one app.

Grab has exploded in use across Southeast Asia, from its origins in Malaysia and Singapore all across Cambodia, Vietnam, Thailand, Myanmar, Indonesia, and the Philippines.

The audience at last fall’s Scylla Summit had a chance to hear about Grab’s use of Scylla in detail, directly from Grab’s Engineering Lead, Aravind Srinivasan.

Aravind first presented the background on how Grab had designed microservices written in Golang and used Apache Kafka for stream processing, and then implemented Scylla to act as a persistent database to store all of that data from Kafka. He went on to describe in detail Grab’s fraud detection system implemented with Scylla.Watch the video below to hear the details for yourself. We’ve also posted Aravind’s slides in case you’d like to zoom in on those. Plus make sure to check out all the other Scylla Summit 2018 presentations under our Tech Talks page.

The post Scylla Summit Video: Grab and Scylla Driving Southeast Asia Forward appeared first on ScyllaDB.


Improved Performance in Scylla Open Source 3.0: Streaming and Hinted Handoffs

$
0
0

Improved Performance in Scylla 3.0

Scylla Open Source 3.0 is a landmark release for ScyllaDB: Materialized Views and Secondary Indexes are production-ready, and Scylla Open Source 3.0 can now read and write the Cassandra 3.x SSTable (“mc”) format. In addition, Scylla Open Source 3.0 provides a variety of performance improvements to existing functionality.

In this article we will explore the nature of a pair of those performance improvements, and the scenarios in which Scylla users can expect to see a significant performance gain.

Streaming

When one Scylla node needs to transfer data to another, it undertakes a process called streaming. This happens when a new node joins the cluster, when a node leaves a cluster, when data needs to be repaired, and so on.

In releases up to Scylla Open Source 2.3, streaming was built on top of Scylla’s Remote Procedure Call (RPC) mechanism. The RPC server is already present for other functionality and building data streaming on top of that is easy enough to get started. However, the RPC paradigm introduces a fair amount of overhead since the data has to be split into smaller messages with request / response.

In Scylla Open Source 3.0, a specialized interface was introduced that is transparent to the developer and user. A stream is now opened between the nodes that will exchange data. Data is then continuously sent without the RPC request/response overhead. To evaluate the effect of that we populated two nodes with 2.8TB of data and a replication factor of 2 (RF=2) in both Scylla Open Source 2.3 and Scylla Open Source 3.0 on the same hardware. The schema is a key-value pair with 4kB values.

Once the ingestion quiesces, we then added a third node and measured the time it takes for the node addition to complete (data rebalancing) while the cluster is otherwise idle. The new node is then decommissioned, so it transfers its ranges back to the original two nodes with the cluster still otherwise idle, and after that it is added again— this time with a constant mixed workload running in tandem.

Figure 1: Network interface receive bandwidth in the node being added. Scylla 3.0 (in blue) achieves faster network throughput in this operation and is therefore faster.

Figure 1: Network interface receive bandwidth in the node being added. Scylla 3.0 (in blue) achieves faster network throughput in this operation and is therefore faster.

Results

Scylla Open Source 2.3 Scylla Open Source 3.0 Difference
Time to decommission a node, cluster idle 895 s 695 s 22%
Time to add a node, cluster idle 1472 s 1236 s 16%
Time to add a node, during load 1834 s 1592 s 13%

Table 1: Result of streaming operations in Scylla Open Source 2.3 and Scylla Open Source 3.0 (in seconds): Scylla Open source 3.0 is faster than Scylla Open Source 2.3 when streaming a total of 2.8TB of data, as a result of a new, specialized RPC interface.

Hinted Handoff

Hinted Handoff is not a performance feature per se, but it can have impact on performance, aside from its general utility. When a write is deemed successful but one or more nodes did not acknowledge the write, Scylla will write a hint that will be replayed back to those nodes when they are back online.

This feature is useful for reducing the difference between data in the nodes when nodes are down — whether due to scheduled upgrades, or for all-too-common intermittent network issues. The biggest impact of this feature is reducing— not eliminating the amount of data transferred during repair. However, when requests are read from the database with QUORUM consistency level, the database, upon finding differences, will have to reconcile the differences on the spot.

Therefore, by having less differences between nodes, some performance improvement is also expected even on standard foreground workloads, outside of repair.

To analyze the impact of Hinted Handoff on performance, we inserted 120,000,000 keys into a 3-node cluster and ran a script in one of them that uses the iptables Linux utility to simulate intermittent network failures. The script can be found in the appendix at the end of this article.

After the insertion phase ends, we then issue a fixed-throughput QUORUM consistency read-only workload, reading at 50,000 ops. Because those are QUORUM reads, if Scylla finds a discrepancy between the two nodes being queried it has to fix it right away. This will slow down some of the reads for which a difference is found.

This behavior can be clearly seen in Figure 2. While Scylla Open Source 3.0 (in yellow) shows no reconciliations, meaning that the data can be served right away. Scylla 2.3, in the absence of Hinted Handoff, will detect differences for the rows that failed to propagate during the time where the node was down. Note that as time passes these differences get lower as another background process, probabilistic read repair, manages to fix some of them.

The results are summarized in Table 2 below: the tail latencies get up to 60% better as the extra work of repairing the difference is avoided.

 

Figure 2: The rate in requests/s in which we see conciliation (read repair) operations during a read-only QUORUM consistency workload. Scylla 2.3, in green, finds differences in the data during a QUORUM read workload, and the process of conciliating slow down the reads. In Scylla 3.0, due to Hinted Handoff, there are no differences in the data.

Figure 2: The rate in requests/s in which we see conciliation (read repair) operations during a read-only QUORUM consistency workload. Scylla 2.3, in green, finds differences in the data during a QUORUM read workload, and the process of conciliating slow down the reads. In Scylla 3.0, due to Hinted Handoff, there are no differences in the data.

Results

Scylla Open Source 2.3 Scylla Open Source 3.0 Improvement
95th percentile 5 ms 2 ms 60%
99th percentile 12 ms 4.5 ms 62%
99.9th percentile 38 ms 34 ms  10%

Table 2: Improvements in tail latencies for a read-only QUORUM consistency workload in Scylla Open Source 3.0 over Scylla Open Source 2.3. Scylla Open Source 3.0 ships with Hinted Handoff enabled, that among other benefits, reduces the need for reconciliation in QUORUM queries.

Conclusion

Scylla Open Source 3.0 introduces many new features and improved core functionality such as Allow Filtering, SSTable 3.0 format support, and range (multipartition) scan improvements. It also promotes other features from experimental status to production-ready, such as Materialized Views, Secondary Indices, and Hinted Handoffs. Collectively these improvements unlock new types of data modeling and result in better workloads.

In this article, we have discussed two of those improvements that improve the performance of in the face of intermittent or permanent failures. Look forward to more upcoming blogs featuring the improvements found in Scylla Open Source 3.0 in the coming days.

Appendix

Streaming

1. Populating Nodes

We used 4 nodes to populate data into the cluster. We splitted work equally between them by writing 175M keys each, within different sequences, adding up to 700M keys. The populating phase used the following logic to split data:

2. Load Simulation

We used read and write operations, running simultaneously. Each limited to 50k requests per second, adding up to 100k requests per second on the server. This procedure got us to 50% cpu usage.

Hinted Handoff

1. Populating nodes

2. Simulating network intermittent failures for Hinted Handoff

3. Hinted Handoff, reading from the nodes

The post Improved Performance in Scylla Open Source 3.0: Streaming and Hinted Handoffs appeared first on ScyllaDB.

New Maintenance Releases: Scylla Open Source 3.0.1 and 2.3.2

$
0
0

Scylla Release

Today we are announcing two software maintenance releases for Scylla Open Source:

Scylla 3.0.1

The Scylla team announces the release of Scylla Open Source 3.0.1, a bugfix release of the Scylla Open Source 3.0 stable branch. Scylla Open Source 3.0.1, like all past and future 3.x.y releases, is backward compatible and supports rolling upgrades.

Related links:

Issues solved in this release:

  • In rare cases on large machines, Scylla may not start. The last log message will be “Completed migration of legacy schema tables”. #4096
  • Using the new Filtering feature in combination with LIMIT applies the limit per page instead of globally. This means a request might get more values in the response than requested #4100

Scylla 2.3.2

The Scylla team announces the release of Scylla Open Source 2.3.2, a bugfix release of the Scylla Open Source 2.3 stable branch. Release 2.3.2, like all past and future 2.3.y releases, is backward compatible and supports rolling upgrades.

Note that the latest stable release of Scylla Open Source is Scylla 3.0 and you are encouraged to upgrade to it.

Related links:

Issues solved in this release:

  • iotune may report incorrect write bandwidth #4064
  • After a Linux kernel upgrade, Scylla may stop serving requests. This root cause is a bug in the Linux RWF_NOWAIT feature that can cause it to return EIO under some uncertain conditions, even though everything is fine. #3996
  • In extreme out-of-memory situations, failure to allocate an internal data structure used for synchronizing memtable flushes can result in a segmentation fault and crash. #3931
  • In some cases, Scylla streaming failure stop the bootstrap of a new node joining the cluster #3732
  • Wrong values if Size_estimates. Size_estimates is a field in a Scylla system table which holds statistics on table size per node. Tools like Spark are using this field to calculate how large its reads should be #3916

The post New Maintenance Releases: Scylla Open Source 3.0.1 and 2.3.2 appeared first on ScyllaDB.

Scylla SSTable 3.0 Can Decrease File Sizes 50% or More

$
0
0

Scylla SSTable 3.0 Can Decrease File Sizes 50% or More

Scylla Open Source 3.0 ships with a new format for on-disk representation, SSTable 3.0. In this article, we will discuss some of the benefits that emerge from the adoption of this format and the scenarios in which they apply. We will discuss the differences between the old and new formats, and demonstrate use cases in which the new format has significant advantages, and others where the advantages are much smaller.

This is truly a situation of “Your Mileage May Vary.” For example, in one test result below, we were able to show a 53% reduction in table size. Other use cases with extremely wide rows may see a savings of up to 80%. Yet in a second use case also shown below, we describe a situation with disk savings of only a few percentage points. Let’s explain why these results diverge so greatly starting with what’s changed in the SSTable format itself.

The SSTable Format

Sorted Strings Table (SSTable) is the persistent file format used by Scylla and Apache Cassandra (as well as many other databases). Scylla has always tried to maintain compatibility with Apache Cassandra, and file formats are no exception. SSTable is saved as a persistent, ordered, immutable set of files on disk. Immutable means SSTables are never modified; they are created by a MemTable flush and are deleted by a compaction.

Up to Scylla 2.x, the supported formats were “ka” and “la“, where the first letter stands for the major version and the second letter stands for the minor version. The new SSTables 3.0 format is named “mc“, “m” being the next after “l” and “c” being the third minor version of the format.

SSTables 3.0 is a complete overhaul of the way data is stored and represented in the binary files, and in order to understand the advantages, we will first briefly go over the older formats and their disadvantages.

Old format structure

SSTable 2 Format

In the old format, each file contains a set of cells, with no notion of rows.

SSTable 2 cell

Each cell has a name and a value, where each cell name consists of the clustering key value and the column name, plus the column value.

Some of the more significant problems with this method of storing data:

First, there is a lot of data duplication. Cell name (which consists of the clustering key value and column name) is repeated across cells. This can cause very significant bloat, especially if the cell name is a long, meaningful set of values. Full TTLs and timestamps are also present in every cell, also increasing the on-disk space used. These are the most prominent issues with the old SSTables format.

Second, the old format is not aligned with CQL and how it represents data. Same-row cells are not grouped, so in order to read a row, we have to read irrelevant cells until we actually reach the end of the row, which results in performing redundant IO and CPU cycles.

The old SSTable format kept an index inside each partition if the partitions were over a certain size. Such an index was designed to make it more efficient to find a row inside a partition but the format was such that only linear searches were supported through this intra-partition index. In the mc file format, this index is now represented as a binary tree allowing for even faster searches within a wide partition.

What is added in SSTables 3.0 or the “mc” format

SSTable 3

From the start, one major advantage of adopting the “mc” file format is that files from Apache Cassandra 3.x can now be directly imported into Scylla, which means that Scylla is fully compatible at the format level with Apache Cassandra 3.x. But beyond compatibility, the new file format design brings in a host of changes that directly affect users.

The new format introduces the notion of rows. Each partition consists of rows, and each row contains cells. This is aligned with CQL’s data representation. In order to read a row, we simply need to find and read it, instead of assembling it from disparate cells. Data is easier read, analysed and parsed. Also, Inherent row attributes have been added. Additional row metadata is added, so that it is directly possible find out whether the row is alive, expiring, etc. This is done in a specialized structure, and not as another cell, within the row.

Users will see significant disk space savings, depending on their schema, which can reach anywhere from just a few percent to 80%. We have taken the two extreme examples to show these differences in the benchmark presented below. In case of wide rows with long column names, the space savings will increase with the row count and column name length. The disk space savings introduced by the new format can be significant enough that we can now consider disabling compression in order to save some cycles and improve the latency.

In large part, the space savings come from the fact that column metadata is stored separately from the data file and no longer needs to be repeated. The benefits of this change go beyond space saving: In the old format, the SSTables files could not be made sense of without the table schema as well. The new format makes it possible now to read the data in SSTables files without consulting the schema, with all the data required available from the SSTables files directly. This allows easier parsing of the SSTables themselves when decoupled from the database in operations like backups/restores and migrations.

There are other improvements that contribute to space savings as well. Variable-length integers are introduced. A typical integer value uses 8 bytes, but that is wasteful when the actual value is small. variable-length integer can be stored in only 1-2 bytes, depending on its size.

Delta-based encoding has also been added. When we want to store, for example, timestamps for a particular row it isn’t necessary to store a timestamp for every cell. Instead the smallest timestamp is taken, and the rest can contain smaller delta values from the base value. The deltas are much smaller (less bytes) than a full timestamp, which saves additional disk space.

The new format supports binary searching through the index of rows inside the partition, instead of linear searches. This means the search will take O(log n) instead of O(n), which can be a huge performance gain for large partitions.

Learn More about the New SSTable 3.0 Format:

Scylla’s Roadmap for the SSTables 3 Format

In the current version Scylla Open Source 3.0:

To be conservative in rolling out this new feature, and to ease the transition from Scylla Open Source 2.x to 3.0, by default SSTables 3.0 are disabled. In release 3.0, SSTables 3.0 need to be enabled by adding the following statement to /etc/scylla/scylla.yaml:

enable_sstables_mc_format: true

This will set the cluster so it reads and writes SSTables 3.0, with reading the old formats supported as well. Note that this feature is set only once per cluster, and is not configurable per keyspace or per table.

Old SSTables will be upgraded to the new format through compaction. Over time and with use, you will see your files converted to the “mc” format, as compactions are performed. New files will be created in the new format.

In a future Scylla Open Source 3.x release:

Binary searching through promoted index on sliced reads will be supported.

A future version will also support caching of promoted index blocks, for additional performance improvement in wide partitions.

Lastly, SSTable 3.0 will be enabled by default. Read operations to tables in old formats will still be supported.

Testing the New Format

To provide a practical example of the space savings of the new SSTable 3.0 format, we have created a simple benchmark, which simulates two typical use-cases: a) a simple key-value schema designed to cause minimal bloat with “la” format SSTables (very short column names, no clustering key) and b) an IoT use case with a set of sensors storing data.

The goal is to demonstrate how space savings happen as a function of the schema. Since each cell in an “la” formatted file contains the column name and clustering keys, having longer column names and using clustering keys (also with long names) will cause each cell to take up more space. Using the old “la” format, the more numerous and longer-named columns are present, the bigger the bloat, and thus, by comparison, the new “mc” format should present larger savings. On the other hand, the less numerous and shorter-named the columns are, the less “la” format bloat should be present, and the smaller the difference between the disk-space used by the two formats.

In terms of disk space use, our two use cases should represent the opposite poles of the data stored by a Scylla user. In both cases, compression has been disabled, in order to see actual data sizes on-disk.

Use case 1:

simple Key-Value data This format consists of a simple key-value schema, with the “key” column as the primary key, no clustering key and fixed key and value sizes. We populated the table with 15 million key-value sets.

CREATE TABLE kvexample (

key text,
val text,
PRIMARY KEY (key))
WITH compression = {}

Full yaml available here

After generating the data, we could see only about 0.3% in disk space savings:

Results:

la format: 283GB
mc format: 281GB

Total disk space used "la" format (283 GB) vs. "mc" format (281 GB)

Use case 2: IoT sensors data

This is a more complex schema, with a clustering key and long and meaningful column names, representing a large set sensors storing several columns of data. This is, of course, not the widest possible row we could come up with, but simply a reasonable approximation of a standard use case. As the schema grows, the disk space savings will also increase.

This is the cassandra-stress yaml, used for generating the IoT sensor data. It consists of 20 million sensors, over 4000 data points, with the the sensor as primary key and timestamp as the clustering key:

CREATE TABLE iotexample (

sensor uuid,
temperature int,
humidity int,
pressure float,
weathersource text,
timestamp timestamp,
PRIMARY KEY (sensor, timestamp))
WITH compression = {}

Full yaml available here

In this case, each cell in the old format will contain the sensor UUID and the timestamp, in addition to the cell value. In the new format, all that redundant data will be omitted. After generating the data, we saw 53% in disk space savings:

Results:

la format: total of 373GB
mc format: total of 195GB

Total disk space used "la" format (373 GB) vs. "mc" format (195 GB)

In conclusion, the new SSTables 3.0 format comes with significant performance, compatibility and disk space advantages. However, it is important to keep in mind that not all workloads will benefit the same from its introduction. The main improvement is in disk space savings in wide rows, with long meaningful column names (the need to keep column names very short is now gone). Keep in mind that the more complex the schema, the more columns there are, the higher the space savings are expected to become. In this post, we have demonstrated a rather small schema with 53% total disk-space savings, and in different schemas we expect even higher figures, closer to the 80% mark and higher.

Now that Scylla is Cassandra 3.0 compatible, it’s not the end of our SSTable 3.0 work. There will be additional performance improvements coming in Scylla 3.x. So stay tuned!

Additional SSTables 3.0 format information:

The post Scylla SSTable 3.0 Can Decrease File Sizes 50% or More appeared first on ScyllaDB.

Scylla Monitoring Stack 2.1

$
0
0

Scylla Release

The Scylla team is pleased to announce the release of Scylla Monitoring Stack 2.1.

Scylla Monitoring is an open source stack for monitoring Scylla Enterprise and Scylla Open Source, based on Prometheus and Grafana. Scylla Monitoring 2.1 stack supports:

  • Scylla Open Source versions 2.3, 3.0
  • Scylla Enterprise versions 2017.x and 2018.x
  • Scylla Manager 1.3

Related Links

New in Monitoring 2.1

  • Upgrade Grafana to Grafana 5.4.2 (Upgrade to the latest, stable, Grafana, Prometheus version #324)
  • Grafana made some big changes when moving from version 4 to 5, most of the changes are under the hood. There are some obvious changes like the overall layout and the use for folders for the dashboards. We use folders for versions, so all the dashboards of a specific version will be under the same folder.Scylla Monitor 2.1
    Grafana changed the URL path of the dashboards, so if you bookmarked a dashboard (in version 4.x) you will need to update it following the upgrade.
  • Using local files (Make all plugins local #222, Mount Grafana dashboard directory #48)
    From version 2.1 everything is done using local files. Instead of uploading the dashboards with an API, they are placed under grafana/build/{version} and uploaded using a dashboard provisioning file that is placed under grafana/provisioning/dashboards. Plugins are no longer downloaded from the web, they are part of the project located in grafana/plugins. Data-sources are configured from a file that is placed under grafana/provisioning/datasources/. Moving to file configuration and provisioning is safer and faster to upload.
  • Clickable alerts (Allows jumping from an alert to a dashboard #457) You can now jump directly from an event to the in-depth view. Click the time of the event to set the view time to the event occurrence. Click the server to stay in the same time period.

Customized Dashboard

Notes for users who use their own dashboards.

If you are uploading your files with the API, Grafana 5.x is backward compatible, but some attributes like row height are not supported.

If you are using our templating and the make_dashboard.py script, it will now generate by default, a Grafana 5 format with enhancements.

Rows will be transferred to absolute x, y position (the Grafana 5.x way).

Also, you can use "gridPos" that is used by panels to set the coordination, width, and height to specify dimensions. For example, to set a row size to 2 units heights you can do it like: "gridPos": {“h”: 2}

The make_dashboard script will fill the missing information and make it a valid "gridPos" object.

If you would like to upload your dashboards from a file, note that their format is a bit different than the one the API uses, you only need the part that is inside the "dashboard" tag.

If you are using the make_dashboard.py script, you can use the --as-file flag to generate the files in the right format.

The post Scylla Monitoring Stack 2.1 appeared first on ScyllaDB.

Meshify and Scylla: an Industrial-Strength IoT Solution

$
0
0

Sam Kenkel, Meshify

This is a story about the Industrial Internet of Things (IIoT). But one that began long before the Internet was invented.

It was early afternoon on March 2nd, 1854, when a careless accident led to the the explosion of the main boiler at the Fales & Gray Car Factory in Hartford, Connecticut. Nineteen of the factory’s workers died immediately, and twenty-three others were injured.

Destruction left in the aftermath of the Fales & Gray Car Works boiler explosion in Hartford, Connecticut, 1854. This disaster led to the foundation of the Hartford Hospital and formation of the Hartford Steam Boiler Company – photo courtesy of the Connecticut Historical Society

Destruction left in the aftermath of the Fales & Gray Car Works boiler explosion in Hartford, Connecticut, 1854. This disaster led to the foundation of the Hartford Hospital and formation of the Hartford Steam Boiler Company – photo courtesy of the Connecticut Historical Society

Right after the end of the American Civil War the horrific explosion of the Sultana on the Mississippi River killed somewhere between 1,500 to 1,800 people, many of whom were Union soldiers heading home. Sensing the need for industry-wide transformation, and a dozen years after the Fales & Gray disaster, a number of Hartford Polytechnic Club members decided to take action, and in 1866 founded the Hartford Steam Boiler Inspection and Insurance Company.

 

Hartford Steam Boiler logo

The idea was to combine proactive inspection with insurance. To prevent disasters as much as to underwrite businesses for insurable losses. They became such a prominent influencer of the industry that their “Hartford standards” became quality specifications for boiler design, manufacture and maintenance.

Munich RE logo

Over the past century-and-a-half Hartford Steam Boiler diversified to cover other related infrastructure: water pipes, pumps, HVAC systems, refrigerators and various manners of industrial equipment. In 2016, Hartford Steam Boiler, now owned by the German reinsurance company Munich RE, acquired Meshify. Founded in 2010, Meshify’s goal was to bring the latest of Internet and Big Data technologies to a one hundred and fifty year old industry. To stay ahead of problems and, where possible, stave off disaster.

Meshify logo

Fast Forward: Meshify at Scylla Summit 2018

Scylla Summit was held in the San Francisco Bay Area in October 2018. At his session, entitled “Meshify: A Case Study, or Petshop Sea Monsters,” Sam Kenkel, DevOps lead at Meshify, began by introducing Meshify to the audience, and its relations to Hartford Steam Boiler and Munich RE.

Sam pulled out a water sensor and a temperature probe from his pocket, and explained how, given these two devices, Meshify could issue a warning to a customer of a temperature drop which could mean the imminent bursting of a water pipe. If that warning was ignored, further, more urgent notices could send warning of an actually frozen pipe, or of water on the floor.

Even if a failure is not averted, they can provide diagnostic information for how it may have occurred for insurance purposes.

Apart from dire warnings and catastrophic equipment failures, these sensors can also prevent needless “truck rolls” (driving out to a site to do a manual reading) during nominal operating conditions.

So how does it work?

The time series data from every sensor it sent back to Scylla, where it can be compared to a set of user-defined alarms. If those alarms are triggered, notifications can be sent out via SMS, email, or webhook.

Meshify’s application runs stateless via containers. Sam alluded to the famous analogy for cloud computing of “pets vs. cattle,” and said “You want servers to be cattle.” By this, he was referring to servers, as Randy Bias described “designed for failure, where no one, two, or even three servers are irreplaceable. Typically, during failure events no human intervention is required.” Scylla’s high availability scheme allows it to act like “cattle.”

Yet while many organizations choose to containerize their systems or make them serverless, Meshify does neither of these. The reason for this is their adherence to vendor neutrality, to maintain “cattle”-like replaceability. So Meshify only uses cloud services that have drop-in replacements. For example, Sam pointed out, if you migrate to DynamoDB, what are your options for migrating away?

Sam then expressed a philosophical axiom: “There is no cloud; only someone else’s server.” For example, Sam said that using an Amazon Machine Image (AMI) means that he can answer with confidence what region his data is in. There is still software on a server somewhere, and that location can have legal implications. Which is a vital issue given the requirements of GDPR.

What is more, Sam correctly pointed out that Scylla’s performance comes from a more direct access to, and knowledge of, the underlying hardware it is running on. He spoke about how Scylla’s pre-tuned AWS AMI allows for a rapid, consistent node deployment. Time to deploy is five minutes. Scylla’s self-tuning means that there is no variance from misconfiguration. You get the consistency of a container, but all the performance benefit of a tuned EC2 instance.

So, going back to the “pets vs. cattle” analogy, Scylla provides all the love you’d give a pet, with all the replaceability of cattle. Hence the “petshop seamonster.”

Sam Kenkel, Meshify, at Scylla Summit 2018 User Awards (with Dor Laor, ScyllaDB CEO)

Sam Kenkel (right), accepting the “Fastest Time to Production” award on behalf of Meshify, is seen here shaking the hand of ScyllaDB CEO Dor Laor (left) at Scylla Summit 2018.

Just as Meshify’s core business is to watch when industrial machinery fails, it also watches for when its data infrastructure fails. When a node dies, it triggers an alarm. Within five minutes, a replacement node is started, and within another five, it is joined to the cluster and is ready to have data streaming to it. Migration of data takes about two hours thereafter. While this is a manual process now (by Meshify’s choice), Sam made clear to say that nothing precludes this from being an automated task.

Beyond node failure, Meshify’s disaster recovery plan can deploy an entire new cluster within 10-15 minutes, and then start streaming data using sstableloader. Within 30 minutes they can get their real-time monitoring systems streaming into the new database. (Restoring historical data from S3 takes longer, but can be done in the background.)

Unlike the response to the Fales & Gray disaster, in the modern world, organizations and communities do not have a dozen years to respond to systemic failures. Fulfilling the vision of the Hartford Steam Boiler founders, Meshify’s job is to stay on top of rapidly changing conditions in real-time, and, where possible, to avert disasters proactively.

It is unsurprising then that their deployment to Scylla was accomplished with alacrity. When we say “Fast Forward” we meant it. Meshify was awarded the “Fastest Time to Production Award” at Scylla Summit 2018.

If you’d like to learn more about Meshify’s use case, you can watch Sam’s presentation at Scylla Summit below, check out his slides, and make sure to read the Meshify case study.

 

The post Meshify and Scylla: an Industrial-Strength IoT Solution appeared first on ScyllaDB.

Viewing all 938 articles
Browse latest View live