SaaS vs OSS – Fight or flight, round #2

December 15, 2018, 8:08 am

≫ Next: Scylla and Confluent Integration for IoT Deployments

≪ Previous: Going Head-to-Head: Scylla vs Amazon DynamoDB

Inspecting Software Licensing (APGL, SSPL, Confluent Community)

To quote Bob Dylan, “the times they are a changin’.” Microsoft loves Linux, IBM buys Red Hat, RedisLabs changes their module license to Commons Clause, Mongo invents Server Side Public License (SSPL) and moves from AGPL, AWS open sources Firecracker and releases a Kafka service, and the hot news from Friday, Confluent changes its license for components of the Confluent Platform from Apache 2.0 to the Confluent Community License.

A few weeks ago I wrote about MongoDB’s SSPL, which is similar to Confluent’s new license. You could say the Confluent Community License is to the Apache license as MongoDB SSPL is to AGPL. Both of these new licenses take the same position in order to protect their assets from cloud providers.

It is hard to blame Confluent for responding to pressure from the AWS Kafka service–and perhaps fear that other cloud vendors will follow. Rightfully, Confluent wishes to enjoy the fruits of their investment in KSQL and, in parallel, still keep it open source. It’s important to note that KSQL isn’t part of the Apache Kafka project, which remains under the ASL2 license and was fully developed by Confluent. To the best of my knowledge, AWS has not provided KSQL as a service but perhaps this is a means of preventing future abuse. AWS adoption of Kafka also signals a victory for of OSS Kafka over the proprietary Kinesis technology.

Although more and more OSS vendors are going down the path of more restrictive licenses, and they might even eventually make these standard, overall this is a step in the wrong direction for our industry. These more restrictive licenses have the regrettable consequence of creating silos where once there was sharing.

Remember that just the opposite was true for large community projects like Linux, KVM, and Hadoop, which saw contributors line up by the hundreds. Unfortunately, pressure from SaaS/Cloud providers is turning OSS vendors toward the undesirable (but perhaps logical) path of more restrictive usage licenses. This is a lose-lose scenario, one that goes against the very spirit of open source.

In a perfect world, customers would not use OSS software-based services from vendors who did not participate in the creation of that software. This is the flag that OSS vendors should carry. On the other hand, it’s hard to compete with an 800-pound gorilla that owns a lot of the mindshare of the industry.

My prediction is that vendors like AWS will receive more criticism and, in response, will lean more and more toward an open source play. Were this a game of chess, a streaming SQL query functionality contribution to Apache Kafka by AWS would result in a call of check. Let’s stay tuned and see how it evolves.

Another class of companies who will be hurt by this licensing trend will be smaller as-a-service vendors that provide MongoDB/Elastic/Kafka/etc. Despite their contributions back to the OSS, these smaller companies will be hampered by the more restrictive licenses, which will keep them from running the very technologies they’ve aided. Examples include mLabs with MongoDB, Instaclustr with Kafka/Confluent, and IBM Compose with a variety of offerings.

This trend is neither healthy nor ideal. We should all want to see OSS vendors and IaaS vendors complement each other — either by contributing to the same OSS project and sharing the commercial benefits or by allowing the OSS vendor to monetize on top of the IaaS market place. As an end user, you should support this cause by directing your buying power toward not only the vendors providing the best services, of course, but also the ones making a difference by aligning themselves with the overall long term value you receive from the ecosystem.

Dor is the co-founder and CEO of ScyllaDB, who develop the AGPL’ed Scylla database and Seastar, its Apache-licensed core engine.

The post SaaS vs OSS – Fight or flight, round #2 appeared first on ScyllaDB.

↧

Scylla and Confluent Integration for IoT Deployments

December 19, 2018, 4:00 am

≫ Next: Scylla Summit 2018 Keynote: Four Years of Scylla

≪ Previous: SaaS vs OSS – Fight or flight, round #2

The Internet is not just connecting people around the world. Through the Internet of Things (IoT), it is also connecting humans to the machines all around us and directly connecting machines to other machines. In this blog post we’ll share an emerging machine-to-machine (M2M) architecture pattern in which MQTT, Apache Kafka and Scylla all work together to provide an end-to-end IoT solution. We’ll also provide demo code so you can try it out for yourself.

IoT Scale

IoT is a fast-growing market, already known to be over $1.2 trillion in 2017 and anticipated to grow to over $6.5 trillion by 2024. The explosive number of devices generating, tracking, and sharing data across a variety of networks is overwhelming to most data management solutions. With more than 25 billion connected devices in 2018 and internet penetration increasing at a staggering 1066% since 2000, the opportunity in the IOT market is significant.

There’s a wide variety of IoT applications, like data center and physical plant monitoring, manufacturing (a multibillion-dollar sub-category known as Industrial IoT, or IIoT), smart meters, smart homes, security monitoring systems and public safety, emergency services, smart buildings (both commercial and industrial), healthcare, logistics & cargo tracking, retail, self-driving cars, ride sharing, navigation and transport, gaming and entertainment… the list goes on.

Original interactive M2M/IoT Sector Map is available on BeechamResearch.com

A significant dependency for this growth is the overall reliability and scalability of IoT deployments. As Internet of Things projects go from concepts to reality, one of the biggest challenges is how the data created by devices will flow through the system. How many devices will create information? What protocols do the devices use to communicate? How will they send that information back? Will you need to capture that data in real time, or in batches? What role will analytics play in the future? What follows is an example of such a system, using existing best-in-class technologies.

An End-to-End Architecture for the Internet of Things

IOT-based applications (both B2C and B2B) are typically built in the cloud as microservices with similar characteristics. It is helpful to think about the data created by the devices and the applications in three stages:

Stage one is the initial creation — where data is created on the device and then sent over the network.
Stage two is how the central system collects and organizes that data.
Stage three is the ongoing use of that data stored in a persistent storage system.

Typically, when sensors/smart-devices get actuated they create data. This information can then be sent over the network back to the central application. At this point, one must decide which standard the data will be created in and how it will be sent over the network.

One widely used protocol for delivering this data is the Message Queuing Telemetry Transport (MQTT) protocol. MQTT is a lightweight messaging protocol for pub-sub communication typically used for M2M communication. Apache Kafka® is not a replacement to MQTT, but since MQTT is not built for high scalability, longer storage or easy integration to legacy systems, it complements Apache Kafka well.

In an IoT solution, devices can be classified into sensors and actuators. Sensors generate data points while actuators are mechanical components that may be controlled through commands. For example, the ambient lighting in a room may be used to adjust the brightness of an LED bulb and MQTT is the protocol optimized for sensor networks and M2M. Since MQTT is designed for low-power and coin-cell-operated devices, it cannot handle the ingestion of massive datasets.

On the other hand, Apache Kafka may deal with high-velocity data ingestion but not with M2M. Scalable IoT solutions use MQTT as an explicit device communication while relying on Apache Kafka for ingesting sensor data. It is also possible to bridge Kafka and MQTT for ingestion. It is recommended to keep them separate by configuring the devices or gateways as Kafka producers while still participating in the M2M network managed by an MQTT broker.

At stage two, data typically lands as streams in Kafka and is arranged in the corresponding topics that various IoT applications consume for real-time decision making. Various options like KSQL and Single Message Transforms (SMT) are available at this stage.

At stage three this data, which typically has a shelf life, is streamed into a long-term store like Scylla using the Kafka Connect framework. A scalable, distributed, peer-to-peer NoSQL database, Scylla is a perfect fit for consuming the variety, velocity and volume of data (often time-series) coming directly from users, devices and sensors spread across geographic locations.

What is Apache Kafka?

Apache Kafka is an open source distributed message queuing and streaming platform capable of handling a high volume and velocity of events. Since being created and open sourced by LinkedIn in 2011, Kafka has quickly evolved from a message queuing system to a full-fledged streaming platform.

Enterprises typically accumulate large amounts of data over time from different sources and data types such as IoT devices and microservices applications. Traditionally, for businesses to derive insights from this data they used data warehousing strategies to perform Extract, Transform, Load (ETL) operations that are batch-driven and run at a specific cadence. This leads to an unmanageable situation as custom scripts move data from their sources to destinations as one-offs. It also creates many single points of failure and does not permit analysis of the data in real time.

Kafka provides a platform that can arrange all these messages by topics and streams. Kafka is enterprise ready and has features like high availability (HA) and replication on commodity hardware. Kafka decouples the impedance mismatch between the sources and the downstream systems that need to perform business-driven actions on the data.

What is Scylla?

Scylla is a scalable, distributed, peer-to-peer NoSQL database. It is a drop-in replacement for Apache Cassandra that delivers as much as 10X better throughput and more consistent low latencies. It also provides better cluster resource utilization while building upon the existing Apache Cassandra ecosystem and APIs.

Most microservices developed in the cloud prefer to have a distributed database native to the cloud that can linearly scale. Scylla fits that use case well by harnessing modern multi-core/multi-CPU architecture, and producing low, predictable latency response times. Scylla is written in C++, which results in significant improvements of TCO, ROI and an overall better user experience.

Scylla is a perfect complement to Kafka because it leverages the best from Apache Cassandra in high availability, fault tolerance, and its rich ecosystem. Kafka is not an end data store itself, but a system to serve a number of downstream storage systems that depend on sources generating the data.

Demo of Scylla and Confluent Integration

The goal of this demo is to demonstrate an end-to-end use case where sensors emit temperature and brightness readings to Kafka and the messages are then processed and stored in Scylla. To demonstrate this, we are using Kafka MQTT proxy (part of the Confluent Enterprise package), which acts as a broker for all the sensors that are emitting the readings.

We also use the Kafka Connect Cassandra connector, which spins up the necessary consumers to stream the messages into Scylla. Scylla supports both the data format (SSTable) and all relevant external interfaces, which is why we can use the out of the box Kafka Connect Cassandra connector.

The load from various sensors is simulated as MQTT messages via the MQTT Client (Mosquitto), which will publish to the Kafka MQTT broker proxy. All the generated messages are then published to the corresponding topics and then a Scylla consumer picks up the messages and stores them into Scylla.

Steps

Download Confluent Enterprise
Once the tarball is downloaded – then:
Set the $PATH variable

For the demo we choose to run the KAFKA Cluster locally but if we want to run this in production we would have to modify a few files to include the actual IP addresses of the cluster:
- Zookeeper – /etc/kafka/zookeeper.properties
- Kafka – /etc/kafka/server.properties
- Schema Registry – /etc/schema-registry/schema-registry.properties
Now we need to start the services Kafka and Zookeeper This should start both zookeeper and Kafka. To do this manually you have to provide these two parameters
Configuring the MQTT proxy
Inside the directory /etc/confluent-kafka-mqtt there is a file kafka-mqtt-dev.properties file that comes with the confluent distribution and this lists all the available configuration options for MQTT Proxy. Modify these parameters
Create Kafka topics
The simulated MQTT devices will be publishing to the topics temperature and brightness, so let’s create those topics in Kafka manually.
Start the MQTT proxy
This is how we start the configured MQTT proxy
Installing the Mosquitto framework
Publish MQTT messages
We are going to be publishing messages with QoS2, that is the highest quality of service supported by MQTT protocol
Verify messages in Kafka
Make sure that the messages are published into the kafka topic
To produce a continuous feed of MQTT messages (optional)
Run this on the terminal
Let’s start a scylla cluster and make it a kafka connect sink

Note: If you are choosing to use Scylla in a different environment - then start from here https://www.scylladb.com/download/

Once the cluster comes up with 3 nodes then ssh into each node and uncomment the broadcast address in /etc/scylla/scylla.yaml, change it to the public address of the node. If we are running the demo locally on a laptop or if we are running the Kafka connect framework in another Data Center compared to where the Scylla cluster is running.
Let’s create a file cassandra-sink.properties
This will enable us to start the connect framework with the necessary properties.

Add these lines to the properties file
Next we need to download the binaries for the stream reactor
Now change the plugin.path property in
/confluent-5.0.0/etc/schema-registry/connect-avro-distributed.properties

To ABSOLUTE_PATH/confluent-5.0.0/lib/stream-reactor-1.1.0-1.1.0/libs/
Now let’s start the connect framework in distributed mode
Make sure that the cassandra-sink.properties file is updated with the necessary contact points of scylla nodes i.e the external IP addresses.

Make sure that the necessary keyspace and tables with the appropriate schema are created after you CQLSH into the scylla nodes.

Then to start the sink connector

After you run the above command, then you should be able to see Scylla as a Cassandra sink and any messages published using the instructions in step-9 will get written to scylla as a downstream system.
Now, let’s try to run a script which can simulate the activity of a MQTT device - you can do this by cloning this repo https://github.com/mailmahee/MQTTKafkaConnectScyllaDB
And then running

This script simulates MQTT sensor activity and publish messages to the corresponding topics. Then the connect frameworks drains the messages from the topics into the corresponding tables in Scylla.

You did it!

If you follow the instructions above, you should now be able to connect Kafka and Scylla using the Connect framework. In addition, You should be able to generate MQTT workloads that publish the messages to the corresponding Kafka topics, which are then used for both real-time as well as batch analytics via Scylla.

Given that applications in IoT are by and large based on streaming data, the alignment between MQTT, Kafka and Scylla makes a great deal of sense. With the new Scylla connector, application developers can easily build solutions that harness IoT-scale fleets of devices, as well as store the data from them in Scylla tables for real-time as well as analytic use cases.

Many of ScyllaDB’s IoT customers like General Electric, Grab, Nauto and Meshify use Scylla and Kafka as the backend for handling their application workloads. Whether a customer is rolling out an IoT deployment for commercial fleets, consumer vehicles, remote patient monitoring or a smart grid, our single-minded focus on the IoT market has led to scalable service offerings that are unmatched in cost efficiency, quality of service and reliability.

Try It Yourself

to use MQTT Proxy

The post Scylla and Confluent Integration for IoT Deployments appeared first on ScyllaDB.

↧

Scylla Summit 2018 Keynote: Four Years of Scylla

December 21, 2018, 8:55 am

≫ Next: Scylla Enterprise Release 2018.1.8

≪ Previous: Scylla and Confluent Integration for IoT Deployments

Now that the dust has settled from our Scylla Summit 2018 user conference, we’re glad for the chance to share the content with those who couldn’t make the trip to the Bay Area. We’ll start with the keynote from our CEO and Co-founder, Dor Laor, who kicked off the event with his talk about the past, present and future of Scylla.

Watch the video in full:

Browse the slides:

See our related Tech Talks page

Dor began with an overview of the trends in the industry and first and foremost, digital transformation. Sticking to the down-to-earth, practical culture at ScyllaDB, Dor covered real life customer examples from all around us. Starting with space, with satellite-enabled services like GPS Insight and TellusLabs, to the rain forest plant-based beauty products of Brazil’s Natura, to the GE Predix-powered Industrial Internet of Things (IIoT) platform, Comcast X1’s on-demand services, as well as automotive applications from Faraday Future and Nauto.

In the beginning…

Dor shared our company’s origins. He recalled first announcing the Seastar framework in February 2015, and leaving stealth mode in September of that year. ScyllaDB CTO Avi Kivity. presented at that year’s Cassandra Summit on how a new database, Scylla, could deliver 1,000,000 CQL operations per server.

Over the ensuing three years we made a great deal of progress. We released Scylla 1.0 at the end of March 2016. That year also saw the first Scylla Summit in September. The following year, in March 2017, Scylla unveiled its first Enterprise software release.

While Scylla was blazing its own path in the world of NoSQL, Dor also remarked on the successes of others in the industry, including MongoDB’s public offering in October 2017, and the September 2018 IPO of Elastic. These events serve as validation of the growing Big Data market as the hunger for data increases, fed by the growing appetite of modern, planet scale software. Not only most enterprises now trust in the operational capabilities of NoSQL distributed databases, the new world requirements cannot be met by traditional relational models.

State of the Art

Moving to the present, Dor announced Scylla Open Source 3.0. With this release, Scylla was finally achieving feature parity with Cassandra, and, in some cases, it was taking the lead. For storage, SSTable format 3.0 (mc) would reduce data footprint on disk. Production-ready Materialized Views (MV) and Global Secondary Indexes (GSI) will help users access only the data they need. Lightweight Transactions (LWT) remains the last major feature to achieve full feature parity with Cassandra.

Dor also announced that our cloud managed database, Scylla Cloud, was available as early access. running on Amazon Web Services (AWS), Scylla Cloud lets users launch a fully managed single-tenant, self-service Scylla cluster in minutes.

As much as we talk about Cassandra, we are shifting gears and wish to be competitive with the best of breed NoSQL databases, led by DynamoDB as an example.

Dor shared results from a head-to-head YCSB comparison of Scylla versus Amazon DynamoDB. We just recently published the comparative benchmark results. Our test results show you can achieve 1/4th the latency and spend only 1/7th the cost with Scylla for similar throughput on DynamoDB. (Scylla Cloud is 4-6X less expensive than DynamoDB.)

However, the real performance difference occurred in Zipfian distributions. You can read the blog in full as to why this is an important real-world consideration. Analogous test results were found for Bigtable, and CosmosDB was expected to perform similarly.

Another key feature introduced for the first time at Scylla Summit 2018 was our unique ability to support per-user SLAs, allowing system managers to limit database resource utilization. With this, Scylla customers can use the same Scylla cluster to service both transaction processing (mixed read-write, or write-heavy loads) as well as analytics (read-only/mostly) requests. Glauber Costa would host a full session on this, entitled OLAP or OLTP: Why not both?

Per-user-SLA utilizes 3 years of development of SLA guarantee for real time operations over distributed database background operations such as compaction, repair and streaming. This is a point in time evolution towards perfect multi tenant database.

Dor then enumerated a list of noteworthy accomplishments and the challenges we still have before us. For example, while he was proud of our Mutant Monitoring System (MMS), there is still work to be done on our Knowledgebase, as well as our upcoming launch of Scylla University. And while performance is good, and compactions are relatively smooth compared to other offerings, there are still more optimizations to be done. And while he was proud of the work we’ve done to integrate with Apache Spark, there’s a lot more to do to align Scylla with Kubernetes.

The Shape of Things to Come

To conclude, Dor gave a glimpse into the future of Scylla. Finishing up Cassandra parity features, especially Lightweight Transactions. Fleshing out Scylla Cloud. Making Scylla itself a stronger offering, with new tiered storage options, improvements in performance and additional drivers. And finally, making Scylla even easier to manage.

It has been a remarkable journey over the past four years. From all of us at ScyllaDB, thank you for following us on our journey, and for a wonderful 2018.

Looking ahead, 2019 is sure to be another amazing year of pioneering achievements in the world of Big Data, both for Scylla as well as our users and customers. We’re looking forward to all that we will accomplish together!

The post Scylla Summit 2018 Keynote: Four Years of Scylla appeared first on ScyllaDB.

↧

Scylla Enterprise Release 2018.1.8

January 2, 2019, 8:16 am

≫ Next: Scylla Summit 2018 Tech Talks Now Online

≪ Previous: Scylla Summit 2018 Keynote: Four Years of Scylla

Scylla Enterprise Release

The Scylla team is pleased to announce the release of Scylla Enterprise 2018.1.8, a production-ready Scylla Enterprise minor release. Scylla Enterprise 2018.1.8 is a bug fix release for the 2018.1 branch, the latest stable branch of Scylla Enterprise. Scylla Enterprise customers are encouraged to upgrade to Scylla Enterprise 2018.1.8 in coordination with the Scylla support team.

This release includes memory management improvements, first introduced in Scylla Open Source 2.3 release and graduated into Scylla Enterprise. These changes allow Scylla to free large contiguous areas of memory more reliably with less effort and improves the performance of workloads that have large blobs or collections. #2480

Scylla Summit 2018 Tech Talks Now Online

January 8, 2019, 11:52 am

≫ Next: JSON Support in Scylla

≪ Previous: Scylla Enterprise Release 2018.1.8

There was so much happening at our Scylla Summit 2018 late last fall. We held more than three dozen sessions, including multiple keynotes and concurrent breakout tracks. Our Tech Talks page has been updated with the videos and slides from Scylla Summit 2018. Now you can see what you missed — whether or not you were able to attend our user conference.

Dor Laor at Scylla Summit 2018

In the weeks ahead we’ll showcase some of the best talks from our conference, but there’s no need to wait. You can browse the whole catalog of talks from Scylla Summit 2018 today!

You can see the YouTube videos and all the SlideShare presentations in one place. All the keynotes, both by ScyllaDB executives and some of our most prominent customers. All the use case presentations by our incredible groundbreaking community members. All the tech talks from our engineers on current and upcoming features, best practices, tips and tricks.

We’ll point out a good intro to many of engineering talks at Scylla Summit 2018. ScyllaDB CTO and Co-Founder Avi Kivity presented on our near-term and longer-term initiatives. This is quite timely, too, as the release of Scylla Open Source 3.0 is right around the corner!

If you have any questions or comments after watching these presentations, or if you’d like to share your own experience with Scylla, please feel free to contact us!

The post Scylla Summit 2018 Tech Talks Now Online appeared first on ScyllaDB.

↧

JSON Support in Scylla

January 10, 2019, 8:10 am

≫ Next: Scylla Enterprise Release 2018.1.9

≪ Previous: Scylla Summit 2018 Tech Talks Now Online

Beginning with version 2.3, Scylla Open Source supports the Javascript Object Notation (JSON) format. That includes inserting JSON documents, retrieving data in JSON and providing helper functions to transform native CQL types into JSON and vice versa.

Also note that schemas are still enforced for all operations — one cannot just insert random JSON documents into a table. The new API is simply a convenient way of working with JSON without having to convert everything back and forth client-side.

JSON support consists of CQL statements and functions, described here, one by one, with examples.

You can use the following code snippet to build a sample restaurant menu. This example will serve as a basis in the following sections. This snippet also contains a second table based on collections, which contains additional information about served dishes.

CREATE KEYSPACE restaurant WITH REPLICATION = { 'class' : 'SimpleStrategy', 'replication_factor' : 1 }; use restaurant;

CREATE TABLE menu (category text, position int, name text, price float, PRIMARY KEY(category, position));

INSERT INTO menu (category, position, name, price) VALUES ('starters', 1, 'foie gras', 10.50); INSERT INTO menu (category, position, name, price) VALUES ('starters', 2, 'steak tartare', 9.50); INSERT INTO menu (category, position, name, price) VALUES ('starters', 3, 'taco de pulpo', 8.00); INSERT INTO menu (category, position, name, price) VALUES ('soups', 1, 'sour rye soup', 12); INSERT INTO menu (category, position, name, price) VALUES ('soups', 2, 'sorrel soup', 8); INSERT INTO menu (category, position, name, price) VALUES ('soups', 3, 'beef tripe soup', 11.20); INSERT INTO menu (category, position, name, price) VALUES ('main courses', 1, 'red-braised pork belly', 24.90); INSERT INTO menu (category, position, name, price) VALUES ('main courses', 2, 'boknafisk', 19);

CREATE TABLE info (category text PRIMARY KEY, calories map<text, int>, vegan set, ranking list); INSERT INTO info (category, calories, vegan, ranking) VALUES ('soups', {'sour rye soup': 500, 'sorrel soup': 290}, {'sorrel soup'}, ['sour rye soup', 'sorrel soup']);

SELECT JSON

Selecting data in JSON format can be performed with SELECT JSON statement. It’s syntax is almost identical to regular CQL SELECT.

In order to extract all data and see what the restaurant serves, try:

SELECT JSON * from menu;

Named columns can also be specified to narrow down the results. So, if we’re only interested in names and prices:

SELECT JSON name, price from menu;

As in regular CQL SELECT, it’s of course possible to restrict the query. Extracting soup info from the database can be achieved like this:

SELECT JSON name, price from menu WHERE category='soups';

Since data underneath is still structured with our schema, it’s possible to apply filtering too. So, if our meal is reimbursed anyway and we don’t want to ruin it by spending too little money:

SELECT JSON name, price from menu WHERE price > 10 ALLOW FILTERING;

Note that the results always consist of one column named [json]. This column contains the requested information in JSON format, properly typed – to string, int, float or boolean. Of course, (nested) collections are supported too!

SELECT JSON * FROM info;

INSERT JSON

Inserting JSON data is also very similar to a regular INSERT statement. Still, note that even though JSON documents can contain lots of arbitrary columns, the ones inserted into Scylla will be validated with table’s schema. Let’s add another soup to the menu:

INSERT INTO menu JSON '{"category": "soups", "position": 4, "name": "red borscht", "price": 11}';

That’s it – not complicated at all. What happens if we try to sneak some out-of-schema data to the statement?

INSERT INTO menu JSON '{"category": "soups", "position": 4, "name": "red borscht", "price": 11, "comment": "filling and delicious"}';

Not possible – schema rules cannot be ignored. What if some columns are missing from our JSON?

INSERT INTO menu JSON '{"category": "soups", "position": 4, "price": 16}'

SELECT * from menu;

Works fine, the omitted column just defaults to null. But, there’s more to the topic.

DEFAULT NULL/DEFAULT UNSET

By default, omitted columns are going to be treated as null values. If, instead, the user wants to omit changing the value in case it already exists, DEFAULT UNSET flag can be used. So, if our red borscht sells well and we want to boost the price in order to increase revenue:

INSERT INTO menu JSON '{"category": "soups", "position": 4, "price": 16}' DEFAULT UNSET;

We can see that our soup name was left intact, but the price changed:

SELECT * FROM menu WHERE category='soups';

fromJson()

fromJson() is a functional equivalent of INSERT JSON for a single value. The easiest way to explain its usage is with an example:

INSERT INTO menu (category, position, name, price) VALUES (fromJson('”soups”'), fromJson(‘1’), 'sour rye soup', 12);

The function works fine with collections too.

INSERT INTO info (category, calories) VALUES ('starters', fromJson('{"foie gras": 550}'));

SELECT * FROM info WHERE category = 'starters';

toJson()

toJson() is a counterpart of the fromJson() function (yes, really!) and can be used to convert single values to JSON format.

SELECT toJson(category), toJson(name) FROM menu;

SELECT category, toJson(calories), toJson(vegan), toJson(ranking) FROM info;

Types

Mapping of CQL types to JSON is well defined and usually intuitive. Full reference table of corresponding types can be found below. Note that some CQL types (e.g. decimal) will be implicitly converted to others, with possibly different precision (e.g. float) when returning JSON values.

CQL type	INSERT JSON accepted type	SELECT JSON returned type
ascii	string	string
bigint	integer, string	integer
blob	string	string
boolean	boolean, string	boolean
date	string	string
decimal	integer, string, float	float
double	integer, string, float	float
float	integer, string, float	float
inet	string	string
int	integer, string	integer
list	list, string	list
map	map, string	map
smallint	integer, string	integer
set	list, string	list
text	string	string
time	string	string
timestamp	integer, string	string
timeuuid	string	string
tinyint	integer, string	integer
tuple	list, string	list
uuid	string	string
varchar	string	string
varint	integer, string	integer

We do JSON. How about you?

JSON support in Scylla permits a variety of new novel designs and implementations. If you are currently using JSON in your own Scylla deployment or planning to use this feature in your own development, we’d love to hear from you.

The post JSON Support in Scylla appeared first on ScyllaDB.

↧

Scylla Enterprise Release 2018.1.9

January 15, 2019, 10:37 am

≫ Next: Introducing Scylla Open Source 3.0

≪ Previous: JSON Support in Scylla

The Scylla team is pleased to announce the release of Scylla Enterprise 2018.1.9, a production-ready Scylla Enterprise minor release. Scylla Enterprise 2018.1.9 is a bug fix release for the 2018.1 branch, the latest stable branch of Scylla Enterprise. Scylla Enterprise customers are encouraged to upgrade to Scylla Enterprise 2018.1.9 in coordination with the Scylla support team.

This release fixes two issues listed below, with open source references, if exists:

Scylla aborted with an “Assertion `end >= _stream_position’ failed” exception. This occurred when querying a partition with no clustering ranges (happened on counter tables with no live rows) which also didn’t have static columns. #3304
Monitoring: latency values reported by Prometheus might be wrong #3827

Introducing Scylla Open Source 3.0

January 17, 2019, 6:00 am

≫ Next: Scylla Summit 2018 Keynote: Four Years of Scylla

≪ Previous: Scylla Enterprise Release 2018.1.9

Scylla is an open source NoSQL database that offers the horizontal scale-out and fault-tolerance of Apache Cassandra, but delivers 10X the throughput and consistent, low single-digit latencies. Implemented from scratch in C++, Scylla’s close-to-the-hardware design significantly reduces the number of database nodes you require and self-optimizes to dynamic workloads and various hardware combinations.

With the release of Scylla Open Source 3.0, we’ve introduced a rich set of new features for more efficient querying, reduced storage requirements, lower repair times, and better overall database performance. Already the industry’s most performant NoSQL database, Scylla now includes production-ready features that surpass the capabilities of Apache Cassandra.

Scylla Open Source 3.0 is now available for download.

Materialized Views

Material Views automate the tedious and inefficient chores created when an application maintains several tables with the same data organized differently. Data is divided into partitions that can be found by a partition key. Sometimes the application needs to find a partition or partitions by the value of another column. Doing this efficiently without scanning all of the partitions requires indexing.

People have been using Materialized Views, also calling them denormalization, for years as a client-side implementation. In those days, the application maintained two or more views and two or more separate tables with the same data but under a different partition key. Every time the application wanted to write data, it needed to write to both tables, and reads were done directly (and efficiently) from the desired table. However, ensuring any level of consistency between the data in the two or more views required complex and slow application logic.

Scylla’s Materialized Views feature moves this complexity out of the application and into the servers. The implementation is faster (fewer round trips to the applications) and more reliable. This approach makes it much easier for applications to begin using multiple views into their data. The application just declares the additional views, Scylla creates the new view tables, and on every update to the base table the view tables are automatically updated as well. Writes are executed only on the base table directly and are automatically propagated to the view tables. Reads go directly to the view tables.

As usual, the Scylla version is compatible – in features and CQL syntax – with the Apache Cassandra version (where it is still in experimental mode).

Global Secondary Indexes

Scylla Open Source 3.0 introduces production-ready global secondary indexes that can scale to any size distributed cluster — unlike the local-indexing approach adopted by Apache Cassandra. The secondary index uses a Materialized View index under the hood in order to make the index independent from the amount of nodes in the cluster. Secondary Indexes are (mostly) transparent to the application. Queries have access to all the columns in the table and you can add and remove indexes without changing the application. Secondary Indexes can also have less storage overhead than Materialized Views because Secondary Indexes need to duplicate only the indexed column and primary key, not the queried columns like with a Materialized View. For the same reason, updates can be more efficient with Secondary Indexes because only changes to the primary key and indexed column cause an update in the index view. In the case of a Materialized View, an update to any of the columns that appear in the view requires the backing view to be updated.

As always, the decision whether to use Secondary Indexes or Materialized Views really depends on the requirements of your application. If you need maximum performance and are likely to query a specific set of columns, you should use Materialized Views. However, if the application needs to query different sets of columns, Secondary Indexes are a better choice because they can be added and removed with less storage overhead depending on application needs.

Global secondary indexes minimize the amount of data retrieved from the database, providing many benefits:

Results are paged and customizable
Filtering is supported to narrow result sets
Keys, rather than data, are denormalized Supports more general-purpose use cases than Materialized Views

Allow Filtering

Allow filtering is a way to make a more complex query, returning only a subset of matching results. Because the filtering is done on the server, this feature also reduces the amount of data transferred over the network between the cluster and the application. Such filtering may incur processing impacts to the Scylla cluster. For example, a query might require the database to filter an extremely large data set before returning a response. By default, such queries are prevented from execution, returning the following message:

Bad Request: Cannot execute this query as it might involve data filtering and thus may have unpredictable performance.

Unpermitted queries include those that restrict:

Non-partition key fields
Parts of primary keys that are not prefixes
Partition keys with something other than an equality relation (though you can combine SI with ALLOW FILTERING to support inequalities; >= or <=; see below)
Clustering keys with a range restriction and then by other conditions (see this blog)

However, in some cases (usually due to data modeling decisions), applications need to make queries that violate these basic rules. Starting with Scylla Open Source 3.0, queries can be appended with the ALLOW FILTERING keyword to bypass this restriction and utilize server-side filtering.

The benefits of filtering include:

Cassandra query compatibility
Spark-Cassandra connector query compatibility
Query flexibility against legacy data sets

New File Format

Scylla Open Source 3.0 introduces support for a more performant storage format (SSTable), which is not only compatible with Apache Cassandra 3.x but also reduces storage volume by as much as 3X. The older 2.x format used to duplicate the column name next to each cell on disk. The new format eliminates the duplication and the column names are stored once, within the schema.

The newly introduced format is identical to that used by Apache Cassandra 3.x, while remaining backward-compatible with prior Scylla SSTable formats. New deployments of Scylla Open Source 3.0 will automatically use the new format, while existing files remain unchanged.

This new storage format delivers important benefits, including:

Can read existing Apache Cassandra 3.x files when migrating
Faster than previous versions
Reduced storage footprint of up to 66%, depending on the data model used Range delete support

Hinted Handoff

Hinted handoffs are designed to help when any individual node is temporarily unresponsive due to heavy write load, network weather, hardware failure, or any other factor. Hinted handoffs also help in the event of short-term network issues or node restarts, reducing the time for scheduled repairs, and resulting in higher overall performance for distributed deployments. Originally introduced as an experimental feature in Scylla Open Source 2.1, hinted handoffs are another production-ready feature in Scylla Open Source 3.0.

Technically, a ‘hint’ is a record of a write request held by the coordinator until an unresponsive replica node comes back online. When a write is deemed successful but one or more replica nodes fail to acknowledge it, Scylla will write a hint that is replayed to those nodes when they recover. Once the node becomes available again, the write request data in the hint is written to the replica node.

Hinted handoffs deliver the following benefits:

Minimizes the difference between data in the nodes when nodes are down — whether for scheduled upgrades or for all-too-common intermittent network issues.
Reduces the amount of data transferred during repair.
Reduces the chances of checksum mismatch (during read-repair) and thus improves overall latency.

Full, Multi-partition Scan Improvements

Scylla Open Source 3.0 builds on earlier improvements by extending stateful paging to support range scans as well. As opposed to other partition queries, which read a single partition or a list of distinct partitions, range scans read all of the partitions that fall into the range specified by the client. Since the precise number and identity of partitions in a given range cannot be determined in advance, the query must read data from all nodes containing data for the range.

To improve range scan paging, Scylla Open Source 3.0 introduces a new control algorithm for reading all data belonging to a range from all shards, which caches the intermediate streams on each of the shards and directs paged queries to the matching, previously used, cached results. The new algorithm is essentially a multiplexer that combines the output of readers opened on affected shards into a single stream. The readers are created on-demand when the partition scan attempts to read from the shard. To ensure that the read won’t stall, the algorithm uses buffering and read-ahead.

Benefits include:

Improved system responsiveness
Throughput of range scans improved by as much as 30%
Amount of data read from the disk reduced by as much as 40%
Disk operations lowered by as much as 75%

Streaming Improvements

Streaming is used during node recovery to populate restored nodes with data replicated from running nodes. The Scylla streaming model reads data on one node, transmits it to another node, and then writes to disk. The sender creates SSTable readers to read the rows from SSTables on disk and sends them over the network. The receiver receives the rows from the network and writes them to a memtable. The rows in memtable are flushed into SSTables periodically or when the memtable is full.

In Scylla Open Source 3.0, stream synchronization between nodes bypasses memtables, significantly reducing the time to repair, add and remove nodes. These improvements result in higher performance when there is a change in the cluster topology, improving streaming bandwidth by as much as 240% and reducing the time it takes to perform a “rebuild” operation by 70%.

Scylla’s new streaming improvements provide the following benefits:

Lower memory consumption. The saved memory can be used to handle your CQL workload instead.
Better CPU utilization. No CPU cycles are used to insert and sort memtables.
Bigger SSTables and fewer compactions.

Release Notes

You can read more details about Scylla Open Source 3.0 in the Release Notes.

The post Introducing Scylla Open Source 3.0 appeared first on ScyllaDB.

↧