


Our use case with Cassandra is to show top 10 recent visitors of a blogpost. Following is the Cassandra table definition

CREATE TABLE blogs_by_visitor (
             blogposturl text,
             visitor text,
             visited_ts timestamp,
             PRIMARY KEY (blogposturl, visitor)

现在,为了显示给定博客文章的最近十大访问者,需要有一个明确的时间戳说明中的排序依据子句。由于visted_ts不在Cassandra的群集列中,因此我们无法完成此操作。 Visited_ts不在群集列中的原因是为了避免记录重复(作为重复)访问者。主键的设计方式是为重复访问者提供最新时间戳。

Now in order to show top 10 recent visitors for a given blogpost, there needs to be an explicit "order by" clause on timestamp desc. Since visted_ts isn't part of the clustering column in Cassandra, we aren't able to get this done. The reason for visited_ts not being part of clustering column is to avoid recording repeat (read as duplicate) visitors. The primary key is designed in such a way to upsert the latest timestamp for a repeat visitor.


In RDBMS world the query would look like the following and a secondary index could be created with blogposturl and timestamp columns.

Select visitor from blog_table
blogposturl = ?
and rownum <= 10
order by timestamp desc


An alternative currently being followed in our Cassandra application, is to obtain the results and then sort based on timestamp on the app side. But what if a particular blogpost becomes so popular and it had more than 100,000 visitors. The query becomes really slow for those blogs.


I'm thinking secondary index wouldn't be useful here, as I don't worry about filtering on it (rather just for sorting - which isn't possible).


Any idea on how we could model the table differently?



这些类型的作业是由Apache Spark或Hadoop完成。计划作业,通过时间戳为每个URL计算唯一的访客顺序,并将结果存储到cassandra中。

These type of job are done by Apache Spark or Hadoop. A schedule job which compute the unique visitor order by timestamp for each url and store the result into cassandra.

或者您可以创建,位于 blogs_by_visitor 。该表将确保唯一身份访问者,并且物化视图将基于 visited_ts 时间戳来提供结果。

Or you can create a Materialized View on top of the blogs_by_visitor. This table will make sure of unique visitor and the materialized view will oder the result based on visited_ts timestamp.


    SELECT *
    FROM blogs_by_visitor
    WHERE blogposturl IS NOT NULL AND visitor IS NOT NULL AND visited_ts IS NOT NULL
    PRIMARY KEY (blogposturl, visited_ts, visitor)
    WITH CLUSTERING ORDER BY (visited_ts DESC, visitor ASC);


Now you can just select the 10 recent unique visitor of a blogpost.

SELECT * FROM unique_visitor WHERE blogposturl = ? LIMIT 10;

您可以看到我没有在选择查询中指定排序顺序。因为在实例化视图架构中已指定了默认的排序顺序 visited_ts DESC

you can see that i haven't specify the sort order in select query. Because in the materialized view schema a have specified default sort order visited_ts DESC



Or You could change your table schmea like below :

CREATE TABLE blogs_by_visitor (
     blogposturl text,
     year int,
     month int,
     day int,
     visitor text,
     visited_ts timestamp,
     PRIMARY KEY ((blogposturl, year, month, day), visitor)

现在在单个分区中只有少量数据,因此可以对所有访问者进行排序基于客户端中单个分区中的 visited_ts 。如果您认为一天中的访问者数量可能非常庞大,请在分区键上增加一个小时。

Now you have only a small amount of data in a single partition.So you can sort all the visitor based on visited_ts in that single partition from the client side. If you think number of visitor in a day can be huge then add hour to the partition key also.


10-20 11:01