mysql - How to optimize data aggregation in day of week, time of day for a period of one month back -
here situation:
i have saas application simple rss feed reader. think people know - users subscribing rss feeds , reading items them. nothing new. 1 feed can have many subscribers.
i've implemented statistics users, don't think i've chosen right approach, because things getting slower hour number of users , feeds grows.
here's i'm doing now:
at every hour total number of articles each feed:
select count(*) articles feed_id=?
get previous value calculate delta (this getting little slow):
select value feeds_stats feed_id=? , name='total_articles' order date desc limit 1
insert new value , delta:
insert feeds_stats (date,feed_id,name,value,delta) values ('".date("y-m-d h:i:s",$global_timestamp)."','".$feed_id','total_articles','".$value."','".($value-$old_value)."')
for every user feeds , each feed number of articles has read:
select count(*) users_articles ua join articles on a.id=ua.article_id a.feed_id='%s' , ua.user_id='%s' , ua.read=1
users_articles table holds read state of each article per user
then again delta:
select value users_feeds_stats user_id='?' , feed_id='?' , name='total_reads' order date desc limit 1
and insert new value + delta:
insert users_feeds_stats (date,user_id,feed_id,name,value,delta) values ('".date("y-m-d h:i:s",$global_timestamp)."','".$user_id."','".$feed_id."','total_reads','".$value."','".($value-$old_value)."')
when feeds user has been processed comes aggregation part:
this bit tricky , think there should lot of room optimization here. here actual aggregation function in php:
<?php function aggregate_user_stats($user_id=false,$feed_id=false){ global $global_timestamp; // defined dimensions $feed_types[0] = array("days_back" => 31, "group_by" => "date_format(date, '%y-%m-%d')"); $feed_types[1] = array("days_back" => 31, "group_by" => "weekday(date)+1"); $feed_types[2] = array("days_back" => 31, "group_by" => "hour(date)"); if($user_id){ $where = " id=".$user_id; } $feed_where = ""; $getusers = mysql_query("select id users".$where)or die(__line__." ".mysql_error()); while($user = mysql_fetch_assoc($getusers)){ if($feed_id){ $feed_where = " , feed_id=".$feed_id; } $user_feeds = array(); $getfeeds = mysql_query("select feed_id subscriptions user_id='".$user["id"]."' , active=1".$feed_where)or die(__line__." ".mysql_error()); while($row = mysql_fetch_assoc($getfeeds)){ foreach($feed_types $tab => $type){ $getdata = mysql_query(" select ".$type["group_by"]." date, name, sum(delta) delta feeds_stats feed_id = '".$row["feed_id"]."' , name='total_articles' , date > date_sub(now(), interval ".$type["days_back"]." day) group name, ".$type["group_by"]." union select ".$type["group_by"]." date, name, sum(delta) delta users_feeds_stats user_id = '".$user["id"]."' , feed_id = '".$row["feed_id"]."' , name='total_reads' , date > date_sub(now(), interval ".$type["days_back"]." day) group name, ".$type["group_by"]." ")or die(__line__." ".mysql_error()); $data = array(); while($row = mysql_fetch_assoc($getdata)){ $data[$row["date"]][$row["name"]] = $row["delta"]; } if(count($data)){ db_start_trx(); mysql_query("delete stats_feeds_over_time feed_id='".$row["feed_id"]."' , user_id='".$user["id"]."' , tab='".$tab."'")or die(__line__." ".mysql_error()); foreach($data $time => $keys){ mysql_query("replace stats_feeds_over_time (feed_id,user_id,tab,date,total_articles,total_reads,total_favs) values ('".$row["feed_id"]."','".$user["id"]."','".$tab."','".$time."','".$keys["total_articles"]."','".$keys["total_reads"]."','".$keys["total_favs"]."')")or die(__line__." ".mysql_error()); } db_commit_trx(); } } } } }
some notes:
edit: here ddl's of involved tables:
create table `articles` ( `id` integer(11) unsigned not null auto_increment, `feed_id` integer(11) unsigned not null, `date` integer(10) unsigned not null, `date_updated` integer(11) unsigned not null, `title` varchar(1000) collate utf8_general_ci not null default '', `url` varchar(2000) collate utf8_general_ci not null default '', `author` varchar(200) collate utf8_general_ci not null default '', `hash` char(32) collate utf8_general_ci not null default '', primary key (`id`), unique key `feed_id_hash` (`feed_id`, `hash`), key `date` (`date`), key `url` (`url`(255)) )engine=innodb auto_increment=0 character set 'utf8' collate 'utf8_general_ci' comment=''; create table `users_articles` ( `id` bigint(20) not null auto_increment, `user_id` integer(11) unsigned not null, `article_id` integer(11) unsigned not null, `subscription_id` integer(11) unsigned not null, `read` tinyint(4) unsigned not null default '0', primary key (`id`), unique key `user_id` (`user_id`, `article_id`), key `article_id` (`article_id`), key `subscription_id` (`subscription_id`) )engine=innodb checksum=1 auto_increment=0 character set 'utf8' collate 'utf8_general_ci' comment=''; create table `feeds_stats` ( `id` integer(11) unsigned not null auto_increment, `feed_id` integer(11) unsigned not null, `date` datetime not null, `name` varchar(50) collate utf8_general_ci not null default '', `value` integer(11) not null, `delta` integer(11) not null, primary key (`id`), key `name` (`name`), key `feed_id` (`feed_id`), key `date` (`date`) )engine=innodb auto_increment=0 character set 'utf8' collate 'utf8_general_ci' comment=''; create table `users_feeds_stats` ( `id` integer(11) unsigned not null auto_increment, `user_id` integer(11) unsigned not null default '0', `feed_id` integer(11) unsigned not null, `date` datetime not null, `name` varchar(50) collate utf8_general_ci not null default '', `value` integer(11) not null, `delta` integer(11) not null, primary key (`id`), key `name` (`name`), key `feed_id` (`feed_id`), key `user_id` (`user_id`), key `date` (`date`) )engine=innodb auto_increment=0 character set 'utf8' collate 'utf8_general_ci' comment=''; create table `stats_feeds_over_time` ( `feed_id` integer(11) unsigned not null, `user_id` integer(11) not null, `tab` integer(11) not null, `date` varchar(30) collate utf8_general_ci not null default '', `total_articles` double(9,2) unsigned not null, `total_reads` double(9,2) unsigned not null, `total_favs` double(9,2) unsigned not null, primary key (`feed_id`, `user_id`, `tab`, `date`) )engine=innodb auto_increment=0 character set 'utf8' collate 'utf8_general_ci' comment='';
in end of aggregation function there replace in table stats_feeds_over_time. table holds records, displayed on graph, actual graphing process not involve heavy queries.
finally here graphs produced this:
i glad if point me in right direction on , how optimize solution, if means ditch mysql statistics.
i have long experience rrdtool, here situation different, because of "time of day", "day of week" aggregations.
i don't know how important queries wish optimize regard other queries might run on same set of tables. assume wish first have these queries optimized.
seeing queries made feed_id
where
predicates, try partition articles
table on column:
create table `articles` ( `id` integer(11) unsigned not null auto_increment, `feed_id` integer(11) unsigned not null, -- etc. )engine=innodb auto_increment=0 character set 'utf8' collate 'utf8_general_ci' comment='' partition key(feed_id) partitions 10;
the number of partitions (10
above) can tuned according needs, yet must above 1 have impact. might want use larger number make select queries faster. query non dependent on feed_id
slowered device.
the same process can applied other tables columns used discriminant in queries.
however, first 2 queries executed feeds, rewrite them follow:
select feed_id, count(feed_id) articles group feed_id select feed_id, value feeds_stats name='total_articles' group feed_id order date desc
both these retrieve results feeds, frees having run queries each individual feed. using these queries makes partitioning counter productive, have choose between two.
the point of partitioning: query discriminating against 1 particular value of feed_id
(or other column used partition) see significant boost. bad point regular queries slowed down.
the point of second solution not have impact on other queries.
Comments
Post a Comment