[01:53:38] akosiaris, seems i don't have admin cassandra access, and maps cluster is running out of space [02:16:18] 7Blocked-on-Operations, 6operations, 10Access Policy, 6Discovery, 10Maps: Urgent: Maps disk overfill, please grant admin Cassandra access to maps-admins - https://phabricator.wikimedia.org/T122465#1905117 (10Yurik) 3NEW [02:16:38] any admins around? [02:16:49] maps is running out of the disk space [02:17:06] 7Blocked-on-Operations, 6operations, 6Discovery, 10Maps: Urgent: Maps disk overfill, please grant admin Cassandra access to maps-admins - https://phabricator.wikimedia.org/T122465#1905125 (10Krenair) [02:21:25] 7Blocked-on-Operations, 6operations, 6Discovery, 10Maps: Urgent: Maps disk overfill, please grant admin Cassandra access to maps-admins - https://phabricator.wikimedia.org/T122465#1905128 (10Yurik) [02:22:23] robh, around? [02:22:46] * yurik is paging admins... [02:23:31] !log mwdeploy@tin sync-l10n completed (1.27.0-wmf.9) (duration: 09m 35s) [02:23:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:29:04] _joe_, here? [02:30:29] !log l10nupdate@tin ResourceLoader cache refresh completed at Sun Dec 27 02:30:29 UTC 2015 (duration 6m 59s) [02:30:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:42:38] !log run drop keyspace v1; on csql on maps-test1001 for yurik [02:42:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [02:45:14] !log run drop keyspace v3; on csql on maps-test1001 for yurik [02:45:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:05:14] !log run nodetool clearsnapshot -- v3 and nodetool clearsnapshot -- v1 on maps-test2001 [03:05:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log, Master [03:08:50] 7Blocked-on-Operations, 6operations, 6Discovery, 10Maps: Urgent: Maps disk overfill, please grant admin Cassandra access to maps-admins - https://phabricator.wikimedia.org/T122465#1905143 (10yuvipanda) p:5Unbreak!>3Triage I worked with Yurik and nodetool clearsnapshot on v1 and v3 keyspaces got us enou... [03:09:59] 7Blocked-on-Operations, 6operations, 6Discovery, 10Maps: Urgent: Maps disk overfill, please grant admin Cassandra access to maps-admins - https://phabricator.wikimedia.org/T122465#1905146 (10yuvipanda) I also found a keyspace named `yuvi` on the hosts. wut. [03:35:00] 7Blocked-on-Operations, 6operations, 6Discovery, 10Maps: Maps disk overfill, please grant admin Cassandra access to maps-admins - https://phabricator.wikimedia.org/T122465#1905156 (10yuvipanda) [03:35:13] PROBLEM - puppet last run on mw1257 is CRITICAL: CRITICAL: Puppet has 2 failures [03:36:42] PROBLEM - puppet last run on eeden is CRITICAL: CRITICAL: Puppet has 2 failures [03:36:43] PROBLEM - puppet last run on mw2038 is CRITICAL: CRITICAL: Puppet has 2 failures [03:38:17] 7Blocked-on-Operations, 6operations, 6Discovery, 10Maps: Urgent: Please grant admin Cassandra access to maps-admins - https://phabricator.wikimedia.org/T122465#1905157 (10Yurik) [03:38:54] 7Blocked-on-Operations, 6operations, 6Discovery, 10Maps: Please grant admin Cassandra access to maps-admins - https://phabricator.wikimedia.org/T122465#1905117 (10Yurik) [03:50:52] PROBLEM - puppet last run on db1021 is CRITICAL: CRITICAL: puppet fail [03:53:32] PROBLEM - puppet last run on mw2090 is CRITICAL: CRITICAL: Puppet has 1 failures [04:01:58] RECOVERY - puppet last run on eeden is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:03:09] RECOVERY - puppet last run on mw1257 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:04:49] RECOVERY - puppet last run on mw2038 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:16:59] RECOVERY - puppet last run on db1021 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [04:19:30] RECOVERY - puppet last run on mw2090 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:06:40] (03CR) 10Glaisher: "I wonder if anything's changed since https://gerrit.wikimedia.org/r/#/c/243921/ ?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/228618 (https://phabricator.wikimedia.org/T90612) (owner: 10Legoktm) [06:30:34] PROBLEM - puppet last run on ms-fe1004 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:02] PROBLEM - puppet last run on cp3008 is CRITICAL: CRITICAL: puppet fail [06:31:24] PROBLEM - puppet last run on mw2208 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:34] PROBLEM - puppet last run on mw2036 is CRITICAL: CRITICAL: Puppet has 1 failures [06:31:42] PROBLEM - puppet last run on mw2158 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:02] PROBLEM - puppet last run on analytics1047 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:12] PROBLEM - puppet last run on mw2021 is CRITICAL: CRITICAL: Puppet has 1 failures [06:32:23] PROBLEM - puppet last run on holmium is CRITICAL: CRITICAL: Puppet has 2 failures [06:32:23] PROBLEM - puppet last run on cp3048 is CRITICAL: CRITICAL: Puppet has 1 failures [06:33:22] PROBLEM - puppet last run on mw1119 is CRITICAL: CRITICAL: Puppet has 2 failures [06:34:31] PROBLEM - puppet last run on mw2050 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:41] PROBLEM - puppet last run on mw2073 is CRITICAL: CRITICAL: Puppet has 1 failures [06:35:41] PROBLEM - puppet last run on mw2207 is CRITICAL: CRITICAL: Puppet has 1 failures [06:37:01] PROBLEM - puppet last run on mw1061 is CRITICAL: CRITICAL: Puppet has 1 failures [06:52:58] (03PS1) 10Ori.livneh: Restart HHVM on the jobrunners daily, as temp. workaround for T122069 [puppet] - 10https://gerrit.wikimedia.org/r/261104 [06:56:12] RECOVERY - puppet last run on mw2036 is OK: OK: Puppet is currently enabled, last run 42 seconds ago with 0 failures [06:56:21] RECOVERY - puppet last run on analytics1047 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [06:56:22] RECOVERY - puppet last run on ms-fe1004 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:31] RECOVERY - puppet last run on mw2208 is OK: OK: Puppet is currently enabled, last run 47 seconds ago with 0 failures [06:56:32] RECOVERY - puppet last run on cp3008 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures [06:56:52] RECOVERY - puppet last run on mw2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:52] RECOVERY - puppet last run on mw1119 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [06:57:12] RECOVERY - puppet last run on mw1061 is OK: OK: Puppet is currently enabled, last run 38 seconds ago with 0 failures [06:57:21] RECOVERY - puppet last run on holmium is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:22] RECOVERY - puppet last run on cp3048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:26] 6operations, 6Performance-Team, 5Patch-For-Review: jobrunner memory leaks - https://phabricator.wikimedia.org/T122069#1905241 (10ori) >>! In T122069#1904624, @faidon wrote: > I've restarted HHVM on jobrunners thrice now, to avoid further OOMs (one cut it real close too). I'd like to revert the two commits th... [06:57:32] RECOVERY - puppet last run on mw2158 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:52] RECOVERY - puppet last run on mw2073 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures [06:57:52] RECOVERY - puppet last run on mw2207 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:58:41] RECOVERY - puppet last run on mw2050 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [07:00:33] (03CR) 10Ori.livneh: [C: 032] Restart HHVM on the jobrunners daily, as temp. workaround for T122069 [puppet] - 10https://gerrit.wikimedia.org/r/261104 (owner: 10Ori.livneh) [07:57:47] PROBLEM - puppet last run on mw1045 is CRITICAL: CRITICAL: Puppet has 1 failures [08:23:20] RECOVERY - puppet last run on mw1045 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [08:24:30] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] [08:26:59] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [08:36:56] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [08:37:57] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] [09:41:31] PROBLEM - puppet last run on db2043 is CRITICAL: CRITICAL: puppet fail [10:09:56] RECOVERY - puppet last run on db2043 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:18:32] Hallo. [10:19:19] Is the centralauth database supposed to be accessible on terbium? [10:34:53] aharoni: yes, it is [10:35:15] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 701 [10:36:01] what's that? fundraising? [10:40:15] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1000 [10:45:15] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1301 [10:50:15] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1485 [10:53:37] 7Blocked-on-Operations, 6operations, 6Discovery, 10Maps: Please grant admin Cassandra access to maps-admins - https://phabricator.wikimedia.org/T122465#1905306 (10akosiaris) I am pretty sure @yurik has the admin password to cassandra. Definite about it. Otherwise the creation of those keyspaces (v2,v3,v4,v... [10:55:15] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1348 [11:00:15] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1649 [11:05:10] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1529 [11:10:10] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1366 [11:15:10] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1667 [11:15:26] hoo: I have a global user id number (gu_id in the CentralAuth database), and I the simplest way to find the username [11:16:41] I ^need^ the simplest way [11:18:03] hm... if you have database access (as you seem to have) that is probably the simplest way to go [11:18:24] meta=globaluserinfo api doesn't even support fetching by user id [11:18:32] so guess you need to go over the DB in some way [11:20:10] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1553 [11:22:32] hoo: I have access to enwiki, dewiki, etc., and also to wikishared, but I wonder where is the centralauth database with the globaluser table; [11:22:43] I don't see it if I do show databases [11:22:57] s7 [11:22:57] just use sql centralauth [11:23:32] what a funny thing [11:23:49] I just tried it a second before you wrote it, and it worked [11:24:17] I wonder why don't I see it in show tables, thouh [11:25:10] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1429 [11:29:51] 6operations, 10Analytics, 10MediaWiki-extensions-ContentTranslation: Make the command `sql wikishared` work on terbium like `sql enwiki`, `sql centralauth`, etc. - https://phabricator.wikimedia.org/T122474#1905337 (10Amire80) 3NEW [11:30:10] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 1729 [11:33:47] 6operations, 10Analytics: the centralauth databases is accessible form the mysql shell on terbium only in some cases - https://phabricator.wikimedia.org/T122475#1905344 (10Amire80) 3NEW [11:35:17] PROBLEM - check_mysql on db1008 is CRITICAL: SLOW_SLAVE CRITICAL: Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 664 [11:40:07] RECOVERY - check_mysql on db1008 is OK: Uptime: 500601 Threads: 42 Questions: 20613185 Slow queries: 6091 Opens: 43350 Flush tables: 2 Open tables: 410 Queries per second avg: 41.176 Slave IO: Yes Slave SQL: Yes Seconds Behind Master: 8 [12:11:03] 6operations, 10Analytics, 10MediaWiki-extensions-ContentTranslation: Make the command `sql wikishared` work on terbium like `sql enwiki`, `sql centralauth`, etc. - https://phabricator.wikimedia.org/T122474#1905369 (10KartikMistry) Try and see if --wikidb=wikishared works? [12:17:53] 6operations, 10Analytics, 10MediaWiki-extensions-ContentTranslation: Make the command `sql wikishared` work on terbium like `sql enwiki`, `sql centralauth`, etc. - https://phabricator.wikimedia.org/T122474#1905372 (10Amire80) >>! In T122474#1905369, @KartikMistry wrote: > Try and see if --wikidb=wikishared w... [12:21:13] 7Blocked-on-Operations, 6operations, 6Discovery, 10Maps: Please grant admin Cassandra access to maps-admins - https://phabricator.wikimedia.org/T122465#1905373 (10Yurik) @akosiaris, I create new keyspace by using tileratorui account. But that same account does not have drop perms. Please paste the creds t... [12:55:47] PROBLEM - High load average on labstore1001 is CRITICAL: CRITICAL: 77.78% of data above the critical threshold [24.0] [13:02:30] PROBLEM - puppet last run on cp3036 is CRITICAL: CRITICAL: puppet fail [13:02:34] 6operations, 10Analytics, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: schedule a daily run of ContentTranslation analytics scripts on terbium - https://phabricator.wikimedia.org/T122479#1905388 (10Amire80) 3NEW [13:04:49] RECOVERY - High load average on labstore1001 is OK: OK: Less than 50.00% above the threshold [16.0] [13:13:08] PROBLEM - puppet last run on mw2151 is CRITICAL: CRITICAL: puppet fail [13:28:39] RECOVERY - puppet last run on cp3036 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:40:31] RECOVERY - puppet last run on mw2151 is OK: OK: Puppet is currently enabled, last run 11 seconds ago with 0 failures [13:52:08] (03PS1) 10Alexandros Kosiaris: elasticsearch: move dependencies at the correct level [puppet] - 10https://gerrit.wikimedia.org/r/261119 [13:53:51] (03CR) 10Alexandros Kosiaris: [C: 032] elasticsearch: move dependencies at the correct level [puppet] - 10https://gerrit.wikimedia.org/r/261119 (owner: 10Alexandros Kosiaris) [14:12:53] PROBLEM - puppet last run on cp3004 is CRITICAL: CRITICAL: puppet fail [14:38:12] RECOVERY - puppet last run on cp3004 is OK: OK: Puppet is currently enabled, last run 28 seconds ago with 0 failures [14:57:12] PROBLEM - puppet last run on db2067 is CRITICAL: CRITICAL: puppet fail [15:02:03] (03PS8) 10Alexandros Kosiaris: Add shinken module/roles [puppet] - 10https://gerrit.wikimedia.org/r/259008 [15:12:01] (03PS9) 10Alexandros Kosiaris: Add shinken module/roles [puppet] - 10https://gerrit.wikimedia.org/r/259008 [15:12:20] (03CR) 10jenkins-bot: [V: 04-1] Add shinken module/roles [puppet] - 10https://gerrit.wikimedia.org/r/259008 (owner: 10Alexandros Kosiaris) [15:22:08] RECOVERY - puppet last run on db2067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [15:30:11] (03PS10) 10Alexandros Kosiaris: Add shinken module/roles [puppet] - 10https://gerrit.wikimedia.org/r/259008 [15:30:31] (03CR) 10jenkins-bot: [V: 04-1] Add shinken module/roles [puppet] - 10https://gerrit.wikimedia.org/r/259008 (owner: 10Alexandros Kosiaris) [15:32:42] (03PS11) 10Alexandros Kosiaris: Add shinken module/roles [puppet] - 10https://gerrit.wikimedia.org/r/259008 [16:42:38] PROBLEM - salt-minion processes on cygnus is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/salt-minion [17:05:47] RECOVERY - salt-minion processes on cygnus is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [20:59:24] 6operations, 10Analytics, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: schedule a daily run of ContentTranslation analytics scripts on terbium - https://phabricator.wikimedia.org/T122479#1905531 (10yuvipanda) This should probably run on one of the stat* boxes rather than terbium. [21:00:40] 6operations, 10Analytics: the centralauth databases is accessible form the mysql shell on terbium only in some cases - https://phabricator.wikimedia.org/T122475#1905534 (10yuvipanda) This works fine if you are using the analytics slaves from stat*, rather than the production masters/slaves. I highly doubt we w... [21:18:05] 6operations, 10Analytics, 10ContentTranslation-Analytics, 10MediaWiki-extensions-ContentTranslation: schedule a daily run of ContentTranslation analytics scripts on terbium - https://phabricator.wikimedia.org/T122479#1905536 (10Amire80) Whatever works, as long as I get the data :) [21:19:07] 6operations, 10Analytics: the centralauth databases is accessible form the mysql shell on terbium only in some cases - https://phabricator.wikimedia.org/T122475#1905538 (10Amire80) As with T122479, whatever works. Can I access them? [23:10:06] PROBLEM - puppet last run on mw1242 is CRITICAL: CRITICAL: Puppet has 1 failures [23:35:52] RECOVERY - puppet last run on mw1242 is OK: OK: Puppet is currently enabled, last run 25 seconds ago with 0 failures