[00:00:04] twentyafterfour: Respected human, time to deploy Phabricator update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170525T0000). Please do the needful. [00:00:10] AaronSchulz: I have struck https://gerrit.wikimedia.org/r/#/c/355174/ from the SWAT as a no-show, if you would still like it to be deployed, please move it to a different window and be present during it (or arrange for someone else to be) [00:00:57] RoanKattouw: I got stuck dealing with really annoying phpunit test [00:01:19] OK, I can unstrike and do it now if you have time to test once it's on mwdebug [00:01:36] ok [00:02:09] (03PS2) 10Dzahn: dnsrecursor: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353119 [00:02:11] (03PS2) 10Catrope: Switch Swift URLs to HTTPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355174 (https://phabricator.wikimedia.org/T160616) (owner: 10Aaron Schulz) [00:02:16] (03CR) 10Catrope: [C: 032] Switch Swift URLs to HTTPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355174 (https://phabricator.wikimedia.org/T160616) (owner: 10Aaron Schulz) [00:03:23] (03CR) 10Dzahn: [C: 032] "no-op http://puppet-compiler.wmflabs.org/6518/" [puppet] - 10https://gerrit.wikimedia.org/r/353119 (owner: 10Dzahn) [00:03:55] (03Merged) 10jenkins-bot: Switch Swift URLs to HTTPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355174 (https://phabricator.wikimedia.org/T160616) (owner: 10Aaron Schulz) [00:05:44] AaronSchulz: OK, live on mwdebug1002, please test [00:05:46] (03CR) 10jenkins-bot: Switch Swift URLs to HTTPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355174 (https://phabricator.wikimedia.org/T160616) (owner: 10Aaron Schulz) [00:06:13] !log upgrading phabricator, expect momentary downtime [00:06:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:06:25] RoanKattouw: just scap on mw1001? [00:07:23] well scap pull, ok [00:07:32] I ran that on mwdebug1002 yes [00:07:51] Do you want me to pull it somewhere else? [00:09:45] * AaronSchulz browses commons [00:09:51] so I guess its on 1001 and 1002 [00:10:01] with x-wmf-debug of course [00:10:01] (03PS2) 10Dzahn: authdns::server: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353123 [00:10:34] !log phabricator upgrade complete, service is online [00:10:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:13:46] (03PS3) 10Dzahn: authdns::server: move 'include standard' to role [puppet] - 10https://gerrit.wikimedia.org/r/353123 [00:13:56] logs and browsing around seem fine [00:14:02] * AaronSchulz advances [00:15:23] !log aaron@tin Synchronized wmf-config/filebackend.php: Enable HTTPs for Swift usage (duration: 00m 41s) [00:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:18:04] (03CR) 10Dzahn: [C: 032] "no-op http://puppet-compiler.wmflabs.org/6520/" [puppet] - 10https://gerrit.wikimedia.org/r/353123 (owner: 10Dzahn) [00:18:20] !log aaron@tin Synchronized wmf-config/ProductionServices.php: Enable HTTPs for Swift usage (duration: 00m 41s) [00:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:00:46] (03PS1) 10Dzahn: phab: comment out include of exim4::ganglia [puppet] - 10https://gerrit.wikimedia.org/r/355572 [01:05:52] (03PS2) 10Dzahn: phab: comment out include of exim4::ganglia [puppet] - 10https://gerrit.wikimedia.org/r/355572 [01:06:21] (03PS3) 10Dzahn: phab: comment out include of exim4::ganglia [puppet] - 10https://gerrit.wikimedia.org/r/355572 [01:06:27] (03CR) 10Dzahn: [C: 032] phab: comment out include of exim4::ganglia [puppet] - 10https://gerrit.wikimedia.org/r/355572 (owner: 10Dzahn) [01:59:02] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [01:59:04] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [01:59:12] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [01:59:22] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [01:59:22] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [02:00:52] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [02:01:02] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [02:01:02] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [02:01:12] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [02:01:12] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [02:26:57] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.1) (duration: 09m 46s) [02:27:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:45:15] !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.2) (duration: 07m 18s) [02:45:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:51:53] !log l10nupdate@tin ResourceLoader cache refresh completed at Thu May 25 02:51:53 UTC 2017 (duration 6m 38s) [02:52:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:21:12] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [03:21:12] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [03:21:32] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [03:21:32] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [03:22:02] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [03:23:23] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [03:23:52] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [03:23:53] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [03:24:02] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [03:24:02] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [03:24:02] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [03:24:02] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [03:24:12] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [03:24:12] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [03:24:22] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [03:24:32] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [03:24:42] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [03:24:42] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [03:24:52] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [03:24:52] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [03:24:53] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [03:24:53] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [03:24:53] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [03:25:22] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [03:25:28] "The wiki is currently in read-only mode." - known issue? [03:26:05] Seems okay now. [03:32:02] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [03:32:02] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [03:32:02] PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [03:32:33] PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [03:32:52] PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [03:32:52] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [03:33:04] PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [03:34:22] RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy [03:34:42] RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy [03:34:42] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [03:34:52] RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy [03:34:52] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [03:34:52] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [03:34:52] RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy [04:34:12] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [04:34:22] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [04:34:22] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [04:34:32] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [04:34:32] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [04:36:12] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [04:36:12] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [04:36:32] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [04:36:32] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [04:37:02] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [04:52:32] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/1/0: down - Core: cr2-eqiad:xe-4/2/0 (Telia, IC-314533, 24ms) {#11371} [10Gbps wave]BR [04:52:32] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 212, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-4/2/0: down - Core: cr1-eqord:xe-1/0/0 (Telia, IC-314533, 29ms) {#3658} [10Gbps wave]BR [04:53:22] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [04:53:32] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [04:53:32] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [04:54:12] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [04:54:22] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [04:56:12] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [04:56:32] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [04:56:32] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [04:57:02] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [04:57:12] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [05:13:32] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 [05:13:32] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 214, down: 0, dormant: 0, excluded: 0, unused: 0 [05:24:22] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [05:24:22] PROBLEM - citoid endpoints health on scb1003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [05:24:32] PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [05:24:32] PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [05:25:13] PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [05:27:12] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy [05:27:22] RECOVERY - citoid endpoints health on scb1003 is OK: All endpoints are healthy [05:27:32] RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy [05:27:32] RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy [05:28:02] RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy [05:58:28] !log Start pt-table-checksum on s1 - T162807 [05:58:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:58:40] T162807: Run pt-table-checksum on s1 (enwiki) - https://phabricator.wikimedia.org/T162807 [06:05:02] PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [06:05:02] PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received [06:05:03] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received [06:05:52] RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy [06:07:02] RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy [06:07:02] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy [06:10:34] (03PS1) 10Marostegui: db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355579 (https://phabricator.wikimedia.org/T166206) [06:15:52] PROBLEM - puppet last run on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:16:02] PROBLEM - MD RAID on ms-be1020 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [06:16:02] PROBLEM - SSH on ms-be1020 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [06:16:12] PROBLEM - Router interfaces on cr1-codfw is CRITICAL: CRITICAL: host 208.80.153.192, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/0/2: down - Core: cr2-ulsfo:xe-1/3/0 (Zayo, OGYX/124337//ZYO, 38.8ms) {#11541} [10Gbps wave]BR [06:16:32] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [06:16:42] RECOVERY - puppet last run on ms-be1020 is OK: OK: Puppet is currently enabled, last run 23 minutes ago with 0 failures [06:16:52] RECOVERY - MD RAID on ms-be1020 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [06:16:52] RECOVERY - SSH on ms-be1020 is OK: SSH OK - OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8 (protocol 2.0) [06:18:53] (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355579 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [06:19:54] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355579 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [06:20:06] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1097 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355579 (https://phabricator.wikimedia.org/T166206) (owner: 10Marostegui) [06:21:31] !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1097 - T166206 (duration: 00m 55s) [06:21:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:21:41] T166206: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206 [06:21:52] (03PS1) 10Marostegui: s4.hosts: Add db1097 to the list of hosts [software] - 10https://gerrit.wikimedia.org/r/355580 [06:24:54] (03PS2) 10Marostegui: s4.hosts: Add db1097 to the list of hosts [software] - 10https://gerrit.wikimedia.org/r/355580 [06:25:47] (03PS3) 10Marostegui: s4.hosts: Add db1097 to the list of hosts [software] - 10https://gerrit.wikimedia.org/r/355580 [06:26:33] (03PS4) 10Marostegui: s4.hosts: Add db1097 to the list of hosts [software] - 10https://gerrit.wikimedia.org/r/355580 [06:29:12] RECOVERY - Router interfaces on cr1-codfw is OK: OK: host 208.80.153.192, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [06:29:32] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [06:32:22] RECOVERY - HP RAID on ms-be2029 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [06:52:22] PROBLEM - HP RAID on ms-be2029 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [07:05:02] PROBLEM - Disk space on elastic1023 is CRITICAL: DISK CRITICAL - free space: /srv 61960 MB (12% inode=99%) [07:06:29] <_joe_> gehel: ^^ [07:06:52] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] [07:07:45] (03PS2) 10Giuseppe Lavagetto: calico: add new version 2.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/355392 (https://phabricator.wikimedia.org/T165024) [07:14:52] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] [07:15:07] (03CR) 10Giuseppe Lavagetto: [C: 032] calico: add new version 2.2.0 [puppet] - 10https://gerrit.wikimedia.org/r/355392 (https://phabricator.wikimedia.org/T165024) (owner: 10Giuseppe Lavagetto) [07:16:02] PROBLEM - Disk space on elastic1023 is CRITICAL: DISK CRITICAL - free space: /srv 62253 MB (12% inode=99%) [07:28:40] !log roll-restart jessie ms-be2* for linux 4.9 update - T162029 [07:28:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:28:50] T162029: Migrate all jessie hosts to Linux 4.9 - https://phabricator.wikimedia.org/T162029 [07:32:22] RECOVERY - HP RAID on ms-be2029 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [07:32:44] (03CR) 10Marostegui: [C: 032] s4.hosts: Add db1097 to the list of hosts [software] - 10https://gerrit.wikimedia.org/r/355580 (owner: 10Marostegui) [07:34:00] (03Merged) 10jenkins-bot: s4.hosts: Add db1097 to the list of hosts [software] - 10https://gerrit.wikimedia.org/r/355580 (owner: 10Marostegui) [07:34:14] _joe_: gehel is out today, there's a reindex process running that is likely to cause these space alertsissues are likely caused [07:34:41] (03PS1) 10Marostegui: db-codfw.php: Depool db2057 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355581 (https://phabricator.wikimedia.org/T166278) [07:35:10] shards are moving around [07:36:50] (03CR) 10Marostegui: [C: 032] db-codfw.php: Depool db2057 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355581 (https://phabricator.wikimedia.org/T166278) (owner: 10Marostegui) [07:38:09] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2057 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355581 (https://phabricator.wikimedia.org/T166278) (owner: 10Marostegui) [07:38:18] (03CR) 10jenkins-bot: db-codfw.php: Depool db2057 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355581 (https://phabricator.wikimedia.org/T166278) (owner: 10Marostegui) [07:40:00] !log marostegui@tin Synchronized wmf-config/db-codfw.php: Depool db2057 - T166278 (duration: 00m 41s) [07:40:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:40:09] T166278: Unify revision table on s3 - https://phabricator.wikimedia.org/T166278 [07:40:35] <_joe_> dcausse: ok good to know [07:40:40] <_joe_> :) [07:51:33] (03CR) 10Elukey: [C: 031] Update kafka.sh wrapper script for Kafka 0.10+ [puppet] - 10https://gerrit.wikimedia.org/r/355259 (https://phabricator.wikimedia.org/T166164) (owner: 10Ottomata) [07:52:29] (03PS1) 10Jcrespo: mariadb: Depool db1077 for maintenance and upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355591 [07:54:42] (03CR) 10Marostegui: [C: 031] mariadb: Depool db1077 for maintenance and upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355591 (owner: 10Jcrespo) [07:55:04] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db1077 for maintenance and upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355591 (owner: 10Jcrespo) [07:56:34] (03Merged) 10jenkins-bot: mariadb: Depool db1077 for maintenance and upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355591 (owner: 10Jcrespo) [07:56:43] (03CR) 10jenkins-bot: mariadb: Depool db1077 for maintenance and upgrade [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355591 (owner: 10Jcrespo) [07:58:37] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Depool db1077 for maintenance and upgrade (duration: 00m 41s) [07:58:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:04] !log resuming slow upgrade of facter across the fleet checking is a noop T166203 [08:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:13] T166203: Upgrade facter to version 2.4.6 - https://phabricator.wikimedia.org/T166203 [08:25:54] !log stopping and restarting db1077 [08:26:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:27:02] RECOVERY - Disk space on elastic1023 is OK: DISK OK [08:38:46] 06Operations, 10ops-codfw: db2049 cannot install jessie - let's try upgrading the firmware first - https://phabricator.wikimedia.org/T165739#3291170 (10jcrespo) This was totally my fault for a misleading title- I should have pinged you and not only put it on ops-codfw. I will be doing that for now on. CC @Maro... [08:41:52] PROBLEM - nova-compute process on labvirt1003 is CRITICAL: PROCS CRITICAL: 2 processes with regex args ^/usr/bin/python /usr/bin/nova-compute [08:42:52] RECOVERY - nova-compute process on labvirt1003 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/nova-compute [08:44:44] 06Operations, 10Mail: Exim panics when spamd reaches maxchildren - https://phabricator.wikimedia.org/T166291#3291183 (10fgiunchedi) [08:51:47] 06Operations, 10ops-eqiad, 15User-fgiunchedi: Debug HP raid cache disabled errors on ms-be1019/20/21 - https://phabricator.wikimedia.org/T163777#3291214 (10fgiunchedi) @Cmjohnson sure let's try ms-be2021 today, ping me on IRC [08:52:47] 06Operations, 05Prometheus-metrics-monitoring, 15User-fgiunchedi: Upgrade mysqld_exporter to 0.10.0 - https://phabricator.wikimedia.org/T161296#3291215 (10fgiunchedi) @jcrespo yeah manually testing (i.e. stop puppet) on a test host sounds good, then we can progressively rollout via puppet by changing the mys... [08:53:13] (03PS1) 10Framawiki: Add www.defenceimagery.mod.uk to CopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355594 (https://phabricator.wikimedia.org/T166271) [08:53:28] 06Operations, 10Phabricator, 10Traffic: phab.wmfusercontent.org "homepage" yields a 500 - https://phabricator.wikimedia.org/T166120#3291219 (10fgiunchedi) I saw there's indeed a phabricator redirect map (in php), would that be the place @mmodell or some other place like apache? [08:55:01] (03PS1) 10Jcrespo: jynus-vimrc: Disable mouse input & enable syntax highlighting [puppet] - 10https://gerrit.wikimedia.org/r/355595 [08:55:52] PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=542.50 Read Requests/Sec=619.90 Write Requests/Sec=28.50 KBytes Read/Sec=41190.00 KBytes_Written/Sec=5084.00 [08:56:53] 06Operations, 10MediaWiki-General-or-Unknown, 06Security-Team, 10Traffic: Mediawiki replies with 500 on wrongly formatted CSP report - https://phabricator.wikimedia.org/T166229#3291220 (10fgiunchedi) doh :( I'm for changing that to 400, we log those anyways on our side, thoughts @Anomie @Bawolff ? [08:58:51] (03PS1) 10Jcrespo: mariadb: Repool db1077 with low weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355596 [08:59:10] (03CR) 10Marostegui: [C: 031] mariadb: Repool db1077 with low weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355596 (owner: 10Jcrespo) [09:00:29] (03CR) 10Jcrespo: [C: 032] mariadb: Repool db1077 with low weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355596 (owner: 10Jcrespo) [09:01:53] (03Merged) 10jenkins-bot: mariadb: Repool db1077 with low weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355596 (owner: 10Jcrespo) [09:02:04] (03CR) 10jenkins-bot: mariadb: Repool db1077 with low weight after maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355596 (owner: 10Jcrespo) [09:02:37] (03PS14) 10Elukey: [WIP] First prototype of the EventLogging purge script [puppet] - 10https://gerrit.wikimedia.org/r/353265 (https://phabricator.wikimedia.org/T156933) [09:03:31] (03CR) 10jerkins-bot: [V: 04-1] [WIP] First prototype of the EventLogging purge script [puppet] - 10https://gerrit.wikimedia.org/r/353265 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [09:05:52] RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=0.40 Read Requests/Sec=1.00 Write Requests/Sec=14.00 KBytes Read/Sec=10.40 KBytes_Written/Sec=96.80 [09:07:06] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1077 after maintenance with low weight (duration: 00m 41s) [09:07:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:14:20] 06Operations, 10ops-codfw: Degraded RAID on ms-be2029 - https://phabricator.wikimedia.org/T166021#3291264 (10fgiunchedi) 05Open>03Resolved Disk rebuilding [09:19:13] akosiaris: Ping? [09:19:39] Please kill https://upload.wikimedia.org/wikipedia/commons/3/34/Xxvid(Portal-edman-news-musik).webm (NSFW) as a deleted copyvio that is still visible [09:20:15] Revent: is it already deleted from commons? [09:20:19] Yes [09:21:15] Some of the WP0 abusers are using links to upload to try to take advantage of caching. [09:21:26] (03PS1) 10Jcrespo: Revert "mariadb: Depool db1077 for maintenance and upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355597 [09:21:35] (03CR) 10jerkins-bot: [V: 04-1] Revert "mariadb: Depool db1077 for maintenance and upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355597 (owner: 10Jcrespo) [09:24:54] Revent: that url is already giving me a 404 [09:24:55] Revent: ok I'm taking a look [09:25:16] godog: is there something more that needs to be that I am not aware of ? [09:25:36] godog: For me upload is 198.35.26.112 [09:25:39] akosiaris: no just a ban in frontend varnish is enough [09:25:49] (if that helps) [09:25:56] doesn't a regular delete purge things automatically? [09:26:19] jynus: maybe the purge has been lost .. it's udp multicast after all [09:26:27] ah, true [09:26:52] specially if it is under load [09:26:56] so esams seems to be clear already.. looking into the other 3 dcs [09:27:30] Don’t remember who wrote it up, but the huge numbers of deletions on Commons from wp0 abuse are causing drama. [09:27:52] akosiaris: ulsfo should be based on the ip [09:28:20] !log ban commons object on request in ulsfo [09:28:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:31] ah godod was faster [09:28:37] eqiad + codfw are already fine btw [09:29:09] nice, I'm adding another example for this to the wikitech page [09:29:31] jynus: the other funny thing that may happen is that with udp multicast there's a race condition [09:30:01] the purge for some reason may reach and be served faster on say ulsfo than eqiad [09:30:09] not due to network but rather load [09:30:20] so eqiad is not purged yet, but ulsfo is [09:30:39] and a request a split second later makes ulsfo query eqiad [09:30:51] finding the resource once more and caching it [09:31:41] although that's a rare race condition. I doubt we saw this. Most probably the purge was lost for some reason [09:33:09] godog: thanks for handling it btw [09:34:02] akosiaris: np! [09:34:22] PROBLEM - puppet last run on ms-be2027 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 23 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[xfs_label-/dev/sdb3],Exec[mkfs-/dev/sdc1] [09:34:25] I still seem to be able to see it on esams though, looking [09:34:40] ? [09:34:42] I can't [09:34:42] PROBLEM - puppet last run on ms-be2037 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 42 seconds ago with 2 failures. Failed resources (up to 3 shown): Exec[xfs_label-/dev/sdb3],Exec[mkfs-/dev/sdc1] [09:35:05] !log restart Yarn nodemanager daemons on all the hadoop worker nodes for jvm upgrades [09:35:11] It’s a bit sad, actually (that particular abuser) in that a filter for “Portal Edman News Musik” in filenames would be useful, but at the same time it would make the probelmatic uploads less obvious (presumably he would notice they were being filtered) [09:35:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:36:20] akosiaris: yeah I got that at the beginning too, then noticed my terminal didn't include '.webm' in the url hilight -.- [09:36:24] Revent: if that is an ongoing issue (and not a user occurence), I would definitely try contacting people in charge of that program or community tech to see there could be some tools to help with that [09:37:57] jynus: It’s related to the WP0 abuse, in a way… some guy that is sharing copyvio stuff (movies, music, and porn) on Commons for WP0 people, and using several different facebook accounts to share the links. [09:39:20] Thing is, he’s dumb enough to always put the same string in the filenames, while socking persistently, so some kind of technical solution might teach him to hide the crap better. [09:39:44] !log reimage analytics1030 to Debian Jessie - T165529 [09:39:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:51] T165529: analytics1030 failed bbu - https://phabricator.wikimedia.org/T165529 [09:40:48] mmmm too fast, mgmt console doesn't work [09:40:56] (the host is down) [09:41:45] 06Operations, 10ops-eqiad: analytics1030 failed bbu - https://phabricator.wikimedia.org/T165529#3291304 (10elukey) @Cmjohnson I can't connect with `ssh root@analytics1030.mgmt.eqiad.wmnet` :( [09:41:56] 06Operations, 10ops-eqiad, 15User-Elukey: analytics1030 failed bbu - https://phabricator.wikimedia.org/T165529#3291305 (10elukey) [09:42:28] akosiaris: trying with a smaller regex, not sure my first one matched as I can still see it, do you? [09:43:17] it could also be that the ban takes a bit, lots of objects in upload [09:44:07] godog: I am definitely not seeing the object in esams.. I consistently get 404 [09:44:22] curl -I --resolve upload.wikimedia.org:443:91.198.174.208 https://upload.wikimedia.org/wikipedia/commons/3/34/Xxvid%28Portal-edman-news-musik%29.webm [09:44:23] HTTP/2 404 [09:45:08] yep same here now [09:45:10] Revent: ^ [09:45:57] taking a break, bbiab [09:46:32] PROBLEM - swift-container-updater on ms-be2039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:46:32] PROBLEM - swift-object-server on ms-be2039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:46:32] PROBLEM - swift-object-updater on ms-be2039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:46:32] PROBLEM - swift-container-server on ms-be2039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:46:32] PROBLEM - dhclient process on ms-be2039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:46:33] PROBLEM - MD RAID on ms-be2039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:46:33] PROBLEM - swift-object-auditor on ms-be2039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:46:34] PROBLEM - DPKG on ms-be2039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:46:52] PROBLEM - salt-minion processes on ms-be2039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:46:52] PROBLEM - SSH on ms-be2039 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [09:47:02] PROBLEM - puppet last run on ms-be2039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:47:02] PROBLEM - Check systemd state on ms-be2039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:47:02] PROBLEM - configured eth on ms-be2039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:47:03] PROBLEM - swift-account-replicator on ms-be2039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:47:12] PROBLEM - very high load average likely xfs on ms-be2039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:47:12] PROBLEM - swift-account-auditor on ms-be2039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:47:12] PROBLEM - swift-container-replicator on ms-be2039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:47:12] PROBLEM - swift-container-auditor on ms-be2039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:47:12] PROBLEM - swift-object-replicator on ms-be2039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:47:22] PROBLEM - swift-account-server on ms-be2039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:47:22] PROBLEM - swift-account-reaper on ms-be2039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:47:22] PROBLEM - Check whether ferm is active by checking the default input chain on ms-be2039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:47:32] PROBLEM - Check size of conntrack table on ms-be2039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:47:32] PROBLEM - Disk space on ms-be2039 is CRITICAL: CHECK_NRPE: Socket timeout after 10 seconds. [09:47:36] akosiaris: Yes, 404 now [09:47:45] (03CR) 10Alexandros Kosiaris: [C: 031] admins: add kaldari to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/355568 (https://phabricator.wikimedia.org/T166165) (owner: 10Dzahn) [09:47:45] cannot ssh. Load, maintenance or crash? [09:47:58] jynus: reboot from me to upgrade to 4.9 and expired downtime [09:48:05] sorry about that, fixing [09:48:05] oh, sorry [09:48:15] Via: 1.1 varnish-v4, 1.1 varnish-v4, 1.1 varnish-v4 [09:48:16] Age: 214 [09:48:16] X-Cache: cp1062 miss, cp3047 hit/6, cp3049 miss [09:49:22] RECOVERY - Disk space on ms-be2039 is OK: DISK OK [09:49:22] RECOVERY - Check size of conntrack table on ms-be2039 is OK: OK: nf_conntrack is 0 % full [09:49:22] RECOVERY - swift-object-server on ms-be2039 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [09:49:22] RECOVERY - MD RAID on ms-be2039 is OK: OK: Active: 6, Working: 6, Failed: 0, Spare: 0 [09:49:22] RECOVERY - swift-object-updater on ms-be2039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [09:49:23] RECOVERY - swift-container-updater on ms-be2039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater [09:49:23] RECOVERY - dhclient process on ms-be2039 is OK: PROCS OK: 0 processes with command name dhclient [09:49:24] RECOVERY - swift-object-auditor on ms-be2039 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [09:49:24] RECOVERY - swift-container-server on ms-be2039 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-container-server [09:49:25] RECOVERY - DPKG on ms-be2039 is OK: All packages OK [09:49:42] RECOVERY - SSH on ms-be2039 is OK: SSH OK - OpenSSH_6.7p1 Debian-5+deb8u3 (protocol 2.0) [09:49:42] RECOVERY - salt-minion processes on ms-be2039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/salt-minion [09:49:52] RECOVERY - puppet last run on ms-be2039 is OK: OK: Puppet is currently enabled, last run 12 minutes ago with 0 failures [09:49:52] RECOVERY - Check systemd state on ms-be2039 is OK: OK - running: The system is fully operational [09:49:52] RECOVERY - swift-account-replicator on ms-be2039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-replicator [09:49:52] RECOVERY - configured eth on ms-be2039 is OK: OK - interfaces up [09:50:03] RECOVERY - very high load average likely xfs on ms-be2039 is OK: OK - load average: 12.00, 3.64, 1.26 [09:50:03] RECOVERY - swift-container-replicator on ms-be2039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-replicator [09:50:03] RECOVERY - swift-account-auditor on ms-be2039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor [09:50:03] RECOVERY - swift-container-auditor on ms-be2039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-auditor [09:50:03] RECOVERY - swift-object-replicator on ms-be2039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [09:50:12] RECOVERY - Check whether ferm is active by checking the default input chain on ms-be2039 is OK: OK ferm input default policy is set [09:50:12] RECOVERY - swift-account-server on ms-be2039 is OK: PROCS OK: 49 processes with regex args ^/usr/bin/python /usr/bin/swift-account-server [09:50:12] RECOVERY - swift-account-reaper on ms-be2039 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-reaper [09:52:06] godog: are you rebooting many other hosts too? [09:54:54] (03CR) 10Alexandros Kosiaris: [C: 032] admins: add kaldari to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/355568 (https://phabricator.wikimedia.org/T166165) (owner: 10Dzahn) [09:55:00] (03PS2) 10Alexandros Kosiaris: admins: add kaldari to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/355568 (https://phabricator.wikimedia.org/T166165) (owner: 10Dzahn) [09:55:04] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] admins: add kaldari to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/355568 (https://phabricator.wikimedia.org/T166165) (owner: 10Dzahn) [09:56:31] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Requesting access to stat1002 for kaldari - https://phabricator.wikimedia.org/T166165#3291336 (10akosiaris) 05Open>03Resolved a:03akosiaris I don't think the 3 day waiting period applies to the addition of extra privileges, so in the interest of... [09:57:56] volans: no, why? [09:58:39] to avoid to have conflict with my upgrade of facter [09:58:45] that is slowly progressing ;) [09:59:17] (03PS1) 10Alexandros Kosiaris: Add jdittrich to analytics-users [puppet] - 10https://gerrit.wikimedia.org/r/355599 (https://phabricator.wikimedia.org/T165943) [09:59:43] ah, no I'm done for today volans [09:59:52] great, thanks [10:01:28] !log restart HDFS datanode daemons on all the hadoop worker nodes for jvm upgrades [10:01:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:06:52] 06Operations, 10ops-eqiad, 07kubernetes: rack/setup/instal (2)l kubernetes staging hosts - https://phabricator.wikimedia.org/T166264#3291363 (10akosiaris) Actually conf1XXX are for conftool, NOT kubernetes. etcd1XXX are for kubernetes, but are going to be renamed to kubetcd1XXX (codfw is already naming them... [10:15:54] 06Operations, 15User-fgiunchedi: Some swift disks wrongly mounted on 5 ms-be hosts - https://phabricator.wikimedia.org/T163673#3291381 (10fgiunchedi) The same thing happened on ms-be2037, it looks like the non-deterministic order of scsi devices when the kernel boots. Sometimes the LDs might not be detected in... [10:19:55] (03PS2) 10Alexandros Kosiaris: role::kubernetes::worker: upgrade calico on one host [puppet] - 10https://gerrit.wikimedia.org/r/355393 (https://phabricator.wikimedia.org/T165024) (owner: 10Giuseppe Lavagetto) [10:23:32] (03Abandoned) 10Elukey: [WIP] First prototype of the EventLogging purge script [puppet] - 10https://gerrit.wikimedia.org/r/353265 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey) [10:25:40] 06Operations, 10ops-eqiad: Degraded RAID on ms-be1008 - https://phabricator.wikimedia.org/T166177#3291410 (10fgiunchedi) a:03Cmjohnson @Cmjohnson please replace ! thanks [10:25:46] (03CR) 10Alexandros Kosiaris: "I 've update to use an eqiad host where we want to see whether the new version fixes a couple of bugs we 've witnessed" [puppet] - 10https://gerrit.wikimedia.org/r/355393 (https://phabricator.wikimedia.org/T165024) (owner: 10Giuseppe Lavagetto) [10:26:21] (03CR) 10Alexandros Kosiaris: [C: 032] role::kubernetes::worker: upgrade calico on one host [puppet] - 10https://gerrit.wikimedia.org/r/355393 (https://phabricator.wikimedia.org/T165024) (owner: 10Giuseppe Lavagetto) [10:26:27] (03PS3) 10Alexandros Kosiaris: role::kubernetes::worker: upgrade calico on one host [puppet] - 10https://gerrit.wikimedia.org/r/355393 (https://phabricator.wikimedia.org/T165024) (owner: 10Giuseppe Lavagetto) [10:26:35] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] role::kubernetes::worker: upgrade calico on one host [puppet] - 10https://gerrit.wikimedia.org/r/355393 (https://phabricator.wikimedia.org/T165024) (owner: 10Giuseppe Lavagetto) [10:50:53] (03CR) 10Dereckson: [C: 04-1] "Files require an access token. We should first assert this token is not session dependant." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355594 (https://phabricator.wikimedia.org/T166271) (owner: 10Framawiki) [10:52:39] 06Operations, 06Operations-Software-Development, 07Technical-Debt: Remove Salt from wmf-auto-reimage / wmf-reimage - https://phabricator.wikimedia.org/T166300#3291472 (10Volans) [10:52:41] 06Operations, 06Operations-Software-Development, 07Technical-Debt: Sunset our use of Salt - https://phabricator.wikimedia.org/T164780#3291485 (10Volans) [10:53:25] (03PS1) 10Elukey: [WIP] Add the eventlogging_cleaner script and base package [software/analytics-eventlogging-maintenance] - 10https://gerrit.wikimedia.org/r/355604 (https://phabricator.wikimedia.org/T156933) [12:00:32] (03PS6) 10Elukey: role::aqs: use profile::cassandra [puppet] - 10https://gerrit.wikimedia.org/r/354107 (owner: 10Giuseppe Lavagetto) [12:00:42] _joe_ had to rebase modules/profile/manifests/cassandra.pp for a conflict but should be ok now [12:00:46] --^ [12:01:59] running pcc again just in case [12:02:06] and then I'll merge to aqs1004 [12:22:23] (03PS12) 10Madhuvishy: sge: Revamp queue,rqs configuration puppet [puppet] - 10https://gerrit.wikimedia.org/r/352895 [12:23:27] (03CR) 10jerkins-bot: [V: 04-1] sge: Revamp queue,rqs configuration puppet [puppet] - 10https://gerrit.wikimedia.org/r/352895 (owner: 10Madhuvishy) [12:28:18] (03PS13) 10Madhuvishy: sge: Revamp queue,rqs configuration puppet [puppet] - 10https://gerrit.wikimedia.org/r/352895 [12:41:14] !log cordon kubernetes100{2,3,4} for testing calico-node on kubernetes1001 [12:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:44:16] (03CR) 10Filippo Giunchedi: [C: 031] role::aqs: use profile::cassandra [puppet] - 10https://gerrit.wikimedia.org/r/354107 (owner: 10Giuseppe Lavagetto) [12:51:22] (03PS1) 10Giuseppe Lavagetto: Split up pybal.pybal in smaller files [debs/pybal] - 10https://gerrit.wikimedia.org/r/355609 [12:51:56] <_joe_> elukey: let me know what you need to do :) [12:52:48] I am about to merge, was doing other things :) [12:53:36] 06Operations, 10MediaWiki-General-or-Unknown, 06Security-Team, 10Traffic: Mediawiki replies with 500 on wrongly formatted CSP report - https://phabricator.wikimedia.org/T166229#3291687 (10Anomie) Yes, 400 would seem to be more semantically appropriate than 500. I have no idea whether a 400 response also sh... [12:56:22] (03CR) 10Elukey: [C: 032] role::aqs: use profile::cassandra [puppet] - 10https://gerrit.wikimedia.org/r/354107 (owner: 10Giuseppe Lavagetto) [12:57:26] all right if puppet breaks on restbase or aqs it is me [12:57:33] going to apply the change to aqs1004 [13:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170525T1300). [13:00:50] no swat patches... someone call guinness [13:02:01] https://upload.wikimedia.org/wikipedia/commons/9/92/Guinness.jpg [13:02:26] jynus: well not that guinness... [13:02:31] but i guess it works [13:02:34] jynus; great minds think alike :-D [13:02:38] lovely day for a guinness [13:03:03] I perfer guiness mixed with white wine [13:03:09] Zppix: actually, it is the same guinness [13:03:15] https://en.wikipedia.org/wiki/Guinness_World_Records [13:03:27] "On 10 November 1951, Sir Hugh Beaver, then the managing director of the Guinness Breweries..." [13:03:52] jynus: of course i forgot everyone owns an alcholic beverage [13:04:46] 06Operations, 10Analytics, 10Traffic, 13Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#3291750 (10Ottomata) Just nocied this ticket again! Can/should we move forward with this? [13:04:57] !log restart cassandra-a on aqs1004 to test https://gerrit.wikimedia.org/r/354107 [13:05:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:43] No patches for SWAT has happened at least a few times in the past. [13:07:16] anomie: but eu swat? It's rare [13:07:24] especially now we offer 3 windows per day [13:07:43] Zppix: The times I'm thinking of were back when we had only two windows. [13:07:57] anomie: thats before my time as a dev here [13:08:33] * anomie hasn't SWATted much if at all since the change to 3 windows [13:09:50] then again theres a lot of swatters anomie xD [13:09:50] _joe_ everything good, rolling out the change to all the aqs nodes.. thanks! [13:10:09] anyway I'm going to let everyone get back to work and stop clogging up the chat [13:10:21] <_joe_> elukey: cool [13:10:42] <_joe_> elukey: remember to add the aqs nodes to the list for the scap targets :) [13:10:48] <_joe_> like twcs [13:10:56] 06Operations, 10Analytics, 10Traffic, 13Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#3291758 (10BBlack) From my perspective, where we last stalled out is waiting for Analytics to say it's ok to merge https://gerrit.wikimedia.org/r/#/c/2... [13:11:02] <_joe_> it could be a good time to move that to use the lists from conftool [13:11:25] _joe_ we don't use twcs so I think it is fine if we skip [13:11:41] <_joe_> elukey: still, it's not a good way to set things up :P [13:11:53] <_joe_> but I'll talk with urandom when he's around [13:12:28] well I guess we can deploy it, it doesn't matter much, we'll not risk to enable it in any way and it will bring more consistency [13:12:55] (03CR) 10Filippo Giunchedi: prometheus: report puppet agent stats (037 comments) [puppet] - 10https://gerrit.wikimedia.org/r/354007 (owner: 10Filippo Giunchedi) [13:13:25] (03PS5) 10Filippo Giunchedi: prometheus: report puppet agent stats [puppet] - 10https://gerrit.wikimedia.org/r/354007 [13:13:27] (03PS4) 10Filippo Giunchedi: base: report prometheus agent stats [puppet] - 10https://gerrit.wikimedia.org/r/354457 [13:13:29] (03PS4) 10Filippo Giunchedi: prometheus: add alertmanager_url to prometheus server [puppet] - 10https://gerrit.wikimedia.org/r/354459 [13:13:31] (03PS4) 10Filippo Giunchedi: role: use alertmanager in beta prometheus [puppet] - 10https://gerrit.wikimedia.org/r/354460 [13:13:33] (03PS3) 10Filippo Giunchedi: role: set external url for prometheus beta [puppet] - 10https://gerrit.wikimedia.org/r/354975 [13:13:35] (03PS4) 10Filippo Giunchedi: WIP prometheus::alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/354976 [13:16:37] (03CR) 10jerkins-bot: [V: 04-1] WIP prometheus::alertmanager [puppet] - 10https://gerrit.wikimedia.org/r/354976 (owner: 10Filippo Giunchedi) [13:17:02] (03PS2) 10Jcrespo: Revert "mariadb: Depool db1077 for maintenance and upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355597 [13:18:28] (03CR) 10Marostegui: [C: 031] Revert "mariadb: Depool db1077 for maintenance and upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355597 (owner: 10Jcrespo) [13:19:09] (03PS1) 10Jcrespo: mariadb: Depool db2055 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355613 [13:20:05] (03CR) 10Marostegui: [C: 031] mariadb: Depool db2055 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355613 (owner: 10Jcrespo) [13:20:51] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db1077 for maintenance and upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355597 (owner: 10Jcrespo) [13:21:00] (03CR) 10jenkins-bot: Revert "mariadb: Depool db1077 for maintenance and upgrade" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355597 (owner: 10Jcrespo) [13:21:09] (03CR) 10Jcrespo: [C: 032] mariadb: Depool db2055 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355613 (owner: 10Jcrespo) [13:22:52] !log jynus@tin Synchronized wmf-config/db-eqiad.php: Repool db1077 back to high load after maintenance (duration: 00m 41s) [13:22:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:15] 06Operations, 10Pybal, 10Traffic, 10netops: Frequent RST returned by appservers to LVS hosts - https://phabricator.wikimedia.org/T163674#3291785 (10elukey) Sorry for the late response! >>! In T163674#3274227, @BBlack wrote: > So from the above, apache really has 3 different modes of operation: > > 1) def... [13:25:00] (03Merged) 10jenkins-bot: mariadb: Depool db2055 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355613 (owner: 10Jcrespo) [13:26:10] (03CR) 10jenkins-bot: mariadb: Depool db2055 for maintenance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355613 (owner: 10Jcrespo) [13:29:22] !log jynus@tin Synchronized wmf-config/db-codfw.php: Depool db2055 for maintenance (duration: 00m 41s) [13:29:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:28] 06Operations, 10MediaWiki-Export-or-Import, 10Wikimedia-General-or-Unknown: Special:Import error: "Import failed: Could not open import file" - https://phabricator.wikimedia.org/T17000#3291824 (10Krinkle) [13:36:48] (03PS1) 10Jcrespo: Revert "mariadb: Depool db2055 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355615 [13:37:00] (03CR) 10Jcrespo: [C: 04-2] "Not yet ready." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355615 (owner: 10Jcrespo) [13:37:23] (03CR) 10Marostegui: [C: 031] Revert "mariadb: Depool db2055 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355615 (owner: 10Jcrespo) [13:39:32] !log restarting and upgrading db2055 for maintenance [13:39:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:59] 06Operations, 10ops-eqiad, 07kubernetes: rack/setup/instal (2)l kubernetes staging hosts - https://phabricator.wikimedia.org/T166264#3291845 (10Cmjohnson) @akosiaris Would it be possible to shorten kubernetes-staging1XXX, I can't fit that all on a label. If you prefer it's fine I will should abbreviate it o... [13:44:18] 06Operations, 10ops-eqiad, 07kubernetes: rack/setup/instal (2)l kubernetes staging hosts - https://phabricator.wikimedia.org/T166264#3291846 (10Cmjohnson) Also, I am assuming you want these in 2 separate rows? [13:44:58] 06Operations, 10ops-eqiad, 07kubernetes: rack/setup/instal (2)l kubernetes staging hosts - https://phabricator.wikimedia.org/T166264#3291847 (10akosiaris) kubestage1xxxx ? It's one letter shorter than kubernetes1xxx which is the production boxes and still conveys the meaning IMHO. And yes if possible 2 sepa... [14:08:19] !log shut ms-be1021 for BBU replacement - T163777 [14:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:28] T163777: Debug HP raid cache disabled errors on ms-be1019/20/21 - https://phabricator.wikimedia.org/T163777 [14:08:48] (03PS3) 10Ottomata: Add kafka_version parameter, s/java_packaage/java_home/ in confluent::kafka::client [puppet] - 10https://gerrit.wikimedia.org/r/354100 [14:10:20] (03CR) 10jerkins-bot: [V: 04-1] Add kafka_version parameter, s/java_packaage/java_home/ in confluent::kafka::client [puppet] - 10https://gerrit.wikimedia.org/r/354100 (owner: 10Ottomata) [14:17:26] 06Operations, 10ops-codfw: db2049 cannot install jessie - let's try upgrading the firmware first - https://phabricator.wikimedia.org/T165739#3292040 (10jcrespo) We believe this could be important, based on the above error output: https://wikitech.wikimedia.org/wiki/Platform-specific_documentation/HP_DL3N0_Gen9... [14:28:06] (03PS2) 10Andrew Bogott: Labs: Add wmcs-roots admin group to NFS servers [puppet] - 10https://gerrit.wikimedia.org/r/355463 (owner: 10BryanDavis) [14:31:06] PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2078489 [14:36:26] PROBLEM - Host mw2140 is DOWN: PING CRITICAL - Packet loss = 100% [14:40:03] (03Abandoned) 10Merlijn van Deen: Add Fabric deploy helper [debs/adminbot] - 10https://gerrit.wikimedia.org/r/231151 (owner: 10Merlijn van Deen) [14:40:06] (03Abandoned) 10Merlijn van Deen: Move to virtualenv based system [debs/adminbot] - 10https://gerrit.wikimedia.org/r/242926 (owner: 10Merlijn van Deen) [14:40:23] (03Abandoned) 10Merlijn van Deen: [WIP DO NOT MERGE] toollabs: replace package{} by require_package() [puppet] - 10https://gerrit.wikimedia.org/r/236616 (owner: 10Merlijn van Deen) [14:40:35] !log restart cp1074 backend (mailbox) [14:40:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:04] (03PS2) 10Elukey: [WIP] Add the eventlogging_cleaner script and base package [software/analytics-eventlogging-maintenance] - 10https://gerrit.wikimedia.org/r/355604 (https://phabricator.wikimedia.org/T156933) [14:42:41] (03Abandoned) 10Merlijn van Deen: Puppet: refactor puppet-enc include [puppet] - 10https://gerrit.wikimedia.org/r/325046 (owner: 10Merlijn van Deen) [14:42:48] 06Operations, 10MediaWiki-Cache, 10MediaWiki-JobQueue, 06Performance-Team, and 2 others: Investigate massive increase in htmlCacheUpdate jobs in Dec/Jan - https://phabricator.wikimedia.org/T124418#3292114 (10Krinkle) >>! In T124418#3276610, @BBlack wrote: > Not resolved, as the purge graphs can attest! Wh... [14:43:11] (03Abandoned) 10Merlijn van Deen: ops/puppet CI: check for non-ascii .pp files [puppet] - 10https://gerrit.wikimedia.org/r/303564 (owner: 10Merlijn van Deen) [14:43:18] (03Abandoned) 10Merlijn van Deen: [DO NOT SUBMIT] test for tool labs puppet compiler [puppet] - 10https://gerrit.wikimedia.org/r/254183 (owner: 10Merlijn van Deen) [14:43:38] (03Abandoned) 10Merlijn van Deen: toollabs: Allow HBA login to all hosts [puppet] - 10https://gerrit.wikimedia.org/r/296932 (https://phabricator.wikimedia.org/T104613) (owner: 10Merlijn van Deen) [14:43:59] (03Abandoned) 10Merlijn van Deen: [puppet] DO NOT SUBMIT -- auth.conf for resource_types search [puppet] - 10https://gerrit.wikimedia.org/r/283696 (owner: 10Merlijn van Deen) [14:44:03] (03PS4) 10Ottomata: Add kafka_version parameter, s/java_packaage/java_home/ in confluent::kafka::client [puppet] - 10https://gerrit.wikimedia.org/r/354100 [14:44:14] (03Abandoned) 10Merlijn van Deen: [ssh, WIP] allow login from tools-login [puppet] - 10https://gerrit.wikimedia.org/r/220214 (https://phabricator.wikimedia.org/T103552) (owner: 10Merlijn van Deen) [14:44:57] (03Abandoned) 10Merlijn van Deen: toollabs: install 'fastapt' provider for packages [puppet] - 10https://gerrit.wikimedia.org/r/249489 (https://phabricator.wikimedia.org/T116813) (owner: 10Merlijn van Deen) [14:45:00] (03CR) 10jerkins-bot: [V: 04-1] Add kafka_version parameter, s/java_packaage/java_home/ in confluent::kafka::client [puppet] - 10https://gerrit.wikimedia.org/r/354100 (owner: 10Ottomata) [14:45:20] (03Abandoned) 10Merlijn van Deen: toollabs: install openjdk-8-headless on trusty [puppet] - 10https://gerrit.wikimedia.org/r/258113 (https://phabricator.wikimedia.org/T121020) (owner: 10Merlijn van Deen) [14:45:47] (03Abandoned) 10Merlijn van Deen: puppet/apt: run unattended-upgrades before puppet (hiera-configurable) [puppet] - 10https://gerrit.wikimedia.org/r/254295 (owner: 10Merlijn van Deen) [14:46:14] (03PS5) 10Ottomata: Add kafka_version parameter, s/java_packaage/java_home/ in confluent::kafka::client [puppet] - 10https://gerrit.wikimedia.org/r/354100 [14:48:16] 06Operations, 10Ops-Access-Requests: Grant sudo access for bdavis for labstore* and labsdb* - https://phabricator.wikimedia.org/T166310#3292124 (10madhuvishy) [14:51:05] 06Operations, 10ops-eqiad, 15User-fgiunchedi: Debug HP raid cache disabled errors on ms-be1019/20/21 - https://phabricator.wikimedia.org/T163777#3292142 (10fgiunchedi) After BBU replacement the error seems to be gone from ms-be1021: ``` root@ms-be1021:~# hpssacli controller slot=3 show detail Smart Array P... [14:51:06] RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 0 [14:53:17] 06Operations, 10Ops-Access-Requests: Grant sudo access for bdavis for labstore* and labsdb* - https://phabricator.wikimedia.org/T166310#3292124 (10jcrespo) Giving root to labsdbs would be equivalent to giving root to all mysql servers for many reasons. No problem with that, but he should be added to the paging... [15:00:14] (03PS1) 10Andrew Bogott: Horizon: Add some local_settings for ldap access [puppet] - 10https://gerrit.wikimedia.org/r/355620 [15:00:29] (03PS2) 10Andrew Bogott: Horizon: Add some local_settings for ldap access [puppet] - 10https://gerrit.wikimedia.org/r/355620 [15:01:13] (03PS3) 10Andrew Bogott: Horizon: Add some local_settings for ldap access [puppet] - 10https://gerrit.wikimedia.org/r/355620 [15:02:45] (03CR) 10Andrew Bogott: [C: 032] Horizon: Add some local_settings for ldap access [puppet] - 10https://gerrit.wikimedia.org/r/355620 (owner: 10Andrew Bogott) [15:02:46] 06Operations, 06DC-Ops: Lots of hosts with hyperthreading disabled - https://phabricator.wikimedia.org/T156140#3292176 (10fgiunchedi) [15:02:48] 06Operations, 10Cassandra, 06Services (blocked): Hyperthreading disabled on restbase2002.codfw.wmnet & restbase1015.codfw.wmnet - https://phabricator.wikimedia.org/T162735#3292175 (10fgiunchedi) [15:04:10] (03PS6) 10Andrew Bogott: Horizon: Add sudo policy panel [puppet] - 10https://gerrit.wikimedia.org/r/353156 (https://phabricator.wikimedia.org/T162097) [15:04:12] (03PS3) 10Andrew Bogott: Horizon sudo panel: Better distinguish between 'create' and 'modify' actions [puppet] - 10https://gerrit.wikimedia.org/r/355452 (https://phabricator.wikimedia.org/T162097) [15:04:14] (03PS3) 10Andrew Bogott: Horizon sudo panel: Add policy checks [puppet] - 10https://gerrit.wikimedia.org/r/355459 [15:08:28] 06Operations, 10Ops-Access-Requests: Grant sudo access for Bryan Davis for labstore* and labsdb* - https://phabricator.wikimedia.org/T166310#3292205 (10bd808) [15:12:30] 06Operations, 10ops-eqiad, 06DC-Ops, 06Services: rack/setup/install resetbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3287641 (10fgiunchedi) We've discussed this in the ops/services syncup today. Since the SSDs will be moved as-is from old hardware the simplest plan I proposed is to reimage... [15:14:43] 06Operations, 10Ops-Access-Requests: Grant sudo access for Bryan Davis for labstore* and labsdb* - https://phabricator.wikimedia.org/T166310#3292231 (10bd808) >>! In T166310#3292144, @jcrespo wrote: > No problem with that, but he should be added to the paging system (if he is not already there) and respond to... [15:16:23] (03PS3) 10BryanDavis: Labs: Add wmcs-roots admin group to NFS servers [puppet] - 10https://gerrit.wikimedia.org/r/355463 (https://phabricator.wikimedia.org/T166310) [15:16:56] (03CR) 10BryanDavis: [C: 04-1] "Needs approval by techops, see T166310" [puppet] - 10https://gerrit.wikimedia.org/r/355463 (https://phabricator.wikimedia.org/T166310) (owner: 10BryanDavis) [15:18:48] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Grant sudo access for Bryan Davis for labstore* and labsdb* - https://phabricator.wikimedia.org/T166310#3292253 (10jcrespo) > I do not have a Foundation provided phone or phone contract I do not have either. Maybe cloud team can create a dedicated co... [15:26:42] 06Operations, 10Ops-Access-Requests, 13Patch-For-Review: Grant sudo access for Bryan Davis for labstore* and labsdb* - https://phabricator.wikimedia.org/T166310#3292282 (10jcrespo) In fact, I want to remember that the group dba used to receive db-only pages, but not sure if it was soft-disabled because it wa... [15:28:58] !log restarting and upgrading db2055 for kernel downgrade [15:29:05] jynus: Last I remember poking the paging, we only had 1 group that could do SMS paging (the ops one). All other groups were e-mail only [15:29:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:29:14] RainbowSprinkles: yes [15:29:19] Releng had expressed interest in a SMS group for some of our services before. [15:29:23] that and one group that could go to irc [15:29:40] which was the problem and the group mostly leftr disabled [15:29:44] * RainbowSprinkles nods [15:32:32] (03PS5) 10Amire80: Add namespace aliases for Hebrew Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352889 (https://phabricator.wikimedia.org/T164858) [15:32:59] I think the best option is for him to get root everywhere and paging [15:33:10] (03PS1) 10Dereckson: Add Wikipedia wordmark in Serbian [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355625 (https://phabricator.wikimedia.org/T165896) [15:34:22] (03PS1) 10BBlack: r::c::perf - +rxring and +txqueuelen [puppet] - 10https://gerrit.wikimedia.org/r/355626 [15:34:56] 06Operations, 10ops-codfw: db2049 cannot install jessie - let's try upgrading the firmware first - https://phabricator.wikimedia.org/T165739#3292318 (10Papaul) a:03Papaul [15:35:49] (03CR) 10jerkins-bot: [V: 04-1] r::c::perf - +rxring and +txqueuelen [puppet] - 10https://gerrit.wikimedia.org/r/355626 (owner: 10BBlack) [15:36:14] 06Operations, 10Traffic, 13Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#3292319 (10Nuria) [15:36:19] 06Operations, 10ops-eqiad, 06DC-Ops, 06Services: rack/setup/install resetbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3292320 (10Eevans) To summarize some discussion of this that took place in the ops-services-syncup meeting today: * Services has a couple weeks worth of data sampled from p... [15:36:31] 06Operations, 10Analytics, 10Traffic, 13Patch-For-Review: Replace Analytics XFF/client.ip data with X-Client-IP - https://phabricator.wikimedia.org/T118557#1803407 (10Nuria) 05duplicate>03Open [15:37:10] 06Operations, 10ops-codfw: db2049 cannot install jessie - let's try upgrading the firmware first - https://phabricator.wikimedia.org/T165739#3292325 (10Papaul) Please put the serer in maintenance mode so i can take over. Thanks [15:38:04] 06Operations, 10ops-codfw: db2049 cannot install jessie - let's try upgrading the firmware first - https://phabricator.wikimedia.org/T165739#3292326 (10jcrespo) All alerts are disabled for this hosts and it contains no running service, it can be put down at any time. [15:42:13] (03CR) 10Amire80: [C: 031] Remove special Math extension settings for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355114 (owner: 10DCausse) [15:43:56] !log delete thumbnails with > 2000px for wikivoyage / wikiversity / wikisource / wikiquote - T162796 [15:44:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:06] T162796: Delete non-used and/or non-requested thumbnail sizes periodically - https://phabricator.wikimedia.org/T162796 [15:46:25] 06Operations, 10ops-eqiad, 06DC-Ops, 06Services: rack/setup/install resetbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3292363 (10Eevans) >>! In T166181#3292221, @fgiunchedi wrote: > We've discussed this in the ops/services syncup today. Since the SSDs will be moved as-is from old hardware t... [15:46:27] 06Operations, 10ops-codfw: db2049 cannot install jessie - let's try upgrading the firmware first - https://phabricator.wikimedia.org/T165739#3292364 (10Papaul) Thanks. [15:47:23] 06Operations, 10ops-codfw: db2049 cannot install jessie - let's try upgrading the firmware first - https://phabricator.wikimedia.org/T165739#3292366 (10jcrespo) I haven's said thank you, btw, to being able to do this so quickly! Thank you! [15:52:07] (03PS2) 10BBlack: r::c::perf - +rxring and +txqueuelen [puppet] - 10https://gerrit.wikimedia.org/r/355626 [15:52:38] 06Operations, 10ops-eqiad, 07kubernetes: rack/setup/instal (2)l kubernetes staging hosts - https://phabricator.wikimedia.org/T166264#3292380 (10RobH) +1 to kubestage [15:53:59] (03CR) 10BBlack: [C: 032] r::c::perf - +rxring and +txqueuelen [puppet] - 10https://gerrit.wikimedia.org/r/355626 (owner: 10BBlack) [15:54:53] jouncebot: next [15:54:54] In 0 hour(s) and 5 minute(s): Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170525T1600) [15:55:45] (03PS5) 10Chad: Move mwdeploy home to /var/lib where it belongs, it's a system user [puppet] - 10https://gerrit.wikimedia.org/r/323867 (https://phabricator.wikimedia.org/T86971) [15:55:57] some query errors on zh_min_nanwiki [15:56:15] starting from 15:44 [15:58:17] I think they are category membership changes [16:00:03] 06Operations, 06Performance-Team, 10Traffic, 13Patch-For-Review: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569#3292408 (10BBlack) So I've stared at NavTiming graphs, and honestly it's hard to read any notable difference in the tea leaves. I'm still invest... [16:00:06] godog, moritzm, and _joe_: Respected human, time to deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170525T1600). Please do the needful. [16:01:22] errors may be going down [16:01:42] it is the jobqueue, so no end users affected for now [16:01:51] 06Operations, 10Phabricator: spam from phabricator in labs - https://phabricator.wikimedia.org/T166322#3292437 (10Dzahn) [16:02:26] 06Operations, 10Phabricator: spam from phabricator in labs - https://phabricator.wikimedia.org/T166322#3292449 (10Dzahn) ^ This is why i disabled puppet on the "phabricator" instance last night. [16:03:12] doing that in a ticket so that people not on ops-list can see it, they asked [16:07:11] 06Operations, 10Phabricator: spam from phabricator in labs - https://phabricator.wikimedia.org/T166322#3292486 (10Paladox) @Dzahn yep thanks for disabling puppet. I've removed the package and replaced it with the lighter one without mysql and then re-enabled puppet now. [16:16:54] 06Operations, 10ops-eqiad, 06DC-Ops, 06Services: rack/setup/install restbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3292515 (10fgiunchedi) [16:19:16] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 120, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-5/3/1: down - Transit: Zayo (IPYX/125449/003/ZYO) {#11542} [10Gbps]BR [16:21:16] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 122, down: 0, dormant: 0, excluded: 0, unused: 0 [16:22:30] 06Operations, 06Analytics-Kanban, 10DBA: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3292531 (10Nuria) p:05Triage>03High [16:22:46] 06Operations, 10Phabricator, 10Traffic: phab.wmfusercontent.org "homepage" yields a 500 - https://phabricator.wikimedia.org/T166120#3292533 (10mmodell) Well apache would probably be a better choice but we could do it in php land for sure. [16:27:11] !log T164865: RESTBase dev, re-enable revision range deletes [16:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:27:20] T164865: Prototype and test range delete-based current revision storage - https://phabricator.wikimedia.org/T164865 [16:40:33] (03PS1) 10Jdrewniak: Updating wikipedia.org stats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355632 (https://phabricator.wikimedia.org/T128546) [16:43:25] that is manually being updated now? the stats numbers? [16:43:59] that's what made us write the original stats thing, to get autogenerated wiki code, heh [16:45:45] mutante: did you block out the horror of all the comments from when Discovery took over that portal and separated it from the pages on meta? ;) [16:46:16] i saw all of that, yea. that's kind of why :) [16:46:30] many many years ago i wrote stuff to generate the content of that page [16:47:53] i just thought this is going to be automatic and pull the numbers from db [16:49:00] *nod* that would be nice if there is a reasonable way to get to the data. [16:49:36] PROBLEM - puppet last run on lvs3003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:49:40] there are many ways to do that, what I wonder is why that is in that repo? [16:49:50] we talked about the challenges of automating the submodule bump at some point. As I recall we decided it was too spooky to script that [16:49:59] It can be done with gerrit automatically [16:50:10] But AIUI, we don't want it to bump automatically *every* time [16:50:23] (sometimes we commit things a WIP?) [16:50:49] Which makes it a special snowflake, I love things that don't assume master is usable! [16:50:50] ;-) [16:51:18] (03CR) 1020after4: [C: 031] phabricator: set alt_host in redirector to "phab" instead of "fab" [puppet] - 10https://gerrit.wikimedia.org/r/355455 (owner: 10Dzahn) [16:53:52] (03PS2) 10Dzahn: phabricator: set alt_host in redirector to "phab" instead of "fab" [puppet] - 10https://gerrit.wikimedia.org/r/355455 [16:54:30] (03CR) 10Dzahn: [C: 032] "it doesn't make a direct difference because this was overridden in another place, but nevertheless it's correct" [puppet] - 10https://gerrit.wikimedia.org/r/355455 (owner: 10Dzahn) [16:59:36] PROBLEM - HP RAID on ms-be1036 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds. [16:59:59] mutante: yeah, there's more automation in Meta (thanks to the bots from your wikistats) [17:00:05] gwicke, cscott, arlolra, subbu, halfak, and Amir1: Dear anthropoid, the time has come. Please deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170525T1700). [17:00:53] 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3292676 (10Papaul) Bad disk has been shipped to Intel. Please see below for shipping tracking information. {F8181441} [17:00:53] Presumably at some point people will get tired of the additional maintenance burden [17:03:06] (03CR) 10Jcrespo: [C: 032] Revert "mariadb: Depool db2055 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355615 (owner: 10Jcrespo) [17:04:08] (03Merged) 10jenkins-bot: Revert "mariadb: Depool db2055 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355615 (owner: 10Jcrespo) [17:05:18] *nod*, Nemo_bis [17:05:47] (03CR) 10jenkins-bot: Revert "mariadb: Depool db2055 for maintenance" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355615 (owner: 10Jcrespo) [17:07:38] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2055 after maintenance (duration: 02m 43s) [17:07:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:07:56] mw2140.codfw.wmnet timedout [17:08:55] seems down [17:09:09] icinga reports it down... seems nobody noticed? :) [17:09:49] I don't see alerting [17:09:52] 06Operations, 13Patch-For-Review, 15User-fgiunchedi: Delete non-used and/or non-requested thumbnail sizes periodically - https://phabricator.wikimedia.org/T162796#3292691 (10fgiunchedi) I started deleting today all thumbnails with widths > 2000px in small containers (i.e. non-commons) With a container at a... [17:10:16] oh, I see it now [17:10:24] yes before [17:10:47] 14:36:26 UTC [17:10:53] !log bsitzmann@tin Started deploy [mobileapps/deploy@614d752]: Update mobileapps to 946fe1f [17:11:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:16] seems dead, I am going to powerreset [17:11:54] uf, even console is naughty [17:12:56] !log powercycling mw2140 [17:13:03] ERROR: Timeout while waiting for server to perform requested power action. [17:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:13:55] not sure I will be able to put it up [17:14:58] !log bsitzmann@tin Finished deploy [mobileapps/deploy@614d752]: Update mobileapps to 946fe1f (duration: 04m 04s) [17:15:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:16:10] Server power status: OFF [17:16:32] 06Operations, 07Technical-Debt: Supersede RT tickets references - https://phabricator.wikimedia.org/T165733#3275388 (10Dzahn) Unfortunately RT ticket numbers don't have a fixed offset to the Phab numbers, unlike Bugzilla tickets do. So the first step here is to find out the right Phab numbers. that should be... [17:16:43] (03CR) 10Zfilipin: [C: 031] "@hashar is this something you would still like to get merged? Or should it be abandoned?" [puppet] - 10https://gerrit.wikimedia.org/r/338633 (owner: 10Hashar) [17:16:52] and doesn''t want to get powered on? [17:17:00] I am trying [17:17:03] :-) [17:17:09] console not very cooperative [17:17:15] bad mw2140 [17:17:35] we might need to have pa.paul have a look [17:17:36] RECOVERY - puppet last run on lvs3003 is OK: OK: Puppet is currently enabled, last run 35 seconds ago with 0 failures [17:18:18] 06Operations: Request about redirect from Wikipedie.cz - https://phabricator.wikimedia.org/T81228#3292713 (10Dzahn) [17:18:22] ERROR: Timeout while waiting for server to perform requested power action. [17:18:28] I think this requires one of those [17:18:35] "drain flea power" [17:18:41] assuming it is not toasted [17:18:59] I will try one hardreset [17:19:19] yeah, if it doesn't work let's put it inactive in confctl to remove it from dsh [17:19:58] ERROR: Timeout while waiting for server to perform requested power action. [17:20:11] Server power status: OFF [17:20:39] how to depool from outside the host? [17:21:02] any "central" host, puppetmasters and neodymium/sarin can do it [17:21:04] confctl [17:21:05] let me grab the command [17:21:08] I see [17:21:21] I got lazy and only used depool [17:21:24] :-) [17:21:31] 06Operations, 07Technical-Debt: Supersede RT tickets references - https://phabricator.wikimedia.org/T165733#3292718 (10Dzahn) a:03Dzahn [17:21:45] 06Operations, 07Technical-Debt: Supersede RT tickets references - https://phabricator.wikimedia.org/T165733#3275388 (10Dzahn) p:05Triage>03Normal [17:22:25] 06Operations, 10ops-codfw: db2049 cannot install jessie - let's try upgrading the firmware first - https://phabricator.wikimedia.org/T165739#3292723 (10Papaul) Firmware update complete. [17:22:46] 06Operations, 10ops-codfw: db2049 cannot install jessie - let's try upgrading the firmware first - https://phabricator.wikimedia.org/T165739#3292725 (10Papaul) a:05Papaul>03jcrespo [17:22:57] confctl --quiet select 'dc=codfw,name=mw2140.codfw.wmnet' get [17:23:00] confctl --quiet --find --action set/pooled=no 'mw2140.codfw.wmnet' [17:23:01] to get current value [17:23:17] sudo -i confctl select 'name=mw2140.codfw.wmnet' set/pooled=inactive [17:23:19] (03CR) 10Zfilipin: "@hashar is this something you would still like to get merged? Or should it be abandoned?" [puppet] - 10https://gerrit.wikimedia.org/r/331856 (https://phabricator.wikimedia.org/T78342) (owner: 10Hashar) [17:23:20] to set it inactive [17:23:27] it auto-SAL it [17:23:54] jynus: ^^^ [17:23:56] inactive vs. no? [17:24:07] no stays in dsh, inactive get's removed [17:24:29] same for LVS IIRC, inactive removes it completely [17:25:11] !log jynus@neodymium conftool action : set/pooled=inactive; selector: name=mw2140.codfw.wmnet [17:25:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:24] https://wikitech.wikimedia.org/wiki/Conftool#Modify_the_state_of_a_server_in_a_pool [17:25:41] I will try to resync just to confirm it work ok [17:25:55] you need to run puppet on tin first [17:26:03] that recreates the dsh group file [17:26:05] withou tit [17:26:30] 06Operations, 10ops-eqiad, 07kubernetes: rack/setup/instal (2)l kubernetes staging hosts - https://phabricator.wikimedia.org/T166264#3292736 (10Cmjohnson) [17:26:32] at least last time I have played with those things it was needed, not sure if we added additional magic to avoid it :) [17:26:41] 06Operations, 10ops-eqiad, 07kubernetes: rack/setup/instal (2)l kubernetes staging hosts - https://phabricator.wikimedia.org/T166264#3290332 (10Cmjohnson) Racked in A4 and B4 [17:27:18] not 100% sure it will work, I think we have to retire it [17:27:27] but I may be wrong [17:28:26] PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [17:28:41] nice^, it works [17:28:51] the alert, not the depool [17:28:59] the depool needs no, I think [17:29:26] RECOVERY - HP RAID on ms-be1036 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor [17:30:17] !log jynus@tin Synchronized wmf-config/db-codfw.php: Repool db2055 after maintenance, 2nd try (duration: 02m 42s) [17:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:44] volans: so no in this case, becaue we want it out of dsh, too [17:31:13] !log jynus@neodymium conftool action : set/pooled=no; selector: name=mw2140.codfw.wmnet [17:31:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:31:53] jynus: is the opposite, "no" is in the config of pybal and dsh, just not pooled, 'inactive' is out of the config [17:32:06] both pybal and dsh [17:32:17] it didn't work [17:32:23] did you run puppet? [17:32:25] on tin [17:32:25] yes [17:32:41] mmmh [17:32:53] puppet didn't do anything either [17:33:06] maybe I have to runit on puppetmaster? [17:33:28] not that I'm aware of [17:33:48] in the old times, we had to take stuff out of dsh in hiera [17:35:06] (03CR) 10Jdlrobson: [C: 031] "This just needs sign off from Nirzar on Phabricator before submitting" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355625 (https://phabricator.wikimedia.org/T165896) (owner: 10Dereckson) [17:37:01] " (only in pybal) present in the config " [17:37:11] I think conftoold doesn't handle dsh at all [17:37:20] so in /etc/confd it seems that it's watching and does if ne $data.pooled "inactive" [17:39:30] (03CR) 10Chad: "This is getting complicated actually. I'm thinking the best thing to do is just work on masters, then scap pull everywhere" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355475 (owner: 10Chad) [17:41:07] 06Operations, 10ops-codfw: mw2140.codfw.wmnet unreponsive, cannot be powercycled with serial console - https://phabricator.wikimedia.org/T166328#3292768 (10jcrespo) [17:42:13] anyway I'm sure the right state is inactive in this case, so let's put it that way, then I'll take a look if/why the dsh group is not following [17:42:18] (03CR) 10Chad: "Actually, that makes way more sense. We'll become an AbstractSync that basically does all the master work first. Easy!" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355475 (owner: 10Chad) [17:42:24] db1016 is the m1 master [17:42:41] if librenms or rt start to go slow, call me [17:43:06] why did it changed write policy? [17:43:19] I do not know, but I have to go [17:43:30] ok [17:43:55] 06Operations, 10ops-eqiad, 15User-Elukey: analytics1030 failed bbu - https://phabricator.wikimedia.org/T165529#3292784 (10Cmjohnson) @elukey sorry about that didn't plug the cable back in...all should be good now [17:44:16] PROBLEM - puppet last run on restbase-dev1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cassandra] [17:44:16] PROBLEM - puppet last run on restbase-dev1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cassandra] [17:44:16] PROBLEM - puppet last run on restbase-dev1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 1 minute ago with 1 failures. Failed resources (up to 3 shown): Package[cassandra] [17:44:31] the battery seems no good [17:44:39] it may resolve itself or not [17:45:00] but I do not think it will suffer much with lower performance [17:45:19] It says Learn Cycle Requested : Yes [17:45:34] ok, I'll take a look, you have to go now ;) [17:45:34] but it is disabled [17:45:46] (03CR) 10Chad: "Duh. It already is an abstractsync. I just stupidly overrode main()" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355475 (owner: 10Chad) [17:45:50] maybe the bbu failed and it did one on its own [17:46:14] which means it will fix on its own or will be dead [17:46:34] I have also filed T166328 [17:46:34] T166328: mw2140.codfw.wmnet unreponsive, cannot be powercycled with serial console - https://phabricator.wikimedia.org/T166328 [17:46:49] the change I deployed is to codfw, so not critical at all if it is not on all hosts [17:46:58] urandom: the failure for restbase-dev seems there since a while, did an icinga downtime just expired? ^^^ [17:47:01] E: Version '3.7.3-instaclustr' for 'cassandra' was not found [17:47:21] ok thanks jy [17:51:58] !log volans@sarin conftool action : set/pooled=inactive; selector: name=mw2140.codfw.wmnet [17:52:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Morning SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170525T1800). [18:00:04] aharoni and jan_drewniak: A patch you scheduled for Morning SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [18:00:12] (03PS1) 10Chad: Scap clean: Rewrite to just do stuff on masters then sync [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355635 [18:02:42] hallo [18:02:45] is SWAT now? [18:03:01] indeed, I can SWAT [18:03:34] aharoni: it looks like you linked to the wrong patch for Remove special Math extension settings for hewiki [18:03:42] thcipriani: checking... [18:04:31] (03PS6) 10Thcipriani: Add namespace aliases for Hebrew Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352889 (https://phabricator.wikimedia.org/T164858) (owner: 10Amire80) [18:04:39] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352889 (https://phabricator.wikimedia.org/T164858) (owner: 10Amire80) [18:05:18] thcipriani: corrected [18:05:22] thanks [18:05:38] (03Merged) 10jenkins-bot: Add namespace aliases for Hebrew Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352889 (https://phabricator.wikimedia.org/T164858) (owner: 10Amire80) [18:05:53] (03CR) 10jenkins-bot: Add namespace aliases for Hebrew Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352889 (https://phabricator.wikimedia.org/T164858) (owner: 10Amire80) [18:07:02] volans: auh, so it did [18:07:49] expired downtime? [18:07:57] volans: yeah [18:08:20] volans: i'll renew it [18:08:21] ok [18:08:34] thanks! :) [18:08:46] aharoni: namespace aliases for hewiki is now live on mwdebug1002, check please and then I'll sync and run namespaceDupes [18:10:26] thcipriani: works [18:10:32] aharoni: ok, syncing now [18:11:06] 06Operations, 10ops-eqiad, 10Analytics, 10Analytics-Cluster: rack/setup/install replacement to stat1005 (stat1002 replacement) - https://phabricator.wikimedia.org/T165368#3264256 (10Cmjohnson) @Ottomata It is in the rack as of today and I am getting through several orders and will do my best. I don't have... [18:11:27] ACKNOWLEDGEMENT - Host mw2140 is DOWN: PING CRITICAL - Packet loss = 100% Volans Host went down and is not restarting: https://phabricator.wikimedia.org/T166328 [18:12:30] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:352889|Add namespace aliases for Hebrew Wikipedia]] T164858 (duration: 00m 47s) [18:12:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:39] T164858: Create namespace aliases for hewiki - https://phabricator.wikimedia.org/T164858 [18:12:44] ACKNOWLEDGEMENT - puppet last run on restbase-dev1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cassandra] eevans T160570: Testing Cassandra 3.10 (currently an unsupported version option) [18:12:44] ACKNOWLEDGEMENT - puppet last run on restbase-dev1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 30 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cassandra] eevans T160570: Testing Cassandra 3.10 (currently an unsupported version option) [18:12:44] ACKNOWLEDGEMENT - puppet last run on restbase-dev1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 29 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[cassandra] eevans T160570: Testing Cassandra 3.10 (currently an unsupported version option) [18:12:57] !log mwscript namespaceDupes.php hewiki --fix [18:13:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:11] oh good. Error: 1062 Duplicate entry '4-CC' for key 'name_title' [18:15:03] aharoni: could you check ויקיפדיה:CC ? I think that's what's messing up in the script [18:15:37] somehow there's a duplicate there [18:15:44] thcipriani: what exactly is wrong about it? It is a redirect [18:16:07] 06Operations, 10ops-codfw: mw2140.codfw.wmnet unreponsive, cannot be powercycled with serial console - https://phabricator.wikimedia.org/T166328#3292768 (10RobH) Once the powercycle is done, before the server is returned to service, full testing of the ilom needs to also happen. IE: @Papaul, once you reset th... [18:16:24] thcipriani: and there are no links to it, so if I delete it, it's not a problem [18:16:32] and I can restore it after the script runs [18:17:44] aharoni: would you please? For some reason the namespaceDupes script is getting tripped up there: https://phabricator.wikimedia.org/P5486 [18:20:40] thcipriani: sorry, Firefox crashed [18:20:47] np [18:21:04] thcipriani: can you please try again? [18:21:07] I deleted it [18:21:12] sure, thank you [18:23:18] aharoni_: done! seems to have worked that time, thank you for your help! [18:23:43] thcipriani: cool, now Math? [18:23:45] yup [18:24:05] (03PS3) 10Thcipriani: Remove special Math extension settings for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355114 (owner: 10DCausse) [18:24:15] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355114 (owner: 10DCausse) [18:25:55] (03Merged) 10jenkins-bot: Remove special Math extension settings for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355114 (owner: 10DCausse) [18:26:04] (03CR) 10jenkins-bot: Remove special Math extension settings for hewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355114 (owner: 10DCausse) [18:26:21] (03PS2) 10Thcipriani: Remove UseMathJax from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353971 (https://phabricator.wikimedia.org/T165475) (owner: 10Amire80) [18:26:27] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353971 (https://phabricator.wikimedia.org/T165475) (owner: 10Amire80) [18:27:10] !log T164865: RESTBase dev, re-enable render range deletes [18:27:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:18] T164865: Prototype and test range delete-based current revision storage - https://phabricator.wikimedia.org/T164865 [18:27:36] PROBLEM - Router interfaces on cr2-ulsfo is CRITICAL: CRITICAL: host 198.35.26.193, interfaces up: 76, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Core: cr1-codfw:xe-5/0/2 (Zayo, OGYX/124337//ZYO, 38.8ms) {#?} [10Gbps wave]BR [18:28:05] (03Merged) 10jenkins-bot: Remove UseMathJax from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353971 (https://phabricator.wikimedia.org/T165475) (owner: 10Amire80) [18:28:14] (03CR) 10jenkins-bot: Remove UseMathJax from CommonSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/353971 (https://phabricator.wikimedia.org/T165475) (owner: 10Amire80) [18:29:55] aharoni: sorry for the delay both of your changes are staged on mwdebug1002, check please [18:31:41] thcipriani: tested, it works [18:31:49] ok, syncing live [18:34:19] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:355114|Remove special Math extension settings for hewiki]] [[gerrit:353971|Remove UseMathJax from CommonSettings.php]] T165475 (duration: 00m 43s) [18:34:25] ^ aharoni live everywhere [18:34:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:34:28] T165475: Remove traces of MathJax from the Math extension - https://phabricator.wikimedia.org/T165475 [18:34:40] jan_drewniak: ping for SWAT [18:37:11] thcipriani: thank you, everything is perfect. [18:37:26] yw, glad to hear it :) [18:39:09] !log forcing BBU learn on db1016 [18:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:42:20] thcipriani: what was the issue around math with all the revert commits ? [18:42:34] aharoni: thanks for this [18:44:58] matanya: to which revert commits are your referring? [18:44:59] thcipriani: matanya dcausse - yeah, I just came back from the hackathon + vacation today and I wondered, too... [18:45:57] thcipriani: https://gerrit.wikimedia.org/r/#/c/353970/ and https://gerrit.wikimedia.org/r/#/c/355112/ [18:46:15] and https://gerrit.wikimedia.org/r/#/c/355114/ too [18:47:39] !log completed upgrade of facter across the fleet T166203 (apart few hosts down) [18:47:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:48] T166203: Upgrade facter to version 2.4.6 - https://phabricator.wikimedia.org/T166203 [18:48:26] RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [18:48:46] ah, the ones I just merged :) seems like there was discussion around it (http://bit.ly/2r63zec) although I can't say for sure what the specific arguments where [18:50:43] thcipriani: i am not asking about the discussion :) i do read Hebrew. I was asking why Amir's change in gerrit had so many revert commits [18:51:20] (03PS3) 10Ottomata: Update kafka.sh wrapper script for Kafka 0.10+ [puppet] - 10https://gerrit.wikimedia.org/r/355259 (https://phabricator.wikimedia.org/T166164) [18:51:50] (03CR) 10Ottomata: [V: 032 C: 032] Update kafka.sh wrapper script for Kafka 0.10+ [puppet] - 10https://gerrit.wikimedia.org/r/355259 (https://phabricator.wikimedia.org/T166164) (owner: 10Ottomata) [18:58:00] 06Operations: Upgrade facter to version 2.4.6 - https://phabricator.wikimedia.org/T166203#3292947 (10Volans) Facter upgraded and verified was a noop across the fleet. Only remaining hosts are few that are currently offline: `analytics1030.eqiad.wmnet,cp3003.esams.wmnet,labstore[1001-1002].eqiad.wmnet` I'm inve... [18:59:54] (03PS1) 10Dzahn: phabricator: avoid root@wm.org mail alias in labs [puppet] - 10https://gerrit.wikimedia.org/r/355640 [19:00:05] thcipriani: Respected human, time to deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170525T1900). Please do the needful. [19:00:11] * thcipriani does [19:05:39] (03PS1) 10Thcipriani: all wikis to 1.30.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355641 [19:05:41] (03CR) 10Thcipriani: [C: 032] all wikis to 1.30.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355641 (owner: 10Thcipriani) [19:06:18] RECOVERY - MegaRAID on ms-be1008 is OK: OK: optimal, 14 logical, 14 physical [19:06:47] (03Merged) 10jenkins-bot: all wikis to 1.30.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355641 (owner: 10Thcipriani) [19:06:55] (03CR) 10jenkins-bot: all wikis to 1.30.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355641 (owner: 10Thcipriani) [19:06:59] (03PS2) 10Chad: Scap clean: Rewrite to just do stuff on masters then sync [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355635 [19:07:31] 06Operations, 10ops-eqiad: Degraded RAID on ms-be1008 - https://phabricator.wikimedia.org/T166177#3292989 (10Cmjohnson) 05Open>03Resolved added a new disk and added back cmjohnson@ms-be1008:~$ sudo megacli -CfgEachDskRaid0 WB RA Direct CachedBadBBU -a0 Adapter 0: Crea... [19:09:36] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: all wikis to 1.30.0-wmf.2 [19:09:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:10:46] (03CR) 10Paladox: [C: 031] phabricator: avoid root@wm.org mail alias in labs [puppet] - 10https://gerrit.wikimedia.org/r/355640 (owner: 10Dzahn) [19:11:29] Dereckson: that was fast, thank you :) [19:11:45] yw [19:12:48] thcipriani: it would be nice to enable https://gerrit.wikimedia.org/r/#/c/354612/ today. I can put in SWAT if you don't have time/energy to roll it out. [19:13:17] (03PS2) 10Dzahn: phabricator: avoid root@wm.org mail alias in labs [puppet] - 10https://gerrit.wikimedia.org/r/355640 [19:13:59] bd808: I can roll that out now if you've got a few [19:14:10] sure! [19:15:11] (03PS4) 10Thcipriani: Add Code of Conduct footer links to wikitech and mw.o [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354612 (owner: 10BryanDavis) [19:15:21] matanya: there was an issue during swat on monday, it was not related to the patch itself but since SWAT was canccelled and the patch got merged I had to revert it and readded it as https://gerrit.wikimedia.org/r/#/c/355114/ [19:15:25] (03CR) 10Thcipriani: [C: 032] Add Code of Conduct footer links to wikitech and mw.o [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354612 (owner: 10BryanDavis) [19:15:51] thanks for clarifying dcausse [19:17:29] (03Merged) 10jenkins-bot: Add Code of Conduct footer links to wikitech and mw.o [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354612 (owner: 10BryanDavis) [19:17:38] (03CR) 10jenkins-bot: Add Code of Conduct footer links to wikitech and mw.o [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354612 (owner: 10BryanDavis) [19:19:21] bd808: live on mwdebug1002 if you want to give it a check [19:19:49] (03CR) 10Legoktm: "Instead of registering a new hook, we should just check the globals inside the same hook." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354612 (owner: 10BryanDavis) [19:20:18] thcipriani: looks like it works for mw.o [19:20:33] bd808: ok, going live [19:22:05] legoktm: I think adding globals inside the hook is just going to make it less readable personally, but as always patches welcome [19:22:53] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:354612|Add Code of Conduct footer links to wikitech and mw.o]] Part I (duration: 00m 39s) [19:22:56] PROBLEM - HHVM rendering on mw2168 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:56] PROBLEM - HHVM rendering on mw2131 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:56] PROBLEM - HHVM rendering on mw2195 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:56] PROBLEM - HHVM rendering on mw2120 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:56] PROBLEM - HHVM rendering on mw2112 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:56] PROBLEM - HHVM rendering on mw2116 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:56] PROBLEM - HHVM rendering on mw2102 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:57] PROBLEM - HHVM rendering on mw2104 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:57] PROBLEM - HHVM rendering on mw2123 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:58] ok [19:22:58] PROBLEM - HHVM rendering on mw2105 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:58] PROBLEM - HHVM rendering on mw2172 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:22:59] PROBLEM - HHVM rendering on mw2109 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:23:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:11] also uh [19:23:39] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:354612|Add Code of Conduct footer links to wikitech and mw.o]] Part II (duration: 00m 39s) [19:23:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:58] ^ bd808 should be live [19:24:01] why did hhvm barf there? [19:24:02] and also uh [19:24:22] PROBLEM - HHVM rendering on mw1219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:22] PROBLEM - HHVM rendering on mw1248 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:26] RECOVERY - HHVM rendering on mw2256 is OK: HTTP OK: HTTP/1.1 200 OK - 20270 bytes in 0.092 second response time [19:24:26] RECOVERY - HHVM rendering on mw2148 is OK: HTTP OK: HTTP/1.1 200 OK - 20270 bytes in 0.105 second response time [19:24:26] RECOVERY - HHVM rendering on mw2200 is OK: HTTP OK: HTTP/1.1 200 OK - 20269 bytes in 0.100 second response time [19:24:26] RECOVERY - HHVM rendering on mw2179 is OK: HTTP OK: HTTP/1.1 200 OK - 20270 bytes in 0.103 second response time [19:24:26] RECOVERY - HHVM rendering on mw2187 is OK: HTTP OK: HTTP/1.1 200 OK - 20270 bytes in 0.106 second response time [19:24:26] RECOVERY - HHVM rendering on mw2107 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.116 second response time [19:24:26] RECOVERY - HHVM rendering on mw2245 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.120 second response time [19:24:36] PROBLEM - HHVM rendering on mw2164 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:36] PROBLEM - HHVM rendering on mw2211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:36] PROBLEM - HHVM rendering on mw2103 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:36] PROBLEM - HHVM rendering on mw2101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:36] PROBLEM - HHVM rendering on mw2113 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:36] PROBLEM - HHVM rendering on mw2100 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:36] PROBLEM - HHVM rendering on mw2171 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:37] PROBLEM - HHVM rendering on mw2191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:39] PROBLEM - HHVM rendering on mw2199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:39] PROBLEM - HHVM rendering on mw2182 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:39] PROBLEM - HHVM rendering on mw2114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:39] PROBLEM - HHVM rendering on mw2115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:24:43] link is live on wikitech [19:24:57] RECOVERY - HHVM rendering on mw1201 is OK: HTTP OK: HTTP/1.1 200 OK - 74593 bytes in 0.717 second response time [19:24:57] RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 74593 bytes in 0.820 second response time [19:24:57] those are all codfw [19:25:05] not awesome but less scary I guess [19:25:07] RECOVERY - HHVM rendering on mw2137 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.736 second response time [19:25:16] RECOVERY - HHVM rendering on mw1248 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 6.415 second response time [19:25:17] PROBLEM - HHVM rendering on mw2184 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:25:17] RECOVERY - HHVM rendering on mw1219 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 6.685 second response time [19:25:36] PROBLEM - HHVM rendering on mw2170 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:25:36] PROBLEM - HHVM rendering on mw2207 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:25:56] PROBLEM - HHVM rendering on mw2225 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:26:06] RECOVERY - HHVM rendering on mw1298 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 3.150 second response time [19:26:06] RECOVERY - HHVM rendering on mw2204 is OK: HTTP OK: HTTP/1.1 200 OK - 20270 bytes in 0.103 second response time [19:26:06] RECOVERY - HHVM rendering on mw2169 is OK: HTTP OK: HTTP/1.1 200 OK - 20270 bytes in 0.105 second response time [19:26:06] RECOVERY - HHVM rendering on mw2125 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.121 second response time [19:26:06] RECOVERY - HHVM rendering on mw2121 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.122 second response time [19:26:07] RECOVERY - HHVM rendering on mw2129 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.122 second response time [19:26:07] RECOVERY - HHVM rendering on mw2134 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.117 second response time [19:26:07] RECOVERY - HHVM rendering on mw2132 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.118 second response time [19:26:07] RECOVERY - HHVM rendering on mw2017 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.122 second response time [19:26:08] RECOVERY - HHVM rendering on mw2174 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.125 second response time [19:26:09] RECOVERY - HHVM rendering on mw1177 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 3.523 second response time [19:26:26] (03PS5) 10Krinkle: varnish: Convert errorpage into re-usable template [puppet] - 10https://gerrit.wikimedia.org/r/350493 (https://phabricator.wikimedia.org/T113114) [19:26:33] (03PS5) 10Krinkle: mediawiki: Define 'mediawiki::errorpage' to simplify usage [puppet] - 10https://gerrit.wikimedia.org/r/355257 [19:26:36] RECOVERY - HHVM rendering on mw2126 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.129 second response time [19:27:26] RECOVERY - HHVM rendering on mw2170 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.118 second response time [19:27:26] RECOVERY - HHVM rendering on mw2211 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.122 second response time [19:27:26] RECOVERY - HHVM rendering on mw2191 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.117 second response time [19:27:26] RECOVERY - HHVM rendering on mw2101 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.118 second response time [19:27:26] RECOVERY - HHVM rendering on mw2171 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.115 second response time [19:27:26] RECOVERY - HHVM rendering on mw2199 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.118 second response time [19:27:27] RECOVERY - HHVM rendering on mw2260 is OK: HTTP OK: HTTP/1.1 200 OK - 20272 bytes in 0.113 second response time [19:27:27] RECOVERY - HHVM rendering on mw2182 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.116 second response time [19:27:28] RECOVERY - HHVM rendering on mw2100 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.127 second response time [19:27:28] RECOVERY - HHVM rendering on mw2214 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.121 second response time [19:27:28] RECOVERY - HHVM rendering on mw2197 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.114 second response time [19:27:29] RECOVERY - HHVM rendering on mw2185 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.113 second response time [19:28:06] PROBLEM - HHVM rendering on mw2137 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:28:16] PROBLEM - HHVM rendering on mw2259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:28:16] PROBLEM - HHVM rendering on mw2236 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:28:36] PROBLEM - HHVM rendering on mw2150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:28:36] PROBLEM - HHVM rendering on mw2205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:28:36] PROBLEM - HHVM rendering on mw2099 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:29:16] PROBLEM - HHVM rendering on mw2134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:29:16] PROBLEM - HHVM rendering on mw2204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:29:16] PROBLEM - HHVM rendering on mw2125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:29:16] PROBLEM - HHVM rendering on mw2169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:29:16] PROBLEM - HHVM rendering on mw2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:29:17] PROBLEM - HHVM rendering on mw2129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:29:17] PROBLEM - HHVM rendering on mw2132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:29:17] PROBLEM - HHVM rendering on mw2121 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:29:17] PROBLEM - HHVM rendering on mw2174 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:29:26] RECOVERY - HHVM rendering on mw2232 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.080 second response time [19:29:26] RECOVERY - HHVM rendering on mw2215 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.349 second response time [19:29:27] RECOVERY - HHVM rendering on mw2210 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.778 second response time [19:29:27] RECOVERY - HHVM rendering on mw2254 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 6.794 second response time [19:29:27] RECOVERY - HHVM rendering on mw2219 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.313 second response time [19:29:27] RECOVERY - HHVM rendering on mw2226 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.829 second response time [19:29:36] RECOVERY - HHVM rendering on mw2227 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 8.342 second response time [19:29:36] RECOVERY - HHVM rendering on mw2241 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 8.872 second response time [19:29:36] RECOVERY - HHVM rendering on mw2251 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 9.845 second response time [19:29:46] PROBLEM - HHVM rendering on mw2126 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:30:07] I think the codfw dbs may be causing that alert storm. Seeing a lot of maxlag warnings in logstash [19:30:28] RECOVERY - HHVM rendering on mw2150 is OK: HTTP OK: HTTP/1.1 200 OK - 20269 bytes in 0.100 second response time [19:30:28] RECOVERY - HHVM rendering on mw2205 is OK: HTTP OK: HTTP/1.1 200 OK - 20270 bytes in 0.106 second response time [19:30:28] RECOVERY - HHVM rendering on mw2099 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.121 second response time [19:30:28] PROBLEM - HHVM rendering on mw2244 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:30:28] PROBLEM - HHVM rendering on mw2252 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:30:36] PROBLEM - HHVM rendering on mw2258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:30:36] PROBLEM - HHVM rendering on mw2228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:30:36] PROBLEM - HHVM rendering on mw2255 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:30:36] PROBLEM - HHVM rendering on mw2170 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:30:36] PROBLEM - HHVM rendering on mw2211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:30:36] PROBLEM - HHVM rendering on mw2148 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:30:37] PROBLEM - HHVM rendering on mw2256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:30:37] PROBLEM - HHVM rendering on mw2101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:30:37] PROBLEM - HHVM rendering on mw2191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:30:38] PROBLEM - HHVM rendering on mw2171 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:30:38] PROBLEM - HHVM rendering on mw2260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:30:39] PROBLEM - HHVM rendering on mw2185 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:30:58] weirdly, I see a ton of things listening on 443 via netstat on 2 of ^ hosts [19:31:20] on 443? that is bizarre [19:31:26] RECOVERY - HHVM rendering on mw2212 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.656 second response time [19:31:27] RECOVERY - HHVM rendering on mw2165 is OK: HTTP OK: HTTP/1.1 200 OK - 74669 bytes in 9.800 second response time [19:31:36] PROBLEM - HHVM rendering on mw2113 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:31:36] PROBLEM - HHVM rendering on mw2106 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:31:47] netstat -tlnp | grep 443 | wc -l [19:31:49] 96 [19:31:55] (03PS6) 10Ottomata: Add kafka_version parameter, s/java_packaage/java_home/ in confluent::kafka::client [puppet] - 10https://gerrit.wikimedia.org/r/354100 [19:32:06] RECOVERY - HHVM rendering on mw2193 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.752 second response time [19:32:07] RECOVERY - HHVM rendering on mw2189 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.765 second response time [19:32:26] PROBLEM - HHVM rendering on mw2215 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:32:27] PROBLEM - HHVM rendering on mw2210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:32:27] PROBLEM - HHVM rendering on mw2232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:32:27] PROBLEM - HHVM rendering on mw2219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:32:27] PROBLEM - HHVM rendering on mw2226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:32:36] PROBLEM - HHVM rendering on mw2254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:32:36] PROBLEM - HHVM rendering on mw2251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:32:36] PROBLEM - HHVM rendering on mw2241 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:32:36] PROBLEM - HHVM rendering on mw2227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:11] AFAIK there was an alter table on codfw DBs on s4 today, and is the only shard lagging in codfw T166206 [19:33:14] T166206: Convert unique keys into primary keys for some wiki tables on s4 - https://phabricator.wikimedia.org/T166206 [19:33:24] and from SAL too [19:33:26] RECOVERY - HHVM rendering on mw2202 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 7.226 second response time [19:33:26] RECOVERY - HHVM rendering on mw2215 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 5.699 second response time [19:33:27] RECOVERY - HHVM rendering on mw2232 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 6.379 second response time [19:33:27] RECOVERY - HHVM rendering on mw2210 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 6.829 second response time [19:33:27] RECOVERY - HHVM rendering on mw2233 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 6.387 second response time [19:33:27] RECOVERY - HHVM rendering on mw2173 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 6.828 second response time [19:33:27] RECOVERY - HHVM rendering on mw2226 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 7.408 second response time [19:33:28] RECOVERY - HHVM rendering on mw2181 is OK: HTTP OK: HTTP/1.1 200 OK - 74627 bytes in 7.856 second response time [19:33:28] RECOVERY - HHVM rendering on mw2203 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 8.328 second response time [19:33:29] RECOVERY - HHVM rendering on mw2146 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 8.781 second response time [19:33:29] RECOVERY - HHVM rendering on mw2244 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.295 second response time [19:33:30] RECOVERY - HHVM rendering on mw2252 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 9.765 second response time [19:33:30] RECOVERY - HHVM rendering on mw2219 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.888 second response time [19:33:30] RECOVERY - HHVM rendering on mw2239 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.417 second response time [19:33:36] RECOVERY - HHVM rendering on mw2218 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.906 second response time [19:33:36] RECOVERY - HHVM rendering on mw2228 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.998 second response time [19:33:36] RECOVERY - HHVM rendering on mw2254 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 9.440 second response time [19:33:36] RECOVERY - HHVM rendering on mw2235 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.481 second response time [19:33:36] RECOVERY - HHVM rendering on mw2230 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.885 second response time [19:33:36] PROBLEM - HHVM rendering on mw2182 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:33:48] RECOVERY - HHVM rendering on mw2102 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 1.669 second response time [19:33:48] RECOVERY - HHVM rendering on mw2123 is OK: HTTP OK: HTTP/1.1 200 OK - 74613 bytes in 0.934 second response time [19:33:49] RECOVERY - HHVM rendering on mw2104 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 1.294 second response time [19:33:49] RECOVERY - HHVM rendering on mw2105 is OK: HTTP OK: HTTP/1.1 200 OK - 74613 bytes in 0.878 second response time [19:33:56] RECOVERY - HHVM rendering on mw2109 is OK: HTTP OK: HTTP/1.1 200 OK - 74593 bytes in 0.638 second response time [19:33:56] RECOVERY - HHVM rendering on mw2122 is OK: HTTP OK: HTTP/1.1 200 OK - 74613 bytes in 0.639 second response time [19:33:56] RECOVERY - HHVM rendering on mw2172 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 1.060 second response time [19:33:57] RECOVERY - HHVM rendering on mw2175 is OK: HTTP OK: HTTP/1.1 200 OK - 74613 bytes in 0.590 second response time [19:33:57] RECOVERY - HHVM rendering on mw2145 is OK: HTTP OK: HTTP/1.1 200 OK - 74614 bytes in 0.527 second response time [19:33:57] RECOVERY - HHVM rendering on mw2136 is OK: HTTP OK: HTTP/1.1 200 OK - 74613 bytes in 0.516 second response time [19:33:57] RECOVERY - HHVM rendering on mw2142 is OK: HTTP OK: HTTP/1.1 200 OK - 74594 bytes in 0.916 second response time [19:33:57] RECOVERY - HHVM rendering on mw2208 is OK: HTTP OK: HTTP/1.1 200 OK - 74613 bytes in 0.884 second response time [19:33:57] RECOVERY - HHVM rendering on mw2144 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 1.320 second response time [19:33:58] RECOVERY - HHVM rendering on mw2167 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 1.346 second response time [19:34:06] RECOVERY - HHVM rendering on mw2201 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 1.788 second response time [19:34:06] RECOVERY - HHVM rendering on mw2253 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 2.005 second response time [19:34:06] RECOVERY - HHVM rendering on mw2135 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 2.314 second response time [19:34:06] RECOVERY - HHVM rendering on mw2186 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 2.754 second response time [19:34:06] RECOVERY - HHVM rendering on mw2242 is OK: HTTP OK: HTTP/1.1 200 OK - 74598 bytes in 2.878 second response time [19:34:06] RECOVERY - HHVM rendering on mw2257 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 3.312 second response time [19:34:06] RECOVERY - HHVM rendering on mw2141 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 3.336 second response time [19:34:13] bd808: see above my essage [19:34:52] (03PS7) 10Ottomata: Add kafka_version parameter, s/java_packaage/java_home/ in confluent::kafka::client [puppet] - 10https://gerrit.wikimedia.org/r/354100 [19:35:05] volans: *nod* a big alter might cause some issues for sure [19:35:23] is this user impacting or just alert spam? [19:35:26] RECOVERY - HHVM rendering on mw2188 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 7.011 second response time [19:35:27] RECOVERY - HHVM rendering on mw2166 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 7.512 second response time [19:35:27] RECOVERY - HHVM rendering on mw2163 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 8.020 second response time [19:35:27] * volans having dinner [19:36:16] PROBLEM - HHVM rendering on mw1293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:16] PROBLEM - HHVM rendering on mw1170 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:36:16] RECOVERY - HHVM rendering on mw2121 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 8.915 second response time [19:36:26] RECOVERY - HHVM rendering on mw2178 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 5.733 second response time [19:36:27] RECOVERY - HHVM rendering on mw2217 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 6.744 second response time [19:36:27] RECOVERY - HHVM rendering on mw2151 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 7.665 second response time [19:36:27] RECOVERY - HHVM rendering on mw2222 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 7.683 second response time [19:36:36] RECOVERY - HHVM rendering on mw2187 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 7.617 second response time [19:36:36] RECOVERY - HHVM rendering on mw2179 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.074 second response time [19:36:36] RECOVERY - HHVM rendering on mw2256 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 8.102 second response time [19:36:36] RECOVERY - HHVM rendering on mw2170 is OK: HTTP OK: HTTP/1.1 200 OK - 74599 bytes in 8.608 second response time [19:36:36] RECOVERY - HHVM rendering on mw2245 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.625 second response time [19:36:36] RECOVERY - HHVM rendering on mw2211 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.034 second response time [19:36:37] RECOVERY - HHVM rendering on mw2200 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.048 second response time [19:36:38] RECOVERY - HHVM rendering on mw2207 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.446 second response time [19:36:38] RECOVERY - HHVM rendering on mw2107 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.595 second response time [19:36:38] volans: looks like alert spam. almost all of the hosts alerting are in codfw [19:36:38] RECOVERY - HHVM rendering on mw2148 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.889 second response time [19:37:17] PROBLEM - HHVM rendering on mw2132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:17] PROBLEM - HHVM rendering on mw2129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:17] PROBLEM - HHVM rendering on mw2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:17] PROBLEM - HHVM rendering on mw2125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:17] PROBLEM - HHVM rendering on mw2134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:17] PROBLEM - HHVM rendering on mw2184 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:17] PROBLEM - HHVM rendering on mw2236 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:26] RECOVERY - HHVM rendering on mw2185 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.124 second response time [19:37:26] RECOVERY - HHVM rendering on mw2171 is OK: HTTP OK: HTTP/1.1 200 OK - 20270 bytes in 0.107 second response time [19:37:26] RECOVERY - HHVM rendering on mw2191 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.122 second response time [19:37:27] RECOVERY - HHVM rendering on mw2115 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.137 second response time [19:37:27] RECOVERY - HHVM rendering on mw2099 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.128 second response time [19:37:27] RECOVERY - HHVM rendering on mw2113 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.130 second response time [19:37:27] RECOVERY - HHVM rendering on mw2101 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.136 second response time [19:37:36] PROBLEM - HHVM rendering on mw2231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:36] PROBLEM - HHVM rendering on mw2227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:36] PROBLEM - HHVM rendering on mw2223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:36] PROBLEM - HHVM rendering on mw2198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:36] PROBLEM - HHVM rendering on mw2251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:36] PROBLEM - HHVM rendering on mw2221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:37:36] PROBLEM - HHVM rendering on mw2190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:38:07] RECOVERY - HHVM rendering on mw2017 is OK: HTTP OK: HTTP/1.1 200 OK - 74647 bytes in 3.426 second response time [19:38:13] (03PS2) 10Chad: Also clean up ExtensionMessages files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355475 [19:38:16] RECOVERY - HHVM rendering on mw2129 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 4.702 second response time [19:38:16] RECOVERY - HHVM rendering on mw1293 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 8.688 second response time [19:38:16] RECOVERY - HHVM rendering on mw2125 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 4.864 second response time [19:38:16] RECOVERY - HHVM rendering on mw2134 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 5.181 second response time [19:38:16] RECOVERY - HHVM rendering on mw1170 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.110 second response time [19:38:16] RECOVERY - HHVM rendering on mw2132 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 5.434 second response time [19:38:16] RECOVERY - HHVM rendering on mw2236 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 2.520 second response time [19:38:17] RECOVERY - HHVM rendering on mw2184 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 2.682 second response time [19:38:26] RECOVERY - HHVM rendering on mw2231 is OK: HTTP OK: HTTP/1.1 200 OK - 74593 bytes in 0.631 second response time [19:38:27] RECOVERY - HHVM rendering on mw2221 is OK: HTTP OK: HTTP/1.1 200 OK - 74613 bytes in 0.632 second response time [19:38:27] RECOVERY - HHVM rendering on mw2251 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 1.114 second response time [19:38:27] RECOVERY - HHVM rendering on mw2227 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 1.188 second response time [19:38:27] RECOVERY - HHVM rendering on mw2198 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 1.642 second response time [19:38:27] RECOVERY - HHVM rendering on mw2223 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 1.697 second response time [19:38:27] RECOVERY - HHVM rendering on mw2190 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 2.078 second response time [19:38:27] RECOVERY - HHVM rendering on mw2164 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 1.199 second response time [19:38:27] RECOVERY - HHVM rendering on mw2103 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 1.667 second response time [19:38:28] RECOVERY - HHVM rendering on mw2182 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 2.094 second response time [19:38:28] RECOVERY - HHVM rendering on mw2150 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 2.565 second response time [19:38:29] RECOVERY - HHVM rendering on mw2260 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 2.628 second response time [19:39:16] PROBLEM - HHVM rendering on mw2259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:20] (03PS8) 10Ottomata: Add kafka_version parameter, s/java_packaage/java_home/ in confluent::kafka::client [puppet] - 10https://gerrit.wikimedia.org/r/354100 [19:39:26] PROBLEM - HHVM rendering on mw2212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:26] PROBLEM - HHVM rendering on mw2226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:27] PROBLEM - HHVM rendering on mw2173 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:27] PROBLEM - HHVM rendering on mw2111 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:27] PROBLEM - HHVM rendering on mw2219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:27] PROBLEM - HHVM rendering on mw2252 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:27] PROBLEM - HHVM rendering on mw2244 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:27] PROBLEM - HHVM rendering on mw2181 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:28] PROBLEM - HHVM rendering on mw2146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:28] PROBLEM - HHVM rendering on mw2165 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:36] PROBLEM - HHVM rendering on mw2180 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:36] PROBLEM - HHVM rendering on mw2218 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:36] PROBLEM - HHVM rendering on mw2235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:36] PROBLEM - HHVM rendering on mw2255 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:39:36] PROBLEM - HHVM rendering on mw2228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:41:01] (03CR) 10Dzahn: [C: 032] "no-op in prod: http://puppet-compiler.wmflabs.org/6526/" [puppet] - 10https://gerrit.wikimedia.org/r/355640 (owner: 10Dzahn) [19:41:06] (03PS3) 10Dzahn: phabricator: avoid root@wm.org mail alias in labs [puppet] - 10https://gerrit.wikimedia.org/r/355640 [19:41:16] RECOVERY - HHVM rendering on mw2259 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 5.385 second response time [19:41:27] RECOVERY - HHVM rendering on mw2165 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 6.017 second response time [19:41:27] RECOVERY - HHVM rendering on mw2146 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 7.046 second response time [19:41:27] RECOVERY - HHVM rendering on mw2111 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 7.423 second response time [19:41:33] (03CR) 10Ottomata: [C: 032] "Mostly a no-op in prod: https://puppet-compiler.wmflabs.org/6532/" [puppet] - 10https://gerrit.wikimedia.org/r/354100 (owner: 10Ottomata) [19:41:36] RECOVERY - HHVM rendering on mw2254 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 9.182 second response time [19:41:36] RECOVERY - HHVM rendering on mw2255 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 9.480 second response time [19:41:36] RECOVERY - HHVM rendering on mw2237 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.687 second response time [19:41:36] PROBLEM - HHVM rendering on mw2164 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:41:36] PROBLEM - HHVM rendering on mw2103 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:41:36] PROBLEM - HHVM rendering on mw2191 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:41:36] PROBLEM - HHVM rendering on mw2171 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:41:37] PROBLEM - HHVM rendering on mw2185 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:41:38] PROBLEM - HHVM rendering on mw2115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:41:38] PROBLEM - HHVM rendering on mw2260 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:41:38] PROBLEM - HHVM rendering on mw2182 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:41:39] PROBLEM - HHVM rendering on mw2150 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:06] PROBLEM - HHVM rendering on mw2127 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:26] RECOVERY - HHVM rendering on mw2150 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.116 second response time [19:42:26] RECOVERY - HHVM rendering on mw2106 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.113 second response time [19:42:26] RECOVERY - HHVM rendering on mw2099 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.137 second response time [19:42:36] PROBLEM - HHVM rendering on mw2101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:36] PROBLEM - HHVM rendering on mw2133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:42:36] RECOVERY - Router interfaces on cr2-ulsfo is OK: OK: host 198.35.26.193, interfaces up: 78, down: 0, dormant: 0, excluded: 0, unused: 0 [19:42:47] (03PS1) 10Krinkle: [WIP] varnish: Try to somehow embed errorpage template in VCL directly [puppet] - 10https://gerrit.wikimedia.org/r/355652 [19:42:49] (03PS4) 10Dzahn: phabricator: avoid root@wm.org mail alias in labs [puppet] - 10https://gerrit.wikimedia.org/r/355640 [19:43:26] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: revert SWAT: [[gerrit:354612|Add Code of Conduct footer links to wikitech and mw.o]] Part I (duration: 00m 39s) [19:43:26] RECOVERY - HHVM rendering on mw2151 is OK: HTTP OK: HTTP/1.1 200 OK - 74613 bytes in 0.637 second response time [19:43:26] RECOVERY - HHVM rendering on mw2222 is OK: HTTP OK: HTTP/1.1 200 OK - 74613 bytes in 0.669 second response time [19:43:26] RECOVERY - HHVM rendering on mw2239 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 1.163 second response time [19:43:26] RECOVERY - HHVM rendering on mw2241 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 1.194 second response time [19:43:26] RECOVERY - HHVM rendering on mw2178 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 1.592 second response time [19:43:26] RECOVERY - HHVM rendering on mw2228 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 1.755 second response time [19:43:26] RECOVERY - HHVM rendering on mw2218 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 2.145 second response time [19:43:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:53] (03CR) 10Faidon Liambotis: [C: 04-1] "No, this shouldn't be done with Hiera, and it's definitely not just about renaming root@wikimedia.org to root@wmflabs.org. The Labs exim c" [puppet] - 10https://gerrit.wikimedia.org/r/355640 (owner: 10Dzahn) [19:44:31] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: revert SWAT: [[gerrit:354612|Add Code of Conduct footer links to wikitech and mw.o]] Part II (duration: 00m 38s) [19:44:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:41] PROBLEM - HHVM rendering on mw2223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:41] PROBLEM - HHVM rendering on mw2198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:41] PROBLEM - HHVM rendering on mw2221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:41] PROBLEM - HHVM rendering on mw2227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:41] PROBLEM - HHVM rendering on mw2190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:41] (03CR) 10jerkins-bot: [V: 04-1] [WIP] varnish: Try to somehow embed errorpage template in VCL directly [puppet] - 10https://gerrit.wikimedia.org/r/355652 (owner: 10Krinkle) [19:44:42] PROBLEM - HHVM rendering on mw2254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:42] PROBLEM - HHVM rendering on mw2251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:43] PROBLEM - HHVM rendering on mw2255 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:43] PROBLEM - HHVM rendering on mw2231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:44:44] PROBLEM - HHVM rendering on mw2237 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:45:07] (03PS2) 10Krinkle: [WIP] varnish: Try to somehow embed errorpage template in VCL directly [puppet] - 10https://gerrit.wikimedia.org/r/355652 [19:45:16] (03CR) 10Dzahn: "yes, there are 2 separate things though. one is exim config that comes from standard and one that comes from the phabricator role itself. " [puppet] - 10https://gerrit.wikimedia.org/r/355640 (owner: 10Dzahn) [19:45:16] RECOVERY - HHVM rendering on mw2236 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.874 second response time [19:45:24] (03PS1) 10Chad: Scap clean: Also drop old patch files [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355653 [19:45:26] RECOVERY - HHVM rendering on mw2181 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 6.377 second response time [19:45:26] RECOVERY - HHVM rendering on mw2212 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 6.825 second response time [19:45:26] RECOVERY - HHVM rendering on mw2203 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 7.165 second response time [19:45:26] RECOVERY - HHVM rendering on mw2146 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 7.259 second response time [19:45:26] RECOVERY - HHVM rendering on mw2252 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 7.706 second response time [19:45:27] RECOVERY - HHVM rendering on mw2111 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 8.226 second response time [19:45:27] RECOVERY - HHVM rendering on mw2219 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.243 second response time [19:45:28] RECOVERY - HHVM rendering on mw2255 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 7.982 second response time [19:45:29] RECOVERY - HHVM rendering on mw2254 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 8.087 second response time [19:45:36] RECOVERY - HHVM rendering on mw2237 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.502 second response time [19:45:36] RECOVERY - HHVM rendering on mw2227 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.593 second response time [19:45:36] RECOVERY - HHVM rendering on mw2251 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 8.973 second response time [19:45:36] RECOVERY - HHVM rendering on mw2231 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.083 second response time [19:45:36] RECOVERY - HHVM rendering on mw2223 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.526 second response time [19:45:37] RECOVERY - HHVM rendering on mw2221 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.639 second response time [19:45:37] RECOVERY - HHVM rendering on mw2198 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.971 second response time [19:45:38] RECOVERY - HHVM rendering on mw2097 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.485 second response time [19:45:38] RECOVERY - HHVM rendering on mw2197 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 8.911 second response time [19:45:39] RECOVERY - HHVM rendering on mw2101 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.034 second response time [19:45:39] PROBLEM - HHVM rendering on mw2099 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:45:40] RECOVERY - HHVM rendering on mw2214 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.342 second response time [19:45:40] RECOVERY - HHVM rendering on mw2115 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.612 second response time [19:45:41] RECOVERY - HHVM rendering on mw2114 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.894 second response time [19:46:07] RECOVERY - HHVM rendering on mw2127 is OK: HTTP OK: HTTP/1.1 200 OK - 74647 bytes in 9.679 second response time [19:46:18] (03CR) 10jerkins-bot: [V: 04-1] [WIP] varnish: Try to somehow embed errorpage template in VCL directly [puppet] - 10https://gerrit.wikimedia.org/r/355652 (owner: 10Krinkle) [19:46:36] PROBLEM - HHVM rendering on mw2151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:46:36] PROBLEM - HHVM rendering on mw2222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:46:36] PROBLEM - HHVM rendering on mw2228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:46:36] PROBLEM - HHVM rendering on mw2239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:46:36] PROBLEM - HHVM rendering on mw2241 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:48:04] (03CR) 10Dzahn: "i know there is definitely more to fix, but can't we start with the low hanging fruit like not hard coding an email address. If we want to" [puppet] - 10https://gerrit.wikimedia.org/r/355640 (owner: 10Dzahn) [19:48:26] (03Abandoned) 10Dzahn: phabricator: avoid root@wm.org mail alias in labs [puppet] - 10https://gerrit.wikimedia.org/r/355640 (owner: 10Dzahn) [19:48:36] PROBLEM - HHVM rendering on mw2097 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:48:36] PROBLEM - HHVM rendering on mw2101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:48:37] PROBLEM - HHVM rendering on mw2197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:48:37] PROBLEM - HHVM rendering on mw2214 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:48:37] PROBLEM - HHVM rendering on mw2114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:48:37] PROBLEM - HHVM rendering on mw2115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:48:37] PROBLEM - HHVM rendering on mw2117 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:48:59] !log restart redises on rdb2003 [19:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:16] PROBLEM - HHVM rendering on mw2121 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:49:16] PROBLEM - HHVM rendering on mw2132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:49:16] PROBLEM - HHVM rendering on mw2129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:49:16] PROBLEM - HHVM rendering on mw2134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:49:20] (03PS3) 10Krinkle: [WIP] varnish: Try to somehow embed errorpage template in VCL directly [puppet] - 10https://gerrit.wikimedia.org/r/355652 [19:50:07] PROBLEM - HHVM rendering on mw1298 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:50:07] PROBLEM - HHVM rendering on mw1295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:50:16] RECOVERY - HHVM rendering on mw2121 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 6.391 second response time [19:50:16] RECOVERY - HHVM rendering on mw2129 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 6.609 second response time [19:50:16] RECOVERY - HHVM rendering on mw2134 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 6.929 second response time [19:50:16] RECOVERY - HHVM rendering on mw2132 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 7.151 second response time [19:50:26] RECOVERY - HHVM rendering on mw2206 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 6.437 second response time [19:50:27] RECOVERY - HHVM rendering on mw2196 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 6.515 second response time [19:50:27] RECOVERY - HHVM rendering on mw2188 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 6.947 second response time [19:50:27] RECOVERY - HHVM rendering on mw2222 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 6.978 second response time [19:50:27] RECOVERY - HHVM rendering on mw2151 is OK: HTTP OK: HTTP/1.1 200 OK - 74727 bytes in 7.444 second response time [19:50:27] RECOVERY - HHVM rendering on mw2256 is OK: HTTP OK: HTTP/1.1 200 OK - 74696 bytes in 6.780 second response time [19:50:27] RECOVERY - HHVM rendering on mw2187 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 7.169 second response time [19:50:28] RECOVERY - HHVM rendering on mw2148 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 7.204 second response time [19:50:29] some of tehse are eqiad hosts it seems thcipriani mw1298 mw1295? [19:50:36] RECOVERY - HHVM rendering on mw2211 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 7.625 second response time [19:50:36] RECOVERY - HHVM rendering on mw2245 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 7.805 second response time [19:50:36] RECOVERY - HHVM rendering on mw2164 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 8.071 second response time [19:50:36] RECOVERY - HHVM rendering on mw2200 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 8.179 second response time [19:50:36] RECOVERY - HHVM rendering on mw2207 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 8.514 second response time [19:50:37] RECOVERY - HHVM rendering on mw2179 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 8.663 second response time [19:50:37] RECOVERY - HHVM rendering on mw2170 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 8.967 second response time [19:50:38] RECOVERY - HHVM rendering on mw2103 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 9.245 second response time [19:50:40] RECOVERY - HHVM rendering on mw2107 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 9.476 second response time [19:50:40] RECOVERY - HHVM rendering on mw2191 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 8.724 second response time [19:50:40] RECOVERY - HHVM rendering on mw2260 is OK: HTTP OK: HTTP/1.1 200 OK - 74696 bytes in 9.081 second response time [19:50:40] RECOVERY - HHVM rendering on mw2185 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 9.201 second response time [19:50:40] RECOVERY - HHVM rendering on mw2199 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 9.547 second response time [19:50:41] RECOVERY - HHVM rendering on mw2182 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 9.660 second response time [19:50:49] (03CR) 10jerkins-bot: [V: 04-1] [WIP] varnish: Try to somehow embed errorpage template in VCL directly [puppet] - 10https://gerrit.wikimedia.org/r/355652 (owner: 10Krinkle) [19:51:16] PROBLEM - HHVM rendering on mw2184 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:51:17] PROBLEM - HHVM rendering on mw2236 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:51:26] PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6380 [19:51:26] PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL ERROR - Redis Library - can not ping 127.0.0.1 on port 6381 [19:51:26] PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1495741878 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9675592 keys, up 2 minutes 40 seconds - replication_delay is 1495741878 [19:51:26] RECOVERY - HHVM rendering on mw2150 is OK: HTTP OK: HTTP/1.1 200 OK - 20375 bytes in 0.106 second response time [19:51:26] RECOVERY - HHVM rendering on mw2197 is OK: HTTP OK: HTTP/1.1 200 OK - 20375 bytes in 0.109 second response time [19:51:27] RECOVERY - HHVM rendering on mw2114 is OK: HTTP OK: HTTP/1.1 200 OK - 20375 bytes in 0.108 second response time [19:51:27] RECOVERY - HHVM rendering on mw2100 is OK: HTTP OK: HTTP/1.1 200 OK - 20376 bytes in 0.125 second response time [19:51:28] RECOVERY - HHVM rendering on mw2106 is OK: HTTP OK: HTTP/1.1 200 OK - 20376 bytes in 0.118 second response time [19:51:28] RECOVERY - HHVM rendering on mw2214 is OK: HTTP OK: HTTP/1.1 200 OK - 20376 bytes in 0.115 second response time [19:51:29] RECOVERY - HHVM rendering on mw2099 is OK: HTTP OK: HTTP/1.1 200 OK - 20376 bytes in 0.121 second response time [19:51:29] RECOVERY - HHVM rendering on mw2115 is OK: HTTP OK: HTTP/1.1 200 OK - 20376 bytes in 0.126 second response time [19:51:29] Hmm wikipedia is taking a long time for me to load [19:51:30] RECOVERY - HHVM rendering on mw2101 is OK: HTTP OK: HTTP/1.1 200 OK - 20376 bytes in 0.120 second response time [19:51:36] PROBLEM - HHVM rendering on mw2231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:51:36] PROBLEM - HHVM rendering on mw2237 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:51:36] PROBLEM - HHVM rendering on mw2221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:51:36] PROBLEM - HHVM rendering on mw2198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:51:36] PROBLEM - HHVM rendering on mw2255 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:51:37] PROBLEM - HHVM rendering on mw2227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:51:37] PROBLEM - HHVM rendering on mw2230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:51:37] mediawiki loads fine. [19:51:38] PROBLEM - HHVM rendering on mw2254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:51:38] PROBLEM - HHVM rendering on mw2239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:51:39] PROBLEM - HHVM rendering on mw2251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:52:06] RECOVERY - HHVM rendering on mw1298 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 5.005 second response time [19:52:06] RECOVERY - HHVM rendering on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 74573 bytes in 5.196 second response time [19:52:16] RECOVERY - HHVM rendering on mw2236 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 4.156 second response time [19:52:16] RECOVERY - HHVM rendering on mw2184 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 4.595 second response time [19:52:26] RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 9637473 keys, up 3 minutes 35 seconds - replication_delay is 0 [19:52:26] RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 9541436 keys, up 3 minutes 32 seconds - replication_delay is 0 [19:52:26] RECOVERY - HHVM rendering on mw2255 is OK: HTTP OK: HTTP/1.1 200 OK - 74694 bytes in 0.728 second response time [19:52:26] RECOVERY - HHVM rendering on mw2227 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 1.228 second response time [19:52:26] RECOVERY - HHVM rendering on mw2230 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 1.269 second response time [19:52:27] RECOVERY - HHVM rendering on mw2237 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 1.787 second response time [19:52:27] RECOVERY - HHVM rendering on mw2251 is OK: HTTP OK: HTTP/1.1 200 OK - 74696 bytes in 1.781 second response time [19:52:28] RECOVERY - HHVM rendering on mw2254 is OK: HTTP OK: HTTP/1.1 200 OK - 74696 bytes in 2.287 second response time [19:52:28] RECOVERY - HHVM rendering on mw2231 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 2.325 second response time [19:52:29] RECOVERY - HHVM rendering on mw2223 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 2.790 second response time [19:52:29] RECOVERY - HHVM rendering on mw2221 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 2.850 second response time [19:52:30] RECOVERY - HHVM rendering on mw2198 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 3.314 second response time [19:52:30] RECOVERY - HHVM rendering on mw2239 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 3.305 second response time [19:52:31] RECOVERY - HHVM rendering on mw2205 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 3.490 second response time [19:53:06] PROBLEM - HHVM rendering on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:53:07] PROBLEM - HHVM rendering on mw1196 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:53:16] PROBLEM - HHVM rendering on mw2134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:53:16] PROBLEM - HHVM rendering on mw2259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:53:26] RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 9642825 keys, up 4 minutes 40 seconds - replication_delay is 0 [19:53:35] is it ok to quiet the bot until this hhvm thing is fixed? [19:53:36] PROBLEM - HHVM rendering on mw2258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:53:36] PROBLEM - HHVM rendering on mw2218 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:53:36] PROBLEM - HHVM rendering on mw2235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:53:36] PROBLEM - HHVM rendering on mw2241 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:53:36] PROBLEM - HHVM rendering on mw2192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:54:40] Sagan: I'm not sure what's up or if things are resolved so better to not, thcipriani any thoughts on current storm? seems like it matched up with a deploy but idk, volans all I've done so far is restart redis on rdb2003 which may or may not have had an effect [19:54:56] RECOVERY - HHVM rendering on mw1196 is OK: HTTP OK: HTTP/1.1 200 OK - 74627 bytes in 2.689 second response time [19:55:06] RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 4.681 second response time [19:55:16] PROBLEM - HHVM rendering on mw2129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:55:26] PROBLEM - HHVM rendering on mw2212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:55:36] PROBLEM - HHVM rendering on mw2254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:55:36] PROBLEM - HHVM rendering on mw2251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:55:36] PROBLEM - HHVM rendering on mw2237 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:55:36] PROBLEM - HHVM rendering on mw2255 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:55:36] PROBLEM - HHVM rendering on mw2221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:55:37] PROBLEM - HHVM rendering on mw2227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:55:37] PROBLEM - HHVM rendering on mw2223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:55:38] PROBLEM - HHVM rendering on mw2230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:55:38] PROBLEM - HHVM rendering on mw2231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:55:39] PROBLEM - HHVM rendering on mw2239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:55:39] PROBLEM - HHVM rendering on mw2198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:55:41] PROBLEM - HHVM rendering on mw2114 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:55:41] PROBLEM - HHVM rendering on mw2197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:55:41] PROBLEM - HHVM rendering on mw2205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:55:43] chasemp: I don't have much access to these machines to be able to see what's happening there. It did match up with the deploy, but I've since reverted and deployed the revert to no avail. [19:55:56] PROBLEM - HHVM rendering on mw1179 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:55:57] the logs aren't telling me a whole lot either [19:56:05] chasemp: what I mean: I can quiet the bot in IRC until it's fixed, but I don't want to decide that, just offering [19:56:06] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad is CRITICAL: CRITICAL - failed 279 probes of 279 (alerts on 19) - https://atlas.ripe.net/measurements/1790947/#!map [19:56:16] RECOVERY - HHVM rendering on mw2129 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 6.468 second response time [19:56:26] RECOVERY - HHVM rendering on mw2212 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 1.060 second response time [19:56:27] RECOVERY - HHVM rendering on mw2254 is OK: HTTP OK: HTTP/1.1 200 OK - 74664 bytes in 0.669 second response time [19:56:27] RECOVERY - HHVM rendering on mw2231 is OK: HTTP OK: HTTP/1.1 200 OK - 74693 bytes in 0.660 second response time [19:56:27] RECOVERY - HHVM rendering on mw2251 is OK: HTTP OK: HTTP/1.1 200 OK - 74696 bytes in 1.206 second response time [19:56:27] RECOVERY - HHVM rendering on mw2255 is OK: HTTP OK: HTTP/1.1 200 OK - 74696 bytes in 1.195 second response time [19:56:27] RECOVERY - HHVM rendering on mw2230 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 1.692 second response time [19:56:27] RECOVERY - HHVM rendering on mw2221 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 1.759 second response time [19:56:28] RECOVERY - HHVM rendering on mw2239 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 2.232 second response time [19:56:28] RECOVERY - HHVM rendering on mw2237 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 2.249 second response time [19:56:28] Sagan: I think we'd rather have it yell at us [19:56:29] RECOVERY - HHVM rendering on mw2223 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 2.787 second response time [19:56:29] RECOVERY - HHVM rendering on mw2227 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 2.815 second response time [19:56:30] RECOVERY - HHVM rendering on mw2198 is OK: HTTP OK: HTTP/1.1 200 OK - 74727 bytes in 3.368 second response time [19:56:30] RECOVERY - HHVM rendering on mw2150 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 3.612 second response time [19:56:31] RECOVERY - HHVM rendering on mw2114 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 3.645 second response time [19:57:16] PROBLEM - HHVM rendering on mw2144 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:57:16] PROBLEM - HHVM rendering on mw2142 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:57:16] PROBLEM - HHVM rendering on mw2128 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:57:16] PROBLEM - HHVM rendering on mw2139 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:57:16] PROBLEM - HHVM rendering on mw2124 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:57:17] PROBLEM - HHVM rendering on mw2125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:57:17] PROBLEM - HHVM rendering on mw2134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:57:18] PROBLEM - HHVM rendering on mw2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:57:26] PROBLEM - HHVM rendering on mw2203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:57:27] PROBLEM - HHVM rendering on mw2146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:57:27] PROBLEM - HHVM rendering on mw2244 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:57:27] PROBLEM - HHVM rendering on mw2111 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:57:27] PROBLEM - HHVM rendering on mw2252 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:57:36] PROBLEM - HHVM rendering on mw2177 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:57:36] PROBLEM - HHVM rendering on mw2258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:57:36] PROBLEM - HHVM rendering on mw2192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:57:36] PROBLEM - HHVM rendering on mw2228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:57:36] PROBLEM - HHVM rendering on mw2180 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:59:53] RECOVERY - HHVM rendering on mw1297 is OK: HTTP OK: HTTP/1.1 200 OK - 20270 bytes in 0.209 second response time [19:59:53] RECOVERY - HHVM rendering on mw1205 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.255 second response time [19:59:56] RECOVERY - HHVM rendering on mw1272 is OK: HTTP OK: HTTP/1.1 200 OK - 20270 bytes in 0.178 second response time [19:59:56] RECOVERY - HHVM rendering on mw1233 is OK: HTTP OK: HTTP/1.1 200 OK - 20270 bytes in 0.188 second response time [19:59:56] RECOVERY - HHVM rendering on mw1176 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.226 second response time [19:59:56] RECOVERY - HHVM rendering on mw1172 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.250 second response time [19:59:56] RECOVERY - HHVM rendering on mw1262 is OK: HTTP OK: HTTP/1.1 200 OK - 20269 bytes in 0.199 second response time [19:59:57] RECOVERY - HHVM rendering on mw1263 is OK: HTTP OK: HTTP/1.1 200 OK - 20270 bytes in 0.211 second response time [19:59:57] RECOVERY - HHVM rendering on mw1187 is OK: HTTP OK: HTTP/1.1 200 OK - 20270 bytes in 0.219 second response time [19:59:57] RECOVERY - HHVM rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 200 OK - 20281 bytes in 0.248 second response time [19:59:57] RECOVERY - HHVM rendering on mw1203 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.317 second response time [20:00:16] RECOVERY - HHVM rendering on mw2204 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 7.875 second response time [20:00:18] RECOVERY - HHVM rendering on mw2174 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 8.237 second response time [20:00:18] RECOVERY - HHVM rendering on mw2176 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 8.353 second response time [20:00:18] RECOVERY - HHVM rendering on mw2169 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 8.693 second response time [20:00:18] RECOVERY - HHVM rendering on mw2132 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 8.889 second response time [20:00:18] RECOVERY - HHVM rendering on mw2184 is OK: HTTP OK: HTTP/1.1 200 OK - 74689 bytes in 5.479 second response time [20:00:18] RECOVERY - HHVM rendering on mw2129 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 9.436 second response time [20:00:18] (03CR) 10Dzahn: "this is in phabricator/main.pp and i don't see how it's possible to solve this without using hiera and without using "if $realm"." [puppet] - 10https://gerrit.wikimedia.org/r/355640 (owner: 10Dzahn) [20:00:18] RECOVERY - HHVM rendering on mw2236 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 6.060 second response time [20:00:26] RECOVERY - HHVM rendering on mw2202 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 2.729 second response time [20:00:26] RECOVERY - HHVM rendering on mw2232 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 2.282 second response time [20:00:26] RECOVERY - HHVM rendering on mw2215 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 2.490 second response time [20:00:26] RECOVERY - HHVM rendering on mw2210 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 2.735 second response time [20:00:26] RECOVERY - HHVM rendering on mw2181 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 2.050 second response time [20:00:26] RECOVERY - HHVM rendering on mw2233 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 2.335 second response time [20:00:26] RECOVERY - HHVM rendering on mw2219 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 2.568 second response time [20:00:28] RECOVERY - HHVM rendering on mw2212 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 2.809 second response time [20:00:28] RECOVERY - HHVM rendering on mw2255 is OK: HTTP OK: HTTP/1.1 200 OK - 74696 bytes in 1.072 second response time [20:00:28] RECOVERY - HHVM rendering on mw2198 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 1.132 second response time [20:00:29] RECOVERY - HHVM rendering on mw2227 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 1.621 second response time [20:00:29] RECOVERY - HHVM rendering on mw2239 is OK: HTTP OK: HTTP/1.1 200 OK - 74695 bytes in 1.692 second response time [20:00:46] RECOVERY - HHVM rendering on mw1268 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 2.140 second response time [20:00:46] RECOVERY - HHVM rendering on mw1214 is OK: HTTP OK: HTTP/1.1 200 OK - 74627 bytes in 1.681 second response time [20:00:46] RECOVERY - HHVM rendering on mw1264 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 2.024 second response time [20:00:46] RECOVERY - HHVM rendering on mw1275 is OK: HTTP OK: HTTP/1.1 200 OK - 74594 bytes in 1.001 second response time [20:00:46] RECOVERY - HHVM rendering on mw1185 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 1.373 second response time [20:00:46] RECOVERY - HHVM rendering on mw1184 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 1.502 second response time [20:00:47] RECOVERY - HHVM rendering on mw1211 is OK: HTTP OK: HTTP/1.1 200 OK - 74627 bytes in 1.841 second response time [20:00:47] RECOVERY - HHVM rendering on mw1231 is OK: HTTP OK: HTTP/1.1 200 OK - 74628 bytes in 1.874 second response time [20:00:48] RECOVERY - HHVM rendering on mw1273 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 2.182 second response time [20:00:48] RECOVERY - HHVM rendering on mw1186 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 2.218 second response time [20:00:49] RECOVERY - HHVM rendering on mw1188 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 2.586 second response time [20:00:49] RECOVERY - HHVM rendering on mw1209 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 2.564 second response time [20:01:00] RECOVERY - HHVM rendering on mw1281 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 6.833 second response time [20:01:00] RECOVERY - HHVM rendering on mw1190 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 7.852 second response time [20:01:02] RECOVERY - HHVM rendering on mw1253 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 7.140 second response time [20:01:02] RECOVERY - HHVM rendering on mw1294 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 7.554 second response time [20:01:02] RECOVERY - HHVM rendering on mw1249 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 5.218 second response time [20:01:03] RECOVERY - HHVM rendering on mw1234 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 5.545 second response time [20:01:03] RECOVERY - HHVM rendering on mw1232 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 5.647 second response time [20:01:04] RECOVERY - HHVM rendering on mw1285 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 5.967 second response time [20:01:04] RECOVERY - HHVM rendering on mw1266 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 6.340 second response time [20:01:07] RECOVERY - HHVM rendering on mw1224 is OK: HTTP OK: HTTP/1.1 200 OK - 74628 bytes in 6.761 second response time [20:01:07] RECOVERY - HHVM rendering on mw1173 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 7.351 second response time [20:01:07] RECOVERY - HHVM rendering on mw1213 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 5.862 second response time [20:01:07] RECOVERY - HHVM rendering on mw1216 is OK: HTTP OK: HTTP/1.1 200 OK - 74627 bytes in 6.175 second response time [20:01:07] RECOVERY - HHVM rendering on mw1270 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 6.187 second response time [20:01:07] RECOVERY - HHVM rendering on mw1218 is OK: HTTP OK: HTTP/1.1 200 OK - 74627 bytes in 6.602 second response time [20:01:08] RECOVERY - HHVM rendering on mw1222 is OK: HTTP OK: HTTP/1.1 200 OK - 74628 bytes in 7.062 second response time [20:01:08] RECOVERY - HHVM rendering on mw1182 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 7.379 second response time [20:01:08] RECOVERY - HHVM rendering on mw1225 is OK: HTTP OK: HTTP/1.1 200 OK - 74628 bytes in 7.883 second response time [20:01:20] PROBLEM - HHVM rendering on mw1177 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:01:20] PROBLEM - HHVM rendering on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:01:21] PROBLEM - HHVM rendering on mw1175 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:01:22] PROBLEM - HHVM rendering on mw1174 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:01:22] PROBLEM - HHVM rendering on mw2017 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:01:23] PROBLEM - HHVM rendering on mw2125 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:01:23] PROBLEM - HHVM rendering on mw2134 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:01:23] PROBLEM - HHVM rendering on mw2121 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:01:24] PROBLEM - HHVM rendering on mw1171 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:01:24] PROBLEM - HHVM rendering on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:01:26] PROBLEM - HHVM rendering on mw1219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:01:26] PROBLEM - HHVM rendering on mw1248 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:01:27] PROBLEM - HHVM rendering on mw2226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:01:27] PROBLEM - HHVM rendering on mw2165 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:01:27] PROBLEM - HHVM rendering on mw2252 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:01:28] PROBLEM - HHVM rendering on mw2146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:01:28] PROBLEM - HHVM rendering on mw2244 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:01:31] PROBLEM - HHVM rendering on mw2173 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:06] RECOVERY - HHVM rendering on mw1196 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 7.819 second response time [20:03:16] PROBLEM - HHVM rendering on mw2138 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:16] PROBLEM - HHVM rendering on mw2167 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:16] PROBLEM - HHVM rendering on mw2204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:16] PROBLEM - HHVM rendering on mw2176 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:16] PROBLEM - HHVM rendering on mw2174 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:16] PROBLEM - HHVM rendering on mw2169 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:16] PROBLEM - HHVM rendering on mw2132 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:17] PROBLEM - HHVM rendering on mw2129 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:27] PROBLEM - HHVM rendering on mw2210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:27] PROBLEM - HHVM rendering on mw2232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:27] PROBLEM - HHVM rendering on mw2215 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:27] PROBLEM - HHVM rendering on mw2219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:27] PROBLEM - HHVM rendering on mw2181 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:27] PROBLEM - HHVM rendering on mw2233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:27] PROBLEM - HHVM rendering on mw2212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:36] PROBLEM - HHVM rendering on mw2198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:36] PROBLEM - HHVM rendering on mw2255 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:36] PROBLEM - HHVM rendering on mw2227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:36] PROBLEM - HHVM rendering on mw2239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:36] PROBLEM - HHVM rendering on mw2230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:36] PROBLEM - HHVM rendering on mw2251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:36] PROBLEM - HHVM rendering on mw2231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:37] PROBLEM - HHVM rendering on mw2237 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:37] PROBLEM - HHVM rendering on mw2223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:38] (03PS4) 10Krinkle: [WIP] varnish: Try to somehow embed errorpage template in VCL directly [puppet] - 10https://gerrit.wikimedia.org/r/355652 [20:03:38] PROBLEM - HHVM rendering on mw2221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:38] PROBLEM - HHVM rendering on mw2254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:39] PROBLEM - HHVM rendering on mw2101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:56] PROBLEM - HHVM rendering on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:56] PROBLEM - HHVM rendering on mw1294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:03:57] PROBLEM - HHVM rendering on mw1253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:06] PROBLEM - HHVM rendering on mw1232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:06] PROBLEM - HHVM rendering on mw1272 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:06] PROBLEM - HHVM rendering on mw1266 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:06] PROBLEM - HHVM rendering on mw1233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:06] PROBLEM - HHVM rendering on mw1173 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:06] PROBLEM - HHVM rendering on mw1176 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:06] PROBLEM - HHVM rendering on mw1172 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:07] PROBLEM - HHVM rendering on mw1262 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:07] PROBLEM - HHVM rendering on mw1187 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:08] PROBLEM - HHVM rendering on mw1216 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:08] PROBLEM - HHVM rendering on mw1263 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:09] PROBLEM - HHVM rendering on mw1222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:36] PROBLEM - HHVM rendering on mw2106 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:04:36] PROBLEM - HHVM rendering on mw2097 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:05:26] RECOVERY - HHVM rendering on mw2232 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 5.244 second response time [20:05:26] RECOVERY - HHVM rendering on mw2215 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 5.382 second response time [20:05:26] RECOVERY - HHVM rendering on mw2210 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 5.643 second response time [20:05:26] RECOVERY - HHVM rendering on mw2233 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 7.012 second response time [20:05:27] RECOVERY - HHVM rendering on mw2219 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 7.245 second response time [20:05:27] RECOVERY - HHVM rendering on mw2212 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 7.479 second response time [20:05:27] RECOVERY - HHVM rendering on mw2181 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 7.669 second response time [20:05:36] RECOVERY - HHVM rendering on mw2254 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 9.047 second response time [20:05:36] RECOVERY - HHVM rendering on mw2255 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 9.346 second response time [20:05:36] RECOVERY - HHVM rendering on mw2231 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.569 second response time [20:05:36] RECOVERY - HHVM rendering on mw2239 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.854 second response time [20:05:36] PROBLEM - HHVM rendering on mw2187 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:05:36] PROBLEM - HHVM rendering on mw2256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:05:36] PROBLEM - HHVM rendering on mw2179 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:05:37] PROBLEM - HHVM rendering on mw2222 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:05:37] PROBLEM - HHVM rendering on mw2206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:05:56] PROBLEM - HHVM rendering on mw1193 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:05:56] PROBLEM - HHVM rendering on mw1202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:05:56] PROBLEM - HHVM rendering on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:05:56] PROBLEM - HHVM rendering on mw1190 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:05:56] PROBLEM - HHVM rendering on mw1179 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:05:56] RECOVERY - HHVM rendering on mw1281 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 9.054 second response time [20:05:57] RECOVERY - HHVM rendering on mw1253 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 9.410 second response time [20:05:57] RECOVERY - HHVM rendering on mw1294 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.762 second response time [20:06:06] RECOVERY - HHVM rendering on mw1233 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 9.644 second response time [20:06:06] RECOVERY - HHVM rendering on mw1232 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 9.913 second response time [20:06:06] PROBLEM - HHVM rendering on mw1196 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:06:06] PROBLEM - HHVM rendering on mw1203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:06:16] RECOVERY - HHVM rendering on mw2138 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.676 second response time [20:06:26] PROBLEM - HHVM rendering on mw2236 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:06:26] PROBLEM - HHVM rendering on mw2184 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:06:26] PROBLEM - HHVM rendering on mw2202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:06:46] RECOVERY - HHVM rendering on mw1193 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 3.321 second response time [20:06:56] RECOVERY - HHVM rendering on mw1202 is OK: HTTP OK: HTTP/1.1 200 OK - 74627 bytes in 3.546 second response time [20:06:56] RECOVERY - HHVM rendering on mw1199 is OK: HTTP OK: HTTP/1.1 200 OK - 74627 bytes in 4.162 second response time [20:06:56] RECOVERY - HHVM rendering on mw1179 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 4.200 second response time [20:06:56] RECOVERY - HHVM rendering on mw1190 is OK: HTTP OK: HTTP/1.1 200 OK - 74565 bytes in 6.784 second response time [20:07:06] RECOVERY - HHVM rendering on mw1266 is OK: HTTP OK: HTTP/1.1 200 OK - 74580 bytes in 5.424 second response time [20:07:06] RECOVERY - HHVM rendering on mw1272 is OK: HTTP OK: HTTP/1.1 200 OK - 74590 bytes in 5.660 second response time [20:07:06] RECOVERY - HHVM rendering on mw1173 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 6.209 second response time [20:07:06] RECOVERY - HHVM rendering on mw1172 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 6.416 second response time [20:07:06] RECOVERY - HHVM rendering on mw1176 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 6.777 second response time [20:07:06] RECOVERY - HHVM rendering on mw1270 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 6.072 second response time [20:07:06] RECOVERY - HHVM rendering on mw1182 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 6.104 second response time [20:07:07] RECOVERY - HHVM rendering on mw1221 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 6.486 second response time [20:07:07] RECOVERY - HHVM rendering on mw1196 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 6.540 second response time [20:07:08] RECOVERY - HHVM rendering on mw1218 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 6.860 second response time [20:07:08] RECOVERY - HHVM rendering on mw1216 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 6.927 second response time [20:07:09] RECOVERY - HHVM rendering on mw1262 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 7.255 second response time [20:07:26] RECOVERY - HHVM rendering on mw2202 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.306 second response time [20:07:26] RECOVERY - HHVM rendering on mw2227 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 6.479 second response time [20:07:27] RECOVERY - HHVM rendering on mw2223 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 6.570 second response time [20:07:27] RECOVERY - HHVM rendering on mw2230 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 7.011 second response time [20:07:27] RECOVERY - HHVM rendering on mw2251 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 7.056 second response time [20:07:27] RECOVERY - HHVM rendering on mw2237 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 7.513 second response time [20:07:27] RECOVERY - HHVM rendering on mw2221 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 7.567 second response time [20:07:28] RECOVERY - HHVM rendering on mw2198 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 7.956 second response time [20:07:28] RECOVERY - HHVM rendering on mw2151 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 7.040 second response time [20:07:36] RECOVERY - HHVM rendering on mw2196 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 7.426 second response time [20:07:36] RECOVERY - HHVM rendering on mw2256 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 7.568 second response time [20:07:36] RECOVERY - HHVM rendering on mw2245 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 7.912 second response time [20:07:40] RECOVERY - HHVM rendering on mw2206 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 7.990 second response time [20:07:40] RECOVERY - HHVM rendering on mw2207 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 8.430 second response time [20:07:40] RECOVERY - HHVM rendering on mw2103 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.498 second response time [20:07:40] RECOVERY - HHVM rendering on mw2179 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.953 second response time [20:07:40] RECOVERY - HHVM rendering on mw2107 is OK: HTTP OK: HTTP/1.1 200 OK - 74647 bytes in 9.006 second response time [20:07:40] RECOVERY - HHVM rendering on mw2211 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.401 second response time [20:07:40] RECOVERY - HHVM rendering on mw2222 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.511 second response time [20:07:40] RECOVERY - HHVM rendering on mw2200 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.810 second response time [20:07:40] RECOVERY - HHVM rendering on mw2187 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.926 second response time [20:07:50] PROBLEM - HHVM rendering on mw1275 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:07:50] PROBLEM - HHVM rendering on mw1185 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:07:51] PROBLEM - HHVM rendering on mw1231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:07:51] PROBLEM - HHVM rendering on mw1211 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:07:56] PROBLEM - HHVM rendering on mw1188 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:07:56] PROBLEM - HHVM rendering on mw1209 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:07:56] PROBLEM - HHVM rendering on mw1212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:07:56] PROBLEM - HHVM rendering on mw1184 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:07:56] PROBLEM - HHVM rendering on mw1186 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:07:56] PROBLEM - HHVM rendering on mw1261 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:08:06] RECOVERY - HHVM rendering on mw1170 is OK: HTTP OK: HTTP/1.1 200 OK - 20270 bytes in 0.210 second response time [20:08:06] RECOVERY - HHVM rendering on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.256 second response time [20:08:16] RECOVERY - HHVM rendering on mw2139 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.361 second response time [20:08:16] RECOVERY - HHVM rendering on mw2110 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.754 second response time [20:08:16] RECOVERY - HHVM rendering on mw2216 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.914 second response time [20:08:16] RECOVERY - HHVM rendering on mw2135 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.198 second response time [20:08:16] RECOVERY - HHVM rendering on mw2253 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 9.378 second response time [20:08:16] RECOVERY - HHVM rendering on mw2130 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.779 second response time [20:08:26] PROBLEM - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough [20:08:27] PROBLEM - HHVM rendering on mw2215 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:08:27] PROBLEM - HHVM rendering on mw2232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:08:27] PROBLEM - HHVM rendering on mw2210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:08:27] PROBLEM - HHVM rendering on mw2173 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:08:27] PROBLEM - HHVM rendering on mw2212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:08:27] PROBLEM - HHVM rendering on mw2244 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:08:27] PROBLEM - HHVM rendering on mw2146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:08:28] PROBLEM - HHVM rendering on mw2219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:08:28] PROBLEM - HHVM rendering on mw2111 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:08:29] PROBLEM - HHVM rendering on mw2226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:08:29] PROBLEM - HHVM rendering on mw2233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:08:30] PROBLEM - HHVM rendering on mw2165 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:08:30] PROBLEM - HHVM rendering on mw2203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:08:42] PROBLEM - HHVM rendering on mw2231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:09:06] RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 74627 bytes in 4.101 second response time [20:09:06] RECOVERY - HHVM rendering on mw1206 is OK: HTTP OK: HTTP/1.1 200 OK - 74603 bytes in 5.626 second response time [20:09:06] RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 6.044 second response time [20:09:06] RECOVERY - HHVM rendering on mw1201 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 7.257 second response time [20:09:07] RECOVERY - HHVM rendering on mw1298 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 7.230 second response time [20:09:07] RECOVERY - HHVM rendering on mw1293 is OK: HTTP OK: HTTP/1.1 200 OK - 74627 bytes in 7.708 second response time [20:09:15] (03PS5) 10Krinkle: [WIP] varnish: Try to somehow embed errorpage template in VCL directly [puppet] - 10https://gerrit.wikimedia.org/r/355652 [20:09:16] RECOVERY - HHVM rendering on mw1177 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.297 second response time [20:09:16] RECOVERY - HHVM rendering on mw1174 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.384 second response time [20:09:16] RECOVERY - HHVM rendering on mw1175 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.824 second response time [20:09:17] RECOVERY - HHVM rendering on mw1290 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 8.416 second response time [20:09:26] RECOVERY - HHVM rendering on mw1248 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 8.606 second response time [20:09:26] RECOVERY - HHVM rendering on mw1219 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 8.875 second response time [20:09:26] RECOVERY - HHVM rendering on mw1171 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.150 second response time [20:09:26] RECOVERY - HHVM rendering on mw2232 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.360 second response time [20:09:27] RECOVERY - HHVM rendering on mw2215 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 8.588 second response time [20:09:27] RECOVERY - HHVM rendering on mw2210 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.804 second response time [20:09:27] RECOVERY - HHVM rendering on mw2226 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 8.658 second response time [20:09:27] RECOVERY - HHVM rendering on mw2233 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.901 second response time [20:09:27] RECOVERY - HHVM rendering on mw2252 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 9.149 second response time [20:09:28] RECOVERY - HHVM rendering on mw2219 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.414 second response time [20:09:28] RECOVERY - HHVM rendering on mw2173 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.645 second response time [20:09:29] RECOVERY - HHVM rendering on mw2244 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.948 second response time [20:09:36] RECOVERY - HHVM rendering on mw2231 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.562 second response time [20:09:36] RECOVERY - HHVM rendering on mw2258 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 9.964 second response time [20:09:46] RECOVERY - IPv4 ping to eqiad on ripe-atlas-eqiad is OK: OK - failed 3 probes of 415 (alerts on 19) - https://atlas.ripe.net/measurements/1790945/#!map [20:09:46] RECOVERY - HHVM rendering on mw1223 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 4.664 second response time [20:09:46] RECOVERY - HHVM rendering on mw1226 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 4.682 second response time [20:09:46] RECOVERY - HHVM rendering on mw1271 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 5.018 second response time [20:09:46] RECOVERY - HHVM rendering on mw1279 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 5.039 second response time [20:09:46] RECOVERY - HHVM rendering on mw1241 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 5.365 second response time [20:09:46] RECOVERY - HHVM rendering on mw1278 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 5.370 second response time [20:09:47] RECOVERY - HHVM rendering on mw1237 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 5.763 second response time [20:09:47] RECOVERY - HHVM rendering on mw1230 is OK: HTTP OK: HTTP/1.1 200 OK - 74630 bytes in 5.898 second response time [20:09:48] RECOVERY - HHVM rendering on mw1276 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 6.129 second response time [20:09:48] RECOVERY - HHVM rendering on mw1264 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 5.331 second response time [20:09:49] RECOVERY - HHVM rendering on mw1214 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 5.395 second response time [20:10:26] PROBLEM - HHVM rendering on mw2202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:10:36] PROBLEM - HHVM rendering on mw2230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:10:36] PROBLEM - HHVM rendering on mw2223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:10:36] PROBLEM - HHVM rendering on mw2227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:10:36] PROBLEM - HHVM rendering on mw2251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:10:36] PROBLEM - HHVM rendering on mw2198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:10:36] PROBLEM - HHVM rendering on mw2221 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:10:36] PROBLEM - HHVM rendering on mw2237 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:10:37] PROBLEM - HHVM rendering on mw2256 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:10:37] PROBLEM - HHVM rendering on mw2245 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:10:38] PROBLEM - HHVM rendering on mw2196 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:10:38] PROBLEM - HHVM rendering on mw2206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:10:39] PROBLEM - HHVM rendering on mw2151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:11:16] PROBLEM - HHVM rendering on mw2110 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:11:16] PROBLEM - HHVM rendering on mw2135 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:11:16] RECOVERY - HHVM rendering on mw1204 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 8.514 second response time [20:11:27] RECOVERY - HHVM rendering on mw2203 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.909 second response time [20:11:36] RECOVERY - HHVM rendering on mw2192 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.213 second response time [20:11:36] RECOVERY - HHVM rendering on mw2239 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.734 second response time [20:12:06] PROBLEM - HHVM rendering on mw1206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:12:06] PROBLEM - HHVM rendering on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:12:06] RECOVERY - HHVM rendering on mw2167 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 6.768 second response time [20:12:07] RECOVERY - HHVM rendering on mw2213 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 7.296 second response time [20:12:07] RECOVERY - HHVM rendering on mw2137 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 8.196 second response time [20:12:16] RECOVERY - HHVM rendering on mw2149 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.650 second response time [20:12:16] RECOVERY - HHVM rendering on mw2108 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.208 second response time [20:12:16] PROBLEM - HHVM rendering on mw1295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:12:16] PROBLEM - HHVM rendering on mw1170 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:12:16] PROBLEM - HHVM rendering on mw1293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:12:16] PROBLEM - HHVM rendering on mw1298 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:12:16] RECOVERY - HHVM rendering on mw2127 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.499 second response time [20:12:17] RECOVERY - HHVM rendering on mw2145 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 9.693 second response time [20:12:17] PROBLEM - HHVM rendering on mw1177 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:12:18] PROBLEM - HHVM rendering on mw1174 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:12:18] PROBLEM - HHVM rendering on mw1175 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:12:26] PROBLEM - HHVM rendering on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:12:26] PROBLEM - HHVM rendering on mw1219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:12:26] PROBLEM - HHVM rendering on mw1171 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:12:26] PROBLEM - HHVM rendering on mw1248 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:12:26] PROBLEM - HHVM rendering on mw2210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:12:26] PROBLEM - HHVM rendering on mw2215 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:12:27] PROBLEM - HHVM rendering on mw2232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:12:27] PROBLEM - HHVM rendering on mw2219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:12:27] PROBLEM - HHVM rendering on mw2233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:12:28] PROBLEM - HHVM rendering on mw2244 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:12:28] PROBLEM - HHVM rendering on mw2173 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:13:06] RECOVERY - HHVM rendering on mw1206 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 2.907 second response time [20:13:06] RECOVERY - HHVM rendering on mw1201 is OK: HTTP OK: HTTP/1.1 200 OK - 74627 bytes in 2.098 second response time [20:13:06] RECOVERY - HHVM rendering on mw1298 is OK: HTTP OK: HTTP/1.1 200 OK - 74594 bytes in 0.592 second response time [20:13:06] RECOVERY - HHVM rendering on mw1293 is OK: HTTP OK: HTTP/1.1 200 OK - 74593 bytes in 0.679 second response time [20:13:06] RECOVERY - HHVM rendering on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 74593 bytes in 1.010 second response time [20:13:06] RECOVERY - HHVM rendering on mw1170 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 1.195 second response time [20:13:06] RECOVERY - HHVM rendering on mw1177 is OK: HTTP OK: HTTP/1.1 200 OK - 74593 bytes in 0.729 second response time [20:13:07] RECOVERY - HHVM rendering on mw1175 is OK: HTTP OK: HTTP/1.1 200 OK - 74593 bytes in 0.804 second response time [20:13:07] RECOVERY - HHVM rendering on mw1174 is OK: HTTP OK: HTTP/1.1 200 OK - 74593 bytes in 0.922 second response time [20:13:08] RECOVERY - HHVM rendering on mw2142 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 6.036 second response time [20:13:16] RECOVERY - HHVM rendering on mw2110 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 8.833 second response time [20:13:17] RECOVERY - HHVM rendering on mw2135 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.206 second response time [20:13:17] RECOVERY - HHVM rendering on mw1248 is OK: HTTP OK: HTTP/1.1 200 OK - 74594 bytes in 0.528 second response time [20:13:17] RECOVERY - HHVM rendering on mw1290 is OK: HTTP OK: HTTP/1.1 200 OK - 74594 bytes in 0.568 second response time [20:13:17] RECOVERY - HHVM rendering on mw1219 is OK: HTTP OK: HTTP/1.1 200 OK - 74625 bytes in 0.976 second response time [20:13:17] RECOVERY - HHVM rendering on mw1171 is OK: HTTP OK: HTTP/1.1 200 OK - 74593 bytes in 1.137 second response time [20:13:26] RECOVERY - HHVM rendering on mw2236 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.449 second response time [20:13:26] RECOVERY - HHVM rendering on mw2259 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 9.628 second response time [20:13:26] RECOVERY - HHVM rendering on mw2184 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.879 second response time [20:13:27] RECOVERY - HHVM rendering on mw2215 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.680 second response time [20:14:06] PROBLEM - HHVM rendering on mw1203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:14:06] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:14:06] PROBLEM - HHVM rendering on mw1189 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:14:06] PROBLEM - HHVM rendering on mw1196 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:14:07] PROBLEM - HHVM rendering on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:14:07] PROBLEM - HHVM rendering on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:14:07] RECOVERY - HHVM rendering on mw2144 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 7.285 second response time [20:14:07] RECOVERY - HHVM rendering on mw2128 is OK: HTTP OK: HTTP/1.1 200 OK - 74647 bytes in 8.415 second response time [20:14:16] RECOVERY - HHVM rendering on mw2136 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.154 second response time [20:14:16] RECOVERY - HHVM rendering on mw2124 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.672 second response time [20:14:16] PROBLEM - HHVM rendering on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:14:26] PROBLEM - HHVM rendering on mw2203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:14:36] PROBLEM - HHVM rendering on mw2192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:14:36] PROBLEM - HHVM rendering on mw2239 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:15:03] marostegui: see for example mw1204 or mw2192 just now again going critical for hhvm [20:15:06] RECOVERY - HHVM rendering on mw1203 is OK: HTTP OK: HTTP/1.1 200 OK - 74629 bytes in 5.913 second response time [20:15:06] RECOVERY - HHVM rendering on mw1196 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 6.705 second response time [20:15:06] RECOVERY - HHVM rendering on mw1189 is OK: HTTP OK: HTTP/1.1 200 OK - 74627 bytes in 6.830 second response time [20:15:06] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 7.359 second response time [20:15:06] RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 6.736 second response time [20:15:07] RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 6.893 second response time [20:15:14] maybe icinga is behind [20:15:16] RECOVERY - HHVM rendering on mw1204 is OK: HTTP OK: HTTP/1.1 200 OK - 74627 bytes in 5.610 second response time [20:15:22] and mw1204 recovery [20:15:26] RECOVERY - HHVM rendering on mw2202 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.949 second response time [20:15:27] RECOVERY - HHVM rendering on mw2232 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.427 second response time [20:15:27] RECOVERY - HHVM rendering on mw2210 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.870 second response time [20:15:27] RECOVERY - HHVM rendering on mw2146 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 8.630 second response time [20:15:27] RECOVERY - HHVM rendering on mw2212 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.777 second response time [20:15:27] RECOVERY - HHVM rendering on mw2203 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.075 second response time [20:15:27] RECOVERY - HHVM rendering on mw2233 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.327 second response time [20:15:28] RECOVERY - HHVM rendering on mw2165 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.544 second response time [20:15:28] RECOVERY - HHVM rendering on mw2226 is OK: HTTP OK: HTTP/1.1 200 OK - 74609 bytes in 9.846 second response time [20:15:36] RECOVERY - HHVM rendering on mw2230 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.442 second response time [20:15:36] RECOVERY - HHVM rendering on mw2192 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.643 second response time [20:15:36] RECOVERY - HHVM rendering on mw2235 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.949 second response time [20:15:44] icinga is always behind :D [20:15:50] :) [20:16:02] hi ops! [20:16:09] (03PS13) 10Krinkle: dynamicproxy: Centralise error page template and use it [puppet] - 10https://gerrit.wikimedia.org/r/350494 (https://phabricator.wikimedia.org/T113114) [20:16:16] PROBLEM - HHVM rendering on mw1295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:16:16] PROBLEM - HHVM rendering on mw1170 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:16:16] PROBLEM - HHVM rendering on mw1298 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:16:17] PROBLEM - HHVM rendering on mw1293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:16:17] PROBLEM - HHVM rendering on mw1175 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:16:17] PROBLEM - HHVM rendering on mw1174 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:16:25] (03Abandoned) 10Krinkle: mediawiki: Define 'mediawiki::errorpage' to simplify usage [puppet] - 10https://gerrit.wikimedia.org/r/355257 (owner: 10Krinkle) [20:16:26] PROBLEM - HHVM rendering on mw1171 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:16:26] PROBLEM - HHVM rendering on mw2236 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:16:26] PROBLEM - HHVM rendering on mw2184 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:16:26] PROBLEM - HHVM rendering on mw2259 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:16:26] RECOVERY - HHVM rendering on mw2100 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.134 second response time [20:16:26] PROBLEM - HHVM rendering on mw2215 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:16:28] <_joe_> is the problem still ongoing? [20:16:32] fr-tech deleted some i18n strings in DonationInterface which are used on donatewiki [20:16:33] <_joe_> wtf is going on? [20:16:42] we reverted the change [20:16:43] <_joe_> ejegg: not now please [20:16:57] _joe_: well we thought maybe not but it seems possibly and we are nto entirely sure, rashes of hhvm failures and recoveries for about an hour [20:17:03] ok, sorry [20:17:06] RECOVERY - HHVM rendering on mw1298 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 4.261 second response time [20:17:06] RECOVERY - HHVM rendering on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 4.351 second response time [20:17:06] RECOVERY - HHVM rendering on mw1293 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 4.770 second response time [20:17:06] RECOVERY - HHVM rendering on mw1170 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 4.848 second response time [20:17:07] RECOVERY - HHVM rendering on mw1174 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 4.016 second response time [20:17:07] RECOVERY - HHVM rendering on mw1175 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 4.387 second response time [20:17:16] we thought it was a bad redis node or related and then possibly some db maint (ping marostegui) but that may not be actually explaining the issue [20:17:16] RECOVERY - HHVM rendering on mw1171 is OK: HTTP OK: HTTP/1.1 200 OK - 74593 bytes in 0.967 second response time [20:17:20] <_joe_> 4 second time for a response? [20:17:26] RECOVERY - HHVM rendering on mw2215 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.494 second response time [20:17:26] RECOVERY - HHVM rendering on mw2181 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 7.977 second response time [20:17:26] RECOVERY - HHVM rendering on mw2173 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 7.994 second response time [20:17:26] RECOVERY - HHVM rendering on mw2219 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.508 second response time [20:17:27] RECOVERY - HHVM rendering on mw2111 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 8.533 second response time [20:17:27] RECOVERY - HHVM rendering on mw2252 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 9.013 second response time [20:17:27] RECOVERY - HHVM rendering on mw2244 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 8.990 second response time [20:17:27] RECOVERY - HHVM rendering on mw2254 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 7.477 second response time [20:17:28] RECOVERY - HHVM rendering on mw2228 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 7.560 second response time [20:17:28] RECOVERY - HHVM rendering on mw2177 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 7.913 second response time [20:17:29] RECOVERY - HHVM rendering on mw2258 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 8.062 second response time [20:17:36] RECOVERY - HHVM rendering on mw2237 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 8.448 second response time [20:17:36] RECOVERY - HHVM rendering on mw2241 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 8.576 second response time [20:17:36] RECOVERY - HHVM rendering on mw2251 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 8.956 second response time [20:17:36] RECOVERY - HHVM rendering on mw2255 is OK: HTTP OK: HTTP/1.1 200 OK - 74616 bytes in 9.072 second response time [20:17:36] RECOVERY - HHVM rendering on mw2231 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.462 second response time [20:17:36] RECOVERY - HHVM rendering on mw2227 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.615 second response time [20:17:36] RECOVERY - HHVM rendering on mw2180 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.870 second response time [20:17:41] <_joe_> codfw also [20:17:46] <_joe_> that doesn't make any sense [20:17:54] <_joe_> unless it's a db maintenance, yes [20:18:01] <_joe_> or a bad code deploy [20:18:02] yes, s4 is under maintenance in codw [20:18:07] *codfw [20:18:16] PROBLEM - HHVM rendering on mw1204 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:18:26] <_joe_> ok so why do we see timeouts in eqiad on depooled servers like mw1170? [20:18:26] PROBLEM - HHVM rendering on mw1248 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:18:26] PROBLEM - HHVM rendering on mw1219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:18:26] PROBLEM - HHVM rendering on mw1290 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:18:26] PROBLEM - HHVM rendering on mw2202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:18:27] PROBLEM - HHVM rendering on mw2232 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:18:27] PROBLEM - HHVM rendering on mw2210 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:18:27] PROBLEM - HHVM rendering on mw2226 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:18:27] PROBLEM - HHVM rendering on mw2233 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:18:27] PROBLEM - HHVM rendering on mw2212 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:18:28] PROBLEM - HHVM rendering on mw2203 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:18:28] PROBLEM - HHVM rendering on mw2146 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:18:29] PROBLEM - HHVM rendering on mw2165 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:18:30] <_joe_> it's the db for sure [20:18:35] _joe_: there was a deploy and revert I believe but unsure if related [20:18:36] PROBLEM - HHVM rendering on mw2235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:18:36] PROBLEM - HHVM rendering on mw2230 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:18:36] PROBLEM - HHVM rendering on mw2192 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:18:36] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 299.14 seconds [20:18:42] that was before for lag on s1 api DBs [20:18:49] but now that seems recovered [20:19:16] RECOVERY - HHVM rendering on mw1290 is OK: HTTP OK: HTTP/1.1 200 OK - 74602 bytes in 8.091 second response time [20:19:26] (03CR) 10Krinkle: "Rebased to not depend on the Varnish changes." [puppet] - 10https://gerrit.wikimedia.org/r/350494 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [20:19:26] RECOVERY - HHVM rendering on mw1219 is OK: HTTP OK: HTTP/1.1 200 OK - 74627 bytes in 8.283 second response time [20:19:26] RECOVERY - HHVM rendering on mw1248 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 8.417 second response time [20:19:36] mw1174 mw1174 mw1174 just now threw issue so I'm not sure [20:19:36] PROBLEM - HHVM rendering on mw2100 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:06] RECOVERY - HHVM rendering on mw1204 is OK: HTTP OK: HTTP/1.1 200 OK - 74629 bytes in 2.971 second response time [20:20:26] RECOVERY - HHVM rendering on mw2115 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.132 second response time [20:20:27] RECOVERY - HHVM rendering on mw2101 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.138 second response time [20:20:27] RECOVERY - HHVM rendering on mw2133 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.140 second response time [20:20:27] PROBLEM - HHVM rendering on mw2215 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:27] PROBLEM - HHVM rendering on mw2252 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:27] PROBLEM - HHVM rendering on mw2181 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:27] PROBLEM - HHVM rendering on mw2173 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:28] PROBLEM - HHVM rendering on mw2111 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:28] PROBLEM - HHVM rendering on mw2244 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:29] PROBLEM - HHVM rendering on mw2219 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:36] PROBLEM - HHVM rendering on mw2228 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:36] PROBLEM - HHVM rendering on mw2258 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:36] PROBLEM - HHVM rendering on mw2254 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:36] PROBLEM - HHVM rendering on mw2237 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:36] PROBLEM - HHVM rendering on mw2255 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:36] PROBLEM - HHVM rendering on mw2177 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:36] PROBLEM - HHVM rendering on mw2227 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:37] PROBLEM - HHVM rendering on mw2251 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:40] PROBLEM - HHVM rendering on mw2241 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:40] PROBLEM - HHVM rendering on mw2180 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:40] PROBLEM - HHVM rendering on mw2231 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:56] PROBLEM - HHVM rendering on mw1179 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:56] PROBLEM - HHVM rendering on mw1205 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:56] PROBLEM - HHVM rendering on mw1202 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:56] PROBLEM - HHVM rendering on mw1199 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:56] PROBLEM - HHVM rendering on mw1178 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:56] PROBLEM - HHVM rendering on mw1294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:57] PROBLEM - HHVM rendering on mw1281 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:20:57] PROBLEM - HHVM rendering on mw1253 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:06] PROBLEM - HHVM rendering on mw1176 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:06] (03PS5) 10Krinkle: varnish: Make errorpage.html balanced and use placeholder [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) [20:21:06] PROBLEM - HHVM rendering on mw1197 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:07] PROBLEM - HHVM rendering on mw1196 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:07] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:07] PROBLEM - HHVM rendering on mw1198 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:21:16] RECOVERY - HHVM rendering on mw2174 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.362 second response time [20:21:26] RECOVERY - HHVM rendering on mw2206 is OK: HTTP OK: HTTP/1.1 200 OK - 20269 bytes in 0.102 second response time [20:21:26] RECOVERY - HHVM rendering on mw2187 is OK: HTTP OK: HTTP/1.1 200 OK - 20269 bytes in 0.096 second response time [20:21:26] RECOVERY - HHVM rendering on mw2164 is OK: HTTP OK: HTTP/1.1 200 OK - 20269 bytes in 0.100 second response time [20:21:26] RECOVERY - HHVM rendering on mw2245 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.118 second response time [20:21:26] RECOVERY - HHVM rendering on mw2151 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.111 second response time [20:21:26] RECOVERY - HHVM rendering on mw2107 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.113 second response time [20:21:56] RECOVERY - HHVM rendering on mw1179 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 6.055 second response time [20:21:56] RECOVERY - HHVM rendering on mw1178 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 7.322 second response time [20:21:56] RECOVERY - HHVM rendering on mw1202 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.601 second response time [20:22:16] RECOVERY - HHVM rendering on mw2169 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.130 second response time [20:22:16] RECOVERY - HHVM rendering on mw2176 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 8.343 second response time [20:22:16] RECOVERY - HHVM rendering on mw2204 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 8.474 second response time [20:22:16] RECOVERY - HHVM rendering on mw2132 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.089 second response time [20:22:16] RECOVERY - HHVM rendering on mw2129 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.391 second response time [20:22:16] RECOVERY - HHVM rendering on mw2121 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.224 second response time [20:22:21] 06Operations, 10Analytics, 10EventBus, 10hardware-requests, and 2 others: SSDs for main Kafka clusters - https://phabricator.wikimedia.org/T166341#3293176 (10Ottomata) [20:22:26] RECOVERY - HHVM rendering on mw2256 is OK: HTTP OK: HTTP/1.1 200 OK - 20272 bytes in 0.111 second response time [20:22:26] RECOVERY - HHVM rendering on mw2188 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.113 second response time [20:22:26] RECOVERY - HHVM rendering on mw2211 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.120 second response time [20:22:26] RECOVERY - HHVM rendering on mw2222 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.120 second response time [20:22:26] RECOVERY - HHVM rendering on mw2103 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.136 second response time [20:22:26] RECOVERY - HHVM rendering on mw2198 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.138 second response time [20:22:26] RECOVERY - HHVM rendering on mw2170 is OK: HTTP OK: HTTP/1.1 200 OK - 20271 bytes in 0.121 second response time [20:22:56] RECOVERY - HHVM rendering on mw1199 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 3.906 second response time [20:22:56] RECOVERY - HHVM rendering on mw1205 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 4.718 second response time [20:22:56] RECOVERY - HHVM rendering on mw1253 is OK: HTTP OK: HTTP/1.1 200 OK - 74596 bytes in 4.611 second response time [20:22:56] RECOVERY - HHVM rendering on mw1281 is OK: HTTP OK: HTTP/1.1 200 OK - 74574 bytes in 4.814 second response time [20:22:56] RECOVERY - HHVM rendering on mw1294 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 5.064 second response time [20:23:06] RECOVERY - HHVM rendering on mw1176 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 3.203 second response time [20:23:06] RECOVERY - HHVM rendering on mw1196 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 5.099 second response time [20:23:06] RECOVERY - HHVM rendering on mw1197 is OK: HTTP OK: HTTP/1.1 200 OK - 74627 bytes in 5.360 second response time [20:23:06] RECOVERY - HHVM rendering on mw1198 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 5.858 second response time [20:23:06] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 74623 bytes in 5.914 second response time [20:23:36] PROBLEM - HHVM rendering on mw2115 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:36] PROBLEM - HHVM rendering on mw2133 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:23:36] PROBLEM - HHVM rendering on mw2101 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:24:06] PROBLEM - HHVM rendering on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:24:17] PROBLEM - HHVM rendering on mw1177 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:24:19] PROBLEM - HHVM rendering on mw1295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:24:19] PROBLEM - HHVM rendering on mw1170 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:24:19] PROBLEM - HHVM rendering on mw1293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:24:19] PROBLEM - HHVM rendering on mw1175 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:24:19] PROBLEM - HHVM rendering on mw1174 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:24:19] RECOVERY - HHVM rendering on mw2017 is OK: HTTP OK: HTTP/1.1 200 OK - 74627 bytes in 9.305 second response time [20:24:19] RECOVERY - HHVM rendering on mw2125 is OK: HTTP OK: HTTP/1.1 200 OK - 74615 bytes in 9.638 second response time [20:24:19] RECOVERY - HHVM rendering on mw2134 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 9.875 second response time [20:24:36] PROBLEM - HHVM rendering on mw2164 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:24:36] PROBLEM - HHVM rendering on mw2206 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:24:36] PROBLEM - HHVM rendering on mw2187 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:24:36] PROBLEM - HHVM rendering on mw2245 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:24:36] PROBLEM - HHVM rendering on mw2151 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:24:36] PROBLEM - HHVM rendering on mw2107 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:24:57] RECOVERY - HHVM rendering on mwdebug1001 is OK: HTTP OK: HTTP/1.1 200 OK - 74613 bytes in 0.898 second response time [20:25:05] !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: wikipedias back to 1.30.0-wmf.1 [20:25:07] RECOVERY - HHVM rendering on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 5.446 second response time [20:25:08] RECOVERY - HHVM rendering on mw1293 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 5.559 second response time [20:25:08] RECOVERY - HHVM rendering on mw1177 is OK: HTTP OK: HTTP/1.1 200 OK - 74595 bytes in 6.026 second response time [20:25:08] RECOVERY - HHVM rendering on mw1170 is OK: HTTP OK: HTTP/1.1 200 OK - 74627 bytes in 6.602 second response time [20:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:25:16] RECOVERY - HHVM rendering on mw1174 is OK: HTTP OK: HTTP/1.1 200 OK - 74565 bytes in 8.184 second response time [20:25:16] RECOVERY - HHVM rendering on mw2202 is OK: HTTP OK: HTTP/1.1 200 OK - 74503 bytes in 0.126 second response time [20:25:17] RECOVERY - HHVM rendering on mw2232 is OK: HTTP OK: HTTP/1.1 200 OK - 74530 bytes in 0.108 second response time [20:25:17] RECOVERY - HHVM rendering on mw2215 is OK: HTTP OK: HTTP/1.1 200 OK - 74530 bytes in 0.106 second response time [20:25:17] RECOVERY - HHVM rendering on mw2210 is OK: HTTP OK: HTTP/1.1 200 OK - 74531 bytes in 0.117 second response time [20:25:17] RECOVERY - HHVM rendering on mw2146 is OK: HTTP OK: HTTP/1.1 200 OK - 74503 bytes in 0.107 second response time [20:25:17] RECOVERY - HHVM rendering on mw2203 is OK: HTTP OK: HTTP/1.1 200 OK - 74502 bytes in 0.105 second response time [20:25:17] RECOVERY - HHVM rendering on mw2233 is OK: HTTP OK: HTTP/1.1 200 OK - 74502 bytes in 0.110 second response time [20:25:18] RECOVERY - HHVM rendering on mw2212 is OK: HTTP OK: HTTP/1.1 200 OK - 74502 bytes in 0.112 second response time [20:25:18] RECOVERY - HHVM rendering on mw2226 is OK: HTTP OK: HTTP/1.1 200 OK - 74502 bytes in 0.105 second response time [20:25:19] RECOVERY - HHVM rendering on mw2181 is OK: HTTP OK: HTTP/1.1 200 OK - 74503 bytes in 0.124 second response time [20:25:19] RECOVERY - HHVM rendering on mw2165 is OK: HTTP OK: HTTP/1.1 200 OK - 74501 bytes in 0.109 second response time [20:25:20] RECOVERY - HHVM rendering on mw2244 is OK: HTTP OK: HTTP/1.1 200 OK - 74503 bytes in 0.118 second response time [20:25:20] RECOVERY - HHVM rendering on mw2219 is OK: HTTP OK: HTTP/1.1 200 OK - 74502 bytes in 0.105 second response time [20:26:17] -- marker for train rollback -- [20:26:26] (03PS6) 10Krinkle: varnish: Avoid std.fileread() and use new errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) [20:26:42] (03Abandoned) 10Krinkle: [WIP] varnish: Try to somehow embed errorpage template in VCL directly [puppet] - 10https://gerrit.wikimedia.org/r/355652 (owner: 10Krinkle) [20:27:31] (03Abandoned) 10Krinkle: varnish: Convert errorpage into re-usable template [puppet] - 10https://gerrit.wikimedia.org/r/350493 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle) [20:28:13] (03PS5) 10Krinkle: varnish: Switch browsersec to use errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/355338 (https://phabricator.wikimedia.org/T113114) [20:30:17] !log T164865: RESTBase dev, disable revision range deletes [20:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:26] T164865: Prototype and test range delete-based current revision storage - https://phabricator.wikimedia.org/T164865 [20:30:56] PROBLEM - cassandra-b CQL 10.192.48.50:9042 on restbase2006 is CRITICAL: connect to address 10.192.48.50 and port 9042: Connection refused [20:31:56] PROBLEM - cassandra-b SSL 10.192.48.50:7001 on restbase2006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused [20:32:58] PROBLEM - cassandra-b service on restbase2006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [20:33:16] PROBLEM - Check systemd state on restbase2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:33:23] urandom: if you can take a look ^^^ we're a bit busy with the other issue :) [20:35:02] (03PS14) 10Krinkle: dynamicproxy: Centralise error page template and use it [puppet] - 10https://gerrit.wikimedia.org/r/350494 (https://phabricator.wikimedia.org/T113114) [20:35:04] (03PS7) 10Krinkle: varnish: Avoid std.fileread() and use new errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114) [20:35:06] (03PS6) 10Krinkle: varnish: Switch browsersec to use errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/355338 (https://phabricator.wikimedia.org/T113114) [20:35:56] RECOVERY - cassandra-b service on restbase2006 is OK: OK - cassandra-b is active [20:36:16] RECOVERY - Check systemd state on restbase2006 is OK: OK - running: The system is fully operational [20:37:56] RECOVERY - cassandra-b SSL 10.192.48.50:7001 on restbase2006 is OK: SSL OK - Certificate restbase2006-b valid until 2017-09-12 15:35:45 +0000 (expires in 109 days) [20:37:56] RECOVERY - cassandra-b CQL 10.192.48.50:9042 on restbase2006 is OK: TCP OK - 0.001 second response time on 10.192.48.50 port 9042 [20:39:59] 06Operations, 10Analytics, 10EventBus, 10hardware-requests, and 2 others: SSDs for main Kafka clusters - https://phabricator.wikimedia.org/T166341#3293256 (10RobH) SSDs in this timeline isn't possible, not if we want them under warranty with the system vendor . [20:40:29] 06Operations, 10ops-eqiad, 10DBA: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3293259 (10Marostegui) p:05Triage>03Normal [20:40:38] 06Operations, 10Analytics, 10EventBus, 10hardware-requests, and 2 others: SSDs for main Kafka clusters - https://phabricator.wikimedia.org/T166341#3293262 (10RobH) a:03Ottomata Is this something that you want done in next years budget, or is it now invalid? Please advise. [20:41:06] PROBLEM - cassandra-b CQL 10.192.48.50:9042 on restbase2006 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [20:41:36] PROBLEM - cassandra-b SSL 10.192.48.50:7001 on restbase2006 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer [20:43:56] PROBLEM - cassandra-b service on restbase2006 is CRITICAL: CRITICAL - Expecting active but unit cassandra-b is failed [20:44:16] PROBLEM - Check systemd state on restbase2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [20:44:56] @thcipriani: Any update on the train status? We just announced a new feature to the community, but it was rolled back with the train. We just need to know if we need to explain that or if the train will be redeployed shortly. [20:45:34] 06Operations, 10Analytics, 10EventBus, 10hardware-requests, and 2 others: New SCB nodes - https://phabricator.wikimedia.org/T166342#3293278 (10RobH) [20:46:10] 06Operations, 10Analytics, 10EventBus, 10hardware-requests, and 2 others: New SCB nodes - https://phabricator.wikimedia.org/T166342#3293284 (10Ottomata) Thanks @RobH [20:46:25] kaldari: still figuring out train status, for the moment there is no short-term plan to roll forward again if that's helpful, that's about all I know at the moment. [20:46:38] Thanks, that's helpful! [20:47:55] thcipriani: what change was rolled back im a tech ambassador and i need to know incase people ask [20:48:42] Zppix: we've rolled back from wmf.2 -> wmf.1 for wikipedias only. [20:49:46] <_joe_> kaldari: it won't be deployed I guess [20:50:08] thcipriani: is that due to that one ext (i forget the name) [20:55:24] (03PS1) 10Hashar: Fix spec for various modules [puppet] - 10https://gerrit.wikimedia.org/r/355695 [20:56:01] Does anyone know if the person who organises the tech news is on IRC? [20:57:32] (03PS1) 10Dzahn: wikistats: ensure Apache PHP module is installed before site [puppet] - 10https://gerrit.wikimedia.org/r/355706 [20:58:35] (03CR) 10jerkins-bot: [V: 04-1] wikistats: ensure Apache PHP module is installed before site [puppet] - 10https://gerrit.wikimedia.org/r/355706 (owner: 10Dzahn) [21:00:41] Zppix: i just learned you mean the thing on meta wiki, not the blog. so yea. it's Johan https://meta.wikimedia.org/wiki/User:Johan_%28WMF%29 [21:00:51] "on IRC as "JohanJ" [21:01:14] Ok thanks dzahn [21:04:11] (03PS2) 10Dzahn: wikistats: ensure Apache PHP module is installed before site [puppet] - 10https://gerrit.wikimedia.org/r/355706 [21:04:13] 06Operations, 10Phabricator, 13Patch-For-Review: spam from phabricator in labs - https://phabricator.wikimedia.org/T166322#3293326 (10Paladox) p:05Triage>03High Setting as hi as i do not want to spam ops. [21:05:55] 06Operations, 06Labs, 10Phabricator, 13Patch-For-Review: spam from phabricator in labs - https://phabricator.wikimedia.org/T166322#3293330 (10Zppix) [21:05:56] RECOVERY - cassandra-b service on restbase2006 is OK: OK - cassandra-b is active [21:06:16] RECOVERY - Check systemd state on restbase2006 is OK: OK - running: The system is fully operational [21:06:56] RECOVERY - cassandra-b CQL 10.192.48.50:9042 on restbase2006 is OK: TCP OK - 0.000 second response time on 10.192.48.50 port 9042 [21:07:14] ACKNOWLEDGEMENT - MegaRAID on db1016 is CRITICAL: CRITICAL: 1 LD(s) must have write cache policy WriteBack, currently using: WriteThrough Volans https://phabricator.wikimedia.org/T166344 [21:07:16] RECOVERY - cassandra-b SSL 10.192.48.50:7001 on restbase2006 is OK: SSL OK - Certificate restbase2006-b valid until 2017-09-12 15:35:45 +0000 (expires in 109 days) [21:12:52] 06Operations, 10ops-eqiad, 10DBA: db1016 m1 master: Possibly faulty BBU - https://phabricator.wikimedia.org/T166344#3293244 (10Volans) I've ack'ed the Icinga alarm with this task. I've also forced a BBU learn cycle on db1016, it was looking good during the cycle, and as soon as the battery was having some c... [21:13:45] (03PS1) 10Thcipriani: Revert "Add Code of Conduct footer links to wikitech and mw.o" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355718 [21:14:02] (03CR) 10Thcipriani: [C: 032] Revert "Add Code of Conduct footer links to wikitech and mw.o" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355718 (owner: 10Thcipriani) [21:14:25] (03PS1) 10Thcipriani: Revert "all wikis to 1.30.0-wmf.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355719 [21:15:37] (03CR) 10Thcipriani: [C: 032] Revert "all wikis to 1.30.0-wmf.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355719 (owner: 10Thcipriani) [21:16:37] hi [21:16:43] are people debugging somewhere? [21:16:44] (03Merged) 10jenkins-bot: Revert "Add Code of Conduct footer links to wikitech and mw.o" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355718 (owner: 10Thcipriani) [21:16:53] (03CR) 10jenkins-bot: Revert "Add Code of Conduct footer links to wikitech and mw.o" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355718 (owner: 10Thcipriani) [21:18:08] !log arlolra@tin Started deploy [parsoid/deploy@4a2c3f4]: Updating Parsoid to 5b52d07b [21:18:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:18:56] (03CR) 10Paladox: [C: 031] wikistats: ensure Apache PHP module is installed before site [puppet] - 10https://gerrit.wikimedia.org/r/355706 (owner: 10Dzahn) [21:19:00] (03Merged) 10jenkins-bot: Revert "all wikis to 1.30.0-wmf.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355719 (owner: 10Thcipriani) [21:19:02] legoktm: debugging train issues? not as far as I'm aware. I'm trying to find more info for the task at the moment https://phabricator.wikimedia.org/T166345 all I can say right now is that there was some performance regression that affected, at least, main_page of enwiki but doesn't appear to affect other wikis? [21:19:41] there were also other confounding issues at that time that make this hard to pinpoint. [21:19:53] ok [21:20:10] is it possible to enable wmf.2 on one of the x-wikimedia-debug servers? [21:20:36] legoktm: yes [21:20:44] I can do that on mwdebug1002 [21:22:03] (03CR) 10jenkins-bot: Revert "all wikis to 1.30.0-wmf.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355719 (owner: 10Thcipriani) [21:22:15] thanks [21:23:25] legoktm: FYI it was taking between 6s and 15s to answer Main_Page (from apache logs) [21:23:29] (03Draft1) 10Paladox: Phabricator: Install the lighter version of exim4 on labs only [puppet] - 10https://gerrit.wikimedia.org/r/355717 [21:23:33] (03PS2) 10Paladox: Phabricator: Install the lighter version of exim4 on labs only [puppet] - 10https://gerrit.wikimedia.org/r/355717 [21:23:38] ok wow [21:23:39] legoktm: mwdebug1002 should have enwiki on wmf.2 now [21:25:51] !log arlolra@tin Finished deploy [parsoid/deploy@4a2c3f4]: Updating Parsoid to 5b52d07b (duration: 07m 43s) [21:25:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:26:40] I'm not able to reproduce slow times [21:26:52] logged out on mwdebug1002 I'm getting 115ms backend response time [21:28:20] (03PS3) 10Paladox: Phabricator: Install the lighter version of exim4 on labs only [puppet] - 10https://gerrit.wikimedia.org/r/355717 [21:29:26] hmm [21:29:31] there was a bunch of s1 database lag? [21:29:43] there was at that time [21:29:51] yes, but later on, between 19:48 and 10:05 [21:30:11] probably due to a checksum that was going [21:30:20] s/going/ongoing/ [21:30:34] but the issue started before the lag and ended after the lag [21:31:03] !log Updated Parsoid to 5b52d07b (T166068) [21:31:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:12] T166068: German Wikipedia's main page shows significantly stale content - https://phabricator.wikimedia.org/T166068 [21:31:13] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown: wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3293395 (10greg) Adding #operations for visibility while this is investigated. We're (RelEng) seeking more eyeballs. [21:31:45] (03PS4) 10Paladox: Phabricator: Install the lighter version of exim4 on labs only [puppet] - 10https://gerrit.wikimedia.org/r/355717 (https://phabricator.wikimedia.org/T166322) [21:31:53] (03PS3) 10Dzahn: wikistats: ensure Apache PHP module is installed before site [puppet] - 10https://gerrit.wikimedia.org/r/355706 [21:32:06] the timing definitely matches the train. As to any kind of explanation for what in the train could have caused that kind of regression on enwiki but not on enwikibooks: that's more confusing to me. [21:32:48] enwikibooks is much smaller and receives way less traffic? [21:32:51] thcipriani: is the evening SWAT proceeding as normal? i have a patch for the Commons UploadWizard breakage from yesterday's deployment [21:33:52] MatmaRex: for wmf.2? [21:33:59] yes [21:34:12] https://phabricator.wikimedia.org/T166298 [21:34:15] * greg-g looks [21:37:19] (this is when a "deploy for 10% of users" would be super useful) [21:37:47] (03Restored) 10Dzahn: phabricator: avoid root@wm.org mail alias in labs [puppet] - 10https://gerrit.wikimedia.org/r/355640 (owner: 10Dzahn) [21:37:54] (03CR) 10Dzahn: [C: 04-2] phabricator: avoid root@wm.org mail alias in labs [puppet] - 10https://gerrit.wikimedia.org/r/355640 (owner: 10Dzahn) [21:38:43] MatmaRex: I'd prefer not, but since wmf.2 is on commons and that is a bad breakage, I'm OK with just that. I want to keep the delta as small as possible while this is investigated [21:38:47] (03CR) 10Dzahn: [C: 032] wikistats: ensure Apache PHP module is installed before site [puppet] - 10https://gerrit.wikimedia.org/r/355706 (owner: 10Dzahn) [21:38:55] thcipriani: should we just revert to wmf.1 everywhere? [21:39:20] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown: wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3293416 (10kaldari) Can we get this assigned to someone in Performance? @Gilles @aaron [21:41:07] (03CR) 10Paladox: [C: 031] "@Faidon Liambotis i've cherry picked this onto pabricators puppet master." [puppet] - 10https://gerrit.wikimedia.org/r/355640 (owner: 10Dzahn) [21:41:35] greg-g: I can revert everywhere, group1 or just commons [21:42:55] volans: do we have an alternate theory about the s1 db lag? [21:43:10] and was any other wiki besides enwiki slow? [21:43:14] we have the debug servers to test wmf.2 on. But maybe it'll make testing the enwiki mainpage theory later? [21:43:50] legoktm: so there was a pt-table-checksum running on s1 and was running on the logging table [21:43:56] !log OS install on ores200[1-9] [21:44:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:09] and the times matches with the increase of errors on logstash for lag [21:44:53] thcipriani: was any other wiki besides enwiki slow? and it started as soon as the train went out? [21:45:21] the other thing that increased were errors of the type: Unexpected general module "{module}" in styles queue. [21:45:32] see https://logstash.wikimedia.org/goto/4f0d8164f833b5f6ed80029edd829785 [21:45:47] 06Operations, 06Labs, 10Labs-project-Phabricator, 13Patch-For-Review: spam from phabricator in labs - https://phabricator.wikimedia.org/T166322#3293439 (10Aklapper) [ wmflabs → not #Phabricator ] [21:47:33] legoktm: the wikipedias it seems more generally were slow, this was the query used to determine that timing was slow generally: fgrep de.wikipedia.org /var/log/apache2/other_vhosts_access.log | awk '{if ($2 > 5000000) print $1, $2, $7 }' alerts didn't start coming in until shortly after train went out, 10 minutes or so [21:47:38] 06Operations, 06Labs, 10Labs-project-Phabricator, 13Patch-For-Review: spam from phabricator in labs - https://phabricator.wikimedia.org/T166322#3293445 (10Dzahn) It's also an issue of the Phabricator Puppet module in general. [21:50:17] thcipriani: and you picked a random host for the apache access log? [21:51:57] legoktm: I'm not sure where that was run, copying and pasting from _joe_ on that one. I don't have access to those logs afaict. [21:53:24] legoktm: I've seen those on mwdebug too [21:53:34] 2017-05-25T20:24:03 15417043 http://en.wikipedia.org/wiki/Main_Page [21:54:07] without the initial fgrep, directly awk of /var/log/apache2/other_vhosts_access.log [21:55:22] 06Operations, 10ops-eqiad, 10netops: ripe-atlas-eqiad is down - https://phabricator.wikimedia.org/T163243#3293469 (10Cmjohnson) 05Open>03Resolved Faidon got this back up and running today. [21:57:27] I've hit the main page a few times while logged out and in using mwdebug and can't trigger a slow response [21:58:08] well then... [21:58:26] RECOVERY - MegaRAID on db1016 is OK: OK: optimal, 1 logical, 2 physical, WriteBack policy [21:59:09] legoktm i experienced the slow down when the problem happened earlier. [21:59:10] it was slow loading the main page but everything else was fast [21:59:15] so is someone waiting for someone else to say "should we try again?" ;) [21:59:31] paladox: yeah, we have logs that confirm that fact [21:59:34] greg-g: pretty much [21:59:38] ok [21:59:49] if it's slow, we can grab a forceprofile traceback and see if the problem is in the PHP code [21:59:58] how quick was recovery last time after the revert? [22:00:15] s/traceback/report/ [22:00:28] seemingly recovery was almost immediate [22:00:57] legoktm: can we do that or do we need an opsen? [22:01:04] the actual slowdown took a while to bubble up [22:01:18] anyone can do that, you just add ?forceprofile=1 and it adds the report as an HTML comment [22:01:23] oh duh [22:01:27] some kinda logjam with lotsa traffic, I guess [22:02:08] but [22:02:21] the other weird factor is this affected codfw mostly initially [22:02:37] legoktm: do you need any other info? I'm about to log off [22:02:52] I investigated a few things that I saw in logs/changelogs but none of them seem to be the issue: 1) LuaSandbox issues? I saw exceptions on a few wikis but if there were problems it would have been way more widespread 2) parser cache invalidation after the mw-parser-output change: I checked the keys on both wmf.1 and wmf.2 and they were the same [22:03:23] volans: what grafana dashboard would you recommend to monitor/check? [22:03:38] I was using the "MySQL aggregated" one earlier [22:04:24] so regarding codfw... there is some maintenance ongoing on s4 shard, so the slave are lagging, if the deploy forces some reload it might explain the alarms in codfw, although it was not proven [22:05:02] but those codfw alarms... they aren't user-impacting, right? can the dbs handle that extra lag? [22:05:24] (03PS1) 10BryanDavis: Revert "Revert "Add Code of Conduct footer links to wikitech and mw.o"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355722 [22:05:26] I also think the slave lag on s1 is suspect, as to whether MW caused slave lag or suffered as an effect of it I don't know [22:05:27] legoktm: so from one side kibana for the errors, the awk/grep on the logs on mw hosts and icinga that starts complaining like above [22:05:45] for the DB you can look at: [22:06:58] max lag per shard: https://grafana.wikimedia.org/dashboard/db/mysql-aggregated?panelId=6&fullscreen&orgId=1 [22:07:23] or https://tendril.wikimedia.org/tree to have a detail of all the DBs in all the shards [22:08:26] oh I forgot about tendril [22:08:28] thanks :) [22:08:29] for a single DB there is the mysql dashboard: https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1052 [22:08:37] that gives you much more details [22:08:52] * legoktm nods [22:09:41] greg-g: yes they should not be user impacting in codfw, what do you mean by handle the extra lag? [22:09:46] thcipriani: greg-g: do we want to just try it again? I'm not sure how else to debug a performance regression like this without more traffic... [22:09:58] volans: I guess I mean: is that lag reason enough to revert? [22:10:20] legoktm: that's where I'm leaning as well. iff thcipriani is up for it though [22:10:37] sure [22:10:48] greg-g: no that's absolutely expected, there is an alter table going on on s4 on codfw all the slaves are lagging since this morning [22:10:51] thcipriani: that didn't inspire confidence ;) [22:10:53] and doesn't affect the normal checks [22:10:59] kk, thanks volans [22:11:01] as long as we're monitoring all the things we can give it a shot [22:11:11] so it's weird that as soon as the second patch was deployed the started alarming in codfw but not in eqiad [22:11:26] and it's not clear if the alarms were for s4 wikis or not either [22:11:30] we concentrate on eqiad ofc [22:12:15] (03PS1) 10Thcipriani: Revert "Revert "all wikis to 1.30.0-wmf.2"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355723 [22:12:23] legoktm: ready? [22:13:09] yep [22:13:23] (03CR) 10Thcipriani: [C: 032] Revert "Revert "all wikis to 1.30.0-wmf.2"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355723 (owner: 10Thcipriani) [22:13:27] okie doke [22:13:45] we really don't have any other way to find the bug than test in production? [22:14:23] volans: it's not showing up with the mwdebug hosts [22:14:44] volans: suggestions welcome? :/ [22:14:44] (03Merged) 10jenkins-bot: Revert "Revert "all wikis to 1.30.0-wmf.2"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355723 (owner: 10Thcipriani) [22:14:48] which are also in production, but this smells like either a conflation of events issue or a "needs enwiki traffic" issue [22:14:51] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown: wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3293501 (10kaldari) @thcipriani: Do you know if this was affecting all Wikipedias, or just certain ones? [22:15:08] kaldari: afawct: just enwiki main page [22:15:11] could it be deployed only one one host with production traffic? [22:15:27] to understand if the issue is generated by the load on the host [22:15:36] or only when applied to the whole fleet to our infrastructure [22:15:39] (03CR) 10jenkins-bot: Revert "Revert "all wikis to 1.30.0-wmf.2"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355723 (owner: 10Thcipriani) [22:15:45] * volans throwing random idea [22:15:52] thcipriani: worth the try? ^ [22:16:27] we could do that fairly easily: scap pull && scap wikiversions-compile would do it right now [22:17:18] !log ores200[1-9] - signing puppet certs, salt-key, initial run [22:17:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:57] doit [22:19:02] ok mw1161.eqiad.wmnet [22:19:45] is now running wmf.2 for all wikis [22:20:21] !log mw1161 running wmf.2 for all wikis for troubleshooting T166345 [22:20:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:20:29] T166345: wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345 [22:23:01] except it's all weird because it's a jobrunner. [22:23:25] bah, undo that and pick a non-jobrunner :) [22:23:43] alright, let's do mw1170.eqiad.wmnet [22:23:49] * thcipriani reverts 1161 first [22:24:51] !log mw1161 wikipedias back to running running wmf.1 [22:24:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:08] !log mw1170 running wmf.2 for all wikis for troubleshooting T166345 [22:26:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:26:16] T166345: wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345 [22:28:25] curl is still pretty peppy afaict [22:29:46] legoktm: anything from your side? [22:30:07] nope, I'm guessing I ran the same curl commands as thcipriani :P [22:30:27] :) [22:30:40] (curl -i -H "Host: en.wikipedia.org" "localhost/wiki/Main_Page") [22:31:08] I could try ab maybe [22:35:24] ab -H "Host: en.wikipedia.org" -l -n 100 localhost/wiki/Main_Page [22:35:31] no slow requests tbh [22:36:05] well then [22:38:03] 06Operations, 10MediaWiki-General-or-Unknown, 06Performance-Team, 10Wikimedia-General-or-Unknown: wmf/1.30.0-wmf.2 performance issue for Wikipedias - https://phabricator.wikimedia.org/T166345#3293569 (10greg) ```lang=irc 22:28 < thciprian> curl is still pretty peppy afaict 22:29 < greg-g> legoktm: anyth... [22:38:29] (bah, sorry for pinging ya'll in three channels) [22:38:50] I'm seeing 2017-05-25T21:08:57 3806223 http://en.wikipedia.org/wiki/Main_Page [22:38:51] so, consensus is: try everywhere? [22:38:51] lol [22:38:56] that is 3.8s [22:39:01] on mw1170 [22:39:20] I started playing with ab and increasing concurrency [22:39:22] ab -H "Host: en.wikipedia.org" -l -n 1000 -c 200 localhost/wiki/Main_Page [22:39:27] also Special:BlankPage and Special:Version seemed pretty slow [22:39:29] so that's 1k total requests, and 200 at a time [22:39:37] hrm, I am starting to see runtime/ext_mysql: slow query: SELECT MASTER_GTID_WAIT pop back up in logs on mwlog1001 [22:40:06] how many requests do servers normally handle concurrently? [22:41:25] much less than that ;) [22:41:32] https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-instance=mw1170 [22:42:01] thcipriani: none of those are from mw1170 [22:44:58] using just curl (not ab) under cumin, I can request the enwiki main page from all of role::mediawiki::appserver hosts and none of them are slow [22:45:08] so it's certainly not the case that just 1x of those servers is persistently slow [22:45:32] bblack: but they have the old version right now AFAICT [22:45:51] ^ all except mw1170 at the moment [22:45:55] ah ok [22:46:27] so given that they are not able to repro on one host, seems more of a problem that manifest itself only when applied to the whole fleet [22:46:40] we are probably hitting some bottleneck/bug somewhere in the stack [22:47:12] at least that's my wild guess [22:47:32] alright, should we try to push it out again and try to pinpoint where in the stack the painpoint is? [22:48:22] do you need me anymore? It's getting late and if not needed I'd head to bed... :) [22:49:19] it seems like from the icinga alert patterns that it's more likely to be the appserver host slowing down than some remote bottleneck? [22:49:35] but that's just guessing [22:50:15] any vetos to thcipriani's proposal? [22:51:04] looking at some sample appservers though, there wasn't really a CPU spike or anything, hmm [22:51:43] greg-g: I haven't caught up on all the backscroll. basically we still don't know what the problem is, and we want to try a full deploy again to find it? [22:52:10] yeah, after trying on mwdebug and then mw1170 for real trafic [22:52:48] I'm not fond of that as a plan [22:53:17] it's 4pm in CA, 6pm where I'm at, it's *late* for most of ops (europe) [22:53:22] in the shower of alarms before there was also some redis-related alarm, not sure if cause or effect though, chasemp took a look IIRC at that time [22:53:41] postpone until tomorrow? [22:53:42] why push a known failure out when we could have people spend an evening digging through some changes and theories first? [22:54:12] I'll not be around more than 10m unless everything is on fire, but putting everything on fire on purpose doesn't seem the best plan of action ;) [22:54:29] mostly just to try to cut through all the confounding factors during the initial deployment to try to get a better view of what happened. [22:54:40] some one should look for a change to core that introduced a new lag check. [22:55:10] AaronSchulz might be a good person to ping for ideas about what would cause that [22:55:31] it is late enough that I would be fine postponing. Given that if we postpone, we need to figure out (a) how far to rollback and (b) whether to continue with swat in 5 minutes. [22:56:07] bblack: for the record, recovery was practically instantaneous after a revert, which surely influenced my opinion of how to proceed now [22:56:10] bd808: that's something to be investigated for sure, although the lag was real for s1 in the 19:48-20:05 timeframe but the alarms started way before and the logs were showing other warnings [22:56:15] we haven't seen any problems on group1 right? Just something that show up at high volume? [22:56:40] jouncebot: next [22:56:40] In 0 hour(s) and 3 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170525T2300) [22:56:56] thcipriani: in any case I'd revert mw1170 [22:57:00] yes [22:57:02] doing [22:57:11] * bd808 wants to SWAT the footer link for code of conduct [22:57:29] there's no indication of bad behavior on group1 or group0 [22:57:38] bblack: what make me think it was something in the stack is that we can see the slow Main_Page also on mwdebug hosts at the time it happened [22:57:55] right, so something like dbs or mc? [22:58:22] !log mw1170 wikipedias back to 1.30.0-wmf.1 [22:58:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:58:49] I'll brb in 15-20 min, I'm switching locations [22:58:50] (03PS1) 10Thcipriani: Revert "Revert "Revert "all wikis to 1.30.0-wmf.2""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355727 [22:59:02] (03CR) 10Thcipriani: [C: 032] Revert "Revert "Revert "all wikis to 1.30.0-wmf.2""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355727 (owner: 10Thcipriani) [22:59:28] I'm out for the next few hours, barring someone dragging me back for an emergency :P [22:59:47] bblack: DB had that lag issue that was matching the pt-table-checksum on logging table that was running since this afternoon but we didn't see any particular queries, apart maybe one: [22:59:54] alright, so current plan: can we leave group0 and group1 where they are, SWAT in 1 minute or so, and let that task marinate overnight [23:00:02] sound reasonable? [23:00:04] addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Respected human, time to deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170525T2300). Please do the needful. [23:00:04] SMalyshev, ejegg, MatmaRex, and bd808: A patch you scheduled for Evening SWAT (Max 8 patches) is about to be deployed. Please be available during the process. [23:00:05] so yes mc/redis could be a possibility [23:00:11] (03Merged) 10jenkins-bot: Revert "Revert "Revert "all wikis to 1.30.0-wmf.2""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355727 (owner: 10Thcipriani) [23:00:20] (03CR) 10jenkins-bot: Revert "Revert "Revert "all wikis to 1.30.0-wmf.2""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355727 (owner: 10Thcipriani) [23:00:25] thcipriani: +1 [23:00:26] thcipriani: works for me [23:00:41] okie doke. Since I'm here: I guess I can take SWAT :) [23:00:48] hi all! is SWAT still on? I gather things have been interesting [23:01:40] things have indeed been interesting. [23:01:49] * volans off [23:01:55] I will go ahead and SWAT unless greg-g has objections [23:02:01] for the record, wmf.2 is only on group0 and group1, not the 'pedias [23:02:07] no, it's ok [23:02:08] thanks everyone [23:02:18] (looked at the patches just now) [23:02:42] thanks for trying thcipriani and legoktm and volans [23:03:35] SMalyshev: ping for SWAT [23:04:42] Ill help look into it is there a task [23:07:49] ejegg: I'm not clear what's going on https://gerrit.wikimedia.org/r/#/c/355660/1 usually gerrit bumps the submodule automatically if you bump the branch that is being tracked as a submodule i.e. the wmf/1.30.0-wmf.2 branch of mediawiki/extensions/DonationInterface [23:09:04] and that repo on that branch is at 7eb79df which is what is live it looks like [23:09:21] thcipriani: here, sorry [23:09:43] was looking into some code and lost track of time [23:09:59] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 55, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Transit: LibertyGlobal (BB00088, donated) {#017391} [10Gbps]BR [23:10:42] ejegg: could you patch DonationInterface wmf/1.30.0-wmf.2 with the changes you need and we can get it merged then? I don't see the changes from https://gerrit.wikimedia.org/r/#/c/355660/1 even though it says that it merged(?) [23:10:46] SMalyshev: np :) [23:11:20] thcipriani: it's just a submodule update [23:11:38] is that not showing up? [23:12:21] oops, that was just in core [23:12:35] sorry, let me fix the DI branch [23:12:55] thanks :) [23:12:59] RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 57, down: 0, dormant: 0, excluded: 0, unused: 0 [23:15:11] (03CR) 10Thcipriani: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355722 (owner: 10BryanDavis) [23:15:40] thcipriani: I'll need it on terbium to check, since it's a maint script [23:15:51] SMalyshev: ok [23:16:15] (03Merged) 10jenkins-bot: Revert "Revert "Add Code of Conduct footer links to wikitech and mw.o"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355722 (owner: 10BryanDavis) [23:16:24] (03CR) 10jenkins-bot: Revert "Revert "Add Code of Conduct footer links to wikitech and mw.o"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355722 (owner: 10BryanDavis) [23:16:54] ^ bd808 I'm going to sync out since we've checked this once already :) [23:17:03] cool beans [23:17:57] thcipriani: sorry, I can't seem to cherry-pick https://gerrit.wikimedia.org/r/355352 to 1.30.0-wmf.2 [23:19:19] and and ffwding locally, then git-review wmf/1.30.0-wmf.2 is giving me this: [23:19:33] !log thcipriani@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:355722|Revert "Revert "Add Code of Conduct footer links to wikitech and mw.o""]] PART I (duration: 00m 43s) [23:19:40] No changes between HEAD and origin/wmf/1.30.0-wmf.2 [23:19:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:20:23] !log thcipriani@tin Synchronized wmf-config/CommonSettings.php: SWAT: [[gerrit:355722|Revert "Revert "Add Code of Conduct footer links to wikitech and mw.o""]] PART II (duration: 00m 41s) [23:20:29] ^ bd808 should be live [23:20:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:21:08] thcipriani: looks good. thanks [23:21:45] ejegg: looks like there's a merge conflict in gateway_common/DonationData.php [23:22:13] doesn't it have the exact same parent commit? [23:22:15] hmm [23:22:51] ah, sorry, I see the problem [23:22:53] one sec [23:24:25] just merging the parent patch [23:25:01] thcipriani: fyi, I just emailed ops/wikitech-l/engineering about the revert [23:25:55] greg-g: cool, thanks [23:28:22] SMalyshev: you patch is live on terbium, check please [23:28:31] thcipriani: ok, parent merged, cherry-picked the one i need, just waiting on gate&submit [23:29:36] thcipriani: ok, the DI branch has the commit i need [23:30:06] the core patch I merged earlier is pointing to a revision a couple ahead of it [23:30:15] but those are totally irrelevant [23:30:22] thcipriani: checking [23:30:33] ejegg: okie doke. As a rule of thumb +2ing on the wmf branches is generally something the person SWATing does, since they have to figure out the order of deployment. Not a big deal in this instance, but sometimes patch ordering can get tricky. [23:30:53] ooh, sorry about that [23:31:27] np, works fine in this instance, just for future reference :) [23:32:51] I'll put up another core PS pointing to the actual wmf.2 DI branch revision [23:33:12] ejegg: gerrit bumps submodules in core automagically [23:33:18] for the wmf branches [23:33:58] so if you +2 a commit to the DI wmf branch, after it merges, gerrit will bump the submodule reference in core to point at the new tip of the DI branch [23:34:08] thcipriani: seems to be working. wmf.2 is only enabled for group0/1 wikis now though, right? [23:34:18] SMalyshev: that is correct [23:34:56] oh, cool! [23:35:00] thcipriani: wait, nope, this patch is not good actually [23:35:14] will have to revert it... [23:35:15] SMalyshev: so revert? [23:35:18] ok, doing [23:35:24] thcipriani: yeah, revert it. [23:35:25] thcipriani: great, core wmf.2 is looking right as far as DonationInterface is concerned [23:35:47] there's a bug I didn't notice before which produces fatal error at the end of the run :( [23:36:11] I'll have to find a different way... :( [23:37:02] thcipriani: I made a revert here: https://gerrit.wikimedia.org/r/#/c/355730/ [23:37:13] ah, nice thank you! [23:38:16] btw, anybody here knows about statsd code in core? [23:39:24] SMalyshev: I code reviewed some of it for ori back in the day [23:40:42] MatmaRex: you change is live on mwdebug1002, check please [23:41:05] bd808: do you remember from then if there's some way to disable data collection? [23:42:04] thcipriani: looks good [23:42:14] MatmaRex: ok, going live [23:42:16] SMalyshev: you can swap in the NullStatsdDataFactory [23:42:45] bd808: you'd think so. But no: Fatal error: Call to undefined method NullStatsdDataFactory::getBuffer() in /srv/mediawiki/php-1.30.0-wmf.2/includes/GlobalFunctions.php on line 1203 [23:43:14] SMalyshev: sounds like a bug [23:43:35] the upstream interface probably changed and the null factory didn't get updated [23:44:01] bd808: that's what I am trying to figure out... I don't think this function is part of the interface [23:44:02] hmmm.. or not [23:44:46] !log thcipriani@tin Synchronized php-1.30.0-wmf.2/resources/src/jquery/jquery.makeCollapsible.js: SWAT: [[gerrit:355721|jquery.makeCollapsible: Restore considering empty as part of toggle]] T166298 (duration: 00m 42s) [23:44:54] ^ MatmaRex should be live now [23:44:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:44:55] T166298: Dropdown options of Upload Wizard are broken (copy metadata, add location and more, non-own-work license options) - https://phabricator.wikimedia.org/T166298 [23:44:58] thanks thcipriani [23:45:08] SMalyshev: yeah. that looks like a bug that ori introduced. GlobalFunctions is assuming the buffering factory [23:45:20] in wfLogProfilingData [23:45:30] bd808: yup [23:46:29] and adding that missing method will probably cause issues of sending empty stuff... [23:46:37] even worse, getStatsdDataFactory declared as returning StatsdDataFactory not StatsdDataFactoryInterface [23:46:57] and null one implements StatsdDataFactoryInterface [23:46:59] yeah. looks like a bit of a dog's breakfast [23:47:34] looks like some fixes are needed there... [23:48:02] ohh and that code is in vendor... [23:48:22] dammit so I can't just add a method like hasData() and avoid the problem [23:48:29] right [23:49:27] ejegg: ok, so I have a concern about this patch: our deployment tooling can't guarantee the order in which the individual files from this patch will arrive at the target hosts and this patch looks like a lot of changes across a lot of files. I don't mind updating l10n stuff, but I worry this might explode everywhere. [23:49:32] well.. you can add that method in our classes, but I think wfLogProfilingData() would just break in a slightly different way. It needs some changes to check things better. [23:49:34] yeah it looks like a proper mess... [23:49:51] I wonder who owns it now? mw core team? [23:49:55] heh [23:50:09] "owns" is a loose concept here, you know that ;) [23:50:20] PROBLEM - Check systemd state on ores2009 is CRITICAL: Return code of 255 is out of bounds [23:50:31] PROBLEM - Check the NTP synchronisation status of timesyncd on ores2009 is CRITICAL: Return code of 255 is out of bounds [23:50:32] thcipriani: no need to worry, none of that code executes outside of payments [23:50:50] bd808: yup... I can make a patch probably, but I need somebody to review it... [23:50:51] it's strictly used for i18n strings everywhere else [23:50:58] SMalyshev: if you, addshore, and the perf team sign off you can probably change any of that any way that is needed. [23:50:59] and ensure it doesn't break something else [23:51:04] I'd be glad to review for you [23:51:16] bd808: ok, thanks [23:51:20] RECOVERY - Check systemd state on ores2009 is OK: OK - running: The system is fully operational [23:51:30] I'll start with filing a bug [23:51:35] good plan [23:52:31] ejegg: ah, ok, so the php is not executed on the prod setup? Just fundraising's private stuff? [23:52:40] yep, exactly [23:52:54] it's just that donatewiki needs to say a lot of the same things [23:53:48] ah, gotcha, ok, I'll start scap in a minute. [23:54:59] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3293642 (10Papaul) [23:55:32] thanks! [23:55:54] 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3258942 (10Papaul) a:05Papaul>03akosiaris @akosiaris This is complete at my end. It is all yours. [23:56:54] !log thcipriani@tin Started scap: SWAT: [[gerrit:355660|Fix version of DonationInterface deployed to donatewiki]] T166302 [23:57:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:03] T166302: Missing interface messages on donatewiki - https://phabricator.wikimedia.org/T166302 [23:58:00] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 55, down: 1, dormant: 0, excluded: 0, unused: 0BRxe-1/3/0: down - Transit: LibertyGlobal (BB00088, donated) {#017391} [10Gbps]BR