[00:01:16] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[00:01:26] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[00:01:27] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[00:01:27] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[00:01:27] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[00:01:27] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[00:01:36] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[00:04:06] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy
[00:04:16] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy
[00:04:26] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy
[00:04:26] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy
[00:04:26] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy
[00:04:26] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy
[00:04:26] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy
[00:05:05] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on ms-be2029 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 17 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[parted-/dev/sdf] daniel_zahn https://phabricator.wikimedia.org/T166021
[00:05:33] <wikibugs>	 (03CR) 10Dzahn: [C: 032] graphite: move 'standard' and 'base::firewall' to role [puppet] - 10https://gerrit.wikimedia.org/r/353364 (owner: 10Dzahn)
[00:07:11] <wikibugs>	 (03CR) 10Dzahn: "no-op" [puppet] - 10https://gerrit.wikimedia.org/r/353364 (owner: 10Dzahn)
[00:09:10] <wikibugs>	 (03PS1) 10Aaron Schulz: Switch Swift URLs to HTTPs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355174 (https://phabricator.wikimedia.org/T160616)
[00:18:30] <wikibugs>	 (03PS4) 10Dzahn: contint: role/profile conversion [puppet] - 10https://gerrit.wikimedia.org/r/355156
[00:24:54] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "more questions, are we really going to use hieradata/profile/ ?" [puppet] - 10https://gerrit.wikimedia.org/r/355156 (owner: 10Dzahn)
[00:40:58] <wikibugs>	 06Operations, 10ops-eqiad: ripe-atlas-eqiad is down - https://phabricator.wikimedia.org/T163243#3285335 (10faidon) a:05faidon>03Cmjohnson I don't see a process for refetching that, so we'll likely need to go and someone at the RIPE NCC. That said, if it was the board that was problematic, aren't we reusing...
[01:02:36] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[01:02:36] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[01:04:26] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1002 is OK: All endpoints are healthy
[01:04:26] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1001 is OK: All endpoints are healthy
[01:31:26] <icinga-wm>	 PROBLEM - HHVM rendering on mw1204 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.073 second response time
[01:32:26] <icinga-wm>	 RECOVERY - HHVM rendering on mw1204 is OK: HTTP OK: HTTP/1.1 200 OK - 74145 bytes in 0.675 second response time
[02:23:01] <logmsgbot>	 !log l10nupdate@tin scap sync-l10n completed (1.30.0-wmf.1) (duration: 07m 25s)
[02:23:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:29:18] <logmsgbot>	 !log l10nupdate@tin ResourceLoader cache refresh completed at Tue May 23 02:29:17 UTC 2017 (duration 6m 16s)
[02:29:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[02:31:32] <wikibugs>	 06Operations, 10ops-eqiad: ripe-atlas-eqiad is down - https://phabricator.wikimedia.org/T163243#3285362 (10Cmjohnson) We did reuse the storage card.   I believe the mac address is the only change.  @Faidon it's connected via console.
[04:09:56] <icinga-wm>	 PROBLEM - mailman I/O stats on fermium is CRITICAL: CRITICAL - I/O stats: Transfers/Sec=3826.90 Read Requests/Sec=1578.30 Write Requests/Sec=4.80 KBytes Read/Sec=35127.60 KBytes_Written/Sec=568.40
[04:19:56] <icinga-wm>	 RECOVERY - mailman I/O stats on fermium is OK: OK - I/O stats: Transfers/Sec=9.20 Read Requests/Sec=0.20 Write Requests/Sec=0.60 KBytes Read/Sec=1.20 KBytes_Written/Sec=6.40
[04:42:16] <icinga-wm>	 PROBLEM - Disk space on elastic1028 is CRITICAL: DISK CRITICAL - free space: /srv 60972 MB (12% inode=99%)
[05:27:36] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[05:27:36] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[05:27:36] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[05:27:36] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[05:28:26] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[05:28:36] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy
[05:28:36] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[05:28:36] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[05:29:26] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy
[05:29:27] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy
[05:29:27] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy
[05:29:27] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy
[05:30:36] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy
[05:31:16] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy
[05:40:16] <icinga-wm>	 RECOVERY - Disk space on elastic1028 is OK: DISK OK
[06:09:03] <wikibugs>	 (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1021" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355185
[06:09:10] <wikibugs>	 (03PS2) 10Marostegui: Revert "db-eqiad.php: Depool db1021" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355185
[06:16:08] <wikibugs>	 (03CR) 10Marostegui: [C: 032] Revert "db-eqiad.php: Depool db1021" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355185 (owner: 10Marostegui)
[06:19:11] <wikibugs>	 (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1021" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355185 (owner: 10Marostegui)
[06:19:21] <wikibugs>	 (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1021" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355185 (owner: 10Marostegui)
[06:20:10] <marostegui>	 !log Deploy alter table on s2 eqiad master db1054 - T162611
[06:20:19] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1021 - T162611 (duration: 00m 38s)
[06:20:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:20:20] <stashbot>	 T162611: Unify revision table on s2 - https://phabricator.wikimedia.org/T162611
[06:20:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:28:59] <wikibugs>	 (03PS2) 10Elukey: Remove any reference of mc1001->mc1018 for decom [puppet] - 10https://gerrit.wikimedia.org/r/354453 (https://phabricator.wikimedia.org/T164341)
[06:29:05] <marostegui>	 !log Deploy alter table on s7.frwiktionary db2040 and db1034 - T165743
[06:29:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[06:29:14] <stashbot>	 T165743: frwiktionary on s7 still needs fixing on the revision table - https://phabricator.wikimedia.org/T165743
[06:30:22] <wikibugs>	 (03Abandoned) 10Muehlenhoff: Make HHVM depend on nutcracker service [puppet] - 10https://gerrit.wikimedia.org/r/353556 (https://phabricator.wikimedia.org/T163795) (owner: 10Muehlenhoff)
[06:49:07] <wikibugs>	 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-Elukey: Decommission old memcached hosts - mc1001->mc1018 - https://phabricator.wikimedia.org/T164341#3285429 (10elukey)
[06:50:35] <wikibugs>	 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-Elukey: Decommission old memcached hosts - mc1001->mc1018 - https://phabricator.wikimedia.org/T164341#3230583 (10elukey) @Cmjohnson: The hosts are ready for the non interruptible steps, including https://gerrit.wikimedia.org/r/354453, so I haven't merge...
[06:50:48] <wikibugs>	 06Operations, 10ops-eqiad, 13Patch-For-Review, 15User-Elukey: Decommission old memcached hosts - mc1001->mc1018 - https://phabricator.wikimedia.org/T164341#3285432 (10elukey) a:03Cmjohnson
[07:07:02] <marostegui>	 !log Rename gather_list  gather_list_flag gather_list_item on db1078 db1094 and db1089 - T166097
[07:07:09] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:07:09] <stashbot>	 T166097: Drop Gather tables from wmf wikis - https://phabricator.wikimedia.org/T166097
[07:13:06] <marostegui>	 !log Deploy schema change on ruwiki.ores_classification directly on codfw master (db2028) - T164530
[07:13:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:13:14] <stashbot>	 T164530: Deploy uniqueness constraints on ores_classification table  - https://phabricator.wikimedia.org/T164530
[07:17:14] <wikibugs>	 06Operations, 10ops-codfw, 06DC-Ops, 06Discovery, and 3 others: elastic2020 is powered off and does not want to restart - https://phabricator.wikimedia.org/T149006#3285476 (10Gehel) We can keep elastic2020 down for a few more weeks if needed. The cluster is able to sustain the current load with -1 node.
[07:25:04] <moritzm>	 !log installing openjdk security updates on maps and wdqs clusters
[07:25:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:25:26] <icinga-wm>	 PROBLEM - puppet last run on labvirt1013 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[07:44:46] <icinga-wm>	 PROBLEM - DPKG on kafka1003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[07:44:46] <icinga-wm>	 PROBLEM - DPKG on kafka1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[07:45:06] <icinga-wm>	 PROBLEM - DPKG on kafka2001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[07:45:07] <icinga-wm>	 PROBLEM - DPKG on kafka1002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[07:45:26] <icinga-wm>	 PROBLEM - DPKG on kafka2003 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[07:45:27] <icinga-wm>	 PROBLEM - DPKG on kafka2002 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[07:48:16] <addshore>	 !log addshore@terbium:~$ ~/mymwscriptwikiset extensions/Cognate/maintenance/purgeDeletedCognatePages.php et+wiktionary.dblist --batch-size=1000 >> ~/purge.201705161230.log T164407
[07:48:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:48:24] <stashbot>	 T164407: Cognate has been disabled from WMF because it caused an outage on x1 by overtaking 10000 concurrent connections - https://phabricator.wikimedia.org/T164407
[07:50:46] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[07:51:16] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0]
[07:51:38] * elukey stares to addshore for the 5xx
[07:51:41] <elukey>	 :D
[07:51:43] <elukey>	 checking them
[07:51:45] <addshore>	 O_o
[07:51:53] <addshore>	 should not be me
[07:52:17] <addshore>	 stopped my script anyway
[07:52:25] <elukey>	 ahhahah I know I know 
[07:52:34] <elukey>	 from https://logstash.wikimedia.org/app/kibana#/dashboard/Varnish-Webrequest-50X it seems a temp spike
[07:53:42] <wikibugs>	 (03CR) 10Hashar: HHVM: Fix puppet on trusty (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/353964 (https://phabricator.wikimedia.org/T165462) (owner: 10Paladox)
[07:54:26] <icinga-wm>	 RECOVERY - puppet last run on labvirt1013 is OK: OK: Puppet is currently enabled, last run 19 seconds ago with 0 failures
[07:54:30] <addshore>	 Well, could be totlaly unrelated, but looking at https://grafana.wikimedia.org/dashboard/db/mysql?orgId=1&from=now-48h&to=now&var-dc=eqiad%20prometheus%2Fops&var-server=db1031 regarding T164407 i start to see things odd happening again
[07:54:31] <stashbot>	 T164407: Cognate has been disabled from WMF because it caused an outage on x1 by overtaking 10000 concurrent connections - https://phabricator.wikimedia.org/T164407
[07:54:33] <elukey>	 seems cp3033 related
[07:55:12] <wikibugs>	 (03PS10) 10Hashar: contint: skip hhvm experimental pin on Trusty [puppet] - 10https://gerrit.wikimedia.org/r/353964 (https://phabricator.wikimedia.org/T165462) (owner: 10Paladox)
[07:55:28] <elukey>	 a lot of errors in the spike end up with cp3033 int in x-cache
[07:56:06] <icinga-wm>	 PROBLEM - puppet last run on kafka2003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[openjdk-7-jdk]
[07:57:06] <icinga-wm>	 RECOVERY - DPKG on kafka2001 is OK: All packages OK
[07:57:26] <icinga-wm>	 RECOVERY - DPKG on kafka2002 is OK: All packages OK
[07:57:41] <elukey>	 yeah there is a big fetch failed event https://grafana.wikimedia.org/dashboard/db/varnish-machine-stats?orgId=1&var-server=cp3033
[07:58:16] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[07:58:22] <elukey>	 ema: ---^
[07:58:32] <elukey>	 afaics seems cp3033 related
[07:58:46] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[07:58:56] <elukey>	 Varnish backends main threads are really high, the rest looks good
[08:00:27] <addshore>	 !log the last script I started is now stopped
[08:00:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:01:06] <icinga-wm>	 PROBLEM - puppet last run on kafka1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[openjdk-7-jdk]
[08:08:46] <icinga-wm>	 RECOVERY - DPKG on kafka1001 is OK: All packages OK
[08:09:06] <icinga-wm>	 RECOVERY - DPKG on kafka1002 is OK: All packages OK
[08:09:47] <icinga-wm>	 RECOVERY - DPKG on kafka1003 is OK: All packages OK
[08:10:26] <icinga-wm>	 PROBLEM - puppet last run on kafka1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[openjdk-7-jdk]
[08:13:00] <marostegui>	 !log Force WB as a default policy on db1031 because of degraded BBU 
[08:13:06] <icinga-wm>	 RECOVERY - puppet last run on kafka1003 is OK: OK: Puppet is currently enabled, last run 34 seconds ago with 0 failures
[08:13:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:13:26] <icinga-wm>	 RECOVERY - puppet last run on kafka1002 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[08:14:46] <icinga-wm>	 PROBLEM - DPKG on contint1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[08:15:39] <elukey>	 !log apply manually https://gerrit.wikimedia.org/r/#/c/351854/2/wmf-config/jobqueue.php (persistent connections between hhvm and redis) to mw1161 as production test
[08:15:46] <icinga-wm>	 PROBLEM - DPKG on stat1004 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[08:15:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:16:45] <moritzm>	 ^stat1004/contint1001 are expected
[08:24:26] <icinga-wm>	 PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 2 minutes ago with 2 failures. Failed resources (up to 3 shown): Package[initramfs-tools],Package[openjdk-7-jdk]
[08:26:34] <wikibugs>	 (03CR) 10Hashar: "We can most probably merge this one. Note I have removed HHVM from the CI Trusty images via https://gerrit.wikimedia.org/r/#/c/355187/" [puppet] - 10https://gerrit.wikimedia.org/r/353964 (https://phabricator.wikimedia.org/T165462) (owner: 10Paladox)
[08:27:40] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA: x1 master db1031: Faulty BBU - https://phabricator.wikimedia.org/T166108#3285567 (10Marostegui)
[08:28:46] <icinga-wm>	 RECOVERY - DPKG on stat1004 is OK: All packages OK
[08:28:56] <elukey>	 no changes to TIME_WAITs for mw1161
[08:29:09] <elukey>	 (manually changed /srv/mediawiki/wmf-config/jobqueue.php)
[08:29:46] <icinga-wm>	 RECOVERY - DPKG on contint1001 is OK: All packages OK
[08:30:26] <icinga-wm>	 RECOVERY - DPKG on kafka2003 is OK: All packages OK
[08:35:58] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA: x1 master db1031: Faulty BBU - https://phabricator.wikimedia.org/T166108#3285604 (10Marostegui) After a long while the BBU shows `Optimal` again, so looks like the manual relearn worked (the same way it did on db1048 - T160731#3109104 ) Setting the policy back to its default...
[08:47:57] <wikibugs>	 (03CR) 10Multichill: [C: 031] "Yes please" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354246 (https://phabricator.wikimedia.org/T164191) (owner: 10Hoo man)
[08:53:26] <icinga-wm>	 RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[08:53:46] <icinga-wm>	 PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 19 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[initramfs-tools]
[08:54:06] <icinga-wm>	 RECOVERY - puppet last run on kafka2003 is OK: OK: Puppet is currently enabled, last run 32 seconds ago with 0 failures
[08:54:36] <akosiaris>	 mutante: no profiles should NOT have a system::role. Only real roles, so yes profiles should lose them when being converted.
[08:57:16] <elukey>	 akosiaris: (curious) - so roles should theoretically only contain system::role and profile includes ?
[09:05:46] <icinga-wm>	 RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 23 seconds ago with 0 failures
[09:11:04] <wikibugs>	 (03PS4) 10Volans: Puppet compiler: automatically sync from all masters [puppet] - 10https://gerrit.wikimedia.org/r/354105 (https://phabricator.wikimedia.org/T165583)
[09:13:34] <wikibugs>	 (03CR) 10Volans: [C: 032] Puppet compiler: automatically sync from all masters [puppet] - 10https://gerrit.wikimedia.org/r/354105 (https://phabricator.wikimedia.org/T165583) (owner: 10Volans)
[09:13:39] <wikibugs>	 06Operations, 10Traffic, 10netops: LLDP on cache hosts - https://phabricator.wikimedia.org/T165614#3285703 (10ayounsi) a:03ayounsi
[09:15:39] <elukey>	 !log reverted manual hack on mw1161 with scap pull
[09:15:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:16:00] <hashar>	 !log Restarting Jenkins on contint1001
[09:16:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:19:38] <akosiaris>	 elukey: not theoretically, in practice (after the migration is done)
[09:19:46] <akosiaris>	 but yes
[09:19:52] <elukey>	 super thanks
[09:20:20] <akosiaris>	 elukey: https://wikitech.wikimedia.org/wiki/Puppet_coding#Roles
[09:22:05] <elukey>	 akosiaris: for some reason I didn't remember 1. :)
[09:22:14] <elukey>	 but it makes sense
[09:24:31] <moritzm>	 !log restarting cassandra on restbase1007, restbase1009, restbase1012 to pick up Java security updates
[09:24:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:42:52] <wikibugs>	 06Operations, 06Operations-Software-Development, 13Patch-For-Review: Puppet compiler: sync facts from all workers - https://phabricator.wikimedia.org/T165583#3285745 (10Volans) 05Open>03Resolved Documentation updated on https://wikitech.wikimedia.org/wiki/Nova_Resource:Puppet3-diffs/Documentation
[09:46:18] <addshore>	 !log addshore@terbium:/srv/mediawiki/php-1.30.0-wmf.1$ mwscriptwikiset extensions/Cognate/maintenance/purgeDeletedCognatePages.php wiktionary.dblist --batch-size=1000 >> ~/purge.201705161230.log T164407
[09:46:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:46:26] <stashbot>	 T164407: Cognate has been disabled from WMF because it caused an outage on x1 by overtaking 10000 concurrent connections - https://phabricator.wikimedia.org/T164407
[09:49:24] <godog>	 !log swift eqiad-prod: ms-be1028/ms-be1039 object weight 3500 - T160640
[09:49:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:49:33] <stashbot>	 T160640: Rack and Setup ms-be1028-ms-1039 - https://phabricator.wikimedia.org/T160640
[09:50:35] <wikibugs>	 (03PS6) 10Filippo Giunchedi: logstash: build http_request from webrequest fields [puppet] - 10https://gerrit.wikimedia.org/r/353282 (https://phabricator.wikimedia.org/T149451)
[09:52:17] <wikibugs>	 (03PS1) 10Jcrespo: raid-check: Return critical in WriteThough mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108)
[09:52:39] <wikibugs>	 (03PS2) 10Filippo Giunchedi: logstash: move 'hostname' to 'host' for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/353853 (https://phabricator.wikimedia.org/T149451)
[09:54:16] <wikibugs>	 (03PS2) 10Jcrespo: raid-check: Return critical in WriteThough mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108)
[09:55:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] logstash: build http_request from webrequest fields [puppet] - 10https://gerrit.wikimedia.org/r/353282 (https://phabricator.wikimedia.org/T149451) (owner: 10Filippo Giunchedi)
[09:55:15] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] logstash: move 'hostname' to 'host' for webrequest [puppet] - 10https://gerrit.wikimedia.org/r/353853 (https://phabricator.wikimedia.org/T149451) (owner: 10Filippo Giunchedi)
[09:56:08] <moritzm>	 !log restarting cassandra on restbase1013, restbase1014, restbase1015, restbase1017 to pick up Java security updates
[09:56:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:03:09] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: x1 master db1031: Faulty BBU - https://phabricator.wikimedia.org/T166108#3285788 (10Marostegui) And this happened again: ``` root@db1031:~# megacli -AdpBbuCmd  -a0  BBU status for Adapter: 0  BatteryType: BBU Voltage: 3830 mV Current: -685 mA Temperatur...
[10:06:37] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: x1 master db1031: Faulty BBU - https://phabricator.wikimedia.org/T166108#3285791 (10Marostegui)
[10:08:16] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: Debianization fixups [calico-containers] - 10https://gerrit.wikimedia.org/r/355191
[10:08:39] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [V: 032 C: 032] Debianization fixups [calico-containers] - 10https://gerrit.wikimedia.org/r/355191 (owner: 10Giuseppe Lavagetto)
[10:10:45] <wikibugs>	 (03PS1) 10Giuseppe Lavagetto: profile::calico::builder: fix branch to check out [puppet] - 10https://gerrit.wikimedia.org/r/355192
[10:12:09] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] profile::calico::builder: fix branch to check out [puppet] - 10https://gerrit.wikimedia.org/r/355192 (owner: 10Giuseppe Lavagetto)
[10:14:42] <marostegui>	 !log Run pt-table-checksum on s7.frwiktionary - T165743
[10:14:51] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:14:52] <stashbot>	 T165743: frwiktionary on s7 still needs fixing on the revision table - https://phabricator.wikimedia.org/T165743
[10:15:29] <wikibugs>	 06Operations, 10DBA: Investigate slow servermon updating queries on db1016 - https://phabricator.wikimedia.org/T165674#3285795 (10akosiaris) Per https://tendril.wikimedia.org/report/slow_queries_checksum?checksum=7680e3d95eee2aa98b1c461dbc0dcc5c&host=db1016&user=&schema=puppet&hours=24 this seems to be occurin...
[10:16:25] <wikibugs>	 (03PS6) 10Volans: Puppet: run-puppet-agent, add --failed-only option [puppet] - 10https://gerrit.wikimedia.org/r/349416
[10:23:56] <wikibugs>	 (03PS11) 10Elukey: [WIP] First prototype of the EventLogging purge script [puppet] - 10https://gerrit.wikimedia.org/r/353265 (https://phabricator.wikimedia.org/T156933)
[10:24:33] <wikibugs>	 (03CR) 10Volans: "Addressed comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/349416 (owner: 10Volans)
[10:24:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] [WIP] First prototype of the EventLogging purge script [puppet] - 10https://gerrit.wikimedia.org/r/353265 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey)
[10:26:45] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355193 (https://phabricator.wikimedia.org/T164530)
[10:27:18] <godog>	 !log upload kafkatee 0.1.5 to jessie-wikimedia, remove unused kafkatee 0.1.4 from trusty-wikimedia - T149451
[10:27:23] <wikibugs>	 (03PS2) 10Marostegui: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355193 (https://phabricator.wikimedia.org/T164530)
[10:27:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:27:25] <stashbot>	 T149451: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451
[10:29:33] <wikibugs>	 (03PS12) 10Elukey: [WIP] First prototype of the EventLogging purge script [puppet] - 10https://gerrit.wikimedia.org/r/353265 (https://phabricator.wikimedia.org/T156933)
[10:33:30] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355193 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui)
[10:34:40] <wikibugs>	 (03PS1) 10Filippo Giunchedi: role: don't install kafkacat for statistics::private [puppet] - 10https://gerrit.wikimedia.org/r/355196
[10:35:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[10:35:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[10:35:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[10:35:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[10:35:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1016 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[10:35:37] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1008 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[10:35:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[10:35:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2003 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[10:35:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[10:35:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2008 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[10:35:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2006 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[10:35:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2007 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[10:35:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2001 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[10:35:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2005 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[10:35:47] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1018 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[10:35:48] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[10:35:48] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1007 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[10:35:49] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1010 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[10:36:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy
[10:37:46] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy
[10:37:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase1012 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[10:38:51] <wikibugs>	 (03CR) 10Elukey: [C: 032] role: don't install kafkacat for statistics::private [puppet] - 10https://gerrit.wikimedia.org/r/355196 (owner: 10Filippo Giunchedi)
[10:39:07] <wikibugs>	 (03PS2) 10Elukey: role: don't install kafkacat for statistics::private [puppet] - 10https://gerrit.wikimedia.org/r/355196 (owner: 10Filippo Giunchedi)
[10:39:30] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355193 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui)
[10:39:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2004 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[10:40:46] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2009 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[10:40:48] <elukey>	 moritzm: --^ is that you upgrading ?
[10:41:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy
[10:41:46] <moritzm>	 elukey: I don't think so. cassandra on restbase1014 was restarted with c-forearch-restart, but I don't think that's related
[10:41:55] <moritzm>	 ^godog, any idea about the error message?
[10:42:28] <godog>	 mhh no I haven't seen that before
[10:42:30] <elukey>	 checking one of the hosts
[10:42:31] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Depool db1085 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355193 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui)
[10:43:32] <elukey>	 seems Error: Operation timed out - received only 1 responses. from restbase
[10:43:34] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Depool db1085 - T164530 (duration: 00m 38s)
[10:43:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy
[10:43:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:43:42] <stashbot>	 T164530: Deploy uniqueness constraints on ores_classification table  - https://phabricator.wikimedia.org/T164530
[10:44:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2011 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[10:46:05] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Repool db1085, depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355197 (https://phabricator.wikimedia.org/T164530)
[10:46:36] <icinga-wm>	 PROBLEM - restbase endpoints health on restbase2002 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received
[10:47:09] <elukey>	 godog: could it be that the instances on restbase1014 going down have caused some read to quorum to fail ? 
[10:48:26] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2012 is OK: All endpoints are healthy
[10:48:27] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2011 is OK: All endpoints are healthy
[10:48:27] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2010 is OK: All endpoints are healthy
[10:48:27] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2002 is OK: All endpoints are healthy
[10:48:27] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1016 is OK: All endpoints are healthy
[10:48:27] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1008 is OK: All endpoints are healthy
[10:48:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2004 is OK: All endpoints are healthy
[10:48:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1018 is OK: All endpoints are healthy
[10:48:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2006 is OK: All endpoints are healthy
[10:48:36] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1009 is OK: All endpoints are healthy
[10:48:37] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2005 is OK: All endpoints are healthy
[10:48:37] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1007 is OK: All endpoints are healthy
[10:48:37] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1012 is OK: All endpoints are healthy
[10:48:38] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2008 is OK: All endpoints are healthy
[10:48:38] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1010 is OK: All endpoints are healthy
[10:48:39] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2007 is OK: All endpoints are healthy
[10:48:39] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2009 is OK: All endpoints are healthy
[10:48:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase1011 is OK: All endpoints are healthy
[10:48:40] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2001 is OK: All endpoints are healthy
[10:48:41] <icinga-wm>	 RECOVERY - restbase endpoints health on restbase2003 is OK: All endpoints are healthy
[10:48:51] <godog>	 elukey: that might be yeah, in fact it just recovered, did restbase say what hosts was expecting a reply but didn't?
[10:49:37] <elukey>	 I can see a lot of Error: Operation timed out - received only 1 responses
[10:50:00] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1085, depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355197 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui)
[10:53:10] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1085, depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355197 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui)
[10:53:19] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Repool db1085, depool db1088 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355197 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui)
[10:54:27] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1085, depool db1088 - T164530 (duration: 00m 38s)
[10:54:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:36] <stashbot>	 T164530: Deploy uniqueness constraints on ores_classification table  - https://phabricator.wikimedia.org/T164530
[10:55:45] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Repool db1088, depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355198 (https://phabricator.wikimedia.org/T164530)
[10:56:40] <_joe_>	 elukey, godog that error means that service-checker could not obtain a response within 5 seconds for that request
[10:57:04] <_joe_>	 so it seems a backend timeout for restbase
[10:58:47] <godog>	 indeed
[10:58:59] <elukey>	 yep
[11:00:18] <elukey>	 but I was wondering if the cassandra restarts caused some reads to fail the quorum
[11:00:22] <elukey>	 returning errors
[11:02:03] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1088, depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355198 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui)
[11:06:43] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1088, depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355198 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui)
[11:06:53] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Repool db1088, depool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355198 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui)
[11:07:36] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1088, depool db1093 - T164530 (duration: 00m 38s)
[11:07:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:07:44] <stashbot>	 T164530: Deploy uniqueness constraints on ores_classification table  - https://phabricator.wikimedia.org/T164530
[11:09:09] <wikibugs>	 (03PS1) 10Marostegui: db-eqiad.php: Repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355202 (https://phabricator.wikimedia.org/T164530)
[11:09:44] <wikibugs>	 (03PS3) 10Jcrespo: raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108)
[11:10:29] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo)
[11:10:33] <wikibugs>	 (03PS4) 10Jcrespo: raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108)
[11:11:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo)
[11:11:33] <wikibugs>	 (03PS5) 10Jcrespo: raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108)
[11:12:13] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo)
[11:12:21] <wikibugs>	 (03CR) 10Marostegui: [C: 032] db-eqiad.php: Repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355202 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui)
[11:18:40] <wikibugs>	 (03Merged) 10jenkins-bot: db-eqiad.php: Repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355202 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui)
[11:18:52] <wikibugs>	 (03CR) 10jenkins-bot: db-eqiad.php: Repool db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355202 (https://phabricator.wikimedia.org/T164530) (owner: 10Marostegui)
[11:19:32] <logmsgbot>	 !log marostegui@tin Synchronized wmf-config/db-eqiad.php: Repool db1093 - T164530 (duration: 00m 38s)
[11:19:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:19:41] <stashbot>	 T164530: Deploy uniqueness constraints on ores_classification table  - https://phabricator.wikimedia.org/T164530
[11:34:19] <wikibugs>	 (03PS6) 10Jcrespo: raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108)
[11:35:12] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo)
[11:42:19] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] Puppet: run-puppet-agent, add --failed-only option [puppet] - 10https://gerrit.wikimedia.org/r/349416 (owner: 10Volans)
[11:42:50] <_joe_>	 !log pushed calico/kube-policy-controller:0.6.0 to the docker registry
[11:42:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:43:46] <wikibugs>	 (03PS7) 10Jcrespo: raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108)
[11:44:01] <_joe_>	 !log pushed calico/node:1.2.0 to the docker registry
[11:44:07] <wikibugs>	 (03PS8) 10Jcrespo: raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108)
[11:44:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:44:57] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo)
[11:48:48] <wikibugs>	 (03PS9) 10Jcrespo: raid-check: Return critical when not in WriteBack mode for megacli Failing to a different write policy happens silently in megacli checks (for example, if BBU is flat, damaged, too hot, etc.). In some hosts (databases), a policy change means horrible perf [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108)
[11:49:40] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] raid-check: Return critical when not in WriteBack mode for megacli Failing to a different write policy happens silently in megacli checks (for example, if BBU is flat, damaged, too hot, etc.). In some hosts (databases), a policy change means horrible perf [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo)
[11:51:25] <_joe_>	 !log uploaded calicoctl 1.2.0-1~wmf1 to jessie-wikimedia
[11:51:32] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:51:52] <_joe_>	 !log uploaded calico-cni 1.8.3-1~wmf1 to jessie-wikimedia
[11:51:59] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:53:19] <wikibugs>	 (03PS10) 10Jcrespo: raid-check: Return critical when not in WriteBack mode for megacli Failing to a different write policy happens silently in megacli checks (for example, if BBU is flat, damaged, too hot, etc.). In some hosts (databases), a policy change means horrible perf [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108)
[11:55:44] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] "root@db1015:~$ megacli -LDSetProp WT -L0 -a0" [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo)
[11:56:04] <elukey>	 !log set vm.dirty_backround_bytes=25165824 on aqs1004 as part of testing for https://gerrit.wikimedia.org/r/#/c/354107 (Rollback: set vm.dirty_backround_ratio=10)
[11:56:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:56:54] <wikibugs>	 (03CR) 10Elukey: "Set vm.dirty_backround_bytes=25165824 on aqs1004 as test. Will come back to this code review in a day or two." [puppet] - 10https://gerrit.wikimedia.org/r/354107 (owner: 10Giuseppe Lavagetto)
[11:57:11] <wikibugs>	 (03CR) 10Marostegui: [C: 031] "Thanks for putting this together so quickly!" [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo)
[11:57:41] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] "I checked analytics hosts and swift hosts with dell, and all seem to be using writeback, and will benefit from this." [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo)
[11:58:11] <wikibugs>	 (03PS11) 10Jcrespo: raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108)
[12:09:27] <moritzm>	 !log uploaded hhvm 3.18.2+dfsg-1+wmf4 to apt.wikimedia.org (contains extended upstream fix for XML reader crash) (T162586)
[12:09:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:09:37] <stashbot>	 T162586: HHVM segfault in memory cleanup - https://phabricator.wikimedia.org/T162586
[12:09:56] <logmsgbot>	 !log joal@tin Started deploy [analytics/refinery@222d0c0]: (no justification provided)
[12:10:03] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:13:52] <logmsgbot>	 !log joal@tin Finished deploy [analytics/refinery@222d0c0]: (no justification provided) (duration: 03m 56s)
[12:14:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:20:12] <logmsgbot>	 !log joal@tin Started deploy [analytics/refinery@679aeea]: Weekly deploy (with 2 weeks late, big deploy)h
[12:20:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:20:55] <moritzm>	 !log upgrading mw1261-mw1265 to hhvm 3.18.2+dfsg-1+wmf4
[12:21:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:22:46] <icinga-wm>	 PROBLEM - Disk space on stat1002 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=91%)
[12:23:20] <Zppix>	 joal: ill fix that deploy log to remove the extra h for you :)
[12:23:38] <joal>	 Thanks a lot Zppix - Sorry for the mess
[12:24:01] <Zppix>	 joal: hey i do it all the time no worries
[12:24:08] <joal>	 :)
[12:24:36] <logmsgbot>	 !log joal@tin Finished deploy [analytics/refinery@679aeea]: Weekly deploy (with 2 weeks late, big deploy)h (duration: 04m 24s)
[12:24:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:25:01] <Zppix>	 Will fix ^
[12:25:06] <Zppix>	 Give me a min
[12:25:30] <elukey>	 fixing stat1002 :)
[12:27:47] <icinga-wm>	 RECOVERY - Disk space on stat1002 is OK: DISK OK
[12:31:00] <Zppix>	 Is this week a norm deploy schedule... i know what wikitech says but i want human verification
[12:31:26] <MatmaRex>	 yes
[12:31:55] <Zppix>	 Alright thanks MatmaRex o/
[12:32:47] <Dereckson>	 Zppix: you've something to deploy at the next EU SWAT?
[12:33:29] <Zppix>	 Dereckson: nope just verifying incase something comes up at all this week
[12:33:46] <Zppix>	 Dereckson: thanks for the concern however
[12:33:48] <Dereckson>	 ok
[12:35:01] <wikibugs>	 (03PS1) 10Dereckson: Enable NewUserMessage on dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355208
[12:36:59] <wikibugs>	 (03PS2) 10Dereckson: Enable NewUserMessage on dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355208 (https://phabricator.wikimedia.org/T166121)
[12:38:13] <logmsgbot>	 !log joal@tin Started deploy [analytics/refinery@679aeea]: Weekly deploy (2 weeks late, big deploy)-2
[12:38:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:39:48] <logmsgbot>	 !log joal@tin Finished deploy [analytics/refinery@679aeea]: Weekly deploy (2 weeks late, big deploy)-2 (duration: 01m 35s)
[12:39:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:43:10] <wikibugs>	 (03PS2) 10Dereckson: Apache: add techconduct.wm.o to remnant sites [puppet] - 10https://gerrit.wikimedia.org/r/354959 (https://phabricator.wikimedia.org/T165977)
[12:43:13] <wikibugs>	 (03CR) 10Dereckson: Apache: add techconduct.wm.o to remnant sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/354959 (https://phabricator.wikimedia.org/T165977) (owner: 10Dereckson)
[12:45:55] <logmsgbot>	 !log elukey@tin Started deploy [analytics/refinery@679aeea]: (no justification provided)
[12:45:56] <logmsgbot>	 !log elukey@tin Finished deploy [analytics/refinery@679aeea]: (no justification provided) (duration: 00m 01s)
[12:46:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:08] <Dereckson>	 jouncebot: refresh
[12:46:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:46:24] <jouncebot>	 I refreshed my knowledge about deployments.
[12:46:46] <logmsgbot>	 !log elukey@tin Started deploy [analytics/refinery@679aeea]: Updated stat1002 with the last refinery deployment
[12:46:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:28] <logmsgbot>	 !log elukey@tin Finished deploy [analytics/refinery@679aeea]: Updated stat1002 with the last refinery deployment (duration: 00m 42s)
[12:47:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:47:46] <icinga-wm>	 PROBLEM - Disk space on elastic1021 is CRITICAL: DISK CRITICAL - free space: /srv 61487 MB (12% inode=99%)
[12:49:33] <wikibugs>	 06Operations, 10Traffic, 10netops: LLDP on cache hosts - https://phabricator.wikimedia.org/T165614#3286077 (10ayounsi) >>! In T165614#3271670, @BBlack wrote: > 1. Probably the reason for a lack of neighbors is that some (most?) of the switches don't blanket-enable LLDP for all ports.  They explicitly list ce...
[12:51:39] <wikibugs>	 (03PS1) 10Mark Bergsma: Fix IPPrefix IPv6 string padding bug [debs/pybal] - 10https://gerrit.wikimedia.org/r/355213
[12:51:41] <wikibugs>	 (03PS1) 10Mark Bergsma: Add IPv4IP, IPv6IP and IPPrefix test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355214
[12:52:36] <Zppix>	 joal: http://wikitech.wikimedia.org/wiki/special:Diff/1760329
[12:53:17] <joal>	 Thanks Zppix :)
[12:53:44] <Zppix>	 joal: hey no problem, it gives me something to do :)
[12:54:09] <joal>	 I don't know why but I can't imagine you have nothing to do ;)
[12:54:11] <joal>	 Zppix: --^
[12:54:31] <Zppix>	 joal: i mean 90% of time i dont
[13:00:05] <jouncebot>	 addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170523T1300).
[13:00:05] <jouncebot>	 Urbanecm and Dereckson: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be available during the process.
[13:00:15] <Dereckson>	 Okay I can SWAT.
[13:00:24] <Dereckson>	 Urbanecm: ping?
[13:00:28] <Urbanecm>	 I'm here
[13:01:43] <hashar>	 I am idling around if needed
[13:01:53] <wikibugs>	 (03Abandoned) 10Dereckson: Set wgSemiprotectedRestrictionLevels for de.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/282471 (https://phabricator.wikimedia.org/T132249) (owner: 10Dereckson)
[13:02:25] <wikibugs>	 (03PS1) 10BBlack: RPS Cleanup 1/4: remove unused upstart file [puppet] - 10https://gerrit.wikimedia.org/r/355216
[13:02:27] <wikibugs>	 (03PS1) 10BBlack: RPS cleanup 2/4: pattern not necc for LVS [puppet] - 10https://gerrit.wikimedia.org/r/355217
[13:02:29] <wikibugs>	 (03PS1) 10BBlack: RPS cleanup 3/4: Add config file to script, use for rss_pattern [puppet] - 10https://gerrit.wikimedia.org/r/355218
[13:02:31] <wikibugs>	 (03PS1) 10BBlack: RPS cleanup 4/4: use new config file in puppet [puppet] - 10https://gerrit.wikimedia.org/r/355219
[13:02:34] <wikibugs>	 (03PS2) 10Dereckson: Add *.esa.int to CopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352620 (https://phabricator.wikimedia.org/T164643) (owner: 10Urbanecm)
[13:02:57] <wikibugs>	 (03CR) 10Dereckson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352620 (https://phabricator.wikimedia.org/T164643) (owner: 10Urbanecm)
[13:06:46] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s7 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 852.26 seconds
[13:06:56] <icinga-wm>	 PROBLEM - HP RAID on ms-be1035 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds.
[13:06:56] <icinga-wm>	 PROBLEM - HP RAID on ms-be1038 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds.
[13:07:16] <icinga-wm>	 PROBLEM - HP RAID on ms-be1037 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds.
[13:07:29] <godog>	 yeah yeah, rebalance
[13:07:33] <godog>	 silencing
[13:07:46] <icinga-wm>	 PROBLEM - HP RAID on ms-be1032 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds.
[13:10:27] <Dereckson>	 hashar: we're blocked waiting an operations-mw-config-composer-hhvm-jessie if the 'idling around' was for expected CI congestion
[13:12:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] RPS cleanup 3/4: Add config file to script, use for rss_pattern [puppet] - 10https://gerrit.wikimedia.org/r/355218 (owner: 10BBlack)
[13:12:46] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] RPS cleanup 4/4: use new config file in puppet [puppet] - 10https://gerrit.wikimedia.org/r/355219 (owner: 10BBlack)
[13:16:46] <icinga-wm>	 RECOVERY - Disk space on elastic1021 is OK: DISK OK
[13:18:56] <icinga-wm>	 PROBLEM - HP RAID on ms-be1033 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds.
[13:19:09] <wikibugs>	 (03Merged) 10jenkins-bot: Add *.esa.int to CopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352620 (https://phabricator.wikimedia.org/T164643) (owner: 10Urbanecm)
[13:19:18] <wikibugs>	 (03CR) 10jenkins-bot: Add *.esa.int to CopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/352620 (https://phabricator.wikimedia.org/T164643) (owner: 10Urbanecm)
[13:19:29] <Dereckson>	 Here we are.
[13:20:00] <Dereckson>	 live on mwdebug1002
[13:23:15] <Dereckson>	 Okay, works fine.
[13:23:58] <logmsgbot>	 !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Add *.esa.int to CopyUploadsDomains (T164643) (duration: 00m 39s)
[13:24:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:24:05] <stashbot>	 T164643: Please add esamultimedia.esa.int to $wgCopyUploadsDomains - https://phabricator.wikimedia.org/T164643
[13:27:33] <wikibugs>	 (03PS1) 10Filippo Giunchedi: hieradata: move webrequest 5xx to logstash.svc [puppet] - 10https://gerrit.wikimedia.org/r/355220 (https://phabricator.wikimedia.org/T149451)
[13:29:04] <jynus>	 godog, talking about raid checks: https://gerrit.wikimedia.org/r/355190
[13:29:53] <jynus>	 we had numerous cases where a policy change caused a performance hit
[13:30:16] <jynus>	 you and analytics are the main other user of this kind of raid
[13:31:37] <godog>	 jynus: nice, so essentially we always want WB
[13:31:48] <jynus>	 I would assume
[13:32:01] <jynus>	 and it will ping us id a BBU is faulty
[13:32:04] <jynus>	 *if
[13:32:14] <jynus>	 and the policy is wt if bbu error
[13:32:32] <jynus>	 later we can have better BBU monitoring, but that is hard
[13:32:51] <jynus>	 this will avoid me (us) thinking "why mysql is slow?"
[13:33:11] <jynus>	 not sure if you suffered "why swift is slow"?
[13:34:39] <jynus>	 and I will look into doing the same for hp
[13:34:42] <godog>	 not in the same pattern no, there aren't many writes anyways, the write intensive dbs are on ssd
[13:34:50] <jynus>	 ah, ok
[13:35:01] <jynus>	 for us, it is the difference betwenn working and lagging
[13:35:14] <jynus>	 and happened very often for older hosts
[13:36:12] <jynus>	 tell me if you want to test it more, or I can try deploying it and monitoring the results
[13:37:12] <marostegui>	 !log Run CleanDuplicateScores script to clean up possible duplicates on fawiki before starting to create the UNIQUE keys - https://phabricator.wikimedia.org/T164530
[13:37:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[13:37:52] <wikibugs>	 (03CR) 10Dereckson: [C: 032] Enable NewUserMessage on dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355208 (https://phabricator.wikimedia.org/T166121) (owner: 10Dereckson)
[13:41:41] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] "LGTM, doesn't look like this is rebased on production yet though" [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo)
[13:41:57] <godog>	 jynus: ^
[13:42:03] <elukey>	 jynus: not a big expert so I can't really give you any good review, but it seems good from what I can read. We principally use raid 0 with single disks on the analytics hosts, so it might affect us as well
[13:44:06] <jynus>	 what do you mean rebased on production?
[13:44:35] <jynus>	 elukey: you want wb, I would assume, too?
[13:45:10] <wikibugs>	 (03PS1) 10BBlack: interface-rps: add mq subqueue qdisc setup [puppet] - 10https://gerrit.wikimedia.org/r/355223
[13:45:22] <jynus>	 basically, caching writes based on the promise that the memory with the battery will assure no writes are lost
[13:45:50] <jynus>	 increasing performance of the underlying disks
[13:46:32] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] interface-rps: add mq subqueue qdisc setup [puppet] - 10https://gerrit.wikimedia.org/r/355223 (owner: 10BBlack)
[13:47:16] <wikibugs>	 06Operations, 10netops, 13Patch-For-Review: analytics hosts frequently tripping 'port utilization threshold' librenms alerts - https://phabricator.wikimedia.org/T133852#3286251 (10ayounsi) I went ahead and ignored those hosts for this specific alert (using T133852#3251556 ) Please reopen the task if needed t...
[13:47:26] <wikibugs>	 06Operations, 10netops, 13Patch-For-Review: analytics hosts frequently tripping 'port utilization threshold' librenms alerts - https://phabricator.wikimedia.org/T133852#3286252 (10ayounsi) 05Open>03Resolved
[13:47:59] <jynus>	 it is a genouine question, I rebased from head, maybe I did something wrong?
[13:48:23] <elukey>	 yep yep I think this is what we need, write through doesn't seem to be ideal, checking the current config
[13:49:03] <jynus>	 I can revert this easily if there is any issue, and it doesn't page, so it is "ok"
[13:50:03] <wikibugs>	 06Operations, 10netops: JSNMP flood of errors across multiple switches - https://phabricator.wikimedia.org/T83898#3286267 (10ayounsi) 05Open>03Resolved a:03ayounsi That's not happening anymore.
[13:50:14] <jynus>	 maybe it is the topic? I used mysql because it is in direct response for T166108 incident
[13:50:14] <stashbot>	 T166108: x1 master db1031: Faulty BBU - https://phabricator.wikimedia.org/T166108
[13:50:47] <wikibugs>	 (03CR) 10Gehel: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/355220 (https://phabricator.wikimedia.org/T149451) (owner: 10Filippo Giunchedi)
[13:50:52] <elukey>	 Current Cache Policy: WriteThrough
[13:50:56] <jynus>	 oh
[13:51:00] <jynus>	 where do you have that?
[13:51:10] <elukey>	 this is spot checking on one analytics host.. 
[13:51:14] <jynus>	 I did not find any host with that
[13:51:19] <elukey>	 elukey@analytics1033:~$ sudo megacli -LDPDInfo -aAll | grep Cache
[13:51:23] <elukey>	 maybe the wrong command
[13:51:28] <jynus>	 no, no, it is ok
[13:51:30] <elukey>	 (I mean, from my side)
[13:52:00] <jynus>	 but I checked other analytics hosts and had WB on them
[13:52:14] <jynus>	 so either it is a mistake, and this patch will detect that
[13:52:30] <jynus>	 or it is intended, and the patch may be not what it is intended
[13:52:55] <elukey>	 So afaik our set up is a bit weird (at lesast for me), we have multiple Virtual Drives (12) with one disk in raid0 each..
[13:53:27] <elukey>	 I always assumed it was a way to have a sort of JBOD
[13:53:46] <wikibugs>	 (03PS2) 10BBlack: RPS cleanup 4/4: use new config file in puppet [puppet] - 10https://gerrit.wikimedia.org/r/355219
[13:53:48] <wikibugs>	 (03PS2) 10BBlack: interface-rps: add mq subqueue qdisc setup [puppet] - 10https://gerrit.wikimedia.org/r/355223
[13:54:05] <elukey>	 but now I don't understand why we don't use WB, that should be a good thing even for this weird use case
[13:54:06] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 032] hieradata: move webrequest 5xx to logstash.svc [puppet] - 10https://gerrit.wikimedia.org/r/355220 (https://phabricator.wikimedia.org/T149451) (owner: 10Filippo Giunchedi)
[13:54:22] <elukey>	 nice godog ---^ \o/
[13:54:30] <jynus>	 python check-raid.py says WARNING: no known controller found
[13:55:08] <elukey>	 (on analytics1040 I see WB set)
[13:55:15] <jynus>	 but I see OK: optimal, 13 logical, 14 physical
[13:55:26] <jynus>	 let me review the execution parameters
[13:55:42] <jynus>	 you should not vote +1 in this case
[13:55:50] <jynus>	 vote -1 and we should research more
[13:56:04] <godog>	 elukey: \o/ but yeah each disk is indeed in raid0 to do sort of JBOD
[13:56:16] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] RPS cleanup 4/4: use new config file in puppet [puppet] - 10https://gerrit.wikimedia.org/r/355219 (owner: 10BBlack)
[13:56:23] <jynus>	 this is not a popularity contest, it is just coordination!
[13:56:30] <elukey>	 ahahhaha
[13:56:43] <elukey>	 sure sure but I think that it is good even for analytics to have a WB check
[13:56:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] interface-rps: add mq subqueue qdisc setup [puppet] - 10https://gerrit.wikimedia.org/r/355223 (owner: 10BBlack)
[13:56:52] <jynus>	 oh, I was wrong
[13:56:58] <jynus>	 I lacked root permissions
[13:57:01] <jynus>	 answer is
[13:57:27] <jynus>	 P5477
[13:57:33] <elukey>	 will check "sudo megacli -LDPDInfo -aAll | grep "Current Cache Policy" | uniq -c" in the meantime across the analytics hosts
[13:57:40] <jynus>	 oh, no expansion by bugbot
[13:57:45] <jynus>	 https://phabricator.wikimedia.org/P5477
[13:58:07] <jynus>	 the important thing here is to know what that *should* be
[13:58:23] <jynus>	 and adjust the cleck accordingly
[13:58:26] <jynus>	 *check
[13:59:21] <wikibugs>	 (03CR) 10Ema: [C: 031] Add BGPUpdateMessage test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355123 (owner: 10Mark Bergsma)
[13:59:41] <jynus>	 with analytics hosts, are they hadoop?
[13:59:42] <wikibugs>	 (03CR) 10Ema: [C: 031] Add GPLv2 header to bgp/ip.py [debs/pybal] - 10https://gerrit.wikimedia.org/r/355152 (owner: 10Mark Bergsma)
[13:59:49] <jynus>	 or what, mostly?
[14:00:35] <elukey>	 yeah hadoop workers
[14:00:37] <jynus>	 sorry to bother you with this, but this check is quite imporatant for us- to get right, not to rush
[14:00:47] <elukey>	 nono sorry to slow you down, I am currently checking
[14:00:51] <elukey>	 this is a good thing
[14:00:54] <jynus>	 not at all
[14:00:58] <jynus>	 you are helping
[14:01:02] <jynus>	 not slowing down
[14:01:10] <jynus>	 I thank you for that
[14:01:23] <jynus>	 and if this detects a misconfig for you, you also win something :-)
[14:01:59] <elukey>	 so WT seems only on an1033 so I was really lucky to find it :D
[14:02:01] <jynus>	 if it is intended, we can add an option so that certain hosts
[14:02:17] <wikibugs>	 (03PS2) 10BBlack: RPS Cleanup 1/4: remove unused upstart file [puppet] - 10https://gerrit.wikimedia.org/r/355216
[14:02:17] <jynus>	 do not check this, or check a different policy
[14:02:19] <wikibugs>	 (03PS2) 10BBlack: RPS cleanup 2/4: pattern not necc for LVS [puppet] - 10https://gerrit.wikimedia.org/r/355217
[14:02:21] <wikibugs>	 (03PS2) 10BBlack: RPS cleanup 3/4: Add config file to script, use for rss_pattern [puppet] - 10https://gerrit.wikimedia.org/r/355218
[14:02:23] <wikibugs>	 (03PS3) 10BBlack: RPS cleanup 4/4: use new config file in puppet [puppet] - 10https://gerrit.wikimedia.org/r/355219
[14:02:25] <wikibugs>	 (03PS3) 10BBlack: interface-rps: add mq subqueue qdisc setup [puppet] - 10https://gerrit.wikimedia.org/r/355223
[14:03:29] <elukey>	 The only inconsistency that I see now is ReadAdaptive/ReadAhead, Write Cache OK if Bad BBU / No Write Cache if Bad BBU
[14:03:46] <jynus>	 do you have a host list?
[14:03:50] <moritzm>	 !log installing nutcracker update in codfw (T163795)
[14:03:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:03:58] <stashbot>	 T163795: Nutcracker doesn't start at boot - https://phabricator.wikimedia.org/T163795
[14:04:04] <elukey>	 sure, let me grab the cumin command
[14:04:16] <jynus>	 wait, those are not write policies
[14:04:21] <jynus>	 I do not check for those
[14:04:25] <jynus>	 only the first item
[14:04:45] <elukey>	 super, so I'd say that only an1033 is misconfigured
[14:04:52] <elukey>	 I'll fix it straight away
[14:05:03] <elukey>	 now I am going to review the other non hadoop workers
[14:05:13] <elukey>	 to see if I have "creative" configurations 
[14:05:25] <wikibugs>	 (03CR) 10Ema: [C: 031] Fix IPPrefix IPv6 string padding bug [debs/pybal] - 10https://gerrit.wikimedia.org/r/355213 (owner: 10Mark Bergsma)
[14:05:33] <wikibugs>	 (03CR) 10Ema: [C: 031] Add IPv4IP, IPv6IP and IPPrefix test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355214 (owner: 10Mark Bergsma)
[14:07:19] <jynus>	 alternatively, we can deploy this- if we get errors, I revert, if we get few errors, we can check only those
[14:07:40] <jynus>	 I didn't check every single host, but I checked most of them
[14:08:37] <wikibugs>	 (03CR) 10Elukey: [C: 031] raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo)
[14:08:44] <elukey>	 there you go :)
[14:09:07] <jynus>	 I will monitor this closely
[14:09:13] <elukey>	 me too, thanks!
[14:09:16] <jynus>	 if there is something weird, I will revert
[14:09:26] <XioNoX>	 !log re-enabling BGP session to Init7 - T165288
[14:09:29] <wikibugs>	 (03PS12) 10Jcrespo: raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108)
[14:09:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:09:34] <stashbot>	 T165288: Report of esams unreachable from Fastweb/Init7 - https://phabricator.wikimedia.org/T165288
[14:09:35] <wikibugs>	 (03CR) 10Ema: [C: 031] RPS Cleanup 1/4: remove unused upstart file [puppet] - 10https://gerrit.wikimedia.org/r/355216 (owner: 10BBlack)
[14:10:03] <jynus>	 I prefer having something, even imperfect
[14:12:20] <wikibugs>	 (03PS3) 10Dereckson: Enable NewUserMessage on dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355208 (https://phabricator.wikimedia.org/T166121)
[14:12:25] <wikibugs>	 (03CR) 10Dereckson: [V: 032 C: 032] Enable NewUserMessage on dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355208 (https://phabricator.wikimedia.org/T166121) (owner: 10Dereckson)
[14:12:52] <wikibugs>	 (03CR) 10Dereckson: [C: 032] "SWAT, take two" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355208 (https://phabricator.wikimedia.org/T166121) (owner: 10Dereckson)
[14:14:13] <wikibugs>	 06Operations, 10netops: Report of esams unreachable from Fastweb/Init7 - https://phabricator.wikimedia.org/T165288#3286326 (10ayounsi) 05Open>03Resolved From Init7: ``` Update: 2017.05.17 09:00 (CEST) First link to AMS-IX has been enabled. Update: 2017.05.18 09:40 (CEST) Second link to AMS-IX has been enab...
[14:14:21] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] raid-check: Return critical when not in WriteBack mode for megacli [puppet] - 10https://gerrit.wikimedia.org/r/355190 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo)
[14:15:36] <wikibugs>	 (03CR) 10Hashar: [C: 031] ci: Docker registry for container builds [puppet] - 10https://gerrit.wikimedia.org/r/345422 (https://phabricator.wikimedia.org/T161657) (owner: 10Dduvall)
[14:15:51] <wikibugs>	 (03CR) 10Hashar: [C: 031] [DO NOT MERGE] ci: Experimental k8s cluster for ci [puppet] - 10https://gerrit.wikimedia.org/r/345192 (https://phabricator.wikimedia.org/T159864) (owner: 10Dduvall)
[14:17:18] <wikibugs>	 (03Merged) 10jenkins-bot: Enable NewUserMessage on dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355208 (https://phabricator.wikimedia.org/T166121) (owner: 10Dereckson)
[14:17:27] <wikibugs>	 (03CR) 10jenkins-bot: Enable NewUserMessage on dty.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355208 (https://phabricator.wikimedia.org/T166121) (owner: 10Dereckson)
[14:21:30] <logmsgbot>	 !log dereckson@tin Synchronized wmf-config/InitialiseSettings.php: Enable NewUserMessage on dty.wikipedia (T166121) (duration: 00m 38s)
[14:21:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:21:38] <stashbot>	 T166121: Enable Extension:NewUserMessage on Doteli Wikipedia - https://phabricator.wikimedia.org/T166121
[14:21:59] <Dereckson>	 !log EU SWAT done
[14:22:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:22] <jynus>	 !log deploying new check_raid monitoring write policy for megacli T166108
[14:25:28] <marostegui>	 \o/
[14:25:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:25:29] <stashbot>	 T166108: x1 master db1031: Faulty BBU - https://phabricator.wikimedia.org/T166108
[14:27:37] <jynus>	 already rolled on some hosts, looking good so far
[14:27:59] <jynus>	 tin is failing
[14:28:36] <marostegui>	 failing with what?
[14:28:40] <elukey>	 lol
[14:28:50] <jynus>	 CRITICAL: 1 LD(s) not in WriteBack policy (WriteThrough)
[14:29:14] <marostegui>	 ah XD
[14:29:20] <jynus>	 not necesarily a check error
[14:29:27] <wikibugs>	 (03CR) 10Mark Bergsma: [C: 032] Add BGPUpdateMessage test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355123 (owner: 10Mark Bergsma)
[14:31:23] <wikibugs>	 06Operations, 10ops-eqiad, 10netops: ripe-atlas-eqiad is down - https://phabricator.wikimedia.org/T163243#3286368 (10faidon) OK, I see the prompt in the console: ``` CentOS release 6.9 (Final) Kernel 2.6.32-696.1.1.el6.x86_64 on an x86_64  us-qas-as14907.anchors.atlas.ripe.net login:  ```  We don't have the...
[14:31:25] <jynus>	 it could make sense there
[14:31:58] <jynus>	 there, disk performance is not that important, and consistency is a must
[14:32:41] <jynus>	 dataset1001 has the same issue, although there could me more doubtful
[14:33:31] <jynus>	 not reverting yet, as it will probably only hit 2-3 hosts
[14:33:42] <jynus>	 (for now)
[14:33:54] <marostegui>	 yeah, let's wait a bit more
[14:34:10] <jynus>	 is apergos around?
[14:34:49] <wikibugs>	 (03Merged) 10jenkins-bot: Add BGPUpdateMessage test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355123 (owner: 10Mark Bergsma)
[14:34:56] <wikibugs>	 (03PS8) 10Fdans: Add bot filter to mysql consumer [puppet] - 10https://gerrit.wikimedia.org/r/352582 (https://phabricator.wikimedia.org/T67508)
[14:35:35] <marostegui>	 I think tin makes sense (there is no BBU present there - or I cannot see it)
[14:35:47] <jynus>	 no BBU?
[14:36:01] <marostegui>	 Either no BBU or I cannot see it
[14:36:03] <jynus>	 then we should be able to change the policy
[14:36:06] <icinga-wm>	 PROBLEM - Check Varnish expiry mailbox lag on cp1074 is CRITICAL: CRITICAL: expiry mailbox lag is 2026842
[14:36:14] <jynus>	 with no effect, wouldn't we?
[14:36:37] <jynus>	 sorry, I was thinking on cache
[14:36:42] <jynus>	 not the battery
[14:36:47] <jynus>	 forget what I just said
[14:37:06] <icinga-wm>	 PROBLEM - MegaRAID on tin is CRITICAL: CRITICAL: 1 LD(s) not in WriteBack policy (WriteThrough)
[14:37:07] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on tin is CRITICAL: CRITICAL: 1 LD(s) not in WriteBack policy (WriteThrough) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T166136
[14:37:10] <wikibugs>	 06Operations, 10ops-eqiad: Degraded RAID on tin - https://phabricator.wikimedia.org/T166136#3286375 (10ops-monitoring-bot)
[14:37:19] <jynus>	 lol
[14:37:20] <marostegui>	 jynus: hehe :)
[14:37:27] <volans>	 ahahah
[14:37:37] <jynus>	 these bots are taking over our jobs
[14:37:39] <volans>	 automation to the rescue :-P
[14:38:17] <jynus>	 so, becaue we have 2 different scripts
[14:38:26] <jynus>	 we have to change the other, too
[14:38:34] <jynus>	 :-(
[14:38:36] <marostegui>	 I always imagine the ops bot with volans face saying: This RAID isn't good, amigo!
[14:38:57] <icinga-wm>	 PROBLEM - MegaRAID on dataset1001 is CRITICAL: CRITICAL: 1 LD(s) not in WriteBack policy (WriteThrough)
[14:38:58] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on dataset1001 is CRITICAL: CRITICAL: 1 LD(s) not in WriteBack policy (WriteThrough) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T166137
[14:39:01] <wikibugs>	 06Operations, 10ops-eqiad: Degraded RAID on dataset1001 - https://phabricator.wikimedia.org/T166137#3286381 (10ops-monitoring-bot)
[14:39:15] <jynus>	 can we kill temporarelly that bot?
[14:39:16] <volans>	 btw jynus one of the thing I was trying to say before but failed to explain myself, was that we should check that there is physically a BBU (not faulty)
[14:39:40] <volans>	 it's not a bot it's called by icinga as a handler
[14:39:53] <volans>	 who is the active one, tegmen or einsteinium?
[14:40:33] <jynus>	 it used to be tegmen, but double check
[14:40:54] <jynus>	 analytics1039 is next
[14:41:13] <jynus>	 db1046, but that is a genuine problem
[14:41:31] <jynus>	 helium
[14:41:37] <jynus>	 which is backups, so makes sense
[14:41:37] <marostegui>	 and restbase hosts
[14:41:41] <jynus>	 let's revert
[14:41:50] <marostegui>	 sounds good
[14:41:51] <jynus>	 and check conditionally
[14:42:07] <wikibugs>	 (03PS1) 10Jcrespo: Revert "raid-check: Return critical when not in WriteBack mode for megacli" [puppet] - 10https://gerrit.wikimedia.org/r/355228
[14:42:17] <wikibugs>	 (03CR) 10Jcrespo: [V: 032 C: 032] Revert "raid-check: Return critical when not in WriteBack mode for megacli" [puppet] - 10https://gerrit.wikimedia.org/r/355228 (owner: 10Jcrespo)
[14:42:24] <marostegui>	 dataset1001 looks like it has a healthy BBU though, so maybe the configuration is simply wrong there :)
[14:42:48] <volans>	 !log temporarily disabled raid_handler and puppet on tegmen
[14:42:55] <jynus>	 it is ok
[14:42:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:44:05] <jynus>	 so we create a paramenter and use it only on db, es, pc, ms-be and analytics hosts?
[14:44:32] <volans>	 or you detect if a BBU is physically present, not sure if it's tricky
[14:44:46] <jynus>	 no
[14:44:51] <jynus>	 even if BBU is there
[14:45:07] <jynus>	 there are some hosts that I do not see badly to have the slowest option
[14:45:14] <jynus>	 think tin and helium
[14:45:19] <jynus>	 probably others
[14:45:26] <marostegui>	 Yeah, I think the parameter is a better approach indeed. like dataset1001 it has WT, and it might be a good idea there (I don't know)
[14:45:46] <icinga-wm>	 PROBLEM - MegaRAID on db1046 is CRITICAL: CRITICAL: 1 LD(s) not in WriteBack policy (WriteThrough)
[14:45:46] <icinga-wm>	 PROBLEM - MegaRAID on analytics1039 is CRITICAL: CRITICAL: 13 LD(s) not in WriteBack policy (WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough)
[14:45:47] <icinga-wm>	 PROBLEM - MegaRAID on prometheus2003 is CRITICAL: CRITICAL: 2 LD(s) not in WriteBack policy (WriteThrough, WriteThrough)
[14:45:56] <icinga-wm>	 PROBLEM - MegaRAID on analytics1033 is CRITICAL: CRITICAL: 13 LD(s) not in WriteBack policy (WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough, WriteThrough)
[14:46:13] <jynus>	 prometheus is probably a mistake, assuming the right hardware?
[14:46:35] <wikibugs>	 (03CR) 10Mforns: [C: 031] "LGTM! :]" [puppet] - 10https://gerrit.wikimedia.org/r/353265 (https://phabricator.wikimedia.org/T156933) (owner: 10Elukey)
[14:46:56] <icinga-wm>	 PROBLEM - DPKG on thorium is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[14:47:16] <icinga-wm>	 PROBLEM - MegaRAID on helium is CRITICAL: CRITICAL: 1 LD(s) not in WriteBack policy (WriteThrough)
[14:47:22] <jynus>	 marostegui: if you have free time, can you give a look at db1046?
[14:47:27] <marostegui>	 yes
[14:47:31] <jynus>	 I will run puppet to recover
[14:47:46] <icinga-wm>	 PROBLEM - MegaRAID on restbase-test2002 is CRITICAL: CRITICAL: 2 LD(s) not in WriteBack policy (WriteThrough, WriteThrough)
[14:48:13] <wikibugs>	 06Operations, 10Analytics, 15User-Elukey: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140#3286412 (10elukey)
[14:48:46] <icinga-wm>	 PROBLEM - MegaRAID on restbase-test2003 is CRITICAL: CRITICAL: 2 LD(s) not in WriteBack policy (WriteThrough, WriteThrough)
[14:49:14] <wikibugs>	 (03PS9) 10Ottomata: Add bot filter to mysql consumer [puppet] - 10https://gerrit.wikimedia.org/r/352582 (https://phabricator.wikimedia.org/T67508) (owner: 10Fdans)
[14:49:29] <apergos>	 jynus: I am here now
[14:49:49] <jynus>	 the thing here is to create some tickets to review config
[14:49:54] <icinga-wm>	 PROBLEM - MegaRAID on rdb1003 is CRITICAL: CRITICAL: 2 LD(s) not in WriteBack policy (WriteThrough, WriteThrough)
[14:50:17] <jynus>	 and differenciate between intended configurations and mistaken configurations
[14:50:21] <marostegui>	 jynus: I am creating the task for db1046
[14:50:48] <jynus>	 db1046 for sure is not intended
[14:51:03] <jynus>	 other thing is if we can change it, due to BBU issues
[14:51:31] <wikibugs>	 (03CR) 10Hashar: "That is a good first pass!!" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/344798 (https://phabricator.wikimedia.org/T161416) (owner: 10Dereckson)
[14:52:14] <moritzm>	 ^ thorium is fine, jdk update in progress
[14:52:41] <jynus>	 or if I am a jerk and lazy, we enable it on databases, and "Every man for himself"
[14:53:20] <jynus>	 I would bet most of those, with few exceptions, are misconfigs
[14:55:34] <icinga-wm>	 PROBLEM - MegaRAID on rdb1004 is CRITICAL: CRITICAL: 2 LD(s) not in WriteBack policy (WriteThrough, WriteThrough)
[14:55:44] <icinga-wm>	 RECOVERY - MegaRAID on analytics1039 is OK: OK: optimal, 13 logical, 14 physical
[14:56:04] <icinga-wm>	 RECOVERY - MegaRAID on analytics1033 is OK: OK: optimal, 13 logical, 14 physical, WB policy
[14:56:05] <icinga-wm>	 PROBLEM - MegaRAID on prometheus1003 is CRITICAL: CRITICAL: 2 LD(s) not in WriteBack policy (WriteThrough, WriteThrough)
[14:56:23] <wikibugs>	 (03PS1) 10Ema: bgp.ip: do not crash on unicode strings [debs/pybal] - 10https://gerrit.wikimedia.org/r/355229
[14:56:42] <logmsgbot>	 !log otto@tin Started deploy [eventlogging/analytics@25f8096]: (no justification provided)
[14:56:46] <logmsgbot>	 !log otto@tin Finished deploy [eventlogging/analytics@25f8096]: (no justification provided) (duration: 00m 04s)
[14:56:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:56:59] <elukey>	 jynus: in my case you discovered a big misconfig mess in analytics* hosts, so thanks :)
[14:57:04] <icinga-wm>	 RECOVERY - MegaRAID on tin is OK: OK: optimal, 1 logical, 2 physical
[14:57:15] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Add bot filter to mysql consumer [puppet] - 10https://gerrit.wikimedia.org/r/352582 (https://phabricator.wikimedia.org/T67508) (owner: 10Fdans)
[14:57:54] <icinga-wm>	 RECOVERY - DPKG on thorium is OK: All packages OK
[14:59:04] <icinga-wm>	 RECOVERY - MegaRAID on dataset1001 is OK: OK: optimal, 3 logical, 36 physical
[14:59:09] <jynus>	 elukey: what would be the best way to proceed, do I create a megaticket, one ticket per group?
[15:00:00] <wikibugs>	 06Operations, 10Analytics, 10DBA: db1046 BBU looks faulty - https://phabricator.wikimedia.org/T166141#3286450 (10Marostegui)
[15:00:14] <wikibugs>	 (03CR) 10Paladox: "Needs uploading to apt.wikimedia.org :)" [debs/gerrit] - 10https://gerrit.wikimedia.org/r/355155 (owner: 10Chad)
[15:01:13] <elukey>	 jynus: I already created a task, it might be better to collect them in a tracking one.. what do you think?
[15:01:28] <jynus>	 please share the number
[15:01:29] <elukey>	 T166140
[15:01:29] <stashbot>	 T166140: Review Megacli Analytics Hadoop workers settings - https://phabricator.wikimedia.org/T166140
[15:01:32] <jynus>	 thanks
[15:01:44] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route responds with malformed body (TypeError: NoneType object has no attribute __getitem__)
[15:01:44] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route responds with malformed body (TypeError: NoneType object has no attribute __getitem__)
[15:01:44] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route responds with malformed body (TypeError: NoneType object has no attribute __getitem__)
[15:01:44] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route responds with malformed body (TypeError: NoneType object has no attribute __getitem__)
[15:01:44] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route responds with malformed body (TypeError: NoneType object has no attribute __getitem__)
[15:01:44] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route responds with malformed body (TypeError: NoneType object has no attribute __getitem__)
[15:01:44] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route responds with malformed body (TypeError: NoneType object has no attribute __getitem__)
[15:01:45] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route responds with malformed body (TypeError: NoneType object has no attribute __getitem__)
[15:01:45] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route responds with malformed body (TypeError: NoneType object has no attribute __getitem__)
[15:01:46] <icinga-wm>	 PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route responds with malformed body (TypeError: NoneType object has no attribute __getitem__)
[15:01:46] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route responds with malformed body (TypeError: NoneType object has no attribute __getitem__)
[15:01:47] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route responds with malformed body (TypeError: NoneType object has no attribute __getitem__)
[15:01:58] <jynus>	 that is not me
[15:01:58] <logmsgbot>	 !log otto@tin Started deploy [eventlogging/analytics@UNKNOWN]: (no justification provided)
[15:02:00] <logmsgbot>	 !log otto@tin Finished deploy [eventlogging/analytics@UNKNOWN]: (no justification provided) (duration: 00m 02s)
[15:02:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:02:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:03:34] <wikibugs>	 (03CR) 10Filippo Giunchedi: "LGTM, it could also include  Bug: T165043" [puppet] - 10https://gerrit.wikimedia.org/r/355110 (owner: 10Muehlenhoff)
[15:05:18] <wikibugs>	 (03PS1) 10Jcrespo: Revert "Revert "raid-check: Return critical when not in WriteBack mode for megacli"" [puppet] - 10https://gerrit.wikimedia.org/r/355231
[15:05:34] <icinga-wm>	 RECOVERY - MegaRAID on rdb1004 is OK: OK: optimal, 2 logical, 4 physical
[15:05:44] <icinga-wm>	 RECOVERY - MegaRAID on db1046 is OK: OK: optimal, 1 logical, 2 physical
[15:05:45] <icinga-wm>	 RECOVERY - MegaRAID on prometheus2003 is OK: OK: optimal, 2 logical, 6 physical
[15:07:14] <icinga-wm>	 RECOVERY - MegaRAID on helium is OK: OK: optimal, 1 logical, 12 physical
[15:07:44] <icinga-wm>	 RECOVERY - MegaRAID on restbase-test2002 is OK: OK: optimal, 2 logical, 2 physical
[15:08:44] <icinga-wm>	 RECOVERY - MegaRAID on restbase-test2003 is OK: OK: optimal, 2 logical, 2 physical
[15:09:54] <icinga-wm>	 RECOVERY - MegaRAID on rdb1003 is OK: OK: optimal, 2 logical, 4 physical
[15:13:19] <wikibugs>	 06Operations, 10ops-eqiad, 15User-Joe: rack/setup/install conf1004-conf1006 - https://phabricator.wikimedia.org/T166081#3286474 (10Joe) Racking  request is just that these new machines go in different rows. They can even be in the racks of the other conf* systems as those old systems will be eventually decom...
[15:16:04] <icinga-wm>	 RECOVERY - MegaRAID on prometheus1003 is OK: OK: optimal, 2 logical, 6 physical
[15:17:26] <wikibugs>	 (03PS3) 10Krinkle: varnish: Make errorpage.html balanced and use placeholder [puppet] - 10https://gerrit.wikimedia.org/r/350966 (https://phabricator.wikimedia.org/T113114)
[15:17:33] <wikibugs>	 (03PS4) 10Krinkle: varnish: Convert errorpage into re-usable template [puppet] - 10https://gerrit.wikimedia.org/r/350493 (https://phabricator.wikimedia.org/T113114)
[15:17:35] <wikibugs>	 06Operations, 10ops-eqiad: Degraded RAID on tin - https://phabricator.wikimedia.org/T166136#3286485 (10Volans) 05Open>03Invalid This was a raid check false positive
[15:17:54] <wikibugs>	 06Operations, 10ops-eqiad: Degraded RAID on dataset1001 - https://phabricator.wikimedia.org/T166137#3286488 (10Volans) 05Open>03Invalid This was a raid check false positive
[15:18:04] <apergos>	 gtk
[15:21:58] <jynus>	 volans: I think it can be enabled now
[15:22:36] <wikibugs>	 (03PS7) 10Krinkle: dynamicproxy: Make use of errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350494 (https://phabricator.wikimedia.org/T113114)
[15:22:45] <wikibugs>	 06Operations, 10ops-codfw: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3286505 (10Papaul) a:05akosiaris>03Papaul
[15:22:57] <wikibugs>	 06Operations, 10ops-codfw: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3286506 (10akosiaris) Racking proposal sounds fine!
[15:23:33] <volans>	 !log re-enabled raid_handler and puppet on tegmen
[15:23:37] <volans>	 jynus: done, thanks
[15:23:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:23:56] <wikibugs>	 06Operations, 10ops-codfw: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3286507 (10Papaul)
[15:24:07] <jynus>	 this was not worthless, it found some interesting things
[15:24:21] <jynus>	 my question is if the script should be executed every time?
[15:24:35] <volans>	 which script?
[15:24:48] <jynus>	 the one creating tasks
[15:25:21] <jynus>	 it acknoledges all errors, and that would be the opposite of what we want
[15:25:42] <wikibugs>	 06Operations, 10Wikimedia-Logstash, 13Patch-For-Review, 15User-Elukey, 15User-fgiunchedi: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451#3286512 (10fgiunchedi)
[15:25:48] <wikibugs>	 (03PS1) 10Gehel: elasticsearch - deploy elasticsearch-curator along with elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/355234
[15:26:10] <wikibugs>	 06Operations, 10Wikimedia-Logstash, 13Patch-For-Review, 15User-Elukey, 15User-fgiunchedi: Get 5xx logs into kibana/logstash - https://phabricator.wikimedia.org/T149451#2753028 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi All done! Kibana dashboard https://logstash.wikimedia.org/app/kibana#/dashb...
[15:26:13] <jynus>	 or I should change it to create a task with different wording
[15:26:19] <volans>	 jynus:  it skips some errors, like timeout and connection refused
[15:26:28] <wikibugs>	 (03PS2) 10Gehel: logstash - apifeature indices need to be cleaned up [puppet] - 10https://gerrit.wikimedia.org/r/353560
[15:26:28] <volans>	 so if you prefer we could just make it skip this one too
[15:26:31] <volans>	 as you want
[15:26:32] <jynus>	 so we can make it skipt that one
[15:26:35] <jynus>	 that would work for me
[15:27:38] <volans>	 is 'not in WriteBack policy' specific wnough to match?
[15:27:50] <jynus>	 wait
[15:28:09] <jynus>	 don't change things yet, I have to redo the other script
[15:28:38] <jynus>	 I will ping you or add it myself once this is done
[15:28:41] <wikibugs>	 (03CR) 10Gehel: [C: 032] logstash - apifeature indices need to be cleaned up [puppet] - 10https://gerrit.wikimedia.org/r/353560 (owner: 10Gehel)
[15:28:42] <volans>	 I'm not changing anything, you can do it yourself, line 22 in raid_handler.py, in SKIP_STRINGS
[15:28:51] <jynus>	 thanks
[15:28:57] <volans>	 has to be something specific that don't appears in the other errors :D
[15:29:11] <jynus>	 yes, it should be easy
[15:29:25] <icinga-wm>	 PROBLEM - HHVM rendering on mw2104 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[15:30:15] <icinga-wm>	 RECOVERY - HHVM rendering on mw2104 is OK: HTTP OK: HTTP/1.1 200 OK - 74095 bytes in 0.227 second response time
[15:32:23] <wikibugs>	 06Operations, 10ops-eqiad, 15User-Elukey, 15User-Joe: rack and setup mw1307-1348 - https://phabricator.wikimedia.org/T165519#3286535 (10elukey)
[15:32:38] <elukey>	 anybody checking mobile apps?
[15:32:48] <wikibugs>	 06Operations, 10Traffic, 10netops: LLDP on cache hosts - https://phabricator.wikimedia.org/T165614#3286536 (10ayounsi) LLDP added to all the interfaces in asw-ulsfo/eqiad. Already configured in codfw and esams.  Which solves the first part of the issue above (minus the devices where lldp crashes).  About the...
[15:34:05] <wikibugs>	 (03CR) 10DCausse: [C: 04-1] elasticsearch - deploy elasticsearch-curator along with elasticsearch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/355234 (owner: 10Gehel)
[15:34:38] <elukey>	 mobrovac: around?
[15:35:03] <elukey>	 /{domain}/v1/page/media/{title} (retrieve images and videos of en.wp Cat page via media route) is CRITICAL: Test retrieve images and videos of en.wp Cat page via media route responds with malformed body (TypeError: 'NoneType' object has no attribute '__getitem__'):
[15:35:06] <elukey>	 {}
[15:36:24] <wikibugs>	 (03PS1) 10Ottomata: Remove use of is_not_bot filter in eventlogging mysql until code is fixed and change is cleared (announced) [puppet] - 10https://gerrit.wikimedia.org/r/355238 (https://phabricator.wikimedia.org/T67508)
[15:37:30] <wikibugs>	 (03PS2) 10Gehel: elasticsearch - deploy elasticsearch-curator along with elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/355234
[15:39:29] <wikibugs>	 (03CR) 10DCausse: [C: 031] "lgtm, thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/355234 (owner: 10Gehel)
[15:39:46] <wikibugs>	 (03PS3) 10Gehel: elasticsearch - deploy elasticsearch-curator along with elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/355234
[15:40:21] <wikibugs>	 (03CR) 10Ottomata: [C: 032] Remove use of is_not_bot filter in eventlogging mysql until code is fixed and change is cleared (announced) [puppet] - 10https://gerrit.wikimedia.org/r/355238 (https://phabricator.wikimedia.org/T67508) (owner: 10Ottomata)
[15:40:49] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] elasticsearch - deploy elasticsearch-curator along with elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/355234 (owner: 10Gehel)
[15:41:54] <wikibugs>	 (03CR) 10Mark Bergsma: [C: 032] bgp.ip: do not crash on unicode strings [debs/pybal] - 10https://gerrit.wikimedia.org/r/355229 (owner: 10Ema)
[15:42:06] <wikibugs>	 (03PS4) 10Gehel: elasticsearch - deploy elasticsearch-curator along with elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/355234
[15:42:08] <wikibugs>	 (03CR) 10Mark Bergsma: [C: 032] Add GPLv2 header to bgp/ip.py [debs/pybal] - 10https://gerrit.wikimedia.org/r/355152 (owner: 10Mark Bergsma)
[15:42:45] <wikibugs>	 (03CR) 10Mark Bergsma: [C: 032] Fix IPPrefix IPv6 string padding bug [debs/pybal] - 10https://gerrit.wikimedia.org/r/355213 (owner: 10Mark Bergsma)
[15:43:02] <wikibugs>	 (03Merged) 10jenkins-bot: Add GPLv2 header to bgp/ip.py [debs/pybal] - 10https://gerrit.wikimedia.org/r/355152 (owner: 10Mark Bergsma)
[15:43:05] <wikibugs>	 (03CR) 10Mark Bergsma: [C: 032] Add IPv4IP, IPv6IP and IPPrefix test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355214 (owner: 10Mark Bergsma)
[15:43:16] <wikibugs>	 (03Merged) 10jenkins-bot: Fix IPPrefix IPv6 string padding bug [debs/pybal] - 10https://gerrit.wikimedia.org/r/355213 (owner: 10Mark Bergsma)
[15:43:34] <wikibugs>	 (03CR) 10Gehel: [C: 032] elasticsearch - deploy elasticsearch-curator along with elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/355234 (owner: 10Gehel)
[15:43:51] <wikibugs>	 (03Merged) 10jenkins-bot: Add IPv4IP, IPv6IP and IPPrefix test cases [debs/pybal] - 10https://gerrit.wikimedia.org/r/355214 (owner: 10Mark Bergsma)
[15:43:53] <wikibugs>	 (03Merged) 10jenkins-bot: bgp.ip: do not crash on unicode strings [debs/pybal] - 10https://gerrit.wikimedia.org/r/355229 (owner: 10Ema)
[15:50:05] <marostegui>	 !log Stop replication on dbstore1002 s7 thread for maintenance - T163190
[15:50:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:50:14] <stashbot>	 T163190: Run pt-table-checksum on s7 - https://phabricator.wikimedia.org/T163190
[15:55:15] <elukey>	 urandom: you there?
[15:55:58] <urandom>	 elukey: ya
[15:57:47] <elukey>	 urandom: hello! There are a lot of CRITICALs for mobile apps but I don't get what's wrong
[15:58:18] <elukey>	 I always confuse where to check for service-checker-swagger test
[16:00:04] <jouncebot>	 godog, moritzm, and _joe_: Dear anthropoid, the time has come. Please deploy Puppet SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170523T1600).
[16:00:06] <urandom>	 elukey: where to check?
[16:00:31] <wikibugs>	 06Operations, 10ops-codfw: Degraded RAID on ms-be2029 - https://phabricator.wikimedia.org/T166021#3286585 (10fgiunchedi)
[16:00:36] <elukey>	 urandom: yeah where the tests are defined
[16:01:27] <urandom>	 elukey: oh, ttbmk, those are part of the swagger spec, and are tested by a script "locally"
[16:01:41] <elukey>	 yep but where are the defs ?
[16:01:43] <urandom>	 where "locally" here means the cluster the service running on
[16:01:59] <elukey>	 I mean, it runs some tests but where are those defined ?
[16:02:10] <elukey>	 in any case, something seems wrong with mobile apps
[16:02:12] <urandom>	 in the swagger spec for that endpoint, let me dig
[16:02:21] <wikibugs>	 (03PS5) 10Chad: Setup apache vhost on scap proxies as well [puppet] - 10https://gerrit.wikimedia.org/r/344221
[16:02:27] <urandom>	 elukey: yeah, i know very little about this but have pinged Pchelolo 
[16:03:19] <elukey>	 super thanks
[16:03:20] * Pchelolo is looking into it
[16:03:24] <elukey>	 o/
[16:03:35] <godog>	 no puppet swat patches btw
[16:03:37] <godog>	 https://i.imgur.com/h11ujSy.gifv
[16:03:47] <elukey>	 I can see some 500s with query error on /srv/log/mobileapps/main.log
[16:03:51] <elukey>	 on scb1004 for example
[16:04:22] <RainbowSprinkles>	 godog: I've been working on that vhost stuff for scap ^^
[16:04:32] <RainbowSprinkles>	 (it's probably not swat-ready, but could use another bit of review)
[16:04:48] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Setup apache vhost on scap proxies as well [puppet] - 10https://gerrit.wikimedia.org/r/344221 (owner: 10Chad)
[16:05:39] <RainbowSprinkles>	 gdi jenkins
[16:05:58] <Pchelolo>	 elukey: https://en.wikipedia.org/w/index.php?title=Cat&oldid=781841025
[16:06:19] <RainbowSprinkles>	 Derp.
[16:06:52] <elukey>	 Pchelolo: --verbose :D
[16:07:09] <urandom>	 heh
[16:07:16] <Pchelolo>	 should fix itself in a bit, the vandalism was reverted. The endpoint uses page images page prop that's updated asyncronously via the jobqueue
[16:08:00] <Pchelolo>	 elukey: heh sorry. The problem's that that endpoint should fetch the images associated with the page and abviously ofter that bandalism there's no images
[16:08:13] <Pchelolo>	 the checker script though could've done better job reporting the issue
[16:08:30] <godog>	 RainbowSprinkles: ok! thanks I'll take a look likely tomorrow
[16:08:34] <urandom>	 Pchelolo: what are the steps to determining this?
[16:08:37] <elukey>	 Pchelolo: what is the workflow to follow in these cases? I mean, where did you check? Just to learn a bit
[16:08:42] <wikibugs>	 (03PS6) 10Chad: Setup apache vhost on scap proxies as well [puppet] - 10https://gerrit.wikimedia.org/r/344221
[16:08:58] <RainbowSprinkles>	 godog: Ok thx
[16:09:31] <urandom>	 Pchelolo: what elukey said
[16:09:33] <urandom>	 :)
[16:10:20] <Pchelolo>	 urandom: elukey: step 1 - ssh to scb node and run `check-mobileapps`, step 2 - verify that the check is indeed failing. step 3 - go to mobileapps source code and see what's the check is doing step 4 - got look what happened with the wiki page the check is using recently
[16:10:31] <urandom>	 i assume you can mine the mobileapps source for the swagger spec to determine what the test is, but is there an easier way?
[16:10:44] <urandom>	 Oh.
[16:11:08] <urandom>	 so that would be a No. :)
[16:11:23] <Pchelolo>	 urandom: no easier way, but it's a good idea to make the checker script log the x-amples spec for a failed check
[16:11:40] <Pchelolo>	 Woths creating a ticket
[16:11:48] <Pchelolo>	 worths
[16:12:08] <godog>	 I remember we ran into this before, I think it was with Obama
[16:12:17] <godog>	 see if I can find the task
[16:12:38] <elukey>	 Pchelolo: yeah I followed up to 3. but then it was a bit difficult
[16:13:01] <elukey>	 especially since I don't know what pages it was testing
[16:13:08] <Pchelolo>	 if godog doesn't find the task I'll create one
[16:13:17] <wikibugs>	 (03PS3) 10BBlack: RPS Cleanup 1/5: remove unused upstart file [puppet] - 10https://gerrit.wikimedia.org/r/355216
[16:13:20] <wikibugs>	 (03PS3) 10BBlack: RPS cleanup 3/5: pattern not necc for LVS [puppet] - 10https://gerrit.wikimedia.org/r/355217
[16:13:21] <wikibugs>	 (03PS3) 10BBlack: RPS cleanup 4/5: Add config file to script, use for rss_pattern [puppet] - 10https://gerrit.wikimedia.org/r/355218
[16:13:23] <wikibugs>	 (03PS4) 10BBlack: RPS cleanup 5/5: use new config file in puppet [puppet] - 10https://gerrit.wikimedia.org/r/355219
[16:13:24] <godog>	 https://phabricator.wikimedia.org/T150560
[16:13:25] <wikibugs>	 (03PS4) 10BBlack: interface-rps: add mq subqueue qdisc setup [puppet] - 10https://gerrit.wikimedia.org/r/355223
[16:13:28] <wikibugs>	 (03PS1) 10BBlack: RPS cleanup 2/5: remove irqbalance module [puppet] - 10https://gerrit.wikimedia.org/r/355243
[16:13:28] <godog>	 Pchelolo: ^
[16:13:34] <Pchelolo>	 And it takes quite some time for the JobQueue to update the page properties now..
[16:13:44] <wikibugs>	 (03PS1) 10Framawiki: Create a new namespace "Vikiproje" for trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355244 (https://phabricator.wikimedia.org/T166102)
[16:13:59] <wikibugs>	 (03CR) 10BBlack: [V: 032 C: 032] RPS Cleanup 1/5: remove unused upstart file [puppet] - 10https://gerrit.wikimedia.org/r/355216 (owner: 10BBlack)
[16:14:11] <Pchelolo>	 cool, thank you godog. I'll write a comment there indicting we've run into this again and raise the priority
[16:14:31] <godog>	 np Pchelolo !
[16:14:41] <wikibugs>	 (03CR) 10BBlack: [V: 032 C: 032] RPS cleanup 2/5: remove irqbalance module [puppet] - 10https://gerrit.wikimedia.org/r/355243 (owner: 10BBlack)
[16:14:44] <godog>	 ah no I was misremembering, it was the Dog page, (retrieve page preview of Dog page)
[16:14:48] <urandom>	 elukey: https://github.com/wikimedia/mediawiki-services-mobileapps/blob/master/spec.yaml#L534
[16:15:11] <wikibugs>	 (03CR) 10BBlack: [C: 032] RPS cleanup 3/5: pattern not necc for LVS [puppet] - 10https://gerrit.wikimedia.org/r/355217 (owner: 10BBlack)
[16:15:25] <godog>	 I think in this case just syslog'ing more info would be enough
[16:15:26] <ema>	 !log cp1074: enable prometheus node_exporter qdisc collector T147569
[16:15:29] <elukey>	 godog: https://giphy.com/gifs/obama-mic-drop-out-3o7qDSOvfaCO9b3MlO
[16:15:31] <wikibugs>	 (03CR) 10BBlack: [C: 032] RPS cleanup 4/5: Add config file to script, use for rss_pattern [puppet] - 10https://gerrit.wikimedia.org/r/355218 (owner: 10BBlack)
[16:15:33] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:15:34] <stashbot>	 T147569: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569
[16:15:45] <elukey>	 thanks urandom 
[16:15:49] <godog>	 elukey: haha yeah that's me in 30 min or so
[16:15:49] <wikibugs>	 (03CR) 10BBlack: [C: 032] RPS cleanup 5/5: use new config file in puppet [puppet] - 10https://gerrit.wikimedia.org/r/355219 (owner: 10BBlack)
[16:15:55] <wikibugs>	 (03CR) 10BBlack: [C: 032] interface-rps: add mq subqueue qdisc setup [puppet] - 10https://gerrit.wikimedia.org/r/355223 (owner: 10BBlack)
[16:16:03] <urandom>	 elukey: i realize that's not very helpful...
[16:16:21] <elukey>	 ahhhh not I got the error message from the check
[16:16:40] <bblack>	 !log disabled puppet on all lvs* for RPS-related deployments
[16:16:41] <elukey>	 sorry Pchelolo I am dumb, EOD hit me
[16:16:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:16:55] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy
[16:16:55] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy
[16:16:55] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy
[16:16:55] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy
[16:16:55] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy
[16:16:56] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy
[16:16:56] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[16:16:57] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy
[16:16:57] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy
[16:16:58] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[16:16:58] <icinga-wm>	 RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy
[16:16:58] <elukey>	 a bit more info would have helped but I didn't read the error message correctly
[16:16:59] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[16:17:05] <elukey>	 (the Cat reference)
[16:17:13] <bblack>	 !log disabled puppet on all cp* for RPS-related deployments (just in case!)
[16:17:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:17:22] <Pchelolo>	 here we go ^^^
[16:17:45] <Pchelolo>	 actually I didn't notice the Cat in the error description as well..
[16:17:51] <wikibugs>	 06Operations, 10ops-codfw: Degraded RAID on ms-be2029 - https://phabricator.wikimedia.org/T166021#3286648 (10fgiunchedi) a:03Papaul @papaul please replace, thanks! Also note that this drive has very few power on hours, almost DOA  ``` => pd 1I:1:8 show array f show  Smart Array P840 in Slot 3     array F...
[16:18:10] <wikibugs>	 (03Abandoned) 10Jcrespo: Revert "Revert "raid-check: Return critical when not in WriteBack mode for megacli"" [puppet] - 10https://gerrit.wikimedia.org/r/355231 (owner: 10Jcrespo)
[16:22:38] <wikibugs>	 06Operations, 06Operations-Software-Development, 13Patch-For-Review, 06Services (watching): More verbose messages from service-checker-swagger - https://phabricator.wikimedia.org/T150560#3286656 (10Pchelolo) Same thing happened today for one of the mobile-apps checks:  ``` /{domain}/v1/page/media/{title} (...
[16:23:53] <wikibugs>	 06Operations, 06Operations-Software-Development, 13Patch-For-Review, 06Services (watching): More verbose messages from service-checker-swagger - https://phabricator.wikimedia.org/T150560#3286659 (10elukey) Today we received some CRITICALs from all the scb hosts and the error message was the following:  ```...
[16:23:58] <elukey>	 Pchelolo: --^
[16:24:10] <elukey>	 argh at the same time
[16:24:11] <elukey>	 sorry
[16:24:27] <elukey>	 removed
[16:24:30] <logmsgbot>	 !log demon@tin Synchronized README: testing (duration: 00m 38s)
[16:24:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:24:39] <wikibugs>	 (03PS1) 10Jcrespo: raid-check: optionally return critical when not in a write policy [puppet] - 10https://gerrit.wikimedia.org/r/355246 (https://phabricator.wikimedia.org/T166108)
[16:25:04] <Pchelolo>	 elukey: hehe :) I've also added the idea that we might log a `curl` command that could be used to repeat the request - would be really easy to follow up in no time
[16:25:17] <wikibugs>	 06Operations, 06Operations-Software-Development, 13Patch-For-Review, 06Services (watching): More verbose messages from service-checker-swagger - https://phabricator.wikimedia.org/T150560#3286676 (10fgiunchedi) Since icinga space for reporting the check output is limited anyway I think in this case addition...
[16:25:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] raid-check: optionally return critical when not in a write policy [puppet] - 10https://gerrit.wikimedia.org/r/355246 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo)
[16:25:44] <wikibugs>	 (03PS2) 10Jcrespo: raid-check: optionally return critical when not in a write policy [puppet] - 10https://gerrit.wikimedia.org/r/355246 (https://phabricator.wikimedia.org/T166108)
[16:25:52] <wikibugs>	 06Operations, 06Operations-Software-Development, 13Patch-For-Review, 06Services (watching): More verbose messages from service-checker-swagger - https://phabricator.wikimedia.org/T150560#2789590 (10Eevans) >>! In T150560#3286656, @Pchelolo wrote: > > [ ... ] > > Even more neat feature would be to convert t...
[16:26:26] <elukey>	 Pchelolo: +1, thanks!
[16:26:34] <bblack>	 !log puppet re-enables on caches
[16:26:35] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] raid-check: optionally return critical when not in a write policy [puppet] - 10https://gerrit.wikimedia.org/r/355246 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo)
[16:26:41] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:27:43] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: x1 master db1031: Faulty BBU - https://phabricator.wikimedia.org/T166108#3285567 (10jcrespo) The previous patch was reverted, I am creating a separate one to allow to enable or disable the extra check at will (for megacli first).
[16:28:55] <icinga-wm>	 RECOVERY - HP RAID on ms-be1033 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor
[16:29:27] <wikibugs>	 06Operations, 10ops-eqiad, 10DBA, 13Patch-For-Review: x1 master db1031: Faulty BBU - https://phabricator.wikimedia.org/T166108#3286698 (10jcrespo) ``` root@prometheus1003:~$ python check-raid.py  OK: optimal, 2 logical, 6 physical OK root@prometheus1003:~$ python check-raid.py --policy=WriteBack CRITICAL:...
[16:32:31] <wikibugs>	 (03CR) 10Framawiki: "Hello Hoo man, please add this patch in the list of deployments at wikitech <https://wikitech.wikimedia.org/wiki/Deployments>. Thanks !" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/354246 (https://phabricator.wikimedia.org/T164191) (owner: 10Hoo man)
[16:32:50] * elukey off
[16:35:24] <wikibugs>	 (03PS3) 10Jcrespo: raid-check: optionally return critical when not in a write policy [puppet] - 10https://gerrit.wikimedia.org/r/355246 (https://phabricator.wikimedia.org/T166108)
[16:36:18] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] raid-check: optionally return critical when not in a write policy [puppet] - 10https://gerrit.wikimedia.org/r/355246 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo)
[16:41:10] <wikibugs>	 (03PS4) 10Chad: Move mwdeploy home to /var/lib where it belongs, it's a system user [puppet] - 10https://gerrit.wikimedia.org/r/323867 (https://phabricator.wikimedia.org/T86971)
[16:43:11] <bblack>	 !log BBR: enabling mq+fq on cp1074 - T147569
[16:43:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:43:18] <stashbot>	 T147569: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569
[16:43:47] <wikibugs>	 (03PS5) 10Chad: Create hourly backup schedule, modeled on weekly and use for Gerrit [puppet] - 10https://gerrit.wikimedia.org/r/341371
[16:45:25] <wikibugs>	 (03PS2) 10Chad: Move all ssl certs to the module and out of files/ [puppet] - 10https://gerrit.wikimedia.org/r/341729
[16:45:51] <RainbowSprinkles>	 bblack: 341729 is for you, btw :)
[16:46:56] <wikibugs>	 (03PS3) 10Chad: Move all ssl certs to the module and out of files/ [puppet] - 10https://gerrit.wikimedia.org/r/341729
[16:47:23] <RainbowSprinkles>	 It's the last of the stuff in files/* :D
[16:47:37] <wikibugs>	 (03PS4) 10Jcrespo: raid-check: optionally return critical when not in a write policy [puppet] - 10https://gerrit.wikimedia.org/r/355246 (https://phabricator.wikimedia.org/T166108)
[16:47:51] <bblack>	 RainbowSprinkles: you might want to check this out from quick git grep:
[16:47:54] <bblack>	 utils/create_ecdsa_cert:    "${_dir}/../files/ssl/${name}.crt"
[16:47:56] <bblack>	 utils/create_ecdsa_cert:git add "${_dir}/../files/ssl/${name}.crt"
[16:48:01] <bblack>	 looks probably-relevant
[16:48:05] <RainbowSprinkles>	 Ahhh
[16:48:08] <RainbowSprinkles>	 Missed those
[16:48:13] <RainbowSprinkles>	 utils/* confuses me :p
[16:49:12] <bblack>	 !log BBR: enabling bbr on cp1074 - T147569
[16:49:20] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:49:20] <stashbot>	 T147569: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569
[16:50:10] <wikibugs>	 (03PS4) 10Chad: Move all ssl certs to the module and out of files/ [puppet] - 10https://gerrit.wikimedia.org/r/341729
[16:50:27] <volans>	 !log upgrading facter on mw[2250-2259] as a test batch
[16:50:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:52:31] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] "I would like to deploy this, as it would be backwards compatible with the current behaviour, and then enable the new check selectively. Ye" [puppet] - 10https://gerrit.wikimedia.org/r/355246 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo)
[16:54:24] <wikibugs>	 (03CR) 10Jcrespo: [C: 031] "Example usage:" [puppet] - 10https://gerrit.wikimedia.org/r/355246 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo)
[16:55:59] <wikibugs>	 (03PS5) 10Jcrespo: raid-check: optionally return critical when not in a write policy [puppet] - 10https://gerrit.wikimedia.org/r/355246 (https://phabricator.wikimedia.org/T166108)
[16:59:25] <wikibugs>	 06Operations, 06Discovery-Search (Current work): replace es-tool with elasticsearch-curator for standard elasticsearch operations - https://phabricator.wikimedia.org/T166154#3286827 (10Gehel)
[17:00:05] <jouncebot>	 gwicke, cscott, arlolra, subbu, halfak, and Amir1: Respected human, time to deploy Services – Graphoid / Parsoid / OCG / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170523T1700). Please do the needful.
[17:00:38] <Amir1>	 Deployment of services happen too soon in respect to my old timezone
[17:00:49] <Amir1>	 anyway, I want to try deploying again
[17:02:45] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0]
[17:03:00] <logmsgbot>	 !log ladsgroup@tin Started deploy [ores/deploy@4874809]: Trying again with deploying ores
[17:03:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:04:02] <wikibugs>	 (03PS1) 10Jcrespo: [WIP]raid: Implement the option to check write cache policies [puppet] - 10https://gerrit.wikimedia.org/r/355249
[17:05:42] <wikibugs>	 06Operations, 10ops-codfw, 10netops: ores200[1-9] switch port configuration - https://phabricator.wikimedia.org/T166156#3286863 (10Papaul)
[17:09:04] <thcipriani>	 !log starting branch cut for 1.30.0-wmf.2 T163512
[17:09:05] <wikibugs>	 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#3286899 (10Ottomata) ping @akosiaris @Joe
[17:09:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:09:13] <stashbot>	 T163512: MW-1.30.0-wmf.2 deployment blockers - https://phabricator.wikimedia.org/T163512
[17:09:22] <thcipriani>	 ^ jdlrobson anomie FYI
[17:24:17] <wikibugs>	 06Operations, 07Puppet, 13Patch-For-Review, 07RfC: RFC: New puppet code organization paradigm/coding standards - https://phabricator.wikimedia.org/T147718#3286942 (10Ottomata) Also, if configuration of profiles can only be done via hiera, doesn't that mean any module parameter that we may want to override...
[17:24:29] <logmsgbot>	 !log ladsgroup@tin Finished deploy [ores/deploy@4874809]: Trying again with deploying ores (duration: 21m 30s)
[17:24:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:25:30] <Amir1>	 it all looks okay
[17:25:48] <Amir1>	 tell me if any alarm started to fire off 
[17:26:58] <Dereckson>	 Amir1: well, actually, we've an hypthesis for this alarm: a IFTTT component
[17:27:35] <Dereckson>	 Amir1: nevermind, I mixed two conversations
[17:27:44] <Amir1>	 Dereckson: oh, okay
[17:27:45] <Amir1>	 :)
[17:27:53] <Amir1>	 Nice meeting you by the way
[17:28:05] <Dereckson>	 Yes:)
[17:28:42] <wikibugs>	 06Operations, 06Operations-Software-Development, 13Patch-For-Review, 06Services (watching): More verbose messages from service-checker-swagger - https://phabricator.wikimedia.org/T150560#3286944 (10Volans) a:05Volans>03None
[17:30:55] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s7 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 248.57 seconds
[17:45:06] <wikibugs>	 06Operations, 10DBA, 10Monitoring, 10media-storage: icinga hp raid check timeout on busy ms-be and db machines - https://phabricator.wikimedia.org/T141252#3286990 (10jcrespo) I've checked, and the currently in use check does too much, probably we do not need such a thorough check every time icinga runs, wh...
[17:45:57] <ejegg>	 Hi ops!
[17:46:13] <ejegg>	 I'm asking about this in releng too, but it might be something you can help with
[17:46:31] <ejegg>	 CI is failing trying to connect to packagist.org
[17:47:24] <ejegg>	 The 
[17:47:26] <ejegg>	 "https://packagist.org/p/provider-2017-04%2486c7cf8d14faebc894d0a52237f6873b18b913101a4d6b8f50fbac618900cd15.json" file could not be downloaded: Failed to enable crypto
[17:47:45] <jynus>	 what is that, curl?
[17:47:58] <ejegg>	 or alternately Failed to decode response: zlib_decode(): data error
[17:48:21] <ejegg>	 jynus: it's from composer install, probably using curl internally
[17:48:43] <jynus>	 "Ok I have found a solution. The problem is that the site uses SSLv3"
[17:48:46] <ejegg>	 maybe an IP address update / firewall issue?
[17:48:53] <jynus>	 seems an error by openssl
[17:48:59] <ejegg>	 ooh, and nothing better?
[17:49:05] <ejegg>	 that's bizarre
[17:49:21] <ejegg>	 curl should negotiate up if both sides support it, right?
[17:50:21] <RainbowSprinkles>	 Oh, following up from other channel: we haven't seen this on other jobs. Is it possible it was transient?
[17:50:22] <ejegg>	 well, over in -releng, RainbowSprinkles says it's probably something that team should be able to solve 
[17:50:53] <ejegg>	 RainbowSprinkles: could be, but it's happened at least 4 times in the past half hour
[17:51:16] <RainbowSprinkles>	 It is? 
[17:51:17] <RainbowSprinkles>	 https://integration.wikimedia.org/ci/job/wikimedia-fundraising-crm-composer-php55-trusty/
[17:51:17] <jynus>	 it is an external site-anything could be
[17:51:20] <ejegg>	 that message or "Failed to decode response: zlib_decode(): data error", followed by half an hour of nothing
[17:51:21] <RainbowSprinkles>	 I only see the one failure
[17:51:29] <ejegg>	 https://integration.wikimedia.org/ci/job/mwgate-composer-php55-trusty/3110/console
[17:51:50] <RainbowSprinkles>	 the zlib_decode() is because it's fetching bogus data. The real error is the ssl/tls issue with packagist.
[17:51:55] <jynus>	 https://www.ssllabs.com/ssltest/analyze.html?d=packagist.org&s=144.217.203.53&latest  seems ok
[17:52:11] <RainbowSprinkles>	 Hmm
[17:52:17] <ejegg>	 huh, some succeed though
[17:52:17] <jynus>	 TLS1.2, modern compression, etc.
[17:52:23] <jynus>	 *cipher
[17:52:35] <jynus>	 but remember I am not your traffic guy
[17:52:45] <icinga-wm>	 RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0]
[17:52:59] <RainbowSprinkles>	 Yeah, this is weird.
[17:53:29] <jynus>	 sites can fail
[17:53:49] <wikibugs>	 (03PS1) 10Thcipriani: Group0 to 1.30.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355255
[17:55:20] <RainbowSprinkles>	 jynus: Yeah that's why I suggested maybe transient
[17:59:39] <jynus>	 maybe that can be cached for increased reliability?
[18:02:09] <RainbowSprinkles>	 https://packagist.org could not be fully loaded, package information was loaded from the local cache and may be out of date
[18:02:13] <RainbowSprinkles>	 Hehe, it does ;-)
[18:03:13] <ejegg>	 RainbowSprinkles: but after that message, it just stalls for half an hour
[18:03:26] <ejegg>	 so something's not quite working with the caching
[18:04:33] <paladox>	 ejegg RainbowSprinkles, i think that is being caused by https://github.com/composer/packagist/commit/e77ad7072b7c545d447c5c9d269a3682f90fb0b7
[18:06:05] <icinga-wm>	 RECOVERY - Check Varnish expiry mailbox lag on cp1074 is OK: OK: expiry mailbox lag is 291790
[18:07:49] <wikibugs>	 (03PS1) 10Krinkle: mediawiki: Define 'mediawiki::errorpage' to simplify usage [puppet] - 10https://gerrit.wikimedia.org/r/355257
[18:08:17] <ejegg>	 paladox: hmm, I don't see how that would break like that
[18:08:36] <paladox>	 It's caching the api requests for longer.
[18:08:52] <paladox>	 Though when did the problem start? In the last 3 hours?
[18:08:56] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mediawiki: Define 'mediawiki::errorpage' to simplify usage [puppet] - 10https://gerrit.wikimedia.org/r/355257 (owner: 10Krinkle)
[18:10:11] <ejegg>	 paladox yeah, it started recently. But the issue isn't stale packages, it's connection failures followed by half-hour waits when it tries to fall back to locally cached packages
[18:11:14] <paladox>	 ok
[18:11:50] <wikibugs>	 (03PS2) 10Krinkle: mediawiki: Define 'mediawiki::errorpage' to simplify usage [puppet] - 10https://gerrit.wikimedia.org/r/355257
[18:13:39] <paladox>	 ejegg an old report, https://github.com/composer/composer/issues/4212
[18:13:47] <paladox>	 but maybe the workaround will work for you?
[18:14:36] <paladox>	 Oh, that is composers site not packagist.
[18:15:27] <ejegg>	 hmm, well, some builds are working
[18:15:40] <ejegg>	 I guess I'll just keep recheck-ing till mine goes through
[18:17:03] <AndyRussG>	 thcipriani: hi! Mmm looks like you already cut the wmf.2 branch? Mmm somehow the CentralNotice update to the deploy branch didn't get automatically pushed to core as it used to......
[18:17:23] <wikibugs>	 (03PS1) 10Ottomata: Update kafka.sh wrapper script for Kafka 0.10+ [puppet] - 10https://gerrit.wikimedia.org/r/355259 (https://phabricator.wikimedia.org/T166162)
[18:17:34] <wikibugs>	 (03CR) 10Jcrespo: [C: 032] raid-check: optionally return critical when not in a write policy [puppet] - 10https://gerrit.wikimedia.org/r/355246 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo)
[18:18:18] <wikibugs>	 (03PS2) 10Ottomata: Update kafka.sh wrapper script for Kafka 0.10+ [puppet] - 10https://gerrit.wikimedia.org/r/355259 (https://phabricator.wikimedia.org/T166164)
[18:20:53] <thcipriani>	 AndyRussG: I'm not sure what you mean, afaict CentralNotice is pinned to the wmf_deploy branch: https://github.com/wikimedia/mediawiki-tools-release/blob/master/make-wmf-branch/config.json#L178
[18:22:45] <thcipriani>	 AndyRussG: what do you mean when you say "automatically pushed to core"? 
[18:22:53] <thcipriani>	 https://github.com/wikimedia/mediawiki/blob/wmf/1.30.0-wmf.2/.gitmodules#L1-L4 seems right based on config.json
[18:23:01] <AndyRussG>	 thcipriani: I was wrong, now it looks OK.... (????)
[18:23:08] <AndyRussG>	 Dunno what I'm doing wrong on my local repo
[18:23:36] <thcipriani>	 AndyRussG: kk, let me know if you find anything further amiss with that extension.
[18:29:41] <AndyRussG>	 thcipriani: thx!!! yeah just double-checked again, all good. yes that config.son is also great.. :)
[18:29:53] <thcipriani>	 cool :)
[18:30:05] * AndyRussG browses bash history to try to identify origin of silly git-submodule-confusion
[18:38:52] <wikibugs>	 (03PS2) 10Jcrespo: [WIP]raid: Implement the option to check write cache policies [puppet] - 10https://gerrit.wikimedia.org/r/355249 (https://phabricator.wikimedia.org/T166108)
[18:39:11] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 for kaldari - https://phabricator.wikimedia.org/T166165#3287147 (10kaldari)
[18:39:38] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 for kaldari - https://phabricator.wikimedia.org/T166165#3287159 (10kaldari)
[18:40:10] <wikibugs>	 (03PS3) 10Krinkle: mediawiki: Define 'mediawiki::errorpage' to simplify usage [puppet] - 10https://gerrit.wikimedia.org/r/355257
[18:40:12] <wikibugs>	 (03PS8) 10Krinkle: dynamicproxy: Make use of errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350494 (https://phabricator.wikimedia.org/T113114)
[18:40:40] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 for kaldari - https://phabricator.wikimedia.org/T166165#3287147 (10kaldari)
[18:40:57] <wikibugs>	 (03PS4) 10Krinkle: mediawiki: Define 'mediawiki::errorpage' to simplify usage [puppet] - 10https://gerrit.wikimedia.org/r/355257
[18:41:05] <wikibugs>	 (03PS9) 10Krinkle: dynamicproxy: Make use of errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350494 (https://phabricator.wikimedia.org/T113114)
[18:41:21] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 for kaldari - https://phabricator.wikimedia.org/T166165#3287147 (10kaldari) @Tnegrin: Could you approve this access request?
[18:42:30] <wikibugs>	 (03PS10) 10Krinkle: dynamicproxy: Make use of errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350494 (https://phabricator.wikimedia.org/T113114)
[18:47:05] <icinga-wm>	 PROBLEM - puppet last run on cp3040 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:00:04] <jouncebot>	 thcipriani: Dear anthropoid, the time has come. Please deploy MediaWiki train (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170523T1900).
[19:00:10] * thcipriani does
[19:05:46] <wikibugs>	 (03PS3) 10Jcrespo: [WIP]raid: Implement the option to check write cache policies [puppet] - 10https://gerrit.wikimedia.org/r/355249 (https://phabricator.wikimedia.org/T166108)
[19:06:22] <logmsgbot>	 !log thcipriani@tin Started scap: testwiki to php-1.30.0-wmf.2 and rebuild l10n cache
[19:06:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:11:15] <wikibugs>	 (03PS1) 10Papaul: DNs: Add mgmt and production DNS entries for ores200[1-9] [dns] - 10https://gerrit.wikimedia.org/r/355270
[19:14:04] <wikibugs>	 (03CR) 10Dzahn: [C: 032] DNs: Add mgmt and production DNS entries for ores200[1-9] [dns] - 10https://gerrit.wikimedia.org/r/355270 (owner: 10Papaul)
[19:15:25] <wikibugs>	 (03PS4) 10Jcrespo: [WIP]raid: Implement the option to check write cache policies [puppet] - 10https://gerrit.wikimedia.org/r/355249 (https://phabricator.wikimedia.org/T166108)
[19:16:05] <wikibugs>	 (03CR) 10Dzahn: "[bast1001:~] $ for orescodfw in $(seq 1 9); do host ores200${orescodfw}.codfw.wmnet; done" [dns] - 10https://gerrit.wikimedia.org/r/355270 (owner: 10Papaul)
[19:16:05] <icinga-wm>	 RECOVERY - puppet last run on cp3040 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures
[19:16:59] <wikibugs>	 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3258942 (10Dzahn) ``` [bast1001:~] $ for orescodfw in $(seq 1 9); do host ores200${orescodfw}.codfw.wmnet; done ores2001.codfw.wmnet has address 10.192.0.12 ores2002.codfw.wmnet has add...
[19:21:32] <wikibugs>	 (03PS5) 10Jcrespo: [WIP]raid: Implement the option to check write cache policies [puppet] - 10https://gerrit.wikimedia.org/r/355249 (https://phabricator.wikimedia.org/T166108)
[19:25:35] <icinga-wm>	 PROBLEM - HHVM rendering on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:26:07] <wikibugs>	 (03CR) 10Jcrespo: "This now does what it is supposed to do, but the style is not very good, can you give me some constructive criticism about that?" [puppet] - 10https://gerrit.wikimedia.org/r/355249 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo)
[19:27:35] <icinga-wm>	 RECOVERY - HHVM rendering on mw1201 is OK: HTTP OK: HTTP/1.1 200 OK - 74094 bytes in 0.415 second response time
[19:28:15] <icinga-wm>	 PROBLEM - Apache HTTP on mw1201 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:29:05] <icinga-wm>	 RECOVERY - Apache HTTP on mw1201 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 613 bytes in 0.203 second response time
[19:30:21] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 for kaldari - https://phabricator.wikimedia.org/T166165#3287255 (10Tnegrin) approved
[19:31:04] <wikibugs>	 06Operations, 10Traffic, 10fundraising-tech-ops: Fix nits in HTTPS/HSTS configs in externally-hosted fundraising domains - https://phabricator.wikimedia.org/T137161#3287257 (10Jgreen) @BBlack ok I upgraded nginx and *ssl, and civicrm and the other frack-hosted sites should be fixed to include the HSTS header...
[19:31:41] <wikibugs>	 06Operations, 10Ops-Access-Requests: Requesting access to stat1002 for kaldari - https://phabricator.wikimedia.org/T166165#3287260 (10kaldari)
[19:34:14] <logmsgbot>	 !log thcipriani@tin Finished scap: testwiki to php-1.30.0-wmf.2 and rebuild l10n cache (duration: 27m 52s)
[19:34:22] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:39:36] <wikibugs>	 (03CR) 10Jcrespo: "I forgot to paste the compilation results: https://puppet-compiler.wmflabs.org/6506/" [puppet] - 10https://gerrit.wikimedia.org/r/355249 (https://phabricator.wikimedia.org/T166108) (owner: 10Jcrespo)
[19:42:31] <wikibugs>	 (03CR) 10Thcipriani: [C: 032] Group0 to 1.30.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355255 (owner: 10Thcipriani)
[19:45:25] <icinga-wm>	 PROBLEM - puppet last run on mw1285 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[19:48:35] <wikibugs>	 (03Merged) 10jenkins-bot: Group0 to 1.30.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355255 (owner: 10Thcipriani)
[19:48:48] <wikibugs>	 (03CR) 10jenkins-bot: Group0 to 1.30.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355255 (owner: 10Thcipriani)
[19:52:06] <logmsgbot>	 !log thcipriani@tin rebuilt wikiversions.php and synchronized wikiversions files: group0 to 1.30.0-wmf.2
[19:52:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:55:56] <wikibugs>	 (03PS1) 10BBlack: caches: enable BBR + tuned mq+fq qdiscs [puppet] - 10https://gerrit.wikimedia.org/r/355276 (https://phabricator.wikimedia.org/T147569)
[19:57:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] caches: enable BBR + tuned mq+fq qdiscs [puppet] - 10https://gerrit.wikimedia.org/r/355276 (https://phabricator.wikimedia.org/T147569) (owner: 10BBlack)
[20:03:24] <wikibugs>	 (03PS2) 10BBlack: caches: enable BBR + tuned mq+fq qdiscs [puppet] - 10https://gerrit.wikimedia.org/r/355276 (https://phabricator.wikimedia.org/T147569)
[20:06:22] <bblack>	 !log disabling puppet on all caches for BBR deploy control
[20:06:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:06:39] <wikibugs>	 (03CR) 10BBlack: [C: 032] caches: enable BBR + tuned mq+fq qdiscs [puppet] - 10https://gerrit.wikimedia.org/r/355276 (https://phabricator.wikimedia.org/T147569) (owner: 10BBlack)
[20:10:37] <bblack>	 !log enable BBR for all caches @ ulsfo - T147569
[20:10:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:10:44] <stashbot>	 T147569: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569
[20:14:25] <icinga-wm>	 RECOVERY - puppet last run on mw1285 is OK: OK: Puppet is currently enabled, last run 1 second ago with 0 failures
[20:20:35] <bblack>	 !log enable BBR for all caches @ codfw - T147569
[20:20:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:20:44] <stashbot>	 T147569: Evaluate/Deploy TCP BBR when available (kernel 4.9+) - https://phabricator.wikimedia.org/T147569
[20:21:45] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[20:21:55] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[20:22:35] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[20:22:35] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[20:22:35] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[20:22:35] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[20:22:35] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[20:23:25] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy
[20:23:26] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy
[20:23:35] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy
[20:23:35] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy
[20:23:35] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy
[20:23:36] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy
[20:23:45] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy
[20:24:32] <wikibugs>	 (03PS1) 10Smalyshev: Allow absolute script path for getMediaWikiCli() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355282
[20:24:55] <bblack>	 !log enable BBR for all caches - T147569
[20:25:01] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:26:42] <wikibugs>	 (03CR) 10Chad: [C: 031] "This is fine, minus a nit" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355282 (owner: 10Smalyshev)
[20:28:03] <wikibugs>	 (03CR) 10Smalyshev: Allow absolute script path for getMediaWikiCli() (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355282 (owner: 10Smalyshev)
[20:28:33] <wikibugs>	 (03PS2) 10Smalyshev: Allow absolute script path for getMediaWikiCli() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355282
[20:29:46] <wikibugs>	 (03CR) 10Daniel Kinzler: [C: 031] Allow absolute script path for getMediaWikiCli() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355282 (owner: 10Smalyshev)
[20:31:20] <wikibugs>	 (03CR) 10Daniel Kinzler: [C: 031] Allow absolute script path for getMediaWikiCli() (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355282 (owner: 10Smalyshev)
[20:32:41] <wikibugs>	 (03CR) 10Krinkle: [C: 031] Allow absolute script path for getMediaWikiCli() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355282 (owner: 10Smalyshev)
[20:33:28] <wikibugs>	 (03CR) 10Smalyshev: Allow absolute script path for getMediaWikiCli() (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355282 (owner: 10Smalyshev)
[20:33:42] <wikibugs>	 (03CR) 10Krinkle: [C: 04-1] Allow absolute script path for getMediaWikiCli() (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355282 (owner: 10Smalyshev)
[20:34:38] <wikibugs>	 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3287404 (10RobH)
[20:34:41] <wikibugs>	 06Operations, 10ops-codfw, 10netops: ores200[1-9] switch port configuration - https://phabricator.wikimedia.org/T166156#3287401 (10RobH) 05Open>03Resolved a:03RobH Done!
[20:34:45] <wikibugs>	 (03PS3) 10Smalyshev: Allow absolute script path for getMediaWikiCli() [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355282
[20:34:59] <wikibugs>	 (03CR) 10Smalyshev: Allow absolute script path for getMediaWikiCli() (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355282 (owner: 10Smalyshev)
[20:35:59] <wikibugs>	 06Operations, 10ops-codfw, 13Patch-For-Review: rack/setup/install ores2001-2009 - https://phabricator.wikimedia.org/T165170#3258942 (10RobH)
[20:36:13] <wikibugs>	 (03PS7) 10Volans: Puppet: run-puppet-agent, add --failed-only option [puppet] - 10https://gerrit.wikimedia.org/r/349416
[20:38:35] <icinga-wm>	 PROBLEM - HP RAID on ms-be1036 is CRITICAL: CHECK_NRPE: Socket timeout after 50 seconds.
[20:42:56] <wikibugs>	 06Operations, 10ops-eqiad, 10fundraising-tech-ops, 10netops: rack/setup/wire/deploy msw2-c1-eqiad - https://phabricator.wikimedia.org/T166171#3287418 (10RobH)
[20:43:15] <icinga-wm>	 PROBLEM - Disk space on ms-be1008 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sda1 is not accessible: Input/output error
[20:53:02] <wikibugs>	 (03CR) 10Bearloga: "@gehel Thanks for uploading shiny-server to the apt repo! I don't think there's anything else this patch depends on. Is there a way to tes" [puppet] - 10https://gerrit.wikimedia.org/r/353571 (https://phabricator.wikimedia.org/T161354) (owner: 10Bearloga)
[20:54:55] <icinga-wm>	 PROBLEM - puppet last run on ms-be1008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sda1]
[20:55:32] <wikibugs>	 06Operations, 10Monitoring, 06Multimedia: Create grafana dashboard for video scaler job runners - https://phabricator.wikimedia.org/T163033#3184054 (10Krinkle)
[20:56:15] <icinga-wm>	 PROBLEM - MegaRAID on ms-be1008 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline)
[20:56:26] <icinga-wm>	 ACKNOWLEDGEMENT - MegaRAID on ms-be1008 is CRITICAL: CRITICAL: 1 failed LD(s) (Offline) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T166177
[20:56:29] <wikibugs>	 06Operations, 10ops-eqiad: Degraded RAID on ms-be1008 - https://phabricator.wikimedia.org/T166177#3287545 (10ops-monitoring-bot)
[20:57:13] <wikibugs>	 (03CR) 10Volans: [C: 032] Puppet: run-puppet-agent, add --failed-only option [puppet] - 10https://gerrit.wikimedia.org/r/349416 (owner: 10Volans)
[20:59:41] <icinga-wm>	 RECOVERY - HP RAID on ms-be1036 is OK: OK: Slot 3: OK: 2I:4:1, 2I:4:2, 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 2I:2:1, 2I:2:2, 2I:2:3, 2I:2:4, Controller, Battery/Capacitor
[21:00:20] <wikibugs>	 06Operations, 10ops-eqiad: Degraded RAID on ms-be1008 - https://phabricator.wikimedia.org/T166177#3287584 (10Volans)
[21:01:16] <icinga-wm>	 RECOVERY - Disk space on ms-be1008 is OK: DISK OK
[21:02:33] <wikibugs>	 (03PS11) 10Krinkle: dynamicproxy: Make use of errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/350494 (https://phabricator.wikimedia.org/T113114)
[21:04:33] <wikibugs>	 06Operations, 10Monitoring: rack/setup/install netmon2001 - https://phabricator.wikimedia.org/T166180#3287592 (10RobH)
[21:04:45] <wikibugs>	 06Operations, 10ops-codfw, 10Monitoring: rack/setup/install netmon2001 - https://phabricator.wikimedia.org/T166180#3287609 (10RobH)
[21:07:21] <wikibugs>	 (03PS1) 10Krinkle: varnish: Switch browsersec to use errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/355338 (https://phabricator.wikimedia.org/T113114)
[21:09:48] <wikibugs>	 06Operations, 10ops-eqiad, 06DC-Ops, 06Services: rack/setup/install resetbase-dev100[456] - https://phabricator.wikimedia.org/T166181#3287641 (10RobH)
[21:10:00] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] varnish: Switch browsersec to use errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/355338 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle)
[21:11:07] <wikibugs>	 (03PS2) 10Krinkle: varnish: Switch browsersec to use errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/355338 (https://phabricator.wikimedia.org/T113114)
[21:11:34] <wikibugs>	 (03PS3) 10Krinkle: varnish: Switch browsersec to use errorpage template [puppet] - 10https://gerrit.wikimedia.org/r/355338 (https://phabricator.wikimedia.org/T113114)
[21:11:39] <wikibugs>	 (03CR) 10Krinkle: "(whitespace)" [puppet] - 10https://gerrit.wikimedia.org/r/355338 (https://phabricator.wikimedia.org/T113114) (owner: 10Krinkle)
[21:16:36] <icinga-wm>	 PROBLEM - puppet last run on elastic1052 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:26:47] <wikibugs>	 (03PS1) 10BBlack: r::c::perf - raise fq flow_limit to 300 [puppet] - 10https://gerrit.wikimedia.org/r/355356 (https://phabricator.wikimedia.org/T147569)
[21:27:12] <wikibugs>	 (03CR) 10BBlack: [V: 032 C: 032] r::c::perf - raise fq flow_limit to 300 [puppet] - 10https://gerrit.wikimedia.org/r/355356 (https://phabricator.wikimedia.org/T147569) (owner: 10BBlack)
[21:36:37] <wikibugs>	 (03PS2) 10Andrew Bogott: Horizon:  Add sudo policy panel [puppet] - 10https://gerrit.wikimedia.org/r/353156 (https://phabricator.wikimedia.org/T162097)
[21:37:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Horizon:  Add sudo policy panel [puppet] - 10https://gerrit.wikimedia.org/r/353156 (https://phabricator.wikimedia.org/T162097) (owner: 10Andrew Bogott)
[21:40:49] <wikibugs>	 (03PS3) 10Andrew Bogott: Horizon:  Add sudo policy panel [puppet] - 10https://gerrit.wikimedia.org/r/353156 (https://phabricator.wikimedia.org/T162097)
[21:44:36] <icinga-wm>	 RECOVERY - puppet last run on elastic1052 is OK: OK: Puppet is currently enabled, last run 58 seconds ago with 0 failures
[21:47:46] <icinga-wm>	 PROBLEM - puppet last run on sarin is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[21:49:46] <icinga-wm>	 RECOVERY - puppet last run on sarin is OK: OK: Puppet is currently enabled, last run 33 seconds ago with 0 failures
[21:50:09] <icinga-wm>	 ACKNOWLEDGEMENT - puppet last run on ms-be1008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 26 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[mkfs-/dev/sda1] Volans failed disk, https://phabricator.wikimedia.org/T166177
[22:04:16] <icinga-wm>	 PROBLEM - puppet last run on install2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[22:12:17] <wikibugs>	 (03PS1) 10Volans: Puppet: run-puppet-agent improvements [puppet] - 10https://gerrit.wikimedia.org/r/355363
[22:13:06] <icinga-wm>	 PROBLEM - citoid endpoints health on scb1004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[22:13:16] <icinga-wm>	 PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[22:13:56] <icinga-wm>	 RECOVERY - citoid endpoints health on scb1004 is OK: All endpoints are healthy
[22:14:06] <icinga-wm>	 RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy
[22:15:27] <wikibugs>	 (03CR) 10Volans: [C: 032] "@godog: I'm merging this to fix the bug, feel free to comment it also later and I'll address the comments tomorrow." [puppet] - 10https://gerrit.wikimedia.org/r/355363 (owner: 10Volans)
[22:29:40] <wikibugs>	 (03PS1) 10Chad: scap clean: Some docs, minor pylint fixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/355365
[22:33:18] <icinga-wm>	 RECOVERY - puppet last run on install2002 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[22:41:06] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2002 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[22:41:16] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2006 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[22:41:46] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2003 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[22:41:46] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2005 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[22:41:47] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2001 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[22:41:47] <icinga-wm>	 PROBLEM - citoid endpoints health on scb2004 is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[22:41:56] <icinga-wm>	 PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (open graph via native scraper) timed out before a response was received
[22:43:46] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2003 is OK: All endpoints are healthy
[22:43:46] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2005 is OK: All endpoints are healthy
[22:43:46] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2001 is OK: All endpoints are healthy
[22:43:46] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2004 is OK: All endpoints are healthy
[22:43:46] <icinga-wm>	 RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy
[22:43:56] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2002 is OK: All endpoints are healthy
[22:44:06] <icinga-wm>	 RECOVERY - citoid endpoints health on scb2006 is OK: All endpoints are healthy
[23:00:04] <jouncebot>	 addshore, hashar, anomie, RainbowSprinkles, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, and thcipriani: Dear anthropoid, the time has come. Please deploy Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20170523T2300).
[23:25:41] <wikibugs>	 06Operations: investigate shared inbox options - https://phabricator.wikimedia.org/T146746#3287937 (10Dzahn) new docs on how to use the Google group:  https://wikitech.wikimedia.org/wiki/Ops_Clinic_Duty#Maintain_the_.27maint-announce.27_mails_and_calendar
[23:30:32] <wikibugs>	 (03PS2) 10Dzahn: bastionhost: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/353599
[23:34:21] <wikibugs>	 (03PS3) 10Dzahn: bastionhost: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/353599
[23:51:51] <wikibugs>	 (03PS4) 10Dzahn: bastionhost: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/353599
[23:52:51] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] bastionhost: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/353599 (owner: 10Dzahn)
[23:58:15] <wikibugs>	 (03PS5) 10Dzahn: bastionhost: convert to role/profile structure [puppet] - 10https://gerrit.wikimedia.org/r/353599