[00:09:58] * Krinkle staging on mwdebug2001
[00:14:24] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/tests/phpunit/includes/utils/: T94522 - I2a0c51bea58 (duration: 01m 02s)
[00:14:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:14:28] <stashbot>	 T94522: Some requests fail with UIDGenerator error "Process clock is outdated or drifted" - https://phabricator.wikimedia.org/T94522
[00:15:48] <logmsgbot>	 !log krinkle@deploy1001 sync-file aborted: T205567 - I75f1eb6dc2cb (duration: 00m 01s)
[00:15:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:15:51] <stashbot>	 T205567: PHP Warning "Unable to delete stat cache" from file uploads - https://phabricator.wikimedia.org/T205567
[00:15:57] <icinga-wm>	 PROBLEM - HHVM rendering on mwdebug2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:16:17] <icinga-wm>	 PROBLEM - Apache HTTP on mwdebug2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[00:17:06] <icinga-wm>	 RECOVERY - HHVM rendering on mwdebug2001 is OK: HTTP OK: HTTP/1.1 200 OK - 75559 bytes in 8.703 second response time
[00:17:17] <icinga-wm>	 RECOVERY - Apache HTTP on mwdebug2001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 621 bytes in 0.129 second response time
[00:19:43] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/includes/utils/UIDGenerator.php: T94522 - I2a0c51bea58 (duration: 00m 56s)
[00:19:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[00:19:46] <stashbot>	 T94522: Some requests fail with UIDGenerator error "Process clock is outdated or drifted" - https://phabricator.wikimedia.org/T94522
[00:24:37] <icinga-wm>	 PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[00:25:16] <icinga-wm>	 PROBLEM - swift-object-auditor on ms-be2040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[00:25:57] <icinga-wm>	 PROBLEM - swift-object-server on ms-be2040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[00:43:52] <wikibugs>	 (03PS4) 10Herron: smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785)
[00:44:31] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron)
[00:48:33] <wikibugs>	 (03PS5) 10Herron: smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785)
[00:49:30] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron)
[00:52:47] <icinga-wm>	 PROBLEM - puppet last run on mw1244 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[00:54:34] <wikibugs>	 (03PS6) 10Herron: smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785)
[00:55:12] <wikibugs>	 (03Abandoned) 10Imarlier: sitemaps: Generalize varnish rule for sitemaps, to apply to all domains [puppet] - 10https://gerrit.wikimedia.org/r/456169 (https://phabricator.wikimedia.org/T198965) (owner: 10Imarlier)
[00:55:25] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron)
[01:00:21] <wikibugs>	 (03PS1) 10Imarlier: Add sitemaps rewrite for additional domains [puppet] - 10https://gerrit.wikimedia.org/r/465538 (https://phabricator.wikimedia.org/T206496)
[01:01:26] <icinga-wm>	 RECOVERY - swift-object-server on ms-be2040 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server
[01:01:37] <icinga-wm>	 RECOVERY - swift-object-auditor on ms-be2040 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor
[01:02:07] <icinga-wm>	 RECOVERY - swift-object-replicator on ms-be2040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator
[01:02:17] <icinga-wm>	 RECOVERY - swift-object-updater on ms-be2040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater
[01:03:36] <icinga-wm>	 RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 59, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[01:05:20] <wikibugs>	 (03CR) 10Herron: smarthost: create mail smarthost role/profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron)
[01:07:28] <bawolff>	 So some people at enwiki have been complaining special page updates have been behind schedule last couple weeks
[01:07:53] <bawolff>	 given timing im wondering if its dc switchover related
[01:18:07] <icinga-wm>	 RECOVERY - puppet last run on mw1244 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures
[01:18:27] <icinga-wm>	 PROBLEM - Disk space on eventlog1002 is CRITICAL: DISK CRITICAL - free space: /srv 33796 MB (3% inode=99%)
[02:10:01] <wikibugs>	 (03PS4) 10Krinkle: profiler: Prevent flush from fataling a request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464178 (https://phabricator.wikimedia.org/T206092)
[02:33:17] <icinga-wm>	 RECOVERY - Memory correctable errors -EDAC- on wtp2011 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2011&var-datasource=codfw%2520prometheus%252Fops
[03:14:21] <bawolff>	 Fun fact, running updateSpecialPages.php spends roughly 52 hours just on commons
[03:26:46] <icinga-wm>	 PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 787.54 seconds
[03:57:46] <icinga-wm>	 RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 268.13 seconds
[05:05:12] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T206313 (10Marostegui) 05Open>03Resolved All good - thank you ``` Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name                :Virtual Disk 0 RAID Level          : Primary-1, Secondary-0, RAID...
[05:10:07] <icinga-wm>	 ACKNOWLEDGEMENT - Device not healthy -SMART- on db1073 is CRITICAL: cluster=mysql device=megaraid,3 instance=db1073:9100 job=node site=eqiad Marostegui T206254 - The acknowledgement expires at: 2018-11-07 05:09:43. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1073&var-datasource=eqiad%2520prometheus%252Fops
[05:10:54] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T206254 (10Marostegui) 05Open>03Resolved The RAID got rebuilt fine. The disk came with some errors, but let's ignore that and stop wasting some disks, let's wait till it fails for real to replace it.  ``` Number...
[05:12:20] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review, 10User-Banyek: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 (10Marostegui) Thanks for the update Chris - unbelievable!
[05:25:07] <icinga-wm>	 PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:25:47] <icinga-wm>	 PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[05:28:55] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Marostegui) 05stalled>03Resolved Thanks Papaul! This looks goods - we will take it from here! ``` Number of Virtual Disks: 1 Virtual Drive: 0 (Target...
[05:29:09] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Marostegui)
[05:31:00] <wikibugs>	 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Marostegui)
[06:12:26] <icinga-wm>	 RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:12:46] <icinga-wm>	 RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down
[06:39:37] <icinga-wm>	 PROBLEM - Disk space on eventlog1002 is CRITICAL: DISK CRITICAL - free space: /srv 33646 MB (3% inode=99%)
[06:55:55] <wikibugs>	 (03PS3) 10Giuseppe Lavagetto: parsoid: remove unused template [puppet] - 10https://gerrit.wikimedia.org/r/463490
[06:58:38] <wikibugs>	 (03CR) 10Ema: [C: 031] parsoid: remove unused template [puppet] - 10https://gerrit.wikimedia.org/r/463490 (owner: 10Giuseppe Lavagetto)
[06:59:22] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] parsoid: remove unused template [puppet] - 10https://gerrit.wikimedia.org/r/463490 (owner: 10Giuseppe Lavagetto)
[07:01:14] <wikibugs>	 (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for Druid Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/465144 (https://phabricator.wikimedia.org/T135991)
[07:04:21] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for Druid Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/465144 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[07:05:10] <moritzm>	 _joe_: ok to merge your patch along?
[07:14:01] <wikibugs>	 (03CR) 10Mathew.onipe: "Jenkins dry run result seems good:" [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) (owner: 10Mathew.onipe)
[07:15:16] <icinga-wm>	 PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production).
[07:18:03] <_joe_>	 moritzm: sorry, yes
[07:18:26] <_joe_>	 I auto-nerd-sniped myself in searching for unreferenced files in our puppet tree
[07:18:59] <moritzm>	 ok, now merged :-)
[07:19:19] <wikibugs>	 (03PS1) 10Elukey: Add interface::add_ip6_mapped to analytics-tool* hosts [puppet] - 10https://gerrit.wikimedia.org/r/465560
[07:19:37] <icinga-wm>	 RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge.
[07:20:17] <wikibugs>	 (03CR) 10Elukey: [C: 032] Add interface::add_ip6_mapped to analytics-tool* hosts [puppet] - 10https://gerrit.wikimedia.org/r/465560 (owner: 10Elukey)
[07:27:21] <wikibugs>	 (03PS1) 10Elukey: profile::prometheus::alerts: raise warning for EL throughput alarm [puppet] - 10https://gerrit.wikimedia.org/r/465563
[07:28:18] <wikibugs>	 (03CR) 10Elukey: [C: 032] profile::prometheus::alerts: raise warning for EL throughput alarm [puppet] - 10https://gerrit.wikimedia.org/r/465563 (owner: 10Elukey)
[07:28:45] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/462480 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[07:35:30] <wikibugs>	 (03PS2) 10Filippo Giunchedi: debian: ship systemd service [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465350
[07:35:30] <wikibugs>	 (03PS2) 10Filippo Giunchedi: debian: use standard rules for Prometheus packages [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465351
[07:35:32] <wikibugs>	 (03PS2) 10Filippo Giunchedi: debian: update changelog [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465352
[07:35:34] <wikibugs>	 (03PS2) 10Filippo Giunchedi: debian: add patch for inline udp usage [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465414 (https://phabricator.wikimedia.org/T205870)
[07:38:04] <wikibugs>	 (03CR) 10Filippo Giunchedi: "Running under ded" (031 comment) [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465350 (owner: 10Filippo Giunchedi)
[07:38:42] <wikibugs>	 (03PS1) 10Elukey: Add IPv6 PTR records for analytics-tool* hosts [dns] - 10https://gerrit.wikimedia.org/r/465565
[07:40:00] <wikibugs>	 (03CR) 10Elukey: [C: 032] Add IPv6 PTR records for analytics-tool* hosts [dns] - 10https://gerrit.wikimedia.org/r/465565 (owner: 10Elukey)
[07:41:56] <wikibugs>	 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Refactor 'use_git_deploy' in wdqs puppet module to cater for scap3 and autodeployment modes - https://phabricator.wikimedia.org/T206597 (10Mathew.onipe)
[07:46:54] <wikibugs>	 (03PS1) 10Mathew.onipe: wdqs: refactor use_git_deploy to include scap3 and autodeploy options [puppet] - 10https://gerrit.wikimedia.org/r/465566 (https://phabricator.wikimedia.org/T206597)
[07:50:57] <icinga-wm>	 RECOVERY - Disk space on eventlog1002 is OK: DISK OK
[07:51:33] <elukey>	 !log cleaned up some log files from eventlog1002
[07:51:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[07:51:46] <elukey>	 we have a task about --^, should be fixed today
[07:52:02] <vgutierrez>	 nice
[07:53:38] <wikibugs>	 (03CR) 10Mathew.onipe: wdqs: refactor use_git_deploy to include scap3 and autodeploy options (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465566 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe)
[07:54:46] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] "A few issues, see comments inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) (owner: 10Mathew.onipe)
[07:55:00] <wikibugs>	 (03CR) 10Gilles: [C: 031] WIP: define haproxy service for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/465185 (https://phabricator.wikimedia.org/T187765) (owner: 10Filippo Giunchedi)
[07:55:05] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install stat1007.eqiad.wmnet (stat1005 user replacement) - https://phabricator.wikimedia.org/T203852 (10elukey)
[08:00:44] <wikibugs>	 (03PS1) 10Elukey: role::eventlogging::analytics::files: reduce retention for archive [puppet] - 10https://gerrit.wikimedia.org/r/465569 (https://phabricator.wikimedia.org/T206542)
[08:01:14] <wikibugs>	 (03CR) 10Filippo Giunchedi: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/464366 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[08:01:59] <moritzm>	 !log rolling out debdeploy 0.0.99.6
[08:02:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:03:53] <wikibugs>	 (03PS3) 10ArielGlenn: fix misc dumps generation when some previous runs are missing [dumps] - 10https://gerrit.wikimedia.org/r/465415 (https://phabricator.wikimedia.org/T206306)
[08:04:54] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 031] "Looks good" [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465350 (owner: 10Filippo Giunchedi)
[08:05:26] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] "I've looped in WMCS folks too" [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite)
[08:06:09] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: Puppet error on deployment-mwmaint01 - https://phabricator.wikimedia.org/T206598 (10Krenair)
[08:06:18] <wikibugs>	 (03CR) 10ArielGlenn: [C: 032] fix misc dumps generation when some previous runs are missing [dumps] - 10https://gerrit.wikimedia.org/r/465415 (https://phabricator.wikimedia.org/T206306) (owner: 10ArielGlenn)
[08:07:40] <logmsgbot>	 !log ariel@deploy1001 Started deploy [dumps/dumps@0714a93]: fix adds/changes dumps generation when prev run is missing
[08:07:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:07:46] <logmsgbot>	 !log ariel@deploy1001 Finished deploy [dumps/dumps@0714a93]: fix adds/changes dumps generation when prev run is missing (duration: 00m 06s)
[08:07:47] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:14:31] <wikibugs>	 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, and 2 others: Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 (10JAllemandou) @Ottomata  - After standup please, as today is kids-day for me :)
[08:15:17] <wikibugs>	 10Puppet, 10Beta-Cluster-Infrastructure: Puppet error on deployment-mwmaint01 - https://phabricator.wikimedia.org/T206598 (10Krenair)
[08:27:49] <moritzm>	 !log installing fuse security updates
[08:27:50] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[08:28:04] <wikibugs>	 (03PS1) 10Elukey: statistics::rsync::eventlogging: reduce retention for archive [puppet] - 10https://gerrit.wikimedia.org/r/465573 (https://phabricator.wikimedia.org/T206542)
[08:33:20] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] Revert "traffic: Depool eqiad from user traffic for switchover" [dns] - 10https://gerrit.wikimedia.org/r/458807 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris)
[08:40:06] <wikibugs>	 (03Abandoned) 10Ema: Revert "traffic: Depool eqiad from user traffic for switchover" [dns] - 10https://gerrit.wikimedia.org/r/458807 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris)
[08:48:11] <wikibugs>	 (03CR) 10Filippo Giunchedi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/465164 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi)
[08:48:16] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install stat1007.eqiad.wmnet (stat1005 user replacement) - https://phabricator.wikimedia.org/T203852 (10elukey) @Cmjohnson quick question (might be wrong since I am a n00b with Juniper): is stat1007 in the analytics VLAN?  ``` elukey@asw2-b-...
[08:50:25] <wikibugs>	 (03Abandoned) 10Filippo Giunchedi: logstash: add ipv6 to elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/465164 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi)
[08:56:47] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] Revert "traffic: route esams via codfw" [puppet] - 10https://gerrit.wikimedia.org/r/458809 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris)
[08:57:27] <_joe_>	 akosiaris: should we talk here, or is there a dedicated channel?
[08:57:29] <wikibugs>	 (03PS3) 10Ema: Revert "traffic: route esams via codfw" [puppet] - 10https://gerrit.wikimedia.org/r/458809 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris)
[08:57:51] <akosiaris>	 _joe_: I 've avoided the dedicated channel just for this, let's do it here
[08:58:04] <_joe_>	 ack
[08:58:15] <akosiaris>	 we can always ignore the bots if we end up having some serious talk
[08:58:21] <ema>	 T-1m
[08:59:05] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] cache::text, cache::upload: Switch services to a/a [puppet] - 10https://gerrit.wikimedia.org/r/458804 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris)
[08:59:26] <ema>	 starting with the inter-cache routing patch
[08:59:32] <akosiaris>	 ok
[08:59:39] <wikibugs>	 (03CR) 10Ema: [C: 032] Revert "traffic: route esams via codfw" [puppet] - 10https://gerrit.wikimedia.org/r/458809 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris)
[09:00:04] <jouncebot>	 Deploy window Datacenter switchback - Traffic (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181010T0900)
[09:00:04] <ema>	 !log Traffic: route esams caches back to eqiad T203777
[09:00:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:00:08] <stashbot>	 T203777: Successfully switch backend traffic (MediaWiki, Swift, RESTBase, Parsoid and services) to be served from eqiad - https://phabricator.wikimedia.org/T203777
[09:00:24] <ema>	 running puppet
[09:00:39] <_joe_>	 on all caches in esams, correct?
[09:00:48] <ema>	 correct
[09:01:16] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "we might decide to leave restbase a/a in the future, but for now let's just revert to the status quo" [puppet] - 10https://gerrit.wikimedia.org/r/458805 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris)
[09:02:17] <ema>	 puppet run on esams caches finished
[09:03:25] <_joe_>	 network traffic is going up on cache_text and cache_upload in eqiad, which is an expected observable
[09:04:25] <wikibugs>	 (03CR) 10Marostegui: [C: 031] site.pp: Comment fixes due to dewiki no longer being the only s5 wiki [puppet] - 10https://gerrit.wikimedia.org/r/464797 (https://phabricator.wikimedia.org/T184805) (owner: 10Jcrespo)
[09:04:47] <wikibugs>	 (03PS2) 10Ema: cache::text, cache::upload: Switch services to a/a [puppet] - 10https://gerrit.wikimedia.org/r/458804 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris)
[09:04:57] <ema>	 manually rebased the patch above, there was a conflict ^
[09:05:25] <_joe_>	 ema: lemme look just to be sure
[09:05:30] <ema>	 _joe_: yes please
[09:06:16] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] cache::text, cache::upload: Switch services to a/a [puppet] - 10https://gerrit.wikimedia.org/r/458804 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris)
[09:06:19] <akosiaris>	 LGTM
[09:06:42] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] cache::text, cache::upload: Switch services to a/a [puppet] - 10https://gerrit.wikimedia.org/r/458804 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris)
[09:07:07] <ema>	 ok to proceed with setting services active/active?
[09:07:11] <akosiaris>	 yes
[09:07:24] <wikibugs>	 (03CR) 10Ema: [C: 032] cache::text, cache::upload: Switch services to a/a [puppet] - 10https://gerrit.wikimedia.org/r/458804 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris)
[09:07:24] <_joe_>	 +1
[09:07:47] <ema>	 !log Traffic: set services active/active T203777
[09:07:49] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:07:50] <stashbot>	 T203777: Successfully switch backend traffic (MediaWiki, Swift, RESTBase, Parsoid and services) to be served from eqiad - https://phabricator.wikimedia.org/T203777
[09:08:06] <ema>	 running puppet on text and upload caches in eqiad and codfw
[09:08:58] <wikibugs>	 (03PS2) 10Ema: cache::text: Switch restbase to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/458805 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris)
[09:09:32] <_joe_>	 ema: this is simple enough you don't need a check, right?
[09:10:25] <ema>	 puppet run on all eqiad/codfw caches finished
[09:10:34] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] cache::text: Switch restbase to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/458805 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris)
[09:11:25] <ema>	 switching restbase to eqiad only now 
[09:11:59] <wikibugs>	 (03CR) 10Ema: [C: 032] cache::text: Switch restbase to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/458805 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris)
[09:12:43] <ema>	 !log Traffic: move restbase back to eqiad T203777
[09:12:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[09:13:14] <ema>	 running puppet on text caches in eqiad/codfw
[09:14:26] <ema>	 traffic switchback done
[09:14:42] <_joe_>	 ema: cool!
[09:14:55] <vgutierrez>	 <3
[09:15:48] <akosiaris>	 nice!
[09:16:05] <gehel>	 congratulations!
[09:16:14] <wikibugs>	 (03CR) 10Gehel: "hieradata/role/common/wdqs.yaml should also be changed (and hieradata/role/common/wdqs/autodeploy.yaml once this patch is rebased)." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/465566 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe)
[09:16:16] <godog>	 \o/ \o/
[09:16:24] <wikibugs>	 (03CR) 10Gehel: [C: 04-1] wdqs: refactor use_git_deploy to include scap3 and autodeploy options [puppet] - 10https://gerrit.wikimedia.org/r/465566 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe)
[09:16:50] <gehel>	 I have a few puppet patches to merge, probably best to wait a bit until everything is settled ?
[09:18:38] <akosiaris>	 gehel: yes please. give it 10mins or os
[09:18:40] <akosiaris>	 so*
[09:18:55] <gehel>	 kool! no emergency!
[09:19:07] <gehel>	 I might take a lunch break by that time :)
[09:19:39] <_joe_>	 gehel: remember we're going to switch back mediawiki at 4 pm your time
[09:19:49] <gehel>	 yep
[09:20:55] <wikibugs>	 (03Abandoned) 10Giuseppe Lavagetto: role::mediawiki::videoscaler: deduce parameters from number of cpus [puppet] - 10https://gerrit.wikimedia.org/r/345817 (owner: 10Giuseppe Lavagetto)
[09:36:49] <wikibugs>	 (03PS11) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187)
[09:47:32] <wikibugs>	 (03PS1) 10Muehlenhoff: Clean up removed rsyncd configs [puppet] - 10https://gerrit.wikimedia.org/r/465583 (https://phabricator.wikimedia.org/T205618)
[09:47:34] <wikibugs>	 (03PS1) 10Muehlenhoff: rsyncd: Remove xinetd remnants [puppet] - 10https://gerrit.wikimedia.org/r/465584
[09:48:37] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] rsyncd: Remove xinetd remnants [puppet] - 10https://gerrit.wikimedia.org/r/465584 (owner: 10Muehlenhoff)
[09:56:22] <wikibugs>	 (03PS2) 10Muehlenhoff: rsyncd: Remove xinetd remnants [puppet] - 10https://gerrit.wikimedia.org/r/465584
[09:58:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] rsyncd: Remove xinetd remnants [puppet] - 10https://gerrit.wikimedia.org/r/465584 (owner: 10Muehlenhoff)
[10:00:08] <wikibugs>	 (03PS3) 10Muehlenhoff: rsyncd: Remove xinetd remnants [puppet] - 10https://gerrit.wikimedia.org/r/465584
[10:10:15] <wikibugs>	 (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for coal [puppet] - 10https://gerrit.wikimedia.org/r/465180 (https://phabricator.wikimedia.org/T135991)
[10:19:40] <wikibugs>	 (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for coal [puppet] - 10https://gerrit.wikimedia.org/r/465180 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff)
[10:22:17] <wikibugs>	 (03PS12) 10Gehel: wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[10:23:40] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete mediawiki-firejail-rsvg-convert [puppet] - 10https://gerrit.wikimedia.org/r/465590
[10:24:08] <wikibugs>	 (03CR) 10Gehel: [C: 032] wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[10:24:55] <wikibugs>	 (03CR) 10Marostegui: [C: 032] mariadb: Update dblists to move some wikis from s3 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464164 (https://phabricator.wikimedia.org/T184805) (owner: 10Jcrespo)
[10:27:24] <wikibugs>	 (03Merged) 10jenkins-bot: mariadb: Update dblists to move some wikis from s3 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464164 (https://phabricator.wikimedia.org/T184805) (owner: 10Jcrespo)
[10:28:07] <wikibugs>	 (03PS1) 10Gehel: wdqs: fix path in autodeploy exec resources [puppet] - 10https://gerrit.wikimedia.org/r/465591 (https://phabricator.wikimedia.org/T197187)
[10:29:08] <wikibugs>	 (03CR) 10jenkins-bot: mariadb: Update dblists to move some wikis from s3 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464164 (https://phabricator.wikimedia.org/T184805) (owner: 10Jcrespo)
[10:30:07] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to servers for Release Engineering tasks for Lars Wirzenius - https://phabricator.wikimedia.org/T206612 (10LarsWirzenius)
[10:30:38] <icinga-wm>	 PROBLEM - puppet last run on wdqs1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[10:31:13] <gehel>	 ^ wdqs1009 is me, patch coming up
[10:31:51] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to servers for Release Engineering tasks for Lars Wirzenius - https://phabricator.wikimedia.org/T206612 (10zeljkofilipin)
[10:32:20] <wikibugs>	 (03CR) 10Gehel: [C: 032] wdqs: fix path in autodeploy exec resources [puppet] - 10https://gerrit.wikimedia.org/r/465591 (https://phabricator.wikimedia.org/T197187) (owner: 10Gehel)
[10:36:18] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to servers for Release Engineering tasks for Lars Wirzenius - https://phabricator.wikimedia.org/T206612 (10MoritzMuehlenhoff) Looking at existing group ownerships this means being added to the deployment, contint-admins, labnet-users and contint-docker gro...
[10:36:48] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to servers for Release Engineering tasks for Lars Wirzenius - https://phabricator.wikimedia.org/T206612 (10zeljkofilipin) As far as I can see, that's `deployment` group in [[ https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modu...
[10:44:03] <wikibugs>	 (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for uwsgi-coal [puppet] - 10https://gerrit.wikimedia.org/r/465593 (https://phabricator.wikimedia.org/T135991)
[10:44:28] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to servers for Release Engineering tasks for Lars Wirzenius - https://phabricator.wikimedia.org/T206612 (10zeljkofilipin) @MoritzMuehlenhoff correct, after further inspection of the file, looking for my groups, looks like that are the correct groups.
[10:45:56] <wikibugs>	 (03PS1) 10Gehel: wdqs: missing dependency for ordering of git::clone [puppet] - 10https://gerrit.wikimedia.org/r/465594 (https://phabricator.wikimedia.org/T197187)
[10:48:23] <logmsgbot>	 !log marostegui@deploy1001 Synchronized dblists/s3.dblist: Update s3.dblist to reflect the wikis moved to s5 - T184805 (duration: 00m 58s)
[10:48:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:48:26] <stashbot>	 T184805: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805
[10:49:34] <logmsgbot>	 !log marostegui@deploy1001 Synchronized dblists/s5.dblist: Update s5.dblist to reflect the wikis moved from s3 - T184805 (duration: 00m 56s)
[10:49:37] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:50:54] <wikibugs>	 (03CR) 10Mathew.onipe: [C: 031] wdqs: missing dependency for ordering of git::clone [puppet] - 10https://gerrit.wikimedia.org/r/465594 (https://phabricator.wikimedia.org/T197187) (owner: 10Gehel)
[10:51:17] <wikibugs>	 (03CR) 10Gehel: [C: 032] wdqs: missing dependency for ordering of git::clone [puppet] - 10https://gerrit.wikimedia.org/r/465594 (https://phabricator.wikimedia.org/T197187) (owner: 10Gehel)
[10:54:29] <marostegui>	 !log Set a replication filter on db1075 (s3 eqiad) to ignore enwikivoyage, cebwiki, shwiki, srwiki & mgwiktionary - T184805
[10:54:31] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[10:54:31] <stashbot>	 T184805: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805
[10:55:19] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove now obsolete Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/465596 (https://phabricator.wikimedia.org/T183454)
[10:55:58] <icinga-wm>	 RECOVERY - puppet last run on wdqs1009 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[11:00:05] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181010T1100).
[11:00:05] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[11:07:49] <zeljkof>	 o/
[11:07:55] <zeljkof>	 I'm around but patches are not :D
[11:08:19] <wikibugs>	 (03PS1) 10Gehel: wdqs: pull files from git-fat after initial checkout [puppet] - 10https://gerrit.wikimedia.org/r/465599 (https://phabricator.wikimedia.org/T197187)
[11:09:04] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wdqs: pull files from git-fat after initial checkout [puppet] - 10https://gerrit.wikimedia.org/r/465599 (https://phabricator.wikimedia.org/T197187) (owner: 10Gehel)
[11:09:46] <wikibugs>	 (03PS2) 10Gehel: wdqs: pull files from git-fat after initial checkout [puppet] - 10https://gerrit.wikimedia.org/r/465599 (https://phabricator.wikimedia.org/T197187)
[11:10:34] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove all absented Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/465600 (https://phabricator.wikimedia.org/T183454)
[11:10:43] <wikibugs>	 (03PS1) 10Muehlenhoff: Remove obsolete Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/465601 (https://phabricator.wikimedia.org/T183454)
[11:12:29] <wikibugs>	 (03CR) 10Mathew.onipe: [C: 031] wdqs: pull files from git-fat after initial checkout [puppet] - 10https://gerrit.wikimedia.org/r/465599 (https://phabricator.wikimedia.org/T197187) (owner: 10Gehel)
[11:13:10] <wikibugs>	 (03CR) 10Gehel: [C: 032] wdqs: pull files from git-fat after initial checkout [puppet] - 10https://gerrit.wikimedia.org/r/465599 (https://phabricator.wikimedia.org/T197187) (owner: 10Gehel)
[11:16:28] <wikibugs>	 (03PS1) 10Urbanecm: Permissions changes on itwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465602 (https://phabricator.wikimedia.org/T206447)
[11:16:51] <Urbanecm>	 zeljkof, now one patch is around as well :D
[11:16:57] <Urbanecm>	 (https://gerrit.wikimedia.org/r/465602)
[11:17:11] <zeljkof>	 Urbanecm: let me see... :)
[11:17:26] <wikibugs>	 (03PS2) 10Zfilipin: Permissions changes on itwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465602 (https://phabricator.wikimedia.org/T206447) (owner: 10Urbanecm)
[11:17:56] <Urbanecm>	 thank you :)
[11:18:08] <wikibugs>	 (03PS1) 10Gehel: wdqs: run git-fat commands from the package directory [puppet] - 10https://gerrit.wikimedia.org/r/465603 (https://phabricator.wikimedia.org/T197187)
[11:18:19] <icinga-wm>	 PROBLEM - puppet last run on wdqs1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wdqs_git_fat_pull]
[11:18:36] <zeljkof>	 Urbanecm: ok, on it
[11:19:40] <wikibugs>	 (03CR) 10Mathew.onipe: [C: 031] wdqs: run git-fat commands from the package directory [puppet] - 10https://gerrit.wikimedia.org/r/465603 (https://phabricator.wikimedia.org/T197187) (owner: 10Gehel)
[11:19:50] <wikibugs>	 (03CR) 10Gehel: [C: 032] wdqs: run git-fat commands from the package directory [puppet] - 10https://gerrit.wikimedia.org/r/465603 (https://phabricator.wikimedia.org/T197187) (owner: 10Gehel)
[11:21:34] <wikibugs>	 (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465602 (https://phabricator.wikimedia.org/T206447) (owner: 10Urbanecm)
[11:23:54] <wikibugs>	 (03Merged) 10jenkins-bot: Permissions changes on itwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465602 (https://phabricator.wikimedia.org/T206447) (owner: 10Urbanecm)
[11:24:28] <zeljkof>	 Urbanecm: it's at mwdebug2001
[11:24:34] <Urbanecm>	 ack
[11:25:29] <Urbanecm>	 zeljkof, its working, please deploy it to whole universe
[11:25:41] <zeljkof>	 Urbanecm: ok
[11:26:42] <logmsgbot>	 !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:465602|Permissions changes on itwikibooks (T206447)]] (duration: 00m 57s)
[11:26:43] <wikibugs>	 (03PS4) 10Urbanecm: Initial configuration for liwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463479 (https://phabricator.wikimedia.org/T205710)
[11:26:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:26:45] <stashbot>	 T206447: changes to manage user group "confirmed" and "accountcreator" on it.wikibooks - https://phabricator.wikimedia.org/T206447
[11:26:52] <wikibugs>	 (03PS5) 10Urbanecm: Initial configuration for liwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463479 (https://phabricator.wikimedia.org/T205710)
[11:27:01] <zeljkof>	 Urbanecm: deployed! anything else?
[11:27:04] <zeljkof>	 all done?
[11:28:04] <zeljkof>	 herron: scap said this during deployment
[11:28:22] <zeljkof>	 `11:26:15 Check 'Check endpoints for mwdebug2002.codfw.wmnet' failed: /wiki/{title} (Main Page) is WARNING: Test Main Page responds with unexpected body: array(59) {`
[11:28:25] <zeljkof>	 ...
[11:28:41] <zeljkof>	 (it's a long warning, I can paste it if needed)
[11:29:24] <Urbanecm>	 well, as of now, I have no open patches. Thank you!7
[11:29:38] <zeljkof>	 !log EU SWAT finished
[11:29:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:29:44] <zeljkof>	 Urbanecm: see you next time! :D
[11:29:51] <Urbanecm>	 :)
[11:30:09] <marostegui>	 zeljkof: probably create a ticket (I got the same thing and I was planning to create a ticket, but I got busy with something else)
[11:30:17] <zeljkof>	 marostegui: will do!
[11:30:21] <marostegui>	 cheers
[11:31:59] <wikibugs>	 (03CR) 10jenkins-bot: Permissions changes on itwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465602 (https://phabricator.wikimedia.org/T206447) (owner: 10Urbanecm)
[11:33:15] <wikibugs>	 (03PS4) 10Muehlenhoff: mediawiki::web::prod_sites: convert wikisource.org [puppet] - 10https://gerrit.wikimedia.org/r/462486 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[11:34:08] <wikibugs>	 (03PS1) 10Mathew.onipe: wdqs: onlyif condition should exit with 1 for git fat to run [puppet] - 10https://gerrit.wikimedia.org/r/465605 (https://phabricator.wikimedia.org/T197187)
[11:34:19] <zeljkof>	 marostegui: T206620
[11:34:20] <stashbot>	 T206620: Check 'Check endpoints for mwdebug2002.codfw.wmnet' failed: /wiki/{title} (Main Page) is WARNING: Test Main Page responds with unexpected body - https://phabricator.wikimedia.org/T206620
[11:34:34] <marostegui>	 zeljkof: thank you!
[11:35:49] <wikibugs>	 (03CR) 10Gehel: [C: 032] wdqs: onlyif condition should exit with 1 for git fat to run [puppet] - 10https://gerrit.wikimedia.org/r/465605 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe)
[11:37:10] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM in terms of decommissioning, but please see the comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465389 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey)
[11:41:10] <wikibugs>	 (03PS2) 10Filippo Giunchedi: WIP: statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/465428 (https://phabricator.wikimedia.org/T205870)
[11:41:12] <wikibugs>	 (03PS1) 10Filippo Giunchedi: WIP: add statsd-exporter to thumbor [puppet] - 10https://gerrit.wikimedia.org/r/465608 (https://phabricator.wikimedia.org/T205870)
[11:41:26] <jynus>	 !log renaming some s3 wiki tables on eqiad master to prevent split brain T184805
[11:41:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[11:41:30] <stashbot>	 T184805: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805
[11:43:58] <wikibugs>	 (03PS3) 10Filippo Giunchedi: WIP: statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/465428 (https://phabricator.wikimedia.org/T205870)
[11:43:58] <wikibugs>	 (03PS2) 10Filippo Giunchedi: WIP: add statsd-exporter to thumbor [puppet] - 10https://gerrit.wikimedia.org/r/465608 (https://phabricator.wikimedia.org/T205870)
[11:47:03] <wikibugs>	 (03PS1) 10Gehel: wdqs: correct condition for initializing git fat [puppet] - 10https://gerrit.wikimedia.org/r/465609 (https://phabricator.wikimedia.org/T197187)
[11:47:46] <wikibugs>	 (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/12850/thumbor1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/465608 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi)
[11:48:18] <wikibugs>	 (03CR) 10Mathew.onipe: [C: 031] wdqs: correct condition for initializing git fat [puppet] - 10https://gerrit.wikimedia.org/r/465609 (https://phabricator.wikimedia.org/T197187) (owner: 10Gehel)
[11:49:46] <wikibugs>	 (03PS2) 10Gehel: wdqs: correct condition for initializing git fat [puppet] - 10https://gerrit.wikimedia.org/r/465609 (https://phabricator.wikimedia.org/T197187)
[11:52:12] <wikibugs>	 (03PS2) 10Elukey: Decommission conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/465389 (https://phabricator.wikimedia.org/T205814)
[11:55:22] <wikibugs>	 (03CR) 10Mathew.onipe: [C: 031] wdqs: correct condition for initializing git fat [puppet] - 10https://gerrit.wikimedia.org/r/465609 (https://phabricator.wikimedia.org/T197187) (owner: 10Gehel)
[11:56:22] <wikibugs>	 (03CR) 10Gehel: [C: 032] wdqs: correct condition for initializing git fat [puppet] - 10https://gerrit.wikimedia.org/r/465609 (https://phabricator.wikimedia.org/T197187) (owner: 10Gehel)
[11:58:44] <wikibugs>	 (03CR) 10Muehlenhoff: mediawiki::web::prod_sites: convert wikisource.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/462486 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto)
[11:58:58] <icinga-wm>	 RECOVERY - puppet last run on wdqs1009 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures
[12:01:38] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[12:02:26] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12853/" [puppet] - 10https://gerrit.wikimedia.org/r/465389 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey)
[12:02:46] <wikibugs>	 (03PS3) 10Elukey: Decommission conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/465389 (https://phabricator.wikimedia.org/T205814)
[12:05:49] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[12:15:45] <wikibugs>	 (03PS1) 10Volans: Tests: refactor puppetdb tests with parametrize [software/cumin] - 10https://gerrit.wikimedia.org/r/465611
[12:15:47] <wikibugs>	 (03PS1) 10Volans: PuppetDB: fix regex matching [software/cumin] - 10https://gerrit.wikimedia.org/r/465612
[12:18:53] <_joe_>	 !log decommissioning conf1001-1003: stopping etcd, nginx, and masking both
[12:18:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[12:21:46] <elukey>	 \o/
[12:38:35] <icinga-wm>	 PROBLEM - Nginx local proxy to apache on mw2217 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.151 second response time
[12:39:44] <icinga-wm>	 RECOVERY - Nginx local proxy to apache on mw2217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.223 second response time
[12:40:19] <wikibugs>	 10Operations, 10ops-eqiad, 10decommission: Decommission conf100[1-3] - https://phabricator.wikimedia.org/T206626 (10elukey) p:05Triage>03Normal
[12:40:34] <wikibugs>	 10Operations, 10Traffic, 10Patch-For-Review: puppetize http purging for ATS backends - https://phabricator.wikimedia.org/T204208 (10ema) 05Open>03Resolved a:03ema Done, `profile::trafficserver::backend` now installs and configures vhtcpd.
[12:42:55] <wikibugs>	 10Operations, 10Patch-For-Review: Switch the main etcd cluster in eqiad to use conf1004-1006 - https://phabricator.wikimedia.org/T205814 (10elukey) Opened https://phabricator.wikimedia.org/T206626 to fully decom conf100[1-3] (not in service anymore and with role::spare::system).  The last step is to switch bac...
[12:45:19] <wikibugs>	 (03CR) 10Muehlenhoff: wmcs: add prometheus-memcached-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/431595 (https://phabricator.wikimedia.org/T147326) (owner: 10Filippo Giunchedi)
[12:51:22] <wikibugs>	 (03CR) 10Muehlenhoff: Stop the diamond service when removing Diamond (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465440 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[12:53:28] <wikibugs>	 (03PS3) 10Muehlenhoff: Stop the diamond service when removing Diamond [puppet] - 10https://gerrit.wikimedia.org/r/465440 (https://phabricator.wikimedia.org/T183454)
[13:04:07] <wikibugs>	 (03PS1) 10Jgreen: update icinga IP for frbast2001.frack.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/465620
[13:05:45] <icinga-wm>	 PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.1146 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash
[13:06:54] <icinga-wm>	 RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash
[13:07:08] <godog>	 looks like it was memcache logs from mw btw
[13:09:15] <wikibugs>	 (03CR) 10Jgreen: [C: 032] update icinga IP for frbast2001.frack.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/465620 (owner: 10Jgreen)
[13:11:05] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Optimize networking configuration for WDQS - https://phabricator.wikimedia.org/T206105 (10Gehel) With some trial an error, it looks like the `smp_affinity` = `00ff00ff` would allow the IRQ to be managed by any CP...
[13:16:26] <wikibugs>	 10Operations, 10IRCecho, 10Patch-For-Review, 10User-fgiunchedi: ircecho / icinga-wm crashlooping - https://phabricator.wikimedia.org/T205522 (10fgiunchedi) With the latest patch in to log exceptions I think we're good to resolve this?
[13:18:48] <wikibugs>	 (03PS1) 10Gehel: wdqs: spread IRQ from NIC over multiple CPUs [puppet] - 10https://gerrit.wikimedia.org/r/465624 (https://phabricator.wikimedia.org/T206105)
[13:37:29] <wikibugs>	 10Operations, 10fundraising-tech-ops, 10netops: Qualys scans causing problematic pfw logspam - https://phabricator.wikimedia.org/T206431 (10Jgreen) 05Open>03Resolved a:03Jgreen The underlying problem was that bellatrix was logging to the root partition rather than the /srv data partition as it should....
[13:39:14] <wikibugs>	 (03CR) 10Ottomata: [C: 031] profile::prometheus::alerts: raise warning for EL throughput alarm [puppet] - 10https://gerrit.wikimedia.org/r/465563 (owner: 10Elukey)
[13:39:46] <wikibugs>	 (03CR) 10Ottomata: [C: 031] role::eventlogging::analytics::files: reduce retention for archive [puppet] - 10https://gerrit.wikimedia.org/r/465569 (https://phabricator.wikimedia.org/T206542) (owner: 10Elukey)
[13:40:39] <wikibugs>	 (03PS2) 10Mathew.onipe: wdqs: refactor use_git_deploy to include scap3 and autodeploy options [puppet] - 10https://gerrit.wikimedia.org/r/465566 (https://phabricator.wikimedia.org/T206597)
[13:41:16] <wikibugs>	 (03CR) 10Ottomata: "Fine with me!  These should all be in Hadoop now anyway.  I'd mostly use these for recovery from client-side raw logs if something goes wr" [puppet] - 10https://gerrit.wikimedia.org/r/465573 (https://phabricator.wikimedia.org/T206542) (owner: 10Elukey)
[13:41:45] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] wdqs: refactor use_git_deploy to include scap3 and autodeploy options [puppet] - 10https://gerrit.wikimedia.org/r/465566 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe)
[13:42:01] <wikibugs>	 (03CR) 10Ottomata: [C: 031] Clean up removed rsyncd configs [puppet] - 10https://gerrit.wikimedia.org/r/465583 (https://phabricator.wikimedia.org/T205618) (owner: 10Muehlenhoff)
[13:42:28] <wikibugs>	 (03CR) 10Ottomata: [C: 031] rsyncd: Remove xinetd remnants [puppet] - 10https://gerrit.wikimedia.org/r/465584 (owner: 10Muehlenhoff)
[13:42:51] <volans>	 jouncebot: next
[13:42:51] <jouncebot>	 In 0 hour(s) and 17 minute(s): Datacenter switchback - MediaWiki (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181010T1400)
[13:43:12] <marostegui>	 \o/
[13:43:35] <volans>	 marostegui: all good on dba-land for it?
[13:43:44] <marostegui>	 volans: yeeep
[13:43:55] <volans>	 great! thx
[13:46:54] <jynus>	 volans: we did downtime the new read only check on masters
[13:47:06] <jynus>	 as it would change asynchronouysly with puppet
[13:48:13] <volans>	 ack
[13:48:43] <jynus>	 I think "all wikis are in read only" was a fair check to have under normal circumstances
[13:50:30] <akosiaris>	 I 'll rebase the required puppet changes
[13:50:56] <wikibugs>	 (03PS1) 10Elukey: Refactor type Systemd::Timer::DateTime to include more normal forms [puppet] - 10https://gerrit.wikimedia.org/r/465630 (https://phabricator.wikimedia.org/T172532)
[13:51:49] <akosiaris>	 FWIW the steps that will be done in read-only and right up to the point of enabling RW again will be pretty sequential without syncups between us
[13:52:32] <volans>	 if there is any blocker or anomaly that deserve a pause shout
[13:54:18] <wikibugs>	 (03PS3) 10Mathew.onipe: wdqs: refactor use_git_deploy to include scap3 and autodeploy options [puppet] - 10https://gerrit.wikimedia.org/r/465566 (https://phabricator.wikimedia.org/T206597)
[13:54:30] <mutante>	 here
[13:55:09] <_joe_>	 time to sound general quarters akosiaris :P
[13:55:57] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: cache::text: switch mediawiki to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/458773 (https://phabricator.wikimedia.org/T203777)
[13:56:44] <akosiaris>	 done
[13:56:47] <akosiaris>	 lol
[13:57:15] <wikibugs>	 (03CR) 10Mathew.onipe: wdqs: refactor use_git_deploy to include scap3 and autodeploy options (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/465566 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe)
[13:57:41] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10Traffic, 10HTTPS: Re-evaluate use of EV certificates for payments.wm.o? - https://phabricator.wikimedia.org/T204931 (10Jgreen) EV certs do seem to have lost almost all their value. That said the cost difference over an OV cert is under $100. Also, I'm not sure whose d...
[13:58:07] <elukey>	 find the 7 differences
[13:58:11] <volans>	 jouncebot: next
[13:58:12] <jouncebot>	 In 0 hour(s) and 1 minute(s): Datacenter switchback - MediaWiki (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181010T1400)
[13:59:10] <volans>	 akosiaris: check the tmux command please
[13:59:52] <akosiaris>	 LGTM
[14:00:04] <jouncebot>	 Deploy window Datacenter switchback - MediaWiki (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181010T1400)
[14:00:05] <akosiaris>	 it's T time
[14:00:09] * marostegui ready
[14:00:10] <wikibugs>	 (03CR) 10Giuseppe Lavagetto: [C: 031] cache::text: switch mediawiki to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/458773 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris)
[14:00:14] <volans>	 ready to liftoff!
[14:00:19] <akosiaris>	 volans: you are good to go
[14:00:27] <volans>	 akosiaris: ack, starting!
[14:00:33] <logmsgbot>	 !log START - Cookbook sre.switchdc.mediawiki.00-disable-puppet (volans@neodymium)
[14:00:34] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:37] <bblack>	 the note about disabling cron on mwmaint?
[14:00:41] <logmsgbot>	 !log END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0) (volans@neodymium)
[14:00:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:00:46] <bblack>	 nevermind, answered elsewhere!
[14:01:04] <logmsgbot>	 !log START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (volans@neodymium)
[14:01:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:11] <logmsgbot>	 !log END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) (volans@neodymium)
[14:01:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:22] <godog>	 "you are go at throttle up"
[14:01:27] <logmsgbot>	 !log START - Cookbook sre.switchdc.mediawiki.00-warmup-caches (volans@neodymium)
[14:01:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:01:35] <akosiaris>	 volans: I got the merge the traffic change
[14:01:40] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10Traffic, 10HTTPS: Re-evaluate use of EV certificates for payments.wm.o? - https://phabricator.wikimedia.org/T204931 (10Krenair) >>! In T204931#4648564, @Liuxinyu970226 wrote: > @krenair please, no more DV certs, that's the reason why jawiki, ugwiki, wuuwiki, zhwiki, z...
[14:01:44] <volans>	 akosiaris: go ahead
[14:01:49] <volans>	 puppet is disabled
[14:01:55] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] cache::text: switch mediawiki to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/458773 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris)
[14:01:59] <volans>	 and we also have to be sure 5 minutes passed since the TTL lowering
[14:02:09] <volans>	 before going RO
[14:02:18] <mark>	 i thought spicerack waited on that?
[14:02:21] <_joe_>	 no, we don't
[14:02:25] <volans>	 was on purpose
[14:02:28] <_joe_>	 mark: only for services
[14:02:30] <mark>	 ok
[14:02:32] <volans>	 to do the maintenance in the middle
[14:02:34] <_joe_>	 where we don't have other operations to do
[14:02:39] <volans>	 s/maintenance/warmup
[14:02:42] <akosiaris>	 ok merged
[14:02:48] <_joe_>	 volans: you can proceed until we get out of the readonly
[14:02:52] <volans>	 akosiaris: how many runs for the warmup?
[14:02:56] <volans>	 2 or 3?
[14:02:58] <akosiaris>	 3 ?
[14:03:04] <volans>	 ack
[14:03:05] <akosiaris>	 3 sounds ok
[14:03:08] <akosiaris>	 cool
[14:03:18] <_joe_>	 3 is the magic number
[14:03:19] <akosiaris>	 note to self to actually document that
[14:04:06] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10Traffic, 10HTTPS: Re-evaluate use of EV certificates for payments.wm.o? - https://phabricator.wikimedia.org/T204931 (10BBlack) >>! In T204931#4654860, @Krenair wrote: >>>! In T204931#4648564, @Liuxinyu970226 wrote: >> @krenair please, no more DV certs, that's the reas...
[14:04:18] <marostegui>	 I can see rows being read on eqiad now, cool
[14:04:30] <marostegui>	 *on eqiad DBs
[14:05:03] <wikibugs>	 (03PS4) 10Alexandros Kosiaris: db: Switch dns master alias to eqiad [dns] - 10https://gerrit.wikimedia.org/r/458790 (https://phabricator.wikimedia.org/T203777)
[14:05:12] <logmsgbot>	 !log END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0) (volans@neodymium)
[14:05:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:20] <volans>	 2nd run
[14:05:24] <logmsgbot>	 !log START - Cookbook sre.switchdc.mediawiki.00-warmup-caches (volans@neodymium)
[14:05:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:05:26] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Mathew.onipe) p:05Triage>03Normal
[14:06:15] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Mathew.onipe) p:05Normal>03High
[14:06:56] <Krinkle>	 o/
[14:07:06] <akosiaris>	 o/
[14:07:41] <logmsgbot>	 !log END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0) (volans@neodymium)
[14:07:42] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:07:53] <volans>	 3rd and last run of warmup
[14:07:57] <logmsgbot>	 !log START - Cookbook sre.switchdc.mediawiki.00-warmup-caches (volans@neodymium)
[14:07:57] <akosiaris>	 ok
[14:07:58] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:08:25] <Krinkle>	 Hmm didn't it used to say at End how long it took?
[14:08:26] <wikibugs>	 10Operations, 10Product-Analytics, 10monitoring, 10Discovery-Analysis (Current work), 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Mathew.onipe)
[14:08:31] <Krinkle>	 Anyway, maybe for later
[14:08:51] <volans>	 Krinkle: the warmup script output?
[14:09:03] <volans>	 it's in the tmux, I can paste something if you need it 
[14:09:33] <Krinkle>	 volans: for each of the steps logged yeah
[14:09:40] <Krinkle>	 No it's okay
[14:10:02] <volans>	 the warmup would be skewed a bit as it ask for manual confirmation
[14:10:03] <logmsgbot>	 !log END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0) (volans@neodymium)
[14:10:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:12] <volans>	 akosiaris: all gerrit patches merged already?
[14:10:17] <akosiaris>	 yup
[14:10:23] <logmsgbot>	 !log START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (volans@neodymium)
[14:10:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:26] <wikibugs>	 10Operations, 10monitoring, 10Discovery-Analysis (Current work), 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Mathew.onipe)
[14:10:37] <logmsgbot>	 !log END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) (volans@neodymium)
[14:10:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:10:46] <mutante>	 confirmed the crons just disappeared on mwmaint2001
[14:10:57] <volans>	 mutante: also any running process?
[14:11:03] <akosiaris>	 MWScript.php still running
[14:11:04] <jynus>	 MostlinkedPage::reallyDoQuery is running
[14:11:14] <jynus>	 on enwiki, meta, zh
[14:11:14] <volans>	 ----- OUTPUT of '! pgrep -c php' -----                                                                                                      │···
[14:11:17] <volans>	 0
[14:11:17] <_joe_>	 volans: there are quite a few scripts still running
[14:11:21] <volans>	 how?
[14:11:25] <akosiaris>	 hhvm
[14:11:30] <_joe_>	 yes
[14:11:31] <akosiaris>	 hhvm -vEval.Jit=1 /srv/mediawiki-staging/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dispatchChanges.php --wiki wikidatawiki --verbose
[14:11:33] <jynus>	 or at least, on db, not sure if on mw
[14:11:34] <mutante>	 volans: ack, ok
[14:11:36] <_joe_>	 they were changed to run hhvm directly
[14:11:36] <akosiaris>	 I 'll kill them
[14:11:38] <wikibugs>	 10Operations, 10monitoring, 10Discovery-Search (Current work), 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Mathew.onipe)
[14:11:43] <_joe_>	 akosiaris: you do it? ok
[14:11:58] <akosiaris>	 ok gone
[14:12:00] <jynus>	 do you want me to kill the slow queries on db2*?
[14:12:06] <_joe_>	 no
[14:12:12] <jynus>	 MostlinkedPage::reallyDoQuery ?
[14:12:14] <volans>	 next step is going RO
[14:12:22] <_joe_>	 jynus: it's up to you
[14:12:25] <wikibugs>	 10Operations, 10monitoring, 10Discovery-Search (Current work), 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Mathew.onipe) a:03Mathew.onipe
[14:12:32] <jynus>	 they will fail when they try to write on codfw
[14:12:37] <akosiaris>	 we are clear of maintenance
[14:12:41] <jynus>	 we can decide later :-)
[14:12:52] <_joe_>	 let's start the RO?
[14:13:00] <akosiaris>	 yes 
[14:13:03] <jynus>	 +1
[14:13:07] <akosiaris>	 volans: let's go in RO
[14:13:12] <akosiaris>	 warp 9 :D
[14:13:18] <volans>	 I'll stop briefly before 06-set-db-readwrite, unless someone shouts before
[14:13:19] <wikibugs>	 (03PS1) 10Alex Monk: Merge branch 'master' into debian [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/465633
[14:13:23] <akosiaris>	 remember to stop right before the RW
[14:13:28] <akosiaris>	 ok
[14:13:32] <akosiaris>	 that, thanks!
[14:13:34] <volans>	 ack starting
[14:13:36] <marostegui>	 good
[14:13:39] <_joe_>	 volans: If I have doubts I'll shout
[14:13:40] <wikibugs>	 (03PS1) 10Banyek: mariadb: depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465634
[14:13:45] <logmsgbot>	 !log START - Cookbook sre.switchdc.mediawiki.02-set-readonly (volans@neodymium)
[14:13:46] <logmsgbot>	 !log MediaWiki read-only period starts at: 2018-10-10 14:13:46.068081 (volans@neodymium)
[14:13:46] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:13:48] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:07] <logmsgbot>	 !log END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0) (volans@neodymium)
[14:14:08] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:12] <logmsgbot>	 !log START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (volans@neodymium)
[14:14:13] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:14] <jynus>	 I cannot edit
[14:14:29] <marostegui>	 codfw masters confirmed read only = ON
[14:14:32] <marostegui>	 on mysql
[14:14:32] <akosiaris>	 cool
[14:14:33] <wikibugs>	 (03PS2) 10Alex Monk: Merge branch 'master' into debian [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/465633
[14:14:39] <logmsgbot>	 !log END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0) (volans@neodymium)
[14:14:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:14:44] <logmsgbot>	 !log START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (volans@neodymium)
[14:14:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:03] <logmsgbot>	 !log END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) (volans@neodymium)
[14:15:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:09] <logmsgbot>	 !log START - Cookbook sre.switchdc.mediawiki.04-switch-traffic (volans@neodymium)
[14:15:10] <_joe_>	 WMFMasterDC changed correctly
[14:15:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:15:18] <volans>	 puppet in flight for traffic
[14:15:30] * volans hopes the diff will still match the regex
[14:15:38] <ShakespeareFan00>	 Hi.
[14:15:39] <volans>	 ema: be ready to check the diff in case it doesn't
[14:15:56] <_joe_>	 ShakespeareFan00: we're in the middle of the datacenter switchover, all sites are in read-only
[14:15:58] <marostegui>	 ShakespeareFan00: We are on maintenance mode due to the DC failover
[14:16:16] <ShakespeareFan00>	 _joe_:  ETA for completion of test?
[14:16:27] <akosiaris>	 no test and it's up to 1h
[14:16:30] <volans>	 so far so good, message matched in one DC
[14:16:34] <volans>	 running puppet on the other DC
[14:16:46] <apergos>	 ShakespeareFan00: questions to #wikimedia-tech please
[14:16:59] <wikibugs>	 (03PS2) 10Banyek: mariadb: depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465634
[14:17:20] <volans>	 akosiaris: we're close to the pause/decision point for RW
[14:17:20] <marostegui>	 I can see reads traffic increasing on MySQLs in eqiad
[14:17:29] <logmsgbot>	 !log END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-traffic (exit_code=0) (volans@neodymium)
[14:17:30] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:35] <logmsgbot>	 !log START - Cookbook sre.switchdc.mediawiki.05-invert-redis-sessions (volans@neodymium)
[14:17:36] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:38] <logmsgbot>	 !log END (PASS) - Cookbook sre.switchdc.mediawiki.05-invert-redis-sessions (exit_code=0) (volans@neodymium)
[14:17:39] <akosiaris>	 volans: yup
[14:17:40] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:17:45] <volans>	 ready for RW
[14:17:45] <_joe_>	 volans: all good here, go on!
[14:17:45] <wikibugs>	 (03PS1) 10Jgreen: Update data.yaml for frack-(administration|bastion)-codfw subnet changes [puppet] - 10https://gerrit.wikimedia.org/r/465635 (https://phabricator.wikimedia.org/T204271)
[14:17:49] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received
[14:17:49] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received
[14:17:53] <akosiaris>	 dammit
[14:17:57] <volans>	 go or wait?
[14:17:59] <icinga-wm>	 RECOVERY - Check health of redis instance on 6381 on rdb1007 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 2865 keys, up 15 days 23 hours
[14:18:00] <icinga-wm>	 RECOVERY - Check health of redis instance on 6381 on rdb1001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 3463 keys, up 19 days 23 hours
[14:18:04] <_joe_>	 go.
[14:18:09] <icinga-wm>	 RECOVERY - Check health of redis instance on 6378 on rdb1001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 15 keys, up 19 days 23 hours
[14:18:11] <logmsgbot>	 !log START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (volans@neodymium)
[14:18:12] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:13] <logmsgbot>	 !log END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) (volans@neodymium)
[14:18:13] <akosiaris>	 yeah go 
[14:18:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:18] <logmsgbot>	 !log START - Cookbook sre.switchdc.mediawiki.07-set-readwrite (volans@neodymium)
[14:18:19] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:19] <icinga-wm>	 RECOVERY - Check health of redis instance on 6378 on rdb1003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 4705606 keys, up 16 days 1 hours
[14:18:20] <akosiaris>	 it was like that last time too
[14:18:24] <volans>	 yep
[14:18:24] <marostegui>	 mysql read only off on eqiad MySQL masters
[14:18:25] <_joe_>	 it's mobileapps everywhere
[14:18:27] <logmsgbot>	 !log MediaWiki read-only period ends at: 2018-10-10 14:18:26.908958 (volans@neodymium)
[14:18:27] <logmsgbot>	 !log END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) (volans@neodymium)
[14:18:28] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:29] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received
[14:18:29] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:18:29] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received
[14:18:29] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received
[14:18:30] <icinga-wm>	 PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received
[14:18:30] <icinga-wm>	 RECOVERY - Check health of redis instance on 6380 on rdb1003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4705813 keys, up 16 days 1 hours
[14:18:31] <volans>	 back in RW
[14:18:31] <icinga-wm>	 RECOVERY - Check health of redis instance on 6380 on rdb1001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 3240 keys, up 19 days 23 hours
[14:18:31] <icinga-wm>	 RECOVERY - Check health of redis instance on 6379 on rdb1001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4705935 keys, up 19 days 23 hours
[14:18:38] <marostegui>	 confirmed on mysql level
[14:18:39] <icinga-wm>	 RECOVERY - Check health of redis instance on 6379 on rdb1007 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 3186 keys, up 15 days 23 hours
[14:18:44] <wikibugs>	 (03CR) 10Jgreen: [C: 032] Update data.yaml for frack-(administration|bastion)-codfw subnet changes [puppet] - 10https://gerrit.wikimedia.org/r/465635 (https://phabricator.wikimedia.org/T204271) (owner: 10Jgreen)
[14:18:45] * akosiaris ignoring icinga-wm for a few
[14:18:47] <jynus>	 edits happening on enwiki
[14:18:49] <jynus>	 checking others
[14:18:50] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[14:18:58] <_joe_>	 yeah mcs is unrelated
[14:18:59] <icinga-wm>	 RECOVERY - Check health of redis instance on 6378 on rdb1007 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 4 keys, up 15 days 23 hours
[14:19:00] <icinga-wm>	 RECOVERY - Check health of redis instance on 6381 on rdb1003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 4604373 keys, up 16 days 1 hours
[14:19:08] <jynus>	 wikivoyage I can edit
[14:19:09] <icinga-wm>	 RECOVERY - Check health of redis instance on 6379 on rdb1003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4705883 keys, up 16 days 1 hours
[14:19:10] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received
[14:19:19] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received
[14:19:20] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on einsteinium is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[14:19:21] <akosiaris>	 great
[14:19:22] <jynus>	 dewiki I can edit
[14:19:29] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received
[14:19:29] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received
[14:19:30] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received
[14:19:30] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[14:19:30] <volans>	 updating tendril, is safe
[14:19:34] <logmsgbot>	 !log START - Cookbook sre.switchdc.mediawiki.08-update-tendril (volans@neodymium)
[14:19:34] <_joe_>	 uh
[14:19:35] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:36] <marostegui>	 volans: go
[14:19:40] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[14:19:45] <logmsgbot>	 !log END (PASS) - Cookbook sre.switchdc.mediawiki.08-update-tendril (exit_code=0) (volans@neodymium)
[14:19:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:19:48] <_joe_>	  PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster=cache_text site=esams  
[14:19:51] <_joe_>	 this is worrying
[14:20:00] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy
[14:20:03] <volans>	 akosiaris, _joe_ I'll wait a GO for the restore-TTL and start-maintenance, just in case
[14:20:19] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on einsteinium is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[14:20:20] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[14:20:20] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[14:20:21] <_joe_>	 we have an alarm on lvs at esams for cache_text
[14:20:26] <marostegui>	 I can edit on eswiki
[14:20:29] <_joe_>	 yes, we're in an outage
[14:20:31] <bblack>	 esams is just more-sensitive
[14:20:36] <bblack>	 it's all of them
[14:20:37] <akosiaris>	 volans: yeah ok good call
[14:20:38] <gehel>	 I confirm search have switched to eqiad (correctly this time), cache hit ratio is down (known issue)
[14:20:39] <icinga-wm>	 RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy
[14:20:41] <bblack>	 and probably mobileapps
[14:20:42] <_joe_>	 can someone check the 5xxs ?
[14:21:00] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[14:21:07] <godog>	 checking
[14:21:20] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[14:21:20] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[14:21:22] <jynus>	 Notice: Undefined index: error in /srv/mediawiki/php-1.32.0-wmf.24/extensions/CirrusSearch/includes/BaseInterwikiResolver.php on line 223
[14:21:27] <gehel>	 yep, looking
[14:21:29] <bblack>	 https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=2&fullscreen&orgId=1&from=now-1h&to=now&var-site=All&var-cache_type=text&var-status_type=5
[14:21:29] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[14:21:30] <icinga-wm>	 PROBLEM - Check health of redis instance on 6381 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1539181286 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 3463 keys, up 6 seconds - replication_delay is 1539181286
[14:21:30] <icinga-wm>	 PROBLEM - Check health of redis instance on 6378 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1539181288 600 - REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 15 keys, up 9 seconds - replication_delay is 1539181288
[14:21:43] <gehel>	 and calling dcausse for help!
[14:21:48] <jynus>	 a skipe but seems going down
[14:21:50] <icinga-wm>	 PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1386539 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 4604373 keys, up 99 days 12 hours - replication_delay is 1386539
[14:21:51] <jynus>	 spike
[14:22:10] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[14:22:10] <icinga-wm>	 PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1386558 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4705813 keys, up 99 days 12 hours - replication_delay is 1386558
[14:22:10] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[14:22:19] <icinga-wm>	 PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1539181333 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4705935 keys, up 53 seconds - replication_delay is 1539181333
[14:22:20] <icinga-wm>	 PROBLEM - Check health of redis instance on 6380 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1539181338 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 3240 keys, up 58 seconds - replication_delay is 1539181338
[14:22:26] <dcausse>	 jynus: it's the error handling code that is broken, it's fixed on master but not yet deployed
[14:22:29] <icinga-wm>	 PROBLEM - Check health of redis instance on 6378 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1386573 600 - REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 4705606 keys, up 99 days 12 hours - replication_delay is 1386573
[14:22:29] <icinga-wm>	 PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1386573 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4705883 keys, up 99 days 12 hours - replication_delay is 1386573
[14:22:35] <marostegui>	 edits count going up
[14:22:39] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[14:22:40] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy
[14:22:41] <volans>	 _joe_: why redis is no noisy? can I ignore it for now?
[14:22:43] <_joe_>	 godog: any news on the 5xxs?
[14:22:46] <jynus>	 I don't see major errors on mediawiki
[14:22:46] <_joe_>	 volans: ignore
[14:22:49] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy
[14:22:49] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy
[14:22:50] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy
[14:22:51] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy
[14:22:57] <jynus>	 at the moment
[14:22:59] <icinga-wm>	 PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[14:22:59] <icinga-wm>	 PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1380624 600 - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 2854 keys, up 99 days 12 hours - replication_delay is 1380624
[14:23:06] <bblack>	 _joe_: the 5xx's already ended, it was a short spike
[14:23:10] <icinga-wm>	 PROBLEM - Check health of redis instance on 6378 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1382550 600 - REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 6 keys, up 99 days 12 hours - replication_delay is 1382550
[14:23:10] <icinga-wm>	 PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1380634 600 - REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 2865 keys, up 99 days 12 hours - replication_delay is 1380634
[14:23:10] <icinga-wm>	 PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1380635 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3186 keys, up 99 days 12 hours - replication_delay is 1380635
[14:23:10] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[14:23:10] <icinga-wm>	 PROBLEM - Check health of redis instance on 6478 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1380637 600 - REDIS 2.8.17 on 127.0.0.1:6478 has 1 databases (db0) with 4 keys, up 99 days 12 hours - replication_delay is 1380637
[14:23:17] <godog>	 _joe_: yeah looks like cache hosts talking to cache hosts in eqiad
[14:23:19] <_joe_>	 the redis alerts, sigh
[14:23:19] <icinga-wm>	 PROBLEM - Check health of redis instance on 6381 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1382560 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 3178 keys, up 99 days 12 hours - replication_delay is 1382560
[14:23:29] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received
[14:23:30] <icinga-wm>	 PROBLEM - Check health of redis instance on 6379 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1382570 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 3088 keys, up 99 days 12 hours - replication_delay is 1382570
[14:23:30] <icinga-wm>	 PROBLEM - Check health of redis instance on 6380 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1382571 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 2888 keys, up 99 days 12 hours - replication_delay is 1382571
[14:23:33] <bblack>	 I don't think the cache layer is at fault here :P
[14:23:38] <jynus>	 regarding 5XX, mostly POST https://en.wikipedia.org/w/api.php
[14:23:39] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[14:23:40] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[14:23:40] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[14:23:49] <akosiaris>	 people are probably trying anyway to save
[14:23:50] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[14:23:52] <akosiaris>	 ?
[14:23:59] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[14:24:00] <akosiaris>	 but why 5xx ?
[14:24:05] <jynus>	 but doesn't see anymore
[14:24:08] <bblack>	 timeouts/drops
[14:24:09] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[14:24:10] <jynus>	 just a spike
[14:24:13] <mutante>	 the 5xx.log on oxygen doesn't seem particularly active
[14:24:20] <jynus>	 I agree
[14:24:30] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received
[14:24:32] <Krinkle>	 Some are php time outs
[14:24:37] <volans>	 are we in good enough shape to restart maintenance?
[14:24:39] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 49.08, 29.21, 13.62
[14:24:49] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[14:24:50] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received
[14:24:53] <jynus>	 in terms of db erors I see fewer than I expected, none ongoing
[14:25:03] <_joe_>	 volans: I still see nginx avail alerts
[14:25:09] <akosiaris>	 volans: give it another 2-3 mins
[14:25:09] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 62.11, 32.35, 14.41
[14:25:11] <Krinkle>	 Some from language converter on page views. So not blocking but something to look at later for better warm up
[14:25:13] <volans>	 ack
[14:25:19] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1228 is CRITICAL: CRITICAL - load average: 60.80, 31.14, 13.38
[14:25:22] <bblack>	 it's a known issue that many of our alerts have bad/conflicting timing
[14:25:33] <_joe_>	 mediawiki fatals are firing an alert right now
[14:25:39] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy
[14:25:41] <bblack>	 for the same underlying problem X, alert1 can come and go, then alert2 fires a minute later and persists a while, etc
[14:25:43] <jynus>	 literally no db errors at the moment
[14:25:50] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 50.85, 32.53, 15.40
[14:25:50] <icinga-wm>	 PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[14:26:04] <akosiaris>	 some high cpu load on various api appservers
[14:26:08] <mark>	 is that our hhvm api high cpu load issue again?
[14:26:19] <mutante>	 grafana says the rate of 5xx is now "green" again https://grafana.wikimedia.org/dashboard/db/varnish-http-errors?refresh=5m&orgId=1
[14:26:20] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 48.39, 34.05, 17.31
[14:26:22] <bblack>	 there's some smaller 5xx spikes now, but they're dwarfed by the first one in graphing
[14:26:33] <_joe_>	 mark: no I think it's some unbalance
[14:26:36] <jynus>	 0 db connection errors
[14:26:48] <marostegui>	 The fatals are no longer I believe
[14:26:57] <bblack>	 very low rate though, hardly a "spike", on the newer ones
[14:26:59] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[14:27:25] <akosiaris>	 I only see the usual rate of 500s
[14:27:32] <_joe_>	 depooling mw1226 to check
[14:27:43] <_joe_>	 can someone else look at general api appserver health?
[14:27:57] <jynus>	 _joe_: I am doing it from a high overview
[14:27:57] <akosiaris>	 yeah I 'll do 
[14:28:05] <jynus>	 logs, etc.
[14:28:09] <_joe_>	 I think it's just load though
[14:28:10] <Krinkle>	 My view is mainly https://logstash.wikimedia.org/goto/ad52476578b0bb4ad9fbcd818c479a1f
[14:28:20] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy
[14:28:20] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received
[14:28:24] <wikibugs>	 (03CR) 10Mathew.onipe: base::monitoring::host: added prometheus check for network receive drops (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) (owner: 10Mathew.onipe)
[14:28:50] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy
[14:29:10] <jynus>	 Krinkle: isn't it the worse thing the know cirrus error?
[14:29:20] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 48.04, 32.43, 17.07
[14:29:20] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy
[14:29:34] <wikibugs>	 (03PS2) 10Mathew.onipe: base::monitoring::host: added prometheus check for network drops [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114)
[14:29:45] <akosiaris>	 overall numbers of alerts in icinga has gone down quite a bit
[14:29:47] <wikibugs>	 (03PS1) 10Vgutierrez: Add discovery alias for certcentral [dns] - 10https://gerrit.wikimedia.org/r/465636 (https://phabricator.wikimedia.org/T199711)
[14:30:18] <Krinkle>	 Yeah, looks like Cirrus code for handling errrors.. has an error
[14:30:23] <mark>	 none of the high load api servers has recovered yet though
[14:30:33] <akosiaris>	 that's the only thing I am worried about
[14:30:38] <mark>	 what is?
[14:30:39] * bearND is unclear about why the /page/random endpoint latency spiked. All we're doing is request 12 random articles from the MW API (using the random generator).
[14:30:40] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received
[14:30:46] <akosiaris>	 the api servers not recovering
[14:30:50] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 39.27, 35.86, 22.12
[14:31:00] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[14:31:01] <akosiaris>	 bearND: the API is under heavy load currently
[14:31:05] <akosiaris>	 probably related
[14:31:06] <Krinkle>	 I'm seeing lots of StashEdit  cache failures
[14:31:10] <Krinkle>	 More than usual
[14:31:21] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[14:31:31] <icinga-wm>	 PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 39.03, 35.27, 20.20
[14:31:37] <volans>	 could be a "replay" issue of things that were hold on during the RO period and were sent to the API hosts all at once once back RW?
[14:31:38] <Krinkle>	 And slots of memc errors as eelll, which is so noisy I have no ability to tell what is and isn't new or random
[14:31:40] <icinga-wm>	 RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[14:31:40] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received
[14:31:40] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received
[14:31:50] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1228 is OK: OK - load average: 11.94, 22.69, 16.38
[14:31:54] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/weight=15; selector: cluster=api_appserver,service=apache2,dc=eqiad,name=mw122.*
[14:31:55] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:31:56] <mark>	 there's the first recovery
[14:32:00] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[14:32:10] <jynus>	 so mobile apps is only complaining because api is high, which means more latency
[14:32:18] <jynus>	 (guessing)?
[14:32:19] <apergos>	 I see load dropping on most of the unhappy api servers
[14:32:25] <akosiaris>	 yeah mobileapps complaining is probably a symptom
[14:32:29] <akosiaris>	 yeah I do too
[14:32:30] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received
[14:32:31] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy
[14:32:35] <_joe_>	 no I think it's not related to that
[14:32:40] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy
[14:32:56] <mutante>	  still going up on mw1233
[14:33:12] <volans>	 REMINDER: we still need to restart maintenance
[14:33:16] <logmsgbot>	 !log oblivian@puppetmaster1001 conftool action : set/weight=15; selector: cluster=api_appserver,service=apache2,dc=eqiad,name=mw123.*
[14:33:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:33:21] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy
[14:33:26] <_joe_>	 volans: go on please
[14:33:30] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received
[14:33:33] <_joe_>	 as far as I'm concerned
[14:33:39] <akosiaris>	 is it going to hit the API ?
[14:33:41] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy
[14:33:43] <mark>	 let alex decide :)
[14:33:46] <akosiaris>	 I am a bit concerned 
[14:33:55] <_joe_>	 akosiaris: what do you mean?
[14:34:00] <akosiaris>	 the maintenance jobs
[14:34:05] <marostegui>	 yeah, I would also wait for that
[14:34:06] <_joe_>	 maintenance? shouldn't
[14:34:07] <akosiaris>	 I 'd like to not add load to the API currently
[14:34:17] <_joe_>	 they should call  the db directly
[14:34:20] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 18.79, 24.34, 19.09
[14:34:21] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[14:34:39] <wikibugs>	 (03Abandoned) 10Vgutierrez: Add discovery alias for certcentral [dns] - 10https://gerrit.wikimedia.org/r/465636 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez)
[14:34:48] <_joe_>	 the apis are ok now
[14:34:49] <jijiki>	 mw1233 says [Wed Oct 10 14:26:33 2018] mce: [Hardware Error]: Machine check events logged
[14:34:51] <akosiaris>	 cpu usage is slowly falling but still is pretty high
[14:34:56] <_joe_>	 I mean the appservers
[14:35:01] <_joe_>	 jijiki: oh nice
[14:35:05] <akosiaris>	 api appservers or just appservers ?
[14:35:21] <_joe_>	 api
[14:35:38] <mutante>	 mw1233 also getting better now
[14:35:41] <akosiaris>	 well the appservers are in a similar state
[14:35:42] <apergos>	 how about mw1231, jijiki?
[14:35:44] <marostegui>	 [1301116.526667] CPU12: Core temperature above threshold, cpu clock throttled (total events = 118014)
[14:35:51] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 15.38, 24.26, 21.01
[14:35:58] <jijiki>	 apergos: let me login there as well
[14:36:15] <_joe_>	 akosiaris: it's normal when you pour all the traffic all of a sudden
[14:36:19] <Krinkle>	 I'm concerned as well, about the memc and stash failures
[14:36:24] <moritzm>	 those are pretty old problems: https://phabricator.wikimedia.org/T149287
[14:36:30] <mark>	 _joe_: didn't happen in codfw though did it
[14:36:32] <_joe_>	 Krinkle: what is failing?
[14:36:33] <moritzm>	 unrelated (or simply exposed by the current load)
[14:36:36] <_joe_>	 mark: it happened 
[14:36:43] <jijiki>	 apergos: same 
[14:36:46] <jynus>	 5XX stopped apparently at :28
[14:36:50] <_joe_>	 mark: I had to rebalance the apis
[14:36:51] <akosiaris>	 codfw is slightly more powerful than eqiad fwiw
[14:36:56] <jynus>	 lots of posts to commons were failing too
[14:36:58] <_joe_>	 Krinkle: what are you referring to?
[14:37:13] <apergos>	 thanks, jijiki, gtk
[14:37:20] <akosiaris>	 but indeed load is going down
[14:37:29] <akosiaris>	 I think we can indeed restart maintenance
[14:37:30] <moritzm>	 all the affected API servers (122/123 are the old batches, less powerful than what we have in codfw
[14:37:32] <jynus>	 akosiaris: sadly, dbs are more powerful on eqiad, so that could cause issues both ways
[14:37:48] <akosiaris>	 volans: please go on with restarting maintenance
[14:37:50] <akosiaris>	 jynus: lol
[14:37:52] <jijiki>	 they are complaining about temp as well ofc 
[14:37:54] <akosiaris>	 oh the irony
[14:37:59] <_joe_>	 jijiki: ofc
[14:38:01] <volans>	 akosiaris: ack
[14:38:03] <akosiaris>	 yeah that's expected with all that load
[14:38:06] <logmsgbot>	 !log START - Cookbook sre.switchdc.mediawiki.08-start-maintenance (volans@neodymium)
[14:38:07] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:38:15] <jynus>	 load == throttle == more load
[14:38:17] <_joe_>	 jynus: less appserver power and more db power, eeek
[14:38:19] <_joe_>	 :)
[14:38:28] <Krinkle>	 _joe_: logstash. I'm looking now to see how it was before
[14:38:48] <_joe_>	 Krinkle: the memcached error rates you mean
[14:39:13] <logmsgbot>	 !log END (FAIL) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=99) (volans@neodymium)
[14:39:14] <jynus>	 Krinkle: isn't edit count worryng?
[14:39:14] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:39:28] <_joe_>	 Krinkle: they were very high but went away afaict
[14:39:29] <akosiaris>	 fail ?
[14:39:34] <mutante>	 confirmed the maint crons appeared on mwmaint1002 (and not 1001) as they should
[14:39:36] <bblack>	 I don't know if those core temp alerts should be "expected", I think our previous take on that issue was that it warranted some hw investigation (e.g. replace thermal paste on CPUs)
[14:39:42] <volans>	 checking puppet failure
[14:39:47] <Krinkle>	 https://grafana.wikimedia.org/dashboard/db/edit-stash?orgId=1&from=now-30m&to=now
[14:39:50] <_joe_>	 akosiaris: I bet it's running on both mwmaint1001 and 1002
[14:39:50] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received
[14:39:51] <akosiaris>	 ah phew
[14:40:05] <volans>	 mwmaint1001.eqiad.wmnet
[14:40:09] <_joe_>	 bblack: yes, they will need termal paste probably
[14:40:17] <Krinkle>	 This one seems to have recovered, which is not easy to see in log stash but clear on Graham's
[14:40:24] <Krinkle>	 Grafana
[14:40:27] <moritzm>	 bblack: yeah, but that/is blocked on procurement of thermal paste
[14:40:33] <Krinkle>	 Stash is Ok
[14:40:35] <moritzm>	 https://phabricator.wikimedia.org/T149287#4319579
[14:40:39] <volans>	 akosiaris: I'll re-run it
[14:40:40] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 14.43, 24.15, 23.36
[14:40:43] <akosiaris>	 ok
[14:40:48] <jijiki>	 load on mw1233 is going down, we could put it back 
[14:40:51] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy
[14:40:54] <jynus>	 Krinkle: absolute count worries me https://grafana.wikimedia.org/dashboard/db/edit-count?refresh=5m&orgId=1&from=now-24h&to=now
[14:40:55] <logmsgbot>	 !log START - Cookbook sre.switchdc.mediawiki.08-start-maintenance (volans@neodymium)
[14:40:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:41:53] <_joe_>	 jynus: the switchover kills some bots/external programs
[14:41:59] <jynus>	 I guess
[14:42:00] <_joe_>	 I think we saw the same last month 
[14:42:01] <logmsgbot>	 !log END (FAIL) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=99) (volans@neodymium)
[14:42:02] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:02] <_joe_>	 let me check
[14:42:21] <Krinkle>	 jynus: https://grafana.wikimedia.org/dashboard/db/backend-save-timing-breakdown?refresh=5m&orgId=1&from=now-3h&to=now
[14:42:26] <akosiaris>	 _joe_: fwiw, mwmaint1001 does not have any maint jobs running
[14:42:28] <_joe_>	 volans: not like I didn't tell you the two mwmaint were gonna be an issue
[14:42:28] <Krinkle>	 Scroll down to see breakdown
[14:42:29] <volans>	 akosiaris: no sorry my bad
[14:42:30] <Krenair>	 yeah I wouldn't be surprised if some of them will just throw an exception and give up in response to read-only mode
[14:42:31] <volans>	 that's expected
[14:42:35] <volans>	 yeah
[14:42:37] <Krinkle>	 Looks like it is mostly bots
[14:42:40] <icinga-wm>	 PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received
[14:42:43] <logmsgbot>	 !log START - Cookbook sre.switchdc.mediawiki.08-restore-ttl (volans@neodymium)
[14:42:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:42:45] <Krinkle>	 That haven't come back yet
[14:42:51] <logmsgbot>	 !log END (PASS) - Cookbook sre.switchdc.mediawiki.08-restore-ttl (exit_code=0) (volans@neodymium)
[14:42:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[14:43:12] <jynus>	 Krinkle: save time is high?
[14:43:18] <wikibugs>	 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10Marostegui) This was done successfully and new wikis are now live on eqiad. What is pending now is: - Run the DNS changes for wikireplicas: T206623: - Re-...
[14:43:40] <icinga-wm>	 RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy
[14:44:04] <volans>	 akosiaris: all done on the cookbooks
[14:44:20] <_joe_>	 the cpu usage for the api servers is still a bit high
[14:44:28] <Krinkle>	 jynus: indeed. But bottom right per counts by group user and entry point etc
[14:44:32] <akosiaris>	 volans: ok good
[14:45:00] <jynus>	 Krinkle: yes, I saw those
[14:45:10] <akosiaris>	 jynus: marostegui last thing is https://gerrit.wikimedia.org/r/#/c/operations/dns/+/458790/ but it's not urgent at all
[14:45:25] <jynus>	 not urgent at all
[14:45:27] <Krinkle>	 The perf issue I suspect is mostly due to memc aches
[14:45:38] <marostegui>	 akosiaris: I can merge and deploy that tomorrow morning even if you want
[14:45:40] <mutante>	 _joe_: what do you mean? mwmaint1001 has no crons and mwmaint1002 has them. i merged a hack for that. we can now (later) revert it to normal where it's only based on active_dc and not hostname
[14:45:45] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Define an SLO for Wikidata Query Service public endpoint and communicate it - https://phabricator.wikimedia.org/T199228 (10Gehel) Coming back to this discussion, I'll try to make my point more clear:  wdqs public...
[14:45:55] <_joe_>	 mutante: the cookbooks assume we have 1 server per dc
[14:46:02] <akosiaris>	 marostegui: yeah, that's fine
[14:46:05] <volans>	 mutante: htat the cookboock checks that mwmaint host in the new active dc has crontabs
[14:46:11] <volans>	 that's not true for 1001, just for 1002
[14:46:14] <JohanJ>	 akosiaris: Do we still want the banner up?
[14:46:16] <marostegui>	 akosiaris: ok, I will take care of that then tomorrow. 
[14:46:18] <mutante>	 _joe_: volans: gotcha! ok
[14:46:29] <akosiaris>	 JohanJ: not any longer. Feel free to remove it
[14:46:43] <akosiaris>	 thanks!
[14:46:44] <jynus>	 so to clarify, mwmaint is ok as in "working"?
[14:46:53] <volans>	 jynus: yes, since the first run
[14:46:57] <jynus>	 thanks
[14:46:57] <akosiaris>	 yup, it's just mwmaint1002, not mwmaint1001
[14:46:57] <_joe_>	 mediawiki fatals are still very high
[14:47:02] <_joe_>	 anyone verifying?
[14:47:11] <volans>	 I misread the output at first I thought was a puppet failure sorry
[14:47:22] <volans>	 akosiaris: we should take down the banners
[14:47:29] <volans>	 who's in charge of those?
[14:47:43] <mark>	 volans: 10 lines up :)
[14:47:47] <akosiaris>	 volans: read backlog :-)
[14:47:51] <akosiaris>	 already being done :-)
[14:47:53] <volans>	 rotfl
[14:48:01] <volans>	 cannot distract one sec here
[14:48:01] <_joe_>	 so I still see 2 alerts that are really worrisome
[14:48:01] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 16.79, 21.24, 23.57
[14:48:01] <volans>	 :D
[14:48:03] <marostegui>	 _joe_: not anymore as per https://logstash.wikimedia.org/goto/4d8333a71a19993fe8cae7923a846dd3
[14:48:15] <jynus>	 there is some long running http queries
[14:48:23] <_joe_>	 marostegui: ok
[14:48:31] <icinga-wm>	 RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 16.82, 18.10, 23.94
[14:48:48] <Krinkle>	 Apache/HHVM: shows elevated levels of req queuing. Going down slowly but still in hundreds whereas normally <5
[14:49:02] <jynus>	 until we have a better working theory, I think the general issues is that dbs are a bottlneeck on codfw on first pool, and mw servers are on eqiad first pool
[14:49:08] <_joe_>	 Krinkle: yes, I need to do a rolling restart of the appservers
[14:49:26] <mark>	 we should probably plan to expand capacity in eqiad early next fiscal
[14:49:29] <_joe_>	 jynus: no, I think this has to do with hhvm running semi-idle for too long
[14:49:43] <jynus>	 _joe_: but that didn't happen for codfw?
[14:49:53] <_joe_>	 jynus: it did, just less sensitive
[14:49:57] <_joe_>	 because more capacity
[14:50:00] <Krinkle>	 _joe_: hmm restart why?
[14:50:10] <jynus>	 ok, so you are agreeing with me, _joe_ in a way
[14:50:10] <_joe_>	 also we didn't restart hhvm, which we did there
[14:50:16] <Krinkle>	 Is it asleep or something and needs to be zapped?
[14:50:32] <jynus>	 moar machines == less problems
[14:50:44] <Krinkle>	 Can I quote you on that?
[14:50:52] <_joe_>	 Krinkle: as soon as I restarted hhvm on one server, it went from using 75% of cpu to 30%
[14:50:59] <jynus>	 you can bash it if you want
[14:51:05] <Krinkle>	 K :)
[14:51:16] <_joe_>	 but let me try to understand the situation better first
[14:51:16] <mark>	 moar machines == more hw problems
[14:51:20] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[14:51:26] <jynus>	 mark: agree
[14:51:27] <volans>	 so we need to restart HHVM in the end :-P
[14:51:30] <jynus>	 I meant capacity
[14:51:41] <akosiaris>	 or just get rid of it
[14:51:45] <mark>	 or just move to php7... :)
[14:51:46] <_joe_>	 so requests queued typically are due to db latencies or high cpu usage on the appservers
[14:51:57] <Krinkle>	 _joe_: hm the restart sounds like it might be killing the problem incl the reqs for users
[14:52:13] <Krinkle>	 _joe_: is there a way we can drain? Or do we do that already around restart?
[14:52:25] <_joe_>	 Krinkle: we do depool, wait a few secs, restart
[14:52:28] <akosiaris>	 yeah the restart script does that
[14:52:32] <Krinkle>	 60s?
[14:52:44] <_joe_>	 typically hhvm won't die until it has finished answering old requests
[14:52:53] <wikibugs>	 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi: Setup rsyslog to be able to produce logs to Kafka - https://phabricator.wikimedia.org/T206633 (10fgiunchedi) p:05Triage>03Normal
[14:53:04] <_joe_>	 anyways, sorry, can we postpone explanations?
[14:53:12] <_joe_>	 I'm trying to assess if I need to do something
[14:53:16] <mark>	 this isn't explanations
[14:53:21] <mark>	 this is reasoning about the best course of action
[14:53:57] <mark>	 so does the restart script wait until min(60s, all requests drained) or not?
[14:54:09] <Krinkle>	 Thx, I'll also turn down the curiosity dial a bit
[14:54:56] <_joe_>	 mark: no, hhvm does not stop until it has finished responding to its requests, or gets killed by systemd at its default timeout
[14:55:00] <Krinkle>	 _joe_: I can check logs for potential fall out. Which one did we restart?
[14:55:06] <_joe_>	 which is - IIRC - 180 seconds
[14:55:09] <mark>	 ok
[14:55:18] <_joe_>	 mark: I'm not sure about the 180 seconds
[14:55:24] <marostegui>	 Edit Counts is almost at the same level as it was before the failover now
[14:55:25] <_joe_>	 but it's in that ballpark
[14:55:35] <_joe_>	 Krinkle: mw1246 for instance
[14:55:45] <_joe_>	 ok one thing is really baffling
[14:55:46] <jynus>	 RefreshLinksJob::run now being the main complainer job
[14:56:09] <_joe_>	 we had a super large spike in network traffic both on api and appservers
[14:56:15] <_joe_>	 which has gone down significantly since
[14:56:23] <_joe_>	 I bet it's populating memcached
[14:56:30] <_joe_>	 that's taking so much toll
[14:56:32] <jynus>	 in or out or both?
[14:56:38] <_joe_>	 jynus: both
[14:56:52] <jynus>	 I see es2/3 the most stressed
[14:56:57] <jynus>	 which would agree with that
[14:56:58] <moritzm>	 90s is the timeout when systemd kicks in
[14:57:00] <_joe_>	 network is going the same way was the CPU load
[14:57:40] <jynus>	 *I saw
[14:57:42] <akosiaris>	 at the peak of the spike the network usage was about twice the regular one for eqiad
[14:57:42] <_joe_>	 Krinkle: on second thoughts, it seems the cpu load is consistent with the network traffic, its' slowly going down from a figure that was ~ 2x what it usually is
[14:57:52] <akosiaris>	 now it's about 20-25% more
[14:57:52] <_joe_>	 akosiaris: similarly the cpu
[14:57:58] <_joe_>	 sane
[14:58:03] <_joe_>	 *same
[14:58:22] <Krinkle>	 mw1246 is showing a lot of errors in the last 2 min rising
[14:58:24] <_joe_>	 ok so, on one side in the last few months we increased memcached traffic a lot
[14:58:33] <_joe_>	 Krinkle: uhm lemme see
[14:58:39] <jynus>	 emm
[14:58:40] <_joe_>	 you mean in logstash?
[14:58:44] <Krinkle>	 Yeah
[14:58:50] <jynus>	 this is vert strange, job queue errors on mw2*
[14:59:01] <Krinkle>	 And gone again
[14:59:03] <Krinkle>	 Was memc
[14:59:08] <_joe_>	 jynus: oh?
[14:59:10] <Krinkle>	 timeout memc
[14:59:21] <_joe_>	 wait, did we switch back jobrunners.discovery.wmnet?
[14:59:36] <Krinkle>	 Might be the usual mcrouter cascading failure that we see sometimes
[14:59:40] <icinga-wm>	 RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen
[14:59:47] <Krinkle>	 I'd say move on for now, will look at later
[14:59:55] <akosiaris>	 _joe_: that would be tomorrow IIRC
[15:00:04] <akosiaris>	 the actual DNS entry that is
[15:00:09] <_joe_>	 akosiaris: nope, luckily it was done now
[15:00:31] <_joe_>	 Krinkle: yes, it's the memcached servers not coping with the traffic, and mcrouter makes you notice that
[15:00:44] <_joe_>	 while nutcracker would silently fail over to another node
[15:00:44] <Krinkle>	 HHVM queuing has also recovered on all eqiad app server
[15:00:55] <_joe_>	 yes, which means we're out of the woods
[15:00:57] <Krinkle>	 Yeah
[15:01:08] <akosiaris>	 ah yes, sorry I got confused with the services discovery records
[15:01:11] * akosiaris sigh
[15:01:19] <Krinkle>	 The mcroiter really needs rethinking on how we configure it and how sensitive
[15:01:36] <moritzm>	 edit rate is also going up, bots seem to reconnect
[15:01:39] <_joe_>	 Krinkle: we configure it in a way that, according to their docs, should work differently
[15:01:45] <_joe_>	 than what we observe
[15:01:52] <Krinkle>	 It's currently optimized for read only, but mw is RW.
[15:02:47] <Krinkle>	 So where were we
[15:03:53] <elukey>	 I noticed that conn_yields in all the shards went up to ~700/900, possibly mcrouter was throttled, but I am not sure if the counter went up during the warm up or later on
[15:04:51] <ebernhardson>	 not sure if it's related to switchover,  but https://noc.wikimedia.org/conf/ 503's from here (and from bast4001 in ulsfo), but seems to be fine when requested from elsewhere
[15:05:23] <_joe_>	 ebernhardson: define "here" :P
[15:05:39] <ebernhardson>	 _joe_: my house. Thats why i added bast4001 for ulsfo, where my requests go ythrough :P
[15:05:44] <_joe_>	 ebernhardson: eheh ok
[15:05:53] <_joe_>	 just noc?
[15:06:00] <_joe_>	 that's pretty strange
[15:06:04] <ebernhardson>	 indeed
[15:06:13] <_joe_>	 ema, bblack any idea?
[15:06:19] <wikibugs>	 (03CR) 10Vgutierrez: [C: 032] Merge branch 'master' into debian [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/465633 (owner: 10Alex Monk)
[15:06:31] <_joe_>	 ebernhardson: can you check your X-cache headers?
[15:07:01] <ebernhardson>	 _joe_: < X-Cache: cp2013 pass, cp4029 miss, cp4031 miss
[15:07:05] <Krinkle>	 Noc is active active right? Maybe one of the two backend isn't good
[15:07:27] <_joe_>	 Krinkle: I don't even know that, I'll have to check the varnish configs
[15:07:49] <wikibugs>	 (03PS2) 10Elukey: role::eventlogging::analytics::files: reduce retention for archive [puppet] - 10https://gerrit.wikimedia.org/r/465569 (https://phabricator.wikimedia.org/T206542)
[15:08:02] <_joe_>	 so it's trying to go via cp2013, which makes me think yes, codfw
[15:08:11] <_joe_>	 ebernhardson: can you try from bast2*
[15:08:12] <_joe_>	 ?
[15:08:18] <ebernhardson>	 sure sec
[15:08:24] <akosiaris>	 yes it is
[15:08:34] <wikibugs>	 (03CR) 10Elukey: [C: 032] role::eventlogging::analytics::files: reduce retention for archive [puppet] - 10https://gerrit.wikimedia.org/r/465569 (https://phabricator.wikimedia.org/T206542) (owner: 10Elukey)
[15:08:37] <bblack>	 bast4002 replaced bast4001
[15:08:41] <_joe_>	 oh right mwmaint2001
[15:08:45] <Krinkle>	 Runs on mwmaint* right
[15:08:55] <bblack>	 ebernhardson: ^
[15:09:08] <wikibugs>	 (03CR) 10jenkins-bot: Merge branch 'master' into debian [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/465633 (owner: 10Alex Monk)
[15:09:46] <bblack>	 I can confirm noc.wikimedia.org failure in codfw too
[15:09:56] <ebernhardson>	 fails from 4002 as well
[15:09:56] <bblack>	 https://noc.wikimedia.org/conf/ -> cp2xxx -> 503
[15:10:14] <_joe_>	 yes
[15:10:46] <wikibugs>	 (03PS2) 10Elukey: statistics::rsync::eventlogging: reduce retention for archive [puppet] - 10https://gerrit.wikimedia.org/r/465573 (https://phabricator.wikimedia.org/T206542)
[15:10:50] <JohanJ>	 The cache hasn't cleared, so I can still see the banner which Seddon removed more than 20 minutes ago. This is longer than normal.
[15:11:23] <akosiaris>	 hmm
[15:11:24] <bblack>	 noc.wm.o's varnish backends are currently A/A, I believe intended
[15:11:25] <bblack>	   noc:
[15:11:25] <bblack>	     backends:
[15:11:25] <bblack>	       eqiad: 'mwmaint1001.eqiad.wmnet'
[15:11:25] <bblack>	       codfw: 'mwmaint2001.codfw.wmnet'
[15:11:26] <_joe_>	 ebernhardson: lol, it shouldd work now
[15:11:28] <ema>	 there's perhaps something wrong with mwmaint2001 when it comes to serving noc.w.o
[15:11:33] <akosiaris>	 that's the wrong maint server
[15:11:37] <_joe_>	 !log started again hhvm on mwmaint2001
[15:11:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:11:41] <wikibugs>	 (03CR) 10Elukey: [C: 032] statistics::rsync::eventlogging: reduce retention for archive [puppet] - 10https://gerrit.wikimedia.org/r/465573 (https://phabricator.wikimedia.org/T206542) (owner: 10Elukey)
[15:11:45] <akosiaris>	 sigh
[15:11:46] <_joe_>	 akosiaris: you killed hhvm there
[15:11:47] <_joe_>	 :P
[15:12:04] <akosiaris>	 _joe_: it was restarted by systemd though
[15:12:12] <akosiaris>	 I did notice it in ps output right after that
[15:12:17] <volans>	 bblack: 1001 not 1002?
[15:12:18] <ema>	 akosiaris: which one is the right maint server?
[15:12:25] <akosiaris>	 mwmaint1002
[15:12:27] <bblack>	 volans: I'm just pasting from the puppet repo!
[15:12:38] <bblack>	 hieradata/role/common/cache/text.yaml
[15:12:43] <_joe_>	 akosiaris: uhm... it seems it doesn't work tbh, I can't connect to hhvm
[15:13:16] <mutante>	 if you want me to i can remove the mw_maintenance role from mwmaint1001 altogether .patch is already waiting
[15:13:48] <_joe_>	 ebernhardson: confirmed it works again from codfw
[15:13:52] <ema>	 noc.w.o seems to be working again now yes
[15:14:00] <akosiaris>	 _joe_: so what ? 
[15:14:08] <akosiaris>	 the hhvm process was foobar or what ?
[15:14:12] <_joe_>	 akosiaris: hhvm was not responding to requests
[15:14:15] <jijiki>	 I stil see the maintenance banner as well 
[15:14:20] <jijiki>	 still*
[15:14:25] <ebernhardson>	 works from 4002 as well now
[15:14:27] <bblack>	 mutante: should noc.wm.o move to s/1001/1002/ as well?
[15:15:01] <mutante>	 bblack: yes, either 1002 ore 1002/2001 
[15:15:10] <bblack>	 ok
[15:15:23] <akosiaris>	 got a patch already for that
[15:15:44] <bblack>	 ok, go for it
[15:16:19] <akosiaris>	 mutante: should we make sure people don't end up using mwmaint1001 ?
[15:16:23] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: noc: Switch varnish backend to mwmaint1002 [puppet] - 10https://gerrit.wikimedia.org/r/465644
[15:16:32] <wikibugs>	 (03PS1) 10Dzahn: Revert "mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/465645
[15:16:48] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] noc: Switch varnish backend to mwmaint1002 [puppet] - 10https://gerrit.wikimedia.org/r/465644 (owner: 10Alexandros Kosiaris)
[15:16:59] <mutante>	 akosiaris: yes, we can either turn it into a role(spare) now or we can add the warning banner
[15:17:12] <akosiaris>	 definitely the warning banner
[15:17:30] <akosiaris>	 at least for a couple of days
[15:18:14] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: Provide a way to have test servers on real hardware, isolated from production - https://phabricator.wikimedia.org/T206636 (10Gehel)
[15:20:18] <wikibugs>	 (03PS1) 10Dzahn: mw_maintenance: ensure motd warning banner on mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/465646
[15:20:28] <mutante>	 ^ this is to get the warning motd
[15:20:35] <mutante>	 $ensure and $motd_ensure were separate
[15:20:38] <mutante>	 that's why that didnt happen
[15:20:43] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[15:21:33] <wikibugs>	 (03PS2) 10Dzahn: mw_maintenance: ensure motd warning banner on mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/465646
[15:22:32] <akosiaris>	 mutante: probably !$ensure ?
[15:22:38] <mutante>	 akosiaris: fixed
[15:23:02] <volans>	 akosiaris, JohanJ I still see the banner in enwiki, itwiki, etc.
[15:23:21] <mutante>	 there was a separate 'primary_dc' lookup for $ensure (crons) and $motd_ensure (banner)
[15:23:28] <volans>	 (thanks jijiki to letting me know ;) )
[15:23:44] <sjoerddebruin>	 Still the banner as well here.
[15:23:46] <mutante>	 we can just use $ensure for all
[15:24:09] <JohanJ>	 Yeah. Me too. According to Seddon, he removed it at 4:47 PM and expected up to ten minutes for the cache to clear.
[15:24:11] <mutante>	 which has the special hack already to differ between 1001 and 1002
[15:24:17] <akosiaris>	 damn that's bad 
[15:24:24] <akosiaris>	 ok so what ? CN is misbehaving ?
[15:24:25] <volans>	 bblack: can we do anything to clear the banner?
[15:24:44] <akosiaris>	 it's not the cache
[15:24:49] <akosiaris>	 the varnish caches I mean
[15:25:00] <akosiaris>	 if you do cache busting the thing is still there
[15:25:02] <bblack>	 I was going to say, I think that's not part of article content caching
[15:25:16] <bblack>	 banners are fetched via js stuff and short and/or nonexistent TTLs
[15:26:41] <akosiaris>	 JohanJ: We don't think it's the caches
[15:26:48] <JohanJ>	 OK.
[15:28:11] <akosiaris>	 I 'll reach out to people to see who can help debug that
[15:29:17] <bblack>	 is there some pointer to where the CN stuff is configured (i.e. where the 4:47pm removal shows up at?)... I know I've seen such a page before
[15:29:17] <wikibugs>	 (03PS3) 10Dzahn: mw_maintenance: use $ensure, not $motd_ensure for warning motd [puppet] - 10https://gerrit.wikimedia.org/r/465646 (https://phabricator.wikimedia.org/T201343)
[15:29:41] <wikibugs>	 10Operations, 10fundraising-tech-ops, 10netops: icinga reports frbast2001.frack.eqiad.wmnet as host down - https://phabricator.wikimedia.org/T206637 (10Jgreen)
[15:29:57] <bblack>	 https://meta.wikimedia.org/wiki/Special:CentralNotice ?
[15:30:36] <akosiaris>	 yeah it's that it
[15:30:43] <wikibugs>	 (03CR) 10Dzahn: "this is a planned revert to go back to normal state.. but only after mwmaint1001 is actually turned into role(spare)!" [puppet] - 10https://gerrit.wikimedia.org/r/465645 (owner: 10Dzahn)
[15:30:44] <bblack>	 https://meta.wikimedia.org/w/index.php?title=Special:CentralNotice&subaction=noticeDetail&notice=Sept2018Maintenance
[15:31:08] <AndyRussG>	 bblack: ?
[15:31:19] <bblack>	 I'm guessing it's disabled in the sense that all projects are de-selected?
[15:31:26] <Krenair>	 the enabled flag is off
[15:31:27] <Krenair>	 https://meta.wikimedia.org/wiki/Special:CentralNoticeLogs
[15:31:29] <bblack>	 but the timer window for it still runs another ~30 mins to 16:00
[15:31:35] <Krenair>	 10 October 2018 15:30	Vogone (talk)	modified	Sept2018Maintenance	15 UTC is in the past, already	 
[15:31:35] <Krenair>	  	Enabled: Changed from on to off
[15:32:06] <wikibugs>	 (03CR) 10Dzahn: [C: 04-1] "hold on.. need to amend" [puppet] - 10https://gerrit.wikimedia.org/r/465646 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn)
[15:32:41] <bblack>	 the on/off times were also edited when it was enabled in the previous change (edited to same values, according to log)
[15:32:46] <bblack>	 maybe they have to be null-edited again?
[15:33:31] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] Remove now obsolete Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/465596 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[15:33:32] <JohanJ>	 I have no idea. I don't have access to the CN banner interface, so I run these things through Seddon, who isn't responding and might have gone off his shift once his part was done.
[15:33:51] <bblack>	 I don't think I have any special access
[15:33:57] <akosiaris>	 neither do I
[15:33:58] <AndyRussG>	 bblack: can I help? note there is some cache delay
[15:33:59] <Krenair>	 Do we need someone with access?
[15:34:07] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] Remove all absented Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/465600 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[15:34:20] <akosiaris>	 AndyRussG: Yes please. The CN maintenance banner is still up and well, it shouldn't
[15:34:28] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] Remove obsolete Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/465601 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[15:34:31] <AndyRussG>	 akosiaris: ok one sec
[15:34:36] <akosiaris>	 https://meta.wikimedia.org/w/index.php?title=Special:CentralNotice&subaction=noticeDetail&notice=Sept2018Maintenance
[15:34:36] <wikibugs>	 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T206500 (10Cmjohnson) Reseated the disk....let's see what happens
[15:34:58] <wikibugs>	 (03CR) 10Filippo Giunchedi: [C: 031] Stop the diamond service when removing Diamond [puppet] - 10https://gerrit.wikimedia.org/r/465440 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[15:35:02] <wikibugs>	 10Operations, 10ops-eqiad, 10Traffic: cp1076 hardware failure - https://phabricator.wikimedia.org/T206394 (10Cmjohnson) @bblack is there any action item for me?
[15:35:25] <Krenair>	 fwiw I'm not seeing the banner
[15:35:37] <JohanJ>	 Me neither.
[15:35:39] <bblack>	 I am
[15:35:42] <Krenair>	 sjoerddebruin, do you see the banner?
[15:35:43] <moritzm>	 !log uploaded jenkins 2.138.2 security release to apt.wikimedia.org (jessie/stretch) (T206234)
[15:35:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:35:46] <akosiaris>	 I no longer am as well
[15:35:49] <moritzm>	 thcipriani: ^
[15:35:50] <bblack>	 oh, it just went away on a fresh reload for me
[15:35:58] <sjoerddebruin>	 It's gone now
[15:36:00] <wikibugs>	 10Operations, 10MediaWiki-General-or-Unknown, 10Wikidata, 10wikidata-tech-focus, and 2 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10Addshore)
[15:36:02] <JohanJ>	 OK, good.
[15:36:05] <thcipriani>	 moritzm: thank you!
[15:36:10] <Krenair>	 okay so
[15:36:11] <akosiaris>	 I have no idea what just happened
[15:36:12] <bblack>	 probably AndyRussG fixed it :)
[15:36:16] <Krenair>	 no
[15:36:20] <wikibugs>	 (03PS4) 10Dzahn: mw_maintenance: let $motd_ensure be based on $ensure for warning motd [puppet] - 10https://gerrit.wikimedia.org/r/465646 (https://phabricator.wikimedia.org/T201343)
[15:36:36] <Krenair>	 When people complained about still seeing the banner
[15:36:41] <Krenair>	 It was still enabled
[15:36:44] <AndyRussG>	 bblack: akosiaris Krenair there's some caching
[15:37:00] <AndyRussG>	 heh no I didn't do anything
[15:37:05] <AndyRussG>	 but I'm not seeing the banner anymore
[15:37:10] <Krenair>	 It got disabled by Vogone around the time that bblack found Special:CentralNotice
[15:37:17] <bblack>	 ah, ok
[15:37:22] <akosiaris>	 ah that explains it
[15:37:22] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] mw_maintenance: let $motd_ensure be based on $ensure for warning motd [puppet] - 10https://gerrit.wikimedia.org/r/465646 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn)
[15:37:34] <JohanJ>	 OK, so might simply have been human error.
[15:37:45] <akosiaris>	 10 October 2018 15:30
[15:38:07] <Krenair>	 It was set to end automatically at 16:00
[15:38:08] <akosiaris>	 ok that's good to know
[15:38:19] <AndyRussG>	 Just fyi btw, CN banner and campaign settings are cached with RL modules
[15:38:24] <Krenair>	 but that's in the future in UTC
[15:38:24] <akosiaris>	 yeah the 2 entries in https://meta.wikimedia.org/wiki/Special:CentralNoticeLogs
[15:38:27] <wikibugs>	 (03PS5) 10Dzahn: mw_maintenance: let $motd_ensure be based on $ensure for warning motd [puppet] - 10https://gerrit.wikimedia.org/r/465646 (https://phabricator.wikimedia.org/T201343)
[15:38:40] <akosiaris>	 Vogone set it to off
[15:38:43] <akosiaris>	 ok
[15:38:44] <AndyRussG>	 changes don't happen instantly
[15:39:07] <akosiaris>	 AndyRussG: yeah it was just reported to us (seems wrongly) it was set to off >50 mins ago
[15:39:11] <akosiaris>	 hence the panic mode
[15:39:20] <AndyRussG>	 ah heh okok
[15:39:36] <akosiaris>	 anyway fixed now. Thanks for looking into it
[15:39:42] <AndyRussG>	 glad there weren't worse things to worry about :)
[15:39:44] <bblack>	 right and RL caches for ~5 mins max, IIRC
[15:39:52] <bblack>	 sometimes less, depending on your timing in the cycle
[15:40:03] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: Provide a way to have test servers on real hardware, isolated from production - https://phabricator.wikimedia.org/T206636 (10Smalyshev) p:05Triage>03High
[15:40:07] <AndyRussG>	 yeah something like that... It was once 10 min, maybe it's 5 min now
[15:40:23] <wikibugs>	 (03CR) 10Dzahn: [C: 031] "now it's ok -> https://puppet-compiler.wmflabs.org/compiler1002/12856/" [puppet] - 10https://gerrit.wikimedia.org/r/465646 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn)
[15:40:30] <bblack>	 maybe it is 10, I haven't looked in a while
[15:40:30] <mutante>	 akosiaris: ^ this should be good now
[15:40:40] <wikibugs>	 10Operations, 10Maps: Switch to unix socket connections for osmupdater / osmimporter for postgresql on maps - https://phabricator.wikimedia.org/T206639 (10Gehel)
[15:41:41] <bblack>	 just looked, 5 minutes
[15:41:53] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 031] mw_maintenance: let $motd_ensure be based on $ensure for warning motd [puppet] - 10https://gerrit.wikimedia.org/r/465646 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn)
[15:41:56] <akosiaris>	 mutante: indeed! Thanks. +1ed
[15:42:07] <wikibugs>	 (03CR) 10Dzahn: [C: 032] mw_maintenance: let $motd_ensure be based on $ensure for warning motd [puppet] - 10https://gerrit.wikimedia.org/r/465646 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn)
[15:42:08] <bblack>	 all the resourceloader outputs I looked at, they emit: "cache-control: public, max-age=300, s-maxage=300"
[15:42:25] <wikibugs>	 (03PS6) 10Dzahn: mw_maintenance: let $motd_ensure be based on $ensure for warning motd [puppet] - 10https://gerrit.wikimedia.org/r/465646 (https://phabricator.wikimedia.org/T201343)
[15:42:43] <wikibugs>	 10Operations, 10MediaWiki-General-or-Unknown, 10Wikidata, 10wikidata-tech-focus, and 2 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10Addshore) >>! In T205865#4639922, @hoo wrote: > It indeed is the lockmanager. I ran t...
[15:43:50] <AndyRussG>	 bblack: ah okok good to know :)
[15:46:11] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10Traffic, 10HTTPS: Re-evaluate use of EV certificates for payments.wm.o? - https://phabricator.wikimedia.org/T204931 (10Jgreen) >>! In T204931#4654860, @Krenair wrote: >  > Presumably whoever would be responsible for purchasing a renewal has to consider this.  It's one...
[15:46:43] <mutante>	 mwmaint1001 has the "this is not the active server" warning now (and 2001 as well)
[15:46:55] <akosiaris>	 cool
[15:46:56] <akosiaris>	 thanks!
[15:48:35] <mutante>	 you're welcome. i still have this planned revert for the special case, but it has to wait a few days until we apply role(spare)  https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/465645/
[15:49:13] <mutante>	 and the change above should not affect it and can stay forever
[15:49:14] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10Traffic, 10HTTPS: Re-evaluate use of EV certificates for payments.wm.o? - https://phabricator.wikimedia.org/T204931 (10BBlack) The kicker probably wouldn't be the monetary cost.  It would be that if you didn't require EV, you could auto-issue certs from LetsEncrypt an...
[15:49:50] <moritzm>	 mutante: now that it's no longer a maint server, can you rename it back to mw1297? then we can re-add it as mw server via https://phabricator.wikimedia.org/T192457
[15:50:12] <icinga-wm>	 RECOVERY - DPKG on contint2001 is OK: All packages OK
[15:50:33] <wikibugs>	 10Operations, 10SRE-Access-Requests: Requesting access to servers for Release Engineering tasks for Lars Wirzenius - https://phabricator.wikimedia.org/T206612 (10greg)
[15:50:39] <mutante>	 moritzm: yes, though for now it still uses the mw_maintenance role
[15:50:52] <mutante>	 after that is gone, i will reinstall it
[15:51:29] <mutante>	 i will take that ticket to remind me
[15:51:43] <wikibugs>	 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) a:03Dzahn
[15:52:52] <moritzm>	 mutante: It's fine to simply rename back to mw1297 and simply use role::spare for now, all the other former image scalers are also spares for now
[15:53:30] <wikibugs>	 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) mwmaint1001 should be reinstalled as mw1297 and go back into the pool.  but this is after https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/461492/   (and https://gerrit.wikimedia.org/r/...
[15:54:21] <mutante>	 moritzm: ok! i was just under the impression akosiaris would like me to wait a few days in case something is wrong with mwmaint1002
[15:54:32] <mutante>	 (before removing the role from 1001 that is)
[15:55:06] <akosiaris>	 yes I would, but not for that reason
[15:55:08] <cmjohnson1>	 !log scheduled downtime for host cloudvirt1019 swap raid card T196507
[15:55:10] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:55:11] <stashbot>	 T196507: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507
[15:55:13] <akosiaris>	 but rather people learning about the switch
[15:55:25] <mutante>	 ah, ok
[15:55:38] <akosiaris>	 I don't feel too strongly about it though
[15:56:26] <mutante>	 maybe tomorrow then
[15:58:18] <moritzm>	 fine with me to wait a few days, but an inaccessible mwmaint1001 (as role::spare only allows SREs to login) will make them notice as well :-)
[15:59:25] <akosiaris>	 true. And they 'll ask for info which they could have gotten anyway. 
[15:59:52] <vgutierrez>	 !log Uploaded certcentral 0.1 to apt.wikimedia.org (stretch) - T199711
[15:59:53] <akosiaris>	 anyway I honestly don't feel too strongly about it. Feel free to kill the server now
[15:59:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[15:59:56] <stashbot>	 T199711: Deploy a scalable service for ACME (LetsEncrypt) certificate management - https://phabricator.wikimedia.org/T199711
[15:59:58] <vgutierrez>	 Krenair: ^^ done
[16:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Morning SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181010T1600).
[16:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[16:00:11] <mutante>	 ok
[16:00:42] <Krenair>	 vgutierrez, cool. puppet tomorrow?
[16:00:47] <icinga-wm>	 ACKNOWLEDGEMENT - Host backup2001 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T196477#4652673
[16:01:12] <icinga-wm>	 RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures
[16:01:18] <wikibugs>	 (03CR) 10Vgutierrez: [C: 031] "Certcentral 0.1 package uploaded to apt.wm.o, this looks good to me, but of course reviews from more seasoned puppet reviewers are very we" [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk)
[16:02:23] <vgutierrez>	 Krenair: yup
[16:04:03] <wikibugs>	 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2018), 10User-Johan: Lessons learned - https://phabricator.wikimedia.org/T206649 (10Johan) p:05Triage>03Normal
[16:04:07] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10Andrew) @Cmjohnson, is there a card for 1024 as well and you're waiting to hear whether 1023 is a success?
[16:04:29] <andrewbogott>	 moritzm: are you planning to make another install attempt on cloudvirt1023 or is that in my court now?
[16:07:01] <wikibugs>	 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2018), 10User-Johan: Lessons learned: Communicating the server switch 2018 - https://phabricator.wikimedia.org/T206649 (10Johan)
[16:07:02] <wikibugs>	 (03PS1) 10Gehel: wdqs: increase throttling limits for internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/465652 (https://phabricator.wikimedia.org/T206648)
[16:07:33] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10RobH) >>! In T199125#4655279, @Andrew wrote: > @Cmjohnson, is there a card for 1024 as well and you're waiting to hear whether 102...
[16:07:55] <icinga-wm>	 ACKNOWLEDGEMENT - HP RAID on cloudvirt1019 is CRITICAL: CRITICAL: Slot 1: Failed: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8 - OK: 2I:1:1, 2I:1:2, 2I:1:3, 2I:1:4, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T206651
[16:07:58] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T206651 (10ops-monitoring-bot)
[16:09:59] <wikibugs>	 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Define an SLO for Wikidata Query Service public endpoint and communicate it - https://phabricator.wikimedia.org/T199228 (10Smalyshev) I think update lag is not the biggest issue. Endpoint availability and respons...
[16:10:11] <wikibugs>	 (03PS2) 10Jcrespo: site.pp: Comment fixes due to dewiki no longer being the only s5 wiki [puppet] - 10https://gerrit.wikimedia.org/r/464797 (https://phabricator.wikimedia.org/T184805)
[16:12:53] <icinga-wm>	 PROBLEM - HHVM jobrunner on mw1306 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time
[16:13:13] <icinga-wm>	 RECOVERY - Check health of redis instance on 6378 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 15 keys, up 1 hours 51 minutes - replication_delay is 5
[16:13:32] <moritzm>	 andrewbogott: this needs additional work to make it work with jessie; an installer image based on 4.9 for jessie.  I talked to Arturo and I'll write up the steps and he'll create a netboot image with that
[16:13:32] <icinga-wm>	 RECOVERY - Check health of redis instance on 6380 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 3240 keys, up 1 hours 52 minutes - replication_delay is 2
[16:14:02] <icinga-wm>	 RECOVERY - HHVM jobrunner on mw1306 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time
[16:14:23] <icinga-wm>	 RECOVERY - Check health of redis instance on 6381 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 3463 keys, up 1 hours 52 minutes - replication_delay is 2
[16:15:43] <icinga-wm>	 RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4705935 keys, up 1 hours 54 minutes - replication_delay is 10
[16:16:02] <akosiaris>	 jessie ?
[16:16:22] <icinga-wm>	 RECOVERY - Check health of redis instance on 6378 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 4705606 keys, up 99 days 14 hours - replication_delay is 3
[16:16:25] <_joe_>	 !log restart of now-unused jobqueue redises for stopping the alerts post-switchover
[16:16:27] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:17:02] <icinga-wm>	 RECOVERY - Check health of redis instance on 6378 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 6 keys, up 99 days 14 hours - replication_delay is 10
[16:17:03] <icinga-wm>	 RECOVERY - Check health of redis instance on 6478 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6478 has 1 databases (db0) with 4 keys, up 99 days 14 hours - replication_delay is 2
[16:17:35] <moritzm>	 akosiaris: openstack stuff is migrated to jessie...
[16:17:40] <akosiaris>	 ok
[16:18:02] <icinga-wm>	 RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 4604373 keys, up 99 days 14 hours - replication_delay is 2
[16:18:02] <icinga-wm>	 RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4705813 keys, up 99 days 14 hours - replication_delay is 5
[16:18:33] <icinga-wm>	 RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4705883 keys, up 99 days 14 hours - replication_delay is 1
[16:19:03] <icinga-wm>	 RECOVERY - Check health of redis instance on 6381 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 3178 keys, up 99 days 14 hours - replication_delay is 10
[16:19:13] <icinga-wm>	 RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 2865 keys, up 99 days 13 hours - replication_delay is 6
[16:19:32] <icinga-wm>	 RECOVERY - Check health of redis instance on 6379 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 3088 keys, up 99 days 14 hours - replication_delay is 7
[16:19:33] <icinga-wm>	 RECOVERY - Check health of redis instance on 6380 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 2888 keys, up 99 days 14 hours - replication_delay is 7
[16:19:40] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10Traffic, 10HTTPS: Re-evaluate use of EV certificates for payments.wm.o? - https://phabricator.wikimedia.org/T204931 (10Jgreen) >>! In T204931#4655176, @BBlack wrote: > The kicker probably wouldn't be the monetary cost.  It would be that if you didn't require EV, you c...
[16:20:12] <icinga-wm>	 RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 2854 keys, up 99 days 14 hours - replication_delay is 2
[16:20:22] <icinga-wm>	 RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3186 keys, up 99 days 14 hours - replication_delay is 3
[16:21:20] <wikibugs>	 (03PS1) 10Ayounsi: Diffscan, don't scan the WMCS public range [puppet] - 10https://gerrit.wikimedia.org/r/465654 (https://phabricator.wikimedia.org/T206653)
[16:22:56] <wikibugs>	 (03PS1) 10Elukey: Revert "statistics::rsync::eventlogging: reduce retention for archive" [puppet] - 10https://gerrit.wikimedia.org/r/465655
[16:23:04] <wikibugs>	 (03PS2) 10Elukey: Revert "statistics::rsync::eventlogging: reduce retention for archive" [puppet] - 10https://gerrit.wikimedia.org/r/465655
[16:23:42] <wikibugs>	 (03CR) 10Elukey: [V: 032 C: 032] Revert "statistics::rsync::eventlogging: reduce retention for archive" [puppet] - 10https://gerrit.wikimedia.org/r/465655 (owner: 10Elukey)
[16:23:49] <andrewbogott>	 moritzm: thanks — can you add that status to T199125 (or refer me to the related task?)
[16:23:50] <stashbot>	 T199125: rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125
[16:25:20] <mutante>	 !log LDAP - added isaacj to wmf group (for SWAP access, existing shell user since recently) (T206631) (T205840)
[16:25:24] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[16:25:25] <stashbot>	 T205840: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840
[16:25:25] <stashbot>	 T206631: LDAP group access request for Isaac Johnson - https://phabricator.wikimedia.org/T206631
[16:27:48] <wikibugs>	 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10Dzahn) ^ We talked on IRC about SWAP access. Membership in the "wmf" LDAP group was missing. (T206631)  So i added that and now all should work.  P.S. The docs...
[16:29:39] <moritzm>	 andrewbogott: I don't have a task yet, but will create one tomorrow and add you and Arturo as subscribers, OKß
[16:30:06] <andrewbogott>	 sounds good — thanks again!  We're at an offsite this week so won't be very responsive.
[16:30:27] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) received the new raid controller and installed, updating the firmware now. Initially it is showing as failed raid
[16:35:54] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10Traffic, 10HTTPS: Re-evaluate use of EV certificates for payments.wm.o? - https://phabricator.wikimedia.org/T204931 (10cwdent) @BBlack I have been exploring options and it sounds like the DNS TXT record challege would allow us to issue certs without disturbing the hos...
[16:37:17] <wikibugs>	 (03CR) 10Ayounsi: [C: 032] Diffscan, don't scan the WMCS public range [puppet] - 10https://gerrit.wikimedia.org/r/465654 (https://phabricator.wikimedia.org/T206653) (owner: 10Ayounsi)
[16:37:33] <wikibugs>	 (03PS2) 10Ayounsi: Diffscan, don't scan the WMCS public range [puppet] - 10https://gerrit.wikimedia.org/r/465654 (https://phabricator.wikimedia.org/T206653)
[16:44:24] <wikibugs>	 (03PS2) 10Dzahn: site: turn mwmaint1001 into a spare::system [puppet] - 10https://gerrit.wikimedia.org/r/461492 (https://phabricator.wikimedia.org/T201343)
[16:48:26] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10Traffic, 10HTTPS: Re-evaluate use of EV certificates for payments.wm.o? - https://phabricator.wikimedia.org/T204931 (10Krenair) That is being set up for prod at the moment actually, but it relies on trusted servers SSHing to prod auth DNS machines. I'm not sure frack...
[16:58:14] <wikibugs>	 (03CR) 10Mathew.onipe: wdqs: increase throttling limits for internal cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465652 (https://phabricator.wikimedia.org/T206648) (owner: 10Gehel)
[16:58:39] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install stat1007.eqiad.wmnet (stat1005 user replacement) - https://phabricator.wikimedia.org/T203852 (10elukey)
[16:58:57] <wikibugs>	 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install stat1007.eqiad.wmnet (stat1005 user replacement) - https://phabricator.wikimedia.org/T203852 (10elukey)
[17:01:22] <wikibugs>	 (03PS1) 10Alexandros Kosiaris: Revert "scap: use mediawiki canaries from codfw" [puppet] - 10https://gerrit.wikimedia.org/r/465661 (https://phabricator.wikimedia.org/T204907)
[17:02:06] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "scap: use mediawiki canaries from codfw" [puppet] - 10https://gerrit.wikimedia.org/r/465661 (https://phabricator.wikimedia.org/T204907) (owner: 10Alexandros Kosiaris)
[17:02:25] <wikibugs>	 (03PS2) 10Alexandros Kosiaris: Revert "scap: use mediawiki canaries from codfw" [puppet] - 10https://gerrit.wikimedia.org/r/465661 (https://phabricator.wikimedia.org/T204907)
[17:02:58] <wikibugs>	 (03CR) 10Gehel: wdqs: increase throttling limits for internal cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465652 (https://phabricator.wikimedia.org/T206648) (owner: 10Gehel)
[17:03:17] <wikibugs>	 (03CR) 10Alexandros Kosiaris: [C: 032] Revert "scap: use mediawiki canaries from codfw" [puppet] - 10https://gerrit.wikimedia.org/r/465661 (https://phabricator.wikimedia.org/T204907) (owner: 10Alexandros Kosiaris)
[17:08:44] <wikibugs>	 (03PS1) 10Cmjohnson: Adding production dns stat1007 [dns] - 10https://gerrit.wikimedia.org/r/465664 (https://phabricator.wikimedia.org/T203852)
[17:10:13] <wikibugs>	 (03PS2) 10Cmjohnson: Adding production dns stat1007 [dns] - 10https://gerrit.wikimedia.org/r/465664 (https://phabricator.wikimedia.org/T203852)
[17:10:19] <wikibugs>	 (03CR) 10Mathew.onipe: wdqs: increase throttling limits for internal cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465652 (https://phabricator.wikimedia.org/T206648) (owner: 10Gehel)
[17:11:17] <wikibugs>	 (03PS3) 10Cmjohnson: Adding production dns stat1007 [dns] - 10https://gerrit.wikimedia.org/r/465664 (https://phabricator.wikimedia.org/T203852)
[17:11:57] <wikibugs>	 (03CR) 10Cmjohnson: [C: 032] Adding production dns stat1007 [dns] - 10https://gerrit.wikimedia.org/r/465664 (https://phabricator.wikimedia.org/T203852) (owner: 10Cmjohnson)
[17:12:07] <wikibugs>	 (03PS1) 10Elukey: Add stat1007 to analytics-1-b [dns] - 10https://gerrit.wikimedia.org/r/465666 (https://phabricator.wikimedia.org/T203852)
[17:12:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add stat1007 to analytics-1-b [dns] - 10https://gerrit.wikimedia.org/r/465666 (https://phabricator.wikimedia.org/T203852) (owner: 10Elukey)
[17:13:02] <wikibugs>	 (03Abandoned) 10Elukey: Add stat1007 to analytics-1-b [dns] - 10https://gerrit.wikimedia.org/r/465666 (https://phabricator.wikimedia.org/T203852) (owner: 10Elukey)
[17:17:46] <wikibugs>	 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10greg) >>! In T191183#4653436, @thcipriani wrote: > This is probably something we should enforce somehow (jenkins? some tool to be created to upload?) before exposing this feature b...
[17:21:32] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10Traffic, 10HTTPS: Re-evaluate use of EV certificates for payments.wm.o? - https://phabricator.wikimedia.org/T204931 (10Jgreen) >>! In T204931#4655430, @Krenair wrote: > How complex is the payments site? Is it possible to do http challenges there?  Off the top of my he...
[17:23:52] <wikibugs>	 10Operations, 10procurement: eqiad: (5) elastic systems - https://phabricator.wikimedia.org/T206681 (10RobH) p:05Triage>03High
[17:23:54] <wikibugs>	 10Operations, 10procurement: eqiad: (5) elastic systems - https://phabricator.wikimedia.org/T206681 (10RobH)
[17:44:07] <wikibugs>	 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) F/W updated and now I am getting new issues...missing several of the disks.  I have to get another AHS report and send to HP....the saga continues
[17:47:23] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10Traffic, 10HTTPS: Re-evaluate use of EV certificates for payments.wm.o? - https://phabricator.wikimedia.org/T204931 (10Krenair) >>! In T204931#4655504, @Jgreen wrote: >>>! In T204931#4655430, @Krenair wrote: >> How complex is the payments site? Is it possible to do ht...
[17:48:49] <wikibugs>	 (03CR) 10Smalyshev: [C: 031] wdqs: increase throttling limits for internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/465652 (https://phabricator.wikimedia.org/T206648) (owner: 10Gehel)
[17:49:34] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10Traffic, 10HTTPS: Re-evaluate use of EV certificates for payments.wm.o? - https://phabricator.wikimedia.org/T204931 (10Jgreen) >>! In T204931#4655857, @Krenair wrote: > > Oh wow, okay - I was expecting you to say it was behind LVS or something but not that.  Ha, well...
[17:49:49] <XioNoX>	 !log replace 10.195.0.0/25 with 10.195.0.0/24 in prefix-list fundraising-codfw4 on cr1/2-codfw - T206637
[17:49:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[17:49:54] <stashbot>	 T206637: icinga reports frbast2001.frack.eqiad.wmnet as host down - https://phabricator.wikimedia.org/T206637
[17:54:56] <wikibugs>	 10Operations, 10fundraising-tech-ops, 10netops: icinga reports frbast2001.frack.eqiad.wmnet as host down - https://phabricator.wikimedia.org/T206637 (10Jgreen) 05Open>03Resolved a:03Jgreen This is fixed.  - fix nagios_nsca.conf in prod puppet for frbast2001's new IP - fix modules/network/data/data.yaml...
[18:01:15] <wikibugs>	 (03PS1) 10Gergő Tisza: Fix Sentry DSN setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465672 (https://phabricator.wikimedia.org/T206589)
[18:05:22] <logmsgbot>	 !log otto@deploy1001 Started deploy [analytics/refinery@4e2d956]: Add accept header to webrequest logs - T170606
[18:05:25] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:05:25] <stashbot>	 T170606: Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606
[18:09:58] <logmsgbot>	 !log otto@deploy1001 Finished deploy [analytics/refinery@4e2d956]: Add accept header to webrequest logs - T170606 (duration: 04m 35s)
[18:10:00] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:10:03] <logmsgbot>	 !log otto@deploy1001 Started deploy [analytics/refinery@28bbee8]: Add accept header to webrequest logs - T170606
[18:10:05] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:19:43] <XioNoX>	 !log delete sessions to AS6805 on cr2-esams (left AMS-IX)
[18:19:45] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:37] <logmsgbot>	 !log otto@deploy1001 Finished deploy [analytics/refinery@28bbee8]: Add accept header to webrequest logs - T170606 (duration: 10m 34s)
[18:20:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:20:40] <stashbot>	 T170606: Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606
[18:21:33] <wikibugs>	 10Operations, 10Traffic: SOA serial numbers returned by authoritative nameservers differ - https://phabricator.wikimedia.org/T206688 (10Dzahn)
[18:22:21] <wikibugs>	 10Operations, 10Traffic: SOA serial numbers returned by authoritative nameservers differ - https://phabricator.wikimedia.org/T206688 (10Dzahn)
[18:23:23] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at codfw on einsteinium is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[18:23:43] <icinga-wm>	 PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues
[18:23:43] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job={varnish-text,varnish-upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[18:23:52] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:24:02] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqiad on einsteinium is CRITICAL: job=varnish-upload site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[18:24:02] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster={cache_text,cache_upload} site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:24:12] <icinga-wm>	 PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[18:24:12] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster={cache_text,cache_upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:24:32] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[18:24:33] <icinga-wm>	 PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[18:24:33] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster={cache_text,cache_upload} site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:24:33] <volans>	 we're aware and looking
[18:24:42] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at esams on einsteinium is CRITICAL: job={varnish-text,varnish-upload} site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[18:24:52] <icinga-wm>	 PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster={cache_text,cache_upload} site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:24:53] <icinga-wm>	 PROBLEM - HTTP availability for Varnish at eqsin on einsteinium is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[18:26:02] <icinga-wm>	 PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[18:26:07] <wikibugs>	 10Operations, 10Domains, 10Traffic: SOA serial numbers returned by authoritative nameservers differ - https://phabricator.wikimedia.org/T206688 (10Dzahn)
[18:26:13] <icinga-wm>	 PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[18:27:03] <icinga-wm>	 PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[18:27:13] <icinga-wm>	 PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[18:27:23] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:27:42] <icinga-wm>	 PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5
[18:27:42] <icinga-wm>	 PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5
[18:27:52] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:28:02] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[18:28:12] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[18:28:12] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:28:12] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:28:12] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[18:28:22] <icinga-wm>	 RECOVERY - HTTP availability for Varnish at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1
[18:29:23] <icinga-wm>	 RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen
[18:29:32] <icinga-wm>	 RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1
[18:33:02] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 45 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[18:33:13] <icinga-wm>	 RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5
[18:33:33] <icinga-wm>	 RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5
[18:34:03] <icinga-wm>	 RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5
[18:34:03] <icinga-wm>	 RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5
[18:34:42] <icinga-wm>	 RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5
[18:34:53] <icinga-wm>	 RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5
[18:35:02] <icinga-wm>	 RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5
[18:35:17] <XioNoX>	 !log disable VC port 1/2 on asw2-c-eqiad:fpc3 (to fpc8)
[18:35:18] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[18:44:27] <wikibugs>	 (03CR) 10Mathew.onipe: [C: 031] wdqs: increase throttling limits for internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/465652 (https://phabricator.wikimedia.org/T206648) (owner: 10Gehel)
[18:54:13] <icinga-wm>	 RECOVERY - puppet last run on ms-be1015 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[18:54:33] <wikibugs>	 (03PS3) 10Dzahn: site: turn mwmaint1001 into a spare::system [puppet] - 10https://gerrit.wikimedia.org/r/461492 (https://phabricator.wikimedia.org/T201343)
[18:55:02] <icinga-wm>	 RECOVERY - MegaRAID on db1067 is OK: OK: optimal, 1 logical, 6 physical, WriteBack policy
[18:55:35] <wikibugs>	 (03CR) 10Dzahn: [C: 032] site: turn mwmaint1001 into a spare::system [puppet] - 10https://gerrit.wikimedia.org/r/461492 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn)
[19:03:46] <wikibugs>	 (03PS1) 10Dzahn: scap/tcpircbot: remove mwmaint1001 from scap and allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/465681 (https://phabricator.wikimedia.org/T201343)
[19:04:19] <wikibugs>	 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn)
[19:04:22] <wikibugs>	 10Operations, 10ops-eqiad, 10Datacenter-Switchover-2018, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Dzahn)
[19:04:52] <wikibugs>	 (03PS2) 10Dzahn: scap/tcpircbot: remove mwmaint1001 from scap and allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/465681 (https://phabricator.wikimedia.org/T201343)
[19:05:22] <wikibugs>	 10Operations, 10netops: Intermittent connectivity issues in eqiad's row C - https://phabricator.wikimedia.org/T201139 (10ayounsi) There are 2 parallel issues here.  1/ IPv6 neighbor discovery randomly broken when igmp-snooping is enabled. This has been worked-around by disabling igmp-snooping yesterday T201039...
[19:05:44] <wikibugs>	 (03CR) 10Dzahn: [C: 032] "it's not a mw maintenance server anymore now" [puppet] - 10https://gerrit.wikimedia.org/r/465681 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn)
[19:09:58] <kaldari>	 If no one else is deploying anything (nothing is on the calendar), I'd like to run a scap sync to rebuild the i18n cache.
[19:10:35] <kaldari>	 as there is an old message stuck in the cache for some reason. (I've verified it's updated on the deployment server.)
[19:11:40] <greg-g>	 kaldari: should be good
[19:11:45] <kaldari>	 thanks
[19:11:53] <kaldari>	 !log scap sync to rebuild i18n cache
[19:11:54] <mutante>	 i just removed mwmaint1001 from scap "dsh"
[19:11:54] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:12:01] <mutante>	 it has been replaced by mwmaint1002
[19:12:20] <mutante>	 hopefully you dont see any warnings about that 
[19:13:36] <logmsgbot>	 !log kaldari@deploy1001 Started scap: (no justification provided)
[19:13:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:14:57] <mutante>	 actually.. now that happened
[19:15:38] <icinga-wm>	 PROBLEM - Filesystem available is greater than filesystem size on ms-be2041 is CRITICAL: cluster=swift device=/dev/sde1 fstype=xfs instance=ms-be2041:9100 job=node mountpoint=/srv/swift-storage/sde1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2041&var-datasource=codfw%2520prometheus%252Fops
[19:15:41] <mutante>	 if you do see something about mwmaint1001 timing out then it's because it doesn't use the mw_maintenance role anymore .. but it should be gone next time for sure
[19:15:58] <greg-g>	 thanks mutante 
[19:17:31] <wikibugs>	 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10CCogdill_WMF) @Jgreen @BBlack thanks for bumping this and continuing to check. It does look like the SSL rating has bumped up to an A: https://www.ss...
[19:22:18] <wikibugs>	 (03PS1) 10Dzahn: mariadb: remove mwmaint1001 from prod-m5 SQL grants [puppet] - 10https://gerrit.wikimedia.org/r/465685 (https://phabricator.wikimedia.org/T201343)
[19:25:58] <wikibugs>	 (03PS1) 10Dzahn: network::constants: remove mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/465686 (https://phabricator.wikimedia.org/T201343)
[19:26:54] <wikibugs>	 (03PS2) 10Dzahn: Revert "mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/465645
[19:28:18] <wikibugs>	 (03PS2) 10Gergő Tisza: Fix Sentry DSN setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465672 (https://phabricator.wikimedia.org/T206589)
[19:30:00] <wikibugs>	 (03PS1) 10Dzahn: Revert "rename wmf6936 from mw1297 to mwmaint1001" [dns] - 10https://gerrit.wikimedia.org/r/465689
[19:30:08] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "rename wmf6936 from mw1297 to mwmaint1001" [dns] - 10https://gerrit.wikimedia.org/r/465689 (owner: 10Dzahn)
[19:30:19] <wikibugs>	 (03CR) 10Dzahn: "planned revert after it has done its job as a temp replacement" [dns] - 10https://gerrit.wikimedia.org/r/465689 (owner: 10Dzahn)
[19:31:48] <icinga-wm>	 PROBLEM - HHVM rendering on mwdebug2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds
[19:32:15] <wikibugs>	 (03PS2) 10Dzahn: Revert "rename wmf6936 from mw1297 to mwmaint1001" [dns] - 10https://gerrit.wikimedia.org/r/465689 (https://phabricator.wikimedia.org/T192457)
[19:32:23] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Revert "rename wmf6936 from mw1297 to mwmaint1001" [dns] - 10https://gerrit.wikimedia.org/r/465689 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn)
[19:33:53] <mutante>	 i see scap activity on deploy2001
[19:33:57] <icinga-wm>	 RECOVERY - HHVM rendering on mwdebug2002 is OK: HTTP OK: HTTP/1.1 200 OK - 75710 bytes in 1.521 second response time
[19:34:02] <mutante>	 ah :)
[19:34:10] <mutante>	 we are back in eqiad though
[19:35:41] <logmsgbot>	 !log kaldari@deploy1001 Finished scap: (no justification provided) (duration: 22m 05s)
[19:35:43] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:39:01] <awight>	 !log restarted ORES celery workers on ores2003 (~17:00), ores200* (17:05)
[19:40:14] <stashbot>	 awight: Failed to log message to wiki. Somebody should check the error logs.
[19:41:43] <awight>	 you don't say
[19:42:21] <halfak>	 haha
[19:42:39] <wikibugs>	 10Operations, 10Domains, 10Traffic: SOA serial numbers returned by authoritative nameservers differ - https://phabricator.wikimedia.org/T206688 (10BBlack) SOA Serial values only have meaning to the administrators of a zone, and to servers with which they authorize legacy zone transfers.  The registrar is nei...
[19:43:08] <mutante>	 awight: i wonder if it is related to ~, ( or * characters in the message
[19:43:55] <mutante>	 !log awight restarted ORES celery workers on ores2003 (~17:00), ores200* (17:05)
[19:43:56] <paladox>	 it could be because of a space between !log
[19:43:56] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[19:44:01] <mutante>	 heh
[19:44:16] <greg-g>	 what did you do?
[19:44:34] <mutante>	 nothing, i first repeated the message to confirm it
[19:44:37] <awight>	  /o\
[19:45:05] <greg-g>	 stashbot: why do you dislike awight ?
[19:45:05] <stashbot>	 See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help.
[19:45:21] <awight>	 I'll try not to take it personally!
[19:46:09] <mutante>	 try it again, just !log test or so
[19:49:15] <wikibugs>	 (03PS3) 10Cwhite: icinga: enable icinga service on icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/464088 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn)
[19:50:26] <wikibugs>	 (03PS1) 10Mforns: Add druid_load jobs to analytics refinery [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342)
[19:51:14] <wikibugs>	 (03CR) 10jerkins-bot: [V: 04-1] Add druid_load jobs to analytics refinery [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns)
[19:51:33] <wikibugs>	 (03Abandoned) 10Mforns: Add druid_load.pp to refinery jobs [puppet] - 10https://gerrit.wikimedia.org/r/464833 (owner: 10Mforns)
[19:54:13] <wikibugs>	 (03CR) 10Dzahn: [C: 032] icinga: enable icinga service on icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/464088 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn)
[19:58:21] <mutante>	 !log icinga - enabled icinga service on icinga1001 (stretch), but all notifications are disabled
[19:58:23] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:00:22] <paladox>	 volans hi, https://github.com/wikimedia/puppet/blob/b347052863d4d2e87b37d6c2d9f44f833cfd9dc2/modules/icinga/files/raid_handler.py#L155 is using deprecated phabricator api. It should be using "search" not "query"
[20:01:42] <wikibugs>	 10Operations, 10Phabricator: raid_handler.py using deprecated conduit api - https://phabricator.wikimedia.org/T206697 (10Paladox)
[20:02:40] <volans>	 paladox: ack, thanks, we already have T159045 btw, I should find the time to update it
[20:02:40] <stashbot>	 T159045: Update Puppet repo code that uses maniphest.update and maniphest.createtask conduit api - https://phabricator.wikimedia.org/T159045
[20:02:57] <paladox>	 oh i forgot about that heh
[20:03:27] <wikibugs>	 10Operations, 10Operations-Software-Development, 10Phabricator, 10Technical-Debt: Update Puppet repo code that uses maniphest.update and maniphest.createtask conduit api - https://phabricator.wikimedia.org/T159045 (10Paladox)
[20:03:31] <wikibugs>	 10Operations, 10Phabricator: raid_handler.py using deprecated conduit api - https://phabricator.wikimedia.org/T206697 (10Paladox)
[20:03:54] <wikibugs>	 10Operations, 10Operations-Software-Development, 10Phabricator, 10Technical-Debt: Update Puppet repo code that uses maniphest.update and maniphest.createtask conduit api - https://phabricator.wikimedia.org/T159045 (10Paladox) Need to also migrate from "project.query" to "project.search"
[20:08:13] <wikibugs>	 (03PS1) 10Dzahn: base/nrpe: add icinga1001 to allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/465697 (https://phabricator.wikimedia.org/T202782)
[20:09:17] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 28 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[20:11:30] <wikibugs>	 (03PS2) 10Dzahn: base/nrpe: add icinga1001 to allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/465697 (https://phabricator.wikimedia.org/T202782)
[20:13:14] <wikibugs>	 10Operations, 10hardware-requests, 10monitoring: hardware request - replacement for tegmen (icinga2001) - https://phabricator.wikimedia.org/T206563 (10faidon) 05Open>03declined We generally keep our servers for 1-2 more years past their warranty expiration (this puts their lifetime at ~5 years, rather th...
[20:13:17] <wikibugs>	 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10faidon)
[20:15:11] <wikibugs>	 (03CR) 10Dzahn: [C: 032] base/nrpe: add icinga1001 to allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/465697 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn)
[20:17:16] <thcipriani>	 !log upgrading releases-jenkins jenkins install on releases2001
[20:17:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:19:17] <wikibugs>	 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudstore1008 & cloudstore1009 - https://phabricator.wikimedia.org/T193655 (10Bstorm) {F26497155}  cloudstore1008 appears to be stuck here.  That's quite interesting, since it seems to have the...
[20:19:35] <wikibugs>	 (03CR) 10Cwhite: [C: 031] debian: ship systemd service [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465350 (owner: 10Filippo Giunchedi)
[20:19:43] <thcipriani>	 !log upgrading releases-jenkins jenkins install on releases1001
[20:19:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:22:38] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 59 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[20:32:20] <wikibugs>	 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Paladox) We won't be able to use git lfs on cobalt as it is using jessie whereas the git-lfs package is in stretch-backports+.  We could enforce it so users can only upload there i...
[20:32:29] <wikibugs>	 10Operations, 10MediaWiki-extensions-CodeReview, 10Wikimedia-production-error: Exec error "Possibly missing executable file: svn diff" from Special:Code - https://phabricator.wikimedia.org/T204801 (10Krinkle) p:05Normal>03Low
[20:34:13] <thcipriani>	 !log upgrading ci jenkins install on contint1001
[20:34:15] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[20:43:40] <wikibugs>	 (03CR) 10Cwhite: [C: 031] Stop the diamond service when removing Diamond [puppet] - 10https://gerrit.wikimedia.org/r/465440 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[20:44:13] <wikibugs>	 (03CR) 10Cwhite: [C: 031] Remove obsolete Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/465601 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[20:45:28] <wikibugs>	 (03CR) 10Cwhite: [C: 031] Remove all absented Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/465600 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[20:45:57] <wikibugs>	 (03CR) 10Cwhite: [C: 031] Remove now obsolete Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/465596 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff)
[20:47:37] <icinga-wm>	 PROBLEM - puppet last run on ores1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[set debconf flag seen for wireshark-common/install-setuid]
[20:51:40] <Krinkle>	 RoanKattouw: Is https://phabricator.wikimedia.org/T204291 patch okay to backport?
[20:52:38] <icinga-wm>	 RECOVERY - puppet last run on ores1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures
[20:53:02] <RoanKattouw>	 Krinkle: Yes, feel free
[20:56:57] <Krinkle>	 RoanKattouw: k, will do :)
[20:58:43] <wikibugs>	 (03CR) 10Ottomata: [C: 031] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns)
[21:00:59] <wikibugs>	 10Operations, 10MediaWiki-File-management, 10MediaWiki-Uploading, 10Multimedia, and 4 others: PHP Warning "Unable to delete stat cache" from file uploads - https://phabricator.wikimedia.org/T205567 (10Krinkle)
[21:13:18] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[21:20:37] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 74 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[21:25:47] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[21:32:51] <wikibugs>	 10Operations, 10netops: Enable access from icinga1001 to mgmt interfaces - https://phabricator.wikimedia.org/T206704 (10colewhite)
[21:32:57] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 42 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[21:33:29] <wikibugs>	 10Operations, 10Operations-Software-Development, 10Phabricator, 10Technical-Debt: Update Puppet repo code that uses deprecated maniphest.update/.createtask/.query Conduit API - https://phabricator.wikimedia.org/T159045 (10Aklapper)
[21:42:09] * Krinkle staging on mwdebug2001
[21:42:36] <volans>	 Krinkle: we're on eqiad now ;)
[21:42:41] <volans>	 codfw db are RO
[21:42:42] <Krinkle>	 right
[21:42:44] * Krinkle staging on mwdebug1001
[21:42:57] <volans>	 :)
[21:43:00] <Krinkle>	 :)
[21:45:41] <XioNoX>	 !log Add icinga1001 to mr* security policies - T206704
[21:45:44] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:45:45] <stashbot>	 T206704: Enable access from icinga1001 to mgmt interfaces - https://phabricator.wikimedia.org/T206704
[21:48:50] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/extensions/Echo/includes/DiscussionParser.php: T204291 - Ia5323b401b94 (duration: 00m 51s)
[21:48:52] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:48:53] <stashbot>	 T204291: Fatal error "request has exceeded memory limit" from Echo DiscussionParser - https://phabricator.wikimedia.org/T204291
[21:49:54] <mutante>	 XioNoX: wow, thank you !
[21:50:06] <mutante>	 i was about to make a ticket for that
[21:50:55] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/extensions/ContentTranslation/specials/SpecialContentTranslation.php: T205433 - Ib34b28c5bb114c (duration: 00m 49s)
[21:51:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[21:51:05] <stashbot>	 T205433: Exception "User account is not global" on Special:ContentTranslation with lang/target params - https://phabricator.wikimedia.org/T205433
[21:51:55] <XioNoX>	 mutante: https://phabricator.wikimedia.org/T206704 :)
[21:52:07] <mutante>	 i just saw:) yay
[21:54:27] <mutante>	 it's working. our new icinga can now talk to mgmt interfaces
[21:54:43] <mutante>	 at least the counter is going down
[21:55:03] <XioNoX>	 mutante: should be pushed everywhere now
[21:55:49] <mutante>	 ok, thanks, it will take a little while until it catches up but it's happening :)
[21:58:29] <wikibugs>	 10Operations, 10netops: Enable access from icinga1001 to mgmt interfaces - https://phabricator.wikimedia.org/T206704 (10ayounsi) 05Open>03Resolved a:03ayounsi Management firewall policies updates.
[21:59:49] <wikibugs>	 (03CR) 10Gehel: "This looks reasonable to me. But let's see get more feedback before moving forward with a fleet wide change" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) (owner: 10Mathew.onipe)
[22:05:56] <wikibugs>	 (03CR) 10Gehel: [C: 031] "LGTM, puppet compiler agrees this is a noop: https://puppet-compiler.wmflabs.org/compiler1002/12857/" [puppet] - 10https://gerrit.wikimedia.org/r/465566 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe)
[22:10:01] <wikibugs>	 (03CR) 10Gehel: "Definitely more compact, not sure it is more readable (at least not to me)." [software/cumin] - 10https://gerrit.wikimedia.org/r/465611 (owner: 10Volans)
[22:10:03] <wikibugs>	 (03CR) 10Volans: "my 2 cents inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) (owner: 10Mathew.onipe)
[22:10:06] <wikibugs>	 (03CR) 10Krinkle: [C: 032] profiler: Prevent flush from fataling a request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464178 (https://phabricator.wikimedia.org/T206092) (owner: 10Krinkle)
[22:11:16] <wikibugs>	 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10Wikidata, and 3 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10hoo) >>! In T205865#4655143, @Addshore wrote: > Reversing this experiment now that we have switched b...
[22:13:08] <wikibugs>	 (03Merged) 10jenkins-bot: profiler: Prevent flush from fataling a request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464178 (https://phabricator.wikimedia.org/T206092) (owner: 10Krinkle)
[22:14:46] * Krinkle staging on mwdebug1001
[22:16:14] <logmsgbot>	 !log krinkle@deploy1001 Synchronized wmf-config/arclamp.php: T206092 - If607ad111a (duration: 00m 48s)
[22:16:17] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:16:17] <stashbot>	 T206092: profiler.php sometimes emits RedisException "read error on connection" during request shutdown - https://phabricator.wikimedia.org/T206092
[22:16:55] <wikibugs>	 (03CR) 10jenkins-bot: profiler: Prevent flush from fataling a request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464178 (https://phabricator.wikimedia.org/T206092) (owner: 10Krinkle)
[22:20:37] <wikibugs>	 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10greg) >>! In T191183#4656413, @Paladox wrote: > We won't be able to use git lfs on cobalt as it is using jessie whereas the git-lfs package is in stretch-backports+.  That's simply...
[22:25:37] <mutante>	 !log icinga1001 - chmod 2710 /var/lib/icinga/rw
[22:25:38] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:28:38] <icinga-wm>	 RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 28 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[22:35:57] <icinga-wm>	 PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 39 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts
[22:38:54] <Krinkle>	 Amir1: rolling out now
[22:39:04] <Amir1>	 on it
[22:39:37] <Krinkle>	 staged on mwdebug1001 for sanity check but presumably nothing to verify
[22:39:57] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1
[22:41:09] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/extensions/ORES/includes/FetchScoreJob.php: T204753 - Icc28230585bc (duration: 00m 49s)
[22:41:11] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:41:12] <stashbot>	 T204753: Failed executing job: ORESFetchScoreJob - https://phabricator.wikimedia.org/T204753
[22:43:36] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/includes/specials/SpecialDeletedContributions.php: T187619 - Ic6b0d8020553 (duration: 00m 48s)
[22:43:39] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:43:39] <stashbot>	 T187619: Extra trailing spaces after IP in Special:DeletedContributions trigger MediaWiki internal error - https://phabricator.wikimedia.org/T187619
[22:48:37] <icinga-wm>	 PROBLEM - DPKG on ores1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages
[22:51:32] <Amir1>	 halfak: ^
[22:51:48] <icinga-wm>	 RECOVERY - DPKG on ores1001 is OK: All packages OK
[22:52:22] <Amir1>	 Krinkle: it's still happening 
[22:56:38] <Krinkle>	 Amir1: Yeah, JobExecutor still shows "Failed ORESFetchScoreJob"
[22:56:45] <Krinkle>	 but I assume that's not surprising because the patch returned false.
[22:56:55] <Krinkle>	 Which means mark as failed, and (if allowed) let it retry later.
[22:57:05] <Krinkle>	 But it no longer has an error attached, and no exception channel messsage
[22:57:12] <Krinkle>	 right?
[22:59:01] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/extensions/MultimediaViewer/: T206099 - I53dbce0a (duration: 00m 49s)
[22:59:04] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[22:59:05] <stashbot>	 T206099: MultimediaViewer should not use deprecated jquery.hidpi module - https://phabricator.wikimedia.org/T206099
[23:00:04] <jouncebot>	 addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181010T2300).
[23:00:04] <jouncebot>	 No GERRIT patches in the queue for this window AFAICS.
[23:08:05] <logmsgbot>	 !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/maintenance/resources/foreign-resources.yaml: Ic865e7077d (duration: 00m 49s)
[23:08:06] <stashbot>	 Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log
[23:15:44] <wikibugs>	 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Tgr) UX-wise a single central place for profile images is obviously preferable, so using Phabricator makes a lot of sense. (Having some way to store a profile image in your Wikimed...
[23:25:11] <wikibugs>	 10Operations, 10ORES, 10Scap, 10Epic, and 2 others: [Epic] ORES should use a git large file plugin for storing serialized binaries - https://phabricator.wikimedia.org/T171619 (10awight)
[23:25:19] <wikibugs>	 10Operations, 10ORES, 10Scap, 10Epic, and 2 others: [Epic] ORES should use a git large file plugin for storing serialized binaries - https://phabricator.wikimedia.org/T171619 (10awight)
[23:30:59] <Amir1>	 Krinkle: oh good
[23:44:08] <icinga-wm>	 PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 35.71% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1