[00:09:58] * Krinkle staging on mwdebug2001 [00:14:24] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/tests/phpunit/includes/utils/: T94522 - I2a0c51bea58 (duration: 01m 02s) [00:14:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:14:28] T94522: Some requests fail with UIDGenerator error "Process clock is outdated or drifted" - https://phabricator.wikimedia.org/T94522 [00:15:48] !log krinkle@deploy1001 sync-file aborted: T205567 - I75f1eb6dc2cb (duration: 00m 01s) [00:15:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:15:51] T205567: PHP Warning "Unable to delete stat cache" from file uploads - https://phabricator.wikimedia.org/T205567 [00:15:57] PROBLEM - HHVM rendering on mwdebug2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:16:17] PROBLEM - Apache HTTP on mwdebug2001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [00:17:06] RECOVERY - HHVM rendering on mwdebug2001 is OK: HTTP OK: HTTP/1.1 200 OK - 75559 bytes in 8.703 second response time [00:17:17] RECOVERY - Apache HTTP on mwdebug2001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 621 bytes in 0.129 second response time [00:19:43] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/includes/utils/UIDGenerator.php: T94522 - I2a0c51bea58 (duration: 00m 56s) [00:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:46] T94522: Some requests fail with UIDGenerator error "Process clock is outdated or drifted" - https://phabricator.wikimedia.org/T94522 [00:24:37] PROBLEM - Router interfaces on cr2-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 57, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [00:25:16] PROBLEM - swift-object-auditor on ms-be2040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [00:25:57] PROBLEM - swift-object-server on ms-be2040 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [00:43:52] (03PS4) 10Herron: smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) [00:44:31] (03CR) 10jerkins-bot: [V: 04-1] smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [00:48:33] (03PS5) 10Herron: smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) [00:49:30] (03CR) 10jerkins-bot: [V: 04-1] smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [00:52:47] PROBLEM - puppet last run on mw1244 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:54:34] (03PS6) 10Herron: smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) [00:55:12] (03Abandoned) 10Imarlier: sitemaps: Generalize varnish rule for sitemaps, to apply to all domains [puppet] - 10https://gerrit.wikimedia.org/r/456169 (https://phabricator.wikimedia.org/T198965) (owner: 10Imarlier) [00:55:25] (03CR) 10jerkins-bot: [V: 04-1] smarthost: create mail smarthost role/profile [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [01:00:21] (03PS1) 10Imarlier: Add sitemaps rewrite for additional domains [puppet] - 10https://gerrit.wikimedia.org/r/465538 (https://phabricator.wikimedia.org/T206496) [01:01:26] RECOVERY - swift-object-server on ms-be2040 is OK: PROCS OK: 101 processes with regex args ^/usr/bin/python /usr/bin/swift-object-server [01:01:37] RECOVERY - swift-object-auditor on ms-be2040 is OK: PROCS OK: 3 processes with regex args ^/usr/bin/python /usr/bin/swift-object-auditor [01:02:07] RECOVERY - swift-object-replicator on ms-be2040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-replicator [01:02:17] RECOVERY - swift-object-updater on ms-be2040 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-object-updater [01:03:36] RECOVERY - Router interfaces on cr2-knams is OK: OK: host 91.198.174.246, interfaces up: 59, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [01:05:20] (03CR) 10Herron: smarthost: create mail smarthost role/profile (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/463143 (https://phabricator.wikimedia.org/T41785) (owner: 10Herron) [01:07:28] So some people at enwiki have been complaining special page updates have been behind schedule last couple weeks [01:07:53] given timing im wondering if its dc switchover related [01:18:07] RECOVERY - puppet last run on mw1244 is OK: OK: Puppet is currently enabled, last run 40 seconds ago with 0 failures [01:18:27] PROBLEM - Disk space on eventlog1002 is CRITICAL: DISK CRITICAL - free space: /srv 33796 MB (3% inode=99%) [02:10:01] (03PS4) 10Krinkle: profiler: Prevent flush from fataling a request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464178 (https://phabricator.wikimedia.org/T206092) [02:33:17] RECOVERY - Memory correctable errors -EDAC- on wtp2011 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2011&var-datasource=codfw%2520prometheus%252Fops [03:14:21] Fun fact, running updateSpecialPages.php spends roughly 52 hours just on commons [03:26:46] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 787.54 seconds [03:57:46] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 268.13 seconds [05:05:12] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1072 - https://phabricator.wikimedia.org/T206313 (10Marostegui) 05Open>03Resolved All good - thank you ``` Number of Virtual Disks: 1 Virtual Drive: 0 (Target Id: 0) Name :Virtual Disk 0 RAID Level : Primary-1, Secondary-0, RAID... [05:10:07] ACKNOWLEDGEMENT - Device not healthy -SMART- on db1073 is CRITICAL: cluster=mysql device=megaraid,3 instance=db1073:9100 job=node site=eqiad Marostegui T206254 - The acknowledgement expires at: 2018-11-07 05:09:43. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db1073&var-datasource=eqiad%2520prometheus%252Fops [05:10:54] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1073 - https://phabricator.wikimedia.org/T206254 (10Marostegui) 05Open>03Resolved The RAID got rebuilt fine. The disk came with some errors, but let's ignore that and stop wasting some disks, let's wait till it fails for real to replace it. ``` Number... [05:12:20] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review, 10User-Banyek: db1092 crashed - BBU broken - https://phabricator.wikimedia.org/T205514 (10Marostegui) Thanks for the update Chris - unbelievable! [05:25:07] PROBLEM - Router interfaces on cr1-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 37, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:25:47] PROBLEM - Router interfaces on cr2-codfw is CRITICAL: CRITICAL: host 208.80.153.193, interfaces up: 122, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [05:28:55] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Marostegui) 05stalled>03Resolved Thanks Papaul! This looks goods - we will take it from here! ``` Number of Virtual Disks: 1 Virtual Drive: 0 (Target... [05:29:09] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Marostegui) [05:31:00] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2096 (x1 codfw expansion host) - https://phabricator.wikimedia.org/T206191 (10Marostegui) [06:12:26] RECOVERY - Router interfaces on cr2-codfw is OK: OK: host 208.80.153.193, interfaces up: 124, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:12:46] RECOVERY - Router interfaces on cr1-eqord is OK: OK: host 208.80.154.198, interfaces up: 39, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:39:37] PROBLEM - Disk space on eventlog1002 is CRITICAL: DISK CRITICAL - free space: /srv 33646 MB (3% inode=99%) [06:55:55] (03PS3) 10Giuseppe Lavagetto: parsoid: remove unused template [puppet] - 10https://gerrit.wikimedia.org/r/463490 [06:58:38] (03CR) 10Ema: [C: 031] parsoid: remove unused template [puppet] - 10https://gerrit.wikimedia.org/r/463490 (owner: 10Giuseppe Lavagetto) [06:59:22] (03CR) 10Giuseppe Lavagetto: [C: 032] parsoid: remove unused template [puppet] - 10https://gerrit.wikimedia.org/r/463490 (owner: 10Giuseppe Lavagetto) [07:01:14] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for Druid Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/465144 (https://phabricator.wikimedia.org/T135991) [07:04:21] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for Druid Prometheus exporter [puppet] - 10https://gerrit.wikimedia.org/r/465144 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [07:05:10] _joe_: ok to merge your patch along? [07:14:01] (03CR) 10Mathew.onipe: "Jenkins dry run result seems good:" [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) (owner: 10Mathew.onipe) [07:15:16] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [07:18:03] <_joe_> moritzm: sorry, yes [07:18:26] <_joe_> I auto-nerd-sniped myself in searching for unreferenced files in our puppet tree [07:18:59] ok, now merged :-) [07:19:19] (03PS1) 10Elukey: Add interface::add_ip6_mapped to analytics-tool* hosts [puppet] - 10https://gerrit.wikimedia.org/r/465560 [07:19:37] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [07:20:17] (03CR) 10Elukey: [C: 032] Add interface::add_ip6_mapped to analytics-tool* hosts [puppet] - 10https://gerrit.wikimedia.org/r/465560 (owner: 10Elukey) [07:27:21] (03PS1) 10Elukey: profile::prometheus::alerts: raise warning for EL throughput alarm [puppet] - 10https://gerrit.wikimedia.org/r/465563 [07:28:18] (03CR) 10Elukey: [C: 032] profile::prometheus::alerts: raise warning for EL throughput alarm [puppet] - 10https://gerrit.wikimedia.org/r/465563 (owner: 10Elukey) [07:28:45] (03CR) 10Muehlenhoff: [C: 031] "Looks good to me!" [puppet] - 10https://gerrit.wikimedia.org/r/462480 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [07:35:30] (03PS2) 10Filippo Giunchedi: debian: ship systemd service [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465350 [07:35:30] (03PS2) 10Filippo Giunchedi: debian: use standard rules for Prometheus packages [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465351 [07:35:32] (03PS2) 10Filippo Giunchedi: debian: update changelog [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465352 [07:35:34] (03PS2) 10Filippo Giunchedi: debian: add patch for inline udp usage [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465414 (https://phabricator.wikimedia.org/T205870) [07:38:04] (03CR) 10Filippo Giunchedi: "Running under ded" (031 comment) [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465350 (owner: 10Filippo Giunchedi) [07:38:42] (03PS1) 10Elukey: Add IPv6 PTR records for analytics-tool* hosts [dns] - 10https://gerrit.wikimedia.org/r/465565 [07:40:00] (03CR) 10Elukey: [C: 032] Add IPv6 PTR records for analytics-tool* hosts [dns] - 10https://gerrit.wikimedia.org/r/465565 (owner: 10Elukey) [07:41:56] 10Operations, 10Discovery, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Search (Current work): Refactor 'use_git_deploy' in wdqs puppet module to cater for scap3 and autodeployment modes - https://phabricator.wikimedia.org/T206597 (10Mathew.onipe) [07:46:54] (03PS1) 10Mathew.onipe: wdqs: refactor use_git_deploy to include scap3 and autodeploy options [puppet] - 10https://gerrit.wikimedia.org/r/465566 (https://phabricator.wikimedia.org/T206597) [07:50:57] RECOVERY - Disk space on eventlog1002 is OK: DISK OK [07:51:33] !log cleaned up some log files from eventlog1002 [07:51:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:51:46] we have a task about --^, should be fixed today [07:52:02] nice [07:53:38] (03CR) 10Mathew.onipe: wdqs: refactor use_git_deploy to include scap3 and autodeploy options (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465566 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe) [07:54:46] (03CR) 10Gehel: [C: 04-1] "A few issues, see comments inline." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) (owner: 10Mathew.onipe) [07:55:00] (03CR) 10Gilles: [C: 031] WIP: define haproxy service for thumbor [puppet] - 10https://gerrit.wikimedia.org/r/465185 (https://phabricator.wikimedia.org/T187765) (owner: 10Filippo Giunchedi) [07:55:05] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install stat1007.eqiad.wmnet (stat1005 user replacement) - https://phabricator.wikimedia.org/T203852 (10elukey) [08:00:44] (03PS1) 10Elukey: role::eventlogging::analytics::files: reduce retention for archive [puppet] - 10https://gerrit.wikimedia.org/r/465569 (https://phabricator.wikimedia.org/T206542) [08:01:14] (03CR) 10Filippo Giunchedi: "> Patch Set 5:" [puppet] - 10https://gerrit.wikimedia.org/r/464366 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [08:01:59] !log rolling out debdeploy 0.0.99.6 [08:02:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:03:53] (03PS3) 10ArielGlenn: fix misc dumps generation when some previous runs are missing [dumps] - 10https://gerrit.wikimedia.org/r/465415 (https://phabricator.wikimedia.org/T206306) [08:04:54] (03CR) 10Muehlenhoff: [C: 031] "Looks good" [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465350 (owner: 10Filippo Giunchedi) [08:05:26] (03CR) 10Filippo Giunchedi: [C: 031] "I've looped in WMCS folks too" [puppet] - 10https://gerrit.wikimedia.org/r/465456 (https://phabricator.wikimedia.org/T183454) (owner: 10Cwhite) [08:06:09] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet error on deployment-mwmaint01 - https://phabricator.wikimedia.org/T206598 (10Krenair) [08:06:18] (03CR) 10ArielGlenn: [C: 032] fix misc dumps generation when some previous runs are missing [dumps] - 10https://gerrit.wikimedia.org/r/465415 (https://phabricator.wikimedia.org/T206306) (owner: 10ArielGlenn) [08:07:40] !log ariel@deploy1001 Started deploy [dumps/dumps@0714a93]: fix adds/changes dumps generation when prev run is missing [08:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:46] !log ariel@deploy1001 Finished deploy [dumps/dumps@0714a93]: fix adds/changes dumps generation when prev run is missing (duration: 00m 06s) [08:07:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:14:31] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, and 2 others: Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 (10JAllemandou) @Ottomata - After standup please, as today is kids-day for me :) [08:15:17] 10Puppet, 10Beta-Cluster-Infrastructure: Puppet error on deployment-mwmaint01 - https://phabricator.wikimedia.org/T206598 (10Krenair) [08:27:49] !log installing fuse security updates [08:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:04] (03PS1) 10Elukey: statistics::rsync::eventlogging: reduce retention for archive [puppet] - 10https://gerrit.wikimedia.org/r/465573 (https://phabricator.wikimedia.org/T206542) [08:33:20] (03CR) 10Giuseppe Lavagetto: [C: 031] Revert "traffic: Depool eqiad from user traffic for switchover" [dns] - 10https://gerrit.wikimedia.org/r/458807 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris) [08:40:06] (03Abandoned) 10Ema: Revert "traffic: Depool eqiad from user traffic for switchover" [dns] - 10https://gerrit.wikimedia.org/r/458807 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris) [08:48:11] (03CR) 10Filippo Giunchedi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/465164 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [08:48:16] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install stat1007.eqiad.wmnet (stat1005 user replacement) - https://phabricator.wikimedia.org/T203852 (10elukey) @Cmjohnson quick question (might be wrong since I am a n00b with Juniper): is stat1007 in the analytics VLAN? ``` elukey@asw2-b-... [08:50:25] (03Abandoned) 10Filippo Giunchedi: logstash: add ipv6 to elasticsearch [puppet] - 10https://gerrit.wikimedia.org/r/465164 (https://phabricator.wikimedia.org/T206454) (owner: 10Filippo Giunchedi) [08:56:47] (03CR) 10Giuseppe Lavagetto: [C: 031] Revert "traffic: route esams via codfw" [puppet] - 10https://gerrit.wikimedia.org/r/458809 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris) [08:57:27] <_joe_> akosiaris: should we talk here, or is there a dedicated channel? [08:57:29] (03PS3) 10Ema: Revert "traffic: route esams via codfw" [puppet] - 10https://gerrit.wikimedia.org/r/458809 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris) [08:57:51] _joe_: I 've avoided the dedicated channel just for this, let's do it here [08:58:04] <_joe_> ack [08:58:15] we can always ignore the bots if we end up having some serious talk [08:58:21] T-1m [08:59:05] (03CR) 10Giuseppe Lavagetto: [C: 031] cache::text, cache::upload: Switch services to a/a [puppet] - 10https://gerrit.wikimedia.org/r/458804 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris) [08:59:26] starting with the inter-cache routing patch [08:59:32] ok [08:59:39] (03CR) 10Ema: [C: 032] Revert "traffic: route esams via codfw" [puppet] - 10https://gerrit.wikimedia.org/r/458809 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris) [09:00:04] Deploy window Datacenter switchback - Traffic (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181010T0900) [09:00:04] !log Traffic: route esams caches back to eqiad T203777 [09:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:08] T203777: Successfully switch backend traffic (MediaWiki, Swift, RESTBase, Parsoid and services) to be served from eqiad - https://phabricator.wikimedia.org/T203777 [09:00:24] running puppet [09:00:39] <_joe_> on all caches in esams, correct? [09:00:48] correct [09:01:16] (03CR) 10Giuseppe Lavagetto: [C: 031] "we might decide to leave restbase a/a in the future, but for now let's just revert to the status quo" [puppet] - 10https://gerrit.wikimedia.org/r/458805 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris) [09:02:17] puppet run on esams caches finished [09:03:25] <_joe_> network traffic is going up on cache_text and cache_upload in eqiad, which is an expected observable [09:04:25] (03CR) 10Marostegui: [C: 031] site.pp: Comment fixes due to dewiki no longer being the only s5 wiki [puppet] - 10https://gerrit.wikimedia.org/r/464797 (https://phabricator.wikimedia.org/T184805) (owner: 10Jcrespo) [09:04:47] (03PS2) 10Ema: cache::text, cache::upload: Switch services to a/a [puppet] - 10https://gerrit.wikimedia.org/r/458804 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris) [09:04:57] manually rebased the patch above, there was a conflict ^ [09:05:25] <_joe_> ema: lemme look just to be sure [09:05:30] _joe_: yes please [09:06:16] (03CR) 10Alexandros Kosiaris: [C: 031] cache::text, cache::upload: Switch services to a/a [puppet] - 10https://gerrit.wikimedia.org/r/458804 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris) [09:06:19] LGTM [09:06:42] (03CR) 10Giuseppe Lavagetto: [C: 031] cache::text, cache::upload: Switch services to a/a [puppet] - 10https://gerrit.wikimedia.org/r/458804 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris) [09:07:07] ok to proceed with setting services active/active? [09:07:11] yes [09:07:24] (03CR) 10Ema: [C: 032] cache::text, cache::upload: Switch services to a/a [puppet] - 10https://gerrit.wikimedia.org/r/458804 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris) [09:07:24] <_joe_> +1 [09:07:47] !log Traffic: set services active/active T203777 [09:07:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:50] T203777: Successfully switch backend traffic (MediaWiki, Swift, RESTBase, Parsoid and services) to be served from eqiad - https://phabricator.wikimedia.org/T203777 [09:08:06] running puppet on text and upload caches in eqiad and codfw [09:08:58] (03PS2) 10Ema: cache::text: Switch restbase to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/458805 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris) [09:09:32] <_joe_> ema: this is simple enough you don't need a check, right? [09:10:25] puppet run on all eqiad/codfw caches finished [09:10:34] (03CR) 10Alexandros Kosiaris: [C: 031] cache::text: Switch restbase to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/458805 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris) [09:11:25] switching restbase to eqiad only now [09:11:59] (03CR) 10Ema: [C: 032] cache::text: Switch restbase to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/458805 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris) [09:12:43] !log Traffic: move restbase back to eqiad T203777 [09:12:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:13:14] running puppet on text caches in eqiad/codfw [09:14:26] traffic switchback done [09:14:42] <_joe_> ema: cool! [09:14:55] <3 [09:15:48] nice! [09:16:05] congratulations! [09:16:14] (03CR) 10Gehel: "hieradata/role/common/wdqs.yaml should also be changed (and hieradata/role/common/wdqs/autodeploy.yaml once this patch is rebased)." (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/465566 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe) [09:16:16] \o/ \o/ [09:16:24] (03CR) 10Gehel: [C: 04-1] wdqs: refactor use_git_deploy to include scap3 and autodeploy options [puppet] - 10https://gerrit.wikimedia.org/r/465566 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe) [09:16:50] I have a few puppet patches to merge, probably best to wait a bit until everything is settled ? [09:18:38] gehel: yes please. give it 10mins or os [09:18:40] so* [09:18:55] kool! no emergency! [09:19:07] I might take a lunch break by that time :) [09:19:39] <_joe_> gehel: remember we're going to switch back mediawiki at 4 pm your time [09:19:49] yep [09:20:55] (03Abandoned) 10Giuseppe Lavagetto: role::mediawiki::videoscaler: deduce parameters from number of cpus [puppet] - 10https://gerrit.wikimedia.org/r/345817 (owner: 10Giuseppe Lavagetto) [09:36:49] (03PS11) 10Mathew.onipe: wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) [09:47:32] (03PS1) 10Muehlenhoff: Clean up removed rsyncd configs [puppet] - 10https://gerrit.wikimedia.org/r/465583 (https://phabricator.wikimedia.org/T205618) [09:47:34] (03PS1) 10Muehlenhoff: rsyncd: Remove xinetd remnants [puppet] - 10https://gerrit.wikimedia.org/r/465584 [09:48:37] (03CR) 10jerkins-bot: [V: 04-1] rsyncd: Remove xinetd remnants [puppet] - 10https://gerrit.wikimedia.org/r/465584 (owner: 10Muehlenhoff) [09:56:22] (03PS2) 10Muehlenhoff: rsyncd: Remove xinetd remnants [puppet] - 10https://gerrit.wikimedia.org/r/465584 [09:58:08] (03CR) 10jerkins-bot: [V: 04-1] rsyncd: Remove xinetd remnants [puppet] - 10https://gerrit.wikimedia.org/r/465584 (owner: 10Muehlenhoff) [10:00:08] (03PS3) 10Muehlenhoff: rsyncd: Remove xinetd remnants [puppet] - 10https://gerrit.wikimedia.org/r/465584 [10:10:15] (03PS3) 10Muehlenhoff: Enable base::service_auto_restart for coal [puppet] - 10https://gerrit.wikimedia.org/r/465180 (https://phabricator.wikimedia.org/T135991) [10:19:40] (03CR) 10Muehlenhoff: [C: 032] Enable base::service_auto_restart for coal [puppet] - 10https://gerrit.wikimedia.org/r/465180 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [10:22:17] (03PS12) 10Gehel: wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [10:23:40] (03PS1) 10Muehlenhoff: Remove obsolete mediawiki-firejail-rsvg-convert [puppet] - 10https://gerrit.wikimedia.org/r/465590 [10:24:08] (03CR) 10Gehel: [C: 032] wdqs: auto deployment of wdqs on wdqs1009 [puppet] - 10https://gerrit.wikimedia.org/r/464659 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [10:24:55] (03CR) 10Marostegui: [C: 032] mariadb: Update dblists to move some wikis from s3 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464164 (https://phabricator.wikimedia.org/T184805) (owner: 10Jcrespo) [10:27:24] (03Merged) 10jenkins-bot: mariadb: Update dblists to move some wikis from s3 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464164 (https://phabricator.wikimedia.org/T184805) (owner: 10Jcrespo) [10:28:07] (03PS1) 10Gehel: wdqs: fix path in autodeploy exec resources [puppet] - 10https://gerrit.wikimedia.org/r/465591 (https://phabricator.wikimedia.org/T197187) [10:29:08] (03CR) 10jenkins-bot: mariadb: Update dblists to move some wikis from s3 to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464164 (https://phabricator.wikimedia.org/T184805) (owner: 10Jcrespo) [10:30:07] 10Operations, 10SRE-Access-Requests: Requesting access to servers for Release Engineering tasks for Lars Wirzenius - https://phabricator.wikimedia.org/T206612 (10LarsWirzenius) [10:30:38] PROBLEM - puppet last run on wdqs1009 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:31:13] ^ wdqs1009 is me, patch coming up [10:31:51] 10Operations, 10SRE-Access-Requests: Requesting access to servers for Release Engineering tasks for Lars Wirzenius - https://phabricator.wikimedia.org/T206612 (10zeljkofilipin) [10:32:20] (03CR) 10Gehel: [C: 032] wdqs: fix path in autodeploy exec resources [puppet] - 10https://gerrit.wikimedia.org/r/465591 (https://phabricator.wikimedia.org/T197187) (owner: 10Gehel) [10:36:18] 10Operations, 10SRE-Access-Requests: Requesting access to servers for Release Engineering tasks for Lars Wirzenius - https://phabricator.wikimedia.org/T206612 (10MoritzMuehlenhoff) Looking at existing group ownerships this means being added to the deployment, contint-admins, labnet-users and contint-docker gro... [10:36:48] 10Operations, 10SRE-Access-Requests: Requesting access to servers for Release Engineering tasks for Lars Wirzenius - https://phabricator.wikimedia.org/T206612 (10zeljkofilipin) As far as I can see, that's `deployment` group in [[ https://phabricator.wikimedia.org/source/operations-puppet/browse/production/modu... [10:44:03] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for uwsgi-coal [puppet] - 10https://gerrit.wikimedia.org/r/465593 (https://phabricator.wikimedia.org/T135991) [10:44:28] 10Operations, 10SRE-Access-Requests: Requesting access to servers for Release Engineering tasks for Lars Wirzenius - https://phabricator.wikimedia.org/T206612 (10zeljkofilipin) @MoritzMuehlenhoff correct, after further inspection of the file, looking for my groups, looks like that are the correct groups. [10:45:56] (03PS1) 10Gehel: wdqs: missing dependency for ordering of git::clone [puppet] - 10https://gerrit.wikimedia.org/r/465594 (https://phabricator.wikimedia.org/T197187) [10:48:23] !log marostegui@deploy1001 Synchronized dblists/s3.dblist: Update s3.dblist to reflect the wikis moved to s5 - T184805 (duration: 00m 58s) [10:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:48:26] T184805: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 [10:49:34] !log marostegui@deploy1001 Synchronized dblists/s5.dblist: Update s5.dblist to reflect the wikis moved from s3 - T184805 (duration: 00m 56s) [10:49:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:54] (03CR) 10Mathew.onipe: [C: 031] wdqs: missing dependency for ordering of git::clone [puppet] - 10https://gerrit.wikimedia.org/r/465594 (https://phabricator.wikimedia.org/T197187) (owner: 10Gehel) [10:51:17] (03CR) 10Gehel: [C: 032] wdqs: missing dependency for ordering of git::clone [puppet] - 10https://gerrit.wikimedia.org/r/465594 (https://phabricator.wikimedia.org/T197187) (owner: 10Gehel) [10:54:29] !log Set a replication filter on db1075 (s3 eqiad) to ignore enwikivoyage, cebwiki, shwiki, srwiki & mgwiktionary - T184805 [10:54:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:31] T184805: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 [10:55:19] (03PS1) 10Muehlenhoff: Remove now obsolete Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/465596 (https://phabricator.wikimedia.org/T183454) [10:55:58] RECOVERY - puppet last run on wdqs1009 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [11:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor My software never has bugs. It just develops random features. Rise for European Mid-day SWAT(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181010T1100). [11:00:05] No GERRIT patches in the queue for this window AFAICS. [11:07:49] o/ [11:07:55] I'm around but patches are not :D [11:08:19] (03PS1) 10Gehel: wdqs: pull files from git-fat after initial checkout [puppet] - 10https://gerrit.wikimedia.org/r/465599 (https://phabricator.wikimedia.org/T197187) [11:09:04] (03CR) 10jerkins-bot: [V: 04-1] wdqs: pull files from git-fat after initial checkout [puppet] - 10https://gerrit.wikimedia.org/r/465599 (https://phabricator.wikimedia.org/T197187) (owner: 10Gehel) [11:09:46] (03PS2) 10Gehel: wdqs: pull files from git-fat after initial checkout [puppet] - 10https://gerrit.wikimedia.org/r/465599 (https://phabricator.wikimedia.org/T197187) [11:10:34] (03PS1) 10Muehlenhoff: Remove all absented Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/465600 (https://phabricator.wikimedia.org/T183454) [11:10:43] (03PS1) 10Muehlenhoff: Remove obsolete Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/465601 (https://phabricator.wikimedia.org/T183454) [11:12:29] (03CR) 10Mathew.onipe: [C: 031] wdqs: pull files from git-fat after initial checkout [puppet] - 10https://gerrit.wikimedia.org/r/465599 (https://phabricator.wikimedia.org/T197187) (owner: 10Gehel) [11:13:10] (03CR) 10Gehel: [C: 032] wdqs: pull files from git-fat after initial checkout [puppet] - 10https://gerrit.wikimedia.org/r/465599 (https://phabricator.wikimedia.org/T197187) (owner: 10Gehel) [11:16:28] (03PS1) 10Urbanecm: Permissions changes on itwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465602 (https://phabricator.wikimedia.org/T206447) [11:16:51] zeljkof, now one patch is around as well :D [11:16:57] (https://gerrit.wikimedia.org/r/465602) [11:17:11] Urbanecm: let me see... :) [11:17:26] (03PS2) 10Zfilipin: Permissions changes on itwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465602 (https://phabricator.wikimedia.org/T206447) (owner: 10Urbanecm) [11:17:56] thank you :) [11:18:08] (03PS1) 10Gehel: wdqs: run git-fat commands from the package directory [puppet] - 10https://gerrit.wikimedia.org/r/465603 (https://phabricator.wikimedia.org/T197187) [11:18:19] PROBLEM - puppet last run on wdqs1009 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[wdqs_git_fat_pull] [11:18:36] Urbanecm: ok, on it [11:19:40] (03CR) 10Mathew.onipe: [C: 031] wdqs: run git-fat commands from the package directory [puppet] - 10https://gerrit.wikimedia.org/r/465603 (https://phabricator.wikimedia.org/T197187) (owner: 10Gehel) [11:19:50] (03CR) 10Gehel: [C: 032] wdqs: run git-fat commands from the package directory [puppet] - 10https://gerrit.wikimedia.org/r/465603 (https://phabricator.wikimedia.org/T197187) (owner: 10Gehel) [11:21:34] (03CR) 10Zfilipin: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465602 (https://phabricator.wikimedia.org/T206447) (owner: 10Urbanecm) [11:23:54] (03Merged) 10jenkins-bot: Permissions changes on itwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465602 (https://phabricator.wikimedia.org/T206447) (owner: 10Urbanecm) [11:24:28] Urbanecm: it's at mwdebug2001 [11:24:34] ack [11:25:29] zeljkof, its working, please deploy it to whole universe [11:25:41] Urbanecm: ok [11:26:42] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:465602|Permissions changes on itwikibooks (T206447)]] (duration: 00m 57s) [11:26:43] (03PS4) 10Urbanecm: Initial configuration for liwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463479 (https://phabricator.wikimedia.org/T205710) [11:26:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:26:45] T206447: changes to manage user group "confirmed" and "accountcreator" on it.wikibooks - https://phabricator.wikimedia.org/T206447 [11:26:52] (03PS5) 10Urbanecm: Initial configuration for liwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/463479 (https://phabricator.wikimedia.org/T205710) [11:27:01] Urbanecm: deployed! anything else? [11:27:04] all done? [11:28:04] herron: scap said this during deployment [11:28:22] `11:26:15 Check 'Check endpoints for mwdebug2002.codfw.wmnet' failed: /wiki/{title} (Main Page) is WARNING: Test Main Page responds with unexpected body: array(59) {` [11:28:25] ... [11:28:41] (it's a long warning, I can paste it if needed) [11:29:24] well, as of now, I have no open patches. Thank you!7 [11:29:38] !log EU SWAT finished [11:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:44] Urbanecm: see you next time! :D [11:29:51] :) [11:30:09] zeljkof: probably create a ticket (I got the same thing and I was planning to create a ticket, but I got busy with something else) [11:30:17] marostegui: will do! [11:30:21] cheers [11:31:59] (03CR) 10jenkins-bot: Permissions changes on itwikibooks [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465602 (https://phabricator.wikimedia.org/T206447) (owner: 10Urbanecm) [11:33:15] (03PS4) 10Muehlenhoff: mediawiki::web::prod_sites: convert wikisource.org [puppet] - 10https://gerrit.wikimedia.org/r/462486 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [11:34:08] (03PS1) 10Mathew.onipe: wdqs: onlyif condition should exit with 1 for git fat to run [puppet] - 10https://gerrit.wikimedia.org/r/465605 (https://phabricator.wikimedia.org/T197187) [11:34:19] marostegui: T206620 [11:34:20] T206620: Check 'Check endpoints for mwdebug2002.codfw.wmnet' failed: /wiki/{title} (Main Page) is WARNING: Test Main Page responds with unexpected body - https://phabricator.wikimedia.org/T206620 [11:34:34] zeljkof: thank you! [11:35:49] (03CR) 10Gehel: [C: 032] wdqs: onlyif condition should exit with 1 for git fat to run [puppet] - 10https://gerrit.wikimedia.org/r/465605 (https://phabricator.wikimedia.org/T197187) (owner: 10Mathew.onipe) [11:37:10] (03CR) 10Giuseppe Lavagetto: [C: 031] "LGTM in terms of decommissioning, but please see the comment" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465389 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey) [11:41:10] (03PS2) 10Filippo Giunchedi: WIP: statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/465428 (https://phabricator.wikimedia.org/T205870) [11:41:12] (03PS1) 10Filippo Giunchedi: WIP: add statsd-exporter to thumbor [puppet] - 10https://gerrit.wikimedia.org/r/465608 (https://phabricator.wikimedia.org/T205870) [11:41:26] !log renaming some s3 wiki tables on eqiad master to prevent split brain T184805 [11:41:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:30] T184805: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 [11:43:58] (03PS3) 10Filippo Giunchedi: WIP: statsd_exporter [puppet] - 10https://gerrit.wikimedia.org/r/465428 (https://phabricator.wikimedia.org/T205870) [11:43:58] (03PS2) 10Filippo Giunchedi: WIP: add statsd-exporter to thumbor [puppet] - 10https://gerrit.wikimedia.org/r/465608 (https://phabricator.wikimedia.org/T205870) [11:47:03] (03PS1) 10Gehel: wdqs: correct condition for initializing git fat [puppet] - 10https://gerrit.wikimedia.org/r/465609 (https://phabricator.wikimedia.org/T197187) [11:47:46] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1002/12850/thumbor1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/465608 (https://phabricator.wikimedia.org/T205870) (owner: 10Filippo Giunchedi) [11:48:18] (03CR) 10Mathew.onipe: [C: 031] wdqs: correct condition for initializing git fat [puppet] - 10https://gerrit.wikimedia.org/r/465609 (https://phabricator.wikimedia.org/T197187) (owner: 10Gehel) [11:49:46] (03PS2) 10Gehel: wdqs: correct condition for initializing git fat [puppet] - 10https://gerrit.wikimedia.org/r/465609 (https://phabricator.wikimedia.org/T197187) [11:52:12] (03PS2) 10Elukey: Decommission conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/465389 (https://phabricator.wikimedia.org/T205814) [11:55:22] (03CR) 10Mathew.onipe: [C: 031] wdqs: correct condition for initializing git fat [puppet] - 10https://gerrit.wikimedia.org/r/465609 (https://phabricator.wikimedia.org/T197187) (owner: 10Gehel) [11:56:22] (03CR) 10Gehel: [C: 032] wdqs: correct condition for initializing git fat [puppet] - 10https://gerrit.wikimedia.org/r/465609 (https://phabricator.wikimedia.org/T197187) (owner: 10Gehel) [11:58:44] (03CR) 10Muehlenhoff: mediawiki::web::prod_sites: convert wikisource.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/462486 (https://phabricator.wikimedia.org/T196968) (owner: 10Giuseppe Lavagetto) [11:58:58] RECOVERY - puppet last run on wdqs1009 is OK: OK: Puppet is currently enabled, last run 27 seconds ago with 0 failures [12:01:38] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:02:26] (03CR) 10Giuseppe Lavagetto: [C: 032] "https://puppet-compiler.wmflabs.org/compiler1002/12853/" [puppet] - 10https://gerrit.wikimedia.org/r/465389 (https://phabricator.wikimedia.org/T205814) (owner: 10Elukey) [12:02:46] (03PS3) 10Elukey: Decommission conf100[1-3] [puppet] - 10https://gerrit.wikimedia.org/r/465389 (https://phabricator.wikimedia.org/T205814) [12:05:49] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [12:15:45] (03PS1) 10Volans: Tests: refactor puppetdb tests with parametrize [software/cumin] - 10https://gerrit.wikimedia.org/r/465611 [12:15:47] (03PS1) 10Volans: PuppetDB: fix regex matching [software/cumin] - 10https://gerrit.wikimedia.org/r/465612 [12:18:53] <_joe_> !log decommissioning conf1001-1003: stopping etcd, nginx, and masking both [12:18:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:46] \o/ [12:38:35] PROBLEM - Nginx local proxy to apache on mw2217 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.151 second response time [12:39:44] RECOVERY - Nginx local proxy to apache on mw2217 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.223 second response time [12:40:19] 10Operations, 10ops-eqiad, 10decommission: Decommission conf100[1-3] - https://phabricator.wikimedia.org/T206626 (10elukey) p:05Triage>03Normal [12:40:34] 10Operations, 10Traffic, 10Patch-For-Review: puppetize http purging for ATS backends - https://phabricator.wikimedia.org/T204208 (10ema) 05Open>03Resolved a:03ema Done, `profile::trafficserver::backend` now installs and configures vhtcpd. [12:42:55] 10Operations, 10Patch-For-Review: Switch the main etcd cluster in eqiad to use conf1004-1006 - https://phabricator.wikimedia.org/T205814 (10elukey) Opened https://phabricator.wikimedia.org/T206626 to fully decom conf100[1-3] (not in service anymore and with role::spare::system). The last step is to switch bac... [12:45:19] (03CR) 10Muehlenhoff: wmcs: add prometheus-memcached-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/431595 (https://phabricator.wikimedia.org/T147326) (owner: 10Filippo Giunchedi) [12:51:22] (03CR) 10Muehlenhoff: Stop the diamond service when removing Diamond (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465440 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [12:53:28] (03PS3) 10Muehlenhoff: Stop the diamond service when removing Diamond [puppet] - 10https://gerrit.wikimedia.org/r/465440 (https://phabricator.wikimedia.org/T183454) [13:04:07] (03PS1) 10Jgreen: update icinga IP for frbast2001.frack.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/465620 [13:05:45] PROBLEM - Packet loss ratio for UDP on logstash1008 is CRITICAL: 0.1146 ge 0.1 https://grafana.wikimedia.org/dashboard/db/logstash [13:06:54] RECOVERY - Packet loss ratio for UDP on logstash1008 is OK: (C)0.1 ge (W)0.05 ge 0 https://grafana.wikimedia.org/dashboard/db/logstash [13:07:08] looks like it was memcache logs from mw btw [13:09:15] (03CR) 10Jgreen: [C: 032] update icinga IP for frbast2001.frack.codfw.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/465620 (owner: 10Jgreen) [13:11:05] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Optimize networking configuration for WDQS - https://phabricator.wikimedia.org/T206105 (10Gehel) With some trial an error, it looks like the `smp_affinity` = `00ff00ff` would allow the IRQ to be managed by any CP... [13:16:26] 10Operations, 10IRCecho, 10Patch-For-Review, 10User-fgiunchedi: ircecho / icinga-wm crashlooping - https://phabricator.wikimedia.org/T205522 (10fgiunchedi) With the latest patch in to log exceptions I think we're good to resolve this? [13:18:48] (03PS1) 10Gehel: wdqs: spread IRQ from NIC over multiple CPUs [puppet] - 10https://gerrit.wikimedia.org/r/465624 (https://phabricator.wikimedia.org/T206105) [13:37:29] 10Operations, 10fundraising-tech-ops, 10netops: Qualys scans causing problematic pfw logspam - https://phabricator.wikimedia.org/T206431 (10Jgreen) 05Open>03Resolved a:03Jgreen The underlying problem was that bellatrix was logging to the root partition rather than the /srv data partition as it should.... [13:39:14] (03CR) 10Ottomata: [C: 031] profile::prometheus::alerts: raise warning for EL throughput alarm [puppet] - 10https://gerrit.wikimedia.org/r/465563 (owner: 10Elukey) [13:39:46] (03CR) 10Ottomata: [C: 031] role::eventlogging::analytics::files: reduce retention for archive [puppet] - 10https://gerrit.wikimedia.org/r/465569 (https://phabricator.wikimedia.org/T206542) (owner: 10Elukey) [13:40:39] (03PS2) 10Mathew.onipe: wdqs: refactor use_git_deploy to include scap3 and autodeploy options [puppet] - 10https://gerrit.wikimedia.org/r/465566 (https://phabricator.wikimedia.org/T206597) [13:41:16] (03CR) 10Ottomata: "Fine with me! These should all be in Hadoop now anyway. I'd mostly use these for recovery from client-side raw logs if something goes wr" [puppet] - 10https://gerrit.wikimedia.org/r/465573 (https://phabricator.wikimedia.org/T206542) (owner: 10Elukey) [13:41:45] (03CR) 10jerkins-bot: [V: 04-1] wdqs: refactor use_git_deploy to include scap3 and autodeploy options [puppet] - 10https://gerrit.wikimedia.org/r/465566 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe) [13:42:01] (03CR) 10Ottomata: [C: 031] Clean up removed rsyncd configs [puppet] - 10https://gerrit.wikimedia.org/r/465583 (https://phabricator.wikimedia.org/T205618) (owner: 10Muehlenhoff) [13:42:28] (03CR) 10Ottomata: [C: 031] rsyncd: Remove xinetd remnants [puppet] - 10https://gerrit.wikimedia.org/r/465584 (owner: 10Muehlenhoff) [13:42:51] jouncebot: next [13:42:51] In 0 hour(s) and 17 minute(s): Datacenter switchback - MediaWiki (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181010T1400) [13:43:12] \o/ [13:43:35] marostegui: all good on dba-land for it? [13:43:44] volans: yeeep [13:43:55] great! thx [13:46:54] volans: we did downtime the new read only check on masters [13:47:06] as it would change asynchronouysly with puppet [13:48:13] ack [13:48:43] I think "all wikis are in read only" was a fair check to have under normal circumstances [13:50:30] I 'll rebase the required puppet changes [13:50:56] (03PS1) 10Elukey: Refactor type Systemd::Timer::DateTime to include more normal forms [puppet] - 10https://gerrit.wikimedia.org/r/465630 (https://phabricator.wikimedia.org/T172532) [13:51:49] FWIW the steps that will be done in read-only and right up to the point of enabling RW again will be pretty sequential without syncups between us [13:52:32] if there is any blocker or anomaly that deserve a pause shout [13:54:18] (03PS3) 10Mathew.onipe: wdqs: refactor use_git_deploy to include scap3 and autodeploy options [puppet] - 10https://gerrit.wikimedia.org/r/465566 (https://phabricator.wikimedia.org/T206597) [13:54:30] here [13:55:09] <_joe_> time to sound general quarters akosiaris :P [13:55:57] (03PS2) 10Alexandros Kosiaris: cache::text: switch mediawiki to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/458773 (https://phabricator.wikimedia.org/T203777) [13:56:44] done [13:56:47] lol [13:57:15] (03CR) 10Mathew.onipe: wdqs: refactor use_git_deploy to include scap3 and autodeploy options (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/465566 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe) [13:57:41] 10Operations, 10Fundraising-Backlog, 10Traffic, 10HTTPS: Re-evaluate use of EV certificates for payments.wm.o? - https://phabricator.wikimedia.org/T204931 (10Jgreen) EV certs do seem to have lost almost all their value. That said the cost difference over an OV cert is under $100. Also, I'm not sure whose d... [13:58:07] find the 7 differences [13:58:11] jouncebot: next [13:58:12] In 0 hour(s) and 1 minute(s): Datacenter switchback - MediaWiki (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181010T1400) [13:59:10] akosiaris: check the tmux command please [13:59:52] LGTM [14:00:04] Deploy window Datacenter switchback - MediaWiki (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181010T1400) [14:00:05] it's T time [14:00:09] * marostegui ready [14:00:10] (03CR) 10Giuseppe Lavagetto: [C: 031] cache::text: switch mediawiki to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/458773 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris) [14:00:14] ready to liftoff! [14:00:19] volans: you are good to go [14:00:27] akosiaris: ack, starting! [14:00:33] !log START - Cookbook sre.switchdc.mediawiki.00-disable-puppet (volans@neodymium) [14:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:37] the note about disabling cron on mwmaint? [14:00:41] !log END (PASS) - Cookbook sre.switchdc.mediawiki.00-disable-puppet (exit_code=0) (volans@neodymium) [14:00:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:46] nevermind, answered elsewhere! [14:01:04] !log START - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (volans@neodymium) [14:01:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:11] !log END (PASS) - Cookbook sre.switchdc.mediawiki.00-reduce-ttl (exit_code=0) (volans@neodymium) [14:01:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:22] "you are go at throttle up" [14:01:27] !log START - Cookbook sre.switchdc.mediawiki.00-warmup-caches (volans@neodymium) [14:01:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:35] volans: I got the merge the traffic change [14:01:40] 10Operations, 10Fundraising-Backlog, 10Traffic, 10HTTPS: Re-evaluate use of EV certificates for payments.wm.o? - https://phabricator.wikimedia.org/T204931 (10Krenair) >>! In T204931#4648564, @Liuxinyu970226 wrote: > @krenair please, no more DV certs, that's the reason why jawiki, ugwiki, wuuwiki, zhwiki, z... [14:01:44] akosiaris: go ahead [14:01:49] puppet is disabled [14:01:55] (03CR) 10Alexandros Kosiaris: [C: 032] cache::text: switch mediawiki to eqiad [puppet] - 10https://gerrit.wikimedia.org/r/458773 (https://phabricator.wikimedia.org/T203777) (owner: 10Alexandros Kosiaris) [14:01:59] and we also have to be sure 5 minutes passed since the TTL lowering [14:02:09] before going RO [14:02:18] i thought spicerack waited on that? [14:02:21] <_joe_> no, we don't [14:02:25] was on purpose [14:02:28] <_joe_> mark: only for services [14:02:30] ok [14:02:32] to do the maintenance in the middle [14:02:34] <_joe_> where we don't have other operations to do [14:02:39] s/maintenance/warmup [14:02:42] ok merged [14:02:48] <_joe_> volans: you can proceed until we get out of the readonly [14:02:52] akosiaris: how many runs for the warmup? [14:02:56] 2 or 3? [14:02:58] 3 ? [14:03:04] ack [14:03:05] 3 sounds ok [14:03:08] cool [14:03:18] <_joe_> 3 is the magic number [14:03:19] note to self to actually document that [14:04:06] 10Operations, 10Fundraising-Backlog, 10Traffic, 10HTTPS: Re-evaluate use of EV certificates for payments.wm.o? - https://phabricator.wikimedia.org/T204931 (10BBlack) >>! In T204931#4654860, @Krenair wrote: >>>! In T204931#4648564, @Liuxinyu970226 wrote: >> @krenair please, no more DV certs, that's the reas... [14:04:18] I can see rows being read on eqiad now, cool [14:04:30] *on eqiad DBs [14:05:03] (03PS4) 10Alexandros Kosiaris: db: Switch dns master alias to eqiad [dns] - 10https://gerrit.wikimedia.org/r/458790 (https://phabricator.wikimedia.org/T203777) [14:05:12] !log END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0) (volans@neodymium) [14:05:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:20] 2nd run [14:05:24] !log START - Cookbook sre.switchdc.mediawiki.00-warmup-caches (volans@neodymium) [14:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:26] 10Operations, 10monitoring, 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Mathew.onipe) p:05Triage>03Normal [14:06:15] 10Operations, 10monitoring, 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Mathew.onipe) p:05Normal>03High [14:06:56] o/ [14:07:06] o/ [14:07:41] !log END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0) (volans@neodymium) [14:07:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:07:53] 3rd and last run of warmup [14:07:57] !log START - Cookbook sre.switchdc.mediawiki.00-warmup-caches (volans@neodymium) [14:07:57] ok [14:07:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:08:25] Hmm didn't it used to say at End how long it took? [14:08:26] 10Operations, 10Product-Analytics, 10monitoring, 10Discovery-Analysis (Current work), 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Mathew.onipe) [14:08:31] Anyway, maybe for later [14:08:51] Krinkle: the warmup script output? [14:09:03] it's in the tmux, I can paste something if you need it [14:09:33] volans: for each of the steps logged yeah [14:09:40] No it's okay [14:10:02] the warmup would be skewed a bit as it ask for manual confirmation [14:10:03] !log END (PASS) - Cookbook sre.switchdc.mediawiki.00-warmup-caches (exit_code=0) (volans@neodymium) [14:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:12] akosiaris: all gerrit patches merged already? [14:10:17] yup [14:10:23] !log START - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (volans@neodymium) [14:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:26] 10Operations, 10monitoring, 10Discovery-Analysis (Current work), 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Mathew.onipe) [14:10:37] !log END (PASS) - Cookbook sre.switchdc.mediawiki.01-stop-maintenance (exit_code=0) (volans@neodymium) [14:10:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:46] confirmed the crons just disappeared on mwmaint2001 [14:10:57] mutante: also any running process? [14:11:03] MWScript.php still running [14:11:04] MostlinkedPage::reallyDoQuery is running [14:11:14] on enwiki, meta, zh [14:11:14] ----- OUTPUT of '! pgrep -c php' ----- │··· [14:11:17] 0 [14:11:17] <_joe_> volans: there are quite a few scripts still running [14:11:21] how? [14:11:25] hhvm [14:11:30] <_joe_> yes [14:11:31] hhvm -vEval.Jit=1 /srv/mediawiki-staging/multiversion/MWScript.php extensions/Wikibase/repo/maintenance/dispatchChanges.php --wiki wikidatawiki --verbose [14:11:33] or at least, on db, not sure if on mw [14:11:34] volans: ack, ok [14:11:36] <_joe_> they were changed to run hhvm directly [14:11:36] I 'll kill them [14:11:38] 10Operations, 10monitoring, 10Discovery-Search (Current work), 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Mathew.onipe) [14:11:43] <_joe_> akosiaris: you do it? ok [14:11:58] ok gone [14:12:00] do you want me to kill the slow queries on db2*? [14:12:06] <_joe_> no [14:12:12] MostlinkedPage::reallyDoQuery ? [14:12:14] next step is going RO [14:12:22] <_joe_> jynus: it's up to you [14:12:25] 10Operations, 10monitoring, 10Discovery-Search (Current work), 10Patch-For-Review: Create an Icinga check to alert on packet dropped - https://phabricator.wikimedia.org/T206114 (10Mathew.onipe) a:03Mathew.onipe [14:12:32] they will fail when they try to write on codfw [14:12:37] we are clear of maintenance [14:12:41] we can decide later :-) [14:12:52] <_joe_> let's start the RO? [14:13:00] yes [14:13:03] +1 [14:13:07] volans: let's go in RO [14:13:12] warp 9 :D [14:13:18] I'll stop briefly before 06-set-db-readwrite, unless someone shouts before [14:13:19] (03PS1) 10Alex Monk: Merge branch 'master' into debian [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/465633 [14:13:23] remember to stop right before the RW [14:13:28] ok [14:13:32] that, thanks! [14:13:34] ack starting [14:13:36] good [14:13:39] <_joe_> volans: If I have doubts I'll shout [14:13:40] (03PS1) 10Banyek: mariadb: depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465634 [14:13:45] !log START - Cookbook sre.switchdc.mediawiki.02-set-readonly (volans@neodymium) [14:13:46] !log MediaWiki read-only period starts at: 2018-10-10 14:13:46.068081 (volans@neodymium) [14:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:13:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:07] !log END (PASS) - Cookbook sre.switchdc.mediawiki.02-set-readonly (exit_code=0) (volans@neodymium) [14:14:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:12] !log START - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (volans@neodymium) [14:14:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:14] I cannot edit [14:14:29] codfw masters confirmed read only = ON [14:14:32] on mysql [14:14:32] cool [14:14:33] (03PS2) 10Alex Monk: Merge branch 'master' into debian [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/465633 [14:14:39] !log END (PASS) - Cookbook sre.switchdc.mediawiki.03-set-db-readonly (exit_code=0) (volans@neodymium) [14:14:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:14:44] !log START - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (volans@neodymium) [14:14:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:03] !log END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-mediawiki (exit_code=0) (volans@neodymium) [14:15:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:09] !log START - Cookbook sre.switchdc.mediawiki.04-switch-traffic (volans@neodymium) [14:15:10] <_joe_> WMFMasterDC changed correctly [14:15:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:18] puppet in flight for traffic [14:15:30] * volans hopes the diff will still match the regex [14:15:38] Hi. [14:15:39] ema: be ready to check the diff in case it doesn't [14:15:56] <_joe_> ShakespeareFan00: we're in the middle of the datacenter switchover, all sites are in read-only [14:15:58] ShakespeareFan00: We are on maintenance mode due to the DC failover [14:16:16] _joe_: ETA for completion of test? [14:16:27] no test and it's up to 1h [14:16:30] so far so good, message matched in one DC [14:16:34] running puppet on the other DC [14:16:46] ShakespeareFan00: questions to #wikimedia-tech please [14:16:59] (03PS2) 10Banyek: mariadb: depool db1092 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465634 [14:17:20] akosiaris: we're close to the pause/decision point for RW [14:17:20] I can see reads traffic increasing on MySQLs in eqiad [14:17:29] !log END (PASS) - Cookbook sre.switchdc.mediawiki.04-switch-traffic (exit_code=0) (volans@neodymium) [14:17:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:35] !log START - Cookbook sre.switchdc.mediawiki.05-invert-redis-sessions (volans@neodymium) [14:17:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:38] !log END (PASS) - Cookbook sre.switchdc.mediawiki.05-invert-redis-sessions (exit_code=0) (volans@neodymium) [14:17:39] volans: yup [14:17:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:17:45] ready for RW [14:17:45] <_joe_> volans: all good here, go on! [14:17:45] (03PS1) 10Jgreen: Update data.yaml for frack-(administration|bastion)-codfw subnet changes [puppet] - 10https://gerrit.wikimedia.org/r/465635 (https://phabricator.wikimedia.org/T204271) [14:17:49] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received [14:17:49] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received [14:17:53] dammit [14:17:57] go or wait? [14:17:59] RECOVERY - Check health of redis instance on 6381 on rdb1007 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 2865 keys, up 15 days 23 hours [14:18:00] RECOVERY - Check health of redis instance on 6381 on rdb1001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 3463 keys, up 19 days 23 hours [14:18:04] <_joe_> go. [14:18:09] RECOVERY - Check health of redis instance on 6378 on rdb1001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 15 keys, up 19 days 23 hours [14:18:11] !log START - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (volans@neodymium) [14:18:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:13] !log END (PASS) - Cookbook sre.switchdc.mediawiki.06-set-db-readwrite (exit_code=0) (volans@neodymium) [14:18:13] yeah go [14:18:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:18] !log START - Cookbook sre.switchdc.mediawiki.07-set-readwrite (volans@neodymium) [14:18:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:19] RECOVERY - Check health of redis instance on 6378 on rdb1003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 4705606 keys, up 16 days 1 hours [14:18:20] it was like that last time too [14:18:24] yep [14:18:24] mysql read only off on eqiad MySQL masters [14:18:25] <_joe_> it's mobileapps everywhere [14:18:27] !log MediaWiki read-only period ends at: 2018-10-10 14:18:26.908958 (volans@neodymium) [14:18:27] !log END (PASS) - Cookbook sre.switchdc.mediawiki.07-set-readwrite (exit_code=0) (volans@neodymium) [14:18:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:29] PROBLEM - mobileapps endpoints health on scb1003 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received [14:18:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:18:29] PROBLEM - mobileapps endpoints health on scb2006 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received [14:18:29] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received [14:18:30] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received [14:18:30] RECOVERY - Check health of redis instance on 6380 on rdb1003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4705813 keys, up 16 days 1 hours [14:18:31] back in RW [14:18:31] RECOVERY - Check health of redis instance on 6380 on rdb1001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 3240 keys, up 19 days 23 hours [14:18:31] RECOVERY - Check health of redis instance on 6379 on rdb1001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4705935 keys, up 19 days 23 hours [14:18:38] confirmed on mysql level [14:18:39] RECOVERY - Check health of redis instance on 6379 on rdb1007 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 3186 keys, up 15 days 23 hours [14:18:44] (03CR) 10Jgreen: [C: 032] Update data.yaml for frack-(administration|bastion)-codfw subnet changes [puppet] - 10https://gerrit.wikimedia.org/r/465635 (https://phabricator.wikimedia.org/T204271) (owner: 10Jgreen) [14:18:45] * akosiaris ignoring icinga-wm for a few [14:18:47] edits happening on enwiki [14:18:49] checking others [14:18:50] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [14:18:58] <_joe_> yeah mcs is unrelated [14:18:59] RECOVERY - Check health of redis instance on 6378 on rdb1007 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 4 keys, up 15 days 23 hours [14:19:00] RECOVERY - Check health of redis instance on 6381 on rdb1003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 4604373 keys, up 16 days 1 hours [14:19:08] wikivoyage I can edit [14:19:09] RECOVERY - Check health of redis instance on 6379 on rdb1003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4705883 keys, up 16 days 1 hours [14:19:10] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received [14:19:19] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received [14:19:20] PROBLEM - HTTP availability for Varnish at eqiad on einsteinium is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [14:19:21] great [14:19:22] dewiki I can edit [14:19:29] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received [14:19:29] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received [14:19:30] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received [14:19:30] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:19:30] updating tendril, is safe [14:19:34] !log START - Cookbook sre.switchdc.mediawiki.08-update-tendril (volans@neodymium) [14:19:34] <_joe_> uh [14:19:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:36] volans: go [14:19:40] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:19:45] !log END (PASS) - Cookbook sre.switchdc.mediawiki.08-update-tendril (exit_code=0) (volans@neodymium) [14:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:48] <_joe_> PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster=cache_text site=esams [14:19:51] <_joe_> this is worrying [14:20:00] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [14:20:03] akosiaris, _joe_ I'll wait a GO for the restore-TTL and start-maintenance, just in case [14:20:19] PROBLEM - HTTP availability for Varnish at esams on einsteinium is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [14:20:20] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job=varnish-text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [14:20:20] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:20:21] <_joe_> we have an alarm on lvs at esams for cache_text [14:20:26] I can edit on eswiki [14:20:29] <_joe_> yes, we're in an outage [14:20:31] esams is just more-sensitive [14:20:36] it's all of them [14:20:37] volans: yeah ok good call [14:20:38] I confirm search have switched to eqiad (correctly this time), cache hit ratio is down (known issue) [14:20:39] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy [14:20:41] and probably mobileapps [14:20:42] <_joe_> can someone check the 5xxs ? [14:21:00] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [14:21:07] checking [14:21:20] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [14:21:20] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [14:21:22] Notice: Undefined index: error in /srv/mediawiki/php-1.32.0-wmf.24/extensions/CirrusSearch/includes/BaseInterwikiResolver.php on line 223 [14:21:27] yep, looking [14:21:29] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=2&fullscreen&orgId=1&from=now-1h&to=now&var-site=All&var-cache_type=text&var-status_type=5 [14:21:29] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:21:30] PROBLEM - Check health of redis instance on 6381 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1539181286 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 3463 keys, up 6 seconds - replication_delay is 1539181286 [14:21:30] PROBLEM - Check health of redis instance on 6378 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1539181288 600 - REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 15 keys, up 9 seconds - replication_delay is 1539181288 [14:21:43] and calling dcausse for help! [14:21:48] a skipe but seems going down [14:21:50] PROBLEM - Check health of redis instance on 6381 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1386539 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 4604373 keys, up 99 days 12 hours - replication_delay is 1386539 [14:21:51] spike [14:22:10] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [14:22:10] PROBLEM - Check health of redis instance on 6380 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1386558 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4705813 keys, up 99 days 12 hours - replication_delay is 1386558 [14:22:10] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [14:22:19] PROBLEM - Check health of redis instance on 6379 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1539181333 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4705935 keys, up 53 seconds - replication_delay is 1539181333 [14:22:20] PROBLEM - Check health of redis instance on 6380 on rdb2001 is CRITICAL: CRITICAL: replication_delay is 1539181338 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 3240 keys, up 58 seconds - replication_delay is 1539181338 [14:22:26] jynus: it's the error handling code that is broken, it's fixed on master but not yet deployed [14:22:29] PROBLEM - Check health of redis instance on 6378 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1386573 600 - REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 4705606 keys, up 99 days 12 hours - replication_delay is 1386573 [14:22:29] PROBLEM - Check health of redis instance on 6379 on rdb2003 is CRITICAL: CRITICAL: replication_delay is 1386573 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4705883 keys, up 99 days 12 hours - replication_delay is 1386573 [14:22:35] edits count going up [14:22:39] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 33.33% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [14:22:40] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [14:22:41] _joe_: why redis is no noisy? can I ignore it for now? [14:22:43] <_joe_> godog: any news on the 5xxs? [14:22:46] I don't see major errors on mediawiki [14:22:46] <_joe_> volans: ignore [14:22:49] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [14:22:49] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [14:22:50] RECOVERY - mobileapps endpoints health on scb1003 is OK: All endpoints are healthy [14:22:51] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [14:22:57] at the moment [14:22:59] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [14:22:59] PROBLEM - Check health of redis instance on 6480 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1380624 600 - REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 2854 keys, up 99 days 12 hours - replication_delay is 1380624 [14:23:06] _joe_: the 5xx's already ended, it was a short spike [14:23:10] PROBLEM - Check health of redis instance on 6378 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1382550 600 - REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 6 keys, up 99 days 12 hours - replication_delay is 1382550 [14:23:10] PROBLEM - Check health of redis instance on 6481 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1380634 600 - REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 2865 keys, up 99 days 12 hours - replication_delay is 1380634 [14:23:10] PROBLEM - Check health of redis instance on 6479 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1380635 600 - REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3186 keys, up 99 days 12 hours - replication_delay is 1380635 [14:23:10] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [14:23:10] PROBLEM - Check health of redis instance on 6478 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1380637 600 - REDIS 2.8.17 on 127.0.0.1:6478 has 1 databases (db0) with 4 keys, up 99 days 12 hours - replication_delay is 1380637 [14:23:17] _joe_: yeah looks like cache hosts talking to cache hosts in eqiad [14:23:19] <_joe_> the redis alerts, sigh [14:23:19] PROBLEM - Check health of redis instance on 6381 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1382560 600 - REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 3178 keys, up 99 days 12 hours - replication_delay is 1382560 [14:23:29] PROBLEM - mobileapps endpoints health on scb2003 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received [14:23:30] PROBLEM - Check health of redis instance on 6379 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1382570 600 - REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 3088 keys, up 99 days 12 hours - replication_delay is 1382570 [14:23:30] PROBLEM - Check health of redis instance on 6380 on rdb2005 is CRITICAL: CRITICAL: replication_delay is 1382571 600 - REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 2888 keys, up 99 days 12 hours - replication_delay is 1382571 [14:23:33] I don't think the cache layer is at fault here :P [14:23:38] regarding 5XX, mostly POST https://en.wikipedia.org/w/api.php [14:23:39] RECOVERY - HTTP availability for Varnish at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [14:23:40] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [14:23:40] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:23:49] people are probably trying anyway to save [14:23:50] RECOVERY - HTTP availability for Varnish at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [14:23:52] ? [14:23:59] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:24:00] but why 5xx ? [14:24:05] but doesn't see anymore [14:24:08] timeouts/drops [14:24:09] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:24:10] just a spike [14:24:13] the 5xx.log on oxygen doesn't seem particularly active [14:24:20] I agree [14:24:30] PROBLEM - mobileapps endpoints health on scb1002 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received [14:24:32] Some are php time outs [14:24:37] are we in good enough shape to restart maintenance? [14:24:39] PROBLEM - High CPU load on API appserver on mw1231 is CRITICAL: CRITICAL - load average: 49.08, 29.21, 13.62 [14:24:49] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [14:24:50] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received [14:24:53] in terms of db erors I see fewer than I expected, none ongoing [14:25:03] <_joe_> volans: I still see nginx avail alerts [14:25:09] volans: give it another 2-3 mins [14:25:09] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 62.11, 32.35, 14.41 [14:25:11] Some from language converter on page views. So not blocking but something to look at later for better warm up [14:25:13] ack [14:25:19] PROBLEM - High CPU load on API appserver on mw1228 is CRITICAL: CRITICAL - load average: 60.80, 31.14, 13.38 [14:25:22] it's a known issue that many of our alerts have bad/conflicting timing [14:25:33] <_joe_> mediawiki fatals are firing an alert right now [14:25:39] RECOVERY - mobileapps endpoints health on scb2003 is OK: All endpoints are healthy [14:25:41] for the same underlying problem X, alert1 can come and go, then alert2 fires a minute later and persists a while, etc [14:25:43] literally no db errors at the moment [14:25:50] PROBLEM - High CPU load on API appserver on mw1226 is CRITICAL: CRITICAL - load average: 50.85, 32.53, 15.40 [14:25:50] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 100.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [14:26:04] some high cpu load on various api appservers [14:26:08] is that our hhvm api high cpu load issue again? [14:26:19] grafana says the rate of 5xx is now "green" again https://grafana.wikimedia.org/dashboard/db/varnish-http-errors?refresh=5m&orgId=1 [14:26:20] PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 48.39, 34.05, 17.31 [14:26:22] there's some smaller 5xx spikes now, but they're dwarfed by the first one in graphing [14:26:33] <_joe_> mark: no I think it's some unbalance [14:26:36] 0 db connection errors [14:26:48] The fatals are no longer I believe [14:26:57] very low rate though, hardly a "spike", on the newer ones [14:26:59] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [14:27:25] I only see the usual rate of 500s [14:27:32] <_joe_> depooling mw1226 to check [14:27:43] <_joe_> can someone else look at general api appserver health? [14:27:57] _joe_: I am doing it from a high overview [14:27:57] yeah I 'll do [14:28:05] logs, etc. [14:28:09] <_joe_> I think it's just load though [14:28:10] My view is mainly https://logstash.wikimedia.org/goto/ad52476578b0bb4ad9fbcd818c479a1f [14:28:20] RECOVERY - mobileapps endpoints health on scb2006 is OK: All endpoints are healthy [14:28:20] PROBLEM - mobileapps endpoints health on scb2005 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received [14:28:24] (03CR) 10Mathew.onipe: base::monitoring::host: added prometheus check for network receive drops (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) (owner: 10Mathew.onipe) [14:28:50] RECOVERY - mobileapps endpoints health on scb1002 is OK: All endpoints are healthy [14:29:10] Krinkle: isn't it the worse thing the know cirrus error? [14:29:20] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 48.04, 32.43, 17.07 [14:29:20] RECOVERY - mobileapps endpoints health on scb2005 is OK: All endpoints are healthy [14:29:34] (03PS2) 10Mathew.onipe: base::monitoring::host: added prometheus check for network drops [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) [14:29:45] overall numbers of alerts in icinga has gone down quite a bit [14:29:47] (03PS1) 10Vgutierrez: Add discovery alias for certcentral [dns] - 10https://gerrit.wikimedia.org/r/465636 (https://phabricator.wikimedia.org/T199711) [14:30:18] Yeah, looks like Cirrus code for handling errrors.. has an error [14:30:23] none of the high load api servers has recovered yet though [14:30:33] that's the only thing I am worried about [14:30:38] what is? [14:30:39] * bearND is unclear about why the /page/random endpoint latency spiked. All we're doing is request 12 random articles from the MW API (using the random generator). [14:30:40] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received [14:30:46] the api servers not recovering [14:30:50] PROBLEM - High CPU load on API appserver on mw1234 is CRITICAL: CRITICAL - load average: 39.27, 35.86, 22.12 [14:31:00] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [14:31:01] bearND: the API is under heavy load currently [14:31:05] probably related [14:31:06] I'm seeing lots of StashEdit cache failures [14:31:10] More than usual [14:31:21] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [14:31:31] PROBLEM - High CPU load on API appserver on mw1233 is CRITICAL: CRITICAL - load average: 39.03, 35.27, 20.20 [14:31:37] could be a "replay" issue of things that were hold on during the RO period and were sent to the API hosts all at once once back RW? [14:31:38] And slots of memc errors as eelll, which is so noisy I have no ability to tell what is and isn't new or random [14:31:40] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [14:31:40] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received [14:31:40] PROBLEM - mobileapps endpoints health on scb2004 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received [14:31:50] RECOVERY - High CPU load on API appserver on mw1228 is OK: OK - load average: 11.94, 22.69, 16.38 [14:31:54] !log oblivian@puppetmaster1001 conftool action : set/weight=15; selector: cluster=api_appserver,service=apache2,dc=eqiad,name=mw122.* [14:31:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:31:56] there's the first recovery [14:32:00] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [14:32:10] so mobile apps is only complaining because api is high, which means more latency [14:32:18] (guessing)? [14:32:19] I see load dropping on most of the unhappy api servers [14:32:25] yeah mobileapps complaining is probably a symptom [14:32:29] yeah I do too [14:32:30] PROBLEM - mobileapps endpoints health on scb1001 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received [14:32:31] RECOVERY - mobileapps endpoints health on scb2004 is OK: All endpoints are healthy [14:32:35] <_joe_> no I think it's not related to that [14:32:40] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy [14:32:56] still going up on mw1233 [14:33:12] REMINDER: we still need to restart maintenance [14:33:16] !log oblivian@puppetmaster1001 conftool action : set/weight=15; selector: cluster=api_appserver,service=apache2,dc=eqiad,name=mw123.* [14:33:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:21] RECOVERY - mobileapps endpoints health on scb1001 is OK: All endpoints are healthy [14:33:26] <_joe_> volans: go on please [14:33:30] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received [14:33:33] <_joe_> as far as I'm concerned [14:33:39] is it going to hit the API ? [14:33:41] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [14:33:43] let alex decide :) [14:33:46] I am a bit concerned [14:33:55] <_joe_> akosiaris: what do you mean? [14:34:00] the maintenance jobs [14:34:05] yeah, I would also wait for that [14:34:06] <_joe_> maintenance? shouldn't [14:34:07] I 'd like to not add load to the API currently [14:34:17] <_joe_> they should call the db directly [14:34:20] RECOVERY - High CPU load on API appserver on mw1226 is OK: OK - load average: 18.79, 24.34, 19.09 [14:34:21] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [14:34:39] (03Abandoned) 10Vgutierrez: Add discovery alias for certcentral [dns] - 10https://gerrit.wikimedia.org/r/465636 (https://phabricator.wikimedia.org/T199711) (owner: 10Vgutierrez) [14:34:48] <_joe_> the apis are ok now [14:34:49] mw1233 says [Wed Oct 10 14:26:33 2018] mce: [Hardware Error]: Machine check events logged [14:34:51] cpu usage is slowly falling but still is pretty high [14:34:56] <_joe_> I mean the appservers [14:35:01] <_joe_> jijiki: oh nice [14:35:05] api appservers or just appservers ? [14:35:21] <_joe_> api [14:35:38] mw1233 also getting better now [14:35:41] well the appservers are in a similar state [14:35:42] how about mw1231, jijiki? [14:35:44] [1301116.526667] CPU12: Core temperature above threshold, cpu clock throttled (total events = 118014) [14:35:51] RECOVERY - High CPU load on API appserver on mw1234 is OK: OK - load average: 15.38, 24.26, 21.01 [14:35:58] apergos: let me login there as well [14:36:15] <_joe_> akosiaris: it's normal when you pour all the traffic all of a sudden [14:36:19] I'm concerned as well, about the memc and stash failures [14:36:24] those are pretty old problems: https://phabricator.wikimedia.org/T149287 [14:36:30] _joe_: didn't happen in codfw though did it [14:36:32] <_joe_> Krinkle: what is failing? [14:36:33] unrelated (or simply exposed by the current load) [14:36:36] <_joe_> mark: it happened [14:36:43] apergos: same [14:36:46] 5XX stopped apparently at :28 [14:36:50] <_joe_> mark: I had to rebalance the apis [14:36:51] codfw is slightly more powerful than eqiad fwiw [14:36:56] lots of posts to commons were failing too [14:36:58] <_joe_> Krinkle: what are you referring to? [14:37:13] thanks, jijiki, gtk [14:37:20] but indeed load is going down [14:37:29] I think we can indeed restart maintenance [14:37:30] all the affected API servers (122/123 are the old batches, less powerful than what we have in codfw [14:37:32] akosiaris: sadly, dbs are more powerful on eqiad, so that could cause issues both ways [14:37:48] volans: please go on with restarting maintenance [14:37:50] jynus: lol [14:37:52] they are complaining about temp as well ofc [14:37:54] oh the irony [14:37:59] <_joe_> jijiki: ofc [14:38:01] akosiaris: ack [14:38:03] yeah that's expected with all that load [14:38:06] !log START - Cookbook sre.switchdc.mediawiki.08-start-maintenance (volans@neodymium) [14:38:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:38:15] load == throttle == more load [14:38:17] <_joe_> jynus: less appserver power and more db power, eeek [14:38:19] <_joe_> :) [14:38:28] _joe_: logstash. I'm looking now to see how it was before [14:38:48] <_joe_> Krinkle: the memcached error rates you mean [14:39:13] !log END (FAIL) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=99) (volans@neodymium) [14:39:14] Krinkle: isn't edit count worryng? [14:39:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:39:28] <_joe_> Krinkle: they were very high but went away afaict [14:39:29] fail ? [14:39:34] confirmed the maint crons appeared on mwmaint1002 (and not 1001) as they should [14:39:36] I don't know if those core temp alerts should be "expected", I think our previous take on that issue was that it warranted some hw investigation (e.g. replace thermal paste on CPUs) [14:39:42] checking puppet failure [14:39:47] https://grafana.wikimedia.org/dashboard/db/edit-stash?orgId=1&from=now-30m&to=now [14:39:50] <_joe_> akosiaris: I bet it's running on both mwmaint1001 and 1002 [14:39:50] PROBLEM - mobileapps endpoints health on scb2001 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received [14:39:51] ah phew [14:40:05] mwmaint1001.eqiad.wmnet [14:40:09] <_joe_> bblack: yes, they will need termal paste probably [14:40:17] This one seems to have recovered, which is not easy to see in log stash but clear on Graham's [14:40:24] Grafana [14:40:27] bblack: yeah, but that/is blocked on procurement of thermal paste [14:40:33] Stash is Ok [14:40:35] https://phabricator.wikimedia.org/T149287#4319579 [14:40:39] akosiaris: I'll re-run it [14:40:40] RECOVERY - High CPU load on API appserver on mw1233 is OK: OK - load average: 14.43, 24.15, 23.36 [14:40:43] ok [14:40:48] load on mw1233 is going down, we could put it back [14:40:51] RECOVERY - mobileapps endpoints health on scb2001 is OK: All endpoints are healthy [14:40:54] Krinkle: absolute count worries me https://grafana.wikimedia.org/dashboard/db/edit-count?refresh=5m&orgId=1&from=now-24h&to=now [14:40:55] !log START - Cookbook sre.switchdc.mediawiki.08-start-maintenance (volans@neodymium) [14:40:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:53] <_joe_> jynus: the switchover kills some bots/external programs [14:41:59] I guess [14:42:00] <_joe_> I think we saw the same last month [14:42:01] !log END (FAIL) - Cookbook sre.switchdc.mediawiki.08-start-maintenance (exit_code=99) (volans@neodymium) [14:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:02] <_joe_> let me check [14:42:21] jynus: https://grafana.wikimedia.org/dashboard/db/backend-save-timing-breakdown?refresh=5m&orgId=1&from=now-3h&to=now [14:42:26] _joe_: fwiw, mwmaint1001 does not have any maint jobs running [14:42:28] <_joe_> volans: not like I didn't tell you the two mwmaint were gonna be an issue [14:42:28] Scroll down to see breakdown [14:42:29] akosiaris: no sorry my bad [14:42:30] yeah I wouldn't be surprised if some of them will just throw an exception and give up in response to read-only mode [14:42:31] that's expected [14:42:35] yeah [14:42:37] Looks like it is mostly bots [14:42:40] PROBLEM - mobileapps endpoints health on scb1004 is CRITICAL: /{domain}/v1/page/random/title (retrieve a random article title) timed out before a response was received [14:42:43] !log START - Cookbook sre.switchdc.mediawiki.08-restore-ttl (volans@neodymium) [14:42:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:42:45] That haven't come back yet [14:42:51] !log END (PASS) - Cookbook sre.switchdc.mediawiki.08-restore-ttl (exit_code=0) (volans@neodymium) [14:42:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:12] Krinkle: save time is high? [14:43:18] 10Operations, 10DBA, 10cloud-services-team, 10wikitech.wikimedia.org, and 2 others: Move some wikis to s5 - https://phabricator.wikimedia.org/T184805 (10Marostegui) This was done successfully and new wikis are now live on eqiad. What is pending now is: - Run the DNS changes for wikireplicas: T206623: - Re-... [14:43:40] RECOVERY - mobileapps endpoints health on scb1004 is OK: All endpoints are healthy [14:44:04] akosiaris: all done on the cookbooks [14:44:20] <_joe_> the cpu usage for the api servers is still a bit high [14:44:28] jynus: indeed. But bottom right per counts by group user and entry point etc [14:44:32] volans: ok good [14:45:00] Krinkle: yes, I saw those [14:45:10] jynus: marostegui last thing is https://gerrit.wikimedia.org/r/#/c/operations/dns/+/458790/ but it's not urgent at all [14:45:25] not urgent at all [14:45:27] The perf issue I suspect is mostly due to memc aches [14:45:38] akosiaris: I can merge and deploy that tomorrow morning even if you want [14:45:40] _joe_: what do you mean? mwmaint1001 has no crons and mwmaint1002 has them. i merged a hack for that. we can now (later) revert it to normal where it's only based on active_dc and not hostname [14:45:45] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Define an SLO for Wikidata Query Service public endpoint and communicate it - https://phabricator.wikimedia.org/T199228 (10Gehel) Coming back to this discussion, I'll try to make my point more clear: wdqs public... [14:45:55] <_joe_> mutante: the cookbooks assume we have 1 server per dc [14:46:02] marostegui: yeah, that's fine [14:46:05] mutante: htat the cookboock checks that mwmaint host in the new active dc has crontabs [14:46:11] that's not true for 1001, just for 1002 [14:46:14] akosiaris: Do we still want the banner up? [14:46:16] akosiaris: ok, I will take care of that then tomorrow. [14:46:18] _joe_: volans: gotcha! ok [14:46:29] JohanJ: not any longer. Feel free to remove it [14:46:43] thanks! [14:46:44] so to clarify, mwmaint is ok as in "working"? [14:46:53] jynus: yes, since the first run [14:46:57] thanks [14:46:57] yup, it's just mwmaint1002, not mwmaint1001 [14:46:57] <_joe_> mediawiki fatals are still very high [14:47:02] <_joe_> anyone verifying? [14:47:11] I misread the output at first I thought was a puppet failure sorry [14:47:22] akosiaris: we should take down the banners [14:47:29] who's in charge of those? [14:47:43] volans: 10 lines up :) [14:47:47] volans: read backlog :-) [14:47:51] already being done :-) [14:47:53] rotfl [14:48:01] cannot distract one sec here [14:48:01] <_joe_> so I still see 2 alerts that are really worrisome [14:48:01] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 16.79, 21.24, 23.57 [14:48:01] :D [14:48:03] _joe_: not anymore as per https://logstash.wikimedia.org/goto/4d8333a71a19993fe8cae7923a846dd3 [14:48:15] there is some long running http queries [14:48:23] <_joe_> marostegui: ok [14:48:31] RECOVERY - High CPU load on API appserver on mw1231 is OK: OK - load average: 16.82, 18.10, 23.94 [14:48:48] Apache/HHVM: shows elevated levels of req queuing. Going down slowly but still in hundreds whereas normally <5 [14:49:02] until we have a better working theory, I think the general issues is that dbs are a bottlneeck on codfw on first pool, and mw servers are on eqiad first pool [14:49:08] <_joe_> Krinkle: yes, I need to do a rolling restart of the appservers [14:49:26] we should probably plan to expand capacity in eqiad early next fiscal [14:49:29] <_joe_> jynus: no, I think this has to do with hhvm running semi-idle for too long [14:49:43] _joe_: but that didn't happen for codfw? [14:49:53] <_joe_> jynus: it did, just less sensitive [14:49:57] <_joe_> because more capacity [14:50:00] _joe_: hmm restart why? [14:50:10] ok, so you are agreeing with me, _joe_ in a way [14:50:10] <_joe_> also we didn't restart hhvm, which we did there [14:50:16] Is it asleep or something and needs to be zapped? [14:50:32] moar machines == less problems [14:50:44] Can I quote you on that? [14:50:52] <_joe_> Krinkle: as soon as I restarted hhvm on one server, it went from using 75% of cpu to 30% [14:50:59] you can bash it if you want [14:51:05] K :) [14:51:16] <_joe_> but let me try to understand the situation better first [14:51:16] moar machines == more hw problems [14:51:20] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [14:51:26] mark: agree [14:51:27] so we need to restart HHVM in the end :-P [14:51:30] I meant capacity [14:51:41] or just get rid of it [14:51:45] or just move to php7... :) [14:51:46] <_joe_> so requests queued typically are due to db latencies or high cpu usage on the appservers [14:51:57] _joe_: hm the restart sounds like it might be killing the problem incl the reqs for users [14:52:13] _joe_: is there a way we can drain? Or do we do that already around restart? [14:52:25] <_joe_> Krinkle: we do depool, wait a few secs, restart [14:52:28] yeah the restart script does that [14:52:32] 60s? [14:52:44] <_joe_> typically hhvm won't die until it has finished answering old requests [14:52:53] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi: Setup rsyslog to be able to produce logs to Kafka - https://phabricator.wikimedia.org/T206633 (10fgiunchedi) p:05Triage>03Normal [14:53:04] <_joe_> anyways, sorry, can we postpone explanations? [14:53:12] <_joe_> I'm trying to assess if I need to do something [14:53:16] this isn't explanations [14:53:21] this is reasoning about the best course of action [14:53:57] so does the restart script wait until min(60s, all requests drained) or not? [14:54:09] Thx, I'll also turn down the curiosity dial a bit [14:54:56] <_joe_> mark: no, hhvm does not stop until it has finished responding to its requests, or gets killed by systemd at its default timeout [14:55:00] _joe_: I can check logs for potential fall out. Which one did we restart? [14:55:06] <_joe_> which is - IIRC - 180 seconds [14:55:09] ok [14:55:18] <_joe_> mark: I'm not sure about the 180 seconds [14:55:24] Edit Counts is almost at the same level as it was before the failover now [14:55:25] <_joe_> but it's in that ballpark [14:55:35] <_joe_> Krinkle: mw1246 for instance [14:55:45] <_joe_> ok one thing is really baffling [14:55:46] RefreshLinksJob::run now being the main complainer job [14:56:09] <_joe_> we had a super large spike in network traffic both on api and appservers [14:56:15] <_joe_> which has gone down significantly since [14:56:23] <_joe_> I bet it's populating memcached [14:56:30] <_joe_> that's taking so much toll [14:56:32] in or out or both? [14:56:38] <_joe_> jynus: both [14:56:52] I see es2/3 the most stressed [14:56:57] which would agree with that [14:56:58] 90s is the timeout when systemd kicks in [14:57:00] <_joe_> network is going the same way was the CPU load [14:57:40] *I saw [14:57:42] at the peak of the spike the network usage was about twice the regular one for eqiad [14:57:42] <_joe_> Krinkle: on second thoughts, it seems the cpu load is consistent with the network traffic, its' slowly going down from a figure that was ~ 2x what it usually is [14:57:52] now it's about 20-25% more [14:57:52] <_joe_> akosiaris: similarly the cpu [14:57:58] <_joe_> sane [14:58:03] <_joe_> *same [14:58:22] mw1246 is showing a lot of errors in the last 2 min rising [14:58:24] <_joe_> ok so, on one side in the last few months we increased memcached traffic a lot [14:58:33] <_joe_> Krinkle: uhm lemme see [14:58:39] emm [14:58:40] <_joe_> you mean in logstash? [14:58:44] Yeah [14:58:50] this is vert strange, job queue errors on mw2* [14:59:01] And gone again [14:59:03] Was memc [14:59:08] <_joe_> jynus: oh? [14:59:10] timeout memc [14:59:21] <_joe_> wait, did we switch back jobrunners.discovery.wmnet? [14:59:36] Might be the usual mcrouter cascading failure that we see sometimes [14:59:40] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [14:59:47] I'd say move on for now, will look at later [14:59:55] _joe_: that would be tomorrow IIRC [15:00:04] the actual DNS entry that is [15:00:09] <_joe_> akosiaris: nope, luckily it was done now [15:00:31] <_joe_> Krinkle: yes, it's the memcached servers not coping with the traffic, and mcrouter makes you notice that [15:00:44] <_joe_> while nutcracker would silently fail over to another node [15:00:44] HHVM queuing has also recovered on all eqiad app server [15:00:55] <_joe_> yes, which means we're out of the woods [15:00:57] Yeah [15:01:08] ah yes, sorry I got confused with the services discovery records [15:01:11] * akosiaris sigh [15:01:19] The mcroiter really needs rethinking on how we configure it and how sensitive [15:01:36] edit rate is also going up, bots seem to reconnect [15:01:39] <_joe_> Krinkle: we configure it in a way that, according to their docs, should work differently [15:01:45] <_joe_> than what we observe [15:01:52] It's currently optimized for read only, but mw is RW. [15:02:47] So where were we [15:03:53] I noticed that conn_yields in all the shards went up to ~700/900, possibly mcrouter was throttled, but I am not sure if the counter went up during the warm up or later on [15:04:51] not sure if it's related to switchover, but https://noc.wikimedia.org/conf/ 503's from here (and from bast4001 in ulsfo), but seems to be fine when requested from elsewhere [15:05:23] <_joe_> ebernhardson: define "here" :P [15:05:39] _joe_: my house. Thats why i added bast4001 for ulsfo, where my requests go ythrough :P [15:05:44] <_joe_> ebernhardson: eheh ok [15:05:53] <_joe_> just noc? [15:06:00] <_joe_> that's pretty strange [15:06:04] indeed [15:06:13] <_joe_> ema, bblack any idea? [15:06:19] (03CR) 10Vgutierrez: [C: 032] Merge branch 'master' into debian [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/465633 (owner: 10Alex Monk) [15:06:31] <_joe_> ebernhardson: can you check your X-cache headers? [15:07:01] _joe_: < X-Cache: cp2013 pass, cp4029 miss, cp4031 miss [15:07:05] Noc is active active right? Maybe one of the two backend isn't good [15:07:27] <_joe_> Krinkle: I don't even know that, I'll have to check the varnish configs [15:07:49] (03PS2) 10Elukey: role::eventlogging::analytics::files: reduce retention for archive [puppet] - 10https://gerrit.wikimedia.org/r/465569 (https://phabricator.wikimedia.org/T206542) [15:08:02] <_joe_> so it's trying to go via cp2013, which makes me think yes, codfw [15:08:11] <_joe_> ebernhardson: can you try from bast2* [15:08:12] <_joe_> ? [15:08:18] sure sec [15:08:24] yes it is [15:08:34] (03CR) 10Elukey: [C: 032] role::eventlogging::analytics::files: reduce retention for archive [puppet] - 10https://gerrit.wikimedia.org/r/465569 (https://phabricator.wikimedia.org/T206542) (owner: 10Elukey) [15:08:37] bast4002 replaced bast4001 [15:08:41] <_joe_> oh right mwmaint2001 [15:08:45] Runs on mwmaint* right [15:08:55] ebernhardson: ^ [15:09:08] (03CR) 10jenkins-bot: Merge branch 'master' into debian [software/certcentral] (debian) - 10https://gerrit.wikimedia.org/r/465633 (owner: 10Alex Monk) [15:09:46] I can confirm noc.wikimedia.org failure in codfw too [15:09:56] fails from 4002 as well [15:09:56] https://noc.wikimedia.org/conf/ -> cp2xxx -> 503 [15:10:14] <_joe_> yes [15:10:46] (03PS2) 10Elukey: statistics::rsync::eventlogging: reduce retention for archive [puppet] - 10https://gerrit.wikimedia.org/r/465573 (https://phabricator.wikimedia.org/T206542) [15:10:50] The cache hasn't cleared, so I can still see the banner which Seddon removed more than 20 minutes ago. This is longer than normal. [15:11:23] hmm [15:11:24] noc.wm.o's varnish backends are currently A/A, I believe intended [15:11:25] noc: [15:11:25] backends: [15:11:25] eqiad: 'mwmaint1001.eqiad.wmnet' [15:11:25] codfw: 'mwmaint2001.codfw.wmnet' [15:11:26] <_joe_> ebernhardson: lol, it shouldd work now [15:11:28] there's perhaps something wrong with mwmaint2001 when it comes to serving noc.w.o [15:11:33] that's the wrong maint server [15:11:37] <_joe_> !log started again hhvm on mwmaint2001 [15:11:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:41] (03CR) 10Elukey: [C: 032] statistics::rsync::eventlogging: reduce retention for archive [puppet] - 10https://gerrit.wikimedia.org/r/465573 (https://phabricator.wikimedia.org/T206542) (owner: 10Elukey) [15:11:45] sigh [15:11:46] <_joe_> akosiaris: you killed hhvm there [15:11:47] <_joe_> :P [15:12:04] _joe_: it was restarted by systemd though [15:12:12] I did notice it in ps output right after that [15:12:17] bblack: 1001 not 1002? [15:12:18] akosiaris: which one is the right maint server? [15:12:25] mwmaint1002 [15:12:27] volans: I'm just pasting from the puppet repo! [15:12:38] hieradata/role/common/cache/text.yaml [15:12:43] <_joe_> akosiaris: uhm... it seems it doesn't work tbh, I can't connect to hhvm [15:13:16] if you want me to i can remove the mw_maintenance role from mwmaint1001 altogether .patch is already waiting [15:13:48] <_joe_> ebernhardson: confirmed it works again from codfw [15:13:52] noc.w.o seems to be working again now yes [15:14:00] _joe_: so what ? [15:14:08] the hhvm process was foobar or what ? [15:14:12] <_joe_> akosiaris: hhvm was not responding to requests [15:14:15] I stil see the maintenance banner as well [15:14:20] still* [15:14:25] works from 4002 as well now [15:14:27] mutante: should noc.wm.o move to s/1001/1002/ as well? [15:15:01] bblack: yes, either 1002 ore 1002/2001 [15:15:10] ok [15:15:23] got a patch already for that [15:15:44] ok, go for it [15:16:19] mutante: should we make sure people don't end up using mwmaint1001 ? [15:16:23] (03PS1) 10Alexandros Kosiaris: noc: Switch varnish backend to mwmaint1002 [puppet] - 10https://gerrit.wikimedia.org/r/465644 [15:16:32] (03PS1) 10Dzahn: Revert "mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/465645 [15:16:48] (03CR) 10Alexandros Kosiaris: [V: 032 C: 032] noc: Switch varnish backend to mwmaint1002 [puppet] - 10https://gerrit.wikimedia.org/r/465644 (owner: 10Alexandros Kosiaris) [15:16:59] akosiaris: yes, we can either turn it into a role(spare) now or we can add the warning banner [15:17:12] definitely the warning banner [15:17:30] at least for a couple of days [15:18:14] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: Provide a way to have test servers on real hardware, isolated from production - https://phabricator.wikimedia.org/T206636 (10Gehel) [15:20:18] (03PS1) 10Dzahn: mw_maintenance: ensure motd warning banner on mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/465646 [15:20:28] ^ this is to get the warning motd [15:20:35] $ensure and $motd_ensure were separate [15:20:38] that's why that didnt happen [15:20:43] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [15:21:33] (03PS2) 10Dzahn: mw_maintenance: ensure motd warning banner on mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/465646 [15:22:32] mutante: probably !$ensure ? [15:22:38] akosiaris: fixed [15:23:02] akosiaris, JohanJ I still see the banner in enwiki, itwiki, etc. [15:23:21] there was a separate 'primary_dc' lookup for $ensure (crons) and $motd_ensure (banner) [15:23:28] (thanks jijiki to letting me know ;) ) [15:23:44] Still the banner as well here. [15:23:46] we can just use $ensure for all [15:24:09] Yeah. Me too. According to Seddon, he removed it at 4:47 PM and expected up to ten minutes for the cache to clear. [15:24:11] which has the special hack already to differ between 1001 and 1002 [15:24:17] damn that's bad [15:24:24] ok so what ? CN is misbehaving ? [15:24:25] bblack: can we do anything to clear the banner? [15:24:44] it's not the cache [15:24:49] the varnish caches I mean [15:25:00] if you do cache busting the thing is still there [15:25:02] I was going to say, I think that's not part of article content caching [15:25:16] banners are fetched via js stuff and short and/or nonexistent TTLs [15:26:41] JohanJ: We don't think it's the caches [15:26:48] OK. [15:28:11] I 'll reach out to people to see who can help debug that [15:29:17] is there some pointer to where the CN stuff is configured (i.e. where the 4:47pm removal shows up at?)... I know I've seen such a page before [15:29:17] (03PS3) 10Dzahn: mw_maintenance: use $ensure, not $motd_ensure for warning motd [puppet] - 10https://gerrit.wikimedia.org/r/465646 (https://phabricator.wikimedia.org/T201343) [15:29:41] 10Operations, 10fundraising-tech-ops, 10netops: icinga reports frbast2001.frack.eqiad.wmnet as host down - https://phabricator.wikimedia.org/T206637 (10Jgreen) [15:29:57] https://meta.wikimedia.org/wiki/Special:CentralNotice ? [15:30:36] yeah it's that it [15:30:43] (03CR) 10Dzahn: "this is a planned revert to go back to normal state.. but only after mwmaint1001 is actually turned into role(spare)!" [puppet] - 10https://gerrit.wikimedia.org/r/465645 (owner: 10Dzahn) [15:30:44] https://meta.wikimedia.org/w/index.php?title=Special:CentralNotice&subaction=noticeDetail¬ice=Sept2018Maintenance [15:31:08] bblack: ? [15:31:19] I'm guessing it's disabled in the sense that all projects are de-selected? [15:31:26] the enabled flag is off [15:31:27] https://meta.wikimedia.org/wiki/Special:CentralNoticeLogs [15:31:29] but the timer window for it still runs another ~30 mins to 16:00 [15:31:35] 10 October 2018 15:30 Vogone (talk) modified Sept2018Maintenance 15 UTC is in the past, already [15:31:35] Enabled: Changed from on to off [15:32:06] (03CR) 10Dzahn: [C: 04-1] "hold on.. need to amend" [puppet] - 10https://gerrit.wikimedia.org/r/465646 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [15:32:41] the on/off times were also edited when it was enabled in the previous change (edited to same values, according to log) [15:32:46] maybe they have to be null-edited again? [15:33:31] (03CR) 10Filippo Giunchedi: [C: 031] Remove now obsolete Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/465596 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [15:33:32] I have no idea. I don't have access to the CN banner interface, so I run these things through Seddon, who isn't responding and might have gone off his shift once his part was done. [15:33:51] I don't think I have any special access [15:33:57] neither do I [15:33:58] bblack: can I help? note there is some cache delay [15:33:59] Do we need someone with access? [15:34:07] (03CR) 10Filippo Giunchedi: [C: 031] Remove all absented Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/465600 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [15:34:20] AndyRussG: Yes please. The CN maintenance banner is still up and well, it shouldn't [15:34:28] (03CR) 10Filippo Giunchedi: [C: 031] Remove obsolete Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/465601 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [15:34:31] akosiaris: ok one sec [15:34:36] https://meta.wikimedia.org/w/index.php?title=Special:CentralNotice&subaction=noticeDetail¬ice=Sept2018Maintenance [15:34:36] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1067 - https://phabricator.wikimedia.org/T206500 (10Cmjohnson) Reseated the disk....let's see what happens [15:34:58] (03CR) 10Filippo Giunchedi: [C: 031] Stop the diamond service when removing Diamond [puppet] - 10https://gerrit.wikimedia.org/r/465440 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [15:35:02] 10Operations, 10ops-eqiad, 10Traffic: cp1076 hardware failure - https://phabricator.wikimedia.org/T206394 (10Cmjohnson) @bblack is there any action item for me? [15:35:25] fwiw I'm not seeing the banner [15:35:37] Me neither. [15:35:39] I am [15:35:42] sjoerddebruin, do you see the banner? [15:35:43] !log uploaded jenkins 2.138.2 security release to apt.wikimedia.org (jessie/stretch) (T206234) [15:35:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:46] I no longer am as well [15:35:49] thcipriani: ^ [15:35:50] oh, it just went away on a fresh reload for me [15:35:58] It's gone now [15:36:00] 10Operations, 10MediaWiki-General-or-Unknown, 10Wikidata, 10wikidata-tech-focus, and 2 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10Addshore) [15:36:02] OK, good. [15:36:05] moritzm: thank you! [15:36:10] okay so [15:36:11] I have no idea what just happened [15:36:12] probably AndyRussG fixed it :) [15:36:16] no [15:36:20] (03PS4) 10Dzahn: mw_maintenance: let $motd_ensure be based on $ensure for warning motd [puppet] - 10https://gerrit.wikimedia.org/r/465646 (https://phabricator.wikimedia.org/T201343) [15:36:36] When people complained about still seeing the banner [15:36:41] It was still enabled [15:36:44] bblack: akosiaris Krenair there's some caching [15:37:00] heh no I didn't do anything [15:37:05] but I'm not seeing the banner anymore [15:37:10] It got disabled by Vogone around the time that bblack found Special:CentralNotice [15:37:17] ah, ok [15:37:22] ah that explains it [15:37:22] (03CR) 10jerkins-bot: [V: 04-1] mw_maintenance: let $motd_ensure be based on $ensure for warning motd [puppet] - 10https://gerrit.wikimedia.org/r/465646 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [15:37:34] OK, so might simply have been human error. [15:37:45] 10 October 2018 15:30 [15:38:07] It was set to end automatically at 16:00 [15:38:08] ok that's good to know [15:38:19] Just fyi btw, CN banner and campaign settings are cached with RL modules [15:38:24] but that's in the future in UTC [15:38:24] yeah the 2 entries in https://meta.wikimedia.org/wiki/Special:CentralNoticeLogs [15:38:27] (03PS5) 10Dzahn: mw_maintenance: let $motd_ensure be based on $ensure for warning motd [puppet] - 10https://gerrit.wikimedia.org/r/465646 (https://phabricator.wikimedia.org/T201343) [15:38:40] Vogone set it to off [15:38:43] ok [15:38:44] changes don't happen instantly [15:39:07] AndyRussG: yeah it was just reported to us (seems wrongly) it was set to off >50 mins ago [15:39:11] hence the panic mode [15:39:20] ah heh okok [15:39:36] anyway fixed now. Thanks for looking into it [15:39:42] glad there weren't worse things to worry about :) [15:39:44] right and RL caches for ~5 mins max, IIRC [15:39:52] sometimes less, depending on your timing in the cycle [15:40:03] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10cloud-services-team: Provide a way to have test servers on real hardware, isolated from production - https://phabricator.wikimedia.org/T206636 (10Smalyshev) p:05Triage>03High [15:40:07] yeah something like that... It was once 10 min, maybe it's 5 min now [15:40:23] (03CR) 10Dzahn: [C: 031] "now it's ok -> https://puppet-compiler.wmflabs.org/compiler1002/12856/" [puppet] - 10https://gerrit.wikimedia.org/r/465646 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [15:40:30] maybe it is 10, I haven't looked in a while [15:40:30] akosiaris: ^ this should be good now [15:40:40] 10Operations, 10Maps: Switch to unix socket connections for osmupdater / osmimporter for postgresql on maps - https://phabricator.wikimedia.org/T206639 (10Gehel) [15:41:41] just looked, 5 minutes [15:41:53] (03CR) 10Alexandros Kosiaris: [C: 031] mw_maintenance: let $motd_ensure be based on $ensure for warning motd [puppet] - 10https://gerrit.wikimedia.org/r/465646 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [15:41:56] mutante: indeed! Thanks. +1ed [15:42:07] (03CR) 10Dzahn: [C: 032] mw_maintenance: let $motd_ensure be based on $ensure for warning motd [puppet] - 10https://gerrit.wikimedia.org/r/465646 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [15:42:08] all the resourceloader outputs I looked at, they emit: "cache-control: public, max-age=300, s-maxage=300" [15:42:25] (03PS6) 10Dzahn: mw_maintenance: let $motd_ensure be based on $ensure for warning motd [puppet] - 10https://gerrit.wikimedia.org/r/465646 (https://phabricator.wikimedia.org/T201343) [15:42:43] 10Operations, 10MediaWiki-General-or-Unknown, 10Wikidata, 10wikidata-tech-focus, and 2 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10Addshore) >>! In T205865#4639922, @hoo wrote: > It indeed is the lockmanager. I ran t... [15:43:50] bblack: ah okok good to know :) [15:46:11] 10Operations, 10Fundraising-Backlog, 10Traffic, 10HTTPS: Re-evaluate use of EV certificates for payments.wm.o? - https://phabricator.wikimedia.org/T204931 (10Jgreen) >>! In T204931#4654860, @Krenair wrote: > > Presumably whoever would be responsible for purchasing a renewal has to consider this. It's one... [15:46:43] mwmaint1001 has the "this is not the active server" warning now (and 2001 as well) [15:46:55] cool [15:46:56] thanks! [15:48:35] you're welcome. i still have this planned revert for the special case, but it has to wait a few days until we apply role(spare) https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/465645/ [15:49:13] and the change above should not affect it and can stay forever [15:49:14] 10Operations, 10Fundraising-Backlog, 10Traffic, 10HTTPS: Re-evaluate use of EV certificates for payments.wm.o? - https://phabricator.wikimedia.org/T204931 (10BBlack) The kicker probably wouldn't be the monetary cost. It would be that if you didn't require EV, you could auto-issue certs from LetsEncrypt an... [15:49:50] mutante: now that it's no longer a maint server, can you rename it back to mw1297? then we can re-add it as mw server via https://phabricator.wikimedia.org/T192457 [15:50:12] RECOVERY - DPKG on contint2001 is OK: All packages OK [15:50:33] 10Operations, 10SRE-Access-Requests: Requesting access to servers for Release Engineering tasks for Lars Wirzenius - https://phabricator.wikimedia.org/T206612 (10greg) [15:50:39] moritzm: yes, though for now it still uses the mw_maintenance role [15:50:52] after that is gone, i will reinstall it [15:51:29] i will take that ticket to remind me [15:51:43] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) a:03Dzahn [15:52:52] mutante: It's fine to simply rename back to mw1297 and simply use role::spare for now, all the other former image scalers are also spares for now [15:53:30] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) mwmaint1001 should be reinstalled as mw1297 and go back into the pool. but this is after https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/461492/ (and https://gerrit.wikimedia.org/r/... [15:54:21] moritzm: ok! i was just under the impression akosiaris would like me to wait a few days in case something is wrong with mwmaint1002 [15:54:32] (before removing the role from 1001 that is) [15:55:06] yes I would, but not for that reason [15:55:08] !log scheduled downtime for host cloudvirt1019 swap raid card T196507 [15:55:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:11] T196507: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 [15:55:13] but rather people learning about the switch [15:55:25] ah, ok [15:55:38] I don't feel too strongly about it though [15:56:26] maybe tomorrow then [15:58:18] fine with me to wait a few days, but an inaccessible mwmaint1001 (as role::spare only allows SREs to login) will make them notice as well :-) [15:59:25] true. And they 'll ask for info which they could have gotten anyway. [15:59:52] !log Uploaded certcentral 0.1 to apt.wikimedia.org (stretch) - T199711 [15:59:53] anyway I honestly don't feel too strongly about it. Feel free to kill the server now [15:59:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:59:56] T199711: Deploy a scalable service for ACME (LetsEncrypt) certificate management - https://phabricator.wikimedia.org/T199711 [15:59:58] Krenair: ^^ done [16:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Morning SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181010T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:00:11] ok [16:00:42] vgutierrez, cool. puppet tomorrow? [16:00:47] ACKNOWLEDGEMENT - Host backup2001 is DOWN: PING CRITICAL - Packet loss = 100% daniel_zahn https://phabricator.wikimedia.org/T196477#4652673 [16:01:12] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [16:01:18] (03CR) 10Vgutierrez: [C: 031] "Certcentral 0.1 package uploaded to apt.wm.o, this looks good to me, but of course reviews from more seasoned puppet reviewers are very we" [puppet] - 10https://gerrit.wikimedia.org/r/441991 (https://phabricator.wikimedia.org/T194962) (owner: 10Alex Monk) [16:02:23] Krenair: yup [16:04:03] 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2018), 10User-Johan: Lessons learned - https://phabricator.wikimedia.org/T206649 (10Johan) p:05Triage>03Normal [16:04:07] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10Andrew) @Cmjohnson, is there a card for 1024 as well and you're waiting to hear whether 1023 is a success? [16:04:29] moritzm: are you planning to make another install attempt on cloudvirt1023 or is that in my court now? [16:07:01] 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2018), 10User-Johan: Lessons learned: Communicating the server switch 2018 - https://phabricator.wikimedia.org/T206649 (10Johan) [16:07:02] (03PS1) 10Gehel: wdqs: increase throttling limits for internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/465652 (https://phabricator.wikimedia.org/T206648) [16:07:33] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 (10RobH) >>! In T199125#4655279, @Andrew wrote: > @Cmjohnson, is there a card for 1024 as well and you're waiting to hear whether 102... [16:07:55] ACKNOWLEDGEMENT - HP RAID on cloudvirt1019 is CRITICAL: CRITICAL: Slot 1: Failed: 1I:1:5, 1I:1:6, 1I:1:7, 1I:1:8 - OK: 2I:1:1, 2I:1:2, 2I:1:3, 2I:1:4, 2I:2:1, 2I:2:2 - Controller: OK - Battery count: 0 nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T206651 [16:07:58] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T206651 (10ops-monitoring-bot) [16:09:59] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Define an SLO for Wikidata Query Service public endpoint and communicate it - https://phabricator.wikimedia.org/T199228 (10Smalyshev) I think update lag is not the biggest issue. Endpoint availability and respons... [16:10:11] (03PS2) 10Jcrespo: site.pp: Comment fixes due to dewiki no longer being the only s5 wiki [puppet] - 10https://gerrit.wikimedia.org/r/464797 (https://phabricator.wikimedia.org/T184805) [16:12:53] PROBLEM - HHVM jobrunner on mw1306 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time [16:13:13] RECOVERY - Check health of redis instance on 6378 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 15 keys, up 1 hours 51 minutes - replication_delay is 5 [16:13:32] andrewbogott: this needs additional work to make it work with jessie; an installer image based on 4.9 for jessie. I talked to Arturo and I'll write up the steps and he'll create a netboot image with that [16:13:32] RECOVERY - Check health of redis instance on 6380 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 3240 keys, up 1 hours 52 minutes - replication_delay is 2 [16:14:02] RECOVERY - HHVM jobrunner on mw1306 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [16:14:23] RECOVERY - Check health of redis instance on 6381 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 3463 keys, up 1 hours 52 minutes - replication_delay is 2 [16:15:43] RECOVERY - Check health of redis instance on 6379 on rdb2001 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4705935 keys, up 1 hours 54 minutes - replication_delay is 10 [16:16:02] jessie ? [16:16:22] RECOVERY - Check health of redis instance on 6378 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 4705606 keys, up 99 days 14 hours - replication_delay is 3 [16:16:25] <_joe_> !log restart of now-unused jobqueue redises for stopping the alerts post-switchover [16:16:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:02] RECOVERY - Check health of redis instance on 6378 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6378 has 1 databases (db0) with 6 keys, up 99 days 14 hours - replication_delay is 10 [16:17:03] RECOVERY - Check health of redis instance on 6478 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6478 has 1 databases (db0) with 4 keys, up 99 days 14 hours - replication_delay is 2 [16:17:35] akosiaris: openstack stuff is migrated to jessie... [16:17:40] ok [16:18:02] RECOVERY - Check health of redis instance on 6381 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 4604373 keys, up 99 days 14 hours - replication_delay is 2 [16:18:02] RECOVERY - Check health of redis instance on 6380 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 4705813 keys, up 99 days 14 hours - replication_delay is 5 [16:18:33] RECOVERY - Check health of redis instance on 6379 on rdb2003 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 4705883 keys, up 99 days 14 hours - replication_delay is 1 [16:19:03] RECOVERY - Check health of redis instance on 6381 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6381 has 1 databases (db0) with 3178 keys, up 99 days 14 hours - replication_delay is 10 [16:19:13] RECOVERY - Check health of redis instance on 6481 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6481 has 1 databases (db0) with 2865 keys, up 99 days 13 hours - replication_delay is 6 [16:19:32] RECOVERY - Check health of redis instance on 6379 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6379 has 1 databases (db0) with 3088 keys, up 99 days 14 hours - replication_delay is 7 [16:19:33] RECOVERY - Check health of redis instance on 6380 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6380 has 1 databases (db0) with 2888 keys, up 99 days 14 hours - replication_delay is 7 [16:19:40] 10Operations, 10Fundraising-Backlog, 10Traffic, 10HTTPS: Re-evaluate use of EV certificates for payments.wm.o? - https://phabricator.wikimedia.org/T204931 (10Jgreen) >>! In T204931#4655176, @BBlack wrote: > The kicker probably wouldn't be the monetary cost. It would be that if you didn't require EV, you c... [16:20:12] RECOVERY - Check health of redis instance on 6480 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6480 has 1 databases (db0) with 2854 keys, up 99 days 14 hours - replication_delay is 2 [16:20:22] RECOVERY - Check health of redis instance on 6479 on rdb2005 is OK: OK: REDIS 2.8.17 on 127.0.0.1:6479 has 1 databases (db0) with 3186 keys, up 99 days 14 hours - replication_delay is 3 [16:21:20] (03PS1) 10Ayounsi: Diffscan, don't scan the WMCS public range [puppet] - 10https://gerrit.wikimedia.org/r/465654 (https://phabricator.wikimedia.org/T206653) [16:22:56] (03PS1) 10Elukey: Revert "statistics::rsync::eventlogging: reduce retention for archive" [puppet] - 10https://gerrit.wikimedia.org/r/465655 [16:23:04] (03PS2) 10Elukey: Revert "statistics::rsync::eventlogging: reduce retention for archive" [puppet] - 10https://gerrit.wikimedia.org/r/465655 [16:23:42] (03CR) 10Elukey: [V: 032 C: 032] Revert "statistics::rsync::eventlogging: reduce retention for archive" [puppet] - 10https://gerrit.wikimedia.org/r/465655 (owner: 10Elukey) [16:23:49] moritzm: thanks — can you add that status to T199125 (or refer me to the related task?) [16:23:50] T199125: rack/setup/install cloudvirt102[34] - https://phabricator.wikimedia.org/T199125 [16:25:20] !log LDAP - added isaacj to wmf group (for SWAP access, existing shell user since recently) (T206631) (T205840) [16:25:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:25:25] T205840: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 [16:25:25] T206631: LDAP group access request for Isaac Johnson - https://phabricator.wikimedia.org/T206631 [16:27:48] 10Operations, 10Research, 10SRE-Access-Requests, 10Patch-For-Review: Server Access for Isaac Johnson - https://phabricator.wikimedia.org/T205840 (10Dzahn) ^ We talked on IRC about SWAP access. Membership in the "wmf" LDAP group was missing. (T206631) So i added that and now all should work. P.S. The docs... [16:29:39] andrewbogott: I don't have a task yet, but will create one tomorrow and add you and Arturo as subscribers, OKß [16:30:06] sounds good — thanks again! We're at an offsite this week so won't be very responsive. [16:30:27] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) received the new raid controller and installed, updating the firmware now. Initially it is showing as failed raid [16:35:54] 10Operations, 10Fundraising-Backlog, 10Traffic, 10HTTPS: Re-evaluate use of EV certificates for payments.wm.o? - https://phabricator.wikimedia.org/T204931 (10cwdent) @BBlack I have been exploring options and it sounds like the DNS TXT record challege would allow us to issue certs without disturbing the hos... [16:37:17] (03CR) 10Ayounsi: [C: 032] Diffscan, don't scan the WMCS public range [puppet] - 10https://gerrit.wikimedia.org/r/465654 (https://phabricator.wikimedia.org/T206653) (owner: 10Ayounsi) [16:37:33] (03PS2) 10Ayounsi: Diffscan, don't scan the WMCS public range [puppet] - 10https://gerrit.wikimedia.org/r/465654 (https://phabricator.wikimedia.org/T206653) [16:44:24] (03PS2) 10Dzahn: site: turn mwmaint1001 into a spare::system [puppet] - 10https://gerrit.wikimedia.org/r/461492 (https://phabricator.wikimedia.org/T201343) [16:48:26] 10Operations, 10Fundraising-Backlog, 10Traffic, 10HTTPS: Re-evaluate use of EV certificates for payments.wm.o? - https://phabricator.wikimedia.org/T204931 (10Krenair) That is being set up for prod at the moment actually, but it relies on trusted servers SSHing to prod auth DNS machines. I'm not sure frack... [16:58:14] (03CR) 10Mathew.onipe: wdqs: increase throttling limits for internal cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465652 (https://phabricator.wikimedia.org/T206648) (owner: 10Gehel) [16:58:39] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install stat1007.eqiad.wmnet (stat1005 user replacement) - https://phabricator.wikimedia.org/T203852 (10elukey) [16:58:57] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: rack/setup/install stat1007.eqiad.wmnet (stat1005 user replacement) - https://phabricator.wikimedia.org/T203852 (10elukey) [17:01:22] (03PS1) 10Alexandros Kosiaris: Revert "scap: use mediawiki canaries from codfw" [puppet] - 10https://gerrit.wikimedia.org/r/465661 (https://phabricator.wikimedia.org/T204907) [17:02:06] (03CR) 10jerkins-bot: [V: 04-1] Revert "scap: use mediawiki canaries from codfw" [puppet] - 10https://gerrit.wikimedia.org/r/465661 (https://phabricator.wikimedia.org/T204907) (owner: 10Alexandros Kosiaris) [17:02:25] (03PS2) 10Alexandros Kosiaris: Revert "scap: use mediawiki canaries from codfw" [puppet] - 10https://gerrit.wikimedia.org/r/465661 (https://phabricator.wikimedia.org/T204907) [17:02:58] (03CR) 10Gehel: wdqs: increase throttling limits for internal cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465652 (https://phabricator.wikimedia.org/T206648) (owner: 10Gehel) [17:03:17] (03CR) 10Alexandros Kosiaris: [C: 032] Revert "scap: use mediawiki canaries from codfw" [puppet] - 10https://gerrit.wikimedia.org/r/465661 (https://phabricator.wikimedia.org/T204907) (owner: 10Alexandros Kosiaris) [17:08:44] (03PS1) 10Cmjohnson: Adding production dns stat1007 [dns] - 10https://gerrit.wikimedia.org/r/465664 (https://phabricator.wikimedia.org/T203852) [17:10:13] (03PS2) 10Cmjohnson: Adding production dns stat1007 [dns] - 10https://gerrit.wikimedia.org/r/465664 (https://phabricator.wikimedia.org/T203852) [17:10:19] (03CR) 10Mathew.onipe: wdqs: increase throttling limits for internal cluster (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465652 (https://phabricator.wikimedia.org/T206648) (owner: 10Gehel) [17:11:17] (03PS3) 10Cmjohnson: Adding production dns stat1007 [dns] - 10https://gerrit.wikimedia.org/r/465664 (https://phabricator.wikimedia.org/T203852) [17:11:57] (03CR) 10Cmjohnson: [C: 032] Adding production dns stat1007 [dns] - 10https://gerrit.wikimedia.org/r/465664 (https://phabricator.wikimedia.org/T203852) (owner: 10Cmjohnson) [17:12:07] (03PS1) 10Elukey: Add stat1007 to analytics-1-b [dns] - 10https://gerrit.wikimedia.org/r/465666 (https://phabricator.wikimedia.org/T203852) [17:12:14] (03CR) 10jerkins-bot: [V: 04-1] Add stat1007 to analytics-1-b [dns] - 10https://gerrit.wikimedia.org/r/465666 (https://phabricator.wikimedia.org/T203852) (owner: 10Elukey) [17:13:02] (03Abandoned) 10Elukey: Add stat1007 to analytics-1-b [dns] - 10https://gerrit.wikimedia.org/r/465666 (https://phabricator.wikimedia.org/T203852) (owner: 10Elukey) [17:17:46] 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10greg) >>! In T191183#4653436, @thcipriani wrote: > This is probably something we should enforce somehow (jenkins? some tool to be created to upload?) before exposing this feature b... [17:21:32] 10Operations, 10Fundraising-Backlog, 10Traffic, 10HTTPS: Re-evaluate use of EV certificates for payments.wm.o? - https://phabricator.wikimedia.org/T204931 (10Jgreen) >>! In T204931#4655430, @Krenair wrote: > How complex is the payments site? Is it possible to do http challenges there? Off the top of my he... [17:23:52] 10Operations, 10procurement: eqiad: (5) elastic systems - https://phabricator.wikimedia.org/T206681 (10RobH) p:05Triage>03High [17:23:54] 10Operations, 10procurement: eqiad: (5) elastic systems - https://phabricator.wikimedia.org/T206681 (10RobH) [17:44:07] 10Operations, 10ops-eqiad: Degraded RAID on cloudvirt1019 - https://phabricator.wikimedia.org/T196507 (10Cmjohnson) F/W updated and now I am getting new issues...missing several of the disks. I have to get another AHS report and send to HP....the saga continues [17:47:23] 10Operations, 10Fundraising-Backlog, 10Traffic, 10HTTPS: Re-evaluate use of EV certificates for payments.wm.o? - https://phabricator.wikimedia.org/T204931 (10Krenair) >>! In T204931#4655504, @Jgreen wrote: >>>! In T204931#4655430, @Krenair wrote: >> How complex is the payments site? Is it possible to do ht... [17:48:49] (03CR) 10Smalyshev: [C: 031] wdqs: increase throttling limits for internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/465652 (https://phabricator.wikimedia.org/T206648) (owner: 10Gehel) [17:49:34] 10Operations, 10Fundraising-Backlog, 10Traffic, 10HTTPS: Re-evaluate use of EV certificates for payments.wm.o? - https://phabricator.wikimedia.org/T204931 (10Jgreen) >>! In T204931#4655857, @Krenair wrote: > > Oh wow, okay - I was expecting you to say it was behind LVS or something but not that. Ha, well... [17:49:49] !log replace 10.195.0.0/25 with 10.195.0.0/24 in prefix-list fundraising-codfw4 on cr1/2-codfw - T206637 [17:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:49:54] T206637: icinga reports frbast2001.frack.eqiad.wmnet as host down - https://phabricator.wikimedia.org/T206637 [17:54:56] 10Operations, 10fundraising-tech-ops, 10netops: icinga reports frbast2001.frack.eqiad.wmnet as host down - https://phabricator.wikimedia.org/T206637 (10Jgreen) 05Open>03Resolved a:03Jgreen This is fixed. - fix nagios_nsca.conf in prod puppet for frbast2001's new IP - fix modules/network/data/data.yaml... [18:01:15] (03PS1) 10Gergő Tisza: Fix Sentry DSN setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465672 (https://phabricator.wikimedia.org/T206589) [18:05:22] !log otto@deploy1001 Started deploy [analytics/refinery@4e2d956]: Add accept header to webrequest logs - T170606 [18:05:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:25] T170606: Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 [18:09:58] !log otto@deploy1001 Finished deploy [analytics/refinery@4e2d956]: Add accept header to webrequest logs - T170606 (duration: 04m 35s) [18:10:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:03] !log otto@deploy1001 Started deploy [analytics/refinery@28bbee8]: Add accept header to webrequest logs - T170606 [18:10:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:19:43] !log delete sessions to AS6805 on cr2-esams (left AMS-IX) [18:19:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:37] !log otto@deploy1001 Finished deploy [analytics/refinery@28bbee8]: Add accept header to webrequest logs - T170606 (duration: 10m 34s) [18:20:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:20:40] T170606: Add Accept header to webrequest logs - https://phabricator.wikimedia.org/T170606 [18:21:33] 10Operations, 10Traffic: SOA serial numbers returned by authoritative nameservers differ - https://phabricator.wikimedia.org/T206688 (10Dzahn) [18:22:21] 10Operations, 10Traffic: SOA serial numbers returned by authoritative nameservers differ - https://phabricator.wikimedia.org/T206688 (10Dzahn) [18:23:23] PROBLEM - HTTP availability for Varnish at codfw on einsteinium is CRITICAL: job=varnish-text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:23:43] PROBLEM - puppet last run on ms-be1015 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:23:43] PROBLEM - HTTP availability for Varnish at ulsfo on einsteinium is CRITICAL: job={varnish-text,varnish-upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:23:52] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:24:02] PROBLEM - HTTP availability for Varnish at eqiad on einsteinium is CRITICAL: job=varnish-upload site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:24:02] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is CRITICAL: cluster={cache_text,cache_upload} site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:24:12] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [18:24:12] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is CRITICAL: cluster={cache_text,cache_upload} site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:24:32] RECOVERY - HTTP availability for Varnish at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:24:33] PROBLEM - Esams HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [18:24:33] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is CRITICAL: cluster={cache_text,cache_upload} site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:24:33] we're aware and looking [18:24:42] PROBLEM - HTTP availability for Varnish at esams on einsteinium is CRITICAL: job={varnish-text,varnish-upload} site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:24:52] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is CRITICAL: cluster={cache_text,cache_upload} site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:24:53] PROBLEM - HTTP availability for Varnish at eqsin on einsteinium is CRITICAL: job=varnish-text site=eqsin https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:26:02] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [18:26:07] 10Operations, 10Domains, 10Traffic: SOA serial numbers returned by authoritative nameservers differ - https://phabricator.wikimedia.org/T206688 (10Dzahn) [18:26:13] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [18:27:03] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [18:27:13] PROBLEM - MediaWiki memcached error rate on graphite1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:27:23] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:27:42] PROBLEM - Codfw HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [18:27:42] PROBLEM - Upload HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [18:27:52] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:28:02] RECOVERY - HTTP availability for Varnish at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:28:12] RECOVERY - HTTP availability for Varnish at ulsfo on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:28:12] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:28:12] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:28:12] RECOVERY - HTTP availability for Varnish at eqsin on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:28:22] RECOVERY - HTTP availability for Varnish at eqiad on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:29:23] RECOVERY - MediaWiki memcached error rate on graphite1001 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [18:29:32] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on einsteinium is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:33:02] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 45 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [18:33:13] RECOVERY - Esams HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [18:33:33] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [18:34:03] RECOVERY - Codfw HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=codfw&var-cache_type=All&var-status_type=5 [18:34:03] RECOVERY - Upload HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [18:34:42] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [18:34:53] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [18:35:02] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [18:35:17] !log disable VC port 1/2 on asw2-c-eqiad:fpc3 (to fpc8) [18:35:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:27] (03CR) 10Mathew.onipe: [C: 031] wdqs: increase throttling limits for internal cluster [puppet] - 10https://gerrit.wikimedia.org/r/465652 (https://phabricator.wikimedia.org/T206648) (owner: 10Gehel) [18:54:13] RECOVERY - puppet last run on ms-be1015 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [18:54:33] (03PS3) 10Dzahn: site: turn mwmaint1001 into a spare::system [puppet] - 10https://gerrit.wikimedia.org/r/461492 (https://phabricator.wikimedia.org/T201343) [18:55:02] RECOVERY - MegaRAID on db1067 is OK: OK: optimal, 1 logical, 6 physical, WriteBack policy [18:55:35] (03CR) 10Dzahn: [C: 032] site: turn mwmaint1001 into a spare::system [puppet] - 10https://gerrit.wikimedia.org/r/461492 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [19:03:46] (03PS1) 10Dzahn: scap/tcpircbot: remove mwmaint1001 from scap and allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/465681 (https://phabricator.wikimedia.org/T201343) [19:04:19] 10Operations, 10Patch-For-Review: Reallocate former image scalers - https://phabricator.wikimedia.org/T192457 (10Dzahn) [19:04:22] 10Operations, 10ops-eqiad, 10Datacenter-Switchover-2018, 10Patch-For-Review: rack/setup/install mwmaint1002.eqiad.wmnet - https://phabricator.wikimedia.org/T201343 (10Dzahn) [19:04:52] (03PS2) 10Dzahn: scap/tcpircbot: remove mwmaint1001 from scap and allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/465681 (https://phabricator.wikimedia.org/T201343) [19:05:22] 10Operations, 10netops: Intermittent connectivity issues in eqiad's row C - https://phabricator.wikimedia.org/T201139 (10ayounsi) There are 2 parallel issues here. 1/ IPv6 neighbor discovery randomly broken when igmp-snooping is enabled. This has been worked-around by disabling igmp-snooping yesterday T201039... [19:05:44] (03CR) 10Dzahn: [C: 032] "it's not a mw maintenance server anymore now" [puppet] - 10https://gerrit.wikimedia.org/r/465681 (https://phabricator.wikimedia.org/T201343) (owner: 10Dzahn) [19:09:58] If no one else is deploying anything (nothing is on the calendar), I'd like to run a scap sync to rebuild the i18n cache. [19:10:35] as there is an old message stuck in the cache for some reason. (I've verified it's updated on the deployment server.) [19:11:40] kaldari: should be good [19:11:45] thanks [19:11:53] !log scap sync to rebuild i18n cache [19:11:54] i just removed mwmaint1001 from scap "dsh" [19:11:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:12:01] it has been replaced by mwmaint1002 [19:12:20] hopefully you dont see any warnings about that [19:13:36] !log kaldari@deploy1001 Started scap: (no justification provided) [19:13:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:57] actually.. now that happened [19:15:38] PROBLEM - Filesystem available is greater than filesystem size on ms-be2041 is CRITICAL: cluster=swift device=/dev/sde1 fstype=xfs instance=ms-be2041:9100 job=node mountpoint=/srv/swift-storage/sde1 site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=ms-be2041&var-datasource=codfw%2520prometheus%252Fops [19:15:41] if you do see something about mwmaint1001 timing out then it's because it doesn't use the mw_maintenance role anymore .. but it should be gone next time for sure [19:15:58] thanks mutante [19:17:31] 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10CCogdill_WMF) @Jgreen @BBlack thanks for bumping this and continuing to check. It does look like the SSL rating has bumped up to an A: https://www.ss... [19:22:18] (03PS1) 10Dzahn: mariadb: remove mwmaint1001 from prod-m5 SQL grants [puppet] - 10https://gerrit.wikimedia.org/r/465685 (https://phabricator.wikimedia.org/T201343) [19:25:58] (03PS1) 10Dzahn: network::constants: remove mwmaint1001 [puppet] - 10https://gerrit.wikimedia.org/r/465686 (https://phabricator.wikimedia.org/T201343) [19:26:54] (03PS2) 10Dzahn: Revert "mw_maintenance: temp hack to avoid duplicate crons on switch to eqiad" [puppet] - 10https://gerrit.wikimedia.org/r/465645 [19:28:18] (03PS2) 10Gergő Tisza: Fix Sentry DSN setting [mediawiki-config] - 10https://gerrit.wikimedia.org/r/465672 (https://phabricator.wikimedia.org/T206589) [19:30:00] (03PS1) 10Dzahn: Revert "rename wmf6936 from mw1297 to mwmaint1001" [dns] - 10https://gerrit.wikimedia.org/r/465689 [19:30:08] (03CR) 10jerkins-bot: [V: 04-1] Revert "rename wmf6936 from mw1297 to mwmaint1001" [dns] - 10https://gerrit.wikimedia.org/r/465689 (owner: 10Dzahn) [19:30:19] (03CR) 10Dzahn: "planned revert after it has done its job as a temp replacement" [dns] - 10https://gerrit.wikimedia.org/r/465689 (owner: 10Dzahn) [19:31:48] PROBLEM - HHVM rendering on mwdebug2002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [19:32:15] (03PS2) 10Dzahn: Revert "rename wmf6936 from mw1297 to mwmaint1001" [dns] - 10https://gerrit.wikimedia.org/r/465689 (https://phabricator.wikimedia.org/T192457) [19:32:23] (03CR) 10jerkins-bot: [V: 04-1] Revert "rename wmf6936 from mw1297 to mwmaint1001" [dns] - 10https://gerrit.wikimedia.org/r/465689 (https://phabricator.wikimedia.org/T192457) (owner: 10Dzahn) [19:33:53] i see scap activity on deploy2001 [19:33:57] RECOVERY - HHVM rendering on mwdebug2002 is OK: HTTP OK: HTTP/1.1 200 OK - 75710 bytes in 1.521 second response time [19:34:02] ah :) [19:34:10] we are back in eqiad though [19:35:41] !log kaldari@deploy1001 Finished scap: (no justification provided) (duration: 22m 05s) [19:35:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:01] !log restarted ORES celery workers on ores2003 (~17:00), ores200* (17:05) [19:40:14] awight: Failed to log message to wiki. Somebody should check the error logs. [19:41:43] you don't say [19:42:21] haha [19:42:39] 10Operations, 10Domains, 10Traffic: SOA serial numbers returned by authoritative nameservers differ - https://phabricator.wikimedia.org/T206688 (10BBlack) SOA Serial values only have meaning to the administrators of a zone, and to servers with which they authorize legacy zone transfers. The registrar is nei... [19:43:08] awight: i wonder if it is related to ~, ( or * characters in the message [19:43:55] !log awight restarted ORES celery workers on ores2003 (~17:00), ores200* (17:05) [19:43:56] it could be because of a space between !log [19:43:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:01] heh [19:44:16] what did you do? [19:44:34] nothing, i first repeated the message to confirm it [19:44:37] /o\ [19:45:05] stashbot: why do you dislike awight ? [19:45:05] See https://wikitech.wikimedia.org/wiki/Tool:Stashbot for help. [19:45:21] I'll try not to take it personally! [19:46:09] try it again, just !log test or so [19:49:15] (03PS3) 10Cwhite: icinga: enable icinga service on icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/464088 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [19:50:26] (03PS1) 10Mforns: Add druid_load jobs to analytics refinery [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342) [19:51:14] (03CR) 10jerkins-bot: [V: 04-1] Add druid_load jobs to analytics refinery [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [19:51:33] (03Abandoned) 10Mforns: Add druid_load.pp to refinery jobs [puppet] - 10https://gerrit.wikimedia.org/r/464833 (owner: 10Mforns) [19:54:13] (03CR) 10Dzahn: [C: 032] icinga: enable icinga service on icinga1001 [puppet] - 10https://gerrit.wikimedia.org/r/464088 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [19:58:21] !log icinga - enabled icinga service on icinga1001 (stretch), but all notifications are disabled [19:58:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:22] volans hi, https://github.com/wikimedia/puppet/blob/b347052863d4d2e87b37d6c2d9f44f833cfd9dc2/modules/icinga/files/raid_handler.py#L155 is using deprecated phabricator api. It should be using "search" not "query" [20:01:42] 10Operations, 10Phabricator: raid_handler.py using deprecated conduit api - https://phabricator.wikimedia.org/T206697 (10Paladox) [20:02:40] paladox: ack, thanks, we already have T159045 btw, I should find the time to update it [20:02:40] T159045: Update Puppet repo code that uses maniphest.update and maniphest.createtask conduit api - https://phabricator.wikimedia.org/T159045 [20:02:57] oh i forgot about that heh [20:03:27] 10Operations, 10Operations-Software-Development, 10Phabricator, 10Technical-Debt: Update Puppet repo code that uses maniphest.update and maniphest.createtask conduit api - https://phabricator.wikimedia.org/T159045 (10Paladox) [20:03:31] 10Operations, 10Phabricator: raid_handler.py using deprecated conduit api - https://phabricator.wikimedia.org/T206697 (10Paladox) [20:03:54] 10Operations, 10Operations-Software-Development, 10Phabricator, 10Technical-Debt: Update Puppet repo code that uses maniphest.update and maniphest.createtask conduit api - https://phabricator.wikimedia.org/T159045 (10Paladox) Need to also migrate from "project.query" to "project.search" [20:08:13] (03PS1) 10Dzahn: base/nrpe: add icinga1001 to allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/465697 (https://phabricator.wikimedia.org/T202782) [20:09:17] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 28 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [20:11:30] (03PS2) 10Dzahn: base/nrpe: add icinga1001 to allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/465697 (https://phabricator.wikimedia.org/T202782) [20:13:14] 10Operations, 10hardware-requests, 10monitoring: hardware request - replacement for tegmen (icinga2001) - https://phabricator.wikimedia.org/T206563 (10faidon) 05Open>03declined We generally keep our servers for 1-2 more years past their warranty expiration (this puts their lifetime at ~5 years, rather th... [20:13:17] 10Operations, 10monitoring, 10Patch-For-Review: upgrade icinga server to stretch and replace einsteinium - https://phabricator.wikimedia.org/T202782 (10faidon) [20:15:11] (03CR) 10Dzahn: [C: 032] base/nrpe: add icinga1001 to allowed_hosts [puppet] - 10https://gerrit.wikimedia.org/r/465697 (https://phabricator.wikimedia.org/T202782) (owner: 10Dzahn) [20:17:16] !log upgrading releases-jenkins jenkins install on releases2001 [20:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:17] 10Operations, 10ops-eqiad, 10Cloud-VPS, 10Patch-For-Review, 10cloud-services-team (Kanban): rack/setup/install cloudstore1008 & cloudstore1009 - https://phabricator.wikimedia.org/T193655 (10Bstorm) {F26497155} cloudstore1008 appears to be stuck here. That's quite interesting, since it seems to have the... [20:19:35] (03CR) 10Cwhite: [C: 031] debian: ship systemd service [debs/prometheus-statsd-exporter] - 10https://gerrit.wikimedia.org/r/465350 (owner: 10Filippo Giunchedi) [20:19:43] !log upgrading releases-jenkins jenkins install on releases1001 [20:19:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:38] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 59 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [20:32:20] 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Paladox) We won't be able to use git lfs on cobalt as it is using jessie whereas the git-lfs package is in stretch-backports+. We could enforce it so users can only upload there i... [20:32:29] 10Operations, 10MediaWiki-extensions-CodeReview, 10Wikimedia-production-error: Exec error "Possibly missing executable file: svn diff" from Special:Code - https://phabricator.wikimedia.org/T204801 (10Krinkle) p:05Normal>03Low [20:34:13] !log upgrading ci jenkins install on contint1001 [20:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:40] (03CR) 10Cwhite: [C: 031] Stop the diamond service when removing Diamond [puppet] - 10https://gerrit.wikimedia.org/r/465440 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [20:44:13] (03CR) 10Cwhite: [C: 031] Remove obsolete Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/465601 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [20:45:28] (03CR) 10Cwhite: [C: 031] Remove all absented Diamond collectors [puppet] - 10https://gerrit.wikimedia.org/r/465600 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [20:45:57] (03CR) 10Cwhite: [C: 031] Remove now obsolete Diamond collector [puppet] - 10https://gerrit.wikimedia.org/r/465596 (https://phabricator.wikimedia.org/T183454) (owner: 10Muehlenhoff) [20:47:37] PROBLEM - puppet last run on ores1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[set debconf flag seen for wireshark-common/install-setuid] [20:51:40] RoanKattouw: Is https://phabricator.wikimedia.org/T204291 patch okay to backport? [20:52:38] RECOVERY - puppet last run on ores1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [20:53:02] Krinkle: Yes, feel free [20:56:57] RoanKattouw: k, will do :) [20:58:43] (03CR) 10Ottomata: [C: 031] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/465692 (https://phabricator.wikimedia.org/T206342) (owner: 10Mforns) [21:00:59] 10Operations, 10MediaWiki-File-management, 10MediaWiki-Uploading, 10Multimedia, and 4 others: PHP Warning "Unable to delete stat cache" from file uploads - https://phabricator.wikimedia.org/T205567 (10Krinkle) [21:13:18] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 33 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [21:20:37] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 74 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [21:25:47] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 35 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [21:32:51] 10Operations, 10netops: Enable access from icinga1001 to mgmt interfaces - https://phabricator.wikimedia.org/T206704 (10colewhite) [21:32:57] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 42 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [21:33:29] 10Operations, 10Operations-Software-Development, 10Phabricator, 10Technical-Debt: Update Puppet repo code that uses deprecated maniphest.update/.createtask/.query Conduit API - https://phabricator.wikimedia.org/T159045 (10Aklapper) [21:42:09] * Krinkle staging on mwdebug2001 [21:42:36] Krinkle: we're on eqiad now ;) [21:42:41] codfw db are RO [21:42:42] right [21:42:44] * Krinkle staging on mwdebug1001 [21:42:57] :) [21:43:00] :) [21:45:41] !log Add icinga1001 to mr* security policies - T206704 [21:45:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:45:45] T206704: Enable access from icinga1001 to mgmt interfaces - https://phabricator.wikimedia.org/T206704 [21:48:50] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/extensions/Echo/includes/DiscussionParser.php: T204291 - Ia5323b401b94 (duration: 00m 51s) [21:48:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:53] T204291: Fatal error "request has exceeded memory limit" from Echo DiscussionParser - https://phabricator.wikimedia.org/T204291 [21:49:54] XioNoX: wow, thank you ! [21:50:06] i was about to make a ticket for that [21:50:55] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/extensions/ContentTranslation/specials/SpecialContentTranslation.php: T205433 - Ib34b28c5bb114c (duration: 00m 49s) [21:51:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:05] T205433: Exception "User account is not global" on Special:ContentTranslation with lang/target params - https://phabricator.wikimedia.org/T205433 [21:51:55] mutante: https://phabricator.wikimedia.org/T206704 :) [21:52:07] i just saw:) yay [21:54:27] it's working. our new icinga can now talk to mgmt interfaces [21:54:43] at least the counter is going down [21:55:03] mutante: should be pushed everywhere now [21:55:49] ok, thanks, it will take a little while until it catches up but it's happening :) [21:58:29] 10Operations, 10netops: Enable access from icinga1001 to mgmt interfaces - https://phabricator.wikimedia.org/T206704 (10ayounsi) 05Open>03Resolved a:03ayounsi Management firewall policies updates. [21:59:49] (03CR) 10Gehel: "This looks reasonable to me. But let's see get more feedback before moving forward with a fleet wide change" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) (owner: 10Mathew.onipe) [22:05:56] (03CR) 10Gehel: [C: 031] "LGTM, puppet compiler agrees this is a noop: https://puppet-compiler.wmflabs.org/compiler1002/12857/" [puppet] - 10https://gerrit.wikimedia.org/r/465566 (https://phabricator.wikimedia.org/T206597) (owner: 10Mathew.onipe) [22:10:01] (03CR) 10Gehel: "Definitely more compact, not sure it is more readable (at least not to me)." [software/cumin] - 10https://gerrit.wikimedia.org/r/465611 (owner: 10Volans) [22:10:03] (03CR) 10Volans: "my 2 cents inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/465450 (https://phabricator.wikimedia.org/T206114) (owner: 10Mathew.onipe) [22:10:06] (03CR) 10Krinkle: [C: 032] profiler: Prevent flush from fataling a request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464178 (https://phabricator.wikimedia.org/T206092) (owner: 10Krinkle) [22:11:16] 10Operations, 10DBA, 10MediaWiki-General-or-Unknown, 10Wikidata, and 3 others: Investigate decrease in wikidata dispatch times due to eqiad -> codfw DC switch - https://phabricator.wikimedia.org/T205865 (10hoo) >>! In T205865#4655143, @Addshore wrote: > Reversing this experiment now that we have switched b... [22:13:08] (03Merged) 10jenkins-bot: profiler: Prevent flush from fataling a request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464178 (https://phabricator.wikimedia.org/T206092) (owner: 10Krinkle) [22:14:46] * Krinkle staging on mwdebug1001 [22:16:14] !log krinkle@deploy1001 Synchronized wmf-config/arclamp.php: T206092 - If607ad111a (duration: 00m 48s) [22:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:17] T206092: profiler.php sometimes emits RedisException "read error on connection" during request shutdown - https://phabricator.wikimedia.org/T206092 [22:16:55] (03CR) 10jenkins-bot: profiler: Prevent flush from fataling a request [mediawiki-config] - 10https://gerrit.wikimedia.org/r/464178 (https://phabricator.wikimedia.org/T206092) (owner: 10Krinkle) [22:20:37] 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10greg) >>! In T191183#4656413, @Paladox wrote: > We won't be able to use git lfs on cobalt as it is using jessie whereas the git-lfs package is in stretch-backports+. That's simply... [22:25:37] !log icinga1001 - chmod 2710 /var/lib/icinga/rw [22:25:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:28:38] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 28 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [22:35:57] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 39 probes of 320 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23RIPE_alerts [22:38:54] Amir1: rolling out now [22:39:04] on it [22:39:37] staged on mwdebug1001 for sanity check but presumably nothing to verify [22:39:57] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 60.00% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [22:41:09] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/extensions/ORES/includes/FetchScoreJob.php: T204753 - Icc28230585bc (duration: 00m 49s) [22:41:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:12] T204753: Failed executing job: ORESFetchScoreJob - https://phabricator.wikimedia.org/T204753 [22:43:36] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/includes/specials/SpecialDeletedContributions.php: T187619 - Ic6b0d8020553 (duration: 00m 48s) [22:43:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:39] T187619: Extra trailing spaces after IP in Special:DeletedContributions trigger MediaWiki internal error - https://phabricator.wikimedia.org/T187619 [22:48:37] PROBLEM - DPKG on ores1001 is CRITICAL: DPKG CRITICAL dpkg reports broken packages [22:51:32] halfak: ^ [22:51:48] RECOVERY - DPKG on ores1001 is OK: All packages OK [22:52:22] Krinkle: it's still happening [22:56:38] Amir1: Yeah, JobExecutor still shows "Failed ORESFetchScoreJob" [22:56:45] but I assume that's not surprising because the patch returned false. [22:56:55] Which means mark as failed, and (if allowed) let it retry later. [22:57:05] But it no longer has an error attached, and no exception channel messsage [22:57:12] right? [22:59:01] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/extensions/MultimediaViewer/: T206099 - I53dbce0a (duration: 00m 49s) [22:59:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:59:05] T206099: MultimediaViewer should not use deprecated jquery.hidpi module - https://phabricator.wikimedia.org/T206099 [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20181010T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:08:05] !log krinkle@deploy1001 Synchronized php-1.32.0-wmf.24/maintenance/resources/foreign-resources.yaml: Ic865e7077d (duration: 00m 49s) [23:08:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:44] 10Operations, 10Gerrit, 10Traffic, 10Patch-For-Review: Enable avatars in gerrit - https://phabricator.wikimedia.org/T191183 (10Tgr) UX-wise a single central place for profile images is obviously preferable, so using Phabricator makes a lot of sense. (Having some way to store a profile image in your Wikimed... [23:25:11] 10Operations, 10ORES, 10Scap, 10Epic, and 2 others: [Epic] ORES should use a git large file plugin for storing serialized binaries - https://phabricator.wikimedia.org/T171619 (10awight) [23:25:19] 10Operations, 10ORES, 10Scap, 10Epic, and 2 others: [Epic] ORES should use a git large file plugin for storing serialized binaries - https://phabricator.wikimedia.org/T171619 (10awight) [23:30:59] Krinkle: oh good [23:44:08] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 35.71% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1