[00:00:04] twentyafterfour: Your horoscope predicts another unfortunate Phabricator update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190411T0000). [00:07:15] 10Operations, 10SRE-Access-Requests: offboard tilman bayer - https://phabricator.wikimedia.org/T220565 (10Dzahn) After further discussion: - I removed the user block again. - I re-reversed the email address in SQL. - Somebody/something else blanked the password field, i did not, it existed just a little while... [00:12:04] !log deploying phabricator upgrade [00:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:22] 10Operations, 10netops: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10ayounsi) p:05Triage→03Normal [00:29:02] Prod clear? Want to fix an UBN train blocker tonight. [00:41:39] James_F: sure [00:42:23] RECOVERY - tools project instance distribution on cloudcontrol1003 is OK: OK: All critical instances are spread out enough https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:42:24] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Socket timeout on wdqs.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T217557 (10Smalyshev) @Gehel any input on this? [00:45:54] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.25/extensions/PageTriage/: UBN Fix for pageTriage and ORES T220649 (duration: 01m 04s) [00:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:58] T220649: Testwiki pagetriage throwing "Class undefined: ORES\ORESServices" - https://phabricator.wikimedia.org/T220649 [00:46:52] twentyafterfour: Thanks. All yours (or whomever's). [00:51:48] (03CR) 10Krinkle: profile::mediawiki::php: conform error reporting levels to HHVM (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/486485 (https://phabricator.wikimedia.org/T211488) (owner: 10Giuseppe Lavagetto) [01:03:43] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [01:07:18] anyone here with +staff globally? [01:07:35] someone needs to edit gadget namespace to fix some stuff on enwiki [01:07:45] and only staff can edit the namespace; not even stewards [01:09:11] Are you sure you need to edit the Gadget namespace on enwiki Vermont? [01:09:24] oh hi krenair :) [01:09:26] https://meta.wikimedia.org/wiki/Steward_requests/Miscellaneous#Edits_to_gadget_namespace [01:09:28] It only appears to contain a single page which is a redirect [01:09:35] See this request. [01:09:47] Neither stews nor GS's, the people who tend to monitor that page, can handle it. [01:10:27] hmm [01:11:16] Looks like I can't even do it logged in with GEI [01:11:41] might be easiest for a steward to just add the right to the global steward group [01:12:32] it seems to be a separate thing to simple interface editing [01:12:45] hm, forgot stews can do that [01:13:18] thanks :) [01:13:52] good luck [01:14:01] I don't remember where stuff ended up with the gadget namespace [01:14:57] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [01:17:46] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Krinkle) [01:19:26] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Krinkle) Per @aaron, the Redis settings don't affect us. Most of phpredis isn't INI-configurable, it's passed at run-time. The only parts of... [01:45:07] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [02:02:59] PROBLEM - Nginx local proxy to apache on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:04:03] RECOVERY - Nginx local proxy to apache on mw1313 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.112 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:14:01] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [02:36:41] PROBLEM - ensure kvm processes are running on labvirt1007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args /usr/bin/kvm https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [03:04:05] RECOVERY - ensure kvm processes are running on labvirt1007 is OK: PROCS OK: 1 process with regex args /usr/bin/kvm https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [03:10:29] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:28:33] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:39:41] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:45:13] RECOVERY - kubelet operational latencies on kubernetes2004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:53:13] 10Operations, 10Wikimedia-Site-requests, 10Chinese-Sites, 10Community-consensus-needed: Enable "upload by url" feature at zhwiki - https://phabricator.wikimedia.org/T142991 (10Cosine02) I wonder if this function is feasible, as I saw on wikimedia commons, [[https://commons.wikimedia.org/wiki/Commons:Upload... [04:03:48] PROBLEM - tools project instance distribution on cloudcontrol1003 is CRITICAL: CRITICAL: paws class instances not spread out enough https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:05:24] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:09:52] (03PS3) 10Marostegui: db-eqiad.php: Set s3 to read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501480 (https://phabricator.wikimedia.org/T219115) [04:10:05] (03PS3) 10Marostegui: mariadb: Promote db1075 to master [puppet] - 10https://gerrit.wikimedia.org/r/501479 (https://phabricator.wikimedia.org/T219115) [04:14:11] !log Disable GTID on s3 hosts - https://phabricator.wikimedia.org/T219115 [04:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:18:49] !log Start topology changes to move s3 slaves under db1075 T219115 [04:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:18:53] T219115: db1078 s3 primary DB master BBU pre-failure - https://phabricator.wikimedia.org/T219115 [04:32:22] !log Disable puppet on db1078 and db1075 T219115 [04:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:32:27] T219115: db1078 s3 primary DB master BBU pre-failure - https://phabricator.wikimedia.org/T219115 [04:32:44] (03CR) 10Marostegui: mariadb: Promote db1075 to master [puppet] - 10https://gerrit.wikimedia.org/r/501479 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [04:32:59] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1075 to master [puppet] - 10https://gerrit.wikimedia.org/r/501479 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [04:36:54] (03PS5) 10Marostegui: db-eqiad.php: Promote db1075 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501481 (https://phabricator.wikimedia.org/T219115) [04:41:23] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10Marostegui) I will check if the raid is on sda, because the host is correctly set to be allowed to be re-imaged: ` db1114|db... [04:48:02] PROBLEM - puppet last run on mw1296 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:50:59] s3 failover is happening in 10 minutes, we will take over puppet and mediawiki deployments, please coordinate with us before deploying anything. We will communicate when it is fine to deploy normally again [04:51:25] (03CR) 10Marostegui: db-eqiad.php: Set s3 to read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501480 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [04:52:27] Going to +2 but not deploy, so I can create the revert, rebase the other change etc [04:52:31] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Set s3 to read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501480 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [04:53:18] (03PS3) 10Marostegui: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/501483 (https://phabricator.wikimedia.org/T219115) [04:53:38] (03Merged) 10jenkins-bot: db-eqiad.php: Set s3 to read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501480 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [04:53:58] (03PS1) 10Marostegui: Revert "db-eqiad.php: Set s3 to read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502940 [04:54:09] (03CR) 10Marostegui: db-eqiad.php: Promote db1075 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501481 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [04:54:14] (03PS6) 10Marostegui: db-eqiad.php: Promote db1075 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501481 (https://phabricator.wikimedia.org/T219115) [04:54:56] Going to +2 the promotion of db1075 but NOT merge on deploy1001 [04:56:29] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Promote db1075 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501481 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [04:57:45] (03Merged) 10jenkins-bot: db-eqiad.php: Promote db1075 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501481 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [04:57:57] (03PS2) 10Marostegui: Revert "db-eqiad.php: Set s3 to read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502940 [04:59:44] (03CR) 10jenkins-bot: db-eqiad.php: Set s3 to read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501480 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [04:59:46] (03CR) 10jenkins-bot: db-eqiad.php: Promote db1075 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501481 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [05:00:04] marostegui and jynus: How many deployers does it take to do s3 database master failover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190411T0500). [05:00:07] jynus: ready? [05:00:11] yeah [05:00:14] let's go [05:00:18] !log Starting s3 failover from db1078 to db1075 - T219115 [05:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:24] T219115: db1078 s3 primary DB master BBU pre-failure - https://phabricator.wikimedia.org/T219115 [05:01:05] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Set s3 on read-only T219115 (duration: 00m 37s) [05:01:06] we are on RO [05:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:21] confirmed^ [05:01:29] failing over [05:01:32] done [05:01:59] looks good [05:02:05] confirmed switch [05:02:05] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Set s3 to read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502940 (owner: 10Marostegui) [05:02:09] ^ not merging [05:02:21] promoting db1078 on mediawiki [05:02:48] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Switchover s3 master eqiad from db1078 to db1075 T219115 (duration: 00m 36s) [05:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:52] everything looking good so far [05:02:55] let's remove read only? [05:03:09] +1 [05:03:11] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Set s3 to read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502940 (owner: 10Marostegui) [05:03:13] can see the change on noc [05:03:20] deploying [05:03:55] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove s3 ready only T219115 (duration: 00m 36s) [05:03:55] we are RW [05:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:58] checking [05:03:58] very few errors on log [05:04:10] I can edit [05:05:02] there was a spike of open connections, seems now gone [05:05:05] yeah [05:05:20] I can also see stuff going on db1075 just fine [05:05:31] reads doesn't seem recovered yet [05:05:37] maybe jobqueue? [05:05:49] but it is going up [05:05:58] I see reads on the new slave (db1078) [05:06:47] yeah, just not as many as before [05:07:02] they are recovering [05:07:49] I see no hard errors, though [05:08:27] yeah, it is the jobqueue not knowing what to do with replication [05:08:51] it is stuck on the old master? [05:09:47] (03CR) 10Marostegui: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/501483 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [05:09:49] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/501483 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [05:10:57] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Set s3 to read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502940 (owner: 10Marostegui) [05:11:44] Deployments on puppet and mediawiki-config can now proceed as normal [05:12:20] I think db1078 got a bit overloaded at first [05:12:25] yeah [05:12:29] I saw it spiking a lot [05:12:35] just even from the processlist [05:12:38] like: omg! [05:13:02] we need to change query killer [05:13:08] the event I mean [05:13:12] maybe next time we need to spread load more [05:13:23] yeah, maybe it was too cold [05:13:52] and definitely not to the old master [05:14:08] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1078 s3 primary DB master BBU pre-failure - https://phabricator.wikimedia.org/T219115 (10Marostegui) [05:14:28] RECOVERY - puppet last run on mw1296 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:14:35] there were also statistics ongoing [05:14:42] those may need to change [05:15:14] jobqueue seems happier now [05:16:03] I guess it rereads the config from time to time [05:16:15] I am changing the query killer event [05:17:02] jynus: your switchover script is <3 :) [05:17:54] I am checking edit rates and performance metrics [05:18:27] there was some slowdown [05:19:59] tendril and zarcillo updated [05:21:39] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1078 s3 primary DB master BBU pre-failure - https://phabricator.wikimedia.org/T219115 (10Marostegui) @Cmjohnson can we schedule the BBU replacement for Monday 15th? db1078 is no longer a master. The failover was performed successfully: Times in UTC:... [05:21:48] I don't see performance issues, only a deployment-related regression some days ago [05:22:23] yeah, I think we are good [05:22:32] Next time we have to either warm up the old master or give it less load [05:27:36] PROBLEM - Memory correctable errors -EDAC- on wtp2013 is CRITICAL: 67.01 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2013&var-datasource=codfw+prometheus/ops [05:28:34] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10Marostegui) The raid is sdb and we need it to be sda for db.cfg to work: ` Disk /dev/sdb: 3.5 TiB, 3840699359232 bytes, 7501365936 s... [05:33:07] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10jcrespo) We of course could make sdb work, but that would make this servers special, compared to the rest. Maybe a disk was not adde... [05:37:29] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) >>! In T211488#5102911, @Krinkle wrote: > Exported from mwdebug1001 in plain text and sorted. Full dumps at P8387 and P8386. > > ###... [05:37:43] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) [05:41:48] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10Marostegui) So, I have been checking out the RAID menu on the controller, but unfortunately over `vsp` it doesn't show most of the o... [05:48:50] PROBLEM - EDAC syslog messages on wtp2013 is CRITICAL: 17 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2013&var-datasource=codfw+prometheus/ops [05:57:52] (03CR) 10Marostegui: mariadb: Allow new option --stop-slave for xtrabackup transfers (032 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/502828 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [06:06:42] (03PS1) 10Elukey: turnilo: remove tbayer_popups from config [puppet] - 10https://gerrit.wikimedia.org/r/502944 (https://phabricator.wikimedia.org/T220575) [06:20:53] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) >>! In T211488#5102911, @Krinkle wrote: > * [ ] Filesytem > > Much lower. Don't know if it matters? > > `lang=diff > -default_socket... [06:21:24] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) [06:24:38] !log upgrading remaining API Servers to HHVM 3.18.5+dfsg-1+wmf8+deb9u2 and wikidiff 1.8.1 (T203069) [06:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:44] T203069: Deploy wikidiff2 v1.8.1 with changed signature - https://phabricator.wikimedia.org/T203069 [06:31:06] PROBLEM - puppet last run on analytics1071 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/R/biocLite.R] [06:32:30] !log uploaded jenkins 2.164.2 to apt.wikimedia.org (jessie-wikimedia / thirdparty) [06:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:13] !log uploaded jenkins 2.164.2 to apt.wikimedia.org (stretch-wikimedia / thirdparty/ci) [06:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:17] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) Error handling was discussed in https://phabricator.wikimedia.org/T211488#4908305 and the followups. The most notable difference is HHV... [06:34:58] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) [06:57:26] RECOVERY - puppet last run on analytics1071 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:14:10] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Track remaining trusty servers in production - https://phabricator.wikimedia.org/T212772 (10MoritzMuehlenhoff) [07:19:48] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10Marostegui) I have been trying to check if there is something else defined on a storage level but it is impossible to see anything w... [07:26:11] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Socket timeout on wdqs.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T217557 (10Gehel) 05Open→03Resolved a:03Gehel I don't think there is anything actionable at this point. Let's close. [07:29:59] (03PS2) 10Gehel: Enable revisions support on internal clusters [puppet] - 10https://gerrit.wikimedia.org/r/502909 (https://phabricator.wikimedia.org/T217897) (owner: 10Smalyshev) [07:40:16] (03CR) 10Gehel: [C: 03+2] Enable revisions support on internal clusters [puppet] - 10https://gerrit.wikimedia.org/r/502909 (https://phabricator.wikimedia.org/T217897) (owner: 10Smalyshev) [07:43:04] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) Also: - PCRE JIT is enabled by default in HHVM and we definitely want it in php7 as well. - `include_path` in php's ini is still set... [07:55:42] (03PS9) 10Muehlenhoff: Enable base::service_auto_restart for rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/496719 (https://phabricator.wikimedia.org/T135991) [07:59:14] (03CR) 10Gehel: [C: 03+2] elasticsearch: reset all indices to read/write [software/spicerack] - 10https://gerrit.wikimedia.org/r/502218 (https://phabricator.wikimedia.org/T219799) (owner: 10Gehel) [07:59:57] (03CR) 10Gehel: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/502220 (https://phabricator.wikimedia.org/T219799) (owner: 10Gehel) [08:00:10] (03CR) 10jenkins-bot: elasticsearch: reset all indices to read/write [software/spicerack] - 10https://gerrit.wikimedia.org/r/502218 (https://phabricator.wikimedia.org/T219799) (owner: 10Gehel) [08:04:45] (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/496719 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:10:18] (03PS1) 10Mathew.onipe: icinga: add remote cluster check for elastic [puppet] - 10https://gerrit.wikimedia.org/r/502950 (https://phabricator.wikimedia.org/T218932) [08:12:28] (03PS3) 10Elukey: role::druid::public::worker: set stricter timeouts [puppet] - 10https://gerrit.wikimedia.org/r/502814 (https://phabricator.wikimedia.org/T219910) [08:15:05] (03CR) 10Elukey: [C: 03+2] role::druid::public::worker: set stricter timeouts [puppet] - 10https://gerrit.wikimedia.org/r/502814 (https://phabricator.wikimedia.org/T219910) (owner: 10Elukey) [08:19:03] !log roll restart of druid-broker/historical on druid100[4-6] to pick up new settings [08:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:25] !log upgrading remaining job runners to HHVM 3.18.5+dfsg-1+wmf8+deb9u2 and wikidiff 1.8.1 (T203069) [08:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:38] T203069: Deploy wikidiff2 v1.8.1 with changed signature - https://phabricator.wikimedia.org/T203069 [08:22:25] (03PS1) 10Elukey: role::druid::public::worker: raise max direct mem for historical [puppet] - 10https://gerrit.wikimedia.org/r/502951 (https://phabricator.wikimedia.org/T219910) [08:22:34] (03PS3) 10Mathew.onipe: maps migrate maps2002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/502768 (https://phabricator.wikimedia.org/T198622) [08:23:45] (03CR) 10Elukey: [C: 03+2] role::druid::public::worker: raise max direct mem for historical [puppet] - 10https://gerrit.wikimedia.org/r/502951 (https://phabricator.wikimedia.org/T219910) (owner: 10Elukey) [08:25:13] (03CR) 10Mathew.onipe: maps migrate maps2002 to stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502768 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [08:27:59] (03CR) 10Gehel: [C: 04-1] "There are still a few changes to maps2001: https://puppet-compiler.wmflabs.org/compiler1002/15686/maps2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/502768 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [08:33:23] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) [08:43:14] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) Also, the value of `doc_root` for HHVM doesn't seem to be set from ini settings, so I'll have to dig up where that happens. [08:50:20] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) Regarding `enable_dl`: i just verified `dl()` does work under HHVM - although I suspect it does nothing. So I would be careful in disab... [08:50:22] 10Operations, 10Wikimedia-Mailing-lists: Change ownership of wikimania-program@lists.wikimedia.org - https://phabricator.wikimedia.org/T220641 (10fgiunchedi) We certainly can! (I'm on SRE clinic duty this week, hence handing ML requests too) Which email address should we be adding to wikimania-program list? A... [08:52:24] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Set `enable_dl` to 0 in php.ini - https://phabricator.wikimedia.org/T220681 (10Joe) [08:53:15] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) [09:01:25] !log deployment servers to HHVM 3.18.5+dfsg-1+wmf8+deb9u2 and wikidiff 1.8.1 (T203069) [09:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:05] T203069: Deploy wikidiff2 v1.8.1 with changed signature - https://phabricator.wikimedia.org/T203069 [09:03:46] (03PS4) 10Mathew.onipe: maps migrate maps2002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/502768 (https://phabricator.wikimedia.org/T198622) [09:06:24] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) >>! In T211488#5102911, @Krinkle wrote: > * [ ] File Uploads & Data Input > > `lang=diff > -upload_tmp_dir = /tmp > +upload_tmp_dir =... [09:06:56] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) [09:07:45] (03CR) 10Mathew.onipe: "@gehel seems we are good now: https://puppet-compiler.wmflabs.org/compiler1002/15687/maps2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/502768 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [09:10:12] 10Operations, 10SRE-Access-Requests: offboard tilman bayer - https://phabricator.wikimedia.org/T220565 (10MoritzMuehlenhoff) [09:17:16] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) [09:20:22] (03CR) 10Jcrespo: [C: 03+1] mariadb: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/492321 (owner: 10Muehlenhoff) [09:21:28] (03PS1) 10Arturo Borrero Gonzalez: labtestnet2002: cleanup [dns] - 10https://gerrit.wikimedia.org/r/502955 (https://phabricator.wikimedia.org/T220426) [09:22:10] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/15691/weblog1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/493243 (https://phabricator.wikimedia.org/T213899) (owner: 10Filippo Giunchedi) [09:22:18] (03PS4) 10Filippo Giunchedi: logging: move webrequest-5xx to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/493243 (https://phabricator.wikimedia.org/T213899) [09:22:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestnet2002: cleanup [dns] - 10https://gerrit.wikimedia.org/r/502955 (https://phabricator.wikimedia.org/T220426) (owner: 10Arturo Borrero Gonzalez) [09:24:35] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2002: repurpose as cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T220426 (10aborrero) [09:27:57] (03PS1) 10Jbond: offboard: remove wikimedia email [puppet] - 10https://gerrit.wikimedia.org/r/502957 [09:28:35] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2002: repurpose as cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T220426 (10aborrero) 05Open→03Resolved [09:33:00] (03CR) 10Jbond: [C: 03+2] offboard: remove wikimedia email [puppet] - 10https://gerrit.wikimedia.org/r/502957 (owner: 10Jbond) [09:33:08] (03PS2) 10Jbond: offboard: remove wikimedia email [puppet] - 10https://gerrit.wikimedia.org/r/502957 [09:35:10] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) Regarding mail: both paths actually link to `/usr/sbin/exim4` on our systems. [09:35:23] (03PS8) 10Muehlenhoff: mariadb: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/492321 [09:36:36] (03PS9) 10Muehlenhoff: mariadb: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/492321 [09:37:47] (03CR) 10Muehlenhoff: [C: 03+2] mariadb: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/492321 (owner: 10Muehlenhoff) [09:38:19] (03PS5) 10Gehel: maps migrate maps2002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/502768 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [09:39:08] (03CR) 10Gehel: [C: 03+2] maps migrate maps2002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/502768 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [09:41:53] (03PS1) 10Elukey: role::druid::public::worker: reduce heap size for coord/overlord [puppet] - 10https://gerrit.wikimedia.org/r/502958 (https://phabricator.wikimedia.org/T219910) [09:41:56] PROBLEM - puppet last run on db1071 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:42:32] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Patch-For-Review: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts: ` ['maps2002.codfw.wmn... [09:42:38] PROBLEM - puppet last run on es1016 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:42:58] (03PS1) 10Muehlenhoff: Fix up installation mariadb-backup [puppet] - 10https://gerrit.wikimedia.org/r/502959 [09:43:10] PROBLEM - puppet last run on db1110 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:44:10] PROBLEM - puppet last run on db2038 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:44:22] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:44:28] PROBLEM - puppet last run on db2079 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:44:28] PROBLEM - puppet last run on db2061 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:44:38] PROBLEM - puppet last run on db1119 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:44:46] PROBLEM - puppet last run on db1066 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:44:48] PROBLEM - puppet last run on db2068 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:45:03] (03PS2) 10Muehlenhoff: Fix up installation of mariadb-backup [puppet] - 10https://gerrit.wikimedia.org/r/502959 [09:45:26] PROBLEM - puppet last run on db1084 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:45:55] (03CR) 10Muehlenhoff: [C: 03+2] Fix up installation of mariadb-backup [puppet] - 10https://gerrit.wikimedia.org/r/502959 (owner: 10Muehlenhoff) [09:46:18] PROBLEM - puppet last run on db1064 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:46:28] PROBLEM - puppet last run on db2057 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:46:38] PROBLEM - puppet last run on matomo1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:46:44] PROBLEM - puppet last run on db1068 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:46:48] PROBLEM - puppet last run on db2083 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:46:54] PROBLEM - puppet last run on db2047 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:47:02] PROBLEM - puppet last run on db1109 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:47:06] PROBLEM - puppet last run on db1103 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:47:22] PROBLEM - puppet last run on db1115 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:47:26] PROBLEM - puppet last run on cumin2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:47:50] PROBLEM - puppet last run on db1098 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:47:58] PROBLEM - puppet last run on es1018 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:48:04] PROBLEM - puppet last run on db2040 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:48:06] PROBLEM - puppet last run on db1116 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:48:06] PROBLEM - puppet last run on db1093 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:48:12] PROBLEM - puppet last run on db1081 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:48:26] PROBLEM - puppet last run on db2093 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:48:26] PROBLEM - puppet last run on db2058 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:48:32] PROBLEM - puppet last run on pc2008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:48:32] PROBLEM - puppet last run on db1061 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:48:32] PROBLEM - puppet last run on es2015 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:48:32] PROBLEM - puppet last run on db2046 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:49:02] PROBLEM - puppet last run on mwmaint1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:49:02] PROBLEM - puppet last run on db2054 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:49:12] PROBLEM - puppet last run on db2066 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:49:20] PROBLEM - puppet last run on db2091 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:49:24] PROBLEM - puppet last run on db1125 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:49:28] PROBLEM - puppet last run on db2078 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:49:38] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:50:04] PROBLEM - puppet last run on db1065 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:50:34] PROBLEM - puppet last run on db2049 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:50:38] (03CR) 10Volans: "To make the review process a bit less abstract of this and the Kernels refactor patch, I've applied this series up to this commit to the t" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443368 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [09:50:42] PROBLEM - puppet last run on dbprov2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:51:34] RECOVERY - puppet last run on db1064 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:51:38] PROBLEM - puppet last run on db2077 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:52:02] RECOVERY - puppet last run on db1068 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:52:06] PROBLEM - puppet last run on db2092 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:52:10] PROBLEM - puppet last run on db2041 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:52:16] PROBLEM - puppet last run on es2014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:52:24] PROBLEM - puppet last run on es2011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:52:30] RECOVERY - puppet last run on db1071 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:53:02] PROBLEM - puppet last run on db2073 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:53:10] (03PS2) 10Elukey: role::druid::public::worker: reduce heap size for coord/overlord [puppet] - 10https://gerrit.wikimedia.org/r/502958 (https://phabricator.wikimedia.org/T219910) [09:53:10] PROBLEM - puppet last run on es2018 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:53:22] RECOVERY - puppet last run on db2040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:53:30] RECOVERY - puppet last run on db1081 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:53:44] (03CR) 10Elukey: [C: 03+2] role::druid::public::worker: reduce heap size for coord/overlord [puppet] - 10https://gerrit.wikimedia.org/r/502958 (https://phabricator.wikimedia.org/T219910) (owner: 10Elukey) [09:53:48] RECOVERY - puppet last run on db1061 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:53:48] RECOVERY - puppet last run on db2046 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:54:08] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) As far as serialisation goes: * Igbinary is installed because the memcached extension and (IIRC) the apcu extension have it as a requi... [09:54:18] RECOVERY - puppet last run on db2054 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [09:54:19] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) [09:55:20] RECOVERY - puppet last run on db1065 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [09:55:20] RECOVERY - puppet last run on db1066 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [09:55:22] RECOVERY - puppet last run on db2068 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [09:55:50] RECOVERY - puppet last run on db2049 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:55:58] RECOVERY - puppet last run on db1084 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:56:56] RECOVERY - puppet last run on db2077 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [09:56:58] (03CR) 10Vgutierrez: [C: 03+1] cache: move varnish storage config to varnish-be profile [puppet] - 10https://gerrit.wikimedia.org/r/502806 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [09:57:04] RECOVERY - puppet last run on db2057 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:57:04] PROBLEM - Maps HTTPS on maps2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [09:57:06] PROBLEM - Maps HTTPS on maps2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [09:57:07] PROBLEM - LVS HTTP IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:57:19] !log roll restart druid-coordinator/overlord on druid100[4-6] to pick up new jvm settings [09:57:26] RECOVERY - puppet last run on db2083 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [09:57:30] RECOVERY - puppet last run on db2041 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [09:57:30] RECOVERY - puppet last run on db2047 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [09:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:22] RECOVERY - puppet last run on db2073 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:58:52] PROBLEM - puppet last run on mwmaint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:59:02] RECOVERY - puppet last run on db2058 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:59:46] RECOVERY - puppet last run on db2066 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:00:04] RECOVERY - puppet last run on db2078 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:00:18] RECOVERY - puppet last run on db2061 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [10:00:18] RECOVERY - puppet last run on db2079 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:00:20] <_joe_> gehel: should I depool maps in codfw? [10:00:28] RECOVERY - puppet last run on db1119 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [10:00:45] _joe_: give me 1 minute [10:00:50] <_joe_> sure! [10:00:57] RECOVERY - LVS HTTP IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1326 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:01:03] <_joe_> lol [10:01:05] <_joe_> :D [10:01:10] <_joe_> literally 7 seconds [10:01:19] :) [10:01:50] I'm not entirely sure that we are out of trouble yet :/ [10:02:17] <_joe_> ok [10:02:21] <_joe_> lmk if I can help [10:02:28] sure, I'll shout if needed [10:02:42] RECOVERY - puppet last run on db2092 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [10:02:58] RECOVERY - puppet last run on db1109 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:03:02] RECOVERY - puppet last run on db1103 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:03:18] RECOVERY - puppet last run on db1115 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:04:02] RECOVERY - puppet last run on db1116 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:04:20] (03PS1) 10Muehlenhoff: Also fix mariadb-backup installation for mariadb::packages_client [puppet] - 10https://gerrit.wikimedia.org/r/502960 [10:04:22] RECOVERY - puppet last run on db2093 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:04:22] RECOVERY - puppet last run on db1110 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:04:35] I am going to upgrade Jenkins [10:05:06] RECOVERY - puppet last run on db2091 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:05:19] (03CR) 10Muehlenhoff: [C: 03+2] Also fix mariadb-backup installation for mariadb::packages_client [puppet] - 10https://gerrit.wikimedia.org/r/502960 (owner: 10Muehlenhoff) [10:05:27] waiting for some job to finish [10:06:17] (03CR) 10Jbond: "> Patch Set 5:" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443368 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [10:08:22] RECOVERY - puppet last run on cumin2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:08:50] RECOVERY - puppet last run on db1098 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [10:09:00] _joe_: we're having an issue with cassandra, can you depool maps codfw while I dig more into it? [10:09:02] RECOVERY - puppet last run on db1093 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:09:12] <_joe_> gehel: sure [10:09:17] thanks! [10:09:26] RECOVERY - puppet last run on es2015 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [10:09:54] PROBLEM - puppet last run on cumin1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [10:10:01] <_joe_> it will take a few minutes for the public endpoints though [10:10:12] RECOVERY - puppet last run on db1125 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:10:45] !log oblivian@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw [10:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:34] RECOVERY - puppet last run on matomo1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:12:54] RECOVERY - puppet last run on es2014 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:13:04] RECOVERY - puppet last run on es2011 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:13:52] RECOVERY - puppet last run on es2018 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:13:56] RECOVERY - puppet last run on es1016 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:13:58] RECOVERY - puppet last run on es1018 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [10:14:30] RECOVERY - puppet last run on pc2008 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:14:41] (03PS1) 10Volans: Temporary depool kartotherian in codfw [puppet] - 10https://gerrit.wikimedia.org/r/502961 [10:14:45] !log remove maps2001 from new cassandra cluster -T198622 [10:14:52] RECOVERY - puppet last run on mwmaint1002 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [10:14:58] RECOVERY - puppet last run on cumin1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:03] T198622: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 [10:15:14] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Temporary depool kartotherian in codfw [puppet] - 10https://gerrit.wikimedia.org/r/502961 (owner: 10Volans) [10:15:18] RECOVERY - puppet last run on db2038 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:15:48] _joe_: do we need to disable puppet on all cp* and then apply in the right order right? [10:15:59] ema: too [10:16:06] <_joe_> volans: not in this case, no [10:16:11] <_joe_> we're just depooling one site [10:16:18] <_joe_> not switching between them [10:16:20] ack [10:16:22] right [10:16:24] RECOVERY - puppet last run on dbprov2001 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [10:16:27] <_joe_> we basically only need to run puppet in codfw [10:16:30] * volans paranoid on that [10:17:19] waiting for ema's confirmation? [10:17:20] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 56 probes of 405 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [10:17:29] volans: what's up [10:17:42] ema can you check https://gerrit.wikimedia.org/r/c/operations/puppet/+/502961 plese? [10:17:46] sure [10:17:55] goal depool maps from codfw [10:18:08] due to issues ge.hel is working on to fix [10:18:11] (03CR) 10Ema: [C: 03+1] Temporary depool kartotherian in codfw [puppet] - 10https://gerrit.wikimedia.org/r/502961 (owner: 10Volans) [10:18:21] (03CR) 10Volans: [C: 03+2] Temporary depool kartotherian in codfw [puppet] - 10https://gerrit.wikimedia.org/r/502961 (owner: 10Volans) [10:18:29] volans: looks good, we'll need to also depool karto codfw in ats [10:18:33] <_joe_> https://puppet-compiler.wmflabs.org/compiler1002/15692/cp2002.codfw.wmnet/ [10:18:36] ema: I need to just run puppet on codfw upload cp* right? [10:18:41] <_joe_> ema: HAH, plot twist [10:18:43] ahahah [10:18:44] right [10:18:51] but that's via discovery? [10:18:54] yes [10:18:55] <_joe_> ema: didn't ATS use discovery recods? [10:18:56] nice [10:19:07] <_joe_> if so, it's already depooled [10:19:08] I'm merging this one [10:19:15] kartotherian.discovery.wmnet [10:19:22] RECOVERY - puppet last run on mwmaint2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:19:49] <_joe_> ema: that now points to eqiad [10:19:56] perfect [10:20:00] <_joe_> ema: you don't have tls termination there though, right? [10:20:09] ema ok to run sudo cumin -b 8 'A:cp-upload_codfw' 'run-puppet-agent' [10:20:12] ? [10:20:38] volans: +1 [10:20:59] !log forcing puppet run on A:cp-upload_codfw [10:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:15] _joe_: yes we use https://kartotherian.discovery.wmnet as the origin [10:22:32] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 16 probes of 405 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [10:23:00] (03CR) 10Arturo Borrero Gonzalez: "Adding Andrew and Alex as reviewers since they may have more knowledge of the status of puppetmasters within Cloud VPS." [puppet] - 10https://gerrit.wikimedia.org/r/502785 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [10:23:38] <_joe_> ema: oh great [10:23:45] so... there was a timing issue in our upgrade procedure on maps codfw, the old and new cassandra clusters were not isolated during a few minutes and discovered each others [10:23:49] puppet run completed [10:24:16] this is going to take a while to repair, but there should be no long lasting damage [10:24:37] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/502785 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [10:30:12]

class="phui-header-row">
Screenshot 2019-04-12 at 00.11.42.png
[23:12:29] Public
Actions

File Metadata

class="phui-header-header">Screenshot 2019-04-12 at 00.11.42.png
View Options

Event Timeline

Add Comment

Event Timeline

class="phui-header-row">Screenshot 2019-04-12 at 00.11.42.png [23:12:29] PublicActions

File Metadata

class="phui-header-header">Screenshot 2019-04-12 at 00.11.42.pngView Options

Event Timeline