[00:00:04] twentyafterfour: Your horoscope predicts another unfortunate Phabricator update deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190411T0000). [00:07:15] 10Operations, 10SRE-Access-Requests: offboard tilman bayer - https://phabricator.wikimedia.org/T220565 (10Dzahn) After further discussion: - I removed the user block again. - I re-reversed the email address in SQL. - Somebody/something else blanked the password field, i did not, it existed just a little while... [00:12:04] !log deploying phabricator upgrade [00:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:22] 10Operations, 10netops: RPKI Validation - https://phabricator.wikimedia.org/T220669 (10ayounsi) p:05Triage→03Normal [00:29:02] Prod clear? Want to fix an UBN train blocker tonight. [00:41:39] James_F: sure [00:42:23] RECOVERY - tools project instance distribution on cloudcontrol1003 is OK: OK: All critical instances are spread out enough https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [00:42:24] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Socket timeout on wdqs.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T217557 (10Smalyshev) @Gehel any input on this? [00:45:54] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.25/extensions/PageTriage/: UBN Fix for pageTriage and ORES T220649 (duration: 01m 04s) [00:45:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:58] T220649: Testwiki pagetriage throwing "Class undefined: ORES\ORESServices" - https://phabricator.wikimedia.org/T220649 [00:46:52] twentyafterfour: Thanks. All yours (or whomever's). [00:51:48] (03CR) 10Krinkle: profile::mediawiki::php: conform error reporting levels to HHVM (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/486485 (https://phabricator.wikimedia.org/T211488) (owner: 10Giuseppe Lavagetto) [01:03:43] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [01:07:18] anyone here with +staff globally? [01:07:35] someone needs to edit gadget namespace to fix some stuff on enwiki [01:07:45] and only staff can edit the namespace; not even stewards [01:09:11] Are you sure you need to edit the Gadget namespace on enwiki Vermont? [01:09:24] oh hi krenair :) [01:09:26] https://meta.wikimedia.org/wiki/Steward_requests/Miscellaneous#Edits_to_gadget_namespace [01:09:28] It only appears to contain a single page which is a redirect [01:09:35] See this request. [01:09:47] Neither stews nor GS's, the people who tend to monitor that page, can handle it. [01:10:27] hmm [01:11:16] Looks like I can't even do it logged in with GEI [01:11:41] might be easiest for a steward to just add the right to the global steward group [01:12:32] it seems to be a separate thing to simple interface editing [01:12:45] hm, forgot stews can do that [01:13:18] thanks :) [01:13:52] good luck [01:14:01] I don't remember where stuff ended up with the gadget namespace [01:14:57] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [01:17:46] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Krinkle) [01:19:26] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Krinkle) Per @aaron, the Redis settings don't affect us. Most of phpredis isn't INI-configurable, it's passed at run-time. The only parts of... [01:45:07] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [02:02:59] PROBLEM - Nginx local proxy to apache on mw1313 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [02:04:03] RECOVERY - Nginx local proxy to apache on mw1313 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 618 bytes in 0.112 second response time https://wikitech.wikimedia.org/wiki/Application_servers [02:14:01] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [02:36:41] PROBLEM - ensure kvm processes are running on labvirt1007 is CRITICAL: PROCS CRITICAL: 0 processes with regex args /usr/bin/kvm https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [03:04:05] RECOVERY - ensure kvm processes are running on labvirt1007 is OK: PROCS OK: 1 process with regex args /usr/bin/kvm https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [03:10:29] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:28:33] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:39:41] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:45:13] RECOVERY - kubelet operational latencies on kubernetes2004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [03:53:13] 10Operations, 10Wikimedia-Site-requests, 10Chinese-Sites, 10Community-consensus-needed: Enable "upload by url" feature at zhwiki - https://phabricator.wikimedia.org/T142991 (10Cosine02) I wonder if this function is feasible, as I saw on wikimedia commons, [[https://commons.wikimedia.org/wiki/Commons:Upload... [04:03:48] PROBLEM - tools project instance distribution on cloudcontrol1003 is CRITICAL: CRITICAL: paws class instances not spread out enough https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [04:05:24] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [04:09:52] (03PS3) 10Marostegui: db-eqiad.php: Set s3 to read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501480 (https://phabricator.wikimedia.org/T219115) [04:10:05] (03PS3) 10Marostegui: mariadb: Promote db1075 to master [puppet] - 10https://gerrit.wikimedia.org/r/501479 (https://phabricator.wikimedia.org/T219115) [04:14:11] !log Disable GTID on s3 hosts - https://phabricator.wikimedia.org/T219115  [04:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:18:49] !log Start topology changes to move s3 slaves under db1075 T219115 [04:18:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:18:53] T219115: db1078 s3 primary DB master BBU pre-failure - https://phabricator.wikimedia.org/T219115 [04:32:22] !log Disable puppet on db1078 and db1075 T219115 [04:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:32:27] T219115: db1078 s3 primary DB master BBU pre-failure - https://phabricator.wikimedia.org/T219115 [04:32:44] (03CR) 10Marostegui: mariadb: Promote db1075 to master [puppet] - 10https://gerrit.wikimedia.org/r/501479 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [04:32:59] (03CR) 10Marostegui: [C: 03+2] mariadb: Promote db1075 to master [puppet] - 10https://gerrit.wikimedia.org/r/501479 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [04:36:54] (03PS5) 10Marostegui: db-eqiad.php: Promote db1075 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501481 (https://phabricator.wikimedia.org/T219115) [04:41:23] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10Marostegui) I will check if the raid is on sda, because the host is correctly set to be allowed to be re-imaged: ` db1114|db... [04:48:02] PROBLEM - puppet last run on mw1296 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:50:59] s3 failover is happening in 10 minutes, we will take over puppet and mediawiki deployments, please coordinate with us before deploying anything. We will communicate when it is fine to deploy normally again [04:51:25] (03CR) 10Marostegui: db-eqiad.php: Set s3 to read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501480 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [04:52:27] Going to +2 but not deploy, so I can create the revert, rebase the other change etc [04:52:31] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Set s3 to read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501480 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [04:53:18] (03PS3) 10Marostegui: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/501483 (https://phabricator.wikimedia.org/T219115) [04:53:38] (03Merged) 10jenkins-bot: db-eqiad.php: Set s3 to read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501480 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [04:53:58] (03PS1) 10Marostegui: Revert "db-eqiad.php: Set s3 to read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502940 [04:54:09] (03CR) 10Marostegui: db-eqiad.php: Promote db1075 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501481 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [04:54:14] (03PS6) 10Marostegui: db-eqiad.php: Promote db1075 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501481 (https://phabricator.wikimedia.org/T219115) [04:54:56] Going to +2 the promotion of db1075 but NOT merge on deploy1001 [04:56:29] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Promote db1075 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501481 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [04:57:45] (03Merged) 10jenkins-bot: db-eqiad.php: Promote db1075 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501481 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [04:57:57] (03PS2) 10Marostegui: Revert "db-eqiad.php: Set s3 to read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502940 [04:59:44] (03CR) 10jenkins-bot: db-eqiad.php: Set s3 to read only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501480 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [04:59:46] (03CR) 10jenkins-bot: db-eqiad.php: Promote db1075 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501481 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [05:00:04] marostegui and jynus: How many deployers does it take to do s3 database master failover deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190411T0500). [05:00:07] jynus: ready? [05:00:11] yeah [05:00:14] let's go [05:00:18] !log Starting s3 failover from db1078 to db1075 - T219115  [05:00:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:00:24] T219115: db1078 s3 primary DB master BBU pre-failure - https://phabricator.wikimedia.org/T219115 [05:01:05] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Set s3 on read-only T219115 (duration: 00m 37s) [05:01:06] we are on RO [05:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:01:21] confirmed^ [05:01:29] failing over [05:01:32] done [05:01:59] looks good [05:02:05] confirmed switch [05:02:05] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Set s3 to read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502940 (owner: 10Marostegui) [05:02:09] ^ not merging [05:02:21] promoting db1078 on mediawiki [05:02:48] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Switchover s3 master eqiad from db1078 to db1075 T219115 (duration: 00m 36s) [05:02:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:52] everything looking good so far [05:02:55] let's remove read only? [05:03:09] +1 [05:03:11] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Set s3 to read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502940 (owner: 10Marostegui) [05:03:13] can see the change on noc [05:03:20] deploying [05:03:55] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove s3 ready only T219115 (duration: 00m 36s) [05:03:55] we are RW [05:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:03:58] checking [05:03:58] very few errors on log [05:04:10] I can edit [05:05:02] there was a spike of open connections, seems now gone [05:05:05] yeah [05:05:20] I can also see stuff going on db1075 just fine [05:05:31] reads doesn't seem recovered yet [05:05:37] maybe jobqueue? [05:05:49] but it is going up [05:05:58] I see reads on the new slave (db1078) [05:06:47] yeah, just not as many as before [05:07:02] they are recovering [05:07:49] I see no hard errors, though [05:08:27] yeah, it is the jobqueue not knowing what to do with replication [05:08:51] it is stuck on the old master? [05:09:47] (03CR) 10Marostegui: wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/501483 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [05:09:49] (03CR) 10Marostegui: [C: 03+2] wmnet: Update s3-master alias [dns] - 10https://gerrit.wikimedia.org/r/501483 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [05:10:57] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Set s3 to read only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502940 (owner: 10Marostegui) [05:11:44] Deployments on puppet and mediawiki-config can now proceed as normal [05:12:20] I think db1078 got a bit overloaded at first [05:12:25] yeah [05:12:29] I saw it spiking a lot [05:12:35] just even from the processlist [05:12:38] like: omg! [05:13:02] we need to change query killer [05:13:08] the event I mean [05:13:12] maybe next time we need to spread load more [05:13:23] yeah, maybe it was too cold [05:13:52] and definitely not to the old master [05:14:08] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1078 s3 primary DB master BBU pre-failure - https://phabricator.wikimedia.org/T219115 (10Marostegui) [05:14:28] RECOVERY - puppet last run on mw1296 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:14:35] there were also statistics ongoing [05:14:42] those may need to change [05:15:14] jobqueue seems happier now [05:16:03] I guess it rereads the config from time to time [05:16:15] I am changing the query killer event [05:17:02] jynus: your switchover script is <3 :) [05:17:54] I am checking edit rates and performance metrics [05:18:27] there was some slowdown [05:19:59] tendril and zarcillo updated [05:21:39] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1078 s3 primary DB master BBU pre-failure - https://phabricator.wikimedia.org/T219115 (10Marostegui) @Cmjohnson can we schedule the BBU replacement for Monday 15th? db1078 is no longer a master. The failover was performed successfully: Times in UTC:... [05:21:48] I don't see performance issues, only a deployment-related regression some days ago [05:22:23] yeah, I think we are good [05:22:32] Next time we have to either warm up the old master or give it less load [05:27:36] PROBLEM - Memory correctable errors -EDAC- on wtp2013 is CRITICAL: 67.01 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2013&var-datasource=codfw+prometheus/ops [05:28:34] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10Marostegui) The raid is sdb and we need it to be sda for db.cfg to work: ` Disk /dev/sdb: 3.5 TiB, 3840699359232 bytes, 7501365936 s... [05:33:07] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10jcrespo) We of course could make sdb work, but that would make this servers special, compared to the rest. Maybe a disk was not adde... [05:37:29] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) >>! In T211488#5102911, @Krinkle wrote: > Exported from mwdebug1001 in plain text and sorted. Full dumps at P8387 and P8386. > > ###... [05:37:43] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) [05:41:48] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10Marostegui) So, I have been checking out the RAID menu on the controller, but unfortunately over `vsp` it doesn't show most of the o... [05:48:50] PROBLEM - EDAC syslog messages on wtp2013 is CRITICAL: 17 ge 4 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2013&var-datasource=codfw+prometheus/ops [05:57:52] (03CR) 10Marostegui: mariadb: Allow new option --stop-slave for xtrabackup transfers (032 comments) [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/502828 (https://phabricator.wikimedia.org/T206203) (owner: 10Jcrespo) [06:06:42] (03PS1) 10Elukey: turnilo: remove tbayer_popups from config [puppet] - 10https://gerrit.wikimedia.org/r/502944 (https://phabricator.wikimedia.org/T220575) [06:20:53] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) >>! In T211488#5102911, @Krinkle wrote: > * [ ] Filesytem > > Much lower. Don't know if it matters? > > `lang=diff > -default_socket... [06:21:24] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) [06:24:38] !log upgrading remaining API Servers to HHVM 3.18.5+dfsg-1+wmf8+deb9u2 and wikidiff 1.8.1 (T203069) [06:24:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:24:44] T203069: Deploy wikidiff2 v1.8.1 with changed signature - https://phabricator.wikimedia.org/T203069 [06:31:06] PROBLEM - puppet last run on analytics1071 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/R/biocLite.R] [06:32:30] !log uploaded jenkins 2.164.2 to apt.wikimedia.org (jessie-wikimedia / thirdparty) [06:32:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:33:13] !log uploaded jenkins 2.164.2 to apt.wikimedia.org (stretch-wikimedia / thirdparty/ci) [06:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:34:17] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) Error handling was discussed in https://phabricator.wikimedia.org/T211488#4908305 and the followups. The most notable difference is HHV... [06:34:58] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) [06:57:26] RECOVERY - puppet last run on analytics1071 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [07:14:10] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Track remaining trusty servers in production - https://phabricator.wikimedia.org/T212772 (10MoritzMuehlenhoff) [07:19:48] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10Marostegui) I have been trying to check if there is something else defined on a storage level but it is impossible to see anything w... [07:26:11] 10Operations, 10Wikidata, 10Wikidata-Query-Service, 10Discovery-Wikidata-Query-Service-Sprint: Socket timeout on wdqs.svc.eqiad.wmnet - https://phabricator.wikimedia.org/T217557 (10Gehel) 05Open→03Resolved a:03Gehel I don't think there is anything actionable at this point. Let's close. [07:29:59] (03PS2) 10Gehel: Enable revisions support on internal clusters [puppet] - 10https://gerrit.wikimedia.org/r/502909 (https://phabricator.wikimedia.org/T217897) (owner: 10Smalyshev) [07:40:16] (03CR) 10Gehel: [C: 03+2] Enable revisions support on internal clusters [puppet] - 10https://gerrit.wikimedia.org/r/502909 (https://phabricator.wikimedia.org/T217897) (owner: 10Smalyshev) [07:43:04] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) Also: - PCRE JIT is enabled by default in HHVM and we definitely want it in php7 as well. - `include_path` in php's ini is still set... [07:55:42] (03PS9) 10Muehlenhoff: Enable base::service_auto_restart for rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/496719 (https://phabricator.wikimedia.org/T135991) [07:59:14] (03CR) 10Gehel: [C: 03+2] elasticsearch: reset all indices to read/write [software/spicerack] - 10https://gerrit.wikimedia.org/r/502218 (https://phabricator.wikimedia.org/T219799) (owner: 10Gehel) [07:59:57] (03CR) 10Gehel: "recheck" [cookbooks] - 10https://gerrit.wikimedia.org/r/502220 (https://phabricator.wikimedia.org/T219799) (owner: 10Gehel) [08:00:10] (03CR) 10jenkins-bot: elasticsearch: reset all indices to read/write [software/spicerack] - 10https://gerrit.wikimedia.org/r/502218 (https://phabricator.wikimedia.org/T219799) (owner: 10Gehel) [08:04:45] (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for rsyslog [puppet] - 10https://gerrit.wikimedia.org/r/496719 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [08:10:18] (03PS1) 10Mathew.onipe: icinga: add remote cluster check for elastic [puppet] - 10https://gerrit.wikimedia.org/r/502950 (https://phabricator.wikimedia.org/T218932) [08:12:28] (03PS3) 10Elukey: role::druid::public::worker: set stricter timeouts [puppet] - 10https://gerrit.wikimedia.org/r/502814 (https://phabricator.wikimedia.org/T219910) [08:15:05] (03CR) 10Elukey: [C: 03+2] role::druid::public::worker: set stricter timeouts [puppet] - 10https://gerrit.wikimedia.org/r/502814 (https://phabricator.wikimedia.org/T219910) (owner: 10Elukey) [08:19:03] !log roll restart of druid-broker/historical on druid100[4-6] to pick up new settings [08:19:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:25] !log upgrading remaining job runners to HHVM 3.18.5+dfsg-1+wmf8+deb9u2 and wikidiff 1.8.1 (T203069) [08:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:20:38] T203069: Deploy wikidiff2 v1.8.1 with changed signature - https://phabricator.wikimedia.org/T203069 [08:22:25] (03PS1) 10Elukey: role::druid::public::worker: raise max direct mem for historical [puppet] - 10https://gerrit.wikimedia.org/r/502951 (https://phabricator.wikimedia.org/T219910) [08:22:34] (03PS3) 10Mathew.onipe: maps migrate maps2002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/502768 (https://phabricator.wikimedia.org/T198622) [08:23:45] (03CR) 10Elukey: [C: 03+2] role::druid::public::worker: raise max direct mem for historical [puppet] - 10https://gerrit.wikimedia.org/r/502951 (https://phabricator.wikimedia.org/T219910) (owner: 10Elukey) [08:25:13] (03CR) 10Mathew.onipe: maps migrate maps2002 to stretch (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502768 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [08:27:59] (03CR) 10Gehel: [C: 04-1] "There are still a few changes to maps2001: https://puppet-compiler.wmflabs.org/compiler1002/15686/maps2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/502768 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [08:33:23] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) [08:43:14] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) Also, the value of `doc_root` for HHVM doesn't seem to be set from ini settings, so I'll have to dig up where that happens. [08:50:20] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) Regarding `enable_dl`: i just verified `dl()` does work under HHVM - although I suspect it does nothing. So I would be careful in disab... [08:50:22] 10Operations, 10Wikimedia-Mailing-lists: Change ownership of wikimania-program@lists.wikimedia.org - https://phabricator.wikimedia.org/T220641 (10fgiunchedi) We certainly can! (I'm on SRE clinic duty this week, hence handing ML requests too) Which email address should we be adding to wikimania-program list? A... [08:52:24] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Set `enable_dl` to 0 in php.ini - https://phabricator.wikimedia.org/T220681 (10Joe) [08:53:15] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) [09:01:25] !log deployment servers to HHVM 3.18.5+dfsg-1+wmf8+deb9u2 and wikidiff 1.8.1 (T203069) [09:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:05] T203069: Deploy wikidiff2 v1.8.1 with changed signature - https://phabricator.wikimedia.org/T203069 [09:03:46] (03PS4) 10Mathew.onipe: maps migrate maps2002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/502768 (https://phabricator.wikimedia.org/T198622) [09:06:24] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) >>! In T211488#5102911, @Krinkle wrote: > * [ ] File Uploads & Data Input > > `lang=diff > -upload_tmp_dir = /tmp > +upload_tmp_dir =... [09:06:56] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) [09:07:45] (03CR) 10Mathew.onipe: "@gehel seems we are good now: https://puppet-compiler.wmflabs.org/compiler1002/15687/maps2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/502768 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [09:10:12] 10Operations, 10SRE-Access-Requests: offboard tilman bayer - https://phabricator.wikimedia.org/T220565 (10MoritzMuehlenhoff) [09:17:16] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) [09:20:22] (03CR) 10Jcrespo: [C: 03+1] mariadb: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/492321 (owner: 10Muehlenhoff) [09:21:28] (03PS1) 10Arturo Borrero Gonzalez: labtestnet2002: cleanup [dns] - 10https://gerrit.wikimedia.org/r/502955 (https://phabricator.wikimedia.org/T220426) [09:22:10] (03CR) 10Filippo Giunchedi: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1002/15691/weblog1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/493243 (https://phabricator.wikimedia.org/T213899) (owner: 10Filippo Giunchedi) [09:22:18] (03PS4) 10Filippo Giunchedi: logging: move webrequest-5xx to logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/493243 (https://phabricator.wikimedia.org/T213899) [09:22:32] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestnet2002: cleanup [dns] - 10https://gerrit.wikimedia.org/r/502955 (https://phabricator.wikimedia.org/T220426) (owner: 10Arturo Borrero Gonzalez) [09:24:35] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2002: repurpose as cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T220426 (10aborrero) [09:27:57] (03PS1) 10Jbond: offboard: remove wikimedia email [puppet] - 10https://gerrit.wikimedia.org/r/502957 [09:28:35] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2002: repurpose as cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T220426 (10aborrero) 05Open→03Resolved [09:33:00] (03CR) 10Jbond: [C: 03+2] offboard: remove wikimedia email [puppet] - 10https://gerrit.wikimedia.org/r/502957 (owner: 10Jbond) [09:33:08] (03PS2) 10Jbond: offboard: remove wikimedia email [puppet] - 10https://gerrit.wikimedia.org/r/502957 [09:35:10] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) Regarding mail: both paths actually link to `/usr/sbin/exim4` on our systems. [09:35:23] (03PS8) 10Muehlenhoff: mariadb: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/492321 [09:36:36] (03PS9) 10Muehlenhoff: mariadb: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/492321 [09:37:47] (03CR) 10Muehlenhoff: [C: 03+2] mariadb: Remove support for trusty [puppet] - 10https://gerrit.wikimedia.org/r/492321 (owner: 10Muehlenhoff) [09:38:19] (03PS5) 10Gehel: maps migrate maps2002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/502768 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [09:39:08] (03CR) 10Gehel: [C: 03+2] maps migrate maps2002 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/502768 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [09:41:53] (03PS1) 10Elukey: role::druid::public::worker: reduce heap size for coord/overlord [puppet] - 10https://gerrit.wikimedia.org/r/502958 (https://phabricator.wikimedia.org/T219910) [09:41:56] PROBLEM - puppet last run on db1071 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:42:32] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Patch-For-Review: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts: ` ['maps2002.codfw.wmn... [09:42:38] PROBLEM - puppet last run on es1016 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:42:58] (03PS1) 10Muehlenhoff: Fix up installation mariadb-backup [puppet] - 10https://gerrit.wikimedia.org/r/502959 [09:43:10] PROBLEM - puppet last run on db1110 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:44:10] PROBLEM - puppet last run on db2038 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:44:22] PROBLEM - puppet last run on db1067 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:44:28] PROBLEM - puppet last run on db2079 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:44:28] PROBLEM - puppet last run on db2061 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:44:38] PROBLEM - puppet last run on db1119 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:44:46] PROBLEM - puppet last run on db1066 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:44:48] PROBLEM - puppet last run on db2068 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:45:03] (03PS2) 10Muehlenhoff: Fix up installation of mariadb-backup [puppet] - 10https://gerrit.wikimedia.org/r/502959 [09:45:26] PROBLEM - puppet last run on db1084 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:45:55] (03CR) 10Muehlenhoff: [C: 03+2] Fix up installation of mariadb-backup [puppet] - 10https://gerrit.wikimedia.org/r/502959 (owner: 10Muehlenhoff) [09:46:18] PROBLEM - puppet last run on db1064 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:46:28] PROBLEM - puppet last run on db2057 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:46:38] PROBLEM - puppet last run on matomo1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:46:44] PROBLEM - puppet last run on db1068 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:46:48] PROBLEM - puppet last run on db2083 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:46:54] PROBLEM - puppet last run on db2047 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:47:02] PROBLEM - puppet last run on db1109 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:47:06] PROBLEM - puppet last run on db1103 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:47:22] PROBLEM - puppet last run on db1115 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:47:26] PROBLEM - puppet last run on cumin2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:47:50] PROBLEM - puppet last run on db1098 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:47:58] PROBLEM - puppet last run on es1018 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:48:04] PROBLEM - puppet last run on db2040 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:48:06] PROBLEM - puppet last run on db1116 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:48:06] PROBLEM - puppet last run on db1093 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:48:12] PROBLEM - puppet last run on db1081 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:48:26] PROBLEM - puppet last run on db2093 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:48:26] PROBLEM - puppet last run on db2058 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:48:32] PROBLEM - puppet last run on pc2008 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:48:32] PROBLEM - puppet last run on db1061 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:48:32] PROBLEM - puppet last run on es2015 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:48:32] PROBLEM - puppet last run on db2046 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:49:02] PROBLEM - puppet last run on mwmaint1002 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:49:02] PROBLEM - puppet last run on db2054 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:49:12] PROBLEM - puppet last run on db2066 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:49:20] PROBLEM - puppet last run on db2091 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:49:24] PROBLEM - puppet last run on db1125 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:49:28] PROBLEM - puppet last run on db2078 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:49:38] RECOVERY - puppet last run on db1067 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:50:04] PROBLEM - puppet last run on db1065 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:50:34] PROBLEM - puppet last run on db2049 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:50:38] (03CR) 10Volans: "To make the review process a bit less abstract of this and the Kernels refactor patch, I've applied this series up to this commit to the t" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443368 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [09:50:42] PROBLEM - puppet last run on dbprov2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:51:34] RECOVERY - puppet last run on db1064 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:51:38] PROBLEM - puppet last run on db2077 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:52:02] RECOVERY - puppet last run on db1068 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:52:06] PROBLEM - puppet last run on db2092 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:52:10] PROBLEM - puppet last run on db2041 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:52:16] PROBLEM - puppet last run on es2014 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:52:24] PROBLEM - puppet last run on es2011 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:52:30] RECOVERY - puppet last run on db1071 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:53:02] PROBLEM - puppet last run on db2073 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:53:10] (03PS2) 10Elukey: role::druid::public::worker: reduce heap size for coord/overlord [puppet] - 10https://gerrit.wikimedia.org/r/502958 (https://phabricator.wikimedia.org/T219910) [09:53:10] PROBLEM - puppet last run on es2018 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:53:22] RECOVERY - puppet last run on db2040 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:53:30] RECOVERY - puppet last run on db1081 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:53:44] (03CR) 10Elukey: [C: 03+2] role::druid::public::worker: reduce heap size for coord/overlord [puppet] - 10https://gerrit.wikimedia.org/r/502958 (https://phabricator.wikimedia.org/T219910) (owner: 10Elukey) [09:53:48] RECOVERY - puppet last run on db1061 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:53:48] RECOVERY - puppet last run on db2046 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [09:54:08] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) As far as serialisation goes: * Igbinary is installed because the memcached extension and (IIRC) the apcu extension have it as a requi... [09:54:18] RECOVERY - puppet last run on db2054 is OK: OK: Puppet is currently enabled, last run 7 seconds ago with 0 failures [09:54:19] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) [09:55:20] RECOVERY - puppet last run on db1065 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [09:55:20] RECOVERY - puppet last run on db1066 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [09:55:22] RECOVERY - puppet last run on db2068 is OK: OK: Puppet is currently enabled, last run 12 seconds ago with 0 failures [09:55:50] RECOVERY - puppet last run on db2049 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [09:55:58] RECOVERY - puppet last run on db1084 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:56:56] RECOVERY - puppet last run on db2077 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [09:56:58] (03CR) 10Vgutierrez: [C: 03+1] cache: move varnish storage config to varnish-be profile [puppet] - 10https://gerrit.wikimedia.org/r/502806 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [09:57:04] RECOVERY - puppet last run on db2057 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:57:04] PROBLEM - Maps HTTPS on maps2003 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [09:57:06] PROBLEM - Maps HTTPS on maps2004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Maps/RunBook [09:57:07] PROBLEM - LVS HTTP IPv4 on kartotherian.svc.codfw.wmnet is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [09:57:19] !log roll restart druid-coordinator/overlord on druid100[4-6] to pick up new jvm settings [09:57:26] RECOVERY - puppet last run on db2083 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [09:57:30] RECOVERY - puppet last run on db2041 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [09:57:30] RECOVERY - puppet last run on db2047 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [09:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:22] RECOVERY - puppet last run on db2073 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [09:58:52] PROBLEM - puppet last run on mwmaint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [09:59:02] RECOVERY - puppet last run on db2058 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [09:59:46] RECOVERY - puppet last run on db2066 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:00:04] RECOVERY - puppet last run on db2078 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:00:18] RECOVERY - puppet last run on db2061 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [10:00:18] RECOVERY - puppet last run on db2079 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:00:20] <_joe_> gehel: should I depool maps in codfw? [10:00:28] RECOVERY - puppet last run on db1119 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [10:00:45] _joe_: give me 1 minute [10:00:50] <_joe_> sure! [10:00:57] RECOVERY - LVS HTTP IPv4 on kartotherian.svc.codfw.wmnet is OK: HTTP OK: HTTP/1.1 200 OK - 1326 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/LVS%23Diagnosing_problems [10:01:03] <_joe_> lol [10:01:05] <_joe_> :D [10:01:10] <_joe_> literally 7 seconds [10:01:19] :) [10:01:50] I'm not entirely sure that we are out of trouble yet :/ [10:02:17] <_joe_> ok [10:02:21] <_joe_> lmk if I can help [10:02:28] sure, I'll shout if needed [10:02:42] RECOVERY - puppet last run on db2092 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [10:02:58] RECOVERY - puppet last run on db1109 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:03:02] RECOVERY - puppet last run on db1103 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:03:18] RECOVERY - puppet last run on db1115 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:04:02] RECOVERY - puppet last run on db1116 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:04:20] (03PS1) 10Muehlenhoff: Also fix mariadb-backup installation for mariadb::packages_client [puppet] - 10https://gerrit.wikimedia.org/r/502960 [10:04:22] RECOVERY - puppet last run on db2093 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:04:22] RECOVERY - puppet last run on db1110 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:04:35] I am going to upgrade Jenkins [10:05:06] RECOVERY - puppet last run on db2091 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:05:19] (03CR) 10Muehlenhoff: [C: 03+2] Also fix mariadb-backup installation for mariadb::packages_client [puppet] - 10https://gerrit.wikimedia.org/r/502960 (owner: 10Muehlenhoff) [10:05:27] waiting for some job to finish [10:06:17] (03CR) 10Jbond: "> Patch Set 5:" [software/debmonitor] - 10https://gerrit.wikimedia.org/r/443368 (https://phabricator.wikimedia.org/T198592) (owner: 10Volans) [10:08:22] RECOVERY - puppet last run on cumin2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:08:50] RECOVERY - puppet last run on db1098 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [10:09:00] _joe_: we're having an issue with cassandra, can you depool maps codfw while I dig more into it? [10:09:02] RECOVERY - puppet last run on db1093 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:09:12] <_joe_> gehel: sure [10:09:17] thanks! [10:09:26] RECOVERY - puppet last run on es2015 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [10:09:54] PROBLEM - puppet last run on cumin1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[mariadb-backup] [10:10:01] <_joe_> it will take a few minutes for the public endpoints though [10:10:12] RECOVERY - puppet last run on db1125 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [10:10:45] !log oblivian@puppetmaster1001 conftool action : set/pooled=false; selector: dnsdisc=kartotherian,name=codfw [10:10:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:12:34] RECOVERY - puppet last run on matomo1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:12:54] RECOVERY - puppet last run on es2014 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:13:04] RECOVERY - puppet last run on es2011 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [10:13:52] RECOVERY - puppet last run on es2018 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:13:56] RECOVERY - puppet last run on es1016 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:13:58] RECOVERY - puppet last run on es1018 is OK: OK: Puppet is currently enabled, last run 14 seconds ago with 0 failures [10:14:30] RECOVERY - puppet last run on pc2008 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [10:14:41] (03PS1) 10Volans: Temporary depool kartotherian in codfw [puppet] - 10https://gerrit.wikimedia.org/r/502961 [10:14:45] !log remove maps2001 from new cassandra cluster -T198622 [10:14:52] RECOVERY - puppet last run on mwmaint1002 is OK: OK: Puppet is currently enabled, last run 2 seconds ago with 0 failures [10:14:58] RECOVERY - puppet last run on cumin1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:15:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:15:03] T198622: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 [10:15:14] (03CR) 10Giuseppe Lavagetto: [C: 03+1] Temporary depool kartotherian in codfw [puppet] - 10https://gerrit.wikimedia.org/r/502961 (owner: 10Volans) [10:15:18] RECOVERY - puppet last run on db2038 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:15:48] _joe_: do we need to disable puppet on all cp* and then apply in the right order right? [10:15:59] ema: too [10:16:06] <_joe_> volans: not in this case, no [10:16:11] <_joe_> we're just depooling one site [10:16:18] <_joe_> not switching between them [10:16:20] ack [10:16:22] right [10:16:24] RECOVERY - puppet last run on dbprov2001 is OK: OK: Puppet is currently enabled, last run 16 seconds ago with 0 failures [10:16:27] <_joe_> we basically only need to run puppet in codfw [10:16:30] * volans paranoid on that [10:17:19] waiting for ema's confirmation? [10:17:20] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 56 probes of 405 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [10:17:29] volans: what's up [10:17:42] ema can you check https://gerrit.wikimedia.org/r/c/operations/puppet/+/502961 plese? [10:17:46] sure [10:17:55] goal depool maps from codfw [10:18:08] due to issues ge.hel is working on to fix [10:18:11] (03CR) 10Ema: [C: 03+1] Temporary depool kartotherian in codfw [puppet] - 10https://gerrit.wikimedia.org/r/502961 (owner: 10Volans) [10:18:21] (03CR) 10Volans: [C: 03+2] Temporary depool kartotherian in codfw [puppet] - 10https://gerrit.wikimedia.org/r/502961 (owner: 10Volans) [10:18:29] volans: looks good, we'll need to also depool karto codfw in ats [10:18:33] <_joe_> https://puppet-compiler.wmflabs.org/compiler1002/15692/cp2002.codfw.wmnet/ [10:18:36] ema: I need to just run puppet on codfw upload cp* right? [10:18:41] <_joe_> ema: HAH, plot twist [10:18:43] ahahah [10:18:44] right [10:18:51] but that's via discovery? [10:18:54] yes [10:18:55] <_joe_> ema: didn't ATS use discovery recods? [10:18:56] nice [10:19:07] <_joe_> if so, it's already depooled [10:19:08] I'm merging this one [10:19:15] kartotherian.discovery.wmnet [10:19:22] RECOVERY - puppet last run on mwmaint2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:19:49] <_joe_> ema: that now points to eqiad [10:19:56] perfect [10:20:00] <_joe_> ema: you don't have tls termination there though, right? [10:20:09] ema ok to run sudo cumin -b 8 'A:cp-upload_codfw' 'run-puppet-agent' [10:20:12] ? [10:20:38] volans: +1 [10:20:59] !log forcing puppet run on A:cp-upload_codfw [10:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:22:15] _joe_: yes we use https://kartotherian.discovery.wmnet as the origin [10:22:32] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 16 probes of 405 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [10:23:00] (03CR) 10Arturo Borrero Gonzalez: "Adding Andrew and Alex as reviewers since they may have more knowledge of the status of puppetmasters within Cloud VPS." [puppet] - 10https://gerrit.wikimedia.org/r/502785 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [10:23:38] <_joe_> ema: oh great [10:23:45] so... there was a timing issue in our upgrade procedure on maps codfw, the old and new cassandra clusters were not isolated during a few minutes and discovered each others [10:23:49] puppet run completed [10:24:16] this is going to take a while to repair, but there should be no long lasting damage [10:24:37] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/502785 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [10:30:12] RECOVERY - Maps HTTPS on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 1323 bytes in 0.193 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [10:30:56] RECOVERY - Maps HTTPS on maps2004 is OK: HTTP OK: HTTP/1.1 200 OK - 1323 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Maps/RunBook [10:31:43] (03PS1) 10Arturo Borrero Gonzalez: clouddb2001-dev: add mariadb role [puppet] - 10https://gerrit.wikimedia.org/r/502963 (https://phabricator.wikimedia.org/T220096) [10:32:25] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] clouddb2001-dev: add mariadb role [puppet] - 10https://gerrit.wikimedia.org/r/502963 (https://phabricator.wikimedia.org/T220096) (owner: 10Arturo Borrero Gonzalez) [10:39:38] !log Upgrading CI Jenkins [10:39:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:07] (03PS1) 10Vgutierrez: acme_requests: Validate dns-01 challenges against all the DNS servers [software/acme-chief] - 10https://gerrit.wikimedia.org/r/502965 (https://phabricator.wikimedia.org/T207461) [10:44:58] (03CR) 10jerkins-bot: [V: 04-1] acme_requests: Validate dns-01 challenges against all the DNS servers [software/acme-chief] - 10https://gerrit.wikimedia.org/r/502965 (https://phabricator.wikimedia.org/T207461) (owner: 10Vgutierrez) [10:46:21] !log upgrading remaining app servers to HHVM 3.18.5+dfsg-1+wmf8+deb9u2 and wikidiff 1.8.1 (T203069) [10:46:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:46:34] T203069: Deploy wikidiff2 v1.8.1 with changed signature - https://phabricator.wikimedia.org/T203069 [10:48:50] (03PS2) 10Vgutierrez: acme_requests: Validate dns-01 challenges against all the DNS servers [software/acme-chief] - 10https://gerrit.wikimedia.org/r/502965 (https://phabricator.wikimedia.org/T207461) [10:49:08] PROBLEM - Apache HTTP on mw1289 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:22] PROBLEM - HHVM rendering on mw1289 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:49:40] PROBLEM - Nginx local proxy to apache on mw1289 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 1308 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:50:10] RECOVERY - Apache HTTP on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.057 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:50:24] RECOVERY - HHVM rendering on mw1289 is OK: HTTP OK: HTTP/1.1 200 OK - 75895 bytes in 0.283 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:50:42] RECOVERY - Nginx local proxy to apache on mw1289 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 617 bytes in 0.052 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:54:42] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10aborrero) 05Stalled→03Open See https://phabricator.wikimedia.org/T220096#5103616, I just reallocated the striker dat... [10:55:54] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommmision: labtestweb2001.wikimedia.org - https://phabricator.wikimedia.org/T218024 (10aborrero) [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190411T1100). [11:00:04] greta: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:01:22] greta: around for SWAT? [11:02:37] (03CR) 10Zfilipin: "This was scheduled for EU SWAT but will not be deployed unless a developer comes to #wikimedia-operations." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500692 (https://phabricator.wikimedia.org/T218767) (owner: 10Greta WMDE) [11:07:14] (03PS1) 10Arturo Borrero Gonzalez: labtestweb2001: decommission [puppet] - 10https://gerrit.wikimedia.org/r/502966 (https://phabricator.wikimedia.org/T218024) [11:09:06] PROBLEM - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is CRITICAL: CRITICAL - failed 38 probes of 405 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:09:44] @zeljkof im here, and I will be around for the deployment. Thank you [11:13:17] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) Looking at memcached settings with @elukey the only one that really seems potentially problematic is `memcached.store_retry_count = 2... [11:15:08] 10Operations, 10hardware-requests, 10User-Elukey: eqiad: (3) - zookeeper cluster for Analytics - https://phabricator.wikimedia.org/T220687 (10elukey) [11:16:07] (03CR) 10Greta WMDE: "> This was scheduled for EU SWAT but will not be deployed unless a" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500692 (https://phabricator.wikimedia.org/T218767) (owner: 10Greta WMDE) [11:19:25] RECOVERY - IPv6 ping to eqsin on ripe-atlas-eqsin IPv6 is OK: OK - failed 18 probes of 405 (alerts on 35) - https://atlas.ripe.net/measurements/11645088/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:22:17] PROBLEM - Disk space on contint1001 is CRITICAL: DISK CRITICAL - free space: / 2490 MB (5% inode=63%) [11:22:34] hmm [11:23:00] Greta_WMDE: sorry, just saw your reply, still around? [11:23:50] Yes [11:23:59] zeljkof :) [11:24:33] Greta_WMDE: ok, I'll let you know (in a few minutes) when it's at mwdebug1002, ready for testing [11:24:47] do you know how to test there, or do you need help? [11:25:30] I do, thank you [11:25:58] hashar ^ does the contint1001 have anything to do with the upgrade? [11:26:00] (03PS6) 10Arturo Borrero Gonzalez: wmcs: Add profiles for oidentd proxy and client modes [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) (owner: 10BryanDavis) [11:26:15] jijiki: unlikely [11:26:20] it is known to fill up disk :/ [11:26:52] this page doesn't look right o.O https://tools.wmflabs.org/versions/ [11:27:00] SAL entries are missing... [11:27:18] zeljkof: I know how to test it :) [11:27:47] Greta_WMDE: ok, great [11:29:17] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10jcrespo) Re mysqli, we should care about: ` +mysqlnd.net_cmd_buffer_size = 4096 +mysqlnd.net_read_buffer_size = 32768 +mysqlnd.net_read_time... [11:29:18] hashar: should I open a task ? [11:29:23] this might page soon [11:29:29] there is one already [11:30:02] !log removing maps2002 from cassandra cluster due to dead node error [11:30:04] ok, so the alert can be ACKed ? [11:30:06] (03PS4) 10Zfilipin: Increase musical notation datatype string length limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500692 (https://phabricator.wikimedia.org/T218767) (owner: 10Greta WMDE) [11:30:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:30:19] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500692 (https://phabricator.wikimedia.org/T218767) (owner: 10Greta WMDE) [11:31:23] (03Merged) 10jenkins-bot: Increase musical notation datatype string length limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500692 (https://phabricator.wikimedia.org/T218767) (owner: 10Greta WMDE) [11:31:54] hashar: the alert can be ack'd then/ [11:31:55] ? [11:32:18] (03PS7) 10Arturo Borrero Gonzalez: wmcs: Add profiles for oidentd proxy and client modes [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) (owner: 10BryanDavis) [11:32:20] Greta_WMDE: the patch is at mwdebug1002, please test and let me know if I can deploy it [11:32:28] maps codfw seem to be back on track, but let's keep it depooled for now and do some more checks [11:32:39] gehel: Its not [11:32:45] zeljkof: ok, will let you know [11:32:48] jijiki, hashar: is it ok to proceed with swat? [11:33:04] I am in a meeting [11:36:27] !log akosiaris@deploy1001 scap-helm cxserver upgrade -f cxserver-codfw-values.yaml production stable/cxserver [namespace: cxserver, clusters: codfw] [11:36:29] !log akosiaris@deploy1001 scap-helm cxserver cluster codfw completed [11:36:29] !log akosiaris@deploy1001 scap-helm cxserver finished [11:36:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:01] (03CR) 10jenkins-bot: Increase musical notation datatype string length limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500692 (https://phabricator.wikimedia.org/T218767) (owner: 10Greta WMDE) [11:37:03] RECOVERY - Disk space on contint1001 is OK: DISK OK [11:37:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:37:39] zeljkof: Its good to go, i test it and Amir1 double checked it for us : ) [11:37:56] Greta_WMDE: ok, deploying [11:38:14] zeljkof: thank you [11:39:05] !log zfilipin@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:500692|Increase musical notation datatype string length limit (T218767)]] (duration: 01m 02s) [11:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:18] T218767: Increase musical notation datatype string length limit to 1500 characters - https://phabricator.wikimedia.org/T218767 [11:39:38] Greta_WMDE: it's deployed! please test and thanks for deploying with #releng :) [11:39:50] !log EU SWAT finished [11:39:59] (03PS1) 10Gilles: Expose haproxy total request time via mtail [puppet] - 10https://gerrit.wikimedia.org/r/502972 (https://phabricator.wikimedia.org/T220499) [11:40:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:19] PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 27514 MB (5% inode=99%) [11:43:55] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:45:07] (03PS8) 10Arturo Borrero Gonzalez: wmcs: Add profiles for oidentd proxy and client modes [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) (owner: 10BryanDavis) [11:47:01] PROBLEM - Request latencies on acrab is CRITICAL: instance=10.192.16.26:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:47:02] zeljkof: tested, looks good. Thank you [11:47:45] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:48:12] elastic1017 should have shards relocating. I will confirm. Its should not be an issue [11:48:17] RECOVERY - Request latencies on acrab is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [11:49:50] (03PS9) 10Arturo Borrero Gonzalez: wmcs: Add profiles for oidentd proxy and client modes [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) (owner: 10BryanDavis) [11:50:54] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T220691 (10Rosalie_WMDE) [11:51:46] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T220691 (10Rosalie_WMDE) [11:55:05] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T220691 (10WMDE-leszek) [11:55:39] RECOVERY - Disk space on elastic1017 is OK: DISK OK [11:56:02] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC as expected:" [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) (owner: 10BryanDavis) [11:57:55] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T220691 (10WMDE-leszek) As an Engineering Manager at WMDE, I endorse this request. For NDA please email address as registered with LDAP account, as seen at... [12:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190411T1200) [12:01:39] (03PS1) 10Mathew.onipe: migrate maps2001 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/502977 (https://phabricator.wikimedia.org/T198622) [12:05:42] (03PS1) 10QChris: Add .gitreview [debs/helmfile] - 10https://gerrit.wikimedia.org/r/502978 [12:05:44] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [debs/helmfile] - 10https://gerrit.wikimedia.org/r/502978 (owner: 10QChris) [12:06:06] (03CR) 10Arturo Borrero Gonzalez: "At this point I would suggest we simply deploy sssd everywhere. This can still be merged thought. Let me know." [puppet] - 10https://gerrit.wikimedia.org/r/496991 (https://phabricator.wikimedia.org/T217280) (owner: 10BryanDavis) [12:07:38] (03PS1) 10QChris: Add .gitreview [debs/helmfile] - 10https://gerrit.wikimedia.org/r/502979 [12:07:40] (03CR) 10QChris: [V: 03+2 C: 03+2] Add .gitreview [debs/helmfile] - 10https://gerrit.wikimedia.org/r/502979 (owner: 10QChris) [12:08:14] (03PS2) 10Arturo Borrero Gonzalez: labs: remove references to other deleted hosts [puppet] - 10https://gerrit.wikimedia.org/r/502633 (owner: 10Alex Monk) [12:09:10] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labs: remove references to other deleted hosts [puppet] - 10https://gerrit.wikimedia.org/r/502633 (owner: 10Alex Monk) [12:10:03] (03PS2) 10Arturo Borrero Gonzalez: labs: Remove nova_dnsmasq_aliases stuff [puppet] - 10https://gerrit.wikimedia.org/r/502634 (owner: 10Alex Monk) [12:10:40] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labs: Remove nova_dnsmasq_aliases stuff [puppet] - 10https://gerrit.wikimedia.org/r/502634 (owner: 10Alex Monk) [12:12:11] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-staging-values.yaml staging stable/cxserver [namespace: cxserver, clusters: staging] [12:12:12] !log kartik@deploy1001 scap-helm cxserver cluster staging completed [12:12:12] !log kartik@deploy1001 scap-helm cxserver finished [12:12:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:12:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:14:39] (03PS1) 10Arturo Borrero Gonzalez: Revert "labs: Remove nova_dnsmasq_aliases stuff" [puppet] - 10https://gerrit.wikimedia.org/r/502980 [12:15:18] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] Revert "labs: Remove nova_dnsmasq_aliases stuff" [puppet] - 10https://gerrit.wikimedia.org/r/502980 (owner: 10Arturo Borrero Gonzalez) [12:15:39] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-codfw-values.yaml production stable/cxserver [namespace: cxserver, clusters: codfw] [12:15:41] !log kartik@deploy1001 scap-helm cxserver cluster codfw completed [12:15:41] !log kartik@deploy1001 scap-helm cxserver finished [12:15:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:16:14] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Patch-For-Review: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['maps2002.codfw.wmnet'] ` Of which those **FAILED**: ` ['maps2002.c... [12:16:17] PROBLEM - puppet last run on labnet1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:16:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:18:43] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:19:12] !log kartik@deploy1001 scap-helm cxserver upgrade -f cxserver-eqiad-values.yaml production stable/cxserver [namespace: cxserver, clusters: eqiad] [12:19:13] !log kartik@deploy1001 scap-helm cxserver cluster eqiad completed [12:19:13] !log kartik@deploy1001 scap-helm cxserver finished [12:19:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:29] PROBLEM - puppet last run on labnet1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:19:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:21:17] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:21:32] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) >>! In T211488#5103689, @jcrespo wrote: > Re mysqli, we should care about: > ` > +mysqlnd.net_cmd_buffer_size = 4096 > +mysqlnd.net_rea... [12:22:31] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:25:05] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:25:35] PROBLEM - Request latencies on argon is CRITICAL: instance=10.64.32.133:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:26:39] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10Joe) [12:26:55] RECOVERY - Request latencies on argon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:29:34] 10Operations, 10PHP 7.0 support, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10jcrespo) > For those settings, I'd frankly use the defaults for now, and we can iteratively improve doing some benchmarking later, unless yo... [12:30:37] PROBLEM - puppet last run on dbproxy1010 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:32:29] a proxy host having a temporary (puppet) proxy error, isn't it ironic? [12:32:31] (03PS1) 10Filippo Giunchedi: webrequest: set logger max message size [puppet] - 10https://gerrit.wikimedia.org/r/502985 (https://phabricator.wikimedia.org/T213899) [12:33:55] (03CR) 10Filippo Giunchedi: [C: 03+2] webrequest: set logger max message size [puppet] - 10https://gerrit.wikimedia.org/r/502985 (https://phabricator.wikimedia.org/T213899) (owner: 10Filippo Giunchedi) [12:35:48] (03PS1) 10Giuseppe Lavagetto: profile::mediawiki::php: tweak ini settings [puppet] - 10https://gerrit.wikimedia.org/r/502986 (https://phabricator.wikimedia.org/T211488) [12:35:53] RECOVERY - puppet last run on dbproxy1010 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:41:46] 10Operations, 10Wikimedia-Logstash, 10Patch-For-Review, 10User-herron: Migrate at least 3 existing Logstash inputs and associated producers to the new Kafka-logging pipeline, and remove the associated non-Kafka Logstash inputs - https://phabricator.wikimedia.org/T213899 (10fgiunchedi) [12:45:53] RECOVERY - puppet last run on labnet1001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:46:12] (03PS2) 10Gehel: migrate maps2001 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/502977 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [12:47:42] RECOVERY - puppet last run on labnet1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:48:14] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T220691 (10fgiunchedi) a:03RStallman-legalteam @RStallman-legalteam I'm assigning this task to you for NDA processing, thanks! [12:48:51] !log gehel@puppetmaster1001 conftool action : set/pooled=inactive; selector: dc=codfw,cluster=maps,name=maps2001.codfw.wmnet [12:49:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:49:06] (03CR) 10Gehel: [C: 03+2] migrate maps2001 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/502977 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [12:52:06] (03PS1) 10Gehel: maps: maps2001 is now a slave after migrating to stretch [puppet] - 10https://gerrit.wikimedia.org/r/502988 (https://phabricator.wikimedia.org/T198622) [12:52:32] (03CR) 10jerkins-bot: [V: 04-1] maps: maps2001 is now a slave after migrating to stretch [puppet] - 10https://gerrit.wikimedia.org/r/502988 (https://phabricator.wikimedia.org/T198622) (owner: 10Gehel) [12:53:18] (03PS2) 10Gehel: maps: maps2001 is now a slave after migrating to stretch [puppet] - 10https://gerrit.wikimedia.org/r/502988 (https://phabricator.wikimedia.org/T198622) [12:57:58] (03PS3) 10Gehel: maps: maps2001 is now a slave after migrating to stretch [puppet] - 10https://gerrit.wikimedia.org/r/502988 (https://phabricator.wikimedia.org/T198622) [12:58:57] (03CR) 10Mathew.onipe: [C: 03+1] maps: maps2001 is now a slave after migrating to stretch [puppet] - 10https://gerrit.wikimedia.org/r/502988 (https://phabricator.wikimedia.org/T198622) (owner: 10Gehel) [12:59:11] (03CR) 10Gehel: [C: 03+2] maps: maps2001 is now a slave after migrating to stretch [puppet] - 10https://gerrit.wikimedia.org/r/502988 (https://phabricator.wikimedia.org/T198622) (owner: 10Gehel) [13:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190411T1300) [13:00:48] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Patch-For-Review: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts: ` ['maps2001.codfw.wmn... [13:02:28] (03CR) 10Filippo Giunchedi: [C: 03+1] Upgrade logstash plugins to 5.6.15 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/502613 (https://phabricator.wikimedia.org/T219571) (owner: 10Herron) [13:04:13] (03CR) 10Filippo Giunchedi: [C: 03+1] cache: move varnish storage config to varnish-be profile [puppet] - 10https://gerrit.wikimedia.org/r/502806 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [13:08:21] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Patch-For-Review: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts: ` ['maps2001.codfw.wmn... [13:20:16] (03PS1) 10Alex Monk: labs: Retry Ief7d536cdc: Remove nova_dnsmasq_aliases [puppet] - 10https://gerrit.wikimedia.org/r/502991 [13:20:40] (03CR) 10Alex Monk: "right, sorry. Icd4c2c46" [puppet] - 10https://gerrit.wikimedia.org/r/502634 (owner: 10Alex Monk) [13:20:49] (03CR) 10Alex Monk: "Icd4c2c46" [puppet] - 10https://gerrit.wikimedia.org/r/502980 (owner: 10Arturo Borrero Gonzalez) [13:25:35] (03PS4) 10Ema: cache: move varnish storage config to varnish-be profile [puppet] - 10https://gerrit.wikimedia.org/r/502806 (https://phabricator.wikimedia.org/T219967) [13:27:47] (03CR) 10Ema: [C: 03+2] cache: move varnish storage config to varnish-be profile [puppet] - 10https://gerrit.wikimedia.org/r/502806 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [13:29:19] (03CR) 10Alex Monk: acme_requests: Validate dns-01 challenges against all the DNS servers (032 comments) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/502965 (https://phabricator.wikimedia.org/T207461) (owner: 10Vgutierrez) [13:29:25] (03CR) 10Gehel: "LGTM, minor comment inline. I'm sure volans has a few more things to say!" (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [13:32:55] (03CR) 10Gehel: "I'm wondering if we should clean the previous releases (we never did so far, but it looks wrong). Maybe instead of improving this flawed p" [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/502613 (https://phabricator.wikimedia.org/T219571) (owner: 10Herron) [13:40:21] (03CR) 10Gehel: Add basic Ganeti RAPI module and tests (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/499032 (owner: 10CRusnov) [13:46:01] (03CR) 10Vgutierrez: acme_requests: Validate dns-01 challenges against all the DNS servers (031 comment) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/502965 (https://phabricator.wikimedia.org/T207461) (owner: 10Vgutierrez) [13:46:47] (03PS1) 10Jbond: icinga: Add a new script and configuration to send prowl notifications [puppet] - 10https://gerrit.wikimedia.org/r/502993 [13:55:43] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10Papaul) @jcrespo @Marostegui I disable the SD card and it is working [13:55:50] (03PS3) 10Vgutierrez: acme_requests: Validate dns-01 challenges against all the DNS servers [software/acme-chief] - 10https://gerrit.wikimedia.org/r/502965 (https://phabricator.wikimedia.org/T207461) [14:00:04] Amir1: It is that lovely time of the day again! You are hereby commanded to deploy Deploy Url Shortener. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190411T1400). [14:00:18] 10Operations, 10ops-eqiad, 10Analytics, 10User-Elukey: Check if a GPU fits in any of the remaining stat or notebook hosts - https://phabricator.wikimedia.org/T220698 (10elukey) p:05Triage→03Normal [14:00:27] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10Papaul) Please look if the configuration looks right so i can do the same on the other 5 servers ` root@db2102:~# fdisk -l Disk /d... [14:03:52] on it [14:03:58] (03PS1) 10Ladsgroup: Deploy UrlShortener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502998 (https://phabricator.wikimedia.org/T108557) [14:04:44] (03CR) 10Mforns: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/502944 (https://phabricator.wikimedia.org/T220575) (owner: 10Elukey) [14:06:29] (03CR) 10Alex Monk: [C: 03+2] acme_requests: Validate dns-01 challenges against all the DNS servers [software/acme-chief] - 10https://gerrit.wikimedia.org/r/502965 (https://phabricator.wikimedia.org/T207461) (owner: 10Vgutierrez) [14:07:44] (03CR) 10Ladsgroup: [C: 03+2] Deploy UrlShortener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502998 (https://phabricator.wikimedia.org/T108557) (owner: 10Ladsgroup) [14:08:02] (03Merged) 10jenkins-bot: acme_requests: Validate dns-01 challenges against all the DNS servers [software/acme-chief] - 10https://gerrit.wikimedia.org/r/502965 (https://phabricator.wikimedia.org/T207461) (owner: 10Vgutierrez) [14:08:46] (03Merged) 10jenkins-bot: Deploy UrlShortener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502998 (https://phabricator.wikimedia.org/T108557) (owner: 10Ladsgroup) [14:09:35] (03CR) 10jenkins-bot: acme_requests: Validate dns-01 challenges against all the DNS servers [software/acme-chief] - 10https://gerrit.wikimedia.org/r/502965 (https://phabricator.wikimedia.org/T207461) (owner: 10Vgutierrez) [14:10:12] 10Operations, 10ops-eqiad, 10Analytics, 10hardware-requests, and 2 others: Upgrade kafka-jumbo100[1-6] to 10G NICs (if possible) - https://phabricator.wikimedia.org/T220700 (10elukey) p:05Triage→03Normal [14:10:57] (03CR) 10Bearloga: "@Nuria: done with the move to Hadoop! :D LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/500076 (https://phabricator.wikimedia.org/T216055) (owner: 10Nuria) [14:11:56] (03PS2) 10Elukey: turnilo: remove tbayer_popups from config [puppet] - 10https://gerrit.wikimedia.org/r/502944 (https://phabricator.wikimedia.org/T220575) [14:12:40] (03PS1) 10Vgutierrez: Release 0.15 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/503000 (https://phabricator.wikimedia.org/T207461) [14:13:02] (03CR) 10jenkins-bot: Deploy UrlShortener [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502998 (https://phabricator.wikimedia.org/T108557) (owner: 10Ladsgroup) [14:13:25] (03CR) 10Elukey: [C: 03+2] turnilo: remove tbayer_popups from config [puppet] - 10https://gerrit.wikimedia.org/r/502944 (https://phabricator.wikimedia.org/T220575) (owner: 10Elukey) [14:14:26] (03PS1) 10Fsero: puppet exec { } doesn't like a bash builtin. [puppet] - 10https://gerrit.wikimedia.org/r/503001 (https://phabricator.wikimedia.org/T214289) [14:14:34] (03CR) 10jerkins-bot: [V: 04-1] puppet exec { } doesn't like a bash builtin. [puppet] - 10https://gerrit.wikimedia.org/r/503001 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [14:14:46] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Deploy UrlShortener to metawiki, let's get the party started (T108557, T44085) (duration: 01m 00s) [14:15:02] (03PS2) 10Vgutierrez: Release 0.16 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/503000 (https://phabricator.wikimedia.org/T207461) [14:15:22] (03PS2) 10Fsero: puppet exec { } doesn't like a bash builtin. [puppet] - 10https://gerrit.wikimedia.org/r/503001 (https://phabricator.wikimedia.org/T214289) [14:15:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:15:27] T108557: Review and deploy UrlShortener extension to Wikimedia wikis - https://phabricator.wikimedia.org/T108557 [14:15:28] T44085: Wikimedia needs a URL shortener (tracking) - https://phabricator.wikimedia.org/T44085 [14:15:47] (03CR) 10jerkins-bot: [V: 04-1] puppet exec { } doesn't like a bash builtin. [puppet] - 10https://gerrit.wikimedia.org/r/503001 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [14:16:51] (03PS3) 10Fsero: puppet exec { } doesn't like a bash builtin. [puppet] - 10https://gerrit.wikimedia.org/r/503001 (https://phabricator.wikimedia.org/T214289) [14:17:04] (03CR) 10jerkins-bot: [V: 04-1] puppet exec { } doesn't like a bash builtin. [puppet] - 10https://gerrit.wikimedia.org/r/503001 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [14:18:21] !log Deployment of Url shortener is done now [14:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:20:38] (03PS4) 10Fsero: puppet exec { } doesn't like a bash builtin. [puppet] - 10https://gerrit.wikimedia.org/r/503001 (https://phabricator.wikimedia.org/T214289) [14:21:04] (03CR) 10jerkins-bot: [V: 04-1] puppet exec { } doesn't like a bash builtin. [puppet] - 10https://gerrit.wikimedia.org/r/503001 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [14:21:42] (03PS5) 10Fsero: puppet exec { } doesn't like a bash builtin. [puppet] - 10https://gerrit.wikimedia.org/r/503001 (https://phabricator.wikimedia.org/T214289) [14:23:39] (03PS6) 10Fsero: puppet exec { } doesn't like a bash builtin. [puppet] - 10https://gerrit.wikimedia.org/r/503001 (https://phabricator.wikimedia.org/T214289) [14:23:57] (03CR) 10Vgutierrez: [C: 03+2] Release 0.16 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/503000 (https://phabricator.wikimedia.org/T207461) (owner: 10Vgutierrez) [14:24:07] (03PS7) 10Fsero: puppet exec { } doesn't like a bash builtin. [puppet] - 10https://gerrit.wikimedia.org/r/503001 (https://phabricator.wikimedia.org/T214289) [14:25:31] (03CR) 10jenkins-bot: Release 0.16 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/503000 (https://phabricator.wikimedia.org/T207461) (owner: 10Vgutierrez) [14:26:15] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10Marostegui) So the SD disablement did the trick! :-) The server looks good now: ` root@db2102:~# df -hT Filesystem Type... [14:26:23] (03CR) 10Filippo Giunchedi: WIP elastalert module (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/502773 (https://phabricator.wikimedia.org/T213933) (owner: 10Filippo Giunchedi) [14:26:33] (03PS3) 10Filippo Giunchedi: WIP elastalert module [puppet] - 10https://gerrit.wikimedia.org/r/502773 (https://phabricator.wikimedia.org/T213933) [14:27:32] (03PS8) 10Fsero: puppet exec { } doesn't like a bash builtin. [puppet] - 10https://gerrit.wikimedia.org/r/503001 (https://phabricator.wikimedia.org/T214289) [14:28:49] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install (5) codfw dedicated dump slaves - https://phabricator.wikimedia.org/T219463 (10Papaul) [14:30:34] (03PS3) 10C. Scott Ananian: Default to Preprocessor_Hash on both PHP7 and HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502567 (https://phabricator.wikimedia.org/T216664) [14:30:48] (03PS1) 10Vgutierrez: acme_requests: Validate dns-01 challenges against all the DNS servers [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/503005 (https://phabricator.wikimedia.org/T207461) [14:30:50] (03PS1) 10Vgutierrez: Release 0.16 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/503006 (https://phabricator.wikimedia.org/T207461) [14:31:29] (03CR) 10Vgutierrez: [C: 03+2] acme_requests: Validate dns-01 challenges against all the DNS servers [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/503005 (https://phabricator.wikimedia.org/T207461) (owner: 10Vgutierrez) [14:31:38] (03CR) 10Vgutierrez: [C: 03+2] Release 0.16 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/503006 (https://phabricator.wikimedia.org/T207461) (owner: 10Vgutierrez) [14:31:47] (03CR) 10Volans: "Thanks for all the improvements, few minor things inline." (0318 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [14:32:01] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10Marostegui) [14:33:06] (03PS1) 10Lucas Werkmeister (WMDE): Update comment for Wikibase monolingual text languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503008 [14:34:17] (03CR) 10Lucas Werkmeister (WMDE): "Just a minor thing I noticed when looking at Ib21ce0dbbe." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503008 (owner: 10Lucas Werkmeister (WMDE)) [14:35:04] (03PS1) 10Marostegui: db2102: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/503009 (https://phabricator.wikimedia.org/T219461) [14:35:04] (03CR) 10jenkins-bot: acme_requests: Validate dns-01 challenges against all the DNS servers [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/503005 (https://phabricator.wikimedia.org/T207461) (owner: 10Vgutierrez) [14:35:06] (03CR) 10jenkins-bot: Release 0.16 [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/503006 (https://phabricator.wikimedia.org/T207461) (owner: 10Vgutierrez) [14:35:08] (03PS1) 10Vgutierrez: debian: Add release 0.16 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/503010 (https://phabricator.wikimedia.org/T207461) [14:35:55] (03CR) 10Vgutierrez: [C: 03+2] debian: Add release 0.16 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/503010 (https://phabricator.wikimedia.org/T207461) (owner: 10Vgutierrez) [14:36:09] (03CR) 10Marostegui: [C: 03+2] db2102: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/503009 (https://phabricator.wikimedia.org/T219461) (owner: 10Marostegui) [14:37:07] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10Papaul) @Marostegui or @jcrespo you are free to take the task [14:37:11] (03Merged) 10jenkins-bot: debian: Add release 0.16 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/503010 (https://phabricator.wikimedia.org/T207461) (owner: 10Vgutierrez) [14:38:44] (03CR) 10jenkins-bot: debian: Add release 0.16 to changelog [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/503010 (https://phabricator.wikimedia.org/T207461) (owner: 10Vgutierrez) [14:40:17] (03CR) 10Herron: "> I'm wondering if we should clean the previous releases (we never" [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/502613 (https://phabricator.wikimedia.org/T219571) (owner: 10Herron) [14:40:17] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10Marostegui) [14:40:26] !log cxserver Add gargage collections graphs under saturation. T205911 [14:40:47] (03PS1) 10Vgutierrez: debian: Fix changelog 0.16 entry [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/503012 [14:41:04] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install (5) codfw dedicated dump slaves - https://phabricator.wikimedia.org/T219463 (10Papaul) [14:41:20] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10Marostegui) 05Open→03Resolved Thanks @Papaul! This server is ready to be productionized at: {T220572} ` root@db2102:~# lsb_rele... [14:41:28] (03CR) 10Vgutierrez: [C: 03+1] "looks nice, maybe some logging could be helpful" [puppet] - 10https://gerrit.wikimedia.org/r/503001 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [14:42:06] !log decommissioning cassandra-a, restbase2008 -- T208087 [14:42:26] !log upgrading mwmaint1002 to HHVM 3.18.5+dfsg-1+wmf8+deb9u2 and wikidiff 1.8.1 (T203069) [14:42:47] (03CR) 10Vgutierrez: [C: 03+2] debian: Fix changelog 0.16 entry [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/503012 (owner: 10Vgutierrez) [14:44:16] (03PS3) 10Giuseppe Lavagetto: apt: remove redundant Install-Recommends [puppet] - 10https://gerrit.wikimedia.org/r/501565 [14:44:26] (03Merged) 10jenkins-bot: debian: Fix changelog 0.16 entry [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/503012 (owner: 10Vgutierrez) [14:44:28] (03CR) 10Giuseppe Lavagetto: [C: 03+2] apt: remove redundant Install-Recommends [puppet] - 10https://gerrit.wikimedia.org/r/501565 (owner: 10Giuseppe Lavagetto) [14:45:57] (03CR) 10jenkins-bot: debian: Fix changelog 0.16 entry [software/acme-chief] (debian) - 10https://gerrit.wikimedia.org/r/503012 (owner: 10Vgutierrez) [14:46:05] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10Marostegui) [14:46:48] !log decommissioning cassandra-a, restbase2008 -- T208087 [14:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:47:11] T208087: Replace remaining Samsung SSDs - https://phabricator.wikimedia.org/T208087 [14:48:59] !log uploaded acme-chief 0.16 to apt.wikimedia.org (buster) - T207461 [14:49:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:42] T207461: Validate DNS-01 challenges against every DNS server - https://phabricator.wikimedia.org/T207461 [14:49:46] (03CR) 10Vgutierrez: [C: 03+1] wikimediafoundation.org: add spf record [dns] - 10https://gerrit.wikimedia.org/r/502589 (https://phabricator.wikimedia.org/T220412) (owner: 10Herron) [14:53:34] !log rebooting labnet1002 [14:53:39] (03PS1) 10Filippo Giunchedi: aptrepo: reflow and sort distributions-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/503013 [14:53:41] arturo, andrewbogott ^ fyi [14:53:41] (03PS1) 10Filippo Giunchedi: aptrepo: add component/elastalert [puppet] - 10https://gerrit.wikimedia.org/r/503014 (https://phabricator.wikimedia.org/T213933) [14:53:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:28] PROBLEM - Host labnet1002 is DOWN: PING CRITICAL - Packet loss = 100% [14:55:41] oh this pages? [14:55:43] oops, sorry [14:55:44] ignore! [14:56:25] RECOVERY - Host labnet1002 is UP: PING OK - Packet loss = 0%, RTA = 0.25 ms [14:56:46] <_joe_> paravoid: all cloud hosts do [14:56:59] <_joe_> but on the upside, you received the page immediately [14:57:24] I did too [14:57:31] And I am now off, long day o/ [14:57:31] :-) [14:57:35] (03CR) 10Ladsgroup: "It's a noop change, you can deploy it at anytime (just make sure it's not at middle of deployment of someone else)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503008 (owner: 10Lucas Werkmeister (WMDE)) [14:59:59] (03PS1) 10CDanis: Repool kartotherian in codfw [puppet] - 10https://gerrit.wikimedia.org/r/503015 [15:00:33] (03PS9) 10Fsero: puppet exec { } doesn't like a bash builtin. [puppet] - 10https://gerrit.wikimedia.org/r/503001 (https://phabricator.wikimedia.org/T214289) [15:00:59] (03CR) 10jerkins-bot: [V: 04-1] puppet exec { } doesn't like a bash builtin. [puppet] - 10https://gerrit.wikimedia.org/r/503001 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [15:01:30] (03PS10) 10Fsero: puppet exec { } doesn't like a bash builtin. [puppet] - 10https://gerrit.wikimedia.org/r/503001 (https://phabricator.wikimedia.org/T214289) [15:02:26] (03CR) 10Gehel: [C: 03+1] "LGTM, kartotherian is under control again" [puppet] - 10https://gerrit.wikimedia.org/r/503015 (owner: 10CDanis) [15:02:47] (03CR) 10Arturo Borrero Gonzalez: "I don't like the commit title line. Would you please reference the other commit in the commit message body instead?" [puppet] - 10https://gerrit.wikimedia.org/r/502991 (owner: 10Alex Monk) [15:03:58] (03CR) 10CDanis: [C: 03+2] Repool kartotherian in codfw [puppet] - 10https://gerrit.wikimedia.org/r/503015 (owner: 10CDanis) [15:04:42] (03CR) 10Jgreen: [C: 03+1] wikimediafoundation.org: add spf record [dns] - 10https://gerrit.wikimedia.org/r/502589 (https://phabricator.wikimedia.org/T220412) (owner: 10Herron) [15:06:16] !log cdanis@puppetmaster1001 conftool action : set/pooled=true; selector: dnsdisc=kartotherian,name=codfw [15:06:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:00] FYI because of some work in Toolforge stashbot and the SAL tool will be messed up for an hour or so [15:07:02] (03PS2) 10Herron: wikimediafoundation.org: add spf record [dns] - 10https://gerrit.wikimedia.org/r/502589 (https://phabricator.wikimedia.org/T220412) [15:07:25] writes to the wiki should be working, but not to the elasticsearch store [15:07:37] can confirm writes to the wiki are working :) [15:10:50] PROBLEM - Hue Server on analytics1039 is CRITICAL: PROCS CRITICAL: 0 processes with command name python2.7, args /usr/lib/hue/build/env/bin/hue https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [15:11:10] this is me, testing host --^ [15:11:18] need to add permanent downtime [15:13:39] ahha okay [15:14:54] (03PS1) 10Elukey: role::analytics_test_cluster::hadoop::ui: silence alarms [puppet] - 10https://gerrit.wikimedia.org/r/503021 [15:14:57] (03CR) 10BryanDavis: [C: 03+1] striker: factor out common code to a shared profile [puppet] - 10https://gerrit.wikimedia.org/r/502472 (owner: 10Arturo Borrero Gonzalez) [15:15:42] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install (5) codfw dedicated dump slaves - https://phabricator.wikimedia.org/T219463 (10Papaul) [15:16:00] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::hadoop::ui: silence alarms [puppet] - 10https://gerrit.wikimedia.org/r/503021 (owner: 10Elukey) [15:17:33] cdanis: thanks for the repool! [15:18:14] jouncebot: now [15:18:14] For the next 0 hour(s) and 41 minute(s): Deploy Url Shortener (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190411T1400) [15:18:34] anyone deploying right now? [15:19:45] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+2] "let’s deploy it now then" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503008 (owner: 10Lucas Werkmeister (WMDE)) [15:20:27] gehel: np! [15:21:00] (03Merged) 10jenkins-bot: Update comment for Wikibase monolingual text languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503008 (owner: 10Lucas Werkmeister (WMDE)) [15:21:07] (03PS4) 10Filippo Giunchedi: WIP elastalert module [puppet] - 10https://gerrit.wikimedia.org/r/502773 (https://phabricator.wikimedia.org/T213933) [15:21:09] (03PS2) 10Filippo Giunchedi: aptrepo: reflow and sort distributions-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/503013 [15:21:11] (03PS2) 10Filippo Giunchedi: aptrepo: add component/elastalert [puppet] - 10https://gerrit.wikimedia.org/r/503014 (https://phabricator.wikimedia.org/T213933) [15:21:13] (03PS1) 10Filippo Giunchedi: aptrepo: validate debian-rfc822 files [puppet] - 10https://gerrit.wikimedia.org/r/503025 [15:23:27] !log lucaswerkmeister-wmde@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [[gerrit:503008|no-op comment update]] (duration: 01m 00s) [15:23:33] (03PS1) 10Muehlenhoff: Pull in buster udebs from unstable [puppet] - 10https://gerrit.wikimedia.org/r/503027 (https://phabricator.wikimedia.org/T213527) [15:23:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:24:22] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: Prepare puppet infrastructure for Debian buster - https://phabricator.wikimedia.org/T213546 (10MoritzMuehlenhoff) I think we can close this one? [15:25:38] (03CR) 10Volans: [C: 03+1] "LTGM, just an optional nit inline." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/499032 (owner: 10CRusnov) [15:26:04] 10Operations, 10monitoring, 10User-fgiunchedi: Upgrade statsd_exporter to 0.9 - https://phabricator.wikimedia.org/T220709 (10fgiunchedi) [15:26:06] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/15701/scb1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/456316 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [15:26:11] ottomata: ^ [15:27:16] danke [15:27:59] <_joe_> godog: ouch [15:28:18] <_joe_> this reminds us how abstracting and centralizing logic makes the blast radius of bugs larger [15:28:47] 10Operations, 10Analytics, 10EventBus, 10monitoring, 10User-fgiunchedi: Upgrade statsd_exporter to 0.9 - https://phabricator.wikimedia.org/T220709 (10Ottomata) [15:29:11] (03CR) 10Herron: [C: 03+2] wikimediafoundation.org: add spf record [dns] - 10https://gerrit.wikimedia.org/r/502589 (https://phabricator.wikimedia.org/T220412) (owner: 10Herron) [15:30:17] (03CR) 10jenkins-bot: Update comment for Wikibase monolingual text languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503008 (owner: 10Lucas Werkmeister (WMDE)) [15:30:28] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/503013 (owner: 10Filippo Giunchedi) [15:31:38] 10Operations, 10Cassandra, 10serviceops, 10Core Platform Team (Session Management Service (CDP2)), and 4 others: Credentials needed for session storage Cassandra cluster - https://phabricator.wikimedia.org/T219560 (10Eevans) >>! In T219560#5102752, @gerritbot wrote: > Change 502890 **merged** by Dzahn: > [... [15:32:02] _joe_: not sure I understand what you mean, like in library bugs? [15:32:09] <_joe_> yes [15:32:25] <_joe_> we're shielding all services with sidecars [15:32:31] <_joe_> (sorry, in a meeting) [15:32:33] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install (5) codfw dedicated dump slaves - https://phabricator.wikimedia.org/T219463 (10Papaul) [15:33:50] !log onimisionipe@deploy1001 Started deploy [kartotherian/deploy@13d9ebb] (stretch): Update stretch instance with latest code [15:34:00] 10Operations, 10Mail, 10Patch-For-Review: SPF record for canonical domains - https://phabricator.wikimedia.org/T193408 (10herron) [15:34:11] (03CR) 10Filippo Giunchedi: [C: 03+1] Pull in buster udebs from unstable [puppet] - 10https://gerrit.wikimedia.org/r/503027 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [15:34:11] 10Operations, 10Fundraising-Backlog, 10Mail, 10fundraising-tech-ops, 10Patch-For-Review: Identify appropriate SPF record for domain wikimediafoundation.org - https://phabricator.wikimedia.org/T220412 (10herron) 05Open→03Resolved a:03herron The below SPF record is now active ` wikimediafoundation.o... [15:34:12] !log onimisionipe@deploy1001 Finished deploy [kartotherian/deploy@13d9ebb] (stretch): Update stretch instance with latest code (duration: 00m 22s) [15:34:32] 10Operations, 10Mail, 10Patch-For-Review: SPF record for canonical domains - https://phabricator.wikimedia.org/T193408 (10herron) [15:34:42] 10Operations, 10Mail, 10Patch-For-Review: SPF record for canonical domains - https://phabricator.wikimedia.org/T193408 (10herron) 05Open→03Resolved a:03herron [15:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:36:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:33] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:39:42] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [15:39:45] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [15:39:48] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [15:39:48] !log otto@deploy1001 scap-helm eventgate-analytics finished [15:40:25] (03CR) 10Andrew Bogott: [C: 03+1] "This seems fine. I'm sure that things will break in labtest and labtestn from this but since we're rebuilding everything that seems fine " [puppet] - 10https://gerrit.wikimedia.org/r/502966 (https://phabricator.wikimedia.org/T218024) (owner: 10Arturo Borrero Gonzalez) [15:41:33] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, although in my quick tests grep-dctrl seems to not bail out even on very malformed files :-)" [puppet] - 10https://gerrit.wikimedia.org/r/503025 (owner: 10Filippo Giunchedi) [15:41:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:30] (03CR) 10Muehlenhoff: [C: 03+1] aptrepo: add component/elastalert [puppet] - 10https://gerrit.wikimedia.org/r/503014 (https://phabricator.wikimedia.org/T213933) (owner: 10Filippo Giunchedi) [15:42:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:42:51] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [15:42:53] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [15:42:53] !log otto@deploy1001 scap-helm eventgate-analytics finished [15:42:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:04] herron: T193408 \o/ [15:44:42] woohoo [15:46:52] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install (5) codfw dedicated dump slaves - https://phabricator.wikimedia.org/T219463 (10Papaul) [15:47:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:47:34] (03PS1) 10Arturo Borrero Gonzalez: Toolforge: cleanup unused puppet code [puppet] - 10https://gerrit.wikimedia.org/r/503035 (https://phabricator.wikimedia.org/T219362) [15:47:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:49:14] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:49:16] T193408: SPF record for canonical domains - https://phabricator.wikimedia.org/T193408 [15:53:25] (03CR) 10Umherirrender: [C: 03+1] Default to Preprocessor_Hash on both PHP7 and HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502567 (https://phabricator.wikimedia.org/T216664) (owner: 10C. Scott Ananian) [15:54:04] (03CR) 10Bstorm: [C: 03+1] "That's a lot all at once! However, it looks pretty thorough :)" [puppet] - 10https://gerrit.wikimedia.org/r/503035 (https://phabricator.wikimedia.org/T219362) (owner: 10Arturo Borrero Gonzalez) [15:54:26] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [15:57:12] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: Prepare puppet infrastructure for Debian buster - https://phabricator.wikimedia.org/T213546 (10Krenair) The only strange things I see in puppet runs on the buster instances are this: ` Notice: /Stage[main]/Nrpe/Package[nagios-plugins]/ensure: create... [15:59:17] (03CR) 10Alex Monk: [C: 03+1] redirects.dat: Remove wikisource.gr [puppet] - 10https://gerrit.wikimedia.org/r/500716 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [16:00:05] godog and _joe_: How many deployers does it take to do Puppet SWAT(Max 6 patches) deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190411T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:00:49] (03PS1) 10Arturo Borrero Gonzalez: sonofgridengine: don't use old gridengine code [puppet] - 10https://gerrit.wikimedia.org/r/503040 (https://phabricator.wikimedia.org/T219362) [16:00:51] (03PS2) 10Alex Monk: labs: Remove nova_dnsmasq_aliases stuff [puppet] - 10https://gerrit.wikimedia.org/r/502991 [16:00:59] (03CR) 10Alex Monk: "oh, sure, done" [puppet] - 10https://gerrit.wikimedia.org/r/502991 (owner: 10Alex Monk) [16:01:08] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:01:58] (03PS3) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [16:03:08] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: Prepare puppet infrastructure for Debian buster - https://phabricator.wikimedia.org/T213546 (10MoritzMuehlenhoff) >>! In T213546#5104344, @Krenair wrote: > The only strange things I see in puppet runs on a couple of buster instances I deal with are... [16:04:09] (03PS1) 10Volans: PuppetDB report: fix presence check and tenants [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/503041 [16:04:42] (03CR) 10jerkins-bot: [V: 04-1] PuppetDB report: fix presence check and tenants [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/503041 (owner: 10Volans) [16:05:04] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "According to PCC:" [puppet] - 10https://gerrit.wikimedia.org/r/503035 (https://phabricator.wikimedia.org/T219362) (owner: 10Arturo Borrero Gonzalez) [16:06:25] (03CR) 10Bstorm: [C: 03+1] "Sorry!" [puppet] - 10https://gerrit.wikimedia.org/r/503040 (https://phabricator.wikimedia.org/T219362) (owner: 10Arturo Borrero Gonzalez) [16:07:25] (03PS2) 10Volans: PuppetDB report: fix presence check and tenants [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/503041 [16:08:10] RECOVERY - Hue Server on analytics1039 is OK: PROCS OK: 1 process with command name python2.7, args /usr/lib/hue/build/env/bin/hue https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Hadoop/Administration [16:08:31] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] sonofgridengine: don't use old gridengine code [puppet] - 10https://gerrit.wikimedia.org/r/503040 (https://phabricator.wikimedia.org/T219362) (owner: 10Arturo Borrero Gonzalez) [16:10:39] (03PS2) 10Arturo Borrero Gonzalez: Toolforge: cleanup unused puppet code [puppet] - 10https://gerrit.wikimedia.org/r/503035 (https://phabricator.wikimedia.org/T219362) [16:12:49] (03PS4) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [16:15:31] (03PS5) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [16:15:54] (03PS1) 10Arturo Borrero Gonzalez: sonofgridengine: don't use old gridengine code [puppet] - 10https://gerrit.wikimedia.org/r/503045 (https://phabricator.wikimedia.org/T219362) [16:16:57] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "PCC: https://puppet-compiler.wmflabs.org/compiler1002/15705/" [puppet] - 10https://gerrit.wikimedia.org/r/503035 (https://phabricator.wikimedia.org/T219362) (owner: 10Arturo Borrero Gonzalez) [16:20:41] (03CR) 10CRusnov: "LGTM, a minor comment inline, that requires no resolution." (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/503041 (owner: 10Volans) [16:22:26] (03CR) 10Bstorm: [C: 03+2] sonofgridengine: don't use old gridengine code [puppet] - 10https://gerrit.wikimedia.org/r/503045 (https://phabricator.wikimedia.org/T219362) (owner: 10Arturo Borrero Gonzalez) [16:22:36] 10Operations, 10netops: Juniper security advisories (April 2019) - https://phabricator.wikimedia.org/T220716 (10MoritzMuehlenhoff) [16:22:46] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [16:22:48] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [16:22:48] !log otto@deploy1001 scap-helm eventgate-analytics finished [16:24:54] (03CR) 10Volans: PuppetDB report: fix presence check and tenants (031 comment) [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/503041 (owner: 10Volans) [16:28:56] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:29:08] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install (5) codfw dedicated dump slaves - https://phabricator.wikimedia.org/T219463 (10Papaul) [16:30:04] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [16:34:59] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/503027 (https://phabricator.wikimedia.org/T213527) (owner: 10Muehlenhoff) [16:35:54] (03PS6) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [16:36:54] (03PS1) 10Jbond: debdeploy: add zsh autocompletion script [puppet] - 10https://gerrit.wikimedia.org/r/503058 [16:37:15] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install (5) codfw dedicated dump slaves - https://phabricator.wikimedia.org/T219463 (10Papaul) Putting this here for reference debian-installer For some reason, and I heard some rumours that this is a known bug, I had to disable USB support... [16:40:12] lvs2010 has been down for over a day [16:42:09] (03PS3) 10Volans: PuppetDB report improvements [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/503041 [16:43:46] (03CR) 10Alex Monk: [C: 03+1] "Well that's what I was doing in Ie01982af but okay." [puppet] - 10https://gerrit.wikimedia.org/r/502499 (owner: 10Alex Monk) [16:45:01] (03Abandoned) 10Alex Monk: Move maintenance_hosts out of ferm macros [puppet] - 10https://gerrit.wikimedia.org/r/502622 (owner: 10Alex Monk) [16:45:13] (03PS23) 10Ayounsi: Bird anycast: add anycast_healthchecker [puppet] - 10https://gerrit.wikimedia.org/r/397723 (https://phabricator.wikimedia.org/T186550) [16:47:33] (03PS12) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [16:47:53] (03PS4) 10Volans: PuppetDB report improvements [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/503041 [16:49:12] (03CR) 10Alexandros Kosiaris: "Whoops, I did not notice, sorry about that." [puppet] - 10https://gerrit.wikimedia.org/r/502499 (owner: 10Alex Monk) [16:49:54] (03CR) 10Alex Monk: [C: 03+1] "no worries" [puppet] - 10https://gerrit.wikimedia.org/r/502499 (owner: 10Alex Monk) [16:53:04] (03PS1) 10CDanis: fix tests and flake8 [software/conftool] - 10https://gerrit.wikimedia.org/r/503061 [16:54:06] (03CR) 10Bstorm: [C: 03+1] "Merged the dependent change and running a compiler run again to see if we missed anything else." [puppet] - 10https://gerrit.wikimedia.org/r/503035 (https://phabricator.wikimedia.org/T219362) (owner: 10Arturo Borrero Gonzalez) [16:55:38] (03PS3) 10Bstorm: Toolforge: cleanup unused puppet code [puppet] - 10https://gerrit.wikimedia.org/r/503035 (https://phabricator.wikimedia.org/T219362) (owner: 10Arturo Borrero Gonzalez) [16:55:44] (03PS7) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [16:57:07] (03PS5) 10Dzahn: apertium: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/456316 (https://phabricator.wikimedia.org/T194724) [16:57:20] (03PS13) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [16:58:20] (03PS5) 10Volans: PuppetDB report improvements [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/503041 [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190411T1700). [17:03:13] no parsoid deploy today [17:04:13] gehel: not sure if expected, but maps2004 has been using its uplink quite a bit: https://librenms.wikimedia.org/device/device=94/tab=port/port=8556/ [17:04:36] not an issue on its own, but can be the sign of an issue if not expected [17:04:40] XioNoX: yep, expected, we're reimaging the rest of the cluster, and it is the master [17:04:46] cool! [17:04:54] (03CR) 10Dzahn: [C: 03+2] apertium: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/456316 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [17:04:59] so the whole dataset is transfered to the newly reimaged slaves [17:05:32] XioNoX: it should go back to normal by tomorrow latest [17:08:18] (03CR) 10Giuseppe Lavagetto: [C: 03+2] fix tests and flake8 [software/conftool] - 10https://gerrit.wikimedia.org/r/503061 (owner: 10CDanis) [17:09:37] (03CR) 10Dzahn: "noop on scb1001/scb1004" [puppet] - 10https://gerrit.wikimedia.org/r/456316 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [17:10:08] (03Merged) 10jenkins-bot: fix tests and flake8 [software/conftool] - 10https://gerrit.wikimedia.org/r/503061 (owner: 10CDanis) [17:13:53] 10Operations, 10Patch-For-Review: Decommission servermon - https://phabricator.wikimedia.org/T198939 (10Dzahn) >>! In T198939#4413445, @faidon wrote: > I'm using servermon for fact query regularly, but I think I'm one of the very few :) I admit I haven't played around much with puppetboard to adjust my use cas... [17:16:52] (03PS5) 10Dzahn: confd: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/456317 (https://phabricator.wikimedia.org/T194724) [17:17:05] 10Operations: HP Gen9 onboard controller review - https://phabricator.wikimedia.org/T216175 (10Marostegui) P408i and Gen 10 might be bitting us: T220572#5104134 T220572#5104345 T220572#5104585 [17:20:52] !log mbsantos@deploy1001 Started deploy [proton/deploy@5cb8bbe]: Update chromium-renderer to 8988283 (T213362, T216191, T212322) [17:21:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:21:18] T216191: Replace Istanbul with nyc - https://phabricator.wikimedia.org/T216191 [17:21:18] T213362: Limit what URLs Proton can access - https://phabricator.wikimedia.org/T213362 [17:21:19] T212322: Verify Proton can handle Queue timeouts properly - https://phabricator.wikimedia.org/T212322 [17:22:25] !log mbsantos@deploy1001 Finished deploy [proton/deploy@5cb8bbe]: Update chromium-renderer to 8988283 (T213362, T216191, T212322) (duration: 01m 33s) [17:22:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:23:24] (03PS8) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [17:23:41] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T220691 (10RStallman-legalteam) Got it. Will update the thread once the NDA is signed and filed. Thanks! [17:24:21] (03PS14) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [17:33:07] (03PS9) 10Ema: cache: add profile::cache::varnish::frontend [puppet] - 10https://gerrit.wikimedia.org/r/502833 (https://phabricator.wikimedia.org/T219967) [17:34:36] 10Operations, 10Analytics, 10EventBus, 10monitoring, and 3 others: Upgrade statsd_exporter to 0.9 - https://phabricator.wikimedia.org/T220709 (10mobrovac) [17:39:07] 10Operations, 10PHP 7.0 support, 10Patch-For-Review, 10Performance-Team (Radar): Audit and sync INI settings as needed between HHVM and PHP 7 - https://phabricator.wikimedia.org/T211488 (10aaron) >>! In T211488#5103435, @Joe wrote: >> * [ ] Session >> >> We don't use the default session storage in PHP. Bu... [17:39:56] (03PS15) 10CRusnov: Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) [17:40:56] (03CR) 10CRusnov: Netbox module for Spicerack (0317 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [17:42:19] (03CR) 10Nuria: [C: 03+1] "Let's merge this!" [puppet] - 10https://gerrit.wikimedia.org/r/500076 (https://phabricator.wikimedia.org/T216055) (owner: 10Nuria) [17:42:53] !log onimisionipe@deploy1001 Started deploy [kartotherian/deploy@5394b59] (stretch): Insert maps2001 into stretch environment [17:43:15] !log onimisionipe@deploy1001 Finished deploy [kartotherian/deploy@5394b59] (stretch): Insert maps2001 into stretch environment (duration: 00m 22s) [17:45:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:46:48] (03PS4) 10Ottomata: Removing TestSearchSatisfaction from it being persisted to MySQL [puppet] - 10https://gerrit.wikimedia.org/r/500076 (https://phabricator.wikimedia.org/T216055) (owner: 10Nuria) [17:46:51] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [17:46:56] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Removing TestSearchSatisfaction from it being persisted to MySQL [puppet] - 10https://gerrit.wikimedia.org/r/500076 (https://phabricator.wikimedia.org/T216055) (owner: 10Nuria) [17:47:18] (03CR) 10Bstorm: "I see one more bit of old junk in the sonofgridengine module around the grid master. I'll clean that up." [puppet] - 10https://gerrit.wikimedia.org/r/503035 (https://phabricator.wikimedia.org/T219362) (owner: 10Arturo Borrero Gonzalez) [17:48:03] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [17:48:11] (03PS1) 10Bearloga: profile::product_analytics: fix dependencies [puppet] - 10https://gerrit.wikimedia.org/r/503069 [17:49:45] (03PS1) 10Bstorm: sonofgridengine: cleaning up old intermodule dep [puppet] - 10https://gerrit.wikimedia.org/r/503070 (https://phabricator.wikimedia.org/T219362) [17:50:48] (03CR) 10Bstorm: "Put up Id5bf7458, and the compiler should be happy after that is merged. I ran a bunch of searches and cannot find any that the compiler " [puppet] - 10https://gerrit.wikimedia.org/r/503035 (https://phabricator.wikimedia.org/T219362) (owner: 10Arturo Borrero Gonzalez) [17:51:10] (03PS2) 10Bstorm: sonofgridengine: cleaning up old intermodule dep [puppet] - 10https://gerrit.wikimedia.org/r/503070 (https://phabricator.wikimedia.org/T219362) [17:52:47] (03PS6) 10Volans: PuppetDB report improvements [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/503041 [17:55:47] (03CR) 10CRusnov: [C: 03+1] "LGTM" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/503041 (owner: 10Volans) [17:56:17] (03CR) 10Volans: [C: 03+2] PuppetDB report improvements [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/503041 (owner: 10Volans) [17:57:05] (03CR) 10Bstorm: [C: 03+2] sonofgridengine: cleaning up old intermodule dep [puppet] - 10https://gerrit.wikimedia.org/r/503070 (https://phabricator.wikimedia.org/T219362) (owner: 10Bstorm) [17:57:57] ottomata: ok to merge your change? [17:58:14] It seems I merged just when you did [17:59:28] (03PS7) 10Volans: PuppetDB report improvements [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/503041 [17:59:33] PROBLEM - Unmerged changes on repository puppet on puppetmaster1001 is CRITICAL: There are 2 unmerged changes in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [18:00:03] !log increase replication factor on maps codfw cluster [18:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190411T1800). [18:00:04] Pchelolo: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:11] bstorm_: is that yours? ^^^ (puppet change) [18:00:27] I'm here [18:00:31] Which one? [18:00:45] I asked ottomata if I can merge theirs [18:00:50] oh [18:00:52] sorry yes [18:00:58] missed that ping [18:00:58] Ok ;) [18:01:03] lol [18:01:07] the prompt is sitting there but i didn't type Yes [18:01:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:12] danke [18:01:15] * bstorm_ merging [18:01:30] Done it many times [18:02:07] RECOVERY - Unmerged changes on repository puppet on puppetmaster1001 is OK: No changes to merge. [18:02:19] thanks! [18:02:23] (03CR) 10Volans: [C: 03+2] PuppetDB report improvements [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/503041 (owner: 10Volans) [18:02:44] \o/ thanks volans and chaomodus [18:02:51] :D [18:03:09] fwiw I fixed labnet1002/elastic2040/2043 earlier today [18:03:59] labnet1002 needed an apt install smbios-utils; getSystemId --service-tag --set=...; reboot [18:04:11] (and the elastic hosts had the serials swapped with each other) [18:04:19] thanks [18:04:52] aarghh.. [18:05:12] * paravoid suspects it didn't go very well [18:05:23] (03PS4) 10Bstorm: Toolforge: cleanup unused puppet code [puppet] - 10https://gerrit.wikimedia.org/r/503035 (https://phabricator.wikimedia.org/T219362) (owner: 10Arturo Borrero Gonzalez) [18:05:32] good intuition ;) [18:05:33] PROBLEM - puppet last run on eventlog1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [18:05:40] and netbox is crappy at reporting errors in reports [18:07:37] !log depooling maps200[1-4] to set the correct cassandra replication factor for system auth [18:07:47] (03PS1) 10Volans: PuppetDB: fix typo [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/503073 [18:09:46] (03CR) 10CRusnov: [C: 03+1] "lgtm" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/503073 (owner: 10Volans) [18:09:52] wait a sec [18:10:51] RECOVERY - puppet last run on eventlog1002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [18:11:31] PROBLEM - HTTP availability for Varnish at codfw on icinga1001 is CRITICAL: job=varnish-upload site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:12:01] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:12:03] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_upload site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:12:05] (03PS2) 10Volans: PuppetDB: fix typos [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/503073 [18:12:25] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:12:43] (03CR) 10CRusnov: [C: 03+1] PuppetDB: fix typos [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/503073 (owner: 10Volans) [18:13:30] (03PS3) 10Volans: PuppetDB: fix typos [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/503073 [18:13:34] damn I keep finding typos, sorry [18:13:49] PROBLEM - Upload HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [18:14:00] any ideas why VM objects are not hyperlinked? [18:14:25] yes [18:14:35] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:14:38] hehe [18:14:42] that is fixed in recent patch set [18:14:52] (03CR) 10CRusnov: [C: 03+1] "Yep :)" [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/503073 (owner: 10Volans) [18:14:58] (03PS4) 10Volans: PuppetDB: fix typos [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/503073 [18:15:01] paravoid: fixed in PS4 [18:15:21] ack [18:15:53] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:15:55] go volans [18:16:15] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:16:39] RECOVERY - HTTP availability for Varnish at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:16:39] this report is already finding tons of errors :) [18:17:11] (03CR) 10CRusnov: [C: 03+1] "Okay reviewed carefully. Looks good." [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/503073 (owner: 10Volans) [18:17:14] I spot-checked a few, most of them are true [18:17:33] oh wow yeah, this is nice [18:17:35] I've already run PS4 so if you refresh should be ok [18:17:37] already [18:17:44] (03CR) 10Volans: [C: 03+2] PuppetDB: fix typos [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/503073 (owner: 10Volans) [18:18:05] volans: in the homepage I see both a "puppetdbserials.PuppetDB" and a "puppetdb.PuppetDB" [18:18:07] running puppet to make all clean [18:18:12] eh we renamed it [18:19:07] where do you see it? [18:19:16] I don't here https://netbox.wikimedia.org/extras/reports/ [18:19:19] https://netbox.wikimedia.org/ [18:19:51] interesting [18:20:08] I'll have a look [18:20:41] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [18:21:33] RECOVERY - Upload HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [18:24:56] chaomodus: can you have a look at the test_netbox_vms_in_puppetdb section? [18:25:00] that looks a bit weird [18:25:18] as they are in puppetdb [18:25:33] at least the first ones [18:25:44] they are the busters I think [18:25:53] true [18:26:00] schema* I don't know though [18:26:08] not in puppetboard [18:26:18] new I think [18:26:38] (03PS1) 10CDanis: kubelet operational latencies: increase thresholds by 10x [puppet] - 10https://gerrit.wikimedia.org/r/503079 [18:26:40] T219556 [18:26:51] cdanis: <3 [18:27:32] didn't we had a bot that given the Tnumber was giving us the title and link? [18:28:00] yeah, it's lagging [18:28:02] stashbot: you there? [18:28:07] :) [18:28:11] cdanis: i enjoy that my patch was super conservative based on graphs and you're just like okay 10x [18:28:12] I think there's some ES work going on, and it's just slooow [18:28:22] Reedy: thanks [18:28:40] it's taking 2-3 mins to reply to !log too [18:28:49] (03PS2) 10CDanis: kubelet operational latencies: increase thresholds by 10x [puppet] - 10https://gerrit.wikimedia.org/r/503079 (https://phabricator.wikimedia.org/T219556) [18:29:43] chaomodus: haha I just saw your patch; had remembered the issue and wanted to find it (but did not remember the patch) [18:29:53] why is test_netbox_*vms* checking for serials? [18:30:02] I don't think 10x of what's there is unreasonable [18:30:14] but I'll let Alex comment before merging [18:30:28] paravoid: is not checking for serials, is getting the puppetdb hostlist from the API [18:30:33] that returns hostname: serial [18:30:59] we can improve the API and add an endpoint to return VMs though [18:31:02] based on is_virtual [18:31:16] that's probably what's messing it up [18:31:20] yeah [18:31:21] that's probably the culprit [18:31:22] y7u[ [18:31:24] yup [18:32:22] volans: should we make some endpoints that return vm list and hostlist? [18:33:14] maybe hosts/physical and hosts/virtual ? [18:33:23] yeah [18:33:26] that's what i was thinking [18:34:50] paravoid: for the old report showing is because it queries ReportResult in the home [18:34:54] for latest results [18:35:14] i'll clean it if it's needed [18:35:14] it will go away by itself or we can manually delete it from the db (meh) [18:35:22] it's not exposed in the admin console [18:35:33] yah you have to use dbshell [18:35:37] it's not *too* bad [18:35:38] but I can from nbshell [18:35:39] but annoying [18:35:43] yeah [18:35:43] :D [18:36:04] (03CR) 10Bstorm: [C: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/15711/" [puppet] - 10https://gerrit.wikimedia.org/r/503035 (https://phabricator.wikimedia.org/T219362) (owner: 10Arturo Borrero Gonzalez) [18:37:39] PROBLEM - Citoid LVS codfw on citoid.svc.codfw.wmnet is CRITICAL: /api (Ensure Zotero is working) timed out before a response was received https://wikitech.wikimedia.org/wiki/Citoid [18:38:53] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [18:40:00] akosiaris, fsero: citoid flapping ^, looks like zotero's fault? [18:44:46] Afk right now give me 10 mins [18:45:57] (03PS1) 10CRusnov: profile.puppetdb: Add is_virtual to the list of acceptable facts in uservice. [puppet] - 10https://gerrit.wikimedia.org/r/503082 [18:46:49] (03PS1) 10Krinkle: profiler: Increase max stack depth for sampling profiler to 250 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503083 (https://phabricator.wikimedia.org/T176916) [18:51:01] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/503082 (owner: 10CRusnov) [18:51:42] (03CR) 10CRusnov: [C: 03+2] profile.puppetdb: Add is_virtual to the list of acceptable facts in uservice. [puppet] - 10https://gerrit.wikimedia.org/r/503082 (owner: 10CRusnov) [18:52:10] (03PS1) 10Elukey: profile::analytics::cluster::packages::common: add libsasl2-dev [puppet] - 10https://gerrit.wikimedia.org/r/503084 [18:53:53] !log reindexing Greek, Turkish, and Irish wikis on elastic@eqiad and elastic@codfw (T217806) [18:54:19] !log disabling puppet on hosts using class 'confd' to safely deploy gerrit:456317 [18:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:15] T217806: Reindex Greek, Turkish, and Irish wikis to keep lang-specific lowercasing & enable empty-token filtering (Greek) - https://phabricator.wikimedia.org/T217806 [18:55:18] (03PS6) 10Dzahn: confd: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/456317 (https://phabricator.wikimedia.org/T194724) [18:55:29] (03PS1) 10Volans: Puppetdb: use the is_virtual fact [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/503087 [18:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:55:59] (03CR) 10Ottomata: [C: 03+1] profile::analytics::cluster::packages::common: add libsasl2-dev [puppet] - 10https://gerrit.wikimedia.org/r/503084 (owner: 10Elukey) [18:56:02] (03CR) 10Dzahn: [C: 03+2] "puppet disabled on hosts using this, first deploying on single host deploy2001" [puppet] - 10https://gerrit.wikimedia.org/r/456317 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [18:56:04] (03CR) 10jerkins-bot: [V: 04-1] Puppetdb: use the is_virtual fact [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/503087 (owner: 10Volans) [18:56:10] (03CR) 10Ottomata: [C: 03+2] profile::analytics::cluster::packages::common: add libsasl2-dev [puppet] - 10https://gerrit.wikimedia.org/r/503084 (owner: 10Elukey) [18:56:18] (03PS2) 10Ottomata: profile::analytics::cluster::packages::common: add libsasl2-dev [puppet] - 10https://gerrit.wikimedia.org/r/503084 (owner: 10Elukey) [18:56:20] (03CR) 10CDanis: [C: 03+1] "Looks good! And I learned some things. Just a couple nits." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/503058 (owner: 10Jbond) [18:56:22] (03CR) 10Ottomata: [V: 03+2 C: 03+2] profile::analytics::cluster::packages::common: add libsasl2-dev [puppet] - 10https://gerrit.wikimedia.org/r/503084 (owner: 10Elukey) [18:56:30] (03PS3) 10Ottomata: profile::analytics::cluster::packages::common: add libsasl2-dev [puppet] - 10https://gerrit.wikimedia.org/r/503084 (owner: 10Elukey) [18:56:32] (03CR) 10Ottomata: [V: 03+2 C: 03+2] profile::analytics::cluster::packages::common: add libsasl2-dev [puppet] - 10https://gerrit.wikimedia.org/r/503084 (owner: 10Elukey) [18:57:07] (03PS2) 10Volans: Puppetdb: use the is_virtual fact [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/503087 [18:58:29] (03PS3) 10Volans: Puppetdb: use the is_virtual fact [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/503087 [18:59:17] 10Operations, 10Wikimedia-Mailing-lists: Change ownership of wikimania-program@lists.wikimedia.org - https://phabricator.wikimedia.org/T220641 (10Quiddity) a:05Dzahn→03fgiunchedi Add: icueva@wikimedia.org Remove: itai@wikimedia.org Thanks! [19:00:04] twentyafterfour: Time to snap out of that daydream and deploy MediaWiki train - Americas version. Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190411T1900). [19:01:11] !log deploy2001 - stopping confd service and letting puppet restart it to confirm things are fine after switching to systemd::service class on confd hosts [19:01:55] (03CR) 10Dzahn: "noop on deploy2001/1001, rhodium..re-enabling puppet" [puppet] - 10https://gerrit.wikimedia.org/r/456317 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [19:02:29] !log re-enabling puppet on hosts using class confd [19:03:40] mobrovac: indeed there is a pod with huge memory usage [19:03:47] zotero pod [19:04:44] (03CR) 10CRusnov: [C: 03+1] "Looks good. Test looks good too." [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/503087 (owner: 10Volans) [19:04:52] (03CR) 10Volans: [C: 03+2] Puppetdb: use the is_virtual fact [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/503087 (owner: 10Volans) [19:05:11] PROBLEM - puppet last run on mc1021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:05:43] PROBLEM - puppet last run on mc1026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:05:49] ah, crap [19:05:55] that looks like me [19:06:07] PROBLEM - puppet last run on mc2029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:06:15] a relationship issue with base::service_unit [19:06:20] paravoid: https://netbox.wikimedia.org/extras/reports/puppetdb.PuppetDB/ fixed, all yours [19:06:37] PROBLEM - puppet last run on mc2022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:06:45] \o/ [19:06:51] that is awesome, thanks guys [19:07:27] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [19:08:33] (03PS1) 10Dzahn: Revert "confd: base::service_unit -> systemd::service" [puppet] - 10https://gerrit.wikimedia.org/r/503089 [19:09:06] (03CR) 10jerkins-bot: [V: 04-1] Revert "confd: base::service_unit -> systemd::service" [puppet] - 10https://gerrit.wikimedia.org/r/503089 (owner: 10Dzahn) [19:09:23] PROBLEM - Unmerged changes on repository puppet on puppetmaster2002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [19:09:53] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [19:09:55] PROBLEM - Unmerged changes on repository puppet on puppetmaster2001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [19:10:31] PROBLEM - puppet last run on mc2028 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:10:31] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [19:10:50] (03PS2) 10Dzahn: Revert "confd: base::service_unit -> systemd::service" [puppet] - 10https://gerrit.wikimedia.org/r/503089 [19:11:00] i am taking care of the mc issue, but i dont know about the puppetmaster unmerged changed [19:11:05] change [19:11:16] (03CR) 10jerkins-bot: [V: 04-1] Revert "confd: base::service_unit -> systemd::service" [puppet] - 10https://gerrit.wikimedia.org/r/503089 (owner: 10Dzahn) [19:11:42] that might be elukey ? [19:11:55] PROBLEM - puppet last run on mc2034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:12:06] (03PS3) 10Dzahn: Revert "confd: base::service_unit -> systemd::service" [puppet] - 10https://gerrit.wikimedia.org/r/503089 [19:12:34] (03CR) 10jerkins-bot: [V: 04-1] Revert "confd: base::service_unit -> systemd::service" [puppet] - 10https://gerrit.wikimedia.org/r/503089 (owner: 10Dzahn) [19:12:43] PROBLEM - puppet last run on mc2036 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:12:45] PROBLEM - puppet last run on mc1034 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:13:22] (03CR) 10Dzahn: [C: 03+2] Revert "confd: base::service_unit -> systemd::service" [puppet] - 10https://gerrit.wikimedia.org/r/503089 (owner: 10Dzahn) [19:13:37] PROBLEM - puppet last run on mc1033 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:14:03] (03CR) 10Dzahn: [C: 03+2] "the jenkins vote is because i am "adding" base::service_unit in this revert.. the point is of course to get rid of that after relationship" [puppet] - 10https://gerrit.wikimedia.org/r/503089 (owner: 10Dzahn) [19:14:04] cdanis: o/ [19:14:18] cdanis: is there an issue? [19:14:20] (03CR) 10Dzahn: [V: 03+2 C: 03+2] Revert "confd: base::service_unit -> systemd::service" [puppet] - 10https://gerrit.wikimedia.org/r/503089 (owner: 10Dzahn) [19:14:27] PROBLEM - puppet last run on mc2020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:14:29] PROBLEM - puppet last run on mc1020 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:14:31] PROBLEM - puppet last run on mc2030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:14:39] PROBLEM - puppet last run on mc1032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:15:09] RECOVERY - Unmerged changes on repository puppet on puppetmaster2001 is OK: No changes to merge. [19:15:13] elukey: maybe you know about the unmerged change on a master ^ ? [19:15:13] just trying to figure out the unmerged changes alert [19:15:17] but looks like it is done now? [19:15:45] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1002 is OK: No changes to merge. [19:15:47] i think it's done because i ran puppet-merge again for my revert [19:15:47] RECOVERY - puppet last run on mc1021 is OK: OK: Puppet is currently enabled, last run 17 seconds ago with 0 failures [19:15:49] yes [19:15:51] PROBLEM - puppet last run on mc1023 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:15:51] (03PS1) 1020after4: all wikis to 1.33.0-wmf.25 refs T206679 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503090 [19:15:53] (03CR) 1020after4: [C: 03+2] all wikis to 1.33.0-wmf.25 refs T206679 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503090 (owner: 1020after4) [19:15:57] RECOVERY - Unmerged changes on repository puppet on puppetmaster2002 is OK: No changes to merge. [19:15:59] mutante: did it prompt you about merging someone else's change? [19:15:59] so I filed a code review and andrew merged it, maybe it didn't propage the change everywhere [19:16:04] and you puppet change fixed it [19:16:08] mutante: --^ [19:16:25] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1001 is OK: No changes to merge. [19:16:27] elukey: yeah the alert is about pushing a change and not running puppet-merge, AIUI [19:16:29] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10Miriam) HI All, I quickly tested a simple training task on stat1005,... [19:16:47] cdanis: yes, it did. i also merged this: [19:16:47] modules/confd/manifests/init.pp | 8 ++++---- [19:16:53] modules/profile/manifests/analytics/cluster/packages/common.pp | 2 ++ [19:16:56] ^ [19:16:57] PROBLEM - puppet last run on mc1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:17:01] (03Merged) 10jenkins-bot: all wikis to 1.33.0-wmf.25 refs T206679 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503090 (owner: 1020after4) [19:17:09] PROBLEM - puppet last run on mc1022 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:17:09] PROBLEM - puppet last run on mc2021 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:17:14] but it did NOT make me type "multiple" [19:17:31] so https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/503084/ is the analytics stuff [19:17:37] that andrew +2ed and merged [19:17:40] should I wait for this puppet problem to be resolved before I sync the train for wmf.25? [19:17:47] PROBLEM - puppet last run on mc1019 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:17:50] i always used puppetmaster1001 [19:17:56] and the alert showed on all others [19:18:12] (I know they are not in conflict but maybe it's better not to have two things happening at once?) [19:18:20] 10Operations, 10Release Pipeline, 10Core Platform Team Kanban (Done with CPT), 10Release-Engineering-Team (Watching / External), 10Services (done): Track and install additional npm packages for all service container images - https://phabricator.wikimedia.org/T205911 (10mobrovac) [19:18:28] twentyafterfour: should be over in a moment [19:18:32] cool [19:18:46] mutante: I think that your merge and Andrew's might have raced [19:18:59] it is the only explanation if you didn't type 'multiple' [19:19:02] it has already happened [19:19:13] elukey: yes, it looks like that. the part that i find odd is that i was able to merge multiple without having to type "multiple" [19:19:22] ack [19:19:27] PROBLEM - puppet last run on mc2026 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:19:45] RECOVERY - puppet last run on mc2020 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:19:47] PROBLEM - puppet last run on mc2032 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:19:56] i am running puppet on 'failed-only' [19:19:59] the lack of synchronization wrt puppet-merge continues to scare me :) [19:20:11] PROBLEM - puppet last run on mc1025 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:20:56] !log running puppet on mc* hosts with --failed-only [19:21:09] PROBLEM - puppet last run on mc1031 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [19:21:27] !log twentyafterfour@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.33.0-wmf.25 refs T206679 [19:21:33] RECOVERY - puppet last run on mc1026 is OK: OK: Puppet is currently enabled, last run 21 seconds ago with 0 failures [19:21:59] RECOVERY - puppet last run on mc2029 is OK: OK: Puppet is currently enabled, last run 54 seconds ago with 0 failures [19:22:13] RECOVERY - puppet last run on mc1029 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:22:25] RECOVERY - puppet last run on mc1022 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:22:27] RECOVERY - puppet last run on mc2021 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:22:27] RECOVERY - puppet last run on mc2022 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:22:27] RECOVERY - puppet last run on mc2034 is OK: OK: Puppet is currently enabled, last run 44 seconds ago with 0 failures [19:22:29] twentyafterfour: good to go [19:22:41] anyone know how to clean up the hhvm cache on mwdebug1001? Several files on that host are corrupt because the disk was full during previous deployments ... now I'm getting the following: [19:22:41] PROBLEM - Apache HTTP on mwdebug1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:22:43] sqlite3_step() returned SQLITE_CORRUPT. Path: '/var/cache/hhvm/fcgi.hhbc.sq3' [19:22:55] is it as simple as just deleting that file? [19:23:05] RECOVERY - puppet last run on mc1019 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [19:23:19] RECOVERY - puppet last run on mc1034 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:23:19] RECOVERY - puppet last run on mc2036 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:23:27] PROBLEM - Apache HTTP on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:23:27] PROBLEM - Apache HTTP on mw1288 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:23:29] PROBLEM - HHVM rendering on mw1314 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:23:35] uh oh [19:23:36] what [19:23:41] PROBLEM - HHVM rendering on mw1235 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [19:23:51] RECOVERY - Apache HTTP on mwdebug1001 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 621 bytes in 0.062 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:24:13] RECOVERY - puppet last run on mc1033 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:24:18] oh just timeouts normal during recent trains I guess [19:24:35] RECOVERY - Apache HTTP on mw1288 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.044 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:24:35] RECOVERY - Apache HTTP on mw1314 is OK: HTTP OK: HTTP/1.1 301 Moved Permanently - 616 bytes in 0.051 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:24:37] RECOVERY - HHVM rendering on mw1314 is OK: HTTP OK: HTTP/1.1 200 OK - 76087 bytes in 0.405 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:24:45] RECOVERY - puppet last run on mc2026 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:24:51] RECOVERY - HHVM rendering on mw1235 is OK: HTTP OK: HTTP/1.1 200 OK - 76088 bytes in 1.468 second response time https://wikitech.wikimedia.org/wiki/Application_servers [19:25:02] (03CR) 10jenkins-bot: all wikis to 1.33.0-wmf.25 refs T206679 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503090 (owner: 1020after4) [19:25:05] RECOVERY - puppet last run on mc1020 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:25:05] RECOVERY - puppet last run on mc2030 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:25:05] RECOVERY - puppet last run on mc2032 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:25:15] RECOVERY - puppet last run on mc1032 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:25:27] RECOVERY - puppet last run on mc1025 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:26:21] RECOVERY - puppet last run on mc2028 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [19:26:25] RECOVERY - puppet last run on mc1023 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:26:27] RECOVERY - puppet last run on mc1031 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:26:55] so, mwdebug1001 has some corrupt files, specifically '/var/cache/hhvm/fcgi.hhbc.sq3' and I'm not sure what else [19:27:14] anyone know if we can just delete that file and let it regen? [19:31:47] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:34:25] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:36:02] twentyafterfour: judging by https://wikitech.wikimedia.org/wiki/HHVM#Rebuild_the_packages I think it's safe to delete that file (with HHVM stopped) [19:36:38] !log mediawiki error rate seems to be back to normal after deploying 1.33.0-wmf.25, the new branch looks stable refs T206679 [19:38:16] not sure what happened with the error rate looking at logstash; lots of timeouts (mostly in the new release) from 19:20-19:30 [19:39:08] cdanis: as far as I understand it, that is hhvm timing out while it tries to rebuild the bytecode cache [19:39:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:39:11] T206679: 1.33.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T206679 [19:39:16] ahh [19:39:19] it's happened with most recent train deployments afaik [19:39:36] er every one I have conducted anyway [19:39:53] 10Operations, 10Puppet, 10Cloud-Services, 10Traffic, and 3 others: Deprecate `base::service_unit` in puppet - https://phabricator.wikimedia.org/T194724 (10Dzahn) I switched confd and it seemed to be just fine on all tested hosts.. except.. it wasn't on mc* hosts because of this puppet relationship issue:... [19:39:59] got it [19:40:01] I think it always happened but previously the 60 second timeouts weren't getting logged [19:40:12] do you need help fixing the file on mwdebug1001 or can you do it yourself? [19:40:25] cdanis: I don't think I have permission but I will try [19:40:27] 10Operations, 10Puppet, 10Cloud-Services, 10Traffic, and 3 others: Deprecate `base::service_unit` in puppet - https://phabricator.wikimedia.org/T194724 (10Dzahn) [19:41:07] nope I don't have permission [19:41:17] cdanis: I guess I do need help [19:41:30] ok I will attempt [19:41:56] 10Operations, 10Puppet, 10Cloud-Services, 10Traffic, and 3 others: Deprecate `base::service_unit` in puppet - https://phabricator.wikimedia.org/T194724 (10Dzahn) Checking the box for adding a style guide violation. That is already happening; as i could confirm when my revert above got downvoted for just that. [19:43:39] hhvm taking over a minute to stop gracefully [19:44:26] hmm [19:44:34] ok it's back [19:46:24] !log cdanis@mwdebug1001.eqiad.wmnet ~ % sudo systemctl stop hhvm && sudo rm /var/cache/hhvm/fcgi.hhbc.sq3 && sudo systemctl start hhvm [19:47:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:51:52] (03PS1) 10Andrew Bogott: tools-redis: move service from 1001 (in eqiad) to 1002 (in eqiad1-r) [puppet] - 10https://gerrit.wikimedia.org/r/503096 [19:52:09] thanks cdanis [19:54:49] np [19:55:21] glad to learn something about HHVM hopefully just in time for that knowledge to be no longer relevant to my work :D [19:59:28] (03PS1) 10Dzahn: confd/redis::multidc: switch to systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/503097 (https://phabricator.wikimedia.org/T194724) [20:03:21] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:11:11] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [20:28:36] wikibugs fell off the world? [20:32:02] James_F: tried to follow docs but cant ssh to tools-bastion or tools-bastion-03 [20:32:09] with my labs key [20:34:07] Yeah, neither can I. [20:34:30] i can ssh to restricted-bastion.wmflabs but that doesnt help to fix this [20:34:51] Bryan mentioned earlier that there's some toolforge maintenance today, which will affect some tools [20:34:54] is there maybe a shinken alert about tools bastion going down? [20:34:56] aha! [20:35:43] [17:06] FYI because of some work in Toolforge stashbot and the SAL tool will be messed up for an hour or so [20:36:06] 17 CET, so 5.5 hrs ago [20:36:14] gotcha, thanks [20:37:33] mutante: does login.tools.wmflabs.org work for you (with a non-root ssh key)? [20:37:59] Yes, for me. [20:38:21] tools-bastion-03 was deleted a few weeks ago. tools-sgebastion-07 is the instance that login.tools.wmflabs.org is pointing at now [20:43:53] bd808: got distracted. yes, it does work. thanks [20:44:28] uses sudo -i to become wikibugs without being in the group for it [20:46:11] !log deleting job of wikibugs-phab-listener in an attempt to restart it [20:46:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:47:00] (also my "hour or so" estimate ended up being badly off) [20:47:07] docs say: "Then, run python3 manage.py start_jobs" [20:47:16] python3: can't open file 'manage.py': [Errno 2] No such file or directory [20:48:02] did you cd to the wikibugs2 directory? [20:48:56] Krenair: no, but i did now. and ran the command [20:49:00] (03CR) 10Andrew Bogott: [C: 03+2] tools-redis: move service from 1001 (in eqiad) to 1002 (in eqiad1-r) [puppet] - 10https://gerrit.wikimedia.org/r/503096 (owner: 10Andrew Bogott) [20:49:09] adding that to the docs [20:49:18] there it is, James_F [20:49:21] huh, wonder how that bit got left out [20:49:34] thought it used to be in there [20:49:41] thanks, mutante. [20:50:18] thanks Krenair , added [20:50:28] yep, thanks mutante [20:55:02] (03PS2) 10Dzahn: confluence::kafka::broker: update outdated comment on base::service_unit [puppet] - 10https://gerrit.wikimedia.org/r/503108 (https://phabricator.wikimedia.org/T194724) [20:55:40] (03CR) 10Dzahn: [C: 03+2] "just a comment but makes it look like it wasn't converted yet" [puppet] - 10https://gerrit.wikimedia.org/r/503108 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [21:01:35] 10Operations, 10Puppet, 10Cloud-Services, 10Traffic, and 3 others: Deprecate `base::service_unit` in puppet - https://phabricator.wikimedia.org/T194724 (10Dzahn) https://gerrit.wikimedia.org/r/c/operations/puppet/+/499769 [21:04:11] (03PS1) 10Dzahn: toollabs::kube2proxy: switch to systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/503116 (https://phabricator.wikimedia.org/T194724) [21:04:14] (03PS1) 10Dzahn: toollabs::maintain_kubeusers: switch to systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/503117 (https://phabricator.wikimedia.org/T194724) [21:05:55] PROBLEM - HHVM rendering on mw1223 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [21:05:56] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10Nuria) >Next, I would like to test a more complex task, and measure ho... [21:07:05] RECOVERY - HHVM rendering on mw1223 is OK: HTTP OK: HTTP/1.1 200 OK - 75751 bytes in 0.366 second response time https://wikitech.wikimedia.org/wiki/Application_servers [21:07:49] (03PS1) 10Dzahn: toollabs::proxy: switch to systemd::service, delete upstart template [puppet] - 10https://gerrit.wikimedia.org/r/503119 (https://phabricator.wikimedia.org/T194724) [21:28:00] https://tools.wmflabs.org/sal/ is working again. !log events between 12:48UTC and 20:46UTC are missing from the index there. [21:30:19] (03PS1) 10Dzahn: acmechief: add test_host parameter, disable some monitoring on test hosts [puppet] - 10https://gerrit.wikimedia.org/r/503122 [21:36:05] (03PS1) 10Bstorm: labstore: refactor the backup roles so they will match the main roles [puppet] - 10https://gerrit.wikimedia.org/r/503123 (https://phabricator.wikimedia.org/T209527) [21:36:53] (03CR) 10jerkins-bot: [V: 04-1] labstore: refactor the backup roles so they will match the main roles [puppet] - 10https://gerrit.wikimedia.org/r/503123 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [21:43:41] (03CR) 10Dzahn: [V: 03+1] "compiler output looks good, like what is intended to happen: https://puppet-compiler.wmflabs.org/compiler1002/15712/" [puppet] - 10https://gerrit.wikimedia.org/r/503122 (owner: 10Dzahn) [21:43:51] (03PS2) 10Bstorm: labstore: refactor the backup roles so they will match the main roles [puppet] - 10https://gerrit.wikimedia.org/r/503123 (https://phabricator.wikimedia.org/T209527) [21:44:45] (03CR) 10jerkins-bot: [V: 04-1] labstore: refactor the backup roles so they will match the main roles [puppet] - 10https://gerrit.wikimedia.org/r/503123 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [21:45:08] (03PS2) 10Dzahn: acmechief: add test_host parameter, disable some monitoring on test hosts [puppet] - 10https://gerrit.wikimedia.org/r/503122 [21:52:33] RECOVERY - kubelet operational latencies on kubernetes2004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [22:15:08] (03PS2) 10Dzahn: sessionstore: debug super_password lookup issue [puppet] - 10https://gerrit.wikimedia.org/r/502912 [22:15:31] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.25/extensions/WikibaseMediaInfo/resources/: Hot-deploy fix for WBMI variable cache miss T220665 (duration: 00m 55s) [22:15:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:15:35] T220665: Caption edit does not save - https://phabricator.wikimedia.org/T220665 [22:17:41] ACKNOWLEDGEMENT - EDAC syslog messages on wtp2013 is CRITICAL: 17 ge 4 daniel_zahn in this state since May 2018 https://phabricator.wikimedia.org/T194174 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2013&var-datasource=codfw+prometheus/ops [22:17:41] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on wtp2013 is CRITICAL: 67.01 ge 4 daniel_zahn in this state since May 2018 https://phabricator.wikimedia.org/T194174 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=wtp2013&var-datasource=codfw+prometheus/ops [22:19:32] (03PS3) 10Bstorm: labstore: refactor the backup roles so they will match the main roles [puppet] - 10https://gerrit.wikimedia.org/r/503123 (https://phabricator.wikimedia.org/T209527) [22:20:19] (03CR) 10jerkins-bot: [V: 04-1] labstore: refactor the backup roles so they will match the main roles [puppet] - 10https://gerrit.wikimedia.org/r/503123 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [22:21:25] (03PS4) 10Bstorm: labstore: refactor the backup roles so they will match the main roles [puppet] - 10https://gerrit.wikimedia.org/r/503123 (https://phabricator.wikimedia.org/T209527) [22:23:01] ACKNOWLEDGEMENT - Long running screen/tmux on sessionstore1001 is CRITICAL: CRIT: Long running SCREEN process. (user: eevans PID: 220733, 1924019s 1728000s). daniel_zahn new host being setup [22:25:26] (03CR) 10Bstorm: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/503123 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [22:31:16] (03PS3) 10Dzahn: sessionstore: debug super_password lookup issue [puppet] - 10https://gerrit.wikimedia.org/r/502912 [22:32:08] (03CR) 10jerkins-bot: [V: 04-1] sessionstore: debug super_password lookup issue [puppet] - 10https://gerrit.wikimedia.org/r/502912 (owner: 10Dzahn) [22:34:21] ACKNOWLEDGEMENT - tools project instance distribution on cloudcontrol1003 is CRITICAL: CRITICAL: paws class instances not spread out enough daniel_zahn https://phabricator.wikimedia.org/T220773 https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:38:55] RECOVERY - tools project instance distribution on cloudcontrol1003 is OK: OK: All critical instances are spread out enough https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [22:51:22] (03PS4) 10Dzahn: sessionstore: debug super_password lookup issue [puppet] - 10https://gerrit.wikimedia.org/r/502912 [22:53:39] Reedy: BTW – scary, or insanely scary: https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/501003 :-) [22:54:37] I wonder if we could do with a puppet-compiler type thing but for mediawiki-config James_F [22:55:28] Krenair: Hmm. How would that work? [22:55:41] dump $GLOBALS before and after for a specific wiki? [22:55:47] well I guess you'd put a change number in and a list of database names [22:55:52] check GLOBALS [22:55:54] run some diffs [22:56:01] that sort of thing [22:56:20] yeah like Reedy said [22:56:46] (03PS5) 10Dzahn: sessionstore: create profile class to fix password lookups [puppet] - 10https://gerrit.wikimedia.org/r/502912 (https://phabricator.wikimedia.org/T219560) [22:56:57] Right. [22:57:03] Could be interesting. [22:57:40] (03CR) 10Dzahn: [V: 03+1 C: 03+1] "this fixes the password lookup issue finally -> https://puppet-compiler.wmflabs.org/compiler1002/15717/sessionstore1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/502912 (https://phabricator.wikimedia.org/T219560) (owner: 10Dzahn) [22:57:47] It'd help with some of the "will this result in the changes I expect" [22:59:15] We'd have to mask out the bits of $GLOBALS we don't want public, though. [22:59:28] well it wouldn't run with the live prod secrets James_F [22:59:36] puppet-compiler uses labs/private.git [22:59:44] Right. [22:59:56] and in mediawiki-config we have private/PrivateSettings.php.example [23:00:02] which is not actually complete IIRC [23:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Evening SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190411T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:00:13] but that would be simple enough for someone to fix if they wanted [23:00:48] > echo count($GLOBALS); [23:00:48] 2325 [23:01:13] So… we'd have 1000 wikis x 2300 globals. The diff will be pretty. [23:01:25] wouldn't always have to run it against all wikis [23:01:39] If we wanted to spot that you're not breaking random other wikis, you really would, surely? [23:01:41] puppet-compiler will let you do all hosts but it will also let you do one or a list or whatever [23:01:59] well you could do all wikis [23:02:16] but at that point dealing with the diff that comes out is your problem :p [23:02:21] `foreach wiki in all.dblist do  … ` [23:02:54] Krenair: File a quick Phab task so we don't forget? [23:02:59] yeah [23:03:18] mediawiki-compiler [23:04:54] wondering what to tag it with [23:05:04] feels a bit like releng's type of thing [23:05:50] https://phabricator.wikimedia.org/T220775 [23:06:30] It also fits with the "Reliability" bit of SRE's name. [23:06:58] (03PS6) 10Dzahn: sessionstore: create profile class to fix password lookups [puppet] - 10https://gerrit.wikimedia.org/r/502912 (https://phabricator.wikimedia.org/T219560) [23:07:28] There, splatted a few appropriate tags on it. [23:08:07] ty [23:08:10] (03PS7) 10Dzahn: sessionstore: create profile class to fix password lookups [puppet] - 10https://gerrit.wikimedia.org/r/502912 (https://phabricator.wikimedia.org/T219560) [23:09:00] (03CR) 10Dzahn: [C: 03+2] sessionstore: create profile class to fix password lookups [puppet] - 10https://gerrit.wikimedia.org/r/502912 (https://phabricator.wikimedia.org/T219560) (owner: 10Dzahn) [23:10:41] i thought before it would be a nice bot feature to just do something like "!ticket 20" and it makes a phab ticket with the last 20 IRC lines on it [23:11:05] huh, does anyone notice icon's have gone missing on phab? [23:11:16] paladox: no? [23:11:38] When i went to https://phabricator.wikimedia.org/F28620640 it shows icons as squares (grey) [23:11:42] paladox: are they on wmfusercontent.org ? [23:12:07] paladox: gerrit or phab? [23:12:14] phab [23:12:20] see
data-meta="0_44">Page Menu class="phabricator-main-menu-brand" href="/">HomePhabricatorNo messages. No notifications.
style="display: none;" data-sigil="phabricator-notification-menu">
Account MenuFavorites Menu class="caret">

class="phui-header-row">
Screenshot 2019-04-12 at 00.11.42.png
[23:12:29] Public

Subscribers
None
class="phui-header-shell ">

File Metadata

Author
Paladox
Created
Fri, Apr 12, 12:12 AM

class="phui-header-header">Screenshot 2019-04-12 at 00.11.42.png

Event Timeline

data-javelin-init-data="{"refresh-csrf":[{"tokenName":"__csrf__","header":"X-Phabricator-Csrf","viaHeader":"X-Phabricator-Via","current":"B@joadzn7pe9d18ea970a27ea4"}],"history-install":[]}"> 10Operations, 10User-ArielGlenn: missed pages from kafka outage on July 11 2018 - https://phabricator.wikimedia.org/T199890 (10Dzahn) ` Why has my message been rejected? A message can be rejected by the network for a number of reasons. If you have set the originator to a value more than 11 characters then so... [23:39:40] (03PS3) 10Alex Monk: openstack::puppet::master::encapi: Avoid nginx-apache conflict [puppet] - 10https://gerrit.wikimedia.org/r/501587 (https://phabricator.wikimedia.org/T171188) [23:40:06] (03CR) 10Alex Monk: "Have not tested this PS" [puppet] - 10https://gerrit.wikimedia.org/r/501587 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk) [23:40:08] (03CR) 10jerkins-bot: [V: 04-1] openstack::puppet::master::encapi: Avoid nginx-apache conflict [puppet] - 10https://gerrit.wikimedia.org/r/501587 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk) [23:40:36] (03PS1) 10Volans: flake8: enforce import order and adopt W504 [software/spicerack] - 10https://gerrit.wikimedia.org/r/503147 [23:40:55] (03PS4) 10Alex Monk: openstack::puppet::master::encapi: Avoid nginx-apache conflict [puppet] - 10https://gerrit.wikimedia.org/r/501587 (https://phabricator.wikimedia.org/T171188) [23:42:33] 10Operations, 10User-ArielGlenn: missed pages from kafka outage on July 11 2018 - https://phabricator.wikimedia.org/T199890 (10Dzahn) This all sounds like some kind of rate limiting happens on the network of the local provider and AQL can't do much about it, though we could try and go the recommended route to... [23:44:36] (03CR) 10jerkins-bot: [V: 04-1] flake8: enforce import order and adopt W504 [software/spicerack] - 10https://gerrit.wikimedia.org/r/503147 (owner: 10Volans) [23:48:30] (03CR) 10Alex Monk: "nor this one" [puppet] - 10https://gerrit.wikimedia.org/r/501587 (https://phabricator.wikimedia.org/T171188) (owner: 10Alex Monk) [23:51:07] mutante: oh? [23:51:09] \o/ [23:54:01] mutante: auuuuhh [23:56:15] mutante: does this mean we can remove the include in the restbase role? [23:56:35] mutante: btw, thanks for taking care of this! [23:57:28] !log decommissioning cassandra-b, restbase2008 -- T208087 [23:57:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:32] T208087: Replace remaining Samsung SSDs - https://phabricator.wikimedia.org/T208087