[00:00:35] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: cloudnet2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T218025 (10RobH) [00:02:09] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: cloudnet2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T218025 (10RobH) a:05RobH→03Papaul @papaul, Please note I cannot see any server on the switch stack with this label, so I was... [00:02:42] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review, 10cloud-services-team (Kanban): decommission: cloudnet2001-dev.codfw.wmnet - https://phabricator.wikimedia.org/T218025 (10RobH) [00:10:20] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service, 10User-Smalyshev: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Smalyshev) Looking at the distribution of Special:EntityData fetches, if we cache entities under 10K, we wi... [00:25:00] (03CR) 10Alex Monk: "-> Ie711940f" [puppet] - 10https://gerrit.wikimedia.org/r/498796 (owner: 10Alex Monk) [00:55:24] !log repooled wdq1006 [00:55:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:56:37] !log depooled wdq1004 to catch up [00:56:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:05:24] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 6370 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [01:06:23] (03Abandoned) 10CRusnov: Create a spicerack module for accessing RAPI [software/spicerack] - 10https://gerrit.wikimedia.org/r/497656 (owner: 10CRusnov) [01:13:14] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:23:20] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [01:31:46] (03PS1) 10Volans: check_icinga: fix retry logic [software/external-monitoring] - 10https://gerrit.wikimedia.org/r/499368 [02:02:58] (03CR) 10CRusnov: "Based on our conversation I've addressed the changes we discussed. I've also separated the instance and cluster tests into their own tests" (0317 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/499032 (owner: 10CRusnov) [02:05:56] (03PS11) 10CRusnov: Port MakeVM to cookbook. [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) [02:06:42] (03CR) 10CRusnov: "Okay minor changes made. Ready to go once we've converged on spicerack module." (032 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/496527 (https://phabricator.wikimedia.org/T203963) (owner: 10CRusnov) [02:18:37] (03CR) 10CRusnov: "Since I can't find an example in production of setting logging_enabled to false, and I'm certain that this change will move the logs to th" [puppet] - 10https://gerrit.wikimedia.org/r/499288 (owner: 10CRusnov) [02:31:16] Reading backlogs [02:31:32] Damn! I missed a lot [02:46:06] PROBLEM - Mjolnir bulk update failure check - codfw on icinga1001 is CRITICAL: 3.247e+06 gt 2 https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates?orgId=1&from=now-7d&to=now&panelId=1&fullscreen [02:52:30] i feel like that alert is wrong, clearly there have been no failures in the last 24 hours [02:53:33] oh, actually nevermind it is failing on codfw ... more problems from mixed cluster :S [02:54:02] basically we have to send different update scripts to 5.x and 6.x, and that actually depends on the version of the node holding the primary shard ... [02:54:42] ebernhardson: can you silence it pls [02:54:51] onimisionipe: i dont know how :( [02:55:14] :) [02:55:19] Ok will do [02:55:30] silence for 5 days should be enough [02:55:41] it will then be active for the next round of bulk updates next week [02:56:34] * ebernhardson relatedly wishes grafana could put eqiad and codfw sourced data on the same graph... [02:59:07] hmm, maybe just need to figure out how. Our repo says we have grafana 5.4.2, and i found a post on grafana website that says you can put multiple datasources in same graph in 2.5.4 [02:59:55] ebernhardson: yes you can. There are graphs on elastic with multiple data sources [03:01:27] trying now, would be much better for these failure graphs to show both clusters at once [03:01:59] Ok [03:08:10] done, much better, i hope :) https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates?orgId=1 [03:09:40] Looks good to me [03:09:59] I will go back to sleeping [03:10:11] doh, did the alert wake you? sorry :( [03:10:45] Oh..no it didn't. :) [04:06:28] PROBLEM - puppet last run on maps1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:32:44] RECOVERY - puppet last run on maps1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [05:52:27] (03PS1) 10Marostegui: db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499379 [05:55:59] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499379 (owner: 10Marostegui) [05:56:14] (03CR) 10Marostegui: ">" (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498322 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [05:56:58] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499379 (owner: 10Marostegui) [05:58:23] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1103:3312 (duration: 01m 10s) [05:58:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:00:22] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Change parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498322 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [06:01:19] (03Merged) 10jenkins-bot: db-codfw.php: Change parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498322 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [06:02:43] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Change one parsercache key on codfw - T210725 (duration: 00m 57s) [06:02:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:02:46] T210725: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 [06:06:00] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1103:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499379 (owner: 10Marostegui) [06:06:03] (03CR) 10jenkins-bot: db-codfw.php: Change parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498322 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [06:30:12] PROBLEM - puppet last run on acmechief1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check_keyholder] [06:38:50] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499380 [06:40:24] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499380 (owner: 10Marostegui) [06:41:24] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499380 (owner: 10Marostegui) [06:42:33] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1103:3312 (duration: 00m 58s) [06:42:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:49:28] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1103:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499380 (owner: 10Marostegui) [06:56:08] !log repooled wdqs1004 [06:56:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:56:30] RECOVERY - puppet last run on acmechief1001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:17] !log depooled wdqs1005 to catch up [06:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:44] (03PS1) 10Marostegui: db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499383 [07:05:46] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499383 (owner: 10Marostegui) [07:07:00] (03CR) 10jerkins-bot: [V: 04-1] db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499383 (owner: 10Marostegui) [07:08:17] what? [07:08:51] (03CR) 10Marostegui: [C: 03+2] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499383 (owner: 10Marostegui) [07:09:49] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499383 (owner: 10Marostegui) [07:11:18] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1122 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499383 (owner: 10Marostegui) [07:11:20] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1122 (duration: 00m 58s) [07:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:19:44] (03PS1) 10Elukey: Add analytics-admins to role::common::statistics::cruncher [puppet] - 10https://gerrit.wikimedia.org/r/499398 [07:24:32] 10Operations, 10Analytics-Kanban, 10SRE-Access-Requests, 10Patch-For-Review: Add Tilman to analytics-admins - https://phabricator.wikimedia.org/T178802 (10elukey) In the interest of allowing minimal access in the Analytics cluster I'd like to review again this access request and establish what level of acc... [07:24:58] (03CR) 10Elukey: [C: 03+2] Add analytics-admins to role::common::statistics::cruncher [puppet] - 10https://gerrit.wikimedia.org/r/499398 (owner: 10Elukey) [07:25:54] this group --^ is deployed across nodes in the Analytics cluster, stat1006 was for some reason still missing it [07:27:00] (03PS1) 10Smalyshev: Load WikibaseLexemeCirrusSearch on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499399 (https://phabricator.wikimedia.org/T216206) [07:27:02] (03PS1) 10Smalyshev: Load WikibaseLexemeCirrusSearch on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499400 (https://phabricator.wikimedia.org/T216206) [07:41:54] PROBLEM - Disk space on elastic1017 is CRITICAL: DISK CRITICAL - free space: /srv 27427 MB (5% inode=99%) [07:49:32] RECOVERY - Disk space on elastic1017 is OK: DISK OK [07:51:24] PROBLEM - Mjolnir bulk update failure check - eqiad on icinga1001 is CRITICAL: 5.473e+05 gt 2 https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates?orgId=1&from=now-7d&to=now&panelId=1&fullscreen [07:55:10] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499404 [08:05:05] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499404 (owner: 10Marostegui) [08:06:07] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499404 (owner: 10Marostegui) [08:06:31] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1122" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499404 (owner: 10Marostegui) [08:07:14] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1122 (duration: 00m 57s) [08:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:25] !log bounce rsyslog on cobalt - apache access logs stopped at ~6.30 today [08:09:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:17:31] !log bounce rsyslog on phab* - apache access logs stopped at ~6.30 today [08:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:40] (03PS1) 10Marostegui: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499408 [08:21:49] (03PS1) 10Joal: Update sqoop timers configuration [puppet] - 10https://gerrit.wikimedia.org/r/499409 (https://phabricator.wikimedia.org/T215550) [08:22:50] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499408 (owner: 10Marostegui) [08:23:55] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499408 (owner: 10Marostegui) [08:25:11] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1076 (duration: 00m 57s) [08:25:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:28:29] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1076 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499408 (owner: 10Marostegui) [08:33:02] !log disabling puppet in acme-chief clients to get rid safely of old TLS material - T207295 [08:33:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:05] T207295: Expose not-yet-live certs to clients so they can handle OCSP stapling - https://phabricator.wikimedia.org/T207295 [08:33:50] (03PS2) 10Elukey: Update sqoop timers configuration [puppet] - 10https://gerrit.wikimedia.org/r/499409 (https://phabricator.wikimedia.org/T215550) (owner: 10Joal) [08:34:08] gilles: o/ [08:34:25] yes, what's up? [08:34:59] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Clean old file based certificate files (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/498920 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [08:35:08] (03PS2) 10Vgutierrez: acme_chief: Clean old file based certificate files (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/498920 (https://phabricator.wikimedia.org/T207295) [08:35:30] gilles: nothing super important, there is some cron output from your user on stat1007, do you mind to add a MAILTO? [08:35:52] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15374/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/499409 (https://phabricator.wikimedia.org/T215550) (owner: 10Joal) [08:35:54] I've already changed it to output to a file [08:36:03] ack then thanks [08:38:52] (03PS3) 10Vgutierrez: acme_chief: Clean old file based certificate files (1/2) [puppet] - 10https://gerrit.wikimedia.org/r/498920 (https://phabricator.wikimedia.org/T207295) [08:44:15] (03CR) 10Marostegui: [C: 03+1] mariadb: Prepare dbprov2001/2 and future dbprov1001/2 for production [puppet] - 10https://gerrit.wikimedia.org/r/499163 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo) [08:44:36] (03PS2) 10Jcrespo: mariadb: Prepare dbprov2001/2 and future dbprov1001/2 for production [puppet] - 10https://gerrit.wikimedia.org/r/499163 (https://phabricator.wikimedia.org/T218336) [08:50:42] (03CR) 10Jcrespo: [C: 03+2] mariadb: Prepare dbprov2001/2 and future dbprov1001/2 for production [puppet] - 10https://gerrit.wikimedia.org/r/499163 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo) [08:51:10] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499414 [08:54:23] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499414 (owner: 10Marostegui) [08:55:22] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499414 (owner: 10Marostegui) [08:56:26] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1076 (duration: 00m 54s) [08:56:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:58:17] (03PS1) 10Marostegui: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499415 [08:59:30] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499415 (owner: 10Marostegui) [09:00:02] 10Operations, 10cloud-services-team (Kanban): apt-get update broken on jessie: jessie-updates and jessie-backports removed by Debian - https://phabricator.wikimedia.org/T219333 (10fgiunchedi) This also affects `package_builder` role on boron when trying to build packages for jessie-backports [09:00:30] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499415 (owner: 10Marostegui) [09:01:47] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1076" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499414 (owner: 10Marostegui) [09:01:48] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1074 (duration: 00m 57s) [09:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:01:52] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1074 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499415 (owner: 10Marostegui) [09:01:59] !log Deploy schema change on db1074, this will generate lag on labsdb hosts for s2 [09:02:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:17] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/deploy codfw dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['dbprov2001.codfw.wmnet'] ` The... [09:06:50] !log puppet reenabled in acme-chief clients - T207295 [09:06:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:06:53] T207295: Expose not-yet-live certs to clients so they can handle OCSP stapling - https://phabricator.wikimedia.org/T207295 [09:06:56] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 165.9 ge 130 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [09:10:54] (03PS1) 10Dzahn: rsync home dirs from bast2001 to bast2002 [puppet] - 10https://gerrit.wikimedia.org/r/499416 (https://phabricator.wikimedia.org/T196665) [09:16:08] (03PS2) 10Jcrespo: DHCP: Add MAC address entries for dpprov200[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/498212 (owner: 10Papaul) [09:16:53] (03CR) 10jerkins-bot: [V: 04-1] DHCP: Add MAC address entries for dpprov200[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/498212 (owner: 10Papaul) [09:17:03] (03CR) 10Vgutierrez: [C: 03+2] acme_chief: Clean old file based certificate files (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/498921 (https://phabricator.wikimedia.org/T207295) (owner: 10Vgutierrez) [09:17:09] (03PS3) 10Jcrespo: DHCP: Add MAC address entries for dpprov200[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/498212 (https://phabricator.wikimedia.org/T218336) (owner: 10Papaul) [09:17:13] (03PS2) 10Vgutierrez: acme_chief: Clean old file based certificate files (2/2) [puppet] - 10https://gerrit.wikimedia.org/r/498921 (https://phabricator.wikimedia.org/T207295) [09:19:55] 10Operations, 10Acme-chief, 10Traffic, 10Goal, 10Patch-For-Review: Deploy managed LetsEncrypt certs for all public use-cases - https://phabricator.wikimedia.org/T213705 (10Vgutierrez) [09:21:47] (03PS1) 10Dzahn: network::constants: add bast2002 [puppet] - 10https://gerrit.wikimedia.org/r/499417 (https://phabricator.wikimedia.org/T196665) [09:22:15] (03PS1) 10Elukey: hadoop::ssl_config: fix permissions for xml files following CDH guidelines [puppet/cdh] - 10https://gerrit.wikimedia.org/r/499418 (https://phabricator.wikimedia.org/T217412) [09:22:20] (03PS4) 10Jcrespo: DHCP: Add MAC address entries for dpprov200[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/498212 (https://phabricator.wikimedia.org/T218336) (owner: 10Papaul) [09:22:46] (03CR) 10Elukey: [V: 03+2 C: 03+2] hadoop::ssl_config: fix permissions for xml files following CDH guidelines [puppet/cdh] - 10https://gerrit.wikimedia.org/r/499418 (https://phabricator.wikimedia.org/T217412) (owner: 10Elukey) [09:23:40] (03PS5) 10Jcrespo: DHCP: Add MAC address entries for dbprov200[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/498212 (https://phabricator.wikimedia.org/T218336) (owner: 10Papaul) [09:24:28] (03PS1) 10Elukey: Update the cdh module to its latest version [puppet] - 10https://gerrit.wikimedia.org/r/499419 [09:25:17] (03CR) 10ArielGlenn: [C: 03+1] network::constants: add bast2002 [puppet] - 10https://gerrit.wikimedia.org/r/499417 (https://phabricator.wikimedia.org/T196665) (owner: 10Dzahn) [09:25:56] (03CR) 10Jcrespo: [C: 03+2] DHCP: Add MAC address entries for dbprov200[1-2] [puppet] - 10https://gerrit.wikimedia.org/r/498212 (https://phabricator.wikimedia.org/T218336) (owner: 10Papaul) [09:26:26] (03CR) 10Dzahn: [C: 03+2] network::constants: add bast2002 [puppet] - 10https://gerrit.wikimedia.org/r/499417 (https://phabricator.wikimedia.org/T196665) (owner: 10Dzahn) [09:26:39] (03PS2) 10Dzahn: network::constants: add bast2002 [puppet] - 10https://gerrit.wikimedia.org/r/499417 (https://phabricator.wikimedia.org/T196665) [09:28:25] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/deploy codfw dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['dbprov2001.codfw.wmnet'] ` The... [09:30:17] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15376/" [puppet] - 10https://gerrit.wikimedia.org/r/499419 (owner: 10Elukey) [09:30:26] (03PS2) 10Elukey: Update the cdh module to its latest version [puppet] - 10https://gerrit.wikimedia.org/r/499419 [09:30:39] (03PS1) 10Dzahn: smokeping: replace bast2001 (A5) with bast2002 (B5) target [puppet] - 10https://gerrit.wikimedia.org/r/499421 (https://phabricator.wikimedia.org/T196665) [09:31:57] (03PS2) 10Dzahn: smokeping: replace bast2001 (A5) with bast2002 (B5) target [puppet] - 10https://gerrit.wikimedia.org/r/499421 (https://phabricator.wikimedia.org/T196665) [09:31:59] (03PS3) 10Elukey: Update the cdh module to its latest version [puppet] - 10https://gerrit.wikimedia.org/r/499419 [09:32:01] (03CR) 10Elukey: [V: 03+2 C: 03+2] Update the cdh module to its latest version [puppet] - 10https://gerrit.wikimedia.org/r/499419 (owner: 10Elukey) [09:33:19] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499422 [09:34:51] (03PS2) 10Dzahn: rsync home dirs from bast2001 to bast2002 [puppet] - 10https://gerrit.wikimedia.org/r/499416 (https://phabricator.wikimedia.org/T196665) [09:37:27] (03CR) 10Dzahn: [C: 03+2] rsync home dirs from bast2001 to bast2002 [puppet] - 10https://gerrit.wikimedia.org/r/499416 (https://phabricator.wikimedia.org/T196665) (owner: 10Dzahn) [09:37:41] (03CR) 10Dzahn: [C: 03+2] "this will be reverted again together with bast2001 decom change" [puppet] - 10https://gerrit.wikimedia.org/r/499416 (https://phabricator.wikimedia.org/T196665) (owner: 10Dzahn) [09:38:51] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499422 (owner: 10Marostegui) [09:40:14] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499422 (owner: 10Marostegui) [09:40:32] PROBLEM - Check systemd state on logstash1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:41:18] !log Upgrade db2092 [09:41:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:41:22] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1074 (duration: 00m 57s) [09:41:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:06] (03PS18) 10ArielGlenn: add wikidata entity dumps settings to its config file, make labs version [puppet] - 10https://gerrit.wikimedia.org/r/499203 (https://phabricator.wikimedia.org/T205825) [09:42:11] (03PS3) 10Vgutierrez: acme_chief: Issue the global unified wildcard certificate [puppet] - 10https://gerrit.wikimedia.org/r/499185 (https://phabricator.wikimedia.org/T213705) [09:42:13] (03PS1) 10Vgutierrez: acme_chief: Add a LE ACMEv2 staging environment account [puppet] - 10https://gerrit.wikimedia.org/r/499426 (https://phabricator.wikimedia.org/T213705) [09:45:27] (03CR) 10ArielGlenn: [C: 03+2] add wikidata entity dumps settings to its config file, make labs version [puppet] - 10https://gerrit.wikimedia.org/r/499203 (https://phabricator.wikimedia.org/T205825) (owner: 10ArielGlenn) [09:47:16] (03PS1) 10Dzahn: add shutdown warning in MOTD on bast2001 [puppet] - 10https://gerrit.wikimedia.org/r/499428 (https://phabricator.wikimedia.org/T196665) [09:47:52] PROBLEM - Check systemd state on logstash1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:48:30] (03PS2) 10Dzahn: add shutdown warning in MOTD on bast2001 [puppet] - 10https://gerrit.wikimedia.org/r/499428 (https://phabricator.wikimedia.org/T196665) [09:49:51] (03CR) 10Dzahn: [C: 03+2] "just updated from last time it was used for bast1001->1002" [puppet] - 10https://gerrit.wikimedia.org/r/499428 (https://phabricator.wikimedia.org/T196665) (owner: 10Dzahn) [09:50:40] PROBLEM - Check systemd state on logstash1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:51:19] (03PS2) 10Vgutierrez: acme_chief: Add a LE ACMEv2 staging environment account [puppet] - 10https://gerrit.wikimedia.org/r/499426 (https://phabricator.wikimedia.org/T213705) [09:51:21] (03PS4) 10Vgutierrez: acme_chief: Issue the global unified wildcard certificate [puppet] - 10https://gerrit.wikimedia.org/r/499185 (https://phabricator.wikimedia.org/T213705) [09:54:25] (03PS1) 10Marostegui: db-eqiad.php: Depool db1090:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499430 [09:55:00] PROBLEM - Check systemd state on logstash1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [09:55:36] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1074" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499422 (owner: 10Marostegui) [09:55:57] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1090:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499430 (owner: 10Marostegui) [09:56:58] re: logstash, there is an issue with a DNS lookup for logstash1004 [09:57:06] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1090:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499430 (owner: 10Marostegui) [09:58:16] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1090:3312 (duration: 00m 56s) [09:58:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:58:26] 10Operations, 10cloud-services-team (Kanban): apt-get update broken on jessie: jessie-updates and jessie-backports removed by Debian - https://phabricator.wikimedia.org/T219333 (10jbond) a:03jbond [09:59:02] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Addshore) >>! In T217897#5056213, @Smalyshev wrote: >> I'm still a bit confused about this logic inside the updat... [09:59:33] (03PS1) 10ArielGlenn: set up for testing misc dumps crons in beta [puppet] - 10https://gerrit.wikimedia.org/r/499432 [09:59:42] PROBLEM - Check systemd state on logstash1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:00:44] (03CR) 10jerkins-bot: [V: 04-1] set up for testing misc dumps crons in beta [puppet] - 10https://gerrit.wikimedia.org/r/499432 (owner: 10ArielGlenn) [10:02:00] (03CR) 10Dzahn: "this caused failures on several logstash hosts today because when the ferm service was reload (due to an unrelated addition to firewall ru" [dns] - 10https://gerrit.wikimedia.org/r/495143 (https://phabricator.wikimedia.org/T217556) (owner: 10RobH) [10:02:58] PROBLEM - Check systemd state on logstash1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:03:16] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Track remaining trusty servers in production - https://phabricator.wikimedia.org/T212772 (10aborrero) [10:04:04] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikimedia-Logstash, and 2 others: Decommission old eqiad logstash hardware hosts logstash100[456] - https://phabricator.wikimedia.org/T217556 (10Dzahn) BEWARE. These hosts have not been removed from all places in puppet yet, though they are already gone from DNS. This... [10:05:56] (03PS2) 10ArielGlenn: set up for testing misc dumps crons in beta [puppet] - 10https://gerrit.wikimedia.org/r/499432 (https://phabricator.wikimedia.org/T205825) [10:06:41] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1090:3312 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499430 (owner: 10Marostegui) [10:07:45] (03CR) 10ArielGlenn: [C: 03+2] set up for testing misc dumps crons in beta [puppet] - 10https://gerrit.wikimedia.org/r/499432 (https://phabricator.wikimedia.org/T205825) (owner: 10ArielGlenn) [10:08:13] (03PS1) 10Dzahn: logstash: remove logstash1004,1005,1006 from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/499433 (https://phabricator.wikimedia.org/T217556) [10:08:54] (03CR) 10Dzahn: "https://gerrit.wikimedia.org/r/c/operations/puppet/+/499433" [dns] - 10https://gerrit.wikimedia.org/r/495143 (https://phabricator.wikimedia.org/T217556) (owner: 10RobH) [10:10:30] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Addshore) >>! In T217897#5060748, @Smalyshev wrote: > Looking at the distribution of Special:EntityData fetches,... [10:10:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "a bit late, but making +1 explicit :-) thanks for adding this." [puppet] - 10https://gerrit.wikimedia.org/r/499267 (owner: 10Andrew Bogott) [10:15:06] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "The commit message doesn't explain why you are doing this." [puppet] - 10https://gerrit.wikimedia.org/r/499304 (https://phabricator.wikimedia.org/T211835) (owner: 10Mforns) [10:17:13] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/499304 (https://phabricator.wikimedia.org/T211835) (owner: 10Mforns) [10:17:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikimedia-Logstash, and 3 others: Decommission old eqiad logstash hardware hosts logstash100[456] - https://phabricator.wikimedia.org/T217556 (10Dzahn) a:05Cmjohnson→03herron [10:19:35] ACKNOWLEDGEMENT - Check systemd state on logstash1007 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T217556#5061227 [10:19:35] ACKNOWLEDGEMENT - Check systemd state on logstash1008 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T217556#5061227 [10:19:35] ACKNOWLEDGEMENT - Check systemd state on logstash1009 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T217556#5061227 [10:19:35] ACKNOWLEDGEMENT - Check systemd state on logstash1010 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T217556#5061227 [10:19:35] ACKNOWLEDGEMENT - Check systemd state on logstash1011 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T217556#5061227 [10:19:35] ACKNOWLEDGEMENT - Check systemd state on logstash1012 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn https://phabricator.wikimedia.org/T217556#5061227 [10:20:44] !log upgrade rsyslog to 8.1903.0-3~bpo8+wmf1 on phab1001 to test imfile file rotation fix - T214176 [10:20:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:03] 10Operations, 10ops-codfw, 10Patch-For-Review: rack/setup/install bast2002.wikimedia.org - https://phabricator.wikimedia.org/T196665 (10Dzahn) [10:21:23] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/deploy codfw dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by jynus on cumin1001.eqiad.wmnet for hosts: ` ['dbprov2001.codfw.wmnet'] ` The... [10:25:44] (03CR) 10Zfilipin: "Is there a related phab task? Or, can you provide link to the request?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499210 (owner: 10Odder) [10:28:11] (03PS1) 10ArielGlenn: add config setting to use lbzip2 for compression of pages-meta-history files [dumps] - 10https://gerrit.wikimedia.org/r/499438 (https://phabricator.wikimedia.org/T214293) [10:36:26] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "This LGTM, not sure if we need to stagger application of this change to account for elasticsearch restarts." [puppet] - 10https://gerrit.wikimedia.org/r/499433 (https://phabricator.wikimedia.org/T217556) (owner: 10Dzahn) [10:36:45] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "This is a long standing issue / pending feature. See T97081 for example. I'm sure we can find more phabricator tasks mentioning the lack o" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499026 (owner: 10Andrew Bogott) [10:39:30] (03PS1) 10Marostegui: Revert "db-eqiad.php: Depool db1090:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499441 [10:40:17] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM, better apply this to one host and see the effect, then on the rest" [puppet] - 10https://gerrit.wikimedia.org/r/499433 (https://phabricator.wikimedia.org/T217556) (owner: 10Dzahn) [10:42:45] (03PS3) 10Vgutierrez: acme_chief: Add a LE ACMEv2 staging environment account [puppet] - 10https://gerrit.wikimedia.org/r/499426 (https://phabricator.wikimedia.org/T213705) [10:42:47] (03PS5) 10Vgutierrez: acme_chief: Issue the global unified wildcard certificate [puppet] - 10https://gerrit.wikimedia.org/r/499185 (https://phabricator.wikimedia.org/T213705) [10:43:50] (03CR) 10Marostegui: [C: 03+2] Revert "db-eqiad.php: Depool db1090:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499441 (owner: 10Marostegui) [10:45:05] (03Merged) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499441 (owner: 10Marostegui) [10:45:53] (03PS3) 10Giuseppe Lavagetto: arclamp: make arclamp-grep work with excimer logs as well [puppet] - 10https://gerrit.wikimedia.org/r/499219 (https://phabricator.wikimedia.org/T176916) [10:47:18] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Repool db1090:3312 (duration: 00m 58s) [10:47:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:50] PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [10:48:58] PROBLEM - Check systemd state on lithium is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:49:10] PROBLEM - DPKG on lithium is CRITICAL: DPKG CRITICAL dpkg reports broken packages [10:49:10] godog: is it you --^ ? [10:49:21] or the recurrent issue that we had with the TLS listener? [10:49:29] elukey: yeah that's me, messing with rsyslgo [10:49:34] <3 [10:49:50] !log disabling puppet on logstash* via cumin [10:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:51:11] (03PS2) 10Dzahn: logstash: remove logstash1004,1005,1006 from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/499433 (https://phabricator.wikimedia.org/T217556) [10:51:38] RECOVERY - rsyslog TLS listener on port 6514 on lithium is OK: SSL OK - Certificate lithium.eqiad.wmnet valid until 2021-10-23 19:09:29 +0000 (expires in 941 days) https://wikitech.wikimedia.org/wiki/Logs [10:51:42] RECOVERY - DPKG on lithium is OK: All packages OK [10:52:09] (03CR) 10Dzahn: [C: 03+2] logstash: remove logstash1004,1005,1006 from Hiera [puppet] - 10https://gerrit.wikimedia.org/r/499433 (https://phabricator.wikimedia.org/T217556) (owner: 10Dzahn) [10:52:19] (03CR) 10jenkins-bot: Revert "db-eqiad.php: Depool db1090:3312" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499441 (owner: 10Marostegui) [10:52:25] (03CR) 10Dzahn: [C: 03+2] "puppet disabled on logstash*, re-enabling a single one" [puppet] - 10https://gerrit.wikimedia.org/r/499433 (https://phabricator.wikimedia.org/T217556) (owner: 10Dzahn) [10:52:46] RECOVERY - Check systemd state on lithium is OK: OK - running: The system is fully operational [10:53:31] !log enabling and running puppet on logstash1007 [10:53:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:54:16] RECOVERY - Check systemd state on logstash1007 is OK: OK - running: The system is fully operational [10:55:17] (03CR) 10Elukey: "Yep yep we are planning to shutoff the instances after a few days if nothing pops up, and then we'll merge only by then. The plan is then " [puppet] - 10https://gerrit.wikimedia.org/r/499304 (https://phabricator.wikimedia.org/T211835) (owner: 10Mforns) [10:57:31] !log upgrade rsyslog to 8.1903.0-3~bpo8+wmf1 on cobalt to test imfile file rotation fix - T214176 [10:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:19] (03CR) 10Dzahn: "i was only added by reviewer bot because the 1 line change in wikistats module. that comment was there because these projects have often b" [puppet] - 10https://gerrit.wikimedia.org/r/499304 (https://phabricator.wikimedia.org/T211835) (owner: 10Mforns) [10:59:53] (03CR) 10Elukey: "Looks good from what I can see! Left a question as comment." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/499304 (https://phabricator.wikimedia.org/T211835) (owner: 10Mforns) [11:00:04] RECOVERY - Check systemd state on logstash1008 is OK: OK - running: The system is fully operational [11:00:05] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: Your horoscope predicts another unfortunate European Mid-day SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190327T1100). [11:00:05] Urbanecm, odder, _joe_, and razesoldier: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:36] I can SWAT today [11:00:45] _joe_: you'll be deploying your change? [11:00:54] RECOVERY - Check systemd state on logstash1009 is OK: OK - running: The system is fully operational [11:00:59] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.e6-upgrade [11:01:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:02] <_joe_> zeljkof: however you prefer [11:01:19] <_joe_> I'm happy not to have to do it, but it's ok [11:01:27] _joe_: I prefer people deploying their commits :) [11:01:59] _joe_: want to go first, last? in calendar order? your patch might take a while to merge [11:02:11] <_joe_> I can wait [11:02:23] <_joe_> I mean I can +2 it in the meanwhile [11:02:34] RECOVERY - Check systemd state on logstash1010 is OK: OK - running: The system is fully operational [11:02:36] <_joe_> it shouldn't conflict with the other ones [11:02:57] _joe_: as far as I can see, the rest of the patches are config, so there should be no problems [11:03:07] <_joe_> ack, merging then [11:03:13] <_joe_> I'll wait my turn [11:03:21] Urbanecm, odder, razesoldier: around for swat? [11:03:33] I here [11:03:47] (03PS4) 10Vgutierrez: acme_chief: Add a LE ACMEv2 staging environment account [puppet] - 10https://gerrit.wikimedia.org/r/499426 (https://phabricator.wikimedia.org/T213705) [11:03:49] (03PS6) 10Vgutierrez: acme_chief: Issue the global unified wildcard certificate [puppet] - 10https://gerrit.wikimedia.org/r/499185 (https://phabricator.wikimedia.org/T213705) [11:04:08] !log re-enabled puppet on logstash1007 through 1011 - then on logstash* [11:04:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:10] RECOVERY - Check systemd state on logstash1011 is OK: OK - running: The system is fully operational [11:05:34] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499231 (https://phabricator.wikimedia.org/T219291) (owner: 10Urbanecm) [11:05:48] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikimedia-Logstash, and 3 others: Decommission old eqiad logstash hardware hosts logstash100[456] - https://phabricator.wikimedia.org/T217556 (10Dzahn) ` 2019-03-27 11:04 mutante: re-enabled puppet on logstash1007 through 1011 - then on logstash* 10:53 mutant... [11:06:13] !log elasticsearch search cluster: setting "index.blocks.read_only_allow_delete" to null on all indices in omega/psi/chi@omega (T219364) [11:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:06:16] T219364: Wikidata search lagging behind - https://phabricator.wikimedia.org/T219364 [11:06:37] (03Merged) 10jenkins-bot: Add throttle rule for Czech editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499231 (https://phabricator.wikimedia.org/T219291) (owner: 10Urbanecm) [11:08:00] RECOVERY - Check systemd state on logstash1012 is OK: OK - running: The system is fully operational [11:08:41] 10Operations, 10hardware-requests: Rename labtestmetal2001 - https://phabricator.wikimedia.org/T161265 (10Dzahn) Has it been renamed back to the old name? https://icinga.wikimedia.org/cgi-bin/icinga/status.cgi?host=labtestmetal2001 [11:08:54] !log zfilipin@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:499231|Add throttle rule for Czech editathon (T219291)]] (duration: 00m 58s) [11:08:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:08:57] T219291: Throttle rule for Czech Editathon - https://phabricator.wikimedia.org/T219291 [11:09:08] Urbanecm: 499231 deployed [11:10:21] !log elasticsearch search cluster: setting cluster.routing.allocation.disk.watermark.flood_stage to 100% on omega/psi/chi@eqiad (T219364) [11:10:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:51] <_joe_> zeljkof: my patch is merged, FTR [11:10:52] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499287 (https://phabricator.wikimedia.org/T219311) (owner: 10Urbanecm) [11:11:07] <_joe_> I don't see the next deployer btw [11:11:44] _joe_: I have a couple more patches, probably just the throttle rule (last entry), those can be deployed without a developer present [11:11:59] (03Merged) 10jenkins-bot: Clean the throttles up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499287 (https://phabricator.wikimedia.org/T219311) (owner: 10Urbanecm) [11:12:01] <_joe_> ack [11:12:05] ACKNOWLEDGEMENT - puppet last run on labtestmetal2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 32 seconds ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] daniel_zahn https://phabricator.wikimedia.org/T217891 [11:14:03] !log zfilipin@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:499287|Clean the throttles up (T219311)]] (duration: 00m 57s) [11:14:05] (03CR) 10Zfilipin: "This was scheduled for EU SWAT but was not deployed because the developer was not in #wikimedia-operations." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499210 (owner: 10Odder) [11:14:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:14:07] T219311: Clean the throttle rules up - https://phabricator.wikimedia.org/T219311 [11:14:38] (03CR) 10jenkins-bot: Add throttle rule for Czech editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499231 (https://phabricator.wikimedia.org/T219291) (owner: 10Urbanecm) [11:14:40] (03CR) 10jenkins-bot: Clean the throttles up [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499287 (https://phabricator.wikimedia.org/T219311) (owner: 10Urbanecm) [11:15:09] (03PS3) 10Zfilipin: Add new throttle rule [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498770 (https://phabricator.wikimedia.org/T219113) (owner: 10星耀晨曦) [11:15:20] Urbanecm: 499287 deployed [11:16:22] (03PS4) 10Zfilipin: Throttle rule for The Art and Feminism Edit-a-thon in Taiwan [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498770 (https://phabricator.wikimedia.org/T219113) (owner: 10星耀晨曦) [11:16:59] (03PS4) 10Ladsgroup: Deprecate statsd hiera config in favor of statsd_host and statsd_port [puppet] - 10https://gerrit.wikimedia.org/r/497316 [11:17:07] (03CR) 10Zfilipin: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498770 (https://phabricator.wikimedia.org/T219113) (owner: 10星耀晨曦) [11:18:28] (03Merged) 10jenkins-bot: Throttle rule for The Art and Feminism Edit-a-thon in Taiwan [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498770 (https://phabricator.wikimedia.org/T219113) (owner: 10星耀晨曦) [11:20:16] !log zfilipin@deploy1001 Synchronized wmf-config/throttle.php: SWAT: [[gerrit:498770|Throttle rule for The Art and Feminism Edit-a-thon in Taiwan (T219113)]] (duration: 00m 59s) [11:20:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:19] T219113: Requesting temporary lift of IP cap on 2019-03-31 - https://phabricator.wikimedia.org/T219113 [11:20:21] razesoldier: 498770 is deployed [11:20:33] <_joe_> zeljkof: can I go? [11:20:48] _joe_: yes, swat is yours :) [11:20:55] zeljkof: thank your swat :) [11:21:16] odder: 499210 was not deployed because you were not around [11:21:29] razesoldier: you're welcome :) [11:23:34] (03PS5) 10Ladsgroup: Deprecate statsd hiera config in favor of statsd_host and statsd_port [puppet] - 10https://gerrit.wikimedia.org/r/497316 (https://phabricator.wikimedia.org/T218567) [11:24:43] !log oblivian@deploy1001 Synchronized php-1.33.0-wmf.22/extensions/WikimediaEvents: SWAT: Backport Use a cookie to persist the seed for php7 a/b test to .22 T216676 (duration: 00m 58s) [11:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:24:46] T216676: Set up A/B testing mechanism for PHP7, - https://phabricator.wikimedia.org/T216676 [11:25:03] <_joe_> zeljkof: I'm done I think [11:25:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10Wikimedia-Logstash, and 3 others: Decommission old eqiad logstash hardware hosts logstash100[456] - https://phabricator.wikimedia.org/T217556 (10Dzahn) a:05herron→03Cmjohnson now it should be ok to continue. at least i don't see the hosts in puppet repo anymore an... [11:25:38] (03CR) 10jenkins-bot: Throttle rule for The Art and Feminism Edit-a-thon in Taiwan [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498770 (https://phabricator.wikimedia.org/T219113) (owner: 10星耀晨曦) [11:27:42] _joe_: cool, will you close the window? [11:27:52] <_joe_> zeljkof: what do you mean? [11:28:11] <_joe_> log "SWAT done" ? [11:28:32] yes [11:28:34] <_joe_> mind, I'm not a deployer, I'm someone who has deployment rights [11:28:43] <_joe_> so I never read the fucking manual of SWAT :D [11:28:45] just checking if it's in the docs, but it's not :) [11:28:50] <_joe_> ahah [11:28:53] <_joe_> ok so even then [11:28:57] <_joe_> !log SWAT done [11:28:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:05] that's it! :) [11:29:27] just a note for anybody that needs to deploy, marking the rest of the window free [11:32:08] zeljkof: just added it to https://wikitech.wikimedia.org/wiki/SWAT_deploys/Deployers#General_advice :) [11:32:22] <_joe_> Lucas_WMDE: thanks [11:32:38] Lucas_WMDE: thanks! [11:32:53] I was sure it was documented, since I didn't make it up (I think) [11:33:03] but then I could not find it in the docs [11:34:47] I don’t think I ever saw it in the docs, I just picked it up from others (mostly you probably) doing it in IRC [11:36:40] zeljkof: Hi, sorry I'm late. Any chance you could still deploy that Gujarati Wikipedia logo during this window? [11:37:15] (03PS1) 10Dzahn: network::constants: remove bast2001 [puppet] - 10https://gerrit.wikimedia.org/r/499449 [11:37:47] !log created wikimedia_editor_tasks_entity_description_exists table on testwikidatawiki [11:37:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:36] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [11:43:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:45:28] 10Operations, 10Analytics, 10Analytics-Kanban, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Sunset Wikimetrics - https://phabricator.wikimedia.org/T211835 (10ema) [11:47:09] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [11:47:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:24] (03PS3) 10Dzahn: add Icinga notes_url to various NRPE monitor checks, pt 2 [puppet] - 10https://gerrit.wikimedia.org/r/499148 [11:50:41] (03CR) 10Dzahn: [C: 03+2] add Icinga notes_url to various NRPE monitor checks, pt 2 [puppet] - 10https://gerrit.wikimedia.org/r/499148 (owner: 10Dzahn) [11:52:54] odder: sorry, just saw your message [11:53:36] odder: can you move the commit to another swat window? [11:55:04] zeljkof: Yes, I'll move it to Monday [11:55:32] (03PS2) 10Odder: Correct logos for the Gujarati Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499210 (https://phabricator.wikimedia.org/T219373) [11:56:00] (03PS1) 10Marostegui: 10.in-addr.arpa: Fix typo [dns] - 10https://gerrit.wikimedia.org/r/499452 (https://phabricator.wikimedia.org/T218336) [11:56:13] jynus: ^ [11:56:17] odder: thanks [11:56:50] thanks I can deploy [11:56:55] jynus: thanks [11:57:18] (03CR) 10Jcrespo: [C: 03+2] 10.in-addr.arpa: Fix typo [dns] - 10https://gerrit.wikimedia.org/r/499452 (https://phabricator.wikimedia.org/T218336) (owner: 10Marostegui) [11:59:32] (03PS1) 10Jbond: jessie-backports: Remove unsued pins [puppet] - 10https://gerrit.wikimedia.org/r/499453 (https://phabricator.wikimedia.org/T219333) [12:00:05] Amir1: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for New wikis . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190327T1200). [12:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190327T1200) [12:02:00] (03CR) 10jerkins-bot: [V: 04-1] jessie-backports: Remove unsued pins [puppet] - 10https://gerrit.wikimedia.org/r/499453 (https://phabricator.wikimedia.org/T219333) (owner: 10Jbond) [12:05:21] (03PS2) 10Jbond: jessie-backports: Remove unsued pins [puppet] - 10https://gerrit.wikimedia.org/r/499453 (https://phabricator.wikimedia.org/T219333) [12:06:08] I'm going to create some wikis now [12:09:17] (03CR) 10Arturo Borrero Gonzalez: jessie-backports: Remove unsued pins (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499453 (https://phabricator.wikimedia.org/T219333) (owner: 10Jbond) [12:12:59] (03CR) 10Volans: [C: 04-1] "Here some other thought:" [puppet] - 10https://gerrit.wikimedia.org/r/499026 (owner: 10Andrew Bogott) [12:15:38] !log gehel@cumin2001 END (FAIL) - Cookbook sre.elasticsearch.e6-upgrade (exit_code=99) [12:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:20:10] (03PS10) 10Ladsgroup: Initial configuration for hyw.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481943 (https://phabricator.wikimedia.org/T212597) (owner: 10MarcoAurelio) [12:23:37] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [12:23:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:32] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [12:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:35] (03CR) 10Ladsgroup: [C: 03+2] "Creating the wiki now." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481943 (https://phabricator.wikimedia.org/T212597) (owner: 10MarcoAurelio) [12:27:28] (03Merged) 10jenkins-bot: Initial configuration for hyw.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481943 (https://phabricator.wikimedia.org/T212597) (owner: 10MarcoAurelio) [12:27:44] (03CR) 10Volans: "Does this means that existing hosts that have any of those packages from jessie (no backports) now are seeing them as upgradable?" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499453 (https://phabricator.wikimedia.org/T219333) (owner: 10Jbond) [12:28:58] PROBLEM - puppet last run on deploy1001 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:30:21] !log mwscript extensions/WikimediaMaintenance/addWiki.php --wiki=mediawikiwiki hyw wikipedia hywwiki hyw.wikipedia.org [12:30:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:31:14] (03CR) 10jenkins-bot: Initial configuration for hyw.wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/481943 (https://phabricator.wikimedia.org/T212597) (owner: 10MarcoAurelio) [12:34:18] PROBLEM - puppet last run on analytics1053 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [12:35:53] mutante: puppet failing to fetch catalog on deploy1001? [12:38:52] PROBLEM - HHVM rendering on mwdebug1002 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [12:39:55] (03CR) 10Jbond: "> Patch Set 2:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499453 (https://phabricator.wikimedia.org/T219333) (owner: 10Jbond) [12:41:08] <_joe_> hauskatze: puppet server issues [12:41:12] !log scap sync-file dblists [12:41:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:41:27] <_joe_> as in, some backend is throwing errors from time to time [12:42:14] !log ladsgroup@deploy1001 Synchronized dblists: (no justification provided) (duration: 00m 58s) [12:42:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:43:48] RECOVERY - HHVM rendering on mwdebug1002 is OK: HTTP OK: HTTP/1.1 200 OK - 75691 bytes in 0.191 second response time https://wikitech.wikimedia.org/wiki/Application_servers [12:43:58] (03PS1) 10Ladsgroup: Add hywwiki to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499464 (https://phabricator.wikimedia.org/T212597) [12:44:20] (03CR) 10Ladsgroup: [C: 03+2] Add hywwiki to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499464 (https://phabricator.wikimedia.org/T212597) (owner: 10Ladsgroup) [12:44:58] _joe_: ack, thanks for taking a look :) [12:45:26] (03Merged) 10jenkins-bot: Add hywwiki to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499464 (https://phabricator.wikimedia.org/T212597) (owner: 10Ladsgroup) [12:48:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] jessie-backports: Remove unsued pins (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499453 (https://phabricator.wikimedia.org/T219333) (owner: 10Jbond) [12:50:49] 10Operations, 10Beta-Cluster-Infrastructure, 10Mathoid, 10Core Platform Team Backlog (Watching / External), and 2 others: remove mathoid from scb - https://phabricator.wikimedia.org/T200832 (10akosiaris) >>! In T200832#5051312, @Krenair wrote: > deployment-mathoid still exists and has been failing puppet r... [12:52:58] (03CR) 10jenkins-bot: Add hywwiki to wikiversions.json [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499464 (https://phabricator.wikimedia.org/T212597) (owner: 10Ladsgroup) [12:55:41] hyw.wikipedia.org redirects to incubator even the gerrit patch for it is on mwdebug1002 and addwiki is ran successfully, what's missing here? [12:56:02] the apache redirect rules are not updated? [12:56:09] idk - Reedy might know [12:56:21] apache config was uploaded and merged I think [12:56:39] but once the db exists I think the redirect is gone [12:56:48] maybe the puppet issues? [12:57:04] (03CR) 10Alexandros Kosiaris: openldap: spruce up the anti-memory-leak cron for replicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/498902 (owner: 10Andrew Bogott) [12:58:04] in fact https://phabricator.wikimedia.org/rODNSdae4e19a292fca7578c4c6751055de728a75679a indeed was merged [12:59:22] 10Operations, 10cloud-services-team, 10serviceops, 10Core Platform Team Backlog (Watching / External), and 2 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10jijiki) [12:59:46] RECOVERY - puppet last run on deploy1001 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:00:04] Deploy window Bastion Server Reboot (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190327T1300) [13:00:05] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190327T1300) [13:01:11] 10Operations, 10cloud-services-team, 10serviceops, 10Core Platform Team Backlog (Watching / External), and 2 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10jijiki) Adding cloud services team in the loop [13:01:53] no, I checked the dblists on mwdebug1002, it's there [13:03:47] (03CR) 10Jbond: jessie-backports: Remove unsued pins (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499453 (https://phabricator.wikimedia.org/T219333) (owner: 10Jbond) [13:04:14] (03PS3) 10Jbond: jessie-backports: Remove unsued pins [puppet] - 10https://gerrit.wikimedia.org/r/499453 (https://phabricator.wikimedia.org/T219333) [13:04:34] (03CR) 10Jbond: jessie-backports: Remove unsued pins (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/499453 (https://phabricator.wikimedia.org/T219333) (owner: 10Jbond) [13:05:03] 10Operations, 10cloud-services-team, 10serviceops, 10Core Platform Team Backlog (Watching / External), and 2 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10aborrero) >>! In T195392#5061736, @jijiki wrote: > Adding cloud services team in the loop labweb1... [13:05:10] RECOVERY - puppet last run on analytics1053 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [13:07:17] Can't figure out what's going on now, going to revert patches [13:07:24] RECOVERY - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is OK: (C)130 ge (W)110 ge 105.2 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [13:07:48] (03PS1) 10Ladsgroup: Revert "Add hywwiki to wikiversions.json" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499473 [13:07:51] (03PS1) 10Ladsgroup: Revert "Initial configuration for hyw.wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499474 [13:07:57] (03CR) 10Ladsgroup: [C: 03+2] Revert "Add hywwiki to wikiversions.json" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499473 (owner: 10Ladsgroup) [13:08:01] (03CR) 10Ladsgroup: [C: 03+2] Revert "Initial configuration for hyw.wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499474 (owner: 10Ladsgroup) [13:09:15] (03Merged) 10jenkins-bot: Revert "Add hywwiki to wikiversions.json" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499473 (owner: 10Ladsgroup) [13:09:19] (03Merged) 10jenkins-bot: Revert "Initial configuration for hyw.wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499474 (owner: 10Ladsgroup) [13:11:29] !log ladsgroup@deploy1001 Synchronized dblists: (no justification provided) (duration: 00m 57s) [13:11:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:57] (03CR) 10jenkins-bot: Revert "Add hywwiki to wikiversions.json" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499473 (owner: 10Ladsgroup) [13:15:59] (03CR) 10jenkins-bot: Revert "Initial configuration for hyw.wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499474 (owner: 10Ladsgroup) [13:17:11] I'm done with the deployment, give it another try later [13:19:12] (03PS4) 10Jbond: jessie-backports: Remove unsued pins [puppet] - 10https://gerrit.wikimedia.org/r/499453 (https://phabricator.wikimedia.org/T219333) [13:23:37] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): apt-get update broken on jessie: jessie-updates and jessie-backports removed by Debian - https://phabricator.wikimedia.org/T219333 (10jbond) once the jessie-backport repos has been remove i suggest running the following on cumin `sudo cumin '... [13:28:34] (03PS1) 10MarcoAurelio: Reinstate "Initial configuration for hyw.wikipedia"" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499477 [13:28:53] (03PS2) 10MarcoAurelio: Reinstate "Initial configuration for hyw.wikipedia" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499477 [13:29:33] hauskatze: do you know what's the reason? I'm looking at the code rn [13:30:16] for hyw not displaying? [13:30:44] I /think/ it could be apache, but I added hyw there. [13:32:01] let me take a look at dns just in case [13:33:16] hyw is there: https://phabricator.wikimedia.org/diffusion/ODNS/browse/master/templates/helpers/langlist.tmpl [13:33:17] I checked dns and apache config in puppet, nothing was problematic there [13:33:31] (03PS11) 10Andrew Bogott: puppet-merge: merge to wmcs puppetmasters as well [puppet] - 10https://gerrit.wikimedia.org/r/497069 [13:33:51] afk for lunch, will be back soon and dig more [13:34:45] (03CR) 10jerkins-bot: [V: 04-1] puppet-merge: merge to wmcs puppetmasters as well [puppet] - 10https://gerrit.wikimedia.org/r/497069 (owner: 10Andrew Bogott) [13:35:53] 10Operations, 10CirrusSearch, 10Discovery-Search, 10Wikidata: Wikidata search lagging behind - https://phabricator.wikimedia.org/T219364 (10dcausse) p:05Unbreak!→03High The backlog of updates is being processed, once we catch up on these updates we will run a maint script to reindex lost updates. Lower... [13:35:54] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [13:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:35:58] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [13:35:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:37:44] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] puppet-merge: merge to wmcs puppetmasters as well [puppet] - 10https://gerrit.wikimedia.org/r/497069 (owner: 10Andrew Bogott) [13:38:01] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.e6-upgrade [13:38:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:35] 10Operations: cronspam: cross-validate-accounts - https://phabricator.wikimedia.org/T219274 (10Dzahn) I would say the team responsible is the entire SRE which historically was the same as people receiving root mail. [13:44:36] (03PS1) 10Andrew Bogott: installer: prepare to rebuild labvirt1008 as cloudvirt1009 [puppet] - 10https://gerrit.wikimedia.org/r/499480 (https://phabricator.wikimedia.org/T216661) [13:46:09] (03CR) 10Andrew Bogott: [C: 03+2] installer: prepare to rebuild labvirt1008 as cloudvirt1009 [puppet] - 10https://gerrit.wikimedia.org/r/499480 (https://phabricator.wikimedia.org/T216661) (owner: 10Andrew Bogott) [13:50:31] 10Operations, 10netops: allow bast2002 to connect to mgmt network - https://phabricator.wikimedia.org/T219384 (10Dzahn) p:05Triage→03Normal [13:51:43] 10Operations, 10netops: allow bast2002 to connect to mgmt network - https://phabricator.wikimedia.org/T219384 (10Dzahn) [13:52:01] (03PS7) 10Filippo Giunchedi: Create node-specific logstash filters for syslog. [puppet] - 10https://gerrit.wikimedia.org/r/498417 (https://phabricator.wikimedia.org/T211125) (owner: 10Ppchelko) [13:53:16] mutante: any idea why a wiki that exists is still redirecting to Incubator? [13:53:48] (03PS3) 10Andrew Bogott: osm: Add a cloud-internal address for the osmdb cluster [puppet] - 10https://gerrit.wikimedia.org/r/494771 (https://phabricator.wikimedia.org/T193264) (owner: 10Bstorm) [13:54:38] (03CR) 10Filippo Giunchedi: [C: 03+2] Create node-specific logstash filters for syslog. [puppet] - 10https://gerrit.wikimedia.org/r/498417 (https://phabricator.wikimedia.org/T211125) (owner: 10Ppchelko) [13:54:55] (03PS3) 10Zfilipin: Add new WMCS IP range to $wgRateLimitsExcludedIps [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482640 (https://phabricator.wikimedia.org/T167432) (owner: 10Hashar) [13:55:59] (03CR) 10Zfilipin: "@hashar: I've rebased the patch. Is it ready to be merged?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/482640 (https://phabricator.wikimedia.org/T167432) (owner: 10Hashar) [13:56:29] hauskatze: no, unless it's just cached [13:57:18] andrewbogott: FYI puppet-merge on puppetmaster1001 mentioned that it failed on labpuppetmaster [13:57:32] mutante: could it be the puppet not fetching the catalog the bot reported and jo-e saying it was looking at it? [13:57:32] godog: dang, I thought I just now fixed that [13:57:42] godog: can you paste me the output from that section? [13:57:51] (maybe you got in before my fix) [13:58:14] the good news is, we've tested the failure case now! [13:58:23] hauskatze: on deployment1001? nah, that wouldnt be related [13:58:26] andrewbogott: could be, I ran puppet-merge 30s ago, I'll do a phaste [13:58:38] ty [13:59:06] andrewbogott: https://phabricator.wikimedia.org/P8290 [13:59:07] mutante: yup on deploy1002 but okay - I was thinking on some missing authdns-update as well but probably not either [14:00:21] godog: ok, looks like the git-clean failed? I'll fix that [14:01:30] 10Operations, 10CirrusSearch, 10Discovery-Search, 10Wikidata: Elasticsearch indices went read-only causing huge lag - https://phabricator.wikimedia.org/T219364 (10dcausse) [14:01:38] hauskatze: either it exists in DNS at all or it does not. but i am not aware of any DNS change between incubator status and being a real wiki [14:01:42] (03PS1) 10Hashar: contint: add more repositories to the gitcache [puppet] - 10https://gerrit.wikimedia.org/r/499482 [14:01:51] (03CR) 10Thiemo Kreuz (WMDE): [C: 03+1] Update Daniel Kinzler’s email address [puppet] - 10https://gerrit.wikimedia.org/r/499230 (owner: 10Lucas Werkmeister (WMDE)) [14:02:47] hauskatze: so that was re "missing authdns-update".. not unless you mean it is missing in langlist.tmpl completely and if that was the case it could not have been created afaict [14:02:47] (03PS1) 10Jcrespo: test [dns] - 10https://gerrit.wikimedia.org/r/499484 [14:03:06] (03Abandoned) 10Jcrespo: test [dns] - 10https://gerrit.wikimedia.org/r/499484 (owner: 10Jcrespo) [14:03:08] huh, actually, it's running "git clean -dffx -e /private/" on all the puppet hosts and that dir doesn't exist on any of them [14:03:46] (03CR) 10Hashar: "Tested on the CI puppet master and that populated the repositories :]" [puppet] - 10https://gerrit.wikimedia.org/r/499482 (owner: 10Hashar) [14:03:47] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team: Prepare and check storage layer for hywwiki - https://phabricator.wikimedia.org/T212625 (10MarcoAurelio) @Marostegui Although the wiki config had to be reverted the addWiki script was run so //I guess// the tables are now in place? [14:04:11] oh, -e is 'exclude' [14:04:38] godog: I think I fixed it (or at least fixed the next issue). Got any more patches to merge? [14:04:42] mutante: no, it's on langlist ofc: https://phabricator.wikimedia.org/rODNSdae4e19a292fca7578c4c6751055de728a75679a [14:05:00] 10Operations, 10CirrusSearch, 10Wikidata, 10Discovery-Search (Current work): Elasticsearch indices went read-only causing huge lag - https://phabricator.wikimedia.org/T219364 (10dcausse) [14:05:14] andrewbogott: not immediately no [14:05:20] 'k [14:05:31] (03CR) 10Dzahn: "while this looks correct i also note that the LDAP user this refers to uses neither of these 2 email addresses but a personal one associat" [puppet] - 10https://gerrit.wikimedia.org/r/499230 (owner: 10Lucas Werkmeister (WMDE)) [14:05:32] !log roll-restart logstash to apply https://gerrit.wikimedia.org/r/c/operations/puppet/+/498417 [14:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:06:01] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team: Prepare and check storage layer for hywwiki - https://phabricator.wikimedia.org/T212625 (10Marostegui) So those tables will not be dropped + created again as part of any other process and will remain there until the wiki config is in place? [14:06:33] hauskatze: ok. yea, then it is not related to authdns-update [14:06:49] danke [14:06:52] also the other commands once needed to add new langs dont have to run anymore [14:06:57] that was fixed [14:07:23] (03PS1) 10Andrew Bogott: Trivial comment change: s/labs/cloud [puppet] - 10https://gerrit.wikimedia.org/r/499486 [14:08:11] (03PS2) 10Andrew Bogott: Trivial comment change: s/labs/cloud [puppet] - 10https://gerrit.wikimedia.org/r/499486 [14:09:22] (03CR) 10Andrew Bogott: [C: 03+2] Trivial comment change: s/labs/cloud [puppet] - 10https://gerrit.wikimedia.org/r/499486 (owner: 10Andrew Bogott) [14:11:15] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [14:11:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:00] !log Sanitize hywwiki on db1124:3313 T212625 [14:12:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:05] T212625: Prepare and check storage layer for hywwiki - https://phabricator.wikimedia.org/T212625 [14:13:26] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [14:13:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:16:15] (03PS1) 10Ema: varnish: do not allow PURGE requests from the outside [puppet] - 10https://gerrit.wikimedia.org/r/499488 [14:17:13] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team: Prepare and check storage layer for hywwiki - https://phabricator.wikimedia.org/T212625 (10Ladsgroup) >>! In T212625#5061983, @Marostegui wrote: > So those tables will not be dropped + created again as part of any other process and will remain th... [14:17:44] PROBLEM - MariaDB Slave Lag: s5 on dbstore2001 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 384.86 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [14:21:03] ^most likely due to backups [14:21:14] yep, backups are running there [14:21:16] 10Operations, 10DBA, 10Data-Services, 10cloud-services-team: Prepare and check storage layer for hywwiki - https://phabricator.wikimedia.org/T212625 (10Marostegui) a:03Marostegui I have sanitized `hywwiki` on db1124:3313 and triggers are now in place and filtered tables deleted. ` mysql.py -h db1124:3313... [14:22:37] gtk [14:22:41] 10Operations, 10monitoring, 10Goal, 10User-fgiunchedi: TEC6: Upgrade metrics monitoring infrastructure core components (Q3 2018/19 goal) - https://phabricator.wikimedia.org/T213288 (10fgiunchedi) [14:24:08] (03CR) 10Filippo Giunchedi: "+Ema for heads up" [puppet] - 10https://gerrit.wikimedia.org/r/498467 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [14:25:17] (03PS5) 10Filippo Giunchedi: prometheus: collect session storage Cassandra metrics [puppet] - 10https://gerrit.wikimedia.org/r/497848 (https://phabricator.wikimedia.org/T209108) (owner: 10Eevans) [14:25:23] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: collect session storage Cassandra metrics [puppet] - 10https://gerrit.wikimedia.org/r/497848 (https://phabricator.wikimedia.org/T209108) (owner: 10Eevans) [14:26:15] (03PS2) 10Ema: varnish: do not allow PURGE requests from the outside [puppet] - 10https://gerrit.wikimedia.org/r/499488 [14:31:47] 10Puppet, 10cloud-services-team (Kanban): Have puppet-merge on puppetmaster1001 publish the official sha1 after merging - https://phabricator.wikimedia.org/T219390 (10Aklapper) [14:32:32] (03PS3) 10CRusnov: netbox ganeti sync: Fix path to logfiles. [puppet] - 10https://gerrit.wikimedia.org/r/499288 [14:33:39] (03CR) 10jerkins-bot: [V: 04-1] netbox ganeti sync: Fix path to logfiles. [puppet] - 10https://gerrit.wikimedia.org/r/499288 (owner: 10CRusnov) [14:34:24] (03CR) 10CRusnov: "After investigating the operation live, it doesn't touch or manipulate the extra log files at all (it logs to syslog). I think it's safe t" [puppet] - 10https://gerrit.wikimedia.org/r/499288 (owner: 10CRusnov) [14:36:47] (03PS4) 10CRusnov: netbox ganeti sync: Fix path to logfiles. [puppet] - 10https://gerrit.wikimedia.org/r/499288 [14:39:48] ACKNOWLEDGEMENT - Mjolnir bulk update failure check - eqiad on icinga1001 is CRITICAL: 5.59e+05 gt 2 daniel_zahn known - per IRC with ebernhardson https://grafana.wikimedia.org/d/000000591/elasticsearch-mjolnir-bulk-updates?orgId=1&from=now-7d&to=now&panelId=1&fullscreen [14:41:48] PROBLEM - puppet last run on kafka-jumbo1003 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [14:42:12] hm [14:42:26] (03PS4) 10Giuseppe Lavagetto: arclamp: make arclamp-grep work with excimer logs as well [puppet] - 10https://gerrit.wikimedia.org/r/499219 (https://phabricator.wikimedia.org/T176916) [14:44:06] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [14:44:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:36] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [14:45:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:45:48] (03CR) 10Giuseppe Lavagetto: [C: 03+2] arclamp: make arclamp-grep work with excimer logs as well [puppet] - 10https://gerrit.wikimedia.org/r/499219 (https://phabricator.wikimedia.org/T176916) (owner: 10Giuseppe Lavagetto) [14:47:04] RECOVERY - puppet last run on kafka-jumbo1003 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [14:52:55] !log gehel@cumin2001 END (FAIL) - Cookbook sre.elasticsearch.e6-upgrade (exit_code=99) [14:52:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:53:24] 10Operations, 10ops-eqiad, 10DBA: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10Marostegui) p:05Triage→03Normal [14:54:29] 10Operations, 10ops-eqiad, 10DBA: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10RobH) [14:56:18] 10Operations, 10ops-eqiad, 10DBA: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10jcrespo) [14:57:20] (03PS4) 10Elukey: role::memcached: apply interface::rps to all the hosts [puppet] - 10https://gerrit.wikimedia.org/r/472099 (https://phabricator.wikimedia.org/T203786) [14:59:03] (03CR) 10Elukey: [C: 03+2] role::memcached: apply interface::rps to all the hosts [puppet] - 10https://gerrit.wikimedia.org/r/472099 (https://phabricator.wikimedia.org/T203786) (owner: 10Elukey) [15:00:03] 10Operations, 10ops-eqiad, 10DBA: rack/setup/deploy eqiad dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T219399 (10Cmjohnson) [15:00:23] 10Operations, 10Traffic: Make authdns-update compatible with local emergency changes - https://phabricator.wikimedia.org/T219400 (10Volans) [15:00:32] 10Operations, 10Analytics, 10Analytics-Kanban, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Sunset Wikimetrics - https://phabricator.wikimedia.org/T211835 (10Nuria) @Vojtech.dostal As of this month event metrics supports couple additional projects: https://phabricator.wikimedia.org/T217058 (wiktionary... [15:00:35] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15383/mwlog1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/499220 (https://phabricator.wikimedia.org/T176916) (owner: 10Giuseppe Lavagetto) [15:00:37] (03PS2) 10Santhosh: ExternalGuidance: Allow google translate hosts as known services [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498913 (https://phabricator.wikimedia.org/T218948) [15:00:56] (03PS4) 10Giuseppe Lavagetto: arclamp: abstract arclamp::instance out of arclamp [puppet] - 10https://gerrit.wikimedia.org/r/499220 (https://phabricator.wikimedia.org/T176916) [15:06:04] 10Operations, 10cloud-services-team, 10serviceops, 10Core Platform Team Backlog (Watching / External), and 2 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10jijiki) @aborrero I though that this is related to T218615, that's why I added you all :) [15:07:56] (03CR) 10BBlack: [C: 03+1] "Looks right, but also scary, please be careful :)" [puppet] - 10https://gerrit.wikimedia.org/r/499488 (owner: 10Ema) [15:08:29] (03CR) 10Dzahn: [C: 03+1] "just did not merge yet because of the current puppet status on cobalt" [puppet] - 10https://gerrit.wikimedia.org/r/499289 (owner: 10Paladox) [15:09:58] (03PS1) 10Jbond: jessie-backports: warn users if the try to use backports on jessie [puppet] - 10https://gerrit.wikimedia.org/r/499505 (https://phabricator.wikimedia.org/T219333) [15:10:00] RECOVERY - MariaDB Slave Lag: s5 on dbstore2001 is OK: OK slave_sql_lag Replication lag: 59.66 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [15:10:12] bblack, ema: shouldn't PURGEs be accepted from mediawiki hosts and such? [15:11:55] (03PS5) 10Dzahn: openldap/offboard-user: add wikitech user deactivation [puppet] - 10https://gerrit.wikimedia.org/r/498429 [15:13:19] Krenair: no, nothing's supposed to send an HTTP PURGE from outside of the host [15:13:59] mediawiki/jobqueue/mwscript/etc are supposed to end up sending a multicast HTCP purge packet, and there's a daemon on each cache node that listens for those and translates them into HTTP PURGE over a localhost HTTP connection. [15:14:44] interesting [15:15:55] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/499505 (https://phabricator.wikimedia.org/T219333) (owner: 10Jbond) [15:16:42] (03CR) 10Alex Monk: [C: 04-1] "Now being shifted around in Ie711940f instead" [puppet] - 10https://gerrit.wikimedia.org/r/498797 (owner: 10Arturo Borrero Gonzalez) [15:17:45] (03PS1) 10Catrope: Set $wgGEHomepageTutorialTitle on beta labs for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499508 (https://phabricator.wikimedia.org/T217105) [15:18:43] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "overall LGTM, I have a couple doubts, expressed in comments." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499453 (https://phabricator.wikimedia.org/T219333) (owner: 10Jbond) [15:18:49] 10Operations, 10Analytics, 10Analytics-Kanban, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Sunset Wikimetrics - https://phabricator.wikimedia.org/T211835 (10mforns) > @mforns the mailing list can be removed as mentioned on wikitech: https://wikitech.wikimedia.org/wiki/Mailman#Remove_a_mailing_list > L... [15:19:19] (03PS3) 10CRusnov: Add synchronizing nodes to ganeti-netbox sync. [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/498268 [15:19:22] (03CR) 10Kosta Harlan: [C: 03+1] Set $wgGEHomepageTutorialTitle on beta labs for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499508 (https://phabricator.wikimedia.org/T217105) (owner: 10Catrope) [15:19:25] !log slowly rolling out interface::rps to all the mcXXXX nodes - T203786 [15:19:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:19:28] T203786: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 [15:21:15] (03CR) 10Giuseppe Lavagetto: [C: 03+1] service::uwsgi: Allow instances to disable logging config (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/498516 (https://phabricator.wikimedia.org/T217932) (owner: 10BryanDavis) [15:21:16] Cc: Krinkle,AaronSchulz --^ - this is a perf improvement for the kernel handling of the mc hosts' NICs. It has been running fine on mc1035 for months, I am now slowly rolling it out to the other nodes. It should be un-eventful but please ping me if you notice anything weird [15:21:20] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [15:21:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:21:46] 10Operations, 10ops-eqiad, 10RESTBase, 10Core Platform Team Backlog (Watching / External), 10Services (watching): rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 (10RobH) p:05Triage→03Normal [15:22:32] (03PS1) 10Mathew.onipe: elasticsearch: use standard resources for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/499511 (https://phabricator.wikimedia.org/T214921) [15:22:40] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [15:22:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:22:47] (03CR) 10Alex Monk: [C: 03+1] redirects.dat: Get rid of domains non controlled by WMF [puppet] - 10https://gerrit.wikimedia.org/r/499239 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [15:23:33] (03CR) 10Alex Monk: "+1 for the idea, but I believe there was a problem with YAML and the default key here" [puppet] - 10https://gerrit.wikimedia.org/r/499426 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [15:23:56] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: use standard resources for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/499511 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [15:24:48] !log rebooting iron.wikimedia.org in 5 minutes [15:24:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:25:07] (03PS2) 10Mathew.onipe: elasticsearch: use standard resources for icinga checks [puppet] - 10https://gerrit.wikimedia.org/r/499511 (https://phabricator.wikimedia.org/T214921) [15:25:37] (03CR) 10Alex Monk: [C: 03+1] Allow LE issue the non-canonical redirects service certificate [dns] - 10https://gerrit.wikimedia.org/r/499156 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [15:28:49] (03CR) 10Dzahn: "@volans alright, i merged it into this change" [puppet] - 10https://gerrit.wikimedia.org/r/498429 (owner: 10Dzahn) [15:29:17] (03Abandoned) 10Dzahn: openldap::management: install python-mysqldb package [puppet] - 10https://gerrit.wikimedia.org/r/497788 (owner: 10Dzahn) [15:30:23] !log rebooting bast5001.wikimedia.org in 5 minutes [15:30:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:44] (03CR) 10Mathew.onipe: "PCC output is Ok: https://puppet-compiler.wmflabs.org/compiler1001/15384/" [puppet] - 10https://gerrit.wikimedia.org/r/499511 (https://phabricator.wikimedia.org/T214921) (owner: 10Mathew.onipe) [15:31:20] Prod clear? UBN train blocker to deploy. [15:31:51] Oh, bastion reboots still going on. I'll wait. [15:32:27] (03CR) 10Alex Monk: [C: 03+1] "Also depends on Ie0426af1" [puppet] - 10https://gerrit.wikimedia.org/r/499185 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [15:32:34] James_F: i can wait if you need me to, got tied up with jessie backports so started late but plan to have them done by 16:00 [15:32:57] jbond42: No no, please go ahead. [15:33:02] ok thanks [15:33:09] if you coordinate such that the deployer isn't using a particular host that is in the process of being rebooted you should be fine..? [15:34:53] !log rebooting bast4002.wikimedia.org in 5 minutes [15:34:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:16] (03CR) 10Alex Monk: [C: 04-1] acme_chief: Issue wikiba.se certificate (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/499189 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [15:36:56] (03CR) 10Catrope: [C: 03+2] Set $wgGEHomepageTutorialTitle on beta labs for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499508 (https://phabricator.wikimedia.org/T217105) (owner: 10Catrope) [15:37:07] 10Operations, 10ops-eqiad, 10RESTBase, 10Core Platform Team Backlog (Watching / External), 10Services (watching): rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 (10RobH) a:03Eevans IRC Update: 08:36 < mobrovac> : robh: heh, distributed over the 4 racks,... [15:38:07] (03Merged) 10jenkins-bot: Set $wgGEHomepageTutorialTitle on beta labs for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499508 (https://phabricator.wikimedia.org/T217105) (owner: 10Catrope) [15:38:42] !log rebooting bast1002.wikimedia.org in 5 minutes [15:38:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:56] PROBLEM - Host bast4002 is DOWN: PING CRITICAL - Packet loss = 100% [15:41:07] (03CR) 10ArielGlenn: [C: 03+2] add config setting to use lbzip2 for compression of pages-meta-history files [dumps] - 10https://gerrit.wikimedia.org/r/499438 (https://phabricator.wikimedia.org/T214293) (owner: 10ArielGlenn) [15:41:11] (03CR) 10Alex Monk: [C: 03+1] "The wikimedia.org SPF record is probably allowing more than it needs to for these domains (unlikely to be any google or silverpop stuff he" [dns] - 10https://gerrit.wikimedia.org/r/499255 (https://phabricator.wikimedia.org/T193408) (owner: 10Vgutierrez) [15:41:40] RECOVERY - Host bast4002 is UP: PING OK - Packet loss = 0%, RTA = 74.26 ms [15:42:16] !log rebooting bast2002.wikimedia.org in 5 minutes [15:42:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:05] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [15:43:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:43:36] (03PS1) 10CRusnov: Take a code-quality pass [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/499515 [15:44:05] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [15:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:44:11] (03CR) 10Alex Monk: "Task T133548 too?" [puppet] - 10https://gerrit.wikimedia.org/r/499201 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [15:44:53] !log rebooting bast2001.wikimedia.org in 5 minutes [15:44:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:46:33] (03CR) 10jenkins-bot: Set $wgGEHomepageTutorialTitle on beta labs for testing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499508 (https://phabricator.wikimedia.org/T217105) (owner: 10Catrope) [15:50:24] (03PS1) 10GTirloni: shinken: Convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/499516 (https://phabricator.wikimedia.org/T218146) [15:50:26] (03CR) 10Filippo Giunchedi: [C: 03+1] logstash: add tcp json_lines localhost compatability endpoint [puppet] - 10https://gerrit.wikimedia.org/r/496021 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [15:51:32] (03CR) 10jerkins-bot: [V: 04-1] shinken: Convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/499516 (https://phabricator.wikimedia.org/T218146) (owner: 10GTirloni) [15:51:37] 10Operations, 10ops-eqiad, 10RESTBase, 10Core Platform Team Backlog (Watching / External), 10Services (watching): rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 (10RobH) For now lets wait for @eevans to comment with the plan on what new servers are replacing... [15:51:50] (03CR) 10Alex Monk: [C: 03+1] "lgtm otherwise" [puppet] - 10https://gerrit.wikimedia.org/r/499201 (https://phabricator.wikimedia.org/T213705) (owner: 10Vgutierrez) [15:53:46] (03PS2) 10GTirloni: shinken: Convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/499516 (https://phabricator.wikimedia.org/T218146) [15:55:10] (03PS8) 10MSantos: Pass flag use_nodejs10 for maps services [puppet] - 10https://gerrit.wikimedia.org/r/495735 (https://phabricator.wikimedia.org/T215523) [15:55:31] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline, idea LGTM" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499505 (https://phabricator.wikimedia.org/T219333) (owner: 10Jbond) [15:56:15] !log ariel@deploy1001 Started deploy [dumps/dumps@88ddd76]: ability to use lbzip2 for meta-history compression [15:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:18] !log ariel@deploy1001 Finished deploy [dumps/dumps@88ddd76]: ability to use lbzip2 for meta-history compression (duration: 00m 03s) [15:56:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:25] !log bastion reboots complete [15:56:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:56:42] James_F: all finished [15:57:23] Thanks! [15:57:30] jouncebot: next [15:57:30] In 0 hour(s) and 2 minute(s): Morning SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190327T1600) [15:57:35] Psh. [16:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: #bothumor I � Unicode. All rise for Morning SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190327T1600). [16:00:05] No GERRIT patches in the queue for this window AFAICS. [16:00:51] (03PS2) 10Mforns: Remove all references to Wikimetrics [puppet] - 10https://gerrit.wikimedia.org/r/499304 (https://phabricator.wikimedia.org/T211835) [16:01:10] PROBLEM - puppet last run on mw1303 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [16:01:48] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): apt-get update broken on jessie: jessie-updates and jessie-backports removed by Debian - https://phabricator.wikimedia.org/T219333 (10fgiunchedi) >>! In T219333#5061125, @fgiunchedi wrote: > This also affects `package_builder` role on boron wh... [16:02:11] (03PS1) 10ArielGlenn: enable use of lbzip2 for compressing revison history dumps [puppet] - 10https://gerrit.wikimedia.org/r/499519 (https://phabricator.wikimedia.org/T214293) [16:02:19] 10Operations, 10netops: allow bast2002 to connect to mgmt network - https://phabricator.wikimedia.org/T219384 (10RobH) FYI: Please note that even when the ACL is setup, bastions allow SSH proxy but not HTTPS proxy. Alternatively, you can setup proxy via cumin servers to get both. This should be fixed (maki... [16:03:01] (03PS6) 10Dzahn: openldap/offboard-user: add wikitech user deactivation [puppet] - 10https://gerrit.wikimedia.org/r/498429 [16:03:04] (03CR) 10Mforns: "@Arturo, sorry for the vague commit message. I tried to specify a little more." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/499304 (https://phabricator.wikimedia.org/T211835) (owner: 10Mforns) [16:04:14] (03CR) 10jerkins-bot: [V: 04-1] openldap/offboard-user: add wikitech user deactivation [puppet] - 10https://gerrit.wikimedia.org/r/498429 (owner: 10Dzahn) [16:04:50] (03PS3) 10Ema: varnish: do not allow PURGE requests from the outside [puppet] - 10https://gerrit.wikimedia.org/r/499488 [16:06:31] (03CR) 10Ema: [C: 03+2] varnish: do not allow PURGE requests from the outside [puppet] - 10https://gerrit.wikimedia.org/r/499488 (owner: 10Ema) [16:08:48] 10Operations, 10Developer-Advocacy, 10Discourse, 10Epic: Bring a discourse instance for technical questions to production - https://phabricator.wikimedia.org/T180853 (10Aklapper) For the records, https://www.mediawiki.org/wiki/Wikimedia_Technology/Annual_Plans/FY2019/TEC1:_Reliability,_Performance,_and_Mai... [16:08:52] RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [16:09:30] (03PS1) 10Jforrester: SDC: Add feature flag for enabling depicts in UW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499520 [16:09:32] (03PS1) 10Jforrester: SDC: Use feature flag for enabling depicts in UW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499521 [16:09:34] (03PS1) 10Jforrester: [BETA] SDC: Enable feature flag for depicts in UW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499522 [16:09:36] (03PS1) 10Jforrester: SDC: Enable feature flag for depicts in UW on Test Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499523 [16:09:55] (03CR) 10Jforrester: [C: 03+2] SDC: Add feature flag for enabling depicts in UW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499520 (owner: 10Jforrester) [16:10:04] I am confused by the mw memcached error rate recovery [16:10:09] when did it alarm in the first placE? [16:10:16] (03CR) 10Dzahn: "> Patch Set 6:" [puppet] - 10https://gerrit.wikimedia.org/r/495735 (https://phabricator.wikimedia.org/T215523) (owner: 10MSantos) [16:10:55] (03CR) 10jerkins-bot: [V: 04-1] [BETA] SDC: Enable feature flag for depicts in UW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499522 (owner: 10Jforrester) [16:11:12] (03CR) 10jerkins-bot: [V: 04-1] SDC: Enable feature flag for depicts in UW on Test Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499523 (owner: 10Jforrester) [16:11:30] (03Merged) 10jenkins-bot: SDC: Add feature flag for enabling depicts in UW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499520 (owner: 10Jforrester) [16:11:42] (03PS2) 10Jforrester: [BETA] SDC: Enable feature flag for depicts in UW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499522 [16:11:45] (03PS2) 10Jforrester: SDC: Enable feature flag for depicts in UW on Test Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499523 [16:13:46] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SDC: Add feature flag for enabling depicts in UW (duration: 00m 57s) [16:13:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:13:50] (03CR) 10Jforrester: [C: 03+2] SDC: Use feature flag for enabling depicts in UW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499521 (owner: 10Jforrester) [16:13:56] (03CR) 10Jforrester: [C: 03+2] [BETA] SDC: Enable feature flag for depicts in UW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499522 (owner: 10Jforrester) [16:14:39] (03PS7) 10Dzahn: openldap/offboard-user: add wikitech user deactivation [puppet] - 10https://gerrit.wikimedia.org/r/498429 [16:14:47] (03CR) 10Elukey: ">" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/499304 (https://phabricator.wikimedia.org/T211835) (owner: 10Mforns) [16:15:02] (03Merged) 10jenkins-bot: SDC: Use feature flag for enabling depicts in UW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499521 (owner: 10Jforrester) [16:15:07] (03Merged) 10jenkins-bot: [BETA] SDC: Enable feature flag for depicts in UW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499522 (owner: 10Jforrester) [16:15:41] (03CR) 10ArielGlenn: [C: 04-1] "Cannot be merged until group1 moves to 1.33.0-wmf.23, so likely Thur Mar 28" [puppet] - 10https://gerrit.wikimedia.org/r/499519 (https://phabricator.wikimedia.org/T214293) (owner: 10ArielGlenn) [16:15:50] (03CR) 10jerkins-bot: [V: 04-1] openldap/offboard-user: add wikitech user deactivation [puppet] - 10https://gerrit.wikimedia.org/r/498429 (owner: 10Dzahn) [16:17:26] (03CR) 10jenkins-bot: SDC: Add feature flag for enabling depicts in UW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499520 (owner: 10Jforrester) [16:18:16] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: SDC: Use feature flag for enabling depicts in UW (duration: 00m 57s) [16:18:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:20:17] (03PS8) 10Dzahn: openldap/offboard-user: add wikitech user deactivation [puppet] - 10https://gerrit.wikimedia.org/r/498429 [16:23:30] (03PS1) 10Jcrespo: mariadb-backups: Setup dbprov2001 as the backup server [puppet] - 10https://gerrit.wikimedia.org/r/499528 (https://phabricator.wikimedia.org/T218336) [16:24:23] (03PS1) 10Mholloway: Update WikimediaEditorTasks config for DB location split [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499529 [16:24:59] (03CR) 10jerkins-bot: [V: 04-1] openldap/offboard-user: add wikitech user deactivation [puppet] - 10https://gerrit.wikimedia.org/r/498429 (owner: 10Dzahn) [16:27:28] RECOVERY - puppet last run on mw1303 is OK: OK: Puppet is currently enabled, last run 52 seconds ago with 0 failures [16:28:07] (03CR) 10jenkins-bot: SDC: Use feature flag for enabling depicts in UW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499521 (owner: 10Jforrester) [16:28:11] (03CR) 10jenkins-bot: [BETA] SDC: Enable feature flag for depicts in UW [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499522 (owner: 10Jforrester) [16:28:53] !log jforrester@deploy1001 Synchronized php-1.33.0-wmf.23/extensions/GlobalPreferences/includes/GlobalPreferencesFactory.php: Hot-fix T219380 GlobalPreferences: Allow modifiedPrefs to be set even if no UI control (duration: 00m 58s) [16:28:54] 10Operations, 10netops: allow bast2002 to connect to mgmt network - https://phabricator.wikimedia.org/T219384 (10ayounsi) a:05Dzahn→03ayounsi [16:28:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:28:56] T219380: [regression] API-only preferences can't be set - https://phabricator.wikimedia.org/T219380 [16:30:00] (03PS3) 10Jforrester: SDC: Enable feature flag for depicts in UW on Test Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499523 (https://phabricator.wikimedia.org/T217024) [16:30:30] (03CR) 10Jforrester: [C: 04-1] "Not yet." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499523 (https://phabricator.wikimedia.org/T217024) (owner: 10Jforrester) [16:33:54] (03PS1) 10Tulsi Bhagat: Add namespace "Lampiran" at id.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499530 [16:34:58] (03CR) 10Ayounsi: [C: 03+1] "It's fine, ideally we should also have an host from row A. authdns2001 looks like a good one. Feel free to add it or let me know if I shou" [puppet] - 10https://gerrit.wikimedia.org/r/499421 (https://phabricator.wikimedia.org/T196665) (owner: 10Dzahn) [16:35:21] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Query-Service, and 2 others: Reduce / remove the aggessive cache busting behaviour of wdqs-updater - https://phabricator.wikimedia.org/T217897 (10Smalyshev) > the cache we are talking about there would be unnecessary if the wdqs just hit varnish. It is probl... [16:38:31] !log mc20XX and mc1022 have interface::rps enabled - T203786 [16:38:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:34] T203786: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 [16:39:32] (03PS1) 10Jforrester: SDC: Enable both new-style and old-style Wikibase federation on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499531 (https://phabricator.wikimedia.org/T214075) [16:40:28] (03PS7) 10Jforrester: SDC: Enable Depicts functionality on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498145 (https://phabricator.wikimedia.org/T218913) [16:40:53] (03CR) 10Jforrester: [C: 03+2] SDC: Enable both new-style and old-style Wikibase federation on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499531 (https://phabricator.wikimedia.org/T214075) (owner: 10Jforrester) [16:43:19] (03CR) 10Lucas Werkmeister (WMDE): SDC: Enable both new-style and old-style Wikibase federation on Commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499531 (https://phabricator.wikimedia.org/T214075) (owner: 10Jforrester) [16:44:24] (03Merged) 10jenkins-bot: SDC: Enable both new-style and old-style Wikibase federation on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499531 (https://phabricator.wikimedia.org/T214075) (owner: 10Jforrester) [16:45:45] (03CR) 10Gergő Tisza: [C: 03+1] Update WikimediaEditorTasks config for DB location split [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499529 (owner: 10Mholloway) [16:46:29] (03CR) 10Jforrester: SDC: Enable both new-style and old-style Wikibase federation on Commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499531 (https://phabricator.wikimedia.org/T214075) (owner: 10Jforrester) [16:46:46] (03PS8) 10Jforrester: SDC: Enable Depicts functionality on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498145 (https://phabricator.wikimedia.org/T218913) [16:46:48] (03PS1) 10Jforrester: SDC: Wikibase likes using HTTP for URI resolution for whatever reason [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499534 [16:47:19] 10Operations, 10netops: allow bast2002 to connect to mgmt network - https://phabricator.wikimedia.org/T219384 (10Dzahn) @Robh I can't confirm this. I can proxy via bast2002 just like i can via bast2001. Using "`ssh -D 8081 bast2002.wikimedia.org` and setting my browser's proxy settings to SOCKS5 and "localhost... [16:48:14] (03PS1) 10Lucas Werkmeister (WMDE): Fix Wikidata base URI in foreign repositories config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499535 [16:48:25] (03CR) 10Lucas Werkmeister (WMDE): SDC: Enable both new-style and old-style Wikibase federation on Commons (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499531 (https://phabricator.wikimedia.org/T214075) (owner: 10Jforrester) [16:49:10] (03Abandoned) 10Jforrester: SDC: Wikibase likes using HTTP for URI resolution for whatever reason [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499534 (owner: 10Jforrester) [16:49:15] (03CR) 10Jforrester: [C: 03+2] Fix Wikidata base URI in foreign repositories config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499535 (owner: 10Lucas Werkmeister (WMDE)) [16:49:35] 10Operations, 10ops-eqiad, 10RESTBase, 10Core Platform Team Backlog (Watching / External), 10Services (watching): rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 (10Eevans) >>! In T219404#5062456, @RobH wrote: > For now lets wait for @eevans to comment with th... [16:49:41] (03PS1) 10Gilles: Make caching of static performance site explicit [puppet] - 10https://gerrit.wikimedia.org/r/499537 (https://phabricator.wikimedia.org/T219417) [16:49:59] (03CR) 10Lucas Werkmeister (WMDE): "sorry, I didn’t see you’d already uploaded this" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499534 (owner: 10Jforrester) [16:50:01] (03CR) 10jenkins-bot: SDC: Enable both new-style and old-style Wikibase federation on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499531 (https://phabricator.wikimedia.org/T214075) (owner: 10Jforrester) [16:50:15] (03Merged) 10jenkins-bot: Fix Wikidata base URI in foreign repositories config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499535 (owner: 10Lucas Werkmeister (WMDE)) [16:50:28] (03CR) 10jenkins-bot: Fix Wikidata base URI in foreign repositories config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499535 (owner: 10Lucas Werkmeister (WMDE)) [16:51:29] (03PS2) 10Tulsi Bhagat: Add namespace "Lampiran" at id.wiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499530 (https://phabricator.wikimedia.org/T218796) [16:51:30] (03CR) 10Jforrester: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499534 (owner: 10Jforrester) [16:51:43] (03PS2) 10Jcrespo: mariadb-backups: Setup dbprov2001 as the backup server [puppet] - 10https://gerrit.wikimedia.org/r/499528 (https://phabricator.wikimedia.org/T218336) [16:53:24] (03PS1) 10Bstorm: kube2proxy: Set a 10 sec wait between service restarts on failure [puppet] - 10https://gerrit.wikimedia.org/r/499538 (https://phabricator.wikimedia.org/T219377) [16:53:26] (03CR) 10Tulsi Bhagat: "Requires `namespaceDupes.php --wiki=idwiktionary --fix` after deployment." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499530 (https://phabricator.wikimedia.org/T218796) (owner: 10Tulsi Bhagat) [16:53:50] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [16:53:59] (03PS3) 10Jcrespo: mariadb-backups: Setup dbprov2001 as the backup server [puppet] - 10https://gerrit.wikimedia.org/r/499528 (https://phabricator.wikimedia.org/T218336) [16:54:36] Lucas_WMDE: Deploying now. [16:54:42] Lucas_WMDE: Thanks for the help. :-) [16:54:53] (03PS9) 10Jforrester: SDC: Enable Depicts functionality on Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498145 (https://phabricator.wikimedia.org/T218913) [16:55:05] (03Abandoned) 10Jforrester: [DNM] SDC: Point TestCommons at TestWikidata, not real Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498168 (owner: 10Jforrester) [16:55:12] !log jforrester@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T214075 SDC: Enable Wikidata federation on Commons (duration: 00m 57s) [16:55:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:55:15] T214075: Enable federated access to entities and properties from Wikidata to Commons - https://phabricator.wikimedia.org/T214075 [16:55:44] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) To clarify: We can live without this if it can't happen right away. We do, however, need a timeline for when it might (or might not)... [16:56:03] 10Operations, 10ops-eqiad, 10RESTBase, 10Core Platform Team Backlog (Watching / External), 10Services (watching): rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 (10RobH) [16:56:19] 10Operations, 10ops-eqiad, 10RESTBase, 10Core Platform Team Backlog (Watching / External), 10Services (watching): rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 (10RobH) a:05Eevans→03Cmjohnson [16:58:22] 10Operations, 10ops-eqiad, 10RESTBase, 10Core Platform Team Backlog (Watching / External), 10Services (watching): rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 (10RobH) These are also replacing some existing systems one the new systems are fully online: row... [16:58:51] (03CR) 10Andrew Bogott: "This adds a bunch of files but seems to not remove anything... is that because Shinken was previously un-puppetized, or because it leaves " [puppet] - 10https://gerrit.wikimedia.org/r/499516 (https://phabricator.wikimedia.org/T218146) (owner: 10GTirloni) [16:59:34] (03Abandoned) 10Jforrester: Initial configuration of depict property for WBMI on Commons, TestCommons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494895 (https://phabricator.wikimedia.org/T217153) (owner: 10Jforrester) [17:03:18] (03CR) 10Andrew Bogott: openldap: spruce up the anti-memory-leak cron for replicas (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/498902 (owner: 10Andrew Bogott) [17:04:27] (03PS4) 10Jforrester: SDC: Stop using wgMediaInfoEnable, we're scrapping it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494896 [17:04:29] (03PS4) 10Jforrester: SDC: Drop use of temporary wgMediaInfoEnable, being removed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494897 [17:05:11] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Only thing missing is declaring the class in modules/profile/manifests/mediawiki/maintenance.pp, rest LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) (owner: 10KartikMistry) [17:06:41] (03CR) 10GTirloni: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/499516 (https://phabricator.wikimedia.org/T218146) (owner: 10GTirloni) [17:08:01] (03CR) 10Jcrespo: [C: 03+2] mariadb-backups: Setup dbprov2001 as the backup server [puppet] - 10https://gerrit.wikimedia.org/r/499528 (https://phabricator.wikimedia.org/T218336) (owner: 10Jcrespo) [17:08:07] (03CR) 10BryanDavis: [C: 03+1] kube2proxy: Set a 10 sec wait between service restarts on failure [puppet] - 10https://gerrit.wikimedia.org/r/499538 (https://phabricator.wikimedia.org/T219377) (owner: 10Bstorm) [17:08:53] (03CR) 10Jforrester: [C: 03+2] SDC: Drop use of temporary wgMediaInfoEnable, being removed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494897 (owner: 10Jforrester) [17:08:59] (03CR) 10Jforrester: [C: 03+2] SDC: Stop using wgMediaInfoEnable, we're scrapping it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494896 (owner: 10Jforrester) [17:09:27] (03CR) 10Andrew Bogott: [C: 03+1] "ok then! I'm interested in what the puppet compiler says but will ask you about that on IRC." [puppet] - 10https://gerrit.wikimedia.org/r/499516 (https://phabricator.wikimedia.org/T218146) (owner: 10GTirloni) [17:10:49] !log fermium: /usr/local/sbin/disable_list wikimetrics T211835 [17:10:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:52] T211835: Sunset Wikimetrics - https://phabricator.wikimedia.org/T211835 [17:12:40] (03PS14) 10KartikMistry: Cron to run script to purge old CX drafts [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) [17:13:51] (03PS4) 10Giuseppe Lavagetto: arclamp: remove previously absented files [puppet] - 10https://gerrit.wikimedia.org/r/499221 [17:14:02] (03CR) 10Giuseppe Lavagetto: [C: 03+2] arclamp: remove previously absented files [puppet] - 10https://gerrit.wikimedia.org/r/499221 (owner: 10Giuseppe Lavagetto) [17:14:28] PROBLEM - Backup of s4 in codfw on db1115 is CRITICAL: Backup for s4 at codfw taken more than 8 days ago: Most recent backup 2019-03-19 17:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [17:15:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] kube2proxy: Set a 10 sec wait between service restarts on failure [puppet] - 10https://gerrit.wikimedia.org/r/499538 (https://phabricator.wikimedia.org/T219377) (owner: 10Bstorm) [17:15:11] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] kube2proxy: Set a 10 sec wait between service restarts on failure [puppet] - 10https://gerrit.wikimedia.org/r/499538 (https://phabricator.wikimedia.org/T219377) (owner: 10Bstorm) [17:16:16] (03PS3) 10ArielGlenn: use MediaWiki maintenance script to get db user and password [dumps] - 10https://gerrit.wikimedia.org/r/498245 (https://phabricator.wikimedia.org/T218923) [17:16:48] <_joe_> 'oh ffs gerrit [17:16:51] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] arclamp: remove previously absented files [puppet] - 10https://gerrit.wikimedia.org/r/499221 (owner: 10Giuseppe Lavagetto) [17:16:54] <_joe_> err jenkins [17:17:04] orilly [17:17:18] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [17:17:24] PROBLEM - Disk space on dbprov2001 is CRITICAL: DISK CRITICAL - /srv/backups/ongoing is not accessible: Permission denied [17:23:08] (03PS2) 10Bstorm: kube2proxy: Set a 10 sec wait between service restarts on failure [puppet] - 10https://gerrit.wikimedia.org/r/499538 (https://phabricator.wikimedia.org/T219377) [17:23:11] (03Merged) 10jenkins-bot: SDC: Stop using wgMediaInfoEnable, we're scrapping it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494896 (owner: 10Jforrester) [17:23:16] (03Merged) 10jenkins-bot: SDC: Drop use of temporary wgMediaInfoEnable, being removed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494897 (owner: 10Jforrester) [17:24:10] (03CR) 10KartikMistry: "> Only thing missing is declaring the class in modules/profile/manifests/mediawiki/maintenance.pp," [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) (owner: 10KartikMistry) [17:25:06] (03PS1) 10Alex Monk: Follow-up Id61b31ad: Make dynamicproxy instances not error [puppet] - 10https://gerrit.wikimedia.org/r/499547 [17:25:34] (03CR) 10Alex Monk: "Error: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resour" [puppet] - 10https://gerrit.wikimedia.org/r/485104 (owner: 10Dzahn) [17:25:43] (03CR) 10Bstorm: [C: 03+2] kube2proxy: Set a 10 sec wait between service restarts on failure [puppet] - 10https://gerrit.wikimedia.org/r/499538 (https://phabricator.wikimedia.org/T219377) (owner: 10Bstorm) [17:26:58] PROBLEM - Backup of s7 in codfw on db1115 is CRITICAL: Backup for s7 at codfw taken more than 8 days ago: Most recent backup 2019-03-19 17:00:02 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [17:29:18] (03CR) 10BryanDavis: [C: 03+1] "The catalog errors this is fixing:" [puppet] - 10https://gerrit.wikimedia.org/r/499547 (owner: 10Alex Monk) [17:30:02] (03CR) 10Alex Monk: "Er, missed one in that case. One moment" [puppet] - 10https://gerrit.wikimedia.org/r/499547 (owner: 10Alex Monk) [17:30:26] (03PS2) 10Bstorm: Follow-up Id61b31ad: Make dynamicproxy instances not error [puppet] - 10https://gerrit.wikimedia.org/r/499547 (owner: 10Alex Monk) [17:30:51] bstorm_, one moment please [17:31:00] 👍🏻 [17:31:14] (03PS3) 10Alex Monk: Follow-up Id61b31ad: Make dynamicproxy instances not error [puppet] - 10https://gerrit.wikimedia.org/r/499547 [17:32:37] (03CR) 10Krinkle: [C: 04-1] "I'm not seeing where it adjusts the filename for the excimer files. We can do that here, in 'format', or in arclamp-log. I'd propose here " [puppet] - 10https://gerrit.wikimedia.org/r/499222 (https://phabricator.wikimedia.org/T176916) (owner: 10Giuseppe Lavagetto) [17:34:45] bstorm_, see edit compared to bryan's comment :) [17:35:16] heh :) [17:35:36] (03CR) 10Bstorm: [C: 03+2] Follow-up Id61b31ad: Make dynamicproxy instances not error [puppet] - 10https://gerrit.wikimedia.org/r/499547 (owner: 10Alex Monk) [17:35:42] Merging it [17:35:55] My other change today cannot deploy without it as well :) [17:36:03] I just happened to run into it [17:36:17] picked random cert to check public cert, proxy-01 [17:36:23] logged in and it said puppet hadn't run since yesterday [17:36:25] thought that was odd [17:36:28] This explains why the proxy didn't start kube2proxy without my help [17:36:33] puppet should have done that [17:36:35] picked random instance* [17:36:37] but it was broken [17:37:54] Thanks for that :) [17:38:04] other random thing I noticed [17:38:21] (03CR) 10Volans: "One additional comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/498386 (https://phabricator.wikimedia.org/T126989) (owner: 10Filippo Giunchedi) [17:38:57] (03PS1) 10Alex Monk: labs puppetmaster backend: set second region designate host correctly [puppet] - 10https://gerrit.wikimedia.org/r/499550 [17:38:58] ^ that [17:41:15] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [17:41:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:43] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [17:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:50] but based on the git log that's an andrew thing probably [17:44:03] (03CR) 10Alexandros Kosiaris: [C: 03+2] "PCC says ok at https://puppet-compiler.wmflabs.org/compiler1002/15388/, merging." [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) (owner: 10KartikMistry) [17:44:14] (03PS15) 10Alexandros Kosiaris: Cron to run script to purge old CX drafts [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) (owner: 10KartikMistry) [17:46:34] Thanks akosiaris ! [17:53:27] (03CR) 10Andrew Bogott: [C: 03+2] "This looks right. Be warned that pretty much everything under 'main' is going to be ripped out in a few weeks, so any new deployments sho" [puppet] - 10https://gerrit.wikimedia.org/r/499550 (owner: 10Alex Monk) [17:53:37] (03PS2) 10Andrew Bogott: labs puppetmaster backend: set second region designate host correctly [puppet] - 10https://gerrit.wikimedia.org/r/499550 (owner: 10Alex Monk) [17:54:05] (03CR) 10Giuseppe Lavagetto: "> Patch Set 3: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/499222 (https://phabricator.wikimedia.org/T176916) (owner: 10Giuseppe Lavagetto) [17:54:33] (03CR) 10Alex Monk: "There is no modules/profile/manifests/openstack/eqiad1/puppetmaster/ directory though :)" [puppet] - 10https://gerrit.wikimedia.org/r/499550 (owner: 10Alex Monk) [17:55:34] (03CR) 10Andrew Bogott: [C: 03+2] "Hm... that will have to be fixed!" [puppet] - 10https://gerrit.wikimedia.org/r/499550 (owner: 10Alex Monk) [17:56:12] (03PS3) 10Arturo Borrero Gonzalez: shinken: Convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/499516 (https://phabricator.wikimedia.org/T218146) (owner: 10GTirloni) [17:57:44] (03CR) 10jerkins-bot: [V: 04-1] shinken: Convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/499516 (https://phabricator.wikimedia.org/T218146) (owner: 10GTirloni) [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190327T1800) [18:01:09] (03PS4) 10Giuseppe Lavagetto: arclamp: add a second instance for excimer logs [puppet] - 10https://gerrit.wikimedia.org/r/499222 (https://phabricator.wikimedia.org/T176916) [18:01:11] (03PS1) 10Giuseppe Lavagetto: arclamp: fix arclamp-grep file format [puppet] - 10https://gerrit.wikimedia.org/r/499554 [18:01:32] (03PS16) 10Alexandros Kosiaris: Cron to run script to purge old CX drafts [puppet] - 10https://gerrit.wikimedia.org/r/486454 (https://phabricator.wikimedia.org/T189091) (owner: 10KartikMistry) [18:04:07] (03PS4) 10Arturo Borrero Gonzalez: shinken: Convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/499516 (https://phabricator.wikimedia.org/T218146) (owner: 10GTirloni) [18:04:57] (03PS5) 10Jbond: jessie-backports: Remove unsued pins [puppet] - 10https://gerrit.wikimedia.org/r/499453 (https://phabricator.wikimedia.org/T219333) [18:05:25] (03CR) 10jerkins-bot: [V: 04-1] shinken: Convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/499516 (https://phabricator.wikimedia.org/T218146) (owner: 10GTirloni) [18:06:07] (03PS2) 10Andrew Bogott: puppet-compiler: restore the ability to export facts without puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/499007 (https://phabricator.wikimedia.org/T219430) [18:06:12] (03PS2) 10Andrew Bogott: puppet compiler: collect facts from cloud VMs as well as prod hosts [puppet] - 10https://gerrit.wikimedia.org/r/499026 (https://phabricator.wikimedia.org/T219430) [18:06:16] (03CR) 10Krinkle: Make caching of static performance site explicit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/499537 (https://phabricator.wikimedia.org/T219417) (owner: 10Gilles) [18:10:16] !log interface::rps applied to all the mc10XX hosts - T203786 [18:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:20] T203786: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 [18:10:38] PROBLEM - Backup of s4 in eqiad on db1115 is CRITICAL: Backup for s4 at eqiad taken more than 8 days ago: Most recent backup 2019-03-19 18:05:11 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [18:10:40] 10Operations, 10serviceops, 10User-jijiki: Ramp up percentage of users on php7.2 to 100% on both API and appserver clusters - https://phabricator.wikimedia.org/T219150 (10jijiki) [18:12:16] (03PS5) 10Arturo Borrero Gonzalez: shinken: Convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/499516 (https://phabricator.wikimedia.org/T218146) (owner: 10GTirloni) [18:12:24] (03CR) 10Jbond: jessie-backports: Remove unsued pins (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499453 (https://phabricator.wikimedia.org/T219333) (owner: 10Jbond) [18:12:52] !log update grants on db1115 for new provisioning hosts on codfw T218336 [18:12:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:55] T218336: rack/setup/deploy codfw dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 [18:14:08] 10Operations, 10cloud-services-team, 10serviceops, 10Core Platform Team Backlog (Watching / External), and 2 others: Switch cronjobs on maintenance hosts to PHP7 - https://phabricator.wikimedia.org/T195392 (10jijiki) [18:14:36] 10Operations, 10Patch-For-Review, 10User-Elukey: Apply interface::rps to all the mc hosts - https://phabricator.wikimedia.org/T209489 (10elukey) 05Open→03Resolved Done today after merging https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/472099/ More info about timings in the parent task. [18:14:39] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.24; 2019-04-02), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1022 (was mc1035) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) [18:16:15] (03CR) 10DannyS712: [C: 03+1] Add editcontentmodel right to the templateeditor group on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494016 (https://phabricator.wikimedia.org/T217499) (owner: 10Ammarpad) [18:20:04] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/deploy codfw dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10jcrespo) I've setup dbprov2001 and sent snapshots for codfw there. We may have to think a way to coordinate dbprov2001 and dbprov2002. I... [18:20:31] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/deploy codfw dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10jcrespo) [18:27:09] (03PS2) 10Jbond: jessie-backports: warn users if the try to use backports on jessie [puppet] - 10https://gerrit.wikimedia.org/r/499505 (https://phabricator.wikimedia.org/T219333) [18:29:02] (03CR) 10Jbond: "thanks filippo, providing more guidance is still outstanding" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499505 (https://phabricator.wikimedia.org/T219333) (owner: 10Jbond) [18:30:08] PROBLEM - exim queue on mx1001 is CRITICAL: CRITICAL: 3336 mails in exim queue. [18:30:26] oh lovely [18:30:36] (03CR) 10jenkins-bot: SDC: Stop using wgMediaInfoEnable, we're scrapping it [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494896 (owner: 10Jforrester) [18:30:40] (03CR) 10jenkins-bot: SDC: Drop use of temporary wgMediaInfoEnable, being removed [mediawiki-config] - 10https://gerrit.wikimedia.org/r/494897 (owner: 10Jforrester) [18:31:57] (03CR) 10Volans: [C: 03+1] "LGTM, do monitor it after deploying to make sure it DTRT" [puppet] - 10https://gerrit.wikimedia.org/r/499288 (owner: 10CRusnov) [18:39:12] gerrit erroring for anyone else? [18:39:42] oh gerrit seems down. [18:40:14] PROBLEM - Gerrit Health Check on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [18:40:18] yup, down for me [18:40:34] PROBLEM - Gerrit JSON on gerrit.wikimedia.org is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [18:42:50] ugh [18:48:38] !log restarting gerrit process [18:48:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:17] (03CR) 10GTirloni: "I've added this in Horizon Hiera to fix an error that said this was missing:" [puppet] - 10https://gerrit.wikimedia.org/r/499516 (https://phabricator.wikimedia.org/T218146) (owner: 10GTirloni) [18:51:36] RECOVERY - Gerrit Health Check on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 950 bytes in 0.046 second response time https://gerrit.wikimedia.org/r/config/server/healthcheck%7Estatus [18:51:58] RECOVERY - Gerrit JSON on gerrit.wikimedia.org is OK: HTTP OK: HTTP/1.1 200 OK - 26393 bytes in 0.040 second response time https://wikitech.wikimedia.org/wiki/Gerrit%23Monitoring [18:53:06] PROBLEM - puppet last run on notebook1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas] [18:54:00] PROBLEM - puppet last run on stat1004 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 4 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki] [18:54:04] PROBLEM - puppet last run on stat1006 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 5 minutes ago with 3 failures. Failed resources (up to 3 shown): Exec[git_pull_mediawiki/event-schemas],Exec[git_pull_statistics_mediawiki],Exec[git_pull_analytics/reportupdater] [18:54:11] (03PS6) 10Arturo Borrero Gonzalez: shinken: Convert to role/profile [puppet] - 10https://gerrit.wikimedia.org/r/499516 (https://phabricator.wikimedia.org/T218146) (owner: 10GTirloni) [18:54:20] PROBLEM - puppet last run on stat1007 is CRITICAL: CRITICAL: Puppet has 7 failures. Last run 5 minutes ago with 7 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_wmde/scripts],Exec[git_pull_wmde/toolkit-analyzer-build],Exec[git_pull_mediawiki/event-schemas] [18:55:25] presumably those are from gerrit issue that recovered above? [18:55:34] since it's all git pulls [18:55:56] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 5:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499516 (https://phabricator.wikimedia.org/T218146) (owner: 10GTirloni) [18:56:50] PROBLEM - puppet last run on notebook1003 is CRITICAL: CRITICAL: Puppet has 2 failures. Last run 7 minutes ago with 2 failures. Failed resources (up to 3 shown): Exec[git_pull_operations/mediawiki-config],Exec[git_pull_mediawiki/event-schemas] [18:57:21] 10Operations, 10ops-eqiad, 10RESTBase, 10Core Platform Team Backlog (Watching / External), and 2 others: rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 (10Eevans) [18:57:41] almost certainly yes chaomodus [19:00:04] marxarelli: Your horoscope predicts another unfortunate MediaWiki train - Americas version deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190327T1900). [19:01:26] (03PS1) 10Alex Monk: Follow-up Id61b31ad: Don't error by default in mediawiki::errorpage [puppet] - 10https://gerrit.wikimedia.org/r/499567 [19:01:34] Is the train going ahead? [19:01:39] thcipriani, ^ [19:01:58] ACKNOWLEDGEMENT - exim queue on mx1001 is CRITICAL: CRITICAL: 5673 mails in exim queue. Effie Mouzeli https://phabricator.wikimedia.org/T216445 I believe the queue will be slowly processed like last time this happened - The acknowledgement expires at: 2019-03-28 23:01:05. [19:02:02] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:02:15] (03CR) 10Alex Monk: "I832c93bb" [puppet] - 10https://gerrit.wikimedia.org/r/485104 (owner: 10Dzahn) [19:03:34] RECOVERY - puppet last run on notebook1004 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:03:56] (03PS1) 10Bstorm: dynamicproxy: expressly convert to integers for error page sizes [puppet] - 10https://gerrit.wikimedia.org/r/499569 (https://phabricator.wikimedia.org/T219377) [19:08:54] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/deploy codfw dedicated backup recovery/provisioning hosts - https://phabricator.wikimedia.org/T218336 (10Marostegui) I would prefer option #1 because it scales better for the future if we need more hosts and it looks cleaner in general, a cen... [19:09:32] (03CR) 10GTirloni: "I'ved removed clientpackages." [puppet] - 10https://gerrit.wikimedia.org/r/499516 (https://phabricator.wikimedia.org/T218146) (owner: 10GTirloni) [19:09:45] (03CR) 10Bstorm: [C: 03+2] Follow-up Id61b31ad: Don't error by default in mediawiki::errorpage [puppet] - 10https://gerrit.wikimedia.org/r/499567 (owner: 10Alex Monk) [19:09:51] Krenair: about to roll the train, yes [19:13:20] (03PS1) 10Dduvall: group1 wikis to 1.33.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499572 [19:13:22] (03CR) 10Dduvall: [C: 03+2] group1 wikis to 1.33.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499572 (owner: 10Dduvall) [19:14:33] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.e6-upgrade [19:14:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:14:40] (03Merged) 10jenkins-bot: group1 wikis to 1.33.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499572 (owner: 10Dduvall) [19:15:06] RECOVERY - puppet last run on stat1004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [19:15:10] RECOVERY - puppet last run on stat1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [19:15:23] (03CR) 10jenkins-bot: group1 wikis to 1.33.0-wmf.23 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499572 (owner: 10Dduvall) [19:16:00] (03Abandoned) 10Bstorm: dynamicproxy: expressly convert to integers for error page sizes [puppet] - 10https://gerrit.wikimedia.org/r/499569 (https://phabricator.wikimedia.org/T219377) (owner: 10Bstorm) [19:16:36] !log dduvall@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.23 [19:18:22] !log dduvall@deploy1001 Synchronized php: group1 wikis to 1.33.0-wmf.23 (duration: 01m 45s) [19:18:37] dduvall@deploy1001: Failed to log message to wiki. Somebody should check the error logs. [19:18:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:44] RECOVERY - puppet last run on stat1007 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [19:23:08] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [19:23:12] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [19:23:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:39] !log (resent; originally @ 1916) dduvall@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.33.0-wmf.23 [19:27:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:28:18] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [19:29:40] PROBLEM - pdfrender on scb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [19:30:23] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [19:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:32:25] !log restarting pdfrender on scb1001 [19:32:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:22] RECOVERY - pdfrender on scb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 0.009 second response time https://phabricator.wikimedia.org/T174916 [19:37:12] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [19:43:28] !log removed queued wikidata notification messages for a***a@w**gm**ster.** on mx1001 to address gmail excessive volume rate limiting [19:43:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:43:44] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.force-shard-allocation [19:43:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:48:30] !log gehel@cumin2001 END (PASS) - Cookbook sre.elasticsearch.force-shard-allocation (exit_code=0) [19:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:50:48] (03PS1) 10Ottomata: [WIP] POST test event to service for readinessProbe [deployment-charts] - 10https://gerrit.wikimedia.org/r/499576 (https://phabricator.wikimedia.org/T218680) [19:51:01] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics - default monitoring.enabled to true [deployment-charts] - 10https://gerrit.wikimedia.org/r/499320 (owner: 10Ottomata) [19:54:58] marxarelli: is the train still on-going? [19:55:11] train is done-zo [19:56:18] thnx! [19:56:27] np [19:58:32] oh so group1 has .23 now? [20:00:54] RECOVERY - exim queue on mx1001 is OK: OK: Less than 1000 mails in exim queue. [20:01:21] apergos: indeed: https://tools.wmflabs.org/versions/ [20:01:34] I always forget that link! thanks (again) [20:01:50] woo hoo! [20:03:01] it would perhaps be a good idea to add that link to the bot msgs automatically for train deployments [20:03:37] !log gehel@cumin2001 END (ERROR) - Cookbook sre.elasticsearch.e6-upgrade (exit_code=97) [20:03:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:57] !log gehel@cumin2001 START - Cookbook sre.elasticsearch.e6-upgrade [20:04:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:58] er? [20:07:12] 10Operations, 10Toolforge, 10Traffic, 10HTTPS, 10Patch-For-Review: Migrate tools.wmflabs.org to https only (and set HSTS) - https://phabricator.wikimedia.org/T102367 (10bd808) Live config from tools-proxy-03.tools.eqiad.wmflabs shows only the `add_header Strict-Transport-Security "max-age=86400";` in the... [20:09:34] (03PS3) 10Andrew Bogott: puppet-compiler: restore the ability to export facts without puppetdb [puppet] - 10https://gerrit.wikimedia.org/r/499007 (https://phabricator.wikimedia.org/T219430) [20:09:36] (03PS3) 10Andrew Bogott: puppet compiler: collect facts from cloud VMs as well as prod hosts [puppet] - 10https://gerrit.wikimedia.org/r/499026 (https://phabricator.wikimedia.org/T219430) [20:09:40] PROBLEM - ElasticSearch health check for shards on 9643 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 626 threshold =0.15 breach: active_shards_percent_as_number: 80.75030750307504, number_of_data_nodes: 15, cluster_name: production-search-psi-codfw, number_of_pending_tasks: 1, task_max_waiting_in_queue_millis: 0, active_shards: 2626, number_of_in_flight_fetch: 0, initializing_shards: 46, unassig [20:09:40] active_primary_shards: 1084, number_of_nodes: 15, relocating_shards: 0, delayed_unassigned_shards: 0, status: yellow, timed_out: False https://wikitech.wikimedia.org/wiki/Search%23Administration [20:09:56] PROBLEM - ElasticSearch health check for shards on 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 556 threshold =0.15 breach: relocating_shards: 0, number_of_nodes: 15, active_primary_shards: 1108, number_of_in_flight_fetch: 0, status: yellow, number_of_pending_tasks: 0, unassigned_shards: 546, number_of_data_nodes: 15, initializing_shards: 10, active_shards_percent_as_number: 83.268131206740 [20:09:56] ing_in_queue_millis: 0, timed_out: False, cluster_name: production-search-omega-codfw, active_shards: 2767, delayed_unassigned_shards: 0 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:10:00] PROBLEM - ElasticSearch health check for shards on 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 607 threshold =0.15 breach: relocating_shards: 0, delayed_unassigned_shards: 0, task_max_waiting_in_queue_millis: 0, active_shards: 2465, number_of_pending_tasks: 0, timed_out: False, number_of_nodes: 30, unassigned_shards: 569, number_of_in_flight_fetch: 0, cluster_name: production-search-codfw, [20:10:00] rds: 38, active_primary_shards: 1028, status: yellow, number_of_data_nodes: 30, active_shards_percent_as_number: 80.24088541666666 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:10:00] ^^ expected [20:10:20] not entirely expected, but recovering [20:10:44] not unexpected either though :) [20:11:07] check is a bit too sensitive, and I had a heavy hand with the number of servers restarted at once [20:11:27] yeah, half expected, half unexpected, but nothing to worry about [20:12:42] ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9243 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 594 threshold =0.15 breach: status: yellow, cluster_name: production-search-codfw, active_shards_percent_as_number: 80.6640625, unassigned_shards: 556, number_of_pending_tasks: 0, relocating_shards: 0, timed_out: False, number_of_in_flight_fetch: 0, number_of_data_nodes: 30, number_of_nod [20:12:42] imary_shards: 1028, delayed_unassigned_shards: 0, initializing_shards: 38, active_shards: 2478, task_max_waiting_in_queue_millis: 0 Gehel cluster restart in progress https://wikitech.wikimedia.org/wiki/Search%23Administration [20:12:42] ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9443 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 556 threshold =0.15 breach: delayed_unassigned_shards: 0, number_of_data_nodes: 15, task_max_waiting_in_queue_millis: 0, initializing_shards: 10, timed_out: False, unassigned_shards: 546, number_of_nodes: 15, active_shards: 2767, number_of_pending_tasks: 0, active_shards_percent_as_number [20:12:42] 89, active_primary_shards: 1108, number_of_in_flight_fetch: 0, relocating_shards: 0, cluster_name: production-search-omega-codfw, status: yellow Gehel cluster restart in progress https://wikitech.wikimedia.org/wiki/Search%23Administration [20:12:42] ACKNOWLEDGEMENT - ElasticSearch health check for shards on 9643 on search.svc.codfw.wmnet is CRITICAL: CRITICAL - elasticsearch inactive shards 620 threshold =0.15 breach: unassigned_shards: 575, delayed_unassigned_shards: 0, number_of_nodes: 15, active_shards_percent_as_number: 80.93480934809348, status: yellow, active_shards: 2632, timed_out: False, number_of_data_nodes: 15, relocating_shards: 0, cluster_name: production-search [20:12:42] max_waiting_in_queue_millis: 0, number_of_pending_tasks: 0, number_of_in_flight_fetch: 0, active_primary_shards: 1084, initializing_shards: 45 Gehel cluster restart in progress https://wikitech.wikimedia.org/wiki/Search%23Administration [20:14:00] !log mholloway-shell@deploy1001 Synchronized php-1.33.0-wmf.23/extensions/WikimediaEditorTasks: Fix: Use READ_LOCKING when evaluating whether to update targets_passed (duration: 00m 58s) [20:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:14:11] !log gehel@cumin2001 END (FAIL) - Cookbook sre.elasticsearch.e6-upgrade (exit_code=99) [20:14:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:15:05] (03CR) 10Mforns: [C: 04-1] "We Analytics decided to freeze this change for a couple months, because we want to keep Wikimetrics cloud instance alive, in case we have " [puppet] - 10https://gerrit.wikimedia.org/r/499304 (https://phabricator.wikimedia.org/T211835) (owner: 10Mforns) [20:16:44] 10Operations, 10Analytics, 10Analytics-Kanban, 10Wikimedia-Mailing-lists: Terminate Wikimetrics - https://phabricator.wikimedia.org/T219446 (10mforns) [20:17:16] 10Operations, 10Analytics, 10Wikimedia-Mailing-lists: Terminate Wikimetrics - https://phabricator.wikimedia.org/T219446 (10mforns) a:05mforns→03None [20:17:37] 10Operations, 10Analytics, 10Wikimedia-Mailing-lists: Terminate Wikimetrics - https://phabricator.wikimedia.org/T219446 (10mforns) [20:18:03] (03PS4) 10Andrew Bogott: puppet compiler: collect facts from cloud VMs as well as prod hosts [puppet] - 10https://gerrit.wikimedia.org/r/499026 (https://phabricator.wikimedia.org/T219430) [20:18:05] (03PS1) 10Andrew Bogott: puppet compiler: add more puppet masters to the fact-collection stage [puppet] - 10https://gerrit.wikimedia.org/r/499584 [20:19:20] 10Operations, 10Analytics, 10Analytics-Kanban, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Sunset Wikimetrics - https://phabricator.wikimedia.org/T211835 (10mforns) [20:20:39] 10Operations, 10Analytics, 10Analytics-Kanban, 10Wikimedia-Mailing-lists, 10Patch-For-Review: Sunset Wikimetrics - https://phabricator.wikimedia.org/T211835 (10mforns) Added a subtask with the last remaining action items, see: T219446 [20:22:50] (03PS2) 10Mholloway: Update WikimediaEditorTasks config for DB location split [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499529 [20:23:16] !log mholloway-shell@deploy1001 Synchronized php-1.33.0-wmf.23/extensions/WikimediaEditorTasks: Update DB utils to handle counts and suggestion DBs in different locations (duration: 00m 58s) [20:23:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:20] (03CR) 10Mholloway: [C: 03+2] Update WikimediaEditorTasks config for DB location split [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499529 (owner: 10Mholloway) [20:26:33] (03Merged) 10jenkins-bot: Update WikimediaEditorTasks config for DB location split [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499529 (owner: 10Mholloway) [20:27:46] RECOVERY - ElasticSearch health check for shards on 9443 on search.svc.codfw.wmnet is OK: OK - elasticsearch status production-search-omega-codfw: timed_out: False, number_of_nodes: 15, initializing_shards: 6, relocating_shards: 0, number_of_data_nodes: 15, delayed_unassigned_shards: 0, task_max_waiting_in_queue_millis: 0, number_of_pending_tasks: 0, active_primary_shards: 1108, active_shards: 2838, number_of_in_flight_fetch: 0, [20:27:46] duction-search-omega-codfw, active_shards_percent_as_number: 85.40475473969305, status: yellow, unassigned_shards: 479 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:29:28] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Update WikimediaEditorTasks config for DB location split (duration: 00m 57s) [20:29:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:52] (03PS5) 10Andrew Bogott: puppet compiler: collect facts from cloud VMs as well as prod hosts [puppet] - 10https://gerrit.wikimedia.org/r/499026 (https://phabricator.wikimedia.org/T219430) [20:29:54] (03PS2) 10Andrew Bogott: puppet compiler: add more puppet masters to the fact-collection stage [puppet] - 10https://gerrit.wikimedia.org/r/499584 [20:30:26] RECOVERY - ElasticSearch health check for shards on 9243 on search.svc.codfw.wmnet is OK: OK - elasticsearch status production-search-codfw: task_max_waiting_in_queue_millis: 0, relocating_shards: 0, initializing_shards: 31, unassigned_shards: 429, active_shards_percent_as_number: 85.02604166666666, number_of_in_flight_fetch: 0, number_of_nodes: 30, number_of_pending_tasks: 0, number_of_data_nodes: 30, cluster_name: production-se [20:30:26] ed_unassigned_shards: 0, status: yellow, timed_out: False, active_shards: 2612, active_primary_shards: 1028 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:33:54] RECOVERY - ElasticSearch health check for shards on 9643 on search.svc.codfw.wmnet is OK: OK - elasticsearch status production-search-psi-codfw: active_shards: 2775, relocating_shards: 0, number_of_nodes: 15, status: yellow, timed_out: False, cluster_name: production-search-psi-codfw, number_of_in_flight_fetch: 0, delayed_unassigned_shards: 4, number_of_pending_tasks: 0, active_shards_percent_as_number: 85.33210332103322, number_ [20:33:54] , active_primary_shards: 1084, unassigned_shards: 471, task_max_waiting_in_queue_millis: 0, initializing_shards: 6 https://wikitech.wikimedia.org/wiki/Search%23Administration [20:43:18] (03CR) 10Krinkle: [C: 03+1] "LGTM. Would mildly prefer without a loop for readability and mutability, but don't mind that much (one will be removed "soon")." [puppet] - 10https://gerrit.wikimedia.org/r/499222 (https://phabricator.wikimedia.org/T176916) (owner: 10Giuseppe Lavagetto) [20:43:33] (03CR) 10Krinkle: [C: 03+1] arclamp: fix arclamp-grep file format [puppet] - 10https://gerrit.wikimedia.org/r/499554 (owner: 10Giuseppe Lavagetto) [20:43:44] (03CR) 10jenkins-bot: Update WikimediaEditorTasks config for DB location split [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499529 (owner: 10Mholloway) [20:45:45] (03CR) 10Herron: puppet compiler: collect facts from cloud VMs as well as prod hosts (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/499026 (https://phabricator.wikimedia.org/T219430) (owner: 10Andrew Bogott) [20:46:45] (03CR) 10Herron: puppet compiler: collect facts from cloud VMs as well as prod hosts (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/499026 (https://phabricator.wikimedia.org/T219430) (owner: 10Andrew Bogott) [20:51:17] (03PS1) 10DCausse: [cirrus] only activate wikibase entitySearch with Cirrus in WikibaseSearchSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499596 (https://phabricator.wikimedia.org/T219448) [20:55:31] !log milimetric@deploy1001 Started deploy [analytics/refinery@fdd21a4]: non-deploy changes and two new oozie jobs [20:55:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:58:05] (03CR) 10Smalyshev: [C: 03+1] [cirrus] only activate wikibase entitySearch with Cirrus in WikibaseSearchSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499596 (https://phabricator.wikimedia.org/T219448) (owner: 10DCausse) [20:59:00] need someone to deploy https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/499596 to unbreak broken Commons search [20:59:08] (03PS1) 10Jcrespo: mariadb-snapshots: Require wmf mariadb package present [puppet] - 10https://gerrit.wikimedia.org/r/499599 (https://phabricator.wikimedia.org/T218336) [20:59:29] SMalyshev: I can deploy it [20:59:53] I'll first test on mwdebug to see if it fixes the issue [21:00:02] there seems to be a deploy in progress dcausse [21:00:06] jouncebot: next [21:00:07] In 1 hour(s) and 59 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190327T2300) [21:00:19] jouncebot: refresh [21:00:20] I refreshed my knowledge about deployments. [21:00:24] jouncebot: next [21:00:25] In 1 hour(s) and 59 minute(s): Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190327T2300) [21:00:36] Wait. I had a deploy. [21:00:51] somebody stole a deploy :) [21:00:54] Oh. [21:00:57] jouncebot: now [21:00:57] For the next 0 hour(s) and 59 minute(s): Wikimania scholarships app deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190327T2100) [21:01:01] That's me. [21:01:04] jouncebot: now [21:01:04] For the next 0 hour(s) and 58 minute(s): Wikimania scholarships app deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190327T2100) [21:01:26] and milimetri-c is deploying on deploy1001 [21:01:37] Niharika: you want to deploy that first? we've got search completely broken on commons [21:01:44] i.e. no search at all [21:02:07] SMalyshev: I did not know. You can go ahead. Let me know when you're done. [21:02:17] hauskatze: do you know what's being deployed? [21:02:53] analytics/refinery.git [21:03:00] PROBLEM - etcd request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 operation=list https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:03:13] "Started deploy [analytics/refinery@fdd21a4]: non-deploy changes and two new oozie jobs" [21:03:18] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=PATCH https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:03:38] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb={PATCH,PUT} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:03:56] hauskatze: thanks it seems unrelated as it touches only the analytics cluster [21:04:08] (03PS6) 10Andrew Bogott: puppet compiler: collect facts from cloud VMs as well as prod hosts [puppet] - 10https://gerrit.wikimedia.org/r/499026 (https://phabricator.wikimedia.org/T219430) [21:04:10] (03PS3) 10Andrew Bogott: puppet compiler: add more puppet masters to the fact-collection stage [puppet] - 10https://gerrit.wikimedia.org/r/499584 [21:04:16] dcausse: perfect then :) [21:04:24] hauskatze: thanks for the heads up! [21:04:30] np [21:05:02] Niharika: thanks, dcausse will do it then [21:05:31] (03CR) 10Jforrester: "This'll break it for Wikidata instead. One second." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499596 (https://phabricator.wikimedia.org/T219448) (owner: 10DCausse) [21:05:56] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [21:05:58] PROBLEM - SSH on proton1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:06:50] James_F: it should be set in WikibaseSearchSettings.php which is only loaded for wikidata and testwikidata [21:07:06] PROBLEM - etcd request latencies on neon is CRITICAL: instance=10.64.0.40:6443 operation={compareAndSwap,get,list} https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:07:10] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [21:07:12] RECOVERY - SSH on proton1001 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u3 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [21:07:19] !log milimetric@deploy1001 Finished deploy [analytics/refinery@fdd21a4]: non-deploy changes and two new oozie jobs (duration: 11m 48s) [21:07:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:08:18] PROBLEM - configured eth on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused [21:08:36] PROBLEM - proton endpoints health on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [21:09:52] dcausse: OK. [21:10:04] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:10:05] (03CR) 10Jforrester: [C: 03+2] "Let's try it out in prod?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499596 (https://phabricator.wikimedia.org/T219448) (owner: 10DCausse) [21:10:21] James_F: I'll quickly test this on mwdebug1002 [21:10:42] dcausse: Cool. Want to deploy or should I? [21:10:44] PROBLEM - Check systemd state on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused [21:10:44] RECOVERY - etcd request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:10:50] PROBLEM - DPKG on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused [21:11:00] RECOVERY - etcd request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:11:01] James_F: I just opened my shell to deploy I can do it np [21:11:02] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [21:11:08] Kk. [21:11:12] PROBLEM - Check whether ferm is active by checking the default input chain on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused [21:11:21] (My shell is permanently open, because I'm too keen. ;-)) [21:11:22] PROBLEM - dhclient process on proton1001 is CRITICAL: connect to address 10.64.0.20 port 5666: Connection refused [21:11:24] (03Merged) 10jenkins-bot: [cirrus] only activate wikibase entitySearch with Cirrus in WikibaseSearchSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499596 (https://phabricator.wikimedia.org/T219448) (owner: 10DCausse) [21:12:12] RECOVERY - Check systemd state on proton1001 is OK: OK - running: The system is fully operational [21:12:14] RECOVERY - DPKG on proton1001 is OK: All packages OK [21:12:14] RECOVERY - configured eth on proton1001 is OK: OK - interfaces up [21:12:30] RECOVERY - Check whether ferm is active by checking the default input chain on proton1001 is OK: OK ferm input default policy is set [21:13:46] SMalyshev: no luck it still uses the wikibase setup on commons [21:13:56] RECOVERY - dhclient process on proton1001 is OK: PROCS OK: 0 processes with command name dhclient [21:14:00] dcausse: hmm... weird [21:14:17] dcausse: wait what's the default? [21:14:31] I hope it's false :) [21:14:37] yes it's false... hmm [21:14:41] I'm going to revert [21:14:50] so why it's still there? let me see [21:15:37] it does check the config [21:16:12] so maybe it loads WikibaseSearchSettings? [21:16:16] PROBLEM - proton endpoints health on proton1002 is CRITICAL: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Foo page from en.wp.org in letter format) timed out before a response was received: /{domain}/v1/pdf/{title}/{format}/{type} (Print the Bar page from en.wp.org in A4 format using optimized for reading on mobile devices) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/pro [21:16:20] RECOVERY - proton endpoints health on proton1001 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [21:16:46] SMalyshev: wait I was using mwdebug1001.... [21:16:50] it works [21:16:50] nope doesn't look like it... [21:16:59] dcausse: oh... :) [21:17:06] SMalyshev: can you test on mwdebug1002 as well? [21:17:11] sure doing [21:17:18] I going to test wikidata to see if nothing breaks [21:17:21] !log restarted proton on proton1001 in response to memory exhaustion and cpu peg [21:17:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:17:27] anything wrong with Wikidata? A couple of users cannot add any kind of link anywhere [21:17:28] RECOVERY - proton endpoints health on proton1002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/proton [21:18:33] hmm weird now search works for me but CirrusDumpQuery still produces the same query [21:18:36] how can it be? [21:19:58] SMalyshev: namespace choices? [21:20:06] hmm [21:20:07] ebernhardson: same in both cases [21:20:38] ebernhardson: try this on mwdebug1002 - what do you see: https://commons.wikimedia.org/w/index.php?search=duck&title=Special%3ASearch&profile=advanced&fulltext=1&advancedSearch-current=%7B%7D&ns0=1&ns4=1&ns5=1&ns6=1&ns100=1&cirrusDumpQuery=1 [21:21:13] SMalyshev: using the browser plugin, search works and cirrusDumpQuery no longer has sitelink_count [21:21:29] hmm I still see one. some caching probably.... [21:21:58] it also brought back the expected commons rescore profile [21:22:41] is wikidata ok as well? [21:23:13] dcausse: works fine for me [21:23:26] ok let's deploy this then [21:23:44] yeah sounds good [21:25:55] !log dcausse@deploy1001 Synchronized wmf-config/Wikibase.php: T219448 (duration: 00m 55s) [21:25:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:58] T219448: Commons search is broken - https://phabricator.wikimedia.org/T219448 [21:26:12] (03PS1) 10Smalyshev: Use new WBCS on Commons too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499608 (https://phabricator.wikimedia.org/T218715) [21:30:36] (03CR) 10jenkins-bot: [cirrus] only activate wikibase entitySearch with Cirrus in WikibaseSearchSettings.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499596 (https://phabricator.wikimedia.org/T219448) (owner: 10DCausse) [21:36:50] Sorry for the breakage. Thanks, ebernhardson dcausse SMalyshev for the fixes. :-( [21:37:28] James_F: np! [21:38:24] Also https://gerrit.wikimedia.org/r/c/mediawiki/extensions/WikibaseCirrusSearch/+/498459 has an… interesting unit test failure. Is that WBMI's fault? [21:40:36] James_F: possible [21:40:50] even probably [21:40:59] I haven't looked yet but was going to [21:41:47] `WikibaseMediaInfoHooksTest::tearDown() must call parent::tearDown()` is nice and all, except WikibaseMediaInfoHooksTest::tearDown() doesn't exist. [21:45:22] dcausse: do you know what T219452 might be the cause? [21:45:23] T219452: Cannot add a Wikidata sitelink [2019-03-27] - https://phabricator.wikimedia.org/T219452 [21:46:22] hauskatze: no clue :( [21:46:41] (03CR) 10Jforrester: [C: 03+1] "Fine by me." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499608 (https://phabricator.wikimedia.org/T218715) (owner: 10Smalyshev) [21:46:44] somebody should check logstash just in case [21:46:48] I cannot do that [21:48:46] hauskatze: No error text, no fatal ID, how would I find it in logstash? :-( [21:49:08] James_F: the wiki doesn't give me anything [21:49:29] I guess filtering by wikidatawiki will give a lot of results [21:49:39] many INFO or DEBUG [21:49:52] Millions. [21:49:55] hauskatze: no errors for wikidatawiki in the last 15minutes [21:52:28] hauskatze: I'm digging through but seeing nothing. [21:53:02] well, I don't want to waste your time really [21:53:10] Lots of the usual AbuseFilter spam, the normal security/CentralAuth logging. [21:53:11] sb should be able to figure it out [21:54:12] !log restarting proton2002 in order to upgrade ram [21:54:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:36] (Can we relinquish conrol so Niharika can deploy?) [21:54:39] (03CR) 10DLynch: VE section editing: Enable mobile AB test on remaining target wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498084 (https://phabricator.wikimedia.org/T218851) (owner: 10Esanders) [21:56:32] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [21:56:46] Luckily the next deploy window is free. [21:57:15] SMalyshev: Are you folks done? [21:57:31] !log restarting proton2001 in order to upgrade ram [21:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:54] !log restarting proton1002 to upgrade ram [21:58:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:47] !log restarting proton1001 to upgrade ram [21:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:01:34] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 276 bytes in 3.084 second response time https://phabricator.wikimedia.org/T174916 [22:02:54] Niharika: yes we're done [22:03:13] 10Operations: Add ram to Proton* - https://phabricator.wikimedia.org/T219456 (10Krenair) [22:05:07] 10Operations, 10ops-codfw, 10DBA, 10procurement: rack/setup/install (5) dedicated dump slaves - https://phabricator.wikimedia.org/T219463 (10RobH) p:05Triage→03Normal [22:05:12] 10Operations, 10ops-codfw, 10DBA, 10procurement: rack/setup/install (5) dedicated dump slaves - https://phabricator.wikimedia.org/T219463 (10RobH) [22:05:32] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [22:05:51] 10Operations, 10ops-codfw, 10DBA, 10procurement: rack/setup/install (5) dedicated dump slaves - https://phabricator.wikimedia.org/T219463 (10RobH) Any member of #dba team can provide feedback (@jcrespo or @Marostegui) and please then assign to @papaul for followup. [22:06:08] 10Operations, 10ops-codfw, 10DBA, 10procurement: rack/setup/install (1) testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10RobH) [22:07:40] PROBLEM - puppet last run on ms-be1047 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [22:09:50] Niharika: I assume you're deploying is that right? [22:09:51] (03PS1) 10Cwhite: prometheus: clean up node exporter transition code [puppet] - 10https://gerrit.wikimedia.org/r/499667 (https://phabricator.wikimedia.org/T213708) [22:11:29] (03Abandoned) 10Cwhite: prometheus: clean up node exporter transition code [puppet] - 10https://gerrit.wikimedia.org/r/493269 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [22:12:15] (03PS4) 10Jforrester: VE section editing: Enable mobile AB test on remaining target wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498084 (https://phabricator.wikimedia.org/T218851) (owner: 10Esanders) [22:12:23] (03CR) 10Jforrester: VE section editing: Enable mobile AB test on remaining target wikis (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498084 (https://phabricator.wikimedia.org/T218851) (owner: 10Esanders) [22:12:58] (03CR) 10DLynch: [C: 03+1] VE section editing: Enable mobile AB test on remaining target wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/498084 (https://phabricator.wikimedia.org/T218851) (owner: 10Esanders) [22:14:07] I see it was delayed by the UBN, np. Had a low prio patch to roll out, but can try later. [22:14:50] (03PS1) 10BryanDavis: dynamicproxy: Prevent STS header from non-TLS connections [puppet] - 10https://gerrit.wikimedia.org/r/499669 (https://phabricator.wikimedia.org/T102367) [22:15:13] (03Abandoned) 10Cwhite: hiera: install node exporter 0.17 in beta [puppet] - 10https://gerrit.wikimedia.org/r/488593 (https://phabricator.wikimedia.org/T213708) (owner: 10Cwhite) [22:17:38] (03Abandoned) 10Cwhite: prometheus: make rules and alerts configuration backwards compatible in beta [puppet] - 10https://gerrit.wikimedia.org/r/488530 (owner: 10Cwhite) [22:18:20] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 8.234 second response time https://phabricator.wikimedia.org/T174916 [22:20:57] (03Abandoned) 10Cwhite: httpd: subconfig for client handling [puppet] - 10https://gerrit.wikimedia.org/r/497946 (owner: 10Cwhite) [22:21:07] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@4fd1022]: Deploy new bot patterns [22:21:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:21:37] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@4fd1022]: Deploy new bot patterns (duration: 00m 31s) [22:21:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:22:10] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [22:25:47] Krinkle: I wasn't deploying yet. I missed the message. Going to deploy now unless there's something more urgent? [22:27:18] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 276 bytes in 9.660 second response time https://phabricator.wikimedia.org/T174916 [22:27:57] Niharika: nope, you've got half an hour or so before the next thing [22:28:08] Plenty. [22:30:23] !log niharika29@deploy1001 Started deploy [scholarships/scholarships@9db232d]: Update wikimania-scholarships; includes fix for broken privacy policy link [22:30:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:30:25] !log niharika29@deploy1001 Finished deploy [scholarships/scholarships@9db232d]: Update wikimania-scholarships; includes fix for broken privacy policy link (duration: 00m 02s) [22:30:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:08] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916 [22:34:02] RECOVERY - puppet last run on ms-be1047 is OK: OK: Puppet is currently enabled, last run 50 seconds ago with 0 failures [22:35:39] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@4fd1022]: Deploy new bot patterns [22:35:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:36:13] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@4fd1022]: Deploy new bot patterns (duration: 00m 34s) [22:36:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:41:38] (03PS5) 10Krinkle: profiler: Fix stack frame mangling for Excimer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499011 (https://phabricator.wikimedia.org/T176916) [22:41:46] (03CR) 10Krinkle: [C: 03+2] profiler: Fix stack frame mangling for Excimer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499011 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [22:42:56] (03Merged) 10jenkins-bot: profiler: Fix stack frame mangling for Excimer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499011 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [22:43:04] RECOVERY - Backup of s4 in eqiad on db1115 is OK: Backup for s4 at eqiad taken less than 8 days ago and larger than 10 GB: Last one 2019-03-27 20:04:49 from db1102.eqiad.wmnet:3314 (111 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [22:43:20] * Krinkle stages on mwdebug1002 [22:45:06] !log krinkle@deploy1001 Synchronized wmf-config/profiler.php: I8c7f8c58313d227a6d9959b9f3a1c / T176916 (duration: 00m 59s) [22:45:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:45:10] T176916: Set up sampling profiler for PHP 7 (alternative to HHVM Xenon) - https://phabricator.wikimedia.org/T176916 [22:45:11] * Krinkle finished staging on mwdebug1002 [22:47:21] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@4fd1022]: Deploy new bot patterns [22:47:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:24] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@4fd1022]: Deploy new bot patterns (duration: 00m 04s) [22:47:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:49:57] (03CR) 10jenkins-bot: profiler: Fix stack frame mangling for Excimer [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499011 (https://phabricator.wikimedia.org/T176916) (owner: 10Krinkle) [22:51:27] !log smalyshev@deploy1001 Started deploy [wdqs/wdqs@4fd1022]: Deploy new bot patterns [22:51:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:51:58] !log smalyshev@deploy1001 Finished deploy [wdqs/wdqs@4fd1022]: Deploy new bot patterns (duration: 00m 31s) [22:52:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:54:06] 10Operations, 10Analytics-Kanban, 10SRE-Access-Requests, 10Patch-For-Review: Add Tilman to analytics-admins - https://phabricator.wikimedia.org/T178802 (10Nuria) I think @Tbayer does not need access to this data in the immediate future so permits can be withdrawn (and added later should need arise). @Tbaye... [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190327T2300). [23:00:04] SMalyshev: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:51] * SMalyshev here [23:07:25] I can SWAT [23:07:38] thcipriani: cool, thanks! [23:07:53] (03PS2) 10Thcipriani: Use new WBCS on Commons too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499608 (https://phabricator.wikimedia.org/T218715) (owner: 10Smalyshev) [23:08:15] (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499608 (https://phabricator.wikimedia.org/T218715) (owner: 10Smalyshev) [23:09:25] (03Merged) 10jenkins-bot: Use new WBCS on Commons too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499608 (https://phabricator.wikimedia.org/T218715) (owner: 10Smalyshev) [23:10:04] SMalyshev: your change is live on mwdebug1002, check please [23:10:25] well, your "WBCS on Commons too" change, that is :) [23:10:42] thcipriani: checking [23:12:37] (03CR) 10jenkins-bot: Use new WBCS on Commons too [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499608 (https://phabricator.wikimedia.org/T218715) (owner: 10Smalyshev) [23:13:21] thcipriani: hm something is not working properly there... [23:13:55] thcipriani: let's revert it for now... we've had trouble with search on commons today, so maybe there's still something missing [23:14:16] SMalyshev: ok [23:14:18] don't want to break it for the second time in one day [23:14:19] * thcipriani reverts [23:14:32] PROBLEM - Backup of s1 in codfw on db1115 is CRITICAL: Backup for s1 at codfw taken more than 8 days ago: Most recent backup 2019-03-19 22:50:48 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [23:17:10] (03PS1) 10Thcipriani: Revert "Use new WBCS on Commons too" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499679 [23:17:41] (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499679 (owner: 10Thcipriani) [23:18:41] (03Merged) 10jenkins-bot: Revert "Use new WBCS on Commons too" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499679 (owner: 10Thcipriani) [23:18:59] (03PS2) 10Thcipriani: Load WikibaseLexemeCirrusSearch on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499399 (https://phabricator.wikimedia.org/T216206) (owner: 10Smalyshev) [23:19:17] (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499399 (https://phabricator.wikimedia.org/T216206) (owner: 10Smalyshev) [23:20:29] (03Merged) 10jenkins-bot: Load WikibaseLexemeCirrusSearch on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499399 (https://phabricator.wikimedia.org/T216206) (owner: 10Smalyshev) [23:21:49] SMalyshev: Load WikibaseLexemeCirrusSearch on test.wikidata.org is live on mwdebug1002, please check [23:22:05] checking [23:23:10] thcipriani: seems to be fine [23:23:22] SMalyshev: cool, going live [23:23:40] (03CR) 10jenkins-bot: Revert "Use new WBCS on Commons too" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499679 (owner: 10Thcipriani) [23:23:42] (03CR) 10jenkins-bot: Load WikibaseLexemeCirrusSearch on test.wikidata.org [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499399 (https://phabricator.wikimedia.org/T216206) (owner: 10Smalyshev) [23:25:09] !log thcipriani@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:499399|Load WikibaseLexemeCirrusSearch on test.wikidata.org]] T216206 (duration: 00m 59s) [23:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:25:13] T216206: Set up WikibaseLexemeCirrusSearch extension for Elastic code in WikibaseLexeme - https://phabricator.wikimedia.org/T216206 [23:25:27] ^ SMalyshev live now [23:25:33] great thanks [23:26:32] (03PS2) 10Thcipriani: Load WikibaseLexemeCirrusSearch on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499400 (https://phabricator.wikimedia.org/T216206) (owner: 10Smalyshev) [23:26:46] (03CR) 10Thcipriani: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499400 (https://phabricator.wikimedia.org/T216206) (owner: 10Smalyshev) [23:27:52] (03Merged) 10jenkins-bot: Load WikibaseLexemeCirrusSearch on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499400 (https://phabricator.wikimedia.org/T216206) (owner: 10Smalyshev) [23:29:40] SMalyshev: Load WikibaseLexemeCirrusSearch on Wikidata is live on mwdebug1002, please check [23:29:54] checking [23:31:23] thcipriani: everything seems to be good [23:31:56] SMalyshev: great, going live [23:33:58] !log thcipriani@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:499400|Load WikibaseLexemeCirrusSearch on Wikidata]] T216206 (duration: 00m 58s) [23:34:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:34:03] T216206: Set up WikibaseLexemeCirrusSearch extension for Elastic code in WikibaseLexeme - https://phabricator.wikimedia.org/T216206 [23:34:04] ^ SMalyshev live now [23:34:12] thank you! [23:34:20] yw :) [23:34:53] (03CR) 10jenkins-bot: Load WikibaseLexemeCirrusSearch on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499400 (https://phabricator.wikimedia.org/T216206) (owner: 10Smalyshev) [23:46:48] !log mholloway-shell@deploy1001 Synchronized php-1.33.0-wmf.23/extensions/WikimediaEditorTasks: Fix: Pass database name to the NameTableStore constructor (duration: 00m 57s) [23:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:54:20] RECOVERY - pdfrender on scb1004 is OK: HTTP OK: HTTP/1.1 200 OK - 275 bytes in 1.929 second response time https://phabricator.wikimedia.org/T174916 [23:58:20] PROBLEM - pdfrender on scb1004 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://phabricator.wikimedia.org/T174916