[00:00:09] (03CR) 10Dzahn: "oh look, there was actually an existing typo here "lookup('profiler:" with an -r at the end. that lookup probably never worked" [puppet] - 10https://gerrit.wikimedia.org/r/624322 (owner: 10Dzahn) [00:04:00] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/24934/" [puppet] - 10https://gerrit.wikimedia.org/r/624322 (owner: 10Dzahn) [00:07:14] (03PS1) 10Dzahn: webperf::arclamp: fix typos in lookup [puppet] - 10https://gerrit.wikimedia.org/r/624352 [00:09:37] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/24935/" [puppet] - 10https://gerrit.wikimedia.org/r/624352 (owner: 10Dzahn) [00:11:07] 10Operations, 10ops-codfw, 10serviceops: decommission mw2135-mw2147, mw2187-mw2214 - physical / datacenter part - https://phabricator.wikimedia.org/T261524 (10Papaul) [00:16:20] (03PS1) 10Dzahn: installserver::web_server: switch from nginx full to light variant [puppet] - 10https://gerrit.wikimedia.org/r/624355 (https://phabricator.wikimedia.org/T261962) [00:16:47] (03CR) 10jerkins-bot: [V: 04-1] installserver::web_server: switch from nginx full to light variant [puppet] - 10https://gerrit.wikimedia.org/r/624355 (https://phabricator.wikimedia.org/T261962) (owner: 10Dzahn) [00:18:04] (03PS2) 10Dzahn: installserver::web_server: switch from nginx full to light variant [puppet] - 10https://gerrit.wikimedia.org/r/624355 (https://phabricator.wikimedia.org/T261962) [00:21:07] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1003/24936/apt1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/624355 (https://phabricator.wikimedia.org/T261962) (owner: 10Dzahn) [00:22:06] (03CR) 10Dave Pifke: [C: 03+1] "Hmmm... I've noticed this before and not done anything to fix it, because I thought the hierdata keys matched the typo'd names here. But " [puppet] - 10https://gerrit.wikimedia.org/r/624352 (owner: 10Dzahn) [00:23:32] 10Operations, 10ops-codfw, 10serviceops: decommission mw2135-mw2147, mw2187-mw2214 - physical / datacenter part - https://phabricator.wikimedia.org/T261524 (10Papaul) [00:28:16] (03PS1) 10Dzahn: nginx: add data types [puppet] - 10https://gerrit.wikimedia.org/r/624357 [00:33:10] (03CR) 10Dzahn: "yea, i searched the repo for "profiler::webperf" and there was nothing. It must have just used the defaults from class arclamp. (127.0.0." [puppet] - 10https://gerrit.wikimedia.org/r/624352 (owner: 10Dzahn) [00:34:33] (03CR) 10Dzahn: "there are Hiera keys without the typo. they just set the same value as the default. so it happened to be no difference. though.. actually " [puppet] - 10https://gerrit.wikimedia.org/r/624352 (owner: 10Dzahn) [00:36:49] (03PS1) 10Dzahn: arclamp: add data types [puppet] - 10https://gerrit.wikimedia.org/r/624364 [00:40:05] (03CR) 10Dave Pifke: [C: 03+1] "This was also done in https://gerrit.wikimedia.org/r/c/operations/puppet/+/622904, which is well-tested in beta and ready to merge. :)" [puppet] - 10https://gerrit.wikimedia.org/r/624364 (owner: 10Dzahn) [00:42:10] (03PS1) 10Dzahn: monitoring: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/624369 [00:43:14] (03CR) 10jerkins-bot: [V: 04-1] monitoring: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/624369 (owner: 10Dzahn) [00:43:37] (03PS6) 10Dave Pifke: arclamp: provide Swift credentials to cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/622904 (https://phabricator.wikimedia.org/T244776) [00:43:39] (03PS5) 10Dave Pifke: [WIP] arclamp: serve SVGs from Swift [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) [00:45:35] (03PS7) 10Dave Pifke: arclamp: provide Swift credentials to cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/622904 (https://phabricator.wikimedia.org/T244776) [00:48:09] !log milimetric@deploy1001 Started deploy [analytics/aqs/deploy@95d6432]: AQS: Deploying new geoeditors endpoints [00:48:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:50:16] (03PS1) 10Dzahn: nrpe: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/624376 [00:50:37] 10Operations, 10cloud-services-team (Kanban): Use lookup() instead of hiera() in Puppet - https://phabricator.wikimedia.org/T209953 (10Dzahn) [00:50:48] (03CR) 10jerkins-bot: [V: 04-1] nrpe: replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/624376 (owner: 10Dzahn) [00:52:14] 10Puppet: Allow variables without hiera calls as lookup() default parameters - https://phabricator.wikimedia.org/T234459 (10Dzahn) I don't think this is the case anymore. I have lately been converting a lot of hiera() to lookup() and never got this. Instead it resolves violations to convert. example: https:/... [00:54:07] 10Operations, 10cloud-services-team (Kanban): Use lookup() instead of hiera() in Puppet - https://phabricator.wikimedia.org/T209953 (10Dzahn) If you (wmcs) could remove any from this list of openstack related ones that would be appreciated: ` modules/profile/manifests/wmcs/nfs/secondary.pp: $observer_pass... [01:16:28] !log Glancing at https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&var-cluster_name=wdqs&from=1599170628749&to=1599182011243, looks like `wdqs2003`'s blazegaph isn't happy based off the null data entries. Restarting blazegraph: `ryankemper@wdqs2003:~$ sudo systemctl restart wdqs-blazegraph` [01:16:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:23:12] 10Operations, 10ops-codfw, 10serviceops: decommission mw2135-mw2147, mw2187-mw2214 - physical / datacenter part - https://phabricator.wikimedia.org/T261524 (10Papaul) [01:23:14] !log (Following the restart of blazegraph, service has been restored to `wdqs2003`. See https://grafana.wikimedia.org/d/000000489/wikidata-query-service?orgId=1&var-cluster_name=wdqs&from=1599182219699&to=1599182547699) [01:23:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:30:01] !log pt1979@cumin2001 START - Cookbook sre.dns.netbox [01:30:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:35:55] !log pt1979@cumin2001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [01:35:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:38:32] 10Operations, 10ops-codfw, 10serviceops: decommission mw2135-mw2147, mw2187-mw2214 - physical / datacenter part - https://phabricator.wikimedia.org/T261524 (10Papaul) [01:43:10] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) ` papaul@asw-a-codfw# show | compare [edit interfaces interface-range vlan-private1-a-codfw] member ge-6/0/13 { ... } + member ge-1/0/7... [01:43:27] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [01:48:10] (03PS1) 10Papaul: DNS: Add prod DNS for es2026 [dns] - 10https://gerrit.wikimedia.org/r/624411 [01:49:27] (03CR) 10Papaul: [C: 03+2] DNS: Add prod DNS for es2026 [dns] - 10https://gerrit.wikimedia.org/r/624411 (owner: 10Papaul) [01:51:27] !log milimetric@deploy1001 Finished deploy [analytics/aqs/deploy@95d6432]: AQS: Deploying new geoeditors endpoints (duration: 63m 18s) [01:51:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [01:52:14] 10Operations, 10ops-codfw, 10DBA, 10DC-Ops, 10Patch-For-Review: (Need By: 2020-08-31) rack/setup/install es20[26-34].codfw.wmnet - https://phabricator.wikimedia.org/T260373 (10Papaul) [02:00:56] (03PS1) 10Papaul: DHCP: Add MAC address for es2026 [puppet] - 10https://gerrit.wikimedia.org/r/624416 (https://phabricator.wikimedia.org/T260373) [02:04:09] Initial incident report for the Wednesday (2020/09/02) outage: https://wikitech.wikimedia.org/wiki/Incident_documentation/20200902-wdqs-outage [02:04:33] s/outage/wdqs outage [02:04:56] (03CR) 10Papaul: [C: 03+2] DHCP: Add MAC address for es2026 [puppet] - 10https://gerrit.wikimedia.org/r/624416 (https://phabricator.wikimedia.org/T260373) (owner: 10Papaul) [02:28:11] (03CR) 10Ryan Kemper: [C: 03+2] "Looks great. I have some nits about some not-good shell practice in this file, but that's for lines that you didn't touch at all." [puppet] - 10https://gerrit.wikimedia.org/r/623783 (https://phabricator.wikimedia.org/T260986) (owner: 10DCausse) [02:50:33] (03PS1) 10Ryan Kemper: Improve shell script compatibility, double quoting [puppet] - 10https://gerrit.wikimedia.org/r/624420 [02:55:07] (03CR) 10Ryan Kemper: "Just some shell quality of life stuff." [puppet] - 10https://gerrit.wikimedia.org/r/624420 (owner: 10Ryan Kemper) [05:13:16] !log Deploy MCR schema change on s4 eqiad master T238966 [05:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:13:24] T238966: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 [05:59:53] (03CR) 10ArielGlenn: [C: 03+1] "OK, although the single = syntax makes my skin crawl :-D" [puppet] - 10https://gerrit.wikimedia.org/r/624420 (owner: 10Ryan Kemper) [06:08:59] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [06:51:45] (03CR) 10JMeybohm: [C: 04-1] "> Patch Set 6:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/623464 (owner: 10Dzahn) [06:52:03] (03CR) 10JMeybohm: [C: 03+2] configmaster: add helm-charts to disc_desired_state.py [puppet] - 10https://gerrit.wikimedia.org/r/623958 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [06:53:08] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good, thanks for picking this up! In addition after merging the following excess nginx packages need to be removed manually:" [puppet] - 10https://gerrit.wikimedia.org/r/624355 (https://phabricator.wikimedia.org/T261962) (owner: 10Dzahn) [07:00:04] Deploy window No deploys all day! See Deployments/Emergencies if things are broken. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200904T0700) [07:02:52] (03CR) 10JMeybohm: [C: 04-1] "Ports are still wrong I guess (see PS6)" [puppet] - 10https://gerrit.wikimedia.org/r/623631 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [07:04:27] (03PS3) 10Muehlenhoff: Turnilo: Remove exception for OPTIONS [puppet] - 10https://gerrit.wikimedia.org/r/615461 [07:06:04] (03CR) 10ArielGlenn: "It's used as an ad hoc "is this the primary webserving host" flag. Something else will have to replace it, for the stat1007 case. I'd bett" [puppet] - 10https://gerrit.wikimedia.org/r/624328 (owner: 10Dzahn) [07:06:27] (03CR) 10JMeybohm: [C: 03+1] lvs::configuration: add push-notifications patch 3/4 [puppet] - 10https://gerrit.wikimedia.org/r/624014 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [07:07:59] (03CR) 10JMeybohm: [C: 03+1] "This LGTM. No idea about the linter, though" [dns] - 10https://gerrit.wikimedia.org/r/623544 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [07:10:36] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) is CRITICAL: Test Get site-specific CSS returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [07:12:31] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [07:27:00] (03PS3) 10Muehlenhoff: profile::java: Add support to deploy debug packages [puppet] - 10https://gerrit.wikimedia.org/r/617079 [07:30:37] !log installing 4.9.228 kernel on stretch systems (only installing the deb, reboots separately) [07:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:37:29] (03PS14) 10Kormat: mariadb: simplify mariadb::service [puppet] - 10https://gerrit.wikimedia.org/r/622569 (https://phabricator.wikimedia.org/T256972) [07:43:40] (03CR) 10JMeybohm: [C: 03+2] helmfile: refactor eventgate-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/621286 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [07:45:03] (03Merged) 10jenkins-bot: helmfile: refactor eventgate-analytics [deployment-charts] - 10https://gerrit.wikimedia.org/r/621286 (https://phabricator.wikimedia.org/T258572) (owner: 10JMeybohm) [07:51:15] (03CR) 10JMeybohm: [C: 03+2] Convert cxserver to the new layout [deployment-charts] - 10https://gerrit.wikimedia.org/r/623749 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [07:52:30] (03Merged) 10jenkins-bot: Convert cxserver to the new layout [deployment-charts] - 10https://gerrit.wikimedia.org/r/623749 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [07:59:36] (03PS2) 10JMeybohm: Convert echostore to the new layout [deployment-charts] - 10https://gerrit.wikimedia.org/r/624007 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [07:59:43] (03CR) 10jerkins-bot: [V: 04-1] Convert echostore to the new layout [deployment-charts] - 10https://gerrit.wikimedia.org/r/624007 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [08:03:28] (03CR) 10Alexandros Kosiaris: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/624096 (https://phabricator.wikimedia.org/T261207) (owner: 10Hashar) [08:06:09] (03PS3) 10JMeybohm: Convert echostore to the new layout [deployment-charts] - 10https://gerrit.wikimedia.org/r/624007 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [08:08:30] !log installing 4.19.132 kernel on buster systems (only installing the deb, reboots separately) [08:08:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:16:04] (03CR) 10JMeybohm: [C: 03+2] "* Fixed typo in commit message" [deployment-charts] - 10https://gerrit.wikimedia.org/r/624007 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [08:16:35] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITICAL: Test Get preview mobile HTML for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [08:17:03] (03Merged) 10jenkins-bot: Convert echostore to the new layout [deployment-charts] - 10https://gerrit.wikimedia.org/r/624007 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [08:17:07] (03CR) 10Alexandros Kosiaris: [C: 03+1] push-notif: drop support to statsd-exporter (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/624012 (https://phabricator.wikimedia.org/T260807) (owner: 10MSantos) [08:17:21] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/624097 (https://phabricator.wikimedia.org/T260200) (owner: 10Andrew Bogott) [08:18:29] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [08:19:57] (03CR) 10Kormat: [C: 03+2] Actually pin black/isort versions this time. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/623825 (owner: 10Kormat) [08:20:58] (03CR) 10DCausse: [C: 03+1] Improve shell script compatibility, double quoting [puppet] - 10https://gerrit.wikimedia.org/r/624420 (owner: 10Ryan Kemper) [08:21:56] (03Merged) 10jenkins-bot: Actually pin black/isort versions this time. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/623825 (owner: 10Kormat) [08:29:35] !log roll restart of the hadoop workers (test and analytics cluster) for openjdk upgrades [08:29:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:31:06] !log elukey@cumin1001 START - Cookbook sre.hadoop.roll-restart-workers [08:31:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:23] (03PS2) 10JMeybohm: Convert proton to the new layout [deployment-charts] - 10https://gerrit.wikimedia.org/r/624008 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [08:35:50] (03CR) 10Hashar: "> That's true and actually a bit worse. latest always changes, but it does not even need to reference an existing tag. It's a tag on its o" [puppet] - 10https://gerrit.wikimedia.org/r/624096 (https://phabricator.wikimedia.org/T261207) (owner: 10Hashar) [08:48:23] PROBLEM - k8s API server requests latencies on chlorine is CRITICAL: instance=10.64.0.45 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [08:52:15] 10Operations, 10Data-Services, 10SRE-Access-Requests, 10Patch-For-Review, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10MoritzMuehlenhoff) >>! In T261145#6417384, @bd808 wrote: > Perhaps one path... [08:58:42] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) [08:58:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:02:22] (03PS1) 10Kormat: WIP: mariadb: Upgrade mariadb::wmfmariadbpy to a profile. [puppet] - 10https://gerrit.wikimedia.org/r/624606 [09:03:23] RECOVERY - k8s API server requests latencies on chlorine is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [09:05:59] RECOVERY - Long running screen/tmux on an-launcher1002 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [09:10:06] (03CR) 10Jbond: "LGTM added comments as you may be able to simplify this more" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/622569 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [09:11:00] !log elukey@cumin1001 START - Cookbook sre.hadoop.roll-restart-workers [09:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:11:10] 10Operations, 10Traffic, 10User-ArielGlenn, 10User-MoritzMuehlenhoff, 10User-jbond: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456 (10jbond) [09:23:22] (03PS3) 10JMeybohm: Convert proton to the new layout [deployment-charts] - 10https://gerrit.wikimedia.org/r/624008 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [09:30:35] (03PS4) 10Hashar: Run integration tests on CI [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/621762 (https://phabricator.wikimedia.org/T261098) [09:30:37] (03PS1) 10Hashar: test: TestOnlineSchemaChanger missed analyze config [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/624622 (https://phabricator.wikimedia.org/T261098) [09:31:36] (03CR) 10jerkins-bot: [V: 04-1] Run integration tests on CI [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/621762 (https://phabricator.wikimedia.org/T261098) (owner: 10Hashar) [09:31:38] (03CR) 10Hashar: "Rebased:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/621762 (https://phabricator.wikimedia.org/T261098) (owner: 10Hashar) [09:33:52] (03CR) 10JMeybohm: [C: 03+2] "* Removed unused main release from helmfile.yaml" [deployment-charts] - 10https://gerrit.wikimedia.org/r/624008 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [09:34:51] (03PS2) 10JMeybohm: Convert wikifeeds to the new layout [deployment-charts] - 10https://gerrit.wikimedia.org/r/624009 (owner: 10Giuseppe Lavagetto) [09:35:21] (03Merged) 10jenkins-bot: Convert proton to the new layout [deployment-charts] - 10https://gerrit.wikimedia.org/r/624008 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [09:39:40] (03PS1) 10Marostegui: wmnet: Revert m5-master TTL from 1M to 5M [dns] - 10https://gerrit.wikimedia.org/r/624628 (https://phabricator.wikimedia.org/T260324) [09:41:53] (03CR) 10Marostegui: [C: 03+1] "This looks good, let's get a PCC to double check this is a NOOP" [puppet] - 10https://gerrit.wikimedia.org/r/622569 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [09:42:22] (03PS1) 10Kormat: wmflib: Add 2 types for service resources. [puppet] - 10https://gerrit.wikimedia.org/r/624629 [09:42:52] (03CR) 10jerkins-bot: [V: 04-1] wmflib: Add 2 types for service resources. [puppet] - 10https://gerrit.wikimedia.org/r/624629 (owner: 10Kormat) [09:43:03] (03PS2) 10Kormat: wmflib: Add 2 types for service resources. [puppet] - 10https://gerrit.wikimedia.org/r/624629 [09:43:37] (03CR) 10jerkins-bot: [V: 04-1] wmflib: Add 2 types for service resources. [puppet] - 10https://gerrit.wikimedia.org/r/624629 (owner: 10Kormat) [09:44:22] (03PS1) 10JMeybohm: Revert "Convert proton to the new layout" [deployment-charts] - 10https://gerrit.wikimedia.org/r/624290 [09:44:55] (03PS3) 10Kormat: wmflib: Add 2 types for service resources. [puppet] - 10https://gerrit.wikimedia.org/r/624629 [09:44:57] (03PS2) 10JMeybohm: Revert "Convert proton to the new layout" [deployment-charts] - 10https://gerrit.wikimedia.org/r/624290 (https://phabricator.wikimedia.org/T244843) [09:46:43] (03CR) 10JMeybohm: [C: 03+2] Revert "Convert proton to the new layout" [deployment-charts] - 10https://gerrit.wikimedia.org/r/624290 (https://phabricator.wikimedia.org/T244843) (owner: 10JMeybohm) [09:48:12] (03Merged) 10jenkins-bot: Revert "Convert proton to the new layout" [deployment-charts] - 10https://gerrit.wikimedia.org/r/624290 (https://phabricator.wikimedia.org/T244843) (owner: 10JMeybohm) [09:48:42] !log Restart prometheus-mysqld-exporter on db2125 [09:48:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:51:08] (03PS24) 10ZPapierski: Multiple instances of msearch_daemon [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) [09:54:10] (03CR) 10Alexandros Kosiaris: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/624096 (https://phabricator.wikimedia.org/T261207) (owner: 10Hashar) [09:55:13] (03CR) 10Kormat: "> Patch Set 14: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/622569 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [09:58:10] 10Operations, 10OTRS, 10serviceops, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10akosiaris) >>! In T187984#6434102, @NoFWDaddress wrote: > Look good. thank you Akosiaris for all you work. > > It might be worth it to no... [09:58:29] (03CR) 10Marostegui: [C: 03+1] "> Patch Set 14:" [puppet] - 10https://gerrit.wikimedia.org/r/622569 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [10:10:04] (03CR) 10Kormat: mariadb: simplify mariadb::service (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/622569 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [10:10:29] (03CR) 10Kormat: [C: 03+2] mariadb: simplify mariadb::service [puppet] - 10https://gerrit.wikimedia.org/r/622569 (https://phabricator.wikimedia.org/T256972) (owner: 10Kormat) [10:12:03] (03CR) 10Kormat: [C: 03+1] wmnet: Revert m5-master TTL from 1M to 5M [dns] - 10https://gerrit.wikimedia.org/r/624628 (https://phabricator.wikimedia.org/T260324) (owner: 10Marostegui) [10:12:43] (03CR) 10Marostegui: [C: 03+2] wmnet: Revert m5-master TTL from 1M to 5M [dns] - 10https://gerrit.wikimedia.org/r/624628 (https://phabricator.wikimedia.org/T260324) (owner: 10Marostegui) [10:27:56] (03PS2) 10Kormat: mariadb: Upgrade mariadb::wmfmariadbpy to a profile. [puppet] - 10https://gerrit.wikimedia.org/r/624606 [10:28:27] (03PS3) 10Kormat: mariadb: Upgrade mariadb::wmfmariadbpy to a profile. [puppet] - 10https://gerrit.wikimedia.org/r/624606 [10:28:46] !log Deploy MCR schema change on db1087 (sanitarium master), this will generate lag (probably a few days) on s8 labsdb hosts T238966 [10:28:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:28:53] T238966: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 [10:29:19] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10Cmjohnson) [10:29:53] (03PS1) 10Marostegui: db1087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/624641 [10:29:55] !log marostegui@cumin1001 dbctl commit (dc=all): 'Depool db1087 for MCR schema change', diff saved to https://phabricator.wikimedia.org/P12492 and previous config saved to /var/cache/conftool/dbconfig/20200904-102955-marostegui.json [10:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:29] (03CR) 10Marostegui: [C: 03+2] db1087: Disable notifications [puppet] - 10https://gerrit.wikimedia.org/r/624641 (owner: 10Marostegui) [10:31:28] !log elukey@cumin1001 END (PASS) - Cookbook sre.hadoop.roll-restart-workers (exit_code=0) [10:31:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:32:06] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Due By: 2020-07-11) rack/setup/install an-worker[1096-1101] - https://phabricator.wikimedia.org/T254892 (10Cmjohnson) 05Open→03Resolved resolving [10:35:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By:2020-08-17) label/setup/install pki1001 - https://phabricator.wikimedia.org/T259826 (10Cmjohnson) [10:36:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review: (Need By:2020-08-17) label/setup/install pki1001 - https://phabricator.wikimedia.org/T259826 (10Cmjohnson) 05Open→03Resolved This server is ready [10:37:08] (03CR) 10Kormat: "PCC run for cumin hosts (NOOP): https://puppet-compiler.wmflabs.org/compiler1002/24942/" [puppet] - 10https://gerrit.wikimedia.org/r/624606 (owner: 10Kormat) [10:41:01] 10Operations, 10OTRS, 10serviceops, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10jcrespo) @akosiaris This is my proposal, db-wise: * Disable https://ticket-test.wikimedia.org so it no longer can query db1077 db * At so... [10:50:43] (03CR) 10Jbond: mariadb: Upgrade mariadb::wmfmariadbpy to a profile. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/624606 (owner: 10Kormat) [10:52:51] 10Operations, 10OTRS, 10serviceops, 10Patch-For-Review, 10User-notice: Update OTRS to the latest stable version (6.0.x) - https://phabricator.wikimedia.org/T187984 (10akosiaris) >>! In T187984#6435616, @jcrespo wrote: > @akosiaris This is my proposal, db-wise: > > * Disable https://ticket-test.wikimedia... [10:54:09] (03CR) 10Jcrespo: mariadb: Upgrade mariadb::wmfmariadbpy to a profile. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/624606 (owner: 10Kormat) [11:07:26] (03CR) 10Jbond: "Thanks although in this instance im not sure its worth it see inline" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/624629 (owner: 10Kormat) [11:08:55] (03PS11) 10Jbond: prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn) [11:09:24] (03CR) 10Jbond: [C: 03+1] "LGTM thx" [puppet] - 10https://gerrit.wikimedia.org/r/623666 (owner: 10Dzahn) [11:10:17] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/623662 (owner: 10Dzahn) [11:10:55] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/623079 (owner: 10Dzahn) [11:32:42] (03CR) 10Jbond: [C: 03+1] "LGTM merging" [puppet] - 10https://gerrit.wikimedia.org/r/624217 (owner: 10Southparkfan) [11:32:59] (03CR) 10Jbond: [C: 03+2] os_version: remove wheezy, add bullseye [puppet] - 10https://gerrit.wikimedia.org/r/624217 (owner: 10Southparkfan) [11:40:07] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:43:27] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:50:56] (03CR) 10Kormat: mariadb: Upgrade mariadb::wmfmariadbpy to a profile. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/624606 (owner: 10Kormat) [11:51:41] 10Operations, 10Data-Services, 10SRE-Access-Requests, 10Patch-For-Review, 10cloud-services-team (Kanban): Enable access for wmcs-admins to run wmcs-prefixed cookbooks on cumin hosts - https://phabricator.wikimedia.org/T261145 (10jbond) Further to the comment from moritz it would be useful to know what th... [11:55:47] (03CR) 10Gehel: "Added Filippo, I'm always confused by the prometheus magic we have here, I'd like him to also have a look." [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) (owner: 10ZPapierski) [12:05:43] (03CR) 10Gehel: [C: 04-1] "Looks mostly good, added a few potential improvements." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/621988 (https://phabricator.wikimedia.org/T260305) (owner: 10ZPapierski) [12:07:00] (03CR) 10Jcrespo: mariadb: Upgrade mariadb::wmfmariadbpy to a profile. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/624606 (owner: 10Kormat) [12:07:57] (03CR) 10Kormat: wmflib: Add 2 types for service resources. (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/624629 (owner: 10Kormat) [12:15:04] (03CR) 10Gehel: [C: 03+1] "LGTM and innocent enough, but I'm far from a bash expert 😊" [puppet] - 10https://gerrit.wikimedia.org/r/624420 (owner: 10Ryan Kemper) [12:15:06] (03CR) 10Gehel: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/617079 (owner: 10Muehlenhoff) [12:20:35] PROBLEM - k8s API server requests latencies on chlorine is CRITICAL: instance=10.64.0.45 verb=PATCH https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:23:28] (03CR) 10JMeybohm: [C: 03+2] Convert wikifeeds to the new layout [deployment-charts] - 10https://gerrit.wikimedia.org/r/624009 (owner: 10Giuseppe Lavagetto) [12:26:03] (03CR) 10Kormat: mariadb: Upgrade mariadb::wmfmariadbpy to a profile. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/624606 (owner: 10Kormat) [12:26:37] (03Merged) 10jenkins-bot: Convert wikifeeds to the new layout [deployment-charts] - 10https://gerrit.wikimedia.org/r/624009 (owner: 10Giuseppe Lavagetto) [12:26:39] (03PS4) 10Kormat: wmflib: Add wmflib::Enable_Service [puppet] - 10https://gerrit.wikimedia.org/r/624629 [12:28:05] RECOVERY - k8s API server requests latencies on chlorine is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Kubernetes https://grafana.wikimedia.org/dashboard/db/kubernetes-api [12:30:48] (03PS5) 10Kormat: wmflib: Add wmflib::Enable_Service [puppet] - 10https://gerrit.wikimedia.org/r/624629 [12:38:03] (03CR) 10Hashar: "Apparently it tries to add a column twice:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/621762 (https://phabricator.wikimedia.org/T261098) (owner: 10Hashar) [12:38:05] (03CR) 10Jcrespo: mariadb: Upgrade mariadb::wmfmariadbpy to a profile. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/624606 (owner: 10Kormat) [12:38:07] (03PS1) 10Jbond: wmfmariadbpy: create seperate module/profile for wmfmariadbpy [puppet] - 10https://gerrit.wikimedia.org/r/624669 [12:39:05] (03CR) 10jerkins-bot: [V: 04-1] wmfmariadbpy: create seperate module/profile for wmfmariadbpy [puppet] - 10https://gerrit.wikimedia.org/r/624669 (owner: 10Jbond) [12:40:56] (03PS2) 10Jbond: wmfmariadbpy: create seperate module/profile for wmfmariadbpy [puppet] - 10https://gerrit.wikimedia.org/r/624669 [12:46:10] (03CR) 10Jcrespo: "Remember also my previous comment about the template left behind." [puppet] - 10https://gerrit.wikimedia.org/r/624669 (owner: 10Jbond) [12:46:12] (03CR) 10Jcrespo: "Ignore my previous comment, for some reason, git/diff considers one moved and the other deleted and recreated, and that confused me." [puppet] - 10https://gerrit.wikimedia.org/r/624669 (owner: 10Jbond) [12:46:16] (03PS3) 10Jbond: wmfmariadbpy: create seperate module/profile for wmfmariadbpy [puppet] - 10https://gerrit.wikimedia.org/r/624669 [12:48:48] (03PS4) 10Jbond: wmfmariadbpy: create seperate module/profile for wmfmariadbpy [puppet] - 10https://gerrit.wikimedia.org/r/624669 [12:49:07] (03CR) 10Jcrespo: "I am ok with making the config dir called whatever- it will need I think code changes deployed at the same time." [puppet] - 10https://gerrit.wikimedia.org/r/624669 (owner: 10Jbond) [12:52:24] (03PS5) 10Jbond: wmfmariadbpy: create seperate module/profile for wmfmariadbpy [puppet] - 10https://gerrit.wikimedia.org/r/624669 [12:55:20] (03CR) 10Jcrespo: "nitpicking" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/624669 (owner: 10Jbond) [12:55:22] (03CR) 10Kormat: wmfmariadbpy: create seperate module/profile for wmfmariadbpy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/624669 (owner: 10Jbond) [12:55:24] (03CR) 10Jbond: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/624669 (owner: 10Jbond) [12:56:21] (03Abandoned) 10Kormat: mariadb: Upgrade mariadb::wmfmariadbpy to a profile. [puppet] - 10https://gerrit.wikimedia.org/r/624606 (owner: 10Kormat) [13:02:40] (03PS6) 10Jbond: wmfmariadbpy: create seperate module/profile for wmfmariadbpy [puppet] - 10https://gerrit.wikimedia.org/r/624669 [13:02:42] (03CR) 10jerkins-bot: [V: 04-1] wmfmariadbpy: create seperate module/profile for wmfmariadbpy [puppet] - 10https://gerrit.wikimedia.org/r/624669 (owner: 10Jbond) [13:02:44] (03CR) 10Kormat: [C: 03+1] wmfmariadbpy: create seperate module/profile for wmfmariadbpy [puppet] - 10https://gerrit.wikimedia.org/r/624669 (owner: 10Jbond) [13:03:40] (03PS7) 10Jbond: wmfmariadbpy: create seperate module/profile for wmfmariadbpy [puppet] - 10https://gerrit.wikimedia.org/r/624669 [13:03:42] (03CR) 10Jbond: "updated pcc (although need to use a better hosts selection)" (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/624669 (owner: 10Jbond) [13:07:36] (03CR) 10jerkins-bot: [V: 04-1] wmfmariadbpy: create seperate module/profile for wmfmariadbpy [puppet] - 10https://gerrit.wikimedia.org/r/624669 (owner: 10Jbond) [13:11:07] (03CR) 10Jcrespo: "Do we want to split this in 2 parts to make the deployment easier- just for practical reasons? (up to you) Eg. module + profile creation a" [puppet] - 10https://gerrit.wikimedia.org/r/624669 (owner: 10Jbond) [13:11:09] (03CR) 10Jcrespo: wmfmariadbpy: create seperate module/profile for wmfmariadbpy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/624669 (owner: 10Jbond) [13:11:11] (03CR) 10Jbond: [C: 03+1] "LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/624629 (owner: 10Kormat) [13:11:47] (03CR) 10Jcrespo: wmfmariadbpy: create seperate module/profile for wmfmariadbpy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/624669 (owner: 10Jbond) [13:14:22] (03CR) 10Kormat: [C: 03+2] wmflib: Add wmflib::Enable_Service [puppet] - 10https://gerrit.wikimedia.org/r/624629 (owner: 10Kormat) [13:15:18] wikibugs is feeling sloow today. it's multiple minutes behind [13:15:23] (03CR) 10Jbond: "> Patch Set 7:" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/624669 (owner: 10Jbond) [13:15:25] (03PS8) 10Jbond: wmfmariadbpy: create seperate module/profile for wmfmariadbpy [puppet] - 10https://gerrit.wikimedia.org/r/624669 [13:15:27] (03PS9) 10Kormat: wmfmariadbpy: create seperate module/profile for wmfmariadbpy [puppet] - 10https://gerrit.wikimedia.org/r/624669 (owner: 10Jbond) [13:15:28] e.g. that last CR i +2'd 6mins before it showed up here [13:15:51] yes the comments from me that just came in where from about 5 mins ago as well [13:16:25] (03CR) 10Kormat: "> Patch Set 7:" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/624669 (owner: 10Jbond) [13:16:27] (03CR) 10Jcrespo: [C: 03+1] wmfmariadbpy: create seperate module/profile for wmfmariadbpy [puppet] - 10https://gerrit.wikimedia.org/r/624669 (owner: 10Jbond) [13:17:29] (03CR) 10Jcrespo: [C: 03+1] "Remember to delete manually the old file on cumin hosts!" [puppet] - 10https://gerrit.wikimedia.org/r/624669 (owner: 10Jbond) [13:20:29] (03CR) 10Kormat: "PCC run is a ~NOOP: https://puppet-compiler.wmflabs.org/compiler1003/24949/" [puppet] - 10https://gerrit.wikimedia.org/r/624669 (owner: 10Jbond) [13:21:29] (03CR) 10Kormat: "And a PCC run for cumin: https://puppet-compiler.wmflabs.org/compiler1001/24950/" [puppet] - 10https://gerrit.wikimedia.org/r/624669 (owner: 10Jbond) [13:23:38] (03CR) 10Kormat: [C: 03+2] wmfmariadbpy: create seperate module/profile for wmfmariadbpy [puppet] - 10https://gerrit.wikimedia.org/r/624669 (owner: 10Jbond) [13:23:42] (03PS1) 10Jcrespo: WMFMariaDB: Update default section -> port assignment path [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/624687 [13:24:37] (03CR) 10Jcrespo: [C: 03+1] "That's totally my fault and lazyness." [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/622555 (owner: 10Kormat) [13:25:29] RECOVERY - Check systemd state on snapshot1007 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:28:10] (03PS1) 10Kormat: mariadb: Add types to mariadb::service parameters [puppet] - 10https://gerrit.wikimedia.org/r/624688 [13:28:12] (03CR) 10Jcrespo: "For context, the issue here is that mediawiki hosts use binary collation, while many other use a more reasonable utf8mb4 and that sometime" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/622554 (owner: 10Kormat) [13:28:14] (03CR) 10Jcrespo: [C: 03+1] Remove unused 'charset' attribute. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/622554 (owner: 10Kormat) [13:28:16] (03CR) 10Kormat: [C: 03+2] Remove unused 'charset' attribute. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/622554 (owner: 10Kormat) [13:29:11] (03CR) 10Kormat: [C: 03+2] Move WMFMariaDB.__init__() to the top of the class. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/622555 (owner: 10Kormat) [13:29:13] (03CR) 10Kormat: [C: 03+2] WMFMariaDB: Update default section -> port assignment path [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/624687 (owner: 10Jcrespo) [13:29:15] (03Merged) 10jenkins-bot: Remove unused 'charset' attribute. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/622554 (owner: 10Kormat) [13:31:08] (03Merged) 10jenkins-bot: Move WMFMariaDB.__init__() to the top of the class. [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/622555 (owner: 10Kormat) [13:35:16] (03Merged) 10jenkins-bot: WMFMariaDB: Update default section -> port assignment path [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/624687 (owner: 10Jcrespo) [13:35:22] (03CR) 10Jcrespo: "Not sure how dbprov should be handled, but not a worry here. I will have to solve that when I setup the installation of the backup package" [puppet] - 10https://gerrit.wikimedia.org/r/624606 (owner: 10Kormat) [13:37:35] (03CR) 10Jcrespo: "I commented this on the wrong patch, pasting it here for tracking purposes:" [puppet] - 10https://gerrit.wikimedia.org/r/624669 (owner: 10Jbond) [13:37:39] (03CR) 10Kormat: "> Patch Set 9:" [puppet] - 10https://gerrit.wikimedia.org/r/624669 (owner: 10Jbond) [13:38:17] (03CR) 10Gehel: [C: 03+1] "@apergos: as discussed on IRC, can you merge this when you have time to keep an eye on it?" [puppet] - 10https://gerrit.wikimedia.org/r/624420 (owner: 10Ryan Kemper) [13:38:37] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/624688 (owner: 10Kormat) [13:38:39] (03CR) 10Jcrespo: "Thanks for your work, we are closer and closer to a saner environment." [puppet] - 10https://gerrit.wikimedia.org/r/624669 (owner: 10Jbond) [13:38:41] (03CR) 10ArielGlenn: [C: 03+1] "Ah, I didn't explicitly +1 so doing that now." [puppet] - 10https://gerrit.wikimedia.org/r/622342 (https://phabricator.wikimedia.org/T261204) (owner: 10DCausse) [13:39:36] (03CR) 10Gehel: [C: 03+1] "As discussed with apergos:" [puppet] - 10https://gerrit.wikimedia.org/r/622342 (https://phabricator.wikimedia.org/T261204) (owner: 10DCausse) [13:40:40] (03CR) 10Effie Mouzeli: "> Patch Set 8: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/623631 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [13:40:42] (03PS9) 10Effie Mouzeli: lvs::configuration: add push-notifications patch 1/4 [puppet] - 10https://gerrit.wikimedia.org/r/623631 (https://phabricator.wikimedia.org/T256973) [13:41:38] (03CR) 10jerkins-bot: [V: 04-1] lvs::configuration: add push-notifications patch 1/4 [puppet] - 10https://gerrit.wikimedia.org/r/623631 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [13:46:52] (03PS2) 10Kormat: mariadb: Add types to mariadb::service parameters [puppet] - 10https://gerrit.wikimedia.org/r/624688 [13:46:54] (03CR) 10Kormat: "New PS, this time using the right type names. Probably. :)" [puppet] - 10https://gerrit.wikimedia.org/r/624688 (owner: 10Kormat) [13:53:00] (03CR) 10Jbond: [C: 03+1] mariadb: Add types to mariadb::service parameters [puppet] - 10https://gerrit.wikimedia.org/r/624688 (owner: 10Kormat) [13:53:41] (03CR) 10Kormat: "PCC run passes this time, so i must have gotten it right ;) https://puppet-compiler.wmflabs.org/compiler1001/24952/" [puppet] - 10https://gerrit.wikimedia.org/r/624688 (owner: 10Kormat) [13:56:55] (03CR) 10Kormat: [C: 03+2] mariadb: Add types to mariadb::service parameters [puppet] - 10https://gerrit.wikimedia.org/r/624688 (owner: 10Kormat) [13:59:09] (03CR) 10Andrew Bogott: [C: 03+2] eqiad1 openstack: replace missing monitor classes [puppet] - 10https://gerrit.wikimedia.org/r/624097 (https://phabricator.wikimedia.org/T260200) (owner: 10Andrew Bogott) [14:03:28] (03CR) 10Kormat: "My quest to refactor the puppet code for mariadb got sucked into the quagmire of details. I haven't given up hope in the long-run, but the" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/622444 (https://phabricator.wikimedia.org/T260843) (owner: 10Bstorm) [14:20:21] (03PS2) 10JMeybohm: Convert zotero to the new layout [deployment-charts] - 10https://gerrit.wikimedia.org/r/624010 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [14:22:28] (03CR) 10JMeybohm: [C: 03+2] Convert zotero to the new layout [deployment-charts] - 10https://gerrit.wikimedia.org/r/624010 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [14:22:45] (03PS2) 10JMeybohm: Convert sessionstore to the new layout [deployment-charts] - 10https://gerrit.wikimedia.org/r/624017 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [14:24:26] (03Merged) 10jenkins-bot: Convert zotero to the new layout [deployment-charts] - 10https://gerrit.wikimedia.org/r/624010 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [14:27:55] (03CR) 10JMeybohm: [C: 03+1] lvs::configuration: add push-notifications patch 1/4 [puppet] - 10https://gerrit.wikimedia.org/r/623631 (https://phabricator.wikimedia.org/T256973) (owner: 10Effie Mouzeli) [14:29:21] (03CR) 10JMeybohm: [C: 03+2] "Some diff because of removed snakeoil cert.pam. Looks fine" [deployment-charts] - 10https://gerrit.wikimedia.org/r/624017 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [14:29:27] (03PS3) 10Effie Mouzeli: Add entries for push-notifications [dns] - 10https://gerrit.wikimedia.org/r/623541 (https://phabricator.wikimedia.org/T256973) [14:30:00] (03PS4) 10Effie Mouzeli: Add entries for push-notifications [dns] - 10https://gerrit.wikimedia.org/r/623541 (https://phabricator.wikimedia.org/T256973) [14:30:36] (03Merged) 10jenkins-bot: Convert sessionstore to the new layout [deployment-charts] - 10https://gerrit.wikimedia.org/r/624017 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [14:31:10] (03PS1) 10Muehlenhoff: Yarn: Remove exception for OPTIONS [puppet] - 10https://gerrit.wikimedia.org/r/624712 [14:33:38] (03CR) 10Elukey: [C: 03+1] Yarn: Remove exception for OPTIONS [puppet] - 10https://gerrit.wikimedia.org/r/624712 (owner: 10Muehlenhoff) [14:38:13] (03CR) 10DCausse: "pcc output: https://puppet-compiler.wmflabs.org/compiler1002/24953/" [puppet] - 10https://gerrit.wikimedia.org/r/624704 (https://phabricator.wikimedia.org/T258835) (owner: 10DCausse) [14:40:57] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:41:26] ^ looking [14:49:15] effie: could it be that this is still a consequence of logging changes (https://phabricator.wikimedia.org/T256459#6264351) and no one raised the threshold? [14:49:44] https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?orgId=1&viewPanel=2&var-datasource=eqiad%20prometheus%2Fops&from=1590969600000&to=now [14:51:02] possibly! [14:52:05] AIUI we're counting more stuff as MW fatal now (parsoid) so it *should* be reasonable to just increase the threshold a bit [14:52:45] I wanted to ask someone if it makes sense to add a different channel for parsoid errors [14:52:58] but I don't know how this would be possible though [14:53:22] I will look into ir [14:53:25] it* [14:53:38] cool [14:54:03] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [14:55:50] (03PS8) 10JMeybohm: helmfile.d: refactor eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/619437 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [15:01:41] (03PS2) 10CRusnov: toolforge/gridscripts/runninggridtasks.py: Fix Python3 PEP8 Warning [puppet] - 10https://gerrit.wikimedia.org/r/624122 (https://phabricator.wikimedia.org/T247364) [15:03:53] (03CR) 10CRusnov: "Thanks! I have changed based on Xqt's suggestion." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/624122 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [15:05:36] (03PS2) 10CRusnov: modules/service/files/logstash_checker.py: Fix Python3 PEP8 errors [puppet] - 10https://gerrit.wikimedia.org/r/624116 (https://phabricator.wikimedia.org/T247364) [15:06:04] (03CR) 10CRusnov: "Thanks for the review, just added a little tag in the top comment so we can add these to the normal python3 tox and winnow the list down." [puppet] - 10https://gerrit.wikimedia.org/r/624116 (https://phabricator.wikimedia.org/T247364) (owner: 10CRusnov) [15:07:28] (03PS2) 10CRusnov: modules/admin/data/nda_audit.py: Fix Python3 pep8 errors [puppet] - 10https://gerrit.wikimedia.org/r/624112 (https://phabricator.wikimedia.org/T247364) [15:14:37] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:44:33] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:52:05] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [15:55:04] (03PS1) 10Hnowlan: api-portal: required extended configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/624750 [15:55:41] (03CR) 10jerkins-bot: [V: 04-1] api-portal: required extended configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/624750 (owner: 10Hnowlan) [15:55:43] (03PS2) 10Hnowlan: api-portal: required extended configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/624750 (https://phabricator.wikimedia.org/T261425) [15:56:20] (03CR) 10jerkins-bot: [V: 04-1] api-portal: required extended configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/624750 (https://phabricator.wikimedia.org/T261425) (owner: 10Hnowlan) [15:56:49] (03PS3) 10Hnowlan: api-portal: required extended configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/624750 (https://phabricator.wikimedia.org/T261425) [15:57:40] (03CR) 10jerkins-bot: [V: 04-1] api-portal: required extended configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/624750 (https://phabricator.wikimedia.org/T261425) (owner: 10Hnowlan) [16:01:49] (03PS4) 10Hnowlan: api-portal: required extended configuration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/624750 (https://phabricator.wikimedia.org/T261425) [16:14:40] (03CR) 10Reedy: [C: 04-1] "I'd also remove the 'if ( $wgDBname == 'apiportalwiki' ) {' block from CommonSettings-labs.php at the same time" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/624750 (https://phabricator.wikimedia.org/T261425) (owner: 10Hnowlan) [16:14:59] (03PS2) 10Andrew Bogott: designate: stop creating 'legacy' entries (that is, things under wmflabs) [puppet] - 10https://gerrit.wikimedia.org/r/620937 (https://phabricator.wikimedia.org/T260614) [16:16:57] (03PS2) 10Andrew Bogott: Nova/Neutron: set dhcp_domain and tld to eqiad1.wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/620936 (https://phabricator.wikimedia.org/T260614) [16:19:42] (03PS7) 10Effie Mouzeli: systemd: fixes in coredump class [puppet] - 10https://gerrit.wikimedia.org/r/545558 (https://phabricator.wikimedia.org/T236253) [16:20:17] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:27:55] (03PS1) 10Andrew Bogott: OpenStack Neutron config: remove the 'tld' variable [puppet] - 10https://gerrit.wikimedia.org/r/624763 [16:31:47] (03CR) 10Andrew Bogott: "Mild pcc results here: https://puppet-compiler.wmflabs.org/compiler1002/24954/" [puppet] - 10https://gerrit.wikimedia.org/r/624763 (owner: 10Andrew Bogott) [16:31:56] (03CR) 10JMeybohm: "Found quite some diff here:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/619437 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [16:32:06] (03PS9) 10JMeybohm: helmfile.d: refactor eventgate-main [deployment-charts] - 10https://gerrit.wikimedia.org/r/619437 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [16:33:57] (03PS3) 10Andrew Bogott: Nova/Neutron: set dhcp_domain and tld to eqiad1.wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/620936 (https://phabricator.wikimedia.org/T260614) [16:35:17] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:39:05] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:40:16] (03PS4) 10Andrew Bogott: Nova/Neutron: set dhcp_domain to eqiad1.wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/620936 (https://phabricator.wikimedia.org/T260614) [16:40:18] (03PS1) 10Andrew Bogott: wmcs nova fullstack test: expect new VMs under .eqiad1.wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/624772 (https://phabricator.wikimedia.org/T260614) [17:03:18] (03CR) 10Dzahn: "is it still -1?" [puppet] - 10https://gerrit.wikimedia.org/r/623464 (owner: 10Dzahn) [17:03:38] (03CR) 10Dzahn: [C: 03+2] arclamp: add data types [puppet] - 10https://gerrit.wikimedia.org/r/624364 (owner: 10Dzahn) [17:05:33] (03CR) 10Dzahn: "noop on webperf1002 - rebasing the previous change" [puppet] - 10https://gerrit.wikimedia.org/r/624364 (owner: 10Dzahn) [17:08:10] (03CR) 10Cicalese: [C: 04-1] "> Patch Set 4: Code-Review-1" (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/624750 (https://phabricator.wikimedia.org/T261425) (owner: 10Hnowlan) [17:19:27] (03PS8) 10Dzahn: arclamp: provide Swift credentials to cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/622904 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke) [17:19:36] (03CR) 10Dzahn: "rebased and added "Optional" to parameters because they have "undef" as default values which isn't a string" [puppet] - 10https://gerrit.wikimedia.org/r/622904 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke) [17:20:32] (03CR) 10jerkins-bot: [V: 04-1] arclamp: provide Swift credentials to cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/622904 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke) [17:21:50] (03PS9) 10Dzahn: arclamp: provide Swift credentials to cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/622904 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke) [17:27:01] (03CR) 10Dave Pifke: "> Patch Set 7:" [puppet] - 10https://gerrit.wikimedia.org/r/622904 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke) [17:27:42] (03CR) 10Ryan Kemper: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/624420 (owner: 10Ryan Kemper) [17:29:58] (03PS1) 10Dzahn: deployment-prep: add profile::prometheus::memcached_exporter::arguments: '' to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/624795 (https://phabricator.wikimedia.org/T244776) [17:30:35] (03CR) 10jerkins-bot: [V: 04-1] deployment-prep: add profile::prometheus::memcached_exporter::arguments: '' to Hiera [puppet] - 10https://gerrit.wikimedia.org/r/624795 (https://phabricator.wikimedia.org/T244776) (owner: 10Dzahn) [17:31:33] (03PS2) 10Dzahn: deployment-prep: add profile::prometheus::memcached_exporter::arguments [puppet] - 10https://gerrit.wikimedia.org/r/624795 (https://phabricator.wikimedia.org/T244776) [17:32:15] (03CR) 10Dzahn: [C: 03+2] arclamp: provide Swift credentials to cron jobs [puppet] - 10https://gerrit.wikimedia.org/r/622904 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke) [17:34:23] (03CR) 10Dzahn: "on webperf1002 - file in /etc/swift/ has been created" [puppet] - 10https://gerrit.wikimedia.org/r/622904 (https://phabricator.wikimedia.org/T244776) (owner: 10Dave Pifke) [17:48:37] (03CR) 10Dzahn: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/624328 (owner: 10Dzahn) [17:52:49] (03CR) 10Dzahn: "and when you said stat1007 you meant labstore1007?" [puppet] - 10https://gerrit.wikimedia.org/r/624328 (owner: 10Dzahn) [17:53:43] (03CR) 10Dzahn: "ignore the last comment. compiling on all of them" [puppet] - 10https://gerrit.wikimedia.org/r/624328 (owner: 10Dzahn) [17:55:32] (03CR) 10Dzahn: "ok, there IS a difference in hieradata/hosts files, i see it now:" [puppet] - 10https://gerrit.wikimedia.org/r/624328 (owner: 10Dzahn) [17:58:25] (03CR) 10Dave Pifke: [C: 03+1] "After this is merged, I'll clean up the hack I made in the local Git repo on deployment-puppetmaster04." [puppet] - 10https://gerrit.wikimedia.org/r/624795 (https://phabricator.wikimedia.org/T244776) (owner: 10Dzahn) [18:02:31] !log milimetric@deploy1001 Started deploy [analytics/aqs/deploy@95d6432]: AQS: new editors by country endpoint, low risk so trying on a Friday with SRE blessing [18:02:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:04:15] (03PS6) 10Dave Pifke: [WIP] arclamp: serve SVGs, compressed logs from Swift [puppet] - 10https://gerrit.wikimedia.org/r/623068 (https://phabricator.wikimedia.org/T244776) [18:06:20] (03PS1) 10Nskaggs: Convert wmcs-novastats scripts to python3 [puppet] - 10https://gerrit.wikimedia.org/r/624805 (https://phabricator.wikimedia.org/T218426) [18:06:57] (03CR) 10jerkins-bot: [V: 04-1] Convert wmcs-novastats scripts to python3 [puppet] - 10https://gerrit.wikimedia.org/r/624805 (https://phabricator.wikimedia.org/T218426) (owner: 10Nskaggs) [18:07:03] (03CR) 10Dzahn: [C: 03+2] deployment-prep: add profile::prometheus::memcached_exporter::arguments [puppet] - 10https://gerrit.wikimedia.org/r/624795 (https://phabricator.wikimedia.org/T244776) (owner: 10Dzahn) [18:09:06] (03PS2) 10Nskaggs: Convert wmcs-novastats scripts to python3 [puppet] - 10https://gerrit.wikimedia.org/r/624805 (https://phabricator.wikimedia.org/T218426) [18:10:06] !log milimetric@deploy1001 Finished deploy [analytics/aqs/deploy@95d6432]: AQS: new editors by country endpoint, low risk so trying on a Friday with SRE blessing (duration: 07m 35s) [18:10:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:45] (03PS1) 10Ssingh: wikidough: update hiera lookups (enclose keys in quotes) [puppet] - 10https://gerrit.wikimedia.org/r/624806 [18:14:08] (03CR) 10Ssingh: "noop on malmok: https://puppet-compiler.wmflabs.org/compiler1003/24960/" [puppet] - 10https://gerrit.wikimedia.org/r/624806 (owner: 10Ssingh) [18:18:29] 10Operations, 10Beta-Cluster-Infrastructure, 10observability: Beta puppet patch "prometheus: make ferm DNS record type configurable" - https://phabricator.wikimedia.org/T244624 (10dpifke) This local patch no longer merges, due to cleanups of the use of `hiera` and type hints. I dropped it when rebasing to u... [18:20:42] (03CR) 10Dave Pifke: "> Patch Set 11:" [puppet] - 10https://gerrit.wikimedia.org/r/381073 (https://phabricator.wikimedia.org/T153468) (owner: 10Hashar) [18:20:51] (03PS1) 10Ssingh: cescout: update hiera lookups (enclose keys in quotes) [puppet] - 10https://gerrit.wikimedia.org/r/624807 [18:21:59] (03PS2) 10Dzahn: dumps: partially remove/rename the do_acme parameter and lookup [puppet] - 10https://gerrit.wikimedia.org/r/624328 [18:22:30] (03CR) 10Ssingh: "noop on cescout1001: https://puppet-compiler.wmflabs.org/compiler1002/24961/" [puppet] - 10https://gerrit.wikimedia.org/r/624807 (owner: 10Ssingh) [18:22:32] (03CR) 10Dzahn: [C: 03+1] cescout: update hiera lookups (enclose keys in quotes) [puppet] - 10https://gerrit.wikimedia.org/r/624807 (owner: 10Ssingh) [18:23:24] (03CR) 10Dzahn: [C: 03+1] "btw: postgres_version is actually a Float instead of a string. if you want and remove the '" [puppet] - 10https://gerrit.wikimedia.org/r/624807 (owner: 10Ssingh) [18:26:53] (03CR) 10Dzahn: [C: 03+1] wikidough: update hiera lookups (enclose keys in quotes) [puppet] - 10https://gerrit.wikimedia.org/r/624806 (owner: 10Ssingh) [18:29:48] (03CR) 10Dzahn: "alright, amended. see new commit message. Now just renaming the parameter and hiera key to make it obvious (for the rsync part) and removi" [puppet] - 10https://gerrit.wikimedia.org/r/624328 (owner: 10Dzahn) [18:31:52] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/24962/" [puppet] - 10https://gerrit.wikimedia.org/r/624328 (owner: 10Dzahn) [18:32:23] (03CR) 10Ssingh: [C: 03+2] wikidough: update hiera lookups (enclose keys in quotes) [puppet] - 10https://gerrit.wikimedia.org/r/624806 (owner: 10Ssingh) [18:34:51] (03CR) 10Ssingh: [C: 03+2] cescout: update hiera lookups (enclose keys in quotes) [puppet] - 10https://gerrit.wikimedia.org/r/624807 (owner: 10Ssingh) [18:41:16] (03CR) 10Nskaggs: [C: 03+1] "No issues with python3 compatibility; seems to run as expected on cloud bastion." [puppet] - 10https://gerrit.wikimedia.org/r/622846 (https://phabricator.wikimedia.org/T218426) (owner: 10Bstorm) [18:52:36] 10Operations, 10DBA: dbmonitor2001 is lacking mysql_connect(), usage of do_acme for https monitoring - https://phabricator.wikimedia.org/T262085 (10Dzahn) [18:56:10] 10Operations, 10DBA: dbmonitor2001 is lacking mysql_connect(), usage of do_acme for https monitoring - https://phabricator.wikimedia.org/T262085 (10jcrespo) Dzahn, this is ticket is 100% accurate, but you may not be aware of the why of this- which is explained on T224589. I would suggest to add your comments t... [19:01:30] 10Operations, 10DBA: dbmonitor2001 is lacking mysql_connect(), usage of do_acme for https monitoring - https://phabricator.wikimedia.org/T262085 (10jcrespo) tl;tr: If we want to make tendril work, we need to revert dbmonitor2001 back to jessie to have the php-mysql extension, which would be a huge security con... [19:02:06] 10Operations: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 (10Dzahn) >>! In T224589#5597691, @jcrespo wrote: > I ran manually `a2dismod mpm_event` and now it worked. Confirmed. This happens to me all the time and the fix is manually running a2dismod mpm_event. I once made a cha... [19:03:10] 10Operations, 10DBA: dbmonitor2001 is lacking mysql_connect(), usage of do_acme for https monitoring - https://phabricator.wikimedia.org/T262085 (10Dzahn) [19:03:26] 10Operations: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 (10Dzahn) [19:03:53] 10Operations: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 (10Dzahn) merging in T262085 as a duplicate. I just wish it would actually merge the content like RT did [19:03:56] 10Operations: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 (10jcrespo) Just to be clear- work on this is stalled because the expected solution is to kill tendril, not to fix it. Manuel is right now working on that but it will take time. [19:04:32] 10Operations, 10DBA: dbmonitor2001 is lacking mysql_connect(), usage of do_acme for https monitoring - https://phabricator.wikimedia.org/T262085 (10Dzahn) Was about to paste the relevant part and ask more questions about this when I saw your comment. Ack, merged it in as a duplicate. ` 5 class role::tend... [19:05:17] 10Operations: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 (10jcrespo) >>! In T224589#6436923, @Dzahn wrote: > merging in T262085 as a duplicate. I just wish it would actually merge the content like RT did Feel free to add more context- the 500 were known but the description her... [19:05:19] 10Operations, 10DBA: dbmonitor2001 is lacking mysql_connect(), usage of do_acme for https monitoring - https://phabricator.wikimedia.org/T262085 (10Dzahn) >>! In T262085#6436919, @jcrespo wrote: > tl;tr: If we want to make tendril work, we need to revert dbmonitor2001 back to jessie to have the php-mysql exten... [19:07:53] 10Operations: Migrate dbmonitor hosts to Buster - https://phabricator.wikimedia.org/T224589 (10Dzahn) >>! In T224589#6436930, @jcrespo wrote: > Just to be clear- work on this is stalled because the expected solution is to kill tendril, not to fix it. Manuel is right now working on that but it will take time. Th... [19:08:35] (03CR) 10Jcrespo: "Maybe this can be added to wikitech documentation at least?" [puppet] - 10https://gerrit.wikimedia.org/r/451206 (https://phabricator.wikimedia.org/T196968) (owner: 10Dzahn) [19:09:05] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/transform/html/to/mobile-html/{title} (Get preview mobile HTML for test page) is CRITICAL: Test Get preview mobile HTML for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [19:09:31] (03Restored) 10Dzahn: httpd: fix mpm_event module conflict with mpm_prefork [puppet] - 10https://gerrit.wikimedia.org/r/451206 (https://phabricator.wikimedia.org/T196968) (owner: 10Dzahn) [19:10:40] (03CR) 10Dzahn: "restored to either serve as a reminder to do that or give it another attempt" [puppet] - 10https://gerrit.wikimedia.org/r/451206 (https://phabricator.wikimedia.org/T196968) (owner: 10Dzahn) [19:12:07] (03CR) 10Andrew Bogott: [C: 03+1] move puppet_alert script to python3 [puppet] - 10https://gerrit.wikimedia.org/r/622846 (https://phabricator.wikimedia.org/T218426) (owner: 10Bstorm) [19:12:46] (03CR) 10Andrew Bogott: [C: 03+2] wmcs nova fullstack test: expect new VMs under .eqiad1.wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/624772 (https://phabricator.wikimedia.org/T260614) (owner: 10Andrew Bogott) [19:12:49] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [19:14:57] (03CR) 10Dzahn: "@Muehlenhoff this would be re T224589#5597603 and other places" [puppet] - 10https://gerrit.wikimedia.org/r/451206 (https://phabricator.wikimedia.org/T196968) (owner: 10Dzahn) [19:18:58] dpifke: there is an alert about "too long since latest timing beacon" on webperf1001. is it already known? [19:19:20] (03CR) 10Jcrespo: [C: 03+1] "This was clearly the fix I needed, as demonstrated on task. I will leave to serviceops the decision if it is safe to deploy it (it looks s" [puppet] - 10https://gerrit.wikimedia.org/r/451206 (https://phabricator.wikimedia.org/T196968) (owner: 10Dzahn) [19:19:25] related to dwswitch maybe [19:21:25] 10Operations: Deploy an updated eventgate-logging-external with NEL patches - https://phabricator.wikimedia.org/T262087 (10CDanis) [19:21:32] Yup, T261919. [19:21:33] T261919: webperf1001 alert "Service: too long since latest timing beacon" when switched over - https://phabricator.wikimedia.org/T261919 [19:21:50] thanks:) [19:21:53] 10Operations, 10Analytics: Deploy an updated eventgate-logging-external with NEL patches - https://phabricator.wikimedia.org/T262087 (10Ottomata) [19:22:04] Is there a standard way for doing silences when we flip DCs? I have some ideas on how to engineer this to not fire in that case, but maybe I'm reinventing the wheel. [19:22:30] !log Icinga - ACKing with sticky - alerts on test and dev hosts [19:22:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:22:49] 10Operations, 10Product-Infrastructure-Data, 10Epic, 10Goal, 10Patch-For-Review: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10CDanis) [19:23:51] (03CR) 10MSantos: [C: 03+2] push-notif: drop support to statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/624012 (https://phabricator.wikimedia.org/T260807) (owner: 10MSantos) [19:24:16] Cool. I'll probably just add $::site to the alert title and document that we should silence it when we flip. (It's going to fire again in codfw when we flip back.) [19:24:49] dpifke: to a certain extent. so.. do you already have some notion of a "primary server" in Hiera or elsewhere? [19:25:07] (03Merged) 10jenkins-bot: push-notif: drop support to statsd-exporter [deployment-charts] - 10https://gerrit.wikimedia.org/r/624012 (https://phabricator.wikimedia.org/T260807) (owner: 10MSantos) [19:25:12] a couple services define in common.yaml what is the primary and what is a failover [19:25:14] The daemon queries etcd and shuts itself down if it's not primary. [19:25:34] So I could report that as an additional label on the Prometheus metrics. [19:26:00] 10Operations, 10Analytics: Deploy an updated eventgate-logging-external with NEL patches - https://phabricator.wikimedia.org/T262087 (10CDanis) [19:27:48] (03CR) 10Bstorm: [C: 03+2] move puppet_alert script to python3 [puppet] - 10https://gerrit.wikimedia.org/r/622846 (https://phabricator.wikimedia.org/T218426) (owner: 10Bstorm) [19:28:01] dpifke: i have an idea as well, i'll suggest something directly on gerrit in a few [19:28:07] I would just need to figure out which values to replace w/ NaN so that Prometheus doesn't extrapolate from previous scrape, and confirm that NaN doesn't trigger the alert. [19:28:08] Ah, cool. [19:28:30] it won't be related to prometheus, just to tell Icinga to not monitor if not in the active site [19:28:38] Oh, that's a clean way to do it. [19:28:51] Much less complicated than what I was proposing. :) [19:36:12] well.. there's the solution that actually reads the active DC from conftool so when there is a switch this just happens. but that requires including profile::conftool::state and conftool::client and that is so far only used by mw maintenance servers, puppetmasters. and there are some comments to use it with caution because it's "kind of an antipattern" [19:36:55] so for now i'm falling back to the usual way misc services do it, you get the same thing but still have to flip it in Hiera [19:37:26] or you might see it as advantage or disadvantage that you can control separately which perf server is active [19:39:09] 10Operations, 10Analytics: Deploy an updated eventgate-logging-external with NEL patches - https://phabricator.wikimedia.org/T262087 (10CDanis) Example request/responses of both preflight and actual request are in NDA'd paste P12494 (has my own PII in it) Chrome sends an OPTIONS request to the endpoint URL wi... [19:39:25] or some hack where we parse if there is eqiad or codfw in the name of the graphite_host or prometheus_nodes hostnames it already looks up, but ugly :) [19:42:41] 10Operations, 10Product-Infrastructure-Data, 10Epic, 10Goal, 10Patch-For-Review: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10CDanis) [19:43:55] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-worker1102.eqiad.w... [19:44:37] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-worker1103.eqiad.w... [19:45:20] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-worker1104.eqiad.w... [19:45:59] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-worker1105.eqiad.w... [19:46:10] 10Operations, 10Product-Infrastructure-Data, 10Epic, 10Goal, 10Patch-For-Review: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10CDanis) == Rollout planning braindump There's three degrees of freedom to play wit... [19:46:27] (03PS12) 10Jeena Huneidi: Script to update image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/619833 (https://phabricator.wikimedia.org/T255835) [19:46:40] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-worker1106.eqiad.w... [19:47:39] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-worker1107.eqiad.w... [19:53:36] (03PS13) 10Jeena Huneidi: Script to update image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/619833 (https://phabricator.wikimedia.org/T255835) [19:57:08] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [19:57:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:57:45] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1104.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1... [19:57:49] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [19:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:16] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [19:59:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:59:23] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [19:59:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:00:38] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [20:00:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:23] (03PS1) 10Dzahn: webperf: add parameter to disable timing beacon monitoring in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/624873 [20:01:30] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:01:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:01:38] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [20:01:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:02:27] (03CR) 10jerkins-bot: [V: 04-1] webperf: add parameter to disable timing beacon monitoring in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/624873 (owner: 10Dzahn) [20:03:37] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:03:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:03:51] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-worker1104.eqiad.w... [20:04:23] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:04:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:51] (03PS2) 10Dzahn: webperf: add parameter to disable timing beacon monitoring in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/624873 [20:05:42] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:05:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:05:53] (03CR) 10jerkins-bot: [V: 04-1] webperf: add parameter to disable timing beacon monitoring in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/624873 (owner: 10Dzahn) [20:07:04] (03PS3) 10Dzahn: webperf: add parameter to disable timing beacon monitoring in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/624873 [20:10:33] (03PS4) 10Dzahn: webperf: add parameter to disable timing beacon monitoring in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/624873 [20:11:38] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/24966/" [puppet] - 10https://gerrit.wikimedia.org/r/624873 (owner: 10Dzahn) [20:13:55] (03CR) 10Dzahn: [V: 03+1] "at least it's not based on individual host names and just eqiad..." [puppet] - 10https://gerrit.wikimedia.org/r/624873 (owner: 10Dzahn) [20:15:33] (03PS3) 10Dzahn: installserver::web_server: switch from nginx full to light variant [puppet] - 10https://gerrit.wikimedia.org/r/624355 (https://phabricator.wikimedia.org/T261962) [20:16:16] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1104.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1... [20:17:23] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-worker1113.eqiad.w... [20:17:26] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-worker1112.eqiad.w... [20:17:34] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-worker1111.eqiad.w... [20:17:38] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-worker1110.eqiad.w... [20:17:43] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-worker1109.eqiad.w... [20:17:47] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-worker1108.eqiad.w... [20:18:48] (03PS14) 10Thcipriani: Script to update image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/619833 (https://phabricator.wikimedia.org/T255835) (owner: 10Jeena Huneidi) [20:20:49] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1102.eqiad.wmnet'] ` and were **ALL** successful. [20:23:00] (03CR) 10Thcipriani: [C: 03+2] "\o/ well done" [deployment-charts] - 10https://gerrit.wikimedia.org/r/619833 (https://phabricator.wikimedia.org/T255835) (owner: 10Jeena Huneidi) [20:23:28] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-worker1104.eqiad.w... [20:24:12] (03Merged) 10jenkins-bot: Script to update image versions [deployment-charts] - 10https://gerrit.wikimedia.org/r/619833 (https://phabricator.wikimedia.org/T255835) (owner: 10Jeena Huneidi) [20:24:35] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-worker1114.eqiad.w... [20:25:06] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-worker1115.eqiad.w... [20:25:44] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-worker1116.eqiad.w... [20:26:52] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by cmjohnson on cumin1001.eqiad.wmnet for hosts: ` an-worker1117.eqiad.w... [20:26:56] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1107.eqiad.wmnet'] ` and were **ALL** successful. [20:27:01] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [20:28:37] 10Operations, 10DNS, 10Traffic: 'skip_first' feature flag for gdnsd GeoIP plugin - https://phabricator.wikimedia.org/T261340 (10CDanis) I just had an alternate idea, which wouldn't require any change to gdnsd. The Reporting API allows you to specify a whole group of endpoint URLs, successful delivery to any... [20:29:11] (03CR) 10Dave Pifke: [C: 03+1] "LGTM as an intermediate fix; it's definitely better than letting the alert hang out in a silenced state." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/624873 (owner: 10Dzahn) [20:30:27] ^ Also: thanks a lot for looking at this. Appreciate the help. [20:30:33] 10Operations, 10Product-Infrastructure-Data, 10Epic, 10Goal, 10Patch-For-Review: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10CDanis) [20:30:44] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [20:30:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:30:51] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [20:30:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:03] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [20:31:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:07] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [20:31:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:32:54] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:56] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:34:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:35:03] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1106.eqiad.wmnet'] ` and were **ALL** successful. [20:35:16] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1103.eqiad.wmnet'] ` and were **ALL** successful. [20:35:51] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:35:52] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1105.eqiad.wmnet'] ` and were **ALL** successful. [20:35:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:08] !log cmjohnson@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [20:36:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:36:54] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [20:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:31] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [20:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:38:56] !log cmjohnson@cumin1001 START - Cookbook sre.hosts.downtime [20:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:39:14] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:39:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:41:22] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:43:24] !log cmjohnson@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [20:43:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:44:36] (03CR) 10Dzahn: [C: 03+2] "will confirm on apt2001 first which does not get traffic." [puppet] - 10https://gerrit.wikimedia.org/r/624355 (https://phabricator.wikimedia.org/T261962) (owner: 10Dzahn) [20:51:41] !log apt2001 - apt-get remove --purge libnginx* and run puppet to replace nginx-full with nginx-light (T261962) [20:51:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:47] T261962: Migrate install_server::web_server (apt*) to nginx-light - https://phabricator.wikimedia.org/T261962 [20:55:02] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1112.eqiad.wmnet'] ` and were **ALL** successful. [20:55:31] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1110.eqiad.wmnet'] ` and were **ALL** successful. [20:55:35] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1108.eqiad.wmnet'] ` and were **ALL** successful. [20:55:37] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1109.eqiad.wmnet'] ` and were **ALL** successful. [20:56:06] (03CR) 10Dzahn: "just did "apt-get remove --purge libnginx* " and ran puppet on apt2001 and the only libnginx* package that gets reinstalled is libnginx-mo" [puppet] - 10https://gerrit.wikimedia.org/r/624355 (https://phabricator.wikimedia.org/T261962) (owner: 10Dzahn) [20:59:57] !log apt2001 - sudo apt-get autoremove [21:00:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:00:39] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1104.eqiad.wmnet'] ` and were **ALL** successful. [21:02:26] !log apt1001 - remove all libnginx-mod* packages except libnginx-mod-http-echo [21:02:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:04:58] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1115.eqiad.wmnet'] ` and were **ALL** successful. [21:05:31] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1116.eqiad.wmnet'] ` and were **ALL** successful. [21:06:50] !log apt1001 - removed all libnginx-mod* packages except libnginx-mod-http-echo ; sudo apt-get autoremove ; run puppet ; restarted nginx - apt.wikimedia.org switched to nginx-light (T261962) [21:06:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:06:56] T261962: Migrate install_server::web_server (apt*) to nginx-light - https://phabricator.wikimedia.org/T261962 [21:09:28] 10Operations, 10Traffic, 10Patch-For-Review, 10User-ArielGlenn: Migrate install_server::web_server (apt*) to nginx-light - https://phabricator.wikimedia.org/T261962 (10Dzahn) ` [apt1001:~] $ dpkg -l | grep nginx ii libnginx-mod-http-echo 1.14.2-2+deb10u3 amd64 Bring echo a... [21:09:48] 10Operations, 10Traffic, 10User-ArielGlenn, 10User-MoritzMuehlenhoff, 10User-jbond: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456 (10Dzahn) [21:09:51] 10Operations, 10Traffic, 10Patch-For-Review, 10User-ArielGlenn: Migrate install_server::web_server (apt*) to nginx-light - https://phabricator.wikimedia.org/T261962 (10Dzahn) 05Open→03Resolved [21:11:56] 10Operations, 10Traffic, 10User-ArielGlenn, 10User-MoritzMuehlenhoff, 10User-jbond: Migrate to nginx-light - https://phabricator.wikimedia.org/T164456 (10Dzahn) apt.wikimedia.org (apt1001/apt2001) switched to nginx-light today [21:17:38] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1113.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1... [21:18:22] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1111.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1... [21:18:42] (03PS5) 10Dzahn: webperf: add parameter to disable timing beacon monitoring in eqiad [puppet] - 10https://gerrit.wikimedia.org/r/624873 [21:19:41] (03CR) 10Dzahn: webperf: add parameter to disable timing beacon monitoring in eqiad (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/624873 (owner: 10Dzahn) [21:21:37] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1001/24969/" [puppet] - 10https://gerrit.wikimedia.org/r/624873 (owner: 10Dzahn) [21:25:16] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1114.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1... [21:31:17] PROBLEM - Query Service HTTP Port on wdqs2002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 Bad Gateway - 380 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:31:32] !log `ryankemper@wdqs2002:~$ sudo systemctl restart wdqs-blazegraph` [21:31:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:32:10] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops, 10Patch-For-Review: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['an-worker1117.eqiad.wmnet'] ` Of which those **FAILED**: ` ['an-worker1... [21:32:41] RECOVERY - Query Service HTTP Port on wdqs2002 is OK: HTTP OK: HTTP/1.1 200 OK - 448 bytes in 0.019 second response time https://wikitech.wikimedia.org/wiki/Wikidata_query_service [21:36:57] !log ryankemper@deploy1001 Started deploy [wdqs/wdqs@c7e6b35]: 0.3.47 [21:37:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:37:57] !log Tests on canary `wdqs1003` passing, beginning full wdqs deploy [21:38:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:44:05] 10Operations, 10Product-Infrastructure-Data, 10Epic, 10Goal, 10Patch-For-Review: automatically collect network error reports from users' browsers (Network Error Logging API) - https://phabricator.wikimedia.org/T257527 (10Nuria) Super cool work! >We could probably do some analysis to figure out the per-... [21:47:05] (03PS12) 10Dzahn: prometheus: replace remaining hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/623666 [21:49:52] !log ryankemper@deploy1001 Finished deploy [wdqs/wdqs@c7e6b35]: 0.3.47 (duration: 12m 55s) [21:49:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:52:46] !log `sudo -E cumin -b 4 'A:wdqs-all' 'systemctl restart wdqs-updater'` [21:52:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:54:18] !log `sudo -E cumin -b 1 'A:wdqs-all and not A:wdqs-test' 'depool && sleep 60 && systemctl restart wdqs-categories && sleep 30 && pool'` [21:54:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:41] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [22:12:35] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [22:15:19] !log wdqs deploy complete, service is healthy [22:15:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:23:47] ryankemper: 👍 [22:33:39] PROBLEM - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is CRITICAL: /{domain}/v1/page/summary/{title} (Get summary for test page) is CRITICAL: Test Get summary for test page returned the unexpected status 504 (expecting: 200) https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [22:35:35] RECOVERY - Mobileapps LVS eqiad on mobileapps.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [23:34:12] (03PS1) 10Jeena Huneidi: Make update_version.py work with python 3.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/624963 (https://phabricator.wikimedia.org/T255835) [23:37:55] (03PS2) 10Jeena Huneidi: Make update_version.py work with python 3.4 [deployment-charts] - 10https://gerrit.wikimedia.org/r/624963 (https://phabricator.wikimedia.org/T255835) [23:42:11] (03PS3) 10Jeena Huneidi: Make update_version.py work with python 3.5 [deployment-charts] - 10https://gerrit.wikimedia.org/r/624963 (https://phabricator.wikimedia.org/T255835) [23:48:46] (03PS1) 10Jeena Huneidi: ci/pipeline/builder.pp: Add ruamel package [puppet] - 10https://gerrit.wikimedia.org/r/624972 (https://phabricator.wikimedia.org/T255835)