[00:03:03] (03CR) 10Dzahn: "I can just confirm that replacing hiera() with lookup() when there are no default parameters and nothing else is changed has always been n" [puppet] - 10https://gerrit.wikimedia.org/r/634387 (https://phabricator.wikimedia.org/T256972) (owner: 10Dzahn) [00:05:06] (03CR) 10Dzahn: [C: 03+2] "new host is not set up yet but we already know the IP and can start here" [puppet] - 10https://gerrit.wikimedia.org/r/635107 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [00:05:17] (03PS2) 10Dzahn: tcpircbot: add deploy1002 to allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/635107 (https://phabricator.wikimedia.org/T265963) [00:05:42] (03PS15) 10Jbond: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) [00:07:09] (03PS2) 10Dzahn: tcpircbot: remove "tin" IP from allowed hosts [puppet] - 10https://gerrit.wikimedia.org/r/635106 [00:09:40] (03CR) 10Dzahn: [C: 03+2] "This IP is not tin, it's a random appserver now." [puppet] - 10https://gerrit.wikimedia.org/r/635106 (owner: 10Dzahn) [00:12:20] (03PS1) 10Dzahn: site: add deployment_server role on deploy1002 [puppet] - 10https://gerrit.wikimedia.org/r/635404 (https://phabricator.wikimedia.org/T265963) [00:14:07] (03CR) 10Dzahn: [C: 04-1] "needs mcrouter cert first" [puppet] - 10https://gerrit.wikimedia.org/r/635404 (https://phabricator.wikimedia.org/T265963) (owner: 10Dzahn) [00:28:15] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:29:53] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [00:41:56] (03PS1) 10Dzahn: base/labs: add systemd timer to clean puppet client bucket [puppet] - 10https://gerrit.wikimedia.org/r/635406 (https://phabricator.wikimedia.org/T165885) [00:45:13] (03CR) 10Dzahn: "@Paladox still interested in this fix for your ticket from 2017?" [puppet] - 10https://gerrit.wikimedia.org/r/635406 (https://phabricator.wikimedia.org/T165885) (owner: 10Dzahn) [01:25:43] PROBLEM - rpki grafana alert on alert1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: total VRPs alert, valid ROAs alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [01:35:37] RECOVERY - rpki grafana alert on alert1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [01:37:13] (03CR) 10Cicalese: [C: 03+1] "Looks good to me. OK to self-merge." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635382 (https://phabricator.wikimedia.org/T263579) (owner: 10Ppchelko) [01:37:58] (03CR) 10Cicalese: [C: 03+1] "Looks good to me. OK to self-merge." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635071 (https://phabricator.wikimedia.org/T264394) (owner: 10Ppchelko) [01:42:45] (03CR) 10Cicalese: [C: 03+1] "Looks good to me. OK to self-merge." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635095 (https://phabricator.wikimedia.org/T265954) (owner: 10Ppchelko) [01:44:06] (03CR) 10Cicalese: [C: 03+1] "Looks good to me. OK to self-merge." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635086 (https://phabricator.wikimedia.org/T265954) (owner: 10Ppchelko) [01:44:57] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan 20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Huji) It seems like since 8/25, WD maxlag has rarely approached 5 seconds ([[ https://grafana.wikimedi... [03:42:43] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [03:44:07] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [03:54:41] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10RobH) They emailed me and required I upload the AHS log via a https drop box utility, so I did so along with the IML log file. Awaiting reply from HP support. [03:57:43] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [04:00:45] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [04:04:57] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [04:11:31] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [04:35:03] !log re-enabled icinga notifications on all wdqs hosts now that `wdqs-updater` is healthy [04:35:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:46:04] (03PS1) 10Ryan Kemper: cirrus: hardcode more_like to codfw cirrus cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635411 [04:47:19] (03CR) 10Ryan Kemper: "Will deploy Weds" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635411 (owner: 10Ryan Kemper) [04:48:09] RECOVERY - Check systemd state on ldap-replica2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:50:08] (03PS2) 10Ryan Kemper: cirrus: Hardcode more_like to codfw cirrus cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635411 [04:53:07] PROBLEM - Check systemd state on ldap-replica2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:55:48] 10Operations, 10SRE-Access-Requests: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 (10Nuria) [04:56:52] 10Operations, 10SRE-Access-Requests: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 (10Nuria) [05:11:56] 10Operations, 10DBA, 10User-Kormat: orchestrator: Get packages into WMF apt - https://phabricator.wikimedia.org/T266023 (10Marostegui) p:05Triage→03Medium [05:12:07] 10Operations, 10serviceops, 10vm-requests: Site: 1 VM request for Analytics test cluster - https://phabricator.wikimedia.org/T266064 (10Marostegui) p:05Triage→03Medium [05:12:15] 10Operations, 10Puppet, 10cloud-services-team (Kanban): Using $facts['networking']['ip'] breaks puppet on cloud hosts - https://phabricator.wikimedia.org/T266075 (10Marostegui) p:05Triage→03Medium [05:14:02] 10Operations, 10SRE-Access-Requests: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 (10Marostegui) p:05Triage→03Medium @MoritzMuehlenhoff can you advise on what is the process for handling this? I guess we need to follow https://wikitech.wikimedia.org/wiki/Volunteer_NDA ? [05:26:37] PROBLEM - Ensure hosts are not performing a change on every puppet run on puppetdb1002 is CRITICAL: CRITICAL: the following (8) node(s) change every puppet run: testvm1001.eqiad.wmnet, wtp2012.codfw.wmnet, analytics-tool1004.eqiad.wmnet, wtp2013.codfw.wmnet, ldap-replica2003.wikimedia.org, wtp2010.codfw.wmnet, wtp2014.codfw.wmnet, an-tool1005.eqiad.wmnet https://wikitech.wikimedia.org/wiki/Puppet%23check_puppet_run_changes [05:30:11] ^ this seems to be https://phabricator.wikimedia.org/P13039 [05:34:03] 10Operations, 10Data-Persistence-Backup, 10SRE-tools: Add toil::systemd_scope_cleanup to dbprov hosts - https://phabricator.wikimedia.org/T265323 (10Marostegui) >>! In T265323#6564767, @jcrespo wrote: > @Marostegui 2 questions: > > * When you said: > >> the disk went full > > only full in activity, not... [06:21:53] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 109 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:23:31] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 27 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [06:29:07] PROBLEM - ores on ores1002 is CRITICAL: connect to address 10.64.0.52 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:31:56] 10Operations, 10SRE-Access-Requests: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 (10elukey) /me cries but supports the request [06:39:18] (03CR) 10Ayounsi: [C: 03+1] diffscan: switch to new refactored diffscan [puppet] - 10https://gerrit.wikimedia.org/r/634566 (owner: 10Jbond) [06:41:41] RECOVERY - ores on ores1002 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.018 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:44:37] (03PS2) 10Muehlenhoff: acmechief: Also allow ldap-replica2003/2004 [puppet] - 10https://gerrit.wikimedia.org/r/634974 (https://phabricator.wikimedia.org/T264388) [06:44:54] marostegui: hola! The analytics hosts failing puppet are my fault, taking care of them now [06:46:21] (03PS1) 10Marostegui: site.pp: Add task to clouddb future hosts [puppet] - 10https://gerrit.wikimedia.org/r/635494 [06:46:28] elukey: :**** [06:47:07] (03CR) 10Marostegui: [C: 03+2] site.pp: Add task to clouddb future hosts [puppet] - 10https://gerrit.wikimedia.org/r/635494 (owner: 10Marostegui) [06:47:45] (03PS1) 10Elukey: superset: remove presto TLS config (not needed anymore) [puppet] - 10https://gerrit.wikimedia.org/r/635495 (https://phabricator.wikimedia.org/T253957) [06:48:50] (03CR) 10Elukey: [C: 03+2] superset: remove presto TLS config (not needed anymore) [puppet] - 10https://gerrit.wikimedia.org/r/635495 (https://phabricator.wikimedia.org/T253957) (owner: 10Elukey) [06:51:33] PROBLEM - puppet last run on an-tool1007 is CRITICAL: CRITICAL: Puppet last ran 1 day ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:51:41] fixing also --^ [06:54:26] (03CR) 10Ayounsi: netbox/puppet: Add machinery to get Puppet facts from Netbox (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [06:56:07] 10Operations, 10SRE-Access-Requests: Nuria's volunteer account - https://phabricator.wikimedia.org/T266086 (10Marostegui) Thanks Luca! @Nuria can we get a manager to approve this as well? @faidon or @mark maybe? Further, can you sign https://phabricator.wikimedia.org/L2 Thanks! [06:57:09] RECOVERY - puppet last run on an-tool1007 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:58:35] 10Operations, 10netops, 10cloud-services-team (Kanban): Enable L3 routing on cloudsw nodes - https://phabricator.wikimedia.org/T265288 (10ayounsi) Would another day in the October 26th week be possible otherwise? I want to make sure we have time to schedule a followup work if something doesn't go as planned... [06:59:14] (03PS1) 10Giuseppe Lavagetto: admin: remove old rsa key, add new ed25519 one [puppet] - 10https://gerrit.wikimedia.org/r/635496 [06:59:16] (03PS1) 10Giuseppe Lavagetto: admin: also remove the old ed25519 key for the time being [puppet] - 10https://gerrit.wikimedia.org/r/635497 [07:06:11] (03CR) 10Muehlenhoff: [C: 03+1] "Ship it :-)" [puppet] - 10https://gerrit.wikimedia.org/r/634566 (owner: 10Jbond) [07:06:18] (03CR) 10Marostegui: [C: 03+2] admin: remove old rsa key, add new ed25519 one [puppet] - 10https://gerrit.wikimedia.org/r/635496 (owner: 10Giuseppe Lavagetto) [07:07:11] (03CR) 10Muehlenhoff: [C: 03+2] acmechief: Also allow ldap-replica2003/2004 [puppet] - 10https://gerrit.wikimedia.org/r/634974 (https://phabricator.wikimedia.org/T264388) (owner: 10Muehlenhoff) [07:29:09] (03CR) 10Muehlenhoff: "That seems fine, but is it a complete replacement, like are there commands necessary beyond simply being able to log in?" [puppet] - 10https://gerrit.wikimedia.org/r/635090 (https://phabricator.wikimedia.org/T265922) (owner: 10Dzahn) [07:30:40] 10Operations, 10Wikimedia-Mailing-lists: Create a mailing list for frwiki nominators - https://phabricator.wikimedia.org/T265835 (10Marostegui) a:05Marostegui→03Kvardek_du [07:48:52] (03CR) 10Ayounsi: [C: 03+1] Add Thanos Swift endpoints to the analytics-in4 filter [homer/public] - 10https://gerrit.wikimedia.org/r/635319 (https://phabricator.wikimedia.org/T246004) (owner: 10Elukey) [07:58:17] (03CR) 10Elukey: [C: 03+2] Add Thanos Swift endpoints to the analytics-in4 filter [homer/public] - 10https://gerrit.wikimedia.org/r/635319 (https://phabricator.wikimedia.org/T246004) (owner: 10Elukey) [07:58:47] update analytics-in4 filter on cr1/cr2-eqiad for https://gerrit.wikimedia.org/r/635319 [07:58:50] uff [07:58:55] !log update analytics-in4 filter on cr1/cr2-eqiad for https://gerrit.wikimedia.org/r/635319 [07:59:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:01:29] 10Operations, 10Wikimedia-Etherpad, 10Patch-For-Review: rate limited etherpad - https://phabricator.wikimedia.org/T265490 (10hashar) Looks good to me now. thank you! [08:02:11] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:03:09] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:04:00] 10Operations, 10Wikimedia-Etherpad, 10Patch-For-Review: rate limited etherpad - https://phabricator.wikimedia.org/T265490 (10Marostegui) Same here, I am not being rate limited anymore. [08:09:32] !log add Routinator 3000 0.8.0 to apt - T266001 [08:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:09:39] T266001: Upgrade Routinator 3000 to 0.8.0 - https://phabricator.wikimedia.org/T266001 [08:09:53] (03PS1) 10Muehlenhoff: acmechief: Add ldap-replica1001/1002 [puppet] - 10https://gerrit.wikimedia.org/r/635499 (https://phabricator.wikimedia.org/T264388) [08:10:51] !log Upgrade Routinator 3000 to 0.8.0 on rpki1001 - T266001 [08:10:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:11:04] 10Operations, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10Joe) Regarding the apache httpd container, I am approaching layering as follows: - one base image, which uses the a... [08:11:08] (03CR) 10Filippo Giunchedi: "For easier comparison, an example host with the standard recipe is alert1001:" [puppet] - 10https://gerrit.wikimedia.org/r/633704 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [08:13:10] 10Operations, 10netops: Upgrade Routinator 3000 to 0.8.0 - https://phabricator.wikimedia.org/T266001 (10ayounsi) 05Open→03Resolved All done. [08:18:15] (03PS1) 10Filippo Giunchedi: hieradata: add Swift account for wqds [puppet] - 10https://gerrit.wikimedia.org/r/635501 (https://phabricator.wikimedia.org/T246004) [08:20:59] (03CR) 10Ema: [C: 03+1] install_server: use standard partman recipe for nvme cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/633704 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [08:21:30] PROBLEM - Check systemd state on ldap-replica2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [08:21:54] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 117 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:23:54] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 44 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [08:25:03] (03PS2) 10Filippo Giunchedi: install_server: use standard partman recipe for nvme cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/633704 (https://phabricator.wikimedia.org/T156955) [08:25:44] (03PS1) 10Marostegui: site.pp: Add dborch1001 node [puppet] - 10https://gerrit.wikimedia.org/r/635502 (https://phabricator.wikimedia.org/T265982) [08:27:25] (03CR) 10Kormat: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/634387 (https://phabricator.wikimedia.org/T256972) (owner: 10Dzahn) [08:30:24] (03CR) 10Kormat: [C: 03+1] site.pp: Add dborch1001 node [puppet] - 10https://gerrit.wikimedia.org/r/635502 (https://phabricator.wikimedia.org/T265982) (owner: 10Marostegui) [08:30:59] 10Operations, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10JMeybohm) If I got this right you are purposing to put apache and php-fpm in the same container, correct (talking ab... [08:33:34] !log swift codfw-prod: bump object weight for ms-be2057 - T261633 [08:33:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:33:41] T261633: Put ms-be2057 (Dell R740xd2) in service - https://phabricator.wikimedia.org/T261633 [08:36:00] 10Operations, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10ema) >>! In T265324#6567095, @Joe wrote: > - one base image, which uses the apache2-bin debian package and just modi... [08:37:47] (03CR) 10Marostegui: [C: 03+2] site.pp: Add dborch1001 node [puppet] - 10https://gerrit.wikimedia.org/r/635502 (https://phabricator.wikimedia.org/T265982) (owner: 10Marostegui) [08:38:38] !log root@cumin1001 START - Cookbook sre.ganeti.makevm [08:38:38] !log root@cumin1001 END (FAIL) - Cookbook sre.ganeti.makevm (exit_code=99) [08:38:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:53] !log root@cumin1001 START - Cookbook sre.ganeti.makevm [08:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:39:36] 10Operations, 10Wikimedia-Mailing-lists: Create a mailing list for frwiki nominators - https://phabricator.wikimedia.org/T265835 (10Kvardek_du) 05Open→03Resolved Thanks a lot @Marostegui! Everything is working perfectly and I added the list to the page. [08:43:35] (03PS2) 10Filippo Giunchedi: prometheus: add Pushgateway profile and module [puppet] - 10https://gerrit.wikimedia.org/r/635295 (https://phabricator.wikimedia.org/T249311) [08:43:37] (03PS2) 10Filippo Giunchedi: role: add Pushgateway to Prometheus ops [puppet] - 10https://gerrit.wikimedia.org/r/635296 (https://phabricator.wikimedia.org/T249311) [08:46:35] !log [urbanecm@mwmaint2001 ~/updateVarDumps/output/group2-medium/output]$ mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=apiportalwiki # T246539 [08:46:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:46:41] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [08:48:01] 10Operations, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10Joe) >>! In T265324#6567154, @JMeybohm wrote: > If I got this right you are purposing to put apache and php-fpm in t... [08:49:06] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10jcrespo) Sorry for the late response, it was very late on our TZ. Apologies also for not using the template, I was not aware of it existence, at least I've never seen it used before. I kn... [08:50:06] !log mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log # wiki=cebwiki; T246539 [08:50:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:50:27] 10Operations, 10Puppet, 10cloud-services-team (Kanban): Using $facts['networking']['ip'] breaks puppet on cloud hosts - https://phabricator.wikimedia.org/T266075 (10jbond) tl;dr this is fixed by refreshing the pcc facts: ` lang=shell PUPPET_MASTER=toolsbeta-puppetmaster-04.toolsbeta.eqiad.wmflabs ./modules/p... [08:51:32] (03PS1) 10Marostegui: dns: Add dns entries for dborch1001 [dns] - 10https://gerrit.wikimedia.org/r/635504 (https://phabricator.wikimedia.org/T265982) [08:51:59] !log Start of `mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log` (wiki=viwiki; T246539) [08:52:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:52:05] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [08:52:19] Daimona: fyi ^^ [08:52:39] (03PS2) 10Marostegui: dns: Add dns entries for dborch1001 [dns] - 10https://gerrit.wikimedia.org/r/635504 (https://phabricator.wikimedia.org/T265982) [08:53:54] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10dcausse) * timestamp: 2020-10-20T18:10:00 to 2020-10-20T21:15:00 * host: mw2252 * message: `[2b171d8b-48ec-480d-b7a4-187dd3af259c] /w/api.php?t... [08:54:50] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:54:58] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [08:55:07] (03PS3) 10Filippo Giunchedi: prometheus: add Pushgateway profile and module [puppet] - 10https://gerrit.wikimedia.org/r/635295 (https://phabricator.wikimedia.org/T249311) [08:55:09] (03PS3) 10Filippo Giunchedi: role: add Pushgateway to Prometheus ops [puppet] - 10https://gerrit.wikimedia.org/r/635296 (https://phabricator.wikimedia.org/T249311) [08:56:18] (03CR) 10Muehlenhoff: "We still have a number of jessie hosts; these support UsePrivilegeSeparation=sandbox, but don't enable it by default. I.e. merging the pat" [puppet] - 10https://gerrit.wikimedia.org/r/635288 (https://phabricator.wikimedia.org/T170298) (owner: 10Jcrespo) [08:56:28] (03CR) 10Elukey: [C: 03+1] dns: Add dns entries for dborch1001 [dns] - 10https://gerrit.wikimedia.org/r/635504 (https://phabricator.wikimedia.org/T265982) (owner: 10Marostegui) [08:56:54] 10Operations, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10JMeybohm) >>! In T265324#6567213, @Joe wrote: Oh, well. Sorry then. I guess I just misread the "one image configured... [08:57:05] (03CR) 10Marostegui: [C: 03+2] dns: Add dns entries for dborch1001 [dns] - 10https://gerrit.wikimedia.org/r/635504 (https://phabricator.wikimedia.org/T265982) (owner: 10Marostegui) [08:58:28] (03CR) 10Filippo Giunchedi: "PCC full diff for prometheus1003: https://puppet-compiler.wmflabs.org/compiler1001/26031/prometheus1003.eqiad.wmnet/fulldiff.html" [puppet] - 10https://gerrit.wikimedia.org/r/635296 (https://phabricator.wikimedia.org/T249311) (owner: 10Filippo Giunchedi) [08:59:24] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Pablo-WMDE) > must be an instance of **V**ikibase\Search\ [...] Is this some sort of copying glitch? [09:00:40] 10Operations, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10Joe) >>! In T265324#6567163, @ema wrote: >>>! In T265324#6567095, @Joe wrote: >> - one base image, which uses the ap... [09:02:59] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10dcausse) >>! In T245183#6567295, @Pablo-WMDE wrote: >> must be an instance of **V**ikibase\Search\ [...] > > Is this some sort of copying glit... [09:04:01] (03CR) 10Jcrespo: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/635288 (https://phabricator.wikimedia.org/T170298) (owner: 10Jcrespo) [09:05:10] (03PS1) 10Elukey: Remove analytics1052 from Hadoop HDFS Journal nodes [puppet] - 10https://gerrit.wikimedia.org/r/635507 (https://phabricator.wikimedia.org/T255140) [09:05:56] 10Operations, 10DBA, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat) [09:06:02] 10Operations, 10DBA, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat) p:05Triage→03Medium [09:06:41] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Pablo-WMDE) >>! In T245183#6567307, @dcausse wrote: >>>! In T245183#6567295, @Pablo-WMDE wrote: >>> must be an instance of **V**ikibase\Search\... [09:09:18] (03CR) 10Jbond: cassandra: add data types, hiera->lookup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/634363 (owner: 10Dzahn) [09:09:24] (03PS2) 10Vgutierrez: vcl: Bump ECDHE-ECDSA-AES128-SHA pageview replacement to 100% [puppet] - 10https://gerrit.wikimedia.org/r/635302 (https://phabricator.wikimedia.org/T258405) [09:10:44] Urbanecm: thank you! [09:13:55] (03CR) 10Abijeet Patro: [C: 03+1] Disable registrations stat on Special:TranslationStats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635237 (https://phabricator.wikimedia.org/T264158) (owner: 10Nikerabbit) [09:17:29] 10Operations, 10SRE-tools: Systemd session creation fails under I/O load - https://phabricator.wikimedia.org/T199911 (10jcrespo) [09:17:31] 10Operations, 10Data-Persistence-Backup, 10SRE-tools: Add toil::systemd_scope_cleanup to dbprov hosts - https://phabricator.wikimedia.org/T265323 (10jcrespo) 05Open→03Declined I am going to decline this, not because it is a bad suggestion, but because the fix is not really a fix, as much as a "way to avo... [09:17:45] 10Operations, 10DBA, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat) Adding `profile::idp::client::httpd`, and configuring orchestrator appropriately should work. [09:21:07] !log aborrero@cumin2001 START - Cookbook sre.hosts.downtime [09:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:22:17] !log root@cumin1001 END (PASS) - Cookbook sre.ganeti.makevm (exit_code=0) [09:22:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:07] !log aborrero@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:23:28] 10Operations, 10DBA, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10Kormat) 11:21:49 kormat: if thats that case i would use the header X-CAS-CN (environment variable HTTP_X_CAS_CN) as the default CAS-User header suffers from the case insensetive issue that i... [09:25:56] 10Operations, 10netops, 10cloud-services-team (Kanban): Enable L3 routing on cloudsw nodes - https://phabricator.wikimedia.org/T265288 (10aborrero) Ok, new proposal: 2020-10:-29 [09:26:17] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1003/26032/" [puppet] - 10https://gerrit.wikimedia.org/r/635507 (https://phabricator.wikimedia.org/T255140) (owner: 10Elukey) [09:28:31] (03CR) 10Jbond: [C: 03+2] diffscan: switch to new refactored diffscan [puppet] - 10https://gerrit.wikimedia.org/r/634566 (owner: 10Jbond) [09:29:03] (03PS7) 10ArielGlenn: get revision info from stubs file and use to generate page range info [dumps] - 10https://gerrit.wikimedia.org/r/633567 (https://phabricator.wikimedia.org/T263319) [09:29:13] godog: fyi mergeing your priv repo change [09:29:35] jbond42: thank you! [09:29:50] still can't remember I have to do that too [09:30:31] !log End of `mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log` (wiki=viwiki; T246539) [09:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:30:37] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [09:31:10] 10Operations, 10netops, 10cloud-services-team (Kanban): Enable L3 routing on cloudsw nodes - https://phabricator.wikimedia.org/T265288 (10ayounsi) WFM, thanks! [09:33:31] (03PS1) 10Jbond: diffscan: fix type [puppet] - 10https://gerrit.wikimedia.org/r/635510 [09:34:11] (03CR) 10Jbond: [C: 03+2] diffscan: fix type [puppet] - 10https://gerrit.wikimedia.org/r/635510 (owner: 10Jbond) [09:37:32] !log mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log # wiki=warwiki; T246539 [09:37:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:37:39] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [09:38:17] !log Start of `mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log` (wiki=shwiki; T246539) [09:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:39:42] (03PS1) 10Marostegui: install_server: Add dborch1001 to DHCP files [puppet] - 10https://gerrit.wikimedia.org/r/635512 (https://phabricator.wikimedia.org/T265982) [09:41:16] (03PS2) 10Marostegui: install_server: Add dborch1001 to DHCP files [puppet] - 10https://gerrit.wikimedia.org/r/635512 (https://phabricator.wikimedia.org/T265982) [09:42:17] !log End of `mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log` (wiki=shwiki; T246539) [09:42:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:40] !log Start of `mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log` (wiki=nowiki; T246539) [09:42:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:42:45] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [09:42:58] (03CR) 10Marostegui: [C: 03+2] install_server: Add dborch1001 to DHCP files [puppet] - 10https://gerrit.wikimedia.org/r/635512 (https://phabricator.wikimedia.org/T265982) (owner: 10Marostegui) [09:43:34] PROBLEM - SSH on ms-be2016 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:45:08] (03CR) 10Elukey: [C: 03+2] Remove analytics1052 from Hadoop HDFS Journal nodes [puppet] - 10https://gerrit.wikimedia.org/r/635507 (https://phabricator.wikimedia.org/T255140) (owner: 10Elukey) [09:45:31] (03CR) 10Ema: [C: 03+1] vcl: Bump ECDHE-ECDSA-AES128-SHA pageview replacement to 100% [puppet] - 10https://gerrit.wikimedia.org/r/635302 (https://phabricator.wikimedia.org/T258405) (owner: 10Vgutierrez) [09:46:20] RECOVERY - SSH on ms-be2016 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [09:51:33] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: fix unset variable in interfaces template [puppet] - 10https://gerrit.wikimedia.org/r/635514 (https://phabricator.wikimedia.org/T261724) [09:53:52] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:55:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [09:55:56] (03PS1) 10Marostegui: netboot.cfg: Add naming scheme for dborch1001 [puppet] - 10https://gerrit.wikimedia.org/r/635515 (https://phabricator.wikimedia.org/T265982) [09:56:17] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: fix unset variable in interfaces template [puppet] - 10https://gerrit.wikimedia.org/r/635514 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [09:56:45] (03PS1) 10Jbond: service_auto_restart: update to use systemd::timer:job instead of cron [puppet] - 10https://gerrit.wikimedia.org/r/635516 (https://phabricator.wikimedia.org/T265138) [09:56:48] (03PS1) 10Jbond: service-auto-restart: clean up cron [puppet] - 10https://gerrit.wikimedia.org/r/635517 (https://phabricator.wikimedia.org/T265138) [09:57:36] (03CR) 10Vgutierrez: [C: 03+2] vcl: Bump ECDHE-ECDSA-AES128-SHA pageview replacement to 100% [puppet] - 10https://gerrit.wikimedia.org/r/635302 (https://phabricator.wikimedia.org/T258405) (owner: 10Vgutierrez) [09:57:42] (03PS2) 10Jbond: service_auto_restart: update to use systemd::timer:job instead of cron [puppet] - 10https://gerrit.wikimedia.org/r/635516 (https://phabricator.wikimedia.org/T265138) [09:57:44] (03CR) 10Muehlenhoff: netboot.cfg: Add naming scheme for dborch1001 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635515 (https://phabricator.wikimedia.org/T265982) (owner: 10Marostegui) [09:57:46] (03PS3) 10Vgutierrez: vcl: Bump ECDHE-ECDSA-AES128-SHA pageview replacement to 100% [puppet] - 10https://gerrit.wikimedia.org/r/635302 (https://phabricator.wikimedia.org/T258405) [09:58:11] (03CR) 10jerkins-bot: [V: 04-1] service-auto-restart: clean up cron [puppet] - 10https://gerrit.wikimedia.org/r/635517 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [09:58:12] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 52, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:58:12] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 75, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:58:49] (03CR) 10jerkins-bot: [V: 04-1] service_auto_restart: update to use systemd::timer:job instead of cron [puppet] - 10https://gerrit.wikimedia.org/r/635516 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [09:59:11] (03PS2) 10Marostegui: netboot.cfg: Add naming scheme for dborch1001 [puppet] - 10https://gerrit.wikimedia.org/r/635515 (https://phabricator.wikimedia.org/T265982) [09:59:17] !log Bump ECDHE-ECDSA-AES128-SHA pageview replacement to 100% - T258405 [09:59:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:23] T258405: Deprecate TLSv1.2 weak ciphersuites - https://phabricator.wikimedia.org/T258405 [09:59:52] RECOVERY - Router interfaces on cr2-eqord is OK: OK: host 208.80.154.198, interfaces up: 54, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:59:54] RECOVERY - Router interfaces on cr3-ulsfo is OK: OK: host 198.35.26.192, interfaces up: 77, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [10:00:23] (03PS3) 10Jbond: service_auto_restart: update to use systemd::timer:job instead of cron [puppet] - 10https://gerrit.wikimedia.org/r/635516 (https://phabricator.wikimedia.org/T265138) [10:00:29] !log End of `mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log` (wiki=nowiki; T246539) [10:00:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:00:35] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [10:00:36] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:01:28] (03CR) 10jerkins-bot: [V: 04-1] service_auto_restart: update to use systemd::timer:job instead of cron [puppet] - 10https://gerrit.wikimedia.org/r/635516 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [10:01:33] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/635515 (https://phabricator.wikimedia.org/T265982) (owner: 10Marostegui) [10:01:46] (03CR) 10Marostegui: [C: 03+2] netboot.cfg: Add naming scheme for dborch1001 [puppet] - 10https://gerrit.wikimedia.org/r/635515 (https://phabricator.wikimedia.org/T265982) (owner: 10Marostegui) [10:01:46] !log Start of `mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log` (wiki=srwiki; T246539) [10:01:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:02:20] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:06:14] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) >>! In T264398#6564517, @gerritbot wrote: > Change 635298 **merged** by Ema: > [operations/puppet@production] varnish: fix websockets on... [10:10:42] (03PS1) 10Arturo Borrero Gonzalez: admin: aborrero: add .screenrc file to my home [puppet] - 10https://gerrit.wikimedia.org/r/635518 [10:11:44] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] admin: aborrero: add .screenrc file to my home [puppet] - 10https://gerrit.wikimedia.org/r/635518 (owner: 10Arturo Borrero Gonzalez) [10:13:58] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:14:24] PROBLEM - Check systemd state on ms-be2042 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:15:36] PROBLEM - BGP status on cr3-eqsin is CRITICAL: BGP CRITICAL - No response from remote host 103.102.166.131 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BGP_status [10:17:54] PROBLEM - Check systemd state on ms-be2042 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [10:19:10] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [10:20:44] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: force installation of latest kernel [puppet] - 10https://gerrit.wikimedia.org/r/635519 (https://phabricator.wikimedia.org/T261724) [10:21:46] 10Operations, 10DBA, 10CAS-SSO, 10User-Kormat: orchestrator: Support SSO - https://phabricator.wikimedia.org/T266106 (10MoritzMuehlenhoff) [10:22:41] (03PS1) 10Muehlenhoff: Add IDP service definition for orchestrator.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/635520 (https://phabricator.wikimedia.org/T266106) [10:25:01] (03CR) 10Kormat: Add IDP service definition for orchestrator.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635520 (https://phabricator.wikimedia.org/T266106) (owner: 10Muehlenhoff) [10:26:05] (03PS2) 10Muehlenhoff: Add IDP service definition for orchestrator.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/635520 (https://phabricator.wikimedia.org/T266106) [10:26:10] (03CR) 10Muehlenhoff: Add IDP service definition for orchestrator.wikimedia.org (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635520 (https://phabricator.wikimedia.org/T266106) (owner: 10Muehlenhoff) [10:29:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: force installation of latest kernel [puppet] - 10https://gerrit.wikimedia.org/r/635519 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [10:29:44] (03CR) 10Faidon Liambotis: "I find the IP->hostname mapping quite useful, so I wouldn't want to drop it entirely. I do agree that perhaps in the future it belongs int" [puppet] - 10https://gerrit.wikimedia.org/r/634946 (https://phabricator.wikimedia.org/T254332) (owner: 10Faidon Liambotis) [10:32:28] 10Operations, 10vm-requests: Site: 1 VM request for Analytics test cluster - https://phabricator.wikimedia.org/T266064 (10Marostegui) [10:37:40] !log End of `mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log` (wiki=srwiki; T246539) [10:37:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:37:47] T246539: Dry-run, then actually run updateVarDumps - https://phabricator.wikimedia.org/T246539 [10:38:22] PROBLEM - Long running screen/tmux on puppetmaster1001 is CRITICAL: CRIT: Long running tmux process. (user: jayme PID: 16826, 1735667s 1728000s). https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [10:38:26] !log Start of `mwscript extensions/AbuseFilter/maintenance/updateVarDumps.php --wiki=$wiki --print-orphaned-records-to=/tmp/urbanecm/$wiki-orphaned.log --progress-markers > $wiki.log` (wiki=rowiki; T246539) [10:38:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:24] (03PS1) 10Elukey: Decommission analytics1052 from the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/635521 (https://phabricator.wikimedia.org/T255140) [10:39:46] 10Operations, 10Traffic, 10Patch-For-Review: Large text objects are randomized to cache backends - https://phabricator.wikimedia.org/T266040 (10ema) Instead of using hfp vs hfm, I think we might want to distinguish between requests that definitely cannot be cached at the ats-be layer (eg: those with `req.htt... [10:40:54] (03CR) 10Elukey: [C: 03+2] Decommission analytics1052 from the Hadoop cluster [puppet] - 10https://gerrit.wikimedia.org/r/635521 (https://phabricator.wikimedia.org/T255140) (owner: 10Elukey) [10:44:10] (03PS1) 10Kormat: admin: Update kormat's my() bash function. [puppet] - 10https://gerrit.wikimedia.org/r/635523 [10:45:21] (03PS1) 10Elukey: Fix analytics1056 role in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/635524 [10:45:50] (03CR) 10Elukey: [C: 03+2] Fix analytics1056 role in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/635524 (owner: 10Elukey) [10:46:23] 10Operations, 10serviceops, 10vm-requests, 10Patch-For-Review: eqiad: New ganeti instance for orchestrator installation - https://phabricator.wikimedia.org/T265982 (10Marostegui) 05Open→03Resolved a:03Marostegui Finally this VM is up and running. ` [10:43:41] marostegui@dborch1001:~$ uptime 10:43:4... [10:46:26] 10Operations, 10DBA, 10User-Kormat: orchestrator: Puppetize - https://phabricator.wikimedia.org/T265990 (10Marostegui) [10:47:27] (03CR) 10Marostegui: [C: 03+1] admin: Update kormat's my() bash function. [puppet] - 10https://gerrit.wikimedia.org/r/635523 (owner: 10Kormat) [10:50:41] (03CR) 10Klausman: admin: Update kormat's my() bash function. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635523 (owner: 10Kormat) [10:52:03] (03CR) 10Kormat: admin: Update kormat's my() bash function. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635523 (owner: 10Kormat) [10:54:20] 10Operations, 10Traffic, 10Patch-For-Review: Large text objects are randomized to cache backends - https://phabricator.wikimedia.org/T266040 (10BBlack) That is what I was thinking too, but I'm not sure if the VCL state diagram allows us to see that at the right point in time to make the decision or not. We'... [10:56:12] !log aborrero@cumin2001 START - Cookbook sre.hosts.downtime [10:56:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:12] !log aborrero@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:58:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:58:47] (03CR) 10Klausman: admin: Update kormat's my() bash function. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635523 (owner: 10Kormat) [10:59:26] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10Joe) >>! In T245183#6567261, @dcausse wrote: > * timestamp: 2020-10-20T18:10:00 to 2020-10-20T21:15:00 > * host: mw2252 Please note: this was... [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: #bothumor I � Unicode. All rise for European mid-day backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201021T1100). [11:00:04] kart_, matthiasmullie, and abijeet: A patch you scheduled for European mid-day backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:16] I can deploy today [11:00:39] (03CR) 10Klausman: [C: 03+1] admin: Update kormat's my() bash function. (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/635523 (owner: 10Kormat) [11:00:56] !log Upgrade db2093's mariadb version T266003 [11:01:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:02] T266003: orchestrator: Select backend database solution - https://phabricator.wikimedia.org/T266003 [11:01:06] (03CR) 10Kormat: [C: 03+2] admin: Update kormat's my() bash function. [puppet] - 10https://gerrit.wikimedia.org/r/635523 (owner: 10Kormat) [11:01:22] Urbanecm: I'm here :) [11:01:28] thanks [11:01:50] abijeet: are you here? [11:02:00] abijeet is also here and I can also test his change if needed. [11:02:04] (03CR) 10Urbanecm: [C: 03+2] Enable ContentTranslation in 5 Wikipedias as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635294 (https://phabricator.wikimedia.org/T264737) (owner: 10KartikMistry) [11:02:09] good, thanks [11:02:44] RECOVERY - Check systemd state on ms-be2042 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:02:54] (03Merged) 10jenkins-bot: Enable ContentTranslation in 5 Wikipedias as a default tool [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635294 (https://phabricator.wikimedia.org/T264737) (owner: 10KartikMistry) [11:03:00] I'm around now [11:03:11] thanks [11:03:23] kart_: your change is at mwdebug2001 [11:03:47] (03CR) 10Urbanecm: [C: 03+2] Disable registrations stat on Special:TranslationStats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635237 (https://phabricator.wikimedia.org/T264158) (owner: 10Nikerabbit) [11:03:48] Urbanecm: testing.. [11:04:50] (03Merged) 10jenkins-bot: Disable registrations stat on Special:TranslationStats [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635237 (https://phabricator.wikimedia.org/T264158) (owner: 10Nikerabbit) [11:07:18] Urbanecm: looks good. Please deploy. [11:07:35] thanks [11:07:38] doing [11:10:31] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 11567427c3f7d2908b29046ee56a7b0c0da32c09: Enable ContentTranslation in 5 Wikipedias as a default tool (T264737; T264738; T264739; T264740; T264741) (duration: 01m 30s) [11:10:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:10:43] T264737: Enable Content Translation in Esperanto Wikipedia as a default tool - https://phabricator.wikimedia.org/T264737 [11:10:43] T264741: Enable Content Translation in Oriya Wikipedia as a default tool - https://phabricator.wikimedia.org/T264741 [11:10:44] T264739: Enable Content Translation in Irish Wikipedia as a default tool - https://phabricator.wikimedia.org/T264739 [11:10:44] T264738: Enable Content Translation in Belarusian Wikipedia as a default tool - https://phabricator.wikimedia.org/T264738 [11:10:44] T264740: Enable Content Translation in Somali Wikipedia as a default tool - https://phabricator.wikimedia.org/T264740 [11:11:03] kart_: done [11:11:22] kart_: abijeet: your patch is available at mwdebug2001, can you test, please? [11:11:42] Urbanecm: thanks! [11:11:45] no problem [11:13:06] Urbanecm, checking [11:13:08] thanks [11:15:18] Urbanecm, looks good, please proceed with deployment. [11:15:21] doing [11:17:39] !log urbanecm@deploy1001 Synchronized wmf-config/CommonSettings.php: 785404fa2b998947d236aebe481ee1abcbd14220: Disable registrations stat on Special:TranslationStats (T264158) (duration: 01m 05s) [11:17:42] abijeet: done [11:17:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:45] T264158: Stat type registrations is too slow for Wikimedia production - https://phabricator.wikimedia.org/T264158 [11:18:04] Urbanecm, thank you. [11:18:07] no problem [11:18:23] I can't find Matthias at IRC, so I'm skipping their patch [11:18:32] 10Operations, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10jijiki) >>! In T265324#6567095, @Joe wrote: > Regarding the apache httpd container, I am approaching layering as fol... [11:20:54] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [11:21:36] o/ [11:21:40] matthiasmullie: hello! [11:21:48] should I deploy your patch, or will you? [11:21:51] (I'm done with the rest) [11:21:55] Sorry I'm late for backports & config - had some trouble connecting [11:22:08] I can take care of it myself [11:22:12] cool [11:22:24] Ok if I start right away? [11:22:29] matthiasmullie: yes, go ahead please [11:22:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_citoid_cluster_codfw site=codfw https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:22:32] Ok thanks [11:22:38] Apologies again for joining late :) [11:22:50] (03PS6) 10Matthias Mullie: [WikibaseMediaInfo] Add config for related terms API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630896 (https://phabricator.wikimedia.org/T256431) [11:23:32] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [11:23:53] (03CR) 10Matthias Mullie: [C: 03+2] [WikibaseMediaInfo] Add config for related terms API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630896 (https://phabricator.wikimedia.org/T256431) (owner: 10Matthias Mullie) [11:24:37] (03Merged) 10jenkins-bot: [WikibaseMediaInfo] Add config for related terms API [mediawiki-config] - 10https://gerrit.wikimedia.org/r/630896 (https://phabricator.wikimedia.org/T256431) (owner: 10Matthias Mullie) [11:24:50] PROBLEM - Check systemd state on ms-be2029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:26:43] 10Operations, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10Joe) >>! In T265324#6567694, @jijiki wrote: >>>! In T265324#6567095, @Joe wrote: >> Regarding the apache httpd conta... [11:29:16] PROBLEM - Check systemd state on ms-be2029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:31:38] PROBLEM - Check systemd state on ms-be2029 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:32:21] ^ checking that [11:33:20] !log mlitn@deploy1001 Synchronized wmf-config/InitialiseSettings.php: [WikibaseMediaInfo] Add config for related terms API (duration: 01m 04s) [11:33:23] Ah, another case of T199911 [11:33:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:26] T199911: Systemd session creation fails under I/O load - https://phabricator.wikimedia.org/T199911 [11:33:42] (03PS1) 10Muehlenhoff: Add ldap-replica2003/2004 to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/635528 (https://phabricator.wikimedia.org/T264388) [11:33:49] (03PS2) 10KartikMistry: WIP: Remove wgContentTranslationRESTBase config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634956 [11:34:08] RECOVERY - Check systemd state on ms-be2029 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:37:43] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10RhinosF1) [11:40:11] (03CR) 10Alexandros Kosiaris: [C: 03+1] admin: deploy eventrouter to all clusters [deployment-charts] - 10https://gerrit.wikimedia.org/r/635259 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [11:40:37] !log EU B&C done [11:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:41:06] PROBLEM - Check systemd state on ms-be2052 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:45:47] 10Operations, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10jijiki) >>! In T265324#6567723, @Joe wrote: >>>! In T265324#6567694, @jijiki wrote: >> How are we planning to solve... [11:46:47] (03CR) 10Alexandros Kosiaris: [C: 04-1] Initial commit of eventrouter docker image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634985 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [11:47:16] (03PS1) 10BBlack: partman: document cacheproxy exceptions [puppet] - 10https://gerrit.wikimedia.org/r/635530 (https://phabricator.wikimedia.org/T156955) [11:48:29] (03CR) 10BBlack: [C: 04-1] "Per IRC discussion, let's not standardize this one. Maybe Ibe7540b90bfa09195cfcc58fa41f3723216df9e4 instead?" [puppet] - 10https://gerrit.wikimedia.org/r/633704 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [11:49:31] 10Operations, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10Joe) >>! In T265324#6567761, @jijiki wrote: > Overall, I think we may need to take one step back and consider if an... [11:52:32] PROBLEM - Check systemd state on ms-be2016 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:52:43] 10Operations, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10Joe) Also I want to clarify: we can reduce the pain as much as possible, but for the duration of the transition phas... [11:54:12] RECOVERY - Check systemd state on ms-be2016 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:55:02] RECOVERY - Check systemd state on ms-be2052 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [11:55:34] (03PS4) 10Jbond: service_auto_restart: update to use systemd::timer:job instead of cron [puppet] - 10https://gerrit.wikimedia.org/r/635516 (https://phabricator.wikimedia.org/T265138) [11:55:36] (03PS1) 10Jbond: etcd: use shared spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/635531 [11:56:57] (03CR) 10jerkins-bot: [V: 04-1] service_auto_restart: update to use systemd::timer:job instead of cron [puppet] - 10https://gerrit.wikimedia.org/r/635516 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [11:57:00] (03CR) 10jerkins-bot: [V: 04-1] etcd: use shared spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/635531 (owner: 10Jbond) [11:59:23] 10Operations, 10User-MoritzMuehlenhoff: Revisit use of swap and related kernel settings - https://phabricator.wikimedia.org/T266118 (10MoritzMuehlenhoff) [11:59:46] 10Operations, 10User-MoritzMuehlenhoff: Revisit use of swap and related kernel settings - https://phabricator.wikimedia.org/T266118 (10Marostegui) p:05Triage→03Medium [12:02:07] (03PS2) 10Jbond: etcd: use shared spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/635531 [12:03:43] (03PS3) 10Jbond: etcd: use shared spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/635531 [12:05:47] 10Operations, 10User-MoritzMuehlenhoff: Revisit use of swap and related kernel settings - https://phabricator.wikimedia.org/T266118 (10BBlack) Recording from IRC for posterity: ` 11:07 < bblack> so I was checking out https://gerrit.wikimedia.org/r/c/operations/puppet/+/633704 (which is one of the partman clean... [12:07:16] (03CR) 10Jbond: [C: 03+2] etcd: use shared spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/635531 (owner: 10Jbond) [12:14:02] 10Operations, 10DBA, 10User-Kormat, 10User-jbond: mariadb::config: parameterize event_scheduler - https://phabricator.wikimedia.org/T266119 (10Kormat) [12:14:19] 10Operations, 10DBA, 10User-Kormat: mariadb::config: parameterize event_scheduler - https://phabricator.wikimedia.org/T266119 (10Kormat) p:05Triage→03Medium [12:20:21] (03CR) 10Muehlenhoff: "I think we can simply hold back the patch a few more months until jessie is fully gone, I'll merge the patch once we're ready for it, ok?" [puppet] - 10https://gerrit.wikimedia.org/r/635288 (https://phabricator.wikimedia.org/T170298) (owner: 10Jcrespo) [12:21:07] (03CR) 10Jcrespo: "+1 to me" [puppet] - 10https://gerrit.wikimedia.org/r/635288 (https://phabricator.wikimedia.org/T170298) (owner: 10Jcrespo) [12:21:16] (03Abandoned) 10Filippo Giunchedi: install_server: use standard partman recipe for nvme cp hosts [puppet] - 10https://gerrit.wikimedia.org/r/633704 (https://phabricator.wikimedia.org/T156955) (owner: 10Filippo Giunchedi) [12:21:30] (03CR) 10Filippo Giunchedi: [C: 03+1] partman: document cacheproxy exceptions [puppet] - 10https://gerrit.wikimedia.org/r/635530 (https://phabricator.wikimedia.org/T156955) (owner: 10BBlack) [12:24:10] (03PS1) 10Kormat: WIP mariadb: Make innodb pool size configurable. [puppet] - 10https://gerrit.wikimedia.org/r/635533 [12:25:34] PROBLEM - Check systemd state on ms-be2051 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:25:35] (03CR) 10Kormat: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/635533 (owner: 10Kormat) [12:25:49] (03PS1) 10Marostegui: global_tables: Empty table schemas for global tables [software/tendril] - 10https://gerrit.wikimedia.org/r/635535 [12:27:04] (03CR) 10Jcrespo: [C: 03+1] global_tables: Empty table schemas for global tables [software/tendril] - 10https://gerrit.wikimedia.org/r/635535 (owner: 10Marostegui) [12:27:08] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:27:17] (03CR) 10Marostegui: [V: 03+2 C: 03+2] global_tables: Empty table schemas for global tables [software/tendril] - 10https://gerrit.wikimedia.org/r/635535 (owner: 10Marostegui) [12:28:48] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [12:32:22] (03PS2) 10Kormat: WIP mariadb: Make innodb pool size configurable. [puppet] - 10https://gerrit.wikimedia.org/r/635533 [12:36:00] (03PS3) 10Kormat: WIP mariadb: Make innodb pool size configurable. [puppet] - 10https://gerrit.wikimedia.org/r/635533 [12:37:27] 10Operations, 10Data-Persistence-Backup, 10SRE-tools: Add toil::systemd_scope_cleanup to dbprov hosts - https://phabricator.wikimedia.org/T265323 (10jcrespo) I did a manual `systemctl reset-failed` once. [12:37:56] RECOVERY - Check systemd state on dbprov1003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [12:38:13] 10Operations, 10Traffic, 10Patch-For-Review: Large text objects are randomized to cache backends - https://phabricator.wikimedia.org/T266040 (10ema) >>! In T266040#6567585, @BBlack wrote: > 1. User requests /foo/bar -> frontend cache cp1234 miss -> chash to cp9999 > 2. Response from cp9999 indicates CL:500KB... [12:39:31] (03PS4) 10Kormat: WIP mariadb: Make innodb pool size configurable. [puppet] - 10https://gerrit.wikimedia.org/r/635533 [12:40:42] (03CR) 10Muehlenhoff: [C: 03+2] Add ldap-replica2003/2004 to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/635528 (https://phabricator.wikimedia.org/T264388) (owner: 10Muehlenhoff) [12:40:58] (03PS2) 10Muehlenhoff: Add ldap-replica2003/2004 to conftool-data [puppet] - 10https://gerrit.wikimedia.org/r/635528 (https://phabricator.wikimedia.org/T264388) [12:41:42] (03PS5) 10Kormat: mariadb: Make innodb pool size configurable. [puppet] - 10https://gerrit.wikimedia.org/r/635533 [12:42:04] (03CR) 10Kormat: "PCC: https://puppet-compiler.wmflabs.org/compiler1003/26039/" [puppet] - 10https://gerrit.wikimedia.org/r/635533 (owner: 10Kormat) [12:45:35] (03PS1) 10Filippo Giunchedi: wmnet: record for prometheus-pushgateway [dns] - 10https://gerrit.wikimedia.org/r/635536 (https://phabricator.wikimedia.org/T249311) [12:46:46] (03CR) 10Ayounsi: "> Patch Set 3:" [puppet] - 10https://gerrit.wikimedia.org/r/634946 (https://phabricator.wikimedia.org/T254332) (owner: 10Faidon Liambotis) [12:53:42] (03CR) 10Vgutierrez: [C: 03+1] acmechief: Add ldap-replica1001/1002 [puppet] - 10https://gerrit.wikimedia.org/r/635499 (https://phabricator.wikimedia.org/T264388) (owner: 10Muehlenhoff) [12:56:32] (03CR) 10Muehlenhoff: [C: 03+2] acmechief: Add ldap-replica1001/1002 [puppet] - 10https://gerrit.wikimedia.org/r/635499 (https://phabricator.wikimedia.org/T264388) (owner: 10Muehlenhoff) [12:58:01] (03PS5) 10Jbond: service_auto_restart: update to use systemd::timer:job instead of cron [puppet] - 10https://gerrit.wikimedia.org/r/635516 (https://phabricator.wikimedia.org/T265138) [12:58:03] (03PS1) 10Jbond: nrpe: update to use shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/635537 [12:59:17] (03CR) 10jerkins-bot: [V: 04-1] service_auto_restart: update to use systemd::timer:job instead of cron [puppet] - 10https://gerrit.wikimedia.org/r/635516 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [13:00:04] longma and liw: It is that lovely time of the day again! You are hereby commanded to deploy Mediawiki train - American+European Version (secondary timeslot). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201021T1300). [13:00:21] 10Operations, 10Release-Engineering-Team-TODO, 10serviceops, 10Continuous-Integration-Config, and 2 others: Add pcov PHP extension to wikimedia apt so it can be used in Wikimedia CI - https://phabricator.wikimedia.org/T243847 (10Daimona) Memo (for myself or whoever is interested): xdebug 3 should supposedl... [13:00:26] (03PS1) 10Lars Wirzenius: group1 wikis to 1.36.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635538 [13:00:28] (03CR) 10Lars Wirzenius: [C: 03+2] group1 wikis to 1.36.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635538 (owner: 10Lars Wirzenius) [13:01:09] (03Merged) 10jenkins-bot: group1 wikis to 1.36.0-wmf.14 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635538 (owner: 10Lars Wirzenius) [13:02:34] (03CR) 10Jbond: [C: 03+2] nrpe: update to use shared spec helper [puppet] - 10https://gerrit.wikimedia.org/r/635537 (owner: 10Jbond) [13:03:05] !log liw@deploy1001 rebuilt and synchronized wikiversions files: group1 wikis to 1.36.0-wmf.14 [13:03:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:22] (03PS6) 10Jbond: service_auto_restart: update to use systemd::timer:job instead of cron [puppet] - 10https://gerrit.wikimedia.org/r/635516 (https://phabricator.wikimedia.org/T265138) [13:04:10] !log liw@deploy1001 Synchronized php: group1 wikis to 1.36.0-wmf.14 (duration: 01m 04s) [13:04:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:04:29] (03CR) 10jerkins-bot: [V: 04-1] service_auto_restart: update to use systemd::timer:job instead of cron [puppet] - 10https://gerrit.wikimedia.org/r/635516 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [13:05:22] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 630 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:10:24] RECOVERY - Check systemd state on ms-be2051 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [13:11:20] (03PS7) 10Jbond: service_auto_restart: update to use systemd::timer:job instead of cron [puppet] - 10https://gerrit.wikimedia.org/r/635516 (https://phabricator.wikimedia.org/T265138) [13:11:22] (03PS1) 10Jbond: install_server: switch to shared spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/635539 [13:12:11] (03CR) 10Jbond: [C: 03+2] install_server: switch to shared spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/635539 (owner: 10Jbond) [13:12:28] (03CR) 10jerkins-bot: [V: 04-1] service_auto_restart: update to use systemd::timer:job instead of cron [puppet] - 10https://gerrit.wikimedia.org/r/635516 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [13:18:20] (03PS8) 10Jbond: service_auto_restart: update to use systemd::timer:job instead of cron [puppet] - 10https://gerrit.wikimedia.org/r/635516 (https://phabricator.wikimedia.org/T265138) [13:18:22] (03PS1) 10Jbond: service: switch to shared spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/635540 [13:19:02] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) I found that there's a significant difference between the [[https://grafana.wikimedia.org/d/EiAVq3FGz/t264398?viewPanel=13&orgId=1&from=... [13:19:06] (03CR) 10Jbond: [C: 03+2] service: switch to shared spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/635540 (owner: 10Jbond) [13:19:32] (03CR) 10jerkins-bot: [V: 04-1] service_auto_restart: update to use systemd::timer:job instead of cron [puppet] - 10https://gerrit.wikimedia.org/r/635516 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [13:19:42] PROBLEM - SSH on ms-be2025 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:21:12] !log pooling ldap-replica2003 T264388 [13:21:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:19] T264388: Migrate LDAP replicas to Buster - https://phabricator.wikimedia.org/T264388 [13:25:35] (03CR) 10Marostegui: [C: 03+1] "Thank you for this!" [puppet] - 10https://gerrit.wikimedia.org/r/635533 (owner: 10Kormat) [13:26:10] RECOVERY - SSH on ms-be2025 is OK: SSH OK - OpenSSH_7.4p1 Debian-10+deb9u7 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [13:26:48] (03CR) 10Kormat: [C: 03+2] mariadb: Make innodb pool size configurable. [puppet] - 10https://gerrit.wikimedia.org/r/635533 (owner: 10Kormat) [13:27:19] 10Operations, 10serviceops, 10vm-requests: eqiad: New ganeti instance for orchestrator installation - https://phabricator.wikimedia.org/T265982 (10Nintendofan885) [13:31:53] (03CR) 10Elukey: "Is there agreement on proceeding or not? I am ok with the change, let me know :)" [puppet] - 10https://gerrit.wikimedia.org/r/634946 (https://phabricator.wikimedia.org/T254332) (owner: 10Faidon Liambotis) [13:37:48] (03PS6) 10Ottomata: camus::job - replace check_whitelist_topics with check_java_opts [puppet] - 10https://gerrit.wikimedia.org/r/634266 (https://phabricator.wikimedia.org/T251609) [13:38:47] (03CR) 10jerkins-bot: [V: 04-1] camus::job - replace check_whitelist_topics with check_java_opts [puppet] - 10https://gerrit.wikimedia.org/r/634266 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [13:39:01] (03PS1) 10Muehlenhoff: Install ldap-replica100[12] as additional LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/635542 (https://phabricator.wikimedia.org/T264388) [13:42:01] (03PS9) 10Jbond: service_auto_restart: update to use systemd::timer:job instead of cron [puppet] - 10https://gerrit.wikimedia.org/r/635516 (https://phabricator.wikimedia.org/T265138) [13:42:03] (03PS1) 10Jbond: systemd: switch to shared spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/635543 [13:42:48] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Add sbisson to analytics-privatedata-users and create a kerberos identity - https://phabricator.wikimedia.org/T265969 (10Arrbee) This is an approved request for Stephane. Thanks. [13:43:21] (03CR) 10jerkins-bot: [V: 04-1] service_auto_restart: update to use systemd::timer:job instead of cron [puppet] - 10https://gerrit.wikimedia.org/r/635516 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [13:43:28] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 5 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:44:20] (03PS7) 10Ottomata: camus::job - replace check_whitelist_topics with check_java_opts [puppet] - 10https://gerrit.wikimedia.org/r/634266 (https://phabricator.wikimedia.org/T251609) [13:45:20] (03CR) 10jerkins-bot: [V: 04-1] camus::job - replace check_whitelist_topics with check_java_opts [puppet] - 10https://gerrit.wikimedia.org/r/634266 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [13:46:44] (03PS8) 10Ottomata: camus::job - replace check_whitelist_topics with check_java_opts [puppet] - 10https://gerrit.wikimedia.org/r/634266 (https://phabricator.wikimedia.org/T251609) [13:46:45] (03CR) 10Jbond: [C: 03+2] systemd: switch to shared spec_helper [puppet] - 10https://gerrit.wikimedia.org/r/635543 (owner: 10Jbond) [13:47:45] (03CR) 10jerkins-bot: [V: 04-1] camus::job - replace check_whitelist_topics with check_java_opts [puppet] - 10https://gerrit.wikimedia.org/r/634266 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [13:49:50] (03PS9) 10Ottomata: camus::job - replace check_whitelist_topics with check_java_opts [puppet] - 10https://gerrit.wikimedia.org/r/634266 (https://phabricator.wikimedia.org/T251609) [13:53:36] (03CR) 10Ottomata: [C: 03+2] camus::job - replace check_whitelist_topics with check_java_opts [puppet] - 10https://gerrit.wikimedia.org/r/634266 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [13:55:13] (03PS10) 10Jbond: service_auto_restart: update to use systemd::timer:job instead of cron [puppet] - 10https://gerrit.wikimedia.org/r/635516 (https://phabricator.wikimedia.org/T265138) [13:55:15] (03PS1) 10Jbond: bacula: add profile::base dependency [puppet] - 10https://gerrit.wikimedia.org/r/635544 [13:55:57] (03CR) 10jerkins-bot: [V: 04-1] bacula: add profile::base dependency [puppet] - 10https://gerrit.wikimedia.org/r/635544 (owner: 10Jbond) [13:56:26] (03CR) 10jerkins-bot: [V: 04-1] service_auto_restart: update to use systemd::timer:job instead of cron [puppet] - 10https://gerrit.wikimedia.org/r/635516 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [13:57:50] (03PS2) 10Muehlenhoff: Install ldap-replica100[12] as additional LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/635542 (https://phabricator.wikimedia.org/T264388) [13:58:28] (03PS11) 10Jbond: service_auto_restart: update to use systemd::timer:job instead of cron [puppet] - 10https://gerrit.wikimedia.org/r/635516 (https://phabricator.wikimedia.org/T265138) [13:58:35] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10CDanis) [13:59:35] (03CR) 10jerkins-bot: [V: 04-1] service_auto_restart: update to use systemd::timer:job instead of cron [puppet] - 10https://gerrit.wikimedia.org/r/635516 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [14:00:27] (03PS2) 10Jbond: bacula: add profile::base dependency [puppet] - 10https://gerrit.wikimedia.org/r/635544 [14:00:52] (03CR) 10jerkins-bot: [V: 04-1] bacula: add profile::base dependency [puppet] - 10https://gerrit.wikimedia.org/r/635544 (owner: 10Jbond) [14:02:25] (03PS3) 10Jbond: bacula: add profile::base dependency [puppet] - 10https://gerrit.wikimedia.org/r/635544 [14:03:27] (03CR) 10Jbond: [C: 03+2] bacula: add profile::base dependency [puppet] - 10https://gerrit.wikimedia.org/r/635544 (owner: 10Jbond) [14:04:59] (03PS1) 10Ottomata: camus::job - don't fail if dynamic_stream_configs is set without kafka.whitelist.topics [puppet] - 10https://gerrit.wikimedia.org/r/635545 (https://phabricator.wikimedia.org/T251609) [14:05:23] (03PS2) 10Ottomata: camus::job - don't fail if dynamic_stream_configs is set without kafka.whitelist.topics [puppet] - 10https://gerrit.wikimedia.org/r/635545 (https://phabricator.wikimedia.org/T251609) [14:05:28] (03CR) 10jerkins-bot: [V: 04-1] camus::job - don't fail if dynamic_stream_configs is set without kafka.whitelist.topics [puppet] - 10https://gerrit.wikimedia.org/r/635545 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [14:05:46] (03CR) 10jerkins-bot: [V: 04-1] camus::job - don't fail if dynamic_stream_configs is set without kafka.whitelist.topics [puppet] - 10https://gerrit.wikimedia.org/r/635545 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [14:06:12] (03CR) 10Muehlenhoff: [C: 03+2] Install ldap-replica100[12] as additional LDAP replicas [puppet] - 10https://gerrit.wikimedia.org/r/635542 (https://phabricator.wikimedia.org/T264388) (owner: 10Muehlenhoff) [14:10:19] (03PS3) 10Ottomata: camus::job - don't fail if dynamic_stream_configs is set [puppet] - 10https://gerrit.wikimedia.org/r/635545 (https://phabricator.wikimedia.org/T251609) [14:10:28] (03PS4) 10Ottomata: camus::job - don't fail if dynamic_stream_configs is set [puppet] - 10https://gerrit.wikimedia.org/r/635545 (https://phabricator.wikimedia.org/T251609) [14:12:33] (03CR) 10Ottomata: [C: 03+2] camus::job - don't fail if dynamic_stream_configs is set [puppet] - 10https://gerrit.wikimedia.org/r/635545 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [14:17:36] (03PS12) 10Jbond: service_auto_restart: update to use systemd::timer:job instead of cron [puppet] - 10https://gerrit.wikimedia.org/r/635516 (https://phabricator.wikimedia.org/T265138) [14:17:38] (03PS1) 10Jbond: profile: fix spec test [puppet] - 10https://gerrit.wikimedia.org/r/635548 [14:19:19] (03CR) 10Jbond: [C: 03+2] profile: fix spec test [puppet] - 10https://gerrit.wikimedia.org/r/635548 (owner: 10Jbond) [14:19:59] (03PS2) 10Jbond: service-auto-restart: clean up cron [puppet] - 10https://gerrit.wikimedia.org/r/635517 (https://phabricator.wikimedia.org/T265138) [14:20:26] (03CR) 10Jbond: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/635516 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [14:22:01] (03PS1) 10Ottomata: camus::job - eg-analytics-external: add missing check_java_opts for kafka.whitelist.topics [puppet] - 10https://gerrit.wikimedia.org/r/635549 (https://phabricator.wikimedia.org/T251609) [14:22:37] (03PS2) 10Ottomata: camus::job - eg-analytics-external: add missing check_java_opts for kafka.whitelist.topics [puppet] - 10https://gerrit.wikimedia.org/r/635549 (https://phabricator.wikimedia.org/T251609) [14:23:39] (03CR) 10jerkins-bot: [V: 04-1] camus::job - eg-analytics-external: add missing check_java_opts for kafka.whitelist.topics [puppet] - 10https://gerrit.wikimedia.org/r/635549 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [14:24:16] (03PS3) 10Ottomata: camus::job - eg-analytics-external: add missing check_java_opts [puppet] - 10https://gerrit.wikimedia.org/r/635549 (https://phabricator.wikimedia.org/T251609) [14:24:48] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1002/26040/an-launcher1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/635549 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [14:27:38] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:28:44] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [14:30:17] 10Operations, 10Analytics, 10puppet-compiler, 10Jenkins, 10Release-Engineering-Team (CI & Testing services): Puppet CI idea: Add a PCC-Nodes tag to commit message to launch PCC job on new patch in gerrit - https://phabricator.wikimedia.org/T266139 (10Ottomata) [14:30:55] (03CR) 10Ottomata: [C: 03+2] camus::job - eg-analytics-external: add missing check_java_opts [puppet] - 10https://gerrit.wikimedia.org/r/635549 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [14:34:30] !log restarting blazegraph on codfw servers (T263952) [14:34:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:34:38] T263952: mwapi calls rarely return results - https://phabricator.wikimedia.org/T263952 [14:35:14] RECOVERY - Check systemd state on ldap-replica2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:38:46] RECOVERY - Long running screen/tmux on puppetmaster1001 is OK: OK: No SCREEN or tmux processes detected. https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [14:39:20] PROBLEM - Check systemd state on ldap-replica2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:41:32] ^ fixing [14:41:53] jouncebot: now [14:41:53] For the next 0 hour(s) and 18 minute(s): Mediawiki train - American+European Version (secondary timeslot) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201021T1300) [14:41:55] jouncebot: next [14:41:56] In 3 hour(s) and 18 minute(s): Train log triage with CPT (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201021T1800) [14:41:56] In 3 hour(s) and 18 minute(s): Morning backport window (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201021T1800) [14:42:22] (03PS1) 10Jbond: systemd::spec Add a note about using the systemd-analyze hack on mac [puppet] - 10https://gerrit.wikimedia.org/r/635552 [14:42:27] (03PS2) 10Reedy: wikitech.php: Set CURLOPT_RETURNTRANSFER true in gerrit handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634663 (https://phabricator.wikimedia.org/T242554) [14:42:32] (03CR) 10Reedy: [C: 03+2] wikitech.php: Set CURLOPT_RETURNTRANSFER true in gerrit handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634663 (https://phabricator.wikimedia.org/T242554) (owner: 10Reedy) [14:43:18] (03Merged) 10jenkins-bot: wikitech.php: Set CURLOPT_RETURNTRANSFER true in gerrit handler [mediawiki-config] - 10https://gerrit.wikimedia.org/r/634663 (https://phabricator.wikimedia.org/T242554) (owner: 10Reedy) [14:43:53] (03CR) 10Jbond: "ready for review" [puppet] - 10https://gerrit.wikimedia.org/r/635516 (https://phabricator.wikimedia.org/T265138) (owner: 10Jbond) [14:44:25] (03CR) 10Jbond: [C: 03+2] systemd::spec Add a note about using the systemd-analyze hack on mac [puppet] - 10https://gerrit.wikimedia.org/r/635552 (owner: 10Jbond) [14:44:44] 10Operations, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10jijiki) >>! In T265324#6567779, @Joe wrote: > Also I want to clarify: we can reduce the pain as much as possible, bu... [14:44:55] !log reedy@deploy1001 Synchronized wmf-config/wikitech.php: Set CURLOPT_RETURNTRANSFER true in gerrit handler T242554 (duration: 01m 07s) [14:45:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:19] (03PS1) 10Ottomata: camus::job - use double quotes around java opts value [puppet] - 10https://gerrit.wikimedia.org/r/635553 (https://phabricator.wikimedia.org/T251609) [14:49:46] PROBLEM - Check systemd state on ldap-replica1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:50:04] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10dancy) Noting for the record in all cases where there's a bogus changed character in a string, the bad character is always one less than what i... [14:50:59] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10dancy) [14:51:29] (03CR) 10Ottomata: [C: 03+2] camus::job - use double quotes around java opts value [puppet] - 10https://gerrit.wikimedia.org/r/635553 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [14:51:32] (03PS1) 10Arturo Borrero Gonzalez: cloudgw: install latest nftables package [puppet] - 10https://gerrit.wikimedia.org/r/635555 (https://phabricator.wikimedia.org/T261724) [14:51:44] 10Operations, 10LDAP: Port prometheus-openldap-exporter to Python 3 - https://phabricator.wikimedia.org/T266147 (10MoritzMuehlenhoff) [14:51:52] PROBLEM - Check systemd state on ldap-replica1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:52:15] 10Operations, 10ops-codfw, 10netops: patch in FB peering into cr2-eqdfw - https://phabricator.wikimedia.org/T265412 (10Papaul) ` papaul@cr2-eqdfw> show interfaces xe-0/1/2 descriptions Interface Admin Link Description xe-0/1/2 up up Reserved for Facebook PNI - no-mon [14:52:28] RECOVERY - Check systemd state on ldap-replica2004 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [14:52:37] (03PS1) 10Ppchelko: Api-Gateway: Correctly log client-id in access logs. [deployment-charts] - 10https://gerrit.wikimedia.org/r/635556 [14:53:28] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudgw: install latest nftables package [puppet] - 10https://gerrit.wikimedia.org/r/635555 (https://phabricator.wikimedia.org/T261724) (owner: 10Arturo Borrero Gonzalez) [14:54:18] 10Operations, 10ops-codfw, 10netops: patch in FB peering into cr2-eqdfw - https://phabricator.wikimedia.org/T265412 (10Papaul) ` Physical interface: xe-0/1/2 Laser bias current : 37.980 mA Laser output power : 0.3080 mW / -5.11 dBm Module temperatur... [14:55:44] (03CR) 10jerkins-bot: [V: 04-1] Api-Gateway: Correctly log client-id in access logs. [deployment-charts] - 10https://gerrit.wikimedia.org/r/635556 (owner: 10Ppchelko) [14:55:57] (03PS1) 10Muehlenhoff: Adding missing dependency on python-yaml [debs/prometheus-openldap-exporter] - 10https://gerrit.wikimedia.org/r/635557 [14:56:27] 10Operations, 10ops-codfw, 10netops: patch in FB peering into cr2-eqdfw - https://phabricator.wikimedia.org/T265412 (10Papaul) [14:56:37] !log crusnov@cumin1001 START - Cookbook sre.dns.netbox [14:56:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:48] (03CR) 10Muehlenhoff: [C: 03+2] Adding missing dependency on python-yaml [debs/prometheus-openldap-exporter] - 10https://gerrit.wikimedia.org/r/635557 (owner: 10Muehlenhoff) [14:58:17] (03CR) 10Ppchelko: "recheck" [deployment-charts] - 10https://gerrit.wikimedia.org/r/635556 (owner: 10Ppchelko) [15:00:03] !log otto@deploy1001 Started deploy [analytics/refinery@e4d16f0] (hadoop-test): deploying with updated camus to test cluster [15:00:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:08] 10Operations, 10ops-codfw, 10netops: patch in FB peering into cr2-eqdfw - https://phabricator.wikimedia.org/T265412 (10Papaul) 05Open→03Resolved complete [15:00:16] 10Operations: Integrate Buster 10.5 point release - https://phabricator.wikimedia.org/T259519 (10MoritzMuehlenhoff) [15:00:49] 10Operations: Integrate Buster 10.5 point release - https://phabricator.wikimedia.org/T259519 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff This is complete [15:01:19] (03CR) 10Ppchelko: [C: 03+2] Api-Gateway: Correctly log client-id in access logs. [deployment-charts] - 10https://gerrit.wikimedia.org/r/635556 (owner: 10Ppchelko) [15:01:45] !log crusnov@cumin1001 END (PASS) - Cookbook sre.dns.netbox (exit_code=0) [15:01:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:02:59] !log otto@deploy1001 Finished deploy [analytics/refinery@e4d16f0] (hadoop-test): deploying with updated camus to test cluster (duration: 02m 56s) [15:03:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:03:13] 10Operations, 10LDAP, 10Python3-Porting: Port prometheus-openldap-exporter to Python 3 - https://phabricator.wikimedia.org/T266147 (10Peachey88) [15:04:46] (03Merged) 10jenkins-bot: Api-Gateway: Correctly log client-id in access logs. [deployment-charts] - 10https://gerrit.wikimedia.org/r/635556 (owner: 10Ppchelko) [15:05:20] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/635520 (https://phabricator.wikimedia.org/T266106) (owner: 10Muehlenhoff) [15:07:58] 10Operations, 10serviceops, 10Wikimedia-production-error: PHP7 corruption reports in 2020 (Call on wrong object, etc.) - https://phabricator.wikimedia.org/T245183 (10LarsWirzenius) [15:07:59] !log imported prometheus-openldap-exporter 0+git20171128-3 to buster-wikimedia T264388 [15:08:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:05] T264388: Migrate LDAP replicas to Buster - https://phabricator.wikimedia.org/T264388 [15:10:24] 10Operations, 10Scap, 10serviceops, 10Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)): Make a way to build Scap .deb in Docker - https://phabricator.wikimedia.org/T265501 (10LarsWirzenius) While this won't block me starting the release process of the next Scap release, I would like to get thi... [15:10:28] RECOVERY - Check systemd state on ldap-replica1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:10:50] RECOVERY - Check systemd state on ldap-replica1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:10:55] (03PS1) 10Filippo Giunchedi: WIP ldap/grafana user sync [puppet] - 10https://gerrit.wikimedia.org/r/635559 [15:11:15] (03CR) 10jerkins-bot: [V: 04-1] WIP ldap/grafana user sync [puppet] - 10https://gerrit.wikimedia.org/r/635559 (owner: 10Filippo Giunchedi) [15:12:42] (03PS1) 10Ottomata: camus - Bump check_jar version to refinery 0.0.137 [puppet] - 10https://gerrit.wikimedia.org/r/635561 (https://phabricator.wikimedia.org/T251609) [15:12:56] RECOVERY - Check systemd state on ldap-replica2003 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:13:23] (03PS2) 10Filippo Giunchedi: WIP ldap/grafana user sync [puppet] - 10https://gerrit.wikimedia.org/r/635559 [15:13:25] (03PS1) 10Filippo Giunchedi: tox: move grafana tests to python3 [puppet] - 10https://gerrit.wikimedia.org/r/635562 (https://phabricator.wikimedia.org/T265712) [15:13:53] 10Operations, 10Analytics, 10puppet-compiler, 10Jenkins, 10Release-Engineering-Team (CI & Testing services): Puppet CI idea: Add a PCC-Nodes tag to commit message to launch PCC job on new patch in gerrit - https://phabricator.wikimedia.org/T266139 (10jbond) @Ottomata This is already possible however you... [15:13:57] (03CR) 10jerkins-bot: [V: 04-1] WIP ldap/grafana user sync [puppet] - 10https://gerrit.wikimedia.org/r/635559 (owner: 10Filippo Giunchedi) [15:14:04] (03CR) 10Ottomata: [C: 03+2] camus - Bump check_jar version to refinery 0.0.137 [puppet] - 10https://gerrit.wikimedia.org/r/635561 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [15:14:06] (03PS1) 10Elukey: role::analytics_test_cluster::client: uplaad hive-site.xml [puppet] - 10https://gerrit.wikimedia.org/r/635563 (https://phabricator.wikimedia.org/T255139) [15:14:52] 10Operations, 10Analytics, 10puppet-compiler, 10Jenkins, 10Release-Engineering-Team (CI & Testing services): Puppet CI idea: Add a PCC-Nodes tag to commit message to launch PCC job on new patch in gerrit - https://phabricator.wikimedia.org/T266139 (10jbond) [15:14:56] 10Operations, 10Puppet, 10Release-Engineering-Team-TODO, 10puppet-compiler, 10Release-Engineering-Team (CI & Testing services): Integrate the puppet compiler in the puppet CI pipeline - https://phabricator.wikimedia.org/T166066 (10jbond) [15:15:08] (03PS6) 10CRusnov: netbox: Move eqiad private to automation [dns] - 10https://gerrit.wikimedia.org/r/634302 (https://phabricator.wikimedia.org/T258729) [15:15:10] (03PS7) 10CRusnov: netbox: Move eqiad public to automation [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) [15:15:12] (03CR) 10CRusnov: netbox: Move eqiad public to automation (038 comments) [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [15:15:34] (03CR) 10jerkins-bot: [V: 04-1] netbox: Move eqiad public to automation [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) (owner: 10CRusnov) [15:15:44] (03PS2) 10Elukey: role::analytics_test_cluster::client: upload hive-site.xml [puppet] - 10https://gerrit.wikimedia.org/r/635563 (https://phabricator.wikimedia.org/T255139) [15:17:22] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::client: upload hive-site.xml [puppet] - 10https://gerrit.wikimedia.org/r/635563 (https://phabricator.wikimedia.org/T255139) (owner: 10Elukey) [15:17:55] ottomata: do you want me to puppet-merge your change too? [15:18:12] (03PS3) 10Filippo Giunchedi: WIP ldap/grafana user sync [puppet] - 10https://gerrit.wikimedia.org/r/635559 [15:18:28] uhh elukey yes [15:18:30] thought it did already... [15:18:33] yes please! [15:18:58] ack! [15:19:31] (03CR) 10jerkins-bot: [V: 04-1] WIP ldap/grafana user sync [puppet] - 10https://gerrit.wikimedia.org/r/635559 (owner: 10Filippo Giunchedi) [15:20:00] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/634050 (owner: 10Arturo Borrero Gonzalez) [15:20:26] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10Gilles) As @dpifke pointed out in our team meeting yesterday, there's also the possibility that v5 was returning hits on things that it shoul... [15:21:42] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [15:23:00] !log upgrade puppetlabs-stdlib to 6.5.0 https://gerrit.wikimedia.org/r/c/operations/puppet/+/634278 [15:23:03] (03CR) 10Jbond: [C: 03+2] stdlib: update to v6.5.0 [puppet] - 10https://gerrit.wikimedia.org/r/634278 (owner: 10Jbond) [15:23:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:17] (03PS4) 10Filippo Giunchedi: WIP ldap/grafana user sync [puppet] - 10https://gerrit.wikimedia.org/r/635559 [15:28:22] !log updating prometheus-openldap-exporter to 0+git20171128-3 to buster-wikimedia [15:28:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:28:29] (03CR) 10Filippo Giunchedi: "Build failures are due to missing python3-ldap, to be fixed by I31e06de392f" [puppet] - 10https://gerrit.wikimedia.org/r/635559 (owner: 10Filippo Giunchedi) [15:29:34] (03CR) 10jerkins-bot: [V: 04-1] WIP ldap/grafana user sync [puppet] - 10https://gerrit.wikimedia.org/r/635559 (owner: 10Filippo Giunchedi) [15:29:56] 10Operations, 10Traffic, 10Performance-Team (Radar): 8-10% response start regression (Varnish 5.1.3-1wm15 -> 6.0.6-1wm1) - https://phabricator.wikimedia.org/T264398 (10ema) >>! In T264398#6568588, @Gilles wrote: > As @dpifke pointed out in our team meeting yesterday, there's also the possibility that v5 was... [15:31:21] (03PS1) 10Matthias Mullie: [WikibaseMediaInfo] Fix concept chips array nesting structure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635567 (https://phabricator.wikimedia.org/T256431) [15:31:41] !log ppchelko@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'api-gateway' for release 'staging' . [15:31:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:33:10] !log ppchelko@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [15:33:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:34:11] !log ppchelko@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'api-gateway' for release 'production' . [15:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:35:53] 10Operations, 10Scap, 10serviceops, 10Release-Engineering-Team-TODO (2020-10-01 to 2020-12-31 (Q2)): Make a way to build Scap .deb in Docker - https://phabricator.wikimedia.org/T265501 (10jijiki) @LarsWirzenius after discussing it, we decided that for the time being we can't adopt this solution, given that... [15:44:21] is it okay if I deploy a security change or is anyone busy with deployments right now? [15:45:41] (03PS16) 10Jbond: netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) [15:46:42] (03CR) 10Jbond: "updated" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [15:46:49] (03PS8) 10CRusnov: netbox: Move eqiad public to automation [dns] - 10https://gerrit.wikimedia.org/r/634303 (https://phabricator.wikimedia.org/T258729) [15:48:11] 10Operations, 10ops-codfw, 10DC-Ops, 10fundraising-tech-ops: (Need By: 2020-09-30) rack/setup/install frmx2001.frack.codfw.wmnet, frdata2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T260183 (10Jgreen) a:05Jgreen→03None [15:52:40] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:53:55] (03PS1) 10Jbond: wmflib: drop Wmflib::UserIpPort as Stdlib::Port::User is now avalible [puppet] - 10https://gerrit.wikimedia.org/r/635570 [15:54:18] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [15:58:49] !log Deployed patch for T260349 [15:58:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:01:00] (03PS1) 10Arturo Borrero Gonzalez: nftables: change ensure_package parameter datatype to String [puppet] - 10https://gerrit.wikimedia.org/r/635571 [16:02:34] (03PS1) 10Ottomata: camus - Refactor http proxy envs [puppet] - 10https://gerrit.wikimedia.org/r/635572 (https://phabricator.wikimedia.org/T251609) [16:10:16] 10Operations, 10MW-on-K8s, 10serviceops, 10Patch-For-Review: Create the base container images for running MediaWiki in a production environment - https://phabricator.wikimedia.org/T265324 (10Jdforrester-WMF) [16:11:01] 10Operations, 10Wikimedia-Etherpad, 10Patch-For-Review: rate limited etherpad - https://phabricator.wikimedia.org/T265490 (10Dzahn) 05Open→03Resolved a:03Dzahn Thank you for confirming. [16:11:26] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:11:50] PROBLEM - MariaDB Replica Lag: s5 on db2099 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1015.09 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:12:28] ^ looking [16:12:38] 10Operations, 10Puppet, 10cloud-services-team (Kanban): Using $facts['networking']['ip'] breaks puppet on cloud hosts - https://phabricator.wikimedia.org/T266075 (10Dzahn) 05Open→03Resolved a:03Dzahn Oh.. thank you @jbond for the detailed explanation and the fix. I assumed it was actually broken on th... [16:13:06] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:13:10] ACKNOWLEDGEMENT - MariaDB Replica Lag: s5 on db2099 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1015.09 seconds Kormat looking https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:13:13] (03CR) 10Bstorm: [C: 03+2] Create wiki replica views for MachineVision extension tables [puppet] - 10https://gerrit.wikimedia.org/r/623775 (https://phabricator.wikimedia.org/T238574) (owner: 10Cparle) [16:14:07] (03CR) 10Ottomata: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/635572 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [16:14:23] oh, ffs. dbstore [16:14:48] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Jonesey95) The disabling of these score tags appears to have some negative interaction with the rest... [16:15:10] RECOVERY - MariaDB Replica Lag: s5 on db2099 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [16:21:24] (03PS1) 10Elukey: role::analytics_test_cluster::client: change description of the system role [puppet] - 10https://gerrit.wikimedia.org/r/635573 [16:21:41] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10RobH) a:05jcrespo→03RobH Jaime: I didn't realize the DB systems hardware repair cadence was different then the other systems (with DBA team only taking it offline immediately before wo... [16:22:17] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10RobH) Oh, if it is a mainboard replacement, the host will need reimage. I assume if that is the case, it can come offline well in advance as its basically re-entering service as a new hos... [16:24:19] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10jcrespo) > the host will need reimage A reimage is not a problem, even with data loss- the problem is being down for an extended amount of time (e.g. ~1 week). [16:24:55] (03CR) 10jerkins-bot: [V: 04-1] camus - Refactor http proxy envs [puppet] - 10https://gerrit.wikimedia.org/r/635572 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [16:25:00] (03CR) 10Elukey: [C: 03+2] role::analytics_test_cluster::client: change description of the system role [puppet] - 10https://gerrit.wikimedia.org/r/635573 (owner: 10Elukey) [16:29:01] (03CR) 10Dzahn: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/635090 (https://phabricator.wikimedia.org/T265922) (owner: 10Dzahn) [16:30:04] (03CR) 10Dzahn: [C: 03+2] allow secteam-users access to nodes with role(peek) [puppet] - 10https://gerrit.wikimedia.org/r/635090 (https://phabricator.wikimedia.org/T265922) (owner: 10Dzahn) [16:30:09] (03CR) 10Muehlenhoff: [C: 03+1] "Fair enough :-)" [puppet] - 10https://gerrit.wikimedia.org/r/635090 (https://phabricator.wikimedia.org/T265922) (owner: 10Dzahn) [16:33:26] 10Operations, 10SRE-Access-Requests, 10Security-Team, 10Patch-For-Review: Access to peek2001.codfw.wmnet - https://phabricator.wikimedia.org/T265922 (10Dzahn) secteam members should now be able to ssh to peek2001.codfw.wmnet [16:33:54] 10Operations, 10SRE-Access-Requests, 10Security-Team, 10Patch-For-Review: Access to peek2001.codfw.wmnet - https://phabricator.wikimedia.org/T265922 (10Dzahn) 05Open→03Resolved [16:34:48] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:35:08] 10Operations, 10SRE-Access-Requests, 10Security-Team, 10Patch-For-Review: Access to peek2001.codfw.wmnet - https://phabricator.wikimedia.org/T265922 (10Reedy) >>! In T265922#6568972, @Dzahn wrote: > secteam members should now be able to ssh to peek2001.codfw.wmnet Confirmed! Thanks [16:35:22] (03CR) 10Ayounsi: [C: 03+1] netbox/puppet: Add machinery to get Puppet facts from Netbox [puppet] - 10https://gerrit.wikimedia.org/r/563186 (https://phabricator.wikimedia.org/T229397) (owner: 10Jbond) [16:38:14] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [16:39:47] (03CR) 10Legoktm: [C: 03+2] "I did test rebuilds locally to verify that this is a no-op, but I'm not going to rebuild this stack of images for no benefit." [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/631998 (owner: 10Legoktm) [16:39:57] (03CR) 10Legoktm: [C: 03+2] Add buildpack images ("stacks") [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/634349 (https://phabricator.wikimedia.org/T265686) (owner: 10Legoktm) [16:40:30] (03Merged) 10jenkins-bot: Don't install apt-transport-https for buster [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/631998 (owner: 10Legoktm) [16:40:33] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10RobH) [16:40:48] (03Merged) 10jenkins-bot: Add buildpack images ("stacks") [docker-images/toollabs-images] - 10https://gerrit.wikimedia.org/r/634349 (https://phabricator.wikimedia.org/T265686) (owner: 10Legoktm) [16:41:02] 10Operations, 10ops-eqiad, 10DBA: db1139 memory errors on boot 2020-08-27 - https://phabricator.wikimedia.org/T261405 (10RobH) [16:43:45] (03PS1) 10Jbond: profile::mariadb: make use of Stdlib::Datasize [puppet] - 10https://gerrit.wikimedia.org/r/635575 [16:46:01] !log restart php-fpm and pool mw2252 and mw2328 [16:46:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:20] https://logstash-beta.wmflabs.org/goto/30c59e0fce4143c9d15c8071f1641c38 is giving me "502 Bad Gateway nginx/1.13.6" [16:51:55] I imagine that should be mentioned in -releng instead [16:52:02] okay, sorry [16:52:47] DannyS712: fwiw from yesterday: " P.chelolo> I think I broke beta sites." [16:52:50] (03PS1) 10Jbond: stdlib: Switch to new Stdlib::Yes_no type where appropriate [puppet] - 10https://gerrit.wikimedia.org/r/635577 [16:53:00] it was working earlier today though... [16:54:03] (03CR) 10jerkins-bot: [V: 04-1] stdlib: Switch to new Stdlib::Yes_no type where appropriate [puppet] - 10https://gerrit.wikimedia.org/r/635577 (owner: 10Jbond) [16:54:57] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan 20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Ladsgroup) @Lydia_Pintscher with the {T258354} being done, Can this be called done now? [16:56:12] (03CR) 10Dzahn: [C: 03+1] "lgtm, both ranges end at 49151 as expected and not just "over 1024" as well." [puppet] - 10https://gerrit.wikimedia.org/r/635570 (owner: 10Jbond) [16:56:47] (03PS2) 10Jbond: stdlib: Switch to new Stdlib::Yes_no type where appropriate [puppet] - 10https://gerrit.wikimedia.org/r/635577 [16:57:27] (03CR) 10Jbond: [C: 03+2] wmflib: drop Wmflib::UserIpPort as Stdlib::Port::User is now avalible [puppet] - 10https://gerrit.wikimedia.org/r/635570 (owner: 10Jbond) [16:57:33] !log scandium - disabling puppet so that Parsoid team can make some tests on testreduce1001 today [16:57:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:57:59] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [16:58:00] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [16:58:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:58:46] (03PS2) 10JMeybohm: Initial commit of eventrouter docker image [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634985 (https://phabricator.wikimedia.org/T262675) [16:59:30] (03CR) 10JMeybohm: Initial commit of eventrouter docker image (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/634985 (https://phabricator.wikimedia.org/T262675) (owner: 10JMeybohm) [17:01:39] (03CR) 10Herron: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/635296 (https://phabricator.wikimedia.org/T249311) (owner: 10Filippo Giunchedi) [17:01:50] (03CR) 10Herron: [C: 03+1] prometheus: add Pushgateway profile and module [puppet] - 10https://gerrit.wikimedia.org/r/635295 (https://phabricator.wikimedia.org/T249311) (owner: 10Filippo Giunchedi) [17:02:56] (03PS2) 10Ladsgroup: Add _ to the allowed list of short url characters [puppet] - 10https://gerrit.wikimedia.org/r/634937 (https://phabricator.wikimedia.org/T230685) [17:04:08] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Add _ to the allowed list of short url characters [puppet] - 10https://gerrit.wikimedia.org/r/634937 (https://phabricator.wikimedia.org/T230685) (owner: 10Ladsgroup) [17:05:09] (03CR) 10Ladsgroup: "> Patch Set 1:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/634937 (https://phabricator.wikimedia.org/T230685) (owner: 10Ladsgroup) [17:05:45] (03CR) 10Anne Tomasevich: [C: 03+1] [WikibaseMediaInfo] Fix concept chips array nesting structure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635567 (https://phabricator.wikimedia.org/T256431) (owner: 10Matthias Mullie) [17:08:48] anybody would object if I deploy a few MW config changes now? [17:09:49] (03PS4) 10Ppchelko: Enable warn+ logging for ParserCache channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635071 (https://phabricator.wikimedia.org/T264394) [17:12:14] PROBLEM - MariaDB Replica Lag: s5 on db2099 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 986.14 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:13:13] 10Operations, 10Discovery, 10Traffic, 10Wikidata, and 3 others: Wikidata maxlag repeatedly over 5s since Jan 20, 2020 (primarily caused by the query service) - https://phabricator.wikimedia.org/T243701 (10Lydia_Pintscher) 05Open→03Resolved a:03Lydia_Pintscher Yeah let's call this resolved. Additional... [17:13:54] !log dzahn@cumin1001 START - Cookbook sre.hosts.downtime [17:13:54] !log dzahn@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [17:13:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:49] (03CR) 10Ppchelko: [C: 03+2] Enable warn+ logging for ParserCache channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635071 (https://phabricator.wikimedia.org/T264394) (owner: 10Ppchelko) [17:17:17] 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Netbox Error for asw2-d4-eqiad - https://phabricator.wikimedia.org/T265393 (10Cmjohnson) The asset tag was my fault, since this was in the spare list the asset tag was not in the location that I normally would put it and did not see it, I ended up adding what is n... [17:17:25] (03Merged) 10jenkins-bot: Enable warn+ logging for ParserCache channel [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635071 (https://phabricator.wikimedia.org/T264394) (owner: 10Ppchelko) [17:18:58] RECOVERY - MariaDB Replica Lag: s5 on db2099 is OK: OK slave_sql_lag Replication lag: 0.00 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:21:07] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable ParserCache logger for warn+, gerrit:635071 (duration: 01m 06s) [17:21:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:54] (03PS2) 10Ppchelko: Enable ParserCache JSON serialization on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635382 (https://phabricator.wikimedia.org/T263579) [17:24:40] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable ParserCache logger for warn+, gerrit:635071 (duration: 01m 08s) [17:24:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:14] (03CR) 10Ppchelko: [C: 03+2] Enable ParserCache JSON serialization on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635382 (https://phabricator.wikimedia.org/T263579) (owner: 10Ppchelko) [17:26:02] (03Merged) 10jenkins-bot: Enable ParserCache JSON serialization on testwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635382 (https://phabricator.wikimedia.org/T263579) (owner: 10Ppchelko) [17:31:30] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical Moves for MediaWiki Servers - https://phabricator.wikimedia.org/T266164 (10wiki_willy) [17:33:17] 10Operations, 10ops-eqiad, 10DC-Ops, 10serviceops: eqiad: Physical Moves for MediaWiki Servers - https://phabricator.wikimedia.org/T266164 (10wiki_willy) [17:33:27] (03PS8) 10Jbond: diffscan: pyhotnify [puppet] - 10https://gerrit.wikimedia.org/r/634572 [17:34:17] !log ppchelko@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Switch ParserCache to JSON on testwiki gerrit:635382 (duration: 01m 05s) [17:34:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:35:17] (03CR) 10Ppchelko: [C: 03+2] Components: Handle missing special pages [skins/WikimediaApiPortal] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635329 (https://phabricator.wikimedia.org/T266021) (owner: 10Cicalese) [17:38:57] (03Merged) 10jenkins-bot: Components: Handle missing special pages [skins/WikimediaApiPortal] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635329 (https://phabricator.wikimedia.org/T266021) (owner: 10Cicalese) [17:43:08] !log ppchelko@deploy1001 Synchronized php-1.36.0-wmf.14/skins/WikimediaApiPortal: Backport gerrit:635329, T266021 (duration: 01m 06s) [17:43:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:43:15] T266021: PHP Fatal Error: Uncaught Error: Call to a member function getDescription() on null - https://phabricator.wikimedia.org/T266021 [17:50:50] (03PS1) 10Ppchelko: Switch ParserCache to JSON for group0 wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635607 (https://phabricator.wikimedia.org/T263579) [17:56:28] 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for FY2020-2021 Q2 DC switchback - https://phabricator.wikimedia.org/T264364 (10Trizek-WMF) [17:56:46] !log configure FB PNI in eqdfw [17:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:57:56] PROBLEM - MariaDB Replica Lag: s5 on db2099 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 1007.50 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [17:58:10] that's me, ignore [18:00:04] longma and liw: #bothumor My software never has bugs. It just develops random features. Rise for Train log triage with CPT. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201021T1800). [18:00:04] RoanKattouw, Niharika, and Urbanecm: Your horoscope predicts another unfortunate Morning backport window deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201021T1800). [18:00:04] ryankemper, annet, and matthiasmullie: A patch you scheduled for Morning backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:16] \o/ [18:00:22] I can deploy today [18:00:24] I'm around [18:01:30] PROBLEM - Router interfaces on cr2-eqdfw is CRITICAL: CRITICAL: host 208.80.153.198, interfaces up: 66, down: 2, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:02:38] (03PS3) 10Urbanecm: cirrus: Hardcode more_like to codfw cirrus cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635411 (owner: 10Ryan Kemper) [18:02:54] (03CR) 10Urbanecm: [C: 03+2] cirrus: Hardcode more_like to codfw cirrus cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635411 (owner: 10Ryan Kemper) [18:04:11] (03Merged) 10jenkins-bot: cirrus: Hardcode more_like to codfw cirrus cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635411 (owner: 10Ryan Kemper) [18:04:42] RECOVERY - MariaDB Replica Lag: s5 on db2099 is OK: OK slave_sql_lag Replication lag: 0.29 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:04:55] ryankemper: pulled onto mwdebug2001 if you're able to test it there [18:05:12] Urbanecm: ack, taking a look now [18:05:55] thanks [18:07:17] Urbanecm: okay, you're clear to proceed to the rest of the fleet [18:07:22] thanks, syncing [18:07:41] (03PS2) 10Urbanecm: [WikibaseMediaInfo] Fix concept chips array nesting structure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635567 (https://phabricator.wikimedia.org/T256431) (owner: 10Matthias Mullie) [18:07:45] (03CR) 10Urbanecm: [C: 03+2] [WikibaseMediaInfo] Fix concept chips array nesting structure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635567 (https://phabricator.wikimedia.org/T256431) (owner: 10Matthias Mullie) [18:08:43] (03Merged) 10jenkins-bot: [WikibaseMediaInfo] Fix concept chips array nesting structure [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635567 (https://phabricator.wikimedia.org/T256431) (owner: 10Matthias Mullie) [18:09:06] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: d94e33ff39b300c74fcaf08d1746c089fb1af783: cirrus: Hardcode more_like to codfw cirrus cluster (duration: 01m 05s) [18:09:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:09:24] annet: hi, I pulled your patch onto mwdebug2001 - can you test it there, please? [18:09:35] Urbanecm: looking... [18:10:44] Urbanecm: all looks good! [18:10:50] annet: thanks, syncing! [18:10:53] thanks :) [18:12:34] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 45312d359442d274e83deb7be80f86e12fb9e864: [WikibaseMediaInfo] Fix concept chips array nesting structure (T256431) (duration: 01m 05s) [18:12:38] annet: should be live [18:12:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:40] T256431: [L] Implement superclass "concept chips" in the MediaSearch interface - https://phabricator.wikimedia.org/T256431 [18:12:41] anything else? :) [18:13:02] Urbanecm: yep, looks good on prod, thanks! [18:13:06] no problem! [18:13:22] !log Morning B&C window done [18:13:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:14] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:15:54] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:16:43] (03CR) 10Ottomata: [C: 03+1] Add sbisson to analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/635227 (https://phabricator.wikimedia.org/T265969) (owner: 10Elukey) [18:17:15] 10Operations, 10Analytics, 10SRE-Access-Requests, 10Patch-For-Review: Add sbisson to analytics-privatedata-users and create a kerberos identity - https://phabricator.wikimedia.org/T265969 (10Ottomata) I am the Analytics approver now. APPROVED. [18:21:28] (03PS1) 10Bstorm: cloud nfs: close all unnecessary ports [puppet] - 10https://gerrit.wikimedia.org/r/635612 (https://phabricator.wikimedia.org/T265588) [18:23:28] (03PS1) 10Subramanya Sastry: Update update_parsoid.sh script for use on testreduce1001 [puppet] - 10https://gerrit.wikimedia.org/r/635613 (https://phabricator.wikimedia.org/T257906) [18:23:30] (03PS2) 10Ottomata: camus - Refactor http proxy envs [puppet] - 10https://gerrit.wikimedia.org/r/635572 (https://phabricator.wikimedia.org/T251609) [18:24:32] (03CR) 10Subramanya Sastry: "Maybe we don't need this script anymore and this can all be folded into https://gerrit.wikimedia.org/r/c/mediawiki/services/parsoid/+/6350" [puppet] - 10https://gerrit.wikimedia.org/r/635613 (https://phabricator.wikimedia.org/T257906) (owner: 10Subramanya Sastry) [18:24:50] (03CR) 10Subramanya Sastry: "In any case, wait on merging this till we have run a successful test on testreduce1001." [puppet] - 10https://gerrit.wikimedia.org/r/635613 (https://phabricator.wikimedia.org/T257906) (owner: 10Subramanya Sastry) [18:25:25] (03CR) 10Ottomata: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/635572 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [18:28:00] (03CR) 10Bstorm: "I think I'm going to stop puppet on labstore1004/5 and test this on cloudstore1008/9 because the impact should be the same and those serve" [puppet] - 10https://gerrit.wikimedia.org/r/635612 (https://phabricator.wikimedia.org/T265588) (owner: 10Bstorm) [18:28:54] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10kaldari) @patilise - Because of the nature of the problem, we unfortunately can't share many details... [18:29:17] (03PS3) 10Ottomata: camus - Refactor http proxy envs [puppet] - 10https://gerrit.wikimedia.org/r/635572 (https://phabricator.wikimedia.org/T251609) [18:33:54] (03CR) 10Ottomata: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/635572 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [18:39:58] (03PS4) 10Ottomata: camus - Refactor http proxy envs [puppet] - 10https://gerrit.wikimedia.org/r/635572 (https://phabricator.wikimedia.org/T251609) [18:40:17] (03CR) 10Ottomata: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/635572 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [18:44:28] (03CR) 10Bstorm: [C: 03+2] cloud nfs: close all unnecessary ports [puppet] - 10https://gerrit.wikimedia.org/r/635612 (https://phabricator.wikimedia.org/T265588) (owner: 10Bstorm) [18:46:25] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/587/" [puppet] - 10https://gerrit.wikimedia.org/r/635572 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [18:49:18] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Tgr) [18:50:26] 10Operations, 10MediaWiki-extensions-Score, 10Security-Team, 10Wikimedia-General-or-Unknown, and 3 others: Extension:Score / Lilypond is disabled on all wikis - https://phabricator.wikimedia.org/T257066 (10Tgr) >>! In T257066#6568853, @Jonesey95 wrote: > The disabling of these score tags appears to have so... [18:54:14] PROBLEM - MariaDB Replica SQL: s4 on db2099 is CRITICAL: CRITICAL slave_sql_state Slave_SQL_Running: No, Errno: 1712, Errmsg: Error Index globalimagelinks is corrupted on query. Default database: commonswiki. [Query snipped] https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_replica [18:55:52] RECOVERY - Router interfaces on cr2-eqdfw is OK: OK: host 208.80.153.198, interfaces up: 68, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [18:56:30] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [18:57:28] 10Operations, 10CX-cxserver, 10Product-Infrastructure-Team-Backlog, 10serviceops-radar, and 2 others: service-runner apps running on kubernetes emit logs with log level 50 - https://phabricator.wikimedia.org/T239459 (10Mholloway) [18:58:12] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:00:04] longma and liw: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Mediawiki train - American+European Version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201021T1900). [19:00:44] 10Operations, 10Product-Infrastructure-Team-Backlog, 10Wikifeeds, 10serviceops: [Bug] The feed/featured endpoint is broken - https://phabricator.wikimedia.org/T263043 (10Mholloway) 05Open→03Resolved a:03Pchelolo [19:08:53] (03PS1) 10Ottomata: camus::job - use \s for spaces in systemd unit Environment var [puppet] - 10https://gerrit.wikimedia.org/r/635618 (https://phabricator.wikimedia.org/T251609) [19:09:05] (03CR) 10jerkins-bot: [V: 04-1] camus::job - use \s for spaces in systemd unit Environment var [puppet] - 10https://gerrit.wikimedia.org/r/635618 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [19:09:09] (03PS2) 10Ottomata: camus::job - use \s for spaces in systemd unit Environment var [puppet] - 10https://gerrit.wikimedia.org/r/635618 (https://phabricator.wikimedia.org/T251609) [19:09:39] (03CR) 10Ottomata: [C: 03+2] camus::job - use \s for spaces in systemd unit Environment var [puppet] - 10https://gerrit.wikimedia.org/r/635618 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [19:09:44] (03CR) 10Fdans: [C: 03+1] "Thanks for doing this Luca!" [puppet] - 10https://gerrit.wikimedia.org/r/635227 (https://phabricator.wikimedia.org/T265969) (owner: 10Elukey) [19:12:44] (03PS1) 10Ottomata: camus::job - only use \s in systemd Environment, regular CLI needs regular spaces [puppet] - 10https://gerrit.wikimedia.org/r/635619 (https://phabricator.wikimedia.org/T251609) [19:13:08] (03CR) 10jerkins-bot: [V: 04-1] camus::job - only use \s in systemd Environment, regular CLI needs regular spaces [puppet] - 10https://gerrit.wikimedia.org/r/635619 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [19:14:02] (03PS2) 10Ottomata: camus::job - only use \s in systemd Environment, regular CLI needs spaces [puppet] - 10https://gerrit.wikimedia.org/r/635619 (https://phabricator.wikimedia.org/T251609) [19:15:29] (03CR) 10Ottomata: [C: 03+2] camus::job - only use \s in systemd Environment, regular CLI needs spaces [puppet] - 10https://gerrit.wikimedia.org/r/635619 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [19:15:34] !log start of foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https (T264963) [19:15:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:15:42] T264963: Add Wikidata support for smnwiki - https://phabricator.wikimedia.org/T264963 [19:18:58] (03PS1) 10Ottomata: Include profile::analytics::refinery::event_service_config in camus test [puppet] - 10https://gerrit.wikimedia.org/r/635620 (https://phabricator.wikimedia.org/T251609) [19:20:22] (03CR) 10Ottomata: [C: 03+2] Include profile::analytics::refinery::event_service_config in camus test [puppet] - 10https://gerrit.wikimedia.org/r/635620 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [19:22:44] (03PS1) 10Razzi: Add an-testl-client1001 mac address [puppet] - 10https://gerrit.wikimedia.org/r/635624 (https://phabricator.wikimedia.org/T266064) [19:25:29] (03PS2) 10Razzi: Add an-test-client1001 mac address [puppet] - 10https://gerrit.wikimedia.org/r/635624 (https://phabricator.wikimedia.org/T266064) [19:27:37] (03PS1) 10Ottomata: camus test - use proper eventstreamconfig.stream_names property [puppet] - 10https://gerrit.wikimedia.org/r/635626 (https://phabricator.wikimedia.org/T251609) [19:28:31] (03CR) 10Ottomata: [C: 03+1] Add an-test-client1001 mac address [puppet] - 10https://gerrit.wikimedia.org/r/635624 (https://phabricator.wikimedia.org/T266064) (owner: 10Razzi) [19:28:43] (03CR) 10Razzi: [C: 03+2] Add an-test-client1001 mac address [puppet] - 10https://gerrit.wikimedia.org/r/635624 (https://phabricator.wikimedia.org/T266064) (owner: 10Razzi) [19:30:01] (03CR) 10Ottomata: [C: 03+2] camus test - use proper eventstreamconfig.stream_names property [puppet] - 10https://gerrit.wikimedia.org/r/635626 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [19:30:10] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:31:46] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:40:49] (03PS1) 10Bstorm: dumps nfs: remove probably-unused firewall ports and services [puppet] - 10https://gerrit.wikimedia.org/r/635628 (https://phabricator.wikimedia.org/T265588) [19:44:47] !log end of foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https (T264963) [19:44:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:44:54] T264963: Add Wikidata support for smnwiki - https://phabricator.wikimedia.org/T264963 [19:49:30] (03CR) 10Bstorm: "PCC: https://puppet-compiler.wmflabs.org/compiler1002/26049/" [puppet] - 10https://gerrit.wikimedia.org/r/635628 (https://phabricator.wikimedia.org/T265588) (owner: 10Bstorm) [19:49:49] (03PS1) 10Ahmon Dancy: Add --force flag to safe-service-restart.py [puppet] - 10https://gerrit.wikimedia.org/r/635630 (https://phabricator.wikimedia.org/T243009) [19:51:06] (03PS3) 10Razzi: superset: use envoy instead of nginx for tls [puppet] - 10https://gerrit.wikimedia.org/r/634662 (https://phabricator.wikimedia.org/T240439) [19:58:06] (03PS1) 10Ottomata: camus - use eventstreamconfig for eventgate-analytics-external streams [puppet] - 10https://gerrit.wikimedia.org/r/635632 (https://phabricator.wikimedia.org/T251609) [19:59:21] (03CR) 10jerkins-bot: [V: 04-1] camus - use eventstreamconfig for eventgate-analytics-external streams [puppet] - 10https://gerrit.wikimedia.org/r/635632 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [20:00:04] chrisalbon and accraze: My dear minions, it's time we take the moon! Just kidding. Time for Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201021T2000). [20:00:32] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:00:43] (03CR) 10Razzi: [C: 03+2] piwik: use envoy instead of nginx for tls [puppet] - 10https://gerrit.wikimedia.org/r/634664 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [20:01:07] (03CR) 10Bstorm: "One thing to check here is that all clients mount as nfsv4. All cloud clients mount as nfsv4, so this would only be clients like stat serv" [puppet] - 10https://gerrit.wikimedia.org/r/635628 (https://phabricator.wikimedia.org/T265588) (owner: 10Bstorm) [20:03:15] (03CR) 10Razzi: [C: 03+2] superset: use envoy instead of nginx for tls [puppet] - 10https://gerrit.wikimedia.org/r/634662 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [20:03:34] (03PS2) 10Ottomata: camus - use eventstreamconfig for eventgate-analytics-external streams [puppet] - 10https://gerrit.wikimedia.org/r/635632 (https://phabricator.wikimedia.org/T251609) [20:03:44] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:04:47] (03CR) 10jerkins-bot: [V: 04-1] camus - use eventstreamconfig for eventgate-analytics-external streams [puppet] - 10https://gerrit.wikimedia.org/r/635632 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [20:05:41] (03CR) 10Razzi: [C: 03+2] turnilo: use envoy instead of nginx for tls [puppet] - 10https://gerrit.wikimedia.org/r/634661 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [20:06:27] (03PS3) 10Ottomata: camus - use eventstreamconfig for eventgate-analytics-external streams [puppet] - 10https://gerrit.wikimedia.org/r/635632 (https://phabricator.wikimedia.org/T251609) [20:06:54] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=pdu_sentry4 site=eqsin https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:08:30] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:09:23] (03CR) 10Razzi: [C: 03+2] hue: switch from nginx to envoy for tls [puppet] - 10https://gerrit.wikimedia.org/r/634660 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [20:13:46] PROBLEM - Check systemd state on analytics-tool1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:15:19] (03CR) 10Bstorm: "stat1005 looks ok" [puppet] - 10https://gerrit.wikimedia.org/r/635628 (https://phabricator.wikimedia.org/T265588) (owner: 10Bstorm) [20:18:34] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:24:39] (03CR) 10Razzi: [C: 03+2] stats: Add envoy on port 8443 alongside nginx [puppet] - 10https://gerrit.wikimedia.org/r/634667 (https://phabricator.wikimedia.org/T240439) (owner: 10Razzi) [20:25:30] (03PS1) 10Ahmon Dancy: Add --force flag to php-check-and-restart.sh [puppet] - 10https://gerrit.wikimedia.org/r/635635 (https://phabricator.wikimedia.org/T243009) [20:25:50] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1001/26051/an-launcher1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/635632 (https://phabricator.wikimedia.org/T251609) (owner: 10Ottomata) [20:29:10] 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Netbox Error for asw2-d4-eqiad - https://phabricator.wikimedia.org/T265393 (10wiki_willy) Thanks for getting that fixed @Cmjohnson - it looks like there's just one more remaining action item with the [20:35:13] 10Operations, 10ops-eqiad, 10DC-Ops: eqiad: Netbox Error for asw2-d4-eqiad - https://phabricator.wikimedia.org/T265393 (10wiki_willy) Thanks for getting that fixed @Cmjohnson - it looks like there's just one more remaining action item with the console port needing to be added: https://netbox.wikimedia.org/e... [20:35:33] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:42:09] (03CR) 10Razzi: [C: 03+2] geoip: move archive timer from stat1007 to an-launcher1002 [puppet] - 10https://gerrit.wikimedia.org/r/633032 (https://phabricator.wikimedia.org/T264152) (owner: 10Razzi) [20:46:33] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job={swagger_check_cxserver_cluster_eqiad,swagger_check_restbase_esams} site={eqiad,esams} https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:48:05] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [20:48:14] (03PS1) 10Razzi: Revert "geoip: move archive timer from stat1007 to an-launcher1002" [puppet] - 10https://gerrit.wikimedia.org/r/635590 [20:49:25] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [20:49:52] (03PS1) 10Bstorm: toolforge k8s: add a PodSecurityPolicy to be used by buildpacks [puppet] - 10https://gerrit.wikimedia.org/r/635641 (https://phabricator.wikimedia.org/T265557) [20:50:28] (03CR) 10Razzi: [C: 03+2] Revert "geoip: move archive timer from stat1007 to an-launcher1002" [puppet] - 10https://gerrit.wikimedia.org/r/635590 (owner: 10Razzi) [20:52:30] (03CR) 10Bstorm: "This is almost exactly the PSP we tested in toolsbeta, except that it uses the correct UID we've chosen (that one used 1000). It disallowe" [puppet] - 10https://gerrit.wikimedia.org/r/635641 (https://phabricator.wikimedia.org/T265557) (owner: 10Bstorm) [20:54:42] (03PS1) 10Andrew Bogott: wmcs puppetmasters: replace cloud-puppetmaster-04 with -05 [puppet] - 10https://gerrit.wikimedia.org/r/635645 [20:55:53] (03CR) 10Andrew Bogott: [C: 03+2] wmcs puppetmasters: replace cloud-puppetmaster-04 with -05 [puppet] - 10https://gerrit.wikimedia.org/r/635645 (owner: 10Andrew Bogott) [20:57:56] (03CR) 10CDanis: [C: 03+1] turnilo: fix retainMissingValue misconfig [puppet] - 10https://gerrit.wikimedia.org/r/634948 (owner: 10Faidon Liambotis) [21:03:21] (03PS1) 10Andrew Bogott: wmcs puppetmasters: the new -05 puppetmaster is only under .wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/635646 [21:04:04] (03CR) 10Andrew Bogott: [C: 03+2] wmcs puppetmasters: the new -05 puppetmaster is only under .wikimedia.cloud [puppet] - 10https://gerrit.wikimedia.org/r/635646 (owner: 10Andrew Bogott) [21:12:05] 10Operations, 10Traffic, 10observability, 10Performance-Team (Radar), 10Sustainability (Incident Followup): Document and/or improve navigation of the various HTTP frontend Grafana dashboards - https://phabricator.wikimedia.org/T253655 (10Ladsgroup) Varnish http requests and "Prometheus varnish http reque... [21:18:40] (03CR) 10Ebernhardson: [C: 03+1] [cirrus] A/B test perfield build on spaceless languages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635313 (https://phabricator.wikimedia.org/T266027) (owner: 10DCausse) [21:22:55] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1025 with 10G interfaces - https://phabricator.wikimedia.org/T266187 (10Andrew) [21:23:41] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, 10cloud-services-team (Kanban): Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [21:27:05] (03PS1) 10Dzahn: testreduce: re-enable vd/rt services [puppet] - 10https://gerrit.wikimedia.org/r/635647 [21:28:03] (03CR) 10Dzahn: [C: 03+2] testreduce: re-enable vd/rt services [puppet] - 10https://gerrit.wikimedia.org/r/635647 (owner: 10Dzahn) [21:32:23] (03PS1) 10Dzahn: parsoid/testing: disable vd client and server [puppet] - 10https://gerrit.wikimedia.org/r/635648 (https://phabricator.wikimedia.org/T257906) [21:33:03] (03CR) 10Dzahn: [C: 03+2] parsoid/testing: disable vd client and server [puppet] - 10https://gerrit.wikimedia.org/r/635648 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [21:38:23] !log testreduce1001 assigned 2 more GBs of RAM - rebooting (T257940, T257906) [21:38:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:38:31] T257940: eqiad: 1 VM request for testreduce - https://phabricator.wikimedia.org/T257940 [21:38:31] T257906: Move testreduce away from scandium to a separate Buster Ganeti VM - https://phabricator.wikimedia.org/T257906 [21:40:57] (03PS1) 10Dzahn: Revert "parsoid/testing: disable vd client and server" [puppet] - 10https://gerrit.wikimedia.org/r/635591 [21:43:06] (03CR) 10Dzahn: [C: 03+2] Revert "parsoid/testing: disable vd client and server" [puppet] - 10https://gerrit.wikimedia.org/r/635591 (owner: 10Dzahn) [21:43:21] (03PS1) 10Dzahn: parsoid/testreduce: disable vd server/client [puppet] - 10https://gerrit.wikimedia.org/r/635653 (https://phabricator.wikimedia.org/T257906) [21:43:52] (03CR) 10Dzahn: [C: 03+2] parsoid/testreduce: disable vd server/client [puppet] - 10https://gerrit.wikimedia.org/r/635653 (https://phabricator.wikimedia.org/T257906) (owner: 10Dzahn) [21:53:46] (03CR) 10Dzahn: [C: 03+1] stdlib: Switch to new Stdlib::Yes_no type where appropriate [puppet] - 10https://gerrit.wikimedia.org/r/635577 (owner: 10Jbond) [21:56:01] 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Kanban): Connect cloudstore1008 and cloudstore1009 directly via second 10G interface similar to labstore1004/5 - https://phabricator.wikimedia.org/T266192 (10Bstorm) [21:57:14] (03CR) 10Dzahn: [C: 03+1] Add _ to the allowed list of short url characters [puppet] - 10https://gerrit.wikimedia.org/r/634937 (https://phabricator.wikimedia.org/T230685) (owner: 10Ladsgroup) [22:00:40] (03CR) 10Dzahn: "another issuer here that is unrelated: Error creating type specialization of a Variant-Type, Cannot use String where Any-Type is expected" [puppet] - 10https://gerrit.wikimedia.org/r/634363 (owner: 10Dzahn) [22:02:11] (03PS3) 10Dzahn: cassandra: add data types, hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/634363 [22:04:03] (03PS4) 10Dzahn: cassandra: add data types, hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/634363 [22:06:16] (03CR) 10Dzahn: cassandra: add data types, hiera->lookup (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/634363 (owner: 10Dzahn) [22:06:24] (03CR) 10Dzahn: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1003/26054/" [puppet] - 10https://gerrit.wikimedia.org/r/634363 (owner: 10Dzahn) [22:07:33] (03CR) 10Legoktm: "One quibble about the naming, see inline." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/635641 (https://phabricator.wikimedia.org/T265557) (owner: 10Bstorm) [22:14:19] (03CR) 10Bstorm: toolforge k8s: add a PodSecurityPolicy to be used by buildpacks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/635641 (https://phabricator.wikimedia.org/T265557) (owner: 10Bstorm) [22:15:22] (03CR) 10CDanis: [C: 03+1] Add _ to the allowed list of short url characters [puppet] - 10https://gerrit.wikimedia.org/r/634937 (https://phabricator.wikimedia.org/T230685) (owner: 10Ladsgroup) [22:17:06] (03CR) 10Legoktm: toolforge k8s: add a PodSecurityPolicy to be used by buildpacks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/635641 (https://phabricator.wikimedia.org/T265557) (owner: 10Bstorm) [22:23:58] (03PS2) 10Bstorm: toolforge k8s: add a PodSecurityPolicy to be used by buildpacks [puppet] - 10https://gerrit.wikimedia.org/r/635641 (https://phabricator.wikimedia.org/T265557) [22:25:53] (03PS1) 10Dzahn: puppetmaster: add data types to all remaining parameters [puppet] - 10https://gerrit.wikimedia.org/r/635656 [22:26:29] (03CR) 10jerkins-bot: [V: 04-1] puppetmaster: add data types to all remaining parameters [puppet] - 10https://gerrit.wikimedia.org/r/635656 (owner: 10Dzahn) [22:30:28] (03PS1) 10Dzahn: wmflib: add data type for SSLVerifyClient and use it [puppet] - 10https://gerrit.wikimedia.org/r/635658 [22:31:59] (03CR) 10jerkins-bot: [V: 04-1] wmflib: add data type for SSLVerifyClient and use it [puppet] - 10https://gerrit.wikimedia.org/r/635658 (owner: 10Dzahn) [22:34:42] (03PS1) 10Dzahn: wmflib:: add data type for puppetmaster server type and use it [puppet] - 10https://gerrit.wikimedia.org/r/635660 [22:36:25] (03CR) 10jerkins-bot: [V: 04-1] wmflib:: add data type for puppetmaster server type and use it [puppet] - 10https://gerrit.wikimedia.org/r/635660 (owner: 10Dzahn) [22:38:05] (03PS1) 10Dzahn: dns::auth::acmechief_target: hiera->lookup, data type [puppet] - 10https://gerrit.wikimedia.org/r/635661 [22:38:42] (03PS1) 10Razzi: Add analytics_test_cluster::client role to an-test-client1001 [puppet] - 10https://gerrit.wikimedia.org/r/635662 (https://phabricator.wikimedia.org/T255139) [22:40:20] (03PS1) 10Dzahn: rpkivalidator: hiera->lookup [puppet] - 10https://gerrit.wikimedia.org/r/635664 [22:42:42] (03PS3) 10Bstorm: toolforge k8s: add a PodSecurityPolicy to be used by buildpacks [puppet] - 10https://gerrit.wikimedia.org/r/635641 (https://phabricator.wikimedia.org/T265557) [22:43:33] 10Operations, 10CommRel-Specialists-Support (Oct-Dec-2020), 10User-notice: CommRel support for FY2020-2021 Q2 DC switchback - https://phabricator.wikimedia.org/T264364 (10RLazarus) [22:43:39] (03PS1) 10Dzahn: debmonitor::client: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/635665 [22:44:49] (03CR) 10jerkins-bot: [V: 04-1] debmonitor::client: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/635665 (owner: 10Dzahn) [22:47:29] (03PS1) 10Dzahn: ntp: hiera->lookup, data types [puppet] - 10https://gerrit.wikimedia.org/r/635666 [22:47:59] (03CR) 10Legoktm: [C: 03+1] toolforge k8s: add a PodSecurityPolicy to be used by buildpacks [puppet] - 10https://gerrit.wikimedia.org/r/635641 (https://phabricator.wikimedia.org/T265557) (owner: 10Bstorm) [22:48:39] (03PS2) 10Dzahn: debmonitor::client: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/635665 [22:50:06] (03CR) 10jerkins-bot: [V: 04-1] debmonitor::client: hiera->lookup, add data types [puppet] - 10https://gerrit.wikimedia.org/r/635665 (owner: 10Dzahn) [22:51:17] (03PS1) 10Catrope: StartEditingDialog: Add padding between difficulty rows [extensions/GrowthExperiments] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635594 (https://phabricator.wikimedia.org/T266033) [22:51:19] (03CR) 10Razzi: [C: 03+2] Add analytics_test_cluster::client role to an-test-client1001 [puppet] - 10https://gerrit.wikimedia.org/r/635662 (https://phabricator.wikimedia.org/T255139) (owner: 10Razzi) [22:51:58] (03CR) 10Catrope: [C: 03+2] "This change is ready for review." [extensions/GrowthExperiments] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635592 (https://phabricator.wikimedia.org/T266033) (owner: 10Catrope) [22:52:05] (03CR) 10Catrope: [C: 03+2] StartEditingDialog: Add padding between difficulty rows [extensions/GrowthExperiments] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635594 (https://phabricator.wikimedia.org/T266033) (owner: 10Catrope) [22:52:18] (03CR) 10Catrope: [C: 03+2] StartEditingDialog: Prevent scrolling in non-modal mode [extensions/GrowthExperiments] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/635330 (https://phabricator.wikimedia.org/T265751) (owner: 10Catrope) [22:52:22] (03CR) 10Catrope: [C: 03+2] Show homepage discovery popup in variant C/D [extensions/GrowthExperiments] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/635331 (https://phabricator.wikimedia.org/T265754) (owner: 10Catrope) [22:54:09] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:54:46] (03PS1) 10Catrope: Revert "Revert "Make variant D the default, and remove variant A"" [extensions/GrowthExperiments] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635595 (https://phabricator.wikimedia.org/T265372) [22:56:11] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:56:41] PROBLEM - Uncommitted DNS changes in Netbox on netbox1001 is CRITICAL: An error occurred checking if Netbox has uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [22:59:06] (03PS2) 10Catrope: GrowthExperiments: Remove variant setting override [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635371 (https://phabricator.wikimedia.org/T265556) [22:59:08] (03PS1) 10Catrope: GrowthExperiments: Make variant D the default on all wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635669 (https://phabricator.wikimedia.org/T265556) [23:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Evening backport window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20201021T2300). [23:00:04] RoanKattouw: A patch you scheduled for Evening backport window is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:11] I'll deoy [23:00:13] *deploy [23:01:52] (03Merged) 10jenkins-bot: StartEditingDialog: Work around CSSJanus flipping bug [extensions/GrowthExperiments] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635592 (https://phabricator.wikimedia.org/T266033) (owner: 10Catrope) [23:02:06] (03CR) 10Ppchelko: [C: 04-2] "Found a bunch of blockers." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/635607 (https://phabricator.wikimedia.org/T263579) (owner: 10Ppchelko) [23:04:53] (03Merged) 10jenkins-bot: StartEditingDialog: Add padding between difficulty rows [extensions/GrowthExperiments] (wmf/1.36.0-wmf.14) - 10https://gerrit.wikimedia.org/r/635594 (https://phabricator.wikimedia.org/T266033) (owner: 10Catrope) [23:05:02] (03Merged) 10jenkins-bot: StartEditingDialog: Prevent scrolling in non-modal mode [extensions/GrowthExperiments] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/635330 (https://phabricator.wikimedia.org/T265751) (owner: 10Catrope) [23:05:05] (03Merged) 10jenkins-bot: Show homepage discovery popup in variant C/D [extensions/GrowthExperiments] (wmf/1.36.0-wmf.13) - 10https://gerrit.wikimedia.org/r/635331 (https://phabricator.wikimedia.org/T265754) (owner: 10Catrope) [23:05:57] 10Operations, 10vm-requests: Site: 1 VM request for Analytics test cluster - https://phabricator.wikimedia.org/T266064 (10razzi) @elukey Getting closer, able to ssh in to an-test-client1001.eqiad.wmnet, however `puppet agent` is throwing an error: ` razzi@an-test-client1001:~$ sudo -i puppet agent -tv Info: U... [23:06:14] 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Kanban): Move labstore1004 and labstore1005 to 10G Ethernet - https://phabricator.wikimedia.org/T266198 (10Bstorm) p:05Triage→03High [23:12:19] RECOVERY - Uncommitted DNS changes in Netbox on netbox1001 is OK: Netbox has zero uncommitted DNS changes https://wikitech.wikimedia.org/wiki/Monitoring/Netbox_DNS_uncommitted_changes [23:14:37] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:14:55] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.13/extensions/GrowthExperiments/: T265751 T265754 (duration: 01m 08s) [23:15:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:03] T265754: Variant C/D: inconsistencies with welcome survey and discovery popups - https://phabricator.wikimedia.org/T265754 [23:15:03] T265751: Variant C/D: autoscroll on Variant D - https://phabricator.wikimedia.org/T265751 [23:16:01] !log catrope@deploy1001 Synchronized php-1.36.0-wmf.14/extensions/GrowthExperiments/: T266033 (duration: 01m 05s) [23:16:03] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:16:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:16:07] T266033: SE intro overlay - labels length and text positioning - https://phabricator.wikimedia.org/T266033 [23:29:15] PROBLEM - varnish-http-requests grafana alert on alert1001 is CRITICAL: CRITICAL: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is alerting: 70% GET drop in 30min alert. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [23:30:51] RECOVERY - varnish-http-requests grafana alert on alert1001 is OK: OK: Varnish HTTP Requests ( https://grafana.wikimedia.org/d/000000180/varnish-http-requests ) is not alerting. https://phabricator.wikimedia.org/project/view/1201/ https://grafana.wikimedia.org/d/000000180/ [23:31:39] PROBLEM - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [23:35:03] PROBLEM - Prometheus prometheus2003/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2003 is CRITICAL: instance=127.0.0.1 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [23:35:08] (03CR) 10Cwhite: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/635295 (https://phabricator.wikimedia.org/T249311) (owner: 10Filippo Giunchedi) [23:35:33] 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Kanban): Move labstore1005 to 10Gbps rack and ethernet - https://phabricator.wikimedia.org/T266199 (10Bstorm) [23:35:51] PROBLEM - MediaWiki exceptions and fatals per minute on alert1001 is CRITICAL: 130 gt 100 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:36:02] (03CR) 10Cwhite: [C: 03+1] role: add Pushgateway to Prometheus ops [puppet] - 10https://gerrit.wikimedia.org/r/635296 (https://phabricator.wikimedia.org/T249311) (owner: 10Filippo Giunchedi) [23:36:09] 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Kanban): Move labstore1005 to 10Gbps rack and ethernet - https://phabricator.wikimedia.org/T266199 (10Bstorm) [23:36:22] (03CR) 10Cwhite: [C: 03+1] wmnet: record for prometheus-pushgateway [dns] - 10https://gerrit.wikimedia.org/r/635536 (https://phabricator.wikimedia.org/T249311) (owner: 10Filippo Giunchedi) [23:37:28] 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Kanban): Move labstore1004 and labstore1005 to 10G Ethernet - https://phabricator.wikimedia.org/T266198 (10Bstorm) [23:37:29] RECOVERY - MediaWiki exceptions and fatals per minute on alert1001 is OK: (C)100 gt (W)50 gt 40 https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [23:41:20] 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Kanban): Move or recable labstore1004 to 10Gbps rack (if needed) and ethernet - https://phabricator.wikimedia.org/T266202 (10Bstorm) [23:42:26] 10Operations, 10ops-eqiad, 10Data-Services, 10cloud-services-team (Kanban): Move labstore1004 and labstore1005 to 10G Ethernet - https://phabricator.wikimedia.org/T266198 (10Bstorm) [23:46:01] (03PS4) 10Ryan Kemper: Bring 3 new eqiad wdqs nodes into service [puppet] - 10https://gerrit.wikimedia.org/r/634381 (https://phabricator.wikimedia.org/T246345) [23:48:11] PROBLEM - Prometheus jobs reduced availability on alert1001 is CRITICAL: job=swagger_check_restbase_esams site=esams https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:49:11] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp1077 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [23:49:47] RECOVERY - Prometheus jobs reduced availability on alert1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [23:57:27] RECOVERY - Prometheus prometheus2004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus2004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global