[00:00:28] RECOVERY - Check systemd state on netflow2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:32] RECOVERY - Check systemd state on netflow1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:34] RECOVERY - Check systemd state on netflow3001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:46] RECOVERY - Check systemd state on netflow4001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:01:58] RECOVERY - Check systemd state on netflow5001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:49:58] PROBLEM - Check systemd state on an-launcher1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [02:02:39] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [02:08:20] PROBLEM - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is CRITICAL: /api (bad URL) is CRITICAL: Test bad URL returned the unexpected status 200 (expecting: 404) https://wikitech.wikimedia.org/wiki/Citoid [03:27:34] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [03:31:22] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [04:03:19] (03PS1) 10AntiCompositeNumber: Add tests for MPEG-1 and MPEG-2 thumbnailing [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/617874 (https://phabricator.wikimedia.org/T244570) [04:59:55] (03PS1) 10Marostegui: db1106: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/617875 (https://phabricator.wikimedia.org/T254462) [05:00:40] (03CR) 10Marostegui: [C: 03+2] db1106: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/617875 (https://phabricator.wikimedia.org/T254462) (owner: 10Marostegui) [05:01:48] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1106 after compression', diff saved to https://phabricator.wikimedia.org/P12137 and previous config saved to /var/cache/conftool/dbconfig/20200803-050148-marostegui.json [05:01:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:02:39] 10Operations: FY2020-2021 Q1 codfw -> eqiad switchback - https://phabricator.wikimedia.org/T243318 (10Marostegui) [05:04:38] !log Remove db1108:3321 and db1108:3322 from tendril and add db1108:3351 and db1108:3352 T254462 [05:04:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:04:41] T254462: Compress enwiki InnoDB tables - https://phabricator.wikimedia.org/T254462 [05:21:14] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:22:26] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Update s5 description [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617876 (https://phabricator.wikimedia.org/T259437) [05:27:15] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1106 after compression', diff saved to https://phabricator.wikimedia.org/P12138 and previous config saved to /var/cache/conftool/dbconfig/20200803-052715-marostegui.json [05:27:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:28:48] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:32:40] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [05:33:58] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:37:50] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [05:41:24] (03CR) 10Giuseppe Lavagetto: envoyproxy::tls_terminator: update tls definitions (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/617086 (https://phabricator.wikimedia.org/T258140) (owner: 10Giuseppe Lavagetto) [05:43:40] (03PS2) 10Giuseppe Lavagetto: envoyproxy::tls_terminator: update tls definitions [puppet] - 10https://gerrit.wikimedia.org/r/617086 (https://phabricator.wikimedia.org/T258140) [05:48:37] RECOVERY - puppet last run on otrs1001 is OK: OK: Puppet is currently disabled (switching deprectated envoy constructs --joe), not alerting. Last run 2 days ago with 0 failures https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [05:51:26] (03PS3) 10Giuseppe Lavagetto: envoyproxy::tls_terminator: update tls definitions [puppet] - 10https://gerrit.wikimedia.org/r/617086 (https://phabricator.wikimedia.org/T258140) [05:53:00] (03CR) 10Giuseppe Lavagetto: [C: 03+2] envoyproxy::tls_terminator: update tls definitions [puppet] - 10https://gerrit.wikimedia.org/r/617086 (https://phabricator.wikimedia.org/T258140) (owner: 10Giuseppe Lavagetto) [06:10:22] (03CR) 10Giuseppe Lavagetto: Add local service proxy to the tls terminator v0.2 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [06:13:19] PROBLEM - puppet last run on otrs1001 is CRITICAL: CRITICAL: Puppet last ran 2 days ago https://wikitech.wikimedia.org/wiki/Monitoring/puppet_checkpuppetrun [06:28:49] PROBLEM - ores on ores2001 is CRITICAL: connect to address 10.192.0.12 and port 8081: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [06:31:13] (03PS9) 10Giuseppe Lavagetto: Add local service proxy to the tls terminator v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) [06:33:39] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10ayounsi) Indeed, miss-configuration from my side, the vlan was configured as `access` instead of... [06:42:07] RECOVERY - ores on ores2001 is OK: HTTP OK: HTTP/1.0 200 OK - 6397 bytes in 0.088 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/ores [07:00:05] 10Operations, 10Traffic, 10netops, 10Patch-For-Review: Anycast: consistent routers->servers routing - https://phabricator.wikimedia.org/T253666 (10ayounsi) 05Open→03Stalled p:05Medium→03Low [07:00:10] 10Operations, 10Traffic, 10netops, 10Patch-For-Review, 10Performance-Team (Radar): Anycast AuthDNS - https://phabricator.wikimedia.org/T98006 (10ayounsi) [07:01:02] (03PS1) 10Muehlenhoff: toolforge: Remove jessie conditionals [puppet] - 10https://gerrit.wikimedia.org/r/617995 [07:02:22] (03PS1) 10Elukey: role::aqs: update druid mediawiki snapshot settings [puppet] - 10https://gerrit.wikimedia.org/r/617996 [07:02:27] (03CR) 10Giuseppe Lavagetto: [C: 03+1] helm: Switch stable chart repository to chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/617701 (https://phabricator.wikimedia.org/T25384) (owner: 10JMeybohm) [07:03:55] (03CR) 10JMeybohm: [C: 03+2] helm: Switch stable chart repository to chartmuseum [puppet] - 10https://gerrit.wikimedia.org/r/617701 (https://phabricator.wikimedia.org/T25384) (owner: 10JMeybohm) [07:07:02] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1106 after compression', diff saved to https://phabricator.wikimedia.org/P12139 and previous config saved to /var/cache/conftool/dbconfig/20200803-070702-marostegui.json [07:07:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:04] !log Deploy MCR change on s7 codfw, lag will appear on codfw T238966 [07:09:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:07] T238966: Apply updates for MCR, actor migration, and content migration, to production wikis. - https://phabricator.wikimedia.org/T238966 [07:10:05] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:10:36] !log Remove revision triggers from db2095:3317 for MCR changes T238966 [07:10:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:13:53] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [07:17:20] (03PS1) 10Marostegui: mariadb: Promote db1107 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/617997 (https://phabricator.wikimedia.org/T257540) [07:17:22] (03PS1) 10Muehlenhoff: Remove access for thephp.cc people [puppet] - 10https://gerrit.wikimedia.org/r/617998 [07:18:33] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/617997 (https://phabricator.wikimedia.org/T257540) (owner: 10Marostegui) [07:19:12] (03CR) 10Muehlenhoff: [C: 03+2] Remove access for thephp.cc people [puppet] - 10https://gerrit.wikimedia.org/r/617998 (owner: 10Muehlenhoff) [07:22:04] (03PS1) 10Marostegui: dbproxy1018: Increase labsdb1011 weight [puppet] - 10https://gerrit.wikimedia.org/r/617999 [07:24:10] (03CR) 10Marostegui: [C: 03+2] dbproxy1018: Increase labsdb1011 weight [puppet] - 10https://gerrit.wikimedia.org/r/617999 (owner: 10Marostegui) [07:24:35] RECOVERY - Check systemd state on an-launcher1002 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:24:40] moritzm: ok to merge your change? [07:26:09] (03CR) 10Marostegui: "I have re-created this host with the new ports in tendril, as it was red." [puppet] - 10https://gerrit.wikimedia.org/r/617077 (https://phabricator.wikimedia.org/T234826) (owner: 10Jcrespo) [07:27:30] (03CR) 10Muehlenhoff: Revoke all remaining group memberships, etc (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617749 (owner: 10Chad) [07:27:39] marostegui: sorry! please do [07:27:43] doing [07:27:59] moritzm: merged [07:28:18] (03CR) 10Gilles: [C: 03+2] Add tests for MPEG-1 and MPEG-2 thumbnailing [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/617874 (https://phabricator.wikimedia.org/T244570) (owner: 10AntiCompositeNumber) [07:28:29] thx [07:28:52] (03PS4) 10Gilles: Support MPEG-1 and MPEG-2 video files with .mpg or .mpeg extension [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/569341 (https://phabricator.wikimedia.org/T166024) (owner: 10Brion VIBBER) [07:33:16] (03CR) 10Gilles: [C: 03+2] Support MPEG-1 and MPEG-2 video files with .mpg or .mpeg extension [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/569341 (https://phabricator.wikimedia.org/T166024) (owner: 10Brion VIBBER) [07:33:53] (03Merged) 10jenkins-bot: Support MPEG-1 and MPEG-2 video files with .mpg or .mpeg extension [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/569341 (https://phabricator.wikimedia.org/T166024) (owner: 10Brion VIBBER) [07:34:14] (03PS1) 10JMeybohm: helm: Allow to update existing repositories with new URL [puppet] - 10https://gerrit.wikimedia.org/r/618000 (https://phabricator.wikimedia.org/T25384) [07:36:27] (03CR) 10Giuseppe Lavagetto: [C: 03+1] helm: Allow to update existing repositories with new URL [puppet] - 10https://gerrit.wikimedia.org/r/618000 (https://phabricator.wikimedia.org/T25384) (owner: 10JMeybohm) [07:36:30] (03PS2) 10Gilles: Add tests for MPEG-1 and MPEG-2 thumbnailing [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/617874 (https://phabricator.wikimedia.org/T244570) (owner: 10AntiCompositeNumber) [07:38:56] (03CR) 10JMeybohm: [C: 03+2] helm: Allow to update existing repositories with new URL [puppet] - 10https://gerrit.wikimedia.org/r/618000 (https://phabricator.wikimedia.org/T25384) (owner: 10JMeybohm) [07:46:20] (03PS1) 10JMeybohm: helm: Fix path to grep [puppet] - 10https://gerrit.wikimedia.org/r/618002 (https://phabricator.wikimedia.org/T25384) [07:47:23] (03CR) 10JMeybohm: [C: 03+2] helm: Fix path to grep [puppet] - 10https://gerrit.wikimedia.org/r/618002 (https://phabricator.wikimedia.org/T25384) (owner: 10JMeybohm) [07:55:18] (03CR) 10Fdans: [C: 03+1] role::aqs: update druid mediawiki snapshot settings [puppet] - 10https://gerrit.wikimedia.org/r/617996 (owner: 10Elukey) [07:56:26] (03CR) 10Elukey: [C: 03+2] role::aqs: update druid mediawiki snapshot settings [puppet] - 10https://gerrit.wikimedia.org/r/617996 (owner: 10Elukey) [08:05:35] (03CR) 10Ladsgroup: db-eqiad,db-codfw.php: Update s5 description (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617876 (https://phabricator.wikimedia.org/T259437) (owner: 10Marostegui) [08:07:31] !log roll restart aqs on aqs* to pick up new druid settings [08:07:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:31] (03PS2) 10Marostegui: db-eqiad,db-codfw.php: Update s5 description [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617876 (https://phabricator.wikimedia.org/T259437) [08:11:35] (03CR) 10Ladsgroup: [C: 03+1] db-eqiad,db-codfw.php: Update s5 description [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617876 (https://phabricator.wikimedia.org/T259437) (owner: 10Marostegui) [08:18:25] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Update s5 description [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617876 (https://phabricator.wikimedia.org/T259437) (owner: 10Marostegui) [08:19:09] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Update s5 description [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617876 (https://phabricator.wikimedia.org/T259437) (owner: 10Marostegui) [08:21:21] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Clarify s5 wikis T259437 (duration: 01m 40s) [08:21:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:21:25] T259437: Update shard descriptions in db-eqiad/db-codfw - https://phabricator.wikimedia.org/T259437 [08:21:26] (03PS10) 10Giuseppe Lavagetto: Add local service proxy to the tls terminator v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) [08:22:33] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Clarify s5 wikis T259437 (duration: 01m 05s) [08:22:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:22:37] (03CR) 10jerkins-bot: [V: 04-1] Add local service proxy to the tls terminator v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [08:25:25] (03CR) 10Kormat: [C: 03+1] mariadb: Promote db1107 to m2 master [puppet] - 10https://gerrit.wikimedia.org/r/617997 (https://phabricator.wikimedia.org/T257540) (owner: 10Marostegui) [08:25:34] !log marostegui@cumin1001 dbctl commit (dc=all): 'Fully repool db1106 after compression', diff saved to https://phabricator.wikimedia.org/P12140 and previous config saved to /var/cache/conftool/dbconfig/20200803-082533-marostegui.json [08:25:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:52] !log installing qemu security updates on stretch [08:25:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:41] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1089 into API', diff saved to https://phabricator.wikimedia.org/P12141 and previous config saved to /var/cache/conftool/dbconfig/20200803-082641-marostegui.json [08:26:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:56] 10Operations, 10MassMessage, 10MediaWiki-JobQueue: Same MassMessage is being sent more than once - https://phabricator.wikimedia.org/T93049 (10Elitre) >>! In T93049#6328739, @Elitre wrote: > For more quirkiness, [[ https://www.mediawiki.org/wiki/User_talk:Krinkle#GUC_Tool_error,_or? | I had recently brought... [08:28:07] (03PS1) 10Elukey: role::druid::analytics::worker: upgrade druid to 0.19 [puppet] - 10https://gerrit.wikimedia.org/r/618005 (https://phabricator.wikimedia.org/T244482) [08:29:58] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1001/24276/druid1001.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/618005 (https://phabricator.wikimedia.org/T244482) (owner: 10Elukey) [08:34:28] (03PS11) 10Giuseppe Lavagetto: Add local service proxy to the tls terminator v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) [08:39:44] (03PS12) 10Giuseppe Lavagetto: Add local service proxy to the tls terminator v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) [08:44:28] (03PS1) 10Ema: atskafka: tune librdkafka settings one by one [puppet] - 10https://gerrit.wikimedia.org/r/618007 (https://phabricator.wikimedia.org/T254317) [08:44:44] (03PS1) 10JMeybohm: common_templates: Sort values for checksum/tls-certs [deployment-charts] - 10https://gerrit.wikimedia.org/r/618008 [08:45:22] (03CR) 10Vgutierrez: [C: 03+2] api: Exclude not valid parts from get_directory_metadata output [software/acme-chief] - 10https://gerrit.wikimedia.org/r/617680 (https://phabricator.wikimedia.org/T259338) (owner: 10Vgutierrez) [08:48:18] (03Merged) 10jenkins-bot: api: Exclude not valid parts from get_directory_metadata output [software/acme-chief] - 10https://gerrit.wikimedia.org/r/617680 (https://phabricator.wikimedia.org/T259338) (owner: 10Vgutierrez) [08:49:46] (03CR) 10Ema: [C: 03+2] atskafka: tune librdkafka settings one by one [puppet] - 10https://gerrit.wikimedia.org/r/618007 (https://phabricator.wikimedia.org/T254317) (owner: 10Ema) [08:51:14] 10Operations, 10User-MoritzMuehlenhoff: Review of ferm services without srange - https://phabricator.wikimedia.org/T149804 (10MoritzMuehlenhoff) [08:54:05] 10Operations, 10observability: db1082 failed on Jul 18th and 25th, however on the 25th pages didn't go out to VO/phones - https://phabricator.wikimedia.org/T259465 (10fgiunchedi) [08:59:17] !log installing ffmpeg security updates on jobrunners/video scalers (3.2.15 rebuilt with VP9/row-mt patches) [08:59:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:00:39] (03PS13) 10Giuseppe Lavagetto: Add local service proxy to the tls terminator v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) [09:00:40] (03PS1) 10Giuseppe Lavagetto: Add validation for envoy config files [deployment-charts] - 10https://gerrit.wikimedia.org/r/618010 [09:11:40] (03PS1) 10Ema: atskafka: set queue.buffering.max.ms to 1s [puppet] - 10https://gerrit.wikimedia.org/r/618011 (https://phabricator.wikimedia.org/T254317) [09:13:19] 10Operations, 10observability: db1082 failed on Jul 18th and 25th, however on the 25th pages didn't go out to VO/phones - https://phabricator.wikimedia.org/T259465 (10Marostegui) Is there a way to send reminders about not resolved incidents? Ideally not via a page :) I am not sure about options #1 and #2. Opt... [09:13:55] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/618011 (https://phabricator.wikimedia.org/T254317) (owner: 10Ema) [09:17:03] (03PS1) 10Kormat: mariadb: Drop mysql.py in favour of wmfmariadbpy package [puppet] - 10https://gerrit.wikimedia.org/r/618012 (https://phabricator.wikimedia.org/T259021) [09:17:08] (03CR) 10Ema: [C: 03+2] atskafka: set queue.buffering.max.ms to 1s [puppet] - 10https://gerrit.wikimedia.org/r/618011 (https://phabricator.wikimedia.org/T254317) (owner: 10Ema) [09:17:50] (03PS2) 10Kormat: mariadb: Drop mysql.py in favour of wmfmariadbpy package [puppet] - 10https://gerrit.wikimedia.org/r/618012 (https://phabricator.wikimedia.org/T259021) [09:18:25] 10Operations, 10serviceops: citoid /api LVS check reports HTTP 404 instead of HTTP 200 - https://phabricator.wikimedia.org/T259469 (10elukey) [09:30:00] (03CR) 10Jcrespo: [C: 03+1] "I would like to know details of new deployment and testing." [puppet] - 10https://gerrit.wikimedia.org/r/618012 (https://phabricator.wikimedia.org/T259021) (owner: 10Kormat) [09:42:01] !log installing curl security updates on stretch [09:42:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:43:44] (03CR) 10JMeybohm: [C: 03+2] eventgate: Update repository URL in requirements [deployment-charts] - 10https://gerrit.wikimedia.org/r/617695 (https://phabricator.wikimedia.org/T253843) (owner: 10JMeybohm) [09:46:26] !log restarting mw1261-mw1265 to pick up curl security updates [09:46:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:46:56] (03PS2) 10Giuseppe Lavagetto: Add validation for envoy config files [deployment-charts] - 10https://gerrit.wikimedia.org/r/618010 [09:46:58] (03PS14) 10Giuseppe Lavagetto: Add local service proxy to the tls terminator v0.2 [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) [09:47:00] (03PS1) 10Giuseppe Lavagetto: Fix deprecated constructs in the envoy configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/618016 (https://phabricator.wikimedia.org/T258140) [09:47:14] (03CR) 10Kormat: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/618012 (https://phabricator.wikimedia.org/T259021) (owner: 10Kormat) [09:49:03] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add validation for envoy config files [deployment-charts] - 10https://gerrit.wikimedia.org/r/618010 (owner: 10Giuseppe Lavagetto) [09:49:05] (03CR) 10Elukey: [C: 03+2] Set spark deploy-mode client for all the Analytics Hive to Druid jobs [puppet] - 10https://gerrit.wikimedia.org/r/617735 (https://phabricator.wikimedia.org/T254493) (owner: 10Elukey) [09:49:34] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Fix deprecated constructs in the envoy configuration [deployment-charts] - 10https://gerrit.wikimedia.org/r/618016 (https://phabricator.wikimedia.org/T258140) (owner: 10Giuseppe Lavagetto) [09:49:53] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [09:57:21] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [10:15:56] (03PS1) 10Elukey: Fix config file path in Eventlogging to Druid jobs [puppet] - 10https://gerrit.wikimedia.org/r/618017 (https://phabricator.wikimedia.org/T254493) [10:17:36] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/24278/an-launcher1002.eqiad.wmnet/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/618017 (https://phabricator.wikimedia.org/T254493) (owner: 10Elukey) [10:19:25] (03PS1) 10Muehlenhoff: Also exclude /mnt/hdfs on analytics_test_cluster::coordinator from debdeploy [puppet] - 10https://gerrit.wikimedia.org/r/618018 [10:19:36] !log restarting wtp1025 (parsoid canary) to pick up curl security updates [10:19:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:21:58] 10Operations, 10serviceops, 10Patch-For-Review: Update deprecated extension names in envoy config - https://phabricator.wikimedia.org/T258140 (10Joe) 05Open→03Resolved I still didn't deploy the code to k8s, but the idea is it will be picked up next time the chart gets updated. [10:26:45] !log restarting Apache on puppetboard to pick up curl security updates [10:26:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:29:20] !log installing NSS security updates on buster [10:29:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:04] jan_drewniak: How many deployers does it take to do Wikimedia Portals Update deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200803T1030). [10:45:13] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618020 (https://phabricator.wikimedia.org/T128546) [10:47:10] (03PS1) 10Muehlenhoff: Also add Cumin aliases for Ganeti in cache pops [puppet] - 10https://gerrit.wikimedia.org/r/618021 [10:47:12] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618020 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:47:14] (03PS1) 10Kormat: mariadb: Fix variable value in replication_lag [puppet] - 10https://gerrit.wikimedia.org/r/618022 [10:48:07] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618020 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:49:17] (03CR) 10Muehlenhoff: [C: 03+2] Also add Cumin aliases for Ganeti in cache pops [puppet] - 10https://gerrit.wikimedia.org/r/618021 (owner: 10Muehlenhoff) [10:50:10] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:618020| Bumping portals to master (T128546)]] (duration: 01m 08s) [10:50:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:50:13] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:51:17] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:618020| Bumping portals to master (T128546)]] (duration: 01m 06s) [10:51:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:20] (03CR) 10Kormat: "PCC run looks good: https://gerrit.wikimedia.org/r/c/operations/puppet/+/618022/" [puppet] - 10https://gerrit.wikimedia.org/r/618022 (owner: 10Kormat) [10:52:35] gehel: we just pushed a new build for the query-gui-deploy repo, can you deploy it during the WDQS window later today? :) [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: I, the Bot under the Fountain, allow thee, The Deployer, to do European mid-day backport window(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200803T1100). [11:00:04] Urbanecm: A patch you scheduled for European mid-day backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:14] ohoho [11:00:17] I'll do that :) [11:00:31] (03CR) 10Urbanecm: [C: 03+2] New throttle rule for Czech editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617717 (https://phabricator.wikimedia.org/T259352) (owner: 10Urbanecm) [11:00:32] noooo don’t break the wikis :< [11:00:39] I'm not going to break them [11:00:43] or, at least I hope :) [11:00:52] D [11:00:54] * :D [11:01:01] * Amir1 is here for emotional support [11:01:14] (03Merged) 10jenkins-bot: New throttle rule for Czech editathon [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617717 (https://phabricator.wikimedia.org/T259352) (owner: 10Urbanecm) [11:01:18] thanks both :) [11:01:19] !log removing cloudcephmon100[1-3].wikimedia.org from debmonitor (these eventually got re-installed as cloudcephmon100[1-3].eqiad.wmnet) [11:01:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:01:57] (unrelated q) btw, does anyone know how does throttling behave with IPv6? [11:02:13] If implemented properly, each participant should have their own IPv6, does that mean each is allowed to create 6 accs? [11:03:11] !log installing ruby2.5 security updates [11:03:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:51] !log urbanecm@deploy1001 Synchronized wmf-config/throttle.php: ead6b9eb699594583b06b8f5c23d40d9add2eb49: New throttle rule for Czech editathon (T259352) (duration: 01m 06s) [11:03:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:03:54] Lucas_WMDE: I'm on vacation, but dcausse and ryankemper should be able to help [11:03:54] T259352: Throttle request for Czech editathon - https://phabricator.wikimedia.org/T259352 [11:04:16] gehel: ok thanks (your name was in the calendar, enjoy the vacation then!) [11:09:23] (03PS1) 10Urbanecm: Add gpophotoeng.gov.il to the wgCopyUploadsDomains allowlist for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618025 (https://phabricator.wikimedia.org/T258857) [11:09:33] (03CR) 10Urbanecm: [C: 03+2] Add gpophotoeng.gov.il to the wgCopyUploadsDomains allowlist for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618025 (https://phabricator.wikimedia.org/T258857) (owner: 10Urbanecm) [11:10:24] (03Merged) 10jenkins-bot: Add gpophotoeng.gov.il to the wgCopyUploadsDomains allowlist for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618025 (https://phabricator.wikimedia.org/T258857) (owner: 10Urbanecm) [11:12:06] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 8c2a2b2187fcc33c37b9d0f6bafcc963afd0b74b: Add gpophotoeng.gov.il to the wgCopyUploadsDomains allowlist for commonswiki (T258857) (duration: 01m 07s) [11:12:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:12:09] T258857: Add gpophotoeng.gov.il to the wgCopyUploadsDomains allowlist of Wikimedia Commons - https://phabricator.wikimedia.org/T258857 [11:13:45] (03PS1) 10Urbanecm: Add extra namespaces for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618046 (https://phabricator.wikimedia.org/T258913) [11:14:31] (03CR) 10Urbanecm: [C: 03+2] Add extra namespaces for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618046 (https://phabricator.wikimedia.org/T258913) (owner: 10Urbanecm) [11:15:20] (03PS2) 10Urbanecm: Add extra namespaces for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618046 (https://phabricator.wikimedia.org/T258913) [11:15:26] (03CR) 10Urbanecm: [C: 03+2] Add extra namespaces for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618046 (https://phabricator.wikimedia.org/T258913) (owner: 10Urbanecm) [11:16:13] (03Merged) 10jenkins-bot: Add extra namespaces for yuewiktionary [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618046 (https://phabricator.wikimedia.org/T258913) (owner: 10Urbanecm) [11:19:23] !log urbanecm@deploy1001 Synchronized wmf-config/InitialiseSettings.php: 346138d95721c274d450568388fb2ad1803dba9e: Add extra namespaces for yuewiktionary (T258913) (duration: 01m 06s) [11:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:19:26] T258913: Add import sources and namespaces for yuewiktionary - https://phabricator.wikimedia.org/T258913 [11:19:29] !log EU B&C done [11:19:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:20:53] !log installing ruby-rack security updates [11:20:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:39:55] (03CR) 10Marostegui: [C: 03+1] mariadb: Fix variable value in replication_lag [puppet] - 10https://gerrit.wikimedia.org/r/618022 (owner: 10Kormat) [11:40:38] (03CR) 10Marostegui: [C: 03+1] "Let's test switchover script and replication_tree too, just in case" [puppet] - 10https://gerrit.wikimedia.org/r/618012 (https://phabricator.wikimedia.org/T259021) (owner: 10Kormat) [11:47:44] (03CR) 10Kormat: [C: 03+2] mariadb: Fix variable value in replication_lag [puppet] - 10https://gerrit.wikimedia.org/r/618022 (owner: 10Kormat) [11:49:01] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:52:49] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [11:55:12] !log installing luajit security updates [11:55:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:25] James_F, https://logstash.wikimedia.org/app/kibana#/doc/logstash-*/logstash-deploy-2020.08.03/mediawiki?id=AXO0LgXAMQ_08tQaVamm&_g=h@44136fa -- this references you? [12:00:24] liw: Yeah, just noise, sorry about that. [12:03:40] (03PS1) 10CDanis: Revert "Revert "puppetmaster: clearer edit message when you can't rewrite history"" [puppet] - 10https://gerrit.wikimedia.org/r/618032 [12:04:04] (03PS2) 10CDanis: Revert "Revert "puppetmaster: clearer edit message when you can't rewrite history"" [puppet] - 10https://gerrit.wikimedia.org/r/618032 [12:07:07] 10Operations, 10observability: db1082 failed on Jul 18th and 25th, however on the 25th pages didn't go out to VO/phones - https://phabricator.wikimedia.org/T259465 (10fgiunchedi) A reminder might work! We'll be inquiring VO about that possibility e.g. via email when an incident stays open for more than X hours. [12:08:27] (03PS1) 10Muehlenhoff: Add library hint for luajit [puppet] - 10https://gerrit.wikimedia.org/r/618049 [12:13:13] (03CR) 10Muehlenhoff: [C: 03+2] Add library hint for luajit [puppet] - 10https://gerrit.wikimedia.org/r/618049 (owner: 10Muehlenhoff) [12:13:43] !log disabling puppet on cumin hosts T259021 [12:13:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:46] T259021: Package wmfmariadbpy as a .deb - https://phabricator.wikimedia.org/T259021 [12:14:30] (03CR) 10Kormat: [C: 03+2] mariadb: Drop mysql.py in favour of wmfmariadbpy package [puppet] - 10https://gerrit.wikimedia.org/r/618012 (https://phabricator.wikimedia.org/T259021) (owner: 10Kormat) [12:15:10] moritzm: is it safe to merge your puppet CR too? [12:15:34] I was about to, please go ahead [12:15:51] grand, done. [12:16:43] thx [12:18:46] James_F, I'll just ignore it? [12:18:57] Thanks. [12:20:41] !log restarting nginx on francium to pick up luajit update [12:20:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:24:42] (03PS1) 10Marostegui: section: Change mysql.py path [software] - 10https://gerrit.wikimedia.org/r/618052 (https://phabricator.wikimedia.org/T259021) [12:24:48] (03CR) 10jerkins-bot: [V: 04-1] section: Change mysql.py path [software] - 10https://gerrit.wikimedia.org/r/618052 (https://phabricator.wikimedia.org/T259021) (owner: 10Marostegui) [12:24:51] kormat: ^ [12:25:24] marostegui: any chance we could stop hard-coding paths, and trust in `$PATH`? [12:25:40] sure, we can too [12:26:02] my repo is broken, but I will ammend [12:26:10] !log installing apache-log4j1.2 security updates [12:26:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:12] there's not backport window open at this time, bug train is at group1 [12:26:22] I'm going to promote train to group2 [12:26:29] (03Abandoned) 10Marostegui: section: Change mysql.py path [software] - 10https://gerrit.wikimedia.org/r/618052 (https://phabricator.wikimedia.org/T259021) (owner: 10Marostegui) [12:27:26] (03PS1) 10Lars Wirzenius: all wikis to 1.36.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618054 [12:27:28] (03CR) 10Lars Wirzenius: [C: 03+2] all wikis to 1.36.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618054 (owner: 10Lars Wirzenius) [12:27:30] (03PS1) 10Ema: prometheus: import atskafka batch metrics [puppet] - 10https://gerrit.wikimedia.org/r/618055 (https://phabricator.wikimedia.org/T254317) [12:27:32] (03PS1) 10Marostegui: section: Do not hardcode mysql.py path [software] - 10https://gerrit.wikimedia.org/r/618056 (https://phabricator.wikimedia.org/T259021) [12:27:42] (03CR) 10jerkins-bot: [V: 04-1] section: Do not hardcode mysql.py path [software] - 10https://gerrit.wikimedia.org/r/618056 (https://phabricator.wikimedia.org/T259021) (owner: 10Marostegui) [12:27:46] uh? [12:28:15] marostegui: did you not ❤️ [12:28:16] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618054 (owner: 10Lars Wirzenius) [12:28:21] that CR enough? it feels abandoned [12:28:38] I just checked out the repo, how can it ask me to rebase? [12:29:16] (03CR) 10Kormat: [C: 03+1] section: Do not hardcode mysql.py path [software] - 10https://gerrit.wikimedia.org/r/618056 (https://phabricator.wikimedia.org/T259021) (owner: 10Marostegui) [12:31:54] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:32:15] !log liw@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.2 [12:32:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:58] [53482bff-abbf-4e59-bd16-5694e0b4dfce] 2020-08-03 12:32:03: Fatal exception of type "WMFTimeoutException" [12:33:15] (on enwiki) [12:35:06] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:37:30] (03PS2) 10Jcrespo: section: Do not hardcode mysql.py path [software] - 10https://gerrit.wikimedia.org/r/618056 (https://phabricator.wikimedia.org/T259021) (owner: 10Marostegui) [12:38:39] (03CR) 10Jcrespo: "retry" [software] - 10https://gerrit.wikimedia.org/r/618056 (https://phabricator.wikimedia.org/T259021) (owner: 10Marostegui) [12:39:26] (03CR) 10Marostegui: [C: 03+2] section: Do not hardcode mysql.py path [software] - 10https://gerrit.wikimedia.org/r/618056 (https://phabricator.wikimedia.org/T259021) (owner: 10Marostegui) [12:39:56] (03Merged) 10jenkins-bot: section: Do not hardcode mysql.py path [software] - 10https://gerrit.wikimedia.org/r/618056 (https://phabricator.wikimedia.org/T259021) (owner: 10Marostegui) [12:42:33] stwalkerster: does that still happen? [12:43:40] this is the traceback, fiw https://www.irccloud.com/pastebin/6pEohTsw/ [12:45:50] liw: ^, can reproduce [12:52:56] (03PS1) 10ZPapierski: Set up WCQS test server [puppet] - 10https://gerrit.wikimedia.org/r/618059 [12:53:25] !log move VRRP master to cr3-eqsin [12:53:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:54:06] Urbanecm, hmmm [12:54:49] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: import atskafka batch metrics [puppet] - 10https://gerrit.wikimedia.org/r/618055 (https://phabricator.wikimedia.org/T254317) (owner: 10Ema) [12:54:51] Urbanecm, luckily for me that's wmf.1 (last week's train), but it's still a bug. is there a Phab task? [12:55:43] oh, didn't see it's a wmf.1 entry, I saw you promoted group2 recently and somehow blindly assumed it's new :) [12:55:47] not sure about a task, will have a look [12:57:44] . o O (conducting the train is a great lesson in "correlation is not causation" :) [12:57:56] * James_F grins. [12:58:18] there's T257002 [12:58:18] T257002: Internal error on Special:Contributions in Wikidata - https://phabricator.wikimedia.org/T257002 [12:58:27] * Urbanecm grins back at James_F [13:05:13] !log installing json-c security updates [13:05:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:27] * Urbanecm adding some more details to the task [13:07:06] (03PS1) 10Ema: atskafka: lower request.required.acks [puppet] - 10https://gerrit.wikimedia.org/r/618061 (https://phabricator.wikimedia.org/T254317) [13:07:15] (03CR) 10Giuseppe Lavagetto: helmfile: strawman refactoring (034 comments) [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [13:07:59] (03CR) 10Giuseppe Lavagetto: helmfile: strawman refactoring (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/615498 (https://phabricator.wikimedia.org/T258572) (owner: 10Giuseppe Lavagetto) [13:08:37] (03CR) 10JMeybohm: [C: 04-1] Add local service proxy to the tls terminator v0.2 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [13:08:47] (03PS2) 10Ema: atskafka: lower request.required.acks [puppet] - 10https://gerrit.wikimedia.org/r/618061 (https://phabricator.wikimedia.org/T254317) [13:09:45] (03CR) 10JMeybohm: [C: 04-1] Add local service proxy to the tls terminator v0.2 (031 comment) [deployment-charts] - 10https://gerrit.wikimedia.org/r/582792 (https://phabricator.wikimedia.org/T244843) (owner: 10Giuseppe Lavagetto) [13:10:08] 10Operations, 10netops, 10Patch-For-Review, 10Sustainability (Incident Followup): Juniper HA audit - https://phabricator.wikimedia.org/T191667 (10ayounsi) [13:10:21] (03CR) 10Ema: [C: 03+2] prometheus: import atskafka batch metrics [puppet] - 10https://gerrit.wikimedia.org/r/618055 (https://phabricator.wikimedia.org/T254317) (owner: 10Ema) [13:10:54] (03CR) 10Ema: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/618061 (https://phabricator.wikimedia.org/T254317) (owner: 10Ema) [13:11:52] !log remove nonstop-bridging from asw-a-codfw - T191667 [13:11:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:56] T191667: Juniper HA audit - https://phabricator.wikimedia.org/T191667 [13:12:58] !log remove nonstop-bridging from asw-b-codfw - T191667 [13:13:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:13:38] (03PS1) 10Marostegui: report_users: Remove mysql.py PATH [software] - 10https://gerrit.wikimedia.org/r/618062 (https://phabricator.wikimedia.org/T259021) [13:14:25] !log remove nonstop-bridging from asw-c-codfw - T191667 [13:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:09] (03CR) 10Kormat: [C: 03+1] report_users: Remove mysql.py PATH [software] - 10https://gerrit.wikimedia.org/r/618062 (https://phabricator.wikimedia.org/T259021) (owner: 10Marostegui) [13:15:40] !log remove nonstop-bridging from asw-d-codfw - T191667 [13:15:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:06] (03CR) 10Marostegui: [C: 03+2] report_users: Remove mysql.py PATH [software] - 10https://gerrit.wikimedia.org/r/618062 (https://phabricator.wikimedia.org/T259021) (owner: 10Marostegui) [13:16:07] (03Merged) 10jenkins-bot: report_users: Remove mysql.py PATH [software] - 10https://gerrit.wikimedia.org/r/618062 (https://phabricator.wikimedia.org/T259021) (owner: 10Marostegui) [13:16:53] (03CR) 10Ema: [C: 03+2] atskafka: lower request.required.acks [puppet] - 10https://gerrit.wikimedia.org/r/618061 (https://phabricator.wikimedia.org/T254317) (owner: 10Ema) [13:18:50] 10Operations, 10netops, 10Patch-For-Review, 10Sustainability (Incident Followup): Juniper HA audit - https://phabricator.wikimedia.org/T191667 (10ayounsi) [13:21:04] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=swagger_check_cxserver_cluster_eqiad site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:24:58] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [13:27:18] !log installing libopenmpt security updates [13:27:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:04] (03PS1) 10Kormat: switchover: Fix import path [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618063 [13:31:40] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:32:24] (03PS1) 10Muehlenhoff: Add libopenmpt library hint [puppet] - 10https://gerrit.wikimedia.org/r/618064 [13:35:50] (03PS2) 10Muehlenhoff: Add libopenmpt library hint [puppet] - 10https://gerrit.wikimedia.org/r/618064 [13:40:49] (03CR) 10Muehlenhoff: [C: 03+2] Add libopenmpt library hint [puppet] - 10https://gerrit.wikimedia.org/r/618064 (owner: 10Muehlenhoff) [13:43:08] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [13:50:00] (03PS1) 10C. Scott Ananian: Bump wikimedia/parsoid to v0.13.0-a3 [vendor] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/618068 (https://phabricator.wikimedia.org/T259311) [13:54:20] (03PS1) 10Filippo Giunchedi: grafana: temp disable grafana db sync ahead of upgrade [puppet] - 10https://gerrit.wikimedia.org/r/618069 (https://phabricator.wikimedia.org/T259143) [13:55:46] (03CR) 10CDanis: [C: 03+2] Revert "Revert "puppetmaster: clearer edit message when you can't rewrite history"" [puppet] - 10https://gerrit.wikimedia.org/r/618032 (owner: 10CDanis) [13:56:12] !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕙☕ sudo cumin A:puppetmaster 'disable-puppet "cdanis deploying I92e9a05"' [13:56:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:00:28] !log ✔️ cdanis@cumin1001.eqiad.wmnet ~ 🕙☕ sudo cumin A:puppetmaster 'enable-puppet "cdanis deploying I92e9a05"' [14:00:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:51] (03Abandoned) 10CDanis: WIP: enforce match between LVS & conftool pools [puppet] - 10https://gerrit.wikimedia.org/r/615877 (https://phabricator.wikimedia.org/T258648) (owner: 10CDanis) [14:03:53] !log filippo@deploy1001 Started deploy [librenms/librenms@413e006]: Upgrade LibreNMS to 1.66 - T257017 [14:03:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:16] !log filippo@deploy1001 Finished deploy [librenms/librenms@413e006]: Upgrade LibreNMS to 1.66 - T257017 (duration: 00m 23s) [14:04:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:18] (03CR) 10Jcrespo: [C: 03+1] "+1 if you tested this works. Only 2 questions:" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618063 (owner: 10Kormat) [14:05:36] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for... [14:05:47] 10Operations, 10netops, 10Patch-For-Review, 10Sustainability (Incident Followup): Juniper HA audit - https://phabricator.wikimedia.org/T191667 (10ayounsi) [14:06:38] (03CR) 10RLazarus: [C: 03+1] Revert "Revert "puppetmaster: clearer edit message when you can't rewrite history"" [puppet] - 10https://gerrit.wikimedia.org/r/618032 (owner: 10CDanis) [14:13:24] (03CR) 10Lars Wirzenius: "I'm afraid I don't know PHP or MediaWiki and cannot provide a useful review of this change." [vendor] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/618068 (https://phabricator.wikimedia.org/T259311) (owner: 10C. Scott Ananian) [14:17:23] (03CR) 10Kormat: "> Patch Set 1: Code-Review+1" [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/618063 (owner: 10Kormat) [14:20:03] 10Operations, 10CommRel-Specialists-Support (Jul-Sep-2020), 10User-notice: CommRel support for FY2020-2021 Q1 DC switchover - https://phabricator.wikimedia.org/T244808 (10Trizek-WMF) 05Stalled→03Open Thank you! I'm starting the pre-announces. [14:20:06] 10Operations, 10Goal: FY2020-2021 Q1 DC switchover and switchback - https://phabricator.wikimedia.org/T243314 (10Trizek-WMF) [14:20:37] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [14:20:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:22:44] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [14:22:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:23:50] (03PS12) 10Ottomata: Initial debian commit [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/610880 (https://phabricator.wikimedia.org/T251006) [14:25:13] 10Operations, 10DBA, 10User-Kormat: DBA python layout - https://phabricator.wikimedia.org/T259516 (10Kormat) [14:25:22] 10Operations, 10DBA, 10User-Kormat: DBA python layout - https://phabricator.wikimedia.org/T259516 (10Kormat) p:05Triage→03Medium [14:27:45] !log remove nonstop-bridging from fasw-c-codfw - T191667 [14:27:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:27:49] !log marostegui@cumin1001 dbctl commit (dc=all): 'Repool db1089 into dump and depool db1106', diff saved to https://phabricator.wikimedia.org/P12142 and previous config saved to /var/cache/conftool/dbconfig/20200803-142749-marostegui.json [14:27:49] T191667: Juniper HA audit - https://phabricator.wikimedia.org/T191667 [14:27:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:47] !log remove IGMP and PIM from pfw3-codfw security zones [14:28:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:28:49] 10Operations: Integrate Buster 10.5 point release - https://phabricator.wikimedia.org/T259519 (10MoritzMuehlenhoff) [14:29:06] 10Operations: Integrate Buster 10.5 point release - https://phabricator.wikimedia.org/T259519 (10MoritzMuehlenhoff) p:05Triage→03Medium [14:31:34] (03PS1) 10Filippo Giunchedi: librenms: update .env file with db connection parameters [puppet] - 10https://gerrit.wikimedia.org/r/618073 (https://phabricator.wikimedia.org/T257017) [14:33:14] !log disable all ALGs from pfw3-codfw [14:33:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:38] (03CR) 10Filippo Giunchedi: "PCC https://puppet-compiler.wmflabs.org/compiler1003/24281/netmon1002.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/618073 (https://phabricator.wikimedia.org/T257017) (owner: 10Filippo Giunchedi) [14:36:08] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1031.eqiad.wmnet'] ` Of which... [14:40:22] !log update Buster netboot images to Buster 10.5 T259519 [14:40:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:40:25] T259519: Integrate Buster 10.5 point release - https://phabricator.wikimedia.org/T259519 [14:41:01] 10Operations: Integrate Buster 10.5 point release - https://phabricator.wikimedia.org/T259519 (10MoritzMuehlenhoff) [14:41:58] (03CR) 10Cwhite: [C: 03+1] grafana: temp disable grafana db sync ahead of upgrade [puppet] - 10https://gerrit.wikimedia.org/r/618069 (https://phabricator.wikimedia.org/T259143) (owner: 10Filippo Giunchedi) [14:47:44] 10Operations: Integrate Stretch 9.13 point update - https://phabricator.wikimedia.org/T258407 (10MoritzMuehlenhoff) [14:47:47] (03PS1) 10ArielGlenn: comma-separate large counts for dump stats email [puppet] - 10https://gerrit.wikimedia.org/r/618080 [14:48:11] (03PS1) 10Vgutierrez: Release 0.28 [software/acme-chief] - 10https://gerrit.wikimedia.org/r/618081 (https://phabricator.wikimedia.org/T259338) [14:48:44] (03CR) 10ArielGlenn: [C: 03+2] comma-separate large counts for dump stats email [puppet] - 10https://gerrit.wikimedia.org/r/618080 (owner: 10ArielGlenn) [14:51:12] !log marostegui@cumin1001 dbctl commit (dc=all): 'Slowly repool db1106', diff saved to https://phabricator.wikimedia.org/P12143 and previous config saved to /var/cache/conftool/dbconfig/20200803-145111-marostegui.json [14:51:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:52:35] (03PS2) 10Filippo Giunchedi: librenms: update .env file with db connection parameters [puppet] - 10https://gerrit.wikimedia.org/r/618073 (https://phabricator.wikimedia.org/T257017) [14:52:37] (03PS1) 10Filippo Giunchedi: librenms: fix python3 dependency [puppet] - 10https://gerrit.wikimedia.org/r/618083 (https://phabricator.wikimedia.org/T257017) [14:53:11] (03CR) 10Ayounsi: [C: 03+1] "LGTM!" [puppet] - 10https://gerrit.wikimedia.org/r/618073 (https://phabricator.wikimedia.org/T257017) (owner: 10Filippo Giunchedi) [14:53:49] (03CR) 10Ayounsi: [C: 03+1] librenms: fix python3 dependency [puppet] - 10https://gerrit.wikimedia.org/r/618083 (https://phabricator.wikimedia.org/T257017) (owner: 10Filippo Giunchedi) [14:54:26] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/618083 (https://phabricator.wikimedia.org/T257017) (owner: 10Filippo Giunchedi) [14:55:16] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by andrew on cumin1001.eqiad.wmnet for... [14:56:10] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10Andrew) Networking looks better -- thanks @ayounsi [14:58:17] (03CR) 10Alexandros Kosiaris: [C: 03+1] "❤️" [puppet] - 10https://gerrit.wikimedia.org/r/617706 (owner: 10Jbond) [14:59:30] (03CR) 10Alexandros Kosiaris: [C: 03+1] Enable printBackground to fix style issues [deployment-charts] - 10https://gerrit.wikimedia.org/r/617728 (https://phabricator.wikimedia.org/T52178) (owner: 10MSantos) [15:05:27] (03PS1) 10Michael Große: Create dispatch lag alerts for test.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/618084 (https://phabricator.wikimedia.org/T258374) [15:07:20] (03PS4) 10Ahmon Dancy: zuul::server: Replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/617762 [15:08:12] (03PS2) 10Michael Große: Create dispatch lag alerts for test.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/618084 (https://phabricator.wikimedia.org/T258374) [15:08:31] (03CR) 10Ahmon Dancy: "> Patch Set 3:" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617762 (owner: 10Ahmon Dancy) [15:08:47] (03PS9) 10Ebernhardson: Move mjolnir's daemons to search-loader hosts [puppet] - 10https://gerrit.wikimedia.org/r/616101 (https://phabricator.wikimedia.org/T258245) (owner: 10Elukey) [15:08:52] (03CR) 10jerkins-bot: [V: 04-1] Create dispatch lag alerts for test.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/618084 (https://phabricator.wikimedia.org/T258374) (owner: 10Michael Große) [15:09:17] 10Operations, 10Citoid, 10serviceops: citoid /api LVS check reports HTTP 404 instead of HTTP 200 - https://phabricator.wikimedia.org/T259469 (10akosiaris) p:05Triage→03High @Mvolz any ideas? [15:09:22] (03CR) 10Michael Große: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/618084 (https://phabricator.wikimedia.org/T258374) (owner: 10Michael Große) [15:09:26] (03CR) 10Ebernhardson: [C: 03+1] Move mjolnir's daemons to search-loader hosts [puppet] - 10https://gerrit.wikimedia.org/r/616101 (https://phabricator.wikimedia.org/T258245) (owner: 10Elukey) [15:10:18] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [15:10:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:18] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [15:11:18] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [15:11:18] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [15:11:18] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [15:11:19] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [15:11:19] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [15:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:20] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [15:11:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:22] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:12:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:31] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:12:31] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:12:31] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:12:32] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:12:32] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:12:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:12:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:28] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [15:14:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:00] (03CR) 1020after4: [C: 03+2] Selenium: Update to WebdriverIO v6 [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/615801 (https://phabricator.wikimedia.org/T255471) (owner: 10Vidhi-Mody) [15:16:07] (03CR) 1020after4: [V: 03+2 C: 03+2] Selenium: Update to WebdriverIO v6 [phabricator/deployment] (wmf/stable) - 10https://gerrit.wikimedia.org/r/615801 (https://phabricator.wikimedia.org/T255471) (owner: 10Vidhi-Mody) [15:16:20] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [15:16:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:29] (03CR) 10Guergana Tzatchkova: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/618084 (https://phabricator.wikimedia.org/T258374) (owner: 10Michael Große) [15:18:27] 10Operations, 10netops, 10Patch-For-Review, 10Sustainability (Incident Followup): Juniper HA audit - https://phabricator.wikimedia.org/T191667 (10ayounsi) [15:24:16] 10Operations, 10Wikimedia-Mailing-lists, 10User-Ladsgroup: Request for creating a DNS record for lists.wmcloud.org to 185.15.56.28 - https://phabricator.wikimedia.org/T259444 (10akosiaris) I am guessing this isn't related to production DNS stuff, so removing #operations. But feel free to re-add. [15:27:35] (03PS3) 10Guergana Tzatchkova: Create dispatch lag alerts for test.wikidata.org [puppet] - 10https://gerrit.wikimedia.org/r/618084 (https://phabricator.wikimedia.org/T258374) (owner: 10Michael Große) [15:27:36] !log Change PK on frwiktionary.revision on db2087:3317, db2129, db2121 db2086:3317 T259524 [15:27:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:27:40] T259524: Review revision table and make sure that the PK is always rev_id - https://phabricator.wikimedia.org/T259524 [15:27:54] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudvirt1032.eqiad.wmnet', 'cloudvirt1... [15:28:09] (03CR) 10Guergana Tzatchkova: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/618084 (https://phabricator.wikimedia.org/T258374) (owner: 10Michael Große) [15:28:52] akosiaris: seems there's a herald rule set to add #Operations to #Wikimedia-Mailing-Lists [15:28:56] https://phabricator.wikimedia.org/T259444#6356559 [15:29:44] (03PS16) 10Ayounsi: Initial templating for CR routing-options [homer/public] - 10https://gerrit.wikimedia.org/r/547587 [15:33:27] RhinosF1: looks like it. I guess I 'll just accept it and move on :-) [15:33:43] !log standardize all routers routing-options config [15:33:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:39:06] akosiaris: seems the simplest option. [15:45:46] (03CR) 10Ayounsi: [C: 03+2] Initial templating for CR routing-options [homer/public] - 10https://gerrit.wikimedia.org/r/547587 (owner: 10Ayounsi) [15:48:11] (03PS1) 10Ayounsi: Remove leftover eqord bgp_out [homer/public] - 10https://gerrit.wikimedia.org/r/618085 [15:49:12] (03CR) 10Ayounsi: [C: 03+2] Remove leftover eqord bgp_out [homer/public] - 10https://gerrit.wikimedia.org/r/618085 (owner: 10Ayounsi) [15:52:40] 10Puppet, 10Beta-Cluster-Infrastructure: MIssing hiera settings for deployment-parsoid11.deployment-prep.eqiad.wmflabs - https://phabricator.wikimedia.org/T259533 (10bd808) [15:55:22] <_joe_> !log regenerating the TLS certs for blubberoid [15:55:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:55:50] (03PS9) 10Ahmon Dancy: Add mtail program for monitoring the Zuul error log [puppet] - 10https://gerrit.wikimedia.org/r/617271 (https://phabricator.wikimedia.org/T258821) [15:58:15] 10Operations, 10VPS-Projects, 10Wikimedia-Mailing-lists, 10User-Ladsgroup, 10cloud-services-team (Kanban): Request for creating a DNS record for lists.wmcloud.org to 185.15.56.28 - https://phabricator.wikimedia.org/T259444 (10bd808) [16:02:14] (03CR) 10Filippo Giunchedi: [C: 03+2] librenms: update .env file with db connection parameters [puppet] - 10https://gerrit.wikimedia.org/r/618073 (https://phabricator.wikimedia.org/T257017) (owner: 10Filippo Giunchedi) [16:02:27] (03CR) 10Filippo Giunchedi: [C: 03+2] librenms: fix python3 dependency [puppet] - 10https://gerrit.wikimedia.org/r/618083 (https://phabricator.wikimedia.org/T257017) (owner: 10Filippo Giunchedi) [16:02:32] !log oblivian@deploy1001 helmfile [staging] Ran 'sync' command on namespace 'blubberoid' for release 'staging' . [16:02:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:03:40] 10Operations, 10VPS-Projects, 10Wikimedia-Mailing-lists, 10User-Ladsgroup, 10cloud-services-team (Kanban): Request for creating a DNS record for lists.wmcloud.org to 185.15.56.28 - https://phabricator.wikimedia.org/T259444 (10bd808) >>! In T259444#6354494, @Ladsgroup wrote: >>>! In T259444#6354493, @Kren... [16:06:20] <_joe_> akosiaris: k8s staging seems to hang up indefinitely [16:06:42] <_joe_> oh it just returned, uhm [16:07:04] (03PS1) 10Filippo Giunchedi: librenms: hide diff for files with passwords [puppet] - 10https://gerrit.wikimedia.org/r/618088 (https://phabricator.wikimedia.org/T257017) [16:08:23] _joe_: it is behaving kind of weird indeed [16:09:07] (03PS1) 10Urbanecm: Turn muswiki and mhwiktionary to read-only [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618089 (https://phabricator.wikimedia.org/T259004) [16:09:09] (03PS1) 10Urbanecm: Point muswiki and mhwiktionary to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618090 (https://phabricator.wikimedia.org/T259004) [16:09:11] (03PS1) 10Urbanecm: Revert "Turn muswiki and mhwiktionary to read-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618091 (https://phabricator.wikimedia.org/T259004) [16:09:36] 10Operations, 10Citoid, 10serviceops: citoid /api LVS check reports HTTP 404 instead of HTTP 200 - https://phabricator.wikimedia.org/T259469 (10Mvolz) >>! In T259469#6356479, @akosiaris wrote: > @Mvolz any ideas? A problem of our own making, I think! The test looks at https://wikimediafoundation.org/404 a... [16:11:39] (03PS2) 10Urbanecm: Point muswiki and mhwiktionary to s5 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618090 (https://phabricator.wikimedia.org/T259004) [16:12:09] (03PS2) 10Urbanecm: Revert "Turn muswiki and mhwiktionary to read-only" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618091 (https://phabricator.wikimedia.org/T259004) [16:12:10] 10Operations, 10ops-eqiad, 10Analytics-Radar, 10DC-Ops: (Need By: TBD) rack/setup/install an-worker11[02-17] - https://phabricator.wikimedia.org/T259071 (10fdans) [16:13:06] (03PS5) 10Dzahn: zuul::server: Replace hiera() with lookup() [puppet] - 10https://gerrit.wikimedia.org/r/617762 (owner: 10Ahmon Dancy) [16:13:40] 10Operations, 10Citoid, 10serviceops: citoid /api LVS check reports HTTP 200 instead of HTTP 404 - https://phabricator.wikimedia.org/T259469 (10Mvolz) [16:16:24] !log oblivian@deploy1001 helmfile [eqiad] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [16:16:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:17:18] 10Operations, 10Citoid, 10serviceops, 10Patch-For-Review: citoid /api LVS check reports HTTP 200 instead of HTTP 404 - https://phabricator.wikimedia.org/T259469 (10Dzahn) >>! In T259469#6356754, @Mvolz wrote: > https://wikimediafoundation.org/404 and expects it to 404 I would recommend to use wikimedia.or... [16:17:53] akosiaris, _joe_: nothing to be worried about but https://gerrit.wikimedia.org/r/c/mediawiki/services/citoid/+/618093 should fix the annoying alerts ^-^ [16:18:30] yea, i would avoid wikimediafoundation.org as a test URL. that could always change without notice [16:18:56] let's use a domain on the wmf cluster, like en.wp as you already suggested [16:20:02] (03PS1) 10Andrew Bogott: Galera: increase max allowed connections [puppet] - 10https://gerrit.wikimedia.org/r/618094 [16:20:25] <_joe_> mvolz: thanks :) [16:21:22] !log oblivian@deploy1001 helmfile [codfw] Ran 'sync' command on namespace 'blubberoid' for release 'production' . [16:21:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:20] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:25:48] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/24283/" [puppet] - 10https://gerrit.wikimedia.org/r/617762 (owner: 10Ahmon Dancy) [16:26:07] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [16:31:25] (03PS2) 10Andrew Bogott: Galera: increase max allowed connections [puppet] - 10https://gerrit.wikimedia.org/r/618094 [16:37:16] (03PS10) 10Dzahn: Add mtail program for monitoring the Zuul error log [puppet] - 10https://gerrit.wikimedia.org/r/617271 (https://phabricator.wikimedia.org/T258821) (owner: 10Ahmon Dancy) [16:41:34] (03CR) 10Dzahn: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1003/24284/contint1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/617271 (https://phabricator.wikimedia.org/T258821) (owner: 10Ahmon Dancy) [16:45:25] (03CR) 10Ladsgroup: "Puppet compiler output: https://puppet-compiler.wmflabs.org/compiler1003/24286/icinga1001.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/618084 (https://phabricator.wikimedia.org/T258374) (owner: 10Michael Große) [16:46:37] (03CR) 10Andrew Bogott: [C: 03+2] Galera: increase max allowed connections [puppet] - 10https://gerrit.wikimedia.org/r/618094 (owner: 10Andrew Bogott) [16:48:27] 10Puppet, 10Beta-Cluster-Infrastructure: puppetdb on deployment-puppetdb03 keeps getting OOMKilled - https://phabricator.wikimedia.org/T248041 (10bd808) Another restart on 2020-08-03: https://sal.toolforge.org/log/n4IAtXMBj_Bg1xd3rkvT [16:49:03] (03CR) 10Dzahn: "on contint1001/2001:" [puppet] - 10https://gerrit.wikimedia.org/r/617271 (https://phabricator.wikimedia.org/T258821) (owner: 10Ahmon Dancy) [16:51:17] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/618088 (https://phabricator.wikimedia.org/T257017) (owner: 10Filippo Giunchedi) [16:53:54] I raised the ticket to UBN and will now roll back [16:54:43] erm, wrong channel [16:54:55] anyway, train is going back to group [16:54:59] group1 [16:55:55] PROBLEM - SSH on stat1008 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/SSH/monitoring [16:56:18] (03CR) 10Filippo Giunchedi: [C: 03+2] librenms: hide diff for files with passwords [puppet] - 10https://gerrit.wikimedia.org/r/618088 (https://phabricator.wikimedia.org/T257017) (owner: 10Filippo Giunchedi) [16:56:29] (03PS2) 10Filippo Giunchedi: librenms: hide diff for files with passwords [puppet] - 10https://gerrit.wikimedia.org/r/618088 (https://phabricator.wikimedia.org/T257017) [16:58:36] !log liw@deploy1001 rebuilt and synchronized wikiversions files: Revert "group2 wikis to 1.36.0-wmf.1" [16:58:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:59:34] (03PS1) 10Lars Wirzenius: Revert "all wikis to 1.36.0-wmf.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618096 [16:59:36] (03CR) 10Lars Wirzenius: [C: 03+2] Revert "all wikis to 1.36.0-wmf.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618096 (owner: 10Lars Wirzenius) [17:00:04] gehel and onimisionipe: May I have your attention please! Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200803T1700) [17:00:04] Lucas_WMDE: A patch you scheduled for Wikidata Query Service weekly deploy is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [17:00:20] (03Merged) 10jenkins-bot: Revert "all wikis to 1.36.0-wmf.2" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618096 (owner: 10Lars Wirzenius) [17:01:07] PROBLEM - DPKG on stat1008 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [17:01:20] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell for Denny Vrandecic - https://phabricator.wikimedia.org/T259388 (10herron) [17:01:25] (03PS13) 10Ottomata: Initial debian commit [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/610880 (https://phabricator.wikimedia.org/T251006) [17:01:34] o/ [17:01:39] RECOVERY - SSH on stat1008 is OK: SSH OK - OpenSSH_7.9p1 Debian-10+deb10u2 (protocol 2.0) https://wikitech.wikimedia.org/wiki/SSH/monitoring [17:02:04] PROBLEM - WMCS Galera Database #page on cloudcontrol1004 is CRITICAL: Error during connection: Lost connection to MySQL server at waiting for initial communication packet, system error: 110 Connection timed out https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:02:05] PROBLEM - WMCS Galera Cluster #page on cloudcontrol1004 is CRITICAL: Error during connection: Lost connection to MySQL server at waiting for initial communication packet, system error: 110 Connection timed out https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:02:30] paged [17:02:32] :wave [17:02:32] andrewbogott^ [17:02:34] wmcs? [17:02:36] 👋 that is [17:02:39] hey [17:02:43] 👋 [17:02:43] not sure if WMCS get pages? [17:02:48] they do [17:02:55] however... [17:03:00] o7 [17:03:07] We just adjusted some settings, not sure why it paged everyone [17:03:22] unless puppet managed to restart every node at once [17:03:24] looks like it's being discussed in #-cloud-admin [17:05:00] yeah, the WMCS SREs are on the Galera page (and actually caused it with a config change) [17:05:28] We get loads of pages :) [17:05:52] Quite surprised any of that paged anyone else, actually [17:06:03] I guess we have some alerts to review in puppet [17:06:38] (03PS1) 10Andrew Bogott: Galera: don't restart db service on config change [puppet] - 10https://gerrit.wikimedia.org/r/618097 [17:06:51] heh yes [17:07:33] (03CR) 10Bstorm: [C: 03+1] "Yeah, we should do that by hand, if needed." [puppet] - 10https://gerrit.wikimedia.org/r/618097 (owner: 10Andrew Bogott) [17:07:35] ACKNOWLEDGEMENT - WMCS Galera Cluster #page on cloudcontrol1004 is CRITICAL: Error during connection: Lost connection to MySQL server at waiting for initial communication packet, system error: 110 Connection timed out andrew bogott investigating https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:07:36] ACKNOWLEDGEMENT - WMCS Galera Database #page on cloudcontrol1004 is CRITICAL: Error during connection: Lost connection to MySQL server at waiting for initial communication packet, system error: 110 Connection timed out andrew bogott investigating https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:08:25] (03CR) 10Andrew Bogott: [C: 03+2] Galera: don't restart db service on config change [puppet] - 10https://gerrit.wikimedia.org/r/618097 (owner: 10Andrew Bogott) [17:09:39] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell for Denny Vrandecic - https://phabricator.wikimedia.org/T259388 (10herron) @Nuria could you please review and give a thumbs up/down on the request for `analytics-privatedata-users` membership? @DVrandecic could you pleas... [17:16:13] (03PS1) 10Bstorm: galera: don't page prod SRE for this cluster [puppet] - 10https://gerrit.wikimedia.org/r/618100 [17:17:15] (03CR) 10Dzahn: [C: 03+1] galera: don't page prod SRE for this cluster [puppet] - 10https://gerrit.wikimedia.org/r/618100 (owner: 10Bstorm) [17:17:41] (03CR) 10Bstorm: [C: 03+2] galera: don't page prod SRE for this cluster [puppet] - 10https://gerrit.wikimedia.org/r/618100 (owner: 10Bstorm) [17:18:32] RECOVERY - WMCS Galera Database #page on cloudcontrol1004 is OK: Database seems up and running... https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:18:33] RECOVERY - WMCS Galera Cluster #page on cloudcontrol1004 is OK: OK wsrep_cluster_size: 3, wsrep_cluster_status: Primary https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [17:28:08] !log dcausse@deploy1001 Started deploy [wdqs/wdqs@20dcff3]: (no justification provided) [17:28:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:28:44] !log dcausse@deploy1001 Finished deploy [wdqs/wdqs@20dcff3]: (no justification provided) (duration: 00m 35s) [17:28:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:29:06] (03CR) 10Dzahn: "having 2 separate monitoring::service's that are the same besides a threshold is an unusual pattern. can't we have one check that is WARN " [puppet] - 10https://gerrit.wikimedia.org/r/618084 (https://phabricator.wikimedia.org/T258374) (owner: 10Michael Große) [17:29:22] jouncebot: now [17:29:23] For the next 0 hour(s) and 0 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200803T1700) [17:30:02] that deploy window closed, I'm going to roll train forward again [17:30:06] well timed :) [17:30:41] (03PS1) 10Lars Wirzenius: all wikis to 1.36.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618104 [17:30:43] (03CR) 10Lars Wirzenius: [C: 03+2] all wikis to 1.36.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618104 (owner: 10Lars Wirzenius) [17:31:24] (03Merged) 10jenkins-bot: all wikis to 1.36.0-wmf.2 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618104 (owner: 10Lars Wirzenius) [17:32:01] RECOVERY - DPKG on stat1008 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [17:32:42] jouncebot: now [17:32:42] No deployments scheduled for the next 0 hour(s) and 27 minute(s) [17:32:53] jouncebot: next [17:32:53] In 0 hour(s) and 27 minute(s): Morning backport window(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200803T1800) [17:33:26] !log liw@deploy1001 rebuilt and synchronized wikiversions files: all wikis to 1.36.0-wmf.2 [17:33:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:33:44] we're back at group2 again [17:33:52] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Initial debian commit [debs/anaconda-wmf] (debian) - 10https://gerrit.wikimedia.org/r/610880 (https://phabricator.wikimedia.org/T251006) (owner: 10Ottomata) [17:35:12] 10Operations, 10observability: db1082 failed on Jul 18th and 25th, however on the 25th pages didn't go out to VO/phones - https://phabricator.wikimedia.org/T259465 (10herron) [17:36:58] 10Operations, 10observability: db1082 failed on Jul 18th and 25th, however on the 25th pages didn't go out to VO/phones - https://phabricator.wikimedia.org/T259465 (10herron) [17:39:35] 10Operations, 10observability: db1082 failed on Jul 18th and 25th, however on the 25th pages didn't go out to VO/phones - https://phabricator.wikimedia.org/T259465 (10herron) I've updated the description to outline the two auto-retrigger and auto-resolve options as available by VO today. IMO a good near-term... [17:40:15] (03PS1) 10Ottomata: Install anaconda-wmf on stat nodes [puppet] - 10https://gerrit.wikimedia.org/r/618106 (https://phabricator.wikimedia.org/T251006) [17:40:46] (03PS2) 10Ottomata: Install anaconda-wmf on stat nodes [puppet] - 10https://gerrit.wikimedia.org/r/618106 (https://phabricator.wikimedia.org/T251006) [17:45:07] PROBLEM - DPKG on stat1005 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [17:48:52] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10Andrew) [17:49:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Hardware): (Need By: 2020-06-20) rack/setup/install cloudvirt10[31-39]eqiad.wmnet - https://phabricator.wikimedia.org/T251627 (10Andrew) 05Open→03Resolved All hosts are up and running canary VMs. I've marked them as 'act... [17:50:59] (03CR) 10Dzahn: [C: 03+2] "This is the https://en.wikipedia.org/wiki/Ladin_language" [dns] - 10https://gerrit.wikimedia.org/r/617860 (https://phabricator.wikimedia.org/T259432) (owner: 10Urbanecm) [17:52:48] I'm going to finish the wdqs deploy [17:53:19] !log dcausse@deploy1001 Started deploy [wdqs/wdqs@20dcff3]: deploy 0.3.43 and gui update [17:53:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:09] (03PS1) 10Andrew Bogott: cloudvirt1031, 1032 -> buster [puppet] - 10https://gerrit.wikimedia.org/r/618124 (https://phabricator.wikimedia.org/T259399) [17:56:16] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1031, 1032 -> buster [puppet] - 10https://gerrit.wikimedia.org/r/618124 (https://phabricator.wikimedia.org/T259399) (owner: 10Andrew Bogott) [18:00:04] RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Morning backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200803T1800). [18:02:50] (03PS1) 10Ssingh: dnsdist: update value for IP rate-limiting [puppet] - 10https://gerrit.wikimedia.org/r/618127 (https://phabricator.wikimedia.org/T252132) [18:04:54] (03CR) 10Ssingh: "https://puppet-compiler.wmflabs.org/compiler1002/24287/malmok.wikimedia.org/index.html" [puppet] - 10https://gerrit.wikimedia.org/r/618127 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [18:05:44] (03CR) 10Ssingh: [C: 03+2] dnsdist: update value for IP rate-limiting [puppet] - 10https://gerrit.wikimedia.org/r/618127 (https://phabricator.wikimedia.org/T252132) (owner: 10Ssingh) [18:09:12] !log dcausse@deploy1001 Finished deploy [wdqs/wdqs@20dcff3]: deploy 0.3.43 and gui update (duration: 15m 53s) [18:09:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:10:14] (03CR) 10Herron: [C: 03+2] exim4: move daily paniclog rotate from exim4-base to exim4-paniclog [puppet] - 10https://gerrit.wikimedia.org/r/617529 (https://phabricator.wikimedia.org/T257016) (owner: 10Herron) [18:13:42] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [18:13:42] !log andrew@cumin1001 START - Cookbook sre.hosts.downtime [18:13:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:45] !log andrew@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [18:13:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:52] !log andrew@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [18:15:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:27:24] RECOVERY - DPKG on stat1005 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [19:28:20] (03CR) 10Mholloway: "> Patch Set 1:" [deployment-charts] - 10https://gerrit.wikimedia.org/r/617728 (https://phabricator.wikimedia.org/T52178) (owner: 10MSantos) [19:28:43] (03PS1) 10Herron: dns: rename kibana-next.svc to kibana7.svc [dns] - 10https://gerrit.wikimedia.org/r/618140 [19:30:35] (03PS1) 10Ottomata: Bump refine job refinery version to 0.0.132 to fix $schema field bug [puppet] - 10https://gerrit.wikimedia.org/r/618141 (https://phabricator.wikimedia.org/T255818) [19:31:50] (03PS2) 10Ottomata: Bump refine job refinery version to 0.0.132 to fix $schema field bug [puppet] - 10https://gerrit.wikimedia.org/r/618141 (https://phabricator.wikimedia.org/T255818) [19:36:03] (03CR) 10Ottomata: [C: 03+2] Bump refine job refinery version to 0.0.132 to fix $schema field bug [puppet] - 10https://gerrit.wikimedia.org/r/618141 (https://phabricator.wikimedia.org/T255818) (owner: 10Ottomata) [19:41:27] (03CR) 10Herron: "Hey Chris, Joe, Valentin, could I ask you to give this patch and the approach a sanity check and lmk what's missing? Haven't been through" [puppet] - 10https://gerrit.wikimedia.org/r/616124 (owner: 10Herron) [19:48:47] (03CR) 10Dzahn: [C: 04-1] logstash-next: change backend naming from kibana-next to kibana7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/616124 (owner: 10Herron) [19:49:32] (03CR) 10Dzahn: [C: 04-1] logstash-next: change backend naming from kibana-next to kibana7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/616124 (owner: 10Herron) [20:00:05] halfak and accraze: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Graphoid / ORES deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200803T2000). [20:01:02] (03CR) 10Herron: logstash-next: change backend naming from kibana-next to kibana7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/616124 (owner: 10Herron) [20:01:47] (03CR) 10Herron: logstash-next: change backend naming from kibana-next to kibana7 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/616124 (owner: 10Herron) [20:03:32] RECOVERY - Citoid LVS codfw on citoid.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [20:04:06] RECOVERY - Citoid LVS eqiad on citoid.svc.eqiad.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Citoid [20:04:18] (03CR) 10Herron: [C: 03+1] grafana: temp disable grafana db sync ahead of upgrade [puppet] - 10https://gerrit.wikimedia.org/r/618069 (https://phabricator.wikimedia.org/T259143) (owner: 10Filippo Giunchedi) [20:06:02] (03PS7) 10Jdlrobson: Switch test wikis to new version of vector by default (3/3) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/614891 (https://phabricator.wikimedia.org/T254227) [20:08:26] 10Operations, 10ops-eqiad, 10Analytics-Clusters, 10DC-Ops: analytics1050 host + mgmt down - https://phabricator.wikimedia.org/T258370 (10Cmjohnson) @elukey This may need a hard power reset. Can I take it down? [20:09:44] PROBLEM - WMCS Galera Database on cloudcontrol1004 is CRITICAL: Error during connection: Cant connect to MySQL server on 208.80.154.132 (110 Connection timed out) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:10:18] PROBLEM - WMCS Galera Cluster on cloudcontrol1004 is CRITICAL: Error during connection: Cant connect to MySQL server on 208.80.154.132 (110 Connection timed out) https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:10:22] (03CR) 10Herron: [C: 03+1] "Thanks for this! LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/617842 (https://phabricator.wikimedia.org/T256536) (owner: 10Ladsgroup) [20:17:29] PROBLEM - mysql -galera- process #page on cloudcontrol1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:17:42] hmm. [20:17:56] paged [20:18:04] same as earlier the day? [20:18:20] andrewbogott: ^^ [20:18:35] hm, bstorm tried to fix those pages so you don't get them [20:18:43] I guess we missed one [20:18:46] 👋 [20:19:01] kk, under control then? [20:19:29] largely yeah [20:19:30] thanks [20:19:31] ACKNOWLEDGEMENT - WMCS Galera Cluster on cloudcontrol1004 is CRITICAL: Error during connection: Cant connect to MySQL server on 208.80.154.132 (111 Connection refused) andrew bogott this is wmcs-specific were working on it https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:19:31] ACKNOWLEDGEMENT - WMCS Galera Database on cloudcontrol1004 is CRITICAL: Error during connection: Cant connect to MySQL server on 208.80.154.132 (111 Connection refused) andrew bogott this is wmcs-specific were working on it https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:19:32] ACKNOWLEDGEMENT - mysql -galera- process #page on cloudcontrol1004 is CRITICAL: PROCS CRITICAL: 0 processes with command name mysqld andrew bogott this is wmcs-specific were working on it https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [20:19:45] gotcha ok, thanks [20:19:54] The host might still be marked critical or something. I'll check a few things. Sometimes it just needs more puppet runs [20:28:31] (03CR) 10Herron: [C: 03+1] "Nice, looking forward to working with this! Optional nitpick in line." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/617688 (https://phabricator.wikimedia.org/T258948) (owner: 10Filippo Giunchedi) [20:35:11] (03PS1) 10Bstorm: galera: remove "critical" tag from another monitor spot [puppet] - 10https://gerrit.wikimedia.org/r/618147 [20:40:16] (03CR) 10Herron: prometheus: puppetized install of prometheus-es-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617260 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [20:45:29] (03PS1) 10Andrew Bogott: Mariadb/galera: override TimeoutStartSec/TimeoutStopSec [puppet] - 10https://gerrit.wikimedia.org/r/618148 [20:45:53] (03CR) 10jerkins-bot: [V: 04-1] Mariadb/galera: override TimeoutStartSec/TimeoutStopSec [puppet] - 10https://gerrit.wikimedia.org/r/618148 (owner: 10Andrew Bogott) [20:46:22] PROBLEM - Host mr1-eqsin IPv6 is DOWN: PING CRITICAL - Packet loss = 100% [20:47:38] (03PS2) 10Andrew Bogott: Mariadb/galera: override TimeoutStartSec/TimeoutStopSec [puppet] - 10https://gerrit.wikimedia.org/r/618148 [20:47:54] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:49:34] (03CR) 10Subramanya Sastry: "I suppose this is not needed anymore since the train rolled ahead with the fix?" [vendor] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/618068 (https://phabricator.wikimedia.org/T259311) (owner: 10C. Scott Ananian) [20:51:40] (03CR) 10Subramanya Sastry: "> Patch Set 1:" [vendor] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/618068 (https://phabricator.wikimedia.org/T259311) (owner: 10C. Scott Ananian) [20:51:46] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:53:12] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [20:54:22] PROBLEM - Host mr1-eqsin.oob IPv6 is DOWN: PING CRITICAL - Packet loss = 71%, RTA = 3852.16 ms [20:55:34] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.8 with snmp version 2 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:56:35] (03PS3) 10Andrew Bogott: Mariadb/galera: override TimeoutStartSec/TimeoutStopSec [puppet] - 10https://gerrit.wikimedia.org/r/618148 [20:58:10] (03CR) 10Andrew Bogott: [C: 03+1] galera: remove "critical" tag from another monitor spot [puppet] - 10https://gerrit.wikimedia.org/r/618147 (owner: 10Bstorm) [20:58:28] RECOVERY - Host mr1-eqsin IPv6 is UP: PING WARNING - Packet loss = 33%, RTA = 1866.04 ms [20:59:08] (03CR) 10Cwhite: prometheus: puppetized install of prometheus-es-exporter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/617260 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [20:59:20] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 38, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:00:05] Reedy and sbassett: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200803T2100). [21:00:20] RECOVERY - Host mr1-eqsin.oob IPv6 is UP: PING OK - Packet loss = 0%, RTA = 224.54 ms [21:00:56] RECOVERY - MediaWiki exceptions and fatals per minute on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [21:02:19] (03PS4) 10Andrew Bogott: Mariadb/galera: override TimeoutStartSec/TimeoutStopSec [puppet] - 10https://gerrit.wikimedia.org/r/618148 [21:03:40] (03CR) 10Bstorm: [C: 03+1] "Let's try it!" [puppet] - 10https://gerrit.wikimedia.org/r/618148 (owner: 10Andrew Bogott) [21:04:12] (03CR) 10Andrew Bogott: [C: 03+2] Mariadb/galera: override TimeoutStartSec/TimeoutStopSec [puppet] - 10https://gerrit.wikimedia.org/r/618148 (owner: 10Andrew Bogott) [21:04:42] (03CR) 10Bstorm: [C: 03+2] galera: remove "critical" tag from another monitor spot [puppet] - 10https://gerrit.wikimedia.org/r/618147 (owner: 10Bstorm) [21:07:25] (03PS1) 10Andrew Bogott: Galera: move init script to the right module [puppet] - 10https://gerrit.wikimedia.org/r/618149 [21:08:00] (03CR) 10Andrew Bogott: [C: 03+2] Galera: move init script to the right module [puppet] - 10https://gerrit.wikimedia.org/r/618149 (owner: 10Andrew Bogott) [21:10:56] (03PS1) 10Andrew Bogott: Galera: split out the running/stopped logic into a service define [puppet] - 10https://gerrit.wikimedia.org/r/618150 [21:11:27] (03CR) 10Andrew Bogott: [C: 03+2] Galera: split out the running/stopped logic into a service define [puppet] - 10https://gerrit.wikimedia.org/r/618150 (owner: 10Andrew Bogott) [21:14:11] Hey all - two sec patches going out during the window: T115888 and T86738. [21:14:53] !log sbassett@deploy1001 Synchronized php-1.36.0-wmf.2/resources/src/mediawiki.jqueryMsg/mediawiki.jqueryMsg.js: (no justification provided) (duration: 01m 00s) [21:14:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:15:43] (03PS1) 10Andrew Bogott: Galera: another attempt to get the ensure of the service right [puppet] - 10https://gerrit.wikimedia.org/r/618151 [21:16:15] (03PS2) 10Andrew Bogott: Galera: another attempt to get the ensure of the service right [puppet] - 10https://gerrit.wikimedia.org/r/618151 [21:17:24] 10Operations, 10Citoid, 10Services: Bind Citoid service to a static IP address - https://phabricator.wikimedia.org/T259040 (10kaldari) [21:17:48] (03CR) 10Andrew Bogott: [C: 03+2] Galera: another attempt to get the ensure of the service right [puppet] - 10https://gerrit.wikimedia.org/r/618151 (owner: 10Andrew Bogott) [21:19:12] 10Operations, 10Citoid, 10Services: Bind Citoid service to a static IP address - https://phabricator.wikimedia.org/T259040 (10kaldari) [21:19:58] (03CR) 10Brennen Bearnes: "> I had previously reviewed and merged this on vendor master. I think Scott cherry-picked this onto the branch for train rollout since the" [vendor] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/618068 (https://phabricator.wikimedia.org/T259311) (owner: 10C. Scott Ananian) [21:22:07] 10Operations, 10Citoid, 10Services: Bind Citoid service to a static IP address (or addresses) - https://phabricator.wikimedia.org/T259040 (10kaldari) [21:28:19] (03PS1) 10Andrew Bogott: Galera: fix our override file [puppet] - 10https://gerrit.wikimedia.org/r/618152 [21:32:18] 10Operations, 10Analytics, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10Nuria) @paravoid says this is quite useful for DOS prevention/troubleshooting so putting it on our next up kanban for this quarter [21:32:29] 10Operations, 10Analytics, 10Analytics-Kanban, 10netops: Add more dimensions in the netflow/pmacct/Druid pipeline - https://phabricator.wikimedia.org/T254332 (10Nuria) [21:35:34] (03CR) 10Subramanya Sastry: "> Patch Set 1:" [vendor] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/618068 (https://phabricator.wikimedia.org/T259311) (owner: 10C. Scott Ananian) [21:35:46] (03Abandoned) 10Subramanya Sastry: Bump wikimedia/parsoid to v0.13.0-a3 [vendor] (wmf/1.36.0-wmf.2) - 10https://gerrit.wikimedia.org/r/618068 (https://phabricator.wikimedia.org/T259311) (owner: 10C. Scott Ananian) [21:35:49] !log Deployed mitigations for T115888 [21:35:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:41:51] RECOVERY - mysql -galera- process #page on cloudcontrol1004 is OK: PROCS OK: 1 process with command name mysqld https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:42:10] RECOVERY - WMCS Galera Cluster on cloudcontrol1004 is OK: OK wsrep_cluster_size: 3, wsrep_cluster_status: Primary https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:43:30] RECOVERY - WMCS Galera Database on cloudcontrol1004 is OK: Database seems up and running... https://wikitech.wikimedia.org/wiki/Portal:Cloud_VPS/Admin/Troubleshooting [21:43:55] (03CR) 10Andrew Bogott: [C: 03+2] Galera: fix our override file [puppet] - 10https://gerrit.wikimedia.org/r/618152 (owner: 10Andrew Bogott) [22:01:15] (03PS1) 10Arlolra: Be explicit about disabling nativeGallery [mediawiki-config] - 10https://gerrit.wikimedia.org/r/618155 [22:03:19] (03CR) 10Dzahn: lists: Use hiera value instead of hard-coded value "lists.wikimedia.org" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/617842 (https://phabricator.wikimedia.org/T256536) (owner: 10Ladsgroup) [22:05:12] (03CR) 10Dzahn: "compiler output looks good https://puppet-compiler.wmflabs.org/compiler1001/24289/" [puppet] - 10https://gerrit.wikimedia.org/r/617842 (https://phabricator.wikimedia.org/T256536) (owner: 10Ladsgroup) [22:08:08] (03CR) 10Dzahn: [C: 03+1] "seems good, i'd just double-check the interface::alias doesn't mess up and leaves us with 2 aliases or something" [puppet] - 10https://gerrit.wikimedia.org/r/617842 (https://phabricator.wikimedia.org/T256536) (owner: 10Ladsgroup) [22:15:17] PROBLEM - Router interfaces on cr3-knams is CRITICAL: CRITICAL: host 91.198.174.246, interfaces up: 77, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:25:11] (03CR) 10Cwhite: [C: 03+2] debianization [debs/prometheus-es-exporter] (debian/sid) - 10https://gerrit.wikimedia.org/r/617250 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [22:25:15] (03CR) 10Cwhite: [V: 03+2 C: 03+2] debianization [debs/prometheus-es-exporter] (debian/sid) - 10https://gerrit.wikimedia.org/r/617250 (https://phabricator.wikimedia.org/T222826) (owner: 10Cwhite) [22:30:28] RECOVERY - Router interfaces on cr3-knams is OK: OK: host 91.198.174.246, interfaces up: 79, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:34:04] 10Operations, 10Mail, 10OTRS, 10Trust-and-Safety, and 2 others: Forward emails addressed to privacy@wikidata to privacy@wikimedia - https://phabricator.wikimedia.org/T255733 (10Dzahn) 05Open→03Stalled p:05Medium→03Low [22:34:40] (03PS2) 10Dzahn: mediawiki::maintenance: load mod_security2 also on mwmaint*, not just mw* [puppet] - 10https://gerrit.wikimedia.org/r/607848 (https://phabricator.wikimedia.org/T255629) [22:35:26] (03CR) 10Dzahn: [C: 03+2] mediawiki::maintenance: load mod_security2 also on mwmaint*, not just mw* [puppet] - 10https://gerrit.wikimedia.org/r/607848 (https://phabricator.wikimedia.org/T255629) (owner: 10Dzahn) [22:40:54] 10Operations, 10Analytics, 10SRE-Access-Requests: Requesting access to production shell for Denny Vrandecic - https://phabricator.wikimedia.org/T259388 (10Nuria) @herron: approved on my end. @DVrandecic please be so kind to read the quite important data access guidelines (to summarize: data cannot leave WMF... [22:44:53] !loading apache mod_security2 on mwmaint* servers as it is on regular mw* appservers already [22:47:34] 10Operations, 10Mail, 10OTRS, 10Trust-and-Safety, and 2 others: Forward emails addressed to privacy@wikidata to privacy@wikimedia - https://phabricator.wikimedia.org/T255733 (10DannyS712) >>! At https://www.wikidata.org/wiki/Wikidata:Living_people#Requests_for_the_removal_of_private_information it currentl... [22:49:46] PROBLEM - Check systemd state on mwmaint2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [22:50:30] PROBLEM - HTTPS-noc on mwmaint2001 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 244 bytes in 1.155 second response time https://wikitech.wikimedia.org/wiki/Noc.wikimedia.org [23:00:04] RoanKattouw, Niharika, and Urbanecm: It is that lovely time of the day again! You are hereby commanded to deploy Evening backport window(Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20200803T2300). [23:00:04] RoanKattouw: A patch you scheduled for Evening backport window(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:18] I'll self-serve [23:02:08] (03PS2) 10Catrope: Enable GrowthExperiments on Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617442 (https://phabricator.wikimedia.org/T253291) [23:02:40] (03CR) 10Catrope: [C: 03+2] Enable GrowthExperiments on Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617442 (https://phabricator.wikimedia.org/T253291) (owner: 10Catrope) [23:03:27] (03Merged) 10jenkins-bot: Enable GrowthExperiments on Persian Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/617442 (https://phabricator.wikimedia.org/T253291) (owner: 10Catrope) [23:04:06] RECOVERY - HTTPS-noc on mwmaint2001 is OK: HTTP OK: HTTP/1.1 200 OK - 3604 bytes in 2.572 second response time https://wikitech.wikimedia.org/wiki/Noc.wikimedia.org [23:04:20] ACKNOWLEDGEMENT - Check systemd state on mwmaint2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn WIP https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:05:14] RECOVERY - Check systemd state on mwmaint2001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [23:10:07] Can someone check the stack trace for [93456897-fb74-4692-b516-73455a431166] please? [23:13:07] DannyS712: sure https://www.irccloud.com/pastebin/EMBYrbgi/ [23:14:11] (03PS4) 10Cwhite: prometheus: puppetized install of prometheus-es-exporter [puppet] - 10https://gerrit.wikimedia.org/r/617260 (https://phabricator.wikimedia.org/T256418) [23:14:19] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Enable GrowthExperiments on fawiki (T253291) (duration: 00m 59s) [23:14:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:14:23] T253291: Deploy Growth features on Persian Wikipedia - https://phabricator.wikimedia.org/T253291 [23:14:52] Urbanecm does that warrant reporting? [23:15:05] DannyS712: depending if you can safely reproduce [23:15:33] (03CR) 10jerkins-bot: [V: 04-1] prometheus: puppetized install of prometheus-es-exporter [puppet] - 10https://gerrit.wikimedia.org/r/617260 (https://phabricator.wikimedia.org/T256418) (owner: 10Cwhite) [23:15:52] nope! it was trying to use JWB to take a page for deletion cleaning up after https://ace.wikipedia.org/wiki/Kusuih:BeuneuriUreu%C3%ABngNgui/159.146.18.31 and the page was since deleted (might have been deleted during the transaction?) [23:18:59] might be [23:22:00] (03PS5) 10Cwhite: prometheus: puppetized install of prometheus-es-exporter [puppet] - 10https://gerrit.wikimedia.org/r/617260 (https://phabricator.wikimedia.org/T256418) [23:43:11] !log mwdebug1001 - temp installing apt-file for debugging an issue on mwmaint [23:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:57:08] 10Puppet, 10Beta-Cluster-Infrastructure: puppetdb on deployment-puppetdb03 keeps getting OOMKilled - https://phabricator.wikimedia.org/T248041 (10Krenair) a:03Krenair replacing with a medium instance, deployment-puppetdb04