[00:10:20] (03PS3) 10DannyS712: General cleanup of initialise settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532280 (https://phabricator.wikimedia.org/T231178) [00:11:18] (03CR) 10jerkins-bot: [V: 04-1] General cleanup of initialise settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532280 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712) [00:13:44] (03PS4) 10DannyS712: General cleanup of initialise settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532280 (https://phabricator.wikimedia.org/T231178) [01:07:43] (03CR) 10Alex Monk: [C: 03+2] ocsp: Allow to load an existing OCSPResponse from disk [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530464 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [01:10:15] (03PS5) 10DannyS712: General cleanup of initialize settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532280 (https://phabricator.wikimedia.org/T231178) [01:10:58] (03Merged) 10jenkins-bot: ocsp: Allow to load an existing OCSPResponse from disk [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530464 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [01:11:00] (03Merged) 10jenkins-bot: ocsp: Provide basic test coverage [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530548 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [01:12:02] (03CR) 10Alex Monk: [C: 03+2] api: Allow acme-chief clients to fetch OCSP responses [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530806 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [01:13:44] (03CR) 10jenkins-bot: ocsp: Provide basic test coverage [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530548 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [01:13:49] (03CR) 10jenkins-bot: ocsp: Allow to load an existing OCSPResponse from disk [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530464 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [01:24:24] (03CR) 10Alex Monk: [C: 03+1] "ready to go, minor nitpick inline" (031 comment) [software/acme-chief] - 10https://gerrit.wikimedia.org/r/530465 (https://phabricator.wikimedia.org/T219765) (owner: 10Vgutierrez) [01:28:22] (03PS6) 10DannyS712: General cleanup of initialize settings [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532280 (https://phabricator.wikimedia.org/T231178) [03:18:57] RECOVERY - snapshot of s5 in codfw on db1115 is OK: snapshot for s5 at codfw taken less than 4 days ago and larger than 90 GB: Last one 2019-08-26 02:05:58 from db2099.codfw.wmnet:3315 (640 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [03:59:09] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /srv 28027 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1018&var-datasource=eqiad+prometheus/ops [04:03:51] RECOVERY - Disk space on elastic1018 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1018&var-datasource=eqiad+prometheus/ops [04:32:07] RECOVERY - snapshot of s6 in codfw on db1115 is OK: snapshot for s6 at codfw taken less than 4 days ago and larger than 90 GB: Last one 2019-08-26 03:43:43 from db2097.codfw.wmnet:3316 (500 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [04:39:33] PROBLEM - Device not healthy -SMART- on db2055 is CRITICAL: cluster=mysql device=cciss,3 instance=db2055:9100 job=node site=codfw https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2055&var-datasource=codfw+prometheus/ops [04:47:51] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /srv 24351 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1018&var-datasource=eqiad+prometheus/ops [04:57:01] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2055 is CRITICAL: cluster=mysql device=cciss,3 instance=db2055:9100 job=node site=codfw Marostegui will be decommissioned https://wikitech.wikimedia.org/wiki/SMART%23Alerts https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2055&var-datasource=codfw+prometheus/ops [04:57:54] 10Operations, 10ops-codfw: Degraded RAID on db2035 - https://phabricator.wikimedia.org/T231176 (10Marostegui) 05Open→03Declined No need to replace these disks, as this host is ready for #dc-ops to decommission {T229784} [05:00:55] 10Operations, 10ops-codfw, 10DC-Ops, 10decommission: Decommission db2035 - https://phabricator.wikimedia.org/T229784 (10Marostegui) [05:03:14] (03PS1) 10Marostegui: db-codfw.php: Depool pc2009. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532287 (https://phabricator.wikimedia.org/T210725) [05:04:34] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Depool pc2009. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532287 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [05:05:33] (03Merged) 10jenkins-bot: db-codfw.php: Depool pc2009. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532287 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [05:06:24] (03CR) 10jenkins-bot: db-codfw.php: Depool pc2009. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532287 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [05:08:13] !log Optimize tables on pc2009 - T210725 [05:08:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:20] T210725: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 [05:09:04] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool pc2009 for optimize T210725 (duration: 02m 53s) [05:09:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:12:59] RECOVERY - Disk space on elastic1018 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1018&var-datasource=eqiad+prometheus/ops [05:25:18] !log Upload new mariadb 10.3 packages to repo [05:25:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:25:51] (03CR) 10Marostegui: [C: 03+2] control-mariadb-10.3*: Upgrade version [software] - 10https://gerrit.wikimedia.org/r/531902 (owner: 10Marostegui) [05:26:18] (03Merged) 10jenkins-bot: control-mariadb-10.3*: Upgrade version [software] - 10https://gerrit.wikimedia.org/r/531902 (owner: 10Marostegui) [05:26:34] 10Operations, 10cloud-services-team, 10netops: Review switches ACL to connect from tools-bastion to dbproxy1019 - https://phabricator.wikimedia.org/T230980 (10Marostegui) Thanks @ayounsi! Let me know when you are around today so we can get this going [05:37:07] (03CR) 10Vgutierrez: [C: 03+2] cache: Allow setting an arbitrary port for incoming TLS connections [puppet] - 10https://gerrit.wikimedia.org/r/531824 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [05:37:16] (03PS3) 10Vgutierrez: cache: Allow setting an arbitrary port for incoming TLS connections [puppet] - 10https://gerrit.wikimedia.org/r/531824 (https://phabricator.wikimedia.org/T221594) [05:52:39] (03PS5) 10Giuseppe Lavagetto: conftool::scripts::safe_service_restart: add pool/depool scripts [puppet] - 10https://gerrit.wikimedia.org/r/531510 [05:54:02] (03CR) 10jerkins-bot: [V: 04-1] conftool::scripts::safe_service_restart: add pool/depool scripts [puppet] - 10https://gerrit.wikimedia.org/r/531510 (owner: 10Giuseppe Lavagetto) [05:56:33] 10Operations, 10Packaging, 10serviceops, 10CPT Initiatives (Session Management Service (CDP2)), 10Core Platform Team Workboards (Green): Need help to create and deploy Debian-packaged Python 3 app - https://phabricator.wikimedia.org/T229980 (10greg) #Packaging is primarily handled by #serviceops / #opera... [05:58:28] 10Operations, 10Packaging, 10serviceops, 10CPT Initiatives (Session Management Service (CDP2)), 10Core Platform Team Workboards (Green): Need help to create and deploy Debian-packaged Python 3 app - https://phabricator.wikimedia.org/T229980 (10Joe) >>! In T229980#5436945, @greg wrote: > #Packaging is pri... [05:59:48] 10Operations, 10Packaging, 10serviceops, 10CPT Initiatives (Session Management Service (CDP2)), 10Core Platform Team Workboards (Green): Need help to create and deploy Debian-packaged Python 3 app - https://phabricator.wikimedia.org/T229980 (10Joe) BTW I see the patch is still under review, and @Volans... [06:04:05] (03CR) 10Vgutierrez: "pcc is happy https://puppet-compiler.wmflabs.org/compiler1001/18008/" [puppet] - 10https://gerrit.wikimedia.org/r/531872 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [06:15:45] PROBLEM - Disk space on elastic1018 is CRITICAL: DISK CRITICAL - free space: /srv 26629 MB (5% inode=99%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1018&var-datasource=eqiad+prometheus/ops [06:23:04] (03PS6) 10Giuseppe Lavagetto: conftool::scripts::safe_service_restart: add pool/depool scripts [puppet] - 10https://gerrit.wikimedia.org/r/531510 [06:24:06] (03CR) 10jerkins-bot: [V: 04-1] conftool::scripts::safe_service_restart: add pool/depool scripts [puppet] - 10https://gerrit.wikimedia.org/r/531510 (owner: 10Giuseppe Lavagetto) [06:27:02] (03CR) 10Ema: [C: 03+1] Release 8.0.5-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/531332 (owner: 10Vgutierrez) [06:28:17] RECOVERY - Disk space on elastic1018 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=elastic1018&var-datasource=eqiad+prometheus/ops [06:30:41] (03CR) 10Ema: [C: 03+1] ATS: Allow specifying timeouts to TTFB in connections to origin servers [puppet] - 10https://gerrit.wikimedia.org/r/531872 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [06:30:57] (03CR) 10Ema: [C: 03+1] ATS: Set origin TTFB timeout to 180 secs for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/531875 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [06:33:34] (03CR) 10Vgutierrez: [C: 03+1] ATS: enable compress.so everywhere [puppet] - 10https://gerrit.wikimedia.org/r/531895 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [06:34:47] (03PS7) 10Giuseppe Lavagetto: conftool::scripts::safe_service_restart: add pool/depool scripts [puppet] - 10https://gerrit.wikimedia.org/r/531510 [06:35:38] (03PS2) 10Ema: ATS: enable compress.so everywhere [puppet] - 10https://gerrit.wikimedia.org/r/531895 (https://phabricator.wikimedia.org/T227432) [06:36:22] (03CR) 10Ema: [C: 03+2] ATS: enable compress.so everywhere [puppet] - 10https://gerrit.wikimedia.org/r/531895 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [06:40:18] (03PS1) 10Marostegui: tendril: Disable insert on innodb_trx and innodb_trx_log [software/tendril] - 10https://gerrit.wikimedia.org/r/532296 [06:40:57] (03PS2) 10Marostegui: tendril: Disable insert on innodb_trx and innodb_trx_log [software/tendril] - 10https://gerrit.wikimedia.org/r/532296 (https://phabricator.wikimedia.org/T231182) [06:41:02] (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.5-1wm1 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/531332 (owner: 10Vgutierrez) [06:41:47] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18011/mw1270.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/531510 (owner: 10Giuseppe Lavagetto) [06:41:57] (03PS8) 10Giuseppe Lavagetto: conftool::scripts::safe_service_restart: add pool/depool scripts [puppet] - 10https://gerrit.wikimedia.org/r/531510 [06:48:18] (03PS3) 10Marostegui: tendril: Disable insert on innodb_trx and innodb_trx_log [software/tendril] - 10https://gerrit.wikimedia.org/r/532296 (https://phabricator.wikimedia.org/T231182) [06:51:09] !log cp-upload: rolling ats-backend-restart to enable compress plugin [06:51:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:21] (03PS4) 10Marostegui: tendril: Disable insert on innodb_trx and innodb_trx_log [software/tendril] - 10https://gerrit.wikimedia.org/r/532296 (https://phabricator.wikimedia.org/T231182) [06:51:28] 10Operations, 10Commons, 10Internet-Archive, 10serviceops: Uploading a big PDF file failed - https://phabricator.wikimedia.org/T231119 (10Joe) A file of 473 MB surely goes over the large file limits unless something changed recently. https://commons.wikimedia.org/wiki/Help:Server-side_upload still seems t... [06:54:15] RECOVERY - snapshot of s7 in codfw on db1115 is OK: snapshot for s7 at codfw taken less than 4 days ago and larger than 90 GB: Last one 2019-08-26 03:55:08 from db2100.codfw.wmnet:3317 (853 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [06:55:23] (03PS2) 10Vgutierrez: ATS: Allow specifying timeouts to TTFB in connections to origin servers [puppet] - 10https://gerrit.wikimedia.org/r/531872 (https://phabricator.wikimedia.org/T221594) [06:55:25] (03PS2) 10Vgutierrez: ATS: Set origin TTFB timeout to 180 secs for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/531875 (https://phabricator.wikimedia.org/T221594) [06:55:27] (03PS3) 10Vgutierrez: ATS: Provide websocket support [puppet] - 10https://gerrit.wikimedia.org/r/531885 (https://phabricator.wikimedia.org/T221594) [06:55:29] (03PS1) 10Vgutierrez: cache: Add missing tls_port parameter [puppet] - 10https://gerrit.wikimedia.org/r/532297 (https://phabricator.wikimedia.org/T221594) [06:55:31] (03PS1) 10Vgutierrez: cache: Allow setting an arbitrary redir_port [puppet] - 10https://gerrit.wikimedia.org/r/532298 (https://phabricator.wikimedia.org/T221594) [06:58:50] (03CR) 10Vgutierrez: "expected noop in pcc: https://puppet-compiler.wmflabs.org/compiler1002/18012/" [puppet] - 10https://gerrit.wikimedia.org/r/532297 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [07:01:04] (03CR) 10Vgutierrez: "PCC shows almost a NOOP (redir_port parameter being added to the profile class): https://puppet-compiler.wmflabs.org/compiler1002/18013/" [puppet] - 10https://gerrit.wikimedia.org/r/532298 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [07:06:18] (03PS1) 10Giuseppe Lavagetto: conftool: fix templates [puppet] - 10https://gerrit.wikimedia.org/r/532308 [07:07:47] (03CR) 10Giuseppe Lavagetto: [C: 03+2] conftool: fix templates [puppet] - 10https://gerrit.wikimedia.org/r/532308 (owner: 10Giuseppe Lavagetto) [07:20:53] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/featured/{year}/{month}/{day} (retrieve title of the featured article for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [07:22:21] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [07:25:39] (03CR) 10Ema: [C: 03+1] cache: Add missing tls_port parameter [puppet] - 10https://gerrit.wikimedia.org/r/532297 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [07:25:45] (03CR) 10Ema: [C: 03+1] cache: Allow setting an arbitrary redir_port [puppet] - 10https://gerrit.wikimedia.org/r/532298 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [07:27:57] (03PS2) 10Vgutierrez: cache: Allow setting an arbitrary redir_port [puppet] - 10https://gerrit.wikimedia.org/r/532298 (https://phabricator.wikimedia.org/T221594) [07:27:59] (03PS3) 10Vgutierrez: ATS: Allow specifying timeouts to TTFB in connections to origin servers [puppet] - 10https://gerrit.wikimedia.org/r/531872 (https://phabricator.wikimedia.org/T221594) [07:28:01] (03PS3) 10Vgutierrez: ATS: Set origin TTFB timeout to 180 secs for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/531875 (https://phabricator.wikimedia.org/T221594) [07:28:03] (03PS4) 10Vgutierrez: ATS: Provide websocket support [puppet] - 10https://gerrit.wikimedia.org/r/531885 (https://phabricator.wikimedia.org/T221594) [07:33:56] (03PS3) 10Vgutierrez: cache: Allow setting an arbitrary redir_port [puppet] - 10https://gerrit.wikimedia.org/r/532298 (https://phabricator.wikimedia.org/T221594) [07:33:58] (03PS4) 10Vgutierrez: ATS: Allow specifying timeouts to TTFB in connections to origin servers [puppet] - 10https://gerrit.wikimedia.org/r/531872 (https://phabricator.wikimedia.org/T221594) [07:34:00] (03PS4) 10Vgutierrez: ATS: Set origin TTFB timeout to 180 secs for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/531875 (https://phabricator.wikimedia.org/T221594) [07:34:02] (03PS5) 10Vgutierrez: ATS: Provide websocket support [puppet] - 10https://gerrit.wikimedia.org/r/531885 (https://phabricator.wikimedia.org/T221594) [07:34:17] (03CR) 10Vgutierrez: [C: 03+2] cache: Add missing tls_port parameter [puppet] - 10https://gerrit.wikimedia.org/r/532297 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [07:34:29] (03PS2) 10Vgutierrez: cache: Add missing tls_port parameter [puppet] - 10https://gerrit.wikimedia.org/r/532297 (https://phabricator.wikimedia.org/T221594) [07:35:10] (03PS3) 10Giuseppe Lavagetto: parsoid: use safe service restarts [puppet] - 10https://gerrit.wikimedia.org/r/518669 [07:38:29] (03CR) 10Giuseppe Lavagetto: [C: 03+2] parsoid: use safe service restarts [puppet] - 10https://gerrit.wikimedia.org/r/518669 (owner: 10Giuseppe Lavagetto) [07:38:39] (03PS4) 10Giuseppe Lavagetto: parsoid: use safe service restarts [puppet] - 10https://gerrit.wikimedia.org/r/518669 [07:38:46] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18014/wtp1030.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/518669 (owner: 10Giuseppe Lavagetto) [07:39:39] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [07:43:59] (03PS1) 10Mobrovac: Scandium: Add the protocol to the rt-client config [puppet] - 10https://gerrit.wikimedia.org/r/532331 (https://phabricator.wikimedia.org/T230166) [07:45:16] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10tramm) For me the lesson learned is that if talking in technical terms about name servers or similar one should be super precise and not fall back to shortcut... [07:52:53] (03PS4) 10Vgutierrez: cache: Add use_trafficserver_tls parameter to unified profile [puppet] - 10https://gerrit.wikimedia.org/r/532298 (https://phabricator.wikimedia.org/T221594) [07:52:55] (03PS5) 10Vgutierrez: ATS: Allow specifying timeouts to TTFB in connections to origin servers [puppet] - 10https://gerrit.wikimedia.org/r/531872 (https://phabricator.wikimedia.org/T221594) [07:52:57] (03PS5) 10Vgutierrez: ATS: Set origin TTFB timeout to 180 secs for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/531875 (https://phabricator.wikimedia.org/T221594) [07:52:59] (03PS6) 10Vgutierrez: ATS: Provide websocket support [puppet] - 10https://gerrit.wikimedia.org/r/531885 (https://phabricator.wikimedia.org/T221594) [07:55:04] (03CR) 10Vgutierrez: "pcc shows almost a NOOP as expected (use_trafficserver_tls parameter being added) https://puppet-compiler.wmflabs.org/compiler1002/18016/" [puppet] - 10https://gerrit.wikimedia.org/r/532298 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [07:58:39] (03PS3) 10Ema: VCL: add support for blacklisting IPs [puppet] - 10https://gerrit.wikimedia.org/r/531873 (https://phabricator.wikimedia.org/T231063) [08:03:20] (03PS1) 10Filippo Giunchedi: Point statsd.codfw to statsd.eqiad [dns] - 10https://gerrit.wikimedia.org/r/532332 [08:03:22] (03PS4) 10Ema: VCL: add support for blacklisting IPs [puppet] - 10https://gerrit.wikimedia.org/r/531873 (https://phabricator.wikimedia.org/T231063) [08:06:10] (03PS7) 10Vgutierrez: ATS: Provide websocket support [puppet] - 10https://gerrit.wikimedia.org/r/531885 (https://phabricator.wikimedia.org/T221594) [08:06:12] (03PS1) 10Vgutierrez: ATS: Fix port definition on trafficserver::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/532333 (https://phabricator.wikimedia.org/T221594) [08:07:40] (03CR) 10Ema: [C: 03+2] VCL: add support for blacklisting IPs [puppet] - 10https://gerrit.wikimedia.org/r/531873 (https://phabricator.wikimedia.org/T231063) (owner: 10Ema) [08:14:00] 10Operations, 10Commons, 10MediaWiki-File-management, 10Traffic, and 2 others: Picture from Commons not found from Singapore - https://phabricator.wikimedia.org/T231086 (10Joe) p:05Triage→03Normal [08:19:38] 10Operations, 10Release Pipeline, 10Maps (Kartotherian): Make jobprocessor's test not depend on external files - https://phabricator.wikimedia.org/T231009 (10Joe) p:05Triage→03Normal [08:21:22] 10Operations, 10Traffic: Allow blocking requests from specific networks on the edge - https://phabricator.wikimedia.org/T231063 (10ema) Feature implemented and documented on https://wikitech.wikimedia.org/wiki/Varnish#HOWTO [08:23:46] (03CR) 10Ema: [C: 03+1] ATS: Fix port definition on trafficserver::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/532333 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [08:24:37] 10Operations, 10Release Pipeline, 10Maps (Kartotherian): Make jobprocessor's test not depend on external files - https://phabricator.wikimedia.org/T231009 (10Joe) @Mathew.onipe can I ask further details on the error you get? It should definitely not be an issue if the test works in a docker image locally. [08:25:46] (03CR) 10Ema: [C: 03+1] cache: Add use_trafficserver_tls parameter to unified profile [puppet] - 10https://gerrit.wikimedia.org/r/532298 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [08:26:13] (03CR) 10Vgutierrez: [C: 03+2] cache: Add use_trafficserver_tls parameter to unified profile [puppet] - 10https://gerrit.wikimedia.org/r/532298 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [08:26:25] (03PS5) 10Vgutierrez: cache: Add use_trafficserver_tls parameter to unified profile [puppet] - 10https://gerrit.wikimedia.org/r/532298 (https://phabricator.wikimedia.org/T221594) [08:26:39] 10Operations, 10Continuous-Integration-Config, 10Patch-For-Review: Add CI to all operations/* repositories and archive obsolete ones - https://phabricator.wikimedia.org/T180330 (10hashar) 05Open→03Resolved a:03hashar Low hanging fruits had been resolved at time. Then it is a never ending task to add CI... [08:26:39] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [08:32:01] (03CR) 10Vgutierrez: [C: 03+2] ATS: Allow specifying timeouts to TTFB in connections to origin servers [puppet] - 10https://gerrit.wikimedia.org/r/531872 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [08:32:17] (03PS6) 10Vgutierrez: ATS: Allow specifying timeouts to TTFB in connections to origin servers [puppet] - 10https://gerrit.wikimedia.org/r/531872 (https://phabricator.wikimedia.org/T221594) [08:34:44] (03CR) 10Vgutierrez: [C: 03+2] ATS: Set origin TTFB timeout to 180 secs for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/531875 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [08:34:57] (03PS6) 10Vgutierrez: ATS: Set origin TTFB timeout to 180 secs for TLS instance [puppet] - 10https://gerrit.wikimedia.org/r/531875 (https://phabricator.wikimedia.org/T221594) [08:38:38] (03CR) 10Vgutierrez: [C: 03+2] "expected noop in pcc: https://puppet-compiler.wmflabs.org/compiler1002/18019/" [puppet] - 10https://gerrit.wikimedia.org/r/532333 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [08:38:42] (03CR) 10Filippo Giunchedi: [C: 03+2] Point statsd.codfw to statsd.eqiad [dns] - 10https://gerrit.wikimedia.org/r/532332 (owner: 10Filippo Giunchedi) [08:38:50] (03PS2) 10Vgutierrez: ATS: Fix port definition on trafficserver::monitoring [puppet] - 10https://gerrit.wikimedia.org/r/532333 (https://phabricator.wikimedia.org/T221594) [08:39:35] (03PS5) 10Giuseppe Lavagetto: parsoid: use safe service restarts [puppet] - 10https://gerrit.wikimedia.org/r/518669 [08:39:56] <_joe_> let's see how many minutes I'll lose today to ff-only! [08:41:21] <_joe_> why can't I merge ffs [08:41:37] (03PS6) 10Giuseppe Lavagetto: parsoid: use safe service restarts [puppet] - 10https://gerrit.wikimedia.org/r/518669 [08:41:48] <_joe_> this is all time wasted to no use. [08:43:03] (03CR) 10Filippo Giunchedi: "> Patch Set 1:" [dns] - 10https://gerrit.wikimedia.org/r/531965 (https://phabricator.wikimedia.org/T200209) (owner: 10RobH) [08:45:18] (03CR) 10Marostegui: [C: 03+2] "https://phabricator.wikimedia.org/T231182#5437126" [software/tendril] - 10https://gerrit.wikimedia.org/r/532296 (https://phabricator.wikimedia.org/T231182) (owner: 10Marostegui) [08:46:15] (03CR) 10Marostegui: [V: 03+2 C: 03+2] tendril: Disable insert on innodb_trx and innodb_trx_log [software/tendril] - 10https://gerrit.wikimedia.org/r/532296 (https://phabricator.wikimedia.org/T231182) (owner: 10Marostegui) [08:46:54] <_joe_> !log hard powercycle of mw2231, down with a blank console [08:46:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:55:22] (03PS1) 10Filippo Giunchedi: monitoring: alert on availability over two minutes [puppet] - 10https://gerrit.wikimedia.org/r/532335 (https://phabricator.wikimedia.org/T228379) [09:19:24] (03PS20) 10Mathew.onipe: Add maps reboot cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) [09:20:40] 10Operations, 10Release Pipeline, 10Maps (Kartotherian): Make jobprocessor's test not depend on external files - https://phabricator.wikimedia.org/T231009 (10Mathew.onipe) @Joe here is the error: https://integration.wikimedia.org/ci/blue/organizations/jenkins/service-pipeline-test/detail/service-pipeline-tes... [09:20:47] 10Operations, 10DBA: Disable/remove unused features on Tendril - https://phabricator.wikimedia.org/T231185 (10Marostegui) [09:21:01] 10Operations, 10DBA: Disable/remove unused features on Tendril - https://phabricator.wikimedia.org/T231185 (10Marostegui) [09:21:54] (03CR) 10Marostegui: [C: 03+1] mariadb::parsercache - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531263 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [09:22:06] (03CR) 10Marostegui: [C: 03+1] mariadb::core_multiinstance - eqiad: add ipv6 mapped address [puppet] - 10https://gerrit.wikimedia.org/r/531173 (https://phabricator.wikimedia.org/T102099) (owner: 10Jbond) [09:25:52] 10Operations: expand list of those who have permissions to edit the #wikimedia-operations topic - https://phabricator.wikimedia.org/T231016 (10Joe) I did some cleanup removing non-sres and adding a few people from the US TZ. [09:25:59] 10Operations: expand list of those who have permissions to edit the #wikimedia-operations topic - https://phabricator.wikimedia.org/T231016 (10Joe) 05Open→03Resolved [09:33:54] (03CR) 10Mathew.onipe: Add maps reboot cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/511819 (https://phabricator.wikimedia.org/T224072) (owner: 10Mathew.onipe) [09:34:53] (03PS3) 10Alaa Sarhan: Revert "Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on client wikis"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531162 [09:38:45] !log Enable partial blocks on test2wiki and mwdebug1001 to test something [09:38:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:29] 10Operations, 10DBA, 10observability: Investigate with Prometheus doesn't report on some graphs on MariaDB 10.3 - https://phabricator.wikimedia.org/T231190 (10Marostegui) [09:40:44] 10Operations, 10DBA, 10observability: Investigate with Prometheus doesn't report on some graphs on MariaDB 10.3 - https://phabricator.wikimedia.org/T231190 (10Marostegui) p:05Triage→03Normal [09:41:00] (03PS1) 10Vgutierrez: Release 8.0.5-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/532342 [09:41:37] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/531808 (owner: 10Muehlenhoff) [09:42:24] 10Operations, 10DBA: Disable/remove unused features on Tendril - https://phabricator.wikimedia.org/T231185 (10Marostegui) [09:42:31] (03PS2) 10Vgutierrez: Release 8.0.5-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/532342 [09:43:08] !log Run scap pull on mwdebug1001, test ended [09:43:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:43] (03PS2) 10Giuseppe Lavagetto: eventbus: use safe restart scripts [puppet] - 10https://gerrit.wikimedia.org/r/518670 [09:52:30] 10Operations, 10ops-codfw, 10DC-Ops: mw2231 is down and unable to reboot - https://phabricator.wikimedia.org/T231192 (10Joe) [09:52:38] 10Operations, 10ops-codfw, 10DC-Ops: mw2231 is down and unable to reboot - https://phabricator.wikimedia.org/T231192 (10Joe) p:05Triage→03Normal [09:53:20] 10Operations, 10MediaWiki-Core-Testing, 10HHVM: Re-add complete URL parsing fix from 3.18.7 release - https://phabricator.wikimedia.org/T185024 (10hashar) [09:54:19] <_joe_> !log codfw/appserver/*/mw2231.codfw.wmnet: pooled changed yes => inactive T231192 [09:54:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:54:24] T231192: mw2231 is down and unable to reboot - https://phabricator.wikimedia.org/T231192 [09:57:00] (03PS3) 10Giuseppe Lavagetto: eventbus: use safe restart scripts [puppet] - 10https://gerrit.wikimedia.org/r/518670 [09:58:21] ACKNOWLEDGEMENT - Host mw2231 is DOWN: PING CRITICAL - Packet loss = 100% Giuseppe Lavagetto T231192 [09:59:58] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18021/kafka-main2001.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/518670 (owner: 10Giuseppe Lavagetto) [10:08:37] (03CR) 10Ema: [C: 03+1] Release 8.0.5-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/532342 (owner: 10Vgutierrez) [10:08:57] (03CR) 10Vgutierrez: [C: 03+2] Release 8.0.5-1wm2 [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/532342 (owner: 10Vgutierrez) [10:16:11] (03PS1) 10Vgutierrez: ATS: Add @reboot and @swap to the list of restricted syscalls groups [puppet] - 10https://gerrit.wikimedia.org/r/532348 [10:23:04] 10Operations, 10serviceops, 10Continuous-Integration-Infrastructure (phase-out-jessie): Upload docker-ce 18.06.3 upstream package for Stretch - https://phabricator.wikimedia.org/T226236 (10hashar) I am Back from vacations! CI currently runs 18.06. 18.09 introduces a bunch of changes I am not comfortable to... [10:26:04] !log uploaded trafficserver-8.0.5-1wm2 to apt.wikimedia.org (stretch) - T221594 [10:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:26:10] T221594: Puppetize ATS TLS configuration for incoming traffic - https://phabricator.wikimedia.org/T221594 [10:28:22] (03PS1) 10Jakob: Termbox: improve logging for invalid requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/532351 (https://phabricator.wikimedia.org/T230921) [10:28:24] (03PS1) 10Jakob: Termbox: improve logging for invalid requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/532352 (https://phabricator.wikimedia.org/T230921) [10:28:26] (03PS1) 10Jakob: Termbox: improve logging for invalid requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/532353 (https://phabricator.wikimedia.org/T230921) [10:28:28] (03PS1) 10Jakob: Termbox: improve logging for invalid requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/532354 (https://phabricator.wikimedia.org/T230921) [10:29:14] (03CR) 10Jakob: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/532351 (https://phabricator.wikimedia.org/T230921) (owner: 10Jakob) [10:29:21] (03CR) 10Jakob: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/532352 (https://phabricator.wikimedia.org/T230921) (owner: 10Jakob) [10:29:26] (03CR) 10Jakob: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/532353 (https://phabricator.wikimedia.org/T230921) (owner: 10Jakob) [10:29:31] (03CR) 10Jakob: "This change is ready for review." [deployment-charts] - 10https://gerrit.wikimedia.org/r/532354 (https://phabricator.wikimedia.org/T230921) (owner: 10Jakob) [10:29:54] (03PS2) 10Giuseppe Lavagetto: profile::lvs::realserver: only use safe restart scripts [puppet] - 10https://gerrit.wikimedia.org/r/518672 [10:30:04] jan_drewniak: #bothumor Q:How do functions break up? A:They stop calling each other. Rise for Wikimedia Portals Update deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190826T1030). [10:30:54] (03CR) 10jerkins-bot: [V: 04-1] profile::lvs::realserver: only use safe restart scripts [puppet] - 10https://gerrit.wikimedia.org/r/518672 (owner: 10Giuseppe Lavagetto) [10:34:25] (03PS1) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp5001 [puppet] - 10https://gerrit.wikimedia.org/r/532355 (https://phabricator.wikimedia.org/T221594) [10:34:27] (03PS1) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 [puppet] - 10https://gerrit.wikimedia.org/r/532356 [10:36:57] (03PS2) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp5001 [puppet] - 10https://gerrit.wikimedia.org/r/532356 [10:41:09] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532358 (https://phabricator.wikimedia.org/T128546) [10:42:23] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1001/18023/" [puppet] - 10https://gerrit.wikimedia.org/r/532355 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [10:43:54] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532358 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:44:25] (03CR) 10Vgutierrez: "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1002/18024/" [puppet] - 10https://gerrit.wikimedia.org/r/532356 (owner: 10Vgutierrez) [10:44:49] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532358 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:46:18] (03PS3) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp5001 [puppet] - 10https://gerrit.wikimedia.org/r/532356 [10:46:26] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532358 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:47:10] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:528092| Bumping portals to master (T128546)]] (duration: 00m 46s) [10:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:47:16] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:47:56] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:528092| Bumping portals to master (T128546)]] (duration: 00m 46s) [10:48:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] Amir1, Lucas_WMDE, awight, and Urbanecm: (Dis)respected human, time to deploy European Mid-day SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190826T1100). Please do the needful. [11:00:04] alaa_wmde: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:54] I can SWAT today [11:01:23] whoop thanks Amir1 [11:01:29] I'm ready when you are [11:01:41] sure [11:06:58] ^ it might also be a good pair-deployment day? [11:07:34] 10Operations, 10DBA, 10observability: Investigate with Prometheus doesn't report on some graphs on MariaDB 10.3 - https://phabricator.wikimedia.org/T231190 (10fgiunchedi) Interesting! Since on buster there's an implicit upgrade of mysqld-exporter to 0.11, some of the innodb-related performance schema options... [11:07:56] awight would've been indeed .. but I'm not in the office .. guess doing it remote is possible, but should've probably prepared earlier .. thoughtful of you to remind thanks! [11:08:13] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531162 (owner: 10Alaa Sarhan) [11:09:12] 10Operations, 10DBA, 10observability: Investigate with Prometheus doesn't report on some graphs on MariaDB 10.3 - https://phabricator.wikimedia.org/T231190 (10fgiunchedi) re: "monitoring queries latency" the expression needs to be changed like this (i.e. to handle multiple handlers) ` http_request_duration_... [11:09:14] (03Merged) 10jenkins-bot: Revert "Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on client wikis"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531162 (owner: 10Alaa Sarhan) [11:09:15] marostegui: jynus ^ [11:09:19] This is going in [11:09:49] (03CR) 10jenkins-bot: Revert "Revert "Revert "Revert "Switch property terms migration to WRITE_NEW on client wikis"""" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531162 (owner: 10Alaa Sarhan) [11:10:54] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:527087|Switch property terms migration to WRITE_NEW on client wikis (T225053)]] (duration: 00m 46s) [11:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:11:00] T225053: Switch `tmpPropertyTermsMigrationStage` to MIGRATION_WRITE_NEW - https://phabricator.wikimedia.org/T225053 [11:11:05] Revert "Revert "Revert "Revert "XXX"""" ... [11:11:08] too many revert! [11:12:11] we actually had to fork otherwise it would have been around eight or ten "revert" there [11:12:12] :P [11:12:56] Taking the double negative thing to a nadir :p [11:13:43] Amir1: re the backport we are deploying now [11:13:43] hmm I can already see Serbian (if I have specified it in babel for my user) on this item https://www.wikidata.org/wiki/Q1631745 [11:13:43] so the bug is only affecting if language was detected through IP .. guess won't be able to test that one and we have to wait for someone to report to us that it was fixed .. once we deploy I'll post a comment on the task asking for someone to check [11:13:43] any different thoughts? [11:16:52] yeah, it's really hard to test it [11:17:06] you basically need to be in Serbia, you can use proxy [11:17:28] but then proxies are blocked... [11:17:36] Do we detect language via IP? I wasn't aware of that. [11:17:48] awight: ULS does it :P [11:18:01] What about visiting the srwiki anonymously? [11:18:25] ah nvm this is wikidata. [11:18:36] ?uselang=sr maybe? [11:19:12] yeah, and it's also used for showing interwiki links (compact language links) [11:21:51] $wgULSGeoService = false; <-- surprising [11:22:41] > but then proxies are blocked... [11:22:41] we block them? [11:23:35] alaa_wmde: yeah they are globally blocked from editing [11:23:38] > $wgULSGeoService = false; <-- surprising [11:23:38] awight where? on Wikidata? [11:25:01] Amir1: that is very interesting [11:25:12] alaa_wmde: it's merged now, can you test it on mwdebug1002? [11:25:15] ^ that's the only time the config appears in the mediawiki/config repo, but maybe Wikidata uses a different config repo? [11:25:26] alaa_wmde: It's done to avoid sockpuppets bypassing CU [11:26:08] awight: This can't be true, at least in case of wikidata, I was in Greece and everything I got as anonymous user was in Greek [11:26:15] testing on 1002 [11:26:35] Amir1: Possibly acceptlanguage magic? If you figure it out, I'm curious to learn. [11:27:01] eh I assume you were on your own laptop, so not acceptlanguage in that case. [11:27:02] Sure [11:27:43] FYI, I just accessed wikidata anonymously from the WMDE office and got an English UI [11:27:52] works I can enter in terms Serbian and Serbian Crotian [11:28:32] okie dokie [11:29:44] > FYI, I just accessed wikidata anonymously from the WMDE office and got an English UI [11:29:44] awight: guess request preferred language from browser is the reason? [11:30:09] I also get English UI when I access through VPN in different countries [11:30:39] I'm dumbfounded. Tried setting my browser's accept-languages to [es, en] and still got the English UI. [11:31:04] that is weird now haha [11:31:11] ULS reported "English (same as content)" [11:32:02] (ULS considering germany a US territory :P #worstjokeattemptoftheday:) [11:32:10] no, this is not about language of UI, it's about the languages in the termbox [11:32:17] do you get German there? [11:32:23] I do [11:32:29] but the UI thing is still quite interesting [11:33:47] going live now [11:34:25] !log ladsgroup@deploy1001 Synchronized php-1.34.0-wmf.19/extensions/UniversalLanguageSelector: SWAT: [[gerrit:532341|Revert "Return target of redirect languages in mw.uls.getFrequentLanguageList" (T217770 T121747)]] (duration: 00m 46s) [11:34:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:34:32] T121747: Serbian language does not appear for people from Serbia in Wikidata item pages - https://phabricator.wikimedia.org/T121747 [11:34:32] T217770: [Story] Never show anything from a language that is not currently Wikidata conform - https://phabricator.wikimedia.org/T217770 [11:35:27] the deployment is done [11:36:00] !log EU SWAT is done [11:36:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:36:17] Amir1: are we not switching to new store for properties today? [11:36:48] alaa_wmde: already done [11:37:46] oh I missed the log message indeed .. great thanks! let's monitor it a bit for next 15 mins [11:54:45] (03CR) 10WMDE-leszek: [C: 03+1] Whitelist jenkins for edit rate limits on beta [mediawiki-config] - 10https://gerrit.wikimedia.org/r/530144 (https://phabricator.wikimedia.org/T230481) (owner: 10Jakob) [11:55:43] !log mobrovac@deploy1001 Started deploy [cpjobqueue/deploy@e742ecf]: Increase the concurrency of cirusSearchCheckerJobs to 20 - T231194 [11:55:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:55:49] T231194: Increase concurrency of the cirrusCheckerJob - https://phabricator.wikimedia.org/T231194 [11:56:53] mobrovac: thanks! [11:57:13] !log mobrovac@deploy1001 Finished deploy [cpjobqueue/deploy@e742ecf]: Increase the concurrency of cirusSearchCheckerJobs to 20 - T231194 (duration: 01m 31s) [11:57:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:28] (03PS9) 10Mathew.onipe: lvs: allow access to wdqs lvs on port 8888 [puppet] - 10https://gerrit.wikimedia.org/r/529053 (https://phabricator.wikimedia.org/T176875) [12:00:28] (03PS3) 10Mathew.onipe: elasticsearch: ship logs to syslog server [puppet] - 10https://gerrit.wikimedia.org/r/531922 (https://phabricator.wikimedia.org/T225125) [12:10:29] 10Operations, 10DBA, 10observability: Investigate with Prometheus doesn't report on some graphs on MariaDB 10.3 - https://phabricator.wikimedia.org/T231190 (10Marostegui) The innodb variables, as kinda expected, are failing due to the fact that they've been removed upstream: https://jira.mariadb.org/browse/M... [12:11:16] 10Operations, 10DBA, 10observability: Investigate with Prometheus doesn't report on some graphs on MariaDB 10.3 - https://phabricator.wikimedia.org/T231190 (10Marostegui) [12:34:48] (03PS1) 10Marostegui: db2114: Change binlog format to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/532370 (https://phabricator.wikimedia.org/T230106) [12:39:02] (03PS2) 10Marostegui: db2114: Change binlog format to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/532370 (https://phabricator.wikimedia.org/T230106) [12:39:12] (03PS1) 10Marostegui: db-codfw.php: Specify db2114 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532371 (https://phabricator.wikimedia.org/T230106) [12:43:47] (03CR) 10Marostegui: [C: 03+2] db2114: Change binlog format to STATEMENT [puppet] - 10https://gerrit.wikimedia.org/r/532370 (https://phabricator.wikimedia.org/T230106) (owner: 10Marostegui) [12:45:13] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Specify db2114 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532371 (https://phabricator.wikimedia.org/T230106) (owner: 10Marostegui) [12:46:54] (03Merged) 10jenkins-bot: db-codfw.php: Specify db2114 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532371 (https://phabricator.wikimedia.org/T230106) (owner: 10Marostegui) [12:47:12] (03CR) 10jenkins-bot: db-codfw.php: Specify db2114 status [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532371 (https://phabricator.wikimedia.org/T230106) (owner: 10Marostegui) [12:48:07] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Clarify db2114 status (duration: 00m 45s) [12:48:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:48:25] !log Restart MySQL on db2114 to pick up binlog format change [12:48:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:55:41] 10Operations, 10DBA, 10observability: Investigate with Prometheus doesn't report on some graphs on MariaDB 10.3 - https://phabricator.wikimedia.org/T231190 (10fgiunchedi) >>! In T231190#5437489, @Marostegui wrote: > The innodb variables, as kinda expected, are failing due to the fact that they've been remove... [13:06:05] (03PS1) 10Ema: prometheus: add trafficserver prefix to global metrics [puppet] - 10https://gerrit.wikimedia.org/r/532377 [13:06:05] !log mobrovac@deploy1001 Started deploy [restbase/deploy@38c313d]: Bring the dev cluster up to date and expose RB on both 7231 and 7233 in it - T223953 [13:06:09] !log mobrovac@deploy1001 deploy aborted: Bring the dev cluster up to date and expose RB on both 7231 and 7233 in it - T223953 (duration: 00m 04s) [13:06:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:10] T223953: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 [13:06:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:26] !log mobrovac@deploy1001 Started deploy [restbase/deploy@38c313d] (dev-cluster): Bring the dev cluster up to date and expose RB on both 7231 and 7233 in it - T223953 [13:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:09:48] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@38c313d] (dev-cluster): Bring the dev cluster up to date and expose RB on both 7231 and 7233 in it - T223953 (duration: 03m 22s) [13:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:15:58] !log mobrovac@deploy1001 Started deploy [restbase/deploy@38c313d]: Expose RB on both 7231 and 7233 - T223953 [13:16:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:16:04] T223953: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 [13:20:45] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: add trafficserver prefix to global metrics [puppet] - 10https://gerrit.wikimedia.org/r/532377 (owner: 10Ema) [13:21:37] !log Change MySQL.monitoring queries latency graph parameters to support buster+mariadb 10.3 - T231190 [13:21:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:43] T231190: Investigate with Prometheus doesn't report on some graphs on MariaDB 10.3 - https://phabricator.wikimedia.org/T231190 [13:25:04] (03PS1) 10Ema: Revert "webserver_misc_apps: do not install envoy" [puppet] - 10https://gerrit.wikimedia.org/r/532380 (https://phabricator.wikimedia.org/T210411) [13:26:30] (03CR) 10Ema: [C: 03+1] hiera: Move nginx from port 443 to 4443 on cp5001 [puppet] - 10https://gerrit.wikimedia.org/r/532355 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [13:26:36] (03CR) 10Ema: [C: 03+1] hiera: Move ats-tls from port 8443 to port 443 on cp5001 [puppet] - 10https://gerrit.wikimedia.org/r/532356 (owner: 10Vgutierrez) [13:28:19] PROBLEM - Restbase LVS codfw on restbase.svc.codfw.wmnet is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/RESTBase [13:28:32] !log Replacing nginx with ats-tls in cp5001 - T221594 [13:28:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:28:39] T221594: Puppetize ATS TLS configuration for incoming traffic - https://phabricator.wikimedia.org/T221594 [13:29:27] PROBLEM - restbase endpoints health on restbase2020 is CRITICAL: /en.wikipedia.org/v1/feed/featured/{yyyy}/{mm}/{dd} (Retrieve aggregated feed content for April 29, 2016) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:29:45] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move nginx from port 443 to 4443 on cp5001 [puppet] - 10https://gerrit.wikimedia.org/r/532355 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [13:29:47] RECOVERY - Restbase LVS codfw on restbase.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/RESTBase [13:29:55] (03PS2) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp5001 [puppet] - 10https://gerrit.wikimedia.org/r/532355 (https://phabricator.wikimedia.org/T221594) [13:31:01] PROBLEM - mobileapps endpoints health on scb2002 is CRITICAL: /{domain}/v1/page/most-read/{year}/{month}/{day} (retrieve the most-read articles for January 1, 2016 (with aggregated=true)) timed out before a response was received https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:31:23] (03PS3) 10Giuseppe Lavagetto: profile::lvs::realserver: only use safe restart scripts [puppet] - 10https://gerrit.wikimedia.org/r/518672 [13:31:58] (03CR) 10Giuseppe Lavagetto: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/18025/krypton.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/518672 (owner: 10Giuseppe Lavagetto) [13:32:23] (03CR) 10Ema: [C: 03+2] prometheus: add trafficserver prefix to global metrics [puppet] - 10https://gerrit.wikimedia.org/r/532377 (owner: 10Ema) [13:32:31] RECOVERY - mobileapps endpoints health on scb2002 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/mobileapps [13:32:31] RECOVERY - restbase endpoints health on restbase2020 is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Services/Monitoring/restbase [13:32:41] 10Operations, 10DBA, 10observability: Investigate with Prometheus doesn't report on some graphs on MariaDB 10.3 - https://phabricator.wikimedia.org/T231190 (10Marostegui) 05Open→03Resolved a:03Marostegui Thanks @fgiunchedi for the explanation and guidance to get it changed. I have replaced it on the da... [13:33:27] PROBLEM - Check the Netbox report-s- librenms for fail status. on netmon1002 is CRITICAL: librenms.LibreNMS CRITICAL https://wikitech.wikimedia.org/wiki/Netbox%23Reports [13:33:33] PROBLEM - MegaRAID on db1063 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:33:34] ACKNOWLEDGEMENT - MegaRAID on db1063 is CRITICAL: CRITICAL: 1 failed LD(s) (Degraded) nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T231199 https://wikitech.wikimedia.org/wiki/MegaCli%23Monitoring [13:33:37] 10Operations, 10ops-eqiad: Degraded RAID on db1063 - https://phabricator.wikimedia.org/T231199 (10ops-monitoring-bot) [13:35:58] 10Operations, 10ops-eqiad, 10DBA: Degraded RAID on db1063 - https://phabricator.wikimedia.org/T231199 (10Marostegui) p:05Triage→03Normal a:03Cmjohnson Can we get this disk replaced? This is m1 master. And old host that will get decommissioned soonish (I need to schedule a master failover for it), but a... [13:37:07] (03PS4) 10Giuseppe Lavagetto: profile::lvs::realserver: only use safe restart scripts [puppet] - 10https://gerrit.wikimedia.org/r/518672 [13:38:57] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@38c313d]: Expose RB on both 7231 and 7233 - T223953 (duration: 23m 00s) [13:39:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:03] T223953: Deploy the RESTBase front-end service (RESTRouter) to Kubernetes - https://phabricator.wikimedia.org/T223953 [13:42:53] (03CR) 10Giuseppe Lavagetto: [V: 03+2 C: 03+2] profile::lvs::realserver: only use safe restart scripts [puppet] - 10https://gerrit.wikimedia.org/r/518672 (owner: 10Giuseppe Lavagetto) [13:43:48] (03PS1) 10Filippo Giunchedi: Revert swiftrepl refactor [software] - 10https://gerrit.wikimedia.org/r/532381 [13:44:18] cdanis: ^ [13:45:32] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] hiera: Move nginx from port 443 to 4443 on cp5001 [puppet] - 10https://gerrit.wikimedia.org/r/532355 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [13:45:33] (03PS3) 10Vgutierrez: hiera: Move nginx from port 443 to 4443 on cp5001 [puppet] - 10https://gerrit.wikimedia.org/r/532355 (https://phabricator.wikimedia.org/T221594) [13:45:41] (03CR) 10Vgutierrez: [V: 03+2 C: 03+2] hiera: Move nginx from port 443 to 4443 on cp5001 [puppet] - 10https://gerrit.wikimedia.org/r/532355 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [13:45:47] (03PS1) 10Mobrovac: RESTBase: Temporarily allow access to port 7233 as well [puppet] - 10https://gerrit.wikimedia.org/r/532382 (https://phabricator.wikimedia.org/T223953) [13:45:59] (03CR) 10CDanis: [C: 03+1] Revert swiftrepl refactor [software] - 10https://gerrit.wikimedia.org/r/532381 (owner: 10Filippo Giunchedi) [13:46:02] godog: ty <3 [13:47:54] (03CR) 10Vgutierrez: [C: 03+2] hiera: Move ats-tls from port 8443 to port 443 on cp5001 [puppet] - 10https://gerrit.wikimedia.org/r/532356 (owner: 10Vgutierrez) [13:48:02] (03PS4) 10Vgutierrez: hiera: Move ats-tls from port 8443 to port 443 on cp5001 [puppet] - 10https://gerrit.wikimedia.org/r/532356 [13:48:05] (03PS2) 10Mobrovac: RESTBase: Temporarily allow access to port 7233 as well [puppet] - 10https://gerrit.wikimedia.org/r/532382 (https://phabricator.wikimedia.org/T223953) [13:49:25] !log Rename table filejournal on enwiki on db2112 - T51195 [13:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:49:34] T51195: Drop filejournal table from WMF - https://phabricator.wikimedia.org/T51195 [13:49:46] !log upgraded trafficserver to version 8.0.5-1wm2 in cp5001 [13:49:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:23] (03CR) 10Filippo Giunchedi: [C: 03+2] Revert swiftrepl refactor [software] - 10https://gerrit.wikimedia.org/r/532381 (owner: 10Filippo Giunchedi) [13:50:27] (03CR) 10Filippo Giunchedi: [V: 03+2 C: 03+2] Revert swiftrepl refactor [software] - 10https://gerrit.wikimedia.org/r/532381 (owner: 10Filippo Giunchedi) [13:51:51] PROBLEM - HTTPS Unified RSA on cp5001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [13:51:53] PROBLEM - HTTPS Unified ECDSA on cp5001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/HTTPS [13:52:04] yey, that's expecte [13:52:07] *expected [13:52:56] (03PS1) 10Giuseppe Lavagetto: hhvm: fix restart script path [puppet] - 10https://gerrit.wikimedia.org/r/532383 [13:53:38] (03CR) 10Tarrow: [C: 03+1] Termbox: improve logging for invalid requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/532351 (https://phabricator.wikimedia.org/T230921) (owner: 10Jakob) [13:53:54] (03CR) 10Tarrow: [C: 03+1] Termbox: improve logging for invalid requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/532352 (https://phabricator.wikimedia.org/T230921) (owner: 10Jakob) [13:54:02] (03CR) 10Tarrow: [C: 03+1] Termbox: improve logging for invalid requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/532353 (https://phabricator.wikimedia.org/T230921) (owner: 10Jakob) [13:54:12] (03CR) 10Tarrow: [C: 03+1] Termbox: improve logging for invalid requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/532354 (https://phabricator.wikimedia.org/T230921) (owner: 10Jakob) [13:55:01] RECOVERY - HTTPS Unified RSA on cp5001 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345533 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (RSA) valid until 2019-11-22 07:59:59 +0000 (expires in 87 days) https://wikitech.wikimedia.org/wiki/HTTPS [13:55:01] RECOVERY - HTTPS Unified ECDSA on cp5001 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 345532 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 87 days) https://wikitech.wikimedia.org/wiki/HTTPS [13:55:12] (03CR) 10Giuseppe Lavagetto: [C: 03+2] hhvm: fix restart script path [puppet] - 10https://gerrit.wikimedia.org/r/532383 (owner: 10Giuseppe Lavagetto) [13:55:15] nice :) [13:56:40] (03CR) 10Jakob: [V: 03+2 C: 03+2] Termbox: improve logging for invalid requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/532351 (https://phabricator.wikimedia.org/T230921) (owner: 10Jakob) [13:57:05] PROBLEM - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5001 is CRITICAL: connect to address 10.132.0.101 and port 9322: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:57:21] PROBLEM - Ensure traffic_manager binds on 8443 and responds to HTTP requests on cp5001 is CRITICAL: connect to address 10.132.0.101 and port 8443: Connection refused https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [13:58:09] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'termbox' for release 'test' . [13:58:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:58:23] RECOVERY - Ensure traffic_exporter binds on port 9322 and responds to HTTP requests on cp5001 is OK: HTTP OK: HTTP/1.0 200 OK - 11003 bytes in 0.460 second response time https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server [14:00:21] (03CR) 10Mobrovac: [C: 03+1] "This can be deployed, RESTBase is already listening to both ports in prod. PCC looks good - https://puppet-compiler.wmflabs.org/compiler10" [puppet] - 10https://gerrit.wikimedia.org/r/532382 (https://phabricator.wikimedia.org/T223953) (owner: 10Mobrovac) [14:00:23] (03CR) 10Jakob: [V: 03+2 C: 03+2] Termbox: improve logging for invalid requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/532352 (https://phabricator.wikimedia.org/T230921) (owner: 10Jakob) [14:00:33] !log @ helmfile [STAGING] Ran 'apply' command on namespace 'termbox' for release 'staging' . [14:00:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:02:23] !log uploaded prometheus-ipsec-exporter-0.3.1-1 pacakge to stretch-wikimedia and buster-wikimedia [14:02:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:03:00] (03CR) 10Jakob: [V: 03+2 C: 03+2] Termbox: improve logging for invalid requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/532353 (https://phabricator.wikimedia.org/T230921) (owner: 10Jakob) [14:05:14] !log repooling cp5001 using trafficserver as TLS termination layer - T221594 [14:05:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:21] T221594: Puppetize ATS TLS configuration for incoming traffic - https://phabricator.wikimedia.org/T221594 [14:05:36] !log @ helmfile [CODFW] Ran 'apply' command on namespace 'termbox' for release 'production' . [14:05:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:12:53] /buffer 12 [14:15:11] (03CR) 10Jakob: [V: 03+2 C: 03+2] Termbox: improve logging for invalid requests [deployment-charts] - 10https://gerrit.wikimedia.org/r/532354 (https://phabricator.wikimedia.org/T230921) (owner: 10Jakob) [14:16:30] !log @ helmfile [EQIAD] Ran 'apply' command on namespace 'termbox' for release 'production' . [14:17:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:21:49] 10Operations, 10Analytics, 10Core Platform Team Legacy (Watching / External), 10Patch-For-Review, and 2 others: Replace and expand kafka main hosts (kafka[12]00[123]) with kafka-main[12]00[12345] - https://phabricator.wikimedia.org/T225005 (10Ottomata) Heya! Yes, that link from Petr is the right one, just... [14:26:14] 10Operations, 10Elasticsearch, 10Wikimedia-Logstash, 10observability, and 2 others: Migrate Elasticsearch from deprecated Gelf logstash input to rsyslog Kafka logging pipeline - https://phabricator.wikimedia.org/T225125 (10Mathew.onipe) [14:28:01] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10Ottomata) > We are testing in https://phabricator.wikimedia.org/T22934... [14:47:05] 10Operations, 10Gerrit, 10Release-Engineering-Team-TODO, 10Continuous-Integration-Config, 10Release-Engineering-Team (Development services): Fix operations/puppet.git "rebase hell" - https://phabricator.wikimedia.org/T224033 (10hashar) [14:49:15] PROBLEM - Host mw2231.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [14:49:53] 10Operations, 10ops-codfw, 10decommission: Decommission db2034 - https://phabricator.wikimedia.org/T223216 (10Papaul) ge-5/0/32 up down db2034 [14:54:51] (03PS1) 10Giuseppe Lavagetto: aptrepo: the envoy repo for jessie has no InRelease file [puppet] - 10https://gerrit.wikimedia.org/r/532387 [14:57:23] PROBLEM - HTTPS Unified ECDSA on cp5001 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/HTTPS [14:57:35] that's not good [14:58:04] (03CR) 10Giuseppe Lavagetto: [C: 03+2] aptrepo: the envoy repo for jessie has no InRelease file [puppet] - 10https://gerrit.wikimedia.org/r/532387 (owner: 10Giuseppe Lavagetto) [14:59:18] !log Change min_replicas to 3 on s5 for eqiad and codfw T231019 [14:59:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:59:25] T231019: set min_replicas on database sections in dbctl - https://phabricator.wikimedia.org/T231019 [15:00:31] RECOVERY - HTTPS Unified ECDSA on cp5001 is OK: SSL OK - OCSP staple validity for en.wikipedia.org has 341602 seconds left:Certificate *.wikipedia.org contains all required SANs:Certificate *.wikipedia.org (ECDSA) valid until 2019-11-22 07:59:59 +0000 (expires in 87 days) https://wikitech.wikimedia.org/wiki/HTTPS [15:02:26] !log depooling cp5001 [15:02:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:17:52] (03CR) 10Arlolra: [C: 03+1] Scandium: Add the protocol to the rt-client config [puppet] - 10https://gerrit.wikimedia.org/r/532331 (https://phabricator.wikimedia.org/T230166) (owner: 10Mobrovac) [15:18:13] RECOVERY - Host mw2231.mgmt is UP: PING OK - Packet loss = 0%, RTA = 31.20 ms [15:22:22] (03PS1) 10Paladox: Revert "Gerrit: Set base url for commitlink" [puppet] - 10https://gerrit.wikimedia.org/r/532391 [15:23:06] (03CR) 10Paladox: [C: 04-1] "Waiting for a gerrit 2.15.16 release which will contain this fix if change is merged in time." [puppet] - 10https://gerrit.wikimedia.org/r/532391 (owner: 10Paladox) [15:23:15] (03PS2) 10Paladox: Revert "Gerrit: Set base url for commitlink" [puppet] - 10https://gerrit.wikimedia.org/r/532391 [15:30:31] 10Operations, 10ops-codfw, 10DC-Ops: mw2231 is down and unable to reboot - https://phabricator.wikimedia.org/T231192 (10Papaul) System Event Log Severity Date/Time Description Instructions: The System Event Log contains information about the managed system. To sort the log by column, click a column header.... [15:31:05] 10Operations, 10Analytics, 10SRE-Access-Requests: Access to HUE for cchen - https://phabricator.wikimedia.org/T231111 (10fdans) p:05Triage→03High [15:31:46] (03PS1) 10Jforrester: Load the Translate extension via static extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532392 (https://phabricator.wikimedia.org/T228051) [15:32:17] (03PS2) 10Jforrester: Load the Translate extension via static extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532392 (https://phabricator.wikimedia.org/T228051) [15:33:50] jouncebot: next [15:33:50] In 1 hour(s) and 26 minute(s): Wikidata Query Service weekly deploy (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190826T1700) [15:34:14] (03CR) 10Jforrester: [C: 03+2] Load the Translate extension via static extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532392 (https://phabricator.wikimedia.org/T228051) (owner: 10Jforrester) [15:35:51] (03Merged) 10jenkins-bot: Load the Translate extension via static extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532392 (https://phabricator.wikimedia.org/T228051) (owner: 10Jforrester) [15:36:42] (03CR) 10jenkins-bot: Load the Translate extension via static extension registration [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532392 (https://phabricator.wikimedia.org/T228051) (owner: 10Jforrester) [15:38:19] !log jforrester@deploy1001 Synchronized wmf-config/CommonSettings.php: T228051 Load the Translate extension via static extension registration (duration: 00m 46s) [15:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:38:27] T228051: Update Wikimedia production config to use extension registration for Translate - https://phabricator.wikimedia.org/T228051 [16:02:13] (03CR) 10Lucas Werkmeister (WMDE): "there’s some whitespace changes in IS-labs.php that I assume are not intended" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531317 (https://phabricator.wikimedia.org/T230840) (owner: 10Smalyshev) [16:04:17] (03PS3) 10Lucas Werkmeister (WMDE): Setup RDF configuration for Commons Beta with correct prefixes [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531317 (https://phabricator.wikimedia.org/T230840) (owner: 10Smalyshev) [16:04:47] (03CR) 10Lucas Werkmeister (WMDE): "I rebased the change and reverted the whitespace changes. I’m not familiar enough with the relevant settings to review the actual changes " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531317 (https://phabricator.wikimedia.org/T230840) (owner: 10Smalyshev) [16:07:24] (03CR) 10Mathew.onipe: "rsyslog server on elastic nodes must be up before this logging can take effect." [puppet] - 10https://gerrit.wikimedia.org/r/531922 (https://phabricator.wikimedia.org/T225125) (owner: 10Mathew.onipe) [16:15:43] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler={proxy:fcgi://127.0.0.1:9000,proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluste [16:15:43] ethod=GET [16:17:29] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [1000.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [16:21:55] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:22:42] elastic1027 is struggling, looking [16:25:36] 10Operations, 10ops-codfw, 10DC-Ops: mw2231 is down and unable to reboot - https://phabricator.wikimedia.org/T231192 (10Papaul) The server wouldn't boot, it goes through the DELL logo screen. Then we get the message "stuck on initializing intel quickpath interconnect" after a couple of minutes it reboots a... [16:25:49] 10Operations, 10ops-codfw, 10DC-Ops: mw2231 is down and unable to reboot - https://phabricator.wikimedia.org/T231192 (10Papaul) a:03Papaul [16:30:21] 10Operations, 10Release-Engineering-Team: Requesting access to Puppet for Viztor[S] - https://phabricator.wikimedia.org/T229894 (10greg) >>! In T229894#5398333, @Dzahn wrote: > We talked on IRC about this and agreed this ticket should be re-purposed away from "production access to puppetmaster" and to "add to... [16:30:45] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 46.67% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:33:03] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1004 is OK: OK: Less than 20.00% above the threshold [500.0] https://wikitech.wikimedia.org/wiki/Search%23Health/Activity_Monitoring https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [16:35:53] (03PS1) 10BBlack: Add TXT verify for brave.com for wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/532400 [16:37:36] is jerkins ok? [16:38:05] 10Operations, 10ops-eqiad, 10DC-Ops: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC) - https://phabricator.wikimedia.org/T227541 (10colewhite) [16:38:44] 10Operations, 10ORES, 10serviceops, 10Scoring-platform-team (Current): celery-ores-worker service failed on ores100[2,4,5] without any apparent reason or significant log - https://phabricator.wikimedia.org/T230917 (10Halfak) [16:38:49] 10Operations, 10ops-eqiad, 10DC-Ops: b6-eqiad pdu refresh (Tuesday 9/10 @11am UTC) - https://phabricator.wikimedia.org/T227541 (10colewhite) [16:39:26] (03CR) 10jerkins-bot: [V: 04-1] Add TXT verify for brave.com for wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/532400 (owner: 10BBlack) [16:41:01] bblack: it has been extremely slow today: https://phabricator.wikimedia.org/T231200 [16:41:45] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:42:37] (03PS2) 10BBlack: Add TXT verify for brave.com for wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/532400 [16:43:23] <_joe_> onimisionipe: I shall pass you the clinic duties now [16:43:28] we can call it a developer quality improvement program. the longer CI takes to respond to each PS, the more incentive we have not to make mistakes in the first place and cycle through another -1 :) [16:44:07] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [16:46:58] _joe_: Thanks! [16:47:00] looks like a eqiad/esams flap btw [16:47:03] the 5xx [16:47:13] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 https://logstash.wikimedia.org/goto/60aa05b6e1129b475fbf4e7be868c67d [16:47:55] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Cache_TLS_termination https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [16:49:41] godog: yeah, does look like another link flap :/ [16:50:34] (03CR) 10BBlack: [C: 03+2] Add TXT verify for brave.com for wikipedia.org [dns] - 10https://gerrit.wikimedia.org/r/532400 (owner: 10BBlack) [16:51:46] (03PS1) 10Ottomata: Use oozie spark sharelib instead of one from spark 1 package [puppet] - 10https://gerrit.wikimedia.org/r/532403 (https://phabricator.wikimedia.org/T229347) [16:53:08] (03CR) 10Smalyshev: "> Patch Set 2:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/531317 (https://phabricator.wikimedia.org/T230840) (owner: 10Smalyshev) [16:55:08] bblack: indeed, although from librenms event log I can only find "bgp session flap" [16:55:39] XioNoX: FYI ^ [16:56:01] looking [16:56:03] PROBLEM - Host mw2231.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [16:59:02] Last flapped : 2019-08-26 16:39:57 UTC (00:18:35 ago) [16:59:31] seems like it's still https://phabricator.wikimedia.org/T228827 [16:59:45] RECOVERY - Host mw2231.mgmt is UP: PING OK - Packet loss = 0%, RTA = 30.88 ms [17:00:04] gehel and onimisionipe: #bothumor My software never has bugs. It just develops random features. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190826T1700). [17:00:09] our account rep is not cooperating much, but we have a thread with them [17:00:20] no deployment today ^ [17:01:57] XioNoX: uughh, thanks for checking! I couldn't find anything re: the link flapping itself on librenms on e.g. cr2-esams syslog or eventlog, do you see the same ? [17:02:10] only the bgp flapping messages that is [17:02:42] godog: I assume the brief burst of HTTP 503s for MW urls was due to the esams routing issue? [17:02:45] https://grafana.wikimedia.org/d/000000066/resourceloader?refresh=5m&panelId=56&fullscreen&orgId=1&from=1566837312159&to=1566838218225 [17:03:00] godog: it's because librenms only pools devices every 5 minutes, so if it's in the same state at each pool, whatever happen in the middle it's not going to be noticed [17:03:09] Krinkle: that's correct yeah [17:03:11] you can check the logs though, or the interface itself [17:03:26] 10Operations, 10ops-codfw, 10DC-Ops: mw2231 is down and unable to reboot - https://phabricator.wikimedia.org/T231192 (10Papaul) Get this after swapping CPU 1 with CPU 2 Clear Log Save As Mon Aug 26 2019 16:58:15 CPU 2 has an internal error (IERR). Mon Aug 26 2019 16:57:52 CPU 1 has an internal erro... [17:04:10] XioNoX: I see, I'll poke kibana [17:05:41] godog: if you know what to look at (my main guess was the level3 link) a `show interfaces ` also says mention the last flap [17:05:46] godog: OK. I've documented it at https://grafana.wikimedia.org/d/000000402/resourceloader-alerts?orgId=1&from=1566835572054&to=1566839122739 with an annotation for {operations} and {performance} [17:05:59] will show up in any dashes that show annotations for either of those tags [17:06:14] it's also possible that hte link goes down only on one side, etc... [17:07:33] Krinkle: nice, relevant task would be T228827 fwiw [17:07:34] T228827: Instability of the Level3 link between cr2-eqiad and cr2-esams - https://phabricator.wikimedia.org/T228827 [17:08:21] XioNoX: yeah I was expecting at least an entry about the interface in eventlog, indeed kibana does have what I was looking for [17:10:03] godog: k, link added [17:17:05] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 49.03 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:21:20] (03PS1) 10Ottomata: Keep daily backups of analytics-meta MySQL instance in HDFS [puppet] - 10https://gerrit.wikimedia.org/r/532406 (https://phabricator.wikimedia.org/T231208) [17:24:55] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 59.76 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:24:59] (03PS2) 10Ottomata: Keep daily backups of analytics-meta MySQL instance in HDFS [puppet] - 10https://gerrit.wikimedia.org/r/532406 (https://phabricator.wikimedia.org/T231208) [17:27:33] (03PS3) 10Ottomata: Keep daily backups of analytics-meta MySQL instance in HDFS [puppet] - 10https://gerrit.wikimedia.org/r/532406 (https://phabricator.wikimedia.org/T231208) [17:29:35] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 81.46 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [17:29:46] (03CR) 10Ottomata: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1001/18033/" [puppet] - 10https://gerrit.wikimedia.org/r/532406 (https://phabricator.wikimedia.org/T231208) (owner: 10Ottomata) [17:29:53] (03PS8) 10Herron: prometheus: add prometheus ipsec exporter service & config in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/530616 (https://phabricator.wikimedia.org/T230236) [17:31:29] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [17:34:48] !log beginning roll out of prometheus-ipsec-exporter in ulsfo T230236 [17:34:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:56] T230236: De-noise ipsec alerts (Reduce Icinga alert noise goal) - https://phabricator.wikimedia.org/T230236 [17:35:39] (03CR) 10Herron: [C: 03+2] prometheus: add prometheus ipsec exporter service & config in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/530616 (https://phabricator.wikimedia.org/T230236) (owner: 10Herron) [17:35:46] (03PS9) 10Herron: prometheus: add prometheus ipsec exporter service & config in ulsfo [puppet] - 10https://gerrit.wikimedia.org/r/530616 (https://phabricator.wikimedia.org/T230236) [17:36:13] (03PS1) 10Ottomata: Make analytics-meta hdfs backup executable by analytics user [puppet] - 10https://gerrit.wikimedia.org/r/532411 (https://phabricator.wikimedia.org/T231208) [17:37:11] (03CR) 10jerkins-bot: [V: 04-1] Make analytics-meta hdfs backup executable by analytics user [puppet] - 10https://gerrit.wikimedia.org/r/532411 (https://phabricator.wikimedia.org/T231208) (owner: 10Ottomata) [17:40:54] (03PS2) 10Ottomata: Make analytics-meta hdfs backup executable by analytics user [puppet] - 10https://gerrit.wikimedia.org/r/532411 (https://phabricator.wikimedia.org/T231208) [17:42:49] (03CR) 10Ottomata: [C: 03+2] Make analytics-meta hdfs backup executable by analytics user [puppet] - 10https://gerrit.wikimedia.org/r/532411 (https://phabricator.wikimedia.org/T231208) (owner: 10Ottomata) [17:42:57] (03PS3) 10Ottomata: Make analytics-meta hdfs backup executable by analytics user [puppet] - 10https://gerrit.wikimedia.org/r/532411 (https://phabricator.wikimedia.org/T231208) [17:43:00] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Make analytics-meta hdfs backup executable by analytics user [puppet] - 10https://gerrit.wikimedia.org/r/532411 (https://phabricator.wikimedia.org/T231208) (owner: 10Ottomata) [17:48:19] (03PS1) 10Ottomata: Run analytics-meta-backup-to-hdfs as root to be able to read source files [puppet] - 10https://gerrit.wikimedia.org/r/532413 (https://phabricator.wikimedia.org/T231208) [17:50:18] (03CR) 10Ottomata: [C: 03+2] Run analytics-meta-backup-to-hdfs as root to be able to read source files [puppet] - 10https://gerrit.wikimedia.org/r/532413 (https://phabricator.wikimedia.org/T231208) (owner: 10Ottomata) [17:53:42] !log add new IP to labsdb-tcp4 on cr1/2-eqiad - T230980 [17:54:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:54:03] T230980: Review switches ACL to connect from tools-bastion to dbproxy1019 - https://phabricator.wikimedia.org/T230980 [17:54:31] (03PS1) 10Ottomata: Own analytics_meta_hdfs_backup_dir as root [puppet] - 10https://gerrit.wikimedia.org/r/532415 (https://phabricator.wikimedia.org/T231208) [17:55:17] 10Operations, 10cloud-services-team, 10netops: Review switches ACL to connect from tools-bastion to dbproxy1019 - https://phabricator.wikimedia.org/T230980 (10ayounsi) 05Open→03Resolved Pushed as it's a very low risk change. Please reopen if it doesn't work. [17:59:09] (03CR) 10Ayounsi: [C: 03+1] setup.py: add missing PyYAML dependency [software/homer] - 10https://gerrit.wikimedia.org/r/532223 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [18:00:04] MaxSem, RoanKattouw, Niharika, and Urbanecm: Time to snap out of that daydream and deploy Morning SWAT (Max 6 patches). Get on with it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190826T1800). [18:00:05] Isarra and stephanebisson: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:14] Hello [18:00:51] A sticker?! [18:02:35] (03CR) 10Ayounsi: [C: 03+1] doc: add configuration example in documentation [software/homer] - 10https://gerrit.wikimedia.org/r/532224 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [18:03:48] I can SWAT (when i finish my meeting in a few meetings) [18:05:12] stephanebisson: Thanks! Let me know what you need from me. [18:05:53] (03PS3) 10Sbisson: Enable Related Article cards in Timeless across all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528506 (https://phabricator.wikimedia.org/T181242) (owner: 10Isarra) [18:06:11] Oh, wait, you just have another? [18:06:17] Sorry, I've never... done this before. [18:06:57] Isarra: I'll start with your patch [18:07:10] Thanks. >.< [18:07:33] Isarra: Are you familiar with the browser extension to test your change on a test server before is goes live? [18:07:46] (03CR) 10Sbisson: [C: 03+2] Enable Related Article cards in Timeless across all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528506 (https://phabricator.wikimedia.org/T181242) (owner: 10Isarra) [18:07:48] Yeah, I do have that somewhere. [18:08:48] (03Merged) 10jenkins-bot: Enable Related Article cards in Timeless across all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528506 (https://phabricator.wikimedia.org/T181242) (owner: 10Isarra) [18:08:53] (03PS1) 10Pmiazga: Drop MobileWebUIActionsTracking sampling rate to 0.01% [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532422 (https://phabricator.wikimedia.org/T220016) [18:09:05] (03CR) 10jenkins-bot: Enable Related Article cards in Timeless across all projects [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528506 (https://phabricator.wikimedia.org/T181242) (owner: 10Isarra) [18:09:18] stephanebisson: When and where do I test it? [18:10:01] Isarra: Your change is now live on mwdebug1002.eqiad.wmnet [18:10:11] You can test now and let me know how it goes [18:10:37] (03PS2) 10Sbisson: lvwiki damaging model adjustment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528546 (https://phabricator.wikimedia.org/T221871) [18:13:34] stephanebisson: Yeah, seems to work... [18:14:09] Isarra: ok, going live... [18:14:33] Yeah, showing up in timeless and only timeless. I think we're good. [18:14:49] (03CR) 10Sbisson: [C: 03+2] lvwiki damaging model adjustment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528546 (https://phabricator.wikimedia.org/T221871) (owner: 10Sbisson) [18:15:00] stephanebisson: Thank you so much! [18:15:15] !log sbisson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:528506]] Enable Related Article cards in Timeless across all projects (duration: 00m 46s) [18:15:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:28] Now I just need to arm myself for the inevitable angry mobs. [18:16:03] Good luck! [18:16:19] (03Merged) 10jenkins-bot: lvwiki damaging model adjustment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528546 (https://phabricator.wikimedia.org/T221871) (owner: 10Sbisson) [18:16:35] (03CR) 10jenkins-bot: lvwiki damaging model adjustment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/528546 (https://phabricator.wikimedia.org/T221871) (owner: 10Sbisson) [18:16:58] Thanks! Probably won't actually need it. [18:17:14] But you never know... [18:18:42] 10Operations, 10Analytics, 10SRE-Access-Requests: Access to HUE for cchen - https://phabricator.wikimedia.org/T231111 (10mforns) @cchen You should be able to access Hue now. Please, reach out if you have any problems. Cheers! [18:19:07] 10Operations, 10Analytics, 10Analytics-Kanban, 10SRE-Access-Requests: Access to HUE for cchen - https://phabricator.wikimedia.org/T231111 (10mforns) [18:24:29] PROBLEM - Check correctness of the icinga configuration on icinga1001 is CRITICAL: Icinga configuration contains errors https://wikitech.wikimedia.org/wiki/Icinga [18:27:41] hey ottomata icinga is complaining about: [18:27:47] https://www.irccloud.com/pastebin/KZckTti6/ [18:28:52] 10Operations, 10Analytics, 10Analytics-Kanban, 10SRE-Access-Requests: Access to HUE for cchen - https://phabricator.wikimedia.org/T231111 (10cchen) Thank you @mforns! I am able to access Hue now! [18:30:31] !log sbisson@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:528546]] lvwiki damaging model adjustment (duration: 00m 46s) [18:30:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:31:58] 10Operations, 10serviceops, 10Core Platform Team (Needs Cleaning - Services Operations): Migrate node-based services in production to node10 - https://phabricator.wikimedia.org/T210704 (10Jdforrester-WMF) [18:37:22] (03PS1) 10Herron: change user to root [debs/prometheus-ipsec-exporter] - 10https://gerrit.wikimedia.org/r/532426 (https://phabricator.wikimedia.org/T230236) [18:48:55] 10Operations, 10Analytics, 10Analytics-Kanban, 10SRE-Access-Requests: Access to HUE for cchen - https://phabricator.wikimedia.org/T231111 (10mforns) Cool! [18:50:08] (03CR) 10Ottomata: [C: 03+2] Own analytics_meta_hdfs_backup_dir as root [puppet] - 10https://gerrit.wikimedia.org/r/532415 (https://phabricator.wikimedia.org/T231208) (owner: 10Ottomata) [18:51:34] (03PS2) 10Ottomata: Check that oozie is installed (not spark 1) for installing sharelib [puppet] - 10https://gerrit.wikimedia.org/r/532403 (https://phabricator.wikimedia.org/T229347) [18:51:42] (03PS3) 10Ottomata: Check that oozie is installed (not spark 1) for installing sharelib [puppet] - 10https://gerrit.wikimedia.org/r/532403 (https://phabricator.wikimedia.org/T229347) [18:59:24] (03PS1) 10Herron: set analytics-database-meta-snapshot-copy-to-hdfs to analytics [puppet] - 10https://gerrit.wikimedia.org/r/532430 [19:01:07] (03PS2) 10Herron: set analytics-database-meta-snapshot-copy-to-hdfs contact to analytics [puppet] - 10https://gerrit.wikimedia.org/r/532430 [19:02:32] (03CR) 10Herron: "this should address the icinga config correctness warning that fired via IRC" [puppet] - 10https://gerrit.wikimedia.org/r/532430 (owner: 10Herron) [19:05:09] (03CR) 10Ayounsi: [C: 03+1] "Only one comment, more like a niptic :)" (033 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/532225 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [19:08:53] (03CR) 10Ottomata: "Oops, thank you!" [puppet] - 10https://gerrit.wikimedia.org/r/532430 (owner: 10Herron) [19:13:53] PROBLEM - DPKG on an-tool1006 is CRITICAL: DPKG CRITICAL dpkg reports broken packages https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [19:14:24] (03CR) 10Herron: [C: 03+2] "np!" [puppet] - 10https://gerrit.wikimedia.org/r/532430 (owner: 10Herron) [19:14:30] ^^ might be me, looking. [19:15:25] RECOVERY - DPKG on an-tool1006 is OK: All packages OK https://wikitech.wikimedia.org/wiki/Monitoring/dpkg [19:17:12] (03CR) 10Ayounsi: [C: 03+1] "Couldn't test it using the cli, but the unit tests look good and pass." [software/homer] - 10https://gerrit.wikimedia.org/r/532226 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [19:18:43] herron: thankss, sorry about that! copy/paste error i thikn! [19:18:45] thank you [19:19:15] hehe sure np! [19:19:38] welcome back btw [19:33:27] (03PS1) 10Urbanecm: Whitelist *.wikimedia.cz in wgCopyUploadsDomains for commonswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532438 (https://phabricator.wikimedia.org/T231247) [19:36:29] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 64.29% of data above the critical threshold [140.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [19:43:01] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/references/{title} (Get references of a test page) timed out before a response was received: /{domain}/v1/page/mobile-sections/{title} (retrieve test page via mobile-sections) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [19:44:38] (03PS1) 10Ottomata: Comment fixes [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/532439 [19:46:09] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [19:48:22] RECOVERY - Check correctness of the icinga configuration on icinga1001 is OK: Icinga configuration is correct https://wikitech.wikimedia.org/wiki/Icinga [19:54:58] PROBLEM - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is CRITICAL: /{domain}/v1/page/news (get In the News content) timed out before a response was received: /{domain}/v1/data/css/mobile/site (Get site-specific CSS) timed out before a response was received https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [19:56:02] RECOVERY - Mobileapps LVS codfw on mobileapps.svc.codfw.wmnet is OK: All endpoints are healthy https://wikitech.wikimedia.org/wiki/Mobileapps_%28service%29 [20:00:05] cscott, arlolra, subbu, bearND, halfak, and accraze: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Services – Parsoid / Citoid / Mobileapps / ORES / … . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190826T2000). [20:02:36] 10Operations, 10Cassandra, 10Core Platform Team Workboards (Clinic Duty Team), 10User-Eevans: Revisit default settings for c-foreach-restart - https://phabricator.wikimedia.org/T198787 (10Eevans) Let's just do this already. [20:07:34] (03Abandoned) 10Ottomata: Comment fixes [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/532439 (owner: 10Ottomata) [20:09:23] (03CR) 10Volans: [C: 03+2] setup.py: add missing PyYAML dependency [software/homer] - 10https://gerrit.wikimedia.org/r/532223 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [20:09:29] (03CR) 10Volans: [C: 03+2] doc: add configuration example in documentation [software/homer] - 10https://gerrit.wikimedia.org/r/532224 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [20:11:53] (03Merged) 10jenkins-bot: setup.py: add missing PyYAML dependency [software/homer] - 10https://gerrit.wikimedia.org/r/532223 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [20:12:00] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@d9042a1]: Update mobileapps to fbe3cc6 [20:12:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:06] (03Merged) 10jenkins-bot: doc: add configuration example in documentation [software/homer] - 10https://gerrit.wikimedia.org/r/532224 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [20:21:52] PROBLEM - Logstash rate of ingestion percent change compared to yesterday on icinga1001 is CRITICAL: 138.8 ge 130 https://phabricator.wikimedia.org/T202307 https://grafana.wikimedia.org/dashboard/db/logstash?orgId=1&panelId=2&fullscreen [20:23:23] 10Operations, 10Analytics, 10Analytics-Wikistats, 10Traffic, 10Performance-Team (Radar): Piwik JS isn't cached - https://phabricator.wikimedia.org/T230772 (10kchapman) [20:24:12] 10Operations, 10Cassandra, 10Core Platform Team Workboards (Clinic Duty Team), 10User-Eevans: Revisit default settings for c-foreach-restart - https://phabricator.wikimedia.org/T198787 (10WDoranWMF) @Eevans Who could look at this, would this be a good task for @Clarakosi? Not sure if you've done deb packag... [20:25:08] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@d9042a1]: Update mobileapps to fbe3cc6 (duration: 13m 08s) [20:25:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:34:10] (03PS2) 10Volans: Configuration: load and merge private config [software/homer] - 10https://gerrit.wikimedia.org/r/532225 (https://phabricator.wikimedia.org/T228388) [20:34:12] (03PS2) 10Volans: devices: add query capability [software/homer] - 10https://gerrit.wikimedia.org/r/532226 (https://phabricator.wikimedia.org/T228388) [20:34:14] (03PS2) 10Volans: cli: rename action compile to generate [software/homer] - 10https://gerrit.wikimedia.org/r/532227 (https://phabricator.wikimedia.org/T228388) [20:34:16] (03PS1) 10Volans: devices: add logging [software/homer] - 10https://gerrit.wikimedia.org/r/532452 (https://phabricator.wikimedia.org/T228388) [20:34:18] (03PS1) 10Volans: templates: add rendering of templates [software/homer] - 10https://gerrit.wikimedia.org/r/532453 (https://phabricator.wikimedia.org/T228388) [20:34:20] (03PS1) 10Volans: actions: add generate action [software/homer] - 10https://gerrit.wikimedia.org/r/532454 (https://phabricator.wikimedia.org/T228388) [20:34:31] (03CR) 10Volans: "replies inline" (033 comments) [software/homer] - 10https://gerrit.wikimedia.org/r/532225 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [20:36:32] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://www.mediawiki.org/wiki/Continuous_integration/Zuul https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [20:37:32] (03CR) 10Ottomata: [C: 03+2] Check that oozie is installed (not spark 1) for installing sharelib [puppet] - 10https://gerrit.wikimedia.org/r/532403 (https://phabricator.wikimedia.org/T229347) (owner: 10Ottomata) [20:37:39] (03CR) 10jerkins-bot: [V: 04-1] templates: add rendering of templates [software/homer] - 10https://gerrit.wikimedia.org/r/532453 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [20:37:44] (03PS4) 10Ottomata: Check that oozie is installed (not spark 1) for installing sharelib [puppet] - 10https://gerrit.wikimedia.org/r/532403 (https://phabricator.wikimedia.org/T229347) [20:37:46] (03PS1) 10Ottomata: Release Spark 2.4.3 [debs/spark2] (debian) - 10https://gerrit.wikimedia.org/r/532455 (https://phabricator.wikimedia.org/T222253) [20:37:48] (03CR) 10jerkins-bot: [V: 04-1] actions: add generate action [software/homer] - 10https://gerrit.wikimedia.org/r/532454 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [20:39:08] (03PS1) 10CRusnov: Separate mgmt interface addresses into appropriately included files [dns] - 10https://gerrit.wikimedia.org/r/532456 (https://phabricator.wikimedia.org/T228387) [20:39:45] (03CR) 10jerkins-bot: [V: 04-1] Separate mgmt interface addresses into appropriately included files [dns] - 10https://gerrit.wikimedia.org/r/532456 (https://phabricator.wikimedia.org/T228387) (owner: 10CRusnov) [20:44:36] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@0463394]: Update mobileapps to 6bdc333 [20:44:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:45:54] (03PS2) 10Volans: templates: add rendering of templates [software/homer] - 10https://gerrit.wikimedia.org/r/532453 (https://phabricator.wikimedia.org/T228388) [20:45:56] (03PS2) 10Volans: actions: add generate action [software/homer] - 10https://gerrit.wikimedia.org/r/532454 (https://phabricator.wikimedia.org/T228388) [20:50:54] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@0463394]: Update mobileapps to 6bdc333 (duration: 06m 18s) [20:50:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:16] (03CR) 10Ayounsi: Configuration: load and merge private config (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/532225 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [20:55:49] (03CR) 10Volans: [C: 03+2] devices: add query capability [software/homer] - 10https://gerrit.wikimedia.org/r/532226 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [20:55:59] (03CR) 10Volans: [C: 03+2] cli: rename action compile to generate [software/homer] - 10https://gerrit.wikimedia.org/r/532227 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [20:56:44] (03CR) 10jenkins-bot: setup.py: add missing PyYAML dependency [software/homer] - 10https://gerrit.wikimedia.org/r/532223 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [20:57:05] (03CR) 10Volans: [C: 03+2] Configuration: load and merge private config [software/homer] - 10https://gerrit.wikimedia.org/r/532225 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [20:57:29] (03CR) 10jenkins-bot: doc: add configuration example in documentation [software/homer] - 10https://gerrit.wikimedia.org/r/532224 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [20:59:19] (03Merged) 10jenkins-bot: Configuration: load and merge private config [software/homer] - 10https://gerrit.wikimedia.org/r/532225 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [20:59:26] (03Merged) 10jenkins-bot: devices: add query capability [software/homer] - 10https://gerrit.wikimedia.org/r/532226 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [20:59:28] (03Merged) 10jenkins-bot: cli: rename action compile to generate [software/homer] - 10https://gerrit.wikimedia.org/r/532227 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [21:00:04] Reedy and sbassett: I, the Bot under the Fountain, allow thee, The Deployer, to do Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190826T2100). [21:03:02] (03CR) 10jenkins-bot: Configuration: load and merge private config [software/homer] - 10https://gerrit.wikimedia.org/r/532225 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [21:03:43] (03CR) 10jenkins-bot: devices: add query capability [software/homer] - 10https://gerrit.wikimedia.org/r/532226 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [21:04:25] (03CR) 10jenkins-bot: cli: rename action compile to generate [software/homer] - 10https://gerrit.wikimedia.org/r/532227 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [21:05:14] (03CR) 10DannyS712: [C: 03+1] "Looks good to me" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/532438 (https://phabricator.wikimedia.org/T231247) (owner: 10Urbanecm) [21:20:48] 10Operations, 10ops-eqiad, 10DC-Ops: a1-eqiad pdu refresh (Thursday 9/12 @11am UTC) - https://phabricator.wikimedia.org/T226782 (10ayounsi) As this rack has one of our 2 most important routers I'd like to be around for the maintenance. 11am UTC is 4am pacific. It would be ideal if it could be pushed at least... [21:21:26] 10Operations, 10ops-eqiad, 10DC-Ops: a8-eqiad pdu refresh (Thursday 9/19 @11am UTC) - https://phabricator.wikimedia.org/T227133 (10ayounsi) As this rack has one of our 2 most important routers I'd like to be around for the maintenance. 11am UTC is 4am pacific. It would be ideal if it could be pushed at least... [21:26:40] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Needs Cleaning - Services Operations): Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 (10Eevans) [21:26:43] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, 10Core Platform Team Workboards (Clinic Duty Team): Migrate remaining Restbase servers to Stretch - https://phabricator.wikimedia.org/T224553 (10Eevans) [21:27:44] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team (Needs Cleaning - Services Operations): Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 (10Eevans) [21:28:20] 10Operations, 10Cassandra, 10RESTBase, 10Core Platform Team Workboards (Clinic Duty Team): Migrate Restbase-dev cluster to Stretch - https://phabricator.wikimedia.org/T224554 (10Eevans) [21:31:42] (03PS1) 10Mforns: modules::turnilo::templates::config.yaml.erb add edit_hourly [puppet] - 10https://gerrit.wikimedia.org/r/532467 (https://phabricator.wikimedia.org/T230963) [21:32:04] (03CR) 10Ayounsi: [C: 03+1] devices: add logging (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/532452 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [21:32:41] (03PS2) 10Mforns: modules::turnilo::templates::config.yaml.erb add edit_hourly [puppet] - 10https://gerrit.wikimedia.org/r/532467 (https://phabricator.wikimedia.org/T230963) [21:40:53] (03PS2) 10Volans: devices: add logging [software/homer] - 10https://gerrit.wikimedia.org/r/532452 (https://phabricator.wikimedia.org/T228388) [21:40:55] (03PS3) 10Volans: templates: add rendering of templates [software/homer] - 10https://gerrit.wikimedia.org/r/532453 (https://phabricator.wikimedia.org/T228388) [21:40:57] (03PS3) 10Volans: actions: add generate action [software/homer] - 10https://gerrit.wikimedia.org/r/532454 (https://phabricator.wikimedia.org/T228388) [21:41:06] (03CR) 10Volans: "addressed comment" (031 comment) [software/homer] - 10https://gerrit.wikimedia.org/r/532452 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [21:43:30] (03PS14) 10Holger Knust: table-properties: Initial commit [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) [21:49:01] 10Operations, 10Cassandra, 10Core Platform Team (Needs Cleaning - Cassandra Operational), 10Patch-For-Review, 10User-Eevans: enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471 (10Eevans) OK, time for our yearly update of this ticket! >>! In T92471#4018640, @Eevans wr... [21:58:45] 10Operations, 10Cassandra, 10Core Platform Team (Needs Cleaning - Cassandra Operational), 10Patch-For-Review, 10User-Eevans: enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471 (10Eevans) >>! In T92471#5439574, @Eevans wrote: > > [ ... ] > > This is no longer the cas... [21:59:57] (03CR) 10jerkins-bot: [V: 04-1] table-properties: Initial commit [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) (owner: 10Holger Knust) [22:00:01] 10Operations, 10Cassandra, 10Core Platform Team Workboards (Clinic Duty Team), 10Patch-For-Review, 10User-Eevans: enable authenticated access to Cassandra JMX - https://phabricator.wikimedia.org/T92471 (10Eevans) p:05Normal→03Triage [22:01:15] 10Operations, 10Cassandra, 10RESTBase, 10RESTBase-Cassandra, 10Core Platform Team (Needs Cleaning - Cassandra Operational): secure Cassandra/RESTBase cluster - https://phabricator.wikimedia.org/T94329 (10Eevans) 05Open→03Resolved a:03Eevans [22:06:51] (03PS15) 10Holger Knust: table-properties: Initial commit [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) [22:17:24] PROBLEM - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is CRITICAL: 35.61 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:19:32] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 59.72 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:22:36] 10Operations, 10ops-eqiad, 10vm-requests: rack/setup/install ganeti10([09]|1[0-8[).eqiad.wmnet - https://phabricator.wikimedia.org/T228924 (10Jclark-ctr) entered ip addresses in IDRAC and set password ganeti10([09]|1[0-8[) [22:25:46] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: (C)60 le (W)70 le 71.91 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:27:54] 10Operations, 10ops-eqiad: rack/setup/instal (4) CI ganeti nodes - https://phabricator.wikimedia.org/T228926 (10Jclark-ctr) entered ip addresses in IDRAC and set password ganeti10[19...22] [22:28:04] (03PS8) 10Jhedden: openstack: Add codfw1dev nova API and metadata to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/530580 (https://phabricator.wikimedia.org/T223907) [22:28:06] PROBLEM - Check the NTP synchronisation status of timesyncd on elastic1046 is CRITICAL: connect to address 10.64.16.70 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/NTP [22:28:35] (03CR) 10Volans: [C: 03+2] devices: add logging [software/homer] - 10https://gerrit.wikimedia.org/r/532452 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [22:30:20] PROBLEM - IPMI Sensor Status on elastic1046 is CRITICAL: connect to address 10.64.16.70 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Power_Supply_Failures [22:30:33] (03Merged) 10jenkins-bot: devices: add logging [software/homer] - 10https://gerrit.wikimedia.org/r/532452 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [22:31:22] (03CR) 10jenkins-bot: devices: add logging [software/homer] - 10https://gerrit.wikimedia.org/r/532452 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [22:33:00] RECOVERY - Varnish traffic drop between 30min ago and now at eqsin on icinga1001 is OK: (C)60 le (W)70 le 78.79 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [22:40:11] (03PS9) 10Jhedden: openstack: Add codfw1dev nova API and metadata to haproxy [puppet] - 10https://gerrit.wikimedia.org/r/530580 (https://phabricator.wikimedia.org/T223907) [22:42:16] (03PS16) 10Holger Knust: table-properties: Initial commit [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) [22:42:33] (03CR) 10Cwhite: [C: 03+1] monitoring: alert on availability over two minutes [puppet] - 10https://gerrit.wikimedia.org/r/532335 (https://phabricator.wikimedia.org/T228379) (owner: 10Filippo Giunchedi) [22:44:24] (03CR) 10Holger Knust: "I addressed all open issues and refactored the code a bit to improve readability." (0310 comments) [software/cassandra-table-properties] - 10https://gerrit.wikimedia.org/r/524921 (https://phabricator.wikimedia.org/T220246) (owner: 10Holger Knust) [23:00:05] MaxSem, RoanKattouw, Niharika, and Urbanecm: #bothumor My software never has bugs. It just develops random features. Rise for Evening SWAT (Max 6 patches). (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190826T2300). [23:00:05] No GERRIT patches in the queue for this window AFAICS. [23:38:40] (03CR) 10Ayounsi: [C: 03+1] templates: add rendering of templates [software/homer] - 10https://gerrit.wikimedia.org/r/532453 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans) [23:41:38] (03PS1) 10Bstorm: tools-prometheus: add an allowance for ssh monitoring [puppet] - 10https://gerrit.wikimedia.org/r/532487 [23:42:48] PROBLEM - Long running screen/tmux on elastic1046 is CRITICAL: connect to address 10.64.16.70 port 5666: Connection refused https://wikitech.wikimedia.org/wiki/Monitoring/Long_running_screens [23:42:52] (03CR) 10jerkins-bot: [V: 04-1] tools-prometheus: add an allowance for ssh monitoring [puppet] - 10https://gerrit.wikimedia.org/r/532487 (owner: 10Bstorm) [23:57:37] (03PS2) 10Bstorm: tools-prometheus: add an allowance for ssh monitoring [puppet] - 10https://gerrit.wikimedia.org/r/532487 [23:59:16] (03CR) 10Ayounsi: [C: 03+1] actions: add generate action [software/homer] - 10https://gerrit.wikimedia.org/r/532454 (https://phabricator.wikimedia.org/T228388) (owner: 10Volans)