[00:02:25] (03PS1) 10CRusnov: netbox reports alerts: fix notes_url from variable rename error [puppet] - 10https://gerrit.wikimedia.org/r/552365 [00:02:54] (03CR) 10CRusnov: [C: 03+2] "quick fix - no breakage possible" [puppet] - 10https://gerrit.wikimedia.org/r/552365 (owner: 10CRusnov) [00:03:02] (03PS1) 10Dzahn: xhgui: disable automatic rsync, keep it manual [puppet] - 10https://gerrit.wikimedia.org/r/552366 [00:03:59] (03CR) 10Dzahn: [C: 03+2] xhgui: disable automatic rsync, keep it manual [puppet] - 10https://gerrit.wikimedia.org/r/552366 (owner: 10Dzahn) [00:05:17] PROBLEM - Check systemd state on tungsten is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:08:50] ^ me, missing some IPv6 records.. fixing [00:09:44] ACKNOWLEDGEMENT - Check systemd state on tungsten is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. daniel_zahn xhgui1001 needs IPv6 records for ferm https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:18:56] !log catrope@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/GrowthExperiments/: Make non-remote titles work in RemotePageConfigurationLoader (T237301) (duration: 00m 54s) [00:19:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:19:01] T237301: Newcomer tasks: fix and migrate JSON config pages - https://phabricator.wikimedia.org/T237301 [00:20:15] (03PS1) 10Dzahn: add IPv6 records for xhgui1001/xhgui2001 [dns] - 10https://gerrit.wikimedia.org/r/552368 (https://phabricator.wikimedia.org/T238098) [00:20:39] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Move newcomer tasks JSON config from mw.org to local wikis (T237301) (duration: 00m 52s) [00:20:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:25:02] (03CR) 10Dzahn: [C: 03+2] add IPv6 records for xhgui1001/xhgui2001 [dns] - 10https://gerrit.wikimedia.org/r/552368 (https://phabricator.wikimedia.org/T238098) (owner: 10Dzahn) [00:25:06] (03PS2) 10Dzahn: add IPv6 records for xhgui1001/xhgui2001 [dns] - 10https://gerrit.wikimedia.org/r/552368 (https://phabricator.wikimedia.org/T238098) [00:37:23] !log tungsten - starting ferm service [00:37:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:37:35] RECOVERY - Check systemd state on tungsten is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [00:46:45] !log xhgui1001/xhgui2001 - rsyncing /srv/mongod from tungsten to /srv/tungsten/mongod/ on both new machines (T158837) [00:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:46:50] T158837: Consolidate performance website and related software - https://phabricator.wikimedia.org/T158837 [00:48:42] (03PS1) 10Dzahn: xhgui: also copy tungsten mongodb data to xhgui2001 [puppet] - 10https://gerrit.wikimedia.org/r/552369 [00:49:21] (03PS2) 10Dzahn: xhgui: also copy tungsten mongodb data to xhgui2001 [puppet] - 10https://gerrit.wikimedia.org/r/552369 [00:49:23] (03CR) 10Dzahn: [C: 03+2] xhgui: also copy tungsten mongodb data to xhgui2001 [puppet] - 10https://gerrit.wikimedia.org/r/552369 (owner: 10Dzahn) [00:57:31] (03PS2) 10Dzahn: xhgui: disable automatic rsync, keep it manual [puppet] - 10https://gerrit.wikimedia.org/r/552366 [01:38:01] Anyone from ops around? [01:38:07] or security or cloud services [01:39:26] * Platonides doesn't like what these questions seem to imply [02:25:31] (03CR) 10Eevans: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552257 (https://phabricator.wikimedia.org/T237143) (owner: 10Mobrovac) [02:43:07] (03PS1) 10DannyS712: Remove `move-rootuserpages` from user on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552374 (https://phabricator.wikimedia.org/T238842) [02:45:10] (03PS2) 10DannyS712: Remove `move-rootuserpages` from user on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552374 (https://phabricator.wikimedia.org/T238842) [02:45:48] (03CR) 10jerkins-bot: [V: 04-1] Remove `move-rootuserpages` from user on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552374 (https://phabricator.wikimedia.org/T238842) (owner: 10DannyS712) [03:03:16] (03PS3) 10DannyS712: Remove `move-rootuserpages` from user on svwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552374 (https://phabricator.wikimedia.org/T238842) [03:23:27] (03CR) 10Vgutierrez: "looking good :) see the inline comments" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552201 (https://phabricator.wikimedia.org/T233274) (owner: 10Ema) [03:49:27] !log restart prometheus@ops on prometheus1003 T238807 [03:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:49:34] T238807: Clean up ORES metrics - https://phabricator.wikimedia.org/T238807 [03:58:07] PROBLEM - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1:9900 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [04:21:59] RECOVERY - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [04:57:08] (03PS1) 10Vgutierrez: ATS: Enable log rotation via logrotate [puppet] - 10https://gerrit.wikimedia.org/r/552379 (https://phabricator.wikimedia.org/T238724) [05:00:55] (03CR) 10Vgutierrez: "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1003/19543/" [puppet] - 10https://gerrit.wikimedia.org/r/552379 (https://phabricator.wikimedia.org/T238724) (owner: 10Vgutierrez) [06:07:35] PROBLEM - MariaDB Slave Lag: s8 on db2083 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 86444.17 seconds https://wikitech.wikimedia.org/wiki/MariaDB/troubleshooting%23Depooling_a_slave [06:08:26] ^ downtime expired [06:15:53] (03PS1) 10Marostegui: install_server: Do not reimage db213[2-5] [puppet] - 10https://gerrit.wikimedia.org/r/552380 (https://phabricator.wikimedia.org/T238183) [06:18:25] (03CR) 10Marostegui: [C: 03+2] install_server: Do not reimage db213[2-5] [puppet] - 10https://gerrit.wikimedia.org/r/552380 (https://phabricator.wikimedia.org/T238183) (owner: 10Marostegui) [06:24:10] (03PS1) 10Marostegui: mariadb: Promote db1086 to s7 primary master [puppet] - 10https://gerrit.wikimedia.org/r/552381 (https://phabricator.wikimedia.org/T238044) [06:25:00] (03PS1) 10Marostegui: wmnet: Update s7-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/552382 (https://phabricator.wikimedia.org/T238044) [06:25:15] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [puppet] - 10https://gerrit.wikimedia.org/r/552381 (https://phabricator.wikimedia.org/T238044) (owner: 10Marostegui) [06:25:30] (03CR) 10Marostegui: [C: 04-2] "Wait for the failover day" [dns] - 10https://gerrit.wikimedia.org/r/552382 (https://phabricator.wikimedia.org/T238044) (owner: 10Marostegui) [06:31:46] !log marostegui@cumin1001 dbctl commit (dc=all): 'Rebalance weights on s7 in preparation for s7 failover on Tuesday T238044', diff saved to https://phabricator.wikimedia.org/P9722 and previous config saved to /var/cache/conftool/dbconfig/20191122-063145-marostegui.json [06:31:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:31:51] T238044: Switchover s7 primary database master db1062 -> db1086 - 26th Nov 06:00 - 06:30 UTC - https://phabricator.wikimedia.org/T238044 [06:50:37] PROBLEM - snapshot of s3 in codfw on db1115 is CRITICAL: snapshot for s3 at codfw taken more than 4 days ago: Most recent backup 2019-11-18 06:38:42 https://wikitech.wikimedia.org/wiki/MariaDB/Backups [07:16:57] (03PS1) 10MaxSem: admin: Remove myself [puppet] - 10https://gerrit.wikimedia.org/r/552389 [07:18:33] PROBLEM - Memory correctable errors -EDAC- on mw1239 is CRITICAL: 4.001 ge 4 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw1239&var-datasource=eqiad+prometheus/ops [07:23:23] <_joe_> MaxSem: :/ [07:23:48] * MaxSem hugs _joe_ [07:24:36] <_joe_> MaxSem: I'll think of you every time I need to spell kartotherian correctly :P [07:25:38] I blame Yuri, my original "kartotherion" was so much easier :P [07:25:52] <_joe_> ahahah [07:37:29] (03CR) 10Muehlenhoff: [C: 03+1] "Thanks :-) I'll merge this after your last work day" [puppet] - 10https://gerrit.wikimedia.org/r/552389 (owner: 10MaxSem) [07:40:28] (03CR) 10Muehlenhoff: Add image submission mode to debmonitor client (031 comment) [software/debmonitor] - 10https://gerrit.wikimedia.org/r/551220 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff) [07:43:39] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 54.11 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:47:03] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 71.03 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [07:51:35] (03PS1) 10Elukey: role::prometheus::analytics: remove old burrow configuration [puppet] - 10https://gerrit.wikimedia.org/r/552393 (https://phabricator.wikimedia.org/T238794) [08:11:17] (03PS1) 10Vgutierrez: acme_chief: Add smokeping certificate [puppet] - 10https://gerrit.wikimedia.org/r/552398 (https://phabricator.wikimedia.org/T238900) [08:14:32] (03CR) 10Ema: [C: 03+1] acme_chief: Add smokeping certificate [puppet] - 10https://gerrit.wikimedia.org/r/552398 (https://phabricator.wikimedia.org/T238900) (owner: 10Vgutierrez) [08:22:01] PROBLEM - Wikitech-static main page has content on labweb1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Wikitech-static [08:22:04] (03CR) 10Elukey: [C: 03+2] role::prometheus::analytics: remove old burrow configuration [puppet] - 10https://gerrit.wikimedia.org/r/552393 (https://phabricator.wikimedia.org/T238794) (owner: 10Elukey) [08:22:12] (03CR) 10Ema: [C: 04-1] "As discussed on irc with Valentin, there's a bit of confusion in hieradata/role/common/acme_chief.yaml when it comes to librenms, netbox, " [puppet] - 10https://gerrit.wikimedia.org/r/552398 (https://phabricator.wikimedia.org/T238900) (owner: 10Vgutierrez) [08:22:13] PROBLEM - Wikitech-static main page has content on labweb1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 500 Internal Server Error - 1866 bytes in 1.500 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [08:23:35] RECOVERY - Wikitech-static main page has content on labweb1001 is OK: HTTP OK: HTTP/1.1 200 OK - 28421 bytes in 0.179 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [08:23:57] RECOVERY - Wikitech-static main page has content on labweb1002 is OK: HTTP OK: HTTP/1.1 200 OK - 28534 bytes in 0.180 second response time https://wikitech.wikimedia.org/wiki/Wikitech-static [08:24:22] (03CR) 10Ema: [C: 03+1] ATS: Enable log rotation via logrotate [puppet] - 10https://gerrit.wikimedia.org/r/552379 (https://phabricator.wikimedia.org/T238724) (owner: 10Vgutierrez) [08:30:04] (03CR) 10Ema: ATS: enable reload for global Lua script (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552201 (https://phabricator.wikimedia.org/T233274) (owner: 10Ema) [08:32:07] (03PS2) 10ArielGlenn: redact possible password entries in dumps log exceptions emailer [puppet] - 10https://gerrit.wikimedia.org/r/552328 [08:33:04] (03CR) 10Vgutierrez: [C: 03+1] ATS: enable reload for global Lua script [puppet] - 10https://gerrit.wikimedia.org/r/552201 (https://phabricator.wikimedia.org/T233274) (owner: 10Ema) [08:33:36] (03PS1) 10Muehlenhoff: Add support for PDNS 4 [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/552467 (https://phabricator.wikimedia.org/T227411) [08:35:32] (03CR) 10ArielGlenn: [C: 03+2] redact possible password entries in dumps log exceptions emailer [puppet] - 10https://gerrit.wikimedia.org/r/552328 (owner: 10ArielGlenn) [08:38:05] (03PS2) 10Vgutierrez: ATS: Enable log rotation via logrotate [puppet] - 10https://gerrit.wikimedia.org/r/552379 (https://phabricator.wikimedia.org/T238724) [08:40:58] (03PS7) 10Muehlenhoff: Add image submission mode to debmonitor client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/551220 (https://phabricator.wikimedia.org/T237978) [08:41:12] (03CR) 10Vgutierrez: [C: 03+2] ATS: Enable log rotation via logrotate [puppet] - 10https://gerrit.wikimedia.org/r/552379 (https://phabricator.wikimedia.org/T238724) (owner: 10Vgutierrez) [08:46:24] (03PS1) 10Ema: cache: reimage cp1081 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/552468 (https://phabricator.wikimedia.org/T227432) [08:46:35] (03CR) 10Muehlenhoff: [C: 03+2] Add image submission mode to debmonitor client [software/debmonitor] - 10https://gerrit.wikimedia.org/r/551220 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff) [08:49:06] (03CR) 10Vgutierrez: [C: 03+1] cache: reimage cp1081 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/552468 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [08:49:52] !log depool cp1081 and reimage as text_ats T227432 [08:49:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:49:58] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [08:50:26] (03CR) 10Ema: [C: 03+2] cache: reimage cp1081 as text_ats [puppet] - 10https://gerrit.wikimedia.org/r/552468 (https://phabricator.wikimedia.org/T227432) (owner: 10Ema) [08:54:05] !log restarting blazegraph and updater on edqs1007 [08:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:54:09] !log restarting blazegraph and updater on wdqs1007 [08:54:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:04:58] !log remove blazegraph 2.1.5-wmf.11 from archiva, broken upload [09:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:05:03] (03CR) 10Elukey: "Just created the kerberos keytabs and uploaded them to the puppet private repo, the change can be merged anytime. Going to wait for an exp" [puppet] - 10https://gerrit.wikimedia.org/r/550466 (https://phabricator.wikimedia.org/T234229) (owner: 10Elukey) [09:05:19] !log ema@cumin1001 START - Cookbook sre.hosts.downtime [09:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:07:27] !log ema@cumin1001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [09:07:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:17:26] (03CR) 10Reedy: Add PoolCounter configuration for Special:Contributions (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552228 (https://phabricator.wikimedia.org/T234450) (owner: 10Reedy) [09:17:30] (03PS2) 10Reedy: Add PoolCounter configuration for Special:Contributions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552228 (https://phabricator.wikimedia.org/T234450) [09:17:34] (03CR) 10Reedy: [C: 03+2] Add PoolCounter configuration for Special:Contributions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552228 (https://phabricator.wikimedia.org/T234450) (owner: 10Reedy) [09:17:45] (03CR) 10Elukey: "created https://phabricator.wikimedia.org/T238905 and tagged for sre-access-request since this change needs the SRE team's approval to be " [puppet] - 10https://gerrit.wikimedia.org/r/552304 (https://phabricator.wikimedia.org/T236180) (owner: 10EBernhardson) [09:18:02] (03CR) 10Vgutierrez: "yeah, actually getting rid of that subprocess would be really nice." [puppet] - 10https://gerrit.wikimedia.org/r/552336 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [09:18:20] (03Merged) 10jenkins-bot: Add PoolCounter configuration for Special:Contributions [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552228 (https://phabricator.wikimedia.org/T234450) (owner: 10Reedy) [09:19:53] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: T234450 (duration: 00m 55s) [09:19:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:21:02] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM overall, see bikeshe^Wnit on job name" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/549871 (https://phabricator.wikimedia.org/T237234) (owner: 10Giuseppe Lavagetto) [09:23:10] !log reedy@deploy1001 Synchronized php-1.35.0-wmf.5/includes/specials/pagers/ContribsPager.php: Remove live hack of limit for T234450 (duration: 00m 54s) [09:23:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:39] !log depool wdqs1007 to allow to catch up on lag - T238229 [09:27:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:43] T238229: WDQS is having high update lag for the last week - https://phabricator.wikimedia.org/T238229 [09:28:10] !log pool cp1081 with ATS backend T227432 [09:28:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:28:14] T227432: Replace Varnish backends with ATS on cache text nodes - https://phabricator.wikimedia.org/T227432 [09:28:41] jouncebot now [09:28:41] No deployments scheduled for the forseeable future! [09:31:11] (03PS1) 10Addshore: wgWikidataOrgQueryServiceMaxLagFactor 60 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552474 (https://phabricator.wikimedia.org/T221774) [09:39:55] (03PS1) 10Ladsgroup: Add gcr and shy-latn to langlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552476 (https://phabricator.wikimedia.org/T238104) [09:40:47] (03CR) 10Ladsgroup: [C: 03+2] Add gcr and shy-latn to langlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552476 (https://phabricator.wikimedia.org/T238104) (owner: 10Ladsgroup) [09:41:28] (03Merged) 10jenkins-bot: Add gcr and shy-latn to langlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552476 (https://phabricator.wikimedia.org/T238104) (owner: 10Ladsgroup) [09:44:52] !log ladsgroup@deploy1001 Synchronized langlist: T238104 T238104 (duration: 00m 52s) [09:44:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:44:58] T238104: Create Guianan Creole Wikipedia - https://phabricator.wikimedia.org/T238104 [09:45:37] (03PS1) 10Ladsgroup: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552477 [09:45:39] (03CR) 10Ladsgroup: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552477 (owner: 10Ladsgroup) [09:45:56] <_joe_> wikibugs: hey [09:46:04] <_joe_> you're not working [09:46:36] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552477 (owner: 10Ladsgroup) [09:46:53] _joe_: anything in particular? [09:47:08] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 56.14 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:47:10] Phab? [09:47:15] <_joe_> Reedy: yeah [09:47:23] <_joe_> no phab task updates [09:47:24] probably just needs that service restarting [09:47:25] moment [09:47:36] !log ladsgroup@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 20s) [09:47:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:49:57] should be coming back [09:51:56] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 59.91 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:53:05] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add fix for tclap position (#9702) [debs/envoyproxy] (wikimedia-stretch) - 10https://gerrit.wikimedia.org/r/552311 (owner: 10Giuseppe Lavagetto) [09:53:15] (03PS1) 10Ladsgroup: Rename shy-latn to shy in langlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552478 (https://phabricator.wikimedia.org/T238105) [09:53:24] I'm deploying some fixes for the new wikis [09:53:29] :) [09:53:35] Amir1: I have one for when your done ;) [09:53:35] * Reedy throws stuff at wikibugs [09:53:40] addshore: no [09:53:48] Reedy: shhhh [09:53:50] (03CR) 10Ladsgroup: [C: 03+2] Rename shy-latn to shy in langlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552478 (https://phabricator.wikimedia.org/T238105) (owner: 10Ladsgroup) [09:54:09] addshore: Sam is sitting next me if you need me to persuade him [09:54:31] (03Merged) 10jenkins-bot: Rename shy-latn to shy in langlist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552478 (https://phabricator.wikimedia.org/T238105) (owner: 10Ladsgroup) [09:54:52] _joe_: Should hopefully work now [09:55:18] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 80.27 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [09:56:37] Reedy: Amir1 https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/552474/ [09:56:46] !log ladsgroup@deploy1001 Synchronized langlist: T238105 (duration: 00m 51s) [09:56:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:51] T238105: Create Shawiya Wiktionary - https://phabricator.wikimedia.org/T238105 [09:57:55] (03PS1) 10Ladsgroup: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552479 [09:57:57] (03CR) 10Ladsgroup: [C: 03+2] Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552479 (owner: 10Ladsgroup) [09:58:41] (03Merged) 10jenkins-bot: Update interwiki cache [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552479 (owner: 10Ladsgroup) [09:59:41] !log ladsgroup@deploy1001 Synchronized wmf-config/interwiki.php: Update interwiki cache (duration: 02m 10s) [09:59:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:05:51] (03PS1) 10Filippo Giunchedi: profile: add esams/eqsin snmp_exporter configs [puppet] - 10https://gerrit.wikimedia.org/r/552480 [10:09:31] (03CR) 10Filippo Giunchedi: [C: 03+2] profile: add esams/eqsin snmp_exporter configs [puppet] - 10https://gerrit.wikimedia.org/r/552480 (owner: 10Filippo Giunchedi) [10:09:35] 10Operations, 10observability, 10Availability, 10Goal, 10Patch-For-Review: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10jcrespo) I will document the graph when it is "finished" (WIP), but for now: * Backup time: end_time - start_time of the last backup * Backup level: if... [10:15:07] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Create NRPE check to alert when cergen certificates are due to expire - https://phabricator.wikimedia.org/T238833 (10jbond) @Ottomata you are correct the script just needs to read the public certificates however the directory with the public certifi... [10:17:33] 10Puppet, 10User-jbond, 10cloud-services-team (Kanban): Prevent catalog breakage on cloud instances by decoupling core cloud puppetmaster from custom puppetmasters - https://phabricator.wikimedia.org/T227029 (10aborrero) p:05Triage→03Normal [10:18:05] 10Puppet, 10Cloud-Services, 10cloud-services-team (Kanban): Consider ways to make puppetmaster CA changes smoother on the puppet client end - https://phabricator.wikimedia.org/T220268 (10aborrero) p:05Triage→03Normal [10:18:39] (03CR) 10Jcrespo: [C: 03+1] "Looks good to me: https://phab.wmfusercontent.org/file/data/h7ttcuxgdhhyh2h6orv6/PHID-FILE-n7zfiugajujgrf4zxwrc/Screenshot_20191122_111632" [puppet] - 10https://gerrit.wikimedia.org/r/552381 (https://phabricator.wikimedia.org/T238044) (owner: 10Marostegui) [10:19:26] (03CR) 10Jcrespo: [C: 03+1] "Same suggestion as before." [dns] - 10https://gerrit.wikimedia.org/r/552382 (https://phabricator.wikimedia.org/T238044) (owner: 10Marostegui) [10:20:46] (03PS2) 10Marostegui: mariadb: Promote db1086 to s7 primary master [puppet] - 10https://gerrit.wikimedia.org/r/552381 (https://phabricator.wikimedia.org/T238044) [10:20:48] (03CR) 10Marostegui: [C: 04-2] "> Looks good to me: https://phab.wmfusercontent.org/file/data/h7ttcuxgdhhyh2h6orv6/PHID-FILE-n7zfiugajujgrf4zxwrc/Screenshot_20191122_1116" [puppet] - 10https://gerrit.wikimedia.org/r/552381 (https://phabricator.wikimedia.org/T238044) (owner: 10Marostegui) [10:20:51] 10Operations, 10observability: The "logstash-*" index pattern does not contain any of the following field types: ip - https://phabricator.wikimedia.org/T238795 (10fgiunchedi) Looks good! I won't have time to look into this in depth but I'm happy to help if patches need review [10:21:16] (03PS2) 10Marostegui: wmnet: Update s7-master CNAME [dns] - 10https://gerrit.wikimedia.org/r/552382 (https://phabricator.wikimedia.org/T238044) [10:21:22] 10Operations, 10Epic, 10cloud-services-team (Kanban): CloudVPS: our ideal future model - https://phabricator.wikimedia.org/T209460 (10aborrero) [10:21:29] (03CR) 10Marostegui: [C: 04-2] "> Same suggestion as before." [dns] - 10https://gerrit.wikimedia.org/r/552382 (https://phabricator.wikimedia.org/T238044) (owner: 10Marostegui) [10:32:49] (03PS1) 10Jbond: icinga::cas: update bool_2_on_off function [puppet] - 10https://gerrit.wikimedia.org/r/552481 [10:33:36] (03PS1) 10Giuseppe Lavagetto: Correct debian/format to quilt [debs/envoyproxy] (wikimedia-stretch) - 10https://gerrit.wikimedia.org/r/552483 [10:35:41] (03CR) 10Jbond: [C: 03+2] icinga::cas: update bool_2_on_off function [puppet] - 10https://gerrit.wikimedia.org/r/552481 (owner: 10Jbond) [10:36:39] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Correct debian/format to quilt [debs/envoyproxy] (wikimedia-stretch) - 10https://gerrit.wikimedia.org/r/552483 (owner: 10Giuseppe Lavagetto) [10:54:25] (03PS1) 10Filippo Giunchedi: WIP: move to Debian packaging [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/552486 (https://phabricator.wikimedia.org/T217340) [10:56:02] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 56.44 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [10:58:16] (03PS3) 10BBlack: vcl: Bump TLSv1/TLSv1.1 pageview replacement to 4% [puppet] - 10https://gerrit.wikimedia.org/r/550868 (https://phabricator.wikimedia.org/T238038) (owner: 10Vgutierrez) [10:58:18] (03PS3) 10BBlack: vcl: Bump TLSv1/TLSv1.1 pageview replacement to 10% [puppet] - 10https://gerrit.wikimedia.org/r/550869 (https://phabricator.wikimedia.org/T238038) (owner: 10Vgutierrez) [10:58:20] (03PS3) 10BBlack: vcl: Bump TLSv1/TLSv1.1 pageview replacement to 100% [puppet] - 10https://gerrit.wikimedia.org/r/550870 (https://phabricator.wikimedia.org/T238038) (owner: 10Vgutierrez) [10:58:22] (03PS1) 10BBlack: browsersec: cover bot traffic better [puppet] - 10https://gerrit.wikimedia.org/r/552488 (https://phabricator.wikimedia.org/T238038) [10:59:48] Creative code review [11:05:14] 10Operations, 10observability, 10Availability, 10Goal, 10Patch-For-Review: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10Marostegui) Just brainstorming here about the dashboard, feel free to ignore, I know it is WIP. - It would be nice to include the date on the "last day... [11:09:13] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Create NRPE check to alert when cergen certificates are due to expire - https://phabricator.wikimedia.org/T238833 (10akosiaris) Just so that you aren't caught off guard ` file {'/srv/private/secret/secrets/certificate': ` a form of this kind of a... [11:11:12] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 72.94 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [11:29:44] !log upload wikidiff2 1.10.0-1 - T236963 [11:29:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:29:50] T236963: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 [11:34:25] !log Roll out wikidiff2 1.10.0-1 to canaries - T236963 [11:34:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:38:05] (03CR) 10BBlack: [C: 03+2] browsersec: cover bot traffic better [puppet] - 10https://gerrit.wikimedia.org/r/552488 (https://phabricator.wikimedia.org/T238038) (owner: 10BBlack) [11:41:12] PROBLEM - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is CRITICAL: CRITICAL - failed 47 probes of 490 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:41:45] (03CR) 10BBlack: [C: 03+2] vcl: Bump TLSv1/TLSv1.1 pageview replacement to 4% [puppet] - 10https://gerrit.wikimedia.org/r/550868 (https://phabricator.wikimedia.org/T238038) (owner: 10Vgutierrez) [11:46:52] RECOVERY - IPv6 ping to eqiad on ripe-atlas-eqiad IPv6 is OK: OK - failed 26 probes of 490 (alerts on 35) - https://atlas.ripe.net/measurements/1790947/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [11:52:00] 10Operations, 10Traffic, 10Wikidata, 10Wikidata-Campsite, 10User-DannyS712: 502 errors on ATS/8.0.5 - https://phabricator.wikimedia.org/T237319 (10WMDE-leszek) Thanks @elukey and @Joe for translating from leet speak! I've filed T238901 about the problem in Wikibase, and we'll be looking into fixing the b... [11:59:40] !log reload php7 on canaries [11:59:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:03] (03PS1) 10Jcrespo: prometheus-bacula-exporter: Parallelize bconsole executions [puppet] - 10https://gerrit.wikimedia.org/r/552490 (https://phabricator.wikimedia.org/T234900) [12:04:23] (03CR) 10Jcrespo: [C: 03+2] prometheus-bacula-exporter: Parallelize bconsole executions [puppet] - 10https://gerrit.wikimedia.org/r/552490 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [12:04:35] (03PS2) 10Jcrespo: prometheus-bacula-exporter: Parallelize bconsole executions [puppet] - 10https://gerrit.wikimedia.org/r/552490 (https://phabricator.wikimedia.org/T234900) [12:09:43] 10Operations, 10MediaWiki-REST-API, 10serviceops, 10wikidiff2, and 2 others: Deploy version 1.10.0 of wikidiff2 to production - https://phabricator.wikimedia.org/T236963 (10jijiki) Version 1.10.0 has been deployed to the canaries, we can roll out to production on Monday [12:10:29] 10Operations, 10observability, 10Availability, 10Goal, 10Patch-For-Review: Setup bacula backup monitoring - https://phabricator.wikimedia.org/T234900 (10jcrespo) >>! In T234900#5683712, @jcrespo wrote: > As I feared, the exported during peak hours gets too slow: https://grafana.wikimedia.org/d/413r2vbWk/... [12:15:04] (03PS1) 10Jcrespo: prometheus-bacula-exporter: Restart service on code change [puppet] - 10https://gerrit.wikimedia.org/r/552491 (https://phabricator.wikimedia.org/T234900) [12:17:18] (03PS1) 10Filippo Giunchedi: prometheus: lower threshold for logstash indexing failures [puppet] - 10https://gerrit.wikimedia.org/r/552492 (https://phabricator.wikimedia.org/T236343) [12:17:40] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: lower threshold for logstash indexing failures [puppet] - 10https://gerrit.wikimedia.org/r/552492 (https://phabricator.wikimedia.org/T236343) (owner: 10Filippo Giunchedi) [12:18:40] (03PS3) 10BBlack: acme-chief: parallelize gdnsd-sync [puppet] - 10https://gerrit.wikimedia.org/r/552336 (https://phabricator.wikimedia.org/T98006) [12:18:42] (03PS2) 10BBlack: authdns: refactor role/profile/hieradata bits [puppet] - 10https://gerrit.wikimedia.org/r/552346 (https://phabricator.wikimedia.org/T98006) [12:21:04] (03CR) 10BBlack: "Seems clean?" [puppet] - 10https://gerrit.wikimedia.org/r/552346 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [12:26:20] (03CR) 10BBlack: "Better run with icinga as well: https://puppet-compiler.wmflabs.org/compiler1003/19547/" [puppet] - 10https://gerrit.wikimedia.org/r/552346 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [12:28:36] jouncebot now [12:28:36] No deployments scheduled for the forseeable future! [12:30:24] (03PS2) 10Addshore: wgWikidataOrgQueryServiceMaxLagFactor 60 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552474 (https://phabricator.wikimedia.org/T221774) [12:30:26] (03CR) 10Addshore: [C: 03+2] wgWikidataOrgQueryServiceMaxLagFactor 60 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552474 (https://phabricator.wikimedia.org/T221774) (owner: 10Addshore) [12:30:33] wikibugs is slooooow [12:31:49] heh [12:32:27] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T221774 - wgWikidataOrgQueryServiceMaxLagFactor 60 (duration: 00m 53s) [12:32:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:32:33] T221774: Add Wikidata query service lag to Wikidata maxlag - https://phabricator.wikimedia.org/T221774 [12:33:30] aaand resync because IS.php is lame [12:34:11] !log addshore@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T221774 - wgWikidataOrgQueryServiceMaxLagFactor 60 RESYNC (duration: 00m 51s) [12:34:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:17] * addshore is done [12:45:48] I'm backporting this bug thingy [12:46:19] sounds like the story of my life [12:46:24] (03PS3) 10BBlack: authdns: refactor role/profile/hieradata bits [puppet] - 10https://gerrit.wikimedia.org/r/552346 (https://phabricator.wikimedia.org/T98006) [12:46:29] I prefer you backport bug fixes [12:46:32] But each to their own [12:47:41] lol [12:49:44] (03PS1) 10Jbond: profile::idp::client: add a profile for configuering apache sites [puppet] - 10https://gerrit.wikimedia.org/r/552494 [12:52:45] (03PS4) 10BBlack: authdns: refactor role/profile/hieradata bits [puppet] - 10https://gerrit.wikimedia.org/r/552346 (https://phabricator.wikimedia.org/T98006) [12:54:25] PROBLEM - MediaWiki memcached error rate on icinga1001 is CRITICAL: 8429 gt 5000 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:55:35] RECOVERY - MediaWiki memcached error rate on icinga1001 is OK: (C)5000 gt (W)1000 gt 4 https://wikitech.wikimedia.org/wiki/Memcached https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=1&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops [12:55:51] (03CR) 10Jbond: [C: 03+2] profile::idp::client: add a profile for configuering apache sites [puppet] - 10https://gerrit.wikimedia.org/r/552494 (owner: 10Jbond) [12:56:54] (03PS5) 10BBlack: authdns: refactor role/profile/hieradata bits [puppet] - 10https://gerrit.wikimedia.org/r/552346 (https://phabricator.wikimedia.org/T98006) [12:59:20] (03CR) 10Muehlenhoff: profile::idp::client: add a profile for configuering apache sites (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552494 (owner: 10Jbond) [12:59:22] 10Operations, 10Traffic, 10Performance-Team (Radar): ATS doesn't support X-Wikimedia-Debug - https://phabricator.wikimedia.org/T237687 (10ema) >>! In T237687#5679746, @Krinkle wrote: > The issue - When `X-Wikimedia-Debug` is enabled (e.g. via the WikimediaDebug browser extension), I am no longer able to brow... [12:59:30] 10Operations, 10Traffic, 10Performance-Team (Radar): ATS doesn't support X-Wikimedia-Debug - https://phabricator.wikimedia.org/T237687 (10ema) p:05High→03Normal [13:04:16] it's in mwdebug1002, worked fine, syncing [13:05:12] (03PS1) 10Jbond: profile::idp::client: remove SAML validation, fix trailing slash [puppet] - 10https://gerrit.wikimedia.org/r/552496 [13:06:06] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/Wikibase/lib/includes/Store/Sql/SqlEntityInfoBuilder.php: T238473 (duration: 00m 52s) [13:06:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:11] T238473: Label for unit isn't displayed correctly, just Q-number - https://phabricator.wikimedia.org/T238473 [13:06:55] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/552496 (owner: 10Jbond) [13:08:02] (03CR) 10Jbond: [C: 03+2] "PCC https://puppet-compiler.wmflabs.org/compiler1003/19550/icinga1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/552496 (owner: 10Jbond) [13:11:02] (03CR) 10Alexandros Kosiaris: [C: 03+1] prometheus-bacula-exporter: Restart service on code change [puppet] - 10https://gerrit.wikimedia.org/r/552491 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [13:11:23] !log start of foreachwikiindblist wikidataclient extensions/Wikibase/lib/maintenance/populateSitesTable.php --force-protocol https (T238119 T238524 T237375 T238120) [13:11:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:32] T238119: Add Wikidata support for gcrwiki - https://phabricator.wikimedia.org/T238119 [13:11:32] T238524: Add Wikidata support for minwiktionary - https://phabricator.wikimedia.org/T238524 [13:11:33] T237375: Add Wikidata support for szywiki - https://phabricator.wikimedia.org/T237375 [13:11:33] T238120: Add Wikidata support for shywiktionary - https://phabricator.wikimedia.org/T238120 [13:13:07] (03PS1) 10Lucas Werkmeister (WMDE): Wikibase (beta-only): Update wmgWikibaseClientDataBridgeHrefRegExp [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552498 (https://phabricator.wikimedia.org/T238918) [13:18:49] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:28:09] (03PS1) 10Kosta Harlan: Beta labs: Remove unused GrowthExperiments config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552501 [13:28:11] (03PS1) 10Kosta Harlan: GrowthExperiments: Remove unused config var [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552502 [13:28:29] (03PS1) 10Jbond: cas-puppetboard.wikimedia.org: add record [dns] - 10https://gerrit.wikimedia.org/r/552503 [13:30:24] 10Operations, 10User-jbond: Add cas authentication to puppetboard - https://phabricator.wikimedia.org/T238924 (10jbond) [13:31:35] 10Operations, 10GLOW, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T238868 (10Aklapper) Setting #SRE-Access-Requests as per https://wikitech.wikimedia.org/wiki/Google_Search_Console_access (and removing #Operations as subscriber). [13:36:36] (03CR) 10Alexandros Kosiaris: prometheus-bacula-exporter: Parallelize bconsole executions (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552490 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [13:38:16] (03CR) 10Thiemo Kreuz (WMDE): Wikibase (beta-only): Update wmgWikibaseClientDataBridgeHrefRegExp (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552498 (https://phabricator.wikimedia.org/T238918) (owner: 10Lucas Werkmeister (WMDE)) [13:39:16] (03PS1) 10Ema: vcl: move XWD logic to text_common_recv/misc_recv_pass [puppet] - 10https://gerrit.wikimedia.org/r/552504 (https://phabricator.wikimedia.org/T233768) [13:42:13] 10Operations, 10Wikimedia-Mailing-lists: Spam from a non-registered email address coming non-moderated to a restricted mailing list - https://phabricator.wikimedia.org/T238871 (10Aklapper) What does "restricted mailing list" mean exactly when it comes to the settings? [13:47:34] (03PS1) 10BBlack: dnsrecursor: modernize notrack for udp:53 [puppet] - 10https://gerrit.wikimedia.org/r/552506 (https://phabricator.wikimedia.org/T98006) [13:49:38] (03CR) 10jerkins-bot: [V: 04-1] dnsrecursor: modernize notrack for udp:53 [puppet] - 10https://gerrit.wikimedia.org/r/552506 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [13:51:50] (03Abandoned) 10Ema: vcl: move XWD logic to text_common_recv/misc_recv_pass [puppet] - 10https://gerrit.wikimedia.org/r/552504 (https://phabricator.wikimedia.org/T233768) (owner: 10Ema) [13:52:44] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus-bacula-exporter: Restart service on code change [puppet] - 10https://gerrit.wikimedia.org/r/552491 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [13:59:27] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [13:59:27] RECOVERY - snapshot of s3 in codfw on db1115 is OK: snapshot for s3 at codfw taken less than 4 days ago and larger than 90 GB: Last one 2019-11-22 10:34:25 from db2098.codfw.wmnet:3313 (811 GB) https://wikitech.wikimedia.org/wiki/MariaDB/Backups [14:00:00] (03PS1) 10Ema: Revert "vcl: move XWD pass logic to wm_common" [puppet] - 10https://gerrit.wikimedia.org/r/552507 (https://phabricator.wikimedia.org/T233768) [14:00:02] (03PS1) 10Ema: cache: do not cache noc.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/552508 (https://phabricator.wikimedia.org/T233768) [14:18:37] <_joe_> uh what's going on with appservers [14:20:09] <_joe_> hah I think it's Amir1's script [14:20:48] oh, where is the error [14:20:58] my script has finished (not the term store) [14:21:01] <_joe_> Amir1: https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=appserver&var-method=GET [14:21:15] <_joe_> the degradation started when you started the script [14:21:22] <_joe_> only correlation I found [14:21:31] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:21:48] <_joe_> ok lemme ack that alert for now [14:22:01] but it ended an hour ago [14:22:26] that seems worrying [14:22:33] <_joe_> indeed [14:22:46] <_joe_> and none of the usual suspects seems to be the issue then [14:23:24] <_joe_> webpagetest agrees btw [14:24:03] I can't find anything in https://grafana.wikimedia.org/d/000000278/mysql-aggregated?orgId=1 [14:24:20] when did it start more or less? [14:24:26] <_joe_> 14:11 [14:24:34] <_joe_> err sorry, 13:11 [14:25:02] ack thanks, I see now [14:25:12] There's huge increase in read in s5 but I don't know if it's related [14:25:33] it could be dumps, let me check [14:25:54] <_joe_> possibly? [14:25:54] I noticed the memcached alerts a while before, there were two spikes that auto-resolved, but then I noticed mc1021's tx bandwidth usage that was higher than the rest (https://grafana.wikimedia.org/d/000000316/memcache?panelId=56&fullscreen&orgId=1&from=now-12h&to=now). It doesn't correlate though [14:26:44] https://grafana.wikimedia.org/d/000000548/wikibase-wb_terms?refresh=30s&orgId=1&from=now-3h&to=now nothing is doing on wikidata, otherwise this would explode [14:27:01] It doesn't seem to be dumps, but a script running on the vslow host [14:27:08] https://grafana.wikimedia.org/d/000000273/mysql?panelId=3&fullscreen&orgId=1&var-dc=eqiad%20prometheus%2Fops&var-server=db1113&var-port=13315&from=now-24h&to=now [14:27:15] The rest of the slaves do not have that increase [14:27:30] SELECT /* SpecialGadgetUsage::reallyDoQuery [14:27:31] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 58.49 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:28:13] yep, confirmed, just the vslow host [14:28:26] * apergos peeks back in (sorry, I was trying to shovel food in my mouth) [14:30:02] <_joe_> does this justify the current slowness? [14:30:26] _joe_: it should not, it is just a dewiki slave with a script running, but other than that it is not causing lag or anything [14:30:51] <_joe_> now lemme see if we have per-wiki data somewhere [14:31:07] is this updateSpecialPages.php? [14:31:20] apergos: No, see above [14:31:28] apergos: At least on s5 vslow host [14:31:32] ah ha [14:32:37] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 70.98 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [14:33:21] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:34:10] <_joe_> I am trying to look at perf data, but it's not clear at all where this time is spent [14:34:26] 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10BBlack) [14:36:50] 10Operations, 10ops-esams: rack/setup/install ganeti300[123] - https://phabricator.wikimedia.org/T236216 (10BBlack) **IMPORTANT NOTE** `ganeti3003` is temporarily repurposed as a critical authdns server and is in live production use for that role (see also: T236479 ). Do not reimage or touch `ganeti3003`. Th... [14:38:54] (03PS1) 10Filippo Giunchedi: prometheus: record job/site availability [puppet] - 10https://gerrit.wikimedia.org/r/552511 [14:38:57] _joe_: https://grafana.wikimedia.org/d/2kP3FjAZk/webpagereplay-enwiki-alerts?orgId=1 ? [14:39:06] Nothing is exploding on enwiki [14:39:13] <_joe_> yeah [14:39:30] <_joe_> it's going to be some scraper [14:39:35] <_joe_> messing with our stats [14:39:55] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: record job/site availability [puppet] - 10https://gerrit.wikimedia.org/r/552511 (owner: 10Filippo Giunchedi) [14:40:05] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:40:07] https://grafana.wikimedia.org/d/000000326/navigation-timing-alerts?refresh=5m&orgId=1&from=now-6h&to=now [14:40:38] We are at this hackathon, maybe people are doing crazy things right now [14:40:48] <_joe_> ahah [14:41:33] why not median? [14:41:44] average can be messed up so easily [14:42:27] <_joe_> median, when you use buckets, is usually way less accurate [14:42:36] <_joe_> the effect is there on the 95th percentile too [14:42:48] <_joe_> 75th percentile is not [14:42:58] <_joe_> so I guess it's possible it's just a long tail [14:44:22] (03PS1) 10Jbond: profile::idp::client::httpd: refactor [puppet] - 10https://gerrit.wikimedia.org/r/552512 [14:44:24] (03PS1) 10Jbond: puppetboard: Add cas authentication [puppet] - 10https://gerrit.wikimedia.org/r/552513 (https://phabricator.wikimedia.org/T238924) [14:44:45] (03PS1) 10Alexandros Kosiaris: otrs: Switch from X-Real-IP to X-Client-IP [puppet] - 10https://gerrit.wikimedia.org/r/552514 [14:44:47] (03PS1) 10Alexandros Kosiaris: Switch from X-Real-IP to X-Client-IP [puppet] - 10https://gerrit.wikimedia.org/r/552515 [14:47:23] (03CR) 10Alexandros Kosiaris: "I guess I need a task, and this was done with a git grep -l | xargs sed incantation, so some review required. I also maybe wrong at the ap" [puppet] - 10https://gerrit.wikimedia.org/r/552515 (owner: 10Alexandros Kosiaris) [14:47:30] (03CR) 10jerkins-bot: [V: 04-1] profile::idp::client::httpd: refactor [puppet] - 10https://gerrit.wikimedia.org/r/552512 (owner: 10Jbond) [14:47:48] (03CR) 10jerkins-bot: [V: 04-1] puppetboard: Add cas authentication [puppet] - 10https://gerrit.wikimedia.org/r/552513 (https://phabricator.wikimedia.org/T238924) (owner: 10Jbond) [14:48:34] <_joe_> !log uploaded envoyproxy 1.12.1 to {buster,stretch} T237235 [14:48:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:48:41] T237235: Build and upload envoy 1.12.0 package. - https://phabricator.wikimedia.org/T237235 [14:49:48] <_joe_> !log disabling puppet on restbase2018, testing envoy upgrade T238050 [14:49:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:49:53] T238050: envoy overwrites the server header - https://phabricator.wikimedia.org/T238050 [14:50:39] (03PS1) 10Muehlenhoff: Add image tracking support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/552517 (https://phabricator.wikimedia.org/T237978) [14:52:41] 10Operations, 10RESTBase, 10Traffic: envoy overwrites the server header - https://phabricator.wikimedia.org/T238050 (10Joe) Confirmed the upgrade fixes the Server: header output: ` restbase2018:~$ curl -k https://restbase2018:7443/de.wikipedia.org/v1/page/references/Der_Junge_mit_dem_gro%C3%9Fen_schwarzen_H... [14:53:31] 10Operations, 10RESTBase, 10Traffic: envoy overwrites the server header - https://phabricator.wikimedia.org/T238050 (10Joe) @Vgutierrez I think you can just upgrade envoy across the fleet when you feel confident enough. [14:55:02] (03PS2) 10Jbond: profile::idp::client::httpd: refactor [puppet] - 10https://gerrit.wikimedia.org/r/552512 [14:55:23] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [14:58:02] (03CR) 10Effie Mouzeli: "We are missing a bit of context here. Can you please elaborate or create a task ?" [puppet] - 10https://gerrit.wikimedia.org/r/552515 (owner: 10Alexandros Kosiaris) [14:58:11] (03CR) 10jerkins-bot: [V: 04-1] profile::idp::client::httpd: refactor [puppet] - 10https://gerrit.wikimedia.org/r/552512 (owner: 10Jbond) [15:00:25] (03PS1) 10Filippo Giunchedi: prometheus: alert on low job availability [puppet] - 10https://gerrit.wikimedia.org/r/552521 (https://phabricator.wikimedia.org/T187708) [15:00:50] 10Operations, 10observability, 10Patch-For-Review: prometheus-pdns-exporter log noise about unexpected metrics - https://phabricator.wikimedia.org/T227411 (10Andrew) >>! In T227411#5683732, @MoritzMuehlenhoff wrote: > @Andrew : I created an (untested) patch which should fix this, can you take it from here?... [15:01:20] (03PS3) 10Jbond: profile::idp::client::httpd: refactor [puppet] - 10https://gerrit.wikimedia.org/r/552512 [15:02:29] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM, this is equivalent to the notrack parameter of ferm::service (which in turn relies on the NO_TRACK definition from modules/ferm/file" [puppet] - 10https://gerrit.wikimedia.org/r/552506 (https://phabricator.wikimedia.org/T98006) (owner: 10BBlack) [15:04:24] (03CR) 10jerkins-bot: [V: 04-1] profile::idp::client::httpd: refactor [puppet] - 10https://gerrit.wikimedia.org/r/552512 (owner: 10Jbond) [15:06:24] (03PS2) 10Jbond: puppetboard: Add cas authentication [puppet] - 10https://gerrit.wikimedia.org/r/552513 (https://phabricator.wikimedia.org/T238924) [15:08:00] (03PS4) 10Jbond: profile::idp::client::httpd: refactor [puppet] - 10https://gerrit.wikimedia.org/r/552512 [15:09:36] (03CR) 10jerkins-bot: [V: 04-1] puppetboard: Add cas authentication [puppet] - 10https://gerrit.wikimedia.org/r/552513 (https://phabricator.wikimedia.org/T238924) (owner: 10Jbond) [15:10:10] 10Operations: Integrate Buster 10.2 point update - https://phabricator.wikimedia.org/T238519 (10MoritzMuehlenhoff) [15:11:02] (03CR) 10jerkins-bot: [V: 04-1] profile::idp::client::httpd: refactor [puppet] - 10https://gerrit.wikimedia.org/r/552512 (owner: 10Jbond) [15:12:08] (03CR) 10Jcrespo: [C: 03+2] prometheus-bacula-exporter: Restart service on code change [puppet] - 10https://gerrit.wikimedia.org/r/552491 (https://phabricator.wikimedia.org/T234900) (owner: 10Jcrespo) [15:12:58] (03PS5) 10Jbond: profile::idp::client::httpd: refactor [puppet] - 10https://gerrit.wikimedia.org/r/552512 [15:14:01] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:16:47] !log ladsgroup@deploy1001 Synchronized php-1.35.0-wmf.5/extensions/Wikibase/repo/: Stop outputting anything in case of 304 responses in Special:EntityData (T238901) (duration: 00m 57s) [15:16:49] (03PS6) 10Jbond: profile::idp::client::httpd: refactor [puppet] - 10https://gerrit.wikimedia.org/r/552512 [15:16:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:16:53] T238901: Wikibase's Special:EntityData should not emit when responding with HTTP code 304 - https://phabricator.wikimedia.org/T238901 [15:19:21] (03PS7) 10Jbond: profile::idp::client::httpd: refactor [puppet] - 10https://gerrit.wikimedia.org/r/552512 [15:27:36] (03CR) 10Alexandros Kosiaris: "I am as well, hence the lack of a task for now. I am still trying to fully figure out if we deprecated X-Real-IP or not" [puppet] - 10https://gerrit.wikimedia.org/r/552515 (owner: 10Alexandros Kosiaris) [15:30:10] (03PS8) 10Jbond: profile::idp::client::httpd: refactor [puppet] - 10https://gerrit.wikimedia.org/r/552512 [15:33:41] (03PS9) 10Jbond: profile::idp::client::httpd: refactor [puppet] - 10https://gerrit.wikimedia.org/r/552512 [15:36:10] (03PS3) 10Jbond: puppetboard: Add cas authentication [puppet] - 10https://gerrit.wikimedia.org/r/552513 (https://phabricator.wikimedia.org/T238924) [15:36:11] 10Operations, 10Discovery-Search, 10SRE-Access-Requests: Allow analytics-search-users members to sudo as the airflow user - https://phabricator.wikimedia.org/T238905 (10Ottomata) Since this instance is maintained by the search team, I think re-using analytics-search-users makes sense to me. We can re-evalua... [15:37:14] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Create NRPE check to alert when cergen certificates are due to expire - https://phabricator.wikimedia.org/T238833 (10Ottomata) Could we make the cergen script itself modify the permissions after it creates the files? It won't ensure things are right... [15:37:45] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=205 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:38:04] (03PS4) 10Jbond: puppetboard: Add cas authentication [puppet] - 10https://gerrit.wikimedia.org/r/552513 (https://phabricator.wikimedia.org/T238924) [15:39:25] 10Operations, 10observability, 10Patch-For-Review: prometheus-pdns-exporter log noise about unexpected metrics - https://phabricator.wikimedia.org/T227411 (10Andrew) That patch seems to quiet the alerts; I'll see about building and deploying [15:40:23] !log renumber AS17639 sessions in eqsin [15:40:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:40:40] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] Add support for PDNS 4 [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/552467 (https://phabricator.wikimedia.org/T227411) (owner: 10Muehlenhoff) [15:41:05] (03PS5) 10Jbond: puppetboard: Add cas authentication [puppet] - 10https://gerrit.wikimedia.org/r/552513 (https://phabricator.wikimedia.org/T238924) [15:43:49] (03PS1) 10Andrew Bogott: Bump changelog for pdns4 support [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/552531 (https://phabricator.wikimedia.org/T227411) [15:43:55] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [15:45:35] (03CR) 10Andrew Bogott: [C: 03+2] Bump changelog for pdns4 support [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/552531 (https://phabricator.wikimedia.org/T227411) (owner: 10Andrew Bogott) [15:46:32] ACKNOWLEDGEMENT - Memory correctable errors -EDAC- on mw1239 is CRITICAL: 4.001 ge 4 Ayounsi still https://phabricator.wikimedia.org/T238018 https://wikitech.wikimedia.org/wiki/Monitoring/Memory%23Memory_correctable_errors_-EDAC- https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=mw1239&var-datasource=eqiad+prometheus/ops [15:46:45] (03CR) 10Muehlenhoff: Bump changelog for pdns4 support (031 comment) [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/552531 (https://phabricator.wikimedia.org/T227411) (owner: 10Andrew Bogott) [15:47:55] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:51:20] 10Operations, 10MediaWiki-Cache, 10Performance-Team (Radar), 10User-Elukey: mcrouter does not remove a memcached shard from consistent hashing when timeouts happen - https://phabricator.wikimedia.org/T208934 (10elukey) [15:52:01] (03PS1) 10Andrew Bogott: reformat changelog line [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/552535 [15:52:21] (03CR) 10Andrew Bogott: [V: 03+2 C: 03+2] reformat changelog line [debs/prometheus-pdns-exporter] - 10https://gerrit.wikimedia.org/r/552535 (owner: 10Andrew Bogott) [15:52:59] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [15:53:00] (03PS6) 10Jbond: puppetboard: Add cas authentication [puppet] - 10https://gerrit.wikimedia.org/r/552513 (https://phabricator.wikimedia.org/T238924) [15:53:41] ^ looking at mw requests latency [15:58:05] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:03:52] 10Operations, 10observability, 10Patch-For-Review: prometheus-pdns-exporter log noise about unexpected metrics - https://phabricator.wikimedia.org/T227411 (10Andrew) 05Open→03Resolved a:03Andrew done -- logs are nice and quiet now. [16:04:31] (03PS1) 10Jbond: cas-puppetboard.wikimedia.org: add new cas protected puppetboard site [puppet] - 10https://gerrit.wikimedia.org/r/552536 (https://phabricator.wikimedia.org/T238924) [16:06:33] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:07:21] (03CR) 10ArielGlenn: "Someone else needs to make the call on which heder to use; if X-Client-IP turns out to be the choice, the dumps-related changes are good t" [puppet] - 10https://gerrit.wikimedia.org/r/552515 (owner: 10Alexandros Kosiaris) [16:09:15] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [16:13:19] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:15:01] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:20:07] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:22:06] !log clean tombstones on prometheus1003 - T238807 [16:22:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:22:11] T238807: Clean up ORES metrics - https://phabricator.wikimedia.org/T238807 [16:22:49] PROBLEM - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is CRITICAL: 59.84 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:23:04] 10Operations, 10Wikimedia-Mailing-lists: Spam from a non-registered email address coming non-moderated to a restricted mailing list - https://phabricator.wikimedia.org/T238871 (10Quiddity) 05Open→03Invalid You've got that address listed in the "List of non-member addresses whose postings should be automati... [16:24:59] (03PS1) 10Papaul: DNS: Remove mgnt DNS for db2048 and db2061 [dns] - 10https://gerrit.wikimedia.org/r/552539 [16:25:13] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:25:22] 10Operations, 10DC-Ops, 10decommission, 10fundraising-tech-ops: decommission alnilam.frack.codfw.wmnet - https://phabricator.wikimedia.org/T238233 (10Papaul) [16:25:35] 10Operations, 10DC-Ops, 10decommission, 10fundraising-tech-ops: decommission alnilam.frack.codfw.wmnet - https://phabricator.wikimedia.org/T238233 (10Papaul) 05Open→03Resolved Complete [16:27:53] RECOVERY - Varnish traffic drop between 30min ago and now at codfw on icinga1001 is OK: (C)60 le (W)70 le 75.14 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [16:28:10] (03PS1) 10Gehel: wdqs: remove the ban of Guzzle user agent. [puppet] - 10https://gerrit.wikimedia.org/r/552540 [16:28:37] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:30:56] I'm going to deploy the security thingy [16:32:08] It's just postponed due to foooooooood [16:33:43] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:36:33] (03Abandoned) 10Papaul: DNS: Remove mgnt DNS for db2048 and db2061 [dns] - 10https://gerrit.wikimedia.org/r/552539 (owner: 10Papaul) [16:37:26] 10Operations, 10Dumps-Generation, 10Patch-For-Review: Migrate dumpsdata hosts to Stretch/Buster - https://phabricator.wikimedia.org/T224563 (10ArielGlenn) Given that the wikidata entity dumps are still finishing up the truthy gz files, and after that there will be bz2 recompression and the Lexemes, I'll be m... [16:42:11] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:45:37] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:47:11] (03PS1) 10Papaul: DNS: Remove mgmt DNS for db2048 and db2061 [dns] - 10https://gerrit.wikimedia.org/r/552542 [16:53:34] 10Operations, 10observability: Clean up ORES metrics - https://phabricator.wikimedia.org/T238807 (10colewhite) 05Open→03Resolved [16:55:49] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [16:58:57] FYI, someone’s asking for an IP exception config change in #wikimedia-tech (for an event starting in an hour) [16:59:10] I’m not going to deploy that on a Friday evening, but if anyone else feels sufficiently adventurous… [16:59:41] (03PS2) 10Phamhi: labmon: add compatibility in buster [puppet] - 10https://gerrit.wikimedia.org/r/552107 (https://phabricator.wikimedia.org/T224585) [17:02:48] (03PS3) 10Phamhi: labmon: add compatibility in buster [puppet] - 10https://gerrit.wikimedia.org/r/552107 (https://phabricator.wikimedia.org/T224585) [17:04:17] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:05:52] (03CR) 10Phamhi: labmon: add compatibility in buster (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552107 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [17:07:43] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:07:45] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10RobH) Please note this has had all the RAM/riser/cards reseated and continues to pass all Dell ePSA tests. @bblack: With the reseating of everything, shall we reimage and try using this sys... [17:07:49] 10Operations, 10Cloud-VPS (Debian Jessie Deprecation), 10Patch-For-Review, 10cloud-services-team (Kanban): Migrate labmon* to Stretch (or Buster, better yet!) - https://phabricator.wikimedia.org/T224585 (10Phamhi) As per suggestion, I have created different python files (no longer template) for different r... [17:08:37] 10Operations, 10observability: Clean up ORES metrics - https://phabricator.wikimedia.org/T238807 (10colewhite) 05Resolved→03Open [17:09:13] !log restart prometheus on prometheus1004 - T238807 [17:09:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:09:18] T238807: Clean up ORES metrics - https://phabricator.wikimedia.org/T238807 [17:15:06] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Leighanna Mixter - https://phabricator.wikimedia.org/T238933 (10Slaporte) [17:17:25] PROBLEM - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [17:17:28] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['cp3056.esams.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201911221... [17:17:41] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3056.esams.wmnet'] ` Of which those **FAILED**: ` ['cp3056.esams.wmnet'] ` [17:18:01] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by bblack on cumin1001.eqiad.wmnet for hosts: ` ['cp3056.esams.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201911221... [17:19:23] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10BBlack) a:05RobH→03BBlack Attempting reimage (see above). If it fails like before, it won't get very far (certainly not into production use). [17:21:19] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:23:17] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Create NRPE check to alert when cergen certificates are due to expire - https://phabricator.wikimedia.org/T238833 (10CDanis) >>! In T238833#5684837, @Ottomata wrote: > Could we make the cergen script itself modify the permissions after it creates th... [17:24:43] (03PS2) 10BBlack: dnsrecursor: modernize notrack for udp:53 [puppet] - 10https://gerrit.wikimedia.org/r/552506 (https://phabricator.wikimedia.org/T98006) [17:27:55] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Create NRPE check to alert when cergen certificates are due to expire - https://phabricator.wikimedia.org/T238833 (10Ottomata) Hm, would running cergen --generate with --force be enough? ` -F --force If not provied --force,... [17:28:09] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:28:16] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Create NRPE check to alert when cergen certificates are due to expire - https://phabricator.wikimedia.org/T238833 (10Ottomata) I'll try to find some time soon to make cergen chmod after creating files. [17:30:33] !log clean tombstones on prometheus1004 - T238807 [17:30:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:30:39] T238807: Clean up ORES metrics - https://phabricator.wikimedia.org/T238807 [17:31:15] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Create NRPE check to alert when cergen certificates are due to expire - https://phabricator.wikimedia.org/T238833 (10Ottomata) I will also update that doc for --force who wrote that!? ò_ô [17:32:01] (03CR) 10Arturo Borrero Gonzalez: "Looks better than before!" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552107 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [17:32:20] back to deploying the security thingy [17:34:10] (03PS1) 10BBlack: late_command: remove cpNNNN mkfs stuff [puppet] - 10https://gerrit.wikimedia.org/r/552547 (https://phabricator.wikimedia.org/T227432) [17:34:49] !log bblack@cumin1001 START - Cookbook sre.hosts.downtime [17:34:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:34:59] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:36:50] (03CR) 10BBlack: [C: 03+2] late_command: remove cpNNNN mkfs stuff [puppet] - 10https://gerrit.wikimedia.org/r/552547 (https://phabricator.wikimedia.org/T227432) (owner: 10BBlack) [17:36:54] !log bblack@cumin1001 END (FAIL) - Cookbook sre.hosts.downtime (exit_code=99) [17:36:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:33] (03PS2) 10Arturo Borrero Gonzalez: toolforge: new k8s: specify default backend for nginx-ingress [puppet] - 10https://gerrit.wikimedia.org/r/550347 (https://phabricator.wikimedia.org/T234032) [17:39:37] RECOVERY - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [17:39:42] (03PS1) 10BBlack: cp3056: re-enable cache::nodes entry [puppet] - 10https://gerrit.wikimedia.org/r/552548 (https://phabricator.wikimedia.org/T236497) [17:40:12] (03CR) 10Bstorm: [C: 03+1] "I think we have the service in toolsbeta? I didn't check, but I seem to remember Bryan did that. If so we can test it there :)" [puppet] - 10https://gerrit.wikimedia.org/r/550347 (https://phabricator.wikimedia.org/T234032) (owner: 10Arturo Borrero Gonzalez) [17:40:49] (03CR) 10Filippo Giunchedi: "LGTM overall, see nit inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/552107 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [17:42:00] (03CR) 10BBlack: [C: 03+2] cp3056: re-enable cache::nodes entry [puppet] - 10https://gerrit.wikimedia.org/r/552548 (https://phabricator.wikimedia.org/T236497) (owner: 10BBlack) [17:43:31] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:45:55] (03CR) 10Arturo Borrero Gonzalez: "> Patch Set 2: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/550347 (https://phabricator.wikimedia.org/T234032) (owner: 10Arturo Borrero Gonzalez) [17:46:15] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] toolforge: new k8s: specify default backend for nginx-ingress [puppet] - 10https://gerrit.wikimedia.org/r/550347 (https://phabricator.wikimedia.org/T234032) (owner: 10Arturo Borrero Gonzalez) [17:49:20] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic, 10Patch-For-Review: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp3056.esams.wmnet'] ` and were **ALL** successful. [17:51:19] 10Operations, 10serviceops: upgrade and rename krypton & create its codfw equivalent - https://phabricator.wikimedia.org/T224247 (10Dzahn) That's true, just had one last little todo here for the one in codfw. Doing that now. [17:52:18] <_joe_> !log repooling restbase2018 [17:52:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:31] 10Operations, 10Wikimedia-Logstash, 10observability, 10service-runner, 10Core Platform Team (Needs Cleaning - Services Operations): Move graphoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219923 (10Pchelolo) 05Stalled→03Declined Graphoid is likely going away, so we shouldn'... [17:52:33] 10Operations, 10Wikimedia-Logstash, 10observability, 10service-runner, and 2 others: Move service-runner to new logging infrastructure - https://phabricator.wikimedia.org/T211125 (10Pchelolo) [17:52:34] PROBLEM - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is CRITICAL: cluster=api_appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-m [17:53:49] 10Operations, 10CX-cxserver, 10Citoid, 10RESTBase, and 3 others: Decom legacy ex-parsoidcache cxserver, citoid, and restbase service hostnames - https://phabricator.wikimedia.org/T133001 (10Pchelolo) Nothing to do here for the core platform team anymore. [17:54:40] 10Operations, 10ops-esams, 10DC-Ops, 10Traffic, 10Patch-For-Review: cp3056 hardware issue - https://phabricator.wikimedia.org/T236497 (10BBlack) So far so good - it has completed all the initial puppetization stuff, which is much further than it got before. Given it's Friday and this node has a fishy hi... [17:55:07] (03CR) 10Bstorm: [C: 03+1] "I'll be around for a bit, so please merge soon in case of issues :)" [puppet] - 10https://gerrit.wikimedia.org/r/550466 (https://phabricator.wikimedia.org/T234229) (owner: 10Elukey) [17:56:00] (03PS4) 10Elukey: role::dumps::distribution::server: add kerberos [puppet] - 10https://gerrit.wikimedia.org/r/550466 (https://phabricator.wikimedia.org/T234229) [17:56:48] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [17:56:51] The security thingy is over now [17:57:04] RECOVERY - High average GET latency for mw requests on api_appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=api_appserver&var-method=GET [17:57:33] 10Operations, 10Traffic, 10fixcopyright.wikimedia.org, 10Core Platform Team Workboards (Clinic Duty Team), and 3 others: Retire fixcopyright.wikimedia.org - https://phabricator.wikimedia.org/T238803 (10Jdforrester-WMF) [17:58:19] 10Operations, 10Analytics, 10ChangeProp, 10Core Platform Team, and 2 others: Consider the possibility of separating ChangeProp and JobQueue on Kafka level - https://phabricator.wikimedia.org/T199431 (10Pchelolo) It's still a viable idea, but I don't think we have the capacity to work on it now. Icebox. [17:59:14] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:00:27] 10Operations, 10observability: Clean up ORES metrics - https://phabricator.wikimedia.org/T238807 (10colewhite) 05Open→03Resolved [18:00:50] (03CR) 10Bstorm: [C: 03+1] role::dumps::distribution::server: add analytics refinery [puppet] - 10https://gerrit.wikimedia.org/r/550816 (https://phabricator.wikimedia.org/T234229) (owner: 10Elukey) [18:01:44] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:02:06] 10Operations, 10Discovery-Search, 10SRE-Access-Requests: Allow analytics-search-users members to sudo as the airflow user - https://phabricator.wikimedia.org/T238905 (10elukey) >>! In T238905#5684836, @Ottomata wrote: > Since this instance is maintained by the search team, I think re-using analytics-search-u... [18:02:57] !log restore prometheus services default settings - T238807 [18:03:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:02] T238807: Clean up ORES metrics - https://phabricator.wikimedia.org/T238807 [18:03:47] 10Operations, 10serviceops: Increased latency in appservers - 22 Nov 2019 - https://phabricator.wikimedia.org/T238939 (10jijiki) [18:04:20] (03PS1) 10Jforrester: Delete fixcopyrightwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552549 (https://phabricator.wikimedia.org/T238803) [18:04:22] (03PS1) 10Jforrester: Drop ability to load SkinPerPage, EUCopyrightCampaign, and EUCopyrightCampaignSkin [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552550 (https://phabricator.wikimedia.org/T238803) [18:09:28] PROBLEM - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [18:10:10] (03PS1) 10Dzahn: wikimania_scholarships app: use codfw database when in codfw [puppet] - 10https://gerrit.wikimedia.org/r/552551 (https://phabricator.wikimedia.org/T224247) [18:10:16] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:12:31] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10CGlenn) I checked the SRE Clinic Duty. Should I assign this ticket to most recent person on the rotation roster? [18:12:50] PROBLEM - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is CRITICAL: instance=127.0.0.1:9900 job=prometheus site=eqiad https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [18:15:59] (03PS1) 10Dzahn: iegreview app: use codfw database when in codfw [puppet] - 10https://gerrit.wikimedia.org/r/552552 (https://phabricator.wikimedia.org/T224247) [18:17:04] (03CR) 10Cicalese: "Thank you for working on this! SkinPerPage already was available prior to the creation of the fixcopyright wiki. It was an existing extens" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552550 (https://phabricator.wikimedia.org/T238803) (owner: 10Jforrester) [18:18:54] (03CR) 10Jforrester: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552550 (https://phabricator.wikimedia.org/T238803) (owner: 10Jforrester) [18:19:15] (03PS1) 10Dzahn: racktables: use codfw database when in codfw [puppet] - 10https://gerrit.wikimedia.org/r/552553 (https://phabricator.wikimedia.org/T224247) [18:19:49] PROBLEM - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is CRITICAL: instance=127.0.0.1:9900 job=prometheus https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [18:23:11] elukey: stil around? did you see brooke's +1's on the kerb patches (plus the comment: "merge soon"?) [18:23:26] although hm then that's merge fancy stuff ona friday... eh well [18:28:01] (03CR) 10Dzahn: "It is following how it was done for other services on https://phabricator.wikimedia.org/T210411 to stay consistent. Then you can change th" [dns] - 10https://gerrit.wikimedia.org/r/551938 (owner: 10Dzahn) [18:29:26] apergos: o/ - yes we synced and decided to postpone to next week, changes are harmless but friday etc.. [18:29:35] thanks for the ping :) [18:29:51] (03PS4) 10Dzahn: ATS/varnish: rename thorium director to analytics-web [puppet] - 10https://gerrit.wikimedia.org/r/551939 [18:30:04] (03CR) 10Dzahn: ATS/varnish: rename thorium director to analytics-web (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/551939 (owner: 10Dzahn) [18:34:33] RECOVERY - Prometheus prometheus1004/ops restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [18:35:13] (03PS6) 10Dzahn: monitoring: add data types to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/551882 [18:38:14] (03CR) 10jerkins-bot: [V: 04-1] monitoring: add data types to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/551882 (owner: 10Dzahn) [18:38:57] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code={200,204} handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method= [18:39:03] PROBLEM - High average POST latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [18:40:29] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [18:40:35] RECOVERY - High average POST latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=POST [18:42:04] (03CR) 10Cicalese: "> Patch Set 1:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552550 (https://phabricator.wikimedia.org/T238803) (owner: 10Jforrester) [18:42:58] (03PS7) 10Dzahn: monitoring: add data types to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/551882 [18:44:01] RECOVERY - Prometheus prometheus1004/global -or a Prometheus it scrapes- was restarted: beware possible monitoring artifacts on prometheus1004 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [18:45:01] RECOVERY - Prometheus prometheus1003/ops restarted: beware possible monitoring artifacts on prometheus1003 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_was_restarted https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [18:45:50] (03CR) 10jerkins-bot: [V: 04-1] monitoring: add data types to monitoring::service [puppet] - 10https://gerrit.wikimedia.org/r/551882 (owner: 10Dzahn) [18:46:08] 10Operations, 10Traffic, 10serviceops: Increased latency in appservers - 22 Nov 2019 - https://phabricator.wikimedia.org/T238939 (10Dzahn) [18:47:25] 10Operations, 10Wikimedia-Mailing-lists: Create OpenGLAM mailing list - https://phabricator.wikimedia.org/T238759 (10crusnov) 05Open→03Resolved a:03crusnov Hello! I have created the mailing list as requested. Listinfo: https://lists.wikimedia.org/mailman/listinfo/open-glam List admin page: https://lists... [18:51:42] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Create NRPE check to alert when cergen certificates are due to expire - https://phabricator.wikimedia.org/T238833 (10Volans) >>! In T238833#5685019, @Ottomata wrote: > I'll try to find some time soon to make cergen chmod after creating files. FYI W... [18:52:04] 10Operations, 10Product-Analytics, 10SRE-Access-Requests: Search Console access for he.wikisource.org - https://phabricator.wikimedia.org/T238090 (10crusnov) a:05crusnov→03None Giving this to the next person on clinic duty. We still need to know the time limits and I believe some other information to com... [19:06:25] 04Critical Alert for device asw2-esams.mgmt.esams.wmnet - Primary outbound port utilisation over 80% [19:09:13] * Reedy squints [19:14:51] :/ [19:16:26] 04̶C̶r̶i̶t̶i̶c̶a̶l Device asw2-esams.mgmt.esams.wmnet recovered from Primary outbound port utilisation over 80% [19:16:36] (03PS2) 10CRusnov: admin: add cglenn to researchers and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/545974 (https://phabricator.wikimedia.org/T236321) (owner: 10Cwhite) [19:17:00] (03CR) 10Phamhi: labmon: add compatibility in buster (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/552107 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [19:17:08] (03CR) 10CRusnov: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/545974 (https://phabricator.wikimedia.org/T236321) (owner: 10Cwhite) [19:18:57] (03CR) 10jerkins-bot: [V: 04-1] admin: add cglenn to researchers and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/545974 (https://phabricator.wikimedia.org/T236321) (owner: 10Cwhite) [19:22:13] (03PS2) 10Jforrester: Remove wgTorLoadNodes as it was removed in b5ccbee in 1.340-wmf.15+ [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550055 (owner: 10Reedy) [19:22:23] (03PS3) 10CRusnov: admin: add cglenn to researchers and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/545974 (https://phabricator.wikimedia.org/T236321) (owner: 10Cwhite) [19:22:34] (03CR) 10Jforrester: [C: 03+1] "Good to deploy whenever." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/550055 (owner: 10Reedy) [19:22:51] (03CR) 10Jforrester: [C: 03+1] "Good to deploy whenever." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552361 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712) [19:24:33] (03CR) 10Dzahn: [C: 03+1] "key and groups match the information on the ticket" [puppet] - 10https://gerrit.wikimedia.org/r/545974 (https://phabricator.wikimedia.org/T236321) (owner: 10Cwhite) [19:30:13] 10Operations, 10Traffic, 10serviceops: Increased latency in appservers - 22 Nov 2019 - https://phabricator.wikimedia.org/T238939 (10CDanis) At ~18:36 there was another spike in long-tail latency, but then, latency seemed to return to 'normal': https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red... [19:40:18] 10Operations, 10Cloud-VPS, 10Traffic, 10HTTPS, 10cloud-services-team (Kanban): add a https-only option to dynamicproxy - https://phabricator.wikimedia.org/T120486 (10bd808) >>! In T120486#5680210, @Krenair wrote: > done in https://gerrit.wikimedia.org/r/c/operations/puppet/+/482142 ? My guess is that @D... [19:41:14] 10Operations, 10serviceops: dropped packets to phab1003 22280/tcp - https://phabricator.wikimedia.org/T238781 (10ayounsi) 05Open→03Resolved Confirmed! [19:41:22] 10Operations, 10Phabricator, 10Traffic, 10serviceops: Phabricator downtime due to aphlict and websockets (aphlict current disabled) - https://phabricator.wikimedia.org/T238593 (10ayounsi) [19:44:07] 10Operations, 10Cloud-VPS, 10Traffic, 10HTTPS, 10cloud-services-team (Kanban): add a https-only option to dynamicproxy - https://phabricator.wikimedia.org/T120486 (10Dzahn) Yea, that's true. It's been a long time since i wrote that and i had a per-proxy feature in mind. I am ok with closing this ticket i... [19:45:21] (03CR) 10CRusnov: [C: 03+2] admin: add cglenn to researchers and analytics-privatedata-users [puppet] - 10https://gerrit.wikimedia.org/r/545974 (https://phabricator.wikimedia.org/T236321) (owner: 10Cwhite) [19:55:53] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: Requesting Access to Stat1004, Stat1006, Stat1007, notebook1003 and notebook1004 - https://phabricator.wikimedia.org/T236321 (10crusnov) 05Open→03Resolved Hello I have added the key above to the patch and merged it. This means that shortly (within... [20:03:03] 10Operations, 10GLOW, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T238868 (10crusnov) p:05Triage→03Normal [20:07:42] 10Operations, 10GLOW, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T238868 (10crusnov) According to the procedure for this request, end-dates for rechecking access are needed. Do you have an end-date in mind? Otherwise we should be able to a... [20:09:11] 10Operations, 10Discovery-Search, 10SRE-Access-Requests: Allow analytics-search-users members to sudo as the airflow user - https://phabricator.wikimedia.org/T238905 (10crusnov) p:05Triage→03Normal [20:15:54] (03PS1) 10Daniel Kinzler: Ping XML dump schema version at 0.10 for now. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552565 (https://phabricator.wikimedia.org/T238921) [20:16:34] (03PS2) 10Daniel Kinzler: Pin XML dump schema version at 0.10 for now. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552565 (https://phabricator.wikimedia.org/T238921) [20:41:32] (03PS4) 10Phamhi: labmon: add compatibility in buster [puppet] - 10https://gerrit.wikimedia.org/r/552107 (https://phabricator.wikimedia.org/T224585) [20:45:02] (03PS1) 10Herron: add forwad/reverse entries for logstash 7 collector hosts [dns] - 10https://gerrit.wikimedia.org/r/552567 (https://phabricator.wikimedia.org/T234854) [20:45:24] (03CR) 10jerkins-bot: [V: 04-1] add forwad/reverse entries for logstash 7 collector hosts [dns] - 10https://gerrit.wikimedia.org/r/552567 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [20:49:50] (03PS2) 10Herron: add forwad/reverse entries for logstash 7 collector hosts [dns] - 10https://gerrit.wikimedia.org/r/552567 (https://phabricator.wikimedia.org/T234854) [20:58:01] 10Operations, 10ops-eqiad, 10serviceops: (No Need By Date Provided) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10wiki_willy) [20:58:22] (03PS1) 10RLazarus: pristine-tar data for poolcounter-prometheus-exporter_0.0~git20181011.d5cca4f.orig.tar.xz [debs/poolcounter-prometheus-exporter] (pristine-tar) - 10https://gerrit.wikimedia.org/r/552568 [20:59:16] 10Operations, 10ops-eqiad: (Need By 8/15/19) rack/setup/install ms-be105[7-9].eqiad.wmnet - https://phabricator.wikimedia.org/T237438 (10wiki_willy) [20:59:52] (03PS1) 10RLazarus: New upstream version 0.0~git20181011.d5cca4f [debs/poolcounter-prometheus-exporter] (upstream) - 10https://gerrit.wikimedia.org/r/552569 [20:59:54] (03PS1) 10RLazarus: Re-adding vendor directory [debs/poolcounter-prometheus-exporter] (upstream) - 10https://gerrit.wikimedia.org/r/552570 [20:59:56] 10Operations, 10ops-eqiad: (No Need By Date Provided) rack/setup/install cloudvirt-wdqs100[123].eqiad.wmnet - https://phabricator.wikimedia.org/T235685 (10wiki_willy) [21:00:36] 10Operations, 10ops-eqiad, 10fundraising-tech-ops: (No Need By Date Provided) rack/setup/install frban1001.eqiad.wmnet - https://phabricator.wikimedia.org/T234068 (10wiki_willy) [21:00:44] (03PS1) 10RLazarus: Ignore quilt dir .pc via .gitignore [debs/poolcounter-prometheus-exporter] - 10https://gerrit.wikimedia.org/r/552571 [21:00:46] (03PS1) 10RLazarus: Initial debianization [debs/poolcounter-prometheus-exporter] - 10https://gerrit.wikimedia.org/r/552572 [21:01:39] 10Operations, 10ops-eqiad: (No Need By Date Provided) replace scs-a8-eqiad - https://phabricator.wikimedia.org/T228919 (10wiki_willy) [21:02:12] 10Operations, 10Parsoid-PHP, 10serviceops, 10Patch-For-Review: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10ssastry) @Joe @Dzahn can that memory bump patch ( https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/548944 ) be deployed next week? Or, are we waiting... [21:02:35] (03CR) 10RLazarus: [C: 03+2] pristine-tar data for poolcounter-prometheus-exporter_0.0~git20181011.d5cca4f.orig.tar.xz [debs/poolcounter-prometheus-exporter] (pristine-tar) - 10https://gerrit.wikimedia.org/r/552568 (owner: 10RLazarus) [21:03:25] (03CR) 10RLazarus: [V: 03+2 C: 03+2] pristine-tar data for poolcounter-prometheus-exporter_0.0~git20181011.d5cca4f.orig.tar.xz [debs/poolcounter-prometheus-exporter] (pristine-tar) - 10https://gerrit.wikimedia.org/r/552568 (owner: 10RLazarus) [21:06:05] (03CR) 10Herron: [C: 03+2] add forwad/reverse entries for logstash 7 collector hosts [dns] - 10https://gerrit.wikimedia.org/r/552567 (https://phabricator.wikimedia.org/T234854) (owner: 10Herron) [21:06:10] (03PS1) 10Andrew Bogott: Remove puppetpanel.pp -- unused [puppet] - 10https://gerrit.wikimedia.org/r/552574 [21:07:15] 10Operations, 10GLOW, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T238868 (10Iflorez) Hello @crusnov, Thank you for your feedback and help to get access. >>! In T238868#5685490, @crusnov wrote: > According to the procedure for this reques... [21:08:57] (03CR) 10Andrew Bogott: [C: 03+2] Remove puppetpanel.pp -- unused [puppet] - 10https://gerrit.wikimedia.org/r/552574 (owner: 10Andrew Bogott) [21:28:01] 10Operations, 10ops-codfw, 10fundraising-tech-ops: (No Need By Date Provided) rack/setup/install frban2001.frack.codfw.wmnet - https://phabricator.wikimedia.org/T234069 (10wiki_willy) [21:33:11] (03PS2) 10Andrew Bogott: wmf_sink: delete instance puppet config from git on instance deletion [puppet] - 10https://gerrit.wikimedia.org/r/552348 (https://phabricator.wikimedia.org/T238708) [21:36:09] (03CR) 10jerkins-bot: [V: 04-1] wmf_sink: delete instance puppet config from git on instance deletion [puppet] - 10https://gerrit.wikimedia.org/r/552348 (https://phabricator.wikimedia.org/T238708) (owner: 10Andrew Bogott) [21:43:16] (03PS3) 10Andrew Bogott: wmf_sink: delete instance puppet config from git on instance deletion [puppet] - 10https://gerrit.wikimedia.org/r/552348 (https://phabricator.wikimedia.org/T238708) [21:46:20] (03CR) 10jerkins-bot: [V: 04-1] wmf_sink: delete instance puppet config from git on instance deletion [puppet] - 10https://gerrit.wikimedia.org/r/552348 (https://phabricator.wikimedia.org/T238708) (owner: 10Andrew Bogott) [21:49:26] (03PS4) 10Andrew Bogott: wmf_sink: delete instance puppet config from git on instance deletion [puppet] - 10https://gerrit.wikimedia.org/r/552348 (https://phabricator.wikimedia.org/T238708) [21:50:59] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:51:43] (03PS1) 10Reedy: Add webservices.picturae.com to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552579 (https://phabricator.wikimedia.org/T238955) [21:52:43] (03CR) 10Reedy: [C: 03+2] Add webservices.picturae.com to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552579 (https://phabricator.wikimedia.org/T238955) (owner: 10Reedy) [21:53:35] (03Merged) 10jenkins-bot: Add webservices.picturae.com to $wgCopyUploadsDomains [mediawiki-config] - 10https://gerrit.wikimedia.org/r/552579 (https://phabricator.wikimedia.org/T238955) (owner: 10Reedy) [21:55:41] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T238955 (duration: 00m 53s) [21:55:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:55:47] T238955: Please add webservices.picturae.com to $wgCopyUploadsDomains - https://phabricator.wikimedia.org/T238955 [21:55:53] (03PS5) 10Andrew Bogott: wmf_sink: delete instance puppet config from git on instance deletion [puppet] - 10https://gerrit.wikimedia.org/r/552348 (https://phabricator.wikimedia.org/T238708) [21:57:47] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [21:58:05] (03PS6) 10Andrew Bogott: wmf_sink: delete instance puppet config from git on instance deletion [puppet] - 10https://gerrit.wikimedia.org/r/552348 (https://phabricator.wikimedia.org/T238708) [22:00:27] (03PS7) 10Andrew Bogott: wmf_sink: delete instance puppet config from git on instance deletion [puppet] - 10https://gerrit.wikimedia.org/r/552348 (https://phabricator.wikimedia.org/T238708) [22:04:32] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, and 2 others: Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10Dzahn) [22:04:46] (03PS8) 10Andrew Bogott: wmf_sink: delete instance puppet config from git on instance deletion [puppet] - 10https://gerrit.wikimedia.org/r/552348 (https://phabricator.wikimedia.org/T238708) [22:07:17] 10Operations, 10ops-eqiad, 10serviceops: (No Need By Date Provided) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10jijiki) @wiki_willy I will provide racking instructions on Monday for you, sorry we have delayed you this much. [22:08:08] 10Operations, 10ops-eqiad, 10serviceops: (No Need By Date Provided) rack/setup/install mw13[49-84].eqiad.wmnet - https://phabricator.wikimedia.org/T236437 (10wiki_willy) Thanks @jijiki , much appreciated [22:10:06] 10Operations, 10Traffic, 10serviceops: Increased latency in appservers - 22 Nov 2019 - https://phabricator.wikimedia.org/T238939 (10jijiki) [22:10:27] (03PS9) 10Andrew Bogott: wmf_sink: Prepare to delete instance puppet config from git on instance deletion [puppet] - 10https://gerrit.wikimedia.org/r/552348 (https://phabricator.wikimedia.org/T238708) [22:10:29] (03PS1) 10Andrew Bogott: wmf_sink: remove instance-puppet git entries for deleted VMs [puppet] - 10https://gerrit.wikimedia.org/r/552583 (https://phabricator.wikimedia.org/T238708) [22:10:48] 10Operations, 10Traffic, 10serviceops: Increased latency in appservers - 22 Nov 2019 - https://phabricator.wikimedia.org/T238939 (10jijiki) [22:11:04] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, 10Release-Engineering-Team (Development services): Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10Dzahn) [22:11:15] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, 10Release-Engineering-Team (Development services): Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10Dzahn) 05Open→03Resolved [22:11:23] 10Operations, 10Phabricator, 10serviceops, 10Patch-For-Review, and 3 others: Apache on phab1001 is gradually leaking worker processes which are stuck in "Gracefully finishing" state - https://phabricator.wikimedia.org/T182832 (10Dzahn) [22:11:46] 10Operations, 10hardware-requests, 10serviceops: requesting WMF7426 as phabricator system in eqiad - https://phabricator.wikimedia.org/T215335 (10Dzahn) We will give this back in T238957. [22:16:21] 10Operations, 10Phabricator, 10hardware-requests, 10serviceops, 10Release-Engineering-Team (Development services): The phabricator server, WMF7426, was given to us temporarily, we would like to make it permanent - https://phabricator.wikimedia.org/T232887 (10Dzahn) After further discussion with Mukunda a... [22:16:34] 10Operations, 10Phabricator, 10hardware-requests, 10serviceops, 10Release-Engineering-Team (Development services): The phabricator server, WMF7426, was given to us temporarily, we would like to make it permanent - https://phabricator.wikimedia.org/T232887 (10Dzahn) 05Open→03Declined [22:16:39] 10Operations, 10Phabricator, 10Release-Engineering-Team-TODO, 10serviceops, 10Release-Engineering-Team (Development services): Reimage both phab1001 and phab2001 to stretch / buster - https://phabricator.wikimedia.org/T190568 (10Dzahn) [22:16:42] (03CR) 10Jforrester: "Please don't write (or merge) patches that fail the requirements for gerrit changes. In particular, it should be impossible to use the sam" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/546369 (https://phabricator.wikimedia.org/T231178) (owner: 10DannyS712) [22:23:23] 10Operations, 10GLOW, 10SRE-Access-Requests: Requesting access to sites from Google Search Console - https://phabricator.wikimedia.org/T238868 (10crusnov) Until september 2020 seems a reasonable timeframe (the docs say "typically aronud one year"). Listing all of the sites now would likely be easiest, yes,... [22:29:01] (03CR) 10Jforrester: "> Patch Set 2: -Code-Review" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542184 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [22:29:08] (03CR) 10Jforrester: [C: 03+1] Drop HHVMRequestInit, never called [mediawiki-config] - 10https://gerrit.wikimedia.org/r/542184 (https://phabricator.wikimedia.org/T235142) (owner: 10Jforrester) [22:37:19] (03PS1) 10Dzahn: phabricator/conftool: switch phab-vcs (git-ssh) service to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/552589 (https://phabricator.wikimedia.org/T238956) [22:39:31] (03PS1) 10Dzahn: phabricator: switch "active server" from phab1003 to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/552591 (https://phabricator.wikimedia.org/T238956) [22:40:51] (03PS1) 10Dzahn: phabricator: remove phab1003 from list of phab servers [puppet] - 10https://gerrit.wikimedia.org/r/552592 (https://phabricator.wikimedia.org/T238957) [22:43:50] (03PS1) 10Dzahn: dumps/phabricator: switch dumps host from phab1003 to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/552593 (https://phabricator.wikimedia.org/T238956) [22:44:11] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Conversion to volunteer NDA for MaxSem - https://phabricator.wikimedia.org/T238960 (10Aklapper) [22:45:00] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Conversion to volunteer NDA for MaxSem - https://phabricator.wikimedia.org/T238960 (10Dzahn) a:03Dzahn [22:45:56] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Conversion to volunteer NDA for MaxSem - https://phabricator.wikimedia.org/T238960 (10Dzahn) @RStallman-legalteam Please let Max sign the volunteer NDA docs. [22:58:27] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Conversion to volunteer NDA for MaxSem - https://phabricator.wikimedia.org/T238960 (10Dzahn) @MaxSem Wanna sign L2 as well? [23:00:33] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Conversion to volunteer NDA for MaxSem - https://phabricator.wikimedia.org/T238960 (10crusnov) p:05Triage→03Normal [23:01:00] 10Operations, 10LDAP-Access-Requests, 10SRE-Access-Requests: Conversion to volunteer NDA for MaxSem - https://phabricator.wikimedia.org/T238960 (10Krenair) As far as I know, no NDA is required for beta cluster access. [23:07:22] (03PS1) 10Dzahn: admins: add Max Semenik as ldap_only_admin [puppet] - 10https://gerrit.wikimedia.org/r/552594 (https://phabricator.wikimedia.org/T238960) [23:08:59] (03PS2) 10Dzahn: admins: add Max Semenik as ldap_only_admin [puppet] - 10https://gerrit.wikimedia.org/r/552594 (https://phabricator.wikimedia.org/T238960) [23:13:14] (03CR) 10Papaul: [C: 03+2] DNS: Remove mgmt DNS for db2048 and db2061 [dns] - 10https://gerrit.wikimedia.org/r/552542 (owner: 10Papaul) [23:14:32] (03PS1) 10Dzahn: varnish: switch phabricator backend to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/552595 (https://phabricator.wikimedia.org/T238956) [23:16:04] (03PS1) 10Dzahn: phabricator: switch mail destination to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/552597 (https://phabricator.wikimedia.org/T238956) [23:16:33] (03PS2) 10Papaul: DNS: Remove mgmt DNS for db2048 and db2061 [dns] - 10https://gerrit.wikimedia.org/r/552542 [23:16:51] (03CR) 10Papaul: [V: 03+2 C: 03+2] DNS: Remove mgmt DNS for db2048 and db2061 [dns] - 10https://gerrit.wikimedia.org/r/552542 (owner: 10Papaul) [23:17:57] (03PS1) 10Dzahn: switch discovery record for phabricator to 1001 for ATS [dns] - 10https://gerrit.wikimedia.org/r/552598 (https://phabricator.wikimedia.org/T238956) [23:18:26] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2061.codfw.wmnet - https://phabricator.wikimedia.org/T238526 (10Papaul) [23:18:41] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2061.codfw.wmnet - https://phabricator.wikimedia.org/T238526 (10Papaul) 05Open→03Resolved Complete [23:18:44] 10Operations, 10DBA: Decommission db2043-db2070 - https://phabricator.wikimedia.org/T228258 (10Papaul) [23:19:09] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2048.codfw.wmnet - https://phabricator.wikimedia.org/T237913 (10Papaul) [23:19:32] 10Operations, 10ops-codfw, 10decommission, 10Patch-For-Review: Decommission db2048.codfw.wmnet - https://phabricator.wikimedia.org/T237913 (10Papaul) 05Open→03Resolved Complete [23:19:34] 10Operations, 10DBA: Decommission db2043-db2070 - https://phabricator.wikimedia.org/T228258 (10Papaul) [23:22:25] (03PS2) 10Dzahn: admin: Remove myself (MaxSem) [puppet] - 10https://gerrit.wikimedia.org/r/552389 (https://phabricator.wikimedia.org/T238960) (owner: 10MaxSem) [23:22:50] (03PS1) 10Dzahn: remove service IPs and IPv6 for phab1003 [dns] - 10https://gerrit.wikimedia.org/r/552599 (https://phabricator.wikimedia.org/T238957) [23:24:44] 10Operations, 10Parsoid-PHP, 10serviceops, 10Patch-For-Review: wt2html: Out of memory crashers - https://phabricator.wikimedia.org/T236833 (10Dzahn) For my part it is blocked on first merging https://gerrit.wikimedia.org/r/c/operations/puppet/+/546448 because it uses that. [23:28:16] (03PS1) 10Dzahn: remove production IPs for phab1003 [dns] - 10https://gerrit.wikimedia.org/r/552601 (https://phabricator.wikimedia.org/T238957) [23:30:23] (03PS1) 10Dzahn: site: turn phab1003 into a spare::system [puppet] - 10https://gerrit.wikimedia.org/r/552603 (https://phabricator.wikimedia.org/T238957) [23:33:04] (03PS1) 10Dzahn: mtail: stop using phab1003 for tests, use phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/552604 (https://phabricator.wikimedia.org/T238957) [23:36:10] (03PS1) 10Dzahn: mariadb: remove grants for users on phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/552607 (https://phabricator.wikimedia.org/T238957) [23:36:14] (03CR) 10jerkins-bot: [V: 04-1] mtail: stop using phab1003 for tests, use phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/552604 (https://phabricator.wikimedia.org/T238957) (owner: 10Dzahn) [23:37:31] (03PS2) 10Dzahn: dumps/phabricator: switch dumps host from phab1003 to phab1001 [puppet] - 10https://gerrit.wikimedia.org/r/552593 (https://phabricator.wikimedia.org/T238956) [23:48:52] (03PS1) 10Dzahn: install_server: remove phab1003 [puppet] - 10https://gerrit.wikimedia.org/r/552609 (https://phabricator.wikimedia.org/T238957) [23:56:45] PROBLEM - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is CRITICAL: 56.61 le 60 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:58:27] RECOVERY - Varnish traffic drop between 30min ago and now at ulsfo on icinga1001 is OK: (C)60 le (W)70 le 71.12 https://wikitech.wikimedia.org/wiki/Varnish%23Diagnosing_Varnish_alerts https://grafana.wikimedia.org/dashboard/db/varnish-http-requests?panelId=6&fullscreen&orgId=1 [23:59:12] 10Operations, 10LDAP-Access-Requests: LDAP access to the wmf group for Leighanna Mixter - https://phabricator.wikimedia.org/T238933 (10Dzahn) 05Open→03Resolved a:03Dzahn Hi @Slaporte, this is already the case. LDAP user "lmixter" is already a member of the WMF group. Let us know if something specific...