[00:00:04] RoanKattouw, Niharika, and Urbanecm: Dear deployers, time to do the Evening SWAT(Max 6 patches) deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20191213T0000). [00:00:04] Ammarpad and RoanKattouw: A patch you scheduled for Evening SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [00:00:11] I'll do the SWAT [00:00:42] (03PS3) 10Catrope: GrowthExperiments: Align help panel new account enabling with homepage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553222 (https://phabricator.wikimedia.org/T232396) [00:00:47] (03CR) 10Catrope: [C: 03+2] GrowthExperiments: Align help panel new account enabling with homepage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553222 (https://phabricator.wikimedia.org/T232396) (owner: 10Catrope) [00:01:27] (03CR) 10Bstorm: toolforge-k8s: harden the init config a bit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/556767 (https://phabricator.wikimedia.org/T240009) (owner: 10Bstorm) [00:01:45] (03Merged) 10jenkins-bot: GrowthExperiments: Align help panel new account enabling with homepage [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553222 (https://phabricator.wikimedia.org/T232396) (owner: 10Catrope) [00:10:58] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: GrowthExperiments: Align help panel new account enabling with homepage (T232396) (duration: 00m 56s) [00:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:11:03] T232396: Variant tests: align treatment groups - https://phabricator.wikimedia.org/T232396 [00:21:58] !log catrope@deploy1001 Synchronized php-1.35.0-wmf.10/extensions/GrowthExperiments/: GrowthExperiments: record suggestededits pre-activation as a preference (T238888) (duration: 00m 55s) [00:22:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:22:04] T238888: Variant tests: "initiation" test - https://phabricator.wikimedia.org/T238888 [00:25:41] (03PS4) 10Catrope: GrowthExperiments: Begin "initiation test" for suggested edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553225 (https://phabricator.wikimedia.org/T238888) [00:26:04] (03CR) 10Catrope: [C: 03+2] GrowthExperiments: Begin "initiation test" for suggested edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553225 (https://phabricator.wikimedia.org/T238888) (owner: 10Catrope) [00:27:00] (03Merged) 10jenkins-bot: GrowthExperiments: Begin "initiation test" for suggested edits [mediawiki-config] - 10https://gerrit.wikimedia.org/r/553225 (https://phabricator.wikimedia.org/T238888) (owner: 10Catrope) [00:27:46] +/26 [00:29:58] (03PS1) 10BBlack: dotls: add OCSP stapling support [puppet] - 10https://gerrit.wikimedia.org/r/556836 [00:32:04] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: GrowthExperiments: Begin "initiation test" for suggested edits (T238888) (duration: 00m 55s) [00:32:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:32:10] T238888: Variant tests: "initiation" test - https://phabricator.wikimedia.org/T238888 [00:33:25] 04Critical Alert for device cr2-eqiad.wikimedia.org - Primary outbound port utilisation over 80% [00:35:54] (03CR) 10BBlack: [C: 03+2] dotls: add OCSP stapling support [puppet] - 10https://gerrit.wikimedia.org/r/556836 (owner: 10BBlack) [00:53:24] 04̶C̶r̶i̶t̶i̶c̶a̶l Device cr2-eqiad.wikimedia.org recovered from Primary outbound port utilisation over 80% [01:06:52] (03PS1) 10Ammarpad: Add namespace aliases for zhwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556847 (https://phabricator.wikimedia.org/T240428) [01:07:35] (03CR) 10jerkins-bot: [V: 04-1] Add namespace aliases for zhwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556847 (https://phabricator.wikimedia.org/T240428) (owner: 10Ammarpad) [01:14:25] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [01:19:49] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [01:20:45] (03PS2) 10Ammarpad: Add namespace aliases for zhwikiquote [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556847 (https://phabricator.wikimedia.org/T240428) [01:23:27] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [01:25:46] 10Operations, 10Traffic: API Querying for XML/JSON, you might get the Browser Connection Security warning HTML page (which is invalid XML) - https://phabricator.wikimedia.org/T240497 (10Anomie) For the Action API (api.php), it would be sufficient to return a 4xx or 5xx rather than a 200. But it sounds like tha... [01:29:35] (03PS1) 10Krinkle: [Beta Cluster] profiler: Enable XHGui backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556850 (https://phabricator.wikimedia.org/T180761) [01:32:13] (03CR) 10Krinkle: [C: 03+2] [Beta Cluster] profiler: Enable XHGui backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556850 (https://phabricator.wikimedia.org/T180761) (owner: 10Krinkle) [01:32:29] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [01:33:04] (03Merged) 10jenkins-bot: [Beta Cluster] profiler: Enable XHGui backend [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556850 (https://phabricator.wikimedia.org/T180761) (owner: 10Krinkle) [01:52:11] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: valid ROAs alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [02:25:16] 10Operations, 10DNS, 10Research, 10Traffic: Add wikiworkshop.org to the Foundation's DNS - https://phabricator.wikimedia.org/T240303 (10BBlack) All of this is irregular and outside of policies we like to adhere to, but I'll push a zonefile to our nameservers which supports the bare minimum (existing Stanfo... [02:26:38] (03PS1) 10BBlack: wikiworkshop.org: new domain [dns] - 10https://gerrit.wikimedia.org/r/556853 (https://phabricator.wikimedia.org/T240303) [02:27:12] (03CR) 10BBlack: [C: 03+2] wikiworkshop.org: new domain [dns] - 10https://gerrit.wikimedia.org/r/556853 (https://phabricator.wikimedia.org/T240303) (owner: 10BBlack) [02:28:06] (03PS1) 10Krinkle: [Beta Cluster] mediawiki: install tideways on beta app servers [puppet] - 10https://gerrit.wikimedia.org/r/556854 (https://phabricator.wikimedia.org/T180761) [02:50:47] PROBLEM - Backup freshness on backup1001 is CRITICAL: Stale-full only: 11 (etcd1006, ...), Fresh: 86 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [03:08:07] (03PS1) 10Krinkle: profiler: Switch production xhgui destination from tungsten to xhgui1001 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556857 (https://phabricator.wikimedia.org/T180761) [03:11:30] (03CR) 10Krinkle: [C: 04-2] "Needs to be coordinated with data sync. Order of operations:" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556857 (https://phabricator.wikimedia.org/T180761) (owner: 10Krinkle) [03:12:04] (03CR) 10Krinkle: [C: 03+1] "Now unlocked. Beta has gone through all three steps. Now time for prod:" [puppet] - 10https://gerrit.wikimedia.org/r/552357 (https://phabricator.wikimedia.org/T158837) (owner: 10Dzahn) [03:29:37] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [04:20:05] PROBLEM - Check systemd state on an-coord1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [04:36:29] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [04:45:29] (03PS7) 10Andrew Bogott: nova: add nova-api middleware to inject a default user_data file [puppet] - 10https://gerrit.wikimedia.org/r/556135 (https://phabricator.wikimedia.org/T181375) [04:46:10] (03CR) 10Andrew Bogott: [C: 03+2] nova: add nova-api middleware to inject a default user_data file [puppet] - 10https://gerrit.wikimedia.org/r/556135 (https://phabricator.wikimedia.org/T181375) (owner: 10Andrew Bogott) [04:50:55] PROBLEM - High average GET latency for mw requests on appserver in eqiad on icinga1001 is CRITICAL: cluster=appserver code=200 handler=proxy:unix:/run/php/fpm-www.sock https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [04:59:57] RECOVERY - High average GET latency for mw requests on appserver in eqiad on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Monitoring/Missing_notes_link https://grafana.wikimedia.org/d/RIA1lzDZk/application-servers-red-dashboard?panelId=9&fullscreen&orgId=1&from=now-3h&to=now&var-datasource=eqiad+prometheus/ops&var-cluster=appserver&var-method=GET [05:08:43] PROBLEM - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [05:09:15] PROBLEM - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [06:01:30] <[1997kB]> cp5007 frontend, Varnish XID 275919026 [06:01:31] <[1997kB]> Error: 503, Backend fetch failed at Fri, 13 Dec 2019 06:00:35 GMT [06:10:58] (03PS1) 10TechneSiyam: Merge branch 'master' of https://gerrit.wikimedia.org/r/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556862 [06:11:00] (03PS1) 10TechneSiyam: Added betawikiversity,hiwikibooks,ukwikinews hd logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556863 [06:18:30] (03PS1) 10TechneSiyam: Modified IS.php with hiwikibooks,betawikiversity,ukwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556864 [06:19:32] (03CR) 10jerkins-bot: [V: 04-1] Modified IS.php with hiwikibooks,betawikiversity,ukwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556864 (owner: 10TechneSiyam) [06:22:06] (03CR) 10Muehlenhoff: "Two comments inline, you mentioned a Docker container which builds this, where should this be imported? component/ci?" (032 comments) [debs/doxygen] (debian/buster-backports) - 10https://gerrit.wikimedia.org/r/554942 (https://phabricator.wikimedia.org/T239482) (owner: 10Hashar) [06:30:36] !log installing libice security updates [06:30:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:41:55] (03PS1) 10Muehlenhoff: Update status message now that it's no longer a test cluster [puppet] - 10https://gerrit.wikimedia.org/r/556865 [06:46:09] 10Operations, 10Maps, 10Discovery-Search (Current work): Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 (10Mathew.onipe) [07:07:57] PROBLEM - Check systemd state on netbox1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:39:03] RECOVERY - Check systemd state on netbox1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:43:23] RECOVERY - Check systemd state on an-coord1001 is OK: OK - running: The system is fully operational https://wikitech.wikimedia.org/wiki/Monitoring/check_systemd_state [07:54:33] RECOVERY - BFD status on cr2-eqdfw is OK: OK: UP: 10 AdminDown: 0 Down: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23BFD_status [07:55:55] !log execute clear bfd session address fe80::ee38:7300:17e8:a04e on cr3-knams to restore BFD session with eqdfw (OSPF3 status ok on cr3-knams) [07:55:59] !log depool maps1002 for postgres init. - T239728 [07:56:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:56:06] T239728: Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 [07:56:23] PROBLEM - Disk space on netflow2001 is CRITICAL: DISK CRITICAL - free space: / 302 MB (3% inode=91%): https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=netflow2001&var-datasource=codfw+prometheus/ops [08:03:01] some daemons are spamming logs --^ [08:05:19] RECOVERY - Disk space on netflow2001 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=netflow2001&var-datasource=codfw+prometheus/ops [08:06:03] !log restart kafkatee-webrequest.service on netflow2001 (spamming logs about not being able to bind to address:port) [08:06:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:06] !log restart fastmon on netflow2001 as attempt to stop spamming logs (failed) [08:07:08] (03CR) 10ArielGlenn: "Can we in fact hardcode their ipv4 and ipv6 addresses in? (I'd like to verify that the specific address will be good for years and years b" [puppet] - 10https://gerrit.wikimedia.org/r/556216 (owner: 10Bstorm) [08:07:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:07:49] (03CR) 10ArielGlenn: "Soooo... is the thought still to try this out?" [puppet] - 10https://gerrit.wikimedia.org/r/555632 (https://phabricator.wikimedia.org/T222349) (owner: 10Bstorm) [08:07:52] !log restart kafkatee-webrequest.service on netflow1001 (spamming logs about not being able to bind to address:port) [08:07:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:08:14] interesting, both netflow hosts are spamming at the same rate [08:10:05] !log rm /var/log user.log.1 messages.1 daemon.log.1 kafkatee.log.1 syslog.1 on netflow2001 to free space (logs spammed with the same error message over and over) [08:10:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:59] 10Operations, 10Maps, 10Discovery-Search (Current work): Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 (10Mathew.onipe) [08:21:45] (03PS2) 10Hashar: Backports for Buster [debs/doxygen] (debian/buster-backports) - 10https://gerrit.wikimedia.org/r/554942 (https://phabricator.wikimedia.org/T239482) [08:24:36] (03CR) 10Hashar: Backports for Buster (032 comments) [debs/doxygen] (debian/buster-backports) - 10https://gerrit.wikimedia.org/r/554942 (https://phabricator.wikimedia.org/T239482) (owner: 10Hashar) [08:31:02] (03CR) 10Giuseppe Lavagetto: [C: 03+2] prometheus::k8s: drop envoy metrics about the admin interface [puppet] - 10https://gerrit.wikimedia.org/r/553246 (owner: 10Giuseppe Lavagetto) [08:34:18] 10Operations, 10netops: fastnetmon spamming /var/log on netflow hosts leading to disk saturation - https://phabricator.wikimedia.org/T240658 (10elukey) [08:40:17] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 79069968 and 5 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:41:29] 10Operations, 10Continuous-Integration-Infrastructure, 10Gerrit, 10Release-Engineering-Team (CI & Testing services), and 2 others: Upload zuul_2.5.1-wmf11 to apt.wikimedia.org - https://phabricator.wikimedia.org/T240570 (10hashar) @jcrespo yes I lack write access :] We do not rebuild it: the package requi... [08:43:17] 10Operations, 10observability, 10serviceops: rsyslogd: omkafka: action will suspended due to kafka error -187: Local: All broker connections are down - https://phabricator.wikimedia.org/T240560 (10akosiaris) p:05High→03Normal Lowering priority per the comments above. Plus, there is a new theory in T214734. [08:44:24] 10Operations, 10netops: BFD session alerts due to inconsistent status on cr3-knams - https://phabricator.wikimedia.org/T240659 (10elukey) [08:47:24] (03CR) 10Elukey: [C: 03+2] Update status message now that it's no longer a test cluster [puppet] - 10https://gerrit.wikimedia.org/r/556865 (owner: 10Muehlenhoff) [08:48:21] !log rebooting cloudvirt1023 to investigate some nova things [08:48:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:56:09] (03PS1) 10Ema: ATS: enable xdebug plugin on 3 hosts [puppet] - 10https://gerrit.wikimedia.org/r/556973 (https://phabricator.wikimedia.org/T238494) [08:56:11] (03PS1) 10Ema: ATS: log origin Transfer-Encoding [puppet] - 10https://gerrit.wikimedia.org/r/556974 (https://phabricator.wikimedia.org/T238494) [08:56:25] PROBLEM - Postgres Replication Lag on maps1001 is CRITICAL: POSTGRES_HOT_STANDBY_DELAY CRITICAL: DB template1 (host:localhost) 18295736 and 1 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [08:56:52] (03PS1) 10Elukey: Analytics refine: blacklist fetchGoogleCloudVisionAnnotations [puppet] - 10https://gerrit.wikimedia.org/r/556976 [08:58:10] (03CR) 10Giuseppe Lavagetto: [C: 03+2] contint: remove role ci::slave::saucelabs [puppet] - 10https://gerrit.wikimedia.org/r/556677 (https://phabricator.wikimedia.org/T240575) (owner: 10Hashar) [08:59:24] (03CR) 10Giuseppe Lavagetto: [C: 03+2] contint: remove role ci::slave::browsertests [puppet] - 10https://gerrit.wikimedia.org/r/556695 (https://phabricator.wikimedia.org/T220035) (owner: 10Hashar) [08:59:47] (03CR) 10Joal: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/556976 (owner: 10Elukey) [09:00:51] (03CR) 10Ema: [C: 03+2] ATS: enable xdebug plugin on 3 hosts [puppet] - 10https://gerrit.wikimedia.org/r/556973 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [09:01:24] (03CR) 10Elukey: [C: 03+2] Analytics refine: blacklist fetchGoogleCloudVisionAnnotations [puppet] - 10https://gerrit.wikimedia.org/r/556976 (owner: 10Elukey) [09:01:46] elukey: ok to puppet-merge your change? [09:01:57] yep! [09:02:32] done! [09:03:09] thanks! [09:03:33] RECOVERY - Postgres Replication Lag on maps1001 is OK: POSTGRES_HOT_STANDBY_DELAY OK: DB template1 (host:localhost) 3128 and 103 seconds https://wikitech.wikimedia.org/wiki/Postgres%23Monitoring [09:03:58] jbond42: I got a correct message that a lock was held, all good! Also really informative, thanks a lot for the work that you have done [09:25:23] (03CR) 10Muehlenhoff: Backports for Buster (031 comment) [debs/doxygen] (debian/buster-backports) - 10https://gerrit.wikimedia.org/r/554942 (https://phabricator.wikimedia.org/T239482) (owner: 10Hashar) [09:27:03] (03PS3) 10Hashar: Backports for Buster [debs/doxygen] (debian/buster-backports) - 10https://gerrit.wikimedia.org/r/554942 (https://phabricator.wikimedia.org/T239482) [09:35:47] (03CR) 10Muehlenhoff: [C: 03+2] Backports for Buster [debs/doxygen] (debian/buster-backports) - 10https://gerrit.wikimedia.org/r/554942 (https://phabricator.wikimedia.org/T239482) (owner: 10Hashar) [09:42:20] (03CR) 10Ema: [C: 03+2] ATS: log origin Transfer-Encoding [puppet] - 10https://gerrit.wikimedia.org/r/556974 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [09:44:55] (03PS1) 10Elukey: cumin: add analytics aliases [puppet] - 10https://gerrit.wikimedia.org/r/556980 [09:49:37] (03CR) 10Muehlenhoff: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/556980 (owner: 10Elukey) [09:50:01] (03CR) 10Elukey: [C: 03+2] cumin: add analytics aliases [puppet] - 10https://gerrit.wikimedia.org/r/556980 (owner: 10Elukey) [09:50:58] !log rzl@conf1006:~$ sudo systemctl restart etcd.service [09:51:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:53:41] PROBLEM - PyBal connections to etcd on lvs3005 is CRITICAL: CRITICAL: 0 connections established with conf1006.eqiad.wmnet:4001 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [09:53:54] ^ me [09:55:17] PROBLEM - PyBal connections to etcd on lvs3007 is CRITICAL: CRITICAL: 0 connections established with conf1006.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [09:55:41] <_joe_> !log restarting pybal on lvs in esams (3007, then 3006 and 3005) [09:55:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:56:11] PROBLEM - PyBal connections to etcd on lvs3006 is CRITICAL: CRITICAL: 0 connections established with conf1006.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [10:00:37] RECOVERY - PyBal connections to etcd on lvs3007 is OK: OK: 12 connections established with conf1006.eqiad.wmnet:4001 (min=12) https://wikitech.wikimedia.org/wiki/PyBal [10:01:33] RECOVERY - PyBal connections to etcd on lvs3006 is OK: OK: 4 connections established with conf1006.eqiad.wmnet:4001 (min=4) https://wikitech.wikimedia.org/wiki/PyBal [10:02:17] 10Operations, 10Puppet, 10serviceops, 10Patch-For-Review, 10User-jbond: Rolling restart of etcd to pick up the renewed CA public certificate. - https://phabricator.wikimedia.org/T237362 (10Joe) 05Open→03Resolved a:05Joe→03RLazarus [10:02:19] 10Operations, 10Puppet, 10Patch-For-Review, 10User-jbond: Document all uses of the puppetCA certificate - https://phabricator.wikimedia.org/T237259 (10Joe) [10:04:25] RECOVERY - PyBal connections to etcd on lvs3005 is OK: OK: 8 connections established with conf1006.eqiad.wmnet:4001 (min=8) https://wikitech.wikimedia.org/wiki/PyBal [10:06:51] 10Operations, 10Pybal, 10Traffic: pybal fails to reconnect cleanly to etcd when etcd is restarted - https://phabricator.wikimedia.org/T240665 (10Joe) [10:09:32] (03PS1) 10Muehlenhoff: Add component/ci for buster-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/556984 (https://phabricator.wikimedia.org/T239482) [10:10:15] 10Operations, 10Pybal, 10Traffic: pybal fails to reconnect cleanly to etcd when etcd is restarted - https://phabricator.wikimedia.org/T240665 (10ema) This is tracked in T169765, and it seems that this patch from @Vgutierrez should address the issue? https://gerrit.wikimedia.org/r/#/c/operations/debs/pybal/+... [10:12:24] 10Operations, 10DNS, 10Research, 10Traffic: Add wikiworkshop.org to the Foundation's DNS - https://phabricator.wikimedia.org/T240303 (10jcrespo) Allow me to suggest a meeting after Christmas between #Research and #Traffic, so there is mutual understanding of what needs correction for this and future cases,... [10:15:02] (03CR) 10Muehlenhoff: [C: 03+2] Add component/ci for buster-wikimedia [puppet] - 10https://gerrit.wikimedia.org/r/556984 (https://phabricator.wikimedia.org/T239482) (owner: 10Muehlenhoff) [10:17:36] !log cp4028: restart ats-be to enable xdebug plugin [10:17:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:20:11] 10Operations, 10Continuous-Integration-Infrastructure, 10Gerrit, 10Release-Engineering-Team (CI & Testing services), and 2 others: Upload zuul_2.5.1-wmf11 to apt.wikimedia.org - https://phabricator.wikimedia.org/T240570 (10jcrespo) a:03jcrespo [10:20:45] RECOVERY - traffic_server backend process restarted on cp4028 is OK: (C)2 ge (W)2 ge 1 https://wikitech.wikimedia.org/wiki/Apache_Traffic_Server https://grafana.wikimedia.org/d/6uhkG6OZk/ats-instance-drilldown?orgId=1&var-site=ulsfo+prometheus/ops&var-instance=cp4028&var-layer=backend [10:21:21] 10Operations, 10Diffusion, 10Packaging, 10Patch-For-Review, and 4 others: Cannot connect to vcs@git-ssh.wikimedia.org (since move from phab1001 to phab1003) - https://phabricator.wikimedia.org/T224677 (10zeljkofilipin) I guess [[ https://wikitech.wikimedia.org/wiki/Help:SSH_Fingerprints/git-ssh.wikimedia.o... [10:30:16] !log uploaded doxygen 1.8.16-1~exp4~deb10+wmf1 to buster-wikimedia/component/ci T239482 [10:30:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:30:22] T239482: Update Doxygen in CI to 1.8.15 or greater - https://phabricator.wikimedia.org/T239482 [10:32:07] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM. Non-blocking comment inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/556767 (https://phabricator.wikimedia.org/T240009) (owner: 10Bstorm) [10:33:47] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "You likely need the change in site.pp as well for the new server name to be assigned a role." [puppet] - 10https://gerrit.wikimedia.org/r/555565 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [10:35:09] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] cloudvps: rename+reimage labmon1001 as cloudmetrics1001 (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/555570 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [10:40:19] (03PS4) 10Elukey: Allow labstore hosts to contact Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/546189 (https://phabricator.wikimedia.org/T234229) (owner: 10Ottomata) [10:41:38] 10Operations, 10Wikimedia-Logstash, 10User-fgiunchedi: Ingestion errors for production logs on ELK7 - https://phabricator.wikimedia.org/T240667 (10fgiunchedi) [10:42:26] (03CR) 10Muehlenhoff: "Also note that there's a host-specific Hiera setting which needs to be renamed along: In hieradata/hosts/ (or maybe this is actually not n" [puppet] - 10https://gerrit.wikimedia.org/r/555565 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [10:43:07] 10Operations, 10Continuous-Integration-Infrastructure, 10Gerrit, 10Release-Engineering-Team (CI & Testing services), and 2 others: Upload zuul_2.5.1-wmf11 to apt.wikimedia.org - https://phabricator.wikimedia.org/T240570 (10jcrespo) @hashar Sorry for the delay, I wasn't aware of the procedure and issues you... [10:46:03] 10Operations, 10Continuous-Integration-Infrastructure, 10Gerrit, 10Release-Engineering-Team (CI & Testing services), and 2 others: Upload zuul_2.5.1-wmf11 to apt.wikimedia.org - https://phabricator.wikimedia.org/T240570 (10hashar) 05Open→03Resolved > @hashar Sorry for the delay, I wasn't aware of the p... [10:46:20] jynus: all good thank you ;] (re: zuul.deb) [10:47:48] (03PS1) 10RLazarus: httpbb: Add a test verifying query string behavior in the TLS redirect. [puppet] - 10https://gerrit.wikimedia.org/r/556989 (https://phabricator.wikimedia.org/T57857) [10:49:20] (03CR) 10Giuseppe Lavagetto: [C: 03+2] httpbb: Add a test verifying query string behavior in the TLS redirect. [puppet] - 10https://gerrit.wikimedia.org/r/556989 (https://phabricator.wikimedia.org/T57857) (owner: 10RLazarus) [10:52:08] !log jmm@cumin2001 START - Cookbook sre.hosts.downtime [10:52:09] !log jmm@cumin2001 END (PASS) - Cookbook sre.hosts.downtime (exit_code=0) [10:52:10] !log rebooting mw2164 for microcode tests [10:52:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:52:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:53:05] 10Operations, 10observability: Make grafana-next.wm.o HTTP 302 redirect to grafana.wm.o - https://phabricator.wikimedia.org/T240048 (10fgiunchedi) >>! In T240048#5737123, @jcrespo wrote: > CC @fgiunchedi , although maybe it was someone else from Foundations that worked on this? Indeed, @CDanis will be taking... [10:55:37] 10Operations, 10Pybal, 10Traffic: pybal fails to reconnect cleanly to etcd when etcd is restarted - https://phabricator.wikimedia.org/T240665 (10jcrespo) Should I merge this into T169765 @Joe as per ema comment? [11:03:47] (03PS1) 10Elukey: Add fake analytics user keytab to analytics1028 [labs/private] - 10https://gerrit.wikimedia.org/r/556992 [11:03:56] (03CR) 10Elukey: [V: 03+2 C: 03+2] Add fake analytics user keytab to analytics1028 [labs/private] - 10https://gerrit.wikimedia.org/r/556992 (owner: 10Elukey) [11:06:01] 10Operations, 10netops, 10cloud-services-team (Kanban): WMCS: cleanup network allocations - https://phabricator.wikimedia.org/T240670 (10aborrero) [11:06:08] 10Operations, 10netops, 10cloud-services-team (Kanban): WMCS: cleanup network allocations - https://phabricator.wikimedia.org/T240670 (10aborrero) p:05Triage→03Normal [11:08:19] (03CR) 10Elukey: "Needed to fix a rebase conflict, nothing really changed." [puppet] - 10https://gerrit.wikimedia.org/r/546189 (https://phabricator.wikimedia.org/T234229) (owner: 10Ottomata) [11:08:42] (03PS5) 10Elukey: Enable Kerberos in Hadoop Analytics and Druid Analytics/Public [puppet] - 10https://gerrit.wikimedia.org/r/549566 (https://phabricator.wikimedia.org/T237269) [11:12:26] (03PS1) 10Arturo Borrero Gonzalez: network: data: cleanup unused WMCS ranges [puppet] - 10https://gerrit.wikimedia.org/r/556994 (https://phabricator.wikimedia.org/T240670) [11:17:27] (03PS1) 10Arturo Borrero Gonzalez: networks: cleanup unused WMCS ranges [dns] - 10https://gerrit.wikimedia.org/r/556995 (https://phabricator.wikimedia.org/T240670) [11:26:50] (03PS2) 10Arturo Borrero Gonzalez: network: data: cleanup unused WMCS ranges [puppet] - 10https://gerrit.wikimedia.org/r/556994 (https://phabricator.wikimedia.org/T240670) [11:27:38] 10Operations, 10Puppet, 10Packaging, 10User-jbond: Create a resources for installing components - https://phabricator.wikimedia.org/T240324 (10MoritzMuehlenhoff) a:03MoritzMuehlenhoff [11:37:31] (03Abandoned) 10Urbanecm: Merge branch 'master' of https://gerrit.wikimedia.org/r/operations/mediawiki-config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556862 (owner: 10TechneSiyam) [11:37:45] (03PS2) 10Urbanecm: Added betawikiversity,hiwikibooks,ukwikinews hd logos [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556863 (owner: 10TechneSiyam) [11:37:59] (03PS2) 10Urbanecm: Modified IS.php with hiwikibooks,betawikiversity,ukwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556864 (owner: 10TechneSiyam) [11:38:56] (03CR) 10jerkins-bot: [V: 04-1] Modified IS.php with hiwikibooks,betawikiversity,ukwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556864 (owner: 10TechneSiyam) [11:42:12] (03CR) 10Alexandros Kosiaris: [C: 03+1] network: data: cleanup unused WMCS ranges [puppet] - 10https://gerrit.wikimedia.org/r/556994 (https://phabricator.wikimedia.org/T240670) (owner: 10Arturo Borrero Gonzalez) [11:46:34] !log installing tiff security updates [11:46:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:47:26] (03CR) 10Alexandros Kosiaris: [C: 03+2] [Beta Cluster] mediawiki: install tideways on beta app servers [puppet] - 10https://gerrit.wikimedia.org/r/556854 (https://phabricator.wikimedia.org/T180761) (owner: 10Krinkle) [11:53:14] 10Operations, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): WMCS: cleanup network allocations - https://phabricator.wikimedia.org/T240670 (10aborrero) a:03aborrero [12:06:54] (03PS4) 10Jbond: etcd::client::globalconfig: add ca_cert [puppet] - 10https://gerrit.wikimedia.org/r/556647 (https://phabricator.wikimedia.org/T237362) [12:11:04] 10Operations, 10service-runner, 10serviceops, 10CPT Initiatives (RESTBase Split (CDP2)), and 5 others: RESTBase/RESTRouter/service-runner rate limiting plans - https://phabricator.wikimedia.org/T235437 (10mobrovac) a:05mobrovac→03None [12:11:19] 10Operations, 10Core Platform Team, 10TechCom, 10User-mobrovac: Service Ownership and Maintenance - https://phabricator.wikimedia.org/T122825 (10mobrovac) a:05mobrovac→03None [12:11:39] 10Operations, 10Core Platform Team Legacy (Later), 10Documentation, 10Service-Architecture, 10Services (later): Create a doc explaining the SLA between services and the monitoring tool - https://phabricator.wikimedia.org/T105780 (10mobrovac) a:05mobrovac→03None [12:40:39] 10Operations, 10cloud-services-team: tmpreaper possible race condition - https://phabricator.wikimedia.org/T151304 (10jijiki) a:05jijiki→03None [12:58:07] 10Operations, 10Traffic: API Querying for XML/JSON, you might get the Browser Connection Security warning HTML page (which is invalid XML) - https://phabricator.wikimedia.org/T240497 (10jcrespo) @DavidBrooks Based on your last comment, it seems like the right actionables here would be to close this as "Invalid... [13:16:45] (03PS3) 10Phamhi: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 [dns] - 10https://gerrit.wikimedia.org/r/555570 (https://phabricator.wikimedia.org/T224585) [13:20:33] (03PS1) 10Aklapper: Phabricator monthly email: Make 'Tasks Closed' query more performant [puppet] - 10https://gerrit.wikimedia.org/r/557008 [13:27:31] (03PS1) 10Phamhi: cloudvps: cleanup labmon1002 dns records [dns] - 10https://gerrit.wikimedia.org/r/557017 (https://phabricator.wikimedia.org/T224585) [13:34:34] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] cloudvps: cleanup labmon1002 dns records [dns] - 10https://gerrit.wikimedia.org/r/557017 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [13:34:42] (03CR) 10Phamhi: [C: 03+2] cloudvps: cleanup labmon1002 dns records [dns] - 10https://gerrit.wikimedia.org/r/557017 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [14:10:50] (03PS3) 10Giuseppe Lavagetto: Add parsoid-php to the discovery records to switchover [cookbooks] - 10https://gerrit.wikimedia.org/r/545167 [14:14:36] (03PS1) 10Jhedden: update ceph keydata key names [labs/private] - 10https://gerrit.wikimedia.org/r/557022 [14:14:58] (03PS1) 10Jhedden: ceph: add libvirt rbd configuration [puppet] - 10https://gerrit.wikimedia.org/r/557023 (https://phabricator.wikimedia.org/T239918) [14:15:34] (03CR) 10Jhedden: [V: 03+2 C: 03+2] update ceph keydata key names [labs/private] - 10https://gerrit.wikimedia.org/r/557022 (owner: 10Jhedden) [14:21:55] (03PS2) 10Jhedden: ceph: add libvirt rbd configuration [puppet] - 10https://gerrit.wikimedia.org/r/557023 (https://phabricator.wikimedia.org/T239918) [14:24:44] (03CR) 10Jhedden: [C: 03+2] "PCC results: https://puppet-compiler.wmflabs.org/compiler1002/19967/" [puppet] - 10https://gerrit.wikimedia.org/r/557023 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [14:26:34] (03PS1) 10Giuseppe Lavagetto: Reset waitIndex on etcd error 401 [debs/pybal] (1.15-stretch) - 10https://gerrit.wikimedia.org/r/557024 (https://phabricator.wikimedia.org/T169765) [14:37:06] !log pool maps1002 after postgres init - T239728 [14:37:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:37:12] T239728: Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 [14:38:34] (03PS1) 10Ema: ATS: assign 8G instead of 2G to RAM caches on ats-be [puppet] - 10https://gerrit.wikimedia.org/r/557031 (https://phabricator.wikimedia.org/T238494) [14:40:23] (03PS9) 10Volans: Add image tracking support [software/debmonitor] - 10https://gerrit.wikimedia.org/r/552517 (https://phabricator.wikimedia.org/T237978) (owner: 10Muehlenhoff) [14:41:23] (03PS1) 10Elukey: cumin: add the presto alias to analytics-all-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/557032 [14:48:27] (03PS1) 10Jhedden: ceph: fix rbd client keyring format [puppet] - 10https://gerrit.wikimedia.org/r/557036 (https://phabricator.wikimedia.org/T239918) [14:50:57] (03CR) 10Elukey: [C: 03+2] cumin: add the presto alias to analytics-all-eqiad [puppet] - 10https://gerrit.wikimedia.org/r/557032 (owner: 10Elukey) [14:51:12] !log depool maps1003 after postgres init - T239728 [14:51:14] (03PS1) 10Jbond: puppet-compiler: test binary_file function [puppet] - 10https://gerrit.wikimedia.org/r/557038 [14:51:16] (03PS1) 10Ema: ATS: log Connection response header from origins [puppet] - 10https://gerrit.wikimedia.org/r/557039 (https://phabricator.wikimedia.org/T238494) [14:51:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:51:17] T239728: Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 [14:54:02] (03Abandoned) 10Jbond: puppet-compiler: test binary_file function [puppet] - 10https://gerrit.wikimedia.org/r/557038 (owner: 10Jbond) [15:00:59] (03CR) 10Jhedden: [C: 03+2] ceph: fix rbd client keyring format [puppet] - 10https://gerrit.wikimedia.org/r/557036 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [15:01:15] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: eqiad rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [15:02:04] (03PS1) 10Jbond: puppet-compiler: test binary_file function [puppet] - 10https://gerrit.wikimedia.org/r/557041 (https://phabricator.wikimedia.org/T236481) [15:09:29] (03CR) 10Ema: [C: 03+2] ATS: log Connection response header from origins [puppet] - 10https://gerrit.wikimedia.org/r/557039 (https://phabricator.wikimedia.org/T238494) (owner: 10Ema) [15:14:10] (03CR) 10Bstorm: toolforge-k8s: harden the init config a bit (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/556767 (https://phabricator.wikimedia.org/T240009) (owner: 10Bstorm) [15:15:19] (03CR) 10Bstorm: [C: 04-1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/556216 (owner: 10Bstorm) [15:16:25] (03PS2) 10Bstorm: toolforge-k8s: harden the init config a bit [puppet] - 10https://gerrit.wikimedia.org/r/556767 (https://phabricator.wikimedia.org/T240009) [15:17:50] (03CR) 10Bstorm: "Sorry, been bogged down. We could at least try this patch to do something." [puppet] - 10https://gerrit.wikimedia.org/r/555632 (https://phabricator.wikimedia.org/T222349) (owner: 10Bstorm) [15:21:44] the RPKI alert has been flapping the past few days, not sure what is up with it, seems like rsyncs are failing from time to time? and then recovering on their own? [15:22:18] I was wondering the same [15:22:41] I'm hoping it keeps harmlessly flapping until Arzhel is back ;) [15:30:28] (03CR) 10Bstorm: [C: 03+2] toolforge-k8s: harden the init config a bit [puppet] - 10https://gerrit.wikimedia.org/r/556767 (https://phabricator.wikimedia.org/T240009) (owner: 10Bstorm) [15:32:53] (03CR) 10ArielGlenn: "> > Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/556216 (owner: 10Bstorm) [15:33:15] yeah looks harmless [15:36:49] 👀 [15:42:46] (03CR) 10Alexandros Kosiaris: [C: 03+2] "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler-test/332/console says noop, merging" [puppet] - 10https://gerrit.wikimedia.org/r/556713 (https://phabricator.wikimedia.org/T239835) (owner: 10Alexandros Kosiaris) [15:47:01] (03PS1) 10Jbond: puppet_compiler: add rich_data support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/557050 [15:47:26] (03CR) 10jerkins-bot: [V: 04-1] puppet_compiler: add rich_data support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/557050 (owner: 10Jbond) [15:49:27] (03PS2) 10Jbond: uppet_compiler: add rich_data support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/557050 [15:51:24] (03PS3) 10Jbond: uppet_compiler: add rich_data support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/557050 [15:53:45] (03PS4) 10Jbond: puppet_compiler: add rich_data support [software/puppet-compiler] - 10https://gerrit.wikimedia.org/r/557050 [15:54:53] (03PS1) 10Ottomata: [WIP] Deploy analytics/hdfs-tools/deploy to hadoop clients [puppet] - 10https://gerrit.wikimedia.org/r/557052 (https://phabricator.wikimedia.org/T238326) [15:55:08] (03PS1) 10TechneSiyam: Added missing comma in line 1712 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557053 [15:56:37] (03CR) 10jerkins-bot: [V: 04-1] Added missing comma in line 1712 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557053 (owner: 10TechneSiyam) [15:56:46] (03CR) 10jerkins-bot: [V: 04-1] [WIP] Deploy analytics/hdfs-tools/deploy to hadoop clients [puppet] - 10https://gerrit.wikimedia.org/r/557052 (https://phabricator.wikimedia.org/T238326) (owner: 10Ottomata) [16:17:27] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [16:25:08] (03PS1) 10Jbond: yamllint: First stab at adding yamllint CI tests [puppet] - 10https://gerrit.wikimedia.org/r/557060 (https://phabricator.wikimedia.org/T236954) [16:25:10] (03PS1) 10Jbond: yamllint: tes yamllint CI [puppet] - 10https://gerrit.wikimedia.org/r/557061 [16:26:00] (03CR) 10jerkins-bot: [V: 04-1] yamllint: tes yamllint CI [puppet] - 10https://gerrit.wikimedia.org/r/557061 (owner: 10Jbond) [16:27:19] (03PS2) 10Jbond: yamllint: First stab at adding yamllint CI tests [puppet] - 10https://gerrit.wikimedia.org/r/557060 (https://phabricator.wikimedia.org/T236954) [16:29:04] (03PS2) 10Jbond: yamllint: tes yamllint CI [puppet] - 10https://gerrit.wikimedia.org/r/557061 [16:29:54] (03CR) 10jerkins-bot: [V: 04-1] yamllint: tes yamllint CI [puppet] - 10https://gerrit.wikimedia.org/r/557061 (owner: 10Jbond) [16:30:11] (03CR) 10jerkins-bot: [V: 04-1] yamllint: First stab at adding yamllint CI tests [puppet] - 10https://gerrit.wikimedia.org/r/557060 (https://phabricator.wikimedia.org/T236954) (owner: 10Jbond) [16:31:54] (03PS3) 10Jbond: yamllint: First stab at adding yamllint CI tests [puppet] - 10https://gerrit.wikimedia.org/r/557060 (https://phabricator.wikimedia.org/T236954) [16:32:25] (03PS3) 10Jbond: yamllint: tes yamllint CI [puppet] - 10https://gerrit.wikimedia.org/r/557061 [16:33:19] (03CR) 10jerkins-bot: [V: 04-1] yamllint: tes yamllint CI [puppet] - 10https://gerrit.wikimedia.org/r/557061 (owner: 10Jbond) [16:34:51] (03CR) 10jerkins-bot: [V: 04-1] yamllint: First stab at adding yamllint CI tests [puppet] - 10https://gerrit.wikimedia.org/r/557060 (https://phabricator.wikimedia.org/T236954) (owner: 10Jbond) [16:35:46] (03PS4) 10Jbond: yamllint: First stab at adding yamllint CI tests [puppet] - 10https://gerrit.wikimedia.org/r/557060 (https://phabricator.wikimedia.org/T236954) [16:35:56] (03PS4) 10Jbond: yamllint: tes yamllint CI [puppet] - 10https://gerrit.wikimedia.org/r/557061 [16:37:31] (03PS1) 10Jhedden: ceph: add libvirt uuid and keydata [puppet] - 10https://gerrit.wikimedia.org/r/557064 (https://phabricator.wikimedia.org/T239918) [16:38:36] (03CR) 10jerkins-bot: [V: 04-1] yamllint: First stab at adding yamllint CI tests [puppet] - 10https://gerrit.wikimedia.org/r/557060 (https://phabricator.wikimedia.org/T236954) (owner: 10Jbond) [16:39:00] (03PS5) 10Jbond: yamllint: tes yamllint CI [puppet] - 10https://gerrit.wikimedia.org/r/557061 [16:39:16] (03PS6) 10Jbond: yamllint: tes yamllint CI [puppet] - 10https://gerrit.wikimedia.org/r/557061 [16:40:23] (03PS7) 10Jbond: yamllint: test yamllint CI [puppet] - 10https://gerrit.wikimedia.org/r/557061 (https://phabricator.wikimedia.org/T236954) [16:40:25] (03CR) 10jerkins-bot: [V: 04-1] yamllint: test yamllint CI [puppet] - 10https://gerrit.wikimedia.org/r/557061 (https://phabricator.wikimedia.org/T236954) (owner: 10Jbond) [16:41:28] (03CR) 10jerkins-bot: [V: 04-1] yamllint: test yamllint CI [puppet] - 10https://gerrit.wikimedia.org/r/557061 (https://phabricator.wikimedia.org/T236954) (owner: 10Jbond) [16:44:46] (03CR) 10Jhedden: [C: 03+2] "PCC results: https://puppet-compiler.wmflabs.org/compiler1002/19970/" [puppet] - 10https://gerrit.wikimedia.org/r/557064 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [16:49:23] (03PS1) 10Bstorm: toolforge-k8s: set required kernel params for current kubelet options [puppet] - 10https://gerrit.wikimedia.org/r/557065 (https://phabricator.wikimedia.org/T240009) [16:51:39] (03PS2) 10Bstorm: toolforge-k8s: set required kernel params for current kubelet options [puppet] - 10https://gerrit.wikimedia.org/r/557065 (https://phabricator.wikimedia.org/T240009) [16:52:30] (03PS1) 10Ema: ATS: add SystemTap probe to trace session teardown [puppet] - 10https://gerrit.wikimedia.org/r/557066 (https://phabricator.wikimedia.org/T238494) [16:52:40] (03CR) 10Bstorm: [C: 03+2] toolforge-k8s: set required kernel params for current kubelet options [puppet] - 10https://gerrit.wikimedia.org/r/557065 (https://phabricator.wikimedia.org/T240009) (owner: 10Bstorm) [16:54:48] (03PS1) 10Jhedden: ceph: disable show_diff on keyring files [puppet] - 10https://gerrit.wikimedia.org/r/557067 (https://phabricator.wikimedia.org/T239918) [16:57:24] (03CR) 10Jhedden: [C: 03+2] ceph: disable show_diff on keyring files [puppet] - 10https://gerrit.wikimedia.org/r/557067 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [17:00:51] (03PS2) 10TechneSiyam: Added missing comma in line 1712,1782 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557053 [17:03:00] (03CR) 10jerkins-bot: [V: 04-1] Added missing comma in line 1712,1782 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557053 (owner: 10TechneSiyam) [17:03:09] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:03:09] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:06:20] (03PS1) 10Jhedden: ceph: disable log output on virsh define [puppet] - 10https://gerrit.wikimedia.org/r/557070 (https://phabricator.wikimedia.org/T239918) [17:09:28] (03PS2) 10Jhedden: ceph: disable log output on virsh define [puppet] - 10https://gerrit.wikimedia.org/r/557070 (https://phabricator.wikimedia.org/T239918) [17:11:33] (03CR) 10Jhedden: [C: 03+2] ceph: disable log output on virsh define [puppet] - 10https://gerrit.wikimedia.org/r/557070 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [17:14:40] (03PS1) 10Bstorm: toolforge-k8s: Add a service and notify around kubelet [puppet] - 10https://gerrit.wikimedia.org/r/557071 [17:20:34] (03PS1) 10Jhedden: ceph: Fix libvirt uuid in key template [puppet] - 10https://gerrit.wikimedia.org/r/557072 (https://phabricator.wikimedia.org/T239918) [17:23:01] (03CR) 10Jhedden: [C: 03+2] ceph: Fix libvirt uuid in key template [puppet] - 10https://gerrit.wikimedia.org/r/557072 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [17:32:42] (03PS1) 1020after4: Install pygerrit2 on releases server [puppet] - 10https://gerrit.wikimedia.org/r/557075 (https://phabricator.wikimedia.org/T196517) [17:36:58] ACKNOWLEDGEMENT - OSPF status on cr1-codfw is CRITICAL: OSPFv2: 5/6 UP : OSPFv3: 5/6 UP CDanis overdue Telia planned maintenance PWIC104146 https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:36:58] ACKNOWLEDGEMENT - Router interfaces on cr1-eqiad is CRITICAL: CRITICAL: host 208.80.154.196, interfaces up: 269, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis overdue Telia planned maintenance PWIC104146 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:37:38] ACKNOWLEDGEMENT - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP CDanis Zayo unplanned repair TTN-0003768396 https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [17:37:38] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis Zayo unplanned repair TTN-0003768396 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [17:40:15] (03CR) 10Thcipriani: [C: 03+1] "lgtm! https://puppet-compiler.wmflabs.org/compiler1003/19973/" [puppet] - 10https://gerrit.wikimedia.org/r/557075 (https://phabricator.wikimedia.org/T196517) (owner: 1020after4) [18:03:25] PROBLEM - rpki grafana alert on icinga1001 is CRITICAL: CRITICAL: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is alerting: rsync status alert. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [18:05:59] (03PS5) 10Elukey: Allow labstore hosts to contact Hadoop [puppet] - 10https://gerrit.wikimedia.org/r/546189 (https://phabricator.wikimedia.org/T234229) (owner: 10Ottomata) [18:06:01] (03PS1) 10Elukey: Move Analytics systemd timers on labstore nodes to local /mnt/hdfs [puppet] - 10https://gerrit.wikimedia.org/r/557083 (https://phabricator.wikimedia.org/T234229) [18:06:03] (03PS1) 10Elukey: Enable kerberos on labstore nodes [puppet] - 10https://gerrit.wikimedia.org/r/557084 (https://phabricator.wikimedia.org/T234229) [18:13:18] (03PS1) 10CDanis: dbctl: diffs nrpe: allow extra time to execute [puppet] - 10https://gerrit.wikimedia.org/r/557085 [18:16:05] (03CR) 10CDanis: [C: 03+2] "PCC looks good https://puppet-compiler.wmflabs.org/compiler1001/19975/" [puppet] - 10https://gerrit.wikimedia.org/r/557085 (owner: 10CDanis) [18:16:15] RECOVERY - Backup freshness on backup1001 is OK: Fresh: 97 jobs https://wikitech.wikimedia.org/wiki/Backups%23Monitoring [18:19:40] (03PS1) 10Jhedden: openstack: add ceph rbd support for nova-compute [puppet] - 10https://gerrit.wikimedia.org/r/557086 (https://phabricator.wikimedia.org/T239918) [18:21:05] (03PS2) 10CDanis: dbctl: diffs nrpe: allow extra time to execute [puppet] - 10https://gerrit.wikimedia.org/r/557085 [18:28:42] (03CR) 10Ottomata: [C: 03+1] Move Analytics systemd timers on labstore nodes to local /mnt/hdfs [puppet] - 10https://gerrit.wikimedia.org/r/557083 (https://phabricator.wikimedia.org/T234229) (owner: 10Elukey) [18:30:47] (03PS2) 10Jhedden: openstack: add ceph rbd support for nova-compute [puppet] - 10https://gerrit.wikimedia.org/r/557086 (https://phabricator.wikimedia.org/T239918) [18:34:23] (03PS3) 10Jhedden: openstack: add ceph rbd support for nova-compute [puppet] - 10https://gerrit.wikimedia.org/r/557086 (https://phabricator.wikimedia.org/T239918) [18:34:25] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 1.894e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [18:35:45] looks like a big cirrusSearchElasticaWrite job is queued up [18:35:49] (03CR) 10Jhedden: "PCC results: https://puppet-compiler.wmflabs.org/compiler1001/19978/" [puppet] - 10https://gerrit.wikimedia.org/r/557086 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [18:35:56] i expect lage will settle after it stops queueing so many [18:36:32] ottomata: hmm, can you find it? I rewrote all that to stop making big ElasticaWrite jobs [18:36:47] ottomata: i mean can you paste it somewhere, or copy into file on some machine [18:37:06] i guess it might queue lots of them, but they should be a few kb [18:37:38] hm, i think its lts of them [18:37:40] not a big single one [18:37:48] there are just a LOT o fmessages in that topic [18:38:13] (03CR) 10Bstorm: [C: 03+1] "This almost seems like a time we could deprecate using the label labstore for things like this (switching to "dumps_dist_hosts" instead or" [puppet] - 10https://gerrit.wikimedia.org/r/546189 (https://phabricator.wikimedia.org/T234229) (owner: 10Ottomata) [18:38:28] so, this is kind of the new default for cirrussearch. we used to have 1 job that wrote to 3 clusters. Now we have 3 jobs that each write to one cluster, So you have the original jobs that frontload some heavy work, then 3x those jobs that actually load content from DB's and ship to elastic [18:38:42] ottomata: do we perhaps need some rate limiting, partitioning, or something else? [18:38:59] (03CR) 10Bstorm: [C: 03+1] Move Analytics systemd timers on labstore nodes to local /mnt/hdfs [puppet] - 10https://gerrit.wikimedia.org/r/557083 (https://phabricator.wikimedia.org/T234229) (owner: 10Elukey) [18:39:06] ebernhardson: maybe just more throughput? the lag is from mirror maker replicating to codfw [18:39:32] hmm [18:39:35] (03CR) 10Bstorm: [C: 03+1] Enable kerberos on labstore nodes [puppet] - 10https://gerrit.wikimedia.org/r/557084 (https://phabricator.wikimedia.org/T234229) (owner: 10Elukey) [18:39:37] hm, but the jobqueue workers execute in codfw? is that right? (don't realy know much about jobqueue) [18:39:53] no workers should actually be operating in codfw, it should be basically unread replicas [18:40:11] i thought i remembered somethign a while ago about using the mostly unused app servers in codfw to do jobqueue work [18:40:16] but maybe not [18:40:23] hmm, maybe they do and i missed that [18:41:13] ebernhardson: [18:41:16] i think that topic needs more partitions [18:41:17] i think that would help [18:41:23] it only has one [18:41:37] i don't know if anything special needs to happen for cp jobqueue to make that work [18:41:41] ping Pchelolo ^^^ [18:41:47] ottomata: oh, i just realized as well because of how job queue works, you get 2x of these replicating to codfw [18:41:54] ? [18:42:10] (oh petr might be on vaca) [18:42:10] ottomata: job queue works by mw -> unpartitioned topic -> jobqueue reads in and rewrites to partitioned topic -> job runners run partitioned jobs [18:42:19] (03PS1) 10MSantos: WIP: Proton charts first draft [deployment-charts] - 10https://gerrit.wikimedia.org/r/557090 (https://phabricator.wikimedia.org/T238830) [18:42:29] ottomata: there is a kafka partition per-cluster in the partitioned topic, but not the source topic that mw writes to [18:42:45] why not just have mw write to partition topic? [18:42:50] you need specific keys? [18:42:53] i don't know, thats how this system works :P [18:42:53] partitions keys [18:42:54] ? [18:43:04] it is eventgate doing it [18:43:13] basically we already had per-sql db partitioning. We reused that but used per-cirrus cluster [18:43:31] lemme double check which repo [18:43:33] hmm, oo would be nice to have key based partitioning in stream config, so eventgate can be configured per stream/topic to do the right thing [18:44:29] ottomata: config is in mediawiki/services/change-propagation/jobqueue-deploy, so i guess changeprop [18:46:24] hm well the lag is on the unpartitioned topic [18:46:25] https://grafana.wikimedia.org/d/000000521/kafka-mirrormaker?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-lag_datasource=eqiad%20prometheus%2Fops&var-mirror_name=main-eqiad_to_main-codfw&refresh=5m&fullscreen&panelId=5 [18:46:32] it might be on the way down? [18:47:29] ottomata: likely that will spike every 2 hours [18:47:53] basically these spikes: https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?orgId=1&fullscreen&panelId=35 [18:48:30] aye [18:48:33] we just need more partitions [18:48:40] ebernhardson: if there is already a reparittioner in cp [18:48:47] then probably just adding more partitions to this topic will help [18:48:50] even with out key partitioning [18:49:03] the cp consumer should just consume from the topic as it is now, and repartition with the right keys [18:49:08] if this might cause problems over the weekend we can turn off those spikes, they are a background thing that keeps everything in sync. They aren't critical on any short term [18:49:15] ah, no i think this is not urgent [18:49:18] ok [18:49:39] it would be good to look at soon, but we can wait for next week and/or petr if/when he comes back [18:49:43] wherever he is :p [18:49:58] :) [18:50:05] it might mean that the lag alert will fire every 2 hours [18:50:08] is this brand new ebernhardson ? [18:50:12] this chnage you made? [18:50:28] ottomata: the rewrite of the jobs just shipped with wmf.8, and thats what created what i was expecting to be 3x more jobs [18:50:39] which is.... this week? [18:51:08] hmm, either monday or thursday not sure. Lemme double check [18:51:18] but yes, this week [18:51:22] look slike it [18:51:23] https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&from=now-7d&to=now&var-dc=eqiad%20prometheus%2Fk8s&var-service=eventgate-main&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=All&fullscreen&panelId=54 [18:51:42] https://grafana.wikimedia.org/d/ePFPOkqiz/eventgate?orgId=1&refresh=1m&from=1575658297350&to=1576263097351&var-dc=eqiad%20prometheus%2Fk8s&var-service=eventgate-main&var-kafka_topic=All&var-kafka_broker=All&var-kafka_producer_type=All [18:52:20] so, the reason for lag is because a single kafka broker has to handle receiving these messages [18:52:26] and also mirrormaker has to consumer from taht broker [18:52:29] so more partitions will help for sure [18:52:55] alright, hopefully whatever mediawiki does to get jobs into the queue can support that. I'm not actually sure how that works...probably posting to some api [18:53:08] also https://issues.apache.org/jira/browse/KAFKA-8443 will help too [18:53:12] if we ever upgrade to kafka 2.4.0 [18:53:15] i'm almost certain mw isn't talking to kafka directly [18:53:24] it was a pain before :P [18:53:24] well mediawiki is posting to eventgate [18:53:26] via eventbus extension [18:53:32] ahh, ok [18:53:42] the kafka producer should roundrobin (or random?) the partition used [18:53:49] if no key or partitioner is specified [18:54:03] sounds good [18:54:26] so likely just adding parititons will be ok (i dunno if eventgate will need a bump, it might...probably not though i think the librdkafka producer is smart enough to update its partiiton metadata) [18:54:31] but, we probably should wait for petr [18:54:34] i'll make a task [18:54:48] (03PS1) 10Jhedden: ceph: add rbd secret to libvirt keystore [puppet] - 10https://gerrit.wikimedia.org/r/557092 (https://phabricator.wikimedia.org/T239918) [18:57:57] (03PS4) 10Jhedden: openstack: add ceph rbd support for nova-compute [puppet] - 10https://gerrit.wikimedia.org/r/557086 (https://phabricator.wikimedia.org/T239918) [18:58:44] (03CR) 10Jhedden: [C: 03+2] ceph: add rbd secret to libvirt keystore [puppet] - 10https://gerrit.wikimedia.org/r/557092 (https://phabricator.wikimedia.org/T239918) (owner: 10Jhedden) [18:59:01] (03PS4) 10Phamhi: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 [dns] - 10https://gerrit.wikimedia.org/r/555570 (https://phabricator.wikimedia.org/T224585) [18:59:37] (03CR) 10jerkins-bot: [V: 04-1] cloudvps: rename+reimage labmon1001 as cloudmetrics1001 [dns] - 10https://gerrit.wikimedia.org/r/555570 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [19:02:19] (03PS5) 10Phamhi: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 [dns] - 10https://gerrit.wikimedia.org/r/555570 (https://phabricator.wikimedia.org/T224585) [19:03:16] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 524 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [19:24:28] (03PS1) 10Mholloway: MachineVision: Fix typo in 'wgMachineVisionShowUploadWizardCallToAction' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557094 [19:26:55] (03CR) 10Mholloway: [C: 03+2] MachineVision: Fix typo in 'wgMachineVisionShowUploadWizardCallToAction' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557094 (owner: 10Mholloway) [19:27:50] (03Merged) 10jenkins-bot: MachineVision: Fix typo in 'wgMachineVisionShowUploadWizardCallToAction' [mediawiki-config] - 10https://gerrit.wikimedia.org/r/557094 (owner: 10Mholloway) [19:33:13] !log mholloway-shell@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Fix typo in 'wgMachineVisionShowUploadWizardCallToAction' (duration: 01m 00s) [19:33:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:35:54] Hey all - need to sec-deploy T240487 now. [19:45:22] (03CR) 10Phamhi: "> Patch Set 2: Code-Review-1" (031 comment) [dns] - 10https://gerrit.wikimedia.org/r/555570 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [19:46:05] (03PS3) 10Phamhi: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 [puppet] - 10https://gerrit.wikimedia.org/r/555565 (https://phabricator.wikimedia.org/T224585) [19:49:12] PROBLEM - Prometheus jobs reduced availability on icinga1001 is CRITICAL: job=pdu_sentry4 site=ulsfo https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:50:03] (03Abandoned) 10Pmiazga: Add History to article toolbar for all logged-in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556440 (https://phabricator.wikimedia.org/T232652) (owner: 10Pmiazga) [19:50:46] (03Abandoned) 10Pmiazga: Enable Article and Discussion tabs for all logged-in users [mediawiki-config] - 10https://gerrit.wikimedia.org/r/556439 (https://phabricator.wikimedia.org/T232594) (owner: 10Pmiazga) [19:51:00] RECOVERY - Prometheus jobs reduced availability on icinga1001 is OK: All metrics within thresholds. https://wikitech.wikimedia.org/wiki/Prometheus%23Prometheus_job_unavailable https://grafana.wikimedia.org/d/NEJu05xZz/prometheus-targets [19:57:06] (03PS4) 10Phamhi: cloudvps: rename+reimage labmon1001 as cloudmetrics1001 [puppet] - 10https://gerrit.wikimedia.org/r/555565 (https://phabricator.wikimedia.org/T224585) [19:57:47] (03CR) 10Phamhi: "> Patch Set 2: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/555565 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [19:58:04] (03CR) 10Phamhi: "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/555565 (https://phabricator.wikimedia.org/T224585) (owner: 10Phamhi) [20:07:08] !log Deployed security patch (via gerrit 557097) for T240487 to wmf.10 [20:07:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:25] (03PS1) 10Phamhi: wmcs: monitoring: remove rssh [puppet] - 10https://gerrit.wikimedia.org/r/557103 [20:21:11] (03PS2) 10Phamhi: wmcs: monitoring: remove rssh [puppet] - 10https://gerrit.wikimedia.org/r/557103 (https://phabricator.wikimedia.org/T224585) [20:22:01] (03PS1) 10Ottomata: Bump eventgate-logging-external eventgate image to 2019-12-13-200604-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/557104 (https://phabricator.wikimedia.org/T236386) [20:23:16] (03CR) 10Ottomata: [C: 03+2] Bump eventgate-logging-external eventgate image to 2019-12-13-200604-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/557104 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [20:23:48] (03Merged) 10jenkins-bot: Bump eventgate-logging-external eventgate image to 2019-12-13-200604-production [deployment-charts] - 10https://gerrit.wikimedia.org/r/557104 (https://phabricator.wikimedia.org/T236386) (owner: 10Ottomata) [20:26:34] !log otto@deploy1001 helmfile [STAGING] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [20:26:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:29:38] !log otto@deploy1001 helmfile [EQIAD] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [20:29:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:31:44] !log otto@deploy1001 helmfile [CODFW] Ran 'apply' command on namespace 'eventgate-logging-external' for release 'logging-external' . [20:31:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:33:56] PROBLEM - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is CRITICAL: 3.985e+05 gt 1e+05 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [21:11:46] RECOVERY - Kafka MirrorMaker main-eqiad_to_main-codfw max lag in last 10 minutes on icinga1001 is OK: (C)1e+05 gt (W)1e+04 gt 0 https://wikitech.wikimedia.org/wiki/Kafka/Administration https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=codfw+prometheus/ops&var-lag_datasource=eqiad+prometheus/ops&var-mirror_name=main-eqiad_to_main-codfw [21:17:48] (03PS2) 10Ottomata: Deploy analytics/hdfs-tools/deploy to hadoop clients [puppet] - 10https://gerrit.wikimedia.org/r/557052 (https://phabricator.wikimedia.org/T238326) [21:21:27] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1002/19980/" [puppet] - 10https://gerrit.wikimedia.org/r/557052 (https://phabricator.wikimedia.org/T238326) (owner: 10Ottomata) [21:21:29] (03CR) 10Ottomata: [C: 03+2] Deploy analytics/hdfs-tools/deploy to hadoop clients [puppet] - 10https://gerrit.wikimedia.org/r/557052 (https://phabricator.wikimedia.org/T238326) (owner: 10Ottomata) [21:29:47] !log otto@deploy1001 Started deploy [hdfs-tools-deploy@c71e63a]: (no justification provided) [21:29:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:29:54] !log disabled tilerator on maps200[1-3] - T239728 [21:29:55] !log otto@deploy1001 Finished deploy [hdfs-tools-deploy@c71e63a]: (no justification provided) (duration: 00m 08s) [21:29:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:30:00] T239728: Re-import OSM data at eqiad and codfw to temporarily fix current OSM replication issues. - https://phabricator.wikimedia.org/T239728 [21:30:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:31:52] !log depool maps2004 for osm initial import - T239728 [21:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:34:26] RECOVERY - Router interfaces on cr2-eqiad is OK: OK: host 208.80.154.197, interfaces up: 242, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:34:54] RECOVERY - OSPF status on cr2-codfw is OK: OSPFv2: 5/5 UP : OSPFv3: 5/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:48:58] PROBLEM - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [21:49:24] PROBLEM - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [21:50:52] !log otto@deploy1001 Started deploy [analytics/hdfs-tools/deploy@06e5f42]: (no justification provided) [21:50:55] !log otto@deploy1001 Finished deploy [analytics/hdfs-tools/deploy@06e5f42]: (no justification provided) (duration: 00m 03s) [21:50:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:51:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:49] !log testing I0e0de86d by hand on mwdebug1001 [21:57:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:00] !log testing I0e0de86d by hand on mwdebug1001 T229686 [21:58:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:58:06] T229686: #dbctl: manage 'externalLoads' data - https://phabricator.wikimedia.org/T229686 [21:58:07] (03PS1) 10Ottomata: Deploy hdfs-tools to profile::analytics::cluster::client hosts [puppet] - 10https://gerrit.wikimedia.org/r/557133 (https://phabricator.wikimedia.org/T234229) [22:00:08] (03PS1) 10Ottomata: Set up hdfs-rsync symlink in /usr/local/bin [puppet] - 10https://gerrit.wikimedia.org/r/557134 [22:00:48] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Deploy hdfs-tools to profile::analytics::cluster::client hosts [puppet] - 10https://gerrit.wikimedia.org/r/557133 (https://phabricator.wikimedia.org/T234229) (owner: 10Ottomata) [22:01:17] (03PS2) 10Ottomata: Set up hdfs-rsync symlink in /usr/local/bin [puppet] - 10https://gerrit.wikimedia.org/r/557134 [22:01:35] (03PS3) 10Ottomata: Set up hdfs-rsync symlink in /usr/local/bin [puppet] - 10https://gerrit.wikimedia.org/r/557134 [22:02:15] (03CR) 10jerkins-bot: [V: 04-1] Set up hdfs-rsync symlink in /usr/local/bin [puppet] - 10https://gerrit.wikimedia.org/r/557134 (owner: 10Ottomata) [22:03:34] (03CR) 10Ottomata: "https://puppet-compiler.wmflabs.org/compiler1001/19981/labstore1006.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/557134 (owner: 10Ottomata) [22:03:41] (03CR) 10Ottomata: [C: 03+2] Set up hdfs-rsync symlink in /usr/local/bin [puppet] - 10https://gerrit.wikimedia.org/r/557134 (owner: 10Ottomata) [22:05:36] RECOVERY - rpki grafana alert on icinga1001 is OK: OK: RPKI ( https://grafana.wikimedia.org/d/UwUa77GZk/rpki ) is not alerting. https://wikitech.wikimedia.org/wiki/RPKI%23Grafana_alerts https://grafana.wikimedia.org/d/UwUa77GZk/ [22:17:36] (03PS1) 10Bstorm: dumpsdistribution: get around unreliable DNS with an IP hardcode [puppet] - 10https://gerrit.wikimedia.org/r/557137 [22:18:06] ACKNOWLEDGEMENT - OSPF status on cr2-codfw is CRITICAL: OSPFv2: 4/5 UP : OSPFv3: 4/5 UP CDanis Zayo unplanned repair TTN-0003768396 https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [22:18:06] ACKNOWLEDGEMENT - Router interfaces on cr2-eqiad is CRITICAL: CRITICAL: host 208.80.154.197, interfaces up: 240, down: 1, dormant: 0, excluded: 0, unused: 0: CDanis Zayo unplanned repair TTN-0003768396 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [22:35:41] !log T229686 ✔️ cdanis@mwdebug1001.eqiad.wmnet /srv/mediawiki 🕠🍺 scap pull [22:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:35:48] T229686: #dbctl: manage 'externalLoads' data - https://phabricator.wikimedia.org/T229686 [22:38:46] (03PS1) 10Bstorm: toolforge-k8s: catching init config up to live config [puppet] - 10https://gerrit.wikimedia.org/r/557140 (https://phabricator.wikimedia.org/T240009) [22:40:44] (03CR) 10Bstorm: [C: 03+2] toolforge-k8s: catching init config up to live config [puppet] - 10https://gerrit.wikimedia.org/r/557140 (https://phabricator.wikimedia.org/T240009) (owner: 10Bstorm) [22:50:41] (03PS1) 10Bstorm: toolforge-k8s: enable encryption at rest [puppet] - 10https://gerrit.wikimedia.org/r/557144 (https://phabricator.wikimedia.org/T240009) [23:06:35] (03Abandoned) 10Bstorm: Revert "comment out sagres.c3sl.ufpr.br from dumps mirrors list" [puppet] - 10https://gerrit.wikimedia.org/r/556216 (owner: 10Bstorm) [23:07:13] (03PS2) 10Bstorm: dumps distribution: increase the rate limit to 5MBps [puppet] - 10https://gerrit.wikimedia.org/r/555632 (https://phabricator.wikimedia.org/T222349) [23:15:50] RECOVERY - Router interfaces on cr1-eqiad is OK: OK: host 208.80.154.196, interfaces up: 271, down: 0, dormant: 0, excluded: 0, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [23:15:56] RECOVERY - OSPF status on cr1-codfw is OK: OSPFv2: 6/6 UP : OSPFv3: 6/6 UP https://wikitech.wikimedia.org/wiki/Network_monitoring%23OSPF_status [23:58:18] PROBLEM - MediaWiki exceptions and fatals per minute on icinga1001 is CRITICAL: cluster=logstash job=statsd_exporter level=ERROR site=eqiad https://wikitech.wikimedia.org/wiki/Application_servers https://grafana.wikimedia.org/d/000000438/mediawiki-alerts?panelId=2&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops