[00:36:59] PROBLEM - puppet last run on wtp1044 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [01:03:35] RECOVERY - puppet last run on wtp1044 is OK: OK: Puppet is currently enabled, last run 13 seconds ago with 0 failures [02:25:31] ACKNOWLEDGEMENT - Check systemd state on cloudcontrol1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. andrew bogott I probably caused this! [02:56:01] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [02:57:05] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [02:58:19] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [02:59:21] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [02:59:57] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [03:03:03] PROBLEM - puppet last run on scandium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:03:41] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [03:03:47] PROBLEM - puppet last run on ms-be1030 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:04:55] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [03:05:57] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [03:29:13] PROBLEM - puppet last run on db1092 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:29:39] RECOVERY - puppet last run on scandium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [03:30:25] RECOVERY - puppet last run on ms-be1030 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [03:55:51] RECOVERY - puppet last run on db1092 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [04:04:17] (03PS1) 10ArielGlenn: reduce further the sleep between wikis for addds-changes dumps [dumps] - 10https://gerrit.wikimedia.org/r/508164 [04:07:41] (03CR) 10ArielGlenn: [C: 03+2] reduce further the sleep between wikis for addds-changes dumps [dumps] - 10https://gerrit.wikimedia.org/r/508164 (owner: 10ArielGlenn) [04:08:42] !log ariel@deploy1001 Started deploy [dumps/dumps@b4b7733]: reduce sleep time more between wikis for incrs [04:08:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:08:48] !log ariel@deploy1001 Finished deploy [dumps/dumps@b4b7733]: reduce sleep time more between wikis for incrs (duration: 00m 05s) [04:08:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [04:32:55] PROBLEM - puppet last run on db1082 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:54:38] (03PS2) 10Marostegui: db1093: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/507264 (https://phabricator.wikimedia.org/T222127) [04:54:51] (03PS2) 10Marostegui: db-eqiad.php: Give some traffic to db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507263 (https://phabricator.wikimedia.org/T222127) [04:59:27] RECOVERY - puppet last run on db1082 is OK: OK: Puppet is currently enabled, last run 36 seconds ago with 0 failures [05:02:18] (03CR) 10Marostegui: [C: 03+2] db1093: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/507264 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui) [05:03:55] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Give some traffic to db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507263 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui) [05:04:56] (03Merged) 10jenkins-bot: db-eqiad.php: Give some traffic to db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507263 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui) [05:06:08] 10Operations, 10ops-eqiad, 10DBA: SMART alerts on db1069 - https://phabricator.wikimedia.org/T222507 (10Marostegui) Thanks @jijiki for creating the task. We are no longer creating tasks for predictive failures, we let them fail so the task gets created automatically. We track the predictive failures at {T208... [05:06:19] (03CR) 10jenkins-bot: db-eqiad.php: Give some traffic to db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507263 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui) [05:06:58] 10Operations, 10ops-eqiad, 10DBA: SMART alerts on db1069 - https://phabricator.wikimedia.org/T222507 (10Marostegui) [05:07:01] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [05:08:24] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Give some weight to db1093 (duration: 00m 58s) [05:08:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:08:34] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [05:09:50] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) >>! In T208323#5158076, @jcrespo wrote: > T222526 db2049 (again?) You might be confused with db2047, I don't recall db2049 having a disk replaced lately [05:14:08] (03CR) 10Marostegui: [C: 03+1] mariadb: Reenable notifications on db1139 and db1140 [puppet] - 10https://gerrit.wikimedia.org/r/507925 (https://phabricator.wikimedia.org/T220572) (owner: 10Jcrespo) [05:46:30] PROBLEM - puppet last run on mw1228 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:47:05] (03CR) 10Marostegui: [C: 04-1] backups: Decommission dbstore1001, dbstore2001 and dbstore2002 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507944 (https://phabricator.wikimedia.org/T220002) (owner: 10Jcrespo) [05:53:55] (03PS1) 10Marostegui: db-eqiad.php: Give some API weight to db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508167 (https://phabricator.wikimedia.org/T222127) [05:57:49] (03PS1) 10Marostegui: mariadb: Promote db2045 to codfw x1 master [puppet] - 10https://gerrit.wikimedia.org/r/508168 (https://phabricator.wikimedia.org/T219493) [05:59:36] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Give some API weight to db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508167 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui) [06:00:38] (03Merged) 10jenkins-bot: db-eqiad.php: Give some API weight to db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508167 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui) [06:00:54] (03CR) 10jenkins-bot: db-eqiad.php: Give some API weight to db1093 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508167 (https://phabricator.wikimedia.org/T222127) (owner: 10Marostegui) [06:01:57] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Give some API traffic to db1093 (duration: 00m 52s) [06:01:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:13:06] RECOVERY - puppet last run on mw1228 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:18:45] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Change second parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508170 (https://phabricator.wikimedia.org/T210725) [06:29:02] PROBLEM - netbox HTTPS on netmon1002 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 547 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Netbox [06:29:17] 10Operations, 10serviceops, 10User-Joe: SRE FY2019 Q3 goal: Ramp-up serving traffic to PHP 7 - https://phabricator.wikimedia.org/T212828 (10Joe) [06:29:20] 10Operations, 10serviceops, 10Patch-For-Review, 10User-Joe: Set up A/B testing mechanism for PHP7, - https://phabricator.wikimedia.org/T216676 (10Joe) 05Open→03Resolved [06:29:56] PROBLEM - puppet last run on dbproxy1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/profile.d/field.sh] [06:30:16] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [06:30:22] PROBLEM - puppet last run on mw1289 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/cgroup-mediawiki-clean] [06:30:34] PROBLEM - puppet last run on an-worker1084 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/R/update-library.R] [06:31:56] PROBLEM - puppet last run on acmechief1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/apt2xml] [06:32:14] PROBLEM - puppet last run on ms-be1035 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/lib/nagios/plugins/check_sysctl] [06:32:58] PROBLEM - puppet last run on authdns2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/etc/rsyslog.d/20-confd.conf] [06:33:08] (03PS1) 10Elukey: service::uwsgi: add the core_limit parameter [puppet] - 10https://gerrit.wikimedia.org/r/508173 (https://phabricator.wikimedia.org/T212697) [06:33:10] (03PS1) 10Elukey: netbox: set the uwsgi's systemd LimitCore setting to 30G [puppet] - 10https://gerrit.wikimedia.org/r/508174 (https://phabricator.wikimedia.org/T212697) [06:34:52] (03CR) 10Elukey: "https://puppet-compiler.wmflabs.org/compiler1002/16338/netmon1002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/508174 (https://phabricator.wikimedia.org/T212697) (owner: 10Elukey) [06:35:18] (03CR) 10Marostegui: "The commit message says from db2103 to db2120, but I only see from db2103 to db2111, is that expected? I guess this patchset is still work" [dns] - 10https://gerrit.wikimedia.org/r/507613 (owner: 10Papaul) [06:37:46] 10Operations, 10ops-codfw: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T222526 (10Marostegui) p:05Triage→03Normal [06:37:58] PROBLEM - Check the last execution of netbox_ganeti_eqiad_sync on netmon1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync [06:38:29] running puppet on netmon1002 [06:42:10] RECOVERY - netbox HTTPS on netmon1002 is OK: HTTP OK: HTTP/1.1 302 Found - 346 bytes in 0.012 second response time https://wikitech.wikimedia.org/wiki/Netbox [06:43:56] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync [06:44:29] in 5 mins those --^ will be re-executed, uwsgi was down [06:48:53] (03PS3) 10Luca Mauri: Create new http://www.mediawiki.org/xml/sitelist-1.1/ to reference sitelist-1.1.xsd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508130 (https://phabricator.wikimedia.org/T222516) [06:49:18] (03CR) 10Luca Mauri: "> This new file needs adding to xml/index.html too" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508130 (https://phabricator.wikimedia.org/T222516) (owner: 10Luca Mauri) [06:51:22] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational [06:54:14] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync [06:55:07] (03PS1) 10Giuseppe Lavagetto: Disable the PHP7 beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508177 (https://phabricator.wikimedia.org/T219128) [06:55:20] (03CR) 10jerkins-bot: [V: 04-1] Disable the PHP7 beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508177 (https://phabricator.wikimedia.org/T219128) (owner: 10Giuseppe Lavagetto) [06:56:20] (03PS2) 10Giuseppe Lavagetto: Disable the PHP7 beta feature [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508177 (https://phabricator.wikimedia.org/T219128) [06:56:28] RECOVERY - puppet last run on dbproxy1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:56:52] RECOVERY - puppet last run on mw1289 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [06:57:06] RECOVERY - puppet last run on an-worker1084 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [06:58:28] RECOVERY - puppet last run on acmechief1001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:58:36] RECOVERY - Check the last execution of netbox_ganeti_eqiad_sync on netmon1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync [06:58:42] RECOVERY - puppet last run on ms-be1035 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [06:59:30] RECOVERY - puppet last run on authdns2001 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:01:01] 10Operations, 10Patch-For-Review: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10elukey) Added a couple of code reviews as attempt to add the LimitCore to the netbox's systemd unit. If this is not the idea that you guys had, please feel free to d... [07:17:10] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" [puppet] - 10https://gerrit.wikimedia.org/r/506750 (https://phabricator.wikimedia.org/T78076) (owner: 10Dzahn) [07:19:03] (03CR) 10Muehlenhoff: [C: 03+1] "Looks good" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508173 (https://phabricator.wikimedia.org/T212697) (owner: 10Elukey) [07:19:42] (03CR) 10Muehlenhoff: [C: 03+1] netbox: set the uwsgi's systemd LimitCore setting to 30G [puppet] - 10https://gerrit.wikimedia.org/r/508174 (https://phabricator.wikimedia.org/T212697) (owner: 10Elukey) [07:23:12] (03CR) 10Muehlenhoff: [C: 04-1] "These are getting decommisioned, but it's currently blocked on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/466833/ getting merg" [puppet] - 10https://gerrit.wikimedia.org/r/507948 (https://phabricator.wikimedia.org/T222443) (owner: 10Jbond) [07:32:25] (03PS1) 10Muehlenhoff: Switch neodymium/sarin to spares [puppet] - 10https://gerrit.wikimedia.org/r/508277 [07:54:54] (03PS1) 10Filippo Giunchedi: hieradata: labmon1001 to prometheus v2 [puppet] - 10https://gerrit.wikimedia.org/r/508280 (https://phabricator.wikimedia.org/T187987) [08:03:20] (03PS5) 10Giuseppe Lavagetto: confctl: Add filter_objects and update_objects [software/spicerack] - 10https://gerrit.wikimedia.org/r/503946 [08:03:21] (03PS4) 10Giuseppe Lavagetto: confctl: add change_and_revert contextmanager [software/spicerack] - 10https://gerrit.wikimedia.org/r/504578 [08:03:23] (03PS5) 10Giuseppe Lavagetto: Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 [08:07:14] (03CR) 10jerkins-bot: [V: 04-1] confctl: Add filter_objects and update_objects [software/spicerack] - 10https://gerrit.wikimedia.org/r/503946 (owner: 10Giuseppe Lavagetto) [08:07:16] (03CR) 10jerkins-bot: [V: 04-1] confctl: add change_and_revert contextmanager [software/spicerack] - 10https://gerrit.wikimedia.org/r/504578 (owner: 10Giuseppe Lavagetto) [08:07:30] (03CR) 10jerkins-bot: [V: 04-1] Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 (owner: 10Giuseppe Lavagetto) [08:07:52] <_joe_> jeez what's up with sphinx [08:17:43] (03CR) 10Elukey: service::uwsgi: add the core_limit parameter (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508173 (https://phabricator.wikimedia.org/T212697) (owner: 10Elukey) [08:19:09] (03PS2) 10Elukey: netbox: set the uwsgi's systemd LimitCore setting to 30G [puppet] - 10https://gerrit.wikimedia.org/r/508174 (https://phabricator.wikimedia.org/T212697) [08:19:40] (03CR) 10Muehlenhoff: [C: 03+1] "Seems fine, but needs meeting approval." [puppet] - 10https://gerrit.wikimedia.org/r/507813 (https://phabricator.wikimedia.org/T222368) (owner: 10Elukey) [08:20:21] (03PS2) 10Muehlenhoff: Drop trusty from debdeploy config [puppet] - 10https://gerrit.wikimedia.org/r/507327 [08:21:54] (03CR) 10Muehlenhoff: [C: 03+2] Drop trusty from debdeploy config [puppet] - 10https://gerrit.wikimedia.org/r/507327 (owner: 10Muehlenhoff) [08:22:34] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/508277 (owner: 10Muehlenhoff) [08:23:25] (03PS1) 10Ema: prometheus: add upload_ats target [puppet] - 10https://gerrit.wikimedia.org/r/508284 (https://phabricator.wikimedia.org/T219967) [08:28:16] (03CR) 10Marostegui: [C: 04-2] "Wait for the 14th May" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508170 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [08:30:28] (03PS2) 10Ema: prometheus: add upload_ats mtail targets [puppet] - 10https://gerrit.wikimedia.org/r/508284 (https://phabricator.wikimedia.org/T219967) [08:35:59] (03PS1) 10Elukey: role::deployment_server: remove analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/508286 [08:37:58] (03CR) 10Volans: [C: 03+1] "LGTM, one optional comment inline." (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/508066 (owner: 10CRusnov) [08:39:01] (03CR) 10Volans: [C: 03+1] "LGTM, one nit inline." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508067 (owner: 10CRusnov) [08:39:36] (03PS2) 10Muehlenhoff: Switch neodymium/sarin to spares [puppet] - 10https://gerrit.wikimedia.org/r/508277 [08:40:45] (03CR) 10Muehlenhoff: [C: 03+2] Switch neodymium/sarin to spares [puppet] - 10https://gerrit.wikimedia.org/r/508277 (owner: 10Muehlenhoff) [08:42:40] (03CR) 10Volans: [C: 03+1] "LGTM, thanks for pushing for it" [puppet] - 10https://gerrit.wikimedia.org/r/508174 (https://phabricator.wikimedia.org/T212697) (owner: 10Elukey) [08:42:42] (03PS2) 10Elukey: role::deployment_server: remove analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/508286 [08:44:28] (03CR) 10Elukey: [C: 03+2] role::deployment_server: remove analytics-admins [puppet] - 10https://gerrit.wikimedia.org/r/508286 (owner: 10Elukey) [08:44:38] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: add upload_ats mtail targets [puppet] - 10https://gerrit.wikimedia.org/r/508284 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [08:44:40] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/508173 (https://phabricator.wikimedia.org/T212697) (owner: 10Elukey) [08:45:50] (03PS3) 10Ema: prometheus: add upload_ats mtail targets [puppet] - 10https://gerrit.wikimedia.org/r/508284 (https://phabricator.wikimedia.org/T219967) [08:47:35] (03CR) 10Ema: [C: 03+2] prometheus: add upload_ats mtail targets [puppet] - 10https://gerrit.wikimedia.org/r/508284 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [08:48:05] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10jcrespo) > You might be confused with db2047, I don't recall db2049 having a disk replaced lately //Marostegui updated the task description. Feb 12 2019, 07:40:// https://phabricator.wikimedia.or... [08:48:34] (03CR) 10Filippo Giunchedi: [C: 03+2] hieradata: labmon1001 to prometheus v2 [puppet] - 10https://gerrit.wikimedia.org/r/508280 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi) [08:48:46] (03PS2) 10Filippo Giunchedi: hieradata: labmon1001 to prometheus v2 [puppet] - 10https://gerrit.wikimedia.org/r/508280 (https://phabricator.wikimedia.org/T187987) [08:50:09] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) >>! In T208323#5159099, @jcrespo wrote: >> You might be confused with db2047, I don't recall db2049 having a disk replaced lately > > //Marostegui updated the task description. Feb 12... [08:55:32] (03PS2) 10Elukey: service::uwsgi: add the core_limit parameter [puppet] - 10https://gerrit.wikimedia.org/r/508173 (https://phabricator.wikimedia.org/T212697) [08:56:56] (03CR) 10Elukey: [C: 03+2] service::uwsgi: add the core_limit parameter [puppet] - 10https://gerrit.wikimedia.org/r/508173 (https://phabricator.wikimedia.org/T212697) (owner: 10Elukey) [08:57:12] (03PS3) 10Elukey: netbox: set the uwsgi's systemd LimitCore setting to 30G [puppet] - 10https://gerrit.wikimedia.org/r/508174 (https://phabricator.wikimedia.org/T212697) [08:57:56] (03CR) 10Elukey: [C: 03+2] netbox: set the uwsgi's systemd LimitCore setting to 30G [puppet] - 10https://gerrit.wikimedia.org/r/508174 (https://phabricator.wikimedia.org/T212697) (owner: 10Elukey) [09:00:10] (03PS3) 10Jcrespo: mariadb: Reenable notifications on db1139 and db1140 [puppet] - 10https://gerrit.wikimedia.org/r/507925 (https://phabricator.wikimedia.org/T220572) [09:00:12] (03PS2) 10Jcrespo: backups: Decommission dbstore1001, dbstore2001 and dbstore2002 [puppet] - 10https://gerrit.wikimedia.org/r/507944 (https://phabricator.wikimedia.org/T220002) [09:03:16] (03PS8) 10Effie Mouzeli: mediawiki: if guard php72_only blocks [puppet] - 10https://gerrit.wikimedia.org/r/507939 (https://phabricator.wikimedia.org/T219150) [09:03:18] 10Operations, 10observability, 10Patch-For-Review: 100% of Prometheus traffic served by Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) [09:03:20] !log upgrade labmon1001 to prometheus 2 - T187987 [09:03:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:03:25] T187987: 100% of Prometheus traffic served by Prometheus v2 - https://phabricator.wikimedia.org/T187987 [09:03:54] RECOVERY - ElasticSearch unassigned shard check - 9243 on search.svc.eqiad.wmnet is OK: OK - All good https://wikitech.wikimedia.org/wiki/Search%23Administration [09:05:28] PROBLEM - Router interfaces on mr1-esams is CRITICAL: CRITICAL: host 91.198.174.247, interfaces up: 32, down: 1, dormant: 0, excluded: 1, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:05:35] 10Operations, 10serviceops, 10Beta-Feature, 10Patch-For-Review, 10User-jijiki: Remove php7 beta feature - https://phabricator.wikimedia.org/T219128 (10jijiki) [09:07:26] PROBLEM - Host mr1-esams.oob is DOWN: PING CRITICAL - Packet loss = 100% [09:08:08] RECOVERY - Router interfaces on mr1-esams is OK: OK: host 91.198.174.247, interfaces up: 36, down: 0, dormant: 0, excluded: 1, unused: 0 https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [09:10:32] (03CR) 10Marostegui: [C: 03+1] backups: Decommission dbstore1001, dbstore2001 and dbstore2002 [puppet] - 10https://gerrit.wikimedia.org/r/507944 (https://phabricator.wikimedia.org/T220002) (owner: 10Jcrespo) [09:12:12] 10Operations, 10Analytics, 10Analytics-Kanban, 10Traffic, 10Patch-For-Review: Add prometheus metrics for varnishkafka instances running on caching hosts - https://phabricator.wikimedia.org/T196066 (10fgiunchedi) >>! In T196066#5155909, @Ottomata wrote: > I don't think Magnus would build it into librdkafk... [09:12:50] RECOVERY - Host mr1-esams.oob is UP: PING OK - Packet loss = 0%, RTA = 98.56 ms [09:18:39] (03PS31) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [09:18:41] (03PS1) 10Vgutierrez: prometheus: Identify trafficserver instances using the layer label [puppet] - 10https://gerrit.wikimedia.org/r/508289 (https://phabricator.wikimedia.org/T221217) [09:31:55] (03CR) 10Filippo Giunchedi: prometheus: Identify trafficserver instances using the layer label (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/508289 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [09:35:52] !log restart netbox on netmon1002 (trying to reproduce the segfault) - T212697 [09:35:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:35:56] T212697: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 [09:37:13] 10Operations, 10cloud-services-team (Kanban): labstore100[45]/labpuppetmaster/labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148 (10MoritzMuehlenhoff) 05Resolved→03Open Looks good, with the removal of udev from the component, can we please a... [09:42:09] (03PS4) 10Effie Mouzeli: cxserver: Clean up old scb cluster stanzas [puppet] - 10https://gerrit.wikimedia.org/r/506366 (https://phabricator.wikimedia.org/T213195) [09:43:06] 10Operations, 10Patch-For-Review: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10elukey) A systemctl restart triggered a segfault, and a core was available under /var/tmp/core. This is what gdb says: ` Core was generated by `/usr/bin/uwsgi --die... [09:43:11] 10Operations, 10Continuous-Integration-Infrastructure, 10Release-Engineering-Team (Backlog): contint1001 store docker images on separate partition or disk - https://phabricator.wikimedia.org/T207707 (10hashar) Eventually I have unzipped them, the reason is the log rotation is handled by python logging not by... [09:54:35] (03PS1) 10Ema: prometheus: add glob for ATS to file_sd_configs [puppet] - 10https://gerrit.wikimedia.org/r/508296 (https://phabricator.wikimedia.org/T219967) [09:57:52] (03CR) 10Ema: "pcc looks good https://puppet-compiler.wmflabs.org/compiler1002/16342/bast4002.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/508296 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [09:59:12] !log T222148 upgrade udev & libudev1 on cloudvirt[1001-1003,1005].eqiad.wmnet [09:59:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:59:16] T222148: labstore100[45]/labpuppetmaster/labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148 [10:01:07] (03CR) 10Ema: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/508296 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [10:03:40] 10Operations, 10cloud-services-team (Kanban): labstore100[45]/labpuppetmaster/labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148 (10aborrero) 05Open→03Resolved cloudvirt[1014,1016-1017,1021-1023].eqiad.wmnet,cloudvirtan[1001-1005].eqiad.wmne... [10:05:20] !log upgrade udev in cloudservices2002-dev [10:05:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:38] (03PS2) 10Ema: prometheus: add glob for ATS to file_sd_configs [puppet] - 10https://gerrit.wikimedia.org/r/508296 (https://phabricator.wikimedia.org/T219967) [10:11:43] (03PS9) 10Arturo Borrero Gonzalez: sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/507376 (https://phabricator.wikimedia.org/T221225) [10:12:15] 10Operations, 10cloud-services-team (Kanban): labstore100[45]/labpuppetmaster/labtestpuppetmaster2001: Broken package state; would upgrade to sysvinit - https://phabricator.wikimedia.org/T222148 (10MoritzMuehlenhoff) Looks good, thanks [10:12:45] (03CR) 10Volans: [C: 04-1] "Missing return inline" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [10:14:09] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: add glob for ATS to file_sd_configs [puppet] - 10https://gerrit.wikimedia.org/r/508296 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [10:14:40] (03PS6) 10Giuseppe Lavagetto: confctl: Add filter_objects and update_objects [software/spicerack] - 10https://gerrit.wikimedia.org/r/503946 [10:14:42] (03PS5) 10Giuseppe Lavagetto: confctl: add change_and_revert contextmanager [software/spicerack] - 10https://gerrit.wikimedia.org/r/504578 [10:14:44] (03PS6) 10Giuseppe Lavagetto: Add the LBRemoteCluster class. [software/spicerack] - 10https://gerrit.wikimedia.org/r/503947 [10:16:54] (03PS1) 10Ladsgroup: Set $wgOresFrontendBaseUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508298 (https://phabricator.wikimedia.org/T219396) [10:17:59] (03CR) 10Volans: Add postgres slave init cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [10:20:59] (03PS2) 10Vgutierrez: prometheus: Identify trafficserver instances using the layer label [puppet] - 10https://gerrit.wikimedia.org/r/508289 (https://phabricator.wikimedia.org/T221217) [10:25:46] (03CR) 10Effie Mouzeli: [C: 03+2] cxserver: Clean up old scb cluster stanzas [puppet] - 10https://gerrit.wikimedia.org/r/506366 (https://phabricator.wikimedia.org/T213195) (owner: 10Effie Mouzeli) [10:26:27] (03PS3) 10Ema: prometheus: add glob for ATS to file_sd_configs [puppet] - 10https://gerrit.wikimedia.org/r/508296 (https://phabricator.wikimedia.org/T219967) [10:27:06] (03CR) 10Vgutierrez: prometheus: Identify trafficserver instances using the layer label (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/508289 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [10:27:53] (03CR) 10Ema: [C: 03+2] prometheus: add glob for ATS to file_sd_configs [puppet] - 10https://gerrit.wikimedia.org/r/508296 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [10:30:04] jan_drewniak: Dear deployers, time to do the Wikimedia Portals Update deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190506T1030). [10:30:10] (03CR) 10Filippo Giunchedi: elastalert: new module (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/502773 (https://phabricator.wikimedia.org/T213933) (owner: 10Filippo Giunchedi) [10:30:31] (03PS8) 10Filippo Giunchedi: elastalert: new module [puppet] - 10https://gerrit.wikimedia.org/r/502773 (https://phabricator.wikimedia.org/T213933) [10:30:33] (03PS8) 10Filippo Giunchedi: elastalert: enable on logstash1007 [puppet] - 10https://gerrit.wikimedia.org/r/505762 (https://phabricator.wikimedia.org/T213933) [10:31:40] (03CR) 10Filippo Giunchedi: [C: 03+1] "Nice!" [puppet] - 10https://gerrit.wikimedia.org/r/508289 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [10:32:35] (03PS10) 10Arturo Borrero Gonzalez: sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/507376 (https://phabricator.wikimedia.org/T221225) [10:33:13] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508302 (https://phabricator.wikimedia.org/T128546) [10:34:18] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "a couple minor nitpicks, but LGTM." (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/507939 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [10:34:43] PROBLEM - puppet last run on lvs3001 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [10:36:24] (03CR) 10Volans: [C: 04-1] "Some minor things inline and one potentially major if we want to be safe with the depooling." (036 comments) [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe) [10:36:59] (03CR) 10Jdrewniak: [C: 03+2] Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508302 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:38:03] (03Merged) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508302 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:38:19] (03CR) 10jenkins-bot: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508302 (https://phabricator.wikimedia.org/T128546) (owner: 10Jdrewniak) [10:38:25] PROBLEM - Check systemd state on scb2004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:38:29] PROBLEM - Check systemd state on scb1001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:38:31] ema: I'm having a look at the lvs3001 puppet failure [10:38:42] the scb* alert is me [10:38:48] ack [10:38:49] PROBLEM - Check systemd state on scb2002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:38:51] will fix [10:38:51] PROBLEM - Check systemd state on scb1003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:38:55] PROBLEM - Check systemd state on scb2005 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:38:59] arg [10:39:05] PROBLEM - Check systemd state on scb2006 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:39:15] PROBLEM - Check systemd state on scb2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:39:29] PROBLEM - Check systemd state on scb1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:39:31] PROBLEM - Check systemd state on scb1004 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:39:33] PROBLEM - Check systemd state on scb2003 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:40:22] (03CR) 10Giuseppe Lavagetto: [C: 04-1] hhvm: base::service_unit -> systemd::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/456319 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [10:40:31] (03PS11) 10Arturo Borrero Gonzalez: sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/507376 (https://phabricator.wikimedia.org/T221225) [10:40:39] PROBLEM - Unmerged changes on repository puppet on puppetmaster2002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [10:40:57] PROBLEM - Unmerged changes on repository puppet on puppetmaster2001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [10:41:47] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1001 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [10:41:57] PROBLEM - Unmerged changes on repository puppet on labpuppetmaster1002 is CRITICAL: There is one unmerged change in puppet (dir /var/lib/git/operations/puppet, ref HEAD..origin/production). [10:42:46] (03PS9) 10Filippo Giunchedi: elastalert: new module [puppet] - 10https://gerrit.wikimedia.org/r/502773 (https://phabricator.wikimedia.org/T213933) [10:42:48] (03PS9) 10Filippo Giunchedi: elastalert: enable on logstash1007 [puppet] - 10https://gerrit.wikimedia.org/r/505762 (https://phabricator.wikimedia.org/T213933) [10:43:11] !log jdrewniak@deploy1001 Synchronized portals/wikipedia.org/assets: Wikimedia Portals Update: [[gerrit:508302| Bumping portals to master (T128546)]] (duration: 00m 52s) [10:43:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:43:16] T128546: [Recurring Task] Update Wikipedia and sister projects portals statistics - https://phabricator.wikimedia.org/T128546 [10:43:26] ema: did puppetmerge fail on all the other hosts? [10:43:34] volans: looking [10:43:35] it seems your last merged change was not propagated [10:44:03] !log jdrewniak@deploy1001 Synchronized portals: Wikimedia Portals Update: [[gerrit:508302| Bumping portals to master (T128546)]] (duration: 00m 51s) [10:44:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:44:29] volans: yes it seems puppet-merge had issues [10:44:33] https://phabricator.wikimedia.org/P8477 [10:44:42] error: cannot lock ref 'refs/remotes/origin/production': is at 73820b7f34685628be58c2166da5baf16c3830fe but expected da491ce740b81ddfdd43166ab3583dc646c5a89e [10:45:01] RECOVERY - Check systemd state on scb1001 is OK: OK - running: The system is fully operational [10:45:23] RECOVERY - Check systemd state on scb2002 is OK: OK - running: The system is fully operational [10:45:23] RECOVERY - Check systemd state on scb1003 is OK: OK - running: The system is fully operational [10:45:27] RECOVERY - Check systemd state on scb2005 is OK: OK - running: The system is fully operational [10:45:37] RECOVERY - Check systemd state on scb2006 is OK: OK - running: The system is fully operational [10:45:39] _joe_: for the conftool missing key ^^^ (ema's phaste) [10:45:47] RECOVERY - Check systemd state on scb2001 is OK: OK - running: The system is fully operational [10:45:58] <_joe_> what? [10:45:59] RECOVERY - Check systemd state on scb1002 is OK: OK - running: The system is fully operational [10:46:03] <_joe_> I miss all context [10:46:03] RECOVERY - Check systemd state on scb1004 is OK: OK - running: The system is fully operational [10:46:05] RECOVERY - Check systemd state on scb2003 is OK: OK - running: The system is fully operational [10:46:40] _joe_: puppet-merge failed, log here: https://phabricator.wikimedia.org/P8477 [10:46:55] <_joe_> ema: yeah scb2005/cxserver [10:46:55] and apart the puppet failure, also confctl failed at the end [10:47:06] <_joe_> I guess you had some race condition? [10:47:54] <_joe_> frankly, I don't think it has to do with confctl [10:47:56] <_joe_> but lemme check [10:48:57] <_joe_> that object doesn't exist indeed [10:49:07] <_joe_> what it looks like from here is [10:49:20] <_joe_> someone launched puppet-merge from two servers at the same time [10:49:32] I was merging for sure [10:49:42] but my changes went to all servers [10:49:53] my change also got applied [10:50:01] lovely [10:50:30] but I didn't get any errors [10:50:33] I puppet-merged on puppetmaster1001 [10:50:40] me too [10:50:58] ema: your changes are not on all the other puppetmasters [10:51:00] only on 1001 [10:51:08] <_joe_> so yes [10:51:11] <_joe_> a race condition [10:51:25] :/ [10:51:28] <_joe_> so, 1 - don't let your changed unmerged for long [10:51:29] sorry ema [10:51:33] <_joe_> 2 - we need a global lock [10:51:48] I doubt we left unmerged changes for log [10:51:53] long [10:52:03] :-S [10:52:43] yeah I puppet-merged right away, I think the issue is just (2) [10:53:12] btw is someone fixing it? [10:53:47] I guess puppet-merging something new would be a fix? [10:54:04] <_joe_> no [10:54:16] <_joe_> puppet-merge $sha1 would be [10:54:18] <_joe_> maybe from 2001 [10:54:25] <_joe_> and on the other failed servers [10:54:31] ok, doing! [10:54:54] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add support for OpenAPI 3.0 [software/service-checker] - 10https://gerrit.wikimedia.org/r/507489 (https://phabricator.wikimedia.org/T218218) (owner: 10Clarakosi) [10:55:58] RECOVERY - Unmerged changes on repository puppet on puppetmaster2001 is OK: No changes to merge. [10:56:43] 10Operations, 10Puppet, 10puppet-compiler: Frequent puppet failures - https://phabricator.wikimedia.org/T221529 (10akosiaris) >>! In T221529#5143984, @jbond wrote: > The error happened as puppet-merge was rolling out changes. I have not looked at how puppet-merge works but this looks like it is caused by a... [10:56:44] RECOVERY - Unmerged changes on repository puppet on puppetmaster2002 is OK: No changes to merge. [10:59:20] RECOVERY - Check systemd state on scb2004 is OK: OK - running: The system is fully operational [10:59:37] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "the code is correct but I'd prefer another format." (031 comment) [software/service-checker] - 10https://gerrit.wikimedia.org/r/507531 (https://phabricator.wikimedia.org/T220401) (owner: 10Mobrovac) [10:59:53] !log manual puppet-merge $sha on failed puppetmasters https://phabricator.wikimedia.org/P8477 [10:59:54] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1001 is OK: No changes to merge. [10:59:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] MaxSem, RoanKattouw, and Niharika: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190506T1100). [11:00:04] Amir1: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:10] RECOVERY - Unmerged changes on repository puppet on labpuppetmaster1002 is OK: No changes to merge. [11:00:38] thanks for fixing [11:01:42] volans, _joe_: done, thanks! [11:05:50] RECOVERY - puppet last run on lvs3001 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:06:13] (03PS1) 10Alaa Sarhan: Enable Suggestion Constraint Status on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508303 (https://phabricator.wikimedia.org/T204439) [11:06:41] volans: have you found out anything about lvs3001's puppetfail? [11:06:58] ema: just the usual AH01102 :( [11:07:34] <_joe_> a 503 [11:07:43] not a 502 [11:07:46] not a 503 [11:07:52] anyone Swatting now? [11:07:54] <_joe_> is it me or puppet (the passenger app) crashes more often than before? [11:08:17] <_joe_> since a few months [11:08:33] it seems so from the data chris was gathering, although that data is skewed at each puppet-merge that breaks puppet on a large number of hosts [11:08:35] 10Operations, 10Patch-For-Review: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10elukey) Highlight to: ` #7 0x00005645dd3aaf83 in uwsgi_segfault (signum=11) at core/uwsgi.c:1839 #8 #9 0x00007ffa725bac99 in uwsgi_socket... [11:08:43] we have 2 open tasks for that [11:09:09] (03PS1) 10Ema: cache: reimage cp4026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967) [11:09:11] (03CR) 10Giuseppe Lavagetto: "recheck" [software/spicerack] - 10https://gerrit.wikimedia.org/r/504578 (owner: 10Giuseppe Lavagetto) [11:09:20] <_joe_> "recheck" [11:09:22] <_joe_> sigh [11:14:51] (03CR) 10Michael Große: [C: 03+1] Enable Suggestion Constraint Status on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508303 (https://phabricator.wikimedia.org/T204439) (owner: 10Alaa Sarhan) [11:16:16] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] "jenkins bot is not verifying this change because some issue with zuul apparently." [puppet] - 10https://gerrit.wikimedia.org/r/507376 (https://phabricator.wikimedia.org/T221225) (owner: 10Arturo Borrero Gonzalez) [11:16:37] (03CR) 10jerkins-bot: [V: 04-1] Enable Suggestion Constraint Status on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508303 (https://phabricator.wikimedia.org/T204439) (owner: 10Alaa Sarhan) [11:17:29] (03CR) 10Muehlenhoff: "https://puppet-compiler.wmflabs.org/compiler1002/16347/" [puppet] - 10https://gerrit.wikimedia.org/r/500413 (owner: 10Muehlenhoff) [11:17:30] !log merging puppet change to the sudo module https://gerrit.wikimedia.org/r/c/operations/puppet/+/507376 [11:17:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:17:36] (03PS5) 10Muehlenhoff: dnsrecursor: Remove support for Ubuntu/trusty [puppet] - 10https://gerrit.wikimedia.org/r/500413 [11:17:42] (03PS2) 10Alaa Sarhan: Enable Suggestion Constraint Status on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508303 (https://phabricator.wikimedia.org/T204439) [11:17:57] hey, we just added patch 508303 to current SWAT, if there's any chance for it to be deployed that would be so great :) [11:19:00] ( https://gerrit.wikimedia.org/r/c/operations/mediawiki-config/+/508303 ) [11:19:07] arturo: that patch changes things to some prod hosts too [11:19:17] (03CR) 10Muehlenhoff: [C: 03+2] dnsrecursor: Remove support for Ubuntu/trusty [puppet] - 10https://gerrit.wikimedia.org/r/500413 (owner: 10Muehlenhoff) [11:19:19] volans: I know [11:19:31] volans: see https://puppet-compiler.wmflabs.org/compiler1002/16343/ [11:20:07] on mwmaint it uses sudoldap though [11:20:12] is that intended? [11:20:25] they include the ldap::client::includes class [11:20:37] https://puppet-compiler.wmflabs.org/compiler1002/16343/mwmaint1002.eqiad.wmnet/ ? [11:21:12] oh I see what you mean [11:22:02] volans: I guess nobody should be using sudoldap outside cloudVPS, but I may be wrong [11:22:14] also, I don't see any review on that patch, and it's obviously not trivial or emergency bugfix [11:23:49] not sure if you are suggesting to revert it [11:24:37] it's not clear to me the impact of it, I just had a quick look and saw that it changes some behaviour of some classes applied to prod hosts and I'm not sure if they are intended [11:25:18] yes, is inteded, we are adding a new parameter to the sudo class, sudo::user and sudo::group as well [11:25:44] <_joe_> is jenkins working for puppet? [11:25:57] this patch was in fact already applied to prod without issues. It got reverted because we had issues inside cloudvps VMs [11:26:05] and this is the second attempt to merge it [11:26:08] volans: ^^^ [11:26:32] _joe_: sporadically I think, see the logs on https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/508296/ [11:27:09] (03PS1) 10Muehlenhoff: Remove obsolete openstack::nova::compute::audit [puppet] - 10https://gerrit.wikimedia.org/r/508308 [11:27:30] volans: I'm ready/open to revert if you think we should do so [11:28:44] <_joe_> arturo: I'll be frank: you should've asked - an waited - for a review for a modification to the sudo module, it's quite fundamental [11:29:13] ok, give me a second, I will revert it [11:29:23] arturo: as I said, I don't know, but would have expected at least a review from moritz or john [11:29:28] <_joe_> I don't think that's needed [11:29:39] and the part that puzzles me for mwmaint hosts is [11:29:41] https://gerrit.wikimedia.org/r/c/operations/puppet/+/507376/11/modules/ldap/manifests/client/includes.pp#1 [11:29:42] <_joe_> but for the future, keep that in mind please :) [11:29:43] (03PS1) 10Arturo Borrero Gonzalez: Revert "sudo: decouple sudo from sudo-ldap" [puppet] - 10https://gerrit.wikimedia.org/r/508309 [11:29:56] in the sense that that file is included on those hosts [11:30:08] and I'm not sure if that change in behaviour there is intented [11:30:31] I prefer to do things right and without disturbing anybody, so I'm reverting the patch right now [11:30:49] (03PS2) 10Arturo Borrero Gonzalez: Revert "sudo: decouple sudo from sudo-ldap" [puppet] - 10https://gerrit.wikimedia.org/r/508309 [11:31:54] (03CR) 10Arturo Borrero Gonzalez: [V: 03+2 C: 03+2] Revert "sudo: decouple sudo from sudo-ldap" [puppet] - 10https://gerrit.wikimedia.org/r/508309 (owner: 10Arturo Borrero Gonzalez) [11:32:17] !log reverting puppet change to the sudo module [11:32:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:33:50] (03PS2) 10Ema: cache: reimage cp4026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967) [11:34:02] I'm deploying those now [11:34:05] sorry for being late [11:34:12] (03CR) 10Ladsgroup: [C: 03+2] Enable Suggestion Constraint Status on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508303 (https://phabricator.wikimedia.org/T204439) (owner: 10Alaa Sarhan) [11:34:21] thank you @Amir1 [11:35:15] (03Merged) 10jenkins-bot: Enable Suggestion Constraint Status on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508303 (https://phabricator.wikimedia.org/T204439) (owner: 10Alaa Sarhan) [11:35:55] PROBLEM - puppet last run on lvs2006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:36:04] (03CR) 10jenkins-bot: Enable Suggestion Constraint Status on Wikidata [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508303 (https://phabricator.wikimedia.org/T204439) (owner: 10Alaa Sarhan) [11:36:15] PROBLEM - puppet last run on ores2008 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:37:09] PROBLEM - puppet last run on db1096 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:37:09] PROBLEM - puppet last run on db2099 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:37:13] alaa_wmde: it's live in mwdebug1002 [11:37:47] (03PS23) 10Mathew.onipe: icinga: create and apply cirrus config check [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [11:38:55] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [11:39:26] <_joe_> it seems it was a very small spike [11:39:33] <_joe_> but I'll keep an eye on that [11:39:41] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:40:03] PROBLEM - puppet last run on es1014 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [11:40:03] <_joe_> May 6 11:32:59 ores2008 puppet-agent[35832]: Could not retrieve catalog from remote server: Error 500 on SERVER: Server Error: Evaluation Error: Error while evaluating a Resource Statement, Could not find declared class sudo::sudoersfile at /etc/puppet/modules/sudo/manifests/init.pp:6:5 on node ores2008.codfw.wmnet [11:40:06] <_joe_> ij, [11:40:09] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [11:40:19] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [11:40:34] <_joe_> uhm [11:40:34] _joe_: taking a look at ores2008 [11:40:39] !log ladsgroup@deploy1001 Synchronized wmf-config/InitialiseSettings.php: SWAT: [[gerrit:508303|Enable Suggestion Constraint Status on Wikidata]] (duration: 00m 52s) [11:40:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:40:57] <_joe_> arturo: no it's ok [11:41:04] <_joe_> just a race condition in puppet-merge I guess [11:41:16] <_joe_> I just re-run puppet and it was ok [11:41:22] ok then, sorry for the noise [11:41:35] RECOVERY - puppet last run on ores2008 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [11:42:13] (03CR) 10Ladsgroup: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508298 (https://phabricator.wikimedia.org/T219396) (owner: 10Ladsgroup) [11:42:19] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:42:29] RECOVERY - puppet last run on db1096 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [11:42:29] RECOVERY - puppet last run on db2099 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [11:44:29] (03CR) 10Ladsgroup: "jenkins seems to have problems with my patch. Will try again later" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508298 (https://phabricator.wikimedia.org/T219396) (owner: 10Ladsgroup) [11:44:43] !log EU SWAT is done [11:44:45] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:44:58] <_joe_> Amir1: CI stuck for you too? [11:45:05] yeah [11:45:24] _joe_: it doesn't even start the job: https://integration.wikimedia.org/zuul/ [11:45:34] (03PS1) 10Muehlenhoff: openstack: Remove support for Ubuntu/Upstart [puppet] - 10https://gerrit.wikimedia.org/r/508310 [11:45:56] hmm, it was just slow [11:46:49] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [11:47:15] <_joe_> https://grafana.wikimedia.org/d/000000321/zuul?panelId=8&fullscreen&orgId=1&from=now-3h&to=now up to two hours :D [11:47:57] (03PS1) 10Arturo Borrero Gonzalez: sudo: decouple sudo from sudo-ldap [puppet] - 10https://gerrit.wikimedia.org/r/508311 (https://phabricator.wikimedia.org/T221225) [11:48:03] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [11:48:11] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [11:50:40] (03PS2) 10Muehlenhoff: openstack: Remove support for Ubuntu/Upstart [puppet] - 10https://gerrit.wikimedia.org/r/508310 [11:52:27] (03CR) 10Arturo Borrero Gonzalez: "Do you have a PCC run for this?" [puppet] - 10https://gerrit.wikimedia.org/r/508310 (owner: 10Muehlenhoff) [11:52:44] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T222538 (10Tobi_WMDE_SW) As the Engineering Manager at WMDE responsible for Adam's team, I endorse this request. [11:52:59] PROBLEM - Check Varnish expiry mailbox lag on cp3038 is CRITICAL: CRITICAL: expiry mailbox lag is 2059198 https://wikitech.wikimedia.org/wiki/Varnish [12:01:58] (03CR) 10Muehlenhoff: "PCC at https://puppet-compiler.wmflabs.org/compiler1001/16351/" [puppet] - 10https://gerrit.wikimedia.org/r/508310 (owner: 10Muehlenhoff) [12:06:39] RECOVERY - puppet last run on es1014 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:07:21] (03CR) 10Effie Mouzeli: mediawiki: if guard php72_only blocks (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/507939 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [12:07:47] RECOVERY - puppet last run on lvs2006 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [12:11:47] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [12:14:57] PROBLEM - Check systemd state on ms-be1036 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [12:17:53] (03CR) 10Arturo Borrero Gonzalez: "I just ran a more wider PCC: https://puppet-compiler.wmflabs.org/compiler1001/16352/" [puppet] - 10https://gerrit.wikimedia.org/r/508310 (owner: 10Muehlenhoff) [12:19:01] (03PS4) 10Mobrovac: Handle application/octet-stream requests properly; release v0.1.5 [software/service-checker] - 10https://gerrit.wikimedia.org/r/507531 (https://phabricator.wikimedia.org/T220401) [12:21:18] (03CR) 10Mobrovac: Handle application/octet-stream requests properly; release v0.1.5 (031 comment) [software/service-checker] - 10https://gerrit.wikimedia.org/r/507531 (https://phabricator.wikimedia.org/T220401) (owner: 10Mobrovac) [12:24:08] !log installing golang security updates on jessie [12:24:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:34:11] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [12:34:13] (03PS1) 10Mathew.onipe: maps: remove cassandra metric blacklist [puppet] - 10https://gerrit.wikimedia.org/r/508313 [12:34:31] (03CR) 10Mathew.onipe: icinga: create and apply cirrus config check (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [12:35:54] (03PS24) 10Mathew.onipe: icinga: create and apply cirrus config check(recheck) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [12:43:46] !log installing rsync security updates [12:43:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:33] RECOVERY - Check systemd state on ms-be1036 is OK: OK - running: The system is fully operational [12:51:53] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:52:09] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:52:19] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:52:36] Hi, is there anything wrong with the "test" pipeline? "recheck" doesn't work, rebasing a patch doesn't run tests etc. Cc paladox [12:52:59] apparently _joe_ filed a task about ci not working. [12:53:05] https://phabricator.wikimedia.org/T222605 [12:53:07] Okay, thanks. [12:53:53] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [12:53:57] paladox: And we subscribed same time [12:54:01] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [12:54:01] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [12:54:02] heh [12:54:31] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:55:23] PROBLEM - Eqsin HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [12:56:05] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:56:15] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [12:58:30] (03PS7) 10CDanis: prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 (https://phabricator.wikimedia.org/T222108) [12:59:15] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [12:59:17] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [13:00:27] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [13:00:37] RECOVERY - Eqsin HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqsin&var-cache_type=All&var-status_type=5 [13:02:09] (03CR) 10Alexandros Kosiaris: [C: 03+1] "PCC is fine at https://puppet-compiler.wmflabs.org/compiler1002/16349/mc2026.codfw.wmnet/. There will only be ferm reload on most hosts, w" [puppet] - 10https://gerrit.wikimedia.org/r/505407 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [13:09:35] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1099 memory issues - https://phabricator.wikimedia.org/T221502 (10Marostegui) 05Open→03Resolved This host recovered itself, so closing for now as nothing is to be done. [13:11:31] (03CR) 10Gehel: [C: 03+1] "LGTM, appart from the failing build :)" [software/spicerack] - 10https://gerrit.wikimedia.org/r/507561 (owner: 10Jbond) [13:14:34] both 503 spikes seem to have been triggered by api.php requests [13:15:01] (03CR) 10Alexandros Kosiaris: [C: 03+2] network: remove old labs public range [puppet] - 10https://gerrit.wikimedia.org/r/507907 (https://phabricator.wikimedia.org/T193496) (owner: 10Alex Monk) [13:15:03] varnish backends in eqiad look fine [13:15:09] (03PS2) 10Alexandros Kosiaris: network: remove old labs public range [puppet] - 10https://gerrit.wikimedia.org/r/507907 (https://phabricator.wikimedia.org/T193496) (owner: 10Alex Monk) [13:15:22] (03CR) 10Alexandros Kosiaris: [C: 03+2] "thanks!" [puppet] - 10https://gerrit.wikimedia.org/r/507907 (https://phabricator.wikimedia.org/T193496) (owner: 10Alex Monk) [13:16:16] (03CR) 10Gehel: [C: 04-1] "relforge and cloudelastic should also be configured." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507950 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [13:16:43] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] network: remove old labs public range [puppet] - 10https://gerrit.wikimedia.org/r/507907 (https://phabricator.wikimedia.org/T193496) (owner: 10Alex Monk) [13:17:11] (03CR) 10Gehel: [C: 03+1] "Do we even need to keep 3 versions of Python supported?" [software/cumin] - 10https://gerrit.wikimedia.org/r/508078 (owner: 10Volans) [13:18:41] (03PS8) 10CDanis: prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 (https://phabricator.wikimedia.org/T222108) [13:20:02] (03CR) 10Volans: "> Patch Set 1: Code-Review+1" [software/cumin] - 10https://gerrit.wikimedia.org/r/508078 (owner: 10Volans) [13:20:13] (03PS3) 10Ema: cache: reimage cp4026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967) [13:20:18] !log installing unzip security updates [13:20:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:20:21] 10Operations, 10Cloud-Services, 10netops, 10Patch-For-Review, 10cloud-services-team (Kanban): Allocate public v4 IPs for Neutron setup in eqiad - https://phabricator.wikimedia.org/T193496 (10chasemp) I was 6 months off on my estimate for this :) [13:20:29] (03PS9) 10CDanis: prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 (https://phabricator.wikimedia.org/T222108) [13:22:21] !log installing audiofile security updates [13:22:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:24:23] (03PS4) 10Ema: cache: reimage cp4026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967) [13:24:29] (03CR) 10Papaul: "> Patch Set 2:" [dns] - 10https://gerrit.wikimedia.org/r/507613 (owner: 10Papaul) [13:25:13] (03PS1) 10Muehlenhoff: Add library hint for audiofile [puppet] - 10https://gerrit.wikimedia.org/r/508316 [13:26:01] (03CR) 10Gehel: [C: 04-1] icinga: create and apply cirrus config check(recheck) (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [13:26:26] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add library hint for audiofile [puppet] - 10https://gerrit.wikimedia.org/r/508316 (owner: 10Muehlenhoff) [13:26:30] (03PS10) 10CDanis: prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 (https://phabricator.wikimedia.org/T222108) [13:29:53] (03CR) 10Gehel: [C: 04-1] wdqs: add WDQS restart cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe) [13:30:43] (03PS5) 10Ema: cache: reimage cp4026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967) [13:30:52] (03CR) 10CDanis: "PCC looks reasonable: https://puppet-compiler.wmflabs.org/compiler1002/16361/" [puppet] - 10https://gerrit.wikimedia.org/r/508011 (https://phabricator.wikimedia.org/T222108) (owner: 10CDanis) [13:31:22] (03CR) 10Gehel: [C: 04-1] Add postgres slave init cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [13:31:53] (03CR) 10Marostegui: "> > Patch Set 2:" [dns] - 10https://gerrit.wikimedia.org/r/507613 (owner: 10Papaul) [13:32:33] (03CR) 10Gehel: [C: 03+1] "> Patch Set 1:" [software/cumin] - 10https://gerrit.wikimedia.org/r/508078 (owner: 10Volans) [13:33:39] (03PS6) 10Ema: cache: reimage cp4026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967) [13:33:46] (03CR) 10Volans: [C: 03+2] Drop support for Python 3.4 [software/cumin] - 10https://gerrit.wikimedia.org/r/508078 (owner: 10Volans) [13:37:37] 10Operations: Integrate Stretch 9.9 point update - https://phabricator.wikimedia.org/T222053 (10MoritzMuehlenhoff) [13:37:58] (03PS7) 10Ema: cache: reimage cp4026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967) [13:39:55] (03Merged) 10jenkins-bot: Drop support for Python 3.4 [software/cumin] - 10https://gerrit.wikimedia.org/r/508078 (owner: 10Volans) [13:40:10] (03CR) 10jenkins-bot: Drop support for Python 3.4 [software/cumin] - 10https://gerrit.wikimedia.org/r/508078 (owner: 10Volans) [13:41:33] (03PS1) 10Muehlenhoff: Add library hints for vips and zziplib [puppet] - 10https://gerrit.wikimedia.org/r/508318 [13:42:26] (03PS8) 10Ema: cache: reimage cp4026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967) [13:44:34] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add library hints for vips and zziplib [puppet] - 10https://gerrit.wikimedia.org/r/508318 (owner: 10Muehlenhoff) [13:44:42] (03PS2) 10Muehlenhoff: Add library hints for vips and zziplib [puppet] - 10https://gerrit.wikimedia.org/r/508318 [13:45:11] (03PS9) 10Ema: cache: reimage cp4026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967) [13:45:30] (03PS4) 10Paladox: Gerrit: Enable gerrit.disableReverseDnsLookup [puppet] - 10https://gerrit.wikimedia.org/r/508127 [13:46:12] (03CR) 10Muehlenhoff: [V: 03+2 C: 03+2] Add library hints for vips and zziplib [puppet] - 10https://gerrit.wikimedia.org/r/508318 (owner: 10Muehlenhoff) [13:48:16] (03PS10) 10Ema: cache: reimage cp4026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967) [13:48:59] (03PS1) 10Marostegui: db-codfw.php: Promote db2045 to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508321 (https://phabricator.wikimedia.org/T219493) [13:49:30] (03CR) 10Marostegui: [C: 04-2] "This needs to be submited after the topology changes and after https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/508168/" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508321 (https://phabricator.wikimedia.org/T219493) (owner: 10Marostegui) [13:51:55] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [13:52:09] PROBLEM - Upload HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [13:52:14] !log CI does not run sometime for some reason ... https://phabricator.wikimedia.org/T222614 :( [13:52:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:54:50] !log installing zziplib security updates [13:54:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:51] !log otto@deploy1001 scap-helm eventgate-analytics install -n analytics -f analytics/staging-values.yaml stable/eventgate [namespace: eventgate-analytics, clusters: staging] [13:56:52] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [13:56:52] !log otto@deploy1001 scap-helm eventgate-analytics finished [13:56:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:56:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:57:22] (03PS11) 10Ema: cache: reimage cp4026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967) [14:03:07] !log cp3038: restart varnish-be [14:03:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:04:50] (03CR) 10Volans: [C: 04-1] wdqs: add WDQS restart cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/507347 (https://phabricator.wikimedia.org/T221832) (owner: 10Mathew.onipe) [14:05:07] (03CR) 10Filippo Giunchedi: [C: 03+1] prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 (https://phabricator.wikimedia.org/T222108) (owner: 10CDanis) [14:05:34] (03PS1) 10Ottomata: eventgate - fix duplicate config error_stream in config.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/508324 (https://phabricator.wikimedia.org/T218346) [14:06:26] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate - fix duplicate config error_stream in config.yaml template [deployment-charts] - 10https://gerrit.wikimedia.org/r/508324 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata) [14:06:31] (03CR) 10Volans: Add postgres slave init cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [14:06:44] (03CR) 10Hashar: "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/508011 (https://phabricator.wikimedia.org/T222108) (owner: 10CDanis) [14:07:01] RECOVERY - Check Varnish expiry mailbox lag on cp3038 is OK: OK: expiry mailbox lag is 0 https://wikitech.wikimedia.org/wiki/Varnish [14:07:29] (03PS5) 10Paladox: Gerrit: Enable gerrit.disableReverseDnsLookup [puppet] - 10https://gerrit.wikimedia.org/r/508127 [14:09:38] !log CI workflow fixed by reverting a change deployed around 10:00 UTC # T222614 [14:09:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:42] T222614: CI no more triggers for some/all? repositories! - https://phabricator.wikimedia.org/T222614 [14:11:14] !log otto@deploy1001 scap-helm eventgate-analytics upgrade analytics -f analytics/staging-values.yaml --reset-values stable/eventgate [namespace: eventgate-analytics, clusters: staging] [14:11:15] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [14:11:15] !log otto@deploy1001 scap-helm eventgate-analytics finished [14:11:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:11:47] (03CR) 10Filippo Giunchedi: initial attempt at a varnishkafka exporter (032 comments) [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [14:12:05] (03CR) 10Volans: [C: 03+1] "LGTM, it requires a patch to Puppet to add the timeout to the configuration before deploying this." [software/spicerack] - 10https://gerrit.wikimedia.org/r/507390 (owner: 10CRusnov) [14:13:03] (03PS12) 10Ema: cache: reimage cp4026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967) [14:13:09] (03CR) 10Volans: [C: 03+1] "Forgot to add a note" [software/spicerack] - 10https://gerrit.wikimedia.org/r/507390 (owner: 10CRusnov) [14:15:13] (03CR) 10Marostegui: [C: 04-2] "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508321 (https://phabricator.wikimedia.org/T219493) (owner: 10Marostegui) [14:15:15] (03CR) 10Ema: "pcc here https://puppet-compiler.wmflabs.org/compiler1001/16369/" [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [14:15:39] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [14:15:47] RECOVERY - Upload HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [14:16:23] (03PS1) 10Vgutierrez: prometheus: Toggle SSL certificate verification for trafficserver-exporter [puppet] - 10https://gerrit.wikimedia.org/r/508327 (https://phabricator.wikimedia.org/T221217) [14:18:07] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:18:29] (03CR) 10Vgutierrez: [C: 03+1] cache: reimage cp4026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [14:19:27] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [14:19:52] !log depool cp4026 and reimage as upload_ats T219967 [14:19:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:19:56] T219967: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 [14:19:57] 10Operations, 10Puppet, 10Patch-For-Review, 10User-herron: custom fact interface_primary breaks under newer versions of facter - https://phabricator.wikimedia.org/T182819 (10MoritzMuehlenhoff) We can close this, given that the patch by @jbond was merged, right? [14:20:31] (03CR) 10Ema: [C: 03+2] cache: reimage cp4026 as upload_ats [puppet] - 10https://gerrit.wikimedia.org/r/508304 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [14:23:38] (03CR) 10Filippo Giunchedi: initial attempt at a varnishkafka exporter (031 comment) [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [14:25:08] !log installing vips security updates [14:25:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:25:33] 10Operations, 10ops-codfw: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T222526 (10Papaul) a:05Papaul→03jcrespo Done [14:26:00] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by ema on cumin2001.codfw.wmnet for hosts: ` ['cp4026.ulsfo.wmnet'] ` The log can be... [14:27:29] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 57.14% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:29:31] (03PS2) 10Vgutierrez: prometheus: Toggle SSL certificate verification for trafficserver-exporter [puppet] - 10https://gerrit.wikimedia.org/r/508327 (https://phabricator.wikimedia.org/T221217) [14:29:33] (03PS7) 10Vgutierrez: trafficserver: Provide a unified monitoring define [puppet] - 10https://gerrit.wikimedia.org/r/506986 (https://phabricator.wikimedia.org/T221217) [14:29:35] (03PS32) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [14:29:37] (03PS3) 10Vgutierrez: prometheus: Identify trafficserver instances using the layer label [puppet] - 10https://gerrit.wikimedia.org/r/508289 (https://phabricator.wikimedia.org/T221217) [14:29:55] hashar: is the icinga notification above a concern? [14:35:35] !log swift eqiad-prod: finish decom ms-be101[45] - T220590 [14:35:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:35:39] T220590: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 [14:36:16] (03PS6) 10Ema: Normalize thumbnail URLs to avoid cachebusting [puppet] - 10https://gerrit.wikimedia.org/r/495643 (https://phabricator.wikimedia.org/T216339) (owner: 10Gilles) [14:37:51] PROBLEM - Host cp1083 is DOWN: PING CRITICAL - Packet loss = 100% [14:39:38] ema: working on cp1083? --^ [14:39:41] nope [14:40:03] depooling [14:40:19] RECOVERY - EDAC syslog messages on thumbor1004 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [14:40:46] (03PS1) 10Giuseppe Lavagetto: role::deployment_server: depend on base role [puppet] - 10https://gerrit.wikimedia.org/r/508334 [14:40:48] (03PS1) 10Giuseppe Lavagetto: role::deployment_server: reorganize code, step 1 [puppet] - 10https://gerrit.wikimedia.org/r/508335 [14:41:00] (03CR) 10Andrew Bogott: "I believe we still have exactly one VM still running Trusty -- labstore1003, which is waiting on T209527." [puppet] - 10https://gerrit.wikimedia.org/r/508310 (owner: 10Muehlenhoff) [14:41:05] !log ema@puppetmaster1001 conftool action : set/pooled=no; selector: name=cp1083.eqiad.wmnet [14:41:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:41:22] thanks for the ping elukey [14:42:58] !log powercycle cp1083 [14:43:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:43:49] ema: <3 [14:43:54] (03CR) 10Andrew Bogott: openstack: Remove support for Ubuntu/Upstart (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508310 (owner: 10Muehlenhoff) [14:43:56] (03CR) 10Muehlenhoff: "But it doesn't use any of the classes touched here, see https://puppet-compiler.wmflabs.org/compiler1001/16352/labstore1003.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/508310 (owner: 10Muehlenhoff) [14:44:09] PROBLEM - IPsec on cp4031 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1083_v4, cp1083_v6 [14:44:13] PROBLEM - IPsec on cp2023 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1083_v4, cp1083_v6 [14:44:15] PROBLEM - IPsec on cp2001 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1083_v4, cp1083_v6 [14:44:17] PROBLEM - IPsec on cp5007 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1083_v4, cp1083_v6 [14:44:26] we know icinga we know [14:44:33] PROBLEM - IPsec on cp2007 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1083_v4, cp1083_v6 [14:44:35] PROBLEM - IPsec on cp3041 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1083_v4, cp1083_v6 [14:44:35] PROBLEM - IPsec on cp3030 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1083_v4, cp1083_v6 [14:44:39] PROBLEM - IPsec on cp2010 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1083_v4, cp1083_v6 [14:44:41] PROBLEM - IPsec on cp4030 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1083_v4, cp1083_v6 [14:44:43] PROBLEM - IPsec on cp3042 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1083_v4, cp1083_v6 [14:44:51] PROBLEM - IPsec on cp4027 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1083_v4, cp1083_v6 [14:44:51] PROBLEM - IPsec on cp3032 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1083_v4, cp1083_v6 [14:44:57] PROBLEM - IPsec on cp5011 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1083_v4, cp1083_v6 [14:44:57] PROBLEM - IPsec on cp5010 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1083_v4, cp1083_v6 [14:44:59] PROBLEM - IPsec on cp2013 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1083_v4, cp1083_v6 [14:44:59] PROBLEM - IPsec on cp2006 is CRITICAL: Strongswan CRITICAL - ok: 50 connecting: cp1083_v4, cp1083_v6 [14:45:07] PROBLEM - IPsec on cp4029 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1083_v4, cp1083_v6 [14:45:07] PROBLEM - IPsec on cp4028 is CRITICAL: Strongswan CRITICAL - ok: 34 connecting: cp1083_v4, cp1083_v6 [14:45:17] (03PS2) 10Giuseppe Lavagetto: role::deployment_server: reorganize code, step 1 [puppet] - 10https://gerrit.wikimedia.org/r/508335 [14:45:19] (03PS11) 10CDanis: prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 (https://phabricator.wikimedia.org/T222108) [14:46:07] RECOVERY - Host cp1083 is UP: PING OK - Packet loss = 0%, RTA = 0.18 ms [14:46:11] RECOVERY - IPsec on cp4027 is OK: Strongswan OK - 36 ESP OK [14:46:13] RECOVERY - IPsec on cp3032 is OK: Strongswan OK - 36 ESP OK [14:46:17] RECOVERY - IPsec on cp2013 is OK: Strongswan OK - 52 ESP OK [14:46:17] RECOVERY - IPsec on cp5011 is OK: Strongswan OK - 36 ESP OK [14:46:17] RECOVERY - IPsec on cp2006 is OK: Strongswan OK - 52 ESP OK [14:46:17] RECOVERY - IPsec on cp5010 is OK: Strongswan OK - 36 ESP OK [14:46:27] RECOVERY - IPsec on cp4029 is OK: Strongswan OK - 36 ESP OK [14:46:27] RECOVERY - IPsec on cp4028 is OK: Strongswan OK - 36 ESP OK [14:46:44] (03PS12) 10CDanis: prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 (https://phabricator.wikimedia.org/T222108) [14:46:49] RECOVERY - IPsec on cp4031 is OK: Strongswan OK - 36 ESP OK [14:46:53] RECOVERY - IPsec on cp2023 is OK: Strongswan OK - 52 ESP OK [14:46:55] RECOVERY - IPsec on cp2001 is OK: Strongswan OK - 52 ESP OK [14:46:59] RECOVERY - IPsec on cp5007 is OK: Strongswan OK - 36 ESP OK [14:47:13] RECOVERY - IPsec on cp2007 is OK: Strongswan OK - 52 ESP OK [14:47:15] RECOVERY - IPsec on cp3041 is OK: Strongswan OK - 36 ESP OK [14:47:15] RECOVERY - IPsec on cp3030 is OK: Strongswan OK - 36 ESP OK [14:47:19] RECOVERY - IPsec on cp2010 is OK: Strongswan OK - 52 ESP OK [14:47:23] RECOVERY - IPsec on cp4030 is OK: Strongswan OK - 36 ESP OK [14:47:23] RECOVERY - IPsec on cp3042 is OK: Strongswan OK - 36 ESP OK [14:47:56] (03PS3) 10Giuseppe Lavagetto: role::deployment_server: reorganize code, step 1 [puppet] - 10https://gerrit.wikimedia.org/r/508335 [14:47:59] (03CR) 10Andrew Bogott: [C: 03+1] "True :)" [puppet] - 10https://gerrit.wikimedia.org/r/508310 (owner: 10Muehlenhoff) [14:49:11] (03CR) 10Andrew Bogott: [C: 03+1] Remove obsolete openstack::nova::compute::audit [puppet] - 10https://gerrit.wikimedia.org/r/508308 (owner: 10Muehlenhoff) [14:49:35] (03PS3) 10Andrew Bogott: nova: pool cloudvirt1001, 1002, 1003, 1004 [puppet] - 10https://gerrit.wikimedia.org/r/506715 (https://phabricator.wikimedia.org/T221141) [14:50:10] (03PS8) 10Vgutierrez: trafficserver: Provide a unified monitoring define [puppet] - 10https://gerrit.wikimedia.org/r/506986 (https://phabricator.wikimedia.org/T221217) [14:50:12] (03PS33) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [14:50:14] (03PS4) 10Vgutierrez: prometheus: Identify trafficserver instances using the layer label [puppet] - 10https://gerrit.wikimedia.org/r/508289 (https://phabricator.wikimedia.org/T221217) [14:54:19] PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [14:54:25] ACKNOWLEDGEMENT - HP RAID on db2049 is CRITICAL: CRITICAL: Slot 0: OK: 1I:1:1, 1I:1:2, 1I:1:3, 1I:1:4, 1I:1:6, 1I:1:7, 1I:1:8, 1I:1:9, 1I:1:10, 1I:1:11, 1I:1:12 - Failed: 1I:1:5 - Controller: OK - Battery/Capacitor: OK nagiosadmin RAID handler auto-ack: https://phabricator.wikimedia.org/T222622 https://wikitech.wikimedia.org/wiki/Dc-operations/Hardware_Troubleshooting_Runbook%23Hardware_Raid_Information_Gathering [14:54:30] 10Operations, 10Fundraising-Backlog, 10Traffic, 10fundraising-tech-ops: SSL cert for links.email.wikimedia.org - https://phabricator.wikimedia.org/T188561 (10Jgreen) @bblack circling back on this, do you still see any issue now after the Silverpop SSL improvements? [14:54:33] 10Operations, 10ops-codfw: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T222622 (10ops-monitoring-bot) [14:54:53] (03CR) 10CRusnov: "> Patch Set 2:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/507390 (owner: 10CRusnov) [14:55:41] RECOVERY - rsyslog TLS listener on port 6514 on lithium is OK: SSL OK - Certificate lithium.eqiad.wmnet valid until 2021-10-23 19:09:29 +0000 (expires in 901 days) https://wikitech.wikimedia.org/wiki/Logs [14:57:07] 10Operations, 10ops-codfw: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T222526 (10Marostegui) [14:57:09] 10Operations, 10ops-codfw: Degraded RAID on db2049 - https://phabricator.wikimedia.org/T222622 (10Marostegui) [14:57:25] !log capture strace / core for rsyslog on wezen / lithium and restart - T199406 [14:57:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:57:30] T199406: rsyslog's in:imtcp thread stuck on old sockets - https://phabricator.wikimedia.org/T199406 [14:58:02] (03PS7) 10Ema: Normalize thumbnail URLs to avoid cachebusting [puppet] - 10https://gerrit.wikimedia.org/r/495643 (https://phabricator.wikimedia.org/T216339) (owner: 10Gilles) [14:58:03] RECOVERY - Device not healthy -SMART- on db2049 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2049&var-datasource=codfw+prometheus/ops [14:58:13] (03CR) 10Andrew Bogott: [C: 03+2] nova: pool cloudvirt1001, 1002, 1003, 1004 [puppet] - 10https://gerrit.wikimedia.org/r/506715 (https://phabricator.wikimedia.org/T221141) (owner: 10Andrew Bogott) [14:58:39] (03PS9) 10Mathew.onipe: Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) [14:59:24] (03CR) 10Effie Mouzeli: "Puppet compiler https://puppet-compiler.wmflabs.org/compiler1001/16353/mw1222.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/507939 (https://phabricator.wikimedia.org/T219150) (owner: 10Effie Mouzeli) [14:59:30] (03CR) 10Mathew.onipe: Add postgres slave init cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [14:59:47] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs [15:01:07] 10Operations, 10Epic, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1001 with 10G interfaces - https://phabricator.wikimedia.org/T221141 (10Andrew) [15:01:19] RECOVERY - Memory correctable errors -EDAC- on thumbor1004 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [15:01:23] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [15:01:25] 10Operations, 10Epic, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1001 with 10G interfaces - https://phabricator.wikimedia.org/T221141 (10Andrew) 05Open→03Resolved Thank you for working on all these, @Cmjohnson ! [15:01:35] RECOVERY - rsyslog TLS listener on port 6514 on wezen is OK: SSL OK - Certificate wezen.codfw.wmnet valid until 2021-08-21 20:09:05 +0000 (expires in 838 days) https://wikitech.wikimedia.org/wiki/Logs [15:01:59] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [15:02:01] 10Operations, 10Epic, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1002 with 10G interfaces - https://phabricator.wikimedia.org/T221140 (10Andrew) 05Open→03Resolved [15:02:03] 10Operations, 10Epic, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1003 with 10G interfaces - https://phabricator.wikimedia.org/T221139 (10Andrew) 05Open→03Resolved [15:02:06] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [15:02:11] 10Operations, 10Epic, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1004 with 10G interfaces - https://phabricator.wikimedia.org/T221138 (10Andrew) 05Open→03Resolved [15:02:14] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [15:02:39] (03PS2) 10Mathew.onipe: maps: remove cassandra metric blacklist [puppet] - 10https://gerrit.wikimedia.org/r/508313 [15:02:44] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [15:03:51] (03PS3) 10Papaul: DNS: Add mgmt and production DNS for db2[103-120] [dns] - 10https://gerrit.wikimedia.org/r/507613 [15:04:25] PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs [15:04:40] (03PS13) 10CDanis: prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 (https://phabricator.wikimedia.org/T222108) [15:04:48] (03CR) 10Ema: [C: 03+1] prometheus: Support several instances of the trafficserver exporter [puppet] - 10https://gerrit.wikimedia.org/r/506659 (https://phabricator.wikimedia.org/T221217) (owner: 10Vgutierrez) [15:07:01] (03CR) 10Ema: [C: 03+1] "Great stuff!" [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [15:07:05] (03PS4) 10Papaul: DNS: Add mgmt and production DNS for db2[103-120] [dns] - 10https://gerrit.wikimedia.org/r/507613 [15:07:17] PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://wikitech.wikimedia.org/wiki/Logs [15:08:12] moritzm: related to ongoing upgrades? ^^^ [15:08:29] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cp4026.ulsfo.wmnet'] ` and were **ALL** successful. [15:09:02] rsyslog on lithium is me [15:09:14] ack, sorry for the ping mor.itz [15:09:32] I missed your ! log, my bad [15:09:43] RECOVERY - rsyslog TLS listener on port 6514 on lithium is OK: SSL OK - Certificate lithium.eqiad.wmnet valid until 2021-10-23 19:09:29 +0000 (expires in 901 days) https://wikitech.wikimedia.org/wiki/Logs [15:09:52] (03CR) 10Mathew.onipe: elasticsearch: add new attribute (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/507950 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [15:10:02] (03CR) 10CDanis: [C: 03+2] prometheus: one-shot alert on restarts [puppet] - 10https://gerrit.wikimedia.org/r/508011 (https://phabricator.wikimedia.org/T222108) (owner: 10CDanis) [15:10:35] (03PS4) 10Mathew.onipe: elasticsearch: add new attribute [puppet] - 10https://gerrit.wikimedia.org/r/507950 (https://phabricator.wikimedia.org/T218932) [15:11:09] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add new attribute [puppet] - 10https://gerrit.wikimedia.org/r/507950 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [15:12:30] I found the issue with rsyslog itself bizzarre, taking a break and then I'll look into it [15:12:45] PROBLEM - Confd template for /etc/varnish/directors.backend.vcl on cp4026 is CRITICAL: NRPE: Command check_confd_etc_varnish_directors.backend.vcl not defined https://wikitech.wikimedia.org/wiki/Confd [15:12:57] that's me ^ [15:12:59] PROBLEM - Varnish traffic logger - varnishstatsd on cp4026 is CRITICAL: NRPE: Command check_varnishstatsd not defined https://wikitech.wikimedia.org/wiki/Varnish [15:13:07] jouncebot: now [15:13:07] No deployments scheduled for the next 1 hour(s) and 46 minute(s) [15:13:19] ACKNOWLEDGEMENT - Check Varnish expiry mailbox lag on cp4026 is CRITICAL: NRPE: Command check_check_varnish_expiry_mailbox_lag not defined Ema reimaged w/ ATS https://wikitech.wikimedia.org/wiki/Varnish [15:13:19] ACKNOWLEDGEMENT - Check the NTP synchronisation status of timesyncd on cp4026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.128.0.126: Connection reset by peer Ema reimaged w/ ATS [15:13:19] ACKNOWLEDGEMENT - Confd template for /etc/varnish/directors.backend.vcl on cp4026 is CRITICAL: NRPE: Command check_confd_etc_varnish_directors.backend.vcl not defined Ema reimaged w/ ATS https://wikitech.wikimedia.org/wiki/Confd [15:13:19] ACKNOWLEDGEMENT - IPMI Sensor Status on cp4026 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.128.0.126: Connection reset by peer Ema reimaged w/ ATS [15:13:19] ACKNOWLEDGEMENT - IPsec on cp4026 is CRITICAL: NRPE: Command check_IPsec not defined Ema reimaged w/ ATS [15:13:19] ACKNOWLEDGEMENT - Varnish HTTP upload-backend - port 3128 on cp4026 is CRITICAL: HTTP CRITICAL: HTTP/1.1 502 connect failed - 1982 bytes in 0.747 second response time Ema reimaged w/ ATS https://wikitech.wikimedia.org/wiki/Varnish [15:13:19] ACKNOWLEDGEMENT - Varnish traffic logger - varnishstatsd on cp4026 is CRITICAL: NRPE: Command check_varnishstatsd not defined Ema reimaged w/ ATS https://wikitech.wikimedia.org/wiki/Varnish [15:14:03] (03PS5) 10Mathew.onipe: elasticsearch: add new attribute [puppet] - 10https://gerrit.wikimedia.org/r/507950 (https://phabricator.wikimedia.org/T218932) [15:14:05] (03PS25) 10Mathew.onipe: icinga: create and apply cirrus config check(recheck) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [15:14:31] !log pool cp4026 w/ ATS backend T219967 [15:14:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:36] T219967: Replace Varnish backends with ATS on cache upload nodes in ulsfo - https://phabricator.wikimedia.org/T219967 [15:14:41] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add new attribute [puppet] - 10https://gerrit.wikimedia.org/r/507950 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [15:16:25] (03PS5) 10Marostegui: DNS: Add mgmt and production DNS for db2[103-120] [dns] - 10https://gerrit.wikimedia.org/r/507613 (owner: 10Papaul) [15:17:10] (03CR) 10Marostegui: [C: 03+2] DNS: Add mgmt and production DNS for db2[103-120] [dns] - 10https://gerrit.wikimedia.org/r/507613 (owner: 10Papaul) [15:20:23] vgutierrez: I still want to merge https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/474272/ — I'm not clear on if that still affects lvs hosts or if they've all moved to Buster already? [15:21:26] andrewbogott: lvs hosts are on stretch and partly jessie [15:23:45] moritzm: ok, so some things would be touched by that patch then [15:23:50] pcc :) [15:23:55] * andrewbogott wonders how they are working now [15:24:01] but AFAIK that won't affect existing lvs servers [15:24:09] at leat not those that are in production right now [15:24:15] vgutierrez: great, I'll give the pcc a try [15:24:23] (03CR) 10Volans: "Looks good in general, some minor possible improvements inline." (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [15:24:23] Assuming you don't object in principle :) [15:24:48] https://puppet-compiler.wmflabs.org/compiler1002/15648/ --> this is an old PCC run against that CR [15:25:10] it shows a NOOP for all existing lvs [15:25:13] (03CR) 10Andrew Bogott: "pcc run in progress: https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/16377/" [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [15:25:51] (03CR) 10Ema: [C: 04-1] "18-normalize-thumbnail-url.vtc is failing. Please fix that." [puppet] - 10https://gerrit.wikimedia.org/r/495643 (https://phabricator.wikimedia.org/T216339) (owner: 10Gilles) [15:28:37] (03PS6) 10Mathew.onipe: elasticsearch: add new attribute [puppet] - 10https://gerrit.wikimedia.org/r/507950 (https://phabricator.wikimedia.org/T218932) [15:28:38] (03PS26) 10Mathew.onipe: icinga: create and apply cirrus config check(recheck) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [15:29:17] (03CR) 10jerkins-bot: [V: 04-1] elasticsearch: add new attribute [puppet] - 10https://gerrit.wikimedia.org/r/507950 (https://phabricator.wikimedia.org/T218932) (owner: 10Mathew.onipe) [15:30:04] !log updating base-files from recent stretch point release [15:30:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:30:13] (03PS1) 10RobH: setting production dns entries for db11[26-38].eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/508354 (https://phabricator.wikimedia.org/T211613) [15:30:15] (03CR) 10Vgutierrez: [C: 03+2] trafficserver: Provide support for inbound TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [15:30:28] (03PS25) 10Vgutierrez: trafficserver: Provide support for inbound TLS traffic [puppet] - 10https://gerrit.wikimedia.org/r/506159 (https://phabricator.wikimedia.org/T221594) [15:31:18] (03CR) 10RobH: [C: 03+2] setting production dns entries for db11[26-38].eqiad.wmnet [dns] - 10https://gerrit.wikimedia.org/r/508354 (https://phabricator.wikimedia.org/T211613) (owner: 10RobH) [15:31:29] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Marostegui) [15:32:39] 10Operations, 10ops-eqiad, 10DBA, 10Goal, 10User-Marostegui: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10RobH) [15:33:00] 10Operations, 10ops-eqiad, 10DBA, 10Goal, 10User-Marostegui: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10RobH) a:05RobH→03Marostegui All set! [15:35:45] !log shutting down elastic2038 for DIMM swap [15:35:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:07] !log updating firmware-bnx2 (from stretch point release, this is a NOP, the source package firmware-nonfree was updated for various Wifi chipsets we don't use, doublechecked by comparing check sums for old and new bnx2 firmware) [15:37:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:12] 10Operations: rsyslog's in:imtcp thread stuck on old sockets - https://phabricator.wikimedia.org/T199406 (10fgiunchedi) Same issue today, rsyslog was stuck on lithium and wezen, strace shows a whole lot of this: ` 37672 recvfrom(837, 0x7f092c765c30, 55, 0, 0, 0) = -1 EAGAIN (Resource temporarily unavailable) 37... [15:37:30] (03PS7) 10Mathew.onipe: elasticsearch: add new attribute [puppet] - 10https://gerrit.wikimedia.org/r/507950 (https://phabricator.wikimedia.org/T218932) [15:37:32] (03PS27) 10Mathew.onipe: icinga: create and apply cirrus config check(recheck) [puppet] - 10https://gerrit.wikimedia.org/r/507045 (https://phabricator.wikimedia.org/T218932) [15:41:34] PROBLEM - Host elastic2038.mgmt is DOWN: PING CRITICAL - Packet loss = 100% [15:41:34] PROBLEM - Check systemd state on ms-be1030 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [15:44:05] ^ known [15:45:02] (03PS1) 10RobH: mac addresses for db11[26-38].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/508355 (https://phabricator.wikimedia.org/T211613) [15:46:28] (03PS2) 10RobH: mac addresses for db11[26-38].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/508355 (https://phabricator.wikimedia.org/T211613) [15:47:22] (03CR) 10jerkins-bot: [V: 04-1] mac addresses for db11[26-38].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/508355 (https://phabricator.wikimedia.org/T211613) (owner: 10RobH) [15:47:36] yeah yeah jenkinsbot i know [15:47:46] (03CR) 10Ema: [C: 03+1] trafficserver: Allow disabling caching requests [puppet] - 10https://gerrit.wikimedia.org/r/506390 (https://phabricator.wikimedia.org/T221594) (owner: 10Vgutierrez) [15:47:56] (03PS3) 10RobH: mac addresses for db11[26-38].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/508355 (https://phabricator.wikimedia.org/T211613) [15:48:29] !log updating firmware-bnx2x (from stretch point release, this is a NOP, the source package firmware-nonfree was updated for various Wifi chipsets we don't use, doublechecked by comparing check sums for old and new bnx2x firmware) [15:48:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:48:39] (03CR) 10Volans: [C: 03+1] "> Patch Set 2:" [software/spicerack] - 10https://gerrit.wikimedia.org/r/507390 (owner: 10CRusnov) [15:49:04] (03CR) 10RobH: [C: 03+2] mac addresses for db11[26-38].eqiad.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/508355 (https://phabricator.wikimedia.org/T211613) (owner: 10RobH) [15:50:23] 10Operations, 10ops-codfw, 10Discovery-Search (Current work): elastic2038 CPU/memory errors - https://phabricator.wikimedia.org/T217398 (10Papaul) @Gehel DIMM swap complete [15:52:01] RECOVERY - Host elastic2038.mgmt is UP: PING OK - Packet loss = 0%, RTA = 36.95 ms [15:52:07] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) Thanks, now that ^ has been merged I will take over Note: db1127 is still not present on the netboot.cfg because it is not accessible yet via idrac s... [15:52:18] (03PS3) 10CRusnov: Ganeti module: Add timeout support [software/spicerack] - 10https://gerrit.wikimedia.org/r/507390 [15:53:05] PROBLEM - Prometheus prometheus1003.eqiad.wmnet/analytics was restarted on prometheus1003 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/analytics [15:53:05] PROBLEM - Prometheus prometheus1004.eqiad.wmnet/k8s-staging was restarted on prometheus1004 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s-staging [15:53:24] sigh, fixing [15:54:12] 10Operations, 10ops-eqiad, 10DBA, 10Goal, and 2 others: rack/setup/install db11[26-38].eqiad.wmnet - https://phabricator.wikimedia.org/T211613 (10Marostegui) [15:54:58] 10Operations, 10ops-eqiad, 10RESTBase, 10Core Platform Team Backlog (Watching / External), and 2 others: rack/setup/install restbase10[19-27].eqiad.wmnet - https://phabricator.wikimedia.org/T219404 (10Cmjohnson) @moborvac I haven't had a chance to get to them until this week. I should be able to get them... [15:54:59] PROBLEM - Prometheus prometheus1004.eqiad.wmnet/ops was restarted on prometheus1004 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [15:54:59] PROBLEM - Prometheus prometheus1003.eqiad.wmnet/global was restarted on prometheus1003 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [15:56:52] (03CR) 10CRusnov: [C: 03+2] Ganeti module: Add timeout support [software/spicerack] - 10https://gerrit.wikimedia.org/r/507390 (owner: 10CRusnov) [15:56:55] PROBLEM - Prometheus prometheus1004.eqiad.wmnet/services was restarted on prometheus1004 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/services [15:56:55] PROBLEM - Prometheus prometheus1003.eqiad.wmnet/k8s was restarted on prometheus1003 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s [15:57:53] !log CI / Zuul is being slowed down and being investigated [15:57:55] RECOVERY - Check systemd state on ms-be1030 is OK: OK - running: The system is fully operational [15:57:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:57:55] :( [15:58:51] PROBLEM - Prometheus prometheus1003.eqiad.wmnet/k8s-staging was restarted on prometheus1003 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s-staging [15:59:41] (03PS1) 10CDanis: prometheus uptime alert: fix query [puppet] - 10https://gerrit.wikimedia.org/r/508356 [16:00:31] PROBLEM - Prometheus bast5001.wikimedia.org/ops was restarted on bast5001 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqsin+prometheus/ops [16:01:12] (03CR) 10jerkins-bot: [V: 04-1] prometheus uptime alert: fix query [puppet] - 10https://gerrit.wikimedia.org/r/508356 (owner: 10CDanis) [16:01:16] PROBLEM - Prometheus prometheus1003.eqiad.wmnet/ops was restarted on prometheus1003 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [16:01:17] PROBLEM - Prometheus prometheus2004.codfw.wmnet/analytics was restarted on prometheus2004 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/analytics [16:01:44] (03PS2) 10CDanis: prometheus uptime alert: fix query [puppet] - 10https://gerrit.wikimedia.org/r/508356 (https://phabricator.wikimedia.org/T222108) [16:01:46] (03Merged) 10jenkins-bot: Ganeti module: Add timeout support [software/spicerack] - 10https://gerrit.wikimedia.org/r/507390 (owner: 10CRusnov) [16:02:26] (03PS3) 10CDanis: prometheus uptime alert: fix query [puppet] - 10https://gerrit.wikimedia.org/r/508356 (https://phabricator.wikimedia.org/T222108) [16:02:43] PROBLEM - Prometheus prometheus1003.eqiad.wmnet/services was restarted on prometheus1003 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/services [16:02:43] PROBLEM - Prometheus prometheus2004.codfw.wmnet/global was restarted on prometheus2004 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [16:04:41] PROBLEM - Prometheus prometheus2004.codfw.wmnet/k8s was restarted on prometheus2004 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s [16:05:18] (03CR) 10CDanis: [C: 03+2] prometheus uptime alert: fix query [puppet] - 10https://gerrit.wikimedia.org/r/508356 (https://phabricator.wikimedia.org/T222108) (owner: 10CDanis) [16:06:35] PROBLEM - Prometheus prometheus2003.codfw.wmnet/analytics was restarted on prometheus2003 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/analytics [16:06:35] PROBLEM - Prometheus prometheus2004.codfw.wmnet/ops was restarted on prometheus2004 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops [16:08:11] PROBLEM - Prometheus bast4002.wikimedia.org/ops was restarted on bast4002 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=ulsfo+prometheus/ops [16:08:29] PROBLEM - Prometheus prometheus2003.codfw.wmnet/global was restarted on prometheus2003 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/global [16:08:31] PROBLEM - Prometheus prometheus2004.codfw.wmnet/services was restarted on prometheus2004 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/services [16:08:38] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review: access for foks to labweb (in one way or another) (or make changePassword.php work on mwmaint hosts) - https://phabricator.wikimedia.org/T220860 (10Dzahn) 05Open→03Stalled [16:08:45] PROBLEM - Device not healthy -SMART- on db2049 is CRITICAL: cluster=mysql device=cciss,11 instance=db2049:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2049&var-datasource=codfw+prometheus/ops [16:10:06] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2049 is CRITICAL: cluster=mysql device=cciss,11 instance=db2049:9100 job=node site=codfw Marostegui being worked by papaul https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2049&var-datasource=codfw+prometheus/ops [16:10:25] PROBLEM - Prometheus prometheus2003.codfw.wmnet/k8s was restarted on prometheus2003 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s [16:11:53] !log CI queue drained. Should be working fine again now [16:11:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:12:21] PROBLEM - Prometheus prometheus1004.eqiad.wmnet/analytics was restarted on prometheus1004 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/analytics [16:12:21] PROBLEM - Prometheus prometheus2003.codfw.wmnet/ops was restarted on prometheus2003 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops [16:13:43] (03PS1) 10EBernhardson: cloudelastic: Don't write to private wikis on cloudelastic [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508357 [16:13:47] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1093 (s6 candidate master) went down - broken BBU - https://phabricator.wikimedia.org/T222127 (10RobH) This is only under warranty until later this month, and was brought up in the SRE weekly meeting. This needs to be high priority! Supposedly warra... [16:14:13] PROBLEM - Prometheus prometheus1004.eqiad.wmnet/global was restarted on prometheus1004 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/global [16:14:13] PROBLEM - Prometheus prometheus2003.codfw.wmnet/services was restarted on prometheus2003 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/services [16:16:05] PROBLEM - Prometheus prometheus1004.eqiad.wmnet/k8s was restarted on prometheus1004 is CRITICAL: bad_data: parse error at char 48: unexpected identifier prometheus in label matching, expected string https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s [16:20:36] (03PS1) 10CDanis: prometheus uptime: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/508359 [16:21:24] (03CR) 10jenkins-bot: Ganeti module: Add timeout support [software/spicerack] - 10https://gerrit.wikimedia.org/r/507390 (owner: 10CRusnov) [16:21:33] (03CR) 10CDanis: [C: 03+2] prometheus uptime: fix typo [puppet] - 10https://gerrit.wikimedia.org/r/508359 (owner: 10CDanis) [16:26:21] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [16:28:09] (03PS1) 10CRusnov: profile::spicerack: Add timeout parameter for ganeti module. [puppet] - 10https://gerrit.wikimedia.org/r/508361 [16:28:11] (03PS1) 10Joal: Update analytics sqoop scheduling [puppet] - 10https://gerrit.wikimedia.org/r/508362 (https://phabricator.wikimedia.org/T222378) [16:28:55] (03CR) 10jerkins-bot: [V: 04-1] Update analytics sqoop scheduling [puppet] - 10https://gerrit.wikimedia.org/r/508362 (https://phabricator.wikimedia.org/T222378) (owner: 10Joal) [16:28:56] elukey: -- if you have a minute --^ [16:28:58] Arf [16:29:21] RECOVERY - Prometheus prometheus2004.codfw.wmnet/analytics was restarted on prometheus2004 is OK: (C)600 lt (W)1800 lt 5.258e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/analytics [16:29:37] RECOVERY - Prometheus prometheus1003.eqiad.wmnet/services was restarted on prometheus1003 is OK: (C)600 lt (W)1800 lt 5.317e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/services [16:29:39] RECOVERY - Prometheus prometheus1003.eqiad.wmnet/k8s was restarted on prometheus1003 is OK: (C)600 lt (W)1800 lt 5.317e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s [16:29:39] RECOVERY - Prometheus prometheus1004.eqiad.wmnet/services was restarted on prometheus1004 is OK: (C)600 lt (W)1800 lt 5.309e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/services [16:29:43] RECOVERY - Prometheus prometheus2003.codfw.wmnet/services was restarted on prometheus2003 is OK: (C)600 lt (W)1800 lt 5.263e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/services [16:29:45] RECOVERY - Prometheus bast4002.wikimedia.org/ops was restarted on bast4002 is OK: (C)600 lt (W)1800 lt 5.251e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=ulsfo+prometheus/ops [16:29:51] RECOVERY - Prometheus prometheus1004.eqiad.wmnet/k8s was restarted on prometheus1004 is OK: (C)600 lt (W)1800 lt 5.322e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s [16:29:57] RECOVERY - Prometheus prometheus1003.eqiad.wmnet/analytics was restarted on prometheus1003 is OK: (C)600 lt (W)1800 lt 5.317e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/analytics [16:29:57] RECOVERY - Prometheus prometheus1004.eqiad.wmnet/k8s-staging was restarted on prometheus1004 is OK: (C)600 lt (W)1800 lt 5.322e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s-staging [16:30:07] RECOVERY - Prometheus prometheus1003.eqiad.wmnet/ops was restarted on prometheus1003 is OK: (C)600 lt (W)1800 lt 5.219e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [16:30:07] RECOVERY - Prometheus prometheus2004.codfw.wmnet/k8s was restarted on prometheus2004 is OK: (C)600 lt (W)1800 lt 5.259e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s [16:30:09] RECOVERY - Prometheus prometheus2004.codfw.wmnet/ops was restarted on prometheus2004 is OK: (C)600 lt (W)1800 lt 5.259e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops [16:30:09] RECOVERY - Prometheus prometheus1004.eqiad.wmnet/analytics was restarted on prometheus1004 is OK: (C)600 lt (W)1800 lt 5.322e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/analytics [16:30:09] RECOVERY - Prometheus prometheus2003.codfw.wmnet/ops was restarted on prometheus2003 is OK: (C)600 lt (W)1800 lt 5.263e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/ops [16:30:35] (03PS2) 10Joal: Update analytics sqoop scheduling [puppet] - 10https://gerrit.wikimedia.org/r/508362 (https://phabricator.wikimedia.org/T222378) [16:30:41] RECOVERY - Prometheus prometheus1003.eqiad.wmnet/k8s-staging was restarted on prometheus1003 is OK: (C)600 lt (W)1800 lt 5.318e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/k8s-staging [16:30:57] first try \o/ [16:31:18] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: db1093 (s6 candidate master) went down - broken BBU - https://phabricator.wikimedia.org/T222127 (10Cmjohnson) I created a task for this with HPE. Case ID: 5338390467 Case title: Failed BBU Severity 3-Normal Product serial number: MXQ616071T Product... [16:31:22] (03CR) 10Elukey: Update analytics sqoop scheduling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508362 (https://phabricator.wikimedia.org/T222378) (owner: 10Joal) [16:32:42] (03CR) 10Andrew Bogott: "That run shows everything as no-op except for a couple of logstash which seem to always produce false positives. Rechecking those, they'r" [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [16:33:33] RECOVERY - Prometheus prometheus1004.eqiad.wmnet/ops was restarted on prometheus1004 is OK: (C)600 lt (W)1800 lt 5.294e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqiad+prometheus/ops [16:33:41] (03PS2) 10CRusnov: profile::spicerack: Add timeout parameter for ganeti module. [puppet] - 10https://gerrit.wikimedia.org/r/508361 [16:34:05] (03PS10) 10Andrew Bogott: lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [16:34:39] RECOVERY - Prometheus prometheus2004.codfw.wmnet/services was restarted on prometheus2004 is OK: (C)600 lt (W)1800 lt 5.261e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/services [16:34:49] (03CR) 10Andrew Bogott: [C: 03+2] lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [16:34:55] !log restart db2102 mysql for upgrade testing [16:34:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:18] (03CR) 10CRusnov: "Build looks good:" [puppet] - 10https://gerrit.wikimedia.org/r/508361 (owner: 10CRusnov) [16:38:34] !log re-imaging cloudvirt1024 [16:38:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:39:03] RECOVERY - Prometheus prometheus2003.codfw.wmnet/analytics was restarted on prometheus2003 is OK: (C)600 lt (W)1800 lt 5.269e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/analytics [16:39:10] 10Operations, 10observability, 10Patch-For-Review, 10Wikimedia-Incident: prometheus: some sort of IRC alerts on restarts? - https://phabricator.wikimedia.org/T222108 (10CDanis) 05Open→03Resolved a:03CDanis We now have IRC alerting based on scraping each prometheus for its `process_start_time_seconds`... [16:39:28] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/508361 (owner: 10CRusnov) [16:39:42] (03PS3) 10CRusnov: profile::spicerack: Add timeout parameter for ganeti module. [puppet] - 10https://gerrit.wikimedia.org/r/508361 [16:40:37] RECOVERY - Prometheus bast5001.wikimedia.org/ops was restarted on bast5001 is OK: (C)600 lt (W)1800 lt 5.252e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=eqsin+prometheus/ops [16:40:57] RECOVERY - Prometheus prometheus2003.codfw.wmnet/k8s was restarted on prometheus2003 is OK: (C)600 lt (W)1800 lt 5.27e+05 https://grafana.wikimedia.org/d/000000271/prometheus-stats?var-datasource=codfw+prometheus/k8s [16:41:34] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10DBA: db1107 (eventlogging db master) possibly memory issues - https://phabricator.wikimedia.org/T222050 (10Marostegui) @elukey @Ottomata what do you guys want to do with this? [16:42:06] !log restart db1114 mysql for upgrade testing [16:42:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:59] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10DBA: db1107 (eventlogging db master) possibly memory issues - https://phabricator.wikimedia.org/T222050 (10elukey) @Marostegui sorry I was under the impression that we'd have needed to wait for a feedback from Chris/Rob about how to proc... [16:46:54] (03CR) 10CRusnov: [C: 03+2] profile::spicerack: Add timeout parameter for ganeti module. [puppet] - 10https://gerrit.wikimedia.org/r/508361 (owner: 10CRusnov) [16:47:48] (03PS2) 10Elukey: admin: allow analytics-admins to sudo as the analytics user [puppet] - 10https://gerrit.wikimedia.org/r/507812 (https://phabricator.wikimedia.org/T222368) [16:47:59] (03CR) 10Elukey: [C: 03+2] "Approved by the SRE team meeting" [puppet] - 10https://gerrit.wikimedia.org/r/507812 (https://phabricator.wikimedia.org/T222368) (owner: 10Elukey) [16:49:25] (03PS3) 10Elukey: admin: allow analytics-admins to use systemctl for all units [puppet] - 10https://gerrit.wikimedia.org/r/507813 (https://phabricator.wikimedia.org/T222368) [16:49:37] (03CR) 10Elukey: [C: 03+2] "Approved by the SRE team meeting." [puppet] - 10https://gerrit.wikimedia.org/r/507813 (https://phabricator.wikimedia.org/T222368) (owner: 10Elukey) [16:49:44] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10DBA: db1107 (eventlogging db master) possibly memory issues - https://phabricator.wikimedia.org/T222050 (10Marostegui) @elukey sorry, I realised that I didn't sent the first sentence: "The errors corrected themselves and Icinga is now al... [16:50:03] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10DBA: db1107 (eventlogging db master) possibly memory issues - https://phabricator.wikimedia.org/T222050 (10Cmjohnson) I now have h/w log entries. I will need the server to be taken offline so I can relocate the DIMM and check to see if t... [16:50:45] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10DBA: db1107 (eventlogging db master) possibly memory issues - https://phabricator.wikimedia.org/T222050 (10Marostegui) @elukey can you coordinate with Chris? ^ [16:51:47] (03CR) 10Joal: Update analytics sqoop scheduling (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508362 (https://phabricator.wikimedia.org/T222378) (owner: 10Joal) [16:51:57] (03PS3) 10Joal: Update analytics sqoop scheduling [puppet] - 10https://gerrit.wikimedia.org/r/508362 (https://phabricator.wikimedia.org/T222378) [16:52:14] Gone for dinner - Back in while [16:52:27] (03CR) 10jerkins-bot: [V: 04-1] Update analytics sqoop scheduling [puppet] - 10https://gerrit.wikimedia.org/r/508362 (https://phabricator.wikimedia.org/T222378) (owner: 10Joal) [16:52:29] milimetric: Shall we deploy new aqs datasource? [16:52:34] oops wrong chan sorry [16:53:55] (03CR) 10Dmaza: [C: 03+1] Enable Partial Blocks on Bengali Wikipedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508122 (https://phabricator.wikimedia.org/T222258) (owner: 10Ammarpad) [16:54:06] 10Operations, 10ops-eqiad, 10Analytics, 10Analytics-EventLogging, 10DBA: db1107 (eventlogging db master) possibly memory issues - https://phabricator.wikimedia.org/T222050 (10Marostegui) ` [18:50:55] marostegui i am confused over db1007...is there an issue or not an issue? There is a h/w l... [16:54:17] (03PS4) 10Joal: Update analytics sqoop scheduling [puppet] - 10https://gerrit.wikimedia.org/r/508362 (https://phabricator.wikimedia.org/T222378) [16:56:02] (03PS3) 10Dzahn: Phab: Allow greg and aklapper to convert projects to subprojects/milestones [puppet] - 10https://gerrit.wikimedia.org/r/505667 (https://phabricator.wikimedia.org/T221112) (owner: 10Aklapper) [16:57:32] (03CR) 10Dzahn: [C: 03+2] Phab: Allow greg and aklapper to convert projects to subprojects/milestones [puppet] - 10https://gerrit.wikimedia.org/r/505667 (https://phabricator.wikimedia.org/T221112) (owner: 10Aklapper) [16:57:40] (03CR) 10Dzahn: [C: 03+2] "approved in today's SRE meeting" [puppet] - 10https://gerrit.wikimedia.org/r/505667 (https://phabricator.wikimedia.org/T221112) (owner: 10Aklapper) [16:58:27] (03CR) 10Dzahn: "@Aklapper: just making sure, does the move_project script really not need any parameters, like the project name to be moved?" [puppet] - 10https://gerrit.wikimedia.org/r/505667 (https://phabricator.wikimedia.org/T221112) (owner: 10Aklapper) [17:00:04] gehel and onimisionipe: Your horoscope predicts another unfortunate Wikidata Query Service weekly deploy deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190506T1700). [17:01:03] 10Operations, 10Phabricator, 10Project-Admins, 10SRE-Access-Requests, and 2 others: Document how to convert projects into subprojects/milestones etc (sudo privileges for phab admins to run move_project script) - https://phabricator.wikimedia.org/T221112 (10Dzahn) @Aklapper @mmodell The request has been app... [17:01:36] (03CR) 10Jforrester: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508130 (https://phabricator.wikimedia.org/T222516) (owner: 10Luca Mauri) [17:02:01] 10Operations, 10Phabricator, 10Project-Admins, 10SRE-Access-Requests, and 2 others: Document how to convert projects into subprojects/milestones etc (sudo privileges for phab admins to run move_project script) - https://phabricator.wikimedia.org/T221112 (10mmodell) @dzahn: it has a bunch of parameters :-/ [17:02:04] (03CR) 10Jforrester: "Presumably this should only get deployed just before I6d0215082f?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508130 (https://phabricator.wikimedia.org/T222516) (owner: 10Luca Mauri) [17:02:12] 10Operations, 10Phabricator, 10Project-Admins, 10SRE-Access-Requests, and 2 others: Document how to convert projects into subprojects/milestones etc (sudo privileges for phab admins to run move_project script) - https://phabricator.wikimedia.org/T221112 (10Dzahn) a:03Aklapper Puppet ran on phab1001. If y... [17:02:39] 10Operations, 10Phabricator, 10Project-Admins, 10SRE-Access-Requests, and 2 others: Document how to convert projects into subprojects/milestones etc (sudo privileges for phab admins to run move_project script) - https://phabricator.wikimedia.org/T221112 (10mmodell) see T221112#5121800 [17:03:33] 10Operations, 10Phabricator, 10Project-Admins, 10SRE-Access-Requests, and 2 others: Document how to convert projects into subprojects/milestones etc (sudo privileges for phab admins to run move_project script) - https://phabricator.wikimedia.org/T221112 (10Dzahn) >>! In T221112#5160984, @mmodell wrote: > @... [17:07:04] (03CR) 10Dzahn: "this is probably not enough because you won't be allowed to add parameters" [puppet] - 10https://gerrit.wikimedia.org/r/505667 (https://phabricator.wikimedia.org/T221112) (owner: 10Aklapper) [17:11:00] !log restart dbprov* hosts, in sequence, for kernel upgrade [17:11:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:31] (03CR) 10Nuria: "Nice. Super concise! Let's please make sure to test this actually works as intended." [puppet] - 10https://gerrit.wikimedia.org/r/508362 (https://phabricator.wikimedia.org/T222378) (owner: 10Joal) [17:16:09] mutante: if you have a minute, would appreciate a review on this comment-only fix https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/504058/ [17:16:45] (03PS2) 10Krinkle: mediawiki: remove comment about 'enable_profiling' [puppet] - 10https://gerrit.wikimedia.org/r/504058 [17:17:22] (03CR) 10Dzahn: [C: 03+2] mediawiki: remove comment about 'enable_profiling' [puppet] - 10https://gerrit.wikimedia.org/r/504058 (owner: 10Krinkle) [17:18:24] Krinkle: no problem merged. in general just add me to Gerrit for those. i will see requests in my queue [17:19:11] !log restart netbox on netmon1002 as test [17:19:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:25:23] (03PS10) 10Alex Monk: elastalert: new module [puppet] - 10https://gerrit.wikimedia.org/r/502773 (https://phabricator.wikimedia.org/T213933) (owner: 10Filippo Giunchedi) [17:26:24] 10Operations, 10Patch-For-Review: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10elukey) Opened https://github.com/unbit/uwsgi/issues/2010 [17:28:00] 10Operations, 10Traffic: false positives in check_trafficserver_config_status - https://phabricator.wikimedia.org/T222642 (10Vgutierrez) [17:28:56] 10Operations, 10Traffic: false positives in check_trafficserver_config_status - https://phabricator.wikimedia.org/T222642 (10Vgutierrez) p:05Triage→03Normal [17:29:27] (03PS11) 10Alex Monk: elastalert: new module [puppet] - 10https://gerrit.wikimedia.org/r/502773 (https://phabricator.wikimedia.org/T213933) (owner: 10Filippo Giunchedi) [17:30:23] (03CR) 10Dzahn: [C: 03+1] "seems to make sense since upstream does disable it by default in 3.0" [puppet] - 10https://gerrit.wikimedia.org/r/508127 (owner: 10Paladox) [17:31:41] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [17:31:47] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [17:33:31] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [17:36:58] (03PS2) 10CRusnov: profile::netbox: Move ganeti sync config to /etc/netbox [puppet] - 10https://gerrit.wikimedia.org/r/508067 [17:37:06] (03CR) 10CRusnov: profile::netbox: Move ganeti sync config to /etc/netbox (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508067 (owner: 10CRusnov) [17:37:22] (03CR) 10jerkins-bot: [V: 04-1] profile::netbox: Move ganeti sync config to /etc/netbox [puppet] - 10https://gerrit.wikimedia.org/r/508067 (owner: 10CRusnov) [17:39:35] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [17:39:41] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [17:41:27] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [17:51:18] 10Operations, 10ops-eqiad, 10Gerrit, 10serviceops, 10Release-Engineering-Team (Watching / External): Gerrit Hardware Upgrade - https://phabricator.wikimedia.org/T222391 (10CDanis) cc @mark who I know is about to start looking at hardware requests for the coming FY [17:51:39] (03CR) 10CRusnov: "rebuild" [puppet] - 10https://gerrit.wikimedia.org/r/508067 (owner: 10CRusnov) [17:52:13] !log otto@deploy1001 scap-helm eventgate-main install -n main -f main/staging-values.yaml stable/eventgate [namespace: eventgate-main, clusters: staging] [17:52:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:52:36] 10Operations, 10ops-eqiad, 10Gerrit, 10serviceops, 10Release-Engineering-Team (Watching / External): Gerrit Hardware Upgrade - https://phabricator.wikimedia.org/T222391 (10Dzahn) I expect this to be a topic in our (DP - SRE) meeting this Wednesday. [17:53:48] !log otto@deploy1001 scap-helm eventgate-main install -n main -f main/staging-values.yaml stable/eventgate [namespace: eventgate-main, clusters: staging] [17:53:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:55:00] (03PS3) 10CRusnov: profile::netbox: Move ganeti sync config to /etc/netbox [puppet] - 10https://gerrit.wikimedia.org/r/508067 [17:59:59] @seen andre_ [17:59:59] mutante: Last time I saw andre_ they were changing the nickname to Guest17533, but Guest17533 is no longer in channel #wikimedia-dev at 10/18/2018 4:53:24 AM (200d13h6m35s ago) [18:00:04] MaxSem, RoanKattouw, and Niharika: That opportune time is upon us again. Time for a Morning SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190506T1800). [18:00:05] kostajh, raynor, and Amir1: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:21] I'm here [18:00:23] o/ [18:00:40] I can SWAT [18:01:07] (03PS1) 10Ottomata: Add eventgate-main to profile::kubernetes::deployment_server::services [puppet] - 10https://gerrit.wikimedia.org/r/508371 (https://phabricator.wikimedia.org/T218346) [18:01:45] o/ [18:02:03] (03CR) 10Ottomata: "I believe this also needs a patch to ops/private. I see how to add them, can I generate a token or does this come from somewhere specific?" [puppet] - 10https://gerrit.wikimedia.org/r/508371 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata) [18:02:37] (03PS2) 10Catrope: Set $wgOresFrontendBaseUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508298 (https://phabricator.wikimedia.org/T219396) (owner: 10Ladsgroup) [18:02:48] (03CR) 10Catrope: [C: 03+2] Set $wgOresFrontendBaseUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508298 (https://phabricator.wikimedia.org/T219396) (owner: 10Ladsgroup) [18:04:01] (03Merged) 10jenkins-bot: Set $wgOresFrontendBaseUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508298 (https://phabricator.wikimedia.org/T219396) (owner: 10Ladsgroup) [18:04:16] (03CR) 10jenkins-bot: Set $wgOresFrontendBaseUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508298 (https://phabricator.wikimedia.org/T219396) (owner: 10Ladsgroup) [18:06:13] Amir1: Your patch is on mwdebug1002, please test [18:06:45] on it [18:06:51] it's a little bit complex to test [18:08:42] 10Operations, 10Phabricator, 10Project-Admins, 10SRE-Access-Requests, and 2 others: Document how to convert projects into subprojects/milestones etc (sudo privileges for phab admins to run move_project script) - https://phabricator.wikimedia.org/T221112 (10Dzahn) >>! In T221112#5160991, @Dzahn wrote: >>>!... [18:11:29] (03PS1) 10Dzahn: admins: simplify sudo privs for phab-admin group [puppet] - 10https://gerrit.wikimedia.org/r/508373 [18:11:32] (03CR) 10CRusnov: [C: 03+2] profile::netbox: Move ganeti sync config to /etc/netbox [puppet] - 10https://gerrit.wikimedia.org/r/508067 (owner: 10CRusnov) [18:11:43] RoanKattouw: It fixes the issue [18:11:46] please process [18:11:49] *proceed [18:14:26] !log catrope@deploy1001 Synchronized wmf-config/CommonSettings.php: Set $wgOresFrontendBaseUrl (T219396) (duration: 00m 51s) [18:14:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:14:33] T219396: Fix oresBaseUrl config variable in frontend - https://phabricator.wikimedia.org/T219396 [18:15:29] (03PS3) 10CRusnov: ganeti-netbox sync: Sync host status also [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/508066 [18:17:31] 10Operations, 10WMF-Legal, 10Wikimedia-General-or-Unknown, 10Documentation, and 2 others: Default license for operations/puppet - https://phabricator.wikimedia.org/T67270 (10Dzahn) p:05Low→03Normal The [[ https://www.mediawiki.org/wiki/Wikimedia_Engineering_Architecture_Principles | Wikimedia Engineeri... [18:18:41] (03PS1) 10Andrew Bogott: cloudvirt1024: update the (now truncated) interface name [puppet] - 10https://gerrit.wikimedia.org/r/508378 (https://phabricator.wikimedia.org/T216724) [18:19:38] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1024: update the (now truncated) interface name [puppet] - 10https://gerrit.wikimedia.org/r/508378 (https://phabricator.wikimedia.org/T216724) (owner: 10Andrew Bogott) [18:21:01] (03PS2) 10CDanis: swift-object-replicator: nice & ionice it [puppet] - 10https://gerrit.wikimedia.org/r/506321 [18:21:37] (03PS2) 10Dzahn: admins: simplify sudo privs for phab-admin group [puppet] - 10https://gerrit.wikimedia.org/r/508373 [18:24:58] !log restart and upgrade db1116 [18:25:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:28:43] (03PS3) 10CDanis: swift-object-replicator: nice & ionice it [puppet] - 10https://gerrit.wikimedia.org/r/506321 [18:28:44] raynor: Your patch is now (finally) on mwdebug1002, please test [18:28:56] RoanKattouw: thx, testing [18:29:34] (03CR) 10CDanis: [C: 03+2] swift-object-replicator: nice & ionice it [puppet] - 10https://gerrit.wikimedia.org/r/506321 (owner: 10CDanis) [18:29:57] RoanKattouw: my patch was easy/quit to test -> it works [18:30:01] please deploy to prod [18:30:20] (03PS4) 10Dzahn: hhvm: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/456319 (https://phabricator.wikimedia.org/T194724) [18:30:44] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin 'ms-be*' 'disable-puppet "cdanis rollout I369f9b29"' [18:30:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:30:58] 10Operations, 10Patch-For-Review: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10Volans) @elukey thanks a lot for deep dive and the bug upstream! [18:32:19] !log catrope@deploy1001 Synchronized php-1.34.0-wmf.3/skins/MinervaNeue/includes/menu/Definitions.php: Harden Definitions::insertCommunityPortal() method (T222407) (duration: 00m 53s) [18:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:32:25] T222407: [1.34.0-wmf.3] Regression: Definitions.php: Call to a member function exists() on a non-object (null) - https://phabricator.wikimedia.org/T222407 [18:35:27] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:35:49] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:35:51] PROBLEM - HTTP availability for Varnish at esams on icinga1001 is CRITICAL: job=varnish-text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:36:03] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:36:27] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:36:29] PROBLEM - HTTP availability for Varnish at eqiad on icinga1001 is CRITICAL: job=varnish-text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:37:35] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [18:37:37] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [18:37:43] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [18:37:48] (03CR) 10Ayounsi: "Thanks, reply inline." (0311 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [18:38:09] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [18:38:10] (03PS10) 10Ayounsi: Icinga: Add OSPF check to routers [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) [18:38:29] RECOVERY - HTTP availability for Varnish at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:39:45] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:39:59] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:40:25] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:40:27] RECOVERY - HTTP availability for Varnish at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [18:40:43] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [18:43:55] kostajh: Your null pageviews link patch is now on mwdebug1002, please test [18:44:00] RoanKattouw: looking [18:44:19] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [18:45:23] RoanKattouw: yep, looks good [18:45:31] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [18:46:03] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [18:46:51] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [18:47:10] !log catrope@deploy1001 Synchronized php-1.34.0-wmf.3/extensions/GrowthExperiments/: Remove link to pageviews tool when no data available (T222405) (duration: 00m 52s) [18:47:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:16] vgutierrez: are you still about? [18:47:17] T222405: Homepage: null pageviews should not be a link - https://phabricator.wikimedia.org/T222405 [18:47:28] err.... [18:47:35] cooking dinner as we speak :) [18:47:45] ok — I'll wait and bug you tomorrow then [18:53:00] (03CR) 10Dzahn: hhvm: base::service_unit -> systemd::service (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/456319 (https://phabricator.wikimedia.org/T194724) (owner: 10Dzahn) [18:53:45] RoanKattouw: for the config patch, I think you can just sync it when it's ready. I have stat1004 open with kafkacat and can look at some events to verify it's working properly [18:53:53] Oh yes that's right [18:53:57] I thought I was done, oops [18:55:31] (03PS5) 10Cwhite: initial attempt at a varnishkafka exporter [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) [18:55:50] (03PS2) 10Catrope: GrowthExperiments: Begin experiment for Homepage with cs/kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507115 (https://phabricator.wikimedia.org/T221266) (owner: 10Kosta Harlan) [18:56:33] (03CR) 10Cwhite: initial attempt at a varnishkafka exporter (032 comments) [debs/prometheus-varnishkafka-exporter] - 10https://gerrit.wikimedia.org/r/507632 (https://phabricator.wikimedia.org/T196066) (owner: 10Cwhite) [18:57:48] (03CR) 10Catrope: [C: 03+2] GrowthExperiments: Begin experiment for Homepage with cs/kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507115 (https://phabricator.wikimedia.org/T221266) (owner: 10Kosta Harlan) [18:58:51] (03Merged) 10jenkins-bot: GrowthExperiments: Begin experiment for Homepage with cs/kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507115 (https://phabricator.wikimedia.org/T221266) (owner: 10Kosta Harlan) [18:58:53] (03PS4) 10CRusnov: ganeti-netbox sync: Sync host status also [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/508066 [18:59:05] (03CR) 10jenkins-bot: GrowthExperiments: Begin experiment for Homepage with cs/kowiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/507115 (https://phabricator.wikimedia.org/T221266) (owner: 10Kosta Harlan) [19:01:31] (03CR) 10CRusnov: ganeti-netbox sync: Sync host status also (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/508066 (owner: 10CRusnov) [19:01:36] !log catrope@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Begin homepage experiment on cswiki and kowiki (T221266) (duration: 00m 51s) [19:01:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:01:40] T221266: Homepage: Deploy to target wikis in production - https://phabricator.wikimedia.org/T221266 [19:02:03] (03CR) 10CRusnov: [V: 03+2] "Latest patchset with some changes due to testing. Seems to work, mirroring the status." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/508066 (owner: 10CRusnov) [19:05:37] 10Operations, 10MediaWiki-Cache, 10Patch-For-Review, 10Performance-Team (Radar), and 2 others: Mcrouter periodically reports soft TKOs for mc1029 (was mc1035, mc1022) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10aaron) a:05aaron→03None [19:13:06] 10Operations, 10ops-codfw, 10media-storage, 10observability: ms-be2043 'sdd' throwing lots of errors - https://phabricator.wikimedia.org/T222654 (10CDanis) [19:20:21] (03PS3) 10Dzahn: admins: remove ability to run commands as user 'apache' [puppet] - 10https://gerrit.wikimedia.org/r/506750 (https://phabricator.wikimedia.org/T78076) [19:20:57] PROBLEM - IPv4 ping to eqsin on ripe-atlas-eqsin is CRITICAL: CRITICAL - failed 84 probes of 454 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [19:22:11] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin -m async -b4 'ms-be2*' 'run-puppet-agent --enable "cdanis rollout I369f9b29"' 'systemctl systemctl restart swift-object-replicator' [19:22:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:27:16] 08Warning Alert for device cr1-codfw.wikimedia.org - Inbound interface errors [19:27:20] (03CR) 10Dzahn: [C: 03+2] admins: remove ability to run commands as user 'apache' [puppet] - 10https://gerrit.wikimedia.org/r/506750 (https://phabricator.wikimedia.org/T78076) (owner: 10Dzahn) [19:31:25] XioNoX: the atlas.ripe net map looks kind of bad. 84 fails seems more than common [19:31:54] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin -m async -b4 'ms-be1*' 'run-puppet-agent --enable "cdanis rollout I369f9b29"' 'systemctl restart swift-object-replicator' [19:31:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:33:16] 08̶W̶a̶r̶n̶i̶n̶g Device cr1-codfw.wikimedia.org recovered from Inbound interface errors [19:35:50] (03PS3) 10Dzahn: vagrant::mediawiki: base::service_unit -> systemd::service [puppet] - 10https://gerrit.wikimedia.org/r/508009 (https://phabricator.wikimedia.org/T194724) [19:37:03] RECOVERY - IPv4 ping to eqsin on ripe-atlas-eqsin is OK: OK - failed 20 probes of 454 (alerts on 35) - https://atlas.ripe.net/measurements/11645085/#!map https://wikitech.wikimedia.org/wiki/Network_monitoring%23Atlas_alerts [19:37:47] 20 is about the baseline IIRC [19:38:29] (03PS3) 10Gehel: maps: remove cassandra metric blacklist [puppet] - 10https://gerrit.wikimedia.org/r/508313 (owner: 10Mathew.onipe) [19:39:39] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:40:10] (03CR) 10Gehel: [C: 03+2] maps: remove cassandra metric blacklist [puppet] - 10https://gerrit.wikimedia.org/r/508313 (owner: 10Mathew.onipe) [19:40:23] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:40:30] (03PS4) 10Dzahn: phragile: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507084 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [19:40:47] (03CR) 10Dzahn: "labs only - affects https://tools.wmflabs.org/openstack-browser/server/phragile-pro.phragile.eqiad.wmflabs - getting access to that insta" [puppet] - 10https://gerrit.wikimedia.org/r/507084 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [19:41:09] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [19:41:11] (03CR) 10Dzahn: [C: 03+2] phragile: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507084 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [19:41:19] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [19:43:01] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [19:43:37] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:44:21] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [19:45:08] (03CR) 10Dzahn: "on phragile-pro.phragile: rm -rf /var/lib/phragile/composer and then let puppet recreate and reclone" [puppet] - 10https://gerrit.wikimedia.org/r/507084 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [19:45:19] thanks mutante ! [19:46:12] (03PS1) 10Paladox: Gerrit: Configure logging in json to error_log.json [puppet] - 10https://gerrit.wikimedia.org/r/508391 [19:47:52] !log Running recomputeNotifCounts.php --notif-types=login-success on all Echo wikis for T220762 [19:47:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:47:57] T220762: [betalabs] Stuck cross-wiki notification - https://phabricator.wikimedia.org/T220762 [19:48:32] !log rolling restart of cassandra on maps* fro config change [19:48:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:49:07] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [19:49:17] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [19:49:39] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [19:50:21] (03PS6) 10Paladox: Gerrit: Redirect /p/(.+)/info/(.+) to /$1/info/$2 [puppet] - 10https://gerrit.wikimedia.org/r/507787 (https://phabricator.wikimedia.org/T218844) [19:51:02] (03PS2) 10Paladox: Gerrit: Configure logging in json to error_log.json [puppet] - 10https://gerrit.wikimedia.org/r/508391 [19:51:45] (03PS3) 10Paladox: Gerrit: Configure logging in json to error_log.json [puppet] - 10https://gerrit.wikimedia.org/r/508391 (https://phabricator.wikimedia.org/T141324) [19:52:49] (03CR) 10Dzahn: [C: 03+1] authdns: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507076 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [19:53:37] (03CR) 10Dzahn: [C: 03+1] "merging this means having to delete the git repo checkout dir and letting puppet re-clone it.. or editing the .git/config and replacing th" [puppet] - 10https://gerrit.wikimedia.org/r/507076 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [19:54:26] (03PS6) 10Dzahn: openldap: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507088 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [20:00:04] cscott, arlolra, subbu, bearND, and halfak: That opportune time is upon us again. Time for a Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190506T2000). [20:03:29] (03CR) 10Dzahn: [C: 03+2] openldap: Stop cloning over /p/ [puppet] - 10https://gerrit.wikimedia.org/r/507088 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [20:06:18] (03PS2) 10Hashar: zuul: stop pinning python-pbr [puppet] - 10https://gerrit.wikimedia.org/r/502207 (https://phabricator.wikimedia.org/T218559) [20:06:23] (03CR) 10Dzahn: "confirmed that the python scripts using this use tempfile which should be deleted right away anyways.. and tested that cross-validate-acco" [puppet] - 10https://gerrit.wikimedia.org/r/507088 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [20:08:04] (03CR) 10Dzahn: [C: 03+1] "on merge the existing checkout dirs should be deleted so that puppet re-creates them" [puppet] - 10https://gerrit.wikimedia.org/r/507074 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [20:08:36] (03CR) 10Hashar: [C: 03+1] "That should be good for deployment. I have patched Zuul to no more rely on python-pbr to get its version." [puppet] - 10https://gerrit.wikimedia.org/r/502207 (https://phabricator.wikimedia.org/T218559) (owner: 10Hashar) [20:10:33] (03CR) 10Dzahn: [C: 04-1] "bump, see my last question" [puppet] - 10https://gerrit.wikimedia.org/r/494811 (owner: 10Paladox) [20:13:19] (03CR) 10Dzahn: [C: 03+2] zuul: stop pinning python-pbr [puppet] - 10https://gerrit.wikimedia.org/r/502207 (https://phabricator.wikimedia.org/T218559) (owner: 10Hashar) [20:13:21] (03CR) 10Paladox: Gerrit: Support switching ldap servers (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/494811 (owner: 10Paladox) [20:13:28] (03CR) 10Krinkle: "As next step - can we switch this to async first, and then commit to monitoring before we serve multi-dc traffic?" [puppet] - 10https://gerrit.wikimedia.org/r/492948 (owner: 10Aaron Schulz) [20:14:07] (03CR) 10Dzahn: "> Once this puppet patch is merged, I can handle the upgrade of python-pbr and confirm that production works fine." [puppet] - 10https://gerrit.wikimedia.org/r/502207 (https://phabricator.wikimedia.org/T218559) (owner: 10Hashar) [20:16:06] (03CR) 10Dzahn: [C: 04-1] "ok! could we just drop the word "custom" from all of these. it's kind of implied with any parameter that it's custom if you change it" [puppet] - 10https://gerrit.wikimedia.org/r/494811 (owner: 10Paladox) [20:17:03] (03CR) 10Hashar: [C: 03+1] "Danke :)" [puppet] - 10https://gerrit.wikimedia.org/r/502207 (https://phabricator.wikimedia.org/T218559) (owner: 10Hashar) [20:18:56] mutante: I will do the upgrade tomorrow morning [20:19:03] too late for me to deal with potential aftermath [20:19:28] (03CR) 10Dzahn: "any further comments or links why we don't need / want it anymore? adding Tyler to reviewers" [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/507990 (owner: 10Paladox) [20:19:36] hashar: sure! ok [20:20:46] mutante see the other channel :) [20:25:42] (03CR) 10Dzahn: "per https://phabricator.wikimedia.org/T162070 which comes after https://phabricator.wikimedia.org/T165625 there is only 1 class left using" [puppet] - 10https://gerrit.wikimedia.org/r/354131 (owner: 10Paladox) [20:27:07] (03CR) 10Dzahn: [C: 04-1] "we really want to delete the entire module instead" [puppet] - 10https://gerrit.wikimedia.org/r/354131 (owner: 10Paladox) [20:50:28] (03CR) 10Thcipriani: "> any further comments or links why we don't need / want it anymore?" [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/507990 (owner: 10Paladox) [20:51:09] (03CR) 10Dzahn: [C: 03+1] "gotcha, thanks" [software/gerrit] (wmf/stable-2.16) - 10https://gerrit.wikimedia.org/r/507990 (owner: 10Paladox) [20:54:45] (03PS1) 10Hashar: contint: bump git-daemon max connections 32 -> 48 [puppet] - 10https://gerrit.wikimedia.org/r/508408 (https://phabricator.wikimedia.org/T222661) [21:00:04] bawolff and Reedy: I, the Bot under the Fountain, allow thee, The Deployer, to do Weekly Security deployment window deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190506T2100). [21:02:58] 10Operations, 10Gerrit, 10Release-Engineering-Team (Watching / External): Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10crusnov) Just to +1 the idea of shipping javamelody to prometheus. Let me know if I can help at all. [21:08:36] 10Operations, 10Gerrit, 10Release-Engineering-Team (Watching / External): Add prometheus exporter to Gerrit - https://phabricator.wikimedia.org/T184086 (10Paladox) @crusnov we could use your help, yup. We need to create a prometheusBearerToken [plugin.javamelody.prometheusBearerToken] https://gerrit.googleso... [21:15:59] !log swift codfw-prod: push up-to-date rings, mistakenly pushed earlier an older version [21:16:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:23:33] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10Dzahn) 05Open→03Stalled [21:24:31] !log experimenting with different disk scheduler on ms-be2014 -- cdanis@ms-be2014.codfw.wmnet ~ % for D in /sys/block/sd*/queue/scheduler ; echo cfq | sudo tee $D [21:24:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:25:41] (03CR) 10Dzahn: "per "I'm not sure this is still valid when the ongoing work is completed to allow wikitech user registration to be opened up again." ... a" [puppet] - 10https://gerrit.wikimedia.org/r/498429 (owner: 10Dzahn) [21:30:59] (03PS1) 10CDanis: swift: mid-line comments are not a thing, apparently. [puppet] - 10https://gerrit.wikimedia.org/r/508419 [21:31:33] (03CR) 10jerkins-bot: [V: 04-1] swift: mid-line comments are not a thing, apparently. [puppet] - 10https://gerrit.wikimedia.org/r/508419 (owner: 10CDanis) [21:31:46] (03PS2) 10CDanis: swift: mid-line comments are not a thing, apparently. [puppet] - 10https://gerrit.wikimedia.org/r/508419 [21:32:25] (03CR) 10CDanis: [C: 03+2] swift: mid-line comments are not a thing, apparently. [puppet] - 10https://gerrit.wikimedia.org/r/508419 (owner: 10CDanis) [21:44:49] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/508066 (owner: 10CRusnov) [21:59:02] !log LDAP - remove 'sukhe' from 'nda' and add to 'wmf' instead (T221990) [21:59:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:59:08] T221990: LDAP access to the nda group for sukhe - https://phabricator.wikimedia.org/T221990 [22:06:37] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T222538 (10Dzahn) Hi @RStallman-legalteam Do we still have an NDA on file for Adam Wight or does it need a new one now that he is WMDE employee? [22:10:50] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): Track remaining trusty servers in production - https://phabricator.wikimedia.org/T212772 (10chasemp) [22:12:07] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T222538 (10RStallman-legalteam) Thanks, I am running this by our contracts attorney. I think we'll just do a quick amendment to the existing NDA to reflect t... [22:17:28] 10Operations, 10netops: cr4-ulsfo rebooted unexpectedly - https://phabricator.wikimedia.org/T221156 (10ayounsi) > After checking the core the engineering team has an update on what happened > “The thread that is holding the lock seem to have corrupted stack and is holding the lock for a very long time. Other t... [22:18:55] 10Operations, 10Traffic, 10Wikimedia-General-or-Unknown, 10User-DannyS712, 10Wikimedia-Incident: 503 errors for several Wikipedia pages - https://phabricator.wikimedia.org/T222418 (10Dzahn) p:05Triage→03High [22:19:14] 10Operations, 10LDAP-Access-Requests, 10WMF-Legal, 10WMF-NDA-Requests: Request to be added to the ldap/wmde group - https://phabricator.wikimedia.org/T222538 (10Dzahn) p:05Triage→03Normal [22:22:55] 10Operations, 10ops-codfw, 10media-storage, 10observability: ms-be2043 'sdd' throwing lots of errors - https://phabricator.wikimedia.org/T222654 (10Dzahn) p:05Triage→03Normal [22:32:34] 10Operations, 10ops-codfw, 10media-storage, 10observability: ms-be2043 'sdd' throwing lots of errors - https://phabricator.wikimedia.org/T222654 (10Dzahn) other tickets where ms-be disks died with "blk_update _request: I/O error" or similar. T184053 , T183896, T218544, T136395, T163690, T166021 Afaict, w... [22:37:47] 10Operations, 10Beta-Cluster-Infrastructure, 10Patch-For-Review: Make www-data the web-serving user (is currently apache) - https://phabricator.wikimedia.org/T78076 (10Dzahn) The ability to run commands as the 'apache' user has been removed from prod admins module sudo privileges today. [22:38:19] (03CR) 10CRusnov: [V: 03+2 C: 03+2] ganeti-netbox sync: Sync host status also [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/508066 (owner: 10CRusnov) [22:42:51] !log crusnov@deploy1001 Started deploy [netbox/deploy@0061190]: Deploy new version of ganeti-netbox sync. [22:42:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:05] !log Running refreshMessageBlobs.php on all wikis for T222539 [22:43:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:43:09] T222539: Special:RecentChanges on no.wp ignores localized MediaWiki:Rcfilters-show-new-changes message - https://phabricator.wikimedia.org/T222539 [22:46:45] !log crusnov@deploy1001 Finished deploy [netbox/deploy@0061190]: Deploy new version of ganeti-netbox sync. (duration: 03m 53s) [22:46:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:47:37] PROBLEM - Check systemd state on netmon2001 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [23:00:05] MaxSem, RoanKattouw, and Niharika: #bothumor I � Unicode. All rise for Evening SWAT (Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190506T2300). [23:00:05] MaxSem: A patch you scheduled for Evening SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:24] I'm here [23:00:42] Since there's only one patch, I'll deploy it in 30 mins [23:01:20] (03CR) 10Dzahn: [C: 03+2] Fix typo in comments [software/service-checker] - 10https://gerrit.wikimedia.org/r/495237 (owner: 10Alexandros Kosiaris) [23:03:06] (03PS2) 10MaxSem: LoginNotify: remove setting that was moved to the extension itself [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503546 (https://phabricator.wikimedia.org/T220780) [23:03:29] (03PS1) 10CRusnov: puppetdb report: Exclude OFFLINE VMs from report [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/508456 [23:16:27] 10Operations, 10ops-codfw, 10DBA, 10Goal, 10Patch-For-Review: rack/setup/install db2[103-120].codfw.wmnet (18 hosts) - https://phabricator.wikimedia.org/T221532 (10Papaul) [23:20:40] (03CR) 10MaxSem: [C: 03+2] LoginNotify: remove setting that was moved to the extension itself [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503546 (https://phabricator.wikimedia.org/T220780) (owner: 10MaxSem) [23:21:42] (03Merged) 10jenkins-bot: LoginNotify: remove setting that was moved to the extension itself [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503546 (https://phabricator.wikimedia.org/T220780) (owner: 10MaxSem) [23:21:56] (03CR) 10jenkins-bot: LoginNotify: remove setting that was moved to the extension itself [mediawiki-config] - 10https://gerrit.wikimedia.org/r/503546 (https://phabricator.wikimedia.org/T220780) (owner: 10MaxSem) [23:25:57] !log maxsem@deploy1001 Synchronized wmf-config/CommonSettings.php: https://gerrit.wikimedia.org/r/#/c/operations/mediawiki-config/+/503546/ (duration: 00m 50s) [23:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:27:22] (03PS7) 10Paladox: Gerrit: Redirect /p/(.+)/info/(.+) to /$1/info/$2 [puppet] - 10https://gerrit.wikimedia.org/r/507787 (https://phabricator.wikimedia.org/T218844) [23:33:44] 10Operations, 10Gerrit, 10cloud-services-team, 10serviceops: Change /r/p/ to /r/ on all hosts (where https://gerrit.wikimedia.org/r/p/ exists) - https://phabricator.wikimedia.org/T222093 (10Dzahn) [23:36:50] (03PS1) 10Papaul: DHCP: Add MAC address entries for db2[103-120] [puppet] - 10https://gerrit.wikimedia.org/r/508472 (https://phabricator.wikimedia.org/T221532) [23:37:32] (03CR) 10jerkins-bot: [V: 04-1] DHCP: Add MAC address entries for db2[103-120] [puppet] - 10https://gerrit.wikimedia.org/r/508472 (https://phabricator.wikimedia.org/T221532) (owner: 10Papaul) [23:41:06] (03PS19) 10CDanis: Add a WMF-specific tool for managing db config in MediaWiki [software/conftool] - 10https://gerrit.wikimedia.org/r/441396 (https://phabricator.wikimedia.org/T197126) (owner: 10Giuseppe Lavagetto) [23:52:08] (03PS2) 10Papaul: DHCP: Add MAC address entries for db2[103-120] [puppet] - 10https://gerrit.wikimedia.org/r/508472 (https://phabricator.wikimedia.org/T221532) [23:55:26] (03PS3) 10Dzahn: DHCP: Add MAC address entries for db2103 through db2120 [puppet] - 10https://gerrit.wikimedia.org/r/508472 (https://phabricator.wikimedia.org/T221532) (owner: 10Papaul) [23:55:40] (03CR) 10Dzahn: [C: 03+2] DHCP: Add MAC address entries for db2103 through db2120 [puppet] - 10https://gerrit.wikimedia.org/r/508472 (https://phabricator.wikimedia.org/T221532) (owner: 10Papaul) [23:59:11] MaxSem: All clear? [23:59:22] Yup!