[00:36:59] 10Operations, 10Domains, 10Traffic, 10WMF-Legal, 10Patch-For-Review: Move wikimedia.ee under WM-EE - https://phabricator.wikimedia.org/T204056 (10tramm) 05Stalled→03Open Please change wikimedia.ee DNS record to refer to 185.7.252.114 (test page: http://wikimedia.ee.klient.veebimajutus.ee/). [01:29:29] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:29:39] PROBLEM - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is CRITICAL: cluster=cache_text site=esams https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:29:41] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:29:59] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [01:30:45] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [01:30:47] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [01:32:25] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:33:35] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:33:47] RECOVERY - HTTP availability for Nginx -SSL terminators- at esams on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:36:13] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [01:36:15] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [01:36:49] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [01:39:17] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:43:23] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [01:55:54] (03CR) 10Zoranzoki21: [C: 03+1] "Looks good now" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509739 (https://phabricator.wikimedia.org/T220752) (owner: 10Vladis13) [01:56:48] (03CR) 10jerkins-bot: [V: 04-1] Enable webfonts for ru,uk,be of wiki,wikisource, and for sourceswiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509739 (https://phabricator.wikimedia.org/T220752) (owner: 10Vladis13) [02:47:04] 10Operations, 10ops-codfw, 10Discovery-Search (Current work): elastic2038 DOWN (CPU/memory errors ) - https://phabricator.wikimedia.org/T217398 (10Papaul) a:05Gehel→03Papaul [02:53:53] 10Operations, 10ops-codfw, 10Discovery-Search (Current work): elastic2038 DOWN (CPU/memory errors ) - https://phabricator.wikimedia.org/T217398 (10Papaul) It looks like the error is showing now on DIMM B2, so we have a bad DIMM. I will go ahead and request a replacement. Description Date and Time Correctab... [03:56:39] PROBLEM - Check systemd state on ms-be1033 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [04:19:31] (03CR) 10Giuseppe Lavagetto: "> joe: can we do this now?" [puppet] - 10https://gerrit.wikimedia.org/r/368248 (https://phabricator.wikimedia.org/T117845) (owner: 10Fomafix) [04:21:21] RECOVERY - Check systemd state on ms-be1033 is OK: OK - running: The system is fully operational [04:27:14] (03CR) 10Giuseppe Lavagetto: [C: 04-2] "While the change is technically correct, and this first change would only add and not change URLs, I'm not convinced such an important cha" [puppet] - 10https://gerrit.wikimedia.org/r/368248 (https://phabricator.wikimedia.org/T117845) (owner: 10Fomafix) [04:56:10] (03PS1) 10Marostegui: db1130,db1138: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/509748 (https://phabricator.wikimedia.org/T222682) [04:57:25] (03CR) 10Marostegui: [C: 03+2] db1130,db1138: Enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/509748 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [05:03:37] (03PS1) 10Marostegui: db-eqiad,db-codfw.php: Pool db1130, db1138 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509749 (https://phabricator.wikimedia.org/T222682) [05:13:54] (03PS1) 10Marostegui: mariadb: db2106,db2110,db2119 into s4 [puppet] - 10https://gerrit.wikimedia.org/r/509750 (https://phabricator.wikimedia.org/T222772) [05:14:44] (03CR) 10Marostegui: [C: 03+2] db-eqiad,db-codfw.php: Pool db1130, db1138 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509749 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [05:15:48] (03Merged) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db1130, db1138 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509749 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [05:16:08] (03CR) 10jenkins-bot: db-eqiad,db-codfw.php: Pool db1130, db1138 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509749 (https://phabricator.wikimedia.org/T222682) (owner: 10Marostegui) [05:17:26] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Pool db1130 into s5 and db1138 into s4 T222682 (duration: 00m 51s) [05:17:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:17:32] T222682: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 [05:18:26] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Pool db1130 into s5 and db1138 into s4 T222682 (duration: 00m 49s) [05:18:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:20:17] (03CR) 10Marostegui: [C: 03+2] mariadb: db2106,db2110,db2119 into s4 [puppet] - 10https://gerrit.wikimedia.org/r/509750 (https://phabricator.wikimedia.org/T222772) (owner: 10Marostegui) [05:27:05] (03PS2) 10ArielGlenn: enable use of lbzip2 for revision history dumps for all big wikis [puppet] - 10https://gerrit.wikimedia.org/r/505441 [05:30:52] (03CR) 10ArielGlenn: [C: 03+2] enable use of lbzip2 for revision history dumps for all big wikis [puppet] - 10https://gerrit.wikimedia.org/r/505441 (owner: 10ArielGlenn) [05:41:07] !log Optimize tables on pc2007 [05:41:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:45:18] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1130,db1138 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509752 [05:48:07] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1130,db1138 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509752 (owner: 10Marostegui) [05:48:49] (03PS20) 10ArielGlenn: dump url shorteners for wiki projects [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) [05:49:09] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1130,db1138 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509752 (owner: 10Marostegui) [05:49:26] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1130,db1138 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509752 (owner: 10Marostegui) [05:50:12] (03CR) 10ArielGlenn: [C: 03+2] dump url shorteners for wiki projects [puppet] - 10https://gerrit.wikimedia.org/r/278400 (https://phabricator.wikimedia.org/T116986) (owner: 10ArielGlenn) [05:50:35] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic for db1130 (s5) and db1138 (s4) T222682 (duration: 00m 49s) [05:50:39] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [05:50:39] T222682: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 [06:00:35] (03PS1) 10Elukey: librenms: fix logrotate cronspam [puppet] - 10https://gerrit.wikimedia.org/r/509753 [06:00:38] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "I meant to add a -1 not a -2 meh." [puppet] - 10https://gerrit.wikimedia.org/r/368248 (https://phabricator.wikimedia.org/T117845) (owner: 10Fomafix) [06:01:21] (03CR) 10Elukey: [C: 03+2] librenms: fix logrotate cronspam [puppet] - 10https://gerrit.wikimedia.org/r/509753 (owner: 10Elukey) [06:09:00] !log Compress s2, s6 and s7 on labsdb1012 - T222978 [06:09:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:09:05] T222978: Compress and defragment tables on labsdb hosts - https://phabricator.wikimedia.org/T222978 [06:11:48] (03CR) 10Giuseppe Lavagetto: [C: 03+1] flake8: Add python extension so theses scripts can be picked up via CI [puppet] - 10https://gerrit.wikimedia.org/r/509426 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [06:14:58] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/508854 (owner: 10Muehlenhoff) [06:18:15] (03CR) 10Giuseppe Lavagetto: [C: 03+1] "FWIW, I find having non-essential config dangling around is extremely confusing." [deployment-charts] - 10https://gerrit.wikimedia.org/r/488800 (owner: 10Alexandros Kosiaris) [06:19:18] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1130,db1138 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509754 [06:29:21] (03PS2) 10Elukey: ores::worker: allow celery to emit a core dump upon segfault [puppet] - 10https://gerrit.wikimedia.org/r/509060 (https://phabricator.wikimedia.org/T222866) [06:30:16] PROBLEM - puppet last run on mw2258 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [06:32:20] PROBLEM - puppet last run on cp2024 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [06:36:23] (03CR) 10Elukey: [C: 03+2] ores::worker: allow celery to emit a core dump upon segfault [puppet] - 10https://gerrit.wikimedia.org/r/509060 (https://phabricator.wikimedia.org/T222866) (owner: 10Elukey) [06:38:32] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1130,db1138 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509754 (owner: 10Marostegui) [06:39:41] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1130,db1138 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509754 (owner: 10Marostegui) [06:39:53] information box for files is not shown on other projects [06:39:55] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1130,db1138 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509754 (owner: 10Marostegui) [06:40:01] was this already reported? [06:40:14] someone asked in -tech [06:40:28] https://en.wikisource.org/wiki/File:The_cutters%27_practical_guide_to_the_cutting_of_ladies%27_garments.djvu [06:40:28] Why is it not showing the FULL contents of the Commons page as it did previously? [06:42:52] I have noticed that too on enwiki [06:44:04] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic for db1130 (s5) and db1138 (s4) T222682 (duration: 00m 49s) [06:44:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:44:15] T222682: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 [06:45:01] looks like https://phabricator.wikimedia.org/source/mediawiki/browse/master/includes/page/ImagePage.php$695 controls all that [06:45:15] https://en.wikipedia.org/wiki/File:Le_Jugement_de_P%C3%A2ris,_par_Paul_C%C3%A9zanne.jpg [06:45:20] doesn't look like anyone touched it recently though, might be buried somewhere under that getDescriptionText call [06:46:18] doesn't appear to be a language thing as I tried it with uselang=en on both sites [06:48:16] idem on fr.wikisource https://fr.wikisource.org/wiki/Fichier:Flaubert_-_Salammb%C3%B4.djvu [06:48:31] https://gerrit.wikimedia.org/r/#/c/mediawiki/core/+/504012/ did get merged a week ago and touched ForeignDBFile [06:50:53] (03PS9) 10Marostegui: db-eqiad,db-codfw.php: Change second parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508170 (https://phabricator.wikimedia.org/T210725) [06:53:46] !log installing ghostscript security updates [06:53:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:55:10] yannf, yeah I think this is the cause [06:55:17] reverting that change on beta seems to resolve the issue? [06:56:03] ok, good [06:57:06] RECOVERY - puppet last run on mw2258 is OK: OK: Puppet is currently enabled, last run 59 seconds ago with 0 failures [06:57:30] actually now I'm not so sure [06:58:51] (03PS1) 10Marostegui: db-eqiad.php: More traffic to db1130,db1138 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509756 [06:59:12] RECOVERY - puppet last run on cp2024 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [07:01:44] 10Operations, 10Analytics, 10Analytics-EventLogging, 10Analytics-Kanban, and 2 others: Upgrade python-kafka - https://phabricator.wikimedia.org/T221848 (10elukey) Update: we were running 1.4.1.-1~stretch1, I have rolled back eventlogging to it and all instabilities went away. 1.4.3 seems a broken version f... [07:03:35] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: More traffic to db1130,db1138 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509756 (owner: 10Marostegui) [07:04:35] (03Merged) 10jenkins-bot: db-eqiad.php: More traffic to db1130,db1138 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509756 (owner: 10Marostegui) [07:05:51] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: More traffic for db1130 (s5) and db1138 (s4) T222682 (duration: 00m 50s) [07:05:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:05:56] T222682: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 [07:06:21] (03CR) 10jenkins-bot: db-eqiad.php: More traffic to db1130,db1138 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509756 (owner: 10Marostegui) [07:08:24] !log slow roll restart of celery on ores* nodes to allow cores to be generated upon segfault - T222866 [07:08:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:08:29] T222866: Ores hosts: mwparserfromhell tokenizer random segfault - https://phabricator.wikimedia.org/T222866 [07:10:36] (03CR) 10Fomafix: "Currently https://sr.wikipedia.org/sr-latn/Главна_страна redirects to https://sr.wikipedia.org/wiki/Главна_страна . Other strings with the" [puppet] - 10https://gerrit.wikimedia.org/r/368248 (https://phabricator.wikimedia.org/T117845) (owner: 10Fomafix) [07:17:02] (03CR) 10Vgutierrez: [C: 03+2] Add SPF record to toolserver.org [dns] - 10https://gerrit.wikimedia.org/r/283870 (https://phabricator.wikimedia.org/T220786) (owner: 10Mschon) [07:17:10] (03PS7) 10Vgutierrez: Add SPF record to toolserver.org [dns] - 10https://gerrit.wikimedia.org/r/283870 (https://phabricator.wikimedia.org/T220786) (owner: 10Mschon) [07:17:36] (03PS1) 10Marostegui: db-eqiad.php: Fully pool db1130,db1138 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509757 [07:19:21] (03CR) 10Vgutierrez: [C: 03+2] Add SPF record for wikiba.se [dns] - 10https://gerrit.wikimedia.org/r/504241 (https://phabricator.wikimedia.org/T220786) (owner: 10Vgutierrez) [07:19:32] (03PS2) 10Vgutierrez: Add SPF record for wikiba.se [dns] - 10https://gerrit.wikimedia.org/r/504241 (https://phabricator.wikimedia.org/T220786) [07:21:28] (03CR) 10Vgutierrez: [C: 04-2] "blocked by T204056" [dns] - 10https://gerrit.wikimedia.org/r/504242 (https://phabricator.wikimedia.org/T220786) (owner: 10Vgutierrez) [07:22:35] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review: Add SPF record for non-canonical domains that are not parked - https://phabricator.wikimedia.org/T220786 (10Vgutierrez) [07:22:44] 10Operations, 10Cloud-VPS, 10DNS, 10Mail, and 3 others: Set SPF (... -all) for toolserver.org - https://phabricator.wikimedia.org/T131930 (10Vgutierrez) 05Open→03Resolved a:05herron→03Vgutierrez [07:23:36] (03CR) 10Vgutierrez: [C: 03+2] Add SPF record for wmftest.org [dns] - 10https://gerrit.wikimedia.org/r/504244 (https://phabricator.wikimedia.org/T220786) (owner: 10Vgutierrez) [07:23:44] (03PS2) 10Vgutierrez: Add SPF record for wmftest.org [dns] - 10https://gerrit.wikimedia.org/r/504244 (https://phabricator.wikimedia.org/T220786) [07:24:08] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Fully pool db1130,db1138 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509757 (owner: 10Marostegui) [07:25:47] https://phabricator.wikimedia.org/T222935#5175464 [07:25:56] (03Merged) 10jenkins-bot: db-eqiad.php: Fully pool db1130,db1138 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509757 (owner: 10Marostegui) [07:26:06] 10Operations, 10DNS, 10Traffic, 10Patch-For-Review: Add SPF record for non-canonical domains that are not parked - https://phabricator.wikimedia.org/T220786 (10Vgutierrez) 05Open→03Stalled [07:26:15] (03CR) 10jenkins-bot: db-eqiad.php: Fully pool db1130,db1138 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509757 (owner: 10Marostegui) [07:27:03] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Fully pool db1130 (s5) and db1138 (s4) T222682 (duration: 00m 50s) [07:27:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:27:07] T222682: Productionize db11[26-38] - https://phabricator.wikimedia.org/T222682 [07:35:24] PROBLEM - puppet last run on mw2182 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[tzdata] [07:42:58] (03CR) 10Marostegui: [C: 04-1] mariadb: set some more Icinga notes URLs for nrpe checks (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/509552 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [07:45:46] (03CR) 10Mathew.onipe: [C: 04-1] Add postgres slave init cookbook (031 comment) [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) (owner: 10Mathew.onipe) [07:50:02] 10Operations, 10ops-eqiad, 10media-storage: ms-be1015 - sdb1 failed - https://phabricator.wikimedia.org/T222991 (10fgiunchedi) [07:50:06] 10Operations, 10media-storage, 10Patch-For-Review, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10fgiunchedi) [07:50:33] 10Operations, 10ops-eqiad, 10media-storage: ms-be1015 - sdb1 failed - https://phabricator.wikimedia.org/T222991 (10fgiunchedi) Thanks Daniel! We're decommissioning these hosts and are not in service anymore, resolving in favor of T220590 [07:58:30] 10Operations, 10cloud-services-team (Kanban): WMCS-related dashboards using Diamond metrics - https://phabricator.wikimedia.org/T210850 (10MoritzMuehlenhoff) [07:59:11] 10Operations, 10cloud-services-team (Kanban): WMCS-related dashboards using Diamond metrics - https://phabricator.wikimedia.org/T210850 (10MoritzMuehlenhoff) [08:01:42] (03CR) 10Hashar: "That is T184435 "Puppet tox: properly lint both Py2 and Py3 files"" [puppet] - 10https://gerrit.wikimedia.org/r/509444 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [08:02:14] RECOVERY - puppet last run on mw2182 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [08:06:41] 10Operations, 10cloud-services-team (Kanban): WMCS-related dashboards using Diamond metrics - https://phabricator.wikimedia.org/T210850 (10MoritzMuehlenhoff) [08:08:38] 10Operations, 10Operations-Software-Development, 10Patch-For-Review: Puppet tox: properly lint both Py2 and Py3 files - https://phabricator.wikimedia.org/T184435 (10hashar) T144169 is/was to detect extension less files that might be python, though that never has been worked on, potentially it could lead to a... [08:09:46] 10Operations, 10media-storage, 10Patch-For-Review, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10fgiunchedi) Hosts are out of swift rings now, ms-be1013 is still off the network and I'll take care of it before hand over. Some filesystems report "input/output" error... [08:10:08] RECOVERY - Disk space on ms-be1015 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [08:13:13] (03CR) 10Hashar: "Probably. My only concern is that we do not have any data about client/servers ciphers usage, so that is really a blind shot :-(" [puppet] - 10https://gerrit.wikimedia.org/r/505410 (https://phabricator.wikimedia.org/T221499) (owner: 10Alex Monk) [08:21:11] (03CR) 10Hashar: [C: 03+1] Gerrit: Disable DNS reverse lookup [puppet] - 10https://gerrit.wikimedia.org/r/508127 (owner: 10Paladox) [08:22:27] 10Operations, 10Patch-For-Review: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10elukey) https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=928927 [08:22:39] 10Operations, 10media-storage, 10Patch-For-Review, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10fgiunchedi) I found some bug reports searching for "reservation ran out. Need to up reservation", e.g. https://bugzilla.redhat.com/show_bug.cgi?id=1092853 and it looks l... [08:23:10] RECOVERY - puppet last run on ms-be1015 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [08:23:17] 10Operations, 10Patch-For-Review: uwsgi's logsocket_plugin.so causes segfaults during log rotation - https://phabricator.wikimedia.org/T212697 (10elukey) p:05High→03Normal [08:26:18] RECOVERY - Disk space on ms-be1014 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [08:32:51] 10Operations, 10cloud-services-team (Kanban): WMCS-related dashboards using Diamond metrics - https://phabricator.wikimedia.org/T210850 (10MoritzMuehlenhoff) [08:32:56] PROBLEM - MediaWiki memcached error rate on graphite1004 is CRITICAL: CRITICAL: 40.00% of data above the critical threshold [5000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [08:34:50] this is due to mcrouter [08:34:50] https://grafana.wikimedia.org/d/000000549/mcrouter?panelId=9&fullscreen&orgId=1&var-source=eqiad%20prometheus%2Fops&var-cluster=api_appserver&var-instance=All [08:35:36] that is the usual (sigh) mc1029.eqiad.wmnet problem (tx bandwidth saturation) [08:35:40] RECOVERY - MediaWiki memcached error rate on graphite1004 is OK: OK: Less than 40.00% above the threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=1&fullscreen [08:37:36] RECOVERY - puppet last run on ms-be1014 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [08:39:53] (03CR) 10Elukey: [C: 03+1] httpd::mod_conf: Remove support for Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/508854 (owner: 10Muehlenhoff) [08:40:46] 10Operations, 10media-storage, 10Patch-For-Review, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10fgiunchedi) Ditto on ms-be1014: ` /dev/sdl1 2.8T 29G 2.7T 2% /srv/swift-storage/sdl1 /dev/sdj1 2.8T 34G 2.7T 2% /srv/swift-storage/sdj1 /dev/sdc1... [08:45:00] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: remove v2 feature flag [puppet] - 10https://gerrit.wikimedia.org/r/509052 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi) [08:45:07] (03PS3) 10Filippo Giunchedi: prometheus: remove v2 feature flag [puppet] - 10https://gerrit.wikimedia.org/r/509052 (https://phabricator.wikimedia.org/T187987) [08:45:23] (03PS1) 10Mathew.onipe: postgresql: relocate .pgpass file [puppet] - 10https://gerrit.wikimedia.org/r/509766 (https://phabricator.wikimedia.org/T220946) [08:52:11] 10Operations, 10observability, 10Patch-For-Review: 100% of Prometheus traffic served by Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) [08:53:25] 10Operations, 10observability, 10Goal, 10User-fgiunchedi: TEC6: Metrics monitoring infrastructure (Q4 2018/19 goal) - https://phabricator.wikimedia.org/T220104 (10fgiunchedi) [08:53:31] 10Operations, 10observability, 10Patch-For-Review: 100% of Prometheus traffic served by Prometheus v2 - https://phabricator.wikimedia.org/T187987 (10fgiunchedi) 05Open→03Resolved a:03fgiunchedi Completed! All production and wmcs Prometheus fleet migrated to Prometheus 2 [09:01:12] (03CR) 10Muehlenhoff: [C: 03+1] rsync: add a bwlimit option for quickdatacopy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/509458 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [09:01:24] (03PS2) 10Muehlenhoff: httpd::mod_conf: Remove support for Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/508854 [09:03:23] (03CR) 10Muehlenhoff: [C: 03+2] httpd::mod_conf: Remove support for Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/508854 (owner: 10Muehlenhoff) [09:09:21] (03PS36) 10Vgutierrez: trafficserver: Provide a TLS terminator profile and backend+TLS role [puppet] - 10https://gerrit.wikimedia.org/r/506398 (https://phabricator.wikimedia.org/T221594) [09:09:23] (03PS1) 10Vgutierrez: trafficserver: Ensure that server's cipher suites preference is being honored [puppet] - 10https://gerrit.wikimedia.org/r/509771 (https://phabricator.wikimedia.org/T221594) [09:24:22] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] Revert "lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit" [puppet] - 10https://gerrit.wikimedia.org/r/508668 (owner: 10Vgutierrez) [09:27:20] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] lvs: Toggle VLAN legacy naming (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508770 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [09:28:01] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] openstack: Disable legacy vlan naming for cloudvirt1024 (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508796 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [09:33:27] !log Upgrading Zuul 2.5.1-wmf7 -> 2.5.1-wmf9 T105474 [09:33:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:33:33] T105474: 'recheck' on a CR+2 patch should trigger gate-and-submit, not test - https://phabricator.wikimedia.org/T105474 [09:35:49] (03PS1) 10Marostegui: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509778 (https://phabricator.wikimedia.org/T217396) [09:36:00] (03CR) 10Vgutierrez: "oh feel free to adjust this CR to your needs, this was just an example using the legacy naming to validate it via PCC and show Andrew how " [puppet] - 10https://gerrit.wikimedia.org/r/508796 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [09:37:54] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509778 (https://phabricator.wikimedia.org/T217396) (owner: 10Marostegui) [09:38:59] (03Merged) 10jenkins-bot: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509778 (https://phabricator.wikimedia.org/T217396) (owner: 10Marostegui) [09:39:13] (03CR) 10jenkins-bot: db-eqiad.php: Depool db1064 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509778 (https://phabricator.wikimedia.org/T217396) (owner: 10Marostegui) [09:40:07] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Depool db1064 T217396 (duration: 00m 49s) [09:40:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:40:11] T217396: Decommission db1061-db1073 - https://phabricator.wikimedia.org/T217396 [09:40:47] (03CR) 10Vgutierrez: lvs: Toggle VLAN legacy naming (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508770 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [09:44:54] (03PS1) 10Muehlenhoff: Remove LDAP access for jmatazzoni [puppet] - 10https://gerrit.wikimedia.org/r/509780 [09:46:34] (03CR) 10Muehlenhoff: [C: 03+2] Remove LDAP access for jmatazzoni [puppet] - 10https://gerrit.wikimedia.org/r/509780 (owner: 10Muehlenhoff) [09:48:34] so [09:48:48] MediaWiki core http requests are terribly broken due to a faulty patch [09:48:56] and I don't even know how mediawiki still runs on production :/ [09:59:48] PROBLEM - Disk space on ms-be1015 is CRITICAL: DISK CRITICAL - /srv/swift-storage/sdb1 is not accessible: Input/output error https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [10:01:10] RECOVERY - Disk space on ms-be1015 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [10:05:12] (03CR) 10Arturo Borrero Gonzalez: "removing -1 per conversation" [puppet] - 10https://gerrit.wikimedia.org/r/508770 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [10:06:02] (03CR) 10Volans: [C: 04-1] "I think we should spread the runs." (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/509445 (owner: 10CRusnov) [10:07:49] 10Operations, 10media-storage, 10Patch-For-Review, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10fgiunchedi) `ms-be1014` has finished swift decom, what's left is zero-bytes old quarantined files ` root@ms-be1014:~# find /srv/swift-storage/ -type f -ls 242077616... [10:17:38] !log rebooting cloudvirt1024 - T209707 [10:17:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:17:43] T209707: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 [10:18:22] (03PS5) 10Alexandros Kosiaris: Revert "Revert "Revert "sshd_config: Increase MaxAuthTries""" [puppet] - 10https://gerrit.wikimedia.org/r/377269 (https://phabricator.wikimedia.org/T172333) [10:19:05] (03CR) 10Alexandros Kosiaris: [C: 03+2] Revert "Revert "Revert "sshd_config: Increase MaxAuthTries""" [puppet] - 10https://gerrit.wikimedia.org/r/377269 (https://phabricator.wikimedia.org/T172333) (owner: 10Alexandros Kosiaris) [10:20:20] (03CR) 10Filippo Giunchedi: [C: 03+1] flake8: Add python extension so theses scripts can be picked up via CI [puppet] - 10https://gerrit.wikimedia.org/r/509426 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [10:22:44] PROBLEM - HHVM rendering on mw1280 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Application_servers [10:22:58] (03CR) 10Filippo Giunchedi: [C: 03+1] flake8: puppetmaster - Add python extension so scripts can be picked up via CI [puppet] - 10https://gerrit.wikimedia.org/r/509484 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [10:23:37] (03CR) 10Vgutierrez: "Manually tested on cloudvirt1024 successfully: https://phabricator.wikimedia.org/P8519" [puppet] - 10https://gerrit.wikimedia.org/r/508796 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [10:23:58] RECOVERY - HHVM rendering on mw1280 is OK: HTTP OK: HTTP/1.1 200 OK - 82299 bytes in 0.159 second response time https://wikitech.wikimedia.org/wiki/Application_servers [10:24:01] (03PS1) 10Jdrewniak: Bumping portals to master [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509785 (https://phabricator.wikimedia.org/T128546) [10:24:26] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_text site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:25:00] PROBLEM - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is CRITICAL: cluster=cache_text site=codfw https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:25:14] PROBLEM - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is CRITICAL: cluster=cache_text site=eqiad https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:25:22] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [10:25:34] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [10:25:40] (03CR) 10Alexandros Kosiaris: [C: 03+2] "Merged on 13:20. Should make it the all of the fleet before today's SWAT starts. Should be reverted if SWAT fails." [puppet] - 10https://gerrit.wikimedia.org/r/377269 (https://phabricator.wikimedia.org/T172333) (owner: 10Alexandros Kosiaris) [10:27:08] (03PS4) 10Arturo Borrero Gonzalez: openstack: Disable legacy vlan naming for cloudvirt1024 [puppet] - 10https://gerrit.wikimedia.org/r/508796 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [10:27:44] RECOVERY - HTTP availability for Nginx -SSL terminators- at codfw on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:27:49] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "LGTM after tests in our affected server (cloudvirt1024)" [puppet] - 10https://gerrit.wikimedia.org/r/508770 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [10:27:54] jouncebot: now [10:27:54] No deployments scheduled for the next 0 hour(s) and 2 minute(s) [10:27:59] jouncebot: next [10:27:59] In 0 hour(s) and 2 minute(s): Wikimedia Portals Update (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190513T1030) [10:28:06] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [10:28:13] I'm stealing the deploy conch for an UBN. [10:28:31] FYU jan_drewniak [10:28:40] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [10:29:22] RECOVERY - HTTP availability for Nginx -SSL terminators- at eqiad on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:29:26] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [10:29:39] James_F: np [10:30:04] jan_drewniak: #bothumor My software never has bugs. It just develops random features. Rise for Wikimedia Portals Update. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190513T1030). [10:30:18] (03PS6) 10Vgutierrez: Revert "lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit" [puppet] - 10https://gerrit.wikimedia.org/r/508668 [10:31:08] PROBLEM - Upload HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [10:31:24] (03CR) 10Filippo Giunchedi: [C: 04-1] "See inline, we should add a job to existing 'ops' prometheus instance" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508952 (https://phabricator.wikimedia.org/T184086) (owner: 10Paladox) [10:32:16] RECOVERY - Eqiad HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [10:33:32] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [10:33:36] RECOVERY - Esams HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [10:33:50] RECOVERY - Text HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [10:34:06] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:35:28] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:36:18] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [10:37:46] (03CR) 10Vgutierrez: [C: 03+2] Revert "lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit" [puppet] - 10https://gerrit.wikimedia.org/r/508668 (owner: 10Vgutierrez) [10:38:56] 10Operations, 10netops: cr2-esams: BGP flapping for AS 61955 (ipv4 and ipv6) - https://phabricator.wikimedia.org/T222424 (10faidon) [10:38:57] !log update puppet5 and facter3 in eqiad [10:38:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:18] James_F: let me know when you're done. [10:39:18] (03PS1) 10Ema: ATS: require explicit Cache-Control/Expires [puppet] - 10https://gerrit.wikimedia.org/r/509787 (https://phabricator.wikimedia.org/T222937) [10:39:25] 10Operations, 10Core Platform Team (PHP7 (TEC4)), 10Core Platform Team Kanban (Doing), 10HHVM, and 3 others: Migrate to PHP 7 in WMF production - https://phabricator.wikimedia.org/T176370 (10Krinkle) [10:39:53] !log jforrester@deploy1001 Synchronized php-1.34.0-wmf.4/includes/http/HttpRequestFactory.php: T222935 Hot-deploy fix for HttpRequestFactory (duration: 00m 50s) [10:39:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:39:58] T222935: HttpRequestFactory get method always return null (was: All local file description pages pointing to Commons do not display description locally from 1.34.0-wmf.4) - https://phabricator.wikimedia.org/T222935 [10:40:28] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [10:40:58] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:41:16] (03CR) 10Vgutierrez: [C: 03+1] openstack: Disable legacy vlan naming for cloudvirt1024 [puppet] - 10https://gerrit.wikimedia.org/r/508796 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [10:42:18] PROBLEM - debmonitor.wikimedia.org on debmonitor1001 is CRITICAL: CRITICAL - Socket timeout after 10 seconds https://wikitech.wikimedia.org/wiki/Debmonitor [10:43:32] RECOVERY - Upload HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [10:44:54] PROBLEM - puppet last run on mw1225 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 4 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[puppet] [10:44:56] RECOVERY - debmonitor.wikimedia.org on debmonitor1001 is OK: HTTP OK: Status line output matched HTTP/1.1 301 - 274 bytes in 0.002 second response time https://wikitech.wikimedia.org/wiki/Debmonitor [10:45:12] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [10:46:34] PROBLEM - puppet last run on mendelevium is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:48:38] ^^ i have checked the two puppet faliures above and they where transient [10:49:14] 10Operations, 10media-storage, 10Patch-For-Review, 10User-fgiunchedi: Decom ms-be101[345] - https://phabricator.wikimedia.org/T220590 (10fgiunchedi) Ditto for `ms-be1015` A bunch of old/zero byte files and a container database in `tmp` that has been replicated but left behind afaics ` root@ms-be1015:~# f... [10:50:12] * volans checking debmonitor [10:50:14] RECOVERY - puppet last run on mw1225 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:50:26] PROBLEM - puppet last run on wtp1036 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 7 minutes ago with 1 failures. Failed resources (up to 3 shown): Package[puppet] [10:51:56] RECOVERY - puppet last run on mendelevium is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:51:57] (03CR) 10Vgutierrez: [C: 03+2] Revert "lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit" [puppet] - 10https://gerrit.wikimedia.org/r/508668 (owner: 10Vgutierrez) [10:54:14] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [10:54:46] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [10:55:03] ulsfo having problems? [10:55:46] RECOVERY - puppet last run on wtp1036 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [10:56:12] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [10:56:27] I don't see anything easily identifiable on cr3,4-ulsfo [10:56:45] (03PS4) 10Jbond: raid: update check_raid to detect missing disk [puppet] - 10https://gerrit.wikimedia.org/r/508855 (https://phabricator.wikimedia.org/T218544) [10:57:07] ah, no scratch that [10:57:16] PROBLEM - Upload HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 22.22% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [10:57:28] (External AS 38861): received unexpected EOF [10:58:17] (03PS2) 10Jbond: role::spare::system: replace standard with profile::standard [puppet] - 10https://gerrit.wikimedia.org/r/507066 (owner: 10Alex Monk) [10:59:44] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [11:00:04] MaxSem, RoanKattouw, and Niharika: Your horoscope predicts another unfortunate European Mid-day SWAT(Max 6 patches) deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190513T1100). [11:00:04] akosiaris and Pchelolo: A patch you scheduled for European Mid-day SWAT(Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [11:00:06] (03CR) 10Jbond: [C: 03+2] role::spare::system: replace standard with profile::standard [puppet] - 10https://gerrit.wikimedia.org/r/507066 (owner: 10Alex Monk) [11:00:16] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:00:37] here. is SWAT happening? [11:01:38] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:02:28] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [11:02:39] so, one of our BGP peerings in the INX in palo alto is indeed down [11:02:53] I'm on the upload@ulsfo alerts [11:04:04] !log cp-ats rolling restart to apply https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/509456/ [11:04:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:04:26] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [11:04:39] Pchelolo: I don't see why not. Although I am not sure any of the deployers are around [11:05:13] akosiaris: ye, that's my question as well. I'm very new to EU mid-day swat [11:05:30] RECOVERY - Upload HTTP 5xx reqs/min on graphite1004 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=upload&var-status_type=5 [11:05:32] It kinda looks like somone has selected the wrong people [11:05:36] (03CR) 10Jbond: lvs: Toggle VLAN legacy naming (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508770 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [11:05:39] Why would normally PST based people be up at 4am to do a SWAT? [11:06:36] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [11:07:08] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:09:35] Pchelolo: Is it testable? Or just want it pushing out? [11:09:44] Reedy: ye I can test it [11:09:52] waiting on jerkins then [11:10:13] I'll need ~10 mins to do so when it's on debug [11:11:43] 10Operations, 10service-runner, 10serviceops, 10Services (later): Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats - https://phabricator.wikimedia.org/T222795 (10Pchelolo) a:03holger.knust Thank you for an impressive level of details :) There's a bunch of other places whe... [11:12:51] (03PS5) 10Vgutierrez: lvs: Toggle VLAN legacy naming [puppet] - 10https://gerrit.wikimedia.org/r/508770 (https://phabricator.wikimedia.org/T209707) [11:15:41] (03CR) 10Jbond: [C: 03+1] "> Patch Set 2:" [puppet] - 10https://gerrit.wikimedia.org/r/506672 (https://phabricator.wikimedia.org/T220894) (owner: 10Alex Monk) [11:15:47] (03CR) 10Vgutierrez: lvs: Toggle VLAN legacy naming (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508770 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [11:22:43] (03CR) 10Jbond: [C: 03+2] flake8: Add python extension so theses scripts can be picked up via CI [puppet] - 10https://gerrit.wikimedia.org/r/509426 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [11:22:51] (03PS2) 10Jbond: flake8: Add python extension so theses scripts can be picked up via CI [puppet] - 10https://gerrit.wikimedia.org/r/509426 (https://phabricator.wikimedia.org/T144169) [11:25:18] 18 minutes and counting [11:25:20] gj jerkins [11:26:20] hashar: ^ swat patches taking nearly 20 minutes, master patches being mergd in 6 or so [11:27:38] PROBLEM - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is CRITICAL: cluster=cache_upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:28:06] PROBLEM - HTTP availability for Varnish at ulsfo on icinga1001 is CRITICAL: job=varnish-upload site=ulsfo https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [11:29:12] PROBLEM - puppet last run on labmon1002 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 3 minutes ago with 3 failures. Failed resources (up to 3 shown): File[/usr/local/bin/archive-instances],File[/usr/local/sbin/graphite-index],File[/usr/local/sbin/graphite-auth] [11:29:40] Pchelolo: 22 minutes to merge the patch. :/ [11:30:28] RECOVERY - HTTP availability for Varnish at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=3&fullscreen&refresh=1m&orgId=1 [11:30:52] yup Reedy, I've been watching it too.. [11:31:14] 9 minutes per line of diff [11:31:27] oh, s/9/11 [11:32:24] RECOVERY - HTTP availability for Nginx -SSL terminators- at ulsfo on icinga1001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/frontend-traffic?panelId=4&fullscreen&refresh=1m&orgId=1 [11:32:50] (03CR) 10Jbond: [C: 03+2] flake8: puppetmaster - Add python extension so scripts can be picked up via CI [puppet] - 10https://gerrit.wikimedia.org/r/509484 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [11:33:00] (03PS3) 10Jbond: flake8: puppetmaster - Add python extension so scripts can be picked up via CI [puppet] - 10https://gerrit.wikimedia.org/r/509484 (https://phabricator.wikimedia.org/T144169) [11:34:58] Pchelolo: It's on mwdebug1002 [11:35:12] ok, gimme 5, I'll test it [11:39:44] (03Abandoned) 10Jbond: puppet5/facter3: ensure puppet master infrastructre is not upgraded [puppet] - 10https://gerrit.wikimedia.org/r/509040 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [11:40:19] (03CR) 10Jbond: [C: 03+2] facter3/puppet5: change default versions [puppet] - 10https://gerrit.wikimedia.org/r/507304 (owner: 10Jbond) [11:40:21] Reedy: perfect! all works as a charm [11:40:31] (03PS2) 10Jbond: facter3/puppet5: change default versions [puppet] - 10https://gerrit.wikimedia.org/r/507304 [11:41:45] yay [11:43:54] !log reedy@deploy1001 Synchronized php-1.34.0-wmf.4/extensions/VisualEditor/: T222639 (duration: 00m 52s) [11:43:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:43:58] T222639: VisualEditor should request Parsoid HTML with ?stash=true query parameter - https://phabricator.wikimedia.org/T222639 [11:44:28] Reedy: Wanna push out https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/Wikibase/+/509801/ as well after? I can do it after if not. [11:44:47] Krinkle: Sure. Jenkins is being slow AF [11:46:35] (03CR) 10Alexandros Kosiaris: [C: 03+2] deployment-prep: Move to working Mathoid service [puppet] - 10https://gerrit.wikimedia.org/r/509595 (https://phabricator.wikimedia.org/T221654) (owner: 10Alex Monk) [11:46:42] (03PS2) 10Alexandros Kosiaris: deployment-prep: Move to working Mathoid service [puppet] - 10https://gerrit.wikimedia.org/r/509595 (https://phabricator.wikimedia.org/T221654) (owner: 10Alex Monk) [11:51:11] 10Operations, 10Continuous-Integration-Infrastructure, 10Patch-For-Review, 10Wikimedia-Incident, 10Zuul: Upload zuul_2.5.1-wmf9 to apt.wikimedia.org - https://phabricator.wikimedia.org/T222689 (10hashar) Upgraded and it seems to work fine. Thank you! [11:55:50] RECOVERY - puppet last run on labmon1002 is OK: OK: Puppet is currently enabled, last run 37 seconds ago with 0 failures [11:58:22] !log reedy@deploy1001 Synchronized php-1.34.0-wmf.4/vendor/: T215746 (duration: 01m 30s) [11:58:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:58:27] T215746: Checkup on cssjanus PHP 7 compat - https://phabricator.wikimedia.org/T215746 [11:59:33] !log reedy@deploy1001 Synchronized php-1.34.0-wmf.4/composer.json: T215746 (duration: 00m 49s) [11:59:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:00:24] (03CR) 10Muehlenhoff: [C: 03+1] facter3/puppet5: clean up old config [puppet] - 10https://gerrit.wikimedia.org/r/507305 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [12:01:00] (03PS1) 10Awight: Update ssh keys and email for awight [puppet] - 10https://gerrit.wikimedia.org/r/509804 [12:08:05] !log reedy@deploy1001 Synchronized php-1.34.0-wmf.4/extensions/Wikibase/lib/includes/Formatters/CachingKartographerEmbeddingHandler.php: T223085 (duration: 00m 50s) [12:08:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:08:10] T223085: ErrorException from line 155 of /srv/mediawiki/php-1.34.0-wmf.4/extensions/Wikibase/lib/includes/Formatters/CachingKartographerEmbeddingHandler.php: PHP Notice: Undefined variable: rlModulesArr - https://phabricator.wikimedia.org/T223085 [12:11:07] Reedy: thx [12:11:10] np [12:13:47] rolling out an RL prep patch as well to avoid breakage in the next train [12:16:50] (03CR) 10Alexandros Kosiaris: [C: 04-1] "The commit message does not reflect the content of the patch. The patch also touches production" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509449 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata) [12:18:40] !log cdanis@ms-be2015.codfw.wmnet /var/log % sudo mount /srv/swift-storage/sda1 [12:18:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:19:01] (03PS2) 10Hashar: contint: bump git-daemon max connections 32 -> 48 [puppet] - 10https://gerrit.wikimedia.org/r/508408 (https://phabricator.wikimedia.org/T222661) [12:19:08] (03PS3) 10Hashar: contint: bump git-daemon max connections 32 -> 48 [puppet] - 10https://gerrit.wikimedia.org/r/508408 (https://phabricator.wikimedia.org/T222661) [12:19:15] (03CR) 10Hashar: "check experimental" [puppet] - 10https://gerrit.wikimedia.org/r/508408 (https://phabricator.wikimedia.org/T222661) (owner: 10Hashar) [12:22:52] (03CR) 10Urbanecm: [C: 03+1] "LGTM" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506892 (https://phabricator.wikimedia.org/T222024) (owner: 10DannyS712) [12:25:05] !log cdanis@ms-be2015.codfw.wmnet ~ % sudo umount /srv/swift-storage/sdf1 && sudo mount /srv/swift-storage/sdf1 [12:25:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:25:58] !log cdanis@ms-be2015.codfw.wmnet ~ % sudo umount /srv/swift-storage/sdl1 && sudo mount /srv/swift-storage/sdl1 [12:26:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:26:17] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/509804 (owner: 10Awight) [12:26:18] 10Operations, 10service-runner, 10serviceops, 10Services (later): Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats - https://phabricator.wikimedia.org/T222795 (10akosiaris) >>! In T222795#5176181, @Pchelolo wrote: > Thank you for an impressive level of details :) There's a... [12:26:52] RECOVERY - Disk space on ms-be2015 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [12:28:33] (03PS3) 10Jbond: facter3/puppet5: change default versions [puppet] - 10https://gerrit.wikimedia.org/r/507304 [12:29:22] RECOVERY - puppet last run on ms-be2015 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [12:30:52] (03CR) 10Jbond: [C: 03+2] facter3/puppet5: clean up old config [puppet] - 10https://gerrit.wikimedia.org/r/507305 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [12:32:25] * Krinkle staging on mwdebug1002 [12:33:21] jbond42: nice :) [12:33:23] !log root@ms-be2013.codfw.wmnet ~ # mount /srv/swift-storage/sdf1 [12:33:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:33:52] 10Operations, 10service-runner, 10serviceops, 10Services (later): Re-evaluate service-runner's (ab)use of statsd timing metric for nodejs GC stats - https://phabricator.wikimedia.org/T222795 (10Pchelolo) > The caveat would be we would have to update all dashboards for all services residing on the same host... [12:33:54] paravoid: thanks :) [12:34:42] just have to finish the puppetmaster stuff now however mor.itz has allready done the hard p[art so hopefully just some testing [12:35:11] what was the hard part? [12:35:36] well not that hard, but morit.z has already build new packages with the dependencies updated [12:35:44] ah, yes, I saw that [12:36:27] (03PS1) 10Alexandros Kosiaris: cxserver: Switch GC stats back to microseconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/509811 (https://phabricator.wikimedia.org/T222795) [12:36:40] !log krinkle@deploy1001 Synchronized php-1.34.0-wmf.4/resources/src/startup/startup.js: I76a2c8d52fa (duration: 00m 51s) [12:36:43] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:48] !log root@ms-be2013.codfw.wmnet ~ # umount /srv/swift-storage/sda1 && mount /srv/swift-storage/sda1 && umount /srv/swift-storage/sdb1 && mount /srv/swift-storage/sdb1 [12:36:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:37:04] (03PS2) 10Jbond: facter3/puppet5: clean up old config [puppet] - 10https://gerrit.wikimedia.org/r/507305 (https://phabricator.wikimedia.org/T219803) [12:38:42] (03CR) 10Jbond: [C: 03+2] facter3/puppet5: clean up old config [puppet] - 10https://gerrit.wikimedia.org/r/507305 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [12:40:10] (03CR) 10Hashar: [C: 03+1] "The update to /lib/systemd/system/git-daemon.service seems fine:" [puppet] - 10https://gerrit.wikimedia.org/r/508408 (https://phabricator.wikimedia.org/T222661) (owner: 10Hashar) [12:41:04] RECOVERY - Disk space on ms-be2013 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [12:41:27] if anyone has some spare time for a puppet merge, I could use a configuration tweak for the CI git-daemon services [12:41:42] we need more concurrent connections > https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/508408/ :) [12:42:28] (03PS1) 10Arturo Borrero Gonzalez: openstack: in stretch don't use libjs-jquery from openstack-mitaka-jessie [puppet] - 10https://gerrit.wikimedia.org/r/509812 (https://phabricator.wikimedia.org/T222862) [12:43:00] (03CR) 10jerkins-bot: [V: 04-1] openstack: in stretch don't use libjs-jquery from openstack-mitaka-jessie [puppet] - 10https://gerrit.wikimedia.org/r/509812 (https://phabricator.wikimedia.org/T222862) (owner: 10Arturo Borrero Gonzalez) [12:44:45] (03PS2) 10Arturo Borrero Gonzalez: openstack: in stretch don't use libjs-jquery from openstack-mitaka-jessie [puppet] - 10https://gerrit.wikimedia.org/r/509812 (https://phabricator.wikimedia.org/T222862) [12:49:59] !log updating puppetdb on deployment-puppetdb02 to 4.4.0-1~wmf2 (T219803) [12:50:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:50:03] T219803: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 [12:51:20] RECOVERY - puppet last run on ms-be2013 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [12:55:33] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] "PCC as expected https://puppet-compiler.wmflabs.org/compiler1002/16483/" [puppet] - 10https://gerrit.wikimedia.org/r/509812 (https://phabricator.wikimedia.org/T222862) (owner: 10Arturo Borrero Gonzalez) [12:58:28] (03PS2) 10Alexandros Kosiaris: cxserver: Switch GC stats back to microseconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/509811 (https://phabricator.wikimedia.org/T222795) [13:02:15] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] cxserver: Switch GC stats back to microseconds [deployment-charts] - 10https://gerrit.wikimedia.org/r/509811 (https://phabricator.wikimedia.org/T222795) (owner: 10Alexandros Kosiaris) [13:02:59] !log enable puppet in cloudvirt1024 to refresh some apt config T222862 [13:03:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:03:04] T222862: jquery included in openstack-mitaka-jessie component, leads to downgrades on stretch hosts - https://phabricator.wikimedia.org/T222862 [13:03:18] (03PS12) 10Mathew.onipe: Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) [13:04:18] !log install libjs-jquery from stretch in cloudnet servers T222862 [13:04:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:05:33] (03PS2) 10KartikMistry: Decrease idwiki MT threshold for publishing [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508818 (https://phabricator.wikimedia.org/T222782) (owner: 10Petar.petkovic) [13:05:57] !log akosiaris@deploy1001 scap-helm cxserver upgrade -f cxserver-codfw-values.yaml production stable/cxserver [namespace: cxserver, clusters: codfw] [13:05:59] !log akosiaris@deploy1001 scap-helm cxserver cluster codfw completed [13:05:59] !log akosiaris@deploy1001 scap-helm cxserver finished [13:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:09] !log akosiaris@deploy1001 scap-helm cxserver upgrade -f cxserver-eqiad-values.yaml production stable/cxserver [namespace: cxserver, clusters: eqiad] [13:06:10] !log akosiaris@deploy1001 scap-helm cxserver cluster eqiad completed [13:06:11] !log akosiaris@deploy1001 scap-helm cxserver finished [13:06:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:24] !log akosiaris@deploy1001 scap-helm cxserver upgrade -f cxserver-staging-values.yaml staging stable/cxserver [namespace: cxserver, clusters: staging] [13:06:25] !log akosiaris@deploy1001 scap-helm cxserver cluster staging completed [13:06:25] !log akosiaris@deploy1001 scap-helm cxserver finished [13:06:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:06:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:32] 10Operations, 10ops-eqiad, 10DC-Ops: Confirm asset tags for asw2-a6/a7/a8/b5-eqiad - https://phabricator.wikimedia.org/T223100 (10faidon) [13:07:34] !log bump cxserver chart to 0.0.7. Renames nodejs GC stats to microseconds and bumps the biggest bucket to 100ms. T220709 [13:07:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:38] T220709: Upgrade statsd_exporter to 0.9 - https://phabricator.wikimedia.org/T220709 [13:08:07] (03CR) 10Jbond: [C: 03+2] Hiera backend: update the hiera configuration to remove the role backend [puppet] - 10https://gerrit.wikimedia.org/r/506167 (owner: 10Jbond) [13:08:16] (03PS2) 10Jbond: Hiera backend: update the hiera configuration to remove the role backend [puppet] - 10https://gerrit.wikimedia.org/r/506167 [13:08:27] (03PS13) 10Mathew.onipe: Add postgres slave init cookbook [cookbooks] - 10https://gerrit.wikimedia.org/r/504570 (https://phabricator.wikimedia.org/T220946) [13:09:50] (03PS1) 10Jbond: Revert "Hiera backend: update the hiera configuration to remove the role backend" [puppet] - 10https://gerrit.wikimedia.org/r/509820 [13:11:10] (03PS1) 10Filippo Giunchedi: prometheus: remove v1 rules files [puppet] - 10https://gerrit.wikimedia.org/r/509822 (https://phabricator.wikimedia.org/T187987) [13:11:26] (03CR) 10jerkins-bot: [V: 04-1] prometheus: remove v1 rules files [puppet] - 10https://gerrit.wikimedia.org/r/509822 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi) [13:12:14] (03PS2) 10Filippo Giunchedi: prometheus: remove v1 rules files [puppet] - 10https://gerrit.wikimedia.org/r/509822 (https://phabricator.wikimedia.org/T187987) [13:14:06] (03CR) 10Ottomata: "CI won't let me add the entry in LabsSettings.php without a corresponding setting in ProductionSettings.php" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509449 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata) [13:15:22] (03CR) 10Ottomata: "Oh sorry should have noted here. Going to do this instead: https://gerrit.wikimedia.org/r/c/operations/puppet/+/509145" [puppet] - 10https://gerrit.wikimedia.org/r/507077 (https://phabricator.wikimedia.org/T218844) (owner: 10Paladox) [13:21:23] (03PS1) 10Alexandros Kosiaris: eventgate: Switch GC metric to microseconds, update buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/509836 (https://phabricator.wikimedia.org/T220709) [13:24:45] 10Operations, 10cloud-services-team (Kanban): WMCS-related dashboards using Diamond metrics - https://phabricator.wikimedia.org/T210850 (10Andrew) Thanks arturo! I worked on this a bit last week but didn't make a whole lot of progress. [] https://grafana.wikimedia.org/d/000000339/labs-nova-fullstack []... [13:25:40] (03PS1) 10CDanis: swift codfw-prod: touch *.builder to finish decom [software/swift-ring] - 10https://gerrit.wikimedia.org/r/509838 (https://phabricator.wikimedia.org/T221068) [13:25:54] !log uploaded puppetdb 4.4.0-1~wmf2 to component/puppetdb4 for apt.wikimedia.org/stretch-wikimedia (T219803) [13:25:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:25:59] T219803: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 [13:26:02] (03CR) 10Filippo Giunchedi: [C: 03+2] prometheus: remove v1 rules files [puppet] - 10https://gerrit.wikimedia.org/r/509822 (https://phabricator.wikimedia.org/T187987) (owner: 10Filippo Giunchedi) [13:27:01] (03PS3) 10Alexandros Kosiaris: Add initialize_service.sh tool [deployment-charts] - 10https://gerrit.wikimedia.org/r/492269 [13:27:03] (03CR) 10Alexandros Kosiaris: "Indeed. Done. Thanks!" [deployment-charts] - 10https://gerrit.wikimedia.org/r/492269 (owner: 10Alexandros Kosiaris) [13:27:36] (03CR) 10CDanis: [V: 03+2 C: 03+2] swift codfw-prod: touch *.builder to finish decom [software/swift-ring] - 10https://gerrit.wikimedia.org/r/509838 (https://phabricator.wikimedia.org/T221068) (owner: 10CDanis) [13:28:02] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Add initialize_service.sh tool [deployment-charts] - 10https://gerrit.wikimedia.org/r/492269 (owner: 10Alexandros Kosiaris) [13:28:49] (03Abandoned) 10Jbond: Revert "Hiera backend: update the hiera configuration to remove the role backend" [puppet] - 10https://gerrit.wikimedia.org/r/509820 (owner: 10Jbond) [13:29:17] (03PS3) 10Andrew Bogott: quarry: nginx conf for custom 50x error pages [puppet] - 10https://gerrit.wikimedia.org/r/509608 (https://phabricator.wikimedia.org/T223018) (owner: 10Framawiki) [13:29:23] !log swift codfw-prod: deploy I1035824d [13:29:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:19] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin -p95 -b8 'ms-be2*' 'run-puppet-agent' [13:30:19] (03CR) 10Andrew Bogott: [C: 03+2] quarry: nginx conf for custom 50x error pages [puppet] - 10https://gerrit.wikimedia.org/r/509608 (https://phabricator.wikimedia.org/T223018) (owner: 10Framawiki) [13:30:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:30:34] (03PS10) 10Marostegui: db-eqiad,db-codfw.php: Change second parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/508170 (https://phabricator.wikimedia.org/T210725) [13:32:10] (03CR) 10Ottomata: [C: 03+1] eventgate: Switch GC metric to microseconds, update buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/509836 (https://phabricator.wikimedia.org/T220709) (owner: 10Alexandros Kosiaris) [13:32:27] (03PS1) 10ArielGlenn: use lbzip2 for decompression in 7z page content recompress step [dumps] - 10https://gerrit.wikimedia.org/r/509843 [13:32:41] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] eventgate: Switch GC metric to microseconds, update buckets [deployment-charts] - 10https://gerrit.wikimedia.org/r/509836 (https://phabricator.wikimedia.org/T220709) (owner: 10Alexandros Kosiaris) [13:33:07] (03PS1) 10Anomie: Set ActorTableSchemaMigrationStage => write-both/read-new on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509844 (https://phabricator.wikimedia.org/T188327) [13:33:55] (03CR) 10Anomie: [C: 03+2] "Deploying planned config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509844 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [13:34:13] (03PS3) 10Ottomata: beta - Configure eventgate-* services with new deployment-eventgate-1 instance [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509449 (https://phabricator.wikimedia.org/T218346) [13:34:57] (03Merged) 10jenkins-bot: Set ActorTableSchemaMigrationStage => write-both/read-new on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509844 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [13:35:31] (03PS4) 10Ottomata: beta - Configure eventgate-* services with new deployment-eventgate-1 node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509449 (https://phabricator.wikimedia.org/T218346) [13:36:17] (03CR) 10jenkins-bot: Set ActorTableSchemaMigrationStage => write-both/read-new on remaining wikis [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509844 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [13:36:37] !log anomie@deploy1001 Synchronized wmf-config/InitialiseSettings.php: Setting actor migration to write-both/read-new on all wikis (T188327) (duration: 00m 50s) [13:36:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:41] T188327: Deploy refactored actor storage - https://phabricator.wikimedia.org/T188327 [13:37:42] 10Operations, 10ops-codfw, 10Discovery-Search (Current work): elastic2038 DOWN (CPU/memory errors ) - https://phabricator.wikimedia.org/T217398 (10Papaul) Create Dispatch: Success You have successfully submitted request SR990577292. [13:38:21] !log akosiaris@deploy1001 scap-helm eventgate-analytics upgrade -f eventgate-analytics-codfw-values.yaml production stable/eventgate-analytics [namespace: eventgate-analytics, clusters: codfw] [13:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:24] !log akosiaris@deploy1001 scap-helm eventgate-analytics cluster codfw completed [13:38:24] !log akosiaris@deploy1001 scap-helm eventgate-analytics finished [13:38:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:29] !log akosiaris@deploy1001 scap-helm eventgate-analytics upgrade -f eventgate-analytics-eqiad-values.yaml production stable/eventgate-analytics [namespace: eventgate-analytics, clusters: eqiad] [13:38:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:32] !log akosiaris@deploy1001 scap-helm eventgate-analytics cluster eqiad completed [13:38:32] !log akosiaris@deploy1001 scap-helm eventgate-analytics finished [13:38:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:49] !log akosiaris@deploy1001 scap-helm eventgate-analytics upgrade -f eventgate-analytics-staging-values.yaml staging stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [13:38:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:51] !log akosiaris@deploy1001 scap-helm eventgate-analytics cluster staging completed [13:38:52] !log akosiaris@deploy1001 scap-helm eventgate-analytics finished [13:38:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:38:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:22] !log bump eventgate-analytics chart to 0.0.36. Renames nodejs GC stats to microseconds and bumps the biggest bucket to 100ms. T220709 [13:39:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:39:27] T220709: Upgrade statsd_exporter to 0.9 - https://phabricator.wikimedia.org/T220709 [13:40:28] PROBLEM - Request latencies on chlorine is CRITICAL: instance=10.64.0.45:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:42:26] (03PS2) 10Elukey: role::common::aqs: update mediawiki's druid datasource to 2019-04 [puppet] - 10https://gerrit.wikimedia.org/r/509150 (owner: 10Mforns) [13:43:06] (03CR) 10Elukey: [C: 03+2] role::common::aqs: update mediawiki's druid datasource to 2019-04 [puppet] - 10https://gerrit.wikimedia.org/r/509150 (owner: 10Mforns) [13:43:18] (03CR) 10Volans: "recheck" [software/cumin] - 10https://gerrit.wikimedia.org/r/508079 (owner: 10Volans) [13:43:36] (03CR) 10jerkins-bot: [V: 04-1] tox: refactor configuration [software/cumin] - 10https://gerrit.wikimedia.org/r/508079 (owner: 10Volans) [13:43:56] (03PS1) 10Alexandros Kosiaris: Actually rename the GC metric for eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/509846 [13:44:32] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] Actually rename the GC metric for eventgate [deployment-charts] - 10https://gerrit.wikimedia.org/r/509846 (owner: 10Alexandros Kosiaris) [13:44:32] RECOVERY - Request latencies on chlorine is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [13:45:39] (03PS2) 10ArielGlenn: use lbzip2 for decompression in 7z page content recompress step [dumps] - 10https://gerrit.wikimedia.org/r/509843 [13:46:04] (03PS9) 10Effie Mouzeli: mediawiki: if guard php72_only blocks [puppet] - 10https://gerrit.wikimedia.org/r/507939 (https://phabricator.wikimedia.org/T219150) [13:46:06] (03PS1) 10Effie Mouzeli: debug_proxy: force http/1.1 when proxying [puppet] - 10https://gerrit.wikimedia.org/r/509848 (https://phabricator.wikimedia.org/T217846) [13:46:56] !log updating puppet on deployment-puppetmaster03 to 4.8.2-5+wmf1 (T219803) [13:47:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:47:00] T219803: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 [13:48:50] PROBLEM - swift-account-auditor on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.136: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [13:49:00] PROBLEM - very high load average likely xfs on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.136: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [13:49:12] PROBLEM - Disk space on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.136: Connection reset by peer https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [13:49:16] PROBLEM - swift-container-updater on ms-be2016 is CRITICAL: CHECK_NRPE: Error - Could not connect to 10.192.0.136: Connection reset by peer https://wikitech.wikimedia.org/wiki/Swift [13:50:06] RECOVERY - swift-account-auditor on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-account-auditor https://wikitech.wikimedia.org/wiki/Swift [13:50:13] log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin -p95 -b8 'ms-fe2*' 'run-puppet-agent' [13:50:16] RECOVERY - very high load average likely xfs on ms-be2016 is OK: OK - load average: 42.67, 35.14, 26.95 https://wikitech.wikimedia.org/wiki/Swift [13:50:16] !log cdanis@cumin1001.eqiad.wmnet ~ % sudo cumin -p95 -b8 'ms-fe2*' 'run-puppet-agent' [13:50:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:50:28] RECOVERY - Disk space on ms-be2016 is OK: DISK OK https://wikitech.wikimedia.org/wiki/Monitoring/Disk_space [13:50:34] RECOVERY - swift-container-updater on ms-be2016 is OK: PROCS OK: 1 process with regex args ^/usr/bin/python /usr/bin/swift-container-updater https://wikitech.wikimedia.org/wiki/Swift [13:52:47] (03CR) 10Alexandros Kosiaris: [C: 03+1] beta - Configure eventgate-* services with new deployment-eventgate-1 node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509449 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata) [13:53:29] 10Operations, 10Patch-For-Review, 10Performance-Team (Radar): PHP fatal error handler not working on mwdebug servers - https://phabricator.wikimedia.org/T217846 (10jijiki) @Krinkle After a little more digging, looks like nginx in hassium is using HTTP/1.0 when forwarding requests to mwdebug*. My theory is th... [13:54:36] PROBLEM - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is CRITICAL: CRITICAL: 10.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [13:55:32] PROBLEM - Mediawiki Cirrussearch update rate - codfw on icinga1001 is CRITICAL: CRITICAL: 20.00% of data under the critical threshold [50.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [13:58:09] (03PS3) 10Alexandros Kosiaris: Add LVS DNS for eventgate-main [dns] - 10https://gerrit.wikimedia.org/r/509104 (https://phabricator.wikimedia.org/T222899) (owner: 10Ottomata) [13:59:17] (03CR) 10Krinkle: [C: 03+1] "Nice catch." [puppet] - 10https://gerrit.wikimedia.org/r/509848 (https://phabricator.wikimedia.org/T217846) (owner: 10Effie Mouzeli) [14:00:15] !log roll restart of aqs on aqs1* to pick up new druid settings [14:00:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:01:20] (03Abandoned) 10ArielGlenn: enable lbzip2 use for all decompression parts of recompression jobs [dumps] - 10https://gerrit.wikimedia.org/r/428156 (https://phabricator.wikimedia.org/T179059) (owner: 10ArielGlenn) [14:05:47] !log uploaded puppet 4.8.2-5+wmf1 to component/puppetdb4 for apt.wikimedia.org/stretch-wikimedia (T219803) [14:05:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:05:52] T219803: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 [14:05:57] (03CR) 10Alexandros Kosiaris: [C: 03+2] Add LVS DNS for eventgate-main [dns] - 10https://gerrit.wikimedia.org/r/509104 (https://phabricator.wikimedia.org/T222899) (owner: 10Ottomata) [14:08:13] (03CR) 10Eevans: "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/509422 (https://phabricator.wikimedia.org/T219404) (owner: 10Filippo Giunchedi) [14:09:33] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10MoritzMuehlenhoff) puppet-common is a transitional package and no longer needed, we currently have it installed on 282 hosts, it's probably best to simply remov... [14:11:41] (03CR) 10Eevans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/509423 (https://phabricator.wikimedia.org/T219404) (owner: 10Filippo Giunchedi) [14:14:25] (03PS1) 10Muehlenhoff: Switch deployment-prep to facter 3 / puppet 5 [puppet] - 10https://gerrit.wikimedia.org/r/509852 (https://phabricator.wikimedia.org/T219803) [14:19:03] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/509852 (https://phabricator.wikimedia.org/T219803) (owner: 10Muehlenhoff) [14:21:40] RECOVERY - Mediawiki Cirrussearch update rate - codfw on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [14:22:00] (03CR) 10Vgutierrez: "pcc seems almost happy in the ~62 nodes (1 FAIL) using the tagged_interface resource: https://puppet-compiler.wmflabs.org/compiler1002/164" [puppet] - 10https://gerrit.wikimedia.org/r/508770 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [14:22:04] RECOVERY - Mediawiki Cirrussearch update rate - eqiad on icinga1001 is OK: OK: Less than 1.00% under the threshold [80.0] https://grafana.wikimedia.org/d/JLK3I_siz/elasticsearch-indexing?panelId=44&fullscreen&orgId=1 [14:24:43] 10Operations, 10DC-Ops, 10netops: Juniper network device audit - all sites - https://phabricator.wikimedia.org/T213843 (10ayounsi) This is almost done. They added all the missing devices and almost fixed the "installed at" addresses, got some of them wrong. I followed up with the correct ones. [14:25:51] (03PS2) 10Alexandros Kosiaris: LVS for eventgate-main [puppet] - 10https://gerrit.wikimedia.org/r/509106 (https://phabricator.wikimedia.org/T222899) (owner: 10Ottomata) [14:26:02] (03CR) 10Alexandros Kosiaris: [C: 03+2] LVS for eventgate-main [puppet] - 10https://gerrit.wikimedia.org/r/509106 (https://phabricator.wikimedia.org/T222899) (owner: 10Ottomata) [14:27:18] (03PS5) 10Ottomata: beta - Configure eventgate-* services with new deployment-eventgate-1 node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509449 (https://phabricator.wikimedia.org/T218346) [14:28:43] (03CR) 10Ayounsi: [C: 03+1] librenms: fix logrotate cronspam [puppet] - 10https://gerrit.wikimedia.org/r/509753 (owner: 10Elukey) [14:28:45] (03CR) 10Volans: "recheck" [software/cumin] - 10https://gerrit.wikimedia.org/r/508079 (owner: 10Volans) [14:29:03] (03CR) 10jerkins-bot: [V: 04-1] tox: refactor configuration [software/cumin] - 10https://gerrit.wikimedia.org/r/508079 (owner: 10Volans) [14:29:51] (03CR) 10Ottomata: [C: 03+2] beta - Configure eventgate-* services with new deployment-eventgate-1 node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509449 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata) [14:30:12] (03CR) 10jenkins-bot: beta - Configure eventgate-* services with new deployment-eventgate-1 node [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509449 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata) [14:32:04] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: placeholder task for migration problems - https://phabricator.wikimedia.org/T222210 (10hashar) On contint1001, using docker-pkg, I have created a new image `docker-registry.wikimedia.org/releng/tox:0.4.0`. On WMCS instance, I am... [14:32:54] !log otto@deploy1001 Synchronized wmf-config/LabsServices.php: no-op in prod - Configure eventgate services in beta (duration: 00m 49s) [14:32:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:33:57] !log otto@deploy1001 Synchronized wmf-config/ProductionServices.php: no-op in prod - Configure eventgate services in beta (duration: 00m 49s) [14:34:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:36:47] (03CR) 10CRusnov: profile::netbox: stop using icinga as remote cron (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/509445 (owner: 10CRusnov) [14:37:24] PROBLEM - PyBal connections to etcd on lvs2006 is CRITICAL: CRITICAL: 39 connections established with conf2001.codfw.wmnet:2379 (min=40) https://wikitech.wikimedia.org/wiki/PyBal [14:37:50] PROBLEM - PyBal IPVS diff check on lvs2006 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.45:32192]) https://wikitech.wikimedia.org/wiki/PyBal [14:39:51] (03PS1) 10Ottomata: Temporarily disable eventgate-analytics monolog events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509866 (https://phabricator.wikimedia.org/T222962) [14:41:47] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Add dummy gsheets.cfg for netbox reports. [labs/private] - 10https://gerrit.wikimedia.org/r/508624 (owner: 10CRusnov) [14:42:07] (03CR) 10Vgutierrez: "cloudvirt1007 shows the same kind of changes on PCC after getting the facts updated: https://puppet-compiler.wmflabs.org/compiler1002/1648" [puppet] - 10https://gerrit.wikimedia.org/r/508770 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [14:42:46] RECOVERY - PyBal connections to etcd on lvs2006 is OK: OK: 40 connections established with conf2001.codfw.wmnet:2379 (min=40) https://wikitech.wikimedia.org/wiki/PyBal [14:43:12] RECOVERY - PyBal IPVS diff check on lvs2006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:44:12] PROBLEM - PyBal IPVS diff check on lvs1006 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.2.45:32192]) https://wikitech.wikimedia.org/wiki/PyBal [14:45:07] (03CR) 10CRusnov: "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/508625 (owner: 10CRusnov) [14:45:11] (03CR) 10Vgutierrez: [C: 03+2] lvs: Toggle VLAN legacy naming [puppet] - 10https://gerrit.wikimedia.org/r/508770 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [14:45:24] (03PS6) 10Vgutierrez: lvs: Toggle VLAN legacy naming [puppet] - 10https://gerrit.wikimedia.org/r/508770 (https://phabricator.wikimedia.org/T209707) [14:46:18] PROBLEM - PyBal connections to etcd on lvs1006 is CRITICAL: CRITICAL: 48 connections established with conf1004.eqiad.wmnet:4001 (min=49) https://wikitech.wikimedia.org/wiki/PyBal [14:47:31] expected ^. Should recover within the next 5 mins [14:47:44] akosiaris: you're going to give me a heart attack :) [14:48:05] vgutierrez: :D [14:49:05] and/or a tshirt! [14:49:12] RECOVERY - PyBal IPVS diff check on lvs1006 is OK: OK: no difference between hosts in IPVS/PyBal https://wikitech.wikimedia.org/wiki/PyBal [14:49:44] godog: somehow that t-shirt is probing to be pretty hard to get [14:49:50] vgutierrez: when you enable+run puppet on lvs2003 it's going to bring in the eventgate-main LVS change. It's fine. I vetted it already on lvs2006 [14:50:00] thx for the heads up :) [14:50:06] :) [14:50:13] <_joe_> vgutierrez: you really want that t-shirt? [14:50:26] <_joe_> I have a few apache refactors to send your way [14:50:40] <_joe_> those might earn you a whole collection [14:50:42] <_joe_> :D [14:51:11] _joe_: apache? that's pretty far away from traffic scope ;P [14:51:18] RECOVERY - PyBal connections to etcd on lvs1006 is OK: OK: 49 connections established with conf1004.eqiad.wmnet:4001 (min=49) https://wikitech.wikimedia.org/wiki/PyBal [14:51:21] * vgutierrez hides like a coward [14:51:44] <_joe_> vgutierrez: just one hop [14:56:53] E_TTL [14:57:17] 10Operations, 10ops-eqiad, 10DC-Ops: Confirm asset tags for asw2-a6/a7/a8/b5-eqiad - https://phabricator.wikimedia.org/T223100 (10Cmjohnson) @faidon @RobH asw2-a6-eqiad Asset tag mismatch for s/n PE3717440136: WMF7322 (Accounting) vs. WMF7232 (Netbox) - Correct asset tag is WMF7322 asw2-a7-eqiad Asset tag... [14:57:26] 10Operations, 10RESTBase-API, 10TechCom, 10serviceops, and 2 others: Decide whether to keep violating OpenAPI/Swagger specification in our REST services - https://phabricator.wikimedia.org/T217881 (10CCicalese_WMF) [14:57:29] 10Operations, 10RESTBase, 10RESTBase-API, 10serviceops, and 4 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10CCicalese_WMF) 05Open→03Stalled a:05Clarakosi→03Pchelolo This work is stalled until other RESTBase patches are merged. [14:58:53] 10Operations, 10ops-eqiad, 10DC-Ops: Confirm asset tags for asw2-a6/a7/a8/b5-eqiad - https://phabricator.wikimedia.org/T223100 (10Cmjohnson) netbox updated [15:01:32] (03PS5) 10Vgutierrez: openstack: Disable legacy vlan naming for cloudvirt1024 [puppet] - 10https://gerrit.wikimedia.org/r/508796 (https://phabricator.wikimedia.org/T209707) [15:01:46] PROBLEM - PyBal connections to etcd on lvs2003 is CRITICAL: CRITICAL: 39 connections established with conf2001.codfw.wmnet:2379 (min=40) https://wikitech.wikimedia.org/wiki/PyBal [15:02:33] ^^ expected as reported by akosiaris [15:02:57] (03CR) 10Andrew Bogott: [C: 03+2] openstack: Disable legacy vlan naming for cloudvirt1024 [puppet] - 10https://gerrit.wikimedia.org/r/508796 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [15:03:20] (03PS1) 10Ema: Make "disable_configuration_modification" work [debs/trafficserver] - 10https://gerrit.wikimedia.org/r/509871 [15:03:36] (03CR) 10Elukey: [C: 03+1] Remove unused profile::analytics::refinery::{job::guard,source} [puppet] - 10https://gerrit.wikimedia.org/r/509143 (https://phabricator.wikimedia.org/T218844) (owner: 10Ottomata) [15:04:52] PROBLEM - PyBal IPVS diff check on lvs2003 is CRITICAL: CRITICAL: Services known to PyBal but not to IPVS: set([10.2.1.45:32192]) https://wikitech.wikimedia.org/wiki/PyBal [15:07:19] (03CR) 10Elukey: [C: 03+1] Remove unused statistics::aggregator [puppet] - 10https://gerrit.wikimedia.org/r/509145 (https://phabricator.wikimedia.org/T218844) (owner: 10Ottomata) [15:08:05] (03PS1) 10Alexandros Kosiaris: profile::redis::master: Switch hiera() to lookup() [puppet] - 10https://gerrit.wikimedia.org/r/509873 [15:09:50] (03PS1) 10Jbond: flake8: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509875 (https://phabricator.wikimedia.org/T144169) [15:10:40] (03CR) 10Alexandros Kosiaris: "noop per https://puppet-compiler.wmflabs.org/compiler1002/16489/rdb1005.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/509873 (owner: 10Alexandros Kosiaris) [15:18:01] 10Operations, 10ops-eqiad, 10DC-Ops: Confirm asset tags for asw2-a6/a7/a8/b5-eqiad - https://phabricator.wikimedia.org/T223100 (10faidon) 05Open→03Resolved Perfect, thank you! [15:18:52] RECOVERY - Memory correctable errors -EDAC- on elastic1029 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=elastic1029&var-datasource=eqiad+prometheus/ops [15:23:14] 10Operations, 10Core Platform Team Backlog, 10MediaWiki-Logging, 10Wikimedia-Logstash, and 6 others: Port mediawiki/php/wmerrors to PHP7 and deploy - https://phabricator.wikimedia.org/T187147 (10herron) [15:31:57] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1024 with 10G interfaces - https://phabricator.wikimedia.org/T216724 (10Vgutierrez) [15:32:15] 10Operations, 10Traffic: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 (10Vgutierrez) 05Open→03Resolved a:03Vgutierrez [15:33:35] 10Operations, 10DBA, 10MediaWiki-Configuration, 10Patch-For-Review, and 2 others: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126 (10CDanis) Here's my tentative plan for moving forward with this, including a rollout procedure: [ ]... [15:36:26] 10Operations, 10DBA, 10MediaWiki-Configuration, 10Patch-For-Review, and 2 others: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126 (10CDanis) a:05Joe→03CDanis [15:39:15] (03CR) 10BBlack: [C: 03+1] ATS: require explicit Cache-Control/Expires [puppet] - 10https://gerrit.wikimedia.org/r/509787 (https://phabricator.wikimedia.org/T222937) (owner: 10Ema) [15:39:40] (03PS2) 10Michael Große: Add EntitySchema to extension-list [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509437 (https://phabricator.wikimedia.org/T221650) [15:39:42] (03PS1) 10Michael Große: Add configuration for EntitySchema ShExSimpleUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509878 [15:42:09] (03PS1) 10Fsero: registryha,traffic: feat: docker-registry.w.o should only hit codfw [puppet] - 10https://gerrit.wikimedia.org/r/509879 (https://phabricator.wikimedia.org/T221101) [15:44:22] (03PS2) 10Michael Große: Add configuration for EntitySchema ShExSimpleUrl [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509878 [15:45:13] (03CR) 10Fsero: "PCC seems happy" [puppet] - 10https://gerrit.wikimedia.org/r/509879 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero) [15:45:18] (03CR) 10Fsero: "https://puppet-compiler.wmflabs.org/compiler1001/16490/" [puppet] - 10https://gerrit.wikimedia.org/r/509879 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero) [15:45:23] (03CR) 10Michael Große: "Configuration for the url, as seen on https://wikidata-shex.wmflabs.org" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509878 (owner: 10Michael Große) [15:46:18] (03CR) 10CDanis: "did git move detection fail on check_grafana_alert somehow?" [puppet] - 10https://gerrit.wikimedia.org/r/509875 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [15:49:10] (03PS2) 10Jbond: flake8: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509875 (https://phabricator.wikimedia.org/T144169) [15:49:40] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/509875 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [15:52:33] (03PS1) 10Herron: logstash: add logstash-filter-truncate plugin [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/509880 (https://phabricator.wikimedia.org/T187147) [15:53:55] (03CR) 10Cwhite: [C: 03+1] logstash: add logstash-filter-truncate plugin [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/509880 (https://phabricator.wikimedia.org/T187147) (owner: 10Herron) [15:54:37] (03CR) 10Lucas Werkmeister (WMDE): "> Also, I looked at the configuration for WBQualityConstraintsSparqlEndpoint for orientation. Is that the right way to do it, or should I " [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509878 (owner: 10Michael Große) [16:00:23] !log reimaging clouvirt1024 (for the last time I hope) [16:00:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:04:29] (03PS1) 10Anomie: Set actor migration to write-new on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509883 (https://phabricator.wikimedia.org/T188327) [16:04:55] (03CR) 10Anomie: [C: 03+2] "Deploying planned config change" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509883 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [16:05:57] (03Merged) 10jenkins-bot: Set actor migration to write-new on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509883 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [16:06:24] (03CR) 10jenkins-bot: Set actor migration to write-new on Beta Cluster [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509883 (https://phabricator.wikimedia.org/T188327) (owner: 10Anomie) [16:08:05] (03PS1) 10Jbond: flake8: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509885 (https://phabricator.wikimedia.org/T144169) [16:12:27] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: placeholder task for migration problems - https://phabricator.wikimedia.org/T222210 (10kostajh) I see something slightly different when I try to pull locally: > docker pull docker-registry.wikimedia.org/releng/quibble-stretch-php... [16:14:12] !log removing tokipona language terms from items using maintenance script (T200432) [16:14:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:14:35] T200432: Labels/aliases/descriptions in Toki Pona need to be removed - https://phabricator.wikimedia.org/T200432 [16:19:48] 10Operations, 10DBA, 10MediaWiki-Configuration, 10Patch-For-Review, and 2 others: Create tool to handle the state of database configuration in MediaWiki in etcd - https://phabricator.wikimedia.org/T197126 (10CDanis) >>! In T197126#5177169, @jcrespo wrote: > It would be nice to have a mockup of the API to t... [16:21:05] 10Operations, 10ops-eqiad: Install new PDUs into b5-eqiad - https://phabricator.wikimedia.org/T223126 (10RobH) p:05Triage→03High [16:23:05] 10Operations, 10ops-eqiad: Install new PDUs into b5-eqiad - https://phabricator.wikimedia.org/T223126 (10RobH) [16:25:05] (03CR) 10Jbond: [C: 03+1] "looks good to me" [puppet] - 10https://gerrit.wikimedia.org/r/509873 (owner: 10Alexandros Kosiaris) [16:27:39] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: placeholder task for migration problems - https://phabricator.wikimedia.org/T222210 (10hashar) @fsero I am afraid we will need some hot fix to make it way faster. Would it be possible to temporarily switch `docker-registry.wikime... [16:28:28] 10Operations, 10Thumbor, 10hardware-requests: reallocate former image scaler to thumbor use - https://phabricator.wikimedia.org/T218323 (10jijiki) 05Open→03Stalled I am stalling this for now until we see how T220811 pans out. [16:28:42] 10Operations, 10ops-eqiad: Install new PDUs into b5-eqiad - https://phabricator.wikimedia.org/T223126 (10Marostegui) Just checked the databases involved. They are easy to depool, we just need a couple of hours heads up. dbproxy1006 is an active proxy but we can fail it over a day before with no issues. dbprox... [16:29:33] 10Operations, 10Prod-Kubernetes, 10serviceops, 10Kubernetes, 10User-fsero: placeholder task for migration problems - https://phabricator.wikimedia.org/T222210 (10fsero) @hashar the CR is already there https://gerrit.wikimedia.org/r/c/operations/puppet/+/509879 just need a +1 from Traffic and i´ll merge it [16:32:02] 10Operations, 10ops-eqiad: Install new PDUs into b5-eqiad - https://phabricator.wikimedia.org/T223126 (10RobH) [16:33:31] (03PS2) 10Ema: ATS: require explicit Cache-Control/Expires [puppet] - 10https://gerrit.wikimedia.org/r/509787 (https://phabricator.wikimedia.org/T222937) [16:34:58] (03CR) 10jerkins-bot: [V: 04-1] ATS: require explicit Cache-Control/Expires [puppet] - 10https://gerrit.wikimedia.org/r/509787 (https://phabricator.wikimedia.org/T222937) (owner: 10Ema) [16:37:39] (03PS1) 10Arturo Borrero Gonzalez: icinga: check_eth: fix bridge and tap interface regexp [puppet] - 10https://gerrit.wikimedia.org/r/509888 (https://phabricator.wikimedia.org/T223107) [16:38:21] (03CR) 10BBlack: [C: 03+1] registryha,traffic: feat: docker-registry.w.o should only hit codfw [puppet] - 10https://gerrit.wikimedia.org/r/509879 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero) [16:39:00] (03CR) 10Fsero: [C: 03+2] registryha,traffic: feat: docker-registry.w.o should only hit codfw [puppet] - 10https://gerrit.wikimedia.org/r/509879 (https://phabricator.wikimedia.org/T221101) (owner: 10Fsero) [16:43:58] 10Operations, 10ops-eqiad: Install new PDUs into b5-eqiad - https://phabricator.wikimedia.org/T223126 (10RobH) [16:44:05] (03PS3) 10Ema: ATS: require explicit Cache-Control/Expires [puppet] - 10https://gerrit.wikimedia.org/r/509787 (https://phabricator.wikimedia.org/T222937) [16:45:18] 10Operations, 10ops-eqiad: Install new PDUs into b5-eqiad - https://phabricator.wikimedia.org/T223126 (10jcrespo) CC @akosiaris @ayounsi @robh for m1 proxy for potential even if unlikely impact on etherpad, bacula, puppet (the mysql database) & librenms, racktables & rt. [16:45:43] (03CR) 10Ema: "pcc here https://puppet-compiler.wmflabs.org/compiler1001/16493/" [puppet] - 10https://gerrit.wikimedia.org/r/509787 (https://phabricator.wikimedia.org/T222937) (owner: 10Ema) [16:46:14] 10Operations, 10ops-eqiad: Install new PDUs into b5-eqiad - https://phabricator.wikimedia.org/T223126 (10fgiunchedi) [16:48:14] 10Operations, 10ops-eqiad: Install new PDUs into b5-eqiad - https://phabricator.wikimedia.org/T223126 (10fgiunchedi) Actions for ms-be hosts updated, to be on the safe side I'll stop swift + rsync in case power goes out. If it'll help I can poweroff hosts too. What time is this activity scheduled for ? [16:48:41] 10Operations, 10ops-eqiad: Install new PDUs into b5-eqiad - https://phabricator.wikimedia.org/T223126 (10RobH) [16:49:25] (03CR) 10Andrew Bogott: [C: 03+1] icinga: check_eth: fix bridge and tap interface regexp [puppet] - 10https://gerrit.wikimedia.org/r/509888 (https://phabricator.wikimedia.org/T223107) (owner: 10Arturo Borrero Gonzalez) [16:50:19] 10Operations, 10ops-eqiad: Install new PDUs into b5-eqiad - https://phabricator.wikimedia.org/T223126 (10RobH) [16:50:35] 10Operations, 10ops-eqiad: Install new PDUs into b5-eqiad - https://phabricator.wikimedia.org/T223126 (10RobH) Updated task description with maint window: Proposed Window: Thursday, May 16th @ 0900 AM Eastern / 1300 GMT. [16:52:01] (03CR) 10Jbond: "can you provide a sample host to test the change, would like to understand better how i missed this, Thanks" [puppet] - 10https://gerrit.wikimedia.org/r/509888 (https://phabricator.wikimedia.org/T223107) (owner: 10Arturo Borrero Gonzalez) [16:55:41] (03PS4) 10Ema: ATS: require explicit Cache-Control/Expires [puppet] - 10https://gerrit.wikimedia.org/r/509787 (https://phabricator.wikimedia.org/T222937) [16:56:22] 10Operations, 10ops-eqiad: Install new PDUs into b5-eqiad - https://phabricator.wikimedia.org/T223126 (10akosiaris) Bacula & puppet databases are not going to exhibit any problems anyway. Puppet is literally used only by servermon and this is to be uninstalled pretty soon and backups don't happen during that t... [16:57:03] 10Operations, 10ops-eqiad: Install new PDUs into b5-eqiad - https://phabricator.wikimedia.org/T223126 (10ayounsi) >>! In T223126#5177457, @jcrespo wrote: > for m1 proxy for potential even if unlikely impact on etherpad, bacula, puppet (the mysql database) & librenms, racktables & rt. It's fine for LibreNMS (ca... [16:57:18] (03CR) 10Faidon Liambotis: [C: 04-1] "Cool! Couple of minor comments, haven't done an exhaustive review though!" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/508928 (owner: 10Ayounsi) [16:58:23] 10Operations, 10Release-Engineering-Team, 10SRE-Access-Requests: Grant James Forrester access to contint-admins and contint-docker - https://phabricator.wikimedia.org/T223137 (10Jdforrester-WMF) [16:58:34] (03PS1) 10Jforrester: admin: add jforrester to contint-{admins,docker} [puppet] - 10https://gerrit.wikimedia.org/r/509891 (https://phabricator.wikimedia.org/T223137) [16:59:25] (03PS1) 10Andrew Bogott: cloudvirt1024: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/509892 (https://phabricator.wikimedia.org/T216724) [17:00:04] gehel and onimisionipe: #bothumor When your hammer is PHP, everything starts looking like a thumb. Rise for Wikidata Query Service weekly deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190513T1700). [17:00:56] No wdqs deployment today [17:01:01] (03CR) 10Andrew Bogott: [C: 03+1] "An example host is: cloudvirt1001.eqiad.wmnet." [puppet] - 10https://gerrit.wikimedia.org/r/509888 (https://phabricator.wikimedia.org/T223107) (owner: 10Arturo Borrero Gonzalez) [17:01:28] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirt1024: enable notifications [puppet] - 10https://gerrit.wikimedia.org/r/509892 (https://phabricator.wikimedia.org/T216724) (owner: 10Andrew Bogott) [17:03:05] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Grant James Forrester access to contint-admins and contint-docker - https://phabricator.wikimedia.org/T223137 (10greg) Approved on my side, specific thing right now is the node6-node10 migration of CI... [17:03:13] (03Abandoned) 10Ottomata: Change eventgate-analytics LVS port to 33192 [puppet] - 10https://gerrit.wikimedia.org/r/508582 (https://phabricator.wikimedia.org/T218346) (owner: 10Ottomata) [17:03:25] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Grant James Forrester access to contint-admins and contint-docker - https://phabricator.wikimedia.org/T223137 (10hashar) [17:03:54] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Grant James Forrester access to contint-admins and contint-docker - https://phabricator.wikimedia.org/T223137 (10greg) [17:04:25] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Grant James Forrester access to contint-admins and contint-docker - https://phabricator.wikimedia.org/T223137 (10hashar) +1 we also need you to be added to the LDAP group `ciadmin` which grants write... [17:04:49] (03CR) 10Fsero: [C: 03+1] Temporarily disable eventgate-analytics monolog events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509866 (https://phabricator.wikimedia.org/T222962) (owner: 10Ottomata) [17:05:04] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): Grant James Forrester access to contint-admins and contint-docker, and to the ciadmin LDAP group - https://phabricator.wikimedia.org/T223137 (10Jdforrester-WMF) [17:05:12] (03CR) 10Hashar: [C: 03+1] admin: add jforrester to contint-{admins,docker} [puppet] - 10https://gerrit.wikimedia.org/r/509891 (https://phabricator.wikimedia.org/T223137) (owner: 10Jforrester) [17:05:33] (03CR) 10Ppchelko: [C: 03+1] Temporarily disable eventgate-analytics monolog events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509866 (https://phabricator.wikimedia.org/T222962) (owner: 10Ottomata) [17:06:55] (03PS2) 10Ottomata: Temporarily disable eventgate-analytics monolog events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509866 (https://phabricator.wikimedia.org/T222962) [17:09:20] (03CR) 10Ottomata: [C: 03+2] Temporarily disable eventgate-analytics monolog events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509866 (https://phabricator.wikimedia.org/T222962) (owner: 10Ottomata) [17:09:43] (03CR) 10jenkins-bot: Temporarily disable eventgate-analytics monolog events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509866 (https://phabricator.wikimedia.org/T222962) (owner: 10Ottomata) [17:09:54] !log disabling all eventgate-analytics monolog events for eventgate chart migration - T222962 [17:10:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:11] T222962: Use new eventgate chart release analytics for eventgate-analytics service. - https://phabricator.wikimedia.org/T222962 [17:10:24] (03PS1) 10Jcrespo: m1 proxy: Switch to use dbproxy1001 in preparation for b5-eqiad maint [dns] - 10https://gerrit.wikimedia.org/r/509894 (https://phabricator.wikimedia.org/T223126) [17:11:25] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: disabling all eventgate-analytics monolog events for eventgate chart migration - T222962 (duration: 00m 50s) [17:11:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:11:46] 10Operations, 10ops-eqiad, 10DC-Ops, 10Epic, and 2 others: Move cloudvirt hosts to 10Gb ethernet - https://phabricator.wikimedia.org/T216195 (10Andrew) [17:11:59] 10Operations, 10Patch-For-Review, 10cloud-services-team (Kanban): relocate/reimage cloudvirt1024 with 10G interfaces - https://phabricator.wikimedia.org/T216724 (10Andrew) 05Open→03Resolved [17:24:40] (03CR) 10Jbond: [C: 03+1] "LGTM added a small comment adding more context around the regression." (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/509888 (https://phabricator.wikimedia.org/T223107) (owner: 10Arturo Borrero Gonzalez) [17:25:41] (03CR) 10Effie Mouzeli: [V: 03+1] "https://puppet-compiler.wmflabs.org/compiler1002/16495/hassaleh.codfw.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/509848 (https://phabricator.wikimedia.org/T217846) (owner: 10Effie Mouzeli) [17:33:15] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Cloud Services: reallocate workload from rack B5-eqiad - https://phabricator.wikimedia.org/T223148 (10aborrero) [17:36:51] PROBLEM - puppet last run on lvs3004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [17:39:21] (03CR) 10Ayounsi: [C: 04-1] "Using nagiosplugin seem to make that script overly complex for the simple task it's meant to achieve. I could be convinced otherwise thoug" [puppet] - 10https://gerrit.wikimedia.org/r/481157 (owner: 10Faidon Liambotis) [17:41:34] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: retry - disabling all eventgate-analytics monolog events for eventgate chart migration - T222962 (duration: 00m 50s) [17:41:44] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:49] T222962: Use new eventgate chart release analytics for eventgate-analytics service. - https://phabricator.wikimedia.org/T222962 [17:44:18] lvs3004 puppet alert is from: [17:44:20] May 13 17:29:37 lvs3004 puppet-agent[17042]: Could not retrieve catalog from remote server: request https://puppet:8140/puppet/v3/catalog/lvs3004.esams.wmnet interrupted after 0.168 seconds [17:44:44] (so not an actual puppet code issue, just some kind of transient issue with contacting the server) [17:45:08] 10Operations, 10cloud-services-team (Kanban): WMCS-related dashboards using Diamond metrics - https://phabricator.wikimedia.org/T210850 (10aborrero) Sorry folks, there are a couple of things that I don't understand. The nova_fullstack_test.py script is sending collected metrics to statsd. Are we deprecating t... [17:47:29] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Cloud Services: reallocate workload from rack B5-eqiad - https://phabricator.wikimedia.org/T223148 (10Krenair) ` cloudvirt1028.eqiad.wmnet: af-puppetdb01.automation-framework.eqiad.wmflabs bastion-eqiad1-02.bastion.eqiad.wmflabs fridolin.... [17:50:01] 10Operations, 10ops-eqiad, 10cloud-services-team (Kanban): Cloud Services: reallocate workload from rack B5-eqiad - https://phabricator.wikimedia.org/T223148 (10Krenair) cloudvirt1014 is already depooled and marked for rebuild as it runs Jessie, would be a good opportunity to drain it. guess the other should... [17:53:42] (03PS1) 10Alex Monk: Depool cloudvirt1028 [puppet] - 10https://gerrit.wikimedia.org/r/509903 (https://phabricator.wikimedia.org/T223148) [17:54:00] 10Operations, 10ops-eqiad, 10Patch-For-Review: Install new PDUs into b5-eqiad - https://phabricator.wikimedia.org/T223126 (10Andrew) Just to clarify -- best case (normal) scenario is no interruption? And worst case is... brief power interruption? Or no power for hours? [17:55:13] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): Cloud Services: reallocate workload from rack B5-eqiad - https://phabricator.wikimedia.org/T223148 (10Andrew) I've asked for clarification about what kind of power outage is feared here. Since emptying 1028 will cause downtime... [17:56:13] (03PS2) 10Framawiki: Enable SandboxLink extension on zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509593 (https://phabricator.wikimedia.org/T223006) [17:57:17] (03PS1) 10Gergő Tisza: Invalidate CommonsMetadata cache for entries affected by T222935 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509905 (https://phabricator.wikimedia.org/T222954) [17:57:19] !log deleting eventgate-analytics and eventgate-analytics-staging releases on staging [17:57:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:13] oh fsero [17:58:17] (03CR) 10jerkins-bot: [V: 04-1] Invalidate CommonsMetadata cache for entries affected by T222935 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509905 (https://phabricator.wikimedia.org/T222954) (owner: 10Gergő Tisza) [17:58:26] you don't need to delete the eventgate release analytics one [17:58:44] just the eventgate-analytics-staging and eventgate-analytics-production ones [17:58:47] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [17:58:49] i guess a delete doessn't hurt, i can reinstall [17:59:32] the port was 33192 not the right one, it could have been updated [17:59:35] but just to be sure [17:59:39] please reinstall on staging [17:59:48] ok [17:59:53] doing now (sorry we can move back to -serviceops0 [18:00:04] MaxSem, RoanKattouw, and Niharika: That opportune time is upon us again. Time for a Morning SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190513T1800). [18:00:05] framawiki: A patch you scheduled for Morning SWAT (Max 6 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [18:00:08] o/ [18:00:41] * Reedy peers in [18:00:47] 10Operations, 10Release Pipeline, 10Release-Engineering-Team, 10Wikidata, and 5 others: Introduce wikidata termbox SSR to kubernetes - https://phabricator.wikimedia.org/T220402 (10Pablo-WMDE) Hi @akosiaris - thanks for getting back to us. > sending a Host: HTTP for the identification of the exact project.... [18:01:05] !log otto@deploy1001 scap-helm eventgate-analytics install -n analytics -f analytics/staging-values.yaml stable/eventgate [namespace: eventgate-analytics, clusters: staging] [18:01:06] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [18:01:06] !log otto@deploy1001 scap-helm eventgate-analytics finished [18:01:08] (03CR) 10Reedy: [C: 03+2] Enable SandboxLink extension on zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509593 (https://phabricator.wikimedia.org/T223006) (owner: 10Framawiki) [18:01:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:01:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:02:23] (03Merged) 10jenkins-bot: Enable SandboxLink extension on zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509593 (https://phabricator.wikimedia.org/T223006) (owner: 10Framawiki) [18:02:37] (03CR) 10jenkins-bot: Enable SandboxLink extension on zhwikiversity [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509593 (https://phabricator.wikimedia.org/T223006) (owner: 10Framawiki) [18:02:46] (03PS2) 10Gergő Tisza: Invalidate CommonsMetadata cache for entries affected by T222935 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509905 (https://phabricator.wikimedia.org/T222954) [18:03:06] (03PS2) 10Framawiki: Enable wmgProofreadPageShowHeaders on pawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509594 (https://phabricator.wikimedia.org/T222740) [18:03:15] !log deleting eventgate-analytics-production releases on codfw [18:03:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:37] RECOVERY - puppet last run on lvs3004 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [18:04:38] framawiki: You don't need to do the rebases, easier to let the deployer do it ;) [18:04:58] !log otto@deploy1001 scap-helm eventgate-analytics upgrade analytics -f analytics/codfw-values.yaml --reset-values stable/eventgate [namespace: eventgate-analytics, clusters: codfw] [18:04:59] Ok, thanks :) [18:05:00] !log otto@deploy1001 scap-helm eventgate-analytics cluster codfw completed [18:05:00] !log otto@deploy1001 scap-helm eventgate-analytics finished [18:05:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:05:17] PROBLEM - Request latencies on neon is CRITICAL: instance=10.64.0.40:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:05:19] (03CR) 10Reedy: [C: 03+2] Enable wmgProofreadPageShowHeaders on pawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509594 (https://phabricator.wikimedia.org/T222740) (owner: 10Framawiki) [18:06:25] (03Merged) 10jenkins-bot: Enable wmgProofreadPageShowHeaders on pawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509594 (https://phabricator.wikimedia.org/T222740) (owner: 10Framawiki) [18:07:30] framawiki: Those two are on mwdebug1002 [18:07:40] ok, on it [18:07:49] (03PS2) 10Reedy: Set wgArticleCountMethod='any' for bgwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506943 (https://phabricator.wikimedia.org/T222044) (owner: 10Ammarpad) [18:07:51] (03CR) 10jenkins-bot: Enable wmgProofreadPageShowHeaders on pawikisource [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509594 (https://phabricator.wikimedia.org/T222740) (owner: 10Framawiki) [18:07:52] !log otto@deploy1001 scap-helm eventgate-analytics upgrade analytics -f analytics/eqiad-values.yaml --reset-values stable/eventgate [namespace: eventgate-analytics, clusters: eqiad] [18:07:52] !log otto@deploy1001 scap-helm eventgate-analytics cluster eqiad completed [18:07:53] !log otto@deploy1001 scap-helm eventgate-analytics finished [18:07:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:57] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:08:01] RECOVERY - Request latencies on neon is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:08:03] (03CR) 10Reedy: [C: 03+2] Set wgArticleCountMethod='any' for bgwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506943 (https://phabricator.wikimedia.org/T222044) (owner: 10Ammarpad) [18:08:13] (03PS1) 10Rush: phab: ban an aggressive spider [puppet] - 10https://gerrit.wikimedia.org/r/509908 [18:08:23] (03PS1) 10Jbond: flake8 - mediawiki: update file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509909 (https://phabricator.wikimedia.org/T144169) [18:09:06] (03Merged) 10jenkins-bot: Set wgArticleCountMethod='any' for bgwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506943 (https://phabricator.wikimedia.org/T222044) (owner: 10Ammarpad) [18:09:41] (03CR) 10jenkins-bot: Set wgArticleCountMethod='any' for bgwikinews [mediawiki-config] - 10https://gerrit.wikimedia.org/r/506943 (https://phabricator.wikimedia.org/T222044) (owner: 10Ammarpad) [18:09:49] (03CR) 10Rush: [C: 03+2] phab: ban an aggressive spider [puppet] - 10https://gerrit.wikimedia.org/r/509908 (owner: 10Rush) [18:10:32] Reedy: both lgtm [18:10:35] ta [18:10:41] (03PS1) 10Ottomata: Re-enable eventgate-analytics monolog events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509911 (https://phabricator.wikimedia.org/T222962) [18:11:00] 10Operations, 10ops-eqiad: wmf7622 wont powercycle (cannot be allocated from spares) - https://phabricator.wikimedia.org/T222922 (10crusnov) Hello, process question about this. The current flowchart for states doesn't allow Spare->Failed to happen, so there are some implicit assumptions inside of f or example... [18:11:15] framawiki: Reedy let me know when you are done with swat. (gonna re-enable the eventgate stuff you just helped me with Reedy ) [18:11:47] PROBLEM - Request latencies on acrux is CRITICAL: instance=10.192.0.93:6443 verb=CONNECT https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:12:17] !log reedy@deploy1001 Synchronized wmf-config/InitialiseSettings.php: T223006 T222740 T222044 (duration: 00m 49s) [18:12:23] 10Operations, 10ops-eqiad, 10Patch-For-Review: Install new PDUs into b5-eqiad - https://phabricator.wikimedia.org/T223126 (10RobH) >>! In T223126#5177834, @Andrew wrote: > Just to clarify -- best case (normal) scenario is no interruption? And worst case is... brief power interruption? Or no power for hours... [18:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:25] T223006: Turn on SandboxLink extension on zhwikiversity - https://phabricator.wikimedia.org/T223006 [18:12:25] T222044: Set $wgArticleCountMethod = 'any' on bgwikinews and run updateArticleCount.php - https://phabricator.wikimedia.org/T222044 [18:12:26] T222740: Show header/footer by default (wmgProofreadPageShowHeaders) on Punjabi Wikisource - https://phabricator.wikimedia.org/T222740 [18:13:07] RECOVERY - Request latencies on acrux is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-api [18:13:26] 10Operations, 10ops-eqiad, 10Patch-For-Review: Install new PDUs into b5-eqiad - https://phabricator.wikimedia.org/T223126 (10RobH) [18:13:59] ottomata: That's me done [18:14:07] great! [18:14:10] thanks [18:15:37] (03CR) 10Fsero: [C: 03+1] Re-enable eventgate-analytics monolog events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509911 (https://phabricator.wikimedia.org/T222962) (owner: 10Ottomata) [18:16:26] (03CR) 10Ottomata: [C: 03+2] Re-enable eventgate-analytics monolog events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509911 (https://phabricator.wikimedia.org/T222962) (owner: 10Ottomata) [18:16:44] (03CR) 10jenkins-bot: Re-enable eventgate-analytics monolog events [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509911 (https://phabricator.wikimedia.org/T222962) (owner: 10Ottomata) [18:17:13] !log re-enabling all eventgate-analytics monolog events - T222962 [18:17:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:18] T222962: Use new eventgate chart release analytics for eventgate-analytics service. - https://phabricator.wikimedia.org/T222962 [18:17:36] thanks Reedy [18:17:40] :) [18:19:21] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: re-enabling all eventgate-analytics monolog events - T222962 (duration: 00m 50s) [18:19:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:21:05] (03PS3) 10Gergő Tisza: Invalidate CommonsMetadata cache for entries affected by T222935 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509905 (https://phabricator.wikimedia.org/T222954) [18:23:13] * Reedy kicks wikibugs [18:23:27] *clank* [18:23:53] (03PS4) 10Reedy: Invalidate CommonsMetadata cache for entries affected by T222935 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509905 (https://phabricator.wikimedia.org/T222954) (owner: 10Gergő Tisza) [18:23:55] (03CR) 10Reedy: [C: 03+2] Invalidate CommonsMetadata cache for entries affected by T222935 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509905 (https://phabricator.wikimedia.org/T222954) (owner: 10Gergő Tisza) [18:24:12] 10Operations, 10ops-eqiad, 10Patch-For-Review, 10cloud-services-team (Kanban): Cloud Services: reallocate workload from rack B5-eqiad - https://phabricator.wikimedia.org/T223148 (10Andrew) I think we should risk the slight chance of a multi-hour outage. Three days isn't enough time to give proper notice o... [18:24:20] (03Merged) 10jenkins-bot: Invalidate CommonsMetadata cache for entries affected by T222935 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509905 (https://phabricator.wikimedia.org/T222954) (owner: 10Gergő Tisza) [18:24:59] (03CR) 10jenkins-bot: Invalidate CommonsMetadata cache for entries affected by T222935 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509905 (https://phabricator.wikimedia.org/T222954) (owner: 10Gergő Tisza) [18:25:33] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 31 seconds ago with 0 failures [18:25:44] !log reedy@deploy1001 Synchronized wmf-config/CommonSettings.php: T222954 (duration: 00m 49s) [18:25:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:25:49] T222954: Imageinfo queries to non-Commons wikis about Commons files return incomplete extmetadata - https://phabricator.wikimedia.org/T222954 [18:29:19] !log otto@deploy1001 Synchronized wmf-config/InitialiseSettings.php: re-sync: re-enabling all eventgate-analytics monolog events - T222962 (duration: 00m 49s) [18:29:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:29:23] T222962: Use new eventgate chart release analytics for eventgate-analytics service. - https://phabricator.wikimedia.org/T222962 [18:31:24] (03Abandoned) 10Alex Monk: Depool cloudvirt1028 [puppet] - 10https://gerrit.wikimedia.org/r/509903 (https://phabricator.wikimedia.org/T223148) (owner: 10Alex Monk) [18:33:36] (03PS1) 10Ottomata: Add eventgate-main.discovery.wmnet [dns] - 10https://gerrit.wikimedia.org/r/509912 (https://phabricator.wikimedia.org/T222899) [18:35:02] 10Operations, 10netops: Emergency syslog messages on asw1-eqsin - https://phabricator.wikimedia.org/T223156 (10ayounsi) p:05Triage→03Normal [18:37:13] (03PS1) 10Gergő Tisza: [DNM until June 15] Revert "Invalidate CommonsMetadata cache for entries affected by T222935" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/509914 [18:37:47] (03PS1) 10Andrew Bogott: Replace git-sync-upstream on labspuppetmasters, remove from puppet-merge [puppet] - 10https://gerrit.wikimedia.org/r/509915 (https://phabricator.wikimedia.org/T171188) [18:38:19] (03PS1) 10Dzahn: labweb/wikitech: set PHP version to 7.2 in Hiera [puppet] - 10https://gerrit.wikimedia.org/r/509916 [18:44:27] (03PS1) 10Jbond: flake8 - rabbitmq: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509917 (https://phabricator.wikimedia.org/T144169) [18:45:50] (03CR) 10jerkins-bot: [V: 04-1] flake8 - rabbitmq: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509917 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [18:46:29] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10dr0ptp4kt) @elukey thanks for the follow up here. No need to block on... [18:54:11] (03PS2) 10Jbond: flake8 - rabbitmq: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509917 (https://phabricator.wikimedia.org/T144169) [18:55:17] (03PS5) 10Ayounsi: Puppet, add RPKI validation daemon [puppet] - 10https://gerrit.wikimedia.org/r/508928 (https://phabricator.wikimedia.org/T220669) [18:55:19] (03CR) 10Dzahn: [C: 04-1] "several unrelated issues when trying to compile ... https://puppet-compiler.wmflabs.org/compiler1001/16496/labweb1001.wikimedia.org/change" [puppet] - 10https://gerrit.wikimedia.org/r/509916 (owner: 10Dzahn) [18:58:20] (03CR) 10Ayounsi: "> Patch Set 4: Code-Review-1" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/508928 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi) [19:00:45] (03PS2) 10Ayounsi: Prometheus, add Routinator endpoint [puppet] - 10https://gerrit.wikimedia.org/r/508956 (https://phabricator.wikimedia.org/T220669) [19:01:59] (03CR) 10Ayounsi: "This change is ready for review." [puppet] - 10https://gerrit.wikimedia.org/r/508956 (https://phabricator.wikimedia.org/T220669) (owner: 10Ayounsi) [19:02:32] (03PS1) 10Jbond: flake8 - varnish: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509921 (https://phabricator.wikimedia.org/T144169) [19:05:19] (03PS1) 10Dzahn: fix puppet compiling on labweb hosts, add missing fake secrets [labs/private] - 10https://gerrit.wikimedia.org/r/509922 [19:05:45] (03PS2) 10Dzahn: fix puppet compiling on labweb hosts, add missing fake secrets [labs/private] - 10https://gerrit.wikimedia.org/r/509922 [19:07:00] (03CR) 10Dzahn: [C: 04-1] "https://gerrit.wikimedia.org/r/#/c/labs/private/+/509922/" [puppet] - 10https://gerrit.wikimedia.org/r/509916 (owner: 10Dzahn) [19:08:22] (03CR) 10Paladox: [C: 03+1] fix puppet compiling on labweb hosts, add missing fake secrets [labs/private] - 10https://gerrit.wikimedia.org/r/509922 (owner: 10Dzahn) [19:08:24] (03CR) 10Dzahn: [V: 03+2 C: 03+2] fix puppet compiling on labweb hosts, add missing fake secrets [labs/private] - 10https://gerrit.wikimedia.org/r/509922 (owner: 10Dzahn) [19:10:35] (03CR) 10Dzahn: "https://puppet-compiler.wmflabs.org/compiler1001/16497/labweb1001.wikimedia.org/" [puppet] - 10https://gerrit.wikimedia.org/r/509916 (owner: 10Dzahn) [19:14:43] (03PS1) 10Jbond: flake8 - arclamp: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509923 (https://phabricator.wikimedia.org/T144169) [19:14:45] (03CR) 10jerkins-bot: [V: 04-1] flake8 - arclamp: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509923 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [19:15:17] (03PS2) 10Ayounsi: Add the Juniper to Netbox import script. [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/507217 (https://phabricator.wikimedia.org/T223161) [19:15:50] (03PS3) 10Ayounsi: Juniper to Netbox import script [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/507217 (https://phabricator.wikimedia.org/T223161) [19:15:52] (03CR) 10Ayounsi: "Opened task https://phabricator.wikimedia.org/T223161" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/507217 (https://phabricator.wikimedia.org/T223161) (owner: 10Ayounsi) [19:18:52] (03PS1) 10Herron: logstash: enforce max length on "message" and "msg" fields [puppet] - 10https://gerrit.wikimedia.org/r/509924 (https://phabricator.wikimedia.org/T187147) [19:19:28] (03PS2) 10Jbond: flake8 - arclamp: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509923 (https://phabricator.wikimedia.org/T144169) [19:28:17] wikibugs: hey [19:28:20] (03PS1) 10Jbond: flake8 - sslcert: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509925 (https://phabricator.wikimedia.org/T144169) [19:29:28] (03PS1) 10Dzahn: labweb/wikitech: start using PHP-FPM [puppet] - 10https://gerrit.wikimedia.org/r/509926 [19:29:32] (03PS2) 10Jbond: flake8 - sslcert: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509925 (https://phabricator.wikimedia.org/T144169) [19:37:51] (03CR) 10Cwhite: [C: 03+1] logstash: enforce max length on "message" and "msg" fields [puppet] - 10https://gerrit.wikimedia.org/r/509924 (https://phabricator.wikimedia.org/T187147) (owner: 10Herron) [19:38:56] (03CR) 10Cwhite: [C: 03+1] flake8: Add python extension so theses scripts can be picked up via CI [puppet] - 10https://gerrit.wikimedia.org/r/509467 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [19:39:10] 10Operations, 10ops-eqiad, 10Patch-For-Review: Install new PDUs into b5-eqiad - https://phabricator.wikimedia.org/T223126 (10RobH) [19:40:52] (03PS1) 10Jbond: flake8 - grafana: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509927 (https://phabricator.wikimedia.org/T144169) [19:41:58] (03CR) 10Cwhite: [C: 03+1] flake8: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509885 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [19:43:39] (03CR) 10Cwhite: [C: 03+1] flake8 - mediawiki: update file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509909 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [19:47:14] (03CR) 10CDanis: [C: 03+1] flake8 - sslcert: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509925 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [19:47:16] (03CR) 10CDanis: [C: 03+1] flake8 - varnish: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509921 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [19:50:17] (03CR) 10Cwhite: [C: 03+1] flake8 - rabbitmq: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509917 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [19:51:07] (03CR) 10CDanis: [C: 03+1] "nice cleanups!" [puppet] - 10https://gerrit.wikimedia.org/r/509875 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [19:52:37] bstorm_: i am amending your rsync bwlimit change to follow-up to comments [19:52:42] then we can merge it i think [19:55:33] Ok 😃 I called in sick today, but I will catch up tmrw [19:56:08] (03CR) 10Dzahn: [C: 03+1] rsync: add a bwlimit option for quickdatacopy (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/509458 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [19:56:34] bstorm_: oh, get well soon and don't watch IRC :) thanks, laters [19:56:39] Does anyone know what caused the increase in memcached transmit traffic again since the start of this month? https://grafana.wikimedia.org/d/000000574/t204083-investigation [19:56:41] (03PS1) 10Jbond: flake8 - misc: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509929 [19:56:43] (03PS3) 10Dzahn: rsync: add a bwlimit option for quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/509458 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [19:56:53] * addshoreVacation might have missed something in an email or on phab, still playing catchup [19:57:00] (03CR) 10jerkins-bot: [V: 04-1] flake8 - misc: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509929 (owner: 10Jbond) [19:57:48] (03CR) 10CDanis: "no references to grafana-dashboard anywhere?" [puppet] - 10https://gerrit.wikimedia.org/r/509927 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [19:57:50] (03CR) 10CDanis: [C: 03+1] flake8 - arclamp: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509923 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [19:57:58] wikibugs is lagging behind a bit heh [19:59:32] i think that's it's flood protection cdanis [20:00:04] cscott, arlolra, subbu, bearND, and halfak: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190513T2000). [20:00:19] I've got an ORES deployment. [20:00:22] Will kick it off. [20:01:58] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/509927 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [20:03:08] (03CR) 10CDanis: [C: 03+1] flake8 - grafana: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509927 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [20:04:02] !log halfak@deploy1001 Started deploy [ores/deploy@c17a1a2]: T202202 [20:04:05] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:04:06] T202202: Build article quality model for svwiki - https://phabricator.wikimedia.org/T202202 [20:07:26] Looking good on the canary [20:07:28] Moving on. [20:09:07] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 7 others: Fix inefficient CacheAwarePropertyInfoStore memcached access pattern - https://phabricator.wikimedia.org/T97368 (10Addshore) I just flicked back to https://grafana.wikimedia.org/d/00000... [20:13:47] Hmm. Lots my terminal while deploying. [20:14:00] Any scap folks know what's likely to happen to my in-progress deployment? [20:15:02] 10Operations, 10Traffic, 10Performance-Team (Radar): Some load.php requests failing due to "ERR_SPDY_PROTOCOL_ERROR 200" - https://phabricator.wikimedia.org/T220022 (10Krinkle) [20:15:46] halfak: afaik, it will continue? do you still see it running in the process list? [20:15:55] if not, then it has probably stopped [20:16:22] addshoreVacation, no I don't. Looking on the deployment tin for anything running under my username. [20:16:57] I do see "/usr/bin/python2 /usr/bin/scap deploy-log" but that looks unrelated. [20:17:19] I indeed dont see it running, so looks like it stopped :) [20:18:08] Thanks! I'll restart. [20:18:20] In a screen this time [20:18:24] :D [20:19:26] !log ariel@deploy1001 Started deploy [dumps/dumps@941d374]: lbzip2 decompression for 7z file production for big wikis [20:19:27] deploy failed: Failed to acquire lock "/var/lock/scap.ores_deploy.lock"; owner is "halfak"; reason is "T202202" [20:19:28] (03CR) 10ArielGlenn: [C: 03+2] use lbzip2 for decompression in 7z page content recompress step [dumps] - 10https://gerrit.wikimedia.org/r/509843 (owner: 10ArielGlenn) [20:19:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:30] T202202: Build article quality model for svwiki - https://phabricator.wikimedia.org/T202202 [20:19:30] !log ariel@deploy1001 Finished deploy [dumps/dumps@941d374]: lbzip2 decompression for 7z file production for big wikis (duration: 00m 03s) [20:19:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:19:42] Looks like there's a lock file I might need to delete. [20:20:04] halfak: you should own that lockfile and be able to delete [20:20:17] !log halfak@deploy1001 Started deploy [ores/deploy@c17a1a2]: T202202 [20:20:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:20:25] OK here we go again [20:20:28] Thanks thcipriani [20:20:36] (03PS2) 10Jbond: flake8 - misc: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509929 [20:20:59] Oh cool. It picked up where it left off really quickly. [20:21:00] Nice. [20:22:35] (03PS1) 10CRusnov: profile::netbox: Move reports config to /etc/netbox [puppet] - 10https://gerrit.wikimedia.org/r/509932 [20:24:32] !log halfak@deploy1001 Finished deploy [ores/deploy@c17a1a2]: T202202 (duration: 04m 16s) [20:24:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:24:37] T202202: Build article quality model for svwiki - https://phabricator.wikimedia.org/T202202 [20:24:47] (03PS1) 10CRusnov: Move netbox report config to /etc/netbox [software/netbox-reports] - 10https://gerrit.wikimedia.org/r/509934 [20:25:52] (03PS3) 10Jbond: flake8 - misc: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509929 [20:25:54] (03CR) 10CDanis: [C: 03+1] flake8 - misc: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509929 (owner: 10Jbond) [20:26:19] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [20:27:12] OK everything looks good. [20:27:22] Thanks thcipriani and addshore for your help :) [20:27:33] np :) [20:30:47] PROBLEM - Router interfaces on cr2-eqord is CRITICAL: CRITICAL: host 208.80.154.198, interfaces up: 53, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:31:21] PROBLEM - Router interfaces on cr3-ulsfo is CRITICAL: CRITICAL: host 198.35.26.192, interfaces up: 66, down: 1, dormant: 0, excluded: 0, unused: 0: https://wikitech.wikimedia.org/wiki/Network_monitoring%23Router_interface_down [20:32:02] (03PS1) 10Jbond: flake8 - mwgrep: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509936 (https://phabricator.wikimedia.org/T144169) [20:35:04] (03PS6) 10Paladox: Add prometheus server for gerrit javamelody monitoring [puppet] - 10https://gerrit.wikimedia.org/r/508952 (https://phabricator.wikimedia.org/T184086) [20:35:06] (03CR) 10Paladox: Add prometheus server for gerrit javamelody monitoring (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/508952 (https://phabricator.wikimedia.org/T184086) (owner: 10Paladox) [20:35:54] (03CR) 10CDanis: [C: 03+1] flake8 - mwgrep: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509936 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [20:35:56] (03CR) 10jerkins-bot: [V: 04-1] Add prometheus server for gerrit javamelody monitoring [puppet] - 10https://gerrit.wikimedia.org/r/508952 (https://phabricator.wikimedia.org/T184086) (owner: 10Paladox) [20:36:37] PROBLEM - puppet last run on lvs3004 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [20:38:52] i think that message has been happening more often latey [20:39:16] (03PS7) 10Paladox: Add prometheus server for gerrit javamelody monitoring [puppet] - 10https://gerrit.wikimedia.org/r/508952 (https://phabricator.wikimedia.org/T184086) [20:39:45] we only started checking for that recently cdanis i think previoulsy it just went unoticed ie green in icinga [20:39:49] (03PS8) 10Paladox: Add prometheus server for gerrit javamelody monitoring [puppet] - 10https://gerrit.wikimedia.org/r/508952 (https://phabricator.wikimedia.org/T184086) [20:40:01] ahh, so my suspicions are definitely correct then ;) [20:40:07] indeed :D [20:40:22] (03PS1) 10Jbond: flake8 - cassandra: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509937 (https://phabricator.wikimedia.org/T144169) [20:41:13] (03CR) 10CDanis: "if it isn't too hard for you, could you make a separate change for renaming rewrite-group-for-memberof (and then rebase this one on top of" [puppet] - 10https://gerrit.wikimedia.org/r/509476 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [20:44:17] (03PS4) 10Jbond: flake8 - misc: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509929 [20:45:16] cdanis: sorry can you redo ^^ that one just added smart/manifests/init.pp smart/files/smart-data-dump [20:45:26] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/509476 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [20:45:57] am I missing the latest patch set for that one? I see the diffs to smart-data-dump's contents but not a rename to .py [20:46:15] (03CR) 10CDanis: [C: 03+1] flake8 - cassandra: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509937 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [20:46:47] oh i probably missed it, i have been given a deadline when we have i have to put down the laptop to whatch GoT :D [20:47:10] ahaha [20:48:30] (03PS5) 10Jbond: flake8 - misc: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509929 [20:49:40] (03CR) 10CDanis: [C: 03+1] flake8 - misc: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509929 (owner: 10Jbond) [20:56:52] (03CR) 10Cwhite: [C: 03+1] flake8 - arclamp: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509923 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [20:58:00] (03PS1) 10Jbond: flake8 - acme-setup: Add python extension so theses scripts can be picked up via CI [puppet] - 10https://gerrit.wikimedia.org/r/509945 (https://phabricator.wikimedia.org/T144169) [20:58:04] (03CR) 10Dzahn: mariadb: set some more Icinga notes URLs for nrpe checks (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/509552 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [20:58:35] RECOVERY - puppet last run on bast3002 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [21:00:04] bawolff and Reedy: How many deployers does it take to do Weekly Security deployment window deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190513T2100). [21:01:02] (03PS2) 10Dzahn: mariadb: set some more Icinga notes URLs for nrpe checks [puppet] - 10https://gerrit.wikimedia.org/r/509552 (https://phabricator.wikimedia.org/T197873) [21:01:04] (03CR) 10Cwhite: [C: 03+1] flake8 - misc: update python file extensions so they are detected by CI [puppet] - 10https://gerrit.wikimedia.org/r/509929 (owner: 10Jbond) [21:01:55] (03CR) 10jerkins-bot: [V: 04-1] flake8 - acme-setup: Add python extension so theses scripts can be picked up via CI [puppet] - 10https://gerrit.wikimedia.org/r/509945 (https://phabricator.wikimedia.org/T144169) (owner: 10Jbond) [21:03:27] RECOVERY - puppet last run on lvs3004 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [21:05:50] (03PS2) 10Jbond: flake8 - letsencryp: Add python extension so CI is run [puppet] - 10https://gerrit.wikimedia.org/r/509945 (https://phabricator.wikimedia.org/T144169) [21:09:46] (03PS1) 10Jbond: flake8 - diffscan: Add python extension so CI is run [puppet] - 10https://gerrit.wikimedia.org/r/509946 (https://phabricator.wikimedia.org/T144169) [21:10:19] (03PS1) 10Dzahn: icinga: add notes_url for bad_directory_owner check [puppet] - 10https://gerrit.wikimedia.org/r/509947 (https://phabricator.wikimedia.org/T197873) [21:12:47] (03PS3) 10Jbond: flake8 - letsencrypt: Add python extension so CI is run [puppet] - 10https://gerrit.wikimedia.org/r/509945 (https://phabricator.wikimedia.org/T144169) [21:13:55] (03CR) 10Krinkle: [C: 03+1] "LGTM. Should it also apply to other fields that are MW specific, such as "exception.trace" and "fatal_exception.trace". Or are they derive" [puppet] - 10https://gerrit.wikimedia.org/r/509924 (https://phabricator.wikimedia.org/T187147) (owner: 10Herron) [21:18:51] 'Age of most recent Analytics meta MySQL database backup files', <- is this more DBA or more Analytics :) i guess both.. trying to find a wikitech link for it [21:36:18] made new runbook for "bad directory owner" [21:36:30] (03PS2) 10Dzahn: icinga: add notes_url for bad_directory_owner check [puppet] - 10https://gerrit.wikimedia.org/r/509947 (https://phabricator.wikimedia.org/T197873) [21:36:32] (03PS3) 10Dzahn: icinga: add notes_url for bad_directory_owner check [puppet] - 10https://gerrit.wikimedia.org/r/509947 (https://phabricator.wikimedia.org/T197873) [21:37:20] (03CR) 10Dzahn: [C: 03+2] "created new runbook https://wikitech.wikimedia.org/wiki/Monitoring/bad_directory_owner" [puppet] - 10https://gerrit.wikimedia.org/r/509947 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [21:39:19] (03CR) 10Cwhite: [C: 03+1] "> Patch Set 1: Code-Review+1" [puppet] - 10https://gerrit.wikimedia.org/r/509924 (https://phabricator.wikimedia.org/T187147) (owner: 10Herron) [21:39:45] made new runbook for "systemd unit state" [21:41:15] (03PS2) 10Dzahn: nrpe: add Icinga notes_url for systemd_unit_state check [puppet] - 10https://gerrit.wikimedia.org/r/509553 (https://phabricator.wikimedia.org/T197873) [21:42:54] (03CR) 10Dzahn: [C: 03+2] nrpe: add Icinga notes_url for systemd_unit_state check [puppet] - 10https://gerrit.wikimedia.org/r/509553 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [21:53:27] (03PS1) 10Dzahn: eventlogging: add Icinga notes URLs [puppet] - 10https://gerrit.wikimedia.org/r/510053 (https://phabricator.wikimedia.org/T197873) [21:54:34] (03PS1) 10Dzahn: statsd: add Icinga notes URL [puppet] - 10https://gerrit.wikimedia.org/r/510054 (https://phabricator.wikimedia.org/T197873) [21:55:38] 10Operations, 10DBA, 10Jade, 10TechCom-RFC, and 2 others: Introduce a new namespace for collaborative judgements about wiki entities - https://phabricator.wikimedia.org/T200297 (10Harej) It is my understanding that outstanding issues with Jade have been addressed. As requested I have moved this to the Inbo... [21:56:32] (03CR) 10Dzahn: [C: 03+2] statsd: add Icinga notes URL [puppet] - 10https://gerrit.wikimedia.org/r/510054 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [21:59:34] (03CR) 10Dzahn: [C: 03+2] mirrors: add Icinga notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/509477 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [21:59:36] (03PS6) 10Dzahn: mirrors: add Icinga notes_urls [puppet] - 10https://gerrit.wikimedia.org/r/509477 (https://phabricator.wikimedia.org/T197873) [22:03:59] PROBLEM - Check systemd state on ms-be2017 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [22:04:02] Leeerooy [22:05:09] (03CR) 10Dzahn: [C: 03+2] "recheck" [puppet] - 10https://gerrit.wikimedia.org/r/509477 (https://phabricator.wikimedia.org/T197873) (owner: 10Dzahn) [22:10:14] hmm.. ok, i will take a break and see if jenkins-bot is less busy / back then [22:10:53] mutante: It's only doing postmerge stuff... [22:10:53] https://integration.wikimedia.org/zuul/ [22:14:32] (03PS2) 10RobH: quotereviewer: support 2019-style Dell EMC quotes [software] - 10https://gerrit.wikimedia.org/r/505640 (owner: 10Faidon Liambotis) [22:14:34] (03PS3) 10Faidon Liambotis: quotereviewer: support 2019-style Dell EMC quotes [software] - 10https://gerrit.wikimedia.org/r/505640 (https://phabricator.wikimedia.org/T223171) [22:15:39] (03CR) 10Faidon Liambotis: [C: 03+2] quotereviewer: support 2019-style Dell EMC quotes [software] - 10https://gerrit.wikimedia.org/r/505640 (https://phabricator.wikimedia.org/T223171) (owner: 10Faidon Liambotis) [22:18:31] 10Operations, 10Operations-Software-Development, 10netbox, 10netops, 10User-crusnov: Netbox report to validate network equipment data - https://phabricator.wikimedia.org/T221507 (10crusnov) After digging and discussing I believe the way forward since the mapping is slightly ... weird between LibreNMS and... [22:19:39] (03PS6) 10Mobrovac: Handle application/octet-stream requests properly; release v0.1.5 [software/service-checker] - 10https://gerrit.wikimedia.org/r/507531 (https://phabricator.wikimedia.org/T220401) [22:21:51] (03CR) 10jerkins-bot: [V: 04-1] Handle application/octet-stream requests properly; release v0.1.5 [software/service-checker] - 10https://gerrit.wikimedia.org/r/507531 (https://phabricator.wikimedia.org/T220401) (owner: 10Mobrovac) [22:26:24] RECOVERY - Check systemd state on ms-be2017 is OK: OK - running: The system is fully operational [22:49:27] 10Operations, 10Performance-Team, 10PHP 7.2 support: Monitoring PHP 7 APC usage - https://phabricator.wikimedia.org/T223180 (10Krinkle) [23:00:04] MaxSem, RoanKattouw, and Niharika: (Dis)respected human, time to deploy Evening SWAT (Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190513T2300). Please do the needful. [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:39:54] https://mailarchive.ietf.org/arch/msg/cfrg/NhiGvOFzcEw108YLwF_ndyfB1k4?fbclid=IwAR2NiuZLlbKK3xu5Vg1EysyZ2Dab7N9mgGYQNPC0p5tGPZOwuJBCQ7R7XQY might be interesting to some people here [23:44:57] (03PS4) 10Dzahn: rsync: add a bwlimit option for quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/509458 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [23:52:56] (03CR) 10Dzahn: "nitpick: if we'd remove this extra space it adds on servers not using the new option it would be noop: https://puppet-compiler.wmflabs.org" [puppet] - 10https://gerrit.wikimedia.org/r/509458 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [23:56:42] PROBLEM - Eqiad HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=eqiad&var-cache_type=All&var-status_type=5 [23:57:02] (03PS5) 10Dzahn: rsync: add a bwlimit option for quickdatacopy [puppet] - 10https://gerrit.wikimedia.org/r/509458 (https://phabricator.wikimedia.org/T209527) (owner: 10Bstorm) [23:57:18] yea..another spike that is already over when looking at the graph [23:57:28] Hah well over [23:57:31] weird [23:57:46] are these trying to track down the mystery 500 thing? [23:57:49] zooming out it is not even that special [23:58:26] PROBLEM - Esams HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=esams&var-cache_type=All&var-status_type=5 [23:58:36] PROBLEM - puppet last run on bast3002 is CRITICAL: CRITICAL: Failed to apply catalog, zero resources tracked by Puppet. It might be a dependency cycle. [23:58:48] PROBLEM - Text HTTP 5xx reqs/min on graphite1004 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [23:58:49] just a sudden 1 sample spike [23:59:00] yea [23:59:18] hmm. what about bast3002 though.. looking