[00:00:16] ebernhardson: yeah probably... we'll need to check carefully this one though [00:00:27] to ensure I didn't make a typo somewhere [00:00:46] (03PS4) 10EBernhardson: Migrate configs to WikibaseCirrusSearch configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501435 (https://phabricator.wikimedia.org/T218716) (owner: 10Smalyshev) [00:01:02] SMalyshev: would cirrus-config-dump show that? [00:01:28] oh, actually no it doesn't cover WBCS [00:01:31] ebernhardson: I dunno maybe... which vars does it show? [00:01:37] that's what I thought... [00:02:14] but we probably can see by query dumps [00:02:16] (03CR) 10EBernhardson: [C: 03+2] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501435 (https://phabricator.wikimedia.org/T218716) (owner: 10Smalyshev) [00:02:19] if it has the right numbers [00:02:41] PROBLEM - puppet last run on rdb1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [00:03:09] (03Merged) 10jenkins-bot: Migrate configs to WikibaseCirrusSearch configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501435 (https://phabricator.wikimedia.org/T218716) (owner: 10Smalyshev) [00:03:57] SMalyshev: live on mwdebug1002 [00:05:25] don't see anything obviously broken... query dump looks ok [00:05:44] yea a few query dumps seem to look the same [00:06:19] (03CR) 10jenkins-bot: Migrate configs to WikibaseCirrusSearch configs [mediawiki-config] - 10https://gerrit.wikimedia.org/r/501435 (https://phabricator.wikimedia.org/T218716) (owner: 10Smalyshev) [00:06:24] alright, syncing out [00:06:28] ok then, let's deploy it and see if anybody else complains ;) [00:07:41] !log ebernhardson@deploy1001 Synchronized wmf-config/: T218716: Migrade configs to WikibaseCirrusSearch (duration: 00m 51s) [00:07:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:07:52] T218716: Remove Elastic/Cirrus code from Wikibase repo - https://phabricator.wikimedia.org/T218716 [00:09:05] now all searches should be running via WikibaseCirrusSearch code [00:09:30] woot, soon can delete all the extra stuff and have less confusion :) [00:09:53] ebernhardson: yep mega-deletion patch is https://gerrit.wikimedia.org/r/c/mediawiki/extensions/Wikibase/+/495159 [00:11:30] we can wait for after .25 cutoff with it if we want to, or merge now... [00:12:05] SMalyshev: will it get much testing on beta cluster? I'm guessing wikidata has a bunch of stuff, so maybe? If so might be worth merging after .25 [00:12:33] yeah let's do that then [00:13:05] not too many people search on beta I assume... so may make sense do some manual testing there [00:13:50] alright everything should be shipped, calling SWAT complete [00:17:54] ebernhardson: thanks! [00:29:01] RECOVERY - puppet last run on rdb1006 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [00:34:25] PROBLEM - tilerator on maps2003 is CRITICAL: connect to address 10.192.32.146 and port 6534: Connection refused https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [00:45:52] !log bootstrapping cassandra-c, restbase2019 -- T208087 [00:45:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [00:45:56] T208087: Replace remaining Samsung SSDs - https://phabricator.wikimedia.org/T208087 [00:47:01] RECOVERY - Check systemd state on restbase2019 is OK: OK - running: The system is fully operational [02:25:58] * Krinkle staging on mwdebug1002 [02:40:06] !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.24/extensions/ExternalGuidance/extension.json: I8614f63960bc763 / T219841 (duration: 00m 53s) [02:40:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:40:10] T219841: Unknown dependency: mw.externalguidance.init - https://phabricator.wikimedia.org/T219841 [02:47:21] 10Operations, 10Maps (Tilerator): Tilerator crashed on maps200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T219849 (10Mathew.onipe) @MSantos ack However, this same error occurred again with maps2003. ` {"name":"tilerator","hostname":"maps2003","pid":4,"level":50,"message":"worker died, restarting","... [02:47:26] !log restarting tilerator on maps2003 - T219849 [02:47:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:47:30] T219849: Tilerator crashed on maps200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T219849 [02:50:09] @Krinkle thanks for that SWAT but did that change sync properly? I'm still seeing the exception "Error: Unknown dependency: mw.externalguidance.init" on https://en.m.wikipedia.org/wiki/Loyalist_Collegiate_and_Vocational_Institute [02:50:55] or is cached HTML the problem here? [02:51:08] jdlrobson: Logged-in or out? [02:51:15] incognito on view-source:https://en.m.wikipedia.org/wiki/International_Women%27s_Day [02:51:16] I'm not seeing it both ways. [02:51:33] mw.externalguidance.init is in RLPAGEMODULES [02:51:39] OK, seeing it on that one [02:52:12] Yes, the commit we synced changes the startup module response. Which will roll over within 5min. For some wiki-languages and DCs likely already has. [02:52:24] The page still queus it like before. [02:52:58] Got it. Thanks for taking care of this Krinkle, I was not sure who to turn to given I only got alerted to it around 5pm. Really appreciate it! [02:53:10] yw [02:53:27] jdlrobson: space in "mobile " [02:53:30] that's why it's not working. [02:53:53] It's not loaded on all pages, so I must've gotten lucky when trying random pages. [02:54:00] but the startup moduel still doesn't register it. [02:54:02] on mobile [03:01:53] aggg [03:03:37] https://gerrit.wikimedia.org/r/#/c/mediawiki/extensions/ExternalGuidance/+/502385 Remove space in targets [03:04:16] Krinkle: guess that explains why the graph isn't changing https://grafana.wikimedia.org/d/000000566/overview?panelId=15&fullscreen&orgId=1 [03:04:41] OK. One more time. I'm sining off in 30min or so as well. [03:06:45] PROBLEM - puppet last run on dns2002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [03:13:45] 10Operations, 10Maps (Tilerator): Tilerator crashed on maps200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T219849 (10Mathew.onipe) This took another turn today. Restating tilerator did not work. syslog showed a permission error: ` onimisionipe@maps2003:/srv/log/tilerator$ tail syslog.log Apr 9 03:... [03:15:08] 10Operations, 10Maps (Tilerator): Tilerator crashed on maps200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T219849 (10Mathew.onipe) I'm going to depool maps2003 until this is fixed [03:16:20] !log depooled maps2003 - T219849 [03:16:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:16:24] T219849: Tilerator crashed on maps200[1-3].codfw.wmnet - https://phabricator.wikimedia.org/T219849 [03:18:21] * Krinkle staging on mwdebug1002 [03:22:09] PROBLEM - HHVM jobrunner on mw1336 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [03:23:07] !log krinkle@deploy1001 Synchronized php-1.33.0-wmf.24/extensions/ExternalGuidance/extension.json: Id04a3a4f40a884 / T219841 (duration: 00m 52s) [03:23:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [03:23:11] T219841: Unknown dependency: mw.externalguidance.init - https://phabricator.wikimedia.org/T219841 [03:23:27] RECOVERY - HHVM jobrunner on mw1336 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.008 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [03:38:25] RECOVERY - puppet last run on dns2002 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [04:05:10] Krinkle: back to normal! phew! I love a downward spike [04:38:47] PROBLEM - puppet last run on mw1322 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [04:51:51] PROBLEM - puppet last run on wtp1048 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:00:28] (03PS3) 10Marostegui: db-eqiad.php: Change parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499737 (https://phabricator.wikimedia.org/T210725) [05:01:27] PROBLEM - puppet last run on cloudvirt1029 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [05:05:13] RECOVERY - puppet last run on mw1322 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [05:18:13] RECOVERY - puppet last run on wtp1048 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [05:27:51] RECOVERY - puppet last run on cloudvirt1029 is OK: OK: Puppet is currently enabled, last run 3 minutes ago with 0 failures [05:34:27] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install db1139|db1140.eqiad.wmnet (2 dump slaves) - https://phabricator.wikimedia.org/T218985 (10Marostegui) 05Stalled→03Open Opening as the hosts arrived [05:34:46] 10Operations, 10ops-eqiad, 10DBA, 10Patch-For-Review: rack/setup/install db1139|db1140.eqiad.wmnet (2 dump slaves) - https://phabricator.wikimedia.org/T218985 (10Marostegui) p:05Triage→03Normal [05:57:21] (03CR) 10Marostegui: db-eqiad.php: Change parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499737 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [06:00:04] marostegui: Your horoscope predicts another unfortunate Change parsercache key deploy. May Zuul be (nice) with you. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190409T0600). [06:00:15] Thanks for being positive jouncebot! [06:00:21] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Change parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499737 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [06:01:23] (03Merged) 10jenkins-bot: db-eqiad.php: Change parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499737 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [06:01:54] !log Deploy parsercache key change on canaries only - T210725 [06:01:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:01:58] T210725: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 [06:02:21] (03CR) 10jenkins-bot: db-eqiad.php: Change parsercache key [mediawiki-config] - 10https://gerrit.wikimedia.org/r/499737 (https://phabricator.wikimedia.org/T210725) (owner: 10Marostegui) [06:10:10] <_joe_> jdlrobson, Krinkle: to my untrained eye, this https://grafana.wikimedia.org/d/000000566/overview?panelId=15&fullscreen&orgId=1 looks like something that needs an incident report [06:11:46] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 9 others: Fix inefficient CacheAwarePropertyInfoStore memcached access pattern - https://phabricator.wikimedia.org/T97368 (10elukey) The train was deployed yesterday, I can see an awesome improve... [06:17:31] addshore: thanks! ---^ \o/ [06:28:31] PROBLEM - puppet last run on es1017 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/usr/local/bin/prometheus-puppet-agent-stats] [06:30:14] !log Change parsercache keys on mw[1270-1279] - T210725 [06:30:15] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.25; 2019-04-09), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1029 (was mc1035, mc1022) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) [06:30:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:30:18] T210725: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 [06:36:27] 10Operations, 10Traffic, 10Patch-For-Review: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 (10Vgutierrez) that's actually pretty easy to test in lvs2010 (currently a spare system): `vgutierrez@lvs2010:~$ apt-cache policy systemd systemd: Installed: 232-25+deb9u9... [06:39:38] !log Change parsercache keys on mw[1260-1269] - T210725 [06:39:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:39:41] T210725: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 [06:47:48] !log installing libav security updates [06:47:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:50:24] 10Operations, 10MediaWiki-General-or-Unknown, 10serviceops, 10Core Platform Team (PHP7 (TEC4)), and 2 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Joe) So for the record, in terms of impact: ` anomie> _joe_: A... [06:51:42] !log elasticsearch search cluster: reindex all spaceless languages in eqiad and codfw (T219533) [06:51:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:51:47] T219533: Reindex space less languages wikis to use BM25 - https://phabricator.wikimedia.org/T219533 [06:52:30] 10Operations, 10Traffic, 10Patch-For-Review: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 (10Vgutierrez) so systemd 241 shows the same behaviour as 232 in lvs2010: `vgutierrez@lvs2010:~$ apt-cache policy systemd systemd: Installed: 241-1~bpo9+1 Candidate: 241-1~... [06:52:37] 10Operations, 10MediaWiki-General-or-Unknown, 10serviceops, 10Core Platform Team (PHP7 (TEC4)), and 2 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Joe) >>! In T219279#5095261, @kchapman wrote: > @Joe did you ge... [06:53:06] 10Operations, 10Core Platform Team Backlog, 10MediaWiki-General-or-Unknown, 10serviceops, and 2 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Joe) [06:54:50] (03PS7) 10Vgutierrez: lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) [07:00:11] RECOVERY - puppet last run on es1017 is OK: OK: Puppet is currently enabled, last run 4 minutes ago with 0 failures [07:03:47] !log Change parsercache keys on mw[1280-1289] - T210725 [07:03:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:03:51] T210725: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 [07:09:50] !log Change parsercache keys on mw[1221-1229] - T210725 [07:09:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:09:53] T210725: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 [07:10:31] !log Depool thumbor1004 for testing - T187765 [07:10:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:10:35] T187765: Replace the Nginx fronting Thumbor with a reverse proxy capable of queuing requests - https://phabricator.wikimedia.org/T187765 [07:11:38] 10Operations, 10Traffic, 10Patch-For-Review: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 (10Vgutierrez) so this is currently a blocker on cloudvirt1024.eqiad.wmnet for @Andrew. The suggested approach by @faidon of using systemd >= 239 doesn't seem to work. I've reb... [07:12:59] (03CR) 10Vgutierrez: [C: 03+1] "I've rebased the change and ran PCC again: https://puppet-compiler.wmflabs.org/compiler1002/15648/" [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [07:21:53] !log Change parsercache keys on mw[1230-1235,1238-1239] - T210725 [07:21:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:21:57] T210725: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 [07:24:16] (03CR) 10Vgutierrez: [C: 04-1] wikiba.se TLS: Make support for different certificate sources clearer (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/501461 (owner: 10Alex Monk) [07:37:20] !log installing samba security updates [07:37:48] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:42:56] (03CR) 10Vgutierrez: [C: 03+2] tendril: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502263 (https://phabricator.wikimedia.org/T220359) (owner: 10Vgutierrez) [07:43:24] (03CR) 10Vgutierrez: [C: 03+2] mirrors: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502260 (https://phabricator.wikimedia.org/T220359) (owner: 10Vgutierrez) [07:43:38] (03PS2) 10Vgutierrez: mirrors: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502260 (https://phabricator.wikimedia.org/T220359) [07:46:02] (03PS1) 10Muehlenhoff: Remove now obsolete Cumin alias for OSM servers [puppet] - 10https://gerrit.wikimedia.org/r/502436 [07:46:32] (03PS2) 10Muehlenhoff: Remove now obsolete Cumin alias for OSM servers [puppet] - 10https://gerrit.wikimedia.org/r/502436 [07:48:59] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Deploy parsercache key change everywhere T210725 (duration: 00m 53s) [07:49:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:49:02] T210725: Replace parsercache keys to something more meaningful on db-XXXX.php - https://phabricator.wikimedia.org/T210725 [07:49:21] (03PS2) 10Vgutierrez: tendril: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502263 (https://phabricator.wikimedia.org/T220359) [07:50:56] (03CR) 10Muehlenhoff: [C: 03+2] Remove now obsolete Cumin alias for OSM servers [puppet] - 10https://gerrit.wikimedia.org/r/502436 (owner: 10Muehlenhoff) [07:51:31] yet another rebase :) [07:51:33] (03PS3) 10Vgutierrez: tendril: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502263 (https://phabricator.wikimedia.org/T220359) [07:51:42] sorry :-) [07:51:45] np! [07:52:10] !log upgrading app servers mw1319-mw1333 to HHVM 3.18.5+dfsg-1+wmf8+deb9u2 and wikidiff 1.8.1 (T203069) [07:52:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:52:38] T203069: Deploy wikidiff2 v1.8.1 with changed signature - https://phabricator.wikimedia.org/T203069 [07:55:17] (03CR) 10Legoktm: Enable UrlShortener in mediawikiwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500910 (https://phabricator.wikimedia.org/T108557) (owner: 10Ladsgroup) [07:57:03] that HHVM version number :o [07:57:08] (03PS1) 10Marostegui: db-eqiad.php: Depool es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502438 [07:59:37] (03Abandoned) 10Marostegui: db-eqiad.php: Depool es1019 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502438 (owner: 10Marostegui) [08:06:24] (03PS1) 10Marostegui: db-codfw.php: Depool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502439 [08:07:55] (03CR) 10Marostegui: [C: 03+2] db-codfw.php: Depool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502439 (owner: 10Marostegui) [08:08:55] (03Merged) 10jenkins-bot: db-codfw.php: Depool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502439 (owner: 10Marostegui) [08:09:09] (03CR) 10jenkins-bot: db-codfw.php: Depool db2069 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502439 (owner: 10Marostegui) [08:10:11] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Depool db2069 (duration: 00m 51s) [08:10:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:15] !log Upgrade db2069 [08:10:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:10:17] (03PS3) 10Muehlenhoff: Remove now obsolete tools-checker-grid-start-trusty monitoring service [puppet] - 10https://gerrit.wikimedia.org/r/502167 [08:12:11] (03CR) 10Muehlenhoff: [C: 03+2] Remove now obsolete tools-checker-grid-start-trusty monitoring service [puppet] - 10https://gerrit.wikimedia.org/r/502167 (owner: 10Muehlenhoff) [08:13:02] (03PS1) 10Marostegui: Revert "db-codfw.php: Depool db2069" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502440 [08:16:30] (03PS8) 10Vgutierrez: lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) [08:19:08] (03PS14) 10Fsero: Enabling docker registry swift cross dc replication [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) [08:19:38] (03PS15) 10Fsero: Enabling docker registry swift cross dc replication [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) [08:20:51] (03CR) 10Vgutierrez: [C: 03+1] "pcc is still happy after getting rid of the unused $iface_length: https://puppet-compiler.wmflabs.org/compiler1002/15649/" [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [08:24:54] (03PS2) 10Gilles: Treat temp containers as private [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/502206 (https://phabricator.wikimedia.org/T220265) [08:25:02] (03CR) 10Gilles: Treat temp containers as private (031 comment) [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/502206 (https://phabricator.wikimedia.org/T220265) (owner: 10Gilles) [08:28:47] 10Operations, 10Wikimedia-Mailing-lists: Reset password for wikimedia-gh mailing list - https://phabricator.wikimedia.org/T220416 (10fgiunchedi) @Nkansahrexford you should have received the new password via email, please confirm! [08:31:14] (03CR) 10Marostegui: [C: 03+2] Revert "db-codfw.php: Depool db2069" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502440 (owner: 10Marostegui) [08:31:17] (03CR) 10Filippo Giunchedi: [C: 03+1] Treat temp containers as private [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/502206 (https://phabricator.wikimedia.org/T220265) (owner: 10Gilles) [08:32:40] (03Merged) 10jenkins-bot: Revert "db-codfw.php: Depool db2069" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502440 (owner: 10Marostegui) [08:33:55] !log marostegui@deploy1001 Synchronized wmf-config/db-codfw.php: Repool db2069 (duration: 00m 50s) [08:33:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:34:44] (03PS1) 10Marostegui: db-eqiad.php: Remove old comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502442 [08:35:02] 10Operations, 10LDAP-Access-Requests, 10Patch-For-Review: LDAP access to the wmf group for Evan Prodromou - https://phabricator.wikimedia.org/T220226 (10fgiunchedi) p:05Triage→03Normal [08:35:46] (03CR) 10Marostegui: [C: 03+2] db-eqiad.php: Remove old comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502442 (owner: 10Marostegui) [08:36:47] (03Merged) 10jenkins-bot: db-eqiad.php: Remove old comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502442 (owner: 10Marostegui) [08:37:36] 10Operations, 10Wikimedia-Mailing-lists: Reset password for wikimedia-gh mailing list - https://phabricator.wikimedia.org/T220416 (10fgiunchedi) p:05Triage→03Normal [08:37:38] 10Operations: Establish guideline documentation for Kafka cluster use cases (main, jumbo, logging, etc.) - https://phabricator.wikimedia.org/T220391 (10fgiunchedi) p:05Triage→03Normal [08:37:40] 10Operations: Review current architecture/capacity and establish plan for Kafka main cluster upgrade/refresh to cover needs for next 2-3 years - https://phabricator.wikimedia.org/T220389 (10fgiunchedi) p:05Triage→03Normal [08:37:42] 10Operations: Audit existing Kafka main producers/consumers and document their configuration and use cases - https://phabricator.wikimedia.org/T220390 (10fgiunchedi) p:05Triage→03Normal [08:37:44] 10Operations, 10Continuous-Integration-Infrastructure: Upload Zuul 2.5.1-wmf6 package to apt.wikimedia.org - https://phabricator.wikimedia.org/T220380 (10fgiunchedi) p:05Triage→03Normal [08:37:47] 10Operations: Evaluate SSO solutions - https://phabricator.wikimedia.org/T220362 (10fgiunchedi) p:05Triage→03Normal [08:37:49] 10Operations: Audit our infrastructure for authenticated services - https://phabricator.wikimedia.org/T220361 (10fgiunchedi) p:05Triage→03Normal [08:37:53] 10Operations, 10Patch-For-Review: decom netmon1003 - https://phabricator.wikimedia.org/T220355 (10fgiunchedi) p:05Triage→03Normal [08:37:55] 10Operations, 10ops-esams: Degraded RAID on cp3034 - https://phabricator.wikimedia.org/T220194 (10fgiunchedi) p:05Triage→03Normal [08:37:57] 10Operations, 10ops-esams: Degraded RAID on cp3041 - https://phabricator.wikimedia.org/T220193 (10fgiunchedi) p:05Triage→03Normal [08:37:59] 10Operations, 10monitoring, 10Goal, 10User-fgiunchedi: Migrate all metrics originated by PoPs from statsd to Prometheus - https://phabricator.wikimedia.org/T220116 (10fgiunchedi) p:05Triage→03Normal [08:38:01] 10Operations: netbox: User's groups not updated - https://phabricator.wikimedia.org/T220004 (10fgiunchedi) p:05Triage→03Normal [08:38:01] !log marostegui@deploy1001 Synchronized wmf-config/db-eqiad.php: Remove old comment from db1089 (duration: 00m 51s) [08:38:03] 10Operations, 10monitoring, 10Goal, 10User-fgiunchedi: TEC6: Logging infrastructure (Q4 2018/19 goal) - https://phabricator.wikimedia.org/T220103 (10fgiunchedi) p:05Triage→03Normal [08:38:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:38:05] 10Operations, 10Release-Engineering-Team (Backlog): mwdebug2001 and mwdebug2002 "/" almost full - https://phabricator.wikimedia.org/T219989 (10fgiunchedi) p:05Triage→03Normal [08:38:09] 10Operations, 10Parsoid, 10Wikimedia-Logstash, 10service-runner, and 2 others: Move parsoid logging to new logging pipeline - https://phabricator.wikimedia.org/T219927 (10fgiunchedi) p:05Triage→03Normal [08:38:12] 10Operations, 10ops-codfw, 10Patch-For-Review: Broken disk on ms-be2026 - https://phabricator.wikimedia.org/T219854 (10fgiunchedi) p:05Triage→03Normal [08:38:14] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review, 10Wikimedia-Incident: Create cookbook to reset readonly indices on elasticsearch clusters - https://phabricator.wikimedia.org/T219799 (10fgiunchedi) p:05Triage→03Normal [08:38:16] 10Operations, 10CX-cxserver, 10Wikimedia-Logstash, 10service-runner, and 3 others: Move cxserver logging to new logging pipeline - https://phabricator.wikimedia.org/T219921 (10fgiunchedi) p:05Triage→03Normal [08:38:18] 10Operations, 10Puppet: Some jessie instances upset about rsyslog package - https://phabricator.wikimedia.org/T219764 (10fgiunchedi) p:05Triage→03Normal [08:38:21] 10Operations, 10ops-ulsfo: Degraded RAID on cp4032 - https://phabricator.wikimedia.org/T219586 (10fgiunchedi) p:05Triage→03Normal [08:38:24] 10Operations, 10Phabricator, 10Release-Engineering-Team, 10Wikimedia-Incident: Analyze and amend (if necessary) workflow of user reporting and detecting large regressions/outages - https://phabricator.wikimedia.org/T219589 (10fgiunchedi) p:05Triage→03Normal [08:38:27] 10Operations: cronspam: cross-validate-accounts - https://phabricator.wikimedia.org/T219274 (10fgiunchedi) p:05Triage→03Normal [08:38:30] 10Operations, 10Traffic: Make authdns-update compatible with local emergency changes - https://phabricator.wikimedia.org/T219400 (10fgiunchedi) p:05Triage→03Normal [08:40:40] gilles: o/ - https://grafana.wikimedia.org/d/000000230/navigation-timing-by-continent seems broken after .wmf24 [08:40:46] Cc: Krinkle [08:41:11] the rest of the dashboards are ok, and I don't see anything weird from the kafka side, so I guess some metric name changed? [08:41:37] (03CR) 10jenkins-bot: Revert "db-codfw.php: Depool db2069" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502440 (owner: 10Marostegui) [08:41:39] (03CR) 10jenkins-bot: db-eqiad.php: Remove old comment [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502442 (owner: 10Marostegui) [08:42:19] (03CR) 10Alexandros Kosiaris: [C: 04-1] "Minor nitpick comments, rest LGTM" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) (owner: 10Fsero) [08:43:13] (03CR) 10Alexandros Kosiaris: [C: 03+1] Updating envoy to 1.9.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/502217 (https://phabricator.wikimedia.org/T220382) (owner: 10Fsero) [08:44:19] RECOVERY - Memory correctable errors -EDAC- on thumbor1004 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [08:44:38] (03CR) 10Giuseppe Lavagetto: [C: 04-1] "minor nit, lgtm otherwise." (031 comment) [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/502217 (https://phabricator.wikimedia.org/T220382) (owner: 10Fsero) [08:45:44] (03PS1) 10Muehlenhoff: toollabs: Remove support for trusty/Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/502444 [08:45:55] elukey: only the stacked view? [08:46:23] (03CR) 10Alexandros Kosiaris: [C: 03+2] Varnish: serve Swift traffic in active/active mode [puppet] - 10https://gerrit.wikimedia.org/r/496872 (https://phabricator.wikimedia.org/T204245) (owner: 10Mobrovac) [08:46:30] (03PS4) 10Alexandros Kosiaris: Varnish: serve Swift traffic in active/active mode [puppet] - 10https://gerrit.wikimedia.org/r/496872 (https://phabricator.wikimedia.org/T204245) (owner: 10Mobrovac) [08:47:05] oh, no, firstPaint [08:47:13] !log switch swift to be accessed from varnish+ats active/active rw [08:47:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:48:01] sigh, yes, thanks [08:48:03] RECOVERY - EDAC syslog messages on thumbor1004 is OK: (C)4 ge (W)2 ge 1 https://grafana.wikimedia.org/dashboard/db/host-overview?orgId=1&var-server=thumbor1004&var-datasource=eqiad+prometheus/ops [08:50:08] 10Operations, 10MediaWiki-Cache, 10MW-1.33-notes (1.33.0-wmf.25; 2019-04-09), 10Patch-For-Review, and 3 others: Mcrouter periodically reports soft TKOs for mc1029 (was mc1035, mc1022) leading to MW Memcached exceptions - https://phabricator.wikimedia.org/T203786 (10elukey) A lot of changes went out with .w... [08:52:16] (03CR) 10Alexandros Kosiaris: [C: 03+1] postgresql: set max_wal_senders on slave conf [puppet] - 10https://gerrit.wikimedia.org/r/501384 (https://phabricator.wikimedia.org/T219652) (owner: 10Bstorm) [08:57:44] 10Operations, 10Analytics, 10EventBus: Eventbus errors: Failed processing event: Failed validating at path rev_id - https://phabricator.wikimedia.org/T220477 (10elukey) p:05Triage→03High [08:58:32] 10Operations, 10Analytics, 10EventBus, 10Services: Eventbus errors: Failed processing event: Failed validating at path rev_id - https://phabricator.wikimedia.org/T220477 (10elukey) [08:58:47] PROBLEM - gdnsd checkconf on authdns2001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure [09:00:29] (03CR) 10Arturo Borrero Gonzalez: "Please reference this phab task T219362" [puppet] - 10https://gerrit.wikimedia.org/r/502444 (owner: 10Muehlenhoff) [09:01:28] (03PS2) 10Muehlenhoff: toollabs: Remove support for trusty/Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/502444 (https://phabricator.wikimedia.org/T219362) [09:02:50] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] toollabs: Remove support for trusty/Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/502444 (https://phabricator.wikimedia.org/T219362) (owner: 10Muehlenhoff) [09:03:23] PROBLEM - Device not healthy -SMART- on db2037 is CRITICAL: cluster=mysql device=cciss,10 instance=db2037:9100 job=node site=codfw https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2037&var-datasource=codfw+prometheus/ops [09:04:30] nice [09:05:56] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) [09:06:03] 10Operations, 10DBA: Predictive failures on disk S.M.A.R.T. status - https://phabricator.wikimedia.org/T208323 (10Marostegui) db2037, m5 codfw master: ` root@db2037:~# hpssacli controller all show config Smart Array P420i in Slot 0 (Embedded) (sn: 0014380312088E0) Port Name: 1I Port Name: 2I G... [09:06:45] ACKNOWLEDGEMENT - Device not healthy -SMART- on db2037 is CRITICAL: cluster=mysql device=cciss,10 instance=db2037:9100 job=node site=codfw Marostegui T208323 https://grafana.wikimedia.org/dashboard/db/host-overview?var-server=db2037&var-datasource=codfw+prometheus/ops [09:07:50] (03CR) 10Giuseppe Lavagetto: "Very nice work, and overall LGTM. A few minor comments inline." (036 comments) [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [09:08:11] (03CR) 10Hashar: [C: 03+2] Train: scap clean, feature flag prune branches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502316 (https://phabricator.wikimedia.org/T218783) (owner: 10Thcipriani) [09:08:20] (03Abandoned) 10Hashar: scap: add logging to clean > prune-git-branches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/497781 (https://phabricator.wikimedia.org/T218783) (owner: 10Hashar) [09:09:13] (03Merged) 10jenkins-bot: Train: scap clean, feature flag prune branches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502316 (https://phabricator.wikimedia.org/T218783) (owner: 10Thcipriani) [09:11:49] PROBLEM - gdnsd checkconf on authdns1001 is CRITICAL: CRITICAL: gdnsd -S checkconf failure [09:14:03] (03CR) 10Jcrespo: "Also I would wait a few hours for pc1* hosts to stabilize." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502438 (owner: 10Marostegui) [09:14:07] error: plugin_metafo: Invalid resource name 'disc-swift-rw' detected from zonefile lookup [09:14:18] lemme figure out how to fix that, I think I introduced it [09:14:22] error: Name 'swift-rw.discovery.wmnet.': resolver plugin 'metafo' rejected resource name 'disc-swift-rw' [09:14:29] lol, same time [09:14:30] (03CR) 10Marostegui: "this is abandoned :-)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502438 (owner: 10Marostegui) [09:14:36] need any help? [09:15:00] (03CR) 10Jcrespo: [C: 03+1] mariadb: Promote db1075 to master [puppet] - 10https://gerrit.wikimedia.org/r/501479 (https://phabricator.wikimedia.org/T219115) (owner: 10Marostegui) [09:15:03] I have no idea what I am doing, so any pointers would be nice [09:15:07] (03CR) 10jenkins-bot: Train: scap clean, feature flag prune branches [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502316 (https://phabricator.wikimedia.org/T218783) (owner: 10Thcipriani) [09:15:20] ah scratch that [09:15:21] akosiaris: https://wikitech.wikimedia.org/wiki/DNS/Discovery is the general thing [09:15:23] found it I think [09:15:25] I'm having a look [09:16:01] akosiaris: s/datacenters/map/ ? [09:16:05] it's a metafo and not a geo resource? [09:16:10] yeah looks like it [09:16:16] others are like [09:16:16] config-geo-test:20: disc-eventbus => { map => mock, dcmap => { mock => 192.0.2.1 } } [09:16:23] compared to config-geo-test:42: disc-swift-rw => { datacenters => mock, dcmap => { mock => 192.0.2.1 } } [09:16:34] CI didn't catch it? [09:16:48] it wasn't a change in the dns repo [09:16:57] we 've changed active_active: true in the puppet repo [09:17:08] right [09:17:11] so usage was changed before the definition I guess? [09:17:19] dunno [09:18:20] akosiaris: but wait a sec [09:18:27] git grep 'datacenters => mock' on the dns repo returns 5 of them [09:18:40] compared to 25 for 'map' [09:18:43] * volans puzzled [09:19:16] (03PS2) 10Arturo Borrero Gonzalez: clouddb2001-dev: cleanup FQDNs [dns] - 10https://gerrit.wikimedia.org/r/502295 (https://phabricator.wikimedia.org/T220129) [09:19:18] (03PS1) 10Arturo Borrero Gonzalez: clouddb2001-dev: fix typo in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/502452 (https://phabricator.wikimedia.org/T220129) [09:20:05] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] clouddb2001-dev: cleanup FQDNs [dns] - 10https://gerrit.wikimedia.org/r/502295 (https://phabricator.wikimedia.org/T220129) (owner: 10Arturo Borrero Gonzalez) [09:20:13] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] clouddb2001-dev: fix typo in site.pp [puppet] - 10https://gerrit.wikimedia.org/r/502452 (https://phabricator.wikimedia.org/T220129) (owner: 10Arturo Borrero Gonzalez) [09:20:54] akosiaris: ok if it's a geo resources needs map, if it's a metafo resource needs datacenter [09:20:59] (03PS3) 10Gehel: maps: migrate maps2003 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/491191 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [09:21:15] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): labtestmetal2001.codfw.wmnet: rename to clouddb2001-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220129 (10aborrero) [09:21:18] (03PS1) 10Alexandros Kosiaris: swift-rw: Mock it as a geo-resource [dns] - 10https://gerrit.wikimedia.org/r/502453 (https://phabricator.wikimedia.org/T204245) [09:21:23] volans: ^ [09:21:33] I am not even sure what metafo stands for right now [09:21:44] (03CR) 10jerkins-bot: [V: 04-1] swift-rw: Mock it as a geo-resource [dns] - 10https://gerrit.wikimedia.org/r/502453 (https://phabricator.wikimedia.org/T204245) (owner: 10Alexandros Kosiaris) [09:21:45] PROBLEM - gdnsd checkconf on multatuli is CRITICAL: CRITICAL: gdnsd -S checkconf failure [09:21:48] gdnsd-plugin-metafo - gdnsd plugin for address meta-failover [09:21:49] lol [09:22:48] mmmh I'm not sure tbh, I don't recall why discovery records can be in both, let me try to resume from long-term memory [09:23:01] (03CR) 10Gehel: [C: 03+2] maps: migrate maps2003 to stretch [puppet] - 10https://gerrit.wikimedia.org/r/491191 (https://phabricator.wikimedia.org/T198622) (owner: 10Mathew.onipe) [09:23:52] akosiaris: templates/wmnet:swift-rw 300/10 IN DYNA metafo!disc-swift-rw [09:23:55] it's defined as metafo [09:24:18] instead of geoip, but IIRC all discovery should be metafo... that's why I'm puzzled [09:24:28] vgutierrez, ema: maybe you have more recent insight? [09:24:29] ah, I should fix that too I guess [09:24:54] volans: I think only active/active go under geoip [09:24:59] but don't quote me on that [09:25:11] RECOVERY - tilerator on maps2003 is OK: HTTP OK: HTTP/1.1 200 OK - 305 bytes in 0.101 second response time https://wikitech.wikimedia.org/wiki/Services/Monitoring/tilerator [09:25:15] (03PS2) 10Alexandros Kosiaris: swift-rw: Mock it as a geo-resource [dns] - 10https://gerrit.wikimedia.org/r/502453 (https://phabricator.wikimedia.org/T204245) [09:25:27] akosiaris: yeah that's my theory too,but not sure enough to suggest it ;) [09:25:45] (03CR) 10Ladsgroup: Enable UrlShortener in mediawikiwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500910 (https://phabricator.wikimedia.org/T108557) (owner: 10Ladsgroup) [09:26:00] so you want swift-rw to be a/a right? [09:26:04] 10Operations, 10Maps, 10Reading-Infrastructure-Team-Backlog, 10Patch-For-Review: migrate maps servers to stretch with the current style - https://phabricator.wikimedia.org/T198622 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts: ` ['maps2003.codfw.wmn... [09:26:42] volans: yeah we are switching to that [09:26:51] overall I should be adding a swift.discovery.wmnet [09:26:57] and not mess with swift-rw [09:26:59] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): labtestmetal2001.codfw.wmnet: rename to clouddb2001-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220129 (10aborrero) [09:27:06] but that was meant to be the cleanup step after [09:27:25] actually, now that I think about it better... maybe I should have gone the other way around [09:27:29] my real doubt right now is what's the impact of moving from metafo to geoip [09:27:32] if it breaks stuff [09:27:50] if done in one step, that is [09:27:53] yeah, maybe it's better to introduce the new record and just adjust ats to use that instead [09:28:11] seems safer to me [09:28:14] and leave swift-rw/swift-ro to just be until we delete them [09:28:18] yeah moving on with that now [09:28:21] +1 [09:28:29] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): labtestmetal2001.codfw.wmnet: rename to clouddb2001-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220129 (10aborrero) p:05High→03Normal a:05aborrero→03Papaul Please @Papaul : *... [09:31:55] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 9 others: Fix inefficient CacheAwarePropertyInfoStore memcached access pattern - https://phabricator.wikimedia.org/T97368 (10Addshore) 05Open→03Resolved a:03Addshore Amazing, I think we can... [09:32:31] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 9 others: Fix inefficient CacheAwarePropertyInfoStore memcached access pattern - https://phabricator.wikimedia.org/T97368 (10Addshore) [09:33:51] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 5 others: [Story] Implement per property caching for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T218115 (10Addshore) [09:34:04] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 7 others: Read from key-per-property cache - https://phabricator.wikimedia.org/T218124 (10Addshore) 05Open→03Resolved [09:34:24] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 5 others: [Story] Implement per property caching for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T218115 (10Addshore) 05Open→03Resolved a:03Addshore [09:34:29] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 9 others: Fix inefficient CacheAwarePropertyInfoStore memcached access pattern - https://phabricator.wikimedia.org/T97368 (10Addshore) [09:35:09] 10Operations, 10MediaWiki-extensions-WikibaseClient, 10MediaWiki-extensions-WikibaseRepository, 10Wikidata, and 5 others: [Story] Implement per property caching for CacheAwarePropertyInfoStore - https://phabricator.wikimedia.org/T218115 (10Addshore) [09:39:46] (03PS1) 10Alexandros Kosiaris: swift: Revert swift-rw to active/passive [puppet] - 10https://gerrit.wikimedia.org/r/502456 (https://phabricator.wikimedia.org/T204245) [09:39:48] (03PS1) 10Alexandros Kosiaris: swift: Introduce new swift.discovery.wmnet stanza [puppet] - 10https://gerrit.wikimedia.org/r/502457 (https://phabricator.wikimedia.org/T204245) [09:39:50] (03PS1) 10Alexandros Kosiaris: trafficserver: Switch to using swift.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/502458 (https://phabricator.wikimedia.org/T204245) [09:40:17] (03CR) 10Lucas Werkmeister (WMDE): Enable UrlShortener in mediawikiwiki (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500910 (https://phabricator.wikimedia.org/T108557) (owner: 10Ladsgroup) [09:43:34] jenkins suferring? [09:43:54] https://grafana.wikimedia.org/d/000000322/zuul-gearman?from=1554799805639&to=1554803017244&panelId=21&fullscreen&orgId=1 [09:44:02] hashar: ^ [09:44:17] I am guessing that spike over there is not good [09:45:49] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 64.29% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [09:46:04] (03CR) 10Alexandros Kosiaris: [V: 03+2 C: 03+2] swift: Revert swift-rw to active/passive [puppet] - 10https://gerrit.wikimedia.org/r/502456 (https://phabricator.wikimedia.org/T204245) (owner: 10Alexandros Kosiaris) [09:46:15] !log rebooting stat1005 for some tests [09:46:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:48:38] RECOVERY - gdnsd checkconf on multatuli is OK: OK: gdnsd -S checkconf success [09:49:58] !log gilles@deploy1001 Synchronized php-1.33.0-wmf.24/extensions/NavigationTiming: T220476 Add originCountry to paintTiming context (duration: 00m 54s) [09:50:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:50:01] T220476: firstPaint geo breakdown broken - https://phabricator.wikimedia.org/T220476 [09:52:38] akosiaris: https://wikitech.wikimedia.org/w/index.php?title=DNS%2FDiscovery&type=revision&diff=1822847&oldid=1799771 [09:54:46] volans: thanks! [09:56:23] (03CR) 10Volans: "I've little context, left some questions" (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/502458 (https://phabricator.wikimedia.org/T204245) (owner: 10Alexandros Kosiaris) [09:57:06] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10MoritzMuehlenhoff) >>! In T148843#5090494, @elukey wrote: > The https:... [09:59:14] RECOVERY - gdnsd checkconf on authdns2001 is OK: OK: gdnsd -S checkconf success [09:59:43] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/502457 (https://phabricator.wikimedia.org/T204245) (owner: 10Alexandros Kosiaris) [10:00:43] (03CR) 10Alexandros Kosiaris: trafficserver: Switch to using swift.discovery.wmnet (033 comments) [puppet] - 10https://gerrit.wikimedia.org/r/502458 (https://phabricator.wikimedia.org/T204245) (owner: 10Alexandros Kosiaris) [10:01:27] (03PS2) 10Alexandros Kosiaris: trafficserver: Switch to using swift.discovery.wmnet [puppet] - 10https://gerrit.wikimedia.org/r/502458 (https://phabricator.wikimedia.org/T204245) [10:02:54] RECOVERY - gdnsd checkconf on authdns1001 is OK: OK: gdnsd -S checkconf success [10:02:55] elukey: should be fixed now, thanks for letting me know [10:03:57] thanks! [10:04:11] akosiaris: not sure if only that stanza or the whole file should be renamed to remove thumbs..., I'm not familiar with ATS puppetization [10:04:41] look at hieradata/role/common/trafficserver/backend.yaml [10:04:52] I guess ema will know better :) [10:05:34] PROBLEM - puppet last run on restbase2020 is CRITICAL: CRITICAL: Puppet has 3 failures. Last run 55 seconds ago with 3 failures. Failed resources (up to 3 shown): Package[cassandra],Package[cassandra/metrics-collector],Package[restbase/deploy] [10:05:58] that's me ^ [10:07:11] volans: unsure as well [10:07:23] akosiaris: hey [10:07:29] !log rebooting stat1005 for some tests again [10:07:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:07:40] ema: https://gerrit.wikimedia.org/r/#/c/operations/puppet/+/502458/ is the context [10:08:35] the TL;DR is that we met some issues with making swift-rw active_active: true in puppet and decided the more sensible approach was to create a new swift.discovery.wmnet RR and use that instead [10:08:52] it is also more homogeneous as well [10:09:00] akosiaris: damn, I haven't been smart enough to think that we would have made swift.discovery.wmnet one day, so that's not in the certificate alternate names field [10:09:17] lol [10:09:20] PROBLEM - Check systemd state on restbase2020 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [10:09:21] !log bootstrapping cassandra-a, restbase2020 -- T208087 [10:09:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:09:25] T208087: Replace remaining Samsung SSDs - https://phabricator.wikimedia.org/T208087 [10:10:05] should we discuss those discovery record changes in the -discovery channel? :-P [10:10:08] * volans hides [10:10:18] volans: go away [10:10:23] hiding is not enough :P [10:10:26] lol [10:10:28] volans: you know you're always welcomed to join our party! [10:10:52] I'm always there, hiding in disguise [10:11:27] 10Operations, 10ops-codfw: Degraded RAID on elastic2048 - https://phabricator.wikimedia.org/T220038 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts: ` ['elastic2048.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201904091011_gehel_8457... [10:11:28] akosiaris: so, the patch as is would break ATS<->swift unfortunately [10:11:38] PROBLEM - HHVM jobrunner on mw1338 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [10:11:44] akosiaris: and generating the new certs makes me want to cry [10:11:53] ema: why is that? [10:12:03] the makes want to cry part [10:12:08] not the breakeage, that one is clear [10:12:49] akosiaris: mostly because it's a manual and error prone process [10:12:52] RECOVERY - HHVM jobrunner on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [10:13:21] (03PS1) 10Fdans: Add PrefUpdate as event schema to ingest to druid [puppet] - 10https://gerrit.wikimedia.org/r/502462 (https://phabricator.wikimedia.org/T218964) [10:14:19] (03CR) 10Alexandros Kosiaris: "Adding @godog just to make sure we have the same understanding of the new RR introduced. Docs about swift vs swift-rw/swift-ro in https://" [puppet] - 10https://gerrit.wikimedia.org/r/502457 (https://phabricator.wikimedia.org/T204245) (owner: 10Alexandros Kosiaris) [10:15:14] ema: looks like for swift we're using cergen tho, should be simple [10:16:06] akosiaris: I'll take a look [10:16:14] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): labtestmetal2001.codfw.wmnet: rename to clouddb2001-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220129 (10aborrero) [10:16:32] ema: yeah I know what you mean. I have the same issue with the kubernetes certs [10:17:19] akosiaris: I'll give -1 meanwhile to make sure we don't accidentally merge. How urgent is this? [10:17:46] ema: not much. We 've just switched over varnish to talk to both DCs active/active [10:17:59] but I don't see a path in which ats talking to just eqiad would breaking things [10:18:14] hopefully I am not missing something [10:21:39] (03CR) 10Ema: [C: 04-1] "Please don't merge this patch yet. The general approach is perfectly fine, but we don't have swift.discovery.wmnet among the subjectAltNam" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502458 (https://phabricator.wikimedia.org/T204245) (owner: 10Alexandros Kosiaris) [10:22:48] (03PS3) 10Greta WMDE: Increase musical notation datatype string length limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500692 (https://phabricator.wikimedia.org/T218767) [10:23:41] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] Increase musical notation datatype string length limit [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500692 (https://phabricator.wikimedia.org/T218767) (owner: 10Greta WMDE) [10:23:47] 10Operations, 10Traffic, 10Patch-For-Review: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 (10aborrero) the shortened interface name `Exec[/sbin/ifup p175s0f1d1.1105]` looks even more confusing, seems to be some random string. Perhaps it makes sense to selectively di... [10:26:09] PROBLEM - Restbase root url on restbase2020 is CRITICAL: connect to address 10.192.32.118 and port 7231: Connection refused https://wikitech.wikimedia.org/wiki/RESTBase [10:26:27] (03CR) 10Alexandros Kosiaris: "Full PCC run at https://puppet-compiler.wmflabs.org/compiler1002/15650/" [puppet] - 10https://gerrit.wikimedia.org/r/499355 (owner: 10Alex Monk) [10:28:01] PROBLEM - cassandra-a CQL 10.192.32.119:9042 on restbase2020 is CRITICAL: connect to address 10.192.32.119 and port 9042: Connection refused https://phabricator.wikimedia.org/T93886 [10:29:02] (03PS7) 10Alexandros Kosiaris: Move cumin_masters out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/499355 (owner: 10Alex Monk) [10:29:57] PROBLEM - cassandra-a SSL 10.192.32.119:7001 on restbase2020 is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection refused https://phabricator.wikimedia.org/T120662 [10:30:07] 10Operations, 10Traffic, 10Patch-For-Review: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 (10Vgutierrez) So we are effectively stripping the common part of every ethernet interface name: en. We don't lose a bit of information. I don't see the problem to be honest. [10:30:43] (03CR) 10Alexandros Kosiaris: [C: 03+2] "This LGTM. I would have loved it if it also killed the $CUMIN_MASTERS ferm macro, but that part is way more involved." [puppet] - 10https://gerrit.wikimedia.org/r/499355 (owner: 10Alex Monk) [10:30:52] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/502457 (https://phabricator.wikimedia.org/T204245) (owner: 10Alexandros Kosiaris) [10:31:05] RECOVERY - cassandra-a SSL 10.192.32.119:7001 on restbase2020 is OK: SSL OK - Certificate restbase2020-a valid until 2021-04-07 15:35:52 +0000 (expires in 729 days) https://phabricator.wikimedia.org/T120662 [10:33:01] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:33:09] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [10:35:49] 10Operations, 10Traffic, 10Patch-For-Review: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 (10BBlack) It's not ideal, but the part that was stripped was the most-predictable part of the name (the `en` prefix), so it's not all that confusing. I think the best way out... [10:37:23] 10Operations, 10ops-codfw: Degraded RAID on elastic2048 - https://phabricator.wikimedia.org/T220038 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic2048.codfw.wmnet'] ` Of which those **FAILED**: ` ['elastic2048.codfw.wmnet'] ` [10:38:38] akosiaris: actually, I was mistaken. It seems that origin server certificate validation is done by looking at the Host request header, not the hostname mentioned in the remap rule [10:39:20] PROBLEM - kubelet operational latencies on kubernetes2001 is CRITICAL: instance=kubernetes2001.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [10:39:38] akosiaris: which means that things would probably keep on working. Let's do the right thing though and update the certs first, I'd say? [10:41:55] ema sure [10:43:14] PROBLEM - puppet last run on mw1224 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [10:48:17] 10Operations: cronspam: cross-validate-accounts - https://phabricator.wikimedia.org/T219274 (10MoritzMuehlenhoff) 05Open→03Resolved a:03MoritzMuehlenhoff >>! In T219274#5061850, @Dzahn wrote: > I would say the team responsible is the entire SRE which historically was the same as people receiving root mail.... [10:48:19] 10Operations, 10Patch-For-Review: Tracking and Reducing cron-spam to root@ - https://phabricator.wikimedia.org/T132324 (10MoritzMuehlenhoff) [10:53:47] (03PS1) 10Arturo Borrero Gonzalez: clouddb2001-dev: add mapped IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/502464 (https://phabricator.wikimedia.org/T220129) [10:54:31] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] clouddb2001-dev: add mapped IPv6 [puppet] - 10https://gerrit.wikimedia.org/r/502464 (https://phabricator.wikimedia.org/T220129) (owner: 10Arturo Borrero Gonzalez) [10:57:28] 10Operations, 10Wikidata, 10Wikidata-Termbox-Hike, 10serviceops, and 4 others: New Service Request: Wikidata Termbox SSR - https://phabricator.wikimedia.org/T212189 (10WMDE-leszek) Sounds good, thanks @akosiaris ! We monitor T220402 then. [10:57:30] !log updated buster installer to daily build from 9th of April [10:57:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:37] (03PS2) 10Alexandros Kosiaris: swift: Introduce new swift.discovery.wmnet stanza [puppet] - 10https://gerrit.wikimedia.org/r/502457 (https://phabricator.wikimedia.org/T204245) [10:59:42] (03CR) 10Alexandros Kosiaris: [C: 03+2] swift: Introduce new swift.discovery.wmnet stanza [puppet] - 10https://gerrit.wikimedia.org/r/502457 (https://phabricator.wikimedia.org/T204245) (owner: 10Alexandros Kosiaris) [11:00:01] !log rebooting lvs2010 with systemd 241-1~bpo9+1 T209707 [11:00:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: My dear minions, it's time we take the moon! Just kidding. Time for European Mid-day SWAT(Max 6 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190409T1100). [11:00:04] No GERRIT patches in the queue for this window AFAICS. [11:00:04] T209707: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 [11:00:29] no patches, no swat... [11:00:40] * zeljkof sings it to the tune of no woman no cry [11:01:24] (03PS1) 10Arturo Borrero Gonzalez: clouddb2001-dev: add IPv6 [dns] - 10https://gerrit.wikimedia.org/r/502466 (https://phabricator.wikimedia.org/T220129) [11:01:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] clouddb2001-dev: add IPv6 [dns] - 10https://gerrit.wikimedia.org/r/502466 (https://phabricator.wikimedia.org/T220129) (owner: 10Arturo Borrero Gonzalez) [11:05:26] (03CR) 10Alex Monk: "deployment-prep's own settings would override this as necessary so it should be fine" [puppet] - 10https://gerrit.wikimedia.org/r/502208 (https://phabricator.wikimedia.org/T213705) (owner: 10Ema) [11:06:42] 10Operations, 10Traffic, 10Patch-For-Review: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 (10ema) >>! In T209707#5095846, @faidon wrote: > So with newer systemd I think there are good chances enp59s0f0 will be named ens2f0 and enp175s0f0 ens3f1 You're right. All sy... [11:07:15] !log akosiaris@puppetmaster1001 conftool action : set/pooled=true; selector: name=.*,dnsdisc=swift [11:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:07:38] !log pool both DCs for newly created swift.recovery.wmnet RR [11:07:40] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:09:26] RECOVERY - puppet last run on mw1224 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [11:16:37] 10Operations, 10Wikimedia-Mailing-lists: Reset password for wikimedia-gh mailing list - https://phabricator.wikimedia.org/T220416 (10Nkansahrexford) 05Open→03Resolved a:03Nkansahrexford Confirming receipt of new password via email. Thanks. [11:16:46] (03PS7) 10Jbond: puppet: Refactor of the base::puppet class [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) [11:22:13] (03CR) 10Jbond: "Thanks for the review" (035 comments) [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [11:26:59] !log upgrading API servers mw1276-mw1290 to HHVM 3.18.5+dfsg-1+wmf8+deb9u2 and wikidiff 1.8.1 (T203069) [11:27:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:27:03] T203069: Deploy wikidiff2 v1.8.1 with changed signature - https://phabricator.wikimedia.org/T203069 [11:28:22] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10aborrero) [11:28:36] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestservices2002: rename to cloudservices2002-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220101 (10aborrero) 05Open→03Resolved [11:28:37] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:30:01] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:30:11] (03CR) 10Gilles: [V: 03+2 C: 03+2] Treat temp containers as private [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/502206 (https://phabricator.wikimedia.org/T220265) (owner: 10Gilles) [11:34:01] (03PS6) 10Gilles: Upgrade to 2.3 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/488060 (https://phabricator.wikimedia.org/T198370) [11:34:43] (03CR) 10Gilles: [C: 03+2] Upgrade to 2.3 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/488060 (https://phabricator.wikimedia.org/T198370) (owner: 10Gilles) [11:38:21] PROBLEM - kubelet operational latencies on kubernetes2003 is CRITICAL: instance=kubernetes2003.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:39:23] PROBLEM - kubelet operational latencies on kubernetes2004 is CRITICAL: instance=kubernetes2004.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:48:39] PROBLEM - Host logstash1012 is DOWN: PING CRITICAL - Packet loss = 100% [11:51:37] RECOVERY - kubelet operational latencies on kubernetes2001 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:56:23] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [11:57:08] (03PS1) 10Arturo Borrero Gonzalez: striker: factor out common code to a shared profile [puppet] - 10https://gerrit.wikimedia.org/r/502472 [11:57:40] (03PS1) 10Vgutierrez: get rid of certcentral[12]001 IPs [dns] - 10https://gerrit.wikimedia.org/r/502473 (https://phabricator.wikimedia.org/T207389) [12:00:05] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190409T1200) [12:01:59] (03PS3) 10Muehlenhoff: toollabs: Remove support for trusty/Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/502444 (https://phabricator.wikimedia.org/T219362) [12:02:25] PROBLEM - Kafka Broker Under Replicated Partitions on logstash1010 is CRITICAL: 58 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1010 [12:02:29] PROBLEM - Kafka Broker Under Replicated Partitions on logstash1011 is CRITICAL: 43 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1011 [12:02:40] (03CR) 10Arturo Borrero Gonzalez: [C: 04-1] "PCC not noop: https://puppet-compiler.wmflabs.org/compiler1002/15651/" [puppet] - 10https://gerrit.wikimedia.org/r/502472 (owner: 10Arturo Borrero Gonzalez) [12:06:18] (03CR) 10Vgutierrez: [C: 03+2] get rid of certcentral[12]001 IPs [dns] - 10https://gerrit.wikimedia.org/r/502473 (https://phabricator.wikimedia.org/T207389) (owner: 10Vgutierrez) [12:06:30] (03CR) 10Jbond: [C: 03+1] "lgtm" [puppet] - 10https://gerrit.wikimedia.org/r/502234 (https://phabricator.wikimedia.org/T219803) (owner: 10Muehlenhoff) [12:07:23] so there seems to be an increase in traffic for the kafka logging cluster [12:07:26] godog, herron [12:08:01] those network bytes in/out is really strange [12:09:26] indeed, checking [12:09:39] PROBLEM - Kafka Broker Under Replicated Partitions on logstash1010 is CRITICAL: 58 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1010 [12:09:41] PROBLEM - Kafka Broker Under Replicated Partitions on logstash1011 is CRITICAL: 43 ge 10 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1011 [12:09:52] godog: mmm might be 1012? [12:10:07] the spikes are about the other two [12:10:14] and I can't ssh into 1012 [12:10:45] see alert from 22 minutes ago [12:10:48] !log remove facter2.4 from wikimedia-buster [12:11:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:11:36] seems frozen from the mgmt console [12:11:52] paravoid: ah snap you are right sigh [12:12:00] godog: ok if I force a powercycle ? [12:12:14] elukey: sure go for it [12:12:23] RECOVERY - kubelet operational latencies on kubernetes2004 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:13:12] !log powercycle logstash1012 - no ssh, no mgmt console available, seems completely stuck [12:13:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:13:33] (03CR) 10Muehlenhoff: [C: 03+2] toollabs: Remove support for trusty/Ubuntu [puppet] - 10https://gerrit.wikimedia.org/r/502444 (https://phabricator.wikimedia.org/T219362) (owner: 10Muehlenhoff) [12:13:50] still getting nothing from the console [12:14:01] lovely [12:14:40] ah no there you go [12:16:14] (03PS2) 10Elukey: Add PrefUpdate as event schema to ingest to druid [puppet] - 10https://gerrit.wikimedia.org/r/502462 (https://phabricator.wikimedia.org/T218964) (owner: 10Fdans) [12:16:53] RECOVERY - Host logstash1012 is UP: PING OK - Packet loss = 0%, RTA = 0.29 ms [12:17:28] sweet, I'll check logstash1012 now that it is back up, thanks elukey paravoid ! [12:17:29] PROBLEM - ElasticSearch health check for shards on 9200 on logstash1012 is CRITICAL: CRITICAL - elasticsearch http://localhost:9200/_cluster/health error while fetching: HTTPConnectionPool(host=localhost, port=9200): Max retries exceeded with url: /_cluster/health (Caused by NewConnectionError(requests.packages.urllib3.connection.HTTPConnection object at 0x7fc83aba8320: Failed to establish a new connection: [Errno 111] Connection [12:17:29] ://wikitech.wikimedia.org/wiki/Search%23Administration [12:17:35] (03PS2) 10Arturo Borrero Gonzalez: striker: factor out common code to a shared profile [puppet] - 10https://gerrit.wikimedia.org/r/502472 [12:17:59] (03PS3) 10Ema: hieradata/labs: add profile::cache::ssl::wikibase settings [puppet] - 10https://gerrit.wikimedia.org/r/502208 (https://phabricator.wikimedia.org/T213705) [12:18:15] godog: didn't find anything good on racadm getsel other than "an OEM even occurred" [12:18:19] *event [12:18:29] RECOVERY - ElasticSearch health check for shards on 9200 on logstash1012 is OK: OK - elasticsearch status production-logstash-eqiad: unassigned_shards: 36, delayed_unassigned_shards: 0, cluster_name: production-logstash-eqiad, status: yellow, task_max_waiting_in_queue_millis: 53726, number_of_data_nodes: 3, number_of_nodes: 6, number_of_pending_tasks: 4, timed_out: False, number_of_in_flight_fetch: 0, initializing_shards: 4, acti [12:18:29] elocating_shards: 0, active_shards_percent_as_number: 80.19801980198021, active_primary_shards: 86 https://wikitech.wikimedia.org/wiki/Search%23Administration [12:19:13] Codename: sid [12:19:19] RECOVERY - Kafka Broker Under Replicated Partitions on logstash1010 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1010 [12:19:21] RECOVERY - Kafka Broker Under Replicated Partitions on logstash1011 is OK: (C)10 ge (W)1 ge 0 https://grafana.wikimedia.org/dashboard/db/kafka?panelId=29&fullscreen&orgId=1&var-datasource=eqiad+prometheus/ops&var-kafka_cluster=logging-eqiad&var-kafka_broker=logstash1011 [12:19:46] (03CR) 10Ema: [C: 03+2] hieradata/labs: add profile::cache::ssl::wikibase settings [puppet] - 10https://gerrit.wikimedia.org/r/502208 (https://phabricator.wikimedia.org/T213705) (owner: 10Ema) [12:20:16] (03CR) 10Elukey: [C: 03+2] "https://puppet-compiler.wmflabs.org/compiler1002/15653/an-coord1001.eqiad.wmnet/" [puppet] - 10https://gerrit.wikimedia.org/r/502462 (https://phabricator.wikimedia.org/T218964) (owner: 10Fdans) [12:20:24] (03PS3) 10Elukey: Add PrefUpdate as event schema to ingest to druid [puppet] - 10https://gerrit.wikimedia.org/r/502462 (https://phabricator.wikimedia.org/T218964) (owner: 10Fdans) [12:20:35] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "I think this PCC diff is mostly a noop, class renames being considered a modification:" [puppet] - 10https://gerrit.wikimedia.org/r/502472 (owner: 10Arturo Borrero Gonzalez) [12:21:06] elukey: ack, thanks! opening a task [12:22:42] 10Operations, 10Thumbor, 10serviceops: Export useful metrics from haproxy logs for Thumbor - https://phabricator.wikimedia.org/T220499 (10jijiki) [12:23:01] 10Operations, 10Thumbor, 10serviceops: Export useful metrics from haproxy logs for Thumbor - https://phabricator.wikimedia.org/T220499 (10jijiki) [12:23:27] (03PS1) 10Arturo Borrero Gonzalez: cloudnet2002-dev: rename and repurpose as cloudweb2001-dev.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/502474 (https://phabricator.wikimedia.org/T220426) [12:23:40] 10Operations, 10Thumbor, 10serviceops, 10User-jijiki: Upgrade Thumbor to Buster - https://phabricator.wikimedia.org/T216815 (10jijiki) [12:23:47] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash: logstash1012 lock up - https://phabricator.wikimedia.org/T220500 (10fgiunchedi) [12:24:29] (03PS2) 10Muehlenhoff: Don't install facter 2.4 in buster installs [puppet] - 10https://gerrit.wikimedia.org/r/502234 (https://phabricator.wikimedia.org/T219803) [12:25:24] (03CR) 10Muehlenhoff: [C: 03+2] Don't install facter 2.4 in buster installs [puppet] - 10https://gerrit.wikimedia.org/r/502234 (https://phabricator.wikimedia.org/T219803) (owner: 10Muehlenhoff) [12:35:14] !log bounce rsyslog on lithium [12:35:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:35:36] Hello everyone ! [12:35:45] (03PS1) 10Arturo Borrero Gonzalez: cloudnet2002-dev: rename to cloudweb2001-dev.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/502476 (https://phabricator.wikimedia.org/T220426) [12:37:13] I have a question to ask for my studies, and maybe you could help if you have the time :x [12:37:25] PROBLEM - kubelet operational latencies on kubernetes2002 is CRITICAL: instance=kubernetes2002.codfw.wmnet https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [12:37:26] (03PS1) 10Gehel: elasticsearch: minor cleanup of directory creation [puppet] - 10https://gerrit.wikimedia.org/r/502477 [12:39:40] MorganNRS: sure! feel free to ask [12:43:31] I am studying Data replication in Large Scale Cloud Computing, and my question is, does the bandwidth between 2 Datacenters or Cluster in different country is higher or lower than two in the same one ? And I ask it about wikipedia, because according to wikimedia, you have datacenters in the US, one in Netherland and now one in Taiwan. So maybe you could help me :x thank you ! [12:43:39] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) After some tests with Moritz we did the following: * install... [12:44:55] I m sorry if this is not an appropriate question :x [12:45:00] (03PS1) 10Vgutierrez: acme_chief: Set up staging role for acmechief-test[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/502478 (https://phabricator.wikimedia.org/T220378) [12:45:42] (03PS1) 10Lucas Werkmeister (WMDE): Fix wgImportSources setting for wikidata dblist [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502479 [12:45:44] (03PS1) 10Lucas Werkmeister (WMDE): Fix WBRepoCanonicalUriProperty setting for testwikidatawiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502480 [12:45:46] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: cloudnet2002-dev: repurpose as cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T220426 (10aborrero) p:05Triage→03Normal a:05aborrero→03Papaul >>! In T220426#5097145, @gerritbot wrote: > Change 502476 had a related pa... [12:47:17] PROBLEM - Work requests waiting in Zuul Gearman server on contint1001 is CRITICAL: CRITICAL: 64.29% of data above the critical threshold [140.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [12:50:44] (03CR) 10DCausse: [C: 03+1] elasticsearch: minor cleanup of directory creation [puppet] - 10https://gerrit.wikimedia.org/r/502477 (owner: 10Gehel) [12:51:04] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2002: repurpose as cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T220426 (10aborrero) [12:52:31] MorganNRS: in general you can get the same bandwidth in terms of link speed, although of course latency will be higher the further the datacenters are, and thus higher BDP (https://en.wikipedia.org/wiki/Bandwidth-delay_product) [12:52:49] (03PS2) 10Gehel: elasticsearch: minor cleanup of directory creation and services [puppet] - 10https://gerrit.wikimedia.org/r/502477 [12:53:05] 10Operations, 10Puppet, 10Packaging, 10Patch-For-Review: upgrade facter and puppet across the fleet - https://phabricator.wikimedia.org/T219803 (10jbond) below is a diff between facter2 and facter3. most things are the same but there are a few things which are different, of course there are many more new... [12:53:56] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2002: repurpose as cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T220426 (10aborrero) There has been a confusion briefly in this phab task (my fault) * the `cloudnet2002-dev.codfw.wmnet` server https://netbox.wik... [12:54:10] 10Operations, 10Traffic, 10Patch-For-Review: Make cp1099 the new pinkunicorn - https://phabricator.wikimedia.org/T202966 (10faidon) According to Netbox, cp1099 is 2 years newer than cp1008, but is still a 6-year old server (purchased Mar 28, 2013). Can we just get rid of it? I'm concerned we're just spending... [12:54:45] (03CR) 10Volans: [C: 03+1] "LGTM, run a pcc to be on the safe side." [puppet] - 10https://gerrit.wikimedia.org/r/502477 (owner: 10Gehel) [12:54:48] godog: Thank you very much for your answer !! [12:55:52] MorganNRS: np! hope that helps [12:56:41] (03PS3) 10Fsero: Updating envoy to 1.9.1 [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/502217 (https://phabricator.wikimedia.org/T220382) [12:57:12] (03PS5) 10Ema: cache: add profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/501560 (https://phabricator.wikimedia.org/T219967) [12:57:19] godog: Yes ! Thank you ! [12:57:38] Keep your amazing work everyone !! See you !! [12:57:38] (03CR) 10Fsero: [V: 03+2 C: 03+2] "Addressed nitpicks so merging" [docker-images/production-images] - 10https://gerrit.wikimedia.org/r/502217 (https://phabricator.wikimedia.org/T220382) (owner: 10Fsero) [12:57:41] (03CR) 10Gehel: [C: 03+2] elasticsearch: minor cleanup of directory creation and services [puppet] - 10https://gerrit.wikimedia.org/r/502477 (owner: 10Gehel) [12:59:18] (03PS2) 10Arturo Borrero Gonzalez: labtestnet2002: rename to cloudweb2001-dev.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/502476 (https://phabricator.wikimedia.org/T220426) [12:59:21] (03PS16) 10Fsero: Enabling docker registry swift cross dc replication [puppet] - 10https://gerrit.wikimedia.org/r/490073 (https://phabricator.wikimedia.org/T214289) [13:00:04] Deploy window MediaWiki train - European version (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190409T1300) [13:00:58] (03PS6) 10Ema: cache: add profile::cache::varnish::backend [puppet] - 10https://gerrit.wikimedia.org/r/501560 (https://phabricator.wikimedia.org/T219967) [13:00:59] 10Operations, 10ops-codfw: Degraded RAID on elastic2048 - https://phabricator.wikimedia.org/T220038 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts: ` ['elastic2048.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201904091300_gehel_1367... [13:01:23] PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs [13:01:45] RECOVERY - rsyslog TLS listener on port 6514 on lithium is OK: SSL OK - Certificate lithium.eqiad.wmnet valid until 2021-10-23 19:09:29 +0000 (expires in 928 days) https://wikitech.wikimedia.org/wiki/Logs [13:02:03] (03PS2) 10Arturo Borrero Gonzalez: labtestnet2002: rename and repurpose as cloudweb2001-dev.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/502474 (https://phabricator.wikimedia.org/T220426) [13:02:44] (03PS4) 10Giuseppe Lavagetto: Fix the nightly build behaviour [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501183 [13:02:45] (03PS4) 10Giuseppe Lavagetto: Depend on docker-py 3.x [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501184 [13:02:48] (03PS4) 10Giuseppe Lavagetto: Add dependency chain when pruning images [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501185 [13:02:50] (03PS4) 10Giuseppe Lavagetto: Add changelog [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501186 [13:02:53] (03PS1) 10Giuseppe Lavagetto: Make ImageFSM instances unique [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/502481 [13:04:44] elukey: should https://phabricator.wikimedia.org/T207760 be resolved? [13:05:38] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash: logstash1012 lock up - https://phabricator.wikimedia.org/T220500 (10fgiunchedi) Additionally it looks like syslog traffic towards central logs (wezen/lithium) dropped at around the same time, which is unexpected as the two destinations (central syslog + kafka)... [13:06:14] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) Ok I found a simple and hacky way to test the removal of : `... [13:07:16] (03CR) 10Ema: "pcc noop here: https://puppet-compiler.wmflabs.org/compiler1002/15659/" [puppet] - 10https://gerrit.wikimedia.org/r/501560 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [13:07:39] !log rolling security updates of systemd on canary systems [13:07:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:08:29] (03CR) 10Giuseppe Lavagetto: Add an update action (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487793 (owner: 10Giuseppe Lavagetto) [13:08:37] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add an update action [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487793 (owner: 10Giuseppe Lavagetto) [13:10:01] (03Merged) 10jenkins-bot: Add an update action [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487793 (owner: 10Giuseppe Lavagetto) [13:10:33] 10Operations, 10ops-codfw, 10DC-Ops: codfw: rename/relabel labtestneutron2001 to cloudnet2001-dev - https://phabricator.wikimedia.org/T214181 (10faidon) Given T218025, can we resolve this? [13:10:36] (03PS1) 10Jbond: cumin: allow ops to read configuration without sudo [puppet] - 10https://gerrit.wikimedia.org/r/502483 [13:11:06] !log building envoy docker image [13:11:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:11:38] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Move pulling logic to us, away from the docker daemon [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501017 (owner: 10Giuseppe Lavagetto) [13:13:18] (03Merged) 10jenkins-bot: Move pulling logic to us, away from the docker daemon [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501017 (owner: 10Giuseppe Lavagetto) [13:16:21] paravoid: yes I think so! [13:16:29] please do :) [13:16:37] PROBLEM - rsyslog TLS listener on port 6514 on lithium is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs [13:16:37] PROBLEM - puppet last run on sessionstore1002 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:17:39] RECOVERY - rsyslog TLS listener on port 6514 on lithium is OK: SSL OK - Certificate lithium.eqiad.wmnet valid until 2021-10-23 19:09:29 +0000 (expires in 928 days) https://wikitech.wikimedia.org/wiki/Logs [13:18:35] that's me ^ T220500 [13:18:35] T220500: logstash1012 lock up - https://phabricator.wikimedia.org/T220500 [13:19:19] PROBLEM - rsyslog TLS listener on port 6514 on wezen is CRITICAL: SSL CRITICAL - failed to connect or SSL handshake:Connection reset by peer https://wikitech.wikimedia.org/wiki/Logs [13:19:29] RECOVERY - rsyslog TLS listener on port 6514 on wezen is OK: SSL OK - Certificate wezen.codfw.wmnet valid until 2021-08-21 20:09:05 +0000 (expires in 865 days) https://wikitech.wikimedia.org/wiki/Logs [13:19:32] !log bounce rsyslog on wezen [13:19:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:21:27] (03CR) 10Volans: "Let me add a bit of context here." [puppet] - 10https://gerrit.wikimedia.org/r/502483 (owner: 10Jbond) [13:23:10] (03PS2) 10Vgutierrez: acme_chief: Set up acmechief-test[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/502478 (https://phabricator.wikimedia.org/T220378) [13:24:13] PROBLEM - Long running screen/tmux on prometheus1004 is CRITICAL: CRIT: Long running SCREEN process. (user: root PID: 5229, 1739149s 1728000s). [13:24:15] (03CR) 10Muehlenhoff: "If this is primarily about the aliases (which are public in git already and don't contain any secrets), we could also simply make /etc/cum" [puppet] - 10https://gerrit.wikimedia.org/r/502483 (owner: 10Jbond) [13:24:52] (03CR) 10Giuseppe Lavagetto: puppet: Refactor of the base::puppet class (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [13:26:24] (03PS2) 10Jbond: cumin: allow ops to read configuration without sudo [puppet] - 10https://gerrit.wikimedia.org/r/502483 [13:27:25] (03CR) 10Jbond: "> Patch Set 1:" [puppet] - 10https://gerrit.wikimedia.org/r/502483 (owner: 10Jbond) [13:29:11] (03PS2) 10Muehlenhoff: Enable base::service_auto_restart for mongodb [puppet] - 10https://gerrit.wikimedia.org/r/501336 (https://phabricator.wikimedia.org/T135991) [13:29:46] (03CR) 10Volans: cumin: allow ops to read configuration without sudo (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502483 (owner: 10Jbond) [13:31:27] (03PS3) 10Vgutierrez: acme_chief: Set up acmechief-test[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/502478 (https://phabricator.wikimedia.org/T220378) [13:31:52] (03CR) 10Muehlenhoff: [C: 03+2] Enable base::service_auto_restart for mongodb [puppet] - 10https://gerrit.wikimedia.org/r/501336 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [13:32:15] (03PS3) 10Jbond: cumin: allow everyone to read /etc/cumin/aliases.yaml [puppet] - 10https://gerrit.wikimedia.org/r/502483 [13:32:21] (03CR) 10Jbond: cumin: allow everyone to read /etc/cumin/aliases.yaml (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/502483 (owner: 10Jbond) [13:32:23] (03PS1) 10DCausse: Add a new extension point SshCommandPreExecutionFilter [software/gerrit] (wmf/stable-2.15) - 10https://gerrit.wikimedia.org/r/502487 [13:35:10] (03CR) 10Volans: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/502483 (owner: 10Jbond) [13:35:41] (03PS4) 10Vgutierrez: acme_chief: Set up acmechief-test[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/502478 (https://phabricator.wikimedia.org/T220378) [13:40:37] PROBLEM - zuul_merger_service_running on contint1001 is CRITICAL: PROCS CRITICAL: 0 processes with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger https://www.mediawiki.org/wiki/Continuous_integration/Zuul [13:41:01] (03CR) 10Giuseppe Lavagetto: "> Patch Set 1: Code-Review-1" [puppet] - 10https://gerrit.wikimedia.org/r/501565 (owner: 10Giuseppe Lavagetto) [13:41:55] RECOVERY - zuul_merger_service_running on contint1001 is OK: PROCS OK: 1 process with regex args ^/usr/share/python/zuul/bin/python /usr/bin/zuul-merger https://www.mediawiki.org/wiki/Continuous_integration/Zuul [13:42:23] RECOVERY - puppet last run on sessionstore1002 is OK: OK: Puppet is currently enabled, last run 53 seconds ago with 0 failures [13:44:40] (03CR) 10Jbond: [C: 03+2] cumin: allow everyone to read /etc/cumin/aliases.yaml [puppet] - 10https://gerrit.wikimedia.org/r/502483 (owner: 10Jbond) [13:44:49] 10Operations, 10ops-eqiad, 10decommission: Decommission neodymium - https://phabricator.wikimedia.org/T220503 (10MoritzMuehlenhoff) [13:44:50] (03PS4) 10Jbond: cumin: allow everyone to read /etc/cumin/aliases.yaml [puppet] - 10https://gerrit.wikimedia.org/r/502483 [13:44:57] (03PS5) 10Vgutierrez: acme_chief: Set up acmechief-test[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/502478 (https://phabricator.wikimedia.org/T220378) [13:46:56] (03PS6) 10Vgutierrez: acme_chief: Set up acmechief-test[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/502478 (https://phabricator.wikimedia.org/T220378) [13:47:15] 10Operations, 10ops-codfw, 10decommission: Decommission sarin - https://phabricator.wikimedia.org/T220504 (10MoritzMuehlenhoff) p:05Triage→03Normal [13:47:26] 10Operations, 10ops-eqiad, 10decommission: Decommission neodymium - https://phabricator.wikimedia.org/T220503 (10MoritzMuehlenhoff) p:05Triage→03Normal [13:49:13] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash: logstash1012 lock up caused central logging stuck - https://phabricator.wikimedia.org/T220500 (10fgiunchedi) [13:49:59] (03CR) 10Vgutierrez: "pcc is happy: https://puppet-compiler.wmflabs.org/compiler1002/15663/" [puppet] - 10https://gerrit.wikimedia.org/r/502478 (https://phabricator.wikimedia.org/T220378) (owner: 10Vgutierrez) [13:53:14] (03CR) 10Ema: [C: 03+1] acme_chief: Set up acmechief-test[12]001 [puppet] - 10https://gerrit.wikimedia.org/r/502478 (https://phabricator.wikimedia.org/T220378) (owner: 10Vgutierrez) [13:54:52] (03PS18) 10Daimona Eaytoy: Move all AbuseFilter config to abusefilter.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/477063 (https://phabricator.wikimedia.org/T145931) [13:56:35] (03PS19) 10Daimona Eaytoy: Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 [13:57:47] (03CR) 10jerkins-bot: [V: 04-1] Update AbuseFilter config to keep the status quo [mediawiki-config] - 10https://gerrit.wikimedia.org/r/475772 (owner: 10Daimona Eaytoy) [13:58:13] (03CR) 10Lucas Werkmeister (WMDE): [C: 03+1] "Do you want to add this to a SWAT soon?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/500692 (https://phabricator.wikimedia.org/T218767) (owner: 10Greta WMDE) [13:58:33] RECOVERY - cassandra-a CQL 10.192.32.119:9042 on restbase2020 is OK: TCP OK - 0.036 second response time on 10.192.32.119 port 9042 https://phabricator.wikimedia.org/T93886 [13:58:41] (03PS6) 10Daimona Eaytoy: Rename globals and rights in AbuseFilter config [mediawiki-config] - 10https://gerrit.wikimedia.org/r/480074 [14:00:07] 10Operations, 10ops-eqiad, 10Wikimedia-Logstash: logstash1012 lock up caused central logging stuck - https://phabricator.wikimedia.org/T220500 (10fgiunchedi) For exact reasons still unknown it looks like traffic to lithium/wezen also stopped around the same time logstash1012 went offline: {F28598272} {F2859... [14:00:19] (03PS2) 10Giuseppe Lavagetto: apt: remove redundant Install-Recommends [puppet] - 10https://gerrit.wikimedia.org/r/501565 [14:04:08] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Depend on docker-py 3.x [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501184 (owner: 10Giuseppe Lavagetto) [14:04:58] (03CR) 10Alexandros Kosiaris: [C: 03+1] "LGTM, My only wish would be a test of successfully failing to re-register the same image" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/502481 (owner: 10Giuseppe Lavagetto) [14:05:45] (03CR) 10Giuseppe Lavagetto: "> Patch Set 1: Code-Review+1" [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/502481 (owner: 10Giuseppe Lavagetto) [14:06:41] RECOVERY - kubelet operational latencies on kubernetes2003 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:06:57] (03CR) 10Alexandros Kosiaris: Add an update action (031 comment) [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487793 (owner: 10Giuseppe Lavagetto) [14:09:35] !log bootstrapping cassandra-b, restbase2020 -- T208087 [14:09:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:09:38] T208087: Replace remaining Samsung SSDs - https://phabricator.wikimedia.org/T208087 [14:10:50] !log reboot lvs2010 with systemd 232 T209707 [14:10:53] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:10:56] T209707: tagged_interface sometimes exceeds IFNAMSIZ - https://phabricator.wikimedia.org/T209707 [14:11:40] 10Operations, 10ops-codfw: Degraded RAID on elastic2048 - https://phabricator.wikimedia.org/T220038 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts: ` ['elastic2048.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201904091411_gehel_2770... [14:11:57] (03CR) 10Vgutierrez: [C: 03+1] "this looks like a nice start <3" [puppet] - 10https://gerrit.wikimedia.org/r/501560 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [14:14:17] 10Operations: Add yq package to our apt repo - https://phabricator.wikimedia.org/T220509 (10Ottomata) [14:14:29] 10Operations: Add yq package to our apt repo - https://phabricator.wikimedia.org/T220509 (10Ottomata) p:05Triage→03Normal [14:14:44] (03CR) 10Filippo Giunchedi: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/501560 (https://phabricator.wikimedia.org/T219967) (owner: 10Ema) [14:14:56] 10Operations, 10ops-codfw: Degraded RAID on elastic2048 - https://phabricator.wikimedia.org/T220038 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic2048.codfw.wmnet'] ` Of which those **FAILED**: ` ['elastic2048.codfw.wmnet'] ` [14:15:22] (03PS2) 10Giuseppe Lavagetto: Make ImageFSM instances unique [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/502481 [14:17:33] (03PS8) 10Jbond: puppet: Refactor of the base::puppet class [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) [14:19:13] 10Operations, 10ops-codfw: Degraded RAID on elastic2048 - https://phabricator.wikimedia.org/T220038 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by gehel on cumin2001.codfw.wmnet for hosts: ` ['elastic2048.codfw.wmnet'] ` The log can be found in `/var/log/wmf-auto-reimage/201904091419_gehel_3041... [14:21:07] (03CR) 10jenkins-bot: Add an update action [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/487793 (owner: 10Giuseppe Lavagetto) [14:22:24] (03CR) 10jenkins-bot: Move pulling logic to us, away from the docker daemon [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501017 (owner: 10Giuseppe Lavagetto) [14:23:03] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Make ImageFSM instances unique [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/502481 (owner: 10Giuseppe Lavagetto) [14:23:13] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Fix the nightly build behaviour [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501183 (owner: 10Giuseppe Lavagetto) [14:24:07] (03PS9) 10Jbond: puppet: Refactor of the base::puppet class [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) [14:25:35] (03PS1) 10Alex Monk: Move maintenance_hosts out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/502499 [14:26:24] (03CR) 10jerkins-bot: [V: 04-1] Move maintenance_hosts out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/502499 (owner: 10Alex Monk) [14:28:14] (03PS5) 10Giuseppe Lavagetto: Depend on docker-py 3.x [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501184 [14:28:55] 10Operations, 10Traffic: Removal of If-Cached VCL support - https://phabricator.wikimedia.org/T220510 (10ema) [14:29:02] 10Operations, 10Traffic: Removal of If-Cached VCL support - https://phabricator.wikimedia.org/T220510 (10ema) p:05Triage→03Normal [14:31:04] (03PS10) 10Jbond: puppet: Refactor of the base::puppet class [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) [14:31:15] (03PS1) 10Effie Mouzeli: Updated version name and debian release [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/502502 [14:32:23] (03PS1) 10Vgutierrez: archiva: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502503 (https://phabricator.wikimedia.org/T220359) [14:32:38] (03CR) 10Andrew Bogott: [C: 03+1] lvs: Avoid tagged network interfaces to hit IFNAMSIZ (15+\0) limit [puppet] - 10https://gerrit.wikimedia.org/r/474272 (https://phabricator.wikimedia.org/T209707) (owner: 10Vgutierrez) [14:33:18] (03CR) 10Jbond: "https://integration.wikimedia.org/ci/job/operations-puppet-catalog-compiler/15666/console" [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [14:33:46] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add dependency chain when pruning images [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501185 (owner: 10Giuseppe Lavagetto) [14:34:04] (03PS2) 10Alex Monk: Move maintenance_hosts out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/502499 [14:34:05] 10Operations, 10ops-codfw: Degraded RAID on elastic2048 - https://phabricator.wikimedia.org/T220038 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['elastic2048.codfw.wmnet'] ` and were **ALL** successful. [14:34:15] 10Operations, 10Traffic: Removal of If-Cached VCL support - https://phabricator.wikimedia.org/T220510 (10fgiunchedi) +1 on my side to remove `If-Cached` as we're not using swiftrepl normally. When we do use swiftrepl however it isn't through varnish anyways but swift <-> swift directly. [14:34:48] (03CR) 10Vgutierrez: [C: 03+1] "pcc looks happy: https://puppet-compiler.wmflabs.org/compiler1001/15667/" [puppet] - 10https://gerrit.wikimedia.org/r/502503 (https://phabricator.wikimedia.org/T220359) (owner: 10Vgutierrez) [14:39:58] (03PS5) 10Giuseppe Lavagetto: Add changelog [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501186 [14:41:07] RECOVERY - kubelet operational latencies on kubernetes2002 is OK: All metrics within thresholds. https://grafana.wikimedia.org/dashboard/db/kubernetes-kubelets?orgId=1 [14:43:27] 10Operations: Add yq package to our apt repo - https://phabricator.wikimedia.org/T220509 (10Ottomata) Am having some difficulty: https://github.com/mikefarah/yq/issues/231 [14:43:44] 10Operations, 10ops-codfw, 10Discovery-Search (Current work): Degraded RAID on elastic2048 - https://phabricator.wikimedia.org/T220038 (10Gehel) Reimage was problematic, with first a puppet failure and then the server not booting over PXE. Manually booting in PXE (F12) finally fixed the issue. Reimage is co... [14:45:13] (03CR) 10Gilles: [C: 03+2] Updated version name and debian release [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/502502 (owner: 10Effie Mouzeli) [14:47:33] (03PS1) 10Vgutierrez: icinga: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502505 [14:49:49] (03CR) 10Giuseppe Lavagetto: [C: 03+2] Add changelog [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501186 (owner: 10Giuseppe Lavagetto) [14:51:05] (03CR) 10jenkins-bot: Make ImageFSM instances unique [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/502481 (owner: 10Giuseppe Lavagetto) [14:51:16] (03PS1) 10Vgutierrez: gerrit: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502506 (https://phabricator.wikimedia.org/T220359) [14:53:13] (03CR) 10jenkins-bot: Fix the nightly build behaviour [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501183 (owner: 10Giuseppe Lavagetto) [14:53:53] (03CR) 10jenkins-bot: Depend on docker-py 3.x [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501184 (owner: 10Giuseppe Lavagetto) [14:54:18] (03CR) 10jenkins-bot: Add dependency chain when pruning images [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501185 (owner: 10Giuseppe Lavagetto) [14:55:11] RECOVERY - Work requests waiting in Zuul Gearman server on contint1001 is OK: OK: Less than 30.00% above the threshold [90.0] https://grafana.wikimedia.org/dashboard/db/zuul-gearman?panelId=10&fullscreen&orgId=1 [14:55:32] (03CR) 10Alex Monk: "Great. There was some discussion about the ferm macro back on PS3. I'm going to try to deal with network::constants first, lets come back " [puppet] - 10https://gerrit.wikimedia.org/r/499355 (owner: 10Alex Monk) [14:57:00] (03PS1) 10Vgutierrez: install_server: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502509 (https://phabricator.wikimedia.org/T220359) [14:57:47] (03CR) 10Jbond: [C: 03+1] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/501565 (owner: 10Giuseppe Lavagetto) [14:59:28] (03PS1) 10Muehlenhoff: Initial Kerberos KDC/kadmin server profiles/roles (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/502511 [14:59:48] (03PS1) 10Vgutierrez: dumps: Offer an ECDSA certificate along with the RSA one [puppet] - 10https://gerrit.wikimedia.org/r/502513 (https://phabricator.wikimedia.org/T220359) [15:03:11] 10Operations, 10Analytics, 10Analytics-Kanban, 10Patch-For-Review, 10User-Elukey: Archival of home directories on servers with very large homes - https://phabricator.wikimedia.org/T215171 (10Milimetric) a:03elukey [15:03:55] (03CR) 10jenkins-bot: Add changelog [docker-images/docker-pkg] - 10https://gerrit.wikimedia.org/r/501186 (owner: 10Giuseppe Lavagetto) [15:04:39] !log uploaded jenkins 2.164.1 for jessie-wikimedia/thirdparty [15:04:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:05:58] !log uploaded jenkins 2.164.1 for stretch-wikimedia/thirdparty/ci [15:06:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:14:09] 10Operations, 10Acme-chief, 10Traffic, 10HTTPS: acme-chief: Validate that configured certificates can be actually issued - https://phabricator.wikimedia.org/T220518 (10Vgutierrez) [15:14:24] 10Operations, 10Acme-chief, 10Traffic, 10HTTPS: acme-chief: Validate that configured certificates can be actually issued - https://phabricator.wikimedia.org/T220518 (10Vgutierrez) p:05Triage→03Normal [15:19:49] (03PS1) 10Arturo Borrero Gonzalez: ldap: include sssd cleanup in the classic stack [puppet] - 10https://gerrit.wikimedia.org/r/502519 (https://phabricator.wikimedia.org/T218126) [15:25:14] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] ldap: include sssd cleanup in the classic stack [puppet] - 10https://gerrit.wikimedia.org/r/502519 (https://phabricator.wikimedia.org/T218126) (owner: 10Arturo Borrero Gonzalez) [15:29:26] (03PS9) 10Ema: role::cache::upload_ats: mixed Varnish/ATS setup [puppet] - 10https://gerrit.wikimedia.org/r/501360 (https://phabricator.wikimedia.org/T219967) [15:37:18] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): labtestmetal2001.codfw.wmnet: rename to clouddb2001-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220129 (10Papaul) ` papaul@asw-b-codfw# show | compare [edit interfaces interface-rang... [15:37:30] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): labtestmetal2001.codfw.wmnet: rename to clouddb2001-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220129 (10Papaul) [15:38:36] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): labtestmetal2001.codfw.wmnet: rename to clouddb2001-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220129 (10Papaul) a:05Papaul→03aborrero @aborrero done [15:47:02] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2002: repurpose as cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T220426 (10Papaul) >>! In T220426#5097165, @aborrero wrote: >>>! In T220426#5097145, @gerritbot wrote: >> Change 502476 had a related patch set upl... [15:47:54] (03PS1) 10Gilles: Pass Swift secret in test_temp [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/502521 (https://phabricator.wikimedia.org/T220265) [15:49:31] (03PS3) 10Volans: zone_validator: catch parse errors [dns] - 10https://gerrit.wikimedia.org/r/481833 [15:51:24] (03PS3) 10Bstorm: postgresql: set max_wal_senders on slave conf [puppet] - 10https://gerrit.wikimedia.org/r/501384 (https://phabricator.wikimedia.org/T219652) [15:51:58] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2002: repurpose as cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T220426 (10Papaul) ` papaul@asw-b-codfw# show | compare [edit interfaces interface-range vlan-public1-b-codfw] member ge-1/0/18 { ... } +... [15:52:26] (03CR) 10Gilles: [V: 03+2 C: 03+2] Pass Swift secret in test_temp [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/502521 (https://phabricator.wikimedia.org/T220265) (owner: 10Gilles) [15:52:55] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2002: repurpose as cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T220426 (10Papaul) a:05Papaul→03aborrero @aborrero switch configuration done [15:53:07] 10Operations, 10ops-codfw, 10DC-Ops, 10Patch-For-Review, 10cloud-services-team (Kanban): labtestmetal2001.codfw.wmnet: rename to clouddb2001-dev and reimage to stretch in codfw1dev - https://phabricator.wikimedia.org/T220129 (10aborrero) 05Open→03Resolved [15:53:21] (03PS1) 10Gilles: Version bump [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/502523 (https://phabricator.wikimedia.org/T220265) [15:53:33] (03CR) 10Gilles: [V: 03+2 C: 03+2] Version bump [software/thumbor-plugins] - 10https://gerrit.wikimedia.org/r/502523 (https://phabricator.wikimedia.org/T220265) (owner: 10Gilles) [15:55:44] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2002: repurpose as cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T220426 (10aborrero) [15:56:56] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestnet2002: rename to cloudweb2001-dev.wikimedia.org [dns] - 10https://gerrit.wikimedia.org/r/502476 (https://phabricator.wikimedia.org/T220426) (owner: 10Arturo Borrero Gonzalez) [15:57:10] (03PS3) 10Arturo Borrero Gonzalez: labtestnet2002: rename and repurpose as cloudweb2001-dev.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/502474 (https://phabricator.wikimedia.org/T220426) [15:57:58] (03PS1) 10Gilles: Upgrade to 2.4 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/502526 (https://phabricator.wikimedia.org/T220265) [15:58:24] (03CR) 10Gilles: [C: 03+2] Upgrade to 2.4 [debs/python-thumbor-wikimedia] - 10https://gerrit.wikimedia.org/r/502526 (https://phabricator.wikimedia.org/T220265) (owner: 10Gilles) [15:58:27] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] labtestnet2002: rename and repurpose as cloudweb2001-dev.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/502474 (https://phabricator.wikimedia.org/T220426) (owner: 10Arturo Borrero Gonzalez) [16:00:04] godog and _joe_: #bothumor Q:Why did functions stop calling each other? A:They had arguments. Rise for Puppet SWAT(Max 6 patches) . (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190409T1600). [16:00:04] No GERRIT patches in the queue for this window AFAICS. [16:02:54] (03CR) 10Volans: [C: 03+1] "LGTM" (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/498268 (owner: 10CRusnov) [16:03:23] (03PS1) 10Herron: ores: ship to logstash via the kafka logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/502527 (https://phabricator.wikimedia.org/T213899) [16:03:40] (03PS2) 10Muehlenhoff: Initial Kerberos KDC/kadmin server profiles/roles (WIP) [puppet] - 10https://gerrit.wikimedia.org/r/502511 [16:04:48] (03CR) 10Nuria: [C: 03+1] Add dr0ptp4kt to gpu-testers [puppet] - 10https://gerrit.wikimedia.org/r/502341 (https://phabricator.wikimedia.org/T148843) (owner: 10Dr0ptp4kt) [16:09:52] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Add synchronizing nodes to ganeti-netbox sync. [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/498268 (owner: 10CRusnov) [16:18:20] !log Uploading python-thumbor-wikimedia_2.4-1+deb9u1 to component/thumbor in stretch-wikimedia [16:18:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:18:27] (03CR) 10Volans: "Few comments/replies inline" (037 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/499032 (owner: 10CRusnov) [16:19:06] (03CR) 10Arturo Borrero Gonzalez: [C: 03+1] "After team meeting talk, this LGTM +1." [puppet] - 10https://gerrit.wikimedia.org/r/493767 (https://phabricator.wikimedia.org/T151704) (owner: 10BryanDavis) [16:19:51] (03PS1) 10CRusnov: Revert "Add synchronizing nodes to ganeti-netbox sync." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/502531 [16:20:36] (03Abandoned) 10CRusnov: Revert "Add synchronizing nodes to ganeti-netbox sync." [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/502531 (owner: 10CRusnov) [16:25:31] (03PS1) 10Elukey: icinga::monitoring::services: remove analytics from eventbus alerts [puppet] - 10https://gerrit.wikimedia.org/r/502533 (https://phabricator.wikimedia.org/T220477) [16:26:07] !log Upgrading thumbor1001 to python-thumbor-wikimedia_2.4-1+deb9u1 [16:26:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:26:23] (03PS1) 10Ottomata: eventgate-analytics - Use new readinessProbe post-events flag to set meta.dt [deployment-charts] - 10https://gerrit.wikimedia.org/r/502534 (https://phabricator.wikimedia.org/T219513) [16:26:26] (03CR) 10Elukey: [C: 03+2] icinga::monitoring::services: remove analytics from eventbus alerts [puppet] - 10https://gerrit.wikimedia.org/r/502533 (https://phabricator.wikimedia.org/T220477) (owner: 10Elukey) [16:27:37] (03CR) 10Volans: "Forgot one..." (031 comment) [software/spicerack] - 10https://gerrit.wikimedia.org/r/499032 (owner: 10CRusnov) [16:28:13] jouncebot: now [16:28:14] For the next 0 hour(s) and 31 minute(s): Puppet SWAT(Max 6 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190409T1600) [16:28:15] jouncebot: next [16:28:16] In 0 hour(s) and 31 minute(s): Services – Graphoid / Parsoid / Citoid / ORES (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190409T1700) [16:28:48] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics - Use new readinessProbe post-events flag to set meta.dt [deployment-charts] - 10https://gerrit.wikimedia.org/r/502534 (https://phabricator.wikimedia.org/T219513) (owner: 10Ottomata) [16:28:56] !log Restarting thumbor service on thumbor1001 [16:28:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:20] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [16:32:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:22] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [16:32:23] !log otto@deploy1001 scap-helm eventgate-analytics finished [16:32:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:32:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:38:00] (03CR) 10Ottomata: [C: 03+1] admin: create analytics-deployers group [puppet] - 10https://gerrit.wikimedia.org/r/501578 (https://phabricator.wikimedia.org/T220175) (owner: 10Elukey) [16:38:27] (03CR) 10Ottomata: [C: 03+1] Rely only on ores::base for common packages deployed to Analytics misc [puppet] - 10https://gerrit.wikimedia.org/r/502233 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [16:39:44] (03PS1) 10Ottomata: eventgate-analytics - Use meta.dt as dt_field for Kafka message timestamp [deployment-charts] - 10https://gerrit.wikimedia.org/r/502539 (https://phabricator.wikimedia.org/T219513) [16:40:08] (03CR) 10Ottomata: [V: 03+2 C: 03+2] eventgate-analytics - Use meta.dt as dt_field for Kafka message timestamp [deployment-charts] - 10https://gerrit.wikimedia.org/r/502539 (https://phabricator.wikimedia.org/T219513) (owner: 10Ottomata) [16:41:24] !log performing rolling restart of kafka main brokers and eventbus instances in eqiad to pick up security updates [16:41:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:43:05] 10Operations, 10Analytics, 10EventBus, 10Patch-For-Review, 10Services (doing): Eventbus errors: Failed processing event: Failed validating at path rev_id - https://phabricator.wikimedia.org/T220477 (10Pchelolo) Sorry about that. Fixed by above patch. [16:45:10] !log otto@deploy1001 scap-helm eventgate-analytics upgrade staging -f eventgate-analytics-staging-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: staging] [16:45:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:12] !log otto@deploy1001 scap-helm eventgate-analytics cluster staging completed [16:45:12] !log otto@deploy1001 scap-helm eventgate-analytics finished [16:45:14] (03CR) 10Cwhite: [C: 03+1] ores: ship to logstash via the kafka logging pipeline [puppet] - 10https://gerrit.wikimedia.org/r/502527 (https://phabricator.wikimedia.org/T213899) (owner: 10Herron) [16:45:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:45:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:48] !log otto@deploy1001 scap-helm eventgate-analytics upgrade production -f eventgate-analytics-codfw-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: codfw] [16:46:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:51] !log otto@deploy1001 scap-helm eventgate-analytics cluster codfw completed [16:46:51] !log otto@deploy1001 scap-helm eventgate-analytics finished [16:46:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:46:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:48:56] (03CR) 10Ottomata: [C: 03+1] aptrepo: update cloudera-jessie to 5.16.1 [puppet] - 10https://gerrit.wikimedia.org/r/500453 (https://phabricator.wikimedia.org/T218343) (owner: 10Elukey) [16:49:28] !log otto@deploy1001 scap-helm eventgate-analytics upgrade production -f eventgate-analytics-eqiad-values.yaml --reset-values stable/eventgate-analytics [namespace: eventgate-analytics, clusters: eqiad] [16:49:29] !log otto@deploy1001 scap-helm eventgate-analytics cluster eqiad completed [16:49:29] !log otto@deploy1001 scap-helm eventgate-analytics finished [16:49:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:49:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:52:55] (03PS1) 10Andrew Bogott: cloudvirts: repool cloudvirt1012 and 1015 [puppet] - 10https://gerrit.wikimedia.org/r/502540 [16:53:41] (03PS2) 10Nuria: Removing TestSearchSatisfaction from it being persisted to MySQL [puppet] - 10https://gerrit.wikimedia.org/r/500076 (https://phabricator.wikimedia.org/T216055) [16:54:08] (03CR) 10Andrew Bogott: [C: 03+2] cloudvirts: repool cloudvirt1012 and 1015 [puppet] - 10https://gerrit.wikimedia.org/r/502540 (owner: 10Andrew Bogott) [16:54:19] (03PS2) 10Elukey: Rely only on ores::base for common packages deployed to Analytics misc [puppet] - 10https://gerrit.wikimedia.org/r/502233 (https://phabricator.wikimedia.org/T148843) [16:55:04] (03CR) 10Volans: [C: 04-1] "Some general comments, mostly on the python script, see inline." (0315 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [16:55:14] (03PS3) 10Ottomata: Removing TestSearchSatisfaction from it being persisted to MySQL [puppet] - 10https://gerrit.wikimedia.org/r/500076 (https://phabricator.wikimedia.org/T216055) (owner: 10Nuria) [16:56:03] (03CR) 10Elukey: [C: 03+2] Rely only on ores::base for common packages deployed to Analytics misc [puppet] - 10https://gerrit.wikimedia.org/r/502233 (https://phabricator.wikimedia.org/T148843) (owner: 10Elukey) [17:00:04] cscott, arlolra, subbu, halfak, and Amir1: #bothumor My software never has bugs. It just develops random features. Rise for Services – Graphoid / Parsoid / Citoid / ORES. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190409T1700). [17:00:50] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2002: repurpose as cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T220426 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: ` labtestnet2002.codfw.wmnet `... [17:01:14] !log T220426 reimaging+renaming labtestnet2002 to cloudweb2001-dev [17:01:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:01:18] T220426: labtestnet2002: repurpose as cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T220426 [17:03:17] !log deploying https://gerrit.wikimedia.org/r/c/mediawiki/core/+/502538/1 and https://gerrit.wikimedia.org/r/c/mediawiki/core/+/502537/1 [17:03:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:03:24] apergos: ^ [17:03:41] woo hoo! [17:03:57] that is a load off my mind [17:04:16] !log gilles@deploy1001 Synchronized php-1.33.0-wmf.24/includes/specials/SpecialUploadStash.php: T220265 Add support for X-Swift-Secret to upload stash (duration: 00m 53s) [17:04:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:04:20] T220265: mw:thumbor swift user doesn't have access to wikipedia-commons-local-temp.* swift containers - https://phabricator.wikimedia.org/T220265 [17:09:20] !log twentyafterfour@deploy1001 Synchronized php-1.33.0-wmf.24/includes/export/XmlDumpWriter.php: deploy https://gerrit.wikimedia.org/r/c/mediawiki/core/+/502538/1 (duration: 00m 52s) [17:09:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:10:01] 10Operations, 10Thumbor, 10serviceops: Export useful metrics from haproxy logs for Thumbor - https://phabricator.wikimedia.org/T220499 (10Gilles) For reference: https://medium.com/@tom.fawcett/extracting-useful-duration-metrics-from-haproxy-prometheus-fluentd-2be9832ff702 We can do the same with mtail. [17:10:30] err BadMethodCallException from line 639 of /srv/mediawiki/php-1.33.0-wmf.24/includes/Revision.php: Call to a member function getId() on a non-object (null) [17:10:42] (not related to my deployment but bad nonetheless) [17:11:57] also: Canary error check failed for 1 canaries, less than threshold to halt deployment [17:12:07] something is borked on mwdebug1001 [17:12:43] (03PS3) 10Elukey: Add dr0ptp4kt to gpu-testers [puppet] - 10https://gerrit.wikimedia.org/r/502341 (https://phabricator.wikimedia.org/T148843) (owner: 10Dr0ptp4kt) [17:14:18] !log twentyafterfour@deploy1001 Synchronized php-1.33.0-wmf.24/includes/export/WikiExporter.php: deploy https://gerrit.wikimedia.org/r/c/mediawiki/core/+/502537/1 (duration: 00m 51s) [17:14:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:14:28] 10Operations, 10Analytics, 10EventBus, 10Core Platform Team (Security, stability, performance and scalability (TEC1)), and 2 others: Eventbus errors: Failed processing event: Failed validating at path rev_id - https://phabricator.wikimedia.org/T220477 (10mobrovac) 05Open→03Resolved Patch merged, will g... [17:15:26] !log bsitzmann@deploy1001 Started deploy [mobileapps/deploy@b04c397]: Update mobileapps to 3edfcad (T220045 T219411 T219667) [17:15:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:15:32] T219411: External links no longer have arrow icon in Light theme. - https://phabricator.wikimedia.org/T219411 [17:15:32] T219667: Add wikibase entity id for image files to media endpoint - https://phabricator.wikimedia.org/T219667 [17:15:33] T220045: Stop getting base CSS from live ResourceLoader requests - https://phabricator.wikimedia.org/T220045 [17:15:34] (03CR) 10Elukey: [C: 03+2] "Given that the group does not grant any sudo and Nuria already approved with her +1, merging." [puppet] - 10https://gerrit.wikimedia.org/r/502341 (https://phabricator.wikimedia.org/T148843) (owner: 10Dr0ptp4kt) [17:15:45] the error on mwdebug1001 is: Fatal error: Class undefined: CirrusSearch\Connection in /srv/mediawiki/php-1.33.0-wmf.24/extensions/CirrusSearch/includes/Hooks.php on line 673 [17:18:01] 10Operations, 10Discovery-Search (Current work), 10Patch-For-Review, 10Wikimedia-Incident: Create cookbook to reset readonly indices on elasticsearch clusters - https://phabricator.wikimedia.org/T219799 (10Gehel) a:03Gehel [17:19:16] !log bsitzmann@deploy1001 Finished deploy [mobileapps/deploy@b04c397]: Update mobileapps to 3edfcad (T220045 T219411 T219667) (duration: 03m 50s) [17:19:51] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:22:53] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10elukey) >>! In T148843#5095613, @dr0ptp4kt wrote: > Hi, I'm requesting... [17:24:18] (03PS1) 10Arturo Borrero Gonzalez: cloudweb2001-dev: missing netboot partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/502552 (https://phabricator.wikimedia.org/T220426) [17:25:11] (03CR) 10Arturo Borrero Gonzalez: [C: 03+2] cloudweb2001-dev: missing netboot partman recipe [puppet] - 10https://gerrit.wikimedia.org/r/502552 (https://phabricator.wikimedia.org/T220426) (owner: 10Arturo Borrero Gonzalez) [17:25:45] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2002: repurpose as cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T220426 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudweb2001-dev.wikimedia.org'] ` Of which those **FAILED**: ` ['cloudweb2001... [17:27:49] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2002: repurpose as cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T220426 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: ` labtestnet2002.codfw.wmnet `... [17:27:53] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2002: repurpose as cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T220426 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['labtestnet2002.codfw.wmnet'] ` Of which those **FAILED**: ` ['labtestnet2002.c... [17:28:22] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2002: repurpose as cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T220426 (10ops-monitoring-bot) Script wmf-auto-reimage was launched by aborrero on cumin1001.eqiad.wmnet for hosts: ` cloudweb2001-dev.wikimedia.or... [17:30:42] (03PS3) 10Elukey: admin: create analytics-deployers group [puppet] - 10https://gerrit.wikimedia.org/r/501578 (https://phabricator.wikimedia.org/T220175) [17:32:23] (03CR) 10Elukey: [C: 03+2] "Since this group doesn't involve any sudo, and a lot of people +1ed it (including Nuria) I am going to merge." [puppet] - 10https://gerrit.wikimedia.org/r/501578 (https://phabricator.wikimedia.org/T220175) (owner: 10Elukey) [17:36:06] PROBLEM - Long running screen/tmux on sessionstore1001 is CRITICAL: CRIT: Long running SCREEN process. (user: eevans PID: 220733, 1736527s 1728000s). [17:36:43] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Elukey: Requesting ability to scap-deploy on stat1007 for gilles - https://phabricator.wikimedia.org/T220175 (10elukey) Done! @Gilles let me know if you can now deploy to stat1007 :) [17:37:16] !log gilles@deploy1001 Started deploy [performance/asoranking@4c83130]: (no justification provided) [17:37:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:37:26] !log gilles@deploy1001 Finished deploy [performance/asoranking@4c83130]: (no justification provided) (duration: 00m 10s) [17:37:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:41:01] jouncebot: next [17:41:01] In 0 hour(s) and 18 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190409T1800) [17:42:19] !log restart keyholder-proxy.service on deploy1001 as attempt to reload perms for the analytics_deploy key [17:42:20] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:29] !log gilles@deploy1001 Started deploy [performance/asoranking@4c83130]: (no justification provided) [17:42:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:42:34] !log gilles@deploy1001 Finished deploy [performance/asoranking@4c83130]: (no justification provided) (duration: 00m 04s) [17:42:36] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:45:08] (03PS13) 10CRusnov: Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) [17:46:33] (03CR) 10CRusnov: "> Patch Set 12:" (039 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [17:49:27] (03CR) 10jerkins-bot: [V: 04-1] Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) (owner: 10CRusnov) [17:50:33] (03PS4) 10Bstorm: postgresql: set max_wal_senders on slave conf [puppet] - 10https://gerrit.wikimedia.org/r/501384 (https://phabricator.wikimedia.org/T219652) [17:54:27] (03CR) 10Bstorm: [C: 03+2] postgresql: set max_wal_senders on slave conf [puppet] - 10https://gerrit.wikimedia.org/r/501384 (https://phabricator.wikimedia.org/T219652) (owner: 10Bstorm) [17:56:07] !log restart keyholder-agent on deploy1001 to pick up new settings for analytics (+ arm all the keys) [17:56:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:10] !log gilles@deploy1001 Started deploy [performance/asoranking@4c83130]: (no justification provided) [17:58:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:58:12] !log gilles@deploy1001 Finished deploy [performance/asoranking@4c83130]: (no justification provided) (duration: 00m 02s) [17:58:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190409T1800) [18:04:34] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2002: repurpose as cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T220426 (10ops-monitoring-bot) Completed auto-reimage of hosts: ` ['cloudweb2001-dev.wikimedia.org'] ` and were **ALL** successful. [18:04:51] (03PS1) 10Elukey: role::deployment_server: add analytics-deployers to admin groups [puppet] - 10https://gerrit.wikimedia.org/r/502562 (https://phabricator.wikimedia.org/T220175) [18:06:23] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2002: repurpose as cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T220426 (10aborrero) [18:06:52] (03CR) 10Elukey: [C: 03+2] role::deployment_server: add analytics-deployers to admin groups [puppet] - 10https://gerrit.wikimedia.org/r/502562 (https://phabricator.wikimedia.org/T220175) (owner: 10Elukey) [18:07:13] !log bootstrapping cassandra-c, restbase2020 -- T208087 [18:07:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:07:18] 10Operations, 10ops-codfw, 10Cloud-VPS, 10DC-Ops, and 2 others: labtestnet2002: repurpose as cloudweb2001-dev.wikimedia.org - https://phabricator.wikimedia.org/T220426 (10aborrero) [18:07:32] (03PS1) 10Smalyshev: Enable revision fetching on test hosts [puppet] - 10https://gerrit.wikimedia.org/r/502564 [18:07:40] T208087: Replace remaining Samsung SSDs - https://phabricator.wikimedia.org/T208087 [18:09:36] (03PS8) 10CRusnov: Add basic Ganeti RAPI module and tests [software/spicerack] - 10https://gerrit.wikimedia.org/r/499032 [18:10:52] (03CR) 10CRusnov: "Woowee." (038 comments) [software/spicerack] - 10https://gerrit.wikimedia.org/r/499032 (owner: 10CRusnov) [18:11:24] !log gilles@deploy1001 Started deploy [performance/asoranking@4c83130]: (no justification provided) [18:11:26] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:11:27] !log gilles@deploy1001 Finished deploy [performance/asoranking@4c83130]: (no justification provided) (duration: 00m 03s) [18:11:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:13:02] 10Operations, 10SRE-Access-Requests, 10Patch-For-Review, 10User-Elukey: Requesting ability to scap-deploy on stat1007 for gilles - https://phabricator.wikimedia.org/T220175 (10elukey) 05Open→03Resolved Confirmed with Gilles that now he is able to deploy to stat1007. [18:13:16] 10Operations, 10Core Platform Team Backlog, 10MediaWiki-General-or-Unknown, 10serviceops, and 3 others: Some pages will become completely unreachable after PHP7 update due to Unicode changes - https://phabricator.wikimedia.org/T219279 (10Esanders) > @Esanders is it a huge deal if, in light of the blast rad... [18:13:17] RECOVERY - Check systemd state on restbase2020 is OK: OK - running: The system is fully operational [18:15:04] (03PS1) 10C. Scott Ananian: Default to Preprocessor_Hash on both PHP7 and HHVM [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502567 (https://phabricator.wikimedia.org/T216664) [18:16:39] (03PS1) 10Andrew Bogott: clouddb-services: add some rudimentary monitoring [puppet] - 10https://gerrit.wikimedia.org/r/502568 [18:20:53] 10Operations, 10Operations-Software-Development: Cumin: add backend for Netbox - https://phabricator.wikimedia.org/T205900 (10crusnov) It'd be neat if teh code written for spicerack for this purpose could be reused somehow. [18:21:21] (03PS2) 10Bstorm: clouddb-services: add some rudimentary monitoring [puppet] - 10https://gerrit.wikimedia.org/r/502568 (https://phabricator.wikimedia.org/T220531) (owner: 10Andrew Bogott) [18:21:25] (03PS1) 10Ottomata: Remove unused rsync module on thorium for stats.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/502569 (https://phabricator.wikimedia.org/T205113) [18:22:29] (03PS2) 10Ottomata: Remove unused rsync module on thorium for stats.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/502569 (https://phabricator.wikimedia.org/T205113) [18:23:36] (03CR) 10Ottomata: [C: 03+2] Remove unused rsync module on thorium for stats.wikimedia.org [puppet] - 10https://gerrit.wikimedia.org/r/502569 (https://phabricator.wikimedia.org/T205113) (owner: 10Ottomata) [18:23:46] (03CR) 10Bstorm: [C: 03+2] clouddb-services: add some rudimentary monitoring [puppet] - 10https://gerrit.wikimedia.org/r/502568 (https://phabricator.wikimedia.org/T220531) (owner: 10Andrew Bogott) [18:23:59] (03PS3) 10Bstorm: clouddb-services: add some rudimentary monitoring [puppet] - 10https://gerrit.wikimedia.org/r/502568 (https://phabricator.wikimedia.org/T220531) (owner: 10Andrew Bogott) [18:26:21] !log crusnov@deploy1001 Started deploy [netbox/deploy@4aa3e47]: Add node sync to Netbox-Ganeti sync script - T215229 [18:26:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:26:25] T215229: Keep Ganeti VMs synchronized in Netbox - https://phabricator.wikimedia.org/T215229 [18:27:18] !log crusnov@deploy1001 Finished deploy [netbox/deploy@4aa3e47]: Add node sync to Netbox-Ganeti sync script - T215229 (duration: 00m 57s) [18:27:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:37:01] FYI icinga about to be restarted in few minutes [18:38:01] !log T196336 cdanis@icinga1001$ sudo systemctl restart nsca [18:38:04] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:38:05] T196336: Icinga passive checks go awol and downtime stops working - https://phabricator.wikimedia.org/T196336 [18:38:37] (03PS1) 10CRusnov: Fix url join operation. [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/502573 [18:39:42] (03PS1) 10Ottomata: Run published-datasets-rsynced module as stats:wikidev [puppet] - 10https://gerrit.wikimedia.org/r/502574 (https://phabricator.wikimedia.org/T205113) [18:40:37] (03CR) 10Ottomata: [C: 03+2] Run published-datasets-rsynced module as stats:wikidev [puppet] - 10https://gerrit.wikimedia.org/r/502574 (https://phabricator.wikimedia.org/T205113) (owner: 10Ottomata) [18:40:44] (03PS2) 10Ottomata: Run published-datasets-rsynced module as stats:wikidev [puppet] - 10https://gerrit.wikimedia.org/r/502574 (https://phabricator.wikimedia.org/T205113) [18:40:46] (03CR) 10Ottomata: [V: 03+2 C: 03+2] Run published-datasets-rsynced module as stats:wikidev [puppet] - 10https://gerrit.wikimedia.org/r/502574 (https://phabricator.wikimedia.org/T205113) (owner: 10Ottomata) [18:42:12] !log restart icinga on icinga1001 - T196336 [18:42:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:44:18] (03CR) 10Volans: Fix url join operation. (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/502573 (owner: 10CRusnov) [18:45:02] (03PS2) 10CRusnov: Fix url join operation. [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/502573 [18:45:24] (03CR) 10CRusnov: Fix url join operation. (031 comment) [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/502573 (owner: 10CRusnov) [18:46:11] (03CR) 10Volans: [C: 03+1] "LGTM" [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/502573 (owner: 10CRusnov) [18:46:13] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@43d2d2e]: Gerrit update (gerrit2001 only) [18:46:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:46:24] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@43d2d2e]: Gerrit update (gerrit2001 only) (duration: 00m 10s) [18:46:25] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:47:54] !log thcipriani@deploy1001 Started deploy [gerrit/gerrit@43d2d2e]: Gerrit update (cobalt) -- restart incoming [18:47:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:04] !log thcipriani@deploy1001 Finished deploy [gerrit/gerrit@43d2d2e]: Gerrit update (cobalt) -- restart incoming (duration: 00m 10s) [18:48:06] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:48:27] !log gerrit restart [18:48:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:50:28] !log gerrit back [18:50:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:51:16] (03CR) 10CRusnov: [V: 03+2 C: 03+2] Fix url join operation. [software/netbox-deploy] - 10https://gerrit.wikimedia.org/r/502573 (owner: 10CRusnov) [18:52:16] PROBLEM - Check the last execution of netbox_ganeti_eqiad_sync on netmon1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_eqiad_sync [18:52:26] PROBLEM - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is CRITICAL: CRITICAL: Status of the systemd unit netbox_ganeti_codfw_sync [18:52:34] !log crusnov@deploy1001 Started deploy [netbox/deploy@018d83e]: Minor fix to Netbox-Ganeti sync script [18:52:37] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:53:02] PROBLEM - Check systemd state on netmon1002 is CRITICAL: CRITICAL - degraded: The system is operational but one or more units failed. [18:53:10] 10Operations, 10Analytics, 10EventBus, 10vm-requests, and 3 others: Create schema[12]00[12] (schema.svc.{eqiad,codfw}.wmnet) - https://phabricator.wikimedia.org/T219556 (10Ottomata) @akosiaris, I'd love to do this sooner rather than later. It'd make some configuration/deployment stuff in the Hive/Hadoop w... [18:53:11] ^ fix incoming [18:53:26] !log crusnov@deploy1001 Finished deploy [netbox/deploy@018d83e]: Minor fix to Netbox-Ganeti sync script (duration: 00m 52s) [18:53:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:00:04] twentyafterfour: #bothumor I � Unicode. All rise for MediaWiki train - Americas version deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190409T1900). [19:03:07] (03PS14) 10CRusnov: Netbox module for Spicerack [software/spicerack] - 10https://gerrit.wikimedia.org/r/493138 (https://phabricator.wikimedia.org/T217072) [19:06:30] RECOVERY - Check systemd state on netmon1002 is OK: OK - running: The system is fully operational [19:10:12] !log branching 1.33.0-wmf.25 [19:10:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:13:48] RECOVERY - Check the last execution of netbox_ganeti_eqiad_sync on netmon1002 is OK: OK: Status of the systemd unit netbox_ganeti_eqiad_sync [19:14:00] RECOVERY - Check the last execution of netbox_ganeti_codfw_sync on netmon1002 is OK: OK: Status of the systemd unit netbox_ganeti_codfw_sync [19:17:04] oooohhh and there's the branch \o/ [19:18:02] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10dr0ptp4kt) Hi all - we had a meeting on this and a question emerged. Is it possible to run Graph without... [19:22:52] 10Operations, 10RESTBase, 10RESTBase-API, 10serviceops, and 3 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10mobrovac) >>! In T218218#5081897, @mobrovac wrote: > One other thing left to do here: replace optional parameters in the `/sys... [19:23:00] 10Operations, 10RESTBase, 10RESTBase-API, 10serviceops, and 3 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10mobrovac) [19:26:06] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10mobrovac) >>! In T211881#5098847, @dr0ptp4kt wrote: > Hi all - we had a meeting on this and a question em... [19:28:46] (03PS1) 10Herron: wikimediafoundation.org: add spf record [dns] - 10https://gerrit.wikimedia.org/r/502589 (https://phabricator.wikimedia.org/T220412) [19:29:01] (03PS2) 10Gehel: Enable revision fetching on test hosts [puppet] - 10https://gerrit.wikimedia.org/r/502564 (owner: 10Smalyshev) [19:30:34] (03CR) 10Gehel: [C: 03+2] Enable revision fetching on test hosts [puppet] - 10https://gerrit.wikimedia.org/r/502564 (owner: 10Smalyshev) [19:34:02] 10Operations, 10Fundraising-Backlog, 10Mail, 10fundraising-tech-ops, 10Patch-For-Review: Identify appropriate SPF record for domain wikimediafoundation.org - https://phabricator.wikimedia.org/T220412 (10herron) >>! In T220412#5095320, @Jgreen wrote: > As far as I know, fundraising does not send mail usin... [19:41:57] (03PS1) 10Jforrester: [BETA] WBMI: Configure initial qualifiers for Beta Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502592 (https://phabricator.wikimedia.org/T219181) [19:42:31] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10Yurik) @mobrovac @dr0ptp4kt there is a much larger issue -- performance, and this is identical to maps: w... [19:46:39] !log added myself to ldap group cn=archiva-deployers,ou=groups,dc=wikimedia,dc=org [19:46:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:46:46] twentyafterfour: OK if I sneak out a beta-only config patch? [19:46:58] James_F: sure [19:47:02] (03CR) 10Jforrester: [C: 03+2] [BETA] WBMI: Configure initial qualifiers for Beta Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502592 (https://phabricator.wikimedia.org/T219181) (owner: 10Jforrester) [19:47:02] Thanks. [19:47:44] * twentyafterfour is just trying to figure out why https://integration.wikimedia.org/ci/job/train-deploy-notes/ doesn't work anymore [19:48:07] (03Merged) 10jenkins-bot: [BETA] WBMI: Configure initial qualifiers for Beta Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502592 (https://phabricator.wikimedia.org/T219181) (owner: 10Jforrester) [19:49:34] twentyafterfour: Oh, af871ec786c1af02b07357d11bc20565fd92e141 wasn't on deployment yet. Is it OK for me to have pulled it, or should I revert it? [19:49:48] (Scap clean changes.) [19:51:57] 10Operations, 10RESTBase, 10RESTBase-API, 10serviceops, and 3 others: Make RESTBase spec standard compliant and switch to OpenAPI 3.0 - https://phabricator.wikimedia.org/T218218 (10Pchelolo) [19:52:08] James_F: should be fine [19:53:10] Kk. All done. [19:57:34] (03CR) 10jenkins-bot: [BETA] WBMI: Configure initial qualifiers for Beta Commons [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502592 (https://phabricator.wikimedia.org/T219181) (owner: 10Jforrester) [20:08:40] (03PS1) 10Ottomata: published-datasets-sync - don't preserve group attr [puppet] - 10https://gerrit.wikimedia.org/r/502598 [20:11:11] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10dr0ptp4kt) Thanks @mobrovac @Yurik. Totally understood on the user observed performance (at least when i... [20:15:25] (03CR) 10Ottomata: [C: 03+2] published-datasets-sync - don't preserve group attr [puppet] - 10https://gerrit.wikimedia.org/r/502598 (owner: 10Ottomata) [20:18:54] 10Operations, 10Analytics, 10Research-management, 10Patch-For-Review, 10User-Elukey: Remove computational bottlenecks in stats machine via adding a GPU that can be used to train ML models - https://phabricator.wikimedia.org/T148843 (10dr0ptp4kt) Thanks @elukey [20:20:27] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10Milimetric) Maybe as a quick fix, graphs could initially render as a generic "graph loading" image. Then... [20:43:52] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install db2102.codfw.wmnet as a testing host for codfw backups - https://phabricator.wikimedia.org/T219461 (10Papaul) [20:44:07] (03PS1) 10Alex Monk: Move bastion_hosts out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/502607 [20:44:54] 10Operations, 10ops-codfw, 10DBA, 10Patch-For-Review: rack/setup/install (5) codfw dedicated dump slaves - https://phabricator.wikimedia.org/T219463 (10Papaul) [20:54:27] (03PS1) 10Alex Monk: Move bastion_hosts out of ferm macros [puppet] - 10https://gerrit.wikimedia.org/r/502612 [20:55:08] (03PS1) 10Herron: Upgrade logstash plugins to 5.6.15 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/502613 (https://phabricator.wikimedia.org/T219571) [21:01:12] (03CR) 10Mathew.onipe: [C: 03+1] Upgrade logstash plugins to 5.6.15 [software/logstash/plugins] - 10https://gerrit.wikimedia.org/r/502613 (https://phabricator.wikimedia.org/T219571) (owner: 10Herron) [21:14:57] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10kaldari) Note that current versions of Vega (since 4.3.0) need ES6 support, so we would either need to tr... [21:18:32] (03CR) 10Cwhite: [C: 03+1] "It took some time to review, but this latest iteration looks good from my perspective." [puppet] - 10https://gerrit.wikimedia.org/r/501617 (https://phabricator.wikimedia.org/T219803) (owner: 10Jbond) [21:21:57] twentyafterfour: Any idea if the train is going to run today? I saw it was blocked on T220153, but doesn't look like anyone is responding. [21:27:47] stashbot, next [21:28:06] T220153: Beta Wikidata: Error: 1146 Table 'wikishared.wikimedia_editor_tasks_keys' doesn't exist - https://phabricator.wikimedia.org/T220153 [21:29:08] 10Operations, 10Code-Stewardship-Reviews, 10Graphoid, 10Core Platform Team Backlog (Watching / External), and 2 others: graphoid: Code stewardship request - https://phabricator.wikimedia.org/T211881 (10Yurik) Another, not yet mentioned consideration: There was a significant syntax change between Vega 1.5,... [21:29:11] (03PS1) 10Alex Monk: Move maintenance_hosts out of ferm macros [puppet] - 10https://gerrit.wikimedia.org/r/502622 [21:30:14] (03CR) 10jerkins-bot: [V: 04-1] Move maintenance_hosts out of ferm macros [puppet] - 10https://gerrit.wikimedia.org/r/502622 (owner: 10Alex Monk) [21:31:27] kaldari: it's also blocked on rebasing security patches so I don't know when it'll happen [21:31:37] ah [21:31:39] thanks [21:32:20] twentyafterfour: Did you try to cut REL1_33? [21:32:30] (What me, keen? ;-)) [21:33:07] James_F: not yet [21:33:12] * James_F nods. [21:33:21] I guess I could do that noq [21:33:23] now [21:35:51] (03PS2) 10Alex Monk: Move maintenance_hosts out of ferm macros [puppet] - 10https://gerrit.wikimedia.org/r/502622 [21:37:41] James_F: running maintenance/updateCredits.php isn't working [21:38:03] I think I had this same problem last time, can't remember what fixed it [21:38:10] Fatal error: Uncaught Error: Class 'Collator' not found in /src/mediawiki/core/maintenance/updateCredits.php:72 [21:43:35] !log decommissioning cassandra-a, restbase2007 -- T208087 [21:47:02] !log decommissioning cassandra-a, restbase2007 -- T208087 [21:47:20] o.O [21:47:56] RECOVERY - Restbase root url on restbase2020 is OK: HTTP OK: HTTP/1.1 200 - 16254 bytes in 0.143 second response time https://wikitech.wikimedia.org/wiki/RESTBase [21:48:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:48:50] T208087: Replace remaining Samsung SSDs - https://phabricator.wikimedia.org/T208087 [21:49:06] (03PS1) 1020after4: testwikis wikis to 1.33.0-wmf.25 refs T206679 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502626 [21:49:09] (03CR) 1020after4: [C: 03+2] testwikis wikis to 1.33.0-wmf.25 refs T206679 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502626 (owner: 1020after4) [21:50:21] (03Merged) 10jenkins-bot: testwikis wikis to 1.33.0-wmf.25 refs T206679 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502626 (owner: 1020after4) [21:50:35] (03CR) 10jenkins-bot: testwikis wikis to 1.33.0-wmf.25 refs T206679 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/502626 (owner: 1020after4) [21:51:04] !log twentyafterfour@deploy1001 Started scap: testwikis wikis to 1.33.0-wmf.25 refs T206679 [21:51:23] RECOVERY - puppet last run on restbase2020 is OK: OK: Puppet is currently enabled, last run 5 minutes ago with 0 failures [21:52:02] kaldari: looks like the train is moving after all [21:52:08] Yay! [21:52:22] All aboard!! [21:57:32] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:57:33] T206679: 1.33.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T206679 [22:01:43] (03PS1) 10Alex Monk: Move caches out of network::constants into hieradata [puppet] - 10https://gerrit.wikimedia.org/r/502630 [22:11:20] !log mobrovac@deploy1001 Started deploy [restbase/deploy@c0a2977]: Bring RB on restbase20(19|20) up to date - T208087 [22:11:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:11:24] T208087: Replace remaining Samsung SSDs - https://phabricator.wikimedia.org/T208087 [22:11:33] (03PS1) 10Alex Monk: labs: remove references to deleted project-proxy hosts [puppet] - 10https://gerrit.wikimedia.org/r/502632 [22:13:52] !log mobrovac@deploy1001 Finished deploy [restbase/deploy@c0a2977]: Bring RB on restbase20(19|20) up to date - T208087 (duration: 02m 32s) [22:13:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:06] !log uploaded python-pynetbox to apt.wikimedia.org/stretch-wikimedia (T217072) [22:19:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:19:09] T217072: Spicerack module for Netbox - https://phabricator.wikimedia.org/T217072 [22:22:18] PROBLEM - Disk space on mwdebug2002 is CRITICAL: DISK CRITICAL - free space: / 0 MB (0% inode=67%) [22:23:28] PROBLEM - Disk space on mwdebug2001 is CRITICAL: DISK CRITICAL - free space: / 16 MB (0% inode=67%) [22:26:36] (03PS1) 10Alex Monk: labs: remove references to other deleted hosts [puppet] - 10https://gerrit.wikimedia.org/r/502633 [22:29:44] (03PS1) 10Alex Monk: labs: Remove nova_dnsmasq_aliases stuff [puppet] - 10https://gerrit.wikimedia.org/r/502634 [22:31:04] !log twentyafterfour@deploy1001 Finished scap: testwikis wikis to 1.33.0-wmf.25 refs T206679 (duration: 39m 59s) [22:31:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:31:08] T206679: 1.33.0-wmf.25 deployment blockers - https://phabricator.wikimedia.org/T206679 [22:34:46] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1004 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [22:37:40] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1004 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [22:42:06] PROBLEM - HHVM jobrunner on mw1295 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [22:42:28] uh [22:42:53] all I did was deploy a new branch (wmf 25) to test wikis, and those errors appear to be for wmf.24 [22:43:54] RECOVERY - HHVM jobrunner on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [22:53:49] (03PS9) 10Ayounsi: Icinga: Add OSPF check to routers [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) [22:56:30] (03CR) 10Ayounsi: "Thanks! Great feedback as usual!" (0313 comments) [puppet] - 10https://gerrit.wikimedia.org/r/496873 (https://phabricator.wikimedia.org/T83992) (owner: 10Ayounsi) [22:58:30] PROBLEM - HHVM jobrunner on mw1293 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [22:59:48] RECOVERY - HHVM jobrunner on mw1293 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [23:00:04] addshore, hashar, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: That opportune time is upon us again. Time for a Evening SWAT (Max 6 patches) deploy. Don't be afraid. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20190409T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:00:25] (03CR) 10Bstorm: [C: 03+2] labs: remove references to deleted project-proxy hosts [puppet] - 10https://gerrit.wikimedia.org/r/502632 (owner: 10Alex Monk) [23:03:40] (03PS1) 10RobH: remove tbayer shell access [puppet] - 10https://gerrit.wikimedia.org/r/502639 (https://phabricator.wikimedia.org/T220565) [23:05:18] (03CR) 10RobH: [C: 03+2] remove tbayer shell access [puppet] - 10https://gerrit.wikimedia.org/r/502639 (https://phabricator.wikimedia.org/T220565) (owner: 10RobH) [23:05:30] (03PS2) 10RobH: remove tbayer shell access [puppet] - 10https://gerrit.wikimedia.org/r/502639 (https://phabricator.wikimedia.org/T220565) [23:10:06] PROBLEM - HHVM jobrunner on mw1299 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [23:11:06] PROBLEM - Nginx local proxy to videoscaler on mw1299 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [23:11:22] PROBLEM - Nginx local proxy to jobrunner on mw1299 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [23:12:06] PROBLEM - puppet last run on notebook1004 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 2 minutes ago with 1 failures. Failed resources (up to 3 shown): User[tbayer] [23:12:22] RECOVERY - Nginx local proxy to videoscaler on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 288 bytes in 0.009 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [23:12:38] RECOVERY - HHVM jobrunner on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.006 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [23:12:38] RECOVERY - Nginx local proxy to jobrunner on mw1299 is OK: HTTP OK: HTTP/1.1 200 OK - 288 bytes in 0.011 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [23:12:56] 10Operations, 10SRE-Access-Requests: offboard tilman bayer - https://phabricator.wikimedia.org/T220565 (10RobH) [23:12:59] 10Operations, 10SRE-Access-Requests: offboard tilman bayer - https://phabricator.wikimedia.org/T220565 (10RobH) [23:13:58] 10Operations, 10SRE-Access-Requests: offboard tilman bayer - https://phabricator.wikimedia.org/T220565 (10RobH) [23:14:00] !log twentyafterfour@deploy1001 Pruned MediaWiki: 1.33.0-wmf.17 [keeping static files] (duration: 06m 03s) [23:14:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:15:16] 10Operations, 10SRE-Access-Requests: offboard tilman bayer - https://phabricator.wikimedia.org/T220565 (10RobH) ` robh@mwmaint1002:~$ sudo offboard-user --list-only -l tbayer User DN: uid=tbayer,ou=people,dc=wikimedia,dc=org Is member of the following unprivileged LDAP groups: cn=project-bastion,ou=groups,dc... [23:17:34] 10Operations, 10SRE-Access-Requests: offboard tilman bayer - https://phabricator.wikimedia.org/T220565 (10RobH) [23:19:39] 10Operations, 10SRE-Access-Requests: offboard tilman bayer - https://phabricator.wikimedia.org/T220565 (10RobH) [23:22:32] PROBLEM - HHVM jobrunner on mw1300 is CRITICAL: HTTP CRITICAL: HTTP/1.1 503 Service Unavailable - 473 bytes in 0.001 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [23:22:34] PROBLEM - puppet last run on notebook1003 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 6 minutes ago with 1 failures. Failed resources (up to 3 shown): User[tbayer] [23:23:46] RECOVERY - HHVM jobrunner on mw1300 is OK: HTTP OK: HTTP/1.1 200 OK - 270 bytes in 0.007 second response time https://wikitech.wikimedia.org/wiki/Jobrunner [23:27:41] 10Operations, 10SRE-Access-Requests: offboard tilman bayer - https://phabricator.wikimedia.org/T220565 (10RobH) [23:31:58] twentyafterfour: Eurgh, fun. :-( [23:39:19] twentyafterfour: Yeah, seems false negative. Although it wouldn't be the first time presence of new code on an less-active/test wiki caused an issue. It's unlikely. [23:39:44] If there's no SWAT, I'll take this to roll a perf fix that missed the cut [23:39:54] go for it [23:40:16] Krinkle: I got scap errors on mwdebug2001 for out of disk spaced [23:40:20] I think that's fixed now [23:40:25] but not tested [23:40:45] Woops, okay. I'll keep that in mind. Might go for 1001 instead just to rule it out for now. [23:43:40] RECOVERY - puppet last run on notebook1003 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [23:51:16] 10Operations, 10Traffic, 10wikitech.wikimedia.org: Wikitech page views sometimes default to MobileFrontend - https://phabricator.wikimedia.org/T220567 (10Krinkle) [23:51:50] 10Operations, 10Traffic, 10wikitech.wikimedia.org: Wikitech page views sometimes default to MobileFrontend - https://phabricator.wikimedia.org/T220567 (10Krinkle) [23:52:01] 10Operations, 10Traffic, 10wikitech.wikimedia.org: Wikitech page views sometimes default to MobileFrontend - https://phabricator.wikimedia.org/T220567 (10Krinkle)