[00:06:34] 10Operations, 10DBA, 10MediaWiki-extensions-ClickTracking: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#4086825 (10Krinkle) 05Resolved>03Open It seems the `click_tracking_events` table w... [00:06:54] 10Operations, 10DBA, 10MediaWiki-extensions-ClickTracking: Drop the tables old_growth, hitcounter, click_tracking, click_tracking_user_properties from enwiki, maybe other schemas - https://phabricator.wikimedia.org/T115982#4086828 (10Krinkle) [00:25:22] (03PS2) 10Dzahn: bastionhost: add MOTD warning of imminent bast1001 shutdown [puppet] - 10https://gerrit.wikimedia.org/r/422339 (https://phabricator.wikimedia.org/T186623) [00:25:38] (03CR) 10jerkins-bot: [V: 04-1] bastionhost: add MOTD warning of imminent bast1001 shutdown [puppet] - 10https://gerrit.wikimedia.org/r/422339 (https://phabricator.wikimedia.org/T186623) (owner: 10Dzahn) [00:31:24] (03PS3) 10Dzahn: bastionhost: add MOTD warning of imminent bast1001 shutdown [puppet] - 10https://gerrit.wikimedia.org/r/422339 (https://phabricator.wikimedia.org/T186623) [00:42:08] 10Operations, 10ops-ulsfo, 10Traffic: setup bast4002/WMF7218 - https://phabricator.wikimedia.org/T179050#4086871 (10Dzahn) Was just working on the Bastion related Wikitech pages due to Bast1001 being replaced and i noticed we have 2 bastions in ULSFO, 4001 and 4002. stalled? [00:45:29] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4086872 (10ayounsi) [01:11:47] (03CR) 10Dzahn: [C: 032] bastionhost: add MOTD warning of imminent bast1001 shutdown [puppet] - 10https://gerrit.wikimedia.org/r/422339 (https://phabricator.wikimedia.org/T186623) (owner: 10Dzahn) [02:28:57] !log l10nupdate@deploy1001 scap sync-l10n completed (1.31.0-wmf.26) (duration: 13m 33s) [02:29:03] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [02:32:46] (03PS2) 10Dzahn: site: enable mapped IPv6 on bromine/vega [puppet] - 10https://gerrit.wikimedia.org/r/420143 [02:33:16] (03CR) 10jerkins-bot: [V: 04-1] site: enable mapped IPv6 on bromine/vega [puppet] - 10https://gerrit.wikimedia.org/r/420143 (owner: 10Dzahn) [02:35:49] is it currently possible to add interface::add_ip6_mapped { 'main': } without getting dislikes from wmf-style? [02:36:15] it doesn't like node-level anymore but also not role (as before) [02:37:03] (03CR) 10Dzahn: "22:35 < mutante> is it currently possible to add interface::add_ip6_mapped { 'main': } without getting dislikes from wmf-style?" [puppet] - 10https://gerrit.wikimedia.org/r/420143 (owner: 10Dzahn) [03:16:41] (03PS1) 10KartikMistry: Update ssh public key for Kartik Mistry [puppet] - 10https://gerrit.wikimedia.org/r/422361 [03:25:53] PROBLEM - MariaDB Slave Lag: s1 on dbstore1002 is CRITICAL: CRITICAL slave_sql_lag Replication lag: 843.70 seconds [03:50:15] 10Operations, 10Cloud-Services, 10hardware-requests, 10Labs-Sprint-101, and 2 others: Kill off virt1000 - https://phabricator.wikimedia.org/T102005#4086985 (10Krinkle) [03:50:27] (03PS1) 10Krinkle: Remove outdated references to virt1000 from db-eqiad.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422362 (https://phabricator.wikimedia.org/T102005) [03:50:51] (03CR) 10jerkins-bot: [V: 04-1] Remove outdated references to virt1000 from db-eqiad.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422362 (https://phabricator.wikimedia.org/T102005) (owner: 10Krinkle) [03:51:10] (03PS2) 10Krinkle: Remove outdated references to virt1000 from db-eqiad.php [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422362 (https://phabricator.wikimedia.org/T102005) [04:01:03] RECOVERY - MariaDB Slave Lag: s1 on dbstore1002 is OK: OK slave_sql_lag Replication lag: 168.87 seconds [05:34:23] PROBLEM - Router interfaces on mr1-eqsin is CRITICAL: CRITICAL: No response from remote host 103.102.166.128 for 1.3.6.1.2.1.2.2.1.7 with snmp version 2 [05:35:14] RECOVERY - Router interfaces on mr1-eqsin is OK: OK: host 103.102.166.128, interfaces up: 37, down: 0, dormant: 0, excluded: 1, unused: 0 [05:44:59] (03CR) 10Muehlenhoff: [C: 032] Update symbols for 1.1.0h [debs/openssl11] - 10https://gerrit.wikimedia.org/r/422177 (owner: 10Muehlenhoff) [06:31:03] (03PS1) 10Jcrespo: mariadb backups: Rotate to latest as soon as they finished [puppet] - 10https://gerrit.wikimedia.org/r/422368 (https://phabricator.wikimedia.org/T189384) [06:31:16] (03PS2) 10Jcrespo: mariadb backups: Rotate to latest as soon as they finished [puppet] - 10https://gerrit.wikimedia.org/r/422368 (https://phabricator.wikimedia.org/T189384) [06:31:18] (03CR) 10jerkins-bot: [V: 04-1] mariadb backups: Rotate to latest as soon as they finished [puppet] - 10https://gerrit.wikimedia.org/r/422368 (https://phabricator.wikimedia.org/T189384) (owner: 10Jcrespo) [06:31:48] (03CR) 10jerkins-bot: [V: 04-1] mariadb backups: Rotate to latest as soon as they finished [puppet] - 10https://gerrit.wikimedia.org/r/422368 (https://phabricator.wikimedia.org/T189384) (owner: 10Jcrespo) [06:32:20] (03CR) 10Muehlenhoff: [C: 032] Update ssh public key for Kartik Mistry [puppet] - 10https://gerrit.wikimedia.org/r/422361 (owner: 10KartikMistry) [06:48:18] (03CR) 10Muehlenhoff: [C: 04-1] Update kafka java.security file with Java 8 u162 changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/421891 (https://phabricator.wikimedia.org/T190400) (owner: 10Ottomata) [06:51:29] !log installing remaining ICU security updates [06:51:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [06:52:02] (03PS3) 10Jcrespo: mariadb backups: Rotate to latest as soon as they finished [puppet] - 10https://gerrit.wikimedia.org/r/422368 (https://phabricator.wikimedia.org/T189384) [07:03:39] (03CR) 10Jcrespo: [C: 032] mariadb backups: Rotate to latest as soon as they finished [puppet] - 10https://gerrit.wikimedia.org/r/422368 (https://phabricator.wikimedia.org/T189384) (owner: 10Jcrespo) [07:08:51] (03PS1) 10Muehlenhoff: Update Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/422369 [07:10:39] (03PS2) 10Muehlenhoff: Update Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/422369 [07:17:43] (03CR) 10Muehlenhoff: [C: 032] Update Cumin aliases [puppet] - 10https://gerrit.wikimedia.org/r/422369 (owner: 10Muehlenhoff) [07:27:10] (03PS1) 10Jcrespo: mariadb backups: Start backups earlier [puppet] - 10https://gerrit.wikimedia.org/r/422370 (https://phabricator.wikimedia.org/T189384) [07:29:17] (03CR) 10Jcrespo: [C: 032] mariadb backups: Start backups earlier [puppet] - 10https://gerrit.wikimedia.org/r/422370 (https://phabricator.wikimedia.org/T189384) (owner: 10Jcrespo) [07:49:33] !log uploaded openssl 1.0.2o to apt.wikimedia.org/jessie-wikimedia [07:49:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [07:55:05] (03CR) 10Elukey: Update kafka java.security file with Java 8 u162 changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/421891 (https://phabricator.wikimedia.org/T190400) (owner: 10Ottomata) [07:55:36] moritzm: thanks for --^ !! [07:57:57] sure :-) [08:07:11] I see the l10n upate ran overnight on deploy1001, now to see if it wrked properly or not, and tbh no idea how to check that [08:15:33] !log reboot labstore1001 for T189115 [08:15:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:15:40] (03PS1) 10Volans: Puppetboard: disable listing of static files [puppet] - 10https://gerrit.wikimedia.org/r/422371 [08:17:16] !log reboot labstore1002 for T189115 [08:17:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:23] (03CR) 10Elukey: [C: 031] Puppetboard: disable listing of static files [puppet] - 10https://gerrit.wikimedia.org/r/422371 (owner: 10Volans) [08:18:23] PROBLEM - Ulsfo HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 11.11% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [08:18:34] !log reboot labstore2001 for T189115 [08:18:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:18:39] hello ulsfo [08:19:10] (03CR) 10Filippo Giunchedi: "Thanks for taking care of this!" (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [08:19:24] PROBLEM - Text HTTP 5xx reqs/min on graphite1001 is CRITICAL: CRITICAL: 10.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [08:19:36] (03CR) 10Volans: [C: 032] Puppetboard: disable listing of static files [puppet] - 10https://gerrit.wikimedia.org/r/422371 (owner: 10Volans) [08:19:57] so it seems a single spike that now is gone, ints from codfw caches [08:20:34] https://grafana.wikimedia.org/dashboard/db/varnish-failed-fetches?orgId=1&var-datasource=codfw%20prometheus%2Fops&var-cache_type=text&var-server=All&var-layer=backend&from=now-3h&to=now [08:20:38] ema --^ [08:21:52] (03PS1) 10Volans: wmf-auto-reimage: fix retcodes in sequential mode [puppet] - 10https://gerrit.wikimedia.org/r/422372 [08:22:49] hey, looking [08:24:49] <3 [08:24:56] seems gone now, was only a fyi [08:25:30] !log add more weight to ms-be204[0-3] - T189633 [08:25:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:25:35] T189633: rack/setup/install ms-be204[0-3] - https://phabricator.wikimedia.org/T189633 [08:25:49] !log reboot labstore200[2,3,4] for T189115 [08:25:54] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [08:26:23] RECOVERY - Ulsfo HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=ulsfo&var-cache_type=All&var-status_type=5 [08:26:24] RECOVERY - Text HTTP 5xx reqs/min on graphite1001 is OK: OK: Less than 1.00% above the threshold [250.0] https://grafana.wikimedia.org/dashboard/file/varnish-aggregate-client-status-codes.json?panelId=3&fullscreen&orgId=1&var-site=All&var-cache_type=text&var-status_type=5 [08:31:13] (03CR) 10Volans: [C: 032] wmf-auto-reimage: fix retcodes in sequential mode [puppet] - 10https://gerrit.wikimedia.org/r/422372 (owner: 10Volans) [08:42:23] (03PS1) 10Giuseppe Lavagetto: conftool: strawman for a db-server object schema for mwconfig [puppet] - 10https://gerrit.wikimedia.org/r/422373 [08:42:42] (03PS1) 10Giuseppe Lavagetto: Manage slave databases load/presence via etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422374 [08:43:52] (03CR) 10jerkins-bot: [V: 04-1] Manage slave databases load/presence via etcd [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422374 (owner: 10Giuseppe Lavagetto) [08:47:33] PROBLEM - Host scb2005 is DOWN: PING CRITICAL - Packet loss = 100% [08:49:13] RECOVERY - Host scb2005 is UP: PING OK - Packet loss = 0%, RTA = 36.09 ms [09:08:13] PROBLEM - Disk space on elastic1019 is CRITICAL: DISK CRITICAL - free space: /srv 61574 MB (12% inode=99%) [09:15:14] RECOVERY - Disk space on elastic1019 is OK: DISK OK [09:22:32] (03PS3) 10Filippo Giunchedi: nagios_common: switch to check_prometheus_metric Python implementation [puppet] - 10https://gerrit.wikimedia.org/r/413142 (https://phabricator.wikimedia.org/T181410) [09:22:47] (03PS4) 10Ema: WIP: VCL: improve handling of uncacheable responses [puppet] - 10https://gerrit.wikimedia.org/r/421542 (https://phabricator.wikimedia.org/T180712) [09:24:33] PROBLEM - Host labstore2003 is DOWN: PING CRITICAL - Packet loss = 100% [09:25:25] !log disable puppet on icinga servers before merging https://gerrit.wikimedia.org/r/c/413142/ [09:25:30] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [09:27:13] PROBLEM - Host labstore2004 is DOWN: PING CRITICAL - Packet loss = 100% [09:27:58] (03PS1) 10ArielGlenn: Revert "switch deployment server from tin to deploy1001" [puppet] - 10https://gerrit.wikimedia.org/r/422376 [09:28:06] (03PS2) 10ArielGlenn: Revert "switch deployment server from tin to deploy1001" [puppet] - 10https://gerrit.wikimedia.org/r/422376 [09:28:15] (03PS1) 1020after4: Revert "switch deployment server from tin to deploy1001" [puppet] - 10https://gerrit.wikimedia.org/r/422377 (https://phabricator.wikimedia.org/T190909) [09:28:30] actually no I'm looking into why tegmen has a completely different cpu profile than einsteinium [09:28:31] (03Abandoned) 1020after4: Revert "switch deployment server from tin to deploy1001" [puppet] - 10https://gerrit.wikimedia.org/r/422377 (https://phabricator.wikimedia.org/T190909) (owner: 1020after4) [09:28:55] oops sorry [09:28:56] (03CR) 1020after4: [C: 031] "see T190909" [puppet] - 10https://gerrit.wikimedia.org/r/422376 (owner: 10ArielGlenn) [09:29:02] well i'll merge mine [09:29:12] no prob [09:29:31] where is jenkins [09:30:33] now it decides to be slow? [09:30:53] 10Operations, 10Packaging, 10Scap, 10Patch-For-Review: Install git-lfs client (at least on scap targets & masters) - https://phabricator.wikimedia.org/T180628#4087305 (10mmodell) [09:30:56] hah [09:30:56] 10Operations: replace tin (new hardware) - https://phabricator.wikimedia.org/T185275#4087301 (10mmodell) 05Resolved>03Open see {T190909} and this patch: [[ https://gerrit.wikimedia.org/r/#/c/422376/ | Revert "switch deployment server from tin to deploy1001" ]] [09:31:46] (03CR) 10ArielGlenn: [C: 032] Revert "switch deployment server from tin to deploy1001" [puppet] - 10https://gerrit.wikimedia.org/r/422376 (owner: 10ArielGlenn) [09:32:33] sweet! icinga restarted by puppet on each puppet run due to file ownership change )o) [09:32:48] lol [09:33:20] twentyafterfour: where does puppet have to run for that change to take effect? [09:33:23] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 30.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [09:34:16] looking (reindex still in progress fwiw) ^ [09:34:33] apergos: deploy1001 and tin [09:34:39] you're not asleep [09:34:42] thank you however [09:34:43] Of course not [09:35:12] Technically everywhere, eventually, but those two kinda matter a bit more :) [09:35:49] there's a big ole motd on tin too [09:36:29] I'm running on those two now, we'll wait the 40 minutes or whatever for it to go around everywhere else, mail should be sent with an update [09:36:43] and someone oughta test on tin after that 40 minutes [09:38:18] ah that took care of the motd, nice [09:38:20] I'll reply to the wikitech thread to note that it's switched back to tin [09:38:30] would you revert the DNS change too? [09:38:40] was the dns change merged? [09:38:59] uh, chad beat me to the email [09:39:04] ok good for that [09:39:26] apergos: I think the dns change was merged, according to the email thread anyway [09:39:38] ah [09:39:39] I didn't think of DNS [09:39:39] "We also just switched the DNS service name for deployment.eqiad/codfw (thanks Andrew Bogott!)" [09:42:20] (03PS1) 10ArielGlenn: Revert "Change cname for deployment.eqiad.wmnet and deployment.codfw.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/422380 [09:43:59] (03CR) 10ArielGlenn: [C: 032] Revert "Change cname for deployment.eqiad.wmnet and deployment.codfw.wmnet" [dns] - 10https://gerrit.wikimedia.org/r/422380 (owner: 10ArielGlenn) [09:46:43] RECOVERY - Host labstore2003 is UP: PING OK - Packet loss = 0%, RTA = 36.08 ms [09:47:50] deployed and live [09:48:00] anything else we may have overlooked? [09:48:37] except actually testing tin in a while [09:49:10] !jouncebot: next [09:49:28] 10Operations, 10Icinga, 10monitoring: icinga restarted on each puppet run on standby server - https://phabricator.wikimedia.org/T190912#4087341 (10fgiunchedi) [09:49:45] jouncebot: next [09:49:45] In 3 hour(s) and 10 minute(s): European Mid-day SWAT(Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180328T1300) [09:50:07] apergos: nope I think that's it [09:50:33] RECOVERY - Host labstore2004 is UP: PING OK - Packet loss = 0%, RTA = 37.02 ms [09:52:04] care to try a test in 15 mins or so? [09:53:15] apergos: ok [09:53:23] thanks [09:53:42] (03CR) 10Filippo Giunchedi: [C: 032] nagios_common: switch to check_prometheus_metric Python implementation [puppet] - 10https://gerrit.wikimedia.org/r/413142 (https://phabricator.wikimedia.org/T181410) (owner: 10Filippo Giunchedi) [09:53:48] (03PS4) 10Filippo Giunchedi: nagios_common: switch to check_prometheus_metric Python implementation [puppet] - 10https://gerrit.wikimedia.org/r/413142 (https://phabricator.wikimedia.org/T181410) [09:55:20] dammit [09:55:33] l10update will run over there in two minutes. before puppet's run everywhere [09:56:00] oh well it will just break [09:56:27] s/two/four/ but you get the idea [10:04:24] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [10:05:58] !log upgrade and restart db2093 [10:06:02] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:11:24] PROBLEM - Host labstore2004 is DOWN: PING CRITICAL - Packet loss = 100% [10:11:44] PROBLEM - Host labstore2003 is DOWN: PING CRITICAL - Packet loss = 100% [10:13:33] akosiaris: since I see you are logged in on deploy1001, no deploys from there, it's back to tin [10:16:14] twentyafterfour: testing time, if you would do the honors [10:16:27] (03PS13) 10Muehlenhoff: Allow to selectively run time servers on Chrony [puppet] - 10https://gerrit.wikimedia.org/r/393581 (https://phabricator.wikimedia.org/T177742) [10:16:33] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [10:18:03] RECOVERY - Host labstore2003 is UP: PING OK - Packet loss = 0%, RTA = 36.24 ms [10:20:43] RECOVERY - Host labstore2004 is UP: PING OK - Packet loss = 0%, RTA = 36.10 ms [10:21:07] (03CR) 10Muehlenhoff: "https://puppet-compiler.wmflabs.org/compiler03/10706/" [puppet] - 10https://gerrit.wikimedia.org/r/393581 (https://phabricator.wikimedia.org/T177742) (owner: 10Muehlenhoff) [10:23:33] (03CR) 10Filippo Giunchedi: [C: 031] Allow to selectively run time servers on Chrony [puppet] - 10https://gerrit.wikimedia.org/r/393581 (https://phabricator.wikimedia.org/T177742) (owner: 10Muehlenhoff) [10:26:22] (03PS1) 10Vgutierrez: mtail: Add varnish_resourceloader_resp in varnishrls [puppet] - 10https://gerrit.wikimedia.org/r/422381 (https://phabricator.wikimedia.org/T184942) [10:26:34] 10Operations: replace tin (new hardware) - https://phabricator.wikimedia.org/T185275#3911588 (10ArielGlenn) A summary of things as I understand them: - deploy1001 to php7 is needed for git-lfs, which is needed for ORES. - icu collation order with libicu57 (default with php7) is different than with libicu52 (w... [10:27:18] !log reload icinga on einsteinium after https://gerrit.wikimedia.org/r/c/413142 [10:27:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:33:05] (03PS3) 10Vgutierrez: mtail: Provide ttfb histogram for varnishbackend [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) [10:33:22] (03PS14) 10Muehlenhoff: Allow to selectively run time servers on Chrony [puppet] - 10https://gerrit.wikimedia.org/r/393581 (https://phabricator.wikimedia.org/T177742) [10:34:05] (03CR) 10Muehlenhoff: [C: 032] Allow to selectively run time servers on Chrony [puppet] - 10https://gerrit.wikimedia.org/r/393581 (https://phabricator.wikimedia.org/T177742) (owner: 10Muehlenhoff) [10:34:52] (03CR) 10Vgutierrez: mtail: Provide ttfb histogram for varnishbackend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [10:37:54] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4087564 (10Vgutierrez) >>! In T184942#4083351, @Krinkle wrote: > The following counters are currently reported to StatsD from `ReqURL ^/w/load.php` ([va... [10:45:38] 10Operations, 10Icinga, 10monitoring: icinga restarted on each puppet run on standby server - https://phabricator.wikimedia.org/T190912#4087584 (10fgiunchedi) The effect is also clear on host dashboards {F16328381} [10:58:50] 10Operations, 10Puppet: Puppet: enable reports to puppetdb - https://phabricator.wikimedia.org/T190918#4087592 (10Volans) p:05Triage>03Normal [10:59:43] !log performing a few minutes live test of reporting Puppet reports to puppetdb too on puppetmaster1001 - T190918 [10:59:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [10:59:49] T190918: Puppet: enable reports to puppetdb - https://phabricator.wikimedia.org/T190918 [11:02:01] (03PS3) 10Mark Bergsma: Fix testRepool test case for previously-down-but-pooled [debs/pybal] - 10https://gerrit.wikimedia.org/r/421051 [11:02:03] (03PS3) 10Mark Bergsma: Fix StubLVSService to use a set instead of a dict for .servers [debs/pybal] - 10https://gerrit.wikimedia.org/r/421052 [11:02:05] (03PS4) 10Mark Bergsma: Introduce server.is_pooled and make server.pooled usage more consistent [debs/pybal] - 10https://gerrit.wikimedia.org/r/421053 [11:03:12] (03CR) 10Mark Bergsma: [C: 032] Fix testRepool test case for previously-down-but-pooled [debs/pybal] - 10https://gerrit.wikimedia.org/r/421051 (owner: 10Mark Bergsma) [11:03:33] (03CR) 10Mark Bergsma: [C: 032] Fix StubLVSService to use a set instead of a dict for .servers [debs/pybal] - 10https://gerrit.wikimedia.org/r/421052 (owner: 10Mark Bergsma) [11:03:42] (03Merged) 10jenkins-bot: Fix testRepool test case for previously-down-but-pooled [debs/pybal] - 10https://gerrit.wikimedia.org/r/421051 (owner: 10Mark Bergsma) [11:04:03] (03Merged) 10jenkins-bot: Fix StubLVSService to use a set instead of a dict for .servers [debs/pybal] - 10https://gerrit.wikimedia.org/r/421052 (owner: 10Mark Bergsma) [11:04:39] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [11:06:06] (03PS1) 10Filippo Giunchedi: icinga: preserve ownership when purging resources [puppet] - 10https://gerrit.wikimedia.org/r/422384 (https://phabricator.wikimedia.org/T190912) [11:06:48] volans: ^ [11:18:15] (03CR) 10Vgutierrez: mtail: Provide ttfb histogram for varnishbackend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [11:21:24] (03PS13) 10Rduran: Add port of osc_host.sh [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/419725 [11:23:33] (03PS1) 10Jcrespo: haproxy: Remove older templates (haproxy<1.7) [puppet] - 10https://gerrit.wikimedia.org/r/422386 (https://phabricator.wikimedia.org/T183249) [11:24:35] (03PS1) 10Muehlenhoff: Switch time server on dns5001 to Chrony [puppet] - 10https://gerrit.wikimedia.org/r/422387 (https://phabricator.wikimedia.org/T177742) [11:31:49] PROBLEM - CirrusSearch eqiad 95th percentile latency on graphite1001 is CRITICAL: CRITICAL: 20.00% of data above the critical threshold [1000.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [11:34:09] (03PS6) 10Arturo Borrero Gonzalez: [WIP] wmcs: introduce hiera support for multiple labmon servers [puppet] - 10https://gerrit.wikimedia.org/r/422126 (https://phabricator.wikimedia.org/T189871) [11:34:43] (03CR) 10jerkins-bot: [V: 04-1] [WIP] wmcs: introduce hiera support for multiple labmon servers [puppet] - 10https://gerrit.wikimedia.org/r/422126 (https://phabricator.wikimedia.org/T189871) (owner: 10Arturo Borrero Gonzalez) [11:35:14] (03CR) 10Vgutierrez: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/422387 (https://phabricator.wikimedia.org/T177742) (owner: 10Muehlenhoff) [11:36:54] (03PS1) 10Arturo Borrero Gonzalez: wmcs: monitoring: rsync whisper files between mon servers [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512) [11:37:34] (03CR) 10jerkins-bot: [V: 04-1] wmcs: monitoring: rsync whisper files between mon servers [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512) (owner: 10Arturo Borrero Gonzalez) [11:39:21] (03PS7) 10Arturo Borrero Gonzalez: [WIP] wmcs: introduce hiera support for multiple labmon servers [puppet] - 10https://gerrit.wikimedia.org/r/422126 (https://phabricator.wikimedia.org/T189871) [11:41:23] (03PS1) 10Muehlenhoff: Enable base::service_auto_restart for cron [puppet] - 10https://gerrit.wikimedia.org/r/422391 (https://phabricator.wikimedia.org/T135991) [11:46:22] (03PS2) 10Arturo Borrero Gonzalez: wmcs: monitoring: rsync whisper files between mon servers [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512) [11:46:42] (03CR) 10Volans: [C: 031] "LGTM" [puppet] - 10https://gerrit.wikimedia.org/r/422384 (https://phabricator.wikimedia.org/T190912) (owner: 10Filippo Giunchedi) [11:47:00] (03CR) 10jerkins-bot: [V: 04-1] wmcs: monitoring: rsync whisper files between mon servers [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512) (owner: 10Arturo Borrero Gonzalez) [11:48:05] (03CR) 10Muehlenhoff: "Looks fine to me in general. Is this intended to be uploaded to apt.wikimedia.org or are you aiming to upload this to Debian? If it's the " [debs/dynomite] - 10https://gerrit.wikimedia.org/r/421447 (owner: 10Aaron Schulz) [11:48:08] !log twentyafterfour@tin Synchronized README: test deploy from tin.eqiad.wmnet (duration: 03m 35s) [11:48:13] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:48:25] (03PS3) 10Arturo Borrero Gonzalez: wmcs: monitoring: rsync whisper files between mon servers [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512) [11:49:02] !log twentyafterfour@tin Started scap: test running full scap sync from tin [11:49:03] (03CR) 10jerkins-bot: [V: 04-1] wmcs: monitoring: rsync whisper files between mon servers [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512) (owner: 10Arturo Borrero Gonzalez) [11:49:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [11:51:56] (03PS4) 10Arturo Borrero Gonzalez: wmcs: monitoring: rsync whisper files between mon servers [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512) [11:52:36] (03CR) 10jerkins-bot: [V: 04-1] wmcs: monitoring: rsync whisper files between mon servers [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512) (owner: 10Arturo Borrero Gonzalez) [11:54:39] (03PS5) 10Arturo Borrero Gonzalez: wmcs: monitoring: rsync whisper files between mon servers [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512) [12:06:58] (03PS8) 10Arturo Borrero Gonzalez: [WIP] wmcs: introduce hiera support for multiple labmon servers [puppet] - 10https://gerrit.wikimedia.org/r/422126 (https://phabricator.wikimedia.org/T189871) [12:07:00] (03PS6) 10Arturo Borrero Gonzalez: wmcs: monitoring: rsync whisper files between mon servers [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512) [12:07:19] (03PS9) 10Arturo Borrero Gonzalez: wmcs: introduce hiera support for multiple labmon servers [puppet] - 10https://gerrit.wikimedia.org/r/422126 (https://phabricator.wikimedia.org/T189871) [12:07:26] (03CR) 10jerkins-bot: [V: 04-1] wmcs: monitoring: rsync whisper files between mon servers [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512) (owner: 10Arturo Borrero Gonzalez) [12:07:59] (03CR) 10jerkins-bot: [V: 04-1] wmcs: introduce hiera support for multiple labmon servers [puppet] - 10https://gerrit.wikimedia.org/r/422126 (https://phabricator.wikimedia.org/T189871) (owner: 10Arturo Borrero Gonzalez) [12:09:22] (03PS7) 10Arturo Borrero Gonzalez: wmcs: monitoring: rsync whisper files between mon servers [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512) [12:16:19] PROBLEM - HHVM jobrunner on mw1293 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:16:39] PROBLEM - HHVM jobrunner on mw1294 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:17:00] PROBLEM - HHVM jobrunner on mw1307 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:18:09] PROBLEM - HHVM jobrunner on mw1296 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:18:49] PROBLEM - HHVM jobrunner on mw1338 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:18:49] PROBLEM - HHVM jobrunner on mw1295 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:18:59] PROBLEM - HHVM jobrunner on mw1318 is CRITICAL: CRITICAL - Socket timeout after 10 seconds [12:24:56] whaaattt [12:26:34] so all videoscalers: https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=videoscaler&var-instance=All&from=now-3h&to=now [12:26:39] load is skyrocketing [12:26:57] so hhvm is up but the health checks are not getting through [12:27:16] I'm on one now and there's a pile of stuff apparently running (mw1293) [12:27:19] PROBLEM - High CPU load on API appserver on mw1282 is CRITICAL: CRITICAL - load average: 61.77, 25.67, 16.13 [12:27:46] yeah I think that somebody triggered a massive re-encode or something similar [12:28:19] RECOVERY - High CPU load on API appserver on mw1282 is OK: OK - load average: 37.50, 25.94, 16.87 [12:28:49] PROBLEM - High CPU load on API appserver on mw1227 is CRITICAL: CRITICAL - load average: 48.14, 26.98, 18.56 [12:29:33] 10Operations, 10ops-ulsfo, 10Traffic: setup bast4002/WMF7218 - https://phabricator.wikimedia.org/T179050#4087858 (10BBlack) Was stalled on my lack of time dealing with the prometheus switchover and then switching peoples' SSH configs, otherwise it's ready for switchover. [12:29:49] RECOVERY - High CPU load on API appserver on mw1227 is OK: OK - load average: 25.76, 24.35, 18.18 [12:29:52] (03PS8) 10Rduran: Create tests skeleton [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/420746 [12:29:54] (03PS8) 10Rduran: Refactor and test the main OSC run method [software/wmfmariadbpy] - 10https://gerrit.wikimedia.org/r/421340 [12:31:21] so https://grafana.wikimedia.org/dashboard/db/jobqueue-eventbus?orgId=1&var-site=eqiad&var-type=webVideoTranscode&var-type=webVideoTranscodePrioritized [12:31:24] apergos: --^ [12:31:30] enqueue rate spiked a lot [12:31:35] https://commons.wikimedia.org/wiki/Special:NewFiles?user=&mediatype%5B%5D=UNKNOWN&mediatype%5B%5D=AUDIO&mediatype%5B%5D=VIDEO&mediatype%5B%5D=MULTIMEDIA&mediatype%5B%5D=ARCHIVE&start=&end=&wpFormIdentifier=specialnewimages&limit=50&offset= [12:31:43] check that out [12:32:41] English: Please subscribe to my channel and my vlog channel! I make new videos here every Wednesday and make vlogs during my majestical daily life. from the description of one of the files [12:32:53] this seems like spam/self adv/needs block [12:32:55] lol [12:32:56] ye [12:33:01] advertising [12:33:03] so iiuc now is change prop that grabs jobs from kafka and then sends to the video scalers [12:33:07] let's see who's in commons channel [12:33:30] so there is also (possibly) and issue with sending too many jobs to the videoscalers fleet [12:33:40] mobrovac,Pchelolo hello :) [12:35:07] !log twentyafterfour@tin Finished scap: test running full scap sync from tin (duration: 46m 05s) [12:35:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [12:36:23] well, I wouldnt say thats a bad neccesairly, better have problems now than later on [12:37:12] Wiki13: ? [12:37:43] yeah these are the source off the transcodes most likely [12:37:55] each one of these has a pile of transcodes per file [12:38:08] so checking metrics for each host (like https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?orgId=1&from=now-3h&to=now&var-datasource=eqiad%20prometheus%2Fops&var-cluster=videoscaler&var-instance=mw1293) it seems that something is not ok in our config, namely too many HHVM threads used [12:38:36] this prevents health checks of course, but things are processing.. not sure how changeprop reacts when the cluster is overloaded [12:39:57] (03CR) 10Jcrespo: [C: 032] "https://puppet-compiler.wmflabs.org/compiler03/10708/" [puppet] - 10https://gerrit.wikimedia.org/r/422386 (https://phabricator.wikimedia.org/T183249) (owner: 10Jcrespo) [12:39:59] https://commons.wikimedia.org/w/index.php?title=Special:Contributions/MennasDosbin&offset=&limit=100&target=MennasDosbin [12:40:10] looks like we will get no joy from commons admins [12:40:40] I can tag them as speedy if you guys wantr [12:40:51] maybe then it gets picked up [12:41:00] at this point the jobs are already queued [12:41:20] okay, so that wouldnt make any difference then [12:41:36] no, it's more about stopping more of it [12:42:54] I've deployed a change to profile/mariadb/proxy/master, I should be the only one using that [12:43:12] I heard you need Commons admin [12:43:13] sup [12:43:46] hey revi [12:43:54] I pinged him ^^ [12:44:00] we just have a spike of video transcode jobs [12:44:05] turns out they are likely all from this: [12:44:11] https://commons.wikimedia.org/w/index.php?title=Special:Contributions/MennasDosbin&offset=&limit=100&target=MennasDosbin [12:44:11] summoned during setting up new router lol [12:44:12] hmm [12:44:16] oh! [12:44:52] anyways the job queued now are queued but it would be nice to head off any more of that (have a look at a few of the descriptions) [12:45:02] uh-uh [12:45:03] yeah [12:45:07] basically I'm just lobbing it over the wall to you fols [12:45:08] not that 'in scope' [12:45:10] (03PS1) 10BBlack: eqsin: turn-up HK + PH + JP [dns] - 10https://gerrit.wikimedia.org/r/422394 (https://phabricator.wikimedia.org/T189252) [12:45:16] we gotta clean up on our side [12:45:30] (03PS1) 10BBlack: eqsin: turn-up India [dns] - 10https://gerrit.wikimedia.org/r/422395 (https://phabricator.wikimedia.org/T189252) [12:45:33] (03PS1) 10BBlack: eqsin: turn-up BD, LK, NP, PK [dns] - 10https://gerrit.wikimedia.org/r/422396 (https://phabricator.wikimedia.org/T189252) [12:46:13] so what exactly do you need from me? [12:46:29] (just to be clear - I'm still setting up my new internet so I may be out of connect for awhile) [12:46:30] (03CR) 10Filippo Giunchedi: [C: 032] icinga: preserve ownership when purging resources [puppet] - 10https://gerrit.wikimedia.org/r/422384 (https://phabricator.wikimedia.org/T190912) (owner: 10Filippo Giunchedi) [12:46:33] right [12:46:45] (03PS2) 10Filippo Giunchedi: icinga: preserve ownership when purging resources [puppet] - 10https://gerrit.wikimedia.org/r/422384 (https://phabricator.wikimedia.org/T190912) [12:46:50] so we set, in videoscalers.yaml, thread_count: 15 [12:47:11] that is exactly how busy hhvm is right now on each scaler, so possibly it is a misconfig from our side [12:47:29] revi: yeah I hear you [12:47:56] it would be nice not to get another flood of those, whether that means communication with the user or whatever else [12:48:09] RECOVERY - HHVM jobrunner on mw1338 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [12:48:09] RECOVERY - HHVM jobrunner on mw1294 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [12:48:19] RECOVERY - HHVM jobrunner on mw1307 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [12:48:19] RECOVERY - HHVM jobrunner on mw1318 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.003 second response time [12:48:28] elukey: so how does 15 wind up nailing us against the wall? [12:48:29] RECOVERY - HHVM jobrunner on mw1293 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [12:48:40] (03CR) 10Alexandros Kosiaris: [C: 032] ci: Add kubernetes deployment classes to CI [puppet] - 10https://gerrit.wikimedia.org/r/422100 (https://phabricator.wikimedia.org/T184924) (owner: 10Alexandros Kosiaris) [12:48:47] (03PS3) 10Alexandros Kosiaris: ci: Add kubernetes deployment classes to CI [puppet] - 10https://gerrit.wikimedia.org/r/422100 (https://phabricator.wikimedia.org/T184924) [12:49:09] RECOVERY - HHVM jobrunner on mw1295 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [12:49:26] apergos: OK, but I think comment from (WMF) would be more trustworthy (or authoritative) than random volunteer admin commenting :P [12:49:34] apergos: so 15 is the number of available HHVM theads that we configure, but we also do some calculations to establish the number of jobrunner "runners" [12:49:46] revi: I dropped a note about it in the admin channel [12:50:05] oh [12:50:06] saw it now [12:50:19] RECOVERY - HHVM jobrunner on mw1296 is OK: HTTP OK: HTTP/1.1 200 OK - 206 bytes in 0.002 second response time [12:50:21] elukey: ok, I didn't see any ridiculous number of them though when looking [12:50:25] in our case, for example, on mw1293 (/etc/jobrunner/jobrunner.conf) we have: 17 runners for transcode, 12 for transcode_prioritized [12:50:40] hmm, deleting it do stop transcoding? [12:50:41] uh huh [12:50:53] apergos: https://grafana.wikimedia.org/dashboard/db/prometheus-apache-hhvm-dc-stats?panelId=17&fullscreen&orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=videoscaler&var-instance=mw1338&from=now-3h&to=now [12:50:56] revi: nope, just deterrence [12:51:02] oh. [12:51:22] it seems that we are getting through them [12:51:27] since queueing is decreasing [12:51:36] we maxed out there? ohdear [12:51:54] yeah, no more hhvm threads == no health checks passing == alarms [12:51:56] wasn’t transcoding queue ‘fixed’ a while ago? [12:51:59] https://commons.wikimedia.org/wiki/File:%22Body_Massage%22_-_Jenna_Marbles.webm that was the latest one in the list, only two left to complete [12:52:29] * apergos spotchecks a coupe others [12:52:31] zhuyifei1999_: it seems a problem of having videoscalers running too many processes at once, that's it [12:52:33] I think proper course for these images are 7-days Deletion Requests since it seem to be scope stuff [12:52:40] k [12:52:54] zhuyifei1999_: I'm not sure if I should just go raid with delete button lol [12:53:05] one one, another has a few left [12:53:06] meh [12:53:09] *one done [12:53:25] meanwhile new router, 3x speed yay [12:53:37] nice [12:53:39] revi: ask jcb to do that, nobody will complain ;) [12:53:55] zhuyifei1999_: I don't want to put myself into Commons Drama Season (x) [12:54:19] PROBLEM - puppet last run on contint1001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 3 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment-charts] [12:55:05] apergos: nproc on those scalers is 40 or 48, I think that 15 hhvm threads is a bit low :) [12:55:18] but I don't remember if that value was intended or not [12:55:32] 44 files in 10 minutes, all needing 16 trnscodes each, each one of those lasting anywhere from 4 to 30 minutes depending on the size [12:55:37] that would do it [12:55:58] *6 not 16 [12:55:59] I recall uploading 20~around video last December [12:56:06] yeah but in theory we shouldn't allow this amount of stress on all the scalers, they took too much work at once [12:56:41] and I was kinda wondering wtf is wrong with the speed meh [12:56:48] contint1001 puppet issues is me, fixing [12:56:54] ok [12:57:44] if mw were smart it would queue the transcodes for a file serially: do each size one after another instead of all in parallel [12:57:58] so that other jobs can run if need be [12:58:10] then you'll say, suppose the other threads are idle [12:58:14] revi, zhuyifei1999_: https://commons.wikimedia.org/wiki/Commons:Deletion_requests/Files_uploaded_by_MennasDosbin I just nominated all of them [12:58:17] great [12:58:24] you just saved my time [12:58:28] thanks [12:58:32] Wiki13: thanks! [12:58:40] thanks for wrangling that [12:58:58] I'll remember to kill them by next week [12:59:08] with priority [12:59:11] ^^ [13:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) European Mid-day SWAT(Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180328T1300). [13:00:04] Amir1, RoanKattouw, and tgr: A patch you scheduled for European Mid-day SWAT(Max 8 patches) is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [13:00:10] o/ [13:00:12] I'll SWAT today [13:00:19] cool [13:00:29] twentyafterfour did a full test scap which apparently ran ok [13:00:38] so you should be good to go, I'm here just in case [13:01:15] elukey: how were we on cpu on those boxes? [13:01:19] (03PS3) 10Catrope: Enable Translate extension in amwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422181 (https://phabricator.wikimedia.org/T180879) (owner: 10Ladsgroup) [13:01:37] (03CR) 10Catrope: [C: 032] Enable Translate extension in amwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422181 (https://phabricator.wikimedia.org/T180879) (owner: 10Ladsgroup) [13:01:48] so I'm out, have a nice day (and goodnight!) [13:01:56] thanks again [13:02:07] Wiki13 did the messy thing :) [13:02:25] apergos: https://grafana.wikimedia.org/dashboard/db/prometheus-machine-stats?orgId=1&var-server=mw1293&var-datasource=eqiad%20prometheus%2Fops [13:02:35] hehe :P [13:02:36] RoanKattouw: Don't forget to run the creating database main. script [13:02:46] (03CR) 10Filippo Giunchedi: mtail: Provide ttfb histogram for varnishbackend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [13:02:48] (03Merged) 10jenkins-bot: Enable Translate extension in amwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422181 (https://phabricator.wikimedia.org/T180879) (owner: 10Ladsgroup) [13:02:52] (03CR) 10jenkins-bot: Enable Translate extension in amwikimedia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422181 (https://phabricator.wikimedia.org/T180879) (owner: 10Ladsgroup) [13:02:56] that initial spike isn't too cheery, can they handle much more in the way of work? [13:03:01] https://phabricator.wikimedia.org/T180879#3916960 [13:03:03] yup [13:03:04] elukey: [13:03:12] I forgot and brought down the whole wiki last time [13:03:28] I am so not ready for bringing down the wikis today [13:03:35] let's try to avoid that, shall we [13:03:51] we shall :D [13:04:00] 👍 [13:04:33] Amir1: On mwdebug1002, please test [13:04:40] PROBLEM - puppet last run on contint2001 is CRITICAL: CRITICAL: Puppet has 1 failures. Last run 5 minutes ago with 1 failures. Failed resources (up to 3 shown): File[/srv/deployment-charts] [13:05:03] apergos: we might possibly need to tune down the number of runners via profile::mediawiki::jobrunner [13:05:13] hmm [13:05:14] RoanKattouw: https://am.wikimedia.org/wiki/%D5%8D%D5%BA%D5%A1%D5%BD%D5%A1%D6%80%D5%AF%D5%B8%D5%B2:%D5%8F%D5%A1%D6%80%D5%A2%D5%A5%D6%80%D5%A1%D5%AF says Translate is there [13:05:19] let's move forward [13:06:02] OK [13:07:15] (03CR) 10Catrope: [C: 032] Enable Flow on euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421606 (https://phabricator.wikimedia.org/T190500) (owner: 10Urbanecm) [13:07:47] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable Translate extension on amwikimedia (T180879) (duration: 01m 22s) [13:07:52] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:07:53] T180879: Install translate extension in amwikimedia - https://phabricator.wikimedia.org/T180879 [13:08:29] (03Merged) 10jenkins-bot: Enable Flow on euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421606 (https://phabricator.wikimedia.org/T190500) (owner: 10Urbanecm) [13:08:57] (03PS1) 10Alexandros Kosiaris: Add profile::kubernetes::deployment_server::git_* [puppet] - 10https://gerrit.wikimedia.org/r/422399 (https://phabricator.wikimedia.org/T184924) [13:09:40] PROBLEM - puppet last run on pc1006 is CRITICAL: CRITICAL: Catalog fetch fail. Either compilation failed or puppetmaster has issues [13:12:22] (03CR) 10jenkins-bot: Enable Flow on euwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421606 (https://phabricator.wikimedia.org/T190500) (owner: 10Urbanecm) [13:13:00] (03PS2) 10Alexandros Kosiaris: Add profile::kubernetes::deployment_server::git_* [puppet] - 10https://gerrit.wikimedia.org/r/422399 (https://phabricator.wikimedia.org/T184924) [13:15:04] (03CR) 10Alexandros Kosiaris: [C: 032] Add profile::kubernetes::deployment_server::git_* [puppet] - 10https://gerrit.wikimedia.org/r/422399 (https://phabricator.wikimedia.org/T184924) (owner: 10Alexandros Kosiaris) [13:18:29] !log catrope@tin Synchronized dblists/flow.dblist: Enable Flow on euwiki (T190500) (duration: 01m 17s) [13:18:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:18:35] T190500: Enable Extension:StructuredDiscussions in Basque Wikipedia - https://phabricator.wikimedia.org/T190500 [13:19:19] RECOVERY - puppet last run on contint1001 is OK: OK: Puppet is currently enabled, last run 30 seconds ago with 0 failures [13:24:10] (03PS1) 10Imarlier: wmf-config/InitialiseSettings.php: Enable oversample for additional countries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422401 (https://phabricator.wikimedia.org/T189252) [13:27:04] (03PS1) 10Elukey: role::mediawiki::videoscaler: reduce the number of available runners [puppet] - 10https://gerrit.wikimedia.org/r/422402 [13:29:39] RECOVERY - puppet last run on contint2001 is OK: OK: Puppet is currently enabled, last run 1 minute ago with 0 failures [13:29:59] RECOVERY - Long running screen/tmux on labstore2003 is OK: OK: No SCREEN or tmux processes detected. [13:30:08] (03PS4) 10Vgutierrez: mtail: Provide ttfb histogram for varnishbackend [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) [13:30:49] tgr: Are you here for your SWAT patches? [13:31:00] RoanKattouw: present [13:31:20] twentyafterfour: Did mediawiki.org disappear from group0 somehow? [13:31:26] testwiki has wmf.27 but mw.org has 26 [13:31:43] (03PS2) 10Catrope: Enable Wikidata description override on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420227 (https://phabricator.wikimedia.org/T184000) (owner: 10Gergő Tisza) [13:31:47] (03CR) 10Catrope: [C: 032] Enable Wikidata description override on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420227 (https://phabricator.wikimedia.org/T184000) (owner: 10Gergő Tisza) [13:31:52] (03CR) 10Vgutierrez: mtail: Provide ttfb histogram for varnishbackend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [13:33:12] (03Merged) 10jenkins-bot: Enable Wikidata description override on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420227 (https://phabricator.wikimedia.org/T184000) (owner: 10Gergő Tisza) [13:34:49] (03CR) 10jenkins-bot: Enable Wikidata description override on enwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/420227 (https://phabricator.wikimedia.org/T184000) (owner: 10Gergő Tisza) [13:35:02] !log upgrade mariadb client on sarin, neodymium, terbium and wasat [13:35:05] (03CR) 10Elukey: "changes from the pcc perspective: https://puppet-compiler.wmflabs.org/compiler03/10715/" [puppet] - 10https://gerrit.wikimedia.org/r/422402 (owner: 10Elukey) [13:35:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:07] !log catrope@tin Synchronized php-1.31.0-wmf.27/extensions/Echo/modules/nojs/mw.echo.badge.less: Prevent FOUC when loading notification badges (duration: 01m 20s) [13:36:08] tgr: Wikidata description override is on mwdebug1002, please test [13:36:12] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:36:15] (03CR) 10BBlack: [C: 032] eqsin: turn-up HK + PH + JP [dns] - 10https://gerrit.wikimedia.org/r/422394 (https://phabricator.wikimedia.org/T189252) (owner: 10BBlack) [13:36:36] RoanKattouw: group0 didn't go out last night [13:36:50] I intend to fix that but Greg said to wait for the train window today [13:38:01] OK [13:38:24] was this before or after switch to deploy1001? just for my info [13:38:26] (03PS5) 10Ema: WIP: VCL: improve handling of uncacheable responses [puppet] - 10https://gerrit.wikimedia.org/r/421542 (https://phabricator.wikimedia.org/T180712) [13:38:27] RoanKattouw: I can see that the feature is enabled, I can't test more than that without writing content (which is probably a bad idea while it's only on one server) [13:38:51] apergos: after [13:38:57] ok, thanks [13:39:40] RECOVERY - puppet last run on pc1006 is OK: OK: Puppet is currently enabled, last run 2 minutes ago with 0 failures [13:39:47] tgr: you can always test other things are not broken, too :-) [13:40:10] well, the site works on mwdebug1002 [13:40:29] for me that would be enough on that context [13:40:29] OK rolling out then [13:40:31] I can't think of anything more specific that would be broken by this [13:41:45] (03PS2) 10Catrope: Enable TemplateStyle on all Wikivoyages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422246 (https://phabricator.wikimedia.org/T189838) (owner: 10Gergő Tisza) [13:41:48] (03CR) 10Catrope: [C: 032] Enable TemplateStyle on all Wikivoyages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422246 (https://phabricator.wikimedia.org/T189838) (owner: 10Gergő Tisza) [13:41:55] a) it can be seen it got enabled correctly (you will be amazed the times where that doesn't work), b) the site is still up c) related funcionality still works [13:42:06] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable Wikidata description override on enwik (T184000) (duration: 01m 18s) [13:42:11] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:42:12] T184000: Magic word on English WP to override display of Wikidata short description - https://phabricator.wikimedia.org/T184000 [13:43:09] (03Merged) 10jenkins-bot: Enable TemplateStyle on all Wikivoyages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422246 (https://phabricator.wikimedia.org/T189838) (owner: 10Gergő Tisza) [13:43:24] (03CR) 10jenkins-bot: Enable TemplateStyle on all Wikivoyages [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422246 (https://phabricator.wikimedia.org/T189838) (owner: 10Gergő Tisza) [13:44:05] (03CR) 10Elukey: [C: 032] role::mediawiki::videoscaler: reduce the number of available runners [puppet] - 10https://gerrit.wikimedia.org/r/422402 (owner: 10Elukey) [13:44:48] (03CR) 10Ottomata: Update kafka java.security file with Java 8 u162 changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/421891 (https://phabricator.wikimedia.org/T190400) (owner: 10Ottomata) [13:45:01] tgr: TemplateStyles on Wikivoyage is now on mwdebug1002, please test (to the extent practical) [13:47:58] RoanKattouw: again, the only thing that I can test is that it's enabled and the site is up, and those pass [13:48:43] (03PS2) 10Ottomata: Update kafka java.security file with Java 8 u162 changes [puppet] - 10https://gerrit.wikimedia.org/r/421891 (https://phabricator.wikimedia.org/T190400) [13:49:11] (03CR) 10Ottomata: "Hm, btw, I wonder if my certpath.disableAlgorithms has some redundancy in it. Some of the default disabledAlgorthims are also listed in m" [puppet] - 10https://gerrit.wikimedia.org/r/421891 (https://phabricator.wikimedia.org/T190400) (owner: 10Ottomata) [13:49:32] OK, deploying [13:50:08] (03CR) 10Ottomata: Update kafka java.security file with Java 8 u162 changes (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/421891 (https://phabricator.wikimedia.org/T190400) (owner: 10Ottomata) [13:51:04] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable TemplateStyles on all Wikivoyages (T189838) (duration: 01m 17s) [13:51:08] !log reduced number of jobrunner runners on the videoscalers after the last burst of jobs that maxed out the cluster [13:51:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:51:11] T189838: Create and deploy configuration change to enable TemplateStyles on Wikivoyages on 2018-03-28 - https://phabricator.wikimedia.org/T189838 [13:51:15] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [13:52:06] I thought I would be able to test it in Special:ExpandTemplates but it seems like that ignores the default content model of the title and always assumes wikitext [13:52:25] not sure if that's a bug or I was just trying to use it for something it wasn't meant for [13:59:16] 10Operations, 10Puppet, 10Goal, 10Patch-For-Review: Modernize Puppet Configuration Management (2017-18 Q3 Goal) - https://phabricator.wikimedia.org/T184561#4088092 (10fgiunchedi) [13:59:19] 10Operations, 10Puppet, 10Patch-For-Review, 10User-fgiunchedi: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562#4088090 (10fgiunchedi) 05Open>03Resolved This is completed, added documentation on pooling/depooling frontend/backend at https://wikitech.wik... [14:10:48] (03PS1) 10Ottomata: Replicate everything except change-prop and internal topics from main to jumbo [puppet] - 10https://gerrit.wikimedia.org/r/422408 (https://phabricator.wikimedia.org/T189464) [14:11:45] (03CR) 10Ottomata: [C: 032] Replicate everything except change-prop and internal topics from main to jumbo [puppet] - 10https://gerrit.wikimedia.org/r/422408 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [14:14:57] RoanKattouw: thanks! both changes seem to work fine [14:17:11] 10Operations, 10Puppet, 10Patch-For-Review, 10User-fgiunchedi: Upgrade Puppet Master Infrastructure to Debian Stretch - https://phabricator.wikimedia.org/T184562#4088144 (10fgiunchedi) [14:17:17] 10Operations, 10Puppet, 10Patch-For-Review: Failover puppet ca service from eqiad to codfw - https://phabricator.wikimedia.org/T189891#4088141 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi This is complete. Added documentation to https://wikitech.wikimedia.org/wiki/Puppet#Puppet_CA [14:18:39] (03PS3) 10Ppchelko: Disable redis queue for cirrusSearch jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416992 (https://phabricator.wikimedia.org/T189137) [14:18:57] 10Operations, 10Puppet, 10Goal, 10Patch-For-Review: Modernize Puppet Configuration Management (2017-18 Q3 Goal) - https://phabricator.wikimedia.org/T184561#4088146 (10fgiunchedi) [14:20:16] 10Operations, 10ops-codfw, 10Traffic: cp2006, cp2010, cp2017: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4088153 (10ema) The memory error situation when it comes to codfw cache hosts is pretty bad. Besides cp2006, cp2010, and cp2017 (found rebooting), I've now checked SEL and the... [14:21:29] (03CR) 10DCausse: Disable redis queue for cirrusSearch jobs for test wikis. (031 comment) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416992 (https://phabricator.wikimedia.org/T189137) (owner: 10Ppchelko) [14:24:34] 10Operations, 10Icinga, 10monitoring, 10Patch-For-Review: icinga restarted on each puppet run on standby server - https://phabricator.wikimedia.org/T190912#4088161 (10fgiunchedi) 05Open>03Resolved a:03fgiunchedi The change above fixed the problem so we're down to one restart per hour driven by `sync_... [14:25:21] (03CR) 10Mobrovac: "LGTM modulo the duplicate line David pointed out." [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416992 (https://phabricator.wikimedia.org/T189137) (owner: 10Ppchelko) [14:27:35] (03PS2) 10Jcrespo: mariadb: Update socket location of misc services (m1, m2, m5) [puppet] - 10https://gerrit.wikimedia.org/r/413167 (https://phabricator.wikimedia.org/T183470) [14:27:37] (03PS4) 10Ppchelko: Disable redis queue for cirrusSearch jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416992 (https://phabricator.wikimedia.org/T189137) [14:27:50] (03CR) 10Ppchelko: "Removed the duplicate line" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416992 (https://phabricator.wikimedia.org/T189137) (owner: 10Ppchelko) [14:28:01] (03CR) 10jerkins-bot: [V: 04-1] mariadb: Update socket location of misc services (m1, m2, m5) [puppet] - 10https://gerrit.wikimedia.org/r/413167 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo) [14:28:07] (03Abandoned) 10Jcrespo: mariadb: Update socket location of misc services (m1, m2, m5) [puppet] - 10https://gerrit.wikimedia.org/r/413167 (https://phabricator.wikimedia.org/T183470) (owner: 10Jcrespo) [14:28:43] 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4088168 (10ema) [14:30:14] (03CR) 10DCausse: [C: 031] Disable redis queue for cirrusSearch jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416992 (https://phabricator.wikimedia.org/T189137) (owner: 10Ppchelko) [14:31:08] 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4088172 (10faidon) These seem to be under warranty for another 2 months, so we should hurry up. 7 out of 22 identical hosts having memory errors so... [14:33:35] (03PS3) 10Jcrespo: phabricator/mariadb: Update database configuration for stretch/10.1 [puppet] - 10https://gerrit.wikimedia.org/r/377693 (https://phabricator.wikimedia.org/T175679) [14:33:41] 10Operations, 10wikidiff2, 10WMDE-QWERTY-Team-Board: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4088178 (10thiemowmde) [14:33:45] (03CR) 10Mobrovac: [C: 031] Disable redis queue for cirrusSearch jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416992 (https://phabricator.wikimedia.org/T189137) (owner: 10Ppchelko) [14:33:47] (03CR) 10jerkins-bot: [V: 04-1] phabricator/mariadb: Update database configuration for stretch/10.1 [puppet] - 10https://gerrit.wikimedia.org/r/377693 (https://phabricator.wikimedia.org/T175679) (owner: 10Jcrespo) [14:34:36] 10Operations, 10wikidiff2, 10WMDE-QWERTY-Team-Board: Update wikidiff2 library on the WMF production cluster - https://phabricator.wikimedia.org/T190717#4088182 (10Lea_WMDE) >So, if I understand this right, the wikidiff extension needs additional changes beyond what is currently deployed on production and bet... [14:36:52] (03PS4) 10Jcrespo: phabricator/mariadb: Update database configuration for stretch/10.1 [puppet] - 10https://gerrit.wikimedia.org/r/377693 (https://phabricator.wikimedia.org/T175679) [14:37:20] (03PS1) 10Gergő Tisza: Enable TemplateStyles on dewiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422414 (https://phabricator.wikimedia.org/T190910) [14:37:57] heads-up, i'll take over deploy1001 for 10 mins or so [14:39:03] mobrovac: in case you didn't know already, deployment server is back to tin [14:39:21] ah? wow [14:39:22] kk [14:39:24] thnx godog [14:39:53] np mobrovac, deployment.eqiad.wmnet cname does the right thing fwiw [14:40:27] i know but i don't like to use it because of offending keys :P [14:41:11] 10Operations, 10DC-Ops, 10Traffic, 10monitoring, and 2 others: memory errors not showing in icinga - https://phabricator.wikimedia.org/T183177#4088202 (10BBlack) See updates in T190540 , quite a few codfw hosts have SEL entries for uncorrectable ECC errors that went by unnoticed (but we tend to notice on r... [14:41:22] uh, why is wikiversions.json modified locally? [14:41:43] for testwikis [14:42:46] RoanKattouw: know anything about ^ ? [14:43:06] /srv/mediawiki-staging/wikiversions.json [14:43:20] (03CR) 10Jcrespo: "Manuel: Most of these changes have been done on other commits already, but the ones pending should be interesting to merge, maybe." [puppet] - 10https://gerrit.wikimedia.org/r/377693 (https://phabricator.wikimedia.org/T175679) (owner: 10Jcrespo) [14:44:20] (03Abandoned) 10Jcrespo: Revert "mariadb: Redo mariadb::backup class into role/profile style" [puppet] - 10https://gerrit.wikimedia.org/r/410131 (owner: 10Jcrespo) [14:46:56] ok, i'll just proceed, it won't interfere with what i want to do [14:48:28] (03CR) 10Mobrovac: [C: 032] Disable redis queue for cirrusSearch jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416992 (https://phabricator.wikimedia.org/T189137) (owner: 10Ppchelko) [14:48:35] (03PS5) 10Mobrovac: Disable redis queue for cirrusSearch jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416992 (https://phabricator.wikimedia.org/T189137) (owner: 10Ppchelko) [14:49:35] (03PS6) 10Jcrespo: [WIP]Remove $::mw_primary variable from puppet [puppet] - 10https://gerrit.wikimedia.org/r/345346 (https://phabricator.wikimedia.org/T156924) [14:49:57] (03CR) 10Filippo Giunchedi: [C: 031] Switch time server on dns5001 to Chrony [puppet] - 10https://gerrit.wikimedia.org/r/422387 (https://phabricator.wikimedia.org/T177742) (owner: 10Muehlenhoff) [14:50:41] (03CR) 10jenkins-bot: Disable redis queue for cirrusSearch jobs for test wikis. [mediawiki-config] - 10https://gerrit.wikimedia.org/r/416992 (https://phabricator.wikimedia.org/T189137) (owner: 10Ppchelko) [14:52:29] (03CR) 10Jcrespo: "Joe, Volans: with the primary master setup being (complete?), maybe you have suggestion on how to complete the script ('# TODO: get the pr" [puppet] - 10https://gerrit.wikimedia.org/r/345346 (https://phabricator.wikimedia.org/T156924) (owner: 10Jcrespo) [14:52:39] PROBLEM - Kafka main-eqiad consumer group lag for kafka-mirror-main-eqiad_to_jumbo-eqiad on kafkamon1001 is CRITICAL: CRITICAL: Group is in an error state. Worst Lag: eqiad.mediawiki.job.wikibase-addUsagesForPage/p0 - lag:480 offset:4578205881 [14:53:03] (03CR) 10Ema: mtail: Provide ttfb histogram for varnishbackend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [14:53:57] (03CR) 10Alexandros Kosiaris: [C: 04-1] "IIRC, we 've settled on having them at the node level for now." [puppet] - 10https://gerrit.wikimedia.org/r/420143 (owner: 10Dzahn) [14:54:02] !log ppchelko@tin Started deploy [cpjobqueue/deploy@c84880a]: Switch CirrusSearch jobs to kafka for test wikis [14:54:08] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:54:45] !log ppchelko@tin Finished deploy [cpjobqueue/deploy@c84880a]: Switch CirrusSearch jobs to kafka for test wikis (duration: 00m 44s) [14:54:50] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:55:18] (03CR) 10Filippo Giunchedi: [C: 031] "I was a little on the fence initially because cron not running is potentially harmful, though a failed restart should (!) trigger the "fai" [puppet] - 10https://gerrit.wikimedia.org/r/422391 (https://phabricator.wikimedia.org/T135991) (owner: 10Muehlenhoff) [14:56:15] !log mobrovac@tin Synchronized wmf-config/InitialiseSettings.php: Disable redis queue for cirrusSearch jobs for test wikis, file 1/2 - T189137 (duration: 01m 17s) [14:56:15] (03CR) 10Vgutierrez: mtail: Provide ttfb histogram for varnishbackend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [14:56:21] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:56:21] T189137: Migrate CirrusSearch jobs to Kafka queue - https://phabricator.wikimedia.org/T189137 [14:56:51] (03CR) 10Jcrespo: [C: 04-2] "Is thjs still relevant or should it be abandoned?" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/399792 (https://phabricator.wikimedia.org/T134476) (owner: 10Jcrespo) [14:57:19] mobrovac: https://grafana.wikimedia.org/dashboard/db/kafka-consumer-lag?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=main-eqiad&var-topic=All&var-consumer_group=change-prop-wikibase-addUsagesForPage [14:57:23] expected? [14:57:27] (I saw the new alarms firing) [14:57:50] ah went down, the alarms are probably too sensitive [14:58:02] mobrovac: No idea, ask twentyafterfour [14:58:05] !log mobrovac@tin Synchronized wmf-config/jobqueue.php: Disable redis queue for cirrusSearch jobs for test wikis, file 2/2 - T189137 (duration: 01m 17s) [14:58:09] elukey: known [14:58:10] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [14:58:28] mobrovac: sure, let's discuss how to tune the alarms when you are less busy :) [15:00:05] !log stopping nodepool on labnodepool1001 [15:00:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:00:27] !log stopping nova-fullstack on labnet1001 for T189115 [15:00:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:01:52] ok, i'm done with tin [15:02:09] !log rebooting labservices1001 and labcontrol1001 for T189115 [15:02:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:07:03] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2030 - https://phabricator.wikimedia.org/T187768#4088271 (10Papaul) switch port information asw-b6-codfw ge-6/0/13 [15:07:38] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2030 - https://phabricator.wikimedia.org/T187768#4088272 (10Papaul) [15:07:39] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad dropped message count in last 30m on einsteinium is CRITICAL: CRITICAL - scalar(sum(increase(kafka_tools_MirrorMaker_MirrorMaker_numDroppedMessages{mirror_name=main-eqiad_to_jumbo-eqiad} [30m]))): 1434.3123149425287 10.0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad%2520prometheus%252Fops&var-mirror_name=main-eqiad_to_jumbo-eqiad [15:07:54] !log restarting nova-network on labnet1001 in case it's upset by the rabbit outage [15:07:59] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:30] !log restarting nova-fullstack on labnet1001 [15:08:34] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:08:40] so the Kafka mirror maker alarm's graph is wrong, singe it is main -> jumbo the issue [15:09:23] https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker-new-consumer?refresh=5m&orgId=1 [15:09:38] the alarm needs to be updated [15:09:45] buuut it seems that mirror maker is doing a ton more work [15:09:55] possibly dueto the new events flowing in? [15:10:28] which new events? [15:10:51] mobrovac: I saw Disable redis queue for cirrusSearch jobs and I thought there were changes :) [15:11:11] that only switched on testwiki though, should be tiny # of events [15:11:22] oh no no, these events have been there for a while, we are just processing them now instead of ignoring them [15:11:46] ottomata: hello :) [15:11:47] https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker-new-consumer?refresh=5m&orgId=1 [15:11:59] oh yeahhhhh [15:12:32] ottomata: might be related to https://grafana.wikimedia.org/dashboard/db/kafka-consumer-lag?orgId=1&var-datasource=eqiad%20prometheus%2Fops&var-cluster=main-eqiad&var-topic=All&var-consumer_group=change-prop-wikibase-addUsagesForPage ? [15:12:50] elukey: are you talking about the extra volume? [15:12:54] yeah [15:12:58] i re-added job topics just now [15:13:02] logged it in analytics ! :) [15:13:02] ahhhh [15:13:09] so its good! [15:13:14] !log restarting nodepool on labnodepool1001 (cleanup from T189115) [15:13:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:13:23] ottomata: scalar(sum(increase(kafka_tools_MirrorMaker_MirrorMaker_numDroppedMessages{mirror_name=main-eqiad_to_jumbo-eqiad} [30m]))): 1 etc.. [15:13:27] argh nope [15:13:35] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad dropped message count in last 30m [15:13:38] etc.. [15:13:40] interesting! [15:13:43] an alert! [15:13:45] there are some icinga alerts firing [15:13:48] oh that [15:13:49] huhhh! [15:13:54] cool well the alerts work! taht's cool [15:14:13] dropped 5 messages interesting indeed [15:14:19] a couple of comments though: 1) we'd need to update the dashboard link with the new-consumer stuff [15:14:33] 2) this one also fired [15:14:34] PROBLEM - Kafka main-eqiad consumer group lag for kafka-mirror-main-eqiad_to_jumbo-eqiad on kafkamon1001 is CRITICAL: CRITICAL: Group is in an error state. Worst Lag: eqiad.mediawiki.job.wikibase-addUsagesForPage/p0 - lag:480 offset:4578205881 [15:14:56] that is a bit confusing for people if they don't know what that is :D [15:15:32] hm, the consumer lag i don't think shoudl have fired for this [15:15:34] it was too short [15:15:37] i should figure out how to adjust that [15:15:40] After a chat with Marko it seems to be a known issue, so we might need to tune the alarms to be less sensitive [15:15:43] yeah [15:15:47] ok :) [15:15:53] but the dropped messages one isn't good [15:15:57] that one I want to be very sensitive [15:16:05] that shoudlnt' happen [15:16:22] (03PS1) 10Arturo Borrero Gonzalez: labs: monitoring: fix permissions of /var/log/graphite [puppet] - 10https://gerrit.wikimedia.org/r/422417 (https://phabricator.wikimedia.org/T189871) [15:16:28] how can this happen? [15:16:34] (03PS3) 10Andrew Bogott: keystone-paste.ini: Remove deprecated extension filters [puppet] - 10https://gerrit.wikimedia.org/r/422352 (https://phabricator.wikimedia.org/T187954) [15:17:01] (03CR) 10Arturo Borrero Gonzalez: [C: 032] labs: monitoring: fix permissions of /var/log/graphite [puppet] - 10https://gerrit.wikimedia.org/r/422417 (https://phabricator.wikimedia.org/T189871) (owner: 10Arturo Borrero Gonzalez) [15:17:25] well, i'm not totally sure what that metric is [15:17:36] i'd hope that it would not update committed offsets for that partition [15:17:41] and it would just reconsume or something [15:19:12] ottomata: ack [15:19:32] 10Operations, 10ops-codfw, 10DBA, 10hardware-requests: Decommission db2030 - https://phabricator.wikimedia.org/T187768#4088298 (10Papaul) [15:21:47] (03Abandoned) 10Arturo Borrero Gonzalez: wmcs: introduce hiera support for multiple labmon servers [puppet] - 10https://gerrit.wikimedia.org/r/422126 (https://phabricator.wikimedia.org/T189871) (owner: 10Arturo Borrero Gonzalez) [15:22:09] PROBLEM - Kafka main-eqiad consumer group lag for kafka-mirror-main-eqiad_to_jumbo-eqiad on kafkamon1001 is CRITICAL: CRITICAL: Group is in an error state. Worst Lag: eqiad.mediawiki.job.htmlCacheUpdate/p0 - lag:1069 offset:887305497 [15:24:22] ok, looking at that lag alert... [15:24:25] cool that it works though! [15:24:42] (03PS1) 10Imarlier: wmf-config: Enable oversampling for remaining countries in Asia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422419 (https://phabricator.wikimedia.org/T189252) [15:25:08] hmmm, interesting [15:25:19] (03CR) 10BBlack: [C: 031] "Please, with all haste, we're waiting on the IN data :)" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422401 (https://phabricator.wikimedia.org/T189252) (owner: 10Imarlier) [15:25:36] (03PS8) 10Arturo Borrero Gonzalez: wmcs: monitoring: rsync whisper files between mon servers [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512) [15:26:04] elukey: so that alert is coming directly from burrow [15:26:09] (03CR) 10jerkins-bot: [V: 04-1] wmf-config: Enable oversampling for remaining countries in Asia [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422419 (https://phabricator.wikimedia.org/T189252) (owner: 10Imarlier) [15:26:12] (03CR) 10Filippo Giunchedi: "While reviewing this it occurred to me that for less data loss upon unplanned failover you can send metrics to the slave via carbon-c-rela" (032 comments) [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512) (owner: 10Arturo Borrero Gonzalez) [15:27:14] ottomata: yeah but it needs to be less sensitive since there are known lag that we shouldn't alert on [15:27:19] (03CR) 10Imarlier: "recheck" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422419 (https://phabricator.wikimedia.org/T189252) (owner: 10Imarlier) [15:27:20] i actually dont' think i can make it less sensitive, [15:27:27] this is something burrow would email for [15:27:36] its using the /status endpoint [15:27:39] and burrow decides [15:28:13] there is a critical lag threshold in this check, but all it does is add extra alerting if burrow says things are warning, but we want to do critical earlier [15:28:16] so [15:28:33] i might need to switch the lag alert to be prometheus based instead of using the burrow checker [15:28:47] burrow -> prometheus -> icinga [15:28:50] rather than burrow -> icinga [15:29:11] ottomata: sure but we can tell to the nagios monitor to wait for say 3/4 times with X time between them before alerting no? [15:29:32] if burrow clears itself in the meantime, no alert fired [15:29:33] bblack: If you want https://gerrit.wikimedia.org/r/422401 to be deployed Right Now (ish), I'd be happy to do that (cc marlier) [15:30:16] can we do that? [15:30:46] oh! [15:30:49] # $retries [15:30:49] # Defaults to 3. The number of times a service will be retried before [15:30:49] # notifying [15:30:50] we can! [15:31:06] let's try that [15:31:18] defaults to 3 though [15:31:19] hm [15:31:33] there must be also a time between retries [15:31:39] the default should be very low [15:31:56] yeah, wonder how often nrpe checks get run [15:31:58] marlier: RoanKattouw above offers to shove the first update now, if you're ok with how it looks presently [15:32:23] ottomata: iirc it was a minute, there are some defaults [15:32:32] RoanKattouw: bblack: works for me, if you don't mind [15:32:36] hm, ok, then let's set retries to 30? [15:32:47] oh retry_interval [15:32:48] hm [15:32:49] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad dropped message count in last 30m on einsteinium is OK: OK - scalar(sum(increase(kafka_tools_MirrorMaker_MirrorMaker_numDroppedMessages{mirror_name=main-eqiad_to_jumbo-eqiad} [30m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad%2520prometheus%252Fops&var-mirror_name=main-eqiad_to_jumbo-eqiad [15:32:51] ottomata: yeah [15:32:51] ooo ok [15:33:15] (03CR) 10Filippo Giunchedi: [C: 031] mtail: Add varnish_resourceloader_resp in varnishrls [puppet] - 10https://gerrit.wikimedia.org/r/422381 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [15:33:16] Alright, on its way [15:33:23] thanks! [15:33:26] (03CR) 10Catrope: [C: 032] wmf-config/InitialiseSettings.php: Enable oversample for additional countries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422401 (https://phabricator.wikimedia.org/T189252) (owner: 10Imarlier) [15:33:30] (03PS2) 10Catrope: wmf-config/InitialiseSettings.php: Enable oversample for additional countries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422401 (https://phabricator.wikimedia.org/T189252) (owner: 10Imarlier) [15:33:35] (03CR) 10Catrope: [C: 032] wmf-config/InitialiseSettings.php: Enable oversample for additional countries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422401 (https://phabricator.wikimedia.org/T189252) (owner: 10Imarlier) [15:33:49] ottomata: let's also set the contact group for 'analytics' for the moment to avoid false positives in here that might confuse people [15:34:27] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4088342 (10ayounsi) [15:34:50] (03Merged) 10jenkins-bot: wmf-config/InitialiseSettings.php: Enable oversample for additional countries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422401 (https://phabricator.wikimedia.org/T189252) (owner: 10Imarlier) [15:35:26] hmm ok [15:37:25] !log catrope@tin Synchronized wmf-config/InitialiseSettings.php: Enable oversampling for IN, GU, MP in preparation for eqsin (T189252) (duration: 01m 18s) [15:37:31] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [15:37:31] T189252: Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252 [15:37:49] (03CR) 10Filippo Giunchedi: mtail: Provide ttfb histogram for varnishbackend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [15:38:09] Alright there you go [15:38:20] (03CR) 10jenkins-bot: wmf-config/InitialiseSettings.php: Enable oversample for additional countries [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422401 (https://phabricator.wikimedia.org/T189252) (owner: 10Imarlier) [15:38:36] Brilliant, thanks RoanKattouw! [15:38:53] (03PS1) 10Ottomata: Alert on lag in last 30 minutes, alert mirror maker lag for analytics [puppet] - 10https://gerrit.wikimedia.org/r/422424 (https://phabricator.wikimedia.org/T189611) [15:39:16] lmk if you need more deployed, I'm eating dinner but am still pingable [15:39:35] (03CR) 10Ottomata: [C: 032] Alert on lag in last 30 minutes, alert mirror maker lag for analytics [puppet] - 10https://gerrit.wikimedia.org/r/422424 (https://phabricator.wikimedia.org/T189611) (owner: 10Ottomata) [15:39:52] 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4088357 (10RobH) [15:41:19] PROBLEM - Kafka main-eqiad consumer group lag for kafka-mirror-main-eqiad_to_jumbo-eqiad on kafkamon1001 is CRITICAL: CRITICAL: Group is in an error state. Worst Lag: eqiad.mediawiki.job.htmlCacheUpdate/p0 - lag:1 offset:887358124 [15:43:44] 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4088365 (10RobH) [15:51:43] (03PS1) 10Jrdnch: Update to glibc >=2.19 [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/422425 (https://phabricator.wikimedia.org/T186250) [15:56:13] 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4088416 (10RobH) [15:56:58] (03CR) 10Bstorm: "Just a note: I am refactoring this to more correctly match standards as well as make the linter happier." [puppet] - 10https://gerrit.wikimedia.org/r/422199 (https://phabricator.wikimedia.org/T181650) (owner: 10Bstorm) [16:05:43] (03CR) 10Arturo Borrero Gonzalez: [C: 032] dynamicproxy: run logrotate hourly [puppet] - 10https://gerrit.wikimedia.org/r/422197 (https://phabricator.wikimedia.org/T190218) (owner: 10Bstorm) [16:05:53] (03CR) 10Arturo Borrero Gonzalez: [C: 031] dynamicproxy: run logrotate hourly [puppet] - 10https://gerrit.wikimedia.org/r/422197 (https://phabricator.wikimedia.org/T190218) (owner: 10Bstorm) [16:07:41] 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4088458 (10RobH) a:03Papaul After reviewing with traffic team, we're goign to test memory in all of these. I've updated the task description with... [16:08:22] (03PS2) 10BBlack: eqsin: turn-up BD, LK, NP, PK [dns] - 10https://gerrit.wikimedia.org/r/422396 (https://phabricator.wikimedia.org/T189252) [16:08:24] (03PS2) 10BBlack: eqsin: turn-up India [dns] - 10https://gerrit.wikimedia.org/r/422395 (https://phabricator.wikimedia.org/T189252) [16:12:49] (03CR) 10Rush: [C: 031] "the only thing I'm not sure of is if there are specific package logrotate directives that are counting on daily runs to do the right thing" [puppet] - 10https://gerrit.wikimedia.org/r/422197 (https://phabricator.wikimedia.org/T190218) (owner: 10Bstorm) [16:13:07] 10Operations, 10ops-codfw, 10DC-Ops, 10hardware-requests: Decommission restbase-test200[123] - https://phabricator.wikimedia.org/T187447#4088465 (10Papaul) switch port information asw-b5-codfw restbase-test2001 ge-5/0/19 restbase-test2002 ge-5/0/16 restbase-test2003 ge-5/0/20 [16:15:08] (03CR) 10Rush: [C: 031] "Part of me says we should put this under teh toollabs modules as it will make sense on execs and k8s workers too but good next step if we " [puppet] - 10https://gerrit.wikimedia.org/r/422186 (https://phabricator.wikimedia.org/T190185) (owner: 10Bstorm) [16:15:32] (03CR) 10Arturo Borrero Gonzalez: ">" [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512) (owner: 10Arturo Borrero Gonzalez) [16:24:08] (03PS1) 10Chad: WIP: Initial crappy implementation of Github repo creation [software/gerrit/plugins/wikimedia] - 10https://gerrit.wikimedia.org/r/422429 [16:26:22] (03CR) 10Rush: [C: 04-1] wmcs: monitoring: rsync whisper files between mon servers (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512) (owner: 10Arturo Borrero Gonzalez) [16:26:25] (03PS1) 10Ottomata: Be more lenient about MirrorMaker numDroppedMessages alert [puppet] - 10https://gerrit.wikimedia.org/r/422430 (https://phabricator.wikimedia.org/T189611) [16:27:34] (03CR) 10Ottomata: [C: 032] Be more lenient about MirrorMaker numDroppedMessages alert [puppet] - 10https://gerrit.wikimedia.org/r/422430 (https://phabricator.wikimedia.org/T189611) (owner: 10Ottomata) [16:28:07] (03CR) 10Alexandros Kosiaris: [C: 032] lttoolbox: Update to latest upstream release [debs/contenttranslation/lttoolbox] - 10https://gerrit.wikimedia.org/r/419346 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry) [16:28:55] (03CR) 10Bstorm: "I looked through it. I was surprised to find that there wasn't any. You can tell it to rotate hourly, but it won't actually do it unless" [puppet] - 10https://gerrit.wikimedia.org/r/422197 (https://phabricator.wikimedia.org/T190218) (owner: 10Bstorm) [16:29:34] (03PS1) 10Ottomata: Increase MirrorMaker main -> jumbo heap size [puppet] - 10https://gerrit.wikimedia.org/r/422431 (https://phabricator.wikimedia.org/T189464) [16:29:58] (03PS2) 10Bstorm: dynamicproxy: run logrotate hourly [puppet] - 10https://gerrit.wikimedia.org/r/422197 (https://phabricator.wikimedia.org/T190218) [16:30:07] (03PS5) 10Vgutierrez: mtail: Provide ttfb histogram for varnishbackend [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) [16:30:14] (03CR) 10Ottomata: [C: 032] Increase MirrorMaker main -> jumbo heap size [puppet] - 10https://gerrit.wikimedia.org/r/422431 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [16:31:47] (03CR) 10Vgutierrez: mtail: Provide ttfb histogram for varnishbackend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [16:32:41] (03PS3) 10Bstorm: dynamicproxy: run logrotate hourly [puppet] - 10https://gerrit.wikimedia.org/r/422197 (https://phabricator.wikimedia.org/T190218) [16:33:11] (03PS2) 10Vgutierrez: mtail: Add varnish_resourceloader_resp in varnishrls [puppet] - 10https://gerrit.wikimedia.org/r/422381 (https://phabricator.wikimedia.org/T184942) [16:33:36] 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4088571 (10Papaul) All those systems are running outdated IDRAC and BIOS version. I will like to update the IDRAC and BIOS first before running the... [16:33:48] (03CR) 10Bstorm: [C: 032] dynamicproxy: run logrotate hourly [puppet] - 10https://gerrit.wikimedia.org/r/422197 (https://phabricator.wikimedia.org/T190218) (owner: 10Bstorm) [16:33:50] (03CR) 10Vgutierrez: [C: 032] mtail: Add varnish_resourceloader_resp in varnishrls [puppet] - 10https://gerrit.wikimedia.org/r/422381 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [16:37:00] (03PS4) 10Andrew Bogott: keystone-paste.ini: Remove deprecated extension filters [puppet] - 10https://gerrit.wikimedia.org/r/422352 (https://phabricator.wikimedia.org/T187954) [16:37:02] (03PS1) 10Andrew Bogott: nova.conf: use entry point name for scheduler_driver [puppet] - 10https://gerrit.wikimedia.org/r/422432 [16:37:04] (03PS1) 10Andrew Bogott: nova.conf: remove memcached setting [puppet] - 10https://gerrit.wikimedia.org/r/422433 (https://phabricator.wikimedia.org/T187954) [16:37:09] (03CR) 10Arturo Borrero Gonzalez: wmcs: monitoring: rsync whisper files between mon servers (034 comments) [puppet] - 10https://gerrit.wikimedia.org/r/422389 (https://phabricator.wikimedia.org/T190512) (owner: 10Arturo Borrero Gonzalez) [16:37:37] (03CR) 10jerkins-bot: [V: 04-1] nova.conf: use entry point name for scheduler_driver [puppet] - 10https://gerrit.wikimedia.org/r/422432 (owner: 10Andrew Bogott) [16:37:41] !log T189075 upload lttoolbox_3.4.0~r84331-1+wmf1 to apt.wikimedia.org/jessie-wikimedia/main [16:37:43] (03PS3) 10Vgutierrez: mtail: Add varnish_resourceloader_resp in varnishrls [puppet] - 10https://gerrit.wikimedia.org/r/422381 (https://phabricator.wikimedia.org/T184942) [16:37:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [16:37:49] T189075: Package apertium-separable and dependencies - https://phabricator.wikimedia.org/T189075 [16:40:44] (03CR) 10Elukey: "Thanks a lot for this work! Tested on a Jessie system with glibc 2.19, works fine. Left a comment for the documentation :)" (031 comment) [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/422425 (https://phabricator.wikimedia.org/T186250) (owner: 10Jrdnch) [16:46:27] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium] - 10https://gerrit.wikimedia.org/r/419351 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry) [16:46:32] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-separable] - 10https://gerrit.wikimedia.org/r/421808 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry) [16:46:34] (03PS2) 10Andrew Bogott: nova.conf: use entry point name for scheduler_driver [puppet] - 10https://gerrit.wikimedia.org/r/422432 (https://phabricator.wikimedia.org/T187954) [16:46:36] (03PS2) 10Andrew Bogott: nova.conf: remove memcached setting [puppet] - 10https://gerrit.wikimedia.org/r/422433 (https://phabricator.wikimedia.org/T187954) [16:46:38] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/421859 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry) [16:46:42] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-lex-tools] - 10https://gerrit.wikimedia.org/r/419356 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry) [16:46:46] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/421825 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry) [16:46:50] (03CR) 10Filippo Giunchedi: mtail: Provide ttfb histogram for varnishbackend (031 comment) [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) (owner: 10Vgutierrez) [16:46:52] (03CR) 10Alexandros Kosiaris: "recheck" [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/421813 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry) [16:47:00] (03CR) 10jerkins-bot: [V: 04-1] apertium-separable: Initial Debian packaging [debs/contenttranslation/apertium-separable] - 10https://gerrit.wikimedia.org/r/421808 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry) [16:47:39] (03CR) 10jerkins-bot: [V: 04-1] apertium-fra-cat: New upstream release [debs/contenttranslation/apertium-fra-cat] - 10https://gerrit.wikimedia.org/r/421859 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry) [16:47:42] (03CR) 10jerkins-bot: [V: 04-1] apertium-lex-tools: New upstream release [debs/contenttranslation/apertium-lex-tools] - 10https://gerrit.wikimedia.org/r/419356 (https://phabricator.wikimedia.org/T189075) (owner: 10KartikMistry) [16:47:56] (03CR) 10jerkins-bot: [V: 04-1] apertium-cat: New upstream release [debs/contenttranslation/apertium-cat] - 10https://gerrit.wikimedia.org/r/421825 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry) [16:48:00] (03CR) 10jerkins-bot: [V: 04-1] apertium-fra: New upstream release [debs/contenttranslation/apertium-fra] - 10https://gerrit.wikimedia.org/r/421813 (https://phabricator.wikimedia.org/T189076) (owner: 10KartikMistry) [16:48:02] RECOVERY - CirrusSearch eqiad 95th percentile latency on graphite1001 is OK: OK: Less than 20.00% above the threshold [500.0] https://grafana.wikimedia.org/dashboard/db/elasticsearch-percentiles?panelId=19&fullscreen&orgId=1&var-cluster=eqiad&var-smoothing=1 [16:54:01] PROBLEM - Host cp2002 is DOWN: PING CRITICAL - Packet loss = 100% [16:54:11] PROBLEM - IPsec on kafka-jumbo1005 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2002_v4, cp2002_v6 [16:54:11] PROBLEM - IPsec on cp5004 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6 [16:54:11] PROBLEM - IPsec on cp5002 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6 [16:54:11] PROBLEM - IPsec on cp5003 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6 [16:54:11] PROBLEM - IPsec on cp5001 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6 [16:54:11] PROBLEM - IPsec on cp3036 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6 [16:54:11] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2002_v4, cp2002_v6 [16:54:12] PROBLEM - IPsec on cp4024 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6 [16:54:12] PROBLEM - IPsec on cp4021 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6 [16:54:13] PROBLEM - IPsec on cp4022 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6 [16:54:13] PROBLEM - IPsec on cp4025 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6 [16:54:14] PROBLEM - IPsec on cp3049 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6 [16:58:03] 10Operations, 10Traffic, 10Patch-For-Review, 10Performance-Team (Radar): Define turn-up process and scope for eqsin service to regional countries - https://phabricator.wikimedia.org/T189252#4088630 (10ayounsi) [16:58:21] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2002_v4, cp2002_v6 [16:58:22] PROBLEM - IPsec on kafka-jumbo1003 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2002_v4, cp2002_v6 [16:58:22] PROBLEM - IPsec on kafka-jumbo1002 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2002_v4, cp2002_v6 [16:58:22] PROBLEM - IPsec on cp1064 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6 [16:58:31] PROBLEM - IPsec on cp1073 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6 [16:58:31] PROBLEM - IPsec on cp4023 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6 [16:58:32] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2002_v4, cp2002_v6 [16:58:32] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2002_v4, cp2002_v6 [16:58:32] PROBLEM - IPsec on cp4026 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6 [16:58:41] PROBLEM - IPsec on cp1072 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6 [16:58:41] PROBLEM - IPsec on cp1074 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6 [16:58:41] PROBLEM - IPsec on cp3034 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6 [16:58:41] PROBLEM - IPsec on cp3044 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6 [16:58:41] PROBLEM - IPsec on cp3046 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6 [16:58:42] PROBLEM - IPsec on cp1050 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6 [16:58:51] PROBLEM - IPsec on kafka-jumbo1006 is CRITICAL: Strongswan CRITICAL - ok: 133 no-child-sa: cp3007_v6 not-conn: cp2002_v4, cp2002_v6 [16:58:51] PROBLEM - IPsec on kafka-jumbo1001 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2002_v4, cp2002_v6 [16:58:51] PROBLEM - IPsec on kafka-jumbo1004 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2002_v4, cp2002_v6 [16:58:51] PROBLEM - IPsec on cp1049 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6 [16:58:51] PROBLEM - IPsec on cp1099 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6 [16:58:52] PROBLEM - IPsec on cp3045 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6 [16:58:52] PROBLEM - IPsec on cp3047 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6 [16:58:52] PROBLEM - IPsec on cp3048 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6 [16:58:52] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2002_v4, cp2002_v6 [16:59:02] PROBLEM - IPsec on cp3037 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6 [16:59:48] hello cp2001 [16:59:51] err 2002 [17:00:05] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do Morning SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180328T1700). [17:00:05] No GERRIT patches in the queue for this window AFAICS. [17:00:51] PROBLEM - IPsec on cp1048 is CRITICAL: Strongswan CRITICAL - ok: 64 not-conn: cp2002_v4, cp2002_v6 [17:00:52] RECOVERY - Host cp2002 is UP: PING WARNING - Packet loss = 37%, RTA = 36.07 ms [17:01:01] RECOVERY - IPsec on cp3048 is OK: Strongswan OK - 66 ESP OK [17:01:01] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 136 ESP OK [17:01:02] RECOVERY - IPsec on cp3037 is OK: Strongswan OK - 66 ESP OK [17:01:11] RECOVERY - IPsec on cp5001 is OK: Strongswan OK - 66 ESP OK [17:01:11] RECOVERY - IPsec on cp5002 is OK: Strongswan OK - 66 ESP OK [17:01:11] RECOVERY - IPsec on kafka-jumbo1005 is OK: Strongswan OK - 136 ESP OK [17:01:21] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 136 ESP OK [17:01:21] RECOVERY - IPsec on cp5003 is OK: Strongswan OK - 66 ESP OK [17:01:21] RECOVERY - IPsec on cp5004 is OK: Strongswan OK - 66 ESP OK [17:01:21] RECOVERY - IPsec on cp4021 is OK: Strongswan OK - 66 ESP OK [17:01:22] RECOVERY - IPsec on cp4022 is OK: Strongswan OK - 66 ESP OK [17:01:22] RECOVERY - IPsec on cp4024 is OK: Strongswan OK - 66 ESP OK [17:01:22] RECOVERY - IPsec on cp3036 is OK: Strongswan OK - 66 ESP OK [17:01:22] RECOVERY - IPsec on cp4025 is OK: Strongswan OK - 66 ESP OK [17:01:24] RECOVERY - IPsec on cp3049 is OK: Strongswan OK - 66 ESP OK [17:01:24] RECOVERY - IPsec on kafka1023 is OK: Strongswan OK - 136 ESP OK [17:01:24] RECOVERY - IPsec on cp5005 is OK: Strongswan OK - 66 ESP OK [17:01:24] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 136 ESP OK [17:01:25] RECOVERY - IPsec on kafka-jumbo1003 is OK: Strongswan OK - 136 ESP OK [17:01:25] RECOVERY - IPsec on cp3038 is OK: Strongswan OK - 66 ESP OK [17:01:41] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 136 ESP OK [17:01:41] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 136 ESP OK [17:01:41] RECOVERY - IPsec on cp4026 is OK: Strongswan OK - 66 ESP OK [17:01:42] RECOVERY - IPsec on cp1072 is OK: Strongswan OK - 66 ESP OK [17:01:42] RECOVERY - IPsec on cp1074 is OK: Strongswan OK - 66 ESP OK [17:01:42] RECOVERY - IPsec on cp1050 is OK: Strongswan OK - 66 ESP OK [17:01:42] RECOVERY - IPsec on cp3034 is OK: Strongswan OK - 66 ESP OK [17:01:42] RECOVERY - IPsec on cp3044 is OK: Strongswan OK - 66 ESP OK [17:01:42] RECOVERY - IPsec on cp3046 is OK: Strongswan OK - 66 ESP OK [17:01:51] RECOVERY - IPsec on kafka-jumbo1006 is OK: Strongswan OK - 136 ESP OK [17:01:51] RECOVERY - IPsec on kafka-jumbo1001 is OK: Strongswan OK - 136 ESP OK [17:01:52] RECOVERY - IPsec on kafka-jumbo1004 is OK: Strongswan OK - 136 ESP OK [17:01:52] RECOVERY - IPsec on cp1048 is OK: Strongswan OK - 66 ESP OK [17:01:52] RECOVERY - IPsec on cp1049 is OK: Strongswan OK - 66 ESP OK [17:01:52] RECOVERY - IPsec on cp1099 is OK: Strongswan OK - 66 ESP OK [17:02:01] RECOVERY - IPsec on cp3045 is OK: Strongswan OK - 66 ESP OK [17:02:01] RECOVERY - IPsec on cp3047 is OK: Strongswan OK - 66 ESP OK [17:02:31] PROBLEM - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad dropped message count in last 30m on einsteinium is CRITICAL: CRITICAL - scalar(sum(increase(kafka_tools_MirrorMaker_MirrorMaker_numDroppedMessages{mirror_name=main-eqiad_to_jumbo-eqiad} [30m]))): 2097.0972689655173 1000.0 https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad%2520prometheus%252Fops&var-mirror_name=main-eqiad_to_jumbo-eqiad [17:02:31] RECOVERY - IPsec on cp1064 is OK: Strongswan OK - 66 ESP OK [17:03:31] RECOVERY - Kafka MirrorMaker main-eqiad_to_jumbo-eqiad dropped message count in last 30m on einsteinium is OK: OK - scalar(sum(increase(kafka_tools_MirrorMaker_MirrorMaker_numDroppedMessages{mirror_name=main-eqiad_to_jumbo-eqiad} [30m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kafka-mirrormaker?var-datasource=eqiad%2520prometheus%252Fops&var-mirror_name=main-eqiad_to_jumbo-eqiad [17:06:44] (03PS6) 10Vgutierrez: mtail: Provide ttfb histogram for varnishbackend [puppet] - 10https://gerrit.wikimedia.org/r/422155 (https://phabricator.wikimedia.org/T184942) [17:06:54] hm [17:09:31] PROBLEM - Host cp2003 is DOWN: PING CRITICAL - Packet loss = 100% [17:16:11] RECOVERY - Host cp2003 is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms [17:23:25] (03PS2) 10Jrdnch: Update to glibc >=2.19 [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/422425 (https://phabricator.wikimedia.org/T186250) [17:25:12] PROBLEM - Host cp2003 is DOWN: PING CRITICAL - Packet loss = 100% [17:28:47] Hi ops team - I'm about to deploy analytics-refinery (scheduled hadoop jobs conf) [17:29:07] No impact whatsoever on mediawiki side [17:29:25] (03PS9) 10Elukey: coal: be smarter about consuming from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/421933 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [17:30:06] jouncebot: next [17:30:06] In 0 hour(s) and 29 minute(s): Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180328T1800) [17:30:14] (03PS3) 10Jrdnch: Update to glibc >=2.19 [software/varnish/varnishkafka] - 10https://gerrit.wikimedia.org/r/422425 (https://phabricator.wikimedia.org/T186250) [17:30:19] joal: FYI --^ (useful) [17:30:43] (03CR) 10Elukey: [C: 032] coal: be smarter about consuming from Kafka [puppet] - 10https://gerrit.wikimedia.org/r/421933 (https://phabricator.wikimedia.org/T110903) (owner: 10Imarlier) [17:31:48] elukey: I'll ask you more on this tomorrow I think :) [17:32:57] !log joal@tin Started deploy [analytics/refinery@7135d44]: Regular weekly analytics deploy - Scheduled hadoop jobs updates [17:33:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:17] !log joal@tin Finished deploy [analytics/refinery@7135d44]: Regular weekly analytics deploy - Scheduled hadoop jobs updates (duration: 05m 21s) [17:38:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [17:38:47] (03PS2) 10Andrew Bogott: toolforge: Add wikimedia.org to the CSP allowed list [puppet] - 10https://gerrit.wikimedia.org/r/422064 (https://phabricator.wikimedia.org/T130748) (owner: 10BryanDavis) [17:39:24] (03CR) 10Andrew Bogott: [C: 032] toolforge: Add wikimedia.org to the CSP allowed list [puppet] - 10https://gerrit.wikimedia.org/r/422064 (https://phabricator.wikimedia.org/T130748) (owner: 10BryanDavis) [17:41:03] PROBLEM - etcd request latencies on chlorine is CRITICAL: CRITICAL - scalar( sum(rate(etcd_request_latencies_summary_sum{ job=k8s-api,instance=10.64.0.45:6443}[5m]))/ sum(rate(etcd_request_latencies_summary_count{ job=k8s-api,instance=10.64.0.45:6443}[5m]))): 116328.3594890511 = 50000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:41:10] 10Operations, 10TemplateStyles, 10Traffic, 10Wikimedia-Extension-setup, and 4 others: Deploy TemplateStyles to WMF production - https://phabricator.wikimedia.org/T133410#4088764 (10Tgr) [17:41:54] RECOVERY - etcd request latencies on chlorine is OK: OK - scalar( sum(rate(etcd_request_latencies_summary_sum{ job=k8s-api,instance=10.64.0.45:6443}[5m]))/ sum(rate(etcd_request_latencies_summary_count{ job=k8s-api,instance=10.64.0.45:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:43:35] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests: Decommission db1011 - https://phabricator.wikimedia.org/T184703#4088779 (10Cmjohnson) [17:48:44] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for RI-maintained services - https://phabricator.wikimedia.org/T189524#4088796 (10Mholloway) 05Open>03Resolved Ah, no need to worry about reading lists, then. Sorry for the partially... [17:49:39] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for RI-maintained services - https://phabricator.wikimedia.org/T189524#4088802 (10Mholloway) [17:50:01] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for mobileapps - https://phabricator.wikimedia.org/T189524#4044468 (10Mholloway) [17:52:37] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review: Decommission mw1196 - https://phabricator.wikimedia.org/T170441#4088811 (10Cmjohnson) [17:52:51] 10Operations, 10ops-eqiad, 10hardware-requests, 10Patch-For-Review: Decommission mw1196 - https://phabricator.wikimedia.org/T170441#3431403 (10Cmjohnson) 05Open>03Resolved [17:54:03] PROBLEM - Host cp2006 is DOWN: PING CRITICAL - Packet loss = 100% [17:55:00] 10Operations, 10ops-eqiad, 10DC-Ops, 10hardware-requests, and 3 others: Decommission restbase-test environment - https://phabricator.wikimedia.org/T186755#4088822 (10Cmjohnson) 05Open>03Resolved a:03Cmjohnson This was done awhile ago resolving [17:56:53] PROBLEM - Request latencies on neon is CRITICAL: CRITICAL - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.40:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.40:6443}[5m]))): 36140855.404015064 = 100000.0 https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:57:33] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6 [17:57:34] PROBLEM - IPsec on kafka-jumbo1005 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6 [17:57:43] PROBLEM - IPsec on kafka-jumbo1002 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6 [17:57:43] PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp2006_v4, cp2006_v6 [17:57:43] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6 [17:57:43] PROBLEM - IPsec on kafka-jumbo1003 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6 [17:57:53] RECOVERY - Request latencies on neon is OK: OK - scalar( sum(rate(apiserver_request_latencies_summary_sum{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.40:6443}[5m]))/ sum(rate(apiserver_request_latencies_summary_count{ job=k8s-api,verb!=WATCH,verb!=WATCHLIST,instance=10.64.0.40:6443}[5m]))) within thresholds https://grafana.wikimedia.org/dashboard/db/kubernetes-api [17:57:53] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6 [17:57:53] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6 [17:57:54] PROBLEM - IPsec on cp1058 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2006_v4, cp2006_v6 [17:57:54] PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2006_v4, cp2006_v6 [17:58:04] PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp2006_v4, cp2006_v6 [17:58:13] PROBLEM - IPsec on kafka-jumbo1006 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6 [17:58:14] PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2006_v4, cp2006_v6 [17:58:14] PROBLEM - IPsec on kafka-jumbo1004 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6 [17:58:14] PROBLEM - IPsec on kafka-jumbo1001 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6 [17:58:23] PROBLEM - IPsec on cp1045 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2006_v4, cp2006_v6 [17:58:23] PROBLEM - IPsec on kafka1023 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6 [17:58:23] PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp2006_v4, cp2006_v6 [17:58:23] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6 [18:00:04] Deploy window Pre MediaWiki train sanity break (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180328T1800) [18:00:04] No GERRIT patches in the queue for this window AFAICS. [18:00:34] twentyafterfour: now? [18:01:06] greg-g: cool [18:03:13] 10Operations, 10ops-eqiad, 10hardware-requests: decom iridium - https://phabricator.wikimedia.org/T172487#4088853 (10Cmjohnson) [18:03:53] !log deploying 1.31.0-wmf.27 to group0. group1 in an hour. See T183966 for blockers. [18:03:58] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:03:59] T183966: 1.31.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T183966 [18:06:07] (03PS1) 1020after4: group0 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422448 [18:06:09] (03CR) 1020after4: [C: 032] group0 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422448 (owner: 1020after4) [18:07:27] (03Merged) 10jenkins-bot: group0 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422448 (owner: 1020after4) [18:08:03] RECOVERY - Host cp2006 is UP: PING WARNING - Packet loss = 61%, RTA = 36.10 ms [18:08:13] RECOVERY - IPsec on cp3007 is OK: Strongswan OK - 40 ESP OK [18:08:13] RECOVERY - IPsec on kafka-jumbo1006 is OK: Strongswan OK - 136 ESP OK [18:08:13] RECOVERY - IPsec on cp1061 is OK: Strongswan OK - 14 ESP OK [18:08:14] RECOVERY - IPsec on kafka-jumbo1004 is OK: Strongswan OK - 136 ESP OK [18:08:14] RECOVERY - IPsec on kafka-jumbo1001 is OK: Strongswan OK - 136 ESP OK [18:08:23] RECOVERY - IPsec on cp1045 is OK: Strongswan OK - 14 ESP OK [18:08:23] RECOVERY - IPsec on kafka1023 is OK: Strongswan OK - 136 ESP OK [18:08:23] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 136 ESP OK [18:08:24] RECOVERY - IPsec on cp3008 is OK: Strongswan OK - 40 ESP OK [18:08:33] (03CR) 10jenkins-bot: group0 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422448 (owner: 1020after4) [18:08:33] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 136 ESP OK [18:08:34] RECOVERY - IPsec on kafka-jumbo1005 is OK: Strongswan OK - 136 ESP OK [18:08:43] RECOVERY - IPsec on kafka-jumbo1002 is OK: Strongswan OK - 136 ESP OK [18:08:43] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 136 ESP OK [18:08:44] RECOVERY - IPsec on kafka-jumbo1003 is OK: Strongswan OK - 136 ESP OK [18:08:44] RECOVERY - IPsec on cp3010 is OK: Strongswan OK - 40 ESP OK [18:08:53] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 136 ESP OK [18:08:54] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 136 ESP OK [18:08:54] RECOVERY - IPsec on cp1058 is OK: Strongswan OK - 14 ESP OK [18:08:54] RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 14 ESP OK [18:09:13] PROBLEM - Host cp2009 is DOWN: PING CRITICAL - Packet loss = 100% [18:09:41] (03PS17) 10ArielGlenn: Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657) [18:10:03] PROBLEM - Host cp2006 is DOWN: PING CRITICAL - Packet loss = 100% [18:10:42] (03CR) 10ArielGlenn: [C: 032] Store all dataset/dumps mirrors info in one hiera structure, and use it [puppet] - 10https://gerrit.wikimedia.org/r/419390 (https://phabricator.wikimedia.org/T189657) (owner: 10ArielGlenn) [18:11:42] (03Abandoned) 10Sbisson: kartotherian/tilerator: set Last-Modified header [puppet] - 10https://gerrit.wikimedia.org/r/421522 (https://phabricator.wikimedia.org/T187300) (owner: 10Sbisson) [18:12:13] !log upgrading restbase-dev1004-a to cassandra 3.11.2 (canary) -- T178905 [18:12:16] !log twentyafterfour@tin rebuilt and synchronized wikiversions files: group0 wikis to 1.31.0-wmf.27 [18:12:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:12:20] T178905: Evaluate new upstream Cassandra release: 3.11.2 - https://phabricator.wikimedia.org/T178905 [18:12:24] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:15:13] PROBLEM - IPsec on kafka-jumbo1006 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6 [18:15:14] PROBLEM - IPsec on cp1061 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2006_v4, cp2006_v6 [18:15:14] PROBLEM - IPsec on cp3007 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp2006_v4, cp2006_v6 [18:15:15] PROBLEM - IPsec on kafka-jumbo1004 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6 [18:15:15] PROBLEM - IPsec on kafka-jumbo1001 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6 [18:15:23] PROBLEM - IPsec on cp1045 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2006_v4, cp2006_v6 [18:15:23] PROBLEM - IPsec on kafka1023 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6 [18:15:24] PROBLEM - IPsec on kafka1013 is CRITICAL: Strongswan CRITICAL - ok: 133 no-child-sa: cp3040_v6 not-conn: cp2006_v4, cp2006_v6 [18:15:33] PROBLEM - IPsec on cp3008 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp2006_v4, cp2006_v6 [18:15:34] PROBLEM - IPsec on kafka1014 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6 [18:15:34] PROBLEM - IPsec on kafka-jumbo1005 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6 [18:15:43] PROBLEM - IPsec on kafka-jumbo1002 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6 [18:15:44] PROBLEM - IPsec on kafka1020 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6 [18:15:44] PROBLEM - IPsec on kafka-jumbo1003 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6 [18:15:53] PROBLEM - IPsec on cp3010 is CRITICAL: Strongswan CRITICAL - ok: 38 not-conn: cp2006_v4, cp2006_v6 [18:15:54] PROBLEM - IPsec on kafka1012 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6 [18:15:54] PROBLEM - IPsec on kafka1022 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6 [18:15:54] PROBLEM - IPsec on cp1058 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2006_v4, cp2006_v6 [18:16:03] PROBLEM - IPsec on cp1051 is CRITICAL: Strongswan CRITICAL - ok: 12 not-conn: cp2006_v4, cp2006_v6 [18:16:23] RECOVERY - Host cp2009 is UP: PING OK - Packet loss = 0%, RTA = 36.94 ms [18:16:48] anything we need to be worried about wrt the train? ^^ [18:17:12] cc mutante XioNoX ^ [18:17:17] !log upgrading restbase-dev1004-b to cassandra 3.11.2 (canary) -- T178905 [18:17:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:17:23] T178905: Evaluate new upstream Cassandra release: 3.11.2 - https://phabricator.wikimedia.org/T178905 [18:18:00] (03PS1) 10Cmjohnson: Removing mgmt dns db1030 [dns] - 10https://gerrit.wikimedia.org/r/422451 (https://phabricator.wikimedia.org/T184397) [18:18:27] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns db1030 [dns] - 10https://gerrit.wikimedia.org/r/422451 (https://phabricator.wikimedia.org/T184397) (owner: 10Cmjohnson) [18:18:31] greg-g: I'm pretty sure ipsec is unrelated to the train [18:19:25] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1030 - https://phabricator.wikimedia.org/T184397#4088932 (10Cmjohnson) [18:19:27] bblack: ^ [18:20:31] twentyafterfour: yeah, it's at a different layer, just making sure nothing larger is going on :) [18:21:43] (03PS1) 10Cmjohnson: Removing mgmt dns for db1001 [dns] - 10https://gerrit.wikimedia.org/r/422452 (https://phabricator.wikimedia.org/T190262) [18:22:26] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns for db1001 [dns] - 10https://gerrit.wikimedia.org/r/422452 (https://phabricator.wikimedia.org/T190262) (owner: 10Cmjohnson) [18:23:46] (03PS2) 10MusikAnimal: Enable PageAssessments on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421080 (https://phabricator.wikimedia.org/T184969) [18:25:00] (03PS1) 10ArielGlenn: clean up internal rsync client list for dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/422454 [18:25:46] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1001 - https://phabricator.wikimedia.org/T190262#4088955 (10Cmjohnson) [18:25:57] greg-g: the various cp2NNNN / ipsec alerts are of no operational consequence you should worry about. papaul's doing some hardware reboots in codfw to investigate memory issues. unfortunately at least the ipsec-spam part of it is relatively-unavoidable. [18:26:04] (03PS1) 10Cmjohnson: Removing mgmt dns for db1011 [dns] - 10https://gerrit.wikimedia.org/r/422455 (https://phabricator.wikimedia.org/T184703) [18:26:29] greg-g: (sorry for the noise!) [18:26:55] (03PS2) 10Cmjohnson: Removing mgmt dns for db1011 [dns] - 10https://gerrit.wikimedia.org/r/422455 (https://phabricator.wikimedia.org/T184703) [18:27:26] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns for db1011 [dns] - 10https://gerrit.wikimedia.org/r/422455 (https://phabricator.wikimedia.org/T184703) (owner: 10Cmjohnson) [18:28:21] 10Operations, 10ops-eqiad, 10DBA, 10hardware-requests, 10Patch-For-Review: Decommission db1011 - https://phabricator.wikimedia.org/T184703#4088963 (10Cmjohnson) [18:28:24] PROBLEM - IPsec on kafka1023 is CRITICAL: Strongswan CRITICAL - ok: 134 not-conn: cp2006_v4, cp2006_v6 [18:30:00] (03PS1) 10Cmjohnson: Removing mgmt dns from db1016 [dns] - 10https://gerrit.wikimedia.org/r/422459 (https://phabricator.wikimedia.org/T190179) [18:31:51] (03CR) 10Cmjohnson: [C: 032] Removing mgmt dns from db1016 [dns] - 10https://gerrit.wikimedia.org/r/422459 (https://phabricator.wikimedia.org/T190179) (owner: 10Cmjohnson) [18:35:28] bblack: s'ok, I just like double checking :) [18:36:43] (03PS3) 10BBlack: eqsin: turn-up BD, LK, NP, PK [dns] - 10https://gerrit.wikimedia.org/r/422396 (https://phabricator.wikimedia.org/T189252) [18:37:34] (03CR) 10BBlack: [C: 032] eqsin: turn-up BD, LK, NP, PK [dns] - 10https://gerrit.wikimedia.org/r/422396 (https://phabricator.wikimedia.org/T189252) (owner: 10BBlack) [18:39:51] (03PS3) 10Bstorm: wiki replicas: refactor and record grants and set user [puppet] - 10https://gerrit.wikimedia.org/r/422199 (https://phabricator.wikimedia.org/T181650) [18:43:05] 10Operations, 10ops-codfw, 10Traffic: cp[2006,2008,2010-2011,2017-2018,2022].codfw.wmnet: Uncorrectable Memory Error - https://phabricator.wikimedia.org/T190540#4088993 (10Papaul) Cp2003 result {F16367386} [18:43:33] RECOVERY - Host cp2003 is UP: PING OK - Packet loss = 0%, RTA = 36.06 ms [18:44:20] 10Operations, 10Deployments, 10Beta-Cluster-reproducible, 10HHVM, and 2 others: Switch mwscript from Zend PHP5 to default php alternative (e.g. HHVM or PHP7) - https://phabricator.wikimedia.org/T146285#4089001 (10mmodell) [18:52:33] !log upgrading restbase-dev1005-{a,b} to cassandra 3.11.2 -- T178905 [18:52:38] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [18:52:38] T178905: Evaluate new upstream Cassandra release: 3.11.2 - https://phabricator.wikimedia.org/T178905 [18:54:45] 10Operations, 10Ops-Access-Requests: Requesting access to stats machines for Lucas Werkmeister - https://phabricator.wikimedia.org/T190415#4089049 (10RobH) [18:55:24] 10Operations, 10Ops-Access-Requests: Requesting access to stats machines for Lucas Werkmeister - https://phabricator.wikimedia.org/T190415#4072956 (10RobH) @Lucas_Werkmeister_WMDE: I'll go ahead and prepare the patchsets, however we're still lacking a WMF staff sponsorship on this request. Is there a particul... [18:56:05] (03PS1) 10Rush: openstack: add nbd kernel module to compute nodes [puppet] - 10https://gerrit.wikimedia.org/r/422465 [18:58:03] (03CR) 10Rush: [C: 032] openstack: add nbd kernel module to compute nodes [puppet] - 10https://gerrit.wikimedia.org/r/422465 (owner: 10Rush) [19:00:04] twentyafterfour: How many deployers does it take to do MediaWiki train deploy? (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180328T1900). [19:00:04] No GERRIT patches in the queue for this window AFAICS. [19:01:15] 10Operations, 10Ops-Access-Requests: Requesting access to stats machines for Lucas Werkmeister - https://phabricator.wikimedia.org/T190415#4072956 (10Nuria) Can you be a bit more explicit on your request? >I want to run long-running queries, e. g. to analyze usage of the WikibaseQualityConstraints extension Y... [19:02:16] jouncebot: one deployer and a whole posse of bots [19:02:51] !log restore elasticsearch eqiad disk high/low watermarks to 75/80% with all large reindexes complete [19:02:52] (03PS1) 1020after4: group1 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422466 [19:02:54] (03CR) 1020after4: [C: 032] group1 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422466 (owner: 1020after4) [19:02:56] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:03:30] * twentyafterfour imagines that jouncebot is listening. [19:03:43] learning even [19:04:12] (03Merged) 10jenkins-bot: group1 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422466 (owner: 1020after4) [19:04:46] (03PS1) 10Ottomata: check_kafka_consumer_log - STOP != alert, just bursty topics [puppet] - 10https://gerrit.wikimedia.org/r/422467 (https://phabricator.wikimedia.org/T189611) [19:05:14] (03CR) 10jerkins-bot: [V: 04-1] check_kafka_consumer_log - STOP != alert, just bursty topics [puppet] - 10https://gerrit.wikimedia.org/r/422467 (https://phabricator.wikimedia.org/T189611) (owner: 10Ottomata) [19:05:23] !log twentyafterfour@tin rebuilt and synchronized wikiversions files: group1 wikis to 1.31.0-wmf.27 [19:05:47] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:05:58] (03PS1) 10Chad: Adding zuul for building [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422469 [19:06:00] (03CR) 10Chad: [C: 032] Adding zuul for building [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422469 (owner: 10Chad) [19:06:21] (03CR) 10BBlack: wmf-config: Enable oversampling for remaining countries in Asia (032 comments) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422419 (https://phabricator.wikimedia.org/T189252) (owner: 10Imarlier) [19:06:41] !log twentyafterfour@tin Synchronized php: group1 wikis to 1.31.0-wmf.27 (duration: 01m 17s) [19:07:00] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:08:16] (03CR) 10jenkins-bot: group1 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422466 (owner: 1020after4) [19:09:10] !log milimetric@tin Started deploy [analytics/refinery@c22fd1e]: (no justification provided) [19:09:16] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:24] !log milimetric@tin Started deploy [analytics/refinery@c22fd1e]: Fixing python import bug [19:09:29] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:09:30] (03PS2) 10Ottomata: check_kafka_consumer_log - STOP != alert, just bursty topics [puppet] - 10https://gerrit.wikimedia.org/r/422467 (https://phabricator.wikimedia.org/T189611) [19:10:34] (03CR) 10Ottomata: [C: 032] check_kafka_consumer_log - STOP != alert, just bursty topics [puppet] - 10https://gerrit.wikimedia.org/r/422467 (https://phabricator.wikimedia.org/T189611) (owner: 10Ottomata) [19:12:12] !log milimetric@tin Finished deploy [analytics/refinery@c22fd1e]: Fixing python import bug (duration: 02m 48s) [19:12:18] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:17:42] I'm seeing quite a few "[{exception_id}] {exception_url} Wikimedia\Rdbms\DBExpectedError from line 924 of /srv/mediawiki/php-1.31.0-wmf.27/includes/libs/rdbms/database/DatabaseMysqlBase.php: Replication wait failed: Lost connection to MySQL server during query (10.64.48.172) [19:17:43] 15 [19:17:46] " [19:18:23] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 80.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:18:32] 30+ in the last 5 minutes which is quite a bit more than the error rate prior to the train [19:18:36] and then there is that [19:18:40] ^ [19:19:06] rolling back to wmf.26 [19:20:04] !log Rolling back to wmf.26 due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query" [19:20:09] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:20:19] (03PS1) 1020after4: group1 wikis to 1.31.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422472 [19:20:21] (03CR) 1020after4: [C: 032] group1 wikis to 1.31.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422472 (owner: 1020after4) [19:21:47] (03Merged) 10jenkins-bot: group1 wikis to 1.31.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422472 (owner: 1020after4) [19:22:05] (03CR) 10jenkins-bot: group1 wikis to 1.31.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422472 (owner: 1020after4) [19:22:49] (03PS1) 10Ottomata: Increase main -> jumbo MirrorMaker num.streams to 12 [puppet] - 10https://gerrit.wikimedia.org/r/422473 (https://phabricator.wikimedia.org/T189464) [19:22:50] !log twentyafterfour@tin rebuilt and synchronized wikiversions files: group1 wikis to 1.31.0-wmf.26 [19:22:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:23:35] any dbas around to help me figure out what's wrong with wmf.27? [19:23:47] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:24:04] ^ those are all the same "Replication wait failed: lost connection to MySQL server during query" error [19:24:08] !log twentyafterfour@tin Synchronized php: group1 wikis to 1.31.0-wmf.26 (duration: 01m 17s) [19:24:14] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:26:09] twentyafterfour: they're out (time of day && holidays) [19:27:19] (03CR) 10Ottomata: [C: 032] Increase main -> jumbo MirrorMaker num.streams to 12 [puppet] - 10https://gerrit.wikimedia.org/r/422473 (https://phabricator.wikimedia.org/T189464) (owner: 10Ottomata) [19:28:54] greg-g: great [19:29:09] (03PS1) 10Rush: openstack: neutron l3-agent custom iptables behavior [puppet] - 10https://gerrit.wikimedia.org/r/422474 (https://phabricator.wikimedia.org/T167357) [19:29:43] (03CR) 10jerkins-bot: [V: 04-1] openstack: neutron l3-agent custom iptables behavior [puppet] - 10https://gerrit.wikimedia.org/r/422474 (https://phabricator.wikimedia.org/T167357) (owner: 10Rush) [19:29:44] looks like the errors are all from RefreshLinksJob: /srv/mediawiki/php-1.31.0-wmf.27/includes/jobqueue/jobs/RefreshLinksJob.php [19:30:16] twentyafterfour: Where in tendril does it show the queries which are causing the fatals that show up in logstash? I'm trying to get the hang of tendril. [19:30:23] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [19:30:46] Niharika: I'm not sure, I'm coming at this from the opposite direction - from kibana [19:30:57] twentyafterfour: Ah, okay, makes sense. [19:31:14] I followed the php stack traces back to RefreshLinksJob line 258 [19:32:32] (03PS2) 10Rush: openstack: neutron l3-agent custom iptables behavior [puppet] - 10https://gerrit.wikimedia.org/r/422474 (https://phabricator.wikimedia.org/T167357) [19:32:53] commitAndWaitForReplication [19:33:21] (03CR) 10jerkins-bot: [V: 04-1] openstack: neutron l3-agent custom iptables behavior [puppet] - 10https://gerrit.wikimedia.org/r/422474 (https://phabricator.wikimedia.org/T167357) (owner: 10Rush) [19:33:38] so refreshlinks opens a transaction for "runForTitle" and then the commitAndWaitForReplication times out [19:36:45] 10Operations, 10Traffic, 10Patch-For-Review: Investigate Chrony as a replacement for ISC ntpd - https://phabricator.wikimedia.org/T177742#4089162 (10BBlack) Looking at ntp::chrony now as I've noticed the above dns5001 switch. There seems to be nothing in there about local peering, or about clock consistency... [19:38:02] (03CR) 10BBlack: [C: 04-1] "https://phabricator.wikimedia.org/T177742#4089162 ?" [puppet] - 10https://gerrit.wikimedia.org/r/422387 (https://phabricator.wikimedia.org/T177742) (owner: 10Muehlenhoff) [19:38:19] (03PS3) 10Rush: openstack: neutron l3-agent custom iptables behavior [puppet] - 10https://gerrit.wikimedia.org/r/422474 (https://phabricator.wikimedia.org/T167357) [19:38:53] (03CR) 10jerkins-bot: [V: 04-1] openstack: neutron l3-agent custom iptables behavior [puppet] - 10https://gerrit.wikimedia.org/r/422474 (https://phabricator.wikimedia.org/T167357) (owner: 10Rush) [19:42:43] niharika: I just went to the server by finding it's ip .. there I can see that there was a big jump in implicit temporary tables [19:42:47] https://tendril.wikimedia.org/host/view/db1109.eqiad.wmnet/3306 [19:43:09] twentyafterfour: And how did you pick the server? [19:43:42] twentyafterfour: And how do we see which queries caused that jump? [19:43:45] In tendril. [19:44:03] Niharika: the error message mentions 10.64.48.172 so I searched the page to find the db server with that IP. as for what query caused that jump, I'm not sure [19:44:08] the php code is opaque [19:44:22] (03PS1) 10Dzahn: install_server: set deploy1001 to use jessie [puppet] - 10https://gerrit.wikimedia.org/r/422479 (https://phabricator.wikimedia.org/T175288) [19:44:24] (03PS4) 10Rush: openstack: neutron l3-agent custom iptables behavior [puppet] - 10https://gerrit.wikimedia.org/r/422474 (https://phabricator.wikimedia.org/T167357) [19:45:08] I'd have expected tendril to be able to show me queries for a given timestamp but that doesn't seem so. [19:45:21] nope [19:45:29] it has a slow query log [19:45:36] https://tendril.wikimedia.org/report/slow_queries?host=family%3Adb1109&hours=1 [19:45:46] Yeah but that's not very useful, is it? [19:45:52] For cases like these. [19:45:53] not really [19:46:33] 11k implicit temp tables is pretty extreme (with the baseline < 3k) [19:47:03] but I can't figure out what query is involved or anything else that might help pinpoint the cause [19:48:56] (03PS2) 10Dzahn: install_server: set deploy1001 to use jessie [puppet] - 10https://gerrit.wikimedia.org/r/422479 (https://phabricator.wikimedia.org/T175288) [19:49:24] (03PS3) 10Dzahn: install_server: set deploy1001 to use jessie [puppet] - 10https://gerrit.wikimedia.org/r/422479 (https://phabricator.wikimedia.org/T175288) [19:49:40] (03CR) 10Rush: [C: 032] openstack: neutron l3-agent custom iptables behavior [puppet] - 10https://gerrit.wikimedia.org/r/422474 (https://phabricator.wikimedia.org/T167357) (owner: 10Rush) [19:50:03] (03CR) 10Dzahn: [C: 032] install_server: set deploy1001 to use jessie [puppet] - 10https://gerrit.wikimedia.org/r/422479 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [19:50:09] (03PS4) 10Dzahn: install_server: set deploy1001 to use jessie [puppet] - 10https://gerrit.wikimedia.org/r/422479 (https://phabricator.wikimedia.org/T175288) [19:52:41] twentyafterfour: https://tendril.wikimedia.org/report/slow_queries_checksum?checksum=2d03f574e8b789aec61dc623f2b45ad2&host=family%3Adb1109&user=&schema=&hours=1%2F32 [19:52:46] This might be the one? [19:53:10] But it's not useful much. [19:54:29] !log deploy1001 - schedule downtime for reinstall with jessie, reinstalling (T175288) [19:54:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [19:54:35] T175288: setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288 [20:00:04] Niharika: likely yes [20:00:04] cscott, arlolra, subbu, bearND, halfak, and Amir1: Dear deployers, time to do the Services – Parsoid / Citoid / Mobileapps / ORES / … deploy. Dont look at me like that. You signed up for it. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180328T2000). [20:00:05] No GERRIT patches in the queue for this window AFAICS. [20:00:28] Niharika: it does look like locking code is where the problem surfaced [20:05:19] Niharika: I created a task https://phabricator.wikimedia.org/T190960 [20:05:47] greg-g: should this be high or ubn? [20:06:02] it's definitely a critical train blocker but it's not an outage [20:06:14] UBN as it's blocking the train [20:09:36] I pinged AaronSchulz on the task since there's no DBAs around [20:09:36] !log mlitn@tin Started deploy [3d2png/deploy@c447488]: Updating 3d2png [20:09:41] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:12:02] !log mlitn@tin Finished deploy [3d2png/deploy@c447488]: Updating 3d2png (duration: 02m 26s) [20:12:07] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:22:08] (03PS1) 10Rush: openstack: bootstrapping neutron l3 agent for floating ip [puppet] - 10https://gerrit.wikimedia.org/r/422489 (https://phabricator.wikimedia.org/T188266) [20:22:40] (03PS2) 10Rush: openstack: bootstrapping neutron l3 agent for floating ip [puppet] - 10https://gerrit.wikimedia.org/r/422489 (https://phabricator.wikimedia.org/T188266) [20:24:23] (03CR) 10Rush: [C: 032] openstack: bootstrapping neutron l3 agent for floating ip [puppet] - 10https://gerrit.wikimedia.org/r/422489 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [20:44:42] !log bsitzmann@tin Started deploy [mobileapps/deploy@6a0d877]: Update mobileapps to a5833a0 [20:44:46] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:50:18] !log bsitzmann@tin Finished deploy [mobileapps/deploy@6a0d877]: Update mobileapps to a5833a0 (duration: 05m 36s) [20:50:22] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [20:51:41] (03CR) 10Chad: [V: 032 C: 032] Adding zuul for building [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422469 (owner: 10Chad) [20:55:49] (03PS1) 10Chad: Use stable-2.14 for zuul [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422550 [20:55:51] (03CR) 10Chad: [C: 032] Use stable-2.14 for zuul [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422550 (owner: 10Chad) [20:56:21] (03CR) 10Paladox: [C: 031] Use stable-2.14 for zuul [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422550 (owner: 10Chad) [21:00:53] jouncebot: next [21:01:07] I swear I never get that command right [21:01:10] In 1 hour(s) and 58 minute(s): Evening SWAT (Max 8 patches) (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180328T2300) [21:01:21] slow bots get beaten [21:02:15] when skynet comes to kill us all, it will be due to this throwaway remark of mine in a publically logged channel [21:04:35] (03PS1) 10Dzahn: Revert "mwscript: Detect php across distros" [puppet] - 10https://gerrit.wikimedia.org/r/422554 [21:05:51] (03PS1) 10Rush: openstack: add manual bridges for linux bridge agent [puppet] - 10https://gerrit.wikimedia.org/r/422555 (https://phabricator.wikimedia.org/T188266) [21:06:23] (03CR) 10jerkins-bot: [V: 04-1] openstack: add manual bridges for linux bridge agent [puppet] - 10https://gerrit.wikimedia.org/r/422555 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [21:06:27] (03CR) 10Chad: [V: 032 C: 032] Use stable-2.14 for zuul [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422550 (owner: 10Chad) [21:07:08] (03PS2) 10Rush: openstack: add manual bridges for linux bridge agent [puppet] - 10https://gerrit.wikimedia.org/r/422555 (https://phabricator.wikimedia.org/T188266) [21:07:32] (03CR) 10jerkins-bot: [V: 04-1] openstack: add manual bridges for linux bridge agent [puppet] - 10https://gerrit.wikimedia.org/r/422555 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [21:08:40] Krinkle: Do you remember why we have 404.html for secure.wm.o and why it can't use 404.php like the other wikis? [21:08:47] I can't find any other users of 404.html [21:09:28] (03PS3) 10Rush: openstack: add manual bridges for linux bridge agent [puppet] - 10https://gerrit.wikimedia.org/r/422555 (https://phabricator.wikimedia.org/T188266) [21:09:51] (03CR) 10jerkins-bot: [V: 04-1] openstack: add manual bridges for linux bridge agent [puppet] - 10https://gerrit.wikimedia.org/r/422555 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [21:11:19] (03PS4) 10Rush: openstack: add manual bridges for linux bridge agent [puppet] - 10https://gerrit.wikimedia.org/r/422555 (https://phabricator.wikimedia.org/T188266) [21:11:42] (03CR) 10jerkins-bot: [V: 04-1] openstack: add manual bridges for linux bridge agent [puppet] - 10https://gerrit.wikimedia.org/r/422555 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [21:12:21] no_justification: https://secure.wikimedia.org/404.hml vs https://meta.wikimedia.org/404.hml [21:12:28] I suppose the main difference is that secure isn't a wiki. [21:12:39] (03PS1) 10Chad: Forgot to add zuul to custom_plugins [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422557 [21:12:41] (03CR) 10Chad: [C: 032] Forgot to add zuul to custom_plugins [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422557 (owner: 10Chad) [21:12:43] (03CR) 10Chad: [V: 032 C: 032] Forgot to add zuul to custom_plugins [software/gerrit/gerrit] (wmf/stable-2.14) - 10https://gerrit.wikimedia.org/r/422557 (owner: 10Chad) [21:12:54] (03PS5) 10Rush: openstack: add manual bridges for linux bridge agent [puppet] - 10https://gerrit.wikimedia.org/r/422555 (https://phabricator.wikimedia.org/T188266) [21:13:03] no_justification: and unlike other non-wiki domains on apaches (like www.wikimedia, www.wikipedia etc) there isn [21:13:14] isn't an obvious wiki for /wiki/ to redirect to [21:13:15] (yet) [21:13:25] Given the "Try /wiki/X" is part of the 404.php thing [21:13:42] Meh, fair nuff. [21:13:43] If we make /wiki/ redirect on secure thehn I'd be +1 for killing it [21:13:49] https://phabricator.wikimedia.org/T113114 [21:14:27] It also used to be used on bits [21:14:33] But yeah I thnk now it's just secure [21:14:45] (03PS6) 10Rush: openstack: add manual bridges for linux bridge agent [puppet] - 10https://gerrit.wikimedia.org/r/422555 (https://phabricator.wikimedia.org/T188266) [21:15:20] I wanna remove the symlink and put it straight in the secure docroot if nothing else uses it [21:15:37] I can't find any pointers to it outside the apache config for secure-only [21:15:43] no_justification: Hm.. that may be tricky from Apache config perspective. It's in the default right? We'd need another fallback [21:15:51] Unless we want to inverse it and make 404.php the fallback [21:15:58] (03CR) 10Rush: [C: 032] openstack: add manual bridges for linux bridge agent [puppet] - 10https://gerrit.wikimedia.org/r/422555 (https://phabricator.wikimedia.org/T188266) (owner: 10Rush) [21:15:59] also https://wikitech.wikimedia.org/d4 should use it [21:16:21] and https://noc.wikimedia.org/1313 should arguably use 404.html [21:16:23] but doesn't right now [21:16:30] Sorry [21:17:03] wikitech shouldn't as it's becoming a normal wiki [21:20:11] (03PS2) 10Dzahn: Revert "mwscript: Detect php across distros" [puppet] - 10https://gerrit.wikimedia.org/r/422554 [21:22:25] (03CR) 10Dzahn: [C: 032] "we are currently back on tin, so everything is php5 like before and deploy1001 is reinstalled with jessie for now.." [puppet] - 10https://gerrit.wikimedia.org/r/422554 (owner: 10Dzahn) [21:23:42] no_justification: Yeah, I meant wikitech should use 404.php [21:23:45] rght now it's apache default [21:37:59] (03CR) 10Dzahn: "@Alex that's exactly what i thought and did first, but then the stylecheck voted me down.. and it didn't in the past.. which is what made " [puppet] - 10https://gerrit.wikimedia.org/r/420143 (owner: 10Dzahn) [21:42:22] !log getting the train back on track, group1 wikis to 1.31.0-wmf.27 [21:42:27] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:42:40] (03PS1) 1020after4: group1 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422563 [21:42:42] (03CR) 1020after4: [C: 032] group1 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422563 (owner: 1020after4) [21:44:09] (03Merged) 10jenkins-bot: group1 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422563 (owner: 1020after4) [21:44:13] (03PS1) 10Rush: openstack: neutron bridge set default to undef instead of '' [puppet] - 10https://gerrit.wikimedia.org/r/422564 [21:46:40] (03CR) 10Rush: [C: 032] openstack: neutron bridge set default to undef instead of '' [puppet] - 10https://gerrit.wikimedia.org/r/422564 (owner: 10Rush) [21:46:45] (03PS2) 10Rush: openstack: neutron bridge set default to undef instead of '' [puppet] - 10https://gerrit.wikimedia.org/r/422564 [21:48:27] (03CR) 10jenkins-bot: group1 wikis to 1.31.0-wmf.27 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422563 (owner: 1020after4) [21:49:14] (03PS1) 10Rush: Revert "openstack: neutron bridge set default to undef instead of ''" [puppet] - 10https://gerrit.wikimedia.org/r/422568 [21:49:38] (03CR) 10Rush: [C: 032] Revert "openstack: neutron bridge set default to undef instead of ''" [puppet] - 10https://gerrit.wikimedia.org/r/422568 (owner: 10Rush) [21:49:47] (03CR) 10Rush: [V: 032 C: 032] Revert "openstack: neutron bridge set default to undef instead of ''" [puppet] - 10https://gerrit.wikimedia.org/r/422568 (owner: 10Rush) [21:50:29] 10Operations, 10Beta-Cluster-Infrastructure, 10User-Ladsgroup: Remove uca-fa from beta cluster - https://phabricator.wikimedia.org/T190965#4089423 (10Ladsgroup) p:05Triage>03High [21:52:08] (03PS3) 10Dzahn: site: enable mapped IPv6 on bromine/vega [puppet] - 10https://gerrit.wikimedia.org/r/420143 [21:52:39] (03CR) 10jerkins-bot: [V: 04-1] site: enable mapped IPv6 on bromine/vega [puppet] - 10https://gerrit.wikimedia.org/r/420143 (owner: 10Dzahn) [21:52:41] (03PS1) 10Ladsgroup: labs: Change category collataion of fawiki back to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422570 (https://phabricator.wikimedia.org/T190965) [21:52:46] (03PS1) 10Chad: Apache: Move all private wikis to a single vhost block [puppet] - 10https://gerrit.wikimedia.org/r/422571 [21:52:52] Krinkle: ^^^ <3 [21:53:14] !log deploy1001 - revoking old puppet certs and signing new ones [21:53:19] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [21:53:57] (03PS2) 10Chad: Apache: Move all private wikis to a single vhost block [puppet] - 10https://gerrit.wikimedia.org/r/422571 [21:54:03] PROBLEM - nutcracker process on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:54:03] PROBLEM - nutcracker port on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:54:04] PROBLEM - Check whether ferm is active by checking the default input chain on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:54:13] PROBLEM - DPKG on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:54:13] PROBLEM - Confd template for /etc/dsh/group/jobrunner on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:54:13] PROBLEM - configured eth on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:54:14] PROBLEM - Check size of conntrack table on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:54:23] PROBLEM - Unmerged changes on repository mediawiki_config on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:54:23] PROBLEM - MD RAID on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:54:24] PROBLEM - Confd template for /etc/dsh/group/cassandra on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:54:43] PROBLEM - Confd template for /etc/dsh/group/ores on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:54:43] PROBLEM - Confd template for /etc/dsh/group/maps on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:54:43] PROBLEM - Confd template for /etc/dsh/group/zotero-translation-server on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:54:44] PROBLEM - Confd template for /etc/dsh/group/mediawiki-installation on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:54:44] PROBLEM - Disk space on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:54:44] PROBLEM - Keyholder SSH agent on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:54:47] (03CR) 10Ladsgroup: [C: 032] labs: Change category collataion of fawiki back to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422570 (https://phabricator.wikimedia.org/T190965) (owner: 10Ladsgroup) [21:54:53] PROBLEM - confd service on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:54:53] PROBLEM - dhclient process on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:54:53] PROBLEM - Confd template for /etc/dsh/group/parsoid on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:54:54] PROBLEM - Check systemd state on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:54:54] PROBLEM - Confd template for /etc/dsh/group/zotero-translators on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:56:01] (03Merged) 10jenkins-bot: labs: Change category collataion of fawiki back to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422570 (https://phabricator.wikimedia.org/T190965) (owner: 10Ladsgroup) [21:56:12] woo [21:56:33] PROBLEM - puppet last run on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:57:25] group1 wikis to 1.31.0-wmf.27 is merged but not rebased on tin, is it okay if I rebase tin anyway? [21:57:31] twentyafterfour: ^ [21:57:56] Amir1: still mid-deploy [21:58:24] I'm waiting on jenkins to merge https://gerrit.wikimedia.org/r/#/c/422565/' [21:58:37] (03CR) 10jenkins-bot: labs: Change category collataion of fawiki back to default [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422570 (https://phabricator.wikimedia.org/T190965) (owner: 10Ladsgroup) [21:58:46] okay, I thought it's ended (the deployment calendar was like it) [21:59:01] train got delayed today [21:59:02] just keep in mind that is mine labs: Change category collataion of fawiki back to default [21:59:09] PROBLEM - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1001 is CRITICAL: Return code of 255 is out of bounds [21:59:43] don't be surprised if you see changes related to that. sorry for interrupting [21:59:48] it's labs only change [21:59:54] Amir1: ok [22:00:22] Thanks [22:01:42] (03CR) 10Krinkle: [C: 031] "LGTM. Confirmed all the same hosts are still in there (order changed slightly)" [puppet] - 10https://gerrit.wikimedia.org/r/422571 (owner: 10Chad) [22:03:43] !log syncing https://gerrit.wikimedia.org/r/#/c/422565/ refs T190960 T183966 [22:03:49] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:03:50] T183966: 1.31.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T183966 [22:03:50] T190960: 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query" - https://phabricator.wikimedia.org/T190960 [22:05:00] Amir1: would you like me to sync https://gerrit.wikimedia.org/r/#/c/422570/ ? [22:05:29] !log twentyafterfour@tin Synchronized php-1.31.0-wmf.27/includes/: sync https://gerrit.wikimedia.org/r/#/c/422565/ (duration: 02m 15s) [22:05:33] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:07:19] PROBLEM - Check the NTP synchronisation status of timesyncd on deploy1001 is CRITICAL: CHECK_NRPE: Error - Could not complete SSL handshake. [22:08:49] !log rolling forward group1 to 1.31.0-wmf.27 refs T183966 T190960 [22:08:55] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:08:55] T183966: 1.31.0-wmf.27 deployment blockers - https://phabricator.wikimedia.org/T183966 [22:08:55] T190960: 1.31.0-wmf.27 rolled back due to increase in fatals: "Replication wait failed: lost connection to MySQL server during query" - https://phabricator.wikimedia.org/T190960 [22:09:21] !log twentyafterfour@tin rebuilt and synchronized wikiversions files: sync https://gerrit.wikimedia.org/r/#/c/422563/ group1 wikis to 1.31.0-wmf.27 refs T183966 T190960 [22:09:28] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:10:10] (03PS4) 10Dzahn: site: enable mapped IPv6 on bromine/vega [puppet] - 10https://gerrit.wikimedia.org/r/420143 [22:10:38] (03CR) 10jerkins-bot: [V: 04-1] site: enable mapped IPv6 on bromine/vega [puppet] - 10https://gerrit.wikimedia.org/r/420143 (owner: 10Dzahn) [22:10:49] now I'm seeing [{exception_id}] {exception_url} Wikimedia\Rdbms\DBTransactionSizeError from line 1293 of /srv/mediawiki/php-1.31.0-wmf.26/includes/libs/rdbms/loadbalancer/LoadBalancer.php: Transaction spent 6.4027202129364 second(s) in writes, exceeding the limit of 3 [22:10:57] greg-g AaronSchulz ^ [22:12:25] rolling back again [22:12:31] twentyafterfour: It doesn't need sync [22:12:38] AFAIK [22:13:01] (03PS1) 1020after4: group1 wikis to 1.31.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422576 [22:13:03] (03CR) 1020after4: [C: 032] group1 wikis to 1.31.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422576 (owner: 1020after4) [22:13:37] twentyafterfour: :( [22:13:56] !log deploy of 1.31.0-wmf.27 resulted in a lot of SlowTimer errors for SlowTimer [10000ms] at runtime/ext_mysql: slow query: SELECT MASTER_GTID_WAIT(...) [22:14:01] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:14:22] (03Merged) 10jenkins-bot: group1 wikis to 1.31.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422576 (owner: 1020after4) [22:15:40] (03PS5) 10Dzahn: site: enable mapped IPv6 on bromine/vega [puppet] - 10https://gerrit.wikimedia.org/r/420143 [22:15:41] these errors are the symptom not the cause [22:16:09] (03CR) 10jerkins-bot: [V: 04-1] site: enable mapped IPv6 on bromine/vega [puppet] - 10https://gerrit.wikimedia.org/r/420143 (owner: 10Dzahn) [22:16:12] !log twentyafterfour@tin rebuilt and synchronized wikiversions files: group1 wikis to 1.31.0-wmf.26 [22:16:17] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:16:19] something else is causing database slowness [22:16:29] probably in unrelated part of the codebase [22:16:42] sorry for the icinga alerts that shouldnt have been here [22:16:57] no reason to worry about deploy1001. i got it [22:17:09] mutante: I'm also getting a bunch of scap errors from deploy1001: Permission denied (publickey,keyboard-interactive). [22:17:28] twentyafterfour: why would that be if we are back to tin? [22:17:30] !log twentyafterfour@tin Synchronized php: group1 wikis to 1.31.0-wmf.26 (duration: 01m 18s) [22:17:35] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [22:17:41] because scap is still trying to connect to the co-masters and the ssh keys changed? [22:17:55] oh, that,yes [22:18:02] it's still on the puppet run [22:18:04] or the scap user isn't authorized to connect to deploy1001 anymore for whatever reason [22:18:15] ok no big deal scap continues and ignores the errors [22:18:18] it's setting up the things right now [22:18:21] (03CR) 10jenkins-bot: group1 wikis to 1.31.0-wmf.26 [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422576 (owner: 1020after4) [22:18:48] it is just kind of slow, i was hoping to have it done earlier [22:21:59] PROBLEM - MediaWiki exceptions and fatals per minute on graphite1001 is CRITICAL: CRITICAL: 70.00% of data above the critical threshold [50.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [22:22:54] wow, that looks bad [22:23:10] twentyafterfour: cc ^ [22:23:27] but looks like it dropped again [22:23:56] heh, you haven't seen how it's when it's really bad :P [22:23:59] RECOVERY - MediaWiki exceptions and fatals per minute on graphite1001 is OK: OK: Less than 70.00% above the threshold [25.0] https://grafana.wikimedia.org/dashboard/db/mediawiki-graphite-alerts?orgId=1&panelId=2&fullscreen [22:24:11] pheew. nice to see that follow-up [22:24:59] that alert was a bit delayed :) [22:25:02] MaxSem: oh, I zoomed out and found the all time high score: 74k :D [22:25:29] yeah icinga is slow to report fatals per minute. both times I spotted the problem in fatalmonitor before icinga-wm could alert [22:25:44] we might want to increase the sensitivity of that alert? [22:26:59] sounds like a good idea [22:27:04] (03CR) 10Chad: "Should probably alphabetize them tbh" [puppet] - 10https://gerrit.wikimedia.org/r/422571 (owner: 10Chad) [22:27:22] not just icinga config, also also how long it takes to be in graphite and to get enough data to diff it afaict [22:27:36] also also also [22:27:38] but yea, sure possible [22:27:42] :P [22:28:08] the usual worry is signal/noise of course, but we can always experiment [22:28:21] yea. "let icinga check graphite" also can have downsides [22:34:10] twentyafterfour: is your deployment done? I had a little window:) [22:35:05] (03PS3) 10Bstorm: toolforge: Add tmpreaper with a custom config to web nodes [puppet] - 10https://gerrit.wikimedia.org/r/422186 (https://phabricator.wikimedia.org/T190185) [22:35:08] yeah :( we're rolled back [22:35:11] MaxSem: yes the train is rolled back and probably not going to be resolved too soon [22:35:40] @jouncebot: reload [22:35:44] (03CR) 10Bstorm: [C: 032] toolforge: Add tmpreaper with a custom config to web nodes [puppet] - 10https://gerrit.wikimedia.org/r/422186 (https://phabricator.wikimedia.org/T190185) (owner: 10Bstorm) [22:35:51] jouncebot: refresh [22:35:52] I refreshed my knowledge about deployments. [22:35:57] @jouncebot: last [22:36:02] jouncebot: last [22:36:11] bleh [22:37:05] 10Operations, 10Traffic, 10Goal, 10Patch-For-Review, 10User-fgiunchedi: Deprecate python varnish cachestats - https://phabricator.wikimedia.org/T184942#4089633 (10Krinkle) Thanks @Vgutierrez ! [22:38:26] jouncebot: now [22:38:26] No deployments scheduled for the next 0 hour(s) and 21 minute(s) [22:41:46] so, just a little warning [22:41:54] deploy1001 is still not done with an initial puppet run [22:42:07] but it's a scap host [22:42:22] not the deployment server for sure, just a host like others [22:42:37] so scap will say it cant connect to it.. but then continue [22:42:51] i am watching it finish the install thoguh.. so soon it should be fixed [22:43:05] (03CR) 10Madhuvishy: [C: 031] clean up internal rsync client list for dumps servers [puppet] - 10https://gerrit.wikimedia.org/r/422454 (owner: 10ArielGlenn) [22:43:07] hopefully within 21 min [22:43:28] bleh [22:46:33] if not i can remove it from the dsh group too [22:49:24] (03CR) 10Dzahn: [C: 031] "+1 but DBAs have to approve" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422362 (https://phabricator.wikimedia.org/T102005) (owner: 10Krinkle) [22:50:52] (03CR) 10Dzahn: "just wanted to let you know recently i removed another use-case of this module in mw-deployment" [puppet] - 10https://gerrit.wikimedia.org/r/391849 (https://phabricator.wikimedia.org/T162070) (owner: 10Jcrespo) [22:51:30] (03CR) 10Dzahn: [C: 032] "also see https://gerrit.wikimedia.org/r/391849" [puppet] - 10https://gerrit.wikimedia.org/r/421197 (https://phabricator.wikimedia.org/T175288) (owner: 10Dzahn) [22:53:50] 10Operations, 10Wikimedia-Apache-configuration, 10Performance-Team (Radar): VirtualHost for mod_status breaks debugging Apache/MediaWiki from localhost - https://phabricator.wikimedia.org/T190111#4089711 (10Krinkle) Forgot to say: The aforementioned workaround is not actually a workaround (sorry). The hostna... [22:55:39] 10Operations, 10hardware-requests, 10Release-Engineering-Team (Watching / External): eqiad: replacement tin/deployment server - https://phabricator.wikimedia.org/T174452#4089715 (10Dzahn) [22:55:48] 10Operations, 10Patch-For-Review, 10Release-Engineering-Team (Watching / External): setup/install/deploy deploy1001 as deployment server - https://phabricator.wikimedia.org/T175288#4089714 (10Dzahn) 05Resolved>03Open [22:57:03] (03PS4) 10Bstorm: wiki replicas: refactor and record grants and set user [puppet] - 10https://gerrit.wikimedia.org/r/422199 (https://phabricator.wikimedia.org/T181650) [22:58:03] (03PS1) 10EBernhardson: Configure next Cirrus AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422585 (https://phabricator.wikimedia.org/T187148) [22:58:15] (03PS1) 10Bstorm: wiki replicas: trying moving hieradata around [labs/private] - 10https://gerrit.wikimedia.org/r/422586 [22:58:19] 10Operations, 10monitoring, 10Patch-For-Review, 10Services (watching): Add Reading Infrastructure engineers to contacts for mobileapps - https://phabricator.wikimedia.org/T189524#4089727 (10Dzahn) I think the only thing left would have been to test if you can also execute commands like "schedule downtime"... [22:58:41] (03CR) 10Bstorm: [V: 032 C: 032] wiki replicas: trying moving hieradata around [labs/private] - 10https://gerrit.wikimedia.org/r/422586 (owner: 10Bstorm) [23:00:04] MaxSem: I seem to be stuck in Groundhog week. Sigh. Time for (yet another) Can't make it to SWAT deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180328T2300). [23:00:04] musikanimal and ebernhardson: A patch you scheduled for Can't make it to SWAT is about to be deployed. Please be around during the process. Note: If you break AND fix the wikis, you will be rewarded with a sticker. [23:00:04] addshore, hashar, anomie, aude, MaxSem, twentyafterfour, RoanKattouw, Dereckson, thcipriani, Niharika, and zeljkof: I, the Bot under the Fountain, allow thee, The Deployer, to do Evening SWAT (Max 8 patches) deploy. (https://wikitech.wikimedia.org/wiki/Deployments#deploycal-item-20180328T2300). [23:00:04] No GERRIT patches in the queue for this window AFAICS. [23:00:31] I'm here [23:00:48] RECOVERY - IPsec on kafka-jumbo1002 is OK: Strongswan OK - 136 ESP OK [23:00:48] RECOVERY - Host cp2006 is UP: PING OK - Packet loss = 0%, RTA = 36.14 ms [23:00:48] RECOVERY - IPsec on cp3007 is OK: Strongswan OK - 40 ESP OK [23:00:49] RECOVERY - IPsec on kafka-jumbo1003 is OK: Strongswan OK - 136 ESP OK [23:00:49] RECOVERY - IPsec on kafka1012 is OK: Strongswan OK - 136 ESP OK [23:00:49] RECOVERY - IPsec on kafka1022 is OK: Strongswan OK - 136 ESP OK [23:00:59] \o [23:00:59] RECOVERY - IPsec on cp3008 is OK: Strongswan OK - 40 ESP OK [23:00:59] RECOVERY - IPsec on cp1061 is OK: Strongswan OK - 14 ESP OK [23:01:03] these are due to cp2006 coming back [23:01:08] RECOVERY - IPsec on cp1045 is OK: Strongswan OK - 14 ESP OK [23:01:18] RECOVERY - IPsec on kafka-jumbo1006 is OK: Strongswan OK - 136 ESP OK [23:01:19] RECOVERY - IPsec on cp3010 is OK: Strongswan OK - 40 ESP OK [23:01:19] RECOVERY - IPsec on kafka-jumbo1004 is OK: Strongswan OK - 136 ESP OK [23:01:19] RECOVERY - IPsec on kafka-jumbo1001 is OK: Strongswan OK - 136 ESP OK [23:01:19] RECOVERY - IPsec on kafka1013 is OK: Strongswan OK - 136 ESP OK [23:01:28] RECOVERY - IPsec on kafka1020 is OK: Strongswan OK - 136 ESP OK [23:01:28] RECOVERY - IPsec on kafka1023 is OK: Strongswan OK - 136 ESP OK [23:01:38] RECOVERY - IPsec on kafka1014 is OK: Strongswan OK - 136 ESP OK [23:01:48] RECOVERY - IPsec on cp1051 is OK: Strongswan OK - 14 ESP OK [23:01:48] RECOVERY - IPsec on cp1058 is OK: Strongswan OK - 14 ESP OK [23:01:48] RECOVERY - IPsec on kafka-jumbo1005 is OK: Strongswan OK - 136 ESP OK [23:01:51] (03CR) 10EBernhardson: [C: 032] Enable PageAssessments on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421080 (https://phabricator.wikimedia.org/T184969) (owner: 10MusikAnimal) [23:03:00] (03Merged) 10jenkins-bot: Enable PageAssessments on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421080 (https://phabricator.wikimedia.org/T184969) (owner: 10MusikAnimal) [23:03:14] (03CR) 10jenkins-bot: Enable PageAssessments on trwiki [mediawiki-config] - 10https://gerrit.wikimedia.org/r/421080 (https://phabricator.wikimedia.org/T184969) (owner: 10MusikAnimal) [23:03:27] MaxSem: deploy1001 is cloning mw-config as we speak.. i can take it out of scap list.. just unfortunate timing.. so close but not yet [23:03:39] musikanimal: you're up on mwdebug1001 [23:03:45] mutante: we could delay like 30 minutes i suppose? [23:04:31] ebernhardson: let's say 15 and i prepare a patch to remove it in case we need it? [23:04:38] ok [23:06:29] (03PS1) 10Dzahn: remove deploy1001 from dsh hosts and scap masters [puppet] - 10https://gerrit.wikimedia.org/r/422587 [23:08:12] last time it was 12,000 seconds.. meh. ok [23:10:33] (03CR) 10Dzahn: [C: 032] remove deploy1001 from dsh hosts and scap masters [puppet] - 10https://gerrit.wikimedia.org/r/422587 (owner: 10Dzahn) [23:11:33] really? wow..thats 3.5 hours [23:11:47] yea, but i started hours ago too [23:12:01] and of course it is adding the needed user and keyholder about 5 seconds after i merged , lol [23:12:07] lol [23:12:48] ebernhardson: would you be able to sync a single host to the rest? [23:13:18] !log created PageAssessments tables on trwiki [23:13:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:13:43] musikanimal: test again [23:14:41] ebernhardson: we are good to sync! [23:15:13] mutante: hmm, i think that's how it works? deployment syncs to canarys, then app servers sync from canarys? [23:16:02] ebernhardson: ok, so you should just do what you were plannning to do [23:16:09] and not be blocked by me [23:16:11] it's removed on tin [23:16:31] it can be re-added and synced later [23:16:44] mutante: ok [23:17:18] RECOVERY - Confd template for /etc/dsh/group/jobrunner on deploy1001 is OK: No errors detected [23:17:18] RECOVERY - configured eth on deploy1001 is OK: OK - interfaces up [23:17:18] RECOVERY - DPKG on deploy1001 is OK: All packages OK [23:17:18] RECOVERY - MD RAID on deploy1001 is OK: OK: Active: 4, Working: 4, Failed: 0, Spare: 0 [23:17:19] RECOVERY - Check size of conntrack table on deploy1001 is OK: OK: nf_conntrack is 0 % full [23:17:25] hahaha, this is hilarious [23:17:29] RECOVERY - Unmerged changes on repository mediawiki_config on deploy1001 is OK: No changes to merge. [23:17:29] RECOVERY - Confd template for /etc/dsh/group/ores on deploy1001 is OK: No errors detected [23:17:29] the timing parts [23:17:38] RECOVERY - Confd template for /etc/dsh/group/zotero-translation-server on deploy1001 is OK: No errors detected [23:17:38] RECOVERY - Disk space on deploy1001 is OK: DISK OK [23:17:38] RECOVERY - Confd template for /etc/dsh/group/mediawiki-installation on deploy1001 is OK: No errors detected [23:17:39] RECOVERY - Confd template for /etc/dsh/group/maps on deploy1001 is OK: No errors detected [23:17:39] RECOVERY - confd service on deploy1001 is OK: OK - confd is active [23:17:48] RECOVERY - dhclient process on deploy1001 is OK: PROCS OK: 0 processes with command name dhclient [23:17:48] RECOVERY - Confd template for /etc/dsh/group/parsoid on deploy1001 is OK: No errors detected [23:17:58] RECOVERY - Check systemd state on deploy1001 is OK: OK - running: The system is fully operational [23:17:58] RECOVERY - Confd template for /etc/dsh/group/zotero-translators on deploy1001 is OK: No errors detected [23:17:59] RECOVERY - nutcracker process on deploy1001 is OK: PROCS OK: 1 process with UID = 114 (nutcracker), command name nutcracker [23:17:59] RECOVERY - nutcracker port on deploy1001 is OK: TCP OK - 0.000 second response time on 127.0.0.1 port 11212 [23:18:08] RECOVERY - Check whether ferm is active by checking the default input chain on deploy1001 is OK: OK ferm input default policy is set [23:18:08] RECOVERY - Confd template for /etc/dsh/group/cassandra on deploy1001 is OK: No errors detected [23:18:37] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: T184969: Enable PageAssessments on trwiki (duration: 01m 09s) [23:18:42] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:18:43] T184969: Deploy PageAssessments to Turkish Wikipedia - https://phabricator.wikimedia.org/T184969 [23:18:47] musikanimal: ^ please test [23:19:03] (03PS1) 10Dzahn: Revert "remove deploy1001 from dsh hosts and scap masters" [puppet] - 10https://gerrit.wikimedia.org/r/422588 [23:19:06] (03CR) 10EBernhardson: [C: 032] "SWAT" [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422585 (https://phabricator.wikimedia.org/T187148) (owner: 10EBernhardson) [23:19:08] RECOVERY - Improperly owned -0:0- files in /srv/mediawiki-staging on deploy1001 is OK: Files ownership is ok. [23:22:51] (03PS1) 10Jdlrobson: Rollout VirtualPageViews (final stage) [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422589 (https://phabricator.wikimedia.org/T189906) [23:23:45] ebernhardson: looks good :) [23:24:02] blank as it should be https://tr.wikipedia.org/wiki/%C3%96zel:PageAssessments [23:24:28] musikanimal: great! [23:25:25] (03PS2) 10EBernhardson: Configure next Cirrus AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422585 (https://phabricator.wikimedia.org/T187148) [23:25:37] (03CR) 10EBernhardson: [C: 032] Configure next Cirrus AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422585 (https://phabricator.wikimedia.org/T187148) (owner: 10EBernhardson) [23:26:29] RECOVERY - puppet last run on deploy1001 is OK: OK: Puppet is currently enabled, last run 10 seconds ago with 0 failures [23:26:52] (03Merged) 10jenkins-bot: Configure next Cirrus AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422585 (https://phabricator.wikimedia.org/T187148) (owner: 10EBernhardson) [23:27:16] thank you! [23:28:39] (03CR) 10jenkins-bot: Configure next Cirrus AB test [mediawiki-config] - 10https://gerrit.wikimedia.org/r/422585 (https://phabricator.wikimedia.org/T187148) (owner: 10EBernhardson) [23:28:47] (03PS3) 10Madhuvishy: nfsclient: Setup dumps mounts from new servers [puppet] - 10https://gerrit.wikimedia.org/r/403767 (https://phabricator.wikimedia.org/T188643) [23:29:12] (03CR) 10jerkins-bot: [V: 04-1] nfsclient: Setup dumps mounts from new servers [puppet] - 10https://gerrit.wikimedia.org/r/403767 (https://phabricator.wikimedia.org/T188643) (owner: 10Madhuvishy) [23:33:13] (03PS4) 10Madhuvishy: nfsclient: Setup dumps mounts from new servers [puppet] - 10https://gerrit.wikimedia.org/r/403767 (https://phabricator.wikimedia.org/T188643) [23:37:24] RECOVERY - Check the NTP synchronisation status of timesyncd on deploy1001 is OK: OK: synced at Wed 2018-03-28 23:37:16 UTC. [23:38:17] !log ebernhardson@tin Synchronized wmf-config/InitialiseSettings.php: SWAT: T187148: Configure next Cirrus AB test (duration: 01m 16s) [23:38:23] Logged the message at https://wikitech.wikimedia.org/wiki/Server_Admin_Log [23:38:23] T187148: Evaluate features provided by `query_explorer` functionality of ltr plugin - https://phabricator.wikimedia.org/T187148 [23:39:01] SWAT complete [23:40:39] ok, re-adding deploy1001 [23:40:42] (03CR) 10Dzahn: [C: 032] Revert "remove deploy1001 from dsh hosts and scap masters" [puppet] - 10https://gerrit.wikimedia.org/r/422588 (owner: 10Dzahn) [23:42:20] eh.. puppet run is broken on tin.. what [23:44:51] no, it's not, i was just confused [23:58:40] 10Operations: build new version of mcrouter package - https://phabricator.wikimedia.org/T190979#4089852 (10Dzahn)